Fine-Tuning GPT-4 and Claude: When to Fine-Tune vs RAG 2026
Make the right architectural decision: fine-tuning or RAG for your LLM application
Fine-Tuning vs RAG: The 2026 Decision Guide
One of the most common architectural decisions in AI applications: should you fine-tune a model or use RAG? Here's a framework with real examples.
The Core Difference
RAG — Retrieve relevant documents at query time, inject into context Fine-tuning — Bake knowledge and behavior directly into model weights
Decision Framework
Use RAG when:
✅ Knowledge updates frequently (daily/weekly)
✅ Need to cite sources
✅ Data is confidential (don't want in model weights)
✅ Quick to implement and iterate
✅ Knowledge base is large (>1M tokens)Use Fine-tuning when:
✅ Need specific output format/style consistently
✅ Knowledge is stable (legal codes, product catalog)
✅ Need to reduce prompt length (fewer examples needed)
✅ Want to remove/reduce model's default behaviors
✅ Latency is critical (no retrieval step)
Fine-Tuning GPT-4o Mini
python
from openai import OpenAI
import jsonclient = OpenAI()
Prepare training data (JSONL format)
training_examples = [
{
'messages': [
{'role': 'system', 'content': 'You are a customer support agent for TechCorp. Always be concise.'},
{'role': 'user', 'content': 'How do I reset my password?'},
{'role': 'assistant', 'content': 'Go to Settings > Security > Reset Password. Enter your email and check your inbox for the reset link. Link expires in 24 hours.'}
]
},
{
'messages': [
{'role': 'system', 'content': 'You are a customer support agent for TechCorp. Always be concise.'},
{'role': 'user', 'content': 'What is your refund policy?'},
{'role': 'assistant', 'content': 'We offer 30-day refunds for all plans. Contact support@techcorp.com with your order number. Refunds process in 3-5 business days.'}
]
}
# ... 50-100 examples minimum
]Save training data
with open('training_data.jsonl', 'w') as f:
for example in training_examples:
f.write(json.dumps(example) + '\n')Upload training file
with open('training_data.jsonl', 'rb') as f:
response = client.files.create(file=f, purpose='fine-tune')
file_id = response.idprint(f'File uploaded: {file_id}')
Create fine-tuning job
job = client.fine_tuning.jobs.create(
training_file=file_id,
model='gpt-4o-mini',
hyperparameters={
'n_epochs': 3,
'batch_size': 4,
'learning_rate_multiplier': 1.8
},
suffix='customer-support-v1'
)print(f'Job created: {job.id}')
Monitor training
import time
while True:
status = client.fine_tuning.jobs.retrieve(job.id)
print(f'Status: {status.status}')
if status.status in ['succeeded', 'failed']:
break
time.sleep(30)if status.status == 'succeeded':
print(f'Model ID: {status.fine_tuned_model}')
# Use fine-tuned model
response = client.chat.completions.create(
model=status.fine_tuned_model,
messages=[{'role': 'user', 'content': 'How do I cancel my subscription?'}]
)
print(response.choices[0].message.content)
LoRA Fine-Tuning with Hugging Face
python
from transformers import AutoModelForCausalLM, AutoTokenizer, TrainingArguments
from peft import LoraConfig, get_peft_model, TaskType
from trl import SFTTrainer
from datasets import Dataset
import torchLoad base model
MODEL = 'meta-llama/Meta-Llama-3.1-8B-Instruct'
tokenizer = AutoTokenizer.from_pretrained(MODEL)
model = AutoModelForCausalLM.from_pretrained(
MODEL,
torch_dtype=torch.bfloat16,
device_map='auto'
)LoRA configuration
lora_config = LoraConfig(
r=16, # Rank - higher = more parameters
lora_alpha=32, # Scaling factor
target_modules=['q_proj', 'k_proj', 'v_proj', 'o_proj'],
lora_dropout=0.05,
bias='none',
task_type=TaskType.CAUSAL_LM
)model = get_peft_model(model, lora_config)
model.print_trainable_parameters()
trainable params: 83M || all params: 8B || trainable: 1.03%
Prepare dataset
def format_prompt(example):
return {'text': f"""<|begin_of_text|><|start_header_id|>system<|end_header_id|>
{example['system']}<|eot_id|><|start_header_id|>user<|end_header_id|>
{example['input']}<|eot_id|><|start_header_id|>assistant<|end_header_id|>
{example['output']}<|eot_id|>"""}dataset = Dataset.from_list(training_examples).map(format_prompt)
Training
trainer = SFTTrainer(
model=model,
tokenizer=tokenizer,
train_dataset=dataset,
dataset_text_field='text',
args=TrainingArguments(
output_dir='./lora-model',
num_train_epochs=3,
per_device_train_batch_size=4,
gradient_accumulation_steps=4,
learning_rate=2e-4,
bf16=True,
save_strategy='epoch'
)
)trainer.train()
Save and merge
model.save_pretrained('./lora-adapter')
For inference, load base model + adapter
Cost Comparison
Hybrid Approach (2026 Best Practice)
python
class HybridRAGFinetuned:
"""
Fine-tuned model for style/format + RAG for dynamic knowledge
Best of both worlds: consistent output style + current information
"""
def __init__(self, fine_tuned_model_id: str):
self.model = fine_tuned_model_id # Fine-tuned for output style
self.vector_store = VectorStore() # For dynamic knowledge
def query(self, question: str) -> str:
# Retrieve relevant context (RAG)
context = self.vector_store.search(question, k=5)
# Use fine-tuned model (consistent output format)
return call_openai(
model=self.model, # Fine-tuned model
system='You are a TechCorp support agent.',
user=f'Context: {context}\n\nQuestion: {question}'
)
Conclusion
For most 2026 applications, start with RAG — it's faster to implement and update. Add fine-tuning when you need consistent output format, tone, or when you've identified specific behaviors to reinforce. The hybrid approach gives you the best of both worlds.
Also available in 中文.