Fine-Tuning GPT-4 and Claude: When to Fine-Tune vs RAG 2026
Make the right architectural decision: fine-tuning or RAG for your LLM application
Fine-Tuning GPT-4 and Claude: When to Fine-Tune vs RAG 2026
Make the right architectural decision: fine-tuning or RAG for your LLM application
Comprehensive guide to deciding between fine-tuning and RAG for LLM applications. Covers fine-tuning GPT-4o mini, LoRA training with Hugging Face, cost comparison, and use case decision framework.
Fine-Tuning vs RAG: The 2026 Decision Guide
One of the most common architectural decisions in AI applications: should you fine-tune a model or use RAG? Here's a framework with real examples.
The Core Difference
RAG — Retrieve relevant documents at query time, inject into context Fine-tuning — Bake knowledge and behavior directly into model weights
Decision Framework
Use RAG when:
✅ Knowledge updates frequently (daily/weekly)
✅ Need to cite sources
✅ Data is confidential (don't want in model weights)
✅ Quick to implement and iterate
✅ Knowledge base is large (>1M tokens)Use Fine-tuning when:
✅ Need specific output format/style consistently
✅ Knowledge is stable (legal codes, product catalog)
✅ Need to reduce prompt length (fewer examples needed)
✅ Want to remove/reduce model's default behaviors
✅ Latency is critical (no retrieval step)
Fine-Tuning GPT-4o Mini
python
from openai import OpenAI
import jsonclient = OpenAI()
Prepare training data (JSONL format)
training_examples = [
{
'messages': [
{'role': 'system', 'content': 'You are a customer support agent for TechCorp. Always be concise.'},
{'role': 'user', 'content': 'How do I reset my password?'},
{'role': 'assistant', 'content': 'Go to Settings > Security > Reset Password. Enter your email and check your inbox for the reset link. Link expires in 24 hours.'}
]
},
{
'messages': [
{'role': 'system', 'content': 'You are a customer support agent for TechCorp. Always be concise.'},
{'role': 'user', 'content': 'What is your refund policy?'},
{'role': 'assistant', 'content': 'We offer 30-day refunds for all plans. Contact support@techcorp.com with your order number. Refunds process in 3-5 business days.'}
]
}
# ... 50-100 examples minimum
]Save training data
with open('training_data.jsonl', 'w') as f:
for example in training_examples:
f.write(json.dumps(example) + '\n')Upload training file
with open('training_data.jsonl', 'rb') as f:
response = client.files.create(file=f, purpose='fine-tune')
file_id = response.idprint(f'File uploaded: {file_id}')
Create fine-tuning job
job = client.fine_tuning.jobs.create(
training_file=file_id,
model='gpt-4o-mini',
hyperparameters={
'n_epochs': 3,
'batch_size': 4,
'learning_rate_multiplier': 1.8
},
suffix='customer-support-v1'
)print(f'Job created: {job.id}')
Monitor training
import time
while True:
status = client.fine_tuning.jobs.retrieve(job.id)
print(f'Status: {status.status}')
if status.status in ['succeeded', 'failed']:
break
time.sleep(30)if status.status == 'succeeded':
print(f'Model ID: {status.fine_tuned_model}')
# Use fine-tuned model
response = client.chat.completions.create(
model=status.fine_tuned_model,
messages=[{'role': 'user', 'content': 'How do I cancel my subscription?'}]
)
print(response.choices[0].message.content)
LoRA Fine-Tuning with Hugging Face
python
from transformers import AutoModelForCausalLM, AutoTokenizer, TrainingArguments
from peft import LoraConfig, get_peft_model, TaskType
from trl import SFTTrainer
from datasets import Dataset
import torchLoad base model
MODEL = 'meta-llama/Meta-Llama-3.1-8B-Instruct'
tokenizer = AutoTokenizer.from_pretrained(MODEL)
model = AutoModelForCausalLM.from_pretrained(
MODEL,
torch_dtype=torch.bfloat16,
device_map='auto'
)LoRA configuration
lora_config = LoraConfig(
r=16, # Rank - higher = more parameters
lora_alpha=32, # Scaling factor
target_modules=['q_proj', 'k_proj', 'v_proj', 'o_proj'],
lora_dropout=0.05,
bias='none',
task_type=TaskType.CAUSAL_LM
)model = get_peft_model(model, lora_config)
model.print_trainable_parameters()
trainable params: 83M || all params: 8B || trainable: 1.03%
Prepare dataset
def format_prompt(example):
return {'text': f"""<|begin_of_text|><|start_header_id|>system<|end_header_id|>
{example['system']}<|eot_id|><|start_header_id|>user<|end_header_id|>
{example['input']}<|eot_id|><|start_header_id|>assistant<|end_header_id|>
{example['output']}<|eot_id|>"""}dataset = Dataset.from_list(training_examples).map(format_prompt)
Training
trainer = SFTTrainer(
model=model,
tokenizer=tokenizer,
train_dataset=dataset,
dataset_text_field='text',
args=TrainingArguments(
output_dir='./lora-model',
num_train_epochs=3,
per_device_train_batch_size=4,
gradient_accumulation_steps=4,
learning_rate=2e-4,
bf16=True,
save_strategy='epoch'
)
)trainer.train()
Save and merge
model.save_pretrained('./lora-adapter')
For inference, load base model + adapter
Cost Comparison
Hybrid Approach (2026 Best Practice)
python
class HybridRAGFinetuned:
"""
Fine-tuned model for style/format + RAG for dynamic knowledge
Best of both worlds: consistent output style + current information
"""
def __init__(self, fine_tuned_model_id: str):
self.model = fine_tuned_model_id # Fine-tuned for output style
self.vector_store = VectorStore() # For dynamic knowledge
def query(self, question: str) -> str:
# Retrieve relevant context (RAG)
context = self.vector_store.search(question, k=5)
# Use fine-tuned model (consistent output format)
return call_openai(
model=self.model, # Fine-tuned model
system='You are a TechCorp support agent.',
user=f'Context: {context}\n\nQuestion: {question}'
)
Conclusion
For most 2026 applications, start with RAG — it's faster to implement and update. Add fine-tuning when you need consistent output format, tone, or when you've identified specific behaviors to reinforce. The hybrid approach gives you the best of both worlds.
相关工具
相关教程
Build complex multi-step AI workflows with state management using LangGraph
Chain-of-thought, tree-of-thoughts, self-consistency, and systematic evaluation methods
Deploy Llama 3 with 20x higher throughput than naive serving