Fine-Tuning GPT-4 and Claude: When to Fine-Tune vs RAG 2026

Make the right architectural decision: fine-tuning or RAG for your LLM application

Fine-Tuning vs RAG: The 2026 Decision Guide

One of the most common architectural decisions in AI applications: should you fine-tune a model or use RAG? Here's a framework with real examples.

The Core Difference

RAG — Retrieve relevant documents at query time, inject into context Fine-tuning — Bake knowledge and behavior directly into model weights

Decision Framework


Use RAG when:
✅ Knowledge updates frequently (daily/weekly)
✅ Need to cite sources
✅ Data is confidential (don't want in model weights)
✅ Quick to implement and iterate
✅ Knowledge base is large (>1M tokens)Use Fine-tuning when:
✅ Need specific output format/style consistently
✅ Knowledge is stable (legal codes, product catalog)
✅ Need to reduce prompt length (fewer examples needed)
✅ Want to remove/reduce model's default behaviors
✅ Latency is critical (no retrieval step)

Fine-Tuning GPT-4o Mini

python
from openai import OpenAI
import json
client = OpenAI()
Prepare training data (JSONL format)
training_examples = [
    {
        'messages': [
            {'role': 'system', 'content': 'You are a customer support agent for TechCorp. Always be concise.'},
            {'role': 'user', 'content': 'How do I reset my password?'},
            {'role': 'assistant', 'content': 'Go to Settings > Security > Reset Password. Enter your email and check your inbox for the reset link. Link expires in 24 hours.'}
        ]
    },
    {
        'messages': [
            {'role': 'system', 'content': 'You are a customer support agent for TechCorp. Always be concise.'},
            {'role': 'user', 'content': 'What is your refund policy?'},
            {'role': 'assistant', 'content': 'We offer 30-day refunds for all plans. Contact support@techcorp.com with your order number. Refunds process in 3-5 business days.'}
        ]
    }
    # ... 50-100 examples minimum
]
Save training data
with open('training_data.jsonl', 'w') as f:
    for example in training_examples:
        f.write(json.dumps(example) + '\n')
Upload training file
with open('training_data.jsonl', 'rb') as f:
    response = client.files.create(file=f, purpose='fine-tune')
    file_id = response.id
print(f'File uploaded: {file_id}')
Create fine-tuning job
job = client.fine_tuning.jobs.create(
    training_file=file_id,
    model='gpt-4o-mini',
    hyperparameters={
        'n_epochs': 3,
        'batch_size': 4,
        'learning_rate_multiplier': 1.8
    },
    suffix='customer-support-v1'
)
print(f'Job created: {job.id}')
Monitor training
import time
while True:
    status = client.fine_tuning.jobs.retrieve(job.id)
    print(f'Status: {status.status}')
    if status.status in ['succeeded', 'failed']:
        break
    time.sleep(30)if status.status == 'succeeded':
    print(f'Model ID: {status.fine_tuned_model}')
    # Use fine-tuned model
    response = client.chat.completions.create(
        model=status.fine_tuned_model,
        messages=[{'role': 'user', 'content': 'How do I cancel my subscription?'}]
    )
    print(response.choices[0].message.content)

LoRA Fine-Tuning with Hugging Face

python
from transformers import AutoModelForCausalLM, AutoTokenizer, TrainingArguments
from peft import LoraConfig, get_peft_model, TaskType
from trl import SFTTrainer
from datasets import Dataset
import torch
Load base model
MODEL = 'meta-llama/Meta-Llama-3.1-8B-Instruct'
tokenizer = AutoTokenizer.from_pretrained(MODEL)
model = AutoModelForCausalLM.from_pretrained(
    MODEL,
    torch_dtype=torch.bfloat16,
    device_map='auto'
)
LoRA configuration
lora_config = LoraConfig(
    r=16,                       # Rank - higher = more parameters
    lora_alpha=32,              # Scaling factor
    target_modules=['q_proj', 'k_proj', 'v_proj', 'o_proj'],
    lora_dropout=0.05,
    bias='none',
    task_type=TaskType.CAUSAL_LM
)
model = get_peft_model(model, lora_config)
model.print_trainable_parameters()
trainable params: 83M || all params: 8B || trainable: 1.03%
Prepare dataset
def format_prompt(example):
    return {'text': f"""<|begin_of_text|><|start_header_id|>system<|end_header_id|>
{example['system']}<|eot_id|><|start_header_id|>user<|end_header_id|>
{example['input']}<|eot_id|><|start_header_id|>assistant<|end_header_id|>
{example['output']}<|eot_id|>"""}
dataset = Dataset.from_list(training_examples).map(format_prompt)
Training
trainer = SFTTrainer(
    model=model,
    tokenizer=tokenizer,
    train_dataset=dataset,
    dataset_text_field='text',
    args=TrainingArguments(
        output_dir='./lora-model',
        num_train_epochs=3,
        per_device_train_batch_size=4,
        gradient_accumulation_steps=4,
        learning_rate=2e-4,
        bf16=True,
        save_strategy='epoch'
    )
)
trainer.train()
Save and merge
model.save_pretrained('./lora-adapter')
For inference, load base model + adapter

Cost Comparison

MethodSetup CostPer-Query CostUpdate Cost

RAG (Pinecone)$200-500$0.003-0.01$0-50 Fine-tune GPT-4o-mini$50-200$0.001-0.005$50-200 LoRA (Llama 3)GPU rental $50-200$0 (self-hosted)$50-200

Hybrid Approach (2026 Best Practice)

python
class HybridRAGFinetuned:
    """
    Fine-tuned model for style/format + RAG for dynamic knowledge
    Best of both worlds: consistent output style + current information
    """
    
    def __init__(self, fine_tuned_model_id: str):
        self.model = fine_tuned_model_id  # Fine-tuned for output style
        self.vector_store = VectorStore()  # For dynamic knowledge
    
    def query(self, question: str) -> str:
        # Retrieve relevant context (RAG)
        context = self.vector_store.search(question, k=5)
        
        # Use fine-tuned model (consistent output format)
        return call_openai(
            model=self.model,  # Fine-tuned model
            system='You are a TechCorp support agent.',
            user=f'Context: {context}\n\nQuestion: {question}'
        )

Conclusion

For most 2026 applications, start with RAG — it's faster to implement and update. Add fine-tuning when you need consistent output format, tone, or when you've identified specific behaviors to reinforce. The hybrid approach gives you the best of both worlds.

Also available in 中文.