Fine-Tuning GPT-4 and Claude: When to Fine-Tune vs RAG 2026

Make the right architectural decision: fine-tuning or RAG for your LLM application

返回教程列表
高级22 分钟

Fine-Tuning GPT-4 and Claude: When to Fine-Tune vs RAG 2026

Make the right architectural decision: fine-tuning or RAG for your LLM application

Comprehensive guide to deciding between fine-tuning and RAG for LLM applications. Covers fine-tuning GPT-4o mini, LoRA training with Hugging Face, cost comparison, and use case decision framework.

fine-tuningraggpt-4lorallmai engineering

Fine-Tuning vs RAG: The 2026 Decision Guide

One of the most common architectural decisions in AI applications: should you fine-tune a model or use RAG? Here's a framework with real examples.

The Core Difference

RAG — Retrieve relevant documents at query time, inject into context Fine-tuning — Bake knowledge and behavior directly into model weights

Decision Framework


Use RAG when:
✅ Knowledge updates frequently (daily/weekly)
✅ Need to cite sources
✅ Data is confidential (don't want in model weights)
✅ Quick to implement and iterate
✅ Knowledge base is large (>1M tokens)

Use Fine-tuning when: ✅ Need specific output format/style consistently ✅ Knowledge is stable (legal codes, product catalog) ✅ Need to reduce prompt length (fewer examples needed) ✅ Want to remove/reduce model's default behaviors ✅ Latency is critical (no retrieval step)

Fine-Tuning GPT-4o Mini

python
from openai import OpenAI
import json

client = OpenAI()

Prepare training data (JSONL format)

training_examples = [ { 'messages': [ {'role': 'system', 'content': 'You are a customer support agent for TechCorp. Always be concise.'}, {'role': 'user', 'content': 'How do I reset my password?'}, {'role': 'assistant', 'content': 'Go to Settings > Security > Reset Password. Enter your email and check your inbox for the reset link. Link expires in 24 hours.'} ] }, { 'messages': [ {'role': 'system', 'content': 'You are a customer support agent for TechCorp. Always be concise.'}, {'role': 'user', 'content': 'What is your refund policy?'}, {'role': 'assistant', 'content': 'We offer 30-day refunds for all plans. Contact support@techcorp.com with your order number. Refunds process in 3-5 business days.'} ] } # ... 50-100 examples minimum ]

Save training data

with open('training_data.jsonl', 'w') as f: for example in training_examples: f.write(json.dumps(example) + '\n')

Upload training file

with open('training_data.jsonl', 'rb') as f: response = client.files.create(file=f, purpose='fine-tune') file_id = response.id

print(f'File uploaded: {file_id}')

Create fine-tuning job

job = client.fine_tuning.jobs.create( training_file=file_id, model='gpt-4o-mini', hyperparameters={ 'n_epochs': 3, 'batch_size': 4, 'learning_rate_multiplier': 1.8 }, suffix='customer-support-v1' )

print(f'Job created: {job.id}')

Monitor training

import time while True: status = client.fine_tuning.jobs.retrieve(job.id) print(f'Status: {status.status}') if status.status in ['succeeded', 'failed']: break time.sleep(30)

if status.status == 'succeeded': print(f'Model ID: {status.fine_tuned_model}') # Use fine-tuned model response = client.chat.completions.create( model=status.fine_tuned_model, messages=[{'role': 'user', 'content': 'How do I cancel my subscription?'}] ) print(response.choices[0].message.content)

LoRA Fine-Tuning with Hugging Face

python
from transformers import AutoModelForCausalLM, AutoTokenizer, TrainingArguments
from peft import LoraConfig, get_peft_model, TaskType
from trl import SFTTrainer
from datasets import Dataset
import torch

Load base model

MODEL = 'meta-llama/Meta-Llama-3.1-8B-Instruct' tokenizer = AutoTokenizer.from_pretrained(MODEL) model = AutoModelForCausalLM.from_pretrained( MODEL, torch_dtype=torch.bfloat16, device_map='auto' )

LoRA configuration

lora_config = LoraConfig( r=16, # Rank - higher = more parameters lora_alpha=32, # Scaling factor target_modules=['q_proj', 'k_proj', 'v_proj', 'o_proj'], lora_dropout=0.05, bias='none', task_type=TaskType.CAUSAL_LM )

model = get_peft_model(model, lora_config) model.print_trainable_parameters()

trainable params: 83M || all params: 8B || trainable: 1.03%

Prepare dataset

def format_prompt(example): return {'text': f"""<|begin_of_text|><|start_header_id|>system<|end_header_id|> {example['system']}<|eot_id|><|start_header_id|>user<|end_header_id|> {example['input']}<|eot_id|><|start_header_id|>assistant<|end_header_id|> {example['output']}<|eot_id|>"""}

dataset = Dataset.from_list(training_examples).map(format_prompt)

Training

trainer = SFTTrainer( model=model, tokenizer=tokenizer, train_dataset=dataset, dataset_text_field='text', args=TrainingArguments( output_dir='./lora-model', num_train_epochs=3, per_device_train_batch_size=4, gradient_accumulation_steps=4, learning_rate=2e-4, bf16=True, save_strategy='epoch' ) )

trainer.train()

Save and merge

model.save_pretrained('./lora-adapter')

For inference, load base model + adapter

Cost Comparison

MethodSetup CostPer-Query CostUpdate Cost

RAG (Pinecone)$200-500$0.003-0.01$0-50 Fine-tune GPT-4o-mini$50-200$0.001-0.005$50-200 LoRA (Llama 3)GPU rental $50-200$0 (self-hosted)$50-200

Hybrid Approach (2026 Best Practice)

python
class HybridRAGFinetuned:
    """
    Fine-tuned model for style/format + RAG for dynamic knowledge
    Best of both worlds: consistent output style + current information
    """
    
    def __init__(self, fine_tuned_model_id: str):
        self.model = fine_tuned_model_id  # Fine-tuned for output style
        self.vector_store = VectorStore()  # For dynamic knowledge
    
    def query(self, question: str) -> str:
        # Retrieve relevant context (RAG)
        context = self.vector_store.search(question, k=5)
        
        # Use fine-tuned model (consistent output format)
        return call_openai(
            model=self.model,  # Fine-tuned model
            system='You are a TechCorp support agent.',
            user=f'Context: {context}\n\nQuestion: {question}'
        )

Conclusion

For most 2026 applications, start with RAG — it's faster to implement and update. Add fine-tuning when you need consistent output format, tone, or when you've identified specific behaviors to reinforce. The hybrid approach gives you the best of both worlds.

相关工具

OpenAIHugging FaceLangChain