LLM Fine-Tuning in 2025: When to Fine-Tune vs. RAG vs. Prompting (With Cost Analysis)

Senior AI engineers explain the decision framework for choosing between fine-tuning, RAG, and prompt engineering

返回教程列表
高级16 分钟

LLM Fine-Tuning in 2025: When to Fine-Tune vs. RAG vs. Prompting (With Cost Analysis)

Senior AI engineers explain the decision framework for choosing between fine-tuning, RAG, and prompt engineering

Decision framework and technical guide for LLM customization — comparing fine-tuning vs. RAG vs. prompting for different use cases, with real cost analysis and step-by-step fine-tuning with OpenAI and LoRA.

fine-tuningllmragopenailora

LLM Fine-Tuning vs. RAG vs. Prompting: The Decision Framework

The Core Question

When should you fine-tune? Most teams default to fine-tuning when prompt engineering would have worked, wasting time and money. This guide gives you the decision framework.

Decision Tree


Do you need real-time/current information?
├── Yes → RAG (not fine-tuning)
└── No ↓

Is it a style/format/tone issue? ├── Yes → Prompt engineering first └── No ↓

Do you have 500+ labeled examples? ├── No → More prompt engineering, generate synthetic data └── Yes ↓

Is inference cost/speed critical? ├── Yes → Fine-tuning (smaller model) └── Maybe → Fine-tuning for consistency

When Fine-Tuning Wins

  • Consistent output format: Always return JSON with specific schema
  • Domain-specific language: Medical, legal, financial terminology
  • Reduced latency: Fine-tuned GPT-3.5 often outperforms GPT-4 on narrow tasks at 10x less cost
  • Reduced prompt length: Bake instructions into model, save tokens
  • Classification tasks: Multi-label classification with your categories
  • When RAG Wins

  • Factual accuracy on large knowledge base: Can't fit in context window
  • Frequently updating information: Retraining too slow
  • Source citations needed: Users need to verify claims
  • Multiple distinct knowledge bases: Different product lines, regions
  • When Prompting Wins

  • Prototyping: Always start here
  • Reasoning tasks: Chain-of-thought benefits from frontier models
  • Novel/rare tasks: Not enough examples for fine-tuning
  • Flexible requirements: Task definition still evolving
  • Fine-Tuning with OpenAI API

    Data Preparation

    python
    

    Format: JSONL with messages structure

    import json

    training_data = [ { "messages": [ {"role": "system", "content": "You are a customer support agent for Acme Corp."}, {"role": "user", "content": "How do I reset my password?"}, {"role": "assistant", "content": "To reset your password, go to Settings > Security > Reset Password. You'll receive an email within 2 minutes."} ] } ]

    with open("training.jsonl", "w") as f: for item in training_data: f.write(json.dumps(item) + "\n")

    Training

    python
    from openai import OpenAI
    client = OpenAI()

    Upload training file

    file = client.files.create( file=open("training.jsonl", "rb"), purpose="fine-tune" )

    Create fine-tuning job

    job = client.fine_tuning.jobs.create( training_file=file.id, model="gpt-4o-mini-2024-07-18", hyperparameters={"n_epochs": 3} )

    print(f"Job ID: {job.id}")

    Cost Estimation

    
    gpt-4o-mini fine-tuning:
    
  • Training: $0.003/1K tokens
  • Inference: $0.0003/1K input + $0.0012/1K output
  • 1000 training examples × 500 tokens avg = $1.50 to train Vs. GPT-4o inference at $0.005/1K = 10x cheaper inference post-fine-tune

    LoRA Fine-Tuning with Open Source Models

    When to Use LoRA

  • Self-hosted for privacy requirements
  • Very frequent retraining needed
  • Highly specialized domain
  • python
    from peft import LoraConfig, get_peft_model
    from transformers import AutoModelForCausalLM

    model = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-3.1-8B")

    lora_config = LoraConfig( r=16, # Rank lora_alpha=32, # Scale target_modules=["q_proj", "v_proj"], lora_dropout=0.05, task_type="CAUSAL_LM" )

    model = get_peft_model(model, lora_config) model.print_trainable_parameters()

    trainable params: 4,194,304 || all params: 8,034,877,440 || trainable%: 0.05

    Evaluation Best Practices

  • Hold-out test set: Never evaluate on training data
  • Human evaluation: For generation quality, human judgment still needed
  • LLM-as-judge: Use GPT-4 to evaluate GPT-3.5 fine-tuned outputs
  • Business metrics: Track actual outcome (support resolution, not just BLEU score)
  • 相关工具

    OpenAIHugging FaceLangChainWeights & Biases