LLM Fine-Tuning for Production: LoRA, QLoRA & RLHF in 2025
Adapt foundation models to your domain efficiently with parameter-efficient fine-tuning techniques
LLM Fine-Tuning for Production: LoRA, QLoRA & RLHF in 2025
Adapt foundation models to your domain efficiently with parameter-efficient fine-tuning techniques
Fine-tuning LLMs allows adapting powerful foundation models to specific domains without training from scratch. This guide covers LoRA and QLoRA for parameter-efficient fine-tuning, dataset preparation and quality filtering, instruction tuning format, RLHF and DPO for alignment, fine-tuning on consumer GPUs with quantization, evaluation with domain benchmarks, and deploying fine-tuned models with vLLM or TGI for production serving.
LLM Fine-Tuning for Production: LoRA, QLoRA & RLHF
When to Fine-Tune vs. Prompt Engineer
Use prompt engineering when: GPT-4/Claude/Gemini work well with examples, task is varied, cost per request is acceptable, speed of iteration matters.
Use fine-tuning when: consistent output format is critical (structured JSON extraction), domain-specific terminology and knowledge required, latency or cost requires smaller model, privacy requires running on-premise, instruction following quality needs improvement.
Parameter-Efficient Fine-Tuning
LoRA (Low-Rank Adaptation)
Fine-tune only a small number of parameters by adding low-rank decomposition matrices alongside frozen model weights. Instead of updating all 7B parameters, LoRA updates ~4M parameters (0.06% of total), achieving comparable performance at 10x lower memory and compute cost.Architecture: freeze original weight matrix W, add trainable matrices A and B where A is d×r and B is r×k (r << d, k). During forward pass: h = Wx + (BA)x * (alpha/r). Only A and B are trained. After training, merge: W_merged = W + BA * (alpha/r).
QLoRA
Combines 4-bit quantization with LoRA: load the base model in 4-bit (NF4 quantization), add LoRA adapters in full precision (bf16). Fine-tune the adapters only. This enables fine-tuning a 65B parameter model on a single 48GB GPU.Key components: bitsandbytes for 4-bit quantization, PEFT library for LoRA adapters, Hugging Face transformers for model loading, TRL for instruction fine-tuning.
Dataset Preparation
Instruction Tuning Format
Format: {"system": "You are a helpful medical documentation assistant...", "user": "Summarize this clinical note...", "assistant": "Patient: 45-year-old male..."}.Quality filtering: remove duplicates, filter low-quality responses (too short, contains errors), ensure diversity across topics and styles, balance classes for classification tasks.
Dataset size: 1,000-10,000 high-quality instruction examples often outperforms 100,000 low-quality examples. Quality >> quantity for fine-tuning.
Data Augmentation
LLM-generated augmentation: use GPT-4 to rephrase existing examples, generate new examples from seed data, create adversarial examples to improve robustness. Validate generated data quality before including in training.Fine-Tuning Setup with PEFT + TRL
Load base model with 4-bit quantization: from_pretrained with load_in_4bit=True, bnb_4bit_compute_dtype=torch.float16, bnb_4bit_quant_type="nf4". Apply LoRA config: r=16, lora_alpha=32, target_modules=["q_proj","v_proj"], lora_dropout=0.05, bias="none", task_type="CAUSAL_LM".
Use SFTTrainer from TRL for supervised fine-tuning: set max_seq_length=2048, per_device_train_batch_size=4, gradient_accumulation_steps=4 (effective batch size 16), learning_rate=2e-4, num_train_epochs=3, warmup_ratio=0.03.
Alignment: RLHF and DPO
RLHF (Reinforcement Learning from Human Feedback)
Three phases: supervised fine-tuning (SFT) on high-quality demonstrations, reward model training (human preferences between response pairs), PPO reinforcement learning (optimize LLM to maximize reward model score).Complex to implement—requires reward model training and PPO stability tuning.
DPO (Direct Preference Optimization)
Simpler alternative to RLHF: directly optimize from preference data (chosen vs. rejected response pairs) without a separate reward model. Same quality as RLHF with simpler training. DPO loss function directly adjusts model weights to prefer chosen responses over rejected ones.Dataset format: {"prompt": "...", "chosen": "high quality response...", "rejected": "low quality response..."}. Use DPOTrainer from TRL.
Evaluation
Domain benchmarks: create 200-500 test examples representative of production queries. Evaluate: task accuracy (exact match, ROUGE, BLEU), output format compliance (valid JSON, required fields), safety (refusal rate for harmful queries), latency and throughput.
Regression testing: ensure fine-tuned model doesn't degrade on general capabilities (MMLU, HellaSwag). Use LLM-as-judge for qualitative evaluation at scale.
Production Serving
vLLM for High-Throughput Serving
vLLM provides 2-24x higher throughput than naive serving through PagedAttention (efficient KV cache management). Load fine-tuned model with LoRA adapters using vllm.LLM with enable_lora=True. Merge adapters before deployment for simplest serving setup.Text Generation Inference (TGI) by Hugging Face
Production-ready serving with continuous batching, quantization support, streaming responses, and OpenAI-compatible API. Deploy with Docker: docker run --gpus all with model volume and port mapping.Quantization for Production
Post-training quantization reduces model size 2-4x with minimal quality loss: GPTQ (4-bit, CPU-compatible), AWQ (4-bit, faster than GPTQ), GGUF (llama.cpp, CPU inference). Choose based on inference hardware and latency requirements.Fine-tuning + production serving with vLLM achieves enterprise-grade LLM performance at fraction of GPT-4 API cost for high-volume use cases.
相关工具
相关教程
Build complex multi-step AI workflows with state management using LangGraph
Chain-of-thought, tree-of-thoughts, self-consistency, and systematic evaluation methods
Deploy Llama 3 with 20x higher throughput than naive serving