← Back to tutorials

LLM Fine-tuning with LoRA: Complete Developer Guide 2026

Master LLM Fine-tuning with LoRA with practical examples and production patterns

LLM Fine-Tuning with LoRA: Complete Developer Guide (2026)

LoRA (Low-Rank Adaptation) is the standard way to fine-tune large models cheaply. Instead of updating all of a model's billions of weights, LoRA freezes them and trains tiny "adapter" matrices injected into each layer — so you tune ~0.1–1% of the parameters, on a single GPU, in hours instead of days. QLoRA goes further by training those adapters on a 4-bit quantized base, fitting even larger models on consumer hardware.

Why LoRA instead of full fine-tuning

Full fine-tuning a 7B+ model needs many high-memory GPUs and produces a full-size copy per task. LoRA:

  • Trains a few small low-rank matrices (rank r, e.g. 8–64) added to existing weights.
  • Keeps the base frozen, so memory and compute drop dramatically.
  • Produces a tiny adapter file (MBs) you can swap per task and merge at inference.
  • To compress the base further before/after, see 模型量化 GPTQ/AWQ.

    Minimal training loop (PEFT)

    python
    

    pip install transformers peft datasets bitsandbytes

    from peft import LoraConfig, get_peft_model from transformers import AutoModelForCausalLM, AutoTokenizer

    model = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-3.1-8B", load_in_4bit=True) # QLoRA config = LoraConfig(r=16, lora_alpha=32, target_modules=["q_proj", "v_proj"], lora_dropout=0.05) model = get_peft_model(model, config) model.print_trainable_parameters() # e.g. 0.2% of params trainable

    ... then train with Trainer / SFTTrainer on your dataset, and model.save_pretrained("adapter/")

    When to fine-tune at all

    Fine-tune when you need: a consistent output format/style, domain behavior not in the base model, lower cost at high volume (a small fine-tuned model can replace a big one), or faster/cheaper inference. Don't fine-tune when a good system prompt + few-shot examples or RAG would do — that's cheaper and more flexible. For a hosted alternative with no infra, see GPT-4o mini 微调指南.

    Data is the real work

    LoRA mechanics are easy; the dataset is what determines success. A few hundred to a few thousand high-quality, consistent examples usually beats tens of thousands of noisy ones. Format them exactly as you'll prompt at inference.

    FAQ

    LoRA vs QLoRA? QLoRA = LoRA on a 4-bit quantized base — same idea, less memory, slight quality trade-off. How big a dataset? Often hundreds–thousands of clean examples; quality and consistency beat raw size. Do I serve the adapter separately? You can load base + adapter at inference, or merge them into one model. Fine-tune or RAG? RAG injects knowledge at query time; fine-tuning bakes in behavior/format. Use RAG for facts, fine-tuning for style/skills.

    Summary

    LoRA makes fine-tuning accessible: freeze the base, train small adapters, swap or merge them per task. QLoRA pushes it onto consumer GPUs. Spend your effort on a clean, consistent dataset, and reach for fine-tuning only when prompting and RAG aren't enough.


    *Last updated: June 2026. Verify APIs against the PEFT/Transformers docs.*

    Also available in 中文.