← Back to tutorials

LLM Fine-Tuning Practical Guide 2026: From Data Preparation to Deployment, a Complete Model Customization Workflow

When Fine-Tuning Is Worth It and When Prompt Engineering Is Enough

Many people ask, "Can I fine-tune a model to make it understand my business better?" — the answer is usually "Yes, but you probably don't need to."

First, let's clarify when you should fine-tune and when prompt engineering is sufficient.

1. Fine-Tuning vs Prompt Engineering: How to Choose

When Fine-Tuning Is Needed

  • Specific domain output format: You need the model to consistently output in a very specific format (e.g., medical record format, specific code style)
  • Large amounts of repetitive context: System prompt exceeds 2000 tokens and must be included every time
  • Specialized terminology and knowledge: The model needs to learn your industry jargon and proprietary knowledge base
  • Latency and cost: You need faster response times or lower inference costs
  • When Fine-Tuning Is Not Needed

  • Just want to change behavior: Most behaviors can be achieved through prompt engineering
  • Frequent knowledge updates: Fine-tuned models cannot be updated in real-time; RAG is more suitable
  • Limited budget: Fine-tuning has certain computational and time costs
  • Just starting validation: First validate feasibility with RAG/prompt engineering
  • 2. Efficient Fine-Tuning: Unsloth + LoRA

    The most popular fine-tuning approach in 2026: Unsloth (training acceleration) + LoRA (parameter-efficient fine-tuning)

    2.1 Environment Setup

    bash
    

    Recommended environment: NVIDIA GPU 16GB+, or Google Colab A100

    pip install unsloth transformers datasets trl accelerate

    Or use Unsloth's one-click install

    pip install "unsloth[colab-new] @ git+https://github.com/unslothai/unsloth.git"

    2.2 Data Preparation (The Most Critical Step)

    Fine-tuning quality = data quality, not data quantity.

    python
    

    Data format: ShareGPT format (recommended)

    training_data = [ { "conversations": [ {"from": "human", "value": "User question"}, {"from": "assistant", "value": "Expected model response"} ] }, # ... more samples ]

    Minimum data needed:

    - Format/style fine-tuning: 100-500 samples

    - Domain knowledge injection: 500-2000 samples

    - Complete behavior change: 2000+ samples

    Data quality checklist

    ✅ Is every sample high quality? (Better fewer but better)

    ✅ Is the data distribution balanced? (Don't overrepresent one type of question)

    ✅ Are there any contradictory samples? (Different answers to the same type of question)

    ✅ Is there any data leakage? (Don't use test set as training set)

    2.3 Unsloth Fine-Tuning Code

    python
    from unsloth import FastLanguageModel
    from trl import SFTTrainer
    from transformers import TrainingArguments
    from datasets import Dataset

    Load base model (choose the size that fits your needs)

    model, tokenizer = FastLanguageModel.from_pretrained( model_name="unsloth/Qwen2.5-7B-Instruct", # 7B is good for beginners max_seq_length=2048, load_in_4bit=True, # 4-bit quantization to save memory )

    Add LoRA adapter

    model = FastLanguageModel.get_peft_model( model, r=16, # LoRA rank, higher = better effect, more memory target_modules=["q_proj", "k_proj", "v_proj", "o_proj"], lora_alpha=16, lora_dropout=0, bias="none", )

    Prepare dataset

    dataset = Dataset.from_list(training_data)

    Start training

    trainer = SFTTrainer( model=model, tokenizer=tokenizer, train_dataset=dataset, dataset_text_field="text", max_seq_length=2048, args=TrainingArguments( per_device_train_batch_size=2, gradient_accumulation_steps=4, warmup_steps=5, num_train_epochs=3, # Number of epochs; too many can cause overfitting learning_rate=2e-4, fp16=True, output_dir="./output", save_strategy="epoch", ), )

    trainer.train()

    3. Evaluating Fine-Tuning Results

    3.1 Quantitative Evaluation

    python
    from evaluate import load

    For generation quality evaluation

    rouge = load("rouge") results = rouge.compute( predictions=model_outputs, references=reference_outputs ) print(results) # ROUGE-1, ROUGE-2, ROUGE-L scores

    For specific tasks

    Classification accuracy, F1 score, etc.

    3.2 Qualitative Evaluation (More Important)

    Create a human evaluation set (50-100 typical questions) and compare:

  • Original model's responses
  • Fine-tuned model's responses
  • Ideal answers
  • Score each dimension (accuracy/format compliance/relevance) to see if fine-tuning improved performance.

    4. Deploying the Fine-Tuned Model

    4.1 Saving and Loading

    python
    

    Save LoRA weights (very small, usually < 100MB)

    model.save_pretrained("my-finetuned-model") tokenizer.save_pretrained("my-finetuned-model")

    Merge weights (optional, for deployment)

    model.save_pretrained_merged( "merged-model", tokenizer, save_method="merged_16bit" )

    4.2 Deployment Options

    Local Inference (Ollama):

    bash
    

    Convert to GGUF format

    python llama.cpp/convert.py merged-model --outtype f16

    Import into Ollama

    ollama create my-model -f Modelfile

    Cloud API (Together AI / Fireworks AI): Both platforms support uploading custom models and provide OpenAI-compatible APIs, suitable for production deployment.


    Further Reading

  • Local LLM Deployment Complete Guide
  • Advanced RAG Techniques
  • LangChain vs LangGraph Practical Guide
  • Also available in 中文.