Deployment of Fine-tuned Models: Hands-On Tutorial

Serving custom fine-tuned models with vLLM and TGI — step-by-step implementation guide

高级约 20 分钟

Deployment of Fine-tuned Models: Hands-On Tutorial

Serving custom fine-tuned models with vLLM and TGI — step-by-step implementation guide

Deployment of Fine-tuned Models Overview Serving custom fine-tuned models with vLLM and TGI. This tutorial provides a complete, runnable implementation. Prerequisites ```bash Install required packages pip install transformers datasets peft trl ac

fine-tuningllmvllmdeploymentdeep-learning

Deployment of Fine-tuned Models

Overview

Serving custom fine-tuned models with vLLM and TGI. This tutorial provides a complete, runnable implementation.

Prerequisites

bash
Install required packages
pip install transformers datasets peft trl accelerate bitsandbytes
pip install vllm
Verify GPU access
python -c "import torch; print(torch.cuda.is_available())"

Dataset Preparation

python
from datasets import Dataset, load_dataset
import json
def prepare_dataset(examples: list[dict]) -> Dataset:
    """
    Prepare dataset for deployment fine-tuning.
    
    Expected format:
    [{"instruction": "...", "input": "...", "output": "..."}]
    """
    
    def format_example(example):
        instruction = example.get("instruction", "")
        input_text = example.get("input", "")
        output = example.get("output", "")
        
        if input_text:
            prompt = f"### Instruction:\n{instruction}\n\n### Input:\n{input_text}\n\n### Response:\n{output}"
        else:
            prompt = f"### Instruction:\n{instruction}\n\n### Response:\n{output}"
        
        return {"text": prompt}
    
    formatted = [format_example(ex) for ex in examples]
    return Dataset.from_list(formatted)
Load or create your dataset
Example: load from HuggingFace
dataset = load_dataset("your-org/your-dataset", split="train")
Or create from your own data
examples = [
    {
        "instruction": "Classify this text",
        "input": "Sample text here",
        "output": "Category: Positive"
    }
]
dataset = prepare_dataset(examples)
print(f"Dataset size: {len(dataset)}")
print(f"Sample: {dataset[0]['text'][:200]}")

Model Setup with VLLM

python
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
from peft import LoraConfig, get_peft_model, TaskType
Model configuration
MODEL_ID = "meta-llama/Llama-3.2-1B-Instruct"  # or your base model
OUTPUT_DIR = "./fine-tuned-model"
Load tokenizer
tokenizer = AutoTokenizer.from_pretrained(MODEL_ID)
tokenizer.pad_token = tokenizer.eos_token
tokenizer.padding_side = "right"
QLoRA: 4-bit quantization config
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.float16,
    bnb_4bit_use_double_quant=True,
)
Load base model
model = AutoModelForCausalLM.from_pretrained(
    MODEL_ID,
    quantization_config=bnb_config,
    device_map="auto",
    trust_remote_code=True,
)
Configure LoRA
lora_config = LoraConfig(
    r=16,                          # Rank - higher = more parameters
    lora_alpha=32,                 # Scaling factor
    target_modules=[               # Which layers to adapt
        "q_proj", "v_proj",
        "k_proj", "o_proj",
        "gate_proj", "up_proj", "down_proj"
    ],
    lora_dropout=0.05,
    bias="none",
    task_type=TaskType.CAUSAL_LM,
)
Apply LoRA
model = get_peft_model(model, lora_config)
model.print_trainable_parameters()
Output: trainable params: 6.7M || all params: 1.24B || trainable%: 0.54%

Training Configuration

python
from transformers import TrainingArguments
from trl import SFTTrainer
Training arguments
training_args = TrainingArguments(
    output_dir=OUTPUT_DIR,
    num_train_epochs=3,
    per_device_train_batch_size=4,
    gradient_accumulation_steps=4,      # Effective batch = 16
    gradient_checkpointing=True,        # Save memory
    optim="paged_adamw_32bit",
    learning_rate=2e-4,
    lr_scheduler_type="cosine",
    warmup_ratio=0.03,
    weight_decay=0.001,
    max_grad_norm=0.3,
    logging_steps=25,
    save_steps=500,
    eval_steps=500,
    fp16=True,                          # Use bf16=True for Ampere GPUs
    report_to="mlflow",                 # Track with MLflow
    run_name="fine-tuning-run-1",
)
Initialize trainer
trainer = SFTTrainer(
    model=model,
    train_dataset=dataset,
    tokenizer=tokenizer,
    args=training_args,
    dataset_text_field="text",
    max_seq_length=2048,
    packing=False,
)
Train!
trainer.train()
Save the fine-tuned adapter
trainer.model.save_pretrained(OUTPUT_DIR)
tokenizer.save_pretrained(OUTPUT_DIR)
print(f"Model saved to {OUTPUT_DIR}")

Inference with Fine-tuned Model

python
from peft import PeftModel
Load base model + adapter
base_model = AutoModelForCausalLM.from_pretrained(
    MODEL_ID, device_map="auto", torch_dtype=torch.float16
)
model = PeftModel.from_pretrained(base_model, OUTPUT_DIR)
model.eval()
def generate(instruction: str, input_text: str = "") -> str:
    """Generate with fine-tuned model."""
    if input_text:
        prompt = f"### Instruction:\n{instruction}\n\n### Input:\n{input_text}\n\n### Response:\n"
    else:
        prompt = f"### Instruction:\n{instruction}\n\n### Response:\n"
    
    inputs = tokenizer(prompt, return_tensors="pt").to("cuda")
    
    with torch.no_grad():
        outputs = model.generate(
            **inputs,
            max_new_tokens=512,
            temperature=0.1,
            do_sample=True,
            eos_token_id=tokenizer.eos_token_id,
        )
    
    response = tokenizer.decode(outputs[0][inputs.input_ids.shape[1]:], skip_special_tokens=True)
    return response.strip()
Test
response = generate("Explain deployment in simple terms")
print(response)

Evaluation

python
from evaluate import load
import numpy as np
Load evaluation metrics
rouge = load("rouge")
bleu = load("bleu")
def evaluate_model(test_examples: list[dict], model_fn) -> dict:
    """Evaluate fine-tuned model quality."""
    predictions = []
    references = []
    
    for ex in test_examples:
        pred = model_fn(ex["instruction"], ex.get("input", ""))
        predictions.append(pred)
        references.append(ex["output"])
    
    rouge_scores = rouge.compute(predictions=predictions, references=references)
    
    return {
        "rouge1": rouge_scores["rouge1"],
        "rouge2": rouge_scores["rouge2"],
        "rougeL": rouge_scores["rougeL"],
        "num_examples": len(predictions)
    }results = evaluate_model(test_examples, generate)
print(f"Evaluation results: {results}")

GPU Memory Requirements

Model SizeTechniqueVRAM Needed

1B paramsFull FT~6GB 1B paramsQLoRA~3GB 7B paramsQLoRA~10GB 13B paramsQLoRA~14GB 70B paramsQLoRA~45GB

Best Practices

Start small — test on 1B model before scaling to 70B

Data quality > quantity — 1000 quality examples beat 10K noisy ones

Validate constantly — check outputs every N steps during training

Use gradient checkpointing — essential for large models

Monitor loss curves — early stopping prevents overfitting

Resources

HuggingFace TRL docs: https://huggingface.co/docs/trl

LoRA paper: https://arxiv.org/abs/2106.09685

Unsloth for 2x speed: https://github.com/unslothai/unsloth

Getting Started

Learn how to get started with this application.

Learn more

Installation Guide

Deployment of Fine-tuned Models: Hands-On Tutorial

Deployment of Fine-tuned Models

Overview

Prerequisites

Install required packages

Verify GPU access

Dataset Preparation

Load or create your dataset

Example: load from HuggingFace

Or create from your own data

Model Setup with VLLM

Model configuration

Load tokenizer

QLoRA: 4-bit quantization config

Load base model

Configure LoRA

Apply LoRA

Output: trainable params: 6.7M || all params: 1.24B || trainable%: 0.54%

Training Configuration

Training arguments

Initialize trainer

Train!

Save the fine-tuned adapter

Inference with Fine-tuned Model

Load base model + adapter

Test

Evaluation

Load evaluation metrics

GPU Memory Requirements

Best Practices

Resources

Documentation

Getting Started

Learn more