使用 LoRA 和 QLoRA 微调大语言模型：2026 完全指南

在单张消费级 GPU（显存低于 24GB）上，利用 LoRA/QLoRA 微调技术，从 Llama 3 和 Mistral 训练定制 AI 模型

高级约 50 分钟

使用 LoRA 和 QLoRA 微调大语言模型：2026 完全指南

在单张消费级 GPU（显存低于 24GB）上，利用 LoRA/QLoRA 微调技术，从 Llama 3 和 Mistral 训练定制 AI 模型

2026 年使用 LoRA 和 QLoRA 技术微调大语言模型的完整指南。涵盖数据集准备、训练配置、硬件需求、评估指标以及将微调模型部署到生产环境。

fine-tuning lora qlora llama python machine-learning

使用 LoRA 和 QLoRA 微调大语言模型：2026 完全指南

对 7B 大语言模型进行全参数微调需要 8 张 A100 GPU，每次训练成本超过 500 美元。而 LoRA（低秩适配）技术可以在单张消费级 GPU 上，以不到 5 美元的成本在 2-4 小时内完成相同模型的微调。这项技术让定制 AI 模型的开发变得大众化。

什么是 LoRA？

LoRA 不更新所有模型权重，而是在特定层添加可训练的小矩阵。只有这些小矩阵（约占参数的 1%）被训练和存储。推理时，它们会合并回原始权重。

结果： 微调 7B 模型只需：

1 张 GPU（RTX 4090 或 A100）

10-20GB 显存（全参数微调需要 140GB+）

2-6 小时训练时间

何时微调 vs 提示工程

方法适用场景

提示工程通过指令改变行为 RAG访问外部知识微调教授新风格/格式/领域微调将提示长度减少 80% 微调专业词汇/术语

环境搭建

bash
pip install transformers datasets peft trl bitsandbytes accelerate wandb

第一步：准备数据集

python
from datasets import Dataset
import json
指令微调的训练数据格式
training_examples = [
    {
        "instruction": "从这份新闻稿中提取公司名称、金额和日期。",
        "input": "Acme Corp 今天宣布以 4500 万美元收购 StarTech，交易于 2026 年 3 月 15 日完成。",
        "output": '{"company": "Acme Corp", "acquisition_target": "StarTech", "amount": "$45M", "date": "March 15, 2026"}'
    },
    # 添加 500-5000 个示例以获得良好效果
]
def format_instruction(sample):
    """格式化为 Alpaca 风格的提示。"""
    if sample["input"]:
        return f"""Below is an instruction that describes a task, paired with an input. Write a response.
Instruction:
{sample['instruction']}
Input:
{sample['input']}
Response:
{sample['output']}"""
    else:
        return f"""Below is an instruction. Write a response.
Instruction:
{sample['instruction']}
Response:
{sample['output']}"""
dataset = Dataset.from_list(training_examples)
dataset = dataset.map(lambda x: {"text": format_instruction(x)})
dataset = dataset.train_test_split(test_size=0.1)print(f"训练集: {len(dataset['train'])} | 测试集: {len(dataset['test'])}")

第二步：配置 QLoRA 训练

python
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig, TrainingArguments
from peft import LoraConfig, get_peft_model, prepare_model_for_kbit_training
from trl import SFTTrainer
MODEL_ID = "meta-llama/Meta-Llama-3.1-8B-Instruct"
4 位量化以节省内存（QLoRA）
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_use_double_quant=True,
    bnb_4bit_quant_type="nf4",  # NormalFloat4 - 最佳质量
    bnb_4bit_compute_dtype=torch.bfloat16
)
以 4 位加载模型
model = AutoModelForCausalLM.from_pretrained(
    MODEL_ID,
    quantization_config=bnb_config,
    device_map="auto",
    token="hf_your_token"
)
model.config.use_cache = False
model.config.pretraining_tp = 1
tokenizer = AutoTokenizer.from_pretrained(MODEL_ID, token="hf_your_token")
tokenizer.pad_token = tokenizer.eos_token
tokenizer.padding_side = "right"
为 k 位训练准备模型
model = prepare_model_for_kbit_training(model)

第三步：LoRA 配置

python
LoRA 配置 - 这些设置适用于指令微调
peft_config = LoraConfig(
    r=64,              # 秩：越高参数越多，质量越好但内存占用更大
    lora_alpha=16,     # 缩放因子（通常 lora_alpha = r/4）
    target_modules=[   # 要微调的层
        "q_proj",
        "k_proj",
        "v_proj",
        "o_proj",
        "gate_proj",
        "up_proj",
        "down_proj"
    ],
    lora_dropout=0.05,
    bias="none",
    task_type="CAUSAL_LM"
)
model = get_peft_model(model, peft_config)
model.print_trainable_parameters()
输出: trainable params: 83,886,080 || all params: 8,114,933,760 || trainable%: 1.03

第四步：训练

python
training_args = TrainingArguments(
    output_dir="./llama3-finetuned",
    num_train_epochs=3,
    per_device_train_batch_size=4,
    per_device_eval_batch_size=4,
    gradient_accumulation_steps=4,  # 有效批量大小 = 4 * 4 = 16
    gradient_checkpointing=True,
    optim="paged_adamw_32bit",
    logging_steps=25,
    save_strategy="epoch",
    learning_rate=2e-4,
    weight_decay=0.001,
    fp16=False,
    bf16=True,
    max_grad_norm=0.3,
    max_steps=-1,
    warmup_ratio=0.03,
    lr_scheduler_type="constant",
    report_to="wandb",
    evaluation_strategy="epoch"
)
trainer = SFTTrainer(
    model=model,
    args=training_args,
    train_dataset=dataset["train"],
    eval_dataset=dataset["test"],
    dataset_text_field="text",
    tokenizer=tokenizer,
    peft_config=peft_config,
    max_seq_length=2048
)trainer.train()
trainer.save_model("./llama3-finetuned-final")
print("训练完成！")

第五步：合并与导出

python
from peft import PeftModel
from transformers import AutoModelForCausalLM
import torch
以全精度加载基础模型用于合并
base_model = AutoModelForCausalLM.from_pretrained(
    MODEL_ID,
    torch_dtype=torch.bfloat16,
    device_map="auto"
)
加载 LoRA 适配器
model = PeftModel.from_pretrained(base_model, "./llama3-finetuned-final")
将适配器权重合并到基础模型
merged_model = model.merge_and_unload()
保存合并后的模型
merged_model.save_pretrained("./llama3-finetuned-merged")
tokenizer.save_pretrained("./llama3-finetuned-merged")
print("合并模型已保存！")

第六步：评估

python
def evaluate_model(model, tokenizer, test_cases: list) -> dict:
    results = []
    
    for case in test_cases:
        prompt = format_instruction({"instruction": case["instruction"], "input": case.get("input", ""), "output": ""})
        
        inputs = tokenizer(prompt, return_tensors="pt").to("cuda")
        
        with torch.no_grad():
            outputs = model.generate(
                **inputs,
                max_new_tokens=256,
                temperature=0.1,
                do_sample=True
            )
        
        generated = tokenizer.decode(outputs[0], skip_special_tokens=True)
        response = generated.split("### Response:")[-1].strip()
        
        results.append({
            "instruction": case["instruction"],
            "expected": case["expected_output"],
            "actual": response,
            "match": response.strip() == case["expected_output"].strip()
        })
    
    accuracy = sum(r["match"] for r in results) / len(results)
    print(f"准确率: {accuracy:.1%}")
    return {"accuracy": accuracy, "results": results}

硬件需求

模型LoRA 秩所需显存训练时间

Llama 3.1 8Br=6418GB (4-bit)2-4 小时 Mistral 7Br=6416GB (4-bit)2-3 小时 Llama 3.1 70Br=1648GB (4-bit)12-24 小时

推荐 GPU：NVIDIA RTX 4090（24GB）用于 7-8B 模型

实际应用效果

使用 LoRA 微调的生产案例：

客户服务：通过领域特定训练，升级率降低 40%

法律文档提取：结构化数据提取准确率 94%，零样本仅 71%

医疗编码：ICD-10 编码错误减少 60%

结论

LoRA 微调让定制 AI 模型开发变得触手可及。一个包含 1000 个示例的数据集和一张 RTX 4090，就能生成在特定领域任务上远超 GPT-4 的模型。关键在于数据集质量——精心整理干净、多样化的示例，代表你的实际使用场景。

Getting Started

Learn how to get started with this application.

Learn more

Installation Guide

使用 LoRA 和 QLoRA 微调大语言模型：2026 完全指南

使用 LoRA 和 QLoRA 微调大语言模型：2026 完全指南

什么是 LoRA？

何时微调 vs 提示工程

环境搭建

第一步：准备数据集

指令微调的训练数据格式

Instruction:

Input:

Response:

Instruction:

Response:

第二步：配置 QLoRA 训练

4 位量化以节省内存（QLoRA）

以 4 位加载模型

为 k 位训练准备模型

第三步：LoRA 配置

LoRA 配置 - 这些设置适用于指令微调

输出: trainable params: 83,886,080 || all params: 8,114,933,760 || trainable%: 1.03

第四步：训练

第五步：合并与导出

以全精度加载基础模型用于合并

加载 LoRA 适配器

将适配器权重合并到基础模型

保存合并后的模型

第六步：评估

硬件需求

实际应用效果

结论

Documentation

Getting Started

Learn more