AI Model Compression: Pruning, Quantization, and Knowledge Distillation

Deploy smaller, faster AI models without sacrificing accuracy

返回教程列表
高级42 分钟

AI Model Compression: Pruning, Quantization, and Knowledge Distillation

Deploy smaller, faster AI models without sacrificing accuracy

Learn model compression techniques to make AI models 10x smaller and faster. Covers weight pruning, quantization (INT8, INT4), knowledge distillation, and deployment on edge devices.

model-compressionquantizationpruningdistillationedge-ai

AI Model Compression Techniques

Why Compress Models?

  • Mobile/edge deployment
  • Reduce inference costs
  • Lower latency requirements
  • Memory-constrained environments
  • 1. Quantization

    Post-Training Quantization (PTQ)

    python
    import torch
    from torch.quantization import quantize_dynamic

    Dynamic quantization (easiest, small accuracy loss)

    model_quantized = quantize_dynamic( model, {torch.nn.Linear}, # Layers to quantize dtype=torch.qint8 )

    Check size reduction

    original_size = get_model_size(model) quantized_size = get_model_size(model_quantized) print(f"Size reduction: {original_size/quantized_size:.1f}x")

    Quantization-Aware Training (QAT)

    Better accuracy by simulating quantization during training:
    python
    model.qconfig = torch.quantization.get_default_qat_qconfig('fbgemm')
    model_prepared = torch.quantization.prepare_qat(model.train())
    

    Fine-tune with quantization simulation

    train(model_prepared, ...) model_quantized = torch.quantization.convert(model_prepared.eval())

    2. Weight Pruning

    Remove less important weights:
    python
    import torch.nn.utils.prune as prune

    Prune 30% of weights in linear layers

    for module in model.modules(): if isinstance(module, torch.nn.Linear): prune.l1_unstructured(module, name='weight', amount=0.3)

    Make pruning permanent

    for module in model.modules(): if isinstance(module, torch.nn.Linear): prune.remove(module, 'weight')

    3. Knowledge Distillation

    Train smaller "student" model to mimic larger "teacher":
    python
    def distillation_loss(student_logits, teacher_logits, labels, temperature=4.0, alpha=0.5):
        # Soft targets from teacher
        soft_loss = nn.KLDivLoss()(
            torch.log_softmax(student_logits / temperature, dim=1),
            torch.softmax(teacher_logits / temperature, dim=1)
        ) * (temperature ** 2)
        
        # Hard targets from ground truth
        hard_loss = nn.CrossEntropyLoss()(student_logits, labels)
        
        return alpha * soft_loss + (1 - alpha) * hard_loss
    

    4. LLM-Specific: GPTQ and AWQ

    python
    from auto_gptq import AutoGPTQForCausalLM, BaseQuantizeConfig

    quantize_config = BaseQuantizeConfig(bits=4, group_size=128) model = AutoGPTQForCausalLM.from_pretrained(model_path, quantize_config) model.quantize(examples) model.save_quantized(save_path)

    Results Summary

    TechniqueSize ReductionAccuracy Loss

    INT8 Quantization4x0.5-2% INT4 Quantization8x1-3% 50% Pruning2x1-5% Knowledge Distillation5-10x2-5%

    相关工具

    pytorchonnxtfliteauto-gptq