← Back to tutorials

AI Model Compression: Pruning, Quantization, and Knowledge Distillation

Deploy smaller, faster AI models without sacrificing accuracy

AI Model Compression Techniques

Why Compress Models?

  • Mobile/edge deployment
  • Reduce inference costs
  • Lower latency requirements
  • Memory-constrained environments
  • 1. Quantization

    Post-Training Quantization (PTQ)

    python
    import torch
    from torch.quantization import quantize_dynamic

    Dynamic quantization (easiest, small accuracy loss)

    model_quantized = quantize_dynamic( model, {torch.nn.Linear}, # Layers to quantize dtype=torch.qint8 )

    Check size reduction

    original_size = get_model_size(model) quantized_size = get_model_size(model_quantized) print(f"Size reduction: {original_size/quantized_size:.1f}x")

    Quantization-Aware Training (QAT)

    Better accuracy by simulating quantization during training:
    python
    model.qconfig = torch.quantization.get_default_qat_qconfig('fbgemm')
    model_prepared = torch.quantization.prepare_qat(model.train())
    

    Fine-tune with quantization simulation

    train(model_prepared, ...) model_quantized = torch.quantization.convert(model_prepared.eval())

    2. Weight Pruning

    Remove less important weights:
    python
    import torch.nn.utils.prune as prune

    Prune 30% of weights in linear layers

    for module in model.modules(): if isinstance(module, torch.nn.Linear): prune.l1_unstructured(module, name='weight', amount=0.3)

    Make pruning permanent

    for module in model.modules(): if isinstance(module, torch.nn.Linear): prune.remove(module, 'weight')

    3. Knowledge Distillation

    Train smaller "student" model to mimic larger "teacher":
    python
    def distillation_loss(student_logits, teacher_logits, labels, temperature=4.0, alpha=0.5):
        # Soft targets from teacher
        soft_loss = nn.KLDivLoss()(
            torch.log_softmax(student_logits / temperature, dim=1),
            torch.softmax(teacher_logits / temperature, dim=1)
        ) * (temperature ** 2)
        
        # Hard targets from ground truth
        hard_loss = nn.CrossEntropyLoss()(student_logits, labels)
        
        return alpha * soft_loss + (1 - alpha) * hard_loss
    

    4. LLM-Specific: GPTQ and AWQ

    python
    from auto_gptq import AutoGPTQForCausalLM, BaseQuantizeConfig

    quantize_config = BaseQuantizeConfig(bits=4, group_size=128) model = AutoGPTQForCausalLM.from_pretrained(model_path, quantize_config) model.quantize(examples) model.save_quantized(save_path)

    Results Summary

    TechniqueSize ReductionAccuracy Loss

    INT8 Quantization4x0.5-2% INT4 Quantization8x1-3% 50% Pruning2x1-5% Knowledge Distillation5-10x2-5%

    Also available in 中文.

    AI Model Compression: Pruning, Quantization, and Knowledge Distillation | AI Skill Navigation | AI Skill Navigation