AI Model Compression: Pruning, Quantization, and Knowledge Distillation
Deploy smaller, faster AI models without sacrificing accuracy
AI Model Compression Techniques
Why Compress Models?
1. Quantization
Post-Training Quantization (PTQ)
python
import torch
from torch.quantization import quantize_dynamicDynamic quantization (easiest, small accuracy loss)
model_quantized = quantize_dynamic(
model,
{torch.nn.Linear}, # Layers to quantize
dtype=torch.qint8
)Check size reduction
original_size = get_model_size(model)
quantized_size = get_model_size(model_quantized)
print(f"Size reduction: {original_size/quantized_size:.1f}x")
Quantization-Aware Training (QAT)
Better accuracy by simulating quantization during training:python
model.qconfig = torch.quantization.get_default_qat_qconfig('fbgemm')
model_prepared = torch.quantization.prepare_qat(model.train())
Fine-tune with quantization simulation
train(model_prepared, ...)
model_quantized = torch.quantization.convert(model_prepared.eval())
2. Weight Pruning
Remove less important weights:python
import torch.nn.utils.prune as prunePrune 30% of weights in linear layers
for module in model.modules():
if isinstance(module, torch.nn.Linear):
prune.l1_unstructured(module, name='weight', amount=0.3)Make pruning permanent
for module in model.modules():
if isinstance(module, torch.nn.Linear):
prune.remove(module, 'weight')
3. Knowledge Distillation
Train smaller "student" model to mimic larger "teacher":python
def distillation_loss(student_logits, teacher_logits, labels, temperature=4.0, alpha=0.5):
# Soft targets from teacher
soft_loss = nn.KLDivLoss()(
torch.log_softmax(student_logits / temperature, dim=1),
torch.softmax(teacher_logits / temperature, dim=1)
) * (temperature ** 2)
# Hard targets from ground truth
hard_loss = nn.CrossEntropyLoss()(student_logits, labels)
return alpha * soft_loss + (1 - alpha) * hard_loss
4. LLM-Specific: GPTQ and AWQ
python
from auto_gptq import AutoGPTQForCausalLM, BaseQuantizeConfigquantize_config = BaseQuantizeConfig(bits=4, group_size=128)
model = AutoGPTQForCausalLM.from_pretrained(model_path, quantize_config)
model.quantize(examples)
model.save_quantized(save_path)
Results Summary
Also available in 中文.