AI Model Compression: Pruning, Quantization, and Knowledge Distillation

Deploy smaller, faster AI models without sacrificing accuracy

By AI Skill Navigation Editorial TeamPublished May 25, 2026

Deploying large AI models (like LLMs or vision transformers) to production often hits a wall: the model is too big, too slow, or consumes too much memory for your target hardware. Model compression solves this by reducing the model's size and computational cost while preserving as much accuracy as possible. The three main pillars are quantization, pruning, and knowledge distillation. Each has its own trade-offs, tooling, and pitfalls.

This tutorial covers the principles, real-world tools, and practical considerations for each technique. We'll focus on qualitative trade-offs—no fabricated benchmark numbers—and highlight common mistakes.

1. Quantization

Quantization reduces the numerical precision of model weights and activations. Instead of storing every parameter as a 32-bit floating point (FP32), you use fewer bits: 16-bit (FP16/BF16), 8-bit (INT8), or even 4-bit (INT4). This directly shrinks memory footprint and can speed up inference, especially on hardware with dedicated integer units (e.g., NVIDIA Tensor Cores, Apple Neural Engine, Qualcomm Hexagon).

How It Works

Post-Training Quantization (PTQ): Convert a pre-trained FP32 model to lower precision without retraining. You need a small calibration dataset to determine optimal scaling factors (min/max ranges) for each tensor. This is the most common approach for LLMs.

Quantization-Aware Training (QAT): Simulate quantization during training (or fine-tuning) so the model learns to compensate for precision loss. QAT usually yields better accuracy but requires access to training data and compute.

Key Tools

ToolSupported FormatsTypical Use Case

GPTQ4-bit, 3-bit (GPU-optimized)LLM inference on NVIDIA GPUs; layer-wise quantization with Hessian-based error compensation. AWQ4-bit (GPU-optimized)Similar to GPTQ but uses activation-aware scaling; often better accuracy at same bit-width. GGUF2-bit to 8-bit (CPU/GPU hybrid)Running LLMs on consumer hardware (CPU + GPU offloading); popular in llama.cpp ecosystem. bitsandbytes8-bit, 4-bit (GPU)Easy integration with Hugging Face Transformers; uses block-wise quantization (LLM.int8()). TensorRT / ONNX RuntimeINT8, FP16Production deployment for vision and NLP models; requires calibration dataset.

Practical Trade-offs

Memory: Going from FP32 to INT8 cuts memory by 4×; to 4-bit by 8×. For a 7B parameter model, that's ~28 GB → ~3.5 GB (4-bit).

Speed: On modern GPUs, INT8 inference can be 2–4× faster than FP32 due to reduced memory bandwidth and faster integer math. 4-bit is slower per token than 8-bit on many GPUs because of dequantization overhead, but still faster than FP16 for memory-bound tasks.

Accuracy: For most models, 8-bit quantization causes negligible accuracy loss (<1% on standard benchmarks). 4-bit can degrade 1–3% on complex reasoning tasks; careful calibration (e.g., GPTQ with 128 samples) helps. QAT often recovers most of the loss.

Common Pitfalls

Using the wrong calibration data: If your calibration set doesn't represent real inputs (e.g., only English text for a multilingual model), quantization errors spike.

Ignoring activation quantization: Many tools only quantize weights. For full speedup, you must also quantize activations (INT8). This is harder and may require hardware support.

Overlooking outlier channels: LLMs often have a few weight channels with extreme values. Bitsandbytes' LLM.int8() handles this by keeping those channels in FP16; GPTQ/AWQ use Hessian-based methods to minimize error.

2. Pruning

Pruning removes redundant parameters (weights, neurons, or layers) from a model. The goal is to reduce model size and computation while maintaining accuracy. There are two main flavors: structured and unstructured.

Structured vs. Unstructured Pruning

Unstructured Pruning: Sets individual weights to zero based on magnitude (smallest absolute values). The resulting weight matrix is sparse (many zeros). This can theoretically reduce memory and computation, but most hardware (GPUs, CPUs) doesn't accelerate sparse matrices efficiently unless sparsity is very high (>90%) and hardware supports it (e.g., NVIDIA Ampere's 2:4 structured sparsity).

Structured Pruning: Removes entire neurons, channels, or attention heads. The model becomes smaller and denser, so it runs faster on standard hardware. The downside: more aggressive pruning can cause larger accuracy drops.

Common Approaches

Magnitude Pruning: Remove weights with the smallest absolute values. Simple, but ignores interactions between weights. Works well for small pruning ratios (10–30%).

Movement Pruning: During fine-tuning, track how much each weight changes. Prune weights that move the least (i.e., are already close to zero and don't change). Often better than magnitude for transformers.

SparseGPT / Wanda: Recent methods for one-shot pruning of LLMs. SparseGPT uses Hessian information to prune weights while minimizing output error; Wanda uses weight magnitudes and activation norms. Both can achieve 50–60% sparsity with minimal accuracy loss.

Tools

PyTorch Pruning (torch.nn.utils.prune): Supports unstructured magnitude pruning. Easy to prototype, but not optimized for inference.

Neural Magic's DeepSparse: Optimized CPU inference for sparse models. Works with unstructured sparsity (up to 95%) and provides pruning recipes.

SparseGPT / Wanda: Standalone implementations for LLMs. Can be applied post-training.

TensorFlow Model Optimization Toolkit: Supports both structured and unstructured pruning with QAT integration.

Practical Trade-offs

Unstructured pruning: Good for academic research or if you have sparse hardware (e.g., NVIDIA's 2:4 pattern). For most users, the speedup is disappointing because GPUs don't accelerate arbitrary sparsity.

Structured pruning: Directly reduces FLOPs and memory bandwidth. A 50% structured pruning of a transformer's feed-forward layers can give ~2× speedup on GPU, but accuracy may drop 2–5% on complex tasks.

Combining with quantization: Pruning + quantization often compounds accuracy loss. Apply pruning first, then quantize, and use QAT to recover.

Common Pitfalls

Pruning too aggressively: Removing 90% of weights often destroys model coherence. Start with 20–30% and evaluate on your specific task.

Not fine-tuning after pruning: Even magnitude pruning requires a few epochs of fine-tuning to recover accuracy. Without it, the model becomes a random guesser.

Ignoring layer importance: Some layers (e.g., early embedding layers, final classification head) are more sensitive to pruning. Use sensitivity analysis to prune different layers at different rates.

3. Knowledge Distillation

Knowledge distillation (KD) trains a smaller "student" model to mimic the behavior of a larger "teacher" model. The student learns not just the hard labels (e.g., "cat"), but the soft probability distribution (logits) of the teacher, which contains richer information about class similarities.

How It Works

Train a large teacher model (or use a pre-trained one).

Define a smaller student model (e.g., a 7B parameter model distilled from a 70B teacher).

Train the student on a dataset where the loss is a combination of:

- Hard loss: Cross-entropy with ground-truth labels. - Soft loss: KL divergence between teacher and student logits (softened by a temperature parameter).

Optionally, use intermediate layer representations (feature distillation) for better transfer.

Real-World Examples

DistilBERT: 40% smaller than BERT-base, 60% faster, retains 97% of performance on GLUE.

TinyBERT: 7.5× smaller, 9.4× faster than BERT-base, with ~96% performance.

LLM Distillation: Distilling a 70B LLaMA into a 7B model is an active research area. It's harder because LLMs have emergent abilities (reasoning, in-context learning) that may not transfer well.

Tools

Hugging Face Transformers: Built-in support for KD via Trainer and custom loss functions.

Textbooks Are All You Need (Phi-1/Phi-2): Microsoft used synthetic data generated by a teacher LLM to train a small student, achieving strong code generation.

Distil-Whisper: Distillation of Whisper speech recognition models using a combination of logit and feature distillation.

Practical Trade-offs

Student size: A student that's too small (e.g., 10× smaller) may not have enough capacity to learn the teacher's knowledge. Rule of thumb: 2–4× smaller is safe; 10× requires careful architecture design.

Data requirements: KD often needs a large, diverse dataset (or synthetic data from the teacher). Without it, the student may overfit to the teacher's biases.

Teacher quality: A poorly trained teacher distills bad habits. Always use the best available teacher for your domain.

Common Pitfalls

Using only hard labels: The whole point of KD is the soft distribution. Without it, you're just training a small model from scratch.

Ignoring temperature tuning: Temperature controls how "soft" the teacher's distribution is. Too high (e.g., 20) washes out information; too low (1) is just hard labels. Start with 4–8 and tune.

Distilling from a quantized teacher: A quantized teacher may have degraded logits, leading to a worse student. Use FP16/BF16 teachers for distillation.

Putting It All Together: A Compression Pipeline

For a production deployment, you often combine techniques:

Start with a pre-trained model (e.g., LLaMA-2 7B).

Apply structured pruning (e.g., remove 30% of attention heads) if you have the compute for fine-tuning.

Quantize to 4-bit using GPTQ or AWQ.

Optionally, distill from a larger model (e.g., 70B) into the pruned+quantized 7B to recover accuracy.

This pipeline can reduce memory by 8× and speed up inference by 3–5×, with accuracy within 1–2% of the original.

FAQ

Q: Can I quantize a pruned model? A: Yes, but accuracy loss may compound. Apply pruning first, then quantize, and use quantization-aware training (QAT) to recover. Some tools (e.g., Neural Magic's DeepSparse) support joint pruning+quantization.

Q: Which quantization method is best for CPU inference? A: GGUF (via llama.cpp) is the most mature for CPU + GPU offloading. For pure CPU, ONNX Runtime with INT8 quantization (using static quantization) works well for vision and smaller NLP models.

Q: Does knowledge distillation require the teacher to be the same architecture? A: No, but it helps. You can distill a transformer into a CNN or a smaller transformer. The key is that the student must be able to represent the teacher's output distribution. Feature distillation (matching intermediate layers) works best when architectures are similar.

Q: How much data do I need for distillation? A: At least as much as you'd use for fine-tuning the student from scratch. For LLMs, synthetic data from the teacher (e.g., 100k–1M examples) is common. Quality matters more than quantity.

Q: Is pruning still useful if I'm already quantizing to 4-bit? A: Yes, because pruning reduces the number of parameters, while quantization reduces the bits per parameter. Combined, you can fit a 7B model into 2–3 GB. However, the accuracy trade-off is steeper—test on your specific task.

*Last updated: July 2026. Always verify against each tool's official docs.*

Also available in 中文.

AI Model Compression: Pruning, Quantization, and Knowledge Distillation

1. Quantization

How It Works

Key Tools

Practical Trade-offs

Common Pitfalls

2. Pruning

Structured vs. Unstructured Pruning

Common Approaches

Tools

Practical Trade-offs

Common Pitfalls

3. Knowledge Distillation

How It Works

Real-World Examples

Tools

Practical Trade-offs

Common Pitfalls

Putting It All Together: A Compression Pipeline

FAQ

Documentation

Getting Started

Learn more