AI Model Quantization (GPTQ, AWQ): Complete Developer Guide 2026
Master AI Model Quantization (GPTQ, AWQ) with practical examples and production patterns
AI Model Quantization (GPTQ, AWQ): Complete Developer Guide (2026)
Quantization shrinks a model by storing its weights in fewer bits (e.g. 4-bit instead of 16-bit), cutting memory and often speeding inference — with a small, usually acceptable, quality loss. It's what lets a 70B model fit on a single GPU or a 7B model run on a laptop. This guide covers the two dominant post-training methods, GPTQ and AWQ, plus when to use each.
Why quantize
A model in FP16 needs ~2 bytes per parameter: an 8B model ≈ 16GB, a 70B ≈ 140GB. Quantize to 4-bit and those drop to ~4GB and ~35GB — the difference between "needs a data-center GPU" and "runs on a 24GB card." Smaller weights also mean less memory bandwidth per token, which often improves throughput. To actually serve quantized models, see Ollama vs vLLM.
GPTQ vs AWQ
GPTQ quantizes weights one layer at a time, minimizing the output error introduced by rounding. It's well-established with broad ecosystem support.
AWQ (Activation-aware Weight Quantization) observes that a small fraction of weights matter most (those multiplying large activations) and protects them, which tends to preserve accuracy better at 4-bit and pairs with fast inference kernels.
There's also bitsandbytes (on-the-fly 8-bit/4-bit, easiest for training/QLoRA) and GGUF (the format Ollama/llama.cpp use for CPU/Apple Silicon).
Practical use
python
Loading a pre-quantized AWQ model with vLLM
vllm serve TheBloke/Llama-3.1-8B-Instruct-AWQ --quantization awq
from openai import OpenAI
client = OpenAI(base_url="http://localhost:8000/v1", api_key="x")
print(client.chat.completions.create(
model="TheBloke/Llama-3.1-8B-Instruct-AWQ",
messages=[{"role": "user", "content": "Hello"}]).choices[0].message.content)
For most people the move is to download an already-quantized checkpoint (Hugging Face has GPTQ/AWQ/GGUF variants of popular models) rather than quantizing yourself. Quantize your own only when you have a fine-tuned model to compress — and pair it with LoRA fine-tuning.
Choosing bits
FAQ
Does quantization hurt quality? A little — 4-bit is usually a small, acceptable drop; below 4-bit it grows. GPTQ or AWQ? AWQ often edges ahead on 4-bit accuracy and speed; GPTQ has the widest tooling. Try both on your task. What about GGUF? That's the format for CPU/Apple-Silicon via llama.cpp/Ollama — see local LLM comparison. Can I quantize a LoRA fine-tune? Yes — merge the adapter, then quantize, or use QLoRA which trains on a quantized base.
Summary
Quantization is the lever that makes large models runnable on modest hardware. GPTQ and AWQ are the two leading 4-bit post-training methods; AWQ tends to win on accuracy/speed, GPTQ on ecosystem breadth. In practice, grab a pre-quantized checkpoint and serve it with vLLM or Ollama.
*Last updated: June 2026. Verify kernel/format support against vLLM, AutoAWQ, and AutoGPTQ docs.*
Also available in 中文.