← Back to tutorials

AI Model Quantization (GPTQ, AWQ): Complete Developer Guide 2026

Master AI Model Quantization (GPTQ, AWQ) with practical examples and production patterns

AI Model Quantization (GPTQ, AWQ): Complete Developer Guide (2026)

Quantization shrinks a model by storing its weights in fewer bits (e.g. 4-bit instead of 16-bit), cutting memory and often speeding inference — with a small, usually acceptable, quality loss. It's what lets a 70B model fit on a single GPU or a 7B model run on a laptop. This guide covers the two dominant post-training methods, GPTQ and AWQ, plus when to use each.

Why quantize

A model in FP16 needs ~2 bytes per parameter: an 8B model ≈ 16GB, a 70B ≈ 140GB. Quantize to 4-bit and those drop to ~4GB and ~35GB — the difference between "needs a data-center GPU" and "runs on a 24GB card." Smaller weights also mean less memory bandwidth per token, which often improves throughput. To actually serve quantized models, see Ollama vs vLLM.

GPTQ vs AWQ

GPTQAWQ

ApproachLayer-wise error-minimizing quantizationActivation-aware: protect salient weights Typical bits3–4 bit4 bit StrengthMature, widely supportedOften better quality at 4-bit, fast kernels Use whenBroad tooling/compatibilityYou want best 4-bit accuracy + speed

GPTQ quantizes weights one layer at a time, minimizing the output error introduced by rounding. It's well-established with broad ecosystem support.

AWQ (Activation-aware Weight Quantization) observes that a small fraction of weights matter most (those multiplying large activations) and protects them, which tends to preserve accuracy better at 4-bit and pairs with fast inference kernels.

There's also bitsandbytes (on-the-fly 8-bit/4-bit, easiest for training/QLoRA) and GGUF (the format Ollama/llama.cpp use for CPU/Apple Silicon).

Practical use

python

Loading a pre-quantized AWQ model with vLLM

vllm serve TheBloke/Llama-3.1-8B-Instruct-AWQ --quantization awq

from openai import OpenAI client = OpenAI(base_url="http://localhost:8000/v1", api_key="x") print(client.chat.completions.create( model="TheBloke/Llama-3.1-8B-Instruct-AWQ", messages=[{"role": "user", "content": "Hello"}]).choices[0].message.content)

For most people the move is to download an already-quantized checkpoint (Hugging Face has GPTQ/AWQ/GGUF variants of popular models) rather than quantizing yourself. Quantize your own only when you have a fine-tuned model to compress — and pair it with LoRA fine-tuning.

Choosing bits

  • 8-bit: near-lossless, modest savings — safe default when you have some headroom.
  • 4-bit (AWQ/GPTQ): big savings, small quality hit — the sweet spot for local/consumer GPUs.
  • 3-bit and below: noticeable degradation; use only when desperate for memory.
  • FAQ

    Does quantization hurt quality? A little — 4-bit is usually a small, acceptable drop; below 4-bit it grows. GPTQ or AWQ? AWQ often edges ahead on 4-bit accuracy and speed; GPTQ has the widest tooling. Try both on your task. What about GGUF? That's the format for CPU/Apple-Silicon via llama.cpp/Ollama — see local LLM comparison. Can I quantize a LoRA fine-tune? Yes — merge the adapter, then quantize, or use QLoRA which trains on a quantized base.

    Summary

    Quantization is the lever that makes large models runnable on modest hardware. GPTQ and AWQ are the two leading 4-bit post-training methods; AWQ tends to win on accuracy/speed, GPTQ on ecosystem breadth. In practice, grab a pre-quantized checkpoint and serve it with vLLM or Ollama.


    *Last updated: June 2026. Verify kernel/format support against vLLM, AutoAWQ, and AutoGPTQ docs.*

    Also available in 中文.

    AI Model Quantization (GPTQ, AWQ): Complete Developer Guide 2026 | AI Skill Navigation | AI Skill Navigation