KV Cache Optimization: Technical Deep Dive
How key-value caching accelerates autoregressive generation
KV Cache Optimization: Technical Deep Dive
How key-value caching accelerates autoregressive generation
KV Cache 优化深解(2026):吞吐瓶颈在缓存不在权重——每 token 字节公式与实算(8B 模型 8K 上下文≈1GB)、PagedAttention、GQA 选型、FP8 量化、前缀缓存与提示词稳定前缀设计,按优先级给出行动清单。
KV Cache Optimization: Technical Deep Dive
If you serve LLMs, the KV cache — not model weights — is usually what limits your throughput. Weights are a fixed cost; the KV cache grows with every token of every concurrent request, and at scale it's why your GPU runs out of memory at 30 concurrent users when the weights only fill half the card. This guide covers what the cache is, how to size it, and the optimizations that actually move production numbers.
What the KV cache is
Autoregressive generation produces one token at a time, and each new token attends to all previous tokens. Without caching you'd recompute the attention keys (K) and values (V) for the whole prefix at every step — O(n²) recomputation. The KV cache stores each token's K and V tensors once, making generation O(n) per token. It's the single most important inference optimization, and every serving stack uses it.
The cost: memory. Per token, the cache holds K and V for every layer and KV head:
text
bytes_per_token = 2 (K and V) × layers × kv_heads × head_dim × bytes_per_element
Worked example — Llama-3.1-8B (32 layers, 8 KV heads, head_dim 128, FP16): 2 × 32 × 8 × 128 × 2 bytes = 131 KB per token. One request with an 8K context: ~1 GB. Twenty of those: ~20 GB — more than the weights of the 8B model itself (~16 GB in FP16). That's the problem in one paragraph.
Optimization 1: PagedAttention (use a real serving engine)
The naive approach pre-allocates a max-length contiguous buffer per request; with most requests far shorter than max length, real-world utilization can drop badly — the vLLM paper measured 60–80% of cache memory wasted in naive serving. PagedAttention (vLLM's core idea) manages the cache like virtual memory: fixed-size blocks allocated on demand, no contiguity requirement, near-zero fragmentation. This is the main reason vLLM-class engines deliver several-fold throughput gains over naive HuggingFace serving — and the first lesson of KV optimization: *don't hand-roll serving; run vLLM, TensorRT-LLM, or SGLang*. Comparison in our inference optimization guide.
Optimization 2: GQA — fewer KV heads (model-level)
Look at the formula: kv_heads multiplies everything. Grouped-Query Attention shares each K/V head across a group of query heads — Llama-3's 32 query heads share 8 KV heads, a 4× cache reduction versus classic multi-head attention, with minor quality cost. Modern open models (Llama 3, Qwen 2.5, Mistral) ship with GQA already; the practical takeaway is that *model choice sets your cache budget* — check num_key_value_heads in the config before capacity planning. (MQA — a single KV head — is the extreme version; DeepSeek's MLA compresses KV into latent vectors for even bigger savings.)
Optimization 3: KV cache quantization
Weights-quantized-but-cache-FP16 is leaving memory on the table. vLLM supports FP8 KV cache (--kv-cache-dtype fp8), halving cache memory versus FP16 — meaning ~2× the concurrent requests or 2× the context per GPU — with small accuracy impact for most workloads (benchmark yours; long multi-turn chains are the sensitive case). INT4 KV schemes exist in research and llama.cpp, with more visible quality loss.
Optimization 4: Prefix caching — stop recomputing the system prompt
Production traffic shares prefixes constantly: the same system prompt + few-shot block heads every request; multi-turn chats re-send growing histories. Automatic prefix caching (vLLM --enable-prefix-caching, SGLang's RadixAttention) stores KV blocks keyed by content hash and reuses them across requests — the shared 2K-token preamble is computed once, and subsequent requests skip straight to the new tokens. Time-to-first-token drops dramatically for prompt-heavy, output-light workloads. Hosted APIs expose the same idea as "prompt caching" with discounted cached-input pricing — and the same design rule applies everywhere: put stable content (system prompt, examples, documents) first and volatile content (user question, timestamp) last, because caching is prefix matching and one changed byte invalidates everything after it.
Optimization 5: Bound what enters the cache
max_model_len to what your workload actually needs — capacity planning against a 128K theoretical context you never use wastes the whole budget.What to do, in order
num_key_value_heads.max_model_len.FAQ
Does the KV cache change outputs? No — it stores exact values that would otherwise be recomputed. (Quantizing it does introduce approximation.)
Why is the first token slow and the rest fast? Prefill computes KV for the whole prompt (compute-bound); decode then generates token-by-token reading the cache (memory-bandwidth-bound). Prefix caching attacks the first; everything else here attacks the second.
Does batching share the cache? Scheduling is shared (continuous batching interleaves requests) but each request's KV entries are its own — only prefix caching deduplicates identical content.
*Last updated: June 2026. Numbers derive from model configs; benchmark on your own workload.*
相关工具
相关教程
Why small prompt changes can cause large output variations
How Anthropic implemented Constitutional AI for Claude
Practical techniques to cut AI API costs dramatically
Building high-quality fine-tuning datasets from scratch — step-by-step implementation guide
Build production AI with Mirascope — ergonomic Python LLM interface
Build production AI with PromptFlow — Azure AI workflow orchestration