← Back to tutorials

LLM Inference Optimization: vLLM, TensorRT-LLM, and Serving at Scale

PagedAttention, continuous batching, quantization, and production serving strategies

LLM Inference Optimization: vLLM, TensorRT-LLM & Serving at Scale (2026)

Inference, not training, is where most LLM money is spent in production — so squeezing more tokens per second per GPU dollar is the core optimization problem. This guide covers the key techniques (KV-cache management, batching, quantization) and the two leading serving engines, vLLM and TensorRT-LLM.

The bottleneck: the KV cache

Transformer inference caches the key/value tensors of every processed token so it doesn't recompute them — that's the KV cache. It grows with sequence length and dominates memory during serving. Managing it well is most of the battle.

  • PagedAttention (vLLM): manages the KV cache like an OS manages virtual memory — in pages — eliminating fragmentation and letting you pack far more concurrent sequences into the same VRAM. This is vLLM's core innovation.
  • Continuous batching: instead of waiting for a batch to finish, new requests join the running batch immediately, keeping the GPU saturated. This is the single biggest throughput win under concurrency.
  • vLLM vs TensorRT-LLM

    vLLMTensorRT-LLM

    StrengthEase + high throughput, PagedAttentionMaximum NVIDIA-GPU performance Setuppip install vllm, simpleHeavier (compile engines) Best forMost production servingSqueezing peak latency/throughput on NVIDIA

    vLLM is the pragmatic default — easy to run, OpenAI-compatible, excellent throughput. TensorRT-LLM can go faster still on NVIDIA hardware by compiling optimized engines, at the cost of more setup. Start with vLLM; reach for TensorRT-LLM when you've proven you need the last increment of performance. For the Ollama-vs-vLLM framing, see Ollama vs vLLM.

    bash
    

    vLLM: one command to a high-throughput OpenAI-compatible server

    vllm serve meta-llama/Llama-3.1-8B-Instruct --port 8000

    Other levers

  • Quantization (4-bit AWQ/GPTQ) cuts memory and bandwidth — see 模型量化指南.
  • Speculative decoding uses a small draft model to propose tokens a big model verifies, speeding generation.
  • Prefix caching reuses the KV cache for shared prompt prefixes (system prompts, few-shot examples).
  • Right-size the model. The cheapest optimization is using a smaller model for easy requests — see GPT-4o mini vs Claude Haiku.
  • FAQ

    Why is throughput low under load? Likely no continuous batching. vLLM fixes this out of the box. vLLM or TensorRT-LLM? vLLM for ease + great throughput; TensorRT-LLM for peak NVIDIA performance when you need it. Biggest single win? Continuous batching, then quantization, then speculative/prefix caching. Does quantization slow things down? Usually the opposite — less memory bandwidth per token often speeds inference.

    Summary

    Optimize inference by managing the KV cache (PagedAttention), keeping the GPU busy (continuous batching), and shrinking the model (quantization). vLLM delivers most of this with a single command; TensorRT-LLM squeezes out the last increment on NVIDIA. Measure tokens/sec on your own hardware and stack the wins.


    *Last updated: June 2026. Verify against the vLLM and TensorRT-LLM docs.*

    Also available in 中文.