LLM Inference Optimization: vLLM, TensorRT-LLM, and Serving at Scale
PagedAttention, continuous batching, quantization, and production serving strategies
LLM Inference Optimization: vLLM, TensorRT-LLM & Serving at Scale (2026)
Inference, not training, is where most LLM money is spent in production — so squeezing more tokens per second per GPU dollar is the core optimization problem. This guide covers the key techniques (KV-cache management, batching, quantization) and the two leading serving engines, vLLM and TensorRT-LLM.
The bottleneck: the KV cache
Transformer inference caches the key/value tensors of every processed token so it doesn't recompute them — that's the KV cache. It grows with sequence length and dominates memory during serving. Managing it well is most of the battle.
vLLM vs TensorRT-LLM
pip install vllm, simplevLLM is the pragmatic default — easy to run, OpenAI-compatible, excellent throughput. TensorRT-LLM can go faster still on NVIDIA hardware by compiling optimized engines, at the cost of more setup. Start with vLLM; reach for TensorRT-LLM when you've proven you need the last increment of performance. For the Ollama-vs-vLLM framing, see Ollama vs vLLM.
bash
vLLM: one command to a high-throughput OpenAI-compatible server
vllm serve meta-llama/Llama-3.1-8B-Instruct --port 8000
Other levers
FAQ
Why is throughput low under load? Likely no continuous batching. vLLM fixes this out of the box. vLLM or TensorRT-LLM? vLLM for ease + great throughput; TensorRT-LLM for peak NVIDIA performance when you need it. Biggest single win? Continuous batching, then quantization, then speculative/prefix caching. Does quantization slow things down? Usually the opposite — less memory bandwidth per token often speeds inference.
Summary
Optimize inference by managing the KV cache (PagedAttention), keeping the GPU busy (continuous batching), and shrinking the model (quantization). vLLM delivers most of this with a single command; TensorRT-LLM squeezes out the last increment on NVIDIA. Measure tokens/sec on your own hardware and stack the wins.
*Last updated: June 2026. Verify against the vLLM and TensorRT-LLM docs.*
Also available in 中文.