LLM Inference Optimization: vLLM, TensorRT-LLM & Quantization in 2025

Achieve 10-50x throughput improvements for LLM serving through batching, quantization, and GPU optimization

返回教程列表
高级23 分钟

LLM Inference Optimization: vLLM, TensorRT-LLM & Quantization in 2025

Achieve 10-50x throughput improvements for LLM serving through batching, quantization, and GPU optimization

Serving LLMs in production requires careful optimization to achieve cost-effective performance at scale. This guide covers continuous batching with vLLM, NVIDIA TensorRT-LLM for GPU-optimized inference, speculative decoding, flash attention, KV cache optimization, INT4/INT8 quantization with AWQ and GPTQ, and benchmarking LLM serving systems to find the right performance/cost tradeoff.

LLM InferencevLLMTensorRT-LLMQuantizationAWQFlash Attention

LLM Inference Optimization: vLLM, TensorRT-LLM & Quantization

The LLM Serving Challenge

A 70B parameter model in FP16 requires 140GB VRAM—2× A100 80GB GPUs. Naive serving: 1 request at a time, GPU mostly idle between tokens. Production requirements: hundreds of concurrent users, latency SLAs (first token < 500ms, streaming), cost efficiency ($ per 1M tokens).

Optimization levers: batching (process multiple requests together), memory management (efficient KV cache), quantization (smaller data types), attention optimization (FlashAttention), speculative decoding (draft model + target model).

vLLM: PagedAttention for High Throughput

Why vLLM?

Traditional serving pre-allocates memory for max sequence length, wasting 60-80% of KV cache memory. vLLM's PagedAttention manages KV cache like virtual memory in OS—allocates in pages, enables copy-on-write for beam search and parallel sampling.

Results: 2-24x higher throughput than HuggingFace Transformers with same GPU. Sub-millisecond scheduling overhead. Supports tensor parallelism across multiple GPUs.

vLLM Deployment

Serve Llama 3 70B with tensor parallelism across 2 GPUs: python -m vllm.entrypoints.openai.api_server --model meta-llama/Llama-3-70b-instruct --tensor-parallel-size 2 --gpu-memory-utilization 0.90 --max-num-seqs 256.

This exposes an OpenAI-compatible API. Drop-in replacement for OpenAI client: change base_url to your vLLM endpoint.

Continuous Batching

Naive static batching: wait for N requests, process together, release. Continuous batching: as soon as one sequence finishes, immediately slot in next waiting request. Achieves 3-5x higher GPU utilization, especially for variable-length generations.

NVIDIA TensorRT-LLM

For NVIDIA GPUs, TensorRT-LLM achieves the best raw performance through: kernel fusion (combine multiple operations), in-flight batching (continuous batching optimized for TRT), FP8 precision support, multi-head attention optimization, and Triton Inference Server integration.

Performance comparison on A100: Llama 2 70B TensorRT-LLM throughput = 2.3x vLLM. But more complex setup requiring model conversion and NVIDIA-specific deployment.

Convert model to TensorRT engine: trtllm-build --checkpoint_dir llama-70b-fp8 --output_dir engine_dir --gemm_plugin float16 --max_batch_size 64 --max_input_len 2048 --max_output_len 512.

Quantization Deep Dive

AWQ (Activation-aware Weight Quantization)

4-bit weight quantization with better quality than GPTQ for most models. Key insight: not all weights are equally important—protect salient weights (high activation magnitude) with more bits.

AWQ quantized model loads at 1/4 the memory of FP16 with <1% quality degradation on most benchmarks. Use autoawq library: AutoAWQForCausalLM.from_quantized with fuse_layers=True for maximum performance.

GPTQ (Generative Pre-trained Transformer Quantization)

Layer-wise quantization using second-order information. Slightly lower quality than AWQ for the same bit-width, but more mature tooling and broader model support.

auto-gptq library: GPTQConfig(bits=4, dataset=calibration_data, tokenizer=tokenizer), model.quantize(). Save quantized model with model.save_quantized("model-gptq").

INT8 Inference with bitsandbytes

8-bit inference with LLM.int8() technique: mix-precision (outlier features in FP16, rest in INT8). 2x memory reduction with ~1% quality loss. Compatible with standard HuggingFace: load_in_8bit=True.

Choosing Quantization

For highest quality: FP8 (requires Hopper GPU, A100/H100). For balance: AWQ INT4 (good quality, works on any GPU). For CPU inference: GGUF Q4_K_M (llama.cpp, good for small models). For lowest latency on NVIDIA: TensorRT-LLM FP8.

Speculative Decoding

Use small "draft" model (e.g., 7B) to generate K candidate tokens speculatively, then verify with large "target" model (70B) in a single forward pass. If draft tokens are correct, accept all. If not, accept up to the first incorrect token.

Speedup: 2-3x for tasks where draft model predictions are often correct (code completion, factual responses). No quality degradation—mathematically equivalent to target model output.

Flash Attention

Standard attention is O(n²) in memory. FlashAttention computes attention in tiles that fit in SRAM, avoiding memory bandwidth bottleneck. FlashAttention-2 achieves 2-4x speedup over standard attention, supports causal masking, and handles arbitrary sequence lengths.

Enable in Transformers: use_flash_attention_2=True. Required for long-context inference (32K+ tokens) to be practical.

Benchmarking LLM Serving

Key metrics: time to first token (TTFT, p50/p95), inter-token latency (streaming), throughput (tokens/second), cost per 1M tokens, maximum concurrent users at latency SLA.

Use lm-evaluation-harness for quality benchmarks. Use the benchmarking scripts in vLLM repository for throughput testing. Test at realistic concurrency levels matching your production load.

Optimization hierarchy: 1) Use FlashAttention (free speedup), 2) Use continuous batching (vLLM), 3) Apply quantization (AWQ INT4 for 4x memory reduction), 4) Enable speculative decoding for suitable tasks, 5) Consider TensorRT-LLM for maximum GPU utilization.

相关工具

vLLMTensorRT-LLMautoawqbitsandbytesllama.cpp