LLM Inference Optimization: vLLM, TensorRT-LLM, and Serving at Scale

PagedAttention, continuous batching, quantization, and production serving strategies

返回教程列表
高级35 分钟

LLM Inference Optimization: vLLM, TensorRT-LLM, and Serving at Scale

PagedAttention, continuous batching, quantization, and production serving strategies

Optimize LLM inference throughput and latency using vLLM, TensorRT-LLM, and other frameworks. Covers PagedAttention, continuous batching, model quantization, and multi-GPU serving.

LLM-inferencevLLMTensorRToptimizationserving

LLM inference optimization is critical for cost-effective production deployment. Key concepts: 1) KV Cache: transformer inference stores key-value pairs for processed tokens to avoid recomputation. PagedAttention (vLLM): manages KV cache like OS virtual memory, eliminating fragmentation - improves GPU memory utilization from 60% to 97%. 2) Continuous batching: instead of waiting for batch completion, dynamically add new requests to running batch - improves throughput 23x vs static batching. 3) Speculative decoding: use small draft model (3B) to propose tokens, large verify model (70B) accepts/rejects - reduces latency 2-3x for many generation tasks. 4) Quantization: INT8 reduces memory 2x with minimal quality loss. AWQ (Activation-aware Weight Quantization) preserves important weights, achieving better quality than naive INT8. GPTQ enables 4-bit with acceptable loss. 5) Tensor parallelism: split model layers across GPUs (e.g., Llama 70B across 4xA100 80GB). Pipeline parallelism: split model stages across GPUs. vLLM setup: pip install vllm; python -m vllm.entrypoints.openai.api_server --model meta-llama/Llama-3.1-8B-Instruct --tensor-parallel-size 2 --dtype float16. TensorRT-LLM offers 3-5x improvement over raw PyTorch for production NVIDIA hardware.