Transformer Architecture Deep Dive: Attention Mechanisms and Modern Variants
From vanilla attention to Flash Attention, Grouped Query Attention, and Mamba
Transformer Architecture Deep Dive: Attention Mechanisms and Modern Variants
From vanilla attention to Flash Attention, Grouped Query Attention, and Mamba
Comprehensive technical deep dive into transformer architecture including self-attention, multi-head attention, positional encoding, and modern efficiency improvements used in GPT-4 and Llama.
The transformer architecture, introduced in "Attention Is All You Need" (2017), remains the foundation of modern AI. Self-attention computes relationships between all positions: Q=XW_Q, K=XW_K, V=XW_V, then Attention(Q,K,V) = softmax(QK^T/sqrt(d_k))V. The sqrt(d_k) scaling prevents softmax saturation in high-dimensional spaces. Multi-head attention: run h attention heads in parallel with different learned projections, concatenate and project - enables attending to information from different representation subspaces. Modern efficiency improvements: Flash Attention rewrites attention to minimize HBM reads/writes using tiling and recomputation, achieving 2-4x speedup. Grouped Query Attention (GQA, used in Llama 2/3) shares K,V heads across multiple Q heads, reducing KV cache memory by 8x with minimal quality loss. Multi-Query Attention (MQA) is the extreme case with one shared K,V head. Positional encodings: absolute (original, limited to training length) -> RoPE (rotary, enables length extrapolation, used in GPT-NeoX, Llama) -> ALiBi (linear bias, used in BLOOM). State space models (Mamba): alternative to attention using selective state spaces, O(n) vs O(n^2) complexity, competitive quality at longer sequences.