Transformer Architecture Deep Dive: Attention Mechanisms and Modern Variants

From vanilla attention to Flash Attention, Grouped Query Attention, and Mamba

The transformer architecture, introduced in "Attention Is All You Need" (2017), remains the foundation of modern AI. Self-attention computes relationships between all positions: Q=XW_Q, K=XW_K, V=XW_V, then Attention(Q,K,V) = softmax(QK^T/sqrt(d_k))V. The sqrt(d_k) scaling prevents softmax saturation in high-dimensional spaces. Multi-head attention: run h attention heads in parallel with different learned projections, concatenate and project - enables attending to information from different representation subspaces. Modern efficiency improvements: Flash Attention rewrites attention to minimize HBM reads/writes using tiling and recomputation, achieving 2-4x speedup. Grouped Query Attention (GQA, used in Llama 2/3) shares K,V heads across multiple Q heads, reducing KV cache memory by 8x with minimal quality loss. Multi-Query Attention (MQA) is the extreme case with one shared K,V head. Positional encodings: absolute (original, limited to training length) -> RoPE (rotary, enables length extrapolation, used in GPT-NeoX, Llama) -> ALiBi (linear bias, used in BLOOM). State space models (Mamba): alternative to attention using selective state spaces, O(n) vs O(n^2) complexity, competitive quality at longer sequences.

Also available in 中文.