Diffusion Models Explained: From DDPM to Stable Diffusion and FLUX
Technical walkthrough of denoising diffusion, latent spaces, and conditioning mechanisms
Diffusion Models Explained: From DDPM to Stable Diffusion and FLUX
Diffusion models generate images by learning to reverse a gradual noising process: take real images, destroy them step by step with Gaussian noise, and train a network to undo each step. Sampling then starts from pure noise and denoises iteratively into an image. This guide walks the full technical chain — DDPM math, latent diffusion (Stable Diffusion), classifier-free guidance, and the flow-matching generation (SD3, FLUX) — with just enough formalism to read the papers.
The forward process: controlled destruction
The forward process is a fixed Markov chain that adds Gaussian noise over T steps (typically T=1000):
text
q(x_t | x_{t-1}) = N(x_t; sqrt(1 - β_t) · x_{t-1}, β_t · I)
where β_t is the noise schedule (small at t=0, larger later). The reparameterization trick makes training practical — with α_t = 1 − β_t and ᾱ_t = ∏ α_s, you can jump to any timestep directly:
text
x_t = sqrt(ᾱ_t) · x_0 + sqrt(1 − ᾱ_t) · ε, ε ~ N(0, I)
No need to simulate t steps — sample a random t, mix the clean image with noise in one shot. By t=T, x_T is indistinguishable from pure Gaussian noise.
The reverse process: learning to denoise
The generative model learns p_θ(x_{t-1} | x_t). DDPM's key simplification (Ho et al., 2020): instead of predicting the denoised image directly, train a network ε_θ(x_t, t) to predict the noise that was added, with a plain MSE loss:
text
L = E_{x_0, ε, t} [ ‖ε − ε_θ(x_t, t)‖² ]
Given the predicted noise you can recover an estimate of x_0 and take one denoising step. The network is classically a UNet — downsampling and upsampling paths with skip connections, plus attention layers — with the timestep t injected via sinusoidal embeddings so one network handles all noise levels. (Newer models replace the UNet with transformers — see DiT below.)
Why predicting ε beats predicting x_0 directly: the target has unit variance at every timestep, which makes optimization stable across the whole schedule.
Sampling: from 1000 steps to 20
Naive DDPM sampling runs all T steps — slow. Practical samplers cut this dramatically:
The sampler dropdown in ComfyUI/A1111 is exactly this choice: solver + step count + noise schedule.
Latent diffusion: the Stable Diffusion trick
Running diffusion on 1024×1024×3 pixels is brutally expensive. Latent Diffusion (Rombach et al., 2022) first trains a VAE that compresses images ~8× per side into a latent space (e.g. 128×128×4 for a 1024² image), runs the entire diffusion process in that latent space — roughly 64× fewer spatial elements — then decodes the final latent with the VAE decoder. That efficiency is what made image generation consumer-hardware-viable, and "Stable Diffusion" is precisely this architecture: VAE + UNet-in-latent-space + text conditioning.
When you see "VAE" as a downloadable file in SD tooling, it's this decoder — a mismatched VAE gives washed-out or oversaturated outputs.
Text conditioning and classifier-free guidance
Text enters through cross-attention: a frozen text encoder (CLIP in SD1.5/SDXL; SD3 and FLUX add T5 for better prompt comprehension) embeds the prompt, and each UNet/transformer block attends to those embeddings.
Classifier-free guidance (CFG) is the dial that makes prompts actually bind. During training, the text conditioning is randomly dropped ~10% of the time, so the same network learns both conditional and unconditional denoising. At sampling time, extrapolate between them:
text
ε̂ = ε_uncond + s · (ε_cond − ε_uncond)
The guidance scale s (the "CFG" slider, typically 5–8) amplifies the direction "toward the prompt." Too low → ignores the prompt; too high → oversaturated, fried-looking images. This one equation explains the most important slider in every generation UI.
The current generation: DiT and flow matching
Two shifts define post-2023 architectures:
Practical model landscape and which to use for what: Stable Diffusion vs FLUX and Midjourney vs DALL-E vs SD; running them locally: SD 3.5 local deployment guide.
How the ecosystem maps to the theory
FAQ
Why do diffusion models beat GANs? Stable training (no adversarial game), better mode coverage (diversity), and a tractable way to trade compute for quality at inference (more steps). GANs still win on single-step speed — which is why distilled diffusion is converging toward GAN-like step counts.
Are video models the same idea? Largely yes — DiT-based diffusion/flow models over spatiotemporal latents, with the same conditioning and guidance machinery. See Runway vs Kling vs Hailuo.
What should I read first? Ho et al. 2020 (DDPM) → Rombach et al. 2022 (latent diffusion) → the SD3 paper (rectified flow + DiT, 2024). The math above is the working vocabulary for all three.
*Last updated: June 2026.*
Also available in 中文.