← Back to tutorials

Diffusion Models Explained: From DDPM to Stable Diffusion and FLUX

Technical walkthrough of denoising diffusion, latent spaces, and conditioning mechanisms

Diffusion Models Explained: From DDPM to Stable Diffusion and FLUX

Diffusion models generate images by learning to reverse a gradual noising process: take real images, destroy them step by step with Gaussian noise, and train a network to undo each step. Sampling then starts from pure noise and denoises iteratively into an image. This guide walks the full technical chain — DDPM math, latent diffusion (Stable Diffusion), classifier-free guidance, and the flow-matching generation (SD3, FLUX) — with just enough formalism to read the papers.

The forward process: controlled destruction

The forward process is a fixed Markov chain that adds Gaussian noise over T steps (typically T=1000):

text
q(x_t | x_{t-1}) = N(x_t; sqrt(1 - β_t) · x_{t-1}, β_t · I)

where β_t is the noise schedule (small at t=0, larger later). The reparameterization trick makes training practical — with α_t = 1 − β_t and ᾱ_t = ∏ α_s, you can jump to any timestep directly:

text
x_t = sqrt(ᾱ_t) · x_0 + sqrt(1 − ᾱ_t) · ε,   ε ~ N(0, I)

No need to simulate t steps — sample a random t, mix the clean image with noise in one shot. By t=T, x_T is indistinguishable from pure Gaussian noise.

The reverse process: learning to denoise

The generative model learns p_θ(x_{t-1} | x_t). DDPM's key simplification (Ho et al., 2020): instead of predicting the denoised image directly, train a network ε_θ(x_t, t) to predict the noise that was added, with a plain MSE loss:

text
L = E_{x_0, ε, t} [ ‖ε − ε_θ(x_t, t)‖² ]

Given the predicted noise you can recover an estimate of x_0 and take one denoising step. The network is classically a UNet — downsampling and upsampling paths with skip connections, plus attention layers — with the timestep t injected via sinusoidal embeddings so one network handles all noise levels. (Newer models replace the UNet with transformers — see DiT below.)

Why predicting ε beats predicting x_0 directly: the target has unit variance at every timestep, which makes optimization stable across the whole schedule.

Sampling: from 1000 steps to 20

Naive DDPM sampling runs all T steps — slow. Practical samplers cut this dramatically:

  • DDIM (Song et al.) — makes the reverse process deterministic and allows skipping steps; 50 steps ≈ DDPM-1000 quality.
  • Higher-order ODE solvers (DPM-Solver++, UniPC) — treat denoising as solving an ODE; 15–25 steps is the practical default in every modern UI.
  • Distillation (LCM, SDXL-Turbo, FLUX schnell) — train a student to match the teacher's whole trajectory in 1–8 steps. This is how "real-time" generation works; the trade is some quality and diversity.
  • The sampler dropdown in ComfyUI/A1111 is exactly this choice: solver + step count + noise schedule.

    Latent diffusion: the Stable Diffusion trick

    Running diffusion on 1024×1024×3 pixels is brutally expensive. Latent Diffusion (Rombach et al., 2022) first trains a VAE that compresses images ~8× per side into a latent space (e.g. 128×128×4 for a 1024² image), runs the entire diffusion process in that latent space — roughly 64× fewer spatial elements — then decodes the final latent with the VAE decoder. That efficiency is what made image generation consumer-hardware-viable, and "Stable Diffusion" is precisely this architecture: VAE + UNet-in-latent-space + text conditioning.

    When you see "VAE" as a downloadable file in SD tooling, it's this decoder — a mismatched VAE gives washed-out or oversaturated outputs.

    Text conditioning and classifier-free guidance

    Text enters through cross-attention: a frozen text encoder (CLIP in SD1.5/SDXL; SD3 and FLUX add T5 for better prompt comprehension) embeds the prompt, and each UNet/transformer block attends to those embeddings.

    Classifier-free guidance (CFG) is the dial that makes prompts actually bind. During training, the text conditioning is randomly dropped ~10% of the time, so the same network learns both conditional and unconditional denoising. At sampling time, extrapolate between them:

    text
    ε̂ = ε_uncond + s · (ε_cond − ε_uncond)
    

    The guidance scale s (the "CFG" slider, typically 5–8) amplifies the direction "toward the prompt." Too low → ignores the prompt; too high → oversaturated, fried-looking images. This one equation explains the most important slider in every generation UI.

    The current generation: DiT and flow matching

    Two shifts define post-2023 architectures:

  • Diffusion Transformers (DiT) — replace the UNet with a transformer over latent patches; scales better with compute. SD3, FLUX, and most frontier image/video models (including Sora-class video) are DiT-family.
  • Rectified flow / flow matching — instead of the curved denoising trajectory of DDPM, train the model to follow (near-)straight paths from noise to data, predicting a velocity field rather than noise. Straighter paths → fewer sampling steps for the same quality. SD3 and FLUX both train this way — which is why FLUX feels different in step-count behavior than SD1.5-era models.
  • Practical model landscape and which to use for what: Stable Diffusion vs FLUX and Midjourney vs DALL-E vs SD; running them locally: SD 3.5 local deployment guide.

    How the ecosystem maps to the theory

    Tool/termWhat it is in theory terms

    LoRALow-rank fine-tune of the attention weights — cheap style/subject specialization ControlNetA parallel network injecting spatial conditioning (pose, depth, edges) into the denoiser InpaintingDiffusion where known pixels are re-noised and clamped each step, only the mask is generated img2img / "denoising strength"Start sampling from a partially-noised version of your input instead of pure noise — strength = how far up the noise schedule you go Negative promptThe unconditional branch of CFG replaced with "what to steer away from"

    FAQ

    Why do diffusion models beat GANs? Stable training (no adversarial game), better mode coverage (diversity), and a tractable way to trade compute for quality at inference (more steps). GANs still win on single-step speed — which is why distilled diffusion is converging toward GAN-like step counts.

    Are video models the same idea? Largely yes — DiT-based diffusion/flow models over spatiotemporal latents, with the same conditioning and guidance machinery. See Runway vs Kling vs Hailuo.

    What should I read first? Ho et al. 2020 (DDPM) → Rombach et al. 2022 (latent diffusion) → the SD3 paper (rectified flow + DiT, 2024). The math above is the working vocabulary for all three.


    *Last updated: June 2026.*

    Also available in 中文.