Diffusion Models Explained: From DDPM to Stable Diffusion and FLUX
Technical walkthrough of denoising diffusion, latent spaces, and conditioning mechanisms
Diffusion Models Explained: From DDPM to Stable Diffusion and FLUX
Technical walkthrough of denoising diffusion, latent spaces, and conditioning mechanisms
Technical deep dive into diffusion models including the diffusion process, denoising networks, classifier-free guidance, latent diffusion, and the architecture of Stable Diffusion and FLUX.
Diffusion models progressively destroy data with Gaussian noise (forward process) and learn to reverse this process (reverse process). Forward process: q(x_t|x_{t-1}) = N(x_t; sqrt(1-beta_t)x_{t-1}, beta_t*I), where beta_t is a noise schedule. Reparameterization: x_t = sqrt(alpha_bar_t)*x_0 + sqrt(1-alpha_bar_t)*epsilon, enabling direct sampling at any timestep. Reverse process: learn p_theta(x_{t-1}|x_t) by training UNet to predict added noise epsilon_theta(x_t, t). DDPM trains on MSE loss: ||epsilon - epsilon_theta(x_t, t)||^2. Stable Diffusion uses Latent Diffusion: encode images to 8x compressed latent space with VAE, run diffusion in latent space (64x more efficient), decode with VAE decoder. CLIP text encoder provides text conditioning via cross-attention in UNet. Classifier-free guidance (CFG): train with and without conditioning, at inference: epsilon_guided = epsilon_uncond + w*(epsilon_cond - epsilon_uncond). Higher w = stronger adherence to prompt but less diversity. FLUX (Black Forest Labs): uses DiT (Diffusion Transformer) architecture replacing UNet, flow matching objective instead of DDPM, rectified flows for faster sampling. Current state-of-art for text-to-image.