Diffusion Models Explained: From DDPM to Stable Diffusion and FLUX

Technical walkthrough of denoising diffusion, latent spaces, and conditioning mechanisms

返回教程列表
高级42 分钟

Diffusion Models Explained: From DDPM to Stable Diffusion and FLUX

Technical walkthrough of denoising diffusion, latent spaces, and conditioning mechanisms

Technical deep dive into diffusion models including the diffusion process, denoising networks, classifier-free guidance, latent diffusion, and the architecture of Stable Diffusion and FLUX.

diffusion-modelsStable-DiffusionFLUXgenerative-AIimage-generation

Diffusion models progressively destroy data with Gaussian noise (forward process) and learn to reverse this process (reverse process). Forward process: q(x_t|x_{t-1}) = N(x_t; sqrt(1-beta_t)x_{t-1}, beta_t*I), where beta_t is a noise schedule. Reparameterization: x_t = sqrt(alpha_bar_t)*x_0 + sqrt(1-alpha_bar_t)*epsilon, enabling direct sampling at any timestep. Reverse process: learn p_theta(x_{t-1}|x_t) by training UNet to predict added noise epsilon_theta(x_t, t). DDPM trains on MSE loss: ||epsilon - epsilon_theta(x_t, t)||^2. Stable Diffusion uses Latent Diffusion: encode images to 8x compressed latent space with VAE, run diffusion in latent space (64x more efficient), decode with VAE decoder. CLIP text encoder provides text conditioning via cross-attention in UNet. Classifier-free guidance (CFG): train with and without conditioning, at inference: epsilon_guided = epsilon_uncond + w*(epsilon_cond - epsilon_uncond). Higher w = stronger adherence to prompt but less diversity. FLUX (Black Forest Labs): uses DiT (Diffusion Transformer) architecture replacing UNet, flow matching objective instead of DDPM, rectified flows for faster sampling. Current state-of-art for text-to-image.