A diffusion model is a type of generative AI that learns to create new data — typically images, audio, or video — by reversing a step-by-step noising process. During training, the model sees real examples gradually corrupted with Gaussian noise across many small steps, and a neural network is trained to predict the noise that was added at each step. Once trained, the model can start from pure random noise and iteratively "denoise" it into a coherent new sample, such as a photorealistic image guided by a text prompt.
How a diffusion model works
Training happens in two coupled phases. In the forward process, a clean training image is taken and small amounts of random noise are added over a fixed number of timesteps (often 1,000) until the image becomes indistinguishable from static. In the reverse process, a neural network — usually a U-Net — learns to estimate the noise that was added at each timestep, so it can subtract that noise and step back toward a clean image.
At inference, generation starts from a sample of pure Gaussian noise. The model iteratively denoises it, step by step, until a clean image emerges. To make generation conditional — for example, turning the prompt "a corgi on a skateboard" into an image — a text encoder (such as a CLIP or T5 model) embeds the prompt and the diffusion network is trained to denoise while attending to that embedding. Classifier-free guidance, introduced in 2022, lets the same model also denoise unconditionally and then extrapolates between the two predictions, sharpening how closely the output follows the prompt.
Why it matters
Diffusion models are the backbone of today's leading text-to-image systems, including Stable Diffusion, DALL·E 3, Midjourney, and Google's Imagen. They tend to produce higher-fidelity and more diverse samples than earlier generative approaches such as GANs, and their iterative sampling makes them easy to condition on signals like text, depth maps, or sketches. Beyond images, the same recipe powers models for audio (e.g. DiffSinger), video, protein structure (e.g. RoseTTAFold All-Atom), and 3D shape generation, making diffusion one of the most versatile generative frameworks in modern AI.
Key types of diffusion models
- Denoising Diffusion Probabilistic Models (DDPMs) — the foundational formulation by Ho et al. (2020) that frames generation as iterative denoising of Gaussian noise.
- Denoising Diffusion Implicit Models (DDIMs) — a faster sampler that uses non-Markovian steps to cut inference time without retraining.
- Latent Diffusion Models (LDMs) — popularized by Stable Diffusion; run the diffusion process in a compressed latent space instead of pixel space, dramatically reducing compute.
- Score-based models (SDEs) — a continuous-time view that connects diffusion to score matching and stochastic differential equations, enabling flexible samplers.
- Rectified Flow / Flow Matching — newer variants that learn straighter noise-to-data paths, allowing generation in far fewer steps.
For a deeper technical treatment, the original DDPM paper by Ho, Jain and Abbeel and the latent diffusion paper by Rombach et al. are the standard starting points. In short, diffusion models turn generation into many small, learnable denoising steps — a simple idea that has reshaped creative AI.