What are Diffusion Models?
July 11, 2021 · 26 min · Lilian Weng
Table of Contents
[Updated on 2021-09-19: Highly recommend this blog post on score-based generative modeling by Yang Song (author of several key papers in the references)].
[Updated on 2022-08-27: Added classifier-free guidance, GLIDE, unCLIP and Imagen.
[Updated on 2022-08-31: Added latent diffusion model.
So far, I’ve written about three types of generative models, GAN, VAE, and Flow-based models. They have shown great success in generating high-quality samples, but each has some limitations of its own. GAN models are known for potentially unstable training and less diversity in generation due to their adversarial training nature. VAE relies on a surrogate loss. Flow models have to use specialized architectures to construct reversible transform.
Diffusion models are inspired by non-equilibrium thermodynamics. They define a Markov chain of diffusion steps to slowly add random noise to data and then learn to reverse the diffusion process to construct desired data samples from the noise. Unlike VAE or flow models, diffusion models are learned with a fixed procedure and the latent variable has high dimensionality (same as the original data).
Fig. 1. Overview of different types of generative models.# What are Diffusion Models?
Several diffusion-based generative models have been proposed with similar ideas underneath, including diffusion probabilistic models (Sohl-Dickstein et al., 2015), noise-conditioned score network (NCSN; Yang & Ermon, 2019), and denoising diffusion probabilistic models (DDPM; Ho et al. 2020).
Forward diffusion process
Given a data point sampled from a real data distribution x0∼q(x), let us define a forward diffusion process in which we add small amount of Gaussian noise to the sample in T steps, producing a sequence of noisy samples x1,…,xT. The step sizes are controlled by a variance schedule {βt∈(0,1)}t=1T.
q(xt|xt−1)=N(xt;1−βtxt−1,βtI)q(x1:T|x0)=∏t=1Tq(xt|xt−1)
$$
q(\mathbf{x}t \vert \mathbf{x}{t-1}) = \mathcal{N}(\mathbf{x}t; \sqrt{1 - \beta_t} \mathbf{x}{t-1}, \beta_t\mathbf{I}) \quad
q(\mathbf{x}_{1:T} \vert \mathbf{x}0) = \prod^T{t=1} q(\mathbf{x}t \vert \mathbf{x}{t-1})
$$
The data sample x0 gradually loses its distinguishable features as the step t becomes larger. Eventually when T→∞, xT is equivalent to an isotropic Gaussian distribution.
Fig. 2. The Markov chain of forward (reverse) diffusion process of generating a sample by slowly adding (removing) noise. (Image source: Ho et al. 2020 with a few additional annotations)A nice property of the above process is that we can sample xt at any arbitrary time step t in a closed form using reparameterization trick. Let αt=1−βt and α¯t=∏i=1tαi:
xt=αtxt−1+1−αtϵt−1 ;where ϵt−1,ϵt−2,⋯∼N(0,I)=αtαt−1xt−2+1−αtαt−1ϵ¯t−2 ;where ϵ¯t−2 merges two Gaussians (*).=…=α¯tx0+1−α¯tϵq(xt|x0)=N(xt;α¯tx0,(1−α¯t)I)
(*) Recall that when we merge two Gaussians with different variance, N(0,σ12I) and N(0,σ22I), the new distribution is N(0,(σ12+σ22)I). Here the merged standard deviation is (1−αt)+αt(1−αt−1)=1−αtαt−1.
Usually, we can afford a larger update step when the sample gets noisier, so β1<β2<⋯<βT and therefore α¯1>⋯>α¯T.
Connection with stochastic gradient Langevin dynamics
Langevin dynamics is a concept from physics, developed for statistically modeling molecular systems. Combined with stochastic gradient descent, stochastic gradient Langevin dynamics (Welling & Teh 2011) can produce samples from a probability density p(x) using only the gradients ∇xlogp(x) in a Markov chain of updates:
xt=xt−1+δ2∇xlogq(xt−1)+δϵt,where ϵt∼N(0,I)
where δ is the step size. When T→∞,ϵ→0, xT equals to the true probability density p(x).
Compared to standard SGD, stochastic gradient Langevin dynamics injects Gaussian noise into the parameter updates to avoid collapses into local minima.
Reverse diffusion process
If we can reverse the above process and sample from q(xt−1|xt), we will be able to recreate the true sample from a Gaussian noise input, xT∼N(0,I). Note that if βt is small enough, q(xt−1|xt) will also be Gaussian. Unfortunately, we cannot easily estimate q(xt−1|xt) because it needs to use the entire dataset and therefore we need to learn a model pθ to approximate these conditional probabilities in order to run the reverse diffusion process.
pθ(x0:T)=p(xT)∏t=1Tpθ(xt−1|xt)pθ(xt−1|xt)=N(xt−1;μθ(xt,t),Σθ(xt,t))
Fig. 3. An example of training a diffusion model for modeling a 2D swiss roll data. (Image source: Sohl-Dickstein et al., 2015)It is noteworthy that the reverse conditional probability is tractable when conditioned on x0:
q(xt−1|xt,x0)=N(xt−1;μ~(xt,x0),β~tI)
Using Bayes' rule, we have:
q(xt−1|xt,x0)=q(xt|xt−1,x0)q(xt−1|x0)q(xt|x0)∝exp(−12((xt−αtxt−1)2βt+(xt−1−α¯t−1x0)21−α¯t−1−(xt−α¯tx0)21−α¯t))=exp(−12(xt2−2αtxtxt−1+αtxt−12βt+xt−12−2α¯t−1x0xt−1+α¯t−1x021−α¯t−1−(xt−α¯tx0)21−α¯t))=exp(−12((αtβt+11−α¯t−1)xt−12−(2αtβtxt+2α¯t−11−α¯t−1x0)xt−1+C(xt,x0)))
where C(xt,x0) is some function not involving xt−1 and details are omitted. Following the standard Gaussian density function, the mean and variance can be parameterized as follows (recall that αt=1−βt and α¯t=∏i=1Tαi):
βt=1/(αtβt+11−α¯t−1)=1/(αt−α¯t+βtβt(1−α¯t−1))=1−α¯t−11−α¯t⋅βtμt(xt,x0)=(αtβtxt+α¯t−11−α¯t−1x0)/(αtβt+11−α¯t−1)=(αtβtxt+α¯t−11−α¯t−1x0)1−α¯t−11−α¯t⋅βt=αt(1−α¯t−1)1−α¯txt+α¯t−1βt1−α¯tx0
Thanks to the nice property, we can represent x0=1α¯t(xt−1−α¯tϵt) and plug it into the above equation and obtain:
μ~t=αt(1−α¯t−1)1−α¯txt+α¯t−1βt1−α¯t1α¯t(xt−1−α¯tϵt)=1αt(xt−1−αt1−α¯tϵt)
As demonstrated in Fig. 2., such a setup is very similar to VAE and thus we can use the variational lower bound to optimize the negative log-likelihood.
−logpθ(x0)≤−logpθ(x0)+DKL(q(x1:T|x0)‖pθ(x1:T|x0))=−logpθ(x0)+Ex1:T∼q(x1:T|x0)[logq(x1:T|x0)pθ(x0:T)/pθ(x0)]=−logpθ(x0)+Eq[logq(x1:T|x0)pθ(x0:T)+logpθ(x0)]=Eq[logq(x1:T|x0)pθ(x0:T)]Let LVLB=Eq(x0:T)[logq(x1:T|x0)pθ(x0:T)]≥−Eq(x0)logpθ(x0)
It is also straightforward to get the same result using Jensen’s inequality. Say we want to minimize the cross entropy as the learning objective,
LCE=−Eq(x0)logpθ(x0)=−Eq(x0)log(∫pθ(x0:T)dx1:T)=−Eq(x0)log(∫q(x1:T|x0)pθ(x0:T)q(x1:T|x0)dx1:T)=−Eq(x0)log(Eq(x1:T|x0)pθ(x0:T)q(x1:T|x0))≤−Eq(x0:T)logpθ(x0:T)q(x1:T|x0)=Eq(x0:T)[logq(x1:T|x0)pθ(x0:T)]=LVLB
To convert each term in the equation to be analytically computable, the objective can be further rewritten to be a combination of several KL-divergence and entropy terms (See the detailed step-by-step process in Appendix B in Sohl-Dickstein et al., 2015):
LVLB=Eq(x0:T)[logq(x1:T|x0)pθ(x0:T)]=Eq[log∏t=1Tq(xt|xt−1)pθ(xT)∏t=1Tpθ(xt−1|xt)]=Eq[−logpθ(xT)+∑t=1Tlogq(xt|xt−1)pθ(xt−1|xt)]=Eq[−logpθ(xT)+∑t=2Tlogq(xt|xt−1)pθ(xt−1|xt)+logq(x1|x0)pθ(x0|x1)]=Eq[−logpθ(xT)+∑t=2Tlog(q(xt−1|xt,x0)pθ(xt−1|xt)⋅q(xt|x0)q(xt−1|x0))+logq(x1|x0)pθ(x0|x1)]=Eq[−logpθ(xT)+∑t=2Tlogq(xt−1|xt,x0)pθ(xt−1|xt)+∑t=2Tlogq(xt|x0)q(xt−1|x0)+logq(x1|x0)pθ(x0|x1)]=Eq[−logpθ(xT)+∑t=2Tlogq(xt−1|xt,x0)pθ(xt−1|xt)+logq(xT|x0)q(x1|x0)+logq(x1|x0)pθ(x0|x1)]=Eq[logq(xT|x0)pθ(xT)+∑t=2Tlogq(xt−1|xt,x0)pθ(xt−1|xt)−logpθ(x0|x1)]=Eq[DKL(q(xT|x0)∥pθ(xT))⏟LT+∑t=2TDKL(q(xt−1|xt,x0)∥pθ(xt−1|xt))⏟Lt−1−logpθ(x0|x1)⏟L0]
Let’s label each component in the variational lower bound loss separately:
LVLB=LT+LT−1+⋯+L0where LT=DKL(q(xT|x0)∥pθ(xT))Lt=DKL(q(xt|xt+1,x0)∥pθ(xt|xt+1)) for 1≤t≤T−1L0=−logpθ(x0|x1)
Every KL term in LVLB (except for L0) compares two Gaussian distributions and therefore they can be computed in closed form. LT is constant and can be ignored during training because q has no learnable parameters and xT is a Gaussian noise. Ho et al. 2020 models L0 using a separate discrete decoder derived from N(x0;μθ(x1,1),Σθ(x1,1)).
Parameterization of Lt for Training Loss
Recall that we need to learn a neural network to approximate the conditioned probability distributions in the reverse diffusion process, pθ(xt−1|xt)=N(xt−1;μθ(xt,t),Σθ(xt,t)). We would like to train μθ to predict μ~t=1αt(xt−1−αt1−α¯tϵt). Because xt is available as input at training time, we can reparameterize the Gaussian noise term instead to make it predict ϵt from the input xt at time step t:
μθ(xt,t)=1αt(xt−1−αt1−α¯tϵθ(xt,t))Thus xt−1=N(xt−1;1αt(xt−1−αt1−α¯tϵθ(xt,t)),Σθ(xt,t))
The loss term Lt is parameterized to minimize the difference from μ~ :
Lt=Ex0,ϵ[12‖Σθ(xt,t)‖22‖μ~t(xt,x0)−μθ(xt,t)‖2]=Ex0,ϵ[12‖Σθ‖22‖1αt(xt−1−αt1−α¯tϵt)−1αt(xt−1−αt1−α¯tϵθ(xt,t))‖2]=Ex0,ϵ[(1−αt)22αt(1−α¯t)‖Σθ‖22‖ϵt−ϵθ(xt,t)‖2]=Ex0,ϵ[(1−αt)22αt(1−α¯t)‖Σθ‖22‖ϵt−ϵθ(α¯tx0+1−α¯tϵt,t)‖2]
Simplification
Empirically, Ho et al. (2020) found that training the diffusion model works better with a simplified objective that ignores the weighting term:
Ltsimple=Et∼[1,T],x0,ϵt[‖ϵt−ϵθ(xt,t)‖2]=Et∼[1,T],x0,ϵt[‖ϵt−ϵθ(α¯tx0+1−α¯tϵt,t)‖2]
The final simple objective is:
Lsimple=Ltsimple+C
where C is a constant not depending on θ.
Fig. 4. The training and sampling algorithms in DDPM (Image source: Ho et al. 2020)### Connection with noise-conditioned score networks (NCSN)
Song & Ermon (2019) proposed a score-based generative modeling method where samples are produced via Langevin dynamics using gradients of the data distribution estimated with score matching. The score of each sample x’s density probability is defined as its gradient ∇xlogq(x). A score network sθ:RD→RD is trained to estimate it, sθ(x)≈∇xlogq(x).
To make it scalable with high-dimensional data in the deep learning setting, they proposed to use either denoising score matching (Vincent, 2011) or sliced score matching (use random projections; Song et al., 2019). Denosing score matching adds a pre-specified small noise to the data q(x~|x) and estimates q(x~) with score matching.
Recall that Langevin dynamics can sample data points from a probability density distribution using only the score ∇xlogq(x) in an iterative process.
However, according to the manifold hypothesis, most of the data is expected to concentrate in a low dimensional manifold, even though the observed data might look only arbitrarily high-dimensional. It brings a negative effect on score estimation since the data points cannot cover the whole space. In regions where data density is low, the score estimation is less reliable. After adding a small Gaussian noise to make the perturbed data distribution cover the full space RD, the training of the score estimator network becomes more stable. Song & Ermon (2019) improved it by perturbing the data with the noise of different levels and train a noise-conditioned score network to jointly estimate the scores of all the perturbed data at different noise levels.
The schedule of increasing noise levels resembles the forward diffusion process. If we use the diffusion process annotation, the score approximates sθ(xt,t)≈∇xtlogq(xt). Given a Gaussian distribution x∼N(μ,σ2I), we can write the derivative of the logarithm of its density function as ∇xlogp(x)=∇x(−12σ2(x−μ)2)=−x−μσ2=−ϵσ where ϵ∼N(0,I). Recall that q(xt|x0)∼N(α¯tx0,(1−α¯t)I) and therefore,
sθ(xt,t)≈∇xtlogq(xt)=Eq(x0)[∇xtq(xt|x0)]=Eq(x0)[−ϵθ(xt,t)1−α¯t]=−ϵθ(xt,t)1−α¯t
Parameterization of βt
The forward variances are set to be a sequence of linearly increasing constants in Ho et al. (2020), from β1=10−4 to βT=0.02. They are relatively small compared to the normalized image pixel values between [−1,1]. Diffusion models in their experiments showed high-quality samples but still could not achieve competitive model log-likelihood as other generative models.
Nichol & Dhariwal (2021) proposed several improvement techniques to help diffusion models to obtain lower NLL. One of the improvements is to use a cosine-based variance schedule. The choice of the scheduling function can be arbitrary, as long as it provides a near-linear drop in the middle of the training process and subtle changes around t=0 and t=T.
βt=clip(1−α¯tα¯t−1,0.999)α¯t=f(t)f(0)where f(t)=cos(t/T+s1+s⋅π2)
where the small offset s is to prevent βt from being too small when close to t=0.
Fig. 5. Comparison of linear and cosine-based scheduling of β_t during training. (Image source: Nichol & Dhariwal, 2021)## Parameterization of reverse process variance Σθ
Ho et al. (2020) chose to fix βt as constants instead of making them learnable and set Σθ(xt,t)=σt2I , where σt is not learned but set to βt or β~t=1−α¯t−11−α¯t⋅βt. Because they found that learning a diagonal variance Σθ leads to unstable training and poorer sample quality.
Nichol & Dhariwal (2021) proposed to learn Σθ(xt,t) as an interpolation between βt and β~t by model predicting a mixing vector v :
Σθ(xt,t)=exp(vlogβt+(1−v)logβ~t)
However, the simple objective Lsimple does not depend on Σθ . To add the dependency, they constructed a hybrid objective Lhybrid=Lsimple+λLVLB where λ=0.001 is small and stop gradient on μθ in the LVLB term such that LVLB only guides the learning of Σθ. Empirically they observed that LVLB is pretty challenging to optimize likely due to noisy gradients, so they proposed to use a time-averaging smoothed version of LVLB with importance sampling.
Fig. 6. Comparison of negative log-likelihood of improved DDPM with other likelihood-based generative models. NLL is reported in the unit of bits/dim. (Image source: Nichol & Dhariwal, 2021)# Speed up Diffusion Model Sampling
It is very slow to generate a sample from DDPM by following the Markov chain of the reverse diffusion process, as T can be up to one or a few thousand steps. One data point from Song et al. 2020: “For example, it takes around 20 hours to sample 50k images of size 32 × 32 from a DDPM, but less than a minute to do so from a GAN on an Nvidia 2080 Ti GPU.”
One simple way is to run a strided sampling schedule (Nichol & Dhariwal, 2021) by taking the sampling update every ⌈T/S⌉ steps to reduce the process from T to S steps. The new sampling schedule for generation is {τ1,…,τS} where τ1<τ2<⋯<τS∈[1,T] and S<T.
For another approach, let’s rewrite qσ(xt−1|xt,x0) to be parameterized by a desired standard deviation σt according to the nice property:
xt−1=α¯t−1x0+1−α¯t−1ϵt−1=α¯t−1x0+1−α¯t−1−σt2ϵt+σtϵ=α¯t−1x0+1−α¯t−1−σt2xt−α¯tx01−α¯t+σtϵqσ(xt−1|xt,x0)=N(xt−1;α¯t−1x0+1−α¯t−1−σt2xt−α¯tx01−α¯t,σt2I)
Recall that in q(xt−1|xt,x0)=N(xt−1;μ~(xt,x0),β~tI), therefore we have:
β~t=σt2=1−α¯t−11−α¯t⋅βt
Let σt2=η⋅β~t such that we can adjust η∈R+ as a hyperparameter to control the sampling stochasticity. The special case of η=0 makes the sampling process *deterministic*. Such a model is named the *denoising diffusion implicit model* (**DDIM**; Song et al., 2020). DDIM has the same marginal noise distribution but deterministically maps noise back to the original data samples.
During generation, we only sample a subset of S diffusion steps {τ1,…,τS} and the inference process becomes:
qσ,τ(xτi−1|xτt,x0)=N(xτi−1;α¯t−1x0+1−α¯t−1−σt2xτi−α¯tx01−α¯t,σt2I)
While all the models are trained with T=1000 diffusion steps in the experiments, they observed that DDIM (η=0) can produce the best quality samples when S is small, while DDPM (η=1) performs much worse on small S. DDPM does perform better when we can afford to run the full reverse Markov diffusion steps (S=T=1000). With DDIM, it is possible to train the diffusion model up to any arbitrary number of forward steps but only sample from a subset of steps in the generative process.
Fig. 7. FID scores on CIFAR10 and CelebA datasets by diffusion models of different settings, including DDIM (η=0) and DDPM (σ^). (Image source: Song et al., 2020)Compared to DDPM, DDIM is able to:
- Generate higher-quality samples using a much fewer number of steps.
- Have “consistency” property since the generative process is deterministic, meaning that multiple samples conditioned on the same latent variable should have similar high-level features.
- Because of the consistency, DDIM can do semantically meaningful interpolation in the latent variable.
Latent diffusion model (LDM; Rombach & Blattmann, et al. 2022) runs the diffusion process in the latent space instead of pixel space, making training cost lower and inference speed faster. It is motivated by the observation that most bits of an image contribute to perceptual details and the semantic and conceptual composition still remains after aggressive compression. LDM loosely decomposes the perceptual compression and semantic compression with generative modeling learning by first trimming off pixel-level redundancy with autoencoder and then manipulate/generate semantic concepts with diffusion process on learned latent.
Fig. 8. The plot for tradeoff between compression rate and distortion, illustrating two-stage compressions - perceptural and semantic comparession. (Image source: Rombach & Blattmann, et al. 2022)The perceptual compression process relies on an autoencoder model. An encoder E is used to compress the input image x∈RH×W×3 to a smaller 2D latent vector z=E(x)∈Rh×w×c , where the downsampling rate f=H/h=W/w=2m,m∈N. Then an decoder D reconstructs the images from the latent vector, x~=D(z). The paper explored two types of regularization in autoencoder training to avoid arbitrarily high-variance in the latent spaces.
- KL-reg: A small KL penalty towards a standard normal distribution over the learned latent, similar to VAE.
- VQ-reg: Uses a