Input Perturbation Reduces Exposure Bias in Diffusion Models
输入扰动降低扩散模型中的曝光偏差
Abstract
摘要
Denoising Diffusion Probabilistic Models have shown an impressive generation quality although their long sampling chain leads to high computational costs. In this paper, we observe that a long sampling chain also leads to an error accumulation phenomenon, which is similar to the exposure bias problem in auto regressive text generation. Specifically, we note that there is a discrepancy between training and testing, since the former is conditioned on the ground truth samples, while the latter is conditioned on the previously generated results. To alleviate this problem, we propose a very simple but effective training regu lari z ation, consisting in perturbing the ground truth samples to simulate the inference time prediction errors. We empirically show that, without affecting the recall and precision, the proposed input perturbation leads to a significant improvement in the sample quality while reducing both the training and the inference times. For instance, on CelebA $64\times64$ , we achieve a new state-of-theart FID score of 1.27, while saving $37.5%$ of the training time. The code is available at https: //github.com/forever208/DDPM-IP.
去噪扩散概率模型 (Denoising Diffusion Probabilistic Models) 虽然采样链较长导致计算成本高昂,但其生成质量令人印象深刻。本文发现长采样链还会引发误差累积现象,这与自回归文本生成中的曝光偏差问题类似。具体而言,我们注意到训练和测试之间存在差异:训练时以真实样本为条件,而测试时则以先前生成结果为条件。为缓解该问题,我们提出了一种简单有效的训练正则化方法,通过对真实样本添加扰动来模拟推理阶段的预测误差。实验表明,在不影响召回率和精确度的前提下,这种输入扰动能显著提升样本质量,同时减少训练和推理时间。例如在CelebA $64\times64$ 数据集上,我们以1.27的FID分数刷新了当前最佳记录,并节省了 $37.5%$ 的训练时间。代码已开源:https://github.com/forever208/DDPM-IP。
1. Introduction
1. 引言
Denoising Diffusion Probabilistic Models (DDPMs) (SohlDickstein et al., 2015; Ho et al., 2020) are a new generative paradigm which is attracting a growing interest due to its very high-quality sample generation capabilities (Dhariwal & Nichol, 2021; Nichol et al., 2022; Ramesh et al., 2022). Differently from most existing generative methods which synthesize a new sample in a single step, DDPMs resemble the Langevin dynamics (Welling & Teh, 2011) and the generation process is based on a sequence of denoising steps, in which a synthetic sample is created starting from pure noise and auto regressive ly reducing the noise component. In more detail, during training, a real sample $\pmb{x}{0}$ is progressively destroyed in $T$ steps adding Gaussian noise (forward process). The sequence ${\pmb x}{0},...,{\pmb x}{t},...,{\pmb x}{T}$ so obtained, is used to train a deep denoising auto encoder $(\mu(\cdot))$ to invert the forward process: $\hat{\mathbf{\pmbx}}{t-1}=\mu(\mathbf{\pmbx}{t},t)$ . At inference time, the generation process is auto regressive because it depends on the previously generated samples: $\hat{\mathbf{x}}{t-1}=\mu(\hat{\mathbf{x}}_{t},t)$ (Sec. 3).
去噪扩散概率模型 (DDPMs) (SohlDickstein et al., 2015; Ho et al., 2020) 是一种新兴的生成范式,因其卓越的样本生成质量而受到广泛关注 (Dhariwal & Nichol, 2021; Nichol et al., 2022; Ramesh et al., 2022)。与大多数现有单步生成方法不同,DDPMs 借鉴了朗之万动力学 (Welling & Teh, 2011) 思想,其生成过程基于一系列去噪步骤:从纯噪声出发,通过自回归方式逐步消除噪声分量来合成样本。具体而言,在训练阶段,真实样本 $\pmb{x}{0}$ 会通过 $T$ 步高斯噪声添加过程(前向过程)被逐步破坏。由此获得的序列 ${\pmb x}{0},...,{\pmb x}{t},...,{\pmb x}{T}$ 用于训练深度去噪自编码器 $(\mu(\cdot))$ 以逆转前向过程:$\hat{\mathbf{\pmbx}}{t-1}=\mu(\mathbf{\pmbx}{t},t)$。在推理阶段,生成过程具有自回归特性,因其依赖于前序生成样本:$\hat{\mathbf{x}}{t-1}=\mu(\hat{\mathbf{x}}_{t},t)$(第3节)。
Despite the large success of DDPMs in different generative fields (Sec. 2), one of the main drawbacks of these models is their very long computational time, which depends on the large number of steps $T$ required at both the training and the inference stage. As recently emphasised in (Xiao et al., 2022), the fundamental reason why $T$ needs to be large is that each denoising step is assumed to be Gaussian, and this assumption holds only for small step sizes. Conversely, with larger step sizes, the prediction network $(\mu(\cdot))$ needs to solve a harder problem and it becomes progressively less accurate (Xiao et al., 2022). However, in this paper, we observe that there is a second phenomenon, related to the sampling chain, but partially in contrast with the first, which is the accumulation of these errors over the $T$ inference sampling steps. This is basically due to the discrepancy between the training and the inference stage, in which the latter generates a sequence of samples based on the results of the previous steps, hence possibly accumulating errors. In fact, at training time, $\mu(\cdot)$ is trained with a ground truth pair $({\pmb x}{t},{\pmb x}{t-1})$ and, given $\pmb{x}{t}$ , it learns to reconstruct $\pmb{x}{t-1}$ $(\mu(\pmb{x}{t},t))$ . However, at inference time, $\mu(\cdot)$ has no access to the “real” $\pmb{x}{t}$ , and its prediction depends on the previously generated $\hat{\pmb x}{t}$ $(\mu(\hat{\pmb x}{t},t))$ . This input mismatch between $\mu(\pmb{x}{t},t)$ , used during training, and $\mu(\hat{\mathbf{\boldsymbol{x}}}_{t},t)$ , used during testing, is similar to the exposure bias problem (Ranzato et al., 2016; Schmidt, 2019) shared by other auto regressive generative methods. For example, Rennie et al. (2017) argue that training a network to maximize the likelihood of the next ground-truth word given the previous ground-truth word (called “Teacher-Forcing” (Bengio et al., 2015)) results in error accumulation at inference time, since the model has never been exposed to its own predictions.
尽管DDPM (Denoising Diffusion Probabilistic Models) 在不同生成领域取得了巨大成功(第2节),这些模型的主要缺点之一是其计算时间非常长,这取决于训练和推理阶段所需的大量步骤$T$。正如(Xiao et al., 2022)最近强调的,$T$需要很大的根本原因是假设每个去噪步骤都是高斯的,而这个假设仅适用于小步长。相反,随着步长增大,预测网络$(\mu(\cdot))$需要解决更困难的问题,其准确性会逐渐降低(Xiao et al., 2022)。然而,在本文中,我们观察到与采样链相关的第二个现象——这些误差在$T$个推理采样步骤中的累积,这与第一个现象部分相反。这主要是由于训练和推理阶段之间的差异造成的,后者基于前几步的结果生成一系列样本,因此可能累积误差。实际上,在训练时,$\mu(\cdot)$使用真实值对$({\pmb x}{t},{\pmb x}{t-1})$进行训练,给定$\pmb{x}{t}$,它学习重建$\pmb{x}{t-1}$$(\mu(\pmb{x}{t},t))$。但在推理时,$\mu(\cdot)$无法访问"真实"的$\pmb{x}{t}$,其预测依赖于先前生成的$\hat{\pmb x}{t}$$(\mu(\hat{\pmb x}{t},t))$。训练时使用的$\mu(\pmb{x}{t},t)$与测试时使用的$\mu(\hat{\mathbf{\boldsymbol{x}}}_{t},t)$之间的输入不匹配,类似于其他自回归生成方法共有的曝光偏差问题(Ranzato et al., 2016; Schmidt, 2019)。例如,Rennie et al. (2017)指出,训练网络在给定前一个真实单词的情况下最大化下一个真实单词的可能性(称为"教师强制"(Bengio et al., 2015))会导致推理时的误差累积,因为模型从未接触过自己的预测。
In this paper, we first empirically analyze this accumulation error phenomenon. For instance, we show that a standard DDPM (Dhariwal & Nichol, 2021), trained with $T$ steps, can generate better results using a number of inference steps $T^{\prime}<T$ (Sec. 6.2). A similar phenomenon was also observed by Nichol & Dhariwal (2021), but the authors did not provide an explanation for that. We believe that the reason for this apparently contrasting result is that while, on the one hand, longer chains can better satisfy the Gaussian assumption in the reverse diffusion process, on the other hand, they lead to a larger accumulation of errors.
本文首先对这一误差累积现象进行了实证分析。例如,我们发现标准DDPM (Dhariwal & Nichol, 2021) 在训练时使用 $T$ 步,却能在推理阶段通过更少的步数 $T^{\prime}<T$ 生成更好的结果 (第6.2节)。Nichol & Dhariwal (2021) 也观察到类似现象,但未给出合理解释。我们认为这种看似矛盾的结果源于:更长的扩散链虽能更好地满足逆向扩散过程的高斯假设,但同时也导致了更大的误差累积。
Second, in order to alleviate the exposure bias problem, we propose a surprisingly simple yet very effective method, which consists in explicitly modelling the prediction error during training. Specifically, at training time, we perturb $\pmb{x}{t}$ and we feed $\mu(\cdot)$ with a noisy version of $\pmb{x}{t}$ , this way simulating the training-inference discrepancy, and forcing the learned network to take into account possible inferencetime prediction errors. Note that our perturbation is different from the content-destroying forward process, because the new noise is not used in the ground truth prediction target (Sec. 5.2). The proposed method is a training regular iz ation which forces the network to smooth its prediction function: to solve the proposed task, two spatially close points $\pmb{x}{1}$ and $\pmb{x}{2}$ should lead to similar predictions $\mu(\pmb{x}{1},t)$ and $\mu(\pmb{x}_{2},t)$ This regular iz ation approach is similar to Mixup (Zhang et al., 2018) and the Vicinal Risk Minimization (VRM) principle (Chapelle et al., 2000), where a neighborhood around each sample in the training data is defined and then used to perturb that sample keeping fixed its target class label.
其次,为缓解曝光偏差问题,我们提出了一种简单却极其有效的方法:在训练期间显式建模预测误差。具体而言,训练时对$\pmb{x}{t}$施加扰动,将带噪版本的$\pmb{x}{t}$输入$\mu(\cdot)$,从而模拟训练-推理差异,迫使学习网络考虑推理阶段可能出现的预测误差。需注意我们的扰动不同于破坏性前向过程,因为新增噪声不会用于真实预测目标(见5.2节)。该方法通过训练正则化强制网络平滑其预测函数:为实现该目标,空间邻近点$\pmb{x}{1}$与$\pmb{x}{2}$应产生相似预测$\mu(\pmb{x}{1},t)$和$\mu(\pmb{x}_{2},t)$。此正则化思路类似于Mixup (Zhang et al., 2018) 和邻域风险最小化 (VRM) 原则 (Chapelle et al., 2000) ,两者均在训练数据中定义样本邻域,通过保持目标类别标签不变的方式扰动样本。
Third, we propose alternative solutions to the exposure bias problem for diffusion models, in which, rather than using input perturbation, we obtain a smoother prediction function $\mu(\cdot)$ by explicitly encouraging $\mu(\cdot)$ to be Lipschitz continuous (Sec. 5.4). The rationale behind this is that a Lipschitz continuous function $\mu(\cdot)$ generates small prediction differences between neighbouring points in its domain, leading to a DDPM which is more robust to the inference-time errors.
第三,我们针对扩散模型的曝光偏差问题提出了替代解决方案。不同于输入扰动方法,我们通过显式鼓励预测函数 $\mu(\cdot)$ 满足Lipschitz连续性(见第5.4节)来获得更平滑的预测函数。其原理在于:Lipschitz连续函数 $\mu(\cdot)$ 会在其定义域内相邻点之间产生较小的预测差异,从而使DDPM对推理阶段的误差更具鲁棒性。
Finally, we empirically analyse all the proposed solutions and we show that, despite being all effective for improving the final generation quality, input perturbation is both more efficient and more effective than the explicit minimization of the Lipschitz constant in DDPMs (Sec. 6.1). Moreover, directly perturbing the network input at training time has no additional training overhead and this solution is very easy to be reproduced and plugged into existing DDPM frameworks: it can be obtained with just two lines of code without any change in the network architecture or the loss function. We call our method Denoising Diffusion Probabilistic Models with Input Perturbation (DDPM-IP) and we show that it can significantly improve the generation quality of state-of-theart DDPMs (Dhariwal & Nichol, 2021; Song et al., 2021a) and speed up the inference-time sampling. For instance, on the CIFAR10 (Krizhevsky et al., 2009), the ImageNet $32\times32$ (Chrabaszcz et al., 2017), the LSUN $64\times64$ (Yu et al., 2015) and the FFHQ $128\times128$ (Karras et al., 2019) datasets, DDPM-IP, with only 80 sampling steps, generates lower FID scores than the state-of-the-art ADM (Dhariwal & Nichol, 2021) with 1,000 steps, corresponding to a more than $12.5\times$ sampling acceleration.
最后,我们实证分析了所有提出的解决方案,并表明尽管这些方法都能有效提升最终生成质量,但在去噪扩散概率模型 (DDPM) 中,输入扰动 (input perturbation) 比显式最小化 Lipschitz 常数更高效且更有效 (第 6.1 节)。此外,在训练时直接扰动网络输入不会产生额外的训练开销,该方案极易复现并嵌入现有 DDPM 框架:仅需两行代码即可实现,无需改变网络架构或损失函数。我们将该方法命名为带输入扰动的去噪扩散概率模型 (DDPM-IP),并证明其能显著提升前沿 DDPM (Dhariwal & Nichol, 2021; Song et al., 2021a) 的生成质量,同时加速推理时采样。例如在 CIFAR10 (Krizhevsky et al., 2009)、ImageNet $32\times32$ (Chrabaszcz et al., 2017)、LSUN $64\times64$ (Yu et al., 2015) 和 FFHQ $128\times128$ (Karras et al., 2019) 数据集上,DDPM-IP 仅用 80 次采样步骤就实现了比前沿 ADM (Dhariwal & Nichol, 2021) 1,000 步采样更低的 FID 分数,相当于超过 $12.5\times$ 的采样加速。
In summary, our contributions are:
总之,我们的贡献包括:
• We show that there is an exposure bias problem in DDPMs which has not been investigated so far. • To alleviate this problem, we propose different regularization methods whose common goal is to smooth the prediction function, and we specifically suggest input perturbation (DDPM-IP) as the best and the simplest of such solutions. • Using common benchmarks, we show that DDPM-IP can significantly improve the generation quality and drastically speed up both training and inference.
• 我们发现DDPM中存在一个尚未被研究的曝光偏差问题。
• 为缓解该问题,我们提出了多种正则化方法,其共同目标是平滑预测函数,并特别推荐输入扰动 (DDPM-IP) 作为其中最优且最简单的解决方案。
• 通过常用基准测试,我们证明DDPM-IP能显著提升生成质量,并大幅加速训练和推理过程。
2. Related Work
2. 相关工作
Diffusion models were introduced by Sohl-Dickstein et al. (2015) and later improved in (Song & Ermon, 2019; Ho et al., 2020; Song et al., 2021b; Nichol & Dhariwal, 2021). More recently, Dhariwal & Nichol (2021) have shown that DDPMs can yield higher-quality images than Generative Adversarial Networks (GANs) (Goodfellow et al., 2014; Brock et al., 2018). Similarly to GANs, the generation process in DDPMs can be both unconditional and conditioned. For instance, GLIDE (Nichol et al., 2022) learns to generate images according to an input textual sentence. Differently from GLIDE, where the diffusion model is defined on the image space, DALL E-2 (Ramesh et al. (2022)) uses a DDPM to learn a prior distribution on the CLIP (Radford et al., 2021) space. Text-to-image generation is explored also in Stable Diffusion (Rombach et al., 2021) and Imagen (Saharia et al., 2022). Apart from images, DDPMs can also be used with categorical distributions (Hoogeboom et al., 2021; Gu et al., 2021), in an audio domain (Mittal et al., 2021; Chen et al., 2021), in time series forecasting (Rasul et al., 2021) and in other generative tasks (Yang et al., 2022; Croitoru et al., 2022). Differently from previous work, our goal is not to propose an application-specific prediction network, but rather to investigate the training-testing discrepancy of the DDPMs and propose a solution which can be used in different application fields and jointly with different denoising architectures.
扩散模型由Sohl-Dickstein等人 (2015) 提出,后经 (Song & Ermon, 2019; Ho等人, 2020; Song等人, 2021b; Nichol & Dhariwal, 2021) 改进。Dhariwal & Nichol (2021) 近期研究表明,DDPMs能生成比生成对抗网络 (GANs) (Goodfellow等人, 2014; Brock等人, 2018) 更高质量的图像。与GANs类似,DDPMs的生成过程可无条件或条件触发。例如GLIDE (Nichol等人, 2022) 能根据输入文本生成图像。不同于在图像空间定义扩散模型的GLIDE,DALL E-2 (Ramesh等人, 2022) 使用DDPM学习CLIP (Radford等人, 2021) 空间的先验分布。Stable Diffusion (Rombach等人, 2021) 和Imagen (Saharia等人, 2022) 也探索了文本到图像生成。除图像领域外,DDPMs还可应用于分类分布 (Hoogeboom等人, 2021; Gu等人, 2021)、音频领域 (Mittal等人, 2021; Chen等人, 2021)、时间序列预测 (Rasul等人, 2021) 及其他生成任务 (Yang等人, 2022; Croitoru等人, 2022)。与之前工作不同,我们的目标不是提出特定应用的预测网络,而是研究DDPMs训练与测试的差异,并提出适用于不同应用领域且能与多种去噪架构协同的解决方案。
Accelerating the DDPM training or reducing the number of sampling steps $T$ (Sec. 1) have been thoroughly investigated due to their practical implications. For instance, Song et al. (2021a) propose Denoising Diffusion Implicit Models (DDIMs), based on a non-Markovian diffusion process, which can use a number of inference sampling steps smaller than those used at training time, without retraining the network. Salimans & Ho (2022) propose to distil the prediction network into new networks which progressively reduce the number of sampling steps. However, the disadvantage is the need of training multiple networks. Rombach et al. (2021) speed up sampling by splitting the process into a compression stage and a generation stage, and applying the DDPM on the compressed (latent) space. Hoogeboom et al. (2022) present an order-agnostic DDPM, inspired by XLNet (Yang et al., 2019), in which the sequence $\pmb{x}{0},...,\pmb{x}_{T}$ is randomly permuted at training time, leading to a partially parallel i zed sampling process. Chen et al. (2021) found that, instead of conditioning the prediction network $(\mu(\cdot))$ on a discrete diffusion step $t$ , it is beneficial to condition $\mu(\cdot)$ on a continuous noise level. Similarly, Kong & Ping (2021) introduce continuous diffusion steps, resulting in a unified framework for fast sampling. In order to use larger size sampling steps and a non-Gaussian reverse process (Sec. 1) Xiao et al. (2022) include an adversarial loss in DDPMs and propose Denoising Diffusion GANs. Karras et al. (2022) suggest using Heun’s second-order deterministic sampling method, leading to high quality results and fast sampling. Xu et al. (2022) accelerate the generation process of continuous normalizing flow using a Poisson flow generative model. Our approach is orthogonal to these previous works, and it can potentially be used jointly with most of them.
加速DDPM训练或减少采样步数$T$(第1节)因其实际意义而被深入研究。例如,Song等人(2021a)提出了基于非马尔可夫扩散过程的去噪扩散隐式模型(DDIM),该模型可以在不重新训练网络的情况下,使用比训练时更少的推理采样步数。Salimans和Ho(2022)提出将预测网络蒸馏到新网络中,逐步减少采样步数,但缺点是需要训练多个网络。Rombach等人(2021)通过将过程分为压缩阶段和生成阶段,并在压缩(潜在)空间上应用DDPM来加速采样。Hoogeboom等人(2022)受XLNet(Yang等人,2019)启发,提出了一种顺序无关的DDPM,其中序列$\pmb{x}{0},...,\pmb{x}_{T}$在训练时随机排列,从而实现部分并行化的采样过程。Chen等人(2021)发现,将预测网络$(\mu(\cdot))$的条件设定为连续噪声水平而非离散扩散步$t$是有益的。类似地,Kong和Ping(2021)引入了连续扩散步,形成了一个快速采样的统一框架。为了使用更大尺寸的采样步和非高斯反向过程(第1节),Xiao等人(2022)在DDPM中加入了对抗损失,并提出了去噪扩散GAN。Karras等人(2022)建议使用Heun的二阶确定性采样方法,从而获得高质量结果和快速采样。Xu等人(2022)使用泊松流生成模型加速了连续归一化流的生成过程。我们的方法与这些先前工作是正交的,并且可以与它们中的大多数联合使用。
3. Background
3. 背景
Without loss of generality, we assume an image domain and we focus on DDPMs which define a diffusion process on the input space. Following (Nichol & Dhariwal, 2021; Dhariwal & Nichol, 2021), we assume that each pixel value is linearly scaled into $[-1,1]$ . Given a sample $\pmb{x}{0}$ from the data distribution $q(\pmb{x}{0})$ and a prefixed noise schedule $(\beta_{1},...,\beta_{T})$ , a DDPM defines the forward process as a Markov chain which starts from a real image ${\pmb x}{0}\sim q({\pmb x}_{0})$ and iterative ly adds Gaussian noise for $T$ diffusion steps:
不失一般性,我们假设一个图像域,并专注于在输入空间上定义扩散过程的DDPMs。根据 (Nichol & Dhariwal, 2021; Dhariwal & Nichol, 2021) 的研究,我们假设每个像素值被线性缩放至 $[-1,1]$。给定来自数据分布 $q(\pmb{x}{0})$ 的样本 $\pmb{x}{0}$ 和预定义的噪声调度 $(\beta_{1},...,\beta_{T})$,DDPM将前向过程定义为一条马尔可夫链,该链从真实图像 ${\pmb x}{0}\sim q({\pmb x}_{0})$ 开始,并迭代地添加高斯噪声进行 $T$ 步扩散:
$$
q(\pmb{x}{t}|\pmb{x}{t-1})=\mathcal{N}(\pmb{x}{t};\sqrt{1-\beta_{t}}\pmb{x}{t-1},\beta_{t}\pmb{I}),
$$
$$
q(\pmb{x}{t}|\pmb{x}{t-1})=\mathcal{N}(\pmb{x}{t};\sqrt{1-\beta_{t}}\pmb{x}{t-1},\beta_{t}\pmb{I}),
$$
$$
q(\pmb{x}{1:T}|\pmb{x}{0})=\prod_{t=1}^{T}q(\pmb{x}{t}|\pmb{x}_{t-1}),
$$
$$
q(\pmb{x}{1:T}|\pmb{x}{0})=\prod_{t=1}^{T}q(\pmb{x}{t}|\pmb{x}_{t-1}),
$$
until obtaining a completely noisy image $\pmb{x}_{T}\sim\mathcal{N}(\mathbf{0},I)$ . On the other hand, the reverse process is defined by transition probabilities parameterized by $\pmb\theta$ :
在获得完全噪声图像 $\pmb{x}_{T}\sim\mathcal{N}(\mathbf{0},I)$ 之前。另一方面,反向过程由参数化转移概率 $\pmb\theta$ 定义:
$$
p_{\pmb\theta}(\pmb x_{t-1}|\pmb x_{t})=\mathcal N(\pmb x_{t-1};\mu_{\pmb\theta}(\pmb x_{t},t),\sigma_{t}\pmb I),
$$
$$
p_{\pmb\theta}(\pmb x_{t-1}|\pmb x_{t})=\mathcal N(\pmb x_{t-1};\mu_{\pmb\theta}(\pmb x_{t},t),\sigma_{t}\pmb I),
$$
where σt = 1βt withat =I=a anda =1-β Given $\mathbf{\boldsymbol{\mathbf{\mathit{x}}}}{0},\mathbf{\boldsymbol{\mathbf{\mathit{x}}}}_{t}$ can be obtained (H o et al., 2020) by:
给定 $\mathbf{\boldsymbol{\mathbf{\mathit{x}}}}{0},\mathbf{\boldsymbol{\mathbf{\mathit{x}}}}_{t}$ 可通过以下方式获得 (Ho et al., 2020) :
其中 σt = 1βt ,且 at =I=a ,a =1-β
$$
\begin{array}{r}{\pmb{x}{t}=\sqrt{\bar{\alpha}{t}}\pmb{x}{0}+\sqrt{1-\bar{\alpha}_{t}}\pmb{\epsilon},}\end{array}
$$
$$
\begin{array}{r}{\pmb{x}{t}=\sqrt{\bar{\alpha}{t}}\pmb{x}{0}+\sqrt{1-\bar{\alpha}_{t}}\pmb{\epsilon},}\end{array}
$$
where $\epsilon$ is a noise vector $(\epsilon\sim\mathcal{N}(\mathbf{0},I))$ . Instead of predicting the mean of the forward process posterior (i.e., $\hat{\pmb x}{t-1}=\mu_{\pmb\theta}(\pmb x_{t},t))$ , Ho et al. (2020) propose to use a network $\epsilon_{\pmb{\theta}}(\cdot)$ which predicts the noise vector (e). Using $\epsilon_{\pmb{\theta}}(\cdot)$ and a simple $L_{2}$ loss function, the training objective becomes:
其中 $\epsilon$ 是噪声向量 $(\epsilon\sim\mathcal{N}(\mathbf{0},I))$。Ho 等人 (2020) 没有选择预测前向过程后验分布的均值 (即 $\hat{\pmb x}{t-1}=\mu_{\pmb\theta}(\pmb x_{t},t))$,而是提出使用网络 $\epsilon_{\pmb{\theta}}(\cdot)$ 来预测噪声向量 (e)。通过采用 $\epsilon_{\pmb{\theta}}(\cdot)$ 和简单的 $L_{2}$ 损失函数,训练目标变为:
$$
\begin{array}{r}{L(\pmb\theta)=\mathbb{E}{\pmb x_{0}\sim q(\pmb x_{0}),\pmb\epsilon\sim\mathcal{N}(\pmb0,I),t\sim\mathbb{U}({1,...,T})}[||\pmb\epsilon-\pmb\epsilon_{\theta}(\pmb x_{t},t)||^{2}].}\end{array}
$$
$$
\begin{array}{r}{L(\pmb\theta)=\mathbb{E}{\pmb x_{0}\sim q(\pmb x_{0}),\pmb\epsilon\sim\mathcal{N}(\pmb0,I),t\sim\mathbb{U}({1,...,T})}[||\pmb\epsilon-\pmb\epsilon_{\theta}(\pmb x_{t},t)||^{2}].}\end{array}
$$
Note that, in Eq. 5, $\pmb{x}{t}$ and $\epsilon$ are ground-truth terms, while $\epsilon_{\pmb{\theta}}(\pmb{x}_{t},t)$ is the network prediction. Using Eq. 5, the training and the sampling algorithms are described in Alg. 1-2, respectively.
请注意,在公式5中,$\pmb{x}{t}$和$\epsilon$是真实值项,而$\epsilon_{\pmb{\theta}}(\pmb{x}_{t},t)$是网络预测值。根据公式5,训练和采样算法分别在算法1-2中描述。
Algorithm 1 DDPM Standard Training
算法 1 DDPM 标准训练
Algorithm 2 DDPM Standard Sampling
算法 2: DDPM标准采样
4. Exposure Bias Problem in Diffusion Models
4. 扩散模型中的曝光偏差问题
Comparing line 4 of Alg. 1 with line 4 of Alg. 2, we note that the inputs of the prediction network $\epsilon_{\theta}(\cdot)$ are different between the training and the inference phase. Concretely, at training time, standard DDPMs use $\epsilon_{\pmb{\theta}}(\pmb{x}{t},t)$ , where $\pmb{x}{t}$ is a ground truth sample (Eq. 4). In contrast, at inference time, they use $\pmb{\epsilon}{\pmb{\theta}}(\hat{\pmb{x}}{t},t))$ , where $\hat{\pmb x}{t}$ is computed based on the output of $\epsilon_{\theta}(\cdot)$ at the previous sampling step ${\mathbf{t}+1}$ . As mentioned in Sec. 1, this leads to a training-inference discrepancy, which is similar to the exposure bias problem observed, e.g., in text generation models, in which the training generation is conditioned on a ground-truth sentence, while the testing generation is conditioned on the previously generated words (Ranzato et al., 2016; Schmidt, 2019; Ren- nie et al., 2017; Bengio et al., 2015). In order to quantify the error accumulation with respect to the number of inference sampling steps, we use a simple experiment in which we start from a (randomly selected) real image $\pmb{x}{0}$ , we compute $\pmb{x}_{t}$ using Eq. 4, and then apply the reverse process (Alg. 2)
将算法1的第4行与算法2的第4行进行对比时,我们注意到预测网络 $\epsilon_{\theta}(\cdot)$ 的输入在训练阶段和推理阶段存在差异。具体而言,在训练时,标准DDPM使用 $\epsilon_{\pmb{\theta}}(\pmb{x}{t},t)$ ,其中 $\pmb{x}{t}$ 是真实样本(公式4);而在推理时,它们使用 $\pmb{\epsilon}{\pmb{\theta}}(\hat{\pmb{x}}{t},t))$ ,其中 $\hat{\pmb x}{t}$ 是根据前一个采样步 ${\mathbf{t}+1}$ 中 $\epsilon_{\theta}(\cdot)$ 的输出计算得到的。如第1节所述,这会导致训练与推理之间的不一致性,类似于文本生成模型中观察到的曝光偏差问题(Ranzato等人,2016;Schmidt,2019;Rennie等人,2017;Bengio等人,2015),即训练生成基于真实句子,而测试生成基于先前生成的词语。为了量化推理采样步数相关的误差累积,我们设计了一个简单实验:从(随机选择的)真实图像 $\pmb{x}{0}$ 出发,通过公式4计算 $\pmb{x}_{t}$ ,再执行逆向过程(算法2)。
starting from $\pmb{x}{t}$ instead of a random ${\pmb x}{T}$ . This way, when $t$ is small enough, the network should be able to “recover” the path to $\pmb{x}{0}$ (the denoising task is easier). We quantify the total error accumulated in $t$ reverse diffusion steps by comparing the difference between the ground truth distribution $q(\pmb{x}{0})$ and the predicted distribution $q(\hat{\pmb x}_{0})$ using the FID scores in Tab. 1. The experiment was done using ADM (Dhariwal & Nichol, 2021) (trained with $T=1,000;$ ) and ImageNet $32\times32$ , and we compute the FID scores using $50\mathrm{k\Omega}$ samples. Tab. 1 (first row) shows that the longer the reverse process, the higher the FID scores, indicating the existence of an error accumulation which is larger with larger values of $t$ . In Appendix 5, we repeat this experiment using deterministic sampling, which quantifies the error accumulation removing the randomness from the sampling process.
从 $\pmb{x}{t}$ 而非随机 ${\pmb x}{T}$ 开始。这样当 $t$ 足够小时,网络应能"恢复"到 $\pmb{x}{0}$ 的路径(去噪任务更容易)。我们通过比较真实分布 $q(\pmb{x}{0})$ 与预测分布 $q(\hat{\pmb x}_{0})$ 之间的差异,使用表1中的FID分数量化了 $t$ 步反向扩散过程中累积的总误差。实验采用ADM (Dhariwal & Nichol, 2021)(训练时 $T=1,000;$)和ImageNet $32\times32$ 数据集,使用 $50\mathrm{k\Omega}$ 样本计算FID分数。表1(首行)显示反向过程越长,FID分数越高,表明存在误差累积现象,且 $t$ 值越大误差累积越显著。附录5中我们使用确定性采样重复该实验,通过消除采样过程中的随机性来量化误差累积。
Table 1. An empirical estimate of the exposure bias on ImageNet $32\times32$ .
表 1: ImageNet $32\times32$ 上曝光偏差的经验估计
模型 | 反向扩散步数 | ||||
---|---|---|---|---|---|
100 | 300 | 500 | 700 | 1,000 | |
ADM | 0.983 | 1.808 | 2.587 | 3.105 | 3.544 |
ADM-IP (ours) | 0.972 | 1.594 | 2.198 | 2.539 | 2.742 |
Finally, in Tab. 3 we will report the FID scores of ADM on different datasets, which show that most of the best results are obtained in the range from 100 to 300 sampling steps, despite all the models have been trained with 1,000 diffusion steps. These results confirm previous similar observations (Nichol & Dhariwal, 2021), and we believe that the reason for this apparently counter intuitive phenomenon, in which fewer sampling steps lead to a better generation quality, is due to the exposure bias problem. Indeed, while more sampling steps correspond to a reverse process which can be more easily approximated with a Gaussian distribution (Sec. 1), longer sampling trajectories produce a larger accumulation of the prediction errors. Hence, the range [100, 300] leads to a better generation quality because it presumably trades off these two opposing aspects.
最后,在表 3 中我们将报告 ADM 在不同数据集上的 FID 分数,结果显示尽管所有模型都经过 1,000 次扩散步骤训练,但最佳结果大多出现在 100 到 300 次采样步骤的范围内。这些结果验证了先前类似的观察 [20],我们认为这种看似反直觉的现象(即更少的采样步骤反而带来更好的生成质量)是由曝光偏差问题导致的。实际上,虽然更多采样步骤对应着一个更容易用高斯分布近似的逆向过程(第 1 节),但更长的采样轨迹会导致预测误差的更大累积。因此,[100, 300] 这个范围能实现更好的生成质量,可能是因为它在这两个对立因素之间取得了平衡。
5. Method
5. 方法
5.1. Regular iz ation with Input Perturbation
5.1. 基于输入扰动的正则化
The solution we propose to alleviate the exposure bias problem is very simple: we explicitly model the prediction error using a Gaussian input perturbation at training time. More specifically, we assume that the error of the prediction network in the reverse process at time $t+1$ is normally distributed with respect to the ground-truth input $\pmb{x}{t}$ (see Sec. 5.3). This is simulated using a second, dedicated random noise vector ${\pmb\xi}\sim\mathcal{N}({\bf0},{\pmb I})$ , using which, we create a perturbed version $({\pmb y}{t})$ of $\pmb{x}_{t}$ :
我们提出的缓解曝光偏差问题的解决方案非常简单:在训练时使用高斯输入扰动显式建模预测误差。具体来说,我们假设反向过程中时间步$t+1$的预测网络误差相对于真实输入$\pmb{x}{t}$呈正态分布(参见第5.3节)。这通过第二个专用随机噪声向量${\pmb\xi}\sim\mathcal{N}({\bf0},{\pmb I})$进行模拟,并由此生成$\pmb{x}{t}$的扰动版本$({\pmb y}_{t})$:
$$
\begin{array}{r}{\pmb{y}{t}=\sqrt{\bar{\alpha}{t}}\pmb{x}{0}+\sqrt{1-\bar{\alpha}{t}}(\pmb{\epsilon}+\gamma_{t}\pmb{\xi}).}\end{array}
$$
$$
\begin{array}{r}{\pmb{y}{t}=\sqrt{\bar{\alpha}{t}}\pmb{x}{0}+\sqrt{1-\bar{\alpha}{t}}(\pmb{\epsilon}+\gamma_{t}\pmb{\xi}).}\end{array}
$$
For simplicity, we use a uniform noise schedule for $\xi$ by setting $\gamma_{0}=...=\gamma_{T}=\gamma $ . In fact, although selecting the best noise schedule $(\beta_{1},...,\beta_{T})$ in DDPMs is usually very important to get high-quality results (Ho et al., 2020; Chen et al., 2021), it is nevertheless an expensive hyper parameter tuning operation (Chen et al., 2021). Therefore, to avoid adding a second noise schedule $(\gamma_{0},...,\gamma_{T})$ to the training procedure, we opted for a simpler (although most likely suboptimal) solution, in which $\gamma_{t}$ does not vary depending on $t$ (more details in Sec. 5.3). In Alg. 3 we show the proposed training algorithm, in which $\pmb{x}_{t}$ is replaced by $\pmb{y}_{t}$ . In contrast, at inference time, we use Alg. 2 without any change.
为简化操作,我们对$\xi$采用统一的噪声调度策略,设定$\gamma_{0}=...=\gamma_{T}=\gamma$。事实上,虽然在DDPM中选择最优噪声调度$(\beta_{1},...,\beta_{T})$对获得高质量结果至关重要 (Ho et al., 2020; Chen et al., 2021) ,但这通常需要耗费大量资源进行超参数调优 (Chen et al., 2021) 。为避免在训练过程中引入第二个噪声调度$(\gamma_{0},...,\gamma_{T})$,我们选择了一种更简单(尽管很可能非最优)的方案,即$\gamma_{t}$不随$t$变化(详见第5.3节)。算法3展示了改进后的训练流程,其中$\pmb{x}_{t}$被替换为$\pmb{y}_{t}$。而在推理阶段,我们直接使用未经修改的算法2。
Algorithm 3 DDPM-IP: Training with input perturbation
算法 3 DDPM-IP: 带输入扰动的训练
| 1: 重复 |
| 2: o ~ q(xo), t ~ U({1, ., T}) |
| 3: ∈ ~ N(0, 1), ≤ ~ N(0, I) |
| 4: 使用公式 6 计算 yt |
| 5: 对 Velle - eo(yt, t)I|2 执行梯度下降步骤 |
| 6: 直到收敛 |
5.2. Discussion
5.2. 讨论
In this section, we analyze the difference between Alg. 3 and Alg. 1. Specifically, in line 5 of Alg. 3, we use $\pmb{y}{t}$ as the input of the prediction network $\epsilon_{\theta}(\cdot)$ but we keep using $\epsilon$ as the regression target. In other words, the new noise term $(\pmb{\xi})$ we introduce is used asymmetrically, because it is applied to the input but not to the prediction target (e). For this reason, Alg. 3 is not equivalent to choose a different value of $\epsilon$ in Alg. 1, where e is instead used symmetrically both in the forward process (Eq. 4) and as the target of the prediction network (line 4 of Alg. 1).
在本节中,我们分析算法3与算法1的区别。具体而言,在算法3的第5行,我们使用 $\pmb{y}{t}$ 作为预测网络 $\epsilon_{\theta}(\cdot)$ 的输入,但依然将 $\epsilon$ 作为回归目标。换句话说,我们引入的新噪声项 $(\pmb{\xi})$ 是以非对称方式使用的,因为它仅作用于输入端而非预测目标端 (e)。因此,算法3并不等同于在算法1中选择不同的 $\epsilon$ 值——在算法1中,e 在前向过程 (公式4) 和预测网络目标端 (算法1第4行) 是被对称使用的。
This difference is schematically illustrated in Fig. 1, where, for both Alg. 1 (i.e., DDPM) and Alg. 3 (DDPM-IP), we show the corresponding pairs of input and target vectors of the prediction network (respectively, $(\pmb{x}{t},\pmb{\epsilon})$ and $\left(\pmb{y}{t},\pmb{\epsilon}\right),$ ). In the same figure, we also show a second version of Alg. 1 (called DDPM $y$ ), where we use the standard training protocol (Alg. 1) but change the noise variance in order to adhere to the same distribution generating $\pmb{y}{t}$ . In fact, it can be easy shown that ${\pmb y}_{t}$ in Alg. 3 is generated using the following distribution (see Appendix A.2 for a proof):
图 1: 展示了这种差异的示意图,其中对于算法 1 (即 DDPM) 和算法 3 (DDPM-IP),我们分别展示了预测网络对应的输入和目标向量对 (分别为 $(\pmb{x}{t},\pmb{\epsilon})$ 和 $\left(\pmb{y}{t},\pmb{\epsilon}\right)$ )。在同一图中,我们还展示了算法 1 的第二个版本 (称为 DDPM $y$ ),其中我们使用标准训练协议 (算法 1) 但改变了噪声方差以遵循生成 $\pmb{y}{t}$ 的相同分布。实际上,可以很容易地证明算法 3 中的 ${\pmb y}_{t}$ 是使用以下分布生成的 (证明见附录 A.2):
$$
q({\pmb y}{t}|{\pmb x}{0})=\mathcal{N}({\pmb y}{t};\sqrt{\bar{\alpha}{t}}{\pmb x}{0},(1-\bar{\alpha}_{t})(1+\gamma^{2}){\pmb I}).
$$
$$
q({\pmb y}{t}|{\pmb x}{0})=\mathcal{N}({\pmb y}{t};\sqrt{\bar{\alpha}{t}}{\pmb x}{0},(1-\bar{\alpha}_{t})(1+\gamma^{2}){\pmb I}).
$$
Hence, we can obtain the same input noise distribution of Alg. 3 in Alg. 1 using $\epsilon^{\prime}\sim\mathcal{N}(0,I)$ and:
因此,我们可以通过使用 $\epsilon^{\prime}\sim\mathcal{N}(0,I)$ 以及以下方式,在算法 1 中获得与算法 3 相同的输入噪声分布:
$$
\begin{array}{r}{{\pmb y}{t}=\sqrt{\bar{\alpha}{t}}{\pmb x}{0}+\sqrt{1-\bar{\alpha}_{t}}\sqrt{1+\gamma^{2}}{\pmb\epsilon}^{\prime}.}\end{array}
$$
$$
\begin{array}{r}{{\pmb y}{t}=\sqrt{\bar{\alpha}{t}}{\pmb x}{0}+\sqrt{1-\bar{\alpha}_{t}}\sqrt{1+\gamma^{2}}{\pmb\epsilon}^{\prime}.}\end{array}
$$
We call DDPM $\cdot y$ the version of Alg. 1 with this new noise distribution. DDPM $y$ is obtained from Alg. 1 using Eq. 8 in line 3 and replacing $\pmb{x}{t}$ with $\pmb{y}{t}$ and $\epsilon$ with $\epsilon^{\prime}$ in line 4. However, note that, for a given $\pmb{y}{t}$ , if $\pmb{\xi}\neq\mathbf{0}$ , then $\epsilon\neq\epsilon^{\prime}$ (see Fig. 1), thus, DDPM-IP and DDPM $y$ share the same input to $\epsilon_{\pmb{\theta}}(\cdot)$ , but they use different targets. In Appendix A.3, we empirically show that DDPM $_y$ is even worse than the standard DDPM.
我们将采用这种新噪声分布的算法1版本称为DDPM $\cdot y$。DDPM $y$ 是通过在算法1第3行使用公式8,并在第4行将 $\pmb{x}{t}$ 替换为 $\pmb{y}{t}$、$\epsilon$ 替换为 $\epsilon^{\prime}$ 得到的。但需注意,对于给定的 $\pmb{y}{t}$,若 $\pmb{\xi}\neq\mathbf{0}$,则 $\epsilon\neq\epsilon^{\prime}$ (见图1)。因此,DDPM-IP与DDPM $y$ 虽然共享 $\epsilon_{\pmb{\theta}}(\cdot)$ 的相同输入,但二者采用不同的训练目标。附录A.3通过实验表明,DDPM $_y$ 的性能甚至逊于标准DDPM。
Intuitively, the proposed training protocol, DDPM-IP, decouples the noise vector $\epsilon^{\prime}$ actually generating $\pmb{y}{t}$ from the ground truth target vector $\epsilon$ which is asked to be predicted by $\epsilon_{\theta}(\cdot)$ . In order to solve this problem, $\epsilon_{\theta}(\cdot)$ needs to smooth its prediction function, reducing the difference between $\epsilon_{\pmb{\theta}}(\pmb{x}{t},t)$ and $\epsilon_{\pmb{\theta}}({\pmb y}_{t},t)$ , and this leads to a training regular iz ation which is similar to VRM (Sec. 1).
直观上,所提出的训练协议 DDPM-IP 将实际生成 $\pmb{y}{t}$ 的噪声向量 $\epsilon^{\prime}$ 与真实目标向量 $\epsilon$ 解耦,后者需要由 $\epsilon_{\theta}(\cdot)$ 预测。为解决这一问题,$\epsilon_{\theta}(\cdot)$ 需要平滑其预测函数,减小 $\epsilon_{\pmb{\theta}}(\pmb{x}{t},t)$ 与 $\epsilon_{\pmb{\theta}}({\pmb y}_{t},t)$ 之间的差异,这会产生类似于 VRM (第1节) 的训练正则化效果。
Figure 1. The inputs and the prediction targets are different in vanilla DDPM, DDPM-IP and DDPM- $\cdot y$ .
图 1: 原始 DDPM、DDPM-IP 和 DDPM- $\cdot y$ 的输入与预测目标存在差异。
5.3. Estimating the Prediction Error
5.3. 预测误差估计
In this section, we analyze the actual prediction error of $\epsilon_{\theta}(\cdot)$ and we use this analysis to choose the value of $\gamma$ in Eq. 6. Analogously to Sec. 4, we use ADM, trained using the standard algorithm Alg. 1 and two datasets: CIFAR10 and ImageNet $32\times32$ . At testing time, for a given $t$ and $\hat{\pmb{\epsilon}}=\pmb{\epsilon}{\pmb{\theta}}(\hat{\pmb{x}}{t},t)$ , we replace e with e in Eq. 4 and we compute the predicted $\hat{\pmb x}{0}$ . Finally, the prediction error at time $t$ is $\pmb{e}{t}=\hat{\pmb x}{0}-\pmb x_{0}$ . Note that using $\hat{\pmb x}{0}$ and $\pmb{x}{0}$ to estimate the error instead of comparing $\hat{\pmb x}{t}$ and $\pmb{x}{t}$ , has the advantage that the former is independent of scaling factors $(\sqrt{1-\bar{\alpha}{t}})$ and, thus, it makes the statistical analysis easier. Using different values of $t$ , uniformly selected in ${1,...,T}$ , we empirically verified that, for a given $t$ , $e_{t}$ is normally distributed: $e_{t}\sim$ $\mathcal{N}(\mathbf{0},\nu_{t}^{2}I)$ , with standard deviation $\nu_{t}$ (see Appendix A.5).
在本节中,我们分析 $\epsilon_{\theta}(\cdot)$ 的实际预测误差,并利用该分析为公式6中的 $\gamma$ 取值。与第4节类似,我们采用标准算法Alg.1训练的ADM模型,在CIFAR10和ImageNet $32\times32$ 两个数据集上进行测试。给定时间步 $t$ 和噪声估计 $\hat{\pmb{\epsilon}}=\pmb{\epsilon}{\pmb{\theta}}(\hat{\pmb{x}}{t},t)$ 时,我们将公式4中的e替换为e来计算预测值 $\hat{\pmb x}{0}$ ,最终得到时间步 $t$ 的预测误差 $\pmb{e}{t}=\hat{\pmb x}{0}-\pmb x_{0}$ 。需要注意的是,相较于直接比较 $\hat{\pmb x}{t}$ 和 $\pmb{x}{t}$ ,采用 $\hat{\pmb x}{0}$ 与 $\pmb{x}{0}$ 计算误差的优势在于前者不受缩放因子 $(\sqrt{1-\bar{\alpha}{t}})$ 影响,从而简化统计分析。通过在 ${1,...,T}$ 中均匀采样不同 $t$ 值进行实证验证,我们发现对于特定 $t$ 值,误差 $e_{t}$ 服从正态分布: $e_{t}\sim$ $\mathcal{N}(\mathbf{0},\nu_{t}^{2}I)$ ,其标准差为 $\nu_{t}$ (详见附录A.5)。
In Fig. 2 we plot the value of $\nu_{t}$ with respect to $t$ . The two curves corresponding to the two datasets are surprisingly close to each other. In principle, we could use this empirical analysis and set $\gamma_{t}=\nu_{t}$ in Eq. 6. In this way, when we perturb the input to $\epsilon_{\theta}(\cdot)$ , we empirically imitate its actual prediction error which is the base of the exposure bias problem. However, this choice would require a two-step training: first, using Alg. 1 to train the base model and empirically estimate $\nu_{t}$ for different $t$ . Then, using Alg. 3 with the estimated $\gamma_{t}$ schedule to retrain the model from scratch. To avoid this and make the whole procedure as simple as possible, we simply use a constant value $\gamma$ , independently of $t$ . This value was empirically set using a grid search on both CIFAR10 and ImageNet $32\times32$ on a small range of values covering the last half of the sampling trajectory. Specifically, we investigated the range $\nu_{t}\in[0,\mathbb{E}{t}[\nu_{t}]]=[0,0.2]$ (see Fig. 2), which was chosen following Karras et al. (2022), who showed that the last part of the inference trajectory has usually the largest impact on the Diffusion Model performance. We finally set $\gamma=0.1$ and, in the rest of this paper, we always use a constant $\gamma=0.1$ , regardless of the dataset and the baseline DDPM. Although a DDPM-specific $\gamma$ value would most likely lead to better quality results, we prefer to emphasise the ease of use of our proposal which does not depend on any other hyper parameter.
在图 2 中,我们绘制了 $\nu_{t}$ 随 $t$ 变化的值。两条对应不同数据集的曲线出奇地接近。原则上,我们可以利用这一实证分析,在公式 6 中设 $\gamma_{t}=\nu_{t}$。这样,当我们对 $\epsilon_{\theta}(\cdot)$ 的输入施加扰动时,就能通过经验模拟其实际预测误差——这正是曝光偏差问题的根源。但该方案需要分两步训练:首先使用算法 1 训练基础模型并经验性估算不同 $t$ 对应的 $\nu_{t}$;随后采用算法 3 配合估算出的 $\gamma_{t}$ 调度表从头开始重新训练模型。为简化流程,我们直接采用与 $t$ 无关的常量 $\gamma$。该值通过在 CIFAR10 和 ImageNet $32\times32$ 数据集上对采样轨迹后半段进行小范围网格搜索确定,具体研究区间为 $\nu_{t}\in[0,\mathbb{E}{t}[\nu_{t}]]=[0,0.2]$ (见图 2)。该区间选择遵循了 Karras 等人 (2022) 的研究结论——推理轨迹后半段通常对扩散模型性能影响最大。我们最终设定 $\gamma=0.1$,并在本文后续部分始终采用该常量值,不随数据集和基线 DDPM 变化。虽然针对特定 DDPM 调整 $\gamma$ 可能获得更优结果,但我们更强调本方案的易用性——它不依赖任何其他超参数。
Figure 2. The inference time standard deviation $\nu_{t}$ of the prediction error of a pre-trained network with respect to the sampling step $t$ . The mean of the blue and the orange curve is 0.20 and 0.19, respectively.
图 2: 预训练网络预测误差的推理时间标准差 $\nu_{t}$ 随采样步长 $t$ 的变化关系。蓝色和橙色曲线的均值分别为 0.20 和 0.19。
5.4. Regular iz ation based on Lipschitz Continuous Functions
5.4. 基于Lipschitz连续函数的正则化
In this section, we propose two alternative solutions to the exposure bias problem which can help to better investigate the phenomenon. The goal is the same as in Sec. 5.1, i.e., we want to smooth the prediction function $\epsilon_{\theta}(\pmb{x}{t},t)$ to make it more robust with respect to local variations of $\pmb{x}{t}$ which are due to the inference-time prediction errors. To do so, instead of using input perturbation, we explicitly encourage $\epsilon_{\theta}(\cdot)$ to be Lipschitz continuous, i.e. to satisfy:
在本节中,我们针对曝光偏差问题提出了两种替代解决方案,以帮助更好地研究该现象。目标与第5.1节相同,即我们希望平滑预测函数 $\epsilon_{\theta}(\pmb{x}{t},t)$ ,使其对由推理时预测误差引起的 $\pmb{x}{t}$ 局部变化更具鲁棒性。为此,我们不再使用输入扰动,而是显式地鼓励 $\epsilon_{\theta}(\cdot)$ 满足Lipschitz连续性条件,即满足:
$$
||\epsilon_{\pmb\theta}({\pmb x},t)-\epsilon_{\pmb\theta}({\pmb y},t)||\leq K||{\pmb x}-{\pmb y}||,\forall({\pmb x},{\pmb y})
$$
$$
||\epsilon_{\pmb\theta}({\pmb x},t)-\epsilon_{\pmb\theta}({\pmb y},t)||\leq K||{\pmb x}-{\pmb y}||,\forall({\pmb x},{\pmb y})
$$
for a small constant $K$ . We implement this idea using two standard Lipschitz constant minimization methods: gradient penalty (Rifai et al., 2011; Gulrajani et al., 2017) and weight decay (Krogh & Hertz, 1991; Miyato et al., 2018). In both cases we do not perturb the input of $\epsilon_{\pmb{\theta}}(\cdot)$ , and we use the original training algorithm (Alg. 1), with the only difference being the loss function used in line 4, where the $L_{2}$ loss is used jointly with a regular iz ation term described below.
对于一个小的常数 $K$。我们通过两种标准的Lipschitz常数最小化方法实现这一思路:梯度惩罚 (Rifai et al., 2011; Gulrajani et al., 2017) 和权重衰减 (Krogh & Hertz, 1991; Miyato et al., 2018)。在这两种情况下,我们都不扰动 $\epsilon_{\pmb{\theta}}(\cdot)$ 的输入,并且使用原始训练算法 (算法1),唯一的区别在于第4行使用的损失函数,其中 $L_{2}$ 损失与下述的正则化项联合使用。
Gradient penalty. In this case, the regular iz ation is based on the Frobenius norm of the Jacobian matrix (Rifai et al., 2011; Goodfellow et al., 2016), and the final loss is:
梯度惩罚 (gradient penalty)。在这种情况下,正则化基于雅可比矩阵 (Jacobian matrix) 的 Frobenius 范数 (Rifai et al., 2011; Goodfellow et al., 2016),最终损失函数为:
$$
L_{G P}(\pmb\theta)=||\epsilon-\epsilon_{\pmb\theta}({\pmb x}{t},t)||^{2}+\lambda_{G P}\left|\frac{\partial\epsilon_{\pmb\theta}({\pmb x}{t},t)}{\partial{\pmb x}}\right|_{F}^{2},
$$
$$
L_{G P}(\pmb\theta)=||\epsilon-\epsilon_{\pmb\theta}({\pmb x}{t},t)||^{2}+\lambda_{G P}\left|\frac{\partial\epsilon_{\pmb\theta}({\pmb x}{t},t)}{\partial{\pmb x}}\right|_{F}^{2},
$$
where $\lambda_{G P}$ is the weight of the gradient penalty term. However, a gradient penalty regular iz ation is very slow (Yoshida & Miyato, 2017) because it involves one forward and two backward passes for each training step.
其中 $\lambda_{G P}$ 是梯度惩罚项的权重。然而,梯度惩罚正则化非常慢 (Yoshida & Miyato, 2017) ,因为每个训练步骤需要一次前向传播和两次反向传播。
Weight decay. As shown in (Liu et al., 2022), Lipschitz continuity can also be encouraged using a weight decay regular iz ation (see Appendix A.6 for more details). In this case, the final loss is:
权重衰减。如 (Liu et al., 2022) 所示,利用权重衰减正则化也能促进 Lipschitz 连续性(详见附录 A.6)。此时,最终损失函数为:
$$
\begin{array}{r}{L_{W D}(\pmb{\theta})=||\pmb{\epsilon}-\pmb{\epsilon}{\pmb{\theta}}(\pmb{x}{t},t)||^{2}+\lambda_{W D}||\pmb{\theta}||^{2},}\end{array}
$$
$$
\begin{array}{r}{L_{W D}(\pmb{\theta})=||\pmb{\epsilon}-\pmb{\epsilon}{\pmb{\theta}}(\pmb{x}{t},t)||^{2}+\lambda_{W D}||\pmb{\theta}||^{2},}\end{array}
$$
where $\lambda_{W D}$ is the weight of the regular iz ation term.
其中 $\lambda_{W D}$ 是正则化项的权重。
6. Results
6. 结果
In this section, we evaluate the generation quality of the proposed solutions and we compare them with state-of-the-art DDPMs. We use unconditional image generation tasks on different datasets and standard metrics: the Fréchet Inception Distance (FID) (Heusel et al., 2017) and the Spatial Fréchet Inception Distance (sFID) (Nash et al., 2021). As a variant of FID, sFID uses spatial features rather than the standard pooled features to better capture spatial relationships, rewarding image distributions with a coherent high-level structure. As mentioned in Sec. 5.3, in all our experiments we use $\gamma=0.1$ without any dataset or baseline specific tuning of our only hyper parameter.
在本节中,我们评估了所提出方案的生成质量,并将其与最先进的DDPM进行了比较。我们在不同数据集上使用无条件图像生成任务和标准指标:Fréchet Inception Distance (FID) (Heusel et al., 2017) 和 Spatial Fréchet Inception Distance (sFID) (Nash et al., 2021)。作为FID的变体,sFID使用空间特征而非标准池化特征,以更好地捕捉空间关系,奖励具有连贯高层结构的图像分布。如第5.3节所述,在所有实验中,我们使用 $\gamma=0.1$,且未针对任何数据集或基线对我们的唯一超参数进行特定调整。
6.1. Evaluation of the Different Proposed Solutions
6.1. 不同提出方案的评估
In this section, we empirically compare to each other the three regular iz ation methods proposed in Sec. 5 to alleviate the exposure bias problem. For all three approaches, we use the state-of-the-art diffusion model ADM (Dhariwal & Nichol, 2021) (without classifier guidance) as the baseline, and we call: (1) “ADM-IP” the version of ADM trained using Alg. 3, (2) “ADM-GP” the version of ADM trained using the gradient penalty, and (3) “ADM-WD” for the weight decay (Sec. 5.4). We use $\lambda_{G P}=1e-6$ and $\lambda_{W D}=0.03$ as the loss weights for ADM-GP and ADM-WD, respectively.
在本节中,我们通过实验对比第5章提出的三种缓解曝光偏差(Exposure Bias)的正则化方法。所有实验均以最先进的扩散模型ADM (Dhariwal & Nichol, 2021) (无分类器引导)为基线,并定义:(1) "ADM-IP"为采用算法3训练的ADM版本,(2) "ADM-GP"为采用梯度惩罚训练的ADM版本,(3) "ADM-WD"为权重衰减版本(第5.4节)。实验中设置ADM-GP的损失权重$\lambda_{G P}=1e-6$,ADM-WD的损失权重$\lambda_{W D}=0.03$。
For this experiment, we use CIFAR10 because ADM-GP is too time-consuming to be trained on larger datasets. The results in Tab. 2 show that all three models outperform the baseline in image quality, demonstrating the effectiveness of smoothing the prediction function using the proposed regular iz ation methods. However, training ADM-GP is too slow and cannot be scaled to larger datasets, thus we do not recommend this solution. Moreover, ADM-IP gets the best FID and sFID scores, thus, in the rest of this paper, we use the input perturbation approach described in Sec. 5.1 as our basic solution.
在本实验中,我们选用CIFAR10数据集,因为ADM-GP模型在更大规模数据集上的训练耗时过高。表2结果显示,三种模型在图像质量指标上均超越基线,验证了通过所提正则化方法平滑预测函数的有效性。但ADM-GP训练速度过慢且难以扩展至更大数据集,因此我们不推荐该方案。此外,ADM-IP取得了最优的FID和sFID分数,故本文后续将采用第5.1节所述的输入扰动方法作为基础解决方案。
Table 2. Comparison of different regular iz ation methods. All the models are tested using $T=1,000$ sampling steps.
表 2: 不同正则化方法的对比。所有模型均使用 $T=1,000$ 采样步数进行测试。
Model | CIFAR10 32×32 |
---|---|
FID sFID | |
ADM (baseline) | 2.99 4.76 |
ADM-GP | 2.80 4.41 |
ADM-WD | 2.82 4.61 |
ADM-IP | 2.76 4.05 |
Finally, we use ADM-IP to quantify the reduction in the exposure bias following the protocol described in Sec. 4. The results reported in Tab. 1 show that ADM-IP leads to a significantly lower exposure bias than ADM, and this difference is larger with longer sampling sequences.
最后,我们按照第4节所述的协议,使用ADM-IP来量化曝光偏差的减少。表1中的结果显示,ADM-IP导致的曝光偏差显著低于ADM,且随着采样序列长度的增加,这种差异更为明显。
6.2. Main results
6.2. 主要结果
Comparison with DDPMs. We compare ADM-IP with ADM using CIFAR10, ImageNet $32\times32$ , LSUN tower $64\times64$ , CelebA $64\times64$ (Liu et al., 2015) and FFHQ $128\times128$ . Following prior work (Ho et al., 2020; Nichol & Dhariwal, 2021), we generate 50K samples for each trained model and we use the full training set to compute the reference distribution statistics, except for LSUN tower where (again following (Ho et al., 2020; Nichol & Dhariwal, 2021)) we use 50K training samples as the reference data. When training, we always use $T=1,000$ steps for all the models. At inference time, the results reported with $T^{\prime}<T$ sampling steps have been obtained using the respacing technique (Nichol & Dhariwal, 2021). As previously mentioned (see Sec. 5.3) we keep fixed $\gamma=0.1$ in all the experiments and the datasets. We refer to Appendix A.7 for the complete list of hyper parameters (e.g. the learning rate, the batch size, etc.) and network architecture settings, which are the same for both ADM and ADM-IP.
与DDPMs的对比。我们将ADM-IP与ADM在CIFAR10、ImageNet $32\times32$、LSUN tower $64\times64$、CelebA $64\times64$ (Liu et al., 2015) 和 FFHQ $128\times128$ 数据集上进行对比。遵循先前工作 (Ho et al., 2020; Nichol & Dhariwal, 2021),我们为每个训练模型生成5万个样本,并使用完整训练集计算参考分布统计量,但LSUN tower数据集除外(同样遵循 (Ho et al., 2020; Nichol & Dhariwal, 2021)),我们使用5万个训练样本作为参考数据。训练时,所有模型均采用 $T=1,000$ 步。推理阶段,报告结果中 $T^{\prime}<T$ 采样步数是通过重间距技术 (Nichol & Dhariwal, 2021) 获得的。如前所述(见第5.3节),我们在所有实验和数据集中固定 $\gamma=0.1$。完整超参数列表(如学习率、批次大小等)及网络架构设置详见附录A.7,这些参数对ADM和ADM-IP均相同。
The results reported in Tab. 3 show that, independently of the dataset and the number of sampling steps $(T^{\prime}\leq T)$ , ADM-IP is always better than ADM in terms of both the FID and sFID metrics, sometimes drastically better. For instance, on LSUN, with $T^{\prime}=80$ , we have a more than 5 sFID score improvement with respect to ADM. On FFHQ $128\times128$ , with $T^{\prime}=1,000$ , we have almost 7 points of improvement compared to both the FID and the sFID scores. In addition to the experiments shown in Tab. 3, we used $T^{\prime}=900$ sampling steps and our ADM-IP on CelebA $64\times64$ , achieving a result of 1.27 FID, which is the new state-of-the-art performance for unconditional generation on this dataset.
表 3 中的结果显示,无论数据集和采样步数 $(T^{\prime}\leq T)$ 如何,ADM-IP 在 FID 和 sFID 指标上始终优于 ADM,有时甚至显著领先。例如,在 LSUN 数据集上,当 $T^{\prime}=80$ 时,ADM-IP 的 sFID 分数比 ADM 提高了 5 分以上。在 FFHQ $128\times128$ 数据集上,当 $T^{\prime}=1,000$ 时,FID 和 sFID 分数均提升了近 7 分。除表 3 所示的实验外,我们还在 CelebA $64\times64$ 数据集上使用 $T^{\prime}=900$ 采样步数和 ADM-IP 方法,取得了 1.27 的 FID 分数,这是该数据集无条件生成任务的新最优性能。
Table 3. Comparison between ADM and ADM-IP using models trained with $T=1,000$ sampling steps and tested with $T^{\prime}\leq T$ steps.
表 3. 使用 $T=1,000$ 采样步长训练并以 $T^{\prime}\leq T$ 步长测试的 ADM 与 ADM-IP 模型对比。
采样步长 (T') | 模型 | CIFAR10 | ImageNet 32 | LSUN tower 64 | CelebA 64 | FFHQ 128 | |||||
---|---|---|---|---|---|---|---|---|---|---|---|
FID | sFID | FID | sFID | FID | sFID | FID | sFID | FID | sFID | ||
1,000 | ADM (baseline) | 2.99 | 4.76 | 3.60 | 3.30 | 3.39 | 7.96 | 1.60 | 3.80 | 9.65 | 12.53 |
ADM-IP (ours) | 2.76 | 4.05 | 2.87 | 2.39 | 2.68 | 6.04 | 1.31 | 3.38 | 2.98 | 5.59 | |
300 | ADM | 2.95 | 4.95 | 3.58 | 3.48 | 3.31 | 8.39 | 1.82 | 4.25 | 9.55 | 12.60 |
ADM-IP | 2.67 | 4.14 | 2.74 | 2.58 | 2.60 | 5.98 | 1.43 | 3.36 | 3.74 | 5.97 | |
100 | ADM | 3.37 | 5.66 | 4.26 | 4.48 | 3.50 | 11.10 | 3.02 | 5.76 | 14.52 | 16.02 |
ADM-IP | 2.70 | 4.51 | 3.24 | 3.13 | 2.79 | 6.56 | 2.21 | 4.33 | 5.94 | 7.90 | |
80 | ADM | 3.63 | 5.97 | 4.61 | 4.76 | 4.17 | 12.60 | 3.75 | 6.80 | 17.00 | 18.02 |
ADM-IP | 2.93 | 4.69 | 3.57 | 3.33 | 2.95 | 6.93 | 2.67 | 4.69 | 6.89 | 8.79 |
Note that, for most datasets, both the baseline (ADM) and ADM-IP reach the best results with $T^{\prime}<T$ (specifically, with $T^{\prime}\in[100,300])$ . As mentioned in Sec. 4, this is most likely a confirmation of the exposure bias problem: a shorter sampling trajectory accumulates a smaller prediction error.
需要注意的是,对于大多数数据集,基线方法 (ADM) 和 ADM-IP 都在 $T^{\prime}<T$ (具体而言 $T^{\prime}\in[100,300]$ )时达到最佳结果。如第4节所述,这很可能印证了曝光偏差问题:较短的采样轨迹会累积更小的预测误差。
Besides generating significantly better images, ADM-IP converges much faster than the baseline during training in all the five datasets (see Fig. 3 and 4). For instance, on LSUN tower and CelebA, ADM-IP converges at 220K and 300K training iterations while ADM saturates around 300K and 480K iterations, respectively. Fig. 3 shows also that, even before convergence, ADM-IP quickly beats the ADM results obtained when the latter has converged. For instance, on CelebA, ADM-IP gets FID 1.51 at 120K training iterations, whereas ADM gets FID 1.6 at convergence (480K iterations), exhibiting a $4\mathbf{x}$ training speed-up. On the larger resolution FFHQ dataset, ADM receives FID 14.52 at convergence (420K iterations), while ADM-IP achieves a FID score of 8.81 with only 60K iterations: an improvement of 5.71 points with a $7\mathbf{x}$ training speed-up. Fig. 4 shows a similar trend for the CIFAR10 dataset. In this figure, we also plot the results of ADM-IP with different $\gamma$ values (Sec. 5.3).
除了生成质量显著提升的图像外,ADM-IP在全部五个数据集的训练过程中收敛速度远超基线模型(见图3和4)。例如在LSUN tower和CelebA数据集上,ADM-IP分别在22万次和30万次训练迭代时收敛,而ADM模型需要约30万次和48万次迭代才达到饱和。图3还显示,即使在收敛前阶段,ADM-IP也能快速超越已收敛的ADM模型表现。以CelebA数据集为例,ADM-IP在12万次迭代时就取得FID 1.51,而ADM模型收敛时(48万次迭代)FID为1.6,实现了4倍的训练加速。在更高分辨率的FFHQ数据集上,ADM模型收敛时(42万次迭代)FID为14.52,而ADM-IP仅用6万次迭代就达到FID 8.81,指标提升5.71分的同时实现7倍加速。图4展示了CIFAR10数据集上的相似趋势,图中还绘制了不同γ参数下ADM-IP的表现(见5.3节)。
The training iterations until convergence for each model are summarized in Tab. 4. The much faster convergence of our method is most likely due to the regular iz ation effect of the input perturbation. In fact, as commonly happens with regularization techniques (Zhang et al., 2018; Liu et al., 2021;
各模型直至收敛的训练迭代次数总结于表4。我们方法的更快收敛很可能是由于输入扰动(perturbation)的正则化(regularization)效应。事实上,正如正则化技术常见的情况(Zhang et al., 2018; Liu et al., 2021;
Bales trier o et al., 2022), the proposed input perturbation also introduces an inductive bias in training. In our case, it is: close points in the domain of the prediction function should lead to similar outcomes. Our empirical results show that this bias helps the DDPM training.
Bales trier o et al., 2022) 提出的输入扰动方法也在训练中引入了归纳偏置。在我们的案例中表现为:预测函数定义域内相邻的点应产生相似输出。实证结果表明,这种偏置有助于DDPM (Denoising Diffusion Probabilistic Models) 训练。
Tab. 4 also shows that ADM-IP can drastically accelerate the inference process, i.e. obtaining better results than the baseline with shorter sampling trajectories. For example, with only 60 or 80 steps, ADM-IP gets a better or an equivalent FID than ADM (tested with the standard 1,000 sampling steps) on all datasets, except for CelebA, where ADM-IP needs 200 sampling steps to reach the same result. This comparison shows a remarkable 5x to $16.7\mathbf{X}$ speed-up of the inference stage, which is particularly significant for the larger resolution FFHQ dataset.
表 4 还显示,ADM-IP 能大幅加速推理过程,即在更短的采样轨迹下获得优于基线的结果。例如,在除 CelebA 外的所有数据集上,ADM-IP 仅需 60 或 80 步即可取得优于或等同于 ADM (标准 1,000采样步数测试) 的 FID 值;对于 CelebA 数据集,ADM-IP 需要 200 采样步才能达到同等效果。这一对比表明推理阶段实现了惊人的 5 倍至 $16.7\mathbf{X}$ 加速,这在更高分辨率的 FFHQ 数据集上尤为显著。
Finally, we measure the recall and precision for the generated samples using the method in Ky nk a an niemi et al. (2019). The results show that the recall and precision achieved by ADM and ADM-IP have no significant difference, which indicates that our input perturbation does not affect the sample diversity (see Appendix A.4).
最后,我们采用Ky nk a an niemi等人(2019)的方法测量生成样本的召回率与精确度。结果表明ADM与ADM-IP实现的召回率和精确度无显著差异,这说明我们的输入扰动不会影响样本多样性(详见附录A.4)。
Comparison with DDIMs. In order to show the generality of our proposal, we use Alg. 3 with the Denoising Diffusion Implicit Models (DDIMs) proposed by Song et al. (2021a) (Sec. 2). We train both the baseline (DDIM) and our method (DDIM-IP) on CIFAR10 using the public code provided by Song et al. (2021a). Since training with DDIM is particularly slow, we use only CIFAR10 for this comparison. We use the default hyper parameters settings (e.g. $T=1,000;$ ) in their code and train both models for 1,600K iterations with batch size 128. We test the performance of the two models with both $\eta=0$ and $\eta=0.5$ , where $\eta$ is the coefficient of stochastic it y sampling in DDIMs. Also in this