Input Perturbation Reduces Exposure Bias in Diffusion Models
Abstract
摘要
Denoising Diffusion Probabilistic Models have shown an impressive generation quality although their long sampling chain leads to high computational costs. In this paper, we observe that a long sampling chain also leads to an error accumulation phenomenon, which is similar to the exposure bias problem in auto regressive text generation. Specifically, we note that there is a discrepancy between training and testing, since the former is conditioned on the ground truth samples, while the latter is conditioned on the previously generated results. To alleviate this problem, we propose a very simple but effective training regu lari z ation, consisting in perturbing the ground truth samples to simulate the inference time prediction errors. We empirically show that, without affecting the recall and precision, the proposed input perturbation leads to a significant improvement in the sample quality while reducing both the training and the inference times. For instance, on CelebA $64\times64$ , we achieve a new state-of-theart FID score of 1.27, while saving $37.5%$ of the training time. The code is available at https: //github.com/forever208/DDPM-IP.
去噪扩散概率模型(Denoising Diffusion Probabilistic Models)虽然采样链较长导致计算成本较高,但其生成质量令人印象深刻。本文发现长采样链还会导致误差累积现象,这与自回归文本生成中的曝光偏差问题类似。具体而言,我们注意到训练与测试之间存在差异:前者以真实样本为条件,而后者则以先前生成的结果为条件。为缓解此问题,我们提出了一种简单有效的训练正则化方法,即通过扰动真实样本来模拟推理时的预测误差。实验表明,在不影响召回率和精确度的情况下,所提出的输入扰动方法显著提高了样本质量,同时减少了训练和推理时间。例如在CelebA $64\times64$ 数据集上,我们以节省 $37.5%$ 训练时间的代价,取得了1.27的最新FID分数。代码已开源:https://github.com/forever208/DDPM-IP。
1. Introduction
1. 引言
Denoising Diffusion Probabilistic Models (DDPMs) (SohlDickstein et al., 2015; Ho et al., 2020) are a new generative paradigm which is attracting a growing interest due to its very high-quality sample generation capabilities (Dhariwal & Nichol, 2021; Nichol et al., 2022; Ramesh et al., 2022). Differently from most existing generative methods which synthesize a new sample in a single step, DDPMs resemble the Langevin dynamics (Welling & Teh, 2011) and the generation process is based on a sequence of denoising steps, in which a synthetic sample is created starting from pure noise and auto regressive ly reducing the noise component. In more detail, during training, a real sample $\pmb{x}{0}$ is progressively destroyed in $T$ steps adding Gaussian noise (forward process). The sequence ${\pmb x}{0},...,{\pmb x}{t},...,{\pmb x}{T}$ so obtained, is used to train a deep denoising auto encoder $(\mu(\cdot))$ to invert the forward process: $\hat{\mathbf{\pmbx}}{t-1}=\mu(\mathbf{\pmbx}{t},t)$ . At inference time, the generation process is auto regressive because it depends on the previously generated samples: $\hat{\mathbf{x}}{t-1}=\mu(\hat{\mathbf{x}}_{t},t)$ (Sec. 3).
去噪扩散概率模型 (DDPMs) (SohlDickstein et al., 2015; Ho et al., 2020) 是一种新兴的生成范式,因其卓越的样本生成质量 (Dhariwal & Nichol, 2021; Nichol et al., 2022; Ramesh et al., 2022) 而受到广泛关注。与大多数单步生成样本的传统方法不同,DDPMs 采用类似朗之万动力学 (Welling & Teh, 2011) 的多步去噪策略:从纯噪声出发,通过自回归方式逐步降噪来合成样本。具体而言,在训练阶段,真实样本 $\pmb{x}{0}$ 会通过 $T$ 步高斯噪声添加被逐步破坏(前向过程)。由此获得的序列 ${\pmb x}{0},...,{\pmb x}{t},...,{\pmb x}{T}$ 用于训练深度去噪自编码器 $(\mu(\cdot))$ 以逆转前向过程:$\hat{\mathbf{\pmbx}}{t-1}=\mu(\mathbf{\pmbx}{t},t)$。在推理阶段,生成过程具有自回归特性,因为每一步生成都依赖于前序结果:$\hat{\mathbf{x}}{t-1}=\mu(\hat{\mathbf{x}}_{t},t)$(第3节)。
Despite the large success of DDPMs in different generative fields (Sec. 2), one of the main drawbacks of these models is their very long computational time, which depends on the large number of steps $T$ required at both the training and the inference stage. As recently emphasised in (Xiao et al., 2022), the fundamental reason why $T$ needs to be large is that each denoising step is assumed to be Gaussian, and this assumption holds only for small step sizes. Conversely, with larger step sizes, the prediction network $(\mu(\cdot))$ needs to solve a harder problem and it becomes progressively less accurate (Xiao et al., 2022). However, in this paper, we observe that there is a second phenomenon, related to the sampling chain, but partially in contrast with the first, which is the accumulation of these errors over the $T$ inference sampling steps. This is basically due to the discrepancy between the training and the inference stage, in which the latter generates a sequence of samples based on the results of the previous steps, hence possibly accumulating errors. In fact, at training time, $\mu(\cdot)$ is trained with a ground truth pair $({\pmb x}{t},{\pmb x}{t-1})$ and, given $\pmb{x}{t}$ , it learns to reconstruct $\pmb{x}{t-1}$ $(\mu(\pmb{x}{t},t))$ . However, at inference time, $\mu(\cdot)$ has no access to the “real” $\pmb{x}{t}$ , and its prediction depends on the previously generated $\hat{\pmb x}{t}$ $(\mu(\hat{\pmb x}{t},t))$ . This input mismatch between $\mu(\pmb{x}{t},t)$ , used during training, and $\mu(\hat{\mathbf{\boldsymbol{x}}}_{t},t)$ , used during testing, is similar to the exposure bias problem (Ranzato et al., 2016; Schmidt, 2019) shared by other auto regressive generative methods. For example, Rennie et al. (2017) argue that training a network to maximize the likelihood of the next ground-truth word given the previous ground-truth word (called “Teacher-Forcing” (Bengio et al., 2015)) results in error accumulation at inference time, since the model has never been exposed to its own predictions.
尽管 DDPM (Denoising Diffusion Probabilistic Models) 在不同生成领域取得了巨大成功 (第 2 节),但这些模型的主要缺点之一是计算时间非常长,这取决于训练和推理阶段所需的大量步骤 $T$。正如 (Xiao et al., 2022) 最近强调的那样,$T$ 需要很大的根本原因是每个去噪步骤都被假设为高斯分布,而这一假设仅适用于较小的步长。相反,步长较大时,预测网络 $(\mu(\cdot))$ 需要解决更困难的问题,并且其准确性会逐渐降低 (Xiao et al., 2022)。然而,在本文中,我们观察到第二种现象,它与采样链有关,但与第一种现象部分相反,即这些误差在 $T$ 个推理采样步骤中的累积。这基本上是由于训练和推理阶段之间的差异,后者基于前一步的结果生成一系列样本,因此可能累积误差。事实上,在训练时,$\mu(\cdot)$ 是用真实值对 $({\pmb x}{t},{\pmb x}{t-1})$ 训练的,给定 $\pmb{x}{t}$,它学会重建 $\pmb{x}{t-1}$ $(\mu(\pmb{x}{t},t))$。然而,在推理时,$\mu(\cdot)$ 无法访问“真实”的 $\pmb{x}{t}$,其预测依赖于之前生成的 $\hat{\pmb x}{t}$ $(\mu(\hat{\pmb x}{t},t))$。这种训练时使用的 $\mu(\pmb{x}{t},t)$ 和测试时使用的 $\mu(\hat{\mathbf{\boldsymbol{x}}}_{t},t)$ 之间的输入不匹配,类似于其他自回归生成方法共有的曝光偏差问题 (Ranzato et al., 2016; Schmidt, 2019)。例如,Rennie et al. (2017) 认为,训练网络以最大化给定前一个真实词的下一个真实词的概率(称为“教师强制” (Bengio et al., 2015))会导致推理时的误差累积,因为模型从未接触过自己的预测。
In this paper, we first empirically analyze this accumulation error phenomenon. For instance, we show that a standard DDPM (Dhariwal & Nichol, 2021), trained with $T$ steps, can generate better results using a number of inference steps $T^{\prime}<T$ (Sec. 6.2). A similar phenomenon was also observed by Nichol & Dhariwal (2021), but the authors did not provide an explanation for that. We believe that the reason for this apparently contrasting result is that while, on the one hand, longer chains can better satisfy the Gaussian assumption in the reverse diffusion process, on the other hand, they lead to a larger accumulation of errors.
本文首先通过实验分析了这种误差累积现象。例如,我们发现一个标准DDPM (Dhariwal & Nichol, 2021) 在使用$T$步训练时,通过$T^{\prime}<T$步推理反而能生成更好的结果(第6.2节)。Nichol & Dhariwal (2021) 也观察到类似现象,但作者未对此作出解释。我们认为这种看似矛盾的结果源于:虽然更长的马尔可夫链能更好地满足逆向扩散过程的高斯假设,但另一方面会导致更大的误差累积。
Second, in order to alleviate the exposure bias problem, we propose a surprisingly simple yet very effective method, which consists in explicitly modelling the prediction error during training. Specifically, at training time, we perturb $\pmb{x}{t}$ and we feed $\mu(\cdot)$ with a noisy version of $\pmb{x}{t}$ , this way simulating the training-inference discrepancy, and forcing the learned network to take into account possible inferencetime prediction errors. Note that our perturbation is different from the content-destroying forward process, because the new noise is not used in the ground truth prediction target (Sec. 5.2). The proposed method is a training regular iz ation which forces the network to smooth its prediction function: to solve the proposed task, two spatially close points $\pmb{x}{1}$ and $\pmb{x}{2}$ should lead to similar predictions $\mu(\pmb{x}{1},t)$ and $\mu(\pmb{x}_{2},t)$ This regular iz ation approach is similar to Mixup (Zhang et al., 2018) and the Vicinal Risk Minimization (VRM) principle (Chapelle et al., 2000), where a neighborhood around each sample in the training data is defined and then used to perturb that sample keeping fixed its target class label.
其次,为了缓解曝光偏差问题,我们提出了一种出奇简单却非常有效的方法,即在训练期间显式建模预测误差。具体而言,在训练时我们扰动 $\pmb{x}{t}$ ,并向 $\mu(\cdot)$ 输入带噪声的 $\pmb{x}{t}$ 版本,从而模拟训练-推理差异,迫使学习到的网络考虑推理阶段可能出现的预测误差。需注意的是,我们的扰动不同于破坏内容的前向过程,因为新增噪声不会用于真实预测目标(见第5.2节)。该方法是一种训练正则化手段,强制网络平滑其预测函数:为实现该任务目标,空间邻近点 $\pmb{x}{1}$ 和 $\pmb{x}{2}$ 应产生相似预测 $\mu(\pmb{x}{1},t)$ 和 $\mu(\pmb{x}_{2},t)$ 。此正则化方法类似于Mixup (Zhang et al., 2018) 和邻域风险最小化 (VRM) 原则 (Chapelle et al., 2000) ,二者通过定义训练数据中每个样本的邻域来扰动样本,同时保持其目标类别标签不变。
Third, we propose alternative solutions to the exposure bias problem for diffusion models, in which, rather than using input perturbation, we obtain a smoother prediction function $\mu(\cdot)$ by explicitly encouraging $\mu(\cdot)$ to be Lipschitz continuous (Sec. 5.4). The rationale behind this is that a Lipschitz continuous function $\mu(\cdot)$ generates small prediction differences between neighbouring points in its domain, leading to a DDPM which is more robust to the inference-time errors.
第三,我们针对扩散模型的曝光偏差问题提出了替代解决方案。不同于输入扰动方法,我们通过显式鼓励预测函数 $\mu(\cdot)$ 满足利普希茨连续性(见第5.4节)来获得更平滑的预测函数。其原理在于:利普希茨连续的 $\mu(\cdot)$ 函数会在其定义域内相邻点之间产生较小的预测差异,从而使DDPM对推理阶段的误差具有更强鲁棒性。
Finally, we empirically analyse all the proposed solutions and we show that, despite being all effective for improving the final generation quality, input perturbation is both more efficient and more effective than the explicit minimization of the Lipschitz constant in DDPMs (Sec. 6.1). Moreover, directly perturbing the network input at training time has no additional training overhead and this solution is very easy to be reproduced and plugged into existing DDPM frameworks: it can be obtained with just two lines of code without any change in the network architecture or the loss function. We call our method Denoising Diffusion Probabilistic Models with Input Perturbation (DDPM-IP) and we show that it can significantly improve the generation quality of state-of-theart DDPMs (Dhariwal & Nichol, 2021; Song et al., 2021a) and speed up the inference-time sampling. For instance, on the CIFAR10 (Krizhevsky et al., 2009), the ImageNet $32\times32$ (Chrabaszcz et al., 2017), the LSUN $64\times64$ (Yu et al., 2015) and the FFHQ $128\times128$ (Karras et al., 2019) datasets, DDPM-IP, with only 80 sampling steps, generates lower FID scores than the state-of-the-art ADM (Dhariwal & Nichol, 2021) with 1,000 steps, corresponding to a more than $12.5\times$ sampling acceleration.
最后,我们通过实证分析了所有提出的解决方案,并表明尽管这些方法都能有效提升最终生成质量,但在DDPM (Denoising Diffusion Probabilistic Models) 中,输入扰动 (input perturbation) 比显式最小化Lipschitz常数更高效且更有效(见第6.1节)。此外,在训练时直接扰动网络输入不会带来额外的训练开销,且该方案极易复现并集成到现有DDPM框架中:仅需两行代码即可实现,无需改变网络架构或损失函数。我们将该方法命名为带输入扰动的去噪扩散概率模型 (DDPM-IP),并证明它能显著提升前沿DDPM模型 (Dhariwal & Nichol, 2021; Song et al., 2021a) 的生成质量,同时加速推理阶段的采样。例如,在CIFAR10 (Krizhevsky et al., 2009)、ImageNet $32\times32$ (Chrabaszcz et al., 2017)、LSUN $64\times64$ (Yu et al., 2015) 和FFHQ $128\times128$ (Karras et al., 2019) 数据集上,仅用80次采样步骤的DDPM-IP所生成的FID分数,优于采用1,000次步骤的当前最佳ADM模型 (Dhariwal & Nichol, 2021),相当于实现了超过 $12.5\times$ 的采样加速。
In summary, our contributions are:
总之,我们的贡献包括:
• We show that there is an exposure bias problem in DDPMs which has not been investigated so far. • To alleviate this problem, we propose different regularization methods whose common goal is to smooth the prediction function, and we specifically suggest input perturbation (DDPM-IP) as the best and the simplest of such solutions. • Using common benchmarks, we show that DDPM-IP can significantly improve the generation quality and drastically speed up both training and inference.
• 我们发现DDPM中存在一个尚未被研究的曝光偏差问题。
• 为缓解该问题,我们提出了多种正则化方法,其共同目标是平滑预测函数,其中我们特别推荐输入扰动 (DDPM-IP) 作为最优且最简单的解决方案。
• 通过常用基准测试,我们证明DDPM-IP能显著提升生成质量,并大幅加速训练和推理过程。
2. Related Work
2. 相关工作
Diffusion models were introduced by Sohl-Dickstein et al. (2015) and later improved in (Song & Ermon, 2019; Ho et al., 2020; Song et al., 2021b; Nichol & Dhariwal, 2021). More recently, Dhariwal & Nichol (2021) have shown that DDPMs can yield higher-quality images than Generative Adversarial Networks (GANs) (Goodfellow et al., 2014; Brock et al., 2018). Similarly to GANs, the generation process in DDPMs can be both unconditional and conditioned. For instance, GLIDE (Nichol et al., 2022) learns to generate images according to an input textual sentence. Differently from GLIDE, where the diffusion model is defined on the image space, DALL E-2 (Ramesh et al. (2022)) uses a DDPM to learn a prior distribution on the CLIP (Radford et al., 2021) space. Text-to-image generation is explored also in Stable Diffusion (Rombach et al., 2021) and Imagen (Saharia et al., 2022). Apart from images, DDPMs can also be used with categorical distributions (Hoogeboom et al., 2021; Gu et al., 2021), in an audio domain (Mittal et al., 2021; Chen et al., 2021), in time series forecasting (Rasul et al., 2021) and in other generative tasks (Yang et al., 2022; Croitoru et al., 2022). Differently from previous work, our goal is not to propose an application-specific prediction network, but rather to investigate the training-testing discrepancy of the DDPMs and propose a solution which can be used in different application fields and jointly with different denoising architectures.
扩散模型由Sohl-Dickstein等人 (2015) 提出,后经 (Song & Ermon, 2019; Ho等人, 2020; Song等人, 2021b; Nichol & Dhariwal, 2021) 改进。Dhariwal & Nichol (2021) 最新研究表明,DDPM能生成比生成对抗网络 (GANs) (Goodfellow等人, 2014; Brock等人, 2018) 更高质量的图像。与GANs类似,DDPM的生成过程可分为无条件生成和条件生成。例如GLIDE (Nichol等人, 2022) 可根据输入文本生成图像。不同于在图像空间定义扩散模型的GLIDE,DALL·E-2 (Ramesh等人, 2022) 使用DDPM在CLIP (Radford等人, 2021) 空间学习先验分布。Stable Diffusion (Rombach等人, 2021) 和Imagen (Saharia等人, 2022) 也探索了文本到图像生成。除图像领域外,DDPM还可应用于类别分布 (Hoogeboom等人, 2021; Gu等人, 2021) 、音频领域 (Mittal等人, 2021; Chen等人, 2021) 、时间序列预测 (Rasul等人, 2021) 及其他生成任务 (Yang等人, 2022; Croitoru等人, 2022) 。与先前工作不同,我们的目标不是提出特定应用的预测网络,而是研究DDPM训练与测试的差异性,并提出可跨领域应用、兼容多种去噪架构的解决方案。
Accelerating the DDPM training or reducing the number of sampling steps $T$ (Sec. 1) have been thoroughly investigated due to their practical implications. For instance, Song et al. (2021a) propose Denoising Diffusion Implicit Models (DDIMs), based on a non-Markovian diffusion process, which can use a number of inference sampling steps smaller than those used at training time, without retraining the network. Salimans & Ho (2022) propose to distil the prediction network into new networks which progressively reduce the number of sampling steps. However, the disadvantage is the need of training multiple networks. Rombach et al. (2021) speed up sampling by splitting the process into a compression stage and a generation stage, and applying the DDPM on the compressed (latent) space. Hoogeboom et al. (2022) present an order-agnostic DDPM, inspired by XLNet (Yang et al., 2019), in which the sequence $\pmb{x}{0},...,\pmb{x}_{T}$ is randomly permuted at training time, leading to a partially parallel i zed sampling process. Chen et al. (2021) found that, instead of conditioning the prediction network $(\mu(\cdot))$ on a discrete diffusion step $t$ , it is beneficial to condition $\mu(\cdot)$ on a continuous noise level. Similarly, Kong & Ping (2021) introduce continuous diffusion steps, resulting in a unified framework for fast sampling. In order to use larger size sampling steps and a non-Gaussian reverse process (Sec. 1) Xiao et al. (2022) include an adversarial loss in DDPMs and propose Denoising Diffusion GANs. Karras et al. (2022) suggest using Heun’s second-order deterministic sampling method, leading to high quality results and fast sampling. Xu et al. (2022) accelerate the generation process of continuous normalizing flow using a Poisson flow generative model. Our approach is orthogonal to these previous works, and it can potentially be used jointly with most of them.
加速DDPM训练或减少采样步数$T$(第1节)因其实际意义已被深入研究。例如,Song等人(2021a)提出了基于非马尔可夫扩散过程的去噪扩散隐式模型(DDIM),该模型可使用少于训练时的推理采样步数,且无需重新训练网络。Salimans & Ho(2022)提出将预测网络蒸馏到逐步减少采样步数的新网络中,但缺点是需要训练多个网络。Rombach等人(2021)通过将过程分为压缩阶段和生成阶段,并在压缩(潜在)空间应用DDPM来加速采样。Hoogeboom等人(2022)受XLNet(Yang等人,2019)启发,提出了一种顺序无关的DDPM,其中序列$\pmb{x}{0},...,\pmb{x}_{T}$在训练时随机排列,实现了部分并行化的采样过程。Chen等人(2021)发现,相较于基于离散扩散步$t$的条件预测网络$(\mu(\cdot))$,基于连续噪声水平的条件$\mu(\cdot)$更有优势。类似地,Kong & Ping(2021)引入了连续扩散步,形成了快速采样的统一框架。为使用更大步长和非高斯逆向过程(第1节),Xiao等人(2022)在DDPM中加入对抗损失,提出了去噪扩散GAN。Karras等人(2022)建议采用Heun二阶确定性采样方法,实现了高质量结果和快速采样。Xu等人(2022)利用泊松流生成模型加速了连续归一化流的生成过程。我们的方法与这些先前工作正交,且有望与其中大多数方法联合使用。
3. Background
- 背景
Without loss of generality, we assume an image domain and we focus on DDPMs which define a diffusion process on the input space. Following (Nichol & Dhariwal, 2021; Dhariwal & Nichol, 2021), we assume that each pixel value is linearly scaled into $[-1,1]$ . Given a sample $\pmb{x}{0}$ from the data distribution $q(\pmb{x}{0})$ and a prefixed noise schedule $(\beta_{1},...,\beta_{T})$ , a DDPM defines the forward process as a Markov chain which starts from a real image ${\pmb x}{0}\sim q({\pmb x}_{0})$ and iterative ly adds Gaussian noise for $T$ diffusion steps:
在不失一般性的前提下,我们假设一个图像域,并专注于在输入空间定义扩散过程的DDPMs。遵循 (Nichol & Dhariwal, 2021; Dhariwal & Nichol, 2021) 的方法,我们假设每个像素值被线性缩放至 $[-1,1]$ 。给定来自数据分布 $q(\pmb{x}{0})$ 的样本 $\pmb{x}{0}$ 以及预定义的噪声调度 $(\beta_{1},...,\beta_{T})$ ,DDPM将前向过程定义为一条马尔可夫链,该链从真实图像 ${\pmb x}{0}\sim q({\pmb x}_{0})$ 开始,并迭代地添加高斯噪声进行 $T$ 步扩散:
$$
q(\pmb{x}{t}|\pmb{x}{t-1})=\mathcal{N}(\pmb{x}{t};\sqrt{1-\beta_{t}}\pmb{x}{t-1},\beta_{t}\pmb{I}),
$$
$$
q(\pmb{x}{t}|\pmb{x}{t-1})=\mathcal{N}(\pmb{x}{t};\sqrt{1-\beta_{t}}\pmb{x}{t-1},\beta_{t}\pmb{I}),
$$
$$
q(\pmb{x}{1:T}|\pmb{x}{0})=\prod_{t=1}^{T}q(\pmb{x}{t}|\pmb{x}_{t-1}),
$$
$$
q(\pmb{x}{1:T}|\pmb{x}{0})=\prod_{t=1}^{T}q(\pmb{x}{t}|\pmb{x}_{t-1}),
$$
until obtaining a completely noisy image $\pmb{x}_{T}\sim\mathcal{N}(\mathbf{0},I)$ . On the other hand, the reverse process is defined by transition probabilities parameterized by $\pmb\theta$ :
直至获得完全噪声图像 $\pmb{x}_{T}\sim\mathcal{N}(\mathbf{0},I)$。另一方面,反向过程由参数化转移概率 $\pmb\theta$ 定义:
$$
p_{\pmb\theta}(\pmb x_{t-1}|\pmb x_{t})=\mathcal N(\pmb x_{t-1};\mu_{\pmb\theta}(\pmb x_{t},t),\sigma_{t}\pmb I),
$$
$$
p_{\pmb\theta}(\pmb x_{t-1}|\pmb x_{t})=\mathcal N(\pmb x_{t-1};\mu_{\pmb\theta}(\pmb x_{t},t),\sigma_{t}\pmb I),
$$
其中 σt = 1βt 且 at = I = a 且 a = 1 - β
$$
\begin{array}{r}{\pmb{x}{t}=\sqrt{\bar{\alpha}{t}}\pmb{x}{0}+\sqrt{1-\bar{\alpha}_{t}}\pmb{\epsilon},}\end{array}
$$
$$
\begin{array}{r}{\pmb{x}{t}=\sqrt{\bar{\alpha}{t}}\pmb{x}{0}+\sqrt{1-\bar{\alpha}_{t}}\pmb{\epsilon},}\end{array}
$$
where $\epsilon$ is a noise vector $(\epsilon\sim\mathcal{N}(\mathbf{0},I))$ . Instead of predicting the mean of the forward process posterior (i.e., $\hat{\pmb x}{t-1}=\mu_{\pmb\theta}(\pmb x_{t},t))$ , Ho et al. (2020) propose to use a network $\epsilon_{\pmb{\theta}}(\cdot)$ which predicts the noise vector (e). Using $\epsilon_{\pmb{\theta}}(\cdot)$ and a simple $L_{2}$ loss function, the training objective becomes:
其中 $\epsilon$ 是噪声向量 $(\epsilon\sim\mathcal{N}(\mathbf{0},I))$ 。Ho 等人 (2020) 提出使用网络 $\epsilon_{\pmb{\theta}}(\cdot)$ 来预测噪声向量 (e),而非直接预测前向过程后验分布的均值 (即 $\hat{\pmb x}{t-1}=\mu_{\pmb\theta}(\pmb x_{t},t))$ 。通过采用 $\epsilon_{\pmb{\theta}}(\cdot)$ 和简单的 $L_{2}$ 损失函数,训练目标可表示为:
$$
\begin{array}{r}{L(\pmb\theta)=\mathbb{E}{\pmb x_{0}\sim q(\pmb x_{0}),\pmb\epsilon\sim\mathcal{N}(\pmb0,I),t\sim\mathbb{U}({1,...,T})}[||\pmb\epsilon-\pmb\epsilon_{\theta}(\pmb x_{t},t)||^{2}].}\end{array}
$$
$$
\begin{array}{r}{L(\pmb\theta)=\mathbb{E}{\pmb x_{0}\sim q(\pmb x_{0}),\pmb\epsilon\sim\mathcal{N}(\pmb0,I),t\sim\mathbb{U}({1,...,T})}[||\pmb\epsilon-\pmb\epsilon_{\theta}(\pmb x_{t},t)||^{2}].}\end{array}
$$
Note that, in Eq. 5, $\pmb{x}{t}$ and $\epsilon$ are ground-truth terms, while $\epsilon_{\pmb{\theta}}(\pmb{x}_{t},t)$ is the network prediction. Using Eq. 5, the training and the sampling algorithms are described in Alg. 1-2, respectively.
注意,在式5中,$\pmb{x}{t}$ 和 $\epsilon$ 是真实值项,而 $\epsilon_{\pmb{\theta}}(\pmb{x}_{t},t)$ 是网络预测值。通过式5,训练和采样算法分别在算法1-2中描述。
Algorithm 1 DDPM Standard Training
算法 1: DDPM 标准训练
Algorithm 2 DDPM Standard Sampling
算法 2 DDPM标准采样
4. Exposure Bias Problem in Diffusion Models
4. 扩散模型中的曝光偏差问题
Comparing line 4 of Alg. 1 with line 4 of Alg. 2, we note that the inputs of the prediction network $\epsilon_{\theta}(\cdot)$ are different between the training and the inference phase. Concretely, at training time, standard DDPMs use $\epsilon_{\pmb{\theta}}(\pmb{x}{t},t)$ , where $\pmb{x}{t}$ is a ground truth sample (Eq. 4). In contrast, at inference time, they use $\pmb{\epsilon}{\pmb{\theta}}(\hat{\pmb{x}}{t},t))$ , where $\hat{\pmb x}{t}$ is computed based on the output of $\epsilon_{\theta}(\cdot)$ at the previous sampling step ${\mathbf{t}+1}$ . As mentioned in Sec. 1, this leads to a training-inference discrepancy, which is similar to the exposure bias problem observed, e.g., in text generation models, in which the training generation is conditioned on a ground-truth sentence, while the testing generation is conditioned on the previously generated words (Ranzato et al., 2016; Schmidt, 2019; Ren- nie et al., 2017; Bengio et al., 2015). In order to quantify the error accumulation with respect to the number of inference sampling steps, we use a simple experiment in which we start from a (randomly selected) real image $\pmb{x}{0}$ , we compute $\pmb{x}_{t}$ using Eq. 4, and then apply the reverse process (Alg. 2)
比较算法1第4行与算法2第4行时,我们注意到预测网络 $\epsilon_{\theta}(\cdot)$ 的输入在训练阶段和推理阶段存在差异。具体而言,训练时标准DDPM采用 $\epsilon_{\pmb{\theta}}(\pmb{x}{t},t)$ (其中 $\pmb{x}{t}$ 是真实样本,见公式4),而推理时则使用 $\pmb{\epsilon}{\pmb{\theta}}(\hat{\pmb{x}}{t},t))$ (其中 $\hat{\pmb x}{t}$ 基于前一步采样 ${\mathbf{t}+1}$ 时 $\epsilon_{\theta}(\cdot)$ 的输出计算)。如第1节所述,这会导致训练-推理差异,类似于文本生成模型中的曝光偏差问题 (Ranzato et al., 2016; Schmidt, 2019; Rennie et al., 2017; Bengio et al., 2015) ——训练时生成基于真实语句,而测试时生成基于先前生成的词汇。为量化推理采样步数导致的误差累积,我们设计了一个简单实验:从(随机选取的)真实图像 $\pmb{x}{0}$ 出发,通过公式4计算 $\pmb{x}_{t}$ ,然后执行逆向过程(算法2)。
starting from $\pmb{x}{t}$ instead of a random ${\pmb x}{T}$ . This way, when $t$ is small enough, the network should be able to “recover” the path to $\pmb{x}{0}$ (the denoising task is easier). We quantify the total error accumulated in $t$ reverse diffusion steps by comparing the difference between the ground truth distribution $q(\pmb{x}{0})$ and the predicted distribution $q(\hat{\pmb x}_{0})$ using the FID scores in Tab. 1. The experiment was done using ADM (Dhariwal & Nichol, 2021) (trained with $T=1,000;$ ) and ImageNet $32\times32$ , and we compute the FID scores using $50\mathrm{k\Omega}$ samples. Tab. 1 (first row) shows that the longer the reverse process, the higher the FID scores, indicating the existence of an error accumulation which is larger with larger values of $t$ . In Appendix 5, we repeat this experiment using deterministic sampling, which quantifies the error accumulation removing the randomness from the sampling process.
从 $\pmb{x}{t}$ 开始而非随机 ${\pmb x}{T}$。这样当 $t$ 足够小时,网络应能"恢复"到 $\pmb{x}{0}$ 的路径(去噪任务更简单)。我们通过对比真实分布 $q(\pmb{x}{0})$ 与预测分布 $q(\hat{\pmb x}_{0})$ 的差异,使用表1中的FID分数量化了$t$步反向扩散累积的总误差。实验采用ADM (Dhariwal & Nichol, 2021)(训练时$T=1,000;$)和ImageNet $32\times32$数据集,使用$50\mathrm{k\Omega}$样本计算FID分数。表1(首行)显示反向过程越长,FID分数越高,表明误差累积现象存在且$t$值越大累积越显著。附录5中我们使用确定性采样重复该实验,通过消除采样过程的随机性来量化误差累积。
Table 1. An empirical estimate of the exposure bias on ImageNet $32\times32$ .
Model | Number of reverse diffusion steps | ||||
100 | 300 | 500 | 700 | 1,000 | |
ADM | 0.983 | 1.808 | 2.587 | 3.105 | 3.544 |
ADM-IP (ours) | 0.972 | 1.594 | 2.198 | 2.539 | 2.742 |
表 1: ImageNet $32\times32$ 上曝光偏差的经验估计
模型 | 反向扩散步数 |
---|---|
100 | |
ADM | 0.983 |
ADM-IP (ours) | 0.972 |
Finally, in Tab. 3 we will report the FID scores of ADM on different datasets, which show that most of the best results are obtained in the range from 100 to 300 sampling steps, despite all the models have been trained with 1,000 diffusion steps. These results confirm previous similar observations (Nichol & Dhariwal, 2021), and we believe that the reason for this apparently counter intuitive phenomenon, in which fewer sampling steps lead to a better generation quality, is due to the exposure bias problem. Indeed, while more sampling steps correspond to a reverse process which can be more easily approximated with a Gaussian distribution (Sec. 1), longer sampling trajectories produce a larger accumulation of the prediction errors. Hence, the range [100, 300] leads to a better generation quality because it presumably trades off these two opposing aspects.
最后,在表3中我们将报告ADM在不同数据集上的FID分数,这些结果表明尽管所有模型都经过1,000步扩散训练,但最佳结果大多出现在100至300采样步数范围内。这些结果验证了先前类似的观察结论 (Nichol & Dhariwal, 2021) ,我们认为这种看似反直觉的现象(即较少采样步数反而获得更优生成质量)源于曝光偏差问题。实际上,虽然更多采样步数对应着更易用高斯分布近似的逆向过程(第1节),但更长的采样轨迹会导致预测误差的更大累积。因此,[100, 300]区间能取得更好的生成质量,可能是由于平衡了这两个相互矛盾的因素。
5. Method
5. 方法
5.1. Regular iz ation with Input Perturbation
5.1. 基于输入扰动的正则化 (Regularization)
The solution we propose to alleviate the exposure bias problem is very simple: we explicitly model the prediction error using a Gaussian input perturbation at training time. More specifically, we assume that the error of the prediction network in the reverse process at time $t+1$ is normally distributed with respect to the ground-truth input $\pmb{x}{t}$ (see Sec. 5.3). This is simulated using a second, dedicated random noise vector ${\pmb\xi}\sim\mathcal{N}({\bf0},{\pmb I})$ , using which, we create a perturbed version $({\pmb y}{t})$ of $\pmb{x}_{t}$ :
我们提出的缓解曝光偏差问题的解决方案非常简单:在训练时使用高斯输入扰动显式建模预测误差。具体而言,我们假设反向过程中时间步$t+1$的预测网络误差相对于真实输入$\pmb{x}{t}$呈正态分布(参见第5.3节)。这通过第二个专用随机噪声向量${\pmb\xi}\sim\mathcal{N}({\bf0},{\pmb I})$进行模拟,并由此生成$\pmb{x}{t}$的扰动版本$({\pmb y}_{t})$:
$$
\begin{array}{r}{\pmb{y}{t}=\sqrt{\bar{\alpha}{t}}\pmb{x}{0}+\sqrt{1-\bar{\alpha}{t}}(\pmb{\epsilon}+\gamma_{t}\pmb{\xi}).}\end{array}
$$
$$
\begin{array}{r}{\pmb{y}{t}=\sqrt{\bar{\alpha}{t}}\pmb{x}{0}+\sqrt{1-\bar{\alpha}{t}}(\pmb{\epsilon}+\gamma_{t}\pmb{\xi}).}\end{array}
$$
For simplicity, we use a uniform noise schedule for $\xi$ by setting $\gamma_{0}=...=\gamma_{T}=\gamma $ . In fact, although selecting the best noise schedule $(\beta_{1},...,\beta_{T})$ in DDPMs is usually very important to get high-quality results (Ho et al., 2020; Chen et al., 2021), it is nevertheless an expensive hyper parameter tuning operation (Chen et al., 2021). Therefore, to avoid adding a second noise schedule $(\gamma_{0},...,\gamma_{T})$ to the training procedure, we opted for a simpler (although most likely suboptimal) solution, in which $\gamma_{t}$ does not vary depending on $t$ (more details in Sec. 5.3). In Alg. 3 we show the proposed training algorithm, in which $\pmb{x}_{t}$ is replaced by $\pmb{y}_{t}$ . In contrast, at inference time, we use Alg. 2 without any change.
为简化操作,我们对$\xi$采用统一的噪声调度,设$\gamma_{0}=...=\gamma_{T}=\gamma$。实际上,虽然在DDPM中选择最优噪声调度$(\beta_{1},...,\beta_{T})$对获得高质量结果通常至关重要 (Ho et al., 2020; Chen et al., 2021),但这仍是一项昂贵的超参数调优操作 (Chen et al., 2021)。因此,为避免在训练过程中引入第二个噪声调度$(\gamma_{0},...,\gamma_{T})$,我们选择了更简单(尽管很可能次优)的方案,即$\gamma_{t}$不随$t$变化(详见第5.3节)。算法3展示了提出的训练算法,其中$\pmb{x}_{t}$被替换为$\pmb{y}_{t}$。而在推理阶段,我们直接使用未经修改的算法2。
Algorithm 3 DDPM-IP: Training with input perturbation
1:repeat | |
2: | o ~ q(xo),t ~ U({1, .,T}) |
3: | ∈ ~ N(0, 1),≤ ~ N(0, I) |
4: | compute yt using Eq.6 |
5: | take a gradient descent step on Velle - eo(yt, t)I|2 |
6: until converged |
算法 3 DDPM-IP: 带输入扰动的训练
| 1: 重复 |
| 2: o ~ q(xo), t ~ U({1, ., T}) |
| 3: ∈ ~ N(0, 1), ≤ ~ N(0, I) |
| 4: 使用公式 6 计算 yt |
| 5: 对 Velle - eo(yt, t)I|2 执行梯度下降步骤 |
| 6: 直到收敛 |
5.2. Discussion
5.2. 讨论
In this section, we analyze the difference between Alg. 3 and Alg. 1. Specifically, in line 5 of Alg. 3, we use $\pmb{y}{t}$ as the input of the prediction network $\epsilon_{\theta}(\cdot)$ but we keep using $\epsilon$ as the regression target. In other words, the new noise term $(\pmb{\xi})$ we introduce is used asymmetrically, because it is applied to the input but not to the prediction target (e). For this reason, Alg. 3 is not equivalent to choose a different value of $\epsilon$ in Alg. 1, where e is instead used symmetrically both in the forward process (Eq. 4) and as the target of the prediction network (line 4 of Alg. 1).
在本节中,我们分析算法3与算法1的区别。具体而言,在算法3的第5行中,我们使用 $\pmb{y}{t}$ 作为预测网络 $\epsilon_{\theta}(\cdot)$ 的输入,但继续将 $\epsilon$ 作为回归目标。换句话说,我们引入的新噪声项 $(\pmb{\xi})$ 是以非对称方式使用的,因为它仅作用于输入端而不作用于预测目标(e)。因此,算法3并不等同于在算法1中选择不同的 $\epsilon$ 值——在算法1中,e在前向过程(式4)和预测网络目标(算法1第4行)中均被对称使用。
This difference is schematically illustrated in Fig. 1, where, for both Alg. 1 (i.e., DDPM) and Alg. 3 (DDPM-IP), we show the corresponding pairs of input and target vectors of the prediction network (respectively, $(\pmb{x}{t},\pmb{\epsilon})$ and $\left(\pmb{y}{t},\pmb{\epsilon}\right),$ ). In the same figure, we also show a second version of Alg. 1 (called DDPM $y$ ), where we use the standard training protocol (Alg. 1) but change the noise variance in order to adhere to the same distribution generating $\pmb{y}{t}$ . In fact, it can be easy shown that ${\pmb y}_{t}$ in Alg. 3 is generated using the following distribution (see Appendix A.2 for a proof):
图1示意性地展示了这一差异,其中对于算法1 (即DDPM) 和算法3 (DDPM-IP) ,我们分别展示了预测网络的输入向量与目标向量对 ( $(\pmb{x}{t},\pmb{\epsilon})$ 和 $\left(\pmb{y}{t},\pmb{\epsilon}\right),$ ) 。在同一图中,我们还展示了算法1的另一个版本 (称为DDPM $y$ ) ,该版本采用标准训练协议 (算法1) 但调整了噪声方差以符合生成 $\pmb{y}{t}$ 的相同分布。实际上可以简单证明,算法3中的 ${\pmb y}_{t}$ 由以下分布生成 (证明见附录A.2) :
$$
q({\pmb y}{t}|{\pmb x}{0})=\mathcal{N}({\pmb y}{t};\sqrt{\bar{\alpha}{t}}{\pmb x}{0},(1-\bar{\alpha}_{t})(1+\gamma^{2}){\pmb I}).
$$
$$
q({\pmb y}{t}|{\pmb x}{0})=\mathcal{N}({\pmb y}{t};\sqrt{\bar{\alpha}{t}}{\pmb x}{0},(1-\bar{\alpha}_{t})(1+\gamma^{2}){\pmb I}).
$$
Hence, we can obtain the same input noise distribution of Alg. 3 in Alg. 1 using $\epsilon^{\prime}\sim\mathcal{N}(0,I)$ and:
因此,我们可以使用 $\epsilon^{\prime}\sim\mathcal{N}(0,I)$ 并在算法1中获得与算法3相同的输入噪声分布:
$$
\begin{array}{r}{{\pmb y}{t}=\sqrt{\bar{\alpha}{t}}{\pmb x}{0}+\sqrt{1-\bar{\alpha}_{t}}\sqrt{1+\gamma^{2}}{\pmb\epsilon}^{\prime}.}\end{array}
$$
$$
\begin{array}{r}{{\pmb y}{t}=\sqrt{\bar{\alpha}{t}}{\pmb x}{0}+\sqrt{1-\bar{\alpha}_{t}}\sqrt{1+\gamma^{2}}{\pmb\epsilon}^{\prime}.}\end{array}
$$
We call DDPM $\cdot y$ the version of Alg. 1 with this new noise distribution. DDPM $y$ is obtained from Alg. 1 using Eq. 8 in line 3 and replacing $\pmb{x}{t}$ with $\pmb{y}{t}$ and $\epsilon$ with $\epsilon^{\prime}$ in line 4. However, note that, for a given $\pmb{y}{t}$ , if $\pmb{\xi}\neq\mathbf{0}$ , then $\epsilon\neq\epsilon^{\prime}$ (see Fig. 1), thus, DDPM-IP and DDPM $y$ share the same input to $\epsilon_{\pmb{\theta}}(\cdot)$ , but they use different targets. In Appendix A.3, we empirically show that DDPM $_y$ is even worse than the standard DDPM.
我们将采用这种新噪声分布的算法1版本称为DDPM $\cdot y$。DDPM $y$ 是通过在算法1第3行使用公式8,并在第4行将 $\pmb{x}{t}$ 替换为 $\pmb{y}{t}$、$\epsilon$ 替换为 $\epsilon^{\prime}$ 而得到的。但需注意,对于给定的 $\pmb{y}{t}$,若 $\pmb{\xi}\neq\mathbf{0}$,则 $\epsilon\neq\epsilon^{\prime}$(见图1)。因此,DDPM-IP与DDPM $y$ 虽然向 $\epsilon_{\pmb{\theta}}(\cdot)$ 输入相同,但采用了不同的训练目标。附录A.3通过实验表明,DDPM $_y$ 的表现甚至逊于标准DDPM。
Intuitively, the proposed training protocol, DDPM-IP, decouples the noise vector $\epsilon^{\prime}$ actually generating $\pmb{y}{t}$ from the ground truth target vector $\epsilon$ which is asked to be predicted by $\epsilon_{\theta}(\cdot)$ . In order to solve this problem, $\epsilon_{\theta}(\cdot)$ needs to smooth its prediction function, reducing the difference between $\epsilon_{\pmb{\theta}}(\pmb{x}{t},t)$ and $\epsilon_{\pmb{\theta}}({\pmb y}_{t},t)$ , and this leads to a training regular iz ation which is similar to VRM (Sec. 1).
直观上,所提出的训练协议 DDPM-IP 将实际生成 $\pmb{y}{t}$ 的噪声向量 $\epsilon^{\prime}$ 与要求 $\epsilon_{\theta}(\cdot)$ 预测的真实目标向量 $\epsilon$ 解耦。为解决这一问题,$\epsilon_{\theta}(\cdot)$ 需要平滑其预测函数,减小 $\epsilon_{\pmb{\theta}}(\pmb{x}{t},t)$ 与 $\epsilon_{\pmb{\theta}}({\pmb y}_{t},t)$ 之间的差异,这会产生类似于 VRM (第1节) 的训练正则化效果。
Figure 1. The inputs and the prediction targets are different in vanilla DDPM, DDPM-IP and DDPM- $\cdot y$ .
图 1: 原始 DDPM、DDPM-IP 和 DDPM-$\cdot y$ 的输入与预测目标存在差异。
5.3. Estimating the Prediction Error
5.3. 预测误差估计
In this section, we analyze the actual prediction error of $\epsilon_{\theta}(\cdot)$ and we use this analysis to choose the value of $\gamma$ in Eq. 6. Analogously to Sec. 4, we use ADM, trained using the standard algorithm Alg. 1 and two datasets: CIFAR10 and ImageNet $32\times32$ . At testing time, for a given $t$ and $\hat{\pmb{\epsilon}}=\pmb{\epsilon}{\pmb{\theta}}(\hat{\pmb{x}}{t},t)$ , we replace e with e in Eq. 4 and we compute the predicted $\hat{\pmb x}{0}$ . Finally, the prediction error at time $t$ is $\pmb{e}{t}=\hat{\pmb x}{0}-\pmb x_{0}$ . Note that using $\hat{\pmb x}{0}$ and $\pmb{x}{0}$ to estimate the error instead of comparing $\hat{\pmb x}{t}$ and $\pmb{x}{t}$ , has the advantage that the former is independent of scaling factors $(\sqrt{1-\bar{\alpha}{t}})$ and, thus, it makes the statistical analysis easier. Using different values of $t$ , uniformly selected in ${1,...,T}$ , we empirically verified that, for a given $t$ , $e_{t}$ is normally distributed: $e_{t}\sim$ $\mathcal{N}(\mathbf{0},\nu_{t}^{2}I)$ , with standard deviation $\nu_{t}$ (see Appendix A.5).
在本节中,我们分析 $\epsilon_{\theta}(\cdot)$ 的实际预测误差,并利用该分析选择公式6中 $\gamma$ 的值。与第4节类似,我们采用通过标准算法Alg. 1训练的ADM模型,分别在CIFAR10和ImageNet $32\times32$ 数据集上进行测试。测试时,对于给定的 $t$ 和 $\hat{\pmb{\epsilon}}=\pmb{\epsilon}{\pmb{\theta}}(\hat{\pmb{x}}{t},t)$ ,我们将公式4中的e替换为e来计算预测值 $\hat{\pmb x}{0}$ 。最终,时刻 $t$ 的预测误差为 $\pmb{e}{t}=\hat{\pmb x}{0}-\pmb x_{0}$ 。需要注意的是,使用 $\hat{\pmb x}{0}$ 和 $\pmb{x}{0}$ 而非 $\hat{\pmb x}{t}$ 与 $\pmb{x}{t}$ 来估计误差的优势在于:前者不受缩放因子 $(\sqrt{1-\bar{\alpha}{t}})$ 的影响,从而简化了统计分析。通过在 ${1,...,T}$ 中均匀选取不同的 $t$ 值,我们经验证发现对于给定 $t$ ,误差 $e_{t}$ 服从正态分布: $e_{t}\sim$ $\mathcal{N}(\mathbf{0},\nu_{t}^{2}I)$ ,其标准差为 $\nu_{t}$ (参见附录A.5)。
In Fig. 2 we plot the value of $\nu_{t}$ with respect to $t$ . The two curves corresponding to the two datasets are surprisingly close to each other. In principle, we could use this empirical analysis and set $\gamma_{t}=\nu_{t}$ in Eq. 6. In this way, when we perturb the input to $\epsilon_{\theta}(\cdot)$ , we empirically imitate its actual prediction error which is the base of the exposure bias problem. However, this choice would require a two-step training: first, using Alg. 1 to train the base model and empirically estimate $\nu_{t}$ for different $t$ . Then, using Alg. 3 with the estimated $\gamma_{t}$ schedule to retrain the model from scratch. To avoid this and make the whole procedure as simple as possible, we simply use a constant value $\gamma$ , independently of $t$ . This value was empirically set using a grid search on both CIFAR10 and ImageNet $32\times32$ on a small range of values covering the last half of the sampling trajectory. Specifically, we investigated the range $\nu_{t}\in[0,\mathbb{E}{t}[\nu_{t}]]=[0,0.2]$ (see Fig. 2), which was chosen following Karras et al. (2022), who showed that the last part of the inference trajectory has usually the largest impact on the Diffusion Model performance. We finally set $\gamma=0.1$ and, in the rest of this paper, we always use a constant $\gamma=0.1$ , regardless of the dataset and the baseline DDPM. Although a DDPM-specific $\gamma$ value would most likely lead to better quality results, we prefer to emphasise the ease of use of our proposal which does not depend on any other hyper parameter.
在图 2 中,我们绘制了 $\nu_{t}$ 随 $t$ 的变化曲线。两条对应不同数据集的曲线惊人地接近。理论上,我们可以利用这一实证分析结果,在公式 6 中设 $\gamma_{t}=\nu_{t}$。这样当我们对 $\epsilon_{\theta}(\cdot)$ 的输入施加扰动时,就能通过经验模拟其实际预测误差(这正是曝光偏差问题的根源)。但该方案需要分两步训练:首先使用算法 1 训练基础模型并经验性估算不同 $t$ 对应的 $\nu_{t}$,然后使用算法 3 配合估算出的 $\gamma_{t}$ 调度表从头开始重新训练模型。为简化流程,我们直接采用与 $t$ 无关的常量 $\gamma$。该值通过在 CIFAR10 和 ImageNet $32\times32$ 数据集上对采样轨迹后半段进行小范围网格搜索确定。具体而言,我们考察了 $\nu_{t}\in[0,\mathbb{E}{t}[\nu_{t}]]=[0,0.2]$ 区间(见图 2),该区间选择依据 Karras et al. (2022) 的研究——他们证明推理轨迹末段通常对扩散模型性能影响最大。最终我们设定 $\gamma=0.1$,且本文后续实验均固定使用该常量值(不区分数据集和基线 DDPM 模型)。虽然针对特定 DDPM 调整 $\gamma$ 可能获得更优结果,但我们更强调本方案的易用性——它不依赖任何其他超参数。
Figure 2. The inference time standard deviation $\nu_{t}$ of the prediction error of a pre-trained network with respect to the sampling step $t$ . The mean of the blue and the orange curve is 0.20 and 0.19, respectively.
图 2: 预训练网络预测误差的推理时间标准差 $\nu_{t}$ 与采样步长 $t$ 的关系。蓝色和橙色曲线的均值分别为 0.20 和 0.19。
5.4. Regular iz ation based on Lipschitz Continuous Functions
5.4. 基于Lipschitz连续函数的正则化
In this section, we propose two alternative solutions to the exposure bias problem which can help to better investigate the phenomenon. The goal is the same as in Sec. 5.1, i.e., we want to smooth the prediction function $\epsilon_{\theta}(\pmb{x}{t},t)$ to make it more robust with respect to local variations of $\pmb{x}{t}$ which are due to the inference-time prediction errors. To do so, instead of using input perturbation, we explicitly encourage $\epsilon_{\theta}(\cdot)$ to be Lipschitz continuous, i.e. to satisfy:
在本节中,我们提出了两种解决曝光偏差问题的替代方案,以帮助更好地研究这一现象。目标与第5.1节相同,即我们希望平滑预测函数 $\epsilon_{\theta}(\pmb{x}{t},t)$ ,使其对由推理时预测误差引起的 $\pmb{x}{t}$ 局部变化更具鲁棒性。为此,我们不使用输入扰动,而是明确鼓励 $\epsilon_{\theta}(\cdot)$ 满足Lipschitz连续性,即满足:
$$
||\epsilon_{\pmb\theta}({\pmb x},t)-\epsilon_{\pmb\theta}({\pmb y},t)||\leq K||{\pmb x}-{\pmb y}||,\forall({\pmb x},{\pmb y})
$$
$$
||\epsilon_{\pmb\theta}({\pmb x},t)-\epsilon_{\pmb\theta}({\pmb y},t)||\leq K||{\pmb x}-{\pmb y}||,\forall({\pmb x},{\pmb y})
$$
for a small constant $K$ . We implement this idea using two standard Lipschitz constant minimization methods: gradient penalty (Rifai et al., 2011; Gulrajani et al., 2017) and weight decay (Krogh & Hertz, 1991; Miyato et al., 2018). In both cases we do not perturb the input of $\epsilon_{\pmb{\theta}}(\cdot)$ , and we use the original training algorithm (Alg. 1), with the only difference being the loss function used in line 4, where the $L_{2}$ loss is used jointly with a regular iz ation term described below.
对于一个小的常数 $K$。我们采用两种标准的Lipschitz常数最小化方法实现这一思路:梯度惩罚 (Rifai et al., 2011; Gulrajani et al., 2017) 和权重衰减 (Krogh & Hertz, 1991; Miyato et al., 2018)。这两种情况下我们都不扰动 $\epsilon_{\pmb{\theta}}(\cdot)$ 的输入,且使用原始训练算法 (算法1),唯一区别在于第4行使用的损失函数,其中 $L_{2}$ 损失与下述正则化项联合使用。
Gradient penalty. In this case, the regular iz ation is based on the Frobenius norm of the Jacobian matrix (Rifai et al., 2011; Goodfellow et al., 2016), and the final loss is:
梯度惩罚 (Gradient penalty)。在这种情况下,正则化基于雅可比矩阵 (Jacobian matrix) 的 Frobenius 范数 (Rifai et al., 2011; Goodfellow et al., 2016),最终损失函数为:
$$
L_{G P}(\pmb\theta)=||\epsilon-\epsilon_{\pmb\theta}({\pmb x}{t},t)||^{2}+\lambda_{G P}\left|\frac{\partial\epsilon_{\pmb\theta}({\pmb x}{t},t)}{\partial{\pmb x}}\right|_{F}^{2},
$$
$$
L_{G P}(\pmb\theta)=||\epsilon-\epsilon_{\pmb\theta}({\pmb x}{t},t)||^{2}+\lambda_{G P}\left|\frac{\partial\epsilon_{\pmb\theta}({\pmb x}{t},t)}{\partial{\pmb x}}\right|_{F}^{2},
$$
where $\lambda_{G P}$ is the weight of the gradient penalty term. However, a gradient penalty regular iz ation is very slow (Yoshida & Miyato, 2017) because it involves one forward and two backward passes for each training step.
其中 $\lambda_{G P}$ 是梯度惩罚项的权重。然而,梯度惩罚正则化非常缓慢 (Yoshida & Miyato, 2017) ,因为每个训练步骤都需要一次前向传播和两次反向传播。
Weight decay. As shown in (Liu et al., 2022), Lipschitz continuity can also be encouraged using a weight decay regular iz ation (see Appendix A.6 for more details). In this case, the final loss is:
权重衰减。如 (Liu et al., 2022) 所示,权重衰减正则化也能促进 Lipschitz 连续性 (详见附录 A.6)。此时最终损失函数为:
$$
\begin{array}{r}{L_{W D}(\pmb{\theta})=||\pmb{\epsilon}-\pmb{\epsilon}{\pmb{\theta}}(\pmb{x}{t},t)||^{2}+\lambda_{W D}||\pmb{\theta}||^{2},}\end{array}
$$
$$
\begin{array}{r}{L_{W D}(\pmb{\theta})=||\pmb{\epsilon}-\pmb{\epsilon}{\pmb{\theta}}(\pmb{x}{t},t)||^{2}+\lambda_{W D}||\pmb{\theta}||^{2},}\end{array}
$$
where $\lambda_{W D}$ is the weight of the regular iz ation term.
其中 $\lambda_{W D}$ 是正则化项的权重。
6. Results
6. 结果
In this section, we evaluate the generation quality of the proposed solutions and we compare them with state-of-the-art DDPMs. We use unconditional image generation tasks on different datasets and standard metrics: the Fréchet Inception Distance (FID) (Heusel et al., 2017) and the Spatial Fréchet Inception Distance (sFID) (Nash et al., 2021). As a variant of FID, sFID uses spatial features rather than the standard pooled features to better capture spatial relationships, rewarding image distributions with a coherent high-level structure. As mentioned in Sec. 5.3, in all our experiments we use $\gamma=0.1$ without any dataset or baseline specific tuning of our only hyper parameter.
在本节中,我们评估所提出方案的生成质量,并将其与最先进的DDPM (Denoising Diffusion Probabilistic Models) 进行对比。我们在不同数据集上采用无条件图像生成任务及标准评估指标:Fréchet Inception Distance (FID) [20] 和 Spatial Fréchet Inception Distance (sFID) [21]。作为FID的变体,sFID通过使用空间特征而非标准池化特征来更好地捕捉空间关系,从而奖励具有连贯高层结构的图像分布。如第5.3节所述,所有实验均采用 $\gamma=0.1$ 的超参数设置,未针对任何数据集或基线进行特定调整。
6.1. Evaluation of the Different Proposed Solutions
6.1. 不同提案解决方案的评估
In this section, we empirically compare to each other the three regular iz ation methods proposed in Sec. 5 to alleviate the exposure bias problem. For all three approaches, we use the state-of-the-art diffusion model ADM (Dhariwal & Nichol, 2021) (without classifier guidance) as the baseline, and we call: (1) “ADM-IP” the version of ADM trained using Alg. 3, (2) “ADM-GP” the version of ADM trained using the gradient penalty, and (3) “ADM-WD” for the weight decay (Sec. 5.4). We use $\lambda_{G P}=1e-6$ and $\lambda_{W D}=0.03$ as the loss weights for ADM-GP and ADM-WD, respectively.
在本节中,我们通过实验对比第5章提出的三种缓解曝光偏差问题的正则化方法。所有实验均以最先进的扩散模型ADM (Dhariwal & Nichol, 2021) (无分类器引导) 为基线,并命名如下: (1) "ADM-IP" 表示采用算法3训练的ADM版本, (2) "ADM-GP" 表示采用梯度惩罚训练的ADM版本, (3) "ADM-WD" 表示采用权重衰减的版本 (第5.4节)。实验中设置ADM-GP的损失权重 $\lambda_{G P}=1e-6$,ADM-WD的损失权重 $\lambda_{W D}=0.03$。
For this experiment, we use CIFAR10 because ADM-GP is too time-consuming to be trained on larger datasets. The results in Tab. 2 show that all three models outperform the baseline in image quality, demonstrating the effectiveness of smoothing the prediction function using the proposed regular iz ation methods. However, training ADM-GP is too slow and cannot be scaled to larger datasets, thus we do not recommend this solution. Moreover, ADM-IP gets the best FID and sFID scores, thus, in the rest of this paper, we use the input perturbation approach described in Sec. 5.1 as our basic solution.
本次实验选用CIFAR10数据集,因为ADM-GP在更大规模数据集上的训练耗时过高。表2结果显示,三种模型在图像质量指标上均超越基线,验证了通过所提正则化方法平滑预测函数的有效性。但ADM-GP训练速度过慢且难以扩展至更大数据集,因此不建议采用该方案。此外,ADM-IP取得了最优的FID和sFID分数,故本文后续将采用5.1节所述的输入扰动方法作为基础解决方案。
Table 2. Comparison of different regular iz ation methods. All the models are tested using $T=1,000$ sampling steps.
表 2: 不同正则化方法的对比。所有模型均使用 $T=1,000$ 采样步数进行测试。
Model | CIFAR10 32×32 |
---|---|
FID sFID | |
ADM (baseline) | 2.99 4.76 |
ADM-GP | 2.80 4.41 |
ADM-WD | 2.82 4.61 |
ADM-IP | 2.76 4.05 |
Finally, we use ADM-IP to quantify the reduction in the exposure bias following the protocol described in Sec. 4. The results reported in Tab. 1 show that ADM-IP leads to a significantly lower exposure bias than ADM, and this difference is larger with longer sampling sequences.
最后,我们按照第4节描述的方案使用ADM-IP来量化曝光偏差的减少。表1所示结果表明,ADM-IP导致的曝光偏差显著低于ADM,且采样序列越长,这种差异越大。
6.2. Main results
6.2. 主要结果
Comparison with DDPMs. We compare ADM-IP with ADM using CIFAR10, ImageNet $32\times32$ , LSUN tower $64\times64$ , CelebA $64\times64$ (Liu et al., 2015) and FFHQ $128\times128$ . Following prior work (Ho et al., 2020; Nichol & Dhariwal, 2021), we generate 50K samples for each trained model and we use the full training set to compute the reference distribution statistics, except for LSUN tower where (again following (Ho et al., 2020; Nichol & Dhariwal, 2021)) we use 50K training samples as the reference data. When training, we always use $T=1,000$ steps for all the models. At inference time, the results reported with $T^{\prime}<T$ sampling steps have been obtained using the respacing technique (Nichol & Dhariwal, 2021). As previously mentioned (see Sec. 5.3) we keep fixed $\gamma=0.1$ in all the experiments and the datasets. We refer to Appendix A.7 for the complete list of hyper parameters (e.g. the learning rate, the batch size, etc.) and network architecture settings, which are the same for both ADM and ADM-IP.
与DDPM的对比。我们将ADM-IP与ADM在CIFAR10、ImageNet $32\times32$、LSUN tower $64\times64$、CelebA $64\times64$ (Liu et al., 2015) 和 FFHQ $128\times128$ 数据集上进行对比。遵循先前工作 (Ho et al., 2020; Nichol & Dhariwal, 2021),我们为每个训练模型生成5万样本,并使用完整训练集计算参考分布统计量(LSUN tower除外,此处沿用 (Ho et al., 2020; Nichol & Dhariwal, 2021) 方法,采用5万训练样本作为参考数据)。训练时,所有模型均使用 $T=1,000$ 步。推理阶段,$T^{\prime}<T$ 采样步数的结果通过重间距技术 (Nichol & Dhariwal, 2021) 获得。如第5.3节所述,所有实验和数据集中固定 $\gamma=0.1$。完整超参数列表(如学习率、批次大小等)及网络架构设置(ADM与ADM-IP相同)详见附录A.7。
The results reported in Tab. 3 show that, independently of the dataset and the number of sampling steps $(T^{\prime}\leq T)$ , ADM-IP is always better than ADM in terms of both the FID and sFID metrics, sometimes drastically better. For instance, on LSUN, with $T^{\prime}=80$ , we have a more than 5 sFID score improvement with respect to ADM. On FFHQ $128\times128$ , with $T^{\prime}=1,000$ , we have almost 7 points of improvement compared to both the FID and the sFID scores. In addition to the experiments shown in Tab. 3, we used $T^{\prime}=900$ sampling steps and our ADM-IP on CelebA $64\times64$ , achieving a result of 1.27 FID, which is the new state-of-the-art performance for unconditional generation on this dataset.
表 3 中的结果表明,无论数据集和采样步数 $(T^{\prime}\leq T)$ 如何,ADM-IP 在 FID 和 sFID 指标上始终优于 ADM,有时甚至显著领先。例如,在 LSUN 数据集上,当 $T^{\prime}=80$ 时,ADM-IP 的 sFID 分数比 ADM 提高了 5 分以上。在 FFHQ $128\times128$ 数据集上,当 $T^{\prime}=1,000$ 时,FID 和 sFID 分数均提升了近 7 分。除表 3 所示的实验外,我们还使用 $T^{\prime}=900$ 采样步数和 ADM-IP 在 CelebA $64\times64$ 数据集上取得了 1.27 的 FID 分数,这是该数据集无条件生成任务的最新最优性能。
Table 3. Comparison between ADM and ADM-IP using models trained with $T=1,000$ sampling steps and tested with $T^{\prime}\leq T$ steps.
表 3. 使用 $T=1,000$ 采样步长训练并以 $T^{\prime}\leq T$ 步长测试的 ADM 与 ADM-IP 模型对比。
采样步长 (T') | 模型 | CIFAR10 | ImageNet 32 | LSUN tower 64 | CelebA 64 | FFHQ 128 | |||||
---|---|---|---|---|---|---|---|---|---|---|---|
FID | sFID | FID | sFID | FID | sFID | FID | sFID | FID | sFID | ||
1,000 | ADM (baseline) | 2.99 | 4.76 | 3.60 | 3.30 | 3.39 | 7.96 | 1.60 | 3.80 | 9.65 | 12.53 |
ADM-IP (ours) | 2.76 | 4.05 | 2.87 | 2.39 | 2.68 | 6.04 | 1.31 | 3.38 | 2.98 | 5.59 | |
300 | ADM | 2.95 | 4.95 | 3.58 | 3.48 | 3.31 | 8.39 | 1.82 | 4.25 | 9.55 | 12.60 |
ADM-IP | 2.67 | 4.14 | 2.74 | 2.58 | 2.60 | 5.98 | 1.43 | 3.36 | 3.74 | 5.97 | |
100 | ADM | 3.37 | 5.66 | 4.26 | 4.48 | 3.50 | 11.10 | 3.02 | 5.76 | 14.52 | 16.02 |
ADM-IP | 2.70 | 4.51 | 3.24 | 3.13 | 2.79 | 6.56 | 2.21 | 4.33 | 5.94 | 7.90 | |
80 | ADM | 3.63 | 5.97 | 4.61 | 4.76 | 4.17 | 12.60 | 3.75 | 6.80 | 17.00 | 18.02 |
ADM-IP | 2.93 | 4.69 | 3.57 | 3.33 | 2.95 | 6.93 | 2.67 | 4.69 | 6.89 | 8.79 |
Note that, for most datasets, both the baseline (ADM) and ADM-IP reach the best results with $T^{\prime}<T$ (specifically, with $T^{\prime}\in[100,300])$ . As mentioned in Sec. 4, this is most likely a confirmation of the exposure bias problem: a shorter sampling trajectory accumulates a smaller prediction error.
需要注意的是,对于大多数数据集,基线方法 (ADM) 和 ADM-IP 都在 $T^{\prime}<T$(具体而言,$T^{\prime}\in[100,300]$)时达到最佳效果。如第4节所述,这很可能印证了曝光偏差问题:较短的采样轨迹会累积更小的预测误差。
Besides generating significantly better images, ADM-IP converges much faster than the baseline during training in all the five datasets (see Fig. 3 and 4). For instance, on LSUN tower and CelebA, ADM-IP converges at 220K and 300K training iterations while ADM saturates around 300K and 480K iterations, respectively. Fig. 3 shows also that, even before convergence, ADM-IP quickly beats the ADM results obtained when the latter has converged. For instance, on CelebA, ADM-IP gets FID 1.51 at 120K training iterations, whereas ADM gets FID 1.6 at convergence (480K iterations), exhibiting a $4\mathbf{x}$ training speed-up. On the larger resolution FFHQ dataset, ADM receives FID 14.52 at convergence (420K iterations), while ADM-IP achieves a FID score of 8.81 with only 60K iterations: an improvement of 5.71 points with a $7\mathbf{x}$ training speed-up. Fig. 4 shows a similar trend for the CIFAR10 dataset. In this figure, we also plot the results of ADM-IP with different $\gamma$ values (Sec. 5.3).
除了生成质量显著更优的图像外,ADM-IP在所有五个数据集的训练过程中收敛速度都远快于基线方法(见图3和4)。例如在LSUN tower和CelebA数据集上,ADM-IP分别在22万次和30万次训练迭代时收敛,而ADM的收敛点约为30万次和48万次迭代。图3还显示,即使在收敛之前,ADM-IP也能快速超越ADM收敛后的结果。以CelebA为例,ADM-IP在12万次迭代时就取得FID 1.51,而ADM在收敛时(48万次迭代)的FID为1.6,实现了4倍训练加速。在更高分辨率的FFHQ数据集上,ADM收敛时(42万次迭代)的FID为14.52,而ADM-IP仅用6万次迭代就达到FID 8.81,在提升5.71个指标点的同时实现7倍训练加速。图4展示了CIFAR10数据集上的相似趋势,该图中我们还绘制了不同γ取值下ADM-IP的结果(见5.3节)。
The training iterations until convergence for each model are summarized in Tab. 4. The much faster convergence of our method is most likely due to the regular iz ation effect of the input perturbation. In fact, as commonly happens with regularization techniques (Zhang et al., 2018; Liu et al., 2021;
各模型直至收敛的训练迭代次数总结如表 4 所示。我们方法的更快收敛很可能归因于输入扰动 (input perturbation) 的正则化效应。事实上,正如正则化技术常见的情况 (Zhang et al., 2018; Liu et al., 2021;
Bales trier o et al., 2022), the proposed input perturbation also introduces an inductive bias in training. In our case, it is: close points in the domain of the prediction function should lead to similar outcomes. Our empirical results show that this bias helps the DDPM training.
Bales trier o等人(2022)提出的输入扰动方法也在训练中引入了归纳偏置。在我们的案例中表现为:预测函数定义域内相邻的点应当产生相似输出。实证结果表明这种偏置有助于DDPM训练。
Tab. 4 also shows that ADM-IP can drastically accelerate the inference process, i.e. obtaining better results than the baseline with shorter sampling trajectories. For example, with only 60 or 80 steps, ADM-IP gets a better or an equivalent FID than ADM (tested with the standard 1,000 sampling steps) on all datasets, except for CelebA, where ADM-IP needs 200 sampling steps to reach the same result. This comparison shows a remarkable 5x to $16.7\mathbf{X}$ speed-up of the inference stage, which is particularly significant for the larger resolution FFHQ dataset.
表 4 还显示,ADM-IP 能大幅加速推理过程,即在更短的采样轨迹下获得比基线更好的结果。例如,仅用 60 或 80 步时,ADM-IP 在所有数据集上的 FID 均优于或等同于 ADM (标准测试采用 1,000 采样步数) ,仅在 CelebA 数据集上需要 200 采样步才能达到相同结果。这一对比显示出推理阶段实现了惊人的 5 倍至 $16.7\mathbf{X}$ 加速,对于更高分辨率的 FFHQ 数据集尤为显著。
Finally, we measure the recall and precision for the generated samples using the method in Ky nk a an niemi et al. (2019). The results show that the recall and precision achieved by ADM and ADM-IP have no significant difference, which indicates that our input perturbation does not affect the sample diversity (see Appendix A.4).
最后,我们采用Ky nk a an niemi等人 (2019) 的方法测量生成样本的召回率与精确度。结果表明ADM与ADM-IP取得的召回率和精确度无显著差异,这说明我们的输入扰动不会影响样本多样性 (详见附录A.4)。
Comparison with DDIMs. In order to show the generality of our proposal, we use Alg. 3 with the Denoising Diffusion Implicit Models (DDIMs) proposed by Song et al. (2021a) (Sec. 2). We train both the baseline (DDIM) and our method (DDIM-IP) on CIFAR10 using the public code provided by Song et al. (2021a). Since training with DDIM is particularly slow, we use only CIFAR10 for this comparis