Diffusion Models without Classifier-free Guidance
无分类器引导的扩散模型
Zhicong Tang 1 Jianmin Bao 2 Dong Chen 2 Baining Guo 2
Zhicong Tang 1 Jianmin Bao 2 Dong Chen 2 Baining Guo 2
Abstract
摘要
This paper presents Model-guidance (MG), a novel objective for training diffusion model that addresses and removes of the commonly used Classifier-free guidance (CFG). Our innovative approach transcends the standard modeling of solely data distribution to incorporating the posterior probability of conditions. The proposed technique originates from the idea of CFG and is easy yet effective, making it a plug-and-play module for existing models. Our method significantly accelerates the training process, doubles the inference speed, and achieve exceptional quality that parallel and even surpass concurrent diffusion models with CFG. Extensive experiments demonstrate the effectiveness, efficiency, s cal ability on different models and datasets. Finally, we establish state-of-theart performance on ImageNet 256 benchmarks with an FID of 1.34. Our code is available at github.com/tzco/Diffusion-wo-CFG.
本文提出了模型引导(Model-guidance, MG),一种用于训练扩散模型的新目标,旨在解决并取代常用的无分类器引导(Classifier-free guidance, CFG)。我们的创新方法超越了仅对数据分布进行标准建模的范畴,将条件后验概率纳入其中。该技术源于CFG的思想,简单而有效,使其成为现有模型的即插即用模块。我们的方法显著加速了训练过程,使推理速度翻倍,并实现了与甚至超越使用CFG的并发扩散模型的卓越质量。大量实验证明了该方法在不同模型和数据集上的有效性、效率和可扩展性。最后,我们在ImageNet 256基准测试中取得了最先进的性能,FID为1.34。我们的代码可在github.com/tzco/Diffusion-wo-CFG获取。
Figure 1: We propose Model-guidance (MG), removing Classifier-free guidance (CFG) for diffusion models and achieving state-of-the-art on ImageNet with FID of 1.34. (a) Instead of running models twice during inference (green and red), MG directly learns the final distribution (blue). (b) MG requires only one line of code modification while providing excellent improvements. (c) Comparing to concurrent methods, MG yields lowest FID even without CFG.
图 1: 我们提出了模型引导 (Model-guidance, MG),去除了扩散模型中的无分类器引导 (Classifier-free guidance, CFG) ,并在 ImageNet 上实现了 1.34 的 FID 值,达到了最先进水平。(a) 在推理过程中,MG 不再运行两次模型 (绿色和红色),而是直接学习最终分布 (蓝色)。(b) MG 只需修改一行代码即可提供显著的改进。(c) 与同时期的方法相比,即使没有 CFG,MG 也能实现最低的 FID。
1. Introduction
1. 引言
Diffusion models (Sohl-Dickstein et al., 2015; Song & Ermon, 2019; Ho et al., 2020; Song et al., 2021a;b) have become the cornerstone of many successful generative models, e.g. image generation (Dhariwal & Nichol, 2021; Nichol et al., 2022; Rombach et al., 2022; Podell et al., 2024; Chen et al., 2024) and video generation (Ho et al., 2022; Blattmann et al., 2023; Gupta et al., 2025; Polyak et al., 2024; Wang et al., 2024) tasks. However, diffusion models also struggle to generate “low temperature” samples (Ho & Salimans, 2021; Karras et al., 2024) due to the nature of training objectives, and techniques such as Classifier guidance (Dhariwal & Nichol, 2021) and Classifier-free guidance (CFG) (Ho & Salimans, 2021) are proposed to improve performances.
扩散模型 (Diffusion models) (Sohl-Dickstein et al., 2015; Song & Ermon, 2019; Ho et al., 2020; Song et al., 2021a;b) 已成为许多成功的生成模型的基石,例如图像生成 (Dhariwal & Nichol, 2021; Nichol et al., 2022; Rombach et al., 2022; Podell et al., 2024; Chen et al., 2024) 和视频生成 (Ho et al., 2022; Blattmann et al., 2023; Gupta et al., 2025; Polyak et al., 2024; Wang et al., 2024) 任务。然而,由于训练目标的本质,扩散模型也难以生成“低温”样本 (Ho & Salimans, 2021; Karras et al., 2024),因此提出了分类器引导 (Classifier guidance) (Dhariwal & Nichol, 2021) 和无分类器引导 (Classifier-free guidance, CFG) (Ho & Salimans, 2021) 等技术来提高性能。
Despite its advantage and ubiquity, CFG has several drawbacks (Karras et al., 2024) and poses challenges to effective implementations (Kynk¨a¨anniemi et al., 2024) of diffusion models. One critical limitation is the simultaneous training of unconditional model apart from the main diffusion model. The unconditional model is typically implemented by randomly dropping the condition of training pairs and replacing with an manually defined empty label. The introduction of additional tasks may reduce network capabilities and lead to skewed sampling distributions (Karras et al., 2024; Kynka¨a¨nniemi et al., 2024). Furthermore, CFG requires two forward passes per denoising step during inference, one for the conditioned and another for the un conditioned model, thereby significantly escalating the computational costs.
尽管 CFG 具有优势和普遍性,但它也存在一些缺点 (Karras et al., 2024),并对扩散模型的有效实现提出了挑战 (Kynk¨a¨anniemi et al., 2024)。一个关键的限制是除了主扩散模型外,还需要同时训练无条件模型。无条件模型通常通过随机丢弃训练对的条件并替换为手动定义的空标签来实现。引入额外的任务可能会降低网络能力,并导致采样分布偏差 (Karras et al., 2024; Kynka¨a¨nniemi et al., 2024)。此外,CFG 在推理过程中每个去噪步骤需要两次前向传递,一次用于条件模型,另一次用于无条件模型,从而显著增加了计算成本。
In this work, we propose Model-guidance (MG), an innovative method for diffusion models to effectively circumvent CFG and boost performances, thereby eliminating the limitations above. We propose a novel objective that transcends from simply modeling the data distribution to incorporating the posterior probability of conditions. Specifically, we leverage the model itself as an implicit classifier and directly learn the score of calibrated distribution during training.
在本工作中,我们提出了模型引导 (Model-guidance, MG),这是一种创新方法,能够有效规避 CFG 并提升扩散模型的表现,从而消除上述限制。我们提出了一种新颖的目标,从简单地建模数据分布扩展到包含条件的后验概率。具体来说,我们利用模型本身作为隐式分类器,在训练过程中直接学习校准分布的评分。
As depicted in Figure 1, our proposed method confers multiple substantial breakthroughs. First, it significantly refines generation quality and accelerates training processes, with experiments showcasing $\mathrm{a}\geq,6.5\times$ convergence speedup than vanilla diffusion models with excellent quality. Second, the inference speed is doubled with our method, as each denoising step needs only one network forward in contrast to two in CFG. Besides, it is easy to implement and requires only one line of code modification, making it a plug-and-play module of existing diffusion models with instant improvements. Finally, it is an end-to-end method that excels traditional two-stage distillation-based approaches and even outperforms CFG in generation performances.
如图 1 所示,我们提出的方法带来了多项重大突破。首先,它显著提高了生成质量并加速了训练过程,实验表明与普通扩散模型相比,其收敛速度提高了 $\mathrm{a}\geq,6.5\times$,且质量优异。其次,使用我们的方法后,推理速度提高了一倍,因为每个去噪步骤只需要一次网络前向传播,而 CFG 需要两次。此外,它易于实现,只需修改一行代码,即可作为现有扩散模型的即插即用模块,实现即时改进。最后,它是一种端到端的方法,优于传统的基于蒸馏的两阶段方法,甚至在生成性能上超过了 CFG。
We conduct comprehensive experiments on the prevalent Imagenet (Deng et al., 2009; Russ a kov sky et al., 2015) benchmarks with $256!\times!256$ and $512!\times!512$ resolution and compare with a wide variates of concurrent models to attest the effec ti ve ness of our proposed method. The evaluation results demonstrate that our method not only parallels and even outperforms other approaches with CFG, but also scales to different models and datasets, making it a promising enhancement for diffusion models. In conclusion, we make the following contribution in this work:
我们在流行的 Imagenet (Deng et al., 2009; Russakovsky et al., 2015) 基准上进行了全面的实验,使用 $256!\times!256$ 和 $512!\times!512$ 分辨率,并与多种并发模型进行了比较,以证明我们提出的方法的有效性。评估结果表明,我们的方法不仅与使用 CFG 的其他方法相当甚至超越了它们,还能扩展到不同的模型和数据集,使其成为扩散模型的一个有前景的增强方法。总之,我们在这项工作中做出了以下贡献:
• We proposed a novel and effective method, Modelguidance (MG), for training diffusion models. • MG removes CFG for diffusion models and greatly accelerates both training and inference process. • Extensive experiments with SOTA results on ImageNet demonstrate the usefulness and advantages of MG.
• 我们提出了一种新颖且有效的方法——模型引导 (Modelguidance, MG),用于训练扩散模型。
• MG 去除了扩散模型中的分类器自由引导 (Classifier-Free Guidance, CFG),并大幅加速了训练和推理过程。
• 在 ImageNet 上进行的大量实验以及取得的 SOTA 结果证明了 MG 的实用性和优势。
2. Background
2. 背景
2.1. Diffusion and Flow Models
2.1. 扩散和流模型
Diffusion models (Sohl-Dickstein et al., 2015; Song & Ermon, 2019; Ho et al., 2020; Song et al., 2021a;b) are a class of generative models that utilize forward and reverse stochastic processes to model complex data distributions.
扩散模型 (Diffusion models) (Sohl-Dickstein et al., 2015; Song & Ermon, 2019; Ho et al., 2020; Song et al., 2021a;b) 是一类生成式模型,利用前向和反向随机过程来建模复杂的数据分布。
The forward process adds noise and transforms data samples into Gaussian distributions as
前向过程通过添加噪声将数据样本转换为高斯分布
where $x_{t}$ represents the noised data at timestep $t$ and $\bar{\alpha}{t}=$ $\textstyle\prod{s=1}^{t}\alpha_{s}$ is the noise schedule.
其中 $x_{t}$ 表示在时间步 $t$ 的噪声数据,$\bar{\alpha}{t}=$ $\textstyle\prod{s=1}^{t}\alpha_{s}$ 是噪声调度。
Conversely, the reverse process learns to denoise and finally recover the original data distribution, which aims to reconstruct score (Sohl-Dickstein et al., 2015; Song et al., 2021b) from the noisy samples $x_{t}$ by learning
相反,逆向过程学习去噪并最终恢复原始数据分布,其目标是通过学习从噪声样本 $x_{t}$ 中重建得分 (Sohl-Dickstein et al., 2015; Song et al., 2021b) 。
where $\mu_{\theta}$ and $\Sigma_{\theta}$ are mean and variance and commonly predicted by neural networks.
其中 $\mu_{\theta}$ 和 $\Sigma_{\theta}$ 是均值和方差,通常由神经网络预测。
In common implementations, the training of diffusion mod
在常见实现中,扩散模型的训练
els leverages a re-parameterized objective that directly pre- dicts the noise at each step (Ho et al., 2020)
els 利用了一个重新参数化的目标函数,直接预测每一步的噪声 (Ho et al., 2020)。
where $x_{t}$ is derived from the forward process in Equation (1) with $x_{0}$ and $\epsilon$ drawn from dataset and Gaussian noises.
其中 $x_{t}$ 是从方程 (1) 中的前向过程推导得出的,$x_{0}$ 和 $\epsilon$ 分别来自数据集和高斯噪声。
Conditional diffusion models allow users to generate samples aligned with specified demands and precisely control the contents of samples. In this case, the generation process is manipulated with give conditions $c$ , such as class labels or text prompts, where network functions are $\epsilon_{\theta}(x_{t},t,c)$ .
条件扩散模型允许用户生成符合指定需求的样本,并精确控制样本内容。在这种情况下,生成过程通过给定条件 $c$ 进行操控,例如类标签或文本提示,其中网络函数为 $\epsilon_{\theta}(x_{t},t,c)$ 。
Flow Models (Lipman et al., 2023; Liu et al., 2023; Albergo et al., 2023; Tong et al., 2024) are another emerging type of generative models similar to diffusion models. Flow models utilize the concept of Ordinary Differential Equations (ODEs) to bridge the source and target distribution and learn the directions from noise pointing to ground-truth data.
流模型 (Flow Models) (Lipman et al., 2023; Liu et al., 2023; Albergo et al., 2023; Tong et al., 2024) 是另一种新兴的生成模型,与扩散模型类似。流模型利用常微分方程 (Ordinary Differential Equations, ODEs) 的概念来连接源分布和目标分布,并学习从噪声指向真实数据的方向。
The forward process of flow models is defined as an Optimal Transport (OT) interpol ant (McCann, 1997)
流模型的前向过程被定义为最优传输(Optimal Transport, OT)插值 (McCann, 1997)
and the loss function takes the form (Lipman et al., 2023)
损失函数的形式如下 (Lipman et al., 2023)
where the ground-truth conditional flow is given by
其中真实条件流由
2.2. Classifier-Free Guidance
2.2. 无分类器引导
Classifier-free guidance (CFG) (Ho & Salimans, 2021) is a widely adopted technique in conditional diffusion models to enhance generation performance and alignment to conditions. It provides an explicit control of the focus on conditioning variables and avoids to sample within the “low temperature” regions with low quality.
Classifier-free guidance (CFG) (Ho & Salimans, 2021) 是一种在条件扩散模型中广泛采用的技术,用于增强生成性能和对条件的对齐。它提供了对条件变量关注的显式控制,并避免在低质量的“低温”区域进行采样。
The key design of CFG is to combine the posterior probability and utilize Bayes’ rule during inference time. To facilitate this, it is required to train both conditional and unconditional diffusion models. In particular, CFG trains the models to predict
CFG的关键设计是在推理时结合后验概率并利用贝叶斯规则。为了实现这一点,需要同时训练条件扩散模型和无条件扩散模型。具体来说,CFG训练模型以进行预测
where is an additional empty class introduced in common practices. During training, the model switches between the two modes with a ratio $\lambda$ .
其中 $\lambda$ 是训练过程中模型在两种模式之间切换的比例。
For inference, the model combines the conditional and unconditional scores and guides the denoising process as
在推理过程中,模型结合条件和无条件分数,并指导去噪过程
where $w$ is the guidance scale that controls the focus on conditional scores and the trade-off between generation performance and sampling diversity. CFG has become an widely adopted protocol in most of diffusion models for tasks, such as image generation and video generation.
其中 $w$ 是指导尺度,用于控制对条件得分的关注以及在生成性能和采样多样性之间的权衡。CFG 已成为大多数扩散模型中广泛采用的协议,用于图像生成和视频生成等任务。
2.3. Distillation-based Methods
Besides acceleration (Song et al., 2023), researchers (Sauer et al., 2024) also adopt distillation on diffusion models with CFG to improve sampling quality. Rectified Flow (Liu et al., 2023) disentangles generation trajectories and streamline learning difficulty by alternatively using offline model to provide training pairs for online models. Distillation is also used to learn a smaller one-step model to match the generation performance of larger multi-step models (Meng et al., 2023). Pioneering diffusion models (Black-Forest-Labs, 2024; Stability-AI, 2024) are released with a distillate d version, where CFG scale is viewed as an additional embedding to provide accurate control. However, these approaches involve two-stage learning and require extra computation and storage for offline teacher models.
除了加速 (Song et al., 2023),研究人员 (Sauer et al., 2024) 还采用了对带有 CFG 的扩散模型进行蒸馏的方法,以提高采样质量。Rectified Flow (Liu et al., 2023) 通过交替使用离线模型为在线模型提供训练对,解耦了生成轨迹并简化了学习难度。蒸馏也被用于学习一个更小的一步模型,以匹配更大的多步模型的生成性能 (Meng et al., 2023)。开创性的扩散模型 (Black-Forest-Labs, 2024; Stability-AI, 2024) 发布了蒸馏版本,其中 CFG 比例被视为额外的嵌入,以提供精确的控制。然而,这些方法涉及两阶段学习,并且需要额外的计算和存储来支持离线教师模型。
3. Method
3. 方法
3.1. Rethinking Classifier-free guidance
3.1. 重新思考无分类器引导
Due to the complex nature of visual datasets, diffusion mod- els often struggle whether to recover real image distribution or engage in the alignment to conditions. Classifier-free guidance (CFG) is then proposed and has become an indispensable ingredient of modern diffusion models (Nichol & Dhariwal, 2021; Karras et al., 2022; Saharia et al., 2022;
由于视觉数据集的复杂性,扩散模型往往难以在恢复真实图像分布与条件对齐之间进行抉择。因此,无分类器引导 (CFG) 被提出,并已成为现代扩散模型不可或缺的组成部分 (Nichol & Dhariwal, 2021; Karras et al., 2022; Saharia et al., 2022)。
Figure 2: We use a grid 2D distribution with two classes, marked with orange and gray regions, as example and train diffusion models on it. We plot the generated samples, trajectories, and probability density function (PDF) of conditional, unconditional, CFG-guided model, and our approach. (a) The first row indicates that although CFG improves quality by eliminating outliers, the samples concentrate in the center of data distributions, resulting the loss of diversity. In contrast, our method yields less outliers than the conditional model and a better coverage of data than CFG. (b) In the second row, the trajectories of CFG show sharp turns at the beginning, e.g. samples inside the red box, while our method directly drives the samples to the closet data distributions. (c) The PDF plots of the last row also suggest that our method predicts more symmetric contours than CFG, balancing both quality and diversity.
图 2: 我们使用一个带有两个类的二维网格分布,用橙色和灰色区域标记,作为示例,并在其上训练扩散模型。我们绘制了生成样本、轨迹以及条件模型、无条件模型、CFG引导模型和我们方法的条件概率密度函数 (PDF)。(a) 第一行表明,尽管CFG通过消除异常值提高了质量,但样本集中在数据分布的中心,导致多样性丧失。相比之下,我们的方法产生的异常值少于条件模型,并且比CFG更好地覆盖了数据。(b) 第二行中,CFG的轨迹在开始时显示出急剧的转弯,例如红色框内的样本,而我们的方法则直接将样本驱动到最接近的数据分布。(c) 最后一行的PDF图也表明,我们的方法预测的轮廓比CFG更对称,平衡了质量和多样性。
Hoogeboom et al., 2023). It drives the sample towards the regions with higher likelihood of conditions with Equation (9), where the images are more canonical and better modeled by networks (Karras et al., 2024).
Hoogeboom et al., 2023)。它通过方程(9)将样本推向条件概率较高的区域,这些区域的图像更规范且更容易被网络建模 (Karras et al., 2024)。
However, CFG has with several disadvantages (Karras et al., 2024; Kynk¨a¨anniemi et al., 2024), such as the multitask learning of both conditional and unconditional generation, and the doubled number of function evaluations (NFEs) during inference. Moreover, the tempting property that solving the denoising process according to Equation (9) eventually recovers data distribution does not hold, as the joint distribution does not represent a valid heat diffusion of the ground-truth (Zheng & Lan, 2024). This results in exaggerated truncation and mode dropping similar to (Karras et al., 2018; Brock et al., 2019; Sauer et al., 2022), since the samples are blindly pushed towards the regions with higher posterior probability. The generation trajectories are distorted in Section 1, the images are often over-saturated in color, and the content of samples is overly simplified.
然而,CFG (Class-Free Guidance) 存在一些缺点 (Karras et al., 2024; Kynk¨a¨anniemi et al., 2024),例如需要同时进行条件生成和无条件生成的多任务学习,以及在推理过程中函数评估次数 (NFEs) 翻倍。此外,根据公式 (9) 解决去噪过程最终恢复数据分布的诱人特性并不成立,因为联合分布并不代表真实数据的热扩散 (Zheng & Lan, 2024)。这导致了类似于 (Karras et al., 2018; Brock et al., 2019; Sauer et al., 2022) 的过度截断和模式丢失,因为样本被盲目推向具有更高后验概率的区域。在第1部分中,生成轨迹被扭曲,图像颜色通常过饱和,且样本内容过于简化。
CFG originates from the classifier-guidance (Dhariwal & Nichol, 2021) that incorporates an auxiliary classifier model $p_{\theta}\big(c|x_{t}\big)$ to modify the sampling distribution as
CFG 源自分类器引导 (classifier-guidance) (Dhariwal & Nichol, 2021),它结合了一个辅助分类器模型 $p_{\theta}\big(c|x_{t}\big)$ 来修改采样分布。
and estimates the posterior probability term with Bayes’ rule
并使用贝叶斯规则估计后验概率项
where $p_{\theta}(\boldsymbol{x}{t}|\boldsymbol{c})$ and $p{\theta}{\left(x_{t}\right)}$ are conditional and unconditional distributions, respectively.
其中 $p_{\theta}(\boldsymbol{x}{t}|\boldsymbol{c})$ 和 $p{\theta}{\left(x_{t}\right)}$ 分别表示条件分布和无条件分布。
The unconditional model is usually implemented by randomly replacing labels by an empty class with a ratio $\lambda$ During inference, each sample is typically forwarded twice, one with and one without conditions. The finding naturally leads us to the question: can we fuse the auxiliary classifier into diffusion models in a more efficient and elegant way?
无条件模型通常通过以比例 $\lambda$ 随机将标签替换为空类来实现。在推理过程中,通常每个样本会前向传递两次,一次有条件和一次无条件。这一发现自然引出了一个问题:我们能否以更高效和优雅的方式将辅助分类器融入扩散模型中?
Figure 3: Illustration of our method. (a) The green and red arrow point towards the centroids of data distributions, as the training pairs $(x_{0},\epsilon)$ are randomly sampled. (b) While CFG provides accurate directions by subtracting the two vectors, our method directly learns the blue arrow, $\nabla\log\tilde{p}{\theta}(x{t}|c)$ .
图 3: 我们的方法示意图。(a) 绿色和红色箭头指向数据分布的质心,因为训练对 $(x_{0},\epsilon)$ 是随机采样的。(b) 虽然 CFG 通过减去两个向量提供了准确的方向,但我们的方法直接学习蓝色箭头 $\nabla\log\tilde{p}{\theta}(x{t}|c)$。
3.2. Model-guidance Loss
3.2. 模型引导损失
Conditional diffusion models optimize the conditional probability $p_{\theta}(\boldsymbol{x}{t}|\boldsymbol{c})$ by Equation (3), where $x{t}$ is the noisy data and $c$ is the condition, e.g. , labels and prompts. However, the models tend to ignore the condition in common practices and CFG (Ho et al., 2020) is proposed as an explicit bias.
条件扩散模型通过公式 (3) 优化条件概率 $p_{\theta}(\boldsymbol{x}{t}|\boldsymbol{c})$ ,其中 $x{t}$ 是噪声数据,$c$ 是条件,例如标签和提示。然而,这些模型在实践中往往忽略条件,因此 CFG (Ho et al., 2020) 被提出作为一种显式偏差。
To enhance both generation quality and alignment to conditions, we propose to take into account the posterior probability $p_{\theta}\big(\boldsymbol{c}|\boldsymbol{x}{t}\big)$ . This leads to the joint optimization of $\tilde{p}{\theta}(x_{t}|c),=,p_{\theta}(x_{t}|c)p_{\theta}(c|x_{t})^{w}$ , where $w$ is the weighting factor of posterior probability. The score of the joint distribution is formulated as
为了提升生成质量和对条件的对齐性,我们提出考虑后验概率 $p_{\theta}\big(\boldsymbol{c}|\boldsymbol{x}{t}\big)$。这导致了联合优化 $\tilde{p}{\theta}(x_{t}|c),=,p_{\theta}(x_{t}|c)p_{\theta}(c|x_{t})^{w}$,其中 $w$ 是后验概率的权重因子。联合分布的得分公式为
The first term corresponds to the standard diffusion objective in Equation (3). However, the second term represents the score of posterior probability $p_{\theta}\big(c|x_{t}\big)$ and cannot be directly obtained, since an explicit classifier of noisy samples is unavailable. Inspired by Equation (11), we transform the diffusion model into an implicit classifier and let it guide itself. Specifically, we employ Bayes’ rule to estimate
第一项对应于公式 (3) 中的标准扩散目标。然而,第二项表示后验概率 $p_{\theta}\big(c|x_{t}\big)$ 的得分,且无法直接获得,因为不存在明确的噪声样本分类器。受公式 (11) 启发,我们将扩散模型转化为隐式分类器,并让其自我引导。具体而言,我们采用贝叶斯规则进行估计。
Next, we use the diffusion model to approximate the scores
接下来,我们使用扩散模型来近似分数
where $\sigma_{t}$ is the variance of the noise added to $x_{t}$ at timestep $t$ , $\mathcal{Q}$ is the empty class, and $\epsilon_{\theta}(\cdot)$ is the diffusion model. Substituting Equations (14) and (15) into Equation (13) yields the score of posterior probability
其中 $\sigma_{t}$ 是在时间步 $t$ 添加到 $x_{t}$ 的噪声的方差,$\mathcal{Q}$ 是空类,$\epsilon_{\theta}(\cdot)$ 是扩散模型。将公式 (14) 和 (15) 代入公式 (13) 后,得到后验概率的得分
Then, our method applies the Bayes’ estimation in Equation (13) online and trains a conditional diffusion model to directly predict the score in Equation (12), instead of separately learning Equations (14) and (15) in the form of CFG. A straight-forward implementation is to adopt the objective in Equation (3) with a modified optimization target
随后,我们的方法在线应用了贝叶斯估计公式 (13),并训练了一个条件扩散模型来直接预测公式 (12) 中的得分,而不是以 CFG 的形式分别学习公式 (14) 和 (15)。一种直接的实现方式是采用公式 (3) 中的目标,并修改优化目标。
We apply the stop gradient operation, $\operatorname{sg}(\cdot)$ , which is a common practice of avoiding model collapse (Grill et al., 2020). We also use the Exponential Mean Average (EMA) counterpart of the online model, $\tilde{\epsilon}_{\theta}(\cdot)$ , to stabilize the training process and provide accurate estimations. For flow-based models, we have the similar objective
我们应用了停止梯度操作$\operatorname{sg}(\cdot)$,这是一种避免模型崩溃的常见做法 (Grill et al., 2020)。我们还使用了在线模型的指数移动平均 (EMA) 对应项$\tilde{\epsilon}_{\theta}(\cdot)$,以稳定训练过程并提供准确的估计。对于基于流的模型,我们有类似的目标
where $u$ is the ground-truth flow in Equation (6).
其中 $u$ 是方程 (6) 中的真实光流。
During training, we randomly drop the condition $c$ in Equations (17) and (19) to $\mathcal{Q}$ with a ratio of $\lambda$ . These formulations transform the model itself into an implicit classifier and adjust the standard training objective of diffusion model in a self-supervised manner, allowing the joint optimization of generation quality and condition alignment with the minimum modification of existing pipelines.
在训练过程中,我们以 $\lambda$ 的比例随机丢弃方程 (17) 和 (19) 中的条件 $c$ 到 $\mathcal{Q}$。这些公式将模型本身转化为隐式分类器,并以自监督的方式调整扩散模型的标准训练目标,从而在最小化现有管道修改的情况下,实现生成质量与条件对齐的联合优化。
3.3. Implementation Details
3.3. 实现细节
With the MG formulation in Equations (17) and (19), we have adequate options in the detailed implementations, such as incorporating an additional input of the guidance scale $w$ into networks, replacing the usage of empty class with the law of total probability, and whether to manual or automatically adjust the hyper-parameters.
在公式 (17) 和 (19) 中的 MG 公式化中,我们在具体实现中有多种选择,例如将引导尺度 $w$ 作为额外输入引入网络,用全概率定律替换空类的使用,以及手动或自动调整超参数。
Scale-aware networks. Similar to other distillation-based methods (Frans et al., 2024), the guidance scale $w$ can be fed into the network as an additional condition. When augmented with $w$ -input, our models offer flexible choices of the balance between image quality and sample diversity during inference time. Note that our models require only one forward per step for all values of $w$ , while standard CFG needs two forwards, e.g. , one with condition and one without condition. In particular, we sample guidance scale from an specified interval, and the loss function are modified into the following form
尺度感知网络。与其他基于蒸馏的方法类似 (Frans et al., 2024),引导尺度 $w$ 可以作为附加条件输入到网络中。当增加 $w$ 输入时,我们的模型在推理过程中提供了图像质量和样本多样性之间平衡的灵活选择。值得注意的是,我们的模型对于所有 $w$ 值只需要每步进行一次前向传播,而标准 CFG 需要两次前向传播,例如一次有条件和一次无条件。特别是,我们从指定的区间中采样引导尺度,并将损失函数修改为以下形式:
Removing the empty class. Another option is whether to perform multitask learning of both conditional and unconditional generation with the same model. In CFG, the estimator in Equation (11) requires to train an unconditional model. However, the multitask learning can distract and hinder model capability. Using the law of total probability
移除空类别。另一个选择是是否使用同一模型对条件生成和无条件生成进行多任务学习。在CFG中,公式(11)中的估计器需要训练一个无条件模型。然而,多任务学习可能会分散和阻碍模型的能力。使用全概率定律
where $N$ different labels are used to estimate the unconditional score, our models focus on the conditional prediction and avoid the introduction of additional empty class.
其中使用 $N$ 个不同的标签来估计无条件分数,我们的模型专注于条件预测,并避免引入额外的空类。
Automatic adjustment of the hyper-parameter $w$ . While the scale $w$ in Equations (18) and (20) plays an important role, it is tedious and costly to perform manual search during training. Therefore, we introduce an automatic scheme to adjust $w$ . We begin with $w;=;0$ that corresponds to vanilla diffusion models, then update the value with EMA according to intermediate evaluation results. The value of $w$ is raised when quality decreases and suppressed otherwise, leading to an optimums when the training converged.
超参数 $w$ 的自动调整。虽然方程 (18) 和 (20) 中的缩放因子 $w$ 起着重要作用,但在训练过程中进行手动搜索既繁琐又耗时。因此,我们引入了一种自动调整 $w$ 的方案。我们从 $w;=;0$ 开始,这对应于原始的扩散模型,然后根据中间评估结果使用 EMA 更新 $w$ 的值。当质量下降时,$w$ 的值会提高,否则会降低,从而在训练收敛时达到最优值。
4. Experiment
4. 实验
We first present a system-level comparison with state-of-theart models on ImageNet $256\times256$ conditional generation. Then we conduct ablation experiments to investigate the detained designs of our method. Especially, we emphasize on the following questions:
我们首先在 ImageNet $256\times256$ 条件生成任务上进行了与最先进模型的系统级比较。然后,我们进行了消融实验,以研究我们方法的详细设计。特别地,我们重点探讨了以下问题:
• How far can MG push the performances of existing diffusion models? (Tables 1 and 2, Section 4.2) • How does implementation details influence the gain of proposed method? (Tables 3 to 6, Section 4.3) • Can MG scales to larger models and datasets with efficiency? (Tables 7 and 8, Figures 4 to 6, Section 4.3)
• MG 能在多大程度上提升现有扩散模型的性能? (表 1 和表 2,第 4.2 节)
• 实现细节如何影响所提出方法的增益? (表 3 到表 6,第 4.3 节)
• MG 能否高效地扩展到更大的模型和数据集? (表 7 和表 8,图 4 到图 6,第 4.3 节)
Table 1: Experiments on ImageNet 256 without CFG. By deploying our method, the performances of both DiT-XL/2 and SiT-XL/2 are greatly boosted, achieving state-of-the-art.
表 1: 在 ImageNet 256 上不使用 CFG 的实验。通过部署我们的方法,DiT-XL/2 和 SiT-XL/2 的性能都得到了显著提升,达到了最先进的水平。
MODEL | FID↓ | SFID↓ | IS↑ | PRE.↑ | REC.↑ | IMG/S↑ |
---|---|---|---|---|---|---|
ADM VDM++ | 10.9 | 2.40 | 101.0 | 0.69 | 0.63 | |
LDM-4 U-V1T-H | 10.5 | 103.5 | 0.71 | 0.62 | ||
8.97 | 136.7 | 0.69 | 0.63 | |||
MDTv2 | 5.06 | - | 155.6 | 0.72 | 0.66 | 0.2 |
REPA | 5.90 | 6.33 | 162.1 | 0.71 | 0.56 | 0.76 |
L-D1T | 2.17 | 4.36 | 205.6 | 0.77 | 0.65 | 0.06 |
VARd30 | 2.16 | 288.7 | 0.81 | 0.61 | 11.2 | |
RARxXL | 3.83 | 274.5 | 0.79 | 0.61 | 3.9 | |
MAR-H | 2.35 | 227.8 | 0.79 | 0.62 | 0.6 | |
D1T-XL/2 | ||||||
+MG(ours) | 9.62 | 6.85 | 121.5 | 0.67 | 0.67 | 0.2 |
IMPROVE | 2.03 | 4.36 | 292.1 | 0.81 | 0.66 | 0.2 |
78.9% | 36.4% | 140% | 20.9% | 1.49% | 0.0% | |
SIT-XL/2 | 8.61 | 6.32 | 131.7 | 0.68 | 0.67 | 0.76 |
+MG(ours) | 1.34 | 4.58 | 321.5 | 0.81 | 0.65 | 0.76 |
IMPROVE | 84.4% | 27.5% | 144% | 19.1% | 2.99% | 0.0% |
Table 2: Experiments on ImageNet 256 with CFG. Comparing to models with CFG, our method still obtains excellent results and surpasses others without efficiency loss.
表 2: 在 ImageNet 256 上使用 CFG 的实验。与使用 CFG 的模型相比,我们的方法仍然取得了出色的结果,并且在效率上没有损失。
MODEL | FID↓ | SFID↓ | IS↑ | PRE.↑ | REC.↑ | IMG/S↑ |
---|---|---|---|---|---|---|
ADM VDM++ | 4.59 2.12 | 5.25 | 186.7 267.7 | 0.82 0.81 | 0.52 0.65 | |
LDM U-VIT-H MDTv2 REPA L-D1T | 3.60 2.29 1.58 1.42 1.35 | 5.68 4.52 4.70 4.15 | 247.7 263.9 314.7 305.7 295.3 | 0.87 0.82 0.79 0.80 0.79 | 0.48 0.57 0.65 0.65 0.65 | = 0.1 0.39 0.03 |
VARd30 RARxXL MAR-H D1T-XL/2 | 1.73 1.48 1.55 2.27 | - 4.60 | 350.2 326.0 303.7 278.2 | 0.82 0.80 0.81 0.83 | 0.60 0.63 0.62 0.57 | 6.3 2.1 0.3 0.1 |
+MG(ours) IMPROVE SIT-XL/2 +MG(ours) IMPROVE | 2.03 10.6% 2.06 1.34 35.0% | 4.36 5.22% 4.49 4.58 2.00% | 292.1 5.00% 277.5 321.5 15.9% | 0.81 2.41% 0.83 0.81 2.41% | 0.66 15.8% 0.59 0.65 10.2% | 0.2 100% 0.39 0.76 94.9% |
4.1. Setup
4.1. 设置
Implementation and dataset. We follow the experiment pipelines in DiT (Peebles & Xie, 2023) and SiT (Ma et al., 2024). We use ImageNet (Deng et al., 2009; Russ a kov sky et al., 2015) dataset and the Stable Diffusion (Rombach et al., 2022) VAE to encode $256\times256$ images into the latent space of $\mathbb{R}^{32\times32\times4}$ . We conduct ablation experiments with the $\scriptstyle\mathrm{B}/2$ variant of DiT and SiT models and train for 400K iterations. During training, we use AdamW (Kingma, 2014; Loshchilov, 2019) optimizer and a batch size of 256 in consistent with DiT (Peebles & Xie, 2023) and SiT (Ma et al., 2024) for fair comparisons. For inference, we use 1000 sampling steps for DiT models and Euler-Maruyama sampler with 250 steps for SiT.
Baseline Models. We compare with several state-of-theart image generation models, including both diffusionbased and AR-based methods, which can be classified into the following three classes: (a) Pixel-space diffusion:
Table 3: Experiments on scale w. Diffusion Models without Classifier-Free Guidance
表 3: 无分类器引导的扩散模型规模实验
MODEL | FID↓ | SFID↓ | IS↑ | PRE.↑ | REC.↑ | |
---|---|---|---|---|---|---|
D1T-B/2 | 1.00 | 43.5 | 36.7 | 39.23 | 0.62 | 0.34 |
+MG(ours) | 1.25 | 9.86 | 8.87 | 176.1 | 0.81 | 0.37 |
+MG(ours) | 1.50 | 7.24 | 5.56 | 189.2 | 0.84 | 0.38 |
+MG(ours) | 1.75 | 8.21 | 6.63 | 197.2 | 0.86 | 0.38 |
+MG(ours) | 2.00 | 9.66 | 7.90 | 224.7 | 0.85 | 0.39 |
+MG(ours) | AUTO | 7.60 | 6.29 | 192.4 | 0.85 | 0.38 |
SIT-B/2 | 1.00 | 33.0 | 27.8 | 65.24 | 0.68 | 0.35 |
+MG(ours) | 1.25 | 8.94 | 7.87 | 194.3 | 0.83 | 0.38 |
+MG(ours) | 1.50 | 6.49 | 5.69 | 212.3 | 0.86 | 0.38 |
+MG(ours) | 1.75 | 8.03 | 6.91 | 221.0 | 0.86 | 0.39 |
+MG(ours) | 2.00 | 9.14 | 7.99 | 236.7 | 0.88 | 0.40 |
+MG(ours) | AUTO | 6.86 | 5.88 | 219.1 | 0.87 | 0.38 |
Table 4: Experiments on drop ratio $\lambda$ .
表 4: 关于丢弃比例 $\lambda$ 的实验
MODEL | λ | FID↓ | sFID↓ | IS↑ | PRE.↑ | REC.↑ |
---|---|---|---|---|---|---|
D1T-B/2 | 1.00 | 43.5 | 36.7 | 39.23 | 0.62 | 0.34 |
+MG(ours) | 0.05 | 11.7 | 9.90 | 156.7 | 0.78 | 0.33 |
+MG(ours) | 0.10 | 7.24 | 5.56 | 189.2 | 0.84 | 0.38 |
+MG(ours) | 0.15 | 7.62 | 5.99 | 183.4 | 0.83 | 0.38 |
+MG(ours) | 0.20 | 9.01 | 7.04 | 171.7 | 0.81 | 0.36 |
SIT-B/2 | 1.00 | 33.0 | 27.8 | 65.24 | 0.68 | 0.35 |
+MG(ours) | 0.05 | 10.8 | 9.25 | 168.8 | 0.80 | 0.34 |
+MG(ours) | 0.10 | 6.49 | 5.69 | 212.3 | 0.86 | 0.38 |
+MG(ours) | 0.15 | 6.77 | 5.89 | 207.4 | 0.85 | 0.37 |
+MG(ours) | 0.20 | 8.87 | 8.06 | 199.6 | 0.84 | 0.37 |
ADM (Dhariwal & Nichol, 2021), $\mathrm{VDM++}$ (Kingma & Gao, 2023); (b) Latent-space diffusion: LDM (Rombach et al., 2022), U-ViT (Bao et al., 2023), MDTv2 (Gao et al., 2023), REPA (Yu et al., 2024b), Lightning D iT(L-DiT) (Yao & Wang, 2025), DiT (Peebles & Xie, 2023), SiT (Ma et al., 2024); (c) Auto-regressive models: VAR (Tian et al., 2024), RAR (Yu et al., 2024a), MAR (Li et al., 2024). These models consist of strong baselines and demonstrate the advantages of our method. Although our method does not requires CFG during inference, we still compare with these baselines under two settings, with and without CFG, for thoroughly investigations.
ADM (Dhariwal & Nichol, 2021), $\mathrm{VDM++}$ (Kingma & Gao, 2023); (b) 潜在空间扩散: LDM (Rombach et al., 2022), U-ViT (Bao et al., 2023), MDTv2 (Gao et al., 2023), REPA (Yu et al., 2024b), Lightning D iT (L-DiT) (Yao & Wang, 2025), DiT (Peebles & Xie, 2023), SiT (Ma et al., 2024); (c) 自回归模型: VAR (Tian et al., 2024), RAR (Yu et al., 2024a), MAR (Li et al., 2024)。这些模型构成了强大的基线,并展示了我们方法的优势。尽管我们的方法在推理过程中不需要CFG,但我们仍然在两种设置下(使用和不使用CFG)与这些基线进行比较,以便进行全面的研究。
Evaluation metrics. We report the commonly used Frechet inception distance (Heusel et al., 2017) with 50,000 samples (FID-50K). In addition, we report sFID (Nash et al., 2021), Inception Score (IS) (Salimans et al., 2016), Precision (Pre.), and Recall (Rec.) (Kynk¨a¨anniemi et al., 2019) as supplementary metrics. We also report the time to generate one sample of each model in seconds to measure the trade-off between generation quality and computation budget.
评估指标。我们报告了常用的 Frechet inception distance (Heusel 等人,2017),使用了 50,000 个样本 (FID-50K)。此外,我们还报告了 sFID (Nash 等人,2021)、Inception Score (IS) (Salimans 等人,2016)、Precision (Pre.) 和 Recall (Rec.) (Kynk¨a¨anniemi 等人,2019) 作为补充指标。我们还报告了每个模型生成一个样本所需的时间(以秒为单位),以衡量生成质量与计算预算之间的权衡。
4.2. Overall Performances
4.2. 整体性能
First of all, we present a through system-level comparison with recent state-of-the-art image generation approaches on ImageNet $256\times256$ dataset in Tables 1 and 2. As shown in Table 1, both DiT-XL/2 and SiT-XL/2 models greatly benefit from our method, ac