Magic 1-For-1: Generating One Minute Video Clips within One Minute
Magic 1-For-1: 在一分钟内生成一分钟视频片段
Hongwei Yi2 Shitong Shao2 Tian Ye2 Jiantong Zhao2 Qingyu Yin2
Hongwei Yi2 Shitong Shao2 Tian Ye2 Jiantong Zhao2 Qingyu Yin2
Michael Li ngel bach 2 Li Yuan1 Yonghong Tian1 Enze Xie3 Daquan Zhou1† ∗
Michael Li ngel bach 2 李远1 田永鸿1 谢恩泽3 周大全1† ∗
1 Peking University 2 Hedra Inc. 3 Nvidia
1 北京大学 2 Hedra Inc. 3 Nvidia
https://magic-141.github.io/Magic-141/
https://magic-141.github.io/Magic-141/
Abstract
摘要
In this technical report, we present Magic 1-For-1 (Magic141), an efficient video generation model with optimized memory consumption and inference latency. The key idea is simple: factorize the text-to-video generation task into two separate easier tasks for diffusion step distillation, namely text-to-image generation and image-to-video generation. We verify that with the same optimization algorithm, the image-to-video task is indeed easier to converge over the text-to-video task. We also explore a bag of optimization tricks to reduce the computational cost of training the image-to-video (I2V) models from three aspects: 1) model convergence speedup by using a multi-modal prior condition injection; 2) inference latency speed up by applying an adversarial step distillation, and 3) inference memory cost optimization with parameter spars if i cation. With those techniques, we are able to generate 5-second video clips within 3 seconds. By applying a test time sliding window, we are able to generate a minute-long video within one minute with significantly improved visual quality and motion dynamics, spending less than 1 second for generating 1 second video clips on average. We conduct a series of preliminary explorations to find out the optimal tradeoff between computational cost and video quality during diffusion step distillation and hope this could be a good foundation model for open-source explorations. The code and the model weights are available at https://github.com/DA-Group-PKU/Magic-1-For-1.
在本技术报告中,我们介绍了 Magic 1-For-1 (Magic141),一种高效的视频生成模型,具有优化的内存消耗和推理延迟。其核心思想很简单:将文本到视频生成任务分解为两个独立的更简单的任务用于扩散步骤蒸馏,即文本到图像生成和图像到视频生成。我们验证了在相同的优化算法下,图像到视频任务确实比文本到视频任务更容易收敛。我们还探索了一系列优化技巧,从三个方面降低图像到视频 (I2V) 模型的训练计算成本:1) 通过使用多模态先验条件注入加速模型收敛;2) 通过应用对抗性步骤蒸馏加速推理延迟;3) 通过参数稀疏化优化推理内存成本。通过这些技术,我们能够在 3 秒内生成 5 秒的视频片段。通过应用测试时间滑动窗口,我们能够在一分钟内生成长达一分钟的视频,视觉质量和运动动态显著提升,平均每生成 1 秒视频片段花费不到 1 秒。我们进行了一系列初步探索,以在扩散步骤蒸馏期间找到计算成本和视频质量之间的最佳权衡,并希望这能成为开源探索的良好基础模型。代码和模型权重可在 https://github.com/DA-Group-PKU/Magic-1-For-1 获取。
Figure 1: The comparative experimental results on General VBench highlight the strong performance of Magic 1-For-1. Our model surpasses other open-source TI2V models, including CogVideoX-I2VSAT, I2Vgen-XL, SEINE-512x320, Video Crafter-I2V, and SVD-XT-1.0, in terms of both performance and efficiency.
图 1: General VBench 上的对比实验结果展示了 Magic 1-For-1 的强劲表现。我们的模型在性能和效率方面均超越了其他开源 TI2V 模型,包括 CogVideoX-I2VSAT、I2Vgen-XL、SEINE-512x320、Video Crafter-I2V 和 SVD-XT-1.0。
Figure 2: Magic 1-For-1 can generate video clips with optimized efficiency-quality trade-off.
图 2: Magic 1-For-1 可以生成在效率-质量权衡上优化的视频片段。
1 Introduction
1 引言
Recently, diffusion models have demonstrated superior performance in generating high-quality images and videos, with significantly wider diversity over the traditional generative adversarial network (GAN)-based methods [4]. However, conventional diffusion models require hundreds or thousands of steps to gradually transform noise into structured data, making them computationally expensive, limiting their practical applications. For instance, the recent open-source video generation model [18] requires around 8 GPUs and 15 minutes to generate a single 5-second video clip without extra optimization.
最近,扩散模型在生成高质量图像和视频方面展示了卓越的性能,相较于传统的基于生成对抗网络 (GAN) 的方法,其多样性显著更广 [4]。然而,传统的扩散模型需要数百或数千步来逐步将噪声转化为结构化数据,这使得它们在计算上非常昂贵,限制了其实际应用。例如,最近的开源视频生成模型 [18] 在没有额外优化的情况下,生成一个5秒的视频片段大约需要8个GPU和15分钟。
Recent works on diffusion step distillation [28, 31, 26, 26] aim to accelerate the generative process by reducing the number of inference steps while maintaining high sample quality. However, most of the research works are on image generation. Despite their promise, these methods often compromise video quality and require carefully designed architectures to achieve satisfactory results. As mentioned in SeaweedAPT [20], when transferred video generation, most of the one-step video generation algorithms still suffer from significant quality degradation such as motion dynamics and structure.
近期关于扩散步骤蒸馏 (diffusion step distillation) 的研究 [28, 31, 26, 26] 旨在通过减少推理步骤来加速生成过程,同时保持高样本质量。然而,大多数研究工作集中在图像生成领域。尽管这些方法前景广阔,但它们通常会影响视频质量,并且需要精心设计的架构才能获得令人满意的结果。正如 SeaweedAPT [20] 所提到的,当应用于视频生成时,大多数一步视频生成算法仍然存在显著的质量下降问题,例如运动动态和结构。
In this report, we propose to simplify the video generation task by factorizing it into two separate subtasks: text-to-image generation and image-to-video generation. Different from the purpose in Emu Video [? ], which focuses on improving the video generation quality, we focus on reducing the diffusion inference steps. The text-to-image task benefits from its simplicity and a wide range of prior research works, making it inherently simpler and requiring fewer diffusion steps.
在本报告中,我们提出通过将视频生成任务分解为两个独立的子任务来简化这一过程:文本到图像生成和图像到视频生成。与 Emu Video [ ] 中专注于提高视频生成质量的目的不同,我们专注于减少扩散推理步骤。文本到图像任务因其简单性和广泛的先前研究工作而受益,使其本质上更简单且需要更少的扩散步骤。
We experimentally observe that diffusion step distillation algorithms converge significantly faster in text-to-image generation while also improving the final generation quality. To further reinforce the generative prior, we incorporate multi-modal inputs by augmenting the text encoder with visual inputs. By leveraging quantization techniques, we reduce memory consumption from 40G to 28G and cut the required diffusion steps from 50 to just 4, achieving optimized performance with minimal degradation in quality. In summary, our contributions are:
我们通过实验观察到,扩散步蒸馏算法在文本到图像生成中收敛速度显著加快,同时提高了最终生成质量。为了进一步增强生成先验,我们通过视觉输入增强文本编码器,结合了多模态输入。通过利用量化技术,我们将内存消耗从40G减少到28G,并将所需的扩散步骤从50步减少到仅4步,在质量最小化下降的情况下实现了优化性能。
• Generative Prior Injection: We introduce a novel acceleration strategy for video generation by injecting stronger generative priors through task factorization. • Multi-Modal Guidance: We propose a multi-modal guidance mechanism that accelerates model convergence by incorporating visual inputs alongside text inputs. • Comprehensive Optimization: By combining a suite of model acceleration techniques, we achieve state-of-the-art trade-off in terms of both generation quality and inference efficiency.
• 生成式先验注入:我们通过任务分解注入更强的生成式先验,提出了一种新颖的视频生成加速策略。
• 多模态引导:我们提出了一种多模态引导机制,通过结合视觉输入和文本输入,加速模型收敛。
• 全面优化:通过结合一系列模型加速技术,我们在生成质量和推理效率方面实现了最先进的权衡。
2 Related Works
2 相关工作
Video generation has been a rapidly evolving research area, leveraging advancements in deep learning and generative models. Early approaches relied on auto regressive models [17], which generate videos frame-by-frame based on past frames, but suffered from compounding errors over time. Variation al Auto encoders (VAEs) [2] and Generative Adversarial Networks (GANs) [34, 7] were later introduced to improve video synthesis, offering better visual fidelity but struggling with temporal consistency over longer sequences.
视频生成一直是一个快速发展的研究领域,得益于深度学习和生成模型的进步。早期方法依赖于自回归模型 [17],它基于过去帧逐帧生成视频,但随着时间的推移会出现累积误差。后来引入了变分自编码器 (VAEs) [2] 和生成对抗网络 (GANs) [34, 7] 来改进视频合成,提供了更好的视觉保真度,但在较长的序列中难以保持时间一致性。
More recent breakthroughs have been driven by Transformer-based architectures and diffusion models. VideoGPT [39] adapts GPT-like architectures for video token iz ation and synthesis, capturing long-term dependencies effectively. Diffusion models have also demonstrated promising results, with works like Video Diffusion Models (VDMs) [12] extending the success of image diffusion models to videos by modeling s patio temporal dynamics through iterative denoising. Additionally, latent space-based methods such as Imagen Video [13] and Make-A-Video [30] focus on efficient high-resolution synthesis by operating in a compressed representation space. These methods have significantly advanced the field, enabling the generation of realistic, temporally coherent videos with high fidelity.
最近的突破由基于 Transformer 的架构和扩散模型驱动。VideoGPT [39] 将类似 GPT 的架构应用于视频 Token 化和合成,有效捕捉长期依赖关系。扩散模型也展示了有前景的结果,如 Video Diffusion Models (VDMs) [12] 通过迭代去噪建模时空动态,将图像扩散模型的成功扩展到视频领域。此外,基于潜在空间的方法,如 Imagen Video [13] 和 Make-A-Video [30],通过在压缩表示空间中操作,专注于高效的高分辨率合成。这些方法显著推动了该领域的发展,使得生成逼真、时间一致的高保真视频成为可能。
Recent advancements in diffusion step distillation have introduced various methods to accelerate the sampling process of diffusion models, aiming to reduce computational costs while maintaining high-quality outputs. Salimans and Ho (2022) proposed Progressive Distillation, which iterative ly halves the number of sampling steps required by training student models to match the output of teacher models with more steps, thereby achieving faster sampling without significant loss in image quality [28]. Meng et al. (2023) developed a distillation approach for classifier-free guided diffusion models, significantly speeding up inference by reducing the number of sampling steps required for high-fidelity image generation [21]. Geng et al. (2024) introduced a method that distills diffusion models into single-step generator models using deep equilibrium architectures, achieving efficient sampling with minimal loss in image quality [9]. Watson et al. (2024) proposed EM Distillation, a maximum likelihood-based approach that distills a diffusion model into a one-step generator model, effectively reducing the number of sampling steps while preserving perceptual quality [36]. Zhou et al. (2024) presented Simple and Fast Distillation (SFD), which simplifies existing distillation paradigms and reduces fine-tuning time by up to 1000 times while achieving high-quality image generation [42]. Yin et al. (2024) introduced Distribution Matching Distillation (DMD), a procedure to transform a diffusion model into a one-step image generator with minimal impact on image quality [40]. These developments contribute to making diffusion-based generative models more practical for real-world applications by substantially decreasing computational demands.
扩散步骤蒸馏的最新进展引入了多种方法来加速扩散模型的采样过程,旨在降低计算成本的同时保持高质量的输出。Salimans 和 Ho (2022) 提出了渐进式蒸馏 (Progressive Distillation),通过训练学生模型来匹配更多步骤的教师模型的输出,迭代地将所需的采样步骤数减半,从而在不显著损失图像质量的情况下实现更快的采样 [28]。Meng 等人 (2023) 开发了一种用于无分类器引导扩散模型的蒸馏方法,通过减少高保真图像生成所需的采样步骤,显著加快了推理速度 [21]。Geng 等人 (2024) 引入了一种使用深度均衡架构将扩散模型蒸馏为单步生成器模型的方法,实现了高效采样且图像质量损失最小 [9]。Watson 等人 (2024) 提出了基于最大似然的 EM 蒸馏 (EM Distillation),将扩散模型蒸馏为一步生成器模型,有效减少了采样步骤数,同时保持了感知质量 [36]。Zhou 等人 (2024) 提出了简单快速蒸馏 (Simple and Fast Distillation,SFD),简化了现有的蒸馏范式,并将微调时间减少了多达 1000 倍,同时实现了高质量的图像生成 [42]。Yin 等人 (2024) 引入了分布匹配蒸馏 (Distribution Matching Distillation,DMD),这是一种将扩散模型转换为一步图像生成器的过程,且对图像质量的影响最小 [40]。这些进展通过大幅降低计算需求,使得基于扩散的生成模型更适用于实际应用。
3 Method
3 方法
Image Prior Generation We use both diffusion-based and retrieval-based methods for getting the images. We define a unified function $\mathcal{G}$ that combines both diffusion-based generation and retrieval-based augmentation:
图像先验生成
我们使用基于扩散和基于检索的方法来获取图像。我们定义了一个统一函数 $\mathcal{G}$,它结合了基于扩散的生成和基于检索的增强:
where $\mathbf{x}_{T}\sim\mathcal{N}(0,I)$ represents the initial noise, $y$ is the textual input, $\mathcal{R}(y)$ is the retrieved set of relevant images, and $\theta$ represents the model parameters. The retrieval function $\mathcal{R}$ is formally defined as:
其中 $\mathbf{x}_{T}\sim\mathcal{N}(0,I)$ 表示初始噪声,$y$ 是文本输入,$\mathcal{R}(y)$ 是检索到的相关图像集,$\theta$ 表示模型参数。检索函数 $\mathcal{R}$ 正式定义为:
where $\mathcal{D}$ is the image corpus, $\phi(y)$ and $\psi(\mathbf{x})$ are text and image embeddings, respectively, and $\mathrm{sim}(\cdot,\cdot)$ is a similarity function (e.g., cosine similarity). The retrieved images serve as additional conditioning signals in the denoising process:
其中 $\mathcal{D}$ 是图像语料库,$\phi(y)$ 和 $\psi(\mathbf{x})$ 分别是文本和图像嵌入,$\mathrm{sim}(\cdot,\cdot)$ 是相似度函数(例如,余弦相似度)。检索到的图像在去噪过程中作为额外的条件信号:
where $\epsilon_{\theta}$ is the learned noise prediction model, $\alpha_{t}$ is the step size, $\sigma_{t}$ is the noise variance, and $\mathbf{z}\sim\mathcal{N}(0,I)$ . This retrieval-augmented diffusion process ensures that the generated image maintains both high fidelity and factual accuracy. Recent methods such as Retrieval-Augmented Diffusion Models (RDM) [27] and kNN-Diffusion [29] have demonstrated the effectiveness of this approach, significantly improving the realism and contextual alignment of generated images.
其中 $\epsilon_{\theta}$ 是学习到的噪声预测模型,$\alpha_{t}$ 是步长,$\sigma_{t}$ 是噪声方差,且 $\mathbf{z}\sim\mathcal{N}(0,I)$。这种检索增强的扩散过程确保了生成的图像既保持高保真度,又具备事实准确性。最近的方法如检索增强扩散模型 (Retrieval-Augmented Diffusion Models, RDM) [27] 和 kNN-Diffusion [29] 已经证明了这种方法的有效性,显著提高了生成图像的逼真度和上下文一致性。
Figure 4: The overview of model acceleration techniques, including DMD2 and CFG distillation.
图 4: 模型加速技术概览,包括 DMD2 和 CFG 蒸馏。
Image Prior Injection and Multi-modal Guidance Design The Image-to-Video (I2V) task involves generating a video that aligns with a given text description, using an input image as the first frame. Specifically, the T2V model takes as input a latent tensor $X$ of shape $T\times C\times H\times W$ , where $T,C,H$ , and $W$ correspond to the number of frames, channels, height, and width of the compressed video, respectively.
图像先验注入与多模态引导设计
图像到视频(I2V)任务涉及生成一个与给定文本描述相符的视频,使用输入图像作为第一帧。具体来说,T2V模型将形状为$T\times C\times H\times W$的潜在张量$X$作为输入,其中$T,C,H$和$W$分别对应于压缩视频的帧数、通道数、高度和宽度。
Similar to Emu Video [10], to incorporate the image condition $I$ , we treat $I$ as the first frame of the video and apply zero-padding to construct a tensor $I_{o}$ of dimensions $T\times C\times H\times W$ , as shown in Fig. 3. Additionally, we introduce a binary mask $m$ with shape $T\times1\times H\times W$ , where the first temporal position is set to 1, and all subsequent positions are set to zero. The
与 Emu Video [10] 类似,为了融入图像条件 $I$,我们将 $I$ 视为视频的第一帧,并应用零填充来构建维度为 $T\times C\times H\times W$ 的张量 $I_{o}$,如图 3 所示。此外,我们引入了一个形状为 $T\times1\times H\times W$ 的二进制掩码 $m$,其中第一个时间位置设置为 1,所有后续位置设置为零。
Figure 3: Overall Architecture of Magic 1-For-1.
图 3: Magic 1-For-1 的整体架构。
latent tensor $X$ , the padded tensor $I_{o}$ , and the mask $m$ are then concatenated along the channel dimension to form the input to the model.
潜在张量 $X$、填充张量 $I_{o}$ 和掩码 $m$ 随后在通道维度上连接,形成模型的输入。
Since the input tensor’s channel dimension increases from $C$ to $2C+1$ , as shown in Fig. 3, we adjust the parameters of the model’s first convolutional module from $\phi=(C_{i n}(=C),\bar{C_{o u t}},\bar{s}{h},s{w})$ to $\phi^{\prime}=(\bar{C}_{i n}^{\prime}(=2C+1),C_{o u t},s_{h},s_{w})$ . Here, $C_{i n}$ and $C_{i n}^{\prime}$ represent the input channels before and after modification, $C_{o u t}$ is the number of output channels, and $s_{h}$ and $s_{w}$ correspond to the height and width of the convolutional kernel, respectively. To preserve the representational capability of the T2V model, the first $C$ input channels of $\phi^{\prime}$ are copied from $\phi$ , while the additional channels are initialized to zero. The I2V model is pre-trained on the same dataset as the T2V model to ensure consistency.
由于输入张量的通道维度从 $C$ 增加到 $2C+1$,如图 3 所示,我们将模型的第一个卷积模块的参数从 $\phi=(C_{i n}(=C),\bar{C_{o u t}},\bar{s}{h},s{w})$ 调整为 $\phi^{\prime}=(\bar{C}_{i n}^{\prime}(=2C+1),C_{o u t},s_{h},s_{w})$。其中,$C_{i n}$ 和 $C_{i n}^{\prime}$ 分别表示修改前后的输入通道数,$C_{o u t}$ 是输出通道数,$s_{h}$ 和 $s_{w}$ 分别对应卷积核的高度和宽度。为了保持 T2V 模型的表示能力,$\phi^{\prime}$ 的前 $C$ 个输入通道从 $\phi$ 中复制,额外的通道则初始化为零。I2V 模型在与 T2V 模型相同的数据集上进行预训练,以确保一致性。
To further enhance the semantic alignment of reference images, we extract their embeddings via the vision branch of the VLM text encoder and concatenate them with the text embeddings, as illustrated in Fig. 3. This integration improves the model’s ability to generate videos that better capture the context provided by both the image and the textual description.
为了进一步增强参考图像的语义对齐,我们通过 VLM 文本编码器的视觉分支提取其嵌入,并将其与文本嵌入连接,如图 3 所示。这种集成提高了模型生成视频的能力,使其更好地捕捉图像和文本描述提供的上下文。
3.1 Diffusion Distillation
3.1 扩散蒸馏
The iterative nature of diffusion model inference, characterized by its multi-step sampling procedure, introduces a significant bottleneck to inference speed. This issue is particularly exacerbated in largescale models such as our 13B parameter diffusion model, Magic 1-For-1, where the computational cost of each individual sampling step is substantial. As depicted in Fig. 4, we address this challenge by implementing a dual distillation approach, incorporating both step distillation and CFG distillation to achieve faster sampling. For step distillation, we leverage DMD2, a state-of-the-art algorithm engineered for efficient distribution alignment and accelerated sampling. Inspired by score distillation sampling (SDS) [25], DMD2 facilitates step distillation through a coordinated training paradigm involving three distinct models. These include: the one/four-step generator $G_{\phi}$ , whose parameters are iterative ly optimized; the real video model $u_{\theta}^{\mathrm{real}}$ , tasked with approximating the underlying real data distribution $p_{\mathrm{real}}$ ; and the fake video model $u_{\theta}^{\mathrm{fake}}$ , which estimates the generated (fake) data distribution $p_{\mathrm{fake}}$ . Crucially, all three models are initialized from the same pre-trained model, ensuring coherence and streamlining the training process. The distribution matching objective for step distillation can be mathematically expressed as:
扩散模型推理的迭代性质,以其多步采样过程为特征,引入了显著的推理速度瓶颈。这一问题在我们拥有 13B 参数的扩散模型 Magic 1-For-1 等大规模模型中尤为突出,其中每个单独采样步骤的计算成本相当高。如图 4 所示,我们通过实施双重蒸馏方法来解决这一挑战,结合步骤蒸馏和 CFG 蒸馏以实现更快的采样。对于步骤蒸馏,我们利用 DMD2,这是一种专为高效分布对齐和加速采样而设计的最先进算法。受分数蒸馏采样 (SDS) [25] 的启发,DMD2 通过涉及三个不同模型的协调训练范式促进步骤蒸馏。这些模型包括:参数迭代优化的一步/四步生成器 $G_{\phi}$;负责逼近底层真实数据分布 $p_{\mathrm{real}}$ 的真实视频模型 $u_{\theta}^{\mathrm{real}}$;以及估计生成(假)数据分布 $p_{\mathrm{fake}}$ 的假视频模型 $u_{\theta}^{\mathrm{fake}}$。至关重要的是,所有三个模型都从同一个预训练模型初始化,确保一致性并简化训练过程。步骤蒸馏的分布匹配目标可以用数学公式表示为:
Here, $\mathbf{z}{t}$ stands for the video latent at the timestep $t$ , and $z{t}=\sigma_{t}z_{1}+(1-\sigma_{t})\hat{z}{0}$ , with $\hat{z}{0}$ representing the output synthesized by the few-step generator, and $\sigma_{t}$ denoting the noise schedule. This formulation reframes the conventional score function-based distribution matching (inherent in standard DMD2) into a novel approach that focuses on distribution alignment at time step $t=0$ . This adaptation is crucial to ensure consistency with the training methodology employed in Magic 1-For-1. Furthermore, DMD2 necessitates the real-time updating of $u_{\theta}^{\mathrm{fake}}$ to guarantee an accurate approximation of the fake data distribution, $p_{\mathrm{fake}}$ . This update is governed by the following loss function:
这里,$\mathbf{z}{t}$ 代表时间步 $t$ 的视频潜在变量,且 $z{t}=\sigma_{t}z_{1}+(1-\sigma_{t})\hat{z}{0}$,其中 $\hat{z}{0}$ 表示由少样本生成器合成的输出,$\sigma_{t}$ 表示噪声调度。这一公式将传统的基于得分函数的分布匹配(标准 DMD2 中固有的)重新构建为一种新方法,专注于时间步 $t=0$ 的分布对齐。这一调整对于确保与 Magic 1-For-1 中采用的训练方法的一致性至关重要。此外,DMD2 需要实时更新 $u_{\theta}^{\mathrm{fake}}$ 以保证对假数据分布 $p_{\mathrm{fake}}$ 的准确逼近。此更新由以下损失函数控制:
In the practical implementation, training DMD2 requires the simultaneous use of three models, which makes it infeasible to fit the models for standard training even under the ZeRO3 configuration with 2 nodes equipped with 8 GPUs, each having 80GB of memory. To address this limitation, we propose leveraging LoRA to handle parameter updates of the fake model. Additionally, we observed that directly training with the standard DMD2 approach often leads to training collapse. This issue arises because the input to the real model is derived from the output of the few-step generator, whose data distribution differs significantly from the training data distribution used during the pre-training phase. To mitigate this problem, we adopt a straightforward yet effective solution: slightly shifting the parameters of the real model towards those of the fake model. This is achieved by adjusting the weight factor associated with the low-rank branch. This adjustment helps align the data distributions, ensuring stable training.
在实际实现中,训练 DMD2 需要同时使用三个模型,这使得即使在 ZeRO3 配置下,配备 2 个节点,每个节点有 8 个 GPU,每个 GPU 有 80GB 内存的情况下,也无法为标准训练拟合这些模型。为了解决这一限制,我们提出利用 LoRA 来处理假模型的参数更新。此外,我们观察到,使用标准的 DMD2 方法直接训练通常会导致训练崩溃。这个问题出现的原因是,真实模型的输入来自少步生成器的输出,其数据分布与预训练阶段使用的训练数据分布有显著差异。为了缓解这一问题,我们采用了一个简单而有效的解决方案:将真实模型的参数略微向假模型的参数偏移。这是通过调整与低秩分支相关的权重因子来实现的。这种调整有助于对齐数据分布,确保训练稳定。
During the inference phase of diffusion models, Classifier-Free Diffusion Guidance (CFG) [14, 6] is frequently employed at each sampling step. CFG enhances the fidelity of the generated results with respect to the specified conditions by performing additional computations under dropoutconditioning. To eliminate this computational overhead and boost inference speed, we implemented CFG distillation [21]. We define a distillation objective that trains the student model, $u_{\theta}^{s}$ , to directly produce the guided outputs. Specifically, we minimize the following expectations over timesteps and guidance strengths:
在扩散模型的推理阶段,Classifier-Free Diffusion Guidance (CFG) [14, 6] 经常在每个采样步骤中被使用。CFG 通过在执行 dropout 条件下的额外计算,增强了生成结果相对于指定条件的保真度。为了消除这种计算开销并提高推理速度,我们实现了 CFG 蒸馏 [21]。我们定义了一个蒸馏目标,训练学生模型 $u_{\theta}^{s}$ 直接生成引导输出。具体来说,我们最小化以下关于时间步长和引导强度的期望:
where
其中
represents the linearly interpolated prediction between conditional and unconditional outputs. $T_{s}$ stands for the text prompt. In this formulation, $p_{w}(w)=\mathcal{U}\left[w_{\mathrm{min}},w_{\mathrm{max}}\right]$ dictates that the guidance strength parameters are sampled uniformly during training, which empowers the distilled model to effectively handle a wide range of guidance scales without necessitating retraining. To integrate the guidance weight, $w$ , we supplied it as an extra input to our student model. This distillation process effectively condenses the traditional CFG calculation into a single, streamlined forward pass. We construct the overall distillation objective, $L_{\mathrm{distillation}}$ , as a weighted sum of two loss terms. T