[论文翻译]Magic 1-For-1: 在一分钟内生成一分钟视频片段


原文地址:https://arxiv.org/pdf/2502.07701v1


Magic 1-For-1: Generating One Minute Video Clips within One Minute

Magic 1-For-1: 在一分钟内生成一分钟视频片段

Hongwei Yi2 Shitong Shao2 Tian Ye2 Jiantong Zhao2 Qingyu Yin2

Hongwei Yi2 Shitong Shao2 Tian Ye2 Jiantong Zhao2 Qingyu Yin2

Michael Li ngel bach 2 Li Yuan1 Yonghong Tian1 Enze Xie3 Daquan Zhou1† ∗

Michael Li ngel bach 2 李远1 田永鸿1 谢恩泽3 周大全1† ∗

1 Peking University 2 Hedra Inc. 3 Nvidia

1 北京大学 2 Hedra Inc. 3 Nvidia

https://magic-141.github.io/Magic-141/

https://magic-141.github.io/Magic-141/

Abstract

摘要

In this technical report, we present Magic 1-For-1 (Magic141), an efficient video generation model with optimized memory consumption and inference latency. The key idea is simple: factorize the text-to-video generation task into two separate easier tasks for diffusion step distillation, namely text-to-image generation and image-to-video generation. We verify that with the same optimization algorithm, the image-to-video task is indeed easier to converge over the text-to-video task. We also explore a bag of optimization tricks to reduce the computational cost of training the image-to-video (I2V) models from three aspects: 1) model convergence speedup by using a multi-modal prior condition injection; 2) inference latency speed up by applying an adversarial step distillation, and 3) inference memory cost optimization with parameter spars if i cation. With those techniques, we are able to generate 5-second video clips within 3 seconds. By applying a test time sliding window, we are able to generate a minute-long video within one minute with significantly improved visual quality and motion dynamics, spending less than 1 second for generating 1 second video clips on average. We conduct a series of preliminary explorations to find out the optimal tradeoff between computational cost and video quality during diffusion step distillation and hope this could be a good foundation model for open-source explorations. The code and the model weights are available at https://github.com/DA-Group-PKU/Magic-1-For-1.

在本技术报告中,我们介绍了 Magic 1-For-1 (Magic141),一种高效的视频生成模型,具有优化的内存消耗和推理延迟。其核心思想很简单:将文本到视频生成任务分解为两个独立的更简单的任务用于扩散步骤蒸馏,即文本到图像生成和图像到视频生成。我们验证了在相同的优化算法下,图像到视频任务确实比文本到视频任务更容易收敛。我们还探索了一系列优化技巧,从三个方面降低图像到视频 (I2V) 模型的训练计算成本:1) 通过使用多模态先验条件注入加速模型收敛;2) 通过应用对抗性步骤蒸馏加速推理延迟;3) 通过参数稀疏化优化推理内存成本。通过这些技术,我们能够在 3 秒内生成 5 秒的视频片段。通过应用测试时间滑动窗口,我们能够在一分钟内生成长达一分钟的视频,视觉质量和运动动态显著提升,平均每生成 1 秒视频片段花费不到 1 秒。我们进行了一系列初步探索,以在扩散步骤蒸馏期间找到计算成本和视频质量之间的最佳权衡,并希望这能成为开源探索的良好基础模型。代码和模型权重可在 https://github.com/DA-Group-PKU/Magic-1-For-1 获取。


Figure 1: The comparative experimental results on General VBench highlight the strong performance of Magic 1-For-1. Our model surpasses other open-source TI2V models, including CogVideoX-I2VSAT, I2Vgen-XL, SEINE-512x320, Video Crafter-I2V, and SVD-XT-1.0, in terms of both performance and efficiency.

图 1: General VBench 上的对比实验结果展示了 Magic 1-For-1 的强劲表现。我们的模型在性能和效率方面均超越了其他开源 TI2V 模型,包括 CogVideoX-I2VSAT、I2Vgen-XL、SEINE-512x320、Video Crafter-I2V 和 SVD-XT-1.0。


Figure 2: Magic 1-For-1 can generate video clips with optimized efficiency-quality trade-off.

图 2: Magic 1-For-1 可以生成在效率-质量权衡上优化的视频片段。

1 Introduction

1 引言

Recently, diffusion models have demonstrated superior performance in generating high-quality images and videos, with significantly wider diversity over the traditional generative adversarial network (GAN)-based methods [4]. However, conventional diffusion models require hundreds or thousands of steps to gradually transform noise into structured data, making them computationally expensive, limiting their practical applications. For instance, the recent open-source video generation model [18] requires around 8 GPUs and 15 minutes to generate a single 5-second video clip without extra optimization.

最近,扩散模型在生成高质量图像和视频方面展示了卓越的性能,相较于传统的基于生成对抗网络 (GAN) 的方法,其多样性显著更广 [4]。然而,传统的扩散模型需要数百或数千步来逐步将噪声转化为结构化数据,这使得它们在计算上非常昂贵,限制了其实际应用。例如,最近的开源视频生成模型 [18] 在没有额外优化的情况下,生成一个5秒的视频片段大约需要8个GPU和15分钟。

Recent works on diffusion step distillation [28, 31, 26, 26] aim to accelerate the generative process by reducing the number of inference steps while maintaining high sample quality. However, most of the research works are on image generation. Despite their promise, these methods often compromise video quality and require carefully designed architectures to achieve satisfactory results. As mentioned in SeaweedAPT [20], when transferred video generation, most of the one-step video generation algorithms still suffer from significant quality degradation such as motion dynamics and structure.

近期关于扩散步骤蒸馏 (diffusion step distillation) 的研究 [28, 31, 26, 26] 旨在通过减少推理步骤来加速生成过程,同时保持高样本质量。然而,大多数研究工作集中在图像生成领域。尽管这些方法前景广阔,但它们通常会影响视频质量,并且需要精心设计的架构才能获得令人满意的结果。正如 SeaweedAPT [20] 所提到的,当应用于视频生成时,大多数一步视频生成算法仍然存在显著的质量下降问题,例如运动动态和结构。

In this report, we propose to simplify the video generation task by factorizing it into two separate subtasks: text-to-image generation and image-to-video generation. Different from the purpose in Emu Video [? ], which focuses on improving the video generation quality, we focus on reducing the diffusion inference steps. The text-to-image task benefits from its simplicity and a wide range of prior research works, making it inherently simpler and requiring fewer diffusion steps.

在本报告中,我们提出通过将视频生成任务分解为两个独立的子任务来简化这一过程:文本到图像生成和图像到视频生成。与 Emu Video [ ] 中专注于提高视频生成质量的目的不同,我们专注于减少扩散推理步骤。文本到图像任务因其简单性和广泛的先前研究工作而受益,使其本质上更简单且需要更少的扩散步骤。

We experimentally observe that diffusion step distillation algorithms converge significantly faster in text-to-image generation while also improving the final generation quality. To further reinforce the generative prior, we incorporate multi-modal inputs by augmenting the text encoder with visual inputs. By leveraging quantization techniques, we reduce memory consumption from 40G to 28G and cut the required diffusion steps from 50 to just 4, achieving optimized performance with minimal degradation in quality. In summary, our contributions are:

我们通过实验观察到,扩散步蒸馏算法在文本到图像生成中收敛速度显著加快,同时提高了最终生成质量。为了进一步增强生成先验,我们通过视觉输入增强文本编码器,结合了多模态输入。通过利用量化技术,我们将内存消耗从40G减少到28G,并将所需的扩散步骤从50步减少到仅4步,在质量最小化下降的情况下实现了优化性能。

• Generative Prior Injection: We introduce a novel acceleration strategy for video generation by injecting stronger generative priors through task factorization. • Multi-Modal Guidance: We propose a multi-modal guidance mechanism that accelerates model convergence by incorporating visual inputs alongside text inputs. • Comprehensive Optimization: By combining a suite of model acceleration techniques, we achieve state-of-the-art trade-off in terms of both generation quality and inference efficiency.

• 生成式先验注入:我们通过任务分解注入更强的生成式先验,提出了一种新颖的视频生成加速策略。
• 多模态引导:我们提出了一种多模态引导机制,通过结合视觉输入和文本输入,加速模型收敛。
• 全面优化:通过结合一系列模型加速技术,我们在生成质量和推理效率方面实现了最先进的权衡。

2 Related Works

2 相关工作

Video generation has been a rapidly evolving research area, leveraging advancements in deep learning and generative models. Early approaches relied on auto regressive models [17], which generate videos frame-by-frame based on past frames, but suffered from compounding errors over time. Variation al Auto encoders (VAEs) [2] and Generative Adversarial Networks (GANs) [34, 7] were later introduced to improve video synthesis, offering better visual fidelity but struggling with temporal consistency over longer sequences.

视频生成一直是一个快速发展的研究领域,得益于深度学习和生成模型的进步。早期方法依赖于自回归模型 [17],它基于过去帧逐帧生成视频,但随着时间的推移会出现累积误差。后来引入了变分自编码器 (VAEs) [2] 和生成对抗网络 (GANs) [34, 7] 来改进视频合成,提供了更好的视觉保真度,但在较长的序列中难以保持时间一致性。

More recent breakthroughs have been driven by Transformer-based architectures and diffusion models. VideoGPT [39] adapts GPT-like architectures for video token iz ation and synthesis, capturing long-term dependencies effectively. Diffusion models have also demonstrated promising results, with works like Video Diffusion Models (VDMs) [12] extending the success of image diffusion models to videos by modeling s patio temporal dynamics through iterative denoising. Additionally, latent space-based methods such as Imagen Video [13] and Make-A-Video [30] focus on efficient high-resolution synthesis by operating in a compressed representation space. These methods have significantly advanced the field, enabling the generation of realistic, temporally coherent videos with high fidelity.

最近的突破由基于 Transformer 的架构和扩散模型驱动。VideoGPT [39] 将类似 GPT 的架构应用于视频 Token 化和合成,有效捕捉长期依赖关系。扩散模型也展示了有前景的结果,如 Video Diffusion Models (VDMs) [12] 通过迭代去噪建模时空动态,将图像扩散模型的成功扩展到视频领域。此外,基于潜在空间的方法,如 Imagen Video [13] 和 Make-A-Video [30],通过在压缩表示空间中操作,专注于高效的高分辨率合成。这些方法显著推动了该领域的发展,使得生成逼真、时间一致的高保真视频成为可能。

Recent advancements in diffusion step distillation have introduced various methods to accelerate the sampling process of diffusion models, aiming to reduce computational costs while maintaining high-quality outputs. Salimans and Ho (2022) proposed Progressive Distillation, which iterative ly halves the number of sampling steps required by training student models to match the output of teacher models with more steps, thereby achieving faster sampling without significant loss in image quality [28]. Meng et al. (2023) developed a distillation approach for classifier-free guided diffusion models, significantly speeding up inference by reducing the number of sampling steps required for high-fidelity image generation [21]. Geng et al. (2024) introduced a method that distills diffusion models into single-step generator models using deep equilibrium architectures, achieving efficient sampling with minimal loss in image quality [9]. Watson et al. (2024) proposed EM Distillation, a maximum likelihood-based approach that distills a diffusion model into a one-step generator model, effectively reducing the number of sampling steps while preserving perceptual quality [36]. Zhou et al. (2024) presented Simple and Fast Distillation (SFD), which simplifies existing distillation paradigms and reduces fine-tuning time by up to 1000 times while achieving high-quality image generation [42]. Yin et al. (2024) introduced Distribution Matching Distillation (DMD), a procedure to transform a diffusion model into a one-step image generator with minimal impact on image quality [40]. These developments contribute to making diffusion-based generative models more practical for real-world applications by substantially decreasing computational demands.

扩散步骤蒸馏的最新进展引入了多种方法来加速扩散模型的采样过程,旨在降低计算成本的同时保持高质量的输出。Salimans 和 Ho (2022) 提出了渐进式蒸馏 (Progressive Distillation),通过训练学生模型来匹配更多步骤的教师模型的输出,迭代地将所需的采样步骤数减半,从而在不显著损失图像质量的情况下实现更快的采样 [28]。Meng 等人 (2023) 开发了一种用于无分类器引导扩散模型的蒸馏方法,通过减少高保真图像生成所需的采样步骤,显著加快了推理速度 [21]。Geng 等人 (2024) 引入了一种使用深度均衡架构将扩散模型蒸馏为单步生成器模型的方法,实现了高效采样且图像质量损失最小 [9]。Watson 等人 (2024) 提出了基于最大似然的 EM 蒸馏 (EM Distillation),将扩散模型蒸馏为一步生成器模型,有效减少了采样步骤数,同时保持了感知质量 [36]。Zhou 等人 (2024) 提出了简单快速蒸馏 (Simple and Fast Distillation,SFD),简化了现有的蒸馏范式,并将微调时间减少了多达 1000 倍,同时实现了高质量的图像生成 [42]。Yin 等人 (2024) 引入了分布匹配蒸馏 (Distribution Matching Distillation,DMD),这是一种将扩散模型转换为一步图像生成器的过程,且对图像质量的影响最小 [40]。这些进展通过大幅降低计算需求,使得基于扩散的生成模型更适用于实际应用。

3 Method

3 方法

Image Prior Generation We use both diffusion-based and retrieval-based methods for getting the images. We define a unified function $\mathcal{G}$ that combines both diffusion-based generation and retrieval-based augmentation:

图像先验生成
我们使用基于扩散和基于检索的方法来获取图像。我们定义了一个统一函数 $\mathcal{G}$,它结合了基于扩散的生成和基于检索的增强:
image.png

where $\mathbf{x}_{T}\sim\mathcal{N}(0,I)$ represents the initial noise, $y$ is the textual input, $\mathcal{R}(y)$ is the retrieved set of relevant images, and $\theta$ represents the model parameters. The retrieval function $\mathcal{R}$ is formally defined as:

其中 $\mathbf{x}_{T}\sim\mathcal{N}(0,I)$ 表示初始噪声,$y$ 是文本输入,$\mathcal{R}(y)$ 是检索到的相关图像集,$\theta$ 表示模型参数。检索函数 $\mathcal{R}$ 正式定义为:

image.png

where $\mathcal{D}$ is the image corpus, $\phi(y)$ and $\psi(\mathbf{x})$ are text and image embeddings, respectively, and $\mathrm{sim}(\cdot,\cdot)$ is a similarity function (e.g., cosine similarity). The retrieved images serve as additional conditioning signals in the denoising process:

其中 $\mathcal{D}$ 是图像语料库,$\phi(y)$ 和 $\psi(\mathbf{x})$ 分别是文本和图像嵌入,$\mathrm{sim}(\cdot,\cdot)$ 是相似度函数(例如,余弦相似度)。检索到的图像在去噪过程中作为额外的条件信号:

image.png

where $\epsilon_{\theta}$ is the learned noise prediction model, $\alpha_{t}$ is the step size, $\sigma_{t}$ is the noise variance, and $\mathbf{z}\sim\mathcal{N}(0,I)$ . This retrieval-augmented diffusion process ensures that the generated image maintains both high fidelity and factual accuracy. Recent methods such as Retrieval-Augmented Diffusion Models (RDM) [27] and kNN-Diffusion [29] have demonstrated the effectiveness of this approach, significantly improving the realism and contextual alignment of generated images.

其中 $\epsilon_{\theta}$ 是学习到的噪声预测模型,$\alpha_{t}$ 是步长,$\sigma_{t}$ 是噪声方差,且 $\mathbf{z}\sim\mathcal{N}(0,I)$。这种检索增强的扩散过程确保了生成的图像既保持高保真度,又具备事实准确性。最近的方法如检索增强扩散模型 (Retrieval-Augmented Diffusion Models, RDM) [27] 和 kNN-Diffusion [29] 已经证明了这种方法的有效性,显著提高了生成图像的逼真度和上下文一致性。

Figure 4: The overview of model acceleration techniques, including DMD2 and CFG distillation.

图 4: 模型加速技术概览,包括 DMD2 和 CFG 蒸馏。

Image Prior Injection and Multi-modal Guidance Design The Image-to-Video (I2V) task involves generating a video that aligns with a given text description, using an input image as the first frame. Specifically, the T2V model takes as input a latent tensor $X$ of shape $T\times C\times H\times W$ , where $T,C,H$ , and $W$ correspond to the number of frames, channels, height, and width of the compressed video, respectively.

图像先验注入与多模态引导设计

图像到视频(I2V)任务涉及生成一个与给定文本描述相符的视频,使用输入图像作为第一帧。具体来说,T2V模型将形状为$T\times C\times H\times W$的潜在张量$X$作为输入,其中$T,C,H$和$W$分别对应于压缩视频的帧数、通道数、高度和宽度。

Similar to Emu Video [10], to incorporate the image condition $I$ , we treat $I$ as the first frame of the video and apply zero-padding to construct a tensor $I_{o}$ of dimensions $T\times C\times H\times W$ , as shown in Fig. 3. Additionally, we introduce a binary mask $m$ with shape $T\times1\times H\times W$ , where the first temporal position is set to 1, and all subsequent positions are set to zero. The

与 Emu Video [10] 类似,为了融入图像条件 $I$,我们将 $I$ 视为视频的第一帧,并应用零填充来构建维度为 $T\times C\times H\times W$ 的张量 $I_{o}$,如图 3 所示。此外,我们引入了一个形状为 $T\times1\times H\times W$ 的二进制掩码 $m$,其中第一个时间位置设置为 1,所有后续位置设置为零。


Figure 3: Overall Architecture of Magic 1-For-1.

图 3: Magic 1-For-1 的整体架构。

latent tensor $X$ , the padded tensor $I_{o}$ , and the mask $m$ are then concatenated along the channel dimension to form the input to the model.

潜在张量 $X$、填充张量 $I_{o}$ 和掩码 $m$ 随后在通道维度上连接,形成模型的输入。

Since the input tensor’s channel dimension increases from $C$ to $2C+1$ , as shown in Fig. 3, we adjust the parameters of the model’s first convolutional module from $\phi=(C_{i n}(=C),\bar{C_{o u t}},\bar{s}{h},s{w})$ to $\phi^{\prime}=(\bar{C}_{i n}^{\prime}(=2C+1),C_{o u t},s_{h},s_{w})$ . Here, $C_{i n}$ and $C_{i n}^{\prime}$ represent the input channels before and after modification, $C_{o u t}$ is the number of output channels, and $s_{h}$ and $s_{w}$ correspond to the height and width of the convolutional kernel, respectively. To preserve the representational capability of the T2V model, the first $C$ input channels of $\phi^{\prime}$ are copied from $\phi$ , while the additional channels are initialized to zero. The I2V model is pre-trained on the same dataset as the T2V model to ensure consistency.

由于输入张量的通道维度从 $C$ 增加到 $2C+1$,如图 3 所示,我们将模型的第一个卷积模块的参数从 $\phi=(C_{i n}(=C),\bar{C_{o u t}},\bar{s}{h},s{w})$ 调整为 $\phi^{\prime}=(\bar{C}_{i n}^{\prime}(=2C+1),C_{o u t},s_{h},s_{w})$。其中,$C_{i n}$ 和 $C_{i n}^{\prime}$ 分别表示修改前后的输入通道数,$C_{o u t}$ 是输出通道数,$s_{h}$ 和 $s_{w}$ 分别对应卷积核的高度和宽度。为了保持 T2V 模型的表示能力,$\phi^{\prime}$ 的前 $C$ 个输入通道从 $\phi$ 中复制,额外的通道则初始化为零。I2V 模型在与 T2V 模型相同的数据集上进行预训练,以确保一致性。

To further enhance the semantic alignment of reference images, we extract their embeddings via the vision branch of the VLM text encoder and concatenate them with the text embeddings, as illustrated in Fig. 3. This integration improves the model’s ability to generate videos that better capture the context provided by both the image and the textual description.

为了进一步增强参考图像的语义对齐,我们通过 VLM 文本编码器的视觉分支提取其嵌入,并将其与文本嵌入连接,如图 3 所示。这种集成提高了模型生成视频的能力,使其更好地捕捉图像和文本描述提供的上下文。

3.1 Diffusion Distillation

3.1 扩散蒸馏

The iterative nature of diffusion model inference, characterized by its multi-step sampling procedure, introduces a significant bottleneck to inference speed. This issue is particularly exacerbated in largescale models such as our 13B parameter diffusion model, Magic 1-For-1, where the computational cost of each individual sampling step is substantial. As depicted in Fig. 4, we address this challenge by implementing a dual distillation approach, incorporating both step distillation and CFG distillation to achieve faster sampling. For step distillation, we leverage DMD2, a state-of-the-art algorithm engineered for efficient distribution alignment and accelerated sampling. Inspired by score distillation sampling (SDS) [25], DMD2 facilitates step distillation through a coordinated training paradigm involving three distinct models. These include: the one/four-step generator $G_{\phi}$ , whose parameters are iterative ly optimized; the real video model $u_{\theta}^{\mathrm{real}}$ , tasked with approximating the underlying real data distribution $p_{\mathrm{real}}$ ; and the fake video model $u_{\theta}^{\mathrm{fake}}$ , which estimates the generated (fake) data distribution $p_{\mathrm{fake}}$ . Crucially, all three models are initialized from the same pre-trained model, ensuring coherence and streamlining the training process. The distribution matching objective for step distillation can be mathematically expressed as:

扩散模型推理的迭代性质,以其多步采样过程为特征,引入了显著的推理速度瓶颈。这一问题在我们拥有 13B 参数的扩散模型 Magic 1-For-1 等大规模模型中尤为突出,其中每个单独采样步骤的计算成本相当高。如图 4 所示,我们通过实施双重蒸馏方法来解决这一挑战,结合步骤蒸馏和 CFG 蒸馏以实现更快的采样。对于步骤蒸馏,我们利用 DMD2,这是一种专为高效分布对齐和加速采样而设计的最先进算法。受分数蒸馏采样 (SDS) [25] 的启发,DMD2 通过涉及三个不同模型的协调训练范式促进步骤蒸馏。这些模型包括:参数迭代优化的一步/四步生成器 $G_{\phi}$;负责逼近底层真实数据分布 $p_{\mathrm{real}}$ 的真实视频模型 $u_{\theta}^{\mathrm{real}}$;以及估计生成(假)数据分布 $p_{\mathrm{fake}}$ 的假视频模型 $u_{\theta}^{\mathrm{fake}}$。至关重要的是,所有三个模型都从同一个预训练模型初始化,确保一致性并简化训练过程。步骤蒸馏的分布匹配目标可以用数学公式表示为:

image.png

Here, $\mathbf{z}{t}$ stands for the video latent at the timestep $t$ , and $z{t}=\sigma_{t}z_{1}+(1-\sigma_{t})\hat{z}{0}$ , with $\hat{z}{0}$ representing the output synthesized by the few-step generator, and $\sigma_{t}$ denoting the noise schedule. This formulation reframes the conventional score function-based distribution matching (inherent in standard DMD2) into a novel approach that focuses on distribution alignment at time step $t=0$ . This adaptation is crucial to ensure consistency with the training methodology employed in Magic 1-For-1. Furthermore, DMD2 necessitates the real-time updating of $u_{\theta}^{\mathrm{fake}}$ to guarantee an accurate approximation of the fake data distribution, $p_{\mathrm{fake}}$ . This update is governed by the following loss function:

这里,$\mathbf{z}{t}$ 代表时间步 $t$ 的视频潜在变量,且 $z{t}=\sigma_{t}z_{1}+(1-\sigma_{t})\hat{z}{0}$,其中 $\hat{z}{0}$ 表示由少样本生成器合成的输出,$\sigma_{t}$ 表示噪声调度。这一公式将传统的基于得分函数的分布匹配(标准 DMD2 中固有的)重新构建为一种新方法,专注于时间步 $t=0$ 的分布对齐。这一调整对于确保与 Magic 1-For-1 中采用的训练方法的一致性至关重要。此外,DMD2 需要实时更新 $u_{\theta}^{\mathrm{fake}}$ 以保证对假数据分布 $p_{\mathrm{fake}}$ 的准确逼近。此更新由以下损失函数控制:

image.png

In the practical implementation, training DMD2 requires the simultaneous use of three models, which makes it infeasible to fit the models for standard training even under the ZeRO3 configuration with 2 nodes equipped with 8 GPUs, each having 80GB of memory. To address this limitation, we propose leveraging LoRA to handle parameter updates of the fake model. Additionally, we observed that directly training with the standard DMD2 approach often leads to training collapse. This issue arises because the input to the real model is derived from the output of the few-step generator, whose data distribution differs significantly from the training data distribution used during the pre-training phase. To mitigate this problem, we adopt a straightforward yet effective solution: slightly shifting the parameters of the real model towards those of the fake model. This is achieved by adjusting the weight factor associated with the low-rank branch. This adjustment helps align the data distributions, ensuring stable training.

在实际实现中,训练 DMD2 需要同时使用三个模型,这使得即使在 ZeRO3 配置下,配备 2 个节点,每个节点有 8 个 GPU,每个 GPU 有 80GB 内存的情况下,也无法为标准训练拟合这些模型。为了解决这一限制,我们提出利用 LoRA 来处理假模型的参数更新。此外,我们观察到,使用标准的 DMD2 方法直接训练通常会导致训练崩溃。这个问题出现的原因是,真实模型的输入来自少步生成器的输出,其数据分布与预训练阶段使用的训练数据分布有显著差异。为了缓解这一问题,我们采用了一个简单而有效的解决方案:将真实模型的参数略微向假模型的参数偏移。这是通过调整与低秩分支相关的权重因子来实现的。这种调整有助于对齐数据分布,确保训练稳定。

During the inference phase of diffusion models, Classifier-Free Diffusion Guidance (CFG) [14, 6] is frequently employed at each sampling step. CFG enhances the fidelity of the generated results with respect to the specified conditions by performing additional computations under dropoutconditioning. To eliminate this computational overhead and boost inference speed, we implemented CFG distillation [21]. We define a distillation objective that trains the student model, $u_{\theta}^{s}$ , to directly produce the guided outputs. Specifically, we minimize the following expectations over timesteps and guidance strengths:

在扩散模型的推理阶段,Classifier-Free Diffusion Guidance (CFG) [14, 6] 经常在每个采样步骤中被使用。CFG 通过在执行 dropout 条件下的额外计算,增强了生成结果相对于指定条件的保真度。为了消除这种计算开销并提高推理速度,我们实现了 CFG 蒸馏 [21]。我们定义了一个蒸馏目标,训练学生模型 $u_{\theta}^{s}$ 直接生成引导输出。具体来说,我们最小化以下关于时间步长和引导强度的期望:

image.png

where

其中image.png

represents the linearly interpolated prediction between conditional and unconditional outputs. $T_{s}$ stands for the text prompt. In this formulation, $p_{w}(w)=\mathcal{U}\left[w_{\mathrm{min}},w_{\mathrm{max}}\right]$ dictates that the guidance strength parameters are sampled uniformly during training, which empowers the distilled model to effectively handle a wide range of guidance scales without necessitating retraining. To integrate the guidance weight, $w$ , we supplied it as an extra input to our student model. This distillation process effectively condenses the traditional CFG calculation into a single, streamlined forward pass. We construct the overall distillation objective, $L_{\mathrm{distillation}}$ , as a weighted sum of two loss terms. The CFG distillation loss serves to align the student model’s outputs with the teacher’s guided predictions, while a base prediction loss ensures the student maintains the teacher’s underlying generation capabilities. The complete distillation loss is therefore given by:

表示条件输出和无条件输出之间的线性插值预测。$T_{s}$ 代表文本提示。在这个公式中,$p_{w}(w)=\mathcal{U}\left[w_{\mathrm{min}},w_{\mathrm{max}}\right]$ 规定了在训练期间均匀采样引导强度参数,这使得蒸馏模型能够有效处理广泛的引导尺度,而无需重新训练。为了整合引导权重 $w$,我们将其作为额外输入提供给学生模型。这个蒸馏过程有效地将传统的 CFG 计算压缩为一次简化的前向传播。我们将整体蒸馏目标 $L_{\mathrm{distillation}}$ 构建为两个损失项的加权和。CFG 蒸馏损失用于将学生模型的输出与教师的引导预测对齐,而基础预测损失确保学生保持教师的基本生成能力。因此,完整的蒸馏损失由下式给出:

Table 1: Quantitative results comparison using Magic 1-For-1 on our customized VBench. Each sample follows VBench [10] to synthesize 5 videos to avoid errors.

表 1: 使用 Magic 1-For-1 在我们的定制 VBench 上的量化结果比较。每个样本遵循 VBench [10] 合成 5 个视频以避免误差。

#Steps Approach i2v Subject(↑) Subject Consistency(↑) Motion Smoothness(↑) Dynamic Degree (↑) Aesthetic Quality(↑) Imaging Quality(↑) Temporal Flickering(↑) Average (↑)
56-step Euler (baseline) 0.9804 0.9603 0.9954 0.2103 0.5884 0.6896 0.9937 0.7740
28-step Euler (baseline) 0.9274 0.9397 0.9953 0.2448 0.5687 0.6671 0.9935 0.7623
16-step Euler (baseline) 0.9750 0.9366 0.9957 0.1241 0.5590 0.6238 0.9946 0.7441
16-step Euler (baseline) 0.9787 0.9000 0.9957 0.0068 0.4994 0.5013 0.9961 0.7441
8-step DMD2 0.9677 0.9634 0.9962 0.3207 0.5993 0.7125 0.9921 0.7928
4-step Euler (baseline) 0.9803 0.8593 0.9965 0.0034 0.4440 0.3693 0.9972 0.6642
4-step DMD2 0.9812 0.9762 0.9934 0.4123 0.6123 0.7234 0.9950 0.8134

image.png

where

其中

image.png

Here, $\lambda_{\mathrm{cfg}}$ is the balancing coefficient for overall distillation loss.

在这里,$\lambda_{\mathrm{cfg}}$ 是整体蒸馏损失的平衡系数。

3.2 Model Quantization

3.2 模型量化

We leverage the optimum-quanto framework for model quantization, employing int8 weight-only quantization to minimize the model’s memory footprint. The quantization strategy specifically targets the denoising network, encompassing the transformer blocks, the text encoder, and the VLM encoder. The quantization process maps the original bfloat16 weights to int8 values. A common approach involves scaling the bfloat16 values to a suitable range before conversion. For instance, one might determine the maximum absolute value within a tensor of weights, scale all weights such that this maximum value corresponds to the maximum representable int8 value (127 or -128), and then perform the conversion. A simplified illustration of this scaling could be represented as:

我们利用最优量化框架进行模型量化,采用仅 int8 权重量化来最小化模型的内存占用。量化策略特别针对去噪网络,包括 Transformer 块、文本编码器和 VLM 编码器。量化过程将原始的 bfloat16 权重映射为 int8 值。常见的做法是在转换之前将 bfloat16 值缩放到合适的范围。例如,可以确定权重张量中的最大绝对值,将所有权重缩放,使该最大值对应于可表示的最大 int8 值(127 或 -128),然后执行转换。这种缩放的简化示意图可以表示为:

image.png

where $w_{\mathrm{bfl6}}$ represents the original bfloat16 weight, $w_{i n t8}$ the quantized int8 weight, and $\operatorname*{max}(\left|w_{\mathrm{bfl}6}\right|)$ the maximum absolute value within the weight tensor. In practice, more sophisticated methods like per-channel quantization or quantization aware training can be used for better performance. To mitigate potential CUDA errors and ensure numerical stability during inference, all linear layers within the model initially convert their inputs to bfloat16 before performing the matrix multiplication with the quantized int8 weights. This bfloat16-int8 multiplication helps maintain accuracy while still benefiting from the reduced memory footprint of the int8 weights.

其中 $w_{\mathrm{bfl6}}$ 表示原始的 bfloat16 权重,$w_{i n t8}$ 表示量化后的 int8 权重,$\operatorname*{max}(\left|w_{\mathrm{bfl}6}\right|)$ 表示权重张量中的最大绝对值。在实际应用中,可以使用更复杂的方法,如逐通道量化或量化感知训练,以获得更好的性能。为了缓解潜在的 CUDA 错误并确保推理过程中的数值稳定性,模型中的所有线性层在执行与量化 int8 权重的矩阵乘法之前,会首先将其输入转换为 bfloat16。这种 bfloat16-int8 乘法有助于保持准确性,同时仍能受益于 int8 权重的减少内存占用。

Before quantization, the model’s weights occupy approximately 32GB. After applying int8 quantization, the model size is reduced to about 16GB. During runtime, the peak memory usage is around 30GB. This optimized model is then capable of running on consumer-grade and inference-focused GPUs, including the RTX 5090, A10, and L20.

在量化之前,模型的权重约占 32GB。应用 int8 量化后,模型大小减少至约 16GB。在运行期间,峰值内存使用量约为 30GB。优化后的模型能够在消费级和专注于推理的 GPU 上运行,包括 RTX 5090、A10 和 L20。

4 Experiments

4 实验

4.1 Experimental Setup

4.1 实验设置

Foundation Model Selection The text to image generation task is better explored, compared to image to video generation task. Thus, we use a pool of pre-trained T2I models [19, 23, 37, 1] or user provided images directly. For I2V task, we use the pre-trained T2V Hun yuan Video 13B model [18] as the base model and modify based on it.

基础模型选择
文本到图像生成任务相较图像到视频生成任务得到了更深入的研究。因此,我们使用一组预训练的 T2I 模型 [19, 23, 37, 1] 或直接使用用户提供的图像。对于 I2V 任务,我们使用预训练的 T2V 混元视频 13B 模型 [18] 作为基础模型,并基于其进行修改。

4.2 Implementation Details

4.2 实现细节

We use 128 GPUs with a batch size 64. We train the model for two weeks. The initial learning rate is set to be $5\times10^{-5}$ and progressively decrease to $1\times10^{-5}$ . We also apply exponential moving average (EMA) [22] for stable training. The model was trained on a subset with 1.6M data samples collected from WebVid-10M [3], Panda-70M [5], Koala-36M [35], and the Internet data. For step distillation, we explore the integration of CFG distillation into the training process of DMD2, aiming to produce a few-step generator within a single stage of training. DMD2 is trained with a batch size of 16 and a fixed learning rate of $2\times10^{-6}$ . During training, the few-step generator is updated after every five updates of the fake model. Additionally, in DMD2 training, the weight factor for the low-rank branch is set to 0.25 for the real model and 1 for the fake model, respectively.

我们使用 128 个 GPU 进行训练,批量大小为 64。模型训练持续两周。初始学习率设置为 $5\times10^{-5}$,并逐步降低至 $1\times10^{-5}$。为了稳定训练,我们还应用了指数移动平均 (EMA) [22]。模型在从 WebVid-10M [3]、Panda-70M [5]、Koala-36M [35] 和互联网数据中收集的 1.6M 数据样本子集上进行训练。在步骤蒸馏方面,我们探索了将 CFG 蒸馏集成到 DMD2 的训练过程中,旨在在单一训练阶段内生成少步骤生成器。DMD2 的训练批量大小为 16,固定学习率为 $2\times10^{-6}$。在训练过程中,每更新五次虚假模型后,少步骤生成器会进行一次更新。此外,在 DMD2 训练中,真实模型的低秩分支权重因子设置为 0.25,而虚假模型的权重因子设置为 1。


Figure 5: Model performance progression during training. Interestingly, T2V Magic 1-For-1 exhibits considerably slower convergence in step distillation compared to TI2V Magic 1-For-1.

图 5: 训练过程中模型性能的进展。有趣的是,T2V Magic 1-For-1 在步骤蒸馏中的收敛速度明显比 TI2V Magic 1-For-1 慢。

4.3 Benchmarks

4.3 基准测试

We utilize our customized VBench, General VBench, and traditional metrics such as FID, FVD, and LPIPS to evaluate the performance of the model. However, due to resource constraints, we do not employ the model with an extensive number of inference steps, as is done in recent state-of-the-art models like MovieGen [24]. Instead, our focus is directed toward the efficiency aspect of the model. In this report, we conduct performance measurements of the base model using 4, 8, 16, 28, and 56 steps. For the few-step generator, we evaluate its performance using 4 and 8 steps.

我们利用定制的 VBench、General VBench 以及传统指标如 FID、FVD 和 LPIPS 来评估模型的性能。然而,由于资源限制,我们没有采用像 MovieGen [24] 这样的最新模型那样大量的推理步骤,而是将重点放在模型的效率方面。在本报告中,我们使用 4、8、16、28 和 56 步对基础模型进行了性能测量。对于少步生成器,我们使用 4 和 8 步来评估其性能。

Our Customized VBench. We use I2V VBench for portrait video synthesis evaluation [16]. We first collect 58 high-quality widescreen images and leverage InternVL-26B to generate corresponding prompts for these images. For each sample, we then synthesize five videos, resulting in a total of 290 videos, to reduce potential testing errors. The synthesized videos are subsequently evaluated using VBench. The primary evaluation metrics include i2v subject, subject consistency, motion smoothness, dynamic degree, aesthetic quality, imaging quality, and temporal flickering.

我们的定制化VBench。我们使用I2V VBench进行肖像视频合成评估[16]。首先,我们收集了58张高质量宽屏图像,并利用InternVL-26B生成这些图像的对应提示。对于每个样本,我们合成了五个视频,总共生成290个视频,以减少潜在的测试误差。随后,我们使用VBench对这些合成视频进行评估。主要评估指标包括i2v主体、主体一致性、运动平滑度、动态程度、美学质量、成像质量和时间闪烁。

General VBench. We also evaluate Magic 1-For-1’s performance on the General VBench [15] dataset, which consists of the official prompts paired with reference images. Similar to the I2VVBench evaluation, this benchmark assesses i2v subject, subject consistency, motion smoothness, dynamic degree, aesthetic quality, imaging quality, and temporal flickering. Notably, this benchmark includes 1,118 prompts, each paired with a reference image, to provide a diverse and comprehensive evaluation.

General VBench。我们还在 General VBench [15] 数据集上评估了 Magic 1-For-1 的性能,该数据集由官方提示与参考图像配对组成。与 I2VVBench 评估类似,该基准测试评估了 i2v 主题、主题一致性、运动平滑度、动态程度、美学质量、成像质量和时间闪烁。值得注意的是,该基准测试包括 1,118 个提示,每个提示都与参考图像配对,以提供多样且全面的评估。

FID, FVD and LPIPS. In portrait video synthesis, key metrics for evaluating video quality include FID [11], FVD [33], and LPIPS [41]. Following the approaches outlined in EMO [32] and Hallo [8], we randomly sample 100 video clips from the VFHQ [38] dataset. Each selected clip contains 129 frames, with resolution and frame rate standardized to $540!\times!960$ and 24 FPS, respectively. To evaluate performance, we calculate the FID and FVD scores by comparing the synthesized videos against their corresponding original videos.

FID、FVD 和 LPIPS。在肖像视频合成中,评估视频质量的关键指标包括 FID [11]、FVD [33] 和 LPIPS [41]。按照 EMO [32] 和 Hallo [8] 中概述的方法,我们从 VFHQ [38] 数据集中随机抽取了 100 个视频片段。每个选定的片段包含 129 帧,分辨率和帧率分别标准化为 $540!\times!960$ 和 24 FPS。为了评估性能,我们通过将合成视频与相应的原始视频进行比较来计算 FID 和 FVD 分数。

4.4 Experiment Results

4.4 实验结果

Our experiments contains three primary components. First, ablation studies reveal that the 4-step DMD2 training achieves optimal performance, and that TI2V step distillation is significantly less challenging to train compared to T2V step distillation. Next, we evaluate the performance differences between the few-step generator and the base model on both portrait video synthesis and general video synthesis tasks. Lastly, we compare Magic 1-For-1 with other state-of-the-art TI2V models to demonstrate the superiority of our proposed algorithm.

我们的实验包含三个主要部分。首先,消融研究表明,4步DMD2训练实现了最佳性能,并且与T2V步骤蒸馏相比,TI2V步骤蒸馏的训练难度显著降低。接下来,我们评估了少步生成器和基础模型在肖像视频合成和通用视频合成任务上的性能差异。最后,我们将Magic 1-For-1与其他最先进的TI2V模型进行比较,以展示我们提出的算法的优越性。

Convergence Speed Comparison. Due to the powerful generative capabilities of the Magic 1-For-1 base model, fine-tuning this large-scale model to enable high-quality synthetic video generation with minimal sampling steps requires relatively low computational overhead. As shown in Fig. 5, applying DMD2 for step distillation on the TI2V base model achieves near convergence within 100 iterations. Specifically, the optimal 4-step generator is obtained at 200 iterations, while the optimal 8-step generator is achieved at merely 100 iterations. Notably, whereas DMD2 demonstrates rapid convergence on the TI2V few-step synthesis task, its performance on the T2V few-step synthesis task remains far from convergence even after 1,000 iterations and significantly under performs compared to the TI2V counterpart.

收敛速度对比。由于 Magic 1-For-1 基础模型强大的生成能力,微调这一大规模模型以实现高质量合成视频生成所需的采样步骤较少,计算开销相对较低。如图 5 所示,在 TI2V 基础模型上应用 DMD2 进行步骤蒸馏,在 100 次迭代内实现了接近收敛的效果。具体来说,最优的 4 步生成器在 200 次迭代时获得,而最优的 8 步生成器仅需 100 次迭代即可完成。值得注意的是,尽管 DMD2 在 TI2V 少步合成任务上表现出快速收敛,但在 T2V 少步合成任务上,即使经过 1,000 次迭代,其性能仍远未达到收敛,且与 TI2V 相比显著落后。


Figure 6: Qualitative comparison of Magic 1-For-1 with recent state-of-the-art open source image-tovideo generation models.

图 6: Magic 1-For-1 与近期最先进的开源图像到视频生成模型的定性对比。

Performance Comparison between Few-Step Generators. We evaluate the performance of our few-step generator and base model on both our customized VBench and the General VBench. Our customized VBench is specifically designed for portrait video synthesis tasks, while the General VBench targets generic video generation. The comparative results are presented in Table 1 and Fig. 1, respectively. From Table 1, it can be observed that the base model’s performance gradually improves as the number of sampling steps increases. Notably, the 50-step base model under performs the 4-step generator across metrics such as motion dynamics, visual quality, and semantic faithfulness, indicating that our improved DMD2 effectively mitigates certain detrimental biases inherent in the base model. Moreover, Magic 1-For-1 (DMD2) demonstrates strong performance on the general VBench benchmark. As illustrated in Fig. 1, Magic 1-For-1 (DMD2) outperforms competing methods—including SVD-XT-1.0, Video Crafter-I2V, SEINE-512x320, I2Vgen-XL, and CogVideoXI2V—across multiple evaluation dimensions such as motion dynamics and visual fidelity. This success underscores the superiority of Magic 1-For-1 in TI2V generation tasks.

少步生成器的性能对比。我们在定制的 VBench 和通用 VBench 上评估了我们的少步生成器和基础模型的性能。定制的 VBench 专为肖像视频合成任务设计,而通用 VBench 则针对通用视频生成。对比结果分别展示在表 1 和图 1 中。从表 1 可以看出,随着采样步数的增加,基础模型的性能逐渐提升。值得注意的是,50 步的基础模型在运动动态、视觉质量和语义忠实度等指标上表现不如 4 步生成器,这表明我们改进的 DMD2 有效缓解了基础模型中某些固有的有害偏差。此外,Magic 1-For-1 (DMD2) 在通用 VBench 基准上表现出色。如图 1 所示,Magic 1-For-1 (DMD2) 在运动动态和视觉保真度等多个评估维度上超越了包括 SVD-XT-1.0、Video Crafter-I2V、SEINE-512x320、I2Vgen-XL 和 CogVideoXI2V 在内的竞争方法。这一成功凸显了 Magic 1-For-1 在 TI2V 生成任务中的优越性。

Performance Comparison With SOTA Models. We also compare the visual quality of our model with recent open-source image-to-video generation models. As shown in Fig. 6, our model demonstrates significantly superior video quality, particularly in terms of visual clarity and motion smoothness.

与SOTA模型的性能对比。我们还将我们的模型与最近开源的图像到视频生成模型进行了视觉质量对比。如图6所示,我们的模型在视频质量上表现出显著优势,特别是在视觉清晰度和运动流畅性方面。

5 Conclusion

5 结论

In this report, we present a comprehensive approach to enhance the efficiency of video generation model training and inference. We decompose the text-to-video generation task into two sequential subtasks: image generation and image-to-video generation. Our findings demonstrate that the commonly used diffusion step distillation algorithm achieves significantly faster convergence when applied to image-to-video generation compared to the full text-to-video generation pipeline. Furthermore, by incorporating quantization techniques, we develop a highly efficient video generation model while maintaining an acceptable level of generation quality. Through this report, we aim to highlight that by leveraging generation-prior information, the diffusion process can be substantially accelerated, making video generation both faster and more practical.

在本报告中,我们提出了一种全面的方法来提高视频生成模型训练和推理的效率。我们将文本到视频生成任务分解为两个顺序的子任务:图像生成和图像到视频生成。我们的研究结果表明,与完整的文本到视频生成流程相比,常用的扩散步骤蒸馏算法在图像到视频生成中能够显著加快收敛速度。此外,通过引入量化技术,我们开发了一种高效的视频生成模型,同时保持了可接受的生成质量。通过本报告,我们旨在强调,通过利用生成先验信息,可以大幅加速扩散过程,使视频生成既更快又更实用。

6 Limitation

6 局限性

Due to resource limitations, we trained the model on $540!\times!960$ videos using a small, unbalanced dataset. The dataset consists of more human-centric subjects and movie-style video clips over the rest of the categories, leading to a noticeable bias in the model’s performance toward these categories. We believe that training on a more balanced dataset with greater diversity would significantly enhance the model’s generalization ability further.

由于资源限制,我们在一个小型且不平衡的数据集上训练了模型,使用分辨率为 $540!\times!960$ 的视频。该数据集由更多以人为中心的主题和电影风格的视频片段组成,导致模型在这些类别上的表现存在明显偏差。我们相信,在一个更加平衡且多样性更高的数据集上进行训练,将显著提升模型的泛化能力。

References

参考文献

阅读全文(20积分)