[论文翻译]Magic Mirror: 视频扩散Transformer中的ID保留视频生成


原文地址:https://arxiv.org/pdf/2501.03931


Magic Mirror: ID-Preserved Video Generation in Video Diffusion Transformers

Magic Mirror: 视频扩散Transformer中的ID保留视频生成

Yuechen Zhang1* Yaoyang Liu2* Bin Xia1 Bohao Peng1 Zexin Yan4 Eric Lo1 Jiaya Jia1,2,3 1CUHK 2HKUST 3SmartMore 4CMU

Yuechen Zhang1* Yaoyang Liu2* Bin Xia1 Bohao Peng1 Zexin Yan4 Eric Lo1 Jiaya Jia1,2,3 1香港中文大学 2香港科技大学 3SmartMore 4卡内基梅隆大学

*https://julian juan er.github.io/projects/Magic Mirror/

*https://julianjuaner.github.io/projects/Magic Mirror/


Figure 1. Magic Mirror generates text-to-video results given the ID reference image. Each video pair shows 24 frames (from a total of 49) with its corresponding face reference displayed in the bottom-left corner. Please use Adobe Acrobat Reader for video playback to get optimal viewing experience. Complete videos are available on the project page.

图 1: Magic Mirror 根据 ID 参考图像生成文本到视频的结果。每个视频对显示 24 帧(总共 49 帧),其对应的面部参考显示在左下角。请使用 Adobe Acrobat Reader 进行视频播放以获得最佳观看体验。完整视频可在项目页面上查看。

Abstract

摘要

We present Magic Mirror, a framework for generating identity-preserved videos with cinematic-level quality and dynamic motion. While recent advances in video diffusion models have shown impressive capabilities in textto-video generation, maintaining consistent identity while producing natural motion remains challenging. Previous methods either require person-specific fine-tuning or struggle to balance identity preservation with motion diversity. Built upon Video Diffusion Transformers, our method introduces three key components: (1) a dual-branch facial feature extractor that captures both identity and structural features, (2) a lightweight cross-modal adapter with Conditioned Adaptive Normalization for efficient identity integration, and (3) a two-stage training strategy combining synthetic identity pairs with video data. Extensive experiments demonstrate that Magic Mirror effectively balances identity consistency with natural motion, outperforming existing methods across multiple metrics while requiring minimal parameters added. The code and model will be made publicly available at: https://github.com/dvlabresearch/Magic Mirror/.

我们提出了 Magic Mirror,一个用于生成具有电影级质量和动态动作的身份保持视频的框架。尽管最近的视频扩散模型在文本到视频生成方面展示了令人印象深刻的能力,但在生成自然动作的同时保持身份一致性仍然具有挑战性。先前的方法要么需要针对特定人物进行微调,要么难以在身份保持与动作多样性之间取得平衡。基于视频扩散 Transformer,我们的方法引入了三个关键组件:(1) 一个双分支面部特征提取器,能够捕捉身份和结构特征,(2) 一个轻量级的跨模态适配器,带有条件自适应归一化,以实现高效的身份集成,(3) 一个结合合成身份对和视频数据的两阶段训练策略。大量实验表明,Magic Mirror 在身份一致性与自然动作之间取得了有效平衡,在多个指标上优于现有方法,同时仅需添加最少的参数。代码和模型将在以下地址公开:https://github.com/dvlabresearch/Magic Mirror/。

1. Introduction

1. 引言

Human-centered content generation has been a focal point in computer vision research. Recent advancements in image generation, particularly through Diffusion Models $[1-$

以人为中心的内容生成一直是计算机视觉研究的重点。近年来,图像生成领域取得了显著进展,尤其是通过扩散模型 (Diffusion Models) [1-

4], have propelled personalized content creation to the forefront of computer vision research. While significant progress has been made in preserving personal identity (ID) in image generation [5–10], achieving comparable fidelity in video generation remains challenging.

4] 已将个性化内容创作推向了计算机视觉研究的前沿。虽然在图像生成中保留个人身份 (ID) 方面取得了显著进展 [5–10],但在视频生成中实现类似的保真度仍然具有挑战性。

Existing ID-preserving video generation methods show promise but face limitations. Approaches like MagicMe and ID-Animator [11, 12], utilizing inflated UNets [13] for fine-tuning or adapter training, demonstrate some success in maintaining identity across frames. However, they are ultimately restricted by the generation model’s inherent capabilities, often failing to produce highly dynamic videos (see Fig. 2). These approaches make a static copy-andpaste instead of generating dynamic facial motions. Another branch of methods combines image personalization methods with Image-to-Video (I2V) generation [14–16]. While these two-stage solutions preserve ID to some extent, they often struggle with stability in longer sequences and require a separate image generation step. To address current shortcomings, we present Magic Mirror, a single-stage framework designed to generate high-quality videos while maintaining strong ID consistency and dynamic natural facial motions. Our approach leverages native video diffusion models [14] to generate ID-specific videos, aiming to empower individuals as protagonists in their virtual narratives, and bridge the gap between personalized ID generation and high-quality video synthesis.

现有的身份保持视频生成方法显示出潜力,但也面临局限性。像 MagicMe 和 ID-Animator [11, 12] 这样的方法,利用膨胀的 UNet [13] 进行微调或适配器训练,在跨帧保持身份方面取得了一些成功。然而,它们最终受到生成模型固有能力的限制,往往无法生成高度动态的视频(见图 2)。这些方法更像是静态的复制粘贴,而不是生成动态的面部动作。另一类方法将图像个性化方法与图像到视频(I2V)生成 [14–16] 结合起来。虽然这些两阶段解决方案在一定程度上保持了身份,但它们通常在较长序列中难以保持稳定性,并且需要单独的图像生成步骤。为了解决当前的不足,我们提出了 Magic Mirror,这是一个单阶段框架,旨在生成高质量视频,同时保持强大的身份一致性和动态自然的面部动作。我们的方法利用原生视频扩散模型 [14] 生成特定身份的视频,旨在让个人成为其虚拟叙事的主角,并弥合个性化身份生成与高质量视频合成之间的差距。

The generation of high-fidelity identity-conditioned videos poses several technical challenges. A primary challenge stems from the architectural disparity between image and video generation paradigms. State-of-the-art video generation models, built upon full-attention Diffusion Transformer (DiT) architectures [14, 17], are not directly compatible with conventional cross-attention conditioning methods. To bridge this gap, we introduce a lightweight identityconditioning adapter integrated into the CogVideoX [14] framework. Specifically, we propose a dual-branch facial embedding that simultaneously preserves high-level identity features and reference-specific structural information. Our analysis reveals that current video foundation models optimize for text-video alignment, often at the cost of spatial fidelity and generation quality. This trade-off manifests in reduced image quality metrics on benchmarks such as VBench [18], particularly affecting the preservation of finegrained identity features. To address it, we develop a Conditioned Adaptive Normalization (CAN) that effectively incorporates identity conditions into the pre-trained foundation model. This module, combined with our dual facial guidance mechanism, enables identity conditioning through attention guidance and feature distribution guidance.

生成高保真身份条件视频面临多项技术挑战。主要挑战源于图像和视频生成范式之间的架构差异。基于全注意力扩散Transformer (DiT) 架构 [14, 17] 的最先进视频生成模型,与传统交叉注意力条件方法并不直接兼容。为了弥合这一差距,我们在 CogVideoX [14] 框架中引入了一个轻量级身份条件适配器。具体而言,我们提出了一种双分支面部嵌入,同时保留高级身份特征和特定参考结构信息。我们的分析表明,当前的视频基础模型优化了文本-视频对齐,但往往以牺牲空间保真度和生成质量为代价。这种权衡体现在 VBench [18] 等基准测试中的图像质量指标下降,特别是影响细粒度身份特征的保留。为了解决这一问题,我们开发了一种条件自适应归一化 (CAN),有效地将身份条件整合到预训练的基础模型中。该模块与我们的双面部引导机制相结合,通过注意力引导和特征分布引导实现身份条件。

Another significant challenge lies in the acquisition of high-quality training data. While paired image data with consistent identity is relatively abundant, obtaining highfidelity image-video pairs that maintain identity consistency remains scarce. To address this limitation, we develop a strategic data synthesis pipeline that leverages identity preservation models [5] to generate paired training data. Our training methodology employs a progressive approach: initially pre-training on image data to learn robust identity representations, followed by video-specific fine-tuning. This two-stage strategy enables effective learning of identity features while ensuring temporal consistency in facial expressions across video sequences.

另一个重大挑战在于高质量训练数据的获取。虽然具有一致身份的配对图像数据相对丰富,但获取保持身份一致性的高保真图像-视频对仍然稀缺。为了解决这一限制,我们开发了一种战略性的数据合成流程,利用身份保持模型 [5] 生成配对训练数据。我们的训练方法采用渐进式策略:首先在图像数据上进行预训练以学习鲁棒的身份表示,然后进行视频特定的微调。这种两阶段策略能够有效学习身份特征,同时确保视频序列中面部表情的时间一致性。


Figure 2. Magic Mirror generates dynamic facial motion. IDAnimator [11] and Video Ocean [19] exhibit limited motion range due to a strong identity-preservation constraint. Magic Mirror achieves more dynamic facial expressions while maintaining reference identity fidelity.


图 2: Magic Mirror 生成动态面部运动。IDAnimator [11] 和 Video Ocean [19] 由于强烈的身份保持约束,表现出有限的运动范围。Magic Mirror 在保持参考身份保真度的同时,实现了更动态的面部表情。

We evaluate our method on multiple general metrics by constructing a human-centric video generation test set and comparing it with the aforementioned competitive IDpreserved video generation methods. Extensive experimental and visual evaluations demonstrate that our approach successfully generates high-quality videos with dynamic content and strong facial consistency, as illustrated in Fig. 1. Magic Mirror represents the first approach to achieve customized video generation using Video DiT without requiring person-specific fine-tuning. Our work marks an advancement in personalized video generation, paving the way for enhanced creative expression in the digital domain.

我们通过构建一个以人为中心的视频生成测试集,并与上述具有竞争力的ID保留视频生成方法进行比较,从多个通用指标上评估了我们的方法。广泛的实验和视觉评估表明,我们的方法成功生成了具有动态内容和强面部一致性的高质量视频,如图1所示。Magic Mirror是首个无需针对特定人物进行微调即可使用Video DiT实现定制化视频生成的方法。我们的工作标志着个性化视频生成领域的进步,为数字领域中的创造性表达铺平了道路。

Our main contributions are: (1) We introduce Magic Mirror, a novel fine-tuning free framework for generating ID-preserving videos; (2) We design a lightweight adapter with conditioned adaptive normalization s for effective integration of face embeddings in full-attention Diffusion Transformer architectures; (3) We develop a dataset construction method that combines synthetic data generation with a progressive training strategy to address data scarcity challenges in personalized video generation.

我们的主要贡献是:(1) 我们引入了 Magic Mirror,一种无需微调的新框架,用于生成保持身份的视频;(2) 我们设计了一个轻量级的适配器,带有条件自适应归一化,用于在全注意力 Diffusion Transformer 架构中有效集成面部嵌入;(3) 我们开发了一种数据集构建方法,结合了合成数据生成和渐进式训练策略,以解决个性化视频生成中的数据稀缺挑战。

2. Related Works

2. 相关工作

Diffusion Models. Since the introduction of DDPM [3], diffusion models have demonstrated remarkable capabilities across diverse domains, spanning NLP [20, 21], medical imaging [22, 23], and molecular modeling [24, 25]. In computer vision, following initial success in image generation [26, 27], Latent Diffusion Models (LDM) [1] significantly reduced computational requirements while maintaining generation quality. Subsequent developments in conditional architectures [28, 29] enabled fine-grained concept customization over the generation process.

扩散模型 (Diffusion Models)。自 DDPM [3] 提出以来,扩散模型在多个领域展现了卓越的能力,涵盖自然语言处理 (NLP) [20, 21]、医学影像 [22, 23] 和分子建模 [24, 25]。在计算机视觉领域,继图像生成 [26, 27] 的初步成功后,潜在扩散模型 (LDM) [1] 在保持生成质量的同时显著降低了计算需求。随后的条件架构发展 [28, 29] 使得生成过程能够实现细粒度的概念定制。

Video Generation via Diffusion Models. Following the emergence of diffusion models, their superior controllability and diversity in image generation [30] have led to their prominence over traditional approaches based on GANs [31–33] and auto-regressive Transformers [34–36]. The Video Diffusion Model (VDM) [37] pioneered video generation using diffusion models by extending the traditional U-Net [38] architecture to process temporal information. Subsequently, LVDM [39] demonstrated the effectiveness of latent space operations, while AnimateD iff [13] adapted text-to-image models for personalized video synthesis. A significant advancement came with the Diffusion Transformer (DiT) [17], which successfully merged Transformer architectures [40, 41] with diffusion models. Building on this foundation, Latte [42] emerged as the first open-source text-to-video model based on DiT. Following the breakthrough of SORA [43], several open-source initiatives including Open-Sora-Plan [44], Open-Sora [45], and CogVideoX [14] have advanced video generation through DiT architectures. While current research predominantly focuses on image-to-video translation [15, 16, 46] and motion control [47, 48], the critical challenge of ID-preserving video generation remains relatively unexplored.

基于扩散模型的视频生成。随着扩散模型的出现,其在图像生成中的卓越可控性和多样性 [30] 使其超越了基于 GAN [31–33] 和自回归 Transformer [34–36] 的传统方法。视频扩散模型 (VDM) [37] 通过扩展传统的 U-Net [38] 架构以处理时间信息,率先使用扩散模型进行视频生成。随后,LVDM [39] 展示了潜在空间操作的有效性,而 AnimateD iff [13] 则调整了文本到图像模型以用于个性化视频合成。扩散 Transformer (DiT) [17] 的出现标志着重大进展,它成功地将 Transformer 架构 [40, 41] 与扩散模型相结合。在此基础上,Latte [42] 成为首个基于 DiT 的开源文本到视频模型。随着 SORA [43] 的突破,包括 Open-Sora-Plan [44]、Open-Sora [45] 和 CogVideoX [14] 在内的多个开源项目通过 DiT 架构推动了视频生成的发展。尽管当前研究主要集中在图像到视频转换 [15, 16, 46] 和运动控制 [47, 48] 上,但 ID 保持视频生成这一关键挑战仍相对未被探索。

ID-Preserved Generation. ID-preserved generation, or identity customization, aims to maintain specific identity characteristics in generated images or videos. Initially developed in the GAN era [31] with significant advances in face generation [33, 49, 50], this field has evolved substantially with diffusion models, demonstrating enhanced capabilities in novel image synthesis [28, 51]. Current approaches to ID-preserved image generation fall into two main categories:

ID 保留生成。ID 保留生成,或身份定制,旨在在生成的图像或视频中保持特定的身份特征。最初在 GAN 时代 [31] 开发,并在人脸生成方面取得了显著进展 [33, 49, 50],该领域随着扩散模型的发展而大幅演进,展示了在新图像合成方面的增强能力 [28, 51]。当前 ID 保留图像生成的方法主要分为两大类:

Tuning-based Models: These approaches fine-tune models using one or more reference images to generate identityconsistent outputs. Notable examples include Textual Inversion [51] and Dreambooth [28].

基于调优的模型:这些方法通过使用一张或多张参考图像对模型进行微调,以生成身份一致的输出。典型的例子包括 Textual Inversion [51] 和 Dreambooth [28]。

Tuning-free Models: Addressing the computational overhead of tuning-based approaches, these models maintain high ID fidelity through additional conditioning and trainable parameters. Starting with IP-adapter [52], various methods like InstantID, PhotoMaker [5–10] have emerged to enable efficient, high-quality personalized generation.

免调优模型:为了解决基于调优方法的计算开销问题,这些模型通过额外的条件设置和可训练参数来保持高ID保真度。从IP-adapter [52]开始,出现了各种方法,如InstantID、PhotoMaker [5-10],以实现高效、高质量的个性化生成。

ID-preserved video generation introduces additional complexities, particularly in synthesizing realistic facial movements from static references while maintaining identity consistency. Current approaches include MagicMe [12], a tuning-based method requiring per-identity optimization, and ID-Animator [11], a tuning-free approach utilizing face adapters and decoupled Cross-Attention [52]. However, these methods face challenges in maintaining dynamic expressions while preserving identity, and are constrained by base model limitations in video quality, duration, and prompt adherence. The integration of Diffusion Transformers presents promising opportunities for advancing IDpreserved video generation.

ID 保留视频生成引入了额外的复杂性,特别是在从静态参考中合成逼真的面部运动的同时保持身份一致性。当前的方法包括 MagicMe [12],这是一种基于调优的方法,需要对每个身份进行优化,以及 ID-Animator [11],这是一种无需调优的方法,利用面部适配器和解耦的 Cross-Attention [52]。然而,这些方法在保持动态表情的同时保留身份方面面临挑战,并且受到基础模型在视频质量、时长和提示遵循方面的限制。Diffusion Transformers 的集成为推进 ID 保留视频生成提供了有前景的机会。

3. Magic Mirror

3. 魔镜

An overview of Magic Mirror is illustrated in Fig. 3. This dual-branch framework (Sec. 3.2) extracts facial identity features from one or more reference images $r$ . These embeddings are subsequently processed through a DiT backbone augmented with a lightweight cross-modal adapter, incorporating conditioned adaptive normalization (Sec. 3.3). This architecture enables Magic Mirror to synthesize identity-preserved text-to-video outputs. The following sections elaborate on the preliminaries of diffusion models (Sec. 3.1) and each component of our method.

Magic Mirror 的概述如图 3 所示。这个双分支框架(第 3.2 节)从一张或多张参考图像 $r$ 中提取面部身份特征。这些嵌入随后通过一个带有轻量级跨模态适配器的 DiT 主干进行处理,并结合条件自适应归一化(第 3.3 节)。这种架构使 Magic Mirror 能够合成保留身份的文本到视频输出。以下部分详细介绍了扩散模型的预备知识(第 3.1 节)和我们方法的每个组件。

3.1. Preliminaries

3.1. 预备知识

Latent Diffusion Models (LDMs) generate data by iteratively reversing a noise corruption process, converting random noise into structured samples. At time step $t\ \in$ ${0,\ldots,T}$ , the model predicts latent state $x_{t}$ conditioned on $x_{t+1}$ :

潜在扩散模型 (Latent Diffusion Models, LDMs) 通过迭代逆转噪声污染过程,将随机噪声转换为结构化样本。在时间步 $t\ \in$ ${0,\ldots,T}$ 时,模型根据 $x_{t+1}$ 预测潜在状态 $x_{t}$:

$$
p_{\theta}(x_{t}|x_{t+1})=\mathcal{N}(x_{t};\widetilde{\mu}{t},\widetilde{\beta}{t}I),
$$

$$
p_{\theta}(x_{t}|x_{t+1})=\mathcal{N}(x_{t};\widetilde{\mu}{t},\widetilde{\beta}{t}I),
$$

where $\theta$ represents the model parameters, ${\widetilde{\mu}}{t}$ denotes the predicted mean, and ${\widetilde{\beta}}{t}$ is the variance sche du le.

其中 $\theta$ 表示模型参数,${\widetilde{\mu}}{t}$ 表示预测均值,${\widetilde{\beta}}{t}$ 是方差调度。

The training obje ct ive typically employs a mean squared error (MSE) loss on the noise prediction $\hat{\epsilon}{\theta}(x{t},t,c_{\mathrm{txt}})$ :

训练目标通常采用均方误差 (MSE) 损失对噪声预测 $\hat{\epsilon}{\theta}(x{t},t,c_{\mathrm{txt}})$ 进行优化:

$$
\begin{array}{r}{\mathcal{L}{\mathrm{noise}}=w\cdot\mathbb{E}{t,c_{\mathrm{txt}},\epsilon\sim\mathcal{N}(0,1)}\Big[|\epsilon-\epsilon_{\theta}(x_{t},t,c_{\mathrm{txt}})|^{2}\Big],}\end{array}
$$

$$
\begin{array}{r}{\mathcal{L}{\mathrm{noise}}=w\cdot\mathbb{E}{t,c_{\mathrm{txt}},\epsilon\sim\mathcal{N}(0,1)}\Big[|\epsilon-\epsilon_{\theta}(x_{t},t,c_{\mathrm{txt}})|^{2}\Big],}\end{array}
$$

where $c_{\mathrm{txt}}$ denotes the text condition.

其中 $c_{\mathrm{txt}}$ 表示文本条件。

Recent studies on controllable generation [7, 52–54] extend this framework by incorporating additional control signals, such as image condition $c_{\mathrm{img}}$ . This is achieved through a feature extractor $\tau_{\mathrm{img}}$ that processes a reference image $r$ : $c_{\mathrm{img}}=\tau_{\mathrm{img}}(r)$ . Consequently, the noise prediction function in Eq. (2) becomes $\epsilon_{\theta}(x_{t},t,c_{\mathrm{txt}},c_{\mathrm{img}})$ .

最近关于可控生成的研究 [7, 52–54] 通过引入额外的控制信号(如图像条件 $c_{\mathrm{img}}$)扩展了这一框架。这是通过一个特征提取器 $\tau_{\mathrm{img}}$ 实现的,该提取器处理参考图像 $r$:$c_{\mathrm{img}}=\tau_{\mathrm{img}}(r)$。因此,公式 (2) 中的噪声预测函数变为 $\epsilon_{\theta}(x_{t},t,c_{\mathrm{txt}},c_{\mathrm{img}})$。

3.2. Decoupled Facial Feature Extraction

3.2. 解耦的面部特征提取

The facial feature extraction component of Magic Mirror is illustrated in the left part of Fig. 3. Given an ID reference image $r,\in,\mathcal{R}^{(h\times w\times3)}$ , our model extracts facial condition embeddings $c_{\mathrm{face}}={x_{\mathrm{face}},x_{\mathrm{fuse}}}$ using hybrid query-based feature perceivers. These embeddings capture both highlevel identity features and facial structural information:

魔镜的面部特征提取组件如图 3 左侧所示。给定一个 ID 参考图像 $r,\in,\mathcal{R}^{(h\times w\times3)}$,我们的模型使用基于混合查询的特征感知器提取面部条件嵌入 $c_{\mathrm{face}}={x_{\mathrm{face}},x_{\mathrm{fuse}}}$。这些嵌入捕捉了高层次的身份特征和面部结构信息:


Figure 3. Overview of Magic Mirror. The framework employs a dual-branch feature extraction system with ID and face perceivers, followed by a cross-modal adapter (illustrated in Fig. 4) for DiT-based video generation. By optimizing trainable modules marked by the flame, our method efficiently integrates facial features for controlled video synthesis while maintaining model efficiency.

图 3: Magic Mirror 概述。该框架采用双分支特征提取系统,包含 ID 和面部感知器,随后通过跨模态适配器(如图 4 所示)进行基于 DiT 的视频生成。通过优化标有火焰的可训练模块,我们的方法有效地整合了面部特征以实现可控的视频合成,同时保持模型效率。

$$
x_{\mathrm{{face}}}=\tau_{\mathrm{{face}}}(q_{\mathrm{{face}}},\mathbf{f})
$$

$$
x_{\mathrm{{face}}}=\tau_{\mathrm{{face}}}(q_{\mathrm{{face}}},\mathbf{f})
$$

where $\mathbf{f}=F_{\mathrm{face}}(r)$ represents dense feature maps extracted from a pre-trained CLIP ViT $F_{\mathrm{feat}}$ [55]. Both perceivers $\tau_{\mathrm{face}}$ and $\tau_{\mathrm{id}}$ utilize the standard Q-Former[56] architecture with distinct query conditions $q_{\mathrm{face}}$ and $q_{\mathrm{id}}$ . Here, $q_{\mathrm{face}}$ is a learnable embedding for facial structure extraction, while $q_{\mathrm{id}},=,F_{\mathrm{id}}(r)$ represents high-level facial features extracted by a face encoder $F_{\mathrm{id}},[5,57]$ . Each perceiver employs crossattention between iterative ly updated queries and dense features to obtain compressed feature embeddings.

其中 $\mathbf{f}=F_{\mathrm{face}}(r)$ 表示从预训练的 CLIP ViT $F_{\mathrm{feat}}$ [55] 中提取的密集特征图。两个感知器 $\tau_{\mathrm{face}}$ 和 $\tau_{\mathrm{id}}$ 都使用标准的 Q-Former[56] 架构,并具有不同的查询条件 $q_{\mathrm{face}}$ 和 $q_{\mathrm{id}}$。这里,$q_{\mathrm{face}}$ 是一个可学习的嵌入,用于提取面部结构,而 $q_{\mathrm{id}},=,F_{\mathrm{id}}(r)$ 表示由面部编码器 $F_{\mathrm{id}},[5,57]$ 提取的高级面部特征。每个感知器通过在迭代更新的查询和密集特征之间进行交叉注意力来获得压缩的特征嵌入。

The compressed embeddings are integrated through a decoupled mechanism. Following recent approaches in newconcept customization [5, 28], we fuse facial embeddings with text embeddings at identity-relevant tokens (e.g., man, woman) in the input prompt $q_{\mathrm{txt}}$ , as expressed in Eq. (4). A fusion MLP $F_{\mathrm{fuse}}$ projects $x_{\mathrm{id}}$ into the text embedding space. The final text embedding for DiT input is computed as $\hat{x}{\mathrm{txt}}=\mathbf{m}x\mathrm{fuse}+(1-\mathbf{m})x_{\mathrm{txt}}.$ , where $\mathbf{m}$ denotes a token- level binary mask indicating fused token positions $q_{\mathrm{txt}}$ .

压缩后的嵌入通过解耦机制进行整合。借鉴新概念定制的最新方法 [5, 28],我们将面部嵌入与输入提示 $q_{\mathrm{txt}}$ 中与身份相关的 Token(例如,男人、女人)处的文本嵌入进行融合,如公式 (4) 所示。融合 MLP $F_{\mathrm{fuse}}$ 将 $x_{\mathrm{id}}$ 投影到文本嵌入空间。DiT 输入的最终文本嵌入计算为 $\hat{x}{\mathrm{txt}}=\mathbf{m}x\mathrm{fuse}+(1-\mathbf{m})x_{\mathrm{txt}}$,其中 $\mathbf{m}$ 表示指示融合 Token 位置的 Token 级二进制掩码 $q_{\mathrm{txt}}$。

3.3. Conditioned Adaptive Normalization

3.3. 条件自适应归一化

Having obtained the decoupled ID-aware conditions $c_{\mathrm{face}}$ we address the challenge of their efficient integration into the video diffusion transformer. Traditional Latent Diffu- sion Models, exemplified by Stable Diffusion [1], utilize isolated cross-attention mechanisms for condition injection, facilitating straightforward adaptation to new conditions through decoupled cross-attention [6, 52]. This approach is enabled by uniform conditional inputs (specifically, text condition $c_{\mathrm{txt.}}$ ) across all cross-attention layers. However, our framework builds upon CogVideoX [14], which implements a cross-modal full-attention paradigm with layerwise distribution modulation experts. This architectural choice introduces additional complexity in adapting to new conditions beyond simple cross-attention augmentation.

在获得解耦的ID感知条件 $c_{\mathrm{face}}$ 后,我们解决了将其高效集成到视频扩散Transformer中的挑战。传统的潜在扩散模型(Latent Diffusion Models),以Stable Diffusion [1] 为例,使用孤立的交叉注意力机制进行条件注入,通过解耦的交叉注意力 [6, 52] 实现了对新条件的直接适应。这种方法依赖于所有交叉注意力层中统一的输入条件(特别是文本条件 $c_{\mathrm{txt.}}$)。然而,我们的框架基于CogVideoX [14],它实现了跨模态的全注意力范式,并带有分层分布调制专家。这种架构选择在适应新条件时引入了额外的复杂性,超出了简单的交叉注意力增强。

Leveraging CogVideoX’s layer-wise modulation [14], we propose a lightweight architecture that incorporates additional conditions while preserving the model’s spatio temporal relationship modeling capacity. As illustrated in Fig. 4, facial embedding $x_{\mathrm{{face}}}$ is concatenated with text and video features $\chi_{\mathrm{txt}}$ and $x_{\mathrm{{vid},}}$ ) through full-self attention. CogVideoX employs modal-specific modulation, where factors $m_{\mathrm{vid}}$ and $m_{\mathrm{{txt}}}$ are applied to their respective features through adaptive normalization modules $\varphi_{\mathrm{{txt,,vid}}}$ . To accommodate the facial modality, we introduce a dedicated adaptive normalization module $\varphi_{\mathrm{face}}$ , normalizing facial features preceding the self-attention and feed-forward network (FFN). The corresponding set of modulation factors $m_{\mathrm{face}}$ is computed as:

利用 CogVideoX 的分层调制 [14],我们提出了一种轻量级架构,该架构在保留模型时空关系建模能力的同时,引入了额外的条件。如图 4 所示,面部嵌入 $x_{\mathrm{{face}}}$ 通过全自注意力机制与文本和视频特征 $\chi_{\mathrm{txt}}$ 和 $x_{\mathrm{{vid},}}$ 进行拼接。CogVideoX 采用模态特定的调制,其中因子 $m_{\mathrm{vid}}$ 和 $m_{\mathrm{{txt}}}$ 通过自适应归一化模块 $\varphi_{\mathrm{{txt,,vid}}}$ 应用于各自的特征。为了适应面部模态,我们引入了一个专用的自适应归一化模块 $\varphi_{\mathrm{face}}$,在自注意力和前馈网络 (FFN) 之前对面部特征进行归一化。相应的调制因子集 $m_{\mathrm{face}}$ 计算如下:

m_{\mathrm{face}}=\{\mu_{\mathrm{face}}^{1},\sigma_{\mathrm{face}}^{1},\gamma_{\mathrm{face}}^{1},\mu_{\mathrm{face}}^{2},\sigma_{\mathrm{face}}^{2},\gamma_{\mathrm{face}}^{2}\}=\varphi_{\mathrm{face}}(\mathbf{t},l),
m_{\mathrm{face}}=\{\mu_{\mathrm{face}}^{1},\sigma_{\mathrm{face}}^{1},\gamma_{\mathrm{face}}^{1},\mu_{\mathrm{face}}^{2},\sigma_{\mathrm{face}}^{2},\gamma_{\mathrm{face}}^{2}\}=\varphi_{\mathrm{face}}(\mathbf{t},l),

where $\mathbf{t}$ denotes the time embedding and $l$ represents the layer index. Let $F^{n}$ denotes in-block operations, where $F^{1}$ represents attention and $F^{2}$ represents the FFN. The feature transformation after operation $n$ is computed followed a scale $\sigma$ , shift $\mu$ , and gating $\gamma$ , represented as: $\bar{x}^{n}=x^{n-1}*(1+\sigma^{n})+\mu^{n}$ , then $x^{n}={\bar{x}}^{n}+\gamma^{n}F^{n}({\bar{x}}^{n})$ , with modality-specific subscripts omitted for brevity.

其中 $\mathbf{t}$ 表示时间嵌入,$l$ 表示层索引。令 $F^{n}$ 表示块内操作,其中 $F^{1}$ 表示注意力机制,$F^{2}$ 表示前馈神经网络 (FFN)。操作 $n$ 后的特征变换通过尺度 $\sigma$、偏移 $\mu$ 和门控 $\gamma$ 计算,表示为:$\bar{x}^{n}=x^{n-1}*(1+\sigma^{n})+\mu^{n}$,然后 $x^{n}={\bar{x}}^{n}+\gamma^{n}F^{n}({\bar{x}}^{n})$,为简洁起见省略了模态特定的下标。

Furthermore, to enhance the distribution learning capability of text and video latents from specific reference IDs, we introduce Conditioned Adaptive Normalization (CAN), inspired by class-conditioned DiT [17] and StyleGAN’s [33] approach to condition control. CAN predicts distribution shifts for video and text modalities:

此外,为了增强从特定参考ID中学习文本和视频潜在分布的能力,我们引入了条件自适应归一化 (Conditioned Adaptive Normalization, CAN),其灵感来源于类条件DiT [17] 和 StyleGAN [33] 的条件控制方法。CAN 预测视频和文本模态的分布偏移:

\begin{array}{r}{\hat{m}_{\mathrm{vid}},\hat{m}_{\mathrm{txt}}=\varphi_{\mathrm{cond}}(\mathbf{t},l,\mu_{\mathrm{vid}}^{1},x_{\mathrm{id}}).}\end{array}
\begin{array}{r}{\hat{m}_{\mathrm{vid}},\hat{m}_{\mathrm{txt}}=\varphi_{\mathrm{cond}}(\mathbf{t},l,\mu_{\mathrm{vid}}^{1},x_{\mathrm{id}}).}\end{array}

Here, $\mu_{\mathrm{vid}}^{1}$ acts as a distribution identifier for a better initi aliz ation of the CAN module, and $x_{\mathrm{id}}$ from Eq. (4) represents the facial embedding prior to the fusion MLP. The final modulation factors are computed through residual addition: $m_{\mathrm{vid}},=,\hat{m}{\mathrm{vid}},+,\varphi{\mathrm{vid}}({\bf t},l),m_{\mathrm{txt}},=,\hat{m}_{\mathrm{txt}}+\varphi_{\mathrm{txt}}({\bf t},l)$ .

这里,$\mu_{\mathrm{vid}}^{1}$ 作为分布标识符,用于更好地初始化 CAN 模块,而来自公式 (4) 的 $x_{\mathrm{id}}$ 表示融合 MLP 之前的面部嵌入。最终的调制因子通过残差加法计算得出:$m_{\mathrm{vid}},=,\hat{m}{\mathrm{vid}},+,\varphi{\mathrm{vid}}({\bf t},l),m_{\mathrm{txt}},=,\hat{m}_{\mathrm{txt}}+\varphi_{\mathrm{txt}}({\bf t},l)$。


Figure 4. Cross-modal adapter in DiT blocks, featuring Conditioned Adaptive Normalization (CAN) for modal-specific feature modulation and decoupled attention integration.

图 4: DiT 块中的跨模态适配器,采用条件自适应归一化 (Conditioned Adaptive Normalization, CAN) 进行模态特定特征调制和解耦注意力集成。

We found this conditional shift prediction $\varphi_{\mathrm{cond}}$ is suitable for an MLP implementation.

我们发现这种条件转移预测 $\varphi_{\mathrm{cond}}$ 适合用 MLP 实现。

Complementing the conditioned normalization, we augment the joint full self-attention $T_{\mathrm{SA}}$ with a cross-attention mechanism $T_{\mathrm{CA}}$ [11, 52] to enhance ID modality feature aggregation. The attention output $x_{\mathrm{out}}$ is computed as:

为了补充条件归一化,我们通过交叉注意力机制 $T_{\mathrm{CA}}$ [11, 52] 增强了联合全自注意力 $T_{\mathrm{SA}}$,以增强 ID 模态特征聚合。注意力输出 $x_{\mathrm{out}}$ 计算如下:

x_{\mathrm{out}}=T_{\mathrm{SA}}\big(W_{q k v}(x_{\mathrm{full}})\big)+T_{\mathrm{CA}}(W_{q}(x_{\mathrm{full}}),W_{k v}(x_{\mathrm{face}})),
x_{\mathrm{out}}=T_{\mathrm{SA}}\big(W_{q k v}(x_{\mathrm{full}})\big)+T_{\mathrm{CA}}(W_{q}(x_{\mathrm{full}}),W_{k v}(x_{\mathrm{face}})),

where $T_{\mathrm{SA}}$ and $T_{\mathrm{CA}}$ utilize the same query projection $W_{q}(x_{\mathrm{full}})$ , while the key-value projections $W_{k v}$ in crossattention are re-initialized and trainable.

其中 $T_{\mathrm{SA}}$ 和 $T_{\mathrm{CA}}$ 使用相同的查询投影 $W_{q}(x_{\mathrm{full}})$,而交叉注意力中的键值投影 $W_{k v}$ 被重新初始化并可训练。

3.4. Data and Training

3.4. 数据与训练

Training a zero-shot customization adapter presents unique data challenges compared to fine-tuning approaches, like Magic-Me [12]. Our model’s full-attention architecture, which integrates spatial and temporal components inseparably, necessitates a two-stage training strategy. As shown in Fig. 5, we begin by training on diverse, high-quality datasets to develop robust identity preservation capabilities.

训练零样本定制适配器与微调方法(如 Magic-Me [12])相比,面临着独特的数据挑战。我们的模型采用全注意力架构,将空间和时间组件不可分割地结合在一起,因此需要采用两阶段训练策略。如图 5 所示,我们首先在多样化的高质量数据集上进行训练,以开发强大的身份保持能力。

Our progressive training pipeline leverages diverse datasets to enhance model performance, particularly in identity preservation. For image pre-training, we first utilize the LAION-Face [58] dataset, which contains web-scale real images and provides a rich source for generating selfreference images. To further increase identity diversity, we utilize the SFHQ [59] dataset, which applies self-reference techniques with standard text prompts. To prevent overfitting and promote the generation of diverse face-head motion, we use the FFHQ [33] dataset as a base. From this, we random sample text prompts from a prompt pool of human image captions, and synthesize ID-conditioned image pairs using PhotoMaker-V2 [5], ensuring both identity similarity and diversity through careful filtering.

我们的渐进式训练管道利用多样化的数据集来增强模型性能,特别是在身份保持方面。对于图像预训练,我们首先使用 LAION-Face [58] 数据集,该数据集包含网络规模的真实图像,并为生成自参考图像提供了丰富的资源。为了进一步增加身份多样性,我们使用 SFHQ [59] 数据集,该数据集通过标准文本提示应用自参考技术。为了防止过拟合并促进多样化的人脸-头部运动生成,我们使用 FFHQ [33] 数据集作为基础。在此基础上,我们从人类图像描述的提示池中随机采样文本提示,并使用 PhotoMaker-V2 [5] 合成 ID 条件图像对,通过仔细筛选确保身份相似性和多样性。


Figure 5. Overview of our training datasets. The pipeline includes image pre-training data (A-D) and video post-training data (D). We utilize both self-reference data (A, B) and filtered synthesized pairs with the same identity (C, D). Numbers of (images $^+$ synthesized images) are reported.

图 5: 我们的训练数据集概览。该流程包括图像预训练数据 (A-D) 和视频后训练数据 (D)。我们使用了自参考数据 (A, B) 以及经过筛选的具有相同身份的合成数据对 (C, D)。报告了 (图像 $^+$ 合成图像) 的数量。

For video post-training, we leverage the high-quality Pexels and Mixkit datasets, along with a small collection of self-collected videos from the web. Similarly, synthesized image data corresponding to each face reference of keyframes are generated as references. The combined dataset offers rich visual content for training the model across images and videos.

对于视频的后训练,我们利用了高质量的 Pexels 和 Mixkit 数据集,以及一小部分从网络上自行收集的视频。同样,生成了与关键帧的每个面部参考相对应的合成图像数据作为参考。这个组合数据集为跨图像和视频训练模型提供了丰富的视觉内容。

The objective function combines identity-aware and general denoising loss: $\mathcal{L}=\mathcal{L}{\mathrm{noise}}+\lambda\left(1-\cos(q{\mathrm{face}},D(x_{0}))\right)$ , where $D(\cdot)$ represents the latent decoder for the denoised latent $x_{0}$ , and $\lambda$ is the balance factor. Following PhotoMaker [5], we compute the denoising loss specifically within the face area for $50%$ of random training samples.

目标函数结合了身份感知和通用去噪损失:$\mathcal{L}=\mathcal{L}{\mathrm{noise}}+\lambda\left(1-\cos(q{\mathrm{face}},D(x_{0}))\right)$,其中$D(\cdot)$表示去噪潜在变量$x_{0}$的潜在解码器,$\lambda$是平衡因子。根据PhotoMaker [5],我们在随机训练样本的$50%$中特别计算了面部区域的去噪损失。

4. Experiments

4. 实验

4.1. Implementation Details

4.1. 实现细节

Dataset preparation. As illustrated in Fig. 5, our training pipeline leverages both self-referenced and synthetically paired image data [33, 58, 59] for identity-preserving alignment in the initial training phase. For synthetic data pairs (denoted as C and D in Fig. 5), we employ ArcFace [57] for facial recognition and detection to extract key attributes including age, bounding box coordinates, gender, and facial embeddings. Reference frames are then generated using Photo Maker V 2 [5]. We implement a quality control process by filtering image pairs ${a,b}$ based on their facial embedding cosine similarity, retaining pairs where $d(q_{\mathrm{face}}^{a},q_{\mathrm{face}}^{b})>$ 0.65. For text conditioning, we utilize MiniGemini-8B [60] to caption all video data, to form a diverse prompt pool containing 29K prompts, while $\mathrm{CogVLM}$ [61] provides video descriptions in the second training stage. Detailed data collection procedures are provided in Appendix A.1.

数据集准备。如图 5 所示,我们的训练流程在初始训练阶段利用了自参考和合成配对的图像数据 [33, 58, 59] 来进行身份保持对齐。对于合成数据对(在图 5 中表示为 C 和 D),我们使用 ArcFace [57] 进行面部识别和检测,以提取包括年龄、边界框坐标、性别和面部嵌入在内的关键属性。然后使用 Photo Maker V 2 [5] 生成参考帧。我们通过基于面部嵌入余弦相似度过滤图像对 ${a,b}$ 来实现质量控制,保留 $d(q_{\mathrm{face}}^{a},q_{\mathrm{face}}^{b})>$ 0.65 的对。对于文本条件,我们使用 MiniGemini-8B [60] 为所有视频数据生成字幕,形成一个包含 29K 提示的多样化提示池,而 $\mathrm{CogVLM}$ [61] 在第二个训练阶段提供视频描述。详细的数据收集过程见附录 A.1。

Training Details. Our Magic Mirror framework extends CogVideoX-5B [14] by integrating facial-specific modal adapters into alternating DiT layers (i.e., adapters in all layers with even index l). We adopt the feature extractor $F_{\mathrm{face}}$ and ID perceiver $\tau_{\mathrm{id}}$ from a pre-trained Photo Maker V 2 [5]. In the image pre-train stage, we optimize the adapter components for 30K iterations using a global batch size of 64. Subsequently, we perform video fine-tuning for 5K iterations with a batch size of 8 to enhance temporal consistency in video generation. Both phases employ a decayed learning rate starting from $10^{-5}$ . All experiments were conducted on a single compute node with 8 NVIDIA A800 GPUs.

训练细节。我们的 Magic Mirror 框架通过将面部特定模态适配器集成到交替的 DiT 层(即所有偶数索引 l 的层中的适配器)中,扩展了 CogVideoX-5B [14]。我们采用了预训练的 Photo Maker V 2 [5] 中的特征提取器 $F_{\mathrm{face}}$ 和 ID 感知器 $\tau_{\mathrm{id}}$。在图像预训练阶段,我们使用全局批量大小为 64 优化适配器组件,进行 30K 次迭代。随后,我们以批量大小为 8 进行 5K 次迭代的视频微调,以增强视频生成中的时间一致性。两个阶段均采用从 $10^{-5}$ 开始的衰减学习率。所有实验均在配备 8 个 NVIDIA A800 GPU 的单个计算节点上进行。

Evaluation and Comparisons. We evaluate our approach against the state-of-the-art ID-consistent video generation model ID-Animator [11] and leading Image-toVideo (I2V) frameworks, including DynamiC rafter [15], CogVideoX [14], and Easy Animate [16]. Our evaluation leverages standardized VBench [18], for video gen- eration assessment that measures motion quality and textmotion alignment. For identity preservation, we utilize facial recognition embedding similarity [62] and facial motion metrics. Our evaluation dataset consists of 40 singlecharacter prompts from VBench, ensuring demographic diversity, and 40 action-specific prompts for motion assessment. Identity references are sampled from 50 face identities from PubFig [63], generating four personalized videos per identity across varied prompts.

评估与比较。我们将我们的方法与最先进的ID一致性视频生成模型ID-Animator [11]以及领先的图像到视频(I2V)框架(包括DynamiC rafter [15]、CogVideoX [14]和Easy Animate [16])进行了比较。我们的评估利用了标准化的VBench [18],用于视频生成评估,衡量运动质量和文本-运动对齐。对于身份保留,我们使用了面部识别嵌入相似度 [62] 和面部运动指标。我们的评估数据集包括来自VBench的40个单角色提示,确保人口多样性,以及40个动作特定的提示用于运动评估。身份参考从PubFig [63] 的50个面部身份中采样,每个身份在不同提示下生成四个个性化视频。

4.2. Quantitative Evaluation

4.2. 定量评估

The quantitative results are summarized in Tab. 1. We evaluate generated videos using VBench’s and E val Crafter’s general metrics [18, 64], including dynamic degree, textprompt consistency, and Inception Score [65] for video quality assessment. For identity preservation, we introduce Average Similarity instead of similarity with the reference image, measuring the distance between generated faces and the average similarity of reference images for each identity. This prevents models from achieving artificially high scores through naive copy-paste behavior, as illustrated in Fig. 2. Face motion is quantified using two metrics: $\mathrm{FM}{\mathrm{ref}}$ (relative distance to reference face) and $\mathrm{FM}{\mathrm{inter}}$ (inter-frame distance), computed using RetinaFace [66] landmarks after position alignment, and the L2 distance between the normalized coordinates are reported as the metric.

定量结果总结在表 1 中。我们使用 VBench 和 Eval Crafter 的通用指标 [18, 64] 评估生成的视频,包括动态程度、文本提示一致性和用于视频质量评估的 Inception Score [65]。对于身份保持,我们引入了平均相似度 (Average Similarity) 而不是与参考图像的相似度,测量生成的面部与每个身份参考图像的平均相似度之间的距离。这防止了模型通过简单的复制粘贴行为获得人为高分,如图 2 所示。面部运动使用两个指标进行量化:$\mathrm{FM}{\mathrm{ref}}$(与参考面部的相对距离)和 $\mathrm{FM}{\mathrm{inter}}$(帧间距离),使用 RetinaFace [66] 地标在位置对齐后计算,归一化坐标之间的 L2 距离作为指标报告。

Table 1. Quantitative comparisons. We report results with Image-to-Video and ID-preserved models. ID similarities are evaluated on the corresponding face-enhanced prompts to avoid face missing caused by complex prompts. Table 2. User study results.

表 1. 定量比较。我们报告了图像到视频和ID保留模型的结果。ID相似度在相应的面部增强提示上进行评估,以避免复杂提示导致的面部缺失。

表 2. 用户研究结果。

模型 视觉质量 文本对齐 动态程度 ID相似度
DynamiCrafter[15] 6.03 7.29 4.85 5.87
EasyAnimate-I2V[16] 6.62 8.21 5.57 6.01
CogVideoX-12V[14] 6.86 8.31 6.55 6.22
ID-Animator[11] 5.63 6.37 4.06 6.70
Magic Mirror 6.97 8.88 7.02 6.39

Our method achieves superior facial similarity scores compared to I2V approaches while maintaining competitive performance to ID-Animator. We demonstrate strong text alignment, video quality, and dynamic performance, attributed to our decoupled facial feature extraction and crossmodal adapter with conditioned adaptive normalization s.

我们的方法在保持与 ID-Animator 竞争性能的同时,相比 I2V 方法实现了更高的面部相似度分数。我们展示了强大的文本对齐、视频质量和动态性能,这归功于我们解耦的面部特征提取和带有条件自适应归一化的跨模态适配器。

Besides, we analyze facial similarity drop across uniformly sampled frames from each video to assess temporal identity consistency, report as the similarity decay term in Tab. 1. Standard I2V models (CogVideoX-I2V [14], Easy Animate [16]) exhibit significant temporal decay in identity preservation. While DynamiC rafter [15] shows better stability due to its random reference strategy, it compromises fidelity. Both ID-Animator [11] and our Magic Mirror maintain consistent identity preservation throughout the video duration.

此外,我们分析了从每个视频中均匀采样的帧之间的面部相似度下降情况,以评估时间上的身份一致性,并在表 1 中报告为相似度衰减项。标准的 I2V 模型 (CogVideoX-I2V [14], Easy Animate [16]) 在身份保持方面表现出显著的时间衰减。而 DynamiC rafter [15] 由于其随机参考策略表现出更好的稳定性,但牺牲了保真度。ID-Animator [11] 和我们的 Magic Mirror 在整个视频持续时间内都保持了一致的身份保持。


Figure 6. Qualitative comparisons. Captions and reference identity images are presented in the top-left corner for each case.

图 6: 定性比较。每个案例的左上角展示了标题和参考身份图像。

4.3. Qualitative Evaluation

4.3. 定性评估

Beyond the examples shown in Fig. 1, we present comparative results in Fig. 6. Our method maintains high text coherence, motion dynamics, and video quality compared to conventional CogVideoX inference. When compared to existing image-to-video approaches [5, 14–16], Magic Mirror demonstrates superior identity consistency across frames while preserving natural motion. Our method also achieves enhanced dynamic range and text alignment compared to ID-Animator [11], which exhibits limitations in motion variety and prompt adherence.

除了图 1 中展示的示例外,我们在图 6 中展示了对比结果。与传统 CogVideoX 推理相比,我们的方法保持了较高的文本连贯性、运动动态性和视频质量。与现有的图像到视频方法 [5, 14–16] 相比,Magic Mirror 在保持自然运动的同时,展示了跨帧的优越身份一致性。与 ID-Animator [11] 相比,我们的方法还实现了增强的动态范围和文本对齐,而 ID-Animator 在运动多样性和提示遵循方面存在局限性。

To complement our quantitative metrics, we conducted a comprehensive user study to evaluate the perceptual quality of generated results. The study involved 173 participants who assessed the outputs across four key aspects: motion dynamics, text-motion alignment, video quality, and identity consistency. Participants rated each aspect on a scale of 1-10, with results summarized in Tab. 2. As shown in the overall preference scores in Tab. 1, Magic Mirror consistently outperforms baseline methods across all evaluated dimensions, demonstrating its superior perceptual quality in human assessment.

为了补充我们的定量指标,我们进行了一项全面的用户研究,以评估生成结果的感知质量。该研究涉及173名参与者,他们从四个关键方面评估输出结果:运动动态、文本-运动对齐、视频质量和身份一致性。参与者对每个方面进行1-10分的评分,结果总结在表2中。如表1中的总体偏好分数所示,Magic Mirror在所有评估维度上始终优于基线方法,展示了其在人类评估中的卓越感知质量。


Figure 7. Examples for ablation studies. Left: Ablation on modules. Right: Ablation on and training strategies.

图 7: 消融研究示例。左:模块消融。右:训练策略消融。

4.4. Ablation Studies

4.4. 消融研究

Condition-related Modules. We evaluate our key architectural components through ablation studies, shown in the left section of Fig. 7. Without the reference feature embedding branch, the model loses crucial high-level attention guidance, significantly degrading identity fidelity. The conditioned adaptive normalization (CAN) proves vital for distribution alignment, enhancing identity preservation across frames. The effectiveness of CAN for facial condition injection is further demonstrated in Fig. 8, showing improved training convergence for identity information capture during the image pre-train stage.

条件相关模块。我们通过消融研究评估了关键架构组件,如图 7 左侧部分所示。在没有参考特征嵌入分支的情况下,模型失去了关键的高级注意力引导,显著降低了身份保真度。条件自适应归一化 (CAN) 被证明对于分布对齐至关重要,增强了跨帧的身份保持。图 8 进一步展示了 CAN 在面部条件注入中的有效性,显示了在图像预训练阶段身份信息捕获的训练收敛性得到改善。

Training Strategy. The right section of Fig. 7 illustrates the impact of different training strategies. Image pre-training is essential for robust identity preservation, while video posttraining ensures temporal consistency. However, training exclusively on image data leads to color-shift artifacts during video inference. This artifact is caused by modulation factor inconsistencies in different training stages. Our twostage training approach achieves optimal results by leveraging the advantages of both phases, generating videos with high ID fidelity and dynamic facial motions.

训练策略。图 7 的右侧部分展示了不同训练策略的影响。图像预训练对于保持身份特征的鲁棒性至关重要,而视频后训练则确保了时间一致性。然而,仅使用图像数据进行训练会导致视频推理时出现色彩偏移伪影。这种伪影是由于不同训练阶段中调制因子不一致引起的。我们的两阶段训练方法通过结合两个阶段的优势,实现了最佳效果,生成了具有高身份保真度和动态面部动作的视频。

![](https://u254848-88c6-e493554b.yza1.seetacloud.com/miner/v2/analysis/pdf_img?as_attachment=False&user_id=931&pdf=0d874afdf8bd136980f0302923b2