[论文翻译]Magic Mirror: 视频扩散Transformer中的ID保留视频生成


原文地址:https://arxiv.org/pdf/2501.03931


Magic Mirror: ID-Preserved Video Generation in Video Diffusion Transformers

Magic Mirror: 视频扩散Transformer中的ID保留视频生成

Yuechen Zhang1* Yaoyang Liu2* Bin Xia1 Bohao Peng1 Zexin Yan4 Eric Lo1 Jiaya Jia1,2,3 1CUHK 2HKUST 3SmartMore 4CMU

Yuechen Zhang1* Yaoyang Liu2* Bin Xia1 Bohao Peng1 Zexin Yan4 Eric Lo1 Jiaya Jia1,2,3 1香港中文大学 2香港科技大学 3SmartMore 4卡内基梅隆大学

*https://julian juan er.github.io/projects/Magic Mirror/

*https://julianjuaner.github.io/projects/Magic Mirror/


Figure 1. Magic Mirror generates text-to-video results given the ID reference image. Each video pair shows 24 frames (from a total of 49) with its corresponding face reference displayed in the bottom-left corner. Please use Adobe Acrobat Reader for video playback to get optimal viewing experience. Complete videos are available on the project page.

图 1: Magic Mirror 根据 ID 参考图像生成文本到视频的结果。每个视频对显示 24 帧(总共 49 帧),其对应的面部参考显示在左下角。请使用 Adobe Acrobat Reader 进行视频播放以获得最佳观看体验。完整视频可在项目页面上查看。

Abstract

摘要

We present Magic Mirror, a framework for generating identity-preserved videos with cinematic-level quality and dynamic motion. While recent advances in video diffusion models have shown impressive capabilities in textto-video generation, maintaining consistent identity while producing natural motion remains challenging. Previous methods either require person-specific fine-tuning or struggle to balance identity preservation with motion diversity. Built upon Video Diffusion Transformers, our method introduces three key components: (1) a dual-branch facial feature extractor that captures both identity and structural features, (2) a lightweight cross-modal adapter with Conditioned Adaptive Normalization for efficient identity integration, and (3) a two-stage training strategy combining synthetic identity pairs with video data. Extensive experiments demonstrate that Magic Mirror effectively balances identity consistency with natural motion, outperforming existing methods across multiple metrics while requiring minimal parameters added. The code and model will be made publicly available at: https://github.com/dvlabresearch/Magic Mirror/.

我们提出了 Magic Mirror,一个用于生成具有电影级质量和动态动作的身份保持视频的框架。尽管最近的视频扩散模型在文本到视频生成方面展示了令人印象深刻的能力,但在生成自然动作的同时保持身份一致性仍然具有挑战性。先前的方法要么需要针对特定人物进行微调,要么难以在身份保持与动作多样性之间取得平衡。基于视频扩散 Transformer,我们的方法引入了三个关键组件:(1) 一个双分支面部特征提取器,能够捕捉身份和结构特征,(2) 一个轻量级的跨模态适配器,带有条件自适应归一化,以实现高效的身份集成,(3) 一个结合合成身份对和视频数据的两阶段训练策略。大量实验表明,Magic Mirror 在身份一致性与自然动作之间取得了有效平衡,在多个指标上优于现有方法,同时仅需添加最少的参数。代码和模型将在以下地址公开:https://github.com/dvlabresearch/Magic Mirror/。

1. Introduction

1. 引言

Human-centered content generation has been a focal point in computer vision research. Recent advancements in image generation, particularly through Diffusion Models $[1-$

以人为中心的内容生成一直是计算机视觉研究的重点。近年来,图像生成领域取得了显著进展,尤其是通过扩散模型 (Diffusion Models) [1-

4], have propelled personalized content creation to the forefront of computer vision research. While significant progress has been made in preserving personal identity (ID) in image generation [5–10], achieving comparable fidelity in video generation remains challenging.

4] 已将个性化内容创作推向了计算机视觉研究的前沿。虽然在图像生成中保留个人身份 (ID) 方面取得了显著进展 [5–10],但在视频生成中实现类似的保真度仍然具有挑战性。

Existing ID-preserving video generation methods show promise but face limitations. Approaches like MagicMe and ID-Animator [11, 12], utilizing inflated UNets [13] for fine-tuning or adapter training, demonstrate some success in maintaining identity across frames. However, they are ultimately restricted by the generation model’s inherent capabilities, often failing to produce highly dynamic videos (see Fig. 2). These approaches make a static copy-andpaste instead of generating dynamic facial motions. Another branch of methods combines image personalization methods with Image-to-Video (I2V) generation [14–16]. While these two-stage solutions preserve ID to some extent, they often struggle with stability in longer sequences and require a separate image generation step. To address current shortcomings, we present Magic Mirror, a single-stage framework designed to generate high-quality videos while maintaining strong ID consistency and dynamic natural facial motions. Our approach leverages native video diffusion models [14] to generate ID-specific videos, aiming to empower individuals as protagonists in their virtual narratives, and bridge the gap between personalized ID generation and high-quality video synthesis.

现有的身份保持视频生成方法显示出潜力,但也面临局限性。像 MagicMe 和 ID-Animator [11, 12] 这样的方法,利用膨胀的 UNet [13] 进行微调或适配器训练,在跨帧保持身份方面取得了一些成功。然而,它们最终受到生成模型固有能力的限制,往往无法生成高度动态的视频(见图 2)。这些方法更像是静态的复制粘贴,而不是生成动态的面部动作。另一类方法将图像个性化方法与图像到视频(I2V)生成 [14–16] 结合起来。虽然这些两阶段解决方案在一定程度上保持了身份,但它们通常在较长序列中难以保持稳定性,并且需要单独的图像生成步骤。为了解决当前的不足,我们提出了 Magic Mirror,这是一个单阶段框架,旨在生成高质量视频,同时保持强大的身份一致性和动态自然的面部动作。我们的方法利用原生视频扩散模型 [14] 生成特定身份的视频,旨在让个人成为其虚拟叙事的主角,并弥合个性化身份生成与高质量视频合成之间的差距。

The generation of high-fidelity identity-conditioned videos poses several technical challenges. A primary challenge stems from the architectural disparity between image and video generation paradigms. State-of-the-art video generation models, built upon full-attention Diffusion Transformer (DiT) architectures [14, 17], are not directly compatible with conventional cross-attention conditioning methods. To bridge this gap, we introduce a lightweight identityconditioning adapter integrated into the CogVideoX [14] framework. Specifically, we propose a dual-branch facial embedding that simultaneously preserves high-level identity features and reference-specific structural information. Our analysis reveals that current video foundation models optimize for text-video alignment, often at the cost of spatial fidelity and generation quality. This trade-off manifests in reduced image quality metrics on benchmarks such as VBench [18], particularly affecting the preservation of finegrained identity features. To address it, we develop a Conditioned Adaptive Normalization (CAN) that effectively incorporates identity conditions into the pre-trained foundation model. This module, combined with our dual facial guidance mechanism, enables identity conditioning through attention guidance and feature distribution guidance.

生成高保真身份条件视频面临多项技术挑战。主要挑战源于图像和视频生成范式之间的架构差异。基于全注意力扩散Transformer (DiT) 架构 [14, 17] 的最先进视频生成模型,与传统交叉注意力条件方法并不直接兼容。为了弥合这一差距,我们在 CogVideoX [14] 框架中引入了一个轻量级身份条件适配器。具体而言,我们提出了一种双分支面部嵌入,同时保留高级身份特征和特定参考结构信息。我们的分析表明,当前的视频基础模型优化了文本-视频对齐,但往往以牺牲空间保真度和生成质量为代价。这种权衡体现在 VBench [18] 等基准测试中的图像质量指标下降,特别是影响细粒度身份特征的保留。为了解决这一问题,我们开发了一种条件自适应归一化 (CAN),有效地将身份条件整合到预训练的基础模型中。该模块与我们的双面部引导机制相结合,通过注意力引导和特征分布引导实现身份条件。

Another significant challenge lies in the acquisition of high-quality training data. While paired image data with consistent identity is relatively abundant, obtaining highfidelity image-video pairs that maintain identity consistency remains scarce. To address this limitation, we develop a strategic data synthesis pipeline that leverages identity preservation models [5] to generate paired training data. Our training methodology employs a progressive approach: initially pre-training on image data to learn robust identity representations, followed by video-specific fine-tuning. This two-stage strategy enables effective learning of identity features while ensuring temporal consistency in facial expressions across video sequences.

另一个重大挑战在于高质量训练数据的获取。虽然具有一致身份的配对图像数据相对丰富,但获取保持身份一致性的高保真图像-视频对仍然稀缺。为了解决这一限制,我们开发了一种战略性的数据合成流程,利用身份保持模型 [5] 生成配对训练数据。我们的训练方法采用渐进式策略:首先在图像数据上进行预训练以学习鲁棒的身份表示,然后进行视频特定的微调。这种两阶段策略能够有效学习身份特征,同时确保视频序列中面部表情的时间一致性。


Figure 2. Magic Mirror generates dynamic facial motion. IDAnimator [11] and Video Ocean [19] exhibit limited motion range due to a strong identity-preservation constraint. Magic Mirror achieves more dynamic facial expressions while maintaining reference identity fidelity.


图 2: Magic Mirror 生成动态面部运动。IDAnimator [11] 和 Video Ocean [19] 由于强烈的身份保持约束,表现出有限的运动范围。Magic Mirror 在保持参考身份保真度的同时,实现了更动态的面部表情。

We evaluate our method on multiple general metrics by constructing a human-centric video generation test set and comparing it with the aforementioned competitive IDpreserved video generation methods. Extensive experimental and visual evaluations demonstrate that our approach successfully generates high-quality videos with dynamic content and strong facial consistency, as illustrated in Fig. 1. Magic Mirror represents the first approach to achieve customized video generation using Video DiT without requiring person-specific fine-tuning. Our work marks an advancement in personalized video generation, paving the way for enhanced creative expression in the digital domain.

我们通过构建一个以人为中心的视频生成测试集,并与上述具有竞争力的ID保留视频生成方法进行比较,从多个通用指标上评估了我们的方法。广泛的实验和视觉评估表明,我们的方法成功生成了具有动态内容和强面部一致性的高质量视频,如图1所示。Magic Mirror是首个无需针对特定人物进行微调即可使用Video DiT实现定制化视频生成的方法。我们的工作标志着个性化视频生成领域的进步,为数字领域中的创造性表达铺平了道路。

Our main contributions are: (1) We introduce Magic Mirror, a novel fine-tuning free framework for generating ID-preserving videos; (2) We design a lightweight adapter with conditioned adaptive normalization s for effective integration of face embeddings in full-attention Diffusion Transformer architectures; (3) We develop a dataset construction method that combines synthetic data generation with a progressive training strategy to address data scarcity challenges in personalized video generation.

我们的主要贡献是:(1) 我们引入了 Magic Mirror,一种无需微调的新框架,用于生成保持身份的视频;(2) 我们设计了一个轻量级的适配器,带有条件自适应归一化,用于在全注意力 Diffusion Transformer 架构中有效集成面部嵌入;(3) 我们开发了一种数据集构建方法,结合了合成数据生成和渐进式训练策略,以解决个性化视频生成中的数据稀缺挑战。

2. Related Works

2. 相关工作

Diffusion Models. Since the introduction of DDPM [3], diffusion models have demonstrated remarkable capabilities across diverse domains, spanning NLP [20, 21], medical imaging [22, 23], and molecular modeling [24, 25]. In computer vision, following initial success in image generation [26, 27], Latent Diffusion Models (LDM) [1] significantly reduced computational requirements while maintaining generation quality. Subsequent developments in conditional architectures [28, 29] enabled fine-grained concept customization over the generation process.

扩散模型 (Diffusion Models)。自 DDPM [3] 提出以来,扩散模型在多个领域展现了卓越的能力,涵盖自然语言处理 (NLP) [20, 21]、医学影像 [22, 23] 和分子建模 [24, 25]。在计算机视觉领域,继图像生成 [26, 27] 的初步成功后,潜在扩散模型 (LDM) [1] 在保持生成质量的同时显著降低了计算需求。随后的条件架构发展 [28, 29] 使得生成过程能够实现细粒度的概念定制。

Video Generation via Diffusion Models. Following the emergence of diffusion models, their superior controllability and diversity in image generation [30] have led to their prominence over traditional approaches based on GANs [31–33] and auto-regressive Transformers [34–36]. The Video Diffusion Model (VDM) [37] pioneered video generation using diffusion models by extending the traditional U-Net [38] architecture to process temporal information. Subsequently, LVDM [39] demonstrated the effectiveness of latent space operations, while AnimateD iff [13] adapted text-to-image models for personalized video synthesis. A significant advancement came with the Diffusion Transformer (DiT) [17], which successfully merged Transformer architectures [40, 41] with diffusion models. Building on this foundation, Latte [42] emerged as the first open-source text-to-video model based on DiT. Following the breakthrough of SORA [43], several open-source initiatives including Open-Sora-Plan [44], Open-Sora [45], and CogVideoX [14] have advanced video generation through DiT architectures. While current research predominantly focuses on image-to-video translation [15, 16, 46] and motion control [47, 48], the critical challenge of ID-preserving video generation remains relatively unexplored.

基于扩散模型的视频生成。随着扩散模型的出现,其在图像生成中的卓越可控性和多样性 [30] 使其超越了基于 GAN [31–33] 和自回归 Transformer [34–36] 的传统方法。视频扩散模型 (VDM) [37] 通过扩展传统的 U-Net [38] 架构以处理时间信息,率先使用扩散模型进行视频生成。随后,LVDM [39] 展示了潜在空间操作的有效性,而 AnimateD iff [13] 则调整了文本到图像模型以用于个性化视频合成。扩散 Transformer (DiT) [17] 的出现标志着重大进展,它成功地将 Transformer 架构 [40, 41] 与扩散模型相结合。在此基础上,Latte [42] 成为首个基于 DiT 的开源文本到视频模型。随着 SORA [43] 的突破,包括 Open-Sora-Plan [44]、Open-Sora [45] 和 CogVideoX [14] 在内的多个开源项目通过 DiT 架构推动了视频生成的发展。尽管当前研究主要集中在图像到视频转换 [15, 16, 46] 和运动控制 [47, 48] 上,但 ID 保持视频生成这一关键挑战仍相对未被探索。

ID-Preserved Generation. ID-preserved generation, or identity customization, aims to maintain specific identity characteristics in generated images or videos. Initially developed in the GAN era [31] with significant advances in face generation [33, 49, 50], this field has evolved substantially with diffusion models, demonstrating enhanced capabilities in novel image synthesis [28, 51]. Current approaches to ID-preserved image generation fall into two main categories:

ID 保留生成。ID 保留生成,或身份定制,旨在在生成的图像或视频中保持特定的身份特征。最初在 GAN 时代 [31] 开发,并在人脸生成方面取得了显著进展 [33, 49, 50],该领域随着扩散模型的发展而大幅演进,展示了在新图像合成方面的增强能力 [28, 51]。当前 ID 保留图像生成的方法主要分为两大类:

Tuning-based Models: These approaches fine-tune models using one or more reference images to generate identityconsistent outputs. Notable examples include Textual Inversion [51] and Dreambooth [28].

基于调优的模型:这些方法通过使用一张或多张参考图像对模型进行微调,以生成身份一致的输出。典型的例子包括 Textual Inversion [51] 和 Dreambooth [28]。

Tuning-free Models: Addressing the computational overhead of tuning-based approaches, these models maintain high ID fidelity through additional conditioning and trainable parameters. Starting with IP-adapter [52], various methods like InstantID, PhotoMaker [5–10] have emerged to enable efficient, high-quality personalized generation.

免调优模型:为了解决基于调优方法的计算开销问题,这些模型通过额外的条件设置和可训练参数来保持高ID保真度。从IP-adapter [52]开始,出现了各种方法,如InstantID、PhotoMaker [5-10],以实现高效、高质量的个性化生成。

ID-preserved video generation introduces additional complexities, particularly in synthesizing realistic facial movements from static references while maintaining identity consistency. Current approaches include MagicMe [12], a tuning-based method requiring per-identity optimization, and ID-Animator [11], a tuning-free approach utilizing face adapters and decoupled Cross-Attention [52]. However, these methods face challenges in maintaining dynamic expressions while preserving identity, and are constrained by base model limitations in video quality, duration, and prompt adherence. The integration of Diffusion Transformers presents promising opportunities for advancing IDpreserved video generation.

ID 保留视频生成引入了额外的复杂性,特别是在从静态参考中合成逼真的面部运动的同时保持身份一致性。当前的方法包括 MagicMe [12],这是一种基于调优的方法,需要对每个身份进行优化,以及 ID-Animator [11],这是一种无需调优的方法,利用面部适配器和解耦的 Cross-Attention [52]。然而,这些方法在保持动态表情的同时保留身份方面面临挑战,并且受到基础模型在视频质量、时长和提示遵循方面的限制。Diffusion Transformers 的集成为推进 ID 保留视频生成提供了有前景的机会。

3. Magic Mirror

3. 魔镜

An overview of Magic Mirror is illustrated in Fig. 3. This dual-branch framework (Sec. 3.2) extracts facial identity features from one or more reference images $r$ . These embeddings are subsequently processed through a DiT backbone augmented with a lightweight cross-modal adapter, incorporating conditioned adaptive normalization (Sec. 3.3). This architecture enables Magic Mirror to synthesize identity-preserved text-to-video outputs. The following sections elaborate on the preliminaries of diffusion models (Sec. 3.1) and each component of our method.

Magic Mirror 的概述如图 3 所示。这个双分支框架(第 3.2 节)从一张或多张参考图像 $r$ 中提取面部身份特征。这些嵌入随后通过一个带有轻量级跨模态适配器的 DiT 主干进行处理,并结合条件自适应归一化(第 3.3 节)。这种架构使 Magic Mirror 能够合成保留身份的文本到视频输出。以下部分详细介绍了扩散模型的预备知识(第 3.1 节)和我们方法的每个组件。

3.1. Preliminaries

3.1. 预备知识

Latent Diffusion Models (LDMs) generate data by iteratively reversing a noise corruption process, converting random noise into structured samples. At time step $t\ \in$ ${0,\ldots,T}$ , the model predicts latent state $x_{t}$ conditioned on $x_{t+1}$ :

潜在扩散模型 (Latent Diffusion Models, LDMs) 通过迭代逆转噪声污染过程,将随机噪声转换为结构化样本。在时间步 $t\ \in$ ${0,\ldots,T}$ 时,模型根据 $x_{t+1}$ 预测潜在状态 $x_{t}$:

$$
p_{\theta}(x_{t}|x_{t+1})=\mathcal{N}(x_{t};\widetilde{\mu}{t},\widetilde{\beta}{t}I),
$$

$$
p_{\theta}(x_{t}|x_{t+1})=\mathcal{N}(x_{t};\widetilde{\mu}{t},\widetilde{\beta}{t}I),
$$

where $\theta$ represents the model parameters, ${\widetilde{\mu}}{t}$ denotes the predicted mean, and ${\widetilde{\beta}}{t}$ is the variance sche du le.

其中 $\theta$ 表示模型参数,${\widetilde{\mu}}{t}$ 表示预测均值,${\widetilde{\beta}}{t}$ 是方差调度。

The training obje ct ive typically employs a mean squared error (MSE) loss on the noise prediction $\hat{\epsilon}{\theta}(x{t},t,c_{\mathrm{txt}})$ :

训练目标通常采用均方误差 (MSE) 损失对噪声预测 $\hat{\epsilon}{\theta}(x{t},t,c_{\mathrm{txt}})$ 进行优化:

$$
\begin{array}{r}{\mathcal{L}{\mathrm{noise}}=w\cdot\mathbb{E}{t,c_{\mathrm{txt}},\epsilon\sim\mathcal{N}(0,1)}\Big[|\epsilon-\epsilon_{\theta}(x_{t},t,c_{\mathrm{txt}})|^{2}\Big],}\end{array}
$$

$$
\begin{array}{r}{\mathcal{L}{\mathrm{noise}}=w\cdot\mathbb{E}{t,c_{\mathrm{txt}},\epsilon\sim\mathcal{N}(0,1)}\Big[|\epsilon-\epsilon_{\theta}(x_{t},t,c_{\mathrm{txt}})|^{2}\Big],}\end{array}
$$

where $c_{\mathrm{txt}}$ denotes the text condition.

其中 $c_{\mathrm{txt}}$ 表示文本条件。

Recent studies on controllable generation [7, 52–54] extend this framework by incorporating additional control signals, such as image condition $c_{\mathrm{img}}$ . This is achieved through a feature extractor $\tau_{\mathrm{img}}$ that processes a reference image $r$ : $c_{\mathrm{img}}=\tau_{\mathrm{img}}(r)$ . Consequently, the noise prediction function in Eq. (2) becomes $\epsilon_{\theta}(x_{t},t,c_{\mathrm{txt}},c_{\mathrm{img}})$ .

最近关于可控生成的研究 [7, 52–54] 通过引入额外的控制信号(如图像条件 $c_{\mathrm{img}}$)扩展了这一框架。这是通过一个特征提取器 $\tau_{\mathrm{img}}$ 实现的,该提取器处理参考图像 $r$:$c_{\mathrm{img}}=\tau_{\mathrm{img}}(r)$。因此,公式 (2) 中的噪声预测函数变为 $\epsilon_{\theta}(x_{t},t,c_{\mathrm{txt}},c_{\mathrm{img}})$。

3.2. Decoupled Facial Feature Extraction

3.2. 解耦的面部特征提取

The facial feature extraction component of Magic Mirror is illustrated in the left part of Fig. 3. Given an ID reference image $r,\in,\mathcal{R}^{(h\times w\times3)}$ , our model extracts facial condition embeddings $c_{\mathrm{face}}={x_{\mathrm{face}},x_{\mathrm{fuse}}}$ using hybrid query-based feature perceivers. These embeddings capture both highlevel identity features and facial structural information:

魔镜的面部特征提取组件如图 3 左侧所示。给定一个 ID 参考图像 $r,\in,\mathcal{R}^{(h\times w\times3)}$,我们的模型使用基于混合查询的特征感知器提取面部条件嵌入 $c_{\mathrm{face}}={x_{\mathrm{face}},x_{\mathrm{fuse}}}$。这些嵌入捕捉了高层次的身份特征和面部结构信息:


Figure 3. Overview of Magic Mirror. The framework employs a dual-branch feature extraction system with ID and face perceivers, followed by a cross-modal adapter (illustrated in Fig. 4) for DiT-based video generation. By optimizing trainable modules marked by the flame, our method efficiently integrates facial features for controlled video synthesis while maintaining model efficiency.

图 3: Magic Mirror 概述。该框架采用双分支特征提取系统,包含 ID 和面部感知器,随后通过跨模态适配器(如图 4 所示)进行基于 DiT 的视频生成。通过优化标有火焰的可训练模块,我们的方法有效地整合了面部特征以实现可控的视频合成,同时保持模型效率。

$$
x_{\mathrm{{face}}}=\tau_{\mathrm{{face}}}(q_{\mathrm{{face}}},\mathbf{f})
$$

$$
x_{\mathrm{{face}}}=\tau_{\mathrm{{face}}}(q_{\mathrm{{face}}},\mathbf{f})
$$

where $\mathbf{f}=F_{\mathrm{face}}(r)$ represents dense feature maps extracted from a pre-trained CLIP ViT $F_{\mathrm{feat}}$ [55]. Both perceivers $\tau_{\mathrm{face}}$ and $\tau_{\mathrm{id}}$ utilize the standard Q-Former[56] architecture with distinct query conditions $q_{\mathrm{face}}$ and $q_{\mathrm{id}}$ . Here, $q_{\mathrm{face}}$ is a learnable embedding for facial structure extraction, while $q_{\mathrm{id}},=,F_{\mathrm{id}}(r)$ represents high-level facial features extracted by a face encoder $F_{\mathrm{id}},[5,57]$ . Each perceiver employs crossattention between iterative ly updated queries and dense features to obtain compressed feature embeddings.

其中 $\mathbf{f}=F_{\mathrm{face}}(r)$ 表示从预训练的 CLIP ViT $F_{\mathrm{feat}}$ [55] 中提取的密集特征图。两个感知器 $\tau_{\mathrm{face}}$ 和 $\tau_{\mathrm{id}}$ 都使用标准的 Q-Former[56] 架构,并具有不同的查询条件 $q_{\mathrm{face}}$ 和 $q_{\mathrm{id}}$。这里,$q_{\mathrm{face}}$ 是一个可学习的嵌入,用于提取面部结构,而 $q_{\mathrm{id}},=,F_{\mathrm{id}}(r)$ 表示由面部编码器 $F_{\mathrm{id}},[5,57]$ 提取的高级面部特征。每个感知器通过在迭代更新的查询和密集特征之间进行交叉注意力来获得压缩的特征嵌入。

The compressed embeddings are integrated through a decoupled mechanism. Following recent approaches in newconcept customization [5, 28], we fuse facial embeddings with text embeddings at identity-relevant tokens (e.g., man, woman) in the input prompt $q_{\mathrm{txt}}$ , as expressed in Eq. (4). A fusion MLP $F_{\mathrm{fuse}}$ projects $x_{\mathrm{id}}$ into the text embedding space. The final text embedding for DiT input is computed as $\hat{x}{\mathrm{txt}}=\mathbf{m}x\mathrm{fuse}+(1-\mathbf{m})x_{\mathrm{txt}}.$ , where $\mathbf{m}$ denotes a token- level binary mask indicating fused token positions $q_{\mathrm{txt}}$ .

压缩后的嵌入通过解耦机制进行整合。借鉴新概念定制的最新方法 [5, 28],我们将面部嵌入与输入提示 $q_{\mathrm{txt}}$ 中与身份相关的 Token(例如,男人、女人)处的文本嵌入进行融合,如公式 (4) 所示。融合 MLP $F_{\mathrm{fuse}}$ 将 $x_{\mathrm{id}}$ 投影到文本嵌入空间。DiT 输入的最终文本嵌入计算为 $\hat{x}{\mathrm{txt}}=\mathbf{m}x\mathrm{fuse}+(1-\mathbf{m})x_{\mathrm{txt}}$,其中 $\mathbf{m}$ 表示指示融合 Token 位置的 Token 级二进制掩码 $q_{\mathrm{txt}}$。

3.3. Conditioned Adaptive Normalization

3.3. 条件自适应归一化

Having obtained the decoupled ID-aware conditions $c_{\mathrm{face}}$ we address the challenge of their efficient integration into the video diffusion transformer. Traditional Latent Diffu- sion Models, exemplified by Stable Diffusion [1], utilize isolated cross-attention mechanisms for condition injection, facilitating straightforward adaptation to new conditions through decoupled cross-attention [6, 52]. This approach is enabled by uniform conditional inputs (specifically, text condition $c_{\mathrm{txt.}}$ ) across all cross-attention layers. However, our framework builds upon CogVideoX [14], which implements a cross-modal full-attention paradigm with layerwise distribution modulation experts. This architectural choice introduces additional complexity in adapting to new conditions beyond simple cross-attention augmentation.

在获得解耦的ID感知条件 $c_{\mathrm{face}}$ 后,我们解决了将其高效集成到视频扩散Transformer中的挑战。传统的潜在扩散模型(Latent Diffusion Models),以Stable Diffusion [1] 为例,使用孤立的交叉注意力机制进行条件注入,通过解耦的交叉注意力 [6, 52] 实现了对新条件的直接适应。这种方法依赖于所有交叉注意力层中统一的输入条件(特别是文本条件 $c_{\mathrm{txt.}}$)。然而,我们的框架基于CogVideoX [14],它实现了跨模态的全注意力范式,并带有分层分布调制专家。这种架构选择在适应新条件时引入了额外的复杂性,超出了简单的交叉注意力增强。

Leveraging CogVideoX’s layer-wise modulation [14], we propose a lightweight architecture that incorporates additional conditions while preserving the model’s spatio temporal relationship modeling capacity. As illustrated in Fig. 4, facial embedding $x_{\mathrm{{face}}}$ is concatenated with text and video features $\chi_{\mathrm{txt}}$ and $x_{\mathrm{{vid},}}$ ) through full-self attention. CogVideoX employs modal-specific modulation, where factors $m_{\mathrm{vid}}$ and $m_{\mathrm{{txt}}}$ are applied to their respective features through adaptive normalization modules $\varphi_{\mathrm{{txt,,vid}}}$ . To accommodate the facial modality, we introduce a dedicated adaptive normalization module $\varphi_{\mathrm{face}}$ , normalizing facial features preceding the self-attention and feed-forward network (FFN). The corresponding set of modulation factors $m_{\mathrm{face}}$ is computed as:

利用 CogVideoX 的分层调制 [14],我们提出了一种轻量级架构,该架构在保留模型时空关系建模能力的同时,引入了额外的条件。如图 4 所示,面部嵌入 $x_{\mathrm{{face}}}$ 通过全自注意力机制与文本和视频特征 $\chi_{\mathrm{txt}}$ 和 $x_{\mathrm{{vid},}}$ 进行拼接。CogVideoX 采用模态特定的调制,其中因子 $m_{\mathrm{vid}}$ 和 $m_{\mathrm{{txt}}}$ 通过自适应归一化模块 $\varphi_{\mathrm{{txt,,vid}}}$ 应用于各自的特征。为了适应面部模态,我们引入了一个专用的自适应归一化模块 $\varphi_{\mathrm{face}}$,在自注意力和前馈网络 (FFN) 之前对面部特征进行归一化。相应的调制因子集 $m_{\mathrm{face}}$ 计算如下:

m_{\mathrm{face}}=\{\mu_{\mathrm{face}}^{1},\sigma_{\mathrm{face}}^{1},\gamma_{\mathrm{face}}^{1},\mu_{\mathrm{face}}^{2},\sigma_{\mathrm{face}}^{2},\gamma_{\mathrm{face}}^{2}\}=\varphi_{\mathrm{face}}(\mathbf{t},l),
m_{\mathrm{face}}=\{\mu_{\mathrm{face}}^{1},\sigma_{\mathrm{face}}^{1},\gamma_{\mathrm{face}}^{1},\mu_{\mathrm{face}}^{2},\sigma_{\mathrm{face}}^{2},\gamma_{\mathrm{face}}^{2}\}=\varphi_{\mathrm{face}}(\mathbf{t},l),

where $\mathbf{t}$ denotes the time embedding and $l$ represents the layer index. Let $F^{n}$ denotes in-block operations, where $F^{1}$ represents attention and $F^{2}$ represents the FFN. The feature transformation after operation $n$ is computed followed a scale $\sigma$ , shift $\mu$ , and gating $\gamma$ , represented as: $\bar{x}^{n}=x^{n-1}*(1+\sigma^{n})+\mu^{n}$ , then $x^{n}={\bar{x}}^{n}+\gamma^{n}F^{n}({\bar{x}}^{n})$ , with modality-specific subscripts omitted for brevity.

其中 $\mathbf{t}$ 表示时间嵌入,$l$ 表示层索引。令 $F^{n}$ 表示块内操作,其中 $F^{1}$ 表示注意力机制,$F^{2}$ 表示前馈神经网络 (FFN)。操作 $n$ 后的特征变换通过尺度 $\sigma$、偏移 $\mu$ 和门控 $\gamma$ 计算,表示为:$\bar{x}^{n}=x^{n-1}*(1+\sigma^{n})+\mu^{n}$,然后 $x^{n}={\bar{x}}^{n}+\gamma^{n}F^{n}({\bar{x}}^{n})$,为简洁起见省略了模态特定的下标。

Furthermore, to enhance the distribution learning capability of text and video latents from specific reference IDs, we introduce Conditioned Adaptive Normalization (CAN), inspired by class-conditioned DiT [17] and StyleGAN’s [33] approach to condition control. CAN predicts distribution shifts for video and text modalities:

此外,为了增强从特定参考ID中学习文本和视频潜在分布的能力,我们引入了条件自适应归一化 (Conditioned Adaptive Normalization, CAN),其灵感来源于类条件DiT [17] 和 StyleGAN [33] 的条件控制方法。CAN 预测视频和文本模态的分布偏移:

\begin{array}{r}{\hat{m}_{\mathrm{vid}},\hat{m}_{\mathrm{txt}}=\varphi_{\mathrm{cond}}(\mathbf{t},l,\mu_{\mathrm{vid}}^{1},x_{\mathrm{id}}).}\end{array}
\begin{array}{r}{\hat{m}_{\mathrm{vid}},\hat{m}_{\mathrm{txt}}=\varphi_{\mathrm{cond}}(\mathbf{t},l,\mu_{\mathrm{vid}}^{1},x_{\mathrm{id}}).}\end{array}

Here, $\mu_{\mathrm{vid}}^{1}$ acts as a distribution identifier for a better initi aliz ation of the CAN module, and $x_{\mathrm{id}}$ from Eq. (4) represents the facial embedding prior to the fusion MLP. The final modulation factors are computed through residual addition: $m_{\mathrm{vid}},=,\hat{m}{\mathrm{vid}},+,\varphi{\mathrm{vid}}({\bf t},l),m_{\mathrm{txt}},=,\hat{m}_{\mathrm{txt}}+\varphi_{\mathrm{txt}}({\bf t},l)$ .

这里,$\mu_{\mathrm{vid}}^{1}$ 作为分布标识符,用于更好地初始化 CAN 模块,而来自公式 (4) 的 $x_{\mathrm{id}}$ 表示融合 MLP 之前的面部嵌入。最终的调制因子通过残差加法计算得出:$m_{\mathrm{vid}},=,\hat{m}{\mathrm{vid}},+,\varphi{\mathrm{vid}}({\bf t},l),m_{\mathrm{txt}},=,\hat{m}_{\mathrm{txt}}+\varphi_{\mathrm{txt}}({\bf t},l)$。


Figure 4. Cross-modal adapter in DiT blocks, featuring Conditioned Adaptive Normalization (CAN) for modal-specific feature modulation and decoupled attention integration.

图 4: DiT 块中的跨模态适配器,采用条件自适应归一化 (Conditioned Adaptive Normalization, CAN) 进行模态特定特征调制和解耦注意力集成。

We found this conditional shift prediction $\varphi_{\mathrm{cond}}$ is suitable for an MLP implementation.

我们发现这种条件转移预测 $\varphi_{\mathrm{cond}}$ 适合用 MLP 实现。

Complementing the conditioned normalization, we augment the joint full self-attention $T_{\mathrm{SA}}$ with a cross-attention mechanism $T_{\mathrm{CA}}$ [11, 52] to enhance ID modality feature aggregation. The attention output $x_{\mathrm{out}}$ is computed as:

为了补充条件归一化,我们通过交叉注意力机制 $T_{\mathrm{CA}}$ [11, 52] 增强了联合全自注意力 $T_{\mathrm{SA}}$,以增强 ID 模态特征聚合。注意力输出 $x_{\mathrm{out}}$ 计算如下:

x_{\mathrm{out}}=T_{\mathrm{SA}}\big(W_{q k v}(x_{\mathrm{full}})\big)+T_{\mathrm{CA}}(W_{q}(x_{\mathrm{full}}),W_{k v}(x_{\mathrm{face}})),
x_{\mathrm{out}}=T_{\mathrm{SA}}\big(W_{q k v}(x_{\mathrm{full}})\big)+T_{\mathrm{CA}}(W_{q}(x_{\mathrm{full}}),W_{k v}(x_{\mathrm{face}})),

where $T_{\mathrm{SA}}$ and $T_{\mathrm{CA}}$ utilize the same query projection $W_{q}(x_{\mathrm{full}})$ , while the key-value projections $W_{k v}$ in crossattention are re-initialized and trainable.

其中 $T_{\mathrm{SA}}$ 和 $T_{\mathrm{CA}}$ 使用相同的查询投影 $W_{q}(x_{\mathrm{full}})$,而交叉注意力中的键值投影 $W_{k v}$ 被重新初始化并可训练。

3.4. Data and Training

3.4. 数据与训练

Training a zero-shot customization adapter presents unique data challenges compared to fine-tuning approaches, like Magic-Me [12]. Our model’s full-attention architecture, which integrates spatial and temporal components inseparably, necessitates a two-stage training strategy. As shown in Fig. 5, we begin by training on diverse, high-quality datasets to develop robust identity preservation capabilities.

训练零样本定制适配器与微调方法(如 Magic-Me [12])相比,面临着独特的数据挑战。我们的模型采用全注意力架构,将空间和时间组件不可分割地结合在一起,因此需要采用两阶段训练策略。如图 5 所示,我们首先在多样化的高质量数据集上进行训练,以开发强大的身份保持能力。

Our progressive training pipeline leverages diverse datasets to enhance model performance, particularly in identity preservation. For image pre-training, we first utilize the LAION-Face [58] dataset, which contains web-scale real images and provides a rich source for generating selfreference images. To further increase identity diversity, we utilize the SFHQ [59] dataset, which applies self-reference techniques with standard text prompts. To prevent overfitting and promote the generation of diverse face-head motion, we use the FFHQ [33] dataset as a base. From this, we random sample text prompts from a prompt pool of human image captions, and synthesize ID-conditioned image pairs using PhotoMaker-V2 [5], ensuring both identity similarity and diversity through careful filtering.

我们的渐进式训练管道利用多样化的数据集来增强模型性能,特别是在身份保持方面。对于图像预训练,我们首先使用 LAION-Face [58] 数据集,该数据集包含网络规模的真实图像,并为生成自参考图像提供了丰富的资源。为了进一步增加身份多样性,我们使用 SFHQ [59] 数据集,该数据集通过标准文本提示应用自参考技术。为了防止过拟合并促进多样化的人脸-头部运动生成,我们使用 FFHQ [33] 数据集作为基础。在此基础上,我们从人类图像描述的提示池中随机采样文本提示,并使用 PhotoMaker-V2 [5] 合成 ID 条件图像对,通过仔细筛选确保身份相似性和多样性。


Figure 5. Overview of our training datasets. The pipeline includes image pre-training data (A-D) and video post-training data (D). We utilize both self-reference data (A, B) and filtered synthesized pairs with the same identity (C, D). Numbers of (images $^+$ synthesized images) are reported.

图 5: 我们的训练数据集概览。该流程包括图像预训练数据 (A-D) 和视频后训练数据 (D)。我们使用了自参考数据 (A, B) 以及经过筛选的具有相同身份的合成数据对 (C, D)。报告了 (图像 $^+$ 合成图像) 的数量。

For video post-training, we leverage the high-quality Pexels and Mixkit datasets, along with a small collection of self-collected videos from the web. Similarly, synthesized image data corresponding to each face reference of keyframes are generated as references. The combined dataset offers rich visual content for training the model across images and videos.

对于视频的后训练,我们利用了高质量的 Pexels 和 Mixkit 数据集,以及一小部分从网络上自行收集的视频。同样,生成了与关键帧的每个面部参考相对应的合成图像数据作为参考。这个组合数据集为跨图像和视频训练模型提供了丰富的视觉内容。

The objective function combines identity-aware and general denoising loss: $\mathcal{L}=\mathcal{L}{\mathrm{noise}}+\lambda\left(1-\cos(q{\mathrm{face}},D(x_{0}))\right)$ , where $D(\cdot)$ represents the latent decoder for the denoised latent $x_{0}$ , and $\lambda$ is the balance factor. Following PhotoMaker [5], we compute the denoising loss specifically within the face area for $50%$ of random training samples.

目标函数结合了身份感知和通用去噪损失:$\mathcal{L}=\mathcal{L}{\mathrm{noise}}+\lambda\left(1-\cos(q{\mathrm{face}},D(x_{0}))\right)$,其中$D(\cdot)$表示去噪潜在变量$x_{0}$的潜在解码器,$\lambda$是平衡因子。根据PhotoMaker [5],我们在随机训练样本的$50%$中特别计算了面部区域的去噪损失。

4. Experiments

4. 实验

4.1. Implementation Details

4.1. 实现细节

Dataset preparation. As illustrated in Fig. 5, our training pipeline leverages both self-referenced and synthetically paired image data [33, 58, 59] for identity-preserving alignment in the initial training phase. For synthetic data pairs (denoted as C and D in Fig. 5), we employ ArcFace [57] for facial recognition and detection to extract key attributes including age, bounding box coordinates, gender, and facial embeddings. Reference frames are then generated using Photo Maker V 2 [5]. We implement a quality control process by filtering image pairs ${a,b}$ based on their facial embedding cosine similarity, retaining pairs where $d(q_{\mathrm{face}}^{a},q_{\mathrm{face}}^{b})>$ 0.65. For text conditioning, we utilize MiniGemini-8B [60] to caption all video data, to form a diverse prompt pool containing 29K prompts, while $\mathrm{CogVLM}$ [61] provides video descriptions in the second training stage. Detailed data collection procedures are provided in Appendix A.1.

数据集准备。如图 5 所示,我们的训练流程在初始训练阶段利用了自参考和合成配对的图像数据 [33, 58, 59] 来进行身份保持对齐。对于合成数据对(在图 5 中表示为 C 和 D),我们使用 ArcFace [57] 进行面部识别和检测,以提取包括年龄、边界框坐标、性别和面部嵌入在内的关键属性。然后使用 Photo Maker V 2 [5] 生成参考帧。我们通过基于面部嵌入余弦相似度过滤图像对 ${a,b}$ 来实现质量控制,保留 $d(q_{\mathrm{face}}^{a},q_{\mathrm{face}}^{b})>$ 0.65 的对。对于文本条件,我们使用 MiniGemini-8B [60] 为所有视频数据生成字幕,形成一个包含 29K 提示的多样化提示池,而 $\mathrm{CogVLM}$ [61] 在第二个训练阶段提供视频描述。详细的数据收集过程见附录 A.1。

Training Details. Our Magic Mirror framework extends CogVideoX-5B [14] by integrating facial-specific modal adapters into alternating DiT layers (i.e., adapters in all layers with even index l). We adopt the feature extractor $F_{\mathrm{face}}$ and ID perceiver $\tau_{\mathrm{id}}$ from a pre-trained Photo Maker V 2 [5]. In the image pre-train stage, we optimize the adapter components for 30K iterations using a global batch size of 64. Subsequently, we perform video fine-tuning for 5K iterations with a batch size of 8 to enhance temporal consistency in video generation. Both phases employ a decayed learning rate starting from $10^{-5}$ . All experiments were conducted on a single compute node with 8 NVIDIA A800 GPUs.

训练细节。我们的 Magic Mirror 框架通过将面部特定模态适配器集成到交替的 DiT 层(即所有偶数索引 l 的层中的适配器)中,扩展了 CogVideoX-5B [14]。我们采用了预训练的 Photo Maker V 2 [5] 中的特征提取器 $F_{\mathrm{face}}$ 和 ID 感知器 $\tau_{\mathrm{id}}$。在图像预训练阶段,我们使用全局批量大小为 64 优化适配器组件,进行 30K 次迭代。随后,我们以批量大小为 8 进行 5K 次迭代的视频微调,以增强视频生成中的时间一致性。两个阶段均采用从 $10^{-5}$ 开始的衰减学习率。所有实验均在配备 8 个 NVIDIA A800 GPU 的单个计算节点上进行。

Evaluation and Comparisons. We evaluate our approach against the state-of-the-art ID-consistent video generation model ID-Animator [11] and leading Image-toVideo (I2V) frameworks, including DynamiC rafter [15], CogVideoX [14], and Easy Animate [16]. Our evaluation leverages standardized VBench [18], for video gen- eration assessment that measures motion quality and textmotion alignment. For identity preservation, we utilize facial recognition embedding similarity [62] and facial motion metrics. Our evaluation dataset consists of 40 singlecharacter prompts from VBench, ensuring demographic diversity, and 40 action-specific prompts for motion assessment. Identity references are sampled from 50 face identities from PubFig [63], generating four personalized videos per identity across varied prompts.

评估与比较。我们将我们的方法与最先进的ID一致性视频生成模型ID-Animator [11]以及领先的图像到视频(I2V)框架(包括DynamiC rafter [15]、CogVideoX [14]和Easy Animate [16])进行了比较。我们的评估利用了标准化的VBench [18],用于视频生成评估,衡量运动质量和文本-运动对齐。对于身份保留,我们使用了面部识别嵌入相似度 [62] 和面部运动指标。我们的评估数据集包括来自VBench的40个单角色提示,确保人口多样性,以及40个动作特定的提示用于运动评估。身份参考从PubFig [63] 的50个面部身份中采样,每个身份在不同提示下生成四个个性化视频。

4.2. Quantitative Evaluation

4.2. 定量评估

The quantitative results are summarized in Tab. 1. We evaluate generated videos using VBench’s and E val Crafter’s general metrics [18, 64], including dynamic degree, textprompt consistency, and Inception Score [65] for video quality assessment. For identity preservation, we introduce Average Similarity instead of similarity with the reference image, measuring the distance between generated faces and the average similarity of reference images for each identity. This prevents models from achieving artificially high scores through naive copy-paste behavior, as illustrated in Fig. 2. Face motion is quantified using two metrics: $\mathrm{FM}{\mathrm{ref}}$ (relative distance to reference face) and $\mathrm{FM}{\mathrm{inter}}$ (inter-frame distance), computed using RetinaFace [66] landmarks after position alignment, and the L2 distance between the normalized coordinates are reported as the metric.

定量结果总结在表 1 中。我们使用 VBench 和 Eval Crafter 的通用指标 [18, 64] 评估生成的视频,包括动态程度、文本提示一致性和用于视频质量评估的 Inception Score [65]。对于身份保持,我们引入了平均相似度 (Average Similarity) 而不是与参考图像的相似度,测量生成的面部与每个身份参考图像的平均相似度之间的距离。这防止了模型通过简单的复制粘贴行为获得人为高分,如图 2 所示。面部运动使用两个指标进行量化:$\mathrm{FM}{\mathrm{ref}}$(与参考面部的相对距离)和 $\mathrm{FM}{\mathrm{inter}}$(帧间距离),使用 RetinaFace [66] 地标在位置对齐后计算,归一化坐标之间的 L2 距离作为指标报告。

Table 1. Quantitative comparisons. We report results with Image-to-Video and ID-preserved models. ID similarities are evaluated on the corresponding face-enhanced prompts to avoid face missing caused by complex prompts. Table 2. User study results.

表 1. 定量比较。我们报告了图像到视频和ID保留模型的结果。ID相似度在相应的面部增强提示上进行评估,以避免复杂提示导致的面部缺失。

表 2. 用户研究结果。

模型 视觉质量 文本对齐 动态程度 ID相似度
DynamiCrafter[15] 6.03 7.29 4.85 5.87
EasyAnimate-I2V[16] 6.62 8.21 5.57 6.01
CogVideoX-12V[14] 6.86 8.31 6.55 6.22
ID-Animator[11] 5.63 6.37 4.06 6.70
Magic Mirror 6.97 8.88 7.02 6.39

Our method achieves superior facial similarity scores compared to I2V approaches while maintaining competitive performance to ID-Animator. We demonstrate strong text alignment, video quality, and dynamic performance, attributed to our decoupled facial feature extraction and crossmodal adapter with conditioned adaptive normalization s.

我们的方法在保持与 ID-Animator 竞争性能的同时,相比 I2V 方法实现了更高的面部相似度分数。我们展示了强大的文本对齐、视频质量和动态性能,这归功于我们解耦的面部特征提取和带有条件自适应归一化的跨模态适配器。

Besides, we analyze facial similarity drop across uniformly sampled frames from each video to assess temporal identity consistency, report as the similarity decay term in Tab. 1. Standard I2V models (CogVideoX-I2V [14], Easy Animate [16]) exhibit significant temporal decay in identity preservation. While DynamiC rafter [15] shows better stability due to its random reference strategy, it compromises fidelity. Both ID-Animator [11] and our Magic Mirror maintain consistent identity preservation throughout the video duration.

此外,我们分析了从每个视频中均匀采样的帧之间的面部相似度下降情况,以评估时间上的身份一致性,并在表 1 中报告为相似度衰减项。标准的 I2V 模型 (CogVideoX-I2V [14], Easy Animate [16]) 在身份保持方面表现出显著的时间衰减。而 DynamiC rafter [15] 由于其随机参考策略表现出更好的稳定性,但牺牲了保真度。ID-Animator [11] 和我们的 Magic Mirror 在整个视频持续时间内都保持了一致的身份保持。


Figure 6. Qualitative comparisons. Captions and reference identity images are presented in the top-left corner for each case.

图 6: 定性比较。每个案例的左上角展示了标题和参考身份图像。

4.3. Qualitative Evaluation

4.3. 定性评估

Beyond the examples shown in Fig. 1, we present comparative results in Fig. 6. Our method maintains high text coherence, motion dynamics, and video quality compared to conventional CogVideoX inference. When compared to existing image-to-video approaches [5, 14–16], Magic Mirror demonstrates superior identity consistency across frames while preserving natural motion. Our method also achieves enhanced dynamic range and text alignment compared to ID-Animator [11], which exhibits limitations in motion variety and prompt adherence.

除了图 1 中展示的示例外,我们在图 6 中展示了对比结果。与传统 CogVideoX 推理相比,我们的方法保持了较高的文本连贯性、运动动态性和视频质量。与现有的图像到视频方法 [5, 14–16] 相比,Magic Mirror 在保持自然运动的同时,展示了跨帧的优越身份一致性。与 ID-Animator [11] 相比,我们的方法还实现了增强的动态范围和文本对齐,而 ID-Animator 在运动多样性和提示遵循方面存在局限性。

To complement our quantitative metrics, we conducted a comprehensive user study to evaluate the perceptual quality of generated results. The study involved 173 participants who assessed the outputs across four key aspects: motion dynamics, text-motion alignment, video quality, and identity consistency. Participants rated each aspect on a scale of 1-10, with results summarized in Tab. 2. As shown in the overall preference scores in Tab. 1, Magic Mirror consistently outperforms baseline methods across all evaluated dimensions, demonstrating its superior perceptual quality in human assessment.

为了补充我们的定量指标,我们进行了一项全面的用户研究,以评估生成结果的感知质量。该研究涉及173名参与者,他们从四个关键方面评估输出结果:运动动态、文本-运动对齐、视频质量和身份一致性。参与者对每个方面进行1-10分的评分,结果总结在表2中。如表1中的总体偏好分数所示,Magic Mirror在所有评估维度上始终优于基线方法,展示了其在人类评估中的卓越感知质量。


Figure 7. Examples for ablation studies. Left: Ablation on modules. Right: Ablation on and training strategies.

图 7: 消融研究示例。左:模块消融。右:训练策略消融。

4.4. Ablation Studies

4.4. 消融研究

Condition-related Modules. We evaluate our key architectural components through ablation studies, shown in the left section of Fig. 7. Without the reference feature embedding branch, the model loses crucial high-level attention guidance, significantly degrading identity fidelity. The conditioned adaptive normalization (CAN) proves vital for distribution alignment, enhancing identity preservation across frames. The effectiveness of CAN for facial condition injection is further demonstrated in Fig. 8, showing improved training convergence for identity information capture during the image pre-train stage.

条件相关模块。我们通过消融研究评估了关键架构组件,如图 7 左侧部分所示。在没有参考特征嵌入分支的情况下,模型失去了关键的高级注意力引导,显著降低了身份保真度。条件自适应归一化 (CAN) 被证明对于分布对齐至关重要,增强了跨帧的身份保持。图 8 进一步展示了 CAN 在面部条件注入中的有效性,显示了在图像预训练阶段身份信息捕获的训练收敛性得到改善。

Training Strategy. The right section of Fig. 7 illustrates the impact of different training strategies. Image pre-training is essential for robust identity preservation, while video posttraining ensures temporal consistency. However, training exclusively on image data leads to color-shift artifacts during video inference. This artifact is caused by modulation factor inconsistencies in different training stages. Our twostage training approach achieves optimal results by leveraging the advantages of both phases, generating videos with high ID fidelity and dynamic facial motions.

训练策略。图 7 的右侧部分展示了不同训练策略的影响。图像预训练对于保持身份特征的鲁棒性至关重要,而视频后训练则确保了时间一致性。然而,仅使用图像数据进行训练会导致视频推理时出现色彩偏移伪影。这种伪影是由于不同训练阶段中调制因子不一致引起的。我们的两阶段训练方法通过结合两个阶段的优势,实现了最佳效果,生成了具有高身份保真度和动态面部动作的视频。


Figure 8. CAN speeds up the convergence. Without the Conditioned Adaptive Normalization, the model cannot fit the simples appearance features like hairstyle in the image pre-train stage.

图 8: CAN 加速了收敛。在没有条件自适应归一化 (Conditioned Adaptive Normalization) 的情况下,模型无法在图像预训练阶段拟合简单的外观特征,如发型。

Table 3. Computation overhead of Magic Mirror.

表 3: Magic Mirror 的计算开销

模型 内存 参数 时间
CogVideoX-5B [14] 24.9 GiB 10.5B 204s
Magic Mirror 28.6 GiB 12.8B 209s

4.5. Discussions

4.5. 讨论

Computation Overhead. Compared with the baseline, we analyze the computational requirements in GPU memory utilization, parameter count, and inference latency for generating a 49-frame 480P video. Most additional parameters are concentrated in the embedding extraction stage, which requires only a single forward pass. Consequently, as shown in Tab. 3, Magic Mirror introduces minimal computational overhead in both GPU memory consumption and inference time compared to the base model.

计算开销。与基线相比,我们分析了生成 49 帧 480P 视频时在 GPU 内存利用率、参数数量和推理延迟方面的计算需求。大多数额外的参数集中在嵌入提取阶段,该阶段只需要一次前向传播。因此,如表 3 所示,与基础模型相比,Magic Mirror 在 GPU 内存消耗和推理时间方面引入的计算开销极小。

Feature Distribution Analysis. To validate our conditioned adaptive normalization mechanism, we visualize the predicted modulation scale factors $\sigma$ using t-SNE [67] in Fig. 9. The analysis reveals distinct distribution patterns across transformer layers and is not sensitive to timestep input. The face modality exhibits its characteristic distribution. The conditioned residual $\hat{\sigma}$ introduces targeted distri- bution shifts from the baseline, which empirically improves model convergence with ID conditions.

特征分布分析。为了验证我们的条件自适应归一化机制,我们使用 t-SNE [67] 在图 9 中可视化了预测的调制比例因子 $\sigma$。分析揭示了跨 Transformer 层的不同分布模式,并且对时间步输入不敏感。面部模态展示了其特有的分布。条件残差 $\hat{\sigma}$ 引入了从基线的目标分布偏移,这经验上改善了带有 ID 条件的模型收敛性。


Figure 9. Different modalities’ scale distribution using t-SNE. Each point represents the scale with a unique timestep-layer index. We also illustrate a shift variant on text and video’s adaptive scale using different colors.

图 9: 使用 t-SNE 的不同模态尺度分布。每个点代表具有唯一时间步-层索引的尺度。我们还使用不同颜色展示了文本和视频自适应尺度的变化。

Limitations and Future Work. While Magic Mirror demonstrates strong performance in ID-consistent video generation, several challenges remain. First, the current framework lacks support for multi-identity customized generation. Second, our method primarily focuses on facial features, leaving room for improvement in preserving finegrained attributes such as clothing and accessories. Extending identity consistency to these broader visual elements represents a promising direction for practical multishot customized video generation.

局限性与未来工作。尽管 Magic Mirror 在身份一致(ID-consistent)的视频生成中表现出色,但仍存在一些挑战。首先,当前框架缺乏对多身份定制生成的支持。其次,我们的方法主要关注面部特征,在保留服装和配饰等细粒度属性方面仍有改进空间。将身份一致性扩展到这些更广泛的视觉元素,是实现实用的多镜头定制视频生成的一个有前景的方向。

5. Conclusion

5. 结论

In this work, we presented Magic Mirror, a zero-shot framework for identity-preserved video generation. Magic Mirror incorporates dual facial embeddings and Conditional Adaptive Normalization (CAN) into DiT-based architectures. Our approach enables robust identity preservation and stable training convergence. Extensive experiments demonstrate that Magic Mirror generates high-quality personalized videos while maintaining identity consistency from a single reference image, outperforming existing methods across multiple benchmarks and human evaluations.

在本工作中,我们提出了 Magic Mirror,一个用于身份保持视频生成的零样本框架。Magic Mirror 将双重面部嵌入和条件自适应归一化 (Conditional Adaptive Normalization, CAN) 结合到基于 DiT 的架构中。我们的方法能够实现稳健的身份保持和稳定的训练收敛。大量实验表明,Magic Mirror 在从单张参考图像生成高质量个性化视频的同时,保持了身份一致性,在多个基准测试和人类评估中优于现有方法。

Ac know leg de ment This work was supported in part by the Research Grants Council under the Areas of Excellence scheme grant AoE/E-601/22-R and the Shenzhen Science and Technology Program under No. KQTD20210811090149095.

致谢

本工作部分得到了研究资助局卓越学科领域计划资助 AoE/E-601/22-R 和深圳市科技计划项目 KQTD20210811090149095 的支持。

Appendix

附录

This appendix provides comprehensive technical details and additional results for Magic Mirror, encompassing dataset preparation, architectural specifications, implementation, and extensive experimental validations. We include additional qualitative results and in-depth analyses to support our main findings. We strongly encourage readers to examine https://julian juan er.github.io/ projects/Magic Mirror/ for dynamic video demonstrations. The following contents are organized for efficient navigation.

本附录提供了 Magic Mirror 的全面技术细节和额外结果,涵盖数据集准备、架构规格、实现以及广泛的实验验证。我们包含了额外的定性结果和深入分析,以支持我们的主要发现。我们强烈建议读者访问 https://julianjuaner.github.io/projects/MagicMirror/ 查看动态视频演示。以下内容为高效导航而组织。

Appendix Contents

附录内容

D. Acknowledgments 13

D. 致谢 13

A. Experiment Details

A. 实验细节

A.1. Training Data Preparation

A.1. 训练数据准备

Our training dataset is constructed through a rigorous preprocessing pipeline, as illustrated in Fig. 10. We start with LAION-face [58], downloading 5 million images, which undergo strict quality filtering based on face detection confidence scores and resolution requirements. The filtered subset of 107K images is then processed through an image captioner [60], where we exclude images containing textual elements. This results in a curated set of 50K high-quality face image-text pairs. To enhance identity diversity, we incorporate the synthetic SFHQ dataset [59]. To fit the model output, we standardize these images by adding black borders and pairing them with a consistent prompt template: ”A squared ID photo of ..., with pure black on two sides.” This preprocessing ensures uniformity while maintaining the dataset’s diverse identity characteristics.

我们的训练数据集通过严格的预处理流程构建,如图 10 所示。我们从 LAION-face [58] 开始,下载了 500 万张图像,这些图像经过基于人脸检测置信度分数和分辨率要求的严格质量过滤。过滤后的 107K 图像子集随后通过图像标注器 [60] 进行处理,我们排除了包含文本元素的图像。最终得到了 50K 张高质量的人脸图像-文本对。为了增强身份多样性,我们引入了合成的 SFHQ 数据集 [59]。为了适应模型输出,我们通过添加黑色边框并将这些图像与一致的提示模板配对来标准化这些图像:“... 的方形证件照,两侧为纯黑色。” 这种预处理确保了数据集的统一性,同时保持了其多样化的身份特征。

For FFHQ [33], we leverage a state-of-the-art identitypreserving prior Photo Maker V 2 [5] to generate synthetic images. We filter redundant identities using pairwise facial similarity metrics, with prompts sampled from our 50K video keyframe captions. We utilize the Pexels-400K and

对于 FFHQ [33],我们利用最先进的身份保留先验 Photo Maker V 2 [5] 来生成合成图像。我们使用成对面部相似性指标过滤冗余身份,并从我们的 50K 视频关键帧字幕中采样提示。我们利用 Pexels-400K 和


Figure 10. Detailed training data processing pipeline. Building upon Fig. 5, we illustrate comprehensive filtering criteria, prompt examples, and processing specifications. The data flow is indicated by blue arrows, while filtering rules leading to data exclusion are marked with red arrows.

图 10: 详细的训练数据处理流程。在图 5 的基础上,我们展示了全面的过滤标准、提示示例和处理规范。数据流由蓝色箭头表示,而导致数据排除的过滤规则用红色箭头标记。

Mixkit datasets from [44] for image-video pairs. The videos undergo a systematic preprocessing pipeline, including face detection and motion-based filtering to ensure high-quality dynamic content. We generate video descriptions using $\mathrm{CogVLM}$ video captioner [61]. Following our FFHQ processing strategy, we employ Photo Maker V 2 to synthesize identity-consistent images from the detected faces, followed by quality-based filtering.

使用来自 [44] 的 Mixkit 数据集进行图像-视频对处理。视频经过系统化的预处理流程,包括人脸检测和基于运动的过滤,以确保高质量的动态内容。我们使用 $\mathrm{CogVLM}$ 视频字幕生成器 [61] 生成视频描述。遵循我们的 FFHQ 处理策略,我们使用 Photo Maker V 2 从检测到的人脸中合成身份一致的图像,然后进行基于质量的过滤。

A.2. Test Data Preparation

A.2. 测试数据准备

Face Images Preparation We construct a comprehensive evaluation set for identity preservation assessment across video generation models. Our dataset comprises 50 distinct identities across seven demographic categories: man, woman, elderly man, elderly woman, boy, girl, and baby. The majority of faces are sourced from PubFig dataset [63], supplemented with public domain images for younger categories. Each identity is represented by 1-4 reference images to capture variations in pose and expression.

面部图像准备
我们构建了一个全面的评估集,用于评估视频生成模型中的身份保持能力。我们的数据集包含七个不同人口类别中的50个独特身份:男性、女性、老年男性、老年女性、男孩、女孩和婴儿。大多数面部图像来自PubFig数据集 [63],并为年轻类别补充了公共领域的图像。每个身份由1-4张参考图像表示,以捕捉姿势和表情的变化。


CogVideoX-I2V:Easy Animate-I2V:

CogVideoX-I2V:Easy Animate-I2V:


acpA cro s ren hoc ci is res sti nltoethrns eas ot rifm oohacni knsa s gn wu rdi irt pfsh.a k Tcial hell eeaa sadb ndah, sce mkd n du ears pv oic t gp hu aloaatfnre tsdb h tduehir lsead c mceahasn actell oeie snt nhdag seiv naabg srr ter,u a ogo tcg ph ket e afdnok sic rn l kim gy f ,fa a t fw sia ioc tc en he sn,t. t h.hTiehs des ikscitana mngtel irhsato er ci nazi pont gnu rhweiinst thti hnsgew iaenta tttr hiuce naetdxee prda et nth sai eiv lbes rloiagfn hhdti sss cuamnpo. ev Hebisem ldeoenwtt. es,Ar shm ihigneh elcidlgi he mt xi b pns r geh tishgsieho etnre , natsnhideo fnp oliacny u hsoifes ldim gguhatsz caeln erdse sav he na adl d tohhiwes


acpA cro s ren hoc ci is res sti nltoethrns eas ot rifm oohacni knsa s gn wu rdi irt pfsh.a k Tcial hell eeaa sadb ndah, sce mkd n du ears pv oic t gp hu aloaatfnre tsdb h tduehir lsead c mceahasn actell oeie snt nhdag seiv naabg srr ter,u a ogo tcg ph ket e afdnok sic rn l kim gy f ,fa a t fw sia ioc tc en he sn,t. t h.hTiehs des ikscitana mngtel irhsato er ci nazi pont gnu rhweiinst thti hnsgew iaenta tttr hiuce naetdxee prda et nth sai eiv lbes rloiagfn hhdti sss cuamnpo. ev Hebisem ldeoenwtt. es,Ar shm ihigneh elcidlgi he mt xi b pns r geh tishgsieho etnre , natsnhideo fnp oliacny u hsoifes ldim gguhatsz caeln erdse sav he na adl d tohhiwes

Figure 11. Impact of prompt length on image-to-video generation. We demonstrate how image-to-video models perform differently with concise versus enhanced prompts. Frames with large artifacts are marked in red. First frame images are generated from enhanced prompts.

图 11: 提示词长度对图像到视频生成的影响。我们展示了图像到视频模型在使用简洁提示词和增强提示词时的不同表现。带有明显伪影的帧用红色标记。第一帧图像由增强提示词生成。

Prompt Preparation Our test prompts are derived from VBench [18], focusing on human-centric actions. For detailed descriptions, we sample from the initial 200 GPT4-enhanced prompts and select 77 single-person scenarios. Each prompt is standardized with consistent subject descriptors and augmented with the img trigger word for model compatibility. We assign four category-appropriate prompts to each identity, ensuring demographic alignment. For the ”baby” category, which lacks representation in VBench, we craft four custom prompts to maintain evaluation consistency across all categories.

提示词准备

我们的测试提示词源自 VBench [18],重点关注以人为中心的动作。为了详细描述,我们从最初的 200 个 GPT4 增强提示词中采样,并选择了 77 个单人场景。每个提示词都经过标准化处理,使用一致的主体描述符,并添加了 img 触发词以确保模型兼容性。我们为每个身份分配了四个与类别相符的提示词,确保人口统计特征的一致性。对于 VBench 中缺乏代表性的“婴儿”类别,我们制作了四个自定义提示词,以保持所有类别评估的一致性。

A.3. Comparisons

A.3. 对比

ID-Animator [11] We utilize enhanced long prompts for evaluation, although some undergo partial truncation due to CLIP’s 77-token input constraint.

ID-Animator [11] 我们利用增强的长提示进行评估,尽管由于 CLIP 的 77-token 输入限制,部分提示会被截断。

CogVideoX-5B-I2V [14] For this image-to-video variant, we first generate reference images using Photo Maker V 2 [5] for each prompt-identity pair. These images, combined with enhanced long prompts, serve as input for video generation.

CogVideoX-5B-I2V [14] 对于这个图像到视频的变体,我们首先使用 Photo Maker V 2 [5] 为每个提示-身份对生成参考图像。这些图像与增强的长提示相结合,作为视频生成的输入。

Easy Animate [16] We evaluate using the same Photo Maker V 2-generated reference images as in our CogVideoX-5B-I2V experiments.

Easy Animate [16] 我们使用与 CogVideoX-5B-I2V 实验中相同的 Photo Maker V 2 生成的参考图像进行评估。

DynamiC rafter [15] Due to model-specific resolution requirements, we create a dedicated set of reference images using Photo Maker V 2 that conform to the model’s specifications.

DynamiC rafter [15] 由于模型特定的分辨率要求,我们使用 Photo Maker V 2 创建了一组符合模型规格的专用参考图像。

In image-to-video baselines, through reference images generated by enhanced prompts, we deliberately use original short concise prompts for video generation. This choice stems from our empirical observation that image-to-video models exhibit a strong semantic bias when processing lengthy prompts. Specifically, these models tend to prioritize text alignment over reference image fidelity, leading to degraded video quality and compromised identity preservation. This trade-off is particularly problematic for our face preservation objectives. We provide visual evidence of this phenomenon in Fig. 11.

在图像到视频的基线中,通过增强提示生成的参考图像,我们特意使用原始的简短提示进行视频生成。这一选择源于我们的经验观察,即图像到视频模型在处理长提示时表现出强烈的语义偏差。具体来说,这些模型往往优先考虑文本对齐,而忽视了参考图像的保真度,导致视频质量下降和身份保存受损。这种权衡对我们的面部保存目标尤其成问题。我们在图 11 中提供了这一现象的视觉证据。


Figure 12. Face Motion (FM) calculation. $\mathrm{FM}_{\mathrm{inter}}$ follows a similar computation across consecutive video frames.

图 12: 面部运动 (Face Motion, FM) 计算。$\mathrm{FM}_{\mathrm{inter}}$ 在连续视频帧之间遵循类似的计算方式。

A.4. Evaluation Metrics

A.4. 评估指标

Our evaluation framework combines standard video metrics with face-specific measurements. From VBench [18], we utilize Dynamic Degree for motion naturality and Overall Consistency for text-video alignment. Video quality is assessed using Inception Score from E val Crafter [64]. For facial fidelity, we measure identity preservation using facial recognition embedding similarity [62] and temporal stability through frame-wise similarity decay.

我们的评估框架结合了标准视频指标和面部特定测量。从 VBench [18] 中,我们利用动态程度 (Dynamic Degree) 来评估运动的自然性,并使用整体一致性 (Overall Consistency) 来评估文本-视频对齐。视频质量通过 E val Crafter [64] 中的 Inception Score 进行评估。对于面部保真度,我们使用面部识别嵌入相似度 [62] 来测量身份保持,并通过帧间相似度衰减来评估时间稳定性。

We propose a novel facial dynamics metric to address the limitation of static face generation in existing methods. As shown in Fig. 12, we extract five facial landmarks using RetinaFace [66] and compute two motion scores: FMref measures facial motion relative to the reference image (computed on aspect-ratio-normalized frames to eliminate positional bias), while FMinter quantifies maximized inter-frame facial motion (computed on original frames to preserve translational movements). This dual-score approach enables a comprehensive assessment of facial dynamics.

我们提出了一种新颖的面部动态度量方法,以解决现有方法在静态人脸生成方面的局限性。如图 12 所示,我们使用 RetinaFace [66] 提取五个面部关键点,并计算两个运动分数:FMref 衡量相对于参考图像的面部运动(在长宽比归一化的帧上计算,以消除位置偏差),而 FMinter 量化最大化的帧间面部运动(在原始帧上计算,以保留平移运动)。这种双分数方法能够对面部动态进行全面评估。

A.5. Implementation Details

A.5. 实现细节

Decoupled Facial Embeddings. Our architecture employs two complementary branches: an ID embedding branch based on pre-trained Photo Maker V 2 [5] with two-token IDembedding query $q_{\mathrm{id}}$ , and a facial structural embedding branch that extracts detailed features from the same ViT’s penultimate layer. The latter initializes 32 token embeddings as facial query $q_{\mathrm{face}}$ input. We use a projection layer to align facial modalities before diffusion model input.

解耦的面部嵌入。我们的架构采用了两个互补的分支:一个基于预训练的 Photo Maker V 2 [5] 的 ID 嵌入分支,带有双 Token 的 ID 嵌入查询 $q_{\mathrm{id}}$,以及一个面部结构嵌入分支,从同一 ViT 的倒数第二层提取详细特征。后者初始化了 32 个 Token 嵌入作为面部查询 $q_{\mathrm{face}}$ 的输入。在输入扩散模型之前,我们使用投影层来对齐面部模态。

Conditioned Adaptive Normalization. This paragraph elaborates on the design details of the Conditioned Adaptive Normalization (CAN) module, complementing the overview provided in Sec. 3.3 and Fig. 4. For predicting facial modulation factors $m_{\mathrm{face}}$ , we employ a two-layer MLP architecture, following the implementation structure of the original normalization modules $\varphi_{\mathrm{txt,,text}}$ . The detailed imple ment ation of CAN is illustrated in Fig. 13. Given the facial ID embedding $x_{\mathrm{id}}\ \in\ \mathcal{R}^{2\times c}$ containing two tokens, we first apply one global projection layer for dimensionality reduction, mapping it to dimension $c_{1}$ . Subsequently, in each adapted layer, we concatenate this projected embedding with the time embedding $\mathbf{t}$ and the predicted shift factor $\mu_{\mathrm{vid}}^{1}$ along the channel dimension. An MLP then processes this concatenated representation to produce the final modulation factors. To ensure stable training, all newly introduced modulation predictors are initialized with zero.

条件自适应归一化。本段详细阐述了条件自适应归一化 (Conditioned Adaptive Normalization, CAN) 模块的设计细节,补充了第 3.3 节和图 4 中的概述。为了预测面部调制因子 $m_{\mathrm{face}}$,我们采用了两层 MLP 架构,遵循原始归一化模块 $\varphi_{\mathrm{txt,,text}}$ 的实现结构。CAN 的详细实现如图 13 所示。给定包含两个 Token 的面部 ID 嵌入 $x_{\mathrm{id}}\ \in\ \mathcal{R}^{2\times c}$,我们首先应用一个全局投影层进行降维,将其映射到维度 $c_{1}$。随后,在每个适配层中,我们将此投影嵌入与时间嵌入 $\mathbf{t}$ 和预测的偏移因子 $\mu_{\mathrm{vid}}^{1}$ 沿通道维度连接。然后,MLP 处理此连接表示以生成最终的调制因子。为了确保训练稳定,所有新引入的调制预测器均初始化为零。


Figure 13. Detailed implementation of Conditioned Adaptive Normalization. We present the expanded architecture of $\varphi_{\mathrm{cond}}$ (illustrated in the unmasked region above) with comprehensive annotations of input-output tensor dimensions at each transformation.

图 13: 条件自适应归一化的详细实现。我们展示了 $\varphi_{\mathrm{cond}}$ 的扩展架构(在上方未遮罩区域中展示),并在每个变换处详细标注了输入输出张量的维度。

B. Additional Discussions

B. 附加讨论

B.1. Why Distribution is Important

B.1. 为什么分布很重要

Besides the cross-modal modulation factor distribution plots presented in Fig. 9, we investigate the critical role of distribution in the convergence of the Magic Mirror network through another perspective. Our experiments reveal that modality-aware data distributions significantly impact generation quality. Using two distinct datasets of CelebVText [68] and our video data from Pexels, we fine-tuned only the normalization layers $\varphi_{\mathrm{vid,,txt}}$ on the CogVideoX base model. The results demonstrate that distributionspecific fine-tuning substantially influences the spatial fidelity of the generated videos, as evidenced in Fig. 14. This finding not only underscores the importance of distribution alignment but also validates the high quality of our collected video dataset.

除了图 9 中展示的跨模态调制因子分布图外,我们还从另一个角度研究了分布对 Magic Mirror 网络收敛的关键作用。我们的实验表明,模态感知的数据分布显著影响生成质量。我们使用 CelebVText [68] 和从 Pexels 收集的视频数据这两个不同的数据集,仅在 CogVideoX 基础模型上微调了归一化层 $\varphi_{\mathrm{vid,,txt}}$。结果表明,特定分布的微调显著影响了生成视频的空间保真度,如图 14 所示。这一发现不仅强调了分布对齐的重要性,还验证了我们收集的视频数据集的高质量。


Figure 14. Modulation layers reflect data distribution. Finetuning solely the modulation layer weights demonstrates adaptation to distinct data distributions, affecting both spatial fidelity and temporal dynamics.

图 14: 调制层反映数据分布。仅微调调制层权重展示了适应不同数据分布的能力,影响空间保真度和时间动态。

B.2. Limitation Analysis

B.2. 限制分析

As discussed in Sec. 4.5, our approach faces several limitations, particularly in handling multi-person scenarios and preserving fine-grained features. Fig. 15 illustrates two represent at ive failure cases: incomplete transfer of reference character details (such as accessories) and motion artifacts caused by the base model. These limitations highlight critical areas for future research in controllable personalized video generation, particularly in maintaining temporal consistency and fine detail preservation.

正如第4.5节所讨论的,我们的方法面临一些限制,特别是在处理多人场景和保留细粒度特征方面。图15展示了两类典型的失败案例:参考角色细节(如配饰)的不完全转移,以及由基础模型引起的运动伪影。这些限制突显了可控个性化视频生成未来研究的关键领域,特别是在保持时间一致性和细粒度细节保留方面。

C. Additional Results & Applications

C. 附加结果与应用

C.1. Additional Applications

C.1. 其他应用

Fig. 16 demonstrates two extended capabilities of Magic Mirror. First, beyond realistic customized video generation, our framework effectively handles stylized prompts, leveraging CogVideoX’s diverse generative capabilities to produce identity-preserved outputs across various artistic styles and visual effects. Furthermore, we show that our method can generate high-quality, temporally consistent multi-shot sequences when maintaining coherent character and style descriptions. We believe these capabilities have significant implications for automated video content creation.

图 16 展示了 Magic Mirror 的两个扩展能力。首先,除了生成逼真的定制视频外,我们的框架还能有效处理风格化提示,利用 CogVideoX 的多样化生成能力,在各种艺术风格和视觉效果中生成保持身份的输出。此外,我们还展示了在保持连贯的角色和风格描述时,我们的方法能够生成高质量、时间一致的多镜头序列。我们相信这些能力对自动化视频内容创作具有重要意义。


Figure 15. Limitations of Magic Mirror. (a) Fine-grained feature preservation failure in facial details and accessories. (b) Motion artifacts in generated videos showing temporal inconsistencies.

图 15: Magic Mirror 的局限性。(a) 面部细节和配饰的细粒度特征保留失败。(b) 生成的视频中出现运动伪影,显示时间不一致性。

C.2. Image Generation Results

C.2. 图像生成结果

Magic Mirror demonstrates strong capability in IDpreserved image generation with the image-pre-trained stage. Notably, it achieves even superior facial identity fidelity compared to video-finetuned variants, primarily due to optimization constraints in video training (e.g., limited batch sizes and dataset scope). Representative examples are presented in Fig. 17.

Magic Mirror 在图像预训练阶段展示了强大的身份保持图像生成能力。值得注意的是,与视频微调变体相比,它甚至实现了更优越的面部身份保真度,这主要归因于视频训练中的优化限制(例如,批量大小和数据范围的限制)。代表性的示例见图 17。

C.3. Video Generation Results

C.3. 视频生成结果

Additional video generation results and comparative analyses are provided in Figs. 18 and 19, highlighting our method’s advantages. Fig. 18 specifically demonstrates the benefits of our one-stage approach over I2V, including superior handling of occluded initial frames, enhanced dynamic range, and improved temporal consistency during facial rotations. In Fig. 19, we provide more results with human faces on different scales.

更多视频生成结果和对比分析见图 18 和图 19,展示了我们方法的优势。图 18 特别展示了我们的一阶段方法相较于 I2V 的优势,包括更好地处理遮挡的初始帧、增强的动态范围以及在面部旋转时改进的时间一致性。图 19 中,我们提供了更多不同尺度的人脸生成结果。

D. Acknowledgments

D. 致谢

Social Impact. Magic Mirror is designed to facilitate creative and research-oriented video content generation while preserving individual identities. We advocate for responsible use in media production and scientific research, explicitly discouraging the creation of misleading content or violation of portrait rights. As our framework builds upon the “Art nouveau, organic curves, floral patterns style a male police officer talking on the radio”

社会影响。Magic Mirror 旨在促进创意和研究导向的视频内容生成,同时保护个人身份。我们倡导在媒体制作和科学研究中负责任地使用,明确反对创建误导性内容或侵犯肖像权。由于我们的框架基于“新艺术风格、有机曲线、花卉图案风格,描绘一名男警官正在用对讲机通话”


(a) Videos with style-specific prompts

图 1:
(a) 带有特定风格提示的视频


(b) multi-shot videos with consistent character prompts

(b) 多镜头视频与一致的角色提示

Figure 16. Additional applications of Magic Mirror. We can generate identity-preserved videos across artistic styles and can generate multi-shot videos with consistent characters. More results are presented in the project page.

图 16: Magic Mirror 的额外应用。我们可以生成跨艺术风格的身份保留视频,并生成具有一致角色的多镜头视频。更多结果请参见项目页面。

DiT foundation model, existing diffusion-based AI-content detection methods remain applicable.

DiT基础模型,现有的基于扩散的AI内容检测方法仍然适用。

Data Usage. The training data we used is almost entirely sourced from known public datasets, including all image data and most video data. All video data was downloaded and processed through proper channels (i.e., download requests). We implement strict NSFW filtering during the training process to ensure content appropriateness. Figures 17-19 are presented on the following pages ↓

数据使用。我们使用的训练数据几乎全部来自已知的公共数据集,包括所有图像数据和大部分视频数据。所有视频数据均通过合法渠道(即下载请求)下载和处理。我们在训练过程中实施了严格的 NSFW 过滤,以确保内容的适当性。图 17-19 将在接下来的页面中展示 ↓


Figure 17. Image generation using Magic Mirror. Model in the image pre-train stage captures ID embeddings of the reference ID (Ref-ID), yet over-fits on some low-level distributions such as image quality, style, and background.

图 17: 使用 Magic Mirror 进行图像生成。图像预训练阶段的模型捕捉了参考 ID (Ref-ID) 的 ID 嵌入,但在一些低级别分布(如图像质量、风格和背景)上过度拟合。


Figure 18. Advantages over I2V generation. Magic Mirror successfully handles challenging scenarios including partially occluded initial frames and maintains identity consistency through complex facial dynamics, addressing limitations of traditional I2V approaches.

图 18: 相较于 I2V 生成的优势。Magic Mirror 成功处理了包括部分遮挡初始帧在内的挑战性场景,并通过复杂的面部动态保持身份一致性,解决了传统 I2V 方法的局限性。


Figure 19. Video generation results. We demonstrate Magic Mirror’s capability across varying facial scales and compositions. Additional examples and comparative analyses are available in the project page.

图 19: 视频生成结果。我们展示了 Magic Mirror 在不同面部比例和构图下的能力。更多示例和对比分析可在项目页面查看。

References

参考文献

阅读全文(20积分)