[论文翻译]Animate Anyone 角色动画的一致且可控制的图像与视频合成(一张图变视频 2023 阿里版本)


原文地址:https://arxiv.org/pdf/2311.17117.pdf

代码地址:https://github.com/HumanAIGC/AnimateAnyone

模型地址:https://humanaigc.github.io/animate-anyone/

Animate Anyone: Consistent and Controllable Image-to-Video Synthesis for Character Animation

Li Hu, Xin Gao, Peng Zhang, Ke Sun, Bang Zhang, Liefeng Bo
Institute for Intelligent Computing, Alibaba Group
{hooks.hl, zimu.gx, futian.zp, xisheng.sk, zhangbang.zb,
https://humanaigc.github.io/animate-anyone/

ABSTRACT

Character Animation aims to generating character videos from still images through driving signals. Currently, diffusion models have become the mainstream in visual generation research, owing to their robust generative capabilities. However, challenges persist in the realm of image-to-video, especially in character animation, where temporally maintaining consistency with detailed information from character remains a formidable problem. In this paper, we leverage the power of diffusion models and propose a novel framework tailored for character animation. To preserve consistency of intricate appearance features from reference image, we design ReferenceNet to merge detail features via spatial attention. To ensure controllability and continuity, we introduce an efficient pose guider to direct character’s movements and employ an effective temporal modeling approach to ensure smooth inter-frame transitions between video frames. By expanding the training data, our approach can animate arbitrary characters, yielding superior results in character animation compared to other image-to-video methods. Furthermore, we evaluate our method on benchmarks for fashion video and human dance synthesis, achieving state-of-the-art results.
角色动画旨在通过驱动信号从静止图像中生成角色视频。 当前,由于其强大的生成能力,扩散模型已成为视觉生成研究的主流。 但是,在图像到视频的领域,尤其是在角色动画中,挑战仍然存在,在角色动画中,与角色的详细信息保持一致性仍然是一个巨大的问题。 在本文中,我们利用扩散模型的力量,并提出了一个针对角色动画的新型框架。 为了保留参考图像中复杂的外观特征的一致性,我们通过空间注意将ReferenceNet设计为合并细节特征。 为了确保可控性和连续性,我们引入了有效的姿势指导来指导角色的动作,并采用有效的时间建模方法来确保视频帧之间的平稳框架间过渡。 通过扩展训练数据,我们的方法可以使任意角色动画,与其他图像到视频方法相比,角色动画的结果卓越。 此外,我们评估了有关时尚视频和人类舞蹈合成的基准测试方法,从而实现了最新的结果。

Figure 1: Consistent and controllable character animation results given reference image (the leftmost image in each group) . Our approach is capable of animating arbitrary characters, generating clear and temporally stable video results while maintaining consistency with the appearance details of the reference character. More video results are available in supplementary material.

1INTRODUCTION

Character Animation is a task to animate source character images into realistic videos according to desired posture sequences, which has many potential applications such as online retail, entertainment videos, artistic creation and virtual character. Beginning with the advent of GANs[gan, wgan, stylegan], numerous studies have delved into the realms of image animation and pose transfer[fomm, mraa, ren2020deep, tpsmm, siarohin2019animating, zhang2022exploring, bidirectionally, everybody]. However, the generated images or videos still exhibit issues such as local distortion, blurred details, semantic inconsistency, and temporal instability, which impede the widespread application of these methods.

In recent years, diffusion models[denoising] have showcased their superiority in producing high-quality images and videos. Researchers have begun exploring human image-to-video tasks by leveraging the architecture of diffusion models and their pretrained robust generative capabilities. DreamPose[dreampose] focuses on fashion image-to-video synthesis, extending Stable Diffusion[ldm] and proposing an adaptar module to integrate CLIP[clip] and VAE[vae] features from images. However, DreamPose requires finetuning on input samples to ensure consistent results, leading to suboptimal operational efficiency. DisCo[disco] explores human dance generation, similarly modifying Stable Diffusion, integrating character features through CLIP, and incorporating background features through ControlNet[controlnet]. However, it exhibits deficiencies in preserving character details and suffers from inter-frame jitter issues.

Furthermore, current research on character animation predominantly focuses on specific tasks and benchmarks, resulting in a limited generalization capability. Recently, benefiting from advancements in text-to-image research[dalle2, glide, imagen, ldm, composer, ediffi], video generation (e.g., text-to-video, video editing)[animatediff, cogvideo, fatezero, imagenvideo, text2videozero, tuneavideo, videocomposer, align, gen1, makeavideo, vdm] has also achieved notable progress in terms of visual quality and diversity. Several studies extend text-to-video methodologies to image-to-video[videocomposer, videocrafter1, i2vgen, animatediff]. However, these methods fall short of capturing intricate details from images, providing more diversity but lacking precision, particularly when applied to character animation, leading to temporal variations in the fine-grained details of the character’s appearance. Moreover, when dealing with substantial character movements, these approaches struggle to generate a consistently stable and continuous process. Currently, there is no observed character animation method that simultaneously achieves generalizability and consistency.

In this paper, we present Animate Anyone, a method capable of transforming character images into animated videos controlled by desired pose sequences. We inherit the network design and pretrained weights from Stable Diffusion (SD) and modify the denoising UNet[unet] to accommodate multi-frame inputs. To address the challenge of maintaining appearance consistency, we introduce ReferenceNet, specifically designed as a symmetrical UNet structure to capture spatial details of the reference image. At each corresponding layer of the UNet blocks, we integrate features from ReferenceNet into the denoising UNet using spatial-attention[attention]. This architecture enables the model to comprehensively learn the relationship with the reference image in a consistent feature space, which significantly contributes to the improvement of appearance details preservation. To ensure pose controllability, we devise a lightweight pose guider to efficiently integrate pose control signals into the denoising process. For temporal stability, we introduce temporal layer to model relationships across multiple frames, which preserves high-resolution details in visual quality while simulating a continuous and smooth temporal motion process.
角色动画是根据所需的姿势序列将源角色图像动画为现实视频的任务,该姿势序列具有许多潜在的应用程序,例如在线零售,娱乐视频,艺术创作和虚拟角色。 从Gans [Gan,Wgan,Stylegan]的出现开始,大量研究已经深入研究了图像动画和姿势转移的领域。 但是,生成的图像或视频仍然表现出诸如局部畸变,模糊的细节,语义不一致和时间不稳定性等问题,这些问题阻碍了这些方法的广泛应用。

近年来,扩散模型[DeNoising]展示了它们在制作高质量图像和视频中的优势。 研究人员已经通过利用扩散模型的结构及其预识的可靠生成能力来探索人类形象到视频的任务。 Dreampose [Dreampose]专注于时尚图像与视频合成,扩展了稳定的扩散[LDM],并提出了一个适应模块,以整合剪辑[剪辑]和VAE [VAE]的特征。 但是,Dreampose需要对输入样本进行填充,以确保结果一致,从而导致次优效率。 迪斯科[迪斯科]探索人类舞蹈的产生,类似地修改稳定的扩散,通过剪辑整合角色特征,并通过ControlNet [ControlNet]结合背景特征。 但是,它在保存角色细节方面表现出缺陷,并遭受了框架间的抖动问题。

此外,当前对角色动画的研究主要集中在特定的任务和基准上,从而产生有限的概括能力。 最近,受益于文本到图像研究的进步[Dalle2,Glide,Glide,Imagen,LDM,Composer,Ediffi],视频生成(例如,文本到视频编辑,视频编辑)[Animatediff,Cogvideo,fatezero,fatezero,fatezero,fatezero,fatezero,fatezero,ifepenvideo,text2videozero,text2videozozero,text2videozozero ,Tuneavideo,VideoComposer,Align,Gen1,Makeavideo,VDM]在视觉质量和多样性方面也取得了显着的进步。 几项研究将文本对视频方法扩展到图像到视频[VideoComposer,Videocrofter1,i2vgen,Animatediff]。 但是,这些方法无法从图像中捕获复杂的细节,提供了更多的多样性,但缺乏精确度,尤其是应用于角色动画时,导致角色外观细粒度细节的时间变化。 此外,在处理实质性角色运动时,这些方法难以产生一个始终如一的稳定过程。 当前,没有观察到的角色动画方法可以同时实现概括性和一致性。

在本文中,我们向任何人展示动画,这种方法能够将角色图像转换为受所需姿势序列控制的动画视频。 我们从稳定扩散(SD)中继承了网络设计和预验证的权重,并修改了denoising unet [unet]以适应多帧输入。 为了应对保持外观一致性的挑战,我们引入了参考文章,该引用是专门设计为对称的UNET结构,以捕获参考图像的空间细节。 在UNET块的每个相应层上,我们使用空间注意力(注意]将参考文章的特征集成到denoising UNET中。 该体系结构使模型能够在一致的特征空间中全面了解与参考图像的关系,这极大地有助于改善外观细节。 为了确保姿势可控性,我们设计了一种轻巧的姿势指南,以有效地将姿势控制信号整合到脱索过程中。 对于时间稳定性,我们引入了时间层来对多个帧进行建模关系,该关系可以保留视觉质量的高分辨率细节,同时模拟连续且流畅的时间运动过程。

Our model is trained on an internal dataset of 5K character video clips. Fig. 1 shows the animation results for various characters. Compared to previous methods, our approach presents several notable advantages. Firstly, it effectively maintains the spatial and temporal consistency of character appearance in videos. Secondly, it produces high-definition videos without issues such as temporal jitter or flickering. Thirdly, it is capable of animating any character image into a video, unconstrained by specific domains. We evaluate our method on two specific human video synthesis benchmarks (UBC fashion video dataset[dwnet] and TikTok dataset[tiktok]), using only the corresponding training datasets for each benchmark in the experiments. Our approach achieves state-of-the-art results. we also compare our method with general image-to-video approaches trained on large-scale data and our approach demonstrates superior capabilities in character animation. We envision that Animate Anyone could serve as a foundational solution for character video creation, inspiring the development of more innovative and creative applications.
我们的模型在5K角色视频片段的内部数据集上进行了训练。 图。 各种角色的结果。 与以前的方法相比,我们的方法具有几个显着的优势。 首先,它有效地保持了视频中角色外观的空间和时间一致性。 其次,它会产生高清视频,而没有暂时的抖动或闪烁等问题。 第三,它能够将任何角色映像动画为视频,而这些视频不受特定域的影响。 我们在两个特定的人类视频综合基准(UBC时尚视频数据集[** dwnet ]和tiktok数据集[ tiktok **])上评估了我们的方法,仅使用实验中每个基准的相应训练数据集。 我们的方法取得了最新的结果。 我们还将我们的方法与经过大规模数据训练的一般图像与视频方法进行了比较,我们的方法在角色动画中表现出了出色的功能。 我们设想,使任何人都可以作为角色视频创建的基础解决方案,从而激发了更具创新性和创意应用程序的开发。

Figure 2: The overview of our method. The pose sequence is initially encoded using Pose Guider and fused with multi-frame noise, followed by the Denoising UNet conducting the denoising process for video generation. The computational block of the Denoising UNet consists of Spatial-Attention, Cross-Attention, and Temporal-Attention, as illustrated in the dashed box on the right. The integration of reference image involves two aspects. Firstly, detailed features are extracted through ReferenceNet and utilized for Spatial-Attention. Secondly, semantic features are extracted through the CLIP image encoder for Cross-Attention. Temporal-Attention operates in the temporal dimension. Finally, the VAE decoder decodes the result into a video clip.

2RELATED WORKS

2.1DIFFUSION MODEL FOR IMAGE GENERATION

In text-to-image research, diffusion-based methods[dalle2, imagen, ldm, glide, ediffi, composer] have achieved significantly superior generation results, becoming the mainstream of research. To reduce computational complexity, Latent Diffusion Model[ldm] proposes denoising in the latent space, striking a balance between effectiveness and efficiency. ControlNet[controlnet] and T2I-Adapter[t2iadapter] delve into the controllability of visual generation by incorporating additional encoding layers, facilitating controlled generation under various conditions such as pose, mask, edge and depth. Some studies further investigate image generation under given image conditions. IP-Adapter[ip] enables diffusion models to generate image results that incorporate the content specified by a given image prompt. ObjectStitch[objectstitch] and Paint-by-Example[paint] leverage the CLIP[clip] and propose diffusion-based image editing methods given image condition. TryonDiffusion[tryondiffusion] applies diffusion models to the virtual apparel try-on task and introduces the Parallel-UNet structure.

在文本到图像研究中,基于扩散的方法[** dalle2 Imagen ldm Glide ediffi composer ] 取得了显着优越的生成结果,成为研究的主流。 为了降低计算复杂性,潜在扩散模型
[
ldm ]提出在潜在空间中降低,从而达到了有效性和效率之间的平衡。 ControlNET [ ControlNet ]和T2I-adapter [ T2IADAPTER ]通过合并其他编码层来探索视觉生成的可控性,从而在各种条件下促进受控生成,例如姿势,掩盖,边缘和深度。 一些研究进一步研究了给定图像条件下的图像产生。 IP-Adapter [ ip ]使扩散模型能够生成图像结果,以结合给定图像提示的内容。 对象缝合[ objectstitch ]和Paint-by-Example [ paint ]利用剪辑[剪辑],并提出基于扩散的图像编辑方法给定图像条件。 TryOndiffusion [ TryOndiffusion **]将扩散模型应用于虚拟服装训练任务,并引入了平行 - UNET结构。

2.2DIFFUSION MODEL FOR VIDEO GENERATION

With the success of diffusion models in text-to-image applications, research in text-to-video has extensively drawn inspiration from text-to-image models in terms of model structure. Many studies[text2videozero, fatezero, cogvideo, tuneavideo, rerender, gen1, followyourpose, makeavideo, vdm] explore the augmentation of inter-frame attention modeling on the foundation of text-to-image (T2I) models to achieve video generation. Some works turn pretrained T2I models into video generators by inserting temporal layers. Video LDM[align] proposes to first pretrain the model on images only and then train temporal layers on videos. AnimateDiff[animatediff] presents a motion module trained on large video data which could be injected into most personalized T2I models without specific tuning. Our approach draws inspiration from such methods for temporal modeling.

Some studies extend text-to-video capabilities to image-to-video. VideoComposer[videocomposer] incorporates images into the diffusion input during training as a conditional control. AnimateDiff[animatediff] performs weighted mixing of image latent and random noise during denoising. VideoCrafter[videocrafter1] incorporates textual and visual features from CLIP as the input for cross-attention. However, these approaches still face challenges in achieving stable human video generation, and the exploration of incorporating image condition input remains an area requiring further investigation.
随着扩散模型在文本到图像应用中的成功,文本到视频的研究在模型结构方面从文本到图像模型中广泛汲取了灵感。 许多研究[** text2videozero fatezero cogvideo tuneavideo rerender gen1 lasteryourpose makeavideo ** ** makeavideo ** ,** vdm ]探索在文本到图像(T2I)模型基础上的框架间注意建模的增强,以实现视频生成。 有些作品通过插入时间层将预估计的T2I模型变成视频发生器。 Video LDM [ Align ]建议首先在图像上预处理模型,然后在视频上训练时间层。 Animatediff [ Animatediff **]提出了一个对大型视频数据训练的运动模块,该模块可以注入大多数个性化的T2I模型,而无需进行特定的调整。 我们的方法从这种时间建模的方法中汲取灵感。

一些研究将文本对视频功能扩展到图像到视频。 视频组合[** VideoComposer ]将图像纳入训练期间的扩散输入中,作为有条件的控制。 Animatediff [ Animatediff ]在DeNoising期间执行图像潜在和随机噪声的加权混合。 VideoCrafter [ VideoCrafter1 **]将剪辑中的文本和视觉特征作为交叉注意的输入。 但是,这些方法在实现稳定的人类视频生成方面仍然面临挑战,并且探索结合图像条件输入仍然是需要进一步研究的领域。

2.3DIFFUSION MODEL FOR HUMAN IMAGE ANIMATION

Image animation[fomm, mraa, ren2020deep, tpsmm, siarohin2019animating, zhang2022exploring, bidirectionally, everybody] aims to generate images or videos based on one or more input images. In recent research, the superior generation quality and stable controllability offered by diffusion models have led to their integration into human image animation. PIDM[pidm] proposes texture diffusion blocks to inject the desired texture patterns into denoising for human pose transfer. LFDM[LFDM] synthesizes an optical flow sequence in the latent space, warping the input image based on given conditions. LEO[leo] represents motion as a sequence of flow maps and employs diffusion model to synthesize sequences of motion codes. DreamPose[dreampose] utilizes pretrained Stable Diffusion model and proposes an adapter to model the CLIP and VAE image embeddings. DisCo[disco] draws inspiration from ControlNet, decoupling the control of pose and background. Despite the incorporation of diffusion models to enhance generation quality, these methods still grapple with issues such as texture inconsistency and temporal instability in their results. Moreover, there is no method to investigate and demonstrate a more generalized capability in character animation.
随着扩散模型在文本到图像应用中的成功,文本到视频的研究在模型结构方面从文本到图像模型中广泛汲取了灵感。 许多研究[** text2videozero fatezero cogvideo tuneavideo rerender gen1 lasteryourpose makeavideo ** ** makeavideo ** ,** vdm ]探索在文本到图像(T2I)模型基础上的框架间注意建模的增强,以实现视频生成。 有些作品通过插入时间层将预估计的T2I模型变成视频发生器。 Video LDM [ Align ]建议首先在图像上预处理模型,然后在视频上训练时间层。 Animatediff [ Animatediff **]提出了一个对大型视频数据训练的运动模块,该模块可以注入大多数个性化的T2I模型,而无需进行特定的调整。 我们的方法从这种时间建模的方法中汲取灵感。

一些研究将文本对视频功能扩展到图像到视频。 视频组合[** VideoComposer ]将图像纳入训练期间的扩散输入中,作为有条件的控制。 Animatediff [ Animatediff ]在DeNoising期间执行图像潜在和随机噪声的加权混合。 VideoCrafter [ VideoCrafter1 **]将剪辑中的文本和视觉特征作为交叉注意的输入。 但是,这些方法在实现稳定的人类视频生成方面仍然面临挑战,并且探索结合图像条件输入仍然是需要进一步研究的领域。

3METHODS

We target at pose-guided image-to-video synthesis for character animation. Given a reference image describing the appearance of a character and a pose sequence, our model generates an animated video of the character. The pipeline of our method is illustrated in Fig. 2. In this section, we first provide a concise introduction to Stable Diffusion in Sec 3.1, which lays the foundational framework and network structure for our method. Then we provide a detailed explanation of the design specifics in Sec 3.1. Finally, we present the training process in Sec 3.3.

3.1PRELIMINARIY: STABLE DIFFUSION

Our method is an extension of Stable Diffusion (SD), which is developed from Latent diffusion model (LDM). To reduce the computational complexity of the model, it introduces to model feature distributions in the latent space. SD develops an autoencoder[vae, vqvae] to establish the implicit representation of images, which consists of an encoder E and a decoder D. Given an image x, the encoder first maps it to a latent representation: z = E(x) and then the decoder reconstructs it: xrecon = D(z).

SD learns to denoise a normally-distributed noise ϵ to realistic latent z. During training, the image latent z is diffused in \mathnormalt timesteps to produce noise latent zt. And a denoising UNet is trained to predict the applied noise. The optimization process is defined as follow objective:

| | L=Ezt,c,ϵ,t(||ϵϵθ**(zt**,c,t)||22**)** | | (1) |
| -- |

where ϵθ represents the function of the denoising UNet. \mathnormalc represents the embeddings of conditional information. In original SD, CLIP ViT-L/14[vit] text encoder is applied to represent the text prompt as token embeddings for text-to-image generation. The denoising UNet consists of four downsample layers , one middle layer and four upsample layers. A typical block within a layer includes three types of computations: 2D convolution, self-attention[attention], and cross-attention (terms as Res-Trans block). Cross-attention is conducted between text embedding and corresponding network feature.

At inference, zT is sampled from random Gaussian distribution with the initial timestep \mathnormalT and is progressively denoised and restored to z0 via deterministic sampling process (e.g. DDPM[denoising], DDIM[ddim]). In each iteration, the denoising UNet predicts the noise on the latent feature corresponding to each timestep \mathnormalt. Finally, z0 will be reconstructed by decoder D to obtain the generated image.

3.2NETWORK ARCHITECTURE

Overview. Fig. 2 provides the overview of our method. The initial input to the network consists of multi-frame noise. The denoising UNet is configured based on the design of SD, employing the same framework and block units, and inherits the training weights from SD. Additionally, our method incorporates three crucial components: 1) ReferenceNet, encoding the appearance features of the character from the reference image; 2) Pose Guider, encoding motion control signals for achieving controllable character movements; 3) Temporal layer, encoding temporal relationship to ensure the continuity of character motion.

ReferenceNet. In text-to-video tasks, textual prompts articulate high-level semantics, necessitating only semantic relevance with the generated visual content. However, in image-to-video tasks, images encapsulate more low-level detailed features, demanding precise consistency in the generated results. In preceding studies focused on image-driven generation, most approaches[ip, objectstitch, paint, dreampose, disco, videocrafter1] employ the CLIP image encoder as a substitute for the text encoder in cross-attention. However, this design falls short of addressing issues related to detail consistency. One reason for this limitation is that the input to the CLIP image encoder comprises low-resolution (224×224) images, resulting in the loss of significant fine-grained detail information. Another factor is that CLIP is trained to match semantic features for text, emphasizing high-level feature matching, thereby leading to a deficit in detailed features within the feature encoding.

Hence, we devise a reference image feature extraction network named ReferenceNet. We adopt a framework identical to the denoising UNet for ReferenceNet, excluding the temporal layer. Similar to the denoising UNet, ReferenceNet inherits weights from the original SD, and weight update is conducted independently for each. Then we explain the integration method of features from ReferenceNet into the denoising UNet. Specifically, as shown in Fig. 2, we replace the self-attention layer with spatial-attention layer. Given a feature map x1**∈R\mathnormalt×\mathnormalh**×\mathnormalw**×\mathnormalc** from denoising UNet and x2**∈R\mathnormalh×\mathnormalw**×\mathnormalc from ReferenceNet, we first copy x2 by \mathnormalt times and concatenate it with x1 along \mathnormalw dimension. Then we perform self-attention and extract the first half of the feature map as the output. This design offers two advantages: Firstly, ReferenceNet can leverage the pre-trained image feature modeling capabilities from the original SD, resulting in a well-initialized feature. Secondly, due to the essentially identical network structure and shared initialization weights between ReferenceNet and the denoising UNet, the denoising UNet can selectively learn features from ReferenceNet that are correlated in the same feature space. Additionally, cross-attention is employed using the CLIP image encoder. Leveraging the shared feature space with the text encoder, it provides semantic features of the reference image, serving as a beneficial initialization to expedite the entire network training process.

A comparable design is ControlNet[controlnet], which introduces additional control features into the denoising UNet using zero convolution. However, control information, such as depth and edge, is spatially aligned with the target image, while the reference image and the target image are spatially related but not aligned. Consequently, ControlNet is not suitable for direct application. We will substantiate this in the subsequent experimental Section 4.4.

While ReferenceNet introduces a comparable number of parameters to the denoising UNet, in diffusion-based video generation, all video frames undergo denoising multiple times, whereas ReferenceNet only needs to extract features once throughout the entire process. Consequently, during inference, it does not lead to a substantial increase in computational overhead.

Pose Guider. ControlNet[controlnet] demonstrates highly robust conditional generation capabilities beyond text. Different from these methods, as the denoising UNet needs to be finetuned, we choose not to incorporate an additional control network to prevent a significant increase in computational complexity. Instead, we employ a lightweight Pose Guider. This Pose Guider utilizes four convolution layers (4×4 kernels, 2×2 strides, using 16,32,64,128 channels, similar to the condition encoder in [controlnet]) to align the pose image with the same resolution as the noise latent. Subsequently, the processed pose image is added to the noise latent before being input into the denoising UNet. The Pose Guider is initialized with Gaussian weights, and in the final projection layer, we employ zero convolution.

Temporal Layer. Numerous studies have suggested incorporating supplementary temporal layers into text-to-image (T2I) models to capture the temporal dependencies among video frames. This design facilitates the transfer of pretrained image generation capabilities from the base T2I model. Adhering to this principle, our temporal layer is integrated after the spatial-attention and cross-attention components within the Res-Trans block. The design of the temporal layer was inspired by AnimateDiff[animatediff]. Specifically, for a feature map xR\mathnormalb×\mathnormalt**×\mathnormalh**×\mathnormalw**×\mathnormalc**, we first reshape it to xR(\mathnormalb**×\mathnormalh**×\mathnormalw**)×\mathnormalt×\mathnormalc**, and then perform temporal attention, which refers to self-attention along the dimension \mathnormalt. The feature from temporal layer is incorporated into the original feature through a residual connection. This design aligns with the two-stage training approach that we will describe in the following subsection. The temporal layer is exclusively applied within the Res-Trans blocks of the denoising UNet. For ReferenceNet, it computes features for a single reference image and does not engage in temporal modeling. Due to the controllability of continuous character movement achieved by the Pose Guider, experiments demonstrate that the temporal layer ensures temporal smoothness and continuity of appearance details, obviating the need for intricate motion modeling.

3.3TRAINING STRATEGY

The training process is divided into two stages. In the first stage, training is performed using individual video frames. Within the denoising UNet, we temporarily exclude the temporal layer and the model takes single-frame noise as input. The ReferenceNet and Pose Guider are also trained during this stage. The reference image is randomly selected from the entire video clip. We initialize the model of the denoising UNet and ReferenceNet based on the pretrained weights from SD. The Pose Guider is initialized using Gaussian weights, except for the final projection layer, which utilizes zero convolution. The weights of the VAE’s Encoder and Decoder, as well as the CLIP image encoder, are all kept fixed. The optimization objective in this stage is to enable the model to generate high-quality animated images under the condition of a given reference image and target pose. In the second stage, we introduce the temporal layer into the previously trained model and initialize it using pretrained weights from AnimateDiff[animatediff]. The input for the model consists of a 24-frames video clip. During this stage, we only train the temporal layer while fixing the weights of the rest of the network.

4EXPERIMENTS

4.1IMPLEMENTATIONS

To demonstrate the applicability of our approach in animating various characters, we collect 5K character video clips (2-10 seconds long) from the internet to train our model. We employ DWPose[dwpose] to extract the pose sequence of characters in the video, including the body and hands, rendering it as pose skeleton images following OpenPose[openpose]. Experiments are conducted on 4 NVIDIA A100 GPUs. In the first training stage, individual video frames are sampled, resized, and center-cropped to a resolution of 768×768. Training is conducted for 30,000 steps with a batch size of 64. In the second training stage, we train the temporal layer for 10,000 steps with 24-frame video sequences and a batch size of 4. Both learning rates are set to 1e-5. During inference, we rescale the length of the driving pose skeleton to approximate the length of the character’s skeleton in the reference image and use a DDIM sampler for 20 denoising steps. We adopt the temporal aggregation method in [edge], connecting results from different batches to generate long videos. For fair comparison with other image animation methods, we also train our model on two specific benchmarks (UBC fashion video dataset[dwnet] and TikTok dataset[tiktok]) without using additional data, as will be discussed in Section 4.3.

Figure 3: Qualitative Results. Given a reference image (the leftmost image), our approach demonstrates the ability to animate diverse characters, encompassing full-body human figures, half-length portraits, cartoon characters, and humanoid figures. The illustration showcases results with clear, consistent details, and continuous motion. More video results are available in supplementary material.

4.2QUALITATIVE RESULTS

Fig. 3 demonstrates that our method can animate arbitrary characters, including full-body human figures, half-length portraits, cartoon characters, and humanoid characters. Our approach is capable of generating high-definition and realistic character details. It maintains temporal consistency with the reference images even under substantial motion and exhibits temporal continuity between frames. More video results are available in the supplementary material.

4.3COMPARISONS

To demonstrate the effectiveness of our approach compared to other image animation methods, we evaluate its performance in two specific benchmarks: fashion video synthesis and human dance generation. For quantitative assessment of image-level quality, SSIM[ssim], PSNR[psnr] and LPIPS[lpips] are employed. Video-level evaluation uses FVD[fvd] metrics.

Figure 4: Qualitative comparison for fashion video synthesis. DreamPose and BDMM exhibit shortcomings in preserving fine-textured details of clothing, whereas our method excels in maintaining exceptional detail features. More comparison could be found in supplementary material.

SSIM PSNR LPIPS FVD
MRAA[mraa] 0.749 - 0.212 253.6
TPSMM[tpsmm] 0.746 - 0.213 247.5
BDMM[bidirectionally] 0.918 24.07 0.048 148.3
DreamPose[dreampose] 0.885 - 0.068 238.7
DreamPose* 0.879 34.75 0.111 279.6
Ours 0.931 38.49 0.044 81.6

Table 1: Quantitative comparison for fashion video synthesis. ”Dreampose*” denotes the result without sample finetuning.

Fashion Video Synthesis. Fashion Video Synthesis aims to turn fashion photographs into realistic, animated videos using a driving pose sequence. Experiments are conducted on the UBC fashion video dataset, which consists of 500 training and 100 testing videos, each containing roughly 350 frames. The quantitative comparison is shown in Tab. 1. Our result outperforms other methods, particularly exhibiting a significant lead in video metric. Qualitative comparison is shown in Fig. 4. For fair comparison, we obtain results of DreamPose without sample finetuning using its open-source code. In the domain of fashion videos, there is a stringent requirement for fine-grained clothing details. However, the videos generated by DreamPose and BDMM fail to maintain the consistency of clothing details and exhibit noticeable errors in terms of color and fine structural elements. In contrast, our method produces results that effectively preserves the consistency of clothing details.

Figure 5: Qualitative comparison between DisCo and our method. DisCo’s generated results display problems such as pose control errors, color inaccuracy, and inconsistent details. In contrast, our method demonstrates significant improvements in addressing these issues.

SSIM PSNR LPIPS FVD
FOMM[fomm] 0.648 29.01 0.335 405.2
MRAA[mraa] 0.672 29.39 0.296 284.8
TPSMM[tpsmm] 0.673 29.18 0.299 306.1
Disco[disco] 0.668 29.03 0.292 292.8
Ours 0.718 29.56 0.285 171.9

Table 2: Quantitative comparison for human dance generation.

Human Dance Generation. Human Dance Generation focuses on animating images in real-world dance scenarios. We utilize the TikTok dataset, comprising 340 training and 100 testing single human dancing videos (10-15 seconds long). Following DisCo’s dataset partitioning and utilizing an identical test set (10 TikTok-style videos), we conduct a quantitative comparison presented in Tab. 2, and our method achieves the best results. For enhanced generalization, DisCo incorporates human attribute pre-training, utilizing a large number of image pairs for model pre-training. In contrast, our training is exclusively conducted on the TikTok dataset, yielding results superior to DisCo. We present qualitative comparison with DisCo in Fig. 5. Given the scene’s complexity, DisCo’s pipeline involves an extra step with SAM[segment] to generate the human foreground mask. Conversely, our approach showcases that, even without explicit learning of human mask, the model can grasp the foreground-background relationship from the subject’s motion, negating the necessity for prior human segmentation. Additionally, during intricate dance sequences, our model stands out in maintaining visual continuity throughout the motion and exhibits enhanced robustness in handling diverse character appearances.

Figure 6: Qualitative comparison with image-to-video methods. Current image-to-video methods struggle to generate substantial character movements and face challenges in maintaining long-term consistency with character image features.

General Image-to-Video Methods. Currently, numerous studies propose video diffusion models with strong generative capabilities based on large-scale training data. We select two of the most well-known and effective image-to-video methods for comparison: AnimateDiff[animatediff] and Gen-2[gen1]. As these two methods do not perform pose control, we only compare their ability to maintain the appearance fidelity to the reference image. As depicted in Fig. 6, current image-to-video methods face challenges in generating substantial character movements and struggle to maintain long-term appearance consistency in videos, thus hindering effective support for consistent character animation.

4.4ABLATION STUDY

To demonstrate the effectiveness of the ReferenceNet design, we explore alternative designs, including (1) using only the CLIP image encoder to represent reference image features without integrating ReferenceNet, (2) initially finetuning SD and subsequently training ControlNet with the reference image. (3) integrating the above two designs. Experiments are conducted on the UBC fashion video dataset. As shown in Fig. 7, visualizations illustrate that ReferenceNet outperforms the other three designs. Solely relying on CLIP features as reference image features can preserve image similarity but fails to fully transfer details. ControlNet does not enhance results as its features lack spatial correspondence, rendering it inapplicable. Quantitative results are also presented in Tab. 3, demonstrating the superiority of our design.

Figure 7: Ablation study of different design. Only our ReferenceNet ensures the consistent preservation of details in the character’s appearance.

SSIM PSNR LPIPS FVD
CLIP 0.897 36.09 0.089 208.5
ControlNet 0.892 35.89 0.105 213.9
CLIP+ControlNet 0.898 36.03 0.086 205.4
Ours 0.931 38.49 0.044 81.6

Table 3: Quantitative comparison for ablation study.

5LIMITATIONS

The limitations of our method can be summarized in three aspects: First, similar to many visual generation models, our model may struggle to generate highly stable results for hand movements, sometimes leading to distortions and motion blur. Second, since images provide information from only one perspective, generating unseen parts during character movement is an ill-posed problem which encounters potential instability. Third, due to the utilization of DDPM, our model exhibits a lower operational efficiency compared to non-diffusion-model-based methods.

6CONCLUSION

In this paper, we present Animate Anyone, a character animation framework capable of transforming character photographs into animated videos controlled by a desired pose sequence while ensuring consistent appearance and temporal stability. We propose ReferenceNet, which genuinely preserves intricate character appearances and we also achieve efficient pose controllability and temporal continuity. Our approach not only applies to general character animation but also outperforms existing methods in specific benchmarks. Animate Anyone serves as a foundational method, with the potential for future extension into various image-to-video applications.

REFERENCES