[论文翻译]Animate Anyone 角色动画的一致且可控制的图像与视频合成(一张图变视频 2023 阿里版本)


原文地址:https://arxiv.org/pdf/2311.17117.pdf

代码地址:https://github.com/HumanAIGC/AnimateAnyone

模型地址:https://humanaigc.github.io/animate-anyone/

Animate Anyone: Consistent and Controllable Image-to-Video Synthesis for Character Animation

Li Hu, Xin Gao, Peng Zhang, Ke Sun, Bang Zhang, Liefeng Bo
Institute for Intelligent Computing, Alibaba Group
{hooks.hl, zimu.gx, futian.zp, xisheng.sk, zhangbang.zb,
https://humanaigc.github.io/animate-anyone/

ABSTRACT

Character Animation aims to generating character videos from still images through driving signals. Currently, diffusion models have become the mainstream in visual generation research, owing to their robust generative capabilities. However, challenges persist in the realm of image-to-video, especially in character animation, where temporally maintaining consistency with detailed information from character remains a formidable problem. In this paper, we leverage the power of diffusion models and propose a novel framework tailored for character animation. To preserve consistency of intricate appearance features from reference image, we design ReferenceNet to merge detail features via spatial attention. To ensure controllability and continuity, we introduce an efficient pose guider to direct character’s movements and employ an effective temporal modeling approach to ensure smooth inter-frame transitions between video frames. By expanding the training data, our approach can animate arbitrary characters, yielding superior results in character animation compared to other image-to-video methods. Furthermore, we evaluate our method on benchmarks for fashion video and human dance synthesis, achieving state-of-the-art results.
角色动画旨在通过驱动信号从静止图像中生成角色视频。 当前,由于其强大的生成能力,扩散模型已成为视觉生成研究的主流。 但是,在图像到视频的领域,尤其是在角色动画中,挑战仍然存在,在角色动画中,与角色的详细信息保持一致性仍然是一个巨大的问题。 在本文中,我们利用扩散模型的力量,并提出了一个针对角色动画的新型框架。 为了保留参考图像中复杂的外观特征的一致性,我们通过空间注意将ReferenceNet设计为合并细节特征。 为了确保可控性和连续性,我们引入了有效的姿势指导来指导角色的动作,并采用有效的时间建模方法来确保视频帧之间的平稳框架间过渡。 通过扩展训练数据,我们的方法可以使任意角色动画,与其他图像到视频方法相比,角色动画的结果卓越。 此外,我们评估了有关时尚视频和人类舞蹈合成的基准测试方法,从而实现了最新的结果。

Figure 1: Consistent and controllable character animation results given reference image (the leftmost image in each group) . Our approach is capable of animating arbitrary characters, generating clear and temporally stable video results while maintaining consistency with the appearance details of the reference character. More video results are available in supplementary material.

1INTRODUCTION

Character Animation is a task to animate source character images into realistic videos according to desired posture sequences, which has many potential applications such as online retail, entertainment videos, artistic creation and virtual character. Beginning with the advent of GANs[gan, wgan, stylegan], numerous studies have delved into the realms of image animation and pose transfer[fomm, mraa, ren2020deep, tpsmm, siarohin2019animating, zhang2022exploring, bidirectionally, everybody]. However, the generated images or videos still exhibit issues such as local distortion, blurred details, semantic inconsistency, and temporal instability, which impede the widespread application of these methods.

In recent years, diffusion models[denoising] have showcased their superiority in producing high-quality images and videos. Researchers have begun exploring human image-to-video tasks by leveraging the architecture of diffusion models and their pretrained robust generative capabilities. DreamPose[dreampose] focuses on fashion image-to-video synthesis, extending Stable Diffusion[ldm] and proposing an adaptar module to integrate CLIP[clip] and VAE[vae] features from images. However, DreamPose requires finetuning on input samples to ensure consistent results, leading to suboptimal operational efficiency. DisCo[disco] explores human dance generation, similarly modifying Stable Diffusion, integrating character features through CLIP, and incorporating background features through ControlNet[controlnet]. However, it exhibits deficiencies in preserving character details and suffers from inter-frame jitter issues.

Furthermore, current research on character animation predominantly focuses on specific tasks and benchmarks, resulting in a limited generalization capability. Recently, benefiting from advancements in text-to-image research[dalle2, glide, imagen, ldm, composer, ediffi], video generation (e.g., text-to-video, video editing)[animatediff, cogvideo, fatezero, imagenvideo, text2videozero, tuneavideo, videocomposer, align, gen1, makeavideo, vdm] has also achieved notable progress in terms of visual quality and diversity. Several studies extend text-to-video methodologies to image-to-video[videocomposer, videocrafter1, i2vgen, animatediff]. However, these methods fall short of capturing intricate details from images, providing more diversity but lacking precision, particularly when applied to character animation, leading to temporal variations in the fine-grained details of the character’s appearance. Moreover, when dealing with substantial character movements, these approaches struggle to generate a consistently stable and continuous process. Currently, there is no observed character animation method that simultaneously achieves generalizability and consistency.

In this paper, we present Animate Anyone, a method capable of transforming character images into animated videos controlled by desired pose sequences. We inherit the network design and pretrained weights from Stable Diffusion (SD) and modify the denoising UNet[unet] to accommodate multi-frame inputs. To address the challenge of maintaining appearance consistency, we introduce ReferenceNet, specifically designed as a symmetrical UNet structure to capture spatial details of the reference image. At each corresponding layer of the UNet blocks, we integrate features from ReferenceNet into the denoising UNet using spatial-attention[attention]. This architecture enables the model to comprehensively learn the relationship with the reference image in a consistent feature space, which significantly contributes to the improvement of appearance details preservation. To ensure pose controllability, we devise a lightweight pose guider to efficiently integrate pose control signals into the denoising process. For temporal stability, we introduce temporal layer to model relationships across multiple frames, which preserves high-resolution details in visual quality while simulating a continuous and smooth temporal motion process.
角色动画是根据所需的姿势序列将源角色图像动画为现实视频的任务,该姿势序列具有许多潜在的应用程序,例如在线零售,娱乐视频,艺术创作和虚拟角色。 从Gans [Gan,Wgan,Stylegan]的出现开始,大量研究已经深入研究了图像动画和姿势转移的领域。 但是,生成的图像或视频仍然表现出诸如局部畸变,模糊的细节,语义不一致和时间不稳定性等问题,这些问题阻碍了这些方法的广泛应用。

近年来,扩散模型[DeNoising]展示了它们在制作高质量图像和视频中的优势。 研究人员已经通过利用扩散模型的结构及其预识的可靠生成能力来探索人类形象到视频的任务。 Dreampose [Dreampose]专注于时尚图像与视频合成,扩展了稳定的扩散[LDM],并提出了一个适应模块,以整合剪辑[剪辑]和VAE [VAE]的特征。 但是,Dreampose需要对输入样本进行填充,以确保结果一致,从而导致次优效率。 迪斯科[迪斯科]探索人类舞蹈的产生,类似地修改稳定的扩散,通过剪辑整合角色特征,并通过ControlNet [ControlNet]结合背景特征。 但是,它在保存角色细节方面表现出缺陷,并遭受了框架间的抖动问题。

此外,当前对角色动画的研究主要集中在特定的任务和基准上,从而产生有限的概括能力。 最近,受益于文本到图像研究的进步[Dalle2,Glide,Glide,Imagen,LDM,Composer,Ediffi],视频生成(例如,文本到视频编辑,视频编辑)[Animatediff,Cogvideo,fatezero,fatezero,fatezero,fatezero,fatezero,fatezero,ifepenvideo,text2videozero,text2videozozero,text2videozozero ,Tuneavideo,VideoComposer,Align,Gen1,Makeavideo,VDM]在视觉质量和多样性方面也取得了显着的进步。 几项研究将文本对视频方法扩展到图像到视频[VideoComposer,Videocrofter1,i2vgen,Animatediff]。 但是,这些方法无法从图像中捕获复杂的细节,提供了更多的多样性,但缺乏精确度,尤其是应用于角色动画时,导致角色外观细粒度细节的时间变化。 此外,在处理实质性角色运动时,这些方法难以产生一个始终如一的稳定过程。 当前,没有观察到的角色动画方法可以同时实现概括性和一致性。

在本文中,我们向任何人展示动画,这种方法能够将角色图像转换为受所需姿势序列控制的动画视频。 我们从稳定扩散(SD)中继承了网络设计和预验证的权重,并修改了denoising unet [unet]以适应多帧输入。 为了应对保持外观一致性的挑战,我们引入了参考文章,该引用是专门设计为对称的UNET结构,以捕获参考图像的空间细节。 在UNET块的每个相应层上,我们使用空间注意力(注意]将参考文章的特征集成到denoising UNET中。 该体系结构使模型能够在一致的特征空间中全面了解与参考图像的关系,这极大地有助于改善外观细节。 为了确保姿势可控性,我们设计了一种轻巧的姿势指南,以有效地将姿势控制信号整合到脱索过程中。 对于时间稳定性,我们引入了时间层来对多个帧进行建模关系,该关系可以保留视觉质量的高分辨率细节,同时模拟连续且流畅的时间运动过程。

Our model is trained on an internal dataset of 5K character video clips. Fig. 1 shows the animation results for various characters. Compared to previous methods, our approach presents several notable advantages. Firstly, it effectively maintains the spatial and temporal consistency of character appearance in videos. Secondly, it produces high-definition videos without issues such as temporal jitter or flickering. Thirdly, it is capable of animating any character image into a video, unconstrained by specific domains. We evaluate our method on two specific human video synthesis benchmarks (UBC fashion video dataset[dwnet] and TikTok dataset[tiktok]), using only the corresponding training datasets for each benchmark in the experiments. Our approach achieves state-of-the-art results. we also compare our method with general image-to-video approaches trained on large-scale data and our approach demonstrates superior capabilities in character animation. We envision that Animate Anyone could serve as a foundational solution for character video creation, inspiring the development of more innovative and creative applications.
我们的模型在5K角色视频片段的内部数据集上进行了训练。 图。 各种角色的结果。 与以前的方法相比,我们的方法具有几个显着的优势。 首先,它有效地保持了视频中角色外观的空间和时间一致性。 其次,它会产生高清视频,而没有暂时的抖动或闪烁等问题。 第三,它能够将任何角色映像动画为视频,而这些视频不受特定域的影响。 我们在两个特定的人类视频综合基准(UBC时尚视频数据集[** dwnet ]和tiktok数据集[ tiktok **])上评估了我们的方法,仅使用实验中每个基准的相应训练数据集。 我们的方法取得了最新的结果。 我们还将我们的方法与经过大规模数据训练的一般图像与视频方法进行了比较,我们的方法在角色动画中表现出了出色的功能。 我们设想,使任何人都可以作为角色视频创建的基础解决方案,从而激发了更具创新性和创意应用程序的开发。

Figure 2: The overview of our method. The pose sequence is initially encoded using Pose Guider and fused with multi-frame noise, followed by the Denoising UNet conducting the denoising process for video generation. The computational block of the Denoising UNet consists of Spatial-Attention, Cross-Attention, and Temporal-Attention, as illustrated in the dashed box on the right. The integration of reference image involves two aspects. Firstly, detailed features are extracted through ReferenceNet and utilized for Spatial-Attention. Secondly, semantic features are extracted through the CLIP image encoder for Cross-Attention. Temporal-Attention operates in the temporal dimension. Finally, the VAE decoder decodes the result into a video clip.

2RELATED WORKS

2.1DIFFUSION MODEL FOR IMAGE GENERATION

In text-to-image research, diffusion-based methods[dalle2, imagen, ldm, glide, ediffi, composer] have achieved significantly superior generation results, becoming the mainstream of research. To reduce computational complexity, Latent Diffusion Model[ldm] proposes denoising in the latent space, striking a balance between effectiveness and efficiency. ControlNet[controlnet] and T2I-Adapter[t2iadapter] delve into the controllability of visual generation by incorporating additional encoding layers, facilitating controlled generation under various conditions such as pose, mask, edge and depth. Some studies further investigate image generation under given image conditions. IP-Adapter[ip] enables diffusion models to generate image results that incorporate the content specified by a given image prompt. ObjectStitch[objectstitch] and Paint-by-Example[paint] leverage the CLIP[clip] and propose diffusion-based image editing methods given image condition. TryonDiffusion[tryondiffusion] applies diffusion models to the virtual apparel try-on task and introduces the Parallel-UNet structure.

在文本到图像研究中,基于扩散的方法[** dalle2 Imagen ldm Glide ediffi composer ] 取得了显着优越的生成结果,成为研究的主流。 为了降低计算复杂性,潜在扩散模型
[
ldm ]提出在潜在空间中降低,从而达到了有效性和效率之间的平衡。 ControlNET [ ControlNet ]和T2I-adapter [ T2IADAPTER ]通过合并其他编码层来探索视觉生成的可控性,从而在各种条件下促进受控生成,例如姿势,掩盖,边缘和深度。 一些研究进一步研究了给定图像条件下的图像产生。 IP-Adapter [ ip ]使扩散模型能够生成图像结果,以结合给定图像提示的内容。 对象缝合[ objectstitch ]和Paint-by-Example [ paint ]利用剪辑[剪辑],并提出基于扩散的图像编辑方法给定图像条件。 TryOndiffusion [ TryOndiffusion **]将扩散模型应用于虚拟服装训练任务,并引入了平行 - UNET结构。

2.2DIFFUSION MODEL FOR VIDEO GENERATION

With the success of diffusion models in text-to-image applications, research in text-to-video has extensively drawn inspiration from text-to-image models in terms of model structure. Many studies[text2videozero, fatezero, cogvideo, tuneavideo, rerender, gen1, followyourpose, makeavideo, vdm] explore the augmentation of inter-frame attention modeling on the foundation of text-to-image (T2I) models to achieve video generation. Some works turn pretrained T2I models into video generators by inserting temporal layers. Video LDM[align] proposes to first pretrain the model on images only and then train temporal layers on videos. AnimateDiff[animatediff] presents a motion module trained on large video data which could be injected into most personalized T2I models without specific tuning. Our approach draws inspiration from such methods for temporal modeling.

Some studies extend text-to-video capabilities to image-to-video. VideoComposer[videocomposer] incorporates images into the diffusion input during training as a conditional control. AnimateDiff[animatediff] performs weighted mixing of image latent and random noise during denoising. VideoCrafter[videocrafter1] incorporates textual and visual features from CLIP as the input for cross-attention. However, these approaches still face challenges in achieving stable human video generation, and the exploration of incorporating image condition input remains an area requiring further investigation.
随着扩散模型在文本到图像应用中的成功,文本到视频的研究在模型结构方面从文本到图像模型中广泛汲取了灵感。 许多研究[** text2videozero fatezero cogvideo tuneavideo rerender gen1 lasteryourpose makeavideo ** ** makeavideo ** ,** vdm ]探索在文本到图像(T2I)模型基础上的框架间注意建模的增强,以实现视频生成。 有些作品通过插入时间层将预估计的T2I模型变成视频发生器。 Video LDM [ Align ]建议首先在图像上预处理模型,然后在视频上训练时间层。 Animatediff [ Animatediff **]提出了一个对大型视频数据训练的运动模块,该模块可以注入大多数个性化的T2I模型,而无需进行特定的调整。 我们的方法从这种时间建模的方法中汲取灵感。

一些研究将文本对视频功能扩展到图像到视频。 视频组合[** VideoComposer ]将图像纳入训练期间的扩散输入中,作为有条件的控制。 Animatediff [ Animatediff ]在DeNoising期间执行图像潜在和随机噪声的加权混合。 VideoCrafter [ VideoCrafter1 **]将剪辑中的文本和视觉特征作为交叉注意的输入。 但是,这些方法在实现稳定的人类视频生成方面仍然面临挑战,并且探索结合图像条件输入仍然是需要进一步研究的领域。

2.3DIFFUSION MODEL FOR HUMAN IMAGE ANIMATION

Image animation[fomm, mraa, ren2020deep, tpsmm, siarohin2019animating, zhang2022exploring, bidirectionally, everybody] aims to generate images or videos based on one or more input images. In recent research, the superior generation quality and stable controllability offered by diffusion models have led to their integration into human image animation. PIDM[pidm] proposes texture diffusion blocks to inject the desired texture patterns into denoising for human pose transfer. LFDM[LFDM] synthesizes an optical flow sequence in the latent space, warping the input image based on given conditions. LEO[leo] represents motion as a sequence of flow maps and employs diffusion model to synthesize sequences of motion codes. DreamPose[dreampose] utilizes pretrained Stable Diffusion model and proposes an adapter to model the CLIP and VAE image embeddings. DisCo[disco] draws inspiration from ControlNet, decoupling the control of pose and background. Despite the incorporation of diffusion models to enhance generation quality, these methods still grapple with issues such as texture inconsistency and temporal instability in their results. Moreover, there is no method to investigate and demonstrate a more generalized capability in character animation.
随着扩散模型在文本到图像应用中的成功,文本到视频的研究在模型结构方面从文本到图像模型中广泛汲取了灵感。 许多研究[** text2videozero fatezero cogvideo tuneavideo rerender gen1 lasteryourpose makeavideo ** ** makeavideo ** ,** vdm ]探索在文本到图像(T2I)模型基础上的框架间注意建模的增强,以实现视频生成。 有些作品通过插入时间层将预估计的T2I模型变成视频发生器。 Video LDM [ Align ]建议首先在图像上预处理模型,然后在视频上训练时间层。 Animatediff [ Animatediff **]提出了一个对大型视频数据训练的运动模块,该模块可以注入大多数个性化的T2I模型,而无需进行特定的调整。 我们的方法从这种时间建模的方法中汲取灵感。

一些研究将文本对视频功能扩展到图像到视频。 视频组合[** VideoComposer ]将图像纳入训练期间的扩散输入中,作为有条件的控制。 Animatediff [ Animatediff ]在DeNoising期间执行图像潜在和随机噪声的加权混合。 VideoCrafter [ VideoCrafter1 **]将剪辑中的文本和视觉特征作为交叉注意的输入。 但是,这些方法在实现稳定的人类视频生成方面仍然面临挑战,并且探索结合图像条件输入仍然是需要进一步研究的领域。

3METHODS

We target at pose-guided image-to-video synthesis for character animation. Given a reference image describing the appearance of a character and a pose sequence, our model generates an animated video of the character. The pipeline of our method is illustrated in Fig. 2. In this section, we first provide a concise introduction to Stable Diffusion in Sec 3.1, which lays the foundational framework and network structure for our method. Then we provide a detailed explanation of the design specifics in Sec 3.1. Finally, we present the training process in Sec 3.3.

3.1PRELIMINARIY: STABLE DIFFUSION

Our method is an extension of Stable Diffusion (SD), which is developed from Latent diffusion model (LDM). To reduce the computational complexity of the model, it introduces to model feature distributions in the latent space. SD develops an autoencoder[vae, vqvae] to establish the implicit representation of images, which consists of an encoder E and a decoder D. Given an image x, the encoder first maps it to a latent representation: z = E(x) and then the decoder reconstructs it: xrecon = D(z).

SD learns to denoise a normally-distributed noise ϵ to realistic latent z. During training, the image latent z is diffused in \mathnormalt timesteps to produce noise latent zt. And a denoising UNet is trained to predict the applied noise. The optimization process is defined as follow objective:

| | L=Ezt,c,ϵ,t(||ϵϵθ**(zt**,c,t)||22**)** | | (1) |
| -- |

where ϵθ represents the function of the denoising UNet. \mathnormalc represents the embeddings of conditional information. In original SD, CLIP ViT-L/14[vit] text encoder is applied to represent the text prompt as token embeddings for text-to-image generation. The denoising UNet consists of four downsample layers , one middle layer and four upsample layers. A typical block within a layer includes three types of computations: 2D convolution, self-attention[attention], and cross-attention (terms as Res-Trans block). Cross-attention is conducted between text embedding and corresponding network feature.

At inference, zT is sampled from random Gaussian distribution with the initial timestep \mathnormalT and is progressively denoised and restored to z0 via deterministic sampling process (e.g. DDPM[denoising], DDIM[ddim]). In each iteration, the denoising UNet predicts the noise on the latent feature corresponding to each timestep \mathnormalt. Finally, z0 will be reconstructed by decoder D to obtain the generated image.

3.2NETWORK ARCHITECTURE

Overview. Fig. 2 provides the overview of our method. The initial input to the network consists of multi-frame noise. The denoising UNet is configured based on the design of SD, employing the same framework and block units, and inherits the training weights from SD. Additionally, our method incorporates three crucial components: 1) ReferenceNet, encoding the appearance features of the character from the reference image; 2) Pose Guider, encoding motion control signals for achieving controllable character movements; 3) Temporal layer, encoding temporal relationship to ensure the continuity of character motion.

ReferenceNet. In text-to-video tasks, textual prompts articulate high-level semantics, necessitating only semantic relevance with the generated visual content. However, in image-to-video tasks, images encapsulate more low-level detailed features, demanding precise consistency in the generated results. In preceding studies focused on image-driven generation, most approaches[ip, objectstitch, paint, dreampose, disco, videocrafter1] employ the CLIP image encoder as a substitute for the text encoder in cross-attention. However, this design falls short of addressing issues related to detail consistency. One reason for this limitation is that the input to the CLIP image encoder comprises low-resolution (224×224) images, resulting in the loss of significant fine-grained detail information. Another factor is that CLIP is trained to match semantic features for text, emphasizing high-level feature matching, thereby leading to a deficit in detailed features within the feature encoding.

Hence, we devise a reference image feature extraction network named ReferenceNet. We adopt a framework identical to the denoising UNet for ReferenceNet, excluding the temporal layer. Similar to the denoising UNet, ReferenceNet inherits weights from the original SD, and weight update is conducted independently for each. Then we explain the integration method of features from ReferenceNet into the denoising UNet. Specifically, as shown in Fig. 2, we replace the self-attention layer with spatial-attention layer. Given a feature map x1**∈R\mathnormalt×\mathnormalh**×\mathnormalw**×\mathnormalc** from denoising UNet and x2**∈R\mathnormalh×\mathnormalw**×\mathnormalc from ReferenceNet, we first copy x2 by \mathnormalt times and concatenate it with x1 along \mathnormalw dimension. Then we perform self-attention and extract the first half of the feature map as the output. This design offers two advantages: Firstly, ReferenceNet can leverage the pre-trained image feature modeling capabilities from the original SD, resulting in a well-initialized feature. Secondly, due to the essentially identical network structure and shared initialization weights between ReferenceNet and the denoising UNet, the denoising UNet can selectively learn features from ReferenceNet that are correlated in the same feature space. Additionally, cross-attention is employed using the CLIP image encoder. Leveraging the shared feature space with the text encoder, it provides semantic features of the reference image, serving as a beneficial initialization to expedite the entire network training process.

A comparable design is ControlNet[controlnet], which introduces additional control features into the denoising UNet using zero convolution. However, control information, such as depth and edge, is spatially aligned with the target image, while the reference image and the target image are spatially related but not aligned. Consequently, ControlNet is not suitable for direct application. We will substantiate this in the subsequent experimental Section 4.4.

While ReferenceNet introduces a comparable number of parameters to the denoising UNet, in diffusion-based video generation, all video frames undergo denoising multiple times, whereas ReferenceNet only needs to extract features once throughout the entire process. Consequently, during inference, it does not lead to a substantial increase in computational overhead.

Pose Guider. ControlNet[controlnet] demonstrates hig