learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks
author
Wei Xiong1, Wenhan Luo2, Lin Ma2, Wei Liu2, and Jiebo Luo1 {wxiongwhu, forest.linma, , ,1Department of Computer Science, University of Rochester, Rochester, NY 146232Tencent AI Lab, Shenzhen, China
ABSTRACT
Taking a photo outside, can we predict the immediate future, e.g., how would the cloud move in the sky? We address this problem by presenting a generative adversarial network (GAN) based two-stage approach to generating realistic time-lapse videos of high resolution. Given the first frame, our model learns to generate long-term future frames. The first stage generates videos of realistic contents for each frame. The second stage refines the generated video from the first stage by enforcing it to be closer to real videos with regard to motion dynamics. To further encourage vivid motion in the final generated video, Gram matrix is employed to model the motion more precisely. We build a large scale time-lapse dataset, and test our approach on this new dataset. Using our model, we are able to generate realistic videos of up to 128×128 resolution for 32 frames. Quantitative and qualitative experiment results have demonstrated the superiority of our model over the state-of-the-art models.
摘要
在外面拍张照片,我们可以预测不久的将来吗,例如,云将如何在天空中移动?我们通过提出基于生成对抗网络(GAN)的两阶段方法来生成高分辨率的逼真的延时视频来解决此问题。给定第一个框架,我们的模型将学习生成长期的未来框架。第一阶段为每个帧生成逼真的内容的视频。第二阶段通过在运动动态方面将其逼近真实视频,从而优化了第一阶段生成的视频。为了进一步鼓励最终生成的视频中的生动运动,采用了Gram矩阵来更精确地对运动进行建模。我们建立了一个大型延时数据集,并在这个新的数据集上测试了我们的方法。使用我们的模型,我们可以生成逼真的视频128×128分辨率为32帧。定量和定性的实验结果证明了我们的模型优于最新模型的优越性。
I INTRODUCTION 介绍
Humans can often estimate fairly well what will happen in the immediate future given the current scene. However, for vision systems, predicting the future states is still a challenging task. The problem of future prediction or video synthesis has drawn more and more attention in recent years since it is critical for various kinds of vision applications, such as automatic driving [1], video understanding [2], and robotics [3]. The goal of video prediction in this paper is to generate realistic, long-term, and high-quality future frames given one starting frame. Achieving such a goal is difficult, as it is challenging to model the multi-modality and uncertainty in generating both the content and motion in future frames.
在当前的情况下,人们通常可以很好地估计在不久的将来会发生什么。但是,对于视觉系统而言,预测未来状态仍然是一项艰巨的任务。近年来,由于对自动驾驶[ 1 ],视频理解[ 2 ]和机器人技术[ 3 ]等各种视觉应用至关重要,因此未来的预测或视频合成问题已引起越来越多的关注。。本文中视频预测的目标是在给定一个起始帧的情况下生成逼真的,长期的和高质量的未来帧。实现这样的目标是困难的,因为要在将来的帧中生成内容和运动时对多模式和不确定性建模是具有挑战性的。
Fig. 1: From top to bottom: example frames of generated videos by VGAN [4], RNN-GAN [5], the first stage of our model, and the second stage of our model, respectively. The contents generated by our model (the third and fourth rows) are visually more realistic. The left column is the input frame.
图1:从上到下:分别由模型的第一阶段和模型的VGAN [ 4 ],RNN-GAN [ 5 ]生成的视频的示例帧。由我们的模型(第三和第四行)生成的内容在视觉上更加逼真。左列是输入框。
In terms of content generation, the main problem is to define what to learn. Generating future on the basis of only one static image encounters inherent uncertainty of the future, which has been illustrated in [6]. Since there can be multiple possibilities for reasonable future scenes following the first frame, the objective function is difficult to define. Generating future frames by simply learning to reconstruct the real video can lead to unrealistic results [4, 7]. Several models including [8] and [4] are proposed to address this problem based on generative adversarial networks [9]. For example, 3D convolution is incorporated in an adversarial network to model the transformation from an image to video in [4]. Their model produces plausible futures given the first frame. However, the generated video tends to be blurry and lose content details, which degrades the reality of generated videos. A possible cause is that the vanilla encoder-decoder structure in the generator fails to preserve all the indispensable details of the content.
在内容生成方面,主要问题是定义学习内容。仅基于一个静态图像生成未来会遇到未来的内在不确定性,这一点已在[ 6 ]中进行了说明。由于在第一帧之后可能存在合理的未来场景的多种可能性,因此很难定义目标函数。通过简单地学习来重建真实的视频产生未来帧可导致不切实际的结果[ 4,7 ]。提出了包括[ 8 ]和[ 4 ]在内的几种模型来解决基于生成对抗网络的问题[9 ]。例如,在对抗网络中结合了3D卷积,以建模[ 4 ]中从图像到视频的转换。在第一帧的情况下,他们的模型产生了可能的期货。然而,所生成的视频趋于模糊并且丢失内容细节,这降低了所生成视频的真实性。可能的原因是,生成器中的普通编码器/解码器结构无法保留内容的所有必不可少的细节。
Regarding motion transformation, the main challenge is to drive the given frame to transform realistically over time. Some prior work has investigated this problem. Zhou and Berg [5] use an RNN to model the temporal transformations. They are able to generate a few types of motion patterns, but not realistic enough. The reason may be that, each future frame is based on the state of previous frames, so the error accumulates and the motion distorts over time. The information loss and error accumulation during the sequence generation hinder the success of future prediction.
关于运动变换,主要挑战是根据给定的帧随时间现实地变换。一些先前的工作已经研究了这个问题。Zhou和Berg [ 5 ]使用RNN对时间变换建模。它们能够生成几种类型的运动模式,但不够真实。原因可能是,每个将来的帧都基于先前帧的状态,因此随着时间的流逝,误差会累积并且运动会失真。序列生成过程中的信息丢失和错误累积阻碍了未来预测的成功。
The performance of the prior models indicates that it is nontrivial to generate videos with both realistic contents in each frame and vivid motion dynamics across frames with a single model at the same time. One reason may be that the representation capacity of a single model is limited in satisfying two objectives that may contradict each other. To this end, we divide the modeling of video generation into content and motion modeling, and propose a Multi-stage Dynamic Generative Adversarial Network (MD-GAN) model to produce realistic future videos. There are two stages in our approach. The first stage aims at generating future frames with content details as realistic as possible given an input frame. The second stage specifically deals with motion modeling, i.e., to make the movement of objects between adjacent frames more vivid, while keeping the content realistic.
现有模型的性能表明,使用单个模型同时生成每个帧中具有逼真的内容和跨帧的生动运动动态的视频并非易事。一个原因可能是,单个模型的表示能力在满足可能彼此矛盾的两个目标方面受到限制。为此,我们将视频生成的建模分为内容建模和运动建模,并提出了多阶段动态生成对抗网络(MD-GAN)模型来生成逼真的未来视频。我们的方法分为两个阶段。
第一阶段旨在生成未来的帧,其中在给定输入帧的情况下,内容细节应尽可能逼真。
第二阶段专门处理运动建模,即,以使对象在相邻帧之间的移动更加生动,同时保持内容逼真。
To be more specific, we develop a generative adversarial network called Base-Net to generate contents in the first stage. Both the generator and the discriminator are composed of 3D convolutions and deconvolutions to model temporal and spatial patterns. The adversarial loss of this stage encourages the generator to produce videos of similar distributions to real ones. In order to preserve more content details, we use a 3D U-net [10] like architecture in the generator instead of the vanilla encoder-decoder structure. Skip connections [11] are used to link the corresponding feature maps in the encoder and decoder so that the decoder can reuse features in the encoder, thus reducing the information loss. In this way, the model can generate better content details in each future frame, which are visually more pleasing than those produced by the vanilla encoder-decoder architecture such as the model in [4].
更具体地说,我们开发了一个称为Base-Net的生成对抗网络,以在第一阶段生成内容。生成器和鉴别器均由3D卷积和反卷积组成,以对时间和空间模式进行建模。这个阶段的对抗性失败鼓励了生成器制作与实际发行类似的视频。为了保留更多的内容细节,我们在生成器中使用类似于3D U-net [ 10 ]的体系结构,而不是普通的编码器-解码器结构。Skip connections[ 11 ]用于链接编码器和解码器中的相应特征图,以便解码器可以重用编码器中的特征,从而减少了信息丢失。这样,该模型可以在每个将来的帧中生成更好的内容细节,这些细节在视觉上比由vanilla 编码器/解码器体系结构(例如[ 4 ]中的模型)产生的内容细节更令人愉悦。
The Base-Net can generate frames with concrete details, but may not be capable of modeling the motion transformations across frames. To generate future frames with vivid motion, the second stage MD-GAN takes the output of the first stage as input, and refines the temporal transformation with another generative adversarial network while preserving the realistic content details, which we call Refine-Net. We propose an adversarial ranking loss to train this network so as to encourage the generated video to be closer to the real one while being further away from the input video (from stage I) regarding motion. To this end, we introduce the Gram matrix [12] to model the dynamic transformations among consecutive frames. We present a few example frames generated by the conventional methods and our method in Fig. 1. The image frames generated by our model are sharper than the state-of-the-art and are visually almost as realistic as the real ones.
Base-Net可以生成具有具体细节的帧,但可能无法对跨帧的运动转换建模。为了生成具有生动运动效果的未来帧,第二阶段MD-GAN将第一阶段的输出作为输入,并在保留现实的内容细节(我们称为Refine-Net)的同时,使用另一个生成性对抗网络优化时间转换。我们提出了对抗性排名损失来训练该网络,以鼓励生成的视频更接近真实的视频,同时又远离运动的输入视频(从阶段I开始)。为此,我们介绍了Gram矩阵[ 12 ]对连续帧之间的动态转换建模。我们展示了一些通过常规方法和图1中的方法生成的示例帧。由我们的模型生成的图像帧比最先进的图像帧更清晰,并且在视觉上几乎与真实的图像帧一样逼真。
We build a large scale time-lapse video dataset called Sky Scene to evaluate the models for future prediction. Our dataset includes daytime, nightfall, starry sky, and aurora scenes. MD-GAN is trained on this dataset and predicts future frames given a static image of sky scene. We are able to produce 128×128 realistic videos, whose resolution is much higher than that of the state-of-the-art models. Unlike some prior work which generates merely one frame at a time, our model generates 32 future frames by a single pass, further preventing error accumulation and information loss.
我们构建了一个名为Sky Scene的大型延时视频数据集,以评估模型以进行将来的预测。我们的数据集包括白天,傍晚,繁星点点的天空和极光场景。MD-GAN在此数据集上受过训练,并根据天空场景的静态图像预测未来的帧。我们有能力生产128×128逼真的视频,其分辨率远远高于最新模型。与某些先前的工作一次只生成一帧不同,我们的模型通过一次遍历生成32个未来的帧,从而进一步防止了错误累积和信息丢失。
Our key contributions are as follows:
- We build a large scale time-lapse video dataset, which contains high-resolution dynamic videos of sky scenes.
- We propose a Multi-Stage Dynamic Generative Adversarial Network (MD-GAN), which can effectively capture the spatial and temporal transformations, thus generating realistic time-lapse future frames up to 128×128 resolution given only one starting frame.
- We introduce the Gram matrix for motion modeling and propose an adversarial ranking loss to mimic motions of real-world videos, which refines motion dynamics of preliminary outputs in the first stage and forces the model to produce more realistic and higher-quality future frames.
我们的主要贡献如下:
1.我们建立了一个大型延时视频数据集,其中包含天空场景的高分辨率动态视频。
2.我们提出了一个多阶段动态生成对抗网络(MD-GAN),该网络可以有效地捕获时空变换,从而生成现实的时移未来帧,直至 128×128 分辨率仅给出一个起始帧。
3.我们引入了用于运动建模的Gram矩阵,并提出了对抗排名损失来模拟真实视频的运动,从而在第一阶段完善了初步输出的运动动态,并迫使模型产生更逼真的,更高质量的未来帧。
II RELATED WORK ## 相关工作
II-A GENERATIVE ADVERSARIAL NETWORKS 生成对抗网络
A generative adversarial network [9] is composed of a generator and a discriminator. The generator tries to fool the discriminator by producing samples similar to real ones, while the discriminator is trained to distinguish the generated samples from the real ones. They are trained in an adversarial way and finally a “Nash Equilibrium” is achieved. GAN has been successfully applied to image generation. In the seminal paper [9], models trained on the MNIST dataset and the Toronto Face Database (TFD), respectively, generate images of digits and faces with high likelihood. Relying only on random noise, the original GAN cannot control the mode of the generated samples, thus conditional GAN [13] is proposed. Images of digits conditioned on class labels and captions conditioned on image features are generated. Many subsequent works are variants of conditional GAN, including image to image translation [14, 15], text to image translation [16] and super-resolution [17]. Our model is also a GAN conditioned on a starting image to generate a video.
生成对抗网络[ 9 ]由生成器和鉴别器组成。生成器试图通过产生类似于真实样本的样本来欺骗鉴别器,同时训练鉴别器以区分生成的样本与真实样本。他们以对抗性的方式受到训练,最终达到“纳什均衡”。GAN已成功应用于图像生成。在开创性论文[ 9 ]中,分别在MNIST数据集和Toronto Face Database(TFD)上训练的模型生成数字图像和面孔的可能性很高。仅依靠随机噪声,原始GAN无法控制所生成样本的模式,因此有条件GAN [ 13 ]被提议。生成以类标签为条件的数字图像和以图像特征为条件的标题。许多后来的作品是有条件甘变体,包括图像图像平移[ 14,15 ],文本图像翻译[ 16 ]和超高分辨率[ 17 ]。我们的模型也是基于起始图像以生成视频的GAN。
Inspired by the coarse-to-fine strategy, multi-stack methods such as StackGAN [18], LAPGAN [19] have been proposed to first generate coarse images and then refine them to finer images. Our model also employs this strategy to stack GANs in two stages. However, instead of refining the pixel-level details in each frame, the second stage focuses on improving motion dynamics across frames generated by the first stage.
受粗糙到精细策略的启发,已经提出了诸如StackGAN [ 18 ],LAPGAN [ 19 ]之类的多堆叠方法来首先生成粗糙图像,然后将它们精炼为更精细的图像。我们的模型还采用这种策略在两个阶段堆叠GAN。但是,第二阶段不是细化每个帧中的像素级细节,而是着重于改善由第一阶段生成的帧之间的运动动态。
Fig. 2: The overall architecture of MD-GAN model. The input image is first duplicated to 32 frames as input to generator G1 of the Base-Net, which produces a video Y1. The discriminator D1 then distinguishes the real video Y from Y1. Following the Base-Net, the Refine-Net takes the result video of G1 as the input and generates a more realistic video Y2. The discriminator D2 is updated with an adversarial ranking loss to push Y2 (the result of Refine-Net) closer to real videos.
图2: MD-GAN模型的整体架构。首先将输入图像复制到32帧作为生成器的输入G1个 生成视频的基本网 ÿ1个。鉴别器d1个然后将真实视频Y与ÿ1个。在基本网络之后,精炼网络将获取以下结果的视频:G1个 作为输入并生成更逼真的视频 ÿ2个。鉴别器d2个 更新以对抗排名失败来推动 ÿ2个 (Refine-Net的结果)更接近真实视频。
II-B VIDEO GENERATION 视频生成
Video generation has been popular recently. Based on conditional VAE [20], Xue et al. [21] propose a cross convolutional network to model layered motion, which applies learned kernels to image features encoded in a multi-scale image encoder. The output difference image is added to the current frame to produce the next frame. Many approaches employ GAN for future prediction. In [4], a two-stream CNN, one for foreground and the other one for background, is proposed for video generation. Combining the dynamic foreground stream and the static background stream, the generated video looks real. In the follow-up work [6], Vondrick and Torralba formulate the future prediction task as transforming pixels in the past to future. Based on large scale unlabeled video data, a CNN model is trained with adversarial learning. Content and motion are decomposed and encoded separately by multi-scale residual blocks, and then combined and decoded to generate plausible videos on both the KTH and the Weizmann datasets [22]. A similar idea is presented in [23]. To generate long-term future frames, Villegas et al. [8] estimate high-level structure (human body pose), and learn LSTM and analogy-based encoder-decoder CNN to generate future frames based on the current image and the estimated high-level structure.
视频生成近来很流行。基于有条件的VAE [ 20 ],Xue等。[ 21 ]提出了一种交叉卷积网络来建模分层运动,该网络将学习的内核应用于在多尺度图像编码器中编码的图像特征。输出的差异图像被添加到当前帧以产生下一帧。许多方法采用GAN进行未来的预测。在[ 4 ]中,提出了一种用于视频生成的两流CNN,一个用于前景,另一个用于背景。结合动态前景流和静态背景流,生成的视频看起来很真实。在后续工作中[ 6 ],Vondrick和Torralba将未来的预测任务表述为将过去的像素转换为未来的像素。基于大规模的未标记视频数据,通过对抗性学习来训练CNN模型。内容和运动通过多尺度残差块分别进行分解和编码,然后进行组合和解码,以在KTH和Weizmann数据集上生成合理的视频[ 22 ]。在[ 23 ]中提出了类似的想法。为了产生长期的未来框架,Villegas等人。[ 8 ]估计高级结构(人体姿势),并学习LSTM和基于类比的编解码器CNN来基于当前图像和估计的高级结构生成将来的帧。
The closest work to ours is [5], which also generates time-lapse videos. However, there are important differences between their work and ours. First, our method is based on 3D convolution while a recurrent neural network is employed in [5] to recursively generate future frames, which is prone to error accumulation. Second, as modeling motion is indispensable for video generation, we explicitly model motion by introducing the Gram matrix. Finally, we generate high-resolution (128×128) videos of dynamic scenes, while the generated videos in [5] are simple (usually with clean background) and of resolution 64×64.
与我们最接近的作品是[ 5 ],它还会生成延时视频。但是,他们的工作与我们的工作之间存在重要差异。首先,我们的方法基于3D卷积,而在[ 5 ]中使用循环神经网络来递归地生成将来的帧,这容易出现错误累积。其次,由于建模运动对于视频生成是必不可少的,因此我们通过引入Gram矩阵来显式地对运动进行建模。最后,我们生成高分辨率(128×128)动态场景的视频,而在[ 5 ]中生成的视频很简单(通常背景干净),分辨率为64×64。
III OUR APPROACH 我们的方法
III-A OVERVIEW 概述
The proposed MD-GAN takes a single RGB image as input and attempts to predict future frames that are as realistic as possible. This task is accomplished in two stages in a coarse-to-fine manner: 1) content generation by Base-Net in Stage I. Given an input image x, the model generates a video Y1 of T frames (including the starting frame, i.e., the input image). The Base-Net ensures that each produced frame in Y1 looks like a real natural image. Besides, Y1 also serves as a coarse estimation of the ground truth Y regarding motion. 2) motion generation by Refine-Net in Stage II. The Refine-Net makes efforts to refine Y1 with vivid motion dynamics, and produce a more vivid video Y2 as the final prediction. The discriminator G2 of the Refine-Net takes three inputs, the resulted video of the Base-Net Y1, the fake video Y2 produced by the generator of the Refine-Net and the ground truth video Y. We define an adversarial ranking loss to encourage the final video Y2 to be closer to the real video and further away from the result video Y1 in the Base-Net. Note that on each stage, we follow the setting in Pix2Pix [14] and do not incorporate any random noise. The overall architecture of our model is shown in Fig. 2.
提出的MD-GAN将单个RGB图像作为输入,并尝试预测尽可能逼真的未来帧。此任务从两个步骤以粗略到精细的方式完成:1)在第I阶段由Base-Net生成内容。给定输入图像x,模型将生成视频ÿ1个 的 Ť帧(包括起始帧,即输入图像)。Base-Net确保每个产生的框架都在ÿ1个看起来像真实的自然图像。除了,ÿ1个还用作关于运动的地面真相Y的粗略估计。2)在第二阶段通过Refine-Net生成运动。精炼网正在努力精炼ÿ1个 具有生动的运动动态,并产生更生动的视频 ÿ2个作为最终的预测。鉴别器G2个 精炼网的输入有三个输入,即基本网的视频 ÿ1个,假视频 ÿ2个由Refine-Net的生成器和地面真实视频Y生成。我们定义了对抗性排名损失,以鼓励最终视频ÿ2个 距离真实视频更近,距离结果视频更远 ÿ1个在基础网中。请注意,在每个阶段,我们都遵循Pix2Pix [ 14 ]中的设置,并且不包含任何随机噪声。我们的模型的总体架构如图2所示
III-B STAGE I: BASE-NET 第一阶段:基础网络
As shown in Fig. 2, the Base-Net is a generative adversarial network composed of a generator G1 and a discriminator D1. Given an image x∈R3×H×W as a starting frame, we duplicate it T times, obtaining a static video X∈R3×T×H×W 1(1In the generator, we can also use 2D CNN to encode an image, but we duplicate the input image to a video to better fit our 3D U-net like architecture of G1..) By forwarding X through layers of 3D convolutions and 3D deconvolutions, the generator G1 outputs a video Y1∈R3×T×H×W of T frames, i.e., Y1=G1(X).
如图2所示,Base-Net是由生成器G1 和鉴别器 d1组成的生成对抗网络。给定图像x∈R3×H×W 作为起始框架,我们将其复制 Ť 次,获取静态视频 X∈R3×T×H×W(在生成器中,我们还可以使用2D CNN对图像进行编码,但是我们将输入图像复制到视频中,以更好地适应3D U-net这样的架构 G1)。通过将X转发通过3D卷积和3D反卷积的层,生成器G1 输出视频 Y1∈R3×T×H×W 的 Ť 框架,即Y1=G1(X).
For generator G1, we adopt an encoder-decoder architecture, which is also employed in [24] and [4]. However, the vanilla encoder-decoder architecture encounters problems in generating decent results as the features from the encoder may not be fully exploited. Therefore, we utilize a 3D U-net like architecture [10] instead so that features in the encoder can be fully made use of to generate Y1. The U-net architecture is implemented by introducing skip connections between the feature maps of the encoder and the decoder, as shown in Fig. 2. The skip connections build information highways between the features in the bottom and top layers, so that features can be reused. In this way, the generated video is more likely to contain rich content details. This may seem like a small trick, yet it plays a key role in improving the quality of videos.
对于生成器G1,我们采用[ 24 ]和[ 4 ]中所采用的编码器-解码器架构。但是,由于无法充分利用编码器的功能,vanilla编码器-解码器体系结构在生成时会遇到问题。因此,我们改为使用类似3D U-net的体系结构[ 10 ],以便可以充分利用编码器中的特征来生成ÿ。如图2所示,通过在编码器和解码器的特征图之间引入跳跃连接来实现U-net架构。跳过连接在底层和顶层的要素之间建立了信息高速公路,因此可以重复使用特征。这样,生成的视频更有可能包含丰富的内容详细信息。这似乎是一个小技巧,但在提高视频质量方面起着关键作用。
The discriminator D1 then takes video Y1 and the real video Y as input and tries to distinguish them. x is the first frame of Y. D1 shares the same architecture as the encoder part of G1, except that the final layer is a single node with a sigmoid activation function.
鉴别器 d1 然后以视频 ÿ1和实际视频Y作为输入,并尝试对其进行区分。x是Y的第一帧。d1 与的编码器部分共享相同的架构 G1,除了最后一层是具有S型激活功能的单个节点。
To train our GAN-based model, the adversarial loss of the Base-Net is defined as:
为了训练我们基于GAN的模型,Base-Net的对抗损失定义为:
Prior work based on conditional GAN discovers that combining the adversarial loss with the L1 or L2 loss [14] in the pixel space will benefit the performance. So we define a content loss function as a complement to the adversarial loss, to further ensure that the content of the generated video follows similar patterns to the content of real-world videos. As pointed out in [14], L1 distance usually results in sharper outputs than those of L2 distance. Recently, instead of measuring the similarity of images in pixel space, perceptual loss [25] is introduced in some GAN-based approaches to model the distance between high-level feature representations. These features are extracted from a well-trained CNN model and previous experiments suggest they capture semantics of contents [17]. Although the perceptual loss performs well in combination with GANs [17, 26] on some tasks, it typically requires features to be extracted from a pretrained deep neural network, which is both time and space consuming. In addition, we observe in experiments that directly combining the adversarial loss and the L1 loss that minimizes the Minkowski distance between the generated video and the ground truth video in the pixel space leads to satisfactory performance. Thus, we define our content loss as
基于条件GAN的先前工作发现,对抗性损失和 L1 或者 L2结合[ 14 ]将有利于性能。因此,我们将内容损失函数定义为对抗性损失的补充,以进一步确保生成的视频的内容遵循与真实视频的内容相似的模式。如[ 14 ]中指出的,用L1会比用L2差距大。最近,代替测量像素空间中图像的相似性,在一些基于GAN的方法中引入了感知损失[ 25 ]来建模高级特征表示之间的距离。这些特征是从训练有素的CNN模型中提取的,以前的实验表明它们捕获了内容的语义[ 17 ]。虽然如此,26 ]表明一些任务,它通常需要一个预训练的深层神经网络,消耗大量时间和空间。我们在实验中观察到直接将对抗性损失结合L1损失使像素空间中生成的视频与地面真实视频之间的Minkowski距离最小化的损耗会导致令人满意的性能。因此,我们将内容损失定义为
The final objective of our Base-Net in Stage I is
第一阶段我们的基础网络的最终目标是
The adversarial training allows the Base-Net to produce videos with realistic content details. However, as the learning capacity of GAN is limited considering the uncertainty of the future, one single GAN model may not be able to capture the correct motion patterns in the real data. As a consequence, the motion dynamics of the generated videos may not be realistic enough. To tackle this problem, we further process the output of Stage I by another GAN model called Refine-Net in Stage II, to compensate it for vivid motion dynamics, and generate more realistic videos.
通过对抗训练,Base-Net可以制作具有真实内容详细信息的视频。但是,由于考虑到未来的不确定性,GAN的学习能力受到限制,因此单个GAN模型可能无法捕获真实数据中的正确运动模式。结果,所生成视频的运动动态可能不够真实。为了解决这个问题,我们通过第二阶段的另一个GAN模型(称为Refine-Net)进一步处理了第一阶段的输出,以补偿其生动的运动动态并生成更逼真的视频。
III-CSTAGE II: REFINE-NET 第二阶段:修正网络
Inputting video Y1 from Stage I, the Refine-Net improves the quality of the generated video Y2 regarding motion to fool human eyes in telling which one is real against the ground truth video Y.
The generator G2 of Refine-Net is similar to G1 in Base-Net. When training the model, we find it difficult to generate vivid motion while retaining realistic content details using skip connections. In other words, skip connections mainly contribute to content generation, but may not be helpful for motion generation. So we remove a few skip connections from G2, as illustrated in Fig. 2. The discriminator D2 of Refine-Net is also a CNN with 3D convolutions and shares the same structure with D1 in Base-Net.
We adopt the adversarial training to update G2 and D2. However, naively employing the vanilla adversarial loss can lead to an identity mapping since the input Y1 is an optimal result of such a structure, i.e. G1. As long as G2 learns an identity mapping, the output Y2 would not be improved. To force the network to learn effective temporal transformations, we propose an adversarial ranking loss to drive the network to generate videos which are closer to real-world videos while further away from the input video (Y1 from Stage I). The ranking loss is defined as Lrank(Y1,Y,Y2), which will be detailed later, with regard to the input Y1, output Y2 and the ground-truth video Y. To construct such a ranking loss, we should take the advantage of effective features that can well represent the dynamics across frames. Based on such feature representations, distances between videos can be conveniently calculated.
We employ the Gram matrix [12] as the motion feature representation to assist G2 to learn dynamics across video frames. Given an input video, we first extract features of the video with discriminator D2. Then the Gram matrix is calculated across the frames using these features such that it incorporates rich temporal information.
Specifically, given an input video Y, suppose that the output of the l-th convolutional layer in D2 is HlY∈RN×Cl×Tl×Hl×Wl , where (N,Cl,Tl,Hl,Wl) are the batch size, number of filters, the length of the time dimension, the height and the width of the feature maps, respectively. We reshape HlY to ^HlY∈RN×Ml×Sl , where Ml=Cl×Tl and Sl=Hl×Wl. Then we calculate the Gram matrix g(Y;l) of the n-th layer as follows:
g(Y;l)=1Ml×SlN∑n=1^Hl,nY(^Hl,nY)T, | (4) |
---|
where ^Hl,nY is the n-th sample of ^HlY. g(Y;l) calculates the covariance matrix between the intermediate features of discriminator D2. Since the calculation incorporates information from different time steps, it can encode motion information of the given video Y.
The Gram matrix has been successfully applied to synthesizing dynamic textures in previous works [27, 28], but our work differs from them in several aspects. First, we use the Gram matrix for video prediction, while the prior works use it for dynamic texture synthesis. Second, we directly calculate the Gram matrix of videos based on the features of discriminator D2, which is updated in each iteration during training. In contrast, the prior works typically calculate it with a pre-trained VGG network [29], which is fixed during training. The motivation of such a different choice is that, as discriminator D2 is closely related to the measurement of motion quality, it is reasonable to directly use features in D2.
To make full use of the video representations, we adopt a variant of the contrastive loss introduced in [30] and [31] to compute the distance between videos. Our adversarial ranking loss with respect to features from the l-th layer is defined as:
We extract the features from multiple convolutional layers of the discriminator for the input Y1, output Y2 and the ground-truth video Y, and calculate their Gram matrices, respectively. Note that the features for each video is extracted from the same discriminator. The final adversarial ranking loss is:
Similar to the objective in Stage I, we also incorporate the pixel-wise L1 distance to capture low-level details. The overall objective for the Refine-Net is:
As shown in Algorithm 1, the generator and discriminator are trained alternatively. When training