learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks


Wei Xiong1, Wenhan Luo2, Lin Ma2, Wei Liu2, and Jiebo Luo1 {wxiongwhu, forest.linma, , ,1Department of Computer Science, University of Rochester, Rochester, NY 146232Tencent AI Lab, Shenzhen, China


Taking a photo outside, can we predict the immediate future, e.g., how would the cloud move in the sky? We address this problem by presenting a generative adversarial network (GAN) based two-stage approach to generating realistic time-lapse videos of high resolution. Given the first frame, our model learns to generate long-term future frames. The first stage generates videos of realistic contents for each frame. The second stage refines the generated video from the first stage by enforcing it to be closer to real videos with regard to motion dynamics. To further encourage vivid motion in the final generated video, Gram matrix is employed to model the motion more precisely. We build a large scale time-lapse dataset, and test our approach on this new dataset. Using our model, we are able to generate realistic videos of up to 128×128 resolution for 32 frames. Quantitative and qualitative experiment results have demonstrated the superiority of our model over the state-of-the-art models.




Humans can often estimate fairly well what will happen in the immediate future given the current scene. However, for vision systems, predicting the future states is still a challenging task. The problem of future prediction or video synthesis has drawn more and more attention in recent years since it is critical for various kinds of vision applications, such as automatic driving [1], video understanding [2], and robotics [3]. The goal of video prediction in this paper is to generate realistic, long-term, and high-quality future frames given one starting frame. Achieving such a goal is difficult, as it is challenging to model the multi-modality and uncertainty in generating both the content and motion in future frames.

在当前的情况下,人们通常可以很好地估计在不久的将来会发生什么。但是,对于视觉系统而言,预测未来状态仍然是一项艰巨的任务。近年来,由于对自动驾驶[ 1 ],视频理解[ 2 ]和机器人技术[ 3 ]等各种视觉应用至关重要,因此未来的预测或视频合成问题已引起越来越多的关注。。本文中视频预测的目标是在给定一个起始帧的情况下生成逼真的,长期的和高质量的未来帧。实现这样的目标是困难的,因为要在将来的帧中生成内容和运动时对多模式和不确定性建模是具有挑战性的。

From top to bottom: example frames of generated videos by VGAN

Fig. 1: From top to bottom: example frames of generated videos by VGAN [4], RNN-GAN [5], the first stage of our model, and the second stage of our model, respectively. The contents generated by our model (the third and fourth rows) are visually more realistic. The left column is the input frame.
图1:从上到下:分别由模型的第一阶段和模型的VGAN [ 4 ],RNN-GAN [ 5 ]生成的视频的示例帧。由我们的模型(第三和第四行)生成的内容在视觉上更加逼真。左列是输入框。

In terms of content generation, the main problem is to define what to learn. Generating future on the basis of only one static image encounters inherent uncertainty of the future, which has been illustrated in [6]. Since there can be multiple possibilities for reasonable future scenes following the first frame, the objective function is difficult to define. Generating future frames by simply learning to reconstruct the real video can lead to unrealistic results [4, 7]. Several models including [8] and [4] are proposed to address this problem based on generative adversarial networks [9]. For example, 3D convolution is incorporated in an adversarial network to model the transformation from an image to video in [4]. Their model produces plausible futures given the first frame. However, the generated video tends to be blurry and lose content details, which degrades the reality of generated videos. A possible cause is that the vanilla encoder-decoder structure in the generator fails to preserve all the indispensable details of the content.
在内容生成方面,主要问题是定义学习内容。仅基于一个静态图像生成未来会遇到未来的内在不确定性,这一点已在[ 6 ]中进行了说明。由于在第一帧之后可能存在合理的未来场景的多种可能性,因此很难定义目标函数。通过简单地学习来重建真实的视频产生未来帧可导致不切实际的结果[ 47 ]。提出了包括[ 8 ]和[ 4 ]在内的几种模型来解决基于生成对抗网络的问题[9 ]。例如,在对抗网络中结合了3D卷积,以建模[ 4 ]中从图像到视频的转换。在第一帧的情况下,他们的模型产生了可能的期货。然而,所生成的视频趋于模糊并且丢失内容细节,这降低了所生成视频的真实性。可能的原因是,生成器中的普通编码器/解码器结构无法保留内容的所有必不可少的细节。

Regarding motion transformation, the main challenge is to drive the given frame to transform realistically over time. Some prior work has investigated this problem. Zhou and Berg [5] use an RNN to model the temporal transformations. They are able to generate a few types of motion patterns, but not realistic enough. The reason may be that, each future frame is based on the state of previous frames, so the error accumulates and the motion distorts over time. The information loss and error accumulation during the sequence generation hinder the success of future prediction.

关于运动变换,主要挑战是根据给定的帧随时间现实地变换。一些先前的工作已经研究了这个问题。Zhou和Berg [ 5 ]使用RNN对时间变换建模。它们能够生成几种类型的运动模式,但不够真实。原因可能是,每个将来的帧都基于先前帧的状态,因此随着时间的流逝,误差会累积并且运动会失真。序列生成过程中的信息丢失和错误累积阻碍了未来预测的成功。

The performance of the prior models indicates that it is nontrivial to generate videos with both realistic contents in each frame and vivid motion dynamics across frames with a single model at the same time. One reason may be that the representation capacity of a single model is limited in satisfying two objectives that may contradict each other. To this end, we divide the modeling of video generation into content and motion modeling, and propose a Multi-stage Dynamic Generative Adversarial Network (MD-GAN) model to produce realistic future videos. There are two stages in our approach. The first stage aims at generating future frames with content details as realistic as possible given an input frame. The second stage specifically deals with motion modeling, i.e., to make the movement of objects between adjacent frames more vivid, while keeping the content realistic.

To be more specific, we develop a generative adversarial network called Base-Net to generate contents in the first stage. Both the generator and the discriminator are composed of 3D convolutions and deconvolutions to model temporal and spatial patterns. The adversarial loss of this stage encourages the generator to produce videos of similar distributions to real ones. In order to preserve more content details, we use a 3D U-net [10] like architecture in the generator instead of the vanilla encoder-decoder structure. Skip connections [11] are used to link the corresponding feature maps in the encoder and decoder so that the decoder can reuse features in the encoder, thus reducing the information loss. In this way, the model can generate better content details in each future frame, which are visually more pleasing than those produced by the vanilla encoder-decoder architecture such as the model in [4].
更具体地说,我们开发了一个称为Base-Net的生成对抗网络,以在第一阶段生成内容。生成器和鉴别器均由3D卷积和反卷积组成,以对时间和空间模式进行建模。这个阶段的对抗性失败鼓励了生成器制作与实际发行类似的视频。为了保留更多的内容细节,我们在生成器中使用类似于3D U-net [ 10 ]的体系结构,而不是普通的编码器-解码器结构。Skip connections[ 11 ]用于链接编码器和解码器中的相应特征图,以便解码器可以重用编码器中的特征,从而减少了信息丢失。这样,该模型可以在每个将来的帧中生成更好的内容细节,这些细节在视觉上比由vanilla 编码器/解码器体系结构(例如[ 4 ]中的模型)产生的内容细节更令人愉悦。

The Base-Net can generate frames with concrete details, but may not be capable of modeling the motion transformations across frames. To generate future frames with vivid motion, the second stage MD-GAN takes the output of the first stage as input, and refines the temporal transformation with another generative adversarial network while preserving the realistic content details, which we call Refine-Net. We propose an adversarial ranking loss to train this network so as to encourage the generated video to be closer to the real one while being further away from the input video (from stage I) regarding motion. To this end, we introduce the Gram matrix [12] to model the dynamic transformations among consecutive frames. We present a few example frames generated by the conventional methods and our method in Fig. 1. The image frames generated by our model are sharper than the state-of-the-art and are visually almost as realistic as the real ones.
Base-Net可以生成具有具体细节的帧,但可能无法对跨帧的运动转换建模。为了生成具有生动运动效果的未来帧,第二阶段MD-GAN将第一阶段的输出作为输入,并在保留现实的内容细节(我们称为Refine-Net)的同时,使用另一个生成性对抗网络优化时间转换。我们提出了对抗性排名损失来训练该网络,以鼓励生成的视频更接近真实的视频,同时又远离运动的输入视频(从阶段I开始)。为此,我们介绍了Gram矩阵[ 12 ]对连续帧之间的动态转换建模。我们展示了一些通过常规方法和图1中的方法生成的示例帧。由我们的模型生成的图像帧比最先进的图像帧更清晰,并且在视觉上几乎与真实的图像帧一样逼真。

We build a large scale time-lapse video dataset called Sky Scene to evaluate the models for future prediction. Our dataset includes daytime, nightfall, starry sky, and aurora scenes. MD-GAN is trained on this dataset and predicts future frames given a static image of sky scene. We are able to produce 128×128 realistic videos, whose resolution is much higher than that of the state-of-the-art models. Unlike some prior work which generates merely one frame at a time, our model generates 32 future frames by a single pass, further preventing error accumulation and information loss.
我们构建了一个名为Sky Scene的大型延时视频数据集,以评估模型以进行将来的预测。我们的数据集包括白天,傍晚,繁星点点的天空和极光场景。MD-GAN在此数据集上受过训练,并根据天空场景的静态图像预测未来的帧。我们有能力生产128×128逼真的视频,其分辨率远远高于最新模型。与某些先前的工作一次只生成一帧不同,我们的模型通过一次遍历生成32个未来的帧,从而进一步防止了错误累积和信息丢失。

Our key contributions are as follows:

  1. We build a large scale time-lapse video dataset, which contains high-resolution dynamic videos of sky scenes.
  2. We propose a Multi-Stage Dynamic Generative Adversarial Network (MD-GAN), which can effectively capture the spatial and temporal transformations, thus generating realistic time-lapse future frames up to 128×128 resolution given only one starting frame.
  3. We introduce the Gram matrix for motion modeling and propose an adversarial ranking loss to mimic motions of real-world videos, which refines motion dynamics of preliminary outputs in the first stage and forces the model to produce more realistic and higher-quality future frames.


2.我们提出了一个多阶段动态生成对抗网络(MD-GAN),该网络可以有效地捕获时空变换,从而生成现实的时移未来帧,直至 128×128 分辨率仅给出一个起始帧。



A generative adversarial network [9] is composed of a generator and a discriminator. The generator tries to fool the discriminator by producing samples similar to real ones, while the discriminator is trained to distinguish the generated samples from the real ones. They are trained in an adversarial way and finally a “Nash Equilibrium” is achieved. GAN has been successfully applied to image generation. In the seminal paper [9], models trained on the MNIST dataset and the Toronto Face Database (TFD), respectively, generate images of digits and faces with high likelihood. Relying only on random noise, the original GAN cannot control the mode of the generated samples, thus conditional GAN [13] is proposed. Images of digits conditioned on class labels and captions conditioned on image features are generated. Many subsequent works are variants of conditional GAN, including image to image translation [14, 15], text to image translation [16] and super-resolution [17]. Our model is also a GAN conditioned on a starting image to generate a video.
生成对抗网络[ 9 ]由生成器和鉴别器组成。生成器试图通过产生类似于真实样本的样本来欺骗鉴别器,同时训练鉴别器以区分生成的样本与真实样本。他们以对抗性的方式受到训练,最终达到“纳什均衡”。GAN已成功应用于图像生成。在开创性论文[ 9 ]中,分别在MNIST数据集和Toronto Face Database(TFD)上训练的模型生成数字图像和面孔的可能性很高。仅依靠随机噪声,原始GAN无法控制所生成样本的模式,因此有条件GAN [ 13 ]被提议。生成以类标签为条件的数字图像和以图像特征为条件的标题。许多后来的作品是有条件甘变体,包括图像图像平移[ 1415 ],文本图像翻译[ 16 ]和超高分辨率[ 17 ]。我们的模型也是基于起始图像以生成视频的GAN。

Inspired by the coarse-to-fine strategy, multi-stack methods such as StackGAN [18], LAPGAN [19] have been proposed to first generate coarse images and then refine them to finer images. Our model also employs this strategy to stack GANs in two stages. However, instead of refining the pixel-level details in each frame, the second stage focuses on improving motion dynamics across frames generated by the first stage.
受粗糙到精细策略的启发,已经提出了诸如StackGAN [ 18 ],LAPGAN [ 19 ]之类的多堆叠方法来首先生成粗糙图像,然后将它们精炼为更精细的图像。我们的模型还采用这种策略在两个阶段堆叠GAN。但是,第二阶段不是细化每个帧中的像素级细节,而是着重于改善由第一阶段生成的帧之间的运动动态。

The overall architecture of MD-GAN model. The input image is first duplicated to 32 frames as input to generator

Fig. 2: The overall architecture of MD-GAN model. The input image is first duplicated to 32 frames as input to generator G1 of the Base-Net, which produces a video Y1. The discriminator D1 then distinguishes the real video Y from Y1. Following the Base-Net, the Refine-Net takes the result video of G1 as the input and generates a more realistic video Y2. The discriminator D2 is updated with an adversarial ranking loss to push Y2 (the result of Refine-Net) closer to real videos.
图2: MD-GAN模型的整体架构。首先将输入图像复制到32帧作为生成器的输入G1个 生成视频的基本网 ÿ1个。鉴别器d1个然后将真实视频Y与ÿ1个。在基本网络之后,精炼网络将获取以下结果的视频:G1个 作为输入并生成更逼真的视频 ÿ2个。鉴别器d2个 更新以对抗排名失败来推动 ÿ2个 (Refine-Net的结果)更接近真实视频。


Video generation has been popular recently. Based on conditional VAE [20], Xue et al. [21] propose a cross convolutional network to model layered motion, which applies learned kernels to image features encoded in a multi-scale image encoder. The output difference image is added to the current frame to produce the next frame. Many approaches employ GAN for future prediction. In [4], a two-stream CNN, one for foreground and the other one for background, is proposed for video generation. Combining the dynamic foreground stream and the static background stream, the generated video looks real. In the follow-up work [6], Vondrick and Torralba formulate the future prediction task as transforming pixels in the past to future. Based on large scale unlabeled video data, a CNN model is trained with adversarial learning. Content and motion are decomposed and encoded separately by multi-scale residual blocks, and then combined and decoded to generate plausible videos on both the KTH and the Weizmann datasets [22]. A similar idea is presented in [23]. To generate long-term future frames, Villegas et al. [8] estimate high-level structure (human body pose), and learn LSTM and analogy-based encoder-decoder CNN to generate future frames based on the current image and the estimated high-level structure.
视频生成近来很流行。基于有条件的VAE [ 20 ],Xue等。[ 21 ]提出了一种交叉卷积网络来建模分层运动,该网络将学习的内核应用于在多尺度图像编码器中编码的图像特征。输出的差异图像被添加到当前帧以产生下一帧。许多方法采用GAN进行未来的预测。在[ 4 ]中,提出了一种用于视频生成的两流CNN,一个用于前景,另一个用于背景。结合动态前景流和静态背景流,生成的视频看起来很真实。在后续工作中[ 6 ],Vondrick和Torralba将未来的预测任务表述为将过去的像素转换为未来的像素。基于大规模的未标记视频数据,通过对抗性学习来训练CNN模型。内容和运动通过多尺度残差块分别进行分解和编码,然后进行组合和解码,以在KTH和Weizmann数据集上生成合理的视频[ 22 ]。在[ 23 ]中提出了类似的想法。为了产生长期的未来框架,Villegas等人。[ 8 ]估计高级结构(人体姿势),并学习LSTM和基于类比的编解码器CNN来基于当前图像和估计的高级结构生成将来的帧。

The closest work to ours is [5], which also generates time-lapse videos. However, there are important differences between their work and ours. First, our method is based on 3D convolution while a recurrent neural network is employed in [5] to recursively generate future frames, which is prone to error accumulation. Second, as modeling motion is indispensable for video generation, we explicitly model motion by introducing the Gram matrix. Finally, we generate high-resolution (128×128) videos of dynamic scenes, while the generated videos in [5] are simple (usually with clean background) and of resolution 64×64.
与我们最接近的作品是[ 5 ],它还会生成延时视频。但是,他们的工作与我们的工作之间存在重要差异。首先,我们的方法基于3D卷积,而在[ 5 ]中使用循环神经网络来递归地生成将来的帧,这容易出现错误累积。其次,由于建模运动对于视频生成是必不可少的,因此我们通过引入Gram矩阵来显式地对运动进行建模。最后,我们生成高分辨率(128×128)动态场景的视频,而在[ 5 ]中生成的视频很简单(通常背景干净),分辨率为64×64。



The proposed MD-GAN takes a single RGB image as input and attempts to predict future frames that are as realistic as possible. This task is accomplished in two stages in a coarse-to-fine manner: 1) content generation by Base-Net in Stage I. Given an input image x, the model generates a video Y1 of T frames (including the starting frame, i.e., the input image). The Base-Net ensures that each produced frame in Y1 looks like a real natural image. Besides, Y1 also serves as a coarse estimation of the ground truth Y regarding motion. 2) motion generation by Refine-Net in Stage II. The Refine-Net makes efforts to refine Y1 with vivid motion dynamics, and produce a more vivid video Y2 as the final prediction. The discriminator G2 of the Refine-Net takes three inputs, the resulted video of the Base-Net Y1, the fake video Y2 produced by the generator of the Refine-Net and the ground truth video Y. We define an adversarial ranking loss to encourage the final video Y2 to be closer to the real video and further away from the result video Y1 in the Base-Net. Note that on each stage, we follow the setting in Pix2Pix [14] and do not incorporate any random noise. The overall architecture of our model is shown in Fig. 2.

提出的MD-GAN将单个RGB图像作为输入,并尝试预测尽可能逼真的未来帧。此任务从两个步骤以粗略到精细的方式完成:1)在第I阶段由Base-Net生成内容。给定输入图像x,模型将生成视频ÿ1个 的 Ť帧(包括起始帧,即输入图像)。Base-Net确保每个产生的框架都在ÿ1个看起来像真实的自然图像。除了,ÿ1个还用作关于运动的地面真相Y的粗略估计。2)在第二阶段通过Refine-Net生成运动。精炼网正在努力精炼ÿ1个 具有生动的运动动态,并产生更生动的视频 ÿ2个作为最终的预测。鉴别器G2个 精炼网的输入有三个输入,即基本网的视频 ÿ1个,假视频 ÿ2个由Refine-Net的生成器和地面真实视频Y生成。我们定义了对抗性排名损失,以鼓励最终视频ÿ2个 距离真实视频更近,距离结果视频更远 ÿ1个在基础网中。请注意,在每个阶段,我们都遵循Pix2Pix [ 14 ]中的设置,并且不包含任何随机噪声。我们的模型的总体架构如图2所示


As shown in Fig. 2, the Base-Net is a generative adversarial network composed of a generator G1 and a discriminator D1. Given an image x∈R3×H×W as a starting frame, we duplicate it T times, obtaining a static video X∈R3×T×H×W 1(1In the generator, we can also use 2D CNN to encode an image, but we duplicate the input image to a video to better fit our 3D U-net like architecture of G1..) By forwarding X through layers of 3D convolutions and 3D deconvolutions, the generator G1 outputs a video Y1∈R3×T×H×W of T frames, i.e., Y1=G1(X).
如图2所示,Base-Net是由生成器G1 和鉴别器 d1组成的生成对抗网络。给定图像x∈R3×H×W 作为起始框架,我们将其复制 Ť 次,获取静态视频 X∈R3×T×H×W(在生成器中,我们还可以使用2D CNN对图像进行编码,但是我们将输入图像复制到视频中,以更好地适应3D U-net这样的架构 G1)。通过将X转发通过3D卷积和3D反卷积的层,生成器G1 输出视频 Y1∈R3×T×H×W 的 Ť 框架,即Y1=G1(X).

For generator G1, we adopt an encoder-decoder architecture, which is also employed in [24] and [4]. However, the vanilla encoder-decoder architecture encounters problems in generating decent results as the features from the encoder may not be fully exploited. Therefore, we utilize a 3D U-net like architecture [10] instead so that features in the encoder can be fully made use of to generate Y1. The U-net architecture is implemented by introducing skip connections between the feature maps of the encoder and the decoder, as shown in Fig. 2. The skip connections build information highways between the features in the bottom and top layers, so that features can be reused. In this way, the generated video is more likely to contain rich content details. This may seem like a small trick, yet it plays a key role in improving the quality of videos.
对于生成器G1,我们采用[ 24 ]和[ 4 ]中所采用的编码器-解码器架构。但是,由于无法充分利用编码器的功能,vanilla编码器-解码器体系结构在生成时会遇到问题。因此,我们改为使用类似3D U-net的体系结构[ 10 ],以便可以充分利用编码器中的特征来生成ÿ。如图2所示,通过在编码器和解码器的特征图之间引入跳跃连接来实现U-net架构。跳过连接在底层和顶层的要素之间建立了信息高速公路,因此可以重复使用特征。这样,生成的视频更有可能包含丰富的内容详细信息。这似乎是一个小技巧,但在提高视频质量方面起着关键作用。

The discriminator D1 then takes video Y1 and the real video Y as input and tries to distinguish them. x is the first frame of Y. D1 shares the same architecture as the encoder part of G1, except that the final layer is a single node with a sigmoid activation function.
鉴别器 d1 然后以视频 ÿ1和实际视频Y作为输入,并尝试对其进行区分。x是Y的第一帧。d1 与的编码器部分共享相同的架构 G1,除了最后一层是具有S型激活功能的单个节点。

To train our GAN-based model, the adversarial loss of the Base-Net is defined as:

Prior work based on conditional GAN discovers that combining the adversarial loss with the L1 or L2 loss [14] in the pixel space will benefit the performance. So we define a content loss function as a complement to the adversarial loss, to further ensure that the content of the generated video follows similar patterns to the content of real-world videos. As pointed out in [14], L1 distance usually results in sharper outputs than those of L2 distance. Recently, instead of measuring the similarity of images in pixel space, perceptual loss [25] is introduced in some GAN-based approaches to model the distance between high-level feature representations. These features are extracted from a well-trained CNN model and previous experiments suggest they capture semantics of contents [17]. Although the perceptual loss performs well in combination with GANs [17, 26] on some tasks, it typically requires features to be extracted from a pretrained deep neural network, which is both time and space consuming. In addition, we observe in experiments that directly combining the adversarial loss and the L1 loss that minimizes the Minkowski distance between the generated video and the ground truth video in the pixel space leads to satisfactory performance. Thus, we define our content loss as
基于条件GAN的先前工作发现,对抗性损失和 L1 或者 L2结合[ 14 ]将有利于性能。因此,我们将内容损失函数定义为对抗性损失的补充,以进一步确保生成的视频的内容遵循与真实视频的内容相似的模式。如[ 14 ]中指出的,用L1会比用L2差距大。最近,代替测量像素空间中图像的相似性,在一些基于GAN的方法中引入了感知损失[ 25 ]来建模高级特征表示之间的距离。这些特征是从训练有素的CNN模型中提取的,以前的实验表明它们捕获了内容的语义[ 17 ]。虽然如此,26 ]表明一些任务,它通常需要一个预训练的深层神经网络,消耗大量时间和空间。我们在实验中观察到直接将对抗性损失结合L1损失使像素空间中生成的视频与地面真实视频之间的Minkowski距离最小化的损耗会导致令人满意的性能。因此,我们将内容损失定义为


The final objective of our Base-Net in Stage I is

The adversarial training allows the Base-Net to produce videos with realistic content details. However, as the learning capacity of GAN is limited considering the uncertainty of the future, one single GAN model may not be able to capture the correct motion patterns in the real data. As a consequence, the motion dynamics of the generated videos may not be realistic enough. To tackle this problem, we further process the output of Stage I by another GAN model called Refine-Net in Stage II, to compensate it for vivid motion dynamics, and generate more realistic videos.


Inputting video Y1 from Stage I, the Refine-Net improves the quality of the generated video Y2 regarding motion to fool human eyes in telling which one is real against the ground truth video Y.

The generator G2 of Refine-Net is similar to G1 in Base-Net. When training the model, we find it difficult to generate vivid motion while retaining realistic content details using skip connections. In other words, skip connections mainly contribute to content generation, but may not be helpful for motion generation. So we remove a few skip connections from G2, as illustrated in Fig. 2. The discriminator D2 of Refine-Net is also a CNN with 3D convolutions and shares the same structure with D1 in Base-Net.

We adopt the adversarial training to update G2 and D2. However, naively employing the vanilla adversarial loss can lead to an identity mapping since the input Y1 is an optimal result of such a structure, i.e. G1. As long as G2 learns an identity mapping, the output Y2 would not be improved. To force the network to learn effective temporal transformations, we propose an adversarial ranking loss to drive the network to generate videos which are closer to real-world videos while further away from the input video (Y1 from Stage I). The ranking loss is defined as Lrank(Y1,Y,Y2), which will be detailed later, with regard to the input Y1, output Y2 and the ground-truth video Y. To construct such a ranking loss, we should take the advantage of effective features that can well represent the dynamics across frames. Based on such feature representations, distances between videos can be conveniently calculated.

We employ the Gram matrix [12] as the motion feature representation to assist G2 to learn dynamics across video frames. Given an input video, we first extract features of the video with discriminator D2. Then the Gram matrix is calculated across the frames using these features such that it incorporates rich temporal information.

Specifically, given an input video Y, suppose that the output of the l-th convolutional layer in D2 is HlY∈RN×Cl×Tl×Hl×Wl , where (N,Cl,Tl,Hl,Wl) are the batch size, number of filters, the length of the time dimension, the height and the width of the feature maps, respectively. We reshape HlY to ^HlY∈RN×Ml×Sl , where Ml=Cl×Tl and Sl=Hl×Wl. Then we calculate the Gram matrix g(Y;l) of the n-th layer as follows:

g(Y;l)=1Ml×SlN∑n=1^Hl,nY(^Hl,nY)T, (4)

where ^Hl,nY is the n-th sample of ^HlY. g(Y;l) calculates the covariance matrix between the intermediate features of discriminator D2. Since the calculation incorporates information from different time steps, it can encode motion information of the given video Y.

The Gram matrix has been successfully applied to synthesizing dynamic textures in previous works [27, 28], but our work differs from them in several aspects. First, we use the Gram matrix for video prediction, while the prior works use it for dynamic texture synthesis. Second, we directly calculate the Gram matrix of videos based on the features of discriminator D2, which is updated in each iteration during training. In contrast, the prior works typically calculate it with a pre-trained VGG network [29], which is fixed during training. The motivation of such a different choice is that, as discriminator D2 is closely related to the measurement of motion quality, it is reasonable to directly use features in D2.

To make full use of the video representations, we adopt a variant of the contrastive loss introduced in [30] and [31] to compute the distance between videos. Our adversarial ranking loss with respect to features from the l-th layer is defined as:


We extract the features from multiple convolutional layers of the discriminator for the input Y1, output Y2 and the ground-truth video Y, and calculate their Gram matrices, respectively. Note that the features for each video is extracted from the same discriminator. The final adversarial ranking loss is:


Similar to the objective in Stage I, we also incorporate the pixel-wise L1 distance to capture low-level details. The overall objective for the Refine-Net is:


As shown in Algorithm 1, the generator and discriminator are trained alternatively. When training generator G2 with discriminator D2 fixed, we try to minimize the adversarial ranking loss Lrank(Y1,Y,Y2), such that the distance between the generated Y2 and the ground truth Y is encouraged to be smaller, while the distance between Y2 and Y1 is encouraged to be larger. By doing so, the distribution of videos generated by the Refine-Net is forced to be similar to that of the real ones, and the quality of videos from Stage I can be improved.

When training discriminator D2 with generator G2 fixed, on the contrary, we maximize the adversarial ranking loss Lrank(Y1,Y,Y2). The insight behind is , if we update D2 by always expecting that the distance between Y2 and Y is not small enough, then the generator G2 is encouraged to produce Y2 closer to Y and further away from Y1 in the next iteration. By optimizing the ranking loss in such an adversarial manner, the Refine-Net is able to learn realistic dynamic patterns and generate vivid videos.



We build a relatively large-scale dataset of time-lapse videos from the Internet. We collect over 5,000 time-lapse videos from Youtube and manually cut these videos into short clips and select those containing dynamic sky scenes, such as the cloudy sky with moving clouds, and the starry sky with moving stars. Some of the clips may contain scenes that are kind of dark or contain effects of quick zoom-in and zoom-out, thus are abandoned.

We split the set of selected video clips into a training set and a testing set. Note that all the video clips belonging to the same long video are in the same set to ensure that the testing video clips are disjoint from those in the training set. We then decompose the short video clips into frames, and generate clips by sequentially combining continuous 32 frames as a clip. There are no overlap between two consecutive clips. We collect 35,392 training video clips, and 2,815 testing video clips, each containing 32 frames. The original size of each frame is 3×640×360, and we resize it into a square image of size 128×128. Before feeding the clips to the model, we normalize the color values to [−1,1]. No other preprocessing is required.

我们将选定的视频剪辑集分为训练集和测试集。请注意,属于同一长视频的所有视频剪辑都在同一集合中,以确保测试视频剪辑与训练集中的视频剪辑不相交。然后,我们将短视频剪辑分解为帧,并通过将连续的32帧顺序组合为剪辑来生成剪辑。两个连续的剪辑之间没有重叠。我们收集了35,392个培训视频剪辑和2,815个测试视频剪辑,每个包含32帧。每帧的原始尺寸为3×640×360,然后将其调整为大小为正方形的图像 128×128。在将片段输入模型之前,我们将颜色值标准化为[-1个,1个]。不需要其他预处理


TABLE I: The architecture of the generators in both stages. The size of the input video is 3×32×128×128.Our dataset contains videos with both complex contents and diverse motion patterns. There are various types of scenes in the data set, including daytime, nightfall, dawn, starry night and aurora. They exhibit different kinds of foreground (the sky), and colors. Unlike some previous time-lapse video datasets, e.g. [5], which contain relatively clean background, the background in our dataset shows high-level diversity across videos. The scenes may contain trees, mountains, buildings and other static objects. It is also challenging to learn the diverse dynamic patterns within each type of scene. The clouds in the blue sky may be of any arbitrary shape and move in any direction. In the starry night scene, the stars usually move fast along a curve in the dark sky.

Our dataset can be used for various tasks on learning dynamic patterns, including unconditional video generation [4], video prediction [8], video classification [32], and dynamic texture synthesis [27]. In this paper, we use it for video prediction. The samples of our dataset are displayed in the supplementary materials.
我们的数据集包含具有复杂内容和多种运动模式的视频。数据集中有多种类型的场景,包括白天,黄昏,黎明,繁星点点的夜晚和极光。它们表现出不同种类的前景(天空)和颜色。与某些以前的延时视频数据集(例如[ 5 ])包含相对干净的背景不同,我们的数据集中的背景显示了整个视频的高水平多样性。场景可能包含树木,山脉,建筑物和其他静态对象。学习每种场景内的各种动态模式也具有挑战性。蓝天中的云可以具有任意形状,并且可以向任何方向移动。在繁星点点的夜景中,星星通常在黑暗的天空中沿曲线快速移动。

我们的数据集可用于学习动态模式的各种任务,包括无条件视频生成[ 4 ],视频预测[ 8 ],视频分类[ 32 ]和动态纹理合成[ 27 ]。在本文中,我们将其用于视频预测。我们数据集的样本显示在补充材料中。


The Base-Net takes a 3×128×128 start image and generates 32 image frames of resolution 128×128, i.e., T=32. The Refine-Net takes the result video of the Base-Net as input, and generates a more realistic video with 128×128 resolution. The models in both stages are optimized with stochastic gradient descent. We use Adam as the optimizer with β=0.5 and the momentum is set to 0.9. The learning rate is 0.0002 and fixed throughout the training procedure.

We use Batch Normalization [33] followed by Leaky ReLU [34] in all the 3D convolutional layers in both the encoder part of generators and discriminators, except for their first and last layers. For the deconvolutional layers, we use ReLU [35] instead of Leaky ReLU. We use Tanh as the activation function of the output layer of the generators. The Gram matrices are calculated using the features of the first and third convolutional layers (after the ReLU layer) of discriminator D2. The weight of the adversarial ranking loss is set to 1 in all experiments, i.e., λ=1. The detailed configurations of G1 are given in Table I. In G2, we remove the skip connections between “conv1” and “deconv6”, “conv2” and “deconv5”. We use the identity mapping as the skip connection [11].
基础网需要 3×128×128 开始图像并生成分辨率为32的图像帧 128×128, IE, Ť=32。Refine-Net将Base-Net的结果视频作为输入,并使用128×128解析度。两个阶段的模型都经过随机梯度下降优化。我们使用Adam作为优化器,β=0.5 并将动量设置为 0.9。学习率是0.0002,在整个培训过程中都是固定的。

我们在生成器和鉴别器的编码器部分(除第一层和最后一层之外)的所有3D卷积层中均使用批归一化[ 33 ],然后使用Leaky ReLU [ 34 ]。对于反卷积层,我们使用ReLU [ 35 ]代替Leaky ReLU。我们将Tanh用作发电机输出层的激活函数。使用鉴别器的第一和第三卷积层(在ReLU层之后)的特征来计算Gram矩阵d2个。在所有实验中,对抗性排名损失的权重均设置为1,即λ=1个。详细配置G1个在表I中给出。在G2个,我们删除了“ conv1”和“ deconv6”,“ conv2”和“ deconv5”之间的跳过连接。我们使用身份映射作为跳过连接[ 11 ]。

“Which is more realistic?” POS
Random Selection 50
Prefers Ours over VGAN 92
Prefers Ours over RNN-GAN 97
Prefers VGAN over Real 5
Prefers RNN-GAN over Real 1
Prefers Ours over Real 16

TABLE II: Quantitative comparison results of different models. We show pairs of videos to a few workers, and ask them “which is more realistic”. We count their evaluation results, which are denoted as Preference Opinion Score (POS). The value range of POS could be [0,100]. If the value is greater than 50 then it means the former performs better than the latter.


We perform quantitative comparison between our model and the models presented in [4] and [5]. For notation convenience, we name these two models as VGAN [4] and RNN-GAN [5], respectively. For a fair comparison, we reproduce the results of their models exactly according to their papers and reference codes, except some adaption to match the same experimental setting as ours. The adaption includes that, all the methods produce 32 frames as the output. Note that, both VGAN and RNN-GAN generate videos of resolution 64×64, thus we resize the videos produced by our model to resolution 64×64 for fairness.
我们在模型和[ 4 ]和[ 5 ]中提出的模型之间进行定量比较。为了表示方便,我们将这两个模型分别命名为VGAN [ 4 ]和RNN-GAN [ 5 ]。为了进行公平的比较,我们完全根据他们的论文和参考代码重现了他们的模型的结果,只是进行了一些修改以匹配与我们相同的实验设置。适应包括所有方法产生32帧作为输出。请注意,VGAN和RNN-GAN均会生成具有分辨率的视频64×64,因此我们调整了模型生成的视频的大小以使其分辨率 64×64 为了公平。

Fig. 1 shows exemplar results by each method. The video frames generated by VGAN (the first row) and RNN-GAN (the second row) tend to be blurry, while our Base-Net (the third row) and Refine-Net (the fourth row) produce samples are much more realistic, indicating that skip connections and the 3D U-net architecture greatly benefit the content generation.
1显示了每种方法的示例结果。由VGAN(第一行)和RNN-GAN(第二行)生成的视频帧趋于模糊,而我们的Base-Net(第三行)和Refine-Net(第四行)生成的样本更为真实。 ,表明跳过连接和3D U-net架构极大地有利于内容生成。

In order to perform a more direct comparison for each model on both content and motion generation, we compare them in pairs. For each two models, we randomly select 100 clips from the testing set and take their first frames as the input. Then we produce the future prediction as a video of 32 frames by the two models. We conduct 100 times of opinion test from professional workers based on the outputs. Each time we show a worker two videos generated from the two models given the same input frame. The worker is required to give opinion about which one is more realistic. The two videos are shown in a random order to avoid the potential issue that the worker tends to always prefer a video on the left (or right) due to laziness. Five groups of comparison are conducted in total. Apart from the comparisons between ours and VGAN and RNN-GAN, respectively, we also conduct comparison of ours, VGAN and RNN-GAN against real videos to evaluate the performance of these models.

Table II shows the quantitative comparison results. Our model outperforms VGAN [4] with regard to the Preference Opinion Score (POS). Qualitatively, videos generated by VGAN are usually not as sharp as ours. The following reasons are suspected to contribute to the superiority of our model. First, we adopt the U-net like structure instead of a vanilla encoder-decoder structure in VGAN. The connections between the encoder and the decoder bring more powerful representations, thus producing more concrete contents. Second, the Refine-Net makes further efforts to learn more vivid dynamic patterns. Our model also performs better than RNN-GAN [5]. One reason might be that RNN-GAN uses an RNN to sequentially generate image frames, thus their results are prone to error accumulation. Our model employs 3D convolutions instead of RNN so that the state of the next frame does not heavily depend on the state of previous frames.
II显示了定量比较结果。在偏好意见评分(POS)方面,我们的模型优于VGAN [ 4 ]。从质量上讲,由VGAN生成的视频通常不如我们的视频清晰。怀疑以下原因有助于我们模型的优越性。首先,我们在VGAN中采用类似U-net的结构,而不是普通的编解码器结构。编码器和解码器之间的连接带来了更强大的表示,从而产生了更多具体的内容。其次,精炼网进一步努力学习更加生动的动态模式。我们的模型也比RNN-GAN [ 5 ]表现更好。原因之一可能是RNN-GAN使用RNN顺序生成图像帧,因此其结果易于累积错误。我们的模型采用3D卷积代替RNN,因此下一帧的状态在很大程度上不依赖于先前帧的状态。

When comparing ours, VGAN and RNN-GAN with real videos, our model consistently achieves better POS than both VGAN and RNN-GAN, showing the superiority of our multi-stage model. Some results of our model are as decent as the real ones, or even perceived as more realistic than the real ones, suggesting that our model is able to generate realistic future scenes.

The generated video frames by Stage I (left) and Stage II (right) given the same first frame. We show exemplar frames 1, 8, 16, 24, and 32. Red circles are used to indicate the locations and areas where obvious movements take place between adjacent frames. Larger and more circles are observed in the frames of Stage II, indicating that there are more vivid motions generated by the Refine-Net.

Fig. 3: The generated video frames by Stage I (left) and Stage II (right) given the same first frame. We show exemplar frames 1, 8, 16, 24, and 32. Red circles are used to indicate the locations and areas where obvious movements take place between adjacent frames. Larger and more circles are observed in the frames of Stage II, indicating that there are more vivid motions generated by the Refine-Net.


Although the Base-Net can generate videos of decent details and plausible motion, it fails to generate vivid dynamics. For instance, some of the results in the scene of cloudy daytime fail to exhibit apparent cloud movement. The Refine-Net makes attempts to compensate for the motion based on the result of Base-Net, while preserving the concrete content details. In this part, we evaluate the performance of Stage II versus Stage I in terms of both quantitative and qualitative results.

“Which is more realistic?” POS
Random Selection 50
Prefers Stage II to Stage I 70
Prefers Stage II to Real 16
Prefers Stage I to Real 8

TABLE III: Quantitative comparison results of Stage I versus Stage II. The evaluation metric is the same as that in Table II.

Quantitative Results. Given an identical starting frame as input, we generate two videos by the Base-Net in Stage I and the Refine-Net in Stage II separately. The comparison is carried out over 100 pairs of generated videos in a similar way as that in the previous section. Showing each pair of two videos, we ask the workers which one is more realistic. To check how effective our model is, we also compare the results of the Base-Net and Refine-Net with the ground truth videos. The results shown in Table III reveal that the Refine-Net contributes significantly to the reality of the generated videos. When comparing the Refine-Net with the Base-Net, the advantage is about 40 (70 versus 30) in terms of the POS. Not surprisingly, the Refine-Net gains better POS than the Base-Net when comparing videos of these two models with the ground-truth videos.

Qualitative Results. As is shown in Fig. 1, although our Refine-Net mainly focuses on improving the motion quality, it still preserves fine content details which are visually almost as realistic as the frames produced by Base-Net. In addition to content comparison, we further compare the motion dynamics of the resulted video by the two stages. We show four video clips generated by the Base-Net and the Refine-Net individually on the basis of the same starting frame in Fig. 3. Motions are indicated by red circles in the frames. Please note the differences between the next and previous frames. We also encourage the readers to check more qualitative results in our supplementary materials. Results in Fig. 3 indicate that although the Base-Net can generate concrete object details, the content of the next frames seems to have no significant difference from the previous frames. While it does captures the motion patterns to some degree, like the color changes or some inconspicuous object movements, the Base-Net fails to generate vivid dynamic scene sequences. In contrast, the Refine-Net takes the result of the Base-Net to produce more realistic motion dynamics learned from the dataset. As a result, the scene sequences show more evident movements across adjacent frames.
定性结果。 如图1所示,尽管我们的Refine-Net主要致力于改善运动质量,但它仍然保留了精细的内容细节,这些细节在视觉上几乎与Base-Net生成的帧一样真实。除了内容比较之外,我们还通过两个阶段比较结果视频的运动动态。我们将基于图3中的相同起始帧分别显示由Base-Net和Refine-Net生成的四个视频剪辑。运动由框架中的红色圆圈指示。请注意下一帧和上一帧之间的差异。我们还鼓励读者在我们的补充材料中检查更多的定性结果。图3的结果表示尽管Base-Net可以生成具体的对象详细信息,但下一帧的内容似乎与先前的帧没有显着差异。虽然它确实在某种程度上捕获了运动模式,例如颜色变化或一些不起眼的对象运动,但是Base-Net无法生成生动的动态场景序列。相比之下,精炼网则利用基础网的结果来生成从数据集中学习到的更逼真的运动动力学。结果,场景序列在相邻帧之间显示出更明显的运动。


Although our model works on time-lapse video generation, it is indeed a general method for video prediction. To evaluate our approach thoroughly, we compare the models on the Beach dataset released by [4] with both VGAN and RNN-GAN, which does not contain any time-lapse video. We use only 10% of this dataset as training data, and the rest as testing data to test our model (both Stage I and Stage II), VGAN and RNN-GAN. For a fair comparison, all these models take a 64×64 image as input. To this end, we adjust our model to take 64×64 resolution image and video by omitting the first convolutional layer which originally takes 128×128 resolution images or videos as inputs. The remaining parts of our model are unchanged. For each approach, we calculate the Mean Square Error(MSE), Peak Signal to Noise Ratio (PSNR) and Structural Similarity Index(SSIM) between 1000 randomly sampled pairs of generated video and the corresponding ground truth video. Results shown in Table IV demonstrate the superiority of our MD-GAN model.
尽管我们的模型适用于延时视频生成,但它确实是用于视频预测的通用方法。为了彻底评估我们的方法,我们将[ 4 ]发布的Beach数据集上的模型与VGAN和RNN-GAN进行了比较,其中不包含任何延时视频。我们仅使用此数据集的10%作为训练数据,其余作为测试数据来测试我们的模型(第一阶段和第二阶段),VGAN和RNN-GAN。为了公平起见,所有这些模型都采用了64×64图片作为输入。为此,我们调整模型以采用64×64 通过省略最初需要的第一个卷积层来获得高分辨率的图像和视频 128×128分辨率的图像或视频作为输入。我们模型的其余部分保持不变。对于每种方法,我们计算生成的视频和相应的地面真实视频的1000个随机采样对之间的均方误差(MSE),峰信噪比(PSNR)和结构相似性指数(SSIM)。表IV中显示的结果证明了我们的MD-GAN模型的优越性。

Method MSE↓ PSNR ↑ SSIM ↑
VGAN [4] 0.0958 11.5586 0.6035
RNN-GAN [5] 0.1849 7.7988 0.5143
MD-GAN Stage I (Ours) 0.0530 14.8982 0.7624
MD-GAN Stage II (Ours) 0.0422 16.1951 0.8019

TABLE IV: Experiment results on Beach dataset in terms of MSE, PSNR and SSIM (arrows indicating direction of better performance). The best performance values are shown in bold.


We propose the MD-GAN model which can generate realistic time-lapse videos of resolution as high as 128×128 in a coarse-to-fine manner. In the first stage, our model generates sharp content details and rough motion dynamics by Base-Net with a 3D U-net like network as the generator. In the second stage, Refine-Net improves the motion quality with an adversarial ranking loss which incorporates the Gram matrix to effectively model the motion patterns. Experiments show that our model outperforms the state-of-the-art models and can generate videos which are visually as realistic as the real-world videos in many cases.
我们提出了MD-GAN模型,该模型可以生成逼真的延时视频,分辨率高达 128×128从粗到细的方式。在第一阶段,我们的模型通过Base-Net(如网络的3D U-net作为生成器)生成清晰的内容细节和粗略的运动动态。在第二阶段,Refine-Net通过对抗排名损失来提高运动质量,该排名损失结合了Gram矩阵来有效地对运动模式进行建模。实验表明,我们的模型优于最新模型,并且可以生成在很多情况下在视觉上与真实视频一样逼真的视频。