基于多阶段动态GAN的延时摄影视频的生成


下载PDF:https://arxiv.org/pdf/1709.07592v3.pdf

下载代码:https://github.com/weixiong-ur/mdgan

learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks

author

Wei Xiong1, Wenhan Luo2, Lin Ma2, Wei Liu2, and Jiebo Luo1 {wxiongwhu, forest.linma, , ,1Department of Computer Science, University of Rochester, Rochester, NY 146232Tencent AI Lab, Shenzhen, China

ABSTRACT

Taking a photo outside, can we predict the immediate future, e.g., how would the cloud move in the sky? We address this problem by presenting a generative adversarial network (GAN) based two-stage approach to generating realistic time-lapse videos of high resolution. Given the first frame, our model learns to generate long-term future frames. The first stage generates videos of realistic contents for each frame. The second stage refines the generated video from the first stage by enforcing it to be closer to real videos with regard to motion dynamics. To further encourage vivid motion in the final generated video, Gram matrix is employed to model the motion more precisely. We build a large scale time-lapse dataset, and test our approach on this new dataset. Using our model, we are able to generate realistic videos of up to 128×128 resolution for 32 frames. Quantitative and qualitative experiment results have demonstrated the superiority of our model over the state-of-the-art models.

摘要

在外面拍张照片,我们可以预测不久的将来吗,例如,云将如何在天空中移动?我们通过提出基于生成对抗网络(GAN)的两阶段方法来生成高分辨率的逼真的延时视频来解决此问题。给定第一个框架,我们的模型将学习生成长期的未来框架。第一阶段为每个帧生成逼真的内容的视频。第二阶段通过在运动动态方面将其逼近真实视频,从而优化了第一阶段生成的视频。为了进一步鼓励最终生成的视频中的生动运动,采用了 Gram 矩阵来更精确地对运动进行建模。我们建立了一个大型延时数据集,并在这个新的数据集上测试了我们的方法。使用我们的模型,我们可以生成逼真的视频 128×128 分辨率为 32 帧。定量和定性的实验结果证明了我们的模型优于最新模型的优越性。

I INTRODUCTION 介绍

Humans can often estimate fairly well what will happen in the immediate future given the current scene. However, for vision systems, predicting the future states is still a challenging task. The problem of future prediction or video synthesis has drawn more and more attention in recent years since it is critical for various kinds of vision applications, such as automatic driving [1], video understanding [2], and robotics [3]. The goal of video prediction in this paper is to generate realistic, long-term, and high-quality future frames given one starting frame. Achieving such a goal is difficult, as it is challenging to model the multi-modality and uncertainty in generating both the content and motion in future frames.

在当前的情况下,人们通常可以很好地估计在不久的将来会发生什么。但是,对于视觉系统而言,预测未来状态仍然是一项艰巨的任务。近年来,由于对自动驾驶[ 1 ],视频理解[ 2 ]和机器人技术[ 3 ]等各种视觉应用至关重要,因此未来的预测或视频合成问题已引起越来越多的关注。。本文中视频预测的目标是在给定一个起始帧的情况下生成逼真的,长期的和高质量的未来帧。实现这样的目标是困难的,因为要在将来的帧中生成内容和运动时对多模式和不确定性建模是具有挑战性的。

From top to bottom: example frames of generated videos by VGAN

Fig. 1: From top to bottom: example frames of generated videos by VGAN [4], RNN-GAN [5], the first stage of our model, and the second stage of our model, respectively. The contents generated by our model (the third and fourth rows) are visually more realistic. The left column is the input frame.
图 1:从上到下:分别由模型的第一阶段和模型的 VGAN [ 4 ],RNN-GAN [ 5 ]生成的视频的示例帧。由我们的模型(第三和第四行)生成的内容在视觉上更加逼真。左列是输入框。

In terms of content generation, the main problem is to define what to learn. Generating future on the basis of only one static image encounters inherent uncertainty of the future, which has been illustrated in [6]. Since there can be multiple possibilities for reasonable future scenes following the first frame, the objective function is difficult to define. Generating future frames by simply learning to reconstruct the real video can lead to unrealistic results [4, 7]. Several models including [8] and [4] are proposed to address this problem based on generative adversarial networks [9]. For example, 3D convolution is incorporated in an adversarial network to model the transformation from an image to video in [4]. Their model produces plausible futures given the first frame. However, the generated video tends to be blurry and lose content details, which degrades the reality of generated videos. A possible cause is that the vanilla encoder-decoder structure in the generator fails to preserve all the indispensable details of the content.
在内容生成方面,主要问题是定义学习内容。仅基于一个静态图像生成未来会遇到未来的内在不确定性,这一点已在[ 6 ]中进行了说明。由于在第一帧之后可能存在合理的未来场景的多种可能性,因此很难定义目标函数。通过简单地学习来重建真实的视频产生未来帧可导致不切实际的结果[ 47 ]。提出了包括[ 8 ]和[ 4 ]在内的几种模型来解决基于生成对抗网络的问题[9 ]。例如,在对抗网络中结合了 3D 卷积,以建模[ 4 ]中从图像到视频的转换。在第一帧的情况下,他们的模型产生了可能的期货。然而,所生成的视频趋于模糊并且丢失内容细节,这降低了所生成视频的真实性。可能的原因是,生成器中的普通编码器/解码器结构无法保留内容的所有必不可少的细节。

Regarding motion transformation, the main challenge is to drive the given frame to transform realistically over time. Some prior work has investigated this problem. Zhou and Berg [5] use an RNN to model the temporal transformations. They are able to generate a few types of motion patterns, but not realistic enough. The reason may be that, each future frame is based on the state of previous frames, so the error accumulates and the motion distorts over time. The information loss and error accumulation during the sequence generation hinder the success of future prediction.

关于运动变换,主要挑战是根据给定的帧随时间现实地变换。一些先前的工作已经研究了这个问题。Zhou 和 Berg [ 5 ]使用 RNN 对时间变换建模。它们能够生成几种类型的运动模式,但不够真实。原因可能是,每个将来的帧都基于先前帧的状态,因此随着时间的流逝,误差会累积并且运动会失真。序列生成过程中的信息丢失和错误累积阻碍了未来预测的成功。

The performance of the prior models indicates that it is nontrivial to generate videos with both realistic contents in each frame and vivid motion dynamics across frames with a single model at the same time. One reason may be that the representation capacity of a single model is limited in satisfying two objectives that may contradict each other. To this end, we divide the modeling of video generation into content and motion modeling, and propose a Multi-stage Dynamic Generative Adversarial Network (MD-GAN) model to produce realistic future videos. There are two stages in our approach. The first stage aims at generating future frames with content details as realistic as possible given an input frame. The second stage specifically deals with motion modeling, i.e., to make the movement of objects between adjacent frames more vivid, while keeping the content realistic.
现有模型的性能表明,使用单个模型同时生成每个帧中具有逼真的内容和跨帧的生动运动动态的视频并非易事。一个原因可能是,单个模型的表示能力在满足可能彼此矛盾的两个目标方面受到限制。为此,我们将视频生成的建模分为内容建模和运动建模,并提出了多阶段动态生成对抗网络(MD-GAN)模型来生成逼真的未来视频。我们的方法分为两个阶段。
第一阶段旨在生成未来的帧,其中在给定输入帧的情况下,内容细节应尽可能逼真。
第二阶段专门处理运动建模,即,以使对象在相邻帧之间的移动更加生动,同时保持内容逼真。

To be more specific, we develop a generative adversarial network called Base-Net to generate contents in the first stage. Both the generator and the discriminator are composed of 3D convolutions and deconvolutions to model temporal and spatial patterns. The adversarial loss of this stage encourages the generator to produce videos of similar distributions to real ones. In order to preserve more content details, we use a 3D U-net [10] like architecture in the generator instead of the vanilla encoder-decoder structure. Skip connections [11] are used to link the corresponding feature maps in the encoder and decoder so that the decoder can reuse features in the encoder, thus reducing the information loss. In this way, the model can generate better content details in each future frame, which are visually more pleasing than those produced by the vanilla encoder-decoder architecture such as the model in [4].
更具体地说,我们开发了一个称为 Base-Net 的生成对抗网络,以在第一阶段生成内容。生成器和鉴别器均由 3D 卷积和反卷积组成,以对时间和空间模式进行建模。这个阶段的对抗性失败鼓励了生成器制作与实际发行类似的视频。为了保留更多的内容细节,我们在生成器中使用类似于 3D U-net [ 10 ]的体系结构,而不是普通的编码器-解码器结构。Skip connections[ 11 ]用于链接编码器和解码器中的相应特征图,以便解码器可以重用编码器中的特征,从而减少了信息丢失。这样,该模型可以在每个将来的帧中生成更好的内容细节,这些细节在视觉上比由 vanilla 编码器/解码器体系结构(例如[ 4 ]中的模型)产生的内容细节更令人愉悦。

The Base-Net can generate frames with concrete details, but may not be capable of modeling the motion transformations across frames. To generate future frames with vivid motion, the second stage MD-GAN takes the output of the first stage as input, and refines the temporal transformation with another generative adversarial network while preserving the realistic content details, which we call Refine-Net. We propose an adversarial ranking loss to train this network so as to encourage the generated video to be closer to the real one while being further away from the input video (from stage I) regarding motion. To this end, we introduce the Gram matrix [12] to model the dynamic transformations among consecutive frames. We present a few example frames generated by the conventional methods and our method in Fig. 1. The image frames generated by our model are sharper than the state-of-the-art and are visually almost as realistic as the real ones.
Base-Net 可以生成具有具体细节的帧,但可能无法对跨帧的运动转换建模。为了生成具有生动运动效果的未来帧,第二阶段 MD-GAN 将第一阶段的输出作为输入,并在保留现实的内容细节(我们称为 Refine-Net)的同时,使用另一个生成性对抗网络优化时间转换。我们提出了对抗性排名损失来训练该网络,以鼓励生成的视频更接近真实的视频,同时又远离运动的输入视频(从阶段 I 开始)。为此,我们介绍了 Gram 矩阵[ 12 ]对连续帧之间的动态转换建模。我们展示了一些通过常规方法和图 1 中的方法生成的示例帧。由我们的模型生成的图像帧比最先进的图像帧更清晰,并且在视觉上几乎与真实的图像帧一样逼真。

We build a large scale time-lapse video dataset called Sky Scene to evaluate the models for future prediction. Our dataset includes daytime, nightfall, starry sky, and aurora scenes. MD-GAN is trained on this dataset and predicts future frames given a static image of sky scene. We are able to produce 128×128 realistic videos, whose resolution is much higher than that of the state-of-the-art models. Unlike some prior work which generates merely one frame at a time, our model generates 32 future frames by a single pass, further preventing error accumulation and information loss.
我们构建了一个名为 Sky Scene 的大型延时视频数据集,以评估模型以进行将来的预测。我们的数据集包括白天,傍晚,繁星点点的天空和极光场景。MD-GAN 在此数据集上受过训练,并根据天空场景的静态图像预测未来的帧。我们有能力生产 128×128 逼真的视频,其分辨率远远高于最新模型。与某些先前的工作一次只生成一帧不同,我们的模型通过一次遍历生成 32 个未来的帧,从而进一步防止了错误累积和信息丢失。

Our key contributions are as follows:

  1. We build a large scale time-lapse video dataset, which contains high-resolution dynamic videos of sky scenes.
  2. We propose a Multi-Stage Dynamic Generative Adversarial Network (MD-GAN), which can effectively capture the spatial and temporal transformations, thus generating realistic time-lapse future frames up to 128×128 resolution given only one starting frame.
  3. We introduce the Gram matrix for motion modeling and propose an adversarial ranking loss to mimic motions of real-world videos, which refines motion dynamics of preliminary outputs in the first stage and forces the model to produce more realistic and higher-quality future frames.
    我们的主要贡献如下:

1.我们建立了一个大型延时视频数据集,其中包含天空场景的高分辨率动态视频。

2.我们提出了一个多阶段动态生成对抗网络(MD-GAN),该网络可以有效地捕获时空变换,从而生成现实的时移未来帧,直至 128×128 分辨率仅给出一个起始帧。

3.我们引入了用于运动建模的 Gram 矩阵,并提出了对抗排名损失来模拟真实视频的运动,从而在第一阶段完善了初步输出的运动动态,并迫使模型产生更逼真的,更高质量的未来帧。

II-A GENERATIVE ADVERSARIAL NETWORKS 生成对抗网络

A generative adversarial network [9] is composed of a generator and a discriminator. The generator tries to fool the discriminator by producing samples similar to real ones, while the discriminator is trained to distinguish the generated samples from the real ones. They are trained in an adversarial way and finally a “Nash Equilibrium” is achieved. GAN has been successfully applied to image generation. In the seminal paper [9], models trained on the MNIST dataset and the Toronto Face Database (TFD), respectively, generate images of digits and faces with high likelihood. Relying only on random noise, the original GAN cannot control the mode of the generated samples, thus conditional GAN [13] is proposed. Images of digits conditioned on class labels and captions conditioned on image features are generated. Many subsequent works are variants of conditional GAN, including image to image translation [14, 15], text to image translation [16] and super-resolution [17]. Our model is also a GAN conditioned on a starting image to generate a video.
生成对抗网络[ 9 ]由生成器和鉴别器组成。生成器试图通过产生类似于真实样本的样本来欺骗鉴别器,同时训练鉴别器以区分生成的样本与真实样本。他们以对抗性的方式受到训练,最终达到“纳什均衡”。GAN 已成功应用于图像生成。在开创性论文[ 9 ]中,分别在 MNIST 数据集和 Toronto Face Database(TFD)上训练的模型生成数字图像和面孔的可能性很高。仅依靠随机噪声,原始 GAN 无法控制所生成样本的模式,因此有条件 GAN [ 13 ]被提议。生成以类标签为条件的数字图像和以图像特征为条件的标题。许多后来的作品是有条件甘变体,包括图像图像平移[ 1415 ],文本图像翻译[ 16 ]和超高分辨率[ 17 ]。我们的模型也是基于起始图像以生成视频的 GAN。

Inspired by the coarse-to-fine strategy, multi-stack methods such as StackGAN [18], LAPGAN [19] have been proposed to first generate coarse images and then refine them to finer images. Our model also employs this strategy to stack GANs in two stages. However, instead of refining the pixel-level details in each frame, the second stage focuses on improving motion dynamics across frames generated by the first stage.
受粗糙到精细策略的启发,已经提出了诸如 StackGAN [ 18 ],LAPGAN [ 19 ]之类的多堆叠方法来首先生成粗糙图像,然后将它们精炼为更精细的图像。我们的模型还采用这种策略在两个阶段堆叠 GAN。但是,第二阶段不是细化每个帧中的像素级细节,而是着重于改善由第一阶段生成的帧之间的运动动态。

The overall architecture of MD-GAN model. The input image is first duplicated to 32 frames as input to generator

Fig. 2: The overall architecture of MD-GAN model. The input image is first duplicated to 32 frames as input to generator G1 of the Base-Net, which produces a video Y1. The discriminator D1 then distinguishes the real video Y from Y1. Following the Base-Net, the Refine-Net takes the result video of G1 as the input and generates a more realistic video Y2. The discriminator D2 is updated with an adversarial ranking loss to push Y2 (the result of Refine-Net) closer to real videos.
图 2: MD-GAN 模型的整体架构。首先将输入图像复制到 32 帧作为生成器的输入 G1 个 生成视频的基本网 ÿ1 个。鉴别器 d1 个然后将真实视频 Y 与 ÿ1 个。在基本网络之后,精炼网络将获取以下结果的视频:G1 个 作为输入并生成更逼真的视频 ÿ2 个。鉴别器 d2 个 更新以对抗排名失败来推动 ÿ2 个 (Refine-Net 的结果)更接近真实视频。

II-B VIDEO GENERATION 视频生成

Video generation has been popular recently. Based on conditional VAE [20], Xue et al. [21] propose a cross convolutional network to model layered motion, which applies learned kernels to image features encoded in a multi-scale image encoder. The output difference image is added to the current frame to produce the next frame. Many approaches employ GAN for future prediction. In [4], a two-stream CNN, one for foreground and the other one for background, is proposed for video generation. Combining the dynamic foreground stream and the static background stream, the generated video looks real. In the follow-up work [6], Vondrick and Torralba formulate the future prediction task as transforming pixels in the past to future. Based on large scale unlabeled video data, a CNN model is trained with adversarial learning. Content and motion are decomposed and encoded separately by multi-scale residual blocks, and then combined and decoded to generate plausible videos on both the KTH and the Weizmann datasets [22]. A similar idea is presented in [23]. To generate long-term future frames, Villegas et al. [8] estimate high-level structure (human body pose), and learn LSTM and analogy-based encoder-decoder CNN to generate future frames based on the current image and the estimated high-level structure.
视频生成近来很流行。基于有条件的 VAE [ 20 ],Xue 等。[ 21 ]提出了一种交叉卷积网络来建模分层运动,该网络将学习的内核应用于在多尺度图像编码器中编码的图像特征。输出的差异图像被添加到当前帧以产生下一帧。许多方法采用 GAN 进行未来的预测。在[ 4 ]中,提出了一种用于视频生成的两流 CNN,一个用于前景,另一个用于背景。结合动态前景流和静态背景流,生成的视频看起来很真实。在后续工作中[ 6 ],Vondrick 和 Torralba 将未来的预测任务表述为将过去的像素转换为未来的像素。基于大规模的未标记视频数据,通过对抗性学习来训练 CNN 模型。内容和运动通过多尺度残差块分别进行分解和编码,然后进行组合和解码,以在 KTH 和 Weizmann 数据集上生成合理的视频[ 22 ]。在[ 23 ]中提出了类似的想法。为了产生长期的未来框架,Villegas 等人。[ 8 ]估计高级结构(人体姿势),并学习 LSTM 和基于类比的编解码器 CNN 来基于当前图像和估计的高级结构生成将来的帧。

The closest work to ours is [5], which also generates time-lapse videos. However, there are important differences between their work and ours. First, our method is based on 3D convolution while a recurrent neural network is employed in [5] to recursively generate future frames, which is prone to error accumulation. Second, as modeling motion is indispensable for video generation, we explicitly model motion by introducing the Gram matrix. Finally, we generate high-resolution (128×128) videos of dynamic scenes, while the generated videos in [5] are simple (usually with clean background) and of resolution 64×64.
与我们最接近的作品是[ 5 ],它还会生成延时视频。但是,他们的工作与我们的工作之间存在重要差异。首先,我们的方法基于 3D 卷积,而在[ 5 ]中使用循环神经网络来递归地生成将来的帧,这容易出现错误累积。其次,由于建模运动对于视频生成是必不可少的,因此我们通过引入 Gram 矩阵来显式地对运动进行建模。最后,我们生成高分辨率(128×128)动态场景的视频,而在[ 5 ]中生成的视频很简单(通常背景干净),分辨率为 64×64。

III OUR APPROACH 我们的方法

III-A OVERVIEW 概述

The proposed MD-GAN takes a single RGB image as input and attempts to predict future frames that are as realistic as possible. This task is accomplished in two stages in a coarse-to-fine manner: 1) content generation by Base-Net in Stage I. Given an input image x, the model generates a video Y1 of T frames (including the starting frame, i.e., the input image). The Base-Net ensures that each produced frame in Y1 looks like a real natural image. Besides, Y1 also serves as a coarse estimation of the ground truth Y regarding motion. 2) motion generation by Refine-Net in Stage II. The Refine-Net makes efforts to refine Y1 with vivid motion dynamics, and produce a more vivid video Y2 as the final prediction. The discriminator G2 of the Refine-Net takes three inputs, the resulted video of the Base-Net Y1, the fake video Y2 produced by the generator of the Refine-Net and the ground truth video Y. We define an adversarial ranking loss to encourage the final video Y2 to be closer to the real video and further away from the result video Y1 in the Base-Net. Note that on each stage, we follow the setting in Pix2Pix [14] and do not incorporate any random noise. The overall architecture of our model is shown in Fig. 2.

提出的 MD-GAN 将单个 RGB 图像作为输入,并尝试预测尽可能逼真的未来帧。此任务从两个步骤以粗略到精细的方式完成:1)在第 I 阶段由 Base-Net 生成内容。给定输入图像 x,模型将生成视频 ÿ1 个 的 Ť 帧(包括起始帧,即输入图像)。Base-Net 确保每个产生的框架都在 ÿ1 个看起来像真实的自然图像。除了,ÿ1 个还用作关于运动的地面真相 Y 的粗略估计。2)在第二阶段通过 Refine-Net 生成运动。精炼网正在努力精炼 ÿ1 个 具有生动的运动动态,并产生更生动的视频 ÿ2 个作为最终的预测。鉴别器 G2 个 精炼网的输入有三个输入,即基本网的视频 ÿ1 个,假视频 ÿ2 个由 Refine-Net 的生成器和地面真实视频 Y 生成。我们定义了对抗性排名损失,以鼓励最终视频 ÿ2 个 距离真实视频更近,距离结果视频更远 ÿ1 个在基础网中。请注意,在每个阶段,我们都遵循 Pix2Pix [ 14 ]中的设置,并且不包含任何随机噪声。我们的模型的总体架构如图 2 所示

III-B STAGE I: BASE-NET 第一阶段:基础网络

As shown in Fig. 2, the Base-Net is a generative adversarial network composed of a generator G1 and a discriminator D1. Given an image x∈R3×H×W as a starting frame, we duplicate it T times, obtaining a static video X∈R3×T×H×W 1(1In the generator, we can also use 2D CNN to encode an image, but we duplicate the input image to a video to better fit our 3D U-net like architecture of G1..) By forwarding X through layers of 3D convolutions and 3D deconvolutions, the generator G1 outputs a video Y1∈R3×T×H×W of T frames, i.e., Y1=G1(X).
如图 2 所示,Base-Net 是由生成器 G1 和鉴别器 d1 组成的生成对抗网络。给定图像 x∈R3×H×W 作为起始框架,我们将其复制 Ť 次,获取静态视频 X∈R3×T×H×W(在生成器中,我们还可以使用 2D CNN 对图像进行编码,但是我们将输入图像复制到视频中,以更好地适应 3D U-net 这样的架构 G1)。通过将 X 转发通过 3D 卷积和 3D 反卷积的层,生成器 G1 输出视频 Y1∈R3×T×H×W 的 Ť 框架,即 Y1=G1(X).

For generator G1, we adopt an encoder-decoder architecture, which is also employed in [24] and [4]. However, the vanilla encoder-decoder architecture encounters problems in generating decent results as the features from the encoder may not be fully exploited. Therefore, we utilize a 3D U-net like architecture [10] instead so that features in the encoder can be fully made use of to generate Y1. The U-net architecture is implemented by introducing skip connections between the feature maps of the encoder and the decoder, as shown in Fig. 2. The skip connections build information highways between the features in the bottom and top layers, so that features can be reused. In this way, the generated video is more likely to contain rich content details. This may seem like a small trick, yet it plays a key role in improving the quality of videos.
对于生成器 G1,我们采用[ 24 ]和[ 4 ]中所采用的编码器-解码器架构。但是,由于无法充分利用编码器的功能,vanilla 编码器-解码器体系结构在生成时会遇到问题。因此,我们改为使用类似 3D U-net 的体系结构[ 10 ],以便可以充分利用编码器中的特征来生成 ÿ。如图 2 所示,通过在编码器和解码器的特征图之间引入跳跃连接来实现 U-net 架构。跳过连接在底层和顶层的要素之间建立了信息高速公路,因此可以重复使用特征。这样,生成的视频更有可能包含丰富的内容详细信息。这似乎是一个小技巧,但在提高视频质量方面起着关键作用。

The discriminator D1 then takes video Y1 and the real video Y as input and tries to distinguish them. x is the first frame of Y. D1 shares the same architecture as the encoder part of G1, except that the final layer is a single node with a sigmoid activation function.
鉴别器 d1 然后以视频 ÿ1 和实际视频 Y 作为输入,并尝试对其进行区分。x 是 Y 的第一帧。d1 与的编码器部分共享相同的架构 G1,除了最后一层是具有 S 型激活功能的单个节点。

To train our GAN-based model, the adversarial loss of the Base-Net is defined as:
为了训练我们基于 GAN 的模型,Base-Net 的对抗损失定义为:
image.png

Prior work based on conditional GAN discovers that combining the adversarial loss with the L1 or L2 loss [14] in the pixel space will benefit the performance. So we define a content loss function as a complement to the adversarial loss, to further ensure that the content of the generated video follows similar patterns to the content of real-world videos. As pointed out in [14], L1 distance usually results in sharper outputs than those of L2 distance. Recently, instead of measuring the similarity of images in pixel space, perceptual loss [25] is introduced in some GAN-based approaches to model the distance between high-level feature representations. These features are extracted from a well-trained CNN model and previous experiments suggest they capture semantics of contents [17]. Although the perceptual loss performs well in combination with GANs [17, 26] on some tasks, it typically requires features to be extracted from a pretrained deep neural network, which is both time and space consuming. In addition, we observe in experiments that directly combining the adversarial loss and the L1 loss that minimizes the Minkowski distance between the generated video and the ground truth video in the pixel space leads to satisfactory performance. Thus, we define our content loss as
基于条件 GAN 的先前工作发现,对抗性损失和 L1 或者 L2 结合[ 14 ]将有利于性能。因此,我们将内容损失函数定义为对抗性损失的补充,以进一步确保生成的视频的内容遵循与真实视频的内容相似的模式。如[ 14 ]中指出的,用 L1 会比用 L2 差距大。最近,代替测量像素空间中图像的相似性,在一些基于 GAN 的方法中引入了感知损失[ 25 ]来建模高级特征表示之间的距离。这些特征是从训练有素的 CNN 模型中提取的,以前的实验表明它们捕获了内容的语义[ 17 ]。虽然如此,26 ]表明一些任务,它通常需要一个预训练的深层神经网络,消耗大量时间和空间。我们在实验中观察到直接将对抗性损失结合 L1 损失使像素空间中生成的视频与地面真实视频之间的 Minkowski 距离最小化的损耗会导致令人满意的性能。因此,我们将内容损失定义为

image.png

The final objective of our Base-Net in Stage I is
第一阶段我们的基础网络的最终目标是
image.png

The adversarial training allows the Base-Net to produce videos with realistic content details. However, as the learning capacity of GAN is limited considering the uncertainty of the future, one single GAN model may not be able to capture the correct motion patterns in the real data. As a consequence, the motion dynamics of the generated videos may not be realistic enough. To tackle this problem, we further process the output of Stage I by another GAN model called Refine-Net in Stage II, to compensate it for vivid motion dynamics, and generate more realistic videos.
通过对抗训练,Base-Net 可以制作具有真实内容详细信息的视频。但是,由于考虑到未来的不确定性,GAN 的学习能力受到限制,因此单个 GAN 模型可能无法捕获真实数据中的正确运动模式。结果,所生成视频的运动动态可能不够真实。为了解决这个问题,我们通过第二阶段的另一个 GAN 模型(称为 Refine-Net)进一步处理了第一阶段的输出,以补偿其生动的运动动态并生成更逼真的视频。

III-CSTAGE II: REFINE-NET 第二阶段:修正网络

Inputting video Y1 from Stage I, the Refine-Net improves the quality of the generated video Y2 regarding motion to fool human eyes in telling which one is real against the ground truth video Y.

The generator G2 of Refine-Net is similar to G1 in Base-Net. When training the model, we find it difficult to generate vivid motion while retaining realistic content details using skip connections. In other words, skip connections mainly contribute to content generation, but may not be helpful for motion generation. So we remove a few skip connections from G2, as illustrated in Fig. 2. The discriminator D2 of Refine-Net is also a CNN with 3D convolutions and shares the same structure with D1 in Base-Net.

We adopt the adversarial training to update G2 and D2. However, naively employing the vanilla adversarial loss can lead to an identity mapping since the input Y1 is an optimal result of such a structure, i.e. G1. As long as G2 learns an identity mapping, the output Y2 would not be improved. To force the network to learn effective temporal transformations, we propose an adversarial ranking loss to drive the network to generate videos which are closer to real-world videos while further away from the input video (Y1 from Stage I). The ranking loss is defined as Lrank(Y1,Y,Y2), which will be detailed later, with regard to the input Y1, output Y2 and the ground-truth video Y. To construct such a ranking loss, we should take the advantage of effective features that can well represent the dynamics across frames. Based on such feature representations, distances between videos can be conveniently calculated.

We employ the Gram matrix [12] as the motion feature representation to assist G2 to learn dynamics across video frames. Given an input video, we first extract features of the video with discriminator D2. Then the Gram matrix is calculated across the frames using these features such that it incorporates rich temporal information.

Specifically, given an input video Y, suppose that the output of the l-th convolutional layer in D2 is HlY∈RN×Cl×Tl×Hl×Wl , where (N,Cl,Tl,Hl,Wl) are the batch size, number of filters, the length of the time dimension, the height and the width of the feature maps, respectively. We reshape HlY to ^HlY∈RN×Ml×Sl , where Ml=Cl×Tl and Sl=Hl×Wl. Then we calculate the Gram matrix g(Y;l) of the n-th layer as follows:

g(Y;l)=1Ml×SlN∑n=1^Hl,nY(^Hl,nY)T, (4)

where ^Hl,nY is the n-th sample of ^HlY. g(Y;l) calculates the covariance matrix between the intermediate features of discriminator D2. Since the calculation incorporates information from different time steps, it can encode motion information of the given video Y.

The Gram matrix has been successfully applied to synthesizing dynamic textures in previous works [27, 28], but our work differs from them in several aspects. First, we use the Gram matrix for video prediction, while the prior works use it for dynamic texture synthesis. Second, we directly calculate the Gram matrix of videos based on the features of discriminator D2, which is updated in each iteration during training. In contrast, the prior works typically calculate it with a pre-trained VGG network [29], which is fixed during training. The motivation of such a different choice is that, as discriminator D2 is closely related to the measurement of motion quality, it is reasonable to directly use features in D2.
image.png

To make full use of the video representations, we adopt a variant of the contrastive loss introduced in [30] and [31] to compute the distance between videos. Our adversarial ranking loss with respect to features from the l-th layer is defined as:

image.png

We extract the features from multiple convolutional layers of the discriminator for the input Y1, output Y2 and the ground-truth video Y, and calculate their Gram matrices, respectively. Note that the features for each video is extracted from the same discriminator. The final adversarial ranking loss is:

image.png

Similar to the objective in Stage I, we also incorporate the pixel-wise L1 distance to capture low-level details. The overall objective for the Refine-Net is:

image.png

As shown in Algorithm 1, the generator and discriminator are trained alternatively. When training generator G2 with discriminator D2 fixed, we try to minimize the adversarial ranking loss Lrank(Y1,Y,Y2), such that the distance between the generated Y2 and the ground truth Y is encouraged to be smaller, while the distance between Y2 and Y1 is encouraged to be larger. By doing so, the distribution of videos generated by the Refine-Net is forced to be similar to that of the real ones, and the quality of videos from Stage I can be improved.

When training discriminator D2 with generator G2 fixed, on the contrary, we maximize the adversarial ranking loss Lrank(Y1,Y,Y2). The insight behind is , if we update D2 by always expecting that the distance between Y2 and Y is not small enough, then the generator G2 is encouraged to produce Y2 closer to Y and further away from Y1 in the next iteration. By optimizing the ranking loss in such an adversarial manner, the Refine-Net is able to learn realistic dynamic patterns and generate vivid videos.

IVEXPERIMENTS 实验

IV-ADATASET

We build a relatively large-scale dataset of time-lapse videos from the Internet. We collect over 5,000 time-lapse videos from YouTube and manually cut these videos into short clips and select those containing dynamic sky scenes, such as the cloudy sky with moving clouds, and the starry sky with moving stars. Some of the clips may contain scenes that are kind of dark or contain effects of quick zoom-in and zoom-out, thus are abandoned.

We split the set of selected video clips into a training set and a testing set. Note that all the video clips belonging to the same long video are in the same set to ensure that the testing video clips are disjoint from those in the training set. We then decompose the short video clips into frames, and generate clips by sequentially combining continuous 32 frames as a clip. There are no overlap between two consecutive clips. We collect 35,392 training video clips, and 2,815 testing video clips, each containing 32 frames. The original size of each frame is 3×640×360, and we resize it into a square image of size 128×128. Before feeding the clips to the model, we normalize the color values to [−1,1]. No other preprocessing is required.
我们建立了一个来自互联网的相对大型的延时视频数据集。我们从 YouTube 收集了 5,000 多个延时视频,并将它们手动剪切成短片,然后选择包含动态天空场景的视频,例如多云的天空和移动的星空以及繁星的天空和移动的星空。某些剪辑可能包含较暗的场景,或者包含快速放大和缩小的效果,因此被放弃。

我们将选定的视频剪辑集分为训练集和测试集。请注意,属于同一长视频的所有视频剪辑都在同一集合中,以确保测试视频剪辑与训练集中的视频剪辑不相交。然后,我们将短视频剪辑分解为帧,并通过将连续的 32 帧顺序组合为剪辑来生成剪辑。两个连续的剪辑之间没有重叠。我们收集了 35,392 个培训视频剪辑和 2,815 个测试视频剪辑,每个包含 32 帧。每帧的原始尺寸为 3×640×360,然后将其调整为大小为正方形的图像 128×128。在将片段输入模型之前,我们将颜色值标准化为[-1 个,1 个]。不需要其他预处理

image.png

TABLE I: The architecture of the generators in both stages. The size of the input video is 3×32×128×128.Our dataset contains videos with both complex contents and diverse motion patterns. There are various types of scenes in the data set, including daytime, nightfall, dawn, starry night and aurora. They exhibit different kinds of foreground (the sky), and colors. Unlike some previous time-lapse video datasets, e.g. [5], which contain relatively clean background, the background in our dataset shows high-level diversity across videos. The scenes may contain trees, mountains, buildings and other static objects. It is also challenging to learn the diverse dynamic patterns within each type of scene. The clouds in the blue sky may be of any arbitrary shape and move in any direction. In the starry night scene, the stars usually move fast along a curve in the dark sky.

Our dataset can be used for various tasks on learning dynamic patterns, including unconditional video generation [4], video prediction [8], video classification [32], and dynamic texture synthesis [27]. In this paper, we use it for video prediction. The samples of our dataset are displayed in the supplementary materials.
我们的数据集包含具有复杂内容和多种运动模式的视频。数据集中有多种类型的场景,包括白天,黄昏,黎明,繁星点点的夜晚和极光。它们表现出不同种类的前景(天空)和颜色。与某些以前的延时视频数据集(例如[ 5 ])包含相对干净的背景不同,我们的数据集中的背景显示了整个视频的高水平多样性。场景可能包含树木,山脉,建筑物和其他静态对象。学习每种场景内的各种动态模式也具有挑战性。蓝天中的云可以具有任意形状,并且可以向任何方向移动。在繁星点点的夜景中,星星通常在黑暗的天空中沿曲线快速移动。

我们的数据集可用于学习动态模式的各种任务,包括无条件视频生成[ 4 ],视频预测[ 8 ],视频分类[ 32 ]和动态纹理合成[ 27 ]。在本文中,我们将其用于视频预测。我们数据集的样本显示在补充材料中。

IV-BIMPLEMENTATION DETAILS

The Base-Net takes a 3×128×128 start image and generates 32 image frames of resolution 128×128, i.e., T=32. The Refine-Net takes the result video of the Base-Net as input, and generates a more realistic video with 128×128 resolution. The models in both stages are optimized with stochastic gradient descent. We use Adam as the optimizer with β=0.5 and the momentum is set to 0.9. The learning rate is 0.0002 and fixed throughout the training procedure.

We use Batch Normalization [33] followed by Leaky ReLU [34] in all the 3D convolutional layers in both the encoder part of generators and discriminators, except for their first and last layers. For the deconvolutional layers, we use ReLU [35] instead of Leaky ReLU. We use Tanh as the activation function of the output layer of the generators. The Gram matrices are calculated using the features of the first and third convolutional layers (after the ReLU layer) of discriminator D2. The weight of the adversarial ranking loss is set to 1 in all experiments, i.e., λ=1. The detailed configurations of G1 are given in Table I. In G2, we remove the skip connections between “conv1” and “deconv6”, “conv2” and “deconv5”. We use the identity mapping as the skip connection [11].
基础网需要 3×128×128 开始图像并生成分辨率为 32 的图像帧 128×128, IE, Ť=32。Refine-Net 将 Base-Net 的结果视频作为输入,并使用 128×128 解析度。两个阶段的模型都经过随机梯度下降优化。我们使用 Adam 作为优化器,β=0.5 并将动量设置为 0.9。学习率是 0.0002,在整个培训过程中都是固定的。

我们在生成器和鉴别器的编码器部分(除第一层和最后一层之外)的所有 3D 卷积层中均使用批归一化[ 33 ],然后使用 Leaky ReLU [ 34 ]。对于反卷积层,我们使用 ReLU [ 35 ]代替 Leaky ReLU。我们将 Tanh 用作发电机输出层的激活函数。使用鉴别器的第一和第三卷积层(在 ReLU 层之后)的特征来计算 Gram 矩阵 d2 个。在所有实验中,对抗性排名损失的权重均设置为 1,即 λ=1 个。详细配置 G1 个在表 I 中给出。在 G2 个,我们删除了“ conv1”和“ deconv6”,“ conv2”和“ deconv5”之间的跳过连接。我们使用身份映射作为跳过连接[ 11 ]。

“Which is more realistic?” POS
Random Selection 50
Prefers Ours over VGAN 92
Prefers Ours over RNN-GAN 97
Prefers VGAN over Real 5
Prefers RNN-GAN over Real 1
Prefers Ours over Real 16

TABLE II: Quantitative comparison results of different models. We show pairs of videos to a few workers, and ask them “which is more realistic”. We count their evaluation results, which are denoted as Preference Opinion Score (POS). The value range of POS could be [0,100]. If the value is greater than 50 then it means the former performs better than the latter.

IV-CCOMPARISON WITH EXISTING METHODS

We perform quantitative comparison between our model and the models presented in [4] and [5]. For notation convenience, we name these two models as VGAN [4] and RNN-GAN [5], respectively. For a fair comparison, we reproduce the results of their models exactly according to their papers and reference codes, except some adaption to match the same experimental setting as ours. The adaption includes that, all the methods produce 32 frames as the output. Note that, both VGAN and RNN-GAN generate videos of resolution 64×64, thus we resize the videos produced by our model to resolution 64×64 for fairness.
我们在模型和[ 4 ]和[ 5 ]中提出的模型之间进行定量比较。为了表示方便,我们将这两个模型分别命名为 VGAN [ 4 ]和 RNN-GAN [ 5 ]。为了进行公平的比较,我们完全根据他们的论文和参考代码重现了他们的模型的结果,只是进行了一些修改以匹配与我们相同的实验设置。适应包括所有方法产生 32 帧作为输出。请注意,VGAN 和 RNN-GAN 均会生成具有分辨率的视频 64×64,因此我们调整了模型生成的视频的大小以使其分辨率 64×64 为了公平。

Fig. 1 shows exemplar results by each method. The video frames generated by VGAN (the first row) and RNN-GAN (the second row) tend to be blurry, while our Base-Net (the third row) and Refine-Net (the fourth row) produce samples are much more realistic, indicating that skip connections and the 3D U-net architecture greatly benefit the content generation.
1 显示了每种方法的示例结果。由 VGAN(第一行)和 RNN-GAN(第二行)生成的视频帧趋于模糊,而我们的 Base-Net(第三行)和 Refine-Net(第四行)生成的样本更为真实。 ,表明跳过连接和 3D U-net 架构极大地有利于内容生成。

In order to perform a more direct comparison for each model on both content and motion generation, we compare them in pairs. For each two models, we randomly select 100 clips from the testing set and take their first frames as the input. Then we produce the future prediction as a video of 32 frames by the two models. We conduct 100 times of opinion test from professional workers based on the outputs. Each time we show a worker two videos generated from the two models given the same input frame. The worker is required to give opinion about which one is more realistic. The two videos are shown in a random order to avoid the potential issue that the worker tends to always prefer a video on the left (or right) due to laziness. Five groups of comparison are conducted in total. Apart from the comparisons between ours and VGAN and RNN-GAN, respectively, we also conduct comparison of ours, VGAN and RNN-GAN against real videos to evaluate the performance of these models.
为了对每种模型在内容和动作生成上进行更直接的比较,我们将它们成对进行比较。对于每两个模型,我们从测试集中随机选择 100 个剪辑,并将其第一帧作为输入。然后,我们通过两个模型将未来的预测结果生成为 32 帧的视频。我们根据产出对专业工作者进行 100 次意见测试。每次我们向工作人员展示由两个模型生成的两个视频,并且它们具有相同的输入帧。要求工人就哪个更现实提出意见。以随机顺序显示这两个视频,以避免潜在的问题,即由于懒惰,工作人员往往总是偏爱左侧(或右侧)的视频。总共进行了五组比较。除了分别与我们的 VGAN 和 RNN-GAN 进行比较之外,

Table II shows the quantitative comparison results. Our model outperforms VGAN [4] with regard to the Preference Opinion Score (POS). Qualitatively, videos generated by VGAN are usually not as sharp as ours. The following reasons are suspected to contribute to the superiority of our model. First, we adopt the U-net like structure instead of a vanilla encoder-decoder structure in VGAN. The connections between the encoder and the decoder bring more powerful representations, thus producing more concrete contents. Second, the Refine-Net makes further efforts to learn more vivid dynamic patterns. Our model also performs better than RNN-GAN [5]. One reason might be that RNN-GAN uses an RNN to sequentially generate image frames, thus their results are prone to error accumulation. Our model employs 3D convolutions instead of RNN so that the state of the next frame does not heavily depend on the state of previous frames.
II 显示了定量比较结果。在偏好意见评分(POS)方面,我们的模型优于 VGAN [ 4 ]。从质量上讲,由 VGAN 生成的视频通常不如我们的视频清晰。怀疑以下原因有助于我们模型的优越性。首先,我们在 VGAN 中采用类似 U-net 的结构,而不是普通的编解码器结构。编码器和解码器之间的连接带来了更强大的表示,从而产生了更多具体的内容。其次,精炼网进一步努力学习更加生动的动态模式。我们的模型也比 RNN-GAN [ 5 ]表现更好。原因之一可能是 RNN-GAN 使用 RNN 顺序生成图像帧,因此其结果易于累积错误。我们的模型采用 3D 卷积代替 RNN,因此下一帧的状态在很大程度上不依赖于先前帧的状态。

When comparing ours, VGAN and RNN-GAN with real videos, our model consistently achieves better POS than both VGAN and RNN-GAN, showing the superiority of our multi-stage model. Some results of our model are as decent as the real ones, or even perceived as more realistic than the real ones, suggesting that our model is able to generate realistic future scenes.
在将我们的 VGAN 和 RNN-GAN 与真实视频进行比较时,我们的模型始终获得比 VGAN 和 RNN-GAN 更好的 POS,显示了我们多阶段模型的优越性。我们模型的某些结果与真实结果一样好,甚至被认为比真实结果更真实,这表明我们的模型能够生成现实的未来场景。

The generated video frames by Stage I (left) and Stage II (right) given the same first frame. We show exemplar frames 1, 8, 16, 24, and 32. Red circles are used to indicate the locations and areas where obvious movements take place between adjacent frames. Larger and more circles are observed in the frames of Stage II, indicating that there are more vivid motions generated by the Refine-Net.

Fig. 3: The generated video frames by Stage I (left) and Stage II (right) given the same first frame. We show exemplar frames 1, 8, 16, 24, and 32. Red circles are used to indicate the locations and areas where obvious movements take place between adjacent frames. Larger and more circles are observed in the frames of Stage II, indicating that there are more vivid motions generated by the Refine-Net.

IV-DCOMPARISON BETWEEN BASE-NET AND REFINE-NET 基础网络和优化网络之间的比较

Although the Base-Net can generate videos of decent details and plausible motion, it fails to generate vivid dynamics. For instance, some of the results in the scene of cloudy daytime fail to exhibit apparent cloud movement. The Refine-Net makes attempts to compensate for the motion based on the result of Base-Net, while preserving the concrete content details. In this part, we evaluate the performance of Stage II versus Stage I in terms of both quantitative and qualitative results.
尽管基本网络可以生成具有不错的细节和合理运动的视频,但它无法生成生动的动态效果。例如,在多云的白天场景中的某些结果无法表现出明显的云运动。Refine-Net 尝试根据 Base-Net 的结果来补偿运动,同时保留具体的内容详细信息。在这一部分中,我们将在定量和定性结果方面评估第二阶段与第一阶段的绩效。

“Which is more realistic?” POS
Random Selection 50
Prefers Stage II to Stage I 70
Prefers Stage II to Real 16
Prefers Stage I to Real 8

TABLE III: Quantitative comparison results of Stage I versus Stage II. The evaluation metric is the same as that in Table II.

Quantitative Results. Given an identical starting frame as input, we generate two videos by the Base-Net in Stage I and the Refine-Net in Stage II separately. The comparison is carried out over 100 pairs of generated videos in a similar way as that in the previous section. Showing each pair of two videos, we ask the workers which one is more realistic. To check how effective our model is, we also compare the results of the Base-Net and Refine-Net with the ground truth videos. The results shown in Table III reveal that the Refine-Net contributes significantly to the reality of the generated videos. When comparing the Refine-Net with the Base-Net, the advantage is about 40 (70 versus 30) in terms of the POS. Not surprisingly, the Refine-Net gains better POS than the Base-Net when comparing videos of these two models with the ground-truth videos.
定量结果。给定相同的起始帧作为输入,我们分别在第一阶段的基础网和第二阶段的精炼网生成两个视频。以与上一节类似的方式对 100 对生成的视频进行比较。展示每两对视频,我们问工作人员哪个更现实。为了检查模型的有效性,我们还将 Base-Net 和 Refine-Net 的结果与地面实况视频进行了比较。结果见表揭示了 Refine-Net 对生成的视频的真实性做出了重大贡献。当将 Refine-Net 与 Base-Net 进行比较时,就 POS 而言,优势约为 40(70 对 30)。毫不奇怪,当将这两种模型的视频与真实视频进行比较时,Refine-Net 会获得比 Base-Net 更好的 POS。

Qualitative Results. As is shown in Fig. 1, although our Refine-Net mainly focuses on improving the motion quality, it still preserves fine content details which are visually almost as realistic as the frames produced by Base-Net. In addition to content comparison, we further compare the motion dynamics of the resulted video by the two stages. We show four video clips generated by the Base-Net and the Refine-Net individually on the basis of the same starting frame in Fig. 3. Motions are indicated by red circles in the frames. Please note the differences between the next and previous frames. We also encourage the readers to check more qualitative results in our supplementary materials. Results in Fig. 3 indicate that although the Base-Net can generate concrete object details, the content of the next frames seems to have no significant difference from the previous frames. While it does captures the motion patterns to some degree, like the color changes or some inconspicuous object movements, the Base-Net fails to generate vivid dynamic scene sequences. In contrast, the Refine-Net takes the result of the Base-Net to produce more realistic motion dynamics learned from the dataset. As a result, the scene sequences show more evident movements across adjacent frames.
定性结果。 如图 1 所示,尽管我们的 Refine-Net 主要致力于改善运动质量,但它仍然保留了精细的内容细节,这些细节在视觉上几乎与 Base-Net 生成的帧一样真实。除了内容比较之外,我们还通过两个阶段比较结果视频的运动动态。我们将基于图 3 中的相同起始帧分别显示由 Base-Net 和 Refine-Net 生成的四个视频剪辑。运动由框架中的红色圆圈指示。请注意下一帧和上一帧之间的差异。我们还鼓励读者在我们的补充材料中检查更多的定性结果。图 3 的结果表示尽管 Base-Net 可以生成具体的对象详细信息,但下一帧的内容似乎与先前的帧没有显着差异。虽然它确实在某种程度上捕获了运动模式,例如颜色变化或一些不起眼的对象运动,但是 Base-Net 无法生成生动的动态场景序列。相比之下,精炼网则利用基础网的结果来生成从数据集中学习到的更逼真的运动动力学。结果,场景序列在相邻帧之间显示出更明显的运动。

IV-EEXPERIMENT ON BEACH DATASET 在海滩数据集上进行实验

Although our model works on time-lapse video generation, it is indeed a general method for video prediction. To evaluate our approach thoroughly, we compare the models on the Beach dataset released by [4] with both VGAN and RNN-GAN, which does not contain any time-lapse video. We use only 10% of this dataset as training data, and the rest as testing data to test our model (both Stage I and Stage II), VGAN and RNN-GAN. For a fair comparison, all these models take a 64×64 image as input. To this end, we adjust our model to take 64×64 resolution image and video by omitting the first convolutional layer which originally takes 128×128 resolution images or videos as inputs. The remaining parts of our model are unchanged. For each approach, we calculate the Mean Square Error(MSE), Peak Signal to Noise Ratio (PSNR) and Structural Similarity Index(SSIM) between 1000 randomly sampled pairs of generated video and the corresponding ground truth video. Results shown in Table IV demonstrate the superiority of our MD-GAN model.
尽管我们的模型适用于延时视频生成,但它确实是用于视频预测的通用方法。为了彻底评估我们的方法,我们将[ 4 ]发布的 Beach 数据集上的模型与 VGAN 和 RNN-GAN 进行了比较,其中不包含任何延时视频。我们仅使用此数据集的 10%作为训练数据,其余作为测试数据来测试我们的模型(第一阶段和第二阶段),VGAN 和 RNN-GAN。为了公平起见,所有这些模型都采用了 64×64 图片作为输入。为此,我们调整模型以采用 64×64 通过省略最初需要的第一个卷积层来获得高分辨率的图像和视频 128×128 分辨率的图像或视频作为输入。我们模型的其余部分保持不变。对于每种方法,我们计算生成的视频和相应的地面真实视频的 1000 个随机采样对之间的均方误差(MSE),峰信噪比(PSNR)和结构相似性指数(SSIM)。表 IV 中显示的结果证明了我们的 MD-GAN 模型的优越性。

Method MSE↓ PSNR ↑ SSIM ↑
VGAN [4] 0.0958 11.5586 0.6035
RNN-GAN [5] 0.1849 7.7988 0.5143
MD-GAN Stage I (Ours) 0.0530 14.8982 0.7624
MD-GAN Stage II (Ours) 0.0422 16.1951 0.8019

TABLE IV: Experiment results on Beach dataset in terms of MSE, PSNR and SSIM (arrows indicating direction of better performance). The best performance values are shown in bold.

V CONCLUSIONS

We propose the MD-GAN model which can generate realistic time-lapse videos of resolution as high as 128×128 in a coarse-to-fine manner. In the first stage, our model generates sharp content details and rough motion dynamics by Base-Net with a 3D U-net like network as the generator. In the second stage, Refine-Net improves the motion quality with an adversarial ranking loss which incorporates the Gram matrix to effectively model the motion patterns. Experiments show that our model outperforms the state-of-the-art models and can generate videos which are visually as realistic as the real-world videos in many cases.
我们提出了 MD-GAN 模型,该模型可以生成逼真的延时视频,分辨率高达 128×128 从粗到细的方式。在第一阶段,我们的模型通过 Base-Net(如网络的 3D U-net 作为生成器)生成清晰的内容细节和粗略的运动动态。在第二阶段,Refine-Net 通过对抗排名损失来提高运动质量,该排名损失结合了 Gram 矩阵来有效地对运动模式进行建模。实验表明,我们的模型优于最新模型,并且可以生成在很多情况下在视觉上与真实视频一样逼真的视频。