Generative Adversarial Nets
生成式对抗网络 (Generative Adversarial Nets)
Ian J. Goodfellow, Jean Pouget-Abadie* Mehdi Mirza, Bing Xu, David Warde-Farley. Sherjil Ozair, Aaron Courville, Yoshua Bengiot
Ian J. Goodfellow, Jean Pouget-Abadie* Mehdi Mirza, Bing Xu, David Warde-Farley. Sherjil Ozair, Aaron Courville, Yoshua Bengiot
Departement d'informatique et de recherche operation n elle Universite de Montréal Montréal, QC H3C 3J7
蒙特利尔大学计算机与运筹学系
蒙特利尔 QC H3C 3J7
Abstract
摘要
We propose a new framework for estimating generative models via an adversarial process, in which we simultaneously train two models: a generative model $G$ that captures the data distribution, and a disc rim i native model $D$ that estimates the probability that a sample came from the training data rather than $G$ . The training procedure for $G$ is to maximize the probability of $D$ making a mistake. This framework corresponds to a minimax two-player game. In the space of arbitrary functions $G$ and $D$ , a unique solution exists, with $G$ recovering the training data distribution and $D$ equal to $\frac{1}{2}$ everywhere. In the case where $G$ and $D$ are defined by multilayer perce ptr on s, the entire system can be trained with back propagation. There is no need for any Markov chains or unrolled approximate inference networks during either training or generation of samples. Experiments demonstrate the potential of the framework through qualitative and quantitative evaluation of the generated samples.
我们提出了一种通过对抗过程估计生成式模型的新框架,该框架同时训练两个模型:用于捕捉数据分布的生成模型 $G$,以及评估样本来自训练数据而非 $G$ 的概率的判别模型 $D$。训练 $G$ 的目标是最大化 $D$ 的误判概率。该框架对应一个极小极大双人博弈。在任意函数 $G$ 和 $D$ 的空间中,存在唯一解,此时 $G$ 能还原训练数据分布,且 $D$ 在所有位置都等于 $\frac{1}{2}$。当 $G$ 和 $D$ 由多层感知机定义时,整个系统可通过反向传播进行训练。在训练或样本生成过程中,无需使用马尔可夫链或展开近似推理网络。实验通过对生成样本的定性与定量评估,证明了该框架的潜力。
1 Introduction
1 引言
The promise of deep learning is to discover rich, hierarchical models [2] that represent probability distributions over the kinds of data encountered in artificial intelligence applications, such as natural images, audio waveforms containing speech, and symbols in natural language corpora. So far, the most striking successes in deep learning have involved disc rim i native models, usually those that map a high-dimensional, rich sensory input to a class label [14, 22]. These striking successes have primarily been based on the back propagation and dropout algorithms, using piecewise linear units [19, 9, 10] which have a particularly well-behaved gradient . Deep generative models have had less of an impact, due to the difficulty of approximating many intractable probabilistic computations that arise in maximum likelihood estimation and related strategies, and due to difficulty of leveraging the benefits of piecewise linear units in the generative context. We propose a new generative model estimation procedure that sidesteps these difficulties. 1
深度学习的潜力在于发现能够表示人工智能应用中常见数据(如自然图像、包含语音的音频波形、自然语言语料库中的符号)概率分布的丰富分层模型 [2]。迄今为止,深度学习最显著的成果集中在判别模型上,尤其是那些将高维、丰富的感官输入映射到类别标签的模型 [14, 22]。这些成功主要基于反向传播和dropout算法,并利用了梯度特性特别良好的分段线性单元 [19, 9, 10]。相比之下,深度生成模型影响较小,原因在于最大似然估计及相关策略中许多难解概率计算的近似困难,以及生成场景中难以有效利用分段线性单元的优势。我们提出了一种规避这些难题的新生成模型估计方法。1
In the proposed adversarial nets framework, the generative model is pitted against an adversary: a disc rim i native model that learns to determine whether a sample is from the model distribution or the data distribution. The generative model can be thought of as analogous to a team of counterfeiters, trying to produce fake currency and use it without detection, while the disc rim i native model is analogous to the police, trying to detect the counterfeit currency. Competition in this game drives both teams to improve their methods until the counterfeits are in dist igu ish able from the genuine articles.
在提出的对抗网络框架中,生成模型与一个对手对抗:这是一个判别模型,其任务是学会判断样本是来自模型分布还是数据分布。可以将生成模型类比为伪造货币的团伙,试图制造假币并使其不被发现,而判别模型则类似于警方,试图检测出假币。这种博弈中的竞争促使双方不断改进方法,直到伪造品与真品无法区分。
This framework can yield specific training algorithms for many kinds of model and optimization algorithm. In this article, we explore the special case when the generative model generates samples by passing random noise through a multilayer perceptron, and the disc rim i native model is also a multilayer perceptron. We refer to this special case as adversarial nets. In this case, we can train both models using only the highly successful back propagation and dropout algorithms [17] and sample from the generative model using only forward propagation. No approximate inference or Markov chains are necessary.
该框架可为多种模型和优化算法生成特定的训练算法。在本文中,我们探讨了一种特殊情况:生成模型通过将随机噪声输入多层感知机来生成样本,而判别模型同样采用多层感知机结构。我们将这种特殊情况称为对抗网络。在此设定下,仅需使用高度成功的反向传播算法和dropout技术[17]即可训练两个模型,且生成模型仅需前向传播即可采样,无需近似推理或马尔可夫链。
2 Related work
2 相关工作
An alternative to directed graphical models with latent variables are undirected graphical models with latent variables, such as restricted Boltzmann machines (RBMs) [27, 16], deep Boltzmann machines (DBMs) [26] and their numerous variants. The interactions within such models are represented as the product of un normalized potential functions, normalized by a global summation/integration over all states of the random variables. This quantity (the partition function) and its gradient are intractable for all but the most trivial instances, although they can be estimated by Markov chain Monte Carlo (MCMC) methods. Mixing poses a significant problem for learning algorithms that rely on MCMC [3, 5].
具有隐变量的无向图模型是另一种替代方案,例如受限玻尔兹曼机 (RBM) [27, 16]、深度玻尔兹曼机 (DBM) [26] 及其众多变体。此类模型中的相互作用表示为未归一化势函数的乘积,并通过随机变量所有状态的全局求和/积分进行归一化。这个量(配分函数)及其梯度对于除最平凡实例外的所有情况都难以处理,尽管可以通过马尔可夫链蒙特卡洛 (MCMC) 方法进行估计。对于依赖 MCMC 的学习算法来说,混合是一个重要问题 [3, 5]。
Deep belief networks (DBNs) [16] are hybrid models containing a single undirected layer and several directed layers. While a fast approximate layer-wise training criterion exists, DBNs incur the computational difficulties associated with both undirected and directed models.
深度信念网络 (DBNs) [16] 是由单个无向层和多个有向层组成的混合模型。虽然存在快速的分层近似训练准则,但DBNs同时继承了无向模型和有向模型的计算难题。
Alternative criteria that do not approximate or bound the log-likelihood have also been proposed, such as score matching [18] and noise-contrastive estimation (NCE) [13]. Both of these require the learned probability density to be analytically specified up to a normalization constant. Note that in many interesting generative models with several layers of latent variables (such as DBNs and DBMs), it is not even possible to derive a tractable un normalized probability density. Some models such as denoising auto-encoders [30] and contractive auto encoders have learning rules very similar to score matching applied to RBMs. In NCE, as in this work, a disc rim i native training criterion is employed to fit a generative model. However, rather than fitting a separate disc rim i native model, the generative model itself is used to discriminate generated data from samples a fixed noise distribution. Because NCE uses a fixed noise distribution, learning slows dramatically after the model has learned even an approximately correct distribution over a small subset of the observed variables.
不近似或限制对数似然的替代标准也被提出,例如分数匹配 [18] 和噪声对比估计 (NCE) [13]。这两种方法都要求学习的概率密度在归一化常数范围内可解析指定。需要注意的是,在许多具有多层隐变量的有趣生成模型(如 DBNs 和 DBMs)中,甚至无法推导出易处理的未归一化概率密度。某些模型(如去噪自编码器 [30] 和收缩自编码器)的学习规则与应用于 RBMs 的分数匹配非常相似。在 NCE 中,与本文工作类似,采用判别式训练准则来拟合生成模型。然而,并非拟合单独的判别模型,而是使用生成模型本身来区分生成数据与固定噪声分布的样本。由于 NCE 使用固定噪声分布,当模型即使在小部分观测变量上学习到近似正确的分布后,学习速度也会显著减慢。
Finally, some techniques do not involve defining a probability distribution explicitly, but rather train a generative machine to draw samples from the desired distribution. This approach has the advantage that such machines can be designed to be trained by back-propagation. Prominent recent work in this area includes the generative stochastic network (GSN) framework [5], which extends generalized denoising auto-encoders [4]: both can be seen as defining a parameterized Markov chain, i.e., one learns the parameters of a machine that performs one step of a generative Markov chain. Compared to GSNs, the adversarial nets framework does not require a Markov chain for sampling. Because adversarial nets do not require feedback loops during generation, they are better able to leverage piecewise linear units [19, 9, 10], which improve the performance of back propagation but have problems with unbounded activation when used ina feedback loop. More recent examples of training a generative machine by back-propagating into it include recent work on auto-encoding variation al Bayes [20] and stochastic back propagation [24].
最后,有些技术并不涉及明确定义概率分布,而是训练一个生成式机器从目标分布中采样。这种方法具有一个优势:这类机器可以通过反向传播进行训练。该领域近期的重要工作包括生成式随机网络(GSN)框架[5],它扩展了广义去噪自编码器[4]——两者都可视为定义了一个参数化的马尔可夫链,即通过学习机器参数来执行生成式马尔可夫链的单步转移。与GSN相比,对抗网络框架在采样时不需要马尔可夫链。由于对抗网络在生成过程中不需要反馈循环,因此能更好地利用分段线性单元[19,9,10]——这些单元虽然能提升反向传播性能,但在反馈循环中使用时会出现无界激活问题。其他通过反向传播训练生成式机器的近期案例包括变分自编码贝叶斯[20]和随机反向传播[24]的相关研究。
3 Adversarial nets
3 对抗网络
The adversarial modeling framework is most straightforward to apply when the models are both multilayer perce ptr on s. To learn the generator’s distribution $p_ {g}$ over data $_ {x}$ , we define a prior on input noise variables $p_ {z}(z)$ , then represent a mapping to data space as $G(z;\theta_ {g})$ , where $G$ is a differentiable function represented by a multilayer perceptron with parameters $\theta_ {g}$ . We also define a second multilayer perceptron $D(\pmb{x};\dot{\theta_ {d}})$ that outputs a single scalar. $D({\pmb x})$ represents the probability that $_ {x}$ came from the data rather than $p_ {g}$ . We train $D$ to maximize the probability of assigning the correct label to both training examples and samples from $G$ . We simultaneously train $G$ to minimize $\log(1-D(G(z)))$ :
对抗建模框架在模型均为多层感知机时最直接适用。为学习生成器在数据 $_ {x}$ 上的分布 $p_ {g}$ ,我们定义输入噪声变量的先验分布 $p_ {z}(z)$ ,并将数据空间的映射表示为 $G(z;\theta_ {g})$ ,其中 $G$ 是由参数为 $\theta_ {g}$ 的多层感知机实现的可微函数。同时定义第二个多层感知机 $D(\pmb{x};\dot{\theta_ {d}})$ ,其输出为单个标量。 $D({\pmb x})$ 表示 $_ {x}$ 来自真实数据而非 $p_ {g}$ 的概率。我们训练 $D$ 以最大化对训练样本和 $G$ 生成样本的正确分类概率,同时训练 $G$ 以最小化 $\log(1-D(G(z)))$ :
In other words, $D$ and $G$ play the following two-player minimax game with value function $V(G,D)$ :
换句话说,$D$ 和 $G$ 通过价值函数 $V(G,D)$ 进行以下两人极小极大博弈:
$$
\underset{G}{\mathrm{min}}\underset{D}{\mathrm{max}}V(D,G)=\mathbb{E}_ {{\boldsymbol{x}}\sim p_ {\mathrm{data}}({\boldsymbol{x}})}[\mathrm{log}D({\boldsymbol{x}})]+\mathbb{E}_ {z\sim p_ {z}(z)}[\mathrm{log}(1-D(G(z)))].
$$
$$
\underset{G}{\mathrm{min}}\underset{D}{\mathrm{max}}V(D,G)=\mathbb{E}_ {{\boldsymbol{x}}\sim p_ {\mathrm{data}}({\boldsymbol{x}})}[\mathrm{log}D({\boldsymbol{x}})]+\mathbb{E}_ {z\sim p_ {z}(z)}[\mathrm{log}(1-D(G(z)))].
$$
In the next section, we present a theoretical analysis of adversarial nets, essentially showing that the training criterion allows one to recover the data generating distribution as $G$ and $D$ are given enough capacity, i.e., in the non-parametric limit. See Figure 1 for a less formal, more pedagogical explanation of the approach. In practice, we must implement the game using an iterative, numerical approach. Optimizing $D$ to completion in the inner loop of training is computationally prohibitive, and on finite datasets would result in over fitting. Instead, we alternate between $k$ steps of optimizing $D$ and one step of optimizing $G$ . This results in $D$ being maintained near its optimal solution, so long as $G$ changes slowly enough. This strategy is analogous to the way that SML/PCD [31, 29] training maintains samples from a Markov chain from one learning step to the next in order to avoid burning in a Markov chain as part of the inner loop of learning. The procedure is formally presented in Algorithm 1.
下一节中,我们将对对抗网络进行理论分析,本质上证明当生成器$G$和判别器$D$具有足够容量时(即非参数极限情况下),该训练准则能够恢复数据生成分布。图1以更通俗的教学方式展示了该方法原理。实际实现时,我们需采用迭代数值方法进行博弈。在训练内循环中完全优化$D$会导致计算量过大,且在有限数据集上会产生过拟合。因此我们改为交替执行$k$次$D$优化和1次$G$优化。只要$G$更新足够缓慢,就能使$D$保持接近最优解的状态。该策略类似于SML/PCD[31,29]训练中通过马尔可夫链在连续学习步骤间维持样本的方法,从而避免将马尔可夫链燃烧过程作为学习内循环的一部分。算法1给出了该过程的正式描述。
In practice, equation 1 may not provide sufficient gradient for $G$ to learn well. Early in learning, when $G$ is poor, $D$ can reject samples with high confidence because they are clearly different from the training data. In this case, $\log(1-D(G(z)))$ saturates. Rather than training $G$ to minimize $\log(1-{\bar{D(G(z))}})$ we can train $G$ to maximize $\log D(G(z))$ . This objective function results in the same fixed point of the dynamics of $G$ and $D$ but provides much stronger gradients early in learning.
实践中,公式1可能无法为$G$提供足够梯度以有效学习。在学习初期,当$G$性能较差时,$D$能以高置信度拒绝样本,因为它们与训练数据明显不同。此时$\log(1-D(G(z)))$会饱和。我们不再训练$G$最小化$\log(1-{\bar{D(G(z))}})$,而是改为训练$G$最大化$\log D(G(z))$。该目标函数在$G$和$D$的动态系统中会产生相同的不动点,但在学习初期能提供更强的梯度信号。

Figure 1: Generative adversarial nets are trained by simultaneously updating the disc rim i native distribution ( $D$ , blue, dashed line) so that it discriminates between samples from the data generating distribution (black, dotted line) $p_ {x}$ from those of the generative distribution $p_ {g}$ (G) (green, solid line). The lower horizontal line is the domain from which $_ {z}$ is sampled, in this case uniformly. The horizontal line above is part of the domain of $_ {\pmb{x}}$ . The upward arrows show how the mapping ${\pmb x}=G({\pmb z})$ imposes the non-uniform distribution $p_ {g}$ on transformed samples. $G$ contracts in regions of high density and expands in regions of low density of $p_ {g}$ . (a) Consider an adversarial pair near convergence: $p_ {g}$ is similar to $p_ {\mathrm{data}}$ and $D$ is a partially accurate classifier. (b) In the inner loop of the algorithm $D$ is trained to discriminate samples from data, converging to $D^{* }(x)=$ pdatap(dxat)a+(xp)g(x) . (c) After an update to G, gradient of D has guided G(z) to flow to regions that are more likely to be classified as data. (d) After several steps of training, if $G$ and $D$ have enough capacity, they will reach a point at which both cannot improve because $p_ {g}=p_ {\mathrm{data}}$ . The disc rim in at or is unable to differentiate between the two distributions, i.e. $\textstyle D({\pmb x})={\frac{1}{2}}$ .
图 1: 生成对抗网络通过同时更新判别分布 ( $D$ , 蓝色虚线) 进行训练, 使其能够区分来自数据生成分布 (黑色点线) $p_ {x}$ 和生成分布 $p_ {g}$ (G) (绿色实线) 的样本。下方水平线表示均匀采样的 $_ {z}$ 域, 上方水平线表示 $_ {\pmb{x}}$ 域的一部分。箭头展示了映射 ${\pmb x}=G({\pmb z})$ 如何在转换样本上施加非均匀分布 $p_ {g}$ 。 $G$ 在 $p_ {g}$ 高密度区域收缩, 在低密度区域扩展。(a) 收敛附近的对抗对: $p_ {g}$ 近似 $p_ {\mathrm{data}}$ , $D$ 是部分准确的分类器。(b) 算法内循环中, $D$ 被训练来区分数据样本, 收敛至 $D^{* }(x)=$ pdatap(dxat)a+(xp)g(x) 。(c) 更新G后, D的梯度引导G(z)流向更可能被分类为数据的区域。(d) 经过多步训练后, 若 $G$ 和 $D$ 具有足够容量, 将达到 $p_ {g}=p_ {\mathrm{data}}$ 的平衡点, 此时判别器无法区分两个分布, 即 $\textstyle D({\pmb x})={\frac{1}{2}}$ 。
4 Theoretical Results
4 理论结果
The generator $G$ implicitly defines a probability distribution $p_ {g}$ as the distribution of the samples $G(z)$ obtained when $z\sim p_ {z}$ . Therefore, we would like Algorithm 1 to converge to a good estimator of $p_ {\mathrm{data}}$ , if given enough capacity and training time. The results of this section are done in a nonparametric setting, e.g. we represent a model with infinite capacity by studying convergence in the space of probability density functions.
生成器 $G$ 隐式定义了概率分布 $p_ {g}$ ,即当 $z\sim p_ {z}$ 时样本 $G(z)$ 的分布。因此,若给定足够的容量和训练时间,我们希望算法 1 能收敛到 $p_ {\mathrm{data}}$ 的良好估计量。本节结果是在非参数化设定下完成的,例如通过研究概率密度函数空间中的收敛性来表示具有无限容量的模型。
We will show in section 4.1 that this minimax game has a global optimum for $p_ {g}=p_ {\mathrm{data}}$ . We will then show in section 4.2 that Algorithm 1 optimizes $\operatorname{Eq}1$ , thus obtaining the desired result.
我们将在4.1节证明该极小极大博弈在 $p_ {g}=p_ {\mathrm{data}}$ 时存在全局最优解。随后在4.2节证明算法1能够优化 $\operatorname{Eq}1$ ,从而获得预期结果。
Algorithm 1 Minibatch stochastic gradient descent training of generative adversarial nets. The number of steps to apply to the disc rim in at or, $k$ , is a hyper parameter. We used $k=1$ , the least expensive option, in our experiments.
算法1:生成对抗网络的小批量随机梯度下降训练。判别器的训练步数$k$是一个超参数。在我们的实验中采用了成本最低的$k=1$方案。
for number of training iterations do
for number of training iterations do
for $k$ steps do
执行 $k$ 次循环
$$
\nabla_ {\theta_ {d}}\frac{1}{m}\sum_ {i=1}^{m}\left[\log D\left(\pmb{x}^{(i)}\right)+\log\left(1-D\left(G\left(\pmb{z}^{(i)}\right)\right)\right)\right].
$$
$$
\nabla_ {\theta_ {d}}\frac{1}{m}\sum_ {i=1}^{m}\left[\log D\left(\pmb{x}^{(i)}\right)+\log\left(1-D\left(G\left(\pmb{z}^{(i)}\right)\right)\right)\right].
$$
end for
end for
$\bullet$ Sample minibatch of $m$ noise samples ${z^{(1)},\ldots,z^{(m)}}$ from noise prior $p_ {g}(z)$ . Update the generator by descending its stochastic gradient:
$\bullet$ 从噪声先验 $p_ {g}(z)$ 中采样包含 $m$ 个噪声样本的小批量 ${z^{(1)},\ldots,z^{(m)}}$,通过随机梯度下降更新生成器:
$$
\nabla_ {\theta_ {g}}\frac{1}{m}\sum_ {i=1}^{m}\log\left(1-D\left(G\left(z^{(i)}\right)\right)\right).
$$
$$
\nabla_ {\theta_ {g}}\frac{1}{m}\sum_ {i=1}^{m}\log\left(1-D\left(G\left(z^{(i)}\right)\right)\right).
$$
end for
end for
The gradient-based updates can use any standard gradient-based learning rule. We used momentum in our experiments.
基于梯度的更新可以使用任何标准的基于梯度的学习规则。我们在实验中使用了动量。
4.1 Global Optimality of
4.1 全局最优性
We first consider the optimal disc rim in at or $D$ for any given generator $G$ .
我们首先考虑任意给定生成器 $G$ 的最优判别器 $D$。
Proposition 1. For $G$ fixed, the optimal disc rim in at or $D$ is
命题 1. 对于固定的 $G$,最优判别器 $D$ 满足
$$
D_ {G}^{* }(\pmb{x})=\frac{p_ {d a t a}(\pmb{x})}{p_ {d a t a}(\pmb{x})+p_ {g}(\pmb{x})}
$$
$$
D_ {G}^{* }(\pmb{x})=\frac{p_ {d a t a}(\pmb{x})}{p_ {d a t a}(\pmb{x})+p_ {g}(\pmb{x})}
$$
Proof. The training criterion for the disc rim in at or $\mathrm{D}$ , given any generator $G$ , is to maximize the quantity $V(G,D)$
证明。给定任意生成器 $G$ 时,判别器 $\mathrm{D}$ 的训练目标是最大化 $V(G,D)$ 量。
$$
\begin{array}{c}{{\displaystyle V(G,D)=\int_ {x}p_ {\mathrm{data}}({\pmb x})\log(D({\pmb x}))d x+\int_ {z}p_ {z}(z)\log(1-D(g(z)))d z}}\ {{=\displaystyle\int_ {x}p_ {\mathrm{data}}({\pmb x})\log(D({\pmb x}))+p_ {g}({\pmb x})\log(1-D({\pmb x}))d x}}\end{array}
$$
$$
\begin{array}{c}{{\displaystyle V(G,D)=\int_ {x}p_ {\mathrm{data}}({\pmb x})\log(D({\pmb x}))d x+\int_ {z}p_ {z}(z)\log(1-D(g(z)))d z}}\ {{=\displaystyle\int_ {x}p_ {\mathrm{data}}({\pmb x})\log(D({\pmb x}))+p_ {g}({\pmb x})\log(1-D({\pmb x}))d x}}\end{array}
$$
For any $(a,b)\in\mathbb{R}^{2}\setminus{0,0}$ , the function $y\rightarrow a\log(y)+b\log(1-y)$ achieves its maximum in $[0,1]$ at $\frac{a}{a+b}$ . The disc rim in at or does not need to be defined outside of $S u p p(p_ {\mathrm{data}})\cup S u p p(p_ {g}),$ concluding the proof. 口
对于任意 $(a,b)\in\mathbb{R}^{2}\setminus{0,0}$,函数 $y\rightarrow a\log(y)+b\log(1-y)$ 在 $[0,1]$ 上的最大值点为 $\frac{a}{a+b}$。判别器 (discriminator) 无需在 $Supp(p_ {\mathrm{data}})\cup Supp(p_ {g})$ 之外定义,证毕。口
Note that the training objective for $D$ can be interpreted as maximizing the log-likelihood for estimating the conditional probability $P(Y=y|x)$ , where $Y$ indicates whether $_ {x}$ comes from $p_ {\mathrm{data}}$ (with $y=1$ ) or from $p_ {g}$ (with $y=0$ ). The minimax game in Eq. 1 can now be reformulated as:
需要注意的是,训练目标 $D$ 可以理解为最大化估计条件概率 $P(Y=y|x)$ 的对数似然,其中 $Y$ 表示 $_ {x}$ 是来自 $p_ {\mathrm{data}}$(当 $y=1$ 时)还是来自 $p_ {g}$(当 $y=0$ 时)。公式 1 中的极小极大博弈现在可以重新表述为:
$$
\begin{array}{r l}&{C(G)=\underset{D}{\mathrm{max}}V(G,D)}\ &{\qquad=\mathbb{E}_ {\pmb{x}\sim p_ {\mathrm{dat}}}[\log D_ {G}^{* }(\pmb{x})]+\mathbb{E}_ {\pmb{z}\sim p_ {\pm}}[\log(1-D_ {G}^{* }(G(\pmb{z})))]}\ &{\qquad=\mathbb{E}_ {\pmb{x}\sim p_ {\mathrm{dat}}}[\log D_ {G}^{* }(\pmb{x})]+\mathbb{E}_ {\pmb{x}\sim p_ {g}}[\log(1-D_ {G}^{* }(\pmb{x}))]}\ &{\qquad=\mathbb{E}_ {\pmb{x}\sim p_ {\mathrm{dat}}}\left[\log\frac{p_ {\mathrm{data}}(\pmb{x})}{P_ {\mathrm{data}}(\pmb{x})+p_ {g}(\pmb{x})}\right]+\mathbb{E}_ {\pmb{x}\sim p_ {g}}\left[\log\frac{p_ {g}(\pmb{x})}{p_ {\mathrm{data}}(\pmb{x})+p_ {g}(\pmb{x})}\right]}\end{array}
$$
$$
\begin{array}{r l}&{C(G)=\underset{D}{\mathrm{max}}V(G,D)}\ &{\qquad=\mathbb{E}_ {\pmb{x}\sim p_ {\mathrm{dat}}}[\log D_ {G}^{* }(\pmb{x})]+\mathbb{E}_ {\pmb{z}\sim p_ {\pm}}[\log(1-D_ {G}^{* }(G(\pmb{z})))]}\ &{\qquad=\mathbb{E}_ {\pmb{x}\sim p_ {\mathrm{dat}}}[\log D_ {G}^{* }(\pmb{x})]+\mathbb{E}_ {\pmb{x}\sim p_ {g}}[\log(1-D_ {G}^{* }(\pmb{x}))]}\ &{\qquad=\mathbb{E}_ {\pmb{x}\sim p_ {\mathrm{dat}}}\left[\log\frac{p_ {\mathrm{data}}(\pmb{x})}{P_ {\mathrm{data}}(\pmb{x})+p_ {g}(\pmb{x})}\right]+\mathbb{E}_ {\pmb{x}\sim p_ {g}}\left[\log\frac{p_ {g}(\pmb{x})}{p_ {\mathrm{data}}(\pmb{x})+p_ {g}(\pmb{x})}\right]}\end{array}
$$
Theorem 1. The global minimum of the virtual training criterion $C(G)$ is achieved if and only if $p_ {g}=p_ {d a t a}$ . At that point, $C(G)$ achieves the value $\mathrm{-log4}$ .
定理 1. 当且仅当 $p_ {g}=p_ {d a t a}$ 时,虚拟训练准则 $C(G)$ 达到全局最小值。此时 $C(G)$ 的值为 $\mathrm{-log4}$。
Proof. For $p_ {g}=p_ {\mathrm{data}}$ , $D_ {G}^{* }({\pmb x})=\textstyle\frac{1}{2}$ , (consider Eq. 2). Hence, by inspecting Eq. 4 at $D_ {G}^{* }(x)={\frac{1}{2}}$ , we find $C(G)=\log{\textstyle{\frac{1}{2}}+\log{\frac{1}{2}}}=-\log4$ . To see that this is the best possible value of $C(G)$ , reached only for $p_ {g}=p_ {\mathrm{data}}$ , observe that
证明。当 $p_ {g}=p_ {\mathrm{data}}$ 时,$D_ {G}^{* }({\pmb x})=\textstyle\frac{1}{2}$ (参考式2)。因此,在 $D_ {G}^{* }(x)={\frac{1}{2}}$ 处考察式4,可得 $C(G)=\log{\textstyle{\frac{1}{2}}+\log{\frac{1}{2}}}=-\log4$。注意到当且仅当 $p_ {g}=p_ {\mathrm{data}}$ 时,$C(G)$ 才能达到这个最优值。
$$
\begin{array}{r}{\mathbb{E}_ {\pmb{x}\sim p_ {\mathrm{data}}}\left[-\log2\right]+\mathbb{E}_ {\pmb{x}\sim p_ {g}}\left[-\log2\right]=-\log4}\end{array}
$$
$$
\begin{array}{r}{\mathbb{E}_ {\pmb{x}\sim p_ {\mathrm{data}}}\left[-\log2\right]+\mathbb{E}_ {\pmb{x}\sim p_ {g}}\left[-\log2\right]=-\log4}\end{array}
$$
and that by subtracting this expression from $C(G)=V(D_ {G}^{* },G)$ , we obtain:
从 $C(G)=V(D_ {G}^{* },G)$ 中减去该表达式可得:
$$
C(G)=-\ln(4)+\mathrm{KL}\left(p_{\mathrm{data}}\middle|\frac{p_{\mathrm{data}}+p_{g}}{2}\right)+\mathrm{KL}\left(p_{g}\middle|\frac{p_{\mathrm{data}}+p_{g}}{2}\right)
$$
$$
C(G)=-\ln(4)+\mathrm{KL}\left(p_{\mathrm{data}}\middle|\frac{p_{\mathrm{data}}+p_{g}}{2}\right)+\mathrm{KL}\left(p_{g}\middle|\frac{p_{\mathrm{data}}+p_{g}}{2}\right)
$$
where $\mathrm{KL}$ is the Kullback–Leibler divergence. We recognize in the previous expression the Jensen– Shannon divergence between the model’s distribution and the data generating process:
其中 $\mathrm{KL}$ 是Kullback-Leibler散度。我们在上述表达式中识别出了模型分布与数据生成过程之间的Jensen-Shannon散度:
$$
C(G)=-\log(4)+2\cdot J S D\left(p_ {\mathrm{data}}\parallel p_ {g}\right)
$$
$$
C(G)=-\log(4)+2\cdot J S D\left(p_ {\mathrm{data}}\parallel p_ {g}\right)
$$
Since the Jensen–Shannon divergence between two distributions is always non-negative and zero only when they are equal, we have shown that $C^{* }=-\log(4)$ is the global minimum of $C(G)$ and that the only solution is $p_ {g}=p_ {\mathrm{data}}$ , i.e., the generative model perfectly replicating the data generating process. 口
由于两个分布之间的Jensen-Shannon散度始终非负且仅当两者相等时为零,我们证明了$C^{* }=-\log(4)$是$C(G)$的全局最小值,唯一解为$p_ {g}=p_ {\mathrm{data}}$,即生成模型完美复现了数据生成过程。口
4.2 Convergence of Algorithm 1
4.2 算法1的收敛性
Proposition 2. If $G$ and $D$ have enough capacity, and at each step of Algorithm 1, the disc rim in at or is allowed to reach its optimum given $G$ , and $p_ {g}$ is updated so as to improve the criterion
命题 2. 如果 $G$ 和 $D$ 具有足够的容量,并且在算法 1 的每个步骤中,判别器 $D$ 被允许在给定 $G$ 的情况下达到其最优值,同时更新 $p_ {g}$ 以改进准则
$$
\mathbb{E}_ {\pmb{x}\sim p_ {d a t a}}[\log D_ {G}^{* }(\pmb{x})]+\mathbb{E}_ {\pmb{x}\sim p_ {g}}[\log(1-D_ {G}^{* }(\pmb{x}))]
$$
$$
\mathbb{E}_ {\pmb{x}\sim p_ {d a t a}}[\log D_ {G}^{* }(\pmb{x})]+\mathbb{E}_ {\pmb{x}\sim p_ {g}}[\log(1-D_ {G}^{* }(\pmb{x}))]
$$
then $p_ {g}$ converges to pdata
那么 $p_ {g}$ 会收敛于 pdata
Proof. Consider $V(G,D)=U(p_ {g},D)$ as a function of $p_ {g}$ as done in the above criterion. Note that $U(p_ {g},D)$ is convex in $p_ {g}$ . The sub derivatives of a supremum of convex functions include the derivative of the function at the point where the maximum is attained. In other words, if $f(x)=$ $\textstyle\operatorname* {sup}_ {\alpha\in A}f_ {\alpha}(x)$ and $f_ {\alpha}(x)$ is convex in $x$ for every $\alpha$ , then $\partial f_ {\beta}(x)\in\partial f$ if $\beta=\arg\operatorname* {sup}_ {\alpha\in A}f_ {\alpha}(x)$ . This is equivalent to computing a gradient descent update for $p_ {g}$ at the optimal $D$ given the corresponding $G$ . $\mathrm{sup}_ {D}U(p_ {g},D)$ is convex in $p_ {g}$ with a unique global optima as proven in Thm 1, therefore with sufficiently small updates of $p_ {g},p_ {g}$ converges to $p_ {x}$ , concluding the proof. 口
证明。考虑将 $V(G,D)=U(p_ {g},D)$ 作为 $p_ {g}$ 的函数,如上文准则所示。注意到 $U(p_ {g},D)$ 在 $p_ {g}$ 上是凸函数。凸函数上确界的次导数包含在达到最大值点处函数的导数。换句话说,若 $f(x)=$ $\textstyle\operatorname* {sup}_ {\alpha\in A}f_ {\alpha}(x)$ 且对于每个 $\alpha$,$f_ {\alpha}(x)$ 在 $x$ 上都是凸的,那么当 $\beta=\arg\operatorname* {sup}_ {\alpha\in A}f_ {\alpha}(x)$ 时,有 $\partial f_ {\beta}(x)\in\partial f$。这等价于在给定对应 $G$ 的最优 $D$ 处计算 $p_ {g}$ 的梯度下降更新。如定理1所证,$\mathrm{sup}_ {D}U(p_ {g},D)$ 在 $p_ {g}$ 上是凸的且具有唯一全局最优解,因此当 $p_ {g}$ 的更新足够小时,$p_ {g}$ 会收敛到 $p_ {x}$,证毕。口
In practice, adversarial nets represent a limited family of $p_ {g}$ distributions via the function $G(z;\theta_ {g})$ , and we optimize $\theta_ {g}$ rather than $p_ {g}$ itself. Using a multilayer perceptron to define $G$ introduces multiple critical points in parameter space. However, the excellent performance of multilayer perceptrons in practice suggests that they are a reasonable model to use despite their lack of theoretical guarantees.
实践中,对抗网络通过函数 $G(z;\theta_ {g})$ 表示 $p_ {g}$ 分布的有限族,我们优化的是 $\theta_ {g}$ 而非 $p_ {g}$ 本身。使用多层感知机定义 $G$ 会在参数空间中引入多个临界点。然而,多层感知机在实际中的卓越表现表明,尽管缺乏理论保证,它们仍是合理的模型选择。
5 Experiments
5 实验
We trained adversarial nets an a range of datasets including MNIST[23], the Toronto Face Database (TFD) [28], and CIFAR-10 [21]. The generator nets used a mixture of rectifier linear activation s [19, 9] and sigmoid activation s, while the disc rim in at or net used maxout [10] activation s. Dropout [17] was applied in training the disc rim in at or net. While our theoretical framework permits the use of dropout and other noise at intermediate layers of the generator, we used noise as the input to only the bottommost layer of the generator network.
我们在包括MNIST[23]、多伦多人脸数据库(TFD)[28]和CIFAR-10[21]在内的多个数据集上训练了对抗网络。生成器网络混合使用了线性整流激活函数[19,9]和sigmoid激活函数,而判别器网络采用了maxout[10]激活函数。在训练判别器时应用了Dropout[17]技术。虽然我们的理论框架允许在生成器的中间层使用dropout和其他噪声,但我们仅将噪声作为输入施加于生成器网络的最底层。
We estimate probability of the test set data under $p_ {g}$ by fitting a Gaussian Parzen window to the samples generated with $G$ and reporting the log-likelihood under this distribution. The $\sigma$ parameter
我们通过将高斯Parzen窗拟合到由$G$生成的样本上,并报告该分布下的对数似然,来估计测试集数据在$p_ {g}$下的概率。参数$\sigma$
| Model | MNIST | TFD |
| DBN[3] | 138±2 | 1909±66 |
| Stacked CAE[3] | 121 ± 1.6 | 2110±50 |
| Deep GSN [6] | 214 ± 1.1 | 1890±29 |
| Adversarialnets | 225±2 | 2057±26 |
| 模型 | MNIST | TFD |
|---|---|---|
| DBN[3] | 138±2 | 1909±66 |
| Stacked CAE[3] | 121±1.6 | 2110±50 |
| Deep GSN[6] | 214±1.1 | 1890±29 |
| Adversarialnets | 225±2 | 2057±26 |
Table 1: Parzen window-based log-likelihood estimates. The reported numbers on MNIST are the mean loglikelihood of samples on test set, with the standard error of the mean computed across examples. On TFD, we computed the standard error across folds of the dataset, with a different $\sigma$ chosen using the validation set of each fold. On TFD, $\sigma$ was cross validated on each fold and mean log-likelihood on each fold were computed. For MNIST we compare against other models of the real-valued (rather than binary) version of dataset.
表 1: 基于Parzen窗口的对数似然估计。MNIST上报告的数字是测试集样本的平均对数似然,并计算了样本间的均值标准误差。在TFD上,我们计算了数据集各折的标准误差,并为每折使用验证集选择不同的$\sigma$。在TFD上,$\sigma$通过交叉验证逐折确定,并计算每折的平均对数似然。对于MNIST,我们与其他处理实值(而非二值)版本数据集的模型进行了对比。
of the Gaussians was obtained by cross validation on the validation set. This procedure was introduced in Breuleux et al. [8] and used for various generative models for which the exact likelihood is not tractable [25, 3, 5]. Results are reported in Table 1. This method of estimating the likelihood has somewhat high variance and does not perform well in high dimensional spaces but it is the best method available to our knowledge. Advances in generative models that can sample but not estimate likelihood directly motivate further research into how to evaluate such models.
高斯函数的参数通过验证集的交叉验证获得。该方法由Breuleux等人[8]提出,并被用于多种无法精确计算似然的生成模型[25, 3, 5]。结果如 表1 所示。这种似然估计方法存在较高方差,在高维空间中表现不佳,但据我们所知已是当前最佳方案。生成模型能采样却无法直接估计似然的现状,直接推动着如何评估此类模型的研究进展。
In Figures 2 and 3 we show samples drawn from the generator net after training. While we make no claim that these samples are better than samples generated by existing methods, we believe that these samples are at least competitive with the better generative models in the literature and highlight the potential of the adversarial framework.
图 2 和图 3 展示了训练后从生成器网络中抽取的样本。虽然我们并不声称这些样本优于现有方法生成的样本,但我们认为这些样本至少能与文献中较好的生成模型相媲美,并凸显了对抗框架的潜力。

Figure 2: Visualization of samples from the model. Rightmost column shows the nearest training example of the neighboring sample, in order to demonstrate that the model has not memorized the training set. Samples are fair random draws, not cherry-picked. Unlike most other visualization s of deep generative models, these images show actual samples from the model distributions, not conditional means given samples of hidden units. Moreover, these samples are uncorrelated because the sampling process does not depend on Markov chain mixing. a) MNIST b) TFD c) CIFAR-10 (fully connected model) d) CIFAR-10 (convolutional disc rim in at or and “de convolutional” generator)
图 2: 模型生成样本的可视化结果。最右侧列展示了相邻样本在训练集中最接近的示例,以证明模型没有记忆训练集。所有样本均为随机抽取,未经过人工筛选。与大多数深度生成模型的可视化不同,这些图像展示的是模型分布的真实样本,而非隐含单元采样条件下的均值。此外,由于采样过程不依赖马尔可夫链混合,这些样本之间不存在相关性。a) MNIST b) TFD c) CIFAR-10 (全连接模型) d) CIFAR-10 (卷积判别器与"反卷积"生成器)


Figure 3: Digits obtained by linearly interpolating between coordinates in $_ {z}$ space of the full model. Table 2: Challenges in generative modeling: a summary of the difficulties encountered by different approaches to deep generative modeling for each of the major operations involving a model.
| Deep directed graphical models | Deep undirected graphical models | Generative autoencoders | Adversarialmodels | |
| Training | Inference needed during training. | Inference needed during training. MCMC needed to approximate partition function gradient. | Enforced tradeoff between mixing and power of reconstruction generation | Synchronizing the discriminator with the generator. Helvetica. |
| Inference | Learned approximate inference | Variational inference | MCMC-based inference | Learned approximate inference |
| Sampling | No difficulties | Requires Markov chain | Requires Markov chain | No difficulties |
| Evaluating p(x) | Intractable, may be approximated with AIS | Intractable, may be approximated with AIS | Not explicitly represented, may be approximated with Parzen density estimation | Not explicitly represented, may be approximated with Parzen density estimation |
| Model design | Nearly all models incur extreme difficulty | Careful design needed to ensure multiple properties | Anydifferentiable function is theoretically permitted | Anydifferentiable function is theoretically permitted |
图 3: 通过在完整模型的 $_ {z}$ 空间坐标间线性插值获得的数字。
表 2: 生成式建模的挑战: 针对涉及模型的主要操作,总结了不同深度生成式建模方法遇到的困难。
| 深度有向图模型 | 深度无向图模型 | 生成式自编码器 | 对抗模型 | |
|---|---|---|---|---|
| 训练 | 训练期间需要进行推断。 | 训练期间需要进行推断。需要MCMC来近似配分函数梯度。 | 强制在混合与重构生成能力之间权衡 | 判别器与生成器的同步。Helvetica。 |
| 推断 | 学习近似推断 | 变分推断 | 基于MCMC的推断 | 学习近似推断 |
| 采样 | 无困难 | 需要马尔可夫链 | 需要马尔可夫链 | 无困难 |
| 评估p(x) | 难解,可用AIS近似 | 难解,可用AIS近似 | 未显式表示,可用Parzen密度估计近似 | 未显式表示,可用Parzen密度估计近似 |
| 模型设计 | 几乎所有模型都面临极大困难 | 需要精心设计以确保多种特性 | 理论上允许任何可微函数 | 理论上允许任何可微函数 |
6 Advantages and disadvantages
6 优缺点
This new framework comes with advantages and disadvantages relative to previous modeling frameworks. The disadvantages are primarily that there is no explicit representation of $p_ {g}({\pmb x})$ , and that $D$ must be synchronized well with $G$ during training (in particular, $G$ must not be trained too much without updating $D$ , in order to avoid “the Helvetica scenario” in which $G$ collapses too many values of $\mathbf{z}$ to the same value of $\mathbf{x}$ to have enough diversity to model $p_ {\mathrm{data.}}$ ), much as the negative chains of a Boltzmann machine must be kept up to date between learning steps. The advantages are that Markov chains are never needed, only backprop is used to obtain gradients, no inference is needed during learning, and a wide variety of functions can be incorporated into the model. Table 2 summarizes the comparison of generative adversarial nets with other generative modeling approaches.
这一新框架相较于先前的建模框架既有优势也有不足。主要缺点在于没有对 $p_ {g}({\pmb x})$ 的显式表示,且训练过程中判别器 $D$ 必须与生成器 $G$ 保持良好同步(特别是不能在没有更新 $D$ 的情况下过度训练 $G$,以避免"Helvetica情景"——即 $G$ 将过多 $\mathbf{z}$ 值坍缩到相同的 $\mathbf{x}$ 值,导致缺乏足够多样性来建模 $p_ {\mathrm{data.}}$),这与玻尔兹曼机在学习步骤间必须保持负链更新的情况类似。其优势在于:无需马尔可夫链、仅通过反向传播获取梯度、学习过程中无需推理,且模型可融入多种函数。表2总结了生成对抗网络与其他生成建模方法的对比。
The aforementioned advantages are primarily computational. Adversarial models may also gain some statistical advantage from the generator network not being updated directly with data examples, but only with gradients flowing through the disc rim in at or. This means that components of the input are not copied directly into the generator’s parameters. Another advantage of adversarial networks is that they can represent very sharp, even degenerate distributions, while methods based on Markov chains require that the distribution be somewhat blurry in order for the chains to be able to mix between modes.
上述优势主要体现在计算层面。对抗模型还可能获得一些统计优势,因为生成器网络不直接通过数据样本更新,而是仅通过判别器回传的梯度进行更新。这意味着输入数据不会被直接复制到生成器参数中。对抗网络的另一优势是能够表征非常尖锐(甚至退化)的分布,而基于马尔可夫链的方法要求分布必须具有一定模糊性,才能使链在不同模态间有效混合。
7 Conclusions and future work
7 结论与未来工作
This framework admits many straightforward extensions:
该框架允许许多直接扩展:
- A conditional generative model $p({\pmb x}\mid{\pmb c})$ can be obtained by adding $_ c$ as input to both $G$ and $D$ . 2. Learned approximate inference can be performed by training an auxiliary network to predict $_ {z}$ given $_ {x}$ . This is similar to the inference net trained by the wake-sleep algorithm [15] but with the advantage that the inference net may be trained for a fixed generator net after the generator net has finished training.
- 通过将 ${\pmb c}$ 同时作为 $G$ 和 $D$ 的输入,可以得到条件生成模型 $p({\pmb x}\mid{\pmb c})$。
- 通过学习近似推断,可以通过训练辅助网络在给定 $_ {x}$ 时预测 $_ {z}$。这类似于 wake-sleep 算法 [15] 训练出的推断网络,但优势在于生成器网络训练完成后,推断网络可以针对固定的生成器网络进行训练。
- One can approximately model all conditionals $p(\pmb{x}_ {S}\mid\pmb{x}_ {S})$ where $S$ is a subset of the indices of $_ {x}$ by training a family of conditional models that share parameters. Essentially, one can use adversarial nets to implement a stochastic extension of the deterministic MP-DBM [11]. 4. Semi-supervised learning: features from the disc rim in at or or inference net could improve performance of class if i ers when limited labeled data is available. 5. Efficiency improvements: training could be accelerated greatly by divising better methods for coordinating $G$ and $D$ or determining better distributions to sample $\mathbf{z}$ from during training.
- 可以通过训练共享参数的系列条件模型,近似建模所有条件概率 $p(\pmb{x}_ {S}\mid\pmb{x}_ {S})$ ,其中 $S$ 是 $_ {x}$ 索引的子集。本质上,这是用对抗网络实现确定性MP-DBM [11]的随机扩展。
- 半监督学习:当标记数据有限时,判别器或推理网络提取的特征能提升分类器性能。
- 效率优化:通过设计更好的 $G$ 与 $D$ 协调机制,或改进训练时 $\mathbf{z}$ 的采样分布策略,可大幅加速训练过程。
This paper has demonstrated the viability of the adversarial modeling framework, suggesting that these research directions could prove useful.
本文验证了对抗建模框架的可行性,表明这些研究方向可能具有实用价值。
Acknowledgments
致谢
We would like to acknowledge Patrice Marcotte, Olivier Delalleau, Kyunghyun Cho, Guillaume Alain and Jason Yosinski for helpful discussions. Yann Dauphin shared his Parzen window evaluation code with us. We would like to thank the developers of Pylearn2 [12] and Theano [7, 1], particularly Frédéric Bastien who rushed a Theano feature specifically to benefit this project. Arnaud Bergeron provided much-needed support with $\mathrm{{BI}E X}$ typesetting. We would also like to thank CIFAR, and Canada Research Chairs for funding, and Compute Canada, and Calcul Quebec for providing computational resources. Ian Goodfellow is supported by the 2013 Google Fellowship in Deep Learning. Finally, we would like to thank Les Trois Brasseurs for stimulating our creativity.
我们感谢Patrice Marcotte、Olivier Delalleau、Kyunghyun Cho、Guillaume Alain和Jason Yosinski富有启发性的讨论。Yann Dauphin与我们分享了他的Parzen窗口评估代码。我们感谢Pylearn2 [12]和Theano [7, 1]的开发者,特别是Frédéric Bastien专门为该项目紧急实现了Theano的功能。Arnaud Bergeron在$\mathrm{{BI}E X}$排版方面提供了急需的支持。我们还要感谢CIFAR和加拿大研究讲席计划提供的资金支持,以及Compute Canada和Calcul Quebec提供的计算资源。Ian Goodfellow获得了2013年谷歌深度学习奖学金的支持。最后,我们要感谢Les Trois Brasseurs激发了我们的创造力。
References
参考文献
