Generative Adversarial Nets
生成式对抗网络 (Generative Adversarial Nets)
Ian J. Goodfellow, Jean Pouget-Abadie* Mehdi Mirza, Bing Xu, David Warde-Farley. Sherjil Ozair, Aaron Courville, Yoshua Bengiot
Ian J. Goodfellow, Jean Pouget-Abadie* Mehdi Mirza, Bing Xu, David Warde-Farley. Sherjil Ozair, Aaron Courville, Yoshua Bengiot
Departement d'informatique et de recherche operation n elle Universite de Montréal Montréal, QC H3C 3J7
蒙特利尔大学计算机与运筹学系
蒙特利尔 QC H3C 3J7
Abstract
摘要
We propose a new framework for estimating generative models via an adversarial process, in which we simultaneously train two models: a generative model $G$ that captures the data distribution, and a disc rim i native model $D$ that estimates the probability that a sample came from the training data rather than $G$ . The training procedure for $G$ is to maximize the probability of $D$ making a mistake. This framework corresponds to a minimax two-player game. In the space of arbitrary functions $G$ and $D$ , a unique solution exists, with $G$ recovering the training data distribution and $D$ equal to $\frac{1}{2}$ everywhere. In the case where $G$ and $D$ are defined by multilayer perce ptr on s, the entire system can be trained with back propagation. There is no need for any Markov chains or unrolled approximate inference networks during either training or generation of samples. Experiments demonstrate the potential of the framework through qualitative and quantitative evaluation of the generated samples.
我们提出了一种通过对抗过程估计生成式模型的新框架,该框架同时训练两个模型:用于捕捉数据分布的生成模型 $G$,以及评估样本来自训练数据而非 $G$ 的概率的判别模型 $D$。训练 $G$ 的目标是最大化 $D$ 的误判概率。该框架对应一个极小极大双人博弈。在任意函数 $G$ 和 $D$ 的空间中,存在唯一解,此时 $G$ 能还原训练数据分布,且 $D$ 在所有位置都等于 $\frac{1}{2}$。当 $G$ 和 $D$ 由多层感知机定义时,整个系统可通过反向传播进行训练。在训练或样本生成过程中,无需使用马尔可夫链或展开近似推理网络。实验通过对生成样本的定性与定量评估,证明了该框架的潜力。
1 Introduction
1 引言
The promise of deep learning is to discover rich, hierarchical models [2] that represent probability distributions over the kinds of data encountered in artificial intelligence applications, such as natural images, audio waveforms containing speech, and symbols in natural language corpora. So far, the most striking successes in deep learning have involved disc rim i native models, usually those that map a high-dimensional, rich sensory input to a class label [14, 22]. These striking successes have primarily been based on the back propagation and dropout algorithms, using piecewise linear units [19, 9, 10] which have a particularly well-behaved gradient . Deep generative models have had less of an impact, due to the difficulty of approximating many intractable probabilistic computations that arise in maximum likelihood estimation and related strategies, and due to difficulty of leveraging the benefits of piecewise linear units in the generative context. We propose a new generative model estimation procedure that sidesteps these difficulties. 1
深度学习的潜力在于发现能够表示人工智能应用中常见数据(如自然图像、包含语音的音频波形、自然语言语料库中的符号)概率分布的丰富分层模型 [2]。迄今为止,深度学习最显著的成果集中在判别模型上,尤其是那些将高维、丰富的感官输入映射到类别标签的模型 [14, 22]。这些成功主要基于反向传播和dropout算法,并利用了梯度特性特别良好的分段线性单元 [19, 9, 10]。相比之下,深度生成模型影响较小,原因在于最大似然估计及相关策略中许多难解概率计算的近似困难,以及生成场景中难以有效利用分段线性单元的优势。我们提出了一种规避这些难题的新生成模型估计方法。1
In the proposed adversarial nets framework, the generative model is pitted against an adversary: a disc rim i native model that learns to determine whether a sample is from the model distribution or the data distribution. The generative model can be thought of as analogous to a team of counterfeiters, trying to produce fake currency and use it without detection, while the disc rim i native model is analogous to the police, trying to detect the counterfeit currency. Competition in this game drives both teams to improve their methods until the counterfeits are in dist igu ish able from the genuine articles.
在提出的对抗网络框架中,生成模型与一个对手对抗:这是一个判别模型,其任务是学会判断样本是来自模型分布还是数据分布。可以将生成模型类比为伪造货币的团伙,试图制造假币并使其不被发现,而判别模型则类似于警方,试图检测出假币。这种博弈中的竞争促使双方不断改进方法,直到伪造品与真品无法区分。
This framework can yield specific training algorithms for many kinds of model and optimization algorithm. In this article, we explore the special case when the generative model generates samples by passing random noise through a multilayer perceptron, and the disc rim i native model is also a multilayer perceptron. We refer to this special case as adversarial nets. In this case, we can train both models using only the highly successful back propagation and dropout algorithms [17] and sample from the generative model using only forward propagation. No approximate inference or Markov chains are necessary.
该框架可为多种模型和优化算法生成特定的训练算法。在本文中,我们探讨了一种特殊情况:生成模型通过将随机噪声输入多层感知机来生成样本,而判别模型同样采用多层感知机结构。我们将这种特殊情况称为对抗网络。在此设定下,仅需使用高度成功的反向传播算法和dropout技术[17]即可训练两个模型,且生成模型仅需前向传播即可采样,无需近似推理或马尔可夫链。
2 Related work
2 相关工作
An alternative to directed graphical models with latent variables are undirected graphical models with latent variables, such as restricted Boltzmann machines (RBMs) [27, 16], deep Boltzmann machines (DBMs) [26] and their numerous variants. The interactions within such models are represented as the product of un normalized potential functions, normalized by a global summation/integration over all states of the random variables. This quantity (the partition function) and its gradient are intractable for all but the most trivial instances, although they can be estimated by Markov chain Monte Carlo (MCMC) methods. Mixing poses a significant problem for learning algorithms that rely on MCMC [3, 5].
具有隐变量的无向图模型是另一种替代方案,例如受限玻尔兹曼机 (RBM) [27, 16]、深度玻尔兹曼机 (DBM) [26] 及其众多变体。此类模型中的相互作用表示为未归一化势函数的乘积,并通过随机变量所有状态的全局求和/积分进行归一化。这个量(配分函数)及其梯度对于除最平凡实例外的所有情况都难以处理,尽管可以通过马尔可夫链蒙特卡洛 (MCMC) 方法进行估计。对于依赖 MCMC 的学习算法来说,混合是一个重要问题 [3, 5]。
Deep belief networks (DBNs) [16] are hybrid models containing a single undirected layer and several directed layers. While a fast approximate layer-wise training criterion exists, DBNs incur the computational difficulties associated with both undirected and directed models.
深度信念网络 (DBNs) [16] 是由单个无向层和多个有向层组成的混合模型。虽然存在快速的分层近似训练准则,但DBNs同时继承了无向模型和有向模型的计算难题。
Alternative criteria that do not approximate or bound the log-likelihood have also been proposed, such as score matching [18] and noise-contrastive estimation (NCE) [13]. Both of these require the learned probability density to be analytically specified up to a normalization constant. Note that in many interesting generative models with several layers of latent variables (such as DBNs and DBMs), it is not even possible to derive a tractable un normalized probability density. Some models such as denoising auto-encoders [30] and contractive auto encoders have learning rules very similar to score matching applied to RBMs. In NCE, as in this work, a disc rim i native training criterion is employed to fit a generative model. However, rather than fitting a separate disc rim i native model, the generative model itself is used to discriminate generated data from samples a fixed noise distribution. Because NCE uses a fixed noise distribution, learning slows dramatically after the model has learned even an approximately correct distribution over a small subset of the observed variables.
不近似或限制对数似然的替代标准也被提出,例如分数匹配 [18] 和噪声对比估计 (NCE) [13]。这两种方法都要求学习的概率密度在归一化常数范围内可解析指定。需要注意的是,在许多具有多层隐变量的有趣生成模型(如 DBNs 和 DBMs)中,甚至无法推导出易处理的未归一化概率密度。某些模型(如去噪自编码器 [30] 和收缩自编码器)的学习规则与应用于 RBMs 的分数匹配非常相似。在 NCE 中,与本文工作类似,采用判别式训练准则来拟合生成模型。然而,并非拟合单独的判别模型,而是使用生成模型本身来区分生成数据与固定噪声分布的样本。由于 NCE 使用固定噪声分布,当模型即使在小部分观测变量上学习到近似正确的分布后,学习速度也会显著减慢。
Finally, some techniques do not involve defining a probability distribution explicitly, but rather train a generative machine to draw samples from the desired distribution. This approach has the advantage that such machines can be designed to be trained by back-propagation. Prominent recent work in this area includes the generative stochastic network (GSN) framework [5], which extends generalized denoising auto-encoders [4]: both can be seen as defining a parameterized Markov chain, i.e., one learns the parameters of a machine that performs one step of a generative Markov chain. Compared to GSNs, the adversarial nets framework does not require a Markov chain for sampling. Because adversarial nets do not require feedback loops during generation, they are better able to leverage piecewise linear units [19, 9, 10], which improve the performance of back propagation but have problems with unbounded activation when used ina feedback loop. More recent examples of training a generative machine by back-propagating into it include recent work on auto-encoding variation al Bayes [20] and stochastic back propagation [24].
最后,有些技术并不涉及明确定义概率分布,而是训练一个生成式机器从目标分布中采样。这种方法具有一个优势:这类机器可以通过反向传播进行训练。该领域近期的重要工作包括生成式随机网络(GSN)框架[5],它扩展了广义去噪自编码器[4]——两者都可视为定义了一个参数化的马尔可夫链,即通过学习机器参数来执行生成式马尔可夫链的单步转移。与GSN相比,对抗网络框架在采样时不需要马尔可夫链。由于对抗网络在生成过程中不需要反馈循环,因此能更好地利用分段线性单元[19,9,10]——这些单元虽然能提升反向传播性能,但在反馈循环中使用时会出现无界激活问题。其他通过反向传播训练生成式机器的近期案例包括变分自编码贝叶斯[20]和随机反向传播[24]的相关研究。
3 Adversarial nets
3 对抗网络
The adversarial modeling framework is most straightforward to apply when the models are both multilayer perce ptr on s. To learn the generator’s distribution $p_ {g}$ over data $_ {x}$ , we define a prior on input noise variables $p_ {z}(z)$ , then represent a mapping to data space as $G(z;\theta_ {g})$ , where $G$ is a differentiable function represented by a multilayer perceptron with parameters $\theta_ {g}$ . We also define a second multilayer perceptron $D(\pmb{x};\dot{\theta_ {d}})$ that outputs a single scalar. $D({\pmb x})$ represents the probability that $_ {x}$ came from the data rather than $p_ {g}$ . We train $D$ to maximize the probability of assigning the correct label to both training examples and samples from $G$ . We simultaneously train $G$ to minimize $\log(1-D(G(z)))$ :
对抗建模框架在模型均为多层感知机时最直接适用。为学习生成器在数据 $_ {x}$ 上的分布 $p_ {g}$ ,我们定义输入噪声变量的先验分布 $p_ {z}(z)$ ,并将数据空间的映射表示为 $G(z;\theta_ {g})$ ,其中 $G$ 是由参数为 $\theta_ {g}$ 的多层感知机实现的可微函数。同时定义第二个多层感知机 $D(\pmb{x};\dot{\theta_ {d}})$ ,其输出为单个标量。 $D({\pmb x})$ 表示 $_ {x}$ 来自真实数据而非 $p_ {g}$ 的概率。我们训练 $D$ 以最大化对训练样本和 $G$ 生成样本的正确分类概率,同时训练 $G$ 以最小化 $\log(1-D(G(z)))$ :
In other words, $D$ and $G$ play the following two-player minimax game with value function $V(G,D)$ :
换句话说,$D$ 和 $G$ 通过价值函数 $V(G,D)$ 进行以下两人极小极大博弈:
$$
\underset{G}{\mathrm{min}}\underset{D}{\mathrm{max}}V(D,G)=\mathbb{E}_ {{\boldsymbol{x}}\sim p_ {\mathrm{data}}({\boldsymbol{x}})}[\mathrm{log}D({\boldsymbol{x}})]+\mathbb{E}_ {z\sim p_ {z}(z)}[\mathrm{log}(1-D(G(z)))].
$$
$$
\underset{G}{\mathrm{min}}\underset{D}{\mathrm{max}}V(D,G)=\mathbb{E}_ {{\boldsymbol{x}}\sim p_ {\mathrm{data}}({\boldsymbol{x}})}[\mathrm{log}D({\boldsymbol{x}})]+\mathbb{E}_ {z\sim p_ {z}(z)}[\mathrm{log}(1-D(G(z)))].
$$
In the next section, we present a theoretical analysis of adversarial nets, essentially showing that the training criterion allows one to recover the data generating distribution as $G$ and $D$ are given enough capacity, i.e., in the non-parametric limit. See Figure 1 for a less formal, more pedagogical explanation of the approach. In practice, we must implement the game using an iterative, numerical approach. Optimizing $D$ to completion in the inner loop of training is computationally prohibitive, and on finite datasets would result in over fitting. Instead, we alternate between $k$ steps of optimizing $D$ and one step of optimizing $G$ . This results in $D$ being maintained near its optimal solution, so long as $G$ changes slowly enough. This strategy is analogous to the way that SML/PCD [31, 29] training maintains samples from a Markov chain from one learning step to the next in order to avoid burning in a Markov chain as part of the inner loop of learning. The procedure is formally presented in Algorithm 1.
下一节中,我们将对对抗网络进行理论分析,本质上证明当生成器$G$和判别器$D$具有足够容量时(即非参数极限情况下),该训练准则能够恢复数据生成分布。图1以更通俗的教学方式展示了该方法原理。实际实现时,我们需采用迭代数值方法进行博弈。在训练内循环中完全优化$D$会导致计算量过大,且在有限数据集上会产生过拟合。因此我们改为交替执行$k$次$D$优化和1次$G$优化。只要$G$更新足够缓慢,就能使$D$保持接近最优解的状态。该策略类似于SML/PCD[31,29]训练中通过马尔可夫链在连续学习步骤间维持样本的方法,从而避免将马尔可夫链燃烧过程作为学习内循环的一部分。算法1给出了该过程的正式描述。
In practice, equation 1 may not provide sufficient gradient for $G$ to learn well. Early in learning, when $G$ is poor, $D$ can reject samples with high confidence because they are clearly different from the training data. In this case, $\log(1-D(G(z)))$ saturates. Rather than training $G$ to minimize $\log(1-{\bar{D(G(z))}})$ we can train $G$ to maximize $\log D(G(z))$ . This objective function results in the same fixed point of the dynamics of $G$ and $D$ but provides much stronger gradients early in learning.
实践中,公式1可能无法为$G$提供足够梯度以有效学习。在学习初期,当$G$性能较差时,$D$能以高置信度拒绝样本,因为它们与训练数据明显不同。此时$\log(1-D(G(z)))$会饱和。我们不再训练$G$最小化$\log(1-{\bar{D(G(z))}})$,而是改为训练$G$最大化$\log D(G(z))$。该目标函数在$G$和$D$的动态系统中会产生相同的不动点,但在学习初期能提供更强的梯度信号。
Figure 1: Generative adversarial nets are trained by simultaneously updating the disc rim i native distribution ( $D$ , blue, dashed line) so that it discriminates between samples from the data generating distribution (black, dotted line) $p_ {x}$ from those of the generative distribution $p_ {g}$ (G) (green, solid line). The lower horizontal line is the domain from which $_ {z}$ is sampled, in this case uniformly. The horizontal line above is part of the domain of $_ {\pmb{x}}$ . The upward arrows show how the mapping ${\pmb x}=G({\pmb z})$ imposes the non-uniform distribution $p_ {g}$ on transformed samples. $G$ contracts in regions of high density and expands in regions of low density of $p_ {g}$ . (a) Consider an adversarial pair near convergence: $p_ {g}$ is similar to $p_ {\mathrm{data}}$ and $D$ is a partially accurate classifier. (b) In the inner loop of the algorithm $D$ is trained to discriminate samples from data, converging to $D^{* }(x)=$ pdatap(dxat)a+(xp)g(x) . (c) After an update to G, gradient of D has guided G(z) to flow to regions that are more likely to be classified as data. (d) After several steps of training, if $G$ and $D$ have enough capacity, they will reach a point at which both cannot improve because $p_ {g}=p_ {\mathrm{data}}$ . The disc rim in at or is unable to differentiate between the two distributions, i.e. $\textstyle D({\pmb x})={\frac{1}{2}}$ .
图 1: 生成对抗网络通过同时更新判别分布 ( $D$ , 蓝色虚线) 进行训练, 使其能够区分来自数据生成分布 (黑色点线) $p_ {x}$ 和生成分布 $p_ {g}$ (G) (绿色实线) 的样本。下方水平线表示均匀采样的 $_ {z}$ 域, 上方水平线表示 $_ {\pmb{x}}$ 域的一部分。箭头展示了映射 ${\pmb x}=G({\pmb z})$ 如何在转换样本上施加非均匀分布 $p_ {g}$ 。 $G$ 在 $p_ {g}$ 高密度区域收缩, 在低密度区域扩展。(a) 收敛附近的对抗对: $p_ {g}$ 近似 $p_ {\mathrm{data}}$ , $D$ 是部分准确的分类器。(b) 算法内循环中, $D$ 被训练来区分数据样本, 收敛至 $D^{* }(x)=$ pdatap(dxat)a+(xp)g(x) 。(c) 更新G后, D的梯度引导G(z)流向更可能被分类为数据的区域。(d) 经过多步训练后, 若 $G$ 和 $D$ 具有足够容量, 将达到 $p_ {g}=p_ {\mathrm{data}}$ 的平衡点, 此时判别器无法区分两个分布, 即 $\textstyle D({\pmb x})={\frac{1}{2}}$ 。
4 Theoretical Results
4 理论结果
The generator $G$ implicitly defines a probability distribution $p_ {g}$ as the distribution of the samples $G(z)$ obtained when $z\sim p_ {z}$ . Therefore, we would like Algorithm 1 to converge to a good estimator of $p_ {\mathrm{data}}$ , if given enough capacity and training time. The results of this section are done in a nonparametric setting, e.g. we represent a model with infinite capacity by studying convergence in the space of probability density functions.
生成器 $G$ 隐式定义了概率分布 $p_ {g}$ ,即当 $z\sim p_ {z}$ 时样本 $G(z)$ 的分布。因此,若给定足够的容量和训练时间,我们希望算法 1 能收敛到 $p_ {\mathrm{data}}$ 的良好估计量。本节结果是在非参数化设定下完成的,例如通过研究概率密度函数空间中的收敛性来表示具有无限容量的模型。
We will show in section 4.1 that this minimax game has a global optimum for $p_ {g}=p_ {\mathrm{data}}$ . We will then show in section 4.2 that Algorithm 1 optimizes $\operatorname{Eq}1$ , thus obtaining the desired result.
我们将在4.1节证明该极小极大博弈在 $p_ {g}=p_ {\mathrm{data}}$ 时存在全局最优解。随后在4.2节证明算法1能够优化 $\operatorname{Eq}1$ ,从而获得预期结果。
Algorithm 1 Minibatch stochastic gradient descent training of generative adversarial nets. The number of steps to apply to the disc rim in at or, $k$ , is a hyper parameter. We used $k=1$ , the least expensive option, in our experiments.
算法1:生成对抗网络的小批量随机梯度下降训练。判别器的训练步数$k$是一个超参数。在我们的实验中采用了成本最低的$k=1$方案。
for number of training iterations do
for number of training iterations do
for $k$ steps do
执行 $k$ 次循环
$$
\nabla_ {\theta_ {d}}\frac{1}{m}\sum_ {i=1}^{m}\left[\log D\left(\pmb{x}^{(i)}\right)+\log\left(1-D\left(G\left(\pmb{z}^{(i)}\right)\right)\right)\right].
$$
$$
\nabla_ {\theta_ {d}}\frac{1}{m}\sum_ {i=1}^{m}\left[\log D\left(\pmb{x}^{(i)}\right)+\log\left(1-D\left(G\left(\pmb{z}^{(i)}\right)\right)\right)\right].
$$
end for
end for
$\bullet$ Sample minibatch of $m$ noise samples ${z^{(1)},\ldots,z^{(m)}}$ from noise prior $p_ {g}(z)$ . Update the generator by descending its stochastic gradient:
$\bullet$ 从噪声先验 $p_ {g}(z)$ 中采样包含 $m$ 个噪声样本的小批量 ${z^{(1)},\ldots,z^{(m)}}$,通过随机梯度下降更新生成器:
$$
\nabla_ {\theta_ {g}}\frac{1}{m}\sum_ {i=1}^{m}\log\left(1-D\left(G\left(z^{(i)}\right)\right)\right).
$$
$$
\nabla_ {\theta_ {g}}\frac{1}{m}\sum_ {i=1}^{m}\log\left(1-D\left(G\left(z^{(i)}\right)\right)\right).
$$
end for
end for
The gradient-based updates can use any standard gradient-based learning rule. We used momentum in our experiments.
基于梯度的更新可以使用任何标准的基于梯度的学习规则。我们在实验中使用了动量。
4.1 Global Optimality of
4.1 全局最优性
We first consider the optimal disc rim in at or $D$ for any given generator $G$ .
我们首先考虑任意给定生成器 $G$ 的最优判别器 $D$。
Proposition 1. For $G$ fixed, the optimal disc rim in at or $D$ is
命题 1. 对于固定的 $G$,最优判别器 $D$ 满足
$$
D_ {G}^{* }(\pmb{x})=\frac{p_ {d a t a}(\pmb{x})}{p_ {d a t a}(\pmb{x})+p_ {g}(\pmb{x})}
$$
$$
D_ {G}^{* }(\pmb{x})=\frac{p_ {d a t a}(\pmb{x})}{p_ {d a t a}(\pmb{x})+p_ {g}(\pmb{x})}
$$
Proof. The training criterion for the disc rim in at or $\mathrm{D}$ , given any generator $G$ , is to maximize the quantity $V(G,D)$
证明。给定任意生成器 $G$ 时,判别器 $\mathrm{D}$ 的训练目标是最大化 $V(G,D)$ 量。
$$
\begin{array}{c}{{\displaystyle V(G,D)=\int_ {x}p_ {\mathrm{data}}({\pmb x})\log(D({\pmb x}))d x+\int_ {z}p_ {z}(z)\log(1-D(g(z)))d z}}\ {{=\displaystyle\int_ {x}p_ {\mathrm{data}}({\pmb x})\log(D({\pmb x}))+p_ {g}({\pmb x})\log(1-D({\pmb x}))d x}}\end{array}
$$
$$
\begin{array}{c}{{\displaystyle V(G,D)=\int_ {x}p_ {\mathrm{data}}({\pmb x})\log(D({\pmb x}))d x+\int_ {z}p_ {z}(z)\log(1-D(g(z)))d z}}\ {{=\displaystyle\int_ {x}p_ {\mathrm{data}}({\pmb x})\log(D({\pmb x}))+p_ {g}({\pmb x})\log(1-D({\pmb x}))d x}}\end{array}
$$
For any $(a,b)\in\mathbb{R}^{2}\setminus{0,0}$ , the function $y\rightarrow a\log(y)+b\log(1-y)$ achieves its maximum in $[0,1]$ at $\frac{a}{a+b}$ . The disc rim in at or does not need to be defined outside of $S u p p(p_ {\mathrm{data}})\cup S u p p(p_ {g}),$ concluding the proof. 口
对于任意 $(a,b)\in\mathbb{R}^{2}\setminus{0,0}$,函数 $y\rightarrow a\log(y)+b\log(1-y)$ 在 $[0,1]$ 上的最大值点为 $\frac{a}{a+b}$。判别器 (discriminator) 无需在 $Supp(p_ {\mathrm{data}})\cup Supp(p_ {g})$ 之外定义,证毕。口
Note that the training objective for $D$ can be interpreted as maximizing the log-likelihood for estimating the conditional probability $P(Y=y|x)$ , where $Y$ indicates whether $_ {x}$ comes from $p_ {\mathrm{data}}$ (with $y=1$ ) or from $p_ {g}$ (with $y=0$ ). The minimax game in Eq. 1 can now be reformulated as:
需要注意的是,训练目标 $D$ 可以理解为最大化估计条件概率 $P(Y=y|x)$ 的对数似然,其中 $Y$ 表示 $_ {x}$ 是来自 $p_ {\mathrm{data}}$(当 $y=1$ 时)还是来自 $p_ {g}$(当 $y=0$ 时)。公式 1 中的极小极大博弈现在可以重新表述为:
$$
\begin{array}{r l}&{C(G)=\underset{D}{\mathrm{max}}V(G,D)}\ &{\qquad=\mathbb{E}_ {\pmb{x}\sim p_ {\mathrm{dat}}}[\log D_ {G}^{* }(\pmb{x})]+\mathbb{E}_ {\pmb{z}\sim p_ {\pm}}[\log(1-D_ {G}^{* }(G(\pmb{z})))]}\ &{\qquad=\mathbb{E}_ {\pmb{x}\sim p_ {\mathrm{dat}}}[\log D_ {G}^{* }(\pmb{x})]+\mathbb{E}_ {\pmb{x}\sim p_ {g}}[\log(1-D_ {G}^{* }(\pmb{x}))]}\ &{\qquad=\mathbb{E}_ {\pmb{x}\sim p_ {\mathrm{dat}}}\left[\log\frac{p_ {\mathrm{data}}(\pmb{x})}{P_ {\mathrm{data}}(\pmb{x})+p_ {g}(\pmb{x})}\right]+\mathbb{E}_ {\pmb{x}\sim p_ {g}}\left[\log\frac{p_ {g}(\pmb{x})}{p_ {\mathrm{data}}(\pmb{x})+p_ {g}(\pmb{x})}\right]}\end{array}
$$
$$
\begin{array}{r l}&{C(G)=\underset{D}{\mathrm{max}}V(G,D)}\ &{\qquad=\mathbb{E}_ {\pmb{x}\sim p_ {\mathrm{dat}}}[\log D_ {G}^{* }(\pmb{x})]+\mathbb{E}_ {\pmb{z}\sim p_ {\pm}}[\log(1-D_ {G}^{* }(G(\pmb{z})))]}\ &{\qquad=\mathbb{E}_ {\pmb{x}\sim p_ {\mathrm{dat}}}[\log D_ {G}^{* }(\pmb{x})]+\mathbb{E}_ {\pmb{x}\sim p_ {g}}[\log(1-D_ {G}^{* }(\pmb{x}))]}\ &{\qquad=\mathbb{E}_ {\pmb{x}\sim p_ {\mathrm{dat}}}\left[\log\frac{p_ {\mathrm{data}}(\pmb{x})}{P_ {\mathrm{data}}(\pmb{x})+p_ {g}(\pmb{x})}\right]+\mathbb{E}_ {\pmb{x}\sim p_ {g}}\left[\log\frac{p_ {g}(\pmb{x})}{p_ {\mathrm{data}}(\pmb{x})+p_ {g}(\pmb{x})}\right]}\end{array}
$$
Theorem 1. The global minimum of the virtual training criterion $C(G)$ is achieved if and only if $p_ {g}=p_ {d a t a}$ . At that point, $C(G)$ achieves the value $\mathrm{-log4}$ .
定理 1. 当且仅当 $p_ {g}=p_ {d a t a}$ 时,虚拟训练准则 $C(G)$ 达到全局最小值。此时 $C(G)$ 的值为 $\mathrm{-log4}$。
Proof. For $p_ {g}=p_ {\mathrm{data}}$ , $D_ {G}^{* }({\pmb x})=\textstyle\frac{1}{2}$ , (consider Eq. 2). Hence, by inspecting Eq. 4 at $D_ {G}^{* }(x)={\frac{1}{2}}$ , we find $C(G)=\log{\textstyle{\frac{1}{2}}+\log{\frac{1}{2}}}=-\log4$ . To see that this is the best possible value of $C(G)$ , reached only for $p_ {g}=p_ {\mathrm{data}}$ , observe that
证明。当 $p_ {g}=p_ {\mathrm{data}}$ 时,$D_ {G}^{* }({\pmb x})=\textstyle\frac{1}{2}$ (参考式2)。因此,在 $D_ {G}^{* }(x)={\frac{1}{2}}$ 处考察式4,可得 $C(G)=\log{\textstyle{\frac{1}{2}}+\log{\frac{1}{2}}}=-\log4$。注意到当且仅当 $p_ {g}=p_ {\mathrm{data}}$ 时,$C(G)$ 才能达到这个最优值。
$$
\begin{array}{r}{\mathbb{E}_ {\pmb{x}\sim p_ {\mathrm{data}}}\left[-\log2\right]+\mathbb{E}_ {\pmb{x}\sim p_ {g}}\left[-\log2\right]=-\log4}\end{array}
$$
$$
\begin{array}{r}{\mathbb{E}_ {\pmb{x}\sim p_ {\mathrm{data}}}\left[-\log2\right]+\mathbb{E}_ {\pmb{x}\sim p_ {g}}\left[-\log2\right]=-\log4}\end{array}
$$
and that by subtracting this expression from $C(G)=V(D_ {G}^{* },G)$ , we obtain:
从 $C(G)=V(D_ {G}^{* },G)$ 中减去该表达式可得:
$$
C(G)=-\ln(4)+\mathrm{KL}\left(p_{\mathrm{data}}\middle|\frac{p_{\mathrm{data}}+p_{g}}{2}\right)+\mathrm{KL}\left(p_{g}\middle|\frac{p_{\mathrm{data}}+p_{g}}{2}\right)
$$
$$
C(G)=-\ln(4)+\mathrm{KL}\left(p_{\mathrm{data}}\middle|\frac{p_{\mathrm{data}}+p_{g}}{2}\right)+\mathrm{KL}\left(p_{g}\middle|\frac{p_{\mathrm{data}}+p_{g}}{2}\right)
$$
where $\mathrm{KL}$ is the Kullback–Leibler divergence. We recognize in the previous expression the Jensen– Shannon divergence between the model’s distribution and the data generating process:
其中 $\mathrm{KL}$ 是Kullback-Leibler散度。我们在上述表达式中识别出了模型分布与数据生成过程之间的Jensen-Shannon散度:
$$
C(G)=-\log(4)+2\cdot J S D\left(p_ {\mathrm{data}}\parallel p_ {g}\right)
$$
$$
C(G)=-\log(4)+2\cdot J S D\left(p_ {\mathrm{data}}\parallel p_ {g