关于生成对抗网络的开放性问题Open Questions about Generative Adversarial Networks

gingo 2021-05-17 15:09:17 AI基础收藏

0 / 624

What we’d like to find out about GANs that we don’t know yet.

Problem 1What are the trade-offs between GANs and other generative models?
Problem 2What sorts of distributions can GANs model?
Problem 3How can we Scale GANs beyond image synthesis?
Problem 4What can we say about the global convergence of the training dynamics?
Problem 5How should we evaluate GANs and when should we use them?
Problem 6How does GAN training scale with batch size?
Problem 7What is the relationship between GANs and adversarial examples?
我们想了解一下我们尚不了解的GAN。
问题1GAN与其他生成模型之间的权衡是什么？
问题2GAN可以模拟哪种分布？
问题3我们如何在图像合成之外扩展GAN？
问题4关于动态训练的全局融合，我们能说些什么？
问题5我们应该如何评估GAN？何时使用它们？
问题6GAN训练如何按批处理规模进行缩放？
问题7GAN与对抗性范例之间有什么关系？

AUTHORS

AFFILIATIONS

Augustus Odena

Google Brain Team

PUBLISHED

April 9, 2019

DOI

10.23915/distill.00018

By some metrics, research on Generative Adversarial Networks (GANs) has progressed substantially in the past 2 years. Practical improvements to image synthesis models are being made

[1, 2, 3, 4, 5]

almost too quickly to keep up with:
在某些方面，在过去的两年中，对生成型对抗网络（GANs）的研究取得了实质性进展。在图像合成领域的改进非常迅猛[1、2、3、4、5]：

Odena et al., 2016[1] Miyato et al., 2017[3] Zhang et al., 2018[2] Brock et al., 2018[4]

However, by other metrics, less has happened. For instance, there is still widespread disagreement about how GANs should be evaluated. Given that current image synthesis benchmarks seem somewhat saturated, we think now is a good time to reflect on research goals for this sub-field.
但是，其他领域的突破相对较少。例如，关于如何评估GAN仍然存在广泛的分歧。考虑到当前的图像合成基准似乎有些饱和，我们认为现在是反思该子领域研究目标的好时机。

Lists of open problems have helped other fields with this[6、7、8、9]. This article suggests open research problems that we’d be excited for other researchers to work on. 1 We assume a fair amount of background (or willingness to look things up) because we reference too many results to explain all those results in detail.

我们列出了一些领域没解决的问题，提出了一些大家感兴趣的开放性的研究问题，1 我们假设读者有相应的背景知识（或愿意查找事物），因为我们引用的结论太多，无法详细解释所有这些结论。

What are the Trade-Offs Between GANs and other Generative Models? GAN与其他生成模型之间的权衡是什么？

In addition to GANs, two other types of generative model are currently popular: Flow Models and Autoregressive Models 2 . Roughly speaking, Flow Models

[11, 12, 13, 14]

apply a stack of invertible transformations to a sample from a prior so that exact log-likelihoods of observations can be computed. On the other hand, Autoregressive Models[15, 16, 17, 18]

factorize the distribution over observations into conditional distributions and process one component of the observation at a time (for images, they may process one pixel at a time.) Recent research[13, 19]

suggests that these models have different performance characteristics and trade-offs. We think that accurately characterizing these trade-offs and deciding whether they are intrinsic to the model families is an interesting open question.
除了GAN，目前还流行两种其他类型的生成模型：流模型和自回归模型 2个。粗略地说，流模型[11，12，13，14]将一堆可逆转换应用于先验样本，以便可以计算出观测值的精确对数似然。另一方面，自回归模型[15、16、17、18]将观测值的分布分解为条件分布，并一次处理观测值的一个组成部分（对于图像，它们可能一次处理一个像素。）最新研究[13，19]建议这些模型具有不同的性能特征和权衡。我们认为，准确描述这些折衷的特征并确定它们是否属于这个模型家族是一个有趣的开放性问题。

For concreteness, let’s temporarily focus on the difference in computational cost between GANs and Flow Models. At first glance, Flow Models seem like they might make GANs unnecessary. Flow Models allow for exact log-likelihood computation and exact inference, so if training Flow Models and GANs had the same computational cost, GANs might not be useful. A lot of effort is spent on training GANs, so it seems like we should care about whether Flow Models make GANs obsolete 3 .
为了具体起见，让我们暂时关注GAN与流模型之间的计算成本差异。乍一看，流程模型似乎使GAN变得不必要。流模型允许进行精确的对数似然计算和精确推断，因此，如果训练流模型和GAN具有相同的计算成本，则GAN可能没有用。在训练GAN上花费了大量的精力，因此似乎我们应该关心Flow Models是否会使GAN淘汰 3 。

However, there seems to be a substantial gap between the computational cost of training GANs and Flow Models. To estimate the magnitude of this gap, we can consider two models trained on datasets of human faces. The GLOW model[13]is trained to generate 256x256 celebrity faces using 40 GPUs for 2 weeks and about 200 million parameters. In contrast, progressive GANs[19]

are trained on a similar face dataset with 8 GPUs for 4 days, using about 46 million parameters, to generate 1024x1024 images. Roughly speaking, the Flow Model took 17 times more GPU days and 4 times more parameters to generate images with 16 times fewer pixels. This comparison isn’t perfect, 4 but it gives you a sense of things.
但是，训练GAN和流模型的计算成本之间似乎存在很大差距。为了估计这一差距的大小，我们可以考虑在人脸数据集上训练的两个模型。GLOW模型[13]经过训练，可以使用40个GPU生成256x256名人头像，历时2周，使用了大约2亿个参数。相反，渐进式GAN[19]在具有8个GPU的相似面部数据集上训练了4天，使用了大约4,600万个参数，以生成1024x1024图像。粗略地讲，流模型需要花费17倍的GPU天数和4倍的参数来生成像素减少16倍的图像。这种比较并不完美， 4 但却给您一种对比性。

Why are the Flow Models less efficient? We see two possible reasons: First, maximum likelihood training might be computationally harder to do than adversarial training. In particular, if any element of your training set is assigned zero probability by your generative model, you will be penalized infinitely harshly! A GAN generator, on the other hand, is only penalized indirectly for assigning zero probability to training set elements, and this penalty is less harsh. Second, normalizing flows might be an inefficient way to represent certain functions. Section 6.1 of

[20]

does some small experiments on expressivity, but at present we’re not aware of any in-depth analysis of this question.
为什么流模型效率较低？我们看到两个可能的原因：首先，最大似然训练可能比对抗训练更难进行计算。特别是，如果您的生成模型为您的训练集中的任何元素分配了零概率，那么您将受到严厉的惩罚！另一方面，GAN生成器仅因为训练集元素分配零概率而间接受到惩罚，并且这种惩罚不太严厉。其次，规范化流程可能是表示某些功能的一种低效方式。第6.1条[20]

在表达性方面做了一些小实验，但是目前我们还没有对该问题进行任何深入的分析。

We’ve discussed the trade-off between GANs and Flow Models, but what about Autoregressive Models? It turns out that Autoregressive Models can be expressed as Flow Models (because they are both reversible

[14]

) that are not parallelizable. 5 It also turns out that Autoregressive Models are more time and parameter efficient than Flow Models. Thus, GANs are parallel and efficient but not reversible, Flow Models are reversible and parallel but not efficient, and Autoregressive models are reversible and efficient, but not parallel.
我们已经讨论了GAN与流模型之间的权衡，但是自回归模型又如何呢？事实证明，自回归模型可以表示为流模型（因为它们都是可逆的）[14]

）不可并行化。 5 结果还表明，自回归模型比流模型更具时间和参数效率。因此，GAN是并行且高效的，但不可逆，流程模型是可逆且并行的，但效率不高，而自回归模型是可逆且高效的，但不是并行的。

	Parallel	Efficient	Reversible
GANs	Yes	Yes	No
Flow Models	Yes	No	Yes
Autoregressive Models	No	Yes	Yes

This brings us to our first open problem:

Problem 1What are the fundamental trade-offs between GANs and other generative models?

In particular, can we make some sort of CAP Theorem

[22]

type statement about reversibility, parallelism, and parameter/time efficiency?

One way to approach this problem could be to study more models that are a hybrid of multiple model families. This has been considered for hybrid GAN/Flow Models

[23, 24]

, but we think that this approach is still underexplored.

We’re also not sure about whether maximum likelihood training is necessarily harder than GAN training. It’s true that placing zero mass on a training data point is not explicitly prohibited under the GAN training loss, but it’s also true that a sufficiently powerful discriminator will be able to do better than chance if the generator does this. It does seem like GANs are learning distributions of low support in practice

[25]

though.

Ultimately, we suspect that Flow Models are fundamentally less expressive per-parameter than arbitrary decoder functions, and we suspect that this is provable under certain assumptions.
这给我们带来了第一个未解决的问题：

问题 1GAN与其他生成模型之间的基本权衡是什么？

特别是，我们可以做某种CAP定理吗？

[22]

关于可逆性，并行性和参数/时间效率的类型声明？

解决此问题的一种方法是研究更多由多个模型族组成的模型。对于混合GAN /流模型，已经考虑了这一点

[23，24]，但我们认为这种方法仍未得到充分开发。

我们也不确定最大似然训练是否一定比GAN训练难。的确，在GAN训练损失下并没有明确禁止在训练数据点上放置零质量，但是，如果发生器这样做，那么足够强大的鉴别器也将比偶然做得更好。尽管看起来GAN在实践中似乎正在学习低支持的分布[25]
最终，我们怀疑流模型在每个参数上的表现力基本上不及任意解码器函数，并且我们怀疑在某些假设下这是可证明的。

What Sorts of Distributions Can GANs Model? GAN可以模拟哪些类型的分布？

Most GAN research focuses on image synthesis. In particular, people train GANs on a handful of standard (in the Deep Learning community) image datasets: MNIST[26]，CIFAR-10[27]，STL-10[28]，CelebA[29]和Imagenet[30]

. There is some folklore about which of these datasets is ‘easiest’ to model. In particular, MNIST and CelebA are considered easier than Imagenet, CIFAR-10, or STL-10 due to being ‘extremely regular’[31]

. Others have noted that ‘a high number of classes is what makes ImageNet synthesis difficult for GANs’[1]. These observations are supported by the empirical fact that the state-of-the-art image synthesis model on CelebA[5] generates images that seem substantially more convincing than the state-of-the-art image synthesis model on Imagenet[4]

GAN的大多数研究都集中在图像合成上。特别是，人们在少数标准（在深度学习社区中）图像数据集上训练GAN：MNIST[26]，CIFAR-10[27]，STL-10[28]，CelebA[29]和Imagenet[30]

。关于这些数据集中最容易建模的数据有一些民间传说。特别是，MNIST和CelebA被认为比Imagenet，CIFAR-10或STL-10更容易，因为它们“非常常规”[31].其他人则指出，“种类繁多使GAN难以进行ImageNet综合”[1].这些观察结果得到了CelebA上最先进的图像合成模型的经验事实的支持[5].生成的图像似乎比Imagenet上的最新图像合成模型更具说服力[4]

However, we’ve had to come to these conclusions through the laborious and noisy process of trying to train GANs on ever larger and more complicated datasets. In particular, we’ve mostly studied how GANs perform on the datasets that happened to be laying around for object recognition.

As with any science, we would like to have a simple theory that explains our experimental observations. Ideally, we could look at a dataset, perform some computations without ever actually training a generative model, and then say something like ‘this dataset will be easy for a GAN to model, but not a VAE’. There has been some progress on this topic

[32, 33], but we feel that more can be done.We can now state the problem:
但是，我们必须通过费力且嘈杂的过程来得出这些结论，这些过程试图在越来越大和更复杂的数据集上训练GAN。特别是，我们主要研究了GAN在碰巧用于对象识别的数据集上的表现。

与任何科学一样，我们希望有一个简单的理论来解释我们的实验观察结果。理想情况下，我们可以查看数据集，而无需实际训练生成模型就可以执行一些计算，然后说类似“ GAN可以轻松建模此数据集，而VAE可以轻松建模”。这个主题已经取得了一些进展[32，33]，但我们认为可以做更多的事情。现在，我们可以说明问题了：

Problem 2 Given a distribution, what can we say about how hard it will be for a GAN to model that distribution? 给定一个分布，对于GAN建模该分布将有多困难，我们能描述吗？

We might ask the following related questions as well: What do we mean by ‘model the distribution’? Are we satisfied with a low-support representation, or do we want a true density model? Are there distributions that a GAN can never learn to model? Are there distributions that are learnable for a GAN in principle, but are not efficiently learnable, for some reasonable model of resource-consumption? Are the answers to these questions actually any different for GANs than they are for other generative models?
我们可能还会问以下相关问题：“对分布进行建模”是什么意思？我们是否对低支持率表示感到满意，还是想要一个真实的密度模型？是否存在GAN永远无法学习建模的分布？对于某种合理的资源消耗模型，原则上对于GAN而言，是否存在可学习但无法有效学习的分布？对于GAN，与其他生成模型相比，这些问题的答案是否真的有所不同？

We propose two strategies for answering these questions:

Synthetic Datasets - We can study synthetic datasets to probe what traits affect learnability. For example, in [32]
the authors create a dataset of synthetic triangles. We feel that this angle is under-explored. Synthetic datasets can even be parameterized by quantities of interest, such as connectedness or smoothness, allowing for systematic study. Such a dataset could also be useful for studying other types of generative models.
Modify Existing Theoretical Results - We can take existing theoretical results and try to modify the assumptions to account for different properties of the dataset. For instance, we could take results about GANs that apply given unimodal data distributions and see what happens to them when the data distribution becomes multi-modal.
我们提出两种策略来回答这些问题：
综合数据集-我们可以研究综合数据集，以探究哪些特征会影响学习能力。例如，在[32]
作者创建了一个合成三角形的数据集。我们认为这个角度尚未被充分研究。甚至可以通过感兴趣的数量（例如连通性或平滑度）对合成数据集进行参数化，从而可以进行系统的研究。这样的数据集对于研究其他类型的生成模型也可能是有用的。
修改现有的理论结果-我们可以采用现有的理论结果，并尝试修改假设以考虑数据集的不同属性。例如，我们可以获取适用于给定单峰数据分布的GAN的结果，并查看当数据分布变为多峰时它们发生了什么。

How Can we Scale GANs Beyond Image Synthesis? 除了图像合成，我们如何用应用GAN？

Aside from applications like image-to-image translation[34] and domain-adaptation[35]

most GAN successes have been in image synthesis. Attempts to use GANs beyond images have focused on three domains:
除了图像到图像翻译之类[34]和域的自适应35]的应用程序

GAN的大多数成功都来自图像合成。在图像之外使用GAN的尝试集中在三个领域：

Text - The discrete nature of text makes it difficult to apply GANs. This is because GANs rely on backpropagating a signal from the discriminator through the generated content into the generator. There are two approaches to addressing this difficulty. The first is to have the GAN act only on continuous representations of the discrete data, as in[36]
. The second is use an actual discrete model and attempt to train the GAN using gradient estimation as in[37]
. Other, more sophisticated treatments exist[38]
, but as far as we can tell, none of them produce results that are competitive (in terms of perplexity) with likelihood-based[39] language models.
- 文本-文本的离散性使其难以应用GAN。这是因为GAN依赖于将来自鉴别器的信号通过生成的内容反向传播到生成器中。有两种方法可以解决此难题。首先是让GAN仅作用于离散数据的连续表示，如[36]
  。第二种是使用实际的离散模型，并尝试使用梯度估计来训练GAN，如[37] 。还存在其他更复杂的治疗方法[38] ，但据我们所知，它们均无法产生基于可能性的竞争性（就困惑而言）[39] 语言模型。
Structured Data - What about other non-euclidean structured data, like graphs? The study of this type of data is called geometric deep learning[40]
. GANs have had limited success here, but so have other deep learning techniques, so it’s hard to tell how much the GAN aspect matters. We’re aware of one attempt to use GANs in this space[41]
, which has the generator produce (and the discriminator ‘critique’) random walks that are meant to resemble those sampled from a source graph.
- 结构化数据-其他非欧式结构化数据（如图形）又如何呢？对这类数据的研究称为几何深度学习[40]
  。GAN在这里取得的成功有限，但是其他深度学习技术也是如此，因此很难说出GAN方面有多重要。我们知道有一种尝试在此空间中使用GAN[41]
  ，其中包含生成器产生的随机游动（和判别器“批判”），其意图类似于从源图中采样的随机游动。
Audio - Audio is the domain in which GANs are closest to achieving the success they’ve enjoyed with images. The first serious attempt at applying GANs to unsupervised audio synthesis is[42]
, in which the authors make a variety of special allowances for the fact that they are operating on audio. More recent work suggests GANs can even outperform autoregressive models on some perceptual metrics[43]
.* 音频-音频是GAN最接近实现图像享受成功的领域。将GAN应用于无监督音频合成的第一个严肃尝试是[42]
，其中作者针对以音频操作的事实做出了各种特殊的规定。最近的工作表明，GAN在某些感知指标上甚至可以胜过自回归模型。[43]
。

Despite these attempts, images are clearly the easiest domain for GANs. This leads us to the statement of the problem:
尽管进行了这些尝试，但是对于GAN来说，图像显然是最简单的域。这导致我们对问题的陈述：

Problem 3How can GANs be made to perform well on non-image data?Does scaling GANs to other domains require new training techniques, or does it simply require better implicit priors for each domain?
如何使GAN在非图像数据上表现良好？

将GAN扩展到其他域是否需要新的训练技术，还是只是需要每个域更好的隐式先验？

We expect GANs to eventually achieve image-synthesis-level success on other continuous data, but that it will require better implicit priors. Finding these priors will require thinking hard about what makes sense and is computationally feasible in a given domain.

For structured data or data that is not continuous, we’re less sure. One approach might be to make both the generator and discriminator be agents trained with reinforcement learning. Making this approach work could require large-scale computational resources[44]. Finally, this problem may just require fundamental research progress.
我们期望GAN最终能够在其他连续数据上实现图像合成级别的成功，但是它将需要更好的隐式先验。找到这些先验将需要认真思考在给定领域中什么是有意义的，并且在计算上是可行的。

对于结构化数据或不连续的数据，我们不确定。一种方法可能是使生成器和判别器都是经过强化学习训练的代理。使这种方法可行可能需要大量的计算资源[44]。最后，这个问题可能只需要基础研究进展。

What can we Say About the Global Convergence of GAN Training?

Training GANs is different from training other neural networks because we simultaneously optimize the generator and discriminator for opposing objectives. Under certain assumptions 6 , this simultaneous optimization is locally asymptotically stable

[45, 46]

Unfortunately, it’s hard to prove interesting things about the fully general case. This is because the discriminator/generator’s loss is a non-convex function of its parameters. But all neural networks have this problem! We’d like some way to focus on just the problems created by simultaneous optimization. This brings us to our question:

Problem 4When can we prove that GANs are globally convergent?

Which neural network convergence results can be applied to GANs?

There has been nontrivial progress on this question. Broadly speaking, there are 3 existing techniques, all of which have generated promising results but none of which have been studied to completion:

Simplifying Assumptions - The first strategy is to make simplifying assumptions about the generator and discriminator. For example, the simplified LGQ GAN[47]
— linear generator, Gaussian data, and quadratic discriminator — can be shown to be globally convergent, if optimized with a special technique[48, 49]
and some additional assumptions. 7 As another example, [50, 51]
show under different simplifying assumptions that certain types of GANs perform a mixture of moment matching and maximum likelihood estimation. It seems promising to gradually relax those assumptions to see what happens. For example, we could move away from unimodal distributions[52]
. This is a natural relaxation to study because ‘mode collapse’ is a standard GAN pathology.
Use Techniques from Normal Neural Networks - The second strategy is to apply techniques for analyzing normal neural networks (which are also non-convex) to answer questions about convergence of GANs. For instance, it’s argued in[53]
that the non-convexity of deep neural networks isn’t a problem, 8 because low-quality local minima of the loss function become exponentially rare as the network gets larger. Can this analysis be ‘lifted into GAN space’? In fact, it seems like a generally useful heuristic to take analyses of deep neural networks used as classifiers and see if they apply to GANs.
Game Theory - The final strategy is to model GAN training using notions from game theory[54, 55, 56]
. These techniques yield training procedures that provably converge to some kind of approximate Nash equilibrium, but do so using unreasonably large resource constraints. The ‘obvious’ next step in this case is to try and reduce those resource constraints.

How Should we Evaluate GANs and When Should we Use Them?

When it comes to evaluating GANs, there are many proposals but little consensus. Suggestions include:

Inception Score and FID - Both these scores[57, 58]
use a pre-trained image classifier and both have known issues [59, 60]
. A common criticism is that these scores measure ‘sample quality’ and don’t really capture ‘sample diversity’.
MS-SSIM - [1]
propose using MS-SSIM[61]
to separately evaluate diversity, but this technique has some issues and hasn’t really caught on.
AIS - [62]
propose putting a Gaussian observation model on the outputs of a GAN and using annealed importance sampling[63]
to estimate the log likelihood under this model, but [23]
show that estimates computed this way are inaccurate in the case where the GAN generator is also a flow model 9 .
Geometry Score - [64]
suggest computing geometric properties of the generated data manifold and comparing those properties to the real data.
Precision and Recall - [65]
attempt to measure both the ‘precision’ and ‘recall’ of GANs.
Skill Rating - [66, 67]
have shown that trained GAN discriminators can contain useful information with which evaluation can be performed.

Those are just a small fraction of the proposed GAN evaluation schemes. Although the Inception Score and FID are relatively popular, GAN evaluation is clearly not a settled issue. Ultimately, we think that confusion about how to evaluate GANs stems from confusion about when to use GANs. Thus, we have bundled those two questions into one:

Problem 5When should we use GANs instead of other generative models?

How should we evaluate performance in those contexts?

What should we use GANs for? If you want an actual density model, GANs probably aren’t the best choice. There is now good experimental evidence that GANs learn a ‘low support’ representation of the target dataset

[23, 24, 25]

, which means there may be substantial parts of the test set to which a GAN (implicitly) assigns zero likelihood.

Rather than worrying too much about this, 10 we think it makes sense to focus GAN research on tasks where this is fine or even helpful. GANs are likely to be well-suited to tasks with a perceptual flavor. Graphics applications like image synthesis, image translation, image infilling, and attribute manipulation all fall under this umbrella.

How should we evaluate GANs on these perceptual tasks? Ideally, we would just use a human judge, but this is expensive. A cheap proxy is to see if a classifier can distinguish between real and fake examples. This is called a classifier two-sample test (C2STs)

[68, 69, 60]

. The main issue with C2STs is that if the Generator has even a minor defect that’s systematic across samples (e.g., [70]

) this will dominate the evaluation.

Ideally, we’d have a holistic evaluation that isn’t dominated by a single factor. One approach might be to make a critic that is blind to the dominant defect. But once we do this, some other defect may dominate, requiring a new critic, and so on. If we do this iteratively, we could get a kind of ‘Gram-Schmidt procedure for critics’, creating an ordered list of the most important defects and critics that ignore them. Perhaps this can be done by performing PCA on the critic activations

[71]

and progressively throwing out more and more of the higher variance components.

Finally, we could evaluate on humans despite the expense. This would allow us to measure the thing that we actually care about. This kind of approach can be made less expensive by predicting human answers and only interacting with a real human when the prediction is uncertain

[72]

How does GAN Training Scale with Batch Size?

Large minibatches have helped to scale up image classification

[73, 74, 75, 76, 77]

— can they also help us scale up GANs? Large minibatches may be especially important for effectively using highly parallel hardware accelerators[78, 79]

At first glance, it seems like the answer should be yes — after all, the discriminator in most GANs is just an image classifier. Larger batches can accelerate training if it is bottlenecked on gradient noise. However, GANs have a separate bottleneck that classifiers don’t: the training procedure can diverge. Thus, we can state our problem:

Problem 6How does GAN training scale with batch size?

How big a role does gradient noise play in GAN training?

Can GAN training be modified so that it scales better with batch size?

There’s some evidence that increasing minibatch size improves quantitative results and reduces training time

[4]

. If this phenomenon is robust, it would suggest that gradient noise is a dominating factor. However, this hasn’t been systematically studied, so we believe this question remains open.

Can alternate training procedures make better use of large batches? Optimal Transport GANs

[80]

theoretically have better convergence properties than normal GANs, but need a large batch size because they try to align batches of samples and training data. As a result, they seem like a promising candidate for scaling to very large batch sizes.

Finally, asynchronous SGD

[81, 82, 83, 84]

could be a good alternative for making use of new hardware. In this setting, the limiting factor tends to be that gradient updates are computed on ‘stale’ copies of the parameters. But GANs seem to actually benefit from training on past parameter snapshots[56, 85]

, so we might ask if asynchronous SGD interacts in a special way with GAN training.

What is the Relationship Between GANs and Adversarial Examples?

It’s well known

[86]

that image classifiers suffer from adversarial examples: human-imperceptible perturbations that cause classifiers to give the wrong output when added to images. It’s also now known that there are classification problems which can normally be efficiently learned, but are exponentially hard to learn robustly[87, 88]

Since the GAN discriminator is an image classifier, one might worry about it suffering from adversarial examples. Despite the large bodies of literature on GANs and adversarial examples, there doesn’t seem to be much work on how they relate. 11 Thus, we can ask the question:

Problem 7How does the adversarial robustness of the discriminator affect GAN training?

How can we begin to think about this problem? Consider a fixed discriminator D. An adversarial example for D would exist if there were a generator sample G(z) correctly classified as fake and a small perturbation p such that G(z) + p is classified as real. With a GAN, the concern would be that the gradient update for the generator would yield a new generator G’ where G’(z) = G(z) + p.

Is this concern realistic?

[89]

shows that deliberate attacks on generative models can work, but we are more worried about something you might call an ‘accidental attack’. There are reasons to believe that these accidental attacks are less likely. First, the generator is only allowed to make one gradient update before the discriminator is updated again. In contrast, current adversarial attacks are typically run for tens of iterations[90]

. Second, the generator is optimized given a batch of samples from the prior, and this batch is different for every gradient step. Finally, the optimization takes place in the space of parameters of the generator rather than in pixel space. However, none of these arguments decisively rules out the generator creating adversarial examples. We think this is a fruitful topic for further exploration.

Acknowledgments

We would like to thank Colin Raffel, Ben Poole, Eric Jang, Dustin Tran, Alex Kurakin, David Berthelot, Aurko Roy, Ian Goodfellow, and Matt Hoffman for helpful discussions and feedback. We would especially like to single out Chris Olah, who provided substantial feedback on the text and help with editing.

Footnotes

We also believe that writing this article has clarified our thinking about GANs, and we would encourage other researchers to write similar articles about their own sub-fields.[↩]
This statement shouldn’t be taken too literally. Those are useful terms for describing fuzzy clusters in ‘model-space’, but there are models that aren’t easy to describe as belonging to just one of those clusters. I’ve also left out VAEs[10]
entirely; they’re arguably no longer considered state-of-the-art at any tasks of record. [↩]
Even in this case, there might still be other reasons to use adversarial training in contexts like image-to-image translation. It also might still make sense to combine adversarial training with maximum-likelihood training. [↩]
For instance, it’s possible that the progressive growing technique could be applied to Flow Models as well. [↩]
Parallelizable is somewhat imprecise in this context. We mean that sampling from Flow Models must in general be done sequentially, one observation at a time. There may be ways around this limitation though[21]
. [↩]
These assumptions are very strict. The referenced paper assumes (roughly speaking) that the equilibrium we are looking for exists and that we are already very close to it. [↩]
Among other things, it’s assumed that we can first learn the means of the Gaussian and then learn the variances. [↩]
A fact that practitioners already kind of suspected. [↩]
The generator being a flow model allows for computation of exact log-likelihoods in this case. [[↩]](https://distill.pub/2019/gan-open-problem