[论文翻译]GRADIENT ORIGIN NETWORKS


原文地址:https://arxiv.org/pdf/2007.02798v5


GRADIENT ORIGIN NETWORKS

GRADIENT ORIGIN NETWORKS

Sam Bond-Taylor∗ & Chris G. Willcocks∗

Sam Bond-Taylor∗ & Chris G. Willcocks∗

ABSTRACT

摘要

This paper proposes a new type of generative model that is able to quickly learn a latent representation without an encoder. This is achieved using empirical Bayes to calculate the expectation of the posterior, which is implemented by initial ising a latent vector with zeros, then using the gradient of the log-likelihood of the data with respect to this zero vector as new latent points. The approach has similar characteristics to auto encoders, but with a simpler architecture, and is demonstrated in a variation al auto encoder equivalent that permits sampling. This also allows implicit representation networks to learn a space of implicit functions without requiring a hyper network, retaining their representation advantages across datasets. The experiments show that the proposed method converges faster, with significantly lower reconstruction error than auto encoders, while requiring half the parameters.

本文提出了一种新型生成模型,无需编码器即可快速学习潜在表征。该方法通过经验贝叶斯计算后验期望来实现:首先用零向量初始化潜在向量,然后利用数据对数似然相对于该零向量的梯度作为新潜在点。该架构具有与自编码器(autoencoder)相似的特征,但结构更简单,并通过等效变分自编码器实现采样功能。该方法还能让隐式表征网络在不依赖超网络的情况下学习隐函数空间,保持其跨数据集的表征优势。实验表明,所提方法收敛速度更快,重建误差显著低于自编码器,且参数需求量减少一半。

1 INTRODUCTION

1 引言

Observable data in nature has some parameters which are known, such as local coordinates, but also some unknown parameters such as how the data is related to other examples. Generative models, which learn a distribution over observable s, are central to our understanding of patterns in nature and allow for efficient query of new unseen examples. Recently, deep generative models have received interest due to their ability to capture a broad set of features when modelling data distributions. As such, they offer direct applications such as synthesising high fidelity images (Karras et al., 2020), super-resolution (Dai et al., 2019), speech synthesis (Li et al., 2019), and drug discovery (Segler et al., 2018), as well as benefits for downstream tasks like semi-supervised learning (Chen et al., 2020).

自然界中的可观测数据既包含已知参数(如局部坐标),也包含未知参数(如数据与其他样本的关联方式)。生成式模型通过学习可观测数据的分布,成为我们理解自然规律的核心工具,并能高效查询未见过的样本实例。近年来,深度生成式模型因其在数据分布建模中捕获广泛特征的能力而备受关注。这类模型可直接应用于合成高保真图像(Karras等人,2020)、超分辨率重建(Dai等人,2019)、语音合成(Li等人,2019)和药物发现(Segler等人,2018)等领域,同时为半监督学习(Chen等人,2020)等下游任务带来显著优势。

A number of methods have been proposed such as Variation al Auto encoders (VAEs, Figure 1a), which learn to encode the data to a latent space that follows a normal distribution permitting sampling (Kingma & Welling, 2014). Generative Adversarial Networks (GANs) have two competing networks, one which generates data and another which discriminates from implausible results (Goodfellow et al., 2014). Variation al approaches that approximate the posterior using gradient descent (Lipton & Tripathi, 2017) and short run MCMC (Nijkamp et al., 2020) respectively have been proposed, but to obtain a latent vector for a sample, they require iterative gradient updates. Auto regressive Models (Van Den Oord et al., 2016) decompose the data distribution as the product of conditional distributions and Normalizing Flows (Rezende & Mohamed, 2015) chain together invertible functions; both methods allow exact likelihood inference. Energy-Based Models (EBMs) map data points to energy values proportional to likelihood thereby permitting sampling through the use of Monte Carlo Markov Chains (Du & Mordatch, 2019). In general to support encoding, these approaches require separate encoding networks, are limited to invertible functions, or require multiple sampling steps.

已提出多种方法,例如变分自编码器 (VAEs, 图 1a) ,它通过学习将数据编码到遵循正态分布的潜在空间以实现采样 (Kingma & Welling, 2014) 。生成对抗网络 (GANs) 包含两个对抗网络:一个生成数据,另一个从不可信结果中判别真伪 (Goodfellow et al., 2014) 。分别使用梯度下降 (Lipton & Tripathi, 2017) 和短程马尔可夫链蒙特卡洛 (Nijkamp et al., 2020) 近似后验的变分方法已被提出,但为获取样本的潜在向量,它们需要迭代梯度更新。自回归模型 (Van Den Oord et al., 2016) 将数据分布分解为条件分布的乘积,而标准化流 (Rezende & Mohamed, 2015) 将可逆函数链式连接;这两种方法都支持精确似然推断。基于能量的模型 (EBMs) 将数据点映射为与似然成比例的能量值,从而支持通过蒙特卡洛马尔可夫链进行采样 (Du & Mordatch, 2019) 。总体而言,为实现编码功能,这些方法需要独立的编码网络、受限于可逆函数或需多步采样。


Figure 1: Gradient Origin Networks (GONs; b) use gradients (dashed lines) as encodings thus only a single network $F$ is required, which can be an implicit representation network (c). Unlike VAEs (a) which use two networks, $E$ and $D$ , variation al GONs (d) permit sampling with only one network.

图 1: 梯度起源网络 (Gradient Origin Networks, GONs; b) 使用梯度 (虚线) 作为编码,因此仅需单个网络 $F$,该网络可以是隐式表示网络 (c)。与使用两个网络 $E$ 和 $D$ 的变分自编码器 (VAEs, a) 不同,变分 GONs (d) 仅需一个网络即可实现采样。

Implicit representation learning (Park et al., 2019; Tancik et al., 2020), where a network is trained on data parameter is ed continuously rather than in discrete grid form, has seen a surge of interest due to the small number of parameters, speed of convergence, and ability to model fine details. In particular, sinusoidal representation networks (SIRENs) (Sitzmann et al., 2020b) achieve impressive results, modelling many signals with high precision, thanks to their use of periodic activation s paired with carefully initial is ed MLPs. So far, however, these models have been limited to modelling single data samples, or use an additional hyper network or meta learning (Sitzmann et al., 2020a) to estimate the weights of a simple implicit model, adding significant complexity.

隐式表示学习 (Park et al., 2019; Tancik et al., 2020) 通过训练网络以连续参数化数据而非离散网格形式,因其参数量少、收敛速度快且能建模精细细节的特性而备受关注。其中,正弦表示网络 (SIRENs) (Sitzmann et al., 2020b) 通过周期性激活函数与精心初始化的多层感知机 (MLP) 结合,实现了高精度多信号建模的优异效果。然而目前这些模型仍局限于单样本建模,或需借助超网络 (hyper network) 和元学习 (Sitzmann et al., 2020a) 来估计简单隐式模型的权重,显著增加了系统复杂度。

This paper proposes Gradient Origin Networks (GONs), a new type of generative model (Figure 1b) that do not require encoders or hyper networks. This is achieved by initial ising latent points at the origin, then using the gradient of the log-likelihood of the data with respect to these points as the latent space. At inference, latent vectors can be obtained in a single step without requiring iteration. GONs are shown to have similar characteristics to convolutional auto encoders and variation al auto encoders using approximately half the parameters, and can be applied to implicit representation networks (such as SIRENs) allowing a space of implicit functions to be learned with a simpler overall architecture.

本文提出梯度原点网络(Gradient Origin Networks,GONs),这是一种无需编码器或超网络的新型生成模型(图1b)。其核心思想是将潜在点初始化为原点,然后使用数据对数似然相对于这些点的梯度作为潜在空间。在推理阶段,无需迭代即可单步获取潜在向量。实验表明,GONs仅需约半数参数就能达到与卷积自编码器和变分自编码器相似的特征表达能力,并可应用于隐式表示网络(如SIRENs),通过更简洁的整体架构学习隐函数空间。

2 PRELIMINARIES

2 预备知识

We first introduce some background context that will be used to derive our proposed approach.

我们首先介绍一些背景信息,这些将用于推导我们提出的方法。

2.1 EMPIRICAL BAYES

2.1 经验贝叶斯 (Empirical Bayes)

The concept of empirical Bayes (Robbins, 1956; Saremi & Hyvarinen, 2019), for a random variable $\mathbf{z}\sim p_ {\mathbf{z}}$ and particular observation ${\bf z}_ {0}\sim p_ {{\bf z}_ {0}}$ , provides an estimator of $\mathbf{z}$ expressed purely in terms of $p(\mathbf{z}_ {0})$ that minimises the expected squared error. This estimator can be written as a conditional mean:

经验贝叶斯 (empirical Bayes) 的概念 (Robbins, 1956; Saremi & Hyvarinen, 2019) 针对随机变量 $\mathbf{z}\sim p_ {\mathbf{z}}$ 和特定观测值 ${\bf z}_ {0}\sim p_ {{\bf z}_ {0}}$ ,提供了一个仅用 $p(\mathbf{z}_ {0})$ 表示的 $\mathbf{z}$ 估计量,该估计量能使期望平方误差最小化。该估计量可表示为条件均值:

$$
\hat{\mathbf{z}}(\mathbf{z}_ {0})=\int\mathbf{z}p(\mathbf{z}|\mathbf{z}_ {0})d\mathbf{z}=\int\mathbf{z}\frac{p(\mathbf{z},\mathbf{z}_ {0})}{p(\mathbf{z}_ {0})}d\mathbf{z}.
$$

$$
\hat{\mathbf{z}}(\mathbf{z}_ {0})=\int\mathbf{z}p(\mathbf{z}|\mathbf{z}_ {0})d\mathbf{z}=\int\mathbf{z}\frac{p(\mathbf{z},\mathbf{z}_ {0})}{p(\mathbf{z}_ {0})}d\mathbf{z}.
$$

Of particular relevance is the case where $\mathbf{z}_ {\mathrm{0}}$ is a noisy observation of $\mathbf{z}$ with covariance $\pmb{\Sigma}$ . In this case $p(\mathbf{z}_ {0})$ can be represented by marginal ising out $\mathbf{z}$ :

特别相关的情况是当 $\mathbf{z}_ {\mathrm{0}}$ 是 $\mathbf{z}$ 带有协方差 $\pmb{\Sigma}$ 的噪声观测时。此时 $p(\mathbf{z}_ {0})$ 可以通过边缘化 $\mathbf{z}$ 来表示:

$$
p(\mathbf{z}_ {0})=\int{\frac{1}{(2\pi)^{d/2}|\operatorname*{det}({\boldsymbol{\Sigma}})|^{1/2}}}\exp\Big(-\mathbf{\Sigma}(\mathbf{z}_ {0}-\mathbf{z})^{T}{\boldsymbol{\Sigma}}^{-1}(\mathbf{z}_ {0}-\mathbf{z})/2\Big)p(\mathbf{z})d\mathbf{z}.
$$

$$
p(\mathbf{z}_ {0})=\int{\frac{1}{(2\pi)^{d/2}|\operatorname*{det}({\boldsymbol{\Sigma}})|^{1/2}}}\exp\Big(-\mathbf{\Sigma}(\mathbf{z}_ {0}-\mathbf{z})^{T}{\boldsymbol{\Sigma}}^{-1}(\mathbf{z}_ {0}-\mathbf{z})/2\Big)p(\mathbf{z})d\mathbf{z}.
$$

Differentiating this with respect to $\mathbf{z}_ {\mathrm{0}}$ and multiplying both sides by $\pmb{\Sigma}$ gives:

对 $\mathbf{z}_ {\mathrm{0}}$ 求导并在两边乘以 $\pmb{\Sigma}$ 可得:

$$
\pmb{\Sigma}\nabla_ {\mathbf{z}_ {0}}p(\mathbf{z}_ {0})=\int(\mathbf{z}-\mathbf{z}_ {0})p(\mathbf{z},\mathbf{z}_ {0})d\mathbf{z}=\int\mathbf{z}p(\mathbf{z},\mathbf{z}_ {0})d\mathbf{z}-\mathbf{z}_ {0}p(\mathbf{z}_ {0}).
$$

$$
\pmb{\Sigma}\nabla_ {\mathbf{z}_ {0}}p(\mathbf{z}_ {0})=\int(\mathbf{z}-\mathbf{z}_ {0})p(\mathbf{z},\mathbf{z}_ {0})d\mathbf{z}=\int\mathbf{z}p(\mathbf{z},\mathbf{z}_ {0})d\mathbf{z}-\mathbf{z}_ {0}p(\mathbf{z}_ {0}).
$$

After dividing through by $p(\mathbf{z}_ {0})$ and combining with Equation 1 we obtain a closed form estimator of $\mathbf{z}$ (Miyasawa, 1961) written in terms of the score function $\nabla\log p(\mathbf{z}_ {0})$ (Hyvärinen, 2005):

除以 $p(\mathbf{z}_ {0})$ 并与方程1结合后,我们得到了 $\mathbf{z}$ 的闭式估计量 (Miyasawa, 1961),该估计量用评分函数 $\nabla\log p(\mathbf{z}_ {0})$ 表示 (Hyvärinen, 2005):

$$
\hat{\mathbf{z}}(\mathbf{z}_ {0})=\mathbf{z}_ {0}+\Sigma\nabla_ {\mathbf{z}_ {0}}\log p(\mathbf{z}_ {0}).
$$

$$
\hat{\mathbf{z}}(\mathbf{z}_ {0})=\mathbf{z}_ {0}+\Sigma\nabla_ {\mathbf{z}_ {0}}\log p(\mathbf{z}_ {0}).
$$

This optimal procedure is achieved in what can be interpreted as a single gradient descent step, with no knowledge of the prior $p(\mathbf{z})$ . By rearranging Equation 4, a definition of $\nabla\log p(\mathbf{z}_ {0})$ can be derived; this can be used to train models that approximate the score function (Song & Ermon, 2019).

这一最优过程可理解为在完全不了解先验分布 $p(\mathbf{z})$ 的情况下,仅通过单次梯度下降步骤实现。通过重新排列公式4,可以推导出 $\nabla\log p(\mathbf{z}_ {0})$ 的定义,该定义可用于训练近似评分函数的模型 (Song & Ermon, 2019)。

2.2 VARIATION AL AUTO ENCODERS

2.2 变分自编码器 (Variational Autoencoders)

Variation al Auto encoders (VAEs; Kingma & Welling 2014) are a probabilistic take on standard auto encoders that permit sampling. A latent-based generative model $p_ {\pmb{\theta}}(\mathbf{x}|\mathbf{z})$ is defined with a normally distributed prior over the latent variables, $\dot{p}_ {\pmb\theta}(\mathbf z)=\mathcal{N}(\mathbf z;\mathbf0,\bar{\pmb I}_ {d})$ . $p_ {\pmb{\theta}}(\mathbf{x}|\mathbf{z})$ is typically parameter is ed as a Bernoulli, Gaussian, multi no mi al distribution, or mixture of logits. In this case, the true posterior $p_ {\pmb{\theta}}(\mathbf{z}|\mathbf{x})$ is intractable, so a secondary encoding network $q_ {\phi}(\mathbf{z}|\mathbf{x})$ is used to approximate the true posterior; the pair of networks thus resembles a traditional auto encoder. This allows VAEs to approximate $p_ {\theta}(\mathbf{x})$ by maximising the evidence lower bound (ELBO), defined as:

变分自编码器 (VAEs; Kingma & Welling 2014) 是标准自编码器的概率化扩展,支持采样操作。该潜变量生成模型 $p_ {\pmb{\theta}}(\mathbf{x}|\mathbf{z})$ 采用正态分布作为潜变量的先验分布 $\dot{p}_ {\pmb\theta}(\mathbf z)=\mathcal{N}(\mathbf z;\mathbf0,\bar{\pmb I}_ {d})$。其中 $p_ {\pmb{\theta}}(\mathbf{x}|\mathbf{z})$ 通常参数化为伯努利分布、高斯分布、多项分布或逻辑混合分布。由于真实后验分布 $p_ {\pmb{\theta}}(\mathbf{z}|\mathbf{x})$ 难以求解,因此引入辅助编码网络 $q_ {\phi}(\mathbf{z}|\mathbf{x})$ 来近似真实后验,这对网络结构与传统自编码器相似。通过最大化证据下界 (ELBO),VAEs 能够近似 $p_ {\theta}(\mathbf{x})$,其定义如下:

$$
\log p_ {\theta}(\mathbf{x})\geq\mathcal{L}^{\mathrm{VAE}}=-D_ {\mathrm{KL}}(\mathcal{N}(q_ {\phi}(\mathbf{z}|\mathbf{x}))||\mathcal{N}(\mathbf{0},I_ {d}))+\mathbb{E}_ {q_ {\phi}(\mathbf{z}|\mathbf{x})}[\log p_ {\theta}(\mathbf{x}|\mathbf{z})].
$$

$$
\log p_ {\theta}(\mathbf{x})\geq\mathcal{L}^{\mathrm{VAE}}=-D_ {\mathrm{KL}}(\mathcal{N}(q_ {\phi}(\mathbf{z}|\mathbf{x}))||\mathcal{N}(\mathbf{0},I_ {d}))+\mathbb{E}_ {q_ {\phi}(\mathbf{z}|\mathbf{x})}[\log p_ {\theta}(\mathbf{x}|\mathbf{z})].
$$

To optimise this lower bound with respect to $\pmb{\theta}$ and $\phi$ , gradients must be back propagated through the stochastic process of generating samples from ${\bf z}^{\prime}\sim q_ {\phi}({\bf z}|{\bf x})$ . This is permitted by re parameter ising $\mathbf{z}$ using the differentiable function $\mathbf{z}^{\prime}=\mu(\mathbf{z})+\sigma(\mathbf{z})\odot\epsilon$ , where $\epsilon\sim\mathcal{N}(\mathbf{0},\pmb{I}_ {d})$ and $\mu(\mathbf{z})$ and $\sigma(\mathbf{z})^{2}$ are the mean and variance respectively of a multivariate Gaussian distribution with diagonal covariance.

为了优化关于$\pmb{\theta}$和$\phi$的这个下界,必须通过从${\bf z}^{\prime}\sim q_ {\phi}({\bf z}|{\bf x})$生成样本的随机过程反向传播梯度。这可以通过使用可微函数$\mathbf{z}^{\prime}=\mu(\mathbf{z})+\sigma(\mathbf{z})\odot\epsilon$对$\mathbf{z}$进行重参数化来实现,其中$\epsilon\sim\mathcal{N}(\mathbf{0},\pmb{I}_ {d})$,而$\mu(\mathbf{z})$和$\sigma(\mathbf{z})^{2}$分别是对角协方差多元高斯分布的均值和方差。

3 METHOD

3 方法

Consider some dataset $\mathbf x\sim p_ {d}$ of continuous or discrete signals $\mathbf{x}\in\mathbb{R}^{m}$ , it is typical to assume that the data can be represented by low dimensional latent variables $\mathbf{z}\in\mathbb{R}^{k}$ , which can be used by a generative neural network to reconstruct the data. These variables are often estimated through the use of a secondary encoding network that is trained concurrently with the generative network. An encoding network adds additional complexity (and parameters) to the model, it can be difficult to balance capacities of the two networks, and for complex hierarchical generative models designing a suitable architecture can be difficult. This has led some to instead approximate latent variables by performing gradient descent on the generative network (Bojanowski et al., 2018; Nijkamp et al., 2020). While this addresses the aforementioned problems, it significantly increases the run time of the inference process, introduces additional hyper parameters to tune, and convergence is not guaranteed.

考虑某个连续或离散信号的数据集 $\mathbf x\sim p_ {d}$ ,其中 $\mathbf{x}\in\mathbb{R}^{m}$ ,通常假设数据可由低维潜在变量 $\mathbf{z}\in\mathbb{R}^{k}$ 表示,这些变量可通过生成式神经网络重建数据。这些变量通常通过一个与生成网络同步训练的次级编码网络来估计。编码网络会增加模型的复杂度(和参数),平衡两个网络的容量可能具有挑战性,且对于复杂的分层生成模型,设计合适的架构可能较为困难。这促使部分研究者转而通过在生成网络上执行梯度下降来近似潜在变量 (Bojanowski et al., 2018; Nijkamp et al., 2020) 。虽然这种方法解决了前述问题,但显著增加了推断过程的运行时间,引入了更多需要调优的超参数,且无法保证收敛性。

3.1 GRADIENT ORIGIN NETWORKS

3.1 梯度起源网络

We propose a generative model that consists only of a decoding network, using empirical Bayes to approximate the posterior in a single step. That is, for some data point $\mathbf{x}$ and latent variable $\mathbf{z}\sim p_ {\mathbf{z}}$ , we wish to find an approximation of $p(\mathbf{z}|\mathbf{x})$ . Given some noisy observation $\mathbf z_ {0}=\mathbf z+\mathcal{N}(\mathbf0,\pmb I_ {d})$ of $\mathbf{z}$ then empirical Bayes can be applied to approximate $\mathbf{z}$ . Specifically, since we wish to approximate $\mathbf{z}$ conditioned on $\mathbf{x}$ , we instead calculate $\hat{\bf z}_ {\bf x}$ , the least squares estimate of $p(\mathbf{z}|\mathbf{x})$ (proof in Appendix A):

我们提出了一种仅由解码网络组成的生成模型,利用经验贝叶斯 (Empirical Bayes) 在单步中近似后验分布。即对于数据点 $\mathbf{x}$ 和潜变量 $\mathbf{z}\sim p_ {\mathbf{z}}$,我们希望找到 $p(\mathbf{z}|\mathbf{x})$ 的近似值。给定 $\mathbf{z}$ 的噪声观测 $\mathbf z_ {0}=\mathbf z+\mathcal{N}(\mathbf0,\pmb I_ {d})$ 时,可应用经验贝叶斯来近似 $\mathbf{z}$。具体而言,由于我们希望近似以 $\mathbf{x}$ 为条件的 $\mathbf{z}$,因此改为计算 $p(\mathbf{z}|\mathbf{x})$ 的最小二乘估计 $\hat{\bf z}_ {\bf x}$(证明见附录 A):

$$
\hat{\mathbf{z}}_ {\mathbf{x}}(\mathbf{z}_ {0})=\mathbf{z}_ {0}+\nabla_ {\mathbf{z}_ {0}}\log p(\mathbf{z}_ {0}|\mathbf{x}).
$$

$$
\hat{\mathbf{z}}_ {\mathbf{x}}(\mathbf{z}_ {0})=\mathbf{z}_ {0}+\nabla_ {\mathbf{z}_ {0}}\log p(\mathbf{z}_ {0}|\mathbf{x}).
$$

Using Bayes’ rule, $\log p(\mathbf{z}_ {0}|\mathbf{x})$ can be written as $\log p(\mathbf{z}_ {0}|\mathbf{x})=\log p(\mathbf{x}|\mathbf{z}_ {0})+\log p(\mathbf{z}_ {0})-\log p(\mathbf{x})$ Since $\log p(\mathbf{x})$ is a normal ising constant that does not affect the gradient, we can rewrite Equation 6 in terms only of the decoding network and $p(\mathbf{z}_ {0})$ :

根据贝叶斯规则,$\log p(\mathbf{z}_ {0}|\mathbf{x})$ 可表示为 $\log p(\mathbf{z}_ {0}|\mathbf{x})=\log p(\mathbf{x}|\mathbf{z}_ {0})+\log p(\mathbf{z}_ {0})-\log p(\mathbf{x})$。由于 $\log p(\mathbf{x})$ 是不影响梯度的归一化常数,我们可以仅用解码网络和 $p(\mathbf{z}_ {0})$ 来重写方程6:

$$
\hat{\mathbf{z}}_ {\mathbf{x}}(\mathbf{z}_ {0})=\mathbf{z}_ {0}+\nabla_ {\mathbf{z}_ {0}}\Big(\log p(\mathbf{x}|\mathbf{z}_ {0})+\log p(\mathbf{z}_ {0})\Big).
$$

$$
\hat{\mathbf{z}}_ {\mathbf{x}}(\mathbf{z}_ {0})=\mathbf{z}_ {0}+\nabla_ {\mathbf{z}_ {0}}\Big(\log p(\mathbf{x}|\mathbf{z}_ {0})+\log p(\mathbf{z}_ {0})\Big).
$$

It still remains, however, how to construct a noisy estimate of $\mathbf{z}_ {\mathrm{0}}$ with no knowledge of $\mathbf{z}$ . If we assume $\mathbf{z}$ follows a known distribution, then it is possible to develop reasonable estimates. For instance, if we assume $p(\mathbf{z})=\mathcal{N}(\mathbf{z};\mathbf{0},I_ {d})$ then we could sample from $p(\mathbf{z}_ {0})=\mathcal{N}(\mathbf{z}_ {0};\mathbf{0},2I_ {d})$ however this could be far from the true distribution of $p(\mathbf{z}_ {0}|\mathbf{z})=\bar{\mathcal{N}}(\mathbf{z}_ {0};\mathbf{z},\bar{\mathbf{\pi}}_ {d})$ . Instead we propose initial ising $\mathbf{z}_ {\mathrm{0}}$ at the origin since this is the distribution’s mean. Initial ising at a constant position decreases the input variation and thus simplifies the optimisation procedure. Naturally, how $p(\mathbf{x}|\mathbf{z})$ is modelled affects $\hat{\mathbf{z}}_ {\mathbf{x}}$ . While mean-field models result in $\hat{\mathbf{z}}_ {\mathbf{x}}$ that are linear functions of $\mathbf{x}$ , conditional auto regressive models, for instance, result in non-linear $\hat{\mathbf{z}}_ {\mathbf{x}}$ ; multiple gradient steps also induce non-linearity, however, we show that a single step works well on high dimensional data suggesting that linear encoders, which normally do not scale to high dimensional data are effective in this case.

然而,如何在不了解 $\mathbf{z}$ 的情况下构建 $\mathbf{z}_ {\mathrm{0}}$ 的噪声估计仍然是一个问题。如果我们假设 $\mathbf{z}$ 服从已知分布,那么可以开发合理的估计。例如,假设 $p(\mathbf{z})=\mathcal{N}(\mathbf{z};\mathbf{0},I_ {d})$ ,则可以从 $p(\mathbf{z}_ {0})=\mathcal{N}(\mathbf{z}_ {0};\mathbf{0},2I_ {d})$ 中采样,但这可能与真实分布 $p(\mathbf{z}_ {0}|\mathbf{z})=\bar{\mathcal{N}}(\mathbf{z}_ {0};\mathbf{z},\bar{\mathbf{\pi}}_ {d})$ 相差甚远。因此,我们建议将 $\mathbf{z}_ {\mathrm{0}}$ 初始化为原点,因为这是分布的平均值。在固定位置初始化可以减少输入变化,从而简化优化过程。

自然, $p(\mathbf{x}|\mathbf{z})$ 的建模方式会影响 $\hat{\mathbf{z}}_ {\mathbf{x}}$ 。虽然均值场模型会导致 $\hat{\mathbf{z}}_ {\mathbf{x}}$ 是 $\mathbf{x}$ 的线性函数,但例如条件自回归模型会产生非线性的 $\hat{\mathbf{z}}_ {\mathbf{x}}$ ;多步梯度也会引入非线性,但我们表明,单步梯度在高维数据上表现良好,这表明通常无法扩展到高维数据的线性编码器在这种情况下是有效的。

3.2 AUTO ENCODING WITH GONS

3.2 基于GONS的自动编码

Before exploring GONs as generative models, we discuss the case where the prior $p(\mathbf{z})$ is unknown; such a model is referred to as an auto encoder. As such, the distribution $p(\mathbf{z}_ {0}\vert\mathbf{z})$ is also unknown thus it is again unclear how we can construct a noisy estimate of $\mathbf{z}$ . By training a model end-to-end where $\mathbf{z}_ {\mathrm{0}}$ is chosen as the origin, however, a prior is implicitly learned over $\mathbf{z}$ such that it is reachable from $\mathbf{z}_ {\mathrm{0}}$ . Although $p(\mathbf{z})$ is unknown, we do not wish to impose a prior on $\mathbf{z}_ {\mathrm{0}}$ ; the term which enforces this is in Equation 7 is $\log p(\mathbf{z}_ {0})$ , so we can safely ignore this term and simply maximise the likelihood of the data given $\mathbf{z}_ {\mathrm{0}}$ . Our estimator of $\mathbf{z}$ can therefore be defined simply as $\hat{\mathbf{z}}_ {\mathbf{x}}(\mathbf{z}_ {0})=\mathbf{z}_ {0}+\nabla_ {\mathbf{z}_ {0}}\log p(\mathbf{x}|\mathbf{z}_ {0})$ , which can otherwise be interpreted as a single gradient descent step on the conditional log-likelihood of the data. From this estimate, the data can be reconstructed by passing $\hat{\bf z}_ {\bf x}$ through the decoder to parameter is e $p(\mathbf{x}|\hat{\mathbf{z}}_ {\mathbf{x}})$ . This procedure can be viewed more explicitly when using a neural network $F\colon\mathbb{R}^{k}\rightarrow\mathbb{R}^{m}$ to output the mean of $p(\mathbf{x}|\hat{\mathbf{z}}_ {\mathbf{x}})$ parameter is ed by a normal distribution; in this case the loss function is defined in terms of mean squared error loss $\mathcal{L}^{\mathrm{MSE}}$ :

在将GONs作为生成模型探讨之前,我们先讨论先验分布$p(\mathbf{z})$未知的情况,此类模型称为自动编码器。由于$p(\mathbf{z}_ {0}\vert\mathbf{z})$分布同样未知,我们仍无法明确如何构建$\mathbf{z}$的噪声估计。然而,通过端到端训练模型并将$\mathbf{z}_ {\mathrm{0}}$设为原点时,系统会隐式学习到$\mathbf{z}$的先验分布,使其可从$\mathbf{z}_ {\mathrm{0}}$到达。虽然$p(\mathbf{z})$未知,但我们不希望为$\mathbf{z}_ {\mathrm{0}}$强加先验——式7中实施此约束的项是$\log p(\mathbf{z}_ {0})$,因此可安全忽略该项,仅需最大化给定$\mathbf{z}_ {\mathrm{0}}$时数据的似然概率。由此,$\mathbf{z}$的估计量可简化为$\hat{\mathbf{z}}_ {\mathbf{x}}(\mathbf{z}_ {0})=\mathbf{z}_ {0}+\nabla_ {\mathbf{z}_ {0}}\log p(\mathbf{x}|\mathbf{z}_ {0})$,这可理解为对数据条件对数似然函数执行单次梯度下降。基于该估计值,通过解码器传递$\hat{\bf z}_ {\bf x}$来参数化$p(\mathbf{x}|\hat{\mathbf{z}}_ {\mathbf{x}})$即可重构数据。当使用神经网络$F\colon\mathbb{R}^{k}\rightarrow\mathbb{R}^{m}$输出正态分布参数化的$p(\mathbf{x}|\hat{\mathbf{z}}_ {\mathbf{x}})$均值时,该过程更为显式:此时损失函数定义为均方误差损失$\mathcal{L}^{\mathrm{MSE}}$:

$$
G_ {\mathbf{x}}=\mathcal{L}^{\mathrm{MSE}}(\mathbf{x},F(-\nabla_ {\mathbf{z}_ {0}}\mathcal{L}^{\mathrm{MSE}}(\mathbf{x},F(\mathbf{z}_ {0})))).
$$

$$
G_ {\mathbf{x}}=\mathcal{L}^{\mathrm{MSE}}(\mathbf{x},F(-\nabla_ {\mathbf{z}_ {0}}\mathcal{L}^{\mathrm{MSE}}(\mathbf{x},F(\mathbf{z}_ {0})))).
$$

The gradient computation thereby plays a similar role to an encoder, while $F$ can be viewed as a decoder, with the outer loss term determining the overall reconstruction quality. Using a single network to perform both roles has the advantage of simplifying the overall architecture, avoiding the need to balance networks, and avoiding bottlenecks; this is demonstrated in Figure 1b which provides a visualisation of the GON process.

梯度计算因此扮演了类似于编码器的角色,而$F$可被视为解码器,外部损失项决定了整体重建质量。使用单一网络执行这两个角色具有简化整体架构、避免平衡网络需求以及避免瓶颈的优势;这一点在图1b中得到了展示,该图提供了GON过程的可视化。

3.3 VARIATION AL GONS

3.3 变分算法 (VARIATION ALGONS)

The variation al approach can be naturally applied to GONs, allowing sampling in a single step while only requiring the generative network, reducing the parameters necessary. Similar to before, a feed forward neural network $F$ parameter is es $p(\mathbf{x}|\mathbf{z})$ , while the expectation of $p(\mathbf{z}|\mathbf{x})$ is calculated with empirical Bayes. A normal prior is assumed over $p(\mathbf{z})$ thus Equation 7 can be written as:

变分方法可以自然地应用于GONs,只需生成网络即可实现单步采样,从而减少所需参数。与之前类似,前馈神经网络$F$的参数用于建模$p(\mathbf{x}|\mathbf{z})$,而$p(\mathbf{z}|\mathbf{x})$的期望则通过经验贝叶斯计算。假设$p(\mathbf{z})$服从正态先验分布,因此式7可表示为:

$$
\hat{\mathbf{z}}_ {\mathbf{x}}(\mathbf{z}_ {0})=\mathbf{z}_ {0}+\nabla_ {\mathbf{z}_ {0}}\Big(\log p(\mathbf{x}|\mathbf{z}_ {0})+\log\mathcal{N}(\mathbf{z}_ {0};\mathbf{0},2I_ {d})\Big),
$$

$$
\hat{\mathbf{z}}_ {\mathbf{x}}(\mathbf{z}_ {0})=\mathbf{z}_ {0}+\nabla_ {\mathbf{z}_ {0}}\Big(\log p(\mathbf{x}|\mathbf{z}_ {0})+\log\mathcal{N}(\mathbf{z}_ {0};\mathbf{0},2I_ {d})\Big),
$$

where $\mathbf{z}_ {0}=\mathbf{0}$ as discussed in Section 3.1. While it would be possible to use this estimate directly within a constant-variance VAE (Ghosh et al., 2019), we opt to incorporate the re parameter is ation trick into the generative network as a stochastic layer, to represent the distribution over which $\mathbf{x}$ could be encoded to, using empirical Bayes to estimate $\mathbf{z}$ . Similar to the auto encoding approach, we could ignore $\log p(\mathbf{z}_ {0})$ , however we find assuming a normally distributed $\mathbf{z}_ {\mathrm{0}}$ implicitly constrains $\mathbf{z}$ , aiding the optimisation procedure. Specifically, the forward pass of $F$ is implemented as follows: $\hat{\mathbf{z}}_ {\mathbf{x}}$ (or $\mathbf{z}_ {\mathrm{0}}$ ) is mapped by linear transformations to $\mu(\hat{\bf z}_ {\bf x})$ and $\sigma(\hat{\bf z}_ {\bf x})$ and the re parameter is ation trick is applied, subsequently the further transformations formerly defined as $F$ in the GON formulation are applied, providing parameters for for $p(\mathbf{x}|\hat{\mathbf{z}}_ {\mathbf{x}})$ . Training is performed end-to-end, minimising the ELBO:

其中 $\mathbf{z}_ {0}=\mathbf{0}$ 如第3.1节所述。虽然可以直接将这个估计用于恒定方差变分自编码器(VAE) (Ghosh et al., 2019),但我们选择将重参数化技巧作为随机层整合到生成网络中,以表示$\mathbf{x}$可能被编码到的分布,并使用经验贝叶斯来估计$\mathbf{z}$。与自编码方法类似,我们可以忽略$\log p(\mathbf{z}_ {0})$,但我们发现假设$\mathbf{z}_ {\mathrm{0}}$服从正态分布会隐式约束$\mathbf{z}$,从而有助于优化过程。具体来说,$F$的前向传播实现如下:$\hat{\mathbf{z}}_ {\mathbf{x}}$(或$\mathbf{z}_ {\mathrm{0}}$)通过线性变换映射到$\mu(\hat{\bf z}_ {\bf x})$和$\sigma(\hat{\bf z}_ {\bf x})$,并应用重参数化技巧,随后应用GON公式中先前定义为$F$的进一步变换,为$p(\mathbf{x}|\hat{\mathbf{z}}_ {\mathbf{x}})$提供参数。训练以端到端方式进行,最小化ELBO:

$$
\begin{array}{r}{\mathcal{L}^{\mathrm{VAE}}=-D_ {\mathrm{KL}}(\mathcal{N}(\mu(\hat{\mathbf{z}}_ {\mathbf{x}}),\sigma(\hat{\mathbf{z}}_ {\mathbf{x}})^{2})||\mathcal{N}(\mathbf{0},I_ {d}))+\log p(\mathbf{x}|\hat{\mathbf{z}}_ {\mathbf{x}}).}\end{array}
$$

$$
\begin{array}{r}{\mathcal{L}^{\mathrm{VAE}}=-D_ {\mathrm{KL}}(\mathcal{N}(\mu(\hat{\mathbf{z}}_ {\mathbf{x}}),\sigma(\hat{\mathbf{z}}_ {\mathbf{x}})^{2})||\mathcal{N}(\mathbf{0},I_ {d}))+\log p(\mathbf{x}|\hat{\mathbf{z}}_ {\mathbf{x}}).}\end{array}
$$

These steps are shown in Figure 1d, which in practice has a simple implementation.

这些步骤如图1d所示,实际实现起来非常简单。

3.4 IMPLICIT GONS

3.4 隐式 GONS

In the field of implicit representation networks, the aim is to learn a neural approximation of a function $\Phi$ that satisfies an implicit equation of the form:

在隐式表示网络领域,目标是学习满足以下形式隐式方程的函数$\Phi$的神经近似:

$$
R({\bf c},\Phi,\nabla_ {\Phi},\nabla_ {\Phi}^{2},\ldots)=0,\quad\Phi;{\bf c}\mapsto\Phi({\bf c}),
$$

$$
R({\bf c},\Phi,\nabla_ {\Phi},\nabla_ {\Phi}^{2},\ldots)=0,\quad\Phi;{\bf c}\mapsto\Phi({\bf c}),
$$

where $R$ ’s definition is problem dependent but often corresponds to a loss function. Equations with this structure arise in a myriad of fields, namely 3D modelling, image, video, and audio representation (Sitzmann et al., 2020b). In these cases, data samples $\mathbf{x}=\dot{{(\mathbf{c},\tilde{\Phi}_ {\mathbf{x}}(\mathbf{c}))}}$ can be represented in this form in terms of coordinates $\mathbf{c}\in\mathbb{R}^{n}$ and the corresponding data at those points $\Phi\colon\mathbf{\bar{R}}^{n}\rightarrow\mathbf{\mathbb{R}}^{m}$ . Due to the continuous nature of $\Phi$ , data with large spatial dimensions can be represented much more efficiently than approaches using convolutions, for instance. Despite these benefits, representing a distribution of data points using implicit methods is more challenging, leading to the use of hyper networks which estimate the weights of an implicit network for each data point (Sitzmann et al., 2020a); this increases the number of parameters and adds significant complexity.

其中 $R$ 的定义取决于具体问题,但通常对应于损失函数。具有这种结构的方程出现在众多领域中,例如 3D 建模、图像、视频和音频表示 (Sitzmann et al., 2020b)。在这些情况下,数据样本 $\mathbf{x}=\dot{{(\mathbf{c},\tilde{\Phi}_ {\mathbf{x}}(\mathbf{c}))}}$ 可以表示为坐标 $\mathbf{c}\in\mathbb{R}^{n}$ 和对应点上的数据 $\Phi\colon\mathbf{\bar{R}}^{n}\rightarrow\mathbf{\mathbb{R}}^{m}$ 的形式。由于 $\Phi$ 的连续性,与使用卷积等方法相比,具有较大空间维度的数据可以更高效地表示。尽管有这些优势,使用隐式方法表示数据点分布更具挑战性,因此需要使用超网络来估计每个数据点的隐式网络权重 (Sitzmann et al., 2020a);这会增加参数数量并显著提高复杂度。

By applying the GON procedure to implicit representation networks, it is possible learn a space of implicit functions without the need for additional networks. We assume there exist latent vectors $\mathbf{z}\in\mathbb{R}^{k}$ corresponding to data samples; the con ca tent ation of these latent variables with the data coordinates can therefore geometrically be seen as points on a manifold that describe the full dataset in keeping with the manifold hypothesis (Fefferman et al., 2016). An implicit Gradient Origin Network can be trained on this space to mimic $\Phi$ using a neural network $F\colon\stackrel{\cdot}{\mathbb{R}^{n+k}}\rightarrow\mathbb{R}^{m}$ , thereby learning a space of implicit functions by modifying Equation 8:

通过对隐式表示网络应用GON方法,无需额外网络即可学习隐式函数空间。我们假设存在与数据样本对应的潜在向量$\mathbf{z}\in\mathbb{R}^{k}$;根据流形假设(Fefferman et al., 2016),这些潜在变量与数据坐标的连接可几何视为描述完整数据集的流形上的点。在此空间上可训练隐式梯度原点网络,使用神经网络$F\colon\stackrel{\cdot}{\mathbb{R}^{n+k}}\rightarrow\mathbb{R}^{m}$来模拟$\Phi$,从而通过修改公式8学习隐式函数空间:

$$
I_ {\mathbf{x}}=\int\mathcal{L}\Big(\Phi_ {\mathbf{x}}(\mathbf{c}),F\Big(\mathbf{c}\oplus-\nabla_ {\mathbf{z}_ {0}}\int\mathcal{L}\big(\Phi_ {\mathbf{x}}(\mathbf{c}),F(\mathbf{c}\oplus\mathbf{z}_ {0})\big)d\mathbf{c}\Big)\Big)d\mathbf{c},
$$

$$
I_ {\mathbf{x}}=\int\mathcal{L}\Big(\Phi_ {\mathbf{x}}(\mathbf{c}),F\Big(\mathbf{c}\oplus-\nabla_ {\mathbf{z}_ {0}}\int\mathcal{L}\big(\Phi_ {\mathbf{x}}(\mathbf{c}),F(\mathbf{c}\oplus\mathbf{z}_ {0})\big)d\mathbf{c}\Big)\Big)d\mathbf{c},
$$

where both integration s are performed over the space of coordinates and $\mathbf{c}\oplus\mathbf{z}$ represents a concatenation (Figure 1c). Similar to the non-implicit approach, the computation of latent vectors can be expressed as $\begin{array}{r}{\mathbf{z}=-\nabla_ {\mathbf{z}_ {0}}\int\mathcal{L}\big(\Phi_ {\mathbf{x}}(\mathbf{c}),F(\mathbf{\bar{c}}\oplus\mathbf{z}_ {0})\big)\bar{d}\mathbf{c}}\end{array}$ . In particular, we parameter is e $F$ with a SIREN (Sitzmann et al., 2020b), finding that it is capable of modelling both high and low frequency components in high dimensional spaces.

其中两个积分都在坐标空间上进行运算,$\mathbf{c}\oplus\mathbf{z}$ 表示拼接操作 (图 1c)。与非隐式方法类似,潜在向量的计算可表示为 $\begin{array}{r}{\mathbf{z}=-\nabla_ {\mathbf{z}_ {0}}\int\mathcal{L}\big(\Phi_ {\mathbf{x}}(\mathbf{c}),F(\mathbf{\bar{c}}\oplus\mathbf{z}_ {0})\big)\bar{d}\mathbf{c}}\end{array}$。特别地,我们采用 SIREN (Sitzmann et al., 2020b) 来参数化 $F$,发现其能够建模高维空间中的高频和低频分量。

3.5 GON GENERALISATIONS

3.5 GON 泛化

There are a number of interesting generalisations that make this approach applicable to other tasks. In Equations 8 and 12 we use the same $\mathcal{L}$ in both the inner term and outer term, however, as with variation al GONs, it is possible to use different functions; through this, training a GON concurrently as a generative model and classifier is possible, or through some minor modifications to the loss involving the addition of the categorical cross entropy loss function $\mathcal{L}^{\mathrm{CCE}}$ to maximise the likelihood of classification, solely as a classifier:

这种方法有几个有趣的推广方向,可适用于其他任务。在公式8和12中,我们在内部项和外部项使用了相同的$\mathcal{L}$,但正如变分生成优化网络(GON)那样,也可以使用不同函数:通过这种方式,可以同时将GON训练为生成模型和分类器,或通过对损失函数进行微小修改(添加分类交叉熵损失函数$\mathcal{L}^{\mathrm{CCE}}$以最大化分类似然)使其仅作为分类器:

$$
C_ {\mathbf{x}}=\mathcal{L}^{\mathrm{CCE}}(f(-\nabla_ {\mathbf{z}_ {0}}\mathcal{L}(\mathbf{x},F(\mathbf{z}_ {0}))),\mathbf{y}),
$$

$$
C_ {\mathbf{x}}=\mathcal{L}^{\mathrm{CCE}}(f(-\nabla_ {\mathbf{z}_ {0}}\mathcal{L}(\mathbf{x},F(\mathbf{z}_ {0}))),\mathbf{y}),
$$

where y are the target labels and $f$ is a single linear layer followed by softmax. Another possibility is modality conversion for translation tasks; in this case the inner reconstruction loss is performed on the source signal and the outer loss on the target signal.

其中y是目标标签,$f$是一个带softmax的单一线性层。另一种可能是翻译任务中的模态转换;此时内部重建损失作用于源信号,外部损失作用于目标信号。

3.6 JUSTIFICATION

3.6 依据

Beyond empirical Bayes, we provide some additional analysis on why a single gradient step is in general sufficient as an encoder. Firstly, the gradient of the loss inherently offers substantial information about data making it a good encoder. Secondly, a good latent space should satisfy local consistency (Zhou et al., 2003; Kamnitsas et al., 2018). GONs satisfy this since similar data points will have similar gradients due to the constant latent initial is ation. As such, the network needs only to find an equilibrium where its prior is the gradient operation, allowing for significant freedom. Finally, since GONs are trained by back propagating through empirical Bayes, there are benefits to using an activation function whose second derivative is non-zero.

除了经验贝叶斯方法外,我们进一步分析了为何单次梯度步长通常足以作为编码器。首先,损失函数的梯度本身携带了数据的丰富信息,使其成为优质编码器。其次,优质潜在空间需满足局部一致性 (Zhou et al., 2003; Kamnitsas et al., 2018)。由于恒定潜在初始化使相似数据点产生相似梯度,GONs自然满足该条件。因此,网络仅需找到以梯度运算为先验的平衡点,从而获得极大自由度。最后,由于GONs通过经验贝叶斯反向传播训练,采用二阶导数非零的激活函数具有优势。

4 RESULTS

4 结果

We evaluate Gradient Origin Networks on a variety of image datasets: MNIST (LeCun et al., 1998), Fashion-MNIST (Xiao et al., 2017), Small NORB (LeCun et al., 2004), COIL-20 (Nane et al., 1996), CIFAR-10 (Krizhevsky et al., 2009), CelebA (Liu et al., 2015), and LSUN Bedroom (Yu et al., 2015). Simple models are used: for small images, implicit GONs consist of approximately 4 hidden layers of 256 units and convolutional GONs consist of 4 convolution layers with Batch Normalization (Ioffe & Szegedy, 2015) and the ELU non-linearity (Clevert et al., 2016), for larger images the same general architecture is used, scaled up with additional layers; all training is performed with the Adam optimiser (Kingma & Ba, 2015). Error bars in graphs represent standard deviations over three runs.

我们在多个图像数据集上评估了梯度起源网络 (Gradient Origin Networks):MNIST (LeCun et al., 1998)、Fashion-MNIST (Xiao et al., 2017)、Small NORB (LeCun et al., 2004)、COIL-20 (Nane et al., 1996)、CIFAR-10 (Krizhevsky et al., 2009)、CelebA (Liu et al., 2015) 和 LSUN Bedroom (Yu et al., 2015)。采用简单模型架构:对于小尺寸图像,隐式GON由约4层256单元的隐藏层构成,卷积GON由4个带批归一化 (Batch Normalization) (Ioffe & Szegedy, 2015) 和ELU非线性激活 (Clevert et al., 2016) 的卷积层构成;对于大尺寸图像采用相同基础架构并通过增加层数进行扩展;所有训练均使用Adam优化器 (Kingma & Ba, 2015)。图中误差条表示三次运行的标准差。

Table 1: Validation reconstruction loss (summed squared error) over 500 epochs. For GLO, latents are assigned to data points and jointly optimised with the network. GONs significantly outperform other single step methods and achieve the lowest reconstruction error on four of the five datasets.

MNISTFashion-MNISTSmallNORBCOIL20CIFAR-10
Single Step
GON (ours)0.41±0.011.00±0.010.17±0.015.35±0.019.12±0.03
AE1.33±0.021.75±0.020.73±0.016.05±0.1112.24±0.05
TiedAE2.16±0.032.45±0.020.93±0.036.68±0.1314.12±0.34
1 Step Detach8.17±0.157.76±0.211.84±0.0115.72±0.7030.17±0.29
MultipleSteps
10 Step Detach8.13±0.507.22±0.921.78±0.0215.48±0.6028.68±1.16
10Step0.42±0.011.01±0.010.17±0.015.36±0.019.19±0.01
GLO0.70±0.010.61±0.01
1.78±0.013.27±0.029.12±0.03

表 1: 500个训练周期内的验证重建损失(平方误差总和)。对于GLO方法,潜在变量被分配给数据点并与网络联合优化。GON在五种数据集中的四种上显著优于其他单步方法,并实现了最低的重建误差。

MNIST Fashion-MNIST SmallNORB COIL20 CIFAR-10
单步方法
GON (本文) 0.41±0.01 1.00±0.01 0.17±0.01 5.35±0.01 9.12±0.03
AE 1.33±0.02 1.75±0.02 0.73±0.01 6.05±0.11 12.24±0.05
TiedAE 2.16±0.03 2.45±0.02 0.93±0.03 6.68±0.13 14.12±0.34
1 Step Detach 8.17±0.15 7.76±0.21 1.84±0.01 15.72±0.70 30.17±0.29
多步方法
10 Step Detach 8.13±0.50 7.22±0.92 1.78±0.02 15.48±0.60 28.68±1.16
10Step 0.42±0.01 1.01±0.01 0.17±0.01 5.36±0.01 9.19±0.01
GLO 0.70±0.01 1.78±0.01 0.61±0.01 3.27±0.02 9.12±0.03
MNISTFashion-MNISTSmallNORBCOIL20CIFAR-10CelebA
VGON(ours)1.063.302.343.445.855.41
VAE1.153.232.543.635.945.59
MNIST Fashion-MNIST SmallNORB COIL20 CIFAR-10 CelebA
VGON(ours) 1.06 3.30 2.34 3.44 5.85 5.41
VAE 1.15 3.23 2.54 3.63 5.94 5.59

Table 2: Validation ELBO in bits/dim over 1000 epochs (CelebA is trained over 150 epochs).

表 2: 1000轮训练周期下的验证ELBO (单位: bits/dim) (CelebA数据集训练150轮)。

4.1 QUANTITATIVE EVALUATION

4.1 定量评估

A quantitative evaluation of the representation ability of GONs is performed in Table 1 against a number of baseline approaches. We compare against the single step methods: standard auto encoder, an auto encoder with tied weights (Gedeon, 1998), and a GON with gradients detached from z, as well as the multi-step methods: 10 gradient descent steps per data point, with and without detaching gradients, and a GLO (Bojanowski et al., 2018) which assigns a persistent latent vector to each data point and optimises them with gradient descent, therefore taking orders of magnitude longer to reconstruct validation data than other approaches. For the 10-step methods, a learning rate of 0.1 is applied as used in other literature (Nijkamp et al., 2019); the GLO is trained with MSE for consistency with the other approaches and we do not project latents onto a hyper sphere as proposed by the authors since in this experiment sampling is unimportant and this would handicap the approach. GONs achieve much lower validation loss than other single step methods and are competitive with the multi-step approaches; in fact, GONs achieve the lowest loss on four out of the five datasets.

表1对GONs的表征能力进行了定量评估,并与多种基线方法进行了比较。我们对比了单步方法:标准自编码器、权重绑定的自编码器 (Gedeon, 1998) 和梯度与z解耦的GON;以及多步方法:每个数据点进行10次梯度下降(含/不含梯度解耦)和GLO (Bojanowski等, 2018) ——后者为每个数据点分配持久潜在向量并通过梯度下降优化,因此验证数据重构耗时比其他方法高出一个数量级。对于10步方法,采用文献 (Nijkamp等, 2019) 中使用的0.1学习率;为保持方法一致性,GLO使用MSE训练且未按原作者建议将潜在向量投影至超球面,因本实验中采样无关紧要且该操作会削弱方法性能。GONs的验证损失远低于其他单步方法,与多步方法表现相当,事实上在五分之四的数据集上达到了最低损失。

Our variation al GON is compared with a VAE, whose decoder is the same as the GON, quantitatively in terms of ELBO on the test set in Table 2. We find that the representation ability of GONs is pertinent here also, allowing the variation al GON to achieve lower ELBO on five out of the six datasets tested. Both models were trained with the original VAE formulation for fairness, however, we note that variations aimed at improving VAE sample quality such as $\beta$ -VAEs (Higgins et al., 2017) are also applicable to variation al GONs to the same effect.

我们的变分 GON 与 VAE 进行了比较,其中解码器与 GON 相同,定量评估结果如表 2 所示,基于测试集上的 ELBO (证据下界) 。我们发现 GON 的表征能力在此也至关重要,使得变分 GON 在六个测试数据集中的五个上实现了更低的 ELBO。为了公平起见,两种模型均采用原始 VAE 公式进行训练,但我们注意到,旨在提升 VAE 样本质量的改进方法 (例如 $\beta$-VAE (Higgins et al., 2017)) 同样适用于变分 GON,并能达到相同效果。

An ablation study is performed, comparing convolutional GONs with auto encoders whose decoders have exactly the same architecture as the GONs (Figure 2a) and where the auto encoder decoders mirror their respective encoders. The mean reconstruction loss over the test set is measured after each epoch for a variety of latent sizes. Despite having half the number of parameters and linear encodings, (loa)s s GthOaNn sa a utc o he ien v ceo dloerws.er validation (*b=)g rMadulst idpeltea lcahteedn,t †u=pndoatt ed estteacpsh.ed. (acu)t oGenOcNosd eorvse.rfit less than standard

我们进行了消融实验,将卷积GONs与自编码器进行比较,其中自编码器的解码器架构与GONs完全相同(图2a),且自编码器的解码器与其对应编码器呈镜像对称。在不同潜在空间尺寸下,我们测量了每个训练周期后在测试集上的平均重构损失。尽管参数数量减半且采用线性编码,GONs仍显著优于自编码器验证集(*b)。梯度更新步骤†未调整。(ac)GONs解码器比