GRADIENT ORIGIN NETWORKS
GRADIENT ORIGIN NETWORKS
Sam Bond-Taylor∗ & Chris G. Willcocks∗
Sam Bond-Taylor∗ & Chris G. Willcocks∗
ABSTRACT
摘要
This paper proposes a new type of generative model that is able to quickly learn a latent representation without an encoder. This is achieved using empirical Bayes to calculate the expectation of the posterior, which is implemented by initial ising a latent vector with zeros, then using the gradient of the log-likelihood of the data with respect to this zero vector as new latent points. The approach has similar characteristics to auto encoders, but with a simpler architecture, and is demonstrated in a variation al auto encoder equivalent that permits sampling. This also allows implicit representation networks to learn a space of implicit functions without requiring a hyper network, retaining their representation advantages across datasets. The experiments show that the proposed method converges faster, with significantly lower reconstruction error than auto encoders, while requiring half the parameters.
本文提出了一种新型生成模型,无需编码器即可快速学习潜在表征。该方法通过经验贝叶斯计算后验期望来实现:首先用零向量初始化潜在向量,然后利用数据对数似然相对于该零向量的梯度作为新潜在点。该架构具有与自编码器(autoencoder)相似的特征,但结构更简单,并通过等效变分自编码器实现采样功能。该方法还能让隐式表征网络在不依赖超网络的情况下学习隐函数空间,保持其跨数据集的表征优势。实验表明,所提方法收敛速度更快,重建误差显著低于自编码器,且参数需求量减少一半。
1 INTRODUCTION
1 引言
Observable data in nature has some parameters which are known, such as local coordinates, but also some unknown parameters such as how the data is related to other examples. Generative models, which learn a distribution over observable s, are central to our understanding of patterns in nature and allow for efficient query of new unseen examples. Recently, deep generative models have received interest due to their ability to capture a broad set of features when modelling data distributions. As such, they offer direct applications such as synthesising high fidelity images (Karras et al., 2020), super-resolution (Dai et al., 2019), speech synthesis (Li et al., 2019), and drug discovery (Segler et al., 2018), as well as benefits for downstream tasks like semi-supervised learning (Chen et al., 2020).
自然界中的可观测数据既包含已知参数(如局部坐标),也包含未知参数(如数据与其他样本的关联方式)。生成式模型通过学习可观测数据的分布,成为我们理解自然规律的核心工具,并能高效查询未见过的样本实例。近年来,深度生成式模型因其在数据分布建模中捕获广泛特征的能力而备受关注。这类模型可直接应用于合成高保真图像(Karras等人,2020)、超分辨率重建(Dai等人,2019)、语音合成(Li等人,2019)和药物发现(Segler等人,2018)等领域,同时为半监督学习(Chen等人,2020)等下游任务带来显著优势。
A number of methods have been proposed such as Variation al Auto encoders (VAEs, Figure 1a), which learn to encode the data to a latent space that follows a normal distribution permitting sampling (Kingma & Welling, 2014). Generative Adversarial Networks (GANs) have two competing networks, one which generates data and another which discriminates from implausible results (Goodfellow et al., 2014). Variation al approaches that approximate the posterior using gradient descent (Lipton & Tripathi, 2017) and short run MCMC (Nijkamp et al., 2020) respectively have been proposed, but to obtain a latent vector for a sample, they require iterative gradient updates. Auto regressive Models (Van Den Oord et al., 2016) decompose the data distribution as the product of conditional distributions and Normalizing Flows (Rezende & Mohamed, 2015) chain together invertible functions; both methods allow exact likelihood inference. Energy-Based Models (EBMs) map data points to energy values proportional to likelihood thereby permitting sampling through the use of Monte Carlo Markov Chains (Du & Mordatch, 2019). In general to support encoding, these approaches require separate encoding networks, are limited to invertible functions, or require multiple sampling steps.
已提出多种方法,例如变分自编码器 (VAEs, 图 1a) ,它通过学习将数据编码到遵循正态分布的潜在空间以实现采样 (Kingma & Welling, 2014) 。生成对抗网络 (GANs) 包含两个对抗网络:一个生成数据,另一个从不可信结果中判别真伪 (Goodfellow et al., 2014) 。分别使用梯度下降 (Lipton & Tripathi, 2017) 和短程马尔可夫链蒙特卡洛 (Nijkamp et al., 2020) 近似后验的变分方法已被提出,但为获取样本的潜在向量,它们需要迭代梯度更新。自回归模型 (Van Den Oord et al., 2016) 将数据分布分解为条件分布的乘积,而标准化流 (Rezende & Mohamed, 2015) 将可逆函数链式连接;这两种方法都支持精确似然推断。基于能量的模型 (EBMs) 将数据点映射为与似然成比例的能量值,从而支持通过蒙特卡洛马尔可夫链进行采样 (Du & Mordatch, 2019) 。总体而言,为实现编码功能,这些方法需要独立的编码网络、受限于可逆函数或需多步采样。

Figure 1: Gradient Origin Networks (GONs; b) use gradients (dashed lines) as encodings thus only a single network $F$ is required, which can be an implicit representation network (c). Unlike VAEs (a) which use two networks, $E$ and $D$ , variation al GONs (d) permit sampling with only one network.
图 1: 梯度起源网络 (Gradient Origin Networks, GONs; b) 使用梯度 (虚线) 作为编码,因此仅需单个网络 $F$,该网络可以是隐式表示网络 (c)。与使用两个网络 $E$ 和 $D$ 的变分自编码器 (VAEs, a) 不同,变分 GONs (d) 仅需一个网络即可实现采样。
Implicit representation learning (Park et al., 2019; Tancik et al., 2020), where a network is trained on data parameter is ed continuously rather than in discrete grid form, has seen a surge of interest due to the small number of parameters, speed of convergence, and ability to model fine details. In particular, sinusoidal representation networks (SIRENs) (Sitzmann et al., 2020b) achieve impressive results, modelling many signals with high precision, thanks to their use of periodic activation s paired with carefully initial is ed MLPs. So far, however, these models have been limited to modelling single data samples, or use an additional hyper network or meta learning (Sitzmann et al., 2020a) to estimate the weights of a simple implicit model, adding significant complexity.
隐式表示学习 (Park et al., 2019; Tancik et al., 2020) 通过训练网络以连续参数化数据而非离散网格形式,因其参数量少、收敛速度快且能建模精细细节的特性而备受关注。其中,正弦表示网络 (SIRENs) (Sitzmann et al., 2020b) 通过周期性激活函数与精心初始化的多层感知机 (MLP) 结合,实现了高精度多信号建模的优异效果。然而目前这些模型仍局限于单样本建模,或需借助超网络 (hyper network) 和元学习 (Sitzmann et al., 2020a) 来估计简单隐式模型的权重,显著增加了系统复杂度。
This paper proposes Gradient Origin Networks (GONs), a new type of generative model (Figure 1b) that do not require encoders or hyper networks. This is achieved by initial ising latent points at the origin, then using the gradient of the log-likelihood of the data with respect to these points as the latent space. At inference, latent vectors can be obtained in a single step without requiring iteration. GONs are shown to have similar characteristics to convolutional auto encoders and variation al auto encoders using approximately half the parameters, and can be applied to implicit representation networks (such as SIRENs) allowing a space of implicit functions to be learned with a simpler overall architecture.
本文提出梯度原点网络(Gradient Origin Networks,GONs),这是一种无需编码器或超网络的新型生成模型(图1b)。其核心思想是将潜在点初始化为原点,然后使用数据对数似然相对于这些点的梯度作为潜在空间。在推理阶段,无需迭代即可单步获取潜在向量。实验表明,GONs仅需约半数参数就能达到与卷积自编码器和变分自编码器相似的特征表达能力,并可应用于隐式表示网络(如SIRENs),通过更简洁的整体架构学习隐函数空间。
2 PRELIMINARIES
2 预备知识
We first introduce some background context that will be used to derive our proposed approach.
我们首先介绍一些背景信息,这些将用于推导我们提出的方法。
2.1 EMPIRICAL BAYES
2.1 经验贝叶斯 (Empirical Bayes)
The concept of empirical Bayes (Robbins, 1956; Saremi & Hyvarinen, 2019), for a random variable $\mathbf{z}\sim p_ {\mathbf{z}}$ and particular observation ${\bf z}_ {0}\sim p_ {{\bf z}_ {0}}$ , provides an estimator of $\mathbf{z}$ expressed purely in terms of $p(\mathbf{z}_ {0})$ that minimises the expected squared error. This estimator can be written as a conditional mean:
经验贝叶斯 (empirical Bayes) 的概念 (Robbins, 1956; Saremi & Hyvarinen, 2019) 针对随机变量 $\mathbf{z}\sim p_ {\mathbf{z}}$ 和特定观测值 ${\bf z}_ {0}\sim p_ {{\bf z}_ {0}}$ ,提供了一个仅用 $p(\mathbf{z}_ {0})$ 表示的 $\mathbf{z}$ 估计量,该估计量能使期望平方误差最小化。该估计量可表示为条件均值:
$$
\hat{\mathbf{z}}(\mathbf{z}_ {0})=\int\mathbf{z}p(\mathbf{z}|\mathbf{z}_ {0})d\mathbf{z}=\int\mathbf{z}\frac{p(\mathbf{z},\mathbf{z}_ {0})}{p(\mathbf{z}_ {0})}d\mathbf{z}.
$$
$$
\hat{\mathbf{z}}(\mathbf{z}_ {0})=\int\mathbf{z}p(\mathbf{z}|\mathbf{z}_ {0})d\mathbf{z}=\int\mathbf{z}\frac{p(\mathbf{z},\mathbf{z}_ {0})}{p(\mathbf{z}_ {0})}d\mathbf{z}.
$$
Of particular relevance is the case where $\mathbf{z}_ {\mathrm{0}}$ is a noisy observation of $\mathbf{z}$ with covariance $\pmb{\Sigma}$ . In this case $p(\mathbf{z}_ {0})$ can be represented by marginal ising out $\mathbf{z}$ :
特别相关的情况是当 $\mathbf{z}_ {\mathrm{0}}$ 是 $\mathbf{z}$ 带有协方差 $\pmb{\Sigma}$ 的噪声观测时。此时 $p(\mathbf{z}_ {0})$ 可以通过边缘化 $\mathbf{z}$ 来表示:
$$
p(\mathbf{z}_ {0})=\int{\frac{1}{(2\pi)^{d/2}|\operatorname*{det}({\boldsymbol{\Sigma}})|^{1/2}}}\exp\Big(-\mathbf{\Sigma}(\mathbf{z}_ {0}-\mathbf{z})^{T}{\boldsymbol{\Sigma}}^{-1}(\mathbf{z}_ {0}-\mathbf{z})/2\Big)p(\mathbf{z})d\mathbf{z}.
$$
$$
p(\mathbf{z}_ {0})=\int{\frac{1}{(2\pi)^{d/2}|\operatorname*{det}({\boldsymbol{\Sigma}})|^{1/2}}}\exp\Big(-\mathbf{\Sigma}(\mathbf{z}_ {0}-\mathbf{z})^{T}{\boldsymbol{\Sigma}}^{-1}(\mathbf{z}_ {0}-\mathbf{z})/2\Big)p(\mathbf{z})d\mathbf{z}.
$$
Differentiating this with respect to $\mathbf{z}_ {\mathrm{0}}$ and multiplying both sides by $\pmb{\Sigma}$ gives:
对 $\mathbf{z}_ {\mathrm{0}}$ 求导并在两边乘以 $\pmb{\Sigma}$ 可得:
$$
\pmb{\Sigma}\nabla_ {\mathbf{z}_ {0}}p(\mathbf{z}_ {0})=\int(\mathbf{z}-\mathbf{z}_ {0})p(\mathbf{z},\mathbf{z}_ {0})d\mathbf{z}=\int\mathbf{z}p(\mathbf{z},\mathbf{z}_ {0})d\mathbf{z}-\mathbf{z}_ {0}p(\mathbf{z}_ {0}).
$$
$$
\pmb{\Sigma}\nabla_ {\mathbf{z}_ {0}}p(\mathbf{z}_ {0})=\int(\mathbf{z}-\mathbf{z}_ {0})p(\mathbf{z},\mathbf{z}_ {0})d\mathbf{z}=\int\mathbf{z}p(\mathbf{z},\mathbf{z}_ {0})d\mathbf{z}-\mathbf{z}_ {0}p(\mathbf{z}_ {0}).
$$
After dividing through by $p(\mathbf{z}_ {0})$ and combining with Equation 1 we obtain a closed form estimator of $\mathbf{z}$ (Miyasawa, 1961) written in terms of the score function $\nabla\log p(\mathbf{z}_ {0})$ (Hyvärinen, 2005):
除以 $p(\mathbf{z}_ {0})$ 并与方程1结合后,我们得到了 $\mathbf{z}$ 的闭式估计量 (Miyasawa, 1961),该估计量用评分函数 $\nabla\log p(\mathbf{z}_ {0})$ 表示 (Hyvärinen, 2005):
$$
\hat{\mathbf{z}}(\mathbf{z}_ {0})=\mathbf{z}_ {0}+\Sigma\nabla_ {\mathbf{z}_ {0}}\log p(\mathbf{z}_ {0}).
$$
$$
\hat{\mathbf{z}}(\mathbf{z}_ {0})=\mathbf{z}_ {0}+\Sigma\nabla_ {\mathbf{z}_ {0}}\log p(\mathbf{z}_ {0}).
$$
This optimal procedure is achieved in what can be interpreted as a single gradient descent step, with no knowledge of the prior $p(\mathbf{z})$ . By rearranging Equation 4, a definition of $\nabla\log p(\mathbf{z}_ {0})$ can be derived; this can be used to train models that approximate the score function (Song & Ermon, 2019).
这一最优过程可理解为在完全不了解先验分布 $p(\mathbf{z})$ 的情况下,仅通过单次梯度下降步骤实现。通过重新排列公式4,可以推导出 $\nabla\log p(\mathbf{z}_ {0})$ 的定义,该定义可用于训练近似评分函数的模型 (Song & Ermon, 2019)。
2.2 VARIATION AL AUTO ENCODERS
2.2 变分自编码器 (Variational Autoencoders)
Variation al Auto encoders (VAEs; Kingma & Welling 2014) are a probabilistic take on standard auto encoders that permit sampling. A latent-based generative model $p_ {\pmb{\theta}}(\mathbf{x}|\mathbf{z})$ is defined with a normally distributed prior over the latent variables, $\dot{p}_ {\pmb\theta}(\mathbf z)=\mathcal{N}(\mathbf z;\mathbf0,\bar{\pmb I}_ {d})$ . $p_ {\pmb{\theta}}(\mathbf{x}|\mathbf{z})$ is typically parameter is ed as a Bernoulli, Gaussian, multi no mi al distribution, or mixture of logits. In this case, the true posterior $p_ {\pmb{\theta}}(\mathbf{z}|\mathbf{x})$ is intractable, so a secondary encoding network $q_ {\phi}(\mathbf{z}|\mathbf{x})$ is used to approximate the true posterior; the pair of networks thus resembles a traditional auto encoder. This allows VAEs to approximate $p_ {\theta}(\mathbf{x})$ by maximising the evidence lower bound (ELBO), defined as:
变分自编码器 (VAEs; Kingma & Welling 2014) 是标准自编码器的概率化扩展,支持采样操作。该潜变量生成模型 $p_ {\pmb{\theta}}(\mathbf{x}|\mathbf{z})$ 采用正态分布作为潜变量的先验分布 $\dot{p}_ {\pmb\theta}(\mathbf z)=\mathcal{N}(\mathbf z;\mathbf0,\bar{\pmb I}_ {d})$。其中 $p_ {\pmb{\theta}}(\mathbf{x}|\mathbf{z})$ 通常参数化为伯努利分布、高斯分布、多项分布或逻辑混合分布。由于真实后验分布 $p_ {\pmb{\theta}}(\mathbf{z}|\mathbf{x})$ 难以求解,因此引入辅助编码网络 $q_ {\phi}(\mathbf{z}|\mathbf{x})$ 来近似真实后验,这对网络结构与传统自编码器相似。通过最大化证据下界 (ELBO),VAEs 能够近似 $p_ {\theta}(\mathbf{x})$,其定义如下:
$$
\log p_ {\theta}(\mathbf{x})\geq\mathcal{L}^{\mathrm{VAE}}=-D_ {\mathrm{KL}}(\mathcal{N}(q_ {\phi}(\mathbf{z}|\mathbf{x}))||\mathcal{N}(\mathbf{0},I_ {d}))+\mathbb{E}_ {q_ {\phi}(\mathbf{z}|\mathbf{x})}[\log p_ {\theta}(\mathbf{x}|\mathbf{z})].
$$
$$
\log p_ {\theta}(\mathbf{x})\geq\mathcal{L}^{\mathrm{VAE}}=-D_ {\mathrm{KL}}(\mathcal{N}(q_ {\phi}(\mathbf{z}|\mathbf{x}))||\mathcal{N}(\mathbf{0},I_ {d}))+\mathbb{E}_ {q_ {\phi}(\mathbf{z}|\mathbf{x})}[\log p_ {\theta}(\mathbf{x}|\mathbf{z})].
$$
To optimise this lower bound with respect to $\pmb{\theta}$ and $\phi$ , gradients must be back propagated through the stochastic process of generating samples from ${\bf z}^{\prime}\sim q_ {\phi}({\bf z}|{\bf x})$ . This is permitted by re parameter ising $\mathbf{z}$ using the differentiable function $\mathbf{z}^{\prime}=\mu(\mathbf{z})+\sigma(\mathbf{z})\odot\epsilon$ , where $\epsilon\sim\mathcal{N}(\mathbf{0},\pmb{I}_ {d})$ and $\mu(\mathbf{z})$ and $\sigma(\mathbf{z})^{2}$ are the mean and variance respectively of a multivariate Gaussian distribution with diagonal covariance.
为了优化关于$\pmb{\theta}$和$\phi$的这个下界,必须通过从${\bf z}^{\prime}\sim q_ {\phi}({\bf z}|{\bf x})$生成样本的随机过程反向传播梯度。这可以通过使用可微函数$\mathbf{z}^{\prime}=\mu(\mathbf{z})+\sigma(\mathbf{z})\odot\epsilon$对$\mathbf{z}$进行重参数化来实现,其中$\epsilon\sim\mathcal{N}(\mathbf{0},\pmb{I}_ {d})$,而$\mu(\mathbf{z})$和$\sigma(\mathbf{z})^{2}$分别是对角协方差多元高斯分布的均值和方差。
3 METHOD
3 方法
Consider some dataset $\mathbf x\sim p_ {d}$ of continuous or discrete signals $\mathbf{x}\in\mathbb{R}^{m}$ , it is typical to assume that the data can be represented by low dimensional latent variables $\mathbf{z}\in\mathbb{R}^{k}$ , which can be used by a generative neural network to reconstruct the data. These variables are often estimated through the use of a secondary encoding network that is trained concurrently with the generative network. An encoding network adds additional complexity (and parameters) to the model, it can be difficult to balance capacities of the two networks, and for complex hierarchical generative models designing a suitable architecture can be difficult. This has led some to instead approximate latent variables by performing gradient descent on the generative network (Bojanowski et al., 2018; Nijkamp et al., 2020). While this addresses the aforementioned problems, it significantly increases the run time of the inference process, introduces additional hyper parameters to tune, and convergence is not guaranteed.
考虑某个连续或离散信号的数据集 $\mathbf x\sim p_ {d}$ ,其中 $\mathbf{x}\in\mathbb{R}^{m}$ ,通常假设数据可由低维潜在变量 $\mathbf{z}\in\mathbb{R}^{k}$ 表示,这些变量可通过生成式神经网络重建数据。这些变量通常通过一个与生成网络同步训练的次级编码网络来估计。编码网络会增加模型的复杂度(和参数),平衡两个网络的容量可能具有挑战性,且对于复杂的分层生成模型,设计合适的架构可能较为困难。这促使部分研究者转而通过在生成网络上执行梯度下降来近似潜在变量 (Bojanowski et al., 2018; Nijkamp et al., 2020) 。虽然这种方法解决了前述问题,但显著增加了推断过程的运行时间,引入了更多需要调优的超参数,且无法保证收敛性。
3.1 GRADIENT ORIGIN NETWORKS
3.1 梯度起源网络
We propose a generative model that consists only of a decoding network, using empirical Bayes to approximate the posterior in a single step. That is, for some data point $\mathbf{x}$ and latent variable $\mathbf{z}\sim p_ {\mathbf{z}}$ , we wish to find an approximation of $p(\mathbf{z}|\mathbf{x})$ . Given some noisy observation $\mathbf z_ {0}=\mathbf z+\mathcal{N}(\mathbf0,\pmb I_ {d})$ of $\mathbf{z}$ then empirical Bayes can be applied to approximate $\mathbf{z}$ . Specifically, since we wish to approximate $\mathbf{z}$ conditioned on $\mathbf{x}$ , we instead calculate $\hat{\bf z}_ {\bf x}$ , the least squares estimate of $p(\mathbf{z}|\mathbf{x})$ (proof in Appendix A):
我们提出了一种仅由解码网络组成的生成模型,利用经验贝叶斯 (Empirical Bayes) 在单步中近似后验分布。即对于数据点 $\mathbf{x}$ 和潜变量 $\mathbf{z}\sim p_ {\mathbf{z}}$,我们希望找到 $p(\mathbf{z}|\mathbf{x})$ 的近似值。给定 $\mathbf{z}$ 的噪声观测 $\mathbf z_ {0}=\mathbf z+\mathcal{N}(\mathbf0,\pmb I_ {d})$ 时,可应用经验贝叶斯来近似 $\mathbf{z}$。具体而言,由于我们希望近似以 $\mathbf{x}$ 为条件的 $\mathbf{z}$,因此改为计算 $p(\mathbf{z}|\mathbf{x})$ 的最小二乘估计 $\hat{\bf z}_ {\bf x}$(证明见附录 A):
$$
\hat{\mathbf{z}}_ {\mathbf{x}}(\mathbf{z}_ {0})=\mathbf{z}_ {0}+\nabla_ {\mathbf{z}_ {0}}\log p(\mathbf{z}_ {0}|\mathbf{x}).
$$
$$
\hat{\mathbf{z}}_ {\mathbf{x}}(\mathbf{z}_ {0})=\mathbf{z}_ {0}+\nabla_ {\mathbf{z}_ {0}}\log p(\mathbf{z}_ {0}|\mathbf{x}).
$$
Using Bayes’ rule, $\log p(\mathbf{z}_ {0}|\mathbf{x})$ can be written as $\log p(\mathbf{z}_ {0}|\mathbf{x})=\log p(\mathbf{x}|\mathbf{z}_ {0})+\log p(\mathbf{z}_ {0})-\log p(\mathbf{x})$ Since $\log p(\mathbf{x})$ is a normal ising constant that does not affect the gradient, we can rewrite Equation 6 in terms only of the decoding network and $p(\mathbf{z}_ {0})$ :
根据贝叶斯规则,$\log p(\mathbf{z}_ {0}|\mathbf{x})$ 可表示为 $\log p(\mathbf{z}_ {0}|\mathbf{x})=\log p(\mathbf{x}|\mathbf{z}_ {0})+\log p(\mathbf{z}_ {0})-\log p(\mathbf{x})$。由于 $\log p(\mathbf{x})$ 是不影响梯度的归一化常数,我们可以仅用解码网络和 $p(\mathbf{z}_ {0})$ 来重写方程6:
$$
\hat{\mathbf{z}}_ {\mathbf{x}}(\mathbf{z}_ {0})=\mathbf{z}_ {0}+\nabla_ {\mathbf{z}_ {0}}\Big(\log p(\mathbf{x}|\mathbf{z}_ {0})+\log p(\mathbf{z}_ {0})\Big).
$$
$$
\hat{\mathbf{z}}_ {\mathbf{x}}(\mathbf{z}_ {0})=\mathbf{z}_ {0}+\nabla_ {\mathbf{z}_ {0}}\Big(\log p(\mathbf{x}|\mathbf{z}_ {0})+\log p(\mathbf{z}_ {0})\Big).
$$
It still remains, however, how to construct a noisy estimate of $\mathbf{z}_ {\mathrm{0}}$ with no knowledge of $\mathbf{z}$ . If we assume $\mathbf{z}$ follows a known distribution, then it is possible to develop reasonable estimates. For instance, if we assume $p(\mathbf{z})=\mathcal{N}(\mathbf{z};\mathbf{0},I_ {d})$ then we could sample from $p(\mathbf{z}_ {0})=\mathcal{N}(\mathbf{z}_ {0};\mathbf{0},2I_ {d})$ however this could be far from the true distribution of $p(\mathbf{z}_ {0}|\mathbf{z})=\bar{\mathcal{N}}(\mathbf{z}_ {0};\mathbf{z},\bar{\mathbf{\pi}}_ {d})$ . Instead we propose initial ising $\mathbf{z}_ {\mathrm{0}}$ at the origin since this is the distribution’s mean. Initial ising at a constant position decreases the input variation and thus simplifies the optimisation procedure. Naturally, how $p(\mathbf{x}|\mathbf{z})$ is modelled affects $\hat{\mathbf{z}}_ {\mathbf{x}}$ . While mean-field models result in $\hat{\mathbf{z}}_ {\mathbf{x}}$ that are linear functions of $\mathbf{x}$ , conditional auto regressive models, for instance, result in non-linear $\hat{\mathbf{z}}_ {\mathbf{x}}$ ; multiple gradient steps also induce non-linearity, however, we show that a single step works well on high dimensional data suggesting that linear encoders, which normally do not scale to high dimensional data are effective in this case.
然而,如何在不了解 $\mathbf{z}$ 的情况下构建 $\mathbf{z}_ {\mathrm{0}}$ 的噪声估计仍然是一个问题。如果我们假设 $\mathbf{z}$ 服从已知分布,那么可以开发合理的估计。例如,假设 $p(\mathbf{z})=\mathcal{N}(\mathbf{z};\mathbf{0},I_ {d})$ ,则可以从 $p(\mathbf{z}_ {0})=\mathcal{N}(\mathbf{z}_ {0};\mathbf{0},2I_ {d})$ 中采样,但这可能与真实分布 $p(\mathbf{z}_ {0}|\mathbf{z})=\bar{\mathcal{N}}(\mathbf{z}_ {0};\mathbf{z},\bar{\mathbf{\pi}}_ {d})$ 相差甚远。因此,我们建议将 $\mathbf{z}_ {\mathrm{0}}$ 初始化为原点,因为这是分布的平均值。在固定位置初始化可以减少输入变化,从而简化优化过程。
自然, $p(\mathbf{x}|\mathbf{z})$ 的建模方式会影响 $\hat{\mathbf{z}}_ {\mathbf{x}}$ 。虽然均值场模型会导致 $\hat{\mathbf{z}}_ {\mathbf{x}}$ 是 $\mathbf{x}$ 的线性函数,但例如条件自回归模型会产生非线性的 $\hat{\mathbf{z}}_ {\mathbf{x}}$ ;多步梯度也会引入非线性,但我们表明,单步梯度在高维数据上表现良好,这表明通常无法扩展到高维数据的线性编码器在这种情况下是有效的。
3.2 AUTO ENCODING WITH GONS
3.2 基于GONS的自动编码
Before exploring GONs as generative models, we discuss the case where the prior $p(\mathbf{z})$ is unknown; such a model is referred to as an auto encoder. As such, the distribution $p(\mathbf{z}_ {0}\vert\mathbf{z})$ is also unknown thus it is again unclear how we can construct a noisy estimate of $\mathbf{z}$ . By training a model end-to-end where $\mathbf{z}_ {\mathrm{0}}$ is chosen as the origin, however, a prior is implicitly learned over $\mathbf{z}$ such that it is reachable from $\mathbf{z}_ {\mathrm{0}}$ . Although $p(\mathbf{z})$ is unknown, we do not wish to impose a prior on $\mathbf{z}_ {\mathrm{0}}$ ; the term which enforces this is in Equation 7 is $\log p(\mathbf{z}_ {0})$ , so we can safely ignore this term and simply maximise the likelihood of the data given $\mathbf{z}_ {\mathrm{0}}$ . Our estimator of $\mathbf{z}$ can therefore be defined simply as $\hat{\mathbf{z}}_ {\mathbf{x}}(\mathbf{z}_ {0})=\mathbf{z}_ {0}+\nabla_ {\mathbf{z}_ {0}}\log p(\mathbf{x}|\mathbf{z}_ {0})$ , which can otherwise be interpreted as a single gradient descent step on the conditional log-likelihood of the data. From this estimate, the data can be reconstructed by passing $\hat{\bf z}_ {\bf x}$ through the decoder to parameter is e $p(\mathbf{x}|\hat{\mathbf{z}}_ {\mathbf{x}})$ . This procedure can be viewed more explicitly when using a neural network $F\colon\mathbb{R}^{k}\rightarrow\mathbb{R}^{m}$ to output the mean of $p(\mathbf{x}|\hat{\mathbf{z}}_ {\mathbf{x}})$ parameter is ed by a normal distribution; in this case the loss function is defined in terms of mean squared error loss $\mathcal{L}^{\mathrm{MSE}}$ :
在将GONs作为生成模型探讨之前,我们先讨论先验分布$p(\mathbf{z})$未知的情况,此类模型称为自动编码器。由于$p(\mathbf{z}_ {0}\vert\mathbf{z})$分布同样未知,我们仍无法明确如何构建$\mathbf{z}$的噪声估计。然而,通过端到端训练模型并将$\mathbf{z}_ {\mathrm{0}}$设为原点时,系统会隐式学习到$\mathbf{z}$的先验分布,使其可从$\mathbf{z}_ {\mathrm{0}}$到达。虽然$p(\mathbf{z})$未知,但我们不希望为$\mathbf{z}_ {\mathrm{0}}$强加先验——式7中实施此约束的项是$\log p(\mathbf{z}_ {0})$,因此可安全忽略该项,仅需最大化给定$\mathbf{z}_ {\mathrm{0}}$时数据的似然概率。由此,$\mathbf{z}$的估计量可简化为$\hat{\mathbf{z}}_ {\mathbf{x}}(\mathbf{z}_ {0})=\mathbf{z}_ {0}+\nabla_ {\mathbf{z}_ {0}}\log p(\mathbf{x}|\mathbf{z}_ {0})$,这可理解为对数据条件对数似然函数执行单次梯度下降。基于该估计值,通过解码器传递$\hat{\bf z}_ {\bf x}$来参数化$p(\mathbf{x}|\hat{\mathbf{z}}_ {\mathbf{x}})$即可重构数据。当使用神经网络$F\colon\mathbb{R}^{k}\rightarrow\mathbb{R}^{m}$输出正态分布参数化的$p(\mathbf{x}|\hat{\mathbf{z}}_ {\mathbf{x}})$均值时,该过程更为显式:此时损失函数定义为均方误差损失$\mathcal{L}^{\mathrm{MSE}}$:
$$
G_ {\mathbf{x}}=\mathcal{L}^{\mathrm{MSE}}(\mathbf{x},F(-\nabla_ {\mathbf{z}_ {0}}\mathcal{L}^{\mathrm{MSE}}(\mathbf{x},F(\mathbf{z}_ {0})))).
$$
$$
G_ {\mathbf{x}}=\mathcal{L}^{\mathrm{MSE}}(\mathbf{x},F(-\nabla_ {\mathbf{z}_ {0}}\mathcal{L}^{\mathrm{MSE}}(\mathbf{x},F(\mathbf{z}_ {0})))).
$$
The gradient computation thereby plays a similar role to an encoder, while $F$ can be viewed as a decoder, with the outer loss term determining the overall reconstruction quality. Using a single network to perform both roles has the advantage of simplifying the overall architecture, avoiding the need to balance networks, and avoiding bottlenecks; this is demonstrated in Figure 1b which provides a visualisation of the GON process.
梯度计算因此扮演了类似于编码器的角色,而$F$可被视为解码器,外部损失项决定了整体重建质量。使用单一网络执行这两个角色具有简化整体架构、避免平衡网络需求以及避免瓶颈的优势;这一点在图1b中得到了展示,该图提供了GON过程的可视化。
3.3 VARIATION AL GONS
3.3 变分算法 (VARIATION ALGONS)
The variation al approach can be naturally applied to GONs, allowing sampling in a single step while only requiring the generative network, reducing the parameters necessary. Similar to before, a feed forward neural network $F$ parameter is es $p(\mathbf{x}|\mathbf{z})$ , while the expectation of $p(\mathbf{z}|\mathbf{x})$ is calculated with empirical Bayes. A normal prior is assumed over $p(\mathbf{z})$ thus Equation 7 can be written as:
变分方法可以自然地应用于GONs,只需生成网络即可实现单步采样,从而减少所需参数。与之前类似,前馈神经网络$F$的参数用于建模$p(\mathbf{x}|\mathbf{z})$,而$p(\mathbf{z}|\mathbf{x})$的期望则通过经验贝叶斯计算。假设$p(\mathbf{z})$服从正态先验分布,因此式7可表示为:
$$
\hat{\mathbf{z}}_ {\mathbf{x}}(\mathbf{z}_ {0})=\mathbf{z}_ {0}+\nabla_ {\mathbf{z}_ {0}}\Big(\log p(\mathbf{x}|\mathbf{z}_ {0})+\log\mathcal{N}(\mathbf{z}_ {0};\mathbf{0},2I_ {d})\Big),
$$
$$
\hat{\mathbf{z}}_ {\mathbf{x}}(\mathbf{z}_ {0})=\mathbf{z}_ {0}+\nabla_ {\mathbf{z}_ {0}}\Big(\log p(\mathbf{x}|\mathbf{z}_ {0})+\log\mathcal{N}(\mathbf{z}_ {0};\mathbf{0},2I_ {d})\Big),
$$
where $\mathbf{z}_ {0}=\mathbf{0}$ as discussed in Section 3.1. While it would be possible to use this estimate directly within a constant-variance VAE (Ghosh et al., 2019), we opt to incorporate the re parameter is ation trick into the generative network as a stochastic layer, to represent the distribution over which $\mathbf{x}$ could be encoded to, using empirical Bayes to estimate $\mathbf{z}$ . Similar to the auto encoding approach, we could ignore $\log p(\mathbf{z}_ {0})$ , however we find assuming a normally distributed $\mathbf{z}_ {\mathrm{0}}$ implicitly constrains $\mathbf{z}$ , aiding the optimisation procedure. Specifically, the forward pass of $F$ is implemented as follows: $\hat{\mathbf{z}}_ {\mathbf{x}}$ (or $\mathbf{z}_ {\mathrm{0}}$ ) is mapped by linear transformations to $\mu(\hat{\bf z}_ {\bf x})$ and $\sigma(\hat{\bf z}_ {\bf x})$ and the re parameter is ation trick is applied, subsequently the further transformations formerly defined as $F$ in the GON formulation are applied, providing parameters for for $p(\mathbf{x}|\hat{\mathbf{z}}_ {\mathbf{x}})$ . Training is performed end-to-end, minimising the ELBO:
其中 $\mathbf{z}_ {0}=\mathbf{0}$ 如第3.1节所述。虽然可以直接将这个估计用于恒定方差变分自编码器(VAE) (Ghosh et al., 2019),但我们选择将重参数化技巧作为随机层整合到生成网络中,以表示$\mathbf{x}$可能被编码到的分布,并使用经验贝叶斯来估计$\mathbf{z}$。与自编码方法类似,我们可以忽略$\log p(\mathbf{z}_ {0})$,但我们发现假设$\mathbf{z}_ {\mathrm{0}}$服从正态分布会隐式约束$\mathbf{z}$,从而有助于优化过程。具体来说,$F$的前向传播实现如下:$\hat{\mathbf{z}}_ {\mathbf{x}}$(或$\mathbf{z}_ {\mathrm{0}}$)通过线性变换映射到$\mu(\hat{\bf z}_ {\bf x})$和$\sigma(\hat{\bf z}_ {\bf x})$,并应用重参数化技巧,随后应用GON公式中先前定义为$F$的进一步变换,为$p(\mathbf{x}|\hat{\mathbf{z}}_ {\mathbf{x}})$提供参数。训练以端到端方式进行,最小化ELBO:
$$
\begin{array}{r}{\mathcal{L}^{\mathrm{VAE}}=-D_ {\mathrm{KL}}(\mathcal{N}(\mu(\hat{\mathbf{z}}_ {\mathbf{x}}),\sigma(\hat{\mathbf{z}}_ {\mathbf{x}})^{2})||\mathcal{N}(\mathbf{0},I_ {d}))+\log p(\mathbf{x}|\hat{\mathbf{z}}_ {\mathbf{x}}).}\end{array}
$$
$$
\begin{array}{r}{\mathcal{L}^{\mathrm{VAE}}=-D_ {\mathrm{KL}}(\mathcal{N}(\mu(\hat{\mathbf{z}}_ {\mathbf{x}}),\sigma(\hat{\mathbf{z}}_ {\mathbf{x}})^{2})||\mathcal{N}(\mathbf{0},I_ {d}))+\log p(\mathbf{x}|\hat{\mathbf{z}}_ {\mathbf{x}}).}\end{array}
$$
These steps are shown in Figure 1d, which in practice has a simple implementation.
这些步骤如图1d所示,实际实现起来非常简单。
3.4 IMPLICIT GONS
3.4 隐式 GONS
In the field of implicit representation networks, the aim is to learn a neural approximation of a function $\Phi$ that satisfies an implicit equation of the form:
在隐式表示网络领域,目标是学习满足以下形式隐式方程的函数$\Phi$的神经近似:
$$
R({\bf c},\Phi,\nabla_ {\Phi},\nabla_ {\Phi}^{2},\ldots)=0,\quad\Phi;{\bf c}\mapsto\Phi({\bf c}),
$$
$$
R({\bf c},\Phi,\nabla_ {\Phi},\nabla_ {\Phi}^{2},\ldots)=0,\quad\Phi;{\bf c}\mapsto\Phi({\bf c}),
$$
where $R$ ’s definition is problem dependent but often corresponds to a loss function. Equations with this structure arise in a myriad of fields, namely 3D modelling, image, video, and audio representation (Sitzmann et al., 2020b). In these cases, data samples $\mathbf{x}=\dot{{(\mathbf{c},\tilde{\Phi}_ {\mathbf{x}}(\mathbf{c}))}}$ can be represented in this form in terms of coordinates $\mathbf{c}\in\mathbb{R}^{n}$ and the corresponding data at those points $\Phi\colon\mathbf{\bar{R}}^{n}\rightarrow\mathbf{\mathbb{R}}^{m}$ . Due to the continuous nature of $\Phi$ , data with large spatial dimensions can be represented much more efficiently than approaches using convolutions, for instance. Despite these benefits, representing a distribution of data points using implicit methods is more challenging, leading to the use of hyper networks which estimate the weights of an implicit network for each data point (Sitzmann et al., 2020a); this increases the number of parameters and adds significant complexity.
其中 $R$ 的定义取决于具体问题,但通常对应于损失函数。具有这种结构的方程出现在众多领域中,例如 3D 建模、图像、视频和音频表示 (Sitzmann et al., 2020b)。在这些情况下,数据样本 $\mathbf{x}=\dot{{(\mathbf{c},\tilde{\Phi}_ {\mathbf{x}}(\mathbf{c}))}}$ 可以表示为坐标 $\mathbf{c}\in\mathbb{R}^{n}$ 和对应点上的数据 $\Phi\colon\mathbf{\bar{R}}^{n}\rightarrow\mathbf{\mathbb{R}}^{m}$ 的形式。由于 $\Phi$ 的连续性,与使用卷积等方法相比,具有较大空间维度的数据可以更高效地表示。尽管有这些优势,使用隐式方法表示数据点分布更具挑战性,因此需要使用超网络来估计每个数据点的隐式网络权重 (Sitzmann et al., 2020a);这会增加参数数量并显著提高复杂度。
By applying the GON procedure to implicit representation networks, it is possible learn a space of implicit functions without the need for additional networks. We assume there exist latent vectors $\mathbf{z}\in\mathbb{R}^{k}$ corresponding to data samples; the con ca tent ation of these latent variables with the data coordinates can therefore geometrically be seen as points on a manifold that describe the full dataset in keeping with the manifold hypothesis (Fefferman et al., 2016). An implicit Gradient Origin Network can be trained on this space to mimic $\Phi$ using a neural network $F\colon\stackrel{\cdot}{\mathbb{R}^{n+k}}\rightarrow\mathbb{R}^{m}$ , thereby learning a space of implicit functions by modifying Equation 8:
通过对隐式表示网络应用GON方法,无需额外网络即可学习隐式函数空间。我们假设存在与数据样本对应的潜在向量$\mathbf{z}\in\mathbb{R}^{k}$;根据流形假设(Fefferman et al., 2016),这些潜在变量与数据坐标的连接可几何视为描述完整数据集的流形上的点。在此空间上可训练隐式梯度原点网络,使用神经网络$F\colon\stackrel{\cdot}{\mathbb{R}^{n+k}}\rightarrow\mathbb{R}^{m}$来模拟$\Phi$,从而通过修改公式8学习隐式函数空间:
$$
I_ {\mathbf{x}}=\int\mathcal{L}\Big(\Phi_ {\mathbf{x}}(\mathbf{c}),F\Big(\mathbf{c}\oplus-\nabla_ {\mathbf{z}_ {0}}\int\mathcal{L}\big(\Phi_ {\mathbf{x}}(\mathbf{c}),F(\mathbf{c}\oplus\mathbf{z}_ {0})\big)d\mathbf{c}\Big)\Big)d\mathbf{c},
$$
$$
I_ {\mathbf{x}}=\int\mathcal{L}\Big(\Phi_ {\mathbf{x}}(\mathbf{c}),F\Big(\mathbf{c}\oplus-\nabla_ {\mathbf{z}_ {0}}\int\mathcal{L}\big(\Phi_ {\mathbf{x}}(\mathbf{c}),F(\mathbf{c}\oplus\mathbf{z}_ {0})\big)d\mathbf{c}\Big)\Big)d\mathbf{c},
$$
where both integration s are performed over the space of coordinates and $\mathbf{c}\oplus\mathbf{z}$ represents a concatenation (Figure 1c). Similar to the non-implicit approach, the computation of latent vectors can be expressed as $\begin{array}{r}{\mathbf{z}=-\nabla_ {\mathbf{z}_ {0}}\int\mathcal{L}\big(\Phi_ {\mathbf{x}}(\mathbf{c}),F(\mathbf{\bar{c}}\oplus\mathbf{z}_ {0})\big)\bar{d}\mathbf{c}}\end{array}$ . In particular, we parameter is e $F$ with a SIREN (Sitzmann et al., 2020b), finding that it is capable of modelling both high and low frequency components in high dimensional spaces.
其中两个积分都在坐标空间上进行运算,$\mathbf{c}\oplus\mathbf{z}$ 表示拼接操作 (图 1c)。与非隐式方法类似,潜在向量的计算可表示为 $\begin{array}{r}{\mathbf{z}=-\nabla_ {\mathbf{z}_ {0}}\int\mathcal{L}\big(\Phi_ {\mathbf{x}}(\mathbf{c}),F(\mathbf{\bar{c}}\oplus\mathbf{z}_ {0})\big)\bar{d}\mathbf{c}}\end{array}$。特别地,我们采用 SIREN (Sitzmann et al., 2020b) 来参数化 $F$,发现其能够建模高维空间中的高频和低频分量。
3.5 GON GENERALISATIONS
3.5 GON 泛化
There are a number of interesting generalisations that make this approach applicable to other tasks. In Equations 8 and 12 we use the same $\mathcal{L}$ in both the inner term and outer term, however, as with variation al GONs, it is possible to use different functions; through this, training a GON concurrently as a generative model and classifier is possible, or through some minor modifications to the loss involving the addition of the categorical cross entropy loss function $\mathcal{L}^{\mathrm{CCE}}$ to maximise the likelihood of classification, solely as a classifier:
这种方法有几个有趣的推广方向,可适用于其他任务。在公式8和12中,我们在内部项和外部项使用了相同的$\mathcal{L}$,但正如变分生成优化网络(GON)那样,也可以使用不同函数:通过这种方式,可以同时将GON训练为生成模型和分类器,或通过对损失函数进行微小修改(添加分类交叉熵损失函数$\mathcal{L}^{\mathrm{CCE}}$以最大化分类似然)使其仅作为分类器:
$$
C_ {\mathbf{x}}=\mathcal{L}^{\mathrm{CCE}}(f(-\nabla_ {\mathbf{z}_ {0}}\mathcal{L}(\mathbf{x},F(\mathbf{z}_ {0}))),\mathbf{y}),
$$
$$
C_ {\mathbf{x}}=\mathcal{L}^{\mathrm{CCE}}(f(-\nabla_ {\mathbf{z}_ {0}}\mathcal{L}(\mathbf{x},F(\mathbf{z}_ {0}))),\mathbf{y}),
$$
where y are the target labels and $f$ is a single linear layer followed by softmax. Another possibility is modality conversion for translation tasks; in this case the inner reconstruction loss is performed on the source signal and the outer loss on the target signal.
其中y是目标标签,$f$是一个带softmax的单一线性层。另一种可能是翻译任务中的模态转换;此时内部重建损失作用于源信号,外部损失作用于目标信号。
3.6 JUSTIFICATION
3.6 依据
Beyond empirical Bayes, we provide some additional analysis on why a single gradient step is in general sufficient as an encoder. Firstly, the gradient of the loss inherently offers substantial information about data making it a good encoder. Secondly, a good latent space should satisfy local consistency (Zhou et al., 2003; Kamnitsas et al., 2018). GONs satisfy this since similar data points will have similar gradients due to the constant latent initial is ation. As such, the network needs only to find an equilibrium where its prior is the gradient operation, allowing for significant freedom. Finally, since GONs are trained by back propagating through empirical Bayes, there are benefits to using an activation function whose second derivative is non-zero.
除了经验贝叶斯方法外,我们进一步分析了为何单次梯度步长通常足以作为编码器。首先,损失函数的梯度本身携带了数据的丰富信息,使其成为优质编码器。其次,优质潜在空间需满足局部一致性 (Zhou et al., 2003; Kamnitsas et al., 2018)。由于恒定潜在初始化使相似数据点产生相似梯度,GONs自然满足该条件。因此,网络仅需找到以梯度运算为先验的平衡点,从而获得极大自由度。最后,由于GONs通过经验贝叶斯反向传播训练,采用二阶导数非零的激活函数具有优势。
4 RESULTS
4 结果
We evaluate Gradient Origin Networks on a variety of image datasets: MNIST (LeCun et al., 1998), Fashion-MNIST (Xiao et al., 2017), Small NORB (LeCun et al., 2004), COIL-20 (Nane et al., 1996), CIFAR-10 (Krizhevsky et al., 2009), CelebA (Liu et al., 2015), and LSUN Bedroom (Yu et al., 2015). Simple models are used: for small images, implicit GONs consist of approximately 4 hidden layers of 256 units and convolutional GONs consist of 4 convolution layers with Batch Normalization (Ioffe & Szegedy, 2015) and the ELU non-linearity (Clevert et al., 2016), for larger images the same general architecture is used, scaled up with additional layers; all training is performed with the Adam optimiser (Kingma & Ba, 2015). Error bars in graphs represent standard deviations over three runs.
我们在多个图像数据集上评估了梯度起源网络 (Gradient Origin Networks):MNIST (LeCun et al., 1998)、Fashion-MNIST (Xiao et al., 2017)、Small NORB (LeCun et al., 2004)、COIL-20 (Nane et al., 1996)、CIFAR-10 (Krizhevsky et al., 2009)、CelebA (Liu et al., 2015) 和 LSUN Bedroom (Yu et al., 2015)。采用简单模型架构:对于小尺寸图像,隐式GON由约4层256单元的隐藏层构成,卷积GON由4个带批归一化 (Batch Normalization) (Ioffe & Szegedy, 2015) 和ELU非线性激活 (Clevert et al., 2016) 的卷积层构成;对于大尺寸图像采用相同基础架构并通过增加层数进行扩展;所有训练均使用Adam优化器 (Kingma & Ba, 2015)。图中误差条表示三次运行的标准差。
Table 1: Validation reconstruction loss (summed squared error) over 500 epochs. For GLO, latents are assigned to data points and jointly optimised with the network. GONs significantly outperform other single step methods and achieve the lowest reconstruction error on four of the five datasets.
| MNIST | Fashion-MNIST | SmallNORB | COIL20 | CIFAR-10 | |
| Single Step | |||||
| GON (ours) | 0.41±0.01 | 1.00±0.01 | 0.17±0.01 | 5.35±0.01 | 9.12±0.03 |
| AE | 1.33±0.02 | 1.75±0.02 | 0.73±0.01 | 6.05±0.11 | 12.24±0.05 |
| TiedAE | 2.16±0.03 | 2.45±0.02 | 0.93±0.03 | 6.68±0.13 | 14.12±0.34 |
| 1 Step Detach | 8.17±0.15 | 7.76±0.21 | 1.84±0.01 | 15.72±0.70 | 30.17±0.29 |
| MultipleSteps | |||||
| 10 Step Detach | 8.13±0.50 | 7.22±0.92 | 1.78±0.02 | 15.48±0.60 | 28.68±1.16 |
| 10Step | 0.42±0.01 | 1.01±0.01 | 0.17±0.01 | 5.36±0.01 | 9.19±0.01 |
| GLO | 0.70±0.01 | 0.61±0.01 | |||
| 1.78±0.01 | 3.27±0.02 | 9.12±0.03 | |||
表 1: 500个训练周期内的验证重建损失(平方误差总和)。对于GLO方法,潜在变量被分配给数据点并与网络联合优化。GON在五种数据集中的四种上显著优于其他单步方法,并实现了最低的重建误差。
| MNIST | Fashion-MNIST | SmallNORB | COIL20 | CIFAR-10 | |
|---|---|---|---|---|---|
| 单步方法 | |||||
| GON (本文) | 0.41±0.01 | 1.00±0.01 | 0.17±0.01 | 5.35±0.01 | 9.12±0.03 |
| AE | 1.33±0.02 | 1.75±0.02 | 0.73±0.01 | 6.05±0.11 | 12.24±0.05 |
| TiedAE | 2.16±0.03 | 2.45±0.02 | 0.93±0.03 | 6.68±0.13 | 14.12±0.34 |
| 1 Step Detach | 8.17±0.15 | 7.76±0.21 | 1.84±0.01 | 15.72±0.70 | 30.17±0.29 |
| 多步方法 | |||||
| 10 Step Detach | 8.13±0.50 | 7.22±0.92 | 1.78±0.02 | 15.48±0.60 | 28.68±1.16 |
| 10Step | 0.42±0.01 | 1.01±0.01 | 0.17±0.01 | 5.36±0.01 | 9.19±0.01 |
| GLO | 0.70±0.01 | 1.78±0.01 | 0.61±0.01 | 3.27±0.02 | 9.12±0.03 |
| MNIST | Fashion-MNIST | SmallNORB | COIL20 | CIFAR-10 | CelebA | |
| VGON(ours) | 1.06 | 3.30 | 2.34 | 3.44 | 5.85 | 5.41 |
| VAE | 1.15 | 3.23 | 2.54 | 3.63 | 5.94 | 5.59 |
| MNIST | Fashion-MNIST | SmallNORB | COIL20 | CIFAR-10 | CelebA | |
|---|---|---|---|---|---|---|
| VGON(ours) | 1.06 | 3.30 | 2.34 | 3.44 | 5.85 | 5.41 |
| VAE | 1.15 | 3.23 | 2.54 | 3.63 | 5.94 | 5.59 |
Table 2: Validation ELBO in bits/dim over 1000 epochs (CelebA is trained over 150 epochs).
表 2: 1000轮训练周期下的验证ELBO (单位: bits/dim) (CelebA数据集训练150轮)。
4.1 QUANTITATIVE EVALUATION
4.1 定量评估
A quantitative evaluation of the representation ability of GONs is performed in Table 1 against a number of baseline approaches. We compare against the single step methods: standard auto encoder, an auto encoder with tied weights (Gedeon, 1998), and a GON with gradients detached from z, as well as the multi-step methods: 10 gradient descent steps per data point, with and without detaching gradients, and a GLO (Bojanowski et al., 2018) which assigns a persistent latent vector to each data point and optimises them with gradient descent, therefore taking orders of magnitude longer to reconstruct validation data than other approaches. For the 10-step methods, a learning rate of 0.1 is applied as used in other literature (Nijkamp et al., 2019); the GLO is trained with MSE for consistency with the other approaches and we do not project latents onto a hyper sphere as proposed by the authors since in this experiment sampling is unimportant and this would handicap the approach. GONs achieve much lower validation loss than other single step methods and are competitive with the multi-step approaches; in fact, GONs achieve the lowest loss on four out of the five datasets.
表1对GONs的表征能力进行了定量评估,并与多种基线方法进行了比较。我们对比了单步方法:标准自编码器、权重绑定的自编码器 (Gedeon, 1998) 和梯度与z解耦的GON;以及多步方法:每个数据点进行10次梯度下降(含/不含梯度解耦)和GLO (Bojanowski等, 2018) ——后者为每个数据点分配持久潜在向量并通过梯度下降优化,因此验证数据重构耗时比其他方法高出一个数量级。对于10步方法,采用文献 (Nijkamp等, 2019) 中使用的0.1学习率;为保持方法一致性,GLO使用MSE训练且未按原作者建议将潜在向量投影至超球面,因本实验中采样无关紧要且该操作会削弱方法性能。GONs的验证损失远低于其他单步方法,与多步方法表现相当,事实上在五分之四的数据集上达到了最低损失。
Our variation al GON is compared with a VAE, whose decoder is the same as the GON, quantitatively in terms of ELBO on the test set in Table 2. We find that the representation ability of GONs is pertinent here also, allowing the variation al GON to achieve lower ELBO on five out of the six datasets tested. Both models were trained with the original VAE formulation for fairness, however, we note that variations aimed at improving VAE sample quality such as $\beta$ -VAEs (Higgins et al., 2017) are also applicable to variation al GONs to the same effect.
我们的变分 GON 与 VAE 进行了比较,其中解码器与 GON 相同,定量评估结果如表 2 所示,基于测试集上的 ELBO (证据下界) 。我们发现 GON 的表征能力在此也至关重要,使得变分 GON 在六个测试数据集中的五个上实现了更低的 ELBO。为了公平起见,两种模型均采用原始 VAE 公式进行训练,但我们注意到,旨在提升 VAE 样本质量的改进方法 (例如 $\beta$-VAE (Higgins et al., 2017)) 同样适用于变分 GON,并能达到相同效果。
An ablation study is performed, comparing convolutional GONs with auto encoders whose decoders have exactly the same architecture as the GONs (Figure 2a) and where the auto encoder decoders mirror their respective encoders. The mean reconstruction loss over the test set is measured after each epoch for a variety of latent sizes. Despite having half the number of parameters and linear encodings, (loa)s s GthOaNn sa a utc o he ien v ceo dloerws.er validation (*b=)g rMadulst idpeltea lcahteedn,t †u=pndoatt ed estteacpsh.ed. (acu)t oGenOcNosd eorvse.rfit less than standard
我们进行了消融实验,将卷积GONs与自编码器进行比较,其中自编码器的解码器架构与GONs完全相同(图2a),且自编码器的解码器与其对应编码器呈镜像对称。在不同潜在空间尺寸下,我们测量了每个训练周期后在测试集上的平均重构损失。尽管参数数量减半且采用线性编码,GONs仍显著优于自编码器验证集(*b)。梯度更新步骤†未调整。(ac)GONs解码器比标准模型更不易过拟合。

Figure 2: Gradient Origin Networks trained on CIFAR-10 are found to outperform auto encoders using exactly the same architecture without the encoder, requiring half the number of parameters.
图 2: 在 CIFAR-10 上训练的梯度起源网络 (Gradient Origin Networks) 表现优于使用完全相同架构但去除编码器的自编码器 (auto encoders) ,且所需参数量减半。

Figure 3: The impact of activation function and number of latent variables on model performance for a GON measured by comparing reconstruction losses through training.
图 3: 通过比较训练过程中的重建损失,衡量激活函数和潜在变量数量对 GON (Generative Occupancy Network) 模型性能的影响。
GONs achieve significantly lower reconstruction loss over a wide range of architectures. Further experiments evaluating the impact of latent size and model capacity are performed in Appendix B.
GONs在多种架构上实现了显著更低的重建损失。关于潜在空间尺寸和模型容量影响的进一步实验评估详见附录B。
Our hypothesis that for high dimensional datasets, a single gradient step is sufficient when jointly optimised with the forwards pass is tested on the CIFAR-10 dataset in Figure 2b. We observe negligible difference between a single step and multiple jointly optimised steps, in support of our hypothesis. Performing multiple steps greatly increases run-time so there is seemingly no benefit in this case. Additionally, the importance of the joint optimisation procedure is determined by detaching the gradients from $\mathbf{z}$ before reconstructing (Figure 2b); this results in markedly worse performance, even when in conjunction with multiple steps. In Figure 2c we assess whether the greater performance of GONs relative to auto encoders comes at the expense of generalisation; we find that the opposite is true, that the discrepancy between reconstruction loss on the training and test sets is greater with auto encoders. This result extends to other datasets as can be seen in Appendix E.
我们在CIFAR-10数据集上验证了高维数据集的假设:当前向传播联合优化时,单次梯度步长即足够 (图2b)。观察到单步与多步联合优化的差异可忽略不计,支持了该假设。由于多步计算会显著增加运行时间,因此在这种情况下并无优势。此外,通过在重构前分离 $\mathbf{z}$ 的梯度 (图2b) 验证了联合优化的重要性——即使结合多步计算,性能仍会显著下降。图2c评估了GONs相对自编码器的性能优势是否以泛化能力为代价,结果发现相反:自编码器在训练集与测试集上的重构损失差异更大。该结论可推广至其他数据集 (见附录E)。
Figure 3 demonstrates the effect activation functions have on convolutional GON performance for different numbers of latent variables. Since optimising GONs requires computing second order derivatives, the choice of non linearity requires different characteristics to standard models. In particular, GONs prefer functions that are not only effective activation functions, but also whose second derivatives are non-zero, unlike ReLUs where $\mathrm{ReLU}^{\prime\prime}(x)=0$ . The ELU non-linearity is effective with all tested architectures.
图 3: 展示了不同潜在变量数量下激活函数对卷积GON性能的影响。由于优化GON需要计算二阶导数,非线性函数的选择需要具备与标准模型不同的特性。具体而言,GON更倾向于那些不仅是有效激活函数、且其二阶导数非零的函数 (例如ReLU的二阶导数为$\mathrm{ReLU}^{\prime\prime}(x)=0$)。在所有测试架构中,ELU非线性函数均表现优异。
4.2 QUALITATIVE EVALUATION
4.2 定性评估
The representation ability of implicit GONs is shown in Figure 4 where we train on large image datasets using a relatively small number of parameters. In particular, Figure 4a shows MNIST can be well-represented with just 4,385 parameters (a SIREN with 3 hidden layers each with 32 hidden units, and 32 latent dimensions). An advantage of modelling data with implicit networks is that coordinates can be at arbitrarily high resolutions. In Figure 5 we train on $32\mathbf{x}32$ images then reconstruct at $256\mathrm{x}256$ . A significant amount of high frequency detail is modelled despite only seeing low resolution images. The structure of the implicit GON latent space is shown by sampling latent codes from pairs of images, and then spherically interpolating between them to synthesise new samples (Figure 6). These samples are shown to capture variations in shape (the shoes in Figure 6b), size, and rotation (Figure 6d).
隐式 GON 的表征能力如图 4 所示,我们使用相对较少的参数在大型图像数据集上进行训练。具体而言,图 4a 显示 MNIST 仅需 4,385 个参数即可良好表征 (采用 3 个隐藏层、每层 32 个隐藏单元和 32 维潜在空间的 SIREN 结构)。隐式网络建模数据的优势在于坐标可支持任意高分辨率。图 5 展示了在 $32\mathbf{x}32$ 图像上训练后,重建出 $256\mathrm{x}256$ 分辨率的结果。尽管仅观察低分辨率图像,模型仍能捕捉大量高频细节。通过从图像对中采样潜在编码并进行球面插值来合成新样本 (图 6),展示了隐式 GON 潜在空间的结构。这些样本显示出对形状 (图 6b 中的鞋子)、尺寸和旋转 (图 6d) 变化的捕捉能力。
We assess the quality of samples from variation al GONs using convolutional models trained to convergence in Figure 7. These are diverse and often contain fine details. Samples from an implicit GON trained with early stopping, as a simple alternative, can be found in Appendix D however this approach results in fine details being lost. GONs are also found to converge quickly; we plot reconstructions at multiple time points during the first minute of training (Figure 8). After only 3 seconds of training on a single GPU, a large amount of signal information from MNIST is modelled.
我们在图7中使用经过训练至收敛的卷积模型评估了变分GON (Generative Occupancy Networks)生成样本的质量。这些样本具有多样性且常包含精细细节。作为简单替代方案,附录D展示了采用早停策略训练的隐式GON样本,但该方法会导致精细细节丢失。实验还发现GON收敛速度极快——图8展示了训练首分钟内多个时间点的重建效果。在单GPU上仅训练3秒后,模型就能捕捉MNIST数据集中的大量信号信息。

Figure 4: Training implicit GONs with few parameters demonstrates their representation ability.
图 4: 使用少量参数训练隐式GONs展示了其表征能力。

Figure 5: By training an implicit GON on $32\mathbf{x}32$ images, then sampling at 256x256, super-resolution is possible despite never observing high resolution data.
图 5: 通过在 $32\mathbf{x}32$ 图像上训练隐式 GON (Generative Occupancy Network),然后在 256x256 分辨率下采样,即使从未观测过高分辨率数据也能实现超分辨率。
In order to evaluate how well GONs can represent high resolution natural images, we train a convolutional GON on the LSUN Bedroom dataset scaled to 128x128 (Figure 9a). As with smaller, more simple data, we find training to always be extremely stable and consistent over a wide range of hyper parameter settings. Reconstructions are of excellent quality given the simple network architecture. A convolutional variation al GON is also trained on the CelebA dataset scaled to 64x64 (Figure 9b). Unconditional samples are somewhat blurry as commonly associated with traditional VAE models on complex natural images (Zhao et al., 2017) but otherwise show wide variety.
为了评估GONs在高分辨率自然图像上的表征能力,我们在缩放到128x128的LSUN卧室数据集上训练了一个卷积GON (图9a)。与更小更简单的数据一样,我们发现训练在各种超参数设置下始终极其稳定且一致。考虑到简单的网络架构,重建质量非常出色。我们还对缩放到64x64的CelebA数据集训练了一个卷积变分GON (图9b)。无条件样本存在一定模糊度,这与复杂自然图像上传统VAE模型的常见表现一致 (Zhao et al., 2017),但除此之外展现了丰富的多样性。
5 DISCUSSION
5 讨论
Despite similarities with auto encoder approaches, the absence of an encoding network offers several advantages. VAEs with overpowered decoders are known to ignore the latent variables (Chen et al., 2017) whereas GONs only have one network that equally serves both encoding and decoding functionality. Designing inference networks for complicated decoders is not a trivial task (Vahdat & Kautz, 2020), however, inferring latent variables using a GON simplifies this procedure. Similar to GONs, Normalizing Flow methods are also capable of encoding and decoding with a single
尽管与自编码器方法有相似之处,但省略编码网络具有多项优势。已知带有过强解码器的VAE会忽略隐变量 (Chen et al., 2017) ,而GON仅通过单一网络同时实现编码与解码功能。为复杂解码器设计推理网络并非易事 (Vahdat & Kautz, 2020) ,但使用GON推断隐变量能简化该流程。与GON类似,归一化流方法同样可通过单一网络实现编码与解码

Figure 6: Spherical linear interpolations between points in the latent space for trained implicit GONs using different datasets (approximately 2-10 minutes training per dataset on a single GPU).
图 6: 在不同数据集上训练的隐式 GON (Generative Occupancy Networks) 潜在空间中点之间的球面线性插值结果 (每个数据集在单 GPU 上训练约 2-10 分钟)。


Figure 7: Random samples from a convolutional variation al GON with normally distributed latents.
图 7: 从具有正态分布潜变量的卷积变分 GON 中随机抽取的样本。

Figure 8: Convergence of convolutional GONs with 74k parameters.
图 8: 参数规模为74k的卷积GONs收敛情况。

(b) Samples from a convolutional variation al GON trained on CelebA.
图 1:
(b) 在CelebA数据集上训练的卷积变分生成对抗网络 (convolutional variational GON) 生成样本。
(a) LSUN 128x128 Bedroom validation images (left) reconstructed by a convolutional GON (right).
(a) LSUN 128x128 卧室验证图像(左)通过卷积GON重建的效果(右)。
Figure 9: GONs are able to represent high resolution complex datasets to a high degree of fidelity.
图 9: GONs能够以高度保真度表示高分辨率复杂数据集。
set of weights, however, they achieve this by restricting the network to be invertible. This requires considerable architectural restrictions that affect performance, make them less parameter efficient, and are unable to reduce the dimension of the data (Kingma & Dhariwal, 2018). Similarly, auto encoders with tied weights also encode and decode with a single set of weights by using the transpose of the encoder’s weight matrices in the decoder; this, however, is only applicable to simple architectures. GONs on the other hand use gradients as encodings which allow arbitrary functions to be used.
然而,它们通过限制网络可逆性来实现权重共享。这需要大量架构限制,从而影响性能、降低参数效率,且无法压缩数据维度 (Kingma & Dhariwal, 2018)。类似地,权重绑定的自编码器也通过将编码器权重矩阵转置用于解码器来实现单组权重编解码,但仅适用于简单架构。相比之下,GON利用梯度作为编码,可支持任意函数的使用。
A number of previous works have used gradient-based computations to learn latent vectors however as far as we are aware, we are the first to use a single gradient step jointly optimised with the feed forward pass, making it fundamentally different to these approaches. Latent encodings have been estimated for pre-trained generative models without encoders, namely Generative Adversarial Networks, using approaches such as standard gradient descent (Lipton & Tripathi, 2017; Zhu et al., 2016). A number of approaches have trained generative models directly with gradient-based inference (Han et al., 2017; Bojanowski et al., 2018; Zadeh et al., 2019); these assign latent vectors to data points and jointly learn them with the network parameters through standard gradient descent or Langevin dynamics. This is very slow, however, and convergence for unseen data samples is not guaranteed. Short run MCMC has also been applied (Nijkamp et al., 2020) however this still requires approximately 25 update steps. Since GONs train end-to-end, the optimiser can make use of the second order derivatives to allow for inference in a single step. Also of relevance is model-agnostic meta-learning (Finn et al., 2017), which trains an architecture so that a few gradient descent steps are all that are necessary to new tasks. This is achieved by back propagating through these gradients, similar to GONs.
已有许多研究利用基于梯度的计算来学习潜在向量,但据我们所知,我们首次将单步梯度优化与前向传播联合训练,使该方法与既有方案存在本质差异。针对无编码器的预训练生成模型(如生成对抗网络),研究者们通过标准梯度下降等方法估算潜在编码 [Lipton & Tripathi, 2017; Zhu et al., 2016]。另有研究直接采用基于梯度的推断训练生成模型 [Han et al., 2017; Bojanowski et al., 2018; Zadeh et al., 2019],这些方法通过标准梯度下降或朗之万动力学为数据点分配潜在向量并与网络参数联合学习,但存在计算速度慢且无法保证未见数据收敛的问题。短程马尔可夫链蒙特卡洛方法 [Nijkamp et al., 2020] 仍需约25次更新步骤。由于GON采用端到端训练,优化器可利用二阶导数实现单步推断。相关研究还包括模型无关元学习 [Finn et al., 2017],该方法通过梯度反向传播训练架构(与GON类似),使模型仅需少量梯度步即可适应新任务。
In the case of implicit GONs, the integration terms in Equation 12 result in computation time that scales in proportion with the data dimension. This makes training slower for very high dimensional data, although we have not yet investigated Monte Carlo integration. In general GONs are stable and consistent, capable of generating quality samples with an exceptionally small number of parameters, and converge to diverse results with few iterations. Nevertheless, there are avenues to explore so as to improve the quality of samples and scale to larger datasets. In particular, it would be beneficial to focus on how to better sample these models, perform formal analysis on the gradients, and investigate whether the distance function could be improved to better capture fine details.
对于隐式GONs,方程12中的积分项会导致计算时间与数据维度成正比。这使得训练在极高维数据下变慢,尽管我们尚未研究蒙特卡洛积分方法。总体而言,GONs具有稳定性和一致性,能够用极少的参数生成高质量样本,并通过少量迭代收敛到多样化结果。然而,仍有提升样本质量和扩展至更大数据集的探索空间。具体而言,重点研究方向应包括:优化模型采样策略、对梯度进行形式化分析,以及探索改进距离函数以更精细捕捉细节特征的可能性。
CONCLUSION
结论
In conclusion, we have introduced a method based on empirical Bayes which computes the gradient of the data fitting loss with respect to the origin, and then jointly fits the data while learning this new point of reference in the latent space. The results show that this approach is able to represent datasets using a small number of parameters with a simple overall architecture, which has advantages in applications such as implicit representation networks. GONs are shown to converge faster with lower overall reconstruction loss than auto encoders, using the exact same architecture but without the encoder. Experiments show that the choice of non-linearity is important, as the network derivative jointly acts as the encoding function.
总结来说,我们提出了一种基于经验贝叶斯 (Empirical Bayes) 的方法,该方法计算数据拟合损失相对于原点的梯度,并在潜在空间中学习这一新参考点的同时联合拟合数据。结果表明,该方法能够通过简单的整体架构和少量参数来表示数据集,这在隐式表示网络等应用中具有优势。实验证明,在完全相同的架构下(仅移除编码器),梯度原点网络 (GONs) 比自编码器收敛更快且整体重建损失更低。非线性函数的选择至关重要,因为网络导数同时充当编码函数。
A PROOF OF CONDITIONAL EMPIRICAL BAYES
条件经验贝叶斯证明
For latent variables $\mathbf{z}\sim p_ {\mathbf{z}}$ , a noisy observation of $\mathbf{z}$ , $\mathbf{z}_ {0}=\mathbf{z}+\mathcal{N}(\mathbf{0},\pmb{I}_ {d})$ , and data point $\mathbf x\sim p_ {d}$ , we wish to find an estimator of $p(\mathbf{z}|\mathbf{x})$ . To achieve this, we condition the Bayes least squares estimator on $\mathbf{x}$ :
对于隐变量 $\mathbf{z}\sim p_ {\mathbf{z}}$,其噪声观测值 $\mathbf{z}_ {0}=\mathbf{z}+\mathcal{N}(\mathbf{0},\pmb{I}_ {d})$ 以及数据点 $\mathbf x\sim p_ {d}$,我们希望找到一个 $p(\mathbf{z}|\mathbf{x})$ 的估计量。为此,我们将贝叶斯最小二乘估计量条件化于 $\mathbf{x}$:
$$
\hat{\bf z}_ {\mathrm{x}}({\bf z}_ {0})=\int{\bf z}p({\bf z}|{\bf z}_ {0},{\bf x})d{\bf z}=\int{\bf z}\frac{p({\bf z}_ {0},{\bf z}|{\bf x})}{p({\bf z}_ {0}|{\bf x})}d{\bf z}.
$$
$$
\hat{\bf z}_ {\mathrm{x}}({\bf z}_ {0})=\int{\bf z}p({\bf z}|{\bf z}_ {0},{\bf x})d{\bf z}=\int{\bf z}\frac{p({\bf z}_ {0},{\bf z}|{\bf x})}{p({\bf z}_ {0}|{\bf x})}d{\bf z}.
$$
Through the definition of the probabilistic chain rule and by marginal ising out $\mathbf{z}$ , we can define $\begin{array}{r}{p(\mathbf{z}_ {0}|\mathbf{\bar{x}})=\int p(\mathbf{z}_ {0}|\mathbf{z},\mathbf{x})p(\mathbf{z}|\mathbf{x})\dot{d}\mathbf{z}}\end{array}$ which can be simplified to $\begin{array}{r}{\int\dot{p}(\mathbf{z}_ {0}|\mathbf{\check{z}})p(\mathbf{z}|\mathbf{x})\check{d}\mathbf{z}}\end{array}$ since $\mathbf{z}_ {\mathrm{0}}$ is dependent only on $\mathbf{z}$ . Writing this out fully, we obtain:
通过概率链式法则的定义并对 $\mathbf{z}$ 进行边缘化处理,我们可以定义 $\begin{array}{r}{p(\mathbf{z}_ {0}|\mathbf{\bar{x}})=\int p(\mathbf{z}_ {0}|\mathbf{z},\mathbf{x})p(\mathbf{z}|\mathbf{x})\dot{d}\mathbf{z}}\end{array}$ ,由于 $\mathbf{z}_ {\mathrm{0}}$ 仅依赖于 $\mathbf{z}$ ,该式可简化为 $\begin{array}{r}{\int\dot{p}(\mathbf{z}_ {0}|\mathbf{\check{z}})p(\mathbf{z}|\mathbf{x})\check{d}\mathbf{z}}\end{array}$ 。完整展开后可得:
$$
p(\mathbf{z}_ {0}|\mathbf{x})=\int{\frac{1}{(2\pi)^{d/2}|\operatorname*{det}({\boldsymbol{\Sigma}})|^{1/2}}}\exp{\Big(}-(\mathbf{z}_ {0}-\mathbf{z})^{T}\mathbf{\boldsymbol{\Sigma}}^{-1}(\mathbf{z}_ {0}-\mathbf{z})/2{\Big)}p(\mathbf{z}|\mathbf{x})d\mathbf{z}.
$$
$$
p(\mathbf{z}_ {0}|\mathbf{x})=\int{\frac{1}{(2\pi)^{d/2}|\operatorname*{det}({\boldsymbol{\Sigma}})|^{1/2}}}\exp{\Big(}-(\mathbf{z}_ {0}-\mathbf{z})^{T}\mathbf{\boldsymbol{\Sigma}}^{-1}(\mathbf{z}_ {0}-\mathbf{z})/2{\Big)}p(\mathbf{z}|\mathbf{x})d\mathbf{z}.
$$
Differentiating with respect to $\mathbf{z}_ {\mathrm{0}}$ and multiplying both sides by $\pmb{\Sigma}$ gives:
对 $\mathbf{z}_ {\mathrm{0}}$ 求导并在两边乘以 $\pmb{\Sigma}$ 得到:
$$
\begin{array}{l}{{\pmb\Sigma}\nabla_ {{\mathbf z}_ {0}}p({\mathbf z}_ {0}|{\mathbf x})=\displaystyle\int({\mathbf z}-{\mathbf z}_ {0})p({\mathbf z}_ {0}|{\mathbf z},{\mathbf x})p({\mathbf z}|{\mathbf x})d{\mathbf z}=\displaystyle\int({\mathbf z}-{\mathbf z}_ {0})p({\mathbf z}_ {0},{\mathbf z}|{\mathbf x})d{\mathbf z}}\ {\displaystyle=\int{\mathbf z}p({\mathbf z}_ {0},{\mathbf z}|{\mathbf x})d{\mathbf z}-{\mathbf z}_ {0}p({\mathbf z}_ {0}|{\mathbf x}).}\end{array}
$$
$$
\begin{array}{l}{{\pmb\Sigma}\nabla_ {{\mathbf z}_ {0}}p({\mathbf z}_ {0}|{\mathbf x})=\displaystyle\int({\mathbf z}-{\mathbf z}_ {0})p({\mathbf z}_ {0}|{\mathbf z},{\mathbf x})p({\mathbf z}|{\mathbf x})d{\mathbf z}=\displaystyle\int({\mathbf z}-{\mathbf z}_ {0})p({\mathbf z}_ {0},{\mathbf z}|{\mathbf x})d{\mathbf z}}\ {\displaystyle=\int{\mathbf z}p({\mathbf z}_ {0},{\mathbf z}|{\mathbf x})d{\mathbf z}-{\mathbf z}_ {0}p({\mathbf z}_ {0}|{\mathbf x}).}\end{array}
$$
After dividing both sides by $p(\mathbf{z}_ {0}\vert\mathbf{x})$ and combining with Equation 14 we get:
两边同时除以 $p(\mathbf{z}_ {0}\vert\mathbf{x})$ 并与方程14结合可得:
$$
\Sigma\frac{\nabla_ {\mathbf{z}_ {0}}p(\mathbf{z}_ {0}|\mathbf{x})}{p(\mathbf{z}_ {0}|\mathbf{x})}=\int\mathbf{z}\frac{p(\mathbf{z}_ {0},\mathbf{z}|\mathbf{x})}{p(\mathbf{z}_ {0}|\mathbf{x})}d\mathbf{z}-\mathbf{z}_ {0}=\hat{\mathbf{z}}_ {\mathbf{x}}(\mathbf{z}_ {0})-\mathbf{z}_ {0}.
$$
$$
\Sigma\frac{\nabla_ {\mathbf{z}_ {0}}p(\mathbf{z}_ {0}|\mathbf{x})}{p(\mathbf{z}_ {0}|\mathbf{x})}=\int\mathbf{z}\frac{p(\mathbf{z}_ {0},\mathbf{z}|\mathbf{x})}{p(\mathbf{z}_ {0}|\mathbf{x})}d\mathbf{z}-\mathbf{z}_ {0}=\hat{\mathbf{z}}_ {\mathbf{x}}(\mathbf{z}_ {0})-\mathbf{z}_ {0}.
$$
Finally, this can be rearranged to give the single step estimator of $\mathbf{z}$ :
最终可以重新排列得到 $\mathbf{z}$ 的单步估计量:
$$
\hat{\mathbf{z}}_ {\mathbf{x}}(\mathbf{z}_ {0})=\mathbf{z}_ {0}+\pmb{\Sigma}\frac{\nabla_ {\mathbf{z}_ {0}}p(\mathbf{z}_ {0}|\mathbf{x})}{p(\mathbf{z}_ {0}|\mathbf{x})}=\mathbf{z}_ {0}+\pmb{\Sigma}\nabla_ {\mathbf{z}_ {0}}\log p(\mathbf{z}_ {0}|\mathbf{x}).
$$
$$
\hat{\mathbf{z}}_ {\mathbf{x}}(\mathbf{z}_ {0})=\mathbf{z}_ {0}+\pmb{\Sigma}\frac{\nabla_ {\mathbf{z}_ {0}}p(\mathbf{z}_ {0}|\mathbf{x})}{p(\mathbf{z}_ {0}|\mathbf{x})}=\mathbf{z}_ {0}+\pmb{\Sigma}\nabla_ {\mathbf{z}_ {0}}\log p(\mathbf{z}_ {0}|\mathbf{x}).
$$
B COMPARISON WITH AUTO ENCODERS
B 与自动编码器的对比
This section compares Gradient Origin Networks with auto encoders that use exactly the same architecture, as described in the paper. All experiments are performed with the CIFAR-10 dataset. After each epoch the model is fixed and the mean reconstruction loss is measured over the training set.
本节将梯度起源网络(Gradient Origin Networks)与采用完全相同架构的自编码器进行比较,如论文所述。所有实验均在CIFAR-10数据集上进行。每个训练周期结束后固定模型,并测量训练集上的平均重建损失。
In Figure 10a the depth of networks is altered by removing blocks thereby down scaling to various resolutions. We find that GONs outperform auto encoders until latents have a spatial size of $\mathrm{8x8}$ (where the equivalent GON now only has only 2 convolution layers). Considering the limit where neither model has any parameters, the latent space is the input data i.e. $\mathbf{z}=\mathbf{x}$ . Substituting the definition of a GON (Equation 8) this becomes $-\nabla_ {\mathbf{z}_ {0}}\mathcal{L}=\mathbf{x}$ which simplifies to $\mathbf{0}=\mathbf{x}$ which is a contradiction. This is not a concern in normal practice, as evidenced by the results presented here.
在图 10a 中,通过移除块来改变网络深度,从而降采样到不同分辨率。我们发现,当潜在空间的空间尺寸为 $\mathrm{8x8}$ 时(此时等效的 GON 仅剩 2 个卷积层),GON 的表现仍优于自动编码器。在考虑两个模型均无参数的极限情况下,潜在空间即为输入数据,即 $\mathbf{z}=\mathbf{x}$。代入 GON 的定义(公式 8)可得 $-\nabla_ {\mathbf{z}_ {0}}\mathcal{L}=\mathbf{x}$,简化为 $\mathbf{0}=\mathbf{x}$,这显然矛盾。但如本文结果所示,实际应用中无需担忧此问题。
Figure 10b explores the relationship between auto encoders and GONs when changing the number of convolution filters; GONs are found to outperform auto encoders in all cases. The discrepancy between loss curves decreases as the number of filters increases likely due to the diminishing returns of providing more capacity to the GON when its loss is significantly closer to 0.
图 10b 探讨了在改变卷积滤波器数量时自编码器 (auto encoders) 与 GONs 的关系;研究发现 GONs 在所有情况下都优于自编码器。随着滤波器数量的增加,损失曲线之间的差异逐渐减小,这可能是由于当 GON 的损失明显接近 0 时,为其提供更多容量带来的收益递减。
A similar pattern is found when decreasing the latent space (Figure 10c); in this case the latent space likely becomes the limiting factor. With larger latent spaces GONs significantly outperform auto encoders, however, when the latent bottleneck becomes more pronounced this lead lessens.
在减小潜在空间时也发现了类似的模式(图 10c);这种情况下潜在空间很可能成为限制因素。对于更大的潜在空间,GONs明显优于自动编码器,但当潜在瓶颈更加显著时,这种优势会减弱。

(a) Encoding images to various spa-(b) Varying capacities by changing (c) Encoding images to a variety of tial dimensions. the number of convolution filters f . latent space sizes.
图 1:
(a) 将图像编码到不同空间维度
(b) 通过改变卷积滤波器数量f来调整容量
(c) 将图像编码到多种潜在空间尺寸
Figure 10: Experiments comparing convolutional GONs with auto encoders on CIFAR-10, where the GON uses exactly same architecture as the AE, without the encoder. (a) At the limit auto encoders tend towards the identity function whereas GONs are unable to operate with no parameters. As the number of network parameters increases (b) and the latent size decreases (c), the performance lead of GONs over AEs decreases due to diminishing returns/bottle necking.
图 10: 在CIFAR-10数据集上对比卷积GON与自编码器的实验,其中GON采用与AE完全相同的架构(仅移除编码器)。(a) 在极限情况下自编码器会趋近恒等函数,而GON无法在零参数条件下运行。随着网络参数增加(b)和潜在空间维度减小(c),由于收益递减/瓶颈效应,GON相对AE的性能优势逐渐减弱。
C INITIAL ISING $\mathbf_$ AT THE ORIGIN
C 初始伊辛 $\mathbf_$ 在原点
We evaluate different initial is at ions of $\mathbf{z}_ {\mathrm{0}}$ in Figure 11 by sampling $\mathbf{z}_ {0}\sim\mathcal{N}(\mathbf{0},\sigma^{2}I)$ for a variety of standard deviations $\sigma$ . The proposed approach $\left(\sigma=0\right)$ ) achieves the lowest reconstruction loss (Figure 11a); results for $\sigma>0$ are similar, suggesting that the latent space is adjusted so $\mathbf{z}_ {0}$ simulates the origin. An alternative parameter is ation of $\mathbf{z}$ is to use a use a single gradient descent style step $\mathbf{z}=\mathbf{z}_ {0}-\nabla_ {\mathbf{z}_ {0}}{\mathcal{L}}$ (Figure 11b), however, losses are higher than the proposed GON initial is ation.
我们在图 11 中通过采样 $\mathbf{z}_ {0}\sim\mathcal{N}(\mathbf{0},\sigma^{2}I)$ 来评估 $\mathbf{z}_ {\mathrm{0}}$ 的不同初始化方式,其中 $\sigma$ 为不同标准差。所提出的方法 $(\sigma=0)$ 实现了最低的重建损失 (图 11a);$\sigma>0$ 的结果与之相似,表明潜在空间经过调整使得 $\mathbf{z}_ {0}$ 模拟原点。另一种 $\mathbf{z}$ 的参数化方式是使用单次梯度下降式步骤 $\mathbf{z}=\mathbf{z}_ {0}-\nabla_ {\mathbf{z}_ {0}}{\mathcal{L}}$ (图 11b),但其损失高于所提出的 GON 初始化方法。

Figure 11: Training GONs with $\mathbf{z}_ {\mathrm{0}}$ sampled from a variety of normal distributions with different standard deviations $\sigma$ , $\mathbf{z}_ {0}\sim\mathcal{N}(\mathbf{0},\sigma^{2}I)$ . Approach (a) directly uses the negative gradients as encodings while approach (b) performs one gradient descent style step initial is ed at $\mathbf{z}_ {\mathrm{0}}$ .
图 11: 使用从不同标准差 $\sigma$ 的正态分布 $\mathbf{z}_ {0}\sim~mathcal{N}(\mathbf{0},\sigma^{2}I)$ 中采样的 $\mathbf{z}_ {\mathrm{0}}$ 训练 GONs。方法 (a) 直接使用负梯度作为编码,而方法 (b) 在初始点 $\mathbf{z}_ {\mathrm{0}}$ 处执行一步梯度下降风格的步骤。
D LATENT SPACE AND EARLY STOPPING
D 潜在空间与早停策略
Figure 12 shows a GON trained as a classifier where the latent space is squeezed into 2D for visualisation (Equation 13) and Figure 13 shows that the negative gradients of a GON after 800 steps of training are approximately normally distributed; Figure 14 shows that the latents of a typical convolutional VAE after 800 steps are significantly less normally distributed than GONs.
图 12 展示了作为分类器训练的 GON (Generative Overcomplete Network) ,其中潜在空间被压缩为 2D 以便可视化 (公式 13) 。图 13 显示经过 800 步训练后,GON 的负梯度近似服从正态分布。图 14 表明,经过 800 步训练后,典型卷积 VAE 的潜在变量正态分布性显著弱于 GON。
To obtain new samples with implicit GONs, we can use early stopping as a simple alternative to variation al approaches. These samples (Figure 15) are diverse and capture the object shapes.
为了通过隐式GON获取新样本,我们可以采用早停法作为变分方法的简易替代方案。这些样本(图15)具有多样性且能捕捉物体形状。

Figure 12: 2D latent space sam- Figure 13: Histogram of la- Figure 14: Histogram of traples of an implicit GON clas- tent gradients after 800 implicit ditional VAE latents after 800 sifier trained on MNIST (class GON steps with a SIREN. steps. colours).
图 12: 在MNIST数据集上训练的隐式GON分类器(按类别着色)的2D潜在空间样本
图 13: 使用SIREN训练800步后隐式GON潜在梯度的直方图
图 14: 传统VAE潜在变量经过800步训练后的直方图



Figure 15: GONs trained with early stopping can be sampled by approximating their latent space with a multivariate normal distribution. These images show samples from an implicit GON trained with early stopping.
图 15: 通过用多元正态分布近似其潜在空间,可以采样采用早停法训练的 GON。这些图像展示了采用早停法训练的隐式 GON 的采样结果。
E GON GENERALISATION
EGON 泛化
In Figure 16 the training and test losses for GONs and auto encoders are plotted for a variety of datasets. In all cases, GONs generalise better than their equivalent auto encoders while achieving lower losses with fewer parameters.
在图 16 中,我们绘制了 GON 和自编码器 (auto encoders) 在不同数据集上的训练与测试损失。在所有情况下,GON 都比等效的自编码器具有更好的泛化能力,同时以更少的参数实现了更低的损失。

Figure 16: The discrepancy between training and test reconstruction losses when using a GON is smaller than equivalent auto encoders over a variety of datasets.
图 16: 在使用 GON 时,训练和测试重构损失之间的差异比各种数据集上的等效自编码器更小。
F AVAILABILITY
F 可用性
Source code for the convolutional GON, variation al GON, and implicit GON is available under the MIT license on GitHub at: https://github.com/cwkx/GON. This implementation uses PyTorch and all reported experiments use a Nvidia RTX 2080 Ti GPU.
卷积GON、变分GON和隐式GON的源代码已根据MIT许可证发布于GitHub:https://github.com/cwkx/GON 。该实现使用PyTorch语言,所有报告实验均基于Nvidia RTX 2080 Ti GPU完成。
G SUPER-RESOLUTION
G 超分辨率 (G SUPER-RESOLUTION)
Additional super-sampled images obtained by training an implicit GON on $28\mathbf{x}28$ MNIST data then reconstructing test data at a resolution of 256x256 are shown in Figure 17.
通过在 $28\mathbf{x}28$ 分辨率的 MNIST 数据上训练隐式 GON (Generative Occupancy Network) 并重构出 256x256 分辨率的测试数据,获得的额外超采样图像如图 17 所示。

Figure 17: Super-sampling 28x28 MNIST test data at $256\mathrm{x}256$ coordinates using an implicit GON.
图 17: 使用隐式 GON 在 $256\mathrm{x}256$ 坐标上对 28x28 MNIST 测试数据进行超采样。
