[论文翻译]基于自然梯度下降的图神经网络优化


原文地址:https://arxiv.org/pdf/2008.09624v1


Optimization of Graph Neural Networks with Natural Gradient Descent

基于自然梯度下降的图神经网络优化

Abstract

摘要

In this work, we propose to employ information-geometric tools to optimize a graph neural network architecture such as the graph convolutional networks. More specifically, we develop optimization algorithms for the graph-based semi-supervised learning by employing the natural gradient information in the optimization process. This allows us to efficiently exploit the geometry of the underlying statistical model or parameter space for optimization and inference. To the best of our knowledge, this is the first work that has utilized the natural gradient for the optimization of graph neural networks that can be extended to other semi-supervised problems. Efficient computations algorithms are developed and extensive numerical studies are conducted to demonstrate the superior performance of our algorithms over existing algorithms such as ADAM and SGD.

在本工作中,我们提出采用信息几何工具来优化图神经网络架构(如图卷积网络)。具体而言,我们通过在优化过程中利用自然梯度信息,开发了基于图的半监督学习优化算法。这使我们能够高效利用底层统计模型或参数空间的几何结构进行优化和推断。据我们所知,这是首个将自然梯度应用于图神经网络优化的研究,该方法可扩展至其他半监督学习问题。我们开发了高效的计算算法,并通过大量数值实验证明,该算法性能优于ADAM和SGD等现有算法。

Index Terms

索引术语

Graph neural network, Fisher information, natural gradient descent, network data.

图神经网络 (Graph Neural Network)、费舍尔信息 (Fisher information)、自然梯度下降 (Natural Gradient Descent)、网络数据

I. INTRODUCTION

I. 引言

In machine learning, the cost function is mostly evaluated using labeled samples that are not easy to collect. Semi-supervised learning tries to find a better model by using unlabeled samples. Most of the semi-supervised methods are based on a graph representation on (transformed) samples and labels [11]. For example, augmentation methods create a new graph in which original and augmented samples are connected. Graphs, as datasets with linked samples, have been the center of attention in semi-supervised learning. Graph Neural Network (GNN), initially proposed to capture graph representations in neural networks [20], have been used for semisupervised learning in a variety of problems like node classification, link predictions, and so on. The goal of each GNN layer is to transform features while considering the graph structure by aggregating information from connected or neighboring nodes. When there is only one graph, the goal of node classification becomes predicting node labels in a graph while only a portion of node labels are available (even though the model might have access to the features of all nodes). Inspired by the advance of convolutional neural networks [14] in computer vision [12], Graph Convolutional Network (GCN) [10] employs the spectra of graph Laplacian for filtering signals and the kernel can be approximated using Chebyshev polynomials or functions [24, 22]. GCN has become a standard and popular tool in the emerging field of geometric deep learning [2].

在机器学习中,成本函数主要通过难以收集的标注样本进行评估。半监督学习试图利用未标注样本来寻找更优模型。大多数半监督方法基于(变换后)样本和标签的图表示[11]。例如,数据增强方法会构建新图,将原始样本与增强样本相连。作为包含关联样本的数据集,图结构始终是半监督学习的核心关注点。图神经网络(GNN)最初是为捕捉神经网络中的图表示而提出[20],现已被广泛应用于节点分类、链接预测等半监督学习任务。每个GNN层的目标是通过聚合相连或相邻节点的信息,在考虑图结构的同时转换特征。当仅存在单一图时,节点分类任务就转化为在仅部分节点标签可用的情况下(尽管模型可能获取所有节点特征)预测图中节点标签。受计算机视觉领域卷积神经网络(CNN)进展[14][12]的启发,图卷积网络(GCN)[10]利用图拉普拉斯矩阵谱进行信号滤波,其核函数可通过切比雪夫多项式或函数逼近[24][22]。GCN已成为几何深度学习这一新兴领域[2]的标准流行工具。

From the optimization perspective, Stochastic Gradient Descent (SGD)-based methods that use an estimation of gradients have been popular choices due to their simplicity and efficiency. However, SGD-based algorithms may be slow in convergence and hard to tune on large datasets. Adding extra information about gradients, may help with the convergence but are not always possible or easy to obtain. For example, using second-order gradients like the Hessian matrix, resulting in the Newton method, is among the best choices which, however, are not easy to calculate especially in NNs. When the dataset is large or samples are redundant, NNs are trained using methods built on top of SGD like AdaGrad [4] or Adam [9]. Such methods use the gradients information from previous iterations or simply add more parameters like momentum to the SGD. Natural Gradient Descent (NGD) [1] provides an alternative based on the secondmoment of gradients. Using an estimation of the inverse of the Fisher information matrix (simply Fisher), NGD transforms gradients into so-called natural gradients that showed to be much faster compared to the SGD in many cases. The use of NGD allows efficient exploration of the geometry of the underlying parameter space in the optimization process. Also, Fisher information plays a pivotal role throughout statistical modeling [16]. In frequent is t statistics, Fisher information is used to construct hypothesis tests and confidence intervals by maximum likelihood estimators. In Bayesian statistics, it defines the Jeffreys’s prior, a default prior commonly used for estimation problems and nuisance parameters in a Bayesian hypothesis test. In minimum description length, Fisher information measures the model complexity and its role in model selection within the minimum description length framework like AIC and BIC. Under this interpretation, NGD is invariant to any smooth and invertible re parameter iz ation of the model, while SGD-based methods highly depend on the parameter iz ation. For models with a large number of parameters like DNN, Fisher is so huge that makes it almost impossible to evaluate natural gradients. Thus, for faster calculation it is preferred to use an approximation of Fisher like Kronecker-Factored Approximate Curvature (KFAC) [18] that are easier to store and inverse.

从优化角度来看,基于随机梯度下降 (Stochastic Gradient Descent, SGD) 的方法因其简单高效而广受欢迎。然而,基于SGD的算法在大型数据集上可能收敛缓慢且难以调参。虽然增加梯度相关信息有助于提升收敛性,但这些信息往往难以获取。例如,使用二阶梯度(如Hessian矩阵)的牛顿法虽是最优选择之一,但在神经网络中尤其难以计算。

当数据集庞大或样本存在冗余时,神经网络通常采用基于SGD改进的方法进行训练,例如AdaGrad [4] 或Adam [9]。这类方法利用历史迭代的梯度信息,或通过添加动量等参数来增强SGD。自然梯度下降 (Natural Gradient Descent, NGD) [1] 则基于梯度二阶矩提供了另一种思路——通过估计费雪信息矩阵(简称Fisher)的逆矩阵,NGD将梯度转换为自然梯度,其收敛速度在多类场景中显著优于SGD。NGD能有效探索参数空间的几何特性,而Fisher信息在统计建模中始终发挥着核心作用 [16]。

在频率统计中,费雪信息用于构建基于最大似然估计的假设检验与置信区间;在贝叶斯统计中,它定义了Jeffreys先验,常作为估计问题和贝叶斯假设检验中冗余参数的默认先验;在最小描述长度框架下,费雪信息通过AIC、BIC等准则衡量模型复杂度并参与模型选择。值得注意的是,NGD对模型任意光滑可逆的重参数化具有不变性,而SGD类方法则高度依赖参数化形式。

对于深度神经网络等参数量庞大的模型,费雪矩阵规模过大导致自然梯度计算不可行。因此实际多采用近似方法,例如更易存储求逆的克罗内克分解近似曲率 (Kronecker-Factored Approximate Curvature, KFAC) [18]。

Both GNN and training NNs with NGD have been active areas of research in recent years but, to the best of our knowledge, this is the first attempt on using NGD in the semi-supervised learning. In this work, a new framework for optimizing GNNs is proposed that takes into account the unlabeled samples in the approximation of Fisher. Section II provides an overview of related topics such as semi-supervised learning, GNN, and NGD. The proposed algorithm is described in section III and a series of experiments are performed in section IV to evaluate the method’s efficiency and sensitivity to hyper-parameters. Finally, the work is concluded in section V.

近年来,图神经网络 (GNN) 和基于自然梯度下降 (NGD) 训练神经网络一直是活跃的研究领域,但据我们所知,这是首次尝试在半监督学习中使用 NGD。本文提出了一种优化 GNN 的新框架,该框架在 Fisher 矩阵近似中考虑了未标记样本。第 II 节概述了相关主题,如半监督学习、GNN 和 NGD。第 III 节描述了所提出的算法,第 IV 节通过一系列实验评估了该方法的效率和对超参数的敏感性。最后,第 V 节对工作进行了总结。

II. PROBLEM AND BACKGROUND

II. 问题与背景

In this section, first, the graph-based semi-supervised learning with a focus on least-squared regression and cross-entropy classification is defined. Required backgrounds on the optimization and neural networks are provided in the subsequent sections. A detailed description of the notation is summarized in the Table I.

在本节中,首先定义了基于图的半监督学习,重点介绍了最小二乘回归和交叉熵分类。后续章节提供了关于优化和神经网络的必要背景知识。表1总结了符号的详细描述。

A. Problem

A. 问题

Consider an information source $q(x)$ generating independent samples $x{i} in X$ , the target distribution $q(y|x)$ associating $ {y}{i}in~Y $ to each $x{i}$ , and the adjacency distribution $q(a|x,x^{prime})$ representing the link between two nodes given their covariates levels $x$ and $x^{prime}$ . The problem of learning $q(y|x)$ is to estimate some parameters that minimizes the cost function

考虑一个信息源 $q(x)$ 生成独立样本 $x{i} in X$,目标分布 $q(y|x)$ 将 ${y}{i}\in~Y$ 关联到每个 $x{i}$,邻接分布 $q(a|x,x^{prime})$ 表示给定协变量水平 ${x}$ 和 ${x}^{prime}$ 时两个节点之间的链接。学习 $q({y}|{x})$ 的问题在于估计某些参数 以最小化成本函数

$$
r(\theta)=E_{\mathbf{x},\underline{{\mathbf{x}}}^{\prime}\sim q(\mathbf{x}),\underline{{a}}\sim q(a|\mathbf{x},\mathbf{x}^{\prime}),\mathbf{y}\sim q(\mathbf{y}|\mathbf{x})}[l(\mathbf{y},\mathbf{f}(\mathbf{x},\underline{{\mathbf{x}}}^{\prime},\underline{{a}};\theta))]
$$

$$
r(\theta)=E_{\mathbf{x},\underline{{\mathbf{x}}}^{\prime}\sim q(\mathbf{x}),\underline{{a}}\sim q(a|\mathbf{x},\mathbf{x}^{\prime}),\mathbf{y}\sim q(\mathbf{y}|\mathbf{x})}[l(\mathbf{y},\mathbf{f}(\mathbf{x},\underline{{\mathbf{x}}}^{\prime},\underline{{a}};\theta))]
$$

where the loss function $l(\mathbf{y},\hat{\mathbf{y}})$ measures the prediction error between $\mathbf{y}$ and $\hat{\bf y}$ . Also, $\underline{{\mathbf{x}}}^{\prime}$ and $\underline{{\boldsymbol{a}}}$ show sequences of $\mathbf{x}^{\prime}$ and $a$ , respectively. As $q(\mathbf{x}),q(a|\mathbf{x},\mathbf{x}^{\prime})$ , and $q(\mathbf{y}|\mathbf{x})$ are usually unknown or unavailable, the cost $r(\pmb\theta)$ is estimated using samples from these distributions. Furthermore, it is often more expensive to sample from $q(\mathbf{y}|\mathbf{x})$ than $q(\mathbf{x})$ and $q(a|\mathbf{x},\mathbf{x}^{\prime})$ resulting in the different number of samples from each distribution being available.

其中损失函数 $l(\mathbf{y},\hat{\mathbf{y}})$ 用于衡量 $\mathbf{y}$ 与 $\hat{\bf y}$ 之间的预测误差。此外,$\underline{{\mathbf{x}}}^{\prime}$ 和 $\underline{{\boldsymbol{a}}}$ 分别表示 $\mathbf{x}^{\prime}$ 和 $a$ 的序列。由于 $q(\mathbf{x}),q(a|\mathbf{x},\mathbf{x}^{\prime})$ 和 $q(\mathbf{y}|\mathbf{x})$ 通常未知或不可获取,因此需通过从这些分布中采样来估计成本 $r(\pmb\theta)$。此外,从 $q(\mathbf{y}|\mathbf{x})$ 采样的成本通常高于 $q(\mathbf{x})$ 和 $q(a|\mathbf{x},\mathbf{x}^{\prime})$,这导致从各分布获得的样本数量存在差异。

Let $X:=:X_{0}$ to be a $d_{0}\times n$ matrix of $n\geq1$ i.i.d $\mathbf{x}_{i}$ samples from $q(\mathbf{x})$ (equivalent to $X\sim q(X))$ . It is assumed that $n\times n$ adjacency matrix $A=[a_{i j}]$ is sampled from $q(a|\mathbf{x}_{i},\mathbf{x}_{j})$ for $i,j=1,\dots,n$ (equivalent to $A\sim q(A|X))$ . One can consider $(X,A)$ to be a graph of $n$ nodes in which the $i$ th column of $X$ shows the covariate at the node $i$ and $D=\mathrm{diag}(\textstyle\sum_{j}a_{i j})$ denotes the diagonal degree matrix. Also, denote $Y$ to be a $d_{m}\times\bar{n}$ matrix of $\bar{n}<n$ samples $\mathbf{y}_{i}$ from $q(\mathbf{y}|\mathbf{x}_{i})$ for $i=1,\dots,\bar{n}$ and $\mathbf{z}=[\mathbb{1}(i\in{1,\dots,\bar{n}})]_{i=1}^{n}$ to be the training mask vector. Note that 1(condition) is 1 if the condition is true and 0 otherwise. Thus, an empirical cost can be estimated by

设 $X:=:X_{0}$ 为一个 $d_{0}\times n$ 的矩阵,其中包含 $n\geq1$ 个来自 $q(\mathbf{x})$ 的独立同分布样本 $\mathbf{x}_{i}$ (等价于 $X\sim q(X))$。假设 $n\times n$ 的邻接矩阵 $A=[a_{i j}]$ 是从 $q(a|\mathbf{x}_{i},\mathbf{x}_{j})$ 中采样得到的,其中 $i,j=1,\dots,n$ (等价于 $A\sim q(A|X))$。可以将 $(X,A)$ 视为一个具有 $n$ 个节点的图,其中 $X$ 的第 $i$ 列表示节点 $i$ 的协变量,$D=\mathrm{diag}(\textstyle\sum_{j}a_{i j})$ 表示对角度矩阵。此外,设 $Y$ 为一个 $d_{m}\times\bar{n}$ 的矩阵,其中包含 $\bar{n}<n$ 个来自 $q(\mathbf{y}|\mathbf{x}_{i})$ 的样本 $\mathbf{y}_{i}$,其中 $i=1,\dots,\bar{n}$,而 $\mathbf{z}=[\mathbb{1}(i\in{1,\dots,\bar{n}})]_{i=1}^{n}$ 是训练掩码向量。注意,1(condition) 在条件为真时为1,否则为0。因此,可以通过以下方式估计经验成本:

TABLE I NOTATION

SymbolDescription
x,x, XScalar, vector, matrix
∈, 入,Regularization hyper-parameters
The learning rate
AAdjacency matrix
XTMatrix transpose
IComfortable identity matrix
A sequence of x vectors
nThe total number of samples
nThe number of labeled samples
FFisher information matrix
BPreconditioning matrix
r(0)The cost of parameters 0
l(y,y)The loss between y and y
q(x)The source distribution
q(y|x)The target distribution
q(a|x,x')The adjacency distribution
p(ylf(X, A;0))The prediction distribution
Φ()An element-wise nonliear function
VefGradient of scalar f wrt. 0
JefJacobian of vector f wrt. 0
HefHessian of scalar f wrt. 0
Element-wise multiplication operation

表 1 符号说明

符号 描述
x, x, X 标量, 向量, 矩阵
∈, 入 正则化超参数
学习率
A 邻接矩阵
XT 矩阵转置
I 舒适单位矩阵
x向量序列
n 样本总数
n 已标注样本数
F 费舍尔信息矩阵
B 预处理矩阵
r(0) 参数0的代价
l(y,y) y与y之间的损失
q(x) 源分布
q(y|x) 目标分布
q(a|x,x') 邻接分布
p(ylf(X, A;0)) 预测分布
Φ() 逐元素非线性函数
Vef 标量f关于θ的梯度
Jef 向量f关于θ的雅可比矩阵
Hef 标量f关于θ的海森矩阵
逐元素乘法运算

$$
\hat{r}(\pmb\theta)=\frac{1}{\bar{n}}\sum_{i=1}^{\bar{n}}l(\mathbf y_{i},\mathbf f(\mathbf x_{i},X,A;\pmb\theta)),
$$

$$
\hat{r}(\pmb\theta)=\frac{1}{\bar{n}}\sum_{i=1}^{\bar{n}}l(\mathbf y_{i},\mathbf f(\mathbf x_{i},X,A;\pmb\theta)),
$$

where $f left(x{i},X,A; pmb theta right)$ shows the processed $x {i}$ when having access to $n-1$ extra samples and links between them. Note that as $ X$ contains ${x}{i}$ (the $ i$ th column), $f left(x {i},X,A; pmb theta right)$ and $f left(X,A; theta right)$ are used interchangeably.

其中 $ f left(x{i},X,A;pmb theta right)$ 表示当能够访问 $n-1$ 个额外样本及其间连接时对 $x {i}$ 的处理结果。注意由于 $X$ 包含 $x {i}$ (第 $i$ 列), $f left(x {i},X,A; pmb theta right)$ 和 $f left(X,A;theta right)$ 可互换使用。

Assuming $p(\mathbf{y}|\mathbf{f}(X,A;\pmb\theta))=p_{\pmb\theta}(\mathbf{y}|X,A)$ to be an exponential family with natural parameters in $\mathbb{F}$ , the loss function becomes

假设 $p(\mathbf{y}|\mathbf{f}(X,A;\pmb\theta))=p_{\pmb\theta}(\mathbf{y}|X,A)$ 是指数族分布且自然参数属于 $\mathbb{F}$,则损失函数变为

$$
l(\mathbf{y},\mathbf{f}\left(X,A;\pmb{\theta}\right))=-\log p(\mathbf{y}|\mathbf{f}\left(X,A;\pmb{\theta}\right)).
$$

$$
l(\mathbf{y},\mathbf{f}\left(X,A;\pmb{\theta}\right))=-\log p(\mathbf{y}|\mathbf{f}\left(X,A;\pmb{\theta}\right)).
$$

In the least-squared regression,

在最小二乘回归中,

$$
p(\mathbf{y}|\mathbf{f}(X,A;\pmb\theta))=\mathcal{N}(\mathbf{y}|\mathbf{f}(X,A;\pmb\theta),\sigma^{2})
$$

$$
p(\mathbf{y}|\mathbf{f}(X,A;\pmb\theta))=\mathcal{N}(\mathbf{y}|\mathbf{f}(X,A;\pmb\theta),\sigma^{2})
$$

for fixed $\sigma^{2}$ and $\mathbb{F}=\mathbb{Y}=\mathbb{R}$ . In the cross-entropy classification to $c$ classes,

对于固定的 $\sigma^{2}$ 且 $\mathbb{F}=\mathbb{Y}=\mathbb{R}$。在交叉熵分类到 $c$ 类时,

$$
p(y=k|\mathbf{f}(X,A;\pmb\theta))=\exp(\mathbf{f}_{k})/\sum_{j=1}^{c}\exp(\mathbf{f}_{j})
$$

$$
p(y=k|\mathbf{f}(X,A;\pmb\theta))=\exp(\mathbf{f}_{k})/\sum_{j=1}^{c}\exp(\mathbf{f}_{j})
$$

for $\mathbb{F}=\mathbb{R}^{c}$ and $\mathbb{Y}={1,\ldots,c}$ .

对于 $\mathbb{F}=\mathbb{R}^{c}$ 和 $\mathbb{Y}={1,\ldots,c}$。

B. Parameter estimation

B. 参数估计

Having the first order approximation of $r(\pmb\theta)$ around a point $\pmb{\theta}_{0}$ ,

在点 $\pmb{\theta}_{0}$ 附近对 $r(\pmb\theta)$ 进行一阶近似

图片.png

the gradient descent can be used to update parameter $\pmb{\theta}$ iterative ly:

梯度下降可用于迭代更新参数 $\pmb{\theta}$:
图片.png

where $\eta>0$ denotes the learning rate, ${\bf g}{0}={\bf g}(\pmb{\theta}{0})$ is the gradient at $\pmb{\theta}{0}$ for

其中 $\eta>0$ 表示学习率,${\bf g}{0}={\bf g}(\pmb{\theta}{0})$ 是在 $\pmb{\theta}{0}$ 处的梯度。

$$
\mathbf{g}(\pmb\theta)=\frac{\partial r(\pmb\theta)}{\partial\pmb\theta}
$$

$$
\mathbf{g}(\pmb\theta)=\frac{\partial r(\pmb\theta)}{\partial\pmb\theta}
$$

and $B$ shows a symmetric positive definite matrix called pre conditioner capturing the interplay between the elements of $\pmb{\theta}$ . In SGD, $B=I$ and ${\bf g}_{0}$ is approximated by:

且 $B$ 表示一个对称正定矩阵,称为预条件子 (pre conditioner),用于捕捉 $\pmb{\theta}$ 各元素间的相互作用。在 SGD 中,$B=I$ 且 ${\bf g}_{0}$ 通过以下方式近似:
图片.png

where $\bar{n}\geq1$ can be the mini-batch (a randomly drawn subset of the dataset) size.

其中 $\bar{n}\geq1$ 可以是小批量(从数据集中随机抽取的子集)的大小。

To take into the account the relation between $\pmb{\theta}$ elements, one can use the second order approximation of $r(\pmb\theta)$ :

考虑到 $\pmb{\theta}$ 元素之间的关系,可以使用 $r(\pmb\theta)$ 的二阶近似:

图片.png

where $H_{0}=H(\pmb\theta_{0})$ denotes the Hessian matrix at $\pmb{\theta}_{0}$ for

其中 $H_{0}=H(\pmb\theta_{0})$ 表示 $\pmb{\theta}_{0}$ 处的 Hessian 矩阵

$$
H(\pmb\theta)=\frac{\partial^{2}r(\pmb\theta)}{\partial\pmb\theta^{\top}\pmb\theta}.
$$

$$
H(\pmb\theta)=\frac{\partial^{2}r(\pmb\theta)}{\partial\pmb\theta^{\top}\pmb\theta}.
$$

Thus, having the gradients of $r(\pmb\theta)$ around $\pmb{\theta}$ as:

因此,在 $\pmb{\theta}$ 附近得到 $r(\pmb\theta)$ 的梯度为:

图片.png

the parameters can be updated using:

参数可通过以下方式更新:

图片.png

The convergence of Eq. 13 heavily depends on the selection of $\eta$ and the distribution of $I-\eta B H_{0}$ eigenvalues. Note that update rules Eqs. 7 and 13 coincides at $B=H_{0}^{-1}$ resulting the Newton’s method. As it is not always possible or desirable to obtain Hessian, several pre conditioners are suggested to adapt the information geometry of the parameter space.

式13的收敛性很大程度上取决于$\eta$的选择以及$I-\eta B H_{0}$特征值的分布。需要注意的是,当$B=H_{0}^{-1}$时,更新规则式7和式13会退化为牛顿法。由于获取Hessian矩阵并不总是可行或可取,因此建议采用若干预条件子来适应参数空间的信息几何结构。

In NGD, the pre conditioner is defined to be the inverse of Fisher Information matrix:

在NGD中,预条件子被定义为Fisher信息矩阵的逆矩阵:

$$
\begin{array}{r l}{F(\pmb\theta):=E_{\mathbf x,\mathbf y\sim p(\mathbf x,\mathbf y;\theta)}[\nabla_{\pmb\theta}\nabla_{\pmb\theta}^{\mathsf{T}}]}\ {=E_{\mathbf x\sim q(\mathbf x),\mathbf y\sim p(\mathbf y|\mathbf x;\pmb\theta)}[\nabla_{\pmb\theta}\nabla_{\pmb\theta}^{\mathsf{T}}]}\end{array}
$$

$$
\begin{array}{r l}{F(\pmb\theta):=E_{\mathbf x,\mathbf y\sim p(\mathbf x,\mathbf y;\theta)}[\nabla_{\pmb\theta}\nabla_{\pmb\theta}^{\mathsf{T}}]}\ {=E_{\mathbf x\sim q(\mathbf x),\mathbf y\sim p(\mathbf y|\mathbf x;\pmb\theta)}[\nabla_{\pmb\theta}\nabla_{\pmb\theta}^{\mathsf{T}}]}\end{array}
$$

where $p(\mathbf{x},\mathbf{y};\pmb\theta):=q(\mathbf{x})p(\mathbf{y}|\mathbf{x};\pmb\theta)$ and

其中 $p(\mathbf{x},\mathbf{y};\pmb\theta):=q(\mathbf{x})p(\mathbf{y}|\mathbf{x};\pmb\theta)$ 且

$$
\begin{array}{r}{\nabla_{\theta}:=-\nabla_{\theta}\log p(\mathbf{x},\mathbf{y};\theta).}\end{array}
$$

$$
\begin{array}{r}{\nabla_{\theta}:=-\nabla_{\theta}\log p(\mathbf{x},\mathbf{y};\theta).}\end{array}
$$

C. Neural Networks

C. 神经网络

A neural network is a mapping from the input space $\mathbb{X}$ to the output space $\mathbb{F}$ through a series of $m$ layers. Layer $k\in{1,\ldots,m}$ , projects $d{k-1}$ -dimensional input $\mathbf{x}{k-1}$ to $d{k}$ -dimensional output $\mathbf{x}{k}$ and can be expressed as:

神经网络是通过一系列 $m$ 层将输入空间 $\mathbb{X}$ 映射到输出空间 $\mathbb{F}$ 的映射关系。其中第 $k\in{1,\ldots,m}$ 层将 $d{k-1}$ 维输入 $\mathbf{x}{k-1}$ 投影为 $d{k}$ 维输出 $\mathbf{x}{k}$,其表达式为:

图片.png

where $\phi{k}$ is an element-wise non-linear function and $W{k}$ is the $d{k}\times d{k-1}$ -dimensional weight matrix. The bias is not explicitly mentioned as it could be the last column of $W{k}$ when $\mathbf{x}{k}$ has an extra unit element. Let the $\pmb{\theta}=[\pmb{\theta}{1},\dots,\pmb{\theta}{m}]$ to be the parameters of an $m$ -layer neural network formed by stacking $m$ vectors of dimension $d{k}d{k-1}$ for $k=1,\ldots,m$ and $\dim(\mathbf{x})=d{0}$ such that $\begin{array}{r}{\dim(\pmb{\theta})=\sum{k=1}^{m}d{k}d{k-1}}\end{array}$ . The parameters of the $k$ ’th layer, $\pmb{\theta}{k}=\mathrm{vec}(W{k})$ for $\operatorname{vec}(W{k})=$ $\left[\mathbf{w}{1},\ldots,\mathbf{w}{d{k}}\right]$ , is also shaped by piling up rows of $W{k}$ . Their gradients, $\nabla{\pmb{\theta}{k}}$ , could be written as:

其中 $\phi{k}$ 是逐元素的非线性函数,$W{k}$ 是 $d{k}\times d{k-1}$ 维的权重矩阵。偏置项未显式提及,因为当 $\mathbf{x}{k}$ 包含额外单位元素时,它可能是 $W{k}$ 的最后一列。设 $\pmb{\theta}=[\pmb{\theta}_{1},\dots,\pmb{\theta}{m}]$ 为一个 $m$ 层神经网络的参数,该网络通过堆叠 $m$ 个维度为 $d{k}d{k-1}$ 的向量构成($k=1,\ldots,m$),且 $\dim(\mathbf{x})=d{0}$,因此 $\begin{array}{r}{\dim(\pmb{\theta})=\sum{k=1}^{m}d{k}d{k-1}}\end{array}$。第 $k$ 层的参数 $\pmb{\theta}{k}=\mathrm{vec}(W{k})$(其中 $\operatorname{vec}(W{k})=$ $\left[\mathbf{w}_{1},\ldots,\mathbf{w}{d{k}}\right]$)也是通过堆叠 $W{k}$ 的行来构建的。它们的梯度 $\nabla{\pmb{\theta}{k}}$ 可表示为:

图片.png

for $d{k}\times d{k}d{k-1}$ -dimensional matrix $\partial\mathbf{x}{k}/\partial\pmb{\theta}{k}$ and $d{k}$ -dimensional vector $\partial l/\partial\mathbf{x}{k}$ .

对于 $d{k}\times d{k}d{k-1}$ 维矩阵 $\partial\mathbf{x}{k}/\partial\pmb{\theta}{k}$ 和 $d{k}$ 维向量 $\partial l/\partial\mathbf{x}{k}$。

D. Graph Neural Networks

D. 图神经网络

The Graph Neural Network (GNN) extends the NN mapping to the data represented in graph domains [20]. The basic idea is to use related samples when the adjacency information is available. In other words, the input to the $k$ ’th layer, $\mathbf{x}{k-1}$ is transformed into $\tilde{\mathbf{x}}{k-1}$ that take into the account unlabeled samples using the adjacency such that $p(\mathbf{x}{k-1},A)=p(\tilde{\mathbf{x}}{k-1})$ . Therefore, for each node $i=1,\ldots,n$ , the Eq. 17 can be written by a local transition function (or a single message passing step) as:

图神经网络 (Graph Neural Network, GNN) 将神经网络映射扩展到图域表示的数据 [20]。其核心思想是在邻接信息可用时利用相关样本。换句话说,第 $k$ 层的输入 $\mathbf{x}{k-1}$ 会被转换为 $\tilde{\mathbf{x}}{k-1}$ ,该转换通过邻接矩阵考虑未标记样本,使得 $p(\mathbf{x}{k-1},A)=p(\tilde{\mathbf{x}}{k-1})$ 。因此,对于每个节点 $i=1,\ldots,n$ ,公式 17 可通过局部转移函数(或单次消息传递步骤)表示为:

图片.png

where $\underline{{\mathbf{x}}}{k,i}$ denotes all the information coming from nodes connected to the $i$ th node at the $k$ th layer. The subscripts here are used to indicate both the layer and the node, i.e. $\mathbf{x}{k,i}$ means the state embedding of node $i$ in the layer $k$ . Also, the local transition Eq. 19, parameterized by $W{k}$ , is shared by all nodes that includes the information of the graph structure, and $\mathbf{x}{0,i}=\mathbf{x}{i}$ .

其中 $\underline{{\mathbf{x}}}_{k,i}$ 表示第 $k$ 层中与第 $i$ 个节点相连的所有节点传入的信息。此处下标同时表示层和节点,即 $\mathbf{x}{k,i}$ 代表第 $k$ 层中第 $i$ 个节点的状态嵌入。此外,由 $W{k}$ 参数化的局部转移方程 (Eq. 19) 被所有包含图结构信息的节点共享,且 $\mathbf{x}{0,i}=\mathbf{x}{i}$。

The Graph Convolutional Network (GCN) is a one of the GNN with the message passing operation as a linear approximation to spectral graph convolution, followed by a non-linear activation function as:

图卷积网络 (GCN) 是一种图神经网络 (GNN),其消息传递操作是对谱图卷积的线性近似,随后通过非线性激活函数实现:

图片.png

where $\phi_{k}$ is a element-wise nonlinear activation function such as $\mathrm{RELU}(x)=\operatorname*{max}(x,0),\mid$ $W_{k}$ is a $d_{k}\times d_{k-1}$ parameter matrix that needs to be estimated. $\tilde{A}$ denotes the normalized adjacency matrix defined by:

其中 $\phi_{k}$ 是逐元素非线性激活函数,例如 $\mathrm{RELU}(x)=\operatorname*{max}(x,0)$,$W_{k}$ 是需要估计的 $d_{k}\times d_{k-1}$ 参数矩阵。$\tilde{A}$ 表示由下式定义的归一化邻接矩阵:

$$
\tilde{A}=(D+I)^{-1/2}(A+I)(D+I)^{-1/2}
$$

$$
\tilde{A}=(D+I)^{-1/2}(A+I)(D+I)^{-1/2}
$$

to overcome the over fitting issue due to the small number of labeled samples $\bar{n}$ . In fact, a GCN layer is basically a NN (Eq. 17) where the input $\mathbf{x}{k-1}$ is initially updated into $\tilde{\mathbf{x}}{k-1}$ using a so-called re normalization trick such that $\begin{array}{r}{\tilde{\mathbf{x}}{k-1,i}=\sum{j=1}^{n}\tilde{a}{i,j}\mathbf{x}{k-1,i}}\end{array}$ where $\tilde{A}=[\tilde{a}{i,j}]$ . Comparing Eq. 20 with the more general Eq. 19, the local transition function $\mathbf{f}{k}$ is defined as a linear combination followed by a nonlinear activation function. For classifying $\mathbf{x}$ into $c$ classes, having a $c$ -dimensional $\mathbf{x}{m}$ as the output of the last layer with a Softmax activation function, the loss between the label $\mathbf{y}$ and the prediction $\mathbf{x}{m}$ becomes:

为克服因标记样本数量 $\bar{n}$ 过少导致的过拟合问题。实际上,GCN层本质上是一个神经网络(公式17),其中输入 $\mathbf{x}{k-1}$ 会通过所谓的重归一化技巧先更新为 $\tilde{\mathbf{x}}{k-1}$,即 $\begin{array}{r}{\tilde{\mathbf{x}}{k-1,i}=\sum{j=1}^{n}\tilde{a}{i,j}\mathbf{x}{k-1,i}}\end{array}$,其中 $\tilde{A}=[\tilde{a}{i,j}]$。将公式20与更通用的公式19进行比较,局部转移函数 $\mathbf{f}{k}$ 被定义为线性组合后接非线性激活函数。为了将 $\mathbf{x}$ 分类到 $c$ 个类别中,最后一层输出具有Softmax激活函数的 $c$ 维向量 $\mathbf{x}{m}$,标签 $\mathbf{y}$ 与预测 $\mathbf{x}{m}$ 之间的损失为:

图片.png

III. METHOD

III. 方法

The basic idea of preconditioning is to capture the relation between the gradients of parameters $\nabla_{\theta}$ . This relation can be as complete as a matrix $B$ (for example, NGD) representing the pairwise relation between elements of $\nabla_{\theta}$ or as simple as a weighting vector (for example, Adam) with the same size as $\nabla_{\pmb{\theta}}$ resulting in a diagonal $B$ . Considering the flow of gradients $\nabla_{\pmb{\theta},t}$ over the training time as input features, the goal of preconditioning is to extract useful features that help with the updating rule. One can consider the pre conditioner to be the expected value of $B(\mathbf{x},\mathbf{y})=[b_{i j}]^{-1}$ for

预处理的基本思想是捕捉参数梯度 $\nabla_{\theta}$ 之间的关系。这种关系可以是一个完整的矩阵 $B$ (例如NGD),表示 $\nabla_{\theta}$ 元素之间的成对关系;也可以是一个简单的权重向量 (例如Adam),其大小与 $\nabla_{\pmb{\theta}}$ 相同,从而得到一个对角矩阵 $B$。将训练过程中梯度流 $\nabla_{\pmb{\theta},t}$ 视为输入特征,预处理的目标是提取有助于更新规则的有用特征。可以将预处理器视为 $B(\mathbf{x},\mathbf{y})=[b_{i j}]^{-1}$ 的期望值。

$$
b_{i j}=b_{i,j}(\mathbf{x},\mathbf{y})=b(\nabla_{\theta i}||\nabla_{\theta j})^{1}.
$$

$$
b_{i j}=b_{i,j}(\mathbf{x},\mathbf{y})=b(\nabla_{\theta i}||\nabla_{\theta j})^{1}.
$$

In methods with a diagonal pre conditioner like Adam, $B({\bf x},{\bf y})=\mathrm{diag}(\nabla_{\pmb\theta}\odot\nabla_{\pmb\theta})$ , the pairwise relation between gradients is neglected. Pre conditioners like Hessian inverse in Newton’s method with the form of $b_{i j}=\partial\nabla_{\pmb{\theta}_{i}}/\partial\theta_{j}$ are based on the second derivative that encodes the cost curvature in the parameter space. In NGD and similar methods, this curvature is approximated using the second moment of gradient $b_{i j}=\nabla_{\pmb{\theta}_{i}}\nabla_{\pmb{\theta}_{j}}$ , as an approximation of Hessian, in some empirical cases (see [13] for a detailed discussion).

在采用对角预条件器(如Adam)的方法中,$B({\bf x},{\bf y})=\mathrm{diag}(\nabla_{\pmb\theta}\odot\nabla_{\pmb\theta})$,梯度间的成对关系被忽略。而牛顿法等采用Hessian逆形式的预条件器 $b_{i j}=\partial\nabla_{\pmb{\theta}_{i}}/\partial\theta_{j}$ 则基于二阶导数,该导数编码了参数空间中的代价曲率。在自然梯度下降(NGD)及类似方法中,该曲率通过梯度的二阶矩 $b_{i j}=\nabla_{\pmb{\theta}_{i}}\nabla_{\pmb{\theta}_{j}}$ 作为Hessian的近似进行估算(具体讨论可参见[13])。

In this section, a new preconditioning algorithm, motivated by natural gradient, is proposed for graph-based semi-supervised learning that improves the convergence of Adam and SGD with intuitive and insensitive hyper-parameters. The natural gradient is a concept from information geometry and stands for the steepest descent direction in the Riemannian manifold of probability distributions [1], where the distance in the distribution space is measured with a special Riemannian metric. This metric depends only on the properties of the distributions themselves and not their parameters, and in particular, it approximates the square root of the KL divergence within a small neighborhood [17]. Instead of measuring the distance between the parameters $\pmb{\theta}$ and $\pmb{\theta}^{\prime}$ , the cost is measured by the KL divergence between their distributions $p(\pmb\theta)$ and $p(\pmb\theta^{\prime})$ . Consequently, the steepest descent direction in the statistical manifold is the negative gradient preconditioned with the Fisher information matrix $F(\pmb\theta)$ . The validation cost on three different datasets is shown in Fig. 1 where preconditioning is applied to both Adam and SGD.

本节提出了一种受自然梯度启发的新预处理算法,用于基于图的半监督学习,该算法通过直观且不敏感的超参数提升了Adam和SGD的收敛性。自然梯度源于信息几何学,代表概率分布黎曼流形中的最速下降方向 [1],其中分布空间的距离由特殊黎曼度量衡量。该度量仅取决于分布本身的特性而非其参数,尤其在小邻域内近似KL散度的平方根 [17]。不同于测量参数$\pmb{\theta}$与$\pmb{\theta}^{\prime}$间的距离,其代价通过二者分布$p(\pmb\theta)$与$p(\pmb\theta^{\prime})$的KL散度来衡量。因此,统计流形中的最速下降方向是经Fisher信息矩阵$F(\pmb\theta)$预处理的负梯度。图1展示了在三个不同数据集上对Adam和SGD应用预处理后的验证成本。


Fig. 1. The validation costs of four optimization methods on the second split of Citation datasets over 10 runs. A 2-layer GCN with a 64-dimensional hidden variable is used in all experiments. As shown in Fig. 1a, 1b, and 1c (upper row), the proposed Adam-KDAC methods (green and red curves) outperform vanilla Adam methods (blue and orange curves) on all three datasets. Also, Fig. 1d, 1e, and 1f (bottom row) reveal that the suggested SGD-KFAC methods (green and red curves) achieve a remarkably faster convergence than the vanilla SGD method (blue and orange curves) on all three datasets.

图 1: 四种优化方法在Citation数据集第二个划分上10次运行的验证成本。所有实验均使用具有64维隐藏变量的2层GCN。如图1a、1b和1c(上行)所示,提出的Adam-KDAC方法(绿色和红色曲线)在所有三个数据集上均优于原始Adam方法(蓝色和橙色曲线)。此外,图1d、1e和1f(下行)显示,建议的SGD-KFAC方法(绿色和红色曲线)在所有三个数据集上比原始SGD方法(蓝色和橙色曲线)实现了显著更快的收敛。

As the original NGD (Eq. 14) is defined based on a prediction function with access only to a single sample, $p(\mathbf{y}|\mathbf{f}(\mathbf{x};\pmb\theta))$ , Fisher information matrix with the presence of the adjacency distribution becomes:

由于原始NGD (式14) 是基于仅能访问单个样本的预测函数 $p(\mathbf{y}|\mathbf{f}(\mathbf{x};\pmb\theta))$ 定义的,引入邻接分布后的Fisher信息矩阵变为:

$$
\begin{array}{r l}&{F(\pmb\theta)=E_{\mathbf x,\underline{x}^{\prime},\underline{a},\mathbf y\sim p(\mathbf x,\underline{x}^{\prime},\underline{a},\mathbf y;\pmb\theta)}[\nabla_{\b\theta}\nabla_{\b\theta}^{\top}]}\ &{\qquad=E_{\mathbf x,\underline{x}^{\prime}\sim q(\mathbf x),\underline{a}\sim q({\boldsymbol a}|\mathbf x,\mathbf x^{\prime}),\mathbf y\sim p(\mathbf y|\mathbf x,\underline{x}^{\prime},\underline{a};\pmb\theta)}[\nabla_{\b\theta}\nabla_{\b\theta}^{\top}].}\end{array}
$$

$$
\begin{array}{r l}&{F(\pmb\theta)=E_{\mathbf x,\underline{x}^{\prime},\underline{a},\mathbf y\sim p(\mathbf x,\underline{x}^{\prime},\underline{a},\mathbf y;\pmb\theta)}[\nabla_{\b\theta}\nabla_{\b\theta}^{\top}]}\ &{\qquad=E_{\mathbf x,\underline{x}^{\prime}\sim q(\mathbf x),\underline{a}\sim q({\boldsymbol a}|\mathbf x,\mathbf x^{\prime}),\mathbf y\sim p(\mathbf y|\mathbf x,\underline{x}^{\prime},\underline{a};\pmb\theta)}[\nabla_{\b\theta}\nabla_{\b\theta}^{\top}].}\end{array}
$$

With $n$ samples of $q(\mathbf{x})$ , i.e. $X$ and $n^{2}$ samples of $q(a|X)$ , i.e. $A$ , Fisher can be estimated as:

利用 $q(\mathbf{x})$ 的 $n$ 个样本 $X$ 和 $q(a|X)$ 的 $n^{2}$ 个样本 $A$,Fisher 可估计为:

$$
\begin{array}{r}{\hat{F}(\pmb{\theta})=E_{\mathbf{y}\sim p(\mathbf{y}|X,A;\pmb{\theta})}[\nabla_{\pmb{\theta}}\nabla_{\pmb{\theta}}^{\mathsf{T}}],}\end{array}
$$

$$
\begin{array}{r}{\hat{F}(\pmb{\theta})=E_{\mathbf{y}\sim p(\mathbf{y}|X,A;\pmb{\theta})}[\nabla_{\pmb{\theta}}\nabla_{\pmb{\theta}}^{\mathsf{T}}],}\end{array}
$$

where

其中

$$
\nabla_{\pmb{\theta}}=-\nabla_{\pmb{\theta}}\log p(X,A,\mathbf{y};\pmb{\theta}).
$$

$$
\nabla_{\pmb{\theta}}=-\nabla_{\pmb{\theta}}\log p(X,A,\mathbf{y};\pmb{\theta}).
$$

In fact, to evaluate the expectation in Eq. 26, $q(X)$ and $q(A|X)$ are approximated with $\hat{q}(X)$ and $\hat{q}(A|X)$ using ${{\bf x}{i}}{i=1}^{n}$ and $A$ , respectively. However, there are only $\bar{n}$ samples from $\hat{q}(\mathbf{y}|\mathbf{x}{j})$ as an approximation of $q(\mathbf{y}|\mathbf{x}{j})$ for the following replacement:

事实上,为了计算式 (26) 中的期望值,我们用 $\hat{q}(X)$ 和 $\hat{q}(A|X)$ 分别近似 $q(X)$ 和 $q(A|X)$ ,其中使用了 ${{\bf x}{i}}{i=1}^{n}$ 和 $A$ 。然而,对于后续替换中的 $q(\mathbf{y}|\mathbf{x}{j})$ ,仅有 $\bar{n}$ 个来自 $\hat{q}(\mathbf{y}|\mathbf{x}{j})$ 的样本作为其近似。

$$
p(\mathbf{y}|X,A;\pmb{\theta})\approx\hat{q}(\mathbf{y}|\mathbf{x}_{i}).
$$

$$
p(\mathbf{y}|X,A;\pmb{\theta})\approx\hat{q}(\mathbf{y}|\mathbf{x}_{i}).
$$

Therefore, an empirical Fisher can be obtained by

因此,可通过经验方法获得Fisher矩阵

for

为了

图片.png

From the computation perspective, the matrix $B_{i}(\pmb\theta)$ can be very large, for example, in neural networks with multiple layers, the parameters could be huge, so it needs to be approximated too. In networks characterized with Eqs. 17 or 20, a simple solution would be ignoring the cross-layer terms so that $B_{i}(\pmb\theta)^{-1}$ and consequently $B_{i}(\pmb\theta)$ turns into a block-diagonal matrix:

从计算角度来看,矩阵 $B_{i}(\pmb\theta)$ 可能非常大,例如在多层神经网络中,参数量可能极其庞大,因此也需要进行近似处理。对于式17或式20描述的网络,一个简单的解决方案是忽略跨层项,使得 $B_{i}(\pmb\theta)^{-1}$ 及其对应的 $B_{i}(\pmb\theta)$ 转化为块对角矩阵:

图片.png

In KFAC, the diagonal block $B{k,i}$ , corresponded to $k$ ’th layer with the dimension $d{k}d{k-1}\times d{k}d{k-1}$ , is approximated with the Kronecker product of the inverse of two smaller matrices $U{k,i}$ and $V{k,i}$ as:

在KFAC中,对应于第$k$层、维度为$d{k}d{k-1}\times d{k}d{k-1}$的对角块$B{k,i}$,通过两个较小矩阵$U{k,i}$和$V{k,i}$的逆的Kronecker积近似表示为:
图片.png

For $\nabla{\pmb{\theta},i}=[\nabla{\pmb{\theta}{1},i}^{\mathsf{T}},\cdot\cdot\cdot,\nabla{\pmb{\theta}{m},i}^{\mathsf{T}}]^{\mathsf{T}}$ , the preconditioned gradient $B{k,i}\nabla{\theta{k},i}$ can be computed using the identity

对于 $\nabla{\pmb{\theta},i}=[\nabla{\pmb{\theta}{1},i}^{\mathsf{T}},\cdot\cdot\cdot,\nabla{\pmb{\theta}{m},i}^{\mathsf{T}}]^{\mathsf{T}}$,预条件梯度 $B{k,i}\nabla{\theta{k},i}$ 可通过恒等式计算

Noting that:

请注意:

图片.png

$U_{k}$ and $V{k}$ blocks are approximated with the expected values of $\mathbf{u}{k,i}\mathbf{u}{k,i}^{\mathsf{T}}$ and $\mathbf{v}{k,i}\mathbf{v}{k,i}^{\mathsf{T}}$ respectively where $\mathrm{dim}({\mathbf{u}}{k})={d}{k}$ , $\dim(\mathbf{v}{k})=d{k-1}$ . Finally, $U{k}^{-1}$ and ${\cal V}{k}^{-1}$ are evaluated by taking inverses of $U{k}+\epsilon^{-1/2}$ and $V{k}+\epsilon^{-1/2}$ for $\epsilon$ being the regular iz ation hyper-parameter.

$U{k}$ 和 $V{k}$ 块分别用 $\mathbf{u}{k,i}\mathbf{u}{k,i}^{\mathsf{T}}$ 和 $\mathbf{v}{k,i}\mathbf{v}{k,i}^{\mathsf{T}}$ 的期望值近似,其中 $\mathrm{dim}({\mathbf{u}}{k})={d}{k}$,$\dim(\mathbf{v}{k})=d{k-1}$。最后,通过取 $U{k}+\epsilon^{-1/2}$ 和 $V{k}+\epsilon^{-1/2}$ 的逆来计算 $U{k}^{-1}$ 和 ${\cal V}{k}^{-1}$,其中 $\epsilon$ 是正则化超参数。

For a graph with $n$ nodes, adjacency matrix $A$ , and the training set ${(\mathbf{x}{i},\mathbf{y}{i})}{i=1}^{\bar{n}}+{\mathbf{x}{i}}{i=\bar{n}+1}^{n}.$ , $U{k}$ and $V{k}$ are estimated in two ways: (1) using only $\bar{n}$ labeled samples, and (2) including $n-\bar{n}$ unlabeled samples. In the first method, $U{k}$ and $V{k}$ are estimated by:

对于一个具有 $n$ 个节点、邻接矩阵 $A$ 以及训练集 ${(\mathbf{x}{i},\mathbf{y}{i})}{i=1}^{\bar{n}}+{\mathbf{x}{i}}{i=\bar{n}+1}^{n}.$ 的图,$U{k}$ 和 $V{k}$ 通过两种方式估算:(1) 仅使用 $\bar{n}$ 个标注样本,(2) 包含 $n-\bar{n}$ 个未标注样本。在第一种方法中,$U{k}$ 和 $V{k}$ 通过以下方式估算:

图片.png

Note that both $\partial l/\partial X{k}$ and $\phi{k}^{\prime}(W{k}\tilde{X}{k-1})$ are $d{k}\times n$ matrices and the last $n-\bar{n}$ columns of $\partial l/\partial X{k}$ are zero. However, as unlabeled samples are not used in the first method, one needs to evaluate loss function for $i={\bar{n}}+1,\dots,n$ , which can be done by sampling $\hat{\mathbf{y}}{i}$ from $p(\mathbf{y}|\mathbf{x};\pmb\theta)$ . In the second method, these new samples are added to the empirical cost as

注意到 $\partial l/\partial X{k}$ 和 $\phi{k}^{\prime}(W{k}\tilde{X}{k-1})$ 都是 $d{k}\times n$ 矩阵,且 $\partial l/\partial X{k}$ 的最后 $n-\bar{n}$ 列为零。然而,由于第一种方法未使用未标注样本,需要为 $i={\bar{n}}+1,\dots,n$ 评估损失函数,这可以通过从 $p(\mathbf{y}|\mathbf{x};\pmb\theta)$ 中采样 $\hat{\mathbf{y}}{i}$ 来实现。在第二种方法中,这些新样本被添加到经验成本中作为

图片.png

where $0\leq\lambda\leq1$ denotes the regular iz ation hyper-parameter for controlling the cost of predicted labels and $\lambda=0$ results the first method. As the prediction improves over the course of training, $\lambda$ can be a function of iteration $t$ , for example here, it is defined to be:

其中 $0\leq\lambda\leq1$ 表示用于控制预测标签成本的正则化超参数,$\lambda=0$ 对应第一种方法。随着训练过程中预测效果的提升,$\lambda$ 可以是迭代次数 $t$ 的函数,例如在此定义为:

$$
\lambda(t):=\left(\frac{t}{\operatorname*{max}(t)}\right)^{\gamma},
$$

$$
\lambda(t):=\left(\frac{t}{\operatorname*{max}(t)}\right)^{\gamma},
$$

where $\operatorname*{max}(t)$ shows the maximum number of iterations and $\gamma$ is the replaced regular iz ation hyper-parameter. Algorithm 1 shows the preconditioning step for modifying gradients of each layer at any iteration such that gradients are first, transformed using two matrices of ${\cal V}{k}^{-1}$ and $U{k}^{-1}$ , then sent to the optimization algorithm for updating parameters.

其中 $\operatorname*{max}(t)$ 表示最大迭代次数,$\gamma$ 为替换的正则化超参数。算法1展示了在每次迭代中修改各层梯度的预处理步骤:梯度首先通过 ${\cal V}{k}^{-1}$ 和 $U{k}^{-1}$ 两个矩阵进行变换,随后被送入优化算法以更新参数。

A. Relation between Fisher and Hessian

A. Fisher与Hessian矩阵的关系

The Hessian of the cost function:

代价函数的 Hessian 矩阵:

$$
H_{\pmb\theta}r(\pmb\theta)=E_{X,A,\mathbf y\sim p(X,A,\mathbf y;\pmb\theta)}[H_{\pmb\theta}l(\mathbf y,f(X,A;\pmb\theta))]
$$

$$
H_{\pmb\theta}r(\pmb\theta)=E_{X,A,\mathbf y\sim p(X,A,\mathbf y;\pmb\theta)}[H_{\pmb\theta}l(\mathbf y,f(X,A;\pmb\theta))]
$$

Algorithm 1 Semi-Supervised Preconditioning

Require: VWkGradient of parameters for k = 1, .. . , m
Require: AAdjacency matrix
Require: DDegree matrix
Require: ZTraining mask vector
Require: ∈, 入Regularization hyper-parameters

算法 1: 半监督预处理

| Require: VWk | 参数梯度, k = 1, ..., m |
| Require: A | 邻接矩阵 |
| Require: D | 度矩阵 |
| Require: Z | 训练掩码向量 |
| Require: ∈, λ | 正则化超参数 |

$$
\begin{array}{r l}&{n=\dim(\mathbf{z})}\ &{\bar{n}=\sum(\mathbf{z})}\ &{\tilde{A}=(D+I)^{-1/2}(A+I)(D+I)^{-1/2}=[\tilde{a}_{i j}]}\end{array}
$$

$$
\begin{array}{r l}&{n=\dim(\mathbf{z})}\ &{\bar{n}=\sum(\mathbf{z})}\ &{\tilde{A}=(D+I)^{-1/2}(A+I)(D+I)^{-1/2}=[\tilde{a}_{i j}]}\end{array}
$$

for $k=1,\ldots,m$ do

对于 $k=1,\ldots,m$ 执行

图片.png

function INVERSE $(X)$ output $(X+\epsilon^{-1/2}I)^{-1}$

函数 INVERSE $(X)$ 输出 $(X+\epsilon^{-1/2}I)^{-1}$

can also be approximated using ${\hat{q}}(X),{\hat{q}}(A|X)$ , and ${\hat{q}}(\mathbf{y}|\mathbf{x}_{i})$ resulting the empirical Hessian to be

也可以使用 ${\hat{q}}(X)$、${\hat{q}}(A|X)$ 和 ${\hat{q}}(\mathbf{y}|\mathbf{x}_{i})$ 进行近似,从而得到经验Hessian矩阵为

图片.png

which is equivalent to the empirical Fisher Eq. 31 when $p(X,A,\mathbf{y};\pmb\theta)$ is estimated with $\hat{q}(X)\hat{q}(A|X)\hat{q}(\mathbf{y}|\mathbf{x}_{i})$ for $i=1,\ldots,\bar{n}$ (see Lemma 1 in the appendix).

这等价于经验Fisher方程(式31),当$p(X,A,\mathbf{y};\pmb\theta)$用$\hat{q}(X)\hat{q}(A|X)\hat{q}(\mathbf{y}|\mathbf{x}_{i})$估计时,其中$i=1,\ldots,\bar{n}$(参见附录中的引理1)。

IV. EXPERIMENTS

IV. 实验

In this section, the performance of the proposed algorithm is evaluated compared to Adam and SGD on several datasets for the task of node classification in single graphs. The task is assumed to be trans duct ive when all the features are available for training but only a portion of labels are used in the training. First, a detailed description of datasets and the model architecture are provided. Then, the general optimization setup, commonly used for the node classification, is specified. The last part includes the sensitivity to hyper-parameter and training time analysis in addition to validation cost convergence and the test accuracy. All the experiments are conducted mainly using Pytorch [19] and Pytorch Geometric [5], two open-source Python libraries for automating differentiation and working with graph datasets.

在本节中,我们评估了所提出算法在单图节点分类任务上相对于Adam和SGD在多个数据集上的性能。该任务被设定为直推式学习 (transductive) ,即所有特征都可用于训练,但仅使用部分标签进行训练。首先,我们详细描述了数据集和模型架构。接着,说明了节点分类任务常用的通用优化设置。最后部分包括超参数敏感性分析、训练时间分析,以及验证成本收敛性和测试准确率。所有实验主要使用两个开源Python库实现:PyTorch [19] 和 PyTorch Geometric [5] ,它们分别用于自动微分和图数据集处理。

A. Datasets

A. 数据集

Three citation datasets with the statistics shown in Table II are used in the experiments [21]. Cora, CiteSeer, and PubMed are single graphs in which nodes and edges correspond to documents and citation links, respectively. A sparse feature vector (document keywords) and a class label are associated with each node. Several splits of these datasets are used in the node classification task. The first split, 20 instances are randomly selected for training, 500 for validation, and 1000 for the test; the rest of the labels are not used [23]. In the second split, all nodes except $500{+}1000$ validation and test nodes are used for the training [3]. To evaluate the over fitting behavior, the third split exploits all labels for training excluding $500+500$ nodes for the validation and test [15].

实验中使用了三个引文数据集,其统计信息如表 II 所示 [21]。Cora、CiteSeer 和 PubMed 是单图结构,其中节点和边分别对应文档和引用链接。每个节点关联一个稀疏特征向量(文档关键词)和一个类别标签。在节点分类任务中采用了这些数据集的多种划分方式:

  • 第一种划分:随机选择 20 个样本用于训练,500 个用于验证,1000 个用于测试,其余标签不作使用 [23];
  • 第二种划分:除 $500{+}1000$ 个验证集和测试集节点外,其余节点均用于训练 [3];
  • 第三种划分:为评估过拟合行为,使用除 $500+500$ 个验证集和测试集节点外的全部标签进行训练 [15]。

TABLE II CITATION NETWORK DATASETS STATISTICS

DatasetNodesEdgesClassesFeatures
Citeseer3,327473263,703
Cora2,7085,42971,433
Pubmed19,71744,3383500

表 II 引用网络数据集统计

数据集 节点数 边数 类别数 特征数
Citeseer 3,327 4,732 6 3,703
Cora 2,708 5,429 7 1,433
Pubmed 19,717 44,338 3 500

B. Architectures

B. 架构

In the node classification using a NN followed by Softmax function (Eq. 5), the class with maximum probability is chosen to be the predicted node label. A 2-layer GCN with a 64- dimensional hidden variable is used for comparing different optimization methods. In the first

在使用神经网络接Softmax函数(式5)进行节点分类时,选择概率最大的类别作为预测节点标签。为了比较不同优化方法,我们采用了具有64维隐藏变量的2层GCN。在第一...

layer, the activation function ReLU is followed by a dropout function with a rate of 0.5. The loss function is evaluated as the negative log-likelihood of Softmax (Eq. 5) of the last layer.

在每一层中,激活函数 ReLU 后接一个丢弃率为 0.5 的 dropout 函数。损失函数计算为最后一层 Softmax (式 5) 的负对数似然。

C. Optimization

C. 优化

The weights of parameters are initialized like the original GCN [10] and input vectors are row-normalized accordingly [7]. The model is trained for 200 epochs without any early stopping and the learning rate of 0.01. The Adam and SGD are used with the weight decay of $5\times10^{-4}$ and the momentum of 0.9, respectively.

参数权重按照原始GCN [10] 的方式进行初始化,输入向量相应地进行了行归一化 [7]。模型训练200个周期,未采用早停策略,学习率为0.01。优化器分别采用Adam和SGD,权重衰减为 $5\times10^{-4}$,动量设置为0.9。

D. Results

D. 结果

The optimization performance is measured by both the minimum validation cost and the test accuracy for the best validation cost. The validation cost of training a 2-layer GCN with a 64-dimensional hidden variable is used for comparing optimization methods (Adam and SGD) with their preconditioned version (Adam-KFAC and SGD-KFAC). For each method, unlabeled samples are used in the training process with a ratio controlled by $\gamma$ . Fig. 1 shows the validation cost of four methods based on Adam (upper row) and SGD (bottom row) for all three Citation datasets. The test accuracy of a 2-layer GCN trained using four different methods on three split are shown in Tab. III, IV, and V. Reported values of test accuracy in tables are averages and $95%$ confidence intervals over 10 runs for the best hyper-parameters tuned on the second split of the CiteSeer dataset. Note that the test accuracy may not always reflect the performance of the optimization method as the objective function (cross-entropy) is not the same as the prediction function (argmax). However, in most cases, the proposed method achieves better accuracy compared to Adam (the first row in all tables). As a fixed learning rate 0.01 is used in all methods, SGD has a very slow convergence and does not provide competitive results.

优化性能通过最佳验证成本下的最小验证成本和测试准确率来衡量。采用训练具有64维隐藏变量的2层GCN(图卷积网络)的验证成本,用于比较优化方法(Adam和SGD)与其预处理版本(Adam-KFAC和SGD-KFAC)。每种方法在训练过程中使用的未标记样本比例由$\gamma$控制。图1展示了基于Adam(上排)和SGD(下排)的四种方法在所有三个Citation数据集上的验证成本。表III、IV和V分别展示了使用四种不同方法训练的2层GCN在三组数据划分上的测试准确率。表中报告的测试准确率值为在CiteSeer数据集第二组划分上调优最佳超参数后,10次运行的平均值及$95%$置信区间。需注意,由于目标函数(交叉熵)与预测函数(argmax)不同,测试准确率并不总能反映优化方法的性能。但在大多数情况下,所提方法相比Adam(所有表格第一行)获得了更高的准确率。由于所有方法均采用固定学习率0.01,SGD收敛速度极慢且未能提供有竞争力的结果。

The importance of hyper-parameters $\epsilon$ , $\gamma$ are shown in Fig. 2. Figures 2a and 2d depict the sensitivity of Adam and SGD to the $\epsilon$ parameter, respectively. As the inverse of $\epsilon$ directly affects the same factor as the learning rate $\eta$ , the smaller the $\epsilon$ , the faster the convergence. However, choosing very small $\epsilon$ results in larger confidence intervals which are not desirable. The effect of $\gamma$ on Adam and SGD are depicted in figures 2b and 2e, respectively. When using Adam, due to its faster convergence compared to SGD, smaller $\gamma$ , i.e. using more predictions leads to much wider confidence intervals. In other words, the training process dominated by more labels results in a more stable convergence with a smaller variance. Thus, for a stable estimation, $\lambda$ or $\gamma$ must be tuned with respect to the optimization algorithm because of their sensitivity to the convergence rate. Since the Fisher matrix does not change considerably at each iteration, an experiment is performed to explore the sensitivity of validation loss to the frequency of updating Fisher. In Figures 2c and 2f, the validation cost over time is evaluated for updating Fisher every $4,8,\ldots,128$ iterations. When Fisher is updated more frequently, its computation takes more time hence the training process lasts longer (having other hyper-parameters fixed). Increasing the update frequency does not affect the performance to some extent, however, it largely reduces the training time. As updating Fisher every 50 or 100 iterations, does not affect the final validation cost to a great extent, to speed up the training process, Fisher is updated every 50 epochs in all of the experiments.

超参数 $\epsilon$ 和 $\gamma$ 的重要性如图 2 所示。图 2a 和图 2d 分别展示了 Adam 和 SGD 对 $\epsilon$ 参数的敏感性。由于 $\epsilon$ 的倒数直接影响与学习率 $\eta$ 相同的因子,$\epsilon$ 越小,收敛速度越快。然而,选择过小的 $\epsilon$ 会导致置信区间增大,这是不可取的。$\gamma$ 对 Adam 和 SGD 的影响分别在图 2b 和图 2e 中展示。使用 Adam 时,由于其收敛速度比 SGD 快,较小的 $\gamma$(即使用更多预测)会导致置信区间显著变宽。换句话说,由更多标签主导的训练过程会带来方差更小的稳定收敛。因此,为了稳定估计,必须根据优化算法调整 $\lambda$ 或 $\gamma$,因为它们对收敛速度非常敏感。

由于 Fisher 矩阵在每次迭代中变化不大,我们通过实验探索验证损失对 Fisher 更新频率的敏感性。在图 2c 和图 2f 中,评估了每 $4,8,\ldots,128$ 次迭代更新 Fisher 时的验证成本随时间的变化。Fisher 更新越频繁,计算耗时越长,训练过程持续时间也越长(其他超参数固定时)。增加更新频率在一定程度上不会影响性能,但会大幅减少训练时间。由于每 50 或 100 次迭代更新 Fisher 对最终验证成本影响较小,为了加速训练过程,所有实验中 Fisher 均每 50 个周期更新一次。

TABLE III THE TEST ACCURACY OF FOUR OPTIMIZATION METHODS ON THE FIRST SPLIT OF CITATION DATASETS OVER 10 RUNS. A 2-LAYER GCN WITH A 64-DIMENSIONAL HIDDEN VARIABLE IS USED IN ALL EXPERIMENTS.

CiteSeerCoraPubmed
Adam71.66 ± 0.6181.20 ± 0.2579.72 ± 0.30
Adam74.28 ± 0.6782.42± 0.3380.06 ± 0.34
Adam-KFAC71.94 ± 0.5381.68 ± 0.2579.48 ± 0.28
Adam-KFAC70.24 ± 0.6682.84 ± 0.8776.94 ± 0.59
SGD20.38±8.9223.14 ± 5.1745.76 ± 3.04
SGDr17.64 ± 6.1817.26 ± 8.4146.20 ± 4.35
SGD-KFACe71.82 ± 0.4882.06 ± 0.3477.20 ± 0.63
SGD-KFAC-73.52±0.2281.70 ± 0.7979.20 ± 0.29

表 III 四种优化方法在CITATION数据集第一个划分上运行10次的测试准确率。所有实验均使用具有64维隐藏变量的2层GCN。

CiteSeer Cora Pubmed
Adam 71.66 ± 0.61 81.20 ± 0.25 79.72 ± 0.30
Adam 74.28 ± 0.67 82.42 ± 0.33 80.06 ± 0.34
Adam-KFAC 71.94 ± 0.53 81.68 ± 0.25 79.48 ± 0.28
Adam-KFAC 70.24 ± 0.66 82.84 ± 0.87 76.94 ± 0.59
SGD 20.38 ± 8.92 23.14 ± 5.17 45.76 ± 3.04
SGDr 17.64 ± 6.18 17.26 ± 8.41 46.20 ± 4.35
SGD-KFACe 71.82 ± 0.48 82.06 ± 0.34 77.20 ± 0.63
SGD-KFAC- 73.52 ± 0.22 81.70 ± 0.79 79.20 ± 0.29

TABLE IV THE TEST ACCURACY OF FOUR OPTIMIZATION METHODS ON THE SECOND SPLIT OF CITATION DATASETS OVER 10 RUNS. A 2-LAYER GCN WITH A 64-DIMENSIONAL HIDDEN VARIABLE IS USED IN ALL EXPERIMENTS.

CiteSeerCoraPubmed
Adam78.68 ± 0.8387.36 ± 0.4787.78 ± 0.14
Adam77.98 ± 0.3987.28 ± 0.3487.52 ± 0.30
Adam-KFAC79.50 ± 0.1587.60 ± 0.2088.46 ± 0.24
Adam-KFAC79.42 ± 0.3286.60 ± 0.3087.88 ± 0.16
SGD20.80 ± 2.1231.90 ± 0.0043.22 ± 1.42
SGD20.96 ± 5.2231.90 ± 0.0040.82 ± 0.33
SGD-KFACe79.48 ± 0.4087.54± 0.4389.08 ± 0.18
SGD-KFAC77.32 ± 0.2787.42 ± 0.2488.18 ± 0.30

表 IV 四种优化方法在引用数据集第二划分上10次运行的测试准确率。所有实验均使用具有64维隐藏变量的2层GCN。

CiteSeer Cora Pubmed
Adam 78.68 ± 0.83 87.36 ± 0.47 87.78 ± 0.14
Adam 77.98 ± 0.39 87.28 ± 0.34 87.52 ± 0.30
Adam-KFAC 79.50 ± 0.15 87.60 ± 0.20 88.46 ± 0.24
Adam-KFAC 79.42 ± 0.32 86.60 ± 0.30 87.88 ± 0.16
SGD 20.80 ± 2.12 31.90 ± 0.00 43.22 ± 1.42
SGD 20.96 ± 5.22 31.90 ± 0.00 40.82 ± 0.33
SGD-KFACe 79.48 ± 0.40 87.54 ± 0.43 89.08 ± 0.18
SGD-KFAC 77.32 ± 0.27 87.42 ± 0.24 88.18 ± 0.30

TABLE V THE TEST ACCURACY OF FOUR OPTIMIZATION METHODS ON THE THIRD SPLIT OF CITATION DATASETS OVER 10 RUNS. A 2-LAYER GCN WITH A 64-DIMENSIONAL HIDDEN VARIABLE IS USED IN ALL EXPERIMENTS.

CiteSeerCoraPubmed
Adam79.80 ± 0.6689.44 ± 0.4187.16 ± 0.71
Adam79.64 ± 0.3289.60 ± 0.9187.44 ± 0.27
Adam-KFAC80.52±0.1490.16 ± 0.5987.84 ± 0.21
Adam-KFAC80.52 ± 0.2289.24 ± 0.6487.36 ± 0.37
SGD15.04 ± 1.7032.80 ± 0.0041.96 ± 0.44
SGDr16.12 ± 5.3032.80 ± 0.0041.20 ± 0.00
SGD-KFACe79.76 ± 0.7589.88 ± 0.1489.36 ± 0.57
SGD-KFAC78.52 ± 0.2888.72 ± 0.3887.88 ± 0.80

表 V 四种优化方法在引用数据集第三次划分上运行10次的测试准确率。所有实验均使用具有64维隐藏变量的2层GCN。

CiteSeer Cora Pubmed
Adam 79.80 ± 0.66 89.44 ± 0.41 87.16 ± 0.71
Adam 79.64 ± 0.32 89.60 ± 0.91 87.44 ± 0.27
Adam-KFAC 80.52±0.14 90.16 ± 0.59 87.84 ± 0.21
Adam-KFAC 80.52 ± 0.22 89.24 ± 0.64 87.36 ± 0.37
SGD 15.04 ± 1.70 32.80 ± 0.00 41.96 ± 0.44
SGDr 16.12 ± 5.30 32.80 ± 0.00 41.20 ± 0.00
SGD-KFACe 79.76 ± 0.75 89.88 ± 0.14 89.36 ± 0.57
SGD-KFAC 78.52 ± 0.28 88.72 ± 0.38 87.88 ± 0.80


Fig. 2. The sensitivity of $\epsilon,\gamma$ , and updating frequency on validation costs of Adam-KFAC (upper) and SGD-KFAC (below) when training on the second split of CiteSeer dataset over 10 runs. A 2-layer GCN with a 64-dimensional hidden variable is used in all experiments. Fig. 2a and 2d show that smaller $\epsilon$ results in a faster convergence with a probable cost of larger variance as it inversely scales the same factor as the learning rate. As depicted in Fig. 2b and 2e, the larger the $\gamma$ , the more stable the convergence (the more confined confidence intervals). Finally, it can be seen in Fig. 2c and 2f that since performances are similar under different updating frequencies, selecting a relatively large frequency (50) can reduce the training time substantially.

图 2: 在 CiteSeer 数据集第二划分上训练 10 次时,Adam-KFAC (上) 和 SGD-KFAC (下) 的验证成本对 $\epsilon,\gamma$ 及更新频率的敏感性。所有实验均使用具有 64 维隐藏变量的 2 层 GCN。图 2a 和 2d 表明,较小的 $\epsilon$ 会带来更快的收敛速度,但可能以更大方差为代价,因为它与学习率成反比缩放同一因子。如图 2b 和 2e 所示,$\gamma$ 越大,收敛越稳定 (置信区间越窄)。最后从图 2c 和 2f 可见,由于不同更新频率下性能相似,选择较大频率 (50) 可大幅减少训练时间。

To examine the time complexity of the proposed method, the validation costs of AdamKFAC and SGD-KFAC are compared with Adam and SGD when training on the second split of Citation datasets with respect to the training time for 200 epochs (Fig. 3). The training on

为检验所提方法的时间复杂度,将AdamKFAC和SGD-KFAC与Adam和SGD在Citation数据集第二划分上进行200轮训练时的验证耗时进行了对比 (图3)。训练过程中...

Cora and PubMed (Fig. 3b and 3c) takes a shorter time compared to the training on CitSeer (Fig. 3a) mainly because of the dimension of input features as it directly enlarges the size of the Fisher matrix. As shown in Fig. 3, the proposed SGD-KFAC method (red curve) converges much faster than the vanilla SGD as expected. Surprisingly, SGD-KFAC outperforms Adam and even Adam-KFAC methods in all datasets implying that the naive SGD with a natural gradient pre conditioner can lead to a faster convergence than Adam-based methods. Another interesting observation is that Adam-based methods demonstrate similar performances in all experiments making them independent of the dataset while SGD-based methods show different over fitting behavior.

Cora和PubMed数据集(图3b和3c)的训练时间比CitSeer(图3a)更短,这主要归因于输入特征的维度,因为它会直接扩大Fisher矩阵的规模。如图3所示,我们提出的SGD-KFAC方法(红色曲线)如预期那样比普通SGD收敛得更快。令人惊讶的是,在所有数据集中SGD-KFAC都优于Adam甚至Adam-KFAC方法,这表明带有自然梯度预调节器的朴素SGD可以比基于Adam的方法实现更快的收敛。另一个有趣的观察是,基于Adam的方法在所有实验中表现出相似的性能,使其独立于数据集,而基于SGD的方法则显示出不同的过拟合行为。


Fig. 3. The validation costs of four optimization methods with respect to the training time on the second split of Citation datasets over 10 runs. A 2-layer GCN with a 64-dimensional hidden variable is used in all experiments. The proposed SGD-KFAC method shows the highest convergence rate among all other methods and it is slightly faster than Adam-KFAC.

图 3: 四种优化方法在Citation数据集第二划分上经过10次运行的验证成本随训练时间变化情况。所有实验均使用具有64维隐藏变量的2层GCN。提出的SGD-KFAC方法在所有方法中表现出最高的收敛速度,且略快于Adam-KFAC。

V. CONCLUSION

V. 结论

In this work, we introduced a novel optimization framework for graph-based semi-supervised learning. After the distinct definition of semi-supervised problems with the adjacency distribution, we provided a comprehensive review of topics like semi-supervised learning, graph neural network, and preconditioning optimization (and NGD as its especial case). We adopted a commonly used probabilistic framework covering least-squared regression and cross-entropy classification. In the node classification task, our proposed method showed to improve Adam and SGD not only in the validation cost but also in the test accuracy of GCN on three splits of Citation datasets. Extensive experiments were provided on the sensitivity to hyper-parameters and the time complexity. As the first work, to the best of our knowledge, on the preconditioned optimization of graph neural networks, we not only achieved the best test accuracy but also empirically showed that it can be used with both Adam and SGD.

在本研究中,我们提出了一种新颖的基于图的半监督学习优化框架。通过邻接分布明确定义半监督问题后,我们全面综述了半监督学习、图神经网络、预条件优化(及其特例NGD)等主题。采用涵盖最小二乘回归和交叉熵分类的通用概率框架,在节点分类任务中,所提方法在GCN对Citation数据集三种划分的验证成本和测试准确率上均优于Adam与SGD。我们深入分析了超参数敏感性和时间复杂度,并提供了大量实验。作为图神经网络预条件优化领域的开创性工作,我们不仅取得了最佳测试准确率,还通过实验验证了该方法可同时适配Adam和SGD优化器。

As the pre conditioner may significantly affect Adam, illustrating the relation between NGD and Adam and effectively combining them can be a promising direction for future work. We also aim to deploy faster approximation methods than KFAC like [6] and better sampling methods for exploiting unlabeled samples. Finally, since this work is mainly focused on single parameter layers, another possible research path would be adjusting KFAC to, for example, residual layers [8].

由于预条件子可能显著影响Adam,阐明NGD与Adam之间的关系并有效结合它们,是未来工作的一个潜在方向。我们也致力于部署比KFAC更快的近似方法(如[6]),以及更好的采样方法来利用未标记样本。最后,由于本研究主要关注单参数层,另一可能的研究路径是调整KFAC以适用于残差层等结构[8]。

ACKNOWLEDGMENT

致谢

YF and LL were partially supported by NSF grants DMS Career 1654579 and DMS 1854779.

YF 和 LL 部分得到了美国国家科学基金会 (NSF) 资助 DMS Career 1654579 和 DMS 1854779 的支持。

REFERENCES

参考文献

[23] Zhilin Yang, William Cohen, and Ruslan Salak hud in ov. “Revisiting semi-supervised learning with graph embeddings”. In: International conference on machine learning. 2016, pp. 40–48.

[23] Zhilin Yang、William Cohen 和 Ruslan Salakhutdinov。 "重新审视基于图嵌入的半监督学习"。见: 国际机器学习会议。 2016年, 第40-48页。

[24] Jie Zhou et al. “Graph neural networks: A review of methods and applications”. In: arXiv preprint arXiv:1812.08434 (2018).

[24] Jie Zhou等. "图神经网络(Graph Neural Networks): 方法与应用综述". 见于: arXiv预印本 arXiv:1812.08434 (2018).

APPENDIX

附录

Lemma 1. The expected value of the Hessian of $-\log p(X,A,\mathbf{y};\pmb\theta)$ is equal to Fisher information matrix, i.e.

引理1. 负对数似然函数 $-\log p(X,A,\mathbf{y};\pmb\theta)$ 的Hessian矩阵期望等于费雪信息矩阵 (Fisher information matrix) ,即

$$
-E_{X,A,\mathbf{y}\sim p(X,A,\mathbf{y};\theta)}[H_{\theta}\log p(X,A,\mathbf{y};\pmb{\theta})]=F
$$

$$
-E_{X,A,\mathbf{y}\sim p(X,A,\mathbf{y};\theta)}[H_{\theta}\log p(X,A,\mathbf{y};\pmb{\theta})]=F
$$

Proof. The Hessian of $f(\pmb\theta)$ can be written as the Jacobian of $\nabla_{\theta}f$ :

证明。$f(\pmb\theta)$ 的 Hessian 矩阵可表示为 $\nabla_{\theta}f$ 的 Jacobian 矩阵:

$$
H_{\boldsymbol\theta}f(\boldsymbol\theta)=J_{\boldsymbol\theta}\nabla_{\boldsymbol\theta}f(\boldsymbol\theta).
$$

$$
H_{\boldsymbol\theta}f(\boldsymbol\theta)=J_{\boldsymbol\theta}\nabla_{\boldsymbol\theta}f(\boldsymbol\theta).
$$

So for the Hessian of the negative log-likelihood becomes:

因此负对数似然的Hessian矩阵变为:

图片.png

Taking the expectation over $p(X,A,\mathbf{y};\pmb\theta)$ , the first term turns into zero:

对 $p(X,A,\mathbf{y};\pmb\theta)$ 取期望后,第一项变为零:

图片.png

and Fisher is defined as the expected value of the second term.

Fisher 定义为第二项的期望值。

阅读全文(20积分)