[论文翻译]通用图Transformer自注意力网络


原文地址:https://arxiv.org/pdf/1909.11855v9


Universal Graph Transformer Self-Attention Networks

通用图Transformer自注意力网络

Abstract

摘要

The transformer self-attention network has been extensively used in research domains such as computer vision, image processing, and natural language processing. But it has not been actively used in graph neural networks (GNNs) where constructing an advanced aggregation function is essential. To this end, we present U2GNN, an effective GNN model leveraging a transformer self-attention mechanism followed by a recurrent transition, to induce a powerful aggregation function to learn graph representations. Experimental results show that the proposed U2GNN achieves state-of-the-art accuracies on well-known benchmark datasets for graph classification.1

Transformer自注意力网络已广泛应用于计算机视觉、图像处理和自然语言处理等研究领域,但在图神经网络(GNNs)中尚未得到充分利用,而构建高级聚合函数对GNN至关重要。为此,我们提出U2GNN——一种通过Transformer自注意力机制与循环转移相结合的有效GNN模型,可生成强大的聚合函数来学习图表示。实验结果表明,所提出的U2GNN在图分类的知名基准数据集上达到了最先进的准确率。[20]

Keywords: Graph neural networks $\cdot$ Graph classification Transformer Self-attention

关键词:图神经网络 (Graph Neural Networks) $\cdot$ 图分类 Transformer 自注意力 (Self-attention)

1 Introduction

1 引言

Graph-structured data appear in many real-world and scientific fields, e.g., knowledge graphs, recommend er systems, social and citation networks, as well as telecommunication and biological networks [3, 56]. In general, we can view a graph as a network of nodes and edges, where nodes correspond to individual objects, and edges encode relationships among those objects. For example, in an online forum, each discussion thread can be constructed as a graph where nodes represent users, and edges represent commenting activities between users [49].

图结构数据广泛存在于现实世界和科学领域中,例如知识图谱、推荐系统、社交与引文网络,以及电信和生物网络 [3, 56]。通常,我们可以将图视为由节点和边组成的网络,其中节点对应独立对象,边则编码这些对象之间的关系。例如,在在线论坛中,每个讨论主题都可以构建为一个图,其中节点代表用户,边表示用户之间的评论活动 [49]。

Learning graph representations is one of the most important topics for the graph-structured data [17, 57, 46, 52], where a goal is to construct vector representations for graphs to predict graph labels. Early approaches measure similarities among graphs to build a graph kernel for graph classification [14], wherein graph kernel-based approaches decompose graphs into “atomic subgraphs” (such as random walks [14, 22, 44], shortest paths [4], graphlets [39], and Weisfeiler-Lehman subtree patterns [38]) to have a vector of frequencies of atomic subgraphs for each given graph. Besides, word embedding-based approaches utilize Word2Vec [32] and Doc2Vec [26] to learn embeddings for the atomic subgraphs [49] and the entire graphs [33, 20].

学习图表示是图结构数据最重要的课题之一 [17, 57, 46, 52],其目标是为图构建向量表示以预测图标签。早期方法通过测量图之间的相似性来构建用于图分类的图核 [14],其中基于图核的方法将图分解为"原子子图"(如随机游走 [14, 22, 44]、最短路径 [4]、小图模式 [39] 和 Weisfeiler-Lehman 子树模式 [38]),从而为每个给定图生成原子子图频率向量。此外,基于词嵌入的方法利用 Word2Vec [32] 和 Doc2Vec [26] 来学习原子子图 [49] 和完整图 [33, 20] 的嵌入表示。

Recently, graph neural networks (GNNs) become an essential strand to learn low-dimensional continuous embeddings of the entire graphs to predict the graph labels [36, 17, 57, 46, 52]. GNNs use an Aggregation function to update the vector representation of each node by transforming and aggregating the vector representations of its neighbors [24, 16, 42]. Then GNNs apply a graph-level pooling function (i.e., a ReadOut operation such as simple sum pooling [48]) to obtain graph embeddings. Compared to the graph kernel and word embedding-based approaches, GNNs have been achieving state-of-the-art accuracies for the graph classification task.

近年来,图神经网络 (GNNs) 已成为学习整个图的低维连续嵌入以预测图标签的重要方法 [36, 17, 57, 46, 52]。GNNs 通过聚合函数 (Aggregation function) 对邻居节点的向量表示进行变换和聚合,从而更新每个节点的向量表示 [24, 16, 42]。随后,GNNs 应用图级池化函数 (即 ReadOut 操作,如简单的求和池化 [48]) 来获取图嵌入。与基于图核 (graph kernel) 和词嵌入 (word embedding) 的方法相比,GNNs 在图分类任务中实现了最先进的准确率。

To further improve the classification performance, constructing a powerful aggregation function is essential for GNNs as also discussed in [48]. Nowadays, there are novel applications of the transformer self-attention network [41, 9] recognized, published, and used successfully in research domains such as computer vision, image processing, and natural language processing. Hence we consider the use of the transformer to a new domain, i.e., graph neural networks (GNNs), as a novelty. Inspired by this advanced self-attention network, we present U2GNN, an effective GNN model, which induces a powerful Aggregation function leveraging a self-attention mechanism [41] followed by a recurrent transition, to update the vector representations of nodes from their neighbors. In particular, our U2GNN is different from related work as follows:

为进一步提升分类性能,构建强大的聚合函数对图神经网络(GNNs)至关重要,[48]中也对此进行了讨论。当前,Transformer自注意力网络[41,9]的创新应用已在计算机视觉、图像处理和自然语言处理等领域得到认可、发表并成功运用。因此,我们将Transformer应用于新领域——图神经网络(GNNs)视为一项创新。受这一先进自注意力网络的启发,我们提出了高效的GNN模型U2GNN,该模型通过自注意力机制[41]结合循环转移来构建强大的聚合函数,从而根据邻居节点更新节点的向量表示。具体而言,我们的U2GNN与相关工作的区别如下:

• A “concurrent” work – Hyper-SAGNN [55] – utilizes the transformer self-attention network for hypergraphs that have diverse and different structures, hence requiring a different solution. Besides, the later and closely related work, GraphBERT [53], is an extension of our U2GNN for semisupervised node classification. • Graph Attention Network (GAT) [42] borrows the standard attention technique from [2] in using a single-layer feed forward neural network para met rize d by a weight vector and then applying the non-linearity function followed by the softmax to compute the importance weights of neighbors of a given node. Note that our U2GNN adopts a scaled dot-product attention mechanism which is more robust and efficient than the attention technique used in GAT.

• 一项"同期"工作——Hyper-SAGNN [55]——针对具有多样化和不同结构的超图(hypergraph)采用了Transformer自注意力网络,因此需要不同的解决方案。此外,后续密切相关的GraphBERT [53]是我们U2GNN在半监督节点分类任务上的扩展。
• 图注意力网络(GAT) [42]借鉴了文献[2]的标准注意力技术,使用由权重向量参数化的单层前馈神经网络,然后应用非线性函数和softmax来计算给定节点邻居的重要性权重。需要注意的是,我们的U2GNN采用了缩放点积注意力机制(scaled dot-product attention),比GAT使用的注意力技术更鲁棒高效。

• Regarding the model architecture, Graph Transformer Network (GTN) [51] identifies useful metapaths [45] to transform graph structures and applies GCN [24] to learn the node embeddings for the node classification task on heterogeneous graphs. Self-Attention Graph Pooling (SAGPool) [27] uses GCN as an attention mechanism to weight the nodes, employs a node selection method [13] to retain a portion of the nodes, and applies the existing graph-level ReadOut pooling layers (consisting of global pooling [54] and hierarchical pooling [6]) to obtain the graph embeddings.

• 在模型架构方面,Graph Transformer Network (GTN) [51] 通过识别有用的元路径 (metapaths) [45] 来转换图结构,并应用 GCN [24] 学习异构图上节点分类任务的节点嵌入。Self-Attention Graph Pooling (SAGPool) [27] 使用 GCN 作为注意力机制对节点进行加权,采用节点选择方法 [13] 保留部分节点,并应用现有的图级 ReadOut 池化层(包含全局池化 [54] 和分层池化 [6])来获取图嵌入。

• To this end, we note that U2GNN is entirely different from GAT [42], Graph Transformer Network [51], and Self-Attention Graph Pooling [27], except similar titles.

• 为此,我们注意到U2GNN与GAT [42]、Graph Transformer Network [51]和Self-Attention Graph Pooling [27]完全不同,除了标题相似。

Contributions. Our main contributions in this paper are as follows:

贡献。我们在本文中的主要贡献如下:

• We propose U2GNN, an effective GNN model, leveraging the transformer self-attention network to construct an advanced aggregation function to learn the graph representations. • Experimental results show that U2GNN obtains state-of-the-art accuracies on well-known benchmark datasets for the graph classification task.

• 我们提出了U2GNN,一种有效的图神经网络 (GNN) 模型,利用Transformer自注意力网络构建高级聚合函数来学习图表示。
• 实验结果表明,U2GNN在图分类任务的知名基准数据集上取得了最先进的准确率。

2 Related work

2 相关工作

2.1 Graph kernel-based approaches These early approaches decompose graphs into “atomic subgraphs” (such as random walks [14, 22, 44], shortest paths [4], graphlets [39], and Weisfeiler-Lehman subtree patterns [38]) to measure the similarities among graphs [14]. Therefore, we can view each atomic substructure as a word term and each graph as a text document. We then represent a collection of graphs as a document-term matrix that describes the normalized frequency of terms in documents. We use an inner product to compute the graph similarities to derive a “kernel matrix” used for the kernel-based learning algorithms such as Support Vector Machines (SVM) [19] to measure the graph classification performance. We refer to an overview of the graph kernel-based approaches in [35, 25].

2.1 基于图核的方法
这些早期方法将图分解为"原子子图"(如随机游走 [14, 22, 44]、最短路径 [4]、小图模式 [39] 和 Weisfeiler-Lehman 子树模式 [38])来测量图之间的相似性 [14]。因此,我们可以将每个原子子结构视为一个词项,将每个图视为一个文本文档。然后,我们将图集合表示为描述文档中词项归一化频率的文档-词项矩阵。我们使用内积计算图相似性,从而推导出"核矩阵",用于基于核的学习算法(如支持向量机 (SVM) [19])来测量图分类性能。关于基于图核方法的综述可参阅 [35, 25]。

2.2 Word embedding-based approaches Since the introduction of word embedding models such as Word2Vec [32] and Doc2Vec [26], several works have used them for the graph classification task. Deep graph kernel (DGK) [49] applies Word2Vec to learn the embeddings of the atomic substructures (such as the graphlets, the WL subtree patterns, and the shortest paths) to derive the kernel matrix. Anonymous walk embedding (AWE) [20] maps random walks into “anonymous walks”, views each anonymous walk as a word token, and utilizes Doc2Vec to achieve the graph embeddings to compute the graph similarities to construct the kernel matrix. Graph2Vec [33] employs Doc2Vec to learn the embeddings for the WL subtree patterns and to obtain the graph embeddings to train a SVM classifier.

2.2 基于词嵌入的方法
自Word2Vec [32]和Doc2Vec [26]等词嵌入模型提出以来,多项研究将其应用于图分类任务。深度图核(DGK) [49]采用Word2Vec学习原子子结构(如图元、WL子树模式和最短路径)的嵌入表示以推导核矩阵。匿名游走嵌入(AWE) [20]将随机游走映射为"匿名游走",将每个匿名游走视为一个词token,并利用Doc2Vec实现图嵌入以计算图相似性来构建核矩阵。Graph2Vec [33]使用Doc2Vec学习WL子树模式的嵌入表示,进而获取图嵌入以训练SVM分类器。

2.3 Graph neural networks Recent works have focused on using graph neural networks (GNNs) to perform the graph classification task [36, 28, 34, 15, 54, 50, 43, 48]. In general, GNNs use an Aggregation function which aims to update the vector representation of each node by recursively propagating the vector representations of its neighbors [36, 24, 16, 42]. The Aggregation function can be a neural network such as gated recurrent units (GRU) [28] or multi-layer perce ptr on s (MLPs) [48]. Besides, GCN [24], GraphSAGE [16], and GAT [42] are also used as the Aggregation functions. GNNs then utilize a ReadOut pooling function to obtain the graph embeddings, which are fed to multiple fully-connected layers followed by a softmax layer to predict the graph labels.

2.3 图神经网络
近期研究聚焦于利用图神经网络 (GNNs) 完成图分类任务 [36, 28, 34, 15, 54, 50, 43, 48]。通常,GNNs 通过聚合函数 (Aggregation function) 递归传播相邻节点的向量表示来更新各节点的向量表征 [36, 24, 16, 42]。该聚合函数可采用门控循环单元 (GRU) [28] 或多层感知机 (MLPs) [48] 等神经网络结构,也可选用图卷积网络 (GCN) [24]、GraphSAGE [16] 和图注意力网络 (GAT) [42]。随后,GNNs 通过 ReadOut 池化函数获取图嵌入表示,并将其输入全连接层与 softmax 层以预测图标签。

Regarding the aggregation of node representations, GCN [24] updates vector representation for a given node $\textsf{v}\in\mathscr{V}$ from its neighbors, using multiple GCN layers stacked on top of each other to capture $k$ -hops neighbors, as:

关于节点表征的聚合,GCN [24] 通过堆叠多个GCN层来捕捉 $k$ 跳邻居信息,从而更新给定节点 $\textsf{v}\in\mathscr{V}$ 的向量表示,其公式为:

$$
\mathsf{h}{\mathsf{v}}^{(k+1)}=\mathsf{g}\left(\sum_{\mathsf{u}\in\mathcal{N}{\mathsf{v}}\cup{\mathsf{v}}}a_{\mathsf{v},\mathsf{u}}\mathsf{W}^{(k)}\mathsf{h}_{\mathsf{u}}^{(k)}\right),\forall\mathsf{v}\in\mathcal{V}
$$

$$
\mathsf{h}{\mathsf{v}}^{(k+1)}=\mathsf{g}\left(\sum_{\mathsf{u}\in\mathcal{N}{\mathsf{v}}\cup{\mathsf{v}}}a_{\mathsf{v},\mathsf{u}}\mathsf{W}^{(k)}\mathsf{h}_{\mathsf{u}}^{(k)}\right),\forall\mathsf{v}\in\mathcal{V}
$$

where $k$ is the layer index; $a_{\mathsf{v},\mathsf{u}}$ is an edge constant between nodes $\mathsf{v}$ and $\mathsf{u}$ in the re-normalized adjacency matrix; $\mathbf{W}^{(k)}$ is a weight matrix; $\mathbf{h}{\mathbf{u}}^{(0)}$ is a feature vector of node $\mathfrak{u}$ ; $\mathsf{g}$ is a non-linear activation function; and $\mathcal{N}_{\sf v}$ is the set of neighbors of node $\mathsf{v}$ .

其中 $k$ 为层索引;$a_{\mathsf{v},\mathsf{u}}$ 是重归一化邻接矩阵中节点 $\mathsf{v}$ 和 $\mathsf{u}$ 之间的边常数;$\mathbf{W}^{(k)}$ 是权重矩阵;$\mathbf{h}{\mathbf{u}}^{(0)}$ 是节点 $\mathfrak{u}$ 的特征向量;$\mathsf{g}$ 是非线性激活函数;$\mathcal{N}_{\sf v}$ 是节点 $\mathsf{v}$ 的邻居集合。

GraphSAGE [16] extends GCN to use a node-wise procedure of uniformly sampling a fixed number of neighbors for each node at each layer as:

GraphSAGE [16] 将 GCN 扩展为采用逐节点处理方式,在每一层为每个节点均匀采样固定数量的邻居,具体如下:

$$
\mathbf{h}{\mathrm{v}}^{(k+1)}=\mathrm{g}\left(\mathbf{W}^{(k)}\left[\mathbf{h}{\mathrm{v}}^{(k)};\mathbf{h}{\mathcal{N}_{\mathrm{v}}^{\prime}}^{(k)}\right]\right),\forall\mathsf{v}\in\mathcal{V}
$$

$$
\mathbf{h}{\mathrm{v}}^{(k+1)}=\mathrm{g}\left(\mathbf{W}^{(k)}\left[\mathbf{h}{\mathrm{v}}^{(k)};\mathbf{h}{\mathcal{N}_{\mathrm{v}}^{\prime}}^{(k)}\right]\right),\forall\mathsf{v}\in\mathcal{V}
$$

where [;] denotes a vector concatenation, and $\mathbf{h}{\mathcal{N}_{\mathrm{v}}^{\prime}}^{(k)}$ can be

其中 [;] 表示向量拼接,$\mathbf{h}{\mathcal{N}_{\mathrm{v}}^{\prime}}^{(k)}$ 可以

where $\mathcal{N}{\sf v}^{\prime}$ is defined as a fixed-size, uniformly sampled from $\mathcal{N}{\sf v}$ of $\mathsf{v}$ . Besides, $\mathcal{N}_{\sf v}^{\prime}$ is sampled differently through each layer.

其中 $\mathcal{N}{\sf v}^{\prime}$ 定义为从 $\mathsf{v}$ 的 $\mathcal{N}{\sf v}$ 中均匀采样的固定大小集合。此外,$\mathcal{N}_{\sf v}^{\prime}$ 在每一层都进行不同的采样。

Graph Attention Network (GAT) [42] extends GCN to compute edge weights following the standard attention technique [2] as:

图注意力网络 (GAT) [42] 通过标准注意力技术 [2] 计算边权重来扩展 GCN,其公式为:

$$
\mathsf{h}{\mathsf{v}}^{(k+1)}=\mathsf{g}\left(\sum_{\mathsf{u}\in\mathcal{N}{\mathsf{v}}\cup{\mathsf{v}}}\tau_{\mathsf{v},\mathsf{u}}^{(k)}\mathsf{W}^{(k)}\mathsf{h}_{\mathsf{u}}^{(k)}\right),\forall\mathsf{v}\in\mathcal{V}
$$

$$
\mathsf{h}{\mathsf{v}}^{(k+1)}=\mathsf{g}\left(\sum_{\mathsf{u}\in\mathcal{N}{\mathsf{v}}\cup{\mathsf{v}}}\tau_{\mathsf{v},\mathsf{u}}^{(k)}\mathsf{W}^{(k)}\mathsf{h}_{\mathsf{u}}^{(k)}\right),\forall\mathsf{v}\in\mathcal{V}
$$

$$
\mathsf{s o f t m a x}\left(\mathsf{L e a k y R e L U}\left(\pmb{a}^{(k)\top}\left[\mathbf{W}^{(k)}\mathbf{h}{\mathsf{v}}^{(k)};\mathbf{W}^{(k)}\mathbf{h}_{\mathsf{u}}^{(k)}\right]\right)\right)
$$

3 U2GNN: Universal Graph Transformer Self-Attention Networks

3 U2GNN: 通用图Transformer自注意力网络

In this section, we present the additional background of graph neural networks and detail our proposed U2GNN.

在本节中,我们将介绍图神经网络 (Graph Neural Networks) 的额外背景知识,并详细阐述我们提出的 U2GNN。

3.1 Graph classification We represent each graph $\mathcal{G}=\left(\mathcal{V},\mathcal{E},{\mathbf{h}{\mathsf{v}}^{(0)}}{\mathsf{v}\in\mathcal{V}}\right)$ , where $\nu$ is a set of nodes, $\mathcal{E}$ is a soeft noof deed ${\bf h}{\vee}^{(0)}\in\mathbb{R}^{d}$ rseeptr eosfe tghrea pfehast $\textsf{v}\in\mathscr{V}$ $M$ ${\mathcal{G}{m}}{m=1}^{M}$ and their corresponding class labels , the graph classification task is to learn an embedding $\mathbf{e}{\mathcal{G}{m}}$ for each graph $\mathcal{G}{m}$ to predict its label $\mathsf{y}_{m}$ .

3.1 图分类
我们将每个图表示为 $\mathcal{G}=\left(\mathcal{V},\mathcal{E},{\mathbf{h}{\mathsf{v}}^{(0)}}{\mathsf{v}\in\mathcal{V}}\right)$,其中 $\nu$ 是节点集合,$\mathcal{E}$ 是边集合,$\mathbf{h}{\vee}^{(0)}\in\mathbb{R}^{d}$ 表示节点 $\textsf{v}\in\mathscr{V}$ 的初始特征。给定一组图 ${\mathcal{G}{m}}{m=1}^{M}$ 及其对应的类别标签 ,图分类任务的目标是为每个图 $\mathcal{G}{m}$ 学习一个嵌入表示 $\mathbf{e}{\mathcal{G}{m}}$,以预测其标签 $\mathsf{y}_{m}$。

3.2 Graph neural networks In general, GNNs aim to update the vector representation of each node by recursively aggregating and transforming the vector representations of its neighbors [24, 16, 42]. After that, GNNs use a ReadOut pooling function to obtain the vector representations of the entire graphs [15, 54, 50, 43, 48].

3.2 图神经网络 (Graph Neural Networks)
一般而言,图神经网络旨在通过递归聚合和转换其邻居节点的向量表示来更新每个节点的向量表示 [24, 16, 42]。随后,图神经网络使用 ReadOut 池化函数获取整个图的向量表示 [15, 54, 50, 43, 48]。

where $\mathsf{h}{\mathsf{v}}^{(k)}$ is the vector representation of node $\mathsf{v}$ at the $k$ -th iteration/layer, $\mathcal{N}_{\sf v}$ is the set of neighbors of $\mathsf{v}$ .

其中 $\mathsf{h}{\mathsf{v}}^{(k)}$ 是节点 $\mathsf{v}$ 在第 $k$ 次迭代/层的向量表示,$\mathcal{N}_{\sf v}$ 是 $\mathsf{v}$ 的邻居节点集合。

Many methods have been proposed to construct the Aggregation functions, e.g., GCN [24], GraphSAGE [16], and GAT [42]. Recently, Graph Isomorphism Network (GIN-0) [48] uses a more powerful Aggregation function based on a multi-layer perceptron (MLP) network of two fully-connected layers as:

许多方法被提出来构建聚合函数,例如 GCN [24]、GraphSAGE [16] 和 GAT [42]。最近,图同构网络 (GIN-0) [48] 使用了一个更强大的聚合函数,基于一个包含两个全连接层的多层感知机 (MLP) 网络,其形式为:

$$
\mathsf{h}{\mathsf{v}}^{(k+1)}=\mathsf{M L P}^{(\mathsf{k})}\left(\sum_{\mathsf{u}\in\mathcal{N}{\mathsf{v}}\cup{\mathsf{v}}}\mathsf{h}_{\mathsf{u}}^{(k)}\right),\forall\mathsf{v}\in\mathcal{V}
$$

$$
\mathsf{h}{\mathsf{v}}^{(k+1)}=\mathsf{M L P}^{(\mathsf{k})}\left(\sum_{\mathsf{u}\in\mathcal{N}{\mathsf{v}}\cup{\mathsf{v}}}\mathsf{h}_{\mathsf{u}}^{(k)}\right),\forall\mathsf{v}\in\mathcal{V}
$$

Following [48], we employ a concatenation over the vector representations of node $\mathsf{v}$ at the different layers to construct a vector representation $\mathbf{e}_{\lor}$ for each node $\mathsf{v}$ as:

遵循[48]的方法,我们通过对节点$\mathsf{v}$在不同层的向量表示进行拼接,为每个节点$\mathsf{v}$构建向量表示$\mathbf{e}_{\lor}$,具体形式如下:

$$
\mathbf{e}{\mathsf{v}}=\left[\mathsf{h}{\mathsf{v}}^{(1)};\mathsf{h}{\mathsf{v}}^{(2)};...;\mathsf{h}_{\mathsf{v}}^{(K)}\right],\forall\mathsf{v}\in\mathcal{V}
$$

$$
\mathbf{e}{\mathsf{v}}=\left[\mathsf{h}{\mathsf{v}}^{(1)};\mathsf{h}{\mathsf{v}}^{(2)};...;\mathsf{h}_{\mathsf{v}}^{(K)}\right],\forall\mathsf{v}\in\mathcal{V}
$$

where $K$ is the index of the last layer. The graph-level ReadOut function can be a simple sum pooling [48] or a complicated pooling such as hierarchical pooling [6], and differentiable pooling [50]. As the sum pooling produces competitive results [48], we use the sum pooling to obtain the embedding $\mathbf{e}_{\mathcal{G}}$ of the entire graph $\mathcal{G}$ as:

其中 $K$ 是最后一层的索引。图级别的 ReadOut 函数可以是简单的求和池化 [48] 或复杂的池化方法,如分层池化 [6] 和可微分池化 [50]。由于求和池化能产生具有竞争力的结果 [48],我们使用求和池化来获取整个图 $\mathcal{G}$ 的嵌入 $\mathbf{e}_{\mathcal{G}}$,公式如下:

$$
\mathbf{e}{\mathcal{G}}=\sum_{\ v\in\mathcal{V}}\mathbf{e}{\mathsf{v}}=\sum_{\mathsf{v}\in\mathcal{V}}\left[\mathbf{h}{\mathsf{v}}^{(1)};\mathbf{h}{\mathsf{v}}^{(2)};...;\mathbf{h}_{\mathsf{v}}^{(K)}\right]
$$

$$
\mathbf{e}{\mathcal{G}}=\sum_{\ v\in\mathcal{V}}\mathbf{e}{\mathsf{v}}=\sum_{\mathsf{v}\in\mathcal{V}}\left[\mathbf{h}{\mathsf{v}}^{(1)};\mathbf{h}{\mathsf{v}}^{(2)};...;\mathbf{h}_{\mathsf{v}}^{(K)}\right]
$$

After that, we can follow [48] to feed the graph embeddings $\mathbf{e}_{\mathcal{G}}$ to a single fully-connected layer followed by a softmax layer to predict the graph labels.

之后,我们可以按照 [48] 的方法,将图嵌入 $\mathbf{e}_{\mathcal{G}}$ 输入到一个全连接层,再接一个 softmax 层来预测图标签。

3.3 The proposed U2GNN The transformer selfattention network [41, 9] has widely applied as a novelty in research domains such as computer vision and NLP. Similarity, we also consider the successful use of this recent advanced technique to a new domain, i.e., graph neural networks (GNNs), as a novel application. Moreover, as also discussed in [48], constructing an powerful Aggregation mechanism is essential for GNNs. To this end, we induce an advanced Aggregation function, using the universal transformer network [9] consisting of a self-attention mechanism [41] followed by a recurrent transition (Trans) with adding residual connections [18] and layer normalization (LNorm) [1], as illustrated in Figure 1.

3.3 提出的U2GNN
Transformer自注意力网络 [41, 9] 作为一种创新技术已广泛应用于计算机视觉和自然语言处理等领域。类似地,我们也考虑将这一前沿技术成功应用于新领域——图神经网络 (GNNs) ,作为一项新颖的应用。此外,如 [48] 所述,构建强大的聚合机制对GNN至关重要。为此,我们引入了一种先进的聚合函数,采用通用Transformer网络 [9] ,该网络由自注意力机制 [41] 和循环转换 (Trans) 组成,并添加了残差连接 [18] 和层归一化 (LNorm) [1] ,如图 1 所示。

The residual connections [18] are used to add useful information learned in the lower layers to the higher layers, and more importantly, to allow gradients to directly pass through the layers to avoid vanishing gradient or exploding gradient problems. The layer normalization (LNorm) [1] is used to normalize the inputs across the feature dimensions to stabilize the network to enable smoother gradients and faster training. The residual connections and the layer normalization are commonly used in many architectures and thus are omitted in the paper for simplicity.

残差连接 [18] 用于将低层学习到的有用信息添加到高层,更重要的是允许梯度直接穿过各层,避免梯度消失或爆炸问题。层归一化 (LNorm) [1] 用于沿特征维度对输入进行归一化,以稳定网络,实现更平滑的梯度和更快的训练。由于残差连接和层归一化在多种架构中广泛使用,本文为简洁起见省略了相关图示。


Figure 1: Illustration of our U2GNN.

图 1: U2GNN 架构示意图

Formally, given an input graph $\mathcal{G}$ , we uniformly sample a set $\mathcal{N}{\sf v}$ of neighbors for each $v\in\mathcal{V}$ and then input $\mathcal{N}{\sf v}\cup{{\sf v}}$ to the U2GNN learning process. Note that we sample a different $\mathcal{N}{\sf v}$ for node v at each training batch. We also construct multiple layers stacked on top of each other in our U2GNN. Regarding the $k$ -th layer, given a node $v\in\nu$ , at each step $t$ , we induce a transformer self-attention-based function to aggregate the vector representations for all nodes $\mathsf{u}\in\mathcal{N}_{\mathsf{v}}\cup{\mathsf{v}}$ as:

形式上,给定输入图 $\mathcal{G}$,我们为每个 $v\in\mathcal{V}$ 均匀采样一组邻居 $\mathcal{N}{\sf v}$,然后将 $\mathcal{N}{\sf v}\cup{{\sf v}}$ 输入到 U2GNN 学习过程中。注意,在每个训练批次中,我们为节点 v 采样不同的 $\mathcal{N}{\sf v}$。我们还在 U2GNN 中构建了多个堆叠层。对于第 $k$ 层,给定节点 $v\in\nu$,在每一步 $t$,我们引入一个基于 Transformer 自注意力的函数来聚合所有节点 $\mathsf{u}\in\mathcal{N}_{\mathsf{v}}\cup{\mathsf{v}}$ 的向量表示:

$$
\begin{array}{r}{\mathbf{\bar{h}}{t,\mathbf{u}}^{(k)}=\mathrm{TRANSFORMER-AGGREGATION}\left(\mathbf{\bar{h}}_{t-1,\mathbf{u}}^{(k)}\right)}\end{array}
$$

$$
\begin{array}{r}{\mathbf{\bar{h}}{t,\mathbf{u}}^{(k)}=\mathrm{TRANSFORMER-AGGREGATION}\left(\mathbf{\bar{h}}_{t-1,\mathbf{u}}^{(k)}\right)}\end{array}
$$

In particular,

特别是,

$$
\mathbf{x}{t,\mathrm{u}}^{(k)}=\mathrm{LNoRM}\left(\mathbf{h}{t-1,\mathrm{u}}^{(k)}+\mathrm{ATT}\left(\mathbf{h}_{t-1,\mathrm{u}}^{(k)}\right)\right)
$$

$$
\mathbf{x}{t,\mathrm{u}}^{(k)}=\mathrm{LNoRM}\left(\mathbf{h}{t-1,\mathrm{u}}^{(k)}+\mathrm{ATT}\left(\mathbf{h}_{t-1,\mathrm{u}}^{(k)}\right)\right)
$$

then,

然后,

$$
\mathbf{h}{t,\mathbf{u}}^{(k)}=\mathrm{LNORM}\left(\mathbf{x}{t,\mathbf{u}}^{(k)}+\mathrm{TRANS}\left(\mathbf{x}_{t,\mathbf{u}}^{(k)}\right)\right)
$$

$$
\mathbf{h}{t,\mathbf{u}}^{(k)}=\mathrm{LNORM}\left(\mathbf{x}{t,\mathbf{u}}^{(k)}+\mathrm{TRANS}\left(\mathbf{x}_{t,\mathbf{u}}^{(k)}\right)\right)
$$

where ${\bf h}_{t,{\bf u}}^{(k)}\in\mathbb{R}^{d}$ ; $\mathrm{TRANS}(.)$ and ATT $(.)$ denote a MLP network (i.e., two fully-connected layers) and a selfattention network respectively:

其中 ${\bf h}_{t,{\bf u}}^{(k)}\in\mathbb{R}^{d}$;$\mathrm{TRANS}(.)$ 和 ATT $(.)$ 分别表示一个多层感知机网络 (即两个全连接层) 和一个自注意力网络:

$$
\left(\mathbf{x}{t,\mathrm{u}}^{(k)}\right)=\mathbf{W}{2}^{(k)}{\mathrm{ReLU}}\left(\mathbf{W}{1}^{(k)}\mathbf{x}{t,\mathrm{u}}^{(k)}+\mathbf{b}{1}^{(k)}\right)+\mathbf{b}_{2}^{(k)}
$$

$$
\left(\mathbf{x}{t,\mathrm{u}}^{(k)}\right)=\mathbf{W}{2}^{(k)}{\mathrm{ReLU}}\left(\mathbf{W}{1}^{(k)}\mathbf{x}{t,\mathrm{u}}^{(k)}+\mathbf{b}{1}^{(k)}\right)+\mathbf{b}_{2}^{(k)}
$$

where W(1k) $\mathbf{W}{1}^{(k)}\in\mathbb{R}^{s\times d}$ and W(2k) $\mathbf{W_{2}^{(k)}}\in~\mathbb{R}^{d\times s}$ are weight matrices, and b(1k) and b(2k) are bias parameters, and:

其中 $\mathbf{W}{1}^{(k)}\in\mathbb{R}^{s\times d}$ 和 $\mathbf{W_{2}^{(k)}}\in~\mathbb{R}^{d\times s}$ 是权重矩阵,b(1k) 和 b(2k) 是偏置参数,且:

$$
\mathrm{ATT}\left(\mathbf{h}{t-1,\mathbf{u}}^{(k)}\right)=\sum_{\mathbf{u}^{\prime}\in\mathcal{N}{\mathsf{v}}\cup{\mathsf{v}}}\alpha_{\mathbf{u},\mathbf{u}^{\prime}}^{(k)}\left(V^{(k)}\mathbf{h}_{t-1,\mathbf{u}^{\prime}}^{(k)}\right)
$$

$$
\mathrm{ATT}\left(\mathbf{h}{t-1,\mathbf{u}}^{(k)}\right)=\sum_{\mathbf{u}^{\prime}\in\mathcal{N}{\mathsf{v}}\cup{\mathsf{v}}}\alpha_{\mathbf{u},\mathbf{u}^{\prime}}^{(k)}\left(V^{(k)}\mathbf{h}_{t-1,\mathbf{u}^{\prime}}^{(k)}\right)
$$

where $V^{(k)}\in\mathbb{R}^{d\times d}$ is a value-projection weight matrix; $\alpha_{\mathsf{u},\mathsf{u}^{\prime}}$ is an attention weight, which is computed using

其中 $V^{(k)}\in\mathbb{R}^{d\times d}$ 是值投影权重矩阵;$\alpha_{\mathsf{u},\mathsf{u}^{\prime}}$ 是通过计算得到的注意力权重

$$
\alpha_{\mathbf{u},\mathbf{u}^{\prime}}^{(k)}=\mathsf{s o f t m a x}\left(\frac{\left(Q^{(k)}\mathsf{h}{t-1,\mathbf{u}}^{(k)}\right)\cdot\left(K^{(k)}\mathsf{h}_{t-1,\mathbf{u}^{\prime}}^{(k)}\right)}{\sqrt{d}}\right)
$$

$$
\alpha_{\mathbf{u},\mathbf{u}^{\prime}}^{(k)}=\mathsf{s o f t m a x}\left(\frac{\left(Q^{(k)}\mathsf{h}{t-1,\mathbf{u}}^{(k)}\right)\cdot\left(K^{(k)}\mathsf{h}_{t-1,\mathbf{u}^{\prime}}^{(k)}\right)}{\sqrt{d}}\right)
$$

where Q(k) ${\pmb Q}^{(k)}\in~\mathbb{R}^{d\times d}$ and ${\cal K}^{(k)}\in~\mathbb{R}^{d\times d}$ are queryprojection and key-projection matrices, respectively.

其中 ${\pmb Q}^{(k)}\in~\mathbb{R}^{d\times d}$ 和 ${\cal K}^{(k)}\in~\mathbb{R}^{d\times d}$ 分别是查询投影矩阵和键投影矩阵。

After $T$ steps, we feed ${\boldsymbol{\mathsf{h}}_{T,\mathsf{v}}^{(k)}\in\mathbb{R}^{d}}$ to the next $(k+1)$ - th layer as:

经过 $T$ 步后,我们将 ${\boldsymbol{\mathsf{h}}_{T,\mathsf{v}}^{(k)}\in\mathbb{R}^{d}}$ 输入到下一 $(k+1)$ 层:

$$
\mathsf{h}{0,\mathsf{v}}^{(k+1)}=\mathsf{h}_{T,\mathsf{v}}^{(k)},\forall\mathsf{v}\in\mathcal{V}
$$

$$
\mathsf{h}{0,\mathsf{v}}^{(k+1)}=\mathsf{h}_{T,\mathsf{v}}^{(k)},\forall\mathsf{v}\in\mathcal{V}
$$

Note that ${\bf h}{0,\mathsf{v}}^{(0)}={\bf h}_{\mathsf{v}}^{(0)}\in\mathbb{R}^{d}$ is the feature vector of node $\mathsf{v}$ .

注意到 ${\bf h}{0,\mathsf{v}}^{(0)}={\bf h}_{\mathsf{v}}^{(0)}\in\mathbb{R}^{d}$ 是节点 $\mathsf{v}$ 的特征向量。

We apply the vector concatenation across the layers to obtain the vector representations $\mathbf{e}_{\mathsf{v}}$ of nodes $\mathsf{v}$ following Equation 3.9 as:

我们按照公式3.9对层间向量进行拼接,得到节点$\mathsf{v}$的向量表示$\mathbf{e}_{\mathsf{v}}$,具体如下:

$$
\mathbf{e}{\mathsf{v}}=\left[\mathsf{h}{0,\mathsf{v}}^{(1)};\mathsf{h}{0,\mathsf{v}}^{(2)};...;\mathsf{h}_{0,\mathsf{v}}^{(K)}\right],\forall\mathsf{v}\in\mathcal{V}
$$

$$
\mathbf{e}{\mathsf{v}}=\left[\mathsf{h}{0,\mathsf{v}}^{(1)};\mathsf{h}{0,\mathsf{v}}^{(2)};...;\mathsf{h}_{0,\mathsf{v}}^{(K)}\right],\forall\mathsf{v}\in\mathcal{V}
$$

where $K$ is the number of layers. We use $\mathbf{e}{\mathsf{v}}$ as the final embedding of node $v\in\mathcal{V}$ and then sum all the final embeddings of nodes in $\mathcal{G}$ to get the final embedding $\mathbf{e}{\mathcal{G}}$ of the entire graph $\mathcal{G}$ . We feed $\mathbf{e}_{\mathcal{G}}$ to a single fullyconnected layer followed by a softmax layer to predict the graph label as:

其中 $K$ 是层数。我们使用 $\mathbf{e}{\mathsf{v}}$ 作为节点 $v\in\mathcal{V}$ 的最终嵌入,然后对图 $\mathcal{G}$ 中所有节点的最终嵌入求和,得到整个图 $\mathcal{G}$ 的最终嵌入 $\mathbf{e}{\mathcal{G}}$。我们将 $\mathbf{e}_{\mathcal{G}}$ 输入到一个全连接层,后接一个 softmax 层,以预测图的标签为:

$$
\hat{\pmb{{\forall}}}{\mathcal{G}}=\mathsf{s o f t m a x}\left(\mathbf{W}\mathbf{e}_{\mathcal{G}}+\mathbf{b}\right)
$$

$$
\hat{\pmb{{\forall}}}{\mathcal{G}}=\mathsf{s o f t m a x}\left(\mathbf{W}\mathbf{e}_{\mathcal{G}}+\mathbf{b}\right)
$$

Finally, we learn the model parameters by minimizing the cross-entropy loss function. To sum up, we briefly present the learning process of our proposed U2GNN in Algorithm 1.

最后,我们通过最小化交叉熵损失函数来学习模型参数。综上所述,我们在算法1中简要展示了所提出的U2GNN的学习过程。

3.4 Discussion We discuss some findings in our proposed U2GNN as follows:

3.4 讨论
我们在提出的U2GNN中发现以下几点:

• As established empirically, our results shown in Section 5 imply that the U2GNN self-attentionbased aggregation function is a powerful computation process compared to other existing functions.

• 根据实证研究,我们在第5节展示的结果表明,与其他现有函数相比,U2GNN基于自注意力(self-attention)的聚合函数是一种强大的计算过程。

4 Experimental setup

4 实验设置

We use seven well-known datasets consisting of three social network datasets (COLLAB, IMDB-B, and IMDBM) and four bioinformatics datasets (DD, MUTAG, PROTEINS, and PTC). The social network datasets do not have available node features; thus, we follow [34, 54] to use node degrees as features. Table 1 reports the statistics of these datasets.

我们使用了七个知名数据集,包括三个社交网络数据集 (COLLAB、IMDB-B 和 IMDBM) 和四个生物信息学数据集 (DD、MUTAG、PROTEINS 和 PTC)。由于社交网络数据集没有可用的节点特征,因此我们遵循 [34, 54] 的方法,使用节点度数作为特征。表 1 报告了这些数据集的统计信息。

• If we set $T$ to $1$ , $\alpha_{\mathbf{u},\mathbf{u}^{\prime}}^{(k)}$ to 1, $V^{(k)}$ to the identity matrix in Equation 3.15, and do not use both the residual connections and the layer normalization, we simplify our U2GNN aggregation function (from Equations 3.17 and 3.13) as:

• 如果我们设 $T$ 为 $1$,$\alpha_{\mathbf{u},\mathbf{u}^{\prime}}^{(k)}$ 为 1,$V^{(k)}$ 为等式 3.15 中的单位矩阵,并且不使用残差连接 (residual connections) 和层归一化 (layer normalization),就可以将 U2GNN 聚合函数(来自等式 3.17 和 3.13)简化为:

$$
\begin{array}{r c l}{\mathbf{h}{1,\mathbf{y}}^{(k+1)}}&{=}&{\displaystyle\mathrm{TRANS}\left(\mathrm{ATT}\left(\mathbf{h}{0,\mathbf{v}}^{(k+1)}\right)\right)}\ &{=}&{\displaystyle\mathrm{TRANS}\left(\sum_{\mathbf{u}\in\mathcal{N}{\mathbf{v}}\cup{\mathbf{v}}}\mathbf{h}{0,\mathbf{u}}^{(k+1)}\right)}\ &{=}&{\displaystyle\mathrm{TRANS}\left(\sum_{\mathbf{u}\in\mathcal{N}{\mathbf{v}}\cup{\mathbf{v}}}\mathbf{h}_{1,\mathbf{u}}^{(k)}\right)}\end{array}
$$

$$
\begin{array}{r c l}{\mathbf{h}{1,\mathbf{y}}^{(k+1)}}&{=}&{\displaystyle\mathrm{TRANS}\left(\mathrm{ATT}\left(\mathbf{h}{0,\mathbf{v}}^{(k+1)}\right)\right)}\ &{=}&{\displaystyle\mathrm{TRANS}\left(\sum_{\mathbf{u}\in\mathcal{N}{\mathbf{v}}\cup{\mathbf{v}}}\mathbf{h}{0,\mathbf{u}}^{(k+1)}\right)}\ &{=}&{\displaystyle\mathrm{TRANS}\left(\sum_{\mathbf{u}\in\mathcal{N}{\mathbf{v}}\cup{\mathbf{v}}}\mathbf{h}_{1,\mathbf{u}}^{(k)}\right)}\end{array}
$$

where Trans(.) denotes the MLP network of two fully-connected layers (as defined in Equation 3.14). Thus, this implies that our U2GNN can be simplified (w.r.t Equation 3.20) to be equivalent to Graph Isomorphism Network (GIN-0) [48] (w.r.t Equation 3.8) – one of the recent state-of-the-art GNNs. Experimental results presented in Section 5 show that U2GNN outperforms GIN-0 on benchmark datasets for the graph classification task.

其中 Trans(.) 表示由两个全连接层组成的 MLP 网络 (如公式 3.14 所定义)。这意味着我们的 U2GNN 可以简化为 (相对于公式 3.20) 与图同构网络 (Graph Isomorphism Network, GIN-0) [48] (相对于公式 3.8) 等效——后者是当前最先进的图神经网络之一。第 5 节的实验结果表明,在图分类任务的基准数据集上,U2GNN 的性能优于 GIN-0。

• We probably could construct a complex architecture using a complicated graph-level pooling (such as hierarchical pooling [6]) followed by multiple fully-connected layers [7, 31] to predict the graph labels. However, we refrained from doing that, as our key purpose is to introduce a single, unified and effective model that can work well and produce competitive performances on the benchmark datasets. Therefore, using the sum pooling followed by a single fully-connected layer is reasonable for our U2GNN.

• 我们或许可以通过复杂的图级池化(如分层池化 [6])结合多个全连接层 [7, 31]来构建一个复杂架构以预测图标签。但考虑到本文的核心目标是提出一个统一、高效且能在基准数据集上取得竞争力的单一模型,因此我们选择在U2GNN中仅使用求和池化与单层全连接这种合理设计。

Table 1: Statistics of the experimental benchmark datasets. #G denotes the numbers of graphs. #Cls denotes the number of class labels. Avg#N denotes the average number of nodes per graph. Avg#E denotes the average number of neighbors per node. $d$ is the dimension of feature vectors. Note that $d$ is also equal to the node embedding size at each U2GNN layer.

表 1: 实验基准数据集的统计信息。#G 表示图的数量。#Cls 表示类别标签的数量。Avg#N 表示每个图的平均节点数。Avg#E 表示每个节点的平均邻居数。$d$ 是特征向量的维度。注意 $d$ 也等于每个 U2GNN 层的节点嵌入大小。

数据集 #G #Cls Avg#N Avg#E d
COLLAB 5,000 3 74.5 65.9
IMDB-M 1,500 3 13.0 10.1
IMDB-B 1,000 2 19.8 9.8
DD 1,178 2 284.3 5.0 82
PROTEINS 1,113 2 39.1 3.7 3
PTC 344 2 25.6 2.0 19
MUTAG 188 2 17.9 2.2 7

• Social networks datasets: COLLAB is a scientific dataset, where each graph represents a collaboration network of a corresponding researcher with other researchers from each of 3 physics fields; each graph is labeled to a physics field that the researcher belongs to. IMDB-B and IMDB-M are movie collaboration datasets, where each graph is derived from actor/actress and genre information of different movies on IMDB; nodes correspond to actors/actresses, and each edge represents a coappearance of two actors/actresses in the same movie; each graph is assigned to a genre.

• 社交网络数据集:COLLAB 是一个科学数据集,其中每个图代表一位研究者与来自3个物理学领域的其他研究者之间的合作网络;每个图被标记为该研究者所属的物理学领域。IMDB-B 和 IMDB-M 是电影合作数据集,其中每个图源自 IMDB 上不同电影的演员/女演员和流派信息;节点对应演员/女演员,每条边表示两位演员/女演员在同一部电影中的共同出演;每个图被分配到一个流派。

• Bioinformatics datasets: DD [10] is a collection of 1,178 protein network structures with 82 discrete node labels, where each graph is classified into enzyme or non-enzyme class. MUTAG [8] is a collection of 188 nitro compound networks with 7 discrete node labels, where classes indicate a mutagenic effect on a bacterium. PROTEINS comprises

• 生物信息学数据集:DD [10] 包含1,178个具有82种离散节点标签的蛋白质网络结构,每个图被分类为酶或非酶类别。MUTAG [8] 是188个硝基化合物网络的集合,具有7种离散节点标签,类别表示对细菌的诱变效应。PROTEINS包含

Table 2: Graph classification results ( $%$ accuracy). The best scores are in bold.

表 2: 图分类结果 ( $%$ 准确率)。最佳分数以粗体显示。

模型 COLLAB IMDB-B IMDB-M DD PROTEINS MUTAG PTC
GK [39] 72.84 ± 0.28 65.87 ± 0.98 43.89 ± 0.38 78.45 ± 0.26 71.67 ± 0.55 81.58 ± 2.11 57.26 ± 1.41
WL [38] 79.02 ± 1.77 73.40 ± 4.63 49.33 ± 4.75 79.78 ± 0.36 74.68 ± 0.49 82.05 ± 0.36 57.97 ± 0.49
PSCN [34] 72.60 ± 2.15 71.00 ± 2.29 45.23 ± 2.84 77.12 ± 2.41 75.89 ± 2.76 92.63 ± 4.21 62.29 ± 5.68
GCN [24] 81.72 ± 1.64 73.30 ± 5.29 51.20 ± 5.13 79.12 ± 3.07 75.65 ± 3.24 87.20 ± 5.11
GFN [7] 81.50 ± 2.42 73.00 ± 4.35 51.80 ± 5.16 78.78 ± 3.49 76.46 ± 4.06 90.84 ± 7.22
GraphSAGE [16] 79.70 ± 1.70 72.40 ± 3.60 49.90 ± 5.00 65.80 ± 4.90 65.90 ± 2.70 79.80 ± 13.9
GAT [42] 75.80 ± 1.60 70.50 ± 2.30 47.80 ± 3.10 74.70 ± 2.20 89.40 ± 6.10 66.70 ± 5.10
DGCNN [54] 73.76 ± 0.49 70.03 ± 0.86 47.83 ± 0.85 79.37 ± 0.94 75.54 ± 0.94 85.83 ± 1.66 58.59 ± 2.47
SAGPool [27] 76.45 ± 0.97 71.86 ± 0.97
PPGN [30] 81.38 ± 1.42 73.00 ± 5.77 50.46 ± 3.59 77.20 ± 4.73 90.55 ± 8.70 66.17 ± 6.54
CapsGNN [47] 79.62 ± 0.91 73.10 ± 4.83 50.27 ± 2.65 75.38 ± 4.17 76.28 ± 3.63 86.67 ± 6.88
DSGC [37] 79.20 ± 1.60 73.20 ± 4.90 48.50 ± 4.80 77.40 ± 6.40 74.20 ± 3.80 86.70 ± 7.60
GIN-0 [48] 80.20 ± 1.90 75.10 ± 5.10 52.30 ± 2.80 76.20 ± 2.80 89.40 ± 5.60 64.60 ± 7.00
GCAPS [43] 77.71 ± 2.51 71.69 ± 3.40 48.50 ± 4.10 77.62 ± 4.99 76.40 ± 4.17 66.01 ± 5.91
IEGN [31] 77.92 ± 1.70 71.27 ± 4.50 48.55 ± 3.90 75.19 ± 4.30 84.61 ± 10.0 59.47 ± 7.30
U2GNN 77.84 ± 1.48 77.04 ± 3.45 53.60 ± 3.53 80.23 ± 1.48 78.53 ± 4.07 89.97 ± 3.65 69.63 ± 3.60

1,113 graphs obtained from [5] to present secondary structure elements (SSEs). PTC [40] consists of 344 chemical compound networks with 19 discrete node labels where classes show carcinogen i city for male and female rats.

从[5]中获取的1,113个图用于呈现二级结构元件(SSE)。PTC [40]包含344个化合物网络,具有19个离散节点标签,其类别显示对雄性和雌性大鼠的致癌性。

4.1 Training protocol We vary the number $K$ of U2GNN layers in ${1,~2,~3}$ , the number of steps $T$ in ${1,2,3,4}$ , the number of neighbors ( $|\mathcal{N}_{\sf v}|=N$ ) sampled for each node in ${4,8,16}$ , and the dimension s of Trans(.) (in Equation 3.14) in ${128, 256, 512, 1024}$ (in Equation 3.14). We set the batch size to 4. We apply the Adam optimizer [23] to train our U2GNN and select the Adam initial learning rate $l r\in$ ${5e^{-5},1e^{-4},5e^{-4},1e^{-3}}$ . We run up to 50 epochs to evaluate our U2GNN.

4.1 训练方案
我们调整以下超参数:U2GNN层数 $K$ 在 ${1,~2,~3}$ 中变化,步数 $T$ 在 ${1,2,3,4}$ 中变化,每个节点的采样邻居数 $(|\mathcal{N}_{\sf v}|=N)$ 在 ${4,8,16}$ 中变化,Trans(.) (公式3.14)的维度 $s$ 在 ${128, 256, 512, 1024}$ (公式3.14)中变化。批次大小固定为4,采用Adam优化器 [23] 训练U2GNN,初始学习率 $lr\in$ ${5e^{-5},1e^{-4},5e^{-4},1e^{-3}}$。模型评估最多运行50个epoch。

4.2 Evaluation protocol We follow [48, 47, 30, 37, 7] to use the same data splits and the same 10-fold cross-validation scheme to calculate the classification performance for a fair comparison.

4.2 评估方案
我们遵循[48, 47, 30, 37, 7]采用相同的数据划分和10折交叉验证方案计算分类性能,以确保公平比较。

We compare our U2GNN with up-to-date strong baselines as follows:

我们将U2GNN与以下最新的强基线进行比较:

• Unsupervised approaches: Graphlet Kernel (GK) [39] and Weisfeiler-Lehman kernel (WL) [38]. • Supervised approaches: PATCHY-SAN (PSCN) [34], Graph Convolutional Network (GCN) [24], GraphSAGE [16], Graph Attention Network (GAT) [42], Deep Graph CNN (DGCNN)

• 无监督方法:Graphlet Kernel (GK) [39] 和 Weisfeiler-Lehman kernel (WL) [38]。
• 监督方法:PATCHY-SAN (PSCN) [34]、Graph Convolutional Network (GCN) [24]、GraphSAGE [16]、Graph Attention Network (GAT) [42]、Deep Graph CNN (DGCNN)

[54], Graph Capsule Convolution Neural Network (GCAPS) [43], Capsule Graph Neural Network (CapsGNN) [47], Self-Attention Graph Pooling (SAGPool) [27], Graph Isomorphism Network (GIN-0) [48], Graph Feature Network (GFN) [7], Invariant-E qui variant Graph Network (IEGN) [31], Provably Powerful Graph Network (PPGN) [30], and Disc rim i native Structural Graph Classification (DSGC) [37].

[54]、Graph Capsule Convolution Neural Network (GCAPS) [43]、Capsule Graph Neural Network (CapsGNN) [47]、Self-Attention Graph Pooling (SAGPool) [27]、Graph Isomorphism Network (GIN-0) [48]、Graph Feature Network (GFN) [7]、Invariant-Equivariant Graph Network (IEGN) [31]、Provably Powerful Graph Network (PPGN) [30] 以及 Discriminative Structural Graph Classification (DSGC) [37]。

We report the baselines taken from the original papers or published in [20, 43, 47, 12, 7, 37].

我们报告了从原始论文或[20, 43, 47, 12, 7, 37]中发表的基线结果。

5 Experimental results

5 实验结果

Table 2 presents the experimental results of U2GNN and other strong baseline models for the benchmark datasets, and Figure 3 shows the classification accuracies of our proposed U2GNN across 10 folds for each dataset. On the social network datasets, our U2GNN produces new state-of-the-art performances on IMDB-B and IMDB-M and gains a competitive score on COLLAB. Especially, U2GNN obtains $4+%$ absolute higher accuracies than all the supervised baseline models on IMDB-B and IMDB-M. Regarding COLLAB, we found the best results of GCN in [7], where, after obtaining the graph embeddings, [7] utilized two fully-connected layers to predict the graph labels for GCN and GFN. This is different from our U2GNN where we used a single fully-connected layer as we discussed in Section 3.4.

表 2 展示了 U2GNN 和其他强基线模型在基准数据集上的实验结果,图 3 则显示了我们提出的 U2GNN 在每个数据集上 10 折交叉验证的分类准确率。在社交网络数据集上,我们的 U2GNN 在 IMDB-B 和 IMDB-M 上取得了新的最先进性能,并在 COLLAB 上获得了具有竞争力的分数。特别是,U2GNN 在 IMDB-B 和 IMDB-M 上的准确率比所有监督基线模型绝对高出 $4+%$。关于 COLLAB,我们发现 [7] 中 GCN 的最佳结果,其中 [7] 在获得图嵌入后,使用两个全连接层来预测 GCN 和 GFN 的图标签。这与我们的 U2GNN 不同,如第 3.4 节所述,我们仅使用了一个全连接层。

On the bioinformatics datasets, our U2GNN obtains the highest accuracies on DD, PROTEINS, and PTC.

在生物信息学数据集上,我们的U2GNN在DD、PROTEINS和PTC上获得了最高准确率。


Figure 2: A $\mathrm{t}$ -SNE visualization of the node embeddings learned by GIN-0 and our U2GNN on the PTC dataset.

图 2: GIN-0 和我们的 U2GNN 在 PTC 数据集上学习到的节点嵌入的 $\mathrm{t}$ -SNE 可视化。

Moreover, U2GNN achieves a competitive accuracy compared with those of the baseline models on MUTAG. Additionally, there are no significant differences between our U2GNN and the baselines on MUTAG as this dataset only consists of 188 graphs, which explains the high variance in the results.

此外,U2GNN在MUTAG数据集上的准确率与基线模型相比具有竞争力。由于该数据集仅包含188个图,导致结果方差较大,因此我们的U2GNN与基线模型在MUTAG上无显著差异。


Figure 3: Classification accuracies across 10 folds for each dataset.

图 3: 各数据集在10折交叉验证中的分类准确率。


Figure 4: Effects of the number of timesteps ( $^{\prime}I^{\prime}$ ), and the number of neighbors sampled for each node ( $N=$ $|\mathcal{N}_{\sf v}|,$ ).

图 4: 时间步数 ( $^{\prime}I^{\prime}$ ) 和每个节点采样的邻居数量 ( $N=$ $|\mathcal{N}_{\sf v}|,$ ) 的影响。

Hyper-parameter analysis. We investigate the effects of the number of timesteps ( $T$ ) and the number of neighbors sampled for each node ( $N=|\mathcal{N}_{\sf v}|$ ) in Figure 4. In general, we find that higher $T$ can help on most of the datasets as we may use more steps $T$ to encode the graph structures better. Furthermore, the social network datasets are denser than the bioinformatics ones; hence we should use more sampled neighbors (i.e., using higher $N$ ) on the social network datasets rather than on the bioinformatics ones.

超参数分析。我们在图 4 中研究了时间步数 ($T$) 和每个节点采样的邻居数 ($N=|\mathcal{N}_{\sf v}|$) 的影响。总体而言,我们发现更高的 $T$ 在大多数数据集上都有帮助,因为可以使用更多步数 $T$ 来更好地编码图结构。此外,社交网络数据集比生物信息学数据集更密集,因此应在社交网络数据集上使用更多采样邻居 (即更高的 $N$),而非生物信息学数据集。

Visualization. To qualitatively demonstrate the effectiveness of our U2GNN, we use $\mathrm{\Deltat}$ -SNE [29] to visualize the node embeddings learned by GIN-0 and our U2GNN on PTC where the node labels are available. Compared to GIN-0, Figure 2 shows that our U2GNN produces a higher quality of learned node embeddings wherein the nodes are well-clustered according to the node labels.

可视化。为了定性展示我们提出的 U2GNN 的有效性,我们使用 $\mathrm{\Deltat}$ -SNE [29] 对 GIN-0 和 U2GNN 在具有节点标签的 PTC 数据集上学习到的节点嵌入进行可视化。如图 2 所示,与 GIN-0 相比,我们的 U2GNN 能生成更高质量的节点嵌入,其中节点根据标签形成了良好的聚类。

In general, the superior performance of our method over the up-to-date baselines (especially, GIN-0) indicates that the U2GNN self-attention-based aggregation function is an advanced computation process to improve the classification performance of GNNs.

总体而言,我们的方法相较于最新基线(特别是 GIN-0)的优越性能表明,基于 U2GNN 自注意力(self-attention)的聚合函数是一种提升图神经网络(GNNs)分类性能的先进计算过程。

6 Conclusion

6 结论

We introduce U2GNN, an advantageous graph neural network model. U2GNN induces a powerful aggregation function leveraging the transformer self-attention network to improve the graph classification performance. We evaluate our U2GNN using the same data splits and the same 10-fold cross-validation scheme on well-known benchmark datasets. Experimental results demonstrate that U2GNN outperforms up-to-date models and produces state-of-the-art accuracies on these datasets. 2

我们推出U2GNN,一种优势显著的图神经网络模型。U2GNN通过利用Transformer自注意力网络构建强大的聚合函数,从而提升图分类性能。我们在知名基准数据集上采用相同的数据划分和10折交叉验证方案对U2GNN进行评估。实验结果表明,U2GNN优于当前最新模型,在这些数据集上实现了最先进的准确率。2

References

参考文献

A Unsupervised Graph Neural Networks

无监督图神经网络

The unsupervised learning is essential in both industry and academic applications, where expanding unsupervised GNN models is more suitable due to the limited availability of class labels. Therefore, we introduce a new unsupervised learning to train GNNs for the graph classification task.

无监督学习在工业界和学术界应用中至关重要,由于类别标签的可用性有限,扩展无监督GNN模型更为合适。因此,我们引入了一种新的无监督学习方法来训练GNN以完成图分类任务。

A.1 Learning process Most of the recent approaches have focused on the supervised learning where they use the graph labels during the training process [47, 48, 7, 31, 37]. In a situation where no graph labels are available during training, some works (such as DGK [49], Graph2Vec [33], and AWE [20]) have considered the unsupervised learning, where they can have access to all nodes from the entire dataset (i.e., additionally using all nodes in the test set during training). But they produce lower classification accuracies compared to the supervised approaches.

A.1 学习过程
近期大多数方法都集中在监督学习上,即在训练过程中使用图标签 [47, 48, 7, 31, 37]。在训练期间无法获取图标签的情况下,部分工作(如 DGK [49]、Graph2Vec [33] 和 AWE [20])采用了无监督学习,它们可以访问整个数据集的所有节点(即在训练期间额外使用测试集中的所有节点)。但与监督方法相比,这些方法的分类准确率较低。

Table 3: Graph classification results ( $%$ accuracy) in the unsupervised learning. The best scores are in bold. uGCN denotes our unsupervised GCN. Note that we do not make any direct comparison between the unsupervised approaches and the supervised ones because of the difference in the training data.

表 3: 无监督学习中的图分类结果 ( $%$ 准确率)。最佳分数以粗体显示。uGCN表示我们的无监督GCN。请注意,由于训练数据的差异,我们没有对无监督方法和有监督方法进行直接比较。

Model COLLAB IMDB-B IMDB-M DD PROTEINS MUTAG PTC
DGK [49] 73.09 ± 0.25 66.96 ± 0.56 44.55 ± 0.52 73.50 ± 1.01 75.68 ± 0.54 87.44 ± 2.72 60.08±2.55
AWE [20] 73.93 ± 1.94 74.45 ±5.83 51.54±3.61 71.51 ± 4.02 87.87 ± 9.76
uGCN 93.28 ± 0.99 94.50 ± 2.79 81.66 ± 3.16 94.31 ± 1.71 89.09 ± 3.25 95.36 ± 2.64 92.67 ± 4.60
U2GNN 95.62 ± 0.92 96.41 ± 1.94 89.20 ±2.52 95.67 ± 1.89 80.01 ± 3.21 88.47 ± 7.13 91.81 ± 6.61

model. We outline the general process of our unsupervised learning in Algorithm 2.

我们在算法2中概述了无监督学习的通用流程。

A.2 Training protocol We follow some unsupervised approaches such as DGK [49] and AWE [20] to train our unsupervised U2GNN on all nodes from the entire dataset (i.e., consisting of all nodes from the test set during training) for a fair comparison. The hyperparameters are varied as same as in Section 4.1.

A.2 训练协议
我们遵循一些无监督方法,如DGK [49]和AWE [20],在整个数据集的所有节点上(即训练期间包含测试集中的所有节点)训练我们的无监督U2GNN,以确保公平比较。超参数设置与第4.1节相同。

To this end, we propose a new unsupervised learning to train GNNs for the graph classification task. We can see $\mathbf{e}{\mathsf{v}}$ in Equation 3.9 as a vector representation encoded for the substructure around node $\mathsf{v}$ . The goal of our unsupervised learning is to guide GNNs to recognize and distinguish the substructures within each graph, leading to improve the classification accuracies of unsupervised models. To achieve this goal, we consider a final embedding $\mathbf{o}{\mathsf{v}}$ for each node $\mathsf{v}$ , and make the similarity between $\mathbf{e}{\mathsf{v}}$ and $\mathbf{o}{\mathsf{v}}$ higher than that between $\mathbf{e}_{\mathsf{v}}$ and the final embeddings of the other nodes, by minimizing the sampled softmax loss function [21] applied to node $\mathsf{v}$ as:

为此,我们提出了一种新的无监督学习方法,用于训练图神经网络 (GNN) 完成图分类任务。我们可以将公式 3.9 中的 $\mathbf{e}{\mathsf{v}}$ 视为节点 $\mathsf{v}$ 周围子结构的向量表示。我们无监督学习的目标是引导 GNN 识别和区分每个图中的子结构,从而提高无监督模型的分类准确率。为实现这一目标,我们为每个节点 $\mathsf{v}$ 考虑一个最终嵌入 $\mathbf{o}{\mathsf{v}}$ ,并通过最小化应用于节点 $\mathsf{v}$ 的采样 softmax 损失函数 [21] ,使得 $\mathbf{e}{\mathsf{v}}$ 和 $\mathbf{o}{\mathsf{v}}$ 之间的相似度高于 $\mathbf{e}_{\mathsf{v}}$ 与其他节点最终嵌入之间的相似度,该损失函数定义为:

We also train our GCN baseline following our unsupervised learning. We set the batch size to 4 and vary the number of GCN layers in ${1,2,3}$ and the hidden layer size in {32, 64, 128, 256}. We also use the Adam optimizer [23] to train this unsupervised GCN up to 50 epochs.

我们还按照无监督学习的方式训练了GCN基线模型。设置批次大小为4,并调整GCN层数为${1,2,3}$,隐藏层尺寸为{32, 64, 128, 256}。同样使用Adam优化器 [23] 对该无监督GCN进行最多50轮训练。

$$
\mathcal{L}{\mathsf{U2G N N}}\left(\mathsf{v}\right)=-\log\frac{\exp(\mathbf{o}{\mathsf{v}}\cdot\mathbf{e}{\mathsf{v}})}{\sum_{\mathsf{v^{\prime}}\in\mathcal{V^{\prime}}}\exp(\mathbf{o}{\mathsf{v^{\prime}}}\cdot\mathbf{e}_{\mathsf{v}})}
$$

$$
\mathcal{L}{\mathsf{U2G N N}}\left(\mathsf{v}\right)=-\log\frac{\exp(\mathbf{o}{\mathsf{v}}\cdot\mathbf{e}{\mathsf{v}})}{\sum_{\mathsf{v^{\prime}}\in\mathcal{V^{\prime}}}\exp(\mathbf{o}{\mathsf{v^{\prime}}}\cdot\mathbf{e}_{\mathsf{v}})}
$$

where $\nu^{\prime}$ is a subset sampled from ${\cup\mathcal{V}{m}}{m=1}^{M}$ . Node embeddings are learned implicitly as model parameters. After that, we sum all the final embeddings $\mathbf{o}{\mathsf{v}}$ of nodes $\mathsf{v}$ in $\mathcal{G}$ to obtain the graph embedding $\mathbf{e}_{\mathcal{G}}$ . We then use the logistic regression classifier [11] with setting the termination criterion to 0.001 to evaluate our

其中 $\nu^{\prime}$ 是从 ${\cup\mathcal{V}{m}}{m=1}^{M}$ 中采样的子集。节点嵌入作为模型参数被隐式学习。之后,我们对图 $\mathcal{G}$ 中所有节点 $\mathsf{v}$ 的最终嵌入 $\mathbf{o}{\mathsf{v}}$ 求和,得到图嵌入 $\mathbf{e}_{\mathcal{G}}$。随后采用逻辑回归分类器 [11] 并设置终止条件为 0.001 进行评估。

A.3 Evaluation protocol We utilizes the logistic regression classifier [11] with using the 10-fold crossvalidation scheme to evaluate our models. We compare our unsupervised GCN (denoted as uGCN) and U2GNN with the baselines: Deep Graph Kernel (DGK) [49] and Anonymous Walk Embedding (AWE) [20].

A.3 评估方案
我们采用逻辑回归分类器 [11] 和10折交叉验证方案来评估模型性能。将无监督GCN (记为uGCN) 和U2GNN与基线方法进行比较:深度图核 (DGK) [49] 和匿名游走嵌入 (AWE) [20]。

A.4 Experimental results Table 3 presents the experimental results in the unsupervised learning. We encourage a re-evaluation to examine the existing GNNs from the supervised learning to our new unsupervised learning to see negative results. For example, regarding the supervised learning, as shown in Table 2, our supervised U2GNN outperforms GCN on the bioinformatics datasets. However, regarding the unsupervised learning, as shown in Table 3, our uGCN works better than our unsupervised U2GNN on PROTEINS, MUTAG, and PTC. These three datasets are much sparse, and node neighbors have a similar effect on each other. Hence, without using the graph labels during training, U2GNN does not increase the similar effects on node neighbors, leading to be outperformed by our uGCN on these datasets.

A.4 实验结果
表 3 展示了无监督学习的实验结果。我们建议重新评估现有 GNN (Graph Neural Network) 从监督学习到我们新的无监督学习的表现,以观察负面结果。例如,在监督学习方面,如表 2 所示,我们的监督式 U2GNN 在生物信息学数据集上优于 GCN (Graph Convolutional Network)。然而,在无监督学习方面,如表 3 所示,我们的 uGCN 在 PROTEINS、MUTAG 和 PTC 数据集上表现优于无监督式 U2GNN。这三个数据集较为稀疏,且节点邻居之间具有相似的影响。因此,在训练过程中不使用图标签的情况下,U2GNN 未能增强节点邻居间的相似效应,导致在这些数据集上被我们的 uGCN 超越。

In general, both our unsupervised U2GNN and uGCN obtain the state-of-the-art accuracies on the benchmark datasets. The significant gains demonstrate a notable impact of our unsupervised learning. It aims to guide GNNs to identify the sub-graphs’ structures for every node; hence, the models can memorize the structural differences among graphs to produce the plausible node and graph embeddings as visualized in Figure 6, leading to improve the unsupervised performance. We hope that future GNN works can consider the unsupervised learning beside the supervised one.

总体而言,我们的无监督方法U2GNN和uGCN在基准数据集上都达到了最先进的准确率。这一显著提升证明了我们无监督学习的突出效果。该方法旨在指导GNN识别每个节点的子图结构,从而使模型能够记忆图之间的结构差异,生成如图6所示的可信节点与图嵌入,最终提升无监督性能。我们希望未来GNN研究能在监督学习之外也考虑无监督学习方向。


Figure 5: Effects of the number of timesteps ( $T$ ) and the number of neighbors sampled for each node ( $N=|\mathcal{N}_{\sf v}|$ ) in the unsupervised learning.

图 5: 无监督学习中时间步数 ($T$) 和每个节点采样邻居数 ($N=|\mathcal{N}_{\sf v}|$) 的影响。

Hyper-parameter analysis. As shown in both Figures 4 and 5, we also see similar findings in both the supervised and unsupervised training settings.

超参数分析。如图 4 和图 5 所示,我们在监督和无监督训练设置中也观察到了类似的发现。


Figure 6: A visualization of the node and graph embeddings learned by our unsupervised U2GNN on the DD dataset.

图 6: 我们的无监督 U2GNN 在 DD 数据集上学到的节点和图嵌入可视化。

阅读全文(20积分)