Universal Graph Transformer Self-Attention Networks
通用图Transformer自注意力网络
Abstract
摘要
The transformer self-attention network has been extensively used in research domains such as computer vision, image processing, and natural language processing. But it has not been actively used in graph neural networks (GNNs) where constructing an advanced aggregation function is essential. To this end, we present U2GNN, an effective GNN model leveraging a transformer self-attention mechanism followed by a recurrent transition, to induce a powerful aggregation function to learn graph representations. Experimental results show that the proposed U2GNN achieves state-of-the-art accuracies on well-known benchmark datasets for graph classification.1
Transformer自注意力网络已广泛应用于计算机视觉、图像处理和自然语言处理等研究领域,但在图神经网络(GNNs)中尚未得到充分利用,而构建高级聚合函数对GNN至关重要。为此,我们提出U2GNN——一种通过Transformer自注意力机制与循环转移相结合的有效GNN模型,可生成强大的聚合函数来学习图表示。实验结果表明,所提出的U2GNN在图分类的知名基准数据集上达到了最先进的准确率。[20]
Keywords: Graph neural networks $\cdot$ Graph classification Transformer Self-attention
关键词:图神经网络 (Graph Neural Networks) $\cdot$ 图分类 Transformer 自注意力 (Self-attention)
1 Introduction
1 引言
Graph-structured data appear in many real-world and scientific fields, e.g., knowledge graphs, recommend er systems, social and citation networks, as well as telecommunication and biological networks [3, 56]. In general, we can view a graph as a network of nodes and edges, where nodes correspond to individual objects, and edges encode relationships among those objects. For example, in an online forum, each discussion thread can be constructed as a graph where nodes represent users, and edges represent commenting activities between users [49].
图结构数据广泛存在于现实世界和科学领域中,例如知识图谱、推荐系统、社交与引文网络,以及电信和生物网络 [3, 56]。通常,我们可以将图视为由节点和边组成的网络,其中节点对应独立对象,边则编码这些对象之间的关系。例如,在在线论坛中,每个讨论主题都可以构建为一个图,其中节点代表用户,边表示用户之间的评论活动 [49]。
Learning graph representations is one of the most important topics for the graph-structured data [17, 57, 46, 52], where a goal is to construct vector representations for graphs to predict graph labels. Early approaches measure similarities among graphs to build a graph kernel for graph classification [14], wherein graph kernel-based approaches decompose graphs into “atomic subgraphs” (such as random walks [14, 22, 44], shortest paths [4], graphlets [39], and Weisfeiler-Lehman subtree patterns [38]) to have a vector of frequencies of atomic subgraphs for each given graph. Besides, word embedding-based approaches utilize Word2Vec [32] and Doc2Vec [26] to learn embeddings for the atomic subgraphs [49] and the entire graphs [33, 20].
学习图表示是图结构数据最重要的课题之一 [17, 57, 46, 52],其目标是为图构建向量表示以预测图标签。早期方法通过测量图之间的相似性来构建用于图分类的图核 [14],其中基于图核的方法将图分解为"原子子图"(如随机游走 [14, 22, 44]、最短路径 [4]、小图模式 [39] 和 Weisfeiler-Lehman 子树模式 [38]),从而为每个给定图生成原子子图频率向量。此外,基于词嵌入的方法利用 Word2Vec [32] 和 Doc2Vec [26] 来学习原子子图 [49] 和完整图 [33, 20] 的嵌入表示。
Recently, graph neural networks (GNNs) become an essential strand to learn low-dimensional continuous embeddings of the entire graphs to predict the graph labels [36, 17, 57, 46, 52]. GNNs use an Aggregation function to update the vector representation of each node by transforming and aggregating the vector representations of its neighbors [24, 16, 42]. Then GNNs apply a graph-level pooling function (i.e., a ReadOut operation such as simple sum pooling [48]) to obtain graph embeddings. Compared to the graph kernel and word embedding-based approaches, GNNs have been achieving state-of-the-art accuracies for the graph classification task.
近年来,图神经网络 (GNNs) 已成为学习整个图的低维连续嵌入以预测图标签的重要方法 [36, 17, 57, 46, 52]。GNNs 通过聚合函数 (Aggregation function) 对邻居节点的向量表示进行变换和聚合,从而更新每个节点的向量表示 [24, 16, 42]。随后,GNNs 应用图级池化函数 (即 ReadOut 操作,如简单的求和池化 [48]) 来获取图嵌入。与基于图核 (graph kernel) 和词嵌入 (word embedding) 的方法相比,GNNs 在图分类任务中实现了最先进的准确率。
To further improve the classification performance, constructing a powerful aggregation function is essential for GNNs as also discussed in [48]. Nowadays, there are novel applications of the transformer self-attention network [41, 9] recognized, published, and used successfully in research domains such as computer vision, image processing, and natural language processing. Hence we consider the use of the transformer to a new domain, i.e., graph neural networks (GNNs), as a novelty. Inspired by this advanced self-attention network, we present U2GNN, an effective GNN model, which induces a powerful Aggregation function leveraging a self-attention mechanism [41] followed by a recurrent transition, to update the vector representations of nodes from their neighbors. In particular, our U2GNN is different from related work as follows:
为进一步提升分类性能,构建强大的聚合函数对图神经网络(GNNs)至关重要,[48]中也对此进行了讨论。当前,Transformer自注意力网络[41,9]的创新应用已在计算机视觉、图像处理和自然语言处理等领域得到认可、发表并成功运用。因此,我们将Transformer应用于新领域——图神经网络(GNNs)视为一项创新。受这一先进自注意力网络的启发,我们提出了高效的GNN模型U2GNN,该模型通过自注意力机制[41]结合循环转移来构建强大的聚合函数,从而根据邻居节点更新节点的向量表示。具体而言,我们的U2GNN与相关工作的区别如下:
• A “concurrent” work – Hyper-SAGNN [55] – utilizes the transformer self-attention network for hypergraphs that have diverse and different structures, hence requiring a different solution. Besides, the later and closely related work, GraphBERT [53], is an extension of our U2GNN for semisupervised node classification. • Graph Attention Network (GAT) [42] borrows the standard attention technique from [2] in using a single-layer feed forward neural network para met rize d by a weight vector and then applying the non-linearity function followed by the softmax to compute the importance weights of neighbors of a given node. Note that our U2GNN adopts a scaled dot-product attention mechanism which is more robust and efficient than the attention technique used in GAT.
• 一项"同期"工作——Hyper-SAGNN [55]——针对具有多样化和不同结构的超图(hypergraph)采用了Transformer自注意力网络,因此需要不同的解决方案。此外,后续密切相关的GraphBERT [53]是我们U2GNN在半监督节点分类任务上的扩展。
• 图注意力网络(GAT) [42]借鉴了文献[2]的标准注意力技术,使用由权重向量参数化的单层前馈神经网络,然后应用非线性函数和softmax来计算给定节点邻居的重要性权重。需要注意的是,我们的U2GNN采用了缩放点积注意力机制(scaled dot-product attention),比GAT使用的注意力技术更鲁棒高效。
• Regarding the model architecture, Graph Transformer Network (GTN) [51] identifies useful metapaths [45] to transform graph structures and applies GCN [24] to learn the node embeddings for the node classification task on heterogeneous graphs. Self-Attention Graph Pooling (SAGPool) [27] uses GCN as an attention mechanism to weight the nodes, employs a node selection method [13] to retain a portion of the nodes, and applies the existing graph-level ReadOut pooling layers (consisting of global pooling [54] and hierarchical pooling [6]) to obtain the graph embeddings.
• 在模型架构方面,Graph Transformer Network (GTN) [51] 通过识别有用的元路径 (metapaths) [45] 来转换图结构,并应用 GCN [24] 学习异构图上节点分类任务的节点嵌入。Self-Attention Graph Pooling (SAGPool) [27] 使用 GCN 作为注意力机制对节点进行加权,采用节点选择方法 [13] 保留部分节点,并应用现有的图级 ReadOut 池化层(包含全局池化 [54] 和分层池化 [6])来获取图嵌入。
• To this end, we note that U2GNN is entirely different from GAT [42], Graph Transformer Network [51], and Self-Attention Graph Pooling [27], except similar titles.
• 为此,我们注意到U2GNN与GAT [42]、Graph Transformer Network [51]和Self-Attention Graph Pooling [27]完全不同,除了标题相似。
Contributions. Our main contributions in this paper are as follows:
贡献。我们在本文中的主要贡献如下:
• We propose U2GNN, an effective GNN model, leveraging the transformer self-attention network to construct an advanced aggregation function to learn the graph representations. • Experimental results show that U2GNN obtains state-of-the-art accuracies on well-known benchmark datasets for the graph classification task.
• 我们提出了U2GNN,一种有效的图神经网络 (GNN) 模型,利用Transformer自注意力网络构建高级聚合函数来学习图表示。
• 实验结果表明,U2GNN在图分类任务的知名基准数据集上取得了最先进的准确率。
2 Related work
2 相关工作
2.1 Graph kernel-based approaches These early approaches decompose graphs into “atomic subgraphs” (such as random walks [14, 22, 44], shortest paths [4], graphlets [39], and Weisfeiler-Lehman subtree patterns [38]) to measure the similarities among graphs [14]. Therefore, we can view each atomic substructure as a word term and each graph as a text document. We then represent a collection of graphs as a document-term matrix that describes the normalized frequency of terms in documents. We use an inner product to compute the graph similarities to derive a “kernel matrix” used for the kernel-based learning algorithms such as Support Vector Machines (SVM) [19] to measure the graph classification performance. We refer to an overview of the graph kernel-based approaches in [35, 25].
2.1 基于图核的方法
这些早期方法将图分解为"原子子图"(如随机游走 [14, 22, 44]、最短路径 [4]、小图模式 [39] 和 Weisfeiler-Lehman 子树模式 [38])来测量图之间的相似性 [14]。因此,我们可以将每个原子子结构视为一个词项,将每个图视为一个文本文档。然后,我们将图集合表示为描述文档中词项归一化频率的文档-词项矩阵。我们使用内积计算图相似性,从而推导出"核矩阵",用于基于核的学习算法(如支持向量机 (SVM) [19])来测量图分类性能。关于基于图核方法的综述可参阅 [35, 25]。
2.2 Word embedding-based approaches Since the introduction of word embedding models such as Word2Vec [32] and Doc2Vec [26], several works have used them for the graph classification task. Deep graph kernel (DGK) [49] applies Word2Vec to learn the embeddings of the atomic substructures (such as the graphlets, the WL subtree patterns, and the shortest paths) to derive the kernel matrix. Anonymous walk embedding (AWE) [20] maps random walks into “anonymous walks”, views each anonymous walk as a word token, and utilizes Doc2Vec to achieve the graph embeddings to compute the graph similarities to construct the kernel matrix. Graph2Vec [33] employs Doc2Vec to learn the embeddings for the WL subtree patterns and to obtain the graph embeddings to train a SVM classifier.
2.2 基于词嵌入的方法
自Word2Vec [32]和Doc2Vec [26]等词嵌入模型提出以来,多项研究将其应用于图分类任务。深度图核(DGK) [49]采用Word2Vec学习原子子结构(如图元、WL子树模式和最短路径)的嵌入表示以推导核矩阵。匿名游走嵌入(AWE) [20]将随机游走映射为"匿名游走",将每个匿名游走视为一个词token,并利用Doc2Vec实现图嵌入以计算图相似性来构建核矩阵。Graph2Vec [33]使用Doc2Vec学习WL子树模式的嵌入表示,进而获取图嵌入以训练SVM分类器。
2.3 Graph neural networks Recent works have focused on using graph neural networks (GNNs) to perform the graph classification task [36, 28, 34, 15, 54, 50, 43, 48]. In general, GNNs use an Aggregation function which aims to update the vector representation of each node by recursively propagating the vector representations of its neighbors [36, 24, 16, 42]. The Aggregation function can be a neural network such as gated recurrent units (GRU) [28] or multi-layer perce ptr on s (MLPs) [48]. Besides, GCN [24], GraphSAGE [16], and GAT [42] are also used as the Aggregation functions. GNNs then utilize a ReadOut pooling function to obtain the graph embeddings, which are fed to multiple fully-connected layers followed by a softmax layer to predict the graph labels.
2.3 图神经网络
近期研究聚焦于利用图神经网络 (GNNs) 完成图分类任务 [36, 28, 34, 15, 54, 50, 43, 48]。通常,GNNs 通过聚合函数 (Aggregation function) 递归传播相邻节点的向量表示来更新各节点的向量表征 [36, 24, 16, 42]。该聚合函数可采用门控循环单元 (GRU) [28] 或多层感知机 (MLPs) [48] 等神经网络结构,也可选用图卷积网络 (GCN) [24]、GraphSAGE [16] 和图注意力网络 (GAT) [42]。随后,GNNs 通过 ReadOut 池化函数获取图嵌入表示,并将其输入全连接层与 softmax 层以预测图标签。
Regarding the aggregation of node representations, GCN [24] updates vector representation for a given node $\textsf{v}\in\mathscr{V}$ from its neighbors, using multiple GCN layers stacked on top of each other to capture $k$ -hops neighbors, as:
关于节点表征的聚合,GCN [24] 通过堆叠多个GCN层来捕捉 $k$ 跳邻居信息,从而更新给定节点 $\textsf{v}\in\mathscr{V}$ 的向量表示,其公式为:
$$
\mathsf{h}{\mathsf{v}}^{(k+1)}=\mathsf{g}\left(\sum_{\mathsf{u}\in\mathcal{N}{\mathsf{v}}\cup{\mathsf{v}}}a_{\mathsf{v},\mathsf{u}}\mathsf{W}^{(k)}\mathsf{h}_{\mathsf{u}}^{(k)}\right),\forall\mathsf{v}\in\mathcal{V}
$$
$$
\mathsf{h}{\mathsf{v}}^{(k+1)}=\mathsf{g}\left(\sum_{\mathsf{u}\in\mathcal{N}{\mathsf{v}}\cup{\mathsf{v}}}a_{\mathsf{v},\mathsf{u}}\mathsf{W}^{(k)}\mathsf{h}_{\mathsf{u}}^{(k)}\right),\forall\mathsf{v}\in\mathcal{V}
$$
where $k$ is the layer index; $a_{\mathsf{v},\mathsf{u}}$ is an edge constant between nodes $\mathsf{v}$ and $\mathsf{u}$ in the re-normalized adjacency matrix; $\mathbf{W}^{(k)}$ is a weight matrix; $\mathbf{h}{\mathbf{u}}^{(0)}$ is a feature vector of node $\mathfrak{u}$ ; $\mathsf{g}$ is a non-linear activation function; and $\mathcal{N}_{\sf v}$ is the set of neighbors of node $\mathsf{v}$ .
其中 $k$ 为层索引;$a_{\mathsf{v},\mathsf{u}}$ 是重归一化邻接矩阵中节点 $\mathsf{v}$ 和 $\mathsf{u}$ 之间的边常数;$\mathbf{W}^{(k)}$ 是权重矩阵;$\mathbf{h}{\mathbf{u}}^{(0)}$ 是节点 $\mathfrak{u}$ 的特征向量;$\mathsf{g}$ 是非线性激活函数;$\mathcal{N}_{\sf v}$ 是节点 $\mathsf{v}$ 的邻居集合。
GraphSAGE [16] extends GCN to use a node-wise procedure of uniformly sampling a fixed number of neighbors for each node at each layer as:
GraphSAGE [16] 将 GCN 扩展为采用逐节点处理方式,在每一层为每个节点均匀采样固定数量的邻居,具体如下:
$$
\mathbf{h}{\mathrm{v}}^{(k+1)}=\mathrm{g}\left(\mathbf{W}^{(k)}\left[\mathbf{h}{\mathrm{v}}^{(k)};\mathbf{h}{\mathcal{N}_{\mathrm{v}}^{\prime}}^{(k)}\right]\right),\forall\mathsf{v}\in\mathcal{V}
$$
$$
\mathbf{h}{\mathrm{v}}^{(k+1)}=\mathrm{g}\left(\mathbf{W}^{(k)}\left[\mathbf{h}{\mathrm{v}}^{(k)};\mathbf{h}{\mathcal{N}_{\mathrm{v}}^{\prime}}^{(k)}\right]\right),\forall\mathsf{v}\in\mathcal{V}
$$
where [;] denotes a vector concatenation, and $\mathbf{h}{\mathcal{N}_{\mathrm{v}}^{\prime}}^{(k)}$ can be
其中 [;] 表示向量拼接,$\mathbf{h}{\mathcal{N}_{\mathrm{v}}^{\prime}}^{(k)}$ 可以
where $\mathcal{N}{\sf v}^{\prime}$ is defined as a fixed-size, uniformly sampled from $\mathcal{N}{\sf v}$ of $\mathsf{v}$ . Besides, $\mathcal{N}_{\sf v}^{\prime}$ is sampled differently through each layer.
其中 $\mathcal{N}{\sf v}^{\prime}$ 定义为从 $\mathsf{v}$ 的 $\mathcal{N}{\sf v}$ 中均匀采样的固定大小集合。此外,$\mathcal{N}_{\sf v}^{\prime}$ 在每一层都进行不同的采样。
Graph Attention Network (GAT) [42] extends GCN to compute edge weights following the standard attention technique [2] as:
图注意力网络 (GAT) [42] 通过标准注意力技术 [2] 计算边权重来扩展 GCN,其公式为:
$$
\mathsf{h}{\mathsf{v}}^{(k+1)}=\mathsf{g}\left(\sum_{\mathsf{u}\in\mathcal{N}{\mathsf{v}}\cup{\mathsf{v}}}\tau_{\mathsf{v},\mathsf{u}}^{(k)}\mathsf{W}^{(k)}\mathsf{h}_{\mathsf{u}}^{(k)}\right),\forall\mathsf{v}\in\mathcal{V}
$$
$$
\mathsf{h}{\mathsf{v}}^{(k+1)}=\mathsf{g}\left(\sum_{\mathsf{u}\in\mathcal{N}{\mathsf{v}}\cup{\mathsf{v}}}\tau_{\mathsf{v},\mathsf{u}}^{(k)}\mathsf{W}^{(k)}\mathsf{h}_{\mathsf{u}}^{(k)}\right),\forall\mathsf{v}\in\mathcal{V}
$$
$$
\mathsf{s o f t m a x}\left(\mathsf{L e a k y R e L U}\left(\pmb{a}^{(k)\top}\left[\mathbf{W}^{(k)}\mathbf{h}{\mathsf{v}}^{(k)};\mathbf{W}^{(k)}\mathbf{h}_{\mathsf{u}}^{(k)}\right]\right)\right)
$$
3 U2GNN: Universal Graph Transformer Self-Attention Networks
3 U2GNN: 通用图Transformer自注意力网络
In this section, we present the additional background of graph neural networks and detail our proposed U2GNN.
在本节中,我们将介绍图神经网络 (Graph Neural Networks) 的额外背景知识,并详细阐述我们提出的 U2GNN。
3.1 Graph classification We represent each graph $\mathcal{G}=\left(\mathcal{V},\mathcal{E},{\mathbf{h}{\mathsf{v}}^{(0)}}{\mathsf{v}\in\mathcal{V}}\right)$ , where $\nu$ is a set of nodes, $\mathcal{E}$ is a soeft noof deed ${\bf h}{\vee}^{(0)}\in\mathbb{R}^{d}$ rseeptr eosfe tghrea pfehast $\textsf{v}\in\mathscr{V}$ $M$ ${\mathcal{G}{m}}{m=1}^{M}$ and their corresponding class labels , the graph classification task is to learn an embedding $\mathbf{e}{\mathcal{G}{m}}$ for each graph $\mathcal{G}{m}$ to predict its label $\mathsf{y}_{m}$ .
3.1 图分类
我们将每个图表示为 $\mathcal{G}=\left(\mathcal{V},\mathcal{E},{\mathbf{h}{\mathsf{v}}^{(0)}}{\mathsf{v}\in\mathcal{V}}\right)$,其中 $\nu$ 是节点集合,$\mathcal{E}$ 是边集合,$\mathbf{h}{\vee}^{(0)}\in\mathbb{R}^{d}$ 表示节点 $\textsf{v}\in\mathscr{V}$ 的初始特征。给定一组图 ${\mathcal{G}{m}}{m=1}^{M}$ 及其对应的类别标签 ,图分类任务的目标是为每个图 $\mathcal{G}{m}$ 学习一个嵌入表示 $\mathbf{e}{\mathcal{G}{m}}$,以预测其标签 $\mathsf{y}_{m}$。
3.2 Graph neural networks In general, GNNs aim to update the vector representation of each node by recursively aggregating and transforming the vector representations of its neighbors [24, 16, 42]. After that, GNNs use a ReadOut pooling function to obtain the vector representations of the entire graphs [15, 54, 50, 43, 48].
3.2 图神经网络 (Graph Neural Networks)
一般而言,图神经网络旨在通过递归聚合和转换其邻居节点的向量表示来更新每个节点的向量表示 [24, 16, 42]。随后,图神经网络使用 ReadOut 池化函数获取整个图的向量表示 [15, 54, 50, 43, 48]。
where $\mathsf{h}{\mathsf{v}}^{(k)}$ is the vector representation of node $\mathsf{v}$ at the $k$ -th iteration/layer, $\mathcal{N}_{\sf v}$ is the set of neighbors of $\mathsf{v}$ .
其中 $\mathsf{h}{\mathsf{v}}^{(k)}$ 是节点 $\mathsf{v}$ 在第 $k$ 次迭代/层的向量表示,$\mathcal{N}_{\sf v}$ 是 $\mathsf{v}$ 的邻居节点集合。
Many methods have been proposed to construct the Aggregation functions, e.g., GCN [24], GraphSAGE [16], and GAT [42]. Recently, Graph Isomorphism Network (GIN-0) [48] uses a more powerful Aggregation function based on a multi-layer perceptron (MLP) network of two fully-connected layers as:
许多方法被提出来构建聚合函数,例如 GCN [24]、GraphSAGE [16] 和 GAT [42]。最近,图同构网络 (GIN-0) [48] 使用了一个更强大的聚合函数,基于一个包含两个全连接层的多层感知机 (MLP) 网络,其形式为:
$$
\mathsf{h}{\mathsf{v}}^{(k+1)}=\mathsf{M L P}^{(\mathsf{k})}\left(\sum_{\mathsf{u}\in\mathcal{N}{\mathsf{v}}\cup{\mathsf{v}}}\mathsf{h}_{\mathsf{u}}^{(k)}\right),\forall\mathsf{v}\in\mathcal{V}
$$
$$
\mathsf{h}{\mathsf{v}}^{(k+1)}=\mathsf{M L P}^{(\mathsf{k})}\left(\sum_{\mathsf{u}\in\mathcal{N}{\mathsf{v}}\cup{\mathsf{v}}}\mathsf{h}_{\mathsf{u}}^{(k)}\right),\forall\mathsf{v}\in\mathcal{V}
$$
Following [48], we employ a concatenation over the vector representations of node $\mathsf{v}$ at the different layers to construct a vector representation $\mathbf{e}_{\lor}$ for each node $\mathsf{v}$ as:
遵循[48]的方法,我们通过对节点$\mathsf{v}$在不同层的向量表示进行拼接,为每个节点$\mathsf{v}$构建向量表示$\mathbf{e}_{\lor}$,具体形式如下:
$$
\mathbf{e}{\mathsf{v}}=\left[\mathsf{h}{\mathsf{v}}^{(1)};\mathsf{h}{\mathsf{v}}^{(2)};...;\mathsf{h}_{\mathsf{v}}^{(K)}\right],\forall\mathsf{v}\in\mathcal{V}
$$
$$
\mathbf{e}{\mathsf{v}}=\left[\mathsf{h}{\mathsf{v}}^{(1)};\mathsf{h}{\mathsf{v}}^{(2)};...;\mathsf{h}_{\mathsf{v}}^{(K)}\right],\forall\mathsf{v}\in\mathcal{V}
$$
where $K$ is the index of the last layer. The graph-level ReadOut function can be a simple sum pooling [48] or a complicated pooling such as hierarchical pooling [6], and differentiable pooling [50]. As the sum pooling produces competitive results [48], we use the sum pooling to obtain the embedding $\mathbf{e}_{\mathcal{G}}$ of the entire graph $\mathcal{G}$ as:
其中 $K$ 是最后一层的索引。图级别的 ReadOut 函数可以是简单的求和池化 [48] 或复杂的池化方法,如分层池化 [6] 和可微分池化 [50]。由于求和池化能产生具有竞争力的结果 [48],我们使用求和池化来获取整个图 $\mathcal{G}$ 的嵌入 $\mathbf{e}_{\mathcal{G}}$,公式如下:
$$
\mathbf{e}{\mathcal{G}}=\sum_{\ v\in\mathcal{V}}\mathbf{e}{\mathsf{v}}=\sum_{\mathsf{v}\in\mathcal{V}}\left[\mathbf{h}{\mathsf{v}}^{(1)};\mathbf{h}{\mathsf{v}}^{(2)};...;\mathbf{h}_{\mathsf{v}}^{(K)}\right]
$$
$$
\mathbf{e}{\mathcal{G}}=\sum_{\ v\in\mathcal{V}}\mathbf{e}{\mathsf{v}}=\sum_{\mathsf{v}\in\mathcal{V}}\left[\mathbf{h}{\mathsf{v}}^{(1)};\mathbf{h}{\mathsf{v}}^{(2)};...;\mathbf{h}_{\mathsf{v}}^{(K)}\right]
$$
After that, we can follow [48] to feed the graph embeddings $\mathbf{e}_{\mathcal{G}}$ to a single fully-connected layer followed by a softmax layer to predict the graph labels.
之后,我们可以按照 [48] 的方法,将图嵌入 $\mathbf{e}_{\mathcal{G}}$ 输入到一个全连接层,再接一个 softmax 层来预测图标签。
3.3 The proposed U2GNN The transformer selfattention network [41, 9] has widely applied as a novelty in research domains such as computer vision and NLP. Similarity, we also consider the successful use of this recent advanced technique to a new domain, i.e., graph neural networks (GNNs), as a novel application. Moreover, as also discussed in [48], constructing an powerful Aggregation mechanism is essential for GNNs. To this end, we induce an advanced Aggregation function, using the universal transformer network [9] consisting of a self-attention mechanism [41] followed by a recurrent transition (Trans) with adding residual connections [18] and layer normalization (LNorm) [1], as illustrated in Figure 1.
3.3 提出的U2GNN
Transformer自注意力网络 [41, 9] 作为一种创新技术已广泛应用于计算机视觉和自然语言处理等领域。类似地,我们也考虑将这一前沿技术成功应用于新领域——图神经网络 (GNNs) ,作为一项新颖的应用。此外,如 [48] 所述,构建强大的聚合机制对GNN至关重要。为此,我们引入了一种先进的聚合函数,采用通用Transformer网络 [9] ,该网络由自注意力机制 [41] 和循环转换 (Trans) 组成,并添加了残差连接 [18] 和层归一化 (LNorm) [1] ,如图 1 所示。
The residual connections [18] are used to add useful information learned in the lower layers to the higher layers, and more importantly, to allow gradients to directly pass through the layers to avoid vanishing gradient or exploding gradient problems. The layer normalization (LNorm) [1] is used to normalize the inputs across the feature dimensions to stabilize the network to enable smoother gradients and faster training. The residual connections and the layer normalization are commonly used in many architectures and thus are omitted in the paper for simplicity.
残差连接 [18] 用于将低层学习到的有用信息添加到高层,更重要的是允许梯度直接穿过各层,避免梯度消失或爆炸问题。层归一化 (LNorm) [1] 用于沿特征维度对输入进行归一化,以稳定网络,实现更平滑的梯度和更快的训练。由于残差连接和层归一化在多种架构中广泛使用,本文为简洁起见省略了相关图示。
Figure 1: Illustration of our U2GNN.
图 1: U2GNN 架构示意图
Formally, given an input graph $\mathcal{G}$ , we uniformly sample a set $\mathcal{N}{\sf v}$ of neighbors for each $v\in\mathcal{V}$ and then input $\mathcal{N}{\sf v}\cup{{\sf v}}$ to the U2GNN learning process. Note that we sample a different $\mathcal{N}{\sf v}$ for node v at each training batch. We also construct multiple layers stacked on top of each other in our U2GNN. Regarding the $k$ -th layer, given a node $v\in\nu$ , at each step $t$ , we induce a transformer self-attention-based function to aggregate the vector representations for all nodes $\mathsf{u}\in\mathcal{N}_{\mathsf{v}}\cup{\mathsf{v}}$ as:
形式上,给定输入图 $\mathcal{G}$,我们为每个 $v\in\mathcal{V}$ 均匀采样一组邻居 $\mathcal{N}{\sf v}$,然后将 $\mathcal{N}{\sf v}\cup{{\sf v}}$ 输入到 U2GNN 学习过程中。注意,在每个训练批次中,我们为节点 v 采样不同的 $\mathcal{N}{\sf v}$。我们还在 U2GNN 中构建了多个堆叠层。对于第 $k$ 层,给定节点 $v\in\nu$,在每一步 $t$,我们引入一个基于 Transformer 自注意力的函数来聚合所有节点 $\mathsf{u}\in\mathcal{N}_{\mathsf{v}}\cup{\mathsf{v}}$ 的向量表示:
$$
\begin{array}{r}{\mathbf{\bar{h}}{t,\mathbf{u}}^{(k)}=\mathrm{TRANSFORMER-AGGREGATION}\left(\mathbf{\bar{h}}_{t-1,\mathbf{u}}^{(k)}\right)}\end{array}
$$
$$
\begin{array}{r}{\mathbf{\bar{h}}{t,\mathbf{u}}^{(k)}=\mathrm{TRANSFORMER-AGGREGATION}\left(\mathbf{\bar{h}}_{t-1,\mathbf{u}}^{(k)}\right)}\end{array}
$$
In particular,
特别是,
$$
\mathbf{x}{t,\mathrm{u}}^{(k)}=\mathrm{LNoRM}\left(\mathbf{h}{t-1,\mathrm{u}}^{(k)}+\mathrm{ATT}\left(\mathbf{h}_{t-1,\mathrm{u}}^{(k)}\right)\right)
$$
$$
\mathbf{x}{t,\mathrm{u}}^{(k)}=\mathrm{LNoRM}\left(\mathbf{h}{t-1,\mathrm{u}}^{(k)}+\mathrm{ATT}\left(\mathbf{h}_{t-1,\mathrm{u}}^{(k)}\right)\right)
$$
then,
然后,
$$
\mathbf{h}{t,\mathbf{u}}^{(k)}=\mathrm{LNORM}\left(\mathbf{x}{t,\mathbf{u}}^{(k)}+\mathrm{TRANS}\left(\mathbf{x}_{t,\mathbf{u}}^{(k)}\right)\right)
$$
$$
\mathbf{h}{t,\mathbf{u}}^{(k)}=\mathrm{LNORM}\left(\mathbf{x}{t,\mathbf{u}}^{(k)}+\mathrm{TRANS}\left(\mathbf{x}_{t,\mathbf{u}}^{(k)}\right)\right)
$$
where ${\bf h}_{t,{\bf u}}^{(k)}\in\mathbb{R}^{d}$ ; $\mathrm{TRANS}(.)$ and ATT $(.)$ denote a MLP network (i.e., two fully-connected layers) and a selfattention network respectively:
其中 ${\bf h}_{t,{\bf u}}^{(k)}\in\mathbb{R}^{d}$;$\mathrm{TRANS}(.)$ 和 ATT $(.)$ 分别表示一个多层感知机网络 (即两个全连接层) 和一个自注意力网络:
$$
\left(\mathbf{x}{t,\mathrm{u}}^{(k)}\right)=\mathbf{W}{2}^{(k)}{\mathrm{ReLU}}\left(\mathbf{W}{1}^{(k)}\mathbf{x}{t,\mathrm{u}}^{(k)}+\mathbf{b}{1}^{(k)}\right)+\mathbf{b}_{2}^{(k)}
$$
$$
\left(\mathbf{x}{t,\mathrm{u}}^{(k)}\right)=\mathbf{W}{2}^{(k)}{\mathrm{ReLU}}\left(\mathbf{W}{1}^{(k)}\mathbf{x}{t,\mathrm{u}}^{(k)}+\mathbf{b}{1}^{(k)}\right)+\mathbf{b}_{2}^{(k)}
$$
where W(1k) $\mathbf{W}{1}^{(k)}\in\mathbb{R}^{s\times d}$ and W(2k) $\mathbf{W_{2}^{(k)}}\in~\mathbb{R}^{d\times s}$ are weight matrices, and b(1k) and b(2k) are bias parameters, and:
其中 $\mathbf{W}{1}^{(k)}\in\mathbb{R}^{s\times d}$ 和 $\mathbf{W_{2}^{(k)}}\in~\mathbb{R}^{d\times s}$ 是权重矩阵,b(1k) 和 b(2k) 是偏置参数,且:
$$
\mathrm{ATT}\left(\mathbf{h}{t-1,\mathbf{u}}^{(k)}\right)=\sum_{\mathbf{u}^{\prime}\in\mathcal{N}{\mathsf{v}}\cup{\mathsf{v}}}\alpha_{\mathbf{u},\mathbf{u}^{\prime}}^{(k)}\left(V^{(k)}\mathbf{h}_{t-1,\mathbf{u}^{\prime}}^{(k)}\right)
$$
$$
\mathrm{ATT}\left(\mathbf{h}{t-1,\mathbf{u}}^{(k)}\right)=\sum_{\mathbf{u}^{\prime}\in\mathcal{N}{\mathsf{v}}\cup{\mathsf{v}}}\alpha_{\mathbf{u},\mathbf{u}^{\prime}}^{(k)}\left(V^{(k)}\mathbf{h}_{t-1,\mathbf{u}^{\prime}}^{(k)}\right)
$$
where $V^{(k)}\in\mathbb{R}^{d\times d}$ is a value-projection weight matrix; $\alpha_{\mathsf{u},\mathsf{u}^{\prime}}$ is an attention weight, which is computed using
其中 $V^{(k)}\in\mathbb{R}^{d\times d}$ 是值投影权重矩阵;$\alpha_{\mathsf{u},\mathsf{u}^{\prime}}$ 是通过计算得到的注意力权重
$$
\alpha_{\mathbf{u},\mathbf{u}^{\prime}}^{(k)}=\mathsf{s o f t m a x}\left(\frac{\left(Q^{(k)}\mathsf{h}{t-1,\mathbf{u}}^{(k)}\right)\cdot\left(K^{(k)}\mathsf{h}_{t-1,\mathbf{u}^{\prime}}^{(k)}\right)}{\sqrt{d}}\right)
$$
$$
\alpha_{\mathbf{u},\mathbf{u}^{\prime}}^{(k)}=\mathsf{s o f t m a x}\left(\frac{\left(Q^{(k)}\mathsf{h}{t-1,\mathbf{u}}^{(k)}\right)\cdot\left(K^{(k)}\mathsf{h}_{t-1,\mathbf{u}^{\prime}}^{(k)}\right)}{\sqrt{d}}\right)
$$
where Q(k) ${\pmb Q}^{(k)}\in~\mathbb{R}^{d\times d}$ and ${\cal K}^{(k)}\in~\mathbb{R}^{d\times d}$ are queryprojection and key-projection matrices, respectively.
其中 ${\pmb Q}^{(k)}\in~\mathbb{R}^{d\times d}$ 和 ${\cal K}^{(k)}\in~\mathbb{R}^{d\times d}$ 分别是查询投影矩阵和键投影矩阵。
After $T$ steps, we feed ${\boldsymbol{\mathsf{h}}_{T,\mathsf{v}}^{(k)}\in\mathbb{R}^{d}}$ to the next $(k+1)$ - th layer as:
经过 $T$ 步后,我们将 ${\boldsymbol{\mathsf{h}}_{T,\mathsf{v}}^{(k)}\in\mathbb{R}^{d}}$ 输入到下一 $(k+1)$ 层:
$$
\mathsf{h}{0,\mathsf{v}}^{(k+1)}=\mathsf{h}_{T,\mathsf{v}}^{(k)},\forall\mathsf{v}\in\mathcal{V}
$$
$$
\mathsf{h}{0,\mathsf{v}}^{(k+1)}=\mathsf{h}_{T,\mathsf{v}}^{(k)},\forall\mathsf{v}\in\mathcal{V}
$$
Note that ${\bf h}{0,\mathsf{v}}^{(0)}={\bf h}_{\mathsf{v}}^{(0)}\in\mathbb{R}^{d}$ is the feature vector of node $\mathsf{v}$ .
注意到 ${\bf h}{0,\mathsf{v}}^{(0)}={\bf h}_{\mathsf{v}}^{(0)}\in\mathbb{R}^{d}$ 是节点 $\mathsf{v}$ 的特征向量。
We apply the vector concatenation across the layers to obtain the vector representations $\mathbf{e}_{\mathsf{v}}$ of nodes $\mathsf{v}$ following Equation 3.9 as:
我们按照公式3.9对层间向量进行拼接,得到节点$\mathsf{v}$的向量表示$\mathbf{e}_{\mathsf{v}}$,具体如下:
$$
\mathbf{e}{\mathsf{v}}=\left[\mathsf{h}{0,\mathsf{v}}^{(1)};\mathsf{h}{0,\mathsf{v}}^{(2)};...;\mathsf{h}_{0,\mathsf{v}}^{(K)}\right],\forall\mathsf{v}\in\mathcal{V}
$$
$$
\mathbf{e}{\mathsf{v}}=\left[\mathsf{h}{0,\mathsf{v}}^{(1)};\mathsf{h}{0,\mathsf{v}}^{(2)};...;\mathsf{h}_{0,\mathsf{v}}^{(K)}\right],\forall\mathsf{v}\in\mathcal{V}
$$
where $K$ is the number of layers. We use $\mathbf{e}{\mathsf{v}}$ as the final embedding of node $v\in\mathcal{V}$ and then sum all the final embeddings of nodes in $\mathcal{G}$ to get the final embedding $\mathbf{e}{\mathcal{G}}$ of the entire graph $\mathcal{G}$ . We feed $\mathbf{e}_{\mathcal{G}}$ to a single fullyconnected layer followed by a softmax layer to predict the graph label as:
其中 $K$ 是层数。我们使用 $\mathbf{e}{\mathsf{v}}$ 作为节点 $v\in\mathcal{V}$ 的最终嵌入,然后对图 $\mathcal{G}$ 中所有节点的最终嵌入求和,得到整个图 $\mathcal{G}$ 的最终嵌入 $\mathbf{e}{\mathcal{G}}$。我们将 $\mathbf{e}_{\mathcal{G}}$ 输入到一个全连接层,后接一个 softmax 层,以预测图的标签为:
$$
\hat{\pmb{{\forall}}}{\mathcal{G}}=\mathsf{s o f t m a x}\left(\mathbf{W}\mathbf{e}_{\mathcal{G}}+\mathbf{b}\right)
$$
$$
\hat{\pmb{{\forall}}}{\mathcal{G}}=\mathsf{s o f t m a x}\left(\mathbf{W}\mathbf{e}_{\mathcal{G}}+\mathbf{b}\right)
$$
Finally, we learn the model parameters by minimizing the cross-entropy loss function. To sum up, we briefly present the learning process of our proposed U2GNN in Algorithm 1.
最后,我们通过最小化交叉熵损失函数来学习模型参数。综上所述,我们在算法1中简要展示了所提出的U2GNN的学习过程。
3.4 Discussion We discuss some findings in our proposed U2GNN as follows:
3.4 讨论
我们在提出的U2GNN中发现以下几点:
• As established empirically, our results shown in Section 5 imply that the U2GNN self-attentionbased aggregation function is a powerful computation process compared to other existing functions.
• 根据实证研究,我们在第5节展示的结果表明,与其他现有函数相比,U2GNN基于自注意力(self-attention)的聚合函数是一种强大的计算过程。
4 Experimental setup
4 实验设置
We use seven well-known datasets consisting of three social network datasets (COLLAB, IMDB-B, and IMDBM) and four bioinformatics datasets (DD, MUTAG, PROTEINS, and PTC). The social network datasets do not have available node features; thus, we follow [34, 54] to use node degrees as features. Table 1 reports the statistics of these datasets.
我们使用了七个知名数据集,包括三个社交网络数据集 (COLLAB、IMDB-B 和 IMDBM) 和四个生物信息学数据集 (DD、MUTAG、PROTEINS 和 PTC)。由于社交网络数据集没有可用的节点特征,因此我们遵循 [34, 54] 的方法,使用节点度数作为特征。表 1 报告了这些数据集的统计信息。
• If we set $T$ to $1$ , $\alpha_{\mathbf{u},\mathbf{u}^{\prime}}^{(k)}$ to 1, $V^{(k)}$ to the identity matrix in Equation 3.15, and do not use both the residual connections and the layer normalization, we simplify our U2GNN aggregation function (from Equations 3.17 and 3.13) as:
• 如果我们设 $T$ 为 $1$,$\alpha_{\mathbf{u},\mathbf{u}^{\prime}}^{(k)}$ 为 1,$V^{(k)}$ 为等式 3.15 中的单位矩阵,并且不使用残差连接 (residual connections) 和层归一化 (layer normalization),就可以将 U2GNN 聚合函数(来自等式 3.17 和 3.13)简化为:
$$
\begin{array}{r c l}{\mathbf{h}{1,\mathbf{y}}^{(k+1)}}&{=}&{\displaystyle\mathrm{TRANS}\left(\mathrm{ATT}\left(\mathbf{h}{0,\mathbf{v}}^{(k+1)}\right)\right)}\ &{=}&{\displaystyle\mathrm{TRANS}\left(\sum_{\mathbf{u}\in\mathcal{N}{\mathbf{v}}\cup{\mathbf{v}}}\mathbf{h}{0,\mathbf{u}}^{(k+1)}\right)}\ &{=}&{\displaystyle\mathrm{TRANS}\left(\sum_{\mathbf{u}\in\mathcal{N}{\mathbf{v}}\cup{\mathbf{v}}}\mathbf{h}_{1,\mathbf{u}}^{(k)}\right)}\end{array}
$$
$$
\begin{array}{r c l}{\mathbf{h}{1,\mathbf{y}}^{(k+1)}}&{=}&{\displaystyle\mathrm{TRANS}\left(\mathrm{ATT}\left(\mathbf{h}{0,\mathbf{v}}^{(k+1)}\right)\right)}\ &{=}&{\displaystyle\mathrm{TRANS}\left(\sum_{\mathbf{u}\in\mathcal{N}{\mathbf{v}}\cup{\mathbf{v}}}\mathbf{h}{0,\mathbf{u}}^{(k+1)}\right)}\ &{=}&{\displaystyle\mathrm{TRANS}\left(\sum_{\mathbf{u}\in\mathcal{N}{\mathbf{v}}\cup{\mathbf{v}}}\mathbf{h}_{1,\mathbf{u}}^{(k)}\right)}\end{array}
$$
where Trans(.) denotes the MLP network of two fully-connected layers (as defined in Equation 3.14). Thus, this implies that our U2GNN can be simplified (w.r.t Equation 3.20) to be equivalent to Graph Isomorphism Network (GIN-0) [48] (w.r.t Equation 3.8) – one of the recent state-of-the-art GNNs. Experimental results presented in Section 5 show that U2GNN outperforms GIN-0 on benchmark datasets for the graph classification task.
其中 Trans(.) 表示由两个全连接层组成的 MLP 网络 (如公式 3.14 所定义)。这意味着我们的 U2GNN 可以简化为 (相对于公式 3.20) 与图同构网络 (Graph Isomorphism Network, GIN-0) [48] (相对于公式 3.8) 等效——后者是当前最先进的图神经网络之一。第 5 节的实验结果表明,在图分类任务的基准数据集上,U2GNN 的性能优于 GIN-0。
• We probably could construct a complex architecture using a complicated graph-level pooling (such as hierarchical pooling [6]) followed by multiple fully-connected layers [7, 31] to predict the graph labels. However, we refrained from doing that, as our key purpose is to introduce a single, unified and effective model that can work well and produce competitive performances on the benchm