A Method to Predict Semantic Relations on Artificial Intelligence Papers
一种预测人工智能论文语义关系的方法
Abstract—Predicting the emergence of links in large evolving networks is a difficult task with many practical applications. Recently, the Science 4 cast competition has illustrated this challenge presenting a network of 64.000 AI concepts and asking the participants to predict which topics are going to be researched together in the future. In this paper, we present a solution to this problem based on a new family of deep learning approaches, namely Graph Neural Networks.
摘要—预测大型演化网络中链接的出现是一项具有许多实际应用价值的艰巨任务。最近,Science 4 Cast 竞赛展示了这一挑战,该竞赛提供了一个包含 64,000 个 AI 概念的网络,并要求参与者预测未来哪些主题将被一起研究。在本文中,我们提出了一种基于深度学习新方法(即图神经网络)的解决方案。
The results of the challenge show that our solution is competitive even if we had to impose severe restrictions to obtain a computationally efficient and parsimonious model: ignoring the intrinsic dynamics of the graph and using only a small subset of the nodes surrounding a target link. Preliminary experiments presented in this paper suggest the model is learning two related, but different patterns: the absorption of a node by a sub-graph and union of more dense sub-graphs. The model seems to excel at recognizing the first type of pattern.
挑战的结果表明,即使我们不得不施加严格的限制以获得计算高效且简洁的模型,我们的解决方案仍然具有竞争力:忽略图的内在动态性,仅使用目标链接周围的一小部分节点。本文中的初步实验表明,该模型正在学习两种相关但不同的模式:节点被子图吸收以及更密集子图的联合。该模型似乎在识别第一种模式方面表现出色。
Index Terms—Link Prediction, Graph Learning, Deep Learning, Graph Neural Networks, Science 4 cast.
索引关键词—链接预测 (Link Prediction),图学习 (Graph Learning),深度学习 (Deep Learning),图神经网络 (Graph Neural Networks),Science 4 cast。
I. INTRODUCTION
I. 引言
In recent years, there has been an increasing interest on learning from graph structured data [1]. As graphs can be used to represent data in a diverse range of disciplines, problems like graph or node classification have a wide range of applications.
近年来,从图结构数据中学习的研究兴趣日益增长 [1]。由于图可以用来表示多个学科中的数据,图或节点分类等问题具有广泛的应用。
One of the main topics in graph learning is the problem of Link Prediction: given a graph at some time $t$ , the goal is to predict the formation of new links in future states of the graph. Many approaches have been developed throughout the years. Graph Kernels [2], Node embeddings [3] and traditional Machine Learning methods [4] have all been widely used with good performance. More recently, Deep Learning has allowed to achieve new, improved results [5].
图学习中的一个主要主题是链接预测 (Link Prediction) 问题:给定某个时间 $t$ 的图,目标是预测图中未来状态的新链接形成。多年来,已经开发了许多方法。图核 (Graph Kernels) [2]、节点嵌入 (Node embeddings) [3] 和传统的机器学习方法 [4] 都已被广泛使用并取得了良好的性能。最近,深度学习使得能够取得新的、改进的结果 [5]。
The Science 4 Cast challenge seeks to encourage research on Link Prediction [4]. It proposes a new benchmark dataset that contains semantic information on scientific papers and can be used to forecast emerging research topics on the Artificial Intelligence field.
Science 4 Cast 挑战旨在鼓励关于链接预测 (Link Prediction) [4] 的研究。它提出了一个新的基准数据集,该数据集包含科学论文的语义信息,可用于预测人工智能 (Artificial Intelligence) 领域的新兴研究主题。
The solution presented in this paper is based on a new, specialized family of Neural Networks, namely Graph Neural Networks (GNN) [6]. Experimental results show that the method can achieve competitive performance using only a subset of the nodes surrounding a target link and ignoring valuable information about the dynamics of the network (which is used by other methods). These simplifications lead to a solution with low memory usage that can be used without specialized hardware.
本文提出的解决方案基于一种新的专用神经网络家族,即图神经网络 (Graph Neural Networks, GNN) [6]。实验结果表明,该方法仅使用目标链接周围节点的子集,并忽略网络动态的宝贵信息(其他方法会使用这些信息),即可实现具有竞争力的性能。这些简化使得解决方案具有低内存占用,并且无需专用硬件即可使用。
Preliminary experiments we present in this paper suggest the model is learning two related, but different patterns. The first is a kind of cold-start scenario in which a node is absorbed by a larger sub-graph. The second is the union of two dense sub-graphs. The model seems to excel at recognizing the asymmetric cold-start case.
我们在本文中进行的初步实验表明,模型正在学习两种相关但不同的模式。第一种是一种冷启动场景,其中一个节点被一个较大的子图吸收。第二种是两个密集子图的合并。模型似乎在识别非对称冷启动情况方面表现出色。
The rest of this paper is structured as follows. In the next section we provide an overview of the Science 4 cast challenge. In Section III, we introduce the problem of link prediction providing a brief summary of the background used to obtain our method and the technique used as baseline. In Section IV we present our solution, stressing the restrictions we had to impose to obtain an efficient method. Section V presents numerical results on the test data made available to the participants of the challenge. Section VI concludes the work with some final remarks.
本文的其余部分结构如下。下一节我们将概述 Science 4 cast 挑战。在第三节中,我们介绍了链接预测问题,简要总结了用于获得我们方法的背景以及作为基准的技术。在第四节中,我们提出了我们的解决方案,强调了为获得高效方法而必须施加的限制。第五节展示了挑战参与者可用的测试数据的数值结果。第六节以一些最终评论总结了这项工作。
II. THE CHALLENGE
II. 挑战
The massive amount of scientific papers (and its accelerated growth) allows for a detailed analysis on the behavior and evolution of scientific disciplines with an unprecedented level of sophistication [4] [7].
大量科学论文(及其加速增长)使得对科学学科行为和演化的详细分析达到了前所未有的复杂程度 [4] [7]。
This is the motivation behind the Science 4 Cast Challenge, that seeks to predict emerging trends in AI.
这是 Science 4 Cast 挑战赛背后的动机,旨在预测 AI 领域的新兴趋势。
To achieve this, a semantic network obtained from an intelligent analysis on a large corpus of AI papers is presented to the community [4]. In the resulting graph, nodes represents semantic concepts relevant to the field and edges represents the co-occurrence of two concepts in the same paper.
为实现这一目标,社区展示了一个通过对大量AI论文进行智能分析获得的语义网络 [4]。在该图中,节点代表与该领域相关的语义概念,边代表两个概念在同一篇论文中的共现。
The challenge is as follows. Given the graph formed until the year 2017, predict the probability of link formation for different pairs of nodes by the year 2020. This is equivalent to predict which pairs of concepts are going to be researched together during the upcoming 3 years.
挑战如下:给定截至2017年的图,预测到2020年不同节点对之间形成链接的概率。这相当于预测在未来3年内哪些概念对将被一起研究。
From a technical perspective, the challenge can be viewed as a Link Prediction problem on a partially dynamic graph: new edges can be added between existing vertices on a temporal dimension.
从技术角度来看,这一挑战可以被视为部分动态图上的链接预测问题:在时间维度上,现有顶点之间可以添加新的边。
A number of useful applications can be derived from this challenge: from an assistant system which helps researchers to choose their subjects, to the meta-study of science dynamics and evolution; passing through novel approaches to the link prediction problem.
从这一挑战中可以衍生出许多有用的应用:从帮助研究人员选择研究主题的辅助系统,到科学动态和演化的元研究;再到链接预测问题的新颖方法。
A. Dataset
A. 数据集
The network given for the challenge contains information about the co-occurrence of concepts in the AI field since 1994 until 2017. It has around 64000 nodes, each representing a concept addressed in a paper of AI. A list of edges and their creation date (disc ret i zed in days, since 01/01/1990) is used to represent the network. There are 7652945 edges formed by the end of 2017. A list of $1E6$ vertex pairs without links by 2017 is designated to make the predictions for 2020. These potential links don’t have any restrictions and were chosen completely at random.
挑战中提供的网络包含了自1994年至2017年AI领域概念共现的信息。该网络约有64000个节点,每个节点代表一篇AI论文中涉及的概念。网络的边及其创建日期(自1990年1月1日起,按天离散化)用于表示网络结构。截至2017年底,共有7652945条边。为了预测2020年的链接,指定了100万对截至2017年尚未连接的顶点对。这些潜在的链接没有任何限制,完全是随机选择的。
An older snapshot of the network is given for testing/validation. A list of $1E6$ vertex pairs without links by the end of 2014 is designated as a standardized test set to make the predictions for 2017. The ground-truth for these test pairs is explicitly provided. Most of the results reported in this paper were obtained on this standardized test set.
提供了一个较旧的网络快照用于测试/验证。指定了2014年底没有链接的$1E6$个顶点对列表作为标准化测试集,以预测2017年的链接。这些测试对的真实标签已明确提供。本文报告的大多数结果都是在这个标准化测试集上获得的。
DESCRIPTIVE STATISTICS OF THE STANDARDIZED TEST SET (2014) AND CHALLENGE SET (2017): NUMBER OF NODES, NUMBER OF EDGES, PERCENTAGE OF TEST PAIRS IN WHICH EXACTLY 1 NODE HAS DEGREE 0, PERCENTAGE OF TEST PAIRS IN WHICH BOTH NODES HAVE DEGREE 0
#Nodes | #Edges | #of degree zero nodes | ||
---|---|---|---|---|
1 | 2 | |||
2014 | 64719 | 2278611 | 49.4% | 30.9% |
2017 | 64719 | 7652945 TA | 31.8% | 3.9% |
1 |
标准化测试集 (2014) 和挑战集 (2017) 的描述性统计:节点数、边数、测试对中恰好1个节点度数为0的百分比、测试对中两个节点度数均为0的百分比
The semantic meaning of each node is hidden for the participants. In fact, it is missing any type of feature information about nodes and edges. Therefore, only the topology can be used to address the challenge. A notorious problem with this is that there is a significant amount of test pairs where at least one of the nodes has degree 0, which means that there is no information about their relationships with other nodes. This is similar to the cold-start problem in collaborative filtering, a problem that could be addressed using additional metadata.
每个节点的语义对参与者是隐藏的。事实上,它缺少任何关于节点和边的特征信息。因此,只能使用拓扑结构来应对这一挑战。一个众所周知的问题是,存在大量测试对,其中至少有一个节点的度数为0,这意味着没有关于它们与其他节点关系的信息。这与协同过滤中的冷启动问题类似,后者可以通过使用额外的元数据来解决。
For this paper, potential links will be categorized under 3 types: 1) Type 0: Links where both nodes has a nonzero degree, 2) Type 1: Links where exactly 1 node has a nonzero degree and 3) Type 2: Links where both nodes have a degree of 0.
对于本文,潜在链接将分为三种类型:1) 类型 0:两个节点度数均不为零的链接,2) 类型 1:恰好一个节点度数不为零的链接,3) 类型 2:两个节点度数均为零的链接。
It is worth noting that some edges can be repeated if the same pair of concepts is found in different papers, which results in a weighted undirected graph.
值得注意的是,如果在不同的论文中发现相同的概念对,某些边可能会重复,从而形成一个加权无向图。
III. BACKGROUND
III. 背景
A variety of methods have been investigated to learn from graph data. Graph kernels, Node embeddings and traditional Machine Learning methods have been widely used. Recently, Deep Learning has allowed for breakthroughs on different areas of graph learning.
为了从图数据中学习,已经研究了多种方法。图核、节点嵌入和传统的机器学习方法已被广泛使用。最近,深度学习在图学习的不同领域取得了突破性进展。
A. Machine Learning based on Feature Engineering
A. 基于特征工程的机器学习
In this approach, a feature vector is first designed using e.g. metrics from graph theory such as degree, CN, and Katz index. Then, a traditional supervised algorithm is trained. The method in [4] is proposed in the Science 4 cast Challenge as a competitive baseline referred to as MK’s Baseline. The feature vector was designed using 15 metrics that captures information about a pair of nodes along the temporal dimension: Degree of both nodes in the current year and previous two years, total number of shared neighbours of each node in the current year and previous two years, number of shared neighbours in current year an previous two years. A 2-layer Fully Connected Neural Network is then trained on the representation.
在这种方法中,首先使用图论中的度量(例如度、共同邻居数 (CN) 和 Katz 指数)设计特征向量。然后,训练一个传统的监督算法。[4] 中提出的方法在 Science 4 cast 挑战中被提出,作为一个有竞争力的基线,称为 MK 的基线。特征向量使用了 15 个度量来捕捉一对节点在时间维度上的信息:当前年份和前两年的两个节点的度、当前年份和前两年每个节点的共享邻居总数、当前年份和前两年的共享邻居数。然后,在表示上训练一个 2 层全连接神经网络。
A modified version of the MK’s Baseline is used in this paper to compare the results on the standardized test set. All the Type 2 vertex pairs were excluded from the input data and were assigned a prediction 0.
本文使用了 MK 基线的修改版本来比较标准化测试集上的结果。所有类型 2 的顶点对都从输入数据中排除,并被分配预测值 0。
B. Graph Neural Networks
B. 图神经网络 (Graph Neural Networks)
Graph Neural Networks (GNN) are a specialized family of Neural Networks that can receive the graph directly as an input [8] [9] [10] [11]. Based on message passing mechanisms, the main idea is to learn a latent representation for each node aggregating the information about the neighbours using learnable parameters. As GNN grow in popularity, many layers have been proposed to implement this idea. A GCN is characterized by having a layer that resembles a convolution operation.
图神经网络 (Graph Neural Networks, GNN) 是一类专门的神经网络,能够直接接收图作为输入 [8] [9] [10] [11]。基于消息传递机制,其主要思想是通过可学习的参数聚合邻居信息,为每个节点学习一个潜在表示。随着 GNN 的普及,许多层被提出以实现这一思想。图卷积网络 (Graph Convolutional Network, GCN) 的特点是具有一个类似于卷积操作的层。
a) GCN Layer: : This is the standard layer for graph convolution operation [8]. For a given undirected graph $G$ with $N$ nodes, adjacency matrix $A~\in~R^{N\times N}$ , and node features matrix $X\in R^{N\times c}$ , the layer computes the following propagation
a) GCN 层:这是用于图卷积操作的标准层 [8]。对于给定的无向图 $G$,具有 $N$ 个节点、邻接矩阵 $A~\in~R^{N\times N}$ 和节点特征矩阵 $X\in R^{N\times c}$,该层计算以下传播
H^{(1+1)}=\sigma(\tilde{D}^{-\frac{1}{2}}\tilde{A}\tilde{D}^{-\frac{1}{2}}H^{(1)}W^{(1)})\,
where $H^{(1)},\in,R^{N\times D}$ is the matrix of activation s of layer l $(H^{0}=X)$ ), $\tilde{A}=A;+;I_{N}$ is the adjacency matrix of the graph $G$ with added self connections, $I_{N}$ is the $N\times N$ identity matrix, $\begin{array}{r}{\tilde{D}_{i i}=\sum_{j}\tilde{A}_{i j}}\end{array}$ , $W^{(1)}$ is a layer-specific trainable weight matrix, and $\sigma(\cdot)$ is a non linear activation function (usually $\operatorname{tanh}(\cdot))$ .
其中 $H^{(1)},\in,R^{N\times D}$ 是第 l 层激活值的矩阵 $(H^{0}=X)$ ), $\tilde{A}=A;+;I_{N}$ 是图 $G$ 的邻接矩阵,添加了自连接,$I_{N}$ 是 $N\times N$ 的单位矩阵,$\begin{array}{r}{\tilde{D}_{i i}=\sum_{j}\tilde{A}_{i j}}\end{array}$ , $W^{(1)}$ 是特定层的可训练权重矩阵,$\sigma(\cdot)$ 是非线性激活函数 (通常为 $\operatorname{tanh}(\cdot))$ 。
Many specialized architectures based on graph convolutional layers for specific problems have been proposed. In [12], an architecture for graph classification was presented, namely DGCNN. In a DGCNN, a stack of graph convolutional layers is used to extract latent representations of the nodes. A novel Sort Pooling layer takes as an input the concatenation of every GCN layer output, sorts the nodes representations in a consistent order and standardizes the output to a fixed size $k\times\sum c_{i}$ , where $c_{i}$ is the number of channels of the graph conv olutional layer $i$ . The parameter $k$ filter the number of nodes considered in the layer’s output, therefore, should be fine tuned accordingly.
许多针对特定问题的基于图卷积层的专用架构已被提出。在 [12] 中,提出了一种用于图分类的架构,即 DGCNN。在 DGCNN 中,使用一系列图卷积层来提取节点的潜在表示。一种新颖的 Sort Pooling 层将每个 GCN 层的输出连接起来作为输入,以一致的顺序对节点表示进行排序,并将输出标准化为固定大小 $k\times\sum c_{i}$ ,其中 $c_{i}$ 是图卷积层 $i$ 的通道数。参数 $k$ 过滤了层输出中考虑的节点数量,因此应相应地进行微调。
A stack of 1D convolutional layers consumes the flattened output of the Sort Pooling layer, and a few fully-connected layers completes the model.
一维卷积层堆叠消耗了Sort Pooling层的扁平化输出,几个全连接层完成了模型。
b) Link prediction using GNN: For link prediction, [13] proposed the SEAL framework. Let’s define:
b) 使用 GNN 进行链接预测:对于链接预测,[13] 提出了 SEAL 框架。定义如下:
Using these definitions, its easy to prove the following theorem.
根据这些定义,很容易证明以下定理。
Theorem 1. Any $h$ -order heuristic for $(x,y)$ can be accurately calculated from $G_{x,y}^{h}.$
定理 1. 任何 $(x,y)$ 的 $h$ 阶启发式都可以从 $G_{x,y}^{h}$ 中准确计算出来。
Moreover, they propose a novel $\gamma$ -decaying heuristic theory that unifies a wide range of heuristics in a single framework. It can be proved that any high-order $\gamma.$ -decaying heuristic can be approximated by the $h$ -hop enclosing sub-graph with an error that, under certain conditions, decreases at least exponentially with $h$ . Lastly, they demonstrate that most wellknown heuristics for link prediction can be represented as $\gamma.$ - decaying heuristics.
此外,他们提出了一种新颖的 $\gamma$ 衰减启发式理论,将多种启发式方法统一在一个框架中。可以证明,任何高阶 $\gamma$ 衰减启发式都可以通过 $h$ 跳包围子图来近似,且在某些条件下,误差至少以指数速度随 $h$ 递减。最后,他们证明了大多数著名的链接预测启发式方法都可以表示为 $\gamma$ 衰减启发式。
According to these results, the SEAL framework addresses a link prediction problem as the task of classifying the subgraphs enclosing the target nodes. This requires 3 stages:
根据这些结果,SEAL框架将链接预测问题视为对包含目标节点的子图进行分类的任务。这需要三个阶段:
• Enclosing sub-graphs extraction. • Node information matrix construction $X$ • GNN training.
• 封闭子图提取。
• 节点信息矩阵构建 $X$
• GNN 训练。
Node Information Matrix Construction. This step is often crucial for training an accurate link prediction model. The node information matrix $X$ in SEAL usually accommodate three types of features: explicit node features, node embeddings, and structural labels. Structural labels encode the role of each node in the topology. The center nodes $x$ and $y$ are the target nodes between which the link is located. Nodes with different relative positions to the center have different structural importance to the link.
节点信息矩阵构建。这一步对于训练一个准确的链接预测模型通常至关重要。SEAL中的节点信息矩阵 $X$ 通常包含三种类型的特征:显式节点特征、节点嵌入和结构标签。结构标签编码了每个节点在拓扑结构中的角色。中心节点 $x$ 和 $y$ 是链接所在的目标节点。与中心节点具有不同相对位置的节点对链接具有不同的结构重要性。
[13] proposed the Double-Radius Node Labeling (DRNL) method $f_{l}:V\to N$ , which assigns an integer label $f_{l}(i)$ to every node $i$ in the enclosing sub-graph. $f_{l}(i)$ takes the form
[13] 提出了双半径节点标记 (Double-Radius Node Labeling, DRNL) 方法 $f_{l}:V\to N$,该方法为包围子图中的每个节点 $i$ 分配一个整数标签 $f_{l}(i)$。$f_{l}(i)$ 的形式为
f_{l}(i)=1+m i n(d_{x},d_{y})+(\frac{d}{2})[(\frac{d}{2})+(d\,\mathcal{U}\,2)-1]\,
where $d_{x}:=d(i,x)$ , $d_{y}:=,d(i,y)$ , $d:=,d_{x}+d_{y}$ , $\left({\frac{d}{2}}\right)$ and $(d%2)$ are the integer quotient and remainder of $d$ divided by 2, respectively.
其中 $d_{x}:=d(i,x)$,$d_{y}:=d(i,y)$,$d:=d_{x}+d_{y}$,$\left({\frac{d}{2}}\right)$ 和 $(d%2)$ 分别是 $d$ 除以 2 的整数商和余数。
IV. PROPOSED METHOD
IV. 提出的方法
There is a variety of decisions to be made on how the challenge should be posed. The method presented in this paper has 2 main stages: 1) Data usage strategy, and 2) SEAL - DGCNN training.
在如何提出挑战方面,有多种决策需要做出。本文提出的方法包含两个主要阶段:1) 数据使用策略,以及 2) SEAL - DGCNN 训练。
A. Data Usage Strategy
A. 数据使用策略
As the challenge’s goal is to use a graph formed until a determined year to predict the existence of a potential link 3 years in the future, it makes sense to prepare training data with exactly the same condition. For a given dataset, let’s define:
由于挑战的目标是使用截至某一年形成的图来预测未来3年潜在链接的存在,因此准备具有完全相同条件的训练数据是有意义的。对于给定的数据集,我们定义:
• Input year: The last year until the graph is formed. • Target year: The year for which is required to make predictions.
• 输入年份:图表形成的最后一年。
• 目标年份:需要预测的年份。
In the challenge, Target $y e a r,=,I n p u t\ y e a r+3$ . For a given pair (Input year, Target year) the training data will be created using:
在挑战中,目标年份 $y e a r,=,I n p u t\ y e a r+3$。对于给定的 (输入年份, 目标年份) 对,训练数据将使用以下方式创建:
Therefore, the training inputs for the model will include node pairs $(a,b)$ for which no links exist in the graph by the end of input year training. The training labels/targets will be $+1$ for node pairs for which a link has been observed by target year training and 0 for node pairs for which no link has been observed by target year training.
因此,模型的训练输入将包括在输入年份训练结束时图中不存在链接的节点对 $(a,b)$。对于在目标年份训练时观察到链接的节点对,训练标签/目标将为 $+1$,而对于在目标年份训练时未观察到链接的节点对,训练标签/目标将为 0。
The training pairs $(a,b)$ were chosen almost at random. There were only 2 restrictions made:
训练对 $(a,b)$ 几乎是随机选择的。只有两个限制条件:
It is worth noting that the proposed method uses only the final state of the graph as an static input for the network, without considering any form of information on the temporal dimension.
值得注意的是,所提出的方法仅使用图的最终状态作为网络的静态输入,没有考虑时间维度上的任何形式的信息。
B. SEAL - DGCNN Training
B. SEAL - DGCNN 训练
The SEAL Framework based on a DGCNN network is proposed to predict link formation, i.e., for each input pair, the $h$ -hop enclosing sub-graph is extracted and that graph is classified by a DGCNN. For computational constraints, only 1-hop enclosing sub-graphs were used and $65%$ of it’s nodes were dropped at random.
提出了基于DGCNN网络的SEAL框架来预测链接形成,即对于每个输入对,提取其$h$跳的封闭子图,并通过DGCNN对该图进行分类。由于计算限制,仅使用了1跳的封闭子图,并随机丢弃了其中$65%$的节点。
For each sub-graph, the feature matrix $X$ is constructed as a one-hot encoding of the DRNL node labeling method described in the previous section.
对于每个子图,特征矩阵 $X$ 被构造为前一节中描述的 DRNL 节点标记方法的一热编码 (one-hot encoding)。
The DGCNN architecture used for the challenge consists of 4 stacked GCN Layers with 128 channels/filters each, a 1 channel GCN Layer, a Sort Pooling Layer, 2 1D Convolutional Layers with 64 and 128 channels respectively, a MaxPooling Layer, and 2 1D Convolutional Layers with 128 and 64 channels respectively, followed by 3 fully-connected layers with 256, 128 and 1 neuron each.
用于挑战的 DGCNN 架构由 4 个堆叠的 GCN 层(每层 128 个通道/滤波器)、1 个 1 通道 GCN 层、1 个排序池化层、2 个 1D 卷积层(分别有 64 和 128 个通道)、1 个最大池化层、2 个