[论文翻译]图分类任务中图神经网络的公平比较


原文地址:https://arxiv.org/pdf/1912.09893v3


A FAIR COMPARISON OF GRAPH NEURAL NETWORKS FOR GRAPH CLASSIFICATION

图分类任务中图神经网络的公平比较

ABSTRACT

摘要

Experimental reproducibility and rep li c ability are critical topics in machine learning. Authors have often raised concerns about their lack in scientific publications to improve the quality of the field. Recently, the graph representation learning field has attracted the attention of a wide research community, which resulted in a large stream of works. As such, several Graph Neural Network models have been developed to effectively tackle graph classification. However, experimental procedures often lack rigorous ness and are hardly reproducible. Motivated by this, we provide an overview of common practices that should be avoided to fairly compare with the state of the art. To counter this troubling trend, we ran more than 47000 experiments in a controlled and uniform framework to re-evaluate five popular models across nine common benchmarks. Moreover, by comparing GNNs with structure-agnostic baselines we provide convincing evidence that, on some datasets, structural information has not been exploited yet. We believe that this work can contribute to the development of the graph learning field, by providing a much needed grounding for rigorous evaluations of graph classification models.

实验可复现性和可重复性是机器学习领域的关键议题。学者们经常对科学出版物中缺乏这些要素表示担忧,以期提升该领域的质量。近年来,图表示学习领域吸引了广泛研究社区的关注,由此产生了大量研究成果。为此,人们开发了多种图神经网络 (Graph Neural Network) 模型来有效解决图分类问题。然而,实验流程往往缺乏严谨性且难以复现。受此启发,我们总结了应当避免的常见做法,以公平地与现有最优方法进行比较。为遏制这一不良趋势,我们在受控统一框架下进行了超过47000次实验,重新评估了九种常用基准测试中的五种流行模型。此外,通过将图神经网络与结构无关基线进行比较,我们提供了令人信服的证据:在某些数据集上,结构信息尚未得到有效利用。我们相信,这项工作能为图分类模型的严谨评估提供必要基础,从而推动图学习领域的发展。

1 INTRODUCTION

1 引言

Over the years, researchers have raised concerns about several flaws in scholarship, such as experimental reproducibility and rep li c ability in machine learning (McDermott, 1976; Lipton & Steinhardt, 2018) and science in general (National Academies of Sciences & Medicine, 2019). These issues are not easy to address, as a collective effort is required to avoid bad practices. Examples include the ambiguity of experimental procedures, the impossibility of reproducing results and the improper comparison of machine learning models. As a result, it can be difficult to uniformly assess the effectiveness of one method against another. This work investigates these issues for the graph representation learning field, by providing a uniform and rigorous benchmarking of state-of-the-art models.

多年来,研究人员对学术研究中存在的若干缺陷提出了担忧,例如机器学习领域的实验可重复性与可复现性 (McDermott, 1976; Lipton & Steinhardt, 2018) 以及科学界的普遍现象 (National Academies of Sciences & Medicine, 2019)。这些问题难以解决,因为需要集体努力才能避免不良实践,例如实验流程描述模糊、结果无法复现以及机器学习模型的不当比较。这导致我们难以统一评估不同方法的有效性。本研究针对图表示学习领域,通过对前沿模型进行统一严谨的基准测试来探讨这些问题。

Graph Neural Networks (GNNs) (Micheli, 2009; Scarselli et al., 2008) have recently become the standard tool for machine learning on graphs. These architectures effectively combine node features and graph topology to build distributed node representations. GNNs can be used to solve node classification (Kipf & Welling, 2017) and link prediction (Zhang & Chen, 2018) tasks, or they can be applied to downstream graph classification (Bacciu et al., 2018). In literature, such models are usually evaluated on chemical and social domains (Xu et al., 2019).

图神经网络 (Graph Neural Networks, GNNs) (Micheli, 2009; Scarselli et al., 2008) 近年来已成为图机器学习的主流工具。该架构能有效融合节点特征与图拓扑结构,构建分布式节点表征。GNNs 可应用于节点分类 (Kipf & Welling, 2017) 和链接预测 (Zhang & Chen, 2018) 任务,也可用于下游的图分类任务 (Bacciu et al., 2018)。现有文献通常基于化学和社交领域数据评估此类模型 (Xu et al., 2019)。

Given their appeal, an ever increasing number of GNNs is being developed (Gilmer et al., 2017). However, despite the theoretical advancements reached by the latest contributions in the field, we find that the experimental settings are in many cases ambiguous or not reproducible.

鉴于其吸引力,越来越多的图神经网络 (GNN) 被开发出来 (Gilmer et al., 2017)。然而,尽管该领域的最新研究成果取得了理论上的进步,但我们发现实验设置在许多情况下存在模糊性或不可复现性。

Some of the most common reproducibility problems we encounter in this field concern hyperparameters selection and the correct usage of data splits for model selection versus model assessment. Moreover, the evaluation code is sometimes missing or incomplete, and experiments are not standardized across different works in terms of node and edge features.

我们在该领域遇到的一些最常见可复现性问题涉及超参数选择,以及模型选择与模型评估中数据划分的正确使用。此外,评估代码有时缺失或不完整,不同研究在节点和边特征方面的实验也缺乏标准化。

These issues easily generate doubts and confusion among practitioners that need a fully transparent and reproducible experimental setting. As a matter of fact, the evaluation of a model goes through two different phases, namely model selection on the validation set and model assessment on the test set. Clearly, to fail in keeping these phases well separated could lead to over-optimistic and biased estimates of the true performance of a model, making it hard for other researchers to present competitive results without following the same ambiguous evaluation procedures.

这些问题很容易让从业者产生疑虑和困惑,他们需要一个完全透明且可复现的实验环境。事实上,模型的评估需经历两个不同阶段:验证集上的模型选择和测试集上的模型评估。显然,若未能严格区分这两个阶段,可能导致对模型真实性能的过度乐观和有偏估计,从而使其他研究者难以在不遵循相同模糊评估流程的情况下呈现具有竞争力的结果。

With this premise, our primary contribution is to provide the graph learning community with a fair performance comparison among GNN architectures, using a standardized and reproducible experimental environment. More in detail, we performed a large number of experiments within a rigorous model selection and assessment framework, in which all models were compared using the same features and the same data splits.

基于这一前提,我们的主要贡献是为图学习社区提供一个公平的GNN架构性能对比,采用标准化且可复现的实验环境。具体而言,我们在严格的模型选择与评估框架下进行了大量实验,所有模型均使用相同的特征和相同的数据划分进行比较。

Secondly, we investigate if and to what extent current GNN models can effectively exploit graph structure. To this end, we add two domain-specific and structure-agnostic baselines, whose purpose is to disentangle the contribution of structural information from node features. Much to our surprise, we found out that these baselines can even perform better than GNNs on some datasets; this calls for moderation when reporting improvements that do not clearly outperform structure-agnostic competitors.

其次,我们研究当前GNN模型能否以及能在多大程度上有效利用图结构。为此,我们增加了两个领域特定且与结构无关的基线模型,旨在将结构信息与节点特征的贡献分离开来。令人惊讶的是,我们发现这些基线模型在某些数据集上的表现甚至优于GNN;这表明在报告未明显超越结构无关基准的改进时需要保持谨慎。

Our last contribution is a study on the effect of node degrees as features in social datasets. Indeed, we show that providing the degree can be beneficial in terms of performances, and it has also implications in the number of GNN layers needed to reach good results.

我们最后的贡献是研究了社交数据集中节点度数作为特征的影响。实际上,我们证明了提供度数有助于提升性能,同时也影响了达到良好结果所需的GNN层数。

We publicly release code and dataset splits to reproduce our results, in order to allow other researchers to carry out rigorous evaluations with minimum additional effort1.

我们公开了代码和数据集划分以复现实验结果,旨在让其他研究者能以最小额外工作量进行严格评估[20]。

Disclaimer Before delving into the work, we would like to clarify that this work does not aim at pinpointing the best (or worst) performing GNN, nor it disavows the effort researchers have put in the development of these models. Rather, it is intended to be an attempt to set up a standardized and uniform evaluation framework for GNNs, such that future contributions can be compared fairly and objectively with existing architectures.

免责声明
在深入探讨本工作之前,我们需明确声明:本研究并非旨在判定性能最优(或最差)的图神经网络 (GNN),也无意否定研究人员在开发这些模型过程中付出的努力。相反,我们试图建立一个标准化、统一的GNN评估框架,使未来研究成果能与现有架构进行公平客观的对比。

2 RELATED WORK

2 相关工作

Graph Neural Networks At the core of GNNs is the idea to compute a state for each node in a graph, which is iterative ly updated according to the state of neighboring nodes. Thanks to layering (Micheli, 2009) or recursive (Scarselli et al., 2008) schemes, these models propagate information and construct node representations that can be “aware” of the broader graph structure. GNNs have recently gained popularity because they can efficiently and automatically extract relevant features from a graph; in the past, the most popular way to deal with complex structures was to use kernel functions (She rv as hid ze et al., 2011) to compute task-agnostic features. However, such kernels are non-adaptive and typically computationally expensive, which makes GNNs even more appealing.

图神经网络 (Graph Neural Networks)
GNN的核心思想是为图中的每个节点计算一个状态,该状态会根据相邻节点的状态进行迭代更新。通过分层 (Micheli, 2009) 或递归 (Scarselli et al., 2008) 方案,这些模型传播信息并构建能够"感知"更广泛图结构的节点表示。GNN最近广受欢迎,因为它们能高效且自动地从图中提取相关特征;过去处理复杂结构最流行的方法是使用核函数 (Shervashidze et al., 2011) 来计算与任务无关的特征。然而,这类核函数不具备自适应性且通常计算成本高昂,这使得GNN更具吸引力。

Even though in this work we specifically focus on architectures designed for graph classification, all GNNs share the notion of “convolution” over node neighborhoods, as a generalization of convolution on grids. For example, GraphSAGE (Hamilton et al., 2017) first performs sum, mean or max-pooling neighborhood aggregation, and then it updates the node representation applying a linear projection on top of the convolution. It also relies on a neighborhood sampling scheme to keep computational complexity constant. Instead, Graph Isomorphism Network (GIN) (Xu et al., 2019) builds upon the limitations of GraphSAGE, extending it with arbitrary aggregation functions on multi-sets. The model is proven to be as theoretically powerful as the Weisfeiler-Lehman test of graph isomorphism. Very recently, Wagstaff et al. (2019) gave an upper bound to the number of hidden units needed to learn permutation-invariant functions over sets and multi-sets. Differently from the above methods, Edge-Conditioned Convolution (ECC) (Simonovsky & Komodakis, 2017)

尽管本研究主要关注专为图分类设计的架构,但所有图神经网络 (GNN) 都共享基于节点邻域的"卷积"概念,这是网格卷积的泛化形式。例如 GraphSAGE (Hamilton 等人, 2017) 先执行求和、均值或最大池化的邻域聚合,随后通过线性投影更新节点表征。该方法还采用邻域采样方案来保持计算复杂度恒定。而图同构网络 (GIN) (Xu 等人, 2019) 基于 GraphSAGE 的局限性进行改进,扩展了多重集上的任意聚合函数。该模型被证明具有与 Weisfeiler-Lehman 图同构测试同等的理论表达能力。最近 Wagstaff 等人 (2019) 给出了学习集合和多重集上置换不变函数所需隐藏单元数量的上界。与上述方法不同,边缘条件卷积 (ECC) (Simonovsky & Komodakis, 2017)

learns a different parameter for each edge label. Therefore, neighbor aggregation is weighted according to specific edge parameters. Finally, Deep Graph Convolutional Neural Network (DGCNN) (Zhang et al., 2018) proposes a convolutional layer similar to the formulation of Kipf & Welling (2017).

为每条边标签学习不同的参数。因此,邻居聚合会根据特定的边参数进行加权。最后,深度图卷积神经网络 (DGCNN) (Zhang et al., 2018) 提出了一种类似于 Kipf & Welling (2017) 公式的卷积层。

Some models also exploit a pooling scheme, which is applied after convolutional layers in order to reduce the size of a graph. For example, the pooling scheme of ECC coarsens graphs through a differentiable pooling map that can be pre-computed. Similarly, DiffPool (Ying et al., 2018) proposes an adaptive pooling mechanism that collapses nodes on the basis of a supervised criterion. In practice, DiffPool combines a differentiable graph encoder with its pooling strategy, so that the architecture is end-to-end trainable. Lastly, DGCNN differs from other works in that nodes are sorted and aligned by a specific algorithm called SortPool (Zhang et al., 2018).

部分模型还采用了一种池化方案,该方案在卷积层之后应用以减小图的尺寸。例如,ECC的池化方案通过可预计算的可微分池化映射对图进行粗化。类似地,DiffPool (Ying et al., 2018) 提出了一种基于监督准则的自适应池化机制来折叠节点。实际应用中,DiffPool将可微分图编码器与其池化策略相结合,使得整个架构可端到端训练。最后,DGCNN与其他工作的不同之处在于,它通过名为SortPool (Zhang et al., 2018) 的特定算法对节点进行排序和对齐。

Model evaluation The work of Shchur et al. (2018) shares a similar purpose with our contribution. In particular, the authors compare different GNNs on node classification tasks, showing that results are highly dependent on the particular train/validation/test split of choice, up to the point where changing splits leads to dramatically different performance rankings. Thus, they recommend to evaluate GNNs on multiple test splits to achieve a fair comparison. Even though we operate in a different setting (graph instead of node classification), we follow the authors’ suggestions by evaluating models under a controlled and rigorous assessment framework. Finally, the work of Dacrema et al. (2019) criticizes a large number of neural recommend er systems, most of which are not reproducible, showing that only one of them truly improves against a simple baseline.

模型评估
Shchur等人(2018)的研究与我们的工作目标相似。具体而言,作者比较了不同图神经网络(GNN)在节点分类任务上的表现,结果表明结果高度依赖于特定的训练/验证/测试划分选择,甚至改变划分会导致性能排名发生显著变化。因此,他们建议通过多个测试划分来评估GNN,以实现公平比较。尽管我们的研究场景不同(图分类而非节点分类),但我们遵循作者的建议,在严格受控的评估框架下进行模型验证。最后,Dacrema等人(2019)的研究批判了大量不可复现的神经推荐系统,指出其中仅有一个系统真正超越了简单基线。

3 RISK ASSESSMENT AND MODEL SELECTION

3 风险评估与模型选择

Here, we recap the risk assessment (also called model evaluation or model assessment) and model selection procedures, to clearly layout the experimental procedure followed in this paper. For space reasons, the overall procedure is visually summarized in Appendix A.1.

在此,我们回顾风险评估(也称为模型评估或模型评定)和模型选择流程,以明确本文遵循的实验步骤。由于篇幅限制,整体流程的视觉化总结见附录A.1。

3.1 RISK ASSESSMENT

3.1 风险评估

The goal of risk assessment is to provide an estimate of the performance of a class of models. When a test set is not explicitly given, a common way to proceed is to use $k$ -fold Cross Validation (CV) (Stone, 1974; Varma & Simon, 2006; Cawley & Talbot, 2010). $k$ -fold CV uses $k$ different training/test splits to estimate the generalization performance of a model; for each partition, an internal model selection procedure selects the hyper-parameters using the training data only. This way, test data is never used for model selection. As model selection is performed independently for each training/test split, we obtain different “best” hyper-parameter configurations; this is why we refer to the performance of a class of models.

风险评估的目标是评估一类模型的性能表现。当没有明确给定测试集时,常用的方法是使用$k$折交叉验证(CV) (Stone, 1974; Varma & Simon, 2006; Cawley & Talbot, 2010)。$k$折CV通过$k$次不同的训练/测试划分来估计模型的泛化性能;对于每个划分,内部模型选择过程仅使用训练数据选择超参数。这样,测试数据永远不会用于模型选择。由于每个训练/测试划分都独立进行模型选择,我们会得到不同的"最佳"超参数配置;这就是为什么我们称之为一类模型的性能。

3.2 MODEL SELECTION

3.2 模型选择

The goal of model selection, or hyper-parameter tuning, is to choose among a set of candidate hyperparameter configurations the one that works best on a specific validation set. If a validation set is not given, one can rely on a holdout training/validation split or an inner $k$ -fold. Nevertheless, the key point to remember is that validation performances are biased estimates of the true generalization capabilities. Consequently, model selection results are generally over-optimistic; this issue is thoroughly documented in Cawley & Talbot (2010). This is why the main contribution of this work is to clearly separate model selection and model assessment estimates, something that is lacking or ambiguous in the literature under consideration.

模型选择(或超参数调优)的目标是从一组候选超参数配置中选出在特定验证集上表现最优的配置。若未提供验证集,可采用训练集/验证集划分或内部 $k$ 折交叉验证。但需注意,验证性能是对真实泛化能力的有偏估计,因此模型选择结果通常存在过度乐观问题(这一问题在Cawley & Talbot (2010) 中有详细论述)。本研究的主要贡献在于明确区分模型选择与模型评估的估计值,而现有文献对此缺乏清晰界定或表述模糊。

4 OVERVIEW OF REPRODUCIBILITY ISSUES

4 可复现性问题概述

To motivate our contribution, we follow the approach of Dacrema et al. (2019) and briefly review recent papers describing five different GNN models, highlighting problems in the experimental setups as well as reproducibility of results. We emphasize that our observations are based solely on the contents of their paper and the available code2. Suitable GNN works were selected according to the following criteria: i) performances obtained with 10-fold CV; ii) peer reviewed; iii) strong architectural differences; iv) popularity. In particular, we selected DGCNN (Zhang et al., 2018), DiffPool (Ying et al., 2018), ECC (Simonovsky & Komodakis, 2017), GIN (Xu et al., 2019) and GraphSAGE (Hamilton et al., 2017). For a detailed description of each model we refer to their respective papers. Our criteria to assess quality of evaluation and reproducibility are: i) code for data preprocessing, model selection and assessment is provided; ii) data splits are provided; iii) data is split by means of a stratification technique, to preserve class proportions across all partitions; iv) results of the 10-fold CV are reported correctly using standard deviations, and they refer to model evaluation (test sets) rather than model selection (validation sets). Table 1 summarizes our findings.

为阐明我们的贡献,我们沿用Dacrema等人(2019)的方法,简要回顾了描述五种不同图神经网络(GNN)模型的近期论文,着重指出实验设置及结果可复现性方面的问题。需特别说明的是,我们的观察仅基于论文内容及公开代码。所选GNN研究符合以下标准:(i)采用十折交叉验证(10-fold CV)的性能结果;(ii)经过同行评审;(iii)架构差异显著;(iv)具有较高知名度。具体选取了DGCNN(Zhang等人,2018)、DiffPool(Ying等人,2018)、ECC(Simonovsky & Komodakis,2017)、GIN(Xu等人,2019)和GraphSAGE(Hamilton等人,2017)五个模型,各模型详细说明请参阅原始论文。我们采用的评估质量与可复现性标准包括:(i)提供数据预处理、模型选择及评估代码;(ii)提供数据划分方案;(iii)采用分层技术划分数据以保持各类别比例;(iv)正确使用标准差报告十折交叉验证结果,且结果针对模型评估(测试集)而非模型选择(验证集)。表1总结了我们的分析结果。

Table 1: Criteria for reproducibility considered in this work and their compliance among considered models. (Y) indicates that the criterion is met, (N) indicates that the criterion is not satisfied, (A) indicates ambiguity (i.e. it is unclear whether the criteria is met or not), (-) indicates lack of information (i.e. no details are provided about the criteria). Note that GraphSAGE is excluded from this comparison, as it was not directly applied by authors to graph classification tasks.

表 1: 本工作中考虑的复现性标准及各模型符合情况。(Y) 表示满足标准,(N) 表示不满足标准,(A) 表示模糊不清(即不清楚是否满足标准),(-) 表示缺乏信息(即未提供相关标准的详细信息)。注意,GraphSAGE 未纳入本次比较,因为作者未直接将其应用于图分类任务。

DGCNN DiffPool ECC GIN
数据预处理代码 Y Y - Y
模型选择代码 N N - N
模型评估代码 Y Y - Y
提供数据划分 Y N N Y
标签分层 Y N - Y
报告测试集准确率 Y A A N
报告标准差 Y N N Y

DGCNN The authors evaluate the model on 10-fold CV. While the architecture is fixed for all dataset, learning rate and epochs are tuned using only one random CV fold, and then reused on all the other folds. While this practice is still acceptable, it may lead to sub-optimal performances. Nonetheless, the code to reproduce model selection is not available. Moreover, the authors run CV 10 times, and they report the average of the 10 final scores. As a result, the variance of the provided estimates is reduced. However, the same procedure was not applied to the other competitors as well. Finally, CV data splits are correctly stratified and publicly available, making it possible to reproduce at least the evaluation experiments.

DGCNN
作者采用10折交叉验证(CV)评估模型。虽然所有数据集都采用固定架构,但学习率和训练周期数仅通过随机选取的一个CV折进行调参,随后应用于其他所有折。这种做法虽仍可接受,但可能导致性能未达最优。此外,模型选择的复现代码未公开。值得注意的是,作者将CV流程重复运行10次,最终报告10次测试得分的平均值,从而降低估计值的方差。但该流程未同步应用于其他对比方法。最后,CV数据分割严格遵循分层抽样原则且已公开,至少能确保评估实验的可复现性。

DiffPool From both the paper and the provided code, it is unclear if reported results are obtained on a test set rather than a validation set. Although the authors state that 10-fold CV is used, standard deviations of DiffPool and its competitors are not reported. Moreover, the authors affirm to have applied early stopping on the validation set to prevent over fitting; unfortunately, neither model selection code nor validation splits are available. Furthermore, according to the code, data is randomly split (without stratification) and no random seed is set, hence splits are different each time the code is executed.

DiffPool
从论文和提供的代码来看,尚不清楚报告结果是基于测试集还是验证集获得的。尽管作者声明使用了10折交叉验证 (10-fold CV) ,但并未提供DiffPool及其对比方法的标准差。此外,作者声称在验证集上应用了早停 (early stopping) 以防止过拟合,但既未提供模型选择代码,也未公开验证集划分方式。根据代码实现,数据采用随机划分(未分层)且未设置随机种子,因此每次执行代码时划分结果都会不同。

ECC The paper reports that ECC is evaluated on 10-fold CV, but results do not include standard deviations. Similarly to DGCNN, hyper-parameters are fixed in advance, hence it is not clear if and how model selection has been performed. Importantly, there are no references in the code repository to data pre-processing, data stratification, data splitting, and model selection.

ECC
该论文报告ECC采用10折交叉验证(10-fold CV)进行评估,但结果未包含标准差。与DGCNN类似,其超参数是预先固定的,因此不清楚是否及如何进行模型选择。值得注意的是,代码仓库中未提及数据预处理(data pre-processing)、数据分层(data stratification)、数据划分(data splitting)和模型选择(model selection)的相关实现。

GIN The authors correctly list all the hyper-parameters tuned. However, as stated explicitly in the paper and in the public review discussion, they report the validation accuracy of 10-fold CV. In other words, reported results refer to model selection and not to model evaluation. The code for model selection is not provided.

GIN 作者正确列出了所有调整的超参数。然而,正如论文和公开评审讨论中明确指出的,他们报告的是10折交叉验证的验证准确率。换句话说,报告的结果涉及模型选择而非模型评估。模型选择的代码未提供。

GraphSAGE The original paper does not test this model on graph classification datasets, but GraphSAGE is often used in other papers as a strong baseline. It follows that GraphSAGE results on graph classification should be accompanied by the code to reproduce the experiments. Despite that, the two works which report results of GraphSAGE (DiffPool and GIN) fail to do so.

GraphSAGE 原论文未在图表分类数据集上测试该模型,但GraphSAGE常被其他论文用作强基线。由此可见,GraphSAGE在图分类任务上的结果应附带实验复现代码。尽管如此,报告GraphSAGE结果的两篇论文(DiffPool和GIN)均未提供相关代码。

Summary Our analysis reveals that GNN works rarely comply with good machine learning practices as regards the quality of evaluation and reproducibility of results. This motivates the need to re-evaluate all models within a rigorous, reproducible and fair environment.

我们的分析表明,GNN研究在评估质量和结果可复现性方面很少遵循良好的机器学习实践。这促使我们需要在严格、可复现且公平的环境中重新评估所有模型。

5 EXPERIMENTS

5 实验

In this section we detail our main experiment, in which we re-evaluate the above-mentioned models on 9 datasets (4 chemical, 5 social), using a model selection and assessment framework that closely follows the rigorous practices described in Section 3. In addition, we implement two baselines whose purpose is to understand the extent to which GNNs are able to exploit structural information. All models have been implemented by means of the Pytorch Geometrics library (Fey & Lenssen, 2019), which provides graph pre-processing routines and makes the definition of graph convolution easier to implement. We sometimes found discrepancies between papers and related code; in such cases, we complied with the specifications in the paper. Because GraphSAGE was not applied to graph classification in the original work, we opted for a max-pooling global aggregation function to classify graph instances; further, we do not use the sampled neighborhood aggregation scheme defined in Hamilton et al. (2017), in order to allow nodes to have access to their whole neighborhood.

在本节中,我们将详细介绍主要实验:基于第3章描述的严谨方法框架,对上述模型在9个数据集(4个化学数据集、5个社交数据集)上进行重新评估。此外,我们实现了两个基线模型,用于探究图神经网络(GNN)能在多大程度上利用结构信息。所有模型均通过Pytorch Geometrics库(Fey & Lenssen, 2019)实现,该库提供图预处理例程并简化了图卷积的定义。我们偶尔发现论文与相关代码存在差异,此时均以论文规范为准。由于GraphSAGE在原工作中未应用于图分类任务,我们选择最大池化全局聚合函数进行图实例分类;同时未采用Hamilton等人(2017)提出的采样邻域聚合方案,以确保节点能访问完整邻域信息。

Datasets All graph datasets are publicly available (Kersting et al., 2016) and represent a relevant subset of those most frequently used in literature to compare GNNs. Some collect molecular graphs, while others contain social graphs. In particular, we used D&D (Dobson & Doig, 2003), PROTEINS (Borgwardt et al., 2005), NCI1 (Wale et al., 2008) and ENZYMES (Schomburg et al., 2004) for binary and multi-class classification of chemical compounds, whereas IMDB-BINARY, IMDB-MULTI, REDDIT-BINARY, REDDIT-5K and COLLAB (Yanardag & Vishwanath an, 2015) are social datasets. Dataset statistics are reported in Table A.2.

数据集
所有图数据集均为公开可用 (Kersting et al., 2016),代表了文献中最常用于比较图神经网络 (GNN) 的相关子集。部分数据集收集分子图,其他则包含社交图。具体而言,我们使用 D&D (Dobson & Doig, 2003)、PROTEINS (Borgwardt et al., 2005)、NCI1 (Wale et al., 2008) 和 ENZYMES (Schomburg et al., 2004) 进行化合物的二元及多元分类,而 IMDB-BINARY、IMDB-MULTI、REDDIT-BINARY、REDDIT-5K 和 COLLAB (Yanardag & Vishwanathan, 2015) 属于社交数据集。数据集统计信息见表 A.2。

Baselines We adopt two distinct baselines, one for chemical and one for social datasets. On all chemical datasets but for ENZYMES, we follow Ralaivola et al. (2005); Luzhnica et al. (2019) and implement the Molecular Fingerprint technique, which first applies global sum pooling (i.e., counts the occurrences of atom types in the graph by summing the features of all nodes in the graph together) and then applies a single-layer MLP with ReLU activation s. On social domains and ENZYMES (due to the presence of additional features), we take inspiration from the work of Zaheer et al. (2017) to learn permutation-invariant functions over sets of nodes: first, we apply a single-layer MLP on top of node features, followed by global sum pooling and another single-layer MLP for classification. Note that both baselines do not leverage graph topology. Using these baselines as a reference is of fundamental importance for future works, as they can provide feedback on the effectiveness of GNNs on a specific dataset. As a matter of fact, if GNN performances are close to the ones of a structure-agnostic baseline, one can draw two possible conclusions: the task does not need topological information to be effectively solved, or the GNN is not exploiting graph structure adequately. While the former can be verified through domain-specific human expertise, the second is more difficult to assess, as multiple factors come into play such as the amount of training data, the structural inductive bias imposed by the architecture and the hyper-parameters used for model selection. Nevertheless, significant improvements with respect to these baselines are a strong indicator that graph topology has been exploited. Therefore, structure-agnostic baselines become vital to understand if and how a model can be improved.

基线方法
我们采用两种不同的基线方法,分别针对化学数据集和社交数据集。除ENZYMES外的所有化学数据集上,我们遵循Ralaivola等人(2005)和Luzhnica等人(2019)的方法,实现分子指纹技术:先通过全局求和池化(即累加图中所有节点的特征来统计原子类型出现次数),再使用带ReLU激活函数的单层MLP。对于社交领域数据集和ENZYMES(因存在额外特征),我们借鉴Zaheer等人(2017)的工作来学习节点集的置换不变函数:先对节点特征应用单层MLP,再进行全局求和池化,最后通过另一个单层MLP进行分类。需注意这两种基线均未利用图拓扑结构。

这些基线对后续研究至关重要,能反馈GNN在特定数据集上的有效性。若GNN性能接近不考虑结构的基线,可得出两种结论:该任务无需拓扑信息即可解决,或GNN未充分挖掘图结构。前者可通过领域专家经验验证,后者则更难评估,涉及训练数据量、架构的结构归纳偏置和模型选择超参数等多重因素。显著超越基线则表明模型成功利用了图拓扑,因此不考虑结构的基线对判断模型改进空间具有关键作用。

Table 2: Pseudo-code for model assessment (left) and model selection (right). In Algorithm 1, “Select” refers to Algorithm 2, whereas “Train” and “Eval” represent training and inference phases, respectively. After each model selection, the best configuration $\mathtt{b e s t}_{k}$ is used to evaluate the external test fold. Performances are averaged across $R$ training runs, where $R$ in our case is set to 3.

表 2: 模型评估(左)和模型选择(右)的伪代码。在算法1中,"Select"指代算法2,而"Train"和"Eval"分别表示训练和推理阶段。每次模型选择后,最佳配置 $\mathtt{b e s t}_{k}$ 将用于评估外部测试折。性能在 $R$ 次训练运行中取平均值,本研究中 $R$ 设为3。

Algorithm 1 Model Assessment ( $k$ -fold CV)

算法 1: 模型评估 (k折交叉验证)

1: Input: Dataset $\mathcal{D}$ , set of configurations $\overline{{\Theta}}$ 2: Split $\mathcal{D}$ into $k$ folds $F_{1},\ldots,F_{k}$ 3: for $i\gets1,\dots,k$ do 4: traink, $\textstyle\mathrm{test}{k}\gets\bigl(\bigcup_{j\neq i}F_{j}\bigr),F_{i}$ 5: $\mathsf{b e s t}{k}\gets\mathrm{Select}(\mathsf{t r a i n}{k},\Theta)$ 6: for $r\leftarrow1,\ldots,R$ do 7: $\mathrm{model}{r}\gets\mathrm{Train}(\mathrm{train}{k},\mathrm{best}{k})$ ) 8: ${\tt p}{r}\gets\mathrm{Eval}(\mathrm{model}{k},\mathrm{test}{k})$ 190:: $\begin{array}{r}{\operatorname{perf}{k}\leftarrow\sum_{r=1}^{R}\mathfrak{p}{r}/R}\end{array}$ 1121:: erentdu fron ${\textstyle\sum_{i=1}^{k}}\mathrm{perf}_{i}/k$

1: 输入: 数据集 $\mathcal{D}$ , 配置集合 $\overline{{\Theta}}$
2: 将 $\mathcal{D}$ 分割为 $k$ 折 $F_{1},\ldots,F_{k}$
3: 对于 $i\gets1,\dots,k$ 执行
4: 训练集 $\mathsf{train}k$, 测试集 $\mathsf{test}k \gets \bigl(\bigcup_{j\neq i}F_{j}\bigr), F_i$
5: $\mathsf{best}k \gets \mathrm{Select}(\mathsf{train}k, \Theta)$
6: 对于 $r\leftarrow1,\ldots,R$ 执行
7: $\mathrm{model}r \gets \mathrm{Train}(\mathrm{train}k, \mathrm{best}k)$
8: $\mathsf{p}r \gets \mathrm{Eval}(\mathrm{model}k, \mathrm{test}k)$
190:: $\begin{array}{r}{\operatorname{perf}{k}\leftarrow\sum_{r=1}^{R}\mathfrak{p}{r}/R}\end{array}$
1121:: 最终性能 $\sum_{i=1}^{k}\mathrm{perf}_i/k$

算法2 模型选择
1: 输入: traink,
2: 将 traink 划分为训练集和验证集
3: Pe = 0
4: 对于每个 θ ∈ Θ 执行
5: model ← 训练(traink, θ)
6: Pθ ← Pθ ∪ 评估(model, valid)
7: 结束循环
8: bestθ ← argmaxθ Pθ
9: 返回 bestθ

Experimental Setting Our experimental approach is to use a 10-fold CV for model assessment and an inner holdout technique with a $90%/10%$ training/validation split for model selection. After each model selection, we train three times on the whole training fold, holding out a random fraction $(10%)$ of the data to perform early stopping. These three separate runs are needed to smooth the effect of unfavorable random weight initialization on test performances. The final test fold score is obtained as the mean of these three runs; Table 2 reports the pseudo-code of the entire evaluation process. To be consistent with literature, we implement early stopping with patience parameter $n$ , where training stops if $n$ epochs have passed without improvement on the validation set. A high value of $n$ can favor model selection by making it less sensitive to fluctuations in the validation score at the cost of additional computation. Importantly, all data partitions have been pre-computed, so that models are selected and evaluated on the same data splits. Moreover, all data splits are stratified, i.e., class proportions are preserved inside each $k$ -fold split as well as in the holdout splits used for model selection.

实验设置
我们的实验方法采用10折交叉验证 (CV) 进行模型评估,并使用内部保留技术以 $90%/10%$ 的训练/验证分割进行模型选择。每次模型选择后,我们在整个训练折上训练三次,随机保留 $(10%)$ 的数据用于早停 (early stopping)。这三个独立运行旨在平滑不利的随机权重初始化对测试性能的影响。最终测试折分数取三次运行的平均值;表2展示了整个评估流程的伪代码。

为与文献保持一致,我们采用耐心参数 $n$ 实现早停机制:若验证集性能连续 $n$ 个周期未提升,则停止训练。较高的 $n$ 值能降低模型选择对验证分数波动的敏感性,但会增加计算成本。值得注意的是,所有数据分区均预先计算,确保模型选择和评估使用相同的数据划分。此外,所有数据划分均为分层抽样,即每个 $k$ 折划分及模型选择使用的保留划分中均保持类别比例一致。

Hyper-parameters Hyper-parameter tuning is performed via grid search. For the sake of conciseness, we list all hyper-parameters in Section A.7. Notice that we always include those used by other authors in their respective papers. We select the number of convolutional layers, the embedding space dimension, the learning rate, and the criterion for early stopping (either based on the validation accuracy or validation loss) for all models. Depending on the model, we also selected regular iz ation terms, dropout, and other model-specific parameters.

超参数
超参数调优通过网格搜索进行。为简洁起见,我们在附录A.7节列出了所有超参数。需要注意的是,我们始终会包含其他作者在各自论文中使用的参数。我们为所有模型选择了卷积层数、嵌入空间维度、学习率以及早停标准(基于验证准确率或验证损失)。根据模型的不同,我们还选择了正则化项、dropout以及其他模型特定参数。

Computational considerations Our experiments involve a large number of training runs. For all models, grid sizes range from 32 to 72 possible configurations, depending on the number of hyper-parameters to choose from. However, we tried to keep the upper bound on the number of parameters as similar as possible across models. The total effort required, in terms of the number of single training runs, to complete model assessment procedures exceeded 47000. Such a large number required extensive use of parallelism, both in CPU and GPU, to conduct the experiments in a reasonable amount of time. We emphasize that in some cases (e.g. ECC in social datasets), training on a single hyper-parameter configuration required more than 72 hours, which would have made the sequential exploration of one single grid last months. Therefore, due to the large amount of experiments to conduct and to the computational resources available, we limited the time to complete a single training to 72 hours.

计算考量
我们的实验涉及大量训练运行。所有模型的网格规模从32到72种可能配置不等,具体取决于需要选择的超参数数量。不过我们尽量保持各模型的参数数量上限相近。完成模型评估流程所需的总训练次数超过47000次。如此庞大的实验规模需要广泛使用CPU和GPU并行计算,才能在合理时间内完成。需要特别说明的是,某些情况下(如社交数据集中的ECC),单组超参数配置的训练耗时超过72小时,若采用串行探索方式,仅完成单个网格搜索就需要数月时间。鉴于实验体量和可用计算资源,我们将单次训练完成时限设定为72小时。

Table 3: Results on chemical datasets with mean accuracy and standard deviation are reported. Best performances are highlighted in bold.

表 3: 化学数据集上的结果报告了平均准确率和标准差,最佳性能以粗体标出。

D&D NCI1 PROTEINS ENZYMES
Baseline 78.4±4.5 69.8±2.2 75.8±3.7 65.2±6.4
DGCNN 76.6±4.3 76.4±1.7 72.9±3.5 38.9±5.7
DiffPool 75.0±3.5 76.9±1.9 73.7±3.5 59.5±5.6
ECC 72.6±4.1 76.2±1.4 72.3±3.4 29.5±8.2
GIN 75.3±2.9 80.0±1.4 73.3±4.0 59.6±4.5
GraphSAGE 72.9±2.0 76.0±1.8 73.0±4.5 58.2±6.0

6 RESULTS AND DISCUSSION

6 结果与讨论

Tables 3 and 4 show the results of our experiments. Overall, GIN seems to be effective on social datasets. Importantly, we discover that on D&D, PROTEINS and ENZYMES none of the GNNs are able to improve over the baseline. On the contrary, on NCI1 the baseline is clearly outperformed: this result suggests that the GNNs we analyzed can actually exploit the topological information of the graphs in this dataset. Moreover, we observe that an overly-parameterized baseline is not able to overfit the NCI1 training data completely. To see this, consider that a baseline with 10000 hidden units and no regular iz ation reaches around $67%$ training accuracy, while GIN can easily overfit $(\approx100%\$ ) the training data. This indicates that structural information hugely affects the ability to fit the training set. On social datasets, we observe that adding node degrees as features is beneficial, but such an effect is more noticeable for REDDIT-BINARY, REDDIT-5K and COLLAB.

表3和表4展示了我们的实验结果。总体而言,GIN在社交数据集上表现有效。重要的是,我们发现D&D、PROTEINS和ENZYMES数据集中,没有任何一种GNN能超越基线性能。相反,在NCI1数据集上基线被明显超越:这一结果表明,我们所分析的GNN实际上能够利用该数据集中图的拓扑信息。此外,我们观察到过度参数化的基线无法完全过拟合NCI1训练数据。具体而言,具有10000个隐藏单元且无正则化的基线仅达到约$67%$的训练准确率,而GIN可以轻松实现过拟合$(\approx100%)$。这说明结构信息极大影响了模型对训练集的拟合能力。在社交数据集上,我们发现添加节点度数作为特征是有益的,但这种效果在REDDIT-BINARY、REDDIT-5K和COLLAB数据集中更为显著。

6.1 THE IMPORTANCE OF BASELINES

6.1 基线的重要性

Our results also show that structure-agnostic baselines are an essential tool to understand the effectiveness of GNNs and extract useful insights. As an example, since none of the GNNs surpasses the baseline on D&D, PROTEINS and ENZYMES, we argue that the state-of-the-art GNN models we analyzed are not able to fully exploit the structure on such datasets yet; indeed, in chemistry, structural features are known to correlate with molecular properties (van Rossum, 1963). For all these reasons, we suggest putting small performance gains on these datasets into the right perspective, at least until the baseline will clearly be outperformed. Currently, small average fluctuations on these datasets are likely to be caused by other factors, such as random initialization s, rather than a successful exploitation of the structure. In conclusion, we warmly recommend GNN practitioners to include baseline comparisons in future works, in order to better characterize the extent of their contributions.

我们的研究结果还表明,结构无关基线是理解图神经网络(GNN)有效性和提取有用洞察的重要工具。例如,由于在D&D、PROTEINS和ENZYMES数据集上没有任何GNN模型超越基线表现,我们认为所分析的最先进GNN模型尚未能充分利用这些数据集的结构信息;事实上,在化学领域,结构特征已知与分子特性相关 (van Rossum, 1963)。基于这些原因,我们建议正确看待这些数据集上的微小性能提升,至少在基线被明显超越之前。目前,这些数据集上的微小平均波动更可能是由随机初始化等其他因素引起,而非对结构的成功利用。最后,我们强烈建议GNN实践者在未来工作中纳入基线比较,以更准确地评估其贡献程度。

6.2 THE EFFECT OF NODE DEGREE

6.2 节点度数的影响

Based on our results, using node degrees as input features is almost always beneficial to increase performances on social datasets, sometimes by a large amount. As an example, degree information is sufficient for our baseline to improve performances of $\approx15%$ , hence being competitive on many datasets; in particular, the baseline achieves the best performance on IMDB-BINARY. In contrast, adding node degrees is less relevant for most GNNs, since they can automatically infer such information from the structure. One notable exception is DGCNN, which explicitly needs node degrees to perform well on all datasets. Moreover, we observe that the ranking of all models, after the addition of the degrees, drastically changes; this raises the question about the impact of other structural features (such as clustering coefficient) on performances, which we leave to future works. However, one may also wonder whether the addition of the degree has an influence on the number of layers that are necessary to solve the task or not. We therefore investigated the matter by computing the median number of layers across the 10 different folds. We observed a general trend across models, with GraphSAGE being the only exception, where the addition of the degree reduces the number of layers needed by $\approx1$ as shown in Table A.3. This may be due to the fact that most architectures find useful to compute the degree at the very first layer, as such information seems useful to the overall performances.

根据我们的研究结果,在社交数据集上使用节点度(node degree)作为输入特征几乎总能提升模型性能,有时提升幅度相当显著。例如,仅凭节点度信息就使基线模型性能提升约15%,使其在多个数据集上具备竞争力;特别是在IMDB-BINARY数据集上,基线模型取得了最佳表现。相比之下,大多数图神经网络(GNN)本身能从图结构中自动推断节点度信息,因此额外添加该特征效果有限。DGCNN是个显著例外,该模型在所有数据集上都需要显式输入节点度才能取得良好表现。值得注意的是,加入节点度特征后,所有模型的性能排名会发生剧烈变化,这引发了对其他结构特征(如聚类系数)影响性能的思考,我们将此留待未来研究。此外,我们还探究了添加节点度特征是否会改变模型所需的层数。通过计算10次交叉验证的中值层数,我们发现除GraphSAGE外,所有模型都呈现减少约1个网络层的趋势(详见表A.3)。这可能是因为多数架构倾向于在第一层就计算节点度信息,该特征对整体性能提升具有显著作用。

Table 4: Results on social datasets with mean accuracy and standard deviation are reported. Best performances are highlighted in bold. OOR means Out of Resources, either time $\mathrm{\check{\rho}}>\mathrm{}72$ hours for a single training) or GPU memory.

表 4: 社交数据集上的结果报告了平均准确率和标准差。最佳性能以粗体标出。OOR表示资源不足,可能是单次训练时间超过72小时或GPU内存不足。

IMDB-B IMDB-M REDDIT-B REDDIT-5K COLLAB
Baseline S R 50.7 ± 2.4 36.1 ± 3.0 72.1 ± 7.8 35.1 ± 1.4 55.0 ± 1.9
DGCNN 53.3 ± 5.0 38.6 ± 2.2 77.1 ± 2.9 35.7 ± 1.8 57.4 ± 1.9
DiffPool 68.3 ± 6.1 45.1 ± 3.2 76.6 ± 2.4 34.6 ± 2.0 67.7 ± 1.9
ECC 67.8 ± 4.8 44.8 ± 3.1 OOR OOR OOR
GIN 66.8 ± 3.9 42.2 ± 4.6 87.0 ± 4.4 53.8 ± 5.9 75.9 ± 1.9
GraphSAGE 69.9 ± 4.6 47.2 ± 3.6 86.1 ± 2.0 49.9 ± 1.7 71.6 ± 1.5
EE R D 70.8 ± 5.0 49.1 ± 3.5 82.2 ± 3.0 52.2 ± 1.5 70.2 ± 1.5
DGCNN 69.2 ± 3.0 45.6 ± 3.4 87.8 ± 2.5 49.2 ± 1.2 71.2 ± 1.9
DiffPool 68.4 ± 3.3 45.6 ± 3.4 89.1 ± 1.6 53.8 ± 1.4 68.9 ± 2.0
ECC 67.7 ± 2.8 43.5 ± 3.1 OOR OOR OOR
GIN 71.2 ± 3.9 48.5 ± 3.3 89.9 ± 1.9 56.1 ± 1.7 75.6 ± 2.3
GraphSAGE 68.8 ± 4.5 47.6 ± 3.5 84.3 ± 1.9 50.0 ± 1.3 73.9 ± 1.7


Figure 1: Chemical and social (with degree) benchmark results are shown together with published results (when available). For each of them, we report validation and test accuracies of the evaluated models, together with published results if available.

图 1: 化学与社交(带度数)基准测试结果与已发表结果(如有)共同展示。对于每项测试,我们报告了评估模型的验证准确率和测试准确率,并附上已发表结果(如有)。

6.3 COMPARISON WITH PUBLISHED RESULTS

6.3 与已发表结果的对比

Figure 1 compares the average values of our test results with those reported in literature. In addition, we plot the average of our validation results across the 10 different model selections. The plots show how our test accuracies are in most cases different from what reported in the literature, and the gap between the two estimates is usually consistent. In contrast, our average validation accuracies are always higher or equal to our test results; this is expected, as discussed in Section 3.2. Finally, we emphasize once again that our results are $i$ ) obtained within the framework of a rigorous model selection and assessment protocol; $i i$ ) fair with respect of data splits and input features assigned to all competitors; iii) reproducible. In contrast, we saw in Section 4 how published results rely on unclear or poorly documented experimental settings.

图 1: 将我们的测试结果平均值与文献报道值进行对比。此外,我们还绘制了10种不同模型选择下验证结果的平均值。图表显示在多数情况下,我们的测试准确率与文献报道存在差异,且两者差距通常保持一致。相比之下,我们的平均验证准确率始终高于或等于测试结果,这与第3.2节的讨论预期一致。最后需要再次强调:我们的结果 $i$ ) 是在严格的模型选择与评估框架下获得; $i i$ ) 对所有参与者的数据划分和输入特征保持公平;iii) 具备可复现性。而如第4节所示,已发表成果的实验设置往往存在表述不清或文档不全的问题。

7 CONCLUSIONS

7 结论

In this paper, we wanted to show how a rigorous empirical evaluation of GNNs can help design future experiments and better reason about the effectiveness of different architectural choices. To this aim, we highlighted ambiguities in the experimental settings of different papers, and we proposed a clear and reproducible procedure for future comparisons. We then provided a complete re-evaluation of five GNNs on nine datasets, which required a significant amount of time and computational resources. This uniform environment helped us reason about the role of structure, as we found that structure-agnostic baselines outperform GNNs on some chemical datasets, thus suggesting that structural properties have not been exploited yet. Moreover, we objectively analyzed the effect of the degree feature on performances and model selection in social datasets, unveiling an effect on the depth of GNNs. Finally, we provide the graph learning community with reliable and reproducible results to which GNN practitioners can compare their architectures. We hope that this work, along with the library we release, will prove useful to researchers and practitioners that want to compare GNNs in a more rigorous way.

在本文中,我们旨在展示如何通过对图神经网络 (GNN) 的严谨实证评估来指导未来实验设计,并更合理地分析不同架构选择的有效性。为此,我们指出了多篇论文实验设置中的模糊之处,并提出了一套清晰且可复现的未来对比流程。随后,我们在九个数据集上对五种GNN模型进行了全面重评估,这项工作耗费了大量时间和计算资源。这种标准化环境帮助我们深入理解结构的作用——我们发现某些化学数据集上,无视结构的基线模型表现优于GNN,这表明结构特性尚未得到充分利用。此外,我们客观分析了社交数据集中度特征对性能与模型选择的影响,揭示了其对GNN深度的影响机制。最终,我们为图学习社区提供了可靠且可复现的基准结果,供GNN实践者进行架构对比。希望这项工作连同我们开源的代码库,能为需要更严谨对比GNN的研究者和开发者提供实用参考。

阅读全文(20积分)