A FAIR COMPARISON OF GRAPH NEURAL NETWORKS FOR GRAPH CLASSIFICATION

图分类任务中图神经网络的公平比较

ABSTRACT

摘要

Experimental reproducibility and rep li c ability are critical topics in machine learning. Authors have often raised concerns about their lack in scientific publications to improve the quality of the field. Recently, the graph representation learning field has attracted the attention of a wide research community, which resulted in a large stream of works. As such, several Graph Neural Network models have been developed to effectively tackle graph classification. However, experimental procedures often lack rigorous ness and are hardly reproducible. Motivated by this, we provide an overview of common practices that should be avoided to fairly compare with the state of the art. To counter this troubling trend, we ran more than 47000 experiments in a controlled and uniform framework to re-evaluate five popular models across nine common benchmarks. Moreover, by comparing GNNs with structure-agnostic baselines we provide convincing evidence that, on some datasets, structural information has not been exploited yet. We believe that this work can contribute to the development of the graph learning field, by providing a much needed grounding for rigorous evaluations of graph classification models.

实验可复现性和可重复性是机器学习领域的关键议题。学者们经常对科学出版物中缺乏这些要素表示担忧，以期提升该领域的质量。近年来，图表示学习领域吸引了广泛研究社区的关注，由此产生了大量研究成果。为此，人们开发了多种图神经网络 (Graph Neural Network) 模型来有效解决图分类问题。然而，实验流程往往缺乏严谨性且难以复现。受此启发，我们总结了应当避免的常见做法，以公平地与现有最优方法进行比较。为遏制这一不良趋势，我们在受控统一框架下进行了超过47000次实验，重新评估了九种常用基准测试中的五种流行模型。此外，通过将图神经网络与结构无关基线进行比较，我们提供了令人信服的证据：在某些数据集上，结构信息尚未得到有效利用。我们相信，这项工作能为图分类模型的严谨评估提供必要基础，从而推动图学习领域的发展。

1 INTRODUCTION

1 引言

Over the years, researchers have raised concerns about several flaws in scholarship, such as experimental reproducibility and rep li c ability in machine learning (McDermott, 1976; Lipton & Steinhardt, 2018) and science in general (National Academies of Sciences & Medicine, 2019). These issues are not easy to address, as a collective effort is required to avoid bad practices. Examples include the ambiguity of experimental procedures, the impossibility of reproducing results and the improper comparison of machine learning models. As a result, it can be difficult to uniformly assess the effectiveness of one method against another. This work investigates these issues for the graph representation learning field, by providing a uniform and rigorous benchmarking of state-of-the-art models.

多年来，研究人员对学术研究中存在的若干缺陷提出了担忧，例如机器学习领域的实验可重复性与可复现性 (McDermott, 1976; Lipton & Steinhardt, 2018) 以及科学界的普遍现象 (National Academies of Sciences & Medicine, 2019)。这些问题难以解决，因为需要集体努力才能避免不良实践，例如实验流程描述模糊、结果无法复现以及机器学习模型的不当比较。这导致我们难以统一评估不同方法的有效性。本研究针对图表示学习领域，通过对前沿模型进行统一严谨的基准测试来探讨这些问题。

Graph Neural Networks (GNNs) (Micheli, 2009; Scarselli et al., 2008) have recently become the standard tool for machine learning on graphs. These architectures effectively combine node features and graph topology to build distributed node representations. GNNs can be used to solve node classification (Kipf & Welling, 2017) and link prediction (Zhang & Chen, 2018) tasks, or they can be applied to downstream graph classification (Bacciu et al., 2018). In literature, such models are usually evaluated on chemical and social domains (Xu et al., 2019).

图神经网络 (Graph Neural Networks, GNNs) (Micheli, 2009; Scarselli et al., 2008) 近年来已成为图机器学习的主流工具。该架构能有效融合节点特征与图拓扑结构，构建分布式节点表征。GNNs 可应用于节点分类 (Kipf & Welling, 2017) 和链接预测 (Zhang & Chen, 2018) 任务，也可用于下游的图分类任务 (Bacciu et al., 2018)。现有文献通常基于化学和社交领域数据评估此类模型 (Xu et al., 2019)。

Given their appeal, an ever increasing number of GNNs is being developed (Gilmer et al., 2017). However, despite the theoretical advancements reached by the latest contributions in the field, we find that the experimental settings are in many cases ambiguous or not reproducible.

鉴于其吸引力，越来越多的图神经网络 (GNN) 被开发出来 (Gilmer et al., 2017)。然而，尽管该领域的最新研究成果取得了理论上的进步，但我们发现实验设置在许多情况下存在模糊性或不可复现性。

Some of the most common reproducibility problems we encounter in this field concern hyperparameters selection and the correct usage of data splits for model selection versus model assessment. Moreover, the evaluation code is sometimes missing or incomplete, and experiments are not standardized across different works in terms of node and edge features.

我们在该领域遇到的一些最常见可复现性问题涉及超参数选择，以及模型选择与模型评估中数据划分的正确使用。此外，评估代码有时缺失或不完整，不同研究在节点和边特征方面的实验也缺乏标准化。

These issues easily generate doubts and confusion among practitioners that need a fully transparent and reproducible experimental setting. As a matter of fact, the evaluation of a model goes through two different phases, namely model selection on the validation set and model assessment on the test set. Clearly, to fail in keeping these phases well separated could lead to over-optimistic and biased estimates of the true performance of a model, making it hard for other researchers to present competitive results without following the same ambiguous evaluation procedures.

这些问题很容易让从业者产生疑虑和困惑，他们需要一个完全透明且可复现的实验环境。事实上，模型的评估需经历两个不同阶段：验证集上的模型选择和测试集上的模型评估。显然，若未能严格区分这两个阶段，可能导致对模型真实性能的过度乐观和有偏估计，从而使其他研究者难以在不遵循相同模糊评估流程的情况下呈现具有竞争力的结果。

With this premise, our primary contribution is to provide the graph learning community with a fair performance comparison among GNN architectures, using a standardized and reproducible experimental environment. More in detail, we performed a large number of experiments within a rigorous model selection and assessment framework, in which all models were compared using the same features and the same data splits.

基于这一前提，我们的主要贡献是为图学习社区提供一个公平的GNN架构性能对比，采用标准化且可复现的实验环境。具体而言，我们在严格的模型选择与评估框架下进行了大量实验，所有模型均使用相同的特征和相同的数据划分进行比较。

Secondly, we investigate if and to what extent current GNN models can effectively exploit graph structure. To this end, we add two domain-specific and structure-agnostic baselines, whose purpose is to disentangle the contribution of structural information from node features. Much to our surprise, we found out that these baselines can even perform better than GNNs on some datasets; this calls for moderation when reporting improvements that do not clearly outperform structure-agnostic competitors.

其次，我们研究当前GNN模型能否以及能在多大程度上有效利用图结构。为此，我们增加了两个领域特定且与结构无关的基线模型，旨在将结构信息与节点特征的贡献分离开来。令人惊讶的是，我们发现这些基线模型在某些数据集上的表现甚至优于GNN；这表明在报告未明显超越结构无关基准的改进时需要保持谨慎。

Our last contribution is a study on the effect of node degrees as features in social datasets. Indeed, we show that providing the degree can be beneficial in terms of performances, and it has also implications in the number of GNN layers needed to reach good results.

我们最后的贡献是研究了社交数据集中节点度数作为特征的影响。实际上，我们证明了提供度数有助于提升性能，同时也影响了达到良好结果所需的GNN层数。

We publicly release code and dataset splits to reproduce our results, in order to allow other researchers to carry out rigorous evaluations with minimum additional effort1.

我们公开了代码和数据集划分以复现实验结果，旨在让其他研究者能以最小额外工作量进行严格评估[20]。

Disclaimer Before delving into the work, we would like to clarify that this work does not aim at pinpointing the best (or worst) performing GNN, nor it disavows the effort researchers have put in the development of these models. Rather, it is intended to be an attempt to set up a standardized and uniform evaluation framework for GNNs, such that future contributions can be compared fairly and objectively with existing architectures.

免责声明
在深入探讨本工作之前，我们需明确声明：本研究并非旨在判定性能最优（或最差）的图神经网络 (GNN)，也无意否定研究人员在开发这些模型过程中付出的努力。相反，我们试图建立一个标准化、统一的GNN评估框架，使未来研究成果能与现有架构进行公平客观的对比。

2 相关工作

Graph Neural Networks At the core of GNNs is the idea to compute a state for each node in a graph, which is iterative ly updated according to the state of neighboring nodes. Thanks to layering (Micheli, 2009) or recursive (Scarselli et al., 2008) schemes, these models propagate information and construct node representations that can be “aware” of the broader graph structure. GNNs have recently gained popularity because they can efficiently and automatically extract relevant features from a graph; in the past, the most popular way to deal with complex structures was to use kernel functions (She rv as hid ze et al., 2011) to compute task-agnostic features. However, such kernels are non-adaptive and typically computationally expensive, which makes GNNs even more appealing.

图神经网络 (Graph Neural Networks)
GNN的核心思想是为图中的每个节点计算一个状态，该状态会根据相邻节点的状态进行迭代更新。通过分层 (Micheli, 2009) 或递归 (Scarselli et al., 2008) 方案，这些模型传播信息并构建能够"感知"更广泛图结构的节点表示。GNN最近广受欢迎，因为它们能高效且自动地从图中提取相关特征；过去处理复杂结构最流行的方法是使用核函数 (Shervashidze et al., 2011) 来计算与任务无关的特征。然而，这类核函数不具备自适应性且通常计算成本高昂，这使得GNN更具吸引力。

Even though in this work we specifically focus on architectures designed for graph classification, all GNNs share the notion of “convolution” over node neighborhoods, as a generalization of convolution on grids. For example, GraphSAGE (Hamilton et al., 2017) first performs sum, mean or max-pooling neighborhood aggregation, and then it updates the node representation applying a linear projection on top of the convolution. It also relies on a neighborhood sampling scheme to keep computational complexity constant. Instead, Graph Isomorphism Network (GIN) (Xu et al., 2019) builds upon the limitations of GraphSAGE, extending it with arbitrary aggregation functions on multi-sets. The model is proven to be as theoretically powerful as the Weisfeiler-Lehman test of graph isomorphism. Very recently, Wagstaff et al. (2019) gave an upper bound to the number of hidden units needed to learn permutation-invariant functions over sets and multi-sets. Differently from the above methods, Edge-Conditioned Convolution (ECC) (Simonovsky & Komodakis, 2017)

尽管本研究主要关注专为图分类设计的架构，但所有图神经网络 (GNN) 都共享基于节点邻域的"卷积"概念，这是网格卷积的泛化形式。例如 GraphSAGE (Hamilton 等人, 2017) 先执行求和、均值或最大池化的邻域聚合，随后通过线性投影更新节点表征。该方法还采用邻域采样方案来保持计算复杂度恒定。而图同构网络 (GIN) (Xu 等人, 2019) 基于 GraphSAGE 的局限性进行改进，扩展了多重集上的任意聚合函数。该模型被证明具有与 Weisfeiler-Lehman 图同构测试同等的理论表达能力。最近 Wagstaff 等人 (2019) 给出了学习集合和多重集上置换不变函数所需隐藏单元数量的上界。与上述方法不同，边缘条件卷积 (ECC) (Simonovsky & Komodakis, 2017)

learns a different parameter for each edge label. Therefore, neighbor aggregation is weighted according to specific edge parameters. Finally, Deep Graph Convolutional Neural Network (DGCNN) (Zhang et al., 2018) proposes a convolutional layer similar to the formulation of Kipf & Welling (2017).

为每条边标签学习不同的参数。因此，邻居聚合会根据特定的边参数进行加权。最后，深度图卷积神经网络 (DGCNN) (Zhang et al., 2018) 提出了一种类似于 Kipf & Welling (2017) 公式的卷积层。

Some models also exploit a pooling scheme, which is applied after convolutional layers in order to reduce the size of a graph. For example, the pooling scheme of ECC coarsens graphs through a differentiable pooling map that can be pre-computed. Similarly, DiffPool (Ying et al., 2018) proposes an adaptive pooling mechanism that collapses nodes on the basis of a supervised criterion. In practice, DiffPool combines a differentiable graph encoder with its pooling strategy, so that the architecture is end-to-end trainable. Lastly, DGCNN differs from other works in that nodes are sorted and aligned by a specific algorithm called SortPool (Zhang et al., 2018).

部分模型还采用了一种池化方案，该方案在卷积层之后应用以减小图的尺寸。例如，ECC的池化方案通过可预计算的可微分池化映射对图进行粗化。类似地，DiffPool (Ying et al., 2018) 提出了一种基于监督准则的自适应池化机制来折叠节点。实际应用中，DiffPool将可微分图编码器与其池化策略相结合，使得整个架构可端到端训练。最后，DGCNN与其他工作的不同之处在于，它通过名为SortPool (Zhang et al., 2018) 的特定算法对节点进行排序和对齐。

Model evaluation The work of Shchur et al. (2018) shares a similar purpose with our contribution. In particular, the authors compare different GNNs on node classification tasks, showing that results are highly dependent on the particular train/validation/test split of choice, up to the point where changing splits leads to dramatically different performance rankings. Thus, they recommend to evaluate GNNs on multiple test splits to achieve a fair comparison. Even though we operate in a different setting (graph instead of node classification), we follow the authors’ suggestions by evaluating models under a controlled and rigorous assessment framework. Finally, the work of Dacrema et al. (2019) criticizes a large number of neural recommend er systems, most of which are not reproducible, showing that only one of them truly improves against a simple baseline.

模型评估
Shchur等人(2018)的研究与我们的工作目标相似。具体而言，作者比较了不同图神经网络(GNN)在节点分类任务上的表现，结果表明结果高度依赖于特定的训练/验证/测试划分选择，甚至改变划分会导致性能排名发生显著变化。因此，他们建议通过多个测试划分来评估GNN，以实现公平比较。尽管我们的研究场景不同(图分类而非节点分类)，但我们遵循作者的建议，在严格受控的评估框架下进行模型验证。最后，Dacrema等人(2019)的研究批判了大量不可复现的神经推荐系统，指出其中仅有一个系统真正超越了简单基线。

3 RISK ASSESSMENT AND MODEL SELECTION

3 风险评估与模型选择

Here, we recap the risk assessment (also called model evaluation or model assessment) and model selection procedures, to clearly layout the experimental procedure followed in this paper. For space reasons, the overall procedure is visually summarized in Appendix A.1.

在此，我们回顾风险评估（也称为模型评估或模型评定）和模型选择流程，以明确本文遵循的实验步骤。由于篇幅限制，整体流程的视觉化总结见附录A.1。

3.1 RISK ASSESSMENT

3.1 风险评估

The goal of risk assessment is to provide an estimate of the performance of a class of models. When a test set is not explicitly given, a common way to proceed is to use $k$ -fold Cross Validation (CV) (Stone, 1974; Varma & Simon, 2006; Cawley & Talbot, 2010). $k$ -fold CV uses $k$ different training/test splits to estimate the generalization performance of a model; for each partition, an internal model selection procedure selects the hyper-parameters using the training data only. This way, test data is never used for model selection. As model selection is performed independently for each training/test split, we obtain different “best” hyper-parameter configurations; this is why we refer to the performance of a class of models.

风险评估的目标是评估一类模型的性能表现。当没有明确给定测试集时，常用的方法是使用$k$折交叉验证(CV) (Stone, 1974; Varma & Simon, 2006; Cawley & Talbot, 2010)。$k$折CV通过$k$次不同的训练/测试划分来估计模型的泛化性能；对于每个划分，内部模型选择过程仅使用训练数据选择超参数。这样，测试数据永远不会用于模型选择。由于每个训练/测试划分都独立进行模型选择，我们会得到不同的"最佳"超参数配置；这就是为什么我们称之为一类模型的性能。

3.2 MODEL SELECTION

3.2 模型选择

The goal of model selection, or hyper-parameter tuning, is to choose among a set of candidate hyperparameter configurations the one that works best on a specific validation set. If a validation set is not given, one can rely on a holdout training/validation split or an inner $k$ -fold. Nevertheless, the key point to remember is that validation performances are biased estimates of the true generalization capabilities. Consequently, model selection results are generally over-optimistic; this issue is thoroughly documented in Cawley & Talbot (2010). This is why the main contribution of this work is to clearly separate model selection and model assessment estimates, something that is lacking or ambiguous in the literature under consideration.

模型选择（或超参数调优）的目标是从一组候选超参数配置中选出在特定验证集上表现最优的配置。若未提供验证集，可采用训练集/验证集划分或内部 $k$ 折交叉验证。但需注意，验证性能是对真实泛化能力的有偏估计，因此模型选择结果通常存在过度乐观问题（这一问题在Cawley & Talbot (2010) 中有详细论述）。本研究的主要贡献在于明确区分模型选择与模型评估的估计值，而现有文献对此缺乏清晰界定或表述模糊。

4 OVERVIEW OF REPRODUCIBILITY ISSUES

4 可复现性问题概述

To motivate our contribution, we follow the approach of Dacrema et al. (2019) and briefly review recent papers describing five different GNN models, highlighting problems in the experimental setups as well as reproducibility of results. We emphasize that our observations are based solely on the contents of their paper and the available code2. Suitable GNN works were selected according to the following criteria: i) performances obtained with 10-fold CV; ii) peer reviewed; iii) strong architectural differences; iv) popularity. In particular, we selected DGCNN (Zhang et al., 2018), DiffPool (Ying et al., 2018), ECC (Simonovsky & Komodakis, 2017), GIN (Xu et al., 2019) and GraphSAGE (Hamilton et al., 2017). For a detailed description of each model we refer to their respective papers. Our criteria to assess quality of evaluation and reproducibility are: i) code for data preprocessing, model selection and assessment is provided; ii) data splits are provided; iii) data is split by means of a stratification technique, to preserve class proportions across all partitions; iv) results of the 10-fold CV are reported correctly using standard deviations, and they refer to model evaluation (test sets) rather than model selection (validation sets). Table 1 summarizes our findings.

为阐明我们的贡献，我们沿用Dacrema等人(2019)的方法，简要回顾了描述五种不同图神经网络(GNN)模型的近期论文，着重指出实验设置及结果可复现性方面的问题。需特别说明的是，我们的观察仅基于论文内容及公开代码。所选GNN研究符合以下标准：(i)采用十折交叉验证(10-fold CV)的性能结果；(ii)经过同行评审；(iii)架构差异显著；(iv)具有较高知名度。具体选取了DGCNN(Zhang等人,2018)、DiffPool(Ying等人,2018)、ECC(Simonovsky & Komodakis,2017)、GIN(Xu等人,2019)和GraphSAGE(Hamilton等人,2017)五个模型，各模型详细说明请参阅原始论文。我们采用的评估质量与可复现性标准包括：(i)提供数据预处理、模型选择及评估代码；(ii)提供数据划分方案；(iii)采用分层技术划分数据以保持各类别比例；(iv)正确使用标准差报告十折交叉验证结果，且结果针对模型评估(测试集)而非模型选择(验证集)。表1总结了我们的分析结果。

Table 1: Criteria for reproducibility considered in this work and their compliance among considered models. (Y) indicates that the criterion is met, (N) indicates that the criterion is not satisfied, (A) indicates ambiguity (i.e. it is unclear whether the criteria is met or not), (-) indicates lack of information (i.e. no details are provided about the criteria). Note that GraphSAGE is excluded from this comparison, as it was not directly applied by authors to graph classification tasks.

表 1: 本工作中考虑的复现性标准及各模型符合情况。(Y) 表示满足标准，(N) 表示不满足标准，(A) 表示模糊不清（即不清楚是否满足标准），(-) 表示缺乏信息（即未提供相关标准的详细信息）。注意，GraphSAGE 未纳入本次比较，因为作者未直接将其应用于图分类任务。

	DGCNN	DiffPool	ECC	GIN
数据预处理代码	Y	Y	-	Y
模型选择代码	N	N	-	N
模型评估代码	Y	Y	-	Y
提供数据划分	Y	N	N	Y
标签分层	Y	N	-	Y
报告测试集准确率	Y	A	A	N
报告标准差	Y	N	N	Y

DGCNN The authors evaluate the model on 10-fold CV. While the architecture is fixed for all dataset, learning rate and epochs are tuned using only one random CV fold, and then reused on all the other folds. While this practice is still acceptable, it may lead to sub-optimal performances. Nonetheless, the code to reproduce model selection is not available. Moreover, the authors run CV 10 times, and they report the average of the 10 final scores. As a result, the variance of the provided estimates is reduced. However, the same procedure was not applied to the other competitors as well. Finally, CV data splits are correctly stratified and publicly available, making it possible to reproduce at least the evaluation experiments.

DGCNN
作者采用10折交叉验证(CV)评估模型。虽然所有数据集都采用固定架构，但学习率和训练周期数仅通过随机选取的一个CV折进行调参，随后应用于其他所有折。这种做法虽仍可接受，但可能导致性能未达最优。此外，模型选择的复现代码未公开。值得注意的是，作者将CV流程重复运行10次，最终报告10次测试得分的平均值，从而降低估计值的方差。但该流程未同步应用于其他对比方法。最后，CV数据分割严格遵循分层抽样原则且已公开，至少能确保评估实验的可复现性。

DiffPool From both the paper and the provided code, it is unclear if reported results are obtained on a test set rather than a validation set. Although the authors state that 10-fold CV is used, standard deviations of DiffPool and its competitors are not reported. Moreover, the authors affirm to have applied early stopping on the validation set to prevent over fitting; unfortunately, neither model selection code nor validation splits are available. Furthermore, according to the code, data is randomly split (without stratification) and no random seed is set, hence splits are different each time the code is executed.

DiffPool
从论文和提供的代码来看，尚不清楚报告结果是基于测试集还是验证集获得的。尽管作者声明使用了10折交叉验证 (10-fold CV) ，但并未提供DiffPool及其对比方法的标准差。此外，作者声称在验证集上应用了早停 (early stopping) 以防止过拟合，但既未提供模型选择代码，也未公开验证集划分方式。根据代码实现，数据采用随机划分（未分层）且未设置随机种子，因此每次执行代码时划分结果都会不同。

ECC The paper reports that ECC is evaluated on 10-fold CV, but results do not include standard deviations. Similarly to DGCNN, hyper-parameters are fixed in advance, hence it is not clear if and how model selection has been performed. Importantly, there are no references in the code repository to data pre-processing, data stratification, data splitting, and model selection.

ECC
该论文报告ECC采用10折交叉验证(10-fold CV)进行评估，但结果未包含标准差。与DGCNN类似，其超参数是预先固定的，因此不清楚是否及如何进行模型选择。值得注意的是，代码仓库中未提及数据预处理(data pre-processing)、数据分层(data stratification)、数据划分(data splitting)和模型选择(model selection)的相关实现。

GIN The authors correctly list all the hyper-parameters tuned. However, as stated explicitly in the paper and in the public review discussion, they report the validation accuracy of 10-fold CV. In other words, reported results refer to model selection and not to model evaluation. The code for model selection is not provided.

GIN 作者正确列出了所有调整的超参数。然而，正如论文和公开评审讨论中明确指出的，他们报告的是10折交叉验证的验证准确率。换句话说，报告的结果涉及模型选择而非模型评估。模型选择的代码未提供。

GraphSAGE The original paper does not test this model on graph classification datasets, but GraphSAGE is often used in other papers as a strong baseline. It follows that GraphSAGE results on graph classification should be accompanied by the code to reproduce the experiments. Despite that, the two works which report results of GraphSAGE (DiffPool and GIN) fail to do so.

GraphSAGE 原论文未在图表分类数据集上测试该模型，但GraphSAGE常被其他论文用作强基线。由此可见，GraphSAGE在图分类任务上的结果应附带实验复现代码。尽管如此，报告GraphSAGE结果的两篇论文（DiffPool和GIN）均未提供相关代码。

Summary Our analysis reveals that GNN works rarely comply with good machine learning practices as regards the quality of evaluation and reproducibility of results. This motivates the need to re-evaluate all models within a rigorous, reproducible and fair environment.

我们的分析表明，GNN研究在评估质量和结果可复现性方面很少遵循良好的机器学习实践。这促使我们需要在严格、可复现且公平的环境中重新评估所有模型。

5 EXPERIMENTS

5 实验

In this section we detail our main experiment, in which we re-evaluate the above-mentioned models on 9 datasets (4 chemical, 5 social), using a model selection and assessment framework that closely follows the rigorous practices described in Section 3. In addition, we implement two baselines whose purpose is to understand the extent to which GNNs are able to exploit structural information. All models have been implemented by means of the Pytorch Geometrics library (Fey & Lenssen, 2019), which provides graph pre-processing routines and makes the definition of graph convolution easier to implement. We sometimes found discrepancies between papers and related code; in such cases, we complied with the specifications in the paper. Because GraphSAGE was not applied to graph classification in the original work, we opted for a max-pooling global aggregation function to classify graph instances; further, we do not use the sampled neighborhood aggregation scheme defined in Hamilton et al. (2017), in order to allow nodes to have access to their whole neighborhood.

在本节中，我们将详细介绍主要实验：基于第3章描述的严谨方法框架，对上述模型在9个数据集（4个化学数据集、5个社交数据集）上进行重新评估。此外，我们实现了两个基线模型，用于探究图神经网络（GNN）能在多大程度上利用结构信息。所有模型均通过Pytorch Geometrics库（Fey & Lenssen, 2019）实现，该库提供图预处理例程并简化了图卷积的定义。我们偶尔发现论文与相关代码存在差异，此时均以论文规范为准。由于GraphSAGE在原工作中未应用于图分类任务，我们选择最大池化全局聚合函数进行图实例分类；同时未采用Hamilton等人（2017）提出的采样邻域聚合方案，以确保节点能访问完整邻域信息。

Datasets All graph datasets are publicly available (Kersting et al., 2016) and represent a relevant subset of those most frequently used in literature to compare GNNs. Some collect molecular graphs, while others contain social graphs. In particular, we used D&D (Dobson & Doig, 2003), PROTEINS (Borgwardt et al., 2005), NCI1 (Wale et al., 2008) and ENZYMES (Schomburg et al., 2004) for binary and multi-class classification of chemical compounds, whereas IMDB-BINARY, IMDB-MULTI, REDDIT-BINARY, REDDIT-5K and COLLAB (Yanardag & Vishwanath an, 2015) are social datasets. Dataset statistics are reported in Table A.2.

数据集
所有图数据集均为公开可用 (Kersting et al., 2016)，代表了文献中最常用于比较图神经网络 (GNN) 的相关子集。部分数据集收集分子图，其他则包含社交图。具体而言，我们使用 D&D (Dobson & Doig, 2003)、PROTEINS (Borgwardt et al., 2005)、NCI1 (Wale et al., 2008) 和 ENZYMES (Schomburg et al., 2004) 进行化合物的二元及多元分类，而 IMDB-BINARY、IMDB-MULTI、REDDIT-BINARY、REDDIT-5K 和 COLLAB (Yanardag & Vishwanathan, 2015) 属于社交数据集。数据集统计信息见表 A.2。

Baselines We adopt two distinct baselines, one for chemical and one for social datasets. On all chemical datasets but for ENZYMES, we follow Ralaivola et al. (2005); Luzhnica et al. (2019) and implement the Molecular Fingerprint technique, which first applies global sum pooling (i.e., counts the occurrences of atom types in the graph by summing the features of all nodes in the graph together) and then applies a single-layer MLP with ReLU activation s. On social domains and ENZYMES (due to the presence of additional features), we take inspiration from the work of Zaheer et al. (2017) to learn permutation-invariant functions over sets of nodes: first, we apply a single-layer MLP on top of node features, followed by global sum pooling and another single-layer MLP for classification. Note that both baselines do not leverage graph topology. Using these baselines as a reference is of fundamental importance for future works, as they can provide feedback on the effectiveness of GNNs on a specific dataset. As a matter of fact, if GNN performances are close to the ones of a structure-agnostic baseline, one can draw two possible conclusions: the task does not need topological information to be effectively solved, or the GNN is not exploiting graph structure adequately. While the former can be verified through domain-specific human expertise, the second is more difficult to assess, as multiple factors come into play such as the amount of training data, the struct

[论文翻译]图分类任务中图神经网络的公平比较

原文地址：https://arxiv.org/pdf/1912.09893v3

A FAIR COMPARISON OF GRAPH NEURAL NETWORKS FOR GRAPH CLASSIFICATION

图分类任务中图神经网络的公平比较

ABSTRACT

1 INTRODUCTION

1 引言

2 相关工作

3 RISK ASSESSMENT AND MODEL SELECTION

3 风险评估与模型选择

3.1 RISK ASSESSMENT

3.1 风险评估

3.2 MODEL SELECTION

3.2 模型选择

4 OVERVIEW OF REPRODUCIBILITY ISSUES

4 可复现性问题概述

5 EXPERIMENTS

5 实验

[论文翻译]图分类任务中图神经网络的公平比较

原文地址：https://arxiv.org/pdf/1912.09893v3

A FAIR COMPARISON OF GRAPH NEURAL NETWORKS FOR GRAPH CLASSIFICATION

图分类任务中图神经网络的公平比较

ABSTRACT

1 INTRODUCTION

1 引言

2 RELATED WORK

2 相关工作

3 RISK ASSESSMENT AND MODEL SELECTION

3 风险评估与模型选择

3.1 RISK ASSESSMENT

3.1 风险评估

3.2 MODEL SELECTION

3.2 模型选择

4 OVERVIEW OF REPRODUCIBILITY ISSUES

4 可复现性问题概述

5 EXPERIMENTS

5 实验