[论文翻译]聚焦关键:运用话语连贯理论解决跨文档共指问题


原文地址:https://arxiv.org/pdf/2110.05362v1


Focus on what matters: Applying Discourse Coherence Theory to Cross Document Co reference

聚焦关键:运用话语连贯理论解决跨文档共指问题

Abstract

摘要

Performing event and entity co reference resolution across documents vastly increases the number of candidate mentions, making it intractable to do the full $n^{2}$ pairwise comparisons. Existing approaches simplify by considering co reference only within document clusters, but this fails to handle inter-cluster coreference, common in many applications. As a result cross-document co reference algorithms are rarely applied to downstream tasks. We draw on an insight from discourse coherence theory: potential co references are constrained by the reader’s discourse focus. We model the entities/events in a reader’s focus as a neighborhood within a learned latent embedding space which minimizes the distance between mentions and the centroids of their gold coreference clusters. We then use these neighborhoods to sample only hard negatives to train a fine-grained classifier on mention pairs and their local discourse features. Our approach1 achieves state-of-the-art results for both events and entities on the $\mathrm{ECB+}$ , Gun Violence, Football Co reference, and Cross-Domain CrossDocument Co reference corpora. Furthermore, training on multiple corpora improves average performance across all datasets by 17.2 F1 points, leading to a robust co reference resolution model for use in downstream tasks where link distribution is unknown.

跨文档执行事件和实体共指消解会极大增加候选提及的数量,使得完整的 $n^{2}$ 成对比较难以处理。现有方法通过仅考虑文档簇内的共指来简化问题,但这无法处理簇间共指,而后者在许多应用中很常见。因此跨文档共指消解算法很少应用于下游任务。我们借鉴了语篇连贯理论的一个观点:潜在的共指关系受读者语篇焦点限制。我们将读者焦点中的实体/事件建模为学习到的潜在嵌入空间中的一个邻域,该邻域最小化提及与其黄金共指簇质心之间的距离。然后利用这些邻域仅采样困难负例,在提及对及其局部语篇特征上训练细粒度分类器。我们的方法在 $\mathrm{ECB+}$、Gun Violence、Football Co reference 和 Cross-Domain CrossDocument Co reference 语料库上均取得了事件和实体共指消解的最先进结果。此外,在多个语料库上训练使所有数据集的平均性能提升了17.2个F1点,从而为链接分布未知的下游任务提供了鲁棒的共指消解模型。

1 Introduction

1 引言

Cross-document co reference resolution of entities and events (CDCR) is an increasingly important problem, as downstream tasks that benefit from co reference annotations — such as question answering, information extraction, and summarization — begin interpreting multiple documents si- multan e ou sly. Yet the number of candidate mentions across documents makes evaluating the full n2 pairwise comparisons intractable (Cremisini and Finlayson, 2020). For single-document coref- erence, the search space is pruned with simple recency-based heuristics, but there is no natural corollary to recency with multiple documents.

跨文档实体与事件共指消解(CDCR)日益重要,因为受益于共指标注的下游任务(如问答、信息抽取和摘要生成)开始同时处理多文档。然而跨文档候选指称项的数量使得评估全部n²成对比较变得不可行 [20]。对于单文档共指,可通过基于时效性的简单启发式方法剪枝搜索空间,但多文档场景缺乏天然的时效性对应机制。

Most CDCR systems thus instead cluster the documents and perform the full $n^{2}$ comparisons only within each cluster, disregarding inter-cluster co reference (Lee et al., 2012; Yang et al., 2015; Choubey and Huang, 2017; Barhom et al., 2019; Cattan et al., 2020; Yu et al., 2020; Caciularu et al., 2021). This was effective for the $\mathrm{ECB+}$ dataset, on which most CDCR methods have been evaluated, because $\mathrm{ECB+}$ has lexically distinct topics with almost no inter-cluster co reference.

因此,大多数CDCR系统转而采用文档聚类策略,仅在各簇内部执行完整的$n^{2}$次比较,而忽略簇间共指关系 (Lee et al., 2012; Yang et al., 2015; Choubey and Huang, 2017; Barhom et al., 2019; Cattan et al., 2020; Yu et al., 2020; Caciularu et al., 2021)。这种方法在$\mathrm{ECB+}$数据集上表现有效(当前多数CDCR方法均基于该数据集评估),因为$\mathrm{ECB+}$的词汇主题区分明显,几乎不存在簇间共指现象。

Such document clustering, however, keeps CDCR systems from being generally applicable. Bugert et al. (2020b) shows that inter-cluster coreference makes up the majority of co reference in many applications. Cremisini and Finlayson (2020) note that document clustering methods are also unlikely to generalize well to real data where documents lack the significant lexical differences of $\mathrm{ECB+}$ topics. These issues present a major barrier for the general applicability of CDCR.

然而,这种文档聚类方式限制了跨文档共指消解(CDCR)系统的通用性。Bugert等人(2020b)指出,在许多应用场景中,跨聚类的共指关系占据了共指现象的主体。Cremisini与Finlayson(2020)强调,当实际数据中的文档缺乏类似$\mathrm{ECB+}$主题那样显著的词汇差异时,文档聚类方法的泛化能力也会大幅受限。这些问题对CDCR技术的广泛应用构成了主要障碍。

Human readers, by contrast, are able to perform co reference resolution with minimal pairwise comparisons. How do they do it? Discourse coherence theory (Grosz, 1977, 1978; Grosz and Sidner, 1986) proposes a simple mechanism: a reader focuses on only a small set of entities/events from their full knowledge. This set, the attention al state, is constructed as entities/events are brought into focus either explicitly by reference or implicitly by their similarity to what has been referenced. Since attentional state is inherently dynamic — entities/events come into and out of focus as discourse progresses — a document level approach is a poor model of this mechanism.

相比之下,人类读者仅需极少量的成对比较就能完成共指消解。他们是如何做到的?话语连贯性理论 (Grosz, 1977, 1978; Grosz and Sidner, 1986) 提出了一个简单机制:读者仅从全部知识中聚焦于一小部分实体/事件。这个被称为注意力状态的集合,会随着实体/事件通过显式引用或与已引用内容的隐式相似性进入焦点而被构建。由于注意力状态具有动态本质——随着话语推进,实体/事件会不断进入和退出焦点——因此文档级方法难以有效建模这一机制。

We propose modeling focus at the mention level using the two stage approach illustrated in Figure

我们提出采用图1所示的两阶段方法在提及级别上建模焦点。


Figure 1: A high level overview of our system: For a particular mention, candidate co referring mentions are retrieved from a neighborhood surrounding the mention. These candidate pairs are fed to a pairwise classifier specialized for hard negatives fetched from this space. This allows our method to create a high fidelity co reference graph with minimal pairwise comparison and no a priori assumptions about co reference. We use a bi-encoder for candidate retrieval and a cross-encoder for pairwise classification (Humeau et al., 2020).

图 1: 系统概览:针对特定提及,从该提及周围的邻域中检索候选共指提及。这些候选对被送入专为该空间提取的困难负样本设计的成对分类器。这种方法使我们能够以最少的成对比较且无需对共指做先验假设,构建高保真共指图。我们采用双编码器 (bi-encoder) 进行候选检索,交叉编码器 (cross-encoder) 进行成对分类 (Humeau et al., 2020)。

  1. We model attention al state as the set of K nearest neighbors within a latent embedding space for mentions. This space is learned with a distance based classification loss to construct embeddings that minimize the distance between mentions and the centroid of all mentions which share their reference class.
  2. 我们将注意力状态建模为潜在嵌入空间中提及的K个最近邻集合。该空间通过基于距离的分类损失进行学习,以构建最小化提及与其共享引用类别的所有提及质心之间距离的嵌入。

These attention al state neighborhoods aggressively constrain the search space for our second stage pairwise classifier. This classifier utilizes cross-attention between mention pairs and their local discourse features to capture the features important within an attention al state which are comparison specific (Grosz, 1978). By sampling from attention al state neighborhoods at training time, we train on only hard negatives such as shown in Table 1. We analyze the contribution of the local discourse features to our approach, providing an explanation for the empirical effectiveness of our classifier and that of earlier work like Caciularu et al. (2021).

这些注意力状态邻域严格限制了第二阶段成对分类器的搜索空间。该分类器利用提及对之间的交叉注意力及其局部话语特征,捕捉注意力状态中与比较任务相关的关键特征 (Grosz, 1978) 。通过在训练时从注意力状态邻域采样,我们仅针对困难负样本进行训练,如表 1 所示。我们分析了局部话语特征对本方法的贡献,为分类器的实证有效性及 Caciularu 等人 (2021) 早期研究提供了理论解释。

Following the recommendations of Bugert et al. (2020a), we evaluate our method on multiple event and entity CDCR corpora, as well as on crosscorpus transfer for event CDCR. Our method achieves state-of-the-art results on the $\mathrm{ECB+}$ corpus for both events $(+0.2\mathrm{ F}1)$ and entities $_{(+0.7}$ F1), the Gun Violence Corpus $(+11.3\mathrm{F}1)$ ), the Football Co reference Corpus $(+13.3\mathrm{F}1)$ ), and the Cross-Domain Cross-Document Co reference Cor- pus $(+34.5\mathrm{F~}$ 1). We further improve average results by training across all event CDCR corpora, leading to a 17.2 F1 improvement for average performance across all tasks. Our robust model makes it feasible to apply CDCR to a wide variety of downstream tasks, without requiring expensive new co reference annotations to enable fine-tuning on each new corpus. (This has been a huge effort for the few tasks that have attempted it like multi-hop QA (Dhingra et al., 2018; Chen et al., 2019) and multi-document sum mari z ation (Falke et al., 2017).)

遵循Bugert等人 (2020a) 的建议,我们在多个事件与实体跨文档共指消解(CDCR)语料库上评估了该方法,包括事件CDCR的跨语料库迁移任务。我们的方法在$\mathrm{ECB+}$语料库上取得了最先进的结果:事件共指消解性能提升$(+0.2\mathrm{~F}1)$,实体共指消解提升$_{(+0.7}$ F1);在枪击暴力语料库上提升$(+11.3\mathrm{~F}1)$,足球共指语料库提升$(+13.3\mathrm{~F}1)$,跨领域跨文档共指语料库提升$(+34.5\mathrm{~F}$1)。通过在所有事件CDCR语料库上进行联合训练,我们进一步将平均性能提升了17.2个F1值。该鲁棒模型使得CDCR技术能广泛应用于下游任务,而无需为每个新语料库进行昂贵的共指标注微调(现有研究中仅有多跳问答(Dhingra等, 2018; Chen等, 2019)和多文档摘要(Falke等, 2017)等少数任务尝试过这种标注,耗费了大量资源)。

2 Related Work

2 相关工作

Cross-Document Co reference Many CDCR algorithms use hand engineered event features to perform classification. Such systems have a low pairwise classification cost and therefore ignore the quadratic scaling and perform no pruning (Bejan and Harabagiu, 2010; Yang et al., 2015; Vossen and Cybulska, 2016; Bugert et al., 2020a). Other such systems choose to include document clustering to increase precision, which can be done with very little tradeoff for the $\mathrm{ECB+}$ corpus (Lee et al., 2012; Cremisini and Finlayson, 2020).

跨文档共指消解
许多跨文档共指消解(CDCR)算法使用手工设计的事件特征进行分类。这类系统具有较低的成对分类成本,因此忽略了二次复杂度且不进行剪枝 (Bejan and Harabagiu, 2010; Yang et al., 2015; Vossen and Cybulska, 2016; Bugert et al., 2020a)。另一些系统选择引入文档聚类来提高精确度,这种方法在$\mathrm{ECB+}$语料库上几乎无需权衡 (Lee et al., 2012; Cremisini and Finlayson, 2020)。

Kenyon-Dean et al. (2018) explore an approach that avoids pairwise classification entirely, instead relying purely on representation learning and clustering within an embedding space. They propose a novel distance based regular iz ation term for their classifier that encourages representations that can be used for clustering. This approach is more scalable than pairwise classification approaches, but its performance lags behind the state-of-the-art as it cannot use pairwise information.

Kenyon-Dean等人(2018)探索了一种完全避免成对分类的方法,纯粹依赖嵌入空间中的表征学习和聚类。他们为分类器提出了一种新颖的基于距离的正则化项,以鼓励可用于聚类的表征。这种方法比成对分类方法更具可扩展性,但由于无法利用成对信息,其性能落后于最先进水平。

Table 1: Examples of positives and hard negatives within an attention al state neighborhood

表 1: 注意力邻域中的正例与困难负例示例

提及类型 提及内容 关系
事件 盖瑟斯地区发生初步震级2.0地震 根节点
地震发生于上午7:30左右 共指
震动发生在上午9:27 不同
实体 ...将使AMD成为全球最大图形芯片供应商之一 根节点
...该公司宣布达成3.34亿美元协议 共指
全球最大图形芯片制造商英特尔拒绝对该交易置评 不同

Table 2: Evaluation Results using $B^{3}$ . For our approaches, $(^{+})/(^{-})$ indicates usage of discourse or only a single sentence respectively. Methods marked with * perform all pairwise comparisons without pruning.

方法 R P F1 R P F1 R P F1 R P F1 R P F1
Barhom et al.(2019) 81.8 77.5 79.6 81.0 66.0 72.7 17.9 88.3 29.8 66.8 75.5 70.9 - - -
Barhom et al. (2019)* - - - - - - 36.0 83.0 50.2 - - - - - -
Bugert et al. (2020a)* 71.8 81.2 76.2 49.9 73.6 59.5 38.3 70.8 49.7 - - - - - -
Cattan et al. (2020) 82.1 82.7 82.4 - - - - - - 70.7 74.8 72.7 57.0 35.0 44.0
Yu et al. (2020) 86.1 84.7 85.4 - - - - - - - - - - - -
Caciularu et al. (2021) 84.9 87.9 86.4 - - - - - - 82.5 81.7 82.1 - - -
Our Approach 84.9 82.4 83.6 67.2 81.1 73.5 47.9 68.7 56.5 84.8 76.2 80.3 67.7 72.8 70.2
Our Approach+ 85.6 87.7 86.6 82.2 83.8 83.0 61.6 65.4 63.5 85.1 80.6 82.8 77.4 79.7 78.5

表 2: 使用 $B^{3}$ 的评估结果。对于我们的方法, $(^{+})/(^{-})$ 表示使用篇章或仅使用单句。标有 * 的方法表示执行了所有无剪枝的成对比较。

Most recent systems use neural models for pairwise classification (Barhom et al., 2019; Cattan et al., 2020; Meged et al., 2020; Zeng et al., 2020; Yu et al., 2020; Caciularu et al., 2021). These algorithms each use document clustering, a pairwise neural classifier to construct distance matrices within each topic, and agglom erat ive clustering to compute the final clusters. Innovation has focused on the pairwise classification stage, with variants of document clustering as the only pruning option. Caciularu et al. (2021) sets the previous state of the art for both events and entities in $\mathrm{ECB+}$ using a cross-document language model with a large context window to cross-encode and classify a pair of mentions with the full context of their documents.

最新系统采用神经网络模型进行成对分类 (Barhom et al., 2019; Cattan et al., 2020; Meged et al., 2020; Zeng et al., 2020; Yu et al., 2020; Caciularu et al., 2021)。这些算法均使用文档聚类、成对神经分类器构建各主题内的距离矩阵,并通过凝聚聚类计算最终聚类结果。创新主要集中在成对分类阶段,其中文档聚类的变体是唯一的剪枝选项。Caciularu等人 (2021) 在$\mathrm{ECB+}$数据集上实现了事件和实体识别的当前最佳性能,他们采用具有大上下文窗口的跨文档语言模型,通过跨编码方式利用文档完整上下文对提及对进行分类。

Other Tasks Lee et al. (2018) introduces the concept of a “coarse-to-fine” approach in single document entity co reference resolution. The architecture utilises a bi-linear scoring function to generate a set of likely antecedents, which is then passed through a more expensive classifier which performs higher order inference on antecedent chains. Our work extends to multiple documents the idea of using a high recall but low precision pruning function combined with expensive pairwise classification to balance recall, precision, and runtime efficiency.

其他任务
Lee等人(2018) 在单文档实体共指消解中提出了"由粗到精(coarse-to-fine)"的方法概念。该架构采用双线性评分函数生成一组可能的先行词,再通过计算成本更高的分类器对这些先行词链进行高阶推理。我们的工作将这种结合高召回低精度剪枝函数与高成本成对分类的思路扩展到多文档场景,以平衡召回率、精确率和运行效率。

Wu et al. (2020) use a similar architecture to ours to create a highly scalable system for zero-shot entity linking. Their method treats entity linking as a ranking problem, using a bi-encoder to retrieve possible entity mentions and then re-ranking the candidate mentions using a cross-encoder. Their results confirm that such architectures can deliver state of the art performance while achieving tremendous scale. However, in co reference resolution, mentions can have one, many, or no co referring mentions which makes treating it as a ranking problem non-trivial and necessitates the novel training and inference processes we propose.

Wu等人 (2020) 采用与我们相似的架构构建了一个高度可扩展的零样本 (zero-shot) 实体链接系统。该方法将实体链接视为排序问题,使用双编码器 (bi-encoder) 检索可能的实体指称项,再通过交叉编码器 (cross-encoder) 对候选指称项进行重排序。其结果表明,此类架构在实现大规模扩展的同时仍能保持最先进的性能。然而在共指消解任务中,指称项可能对应零个、一个或多个共指对象,这使得将其视为排序问题变得尤为复杂,因此需要我们提出的新型训练与推理流程。

3 Model

3 模型

Our system is trained in multiple stages and evaluated as a single pipeline. First, we train the encoder for the pruning model to define our latent embedding space. Then, we use this model to sample training data for a pairwise classifier which performs binary classification for co reference. Our complete pipeline retrieves candidate pairs from the attention al state, classifies them using the pairwise classifier, and performs a variant of the agglomerative clustering algorithm proposed by Barhom et al. (2019) to form the final clusters, as laid out in Figure 2.

我们的系统经过多阶段训练,并作为单一流程进行评估。首先,我们训练剪枝模型的编码器以定义潜在嵌入空间。接着,使用该模型为二元共指分类器采样训练数据。完整流程包括:从注意力状态检索候选对,使用成对分类器进行分类,并采用Barhom等人(2019)提出的聚合聚类算法变体形成最终簇,如图2所示。

3.1 Candidate Retrieval

3.1 候选检索

Encoding Setup We feed the sentences from a window surrounding the mention sentence to a fine-tuned BERT architecture initialized from

编码设置
我们将提及句周围窗口中的句子输入到一个经过微调的BERT架构中,该架构初始化自


Figure 2: Clustering algorithm used at inference time

图 2: 推理时使用的聚类算法

RoBERTA-large pre-trained weights (Devlin et al., 2019; Liu et al., 2019). A mention is represented as the concatenation of the token-level representations at the boundaries of the mention, following the span boundary representations used by Lee et al. (2017).

RoBERTA-large预训练权重 (Devlin等人, 2019; Liu等人, 2019)。提及(Mention)的表示采用Lee等人(2017)使用的跨度边界表示方法,通过连接提及边界处的Token级表示来实现。

Optimization Similar to Kenyon-Dean et al. (2018), the network is trained to perform a multiclass classification problem where the classes are labels assigned to the gold co reference clusters, which are the connected components of the coreference graph. Rather than adding distance based regular iz ation, we instead optimize the distance metric directly by using the inner product as our scoring function.

优化
与 Kenyon-Dean 等人 (2018) 类似,该网络被训练用于执行多类别分类问题,其中类别是分配给黄金共指簇(即共指图的连通分量)的标签。我们没有添加基于距离的正则化,而是通过使用内积作为评分函数来直接优化距离度量。

Before each epoch, we construct the representation of each mention $y_{m_{i}}$ with the encoder from the previous epoch. Each gold co reference cluster $y_{c_{i}}$ is represented as the centroid of its component mentions $c_{i}$ :

在每个训练周期开始前,我们使用上一周期的编码器为每个提及项$y_{m_{i}}$构建表征。每个标准共指聚类$y_{c_{i}}$被表示为该聚类成员提及项$c_{i}$的质心:

$$
y_{c_{i}}={\frac{1}{\mid c_{i}\mid}}\sum_{y_{m_{i}}\in c_{i}}y_{m_{i}}
$$

$$
y_{c_{i}}={\frac{1}{\mid c_{i}\mid}}\sum_{y_{m_{i}}\in c_{i}}y_{m_{i}}
$$

The score $s_{o}$ of a mention $m_{i}$ for a cluster $c_{i}$ is simply the inner product between this cluster represent ation and the mention representation:

提及 $m_{i}$ 对于聚类 $c_{i}$ 的得分 $s_{o}$ 即为该聚类表示与提及表示的内积:

$$
s_{o}(m_{i},c_{i})=y_{m_{i}}\cdot y_{c_{i}}
$$

$$
s_{o}(m_{i},c_{i})=y_{m_{i}}\cdot y_{c_{i}}
$$

Using this scoring function, the model is trained to predict the correct cluster for a mention with respect to sampled negative clusters. We combine random in-batch negative clusters with hard negatives from the top 10 predicted gold clusters for each training sample in the batch, following Gillick et al. (2019). For each mention $m_{i}$ with true cluster $c^{\prime}$ and negative clusters $B$ , the loss is computed using Categorical Cross Entropy loss on the softmax of our score vector, which we express as:

利用该评分函数,模型被训练用于预测提及项相对于采样负聚类的正确聚类。我们遵循Gillick等人(2019) 的方法,将批次内随机负聚类与每个训练样本前10个预测黄金聚类的困难负样本相结合。对于每个提及项$m_{i}$,其真实聚类为$c^{\prime}$,负聚类集合为$B$,损失函数采用分类交叉熵计算得分向量的softmax结果,公式表示为:

$$
L(m_{i},c^{\prime})=-s_{o}(m_{i},c^{\prime})+\log\sum_{c_{i}\in B}\exp(s_{o}(m_{i},c_{i}))
$$

$$
L(m_{i},c^{\prime})=-s_{o}(m_{i},c^{\prime})+\log\sum_{c_{i}\in B}\exp(s_{o}(m_{i},c_{i}))
$$

This loss function can be interpreted intuitively as rewarding embeddings which form separable dense mention clusters according to their gold coreference labels. The left term in our loss function acts as an attractive component towards the centroid of the gold cluster, while the right term acts as a repulsive component away from the centroids of incorrect clusters. The repulsive component is especially important for singleton clusters, whose centroids are by definition identical to their mention representations.

该损失函数可以直观理解为:根据正确的共指标签,对形成可分离密集提及簇的嵌入进行奖励。损失函数中的左项作为吸引项,使嵌入向正确簇的质心靠拢;而右项作为排斥项,使嵌入远离错误簇的质心。对于单例簇而言,排斥项尤为重要——根据定义,这些簇的质心与其提及表示完全相同。

Inference Unlike previous work using the biencoder architecture, our inference task is distinct from our training task. Since our training task requires oracle knowledge of the gold co reference labels, it cannot be performed at inference time. However, since the embedding model is optimized to place all mentions near their centroids, it implicitly places all mentions of the same class close to one another even when that class is unknown. Therefore, the set of K nearest mentions within this space is made up of co references and references to highly related entities/events such as shown in Table 1, which models an attention al state made up of entities/events explicitly and implicitly in focus (Grosz and Sidner, 1986).

推理
与之前使用双编码器架构的工作不同,我们的推理任务与训练任务存在本质区别。由于训练任务需要黄金共指标签的先验知识,这些信息在推理时无法获取。但通过优化嵌入模型使所有提及靠近其质心,该模型会隐式地将同一类别的所有提及彼此靠近——即使该类别未知。因此,该嵌入空间中K个最近邻的提及集合既包含共指实例,也包含高度相关的实体/事件引用,如表1所示。该表模拟了由显式/隐式聚焦的实体/事件组成的注意力状态 (Grosz and Sidner, 1986)。

表1:

Compared to document clustering, this approach can prune aggressively without disregarding any links. The encoding step scales linearly and old embeddings do not need to be recomputed if new documents are added. Importantly, no pairs are disregarded a priori when we compute the nearest neighbor graph and this efficient computation can scale to millions of points using GPU-enabled nearest neighbor libraries like FAISS (Johnson et al., 2017), which we use for our implementation.

与文档聚类相比,这种方法可以积极剪枝而不忽略任何链接。编码步骤呈线性扩展,且添加新文档时无需重新计算旧嵌入。关键的是,在计算最近邻图时我们没有先验地忽略任何配对,这种高效计算可通过启用GPU的最近邻库(如FAISS (Johnson et al., 2017) )扩展到数百万个点,我们将其用于实现中。

3.2 Pairwise Classifier

3.2 成对分类器

Classification Setup For pairwise classification, we use a transformer with cross-attention between pairs. This follows prior work demonstrating that such encoders pick up distinctions between classes which previously required custom logic (Yu et al., 2020). Our use of cross-attention is also motivated by discourse coherence theory. Grosz (1978) highlights that, within an attention al state, the importance to co reference of a mention’s features depends heavily on the features of the mention it is being compared to.

分类设置
对于成对分类任务,我们采用具有交叉注意力机制的 Transformer 模型处理样本对。该方法延续了先前研究的成果,证明此类编码器能捕捉原本需要定制逻辑的类别差异 [20]。我们采用交叉注意力机制还受到语篇连贯理论的启发。Grosz (1978) 指出,在注意力状态下,指代特征的重要性很大程度上取决于与之对比的特征。

The cross encoder is a fine-tuned BERT architecture starting with RoBERTA-large pre-trained weights. For a mention pair $(e_{i},e_{j})$ , we build a pairwise representation by feeding the following sequence to our encoder, where $S_{i}$ is the sentence in which the mention occurs and $w$ is the maximum number of sentences away from the mention sentence we include as context:

交叉编码器是基于RoBERTA-large预训练权重微调的BERT架构。对于提及对$(e_{i},e_{j})$,我们通过向编码器输入以下序列来构建成对表示,其中$S_{i}$是提及出现的句子,$w$是作为上下文包含的与提及句子相距的最大句子数:

$$
\langle s\rangle S_{i-w}...S_{i}...S_{i+w}\langle/s\rangle\langle s\rangle S_{j-w}...S_{j}...S_{j+w}\langle/s\rangle
$$

$$
\langle s\rangle S_{i-w}...S_{i}...S_{i+w}\langle/s\rangle\langle s\rangle S_{j-w}...S_{j}...S_{j+w}\langle/s\rangle
$$

Each mention is represented as $v_{e_{i}}$ which is the concatenation of the representations of its boundary tokens, with the pair of mentions represented as the concatenation of each mention representation and the element-wise multiplication of the two mentions:

每个提及表示为 $v_{e_{i}}$ ,即其边界token表示的拼接,而提及对则表示为每个提及表示的拼接与两个提及的逐元素相乘结果:

$$
v_{(e_{i},e_{j})}=[v_{e_{i}},v_{e_{j}},v_{e_{i}}\odot v_{e_{j}}]
$$

$$
v_{(e_{i},e_{j})}=[v_{e_{i}},v_{e_{j}},v_{e_{i}}\odot v_{e_{j}}]
$$

This vector is fed into a multi-layer perceptron and we take the softmax function to get the probability that $e_{i}$ and $e_{j}$ are co referring.

该向量被输入到一个多层感知机中,我们采用softmax函数来获取$e_{i}$和$e_{j}$共指的概率。

Training Pair Generation We use K nearest neighbors in the bi-encoder embedding space to generate training data for the pairwise classifier. This provides the training data a similar distribution of positives and negatives as the classifier will likely see at inference time, but also serves to sample only positive and hard negative pairs.

训练对生成
我们使用双编码器嵌入空间中的K近邻来为成对分类器生成训练数据。这为训练数据提供了与分类器在推理时可能看到的正负样本分布相似的分布,同时也仅采样正样本和困难负样本对。

These negatives are those that the bi-encoder was unable to separate clearly in isolation, which makes them prime candidates for more expensive cross-comparison. At training time, the selection of hyper parameter K is used to balance the volume of training data with the difficulty of negative pairs.

这些负样本是双编码器无法单独清晰区分的样本,因此它们成为更耗时的交叉比较的主要候选对象。在训练时,超参数K的选择用于平衡训练数据量与负样本对的难度。

Optimization Once the training data has been generated, we simply train the classifier in a binary setup to classify a pair as either co referring or nonco referring. As with prior work, we optimize our pairwise classifier using binary cross-entropy loss.

优化
生成训练数据后,我们只需在二元设置下训练分类器,将实体对分类为共指或非共指。与先前工作一致,我们采用二元交叉熵损失函数来优化配对分类器。

3.3 Clustering

3.3 聚类

At inference time, we use a modified form of the agglom erat ive clustering algorithm designed by Barhom et al. (2019) to compute clusters, as described in Figure 2. We do not perform mention detection, so our method relies on gold mentions or a separate mention detection step. First, it generate pairs of mentions using K nearest neighbor retrieval within our embedding space. Each of these pairs is run through the trained cross-encode and all pairs with a probability of less than 0.5 are removed. Pairs are then sorted by their classification probability and clusters are merged greedily.

在推理阶段,我们采用Barhom等人(2019)设计的改进版凝聚聚类算法计算簇,如图2所示。由于不执行指称检测(mention detection),该方法依赖黄金指称或独立的指称检测步骤。首先,算法在嵌入空间中使用K近邻检索生成指称对,所有指称对通过训练好的交叉编码器进行分类,并剔除概率低于0.5的指称对。随后按分类概率排序,以贪心策略合并簇。

Following Barhom et al. (2019), we compute the score between two clusters as the average score between all mention pairs in each cluster. However, since we only compare two clusters that share a local edge, we do this without computing the full pairwise distance matrix.

遵循 Barhom 等人 (2019) 的方法,我们通过计算每个簇中所有提及对之间的平均得分来衡量两个簇之间的得分。但由于我们仅比较共享局部边的两个簇,因此无需计算完整的成对距离矩阵。

4 Experiments

4 实验

We perform an empirical study across 3 event and 2 entity English cross-document co reference corpora.

我们对3个事件和2个实体的英文跨文档共指语料库进行了实证研究。

4.1 Datasets

4.1 数据集

Here we briefly cover the properties of each corpus we evaluate on. For a more thorough breakdown of corpus properties for event CDCR, see Bugert et al. (2020a).

我们在此简要介绍所评估的每个语料库的特性。关于事件跨文档共指消解 (event CDCR) 语料库属性的详细分析,请参阅 Bugert 等人 (2020a) 的研究。

Event Co reference Bank Plus $\mathbf{(ECB+)}$ Historically, the $\mathrm{ECB+}$ corpus has been the primary dataset used for evaluating CDCR. This corpus is based on the original Event Co reference Bank corpus from (Bejan and Harabagiu, 2010), with entity annotations added in Lee et al. (2012) to allow joint modeling and additional documents added by Cybulska and Vossen (2014). By number of documents, it is the largest corpus we evaluate on with 982 articles covering 43 diverse topics. It contains 26,712 co reference links between 6,833 event mentions and 69,050 co reference links between 8289 entity mentions.

事件共指库增强版 $\mathbf{(ECB+)}$

历史上,$\mathrm{ECB+}$ 语料库一直是评估跨文档共指消解 (CDCR) 的主要数据集。该语料库基于 Bejan 和 Harabagiu (2010) 的原始事件共指库 (Event Co reference Bank) ,并在 Lee 等人 (2012) 的研究中添加了实体标注以实现联合建模,随后 Cybulska 和 Vossen (2014) 又补充了额外文档。按文档数量计算,这是我们评估的最大语料库,包含 982 篇文章,涵盖 43 个不同主题。其中包含 6,833 个事件提及之间的 26,712 条共指链接,以及 8,289 个实体提及之间的 69,050 条共指链接。

Gun Violence Corpus (GVC) The Gun Violence Corpus was introduced by Vossen et al. (2018) to present a greater challenge for CDCR by curating a corpus with high similarity between all mentions and documents covered. All 510 articles in the dataset cover incidents of gun violence and are lexically similar which presents a greater challenge for document clustering. It contains 29,398 links between 7,298 event mentions.

枪支暴力语料库 (GVC)
Gun Violence Corpus由Vossen等人 (2018) 提出,通过构建一个所有提及事件和涵盖文档间具有高度相似性的语料库,为CDCR带来更大挑战。该数据集包含510篇报道枪支暴力事件的文章,这些文章在词汇层面高度相似,为文档聚类任务增加了难度。语料库包含7,298个事件提及之间的29,398条关联链接。

Table 3: Cross-Evaluation of our approach compared to Bugert et al. (2020a) using the $B^{3}$ metric

表 3: 我们的方法与 Bugert 等人 (2020a) 使用 $B^{3}$ 指标的交叉评估对比

模型 训练数据集 ECB+ R P F1 GVC R P F1 FCC R P F1 Harmonic Mean R P F1
Baseline ECB+ 71.8 81.2 76.2 40.1 50.3 44.6 21.6 71.0 33.1 35.2 64.8 45.6
Ours 87.1 85.3 86.2 59.3 70.7 64.5 28.5 78.0 41.7 47.3 77.6 58.8
Baseline 22.1 89.0 35.4 6.4 82.9 11.9 38.3 70.8 49.7 13.2 80.2 22.6
Ours FCC 88.3 19.3 31.7 63.3 29.0 39.8 51.7 73.2 60.6 64.6 30.0 41.0
Baseline GVC 78.9 63.5 70.4 49.9 73.6 59.5 31.0 62.6 41.5 46.2 66.2 54.4
Ours 88.4 44.2 58.9 78.6 78.8 78.7 46.1 48.5 47.3 65.6 53.6 59.0
Baseline ECB+ & FCC 71.8 77.2 74.4 41.2 46.5 43.7 31.0 71.6 43.3 42.6 62.0 50.5
Ours 83.3 86.2 84.7 59.0 70.8 64.4 49.2 87.0 62.9 60.9 80.6 69.4
Baseline ECB+ & GVC 78.1 68.5 73.0 46.4 40.0 43.0 39.2 50.0 43.9 50.1 50.3 50.2
Ours 84.1 85.5 84.8 80.5 87.0 83.6 26.6 78.5 39.7 48.4 83.5 61.3
Baseline GVC & FCC 78.2 50.6 61.4 48.8 60.7 54.1 61.0 39.6 48.0 60.4 48.8 54.0
Ours 94.2 19.4 32.2 82.2 75.3 78.6 54.7 77.2 64.0 73.1 38.6 50.5
Baseline AllDatasets 87.2 32.3 47.1 70.7 29.6 41.7 50.8 42.6 46.3 66.2 34.0 44.9
Ours 83.4 84.0 83.7 70.8 86.7 78.0 49.1 72.3 58.6 64.6 80.5 71.6

Football Co reference Corpus (FCC) Bugert et al. (2020b) introduced the Football Co reference Corpus in order to evaluate the ability for CDCR systems to identify event co reference across subtopics. It contains 451 documents covering Football tournaments, where articles covering one tournament often refer to events from other tournaments. While it is the smallest corpus in terms of document size, it has the largest number of coreference links of any dataset we evaluate on with 145,272 co reference links between 3,563 event mentions. Bugert et al. (2020a) re-annotates this corpus at the token level and adds entity labels to enable easier validation between FCC and $\mathrm{ECB+}$ .

足球共指语料库 (FCC)
Bugert等人 (2020b) 提出了足球共指语料库,用于评估跨文档共指消解 (CDCR) 系统识别跨子主题事件共指的能力。该语料库包含451篇关于足球赛事的文档,其中报道同一赛事的文章常会引用其他赛事的事件。虽然其文档规模是最小的,但在我们评估的所有数据集中,它拥有最多的共指链接——3,563个事件提及之间存在145,272个共指链接。Bugert等人 (2020a) 对该语料库进行了token级重新标注,并添加了实体标签,以便更轻松地验证FCC与$\mathrm{ECB+}$之间的关联。

Cross-Domain Cross-Document Co reference Corpus (CD2CR) Ravenscroft et al. (2021) presents a dataset which evaluates the ability for CDCR models to work across domains which vary significantly in style and vocabulary. It contains 918 documents documents, made up of a 459 pairs of a scientific paper and a newspaper article covering the paper. These articles cover a variety of topics, but since documents come in automatically discovered pairs existing evaluations use the gold document pairs. It contains 13,169 links between 3102 entity mentions.

跨领域跨文档共指语料库 (CD2CR)
Ravenscroft等人 (2021) 提出了一个评估CDCR模型在风格和词汇差异显著的跨领域场景中表现的数据集。该数据集包含918篇文档,由459组科学论文与对应新闻报道文章组成。这些文章涵盖多样主题,但由于文档通过自动发现配对,现有评估均采用标注的黄金文档对。语料库包含3102个实体提及之间的13,169条关联链接。

4.2 Evaluation and Results

4.2 评估与结果

All models are implemented in PyTorch (Paszke et al., 2019) and optimized with Adam (Kingma and Ba, 2015). Training the whole pipeline takes one day on a single Tesla V100 GPU. For $\mathrm{ECB+}$ , we use the data split used by Cybulska and Vossen (2015). For both FCC and GVC, we use the data splits used by Bugert et al. (2020a). For CD2CR, we use the splits used by Ravenscroft et al. (2021). We compare the $B^{3}$ metric, since it is reported by baselines for all corpora and has the fewest applicable downsides identified by Moosavi and Strube (2016) since we do not perform mention identification (a full table of metrics for our corpus tailored systems can be found in Appendix A). We use a context window size of 5 sentences during candidate retrieval and of 3 sentences during pairwise classification for all experiments. For corpus tailored evaluations, we retrieve 15 pairs for each mention at training time and 5 pairs at inference time. For cross corpus evaluations, we retrieve 5 pairs for each mention for both training and inference.

所有模型均使用PyTorch语言 (Paszke等人, 2019) 实现,并通过Adam优化器 (Kingma和Ba, 2015) 进行优化。整个流程在单块Tesla V100 GPU上训练耗时一天。对于ECB+数据集,我们采用Cybulska和Vossen (2015) 使用的数据划分方式;FCC和GVC数据集采用Bugert等人 (2020a) 的划分方案;CD2CR数据集则使用Ravenscroft等人 (2021) 的划分标准。我们选用B³评估指标进行对比,因为该指标被所有基线系统报告,且经Moosavi和Strube (2016) 验证存在最少适用缺陷(完整指标表见附录A)。所有实验中,候选检索阶段采用5个句子的上下文窗口,成对分类阶段采用3个句子窗口。针对特定语料的评估,训练时每个提及检索15对样本,推理时检索5对;跨语料评估则在训练和推理阶段均检索5对样本。

$\mathbf{ECB+}$ Our approach achieves a new state of the art result on $\mathrm{ECB+}$ , which is the most widely used CDCR dataset. Our results improve on Caciularu et al. (2021) by $0.2\mathrm{~F}1$ points for events and $0.7\mathrm{~F}1$ points for entities. This result is particularly noteworthy since document clustering can be performed nearly perfectly for the $\mathrm{ECB+}$ dataset (Barhom et al., 2019) and there are no intercluster links (Bugert et al., 2020a).

$\mathbf{ECB+}$ 我们的方法在目前最广泛使用的跨文档共指消解(CDCR)数据集$\mathrm{ECB+}$上取得了新的最先进成果。相较于Caciularu等人(2021)的研究,我们在事件提及上提升了$0.2\mathrm{~F}1$分,在实体提及上提升了$0.7\mathrm{~F}1$分。这一结果尤其值得关注,因为对于$\mathrm{ECB+}$数据集而言,文档聚类几乎可以达到完美水平(Barhom等人,2019),且不存在跨簇链接(Bugert等人,2020a)。

Given that document clustering has almost no downside for $\mathrm{ECB+}$ and Caciularu et al. (2021) uses a cross-encoder architecture with a much wider context window for classification, we largely credit the increased performance on $\mathrm{ECB+}$ dataset to the benefits of hard sampling using our attentional state neighborhoods.

考虑到文档聚类对$\mathrm{ECB+}$几乎没有负面影响,且Caciularu等人(2021)采用了具有更宽上下文窗口的交叉编码器架构进行分类,我们将$\mathrm{ECB+}$数据集上性能的提升主要归功于使用注意力状态邻域进行硬采样的优势。

GVC & FCC We evaluate the broader applicability of our model for event CDCR by applying it to the FCC and GVC datasets. Each aim to address elements of real world event CDCR overlooked by $\mathrm{ECB+}$ . These datasets only annotate events, preventing joint modeling of events and entities. This negatively impacts Barhom et al. (2019) which was designed as a joint method, but requires no changes to our architecture.

GVC与FCC数据集
我们通过将模型应用于FCC和GVC数据集,评估其在事件跨文档共指消解(CDCR)中的更广泛适用性。这两个数据集旨在解决$\mathrm{ECB+}$忽略的现实世界事件CDCR要素。这些数据集仅标注事件,无法对事件和实体进行联合建模。这对Barhom等人(2019)设计的联合方法产生了负面影响,但我们的架构无需任何改动。

Our approach improves over the state of the art by 11.3 F1 points for the GVC dataset and by 13.1 F1 points for the FCC dataset. It is worth noting that the previous state-of-the-art was split between these datasets, with document clustering benefiting GVC and harming FCC performance. Our approach improves on the results for both datasets without modification, unifying the state-of-the-art under one approach.

我们的方法在GVC数据集上将当前最优水平提高了11.3个F1分数,在FCC数据集上提高了13.1个F1分数。值得注意的是,此前的最优方法在这两个数据集上表现分化:文档聚类对GVC有益却损害了FCC性能。而我们的方法无需调整即可同时提升两个数据集的性能,用单一方案统一了最优水平。

CD2CR CD2CR presents a unique challenge with co reference links which span two domains with very different linguistic properties: academic text and science journalism. While one might expect that this linguistic diversity could cause our pruning method to struggle to retrieve pairs across domains, our method proves robust to this challenge with a 34.5 F1 point improvement over the state-of-the-art. This is especially significant as CD2CR previously used a highly corpus-tailored document linking algorithm that relied on data such as DOI matching and author name and affliation matching since document clustering algorithms used for $\mathrm{ECB+}$ are a bad fit for CD2CR due to the within-topic lexical diversity. This highlights how flexible our method is compared to document clustering.

CD2CR
CD2CR提出了一个独特的共指链接挑战,其涉及两个语言特性截然不同的领域:学术文本和科学新闻报道。尽管人们可能认为这种语言多样性会导致我们的剪枝方法难以跨域检索配对,但我们的方法展现了强大的鲁棒性,相比现有最优技术实现了34.5个F1分的提升。这一结果尤为重要,因为CD2CR此前采用了高度定制化的文档链接算法,依赖DOI匹配、作者姓名及机构匹配等数据——由于$\mathrm{ECB+}$使用的文档聚类算法会因主题内词汇多样性而难以适配CD2CR。这凸显了我们的方法相比文档聚类技术的灵活性。

Event Cross-Dataset Evaluation We evaluate the robustness of our learned models by training and evaluating across the multiple event datasets. Bugert et al. (2020a) propose cross-corpus training as a treatment to produce more generally effective models, since downstream corpora are unlikely to match any specific CDCR corpus. We follow their cross-corpus evaluation and present the results for this cross-evaluation in Table 3.

事件跨数据集评估
我们通过在多个事件数据集上进行训练和评估,来验证所学模型的鲁棒性。Bugert等人 (2020a) 提出跨语料训练作为一种解决方案,可以产生更具普适性的模型效果,因为下游语料不太可能完全匹配任何特定的CDCR语料库。我们遵循其跨语料评估方法,并在表3中展示了本次交叉评估的结果。

For models trained on the train split from a single corpus, we see significant performance loss when evaluated on test splits from other corpora as is expected. However, we see vastly improved general iz ability with our approach when trained on a single corpus compared to the baseline set by Bugert et al. (2020a).

对于在单一语料库训练集上训练的模型,当在其他语料库的测试集上评估时,如预期那样会出现显著的性能下降。然而,与Bugert等人 (2020a) 设定的基线相比,我们的方法在单一语料库训练下展现出显著提升的泛化能力。

Table 4: Candidate Retrieval with Alternate Class if i ers evaluated on $\mathrm{ECB+}$ using $B^{3}$

表 4: 在 $\mathrm{ECB+}$ 数据集上使用 $B^{3}$ 评估的候选检索替代分类器性能

成对分类器 R P F1
Barhom et al. (2019) 76.2 70.7 73.4
Yu et al. (2020) Discourse Cross-Encoder 84.4 81.4 82.9
87.1 85.3 86.2
OracleModel 96.3 1.0 98.1

To evaluate the ability of our model to learn from multiple corpora at once, we train our pipeline on combinations of multiple datasets. Datasets are combined naively by using all documents and mentions from the train split of each corpus.

为了评估我们的模型从多个语料库中同时学习的能力,我们在多个数据集的组合上训练了我们的流程。通过简单地合并每个语料库训练集中的所有文档和提及来组合数据集。

Interestingly, our performance improves on FCC and GVC when training our model with two out of three datasets for both GVC and FCC. We achieve our best results on FCC when GVC training data is added and our best results on GVC when $\mathrm{ECB+}$ data is added. This signals that there is potential for further improvement of the model trained on all datasets by exploring what causes the performance decrease with the introduction of the third dataset in these two cases.

有趣的是,当使用三分之二的数据集(GVC和FCC各两个)训练模型时,我们在FCC和GVC上的性能均有提升。具体而言,添加GVC训练数据时,我们在FCC上取得最佳结果;而添加$\mathrm{ECB+}$数据时,则在GVC上表现最优。这表明通过探究这两种情况下引入第三个数据集导致性能下降的原因,有望进一步提升全数据集训练模型的性能。

Most importantly, our model trained across all datasets shoes improved general iz ability across each dataset, sacrificing 2.9, 5.0, and $4.9\mathrm{F}1$ points compared to our state-of-the-art corpus tailored models for $\mathrm{ECB+}$ , GVC, and FCC respectively. This is a 4.27 point F1 decrease on average compared to 16.7 F1 points for the baseline, suggesting that our model more effectively adapts to the varying feature importance across corpora shown by Bugert et al. (2020a). For use in downstream systems, this model variant makes it feasible variety of downstream corpora without fine-tuning, which is especially important since the majority of downstream tasks lack co reference annotations for fine-tuning.

最重要的是,我们跨所有数据集训练的模型显示出各数据集的泛化能力提升,与针对ECB+、GVC和FCC定制的最先进语料库模型相比,分别牺牲了2.9、5.0和4.9 F1分数。相比基线16.7 F1分的下降幅度,我们的模型平均仅降低4.27 F1分,这表明我们的模型能更有效地适应Bugert等人(2020a)所展示的不同语料库间的特征重要性差异。对于下游系统应用,该模型变体使得无需微调即可适配多种下游语料库成为可能,这一点尤为重要,因为大多数下游任务缺乏用于微调的共指标注数据。

5 Analysis

5 分析

We analyze the components of our model in isolation to explain the sources of our significant perfor- mance gains and bottlenecks which still exist.

我们通过单独分析模型的各个组件,来解释性能显著提升的来源以及仍然存在的瓶颈。

5.1 Candidate Retrieval Isolation

5.1 候选检索隔离

We evaluate our pruning method with alternate classifiers in Table 4. For these experiments, we fetch 5 nearest neighbor pairs for each mention.

我们在表4中用替代分类器评估了剪枝方法。在这些实验中,我们为每个提及获取了5个最近的邻居对。

Table 5: Masking Study of Discourse Cross-Encoder. Masking is applied only to sentences from the context window, leaving the sentence where the mention occurs fully unmasked. $(^{+})/(^{-})$ indicates usage of discourse or only a single sentence respectively.

表 5: 篇章交叉编码器的掩码研究。掩码仅应用于上下文窗口中的句子,提及出现的句子完全不做掩码处理。$(^{+})/(^{-})$ 分别表示使用篇章或仅使用单个句子。

模型变体 事件-R 事件-P 事件-F1 实体-R 实体-P 实体-F1
带篇章的我们的方法 87.1 85.3 86.2 84.1 77.6 80.7
- 时间和地点 84.5 85.9 85.2 82.6 79.0 80.7
- 共指 85.2 86.0 85.6 83.5 72.9 77.8
- 所有实体 82.0 87.9 84.9 81.4 73.2 77.1
- 所有事件 88.2 82.3 85.1 81.4 80.5 81.0
不带篇章的我们的方法 84.4 81.4 82.9 84.1 69.4 76.0

We define the upper bound performance of our pruning method by performing an oracle study where the pruned pairs are passed pairwise classifier that has access to gold labels. Despite using only 5 nearest neighbors the system achieves a recall of 96.3, resulting in an upper-bound F1 of 98.1. Future works can use our pruning method with improved pairwise classification methods without concern since the pruning method delivers near perfect results with an oracle pairwise classifier.

我们通过一项预言机研究定义了剪枝方法的上限性能,其中剪枝后的配对会输入一个能访问真实标签的成对分类器。尽管仅使用5个最近邻,该系统仍实现了96.3的召回率,最终获得98.1的F1上限值。未来工作可以放心地将我们的剪枝方法与改进的成对分类方法结合使用,因为该剪枝方法配合预言机成对分类器时能提供近乎完美的结果。

We isolate the benefits of our pairwise classification approach by using our pruning model with the pairwise class if i ers of Barhom et al. (2019) and the trigger-only variant of Yu et al. (2020). The resulting performance is worse than that of our work, indicating that the pairwise classification model we utilize also plays an important role in our results. Our approach varies from Yu et al. (2020) by using a hard negative training approach and local discourse features, leading us to believe these are the primary beneficial factors.

我们通过将剪枝模型与Barhom等人(2019)的成对分类器和Yu等人(2020)的仅触发器变体结合使用,从而分离出成对分类方法的优势。最终性能低于我们的工作,表明我们所采用的成对分类模型在结果中也起着重要作用。我们的方法与Yu等人(2020)的不同之处在于使用了硬负样本训练方法和局部语篇特征,这使我们相信这些是主要的有利因素。

5.2 Discourse Context Ablation Study

5.2 话语语境消融研究

Both our work and the prior state-of-the-art (Caciularu et al., 2021) utilize discourse features during pairwise comparison, which significantly improves performance compared to just a single sentence of context. However, it is not well understood what features of local discourse are valuable to CDCR. We analyze the contributions of local discourse information through two ablation studies.

我们的工作和先前的最先进方法 (Caciularu et al., 2021) 都在成对比较中利用了篇章特征,相比仅使用单句上下文显著提升了性能。但目前尚不清楚哪些局部篇章特征对跨文档共指消解 (CDCR) 具有重要价值。我们通过两项消融实验分析了局部篇章信息的贡献。

We first evaluate the sensitivity of our model to hyper parameter $w$ , the number of sentences surrounding each mention included as context, by keeping a fixed bi-encoder and training 4 separate cross-encoders from $w=0$ up until $w=3$ . Due to our model’s 512 token limit, we do not evaluate over $w=3$ . The results of this ablation, shown in Table 6, demonstrate that each increase in window size increases performance, with diminishing returns.

我们首先通过固定双编码器并训练4个独立的交叉编码器(从$w=0$到$w=3$),评估模型对超参数$w$(即每个提及包含的上下文句子数量)的敏感性。由于模型的512 token限制,未对$w=3$以上情况进行评估。如表6所示,消融实验结果表明,窗口尺寸的每次增大都能提升性能,但收益递减。

To understand which local discourse features contribute to this improvement, we study three special types of token from the surrounding discourse: times, locations, and co references. Time and location within a sentence has been used in past work using semantic role labeling (Barhom et al., 2019; Bugert et al., 2020a) and co referring tokens are intuitively informative as they provide additional information about the same event/entity. By including local discourse, $21%$ , $11%$ , $29%$ of events and $18%$ , $9%$ , $34%$ entities gain access to new time, location, and co reference information respectively. For example, consider the following text:

为探究哪些局部语篇特征促成了这一改进,我们研究了周边语篇中的三类特殊token:时间、地点和共指。句子中的时间和地点信息在以往基于语义角色标注的研究中已有应用 (Barhom et al., 2019; Bugert et al., 2020a) ,而共指token能直观地提供同一事件/实体的补充信息。通过引入局部语篇,分别有 $21%$ 、 $11%$ 、 $29%$ 的事件和 $18%$ 、 $9%$ 、 $34%$ 的实体获得了新的时间、地点及共指信息。例如以下文本:

A strong earthquake struck Indonesia’s Aceh province on Tuesday. Many houses were damaged and dozens of villagers were injured.

周二,印度尼西亚亚齐省发生强烈地震。许多房屋受损,数十名村民受伤。

While the event ”damaged” is ambiguous with only the context of a single sentence, it becomes much more specific when contextual i zed with the previous sentence which contains both a time and a location for the event. We evaluate our system with tokens of these types masked from the local discourse with results reported in Table 5.

虽然仅凭单一句子的上下文,"damaged"这一事件表述显得模糊不清,但当结合前文包含事件时间和地点的句子进行语境化解读时,其含义就变得具体得多。我们通过屏蔽局部语篇中这类token来评估系统性能,结果如表5所示。

For events, both masking time and location (- 1.0 F1) and masking co reference (-0.6 F1) in the local discourse significantly harms performance . However, only within-document co reference seems to majorly impact entity resolution (-2.9 F1). Both events and entities are more impacted by masking all entities (-1.3 F1 for events, -3.6 for entities) than they are by masking all events (-1.1 F1 for events, $+0.3$ F1), which matches the expectation that the greater degree of polysemy for event tokens makes them less disc rim i native.

对于事件而言,在局部语篇中同时掩码时间和地点 (-1.0 F1) 以及共指关系 (-0.6 F1) 会显著损害性能。然而,仅文档内的共指关系似乎对实体解析产生重大影响 (-2.9 F1)。与掩码所有事件 (-1.1 F1 事件,$+0.3$ F1 实体) 相比,掩码所有实体对事件和实体的影响更大 (事件 -1.3 F1,实体 -3.6 F1),这与事件token的多义性更高导致其区分度更低的预期相符。

Table 6: Ablation on cross-encoder context window $w$ evaluated on $\mathrm{ECB+}$ using $B^{3}$

表 6: 交叉编码器上下文窗口 $w$ 在 $\mathrm{ECB+}$ 数据集上使用 $B^{3}$ 指标的消融实验

w R P F1
0 84.4 81.4 82.9
1 83.4 86.5 84.9
2 83.1 87.7 85.4
3 87.1 85.3 86.2

6 Conclusion and Future Work

6 结论与未来工作

In this work, we presented a two-step method for resolving cross-document event and entity coreference inspired by discourse coherence theory. We achieved state-of-the-art results on 3 event and 2 entity CDCR datasets, unifying the previously fractured CDCR space with a single model. We further improve applicability by training across corpora, presenting a model which can be used for downstream tasks that lack co reference annotations for fine-tuning. We demonstrated that our pruning method offers high upper bound performance and that both stages of our model contribute to our state-of-the-art results. Finally, we explained contributions of local discourse features when crossencoding for co reference resolution.

在本工作中,我们提出了一种受语篇连贯理论启发的跨文档事件与实体共指消解两步法。我们在3个事件和2个实体CDCR数据集上取得了最先进的结果,用单一模型统一了此前割裂的CDCR研究领域。通过跨语料库训练,我们进一步提升了模型适用性,使其可用于缺乏共指标注进行微调的下游任务。实验表明,我们的剪枝方法具有较高的性能上限,且模型的两个阶段都对最终结果有贡献。最后,我们阐释了局部语篇特征在共指消解跨编码过程中的作用。

We identify 3 areas of future work:

我们确定了未来工作的3个方向:

• Using knowledge distillation to further improve s cal ability. Wu et al. (2020) demonstrate that much of the quality gain from crossencoding can be transferred to a bi-encoder through knowledge distillation, which could have the potential to remove pairwise classification altogether.

• 利用知识蒸馏 (knowledge distillation) 进一步提升可扩展性。Wu 等人 (2020) 的研究表明,通过知识蒸馏可以将交叉编码中获得的大部分质量增益转移到双编码器中,这有可能完全消除成对分类的需求。

• Pairing alternate models for pairwise classification with the bi-encoder candidate pair generator. Our candidate pair generator is unlikely to become a recall bottleneck, so future efforts in CDCR should focus primarily on improving the accuracy of pairwise classification.

• 将交替模型与双编码器候选对生成器配对用于成对分类。我们的候选对生成器不太可能成为召回瓶颈,因此CDCR未来的工作应主要集中于提升成对分类的准确性。

• Integrating CDCR into a wider range of tasks. Our work is robust to a wide variety of data, but it is still unknown which cross-document tasks benefit the most from co reference information.

• 将跨文档共指消解 (CDCR) 整合到更广泛的任务中。我们的方法对多种数据具有鲁棒性,但目前尚不清楚哪些跨文档任务最能受益于共指信息。

阅读全文(20积分)