[论文翻译]聚焦关键:运用话语连贯理论解决跨文档共指问题


原文地址:https://arxiv.org/pdf/2110.05362v1


Focus on what matters: Applying Discourse Coherence Theory to Cross Document Co reference

聚焦关键:运用话语连贯理论解决跨文档共指问题

Abstract

摘要

Performing event and entity co reference resolution across documents vastly increases the number of candidate mentions, making it intractable to do the full $n^{2}$ pairwise comparisons. Existing approaches simplify by considering co reference only within document clusters, but this fails to handle inter-cluster coreference, common in many applications. As a result cross-document co reference algorithms are rarely applied to downstream tasks. We draw on an insight from discourse coherence theory: potential co references are constrained by the reader’s discourse focus. We model the entities/events in a reader’s focus as a neighborhood within a learned latent embedding space which minimizes the distance between mentions and the centroids of their gold coreference clusters. We then use these neighborhoods to sample only hard negatives to train a fine-grained classifier on mention pairs and their local discourse features. Our approach1 achieves state-of-the-art results for both events and entities on the $\mathrm{ECB+}$ , Gun Violence, Football Co reference, and Cross-Domain CrossDocument Co reference corpora. Furthermore, training on multiple corpora improves average performance across all datasets by 17.2 F1 points, leading to a robust co reference resolution model for use in downstream tasks where link distribution is unknown.

跨文档执行事件和实体共指消解会极大增加候选提及的数量,使得完整的 $n^{2}$ 成对比较难以处理。现有方法通过仅考虑文档簇内的共指来简化问题,但这无法处理簇间共指,而后者在许多应用中很常见。因此跨文档共指消解算法很少应用于下游任务。我们借鉴了语篇连贯理论的一个观点:潜在的共指关系受读者语篇焦点限制。我们将读者焦点中的实体/事件建模为学习到的潜在嵌入空间中的一个邻域,该邻域最小化提及与其黄金共指簇质心之间的距离。然后利用这些邻域仅采样困难负例,在提及对及其局部语篇特征上训练细粒度分类器。我们的方法在 $\mathrm{ECB+}$、Gun Violence、Football Co reference 和 Cross-Domain CrossDocument Co reference 语料库上均取得了事件和实体共指消解的最先进结果。此外,在多个语料库上训练使所有数据集的平均性能提升了17.2个F1点,从而为链接分布未知的下游任务提供了鲁棒的共指消解模型。

1 Introduction

1 引言

Cross-document co reference resolution of entities and events (CDCR) is an increasingly important problem, as downstream tasks that benefit from co reference annotations — such as question answering, information extraction, and summarization — begin interpreting multiple documents si- multan e ou sly. Yet the number of candidate mentions across documents makes evaluating the full n2 pairwise comparisons intractable (Cremisini and Finlayson, 2020). For single-document coref- erence, the search space is pruned with simple recency-based heuristics, but there is no natural corollary to recency with multiple documents.

跨文档实体与事件共指消解(CDCR)日益重要,因为受益于共指标注的下游任务(如问答、信息抽取和摘要生成)开始同时处理多文档。然而跨文档候选指称项的数量使得评估全部n²成对比较变得不可行 [20]。对于单文档共指,可通过基于时效性的简单启发式方法剪枝搜索空间,但多文档场景缺乏天然的时效性对应机制。

Most CDCR systems thus instead cluster the documents and perform the full $n^{2}$ comparisons only within each cluster, disregarding inter-cluster co reference (Lee et al., 2012; Yang et al., 2015; Choubey and Huang, 2017; Barhom et al., 2019; Cattan et al., 2020; Yu et al., 2020; Caciularu et al., 2021). This was effective for the $\mathrm{ECB+}$ dataset, on which most CDCR methods have been evaluated, because $\mathrm{ECB+}$ has lexically distinct topics with almost no inter-cluster co reference.

因此,大多数CDCR系统转而采用文档聚类策略,仅在各簇内部执行完整的$n^{2}$次比较,而忽略簇间共指关系 (Lee et al., 2012; Yang et al., 2015; Choubey and Huang, 2017; Barhom et al., 2019; Cattan et al., 2020; Yu et al., 2020; Caciularu et al., 2021)。这种方法在$\mathrm{ECB+}$数据集上表现有效(当前多数CDCR方法均基于该数据集评估),因为$\mathrm{ECB+}$的词汇主题区分明显,几乎不存在簇间共指现象。

Such document clustering, however, keeps CDCR systems from being generally applicable. Bugert et al. (2020b) shows that inter-cluster coreference makes up the majority of co reference in many applications. Cremisini and Finlayson (2020) note that document clustering methods are also unlikely to generalize well to real data where documents lack the significant lexical differences of $\mathrm{ECB+}$ topics. These issues present a major barrier for the general applicability of CDCR.

然而,这种文档聚类方式限制了跨文档共指消解(CDCR)系统的通用性。Bugert等人(2020b)指出,在许多应用场景中,跨聚类的共指关系占据了共指现象的主体。Cremisini与Finlayson(2020)强调,当实际数据中的文档缺乏类似$\mathrm{ECB+}$主题那样显著的词汇差异时,文档聚类方法的泛化能力也会大幅受限。这些问题对CDCR技术的广泛应用构成了主要障碍。

Human readers, by contrast, are able to perform co reference resolution with minimal pairwise comparisons. How do they do it? Discourse coherence theory (Grosz, 1977, 1978; Grosz and Sidner, 1986) proposes a simple mechanism: a reader focuses on only a small set of entities/events from their full knowledge. This set, the attention al state, is constructed as entities/events are brought into focus either explicitly by reference or implicitly by their similarity to what has been referenced. Since attentional state is inherently dynamic — entities/events come into and out of focus as discourse progresses — a document level approach is a poor model of this mechanism.

相比之下,人类读者仅需极少量的成对比较就能完成共指消解。他们是如何做到的?话语连贯性理论 (Grosz, 1977, 1978; Grosz and Sidner, 1986) 提出了一个简单机制:读者仅从全部知识中聚焦于一小部分实体/事件。这个被称为注意力状态的集合,会随着实体/事件通过显式引用或与已引用内容的隐式相似性进入焦点而被构建。由于注意力状态具有动态本质——随着话语推进,实体/事件会不断进入和退出焦点——因此文档级方法难以有效建模这一机制。

We propose modeling focus at the mention level using the two stage approach illustrated in Figure

我们提出采用图1所示的两阶段方法在提及级别上建模焦点。


Figure 1: A high level overview of our system: For a particular mention, candidate co referring mentions are retrieved from a neighborhood surrounding the mention. These candidate pairs are fed to a pairwise classifier specialized for hard negatives fetched from this space. This allows our method to create a high fidelity co reference graph with minimal pairwise comparison and no a priori assumptions about co reference. We use a bi-encoder for candidate retrieval and a cross-encoder for pairwise classification (Humeau et al., 2020).

图 1: 系统概览:针对特定提及,从该提及周围的邻域中检索候选共指提及。这些候选对被送入专为该空间提取的困难负样本设计的成对分类器。这种方法使我们能够以最少的成对比较且无需对共指做先验假设,构建高保真共指图。我们采用双编码器 (bi-encoder) 进行候选检索,交叉编码器 (cross-encoder) 进行成对分类 (Humeau et al., 2020)。

  1. We model attention al state as the set of K nearest neighbors within a latent embedding space for mentions. This space is learned with a distance based classification loss to construct embeddings that minimize the distance between mentions and the centroid of all mentions which share their reference class.
  2. 我们将注意力状态建模为潜在嵌入空间中提及的K个最近邻集合。该空间通过基于距离的分类损失进行学习,以构建最小化提及与其共享引用类别的所有提及质心之间距离的嵌入。

These attention al state neighborhoods aggressively constrain the search space for our second stage pairwise classifier. This classifier utilizes cross-attention between mention pairs and their local discourse features to capture the features important within an attention al state which are comparison specific (Grosz, 1978). By sampling from attention al state neighborhoods at training time, we train on only hard negatives such as shown in Table 1. We analyze the contribution of the local discourse features to our approach, providing an explanation for the empirical effectiveness of our classifier and that of earlier work like Caciularu et al. (2021).

这些注意力状态邻域严格限制了第二阶段成对分类器的搜索空间。该分类器利用提及对之间的交叉注意力及其局部话语特征,捕捉注意力状态中与比较任务相关的关键特征 (Grosz, 1978) 。通过在训练时从注意力状态邻域采样,我们仅针对困难负样本进行训练,如表 1 所示。我们分析了局部话语特征对本方法的贡献,为分类器的实证有效性及 Caciularu 等人 (2021) 早期研究提供了理论解释。

Following the recommendations of Bugert et al. (2020a), we evaluate our method on multiple event and entity CDCR corpora, as well as on crosscorpus transfer for event CDCR. Our method achieves state-of-the-art results on the $\mathrm{ECB+}$ corpus for both events $(+0.2\mathrm{ F}1)$ and entities $_{(+0.7}$ F1), the Gun Violence Corpus $(+11.3\mathrm{F}1)$ ), the Football Co reference Corpus $(+13.3\mathrm{F}1)$ ), and the Cross-Domain Cross-Document Co reference Cor- pus $(+34.5\mathrm{F~}$ 1). We further improve average results by training across all event CDCR corpora, leading to a 17.2 F1 improvement for average performance across all tasks. Our robust model makes it feasible to apply CDCR to a wide variety of downstream tasks, without requiring expensive new co reference annotations to enable fine-tuning on each new corpus. (This has been a huge effort for the few tasks that have attempted it like multi-hop QA (Dhingra et al., 2018; Chen et al., 2019) and multi-document sum mari z ation (Falke et al., 2017).)

遵循Bugert等人 (2020a) 的建议,我们在多个事件与实体跨文档共指消解(CDCR)语料库上评估了该方法,包括事件CDCR的跨语料库迁移任务。我们的方法在$\mathrm{ECB+}$语料库上取得了最先进的结果:事件共指消解性能提升$(+0.2\mathrm{~F}1)$,实体共指消解提升$_{(+0.7}$ F1);在枪击暴力语料库上提升$(+11.3\mathrm{~F}1)$,足球共指语料库提升$(+13.3\mathrm{~F}1)$,跨领域跨文档共指语料库提升$(+34.5\mathrm{~F}$1)。通过在所有事件CDCR语料库上进行联合训练,我们进一步将平均性能提升了17.2个F1值。该鲁棒模型使得CDCR技术能广泛应用于下游任务,而无需为每个新语料库进行昂贵的共指标注微调(现有研究中仅有多跳问答(Dhingra等, 2018; Chen等, 2019)和多文档摘要(Falke等, 2017)等少数任务尝试过这种标注,耗费了大量资源)。

2 Related Work

2 相关工作

Cross-Document Co reference Many CDCR algorithms use hand engineered event features to perform classification. Such systems have a low pairwise classification cost and therefore ignore the quadratic scaling and perform no pruning (Bejan and Harabagiu, 2010; Yang et al., 2015; Vossen and Cybulska, 2016; Bugert et al., 2020a). Other such systems choose to include document clustering to increase precision, which can be done with very little tradeoff for the $\mathrm{ECB+}$ corpus (Lee et al., 2012; Cremisini and Finlayson, 2020).

跨文档共指消解
许多跨文档共指消解(CDCR)算法使用手工设计的事件特征进行分类。这类系统具有较低的成对分类成本,因此忽略了二次复杂度且不进行剪枝 (Bejan and Harabagiu, 2010; Yang et al., 2015; Vossen and Cybulska, 2016; Bugert et al., 2020a)。另一些系统选择引入文档聚类来提高精确度,这种方法在$\mathrm{ECB+}$语料库上几乎无需权衡 (Lee et al., 2012; Cremisini and Finlayson, 2020)。

Kenyon-Dean et al. (2018) explore an approach that avoids pairwise classification entirely, instead relying purely on representation learning and clustering within an embedding space. They propose a novel distance based regular iz ation term for their classifier that encourages representations that can be used for clustering. This approach is more scalable than pairwise classification approaches, but its performance lags behind the state-of-the-art as it cannot use pairwise information.

Kenyon-Dean等人(2018)探索了一种完全避免成对分类的方法,纯粹依赖嵌入空间中的表征学习和聚类。他们为分类器提出了一种新颖的基于距离的正则化项,以鼓励可用于聚类的表征。这种方法比成对分类方法更具可扩展性,但由于无法利用成对信息,其性能落后于最先进水平。

Table 1: Examples of positives and hard negatives within an attention al state neighborhood

表 1: 注意力邻域中的正例与困难负例示例

提及类型 提及内容 关系
事件 盖瑟斯地区发生初步震级2.0地震 根节点
地震发生于上午7:30左右 共指
震动发生在上午9:27 不同
实体 ...将使AMD成为全球最大图形芯片供应商之一 根节点
...该公司宣布达成3.34亿美元协议 共指
全球最大图形芯片制造商英特尔拒绝对该交易置评 不同

Table 2: Evaluation Results using $B^{3}$ . For our approaches, $(^{+})/(^{-})$ indicates usage of discourse or only a single sentence respectively. Methods marked with * perform all pairwise comparisons without pruning.

方法 R P F1 R P F1 R P F1 R P F1 R P F1
Barhom et al.(2019) 81.8 77.5 79.6 81.0 66.0 72.7 17.9 88.3 29.8 66.8 75.5 70.9 - - -
Barhom et al. (2019)* - - - - - - 36.0 83.0 50.2 - - - - - -
Bugert et al. (2020a)* 71.8 81.2 76.2 49.9 73.6 59.5 38.3 70.8 49.7 - - - - - -
Cattan et al. (2020) 82.1 82.7 82.4 - - - - - - 70.7 74.8 72.7 57.0 35.0 44.0
Yu et al. (2020) 86.1 84.7 85.4 - - - - - - - - - - - -
Caciularu et al. (2021) 84.9 87.9 86.4 - - - - - - 82.5 81.7 82.1 - - -
Our Approach 84.9 82.4 83.6 67.2 81.1 73.5 47.9 68.7 56.5 84.8 76.2 80.3 67.7 72.8 70.2
Our Approach+ 85.6 87.7 86.6 82.2 83.8 83.0 61.6 65.4 63.5 85.1 80.6 82.8 77.4 79.7 78.5

表 2: 使用 $B^{3}$ 的评估结果。对于我们的方法, $(^{+})/(^{-})$ 表示使用篇章或仅使用单句。标有 * 的方法表示执行了所有无剪枝的成对比较。

Most recent systems use neural models for pairwise classification (Barhom et al., 2019; Cattan et al., 2020; Meged et al., 2020; Zeng et al., 2020; Yu et al., 2020; Caciularu et al., 2021). These algorithms each use document clustering, a pairwise neural classifier to construct distance matrices within each topic, and agglom erat ive clustering to compute the final clusters. Innovation has focused on the pairwise classification stage, with variants of document clustering as the only pruning option. Caciularu et al. (2021) sets the previous state of the art for both events and entities in $\mathrm{ECB+}$ using a cross-document language model with a large context window to cross-encode and classify a pair of mentions with the full context of their documents.

最新系统采用神经网络模型进行成对分类 (Barhom et al., 2019; Cattan et al., 2020; Meged et al., 2020; Zeng et al., 2020; Yu et al., 2020; Caciularu et al., 2021)。这些算法均使用文档聚类、成对神经分类器构建各主题内的距离矩阵,并通过凝聚聚类计算最终聚类结果。创新主要集中在成对分类阶段,其中文档聚类的变体是唯一的剪枝选项。Caciularu等人 (2021) 在$\mathrm{ECB+}$数据集上实现了事件和实体识别的当前最佳性能,他们采用具有大上下文窗口的跨文档语言模型,通过跨编码方式利用文档完整上下文对提及对进行分类。

Other Tasks Lee et al. (2018) introduces the concept of a “coarse-to-fine” approach in single document entity co reference resolution. The architecture utilises a bi-linear scoring function to generate a set of likely antecedents, which is then passed through a more expensive classifier which performs higher order inference on antecedent chains. Our work extends to multiple documents the idea of using a high recall but low precision pruning function combined with expensive pairwise classification to balance recall, precision, and runtime efficiency.

其他任务
Lee等人(2018) 在单文档实体共指消解中提出了"由粗到精(coarse-to-fine)"的方法概念。该架构采用双线性评分函数生成一组可能的先行词,再通过计算成本更高的分类器对这些先行词链进行高阶推理。我们的工作将这种结合高召回低精度剪枝函数与高成本成对分类的思路扩展到多文档场景,以平衡召回率、精确率和运行效率。

Wu et al. (2020) use a similar architecture to ours to create a highly scalable system for zero-shot entity linking. Their method treats entity linking as a ranking problem, using a bi-encoder to retrieve possible entity mentions and then re-ranking the candidate mentions using a cross-encoder. Their results confirm that such architectures can deliver state of the art performance while achieving tremendous scale. However, in co reference resolution, mentions can have one, many, or no co referring mentions which makes treating it as a ranking problem non-trivial and necessitates the novel training and inference processes we propose.

Wu等人 (2020) 采用与我们相似的架构构建了一个高度可扩展的零样本 (zero-shot) 实体链接系统。该方法将实体链接视为排序问题,使用双编码器 (bi-encoder) 检索可能的实体指称项,再通过交叉编码器 (cross-encoder) 对候选指称项进行重排序。其结果表明,此类架构在实现大规模扩展的同时仍能保持最先进的性能。然而在共指消解任务中,指称项可能对应零个、一个或多个共指对象,这使得将其视为排序问题变得尤为复杂,因此需要我们提出的新型训练与推理流程。

3 Model

3 模型

Our system is trained in multiple stages and evaluated as a single pipeline. First, we train the encoder for the pruning model to define our latent embedding space. Then, we use this model to sample training data for a pairwise classifier which performs binary classification for co reference. Our complete pipeline retrieves candidate pairs from the attention al state, classifies them using the pairwise classifier, and performs a variant of the agglomerative clustering algorithm proposed by Barhom et al. (2019) to form the final clusters, as laid out in Figure 2.

我们的系统经过多阶段训练,并作为单一流程进行评估。首先,我们训练剪枝模型的编码器以定义潜在嵌入空间。接着,使用该模型为二元共指分类器采样训练数据。完整流程包括:从注意力状态检索候选对,使用成对分类器进行分类,并采用Barhom等人(2019)提出的聚合聚类算法变体形成最终簇,如图2所示。

3.1 Candidate Retrieval

3.1 候选检索

Encoding Setup We feed the sentences from a window surrounding the mention sentence to a fine-tuned BERT architecture initialized from

编码设置
我们将提及句周围窗口中的句子输入到一个经过微调的BERT架构中,该架构初始化自


Figure 2: Clustering algorithm used at inference time

图 2: 推理时使用的聚类算法

RoBERTA-large pre-trained weights (Devlin et al., 2019; Liu et al., 2019). A mention is represented as the concatenation of the token-level representations at the boundaries of the mention, following the span boundary representations used by Lee et al. (2017).

RoBERTA-large预训练权重 (Devlin等人, 2019; Liu等人, 2019)。提及(Mention)的表示采用Lee等人(2017)使用的跨度边界表示方法,通过连接提及边界处的Token级表示来实现。

Optimization Similar to Kenyon-Dean et al. (2018), the network is trained to perform a multiclass classification problem where the classes are labels assigned to the gold co reference clusters, which are the connected components of the coreference graph. Rather than adding distance based regular iz ation, we instead optimize the distance metric directly by using the inner product as our scoring function.

优化
与 Kenyon-Dean 等人 (2018) 类似,该网络被训练用于执行多类别分类问题,其中类别是分配给黄金共指簇(即共指图的连通分量)的标签。我们没有添加基于距离的正则化,而是通过使用内积作为评分函数来直接优化距离度量。

Before each epoch, we construct the representation of each mention $y_{m_{i}}$ with the encoder from the previous epoch. Each gold co reference cluster $y_{c_{i}}$ is represented as the centroid of its component mentions $c_{i}$ :

在每个训练周期开始前,我们使用上一周期的编码器为每个提及项$y_{m_{i}}$构建表征。每个标准共指聚类$y_{c_{i}}$被表示为该聚类成员提及项$c_{i}$的质心:

$$
y_{c_{i}}={\frac{1}{\mid c_{i}\mid}}\sum_{y_{m_{i}}\in c_{i}}y_{m_{i}}
$$

$$
y_{c_{i}}={\frac{1}{\mid c_{i}\mid}}\sum_{y_{m_{i}}\in c_{i}}y_{m_{i}}
$$

The score $s_{o}$ of a mention $m_{i}$ for a cluster $c_{i}$ is simply the inner product between this cluster represent ation and the mention representation:

提及 $m_{i}$ 对于聚类 $c_{i}$ 的得分 $s_{o}$ 即为该聚类表示与提及表示的内积:

$$
s_{o}(m_{i},c_{i})=y_{m_{i}}\cdot y_{c_{i}}
$$

$$
s_{o}(m_{i},c_{i})=y_{m_{i}}\cdot y_{c_{i}}
$$

Using this scoring function, the model is trained to predict the correct cluster for a mention with respect to sampled negative clusters. We combine random in-batch negative clusters with hard negatives from the top 10 predicted gold clusters for each training sample in the batch, following Gillick et al. (2019). For each mention $m_{i}$ with true cluster $c^{\prime}$ and negative clusters $B$ , the loss is computed using Categorical Cross Entropy loss on the softmax of our score vector, which we express as:

利用该评分函数,模型被训练用于预测提及项相对于采样负聚类的正确聚类。我们遵循Gillick等人(2019) 的方法,将批次内随机负聚类与每个训练样本前10个预测黄金聚类的困难负样本相结合。对于每个提及项$m_{i}$,其真实聚类为$c^{\prime}$,负聚类集合为$B$,损失函数采用分类交叉熵计算得分向量的softmax结果,公式表示为:

$$
L(m_{i},c^{\prime})=-s_{o}(m_{i},c^{\prime})+\log\sum_{c_{i}\in B}\exp(s_{o}(m_{i},c_{i}))
$$

$$
L(m_{i},c^{\prime})=-s_{o}(m_{i},c^{\prime})+\log\sum_{c_{i}\in B}\exp(s_{o}(m_{i},c_{i}))
$$

This loss function can be interpreted intuitively as rewarding embeddings which form separable dense mention clusters according to their gold coreference labels. The left term in our loss function acts as an attractive component towards the centroid of the gold cluster, while the right term acts as a repulsive component away from the centroids of incorrect clusters. The repulsive component is especially important for singleton clusters, whose centroids are by definition identical to their mention representations.

该损失函数可以直观理解为:根据正确的共指标签,对形成可分离密集提及簇的嵌入进行奖励。损失函数中的左项作为吸引项,使嵌入向正确簇的质心靠拢;而右项作为排斥项,使嵌入远离错误簇的质心。对于单例簇而言,排斥项尤为重要——根据定义,这些簇的质心与其提及表示完全相同。

Inference Unlike previous work using the biencoder architecture, our inference task is distinct from our training task. Since our training task requires oracle knowledge of the gold co reference labels, it cannot be performed at inference time. However, since the embedding model is optimized to place all mentions near their centroids, it implicitly places all mentions of the same class close to one another even when that class is unknown. Therefore, the set of K nearest mentions within this space is made up of co references and references to highly related entities/events such as shown in Table 1, which models an attention al state made up of entities/events explicitly and implicitly in focus (Grosz and Sidner, 1986).

推理
与之前使用双编码器架构的工作不同,我们的推理任务与训练任务存在本质区别。由于训练任务需要黄金共指标签的先验知识,这些信息在推理时无法获取。但通过优化嵌入模型使所有提及靠近其质心,该模型会隐式地将同一类别的所有提及彼此靠近——即使该类别未知。因此,该嵌入空间中K个最近邻的提及集合既包含共指实例,也包含高度相关的实体/事件引用,如表1所示。该表模拟了由显式/隐式聚焦的实体/事件组成的注意力状态 (Grosz and Sidner, 1986)。

表1:

Compared to document clustering, this approach can prune aggressively without disregarding any links. The encoding step scales linearly and old embeddings do not need to be recomputed if new documents are added. Importantly, no pairs are disregarded a priori when we compute the nearest neighbor graph and this efficient computation can scale to millions of points using GPU-enabled nearest neighbor libraries like FAISS (Johnson et al., 2017), which we use for our implementation.

与文档聚类相比,这种方法可以积极剪枝而不忽略任何链接。编码步骤呈线性扩展,且添加新文档时无需重新计算旧嵌入。关键的是,在计算最近邻图时我们没有先验地忽略任何配对,这种高效计算可通过启用GPU的最近邻库(如FAISS (Johnson et al., 2017) )扩展到数百万个点,我们将其用于实现中。

3.2 Pairwise Classifier

3.2 成对分类器

Classification Setup For pairwise classification, we use a transformer with cross-attention between pairs. This follows prior work demonstrating that such encoders pick up distinctions between classes which previously required custom logic (Yu et al., 2020). Our use of cross-attention is also motivated by discourse coherence theory. Grosz (1978) highlights that, within an attention al state, the importance to co reference of a mention’s features depends heavily on the features of the mention it is being compared to.

分类设置
对于成对分类任务,我们采用具有交叉注意力机制的 Transformer 模型处理样本对。该方法延续了先前研究的成果,证明此类编码器能捕捉原本需要定制逻辑的类别差异 [20]。我们采用交叉注意力机制还受到语篇连贯理论的启发。Grosz (1978) 指出,在注意力状态下,指代特征的重要性很大程度上取决于与之对比的特征。

The cross encoder is a fine-tuned BERT architecture starting with RoBERTA-large pre-trained weights. For a mention pair $(e_{i},e_{j})$ , we build a pairwise representation by feeding the following sequence to our encoder, where $S_{i}$ is the sentence in which the mention occurs and $w$ is the maximum number of sentences away from the mention sentence we include as context:

交叉编码器是基于RoBERTA-large预训练权重微调的BERT架构。对于提及对$(e_{i},e_{j})$,我们通过向编码器输入以下序列来构建成对表示,其中$S_{i}$是提及出现的句子,$w$是作为上下文包含的与提及句子相距的最大句子数:

$$
\langle s\rangle S_{i-w}...S_{i}...S_{i+w}\langle/s\rangle\langle s\rangle S_{j-w}...S_{j}...S_{j+w}\langle/s\rangle
$$

$$
\langle s\rangle S_{i-w}...S_{i}...S_{i+w}\langle/s\rangle\langle s\rangle S_{j-w}...S_{j}...S_{j+w}\langle/s\rangle
$$

Each mention is represented as $v_{e_{i}}$ which is the concatenation of the representations of its boundary tokens, with the pair of mentions represented as the concatenation of each mention representation and the element-wise multiplication of the two mentions:

每个提及表示为 $v_{e_{i}}$ ,即其边界token表示的拼接,而提及对则表示为每个提及表示的拼接与两个提及的逐元素相乘结果:

$$
v_{(e_{i},e_{j})}=[v_{e_{i}},v_{e_{j}},v_{e_{i}}\odot v_{e_{j}}]
$$

$$
v_{(e_{i},e_{j})}=[v_{e_{i}},v_{e_{j}},v_{e_{i}}\odot v_{e_{j}}]
$$

This vector is fed into a multi-layer perceptron and we take the softmax function to get the probability that $e_{i}$ and $e_{j}$ are co referring.

该向量被输入到一个多层感知机中,我们采用softmax函数来获取$e_{i}$和$e_{j}$共指的概率。

Training Pair Generation We use K nearest neighbors in the bi-encoder embedding space to generate training data for the pairwise classifier. This provides the training data a similar distribution of positives and negatives as the classifier will likely see at inference time, but also serves to sample only positive and hard negative pairs.

训练对生成
我们使用双编码器嵌入空间中的K近邻来为成对分类器生成训练数据。这为训练数据提供了与分类器在推理时可能看到的正负样本分布相似的分布,同时也仅采样正样本和困难负样本对。

These negatives are those that the bi-encoder was unable to separate clearly in isolation, which makes them prime candidates for more expensive cross-comparison. At training time, the selection of hyper parameter K is used to balance the volume of training data with the difficulty of negative pairs.

这些负样本是双编码器无法单独清晰区分的样本,因此它们成为更耗时的交叉比较的主要候选对象。在训练时,超参数K的选择用于平衡训练数据量与负样本对的难度。

Optimization Once the training data has been generated, we simply train the classifier in a binary setup to classify a pair as either co referring or nonco referring. As with prior work, we optimize our pairwise classifier using binary cross-entropy loss.

优化
生成训练数据后,我们只需在二元设置下训练分类器,将实体对分类为共指或非共指。与先前工作一致,我们采用二元交叉熵损失函数来优化配对分类器。

3.3 Clustering

3.3 聚类

At inference time, we use a modified form of the agglom erat ive clustering algorithm designed by Barhom et al. (2019) to compute clusters, as described in Figure 2. We do not perform mention detection, so our method relies on gold mentions or a separate mention detection step. First, it generate pairs of mentions using K nearest neighbor retrieval within our embedding space. Each of these pairs is run through the trained cross-encode and all pairs with a probability of less than 0.5 are removed. Pairs are then sorted by their classification probability and clusters are merged greedily.

在推理阶段,我们采用Barhom等人(2019)设计的改进版凝聚聚类算法计算簇,如图2所示。由于不执行指称检测(mention detection),该方法依赖黄金指称或独立的指称检测步骤。首先,算法在嵌入空间中使用K近邻检索生成指称对,所有指称对通过训练好的交叉编码器进行分类,并剔除概率低于0.5的指称对。随后按分类概率排序,以贪心策略合并簇。

Following Barhom et al. (2019), we compute the score between two clusters as the average score between all mention pairs in each cluster. However, since we only compare two clusters that share a local edge, we do this without computing the full pairwise distance matrix.

遵循 Barhom 等人 (2019) 的方法,我们通过计算每个簇中所有提及对之间的平均得分来衡量两个簇之间的得分。但由于我们仅比较共享局部边的两个簇,因此无需计算完整的成对距离矩阵。

4 Experiments

4 实验

We perform an empirical study across 3 event and 2 entity English cross-document co reference corpora.

我们对3个事件和2个实体的英文跨文档共指语料库进行了实证研究。

4.1 Datasets

4.1 数据集

Here we briefly cover the properties of each corpus we evaluate on. For a more thorough breakdown of corpus properties for event CDCR, see Bugert et al. (2020a).

我们在此简要介绍所评估的每个语料库的特性。关于事件跨文档共指消解 (event CDCR) 语料库属性的详细分析,请参阅 Bugert 等人 (2020a) 的研究。

Event Co reference Bank Plus $\mathbf{(ECB+)}$ Historically, the $\mathrm{ECB+}$ corpus has been the primary dataset used for evaluating CDCR. This corpus is based on the original Event Co reference Bank corpus from (Bejan and Harabagiu, 2010), with entity annotations added in Lee et al. (2012) to allow joint modeling and additional documents added by Cybulska and Vossen (2014). By number of documents, it is the largest corpus we evaluate on with 982 articles covering 43 diverse topics. It contains 26,712 co reference links between 6,833 event mentions and 69,050 co reference links between 8289 entity mentions.

事件共指库增强版 $\mathbf{(ECB+)}$

历史上,$\mathrm{ECB+}$ 语料库一直是评估跨文档共指消解 (CDCR) 的主要数据集。该语料库基于 Bejan 和 Harabagiu (2010) 的原始事件共指库 (Event Co reference Bank) ,并在 Lee 等人 (2012) 的研究中添加了实体标注以实现联合建模,随后 Cybulska 和 Vossen (2014) 又补充了额外文档。按文档数量计算,这是我们评估的最大语料库,包含 982 篇文章,涵盖 43 个不同主题。其中包含 6,833 个事件提及之间的 26,712 条共指链接,以及 8,289 个实体提及之间的 69,050 条共指链接。

Gun Violence Corpus (GVC) The Gun Violence Corpus was introduced by Vossen et al. (2018) to present a greater challenge for CDCR by curating a corpus with high similarity between all mentions and documents covered. All 510 articles in the dataset cover incidents of gun violence and are lexically similar which presents a greater challenge for document clustering. It contains 29,398 links between 7,298 event mentions.

枪支暴力语料库 (GVC)
Gun Violence Corpus由Vossen等人 (2018) 提出,通过构建一个所有提及事件和涵盖文档间具有高度相似性的语料库,为CDCR带来更大挑战。该数据集包含510篇报道枪支暴力事件的文章,这些文章在词汇层面高度相似,为文档聚类任务增加了难度。语料库包含7,298个事件提及之间的29,398条关联链接。

Table 3: Cross-Evaluation of our approach compared to Bugert et al. (2020a) using the $B^{3}$ metric

表 3: 我们的方法与 Bugert 等人 (2020a) 使用 $B^{3}$ 指标的交叉评估对比

模型 训练数据集 ECB+ R P F1 GVC R P F1 FCC R P F1 Harmonic Mean R P F1
Baseline ECB+ 71.8 81.2 76.2 40.1 50.3 44.6 21.6 71.0 33.1 35.2 64.8 45.6
Ours 87.1 85.3 86.2 59.3 70.7 64.5 28.5 78.0 41.7 47.3 77.6 58.8
Baseline 22.1 89.0 35.4 6.4 82.9 11.9 38.3 70.8 49.7 13.2 80.2 22.6
Ours FCC 88.3 19.3 31.7 63.3 29.0 39.8 51.7 73.2 60.6 64.6 30.0 41.0
Baseline GVC 78.9 63.5 70.4 49.9 73.6 59.5 31.0 62.6 41.5 46.2 66.2 54.4
Ours 88.4 44.2 58.9 78.6 78.8 78.7 46.1 48.5 47.3 65.6 53.6 59.0
Baseline ECB+ & FCC 71.8 77.2 74.4 41.2 46.5 43.7 31.0 71.6 43.3 42.6 62.0 50.5
Ours 83.3 86.2 84.7 59.0 70.8 64.4 49.2 87.0 62.9 60.9 80.6 69.4
Baseline ECB+ & GVC 78.1 68.5 73.0 46.4 40.0 43.0 3