[论文翻译]Maverick: 高效精准的指代消解技术挑战近期趋势


原文地址:https://arxiv.org/pdf/2407.21489v1


Maverick: Efficient and Accurate Co reference Resolution Defying Recent Trends

Maverick: 高效精准的指代消解技术挑战近期趋势

Abstract

摘要

Large auto regressive generative models have emerged as the cornerstone for achieving the highest performance across several Natural Language Processing tasks. However, the urge to attain superior results has, at times, led to the premature replacement of carefully designed task-specific approaches without exhaustive experimentation. The Co reference Resolution task is no exception; all recent stateof-the-art solutions adopt large generative auto regressive models that outperform encoderbased disc rim i native systems. In this work, we challenge this recent trend by introducing Maverick, a carefully designed – yet simple – pipeline, which enables running a state-ofthe-art Co reference Resolution system within the constraints of an academic budget, outperforming models with up to 13 billion parameters with as few as 500 million parameters. Maverick achieves state-of-the-art performance on the CoNLL-2012 benchmark, training with up to 0.006x the memory resources and obtaining a $170\mathrm{x}$ faster inference compared to previous state-of-the-art systems. We extensively validate the robustness of the Maverick framework with an array of diverse experiments, reporting improvements over prior systems in data-scarce, long-document, and out-of-domain settings. We release our code and models for research purposes at https: //github.com/S a pienza NLP/maverick-coref.

大型自回归生成模型已成为在多项自然语言处理任务中实现最高性能的基石。然而,追求卓越结果的冲动有时会导致未经充分实验就过早取代精心设计的任务特定方法。共指消解任务也不例外;所有最新的最先进解决方案都采用大型生成式自回归模型,其性能优于基于编码器的判别系统。在这项工作中,我们通过引入Maverick对这一趋势提出挑战——这是一个精心设计却简洁的流程,能在学术预算限制下运行最先进的共指消解系统,仅用5亿参数就超越高达130亿参数的模型。Maverick在CoNLL-2012基准测试中实现了最先进性能,训练所需内存资源仅为前代最优系统的0.006倍,推理速度提升达$170\mathrm{x}$。我们通过一系列多样化实验全面验证了Maverick框架的鲁棒性,在数据稀缺、长文档和跨领域场景中均报告了对先前系统的改进。研究代码和模型已发布于https://github.com/SapienzaNLP/maverick-coref

1 Introduction

1 引言

As one of the core tasks in Natural Language Processing, Co reference Resolution aims to identify and group expressions (called mentions) that refer to the same entity (Karttunen, 1969). Given its crucial role in various downstream tasks, such as Knowledge Graph Construction (Li et al., 2020), Entity Linking (Kundu et al., 2018; Agarwal et al., 2022), Question Answering (Dhingra et al., 2018; Dasigi et al., 2019; Bhatt a char j ee et al., 2020; Chen and Durrett, 2021), Machine Translation (S to jan ovsk i and Fraser, 2018; Voita et al., 2018; Ohtani et al., 2019; Yehudai et al., 2023) and Text Sum mari z ation (Falke et al., 2017; Pasunuru et al., 2021; Liu et al., 2021), inter alia, there is a pressing need for both high performance and efficiency. However, recent works in Co reference Resolution either explore methods to obtain reasonable performance optimizing time and memory efficiency (Kirstain et al., 2021; Dobro vols kii, 2021; Otmazgin et al., 2022), or strive to improve benchmark scores regardless of the increased computational demand (Bohnet et al., 2023; Zhang et al., 2023).

作为自然语言处理的核心任务之一,共指消解 (Co reference Resolution) 旨在识别并分组指向同一实体的表述(称为提及)(Karttunen, 1969)。由于其在知识图谱构建 (Li et al., 2020)、实体链接 (Kundu et al., 2018; Agarwal et al., 2022)、问答系统 (Dhingra et al., 2018; Dasigi et al., 2019; Bhatt a char j ee et al., 2020; Chen and Durrett, 2021)、机器翻译 (S to jan ovsk i and Fraser, 2018; Voita et al., 2018; Ohtani et al., 2019; Yehudai et al., 2023) 和文本摘要 (Falke et al., 2017; Pasunuru et al., 2021; Liu et al., 2021) 等下游任务中的关键作用,对高性能与高效率的需求日益迫切。然而,当前共指消解研究要么探索在优化时间和内存效率的同时获得合理性能的方法 (Kirstain et al., 2021; Dobro vols kii, 2021; Otmazgin et al., 2022),要么不顾计算需求增加而竭力提升基准分数 (Bohnet et al., 2023; Zhang et al., 2023)。

Efficient solutions usually rely on disc rim i native formulations, frequently employing the mentionantecedent classification method proposed by Lee et al. (2017). These approaches leverage relatively small encoder-only transformer architectures (Joshi et al., 2020; Beltagy et al., 2020) to encode docu- ments and build on top of them task-specific networks that ensure high speed and efficiency. On the other hand, performance-centered solutions are nowadays dominated by general-purpose large Sequence-to-Sequence models (Liu et al., 2022; Zhang et al., 2023). A notable example of this formulation, and currently the state of the art in Co reference Resolution, is Bohnet et al. (2023), which proposes a transition-based system that incrementally builds clusters of mentions by generating co reference links sentence by sentence in an autoregressive fashion. Although Sequence-to-Sequence solutions achieve remarkable performance, their auto regressive nature and the size of the underlying language models (up to 13B parameters) make them dramatically slower and memory-demanding compared to traditional encoder-only approaches. This not only makes their usage for downstream applications impractical, but also poses a significant barrier to their accessibility for a large number of users operating within an academic budget.

高效的解决方案通常依赖于判别式 (discriminative) 表述,常采用 Lee 等人 (2017) 提出的提及-先行词分类方法。这些方法利用相对较小的仅编码器架构 Transformer (Joshi 等人, 2020; Beltagy 等人, 2020) 对文档进行编码,并在此基础上构建任务专用网络以确保高速和效率。另一方面,以性能为中心的解决方案目前主要由通用大序列到序列 (Sequence-to-Sequence) 模型主导 (Liu 等人, 2022; Zhang 等人, 2023)。该表述的典型代表是 Bohnet 等人 (2023) 提出的基于转移的系统,它通过自回归方式逐句生成共指链接来逐步构建提及簇,这也是当前共指消解 (Coreference Resolution) 领域的最先进技术。尽管序列到序列解决方案实现了卓越性能,但其自回归特性及底层大语言模型 (参数规模高达 130 亿) 使得它们与传统仅编码器方法相比速度显著更慢、内存需求更高。这不仅导致其在下游应用中难以实用,也为学术预算范围内运作的大量用户设置了可及性障碍。

In this work we argue that disc rim i native encoder-only approaches for Co reference Resolution have not yet expressed their full potential and have been discarded too early in the urge to achieve state-of-the-art performance. In proposing Maverick, we strike an optimal balance between high performance and efficiency, a combination that was missing in previous systems. Our framework enables an encoder-only model to achieve toptier performance while keeping the overall model size less than one-twentieth of the current stateof-the-art system, and training it with academic resources. Moreover, when further reducing the size of the underlying transformer encoder, Maverick performs in the same ballpark as encoder-only efficiency-driven solutions while improving speed and memory consumption. Finally, we propose a novel incremental Co reference Resolution method that, integrated into the Maverick framework, results in a robust architecture for out-of-domain, data-scarce, and long-document settings.

在本研究中,我们认为仅使用判别式编码器 (encoder-only) 的共指消解 (Co-reference Resolution) 方法尚未充分发挥其潜力,并在追求最先进性能的过程中被过早放弃。通过提出 Maverick 框架,我们在高性能与效率之间实现了最佳平衡,这种组合在先前系统中一直缺失。我们的框架使仅编码器模型能够达到顶级性能,同时将整体模型规模控制在当前最先进系统的二十分之一以内,并仅需学术资源即可完成训练。此外,当进一步缩减底层 Transformer 编码器规模时,Maverick 的性能与仅追求效率的编码器方案相当,同时提升了速度和内存效率。最后,我们提出了一种新颖的增量式共指消解方法,将其集成到 Maverick 框架后,可构建出适用于跨领域、数据稀缺和长文档场景的鲁棒架构。

2 Related Work

2 相关工作

We now introduce well-established approaches to neural Co reference Resolution. Specifically, we first delve into the details of traditional discriminative solutions, including their incremental variations, and then present the recent paradigm shift for approaches based on large generative architectures.

我们现在介绍神经共指消解 (Co reference Resolution) 的成熟方法。具体而言,首先深入探讨传统判别式解决方案的细节(包括其增量变体),然后介绍基于大型生成架构方法的最新范式转变。

2.1 Disc rim i native models

2.1 判别式模型

Disc rim i native approaches tackle the Co reference Resolution task as a classification problem, usually employing encoder-only architectures. The pioneering works of Lee et al. (2017, 2018) introduced the first end-to-end disc rim i native system for Co reference Resolution, the Coarse-to-Fine model. First, it involves a mention extraction step, in which the spans most likely to be co reference mentions are identified. This is followed by a mentionantecedent classification step where, for each extracted mention, the model searches for its most probable antecedent (i.e., the extracted span that appears before in the text). This pipeline, composed of mention extraction and mention-antecedent classification steps, has been adopted with minor modifications in many subsequent works, that we refer to as Coarse-to-Fine models.

判别式方法将共指消解任务视为分类问题,通常采用仅编码器架构。Lee等人的开创性工作(2017, 2018)提出了首个端到端的共指消解判别式系统——由粗到精模型。首先进行提及抽取步骤,识别最可能成为共指提及的文本片段;随后执行提及-先行词分类步骤,模型为每个抽取的提及寻找其最可能的先行词(即文本中先前出现的抽取片段)。这种由提及抽取和提及-先行词分类组成的流程框架,在后继诸多研究中仅经微小调整便被沿用,我们统称为由粗到精模型。

Coarse-to-Fine Models Among the works that build upon the Coarse-to-Fine formulation, Lee et al. (2018), Joshi et al. (2019) and Joshi et al. (2020) experimented with changing the underlying document encoder, utilizing ELMo (Peters et al.,

粗到精模型
在基于粗到精(Coarse-to-Fine)框架的研究中,Lee等人(2018)、Joshi等人(2019)和Joshi等人(2020)尝试通过使用ELMo (Peters等人)来改变底层文档编码器。

2018), BERT (Devlin et al., 2019) and SpanBERT (Joshi et al., 2020), respectively, achieving remarkable score improvements on the English OntoNotes (Pradhan et al., 2012). Similarly, Kirstain et al. (2021) introduced s2e-coref that reduces the high memory footprint of SpanBERT by leveraging the LongFormer (Beltagy et al., 2020) sparse-attention mechanism. Based on the same architecture, Otmazgin et al. (2023) analyzed the impact of having multiple experts score different linguistically motivated categories (e.g., pronouns-nouns, nounsnouns, etc.). While the foregoing works have been able to modernize the original Coarse-toFine formulation, training their architectures on the OntoNotes dataset still requires a considerable amount of memory.1 This occurs because they rely on the traditional Coarse-to-Fine pipeline that, as we cover in Section 3.1, has a large memory overhead and is based on manually-set thresholds to regulate memory usage.

2018)、BERT (Devlin等人,2019)和SpanBERT (Joshi等人,2020)分别对英语OntoNotes (Pradhan等人,2012)实现了显著的分数提升。类似地,Kirstain等人(2021)提出了s2e-coref,通过利用LongFormer (Beltagy等人,2020)的稀疏注意力机制,降低了SpanBERT的高内存占用。基于相同架构,Otmazgin等人(2023)分析了让多个专家对不同语言学动机类别(如代词-名词、名词-名词等)进行评分的影响。尽管上述工作能够对原始Coarse-to-Fine框架进行现代化改造,但在OntoNotes数据集上训练这些架构仍需要大量内存。这是因为它们依赖于传统的Coarse-to-Fine流程,正如我们在3.1节所述,该流程具有较高的内存开销,并基于手动设置的阈值来调节内存使用。

Incremental Models Disc rim i native systems also include incremental techniques. Incremental Co reference Resolution has a strong cognitive grounding: research on the “garden-path” effect shows that humans resolve referring expressions increment ally (Altmann and Steedman, 1988).

增量模型
判别式系统还包括增量技术。增量共指消解 (Incremental Coreference Resolution) 具有坚实的认知基础:关于"花园路径"效应 (garden-path effect) 的研究表明,人类会以增量方式解析指代表达式 (Altmann and Steedman, 1988)。

A seminal work that proposed an automatic incremental system was that of Webster and Curran (2014), which introduced a clustering approach based on the shift-reduce paradigm. In this formulation, for each mention, a classifier decides whether to SHIFT it into a singleton (i.e., single mention cluster) or to REDUCE it within an existing cluster. The same approach has recently been reintroduced in ICoref (Xia et al., 2020) and longdoc (Toshniwal et al., 2021), which adopted SpanBERT and LongFormer, respectively. In these works the mention extraction step is identical to that of Coarse-to-Fine models. On the other hand, the mention clustering step is performed by using a linear classifier that scores each mention against a vector representation of previously built clusters, in an incremental fashion. This method ensures constant memory usage since cluster representations are updated with a learnable function. In Section 3.2 we present a novel performance-driven incremental method that obtains superior performance and generalization capabilities, in which we adopt a lightweight transformer architecture that retains the mention representations.

Webster和Curran (2014) 的开创性工作提出了基于移进-归约范式的自动增量聚类系统。该框架中,分类器会为每个指称项判断是将其SHIFT为单例簇(即仅含单个指称项的簇),还是REDUCE到现有簇中。ICoref (Xia et al., 2020) 和longdoc (Toshniwal et al., 2021) 近期分别采用SpanBERT和LongFormer重新引入了该方法,其指称项抽取步骤与由粗到精 (Coarse-to-Fine) 模型完全一致。而指称项聚类步骤则通过线性分类器对每个指称项与已构建簇的向量表征进行增量式评分。由于簇表征通过可学习函数更新,该方法能确保内存占用量恒定。在第3.2节中,我们提出了一种新型性能驱动的增量方法:通过采用保留指称项表征的轻量级Transformer架构,该方法获得了更优的性能与泛化能力。

2.2 Sequence-to-Sequence models

2.2 序列到序列模型

Recent state-of-the-art Co reference Resolution systems all employ auto regressive generative approaches. However, an early example of Sequenceto-Sequence model, TANL (Paolini et al., 2021), failed to achieve competitive performance on OntoNotes. The first system to show that the auto regressive formulation was competitive was ASP (Liu et al., 2022), which outperformed encoderonly disc rim i native approaches. ASP is an autoregressive pointer-based model that generates actions for mention extraction (bracket pairing) and then conditions the next step to generate co reference links. Notably, the breakthrough achieved by ASP is not only due to its formulation but also to its usage of large generative models. Indeed, the success of their approach is strictly correlated with the underlying model size, since, when using models with a comparable number of parameters, the performance is significantly lower than encoder-only approaches. The same occurs in Zhang et al. (2023), a fully-seq2seq approach where a model learns to generate a formatted sequence encoding coreference notation, in which they report a strong positive correlation between performance and model sizes.

最新的共指消解系统均采用自回归生成方法。然而早期Seq2Seq模型TANL (Paolini et al., 2021)未能在OntoNotes数据集上取得竞争力表现。首个证明自回归框架具有竞争力的是ASP (Liu et al., 2022),其性能超越了仅编码器的判别式方法。ASP是基于自回归指针的模型,首先生成提及抽取动作(括号配对),再根据上一步生成共指链接。值得注意的是,ASP的突破不仅源于其框架设计,更得益于大生成模型的使用——当使用参数量相当的模型时,其性能显著低于仅编码器方法。Zhang et al. (2023)采用全seq2seq方法让模型学习生成格式化共指标记序列,同样观察到性能与模型规模呈强正相关[20]。

Finally, the current state-of-the-art system on the OntoNotes benchmark is held by Link-Append (Bohnet et al., 2023), a transition-based system that increment ally builds clusters exploiting a multipass Sequence-to-Sequence architecture. This approach increment ally maps the mentions in previously co reference-annotated sentences to system actions for the current sentence, using the same shift-reduce incremental paradigm presented in Section 2.1. This method obtains state-of-the-art performance at the cost of using a 13B-parameter model and processing one sentence at a time, drastically increasing the need for computational power. While the foregoing models ensure superior performance compared to previous disc rim i native approaches, using them for inference is out of reach for many users, not to mention the exorbitant cost of training them from scratch.

最后,OntoNotes基准测试的当前最先进系统由Link-Append (Bohnet et al., 2023)保持,这是一个基于转移的系统,利用多通道Sequence-to-Sequence架构逐步构建聚类。该方法采用与第2.1节相同的shift-reduce增量范式,逐步将先前共指标注句子中的提及映射到当前句子的系统动作。此方法以使用130亿参数模型和逐句处理为代价获得最先进性能,大幅增加了对计算能力的需求。虽然上述模型相比之前的判别式方法确保了更优性能,但对许多用户而言,使用它们进行推理遥不可及,更不用说从头训练它们的巨额成本了。

3 Methodology

3 方法论

In this section, we present the Maverick framework: we propose replacing the preprocessing and training strategy of Coarse-to-Fine models with a novel pipeline that improves the training and inference efficiency of Co reference Resolution systems. Furthermore, with the Maverick Pipeline, we eliminate the dependency on long-standing manuallyset hyper parameters that regulate memory usage. Finally, building on top of our pipeline, we propose three models that adopt a mention-antecedent classification technique, namely Maverick $\mathbf{\nabla}_{\mathbf{\mu}^{\mathbf{5}}}2\mathbf{e}$ and Maverick $\mathrm{}$ , and a system that is based upon a novel incremental formulation, Maverick in cr.

在本节中,我们提出Maverick框架:通过一种创新流程替代传统由粗到细(Coarse-to-Fine)模型的预处理与训练策略,从而提升共指消解(Co-reference Resolution)系统的训练和推理效率。该流程消除了对长期依赖人工设定超参数来调控内存使用的需求。基于此框架,我们提出三个采用提及-先行词分类技术的模型:Maverick $\mathbf{\nabla}_{\mathbf{\mu}^{\mathbf{5}}}2\mathbf{e}$、Maverick $\mathrm{}$,以及基于增量式创新的Maverick in cr系统。

3.1 Maverick Pipeline

3.1 Maverick Pipeline

The Maverick Pipeline combines i) a novel mention extraction method, ii) an efficient mention regularization technique, and iii) a new mention pruning strategy.

Maverick Pipeline结合了i)一种新颖的提及提取方法、ii)高效的提及正则化技术,以及iii)新的提及剪枝策略。

Mention Extraction When it comes to extracting mentions from a document $D$ , there are different strategies to model the probability that a span contains a mention. Several previous works follow the Coarse-to-Fine formulation presented in Section 2.1, which consists of scoring all the possible spans in $D$ . This entails a quadratic computational cost in relation to the input length, which they mitigate by introducing several pruning techniques.

提及抽取
在从文档 $D$ 中抽取提及内容时,存在多种策略来建模某个文本片段包含提及的概率。先前的一些研究遵循第2.1节提出的由粗到精 (Coarse-to-Fine) 方法,该方法需要对 $D$ 中所有可能的文本片段进行评分。这会导致与输入长度相关的二次计算成本,研究者们通过引入多种剪枝技术来缓解这一问题。

In this work, we employ a different strategy. We extract co reference mentions by first identifying all the possible starts of a mention, and then, for each start, extracting its possible end. To extract start indices, we first compute the hidden representation $(x_{1},\ldots,x_{n})$ of the tokens $(t_{1},\ldots,t_{n})\in D$ using a transformer encoder, and then use a fullyconnected layer $F$ to compute the probability for each $t_{i}$ being the start of a mention as:

在本工作中,我们采用了不同的策略。我们通过首先识别所有可能的指称起始位置,然后为每个起始位置提取其可能的结束位置来提取共指指称。为提取起始索引,我们首先使用Transformer编码器计算词元$(t_{1},\ldots,t_{n})\in D$的隐藏表示$(x_{1},\ldots,x_{n})$,然后使用全连接层$F$计算每个$t_{i}$作为指称起始位置的概率:

$$
\begin{array}{r}{F_{s t a r t}(x)=W_{s t a r t}^{\prime}(G e L U(W_{s t a r t}x))}\ {p_{s t a r t}(t_{i})=\sigma(F_{s t a r t}(x_{i}))\qquad}\end{array}
$$

$$
\begin{array}{r}{F_{s t a r t}(x)=W_{s t a r t}^{\prime}(G e L U(W_{s t a r t}x))}\ {p_{s t a r t}(t_{i})=\sigma(F_{s t a r t}(x_{i}))\qquad}\end{array}
$$

with $W_{s t a r t}^{\prime}$ , $W_{s t a r t}$ being the learnable parameters, and $\sigma$ the sigmoid function. For each start of a mention $t_{s}$ , i.e., those tokens having $p_{s t a r t}(t_{s})>0.5$ , we then compute the probability of its subsequent tokens $t_{j}$ , with $s\leq j$ , to be the end of a mention that starts with $t_{s}$ . We follow the same process as that of the mention start classification, but we condition the prediction on the starting token by concatenating the start, $x_{s}$ , and end, $x_{j}$ , hidden representations before the linear classifier:

其中 $W_{s t a r t}^{\prime}$ 和 $W_{s t a r t}$ 是可学习参数,$\sigma$ 为 sigmoid 函数。对于每个提及起始位置 $t_{s}$(即满足 $p_{s t a r t}(t_{s})>0.5$ 的 token),我们计算其后续 token $t_{j}$($s\leq j$)作为该提及结束位置的概率。我们采用与提及起始分类相同的方法,但通过将起始隐藏表示 $x_{s}$ 和结束隐藏表示 $x_{j}$ 拼接后输入线性分类器,使预测结果依赖于起始 token:

$$
\begin{array}{r l}&{F_{e n d}(x,x^{\prime})=W_{e n d}^{\prime}(G e L U(W_{e n d}[x,x^{\prime}]))}\ &{}\ &{p_{e n d}(t_{j}|t_{s})=\sigma(F_{e n d}(x_{s},x_{j}))}\end{array}
$$

$$
\begin{array}{r l}&{F_{e n d}(x,x^{\prime})=W_{e n d}^{\prime}(G e L U(W_{e n d}[x,x^{\prime}]))}\ &{}\ &{p_{e n d}(t_{j}|t_{s})=\sigma(F_{e n d}(x_{s},x_{j}))}\end{array}
$$

with $W_{e n d}^{\prime},W_{e n d}$ being learnable parameters. This formulation handles overlapping mentions since, for each start $t_{s}$ , we can find multiple ends $t_{e}$ (i.e., those that have $p_{e n d}(t_{j}|t_{s})>0.5)$ .

其中 $W_{e n d}^{\prime}$ 和 $W_{e n d}$ 是可学习参数。该公式处理重叠提及,因为对于每个起始位置 $t_{s}$,可以找到多个结束位置 $t_{e}$ (即满足 $p_{e n d}(t_{j}|t_{s})>0.5$ 的位置)。

Previous works already adopted a linear layer to compute start and end mention scores for each possible mention, i.e., s2e-coref (Kirstain et al., 2021), and LingMess (Otmazgin et al., 2023). However, our mention extraction technique differs from previous approaches since i) we produce two probabilities $\mathrm{0}<p<1\$ ) instead of two unbounded scores and ii) we use the computed start probability to filter out possible mentions, which reduces by a factor of 9 the number of mentions considered compared to existing Coarse-to-Fine systems (Table 1, first row).

先前的研究已采用线性层计算每个可能提及的起始和结束提及分数,例如 s2e-coref (Kirstain et al., 2021) 和 LingMess (Otmazgin et al., 2023)。然而,我们的提及提取技术与之前的方法不同,因为:i) 我们生成两个概率 $\mathrm{0}<p<1\$ ) 而非两个无界分数;ii) 我们使用计算出的起始概率过滤可能的提及,与现有的粗到细系统相比,考虑的提及数量减少了 9 倍 (表 1, 第一行)。

Mention Regular iz ation To further reduce the computation demand of this process, in the Maverick Pipeline we use the end-of-sentence (EOS) mention regular iz ation strategy: after extracting the span start, we consider only the tokens up to the nearest EOS as possible mention end candidates.2 Since annotated mentions never span across sentences, EOS mention regular iz ation prunes the number of mentions considered without any loss of information. While this heuristic was initially introduced in the implementation of Lee et al. (2018), all the recent Coarse-to-Fine have abandoned it in favor of the maximum span-length regular iz ation, which is a manually-set hyper parameter that regulates a threshold to filter out spans that exceed a certain length. This implies a large overhead of unnecessary computations and introduces a structural bias that does not consider long mentions that exceed a fixed length.3 In our work, we not only reintroduce the EOS mention regular iz ation, but we also study its contribution in terms of efficiency, as reported in Table 1, second row.

提及正则化
为了进一步降低这一过程的计算需求,在Maverick Pipeline中我们采用句尾(EOS)提及正则化策略:提取跨度起始位置后,仅将最近EOS前的token视为可能的提及结束候选。由于标注提及从不跨句,EOS提及正则化能在不丢失信息的前提下减少待考虑的提及数量。该启发式方法最初由Lee等人(2018)在实现中提出,但近期所有Coarse-to-Fine模型都弃用了它,转而采用最大跨度长度正则化——这种人工设置的超参数通过阈值过滤超过特定长度的跨度。这不仅带来大量不必要的计算开销,还会引入结构性偏差,即忽略超过固定长度的长提及。我们在工作中不仅重新引入EOS提及正则化,还通过表1第二行数据量化分析了其对效率的提升贡献。

注:
2 原文"regular iz ation"应为"regularization"的笔误,翻译时统一处理为"正则化"
3 此处保留原文超链接标记未翻译

Mention Pruning After the mention extraction step, as a result of the Maverick Pipeline, we consider an $18\mathrm{x}$ lower number of candidate mentions for the successive mention clustering phase (Table 1). This step consists of computing, for each mention, the probability of all its antecedents being in the same cluster, incurring a quadratic computational cost. Within the Coarse-to-Fine formulation, this high computational cost is mitigated by considering only the top $k$ mentions according to their probability score, where $k$ is a manually set hyperparameter. Since after our mention extraction step we obtain probabilities for a very concise number of mentions, we consider only mentions classified as probable candidates (i.e., those with $p_{e n d}>0.5$ and $p_{s t a r t}>0.5$ ), reducing the number of mention pairs considered by a factor of 10. In Table 1, we compare the previous Coarse-to-Fine formulation with the new Maverick Pipeline.

指代修剪
在指代提取步骤之后,通过Maverick Pipeline的处理,我们为后续的指代聚类阶段考虑的候选指代数量减少了18倍(表1)。该步骤包括为每个指代计算其所有先行词属于同一聚类的概率,这会带来二次方的计算成本。在由粗到精(Coarse-to-Fine)的框架中,通过仅考虑概率得分最高的k个指代(其中k是手动设置的超参数)来缓解这一高计算成本。由于在我们的指代提取步骤后,我们获得了非常少量指代的概率,因此我们仅考虑被分类为可能候选的指代(即那些满足$p_{end}>0.5$和$p_{start}>0.5$的指代),将考虑的指代对数量减少了10倍。在表1中,我们将之前的由粗到精框架与新的Maverick Pipeline进行了比较。

Table 1: Comparison between the Coarse-to-Fine pipeline and the Maverick Pipeline in terms of the average number of mentions considered in the mention extraction step (top) and the average number of mention pairs considered in the mention clustering step (bottom). The statistics are computed on the OntoNotes devset, and refer to the hyper parameters proposed in Lee et al. (2018), which were unchanged by subsequent Coarseto-Fine works, i.e., span-len $=30$ , top- $\cdot\mathrm{k}=0.4$ .

Coarse-to-FineMaverick
Ment.ExtractionEnumeration 183,577 (+)Span-length(i)s Start-End 20,565 (ii) (+)EOS-8,92x
Ment.Clustering14,265 Top-k 29,334777 (iii) Pred-only 2,713-18,3x -10,81x

表 1: 粗到精 (Coarse-to-Fine) 流程与 Maverick 流程在指代抽取步骤考虑的平均提及数量 (上) 及指代聚类步骤考虑的平均提及对数量 (下) 的对比。统计数据基于 OntoNotes 开发集计算,并采用 Lee et al. (2018) 提出的超参数 (后续粗到精研究未作改动) ,即 span-len $=30$ ,top- $\cdot\mathrm{k}=0.4$ 。

Coarse-to-Fine Maverick
指代抽取 枚举 183,577 (+)Span-length (i) 起止点 20,565 (ii) (+)EOS -8,92x
指代聚类 14,265 Top-k 29,334 777 (iii) 仅预测 2,713 -18,3x -10,81x

3.2 Mention Clustering

3.2 提及聚类

As a result of the Maverick Pipeline, we obtain a set of candidate mentions $M=(m_{1},m_{2},\ldots,m_{l})$ , for which we propose three different clustering techniques: Maverick $\mathrm{\Delta}_{\mathrm{s}2\mathrm{e}}$ and Maverick mes, which use two well-established Coarse-to-Fine mentionantecedent techniques, and Maverick in cr, which adopts a novel incremental technique that leverages a light transformer architecture.

通过Maverick Pipeline的处理,我们得到一组候选提及 $M=(m_{1},m_{2},\ldots,m_{l})$ ,并针对这些提及提出了三种不同的聚类技术:Maverick $\mathrm{\Delta}_{\mathrm{s}2\mathrm{e}}$ 和Maverick mes(采用两种成熟的从粗到细的提及-先行词技术),以及Maverick in cr(采用一种新型增量技术,利用轻量级Transformer架构)。

Mention-Antecedent models The first proposed model, Maverick $\mathrm{\tilde{s}}2\mathrm{e}$ , adopts an equivalent mention clustering strategy to Kirstain et al. (2021): given a mention $\boldsymbol{m}{i}=(x{s},x_{e})$ and its antecedent $m_{j}=$ $(x_{s^{\prime}},x_{e^{\prime}})$ , with their start and end token hidden states, we use two fully-connected layers to model their corresponding representations:

提及-先行词模型
首个提出的模型 Maverick $\mathrm{\tilde{s}}2\mathrm{e}$ 采用了与 Kirstain 等人 (2021) 等效的提及聚类策略:给定提及 $\boldsymbol{m}{i}=(x{s},x_{e})$ 及其先行词 $m_{j}=$ $(x_{s^{\prime}},x_{e^{\prime}})$ ,利用它们的起始和结束 Token 隐藏状态,通过两个全连接层建模其对应表示:

$$
\begin{array}{r}{F_{s}(x)=W_{s}^{\prime}(G e L U(W_{s}x))}\ {F_{e}(x)=W_{e}^{\prime}(G e L U(W_{e}x))}\end{array}
$$

$$
\begin{array}{r}{F_{s}(x)=W_{s}^{\prime}(G e L U(W_{s}x))}\ {F_{e}(x)=W_{e}^{\prime}(G e L U(W_{e}x))}\end{array}
$$

we then calculate their probability to be in the same cluster as:

我们随后计算它们属于同一集群的概率为:

$$
\begin{array}{r}{p_{c}(m_{i},m_{j})=\sigma(F_{s}(x_{s})\cdot W_{s s}\cdot F_{s}(x_{s^{\prime}})+}\ {F_{e}(x_{e})\cdot W_{e e}\cdot F_{e}(x_{e^{\prime}})+}\ {F_{s}(x_{s})\cdot W_{s e}\cdot F_{e}(x_{e^{\prime}})+}\ {F_{e}(x_{e})\cdot W_{e s}\cdot F_{s}(x_{s^{\prime}}))}\end{array}
$$

$$
\begin{array}{r}{p_{c}(m_{i},m_{j})=\sigma(F_{s}(x_{s})\cdot W_{s s}\cdot F_{s}(x_{s^{\prime}})+}\ {F_{e}(x_{e})\cdot W_{e e}\cdot F_{e}(x_{e^{\prime}})+}\ {F_{s}(x_{s})\cdot W_{s e}\cdot F_{e}(x_{e^{\prime}})+}\ {F_{e}(x_{e})\cdot W_{e s}\cdot F_{s}(x_{s^{\prime}}))}\end{array}
$$

with $W_{s s},W_{e e}$ , $W_{s e}$ , $W_{e s}$ being four learnable matrices and $W_{s},W_{s}^{\prime},W_{e},W_{e}^{\prime}$ the learnable parameters of the two fully-connected layers.

其中 $W_{s s},W_{e e}$ 、 $W_{s e}$ 、 $W_{e s}$ 为四个可学习矩阵, $W_{s},W_{s}^{\prime},W_{e},W_{e}^{\prime}$ 为两个全连接层的可学习参数。

A similar formulation is adopted in Maverick mes, where, instead of using only one generic mentionpair scorer, we use 6 different scorers that handle linguistically motivated categories, as introduced by Otmazgin et al. (2023). We detect which category $k$ a pair of mentions $m_{i}$ and $m_{j}$ belongs to (e.g., if $m_{i}$ is a pronoun and $m_{j}$ is a proper noun, the category will be PRONOUN-ENTITY) and use a category-specific scorer to compute $p_{c}$ . A complete description of the process along with the list of categories can be found in Appendix A.

Maverick方法采用了类似的思路,不同之处在于我们并非仅使用一个通用提及对(mention pair)评分器,而是如Otmazgin等人(2023)所述,采用了6个针对不同语言学类别的专用评分器。我们会检测提及对$m_{i}$和$m_{j}$所属的类别$k$(例如,若$m_{i}$是代词而$m_{j}$是专有名词,则该类别为PRONOUN-ENTITY),并使用对应类别的专用评分器计算$p_{c}$。完整处理流程及类别列表详见附录A。

Incremental model Finally, we introduce a novel incremental approach to tackle the mention clustering step, namely Maverick in cr, which follows the standard shift-reduce paradigm introduced in Section 2.1. Differently from the previous neural incremental techniques (i.e., ICoref (Xia et al., 2020) and longdoc (Toshniwal et al., 2021)) which use a linear classifier to obtain the clustering probability between each mention and a fixed length vector representation of previously built clusters, Maverick in cr leverages a lightweight transformer model to attend to previous clusters, for which we retain the mentions’ hidden representations. Specifically, we compute the hidden representations $(h_{1},\ldots,h_{l})$ for all the candidate mentions in $M$ using a fully-connected layer on top of the concatenation of their start and end token representations. We first assign the first mention $m_{1}$ to the first cluster $c_{1} =~\left(m_{1}\right)$ . Then, for each mention $m_{i}\in M$ at step $i$ we obtain the probability of $m_{i}$ being in a certain cluster $c_{j}$ by encoding $h_{i}$ with all the representations of the mentions contained in the cluster $c_{j}$ using a transformer architecture. We use the first special token ([CLS]) of a single-layer transformer architecture $T$ to obtain the score $S(m_{i},c_{j})$ of $m_{i}$ being in the cluster $c_{j}=(m_{f},\ldots,m_{g})$ with $f\leq g<i$ as:

增量模型
最后,我们提出了一种新颖的增量方法来解决指代聚类步骤,即Maverick in cr。该方法遵循第2.1节介绍的标准移进-归约范式。与之前使用线性分类器计算每个指称与固定长度聚类向量表示之间聚类概率的神经增量技术(如ICoref (Xia et al., 2020)和longdoc (Toshniwal et al., 2021))不同,Maverick in cr采用轻量级Transformer模型关注先前聚类,并保留指称的隐藏表示。具体而言,我们通过全连接层计算候选指称集$M$中所有指称的隐藏表示$(h_{1},\ldots,h_{l})$,该层以指称起止Token表示的拼接作为输入。首先将第一个指称$m_{1}$分配至初始聚类$c_{1} =~\left(m_{1}\right)$。随后在步骤$i$处理每个指称$m_{i}\in M$时,通过Transformer架构将$m_{i}$的隐藏表示$h_{i}$与聚类$c_{j}$所含指称的表示共同编码,获得$m_{i}$属于聚类$c_{j}$的概率。使用单层Transformer架构$T$的首个特殊Token([CLS])计算得分$S(m_{i},c_{j})$,该得分表示$m_{i}$属于聚类$c_{j}=(m_{f},\ldots,m_{g})$(其中$f\leq g<i$)的可能性:

$$
S(m_{i},c_{j})=W_{c}\cdot(R e L U(T_{C L S}(h_{i},h_{f},\dots,h_{g})))
$$

$$
S(m_{i},c_{j})=W_{c}\cdot(R e L U(T_{C L S}(h_{i},h_{f},\dots,h_{g})))
$$

Finally, we compute the probability of $m_{i}$ belonging to $c_{j}$ as:

最后,我们计算 $m_{i}$ 属于 $c_{j}$ 的概率为:

$$
p_{c}(m_{i}\in c_{j}|c_{j}=(m_{f},\ldots,m_{g}))=\sigma(S(m_{i},c_{j}))
$$

$$
p_{c}(m_{i}\in c_{j}|c_{j}=(m_{f},\ldots,m_{g}))=\sigma(S(m_{i},c_{j}))
$$

We calculate this probability for each cluster $c_{j}$ up to step $i$ . We assign the mention $m_{i}$ to the most probable cluster $c_{j}$ having $p_{c}(m_{i}\in c_{j})>0.5$ if one exists, or we create a new singleton cluster containing $m_{i}$ .

我们为每个聚类 $c_{j}$ 计算截至步骤 $i$ 时的该概率。若存在 $p_{c}(m_{i}\in c_{j})>0.5$ 的聚类 $c_{j}$ ,则将提及 $m_{i}$ 分配给概率最高的聚类,否则为 $m_{i}$ 创建新的单例聚类。

As we show in Sections 5.3 and 5.5, this formulation obtains better results than previous incremental methods, and is beneficial when dealing with long-document and out-of-domain settings.

正如我们在第5.3节和第5.5节所示,该公式比以往的增量方法取得了更好的结果,并且在处理长文档和跨领域场景时具有优势。

3.3 Training

3.3 训练

To train a Maverick model, we optimize the sum of three binary cross-entropy losses:

为了训练Maverick模型,我们优化了三个二元交叉熵损失的总和:

$$
L_{c o r e f}=L_{s t a r t}+L_{e n d}+L_{c l u s t}
$$

$$
L_{c o r e f}=L_{s t a r t}+L_{e n d}+L_{c l u s t}
$$

where $N$ is the sequence length, $S$ is the number of starts, $E_{s}$ is the number of possible ends for a start $s$ and $p_{s t a r t}(t_{i})$ and $p_{e n d}(t_{j}|t_{s})$ are those defined in Section 3.1.

其中 $N$ 是序列长度,$S$ 是起始点数量,$E_{s}$ 是起始点 $s$ 对应的可能终点数量,$p_{start}(t_{i})$ 和 $p_{end}(t_{j}|t_{s})$ 的定义见第3.1节。

Finally, $L_{c l u s t}$ is the loss for the mention clustering step. Since we experiment with two different mention clustering formulations, we use a different loss for each clustering technique, namely Lcalnutst for the mention-antecedent models, i.e., Maverick $\mathrm{\tilde{s}}2\mathrm{e}$ and Maverick mes, and $\cal{L}_{c l u s t}^{i n c r}$ for the incremental model, i.e., Maverick in cr :

最后,$L_{c l u s t}$ 是提及聚类步骤的损失函数。由于我们实验了两种不同的提及聚类方法,因此对每种聚类技术使用不同的损失函数:对于提及-先行词模型(即 Maverick $\mathrm{\tilde{s}}2\mathrm{e}$ 和 Maverick mes)使用 Lcalnutst,而对于增量模型(即 Maverick in cr)使用 $\cal{L}_{c l u s t}^{i n c r}$。

$$
\begin{array}{c}{{L_{c l u s t}^{a n t}=\displaystyle\sum_{i=1}^{|M|}\displaystyle\sum_{j=1}^{|M|}-\big(y_{i}\log(p_{c}(m_{i}|m_{j}))+}}\ {{\big(1-y_{i}\big)\log\bigl(1-p_{c}(m_{i}|m_{j})\bigr)\big)}}\ {{L_{c l u s t}^{i n c r}=\displaystyle\sum_{i=1}^{|M|}\displaystyle\sum_{j=1}^{C_{i}}-\big(y_{i}\log(p_{c}(m_{i}\in c_{j})\bigr)+}}\ {{\big(1-y_{i}\big)\log\bigl(1-p_{c}(m_{i}\in c_{j})\bigr)\bigr)\bigr)}}\end{array}
$$

$$
\begin{array}{c}{{L_{c l u s t}^{a n t}=\displaystyle\sum_{i=1}^{|M|}\displaystyle\sum_{j=1}^{|M|}-\big(y_{i}\log(p_{c}(m_{i}|m_{j}))+}}\ {{\big(1-y_{i}\big)\log\bigl(1-p_{c}(m_{i}|m_{j})\bigr)\big)}}\ {{L_{c l u s t}^{i n c r}=\displaystyle\sum_{i=1}^{|M|}\displaystyle\sum_{j=1}^{C_{i}}-\big(y_{i}\log(p_{c}(m_{i}\in c_{j})\bigr)+}}\ {{\big(1-y_{i}\big)\log\bigl(1-p_{c}(m_{i}\in c_{j})\bigr)\bigr)\bigr)}}\end{array}
$$

Table 2: Dataset statistics: number of documents in each dataset split, average number of words and mentions per document, and singletons percentage.

Dataset#Train#Dev#TestTokensMentions%Sing
OntoNotes2802343348467560
LitBank801010210529119.8
PreCo3612050050033710552.0
GAP12000953
WikiCoref13019962300

表 2: 数据集统计:各数据集划分中的文档数量、每文档平均词数和提及数,以及单例百分比。

Dataset #Train #Dev #Test Tokens Mentions %Sing
OntoNotes 2802 343 348 467 56 0
LitBank 80 10 10 2105 291 19.8
PreCo 36120 500 500 337 105 52.0
GAP 1 2000 95 3
WikiCoref 1 30 1996 230 0

where $|M|$ is the number of extracted mentions, $C_{i}$ is the set of clusters created up to step $i$ , and $p_{c}(m_{i}|m_{j})$ and $p_{c}(m_{i}\in c_{j})$ are defined in Section 3.2.

其中 $|M|$ 表示提取的提及数量,$C_{i}$ 表示截至第 $i$ 步创建的聚类集合,$p_{c}(m_{i}|m_{j})$ 和 $p_{c}(m_{i}\in c_{j})$ 的定义见第3.2节。

All the models we introduce are trained using teacher forcing. In particular, in the mention token end classification step, we use gold start indices to condition the end tokens prediction, and, for the mention clustering step, we consider only gold mention indices. For Maverick in cr, at each iteration, we compare each mention only to previous gold clusters.

我们引入的所有模型都采用教师强制(teacher forcing)进行训练。具体而言,在提及token结束分类步骤中,我们使用标注的起始索引来约束结束token的预测;在提及聚类步骤中,我们仅考虑标注的提及索引。对于cr中的Maverick模型,每次迭代时我们仅将每个提及与之前的标注聚类进行比较。

4 Experiments Setup

4 实验设置

4.1 Datasets

4.1 数据集

We train and evaluate all the comparison systems on three Co reference Resolution datasets:

我们在三个共指消解数据集上训练并评估所有对比系统:

OntoNotes (Pradhan et al., 2012), proposed in the CoNLL-2012 shared task, is the de facto standard dataset used to benchmark Co reference Resolution systems. It consists of documents that span seven distinct genres, including full-length documents (broadcast news, newswire, magazines, weblogs, and Testaments) and multiple speaker transcripts (broadcast and telephone conversations).

OntoNotes (Pradhan et al., 2012) 是CoNLL-2012共享任务中提出的数据集,现已成为指代消解系统的基准测试标准。该数据集包含七种不同体裁的文档,包括完整篇幅的文档(广播新闻、通讯社报道、杂志文章、网络博客和圣经文本)以及多人对话转录文本(广播对话和电话通话)。

LitBank (Bamman et al., 2020) contains 100 literary documents typically used to evaluate longdocument Co reference Resolution.

LitBank (Bamman et al., 2020) 包含100份通常用于评估长文档共指消解的文学文献。

PreCo (Chen et al., 2018) is a large-scale dataset that includes reading comprehension tests for middle school and high school students.

PreCo (Chen et al., 2018) 是一个包含初高中学生阅读理解测试的大规模数据集。

Notably, both LitBank and PreCo have different annotation guidelines compared to OntoNotes, and provide annotation for singletons (i.e., singlemention clusters). Furthermore, we evaluate models trained on OntoNotes on three out-of-domain datasets:

值得注意的是,LitBank和PreCo的标注规范