Maverick: Efficient and Accurate Co reference Resolution Defying Recent Trends
Maverick: 高效精准的指代消解技术挑战近期趋势
Abstract
摘要
Large auto regressive generative models have emerged as the cornerstone for achieving the highest performance across several Natural Language Processing tasks. However, the urge to attain superior results has, at times, led to the premature replacement of carefully designed task-specific approaches without exhaustive experimentation. The Co reference Resolution task is no exception; all recent stateof-the-art solutions adopt large generative auto regressive models that outperform encoderbased disc rim i native systems. In this work, we challenge this recent trend by introducing Maverick, a carefully designed – yet simple – pipeline, which enables running a state-ofthe-art Co reference Resolution system within the constraints of an academic budget, outperforming models with up to 13 billion parameters with as few as 500 million parameters. Maverick achieves state-of-the-art performance on the CoNLL-2012 benchmark, training with up to 0.006x the memory resources and obtaining a $170\mathrm{x}$ faster inference compared to previous state-of-the-art systems. We extensively validate the robustness of the Maverick framework with an array of diverse experiments, reporting improvements over prior systems in data-scarce, long-document, and out-of-domain settings. We release our code and models for research purposes at https: //github.com/S a pienza NLP/maverick-coref.
大型自回归生成模型已成为在多项自然语言处理任务中实现最高性能的基石。然而,追求卓越结果的冲动有时会导致未经充分实验就过早取代精心设计的任务特定方法。共指消解任务也不例外;所有最新的最先进解决方案都采用大型生成式自回归模型,其性能优于基于编码器的判别系统。在这项工作中,我们通过引入Maverick对这一趋势提出挑战——这是一个精心设计却简洁的流程,能在学术预算限制下运行最先进的共指消解系统,仅用5亿参数就超越高达130亿参数的模型。Maverick在CoNLL-2012基准测试中实现了最先进性能,训练所需内存资源仅为前代最优系统的0.006倍,推理速度提升达$170\mathrm{x}$。我们通过一系列多样化实验全面验证了Maverick框架的鲁棒性,在数据稀缺、长文档和跨领域场景中均报告了对先前系统的改进。研究代码和模型已发布于https://github.com/SapienzaNLP/maverick-coref。
1 Introduction
1 引言
As one of the core tasks in Natural Language Processing, Co reference Resolution aims to identify and group expressions (called mentions) that refer to the same entity (Karttunen, 1969). Given its crucial role in various downstream tasks, such as Knowledge Graph Construction (Li et al., 2020), Entity Linking (Kundu et al., 2018; Agarwal et al., 2022), Question Answering (Dhingra et al., 2018; Dasigi et al., 2019; Bhatt a char j ee et al., 2020; Chen and Durrett, 2021), Machine Translation (S to jan ovsk i and Fraser, 2018; Voita et al., 2018; Ohtani et al., 2019; Yehudai et al., 2023) and Text Sum mari z ation (Falke et al., 2017; Pasunuru et al., 2021; Liu et al., 2021), inter alia, there is a pressing need for both high performance and efficiency. However, recent works in Co reference Resolution either explore methods to obtain reasonable performance optimizing time and memory efficiency (Kirstain et al., 2021; Dobro vols kii, 2021; Otmazgin et al., 2022), or strive to improve benchmark scores regardless of the increased computational demand (Bohnet et al., 2023; Zhang et al., 2023).
作为自然语言处理的核心任务之一,共指消解 (Co reference Resolution) 旨在识别并分组指向同一实体的表述(称为提及)(Karttunen, 1969)。由于其在知识图谱构建 (Li et al., 2020)、实体链接 (Kundu et al., 2018; Agarwal et al., 2022)、问答系统 (Dhingra et al., 2018; Dasigi et al., 2019; Bhatt a char j ee et al., 2020; Chen and Durrett, 2021)、机器翻译 (S to jan ovsk i and Fraser, 2018; Voita et al., 2018; Ohtani et al., 2019; Yehudai et al., 2023) 和文本摘要 (Falke et al., 2017; Pasunuru et al., 2021; Liu et al., 2021) 等下游任务中的关键作用,对高性能与高效率的需求日益迫切。然而,当前共指消解研究要么探索在优化时间和内存效率的同时获得合理性能的方法 (Kirstain et al., 2021; Dobro vols kii, 2021; Otmazgin et al., 2022),要么不顾计算需求增加而竭力提升基准分数 (Bohnet et al., 2023; Zhang et al., 2023)。
Efficient solutions usually rely on disc rim i native formulations, frequently employing the mentionantecedent classification method proposed by Lee et al. (2017). These approaches leverage relatively small encoder-only transformer architectures (Joshi et al., 2020; Beltagy et al., 2020) to encode docu- ments and build on top of them task-specific networks that ensure high speed and efficiency. On the other hand, performance-centered solutions are nowadays dominated by general-purpose large Sequence-to-Sequence models (Liu et al., 2022; Zhang et al., 2023). A notable example of this formulation, and currently the state of the art in Co reference Resolution, is Bohnet et al. (2023), which proposes a transition-based system that incrementally builds clusters of mentions by generating co reference links sentence by sentence in an autoregressive fashion. Although Sequence-to-Sequence solutions achieve remarkable performance, their auto regressive nature and the size of the underlying language models (up to 13B parameters) make them dramatically slower and memory-demanding compared to traditional encoder-only approaches. This not only makes their usage for downstream applications impractical, but also poses a significant barrier to their accessibility for a large number of users operating within an academic budget.
高效的解决方案通常依赖于判别式 (discriminative) 表述,常采用 Lee 等人 (2017) 提出的提及-先行词分类方法。这些方法利用相对较小的仅编码器架构 Transformer (Joshi 等人, 2020; Beltagy 等人, 2020) 对文档进行编码,并在此基础上构建任务专用网络以确保高速和效率。另一方面,以性能为中心的解决方案目前主要由通用大序列到序列 (Sequence-to-Sequence) 模型主导 (Liu 等人, 2022; Zhang 等人, 2023)。该表述的典型代表是 Bohnet 等人 (2023) 提出的基于转移的系统,它通过自回归方式逐句生成共指链接来逐步构建提及簇,这也是当前共指消解 (Coreference Resolution) 领域的最先进技术。尽管序列到序列解决方案实现了卓越性能,但其自回归特性及底层大语言模型 (参数规模高达 130 亿) 使得它们与传统仅编码器方法相比速度显著更慢、内存需求更高。这不仅导致其在下游应用中难以实用,也为学术预算范围内运作的大量用户设置了可及性障碍。
In this work we argue that disc rim i native encoder-only approaches for Co reference Resolution have not yet expressed their full potential and have been discarded too early in the urge to achieve state-of-the-art performance. In proposing Maverick, we strike an optimal balance between high performance and efficiency, a combination that was missing in previous systems. Our framework enables an encoder-only model to achieve toptier performance while keeping the overall model size less than one-twentieth of the current stateof-the-art system, and training it with academic resources. Moreover, when further reducing the size of the underlying transformer encoder, Maverick performs in the same ballpark as encoder-only efficiency-driven solutions while improving speed and memory consumption. Finally, we propose a novel incremental Co reference Resolution method that, integrated into the Maverick framework, results in a robust architecture for out-of-domain, data-scarce, and long-document settings.
在本研究中,我们认为仅使用判别式编码器 (encoder-only) 的共指消解 (Co-reference Resolution) 方法尚未充分发挥其潜力,并在追求最先进性能的过程中被过早放弃。通过提出 Maverick 框架,我们在高性能与效率之间实现了最佳平衡,这种组合在先前系统中一直缺失。我们的框架使仅编码器模型能够达到顶级性能,同时将整体模型规模控制在当前最先进系统的二十分之一以内,并仅需学术资源即可完成训练。此外,当进一步缩减底层 Transformer 编码器规模时,Maverick 的性能与仅追求效率的编码器方案相当,同时提升了速度和内存效率。最后,我们提出了一种新颖的增量式共指消解方法,将其集成到 Maverick 框架后,可构建出适用于跨领域、数据稀缺和长文档场景的鲁棒架构。
2 Related Work
2 相关工作
We now introduce well-established approaches to neural Co reference Resolution. Specifically, we first delve into the details of traditional discriminative solutions, including their incremental variations, and then present the recent paradigm shift for approaches based on large generative architectures.
我们现在介绍神经共指消解 (Co reference Resolution) 的成熟方法。具体而言,首先深入探讨传统判别式解决方案的细节(包括其增量变体),然后介绍基于大型生成架构方法的最新范式转变。
2.1 Disc rim i native models
2.1 判别式模型
Disc rim i native approaches tackle the Co reference Resolution task as a classification problem, usually employing encoder-only architectures. The pioneering works of Lee et al. (2017, 2018) introduced the first end-to-end disc rim i native system for Co reference Resolution, the Coarse-to-Fine model. First, it involves a mention extraction step, in which the spans most likely to be co reference mentions are identified. This is followed by a mentionantecedent classification step where, for each extracted mention, the model searches for its most probable antecedent (i.e., the extracted span that appears before in the text). This pipeline, composed of mention extraction and mention-antecedent classification steps, has been adopted with minor modifications in many subsequent works, that we refer to as Coarse-to-Fine models.
判别式方法将共指消解任务视为分类问题,通常采用仅编码器架构。Lee等人的开创性工作(2017, 2018)提出了首个端到端的共指消解判别式系统——由粗到精模型。首先进行提及抽取步骤,识别最可能成为共指提及的文本片段;随后执行提及-先行词分类步骤,模型为每个抽取的提及寻找其最可能的先行词(即文本中先前出现的抽取片段)。这种由提及抽取和提及-先行词分类组成的流程框架,在后继诸多研究中仅经微小调整便被沿用,我们统称为由粗到精模型。
Coarse-to-Fine Models Among the works that build upon the Coarse-to-Fine formulation, Lee et al. (2018), Joshi et al. (2019) and Joshi et al. (2020) experimented with changing the underlying document encoder, utilizing ELMo (Peters et al.,
粗到精模型
在基于粗到精(Coarse-to-Fine)框架的研究中,Lee等人(2018)、Joshi等人(2019)和Joshi等人(2020)尝试通过使用ELMo (Peters等人)来改变底层文档编码器。
2018), BERT (Devlin et al., 2019) and SpanBERT (Joshi et al., 2020), respectively, achieving remarkable score improvements on the English OntoNotes (Pradhan et al., 2012). Similarly, Kirstain et al. (2021) introduced s2e-coref that reduces the high memory footprint of SpanBERT by leveraging the LongFormer (Beltagy et al., 2020) sparse-attention mechanism. Based on the same architecture, Otmazgin et al. (2023) analyzed the impact of having multiple experts score different linguistically motivated categories (e.g., pronouns-nouns, nounsnouns, etc.). While the foregoing works have been able to modernize the original Coarse-toFine formulation, training their architectures on the OntoNotes dataset still requires a considerable amount of memory.1 This occurs because they rely on the traditional Coarse-to-Fine pipeline that, as we cover in Section 3.1, has a large memory overhead and is based on manually-set thresholds to regulate memory usage.
2018)、BERT (Devlin等人,2019)和SpanBERT (Joshi等人,2020)分别对英语OntoNotes (Pradhan等人,2012)实现了显著的分数提升。类似地,Kirstain等人(2021)提出了s2e-coref,通过利用LongFormer (Beltagy等人,2020)的稀疏注意力机制,降低了SpanBERT的高内存占用。基于相同架构,Otmazgin等人(2023)分析了让多个专家对不同语言学动机类别(如代词-名词、名词-名词等)进行评分的影响。尽管上述工作能够对原始Coarse-to-Fine框架进行现代化改造,但在OntoNotes数据集上训练这些架构仍需要大量内存。这是因为它们依赖于传统的Coarse-to-Fine流程,正如我们在3.1节所述,该流程具有较高的内存开销,并基于手动设置的阈值来调节内存使用。
Incremental Models Disc rim i native systems also include incremental techniques. Incremental Co reference Resolution has a strong cognitive grounding: research on the “garden-path” effect shows that humans resolve referring expressions increment ally (Altmann and Steedman, 1988).
增量模型
判别式系统还包括增量技术。增量共指消解 (Incremental Coreference Resolution) 具有坚实的认知基础:关于"花园路径"效应 (garden-path effect) 的研究表明,人类会以增量方式解析指代表达式 (Altmann and Steedman, 1988)。
A seminal work that proposed an automatic incremental system was that of Webster and Curran (2014), which introduced a clustering approach based on the shift-reduce paradigm. In this formulation, for each mention, a classifier decides whether to SHIFT it into a singleton (i.e., single mention cluster) or to REDUCE it within an existing cluster. The same approach has recently been reintroduced in ICoref (Xia et al., 2020) and longdoc (Toshniwal et al., 2021), which adopted SpanBERT and LongFormer, respectively. In these works the mention extraction step is identical to that of Coarse-to-Fine models. On the other hand, the mention clustering step is performed by using a linear classifier that scores each mention against a vector representation of previously built clusters, in an incremental fashion. This method ensures constant memory usage since cluster representations are updated with a learnable function. In Section 3.2 we present a novel performance-driven incremental method that obtains superior performance and generalization capabilities, in which we adopt a lightweight transformer architecture that retains the mention representations.
Webster和Curran (2014) 的开创性工作提出了基于移进-归约范式的自动增量聚类系统。该框架中,分类器会为每个指称项判断是将其SHIFT为单例簇(即仅含单个指称项的簇),还是REDUCE到现有簇中。ICoref (Xia et al., 2020) 和longdoc (Toshniwal et al., 2021) 近期分别采用SpanBERT和LongFormer重新引入了该方法,其指称项抽取步骤与由粗到精 (Coarse-to-Fine) 模型完全一致。而指称项聚类步骤则通过线性分类器对每个指称项与已构建簇的向量表征进行增量式评分。由于簇表征通过可学习函数更新,该方法能确保内存占用量恒定。在第3.2节中,我们提出了一种新型性能驱动的增量方法:通过采用保留指称项表征的轻量级Transformer架构,该方法获得了更优的性能与泛化能力。
2.2 Sequence-to-Sequence models
2.2 序列到序列模型
Recent state-of-the-art Co reference Resolution systems all employ auto regressive generative approaches. However, an early example of Sequenceto-Sequence model, TANL (Paolini et al., 2021), failed to achieve competitive performance on OntoNotes. The first system to show that the auto regressive formulation was competitive was ASP (Liu et al., 2022), which outperformed encoderonly disc rim i native approaches. ASP is an autoregressive pointer-based model that generates actions for mention extraction (bracket pairing) and then conditions the next step to generate co reference links. Notably, the breakthrough achieved by ASP is not only due to its formulation but also to its usage of large generative models. Indeed, the success of their approach is strictly correlated with the underlying model size, since, when using models with a comparable number of parameters, the performance is significantly lower than encoder-only approaches. The same occurs in Zhang et al. (2023), a fully-seq2seq approach where a model learns to generate a formatted sequence encoding coreference notation, in which they report a strong positive correlation between performance and model sizes.
最新的共指消解系统均采用自回归生成方法。然而早期Seq2Seq模型TANL (Paolini et al., 2021)未能在OntoNotes数据集上取得竞争力表现。首个证明自回归框架具有竞争力的是ASP (Liu et al., 2022),其性能超越了仅编码器的判别式方法。ASP是基于自回归指针的模型,首先生成提及抽取动作(括号配对),再根据上一步生成共指链接。值得注意的是,ASP的突破不仅源于其框架设计,更得益于大生成模型的使用——当使用参数量相当的模型时,其性能显著低于仅编码器方法。Zhang et al. (2023)采用全seq2seq方法让模型学习生成格式化共指标记序列,同样观察到性能与模型规模呈强正相关[20]。
Finally, the current state-of-the-art system on the OntoNotes benchmark is held by Link-Append (Bohnet et al., 2023), a transition-based system that increment ally builds clusters exploiting a multipass Sequence-to-Sequence architecture. This approach increment ally maps the mentions in previously co reference-annotated sentences to system actions for the current sentence, using the same shift-reduce incremental paradigm presented in Section 2.1. This method obtains state-of-the-art performance at the cost of using a 13B-parameter model and processing one sentence at a time, drastically increasing the need for computational power. While the foregoing models ensure superior performance compared to previous disc rim i native approaches, using them for inference is out of reach for many users, not to mention the exorbitant cost of training them from scratch.
最后,OntoNotes基准测试的当前最先进系统由Link-Append (Bohnet et al., 2023)保持,这是一个基于转移的系统,利用多通道Sequence-to-Sequence架构逐步构建聚类。该方法采用与第2.1节相同的shift-reduce增量范式,逐步将先前共指标注句子中的提及映射到当前句子的系统动作。此方法以使用130亿参数模型和逐句处理为代价获得最先进性能,大幅增加了对计算能力的需求。虽然上述模型相比之前的判别式方法确保了更优性能,但对许多用户而言,使用它们进行推理遥不可及,更不用说从头训练它们的巨额成本了。
3 Methodology
3 方法论
In this section, we present the Maverick framework: we propose replacing the preprocessing and training strategy of Coarse-to-Fine models with a novel pipeline that improves the training and inference efficiency of Co reference Resolution systems. Furthermore, with the Maverick Pipeline, we eliminate the dependency on long-standing manuallyset hyper parameters that regulate memory usage. Finally, building on top of our pipeline, we propose three models that adopt a mention-antecedent classification technique, namely Maverick $\mathbf{\nabla}_{\mathbf{\mu}^{\mathbf{5}}}2\mathbf{e}$ and Maverick $\mathrm{}$ , and a system that is based upon a novel incremental formulation, Maverick in cr.
在本节中,我们提出Maverick框架:通过一种创新流程替代传统由粗到细(Coarse-to-Fine)模型的预处理与训练策略,从而提升共指消解(Co-reference Resolution)系统的训练和推理效率。该流程消除了对长期依赖人工设定超参数来调控内存使用的需求。基于此框架,我们提出三个采用提及-先行词分类技术的模型:Maverick $\mathbf{\nabla}_{\mathbf{\mu}^{\mathbf{5}}}2\mathbf{e}$、Maverick $\mathrm{}$,以及基于增量式创新的Maverick in cr系统。
3.1 Maverick Pipeline
3.1 Maverick Pipeline
The Maverick Pipeline combines i) a novel mention extraction method, ii) an efficient mention regularization technique, and iii) a new mention pruning strategy.
Maverick Pipeline结合了i)一种新颖的提及提取方法、ii)高效的提及正则化技术,以及iii)新的提及剪枝策略。
Mention Extraction When it comes to extracting mentions from a document $D$ , there are different strategies to model the probability that a span contains a mention. Several previous works follow the Coarse-to-Fine formulation presented in Section 2.1, which consists of scoring all the possible spans in $D$ . This entails a quadratic computational cost in relation to the input length, which they mitigate by introducing several pruning techniques.
提及抽取
在从文档 $D$ 中抽取提及内容时,存在多种策略来建模某个文本片段包含提及的概率。先前的一些研究遵循第2.1节提出的由粗到精 (Coarse-to-Fine) 方法,该方法需要对 $D$ 中所有可能的文本片段进行评分。这会导致与输入长度相关的二次计算成本,研究者们通过引入多种剪枝技术来缓解这一问题。
In this work, we employ a different strategy. We extract co reference mentions by first identifying all the possible starts of a mention, and then, for each start, extracting its possible end. To extract start indices, we first compute the hidden representation $(x_{1},\ldots,x_{n})$ of the tokens $(t_{1},\ldots,t_{n})\in D$ using a transformer encoder, and then use a fullyconnected layer $F$ to compute the probability for each $t_{i}$ being the start of a mention as:
在本工作中,我们采用了不同的策略。我们通过首先识别所有可能的指称起始位置,然后为每个起始位置提取其可能的结束位置来提取共指指称。为提取起始索引,我们首先使用Transformer编码器计算词元$(t_{1},\ldots,t_{n})\in D$的隐藏表示$(x_{1},\ldots,x_{n})$,然后使用全连接层$F$计算每个$t_{i}$作为指称起始位置的概率:
$$
\begin{array}{r}{F_{s t a r t}(x)=W_{s t a r t}^{\prime}(G e L U(W_{s t a r t}x))}\ {p_{s t a r t}(t_{i})=\sigma(F_{s t a r t}(x_{i}))\qquad}\end{array}
$$
$$
\begin{array}{r}{F_{s t a r t}(x)=W_{s t a r t}^{\prime}(G e L U(W_{s t a r t}x))}\ {p_{s t a r t}(t_{i})=\sigma(F_{s t a r t}(x_{i}))\qquad}\end{array}
$$
with $W_{s t a r t}^{\prime}$ , $W_{s t a r t}$ being the learnable parameters, and $\sigma$ the sigmoid function. For each start of a mention $t_{s}$ , i.e., those tokens having $p_{s t a r t}(t_{s})>0.5$ , we then compute the probability of its subsequent tokens $t_{j}$ , with $s\leq j$ , to be the end of a mention that starts with $t_{s}$ . We follow the same process as that of the mention start classification, but we condition the prediction on the starting token by concatenating the start, $x_{s}$ , and end, $x_{j}$ , hidden representations before the linear classifier:
其中 $W_{s t a r t}^{\prime}$ 和 $W_{s t a r t}$ 是可学习参数,$\sigma$ 为 sigmoid 函数。对于每个提及起始位置 $t_{s}$(即满足 $p_{s t a r t}(t_{s})>0.5$ 的 token),我们计算其后续 token $t_{j}$($s\leq j$)作为该提及结束位置的概率。我们采用与提及起始分类相同的方法,但通过将起始隐藏表示 $x_{s}$ 和结束隐藏表示 $x_{j}$ 拼接后输入线性分类器,使预测结果依赖于起始 token:
$$
\begin{array}{r l}&{F_{e n d}(x,x^{\prime})=W_{e n d}^{\prime}(G e L U(W_{e n d}[x,x^{\prime}]))}\ &{}\ &{p_{e n d}(t_{j}|t_{s})=\sigma(F_{e n d}(x_{s},x_{j}))}\end{array}
$$
$$
\begin{array}{r l}&{F_{e n d}(x,x^{\prime})=W_{e n d}^{\prime}(G e L U(W_{e n d}[x,x^{\prime}]))}\ &{}\ &{p_{e n d}(t_{j}|t_{s})=\sigma(F_{e n d}(x_{s},x_{j}))}\end{array}
$$
with $W_{e n d}^{\prime},W_{e n d}$ being learnable parameters. This formulation handles overlapping mentions since, for each start $t_{s}$ , we can find multiple ends $t_{e}$ (i.e., those that have $p_{e n d}(t_{j}|t_{s})>0.5)$ .
其中 $W_{e n d}^{\prime}$ 和 $W_{e n d}$ 是可学习参数。该公式处理重叠提及,因为对于每个起始位置 $t_{s}$,可以找到多个结束位置 $t_{e}$ (即满足 $p_{e n d}(t_{j}|t_{s})>0.5$ 的位置)。
Previous works already adopted a linear layer to compute start and end mention scores for each possible mention, i.e., s2e-coref (Kirstain et al., 2021), and LingMess (Otmazgin et al., 2023). However, our mention extraction technique differs from previous approaches since i) we produce two probabilities $\mathrm{0}<p<1\$ ) instead of two unbounded scores and ii) we use the computed start probability to filter out possible mentions, which reduces by a factor of 9 the number of mentions considered compared to existing Coarse-to-Fine systems (Table 1, first row).
先前的研究已采用线性层计算每个可能提及的起始和结束提及分数,例如 s2e-coref (Kirstain et al., 2021) 和 LingMess (Otmazgin et al., 2023)。然而,我们的提及提取技术与之前的方法不同,因为:i) 我们生成两个概率 $\mathrm{0}<p<1\$ ) 而非两个无界分数;ii) 我们使用计算出的起始概率过滤可能的提及,与现有的粗到细系统相比,考虑的提及数量减少了 9 倍 (表 1, 第一行)。
Mention Regular iz ation To further reduce the computation demand of this process, in the Maverick Pipeline we use the end-of-sentence (EOS) mention regular iz ation strategy: after extracting the span start, we consider only the tokens up to the nearest EOS as possible mention end candidates.2 Since annotated mentions never span across sentences, EOS mention regular iz ation prunes the number of mentions considered without any loss of information. While this heuristic was initially introduced in the implementation of Lee et al. (2018), all the recent Coarse-to-Fine have abandoned it in favor of the maximum span-length regular iz ation, which is a manually-set hyper parameter that regulates a threshold to filter out spans that exceed a certain length. This implies a large overhead of unnecessary computations and introduces a structural bias that does not consider long mentions that exceed a fixed length.3 In our work, we not only reintroduce the EOS mention regular iz ation, but we also study its contribution in terms of efficiency, as reported in Table 1, second row.
提及正则化
为了进一步降低这一过程的计算需求,在Maverick Pipeline中我们采用句尾(EOS)提及正则化策略:提取跨度起始位置后,仅将最近EOS前的token视为可能的提及结束候选。由于标注提及从不跨句,EOS提及正则化能在不丢失信息的前提下减少待考虑的提及数量。该启发式方法最初由Lee等人(2018)在实现中提出,但近期所有Coarse-to-Fine模型都弃用了它,转而采用最大跨度长度正则化——这种人工设置的超参数通过阈值过滤超过特定长度的跨度。这不仅带来大量不必要的计算开销,还会引入结构性偏差,即忽略超过固定长度的长提及。我们在工作中不仅重新引入EOS提及正则化,还通过表1第二行数据量化分析了其对效率的提升贡献。
注:
2 原文"regular iz ation"应为"regularization"的笔误,翻译时统一处理为"正则化"
3 此处保留原文超链接标记未翻译
Mention Pruning After the mention extraction step, as a result of the Maverick Pipeline, we consider an $18\mathrm{x}$ lower number of candidate mentions for the successive mention clustering phase (Table 1). This step consists of computing, for each mention, the probability of all its antecedents being in the same cluster, incurring a quadratic computational cost. Within the Coarse-to-Fine formulation, this high computational cost is mitigated by considering only the top $k$ mentions according to their probability score, where $k$ is a manually set hyperparameter. Since after our mention extraction step we obtain probabilities for a very concise number of mentions, we consider only mentions classified as probable candidates (i.e., those with $p_{e n d}>0.5$ and $p_{s t a r t}>0.5$ ), reducing the number of mention pairs considered by a factor of 10. In Table 1, we compare the previous Coarse-to-Fine formulation with the new Maverick Pipeline.
指代修剪
在指代提取步骤之后,通过Maverick Pipeline的处理,我们为后续的指代聚类阶段考虑的候选指代数量减少了18倍(表1)。该步骤包括为每个指代计算其所有先行词属于同一聚类的概率,这会带来二次方的计算成本。在由粗到精(Coarse-to-Fine)的框架中,通过仅考虑概率得分最高的k个指代(其中k是手动设置的超参数)来缓解这一高计算成本。由于在我们的指代提取步骤后,我们获得了非常少量指代的概率,因此我们仅考虑被分类为可能候选的指代(即那些满足$p_{end}>0.5$和$p_{start}>0.5$的指代),将考虑的指代对数量减少了10倍。在表1中,我们将之前的由粗到精框架与新的Maverick Pipeline进行了比较。
Table 1: Comparison between the Coarse-to-Fine pipeline and the Maverick Pipeline in terms of the average number of mentions considered in the mention extraction step (top) and the average number of mention pairs considered in the mention clustering step (bottom). The statistics are computed on the OntoNotes devset, and refer to the hyper parameters proposed in Lee et al. (2018), which were unchanged by subsequent Coarseto-Fine works, i.e., span-len $=30$ , top- $\cdot\mathrm{k}=0.4$ .
| Coarse-to-Fine | Maverick | △ | |
| Ment.Extraction | Enumeration 183,577 (+)Span-length | (i)s Start-End 20,565 (ii) (+)EOS | -8,92x |
| Ment.Clustering | 14,265 Top-k 29,334 | 777 (iii) Pred-only 2,713 | -18,3x -10,81x |
表 1: 粗到精 (Coarse-to-Fine) 流程与 Maverick 流程在指代抽取步骤考虑的平均提及数量 (上) 及指代聚类步骤考虑的平均提及对数量 (下) 的对比。统计数据基于 OntoNotes 开发集计算,并采用 Lee et al. (2018) 提出的超参数 (后续粗到精研究未作改动) ,即 span-len $=30$ ,top- $\cdot\mathrm{k}=0.4$ 。
| Coarse-to-Fine | Maverick | △ | |
|---|---|---|---|
| 指代抽取 | 枚举 183,577 (+)Span-length | (i) 起止点 20,565 (ii) (+)EOS | -8,92x |
| 指代聚类 | 14,265 Top-k 29,334 | 777 (iii) 仅预测 2,713 | -18,3x -10,81x |
3.2 Mention Clustering
3.2 提及聚类
As a result of the Maverick Pipeline, we obtain a set of candidate mentions $M=(m_{1},m_{2},\ldots,m_{l})$ , for which we propose three different clustering techniques: Maverick $\mathrm{\Delta}_{\mathrm{s}2\mathrm{e}}$ and Maverick mes, which use two well-established Coarse-to-Fine mentionantecedent techniques, and Maverick in cr, which adopts a novel incremental technique that leverages a light transformer architecture.
通过Maverick Pipeline的处理,我们得到一组候选提及 $M=(m_{1},m_{2},\ldots,m_{l})$ ,并针对这些提及提出了三种不同的聚类技术:Maverick $\mathrm{\Delta}_{\mathrm{s}2\mathrm{e}}$ 和Maverick mes(采用两种成熟的从粗到细的提及-先行词技术),以及Maverick in cr(采用一种新型增量技术,利用轻量级Transformer架构)。
Mention-Antecedent models The first proposed model, Maverick $\mathrm{\tilde{s}}2\mathrm{e}$ , adopts an equivalent mention clustering strategy to Kirstain et al. (2021): given a mention $\boldsymbol{m}{i}=(x{s},x_{e})$ and its antecedent $m_{j}=$ $(x_{s^{\prime}},x_{e^{\prime}})$ , with their start and end token hidden states, we use two fully-connected layers to model their corresponding representations:
提及-先行词模型
首个提出的模型 Maverick $\mathrm{\tilde{s}}2\mathrm{e}$ 采用了与 Kirstain 等人 (2021) 等效的提及聚类策略:给定提及 $\boldsymbol{m}{i}=(x{s},x_{e})$ 及其先行词 $m_{j}=$ $(x_{s^{\prime}},x_{e^{\prime}})$ ,利用它们的起始和结束 Token 隐藏状态,通过两个全连接层建模其对应表示:
$$
\begin{array}{r}{F_{s}(x)=W_{s}^{\prime}(G e L U(W_{s}x))}\ {F_{e}(x)=W_{e}^{\prime}(G e L U(W_{e}x))}\end{array}
$$
$$
\begin{array}{r}{F_{s}(x)=W_{s}^{\prime}(G e L U(W_{s}x))}\ {F_{e}(x)=W_{e}^{\prime}(G e L U(W_{e}x))}\end{array}
$$
we then calculate their probability to be in the same cluster as:
我们随后计算它们属于同一集群的概率为:
$$
\begin{array}{r}{p_{c}(m_{i},m_{j})=\sigma(F_{s}(x_{s})\cdot W_{s s}\cdot F_{s}(x_{s^{\prime}})+}\ {F_{e}(x_{e})\cdot W_{e e}\cdot F_{e}(x_{e^{\prime}})+}\ {F_{s}(x_{s})\cdot W_{s e}\cdot F_{e}(x_{e^{\prime}})+}\ {F_{e}(x_{e})\cdot W_{e s}\cdot F_{s}(x_{s^{\prime}}))}\end{array}
$$
$$
\begin{array}{r}{p_{c}(m_{i},m_{j})=\sigma(F_{s}(x_{s})\cdot W_{s s}\cdot F_{s}(x_{s^{\prime}})+}\ {F_{e}(x_{e})\cdot W_{e e}\cdot F_{e}(x_{e^{\prime}})+}\ {F_{s}(x_{s})\cdot W_{s e}\cdot F_{e}(x_{e^{\prime}})+}\ {F_{e}(x_{e})\cdot W_{e s}\cdot F_{s}(x_{s^{\prime}}))}\end{array}
$$
with $W_{s s},W_{e e}$ , $W_{s e}$ , $W_{e s}$ being four learnable matrices and $W_{s},W_{s}^{\prime},W_{e},W_{e}^{\prime}$ the learnable parameters of the two fully-connected layers.
其中 $W_{s s},W_{e e}$ 、 $W_{s e}$ 、 $W_{e s}$ 为四个可学习矩阵, $W_{s},W_{s}^{\prime},W_{e},W_{e}^{\prime}$ 为两个全连接层的可学习参数。
A similar formulation is adopted in Maverick mes, where, instead of using only one generic mentionpair scorer, we use 6 different scorers that handle linguistically motivated categories, as introduced by Otmazgin et al. (2023). We detect which category $k$ a pair of mentions $m_{i}$ and $m_{j}$ belongs to (e.g., if $m_{i}$ is a pronoun and $m_{j}$ is a proper noun, the category will be PRONOUN-ENTITY) and use a category-specific scorer to compute $p_{c}$ . A complete description of the process along with the list of categories can be found in Appendix A.
Maverick方法采用了类似的思路,不同之处在于我们并非仅使用一个通用提及对(mention pair)评分器,而是如Otmazgin等人(2023)所述,采用了6个针对不同语言学类别的专用评分器。我们会检测提及对$m_{i}$和$m_{j}$所属的类别$k$(例如,若$m_{i}$是代词而$m_{j}$是专有名词,则该类别为PRONOUN-ENTITY),并使用对应类别的专用评分器计算$p_{c}$。完整处理流程及类别列表详见附录A。
Incremental model Finally, we introduce a novel incremental approach to tackle the mention clustering step, namely Maverick in cr, which follows the standard shift-reduce paradigm introduced in Section 2.1. Differently from the previous neural incremental techniques (i.e., ICoref (Xia et al., 2020) and longdoc (Toshniwal et al., 2021)) which use a linear classifier to obtain the clustering probability between each mention and a fixed length vector representation of previously built clusters, Maverick in cr leverages a lightweight transformer model to attend to previous clusters, for which we retain the mentions’ hidden representations. Specifically, we compute the hidden representations $(h_{1},\ldots,h_{l})$ for all the candidate mentions in $M$ using a fully-connected layer on top of the concatenation of their start and end token representations. We first assign the first mention $m_{1}$ to the first cluster $c_{1} =~\left(m_{1}\right)$ . Then, for each mention $m_{i}\in M$ at step $i$ we obtain the probability of $m_{i}$ being in a certain cluster $c_{j}$ by encoding $h_{i}$ with all the representations of the mentions contained in the cluster $c_{j}$ using a transformer architecture. We use the first special token ([CLS]) of a single-layer transformer architecture $T$ to obtain the score $S(m_{i},c_{j})$ of $m_{i}$ being in the cluster $c_{j}=(m_{f},\ldots,m_{g})$ with $f\leq g<i$ as:
增量模型
最后,我们提出了一种新颖的增量方法来解决指代聚类步骤,即Maverick in cr。该方法遵循第2.1节介绍的标准移进-归约范式。与之前使用线性分类器计算每个指称与固定长度聚类向量表示之间聚类概率的神经增量技术(如ICoref (Xia et al., 2020)和longdoc (Toshniwal et al., 2021))不同,Maverick in cr采用轻量级Transformer模型关注先前聚类,并保留指称的隐藏表示。具体而言,我们通过全连接层计算候选指称集$M$中所有指称的隐藏表示$(h_{1},\ldots,h_{l})$,该层以指称起止Token表示的拼接作为输入。首先将第一个指称$m_{1}$分配至初始聚类$c_{1} =~\left(m_{1}\right)$。随后在步骤$i$处理每个指称$m_{i}\in M$时,通过Transformer架构将$m_{i}$的隐藏表示$h_{i}$与聚类$c_{j}$所含指称的表示共同编码,获得$m_{i}$属于聚类$c_{j}$的概率。使用单层Transformer架构$T$的首个特殊Token([CLS])计算得分$S(m_{i},c_{j})$,该得分表示$m_{i}$属于聚类$c_{j}=(m_{f},\ldots,m_{g})$(其中$f\leq g<i$)的可能性:
$$
S(m_{i},c_{j})=W_{c}\cdot(R e L U(T_{C L S}(h_{i},h_{f},\dots,h_{g})))
$$
$$
S(m_{i},c_{j})=W_{c}\cdot(R e L U(T_{C L S}(h_{i},h_{f},\dots,h_{g})))
$$
Finally, we compute the probability of $m_{i}$ belonging to $c_{j}$ as:
最后,我们计算 $m_{i}$ 属于 $c_{j}$ 的概率为:
$$
p_{c}(m_{i}\in c_{j}|c_{j}=(m_{f},\ldots,m_{g}))=\sigma(S(m_{i},c_{j}))
$$
$$
p_{c}(m_{i}\in c_{j}|c_{j}=(m_{f},\ldots,m_{g}))=\sigma(S(m_{i},c_{j}))
$$
We calculate this probability for each cluster $c_{j}$ up to step $i$ . We assign the mention $m_{i}$ to the most probable cluster $c_{j}$ having $p_{c}(m_{i}\in c_{j})>0.5$ if one exists, or we create a new singleton cluster containing $m_{i}$ .
我们为每个聚类 $c_{j}$ 计算截至步骤 $i$ 时的该概率。若存在 $p_{c}(m_{i}\in c_{j})>0.5$ 的聚类 $c_{j}$ ,则将提及 $m_{i}$ 分配给概率最高的聚类,否则为 $m_{i}$ 创建新的单例聚类。
As we show in Sections 5.3 and 5.5, this formulation obtains better results than previous incremental methods, and is beneficial when dealing with long-document and out-of-domain settings.
正如我们在第5.3节和第5.5节所示,该公式比以往的增量方法取得了更好的结果,并且在处理长文档和跨领域场景时具有优势。
3.3 Training
3.3 训练
To train a Maverick model, we optimize the sum of three binary cross-entropy losses:
为了训练Maverick模型,我们优化了三个二元交叉熵损失的总和:
$$
L_{c o r e f}=L_{s t a r t}+L_{e n d}+L_{c l u s t}
$$
$$
L_{c o r e f}=L_{s t a r t}+L_{e n d}+L_{c l u s t}
$$
where $N$ is the sequence length, $S$ is the number of starts, $E_{s}$ is the number of possible ends for a start $s$ and $p_{s t a r t}(t_{i})$ and $p_{e n d}(t_{j}|t_{s})$ are those defined in Section 3.1.
其中 $N$ 是序列长度,$S$ 是起始点数量,$E_{s}$ 是起始点 $s$ 对应的可能终点数量,$p_{start}(t_{i})$ 和 $p_{end}(t_{j}|t_{s})$ 的定义见第3.1节。
Finally, $L_{c l u s t}$ is the loss for the mention clustering step. Since we experiment with two different mention clustering formulations, we use a different loss for each clustering technique, namely Lcalnutst for the mention-antecedent models, i.e., Maverick $\mathrm{\tilde{s}}2\mathrm{e}$ and Maverick mes, and $\cal{L}_{c l u s t}^{i n c r}$ for the incremental model, i.e., Maverick in cr :
最后,$L_{c l u s t}$ 是提及聚类步骤的损失函数。由于我们实验了两种不同的提及聚类方法,因此对每种聚类技术使用不同的损失函数:对于提及-先行词模型(即 Maverick $\mathrm{\tilde{s}}2\mathrm{e}$ 和 Maverick mes)使用 Lcalnutst,而对于增量模型(即 Maverick in cr)使用 $\cal{L}_{c l u s t}^{i n c r}$。
$$
\begin{array}{c}{{L_{c l u s t}^{a n t}=\displaystyle\sum_{i=1}^{|M|}\displaystyle\sum_{j=1}^{|M|}-\big(y_{i}\log(p_{c}(m_{i}|m_{j}))+}}\ {{\big(1-y_{i}\big)\log\bigl(1-p_{c}(m_{i}|m_{j})\bigr)\big)}}\ {{L_{c l u s t}^{i n c r}=\displaystyle\sum_{i=1}^{|M|}\displaystyle\sum_{j=1}^{C_{i}}-\big(y_{i}\log(p_{c}(m_{i}\in c_{j})\bigr)+}}\ {{\big(1-y_{i}\big)\log\bigl(1-p_{c}(m_{i}\in c_{j})\bigr)\bigr)\bigr)}}\end{array}
$$
$$
\begin{array}{c}{{L_{c l u s t}^{a n t}=\displaystyle\sum_{i=1}^{|M|}\displaystyle\sum_{j=1}^{|M|}-\big(y_{i}\log(p_{c}(m_{i}|m_{j}))+}}\ {{\big(1-y_{i}\big)\log\bigl(1-p_{c}(m_{i}|m_{j})\bigr)\big)}}\ {{L_{c l u s t}^{i n c r}=\displaystyle\sum_{i=1}^{|M|}\displaystyle\sum_{j=1}^{C_{i}}-\big(y_{i}\log(p_{c}(m_{i}\in c_{j})\bigr)+}}\ {{\big(1-y_{i}\big)\log\bigl(1-p_{c}(m_{i}\in c_{j})\bigr)\bigr)\bigr)}}\end{array}
$$
Table 2: Dataset statistics: number of documents in each dataset split, average number of words and mentions per document, and singletons percentage.
| Dataset | #Train | #Dev | #Test | Tokens | Mentions | %Sing |
| OntoNotes | 2802 | 343 | 348 | 467 | 56 | 0 |
| LitBank | 80 | 10 | 10 | 2105 | 291 | 19.8 |
| PreCo | 36120 | 500 | 500 | 337 | 105 | 52.0 |
| GAP | 1 | 2000 | 95 | 3 | ||
| WikiCoref | 1 | 30 | 1996 | 230 | 0 |
表 2: 数据集统计:各数据集划分中的文档数量、每文档平均词数和提及数,以及单例百分比。
| Dataset | #Train | #Dev | #Test | Tokens | Mentions | %Sing |
|---|---|---|---|---|---|---|
| OntoNotes | 2802 | 343 | 348 | 467 | 56 | 0 |
| LitBank | 80 | 10 | 10 | 2105 | 291 | 19.8 |
| PreCo | 36120 | 500 | 500 | 337 | 105 | 52.0 |
| GAP | 1 | 2000 | 95 | 3 | ||
| WikiCoref | 1 | 30 | 1996 | 230 | 0 |
where $|M|$ is the number of extracted mentions, $C_{i}$ is the set of clusters created up to step $i$ , and $p_{c}(m_{i}|m_{j})$ and $p_{c}(m_{i}\in c_{j})$ are defined in Section 3.2.
其中 $|M|$ 表示提取的提及数量,$C_{i}$ 表示截至第 $i$ 步创建的聚类集合,$p_{c}(m_{i}|m_{j})$ 和 $p_{c}(m_{i}\in c_{j})$ 的定义见第3.2节。
All the models we introduce are trained using teacher forcing. In particular, in the mention token end classification step, we use gold start indices to condition the end tokens prediction, and, for the mention clustering step, we consider only gold mention indices. For Maverick in cr, at each iteration, we compare each mention only to previous gold clusters.
我们引入的所有模型都采用教师强制(teacher forcing)进行训练。具体而言,在提及token结束分类步骤中,我们使用标注的起始索引来约束结束token的预测;在提及聚类步骤中,我们仅考虑标注的提及索引。对于cr中的Maverick模型,每次迭代时我们仅将每个提及与之前的标注聚类进行比较。
4 Experiments Setup
4 实验设置
4.1 Datasets
4.1 数据集
We train and evaluate all the comparison systems on three Co reference Resolution datasets:
我们在三个共指消解数据集上训练并评估所有对比系统:
OntoNotes (Pradhan et al., 2012), proposed in the CoNLL-2012 shared task, is the de facto standard dataset used to benchmark Co reference Resolution systems. It consists of documents that span seven distinct genres, including full-length documents (broadcast news, newswire, magazines, weblogs, and Testaments) and multiple speaker transcripts (broadcast and telephone conversations).
OntoNotes (Pradhan et al., 2012) 是CoNLL-2012共享任务中提出的数据集,现已成为指代消解系统的基准测试标准。该数据集包含七种不同体裁的文档,包括完整篇幅的文档(广播新闻、通讯社报道、杂志文章、网络博客和圣经文本)以及多人对话转录文本(广播对话和电话通话)。
LitBank (Bamman et al., 2020) contains 100 literary documents typically used to evaluate longdocument Co reference Resolution.
LitBank (Bamman et al., 2020) 包含100份通常用于评估长文档共指消解的文学文献。
PreCo (Chen et al., 2018) is a large-scale dataset that includes reading comprehension tests for middle school and high school students.
PreCo (Chen et al., 2018) 是一个包含初高中学生阅读理解测试的大规模数据集。
Notably, both LitBank and PreCo have different annotation guidelines compared to OntoNotes, and provide annotation for singletons (i.e., singlemention clusters). Furthermore, we evaluate models trained on OntoNotes on three out-of-domain datasets:
值得注意的是,LitBank和PreCo的标注规范与OntoNotes不同,并为单例(即单提及聚类)提供了标注。此外,我们在三个域外数据集上评估了基于OntoNotes训练的模型:
• GAP (Webster et al., 2018) contains sentences in which, given a pronoun, the model has to choose between two candidate mentions.
• GAP (Webster et al., 2018) 包含的句子要求模型在给定代词的情况下,从两个候选提及中选择一个。
• $\mathbf{Lit{Bank}{n s}}$ and $\mathbf{PreCo_{ns}}$ , the datasets’ test-set where we filter out singleton annotations. • WikiCoref (Ghaddar and Langlais, 2016), which contains Wikipedia texts, including documents with up to 9,869 tokens.
• $\mathbf{Lit{Bank}{n s}}$ 和 $\mathbf{PreCo_{ns}}$,这两个数据集的测试集已过滤掉单例标注。
• WikiCoref (Ghaddar and Langlais, 2016),包含维基百科文本,其中部分文档的token数量高达9,869个。
The statistics of the datasets used are shown in Table 2.
所用数据集的统计信息如表 2 所示。
4.2 Comparison Systems
4.2 对比系统
Disc rim i native Among the disc rim i native systems, we consider c2f-coref (Joshi et al., 2020) and s2e-coref (Kirstain et al., 2021), which build upon the Coarse-to-Fine formulation and adopt different document encoders. We also report the results of LingMess (Otmazgin et al., 2023), which is the previous best encoder-only solution, and f-coref (Otmazgin et al., 2022), which is a distilled version of LingMess. Furthermore, we include CorefQA (Wu et al., 2020), which casts Co reference as extractive Question Answering, and wl-coref (Dobro vols kii, 2021), which first predicts co reference links between words, then extracts mentions spans. Finally, we report the results of incremental systems, such as ICoref (Xia et al., 2020) and longdoc (Toshniwal et al., 2021).
在判别式系统中,我们考察了基于Coarse-to-Fine框架的c2f-coref (Joshi et al., 2020) 和采用不同文档编码器的s2e-coref (Kirstain et al., 2021)。同时汇报了当前最佳纯编码器方案LingMess (Otmazgin et al., 2023) 及其蒸馏版本f-coref (Otmazgin et al., 2022) 的结果。此外还纳入将共指消解建模为抽取式问答的CorefQA (Wu et al., 2020),以及先预测词间共指链再抽取指称跨度的wl-coref (Dobrovolskii, 2021)。最后展示了增量式系统ICoref (Xia et al., 2020) 和longdoc (Toshniwal et al., 2021) 的性能表现。
Sequence-to-Sequence We compare our models with TANL (Paolini et al., 2021) and ASP (Liu et al., 2022), which frame Co reference Resolution as an auto regressive structured prediction. We also include Link-Append (Bohnet et al., 2023), a transition-based system that builds clusters with a multi-pass Sequence-to-Sequence architecture. Finally, we report the results of seq2seq (Zhang et al., 2023), a model that learns to generate a sequence with Co reference Resolution labels.
序列到序列
我们将我们的模型与TANL (Paolini等人, 2021) 和ASP (Liu等人, 2022) 进行比较,这些模型将共指消解 (Co reference Resolution) 框架化为自回归结构化预测。我们还纳入了Link-Append (Bohnet等人, 2023),这是一个基于转移的系统,通过多轮序列到序列架构构建聚类。最后,我们报告了seq2seq (Zhang等人, 2023) 的结果,该模型学习生成带有共指消解标签的序列。
4.3 Maverick Setup
4.3 Maverick设置
All Maverick models use DeBERTa-v3 (He et al., 2023) as the document encoder. We use DeBERTa because it can model very long input texts effectively (He et al., 2021).4 Moreover, compared to the LongFormer, which was previously adopted by several token-level systems, DeBERTa ensures a larger input max sequence length (e.g., De BERTa large can handle sequences up to 24,528 tokens while LongFormer only 4096) and has shown better performances empirically in our experiments on the OntoNotes dataset. On the other hand, using
所有Maverick模型均采用DeBERTa-v3 (He et al., 2023)作为文档编码器。我们选择DeBERTa是因为它能有效建模超长输入文本 (He et al., 2021)。此外,相比此前多个token级系统采用的LongFormer,DeBERTa支持更大的输入序列长度上限(例如DeBERTa large可处理24,528个token,而LongFormer仅支持4096),且我们在OntoNotes数据集上的实验表明其具有更优的实证性能。另一方面,使用
DeBERTa to encode long documents is computationally expensive because its attention mechanism incurs a quadratic computational complexity. Whereas this further increases the computational cost of traditional Coarse-to-Fine systems, the Maverick Pipeline enables us to train models that leverage De BERTa large on the OntoNotes dataset, without any performance-lowering pruning heuristic. To train our models we use Adafactor (Shazeer and Stern, 2018) as our optimizer, with a learning rate of 3e-4 for the linear layers, and 2e-5 for the pretrained encoder. We perform all our experiments within an academic budget, i.e., a single RTX 4090, which has 24GB of VRAM. We report more training details in Appendix B.
使用DeBERTa编码长文档在计算上成本高昂,因为其注意力机制会导致二次计算复杂度。虽然这会进一步增加传统由粗到精(Coarse-to-Fine)系统的计算开销,但Maverick Pipeline使我们能够在OntoNotes数据集上训练基于DeBERTa Large的模型,且无需采用任何降低性能的剪枝启发式方法。我们使用Adafactor (Shazeer and Stern, 2018)作为优化器进行模型训练,其中线性层学习率为3e-4,预训练编码器学习率为2e-5。所有实验均在学术预算范围内完成(即单张24GB显存的RTX 4090显卡)。更多训练细节详见附录B。
5 Results
5 结果
5.1 English OntoNotes
5.1 English OntoNotes
We report in Table 3 the average CoNLL-F1 score of the comparison systems trained on the English OntoNotes, along with their underlying pre-trained language models and total parameters. Compared to previous disc rim i native systems, we report gains of $+2.2$ CoNLL-F1 points over LingMess, the best encoder-only model. Interestingly, we even outperform CorefQA, which uses additional Question Answering training data.
我们在表 3 中报告了基于英语 OntoNotes 训练的对比系统的平均 CoNLL-F1 分数,以及它们底层的预训练语言模型和总参数量。与之前的判别式系统相比,我们比最佳纯编码器模型 LingMess 提高了 $+2.2$ 个 CoNLL-F1 点。值得注意的是,我们的表现甚至超过了使用额外问答训练数据的 CorefQA。
Concerning Sequence-to-Sequence approaches, we report extensive improvements over systems with a similar amount of parameters compared to our large models (500M): we obtain $+3.4$ points compared to ASP (770M), and the gap is even wider when taking into consideration Link-Append (3B) and seq2seq (770M), with $+6.4$ and $+5.6$ , respectively. Most importantly, Maverick models surpass the performance of all Sequence-to- Sequence transformers even when they have several billions of parameters. Among our proposed methods, Maverick mes shows the best performance, setting a new state of the art with a score of 83.6 CoNLL-F1 points on the OntoNotes benchmark. More detailed results, including a table with MUC, $\mathbf{B}^{3}$ , and $\mathrm{CEAF}\phi_{4}$ scores and a qualitative error analysis, can be found in Appendix C.
关于序列到序列 (Sequence-to-Sequence) 方法,与参数量相近的大型模型 (500M) 相比,我们取得了显著提升:相比 ASP (770M) 提升了 $+3.4$ 分,而对比 Link-Append (3B) 和 seq2seq (770M) 时优势更大,分别达到 $+6.4$ 和 $+5.6$ 分。最重要的是,即使面对参数量达数十亿的序列到序列 Transformer 模型,Maverick 模型仍能保持性能优势。在我们提出的方法中,Maverick mes 表现最佳,在 OntoNotes 基准测试中以 83.6 CoNLL-F1 分刷新了当前最优水平。更多详细结果(包括包含 MUC、$\mathbf{B}^{3}$ 和 $\mathrm{CEAF}\phi_{4}$ 得分的表格)以及定性误差分析见附录 C。
5.2 PreCo and LitBank
5.2 PreCo 和 LitBank
We further validate the robustness of the Maverick framework by training and evaluating systems on the PreCo and LitBank datasets. As reported in Table 4, our models show superior performance when dealing with long documents in a data-scarce setting such as the one LitBank poses. On this dataset, Maverick $\mathrm{\dot{\Pi}}_{\mathrm{1ncr}}$ achieves a new stateof-the-art score of 78.3, and gains $+1.0$ CoNLLF1 points compared with seq2seq. On PreCo, Maverick in cr outperforms longdoc, but seq2seq still shows slightly better performance. This is mainly due to the high presence of singletons in PreCo ( $52%$ of all the clusters). Our systems, using a mention extraction technique that favors precision rather than recall, are penalized compared to high recall systems such as seq2seq.5 Among our systems, Maverick $\operatorname{incr}$ , leveraging its hybrid architecture, performs better on both PreCo and LitBank.
我们通过在PreCo和LitBank数据集上训练和评估系统,进一步验证了Maverick框架的鲁棒性。如表4所示,我们的模型在处理LitBank等数据稀缺场景下的长文档时表现出优越性能。在该数据集上,Maverick $\mathrm{\dot{\Pi}}_{\mathrm{1ncr}}$ 以78.3分刷新了当前最优成绩,相比seq2seq提升了$+1.0$个CoNLLF1点。在PreCo数据集上,Maverick incr优于longdoc,但seq2seq仍略占优势,这主要由于PreCo中单例簇占比高达$52%$。我们的系统采用偏向精确率而非召回率的指称抽取技术,因此相比seq2seq等高召回率系统存在劣势。在我们的系统中,采用混合架构的Maverick $\operatorname{incr}$ 在PreCo和LitBank数据集上都取得了更好表现。
5.3 Out-of-Domain Evaluation
5.3 域外评估
In Table 5, we report the performance of Maverick systems along with LingMess, the best encoderonly model, when dealing with out-of-domain texts, that is, when they are trained on OntoNotes and tested on other datasets. First of all, we report considerable improvements on the GAP test set, obtaining a $+1.2$ F1 score compared to the previous state of the art. We also test models on WikiCoref, ${\mathrm{PreCo}}{\mathrm{ns}}$ and $\mathrm{Lit{Bank}{n s}}$ (Section 4.1). However, since the span annotation guidelines of these corpora differ from the ones used in OntoNotes, in Table 5 we also report the performance using gold mentions, i.e., skipping the mention extraction step (gold column).6 On the WikiCoref benchmark, we achieve a new state-of-the-art score of 67.2 CoNLLF1, with an improvement of $+4.2$ points over the previous best score obtained by LingMess. On the same dataset, when using pre-identified mentions the gap increases to $+5.8$ CoNLL-F1 points (76.6 vs 82.4). In the same setting, our models obtain up to $+7.3$ and $+10.1$ CoNLL-F1 points on $\mathrm{Preco_{ns}}$ and $\mathrm{Lit{Bank}{n s}}$ , respectively, compared to LingMess. These results suggest that the Maverick training strategy makes this model more suitable when dealing with pre-identified mentions and out-of-domain texts. This further increases the potential benefits that Maverick systems can bring to many downstream applications that exploit co reference as an intermediate layer, such as Entity Linking (Rosales-Méndez et al., 2020) and Relation Extraction (Xiong et al., 2023; Zeng et al., 2023), where the mentions are already identified. Among our models, on LitBankns and WikiCoref, Maverick in cr outperforms Maverick mes and Maverick $\mathrm{\Delta}_{\mathrm{s2e}}$ , confirming the superior capabilities of the incremental formulation in the long-document setting. Finally, we highlight that the performance gap between using gold mentions and performing full Co reference Resolution is wider when tested on out-of-domain datasets (on average $+17%$ ) compared to testing it directly on OntoNotes (83.6 vs 93.6, $+10%$ ).7 This result, obtained on three different out-of-domain datasets, suggests that the difference in annotation guidelines considerably contribute to lower the OOD performances $(-7%)$ .
在表5中,我们报告了Maverick系统与最佳仅编码器模型LingMess在处理领域外文本(即在OntoNotes上训练并在其他数据集上测试)时的性能表现。首先,我们在GAP测试集上取得了显著提升,相比之前的最先进水平获得了$+1.2$的F1分数提升。我们还测试了模型在WikiCoref、${\mathrm{PreCo}}{\mathrm{ns}}$和$\mathrm{Lit{Bank}{n s}}$上的表现(第4.1节)。由于这些语料库的标注规范与OntoNotes不同,表5同时报告了使用黄金提及(即跳过提及抽取步骤)时的性能表现(gold列)。在WikiCoref基准测试中,我们以67.2的CoNLL-F1分数刷新了最先进水平,较LingMess先前最佳成绩提升了$+4.2$分。同一数据集上,使用预识别提及时差距扩大至$+5.8$个CoNLL-F1分(76.6 vs 82.4)。相同设置下,我们的模型在$\mathrm{Preco_{ns}}$和$\mathrm{Lit{Bank}{n s}}$上分别较LingMess取得$+7.3$和$+10.1$的CoNLL-F1分提升。这些结果表明Maverick训练策略使其更适用于处理预识别提及和领域外文本,这进一步提升了Maverick系统在实体链接(Rosales-Méndez et al., 2020)和关系抽取(Xiong et al., 2023; Zeng et al., 2023)等下游应用中的潜在价值。在我们的模型中,Maverick incr在LitBankns和WikiCoref上表现优于Maverick mes和Maverick $\mathrm{\Delta}_{\mathrm{s2e}}$,证实了增量式建模在长文档场景的优越性。最后我们注意到,与直接在OntoNotes上测试(83.6 vs 93.6,$+10%$)相比,领域外数据集上黄金提及与完整共指消解的绩效差距更大(平均$+17%$)。这一在三个不同领域外数据集上获得的结果表明,标注规范的差异显著降低了领域外性能($-7%$)。
Table 3: Results on the OntoNotes benchmark. We report the Avg. CoNLL-F1 score, the number of parameters, the training time, and the hardware used to train each model. Inference time (sec) and memory (GiB) were calculated on an RTX4090. For Sequence-to-Sequence models we include statistics that are reported in the original papers, since we could not run models locally. $({}^{*})$ indicates models trained on additional resources. $(^{d})$ indicates scores obtained on the dev set, however, Maverick systems always perform better on the dev than on the test sets. Missing values (-) are not reported in the original paper, and it is not feasible to reproduce them using our limited hardware resources.
表 3: OntoNotes基准测试结果。我们报告了平均CoNLL-F1分数、参数量、训练时间以及训练各模型所用的硬件。推理时间(秒)和内存占用(GiB)均在RTX4090上测得。对于序列到序列模型,由于无法本地运行,我们直接引用了原论文报告的统计数据。$({}^{*})$表示使用额外资源训练的模型,$(^{d})$表示开发集得分(但Maverick系统在开发集上的表现总是优于测试集)。缺失值(-)表示原论文未报告,且受限于硬件资源无法复现。
| 模型 | 语言模型 | 平均F1 | 参数量 | 训练时间 | 训练硬件 | 推理时间 | 内存占用 |
|---|---|---|---|---|---|---|---|
| 判别式模型 | |||||||
| c2f-coref (Joshi et al., 2020)ICoref (Xia et al., 2020) | SpanBERTlargeSpanBERTlarge | 79.679.4 | 370M377M | 40h | 1x32G1x1080TI-12G | 50s38s | 11.92.9 |
| CorefQA (Wu et al.,2020) | SpanBERTlarge | 83.1* | 740M | 1xTPUv3-128G | |||
| s2e-coref (Kirstain et al., 2021) | LongFormerlarge | 80.3 | 494M | 1x32G | 17s | 3.9 | |
| longdoc (Toshniwal et al.,2021) | 471M | 1xA6000-48G | 25s | 2.1 | |||
| LongFormerlarge | 79.6 | 16h | |||||
| wl-coref (Dobrovolskii, 2021) | RoBERTalarge | 81.0 | 360M | 5h | 1xRTX8000-48G | 11s | 2.3 |
| f-coref (Otmazgin et al., 2022) | DistilRoBERTa | 78.5* | 91M | 1xV100-32G | 3s | 1.0 | |
| LingMess (Otmazgin et al., 2023) | LongFormerlarge | 81.4 | 590M | 23h | 1xV100-32G | 20s | 4.8 |
| 序列到序列模型 | |||||||
| ASP (Liu et al.,2022) | FLAN-T5LFLAN-T5xxl | 80.2 | 770M | 1xA100-40G | |||
| mT5x1 | 82.578.0d | 11B3B | 45h | 6xA100-80G128xTPUv4-32G | 20m | = | |
| Link-Append (Bohnet et al., 2023) | mT5xxl | 83.3 | 13B | 48h | 128xTPUv4-32G | 30m | |
| seq2seq (Zhang et al., 2023) | T5-largeT0-11B | 77.2d83.2 | 770M | 8xA100-40G | 40m | ||
| 我们的判别式模型 | |||||||
| Maverick2e | DeBERTabaseDeBERTalarge | 81.183.4 | 192M449M | 7h14h | 1xRTX4090-24G1xRTX4090-24G | 6s13s | 1.84.0 |
| Maverickincr | DeBERTabase | 81.0 | 197M | 21h | 1xRTX4090-24G | 22s | 1.8 |
| DeBERTalarge | 83.5 | 452M | 29h | 1xRTX4090-24G | 29s | 3.4 | |
| Maverickmes | DeBERTabase | 81.4 | 223M | 7h | 1xRTX4090-24G | 6s | 1.9 |
| DeBERTalarge | 83.6 | 504M | 14h | 1xRTX4090-24G | 14s | 4.0 |
Table 4: Results of the compared systems on the PreCo and LitBank test-sets in terms of CoNLL-F1 score.
表 4: 各对比系统在PreCo和LitBank测试集上的CoNLL-F1分数结果。
| 模型 | PreCo | LitBank |
|---|---|---|
| longdoc (Toshniwal et al., 2021) | 87.8 | 77.2 |
| seq2seq (Zhang et al., 2023) | 88.5 | 77.3 |
| Mavericks2e | 87.2 | 77.6 |
| Maverickincr | 88.0 | 78.3 |
| Maverickmes | 87.4 | 78.0 |
Table 5: Comparison between LingMess and Maverick systems on GAP, WikiCoref, $\mathrm{PreCo_{ns}}$ $\mathrm{Lit{Bank}_{\mathrm{ns}}}$ . We report scores using systems prediction (sys.) or passing gold mentions (gold).
表 5: LingMess 和 Maverick 系统在 GAP、WikiCoref、$\mathrm{PreCo_{ns}}$ 和 $\mathrm{Lit{Bank}_{\mathrm{ns}}}$ 上的对比。我们报告了使用系统预测 (sys.) 或传递黄金提及 (gold) 的分数。
| 模型 | GAP | WikiCoref | PreCons | LitBankns | ||
|---|---|---|---|---|---|---|
| sys. | gold | sys. gold | sys. | gold | ||
| LingMess | 89.6 | 63.0 | 76.6 | 65.1 | 80.6 64.4 | 73.9 |
| Mavericks2e | 91.1 | 67.2 | 81.5 | 67.2 | 87.9 64.8 | 83.1 |
| Maverickincr | 91.2 | 66.8 | 82.4 | 66.1 | 86.5 65.4 | 84.0 |
| Maverickmes | 91.1 | 66.8 | 82.1 | 66.1 | 86.9 | 65.1 82.8 |
5.4 Speed and Memory Usage
5.4 速度与内存使用
In Table 3, we include details regarding the training time and the hardware used by each comparison system, along with the measurement of the inference time and peak memory usage on OntoNotes the validation set. Compared to Coarseto-Fine models, which require 32GB of VRAM, we can train Maverick systems under 18GB. At inference time both Maverick mes and Maverick $\mathrm{\Delta}{\mathrm{s}}\mathrm{2e}$ , exploiting $\mathrm{DeBERTa_{large}}$ , achieve competitive speed and memory consumption compared to wl-coref and $\mathtt{s2e}$ -coref. Furthermore, when adopting $\mathrm{DeBERTa}_{b a s e}$ , Maverick mes proves to be the most efficient approach8 among those directly trained on OntoNotes, while, at the same time, attaining performances that are equal to the previous best encoder-only system, LingMess. The only system that shows better inference speed is f-coref, but at the cost of lower performance (-3.0).
在表3中,我们列出了各对比系统的训练时间、所用硬件细节,以及在OntoNotes验证集上测量的推理时间和峰值内存使用情况。相比需要32GB显存的Coarseto-Fine模型,我们能在18GB以下显存条件下训练Maverick系统。推理阶段,采用DeBERTalarge的Maverickmes和Maverick Δs2e在速度和内存消耗方面均与wl-coref和s2e-coref具有竞争力。此外,当采用DeBERTabase时,Maverickmes成为直接在OntoNotes上训练的最高效方案[8],同时保持与先前最佳纯编码器系统LingMess相当的性能。唯一展现更快推理速度的系统是f-coref,但这是以性能下降3.0为代价的。
Table 6: Comparison between Maverick models and previous techniques. LingMesst and s2e-coreft are trained using their official scripts. We use $\mathrm{DeBERTa_{base}}$ because the $\mathrm{DeBERTa_{large}}$ could not fit our hardware when training comparison systems.
表 6: Maverick模型与现有技术的对比。LingMesst和s2e-coreft均使用官方脚本训练。由于硬件限制,我们在训练对比系统时采用$\mathrm{DeBERTa_{base}}$,而无法使用$\mathrm{DeBERTa_{large}}$。
| 模型 | LM | 得分 |
|---|---|---|
| Mavericks2e | ||
| Mavericks2e s2e-coreft Mavericks2e s2e-coref | DeBERTabase DeBERTabase LongFormerlarge | 81.0 78.3 80.6 80.3 |
| LongFormerlarge Maverickmes | 81.4 | |
| Maverickmes LingMesSt Maverickmes LingMess | DeBERTabase DeBERTabase LongFormerlarge | 78.6 81.0 81.4 |
| LongFormerlarge Maverickincr | ||
| Maverickincr Maverickprev-incr | DeBERTalarge DeBERTalarge | 83.5 79.6 |
Compared to the previous Sequence-toSequence state-of-the-art approach, Link-Append, we train our models with $175\mathrm{x}$ less memory requirements. Comparing inference time is more complicated, since we could not run models on our memory-constrained budget. For this reason, we report the inference times from the original articles, and hence times achieved with their high-resource settings. Interestingly, we report as much as 170x faster inference compared to seq2seq, which exploits parallel inference on multiple GPUs, and $85\mathrm{x}$ faster when compared to the more efficient ASP. Among Maverick models, Maverick in cr is notably slower both in inference and training time, as it increment ally builds clusters using multiple steps.
与之前最先进的序列到序列(Sequence-to-Sequence)方法Link-Append相比,我们的模型训练所需内存减少了$175\mathrm{x}$。由于在内存受限的条件下无法运行模型,推理时间的比较更为复杂。因此,我们引用了原始论文中的推理时间数据,这些数据是在高资源设置下获得的。值得注意的是,与利用多GPU并行推理的seq2seq相比,我们的推理速度提升了170倍;而与效率更高的ASP相比,速度提升了$85\mathrm{x}$。在Maverick模型中,Maverick in cr的推理和训练速度明显较慢,因为它需要逐步构建聚类。
5.5 Maverick Ablation
5.5 Maverick消融实验
In Table 6, we compare Maverick $\mathtt{s2e}$ and Maverick mes models with s2e-coref and LingMess, respectively, using different pre-trained encoders. Interestingly, when using DeBERTa, Maverick systems not only achieve better speed and memory efficiency, but also obtain higher performance compared to the previous systems. When using the LongFormer, instead, their scores are in the same ballpark, showing empirically that the Maverick training procedure better exploits the capabilities of DeBERTa. To test the benefits of our novel incremental formulation, Maverick in cr, we also implement a Maverick model with the previously adopted incremental method used in longdoc and ICoref (Section 2.1), which we call Maverick prev-incr. Compared to the previous formulation we report an increase in score of $+3.9$ CoNLL-F1 points. The improvement demonstrates that exploiting a transformer architecture to attend to all the previously clustered mentions is beneficial, and enables the future usage of hybrid architectures when needed.
在表6中,我们分别使用不同的预训练编码器比较了Maverick $\mathtt{s2e}$、Maverick mes模型与s2e-coref和LingMess。有趣的是,当使用DeBERTa时,Maverick系统不仅实现了更优的速度和内存效率,还获得了比先前系统更高的性能。而使用LongFormer时,它们的分数处于相近水平,这从经验上表明Maverick训练流程能更好地发挥DeBERTa的潜力。为了测试我们新颖的增量公式Maverick in cr的优势,我们还实现了采用longdoc和ICoref(第2.1节)原有增量方法的Maverick模型(称为Maverick prev-incr)。相比先前方案,我们观察到CoNLL-F1分数提升了$+3.9$分。这一改进证明利用transformer架构关注所有已聚类指称是有益的,并为未来按需使用混合架构提供了可能。
As a further analysis of whether the efficiency improvements of our systems stem from using DeBERTa or are attributable to the Maverick Pipeline, we compared the speed and memory occupation of a Maverick system using as underlying encoder either De BERTa large or Long Former large. Our experiments show that using DeBERTa leads to an increase of $+77%$ of memory space and $+23%$ of time to complete an epoch when training on OntoNotes. An equivalent measurement, attributable to the quadratic memory attention mechanism of DeBERTa, was observed for the inference time and memory occupation on the OntoNotes test set. These results highlight the efficiency contribution of the Maverick Pipeline, which is agnostic to the document encoder and can be applied to future co reference systems to ensure higher efficiency.
为了进一步分析我们系统的效率提升是源于使用DeBERTa还是归功于Maverick Pipeline,我们比较了分别采用DeBERTa large和LongFormer large作为基础编码器的Maverick系统在速度与内存占用上的表现。实验表明,在OntoNotes数据集上训练时,使用DeBERTa会导致内存空间增加$+77%$,单轮训练时间延长$+23%$。由于DeBERTa的二次内存注意力机制,在OntoNotes测试集上也观测到推理时间和内存占用的等效增长。这些结果凸显了Maverick Pipeline的效率贡献——该流程与文档编码器无关,可应用于未来共指消解系统以确保更高效率。
6 Conclusion
6 结论
In this work, we challenged the recent trends of adopting large auto regressive generative models to solve the Co reference Resolution task. To do so, we proposed Maverick, a new framework that enables fast and memory-efficient Co reference Resolution while obtaining state-of-the-art results. This demonstrates that the large computational overhead required by Sequence-to-Sequence approaches is unnecessary. Indeed, in our experiments Maverick systems demonstrated that they can outperform large generative models and improve the speed and memory usage of previous best-performing encoder-only approaches. Furthermore, we introduced Maverick $\mathrm{incr}$ , a robust multi-step incremental technique that obtains higher performance in the out-of-domain and long-document setting. By releasing our systems, we make state-of-the-art models usable by a larger portion of users in different scenarios and potentially improve downstream applications.
在这项工作中,我们挑战了近期采用大型自回归生成模型解决共指消解任务的主流趋势。为此,我们提出了Maverick框架,该框架能在实现最先进性能的同时,显著提升共指消解的速度和内存效率。这证明序列到序列方法所需的高计算开销并非必要。实验表明,Maverick系统不仅能超越大型生成模型,还优化了此前最佳纯编码器方法的速度与内存占用。此外,我们推出了Maverick $\mathrm{incr}$,这是一种鲁棒的多步增量技术,在跨领域和长文档场景中表现更优。通过开源该系统,我们让更广泛的用户群体能在不同场景中使用最先进的模型,并有望推动下游应用的改进。
7 Limitations
7 局限性
Our experiments were limited by our resource setting i.e., a single RTX 4090. For this reason, we could not run Maverick using larger encoders, and could not properly test Sequence-to-Sequence models as we did with encoder-only models. Nevertheless, we believe this limitation is a common scenario in many real-world applications that would benefit substantially from our system. We also did not test our formulation on multiple languages, but note that both the methodology behind Maverick and our novel incremental formulation are language agnostic, and thus could be applied to any language.
我们的实验受限于资源设置(即单块RTX 4090显卡)。因此,我们无法使用更大的编码器运行Maverick,也无法像处理纯编码器模型那样充分测试序列到序列模型。尽管如此,我们认为这种限制是许多现实应用中的常见场景,而这些应用将显著受益于我们的系统。我们也没有在多语言环境下测试我们的方案,但需指出Maverick背后的方法论和我们提出的增量式方案都是语言无关的,因此可应用于任何语言。
Acknowledgements
致谢


We gratefully acknowledge the support of the PNRR MUR project PE0000013-FAIR.
我们衷心感谢PNRR MUR项目PE0000013-FAIR的支持。
Roberto Navigli also gratefully acknowledges the support of the CREATIVE project (CRoss-modal understanding and gEnerATIon of Visual and tExtual content), which is funded by the MUR Progetti di Rilevante Interesse Nazionale programme (PRIN 2020). This work has been carried out while Giuliano Martinelli was enrolled in the Italian National Doctorate on Artificial Intelligence run by Sapienza University of Rome.
Roberto Navigli 同时感谢 CREATIVE 项目 (CRoss-modal understanding and gEnerATIon of Visual and tExtual content) 的支持,该项目由 MUR Progetti di Rilevante Interesse Nazionale 计划 (PRIN 2020) 资助。Giuliano Martinelli 在罗马智慧大学 (Sapienza University of Rome) 负责的意大利国家人工智能博士项目期间完成了这项工作。
