[论文翻译]BLEURT: 学习文本生成的鲁棒性指标


原文地址:https://arxiv.org/pdf/2004.04696v5


BLEURT: Learning Robust Metrics for Text Generation

BLEURT: 学习文本生成的鲁棒性指标

Thibault Sellam Dipanjan Das Ankur P. Parikh Google Research New York, NY {tsellam, dipanjand, aparikh }@google.com

Thibault Sellam Dipanjan Das Ankur P. Parikh 谷歌研究院 纽约州纽约市 {tsellam, dipanjand, aparikh}@google.com

Abstract

摘要

Text generation has made significant advances in the last few years. Yet, evaluation metrics have lagged behind, as the most popular choices (e.g., BLEU and ROUGE) may correlate poorly with human judgments. We propose BLEURT, a learned evaluation metric based on BERT that can model human judgments with a few thousand possibly biased training examples. A key aspect of our approach is a novel pre-training scheme that uses millions of synthetic examples to help the model generalize. BLEURT provides state-ofthe-art results on the last three years of the WMT Metrics shared task and the WebNLG Competition dataset. In contrast to a vanilla BERT-based approach, it yields superior results even when the training data is scarce and out-of-distribution.

文本生成在过去几年取得了显著进展。然而评估指标却相对滞后,因为最常用的选择(如BLEU和ROUGE)可能与人类判断相关性较低。我们提出了BLEURT,这是一种基于BERT的学习型评估指标,仅需数千个可能存在偏差的训练样本即可建模人类判断。该方法的关键在于新颖的预训练方案,通过数百万合成样本来提升模型泛化能力。BLEURT在最近三年的WMT Metrics共享任务和WebNLG竞赛数据集上实现了最先进的结果。与原始基于BERT的方法相比,即使在训练数据稀缺且分布外的情况下,BLEURT仍能提供更优异的表现。

1 Introduction

1 引言

In the last few years, research in natural text generation (NLG) has made significant progress, driven largely by the neural encoder-decoder paradigm (Sutskever et al., 2014; Bahdanau et al., 2015) which can tackle a wide array of tasks including translation (Koehn, 2009), summarization (Mani, 1999; Chopra et al., 2016), structured- data-to-text generation (McKeown, 1992; Kukich, 1983; Wiseman et al., 2017) dialog (Smith and Hipp, 1994; Vinyals and Le, 2015) and image captioning (Fang et al., 2015). However, progress is increasingly impeded by the shortcomings of existing metrics (Wiseman et al., 2017; Ma et al., 2019; Tian et al., 2019).

近年来,自然文本生成(NLG)研究取得了显著进展,这主要归功于神经编码器-解码器范式 (Sutskever et al., 2014; Bahdanau et al., 2015) ,该范式能够处理包括翻译 (Koehn, 2009) 、摘要 (Mani, 1999; Chopra et al., 2016) 、结构化数据到文本生成 (McKeown, 1992; Kukich, 1983; Wiseman et al., 2017) 、对话 (Smith and Hipp, 1994; Vinyals and Le, 2015) 以及图像描述 (Fang et al., 2015) 在内的多种任务。然而,现有指标的不足 (Wiseman et al., 2017; Ma et al., 2019; Tian et al., 2019) 日益阻碍了研究的进一步发展。

Human evaluation is often the best indicator of the quality of a system. However, designing crowd sourcing experiments is an expensive and high-latency process, which does not easily fit in a daily model development pipeline. Therefore, NLG researchers commonly use automatic evaluation metrics, which provide an acceptable proxy for quality and are very cheap to compute. This paper investigates sentence-level, referencebased metrics, which describe the extent to which a candidate sentence is similar to a reference one. The exact definition of similarity may range from string overlap to logical entailment.

人工评估通常是衡量系统质量的最佳指标。然而,设计众包实验成本高昂且耗时较长,难以融入日常模型开发流程。因此,自然语言生成(NLG)研究者通常采用自动评估指标,这些指标能提供可接受的质量参考且计算成本极低。本文研究基于参考语句的句子级评估指标,用于衡量候选句与参考句的相似程度。相似度的具体定义可从字符串重叠度延伸至逻辑蕴含关系。

The first generation of metrics relied on handcrafted rules that measure the surface similarity between the sentences. To illustrate, BLEU (Papineni et al., 2002) and ROUGE (Lin, 2004), two popular metrics, rely on N-gram overlap. Because those metrics are only sensitive to lexical variation, they cannot appropriately reward semantic or syntactic variations of a given reference. Thus, they have been repeatedly shown to correlate poorly with human judgment, in particular when all the systems to compare have a similar level of accuracy (Liu et al., 2016; Novikova et al., 2017; Chaganty et al., 2018).

第一代指标依赖于手工制定的规则,用于衡量句子之间的表面相似性。例如,BLEU (Papineni et al., 2002) 和 ROUGE (Lin, 2004) 这两种常用指标基于 N-gram 重叠。由于这些指标仅对词汇变化敏感,无法合理评估给定参考文本的语义或句法变化。因此,它们多次被证明与人类判断相关性较差,尤其是在所有对比系统的准确度水平相近时 (Liu et al., 2016; Novikova et al., 2017; Chaganty et al., 2018)。

Increasingly, NLG researchers have addressed those problems by injecting learned components in their metrics. To illustrate, consider the WMT Metrics Shared Task, an annual benchmark in which translation metrics are compared on their ability to imitate human assessments. The last two years of the competition were largely dominated by neural net-based approaches, RUSE, YiSi and ESIM (Ma et al., 2018, 2019). Current approaches largely fall into two categories. Fully learned metrics, such as BEER, RUSE, and ESIM are trained end-to-end, and they typically rely on handcrafted features and/or learned embeddings. Conversely, hybrid metrics, such as YiSi and BERTscore combine trained elements, e.g., contextual embeddings, with handwritten logic, e.g., as token alignment rules. The first category typically offers great expressivity: if a training set of human ratings data is available, the metrics may take full advantage of it and fit the ratings distribution tightly. Furthermore, learned metrics can be tuned to measure task-specific properties, such as fluency, faithfulness, grammar, or style. On the other hand, hybrid metrics offer robustness. They may provide better results when there is little to no training data, and they do not rely on the assumption that training and test data are identically distributed.

越来越多的自然语言生成(NLG)研究人员通过将学习组件注入评估指标来解决这些问题。以WMT指标共享任务为例,这个年度基准测试通过比较各翻译指标模仿人类评估的能力来评判优劣。过去两年的竞赛主要由基于神经网络的方法主导,包括RUSE、YiSi和ESIM (Ma et al., 2018, 2019)。当前方法主要分为两类:完全学习型指标(如BEER、RUSE和ESIM)采用端到端训练,通常依赖手工特征和/或学习得到的嵌入表示;混合型指标(如YiSi和BERTscore)则将训练元素(如上下文嵌入)与手写逻辑(如token对齐规则)相结合。第一类指标通常具有强大的表达能力:若可获得人工评分训练集,这类指标能充分利用数据并紧密拟合评分分布。此外,学习型指标可针对特定任务属性(如流畅性、忠实度、语法或风格)进行优化。另一方面,混合型指标具有更强的鲁棒性,在训练数据不足甚至缺失时仍能提供良好结果,且不依赖训练数据与测试数据同分布的前提假设。

And indeed, the IID assumption is particularly problematic in NLG evaluation because of domain drifts, that have been the main target of the metrics literature, but also because of quality drifts: NLG systems tend to get better over time, and therefore a model trained on ratings data from 2015 may fail to distinguish top performing systems in 2019, especially for newer research tasks. An ideal learned metric would be able to both take full advantage of available ratings data for training, and be robust to distribution drifts, i.e., it should be able to extrapolate.

事实上,IID假设在自然语言生成(NLG)评估中尤其成问题,这不仅因为指标文献主要关注的领域漂移(domain drifts),还由于质量漂移(quality drifts):NLG系统会随时间推移不断改进,因此基于2015年评分数据训练的模型可能无法区分2019年的顶尖系统,尤其对于新兴研究任务而言。理想的习得指标应能充分利用现有评分数据进行训练,同时对分布漂移保持稳健性,即具备外推能力。

Our insight is that it is possible to combine expressivity and robustness by pre-training a fully learned metric on large amounts of synthetic data, before fine-tuning it on human ratings. To this end, we introduce BLEURT,1 a text generation metric based on BERT (Devlin et al., 2019). A key ingredient of BLEURT is a novel pre-training scheme, which uses random perturbations of Wikipedia sentences augmented with a diverse set of lexical and semantic-level supervision signals.

我们的核心观点是:通过在大规模合成数据上预训练一个完全学习的指标(metric),再基于人工评分进行微调,可以兼顾表达力与鲁棒性。为此,我们提出了基于BERT (Devlin et al., 2019) 的文本生成评估指标BLEURT。其关键创新在于预训练方案——通过对维基百科句子施加随机扰动,并结合多样化的词法与语义级监督信号来增强数据。

To demonstrate our approach, we train BLEURT for English and evaluate it under different genera liz ation regimes. We first verify that it provides state-of-the-art results on all recent years of the WMT Metrics Shared task (2017 to 2019, to-English language pairs). We then stress-test its ability to cope with quality drifts with a synthetic benchmark based on WMT 2017. Finally, we show that it can easily adapt to a different domain with three tasks from a data-to-text dataset, WebNLG 2017 (Gardent et al., 2017). Ablations show that our synthetic pre training scheme increases performance in the IID setting, and is critical to ensure robustness when the training data is scarce, skewed, or out-of-domain.

为验证我们的方法,我们针对英语训练了BLEURT模型,并在不同泛化场景下进行评估。首先证实该模型在WMT指标共享任务历年数据(2017至2019年英语语对)中均取得最先进结果。随后通过基于WMT 2017构建的合成基准测试,压力测试其应对质量漂移的能力。最后以WebNLG 2017 (Gardent et al., 2017) 数据到文本数据集的三项任务证明模型可轻松适配新领域。消融实验表明:我们的合成预训练方案能提升独立同分布(IID)设定下的性能,且在训练数据稀缺、偏态或跨域时对确保鲁棒性至关重要。

The code and pre-trained models are available online2.

代码和预训练模型已在线发布2。

2 Preliminaries

2 预备知识

Define $\pmb{x}=(x_{1},..,x_{r})$ to be the reference sentence of length $r$ where each $x_{i}$ is a token and let $\tilde{\pmb{x}}=(\tilde{x}_ {1},..,\tilde{x}_ {p})$ be a prediction sentence of length $p$ . Let ${(\mathbf{\hat{x}}_ {i},\mathbf{\hat{x}}_ {i},y_{i})}_ {n=1}^{N}$ be a training dataset of size $N$ where $y_{i}\in\mathbb{R}$ is the human rating that indicates how good $\tilde{\mathbf{{itx}}}_ {i}$ is with respect to $\mathbf{\Delta}_ {x_{i}}$ . Given the training data, our goal is to learn a function $\pmb{f}:(\pmb{x},\tilde{\pmb x})\rightarrow{\pmb {y}}$ that predicts the human rating.

定义 $\pmb{x}=(x_{1},..,x_{r})$ 为长度为 $r$ 的参考句子,其中每个 $x_{i}$ 是一个 token,并令 $\tilde{\pmb{x}}=(\tilde{x}_ {1},..,\tilde{x}_ {p})$ 为长度为 $p$ 的预测句子。设 ${(\mathbf{\hat{x}}_ {i},\mathbf{\hat{x}}_ {i},y_{i})}_ {n=1}^{N}$ 为大小为 $N$ 的训练数据集,其中 $y_{i}\in\mathbb{R}$ 是表示 $\tilde{\mathbf{{itx}}}_ {i}$ 相对于 $\mathbf{\Delta}_ {x_{i}}$ 质量的人工评分。给定训练数据,我们的目标是学习一个函数 $\pmb{f}:(\pmb{x},\tilde{\pmb x})\rightarrow{\pmb {y}}$ 以预测人工评分。

3 Fine-Tuning BERT for Quality Evaluation

3 基于BERT的质量评估微调

Given the small amounts of rating data available, it is natural to leverage unsupervised representations for this task. In our model, we use BERT (Bidirectional Encoder Representations from Transform- ers) (Devlin et al., 2019), which is an unsupervised technique that learns contextual i zed representations of sequences of text. Given $_{x}$ and $\tilde{\mathbf{x}}$ , BERT is a Transformer (Vaswani et al., 2017) that returns a sequence of contextual i zed vectors:

鉴于可用的评分数据量较少,利用无监督表示来完成此任务是很自然的选择。在我们的模型中,我们使用了BERT (Bidirectional Encoder Representations from Transformers) [20],这是一种无监督技术,可以学习文本序列的上下文表示。给定$x$和$\tilde{\mathbf{x}}$,BERT是一个Transformer [21],它返回一系列上下文向量:

$$
v_{[\mathrm{CLS}]},v_{x_{1}},...,v_{x_{r}},v_{1},...,v_{\tilde{x}_{p}}=\mathrm{BERT}({\pmb x},\tilde{{\pmb x}})
$$

$$
v_{[\mathrm{CLS}]},v_{x_{1}},...,v_{x_{r}},v_{1},...,v_{\tilde{x}_{p}}=\mathrm{BERT}({\pmb x},\tilde{{\pmb x}})
$$

where $\pmb{v}_{[\mathrm{CLS}]}$ is the representation for the special [CLS] token. As described by Devlin et al. (2019), we add a linear layer on top of the [CLS] vector to predict the rating:

其中 $\pmb{v}_{[\mathrm{CLS}]}$ 是特殊 [CLS] token 的表征。如 Devlin 等人 (2019) 所述,我们在 [CLS] 向量上添加一个线性层来预测评分:

$$
\hat{y}=\pmb{f}(\pmb{x},\tilde{\pmb{x}})=\pmb{W}\tilde{v}_{[\mathrm{CLS}]}+\pmb{b}
$$

$$
\hat{y}=\pmb{f}(\pmb{x},\tilde{\pmb{x}})=\pmb{W}\tilde{v}_{[\mathrm{CLS}]}+\pmb{b}
$$

where $W$ and $^{b}$ are the weight matrix and bias vector respectively. Both the above linear layer as well as the BERT parameters are trained (i.e. fine-tuned) on the supervised data which typically numbers in a few thousand examples. We use the regression loss $\begin{array}{r}{\ell_{\mathrm{supervised}}=\frac{1}{N}\sum_{n=1}^{\hat{N}}|y_{i}-\hat{y}|^{2}}\end{array}$ .

其中 $W$ 和 $^{b}$ 分别是权重矩阵和偏置向量。上述线性层以及 BERT 参数均在通常包含数千个样本的监督数据上进行训练(即微调)。我们使用回归损失 $\begin{array}{r}{\ell_{\mathrm{supervised}}=\frac{1}{N}\sum_{n=1}^{\hat{N}}|y_{i}-\hat{y}|^{2}}\end{array}$。

Although this approach is quite straightforward, we will show in Section 5 that it gives state-of-theart results on WMT Metrics Shared Task 17-19, which makes it a high-performing evaluation metric. However, fine-tuning BERT requires a sizable amount of IID data, which is less than ideal for a metric that should generalize to a variety of tasks and model drift.

尽管这种方法相当直接,但我们将在第5节展示,它在WMT Metrics Shared Task 17-19上取得了最先进 (state-of-the-art) 的结果,使其成为高性能的评估指标。然而,微调BERT需要大量独立同分布 (IID) 数据,这对于一个需要泛化到多种任务和模型漂移的指标来说并不理想。

4 Pre-Training on Synthetic Data

4 基于合成数据的预训练

The key aspect of our approach is a pre-training technique that we use to “warm up” BERT before fine-tuning on rating data.3 We generate a large number of of synthetic reference-candidate pairs $(z,\tilde{z})$ , and we train BERT on several lexical- and semantic-level supervision signals with a multitask loss. As our experiments will show, BLEURT generalizes much better after this phase, especially with incomplete training data.

我们方法的关键在于采用了一种预训练技术,在基于评分数据进行微调前对BERT进行"预热"。我们生成了大量合成参考-候选对$(z,\tilde{z})$,并通过多任务损失函数,基于词汇和语义层面的监督信号对BERT进行训练。实验表明,经过该阶段后BLEURT展现出显著更好的泛化能力,尤其在训练数据不完整时表现更为突出。

Any pre-training approach requires a dataset and a set of pre-training tasks. Ideally, the setup should resemble the final NLG evaluation task, i.e., the sentence pairs should be distributed similarly and the pre-training signals should correlate with human ratings. Unfortunately, we cannot have access to the NLG models that we will evaluate in the future. Therefore, we optimized our scheme for generality, with three requirements. (1) The set of reference sentences should be large and diverse, so that BLEURT can cope with a wide range of NLG domains and tasks. (2) The sentence pairs should contain a wide variety of lexical, syntactic, and semantic dissimilarities. The aim here is to anticipate all variations that an NLG system may produce, e.g., phrase substitution, paraphrases, noise, or omissions. (3) The pre-training objectives should effectively capture those phenomena, so that BLEURT can learn to identify them. The following sections present our approach.

任何预训练方法都需要一个数据集和一组预训练任务。理想情况下,该设置应类似于最终的NLG评估任务,即句子对的分布应相似,且预训练信号应与人类评分相关。遗憾的是,我们无法提前获知未来将评估的NLG模型。因此,我们针对通用性优化了方案,提出三点要求:(1) 参考句集应规模庞大且多样化,使BLEURT能适应广泛的NLG领域和任务;(2) 句子对应包含词汇、句法和语义层面的多样化差异,旨在预判NLG系统可能产生的所有变体,例如短语替换、改述、噪声或遗漏;(3) 预训练目标应有效捕捉这些现象,使BLEURT学会识别它们。以下章节将介绍我们的方法。

4.1 Generating Sentence Pairs

4.1 生成句子对

One way to expose BLEURT to a wide variety of sentence differences is to use existing sentence pairs datasets (Bowman et al., 2015; Williams et al., 2018; Wang et al., 2019). These sets are a rich source of related sentences, but they may fail to capture the errors and alterations that NLG systems produce (e.g., omissions, repetitions, nonsensical substitutions). We opted for an automatic approach instead, that can be scaled arbitrarily and at little cost: we generate synthetic sentence pairs $(z,\tilde{z})$ by randomly perturbing 1.8 million segments $_{z}$ from Wikipedia. We use three techniques: mask-filling with BERT, back translation, and randomly dropping out words. We obtain about 6.5 million perturbations $\tilde{z}$ . Let us describe those techniques.

让BLEURT接触多样化句子差异的一种方法是利用现有句子对数据集 (Bowman et al., 2015; Williams et al., 2018; Wang et al., 2019)。这些数据集虽能提供丰富的关联语句资源,但可能无法涵盖自然语言生成系统产生的错误和改动 (如遗漏、重复、无意义替换)。我们选择了一种可任意扩展且成本低廉的自动化方案:通过随机扰动180万条维基百科段落$_{z}$来生成合成句子对$(z,\tilde{z})$。具体采用三种技术:基于BERT的掩码填充、回译以及随机词丢弃,最终获得约650万条扰动结果$\tilde{z}$。以下将详细说明这些技术。

Mask-filling with BERT: BERT’s initial training task is to fill gaps (i.e., masked tokens) in tokenized sentences. We leverage this functionality by inserting masks at random positions in the Wikipedia sentences, and fill them with the language model. Thus, we introduce lexical alterations while maintaining the fluency of the sentence. We use two masking strategies—we either introduce the masks at random positions in the sentences, or we create contiguous sequences of masked tokens. More details are provided in the Appendix.

使用BERT进行掩码填充:BERT的初始训练任务是填补标记化句子中的空白(即被掩码的token)。我们通过在维基百科句子中随机位置插入掩码,并利用语言模型进行填充来实现这一功能。这样可以在保持句子流畅性的同时引入词汇变化。我们采用两种掩码策略——要么在句子随机位置插入掩码,要么创建连续的掩码token序列。更多细节见附录。

Back translation: We generate paraphrases and perturbations with back translation, that is, round trips from English to another language and then back to English with a translation model (Bannard and Callison-Burch, 2005; Gan it kev itch et al., 2013; Sennrich et al., 2016). Our primary aim is to create variants of the reference sentence that preserves semantics. Additionally, we use the mispredictions of the back translation models as a source of realistic alterations.

回译:我们通过回译生成改写和扰动,即利用翻译模型将英文先翻译成另一种语言再转回英文的往返过程 (Bannard and Callison-Burch, 2005; Ganitkevitch et al., 2013; Sennrich et al., 2016)。主要目标是创建保留原句语义的变体,同时利用回译模型的错误预测作为现实语义变化的来源。

Dropping words: We found it useful in our experiments to randomly drop words from the synthetic examples above to create other examples. This method prepares BLEURT for “pathological” behaviors or NLG systems, e.g., void predictions, or sentence truncation.

随机丢弃词语:在实验中,我们发现随机从上述合成示例中删除词语以创建其他样本很有帮助。这种方法能让BLEURT适应"病态"行为或自然语言生成系统,例如空预测或句子截断。

4.2 Pre-Training Signals

4.2 预训练信号

The next step is to augment each sentence pair $(z,\tilde{z})$ with a set of pre-training signals ${\tau_{k}}$ , where $\tau_{k}$ is the target vector of pre-training task $k$ . Good pre-training signals should capture a wide variety of lexical and semantic differences. They should also be cheap to obtain, so that the approach can scale to large amounts of synthetic data. The following section presents our 9 pretraining tasks, summarized in Table 1. Additional implementation details are in the Appendix.

下一步是为每个句子对 $(z,\tilde{z})$ 增强一组预训练信号 ${\tau_{k}}$ ,其中 $\tau_{k}$ 是预训练任务 $k$ 的目标向量。良好的预训练信号应能捕捉广泛的词汇和语义差异,同时还需易于获取,以便该方法能扩展到大量合成数据。以下部分介绍了我们的9个预训练任务,总结如表 1: 。更多实现细节见附录。

Automatic Metrics: We create three signals τBLEU, τROUGE, and τBERTscore with sentence BLEU (Papineni et al., 2002), ROUGE (Lin, 2004), and BERTscore (Zhang et al., 2020) respectively (we use precision, recall and F-score for the latter two).

自动指标:我们分别使用句子BLEU (Papineni等人,2002)、ROUGE (Lin,2004) 和 BERTscore (Zhang等人,2020) 创建了三个信号τBLEU、τROUGE和τBERTscore(后两者采用精确率、召回率和F值)。

Back translation Likelihood: The idea behind this signal is to leverage existing translation models to measure semantic equivalence. Given a pair $(z,\tilde{z})$ , this training signal measures the probability that $\tilde{z}$ is a back translation of $_ {z}$ , $P(\tilde{z}|z)$ , normalized by the length of $\tilde{z}$ . Let $P_{\mathrm{en-fr}}(z_{\mathrm{fr}}|z)$ be a translation model that assigns probabilities to French sentences $z_{\mathrm{\scriptsize{f}r}}$ conditioned on English sentences $_ {z}$ and let $P_{\mathrm{fr-en}}(z|z_{\mathrm{fr}})$ be a trans- lation model that assigns probabilities to English sentences given french sentences. If $|\tilde{z}|$ is the number of tokens in $\tilde{z}$ , we define our score as $\begin{array}{r}{\tau_{\mathrm{en-fr},\tilde{z}|z}=\frac{\log P(\tilde{z}|z)}{|\tilde{z}|}}\end{array}$ log P(zlz), with:

回译似然度:该信号背后的理念是利用现有翻译模型来衡量语义等价性。给定一对 $(z,\tilde{z})$,该训练信号衡量 $\tilde{z}$ 是 $z$ 的回译的概率 $P(\tilde{z}|z)$,并通过 $\tilde{z}$ 的长度进行归一化。设 $P_{\mathrm{en-fr}}(z_{\mathrm{fr}}|z)$ 为将英语句子 $z$ 作为条件为法语句子 $z_{\mathrm{\scriptsize{f}r}}$ 分配概率的翻译模型,$P_{\mathrm{fr-en}}(z|z_{\mathrm{fr}})$ 为将法语句子作为条件为英语句子分配概率的翻译模型。若 $|\tilde{z}|$ 表示 $\tilde{z}$ 中的 token 数量,则得分定义为 $\begin{array}{r}{\tau_{\mathrm{en-fr},\tilde{z}|z}=\frac{\log P(\tilde{z}|z)}{|\tilde{z}|}}\end{array}$ log P(zlz),其中:

Table 1: Our pre-training signals.

TaskTypePre-training SignalsLoss Type
BLEUTBLEURegression
ROUGETROUGE TROUGE-P,TROUGE-R,TROUGE-FRegression
BERTscoreTBERTscore TBERTscore-P,TBERTscore-R,TBERTscore-FRegression
Backtrans.likelihoodTen-fr,z|, Ten-fr,|z, Ten-de,z|, Ten-de,|zRegression
EntailmentTentail TEntail,TContradict,TNeutralMulticlass
Backtrans.f flagTbacktran_flagMulticlass

表 1: 我们的预训练信号。

TaskType Pre-training Signals Loss Type
BLEU TBLEU Regression
ROUGE TROUGE TROUGE-P, TROUGE-R, TROUGE-F Regression
BERTscore TBERTscore TBERTscore-P, TBERTscore-R, TBERTscore-F Regression
Backtrans.likelihood Ten-fr,z , Ten-fr,
Entailment Tentail TEntail, TContradict, TNeutral Multiclass
Backtrans.f flag Tbacktran_flag Multiclass

$$
P(\tilde{z}|z)=\sum_{z_{\mathrm{fr}}}P_{\mathrm{fr}\rightarrow\mathrm{en}}(\tilde{z}|z_{\mathrm{fr}})P_{\mathrm{en}\rightarrow\mathrm{fr}}(z_{\mathrm{fr}}|z)
$$

$$
P(\tilde{z}|z)=\sum_{z_{\mathrm{fr}}}P_{\mathrm{fr}\rightarrow\mathrm{en}}(\tilde{z}|z_{\mathrm{fr}})P_{\mathrm{en}\rightarrow\mathrm{fr}}(z_{\mathrm{fr}}|z)
$$

Because computing the summation over all possible French sentences is in- tractable, we approximate the sum using $\begin{array}{r l r}{z_{\mathrm{fr}}^{\ast}}&{{}=}&{\arg\operatorname*{max}P_{\mathrm{en}\to\mathrm{fr}}(z_{\mathrm{fr}}|z)}\end{array}$ and we assume that $P_{\mathrm{en}\to\mathrm{fr}}^{*}|z\big)\approx1$ :

由于对所有可能的法语句子进行求和计算是不可行的,我们采用近似方法 $\begin{array}{r l r}{z_{\mathrm{fr}}^{\ast}}&{{}=}&{\arg\operatorname*{max}P_{\mathrm{en}\to\mathrm{fr}}(z_{\mathrm{fr}}|z)}\end{array}$ ,并假设 $P_{\mathrm{en}\to\mathrm{fr}}^{*}|z\big)\approx1$ ) :

$$
P(\tilde{z}|z)\approx P_{\mathrm{fr}\to\mathrm{en}}(\tilde{z}|z_{\mathrm{fr}}^{*})
$$

$$
P(\tilde{z}|z)\approx P_{\mathrm{fr}\to\mathrm{en}}(\tilde{z}|z_{\mathrm{fr}}^{*})
$$

We can trivially reverse the procedure to compute $P(z|\tilde{z})$ , thus we create 4 pre-training signals $\tau_{\mathrm{en-fr},z|\tilde{z}},\tau_{\mathrm{en-fr},\tilde{z}|z},\tau_{\mathrm{en-de},z|\tilde{z}},\tau_{\mathrm{en-de},\tilde{z}|z}$ with two pairs of languages $\mathbf{\check{en}}\leftrightarrow\mathrm{de}$ and $\mathbf{\check{en}}\leftrightarrow\mathrm{fr}$ ) in both directions.

我们可以轻松地逆向计算 $P(z|\tilde{z})$ ,从而创建4个预训练信号 $\tau_{\mathrm{en-fr},z|\tilde{z}},\tau_{\mathrm{en-fr},\tilde{z}|z},\tau_{\mathrm{en-de},z|\tilde{z}},\tau_{\mathrm{en-de},\tilde{z}|z}$ ,涉及两对语言 $\mathbf{\check{en}}\leftrightarrow\mathrm{de}$ 和 $\mathbf{\check{en}}\leftrightarrow\mathrm{fr}$ ) 的双向转换。

Textual Entailment: The signal τentail expresses whether $_{z}$ entails or contradicts $\tilde{z}$ using a classifier. We report the probability of three labels: Entail, Contradict, and Neutral, using BERT finetuned on an entailment dataset, MNLI (Devlin et al., 2019; Williams et al., 2018).

文本蕴含:信号τentail通过分类器表达$_{z}$是否蕴含或矛盾$\tilde{z}$。我们使用在蕴含数据集MNLI上微调过的BERT (Devlin等人,2019;Williams等人,2018) ,报告三个标签的概率:蕴含、矛盾和中性。

Back translation flag: The signal τbacktran flag is a Boolean that indicates whether the perturbation was generated with back translation or with maskfilling.

回译标志:信号τbacktran flag是一个布尔值,用于指示扰动是通过回译(back translation)还是掩码填充(maskfilling)生成的。

4.3 Modeling

4.3 建模

For each pre-training task, our model uses either a regression or a classification loss. We then aggregate the task-level losses with a weighted sum.

对于每个预训练任务,我们的模型采用回归或分类损失函数,并通过加权求和方式聚合任务级损失。

Let $\tau_{k}$ describe the target vector for each task, e.g., the probabilities for the classes Entail, Contradict, Neutral, or the precision, recall, and Fscore for ROUGE. If $\tau_{k}$ is a regression task, then the loss used is the $\ell_{2}$ loss i.e. $\begin{array}{r c l}{\ell_{k}}&{=}&{\Vert\tau_{k}-}\end{array}$ $\hat{\tau}_ {k}|_ {2}^{2}/|\tau_{k}|$ where $|\tau_{k}|$ is the dimension of $\tau_{k}$ and $\hat{\tau}_ {k}$ is computed by using a task-specific linear layer on top of the [CLS] embedding: $\begin{array}{r l}{\hat{\boldsymbol{\tau}}_ {k}}&{{}=}\end{array}$ ${\cal W}_ {\tau_{k}}\tilde{v}_ {[\mathrm{CLS}]}+b_{\tau_{k}}$ . If $\tau_{k}$ is a classification task, we use a separate linear layer to predict a logit for each class $c$ : $\hat{\tau}_ {k c}=W_{\tau_{k c}}\tilde{v}_ {[\mathrm{CLS}]}+b_{\tau_{k c}}$ , and we use the multiclass cross-entropy loss. We define our aggregate pre-training loss function as follows:

设$\tau_{k}$表示每个任务的目标向量,例如类别Entail、Contradict、Neutral的概率,或ROUGE的精确率、召回率和F值。若$\tau_{k}$是回归任务,则使用$\ell_{2}$损失函数,即$\begin{array}{r c l}{\ell_{k}}&{=}&{\Vert\tau_{k}-}\end{array}$$\hat{\tau}_ {k}|_ {2}^{2}/|\tau_{k}|$,其中$|\tau_{k}|$是$\tau_{k}$的维度,$\hat{\tau}_ {k}$通过在[CLS]嵌入上使用任务特定的线性层计算得到:$\begin{array}{r l}{\hat{\boldsymbol{\tau}}_ {k}}&{{}=}\end{array}$${\cal W}_ {\tau_{k}}\tilde{v}_ {[\mathrm{CLS}]}+b_{\tau_{k}}$。若$\tau_{k}$是分类任务,则使用单独的线性层预测每个类别$c$的logit值:$\hat{\tau}_ {k c}=W_{\tau_{k c}}\tilde{v}_ {[\mathrm{CLS}]}+b_{\tau_{k c}}$,并采用多类别交叉熵损失。我们定义聚合预训练损失函数如下:

$$
\ell_{\mathrm{pre-training}}=\frac{1}{M}\sum_{m=1}^{M}\sum_{k=1}^{K}\gamma_{k}\ell_{k}(\pmb{\tau}_ {k}^{m},\hat{\pmb{\tau}}_{k}^{m})
$$

$$
\ell_{\mathrm{pre-training}}=\frac{1}{M}\sum_{m=1}^{M}\sum_{k=1}^{K}\gamma_{k}\ell_{k}(\pmb{\tau}_ {k}^{m},\hat{\pmb{\tau}}_{k}^{m})
$$

where $\tau_{k}^{m}$ is the target vector for example $m,M$ is number of synthetic examples, and $\gamma_{k}$ are hyper parameter weights obtained with grid search (more details in the Appendix).

其中 $\tau_{k}^{m}$ 是样本 $m$ 的目标向量,$M$ 是合成样本数量,$\gamma_{k}$ 是通过网格搜索获得的超参数权重 (更多细节见附录)。

5 Experiments

5 实验

In this section, we report our experimental results for two tasks, translation and data-to-text. First, we benchmark BLEURT against existing text generation metrics on the last 3 years of the WMT Metrics Shared Task (Bojar et al., 2017). We then evaluate its robustness to quality drifts with a series of synthetic datasets based on WMT17. We test BLEURT’s ability to adapt to different tasks with the WebNLG 2017 Challenge Dataset (Gardent et al., 2017). Finally, we measure the contribution of each pre-training task with ablation experiments.

在本节中,我们报告了翻译和数据到文本两项任务的实验结果。首先,我们在WMT Metrics Shared Task近三年的数据上 (Bojar et al., 2017) 将BLEURT与现有文本生成指标进行基准测试。随后基于WMT17构建的合成数据集序列,评估其对质量漂移的鲁棒性。通过WebNLG 2017挑战赛数据集 (Gardent et al., 2017) 测试BLEURT适应不同任务的能力。最后通过消融实验量化各预训练任务的贡献度。

Our Models: Unless specified otherwise, all BLEURT models are trained in three steps: regular BERT pre-training (Devlin et al., 2019), pre-training on synthetic data (as explained in Section 4), and fine-tuning on task-specific ratings (translation and/or data-to-text). We experiment with two versions of BLEURT, BLEURT and BLEURTbase, respectively based on BERTLarge (24 layers, 1024 hidden units, 16 heads) and BERT-Base (12 layers, 768 hidden units, 12 heads) (Devlin et al., 2019), both uncased. We use batch size 32, learning rate 1e-5, and 800,000 steps for pre-training and 40,000 steps for finetuning. We provide the full detail of our training setup in the Appendix.

我们的模型:除非另有说明,所有BLEURT模型都经过三个步骤训练:常规BERT预训练 (Devlin et al., 2019)、合成数据预训练(如第4节所述)以及针对特定任务评分(翻译和/或数据到文本)的微调。我们实验了两种版本的BLEURT,分别是基于BERTLarge(24层,1024个隐藏单元,16个头)的BLEURT和基于BERT-Base(12层,768个隐藏单元,12个头) (Devlin et al., 2019) 的BLEURTbase,两者均为uncased版本。我们使用批量大小32、学习率1e-5进行预训练80万步和微调4万步。完整的训练设置细节见附录。

Table 2: Agreement with human ratings on the WMT17 Metrics Shared Task. The metrics are Kendall Tau $(\tau)$ and the Pearson correlation $\cdot_{r_{\mathrm{{i}}}}$ , the official metric of the shared task), divided by 100.

modelcs-en T / rde-en T / rfi-en T/rlv-en T/rru-en T /rtr-en T / rzh-en T / ravg T/r
sentBLEU29.6/43.228.9/42.238.6/56.023.9/38.234.3/47.734.3/54.037.4/51.332.4/47.5
MoverScore47.6/67.051.2/70.8NANA53.4/73.856.1/76.253.1/74.452.3/72.4
BERTscorew/BERT48.0/66.650.3/ 70.161.4 / 81.451.6/72.353.7/73.055.6/76.052.2/73.153.3/73.2
BERTscorew/roBERTa54.2/72.656.9/76.064.8/83.256.2/75.757.2/75.257.9/76.158.8/78.958.0/76.8
chrF++35.0/52.336.5/53.447.5/67.833.3/52.041.5/58.843.2/ 61.440.5/ 59.339.6/57.9
BEER34.0/51.136.1/53.048.3/68.132.8/51.540.2/57.742.8/ 60.039.5/58.239.1/57.1
BLEURTbase-pre51.5/68.252.0/ 70.766.6/ 85.160.8/80.557.5/77.756.9/76.052.1/72.156.8/75.8
BLEURTbase55.7/73.456.3/75.768.0/86.864.7/83.360.1/ 80.162.4/ 81.759.5/ 80.561.0/80.2
BLEURT-pre56.0/74.757.1/75.767.2/ 86.162.3/81.758.4/78.361.6/ 81.455.9/76.559.8/79.2
BLEURT59.3/77.359.9/79.269.5/ 87.864.4/83.561.3/81.162.9/82.460.2/81.462.5/81.8

表 2: WMT17 指标共享任务中与人工评分的相关性。指标为 Kendall Tau $(\tau)$ 和 Pearson 相关系数 $\cdot_{r_{\mathrm{{i}}}}$ (该共享任务的官方指标),数值已除以 100。

model cs-en T/r de-en T/r fi-en T/r lv-en T/r ru-en T/r tr-en T/r zh-en T/r avg T/r
sentBLEU 29.6/43.2 28.9/42.2 38.6/56.0 23.9/38.2 34.3/47.7 34.3/54.0 37.4/51.3 32.4/47.5
MoverScore 47.6/67.0 51.2/70.8 NA NA 53.4/73.8 56.1/76.2 53.1/74.4 52.3/72.4
BERTscorew/BERT 48.0/66.6 50.3/70.1 61.4/81.4 51.6/72.3 53.7/73.0 55.6/76.0 52.2/73.1 53.3/73.2
BERTscorew/roBERTa 54.2/72.6 56.9/76.0 64.8/83.2 56.2/75.7 57.2/75.2 57.9/76.1 58.8/78.9 58.0/76.8
chrF++ 35.0/52.3 36.5/53.4 47.5/67.8 33.3/52.0 41.5/58.8 43.2/61.4 40.5/59.3 39.6/57.9
BEER 34.0/51.1 36.1/53.0 48.3/68.1 32.8/51.5 40.2/57.7 42.8/60.0 39.5/58.2 39.1/57.1
BLEURTbase-pre 51.5/68.2 52.0/70.7 66.6/85.1 60.8/80.5 57.5/77.7 56.9/76.0 52.1/72.1 56.8/75.8
BLEURTbase 55.7/73.4 56.3/75.7 68.0/86.8 64.7/83.3 60.1/80.1 62.4/81.7 59.5/80.5 61.0/80.2
BLEURT-pre 56.0/74.7 57.1/75.7 67.2/86.1 62.3/81.7 58.4/78.3 61.6/81.4 55.9/76.5 59.8/79.2
BLEURT 59.3/77.3 59.9/79.2 69.5/87.8 64.4/83.5 61.3/81.1 62.9/82.4 60.2/81.4 62.5/81.8
modelcs-en T/DAde-en T/DAet-en T/ DAfi-en T/DAru-en T/DAtr-en T/DAzh-en T/DAavg T/ DA
sentBLEU20.0/22.531.6/41.526.0/28.217.1/15.620.5/22.422.9/13.621.6/17.622.8/23.2
BERTscorew/BERT29.5/40.039.9/53.834.7/39.026.0/29.727.8/34.731.7/27.527.5/25.231.0/35.7
BERTscorew/roBERTa31.2/ 41.142.2/55.537.0/40.327.8/30.830.2/35.432.8/30.229.2/ 26.332.9/ 37.1
Meteor++22.4/26.834.7/45.729.7/ 32.921.6/20.622.8/25.327.3/20.423.6/17.5*26.0/27.0
RUSE27.0/34.536.1/ 49.832.9/36.825.5/27.525.0/31.129.1/ 25.924.6/21.5*28.6/32.4
YiSil23.5/31.735.5/48.830.2/35.121.5/23.123.3/30.026.8/23.423.1/ 20.926.3/ 30.4
YiSi1SRL1823.3/31.534.3/48.329.8 / 34.521.2/23.722.6/30.626.1/23.322.9/20.725.7/30.4
BLEURTbase-pre33.0/39.041.5/54.638.2/39.630.7/31.130.7/34.932.9/29.828.3/25.633.6/36.4
BLEURTbase34.5/ 42.943.5/ 55.639.2/ 40.531.5/ 30.931.0/35.735.0/ 29.429.6 / 26.934.9 / 37.4
BLEURT-pre34.5/42.142.7/ 55.439.2/40.631.4/31.631.4/34.233.4/29.328.9/25.634.5/37.0
BLEURT35.6/42.344.2/56.740.0/41.432.1/32.531.9/36.035.5/31.529.7/ 26.035.6/38.1
模型 cs-en T/DA de-en T/DA et-en T/DA fi-en T/DA ru-en T/DA tr-en T/DA zh-en T/DA avg T/DA
sentBLEU 20.0/22.5 31.6/41.5 26.0/28.2 17.1/15.6 20.5/22.4 22.9/13.6 21.6/17.6 22.8/23.2
BERTscorew/BERT 29.5/40.0 39.9/53.8 34.7/39.0 26.0/29.7 27.8/34.7 31.7/27.5 27.5/25.2 31.0/35.7
BERTscorew/roBERTa 31.2/41.1 42.2/55.5 37.0/40.3 27.8/30.8 30.2/35.4 32.8/30.2 29.2/26.3 32.9/37.1
Meteor++ 22.4/26.8 34.7/45.7 29.7/32.9 21.6/20.6 22.8/25.3 27.3/20.4 23.6/17.5* 26.0/27.0
RUSE 27.0/34.5 36.1/49.8 32.9/36.8 25.5/27.5 25.0/31.1 29.1/25.9 24.6/21.5* 28.6/32.4
YiSil 23.5/31.7 35.5/48.8 30.2/35.1 21.5/23.1 23.3/30.0 26.8/23.4 23.1/20.9 26.3/30.4
YiSi1SRL18 23.3/31.5 34.3/48.3 29.8/34.5 21.2/23.7 22.6/30.6 26.1/23.3 22.9/20.7 25.7/30.4
BLEURTbase-pre 33.0/39.0 41.5/54.6 38.2/39.6 30.7/31.1 30.7/34.9 32.9/29.8 28.3/25.6 33.6/36.4
BLEURTbase 34.5/42.9 43.5/55.6 39.2/40.5 31.5/30.9 31.0/35.7 35.0/29.4 29.6/26.9 34.9/37.4
BLEURT-pre 34.5/42.1 42.7/55.4 39.2/40.6 31.4/31.6 31.4/34.2 33.4/29.3 28.9/25.6 34.5/37.0
BLEURT 35.6/42.3 44.2/56.7 40.0/41.4 32.1/32.5 31.9/36.0 35.5/31.5 29.7/26.0 35.6/38.1

Table 3: Agreement with human ratings on the WMT18 Metrics Shared Task. The metrics are Kendall Tau $(\tau)$ and WMT’s Direct Assessment metrics divided by 100. The star * indicates results that are more than 0.2 percentage points away from the official WMT results (up to 0.4 percentage points away).

表 3: WMT18指标共享任务中与人类评分的相关性。指标为肯德尔tau系数$(\tau)$和WMT直接评估指标(除以100)。星号*表示与官方WMT结果相差超过0.2个百分点(最大相差0.4个百分点)。

5.1 WMT Metrics Shared Task

5.1 WMT 指标共享任务

Datasets and Metrics: We use years 2017 to 2019 of the WMT Metrics Shared Task, to-English language pairs. For each year, we used the official WMT test set, which include several thousand pairs of sentences with human ratings from the news domain. The training sets contain 5,360, 9,492, and 147,691 records for each year. The test sets for years 2018 and 2019 are noisier, as reported by the organizers and shown by the overall lower correlations.

数据集与评估指标:我们使用2017至2019年WMT指标共享任务的英译语言对数据。每年采用官方WMT测试集,包含新闻领域数千组人工评分的句对。训练集分别包含5,360、9,492和147,691条记录。如组织方所述,2018和2019年测试集噪声更大,整体相关性指标较低也印证了这一点。

We evaluate the agreement between the automatic metrics and the human ratings. For each year, we report two metrics: Kendall’s Tau $\tau$ (for consistency across experiments), and the official WMT metric for that year (for completeness). The official WMT metric is either Pearson’s correlation or a robust variant of Kendall’s Tau called DARR, described in the Appendix. All the numbers come from our own implementation of the benchmark.4 Our results are globally consistent with the official results but we report small differences in 2018 and 2019, marked in the tables.

我们评估了自动指标与人工评分之间的一致性。每年报告两个指标:Kendall's Tau $\tau$(用于实验间一致性)以及该年度的官方WMT指标(用于完整性)。官方WMT指标采用皮尔逊相关系数或称为DARR的Kendall's Tau鲁棒变体(详见附录)。所有数据均基于我们自主实现的基准测试得出。4 我们的结果与官方结果整体一致,但在2018和2019年存在微小差异(表中已标注)。

Models: We experiment with four versions of BLEURT: BLEURT, BLEURTbase, BLEURT -pre and BLEURTbase -pre. The first two models are based on BERT-large and BERT-base. In the latter two versions, we skip the pre-training phase and fine-tune directly on the WMT ratings. For each year of the WMT shared task, we use the test set from the previous years for training and validation. We describe our setup in further detail in the Appendix. We compare BLEURT to participant data from the shared task and automatic metrics that we ran ourselves. In the former case, we use the the best-performing contestants for each year, that is, chrF $^{++}$ , BEER, Meteor $^{++}$ , RUSE, Yisi1, ESIM and Yisi1-SRL (Mathur et al., 2019). All the contestants use the same WMT training data, in addition to existing sentence or token embeddings. In the latter case, we use Moses sentence BLEU, BERTscore (Zhang et al., 2020), and MoverScore (Zhao et al., 2019). For BERTscore, we use BERT-large uncased for fairness, and roBERTa (the recommended version) for completeness (Liu et al., 2019). We run MoverScore on WMT 2017 using the scripts published by the authors.

模型:我们测试了BLEURT的四个版本:BLEURT、BLEURTbase、BLEURT-pre和BLEURTbase-pre。前两个模型基于BERT-large和BERT-base,后两个版本跳过预训练阶段,直接在WMT评分数据上进行微调。针对WMT共享任务的每一年份,我们使用前几年的测试集进行训练和验证。附录详细描述了实验设置。我们将BLEURT与共享任务的参赛数据及自行运行的自动指标进行对比:前者选取历年表现最佳的参赛系统(包括chrF$^{++}$、BEER、Meteor$^{++}$、RUSE、Yisi1、ESIM和Yisi1-SRL (Mathur et al., 2019)),这些系统除使用WMT训练数据外,还结合了现有句子或token嵌入;后者采用Moses句子BLEU、BERTscore (Zhang et al., 2020)和MoverScore (Zhao et al., 2019)。为保持公平性,BERTscore统一使用BERT-large uncased版本,同时为完整性补充roBERTa(推荐版本)(Liu et al., 2019)。MoverScore基于作者发布的脚本在WMT 2017数据上运行。

Results: Tables 2, 3, 4 show the results. For years 2017 and 2018, a BLEURT-based metric

结果:表2、表3、表4展示了结果。对于2017年和2018年,基于BLEURT的指标

modelde-en T/DAfi-en T/DAgu-en T/DAkk-en T/DAlt-en T/ DAru-en T/DAzh-en T/DAavg T/ DA
sentBLEU19.4/5.420.6/23.317.3/18.930.0/37.623.8/26.219.4 / 12.428.7/32.222.7/22.3
BERTscorew/BERT26.2/17.327.6/34.725.8/29.336.9/ 44.030.8/37.425.2/20.637.5/41.430.0/32.1
BERTscorew/roBERTa29.1/19.329.7/35.327.7/32.437.1/43.132.6/38.226.3/22.741.4/43.832.0/33.6
ESIM28.4/16.628.9/33.727.1/30.438.4743.333.2/35.926.6/19.938.7/39.631.6/31.3
YiSi1SRL1926.3/19.827.8/34.626.6/30.636.9 / 44.130.9/38.025.3/22.038.9/43.130.4/33.2
BLEURTbase-pre30.1/15.830.4/35.426.8/ 29.737.8/41.834.2/ 39.027.0/ 20.740.1/39.832.3/31.7
BLEURTbase31.0/16.631.3/36.227.9/30.639.5/44.635.2/39.428.5/21.541.7/41.633.6/ 32.9
BLEURT-pre31.1/16.931.3/36.527.6/31.338.4/42.835.0/40.027.5/21.441.6 / 41.433.2/32.9
BLEURT31.2/16.931.7/36.328.3/ 31.939.5/44.635.2/40.628.3/22.342.7/42.433.8/33.6
模型 德英 T/DA 芬英 T/DA 古英 T/DA 哈英 T/DA 立英 T/DA 俄英 T/DA 中英 T/DA 平均 T/DA
sentBLEU 19.4/5.4 20.6/23.3 17.3/18.9 30.0/37.6 23.8/26.2 19.4/12.4 28.7/32.2 22.7/22.3
BERTscorew/BERT 26.2/17.3 27.6/34.7 25.8/29.3 36.9/44.0 30.8/37.4 25.2/20.6 37.5/41.4 30.0/32.1
BERTscorew/roBERTa 29.1/19.3 29.7/35.3 27.7/32.4 37.1/43.1 32.6/38.2 26.3/22.7 41.4/43.8 32.0/33.6
ESIM 28.4/16.6 28.9/33.7 27.1/30.4 38.4/43.3 33.2/35.9 26.6/19.9 38.7/39.6 31.6/31.3
YiSi1SRL19 26.3/19.8 27.8/34.6 26.6/30.6 36.9/44.1 30.9/38.0 25.3/22.0 38.9/43.1 30.4/33.2
BLEURTbase-pre 30.1/15.8 30.4/35.4 26.8/29.7 37.8/41.8 34.2/39.0 27.0/20.7 40.1/39.8 32.3/31.7
BLEURTbase 31.0/16.6 31.3/36.2 27.9/30.6 39.5/44.6 35.2/39.4 28.5/21.5 41.7/41.6 33.6/32.9
BLEURT-pre 31.1/16.9 31.3/36.5 27.6/31.3 38.4/42.8 35.0/40.0 27.5/21.4 41.6/41.4 33.2/32.9
BLEURT 31.2/16.9 31.7/36.3 28.3/31.9 39.5/44.6 35.2/40.6 28.3/22.3 42.7/42.4 33.8/33.6

Table 4: Agreement with human ratings on the WMT19 Metrics Shared Task. The metrics are Kendall Tau $(\tau)$ and WMT’s Direct Assessment metrics divided by 100. All the values reported for Yisi1 SRL and ESIM fall within 0.2 percentage of the official WMT results.

表 4: WMT19指标共享任务中与人类评分的吻合度。指标为肯德尔系数$(\tau)$和WMT直接评估指标(除以100)。Yisi1 SRL和ESIM报告的所有数值与官方WMT结果相差均在0.2个百分点以内。


Figure 1: Distribution of the human ratings in the train/validation and test datasets for different skew factors.

图 1: 不同偏斜因子下训练/验证集与测试集中人工评分的分布情况。

dominates the benchmark for each language pair (Tables 2 and 3). BLEURT and BLEURTbase are also competitive for year 2019: they yield the best results for all language pairs on Kendall’s Tau, and they come first for 3 out of 7 pairs on DARR. As expected, BLEURT dominates BLEURTbase in the majority of cases. Pre-training consistently improves the results of BLEURT and BLEURTbase. We observe the largest effect on year 2017, where it adds up to 7.4 Kendall Tau points for BLEURTbase $(\mathtt{z h}\mathrm{-}\mathrm{en})$ . The effect is milder on years 2018 and 2019, up to 2.1 points $({\tt t r-e n}$ , 2018). We explain the difference by the fact that the training data used for 2017 is smaller than the datasets used for the following years, so pre-training is likelier to help. In general pretraining yields higher returns for BERT-base than for BERT-large—in fact, BLEURTbase with pretraining is often better than BLEURT without.

在每个语言对的基准测试中占据主导地位(表2和表3)。BLEURT和BLEURTbase在2019年也表现优异:它们在Kendall's Tau上对所有语言对都取得了最佳结果,并在DARR的7个语言对中有3个排名第一。正如预期,BLEURT在大多数情况下优于BLEURTbase。预训练持续提升了BLEURT和BLEURTbase的结果。我们观察到对2017年的影响最大,其中BLEURTbase在$(\mathtt{zh}\mathrm{-}\mathrm{en})$上增加了高达7.4个Kendall Tau点。对2018和2019年的影响较小,最高为2.1点$({\tt tr-en}$, 2018)。我们通过2017年使用的训练数据比后续年份的数据集更小这一事实来解释这一差异,因此预训练更可能有所帮助。总体而言,预训练对BERT-base的回报高于BERT-large——事实上,经过预训练的BLEURTbase通常优于未经预训练的BLEURT。

Takeaways: Pre-training delivers consistent improvements, especially for BLEURT-base. BLEURT yields state-of-the art performance for all years of the WMT Metrics Shared task.

要点:预训练能带来一致的性能提升,尤其对BLEURT-base模型效果显著。BLEURT在WMT指标共享任务历年数据中均展现出最先进的性能表现。

5.2 Robustness to Quality Drift

5.2 质量漂移的鲁棒性

We assess our claim that pre-training makes BLEURT robust to quality drifts, by constructing a series of tasks for which it is increasingly pressured to extrapolate. All the experiments that follow are based on the WMT Metrics Shared Task 2017, because the ratings for this edition are particularly reliable.5

我们通过构建一系列任务来验证预训练使BLEURT对质量漂移具有鲁棒性的主张,这些任务逐渐增加其外推压力。以下所有实验均基于WMT 2017年度量共享任务,因为该届评分数据尤为可靠[20]。


Figure 2: Agreement between BLEURT and human ratings for different skew factors in train and test.


图 2: BLEURT评分与人类评分在不同训练和测试偏斜因子下的相关性。

Methodology: We create increasingly challenging datasets by sub-sampling the records from the WMT Metrics shared task, keeping low-rated translations for training and high-rated translations for test. The key parameter is the skew factor $\alpha$ , that measures how much the training data is leftskewed and the test data is right-skewed. Figure 1 demonstrates the ratings distribution that we used in our experiments. The training data shrinks as $\alpha$ increases: in the most extreme case $\mathrm{\Delta\alpha}(\alpha=3.0)$ , we use only $11.9%$ of the original 5,344 training records. We give the full detail of our sampling methodology in the Appendix.

方法:我们通过对WMT Metrics共享任务中的记录进行子采样来创建难度递增的数据集,保留低评分翻译用于训练,高评分翻译用于测试。关键参数是衡量训练数据左偏程度和测试数据右偏程度的偏斜因子$\alpha$。图1展示了实验中使用的评分分布。随着$\alpha$增大,训练数据量减少:在最极端情况下$\mathrm{\Delta\alpha}(\alpha=3.0)$,仅使用原始5,344条训练记录中的$11.9%$。完整采样方法详见附录。

We use BLEURT with and without pre-training and we compare to Moses sentBLEU and BERTscore. We use BERT-large uncased for both BLEURT and BERTscore.

我们对比了使用预训练和未预训练的BLEURT,以及Moses sentBLEU和BERTscore。BLEURT和BERTscore均采用BERT-large uncased模型。


Figure 3: Absolute Kendall Tau of BLEU, Meteor, and BLEURT with human judgements on the WebNLG dataset, varying the size of the data used for training and validation.

图 3: BLEU、Meteor 和 BLEURT 在 WebNLG 数据集上与人类判断的绝对肯德尔 Tau 值,随训练和验证数据规模变化的对比。

Results: Figure 2 presents BLEURT’s performance as we vary the train and test skew independently. Our first observation is that the agreements fall for all metrics as we increase the test skew. This effect was already described is the 2019 WMT Metrics report (Ma et al., 2019). A common explanation is that the task gets more difficult as the ratings get closer—it is easier to discriminate between “good” and “bad” systems than to rank “good” systems.

结果:图 2 展示了当训练集和测试集的偏斜度独立变化时 BLEURT 的表现。我们首先观察到,随着测试集偏斜度的增加,所有指标的吻合度均下降。这一现象已在 2019 年 WMT 指标报告 (Ma et al., 2019) 中被描述过。常见的解释是,当评分越接近时任务会变得更困难——区分"好"和"差"系统比给"好"系统排序更容易。

Training skew has a disastrous effect on BLEURT without pre-training: it is below BERTscore for $\alpha~=~1.0$ , and it falls under sentBLEU for $\alpha\geq1.5$ . Pre-trained BLEURT is much more robust: the only case in which it falls under the baselines is $\alpha=3.0$ , the most extreme drift, for which incorrect translations are used for train while excellent ones for test.

训练偏差对未经预训练的BLEURT具有灾难性影响:当$\alpha~=~1.0$时其表现低于BERTscore,当$\alpha\geq1.5$时甚至逊于sentBLEU。经过预训练的BLEURT则稳健得多:仅在$\alpha=3.0$这种最极端的偏移情况下(训练使用错误翻译而测试使用优质翻译)会低于基线指标。

Takeaways: Pre-training makes BLEURT significantly more robust to quality drifts.

要点:预训练使BLEURT对质量漂移的鲁棒性显著提升。

5.3 WebNLG Experiments

5.3 WebNLG 实验

In this section, we evaluate BLEURT’s performance on three tasks from a data-to-text dataset, the WebNLG Challenge 2017 (Shimorina et al., 2019). The aim is to assess BLEURT’s capacity to adapt to new tasks with limited training data.

在本节中,我们评估BLEURT在WebNLG Challenge 2017 (Shimorina et al., 2019) 数据到文本数据集上的三项任务表现,旨在测试BLEURT在有限训练数据下适应新任务的能力。

Dataset and Evaluation Tasks: The WebNLG challenge benchmarks systems that produce natural language description of entities (e.g., buildings, cities, artists) from sets of 1 to 5 RDF triples. The organizers released the human assessments for 9 systems over 223 inputs, that is, 4,677 sentence pairs in total (we removed null values). Each input comes with 1 to 3 reference descriptions. The submissions are evaluated on 3 aspects: semantics, grammar, and fluency. We treat each type of rating as a separate modeling task. The data has no natural split between train and test, therefore we experiment with several schemes. We allocate $0%$ to about $50%$ of the data to training, and we split on both the evaluated systems or the RDF inputs in order to test different generalization regimes.

数据集与评估任务:
WebNLG挑战赛旨在评测系统根据1到5个RDF三元组生成实体(如建筑、城市、艺术家)自然语言描述的能力。主办方发布了针对223组输入(共4,677对句子,已剔除空值)的9个系统人工评估结果,每组输入包含1至3条参考描述。系统提交结果从语义、语法和流畅性三个维度进行评估,我们将每种评分类型视为独立建模任务。由于数据未预设训练集与测试集的划分,我们尝试了多种方案:将0%至约50%数据分配至训练集,并通过对被评估系统或RDF输入进行划分来测试不同泛化场景。

Systems and Baselines: BLEURT -pre -wmt, is a public BERT-large uncased checkpoint directly trained on the WebNLG ratings. BLEURT -wmtwas first pre-trained on synthetic data, then fine-tuned on WebNLG data. BLEURT was trained in three steps: first on synthetic data, then on WMT data (16-18), and finally on WebNLG data. When a record comes with several references, we run BLEURT on each reference and report the highest value (Zhang et al., 2020).

系统与基线:
BLEURT-pre-wmt 是一个基于 WebNLG 评分直接训练的公开 BERT-large uncased 检查点。BLEURT-wmt 首先在合成数据上进行预训练,随后在 WebNLG 数据上微调。BLEURT 的训练分为三步:先在合成数据上训练,再在 WMT (16-18) 数据上训练,最后在 WebNLG 数据上训练。当一条记录包含多个参考时,我们对每个参考运行 BLEURT 并报告最高值 (Zhang et al., 2020)。

We report four baselines: BLEU, TER, Meteor, and BERTscore. The first three were computed by the WebNLG competition organizers. We ran the latter one ourselves, using BERTlarge uncased for a fair comparison.

我们报告了四个基线指标:BLEU、TER、Meteor和BERTscore。前三个指标由WebNLG竞赛组织者计算得出。为确保公平比较,我们自行运行了最后一个指标(使用BERTlarge uncased模型)。

Results: Figure 3 presents the correlation of the metrics with human assessments as we vary the share of data allocated to training. The more pretrained BLEURT is, the quicker it adapts. The vanilla BERT approach BLEURT -pre -wmt requires about a third of the WebNLG data to dominate the baselines on the majority of tasks, and it still lags behind on semantics (split by system). In contrast, BLEURT -wmt is competitive with as little as 836 records, and BLEURT is comparable with BERTscore with zero fine-tuning.

结果:图3展示了随着分配给训练的数据比例变化,各指标与人工评估的相关性。预训练越充分的BLEURT模型适应速度越快。原始BERT方法BLEURT-pre-wmt需要约三分之一的WebNLG数据才能在多数任务上超越基线,但在语义指标(按系统划分)上仍表现欠佳。相比之下,BLEURT-wmt仅需836条记录即可达到竞争力,而未经微调的BLEURT与BERTscore表现相当。


Figure 4: Improvement in Kendall Tau on WMT 17 varying the pre-training tasks.

图 4: WMT 17上Kendall Tau随预训练任务变化的改进情况

Takeaways: Thanks to pre-training, BLEURT can quickly adapt to the new tasks. BLEURT finetuned twice (first on synthetic data, then on WMT data) provides acceptable results on all tasks without training data.

要点:得益于预训练,BLEURT能够快速适应新任务。经过两次微调(先是在合成数据上,然后在WMT数据上)的BLEURT,在所有任务上无需训练数据即可提供可接受的结果。

5.4 Ablation Experiments

5.4 消融实验

Figure 4 presents our ablation experiments on WMT 2017, which highlight the relative importance of each pre-training task. On the left side, we compare BLEURT pre-trained on a single task to BLEURT without pre-training. On the right side, we compare full BLEURT to BLEURT pretrained on all tasks except one. Pre-training on BERTscore, entailment, and the back translation scores yield improvements (symmetrically, ablating them degrades BLEURT). Oppositely, BLEU and ROUGE have a negative impact. We conclude that pre-training on high quality signals helps BLEURT, but that metrics that correlate less well with human judgment may in fact harm the model.6

图 4: 我们在 WMT 2017 上进行的消融实验,展示了每个预训练任务的相对重要性。左侧比较了在单一任务上预训练的 BLEURT 与未预训练的 BLEURT。右侧比较了完整版 BLEURT 与仅排除一项任务的预训练版本。基于 BERTscore、蕴含任务和回译得分的预训练能带来提升 (对称地,移除它们会降低 BLEURT 性能),而 BLEU 和 ROUGE 则产生负面影响。我们得出结论:基于高质量信号的预训练有助于 BLEURT,但与人类判断相关性较低的指标实际上可能损害模型性能。6

6 Related Work

6 相关工作

The WMT shared metrics competition (Bojar et al., 2016; Ma et al., 2018, 2019) has inspired the creation of many learned metrics, some of which use regression or deep learning (Stanojevic and Simaan, 2014; Ma et al., 2017; Shimanaka et al., 2018; Chen et al., 2017; Mathur et al., 2019). Other metrics have been introduced, such as the recent MoverScore (Zhao et al., 2019) which combines contextual embeddings and Earth Mover’s Distance. We provide a head-to-head comparison with the best performing of those in our experiments. Other approaches do not attempt to estimate quality directly, but use information extraction or question answering as a proxy (Wiseman et al., 2017; Goodrich et al., 2019; Eyal et al., 2019). Those are complementary to our work.

WMT共享指标竞赛 (Bojar等人,2016;Ma等人,2018,2019) 催生了许多基于学习的指标,其中一些采用回归或深度学习技术 (Stanojevic和Simaan,2014;Ma等人,2017;Shimanaka等人,2018;Chen等人,2017;Mathur等人,2019)。其他指标也被提出,例如结合上下文嵌入和推土机距离 (Earth Mover's Distance) 的最新MoverScore (Zhao等人,2019)。我们在实验中与表现最佳的指标进行了直接对比。另一些方法不直接评估质量,而是采用信息抽取或问答作为代理指标 (Wiseman等人,2017;Goodrich等人,2019;Eyal等人,2019),这些方法与我们的工作形成互补。

There has been recent work that uses BERT for evaluation. BERTScore (Zhang et al., 2020) proposes replacing the hard n-gram overlap of BLEU with a soft-overlap using BERT embeddings. We use it in all our experiments. Bertr (Mathur et al., 2019) and YiSi (Mathur et al., 2019) also make use of BERT embeddings to capture similarity. SumQE (Xenouleas et al., 2019) fine-tunes BERT for quality estimation as we describe in Section 3. Our focus is different—we train metrics that are not only state-of-the-art in conventional IID experimental setups, but also robust in the presence of scarce and out-of-distribution training data. To our knowledge no existing work has explored pretraining and extrapolation in the context of NLG.

近期有研究利用BERT进行评估。BERTScore (Zhang等人,2020)提出用BERT嵌入的软重叠替代BLEU的硬n-gram重叠。我们在所有实验中都使用了该方法。Bertr (Mathur等人,2019)和YiSi (Mathur等人,2019)也利用BERT嵌入来捕捉相似性。如第3节所述,SumQE (Xenouleas等人,2019)对BERT进行微调以进行质量估计。我们的重点不同——我们训练的度量标准不仅在传统的IID实验设置中是最先进的,而且在训练数据稀缺和分布外的情况下也具有鲁棒性。据我们所知,目前还没有研究在自然语言生成(NLG)的背景下探索预训练和外推。

Previous studies have used noising for referenceless evaluation (Dusek et al., 2019). Noisy pre-training has also been proposed before for other tasks such as paraphrasing (Wieting et al., 2016; Tomar et al., 2017) but generally not with synthetic data. Generating synthetic data via paraphrases and perturbations has been commonly used for generating adversarial examples (Jia and Liang, 2017; Iyyer et al., 2018; Belinkov and Bisk, 2018; Ribeiro et al., 2018), an orthogonal line of research.

先前的研究已采用加噪技术进行无参考评估 (Dusek et al., 2019)。加噪预训练方法也曾在其他任务中被提出,例如复述生成 (Wieting et al., 2016; Tomar et al., 2017),但通常不使用合成数据。通过复述和扰动生成合成数据的方法常见于对抗样本生成领域 (Jia and Liang, 2017; Iyyer et al., 2018; Belinkov and Bisk, 2018; Ribeiro et al., 2018),这是一条与之正交的研究路线。

7 Conclusion

7 结论

We presented BLEURT, a reference-based text generation metric for English. Because the metric is trained end-to-end, BLEURT can model human assessment with superior accuracy. Furthermore, pre-training makes the metrics robust particularly robust to both domain and quality drifts. Future research directions include multilingual NLG evaluation, and hybrid methods involving both humans and class if i ers.

我们提出了BLEURT,一种基于参考的英语文本生成评估指标。由于该指标采用端到端训练,BLEURT能够以更高精度模拟人类评估。此外,预训练使该指标对领域偏移和质量漂移具有特别强的鲁棒性。未来研究方向包括多语言自然语言生成(NLG)评估,以及结合人类与分类器的混合方法。

Acknowledgments

致谢

Thanks to Eunsol Choi, Nicholas FitzGerald, Jacob Devlin, and to the members of the Google AI Language team for the proof-reading, feedback, and suggestions. We also thank Madhavan Kidambi and Ming-Wei Chang, who implemented blank-filling with BERT.

感谢 Eunsol Choi、Nicholas FitzGerald、Jacob Devlin 以及 Google AI Language 团队成员对本文的校对、反馈和建议。同时感谢 Madhavan Kidambi 和 Ming-Wei Chang 实现了基于 BERT 的填空任务。

Inderjeet Mani. 1999. Advances in automatic text summarization. MIT press.

Inderjeet Mani. 1999. 自动文本摘要的进展. MIT出版社.

Nitika Mathur, Timothy Baldwin, and Trevor Cohn. 2019. Putting evaluation in context: Contextual embeddings improve machine translation evaluation. In Proceedings of ACL.

Nitika Mathur、Timothy Baldwin和Trevor Cohn。2019。将评估置于上下文中:上下文嵌入提升机器翻译评估效果。见ACL会议论文集。

Kathleen McKeown. 1992. Text generation. Cambridge University Press.

Kathleen McKeown. 1992. 文本生成. Cambridge University Press.

Jekaterina Novikova, Ondrej Dusek, Amanda Cercas Curry, and Verena Rieser. 2017. Why we need new evaluation metrics for nlg. Proceedings of EMNLP.

Jekaterina Novikova、Ondrej Dusek、Amanda Cercas Curry和Verena Rieser。2017。为什么我们需要新的自然语言生成评估指标。EMNLP会议论文集。

Kishore Papineni, Salim Roukos, Todd Ward, and WeiJing Zhu. 2002. Bleu: a method for automatic evaluation of machine translation. In Proceedings of ACL.

Kishore Papineni、Salim Roukos、Todd Ward 和 WeiJing Zhu。2002。BLEU:一种机器翻译自动评估方法。载于ACL会议论文集。

Marco Tulio Ribeiro, Sameer Singh, and Carlos Guestrin. 2018. Semantically equivalent adversarial rules for debugging nlp models. In Proceedings of ACL.

Marco Tulio Ribeiro、Sameer Singh和Carlos Guestrin。2018. 用于调试NLP模型的语义等价对抗规则。载于ACL会议论文集。

Rico Sennrich, Barry Haddow, and Alexandra Birch. 2016. Improving neural machine translation models with monolingual data. Proceedings of ACL.

Rico Sennrich、Barry Haddow 和 Alexandra Birch。2016. 利用单语数据改进神经机器翻译模型。ACL 会议论文集。

Hiroki Shimanaka, Tomoyuki Kajiwara, and Mamoru Komachi. 2018. Ruse: Regressor using sentence embeddings for automatic machine translation evaluation. In Proceedings of WMT.

Hiroki Shimanaka、Tomoyuki Kajiwara和Mamoru Komachi。2018。Ruse:基于句子嵌入的回归器用于自动机器翻译评估。见于WMT会议论文集。

Anastasia Shimorina, Claire Gardent, Shashi Narayan, and Laura Perez-Beltr a chin i. 2019. Webnlg chal- lenge: Human evaluation results. Technical report.

Anastasia Shimorina、Claire Gardent、Shashi Narayan和Laura Perez-Beltrachini。2019。WebNLG挑战赛:人工评估结果。技术报告。

Ronnie W Smith and D Richard Hipp. 1994. Spoken natural language dialog systems: A practical approach. Oxford University Press.

Ronnie W Smith 和 D Richard Hipp。1994。《口语自然语言对话系统:实用方法》。牛津大学出版社。

Milos Stanojevic and Khalil Simaan. 2014. Beer: Bet- ter evaluation as ranking. In Proceedings of WMT.

Milos Stanojevic和Khalil Simaan. 2014. BEER: 基于排序的更好评估方法. 见WMT会议论文集.

Ilya Sutskever, Oriol Vinyals, and Quoc V Le. 2014. Sequence to sequence learning with neural networks. In Proceedings of NIPS.

Ilya Sutskever、Oriol Vinyals 和 Quoc V Le。2014。基于神经网络的序列到序列学习。见《NIPS 会议论文集》。

Ran Tian, Shashi Narayan, Thibault Sellam, and Ankur P Parikh. 2019. Sticking to the facts: Confident decoding for faithful data-to-text generation. arXiv:1910.08684.

Ran Tian、Shashi Narayan、Thibault Sellam 和 Ankur P Parikh。2019. 忠于事实:数据到文本生成的可信解码。arXiv:1910.08684。

Gaurav Singh Tomar, Thyago Duque, Oscar Tackstrom, Jakob Uszkoreit, and Dipanjan Das. 2017. Neural paraphrase identification of questions with noisy pre training. Proceedings of the First Workshop on Subword and Character Level Models in NLP.

Gaurav Singh Tomar, Thyago Duque, Oscar Tackstrom, Jakob Uszkoreit, 和 Dipanjan Das. 2017. 基于噪声预训练的问题神经复述识别. 《第一届自然语言处理中子词和字符级模型研讨会论文集》.

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In Proceedings of NIPS.

Ashish Vaswani、Noam Shazeer、Niki Parmar、Jakob Uszkoreit、Llion Jones、Aidan N Gomez、Łukasz Kaiser 和 Illia Polosukhin。2017. Attention is all you need。发表于 NIPS 会议论文集。

Oriol Vinyals and Quoc Le. 2015. A neural conversational model. Proceedings of ICML.

Oriol Vinyals 和 Quoc Le。2015。神经对话模型。ICML 会议论文集。

Alex Wang, Amanpreet Singh, Julian Michael, Felix Hill, Omer Levy, and Samuel R Bowman. 2019. Glue: A multi-task benchmark and analysis platform for natural language understanding. Proceedings of ICLR.

Alex Wang、Amanpreet Singh、Julian Michael、Felix Hill、Omer Levy 和 Samuel R Bowman。2019. GLUE:自然语言理解的多任务基准与分析平台。ICLR 会议论文集。

John Wieting, Mohit Bansal, Kevin Gimpel, and Karen Livescu. 2016. Towards universal para phra stic sen- tence embeddings. Proceedings of ICLR.

John Wieting、Mohit Bansal、Kevin Gimpel 和 Karen Livescu。2016. 面向通用释义句嵌入。ICLR 会议论文集。

Adina Williams, Nikita Nangia, and Samuel R Bowman. 2018. A broad-coverage challenge corpus for sentence understanding through inference. Proceedings of NAACL HLT.

Adina Williams、Nikita Nangia 和 Samuel R Bowman。2018。一个用于通过推理理解句子的广泛覆盖挑战语料库。NAACL HLT 会议论文集。

Sam Wiseman, Stuart M Shieber, and Alexander M Rush. 2017. Challenges in data-to-document generation. Proceedings of EMNLP.

Sam Wiseman、Stuart M Shieber 和 Alexander M Rush。2017。数据到文档生成的挑战。EMNLP 会议论文集。

Stratos Xenouleas, Prodromos Malak as i otis, Marianna Apidianaki, and Ion And rout so poul os. 2019. Sumqe: a bert-based summary quality estimation model supplementary material. In Proceedings of EMNLP.

Stratos Xenouleas、Prodromos Malakasiotis、Marianna Apidianaki 和 Ion Androutsopoulos。2019. SumQE:基于BERT的摘要质量评估模型补充材料。见《EMNLP会议论文集》。

Tianyi Zhang, Varsha Kishore, Felix Wu, Kilian Q Weinberger, and Yoav Artzi. 2020. Bertscore: Evaluating text generation with bert. Proceedings of ICLR.

Tianyi Zhang、Varsha Kishore、Felix Wu、Kilian Q Weinberger 和 Yoav Artzi。2020年。BERTScore:用BERT评估文本生成。ICLR会议论文集。

Wei Zhao, Maxime Peyrard, Fei Liu, Yang Gao, Christian M Meyer, and Steffen Eger. 2019. Moverscore: Text generation evaluating with contextual i zed embeddings and earth mover distance. Proceedings of EMNLP.

Wei Zhao、Maxime Peyrard、Fei Liu、Yang Gao、Christian M Meyer 和 Steffen Eger。2019. Moverscore: 基于上下文嵌入和推土机距离的文本生成评估。EMNLP 会议论文集。

A Implementation Details of the Pre-Training Phase

预训练阶段的实现细节

This section provides implementation details for some of the pre-training techniques described in the main paper.

本节提供主论文中描述的部分预训练技术的实现细节。

A.1 Data Generation

A.1 数据生成

Random Masking: We use two masking strategies. The first strategy samples random words in the sentence and it replaces them with masks (one for each token). Thus, the masks are scattered across the sentence. The second strategy creates contiguous sequences: it samples a start position $s$ , a length $l$ (uniformly distributed), and it masks all the tokens spanned by words between positions $s$ and $s+l$ . In both cases, we use up to 15 masks per sentence. Instead of running the language model once and picking the most likely token at each position, we use beam search (the beam size 8 by default). This enforces consistency and avoids repeated sequences, e.g., $\yen123,456$ .

随机掩码:我们采用两种掩码策略。第一种策略随机采样句子中的单词并用掩码替换(每个token对应一个掩码),因此掩码会分散在整个句子中。第二种策略生成连续序列:采样起始位置$s$和长度$l$(均匀分布),并掩码从位置$s$到$s+l$之间所有单词对应的token。两种策略中,每句最多使用15个掩码。为避免直接运行语言模型并选择每个位置最可能的token导致的问题(例如重复序列$\yen123,456$),我们采用束搜索(beam search,默认束宽为8)来确保生成结果的一致性和多样性。

Back translation: Consider English and French. Given a forward translation model $P_{\mathrm{fr}\rightarrow\mathrm{en}}(z_{\mathrm{fr}}|z_{\mathrm{en}})$ and backward translation model $P_{\mathrm{en}\rightarrow\mathrm{fr}}$ , we generate $\tilde{z}$ as follows:

反向翻译:考虑英语和法语。给定前向翻译模型$P_{\mathrm{fr}\rightarrow\mathrm{en}}(z_{\mathrm{fr}}|z_{\mathrm{en}})$ 和后向翻译模型 $P_{\mathrm{en}\rightarrow\mathrm{fr}}(z_{\mathrm{en}}|z_{\mathrm{fr}})$,我们按如下方式生成 $\tilde{z}$:

$$
\tilde{z}=\underset{z_{\mathrm{en}}}{\arg\operatorname*{max}}\left(P_{\mathrm{fr}\rightarrow\mathrm{en}}(z_{\mathrm{en}}|z_{\mathrm{fr}}^{*})\right)
$$

$$
\tilde{z}=\underset{z_{\mathrm{en}}}{\arg\operatorname*{max}}\left(P_{\mathrm{fr}\rightarrow\mathrm{en}}(z_{\mathrm{en}}|z_{\mathrm{fr}}^{*})\right)
$$

where 2*r $\begin{array}{r l r}{z_{\mathrm{fr}}^{ * }}&{{}=}&{\arg\operatorname*{max}_ {z_{\mathrm{fr}}}\left(P_{\mathrm{fr}\rightarrow\mathrm{en}}(z_{\mathrm{fr}}|z)\right)}\end{array}$ . For the translations, we use a Transformer model (Vaswani et al., 2017), trained on EnglishGerman with the tensor 2 tensor framework.7

其中 2*r $\begin{array}{r l r}{z_{\mathrm{fr}}^{ * }}&{{}=}&{\arg\operatorname*{max}_ {z_{\mathrm{fr}}}\left(P_{\mathrm{fr}\rightarrow\mathrm{en}}(z_{\mathrm{fr}}|z)\right)}\end{array}$ 。对于翻译任务,我们使用基于 tensor2tensor 框架训练的 Transformer 模型 (Vaswani et al., 2017) ,该模型在英语-德语语料上进行训练。

Word dropping: Given a synthetic example $(z,\tilde{z})$ we generate a pair $(z,\tilde{z}^{\prime})$ , by randomly dropping words from $\tilde{z}$ . We draw the number of words to drop uniformly, up to the length of the sentence. We apply this transformation on about $30%$ of the data generated with the previous method.

词丢弃:给定一个合成示例 $(z,\tilde{z})$,我们通过从 $\tilde{z}$ 中随机丢弃单词来生成一对 $(z,\tilde{z}^{\prime})$。丢弃的单词数量均匀抽取,最多不超过句子长度。我们对前一种方法生成的数据中约 $30%$ 应用此变换。

A.2 Pre-Training Tasks

A.2 预训练任务

We now provide additional details on the signals we used for pre-training.

我们现在提供关于预训练所用信号的更多细节。

Automatic Metrics: As shown in the table, we use three types of signals: BLEU, ROUGE, and BERTscore. For BLEU, we used the original Moses SENTENCE BLEU 8 implementation, using the Moses tokenizer and the default parameters. For ROUGE, we used the seq2seq implementation of ROUGE-N.9 We used a custom implementation of BERTSCORE, based on BERT-large uncased. ROUGE and BERTscore return three scores: precision, recall, and F-score. We use all three quantities.

自动指标:如表所示,我们使用了三种信号:BLEU、ROUGE 和 BERTscore。对于 BLEU,我们采用原始 Moses SENTENCE BLEU 8 实现,使用 Moses tokenizer 和默认参数。对于 ROUGE,我们采用 seq2seq 实现的 ROUGE-N。我们基于 BERT-large uncased 实现了自定义的 BERTSCORE。ROUGE 和 BERTscore 返回三个分数:精确率、召回率和 F 值。我们使用了全部三项指标。

Back translation Likelihood: We compute all the losses using custom Transformer model (Vaswani et al., 2017), trained on two language pairs (English-French and EnglishGerman) with the tensor 2 tensor framework.

回译似然度:我们使用定制的Transformer模型 (Vaswani et al., 2017) 计算所有损失,该模型基于tensor2tensor框架在两种语言对(英法、英德)上进行训练。

Normalization: All the regression labels are normalized before training.

归一化:所有回归标签在训练前都经过归一化处理。

A.3 Modeling

A.3 建模

Setting the weights of the pre-training tasks: We set the weights $\gamma_{k}$ with grid search, optimizing BLEURT’s performance on WMT 17’s validation set. To reduce the size of the grid, we make groups of pre-training tasks that share the same weights: (τBLEU, τROUGE, τBERTscore), $\left(\tau_{\mathrm{en-fr},z|\tilde{z}},\tau_{\mathrm{en-fr},\tilde{z}|z},\tau_{\mathrm{en-de},z|\tilde{z}},\tau_{\mathrm{en-de},\tilde{z}|z}\right)$ , and (τentail, τbacktran flag).

设定预训练任务的权重:我们通过网格搜索设置权重 $\gamma_{k}$ ,优化 BLEURT 在 WMT 17 验证集上的性能。为缩小网格规模,我们将共享相同权重的预训练任务分组:(τBLEU, τROUGE, τBERTscore)、$\left(\tau_{\mathrm{en-fr},z|\tilde{z}},\tau_{\mathrm{en-fr},\tilde{z}|z},\tau_{\mathrm{en-de},z|\tilde{z}},\tau_{\mathrm{en-de},\tilde{z}|z}\right)$ 以及 (τentail, τbacktran flag)。

B Experiments–Supplementary Material

B 实验–补充材料

B.1 Training Setup for All Experiments

B.1 所有实验的训练设置

We user BERT’s public checkpoints 10 with Adam (the default optimizer), learning rate 1e-5, and batch size 32. Unless specified otherwise, we use 800,00 training steps for pre-training and 40,000 steps for fine-tuning. We run training and evaluation in parallel: we run the evaluation every 1,500 steps and store the checkpoint that performs best on a held-out validation set (more details on the data splits and our choice of metrics in the following sections). We use Google Cloud TPUs v2 for learning, and Nvidia Tesla V100 accelerators for evaluation and test. Our code uses Tensorflow 1.15 and Python 2.7.

我们使用BERT公开的检查点[10],采用默认优化器Adam,学习率为1e-5,批量大小为32。除非另有说明,预训练采用800,000步,微调采用40,000步。训练与评估并行执行:每1,500步进行一次评估,并保存在校验集上表现最佳的检查点(数据划分及指标选择详见后续章节)。训练使用Google Cloud TPU v2,评估与测试使用Nvidia Tesla V100加速器。代码基于TensorFlow 1.15和Python语言2.7版本实现。

B.2 WMT Metric Shared Task

B.2 WMT 指标共享任务

Metrics. The metrics used to compare the evaluation systems vary across the years. The organizers use Pearson’s correlation on standardized human judgments across all segments in 2017, and a custom variant of Kendall’s Tau named “DARR” on raw human judgments in 2018 and 2019. The latter metrics operates as follows. The organizers gather all the translations for the same ref- erence segment, they enumerate all the possible pairs (translation 1, translation 2), and they discard all the pairs which have a “similar” score (less than 25 points away on a 100 points scale). For each remaining pair, they then determine which translation is the best according both human judgment and the candidate metric. Let Concordant be the number of pairs on which the NLG metrics agree and Discordant be those on which they disagree, then the score is computed as follows:

指标。用于比较评估系统的指标每年各不相同。2017年主办方对所有段落标准化人工评分采用皮尔逊相关系数 (Pearson's correlation) ,2018和2019年则对原始人工评分使用名为"DARR"的肯德尔Tau (Kendall's Tau) 定制变体。后者的计算方式如下:主办方收集同一参考段落的所有译文,枚举所有可能的译文对 (translation 1, translation 2) ,并剔除得分"相近"的配对 (在100分制下相差小于25分) 。对每个剩余配对,根据人工评判和候选指标确定更优译文。设Concordant为NLG指标判断一致的配对数量,Discordant为不一致数量,则得分计算公式如下:

$$
\frac{|{\mathrm{Concordant}}|-|{\mathrm{Discordant}}|}{|{\mathrm{Concordant}}|+|{\mathrm{Discordant}}|}
$$

$$
\frac{|{\mathrm{Concordant}}|-|{\mathrm{Discordant}}|}{|{\mathrm{Concordant}}|+|{\mathrm{Discordant}}|}
$$

The idea behind the 25 points filter is to make the evaluation more robust, since the judgments collected for WMT 2018 and 2019 are noisy. Kendall’s Tau is identical, but it does not use the filter.

25点过滤器的设计初衷是使评估更具鲁棒性,因为WMT 2018和2019收集的评判数据存在噪声。Kendall's Tau系数虽未使用该过滤器,但结果保持一致。


Figure 5: Improvement in Kendall Tau accuracy on all language pairs of the WMT Metrics Shared Task 2017, varying the number of pre-training steps. 0 steps corresponds to 0.555 Kendall Tau for BLEURTbase and 0.580 for BLEURT.

图 5: WMT 2017度量标准共享任务中所有语言对的Kendall Tau准确率提升情况,随预训练步数变化。0步对应BLEURTbase的0.555 Kendall Tau和BLEURT的0.580。

Training setup. To separate training and validation data, we set aside a fixed ratio of records in such a way that there is no “leak” between the datasets (i.e., train and validation records that share the same source). We use $10%$ of the data for validation for years 2017 and 2018, and $5%$ for year 2019. We report results for the models that yield the highest Kendall Tau across all records on validation data. The weights associated to each pre training task (see our Modeling section) are set with grid search, using the train/validation setup of WMT 2017.

训练设置。为了区分训练数据和验证数据,我们按照固定比例预留部分记录,确保数据集之间不存在"泄漏"(即训练集和验证集中不存在同源记录)。对于2017和2018年的数据,我们使用10%作为验证集,2019年数据则使用5%验证集。我们报告在验证数据上获得最高Kendall Tau系数的模型结果。各预训练任务(参见建模章节)的权重通过网格搜索确定,采用WMT 2017的训练/验证设置方案。

Baselines. we use three metrics: the Moses implementation of sentence BLEU,11 BERTscore,12 and MoverScore,13 which are all available online. We run the Moses tokenizer on the reference and candidate segments before computing sentence BLEU.

基线方法。我们采用三个指标:Moses实现的句子BLEU[11]、BERTscore[12]以及MoverScore[13],这些指标均可在线获取。在计算句子BLEU前,我们对参考文本和候选文本执行了Moses分词器处理。

B.3 Robustness to Quality Drift

B.3 对质量漂移的鲁棒性

Data Re-sampling Methodology: We sample the training and test separately, as follows. We split the data in 10 bins of equal size. We then sample each record in the dataset with probabilities $\frac{1}{B^{\alpha}}$ and (11 1B)α for train and test respectively, where $B$ is the bin index of the record between 1 and 10, and $\alpha$ is a predefined skew factor. The skew factor $\alpha$ controls the drift: a value of 0 has no effect (the ratings are centered around 0), and value of 3.0 yields extreme differences. Note that the sizes of the datasets decrease as $\alpha$ increases: we use $50.7%$ , $30.3%$ , $20.4%$ , and $11.9%$ of the original 5,344 training records for $\alpha=0.5$ , 1.0, 1.5, and 3.0 respectively.

数据重采样方法:我们分别对训练集和测试集进行采样,具体如下。将数据划分为10个大小相等的区间,然后按概率 $\frac{1}{B^{\alpha}}$ 和 $(11 \frac{1}{B})^{\alpha}$ 分别对训练集和测试集中的每条记录进行采样,其中 $B$ 是记录所在区间的索引(1到10),$\alpha$ 是预定义的偏斜因子。偏斜因子 $\alpha$ 控制漂移程度:值为0时无影响(评分集中在0附近),值为3.0时会产生极大差异。需注意数据集规模随 $\alpha$ 增大而减小:当 $\alpha=0.5$、1.0、1.5和3.0时,我们分别使用原始5,344条训练记录的 $50.7%$、$30.3%$、$20.4%$ 和 $11.9%$。

B.4 Ablation Experiment–How Much Pre-Training Time is Necessary?

B.4 消融实验——需要多少预训练时间?

To understand the relationship between pretraining time and downstream accuracy, we pretrain several versions of BLEURT and we fine-tune them on WMT17 data, varying the number of pretraining steps. Figure 5 presents the results. Most gains are obtained during the first 400,000 steps, that is, after about 2 epochs over our synthetic dataset.

为了理解预训练时间与下游准确率之间的关系,我们预训练了多个版本的BLEURT模型,并在WMT17数据上对其进行微调,同时改变预训练步数。图5展示了结果。大部分性能提升发生在前40万步,即在我们合成数据集上约2个周期后。

阅读全文(20积分)