BLEURT: Learning Robust Metrics for Text Generation

BLEURT: 学习文本生成的鲁棒性指标

Thibault Sellam Dipanjan Das Ankur P. Parikh Google Research New York, NY {tsellam, dipanjand, aparikh }@google.com

Thibault Sellam Dipanjan Das Ankur P. Parikh 谷歌研究院纽约州纽约市 {tsellam, dipanjand, aparikh}@google.com

Abstract

摘要

Text generation has made significant advances in the last few years. Yet, evaluation metrics have lagged behind, as the most popular choices (e.g., BLEU and ROUGE) may correlate poorly with human judgments. We propose BLEURT, a learned evaluation metric based on BERT that can model human judgments with a few thousand possibly biased training examples. A key aspect of our approach is a novel pre-training scheme that uses millions of synthetic examples to help the model generalize. BLEURT provides state-ofthe-art results on the last three years of the WMT Metrics shared task and the WebNLG Competition dataset. In contrast to a vanilla BERT-based approach, it yields superior results even when the training data is scarce and out-of-distribution.

文本生成在过去几年取得了显著进展。然而评估指标却相对滞后，因为最常用的选择(如BLEU和ROUGE)可能与人类判断相关性较低。我们提出了BLEURT，这是一种基于BERT的学习型评估指标，仅需数千个可能存在偏差的训练样本即可建模人类判断。该方法的关键在于新颖的预训练方案，通过数百万合成样本来提升模型泛化能力。BLEURT在最近三年的WMT Metrics共享任务和WebNLG竞赛数据集上实现了最先进的结果。与原始基于BERT的方法相比，即使在训练数据稀缺且分布外的情况下，BLEURT仍能提供更优异的表现。

1 Introduction

1 引言

In the last few years, research in natural text generation (NLG) has made significant progress, driven largely by the neural encoder-decoder paradigm (Sutskever et al., 2014; Bahdanau et al., 2015) which can tackle a wide array of tasks including translation (Koehn, 2009), summarization (Mani, 1999; Chopra et al., 2016), structured- data-to-text generation (McKeown, 1992; Kukich, 1983; Wiseman et al., 2017) dialog (Smith and Hipp, 1994; Vinyals and Le, 2015) and image captioning (Fang et al., 2015). However, progress is increasingly impeded by the shortcomings of existing metrics (Wiseman et al., 2017; Ma et al., 2019; Tian et al., 2019).

近年来，自然文本生成(NLG)研究取得了显著进展，这主要归功于神经编码器-解码器范式 (Sutskever et al., 2014; Bahdanau et al., 2015) ，该范式能够处理包括翻译 (Koehn, 2009) 、摘要 (Mani, 1999; Chopra et al., 2016) 、结构化数据到文本生成 (McKeown, 1992; Kukich, 1983; Wiseman et al., 2017) 、对话 (Smith and Hipp, 1994; Vinyals and Le, 2015) 以及图像描述 (Fang et al., 2015) 在内的多种任务。然而，现有指标的不足 (Wiseman et al., 2017; Ma et al., 2019; Tian et al., 2019) 日益阻碍了研究的进一步发展。

Human evaluation is often the best indicator of the quality of a system. However, designing crowd sourcing experiments is an expensive and high-latency process, which does not easily fit in a daily model development pipeline. Therefore, NLG researchers commonly use automatic evaluation metrics, which provide an acceptable proxy for quality and are very cheap to compute. This paper investigates sentence-level, referencebased metrics, which describe the extent to which a candidate sentence is similar to a reference one. The exact definition of similarity may range from string overlap to logical entailment.

人工评估通常是衡量系统质量的最佳指标。然而，设计众包实验成本高昂且耗时较长，难以融入日常模型开发流程。因此，自然语言生成(NLG)研究者通常采用自动评估指标，这些指标能提供可接受的质量参考且计算成本极低。本文研究基于参考语句的句子级评估指标，用于衡量候选句与参考句的相似程度。相似度的具体定义可从字符串重叠度延伸至逻辑蕴含关系。

The first generation of metrics relied on handcrafted rules that measure the surface similarity between the sentences. To illustrate, BLEU (Papineni et al., 2002) and ROUGE (Lin, 2004), two popular metrics, rely on N-gram overlap. Because those metrics are only sensitive to lexical variation, they cannot appropriately reward semantic or syntactic variations of a given reference. Thus, they have been repeatedly shown to correlate poorly with human judgment, in particular when all the systems to compare have a similar level of accuracy (Liu et al., 2016; Novikova et al., 2017; Chaganty et al., 2018).

第一代指标依赖于手工制定的规则，用于衡量句子之间的表面相似性。例如，BLEU (Papineni et al., 2002) 和 ROUGE (Lin, 2004) 这两种常用指标基于 N-gram 重叠。由于这些指标仅对词汇变化敏感，无法合理评估给定参考文本的语义或句法变化。因此，它们多次被证明与人类判断相关性较差，尤其是在所有对比系统的准确度水平相近时 (Liu et al., 2016; Novikova et al., 2017; Chaganty et al., 2018)。

Increasingly, NLG researchers have addressed those problems by injecting learned components in their metrics. To illustrate, consider the WMT Metrics Shared Task, an annual benchmark in which translation metrics are compared on their ability to imitate human assessments. The last two years of the competition were largely dominated by neural net-based approaches, RUSE, YiSi and ESIM (Ma et al., 2018, 2019). Current approaches largely fall into two categories. Fully learned metrics, such as BEER, RUSE, and ESIM are trained end-to-end, and they typically rely on handcrafted features and/or learned embeddings. Conversely, hybrid metrics, such as YiSi and BERTscore combine trained elements, e.g., contextual embeddings, with handwritten logic, e.g., as token alignment rules. The first category typically offers great expressivity: if a training set of human ratings data is available, the metrics may take full advantage of it and fit the ratings distribution tightly. Furthermore, learned metrics can be tuned to measure task-specific properties, such as fluency, faithfulness, grammar, or style. On the other hand, hybrid metrics offer robustness. They may provide better results when there is little to no training data, and they do not rely on the assumption that training and test data are identically distributed.

越来越多的自然语言生成(NLG)研究人员通过将学习组件注入评估指标来解决这些问题。以WMT指标共享任务为例，这个年度基准测试通过比较各翻译指标模仿人类评估的能力来评判优劣。过去两年的竞赛主要由基于神经网络的方法主导，包括RUSE、YiSi和ESIM (Ma et al., 2018, 2019)。当前方法主要分为两类：完全学习型指标(如BEER、RUSE和ESIM)采用端到端训练，通常依赖手工特征和/或学习得到的嵌入表示；混合型指标(如YiSi和BERTscore)则将训练元素(如上下文嵌入)与手写逻辑(如token对齐规则)相结合。第一类指标通常具有强大的表达能力：若可获得人工评分训练集，这类指标能充分利用数据并紧密拟合评分分布。此外，学习型指标可针对特定任务属性(如流畅性、忠实度、语法或风格)进行优化。另一方面，混合型指标具有更强的鲁棒性，在训练数据不足甚至缺失时仍能提供良好结果，且不依赖训练数据与测试数据同分布的前提假设。

And indeed, the IID assumption is particularly problematic in NLG evaluation because of domain drifts, that have been the main target of the metrics literature, but also because of quality drifts: NLG systems tend to get better over time, and therefore a model trained on ratings data from 2015 may fail to distinguish top performing systems in 2019, especially for newer research tasks. An ideal learned metric would be able to both take full advantage of available ratings data for training, and be robust to distribution drifts, i.e., it should be able to extrapolate.

事实上，IID假设在自然语言生成(NLG)评估中尤其成问题，这不仅因为指标文献主要关注的领域漂移(domain drifts)，还由于质量漂移(quality drifts)：NLG系统会随时间推移不断改进，因此基于2015年评分数据训练的模型可能无法区分2019年的顶尖系统，尤其对于新兴研究任务而言。理想的习得指标应能充分利用现有评分数据进行训练，同时对分布漂移保持稳健性，即具备外推能力。

Our insight is that it is possible to combine expressivity and robustness by pre-training a fully learned metric on large amounts of synthetic data, before fine-tuning it on human ratings. To this end, we introduce BLEURT,1 a text generation metric based on BERT (Devlin et al., 2019). A key ingredient of BLEURT is a novel pre-training scheme, which uses random perturbations of Wikipedia sentences augmented with a diverse set of lexical and semantic-level supervision signals.

我们的核心观点是：通过在大规模合成数据上预训练一个完全学习的指标（metric），再基于人工评分进行微调，可以兼顾表达力与鲁棒性。为此，我们提出了基于BERT (Devlin et al., 2019) 的文本生成评估指标BLEURT。其关键创新在于预训练方案——通过对维基百科句子施加随机扰动，并结合多样化的词法与语义级监督信号来增强数据。

To demonstrate our approach, we train BLEURT for English and evaluate it under different genera liz ation regimes. We first verify that it provides state-of-the-art results on all recent years of the WMT Metrics Shared task (2017 to 2019, to-English language pairs). We then stress-test its ability to cope with quality drifts with a synthetic benchmark based on WMT 2017. Finally, we show that it can easily adapt to a different domain with three tasks from a data-to-text dataset, WebNLG 2017 (Gardent et al., 2017). Ablations show that our synthetic pre training scheme increases performance in the IID setting, and is critical to ensure robustness when the training data is scarce, skewed, or out-of-domain.

为验证我们的方法，我们针对英语训练了BLEURT模型，并在不同泛化场景下进行评估。首先证实该模型在WMT指标共享任务历年数据（2017至2019年英语语对）中均取得最先进结果。随后通过基于WMT 2017构建的合成基准测试，压力测试其应对质量漂移的能力。最后以WebNLG 2017 (Gardent et al., 2017) 数据到文本数据集的三项任务证明模型可轻松适配新领域。消融实验表明：我们的合成预训练方案能提升独立同分布(IID)设定下的性能，且在训练数据稀缺、偏态或跨域时对确保鲁棒性至关重要。

The code and pre-trained models are available online2.

代码和预训练模型已在线发布2。

2 Preliminaries

2 预备知识

Define $\pmb{x}=(x_{1},..,x_{r})$ to be the reference sentence of length $r$ where each $x_{i}$ is a token and let $\tilde{\pmb{x}}=(\tilde{x}_ {1},..,\tilde{x}_ {p})$ be a prediction sentence of length $p$ . Let ${(\mathbf{\hat{x}}_ {i},\mathbf{\hat{x}}_ {i},y_{i})}_ {n=1}^{N}$ be a training dataset of size $N$ where $y_{i}\in\mathbb{R}$ is the human rating that indicates how good $\tilde{\mathbf{{itx}}}_ {i}$ is with respect to $\mathbf{\Delta}_ {x_{i}}$ . Given the training data, our goal is to learn a function $\pmb{f}:(\pmb{x},\tilde{\pmb x})\rightarrow{\pmb {y}}$ that predicts the human rating.

定义 $\pmb{x}=(x_{1},..,x_{r})$ 为长度为 $r$ 的参考句子，其中每个 $x_{i}$ 是一个 token，并令 $\tilde{\pmb{x}}=(\tilde{x}_ {1},..,\tilde{x}_ {p})$ 为长度为 $p$ 的预测句子。设 ${(\mathbf{\hat{x}}_ {i},\mathbf{\hat{x}}_ {i},y_{i})}_ {n=1}^{N}$ 为大小为 $N$ 的训练数据集，其中 $y_{i}\in\mathbb{R}$ 是表示 $\tilde{\mathbf{{itx}}}_ {i}$ 相对于 $\mathbf{\Delta}_ {x_{i}}$ 质量的人工评分。给定训练数据，我们的目标是学习一个函数 $\pmb{f}:(\pmb{x},\tilde{\pmb x})\rightarrow{\pmb {y}}$ 以预测人工评分。

3 Fine-Tuning BERT for Quality Evaluation

3 基于BERT的质量评估微调

Given the small amounts of rating data available, it is natural to leverage unsupervised representations for this task. In our model, we use BERT (Bidirectional Encoder Representations from Transform- ers) (Devlin et al., 2019), which is an unsupervised technique that learns contextual i zed representations of sequences of text. Given $_{x}$ and $\tilde{\mathbf{x}}$ , BERT is a Transformer (Vaswani et al., 2017) that returns a sequence of contextual i zed vectors:

鉴于可用的评分数据量较少，利用无监督表示来完成此任务是很自然的选择。在我们的模型中，我们使用了BERT (Bidirectional Encoder Representations from Transformers) [20]，这是一种无监督技术，可以学习文本序列的上下文表示。给定$x$和$\tilde{\mathbf{x}}$，BERT是一个Transformer [21]，它返回一系列上下文向量：

$$
v_{[\mathrm{CLS}]},v_{x_{1}},...,v_{x_{r}},v_{1},...,v_{\tilde{x}_{p}}=\mathrm{BERT}({\pmb x},\tilde{{\pmb x}})
$$

where $\pmb{v}_{[\mathrm{CLS}]}$ is the representation for the special [CLS] token. As described by Devlin et al. (2019), we add a linear layer on top of the [CLS] vector to predict the rating:

其中 $\pmb{v}_{[\mathrm{CLS}]}$ 是特殊 [CLS] token 的表征。如 Devlin 等人 (2019) 所述，我们在 [CLS] 向量上添加一个线性层来预测评分：

$$
\hat{y}=\pmb{f}(\pmb{x},\tilde{\pmb{x}})=\pmb{W}\tilde{v}_{[\mathrm{CLS}]}+\pmb{b}
$$

where $W$ and $^{b}$ are the weight matrix and bias vector respectively. Both the above linear layer as well as the BERT parameters are trained (i.e. fine-tuned) on the supervised data which typically numbers in a few thousand examples. We use the regression loss $\begin{array}{r}{\ell_{\mathrm{supervised}}=\frac{1}{N}\sum_{n=1}^{\hat{N}}|y_{i}-\hat{y}|^{2}}\end{array}$ .

其中 $W$ 和 $^{b}$ 分别是权重矩阵和偏置向量。上述线性层以及 BERT 参数均在通常包含数千个样本的监督数据上进行训练（即微调）。我们使用回归损失 $\begin{array}{r}{\ell_{\mathrm{supervised}}=\frac{1}{N}\sum_{n=1}^{\hat{N}}|y_{i}-\hat{y}|^{2}}\end{array}$。

Although this approach is quite straightforward, we will show in Section 5 that it gives state-of-theart results on WMT Metrics Shared Task 17-19, which makes it a high-performing evaluation metric. However, fine-tuning BERT requires a sizable amount of IID data, which is less than ideal for a metric that should generalize to a variety of tasks and model drift.

尽管这种方法相当直接，但我们将在第5节展示，它在WMT Metrics Shared Task 17-19上取得了最先进 (state-of-the-art) 的结果，使其成为高性能的评估指标。然而，微调BERT需要大量独立同分布 (IID) 数据，这对于一个需要泛化到多种任务和模型漂移的指标来说并不理想。

4 Pre-Training on Synthetic Data

4 基于合成数据的预训练

The key aspect of our approach is a pre-training technique that we use to “warm up” BERT before fine-tuning on rating data.3 We generate a large number of of synthetic reference-candidate pairs $(z,\tilde{z})$ , and we train BERT on several lexical- and semantic-level supervision signals with a multitask loss. As our experiments will show, BLEURT generalizes much better after this phase, especially with incomplete training data.

我们方法的关键在于采用了一种预训练技术，在基于评分数据进行微调前对BERT进行"预热"。我们生成了大量合成参考-候选对$(z,\tilde{z})$，并通过多任务损失函数，基于词汇和语义层面的监督信号对BERT进行训练。实验表明，经过该阶段后BLEURT展现出显著更好的泛化能力，尤其在训练数据不完整时表现更为突出。

Any pre-training approach requires a dataset and a set of pre-training tasks. Ideally, the setup should resemble the final NLG evaluation task, i.e., the sentence pairs should be distributed similarly and the pre-training signals should correlate with human ratings. Unfortunately, we cannot have access to the NLG models that we will evaluate in the future. Therefore, we optimized our scheme for generality, with three requirements. (1) The set of reference sentences should be large and diverse, so that BLEURT can cope with a wide range of NLG domains and tasks. (2) The sentence pairs should contain a wide variety of lexical, syntactic, and semantic dissimilarities. The aim here is to anticipate all variations that an NLG system may produce, e.g., phrase substitution, paraphrases, noise, or omissions. (3) The pre-training objectives should effectively capture those phenomena, so that BLEURT can learn to identify them. The following sections present our approach.

任何预训练方法都需要一个数据集和一组预训练任务。理想情况下，该设置应类似于最终的NLG评估任务，即句子对的分布应相似，且预训练信号应与人类评分相关。遗憾的是，我们无法提前获知未来将评估的NLG模型。因此，我们针对通用性优化了方案，提出三点要求：(1) 参考句集应规模庞大且多样化，使BLEURT能适应广泛的NLG领域和任务；(2) 句子对应包含词汇、句法和语义层面的多样化差异，旨在预判NLG系统可能产生的所有变体，例如短语替换、改述、噪声或遗漏；(3) 预训练目标应有效捕捉这些现象，使BLEURT学会识别它们。以下章节将介绍我们的方法。

4.1 Generating Sentence Pairs

4.1 生成句子对

One way to expose BLEURT to a wide variety of sentence differences is to use existing sentence pairs datasets (Bowman et al., 2015; Williams et al., 2018; Wang et al., 2019). These sets are a rich source of related sentences, but they may fail to capture the errors and alterations that NLG systems produce (e.g., omissions, repetitions, nonsensical substitutions). We opted for an automatic approach instead, that can be scaled arbitrarily and at little cost: we generate synthetic sentence pairs $(z,\tilde{z})$ by randomly perturbing 1.8 million segments $_{z}$ from Wikipedia. We use three techniques: mask-filling with BERT, back translation, and randomly dropping out words. We obtain about 6.5 million perturbations $\tilde{z}$ . Let us describe those techniques.

让BLEURT接触多样化句子差异的一种方法是利用现有句子对数据集 (Bowman et al., 2015; Williams et al., 2018; Wang et al., 2019)。这些数据集虽能提供丰富的关联语句资源，但可能无法涵盖自然语言生成系统产生的错误和改动 (如遗漏、重复、无意义替换)。我们选择了一种可任意扩展且成本低廉的自动化方案：通过随机扰动180万条维基百科段落$_{z}$来生成合成句子对$(z,\tilde{z})$。具体采用三种技术：基于BERT的掩码填充、回译以及随机词丢弃，最终获得约650万条扰动结果$\tilde{z}$。以下将详细说明这些技术。

Mask-filling with BERT: BERT’s initial training task is to fill gaps (i.e., masked tokens) in tokenized sentences. We leverage this functionality by inserting masks at random positions in the Wikipedia sentences, and fill them with the language model. Thus, we introduce lexical alterations while maintaining the fluency of the sentence. We use two masking strategies—we either introduce the masks at random positions in the sentences, or we create contiguous sequences of masked tokens. More details are provided in the Appendix.

使用BERT进行掩码填充：BERT的初始训练任务是填补标记化句子中的空白（即被掩码的token）。我们通过在维基百科句子中随机位置插入掩码，并利用语言模型进行填充来实现这一功能。这样可以在保持句子流畅性的同时引入词汇变化。我们采用两种掩码策略——要么在句子随机位置插入掩码，要么创建连续的掩码token序列。更多细节见附录。

Back translation: We generate paraphrases and perturbations with back translation, that is, round trips from English to another language and then back to English with a translation model (Bannard and Callison-Burch, 2005; Gan it kev itch et al., 2013; Sennrich et al., 2016). Our primary aim is to create variants of the reference sentence that preserves semantics. Additionally, we use the mispredictions of the back translation models as a source of realistic alterations.

回译：我们通过回译生成改写和扰动，即利用翻译模型将英文先翻译成另一种语言再转回英文的往返过程 (Bannard and Callison-Burch, 2005; Ganitkevitch et al., 2013; Sennrich et al., 2016)。主要目标是创建保留原句语义的变体，同时利用回译模型的错误预测作为现实语义变化的来源。

Dropping words: We found it useful in our experiments to randomly drop words from the synthetic examples above to create other examples. This method prepares BLEURT for “pathological” behaviors or NLG systems, e.g., void predictions, or sentence truncation.

随机丢弃词语：在实验中，我们发现随机从上述合成示例中删除词语以创建其他样本很有帮助。这种方法能让BLEURT适应"病态"行为或自然语言生成系统，例如空预测或句子截断。

4.2 Pre-Training Signals

4.2 预训练信号

The next step is to augment each sentence pair $(z,\tilde{z})$ with a set of pre-training signals ${\tau_{k}}$ , where $\tau_{k}$ is the target vector of pre-training task $k$ . Good pre-training signals should capture a wide variety of lexical and semantic differences. They should also be cheap to obtain, so that the approach can scale to large amounts of synthetic data. The following section presents our 9 pretraining tasks, summarized in Table 1. Additional implementation details are in the Appendix.

下一步是为每个句子对 $(z,\tilde{z})$ 增强一组预训练信号 ${\tau_{k}}$ ，其中 $\tau_{k}$ 是预训练任务 $k$ 的目标向量。良好的预训练信号应能捕捉广泛的词汇和语义差异，同时还需易于获取，以便该方法能扩展到大量合成数据。以下部分介绍了我们的9个预训练任务，总结如表 1: 。更多实现细节见附录。

Automatic Metrics: We create three signals τBLEU, τROUGE, and τBERTscore with sentence BLEU (Papineni et al., 2002), ROUGE (Lin, 2004), and BERTscore (Zhang et al., 2020) respectively (we use precision, recall and F-score for the latter two).

自动指标：我们分别使用句子BLEU (Papineni等人，2002)、ROUGE (Lin，2004) 和 BERTscore (Zhang等人，2020) 创建了三个信号τBLEU、τROUGE和τBERTscore（后两者采用精确率、召回率和F值）。

Back translation Likelihood: The idea behind this signal is to leverage existing translation models to measure semantic equivalence. Given a pair $(z,\tilde{z})$ , this training signal measures the probability that $\tilde{z}$ is a back translation of $_ {z}$ , $P(\tilde{z}|z)$ , normalized by the length of $\tilde{z}$ . Let $P_{\mathrm{en-fr}}(z_{\mathrm{fr}}|z)$ be a translation model that assigns probabilities to French sentences $z_{\mathrm{\scriptsize{f}r}}$ conditioned on English sentences $_ {z}$ and let $P_{\mathrm{fr-en}}(z|z_{\mathrm{fr}})$ be a trans- lation model that assigns probabilities to English sentences given french sentences. If $|\tilde{z}|$ is the number of tokens in $\tilde{z}$ , we define our score as $\begin{array}{r}{\tau_{\mathrm{en-fr},\tilde{z}|z}=\frac{\log P(\tilde{z}|z)}{|\tilde{z}|}}\end{array}$ log P(zlz), with:

回译似然度：该信号背后的理念是利用现有翻译模型来衡量语义等价性。给定一对 $(z,\tilde{z})$，该训练信号衡量 $\tilde{z}$ 是 $z$ 的回译的概率 $P(\tilde{z}|z)$，并通过 $\tilde{z}$ 的长度进行归一化。设 $P_{\mathrm{en-fr}}(z_{\mathrm{fr}}|z)$ 为将英语句子 $z$ 作为条件为法语句子 $z_{\mathrm{\scriptsize{f}r}}$ 分配概率的翻译模型，$P_{\mathrm{fr-en}}(z|z_{\mathrm{fr}})$ 为将法语句子作为条件为英语句子分配概率的翻译模型。若 $|\tilde{z}|$ 表示 $\tilde{z}$ 中的 token 数量，则得分定义为 $\begin{array}{r}{\tau_{\mathrm{en-fr},\tilde{z}|z}=\frac{\log P(\tilde{z}|z)}{|\tilde{z}|}}\end{array}$ log P(zlz)，其中：

Table 1: Our pre-training signals.

TaskType	Pre-training Signals	Loss Type
BLEU	TBLEU	Regression
ROUGE	TROUGE TROUGE-P，TROUGE-R，TROUGE-F	Regression
BERTscore	TBERTscore TBERTscore-P，TBERTscore-R，TBERTscore-F	Regression
Backtrans.likelihood	Ten-fr,z\|, Ten-fr,\|z, Ten-de,z\|, Ten-de,\|z	Regression
Entailment	Tentail TEntail,TContradict,TNeutral	Multiclass
Backtrans.f flag	Tbacktran_flag	Multiclass

表 1: 我们的预训练信号。

TaskType	Pre-training Signals	Loss Type
BLEU	TBLEU	Regression
ROUGE	TROUGE TROUGE-P, TROUGE-R, TROUGE-F	Regression
BERTscore	TBERTscore TBERTscore-P, TBERTscore-R, TBERTscore-F	Regression
Backtrans.likelihood	Ten-fr,z	, Ten-fr,
Entailment	Tentail TEntail, TContradict, TNeutral	Multiclass
Backtrans.f flag	Tbacktran_flag	Multiclass

$$
P(\tilde{z}|z)=\sum_{z_{\mathrm{fr}}}P_{\mathrm{fr}\rightarrow\mathrm{en}}(\tilde{z}|z_{\mathrm{fr}})P_{\mathrm{en}\rightarrow\mathrm{fr}}(z_{\mathrm{fr}}|z)
$$

Because computing the summation over all possible French sentences is in- tractable, we approximate the sum using $\begin{array}{r l r}{z_{\mathrm{fr}}^{\ast}}&{{}=}&{\arg\operatorname*{max}P_{\mathrm{en}\to\mathrm{fr}}(z_{\mathrm{fr}}|z)}\end{array}$ and we assume that $P_{\mathrm{en}\to\mathrm{fr}}^{*}|z\big)\approx1$ :

由于对所有可能的法语句子进行求和计算是不可行的，我们采用近似方法 $\begin{array}{r l r}{z_{\mathrm{fr}}^{\ast}}&{{}=}&{\arg\operatorname*{max}P_{\mathrm{en}\to\mathrm{fr}}(z_{\mathrm{fr}}|z)}\end{array}$ ，并假设 $P_{\mathrm{en}\to\mathrm{fr}}^{*}|z\big)\approx1$ ) :

$$
P(\tilde{z}|z)\approx P_{\mathrm{fr}\to\mathrm{en}}(\tilde{z}|z_{\mathrm{fr}}^{*})
$$

We can trivially reverse the procedure to compute $P(z|\tilde{z})$ , thus we create 4 pre-training signals $\tau_{\mathrm{en-fr},z|\tilde{z}},\tau_{\mathrm{en-fr},\tilde{z}|z},\tau_{\mathrm{en-de},z|\tilde{z}},\tau_{\mathrm{en-de},\tilde{z}|z}$ with two pairs of languages $\mathbf{\check{en}}\leftrightarrow\mathrm{de}$ and $\mathbf{\check{en}}\leftrightarrow\mathrm{fr}$ ) in both directions.

我们可以轻松地逆向计算 $P(z|\tilde{z})$ ，从而创建4个预训练信号 $\tau_{\mathrm{en-fr},z|\tilde{z}},\tau_{\mathrm{en-fr},\tilde{z}|z},\tau_{\mathrm{en-de},z|\tilde{z}},\tau_{\mathrm{en-de},\tilde{z}|z}$ ，涉及两对语言 $\mathbf{\check{en}}\leftrightarrow\mathrm{de}$ 和 $\mathbf{\check{en}}\leftrightarrow\mathrm{fr}$ ) 的双向转换。

Textual Entailment: The signal τentail expresses whether $_{z}$ entails or contradicts $\tilde{z}$ using a classifier. We report the probability of three labels: Entail, Contradict, and Neutral, using BERT finetuned on an entailment dataset, MNLI (Devlin et al., 2019; Williams et al., 2018).

文本蕴含：信号τentail通过分类器表达$_{z}$是否蕴含或矛盾$\tilde{z}$。我们使用在蕴含数据集MNLI上微调过的BERT (Devlin等人，2019；Williams等人，2018) ，报告三个标签的概率：蕴含、矛盾和中性。

Back translation flag: The signal τbacktran flag is a Boolean that indicates whether the perturbation was generated with back translation or with maskfilling.

回译标志：信号τbacktran flag是一个布尔值，用于指示扰动是通过回译(back translation)还是掩码填充(maskfilling)生成的。

4.3 Modeling

4.3 建模

For each pre-training task, our model uses either a regression or a classification loss. We then aggregate the task-level losses with a weighted sum.

对于每个预训练任务，我们的模型采用回归或分类损失函数，并通过加权求和方式聚合任务级损失。

Let $\tau_{k}$ describe the target vector for each task, e.g., the probabilities for the classes Entail, Contradict, Neutral, or the precision, recall, and Fscore for ROUGE. If $\tau_{k}$ is a regression task, then the loss used is the $\ell_{2}$ loss i.e. $\begin{array}{r c l}{\ell_{k}}&{=}&{\Vert\tau_{k}-}\end{array}$ $\hat{\tau}_ {k}|_ {2}^{2}/|\tau_{k}|$ where $|\tau_{k}|$ is the dimension of $\tau_{k}$ and $\hat{\tau}_ {k}$ is computed by using a task-specific linear layer on top of the [CLS] embedding: $\begin{array}{r l}{\hat{\boldsymbol{\tau}}_ {k}}&{{}=}\end{array}$ ${\cal W}_ {\tau_{k}}\tilde{v}_ {[\mathrm{CLS}]}+b_{\tau_{k}}$ . If $\tau_{k}$ is a classification task, we use a separate linear layer to predict a logit for each class $c$ : $\hat{\tau}_ {k c}=W_{\tau_{k c}}\tilde{v}_ {[\mathrm{CLS}]}+b_{\tau_{k c}}$ , and we use the multiclass cross-entropy loss. We define our aggregate pre-training loss function as follows:

设$\tau_{k}$表示每个任务的目标向量，例如类别Entail、Contradict、Neutral的概率，或ROUGE的精确率、召回率和F值。若$\tau_{k}$是回归任务，则使用$\ell_{2}$损失函数，即$\begin{array}{r c l}{\ell_{k}}&{=}&{\Vert\tau_{k}-}\end{array}$$\hat{\tau}_ {k}|_ {2}^{2}/|\tau_{k}|$，其中$|\tau_{k}|$是$\tau_{k}$的维度，$\hat{\tau}_ {k}$通过在[CLS]嵌入上使用任务特定的线性层计算得到：$\begin{array}{r l}{\hat{\boldsymbol{\tau}}_ {k}}&{{}=}\end{array}$${\cal W}_ {\tau_{k}}\tilde{v}_ {[\mathrm{CLS}]}+b_{\tau_{k}}$。若$\tau_{k}$是分类任务，则使用单独的线性层预测每个类别$c$的logit值：$\hat{\tau}_ {k c}=W_{\tau_{k c}}\tilde{v}_ {[\mathrm{CLS}]}+b_{\tau_{k c}}$，并采用多类别交叉熵损失。我们定义聚合预训练损失函数如下：

$$
\ell_{\mathrm{pre-training}}=\frac{1}{M}\sum_{m=1}^{M}\sum_{k=1}^{K}\gamma_{k}\ell_{k}(\pmb{\tau}_ {k}^{m},\hat{\pmb{\tau}}_{k}^{m})
$$

where $\tau_{k}^{m}$ is the target vector for example $m,M$ is number of synthetic examples, and $\gamma_{k}$ are hyper parameter weights obtained with grid search (more details in the Appendix).

其中 $\tau_{k}^{m}$ 是样本 $m$ 的目标向量，$M$ 是合成样本数量，$\gamma_{k}$ 是通过网格搜索获得的超参数权重 (更多细节见附录)。

5 Experiments

5 实验

In this section, we report our experimental results for two tasks, translation and data-to-text. First, we benchmark BLEURT against existing text generation metrics on the last 3 years of the WMT Metrics Shared Task (Bojar et al., 2017). We then evaluate its robustness to quality drifts with a series of synthetic datasets based on WMT17. We test BLEURT’s ability to adapt to different tasks with the WebNLG 2017 Challenge Dataset (Gardent et al., 2017). Finally, we measure the contribution of each pre-training task with ablation experiments.

在本节中，我们报告了翻译和数据到文本两项任务的实验结果。首先，我们在WMT Metrics Shared Task近三年的数据上 (Bojar et al., 2017) 将BLEURT与现有文本生成指标进行基准测试。随后基于WMT17构建的合成数据集序列，评估其对质量漂移的鲁棒性。通过WebNLG 2017挑战赛数据集 (Gardent et al., 2017) 测试BLEURT适应不同任务的能力。最后通过消融实验量化各预训练任务的贡献度。

Our Models: Unless specified otherwise, all BLEURT models are trained in three steps: regular BERT pre-training (Devlin et al., 2019), pre-training on synthetic data (as explained in Section 4), and fine-tuning on task-specific ratings (translation and/or data-to-text). We experiment with two versions of BLEURT, BLEURT and BLEURTbase, respectively based on BERTLarge (24 layers, 1024 hidden units, 16 heads) and BERT-Base (12 layers, 768 hidden units, 12 heads) (Devlin et al., 2019), both uncased. We use batch size 32, learning rate 1e-5, and 800,000 steps for pre-training and 40,000 steps for finetuning. We provide the full detail of our training setup in the Appendix.

我们的模型：除非另有说明，所有BLEURT模型都经过三个步骤训练：常规BERT预训练 (Devlin et al., 2019)、合成数据预训练（如第4节所述）以及针对特定任务评分（翻译和/或数据到文本）的微调。我们实验了两种版本的BLEURT，分别是基于BERTLarge（24层，1024个隐藏单元，16个头）的BLEURT和基于BERT-Base（12层，768个隐藏单元，12个头） (Devlin et al., 2019) 的BLEURTbase，两者均为uncased版本。我们使用批量大小32、学习率1e-5进行预训练80万步和微调4万步。完整的训练设置细节见附录。

Table 2: Agreement with human ratings on the WMT17 Metrics Shared Task. The metrics are Kendall Tau $(\tau)$ and the Pearson correlation $\cdot_{r_{\mathrm{{i}}}}$ , the official metric of the shared task), divided by 100.

model	cs-en T / r	de-en T / r	fi-en T/r	lv-en T/r	ru-en T /r	tr-en T / r	zh-en T / r	avg T/r
sentBLEU	29.6/43.2	28.9/42.2	38.6/56.0	23.9/38.2	34.3/47.7	34.3/54.0	37.4/51.3	32.4/47.5
MoverScore	47.6/67.0	51.2/70.8	NA	NA	53.4/73.8	56.1/76.2	53.1/74.4	52.3/72.4
BERTscorew/BERT	48.0/66.6	50.3/ 70.1	61.4 / 81.4	51.6/72.3	53.7/73.0	55.6/76.0	52.2/73.1	53.3/73.2
BERTscorew/roBERTa	54.2/72.6	56.9/76.0	64.8/83.2	56.2/75.7	57.2/75.2	57.9/76.1	58.8/78.9	58.0/76.8
chrF++	35.0/52.3	36.5/53.4	47.5/67.8	33.3/52.0	41.5/58.8	43.2/ 61.4	40.5/ 59.3	39.6/57.9
BEER	34.0/51.1	36.1/53.0	48.3/68.1	32.8/51.5	40.2/57.7	42.8/ 60.0	39.5/58.2	39.1/57.1
BLEURTbase-pre	51.5/68.2	52.0/ 70.7	66.6/ 85.1	60.8/80.5	57.5/77.7	56.9/76.0	52.1/72.1	56.8/75.8
BLEURTbase	55.7/73.4	56.3/75.7	68.0/86.8	64.7/83.3	60.1/ 80.1	62.4/ 81.7	59.5/ 80.5	61.0/80.2
BLEURT-pre	56.0/74.7	57.1/75.7	67.2/ 86.1	62.3/81.7	58.4/78.3	61.6/ 81.4	55.9/76.5	59.8/79.2
BLEURT	59.3/77.3	59.9/79.2	69.5/ 87.8	64.4/83.5	61.3/81.1	62.9/82.4	60.2/81.4	62.5/81.8

表 2: WMT17 指标共享任务中与人工评分的相关性。指标为 Kendall Tau $(\tau)$ 和 Pearson 相关系数 $\cdot_{r_{\mathrm{{i}}}}$ (该共享任务的官方指标)，数值已除以 100。

model	cs-en T/r	de-en T/r	fi-en T/r	lv-en T/r	ru-en T/r	tr-en T/r	zh-en T/r	avg T/r
sentBLEU	29.6/43.2	28.9/42.2	38.6/56.0	23.9/38.2	34.3/47.7	34.3/54.0	37.4/51.3	32.4/47.5
MoverScore	47.6/67.0	51.2/70.8	NA	NA	53.4/73.8	56.1/76.2	53.1/74.4	52.3/72.4
BERTscorew/BERT	48.0/66.6	50.3/70.1	61.4/81.4	51.6/72.3	53.7/73.0	55.6/76.0	52.2/73.1	53.3/73.2
BERTscorew/roBERTa	54.2/72.6	56.9/76.0	64.8/83.2	56.2/75.7	57.2/75.2	57.9/76.1	58.8/78.9	58.0/76.8
chrF++	35.0/52.3	36.5/53.4	47.5/67.8	33.3/52.0	41.5/58.8	43.2/61.4	40.5/59.3	39.6/57.9
BEER	34.0/51.1	36.1/53.0	48.3/68.1	32.8/51.5	40.2/57.7	42.8/60.0	39.5/58.2	39.1/57.1
BLEURTbase-pre	51.5/68.2	52.0/70.7	66.6/85.1	60.8/80.5	57.5/77.7	56.9/76.0	52.1/72.1	56.8/75.8
BLEURTbase	55.7/73.4	56.3/75.7	68.0/86.8	64.7/83.3	60.1/80.1	62.4/81.7	59.5/80.5	61.0/80.2
BLEURT-pre	56.0/74.7	57.1/75.7	67.2/86.1	62.3/81.7	58.4/78.3	61.6/81.4	55.9/76.5	59.8/79.2
BLEURT	59.3/77.3	59.9/79.2	69.5/87.8	64.4/83.5	61.3/81.1	62.9/82.4	60.2/81.4	62.5/81.8

model	cs-en T/DA	de-en T/DA	et-en T/ DA	fi-en T/DA	ru-en T/DA	tr-en T/DA	zh-en T/DA	avg T/ DA
sentBLEU	20.0/22.5	31.6/41.5	26.0/28.2	17.1/15.6	20.5/22.4	22.9/13.6	21.6/17.6	22.8/23.2
BERTscorew/BERT	29.5/40.0	39.9/53.8	34.7/39.0	26.0/29.7	27.8/34.7	31.7/27.5	27.5/25.2	31.0/35.7
BERTscorew/roBERTa	31.2/ 41.1	42.2/55.5	37.0/40.3	27.8/30.8	30.2/35.4	32.8/30.2	29.2/ 26.3	32.9/ 37.1
Meteor++	22.4/26.8	34.7/45.7	29.7/ 32.9	21.6/20.6	22.8/25.3	27.3/20.4	23.6/17.5*	26.0/27.0
RUSE	27.0/34.5	36.1/ 49.8	32.9/36.8	25.5/27.5	25.0/31.1	29.1/ 25.9	24.6/21.5*	28.6/32.4
YiSil	23.5/31.7	35.5/48.8	30.2/35.1	21.5/23.1	23.3/30.0	26.8/23.4	23.1/ 20.9	26.3/ 30.4
YiSi1SRL18	23.3/31.5	34.3/48.3	29.8 / 34.5	21.2/23.7	22.6/30.6	26.1/23.3	22.9/20.7	25.7/30.4
BLEURTbase-pre	33.0/39.0	41.5/54.6	38.2/39.6	30.7/31.1	30.7/34.9	32.9/29.8	28.3/25.6	33.6/36.4
BLEURTbase	34.5/ 42.9	43.5/ 55.6	39.2/ 40.5	31.5/ 30.9	31.0/35.7	35.0/ 29.4	29.6 / 26.9	34.9 / 37.4
BLEURT-pre	34.5/42.1	42.7/ 55.4	39.2/40.6	31.4/31.6	31.4/34.2	33.4/29.3	28.9/25.6	34.5/37.0
BLEURT	35.6/42.3	44.2/56.7	40.0/41.4	32.1/32.5	31.9/36.0	35.5/31.5	29.7/ 26.0	35.6/38.1

模型	cs-en T/DA	de-en T/DA	et-en T/DA	fi-en T/DA	ru-en T/DA	tr-en T/DA	zh-en T/DA	avg T/DA
sentBLEU	20.0/22.5	31.6/41.5	26.0/28.2	17.1/15.6	20.5/22.4	22.9/13.6	21.6/17.6	22.8/23.2
BERTscorew/BERT	29.5/40.0	39.9/53.8	34.7/39.0	26.0/29.7	27.8/34.7	31.7/27.5	27.5/25.2	31.0/35.7
BERTscorew/roBERTa	31.2/41.1	42.2/55.5	37.0/40.3	27.8/30.8	30.2/35.4	32.8/30.2	29.2/26.3	32.9/37.1
Meteor++	22.4/26.8	34.7/45.7	29.7/32.9	21.6/20.6	22.8/25.3	27.3/20.4	23.6/17.5*	26.0/27.0
RUSE	27.0/34.5	36.1/49.8	32.9/36.8	25.5/27.5	25.0/31.1	29.1/25.9	24.6/21.5*	28.6/32.4
YiSil	23.5/31.7	35.5/48.8	30.2/35.1	21.5/23.1	23.3/30.0	26.8/23.4	23.1/20.9	26.3/30.4
YiSi1SRL18	23.3/31.5	34.3/48.3	29.8/34.5	21.2/23.7	22.6/30.6	26.1/23.3	22.9/20.7	25.7/30.4
BLEURTbase-pre	33.0/39.0	41.5/54.6	38.2/39.6	30.7/31.1	30.7/34.9	32.9/29.8	28.3/25.6	33.6/36.4
BLEURTbase	34.5/42.9	43.5/55.6	39.2/40.5	31.5/30.9	31.0/35.7	35.0/29.4	29.6/26.9	34.9/37.4
BLEURT-pre	34.5/42.1	42.7/55.4	39.2/40.6	31.4/31.6	31.4/34.2	33.4/29.3	28.9/25.6	34.5/37.0
BLEURT	35.6/42.3	44.2/56.7	40.0/41.4	32.1/32.5	31.9/36.0	35.5/31.5	29.7/26.0	35.6/38.1

Table 3: Agreement with human ratings on the WMT18 Metrics Shared Task. The metrics are Kendall Tau $(\tau)$ and WMT’s Direct Assessment metrics divided by 100. The star * indicates results that are more than 0.2 percentage points away from the official WMT results (up to 0.4 percentage points away).

表 3: WMT18指标共享任务中与人类评分的相关性。指标为肯德尔tau系数$(\tau)$和WMT直接评估指标(除以100)。星号*表示与官方WMT结果相差超过0.2个百分点(最大相差0.4个百分点)。

5.1 WMT Metrics Shared Task

5.1 WMT 指标共享任务

Datasets and Metrics: We use years 2017 to 2019 of the WMT Metrics Shared Task, to-English language pairs. For each year, we used the official WMT test set, which include several thousand pairs of sentences with human ratings from the news domain. The training sets contain 5,360, 9,492, and 147,691 records for each year. The test sets for years 2018 and 2019 are noisier, as reported by the organizers and shown by the overall lower correlations.

数据集与评估指标：我们使用2017至2019年WMT指标共享任务的英译语言对数据。每年采用官方WMT测试集，包含新闻领域数千组人工评分的句对。训练集分别包含5,360、9,492和147,691条记录。如组织方所述，2018和2019年测试集噪声更大，整体相关性指标较低也印证了这一点。

We evaluate the agreement between the automatic metrics and the human ratings. For each year, we report two metrics: Kendall’s Tau $\tau$ (for consistency across experiments), and the official WMT metric for that year (for completeness). The official WMT metric is either Pearson’s correlation or a robust variant of Kendall’s Tau called DARR, described in the Appendix. All the numbers come from our own implementation of the benchmark.4 Our results are globally consistent with the official results but we report small differences in 2018 and 2019, marked in the tables.

我们评估了自动指标与人工评分之间的一致性。每年报告两个指标：Kendall's Tau $\tau$（用于实验间一致性）以及该年度的官方WMT指标（用于完整性）。官方WMT指标采用皮尔逊相关系数或称为DARR的Kendall's Tau鲁棒变体（详见附录）。所有数据均基于我们自主实现的基准测试得出。4 我们的结果与官方结果整体一致，但在2018和2019年存在微小差异（表中已标注）。

Models: We experiment with four versions of BLEURT: BLEURT, BLEURTbase, BLEURT -pre and BLEURTbase -pre. The first two models are based on BERT-large and BERT-base. In the latter two versions, we skip the pre-training phase and fine-tune directly on the WMT ratings. For each year of the WMT shared task, we use the test set from the previous years for training and validation. We describe our setup in further detail in the Appendix. We compare BLEURT to participant data from the shared task and automatic metrics that we ran ourselves. In the former case, we use the the best-performing contestants for each year, that is, chrF $^{++}$ , BEER, Meteor $^{++}$ , RUSE, Yisi1, ESIM and Yisi1-SRL (Mathur et al., 2019). All the contestants use the same WMT training data, in addition to existing sentence or token embeddings. In the latter case, we use Moses sentence BLEU, BERTscore (Zhang et al., 2020), and MoverScore (Zhao et al., 2019). For BERTscore, we use BERT-large uncased for fairness, and roBERTa (the recommended version) for completeness (Liu et al., 2019). We run MoverScore on WMT 2017 using the scripts published by the authors.

模型：我们测试了BLEURT的四个版本：BLEURT、BLEURTbase、BLEURT-pre和BLEURTbase-pre。前两个模型基于BERT-large和BERT-base，后两个版本跳过预训练阶段，直接在WMT评分数据上进行微调。针对WMT共享任务的每一年份，我们使用前几年的测试集进行训练和验证。附录详细描述了实验设置。我们将BLEURT与共享任务的参赛数据及自行运行的自动指标进行对比：前者选取历年表现最佳的参赛系统（包括chrF$^{++}$、BEER、Meteor$^{++}$、RUSE、Yisi1、ESIM和Yisi1-SRL (Mathur et al., 2019)），这些系统除使用WMT训练数据外，还结合了现有句子或token嵌入；后者采用Moses句子BLEU、BERTscore (Zhang et al., 2020)和MoverScore (Zhao et al., 2019)。为保持公平性，BERTscore统一使用BERT-large uncased版本，同时为完整性补充roBERTa（推荐版本）(Liu et al., 2019)。MoverScore基于作者发布的脚本在WMT 2017数据上运行。

Results: Tables 2, 3, 4 show the results. For years 2017 and 2018, a BLEURT-based metric

结果：表2、表3、表4展示了结果。对于2017年和2018年，基于BLEURT的指标

model	de-en T/DA	fi-en T/DA	gu-en T/DA	kk-en T/DA	lt-en T/ DA	ru-en T/DA	zh-en T/DA	avg T/ DA
sentBLEU	19.4/5.4	20.6/23.3	17.3/18.9	30.0/37.6	23.8/26.2	19.4 / 12.4	28.7/32.2	22.7/22.3
BERTscorew/BERT	26.2/17.3	27.6/34.7	25.8/29.3	36.9/ 44.0	30.8/37.4	25.2/20.6	37.5/41.4	30.0/32.1
BERTscorew/roBERTa	29.1/19.3	29.7/35.3	27.7/32.4	37.1/43.1	32.6/38.2	26.3/22.7	41.4/43.8	32.0/33.6
ESIM	28.4/16.6	28.9/33.7	27.1/30.4	38.4743.3	33.2/35.9	26.6/19.9	38.7/39.6	31.6/31.3
YiSi1SRL19	26.3/19.8	27.8/34.6	26.6/30.6	36.9 / 44.1	30.9/38.0	25.3/22.0	38.9/43.1	30.4/33.2
BLEURTbase-pre	30.1/15.8	30.4/35.4	26.8/ 29.7	37.8/41.8	34.2/ 39.0	27.0/ 20.7	40.1/39.8	32.3/31.7
BLEURTbase	31.0/16.6	31.3/36.2	27.9/30.6	39.5/44.6	35.2/39.4	28.5/21.5	41.7/41.6	33.6/ 32.9
BLEURT-pre	31.1/16.9	31.3/36.5	27.6/31.3	38.4/42.8	35.0/40.0	27.5/21.4	41.6 / 41.4	33.2/32.9
BLEURT	31.2/16.9	31.7/36.3	28.3/ 31.9	39.5/44.6	35.2/40.6	28.3/22.3	42.7/42.4	33.8/33.6

模型	德英 T/DA	芬英 T/DA	古英 T/DA	哈英 T/DA	立英 T/DA	俄英 T/DA	中英 T/DA	平均 T/DA
sentBLEU	19.4/5.4	20.6/23.3	17.3/18.9	30.0/37.6	23.8/26.2	19.4/12.4	28.7/32.2	22.7/22.3
BERTscorew/BERT	26.2/17.3	27.6/34.7	25.8/29.3	36.9/44.0	30.8/37.4	25.2/20.6	37.5/41.4	30.0/32.1
BERTscorew/roBERTa	29.1/19.3	29.7/35.3	27.7/32.4	37.1/43.1	32.6/38.2	26.3/22.7	41.4/43.8	32.0/33.6
ESIM	28.4/16.6	28.9/33.7	27.1/30.4	38.4/43.3	33.2/35.9	26.6/19.9	38.7/39.6	31.6/31.3
YiSi1SRL19	26.3/19.8	27.8/34.6	26.6/30.6	36.9/44.1	30.9/38.0	25.3/22.0	38.9/43.1	30.4/33.2
BLEURTbase-pre	30.1/15.8	30.4/35.4	26.8/29.7	37.8/41.8	34.2/39.0	27.0/20.7	40.1/39.8	32.3/31.7
BLEURTbase	31.0/16.6	31.3/36.2	27.9/30.6	39.5/44.6	35.2/39.4	28.5/21.5	41.7/41.6	33.6/32.9
BLEURT-pre	31.1/16.9	31.3/36.5	27.6/31.3	38.4/42.8	35.0/40.0	27.5/21.4	41.6/41.4	33.2/32.9
BLEURT	31.2/16.9	31.7/36.3	28.3/31.9	39.5/44.6	35.2/40.6	28.3/22.3	42.7/42.4	33.8/33.6

Table 4: Agreement with human ratings on the WMT19 Metrics Shared Task. The metrics are Kendall Tau $(\tau)$ and WMT’s Direct Assessment metrics divided by 100. All the values reported for Yisi1 SRL and ESIM fall within 0.2 percentage of the official WMT results.

表 4: WMT19指标共享任务中与人类评分的吻合度。指标为肯德尔系数$(\tau)$和WMT直接评估指标(除以100)。Yisi1 SRL和ESIM报告的所有数值与官方WMT结果相差均在0.2个百分点以内。

Figure 1: Distribution of the human ratings in the train/validation and test datasets for different skew factors.

图 1: 不同偏斜因子下训练/验证集与测试集中人工评分的分布情况。

dominates the benchmark for each language pair (Tables 2 and 3). BLEURT and BLEURTbase are also competitive for year 2019: they yield the best results for all language pairs on Kendall’s Tau, and they come first for 3 out of 7 pairs on DARR. As expected, BLEURT dominates BLEURTbase in the majority of cases. Pre-training consistently improves the results of BLEURT and BLEURTbase. We observe the largest effect on year 2017, where it adds up to 7.4 Kendall Tau points for BLEURTbase $(\mathtt{z h}\mathrm{-}\mathrm{en})$ . The effect is milder on years 2018 and 2019, up to 2.1 points $({\tt t r-e n}$ , 2018). We explain the difference by the fact that the training data used for 2017 is smaller than the datasets used for the following years, so pre-training is likelier to help. In general pretraining yields higher returns for BERT-base than for BERT-large—in fact, BLEURTbase with pretraining is often better than BLEURT without.

在每个语言对的基准测试中占据主导地位（表2和表3）。BLEURT和BLEURTbase在2019年也表现优异：它们在Kendall's Tau上对所有语言对都取得了最佳结果，并在DARR的7个语言对中有3个排名第一。正如预期，BLEURT在大多数情况下优于BLEURTbase。预训练持续提升了BLEURT和BLEURTbase的结果。我们观察到对2017年的影响最大，其中BLEURTbase在$(\mathtt{zh}\mathrm{-}\mathrm{en})$上增加了高达7.4个Kendall Tau点。对2018和2019年的影响较小，最高为2.1点$({\tt tr-en}$, 2018)。我们通过2017年使用的训练数据比后续年份的数据集更小这一事实来解释这一差异，因此预训练更可能有所帮助。总体而言，预训练对BERT-base的回报高于BERT-large——事实上，经过预训练的BLEURTbase通常优于未经预训练的BLEURT。

Takeaways: Pre-training delivers consistent improvements, especially for BLEURT-base. BLEURT yields

[论文翻译]BLEURT: 学习文本生成的鲁棒性指标

原文地址：https://arxiv.org/pdf/2004.04696v5