[论文翻译]探索统一的文本到文本 Transformer (Text-to-Text Transformer) 的迁移学习极限


原文地址:https://jmlr.org/papers/volume21/20-074/20-074.pdf


Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer

探索统一的文本到文本 Transformer (Text-to-Text Transformer) 的迁移学习极限

Colin Raffel∗ Noam Shazeer∗ Adam Roberts∗ Katherine Lee∗ Sharan Narang Michael Matena Yanqi Zhou Wei Li Peter J. Liu

craffel@gmail.com noam@google.com adarob@google.com katherine lee@google.com sharan naran g@google.com mmatena@google.com yanqiz@google.com mweili@google.com peterjliu@google.com

Google, Mountain View, CA 94043, USA

谷歌,山景城,加利福尼亚州 94043,美国

Editor: Ivan Titov

编辑:Ivan Titov

Abstract

摘要

Transfer learning, where a model is first pre-trained on a data-rich task before being finetuned on a downstream task, has emerged as a powerful technique in natural language processing (NLP). The effectiveness of transfer learning has given rise to a diversity of approaches, methodology, and practice. In this paper, we explore the landscape of transfer learning techniques for NLP by introducing a unified framework that converts all text-based language problems into a text-to-text format. Our systematic study compares pre-training objectives, architectures, unlabeled data sets, transfer approaches, and other factors on dozens of language understanding tasks. By combining the insights from our exploration with scale and our new “Colossal Clean Crawled Corpus”, we achieve state-of-the-art results on many benchmarks covering sum mari z ation, question answering, text classification, and more. To facilitate future work on transfer learning for NLP, we release our data set, pre-trained models, and code.

迁移学习,即模型首先在一个数据丰富的任务上进行预训练,然后再在下游任务上进行微调,已成为自然语言处理 (NLP) 中的一种强大技术。迁移学习的有效性催生了多种方法、方法论和实践。在本文中,我们通过引入一个统一的框架,将所有基于文本的语言问题转换为文本到文本格式,来探讨 NLP 中的迁移学习技术。我们的系统研究比较了数十个语言理解任务上的预训练目标、架构、未标注数据集、迁移方法和其他因素。通过结合我们在探索中获得的见解与规模以及我们新的“Colossal Clean Crawled Corpus”,我们在涵盖总结、问答、文本分类等众多基准测试中取得了最先进的结果。为了促进未来在 NLP 的迁移学习方面的工作,我们发布了我们的数据集、预训练模型和代码。

Keywords: transfer learning, natural language processing, multi-task learning, attentionbased models, deep learning

关键词:迁移学习,自然语言处理,多任务学习,基于注意力的模型 (attention-based models),深度学习

1. Introduction

1. 引言

Training a machine learning model to perform natural language processing (NLP) tasks often requires that the model can process text in a way that is amenable to downstream learning. This can be loosely viewed as developing general-purpose knowledge that allows the model to “understand” text. This knowledge can range from low-level (e.g. the spelling or meaning of words) to high-level (e.g. that a tuba is too large to fit in most backpacks). In modern machine learning practice, providing this knowledge is rarely done explicitly; instead, it is often learned as part of an auxiliary task. For example, a historically common approach is to use word vectors (Mikolov et al., 2013b,a; Pennington et al., 2014) to map word identities to a continuous representation where, ideally, similar words map to similar vectors. These vectors are often learned through an objective that, for example, encourages co-occurring words to be positioned nearby in the continuous space (Mikolov et al., 2013b).

训练机器学习模型以执行自然语言处理 (NLP) 任务通常要求该模型能够以有利于下游学习的方式处理文本。这可以宽松地视为开发使模型能够“理解”文本的通用知识。这种知识可以从低级(例如单词的拼写或含义)到高级(例如长号太大而无法放入大多数背包中)。在现代机器学习实践中,提供这种知识很少是显式的;相反,它通常是作为辅助任务的一部分而被学习到的。例如,一种历史上常见的方法是使用词向量 (Mikolov et al., 2013b,a; Pennington et al., 2014) 将词身份映射到连续表示,在理想情况下,相似的词会映射到相似的向量。这些向量通常是通过一个目标函数学习的,例如鼓励共现词在连续空间中彼此靠近 (Mikolov et al., 2013b)。

Recently, it has become increasingly common to pre-train the entire model on a data-rich task. Ideally, this pre-training causes the model to develop general-purpose abilities and knowledge that can then be transferred to downstream tasks. In applications of transfer learning to computer vision (Oquab et al., 2014; Jia et al., 2014; Huh et al., 2016; Yosinski et al., 2014), pre-training is typically done via supervised learning on a large labeled data set like ImageNet (Russ a kov sky et al., 2015; Deng et al., 2009). In contrast, modern techniques for transfer learning in NLP often pre-train using unsupervised learning on unlabeled data. This approach has recently been used to obtain state-of-the-art results in many of the most common NLP benchmarks (Devlin et al., 2018; Yang et al., 2019; Dong et al., 2019; Liu et al., 2019c; Lan et al., 2019). Beyond its empirical strength, unsupervised pre-training for NLP is particularly attractive because unlabeled text data is available en masse thanks to the Internet—for example, the Common Crawl project $^2$ produces about 20TB of text data extracted from web pages each month. This is a natural fit for neural networks, which have been shown to exhibit remarkable s cal ability, i.e. it is often possible to achieve better performance simply by training a larger model on a larger data set (Hestness et al., 2017; Shazeer et al., 2017; Jozefowicz et al., 2016; Mahajan et al., 2018; Radford et al., 2019; Shazeer et al., 2018; Huang et al., 2018b; Keskar et al., 2019a).

最近,越来越多的做法是在数据丰富的任务上预训练整个模型。理想情况下,这种预训练使模型能够发展出通用的能力和知识,这些能力与知识可以转移到下游任务中。在计算机视觉领域的迁移学习应用中 (Oquab et al., 2014; Jia et al., 2014; Huh et al., 2016; Yosinski et al., 2014),预训练通常通过在大型标注数据集(如 ImageNet (Russakovsky et al., 2015; Deng et al., 2009))上的监督学习完成。相比之下,现代 NLP 领域的迁移学习技术经常使用无监督学习在未标注的数据上进行预训练。这种方法最近被用于在许多常见的 NLP 基准测试中取得最先进的结果 (Devlin et al., 2018; Yang et al., 2019; Dong et al., 2019; Liu et al., 2019c; Lan et al., 2019)。除了其实证优势外,NLP 的无监督预训练特别有吸引力,因为互联网提供了大量未标注的文本数据——例如,Common Crawl 项目 $^2$ 每月生成约 20TB 从网页中提取的文本数据。这非常适合神经网络,研究表明神经网络表现出显著的可扩展性,即通常可以通过在更大的数据集上训练更大的模型来获得更好的性能 (Hestness et al., 2017; Shazeer et al., 2017; Jozefowicz et al., 2016; Mahajan et al., 2018; Radford et al., 2019; Shazeer et al., 2018; Huang et al., 2018b; Keskar et al., 2019a)。

This synergy has resulted in a great deal of recent work developing transfer learning methodology for NLP, which has produced a wide landscape of pre-training objectives (Howard and Ruder, 2018; Devlin et al., 2018; Yang et al., 2019; Dong et al., 2019), unlabeled data sets (Yang et al., 2019; Liu et al., 2019c; Zellers et al., 2019), benchmarks (Wang et al., 2019b, 2018; Conneau and Kiela, 2018), fine-tuning methods (Howard and Ruder, 2018; Houlsby et al., 2019; Peters et al., 2019), and more. The rapid rate of progress and diversity of techniques in this burgeoning field can make it difficult to compare different algorithms, tease apart the effects of new contributions, and understand the space of existing methods for transfer learning. Motivated by a need for more rigorous understanding, we leverage a unified approach to transfer learning that allows us to systematically study different approaches and push the current limits of the field.

这种协同作用导致了近期在开发自然语言处理的迁移学习方法方面做了大量工作,这产生了广泛的预训练目标 (Howard 和 Ruder, 2018; Devlin 等, 2018; Yang 等, 2019; Dong 等, 2019),无标签数据集 (Yang 等, 2019; Liu 等, 2019c; Zellers 等, 2019),基准测试 (Wang 等, 2019b, 2018; Conneau 和 Kiela, 2018),微调方法 (Howard 和 Ruder, 2018; Houlsby 等, 2019; Peters 等, 2019) 等等。这个新兴领域中技术进步的快速步伐和多样性使得比较不同算法、区分新贡献的效果以及理解现有的迁移学习方法变得困难。为了更严谨地理解这一需求,我们采用了一种统一的迁移学习方法,使我们能够系统地研究不同的方法并推动该领域的现有极限。

The basic idea underlying our work is to treat every text processing problem as a “text-to-text” problem, i.e. taking text as input and producing new text as output. This approach is inspired by previous unifying frameworks for NLP tasks, including casting all text problems as question answering (McCann et al., 2018), language modeling (Radford et al., 2019), or span extraction Keskar et al. (2019b) tasks. Crucially, the text-to-text framework allows us to directly apply the same model, objective, training procedure, and decoding process to every task we consider. We leverage this flexibility by evaluating performance on a wide variety of English-based NLP problems, including question answering, document

我们工作的基本思想是将每个文本处理问题视为“文本到文本 (text-to-text)”问题,即以文本作为输入并生成新的文本作为输出。这种方法受到之前统一的自然语言处理任务框架的启发,包括将所有文本问题转化为问答任务 (McCann et al., 2018),语言模型任务 (Radford et al., 2019),或片段抽取任务 Keskar et al. (2019b)。关键在于,文本到文本框架使我们能够直接应用相同的模型、目标函数、训练过程和解码过程来处理每一个考虑的任务。我们利用这种灵活性,在广泛的基于英语的自然语言处理问题上评估性能,包括问答、文档

Figure 1: A diagram of our text-to-text framework. Every task we consider—including translation, question answering, and classification—is cast as feeding our model text as input and training it to generate some target text. This allows us to use the same model, loss function, hyper parameters, etc. across our diverse set of tasks. It also provides a standard testbed for the methods included in our empirical survey. “T5” refers to our model, which we dub the “Text-to-Text Transfer Transformer”.

图 1: 我们的文本到文本框架的示意图。我们考虑的每一项任务——包括翻译、问答和分类——都被视为将文本输入我们的模型并训练它生成目标文本。这使我们能够跨多个不同的任务使用相同的模型、损失函数、超参数等。这也为我们的实证调查中包含的方法提供了一个标准测试平台。“T5”是指我们的模型,我们称之为“文本到文本迁移 Transformer (Text-to-Text Transfer Transformer)”。

sum mari z ation, and sentiment classification, to name a few. With this unified approach, we can compare the effectiveness of different transfer learning objectives, unlabeled data sets, and other factors, while exploring the limits of transfer learning for NLP by scaling up models and data sets beyond what has previously been considered.

总结、情感分类等。通过这种统一的方法,我们可以比较不同的迁移学习目标、未标记数据集和其他因素的有效性,同时通过扩大模型和数据集的规模来探索迁移学习在自然语言处理 (NLP) 中的极限,超越以往的考虑。

We emphasize that our goal is not to propose new methods but instead to provide a comprehensive perspective on where the field stands. As such, our work primarily comprises a survey, exploration, and empirical comparison of existing techniques. We also explore the limits of current approaches by scaling up the insights from our systematic study (training models up to 11 billion parameters) to obtain state-of-the-art results in many of the tasks we consider. In order to perform experiments at this scale, we introduce the “Colossal Clean Crawled Corpus” (C4), a data set consisting of hundreds of gigabytes of clean English text scraped from the web. Recognizing that the main utility of transfer learning is the possibility of leveraging pre-trained models in data-scarce settings, we release our code, data sets, and pre-trained models.

我们强调,我们的目标不是提出新方法,而是提供一个全面的视角来审视该领域目前所处的位置。因此,我们的工作主要由现有技术的综述、探索和实证比较组成。我们还通过扩大系统研究中的见解(训练模型最多达 110 亿个参数)来探索当前方法的极限,以在我们考虑的许多任务中获得最先进的结果。为了进行这种规模的实验,我们引入了“Colossal Clean Crawled Corpus” (C4),这是一个包含数百 GB 清洁英文文本的数据集,这些文本是从网络上抓取的。认识到迁移学习的主要优势在于能够在数据稀缺的情况下利用预训练模型,我们发布了代码、数据集和预训练模型。

The remainder of the paper is structured as follows: In the following section, we discuss our base model and its implementation, our procedure for formulating every text processing problem as a text-to-text task, and the suite of tasks we consider. In Section 3, we present a large set of experiments that explore the field of transfer learning for NLP. At the end of the section (Section 3.7), we combine insights from our systematic study to obtain state-of-the-art results on a wide variety of benchmarks. Finally, we provide a summary of our results and wrap up with a look towards the future in Section 4.

本文其余部分结构如下:在接下来的部分,我们讨论了我们的基础模型及其实现,我们将每个文本处理问题表述为文本到文本任务的方法,以及我们考虑的任务套件。在第 3 节中,我们展示了一组大规模实验,探索了 NLP 领域的迁移学习。在本节末尾(第 3.7 节),我们结合系统研究的见解,在各种基准测试中获得了最先进的结果。最后,在第 4 节中,我们总结了研究结果,并展望未来。

2. Setup

2. 设置

Before presenting the results from our large-scale empirical study, we review the necessary background topics required to understand our results, including the Transformer model architecture and the downstream tasks we evaluate on. We also introduce our approach for treating every problem as a text-to-text task and describe our “Colossal Clean Crawled Corpus” (C4), the Common Crawl-based data set we created as a source of unlabeled text data. We refer to our model and framework as the “Text-to-Text Transfer Transformer” (T5).

在呈现我们大规模实证研究的结果之前,我们回顾了理解这些结果所需的相关背景知识,包括 Transformer 模型架构以及我们在下游任务上的评估。我们还介绍了将每个问题视为文本到文本任务的方法,并描述了我们创建的基于 Common Crawl 的未标注文本数据集“Colossal Clean Crawled Corpus” (C4)。我们将模型和框架称为“Text-to-Text Transfer Transformer” (T5)。

2.1. Model

2.1. 模型

Early results on transfer learning for NLP leveraged recurrent neural networks (Peters et al., 2018; Howard and Ruder, 2018), but it has recently become more common to use models based on the “Transformer” architecture (Vaswani et al., 2017). The Transformer was initially shown to be effective for machine translation, but it has subsequently been used in a wide variety of NLP settings (Radford et al., 2018; Devlin et al., 2018; McCann et al., 2018; Yu et al., 2018). Due to its increasing ubiquity, all of the models we study are based on the Transformer architecture. Apart from the details mentioned below and the variants we explore in Section 3.2, we do not deviate significantly from this architecture as originally proposed. Instead of providing a comprehensive definition of this model, we refer the interested reader to the original paper (Vaswani et al., 2017) or follow-up tutorials $^{3,4}$ for a more detailed introduction.

早期的迁移学习在自然语言处理 (NLP) 领域利用了循环神经网络 (Peters et al., 2018; Howard and Ruder, 2018),但最近更常见的是使用基于 “Transformer” 架构的模型 (Vaswani et al., 2017)。Transformer 最初被证明在机器翻译方面有效,但随后已被应用于各种 NLP 场景 (Radford et al., 2018; Devlin et al., 2018; McCann et al., 2018; Yu et al., 2018)。由于其日益普及,我们研究的所有模型都基于 Transformer 架构。除了下面提到的细节和我们在第 3.2 节中探讨的变体外,我们并没有显著偏离最初提出的架构。我们不提供该模型的全面定义,感兴趣的读者可以参考原始论文 (Vaswani et al., 2017) 或后续教程 $^{3,4}$ 以获得更详细的介绍。

The primary building block of the Transformer is self-attention (Cheng et al., 2016). Self-attention is a variant of attention (Graves, 2013; Bahdanau et al., 2015) that processes a sequence by replacing each element by a weighted average of the rest of the sequence. The original Transformer consisted of an encoder-decoder architecture and was intended for sequence-to-sequence (Sutskever et al., 2014; Kal ch brenner et al., 2014) tasks. It has recently also become common to use models consisting of a single Transformer layer stack, with varying forms of self-attention used to produce architectures appropriate for language modeling (Radford et al., 2018; Al-Rfou et al., 2019) or classification and span prediction tasks (Devlin et al., 2018; Yang et al., 2019). We empirically explore these architectural variants in Section 3.2.

Transformer 的主要构建模块是自注意力 (Cheng et al., 2016)。自注意力是注意力机制 (Graves, 2013; Bahdanau et al., 2015) 的一种变体,它通过用序列中其他元素的加权平均值替换每个元素来处理序列。最初的 Transformer 包含编码器-解码器架构,并旨在用于序列到序列 (Sutskever et al., 2014; Kalchbrenner et al., 2014) 任务。最近,也变得常见的是使用由单个 Transformer 层堆栈组成的模型,采用不同形式的自注意力来生成适用于语言建模 (Radford et al., 2018; Al-Rfou et al., 2019) 或分类和跨度预测任务 (Devlin et al., 2018; Yang et al., 2019) 的架构。我们在第 3.2 节中对这些架构变体进行了实证探索。

Overall, our encoder-decoder Transformer implementation closely follows its originallyproposed form (Vaswani et al., 2017). First, an input sequence of tokens is mapped to a sequence of embeddings, which is then passed into the encoder. The encoder consists of a stack of “blocks”, each of which comprises two sub components: a self-attention layer followed by a small feed-forward network. Layer normalization (Ba et al., 2016) is applied to the input of each sub component. We use a simplified version of layer normalization where the activation s are only rescaled and no additive bias is applied. After layer normalization, a residual skip connection (He et al., 2016) adds each sub component’s input to its output. Dropout (Srivastava et al., 2014) is applied within the feed-forward network, on the skip connection, on the attention weights, and at the input and output of the entire stack. The decoder is similar in structure to the encoder except that it includes a standard attention mechanism after each self-attention layer that attends to the output of the encoder. The self-attention mechanism in the decoder also uses a form of auto regressive or causal selfattention, which only allows the model to attend to past outputs. The output of the final decoder block is fed into a dense layer with a softmax output, whose weights are shared with the input embedding matrix. All attention mechanisms in the Transformer are split up into independent “heads” whose outputs are concatenated before being further processed.

总体而言,我们的编码器-解码器 Transformer 实现紧密遵循其最初提出的形态 (Vaswani et al., 2017)。首先,输入的 Token 序列被映射到嵌入序列,然后传递给编码器。编码器由多个“块”的堆栈组成,每个块包含两个子组件:一个自注意力层后接一个小的前馈网络。对每个子组件的输入应用层归一化 (Ba et al., 2016)。我们使用了一种简化的层归一化版本,其中激活值仅被重新缩放,而没有添加偏置。在层归一化之后,残差跳跃连接 (He et al., 2016) 将每个子组件的输入加到其输出上。在前馈网络、跳跃连接、注意力权重以及整个堆栈的输入和输出中应用了 Dropout (Srivastava et al., 2014)。解码器在结构上与编码器类似,不同之处在于它在每个自注意力层之后包含一个标准的注意力机制,该机制关注编码器的输出。解码器中的自注意力机制还使用了一种自回归或因果自注意力的形式,只允许模型关注过去的输出。最终解码器块的输出被送入一个带有 softmax 输出的全连接层,其权重与输入嵌入矩阵共享。Transformer 中的所有注意力机制都被拆分为独立的“头”,其输出在进一步处理之前被拼接在一起。

Since self-attention is order-independent (i.e. it is an operation on sets), it is common to provide an explicit position signal to the Transformer. While the original Transformer used a sinusoidal position signal or learned position embeddings, it has recently become more common to use relative position embeddings (Shaw et al., 2018; Huang et al., 2018a). Instead of using a fixed embedding for each position, relative position embeddings produce a different learned embedding according to the offset between the “key” and “query” being compared in the self-attention mechanism. We use a simplified form of position embeddings where each “embedding” is simply a scalar that is added to the corresponding logit used for computing the attention weights. For efficiency, we also share the position embedding parameters across all layers in our model, though within a given layer each attention head uses a different learned position embedding. Typically, a fixed number of embeddings are learned, each corresponding to a range of possible key-query offsets. In this work, we use 32 embeddings for all of our models with ranges that increase in size logarithmic ally up to an offset of 128 beyond which we assign all relative positions to the same embedding. Note that a given layer is insensitive to relative position beyond 128 tokens, but subsequent layers can build a sensitivity to larger offsets by combining local information from previous layers. To summarize, our model is roughly equivalent to the original Transformer proposed by Vaswani et al. (2017) with the exception of removing the Layer Norm bias, placing the layer normalization outside the residual path, and using a different position embedding scheme. Since these architectural changes are orthogonal to the experimental factors we consider in our empirical survey of transfer learning, we leave the ablation of their impact for future work.

由于自注意力机制与顺序无关(即它是对集合的操作),通常会向 Transformer 提供一个显式的位序信号。虽然原始的 Transformer 使用了正弦位置信号或学习到的位置嵌入,但最近更常见的是使用相对位置嵌入 (Shaw et al., 2018; Huang et al., 2018a)。相对于为每个位置使用固定的嵌入,相对位置嵌入根据自注意力机制中比较的“key”和“query”之间的偏移生成不同的学习嵌入。我们使用了一种简化的位序嵌入形式,其中每个“嵌入”只是一个标量,加到用于计算注意力权重的相应 logit 上。为了提高效率,我们在模型的所有层之间共享位序嵌入参数,尽管在给定层中每个注意力头使用不同的学习到的位序嵌入。通常,会学习固定数量的嵌入,每个对应于可能的 key-query 偏移范围。在这项工作中,我们为所有模型使用 32 个嵌入,其范围以对数方式增长,直到超过 128 的偏移量,之后我们将所有相对位置分配给相同的嵌入。请注意,给定层对于超过 128 个 token 的相对位置是不敏感的,但后续层可以通过组合前一层的局部信息来建立对更大偏移量的敏感性。总结来说,我们的模型大致等同于 Vaswani 等人 (2017) 提出的原始 Transformer,例外之处在于移除了 Layer Norm 偏置,将层归一化置于残差路径之外,并使用了不同的位置嵌入方案。由于这些架构更改与我们在迁移学习的经验调查中考虑的实验因素正交,我们留待未来工作来分析它们的影响。

As part of our study, we experiment with the s cal ability of these models, i.e. how their performance changes as they are made to have more parameters or layers. Training large models can be non-trivial since they might not fit on a single machine and require a great deal of computation. As a result, we use a combination of model and data parallelism and train models on “slices” of Cloud TPU Pods.5 TPU pods are are multi-rack ML supercomputers that contain 1,024 TPU v3 chips connected via a high-speed 2D mesh interconnect with supporting CPU host machines. We leverage the Mesh TensorFlow library (Shazeer et al., 2018) for ease of implementation of both model parallelism and data parallelism (Krizhevsky, 2014).

在我们的研究中,我们实验了这些模型的可扩展性,即当它们拥有更多参数或层数时性能如何变化。训练大模型可能并非易事,因为它们可能无法适应单个机器,并且需要大量的计算资源。因此,我们使用模型并行和数据并行的组合,在 Cloud TPU Pods 的“切片”上训练模型。TPU Pods 是多机架 ML 超级计算机,包含 1,024 个通过高速 2D 网格互连连接的 TPU v3 芯片,并配有支持的 CPU 主机。我们利用 Mesh TensorFlow 库 (Shazeer et al., 2018) 来简化模型并行和数据并行 (Krizhevsky, 2014) 的实现。

2.2. The Colossal Clean Crawled Corpus

2.2. 巨大的清洁爬取语料库 (Colossal Clean Crawled Corpus)

Much of the previous work on transfer learning for NLP makes use of large unlabeled data sets for unsupervised learning. In this paper, we are interested in measuring the effect of the quality, characteristics, and size of this unlabeled data. To generate data sets that satisfy our needs, we leverage Common Crawl as a source of text scraped from the web. Common

之前关于 NLP 迁移学习的许多工作都利用了大型未标注数据集进行无监督学习。在本文中,我们关注的是测量这些未标注数据的质量、特征和规模的影响。为了生成满足我们需求的数据集,我们利用 Common Crawl 作为从网络上抓取的文本来源。Common

Crawl has previously been used as a source of text data for NLP, for example to train an n-gram language model (Buck et al., 2014), as training data for commonsense reasoning (Trinh and Le, 2018), for mining parallel texts for machine translation (Smith et al., 2013), as a pre-training data set (Grave et al., 2018; Zellers et al., 2019; Liu et al., 2019c), and even simply as a giant text corpus for testing optimizers (Anil et al., 2019).

爬虫之前已被用作NLP的文本数据来源,例如用于训练n元语言模型 (Buck et al., 2014),作为常识推理的训练数据 (Trinh 和 Le, 2018),用于挖掘机器翻译的平行文本 (Smith et al., 2013),作为预训练数据集 (Grave et al., 2018; Zellers et al., 2019; Liu et al., 2019c),甚至仅仅作为一个巨大的文本语料库来测试优化器 (Anil et al., 2019)。

Common Crawl is a publicly-available web archive that provides “web extracted text” by removing markup and other non-text content from the scraped HTML files. This process produces around 20TB of scraped text data each month. Unfortunately, the majority of the resulting text is not natural language. Instead, it largely comprises gibberish or boiler-plate text like menus, error messages, or duplicate text. Furthermore, a good deal of the scraped text contains content that is unlikely to be helpful for any of the tasks we consider (offensive language, placeholder text, source code, etc.). To address these issues, we used the following heuristics for cleaning up Common Crawl’s web extracted text:

公共爬虫 (Common Crawl) 是一个公开的网页档案,通过从抓取的 HTML 文件中移除标记和其他非文本内容,提供“网页提取文本”。此过程每月产生约 20TB 的抓取文本数据。不幸的是,大部分生成的文本并不是自然语言。相反,它主要由乱码或模板文本组成,如菜单、错误消息或重复文本。此外,相当一部分抓取的文本包含的内容对于我们考虑的任务(例如:冒犯性语言、占位符文本、源代码等)不太可能有帮助。为了解决这些问题,我们使用了以下启发式方法来清理公共爬虫的网页提取文本:

Additionally, since most of our downstream tasks are focused on English-language text, we used langdetect to filter out any pages that were not classified as English with a probability of at least 0.99. Our heuristics are inspired by past work on using Common Crawl as a source of data for NLP: For example, Grave et al. (2018) also filter text using an automatic language detector and discard short lines and Smith et al. (2013); Grave et al. (2018) both perform line-level de duplication. However, we opted to create a new data set because prior data sets use a more limited set of filtering heuristics, are not publicly available, and/or are different in scope (e.g. are limited to News data (Zellers et al., 2019; Liu et al., 2019c), comprise only Creative Commons content (Habernal et al., 2016), or are focused on parallel training data for machine translation (Smith et al., 2013)).

此外,由于我们的大多数下游任务都集中在英文文本上,我们使用了 langdetect 来过滤掉任何未被识别为英语(概率至少为 0.99)的页面。我们的启发式方法借鉴了过去使用 Common Crawl 作为 NLP 数据来源的工作:例如,Grave 等 (2018) 也使用自动语言检测器过滤文本并丢弃短行和 Smith 等 (2013); Grave 等 (2018) 都执行行级去重。然而,我们选择创建一个新的数据集,因为先前的数据集使用了更有限的过滤启发式方法,没有公开可用,和/或在范围上不同(例如,仅限于新闻数据 (Zellers 等, 2019; Liu 等, 2019c),仅包含知识共享内容 (Habernal 等, 2016),或者专注于机器翻译的平行训练数据 (Smith 等, 2013))。

To assemble our base data set, we downloaded the web extracted text from April 2019 and applied the aforementioned filtering. This produces a collection of text that is not only orders of magnitude larger than most data sets used for pre-training (about 750 GB) but also comprises reasonably clean and natural English text. We dub this data set the “Colossal Clean Crawled Corpus” (or C4 for short) and release it as part of TensorFlow Datasets.8 We consider the impact of using various alternative versions of this data set in Section 3.4.

为了组装我们的基础数据集,我们下载了2019年4月的网页提取文本,并应用了上述过滤。这产生了一个不仅比大多数用于预训练的数据集大几个数量级(约750 GB),而且还包含相当干净和自然的英文文本的集合。我们将这个数据集命名为“Colossal Clean Crawled Corpus”(简称 C4),并作为 TensorFlow Datasets 的一部分发布。8 我们在第3.4节中考虑了使用该数据集的各种替代版本的影响。

2.3. Downstream Tasks

2.3. 下游任务

Our goal in this paper is to measure general language learning abilities. As such, we study downstream performance on a diverse set of benchmarks, including machine translation, question answering, abstract ive sum mari z ation, and text classification. Specifically, we measure performance on the GLUE and SuperGLUE text classification meta-benchmarks; CNN/Daily Mail abstract ive sum mari z ation; SQuAD question answering; and WMT English to German, French, and Romanian translation. All data was sourced from TensorFlow Datasets. $^{9}$

我们在这篇论文中的目标是测量通用语言学习能力。因此,我们研究了在多个不同基准测试上的下游性能,包括机器翻译、问答、抽象概括 (abstract ive sum mari z ation) 和文本分类。具体来说,我们在 GLUE 和 SuperGLUE 文本分类元基准上测量性能;CNN/Daily Mail 抽象概括;SQuAD 问答;以及 WMT 英语到德语、法语和罗马尼亚语的翻译。所有数据均来自 TensorFlow Datasets。 $^{9}$

GLUE (Wang et al., 2018) and SuperGLUE (Wang et al., 2019b) each comprise a collection of text classification tasks meant to test general language understanding abilities:

GLUE (Wang et al., 2018) 和 SuperGLUE (Wang et al., 2019b) 各自包含一系列文本分类任务,旨在测试通用语言理解能力:

We use the data sets as distributed by the GLUE and SuperGLUE benchmarks. For simplicity, when fine-tuning we treat all of the tasks in the GLUE benchmark (and similarly for SuperGLUE) as a single task by concatenating all of the constituent data sets. As suggested by Kocijan et al. (2019) we also include the Definite Pronoun Resolution (DPR) data set (Rahman and Ng, 2012) in the combined SuperGLUE task.

我们使用由 GLUE 和 SuperGLUE 基准分发的数据集。为简化起见,在微调时,我们将 GLUE 基准中的所有任务(以及类似的 SuperGLUE)视为单个任务,通过连接所有组成的数据集来处理。根据 Kocijan 等人 (2019) 的建议,我们还将在合并的 SuperGLUE 任务中包含确定性代词消解 (DPR) 数据集 (Rahman 和 Ng, 2012)。

The CNN/Daily Mail (Hermann et al., 2015) data set was introduced as a questionanswering task but was adapted for text sum mari z ation by Nallapati et al. (2016); we use the non-anonymized version from See et al. (2017) as an abstract ive sum mari z ation task. SQuAD (Rajpurkar et al., 2016) is a common question-answering benchmark. In our experiments, the model is fed the question and its context and asked to generate the answer token-by-token. For WMT English to German, we use the same training data as (Vaswani et al., 2017) (i.e. News Commentary v13, Common Crawl, Europarl v7) and news test 2013 as a validation set (Bojar et al., 2014). For English to French, we use the standard training data from 2015 and news test 2014 as a validation set (Bojar et al., 2015). For English to Romanian, which is a standard lower-resource machine translation benchmark, we use the train and validation sets from WMT 2016 (Bojar et al., 2016). Note that we only pre-train on English data, so in order to learn to translate a given model will need to learn to generate text in a new language.

CNN/Daily Mail (Hermann 等,2015) 数据集最初是作为问答任务引入的,但被 Nallapati 等 (2016) 改编为文本摘要任务;我们使用 See 等 (2017) 提供的非匿名版本作为抽象摘要任务。SQuAD (Rajpurkar 等, 2016) 是一个常见的问答基准。在我们的实验中,模型接收问题及其上下文,并被要求逐个 Token 生成答案。对于 WMT 英语到德语,我们使用与 (Vaswani 等, 2017) 相同的训练数据(即 News Commentary v13、Common Crawl、Europarl v7),并使用新闻测试 2013 作为验证集 (Bojar 等, 2014)。对于英语到法语,我们使用 2015 年的标准训练数据,并使用新闻测试 2014 作为验证集 (Bojar 等, 2015)。对于英语到罗马尼亚语,这是一个标准的低资源机器翻译基准,我们使用 WMT 2016 (Bojar 等, 2016) 的训练和验证集。请注意,我们仅在英语数据上进行预训练,因此为了学习翻译,给定模型需要学习生成新语言的文本。

2.4. Input and Output Format

2.4. 输入和输出格式

In order to train a single model on the diverse set of tasks described above, we cast all of the tasks we consider into a “text-to-text” format—that is, a task where the model is fed some text for context or conditioning and is then asked to produce some output text. This framework provides a consistent training objective both for pre-training and fine-tuning. Specifically, the model is trained with a maximum likelihood objective (using “teacher forcing” (Williams and Zipser, 1989)) regardless of the task. To specify which task the model should perform, we add a task-specific (text) prefix to the original input sequence before feeding it to the model.

为了在一个模型上训练上述多样化的任务集,我们将所有考虑的任务转换为“文本到文本”格式——即,模型接收一些文本作为上下文或条件,然后被要求生成一些输出文本。这个框架为预训练和微调提供了统一的训练目标。具体来说,无论任务如何,模型都是使用最大似然目标(采用“教师强制” (Williams and Zipser, 1989))进行训练的。为了指定模型应执行的任务,我们在原始输入序列前添加一个特定于任务的(文本)前缀,然后再将其输入模型。

As an example, to ask the model to translate the sentence “That is good.” from English to German, the model would be fed the sequence “translate English to German: That is good.” and would be trained to output “Das ist gut.” For text classification tasks, the model simply predicts a single word corresponding to the target label. For example, on the MNLI benchmark (Williams et al., 2017) the goal is to predict whether a premise implies (“entailment”), contradicts (“contradiction”), or neither (“neutral”) a hypothesis. With our preprocessing, the input sequence becomes “mnli premise: I hate pigeons. hypothesis: My feelings towards pigeons are filled with animosity.” with the corresponding target word “entailment”. Note that an issue arises if our model outputs text on a text classification task that does not correspond to any of the possible labels (for example if the model outputs “hamburger” when the only possible labels for a task were “entailment”, “neutral”, or “contradiction”). In this case, we always count the model’s output as wrong, though we never observed this behavior in any of our trained models. Note that the choice of text prefix used for a given task is essentially a hyper parameter; we found that changing the exact wording of the prefix had limited impact and so did not perform extensive experiments into different prefix choices. A diagram of our text-to-text framework with a few input/output examples is shown in Figure 1. We provide full examples of pre processed inputs for every task we studied in Appendix D.

例如,要求模型将句子“That is good.”从英语翻译成德语,模型会接收序列“translate English to German: That is good.”并被训练输出“Das ist gut。”对于文本分类任务,模型只需预测一个对应目标标签的单词。例如,在 MNLI 基准测试 (Williams et al., 2017) 中,目标是预测前提是否意味着(“entailment”),矛盾(“contradiction”),或既不意味着也不矛盾(“neutral”)假设。经过我们的预处理,输入序列变为“mnli premise: I hate pigeons. hypothesis: My feelings towards pigeons are filled with animosity.”,对应的标签词为“entailment”。请注意,如果模型在文本分类任务中输出的文本不对应任何可能的标签(例如,当唯一可能的标签为“entailment”,“neutral”,或“contradiction”时,模型输出了“hamburger”),则会出现问题。在这种情况下,我们始终将模型的输出视为错误,尽管我们在任何训练的模型中从未观察到这种行为。请注意,给定任务使用的文本前缀选择本质上是一个超参数;我们发现改变前缀的确切措辞影响有限,因此没有对不同的前缀选择进行广泛的实验。图 1 显示了我们文本到文本框架的示例输入/输出图。我们在附录 D 中提供了我们研究的每个任务的预处理输入的完整示例。

图 1: 我们的文本到文本框架示例输入/输出图

Our text-to-text framework follows previous work that casts multiple NLP tasks into a common format: McCann et al. (2018) propose the “Natural Language Decathlon”, a benchmark that uses a consistent question-answering format for a suite of ten NLP tasks. The Natural Language Decathlon also stipulates that all models must be multi-task, i.e. are able to simultaneously tackle all of the tasks at once. We instead allow for separately fine-tuning the model on each individual task and use short task prefixes instead of an explicit question-answer format. Radford et al. (2019) evaluate the zero-shot learning capabilities of language models by feeding some input to the model as a prefix and then auto regressive ly sampling an output. For example, automatic sum mari z ation is done by feeding in a document followed by the text “TL;DR:” (short for “too long, didn’t read”, a common abbreviation) and then the summary is predicted via auto regressive decoding. We mainly consider models that explicitly process an input with an encoder before generating an output with a separate decoder and we focus on transfer learning rather than zero-shot learning. Finally, Keskar et al. (2019b) unify many NLP tasks as “span extraction”, where text corresponding to possible output choices are appended to the input and the model is trained to extract the input span corresponding to the correct choice. In contrast, our framework also allows for generative tasks like machine translation and abstract ive sum mari z ation where it is not possible to enumerate all possible output choices.

我们的文本到文本框架遵循先前的工作,将多个 NLP 任务转换为一种通用格式:McCann 等 (2018) 提出了“自然语言十项全能”,这是一个使用一致的问题回答格式的十个 NLP 任务基准。自然语言十项全能还规定所有模型必须是多任务的,即能够同时处理所有任务。相比之下,我们允许对每个单独的任务进行分别微调,并使用简短的任务前缀而不是显式的问题-回答格式。Radford 等 (2019) 通过将一些输入作为前缀输入模型,然后自回归地采样输出来评估语言模型的零样本学习能力。例如,自动摘要生成是通过输入文档后跟文本 “TL;DR:”(意为“太长,没读”,一个常见的缩写),然后通过自回归解码预测摘要。我们主要考虑那些在生成输出之前显式地用编码器处理输入并在单独的解码器中生成输出的模型,并且我们专注于迁移学习而非零样本学习。最后,Keskar 等 (2019b) 将许多 NLP 任务统一为“片段提取”,其中可能的输出选择对应的文本被附加到输入上,模型被训练以提取对应正确选择的输入片段。相比之下,我们的框架还允许生成式任务,如机器翻译和抽象性摘要,在这些任务中不可能枚举所有可能的输出选择。

We were able to straightforwardly cast all of the tasks we considered into a text-to-text format with the exception of STS-B, which is a regression task where the goal is to predict a similarity score between 1 and 5. We found that most of these scores were annotated in increments of 0.2, so we simply rounded any score to the nearest increment of 0.2 and converted the result to a literal string representation of the number (e.g. the floating-point value 2.57 would be mapped to the string “2.6”). At test time, if the model outputs a string corresponding to a number between 1 and 5, we convert it to a floating-point value; otherwise, we treat the model’s prediction as incorrect. This effectively recasts the STS-B regression problem as a 21-class classification problem.

我们能够将所有考虑的任务直接转换为文本到文本的格式,例外的是 STS-B,这是一个回归任务,目标是预测 1 到 5 之间的相似度分数。我们发现这些分数大多以 0.2 的增量进行标注,因此我们将任何分数四舍五入到最接近的 0.2 增量,并将结果转换为数字的字面字符串表示(例如,浮点值 2.57 将映射为字符串 “2.6”)。在测试时,如果模型输出一个介于 1 和 5 之间的数字字符串,我们将其转换为浮点值;否则,我们将模型的预测视为不正确。这实际上将 STS-B 回归问题重新定义为一个 21 类分类问题。

Separately, we also convert the Winograd tasks (WNLI from GLUE, WSC from SuperGLUE, and the DPR data set we add to SuperGLUE) into a simpler format that is more amenable to the text-to-text framework. Examples from the Winograd tasks consist of a text passage containing an ambiguous pronoun that could refer to more than one of the noun phrases in the passage. For example, the passage might be “The city councilmen refused the demonstrators a permit because they feared violence.”, which contains the ambiguous pronoun “they” that could refer to “city councilmen” or “demonstrators”. We cast the WNLI, WSC, and DPR tasks as text-to-text problems by highlighting the ambiguous pronoun in the text passage and asking the model to predict the noun that it refers to. The example mentioned above would be transformed to the input “The city councilmen refused the demonstrators a permit because *they* feared violence.” and the model would be trained to predict the target text “The city councilmen”.

我们还分别将 Winograd 任务(来自 GLUE 的 WNLI,来自 SuperGLUE 的 WSC,以及我们添加到 SuperGLUE 的 DPR 数据集)转换为更简单的格式,使其更适合文本到文本框架。Winograd 任务的示例包括包含歧义代词的文本段落,该代词可能指代段落中的多个名词短语之一。例如,段落可能是“The city councilmen refused the demonstrators a permit because they feared violence.”,其中包含歧义代词“they”,它可以指代“city councilmen”或“demonstrators”。我们将 WNLI、WSC 和 DPR 任务视为文本到文本问题,通过在文本段落中突出显示歧义代词,并要求模型预测它所指代的名词。上述提到的示例将被转换为输入“The city councilmen refused the demonstrators a permit because they feared violence.”,模型将被训练以预测目标文本“The city councilmen”。

For WSC, examples contain the passage, the ambiguous pronoun, a candidate noun, and a True/False label reflecting whether the candidate matches the pronoun (ignoring any articles). We only train on examples with a “True” label since we do not know the correct noun targets for examples with a “False” label. For evaluation, we assign a “True” label if the words in the model’s output are a subset of the words in the candidate noun phrase (or vice versa) and assign a “False” label otherwise. This removes roughly half of the WSC training set, but the DPR data set adds about 1,000 pronoun resolution examples. Examples from DPR are annotated with the correct referent noun, making it easy to use this data set in the format listed above.

对于WSC,示例包含段落、歧义代词、候选名词以及反映候选名词是否匹配代词的True/False标签(忽略任何冠词)。我们只在带有“True”标签的示例上进行训练,因为我们不知道带有“False”标签的示例的正确名词目标。对于评估,如果模型输出中的单词是候选名词短语中的单词的子集(或反之亦然),则分配一个“True”标签,否则分配一个“False”标签。这大约去除了WSC训练集的一半,但DPR数据集添加了大约1,000个代词解析示例。来自DPR的示例标注有正确的指代名词,使其易于使用上述格式的数据集。

The WNLI training and validation sets have a significant overlap with the WSC training set. To avoid leaking validation examples into our training data (a particular issue in the multi-task experiments of Section 3.5.2), we therefore never train on WNLI and never report results on the WNLI validation set. Omitting results on the WNLI validation set is standard practice (Devlin et al., 2018) due to the fact that it is “adversarial” with respect to the training set, i.e. validation examples are all slightly-perturbed versions of training examples with the opposite label. As such, we do not include WNLI in the average GLUE score whenever we report on the validation set (all sections except Section 3.7 where results are presented on the test sets). Converting examples from WNLI to the “referent noun prediction” variant described above is a little more involved; we describe this process in Appendix B.

WNLI 训练集和验证集与 WSC 训练集有显著重叠。为了避免验证样本泄露到训练数据中(这在第 3.5.2 节的多任务实验中是一个特别的问题),因此我们从不在 WNLI 上进行训练,也从不报告 WNLI 验证集上的结果。省略 WNLI 验证集上的结果是标准做法 (Devlin et al., 2018),因为它是相对于训练集“对抗性”的,即验证样本都是训练样本的轻微扰动版本,并且标签相反。因此,我们在报告验证集上的平均 GLUE 分数时从未包含 WNLI(所有章节除外第 3.7 节,在该节中结果是在测试集上呈现的)。将 WNLI 的样本转换为上述“指代名词预测”变体稍微复杂一些;我们在附录 B 中描述了这个过程。

3. Experiments

3. 实验

Recent advances in transfer learning for NLP have come from a wide variety of developments, such as new pre-training objectives, model architectures, unlabeled data sets, and more. In this section, we carry out an empirical survey of these techniques in hopes of teasing apart their contribution and significance. We then combine the insights gained to attain state-of-the-art in many of the tasks we consider. Since transfer learning for NLP is a rapidly growing area of research, it is not feasible for us to cover every possible technique or idea in our empirical study. For a broader literature review, we recommend a recent survey by Ruder et al. (2019).

近期在自然语言处理 (NLP) 中的迁移学习进展来自于各种各样的发展,例如新的预训练目标、模型架构、无标签数据集等。在本节中,我们对这些技术进行了实证调查,以期能够区分它们的贡献和重要性。然后我们将获得的见解结合起来,在我们考虑的许多任务中达到最先进水平。由于 NLP 的迁移学习是一个快速发展的研究领域,我们无法在实证研究中涵盖每一种可能的技术或想法。对于更广泛的文献综述,我们推荐 Ruder 等人 (2019) 的最新调查。

We systematically study these contributions by taking a reasonable baseline (described in Section 3.1) and altering one aspect of the setup at a time. For example, in Section 3.3 we measure the performance of different unsupervised objectives while keeping the rest of our experimental pipeline fixed. This “coordinate ascent” approach might miss second-order effects (for example, some particular unsupervised objective may work best on a model larger than our baseline setting), but performing a combinatorial exploration of all of the factors in our study would be prohibitively expensive. In future work, we expect it could be fruitful to more thoroughly consider combinations of the approaches we study.

我们系统地研究这些贡献,通过采用一个合理的基线(描述在第 3.1 节)并每次改变设置的一个方面。例如,在第 3.3 节中,我们在保持其余实验流程固定的情况下测量不同无监督目标的性能。这种“坐标上升”方法可能会忽略二阶效应(例如,某些特定的无监督目标可能在一个比我们基线设置更大的模型上表现最佳),但对我们研究中的所有因素进行组合探索将过于昂贵。在未来的工作中,我们预计更彻底地考虑我们研究的方法组合可能会有成效。

Our goal is to compare a variety of different approaches on a diverse set of tasks while keeping as many factors fixed as possible. In order to satisfy this aim, in some cases we do not exactly replicate existing approaches. For example, “encoder-only” models like BERT (Devlin et al., 2018) are designed to produce a single prediction per input token or a single prediction for an entire input sequence. This makes them applicable for classification or span prediction tasks but not for generative tasks like translation or abstract ive sum mari z ation. As such, none of the model architectures we consider are identical to BERT or consist of an encoder-only structure. Instead, we test approaches that are similar in spirit—for example, we consider an analogous objective to BERT’s “masked language modeling” objective in Section 3.3 and we consider a model architecture that behaves similarly to BERT on text classification tasks in Section 3.2.

我们的目标是在尽可能多的任务上比较各种不同的方法,同时保持尽可能多的因素不变。为了实现这一目标,在某些情况下我们不会完全复制现有的方法。例如,“仅编码器”模型如 BERT (Devlin et al., 2018) 被设计为对每个输入 Token 或整个输入序列生成单个预测。这使得它们适用于分类或跨度预测任务,但不适用于翻译或摘要生成等生成式任务。因此,我们考虑的模型架构中没有一个与 BERT 完全相同或仅由编码器结构组成。相反,我们测试了类似的方法——例如,在第 3.3 节中,我们考虑了一个类似于 BERT 的“掩码语言建模 (masked language modeling)”目标,并在第 3.2 节中考虑了一个在文本分类任务上表现类似于 BERT 的模型架构。

After outlining our baseline experimental setup in the following subsection, we undertake an empirical comparison of model architectures (Section 3.2), unsupervised objectives (Section 3.3), pre-training data sets (Section 3.4), transfer approaches (Section 3.5), and scaling (Section 3.6). At the culmination of this section, we combine insights from our study with scale to obtain state-of-the-art results in many tasks we consider (Section 3.7).

在下一小节中概述了我们的基线实验设置后,我们对模型架构(第 3.2 节)、无监督目标(第 3.3 节)、预训练数据集(第 3.4 节)、迁移方法(第 3.5 节)和扩展(第 3.6 节)进行了实证比较。在本节的最后,我们将研究中的见解与规模相结合,在我们考虑的许多任务中取得了最先进的结果(第 3.7 节)。

3.1. Baseline

3.1. 基线模型 (Baseline)

Our goal for our baseline is to reflect typical, modern practice. We pre-train a standard Transformer (described in Section 2.1) using a simple denoising objective and then separately fine-tune on each of our downstream tasks. We describe the details of this experimental setup in the following subsections.

我们的基线目标是反映典型、现代的实践。我们预先训练一个标准的 Transformer (描述在第 2.1 节) 使用简单的去噪目标,然后分别在每个下游任务上进行微调。我们在以下小节中描述此实验设置的详细信息。

3.1.1. Model

3.1.1. 模型

规则:
      - 输出中文翻译部分的时候,只保留翻译的标题,不要有任何其他的多余内容,不要重复,不要解释。
      - 不要输出与英文内容无关的内容。
      - 翻译时要保留原始段落格式,以及保留术语,例如 FLAC,JPEG 等。保留公司缩写,例如 Microsoft, Amazon, OpenAI 等。
      - 人名不翻译
      - 同时要保留引用的论文,例如 [20] 这样的引用。
      - 对于 Figure 和 Table,翻译的同时保留原有格式,例如:“Figure 1: ”翻译为“图 1: ”,“Table 1: ”翻译为:“表 1: ”。
      - 全角括号换成半角括号,并在左括号前面加半角空格,右括号后面加半角空格。
      - 在翻译专业术语时,第一次出现时要在括号里面写上英文原文,例如:“生成式 AI (Generative AI)”,之后就可以只写中文了。
      - 以下是常见的 AI 相关术语词汇对应表(English -> 中文):
      * Transformer -> Transformer
      * Token -> Token
      * LLM/Large Language Model -> 大语言模型
      * Zero-shot -> 零样本
      * Few-shot -> 少样本
      * AI Agent -> AI智能体
      * AGI -> 通用人工智能
      * Python -> Python语言

      策略:

      分三步进行翻译工作:
      1. 不翻译无法识别的特殊字符和公式,原样返回
      2. 将HTML表格格式转换成Markdown表格格式
      3. 根据英文内容翻译成符合中文表达习惯的内容,不要遗漏任何信息

      最终只返回Markdown格式的翻译结果,不要回复无关内容。

For our model, we use a standard encoder-decoder Transformer as proposed by Vaswani et al. (2017). While many modern approaches to transfer learning for NLP use a Transformer architecture consisting of only a single “stack” (e.g. for language modeling (Radford et al., 2018; Dong et al., 2019) or classification and span prediction (Devlin et al., 2018; Yang et al., 2019)), we found that using a standard encoder-decoder structure achieved good results on both generative and classification tasks. We explore the performance of different model architectures in Section 3.2.

对于我们的模型,我们使用了由 Vaswani 等人 (2017) 提出的标准编码器-解码器 Transformer。虽然许多现代自然语言处理的迁移学习方法使用仅包含单个“堆栈”的 Transformer 架构(例如,用于语言建模 (Radford 等,2018;Dong 等,2019) 或分类和跨度预测 (Devlin 等,2018;Yang 等,2019)),我们发现使用标准的编码器-解码器结构在生成式和分类任务上都取得了良好的结果。我们在第 3.2 节中探讨了不同模型架构的性能。

Our baseline model is designed so that the encoder and decoder are each similar in size and configuration to a “BERTBASE” (Devlin et al., 2018) stack. Specifically, both the encoder and decoder consist of 12 blocks (each block comprising self-attention, optional encoder-decoder attention, and a feed-forward network). The feed-forward networks in each block consist of a dense layer with an output dimensionality of $d_{\mathrm{ff}}=3072$ followed by a ReLU non linearity and another dense layer. The “key” and “value” matrices of all attention mechanisms have an inner dimensionality of $d_{\mathrm{kv}}=64$ and all attention mechanisms have 12 heads. All other sub-layers and embeddings have a dimensionality of $d_{\mathrm{model}}=768$ . In total, this results in a model with about 220 million parameters. This is roughly twice the number of parameters of BERTBASE since our baseline model contains two layer stacks instead of one. For regular iz ation, we use a dropout probability of 0.1 everywhere dropout is applied in the model.

我们的基线模型设计为编码器和解码器在大小和配置上各自类似于“BERTBASE” (Devlin et al., 2018) 的堆栈。具体来说,编码器和解码器均由 12 个块组成(每个块包含自注意力机制、可选的编码器-解码器注意力机制和前馈网络)。每个块中的前馈网络由一个输出维度为 $d_{\mathrm{ff}}=3072$ 的全连接层组成,后面跟着 ReLU 非线性层和另一个全连接层。所有注意力机制的“key”和“value”矩阵的内部维度为 $d_{\mathrm{kv}}=64$ ,并且所有注意力机制有 12 个头。所有其他子层和嵌入的维度为 $d_{\mathrm{model}}=768$ 。总计,这导致了一个约有 2.2 亿参数的模型。这大约是 BERTBASE 参数数量的两倍,因为我们的基线模型包含两个层堆栈而不是一个。为了正则化,我们在模型中应用 dropout 的地方使用了 0.1 的 dropout 概率。

3.1.2. Training

3.1.2. 训练

As described in Section 2.4, all tasks are formulated as text-to-text tasks. This allows us to always train using standard maximum likelihood, i.e. using teacher forcing (Williams and Zipser, 1989) and a cross-entropy loss. For optimization, we use AdaFactor (Shazeer and Stern, 2018). At test time, we use greedy decoding (i.e. choosing the highest-probability logit at every timestep).

如第 2.4 节所述,所有任务都被表述为文本到文本的任务。这使我们能够始终使用标准的最大似然进行训练,即使用教师强制 (teacher forcing) (Williams 和 Zipser, 1989) 和交叉熵损失。对于优化,我们使用 AdaFactor (Shazeer 和 Stern, 2018)。在测试时,我们使用贪婪解码(即在每个时间步选择概率最高的 logit)。

We pre-train each model for $2^{19}=524{,}288$ steps on C4 before fine-tuning. We use a maximum sequence length of 512 and a batch size of 128 sequences. Whenever possible, we “pack” multiple sequences into each entry of the batch $^{10}$ so that our batches contain roughly $2^{16},=,65{,}536$ tokens. In total, this batch size and number of steps corresponds to pre-training on $2^{35}\approx34\mathrm{B}$ tokens. This is considerably less than BERT (Devlin et al., 2018), which used roughly 137B tokens, or RoBERTa (Liu et al., 2019c), which used roughly $2.2\mathrm{T}$ tokens. Using only $2^{35}$ tokens results in a reasonable computational budget while still providing a sufficient amount of pre-training for acceptable performance. We consider the effect of pre-training for more steps in Sections 3.6 and 3.7. Note that $2^{35}$ tokens only covers a fraction of the entire C4 data set, so we never repeat any data during pre-training.

我们在 C4 上对每个模型进行预训练,共进行 $2^{19}=524,288$ 步,然后进行微调。我们使用最大序列长度为 512 和批次大小为 128 个序列。在可能的情况下,我们将多个序列“打包”到每个批次的每个条目中 $^{10}$ ,从而使我们的批次包含大约 $2^{16}=65,536$ 个 Token。总计,这个批次大小和步数相当于在 $2^{35}\approx34\mathrm{B}$ 个 Token 上进行预训练。这比 BERT (Devlin et al., 2018) 使用的大约 137B 个 Token 或 RoBERTa (Liu et al., 2019c) 使用的大约 $2.2\mathrm{T}$ 个 Token 要少得多。仅使用 $2^{35}$ 个 Token 可以在合理的计算预算内提供足够的预训练量,以实现可接受的性能。我们在第 3.6 节和第 3.7 节中考虑了增加预训练步数的影响。需要注意的是,$2^{35}$ 个 Token 仅覆盖整个 C4 数据集的一小部分,因此在预训练过程中我们不会重复任何数据。

During pre-training, we use an “inverse square root” learning rate schedule: $1/\sqrt{\operatorname*{max}(n,k)}$ where $n$ is the current training iteration and $k$ is the number of warm-up steps (set to $10^{4}$ in all of our experiments). This sets a constant learning rate of 0.01 for the first $10^{4}$ steps, then exponentially decays the learning rate until pre-training is over. We also experimented with using a triangular learning rate (Howard and Ruder, 2018), which produced slightly better results but requires knowing the total number of training steps ahead of time. Since we will be varying the number of training steps in some of our experiments, we opt for the more generic inverse square root schedule.

在预训练期间,我们使用“反平方根”学习率调度:$1/\sqrt{\operatorname*{max}(n,k)}$,其中 $n$ 是当前训练迭代次数,$k$ 是热身步数(在所有实验中设置为 $10^4$)。这在前 $10^4$ 步设置了 0.01 的恒定学习率,然后指数衰减学习率直到预训练结束。我们也尝试了使用三角形学习率 (Howard 和 Ruder, 2018),这产生了稍好的结果,但需要提前知道总的训练步数。由于在某些实验中我们将改变训练步数,因此我们选择了更通用的反平方根调度。

Our models are fine-tuned for $2^{18}=262{,}144$ steps on all tasks. This value was chosen as a trade-off between the high-resource tasks (i.e. those with large data sets), which benefit from additional fine-tuning, and low-resource tasks (smaller data sets), which overfit quickly. During fine-tuning, we continue using batches with 128 length-512 sequences (i.e. $2^{16}$ tokens per batch). We use a constant learning rate of 0.001 when fine-tuning. We save a checkpoint every 5,000 steps and report results on the model checkpoint corresponding to the highest validation performance. For models fine-tuned on multiple tasks, we choose the best checkpoint for each task independently. For all of the experiments except those in Section 3.7, we report results in the validation set to avoid performing model selection on the test set.

我们的模型在所有任务上微调了 $2^{18}=262,144$ 步。这个值是在高资源任务(即那些具有大数据集的任务)和低资源任务(较小数据集)之间进行权衡后选择的,前者从额外的微调中受益,而后者则会迅速过拟合。在微调期间,我们继续使用包含 128 个长度为 512 的序列的批次(即每批次 $2^{16}$ 个 Token)。微调时我们使用 0.001 的恒定学习率。我们每 5,000 步保存一个检查点,并报告验证性能最高的模型检查点的结果。对于在多个任务上微调的模型,我们为每个任务独立选择最佳检查点。除了第 3.7 节中的实验外,我们在验证集上报告结果,以避免在测试集上进行模型选择。

3.1.3. Vocabulary

3.1.3. 词汇表 (Vocabulary)

We use Sentence Piece (Kudo and Richardson, 2018) to encode text as WordPiece tokens (Sennrich et al., 2015; Kudo, 2018). For all experiments, we use a vocabulary of 32,000 wordpieces. Since we ultimately fine-tune our model on English to German, French, and Romanian translation, we also require that our vocabulary covers these non-English languages. To address this, we classified pages from the Common Crawl scrape used in C4 as German, French, and Romanian. Then, we trained our Sentence Piece model on a mixture of 10 parts of English C4 data with 1 part each of data classified as German, French or Romanian. This vocabulary was shared across both the input and output of our model. Note that our vocabulary makes it so that our model can only process a predetermined, fixed set of languages.

我们使用 Sentence Piece (Kudo 和 Richardson, 2018) 将文本编码为 WordPiece Token (Sennrich 等, 2015; Kudo, 2018)。对于所有实验,我们使用包含 32,000 个 wordpieces 的词汇表。由于我们最终在英语到德语、法语和罗马尼亚语的翻译任务上微调模型,因此我们的词汇表还需要覆盖这些非英语语言。为此,我们将 Common Crawl 抓取的数据中分类为德语、法语和罗马尼亚语的页面进行了分类。然后,我们在 10 部分英语 C4 数据与 1 部分被分类为德语、法语或罗马尼亚语的数据混合数据集上训练了 Sentence Piece 模型。该词汇表在模型的输入和输出之间共享。请注意,我们的词汇表使得模型只能处理预定义的固定语言集合。

3.1.4. Unsupervised Objective

3.1.4. 无监督目标

Leveraging unlabeled data to pre-train our model necessitates an objective that does not require labels but (loosely speaking) teaches the model general iz able knowledge that will be useful in downstream tasks. Preliminary work that applied the transfer learning paradigm of pre-training and fine-tuning all of the model’s parameters to NLP problems used a causal language modeling objective for pre-training (Dai and Le, 2015; Peters et al., 2018; Radford et al., 2018; Howard and Ruder, 2018). However, it has recently been shown that “denoising” objectives (Devlin et al., 2018; Taylor, 1953) (also called “masked language modeling”) produce better performance and as a result they have quickly become standard. In a denoising objective, the model is trained to predict missing or otherwise corrupted tokens in the input. Inspired by BERT’s “masked language modeling” objective and the

利用未标注数据对模型进行预训练需要一个不需要标签的目标函数,但(粗略地说)要教会模型可泛化的知识,这些知识在下游任务中将是有用的。早期的工作将迁移学习范式中的预训练和微调所有模型参数应用于自然语言处理 (NLP) 问题时,使用了因果语言建模目标函数进行预训练 (Dai and Le, 2015; Peters et al., 2018; Radford et al., 2018; Howard and Ruder, 2018)。然而,最近的研究表明,“去噪”目标函数 (Devlin et al., 2018; Taylor, 1953) (也称为“掩码语言建模”)能够产生更好的性能,因此它们迅速成为标准。在去噪目标函数中,模型被训练来预测输入中缺失或以其他方式损坏的 Token。受 BERT 的“掩码语言建模”目标函数的启发以及

由于图片无法直接翻译为文字内容,如果您能提供图片中的文本内容或描述,我将很乐意为您翻译。

Figure 2: Schematic of the objective we use in our baseline model. In this example, we process the sentence “Thank you for inviting me to your party last week.” The words “for”, “inviting” and “last” (marked with an $\times$ ) are randomly chosen for corruption. Each consecutive span of corrupted tokens is replaced by a sentinel token (shown as and ) that is unique over the example. Since “for” and “inviting” occur consecutively, they are replaced by a single sentinel . The output sequence then consists of the dropped-out spans, delimited by the sentinel tokens used to replace them in the input plus a final sentinel token .

图 2: 我们在基线模型中使用的目标的示意图。在这个例子中,我们处理句子 “Thank you for inviting me to your party last week.” 单词 “for”,“inviting” 和 “last”(标记为 $\times$)被随机选中进行破坏。每个连续的被破坏的 Token 被一个哨兵 Token(显示为 )替换,该哨兵 Token 在整个示例中是唯一的。由于 “for” 和 “inviting” 是连续出现的,它们被单个哨兵 替换。输出序列则由被删除的跨度组成,这些跨度由用于替换输入中的哨兵 Token 分隔,加上一个最终的哨兵 Token

“word dropout” regular iz ation technique (Bowman et al., 2015), we design an objective that randomly samples and then drops out 15% of tokens in the input sequence. All consecutive spans of dropped-out tokens are replaced by a single sentinel token. Each sentinel token is assigned a token ID that is unique to the sequence. The sentinel IDs are special tokens which are added to our vocabulary and do not correspond to any wordpiece. The target then corresponds to all of the dropped-out spans of tokens, delimited by the same sentinel tokens used in the input sequence plus a final sentinel token to mark the end of the target sequence. Our choices to mask consecutive spans of tokens and only predict dropped-out tokens were made to reduce the computational cost of pre-training. We perform thorough investigation into pre-training objectives in Section 3.3. An example of the transformation resulting from applying this objective is shown in Figure 2. We empirically compare this objective to many other variants in Section 3.3.

采用“词 dropout”正则化技术 (Bowman et al., 2015),我们设计了一个目标函数,该函数随机采样并以 15% 的概率丢弃输入序列中的 Token。所有连续的被丢弃的 Token 区域将被一个单一的哨兵 Token 替代。每个哨兵 Token 被分配一个对序列唯一的 Token ID。这些哨兵 ID 是特殊 Token,它们被添加到我们的词汇表中,并不对应任何词片段。目标则是所有被丢弃的 Token 区域,由输入序列中使用的相同哨兵 Token 分隔,再加上一个最终的哨兵 Token 标记目标序列的结束。我们选择遮蔽连续的 Token 区域并仅预测被丢弃的 Token,是为了降低预训练的计算成本。我们在第 3.3 节对预训练目标进行了深入研究。图 2 展示了应用此目标后转换的一个示例。我们在第 3.3 节中对该目标与许多其他变体进行了实证比较。

3.1.5. Baseline Performance

3.1.5. 基准性能

In this section, we present results using the baseline experimental procedure described above to get a sense of what kind of performance to expect on our suite of downstream tasks. Ideally, we would repeat every experiment in our study multiple times to get a confidence interval on our results. Unfortunately, this would be prohibitively expensive due to the large number of experiments we run. As a cheaper alternative, we train our baseline model 10 times from scratch (i.e. with different random initialization s and data set shuffling) and assume that the variance over these runs of the base model also applies to each experimental variant. We don’t expect most of the changes we make to have a dramatic effect on the inter-run variance, so this should provide a reasonable indication of the significance of different changes. Separately, we also measure the performance of training our model for $2^{18}$ steps (the same number we use for fine-tuning) on all downstream tasks without pre-training. This gives us an idea of how much pre-training benefits our model in the baseline setting.

在本节中,我们使用上述基线实验程序展示结果,以了解在我们的下游任务套件上可以预期的性能。理想情况下,我们会重复研究中的每个实验多次以获得结果的置信区间。不幸的是,由于我们运行的实验数量庞大,这将过于昂贵。作为更便宜的替代方案,我们从头开始训练基线模型 10 次(即使用不同的随机初始化和数据集洗牌),并假设这些基线模型运行之间的方差也适用于每个实验变体。我们不期望所做的大多数更改会对运行间方差产生显著影响,因此这应该能合理地反映不同更改的重要性。此外,我们还测量了在所有下游任务上训练模型 $2^{18}$ 步(与我们用于微调的步数相同)而不进行预训练的性能。这使我们了解在基线设置下预训练对模型的好处有多大。

Table 1: Average and standard deviation of scores achieved by our baseline model and training procedure. For comparison, we also report performance when training on each task from scratch (i.e. without any pre-training) for the same number of steps used to fine-tune the baseline model. All scores in this table (and every table in our paper except Table 14) are reported on the validation sets of each data set.

表 1: 基准模型和训练过程所达到分数的平均值和标准差。为了对比,我们也报告了在相同步数下从头开始训练(即没有任何预训练)的性能。本表中的所有分数(以及本文中除表 14 外的所有表格)均在每个数据集的验证集上报告。

GLUE CNNDM SQuAD SGLUE EnDe EnFr EnRo
★1 基准平均值 83.28 19.24 80.88 71.36 26.98 39.82 27.65
基准标准差 0.235 0.065 0.343 0.416 0.112 0.090 0.108
无预训练 66.22 17.60 50.31 53.04 25.86 39.77 24.04

When reporting results in the main text, we only report a subset of the scores across all the benchmarks to conserve space and ease interpretation. For GLUE and SuperGLUE, we report the average score across all subtasks (as stipulated by the official benchmarks) under the headings “GLUE” and “SGLUE”. For all translation tasks, we report the BLEU score (Papineni et al., 2002) as provided by SacreBLEU v1.3.0 (Post, 2018) with “exp” smoothing and “intl” token iz ation. We refer to scores for WMT English to German, English to French, and English to Romanian as EnDe, EnFr, and EnRo, respectively. For CNN/Daily Mail, we find the performance of models on the ROUGE-1-F, ROUGE-2-F, and ROUGE-L-F metrics (Lin, 2004) to be highly correlated so we report the ROUGE-2-F score alone under the heading “CNNDM”. Similarly, for SQuAD we find the performance of the “exact match” and “F1” scores to be highly correlated so we report the “exact match” score alone. We provide every score achieved on every task for all experiments in Table 16, Appendix E.

在正文报告结果时,我们仅报告所有基准测试中的一部分分数,以节省空间并便于理解。对于 GLUE 和 SuperGLUE,我们在标题为“GLUE”和“SGLUE”的部分下报告所有子任务的平均分数(按照官方基准的规定)。对于所有翻译任务,我们报告由 SacreBLEU v1.3.0 (Post, 2018) 提供的 BLEU 分数 (Papineni et al., 2002),使用“exp”平滑和“intl”Token 化。我们将 WMT 英语到德语、英语到法语和英语到罗马尼亚语的分数分别称为 EnDe、EnFr 和 EnRo。对于 CNN/Daily Mail,我们发现模型在 ROUGE-1-F、ROUGE-2-F 和 ROUGE-L-F 指标 (Lin, 2004) 上的表现高度相关,因此我们仅在标题为“CNNDM”的部分下报告 ROUGE-2-F 分数。同样地,对于 SQuAD,我们发现“完全匹配”和“F1”分数的表现高度相关,因此我们仅报告“完全匹配”分数。我们在附录 E 的表 16 中提供了所有实验中每个任务的所有得分。

Our results tables are all formatted so that each row corresponds to a particular experimental configuration with columns giving the scores for each benchmark. We will include the mean performance of the baseline configuration in most tables. Wherever a baseline configuration appears, we will mark it with a $\star$ (as in the first row of Table 1). We also will boldface any score that is within two standard deviations of the maximum (best) in a given experiment.

我们的结果表格均格式化为每一行对应一个特定的实验配置,列给出每个基准测试的分数。我们将在大多数表格中包含基线配置的平均性能。每当基线配置出现时,我们将用 $\star$ 标记它(如表 1 的第一行)。我们还会将任何在给定实验中与最大值(最佳)相差不超过两个标准差的分数加粗。

Our baseline results are shown in Table 1. Overall, our results are comparable to existing models of similar size. For example, BERTBASE achieved an exact match score of 80.8 on SQuAD and an accuracy of 84.4 on MNLI-matched, whereas we achieve 80.88 and 84.24, respectively (see Table 16). Note that we cannot directly compare our baseline to BERTBASE because ours is an encoder-decoder model and was pre-trained for roughly $1/_{4}$ as many steps. Un surprisingly, we find that pre-training provides significant gains across almost all benchmarks. The only exception is WMT English to French, which is a large enough data set that gains from pre-training tend to be marginal. We include this task in our experiments to test the behavior of transfer learning in the high-resource regime. Since we perform early stopping by selecting the best-performing checkpoint, the large disparity between our baseline and “no pre-training” emphasize how much pre-training improves performance on tasks with limited data. While we do not explicitly measure improvements in data efficiency in this paper, we emphasize that this is one of the primary benefits of the transfer learning paradigm.

我们的基线结果如表 1 所示。总体而言,我们的结果与现有类似规模的模型相当。例如,BERTBASE 在 SQuAD 上达到了 80.8 的精确匹配分数和在 MNLI-matched 上达到了 84.4 的准确率,而我们分别达到了 80.88 和 84.24(见表 16)。请注意,我们不能直接将我们的基线与 BERTBASE 进行比较,因为我们的模型是编码器-解码器结构,并且预训练步数大约为 BERTBASE 的 1/4。不出所料,我们发现预训练在几乎所有基准测试中都提供了显著的提升。唯一的例外是 WMT 英语到法语任务,该数据集足够大,以至于预训练带来的增益通常较小。我们在实验中包含此任务是为了测试高资源环境下的迁移学习行为。由于我们通过选择表现最佳的检查点来进行提前停止,因此我们基线与“无预训练”之间的巨大差异强调了预训练在数据有限的任务上对性能的大幅提升。虽然我们在这篇论文中没有明确测量数据效率的改进,但我们强调这是迁移学习范式的主要优势之一。

Figure 3: Matrices representing different attention mask patterns. The input and output of the self-attention mechanism are denoted $x$ and $y$ respectively. A dark cell at row $i$ and column $j$ indicates that the self-attention mechanism is allowed to attend to input element $j$ at output timestep $i$ . A light cell indicates that the self-attention mechanism is not allowed to attend to the corresponding $i$ and $j$ combination. Left: A fully-visible mask allows the self-attention mechanism to attend to the full input at every output timestep. Middle: A causal mask prevents the $i$ th output element from depending on any input elements from “the future”. Right: Causal masking with a prefix allows the self-attention mechanism to use fully-visible masking on a portion of the input sequence.

图 3: 表示不同注意力掩码模式的矩阵。自注意力机制的输入和输出分别表示为 $x$ 和 $y$ 。第 $i$ 行和第 $j$ 列的深色单元格表示自注意力机制允许在输出时间步 $i$ 处关注输入元素 $j$ 。浅色单元格表示自注意力机制不允许关注对应的 $i$ 和 $j$ 组合。左:完全可见的掩码允许自注意力机制在每个输出时间步关注整个输入。中:因果掩码防止第 $i$ 个输出元素依赖于任何来自“未来”的输入元素。右:带前缀的因果掩码允许自注意力机制在输入序列的一部分上使用完全可见的掩码。

As for inter-run variance, we find that for most tasks the standard deviation across runs is smaller than 1% of the task’s baseline score. Exceptions to this rule include CoLA, CB, and COPA, which are all low-resource tasks from the GLUE and SuperGLUE benchmarks. For example, on CB our baseline model had an average F1 score of 91.22 with a standard deviation of 3.237 (see Table 16), which may be partly due to the fact that CB’s validation set contains only 56 examples. Note that the GLUE and SuperGLUE scores are computed as the average of scores across the tasks comprising each benchmark. As a result, we caution that the high inter-run variance of CoLA, CB, and COPA can make it harder to compare models using the GLUE and SuperGLUE scores alone.

对于运行间方差,我们发现对于大多数任务,各次运行的标准差小于任务基线分数的1%。这一规则的例外包括CoLA、CB和COPA,这些任务都是来自GLUE和SuperGLUE基准测试的低资源任务。例如,在CB上,我们的基线模型平均F1分数为91.22,标准差为3.237(见表 16),这可能部分是由于CB的验证集仅包含56个样本。请注意,GLUE和SuperGLUE分数是根据每个基准测试中各任务分数的平均值计算得出的。因此,我们提醒,CoLA、CB和COPA的高运行间方差可能会使得仅使用GLUE和SuperGLUE分数来比较模型变得更加困难。

3.2. Architectures

3.2. 架构

While the Transformer was originally introduced with an encoder-decoder architecture, much modern work on transfer learning for NLP uses alternative architectures. In this section, we review and compare these architectural variants.

虽然 Transformer 最初是以编码器-解码器架构引入的,但目前自然语言处理 (NLP) 领域的很多迁移学习工作使用了替代架构。在本节中,我们将回顾和比较这些架构变体。

3.2.1. Model Structures

3.2.1. 模型结构

A major distinguishing factor for different architectures is the “mask” used by different attention mechanisms in the model. Recall that the self-attention operation in a Transformer takes a sequence as input and outputs a new sequence of the same length. Each entry of the output sequence is produced by computing a weighted average of entries of the input sequence. Specifically, let $y_{i}$ refer to the $i$ th element of the output sequence and $x_{j}$ refer to the $j$ th entry of the input sequence. $y_{i}$ is computed as $\textstyle\sum_{j}w_{i,j}x_{j}$ , where $w_{i,j}$ is the scalar weight produced by the self-attention mechanism as a fu nction of $x_{i}$ and $x_{j}$ . The attention mask is then used to zero out certain weights in order to constrain which entries of the input can be attended to at a given output timestep. Diagrams of the masks we will consider are shown in Figure 3. For example, the causal mask (Figure 3, middle) sets any $w_{i,j}$ to zero if $j>i$ .

不同架构的主要区别在于模型中使用的不同注意力机制的“掩码”。回顾一下,Transformer 中的自注意力操作以一个序列作为输入,并输出一个相同长度的新序列。输出序列的每个元素是通过对输入序列的元素进行加权平均计算得出的。具体来说,令 $y_{i}$ 表示输出序列的第 $i$ 个元素,$x_{j}$ 表示输入序列的第 $j$ 个元素。$y_{i}$ 计算为 $\textstyle\sum_{j}w_{i,j}x_{j}$,其中 $w_{i,j}$ 是自注意力机制根据 $x_{i}$ 和 $x_{j}$ 计算出的标量权重。注意力掩码用于将某些权重置零,以限制在给定输出时间步可以关注的输入序列中的哪些元素。我们将考虑的掩码图如图 3 所示。例如,因果掩码(图 3,中间)会将任何 $w_{i,j}$ 置为零,如果 $j > i$。

图 3:


Figure 4: Schematics of the Transformer architecture variants we consider. In this diagram, blocks represent elements of a sequence and lines represent attention visibility. Different colored groups of blocks indicate different Transformer layer stacks. Dark grey lines correspond to fully-visible masking and light grey lines correspond to causal masking. We use “.” to denote a special end-of-sequence token that represents the end of a prediction. The input and output sequences are represented as $x$ and $y$ respectively. Left: A standard encoder-decoder architecture uses fullyvisible masking in the encoder and the encoder-decoder attention, with causal masking in the decoder. Middle: A language model consists of a single Transformer layer stack and is fed the concatenation of the input and target, using a causal mask throughout. Right: Adding a prefix to a language model corresponds to allowing fully-visible masking over the input.

图 4: 我们考虑的 Transformer 架构变体示意图。在此图中,块表示序列的元素,线表示注意力可见性。不同颜色的块组表示不同的 Transformer 层堆栈。深灰色线对应完全可见的掩码,浅灰色线对应因果掩码。我们使用 “.” 表示一个特殊的序列结束 Token ,它代表预测的结束。输入和输出序列分别表示为 $x$ 和 $y$ 。左:标准的编码器 - 解码器架构在编码器和编码器 - 解码器注意力中使用完全可见的掩码,在解码器中使用因果掩码。中:语言模型由单个 Transformer 层堆栈组成,并输入输入和目标的连接,全程使用因果掩码。右:在语言模型中添加前缀对应于允许在输入上进行完全可见的掩码。

The first model structure we consider is an an encoder-decoder Transformer, which consists of two layer stacks: The encoder, which is fed an input sequence, and the decoder, which produces a new output sequence. A schematic of this architectural variant is shown in the left panel of Figure 4.

我们考虑的第一个模型结构是一个编码器-解码器 Transformer,它由两个层堆栈组成:编码器,接收输入序列;解码器,生成新的输出序列。这种架构变体的示意图如图 4 左侧所示。

The encoder uses a “fully-visible” attention mask. Fully-visible masking allows a selfattention mechanism to attend to any entry of the input when producing each entry of its output. We visualize this masking pattern in Figure 3, left. This form of masking is appropriate when attending over a “prefix”, i.e. some context provided to the model that is later used when making predictions. BERT (Devlin et al., 2018) also uses a fully-visible masking pattern and appends a special “classification” token to the input. BERT’s output at the timestep corresponding to the classification token is then used to make a prediction for classifying the input sequence.

编码器使用“完全可见”的注意力掩码。完全可见的掩码允许自注意力机制在生成每个输出项时关注输入的任何项。我们在图 3 (左) 中可视化了这种掩码模式。当对一个“前缀”进行关注时,这种掩码形式是合适的,即提供给模型的一些上下文,在之后用于进行预测时会用到。BERT (Devlin et al., 2018) 也使用完全可见的掩码模式,并在输入中附加一个特殊的“分类”Token。然后使用 BERT 在与分类 Token 对应的时间步的输出来对输入序列进行分类预测。

The self-attention operations in the Transformer’s decoder use a “causal” masking pattern. When producing the $i$ th entry of the output sequence, causal masking prevents the model from attending to the $j$ th entry of the input sequence for $j>i$ . This is used during training so that the model can’t “see into the future” as it produces its output. An attention matrix for this masking pattern is shown in Figure 3, middle.

Transformer 的解码器中的自注意力操作使用“因果”掩码模式。在生成输出序列的第 $i$ 个元素时,因果掩码防止模型关注输入序列中第 $j$ 个($j>i$)元素。这在训练期间使用,以确保模型不能“窥视未来”来生成其输出。这种掩码模式的注意力矩阵如图 3 中间所示。

The decoder in an encoder-decoder Transformer is used to auto regressive ly produce an output sequence. That is, at each output timestep, a token is sampled from the model’s predicted distribution and the sample is fed back into the model to produce a prediction for the next output timestep, and so on. As such, a Transformer decoder (without an encoder) can be used as a language model (LM), i.e. a model trained solely for next-step prediction (Liu et al., 2018; Radford et al., 2018; Al-Rfou et al., 2019). This constitutes the second model structure we consider. A schematic of this architecture is shown in Figure 4, middle. In fact, early work on transfer learning for NLP used this architecture with a language modeling objective as a pre-training method (Radford et al., 2018).

编码器-解码器 Transformer 中的解码器用于自回归地生成输出序列。也就是说,在每个输出时间步,从模型预测的分布中采样一个 Token,并将该样本反馈到模型中以生成下一个输出时间步的预测,依此类推。因此,一个 Transformer 解码器(不带编码器)可以用作语言模型 (LM),即仅训练用于下一步预测的模型 (Liu et al., 2018; Radford et al., 2018; Al-Rfou et al., 2019)。这构成了我们考虑的第二种模型结构。该架构的示意图如图 4 中间所示。实际上,早期的 NLP 迁移学习工作使用这种架构和语言建模目标作为预训练方法 (Radford et al., 2018)。

Language models are typically used for compression or sequence generation (Graves, 2013). However, they can also be used in the text-to-text framework simply by concatenating the inputs and targets. As an example, consider the case of English to German translation: If we have a training datapoint with input sentence “That is good.” and target “Das ist gut.”, we would simply train the model on next-step prediction over the concatenated input sequence “translate English to German: That is good. target: Das ist gut.” If we wanted to obtain the model’s prediction for this example, the model would be fed the prefix “translate English to German: That is good. target:” and would be asked to generate the remainder of the sequence auto regressive ly. In this way, the model can predict an output sequence given an input, which satisfies the needs of text-to-text tasks. This approach was recently used to show that language models can learn to perform some text-to-text tasks without supervision (Radford et al., 2019).

语言模型通常用于压缩或序列生成 (Graves, 2013)。然而,它们也可以通过连接输入和目标在文本到文本框架中使用。例如,考虑英译德的情况:如果我们有一个训练数据点,输入句子为“That is good.”,目标为“Das ist gut.”,我们只需训练模型对连接后的输入序列“translate English to German: That is good. target: Das ist gut.”进行下一步预测。如果我们要获得模型对此示例的预测,我们会给模型提供前缀“translate English to German: That is good. target:”,并要求模型自回归地生成剩余的序列。通过这种方式,模型可以根据输入预测输出序列,满足文本到文本任务的需求。这种方法最近被用来证明语言模型可以在没有监督的情况下学习执行某些文本到文本任务 (Radford et al., 2019)。

A fundamental and frequently cited drawback of using a language model in the textto-text setting is that causal masking forces the model’s representation of the $i$ th entry of the input sequence to only depend on the entries up until $i$ . To see why this is potentially disadvantageous, consider the text-to-text framework where the model is provided with a prefix/context before being asked to make predictions (e.g., the prefix is an English sentence and the model is asked to predict the German translation). With fully causal masking, the model’s representation of a prefix state can only depend on prior entries of the prefix. So, when predicting an entry of the output, the model will attend to a representation of the prefix that is unnecessarily limited. Similar arguments have been made against using a unidirectional recurrent neural network encoder in sequence-to-sequence models (Bahdanau et al., 2015).

在文本到文本设置中使用大语言模型的一个基本且经常被引用的缺点是,因果掩码迫使模型对输入序列的第 $i$ 个条目的表示仅依赖于直到 $i$ 的条目。要理解为什么这可能是不利的,可以考虑这样一个文本到文本框架:模型在被要求进行预测之前会提供一个前缀/上下文(例如,前缀是一个英语句子,模型被要求预测德语翻译)。在完全因果掩码的情况下,模型对前缀状态的表示只能依赖于前缀的先前条目。因此,在预测输出条目时,模型将关注一个不必要的受限的前缀表示。类似的论点也针对在序列到序列模型中使用单向循环神经网络编码器提出过 (Bahdanau et al., 2015)。

This issue can be avoided in a Transformer-based language model simply by changing the masking pattern. Instead of using a causal mask, we use fully-visible masking during the prefix portion of the sequence. This masking pattern and a schematic of the resulting “prefix LM” (the third model structure we consider) are illustrated in the rightmost panels of Figures 3 and 4, respectively. In the English to German translation example mentioned above, fully-visible masking would be applied to the prefix “translate English to German: That is good. target:” and causal masking would be used during training for predicting the target “Das ist gut.” Using a prefix LM in the text-to-text framework was originally proposed by

这个问题可以在基于 Transformer 的语言模型中通过改变掩码模式来避免。不是使用因果掩码,我们在序列的前缀部分使用完全可见的掩码。这种掩码模式和由此产生的“前缀大语言模型”(我们考虑的第三种模型结构)分别在图 3 和图 4 的最右侧面板中进行了说明。在上述的英德翻译例子中,完全可见的掩码将应用于前缀 “translate English to German: That is good. target:”,而在训练时用于预测目标 “Das ist gut.” 的则是因果掩码。在文本到文本框架中使用前缀大语言模型最初是由

Liu et al. (2018). More recently, Dong et al. (2019) showed that this architecture is effective on a wide variety of text-to-text tasks. This architecture is similar to an encoder-decoder model with parameters shared across the encoder and decoder and with the encoder-decoder attention replaced with full attention across the input and target sequence.

Liu 等 (2018)。最近,Dong 等 (2019) 表明该架构在各种文本到文本任务上是有效的。该架构类似于编码器-解码器模型,在编码器和解码器之间共享参数,并将编码器-解码器注意力替换为输入序列和目标序列之间的全注意力。

We note that when following our text-to-text framework, the prefix LM architecture closely resembles BERT (Devlin et al., 2018) for classification tasks. To see why, consider an example from the MNLI benchmark where the premise is “I hate pigeons.”, the hypothesis is “My feelings towards pigeons are filled with animosity.” and the correct label is “entailment”. To feed this example into a language model, we would transform it into the sequence “mnli premise: I hate pigeons. hypothesis: My feelings towards pigeons are filled with animosity. target: entailment”. In this case, the fully-visible prefix would correspond to the entire input sequence up to the word “target:”, which can be seen as being analogous to the “classification” token used in BERT. So, our model would have full visibility over the entire input, and then would be tasked with making a classification by outputting the word “entailment”. It is easy for the model to learn to output one of the valid class labels given the task prefix (“mnli” in this case). As such, the main difference between a prefix LM and the BERT architecture is that the classifier is simply integrated into the output layer of the Transformer decoder in the prefix LM.

我们注意到,在遵循我们的文本到文本框架时,前缀大语言模型架构在分类任务中与 BERT (Devlin et al., 2018) 非常相似。为了理解原因,考虑来自 MNLI 基准的一个例子,其中前提为 “I hate pigeons.”(我讨厌鸽子),假设为 “My feelings towards pigeons are filled with animosity.”(我对鸽子的感情充满了敌意),正确标签为 “entailment”(蕴含)。要将这个例子输入到语言模型中,我们会将其转换为序列 “mnli premise: I hate pigeons. hypothesis: My feelings towards pigeons are filled with animosity. target: entailment”。在这种情况下,完全可见的前缀对应于直到单词 “target:” 的整个输入序列,这可以被视为类似于 BERT 中使用的 “classification” Token。因此,我们的模型可以完全看到整个输入,并通过输出单词 “entailment” 来进行分类。鉴于任务前缀(在这个例子中是 “mnli”),模型很容易学习输出一个有效的类别标签。因此,前缀大语言模型与 BERT 架构的主要区别在于,分类器被简单地集成到了前缀大语言模型的 Transformer 解码器的输出层中。

3.2.2. Comparing Different Model Structures

3.2.2. 比较不同的模型结构

In the interest of experimentally comparing these architectural variants, we would like each model we consider to be equivalent in some meaningful way. We might say that two models are equivalent if they either have the same number of parameters or they require roughly the same amount of computation to process a given (input-sequence, target-sequence) pair. Unfortunately, it is not possible to compare an encoder-decoder model to a language model architecture (comprising a single Transformer stack) according to both of these criteria at the same time. To see why, first note an encoder-decoder model with $L$ layers in the encoder and $L$ layers in the decoder has approximately the same number of parameters as a language model with $2L$ layers. However, the same $L+L$ encoder-decoder model will have approximately the same computational cost as a language model with only $L$ layers. This is a consequence of the fact that the $L$ layers in the language model must be applied to both the input and output sequence, while the encoder is only applied to the input sequence and the decoder is only applied to the output sequence. Note that these equivalences are approximate—there are some extra parameters in the decoder due to the encoder-decoder attention and there are also some computational costs in the attention layers that are quadratic in the sequence lengths. In practice, however, we observed nearly identical step times for $L$ -layer language models versus $L+L$ -layer encoder-decoder models, suggesting a roughly equivalent computational cost. Further, for the model sizes we consider, the number of parameters in the encoder-decoder attention layers is about $10%$ of the total parameter count, so we make the simplifying assumption that an $L+L$ -layer encoder-decoder model has the same number of parameters as an $2L$ -layer language model.

为了实验性地比较这些架构变体,我们希望每个考虑的模型在某些有意义的方式上是等价的。我们可以说两个模型是等价的,如果它们要么具有相同的参数数量,要么处理给定 (输入序列, 目标序列) 对所需的计算量大致相同。不幸的是,根据这两个标准同时比较编码器-解码器模型和语言模型架构(由单个 Transformer 堆栈组成)是不可能的。

要理解原因,请首先注意一个编码器有 $L$ 层且解码器也有 $L$ 层的编码器-解码器模型大约与一个有 $2L$ 层的语言模型具有相同的参数数量。然而,同一个 $L+L$ 的编码器-解码器模型将大约与只有 $L$ 层的语言模型具有相同的计算成本。这是由于语言模型中的 $L$ 层必须应用于输入和输出序列,而编码器仅应用于输入序列,解码器仅应用于输出序列。请注意,这些等价关系是近似的——解码器中由于编码器-解码器注意力机制存在一些额外的参数,并且在注意力层中也有一些与序列长度呈二次关系的计算成本。然而,在实践中,我们观察到 $L$ 层语言模型与 $L+L$ 层编码器-解码器模型的每步时间几乎相同,这表明其计算成本大致相当。此外,对于我们考虑的模型规模,编码器-解码器注意力层中的参数数量约占总参数数量的 $10%$,因此我们简化假设 $L+L$ 层的编码器-解码器模型与 $2L$ 层的语言模型具有相同的参数数量。

To provide a reasonable means of comparison, we consider multiple configurations for our encoder-decoder model. We will refer to the number of layers and parameters in a BERTBASE-sized layer stack as $L$ and $P$ , respectively. We will use $M$ to refer to the number of FLOPs required for an $L+L$ -layer encoder-decoder model or $L$ -layer decoder-only model to process a given input-target pair. In total, we will compare:

为了提供一个合理的比较手段,我们考虑了编码器-解码器模型的多种配置。我们将 BERTBASE 规模的层堆栈中的层数和参数数量分别表示为 $L$ 和 $P$ 。我们将使用 $M$ 来表示处理给定输入-目标对所需的 FLOPs 数量,对于具有 $L+L$ 层的编码器-解码器模型或 $L$ 层的仅解码器模型。总共,我们将比较:

3.2.3. Objectives

3.2.3. 目标

As an unsupervised objective, we will consider both a basic language modeling objective as well as our baseline denoising objective described in Section 3.1.4. We include the language modeling objective due to its historic use as a pre-training objective (Dai and Le, 2015; Rama chandra n et al., 2016; Howard and Ruder, 2018; Radford et al., 2018; Peters et al., 2018) as well as its natural fit for the language model architectures we consider. For models that ingest a prefix before making predictions (the encoder-decoder model and prefix LM), we sample a span of text from our unlabeled data set and choose a random point to split it into prefix and target portions. For the standard language model, we train the model to predict the entire span from beginning to end. Our unsupervised denoising objective is designed for text-to-text models; to adapt it for use with a language model we concatenate the inputs and targets as described in Section 3.2.1.

作为无监督目标,我们将考虑基本的语言建模目标以及我们在第 3.1.4 节中描述的基线去噪目标。我们包含语言建模目标是由于其历史上被用作预训练目标 (Dai and Le, 2015; Ramachandran et al., 2016; Howard and Ruder, 2018; Radford et al., 2018; Peters et al., 2018) 以及它与我们考虑的语言模型架构的自然契合。对于在进行预测之前需要输入前缀的模型(编码器-解码器模型和前缀 LM),我们从未标记的数据集中采样一段文本,并随机选择一个点将其分为前缀和目标部分。对于标准语言模型,我们训练模型从头到尾预测整个片段。我们的无监督去噪目标是为文本到文本模型设计的;为了适应语言模型使用,我们将输入和目标连接起来,如第 3.2.1 节所述。

3.2.4. Results

The scores achieved by each of the architectures we compare are shown in Table 2. For all tasks, the encoder-decoder architecture with the denoising objective performed best. This variant has the highest parameter count (2 $P$ ) but the same computational cost as the $P$ -parameter decoder-only models. Surprisingly, we found that sharing parameters across the encoder and decoder performed nearly as well. In contrast, halving the number of layers in the encoder and decoder stacks significantly hurt performance. Concurrent work (Lan et al., 2019) also found that sharing parameters across Transformer blocks can be an effective means of lowering the total parameter count without sacrificing much performance. XLNet also bears some resemblance to the shared encoder-decoder approach with a denoising objective (Yang et al., 2019). We also note that the shared parameter encoder-decoder outperforms the decoder-only prefix LM, suggesting that the addition of an explicit encoder-decoder attention is beneficial. Finally, we confirm the widely-held conception that using a denoising objective always results in better downstream task performance compared to a language modeling objective. This observation has been previously made by Devlin et al. (2018), Voita et al. (2019), and Lample and Conneau (2019) among others. We undertake a more detailed exploration of unsupervised objectives in the following section.

表 2: 我们比较的每种架构所获得的分数。

对于所有任务,使用去噪目标的编码器-解码器架构表现最佳。这种变体具有最高的参数量 (2 $P$ ),但计算成本与 $P$ 参数的仅解码器模型相同。令人惊讶的是,我们发现跨编码器和解码器共享参数的表现几乎一样好。相反,将编码器和解码器堆栈中的层数减半显著损害了性能。同期工作 (Lan et al., 2019) 也发现,跨 Transformer 模块共享参数可以有效降低总参数量而不牺牲太多性能。XLNet 也与带有去噪目标的共享编码器-解码器方法有某些相似之处 (Yang et al., 2019)。我们还注意到,共享参数的编码器-解码器优于仅解码器前缀语言模型,这表明添加显式的编码器-解码器注意力是有益的。最后,我们确认了一种广泛持有的观点,即使用去噪目标总是比使用语言建模目标在下游任务中表现更好。这一观察结果之前已被 Devlin et al. (2018),Voita et al. (2019),和 Lample and Conneau (2019) 等人提出过。我们在下一节中对无监督目标进行更详细的探讨。

Table 2: Performance of the different architectural variants described in Section 3.2.2. We use $P$ to refer to the number of parameters in a 12-layer base Transformer layer stack and $M$ to refer to the FLOPs required to process a sequence using the encoderdecoder model. We evaluate each architectural variant using a denoising objective (described in Section 3.1.4) and an auto regressive objective (as is commonly used to train language models).

表 2: 不同架构变体的性能,如第 3.2.2 节所述。我们使用 $P$ 表示 12 层基础 Transformer 层堆栈中的参数数量,并使用 $M$ 表示使用编码器-解码器模型处理序列所需的 FLOPs。我们使用去噪目标(如第 3.1.4 节所述)和自回归目标(通常用于训练语言模型)评估每个架构变体。

架构 目标 参数 成本 GLUE CNNDM SQuAD SGLUE EnDe EnFr EnRo
+ 编码器-解码器 去噪 2P M 83.28 19.24 80.88 71.36 26.98 39.82 27.65
编码器-解码器,共享 去噪 P M 82.81 18.78 80.63 70.73 26.72 39.03 27.46
编码器-解码器,6 层 去噪 P M/2 80.88 18.97 77.59 68.42 26.38 38.40 26.95
语言模型 去噪 P M 74.70 17.93 61.14 55.02 25.09 35.28 25.86
前缀 LM 去噪 P M 81.82 18.61 78.94 68.11 26.43 37.98 27.39
编码器-解码器 语言模型 (LM) 2P M 79.56 18.59 76.02 64.29 26.27 39.17 26.86
编码器-解码器,共享 语言模型 (LM) P M 79.60 18.13 76.35 63.50 26.62 39.17 27.05
编码器-解码器,6 层 语言模型 (LM) P M/2 78.67 18.26 75.32 64.06 26.13 38.42 26.89
语言模型 语言模型 (LM) P M 73.78 17.54 53.81 56.51 25.23 34.31 25.38
前缀 LM 语言模型 (LM) P M 79.68 17.84 76.87 64.86 26.28 37.51 26.76

3.3. Unsupervised Objectives

3.3. 无监督目标

The choice of unsupervised objective is of central importance as it provides the mechanism through which the model gains general-purpose knowledge to apply to downstream tasks. This has led to the development of a wide variety of pre-training objectives (Dai and Le, 2015; Ramachandran et al., 2016; Radford et al., 2018; Devlin et al., 2018; Yang et al., 2019; Liu et al., 2019b; Wang et al., 2019a; Song et al., 2019; Dong et al., 2019; Joshi et al., 2019). In this section, we perform a procedural exploration of the space of unsupervised objectives. In many cases, we will not replicate an existing objective exactly—some will be modified to fit our text-to-text encoder-decoder framework and, in other cases, we will use objectives that combine concepts from multiple common approaches.

无监督目标的选择至关重要,因为它为模型提供了获取通用知识的机制,以应用于下游任务。这导致了各种各样的预训练目标的开发 (Dai and Le, 2015; Ramachandran et al., 2016; Radford et al., 2018; Devlin et al., 2018; Yang et al., 2019; Liu et al., 2019b; Wang et al., 2019a; Song et al., 2019; Dong et al., 2019; Joshi et al., 2019)。在本节中,我们将对无监督目标的空间进行程序化探索。在许多情况下,我们不会完全复制现有的目标——有些将被修改以适应我们的文本到文本编码器-解码器框架,在其他情况下,我们将使用结合了多个常见方法概念的目标。

Overall, all of our objectives ingest a sequence of token IDs corresponding to a tokenized span of text from our unlabeled text data set. The token sequence is processed to produce a (corrupted) input sequence and a corresponding target. Then, the model is trained as usual with maximum likelihood to predict the target sequence. We provide illustrative examples of many of the objectives we consider in Table 3.

总体而言,我们的所有目标都会输入一个与我们未标注文本数据集中的一段标记化文本相对应的 Token ID 序列。Token 序列被处理以生成一个(损坏的)输入序列和相应的目标序列。然后,模型像往常一样通过最大似然方法训练以预测目标序列。我们在表 3 中提供了许多我们考虑的目标的说明性示例。

3.3.1. Disparate High-Level Approaches

3.3.1. 不同的高层方法

To begin with, we compare three techniques that are inspired by commonly-used objectives but differ significantly in their approach. First, we include a basic “prefix language modeling” objective as was used in Section 3.2.3. This technique splits a span of text into two components, one to use as inputs to the encoder and the other to use as a target sequence

首先,我们比较三种受常用目标启发但方法上存在显著差异的技术。第一种是包含基本的“前缀语言建模 (prefix language modeling)”目标,如第 3.2.3 节所使用的方法。该技术将一段文本分割成两个部分,一部分作为编码器的输入,另一部分作为目标序列。

目标 输入 输出
前缀语言建模 Thank you for inviting me to your party last week .
BERT风格 Devlin et al. (2018) Thank you me to your party apple week . original tert
解洗牌 party me for your to . last fun you inviting week Thank
MASS风格 Song et al. (2019) Thank you me to your party week . (originaltert)
独立同分布噪声,替换片段 Thank you 1me to your party week. for invitinglast
独立同分布噪声,(删除 Token) Thank you me to your party week for inviting last
随机片段 Thank you to week. for inviting me eyourparty last

Table 3: Examples of inputs and targets produced by some of the unsupervised objectives we consider applied to the input text “Thank you for inviting me to your party last week .” Note that all of our objectives process tokenized text. For this particular sentence, all words were mapped to a single token by our vocabulary. We write (original text) as a target to denote that the model is tasked with reconstructing the entire input text. denotes a shared mask token and , , and $<!!Z!>$ denote sentinel tokens that are assigned unique token IDs. The BERT-style objective (second row) includes a corruption where some tokens are replaced by a random token ID; we show this via the greyed-out word apple.

表 3: 我们考虑的一些无监督目标应用于输入文本 “Thank you for inviting me to your party last week .” 的输入和目标示例。请注意,我们所有的目标都处理分词文本。对于这个特定的句子,所有单词都被我们的词汇映射为单个 Token 。我们写 (原始文本) 作为目标,表示模型的任务是重建整个输入文本。 表示共享的掩码 Token , 和 $<!!Z!>$ 表示被分配唯一 Token ID 的哨兵 Token 。BERT 风格的目标(第二行)包括一些 Token 被随机 Token ID 替换的破坏;我们通过灰色显示的单词 apple 来表示这一点。

目标 输入 目标
原始文本 (Original Text) Thank you for inviting me to your party last week . (原始文本)
BERT 风格 (BERT-style) Thank you for (apple) me to your party last week . (original text) with masked tokens
掩码语言建模 (Masked Language Modeling) Thank you for me to your party last week . (original text)
替换 Token 建模 (Replaced Token Modeling) Thank you for me to your party last week . (original text)
删除重建 (Deletion Reconstruction) Thank you for _ me to your party last week . (original text)
顺序预测 (Order Prediction) Thank you for me to your party last week . (original text)

Table 4: Performance of the three disparate pre-training objectives described in Section 3.3.1.

表 4: 第 3.3.1 节中描述的三种不同的预训练目标的性能。

目标 GLUE CNNDM SQuAD SGLUE EnDe EnFr EnRo
前缀语言建模 80.69 18.94 77.99 65.27 26.86 39.73 27.49
BERT-style (Devlin et al., 2018) 82.96 19.17 80.65 69.85 26.78 40.03 27.41
反洗牌 73.17 18.59 67.61 58.47 26.11 39.30 25.62

to be predicted by the decoder. Second, we consider an objective inspired by the “masked language modeling” (MLM) objective used in BERT (Devlin et al., 2018). MLM takes a span of text and corrupts $15%$ of the tokens. $90%$ of the corrupted tokens are replaced with a special mask token and $10%$ are replaced with a random token. Since BERT is an encoder-only model, its goal during pre-training is to reconstruct masked tokens at the output of the encoder. In the encoder-decoder case, we simply use the entire uncorrupted sequence as the target. Note that this differs from our baseline objective, which uses only the corrupted tokens as targets; we compare these two approaches in Section 3.3.2. Finally, we also consider a basic de shuffling objective as used e.g. in (Liu et al., 2019a) where it was applied to a denoising sequential auto encoder. This approach takes a sequence of tokens, shuffles it, and then uses the original deshuffled sequence as a target. We provide examples of the inputs and targets for these three methods in the first three rows of Table 3.

首先,由解码器进行预测。其次,我们考虑一个受“掩码语言建模” (MLM) 目标启发的目标,该目标在 BERT (Devlin et al., 2018) 中使用。MLM 选取一段文本并破坏其中 15% 的 Token 。90% 的被破坏 Token 被替换为特殊的掩码 Token ,10% 被替换为随机 Token 。由于 BERT 是仅编码器模型,其在预训练期间的目标是通过编码器的输出重建被掩码的 Token 。在编码器-解码器的情况下,我们简单地将整个未被破坏的序列作为目标。请注意,这与我们的基线目标不同,后者仅使用被破坏的 Token 作为目标;我们在第 3.3.2 节中比较了这两种方法。最后,我们还考虑了一个基本的去洗牌目标,例如在 (Liu et al., 2019a) 中应用到去噪序列自编码器。这种方法取一串 Token ,将其打乱,然后使用原始的未打乱序列作为目标。我们在表 3 的前三行提供了这三种方法的输入和目标示例。

The performance of these three objectives is shown in Table 4. Overall, we find that the BERT-style objective performs best, though the prefix language modeling objective attains similar performance on the translation tasks. Indeed, the motivation for the BERT objective was to outperform language model-based pre-training. The de shuffling objective performs considerably worse than both prefix language modeling and the BERT-style objective.

表 4: 显示了这三个目标的性能。总体而言,我们发现 BERT 式 (BERT-style) 目标表现最好,尽管前缀语言建模目标在翻译任务上达到了类似的性能。实际上,BERT 目标的动机是为了超越基于语言模型的预训练。去洗牌目标的表现明显不如前缀语言建模和 BERT 式目标。

Table 5: Comparison of variants of the BERT-style pre-training objective. In the first two variants, the model is trained to reconstruct the original uncorrupted text segment. In the latter two, the model only predicts the sequence of corrupted tokens.

目标 GLUE CNNDM SQuAD SGLUE EnDe EnFr EnRo
BERT-style (Devlin et al., 2018) 82.96 19.17 80.65 69.85 26.78 40.03 27.41
MASS-style (Song et al., 2019) 82.32 19.16 80.10 69.28 26.79 39.89 27.55
替换损坏的片段 83.28 19.24 80.88 71.36 26.98 39.82 27.65
删除损坏的 Token 84.44 19.31 80.52 68.67 27.07 39.76 27.82

表 5: BERT-style 预训练目标变体的比较。在前两个变体中,模型被训练以重建原始未损坏的文本片段。在后两个变体中,模型仅预测损坏的 Token 序列。

3.3.2. Simplifying the BERT Objective

3.3.2. 简化 BERT 目标

Based on the results in the prior section, we will now focus on exploring modifications to the BERT-style denoising objective. This objective was originally proposed as a pre-training technique for an encoder-only model trained for classification and span prediction. As such, it may be possible to modify it so that it performs better or is more efficient in our encoder-decoder text-to-text setup.

基于前一节的结果,我们现在将专注于探索对 BERT 式去噪目标的修改。这个目标最初被提出作为一种仅编码器模型的预训练技术,该模型用于分类和跨度预测。因此,有可能对其进行修改,使其在我们的编码器-解码器文本到文本设置中表现更好或更高效。

First, we consider a simple variant of the BERT-style objective where we don’t include the random token swapping step. The resulting objective simply replaces $15%$ of the tokens in the input with a mask token and the model is trained to reconstruct the original uncorrupted sequence. A similar masking objective was used by Song et al. (2019) where it was referred to as “MASS”, so we call this variant the “MASS-style” objective. Second, we were interested to see if it was possible to avoid predicting the entire uncorrupted text span since this requires self-attention over long sequences in the decoder. We consider two strategies to achieve this: First, instead of replacing each corrupted token with a mask token, we replace the entirety of each consecutive span of corrupted tokens with a unique mask token. Then, the target sequence becomes the concatenation of the “corrupted” spans, each prefixed by the mask token used to replace it in the input. This is the pre-training objective we use in our baseline, described in Section 3.1.4. Second, we also consider a variant where we simply drop the corrupted tokens from the input sequence completely and task the model with reconstructing the dropped tokens in order. Examples of these approaches are shown in the fifth and sixth rows of Table 3.

首先,我们考虑一个简单的 BERT-style 目标变体,在这个变体中我们不包括随机 Token 交换步骤。得到的目标简单地将输入中的 15% 的 Token 替换为掩码 Token,并训练模型以重建原始未损坏的序列。Song 等人 (2019) 使用了一个类似的掩码目标,他们称之为“MASS”,因此我们将这种变体称为“MASS-style”目标。其次,我们有兴趣研究是否可以避免预测整个未损坏的文本片段,因为这需要在解码器中对长序列进行自注意力操作。我们考虑了两种策略来实现这一点:第一种策略是,不是用掩码 Token 替换每个损坏的 Token,而是用唯一的掩码 Token 替换每个连续的损坏 Token 区域。然后,目标序列成为“损坏”区域的连接,每个区域前缀为用于替换输入中该区域的掩码 Token。这是我们在第 3.1.4 节中描述的基线使用的预训练目标。第二种策略是,我们还考虑了一种变体,即完全从输入序列中删除损坏的 Token,并要求模型按顺序重建这些被删除的 Token。这些方法的例子显示在表 3 的第五行和第六行。

An empirical comparison of the original BERT-style objective to these three alternatives is shown in Table 5. We find that in our setting, all of these variants perform similarly. The only exception was that dropping corrupted tokens completely produced a small improvement in the GLUE score thanks to a significantly higher score on CoLA (60.04, compared to our baseline average of 53.84, see Table 16). This may be due to the fact that CoLA involves classifying whether a given sentence is grammatically and syntactically acceptable, and being able to determine when tokens are missing is closely related to detecting acceptability. However, dropping tokens completely performed worse than replacing them with sentinel tokens on SuperGLUE. The two variants that do not require predicting the full original sequence (“replace corrupted spans” and “drop corrupted spans”) are both potentially attractive since they make the target sequences shorter and consequently make training faster. Going forward, we will explore variants where we replace corrupted spans with sentinel tokens and only predict the corrupted tokens (as in our baseline objective).

表 5: 对原始 BERT 风格目标与这三种替代方案的实证比较显示,在我们的设置中,所有这些变体表现相似。唯一的例外是完全删除损坏的 Token 产生了小幅度的 GLUE 得分提升,这得益于 CoLA 上显著更高的得分(60.04,相比之下我们基线平均得分为 53.84,见表 16)。这可能是由于 CoLA 涉及判断给定句子在语法和句法上是否可接受,而能够确定 Token 是否缺失与检测可接受性密切相关。然而,完全删除 Token 在 SuperGLUE 上的表现不如用哨兵 Token 替换它们。不需要预测完整原始序列的两个变体(“替换损坏片段”和“删除损坏片段”)都具有吸引力,因为它们使目标序列更短,从而加快训练速度。未来,我们将探索用哨兵 Token 替换损坏片段并仅预测损坏的 Token 的变体(如我们基线目标所示)。

Table 6: Performance of the i.i.d. corruption objective with different corruption rates.

表 6: 不同损坏率的 i.i.d. 损坏目标的性能。

损坏率 GLUE CNNDM SQuAD SGLUE EnDe EnFr EnRo
10% 82.82 19.00 80.38 69.55 26.87 39.28 27.44
★15% 83.28 19.24 80.88 71.36 26.98 39.82 27.65
25% 83.00 19.54 80.96 70.48 27.04 39.83 27.47
50% 81.27 19.32 79.80 70.33 27.01 39.90 27.49

3.3.3. Varying the Corruption Rate

3.3.3. 改变损坏率

So far, we have been corrupting $15%$ of the tokens, the value used in BERT (Devlin et al., 2018). Again, since our text-to-text framework differs from BERT’s, we are interested to see if a different corruption rate works better for us. We compare corruption rates of $10%$ , $15%$ , $25%$ , and $50%$ in Table 6. Overall, we find that the corruption rate had a limited effect on the model’s performance. The only exception is that the largest corruption rate we consider $(50%)$ results in a significant degradation of performance on GLUE and SQuAD. Using a larger corruption rate also results in longer targets, which can potentially slow down training. Based on these results and the historical precedent set by BERT, we will use a corruption rate of $15%$ going forward.

到目前为止,我们已经破坏了 $15%$ 的 Token,这是 BERT (Devlin et al., 2018) 使用的比例。再次强调,由于我们的文本到文本框架与 BERT 不同,我们有兴趣看看不同的破坏率是否对我们更有效。我们在表 6 中比较了 $10%$、$15%$、$25%$ 和 $50%$ 的破坏率。总体而言,我们发现破坏率对模型性能的影响有限。唯一的例外是我们考虑的最大破坏率 $(50%)$ 导致 GLUE 和 SQuAD 上的性能显著下降。使用更大的破坏率也会导致目标序列更长,这可能会减慢训练速度。基于这些结果以及 BERT 设定的历史先例,我们将继续使用 $15%$ 的破坏率。

3.3.4. Corrupting Spans

3.3.4. 腐败片段 (Corrupting Spans)

We now turn towards the goal of speeding up training by predicting shorter targets. The approach we have used so far makes an i.i.d. decision for each input token as to whether to corrupt it or not. When multiple consecutive tokens have been corrupted, they are treated as a “span” and a single unique mask token is used to replace the entire span. Replacing entire spans with a single token results in unlabeled text data being processed into shorter sequences. Since we are using an i.i.d. corruption strategy, it is not always the case that a significant number of corrupted tokens appear consecutively. As a result, we might obtain additional speedup by specifically corrupting spans of tokens rather than corrupting individual tokens in an i.i.d. manner. Corrupting spans was also previously considered as a pre-training objective for BERT, where it was found to improve performance (Joshi et al., 2019).

我们现在转向通过预测更短的目标来加速训练的目标。到目前为止,我们使用的方法是对每个输入 Token 是否要破坏它进行独立同分布 (i.i.d.) 决策。当多个连续的 Token 被破坏时,它们被视为一个“片段”,并使用一个唯一的掩码 Token 来替换整个片段。用单个 Token 替换整个片段的结果是未标注的文本数据被处理成更短的序列。由于我们使用的是独立同分布的破坏策略,并不总是会有大量连续的破坏 Token 出现。因此,通过专门破坏 Token 片段而不是以独立同分布的方式破坏单个 Token,我们可能会获得额外的加速。破坏片段之前也被考虑作为 BERT 的预训练目标,在那里发现它可以提高性能 (Joshi et al., 2019)。

To test this idea, we consider an objective that specifically corrupts contiguous, randomlyspaced spans of tokens. This objective can be para met rize d by the proportion of tokens to be corrupted and the total number of corrupted spans. The span lengths are then chosen randomly to satisfy these specified parameters. For example, if we are processing a sequence of 500 tokens and we have specified that 15% of tokens should be corrupted and that there should be 25 total spans, then the total number of corrupted tokens would be $500\times0.15=75$ and the average span length would be $75/25=3$ . Note that given the original sequence length and corruption rate, we can equivalently para met rize this objective either by the average span length or the total number of spans.

为了测试这个想法,我们考虑一个特定的目标,该目标会破坏连续的、随机间隔的 Token 跨度。这个目标可以通过要破坏的 Token 比例和总破坏跨度数来参数化。然后跨度长度将被随机选择以满足这些指定的参数。例如,如果我们正在处理一个包含 500 个 Token 的序列,并且我们指定了 15% 的 Token 应该被破坏,并且应该有 25 个总跨度,那么被破坏的 Token 总数将是 $500\times0.15=75$,平均跨度长度将是 $75/25=3$。请注意,给定原始序列长度和破坏率,我们可以等效地通过平均跨度长度或总跨度数来参数化此目标。

跨度长度 GLUE CNNDM SQuAD SGLUE EnDe EnFr EnRo
★ 基线 (i.i.d.) 83.28 19.24 80.88 71.36 26.98 39.82 27.65
2 83.54 19.39 82.09 72.20 26.76 39.99 27.63
3 83.49 19.62 81.84 72.53 26.86 39.65 27.62
5 83.40 19.24 82.05 72.23 26.88 39.40 27.53
10 82.85 19.33 81.84 70.44 26.79 39.49 27.69

Table 7: Performance of the span-corruption objective (inspired by Joshi et al. (2019)) for different average span lengths. In all cases, we corrupt $15%$ of the original text sequence.

表 7: 不同平均跨度长度的 span-corruption 目标 (受 Joshi et al. (2019) 启发) 的性能。在所有情况下,我们破坏了原始文本序列的 15% 。

由于图片无法直接翻译为文字内容,如果您能提供图片中的文本内容或描述,我将很乐意为您进行翻译。

Figure 5: A flow chart of our exploration of unsupervised objectives. We first consider a few disparate approaches in Section 3.3.1 and find that a BERT-style denoising objective performs best. Then, we consider various methods for simplifying the BERT objective so that it produces shorter target sequences in Section 3.3.2. Given that replacing dropped-out spans with sentinel tokens performs well and results in short target sequences, in Section 3.3.3 we experiment with different corruption rates. Finally, we evaluate an objective that intentionally corrupts contiguous spans of tokens in Section 3.3.4.

图 5: 我们探索无监督目标的流程图。我们首先在第 3.3.1 节考虑了几种不同的方法,并发现 BERT 风格的去噪目标表现最佳。然后,我们在第 3.3.2 节考虑了多种简化 BERT 目标的方法,使其生成更短的目标序列。鉴于用哨兵 Token 替换丢弃的片段表现良好且结果为目标序列较短,在第 3.3.3 节中我们尝试了不同的损坏率。最后,我们在第 3.3.4 节评估了一个故意损坏连续 Token 片段的目标。

We compare the span-corruption objective to the i.i.d-corruption objective in Table 7. We use a corruption rate of $15%$ in all cases and compare using average span lengths of 2, 3, 5 and 10. Again, we find a limited difference between these objectives, though the version with an average span length of 10 slightly under performs the other values in some cases. We also find in particular that using an average span length of 3 slightly (but significantly) outperforms the i.i.d. objective on most non-translation benchmarks. Fortunately, the span-corruption objective also provides some speedup during training compared to the i.i.d. noise approach because span corruption produces shorter sequences on average.

我们在表 7 中比较了 span-corruption 目标和 i.i.d-corruption 目标。在所有情况下,我们使用 15% 的 corruption 率,并使用平均 span 长度为 2、3、5 和 10 进行比较。再次,我们发现这些目标之间的差异有限,尽管在某些情况下,平均 span 长度为 10 的版本表现略逊于其他值。我们还特别发现,使用平均 span 长度为 3 的方法在大多数非翻译基准测试中略微(但显著)优于 i.i.d. 目标。幸运的是,与 i.i.d. 噪声方法相比,span-corruption 目标在训练过程中也提供了一些加速,因为 span 腐蚀产生的序列较短。

表 7:

3.3.5. Discussion

3.3.5. 讨论

Figure 5 shows a flow chart of the choices made during our exploration of unsupervised objectives. Overall, the most significant difference in performance we observed was that denoising objectives outperformed language modeling and de shuffling for pre-training. We did not observe a remarkable difference across the many variants of the denoising objectives we explored. However, different objectives (or parameter iz at ions of objectives) can lead to different sequence lengths and thus different training speeds. This implies that choosing among the denoising objectives we considered here should mainly be done according to their computational cost. Our results also suggest that additional exploration of objectives similar to the ones we consider here may not lead to significant gains for the tasks and model we consider. Instead, it may be fortuitous to explore entirely different ways of leveraging unlabeled data.

图 5: 显示了在探索无监督目标过程中所做的选择的流程图。总体而言,我们观察到性能上最显著的差异是去噪目标优于语言建模和解洗牌用于预训练。我们没有观察到在我们探索的许多去噪目标变体之间有显著差异。然而,不同的目标(或目标的参数化)可能导致不同的序列长度,从而导致不同的训练速度。这意味着在我们这里考虑的去噪目标之间进行选择时,主要应根据其计算成本来进行。我们的结果还表明,对我们这里考虑的任务和模型,进一步探索类似的目标可能不会带来显著的收益。相反,探索完全不同的利用未标记数据的方法可能是有利的。

3.4. Pre-training Data set

3.4. 预训练数据集

Like the unsupervised objective, the pre-training data set itself is a crucial component of the transfer learning pipeline. However, unlike objectives and benchmarks, new pre-training data sets are usually not treated as significant contributions on their own and are often not released alongside pre-trained models and code. Instead, they are typically introduced in the course of presenting a new method or model. As a result, there has been relatively little comparison of different pre-training data sets as well as a lack of a “standard” data set used for pre-training. Some recent notable exceptions (Baevski et al., 2019; Liu et al., 2019c; Yang et al., 2019) have compared pre-training on a new large (often Common Crawl-sourced) data set to using a smaller preexisting data set (often Wikipedia). To probe more deeply into the impact of the pre-training data set on performance, in this section we compare variants of our C4 data set and other potential sources of pre-training data. We release all of the C4 data set variants we consider as part of TensorFlow Datasets.11

与无监督目标一样,预训练数据集本身是迁移学习管道的关键组成部分。然而,与目标和基准不同,新的预训练数据集通常不被视为独立的重要贡献,并且通常不会与预训练模型和代码一起发布。相反,它们通常是在介绍新方法或模型的过程中引入的。因此,不同预训练数据集之间的比较相对较少,也没有一个用于预训练的“标准”数据集。一些最近值得注意的例外 (Baevski et al., 2019; Liu et al., 2019c; Yang et al., 2019) 比较了在新的大型(通常是 Common Crawl 来源)数据集上进行预训练与使用较小的现有数据集(通常是 Wikipedia)进行预训练的效果。为了更深入地探究预训练数据集对性能的影响,在本节中我们将比较我们 C4 数据集的不同版本以及其他潜在的预训练数据来源。我们将所有考虑的 C4 数据集版本作为 TensorFlow Datasets 的一部分发布。

3.4.1. Unlabeled Data Sets

3.4.1. 未标注数据集

In creating C4, we developed various heuristics to filter the web-extracted text from Common Crawl (see Section 2.2 for a description). We are interested in measuring whether this filtering results in improved performance on downstream tasks, in addition to comparing it to other filtering approaches and common pre-training data sets. Towards this end, we compare the performance of our baseline model after pre-training on the following data sets:

在创建 C4 时,我们开发了多种启发式方法来过滤从 Common Crawl 提取的文本(详见第 2.2 节)。我们感兴趣的是测量这种过滤是否能在下游任务中提高性能,此外还要将其与其他过滤方法和常见的预训练数据集进行比较。为此,我们比较了基线模型在以下数据集上预训练后的性能:

C4 As a baseline, we first consider pre-training on our proposed unlabeled data set as described in Section 2.2.

C4 作为基线,我们首先考虑在第 2.2 节中描述的我们提出的未标注数据集上进行预训练。

Unfiltered C4 To measure the effect of the heuristic filtering we used in creating C4 (de duplication, removing bad words, only retaining sentences, etc.), we also generate an alternate version of C4 that forgoes this filtering. Note that we still use langdetect to extract English text. As a result, our “unfiltered” variant still includes some filtering because langdetect sometimes assigns a low probability to non-natural English text.

未过滤的 C4

为了测量我们在创建 C4 时使用的启发式过滤(去重、移除不良词汇、仅保留句子等)的效果,我们还生成了一个不进行这种过滤的 C4 替代版本。请注意,我们仍然使用 langdetect 来提取英文文本。因此,我们的“未过滤”变体仍然包含一些过滤,因为 langdetect 有时会对非自然的英文文本分配较低的概率。

RealNews-like Recent work has used text data extracted from news websites (Zellers et al., 2019; Baevski et al., 2019). To compare to this approach, we generate another unlabeled data set by additionally filtering C4 to only include content from one of the domains used in the “RealNews” data set (Zellers et al., 2019). Note that for ease of comparison, we retain the heuristic filtering methods used in C4; the only difference is that we have ostensibly omitted any non-news content.

类似 RealNews 的近期工作已经使用从新闻网站提取的文本数据 (Zellers et al., 2019; Baevski et al., 2019)。为了与这种方法进行比较,我们通过进一步过滤 C4,仅包含 “RealNews” 数据集中使用的其中一个域名的内容 (Zellers et al., 2019),生成了另一个未标注的数据集。请注意,为了便于比较,我们保留了 C4 中使用的所有启发式过滤方法;唯一的区别是我们显然省略了所有非新闻内容。

Table 8: Performance resulting from pre-training on different data sets. The first four variants are based on our new C4 data set.

表 8: 在不同数据集上预训练的结果。前四个变体基于我们新的 C4 数据集。

数据集 大小 GLUE CNNDM SQuAD SGLUE EnDe EnFr EnRo
★C4 745GB 83.28 19.24 80.88 71.36 26.98 39.82 27.65
C4, unfiltered 6.1TB 81.46 19.14 78.78 68.04 26.55 39.34 27.21
RealNews-like 35GB 83.83 19.23 80.39 72.38 26.75 39.90 27.48
WebText-like 17GB 84.03 19.31 81.42 71.40 26.80 39.74 27.59
Wikipedia 16GB 81.85 19.31 81.29 68.01 26.94 39.69 27.67
Wikipedia + TBC 20GB 83.65 19.28 82.08 73.24 26.77 39.63 27.57

WebText-like Similarly, the WebText data set (Radford et al., 2019) only uses content from webpages that were submitted to the content aggregation website Reddit and received a “score” of at least 3. The score for a webpage submitted to Reddit is computed based on the proportion of users who endorse (upvote) or oppose (downvote) the webpage. The idea behind using the Reddit score as a quality signal is that users of the site would only upvote high-quality text content. To generate a comparable data set, we first tried removing all content from C4 that did not originate from a URL that appeared in the list prepared by the Open Web Text effort. However, this resulted in comparatively little content—only about 2 GB—because most pages never appear on Reddit. Recall that C4 was created based on a single month of Common Crawl data. To avoid using a prohibitively small data set, we therefore downloaded 12 months of data from Common Crawl from August 2018 to July 2019, applied our heuristic filtering for C4, then applied the Reddit filter. This produced a 17 GB WebText-like data set, which is of comparable size to the original 40GB WebText data set (Radford et al., 2019).

类似 WebText,WebText 数据集 (Radford et al., 2019) 仅使用了提交到内容聚合网站 Reddit 的网页内容,并且这些网页至少获得了 3 分。Reddit 上提交的网页评分是根据用户对网页的支持(点赞)或反对(点踩)的比例计算得出的。使用 Reddit 评分作为质量信号的想法是,只有高质量的文本内容才会得到用户的点赞。为了生成一个可比较的数据集,我们首先尝试移除 C4 中所有未出现在 Open Web Text 项目列表中的 URL 所对应的网页内容。然而,这导致了相对较少的内容——仅有大约 2 GB——因为大多数页面从未出现在 Reddit 上。回想一下,C4 是基于一个月的 Common Crawl 数据创建的。为了避免使用过小的数据集,我们因此下载了从 2018 年 8 月到 2019 年 7 月共 12 个月的 Common Crawl 数据,应用了 C4 的启发式过滤,然后应用了 Reddit 过滤器。这产生了 17 GB 类似 WebText 的数据集,其大小与原始 40 GB 的 WebText 数据集 (Radford et al., 2019) 相当。

Wikipedia The website Wikipedia consists of millions of encyclopedia articles written collaborative ly. The content on the site is subject to strict quality guidelines and therefore has been used as a reliable source of clean and natural text. We use the English Wikipedia text data from TensorFlow Datasets,13 which omits any markup or reference sections from the articles.

维基百科网站由数百万篇协作撰写的百科全书文章组成。网站上的内容受严格的质量指南约束,因此一直被用作可靠、干净且自然的文本来源。我们使用来自 TensorFlow Datasets 的英文维基百科文本数据,13 这些数据省略了文章中的任何标记或参考部分。

Wikipedia + Toronto Books Corpus A drawback of using pre-training data from Wikipedia is that it represents only one possible domain of natural text (encyclopedia articles). To mitigate this, BERT (Devlin et al., 2018) combined data from Wikipedia with the Toronto Books Corpus (TBC) (Zhu et al., 2015). TBC contains text extracted from eBooks, which represents a different domain of natural language. BERT’s popularity has led to the Wikipedia $+$ TBC combination being used in many subsequent works.

维基百科 + 多伦多书籍语料库
使用来自维基百科的预训练数据的一个缺点是,它仅代表自然文本的一种可能领域(百科全书文章)。为了解决这个问题,BERT (Devlin et al., 2018) 将来自维基百科的数据与多伦多书籍语料库 (TBC) (Zhu et al., 2015) 结合。TBC 包含从电子书中提取的文本,代表了自然语言的另一个领域。BERT 的流行使得维基百科 + TBC 的组合在许多后续工作中被采用。

The results achieved after pre-training on each of these data sets is shown in Table 8. A first obvious takeaway is that removing the heuristic filtering from C4 uniformly degrades performance and makes the unfiltered variant perform the worst in every task. Beyond this, we found that in some cases a pre-training data set with a more constrained domain outperformed the diverse C4 data set. For example, using the Wikipedia $^+$ TBC corpus produced a SuperGLUE score of 73.24, beating our baseline’s score (using C4) of 71.36. This is almost entirely attributable to a boost in performance from 25.78 (baseline, C4) to 50.93 (Wikipedia $^+$ TBC) on the Exact Match score for MultiRC (see Table 16). MultiRC is a reading comprehension data set whose largest source of data comes from fiction books, which is exactly the domain covered by TBC. Similarly, using the RealNews-like data set for pre-training conferred an increase from 68.16 to 73.72 on the Exact Match score for ReCoRD, a data set that measures reading comprehension on news articles. As a final example, using data from Wikipedia produced significant (but less dramatic) gains on SQuAD, which is a question-answering data set with passages sourced from Wikipedia. Similar observations have been made in prior work, e.g. Beltagy et al. (2019) found that pre-training BERT on text from research papers improved its performance on scientific tasks. The main lesson behind these findings is that pre-training on in-domain unlabeled data can improve performance on downstream tasks. This is unsurprising but also unsatisfying if our goal is to pre-train a model that can rapidly adapt to language tasks from arbitrary domains. Liu et al. (2019c) also observed that pre-training on a more diverse data set yielded improvements on downstream tasks. This observation also motivates the parallel line of research on domain adaptation for natural language processing; for surveys of this field see e.g. Ruder (2019); Li (2012).

表 8: 在每个数据集上预训练后取得的结果。一个明显的结论是,从 C4 中移除启发式过滤会一致地降低性能,并使未过滤的变体在每个任务中表现最差。除此之外,我们发现,在某些情况下,具有更受限领域的预训练数据集比多样化的 C4 数据集表现更好。例如,使用 Wikipedia + TBC 语料库产生的 SuperGLUE 得分为 73.24,超过了我们基线(使用 C4)的得分 71.36。这几乎完全归因于 MultiRC 的精确匹配得分从 25.78(基线,C4)提升到 50.93(Wikipedia + TBC)(见表 16)。MultiRC 是一个阅读理解数据集,其最大的数据来源是小说书籍,这正是 TBC 所涵盖的领域。同样,使用类似 RealNews 的数据集进行预训练使得 ReCoRD 的精确匹配得分从 68.16 提升到 73.72,ReCoRD 是一个衡量新闻文章阅读理解的数据集。作为最后一个例子,使用来自 Wikipedia 的数据在 SQuAD 上产生了显著(但不那么戏剧性)的增益,SQuAD 是一个从 Wikipedia 获取段落的问题回答数据集。类似的现象在先前的工作中也有观察到,例如 Beltagy 等 (2019) 发现,在研究论文文本上预训练 BERT 可以提高其在科学任务上的性能。这些发现的主要教训是,在领域内未标注的数据上进行预训练可以提高下游任务的性能。这并不令人惊讶,但如果我们的目标是预训练一个能够快速适应任意领域语言任务的模型,这也是不令人满意的。Liu 等 (2019c) 也观察到,在更多样化的数据集上进行预训练可以在下游任务上带来改进。这一观察也推动了自然语言处理领域自适应的平行研究;关于该领域的综述,请参见 Ruder (2019);Li (2012)。

A drawback to only pre-training on a single domain is that the resulting data sets are often substantially smaller. Similarly, while the WebText-like variant performed as well or better than the C4 data set in our baseline setting, the Reddit-based filtering produced a data set that was about 40 $\times$ smaller than C4 despite being based on $12\times$ more data from Common Crawl. Note, however, that in our baseline setup we only pre-train on $2^{35}\approx34\mathrm{B}$ tokens, which is only about 8 times larger than the smallest pre-training data set we consider. We investigate at what point using a smaller pre-training data sets poses an issue in the following section.

仅在一个领域进行预训练的缺点是,生成的数据集通常会小得多。同样,虽然类似 WebText 的变体在我们的基准设置中表现得与 C4 数据集一样好或更好,但基于 Reddit 的过滤产生的数据集比 C4 小约 40 倍,尽管它基于来自 Common Crawl 的 12 倍更多的数据。然而,请注意,在我们的基准设置中,我们只预训练了 $2^{35}\approx34\mathrm{B}$ 个 Token,这仅比我们考虑的最小预训练数据集大 8 倍左右。我们在以下部分研究使用较小的预训练数据集在什么情况下会成为一个问题。

3.4.2. Pre-training Data set Size

3.4.2. 预训练数据集大小

The pipeline we use to create C4 was designed to be able to create extremely large pretraining data sets. The access to so much data allows us to pre-train our models without repeating examples. It is not clear whether repeating examples during pre-training would be helpful or harmful to downstream performance because our pre-training objective is itself stochastic and can help prevent the model from seeing the same exact data multiple times.

我们用于创建 C4 的管道旨在能够创建极其庞大的预训练数据集。访问如此多的数据使我们能够在不重复示例的情况下预训练模型。目前尚不清楚在预训练期间重复示例是有益还是有害于下游性能,因为我们的预训练目标本身是随机的,可以帮助防止模型多次看到完全相同的数据。

To test the effect of limited unlabeled data set sizes, we pre-trained our baseline model on artificially truncated versions of C4. Recall that we pre-train our baseline model on $2^{35}\approx34\mathrm{B}$ tokens (a small fraction of the total size of C4). We consider training on truncated variants of C4 consisting of $2^{29}$ , $2^{27}$ , $2^{25}$ and $2^{23}$ tokens. These sizes correspond to repeating the data set 64, 256, 1,024, and 4,096 times respectively over the course of pre-training.

为了测试有限未标注数据集大小的影响,我们在人工截断版本的 C4 上预训练了我们的基准模型。回想一下,我们在大约 $2^{35}\approx34\mathrm{B}$ 个 Token (C4 总大小的一小部分) 上预训练了基准模型。我们考虑在包含 $2^{29}$、$2^{27}$、$2^{25}$ 和 $2^{23}$ 个 Token 的 C4 截断版本上进行训练。这些大小分别对应于在整个预训练过程中重复数据集 64、256、1,024 和 4,096 次。

Table 9: Measuring the effect of repeating data during pre-training. In these experiments, we only use the first $N$ tokens from C4 (with varying values of $N$ shown in the first column) but still pre-train over $2^{35}$ tokens. This results in the data set being repeated over the course of pre-training (with the number of repeats for each experiment shown in the second column), which may result in memorization (see Figure 6).

表 9: 测量预训练期间重复数据的效果。在这些实验中,我们仅使用 C4 的前 N 个 Token (第一列显示了不同的 N 值),但仍然预训练超过 $2^{35}$ 个 Token。这导致在整个预训练过程中数据集被重复 (每个实验的重复次数显示在第二列),这可能导致记忆效应 (见图 6)。

Token 数量 重复次数 GLUE CNNDM SQuAD SGLUE EnDe EnFr EnRo
完整数据集 0 83.28 19.24 80.88 71.36 26.98 39.82 27.65
229 64 82.87 19.19 80.97 72.03 26.83 39.74 27.63
227 256 82.62 19.20 79.78 69.97 27.02 39.71 27.33
225 1,024 79.55 18.57 76.27 64.76 26.38 39.56 26.80
223 4,096 76.34 18.33 70.92 59.29 26.37 38.84 25.81

The resulting downstream performance is shown in Table 9. As expected, performance degrades as the data set size shrinks. We suspect this may be due to the fact that the model begins to memorize the pre-training data set. To measure if this is true, we plot the training loss for each of these data set sizes in Figure 6. Indeed, the model attains significantly smaller training losses as the size of the pre-training data set shrinks, suggesting possible memorization. Baevski et al. (2019) similarly observed that truncating the pre-training data set size can degrade downstream task performance.

表 9: 下游性能结果如表所示。如预期的那样,随着数据集大小的缩小,性能下降。我们怀疑这可能是由于模型开始记忆预训练数据集。为了测量这一假设是否正确,我们在图 6 中绘制了每个数据集大小的训练损失。确实,随着预训练数据集大小的缩小,模型获得的训练损失显著减小,这表明可能存在记忆现象。Baevski 等 (2019) 同样观察到截断预训练数据集大小会降低下游任务的性能。

We note that these effects are limited when the pre-training data set is repeated only 64 times. This suggests that some amount of repetition of pre-training data might not be harmful. However, given that additional pre-training can be beneficial (as we will show in Section 3.6) and that obtaining additional unlabeled data is cheap and easy, we suggest using large pre-training data sets whenever possible. We also note that this effect may be more pronounced for larger model sizes, i.e. a bigger model may be more prone to over fitting to a smaller pre-training data set.

我们注意到,当预训练数据集仅重复 64 次时,这些效果是有限的。这表明一定程度的预训练数据重复可能无害。然而,鉴于额外的预训练可能是有益的(如我们将在第 3.6 节中展示)以及获取额外的未标注数据既便宜又容易,我们建议在可能的情况下使用较大的预训练数据集。我们还注意到,对于更大的模型尺寸,这种效应可能更为明显,即更大的模型可能更容易对较小的预训练数据集过拟合。

3.5. Training Strategy

3.5. 训练策略

So far we have considered the setting where all parameters of a model are pre-trained on an unsupervised task before being fine-tuned on individual supervised tasks. While this approach is straightforward, various alternative methods for training the model on downstream/supervised tasks have been proposed. In this section, we compare different schemes for fine-tuning the model in addition to the approach of training the model simultaneously on multiple tasks.

到目前为止,我们考虑了在对模型的所有参数进行无监督任务预训练后,再对各个有监督任务进行微调的设置。虽然这种方法 straightforward,但已经提出了各种替代方法来在下游/有监督任务上训练模型。在本节中,我们将比较不同的微调方案,此外还包括同时在多个任务上训练模型的方法。

3.5.1. Fine-tuning Methods

3.5.1. 微调方法 (Fine-tuning Methods)

It has been argued that fine-tuning all of the model’s parameters can lead to suboptimal results, particularly on low-resource tasks (Peters et al., 2019). Early results on transfer learning for text classification tasks advocated fine-tuning only the parameters of a small classifier that was fed sentence embeddings produced by a fixed pre-trained model (Subramanian et al., 2018; Kiros et al., 2015; Logeswaran and Lee, 2018; Hill et al., 2016; Conneau et al., 2017). This approach is less applicable to our encoder-decoder model because the entire decoder must be trained to output the target sequences for a given task. Instead, we focus on two alternative fine-tuning approaches that update only a subset of the parameters of our encoder-decoder model.

有观点认为,微调模型的所有参数可能会导致次优结果,特别是在低资源任务上 (Peters et al., 2019)。早期关于文本分类任务的迁移学习结果显示,仅微调一个小分类器的参数是有效的,该分类器使用固定预训练模型生成的句子嵌入作为输入 (Subramanian et al., 2018; Kiros et al., 2015; Logeswaran 和 Lee, 2018; Hill et al., 2016; Conneau et al., 2017)。这种方法对我们的编码器-解码器模型不太适用,因为整个解码器必须被训练以输出给定任务的目标序列。相反,我们专注于两种替代的微调方法,这些方法只更新编码器-解码器模型的部分参数。


Figure 6: Pre-training loss for our original C4 data set as well as 4 artificially truncated versions. The sizes listed refer to the number of tokens in each data set. The four sizes considered correspond to repeating the data set between 64 and 4,096 times over the course of pre-training. Using a smaller data set size results in smaller training loss values, which may suggest some memorization of the unlabeled data set.

图 6: 我们原始 C4 数据集以及 4 个经过人工截断的版本的预训练损失。所列的大小指的是每个数据集中的 Token 数量。考虑的四个大小对应于在预训练过程中重复数据集 64 到 4,096 次。使用较小的数据集会导致较小的训练损失值,这可能表明对未标记数据集存在一定程度的记忆。

The first, “adapter layers” (Houlsby et al., 2019; Bapna et al., 2019), is motivated by the goal of keeping most of the original model fixed while fine-tuning. Adapter layers are additional dense-ReLU-dense blocks that are added after each of the preexisting feed-forward networks in each block of the Transformer. These new feed-forward networks are designed so that their output dimensionality matches their input. This allows them to be inserted into the network with no additional changes to the structure or parameters. When finetuning, only the adapter layer and layer normalization parameters are updated. The main hyper parameter of this approach is the inner dimensionality $d$ of the feed-forward network, which changes the number of new parameters added to the model. We experiment with various values for $d$ .

第一种方法是“适配器层” (Houlsby et al., 2019; Bapna et al., 2019),其动机是在微调时保持原始模型的大部分结构不变。适配器层是附加的 dense-ReLU-dense 块,添加到 Transformer 每个块中已有的前馈网络之后。这些新的前馈网络设计为输出维度与输入维度匹配。这使得它们可以在不改变网络结构或参数的情况下插入网络。在微调时,仅更新适配器层和层归一化参数。这种方法的主要超参数是前馈网络的内部维度 $d$ ,它决定了添加到模型的新参数数量。我们对不同的 $d$ 值进行了实验。

The second alternative fine-tuning method we consider is “gradual unfreezing” (Howard and Ruder, 2018). In gradual unfreezing, more and more of the model’s parameters are finetuned over time. Gradual unfreezing was originally applied to a language model architecture consisting of a single stack of layers. In this setting, at the start of fine-tuning only the parameters of the final layer are updated, then after training for a certain number of updates the parameters of the second-to-last layer are also included, and so on until the entire network’s parameters are being fine-tuned. To adapt this approach to our encoder-decoder model, we gradually unfreeze layers in the encoder and decoder in parallel, starting from the top in both cases. Since the parameters of our input embedding matrix and output classification matrix are shared, we update them throughout fine-tuning. Recall that our baseline model consists of 12 layers each in the encoder and decoder and is fine-tuned for

我们考虑的第二种替代微调方法是“渐进解冻” (Howard and Ruder, 2018)。在渐进解冻中,随着时间的推移,越来越多的模型参数被微调。渐进解冻最初应用于由单个层堆栈组成的语言模型架构。在这种设置下,在微调开始时,只有最后一层的参数被更新,然后在经过一定数量的更新后,倒数第二层的参数也被包含进来,依此类推,直到整个网络的参数都被微调。为了将这种方法适应于我们的编码器-解码器模型,我们在编码器和解码器中并行地逐渐解冻层,从顶部开始。由于我们的输入嵌入矩阵和输出分类矩阵的参数是共享的,因此我们在整个微调过程中更新它们。回顾一下,我们的基准模型由编码器和解码器各12层组成,并进行微调。

Table 10: Comparison of different alternative fine-tuning methods that only update a subset of the model’s parameters. For adapter layers, $d$ refers to the inner dimensionality of the adapters.

表 10: 不同替代微调方法的比较,这些方法仅更新模型参数的子集。对于适配器层,$d$ 指的是适配器的内部维度。

微调方法 GLUE CNNDM SQuAD SGLUE EnDe EnFr EnRo
★所有参数 83.28 19.24 80.88 71.36 26.98 39.82 27.65
适配器层, d = 32 80.52 15.08 79.32 60.40 13.84 17.88 15.54
适配器层, d = 128 81.51 16.62 79.47 63.03 19.83 27.50 22.63
适配器层, d = 512 81.54 17.78 79.18 64.30 23.45 33.98 25.81
适配器层, d = 2048 81.51 16.62 79.47 63.03 19.83 27.50 22.63
逐步解冻 82.50 18.95 79.17 70.79 26.71 39.02 26.93

$2^{18}$ steps. As such, we subdivide the fine-tuning process into 12 episodes of ${^{2^{18}}!\left/12\right.}$ steps each and train from layers $12-n$ to 12 in the $n$ th episode. We note that Howard and Ruder (2018) suggested fine-tuning an additional layer after each epoch of training. However, since our supervised data sets vary so much in size and since some of our downstream tasks are actually mixtures of many tasks (GLUE and SuperGLUE), we instead adopt the simpler strategy of fine-tuning an additional layer after every $2^{18}/12$ steps.

$2^{18}$ 步。因此,我们将微调过程细分为 12 个阶段,每个阶段包含 ${^{2^{18}}!\left/12\right.}$ 步,并在第 $n$ 阶段从第 $12-n$ 层到第 12 层进行训练。我们注意到 Howard 和 Ruder (2018) 建议在每个训练周期后微调额外的一层。然而,由于我们的监督数据集大小差异很大,且一些下游任务实际上是多个任务的混合(如 GLUE 和 SuperGLUE),我们采用了更简单的策略,在每 $2^{18}/12$ 步后微调额外的一层。

A comparison of the performance of these fine-tuning approaches is shown in Table 10. For adapter layers, we report the performance using an inner dimensionality $d$ of 32, 128, 512, 2048. Pursuant with past results (Houlsby et al., 2019; Bapna et al., 2019) we find that lower-resource tasks like SQuAD work well with a small value of $d$ whereas higher resource tasks require a large dimensionality to achieve reasonable performance. This suggests that adapter layers could be a promising technique for fine-tuning on fewer parameters as long as the dimensionality is scaled appropriately to the task size. Note that in our case we treat GLUE and SuperGLUE each as a single “task” by concatenating their constituent data sets, so although they comprise some low-resource data sets the combined data set is large enough that it necessitates a large value of $d$ . We found that gradual unfreezing caused a minor degradation in performance across all tasks, though it did provide some speedup during fine-tuning. Better results may be attainable by more carefully tuning the unfreezing schedule.

表 10: 这些微调方法的性能比较。对于适配器层,我们报告了使用内部维度 $d$ 为 32、128、512、2048 的性能。根据以往的结果 (Houlsby et al., 2019; Bapna et al., 2019),我们发现低资源任务如 SQuAD 使用较小的 $d$ 值表现良好,而高资源任务需要较大的维度才能达到合理的性能。这表明只要维度根据任务规模适当调整,适配器层可能是减少参数进行微调的一种有前途的技术。请注意,在我们的情况下,我们将 GLUE 和 SuperGLUE 各自视为一个单独的“任务”,通过连接其组成的数据集来处理,因此尽管它们包含一些低资源数据集,但合并后的数据集足够大,以至于需要较大的 $d$ 值。我们发现逐步解冻在所有任务中导致了轻微的性能下降,尽管它确实在微调期间提供了一些加速。通过更仔细地调整解冻计划,可能会获得更好的结果。

3.5.2. Multi-task Learning

3.5.2. 多任务学习

So far, we have been pre-training our model on a single unsupervised learning task before fine-tuning it individually on each downstream task. An alternative approach, called “multitask learning” (Ruder, 2017; Caruana, 1997), is to train the model on multiple tasks at a time. This approach typically has the goal of training a single model that can simultaneously perform many tasks at once, i.e. the model and most of its parameters are shared across all tasks. We relax this goal somewhat and instead investigate methods for training on multiple tasks at once in order to eventually produce separate parameter settings that perform well on each individual task. For example, we might train a single model on many tasks, but when reporting performance we are allowed to select a different checkpoint for each task. This loosens the multi-task learning framework and puts it on more even footing compared to the pre-train-then-fine-tune approach we have considered so far. We also note that in our unified text-to-text framework, “multi-task learning” simply corresponds to mixing data sets together. It follows that we can still train on unlabeled data when using multi-task learning by treating the unsupervised task as one of the tasks being mixed together. In contrast, most applications of multi-task learning to NLP add task-specific classification networks or use different loss functions for each task (Liu et al., 2019b).

到目前为止,我们一直在单个无监督学习任务上预训练我们的模型,然后在每个下游任务上分别微调它。另一种方法称为“多任务学习” (multitask learning) (Ruder, 2017; Caruana, 1997),是同时在多个任务上训练模型。这种方法通常旨在训练一个单一的模型,该模型能够同时执行多个任务,即模型及其大部分参数在所有任务之间共享。我们稍微放宽了这一目标,而是研究同时在多个任务上训练的方法,以最终产生在每个单独任务上表现良好的不同参数设置。例如,我们可能会在一个模型上训练许多任务,但在报告性能时,我们可以为每个任务选择不同的检查点。这放松了多任务学习框架,并使其与我们迄今为止考虑的预训练-然后微调方法更加公平。我们还注意到,在我们统一的文本到文本框架中,“多任务学习”只是将数据集混合在一起。因此,当使用多任务学习时,我们仍然可以通过将无监督任务视为其中一个被混合的任务来训练未标记的数据。相比之下,大多数应用于自然语言处理 (NLP) 的多任务学习方法会添加特定于任务的分类网络或为每个任务使用不同的损失函数 (Liu et al., 2019b)。

As pointed out by Ari vaz hagan et al. (2019), an extremely important factor in multi-task learning is how much data from each task the model should be trained on. Our goal is to not under- or over-train the model—that is, we want the model to see enough data from a given task that it can perform the task well, but not to see so much data that it memorizes the training set. How exactly to set the proportion of data coming from each task can depend on various factors including data set sizes, the “difficulty” of learning the task (i.e. how much data the model must see before being able to perform the task effectively), regular iz ation, etc. An additional issue is the potential for “task interference” or “negative transfer”, where achieving good performance on one task can hinder performance on another. Given these concerns, we begin by exploring various strategies for setting the proportion of data coming from each task. A similar exploration was performed by Wang et al. (2019a).

如 Ari vaz hagan 等 (2019) 所指出的,在多任务学习中一个极为重要的因素是模型应该在每个任务上训练多少数据。我们的目标是既不要欠训练也不要过训练模型——也就是说,我们希望模型能够看到足够多来自给定任务的数据,以便能够很好地执行该任务,但又不能看到太多数据以至于记住训练集。如何精确设置来自每个任务的数据比例可以取决于各种因素,包括数据集大小、学习任务的“难度”(即模型在能够有效执行任务之前必须看到多少数据)、正则化等。另一个问题是可能出现的“任务干扰”或“负迁移”,即在一个任务上取得良好性能可能会妨碍在另一个任务上的表现。鉴于这些考虑,我们首先探索为每个任务设置数据比例的各种策略。Wang 等 (2019a) 也进行了类似的探索。

Examples-proportional mixing A major factor in how quickly a model will overfit to a given task is the task’s data set size. As such, a natural way to set the mixing proportions is to sample in proportion to the size of each task’s data set. This is equivalent to concatenating the data sets for all tasks and randomly sampling examples from the combined data set. Note, however, that we are including our unsupervised denoising task, which uses a data set that is orders of magnitude larger than every other task’s. It follows that if we simply sample in proportion to each data set’s size, the vast majority of the data the model sees will be unlabeled, and it will undertrain on all of the supervised tasks. Even without the unsupervised task, some tasks (e.g. WMT English to French) are so large that they would similarly crowd out most of the batches. To get around this issue, we set an artificial “limit” on the data set sizes before computing the proportions. Specifically, if the number of examples in each of our $N$ task’s data sets is $e_{n},n\in{1,\ldots,N}$ then we set probability of sampling an example from the mth task during training to $r_{m}=\mathrm{min}(e_{m},K)/\sum\mathrm{min}(e_{n},K)$ where $K$ is the artificial data set size limit.

示例比例混合

一个模型过拟合到给定任务的速度主要取决于该任务的数据集大小。因此,设置混合比例的一种自然方法是根据每个任务数据集的大小进行采样。这相当于将所有任务的数据集连接起来,并从合并后的数据集中随机采样示例。然而,需要注意的是,我们包括了一个无监督去噪任务,该任务使用的数据集比其他所有任务的数据集大几个数量级。因此,如果我们只是根据每个数据集的大小进行采样,模型看到的绝大多数数据将是未标注的,并且它将在所有有监督任务上欠训练。即使没有无监督任务,某些任务(例如 WMT 英语到法语)也非常大,以至于它们会占据大多数批次。为了解决这个问题,我们在计算比例之前对数据集大小设置了一个人工“限制”。具体来说,如果每个任务的数据集中的示例数量为 $e_{n},n\in{1,\ldots,N}$ ,那么我们将训练期间从第 m 个任务中采样示例的概率设置为 $r_{m}=\mathrm{min}(e_{m},K)/\sum\mathrm{min}(e_{n},K)$ ,其中 $K$ 是人工数据集大小限制。

Temperature-scaled mixing An alternative way of mitigating the huge disparity between data set sizes is to adjust the “temperature” of the mixing rates. This approach was used by multilingual BERT to ensure that the model was sufficiently trained on lowresource languages.14 To implement temperature scaling with temperature $T$ , we raise each task’s mixing rate $r_{m}$ to the power of $^1!/!T$ and re normalize the rates so that they sum to 1. When $T=1$ , this approach is equivalent to examples-proportional mixing and as $T$ increases the proportions become closer to equal mixing. We retain the data set size limit $K$ (applied to obtain $r_{m}$ before temperature scaling) but set it to a large value of $K=2^{21}$ . We use a large