Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer
探索统一的文本到文本 Transformer (Text-to-Text Transformer) 的迁移学习极限
Colin Raffel∗ Noam Shazeer∗ Adam Roberts∗ Katherine Lee∗ Sharan Narang Michael Matena Yanqi Zhou Wei Li Peter J. Liu
craffel@gmail.com noam@google.com adarob@google.com katherine lee@google.com sharan naran g@google.com mmatena@google.com yanqiz@google.com mweili@google.com peterjliu@google.com
Google, Mountain View, CA 94043, USA
谷歌,山景城,加利福尼亚州 94043,美国
Editor: Ivan Titov
编辑:Ivan Titov
Abstract
摘要
Transfer learning, where a model is first pre-trained on a data-rich task before being finetuned on a downstream task, has emerged as a powerful technique in natural language processing (NLP). The effectiveness of transfer learning has given rise to a diversity of approaches, methodology, and practice. In this paper, we explore the landscape of transfer learning techniques for NLP by introducing a unified framework that converts all text-based language problems into a text-to-text format. Our systematic study compares pre-training objectives, architectures, unlabeled data sets, transfer approaches, and other factors on dozens of language understanding tasks. By combining the insights from our exploration with scale and our new “Colossal Clean Crawled Corpus”, we achieve state-of-the-art results on many benchmarks covering sum mari z ation, question answering, text classification, and more. To facilitate future work on transfer learning for NLP, we release our data set, pre-trained models, and code.
迁移学习,即模型首先在一个数据丰富的任务上进行预训练,然后再在下游任务上进行微调,已成为自然语言处理 (NLP) 中的一种强大技术。迁移学习的有效性催生了多种方法、方法论和实践。在本文中,我们通过引入一个统一的框架,将所有基于文本的语言问题转换为文本到文本格式,来探讨 NLP 中的迁移学习技术。我们的系统研究比较了数十个语言理解任务上的预训练目标、架构、未标注数据集、迁移方法和其他因素。通过结合我们在探索中获得的见解与规模以及我们新的“Colossal Clean Crawled Corpus”,我们在涵盖总结、问答、文本分类等众多基准测试中取得了最先进的结果。为了促进未来在 NLP 的迁移学习方面的工作,我们发布了我们的数据集、预训练模型和代码。
Keywords: transfer learning, natural language processing, multi-task learning, attentionbased models, deep learning
关键词:迁移学习,自然语言处理,多任务学习,基于注意力的模型 (attention-based models),深度学习
1. Introduction
1. 引言
Training a machine learning model to perform natural language processing (NLP) tasks often requires that the model can process text in a way that is amenable to downstream learning. This can be loosely viewed as developing general-purpose knowledge that allows the model to “understand” text. This knowledge can range from low-level (e.g. the spelling or meaning of words) to high-level (e.g. that a tuba is too large to fit in most backpacks). In modern machine learning practice, providing this knowledge is rarely done explicitly; instead, it is often learned as part of an auxiliary task. For example, a historically common approach is to use word vectors (Mikolov et al., 2013b,a; Pennington et al., 2014) to map word identities to a continuous representation where, ideally, similar words map to similar vectors. These vectors are often learned through an objective that, for example, encourages co-occurring words to be positioned nearby in the continuous space (Mikolov et al., 2013b).
训练机器学习模型以执行自然语言处理 (NLP) 任务通常要求该模型能够以有利于下游学习的方式处理文本。这可以宽松地视为开发使模型能够“理解”文本的通用知识。这种知识可以从低级(例如单词的拼写或含义)到高级(例如长号太大而无法放入大多数背包中)。在现代机器学习实践中,提供这种知识很少是显式的;相反,它通常是作为辅助任务的一部分而被学习到的。例如,一种历史上常见的方法是使用词向量 (Mikolov et al., 2013b,a; Pennington et al., 2014) 将词身份映射到连续表示,在理想情况下,相似的词会映射到相似的向量。这些向量通常是通过一个目标函数学习的,例如鼓励共现词在连续空间中彼此靠近 (Mikolov et al., 2013b)。
Recently, it has become increasingly common to pre-train the entire model on a data-rich task. Ideally, this pre-training causes the model to develop general-purpose abilities and knowledge that can then be transferred to downstream tasks. In applications of transfer learning to computer vision (Oquab et al., 2014; Jia et al., 2014; Huh et al., 2016; Yosinski et al., 2014), pre-training is typically done via supervised learning on a large labeled data set like ImageNet (Russ a kov sky et al., 2015; Deng et al., 2009). In contrast, modern techniques for transfer learning in NLP often pre-train using unsupervised learning on unlabeled data. This approach has recently been used to obtain state-of-the-art results in many of the most common NLP benchmarks (Devlin et al., 2018; Yang et al., 2019; Dong et al., 2019; Liu et al., 2019c; Lan et al., 2019). Beyond its empirical strength, unsupervised pre-training for NLP is particularly attractive because unlabeled text data is available en masse thanks to the Internet—for example, the Common Crawl project $^2$ produces about 20TB of text data extracted from web pages each month. This is a natural fit for neural networks, which have been shown to exhibit remarkable s cal ability, i.e. it is often possible to achieve better performance simply by training a larger model on a larger data set (Hestness et al., 2017; Shazeer et al., 2017; Jozefowicz et al., 2016; Mahajan et al., 2018; Radford et al., 2019; Shazeer et al., 2018; Huang et al., 2018b; Keskar et al., 2019a).
最近,越来越多的做法是在数据丰富的任务上预训练整个模型。理想情况下,这种预训练使模型能够发展出通用的能力和知识,这些能力与知识可以转移到下游任务中。在计算机视觉领域的迁移学习应用中 (Oquab et al., 2014; Jia et al., 2014; Huh et al., 2016; Yosinski et al., 2014),预训练通常通过在大型标注数据集(如 ImageNet (Russakovsky et al., 2015; Deng et al., 2009))上的监督学习完成。相比之下,现代 NLP 领域的迁移学习技术经常使用无监督学习在未标注的数据上进行预训练。这种方法最近被用于在许多常见的 NLP 基准测试中取得最先进的结果 (Devlin et al., 2018; Yang et al., 2019; Dong et al., 2019; Liu et al., 2019c; Lan et al., 2019)。除了其实证优势外,NLP 的无监督预训练特别有吸引力,因为互联网提供了大量未标注的文本数据——例如,Common Crawl 项目 $^2$ 每月生成约 20TB 从网页中提取的文本数据。这非常适合神经网络,研究表明神经网络表现出显著的可扩展性,即通常可以通过在更大的数据集上训练更大的模型来获得更好的性能 (Hestness et al., 2017; Shazeer et al., 2017; Jozefowicz et al., 2016; Mahajan et al., 2018; Radford et al., 2019; Shazeer et al., 2018; Huang et al., 2018b; Keskar et al., 2019a)。
This synergy has resulted in a great deal of recent work developing transfer learning methodology for NLP, which has produced a wide landscape of pre-training objectives (Howard and Ruder, 2018; Devlin et al., 2018; Yang et al., 2019; Dong et al., 2019), unlabeled data sets (Yang et al., 2019; Liu et al., 2019c; Zellers et al., 2019), benchmarks (Wang et al., 2019b, 2018; Conneau and Kiela, 2018), fine-tuning methods (Howard and Ruder, 2018; Houlsby et al., 2019; Peters et al., 2019), and more. The rapid rate of progress and diversity of techniques in this burgeoning field can make it difficult to compare different algorithms, tease apart the effects of new contributions, and understand the space of existing methods for transfer learning. Motivated by a need for more rigorous understanding, we leverage a unified approach to transfer learning that allows us to systematically study different approaches and push the current limits of the field.
这种协同作用导致了近期在开发自然语言处理的迁移学习方法方面做了大量工作,这产生了广泛的预训练目标 (Howard 和 Ruder, 2018; Devlin 等, 2018; Yang 等, 2019; Dong 等, 2019),无标签数据集 (Yang 等, 2019; Liu 等, 2019c; Zellers 等, 2019),基准测试 (Wang 等, 2019b, 2018; Conneau 和 Kiela, 2018),微调方法 (Howard 和 Ruder, 2018; Houlsby 等, 2019; Peters 等, 2019) 等等。这个新兴领域中技术进步的快速步伐和多样性使得比较不同算法、区分新贡献的效果以及理解现有的迁移学习方法变得困难。为了更严谨地理解这一需求,我们采用了一种统一的迁移学习方法,使我们能够系统地研究不同的方法并推动该领域的现有极限。
The basic idea underlying our work is to treat every text processing problem as a “text-to-text” problem, i.e. taking text as input and producing new text as output. This approach is inspired by previous unifying frameworks for NLP tasks, including casting all text problems as question answering (McCann et al., 2018), language modeling (Radford et al., 2019), or span extraction Keskar et al. (2019b) tasks. Crucially, the text-to-text framework allows us to directly apply the same model, objective, training procedure, and decoding process to every task we consider. We leverage this flexibility by evaluating performance on a wide variety of English-based NLP problems, including question answering, document
我们工作的基本思想是将每个文本处理问题视为“文本到文本 (text-to-text)”问题,即以文本作为输入并生成新的文本作为输出。这种方法受到之前统一的自然语言处理任务框架的启发,包括将所有文本问题转化为问答任务 (McCann et al., 2018),语言模型任务 (Radford et al., 2019),或片段抽取任务 Keskar et al. (2019b)。关键在于,文本到文本框架使我们能够直接应用相同的模型、目标函数、训练过程和解码过程来处理每一个考虑的任务。我们利用这种灵活性,在广泛的基于英语的自然语言处理问题上评估性能,包括问答、文档


Figure 1: A diagram of our text-to-text framework. Every task we consider—including translation, question answering, and classification—is cast as feeding our model text as input and training it to generate some target text. This allows us to use the same model, loss function, hyper parameters, etc. across our diverse set of tasks. It also provides a standard testbed for the methods included in our empirical survey. “T5” refers to our model, which we dub the “Text-to-Text Transfer Transformer”.
图 1: 我们的文本到文本框架的示意图。我们考虑的每一项任务——包括翻译、问答和分类——都被视为将文本输入我们的模型并训练它生成目标文本。这使我们能够跨多个不同的任务使用相同的模型、损失函数、超参数等。这也为我们的实证调查中包含的方法提供了一个标准测试平台。“T5”是指我们的模型,我们称之为“文本到文本迁移 Transformer (Text-to-Text Transfer Transformer)”。
sum mari z ation, and sentiment classification, to name a few. With this unified approach, we can compare the effectiveness of different transfer learning objectives, unlabeled data sets, and other factors, while exploring the limits of transfer learning for NLP by scaling up models and data sets beyond what has previously been considered.
总结、情感分类等。通过这种统一的方法,我们可以比较不同的迁移学习目标、未标记数据集和其他因素的有效性,同时通过扩大模型和数据集的规模来探索迁移学习在自然语言处理 (NLP) 中的极限,超越以往的考虑。
We emphasize that our goal is not to propose new methods but instead to provide a comprehensive perspective on where the field stands. As such, our work primarily comprises a survey, exploration, and empirical comparison of existing techniques. We also explore the limits of current approaches by scaling up the insights from our systematic study (training models up to 11 billion parameters) to obtain state-of-the-art results in many of the tasks we consider. In order to perform experiments at this scale, we introduce the “Colossal Clean Crawled Corpus” (C4), a data set consisting of hundreds of gigabytes of clean English text scraped from the web. Recognizing that the main utility of transfer learning is the possibility of leveraging pre-trained models in data-scarce settings, we release our code, data sets, and pre-trained models.
我们强调,我们的目标不是提出新方法,而是提供一个全面的视角来审视该领域目前所处的位置。因此,我们的工作主要由现有技术的综述、探索和实证比较组成。我们还通过扩大系统研究中的见解(训练模型最多达 110 亿个参数)来探索当前方法的极限,以在我们考虑的许多任务中获得最先进的结果。为了进行这种规模的实验,我们引入了“Colossal Clean Crawled Corpus” (C4),这是一个包含数百 GB 清洁英文文本的数据集,这些文本是从网络上抓取的。认识到迁移学习的主要优势在于能够在数据稀缺的情况下利用预训练模型,我们发布了代码、数据集和预训练模型。
The remainder of the paper is structured as follows: In the following section, we discuss our base model and its implementation, our procedure for formulating every text processing problem as a text-to-text task, and the suite of tasks we consider. In Section 3, we present a large set of experiments that explore the field of transfer learning for NLP. At the end of the section (Section 3.7), we combine insights from our systematic study to obtain state-of-the-art results on a wide variety of benchmarks. Finally, we provide a summary of our results and wrap up with a look towards the future in Section 4.
本文其余部分结构如下:在接下来的部分,我们讨论了我们的基础模型及其实现,我们将每个文本处理问题表述为文本到文本任务的方法,以及我们考虑的任务套件。在第 3 节中,我们展示了一组大规模实验,探索了 NLP 领域的迁移学习。在本节末尾(第 3.7 节),我们结合系统研究的见解,在各种基准测试中获得了最先进的结果。最后,在第 4 节中,我们总结了研究结果,并展望未来。
2. Setup
2. 设置
Before presenting the results from our large-scale empirical study, we review the necessary background topics required to understand our results, including the Transformer model architecture and the downstream tasks we evaluate on. We also introduce our approach for treating every problem as a text-to-text task and describe our “Colossal Clean Crawled Corpus” (C4), the Common Crawl-based data set we created as a source of unlabeled text data. We refer to our model and framework as the “Text-to-Text Transfer Transformer” (T5).
在呈现我们大规模实证研究的结果之前,我们回顾了理解这些结果所需的相关背景知识,包括 Transformer 模型架构以及我们在下游任务上的评估。我们还介绍了将每个问题视为文本到文本任务的方法,并描述了我们创建的基于 Common Crawl 的未标注文本数据集“Colossal Clean Crawled Corpus” (C4)。我们将模型和框架称为“Text-to-Text Transfer Transformer” (T5)。
2.1. Model
2.1. 模型
Early results on transfer learning for NLP leveraged recurrent neural networks (Peters et al., 2018; Howard and Ruder, 2018), but it has recently become more common to use models based on the “Transformer” architecture (Vaswani et al., 2017). The Transformer was initially shown to be effective for machine translation, but it has subsequently been used in a wide variety of NLP settings (Radford et al., 2018; Devlin et al., 2018; McCann et al., 2018; Yu et al., 2018). Due to its increasing ubiquity, all of the models we study are based on the Transformer architecture. Apart from the details mentioned below and the variants we explore in Section 3.2, we do not deviate significantly from this architecture as originally proposed. Instead of providing a comprehensive definition of this model, we refer the interested reader to the original paper (Vaswani et al., 2017) or follow-up tutorials $^{3,4}$ for a more detailed introduction.
早期的迁移学习在自然语言处理 (NLP) 领域利用了循环神经网络 (Peters et al., 2018; Howard and Ruder, 2018),但最近更常见的是使用基于 “Transformer” 架构的模型 (Vaswani et al., 2017)。Transformer 最初被证明在机器翻译方面有效,但随后已被应用于各种 NLP 场景 (Radford et al., 2018; Devlin et al., 2018; McCann et al., 2018; Yu et al., 2018)。由于其日益普及,我们研究的所有模型都基于 Transformer 架构。除了下面提到的细节和我们在第 3.2 节中探讨的变体外,我们并没有显著偏离最初提出的架构。我们不提供该模型的全面定义,感兴趣的读者可以参考原始论文 (Vaswani et al., 2017) 或后续教程 $^{3,4}$ 以获得更详细的介绍。
The primary building block of the Transformer is self-attention (Cheng et al., 2016). Self-attention is a variant of attention (Graves, 2013; Bahdanau et al., 2015) that processes a sequence by replacing each element by a weighted average of the rest of the sequence. The original Transformer consisted of an encoder-decoder architecture and was intended for sequence-to-sequence (Sutskever et al., 2014; Kal ch brenner et al., 2014) tasks. It has recently also become common to use models consisting of a single Transformer layer stack, with varying forms of self-attention used to produce architectures appropriate for language modeling (Radford et al., 2018; Al-Rfou et al., 2019) or classification and span prediction tasks (Devlin et al., 2018; Yang et al., 2019). We empirically explore these architectural variants in Section 3.2.
Transformer 的主要构建模块是自注意力 (Cheng et al., 2016)。自注意力是注意力机制 (Graves, 2013; Bahdanau et al., 2015) 的一种变体,它通过用序列中其他元素的加权平均值替换每个元素来处理序列。最初的 Transformer 包含编码器-解码器架构,并旨在用于序列到序列 (Sutskever et al., 2014; Kalchbrenner et al., 2014) 任务。最近,也变得常见的是使用由单个 Transformer 层堆栈组成的模型,采用不同形式的自注意力来生成适用于语言建模 (Radford et al., 2018; Al-Rfou et al., 2019) 或分类和跨度预测任务 (Devlin et al., 2018; Yang et al., 2019) 的架构。我们在第 3.2 节中对这些架构变体进行了实证探索。
Overall, our encoder-decoder Transformer implementation closely follows its originallyproposed form (Vaswani et al., 2017). First, an input sequence of tokens is mapped to a sequence of embeddings, which is then passed into the encoder. The encoder consists of a stack of “blocks”, each of which comprises two sub components: a self-attention layer followed by a small feed-forward network. Layer normalization (Ba et al., 2016) is applied to the input of each sub component. We use a simplified version of layer normalization where the activation s are only rescaled and no additive bias is applied. After layer normalization, a residual skip connection (He et al., 2016) adds each sub component’s input to its output. Dropout (Srivastava et al., 2014) is applied within the feed-forward network, on the skip connection, on the attention weights, and at the input and output of the entire stack. The decoder is similar in structure to the encoder except that it includes a standard attention mechanism after each self-attention layer that attends to the output of the encoder. The self-attention mechanism in the decoder also uses a form of auto regressive or causal selfattention, which only allows the model to attend to past outputs. The output of the final decoder block is fed into a dense layer with a softmax output, whose weights are shared with the input embedding matrix. All attention mechanisms in the Transformer are split up into independent “heads” whose outputs are concatenated before being further processed.
总体而言,我们的编码器-解码器 Transformer 实现紧密遵循其最初提出的形态 (Vaswani et al., 2017)。首先,输入的 Token 序列被映射到嵌入序列,然后传递给编码器。编码器由多个“块”的堆栈组成,每个块包含两个子组件:一个自注意力层后接一个小的前馈网络。对每个子组件的输入应用层归一化 (Ba et al., 2016)。我们使用了一种简化的层归一化版本,其中激活值仅被重新缩放,而没有添加偏置。在层归一化之后,残差跳跃连接 (He et al., 2016) 将每个子组件的输入加到其输出上。在前馈网络、跳跃连接、注意力权重以及整个堆栈的输入和输出中应用了 Dropout (Srivastava et al., 2014)。解码器在结构上与编码器类似,不同之处在于它在每个自注意力层之后包含一个标准的注意力机制,该机制关注编码器的输出。解码器中的自注意力机制还使用了一种自回归或因果自注意力的形式,只允许模型关注过去的输出。最终解码器块的输出被送入一个带有 softmax 输出的全连接层,其权重与输入嵌入矩阵共享。Transformer 中的所有注意力机制都被拆分为独立的“头”,其输出在进一步处理之前被拼接在一起。
Since self-attention is order-independent (i.e. it is an operation on sets), it is common to provide an explicit position signal to the Transformer. While the original Transformer used a sinusoidal position signal or learned position embeddings, it has recently become more common to use relative position embeddings (Shaw et al., 2018; Huang et al., 2018a). Instead of using a fixed embedding for each position, relative position embeddings produce a different learned embedding according to the offset between the “key” and “query” being compared in the self-attention mechanism. We use a simplified form of position embeddings where each “embedding” is simply a scalar that is added to the corresponding logit used for computing the attention weights. For efficiency, we also share the position embedding parameters across all layers in our model, though within a given layer each attention head uses a different learned position embedding. Typically, a fixed number of embeddings are learned, each corresponding to a range of possible key-query offsets. In this work, we use 32 embeddings for all of our models with ranges that increase in size logarithmic ally up to an offset of 128 beyond which we assign all relative positions to the same embedding. Note that a given layer is insensitive to relative position beyond 128 tokens, but subsequent layers can build a sensitivity to larger offsets by combining local information from previous layers. To summarize, our model is roughly equivalent to the original Transformer proposed by Vaswani et al. (2017) with the exception of removing the Layer Norm bias, placing the layer normalization outside the residual path, and using a different position embedding scheme. Since these architectural changes are orthogonal to the experimental factors we consider in our empirical survey of transfer learning, we leave the ablation of their impact for future work.
由于自注意力机制与顺序无关(即它是对集合的操作),通常会向 Transformer 提供一个显式的位序信号。虽然原始的 Transformer 使用了正弦位置信号或学习到的位置嵌入,但最近更常见的是使用相对位置嵌入 (Shaw et al., 2018; Huang et al., 2018a)。相对于为每个位置使用固定的嵌入,相对位置嵌入根据自注意力机制中比较的“key”和“query”之间的偏移生成不同的学习嵌入。我们使用了一种简化的位序嵌入形式,其中每个“嵌入”只是一个标量,加到用于计算注意力权重的相应 logit 上。为了提高效率,我们在模型的所有层之间共享位序嵌入参数,尽管在给定层中每个注意力头使用不同的学习到的位序嵌入。通常,会学习固定数量的嵌入,每个对应于可能的 key-query 偏移范围。在这项工作中,我们为所有模型使用 32 个嵌入,其范围以对数方式增长,直到超过 128 的偏移量,之后我们将所有相对位置分配给相同的嵌入。请注意,给定层对于超过 128 个 token 的相对位置是不敏感的,但后续层可以通过组合前一层的局部信息来建立对更大偏移量的敏感性。总结来说,我们的模型大致等同于 Vaswani 等人 (2017) 提出的原始 Transformer,例外之处在于移除了 Layer Norm 偏置,将层归一化置于残差路径之外,并使用了不同的位置嵌入方案。由于这些架构更改与我们在迁移学习的经验调查中考虑的实验因素正交,我们留待未来工作来分析它们的影响。
As part of our study, we experiment with the s cal ability of these models, i.e. how their performance changes as they are made to have more parameters or layers. Training large models can be non-trivial since they might not fit on a single machine and require a great deal of computation. As a result, we use a combination of model and data parallelism and train models on “slices” of Cloud TPU Pods.5 TPU pods are are multi-rack ML supercomputers that contain 1,024 TPU v3 chips connected via a high-speed 2D mesh interconnect with supporting CPU host machines. We leverage the Mesh TensorFlow library (Shazeer et al., 2018) for ease of implementation of both model parallelism and data parallelism (Krizhevsky, 2014).
在我们的研究中,我们实验了这些模型的可扩展性,即当它们拥有更多参数或层数时性能如何变化。训练大模型可能并非易事,因为它们可能无法适应单个机器,并且需要大量的计算资源。因此,我们使用模型并行和数据并行的组合,在 Cloud TPU Pods 的“切片”上训练模型。TPU Pods 是多机架 ML 超级计算机,包含 1,024 个通过高速 2D 网格互连连接的 TPU v3 芯片,并配有支持的 CPU 主机。我们利用 Mesh TensorFlow 库 (Shazeer et al., 2018) 来简化模型并行和数据并行 (Krizhevsky, 2014) 的实现。
2.2. The Colossal Clean Crawled Corpus
2.2. 巨大的清洁爬取语料库 (Colossal Clean Crawled Corpus)
Much of the previous work on transfer learning for NLP makes use of large unlabeled data sets for unsupervised learning. In this paper, we are interested in measuring the effect of the quality, characteristics, and size of this unlabeled data. To generate data sets that satisfy our needs, we leverage Common Crawl as a source of text scraped from the web. Common
之前关于 NLP 迁移学习的许多工作都利用了大型未标注数据集进行无监督学习。在本文中,我们关注的是测量这些未标注数据的质量、特征和规模的影响。为了生成满足我们需求的数据集,我们利用 Common Crawl 作为从网络上抓取的文本来源。Common
Crawl has previously been used as a source of text data for NLP, for example to train an n-gram language model (Buck et al., 2014), as training data for commonsense reasoning (Trinh and Le, 2018), for mining parallel texts for machine translation (Smith et al., 2013), as a pre-training data set (Grave et al., 2018; Zellers et al., 2019; Liu et al., 2019c), and even simply as a giant text corpus for testing optimizers (Anil et al., 2019).
爬虫之前已被用作NLP的文本数据来源,例如用于训练n元语言模型 (Buck et al., 2014),作为常识推理的训练数据 (Trinh 和 Le, 2018),用于挖掘机器翻译的平行文本 (Smith et al., 2013),作为预训练数据集 (Grave et al., 2018; Zellers et al., 2019; Liu et al., 2019c),甚至仅仅作为一个巨大的文本语料库来测试优化器 (Anil et al., 2019)。
Common Crawl is a publicly-available web archive that provides “web extracted text” by removing markup and other non-text content from the scraped HTML files. This process produces around 20TB of scraped text data each month. Unfortunately, the majority of the resulting text is not natural language. Instead, it largely comprises gibberish or boiler-plate text like menus, error messages, or duplicate text. Furthermore, a good deal of the scraped text contains content that is unlikely to be helpful for any of the tasks we consider (offensive language, placeholder text, source code, etc.). To address these issues, we used the following heuristics for cleaning up Common Crawl’s web extracted text:
公共爬虫 (Common Crawl) 是一个公开的网页档案,通过从抓取的 HTML 文件中移除标记和其他非文本内容,提供“网页提取文本”。此过程每月产生约 20TB 的抓取文本数据。不幸的是,大部分生成的文本并不是自然语言。相反,它主要由乱码或模板文本组成,如菜单、错误消息或重复文本。此外,相当一部分抓取的文本包含的内容对于我们考虑的任务(例如:冒犯性语言、占位符文本、源代码等)不太可能有帮助。为了解决这些问题,我们使用了以下启发式方法来清理公共爬虫的网页提取文本:
Additionally, since most of our downstream tasks are focused on English-language text, we used langdetect to filter out any pages that were not classified as English with a probability of at least 0.99. Our heuristics are inspired by past work on using Common Crawl as a source of data for NLP: For example, Grave et al. (2018) also filter text using an automatic language detector and discard short lines and Smith et al. (2013); Grave et al. (2018) both perform line-level de duplication. However, we opted to create a new data set because prior data sets use a more limited set of filtering heuristics, are not publicly available, and/or are different in scope (e.g. are limited to News data (Zellers et al., 2019; Liu et al., 2019c), comprise only Creative Commons content (Habernal et al., 2016), or are focused on parallel training data for machine translation (Smith et al., 2013)).
此外,由于我们的大多数下游任务都集中在英文文本上,我们使用了 langdetect 来过滤掉任何未被识别为英语(概率至少为 0.99)的页面。我们的启发式方法借鉴了过去使用 Common Crawl 作为 NLP 数据来源的工作:例如,Grave 等 (2018) 也使用自动语言检测器过滤文本并丢弃短行和 Smith 等 (2013); Grave 等 (2018) 都执行行级去重。然而,我们选择创建一个新的数据集,因为先前的数据集使用了更有限的过滤启发式方法,没有公开可用,和/或在范围上不同(例如,仅限于新闻数据 (Zellers 等, 2019; Liu 等, 2019c),仅包含知识共享内容 (Habernal 等, 2016),或者专注于机器翻译的平行训练数据 (Smith 等, 2013))。
To assemble our base data set, we downloaded the web extracted text from April 2019 and applied the aforementioned filtering. This produces a collection of text that is not only orders of magnitude larger than most data sets used for pre-training (about 750 GB) but also comprises reasonably clean and natural English text. We dub this data set the “Colossal Clean Crawled Corpus” (or C4 for short) and release it as part of TensorFlow Datasets.8 We consider the impact of using various alternative versions of this data set in Section 3.4.
为了组装我们的基础数据集,我们下载了2019年4月的网页提取文本,并应用了上述过滤。这产生了一个不仅比大多数用于预训练的数据集大几个数量级(约750 GB),而且还包含相当干净和自然的英文文本的集合。我们将这个数据集命名为“Colossal Clean Crawled Corpus”(简称 C4),并作为 TensorFlow Datasets 的一部分发布。8 我们在第3.4节中考虑了使用该数据集的各种替代版本的影响。
2.3. Downstream Tasks
2.3. 下游任务
Our goal in this paper is to measure general language learning abilities. As such, we study downstream performance on a diverse set of benchmarks, including machine translation, question answering, abstract ive sum mari z ation, and text classification. Specifically, we measure performance on the GLUE and SuperGLUE text classification meta-benchmarks; CNN/Daily Mail abstract ive sum mari z ation; SQuAD question answering; and WMT English to German, French, and Romanian translation. All data was sourced from TensorFlow Datasets. $^{9}$
我们在这篇论文中的目标是测量通用语言学习能力。因此,我们研究了在多个不同基准测试上的下游性能,包括机器翻译、问答、抽象概括 (abstract ive sum mari z ation) 和文本分类。具体来说,我们在 GLUE 和 SuperGLUE 文本分类元基准上测量性能;CNN/Daily Mail 抽象概括;SQuAD 问答;以及 WMT 英语到德语、法语和罗马尼亚语的翻译。所有数据均来自 TensorFlow Datasets。 $^{9}$
GLUE (Wang et al., 2018) and SuperGLUE (Wang et al., 2019b) each comprise a collection of text classification tasks meant to test general language understanding abilities:
GLUE (Wang et al., 2018) 和 SuperGLUE (Wang et al., 2019b) 各自包含一系列文本分类任务,旨在测试通用语言理解能力:
We use the data sets as distributed by the GLUE and SuperGLUE benchmarks. For simplicity, when fine-tuning we treat all of the tasks in the GLUE benchmark (and similarly for SuperGLUE) as a single task by concatenating all of the constituent data sets. As suggested by Kocijan et al. (2019) we also include the Definite Pronoun Resolution (DPR) data set (Rahman and Ng, 2012) in the combined SuperGLUE task.
我们使用由 GLUE 和 SuperGLUE 基准分发的数据集。为简化起见,在微调时,我们将 GLUE 基准中的所有任务(以及类似的 SuperGLUE)视为单个任务,通过连接所有组成的数据集来处理。根据 Kocijan 等人 (2019) 的建议,我们还将在合并的 SuperGLUE 任务中包含确定性代词消解 (DPR) 数据集 (Rahman 和 Ng, 2012)。
The CNN/Daily Mail (Hermann et al., 2015) data set was introduced as a questionanswering task but was adapted for text sum mari z ation by Nallapati et al. (2016); we use the non-anonymized version from See et al. (2017) as an abstract ive sum mari z ation task. SQuAD (Rajpurkar et al., 2016) is a common question-answering benchmark. In our experiments, the model is fed the question and its context and asked to generate the answer token-by-token. For WMT English to German, we use the same training data as (Vaswani et al., 2017) (i.e. News Commentary v13, Common Crawl, Europarl v7) and news test 2013 as a validation set (Bojar et al., 2014). For English to French, we use the standard training data from 2015 and news test 2014 as a validation set (Bojar et al., 2015). For English to Romanian, which is a standard lower-resource machine translation benchmark, we use the train and validation sets from WMT 2016 (Bojar et al., 2016). Note that we only pre-train on English data, so in order to learn to translate a given model will need to learn to generate text in a new language.
CNN/Daily Mail (Hermann 等,2015) 数据集最初是作为问答任务引入的,但被 Nallapati 等 (2016) 改编为文本摘要任务;我们使用 See 等 (2017) 提供的非匿名版本作为抽象摘要任务。SQuAD (Rajpurkar 等, 2016) 是一个常见的问答基准。在我们的实验中,模型接收问题及其上下文,并被要求逐个 Token 生成答案。对于 WMT 英语到德语,我们使用与 (Vaswani 等, 2017) 相同的训练数据(即 News Commentary v13、Common Crawl、Europarl v7),并使用新闻测试 2013 作为验证集 (Bojar 等, 2014)。对于英语到法语,我们使用 2015 年的标准训练数据,并使用新闻测试 2014 作为验证集 (Bojar 等, 2015)。对于英语到罗马尼亚语,这是一个标准的低资源机器翻译基准,我们使用 WMT 2016 (Bojar 等, 2016) 的训练和验证集。请注意,我们仅在英语数据上进行预训练,因此为了学习翻译,给定模型需要学习生成新语言的文本。
2.4. Input and Output Format
2.4. 输入和输出格式
In order to train a single model on the diverse set of tasks described above, we cast all of the tasks we consider into a “text-to-text” format—that is, a task where the model is fed some text for context or conditioning and is then asked to produce some output text. This framework provides a consistent training objective both for pre-training and fine-tuning. Specifically, the model is trained with a maximum likelihood objective (using “teacher forcing” (Williams and Zipser, 1989)) regardless of the task. To specify which task the model should perform, we add a task-specific (text) prefix to the original input sequence before feeding it to the model.
为了在一个模型上训练上述多样化的任务集,我们将所有考虑的任务转换为“文本到文本”格式——即,模型接收一些文本作为上下文或条件,然后被要求生成一些输出文本。这个框架为预训练和微调提供了统一的训练目标。具体来说,无论任务如何,模型都是使用最大似然目标(采用“教师强制” (Williams and Zipser, 1989))进行训练的。为了指定模型应执行的任务,我们在原始输入序列前添加一个特定于任务的(文本)前缀,然后再将其输入模型。
As an example, to ask the model to translate the sentence “That is good.” from English to German, the model would be fed the sequence “translate English to German: That is good.” and would be trained to output “Das ist gut.” For text classification tasks, the model simply predicts a single word corresponding to the target label. For example, on the MNLI benchmark (Williams et al., 2017) the goal is to predict whether a premise implies (“entailment”), contradicts (“contradiction”), or neither (“neutral”) a hypothesis. With our preprocessing, the input sequence becomes “mnli premise: I hate pigeons. hypothesis: My feelings towards pigeons are filled with animosity.” with the corresponding target word “entailment”. Note that an issue arises if our model outputs text on a text classification task that does not correspond to any of the possible labels (for example if the model outputs “hamburger” when the only possible labels for a task were “entailment”, “neutral”, or “contradiction”). In this case, we always count the model’s output as wrong, though we never observed this behavior in any of our trained models. Note that the choice of text prefix used for a given task is essentially a hyper parameter; we found that changing the exact wording of the prefix had limited impact and so did not perform extensive experiments into different prefix choices. A diagram of our text-to-text framework with a few input/output examples is shown in Figure 1. We provide full examples of pre processed inputs for every task we studied in Appendix D.
例如,要求模型将句子“That is good.”从英语翻译成德语,模型会接收序列“translate English to German: That is good.”并被训练输出“Das ist gut。”对于文本分类任务,模型只需预测一个对应目标标签的单词。例如,在 MNLI 基准测试 (Williams et al., 2017) 中,目标是预测前提是否意味着(“entailment”),矛盾(“contradiction”),或既不意味着也不矛盾(“neutral”)假设。经过我们的预处理,输入序列变为“mnli premise: I hate pigeons. hypothesis: My feelings towards pigeons are filled with animosity.”,对应的标签词为“entailment”。请注意,如果模型在文本分类任务中输出的文本不对应任何可能的标签(例如,当唯一可能的标签为“entailment”,“neutral”,或“contradiction”时,模型输出了“hamburger”),则会出现问题。在这种情况下,我们始终将模型的输出视为错误,尽管我们在任何训练的模型中从未观察到这种行为。请注意,给定任务使用的文本前缀选择本质上是一个超参数;我们发现改变前缀的确切措辞影响有限,因此没有对不同的前缀选择进行广泛的实验。图 1 显示了我们文本到文本框架的示例输入/输出图。我们在附录 D 中提供了我们研究的每个任务的预处理输入的完整示例。
图 1: 我们的文本到文本框架示例输入/输出图
Our text-to-text framework follows previous work that casts multiple NLP tasks into a common format: McCann et al. (2018) propose the “Natural Language Decathlon”, a benchmark that uses a consistent question-answering format for a suite of ten NLP tasks. The Natural Language Decathlon also stipulates that all models must be multi-task, i.e. are able to simultaneously tackle all of the tasks at once. We instead allow for separately fine-tuning the model on each individual task and use short task prefixes instead of an explicit question-answer format. Radford et al. (2019) evaluate the zero-shot learning capabilities of language models by feeding some input to the model as a prefix and then auto regressive ly sampling an output. For example, automatic sum mari z ation is done by feeding in a document followed by the text “TL;DR:” (short for “too long, didn’t read”, a common abbreviation) and then the summary is predicted via auto regressive decoding. We mainly consider models that explicitly process an input with an encoder before generating an output with a separate decoder and we focus on transfer learning rather than zero-shot learning. Finally, Keskar et al. (2019b) unify many NLP tasks as “span extraction”, where text corresponding to possible output choices are appended to the input and the model is trained to extract the input span corresponding to the correct choice. In contrast, our framework also allows for generative tasks like machine translation and abstract ive sum mari z ation where it is not possible to enumerate all possible output choices.
我们的文本到文本框架遵循先前的工作,将多个 NLP 任务转换为一种通用格式:McCann 等 (2018) 提出了“自然语言十项全能”,这是一个使用一致的问题回答格式的十个 NLP 任务基准。自然语言十项全能还规定所有模型必须是多任务的,即能够同时处理所有任务。相比之下,我们允许对每个单独的任务进行分别微调,并使用简短的任务前缀而不是显式的问题-回答格式。Radford 等 (2019) 通过将一些输入作为前缀输入模型,然后自回归地采样输出来评估语言模型的零样本学习能力。例如,自动摘要生成是通过输入文档后跟文本 “TL;DR:”(意为“太长,没读”,一个常见的缩写),然后通过自回归解码预测摘要。我们主要考虑那些在生成输出之前显式地用编码器处理输入并在单独的解码器中生成输出的模型,并且我们专注于迁移学习而非零样本学习。最后,Keskar 等 (2019b) 将许多 NLP 任务统一为“片段提取”,其中可能的输出选择对应的文本被附加到输入上,模型被训练以提取对应正确选择的输入片段。相比之下,我们的框架还允许生成式任务,如机器翻译和抽象性摘要,在这些任务中不可能枚举所有可能的输出选择。
We were able to straightforwardly cast all of the tasks we considered into a text-to-text format with the exception of STS-B, which is a regression task where the goal is to predict a similarity score between 1 and 5. We found that most of these scores were annotated in increments of 0.2, so we simply rounded any score to the nearest increment of 0.2 and converted the result to a literal string representation of the number (e.g. the floating-point value 2.57 would be mapped to the string “2.6”). At test time, if the model outputs a string corresponding to a number between 1 and 5, we convert it to a floating-point value; otherwise, we treat the model’s prediction as incorrect. This effectively recasts the STS-B regression problem as a 21-class classification problem.
我们能够将所有考虑的任务直接转换为文本到文本的格式,例外的是 STS-B,这是一个回归任务,目标是预测 1 到 5 之间的相似度分数。我们发现这些分数大多以 0.2 的增量进行标注,因此我们将任何分数四舍五入到最接近的 0.2 增量,并将结果转换为数字的字面字符串表示(例如,浮点值 2.57 将映射为字符串 “2.6”)。在测试时,如果模型输出一个介于 1 和 5 之间的数字字符串,我们将其转换为浮点值;否则,我们将模型的预测视为不正确。这实际上将 STS-B 回归问题重新定义为一个 21 类分类问题。
Separately, we also convert the Winograd tasks (WNLI from GLUE, WSC from SuperGLUE, and the DPR data set we add to SuperGLUE) into a simpler format that is more amenable to the text-to-text framework. Examples from the Winograd tasks consist of a text passage containing an ambiguous pronoun that could refer to more than one of the noun phrases in the passage. For example, the passage might be “The city councilmen refused the demonstrators a permit because they feared violence.”, which contains the ambiguous pronoun “they” that could refer to “city councilmen” or “demonstrators”. We cast the WNLI, WSC, and DPR tasks as text-to-text problems by highlighting the ambiguous pronoun in the text passage and asking the model to predict the noun that it refers to. The example mentioned above would be transformed to the input “The city councilmen refused the demonstrators a permit because *they* feared violence.” and the model would be trained to predict the target text “The city councilmen”.
我们还分别将 Winograd 任务(来自 GLUE 的 WNLI,来自 SuperGLUE 的 WSC,以及我们添加到 SuperGLUE 的 DPR 数据集)转换为更简单的格式,使其更适合文本到文本框架。Winograd 任务的示例包括包含歧义代词的文本段落,该代词可能指代段落中的多个名词短语之一。例如,段落可能是“The city councilmen refused the demonstrators a permit because they feared violence.”,其中包含歧义代词“they”,它可以指代“city councilmen”或“demonstrators”。我们将 WNLI、WSC 和 DPR 任务视为文本到文本问题,通过在文本段落中突出显示歧义代词,并要求模型预测它所指代的名词。上述提到的示例将被转换为输入“The city councilmen refused the demonstrators a permit because they feared violence.”,模型将被训练以预测目标文本“The city councilmen”。
For WSC, examples contain the passage, the ambiguous pronoun, a candidate noun, and a True/False label reflecting whether the candidate matches the pronoun (ignoring any articles). We only train on examples with a “True” label since we do not know the correct noun targets for examples with a “False” label. For evaluation, we assign a “True” label if the words in the model’s output are a subset of the words in the candidate noun phrase (or vice versa) and assign a “False” label otherwise. This removes roughly half of the WSC training set, but the DPR data set adds about 1,000 pronoun resolution examples. Examples from DPR are annotated with the correct referent noun, making it easy to use this data set in the format listed above.
对于WSC,示例包含段落、歧义代词、候选名词以及反映候选名词是否匹配代词的True/False标签(忽略任何冠词)。我们只在带有“True”标签的示例上进行训练,因为我们不知道带有“False”标签的示例的正确名词目标。对于评估,如果模型输出中的单词是候选名词短语中的单词的子集(或反之亦然),则分配一个“True”标签,否则分配一个“False”标签。这大约去除了WSC训练集的一半,但DPR数据集添加了大约1,000个代词解析示例。来自DPR的示例标注有正确的指代名词,使其易于使用上述格式的数据集。
The WNLI training and validation sets have a significant overlap with the WSC training set. To avoid leaking validation examples into our training data (a particular issue in the multi-task experiments of Section 3.5.2), we therefore never train on WNLI and never report results on the WNLI validation set. Omitting results on the WNLI validation set is standard practice (Devlin et al., 2018) due to the fact that it is “adversarial” with respect to the training set, i.e. validation examples are all slightly-perturbed versions of training examples with the opposite label. As such, we do not include WNLI in the average GLUE score whenever we report on the validation set (all sections except Section 3.7 where results are presented on the test sets). Converting examples from WNLI to the “referent noun prediction” variant described above is a little more involved; we describe this process in Appendix B.
WNLI 训练集和验证集与 WSC 训练集有显著重叠。为了避免验证样本泄露到训练数据中(这在第 3.5.2 节的多任务实验中是一个特别的问题),因此我们从不在 WNLI 上进行训练,也从不报告 WNLI 验证集上的结果。省略 WNLI 验证集上的结果是标准做法 (Devlin et al., 2018),因为它是相对于训练集“对抗性”的,即验证样本都是训练样本的轻微扰动版本,并且标签相反。因此,我们在报告验证集上的平均 GLUE 分数时从未包含 WNLI(所有章节除外第 3.7 节,在该节中结果是在测试集上呈现的)。将 WNLI 的样本转换为上述“指代名词预测”变体稍微复杂一些;我们在附录 B 中描述了这个过程。
3. Experiments
3. 实验
Recent advances in transfer learning for NLP have come from a wide variety of developments, such as new pre-training objectives, model architectures, unlabeled data sets, and more. In this section, we carry out an empirical survey of these techniques in hopes of teasing apart their contribution and significance. We then combine the insights gained to attain state-of-the-art in many of the tasks we consider. Since transfer learning for NLP is a rapidly growing area of research, it is not feasible for us to cover every possible technique or idea in our empirical study. For a broader literature review, we recommend a recent survey by Ruder et al. (2019).
近期在自然语言处理 (NLP) 中的迁移学习进展来自于各种各样的发展,例如新的预训练目标、模型架构、无标签数据集等。在本节中,我们对这些技术进行了实证调查,以期能够区分它们的贡献和重要性。然后我们将获得的见解结合起来,在我们考虑的许多任务中达到最先进水平。由于 NLP 的迁移学习是一个快速发展的研究领域,我们无法在实证研究中涵盖每一种可能的技术或想法。对于更广泛的文献综述,我们推荐 Ruder 等人 (2019) 的最新调查。
We systematically study these contributions by taking a reasonable baseline (described in Section 3.1) and altering one aspect of the setup at a time. For example, in Section 3.3 we measure the performance of different unsupervised objectives while keeping the rest of our experimental pipeline fixed. This “coordinate ascent” approach might miss second-order effects (for example, some particular unsupervised objective may work best on a model larger than our baseline setting), but performing a combinatorial exploration of all of the factors in our study would be prohibitively expensive. In future work, we expect it could be fruitful to more thoroughly consider combinations of the approaches we study.
我们系统地研究这些贡献,通过采用一个合理的基线(描述在第 3.1 节)并每次改变设置的一个方面。例如,在第 3.3 节中,我们在保持其余实验流程固定的情况下测量不同无监督目标的性能。这种“坐标上升”方法可能会忽略二阶效应(例如,某些特定的无监督目标可能在一个比我们基线设置更大的模型上表现最佳),但对我们研究中的所有因素进行组合探索将过于昂贵。在未来的工作中,我们预计更彻底地考虑我们研究的方法组合可能会有成效。
Our goal is to compare a variety of different approaches on a diverse set of tasks while keeping as many factors fixed as possible. In order to satisfy this aim, in some cases we do not exactly replicate existing approaches. For example, “encoder-only” models like BERT (Devlin et al., 2018) are designed to produce a single prediction per input token or a single prediction for an entire input sequence. This makes them applicable for classification or span prediction tasks but not for generative tasks like translation or abstract ive sum mari z ation. As such, none of the model architectures we consider are identical to BERT or consist of an encoder-only structure. Instead, we test approaches that are similar in spirit—for example, we consider an analogous objective to BERT’s “masked language modeling” objective in Section 3.3 and we consider a model architecture that behaves similarly to BERT on text classification tasks in Section 3.2.
我们的目标是在尽可能多的任务上比较各种不同的方法,同时保持尽可能多的因素不变。为了实现这一目标,在某些情况下我们不会完全复制现有的方法。例如,“仅编码器”模型如 BERT (Devlin et al., 2018) 被设计为对每个输入 Token 或整个输入序列生成单个预测。这使得它们适用于分类或跨度预测任务,但不适用于翻译或摘要生成等生成式任务。因此,我们考虑的模型架构中没有一个与 BERT 完全相同或仅由编码器结构组成。相反,我们测试了类似的方法——例如,在第 3.3 节中,我们考虑了一个类似于 BERT 的“掩码语言建模 (masked language modeling)”目标,并在第 3.2 节中考虑了一个在文本分类任务上表现类似于 BERT 的模型架构。
After outlining our baseline experimental setup in the following subsection, we undertake an empirical comparison of model architectures (Section 3.2), unsupervised objectives (Section 3.3), pre-training data sets (Section 3.4), transfer approaches (Section 3.5), and scaling (Section 3.6). At the culmination of this section, we combine insights from our study with scale to obtain state-of-the-art results in many tasks we consider (Section 3.7).
在下一小节中概述了我们的基线实验设置后,我们对模型架构(第 3.2 节)、无监督目标(第 3.3 节)、预训练数据集(第 3.4 节)、迁移方法(第 3.5 节)和扩展(第 3.6 节)进行了实证比较。在本节的最后,我们将研究中的见解与规模相结合,在我们考虑的许多任务中取得了最先进的结果(第 3.7 节)。
3.1. Baseline
3.1. 基线模型 (Baseline)
Our goal for our baseline is to reflect typical, modern practice. We pre-train a standard Transformer (described in Section 2.1) using a simple denoising objective and then separately fine-tune on each of our downstream tasks. We describe the details of this experimental setup in the following subsections.
我们的基线目标是反映典型、现代的实践。我们预先训练一个标准的 Transformer (描述在第 2.1 节) 使用简单的去噪目标,然后分别在每个下游任务上进行微调。我们在以下小节中描述此实验设置的详细信息。
3.1.1. Model
3.1.1. 模型
规则:
- 输出中文翻译部分的时候,只保留翻译的标题,不要有任何其他的多余内容,不要重复,不要解释。
- 不要输出与英文内容无关的内容。
- 翻译时要保留原始段落格式,以及保留术语,例如 FLAC,JPEG 等。保留公司缩写,例如 Microsoft, Amazon, OpenAI 等。
- 人名不翻译
- 同时要保留引用的论文,例如 [20] 这样的引用。
- 对于 Figure 和 Table,翻译的同时保留原有格式,例如:“Figure 1: ”翻译为“图 1: ”,“Table 1: ”翻译为:“表 1: ”。
- 全角括号换成半角括号,并在左括号前面加半角空格,右括号后面加半角空格。
- 在翻译专业术语时,第一次出现时要在括号里面写上英文原文,例如:“生成式 AI (Generative AI)”,之后就可以只写中文了。
- 以下是常见的 AI 相关术语词汇对应表(English -> 中文):
* Transformer -> Transformer
* Token -> Token
* LLM/Large Language Model -> 大语言模型
* Zero-shot -> 零样本
* Few-shot -> 少样本
* AI Agent -> AI智能体
* AGI -> 通用人工智能
* Python -> Python语言
策略:
分三步进行翻译工作:
1. 不翻译无法识别的特殊字符和公式,原样返回
2. 将HTML表格格式转换成Markdown表格格式
3. 根据英文内容翻译成符合中文表达习惯的内容,不要遗漏任何信息
最终只返回Markdown格式的翻译结果,不要回复无关内容。
For our model, we use a standard encoder-decoder Transformer as proposed by Vaswani et al. (2017). While many modern approaches to transfer learning for NLP use a Transformer architecture consisting of only a single “stack” (e.g. for language modeling (Radford et al., 2018; Dong et al., 2019) or classification and span prediction (Devlin et al., 2018; Yang et al., 2019)), we found that using a standard encoder-decoder structure achieved good results on both generative and classification tasks. We explore the performance of different model architectures in Section 3.2.
对于我们的模型,我们使用了由 Vaswani 等人 (2017) 提出的标准编码器-解码器 Transformer。虽然许多现代自然语言处理的迁移学习方法使用仅包含单个“堆栈”的 Transformer 架构(例如,用于语言建模 (Radford 等,2018;Dong 等,2019) 或分类和跨度预测 (Devlin 等,2018;Yang 等,2019)),我们发现使用标准的编码器-解码器结构在生成式和分类任务上都取得了良好的结果。我们在第 3.2 节中探讨了不同模型架构的性能。
Our baseline model is designed so that the encoder and decoder are each similar in size and configuration to a “BERTBASE” (Devlin et al., 2018) stack. Specifically, both the encoder and decoder consist of 12 blocks (each block comprising self-attention, optional encoder-decoder attention, and a feed-forward network). The feed-forward networks in each block consist of a dense layer with an output dimensionality of $d_{\mathrm{ff}}=3072$ followed by a ReLU non linearity and another dense layer. The “key” and “value” matrices of all attention mechanisms have an inner dimensionality of $d_{\mathrm{kv}}=64$ and all attention mechanisms have 12 heads. All other sub-layers and embeddings have a dimensionality of $d_{\mathrm{model}}=768$ . In total, this results in a model with about 220 million parameters. This is roughly twice the number of parameters of BERTBASE since our baseline model contains two layer stacks instead of one. For regular iz ation, we use a dropout probability of 0.1 everywhere dropout is applied in the model.
我们的基线模型设计为编码器和解码器在大小和配置上各自类似于“BERTBASE” (Devlin et al., 2018) 的堆栈。具体来说,编码器和解码器均由 12 个块组成(每个块包含自注意力机制、可选的编码器-解码器注意力机制和前馈网络)。每个块中的前馈网络由一个输出维度为 $d_{\mathrm{ff}}=3072$ 的全连接层组成,后面跟着 ReLU 非线性层和另一个全连接层。所有注意力机制的“key”和“value”矩阵的内部维度为 $d_{\mathrm{kv}}=64$ ,并且所有注意力机制有 12 个头。所有其他子层和嵌入的维度为 $d_{\mathrm{model}}=768$ 。总计,这导致了一个约有 2.2 亿参数的模型。这大约是 BERTBASE 参数数量的两倍,因为我们的基线模型包含两个层堆栈而不是一个。为了正则化,我们在模型中应用 dropout 的地方使用了 0.1 的 dropout 概率。
3.1.2. Training
3.1.2. 训练
As described in Section 2.4, all tasks are formulated as text-to-text tasks. This allows us to always train using standard maximum likelihood, i.e. using teacher forcing (Williams and Zipser, 1989) and a cross-entropy loss. For optimization, we use AdaFactor (Shazeer and Stern, 2018). At test time, we use greedy decoding (i.e. choosing the highest-probability logit at every timestep).
如第 2.4 节所述,所有任务都被表述为文本到文本的任务。这使我们能够始终使用标准的最大似然进行训练,即使用教师强制 (teacher forcing) (Williams 和 Zipser, 1989) 和交叉熵损失。对于优化,我们使用 AdaFactor (Shazeer 和 Stern, 2018)。在测试时,我们使用贪婪解码(即在每个时间步选择概率最高的 logit)。
We pre-train each model for $2^{19}=524{,}288$ steps on C4 before fine-tuning. We use a maximum sequence length of 512 and a batch size of 128 sequences. Whenever possible, we “pack” multiple sequences into each entry of the batch $^{10}$ so that our batches contain roughly $2^{16},=,65{,}536$ tokens. In total, this batch size and number of steps corresponds to pre-training on $2^{35}\approx34\mathrm{B}$ tokens. This is considerably less than BERT (Devlin et al., 2018), which used roughly 137B tokens, or RoBERTa (Liu et al., 2019c), which used roughly $2.2\mathrm{T}$ tokens. Using only $2^{35}$ tokens results in a reasonable computational budget while still providing a sufficient amount of pre-training for acceptable performance. We consider the effect of pre-training for more steps in Sections 3.6 and 3.7. Note that $2^{35}$ tokens only covers a fraction of the entire C4 data set, so we never repeat any data during pre-training.
我们在 C4 上对每个模型进行预训练,共进行 $2^{19}=524,288$ 步,然后进行微调。我们使用最大序列长度为 512 和批次大小为 128 个序列。在可能的情况下,我们将多个序列“打包”到每个批次的每个条目中 $^{10}$ ,从而使我们的批次包含大约 $2^{16}=65,536$ 个 Token。总计,这个批次大小和步数相当于在 $2^{35}\approx34\mathrm{B}$ 个 Token 上进行预训练。这比 BERT (Devlin et al., 2018) 使用的大约 137B 个 Token 或 RoBERTa (Liu et al., 2019c) 使用的大约 $2.2\mathrm{T}$ 个 Token 要少得多。仅使用 $2^{35}$ 个 Token 可以在合理的计算预算内提供足够的预训练量,以实现可接受的性能。我们在第 3.6 节和第 3.7 节中考虑了增加预训练步数的影响。需要注意的是,$2^{35}$ 个 Token 仅覆盖整个 C4 数据集的一小部分,因此在预训练过程中我们不会重复任何数据。
During pre-training, we use an “inverse square root” learning rate schedule: $1/\sqrt{\operatorname*{max}(n,k)}$ where $n$ is the current training iteration and $k$ is the number of warm-up steps (set to $10^{4}$ in all of our experiments). This sets a constant learning rate of 0.01 for the first $10^{4}$ steps, then exponentially decays the learning rate until pre-training is over. We also experimented with using a triangular learning rate (Howard and Ruder, 2018), which produced slightly better results but requires knowing the total number of training steps ahead of time. Since we will be varying the number of training steps in some of our experiments, we opt for the more generic inverse square root schedule.
在预训练期间,我们使用“反平方根”学习率调度:$1/\sqrt{\operatorname*{max}(n,k)}$,其中 $n$ 是当前训练迭代次数,$k$ 是热身步数(在所有实验中设置为 $10^4$)。这在前 $10^4$ 步设置了 0.01 的恒定学习率,然后指数衰减学习率直到预训练结束。我们也尝试了使用三角形学习率 (Howard 和 Ruder, 2018),这产生了稍好的结果,但需要提前知道总的训练步数。由于在某些实验中我们将改变训练步数,因此我们选择了更通用的反平方根调度。
Our models are fine-tuned for $2^{18}=262{,}144$ steps on all tasks. This value was chosen as a trade-off between the high-resource tasks (i.e. those with large data sets), which benefit from additional fine-tuning, and low-resource tasks (smaller data sets), which overfit quickly. During fine-tuning, we continue using batches with 128 length-512 sequences (i.e. $2^{16}$ tokens per batch). We use a constant learning rate of 0.001 when fine-tuning. We save a checkpoint every 5,000 steps and report results on the model checkpoint corresponding to the highest validation performance. For models fine-tuned on multiple tasks, we choose the best checkpoint for each task independently. For all of the experiments except those in Section 3.7, we report results in the validation set to avoid performing model selection on the test set.
我们的模型在所有任务上微调了 $2^{18}=262,144$ 步。这个值是在高资源任务(即那些具有大数据集的任务)和低资源任务(较小数据集)之间进行权衡后选择的,前者从额外的微调中受益,而后者则会迅速过拟合。在微调期间,我们继续使用包含 128 个长度为 512 的序列的批次(即每批次 $2^{16}$ 个 Token)。微调时我们使用 0.001 的恒定学习率。我们每 5,000 步保存一个检查点,并报告验证性能最高的模型检查点的结果。对于在多个任务上微调的模型,我们为每个任务独立选择最佳检查点。除了第 3.7 节中的实验外,我们在验证集上报告结果,以避免在测试集上进行模型选择。
3.1.3. Vocabulary
3.1.3. 词汇表 (Vocabulary)
We use Sentence Piece (Kudo and Richardson, 2018) to encode text as WordPiece tokens (Sennrich et al., 2015; Kudo, 2018). For all experiments, we use a vocabulary of 32,000 wordpieces. Since we ultimately fine-tune our model on English to German, French, and Romanian translation, we also require that our vocabulary covers these non-English languages. To address this, we classified pages from the Common Crawl scrape used in C4 as German, French, and Romanian. Then, we trained our Sentence Piece model on a mixture of 10 parts of English C4 data with 1 part each of data classified as German, French or Romanian. This vocabulary was shared across both the input and output of our model. Note that our vocabulary makes it so that our model can only process a predetermined, fixed set of languages.
我们使用 Sentence Piece (Kudo 和 Richardson, 2018) 将文本编码为 WordPiece Token (Sennrich 等, 2015; Kudo, 2018)。对于所有实验,我们使用包含 32,000 个 wordpieces 的词汇表。由于我们最终在英语到德语、法语和罗马尼亚语的翻译任务上微调模型,因此我们的词汇表还需要覆盖这些非英语语言。为此,我们将 Common Crawl 抓取的数据中分类为德语、法语和罗马尼亚语的页面进行了分类。然后,我们在 10 部分英语 C4 数据与 1 部分被分类为德语、法语或罗马尼亚语的数据混合数据集上训练了 Sentence Piece 模型。该词汇表在模型的输入和输出之间共享。请注意,我们的词汇表使得模型只能处理预定义的固定语言集合。
3.1.4. Unsupervised Objective
3.1.4. 无监督目标
Leveraging unlabeled data to pre-train our model necessitates an objective that does not require labels but (loosely speaking) teaches the model general iz able knowledge that will be useful in downstream tasks. Preliminary work that applied the transfer learning paradigm of pre-training and fine-tuning all of the model’s parameters to NLP problems used a causal language modeling objective for pre-training (Dai and Le, 2015; Peters et al., 2018; Radford et al., 2018; Howard and Ruder, 2018). However, it has recently been shown that “denoising” objectives (Devlin et al., 2018; Taylor, 1953) (also called “masked language modeling”) produce better performance and as a result they have quickly become standard. In a denoising objective, the model is trained to predict missing or otherwise corrupted tokens in the input. Inspired by BERT’s “masked language modeling” objective and the
利用未标注数据对模型进行预训练需要一个不需要标签的目标函数,但(粗略地说)要教会模型可泛化的知识,这些知识在下游任务中将是有用的。早期的工作将迁移学习范式中的预训练和微调所有模型参数应用于自然语言处理 (NLP) 问题时,使用了因果语言建模目标函数进行预训练 (Dai and Le, 2015; Peters et al., 2018; Radford et al., 2018; Howard and Ruder, 2018)。然而,最近的研究表明,“去噪”目标函数 (Devlin et al., 2018; Taylor, 1953) (也称为“掩码语言建模”)能够产生更好的性能,因此它们迅速成为标准。在去噪目标函数中,模型被训练来预测输入中缺失或以其他方式损坏的 Token。受 BERT 的“掩码语言建模”目标函数的启发以及


由于图片无法直接翻译为文字内容,如果您能提供图片中的文本内容或描述,我将很乐意为您翻译。
Figure 2: Schematic of the objective we use in our baseline model. In this example, we process the sentence “Thank you for inviting me to your party last week.” The words “for”, “inviting” and “last” (marked with an $\times$ ) are randomly chosen for corruption. Each consecutive span of corrupted tokens is replaced by a sentinel token (shown as
图 2: 我们在基线模型中使用的目标的示意图。在这个例子中,我们处理句子 “Thank you for inviting me to your party last week.” 单词 “for”,“inviting” 和 “last”(标记为 $\times$)被随机选中进行破坏。每个连续的被破坏的 Token 被一个哨兵 Token(显示为
“word dropout” regular iz ation technique (Bowman et al., 2015), we design an objective that randomly samples and then drops out 15% of tokens in the input sequence. All consecutive spans of dropped-out tokens are replaced by a single sentinel token. Each sentinel token is assigned a token ID that is unique to the sequence. The sentinel IDs are special tokens which are added to our vocabulary and do not correspond to any wordpiece. The target then corresponds to all of the dropped-out spans of tokens, delimited by the same sentinel tokens used in the input sequence plus a final sentinel token to mark the end of the target sequence. Our choices to mask consecutive spans of tokens and only predict dropped-out tokens were made to reduce the computational cost of pre-training. We perform thorough investigation into pre-training objectives in Section 3.3. An example of the transformation resulting from applying this objective is shown in Figure 2. We empirically compare this objective to many other variants in Section 3.3.
采用“词 dropout”正则化技术 (Bowman et al., 2015),我们设计了一个目标函数,该函数随机采样并以 15% 的概率丢弃输入序列中的 Token。所有连续的被丢弃的 Token 区域将被一个单一的哨兵 Token 替代。每个哨兵 Token 被分配一个对序列唯一的 Token ID。这些哨兵 ID 是特殊 Token,它们被添加到我们的词汇表中,并不对应任何词片段。目标则是所有被丢弃的 Token 区域,由输入序列中使用的相同哨兵 Token 分隔,再加上一个最终的哨兵 Token 标记目标序列的结束。我们选择遮蔽连续的 Token 区域并仅预测被丢弃的 Token,是为了降低预训练的计算成本。我们在第 3.3 节对预训练目标进行了深入研究。图 2 展示了应用此目标后转换的一个示例。我们在第 3.3 节中对该目标与许多其他变体进行了实证比较。
3.1.5. Baseline Performance
3.1.5. 基准性能
In this section, we present results using the baseline experimental procedure described above to get a sense of what kind of performance to expect on our suite of downstream tasks. Ideally, we would repeat every experiment in our study multiple times to get a confidence interval on our results. Unfortunately, this would be prohibitively expensive due to the large number of experiments we run. As a cheaper alternative, we train our baseline model 10 times from scratch (i.e. with different random initialization s and data set shuffling) and assume that the variance over these runs of the base model also applies to each experimental variant. We don’t expect most of the changes we make to have a dramatic effect on the inter-run variance, so this should provide a reasonable indication of the significance of different changes. Separately, we also measure the performance of training our model for $2^{18}$ steps (the same number we use for fine-tuning) on all downstream tasks without pre-training. This gives us an idea of how much pre-training benefits our model in the baseline setting.
在本节中,我们使用上述基线实验程序展示结果,以了解在我们的下游任务套件上可以预期的性能。理想情况下,我们会重复研究中的每个实验多次以获得结果的置信区间。不幸的是,由于我们运行的实验数量庞大,这将过于昂贵。作为更便宜的替代方案,我们从头开始训练基线模型 10 次(即使用不同的随机初始化和数据集洗牌),并假设这些基线模型运行之间的方差也适用于每个实验变体。我们不期望所做的大多数更改会对运行间方差产生显著影响,因此这应该能合理地反映不同更改的重要性。此外,我们还测量了在所有下游任务上训练模型 $2^{18}$ 步(与我们用于微调的步数相同)而不进行预训练的性能。这使我们了解在基线设置下预训练对模型的好处有多大。
Table 1: Average and standard deviation of scores achieved by our baseline model and training procedure. For comparison, we also report performance when training on each task from scratch (i.e. without any pre-training) for the same number of steps used to fine-tune the baseline model. All scores in this table (and every table in our paper except Table 14) are reported on the validation sets of each data set.
表 1: 基准模型和训练过程所达到分数的平均值和标准差。为了对比,我们也报告了在相同步数下从头开始训练(即没有任何预训练)的性能。本表中的所有分数(以及本文中除表 14 外的所有表格)均在每个数据集的验证集上报告。
| GLUE | CNNDM | SQuAD | SGLUE | EnDe | EnFr | EnRo | |
|---|---|---|---|---|---|---|---|
| ★1 基准平均值 | 83.28 | 19.24 | 80.88 | 71.36 | 26.98 | 39.82 | 27.65 |
| 基准标准差 | 0.235 | 0.065 | 0.343 | 0.416 | 0.112 | 0.090 | 0.108 |
| 无预训练 | 66.22 | 17.60 | 50.31 | 53.04 | 25.86 | 39.77 | 24.04 |
When reporting results in the main text, we only report a subset of the scores across all the benchmarks to conserve space and ease interpretation. For GLUE and SuperGLUE, we report the average score across all subtasks (as stipulated by the official benchmarks) under the headings “GLUE” and “SGLUE”. For all translation tasks, we report the BLEU score (Papineni et al., 2002) as provided by SacreBLEU v1.3.0 (Post, 2018) with “exp” smoothing and “intl” token iz ation. We refer to scores for WMT English to German, English to French, and English to Romanian as EnDe, EnFr, and EnRo, respectively. For CNN/Daily Mail, we find the performance of models on the ROUGE-1-F, ROUGE-2-F, and ROUGE-L-F metrics (Lin, 2004) to be highly correlated so we report the ROUGE-2-F score alone under the heading “CNNDM”. Similarly, for SQuAD we find the performance of the “exact match” and “F1” scores to be highly correlated so we report the “exact match” score alone. We provide every score achieved on every task for all experiments in Table 16, Appendix E.
在正文报告结果时,我们仅报告所有基准测试中的一部分分数,以节省空间并便于理解。对于 GLUE 和 SuperGLUE,我们在标题为“GLUE”和“SGLUE”的部分下报告所有子任务的平均分数(按照官方基准的规定)。对于所有翻译任务,我们报告由 SacreBLEU v1.3.0 (Post, 2018) 提供的 BLEU 分数 (Papineni et al., 2002),使用“exp”平滑和“intl”Token 化。我们将 WMT 英语到德语、英语到法语和英语到罗马尼亚语的分数分别称为 EnDe、EnFr 和 EnRo。对于 CNN/Daily Mail,我们发现模型在 ROUGE-1-F、ROUGE-2-F 和 ROUGE-L-F 指标 (Lin, 2004) 上的表现高度相关,因此我们仅在标题为“CNNDM”的部分下报告 ROUGE-2-F 分数。同样地,对于 SQuAD,我们发现“完全匹配”和“F1”分数的表现高度相关,因此我们仅报告“完全匹配”分数。我们在附录 E 的表 16 中提供了所有实验中每个任务的所有得分。
Our results tables are all formatted so that each row corresponds to a particular experimental configuration with columns giving the scores for each benchmark. We will include the mean performance of the baseline configuration in most tables. Wherever a baseline configuration appears, we will mark it with a $\star$ (as in the first row of Table 1). We also will boldface any score that is within two standard deviations of the maximum (best) in a given experiment.
我们的结果表格均格式化为每一行对应一个特定的实验配置,列给出每个基准测试的分数。我们将在大多数表格中包含基线配置的平均性能。每当基线配置出现时,我们将用 $\star$ 标记它(如表 1 的第一行)。我们还会将任何在给定实验中与最大值(最佳)相差不超过两个标准差的分数加粗。
Our baseline results are shown in Table 1. Overall, our results are comparable to existing models of similar size. For example, BERTBASE achieved an exact match score of 80.8 on SQuAD and an accuracy of 84.4 on MNLI-matched, whereas we achieve 80.88 and 84.24, respectively (see Table 16). Note that we cannot directly compare our baseline to BERTBASE because ours is an encoder-decoder model and was pre-trained for roughly $1/_{4}$ as many steps. Un surprisingly, we find that pre-training provides significant gains across almost all benchmarks. The only exception is WMT English to French, which is a large enough data set that gains from pre-training tend to be marginal. We include this task in our experiments to test the behavior of transfer learning in the high-resource regime. Since we perform early stopping by selecting the best-performing checkpoint, the large disparity between our baseline and “no pre-training” emphasize how much pre-training improves performance on tasks with limited data. While we do not explicitly measure improvements in data efficiency in this paper, we emphasize that this is one of the primary benefits of the transfer learning paradigm.
我们的基线结果如表 1 所示。总体而言,我们的结果与现有类似规模的模型相当。例如,BERTBASE 在 SQuAD 上达到了 80.8 的精确匹配分数和在 MNLI-matched 上达到了 84.4 的准确率,而我们分别达到了 80.88 和 84.24(见表 16)。请注意,我们不能直接将我们的基线与 BERTBASE 进行比较,因为我们的模型是编码器-解码器结构,并且预训练步数大约为 BERTBASE 的 1/4。不出所料,我们发现预训练在几乎所有基准测试中都提供了显著的提升。唯一的例外是 WMT 英语到法语任务,该数据集足够大,以至于预训练带来的增益通常较小。我们在实验中包含此任务是为了测试高资源环境下的迁移学习行为。由于我们通过选择表现最佳的检查点来进行提前停止,因此我们基线与“无预训练”之间的巨大差异强调了预训练在数据有限的任务上对性能的大幅提升。虽然我们在这篇论文中没有明确测量数据效率的改进,但我们强调这是迁移学习范式的主要优势之一。


Figure 3: Matrices representing different attention mask patterns. The input and output of the self-attention mechanism are denoted $x$ and $y$ respectively. A dark cell at row $i$ and column $j$ indicates that the self-attention mechanism is allowed to attend to input element $j$ at output timestep $i$ . A light cell indicates that the self-attention mechanism is not allowed to attend to the corresponding $i$ and $j$ combination. Left: A fully-visible mask allows the self-attention mechanism to attend to the full input at every output timestep. Middle: A causal mask prevents the $i$ th output element from depending on any input elements from “the future”. Right: Causal masking with a prefix allows the self-attention mechanism to use fully-visible masking on a portion of the input sequence.
图 3: 表示不同注意力掩码模式的矩阵。自注意力机制的输入和输出分别表示为 $x$ 和 $y$ 。第 $i$ 行和第 $j$ 列的深色单元格表示自注意力机制允许在输出时间步 $i$ 处关注输入元素 $j$ 。浅色单元格表示自注意力机制不允许关注对应的 $i$ 和 $j$ 组合。左:完全可见的掩码允许自注意力机制在每个输出时间步关注整个输入。中:因果掩码防止第 $i$ 个输出元素依赖于任何来自“未来”的输入元素。右:带前缀的因果掩码允许自注意力机制在输入序列的一部分上使用完全可见的掩码。
As for inter-run variance, we find that for most tasks the standard deviation across runs is smaller than 1% of the task’s baseline score. Exceptions to this rule include CoLA, CB, and COPA, which are all low-resource tasks from the GLUE and SuperGLUE benchmarks. For example, on CB our baseline model had an average F1 score of 91.22 with a standard deviation of 3.237 (see Table 16), which may be partly due to the fact that CB’s validation set contains only 56 examples. Note that the GLUE and SuperGLUE scores are computed as the average of scores across the tasks comprising each benchmark. As a result, we caution that the high inter-run variance of CoLA, CB, and COPA can make it harder to compare models using the GLUE and SuperGLUE scores alone.
对于运行间方差,我们发现对于大多数任务,各次运行的标准差小于任务基线分数的1%。这一规则的例外包括CoLA、CB和COPA,这些任务都是来自GLUE和SuperGLUE基准测试的低资源任务。例如,在CB上,我们的基线模型平均F1分数为91.22,标准差为3.237(见表 16),这可能部分是由于CB的验证集仅包含56个样本。请注意,GLUE和SuperGLUE分数是根据每个基准测试中各任务分数的平均值计算得出的。因此,我们提醒,CoLA、CB和COPA的高运行间方差可能会使得仅使用GLUE和SuperGLUE分数来比较模型变得更加困难。
3.2. Architectures
3.2. 架构
While the Transformer was originally introduced with an encoder-decoder architecture, much modern work on transfer learning for NLP uses alternative architectures. In this section, we review and compare these architectural variants.
虽然 Transformer 最初是以编码器-解码器架构引入的,但目前自然语言处理 (NLP) 领域的很多迁移学习工作使用了替代架构。在本节中,我们将回顾和比较这些架构变体。
3.2.1. Model Structures
3.2.1. 模型结构
A major distinguishing factor for different architectures is the “mask” used by different attention mechanisms in the model. Recall that the self-attention operation in a Transformer takes a sequence as input and outputs a new sequence of the same length. Each entry of the output sequence is produced by computing a weighted average of entries of the input sequence. Specifically, let $y_{i}$ refer to the $i$ th element of the output sequence and $x_{j}$ refer to the $j$ th entry of the input sequence. $y_{i}$ is computed as $\textstyle\sum_{j}w_{i,j}x_{j}$ , where $w_{i,j}$ is the scalar weight produced by the self-attention mechanism as a fu nction of $x_{i}$ and $x_{j}$ . The attention mask is then used to zero out certain weights in order to constrain which entries of the input can be attended to at a given output timestep. Diagrams of the masks we will consider are shown in Figure 3. For example, the causal mask (Figure 3, middle) sets any $w_{i,j}$ to zero if $j>i$ .
不同架构的主要区别在于模型中使用的不同注意力机制的“掩码”。回顾一下,Transformer 中的自注意力操作以一个序列作为输入,并输出一个相同长度的新序列。输出序列的每个元素是通过对输入序列的元素进行加权平均计算得出的。具体来说,令 $y_{i}$ 表示输出序列的第 $i$ 个元素,$x_{j}$ 表示输入序列的第 $j$ 个元素。$y_{i}$ 计算为 $\textstyle\sum_{j}w_{i,j}x_{j}$,其中 $w_{i,j}$ 是自注意力机制根据 $x_{i}$ 和 $x_{j}$ 计算出的标量权重。注意力掩码用于将某些权重置零,以限制在给定输出时间步可以关注的输入序列中的哪些元素。我们将考虑的掩码图如图 3 所示。例如,因果掩码(图 3,中间)会将任何 $w_{i,j}$ 置为零,如果 $j > i$。
图 3:

Figure 4: Schematics of the Transformer architecture variants we consider. In this diagram, blocks represent elements of a sequence and lines represent attention visibility. Different colored groups of blocks indicate different Transformer layer stacks. Dark grey lines correspond to fully-visible masking and light grey lines correspond to causal masking. We use “.” to denote a special end-of-sequence token that represents the end of a prediction. The input and output sequences are represented as $x$ and $y$ respectively. Left: A standard encoder-decoder architecture uses fullyvisible masking in the encoder and the encoder-decoder attention, with causal masking in the decoder. Middle: A language model consists of a single Transformer layer stack and is fed the concatenation of the input and target, using a causal mask throughout. Right: Adding a prefix to a language model corresponds to allowing fully-visible masking over the input.
图 4: 我们考虑的 Transformer 架构变体示意图。在此图中,块表示序列的元素,线表示注意力可见性。不同颜色的块组表示不同的 Transformer 层堆栈。深灰色线对应完全可见的掩码,浅灰色线对应因果掩码。我们使用 “.” 表示一个特殊的序列结束 Token ,它代表预测的结束。输入和输出序列分别表示为 $x$ 和 $y$ 。左:标准的编码器 - 解码器架构在编码器和编码器 - 解码器注意力中使用完全可见的掩码,在解码器中使用因果掩码。中:语言模型由单个 Transformer 层堆栈组成,并输入输入和目标的连接,全程使用因果掩码。右:在语言模型中添加前缀对应于允许在输入上进行完全可见的掩码。
The first model structure we consider is an an encoder-decoder Transformer, which consists of two layer stacks: The encoder, which is fed an input sequence, and the decoder, which produces a new output sequence. A schematic of this architectural variant is shown in the left panel of Figure 4.
我们考虑的第一个模型结构是一个编码器-解码器 Transformer,它由两个层堆栈组成:编码器,接收输入序列;解码器,生成新的输出序列。这种架构变体的示意图如图 4 左侧所示。
The encoder uses a “fully-visible” attention mask. Fully-visible masking allows a selfattention mechanism to attend to any entry of the input when producing each entry of its output. We visualize this masking pattern in Figure 3, left. This form of masking is appropriate when attending over a “prefix”, i.e. some context provided to the model that is later used when making predictions. BERT (Devlin et al., 2018) also uses a fully-visible masking pattern and appends a special “classification” token to the input. BERT’s output at the timestep corresponding to the classification token is then used to make a prediction for classifying the input sequence.
编码器使用“完全可见”的注意力掩码。完全可见的掩码允许自注意力机制在生成每个输出项时关注输入的任何项。我们在图 3 (左) 中可视化了这种掩码模式。当对一个“前缀”进行关注时,这种掩码形式是合适的,即提供给模型的一些上下文,在之后用于进行预测时会用到。BERT (Devlin et al., 2018) 也使用完全可见的掩码模式,并在输入中附加一个特殊的“分类”Token。然后使用 BERT 在与分类 Token 对应的时间步的输出来对输入序列进行分类预测。
The self-attention operations in the Transformer’s decoder use a “causal” masking pattern. When producing the $i$ th entry of the output sequence, causal masking prevents the model from attending to the $j$ th entry of the input sequence for $j>i$ . This is used during training so that the model can’t “see into the future” as it produces its output. An attention matrix for this masking pattern is shown in Figure 3, middle.
Transformer 的解码器中的自注意力操作使用“因果”掩码模式。在生成输出序列的第 $i$ 个元素时,因果掩码防止模型关注输入序列中第 $j$ 个($j>i$)元素。这在训练期间使用,以确保模型不能“窥视未来”来生成其输出。这种掩码模式的注意力矩阵如图 3 中间所示。
The decoder in an encoder-decoder Transformer is used to auto regressive ly produce an output sequence. That is, at each output timestep, a token is sampled from the model’s predicted distribution and the sample is fed back into the model to produce a prediction for the next output timestep, and so on. As such, a Transformer decoder (without an encoder) can be used as a language model (LM), i.e. a model trained solely for next-step prediction (Liu et al., 2018; Radford et al., 2018; Al-Rfou et al., 2019). This constitutes the second model structure we consider. A schematic of this architecture is shown in Figure 4, middle. In fact, early work on transfer learning for NLP used this architecture with a language modeling objective as a pre-training method (Radford et al., 2018).
编码器-解码器 Transformer 中的解码器用于自回归地生成输出序列。也就是说,在每个输出时间步,从模型预测的分布中采样一个 Token,并将该样本反馈到模型中以生成下一个输出时间步的预测,依此类推。因此,一个 Transformer 解码器(不带编码器)可以用作语言模型 (LM),即仅训练用于下一步预测的模型 (Liu et al., 2018; Radford et al., 2018; Al-Rfou et al., 2019)。这构成了我们考虑的第二种模型结构。该架构的示意图如图 4 中间所示。实际上,早期的 NLP 迁移学习工作使用这种架构和语言建模目标作为预训练方法 (Radford et al., 2018)。
Language models are typically used for compression or sequence generation (Graves, 2013). However, they can also be used in the text-to-text framework simply by concatenating the inputs and targets. As an example, consider the case of English to German translation: If we have a training datapoint with input sentence “That is good.” and target “Das ist gut.”, we would simply train the model on next-step prediction over the concatenated input sequence “translate English to German: That is good. target: Das ist gut.” If we wanted to obtain the model’s prediction for this example, the model would be fed the prefix “translate English to German: That is good. target:” and would be asked to generate the remainder of the sequence auto regressive ly. In this way, the model can predict an output sequence given an input, which satisfies the needs of text-to-text tasks. This approach was recently used to show that language models can learn to perform some text-to-text tasks without supervision (Radford et al., 2019).
语言模型通常用于压缩或序列生成 (Graves, 2013)。然而,它们也可以通过连接输入和目标在文本到文本框架中使用。例如,考虑英译德的情况:如果我们有一个训练数据点,输入句子为“That is good.”,目标为“Das ist gut.”,我们只需训练模型对连接后的输入序列“translate English to German: That is good. target: Das ist gut.”进行下一步预测。如果我们要获得模型对此示例的预测,我们会给模型提供前缀“translate English to German: That is good. target:”,并要求模型自回归地生成剩余的序列。通过这种方式,模型可以根据输入预测输出序列,满足文本到文本任务的需求。这种方法最近被用来证明语言模型可以在没有监督的情况下学习执行某些文本到文本任务 (Radford et al., 2019)。
A fundamental and frequently cited drawback of using a language model in the textto-text setting is that causal masking forces the model’s representation of the $i$ th entry of the input sequence to only depend on the entries up until $i$ . To see why this is potentially disadvantageous, consider the text-to-text framework where the model is provided with a prefix/context before being asked to make predictions (e.g., the prefix is an English sentence and the model is asked to predict the German translation). With fully causal masking, the model’s representation of a prefix state can only depend on prior entries of the prefix. So, when predicting an entry of the output, the model will attend to a representation of the prefix that is unnecessarily limited. Similar arguments have been made against using a unidirectional recurrent neural network encoder in sequence-to-sequence models (Bahdanau et al., 2015).
在文本到文本设置中使用大语言模型的一个基本且经常被引用的缺点是,因果掩码迫使模型对输入序列的第 $i$ 个条目的表示仅依赖于直到 $i$ 的条目。要理解为什么这可能是不利的,可以考虑这样一个文本到文本框架:模型在被要求进行预测之前会提供一个前缀/上下文(例如,前缀是一个英语句子,模型被要求预测德语翻译)。在完全因果掩码的情况下,模型对前缀状态的表示只能依赖于前缀的先前条目。因此,在预测输出条目时,模型将关注一个不必要的受限的前缀表示。类似的论点也针对在序列到序列模型中使用单向循环神经网络编码器提出过 (Bahdanau et al., 2015)。
This issue can be avoided in a Transformer-based language model simply by changing the masking pattern. Instead of using a causal mask, we use fully-visible masking during the prefix portion of the sequence. This masking pattern and a schematic of the resulting “prefix LM” (the third model structure we consider) are illustrated in the rightmost panels of Figures 3 and 4, respectively. In the English to German translation example mentioned above, fully-visible masking would be applied to the prefix “translate English to German: That is good. target:” and causal masking would be used during training for predicting the target “Das ist gut.” Using a prefix LM in the text-to-text framework was originally proposed by
这个问题可以在基于 Transformer 的语言模型中通过改变掩码模式来避免。不是使用因果掩码,我们在序列的前缀部分使用完全可见的掩码。这种掩码模式和由此产生的“前缀大语言模型”(我们考虑的第三种模型结构)分别在图 3 和图 4 的最右侧面板中进行了说明。在上述的英德翻译例子中,完全可见的掩码将应用于前缀 “translate English to German: That is good. target:”,而在训练时用于预测目标 “Das ist gut.” 的则是因果掩码。在文本到文本框架中使用前缀大语言模型最初是由
Liu et al. (2018). More recently, Dong et al. (2019) showed that this architecture is effective on a wide variety of text-to-text tasks. This architecture is similar to an encoder-decoder model with parameters shared across the encoder and decoder and with the encoder-decoder attention replaced with full attention across the input and target sequence.
Liu 等 (2018)。最近,Dong 等 (2019) 表明该架构在各种文本到文本任务上是有效的。该架构类似于编码器-解码器模型,在编码器和解码器之间共享参数,并将编码器-解码器注意力替换为输入序列和目标序列之间的全注意力。
We note that when following our text-to-text framework, the prefix LM architecture closely resembles BERT (Devlin et al., 2018) for classification tasks. To see why, consider an example from the MNLI benchmark where the premise is “I hate pigeons.”, the hypothesis is “My feelings towards pigeons are filled with animosity.” and the correct label is “entailment”. To feed this example into a language model, we would transform it into the sequence “mnli premise: I hate pigeons. hypothesis: My feelings towards pigeons are filled with animosity. target: entailment”. In this case, the fully-visible prefix would correspond to the entire input sequence up to the word “target:”, which can be seen as being analogous to the “classification” token used in BERT. So, our model would have full visibility over the entire input, and then would be tasked with making a classification by outputting the word “entailment”. It is easy for the model to learn to output one of the valid class labels given the task prefix (“mnli” in this case). As such, the main difference between a prefix LM and the BERT architecture is that the classifier is simply integrated into the output layer of the Transformer decoder in the prefix LM.
我们注意到,在遵循我们的文本到文本框架时,前缀大语言模型架构在分类任务中与 BERT (Devlin et al., 2018) 非常相似。为了理解原因,考虑来自 MNLI 基准的一个例子,其中前提为 “I hate pigeons.”(我讨厌鸽子),假设为 “My feelings towards pigeons are filled with animosity.”(我对鸽子的感情充满了敌意),正确标签为 “entailment”(蕴含)。要将这个例子输入到语言模型中,我们会将其转换为序列 “mnli premise: I hate pigeons. hypothesis: My feelings towards pigeons are filled with animosity. target: entailment”。在这种情况下,完全可见的前缀对应于直到单词 “target:” 的整个输入序列,这可以被视为类似于 BERT 中使用的 “classification” Token。因此,我们的模型可以完全看到整个输入,并通过输出单词 “entailment” 来进行分类。鉴于任务前缀(在这个例子中是 “mnli”),模型很容易学习输出一个有效的类别标签。因此,前缀大语言模型与 BERT 架构的主要区别在于,分类器被简单地集成到了前缀大语言模型的 Transformer 解码器的输出层中。
3.2.2. Comparing Different Model Structures
3.2.2. 比较不同的模型结构
In the interest of experimentally comparing these architectural variants, we would like each model we consider to be equivalent in some meaningful way. We might say that two models are equivalent if they either have the same number of parameters or they require roughly the same amount of computation to process a given (input-sequence, target-sequence) pair. Unfortunately, it is not possible to compare an encoder-decoder model to a language model architecture (comprising a single Transformer stack) according to both of these criteria at the same time. To see why, first note an encoder-decoder model with $L$ layers in the encoder and $L$ layers in the decoder has approximately the same number of parameters as a language model with $2L$ layers. However, the same $L+L$ encoder-decoder model will have approximately the same computational cost as a language model with only $L$ layers. This is a consequence of the fact that the $L$ layers in the language model must be applied to both the input and output sequence, while the encoder is only applied to the input sequence and the decoder is only applied to the output sequence. Note that these equivalences are approximate—there are some extra parameters in the decoder due to the encoder-decoder attention and there are also some computational costs in the attention layers that are quadratic in the sequence lengths. In practice, however, we observed nearly identical step times for $L$ -layer language models versus $L+L$ -layer encoder-decoder models, suggesting a roughly equivalent computational cost. Further, for the model sizes we consider, the number of parameters in the encoder-decoder attention layers is about $10%$ of the total parameter count, so we make the simplifying assumption that an $L+L$ -layer encoder-decoder model has the same number of parameters as an $2L$ -layer language model.
为了实验性地比较这些架构变体,我们希望每个考虑的模型在某些有意义的方式上是等价的。我们可以说两个模型是等价的,如果它们要么具有相同的参数数量,要么处理给定 (输入序列, 目标序列) 对所需的计算量大致相同。不幸的是,根据这两个标准同时比较编码器-解码器模型和语言模型架构(由单个 Transformer 堆栈组成)是不可能的。
要理解原因,请首先注意一个编码器有 $L$ 层且解码器也有 $L$ 层的编码器-解码器模型大约与一个有 $2L$ 层的语言模型具有相同的参数数量。然而,同一个 $L+L$ 的编码器-解码器模型将大约与只有 $L$ 层的语言模型具有相同的计算成本。这是由于语言模型中的 $L$ 层必须应用于输入和输出序列,而编码器仅应用于输入序列,解码器仅应用于输出序列。请注意,这些等价关系是近似的——解码器中由于编码器-解码器注意力机制存在一些额外的参数,并且在注意力层中也有一些与序列长度呈二次关系的计算成本。然而,在实践中,我们观察到 $L$ 层语言模型与 $L+L$ 层编码器-解码器模型的每步时间几乎相同,这表明其计算成本大致相当。此外,对于我们考虑的模型规模,编码器-解码器注意力层中的参数数量约占总参数数量的 $10%$,因此我们简化假设 $L+L$ 层的编码器-解码器模型与 $2L$ 层的语言模型具有相同的参数数量。
To provide a reasonable means of comparison, we consider multiple configurations for our encoder-decoder model. We will refer to the number of layers and parameters in a BERTBASE-sized layer stack as $L$ and $P$ , respectively. We will use $M$ to refer to the number of FLOPs required for an $L+L$ -layer encoder-decoder model or $L$ -layer decoder-only model to process a given input-target pair. In total, we will compare:
为了提供一个合理的比较手段,我们考虑了编码器-解码器模型的多种配置。我们将 BERTBASE 规模的层堆栈中的层数和参数数量分别表示为 $L$ 和 $P$ 。我们将使用 $M$ 来表示处理给定输入-目标对所需的 FLOPs 数量,对于具有 $L+L$ 层的编码器-解码器模型或 $L$ 层的仅解码器模型。总共,我们将比较:
3.2.3. Objectives
3.2.3. 目标
As an unsupervised objective, we will consider both a basic language modeling objective as well as our baseline denoising objective described in Section 3.1.4. We include the language modeling objective due to its historic use as a pre-training objective (Dai and Le, 2015; Rama chandra n et al., 2016; Howard and Ruder, 2018; Radford et al., 2018; Peters et al., 2018) as well as its natural fit for the language model architectures we consider. For models that ingest a prefix before making predictions (the encoder-decoder model and prefix LM), we sample a span of text from our unlabeled data set and choose a random point to split it into prefix and target portions. For the standard language model, we train the model to predict the entire span from beginning to end. Our unsupervised denoising objective is designed for text-to-text models; to adapt it for use with a language model we concatenate the inputs and targets as described in Section 3.2.1.
作为无监督目标,我们将考虑基本的语言建模目标以及我们在第 3.1.4 节中描述的基线去噪目标。我们包含语言建模目标是由于其历史上被用作预训练目标 (Dai and Le, 2015; Ramachandran et al., 2016; Howard and Ruder, 2018; Radford et al., 2018; Peters et al., 2018) 以及它与我们考虑的语言模型架构的自然契合。对于在进行预测之前需要输入前缀的模型(编码器-解码器模型和前缀 LM),我们从未标记的数据集中采样一段文本,并随机选择一个点将其分为前缀和目标部分。对于标准语言模型,我们训练模型从头到尾预测整个片段。我们的无监督去噪目标是为文本到文本模型设计的;为了适应语言模型使用,我们将输入和目标连接起来,如第 3.2.1 节所述。
3.2.4. Results
The scores achieved by each of the architectures we compare are shown in Table 2. For all tasks, the encoder-decoder architecture with the denoising objective performed best. This variant has the highest parameter count (2 $P$ ) but the same computational cost as the $P$ -parameter decoder-only models. Surprisingly, we found that sharing parameters across the encoder and decoder performed nearly as well. In contrast, halving the number of layers in the encoder and decoder stacks significantly hurt performance. Concurrent work (Lan et al., 2019) also found that sharing parameters across Transformer blocks can be an effective means of lowering the total parameter count without sacrificing much performance. XLNet also bears some resemblance to the shared encoder-decoder approach with a denoising objective (Yang et al., 2019). We also note that the shared parameter encoder-decoder outperforms the decoder-only prefix LM, suggesting that the addition of an explicit encoder-decoder attention is beneficial. Finally, we confirm the widely-held conception that using a denoising objective always results in better downstream task performance compared to a language modeling objective. This observation has been previously made by Devlin et al. (2018), Voita et al. (2019), and Lample and Conneau (2019) among others. We undertake a more detailed exploration of unsupervised objectives in the following section.
表 2: 我们比较的每种架构所获得的分数。
对于所有任务,使用去噪目标的编码器-解码器架构表现最佳。这种变体具有最高的参数量 (2 $P$ ),但计算成本与 $P$ 参数的仅解码器模型相同。令人惊讶的是,我们发现跨编码器和解码器共享参数的表现几乎一样好。相反,将编码器和解码器堆栈中的层数减半显著损害了性能。同期工作 (Lan et al., 2019) 也发现,跨 Transformer 模块共享参数可以有效降低总参数量而不牺牲太多性能。XLNet 也与带有去噪目标的共享编码器-解码器方法有某些相似之处 (Yang et al., 2019)。我们还注意到,共享参数的编码器-解码器优于仅解码器前缀语言模型,这表明添加显式的编码器-解码器注意力是有益的。最后,我们确认了一种广泛持有的观点,即使用去噪目标总是比使用语言建模目标在下游任务中表现更好。这一观察结果之前已被 Devlin et al. (2018),Voita et al. (2019),和 Lample and Conneau (2019) 等人提出过。我们在下一节中对无监督目标进行更详细的探讨。
Table 2: Performance of the different architectural variants described in Section 3.2.2. We use $P$ to refer to the number of parameters in a 12-layer base Transformer layer stack and $M$ to refer to the FLOPs required to process a sequence using the encoderdecoder model. We evaluate each architectural variant using a denoising objective (described in Section 3.1.4) and an auto regressive objective (as is commonly used to train language models).
表 2: 不同架构变体的性能,如第 3.2.2 节所述。我们使用 $P$ 表示 12 层基础 Transformer 层堆栈中的参数数量,并使用 $M$ 表示使用编码器-解码器模型处理序列所需的 FLOPs。我们使用去噪目标(如第 3.1.4 节所述)和自回归目标(通常用于训练语言模型)评估每个架构变体。
| 架构 | 目标 | 参数 | 成本 | GLUE | CNNDM | SQuAD | SGLUE | EnDe | EnFr | EnRo |
|---|---|---|---|---|---|---|---|---|---|---|
| + 编码器-解码器 | 去噪 | 2P | M | 83.28 | 19.24 | 80.88 | 71.36 | 26.98 | 39.82 | 27.65 |
| 编码器-解码器,共享 | 去噪 | P | M | 82.81 | 18.78 | 80.63 | 70.73 | 26.72 | 39.03 | 27.46 |
| 编码器-解码器,6 层 | 去噪 | P | M/2 | 80.88 | 18.97 | 77.59 | 68.42 | 26.38 | 38.40 | 26.95 |
| 语言模型 | 去噪 | P | M | 74.70 | 17.93 | 61.14 | 55.02 | 25.09 | 35.28 | 25.86 |
| 前缀 LM | 去噪 | P | M | 81.82 | 18.61 | 78.94 | 68.11 | 26.43 | 37.98 | 27.39 |
| 编码器-解码器 | 语言模型 (LM) | 2P | M | 79.56 | 18.59 | 76.02 | 64.29 | 26.27 | 39.17 | 26.86 |
| 编码器-解码器,共享 | 语言模型 (LM) | P | M | 79.60 | 18.13 | 76.35 | 63.50 | 26.62 | 39.17 | 27.05 |
| 编码器-解码器,6 层 | 语言模型 (LM) | P | M/2 | 78.67 | 18.26 | 75.32 | 64.06 | 26.13 | 38.42 | 26.89 |
| 语言模型 | 语言模型 (LM) | P | M | 73.78 | 17.54 | 53.81 | 56.51 | 25.23 | 34.31 | 25.38 |
| 前缀 LM | 语言模型 (LM) | P | M | 79.68 | 17.84 | 76.87 | 64.86 | 26.28 | 37.51 | 26.76 |
3.3. Unsupervised Objectives
3.3. 无监督目标
The choice of unsupervised objective is of central importance as it provides the mechanism through which the model gains general-purpose knowledge to apply to downstream tasks. This has led to the development of a wide variety of pre-training objectives (Dai and Le, 2015; Ramachandran et al., 2016; Radford et al., 2018; Devlin et al., 2018; Yang et al., 2019; Liu et al., 2019b; Wang et al., 2019a; Song et al., 2019; Dong et al., 2019; Joshi et al., 2019). In this section, we perform a procedural exploration of the space of unsupervised objectives. In many cases, we will not replicate an existing objective exactly—some will be modified to fit our text-to-text encoder-decoder framework and, in other cases, we will use objectives that combine concepts from multiple common approaches.
无监督目标的选择至关重要,因为它为模型提供了获取通用知识的机制,以应用于下游任务。这导致了各种各样的预训练目标的开发 (Dai and Le, 2015; Ramachandran et al., 2016; Radford et al., 2018; Devlin et al., 2018; Yang et al., 2019; Liu et al., 2019b; Wang et al., 2019a; Song et al., 2019; Dong et al., 2019; Joshi et al., 2019)。在本节中,我们将对无监督目标的空间进行程序化探索。在许多情况下,我们不会完全复制现有的目标——有些将被修改以适应我们的文本到文本编码器-解码器框架,在其他情况下,我们将使用结合了多个常见方法概念的目标。
Overall, all of our objectives ingest a sequence of token IDs corresponding to a tokenized span of text from our unlabeled text data set. The token sequence is processed to produce a (corrupted) input sequence and a corresponding target. Then, the model is trained as usual with maximum likelihood to predict the target sequence. We provide illustrative examples of many of the objectives we consider in Table 3.
总体而言,我们的所有目标都会输入一个与我们未标注文本数据集中的一段标记化文本相对应的 Token ID 序列。Token 序列被处理以生成一个(损坏的)输入序列和相应的目标序列。然后,模型像往常一样通过最大似然方法训练以预测目标序列。我们在表 3 中提供了许多我们考虑的目标的说明性示例。
3.3.1. Disparate High-Level Approaches
3.3.1. 不同的高层方法
To begin with, we compare three techniques that are inspired by commonly-used objectives but differ significantly in their approach. First, we include a basic “prefix language modeling” objective as was used in Section 3.2.3. This technique splits a span of text into two components, one to use as inputs to the encoder and the other to use as a target sequence
首先,我们比较三种受常用目标启发但方法上存在显著差异的技术。第一种是包含基本的“前缀语言建模 (prefix language modeling)”目标,如第 3.2.3 节所使用的方法。该技术将一段文本分割成两个部分,一部分作为编码器的输入,另一部分作为目标序列。
| 目标 | 输入 | 输出 |
|---|---|---|
| 前缀语言建模 | Thank you for inviting | me to your party last week . |
| BERT风格 Devlin et al. (2018) | Thank you me to your party apple week . | original tert |
| 解洗牌 | party me for your to . last fun you inviting week Thank | |
| MASS风格 Song et al. (2019) | Thank you me to your party week . | (originaltert) |
| 独立同分布噪声,替换片段 | Thank you 1me to your party week. | for invitinglast |
| 独立同分布噪声,(删除 Token) | Thank you me to your party week | for inviting last |
| 随机片段 | Thank you to week. | for inviting me eyourparty last |
Table 3: Examples of inputs and targets produced by some of the unsupervised objectives we consider applied to the input text “Thank you for inviting me to your party last week .” Note that all of our objectives process tokenized text. For this particular sentence, all words were mapped to a single token by our vocabulary. We write (original text) as a target to denote that the model is tasked with reconstructing the entire input text.
表 3: 我们考虑的一些无监督目标应用于输入文本 “Thank you for inviting me to your party last week .” 的输入和目标示例。请注意,我们所有的目标都处理分词文本。对于这个特定的句子,所有单词都被我们的词汇映射为单个 Token 。我们写 (原始文本) 作为目标,表示模型的任务是重建整个输入文本。
| 目标 | 输入 | 目标 |
|---|---|---|
| 原始文本 (Original Text) | Thank you for inviting me to your party last week . | (原始文本) |
| BERT 风格 (BERT-style) | Thank you for (apple) me to your party last week . | (original text) with masked tokens |
| 掩码语言建模 (Masked Language Modeling) | Thank you for me to your party last week . | (original text) |
| 替换 Token 建模 (Replaced Token Modeling) | Thank you for me to your party last week . | (original text) |
| 删除重建 (Deletion Reconstruction) | Thank you for _ me to your party last week . | (original text) |
| 顺序预测 (Order Prediction) | Thank you for me to your party last week . | (original text) |
Table 4: Performance of the three disparate pre-training objectives described in Section 3.3.1.
表 4: 第 3.3.1 节中描述的三种不同的预训练目标的性能。
| 目标 | GLUE | CNNDM | SQuAD | SGLUE | EnDe | EnFr | EnRo |
|---|---|---|---|---|---|---|---|
| 前缀语言建模 | 80.69 | 18.94 | 77.99 | 65.27 | 26.86 | 39.73 | 27.49 |
| BERT-style (Devlin et al., 2018) | 82.96 | 19.17 | 80.65 | 69.85 | 26.78 | 40.03 | 27.41 |
| 反洗牌 | 73.17 | 18.59 | 67.61 | 58.47 | 26.11 | 39.30 | 25.62 |
to be predicted by the decoder. Second, we consider an objective inspired by the “masked language modeling” (MLM) objective used in BERT (Devlin et al., 2018). MLM takes a span of text and corrupts $15%$ of the tokens. $90%$ of the corrupted tokens are replaced with a special mask token and $10%$ are replaced with a random token. Since BERT is an encoder-only model, its goal during pre-training is to reconstruct masked tokens at the output of the encoder. In the encoder-decoder case, we simply use the entire uncorrupted sequence as the target. Note that this differs from our baseline objective, which uses only the corrupted tokens as targets; we compare these two approaches in Section 3.3.2. Finally, we also consider a basic de shuffling objective as used e.g. in (Liu et al., 2019a) where it was applied to a denoising sequential auto encoder. This approach takes a sequence of tokens, shuffles it, and then uses the original deshuffled sequence as a target. We provide examples of the inputs and targets for these three methods in the first three rows of Table 3.
首先,由解码器进行预测。其次,我们考虑一个受“掩码语言建模” (MLM) 目标启发的目标,该目标在 BERT (Devlin et al., 2018) 中使用。MLM 选取一段文本并破坏其中 15% 的 Token 。90% 的被破坏 Token 被替换为特殊的掩码 Token ,10% 被替换为随机 Token 。由于 BERT 是仅编码器模型,其在预训练期间的目标是通过编码器的输出重建被掩码的 Token 。在编码器-解码器的情况下,我们简单地将整个未被破坏的序列作为目标。请注意,这与我们的基线目标不同,后者仅使用被破坏的 Token 作为目标;我们在第 3.3.2 节中比较了这两种方法。最后,我们还考虑了一个基本的去洗牌目标,例如在 (Liu et al., 2019a) 中应用到去噪序列自编码器。这种方法取一串 Token ,将其打乱,然后使用原始的未打乱序列作为目标。我们在表 3 的前三行提供了这三种方法的输入和目标示例。
The performance of these three objectives is shown in Table 4. Overall, we find that the BERT-style objective performs best, though the prefix language modeling objective attains similar performance on the translation tasks. Indeed, the motivation for the BERT objective was to outperform language model-based pre-training. The de shuffling objective performs considerably worse than both prefix language modeling and the BERT-style objective.
表 4: 显示了这三个目标的性能。总体而言,我们发现 BERT 式 (BERT-style) 目标表现最好,尽管前缀语言建模目标在翻译任务上达到了类似的性能。实际上,BERT 目标的动机是为了超越基于语言模型的预训练。去洗牌目标的表现明显不如前缀语言建模和 BERT 式目标。
Table 5: Comparison of variants of the BERT-style pre-training objective. In the first two variants, the model is trained to reconstruct the original uncorrupted text segment. In the latter two, the model only predicts the sequence of corrupted tokens.
| 目标 | GLUE | CNNDM | SQuAD | SGLUE | EnDe | EnFr | EnRo |
|---|---|---|---|---|---|---|---|
| BERT-style (Devlin et al., 2018) | 82.96 | 19.17 | 80.65 | 69.85 | 26.78 | 40.03 | 27.41 |
| MASS-style (Song et al., 2019) | 82.32 | 19.16 | 80.10 | 69.28 | 26.79 | 39.89 | 27.55 |
| 替换损坏的片段 | 83.28 | 19.24 | 80.88 | 71.36 | 26.98 | 39.82 | 27.65 |
| 删除损坏的 Token | 84.44 | 19.31 | 80.52 | 68.67 | 27.07 | 39.76 | 27.82 |
表 5: BERT-style 预训练目标变体的比较。在前两个变体中,模型被训练以重建原始未损坏的文本片段。在后两个变体中,模型仅预测损坏的 Token 序列。
3.3.2. Simplifying the BERT Objective
3.3.2. 简化 BERT 目标
Based on the results in the prior section, we will now focus on exploring modifications to the BERT-style denoising objective. This objective was originally proposed as a pre-training technique for an encoder-only model trained for classification and span prediction. As such, it may be possible to modify it so that it performs better or is more efficient in our encoder-decoder text-to-text setup.
基于前一节的结果,我们现在将专注于探索对 BERT 式去噪目标的修改。这个目标最初被提出作为一种仅编码器模型的预训练技术,该模型用于分类和跨度预测。因此,有可能对其进行修改,使其在我们的编码器-解码器文本到文本设置中表现更好或更高效。
First, we consider a simple variant of the BERT-style objective where we don’t include the random token swapping step. The resulting objective simply replaces $15%$ of the tokens in the input with a mask token and the model is trained to reconstruct the original uncorrupted sequence. A similar masking objective was used by Song et al. (2019) where it was referred to as “MASS”, so we call this variant the “MASS-style” objective. Second, we were interested to see if it was possible to avoid predicting the entire uncorrupted text span since this requires self-attention over long sequences in the decoder. We consider two strategies to achieve this: First, instead of replacing each corrupted token with a mask token, we replace the entirety of each consecutive span of corrupted tokens with a unique mask token. Then, the target sequence becomes the concatenation of the “corrupted” spans, each prefixed by the mask token used to replace it in the input. This is the pre-training objective we use in our baseline, described in Section 3.1.4. Second, we also consider a variant where we simply drop the corrupted tokens from the input sequence completely and task the model with reconstructing the dropped tokens in order. Examples of these approaches are shown in the fifth and sixth rows of Table 3.
首先,我们考虑一个简单的 BERT-style 目标变体,在这个变体中我们不包括随机 Token 交换步骤。得到的目标简单地将输入中的 15% 的 Token 替换为掩码 Token,并训练模型以重建原始未损坏的序列。Song 等人 (2019) 使用了一个类似的掩码目标,他们称之为“MASS”,因此我们将这种变体称为“MASS-style”目标。其次,我们有兴趣研究是否可以避免预测整个未损坏的文本片段,因为这需要在解码器中对长序列进行自注意力操作。我们考虑了两种策略来实现这一点:第一种策略是,不是用掩码 Token 替换每个损坏的 Token,而是用唯一的掩码 Token 替换每个连续的损坏 Token 区域。然后,目标序列成为“损坏”区域的连接,每个区域前缀为用于替换输入中该区域的掩码 Token。这是我们在第 3.1.4 节中描述的基线使用的预训练目标。第二种策略是,我们还考虑了一种变体,即完全从输入序列中删除损坏的 Token,并要求模型按顺序重建这些被删除的 Token。这些方法的例子显示在表 3 的第五行和第六行。
An empirical comparison of the original BERT-style objective to these three alternatives is shown in Table 5. We find that in our setting, all of these variants perform similarly. The only exception was that dropping corrupted tokens completely produced a small improvement in the GLUE score thanks to a significantly higher score on CoLA (60.04, compared to our baseline average of 53.84, see Table 16). This may be due to the fact that CoLA involves classifying whether a given sentence is grammatically and syntactically acceptable, and being able to determine when tokens are missing is closely related to detecting acceptability. However, dropping tokens completely performed worse than replacing them with sentinel tokens on SuperGLUE. The two variants that do not require predicting the full original sequence (“replace corrupted spans” and “drop corrupted spans”) are both potentially attractive since they make the target sequences shorter and consequently make training faster. Going forward, we will explore variants where we replace corrupted spans with sentinel tokens and only predict the corrupted tokens (as in our baseline objective).
表 5: 对原始 BERT 风格目标与这三种替代方案的实证比较显示,在我们的设置中,所有这些变体表现相似。唯一的例外是完全删除损坏的 Token 产生了小幅度的 GLUE 得分提升,这得益于 CoLA 上显著更高的得分(60.04,相比之下我们基线平均得分为 53.84,见表 16)。这可能是由于 CoLA 涉及判断给定句子在语法和句法上是否可接受,而能够确定 Token 是否缺失与检测可接受性密切相关。然而,完全删除 Token 在 SuperGLUE 上的表现不如用哨兵 Token 替换它们。不需要预测完整原始序列的两个变体(“替换损坏片段”和“删除损坏片段”)都具有吸引力,因为它们使目标序列更短,从而加快训练速度。未来,我们将探索用哨兵 Token 替换损坏片段并仅预测损坏的 Token 的变体(如我们基线目标所示)。
Table 6: Performance of the i.i.d. corruption objective with different corruption rates.
表 6: 不同损坏率的 i.i.d. 损坏目标的性能。
| 损坏率 | GLUE | CNNDM | SQuAD | SGLUE | EnDe | EnFr | EnRo |
|---|---|---|---|---|---|---|---|
| 10% | 82.82 | 19.00 | 80.38 | 69.55 | 26.87 | 39.28 | 27.44 |
| ★15% | 83.28 | 19.24 | 80.88 | 71.36 | 26.98 | 39.82 | 27.65 |
| 25% | 83.00 | 19.54 | 80.96 | 70.48 | 27.04 | 39.83 | 27.47 |
| 50% | 81.27 | 19.32 | 79.80 | 70.33 | 27.01 | 39.90 | 27.49 |
3.3.3. Varying the Corruption Rate
3.3.3. 改变损坏率
So far, we have been corrupting $15%$ of the tokens, the value used in BERT (Devlin et al., 2018). Again, since our text-to-text framework differs from BERT’s, we are interested to see if a different corruption rate works better for us. We compare corruption rates of $10%$ , $15%$ , $25%$ , and $50%$ in Table 6. Overall, we find that the corruption rate had a limited effect on the model’s performance. The only exception is that the largest corruption rate we consider $(50%)$ results in a significant degradation of performance on GLUE and SQuAD. Using a larger corruption rate also results in longer targets, which can potentially slow down training. Based on these results and the historical precedent set by BERT, we will use a corruption rate of $15%$ going forward.
到目前为止,我们已经破坏了 $15%$ 的 Token,这是 BERT (Devlin et al., 2018) 使用的比例。再次强调,由于我们的文本到文本框架与 BERT 不同,我们有兴趣看看不同的破坏率是否对我们更有效。我们在表 6 中比较了 $10%$、$15%$、$25%$ 和 $50%$ 的破坏率。总体而言,我们发现破坏率对模型性能的影响有限。唯一的例外是我们考虑的最大破坏率 $(50%)$ 导致 GLUE 和 SQuAD 上的性能显著下降。使用更大的破坏率也会导致目标序列更长,这可能会减慢训练速度。基于这些结果以及 BERT 设定的历史先例,我们将继续使用 $15%$ 的破坏率。
3.3.4. Corrupting Spans
3.3.4. 腐败片段 (Corrupting Spans)
We now turn towards the goal of speeding up training by predicting shorter targets. The approach we have used so far makes an i.i.d. decision for each input token as to whether to corrupt it or not. When multiple consecutive tokens have been corrupted, they are treated as a “span” and a single unique mask token is used to replace the entire span. Replacing entire spans with a single token results in unlabeled text data being processed into shorter sequences. Since we are using an i.i.d. corruption strategy, it is not always the case that a significant number of corrupted tokens appear consecutively. As a result, we might obtain additional speedup by specifically corrupting spans of tokens rather than corrupting individual tokens in an i.i.d. manner. Corrupting spans was also previously considered as a pre-training objective for BERT, where it was found to improve performance (Joshi et al., 2019).
我们现在转向通过预测更短的目标来加速训练的目标。到目前为止,我们使用的方法是对每个输入 Token 是否要破坏它进行独立同分布 (i.i.d.) 决策。当多个连续的 Token 被破坏时,它们被视为一个“片段”,并使用一个唯一的掩码 Token 来替换整个片段。用单个 Token 替换整个片段的结果是未标注的文本数据被处理成更短的序列。由于我们使用的是独立同分布的破坏策略,并不总是会有大量连续的破坏 Token 出现。因此,通过专门破坏 Token 片段而不是以独立同分布的方式破坏单个 Token,我们可能会获得额外的加速。破坏片段之前也被考虑作为 BERT 的预训练目标,在那里发现它可以提高性能 (Joshi et al., 2019)。
To test this idea, we consider an objective that specifically corrupts contiguous, randomlyspaced spans of tokens. This objective can be para met rize d by the proportion of tokens to be corrupted and the total number of corrupted spans. The span lengths are then chosen randomly to satisfy these specified parameters. For example, if we are processing a sequence of 500 tokens and we have specified that 15% of tokens should be corrupted and that there should be 25 total spans, then the total number of corrupted tokens would be $500\times0.15=75$ and the average span length would be $75/25=3$ . Note that given the original sequence length and corruption rate, we can equivalently para met rize this objective either by the average span length or the total number of spans.
为了测试这个想法,我们考虑一个特定的目标,该目标会破坏连续的、随机间隔的 Token 跨度。这个目标可以通过要破坏的 Token 比例和总破坏跨度数来参数化。然后跨度长度将被随机选择以满足这些指定的参数。例如,如果我们正在处理一个包含 500 个 Token 的序列,并且我们指定了 15% 的 Token 应该被破坏,并且应该有 25 个总跨度,那么被破坏的 Token 总数将是 $500\times0.15=75$,平均跨度长度将是 $75/25=3$。请注意,给定原始序列长度和破坏率,我们可以等效地通过平均跨度长度或总跨度数来参数化此目标。
| 跨度长度 | GLUE | CNNDM | SQuAD | SGLUE | EnDe | EnFr | EnRo |
|---|---|---|---|---|---|---|---|
| ★ 基线 (i.i.d.) | 83.28 | 19.24 | 80.88 | 71.36 | 26.98 | 39.82 | 27.65 |
| 2 | 83.54 | 19.39 | 82.09 | 72.20 | 26.76 | 39.99 | 27.63 |
| 3 | 83.49 | 19.62 | 81.84 | 72.53 | 26.86 | 39.65 | 27.62 |
| 5 | 83.40 | 19.24 | 82.05 | 72.23 | 26.88 | 39.40 | 27.53 |
| 10 | 82.85 | 19.33 | 81.84 | 70.44 | 26.79 | 39.49 | 27.69 |
Table 7: Performance of the span-corruption objective (inspired by Joshi et al. (2019)) for different average span lengths. In all cases, we corrupt $15%$ of the original text sequence.
表 7: 不同平均跨度长度的 span-corruption 目标 (受 Joshi et al. (2019) 启发) 的性能。在所有情况下,我们破坏了原始文本序列的 15% 。


由于图片无法直接翻译为文字内容,如果您能提供图片中的文本内容或描述,我将很乐意为您进行翻译。
Figure 5: A flow chart of our exploration of unsupervised objectives. We first consider a few disparate approaches in Section 3.3.1 and find that a BERT-style denoising objective performs best. Then, we consider various methods for simplifying the BERT objective so that it produces shorter target sequences in Section 3.3.2. Given that replacing dropped-out spans with sentinel tokens performs well and results in short target sequences, in Section 3.3.3 we experiment with different corruption rates. Finally, we evaluate an objective that intentionally corrupts contiguous spans of tokens in Section 3.3.4.
图 5: 我们探索无监督目标的流程图。我们首先在第 3.3.1 节考虑了几种不同的方法,并发现 BERT 风格的去噪目标表现最佳。然后,我们在第 3.3.2 节考虑了多种简化 BERT 目标的方法,使其生成更短的目标序列。鉴于用哨兵 Token 替换丢弃的片段表现良好且结果为目标序列较短,在第 3.3.3 节中我们尝试了不同的损坏率。最后,我们在第 3.3.4 节评估了一个故意损坏连续 Token 片段的目标。
We compare the span-corruption objective to the i.i.d-corruption objective in Table 7. We use a corruption rate of $15%$ in all cases and compare using average span lengths of 2, 3, 5 and 10. Again, we find a limited difference between these objectives, though the version with an average span length of 10 slightly under performs the other values in some cases. We also find in particular that using an average span length of 3 slightly (but significantly) outperforms the i.i.d. objective on most non-translation benchmarks. Fortunately, the span-corruption objective also provides some speedup during training compared to the i.i.d. noise approach because span corruption produces shorter sequences on average.
我们在表 7 中比较了 span-corruption 目标和 i.i.d-corruption 目标。在所有情况下,我们使用 15% 的 corruption 率,并使用平均 span 长度为 2、3、5 和 10 进行比较。再次,我们发现这些目标之间的差异有限,尽管在某些情况下,平均 span 长度为 10 的版本表现略逊于其他值。我们还特别发现,使用平均 span 长度为 3 的方法在大多数非翻译基准测试中略微(但显著)优于 i.i.d. 目标。幸运的是,与 i.i.d. 噪声方法相比,span-corruption 目标在训练过程中也提供了一些加速,因为 span 腐蚀产生的序列较短。
表 7:
3.3.5. Discussion
3.3.5. 讨论
Figure 5 shows a flow chart of the choices made during our exploration of unsupervised objectives. Overall, the most significant difference in performance we observed was that denoising objectives outperformed language modeling and de shuffling for pre-training. We did not observe a remarkable difference across the many variants of the denoising objectives we explored. However, different objectives (or parameter iz at ions of objectives) can lead to different sequence lengths and thus different training speeds. This implies that choosing among the denoising objectives we considered here should mainly be done according to their computational cost. Our results also suggest that additional exploration of objectives similar to the ones we consider here may not lead to significant gains for the tasks and model we consider. Instead, it may be fortuitous to explore entirely different ways of leveraging unlabeled data.
图 5: 显示了在探索无监督目标过程中所做的选择的流程图。总体而言,我们观察到性能上最显著的差异是去噪目标优于语言建模和解洗牌用于预训练。我们没有观察到在我们探索的许多去噪目标变体之间有显著差异。然而,不同的目标(或目标的参数化)可能导致不同的序列长度,从而导致不同的训练速度。这意味着在我们这里考虑的去噪目标之间进行选择时,主要应根据其计算成本来进行。我们的结果还表明,对我们这里考虑的任务和模型,进一步探索类似的目标可能不会带来显著的收益。相反,探索完全不同的利用未标记数据的方法可能是有利的。
3.4. Pre-training Data set
3.4. 预训练数据集
Like the unsupervised objective, the pre-training data set itself is a crucial component of the transfer learning pipeline. However, unlike objectives and benchmarks, new pre-training data sets are usually not treated as significant contributions on their own and are often not released alongside pre-trained models and code. Instead, they are typically introduced in the course of presenting a new method or model. As a result, there has been relatively little comparison of different pre-training data sets as well as a lack of a “standard” data set used for pre-training. Some recent notable exceptions (Baevski et al., 2019; Liu et al., 2019c; Yang et al., 2019) have compared pre-training on a new large (often Common Crawl-sourced) data set to using a smaller preexisting data set (often Wikipedia). To probe more deeply into the impact of the pre-training data set on performance, in this section we compare variants of our C4 data set and other potential sources of pre-training data. We release all of the C4 data set variants we consider as part of TensorFlow Datasets.11
与无监督目标一样,预训练数据集本身是迁移学习管道的关键组成部分。然而,与目标和基准不同,新的预训练数据集通常不被视为独立的重要贡献,并且通常不会与预训练模型和代码一起发布。相反,它们通常是在介绍新方法或模型的过程中引入的。因此,不同预训练数据集之间的比较相对较少,也没有一个用于预训练的“标准”数据集。一些最近值得注意的例外 (Baevski et al., 2019; Liu et al., 2019c; Yang et al., 2019) 比较了在新的大型(通常是 Common Crawl 来源)数据集上进行预训练与使用较小的现有数据集(通常是 Wikipedia)进行预训练的效果。为了更深入地探究预训练数据集对性能的影响,在本节中我们将比较我们 C4 数据集的不同版本以及其他潜在的预训练数据来源。我们将所有考虑的 C4 数据集版本作为 TensorFlow Datasets 的一部分发布。
3.4.1. Unlabeled Data Sets
3.4.1. 未标注数据集
In creating C4, we developed various heuristics to filter the web-extracted text from Common Crawl (see Section 2.2 for a description). We are interested in measuring whether this filtering results in improved performance on downstream tasks, in addition to comparing it to other filtering approaches and common pre-training data sets. Towards this end, we compare the performance of our baseline model after pre-training on the following data sets:
在创建 C4 时,我们开发了多种启发式方法来过滤从 Common Crawl 提取的文本(详见第 2.2 节)。我们感兴趣的是测量这种过滤是否能在下游任务中提高性能,此外还要将其与其他过滤方法和常见的预训练数据集进行比较。为此,我们比较了基线模型在以下数据集上预训练后的性能:
C4 As a baseline, we first consider pre-training on our proposed unlabeled data set as described in Section 2.2.
C4 作为基线,我们首先考虑在第 2.2 节中描述的我们提出的未标注数据集上进行预训练。
Unfiltered C4 To measure the effect of the heuristic filtering we used in creating C4 (de duplication, removing bad words, only retaining sentences, etc.), we also generate an alternate version of C4 that forgoes this filtering. Note that we still use langdetect to extract English text. As a result, our “unfiltered” variant still includes some filtering because langdetect sometimes assigns a low probability to non-natural English text.
未过滤的 C4
为了测量我们在创建 C4 时使用的启发式过滤(去重、移除不良词汇、仅保留句子等)的效果,我们还生成了一个不进行这种过滤的 C4 替代版本。请注意,我们仍然使用 langdetect 来提取英文文本。因此,我们的“未过滤”变体仍然包含一些过滤,因为 langdetect 有时会对非自然的英文文本分配较低的概率。
RealNews-like Recent work has used text data extracted from news websites (Zellers et al., 2019; Baevski et al., 2019). To compare to this approach, we generate another unlabeled data set by additionally filtering C4 to only include content from one of the domains used in the “RealNews” data set (Zellers et al., 2019). Note that for ease of comparison, we retain the heuristic filtering methods used in C4; the only difference is that we have ostensibly omitted any non-news content.
类似 RealNews 的近期工作已经使用从新闻网站提取的文本数据 (Zellers et al., 2019; Baevski et al., 2019)。为了与这种方法进行比较,我们通过进一步过滤 C4,仅包含 “RealNews” 数据集中使用的其中一个域名的内容 (Zellers et al., 2019),生成了另一个未标注的数据集。请注意,为了便于比较,我们保留了 C4 中使用的所有启发式过滤方法;唯一的区别是我们显然省略了所有非新闻内容。
Table 8: Performance resulting from pre-training on different data sets. The first four variants are based on our new C4 data set.
表 8: 在不同数据集上预训练的结果。前四个变体基于我们新的 C4 数据集。
| 数据集 | 大小 | GLUE | CNNDM | SQuAD | SGLUE | EnDe | EnFr | EnRo |
|---|---|---|---|---|---|---|---|---|
| ★C4 | 745GB | 83.28 | 19.24 | 80.88 | 71.36 | 26.98 | 39.82 | 27.65 |
| C4, unfiltered | 6.1TB | 81.46 | 19.14 | 78.78 | 68.04 | 26.55 | 39.34 | 27.21 |
| RealNews-like | 35GB | 83.83 | 19.23 | 80.39 | 72.38 | 26.75 | 39.90 | 27.48 |
| WebText-like | 17GB | 84.03 | 19.31 | 81.42 | 71.40 | 26.80 | 39.74 | 27.59 |
| Wikipedia | 16GB | 81.85 | 19.31 | 81.29 | 68.01 | 26.94 | 39.69 | 27.67 |
| Wikipedia + TBC | 20GB | 83.65 | 19.28 | 82.08 | 73.24 | 26.77 | 39.63 | 27.57 |
WebText-like Similarly, the WebText data set (Radford et al., 2019) only uses content from webpages that were submitted to the content aggregation website Reddit and received a “score” of at least 3. The score for a webpage submitted to Reddit is computed based on the proportion of users who endorse (upvote) or oppose (downvote) the webpage. The idea behind using the Reddit score as a quality signal is that users of the site would only upvote high-quality text content. To generate a comparable data set, we first tried removing all content from C4 that did not originate from a URL that appeared in the list prepared by the Open Web Text effort. However, this resulted in comparatively little content—only about 2 GB—because most pages never appear on Reddit. Recall that C4 was created based on a single month of Common Crawl data. To avoid using a prohibitively small data set, we therefore downloaded 12 months of data from Common Crawl from August 2018 to July 2019, applied our heuristic filtering for C4, then applied the Reddit filter. This produced a 17 GB WebText-like data set, which is of comparable size to the original 40GB WebText data set (Radford et al., 2019).
类似 WebText,WebText 数据集 (Radford et al., 2019) 仅使用了提交到内容聚合网站 Reddit 的网页内容,并且这些网页至少获得了 3 分。Reddit 上提交的网页评分是根据用户对网页的支持(点赞)或反对(点踩)的比例计算得出的。使用 Reddit 评分作为质量信号的想法是,只有高质量的文本内容才会得到用户的点赞。为了生成一个可比较的数据集,我们首先尝试移除 C4 中所有未出现在 Open Web Text 项目列表中的 URL 所对应的网页内容。然而,这导致了相对较少的内容——仅有大约 2 GB——因为大多数页面从未出现在 Reddit 上。回想一下,C4 是基于一个月的 Common Crawl 数据创建的。为了避免使用过小的数据集,我们因此下载了从 2018 年 8 月到 2019 年 7 月共 12 个月的 Common Crawl 数据,应用了 C4 的启发式过滤,然后应用了 Reddit 过滤器。这产生了 17 GB 类似 WebText 的数据集,其大小与原始 40 GB 的 WebText 数据集 (Radford et al., 2019) 相当。
Wikipedia The website Wikipedia consists of millions of encyclopedia articles written collaborative ly. The content on the site is subject to strict quality guidelines and therefore has been used as a reliable source of clean and natural text. We use the English Wikipedia text data from TensorFlow Datasets,13 which omits any markup or reference sections from the articles.
维基百科网站由数百万篇协作撰写的百科全书文章组成。网站上的内容受严格的质量指南约束,因此一直被用作可靠、干净且自然的文本来源。我们使用来自 TensorFlow Datasets 的英文维基百科文本数据,13 这些数据省略了文章中的任何标记或参考部分。
Wikipedia + Toronto Books Corpus A drawback of using pre-training data from Wikipedia is that it represents only one possible domain of natural text (encyclopedia articles). To mitigate this, BERT (Devlin et al., 2018) combined data from Wikipedia with the Toronto Books Corpus (TBC) (Zhu et al., 2015). TBC contains text extracted from eBooks, which represents a different domain of natural language. BERT’s popularity has led to the Wikipedia $+$ TBC combination being used in many subsequent works.
维基百科 + 多伦多书籍语料库
使用来自维基百科的预训练数据的一个缺点是,它仅代表自然文本的一种可能领域(百科全书文章)。为了解决这个问题,BERT (Devlin et al., 2018) 将来自维基百科的数据与多伦多书籍语料库 (TBC) (Zhu et al., 2015) 结合。TBC 包含从电子书中提取的文本,代表了自然语言的另一个领域。BERT 的流行使得维基百科 + TBC 的组合在许多后续工作中被采用。
The results achieved after pre-training on each of these data sets is shown in Table 8. A first obvious takeaway is that removing the heuristic filtering from C4 uniformly degrades performance and makes the unfiltered variant perform the worst in every task. Beyond this, we found that in some cases a pre-training data set with a more constrained domain outperformed the diverse C4 data set. For example, using the Wikipedia $^+$ TBC corpus produced a SuperGLUE score of 73.24, beating our baseline’s score (using C4) of 71.36. This is almost entirely attributable to a boost in performance from 25.78 (baseline, C4) to 50.93 (Wikipedia $^+$ TBC) on the Exact Match score for MultiRC (see Table 16). MultiRC is a reading comprehension data set whose largest source of data comes from fiction books, which is exactly the domain covered by TBC. Similarly, using the RealNews-like data set for pre-training conferred an increase from 68.16 to 73.72 on the Exact Match score for ReCoRD, a data set that measures reading comprehension on news articles. As a final example, using data from Wikipedia produced significant (but less dramatic) gains on SQuAD, which is a question-answering data set with passages sourced from Wikipedia. Similar observations have been made in prior work, e.g. Beltagy et al. (2019) found that pre-training BERT on text from research papers improved its performance on scientific tasks. The main lesson behind these findings is that pre-training on in-domain unlabeled data can improve performance on downstream tasks. This is unsurprising but also unsatisfying if our goal is to pre-train a model that can rapidly adapt to language tasks from arbitrary domains. Liu et al. (2019c) also observed that pre-training on a more diverse data set yielded improvements on downstream tasks. This observation also motivates the parallel line of research on domain adaptation for natural language processing; for surveys of this field see e.g. Ruder (2019); Li (2012).
表 8: 在每个数据集上预训练后取得的结果。一个明显的结论是,从 C4 中移除启发式过滤会一致地降低性能,并使未过滤的变体在每个任务中表现最差。除此之外,我们发现,在某些情况下,具有更受限领域的预训练数据集比多样化的 C4 数据集表现更好。例如,使用 Wikipedia + TBC 语料库产生的 SuperGLUE 得分为 73.24,超过了我们基线(使用 C4)的得分 71.36。这几乎完全归因于 MultiRC 的精确匹配得分从 25.78(基线,C4)提升到 50.93(Wikipedia + TBC)(见表 16)。MultiRC 是一个阅读理解数据集,其最大的数据来源是小说书籍,这正是 TBC 所涵盖的领域。同样,使用类似 RealNews 的数据集进行预训练使得 ReCoRD 的精确匹配得分从 68.16 提升到 73.72,ReCoRD 是一个衡量新闻文章阅读理解的数据集。作为最后一个例子,使用来自 Wikipedia 的数据在 SQuAD 上产生了显著(但不那么戏剧性)的增益,SQuAD 是一个从 Wikipedia 获取段落的问题回答数据集。类似的现象在先前的工作中也有观察到,例如 Beltagy 等 (2019) 发现,在研究论文文本上预训练 BERT 可以提高其在科学任务上的性能。这些发现的主要教训是,在领域内未标注的数据上进行预训练可以提高下游任务的性能。这并不令人惊讶,但如果我们的目标是预训练一个能够快速适应任意领域语言任务的模型,这也是不令人满意的。Liu 等 (2019c) 也观察到,在更多样化的数据集上进行预训练可以在下游任务上带来改进。这一观察也推动了自然语言处理领域自适应的平行研究;关于该领域的综述,请参见 Ruder (2019);Li (2012)。
A drawback to only pre-training on a single domain is that the resulting data sets are often substantially smaller. Similarly, while the WebText-like variant performed as well or better than the C4 data set in our baseline setting, the Reddit-based filtering produced a data set that was about 40 $\times$ smaller than C4 despite being based on $12\times$ more data from Common Crawl. Note, however, that in our baseline setup we only pre-train on $2^{35}\approx34\mathrm{B}$ tokens, which is only about 8 times larger than the smallest pre-training data set we consider. We investigate at what point using a smaller pre-training data sets poses an issue in the following section.
仅在一个领域进行预训练的缺点是,生成的数据集通常会小得多。同样,虽然类似 WebText 的变体在我们的基准设置中表现得与 C4 数据集一样好或更好,但基于 Reddit 的过滤产生的数据集比 C4 小约 40 倍,尽管它基于来自 Common Crawl 的 12 倍更多的数据。然而,请注意,在我们的基准设置中,我们只预训练了 $2^{35}\approx34\mathrm{B}$ 个 Token,这仅比我们考虑的最小预训练数据集大 8 倍左右。我们在以下部分研究使用较小的预训练数据集在什么情况下会成为一个问题。
3.4.2. Pre-training Data set Size
3.4.2. 预训练数据集大小
The pipeline we use to create C4 was designed to be able to create extremely large pretraining data sets. The access to so much data allows us to pre-train our models without repeating examples. It is not clear whether repeating examples during pre-training would be helpful or harmful to downstream performance because our pre-training objective is itself stochastic and can help prevent the model from seeing the same exact data multiple times.
我们用于创建 C4 的管道旨在能够创建极其庞大的预训练数据集。访问如此多的数据使我们能够在不重复示例的情况下预训练模型。目前尚不清楚在预训练期间重复示例是有益还是有害于下游性能,因为我们的预训练目标本身是随机的,可以帮助防止模型多次看到完全相同的数据。
To test the effect of limited unlabeled data set sizes, we pre-trained our baseline model on artificially truncated versions of C4. Recall that we pre-train our baseline model on $2^{35}\approx34\mathrm{B}$ tokens (a small fraction of the total size of C4). We consider training on truncated variants of C4 consisting of $2^{29}$ , $2^{27}$ , $2^{25}$ and $2^{23}$ tokens. These sizes correspond to repeating the data set 64, 256, 1,024, and 4,096 times respectively over the course of pre-training.
为了测试有限未标注数据集大小的影响,我们在人工截断版本的 C4 上预训练了我们的基准模型。回想一下,我们在大约 $2^{35}\approx34\mathrm{B}$ 个 Token (C4 总大小的一小部分) 上预训练了基准模型。我们考虑在包含 $2^{29}$、$2^{27}$、$2^{25}$ 和 $2^{23}$ 个 Token 的 C4 截断版本上进行训练。这些大小分别对应于在整个预训练过程中重复数据集 64、256、1,024 和 4,096 次。
Table 9: Measuring the effect of repeating data during pre-training. In these experiments, we only use the first $N$ tokens from C4 (with varying values of $N$ shown in the first column) but still pre-train over $2^{35}$ tokens. This results in the data set being repeated over the course of pre-training (with the number of repeats for each experiment shown in the second column), which may result in memorization (see Figure 6).
表 9: 测量预训练期间重复数据的效果。在这些实验中,我们仅使用 C4 的前 N 个 Token (第一列显示了不同的 N 值),但仍然预训练超过 $2^{35}$ 个 Token。这导致在整个预训练过程中数据集被重复 (每个实验的重复次数显示在第二列),这可能导致记忆效应 (见图 6)。
| Token 数量 | 重复次数 | GLUE | CNNDM | SQuAD | SGLUE | EnDe | EnFr | EnRo |
|---|---|---|---|---|---|---|---|---|
| 完整数据集 | 0 | 83.28 | 19.24 | 80.88 | 71.36 | 26.98 | 39.82 | 27.65 |
| 229 | 64 | 82.87 | 19.19 | 80.97 | 72.03 | 26.83 | 39.74 | 27.63 |
| 227 | 256 | 82.62 | 19.20 | 79.78 | 69.97 | 27.02 | 39.71 | 27.33 |
| 225 | 1,024 | 79.55 | 18.57 | 76.27 | 64.76 | 26.38 | 39.56 | 26.80 |
| 223 | 4,096 | 76.34 | 18.33 | 70.92 | 59.29 | 26.37 | 38.84 | 25.81 |
The resulting downstream performance is shown in Table 9. As expected, performance degrades as the data set size shrinks. We suspect this may be due to the fact that the model begins to memorize the pre-training data set. To measure if this is true, we plot the training loss for each of these data set sizes in Figure 6. Indeed, the model attains significantly smaller training losses as the size of the pre-training data set shrinks, suggesting possible memorization. Baevski et al. (2019) similarly observed that truncating the pre-training data set size can degrade downstream task performance.
表 9: 下游性能结果如表所示。如预期的那样,随着数据集大小的缩小,性能下降。我们怀疑这可能是由于模型开始记忆预训练数据集。为了测量这一假设是否正确,我们在图 6 中绘制了每个数据集大小的训练损失。确实,随着预训练数据集大小的缩小,模型获得的训练损失显著减小,这表明可能存在记忆现象。Baevski 等 (2019) 同样观察到截断预训练数据集大小会降低下游任务的性能。
We note that these effects are limited when the pre-training data set is repeated only 64 times. This suggests that some amount of repetition of pre-training data might not be harmful. However, given that additional pre-training can be beneficial (as we will show in Section 3.6) and that obtaining additional unlabeled data is cheap and easy, we suggest using large pre-training data sets whenever possible. We also note that this effect may be more pronounced for larger model sizes, i.e. a bigger model may be more prone to over fitting to a smaller pre-training data set.
我们注意到,当预训练数据集仅重复 64 次时,这些效果是有限的。这表明一定程度的预训练数据重复可能无害。然而,鉴于额外的预训练可能是有益的(如我们将在第 3.6 节中展示)以及获取额外的未标注数据既便宜又容易,我们建议在可能的情况下使用较大的预训练数据集。我们还注意到,对于更大的模型尺寸,这种效应可能更为明显,即更大的模型可能更容易对较小的预训练数据集过拟合。
3.5. Training Strategy
3.5. 训练策略
So far we have considered the setting where all parameters of a model are pre-trained on an unsupervised task before being fine-tuned on individual supervised tasks. While this approach is straightforward, various alternative methods for training the model on downstream/supervised tasks have been proposed. In this section, we compare different schemes for fine-tuning the model in addition to the approach of training the model simultaneously on multiple tasks.
到目前为止,我们考虑了在对模型的所有参数进行无监督任务预训练后,再对各个有监督任务进行微调的设置。虽然这种方法 straightforward,但已经提出了各种替代方法来在下游/有监督任务上训练模型。在本节中,我们将比较不同的微调方案,此外还包括同时在多个任务上训练模型的方法。
3.5.1. Fine-tuning Methods
3.5.1. 微调方法 (Fine-tuning Methods)
It has been argued that fine-tuning all of the model’s parameters can lead to suboptimal results, particularly on low-resource tasks (Peters et al., 2019). Early results on transfer learning for text classification tasks advocated fine-tuning only the parameters of a small classifier that was fed sentence embeddings produced by a fixed pre-trained model (Subramanian et al., 2018; Kiros et al., 2015; Logeswaran and Lee, 2018; Hill et al., 2016; Conneau et al., 2017). This approach is less applicable to our encoder-decoder model because the entire decoder must be trained to output the target sequences for a given task. Instead, we focus on two alternative fine-tuning approaches that update only a subset of the parameters of our encoder-decoder model.
有观点认为,微调模型的所有参数可能会导致次优结果,特别是在低资源任务上 (Peters et al., 2019)。早期关于文本分类任务的迁移学习结果显示,仅微调一个小分类器的参数是有效的,该分类器使用固定预训练模型生成的句子嵌入作为输入 (Subramanian et al., 2018; Kiros et al., 2015; Logeswaran 和 Lee, 2018; Hill et al., 2016; Conneau et al., 2017)。这种方法对我们的编码器-解码器模型不太适用,因为整个解码器必须被训练以输出给定任务的目标序列。相反,我们专注于两种替代的微调方法,这些方法只更新编码器-解码器模型的部分参数。

Figure 6: Pre-training loss for our original C4 data set as well as 4 artificially truncated versions. The sizes listed refer to the number of tokens in each data set. The four sizes considered correspond to repeating the data set between 64 and 4,096 times over the course of pre-training. Using a smaller data set size results in smaller training loss values, which may suggest some memorization of the unlabeled data set.
图 6: 我们原始 C4 数据集以及 4 个经过人工截断的版本的预训练损失。所列的大小指的是每个数据集中的 Token 数量。考虑的四个大小对应于在预训练过程中重复数据集 64 到 4,096 次。使用较小的数据集会导致较小的训练损失值,这可能表明对未标记数据集存在一定程度的记忆。
The first, “adapter layers” (Houlsby et al., 2019; Bapna et al., 2019), is motivated by the goal of keeping most of the original model fixed while fine-tuning. Adapter layers are additional dense-ReLU-dense blocks that are added after each of the preexisting feed-forward networks in each block of the Transformer. These new feed-forward networks are designed so that their output dimensionality matches their input. This allows them to be inserted into the network with no additional changes to the structure or parameters. When finetuning, only the adapter layer and layer normalization parameters are updated. The main hyper parameter of this approach is the inner dimensionality $d$ of the feed-forward network, which changes the number of new parameters added to the model. We experiment with various values for $d$ .
第一种方法是“适配器层” (Houlsby et al., 2019; Bapna et al., 2019),其动机是在微调时保持原始模型的大部分结构不变。适配器层是附加的 dense-ReLU-dense 块,添加到 Transformer 每个块中已有的前馈网络之后。这些新的前馈网络设计为输出维度与输入维度匹配。这使得它们可以在不改变网络结构或参数的情况下插入网络。在微调时,仅更新适配器层和层归一化参数。这种方法的主要超参数是前馈网络的内部维度 $d$ ,它决定了添加到模型的新参数数量。我们对不同的 $d$ 值进行了实验。
The second alternative fine-tuning method we consider is “gradual unfreezing” (Howard and Ruder, 2018). In gradual unfreezing, more and more of the model’s parameters are finetuned over time. Gradual unfreezing was originally applied to a language model architecture consisting of a single stack of layers. In this setting, at the start of fine-tuning only the parameters of the final layer are updated, then after training for a certain number of updates the parameters of the second-to-last layer are also included, and so on until the entire network’s parameters are being fine-tuned. To adapt this approach to our encoder-decoder model, we gradually unfreeze layers in the encoder and decoder in parallel, starting from the top in both cases. Since the parameters of our input embedding matrix and output classification matrix are shared, we update them throughout fine-tuning. Recall that our baseline model consists of 12 layers each in the encoder and decoder and is fine-tuned for
我们考虑的第二种替代微调方法是“渐进解冻” (Howard and Ruder, 2018)。在渐进解冻中,随着时间的推移,越来越多的模型参数被微调。渐进解冻最初应用于由单个层堆栈组成的语言模型架构。在这种设置下,在微调开始时,只有最后一层的参数被更新,然后在经过一定数量的更新后,倒数第二层的参数也被包含进来,依此类推,直到整个网络的参数都被微调。为了将这种方法适应于我们的编码器-解码器模型,我们在编码器和解码器中并行地逐渐解冻层,从顶部开始。由于我们的输入嵌入矩阵和输出分类矩阵的参数是共享的,因此我们在整个微调过程中更新它们。回顾一下,我们的基准模型由编码器和解码器各12层组成,并进行微调。
Table 10: Comparison of different alternative fine-tuning methods that only update a subset of the model’s parameters. For adapter layers, $d$ refers to the inner dimensionality of the adapters.
表 10: 不同替代微调方法的比较,这些方法仅更新模型参数的子集。对于适配器层,$d$ 指的是适配器的内部维度。
| 微调方法 | GLUE | CNNDM | SQuAD | SGLUE | EnDe | EnFr | EnRo |
|---|---|---|---|---|---|---|---|
| ★所有参数 | 83.28 | 19.24 | 80.88 | 71.36 | 26.98 | 39.82 | 27.65 |
| 适配器层, d = 32 | 80.52 | 15.08 | 79.32 | 60.40 | 13.84 | 17.88 | 15.54 |
| 适配器层, d = 128 | 81.51 | 16.62 | 79.47 | 63.03 | 19.83 | 27.50 | 22.63 |
| 适配器层, d = 512 | 81.54 | 17.78 | 79.18 | 64.30 | 23.45 | 33.98 | 25.81 |
| 适配器层, d = 2048 | 81.51 | 16.62 | 79.47 | 63.03 | 19.83 | 27.50 | 22.63 |
| 逐步解冻 | 82.50 | 18.95 | 79.17 | 70.79 | 26.71 | 39.02 | 26.93 |
$2^{18}$ steps. As such, we subdivide the fine-tuning process into 12 episodes of ${^{2^{18}}!\left/12\right.}$ steps each and train from layers $12-n$ to 12 in the $n$ th episode. We note that Howard and Ruder (2018) suggested fine-tuning an additional layer after each epoch of training. However, since our supervised data sets vary so much in size and since some of our downstream tasks are actually mixtures of many tasks (GLUE and SuperGLUE), we instead adopt the simpler strategy of fine-tuning an additional layer after every $2^{18}/12$ steps.
$2^{18}$ 步。因此,我们将微调过程细分为 12 个阶段,每个阶段包含 ${^{2^{18}}!\left/12\right.}$ 步,并在第 $n$ 阶段从第 $12-n$ 层到第 12 层进行训练。我们注意到 Howard 和 Ruder (2018) 建议在每个训练周期后微调额外的一层。然而,由于我们的监督数据集大小差异很大,且一些下游任务实际上是多个任务的混合(如 GLUE 和 SuperGLUE),我们采用了更简单的策略,在每 $2^{18}/12$ 步后微调额外的一层。
A comparison of the performance of these fine-tuning approaches is shown in Table 10. For adapter layers, we report the performance using an inner dimensionality $d$ of 32, 128, 512, 2048. Pursuant with past results (Houlsby et al., 2019; Bapna et al., 2019) we find that lower-resource tasks like SQuAD work well with a small value of $d$ whereas higher resource tasks require a large dimensionality to achieve reasonable performance. This suggests that adapter layers could be a promising technique for fine-tuning on fewer parameters as long as the dimensionality is scaled appropriately to the task size. Note that in our case we treat GLUE and SuperGLUE each as a single “task” by concatenating their constituent data sets, so although they comprise some low-resource data sets the combined data set is large enough that it necessitates a large value of $d$ . We found that gradual unfreezing caused a minor degradation in performance across all tasks, though it did provide some speedup during fine-tuning. Better results may be attainable by more carefully tuning the unfreezing schedule.
表 10: 这些微调方法的性能比较。对于适配器层,我们报告了使用内部维度 $d$ 为 32、128、512、2048 的性能。根据以往的结果 (Houlsby et al., 2019; Bapna et al., 2019),我们发现低资源任务如 SQuAD 使用较小的 $d$ 值表现良好,而高资源任务需要较大的维度才能达到合理的性能。这表明只要维度根据任务规模适当调整,适配器层可能是减少参数进行微调的一种有前途的技术。请注意,在我们的情况下,我们将 GLUE 和 SuperGLUE 各自视为一个单独的“任务”,通过连接其组成的数据集来处理,因此尽管它们包含一些低资源数据集,但合并后的数据集足够大,以至于需要较大的 $d$ 值。我们发现逐步解冻在所有任务中导致了轻微的性能下降,尽管它确实在微调期间提供了一些加速。通过更仔细地调整解冻计划,可能会获得更好的结果。
3.5.2. Multi-task Learning
3.5.2. 多任务学习
So far, we have been pre-training our model on a single unsupervised learning task before fine-tuning it individually on each downstream task. An alternative approach, called “multitask learning” (Ruder, 2017; Caruana, 1997), is to train the model on multiple tasks at a time. This approach typically has the goal of training a single model that can simultaneously perform many tasks at once, i.e. the model and most of its parameters are shared across all tasks. We relax this goal somewhat and instead investigate methods for training on multiple tasks at once in order to eventually produce separate parameter settings that perform well on each individual task. For example, we might train a single model on many tasks, but when reporting performance we are allowed to select a different checkpoint for each task. This loosens the multi-task learning framework and puts it on more even footing compared to the pre-train-then-fine-tune approach we have considered so far. We also note that in our unified text-to-text framework, “multi-task learning” simply corresponds to mixing data sets together. It follows that we can still train on unlabeled data when using multi-task learning by treating the unsupervised task as one of the tasks being mixed together. In contrast, most applications of multi-task learning to NLP add task-specific classification networks or use different loss functions for each task (Liu et al., 2019b).
到目前为止,我们一直在单个无监督学习任务上预训练我们的模型,然后在每个下游任务上分别微调它。另一种方法称为“多任务学习” (multitask learning) (Ruder, 2017; Caruana, 1997),是同时在多个任务上训练模型。这种方法通常旨在训练一个单一的模型,该模型能够同时执行多个任务,即模型及其大部分参数在所有任务之间共享。我们稍微放宽了这一目标,而是研究同时在多个任务上训练的方法,以最终产生在每个单独任务上表现良好的不同参数设置。例如,我们可能会在一个模型上训练许多任务,但在报告性能时,我们可以为每个任务选择不同的检查点。这放松了多任务学习框架,并使其与我们迄今为止考虑的预训练-然后微调方法更加公平。我们还注意到,在我们统一的文本到文本框架中,“多任务学习”只是将数据集混合在一起。因此,当使用多任务学习时,我们仍然可以通过将无监督任务视为其中一个被混合的任务来训练未标记的数据。相比之下,大多数应用于自然语言处理 (NLP) 的多任务学习方法会添加特定于任务的分类网络或为每个任务使用不同的损失函数 (Liu et al., 2019b)。
As pointed out by Ari vaz hagan et al. (2019), an extremely important factor in multi-task learning is how much data from each task the model should be trained on. Our goal is to not under- or over-train the model—that is, we want the model to see enough data from a given task that it can perform the task well, but not to see so much data that it memorizes the training set. How exactly to set the proportion of data coming from each task can depend on various factors including data set sizes, the “difficulty” of learning the task (i.e. how much data the model must see before being able to perform the task effectively), regular iz ation, etc. An additional issue is the potential for “task interference” or “negative transfer”, where achieving good performance on one task can hinder performance on another. Given these concerns, we begin by exploring various strategies for setting the proportion of data coming from each task. A similar exploration was performed by Wang et al. (2019a).
如 Ari vaz hagan 等 (2019) 所指出的,在多任务学习中一个极为重要的因素是模型应该在每个任务上训练多少数据。我们的目标是既不要欠训练也不要过训练模型——也就是说,我们希望模型能够看到足够多来自给定任务的数据,以便能够很好地执行该任务,但又不能看到太多数据以至于记住训练集。如何精确设置来自每个任务的数据比例可以取决于各种因素,包括数据集大小、学习任务的“难度”(即模型在能够有效执行任务之前必须看到多少数据)、正则化等。另一个问题是可能出现的“任务干扰”或“负迁移”,即在一个任务上取得良好性能可能会妨碍在另一个任务上的表现。鉴于这些考虑,我们首先探索为每个任务设置数据比例的各种策略。Wang 等 (2019a) 也进行了类似的探索。
Examples-proportional mixing A major factor in how quickly a model will overfit to a given task is the task’s data set size. As such, a natural way to set the mixing proportions is to sample in proportion to the size of each task’s data set. This is equivalent to concatenating the data sets for all tasks and randomly sampling examples from the combined data set. Note, however, that we are including our unsupervised denoising task, which uses a data set that is orders of magnitude larger than every other task’s. It follows that if we simply sample in proportion to each data set’s size, the vast majority of the data the model sees will be unlabeled, and it will undertrain on all of the supervised tasks. Even without the unsupervised task, some tasks (e.g. WMT English to French) are so large that they would similarly crowd out most of the batches. To get around this issue, we set an artificial “limit” on the data set sizes before computing the proportions. Specifically, if the number of examples in each of our $N$ task’s data sets is $e_{n},n\in{1,\ldots,N}$ then we set probability of sampling an example from the mth task during training to $r_{m}=\mathrm{min}(e_{m},K)/\sum\mathrm{min}(e_{n},K)$ where $K$ is the artificial data set size limit.
示例比例混合
一个模型过拟合到给定任务的速度主要取决于该任务的数据集大小。因此,设置混合比例的一种自然方法是根据每个任务数据集的大小进行采样。这相当于将所有任务的数据集连接起来,并从合并后的数据集中随机采样示例。然而,需要注意的是,我们包括了一个无监督去噪任务,该任务使用的数据集比其他所有任务的数据集大几个数量级。因此,如果我们只是根据每个数据集的大小进行采样,模型看到的绝大多数数据将是未标注的,并且它将在所有有监督任务上欠训练。即使没有无监督任务,某些任务(例如 WMT 英语到法语)也非常大,以至于它们会占据大多数批次。为了解决这个问题,我们在计算比例之前对数据集大小设置了一个人工“限制”。具体来说,如果每个任务的数据集中的示例数量为 $e_{n},n\in{1,\ldots,N}$ ,那么我们将训练期间从第 m 个任务中采样示例的概率设置为 $r_{m}=\mathrm{min}(e_{m},K)/\sum\mathrm{min}(e_{n},K)$ ,其中 $K$ 是人工数据集大小限制。
Temperature-scaled mixing An alternative way of mitigating the huge disparity between data set sizes is to adjust the “temperature” of the mixing rates. This approach was used by multilingual BERT to ensure that the model was sufficiently trained on lowresource languages.14 To implement temperature scaling with temperature $T$ , we raise each task’s mixing rate $r_{m}$ to the power of $^1!/!T$ and re normalize the rates so that they sum to 1. When $T=1$ , this approach is equivalent to examples-proportional mixing and as $T$ increases the proportions become closer to equal mixing. We retain the data set size limit $K$ (applied to obtain $r_{m}$ before temperature scaling) but set it to a large value of $K=2^{21}$ . We use a large value of $K$ because increasing the temperature will decrease the mixing rate of the largest data sets.
温度缩放混合
一种缓解数据集大小巨大差异的替代方法是调整混合率的“温度”。这种方法被多语言 BERT 用于确保模型在低资源语言上得到充分训练 [14]。为了实现温度为 $T$ 的温度缩放,我们将每个任务的混合率 $r_{m}$ 提高到 $^1!/T$ 的幂,并重新归一化这些比率,使它们的总和为 1。当 $T=1$ 时,这种方法等同于按示例比例混合,随着 $T$ 增加,比例逐渐接近均匀混合。我们保留数据集大小限制 $K$(在温度缩放之前用于获得 $r_{m}$),但将其设置为较大的值 $K=2^{21}$。我们使用较大的 $K$ 值是因为增加温度会降低最大数据集的混合率。
Table 11: Comparison of multi-task training using different mixing strategies. Examplesproportional mixing refers to sampling examples from each data set according to the total size of each data set, with an artificial limit ( $K$ ) on the maximum data set size. Temperature-scaled mixing re-scales the sampling rates by a temperature $T$ . For temperature-scaled mixing, we use an artificial data set size limit of $K=2^{21}$ .
表 11: 不同混合策略的多任务训练对比。Examples-proportional 混合指的是根据每个数据集的总大小从每个数据集中采样示例,并对最大数据集大小设置一个人为限制 (K)。Temperature-scaled 混合通过温度 T 对采样率进行重新缩放。对于 Temperature-scaled 混合,我们使用的人为数据集大小限制为 K=2^21。
| 混合策略 | GLUE | CNNDM | SQuAD | SGLUE | EnDe | EnFr | EnRo |
|---|---|---|---|---|---|---|---|
| 基线 (预训练/微调) | 83.28 | 19.24 | 80.88 | 71.36 | 26.98 | 39.82 | 27.65 |
| 等量 | 76.13 | 19.02 | 76.51 | 63.37 | 23.89 | 34.31 | 26.78 |
| Examples-proportional, K = 216 | 80.45 | 19.04 | 77.25 | 69.95 | 24.35 | 34.99 | 27.10 |
| Examples-proportional, K = 217 | 81.56 | 19.12 | 77.00 | 67.91 | 24.36 | 35.00 | 27.25 |
| Examples-proportional, K = 218 | 81.67 | 19.07 | 78.17 | 67.94 | 24.57 | 35.19 | 27.39 |
| Examples-proportional, K = 219 | 81.42 | 19.24 | 79.78 | 67.30 | 25.21 | 36.30 | 27.76 |
| Examples-proportional, K = 220 | 80.80 | 19.24 | 80.36 | 67.38 | 25.66 | 36.93 | 27.68 |
| Examples-proportional, K = 221 | 79.83 | 18.79 | 79.50 | 65.10 | 25.82 | 37.22 | 27.13 |
| Temperature-scaled, T = 2 | 81.90 | 19.28 | 79.42 | 69.92 | 25.42 | 36.72 | 27.20 |
| Temperature-scaled, T = 4 | 80.56 | 19.22 | 77.99 | 69.54 | 25.04 | 35.82 | 27.45 |
| Temperature-scaled, T = 8 | 77.21 | 19.10 | 77.14 | 66.07 | 24.55 | 35.35 | 27.17 |
Equal mixing In this case, we sample examples from each task with equal probability. Specifically, each example in each batch is sampled uniformly at random from one of the data sets we train on. This is most likely a suboptimal strategy, as the model will overfit quickly on low-resource tasks and underfit on high-resource tasks. We mainly include it as a point of reference of what might go wrong when the proportions are set suboptimal ly.
均匀混合 在这种情况下,我们以相等的概率从每个任务中采样示例。具体来说,每个批次中的每个示例都是从我们训练的数据集中均匀随机采样的。这很可能是一种次优策略,因为模型会在低资源任务上过拟合,而在高资源任务上欠拟合。我们主要将其作为比例设置不当可能导致问题的参考点。
To compare these mixing strategies on equal footing with our baseline pre-train-thenfine-tune results, we train multi-task models for the same total number of steps: $2^{19}+2^{18}=$ 786,432. The results are shown in Table 11.
为了在相同的基准上比较这些混合策略与我们预训练然后微调的结果,我们训练了多任务模型,总步数相同:$2^{19}+2^{18}=$ 786,432。结果如表 11 所示。
In general, we find that multi-task training under performs pre-training followed by fine-tuning on most tasks. The “equal” mixing strategy in particular results in dramatically degraded performance, which may be because the low-resource tasks have overfit, the highresource tasks have not seen enough data, or the model has not seen enough unlabeled data to learn general-purpose language capabilities. For examples-proportional mixing, we find that for most tasks there is a “sweet spot” for $K$ where the model obtains the best performance, and larger or smaller values of $K$ tend to result in worse performance. The exception (for the range of $K$ values we considered) was WMT English to French translation, which is such a high-resource task that it always benefits from a higher mixing proportion. Finally, we note that temperature-scaled mixing also provides a means of obtaining reasonable performance from most tasks, with $T=2$ performing the best in most cases. The finding that a multi-task model is outperformed by separate models trained on each individual task has previously been observed e.g. by Ari vaz hagan et al. (2019) and McCann et al. (2018), though it has been shown that the multi-task setup can confer benefits across very similar tasks Liu et al. (2019b); Ratner et al. (2018). In the following section, we explore ways to close the gap between multi-task training and the pre-train-then-fine-tune approach.
总体而言,我们发现多任务训练在大多数任务上的表现不如预训练后再进行微调。特别是“等量”混合策略会导致性能显著下降,这可能是由于低资源任务过拟合、高资源任务数据不足,或者模型未获得足够的未标注数据来学习通用语言能力。对于按比例混合,我们发现在大多数任务中存在一个 $K$ 的“最佳点”,在此处模型可以获得最佳性能,而更大或更小的 $K$ 值往往会导致性能下降。例外情况(在我们考虑的 $K$ 值范围内)是 WMT 英语到法语翻译任务,这是一个高资源任务,因此总是从更高的混合比例中受益。最后,我们注意到温度缩放混合也提供了一种从大多数任务中获得合理性能的方法,其中 $T=2$ 在大多数情况下表现最佳。多任务模型被单独训练的单个任务模型超越的现象之前已被观察到,例如 Ari vaz hagan 等 (2019) 和 McCann 等 (2018),尽管已证明多任务设置可以在非常相似的任务之间带来好处 Liu 等 (2019b); Ratner 等 (2018)。在下一节中,我们将探讨缩小多任务训练与预训练然后微调方法之间差距的方法。
Table 12: Comparison of unsupervised pre-training, multi-task learning, and various forms of multi-task pre-training.
表 12: 非监督预训练、多任务学习及各种形式的多任务预训练的比较。
| 训练策略 | GLUE | CNN_DM | SQuAD | SGLUE | EnDe | EnFr | EnRo |
|---|---|---|---|---|---|---|---|
| 非监督预训练 + 微调 | 83.28 | 19.24 | 80.88 | 71.36 | 26.98 | 39.82 | 27.65 |
| 多任务训练 | 81.42 | 19.24 | 79.78 | 67.30 | 25.21 | 36.30 | 27.76 |
| 多任务预训练 + 微调 | 83.11 | 19.12 | 80.26 | 71.03 | 27.08 | 39.80 | 28.07 |
| 留一法多任务训练 | 81.98 | 19.05 | 79.97 | 71.68 | 26.93 | 39.79 | 27.87 |
| 有监督多任务预训练 | 79.93 | 18.96 | 77.38 | 65.36 | 26.81 | 40.13 | 28.04 |
3.5.3. Combining Multi-Task Learning with Fine-Tuning
3.5.3. 结合多任务学习与微调 (Combining Multi-Task Learning with Fine-Tuning)
Recall that we are studying a relaxed version of multi-task learning where we train a single model on a mixture of tasks but are allowed to evaluate performance using different parameter settings (checkpoints) for the model. We can extend this approach by considering the case where the model is pre-trained on all tasks at once but is then fine-tuned on the individual supervised tasks. This is the method used by the “MT-DNN” (Liu et al., 2015, 2019b), which achieved state-of-the-art performance on GLUE and other benchmarks when it was introduced. We consider three variants of this approach: In the first, we simply pre-train the model on an examples-proportional mixture with an artificial data set size limit of $K=2^{19}$ before fine-tuning it on each individual downstream task. This helps us measure whether including the supervised tasks alongside the unsupervised objective during pre-training gives the model some beneficial early exposure to the downstream tasks. We might also hope that mixing in many sources of supervision could help the pre-trained model obtain a more general set of “skills” (loosely speaking) before it is adapted to an individual task. To measure this directly, we consider a second variant where we pre-train the model on the same examples-proportional mixture (with $K=2^{19}$ ) except that we omit one of the downstream tasks from this pre-training mixture. Then, we fine-tune the model on the task that was left out during pre-training. We repeat this for each of the downstream tasks we consider. We call this approach “leave-one-out” multi-task training. This simulates the real-world setting where a pre-trained model is fine-tuned on a task it had not seen during pre-training. Note that multi-task pre-training provides a diverse mixture of supervised tasks. Since other fields (e.g. computer vision (Oquab et al., 2014; Jia et al., 2014; Huh et al., 2016; Yosinski et al., 2014)) use a supervised data set for pre-training, we were interested to see whether omitting the unsupervised task from the multi-task pre-training mixture still produced good results. For our third variant we therefore pre-train on an examples-proportional mixture of all of the supervised tasks we consider with $K=2^{19}$ . In all of these variants, we follow our standard procedure of pre-training for $2^{19}$ steps before fine-tuning for $2^{18}$ steps.
回顾我们正在研究一个多任务学习的简化版本,其中我们在任务混合上训练单个模型,但允许使用不同的参数设置(检查点)来评估模型性能。我们可以扩展这种方法,考虑模型在所有任务上同时预训练,然后在各个监督任务上进行微调的情况。这是“MT-DNN” (Liu et al., 2015, 2019b) 使用的方法,在引入时它在 GLUE 和其他基准测试中取得了最先进的性能。我们考虑了这种方法的三个变体:在第一个变体中,我们在一个示例比例混合的数据集上预训练模型,并设置一个人工数据集大小限制 $K=2^{19}$ ,然后再对每个下游任务进行微调。这有助于我们测量在预训练期间将监督任务与无监督目标一起包含是否能给模型提供一些有益的早期接触下游任务的机会。我们还希望,混合多种监督源可以帮助预训练模型在适应特定任务之前获得更广泛的一组“技能”(广义上讲)。为了直接测量这一点,我们考虑第二个变体,在这个变体中,我们在相同的示例比例混合数据集上预训练模型($K=2^{19}$),但省略其中一个下游任务。然后,我们在预训练期间未包含的任务上微调模型。我们对每个考虑的下游任务重复这一过程。我们将这种方法称为“留一法”多任务训练。这模拟了现实世界中的情况,即预训练模型在一个它在预训练期间未见过的任务上进行微调。请注意,多任务预训练提供了多样化的监督任务混合。由于其他领域(例如计算机视觉 (Oquab et al., 2014; Jia et al., 2014; Huh et al., 2016; Yosinski et al., 2014))使用监督数据集进行预训练,我们有兴趣了解从多任务预训练混合中省略无监督任务是否仍能产生良好的结果。因此,在我们的第三个变体中,我们在所有考虑的监督任务的示例比例混合数据集上预训练模型,$K=2^{19}$。在这所有变体中,我们遵循标准程序,在微调 $2^{18}$ 步之前预训练 $2^{19}$ 步。
We compare the results of these approaches in Table 12. For comparison, we also include results for our baseline (pre-train then fine-tune) and for standard multi-task learning (without fine-tuning) on an examples-proportional mixture with $K=2^{19}$ . We find that fine-tuning after multi-task pre-training results in comparable performance to our baseline. This suggests that using fine-tuning after multi-task learning can help mitigate some of the trade-offs between different mixing rates described in Section 3.5.2. Interestingly, the performance of “leave-one-out” training was only slightly worse, suggesting that a model that was trained on a variety of tasks can still adapt to new tasks (i.e. multi-task pretraining might not result in a dramatic task interference). Finally, supervised multi-task pre-training performed significantly worse in every case except for the translation tasks. This could suggest that the translation tasks benefit less from (English) pre-training, whereas unsupervised pre-training is an important factor in the other tasks.
我们比较了这些方法的结果,见表 12。为了对比,我们也包含了我们基线(预训练然后微调)和标准多任务学习(不进行微调)在示例比例混合下的结果,其中 $K=2^{19}$ 。我们发现,在多任务预训练后进行微调的结果与我们的基线相当。这表明,在多任务学习后使用微调可以帮助缓解第 3.5.2 节中描述的不同混合率之间的一些权衡。有趣的是,“留一法”训练的性能仅略差,这表明在一个多种任务上训练的模型仍然可以适应新任务(即多任务预训练可能不会导致严重的任务干扰)。最后,监督式多任务预训练在除翻译任务外的所有情况下表现显著更差。这可能表明翻译任务从(英文)预训练中的获益较少,而非监督预训练是其他任务的重要因素。
3.6. Scaling
3.6. 扩展
The “bitter lesson” of machine learning research argues that general methods that can leverage additional computation ultimately win out against methods that rely on human expertise (Sutton, 2019; Hestness et al., 2017; Shazeer et al., 2017; Jozefowicz et al., 2016; Mahajan et al., 2018; Shazeer et al., 2018, 2017; Huang et al., 2018b; Keskar et al., 2019a). Recent results suggest that this may hold true for transfer learning in NLP (Liu et al., 2019c; Radford et al., 2019; Yang et al., 2019; Lan et al., 2019), i.e. it has repeatedly been shown that scaling up produces improved performance compared to more carefully-engineered methods. However, there are a variety of possible ways to scale, including using a bigger model, training the model for more steps, and ensembling. In this section, we compare these different approaches by addressing the following premise: “You were just given 4 $\times$ more compute. How should you use it?”
机器学习研究的“苦涩教训”认为,能够利用额外计算资源的通用方法最终会胜过依赖人类专业知识的方法 (Sutton, 2019; Hestness 等, 2017; Shazeer 等, 2017; Jozefowicz 等, 2016; Mahajan 等, 2018; Shazeer 等, 2018, 2017; Huang 等, 2018b; Keskar 等, 2019a)。最近的结果表明,这可能也适用于自然语言处理中的迁移学习 (Liu 等, 2019c; Radford 等, 2019; Yang 等, 2019; Lan 等, 2019),即不断有研究表明,扩大规模相比更精心设计的方法能产生更好的性能。然而,扩大规模有多种可能的方式,包括使用更大的模型、增加训练步骤以及集成。在本节中,我们将通过探讨以下前提来比较这些不同的方法:“你刚刚获得了 4 × 更多的计算资源。你应该怎样使用它们?”
We start with our baseline model, which has 220M parameters and is pre-trained and fine-tuned for $2^{19}$ and $2^{18}$ steps respectively. The encoder and decoder are both sized similarly to “BERTBASE”. To experiment with increased model size, we follow the guidelines of “BERTLARGE” Devlin et al. (2018) and use $d_{\mathrm{ff}},=,4096$ , $d_{\mathrm{model}},=,1024$ , $d_{\mathrm{kv}},=,64$ and 16-head attention mechanisms. We then generate two variants with 16 and 32 layers each in the encoder and decoder, producing models with 2 $\times$ and 4 $\times$ as many parameters as our original model. These two variants also have a roughly 2 $\times$ and 4 $\times$ the computational cost. Using our baseline and these two larger models, we consider three ways of using 4 $\times$ as much computation: Training for 4 $\times$ as many steps, training for 2 $\times$ as many steps with the 2 $\times$ bigger model, and training the 4 $\times$ bigger model for the “baseline” number of training steps. When we increase the training steps, we scale both the pre-train and fine-tune steps for simplicity. Note that when increasing the number of pre-training steps, we are effectively including more pre-training data as C4 is so large that we do not complete one pass over the data even when training for $2^{23}$ steps.
我们从基线模型开始,该模型有 2.2 亿个参数,并分别进行了 $2^{19}$ 和 $2^{18}$ 步的预训练和微调。编码器和解码器的规模与“BERTBASE”相似。为了实验更大的模型规模,我们遵循 Devlin 等人 (2018) 的“BERTLARGE”指南,使用 $d_{\mathrm{ff}} = 4096$ , $d_{\mathrm{model}} = 1024$ , $d_{\mathrm{kv}} = 64$ 和 16 头注意力机制。然后我们生成了两个变体,在编码器和解码器中分别有 16 层和 32 层,产生的模型参数分别是原始模型的 2 倍和 4 倍。这两个变体的计算成本也大约是原来的 2 倍和 4 倍。
使用我们的基线模型和这两个更大规模的模型,我们考虑了三种使用 4 倍计算量的方法:训练 4 倍的步数,用 2 倍大的模型训练 2 倍的步数,以及用 4 倍大的模型以“基线”训练步数进行训练。当我们增加训练步数时,为了简化操作,我们同时增加预训练和微调的步数。需要注意的是,当增加预训练步数时,实际上我们包含了更多的预训练数据,因为 C4 如此之大,即使训练 $2^{23}$ 步也无法完成一次完整的数据遍历。
An alternative way for the model to see 4 $\times$ as much data is to increase the batch size by a factor of 4. This can potentially result in faster training due to more efficient parallel iz ation. However, training with a 4 $\times$ larger batch size can yield a different outcome than training for 4 $\times$ as many steps (Shallue et al., 2018). We include an additional experiment where we train our baseline model with a 4 $\times$ larger batch size to compare these two cases.
模型另一种看到 4 × 作为多数据的方式是将批量大小增加 4 倍。这可能会由于更高效的并行化而导致训练更快。然而,使用 4 × 更大的批量大小进行训练可能会产生与训练 4 × 更多步骤不同的结果 (Shallue et al., 2018)。我们增加了一个额外的实验,在该实验中我们用 4 × 更大的批量大小训练我们的基准模型以比较这两种情况。
It is common practice on many of the benchmarks we consider to eke out additional performance by training and evaluating using an ensemble of models. This provides an orthogonal way of using additional computation. To compare other scaling methods to ensembling, we also measure the performance of an ensemble of 4 separately pre-trained and fine-tuned models. We average the logits across the ensemble before feeding them into the output softmax non linearity to obtain an aggregate prediction. Instead of pre-training 4 separate models, a cheaper alternative is to take a single pre-trained model and produce 4 separate fine-tuned versions. While this does not use our entire 4 $\times$ computational budget, we also include this method to see if it produces competitive performance to the other scaling methods.
在我们考虑的许多基准测试中,通常做法是通过训练和评估模型集合来挤出额外的性能。这提供了一种使用额外计算资源的正交方法。为了将其他扩展方法与集成方法进行比较,我们还测量了由 4 个单独预训练和微调的模型组成的集成的性能。我们在将 logits 输入输出 softmax 非线性之前对它们进行平均以获得聚合预测。与其预训练 4 个单独的模型,一个更便宜的替代方案是采用单个预训练模型并生成 4 个单独微调的版本。虽然这种方法没有使用我们全部 4 × 的计算预算,我们也包括这种方法以查看其是否能产生与其他扩展方法竞争的性能。
Table 13: Comparison of different methods of scaling up our baseline model. All methods except ensembling fine-tuned models use $4\times$ the computation as the baseline. “Size” refers to the number of parameters in the model and “training time” refers to the number of steps used for both pre-training and fine-tuning.
表 13: 不同放大方法与我们基准模型的对比。除集成微调模型外的所有方法使用的计算量是基准模型的 $4\times$ 。"Size" 指的是模型中的参数数量,"training time" 指的是预训练和微调所用的步数。
| 放大策略 | GLUE | CNNDM | SQuAD | SGLUE | EnDe | EnFr | EnRo |
|---|---|---|---|---|---|---|---|
| ★基准线 | 83.28 | 19.24 | 80.88 | 71.36 | 26.98 | 39.82 | 27.65 |
| 1× size, 4x training steps | 85.33 | 19.33 | 82.45 | 74.72 | 27.08 | 40.66 | 27.93 |
| 1× size, 4x batch size | 84.60 | 19.42 | 82.52 | 74.64 | 27.07 | 40.60 | 27.84 |
| 2× size, 2x training steps | 86.18 | 19.66 | 84.18 | 77.18 | 27.52 | 41.03 | 28.19 |
| 4× size, 1x training steps | 85.91 | 19.73 | 83.86 | 78.04 | 27.47 | 40.71 | 28.10 |
| 4x 集成 | 84.77 | 20.10 | 83.09 | 71.74 | 28.05 | 40.53 | 28.57 |
| 4× 集成,仅微调 | 84.05 | 19.57 | 82.36 | 71.55 | 27.55 | 40.22 | 28.09 |
The performance achieved after applying these various scaling methods is shown in Table 13. Un surprisingly, increasing the training time and/or model size consistently improves the baseline. There was no clear winner between training for 4 $\times$ as many steps or using a 4 $\times$ larger batch size, though both were beneficial. In general, increasing the model size resulted in an additional bump in performance compared to solely increasing the training time or batch size. We did not observe a large difference between training a 2 $\times$ bigger model for 2 $\times$ as long and training a 4 $\times$ bigger model on any of the tasks we studied. This suggests that increasing the training time and increasing the model size can be complementary means of improving performance. Our results also suggest that ensembling provides an orthogonal and effective means of improving performance through scale. In some tasks (CNN/DM, WMT English to German, and WMT English to Romanian), ensembling 4 completely separately trained models significantly outperformed every other scaling approach. Ensembling models that were pre-trained together but fine-tuned separately also gave a substantial performance increase over the baseline, which suggests a cheaper means of improving performance. The only exception was SuperGLUE, where neither ensembling approach significantly improved over the baseline.
表 13: 应用这些不同的缩放方法后取得的性能。
不出所料,增加训练时间和/或模型大小始终能改善基线性能。在训练步数增加 4 倍和使用 4 倍更大的批量之间没有明显的胜者,尽管两者都有益。总体而言,增加模型大小相比仅增加训练时间或批量大小带来了额外的性能提升。我们没有观察到在我们研究的任务中,训练一个 2 倍大的模型两倍长的时间与训练一个 4 倍大的模型之间有显著差异。这表明增加训练时间和增加模型大小可以是互补的改进性能的方法。我们的结果还表明,集成提供了一种正交且有效的通过规模改进性能的手段。在某些任务(CNN/DM、WMT 英语到德语和 WMT 英语到罗马尼亚语)中,集成 4 个完全独立训练的模型显著优于其他所有缩放方法。集成那些一起预训练但分别微调的模型也比基线性能有了显著提升,这表明了一种更经济的改进性能的方法。唯一的例外是 SuperGLUE,在这里两种集成方法都没有显著优于基线。
We note that different scaling methods have different trade-offs that are separate from their performance. For example, using a larger model can make downstream fine-tuning and inference more expensive. In contrast, the cost of pre-training a small model for longer is effectively amortized if it is applied to many downstream tasks. Separately, we note that ensembling $N$ separate models has a similar cost to using a model that has an $N\times$ higher computational cost. As a result, some consideration for the eventual use of the model is important when choosing between scaling methods.
我们注意到不同的缩放方法有不同的权衡,这些权衡与其性能是分开的。例如,使用更大的模型会使下游微调和推理更加昂贵。相比之下,如果小模型被应用于许多下游任务,则其更长时间的预训练成本实际上是可以分摊的。另外,我们注意到组合 $N$ 个独立模型的成本与使用计算成本高 $N\times$ 倍的单个模型相似。因此,在选择缩放方法时,考虑模型的最终使用情况是很重要的。
3.7. Putting It All Together
3.7. 综合应用
规则:
- 输出中文翻译部分的时候,只保留翻译的标题,不要有任何其他的多余内容,不要重复,不要解释。
- 不要输出与英文内容无关的内容。
- 翻译时要保留原始段落格式,以及保留术语,例如 FLAC,JPEG 等。保留公司缩写,例如 Microsoft, Amazon, OpenAI 等。
- 人名不翻译
- 同时要保留引用的论文,例如 [20] 这样的引用。
- 对于 Figure 和 Table,翻译的同时保留原有格式,例如:“Figure 1: ”翻译为“图 1: ”,“Table 1: ”翻译为:“表 1: ”。
- 全角括号换成半角括号,并在左括号前面加半角空格,右括号后面加半角空格。
- 在翻译专业术语时,第一次出现时要在括号里面写上英文原文,例如:“生成式 AI (Generative AI)”,之后就可以只写中文了。
- 以下是常见的 AI 相关术语词汇对应表(English -> 中文):
* Transformer -> Transformer
* Token -> Token
* LLM/Large Language Model -> 大语言模型
* Zero-shot -> 零样本
* Few-shot -> 少样本
* AI Agent -> AI智能体
* AGI -> 通用人工智能
* Python -> Python语言
策略:
分三步进行翻译工作:
1. 不翻译无法识别的特殊字符和公式,原样返回
2. 将HTML表格格式转换成Markdown表格格式
3. 根据英文内容翻译成符合中文表达习惯的内容,不要遗漏任何信息
最终只返回Markdown格式的翻译结果,不要回复无关内容。
We now leverage the insights from our systematic study to determine how far we can push performance on popular NLP benchmarks. We are also interested in exploring the current limits of transfer learning for NLP by training larger models on large amounts of data. We start with our baseline training approach and make the following changes:
我们现在利用系统研究的见解来确定在流行的 NLP 基准测试上可以将性能提升到什么程度。我们还对通过在大量数据上训练更大模型来探索 NLP 迁移学习的当前极限感兴趣。我们从基线训练方法开始,并做出以下更改:
Objective We swap out the i.i.d. denoising objective in our baseline for the span-corruption objective described in Section 3.3.4, which was loosely inspired by SpanBERT (Joshi et al., 2019). Specifically, we use a mean span length of 3 and corrupt $15%$ of the original sequence. We found that this objective produced marginally better performance (Table 7) while being slightly more computationally efficient due to shorter target sequence lengths.
目标 我们将基线中的独立同分布去噪目标替换为第 3.3.4 节中描述的片段腐蚀目标,该目标受到 SpanBERT (Joshi 等, 2019) 的启发。具体来说,我们使用平均片段长度为 3,并腐蚀 $15%$ 的原始序列。我们发现这个目标产生了略好的性能(表 7),同时由于目标序列长度较短,计算效率也略有提高。
Longer training Our baseline model uses a relatively small amount of pre-training ( $1/!_{4}$ as much as BERT (Devlin et al., 2018), 1⁄16 as much as XLNet (Yang et al., 2019), 1⁄64 as much as RoBERTa (Liu et al., 2019c), etc.). Fortunately, C4 is big enough that we can train for substantially longer without repeating data (which can be detrimental, as shown in Section 3.4.2). We found in Section 3.6 that additional pre-training can indeed be helpful, and that both increasing the batch size and increasing the number of training steps can confer this benefit. We therefore pre-train our models for 1 million steps on a batch size of $2^{11}$ sequences of length 512, corresponding to a total of about 1 trillion pre-training tokens (about 32 $\times$ as many as our baseline). In Section 3.4.1, we showed that pre-training on the RealNews-like, WebText-like, and Wikipedia $+$ TBC data sets outperformed pre-training on C4 on a few downstream tasks. However, these data set variants are sufficiently small that they would be repeated hundreds of times over the course of pre-training on 1 trillion tokens. Since we showed in Section 3.4.2 that this repetition could be harmful, we opted instead to continue using the C4 data set.
更长的训练
我们的基准模型使用相对较少的预训练数据(仅为 BERT (Devlin et al., 2018) 的 $1/!_{4}$,XLNet (Yang et al., 2019) 的 1⁄16,RoBERTa (Liu et al., 2019c) 的 1⁄64 等)。幸运的是,C4 数据集足够大,我们可以进行更长时间的训练而不会重复数据(这可能会产生不利影响,如第 3.4.2 节所示)。我们在第 3.6 节中发现,额外的预训练确实是有帮助的,并且增加批量大小和增加训练步数都可以带来这种好处。因此,我们在批量大小为 $2^{11}$、长度为 512 的序列上对模型进行了 1 百万步的预训练,相当于大约 1 万亿个预训练 Token(约为我们基准模型的 32 倍)。在第 3.4.1 节中,我们展示了在 RealNews 类似、WebText 类似和 Wikipedia + TBC 数据集上进行预训练在一些下游任务上优于在 C4 上进行预训练。然而,这些数据集变体规模较小,在 1 万亿个 Token 的预训练过程中会重复数百次。由于我们在第 3.4.2 节中表明这种重复可能是有害的,因此我们选择继续使用 C4 数据集。
Model sizes In Section 3.6 we also showed how scaling up the baseline model size improved performance. However, using smaller models can be helpful in settings where limited computational resources are available for fine-tuning or inference. Based on these factors, we train models with a wide range of sizes:
模型大小
在第 3.6 节中,我们也展示了如何通过扩大基线模型的规模来提高性能。然而,在微调或推理时可用的计算资源有限的情况下,使用较小的模型可能会更有帮助。基于这些因素,我们训练了具有广泛大小范围的模型:
$d_{\mathrm{model}},=,1024$ , a 24 layer encoder and decoder, and $d_{\mathrm{kv}}=128$ . For the “3B” variant, we use $d_{\mathrm{ff}}=16{,}384$ with 32-headed attention, which results in around 2.8 billion parameters; for “11B” we use $d_{\mathrm{ff}}=65,536$ with 128-headed attention producing a model with about 11 billion parameters. We chose to scale up $d_{\mathrm{ff}}$ specifically because modern accelerators (such as the TPUs we train our models on) are most efficient for large dense matrix multiplications like those in the Transformer’s feed-forward networks.
$d_{\mathrm{model}}=1024$ ,24层编码器和解码器,以及 $d_{\mathrm{kv}}=128$ 。对于“3B”版本,我们使用 $d_{\mathrm{ff}}=16,384$ 并采用32头注意力机制,这导致模型参数量约为28亿;对于“11B”,我们使用 $d_{\mathrm{ff}}=65,536$ 并采用128头注意力机制,生成的模型参数量约为110亿。我们选择特别增加 $d_{\mathrm{ff}}$ 是因为现代加速器(例如我们用于训练模型的TPU)在处理像Transformer前馈网络中的大型密集矩阵乘法时效率最高。
Multi-task pre-training In Section 3.5.3, we showed that pre-training on a multi-task mixture of unsupervised and supervised tasks before fine-tuning worked as well as pre-training on the unsupervised task alone. This is the approach advocated by the “MT-DNN” (Liu et al., 2015, 2019b). It also has the practical benefit of being able to monitor “downstream” performance for the entire duration of training, rather than just during fine-tuning. We therefore used multi-task pre-training in our final set of experiments. We hypothesize that larger models trained for longer might benefit from a larger proportion of unlabeled data because they are more likely to overfit to smaller training data sets. However, we also note that the results of Section 3.5.3 suggest that fine-tuning after multi-task pre-training can mitigate some of the issues that might arise from choosing a suboptimal proportion of unlabeled data. Based on these ideas, we substitute the following artificial data set sizes for our unlabeled data before using standard example-proportional mixing (described in Section 3.5.2): 710,000 for Small, 2,620,000 for Base, 8,660,000 for Large, 33,500,000 for 3B, and 133,000,000 for 11B. For all model variants, we also capped the effective data set size of the WMT English to French and WMT English to German data sets to 1M examples during pre-training.
多任务预训练
在第 3.5.3 节中,我们展示了在微调之前对无监督和有监督任务的混合进行预训练的效果与仅对无监督任务进行预训练一样好。这是由“MT-DNN” (Liu et al., 2015, 2019b) 提倡的方法。它还具有能够在整个训练期间监控“下游”性能的实际优势,而不仅仅是在微调期间。因此,在我们的最终实验中使用了多任务预训练。我们假设更大的模型经过更长时间的训练可能会从更大比例的未标记数据中受益,因为它们更容易过拟合到较小的训练数据集。然而,我们也注意到第 3.5.3 节的结果表明,在多任务预训练后进行微调可以缓解由于选择次优比例的未标记数据而可能引起的一些问题。基于这些想法,在使用标准示例比例混合(在第 3.5.2 节中描述)之前,我们用以下人工数据集大小替换未标记数据:Small 为 710,000,Base 为 2,620,000,Large 为 8,660,000,3B 为 33,500,000,11B 为 133,000,000。对于所有模型变体,在预训练期间,我们将 WMT 英语到法语和 WMT 英语到德语数据集的有效数据集大小限制为 1M 示例。
Fine-tuning on individual GLUE and SuperGLUE tasks So far, when fine-tuning on GLUE and SuperGLUE, we have concatenated all of the data sets in each benchmark so that we only fine-tune models once for GLUE and once for SuperGLUE. This approach makes our study logistically simpler, but we found that this sacrifices a small amount of performance on some tasks compared to fine-tuning on the task separately. A potential issue with fine-tuning on individual tasks, which would otherwise be mitigated by training on all tasks at once, is that we might overfit quickly to low-resource tasks. For example, our large batch size of $2^{11}$ length-512 sequences would result in the entire data set appearing multiple times in each batch for many of the low-resource GLUE and SuperGLUE tasks. We therefore use a smaller batch size of 8 length-512 sequences during fine-tuning for each GLUE and SuperGLUE task. We also save checkpoints every 1,000 steps rather than every 5,000 steps to ensure we have access to the model’s parameters before it overfits.
在 GLUE 和 SuperGLUE 单个任务上的微调
到目前为止,在对 GLUE 和 SuperGLUE 进行微调时,我们将每个基准中的所有数据集连接起来,以便我们只为 GLUE 和 SuperGLUE 各微调一次模型。这种方法使我们的研究在物流上更简单,但我们发现这在某些任务上会牺牲少量性能,相比于分别对任务进行微调。对单个任务进行微调的一个潜在问题是,原本可以通过一次性训练所有任务来缓解的问题是,我们可能会快速过拟合到低资源任务。例如,我们较大的批量大小为 $2^{11}$ 长度为 512 的序列会导致整个数据集在许多低资源的 GLUE 和 SuperGLUE 任务的每个批量中多次出现。因此,在对每个 GLUE 和 SuperGLUE 任务进行微调时,我们使用较小的批量大小,即 8 个长度为 512 的序列。我们还每 1,000 步保存一次检查点,而不是每 5,000 步保存一次,以确保在模型过拟合之前可以访问其参数。
Beam search All of our previous results were reported using greedy decoding. For tasks with long output sequences, we found improved performance from using beam search (Sutskever et al., 2014). Specifically, we use a beam width of 4 and a length penalty of $\alpha=0.6$ (Wu et al., 2016) for the WMT translation and CNN/DM sum mari z ation tasks.
束搜索
我们之前的所有结果都是使用贪婪解码报告的。对于输出序列较长的任务,我们发现使用束搜索 (Sutskever et al., 2014) 可以提高性能。具体来说,我们在 WMT 翻译和 CNN/DM 摘要任务中使用了束宽为 4 和长度惩罚 $\alpha=0.6$ (Wu et al., 2016)。
Test set Since this is our final set of experiments, we report results on the test set rather than the validation set. For CNN/Daily Mail, we use the standard test set distributed with the data set. For the WMT tasks, this corresponds to using news test 2014 for English-German, news test 2015 for English-French, and news test 2016 for EnglishRomanian. For GLUE and SuperGLUE, we used the benchmark evaluation servers to compute official test set scores. $^{15,16}$ For SQuAD, evaluating on the test set requires running inference on a benchmark server. Unfortunately, the computational resources on this server are insufficient for obtaining predictions from our largest models. As a result, we instead continue to report performance on the SQuAD validation set. Fortunately, the model with the highest performance on the SQuAD test set also reported results on the validation set, so we can still compare to what is ostensibly the state-of-the-art.
测试集
由于这是我们的最终实验集,我们在测试集上报告结果而不是验证集。对于 CNN/Daily Mail,我们使用与数据集一起分发的标准测试集。对于 WMT 任务,这分别对应于使用新闻测试 2014(英语-德语),新闻测试 2015(英语-法语),以及新闻测试 2016(英语-罗马尼亚语)。对于 GLUE 和 SuperGLUE,我们使用基准评估服务器来计算官方测试集分数。$^{15,16}$ 对于 SQuAD,在测试集上进行评估需要在基准服务器上运行推理。不幸的是,该服务器上的计算资源不足以从我们最大的模型中获得预测结果。因此,我们继续在 SQuAD 验证集上报告性能。幸运的是,在 SQuAD 测试集上表现最好的模型也在验证集上报告了结果,因此我们仍然可以与所谓的最先进水平进行比较。
Apart from those changes mentioned above, we use the same training procedure and hyper parameters as our baseline (AdaFactor optimizer, inverse square root learning rate schedule for pre-training, constant learning rate for fine-tuning, dropout regular iz ation, vocabulary, etc.). For reference, these details are described in Section 2.
除了上述更改外,我们使用与基线相同的训练过程和超参数(AdaFactor 优化器,预训练时的反平方根学习率调度,微调时的恒定学习率,dropout 正则化,词汇表等)。详细信息请参见第 2 节。
The results of this final set of experiments are shown in Table 14. Overall, we achieved state-of-the-art performance on 18 out of the 24 tasks we consider. As expected, our largest (11 billion parameter) model performed best among our model size variants across all tasks. Our T5-3B model variant did beat the previous state of the art in a few tasks, but scaling the model size to 11 billion parameters was the most important ingredient for achieving our best performance. We now analyze the results for each individual benchmark.
表 14:
总体而言,在我们考虑的 24 个任务中,我们在 18 个任务上取得了最先进的性能。如预期的那样,我们最大的(110 亿参数)模型在所有任务中表现最佳。我们的 T5-3B 模型变体确实在几个任务中超过了之前的最先进水平,但将模型大小扩展到 110 亿参数是实现我们最佳性能的最关键因素。我们现在对每个单独的基准测试结果进行分析。
We achieved a state-of-the-art average GLUE score of 90.3. Notably, our performance was substantially better than the previous state-of-the-art for the natural language inference tasks MNLI, RTE, and WNLI. RTE and WNLI are two of the tasks where machine performance has historically lagged behind human performance, which is 93.6 and 95.9 respectively (Wang et al., 2018). In terms of parameter count, our 11B model variant is the largest model that has been submitted to the GLUE benchmark. However, most of the best-scoring submissions use a large amount of ensembling and computation to produce predictions. For example, the best-performing variant of ALBERT (Lan et al., 2019) uses a model similar in size and architecture to our 3B variant (though it has dramatically fewer parameters due to clever parameter sharing). To produce its impressive performance on GLUE, the ALBERT authors ensembled “from 6 to 17” models depending on the task. This likely results in it being more computationally expensive to produce predictions with the ALBERT ensemble than it is with T5-11B.
我们实现了最先进的平均 GLUE 得分 90.3。值得注意的是,我们在自然语言推理任务 MNLI、RTE 和 WNLI 上的表现显著优于之前的最先进水平。RTE 和 WNLI 是机器表现历史上落后于人类表现的两个任务,人类表现分别为 93.6 和 95.9 (Wang et al., 2018)。在参数数量方面,我们的 11B 模型变体是提交给 GLUE 基准测试的最大模型。然而,大多数得分最高的提交使用了大量的集成和计算来生成预测。例如,表现最好的 ALBERT 变体 (Lan et al., 2019) 使用了一个与我们 3B 变体相似大小和架构的模型(尽管由于巧妙的参数共享,它的参数数量大幅减少)。为了在 GLUE 上取得令人印象深刻的表现,ALBERT 的作者根据任务的不同集成了“从 6 到 17”个模型。这可能导致使用 ALBERT 集成生成预测的计算成本高于使用 T5-11B。
For SQuAD, we outperformed the previous state-of-the-art (ALBERT (Lan et al., 2019)) by over one point on the Exact Match score. SQuAD is a long-standing benchmark that was created over three years ago, and most recent improvements have only increased the state-of-the-art by a fraction of a percentage point. We note that when results are reported on the test set, they are typically based on an ensemble of models and/or leverage external data sets (e.g. TriviaQA (Joshi et al., 2017) or NewsQA (Trischler et al., 2016)) to augment the small SQuAD training set. Human performance on SQuAD is estimated at 82.30 and 91.22 for the Exact Match and F1 metric respectively (Rajpurkar et al., 2016), so it is not clear if further improvements on this benchmark are meaningful.
对于 SQuAD,我们在 Exact Match 分数上超过了之前的最先进模型 (ALBERT (Lan et al., 2019)) 超过一个点。SQuAD 是一个三年多前创建的长期基准,最近的改进通常只将最先进水平提高了不到一个百分点。我们注意到,当在测试集上报告结果时,通常是基于模型集成和/或利用外部数据集(例如 TriviaQA (Joshi et al., 2017) 或 NewsQA (Trischler et al., 2016))来扩充较小的 SQuAD 训练集。人类在 SQuAD 上的表现估计为 Exact Match 82.30 和 F1 指标 91.22 (Rajpurkar et al., 2016),因此进一步改进这个基准是否有意义尚不清楚。
| 模型 | GLUE 平均 | CoLA Matthew's | SST-2 准确率 | MRPC F1 | MRPC 准确率 | STS-B Pearson | STS-B Spearman |
|---|---|---|---|---|---|---|---|
| 之前最佳 | 89.4α | 69.26 | 97.1a | 93.6b | 91.5b | 92.7b | 92.3b |
| T5-Small | 77.4 | 41.0 | 91.8 | 89.7 | 86.6 | 85.6 | 85.0 |
| T5-Base | 82.7 | 51.1 | 95.2 | 90.7 | 87.5 | 89.4 | 88.6 |
| T5-Large | 86.4 | 61.2 | 96.3 | 92.4 | 89.9 | 89.9 | 89.2 |
| T5-3B | 88.5 | 67.1 | 97.4 | 92.5 | 90.0 | 90.6 | 89.8 |
| T5-11B | 90.3 | 71.6 | 97.5 | 92.8 | 90.4 | 93.1 | 92.8 |
| QQP | QQP | MNLI-m | MNLI-mm | QNLI | RTE | WNLI | |
| 模型 | F1 | 准确率 | 准确率 | 准确率 | 准确率 | 准确率 | 准确率 |
| 之前最佳 | 74.8c | 90.7b | 91.3a | 91.0a | 99.2a | 89.2a | 91.8a |
| T5-Small | 70.0 | 88.0 | 82.4 | 82.3 | 90.3 | 69.9 | 69.2 |
| T5-Base | 72.6 | 89.4 | 87.1 | 86.2 | 93.7 | 80.1 | 78.8 |
| T5-Large | 73.9 | 89.9 | 89.9 | 89.6 | 94.8 | 87.2 | 85.6 |
| T5-3B | 74.4 | 89.7 | 91.4 | 91.2 | 96.3 | 91.1 | 89.7 |
| T5-11B | 75.1 | 90.6 | 92.2 | 91.9 | 96.9 | 92.8 | 94.5 |
| SQuAD | SQuAD | SuperGLUE | BoolQ | CB | CB | COPA | |
| 模型 | EM | F1 | 平均 | 准确率 | F1 | 准确率 | 准确率 |
| 之前最佳 | 90.1α | 95.5a | 84.6d | 87.1d | 90.5d | 95.2d | 90.6d |
| T5-Small | 79.10 | 87.24 | 63.3 | 76.4 | 56.9 | 81.6 | 46.0 |
| T5-Base | 85.44 | 92.08 | 76.2 | 81.4 | 86.2 | 94.0 | 71.2 |
| T5-Large | 86.66 | 93.79 | 82.3 | 85.4 | 91.6 | 94.8 | 83.4 |
| T5-3B | 88.53 | 94.95 | 86.4 | 89.9 | 90.3 | 94.4 | 92.0 |
| T5-11B | 91.26 | 96.22 | 88.9 | 91.2 | 93.9 | 96.8 | 94.8 |
| MultiRC | MultiRC | ReCoRD | ReCoRD | RTE | WiC | WSC | |
| 模型 | F1a | EM | F1 | 准确率 | 准确率 | 准确率 | 准确率 |
| 之前最佳 | 84.4d | 52.5d | 90.6d | 90.0d | 88.2d | 69.9d | 89.0d |
| T5-Small | 69.3 | 26.3 | 56.3 | 55.4 | 73.3 | 66.9 | 70.5 |
| T5-Base | 79.7 | 43.1 | 75.0 | 74.2 | 81.5 | 68.3 | 80.8 |
| T5-Large | 83.3 | 50.7 | 86.8 | 85.9 | 87.8 | 69.3 | 86.3 |
| T5-3B | 86.8 | 58.3 | 91.2 | 90.4 | 90.7 | 72.1 | 90.4 |
| T5-11B | 88.1 | 63.3 | 94.1 | 93.4 | 92.5 | 76.9 | 93.8 |
| WMT EnDe | WMT EnFr | WMT EnRo | CNN/DM | CNN/DM | CNN/DM | ||
| 模型 | BLEU | BLEU | BLEU | ROUGE-1 | ROUGE-2 | ROUGE-L | |
| 之前最佳 | 33.8e | 43.8e | 38.5f | 43.479 | 20.309 | 40.639 | |
| T5-Small | 26.7 | 36.0 | 26.8 | 41.12 | 19.56 | 38.35 | |
| T5-Base | 30.9 | 41.2 | 28.0 | 42.05 | 20.34 | 39.40 | |
| T5-Large | 32.0 | 41.5 | 28.1 | 42.50 | 20.68 | 39.75 | |
| T5-3B | 31.8 | 42.6 | 28.2 | 42.72 | 21.02 | 39.94 | |
| T5-11B | 32.1 | 43.4 | 28.1 | 43.52 | 21.55 | 40.69 |
Table 14: Performance of our T5 variants on every task we study. Small, Base, Large, 3B, and 11B refer to model configurations with 60 million, 220 million, 770 million, 3 billion, and 11 billion parameters, respectively. In the first row of each table, we report the state-of-the-art for the task (as of October 24th, 2019), with the superscript denoting its source with references listed at the end of this caption. All results are reported on the test set except for SQuAD where we use the validation set. $^{a}$ (Lan et al., 2019) $^{b}$ (Wang et al., 2019c) $c$ (Zhu et al., 2019) $d$ (Liu et al., 2019c) $e$ (Edunov et al., 2018) $f$ (Lample and Conneau, 2019) $g$ (Dong et al., 2019)
表 14: 我们的 T5 变体在每个研究任务上的性能。Small, Base, Large, 3B, 和 11B 分别指参数量为 60 百万,2.2 亿,7.7 亿,30 亿和 110 亿的模型配置。在每个表格的第一行,我们报告了该任务的最新水平(截至 2019 年 10 月 24 日),上标表示其来源,并在本说明末尾列出引用。所有结果均在测试集上报告,除了 SQuAD,我们在验证集上使用。 $^{a}$ (Lan et al., 2019) $^{b}$ (Wang et al., 2019c) $^{c}$ (Zhu et al., 2019) $^{d}$ (Liu et al., 2019c) $^{e}$ (Edunov et al., 2018) $^{f}$ (Lample and Conneau, 2019) $^{g}$ (Dong et al., 2019)
For SuperGLUE, we improved upon the state-of-the-art by a large margin (from an average score of 84.6 (Liu et al., 2019c) to 88.9). SuperGLUE was designed to include tasks that were “beyond the scope of current state-of-the-art systems, but solvable by most college-educated English speakers” (Wang et al., 2019b). We nearly match the human performance of 89.8 (Wang et al., 2019b). Interestingly, on the reading comprehension tasks (MultiRC and ReCoRD) we exceed human performance by a large margin, suggesting the evaluation metrics used for these tasks may be biased towards machine-made predictions. On the other hand, humans achieve $100%$ accuracy on both COPA and WSC, which is significantly better than our model’s performance. This suggests that there remain linguistic tasks that are hard for our model to perfect, particularly in the low-resource setting.
对于 SuperGLUE,我们将最先进的水平大幅提高(从平均分数 84.6 (Liu et al., 2019c) 提高到 88.9)。SuperGLUE 被设计为包含“超出当前最先进系统范围,但大多数受过大学教育的英语使用者可以解决”的任务 (Wang et al., 2019b)。我们几乎达到了人类表现的 89.8 (Wang et al., 2019b)。有趣的是,在阅读理解任务(MultiRC 和 ReCoRD)中,我们大大超过了人类的表现,这表明用于这些任务的评估指标可能偏向于机器预测。另一方面,人类在 COPA 和 WSC 上实现了 100% 的准确率,这显著优于我们模型的表现。这表明仍然存在一些语言任务对我们的模型来说难以完美完成,尤其是在资源有限的情况下。
We did not achieve state-of-the-art performance on any of the WMT translation tasks. This may be in part due to our use of an English-only unlabeled data set. We also note that most of the best results on these tasks use back translation (Edunov et al., 2018; Lample and Conneau, 2019), which is a sophisticated data augmentation scheme. The state of the art on the low-resource English to Romanian benchmark also uses additional forms of cross-lingual unsupervised training (Lample and Conneau, 2019). Our results suggest that scale and English-language pre-training may be insufficient to match the performance of these more sophisticated methods. On a more specific note, the best results on English to German news test 2014 set use the much larger training set from WMT 2018 (Edunov et al., 2018), making direct comparison to our results difficult.
我们在任何 WMT 翻译任务上都未达到最先进水平。这可能部分归因于我们使用了仅包含英语的未标注数据集。我们还注意到,这些任务中的大多数最佳结果都使用了回译 (Edunov et al., 2018; Lample 和 Conneau, 2019),这是一种复杂的数据增强方案。在低资源英语到罗马尼亚语基准上的最先进水平也使用了额外形式的跨语言无监督训练 (Lample 和 Conneau, 2019)。我们的结果表明,规模和英语预训练可能不足以匹敌这些更复杂方法的性能。更具体地说,英语到德语新闻测试 2014 集的最佳结果使用了来自 WMT 2018 的更大训练集 (Edunov et al., 2018),这使得与我们的结果进行直接比较变得困难。
Finally, on CNN/Daily Mail we attain state-of-the-art performance, though only by a significant amount on the ROUGE-2-F score. It has been shown that improvements to the ROUGE score do not necessarily correspond to more coherent summaries (Paulus et al., 2017). Furthermore, while CNN/Daily Mail is posed as an abstract ive sum mari z ation benchmark, purely extractive approaches have been shown to work well (Liu, 2019). It has also been argued that generative models trained with maximum likelihood are prone to producing repetitive summaries (See et al., 2017). Despite these potential issues, we find that our models do generate coherent and largely correct summaries. We provide some non-cherry-picked validation set examples in Appendix C.
最后,在 CNN/Daily Mail 上我们达到了最先进的性能,尽管仅在 ROUGE-2-F 分数上有显著提升。已经证明,ROUGE 分数的改进并不一定对应更连贯的摘要 (Paulus et al., 2017)。此外,虽然 CNN/Daily Mail 被视为抽象式摘要的基准,但完全抽取式方法也被证明效果良好 (Liu, 2019)。还有一种观点认为,使用最大似然训练的生成式模型容易产生重复的摘要 (See et al., 2017)。尽管存在这些潜在问题,我们发现我们的模型确实生成了连贯且大部分正确的摘要。我们在附录 C 中提供了一些未经挑选的验证集示例。
To achieve its strong results, T5 combines insights from our experimental study with unprecedented scale. Note that in Section 3.6 we found that scaling up the pre-training amount or size of our baseline model produced substantial gains. Given this, we were interested to measure how much the “non-scaling” changes we introduced into T5 contributed to its strong performance. We therefore carried out a final experiment where we compared the following three configurations: First, the standard baseline model, which was pre-trained on $2^{35}\approx34\mathrm{B}$ tokens; second, the baseline trained instead for about 1 trillion tokens (i.e. the same amount of pre-training used for T5), which we refer to as “baseline-1T”; and third, T5-Base. Note that the differences between baseline-1T and T5-Base comprise the “non-scaling” changes we made when designing T5. As such, comparing the performance of these two models gives us a concrete measurement of the impact of the insights from our systematic study.
为了实现其强大的结果,T5 结合了我们实验研究的见解和前所未有的规模。请注意,在第 3.6 节中我们发现,增加预训练数据量或我们基准模型的大小带来了显著的提升。鉴于此,我们有兴趣衡量我们引入到 T5 中的“非扩展”更改对其强大性能的贡献程度。因此,我们进行了最后一次实验,比较了以下三种配置:第一,标准基准模型,该模型在约 $2^{35}\approx34\mathrm{B}$ 个 Token 上进行了预训练;第二,预训练了大约 1 万亿个 Token 的基准模型(即与 T5 使用相同的预训练量),我们称之为“baseline-1T”;第三,T5-Base。请注意,baseline-1T 和 T5-Base 之间的差异包括我们在设计 T5 时所做的“非扩展”更改。因此,比较这两个模型的性能可以为我们提供一个系统研究见解影响的具体测量。
The performance of these three model configurations is shown in Table 15. Consistent with the findings in Section 3.6, we find that additional pre-training improves performance over the baseline. Nevertheless, T5-Base substantially outperforms baseline-1T on all downstream tasks. This suggests that scale is not the only factor that contributes to T5’s
表 15: 这三种模型配置的性能。与第 3.6 节的发现一致,我们发现额外的预训练可以提高性能。然而,T5-Base 在所有下游任务上显著优于 baseline-1T 。这表明规模不是影响 T5 性能的唯一因素。
| 模型 | GLUE | CNNDM | SQuAD | SGLUE | EnDe | EnFr | EnRo |
|---|---|---|---|---|---|---|---|
| Baseline | 83.28 | 19.24 | 80.88 | 71.36 | 26.98 | 39.82 | 27.65 |
| Baseline-1T | 84.80 | 19.62 | 83.01 | 73.90 | 27.46 | 40.30 | 28.34 |
| T5-Base | 85.97 | 20.90 | 85.44 | 75.64 | 28.37 | 41.37 | 28.98 |
Table 15: Performance comparison of T5-Base to our baseline experimental setup used in the rest of the paper. Results are reported on the validation set. “Baseline-1T” refers to the performance achieved by pre-training the baseline model on 1 trillion tokens (the same number used for the T5 model variants) instead of $2^{35}\approx34\mathrm{B}$ tokens (as was used for the baseline).
表 15: T5-Base 与我们在论文其余部分使用的基线实验设置的性能对比。结果在验证集上报告。“Baseline-1T”指的是通过在 1 万亿个 Token (与 T5 模型变体使用的数量相同) 上预训练基线模型所达到的性能,而不是在 $2^{35}\approx34\mathrm{B}$ 个 Token (如基线模型所使用的) 上预训练。
success. We hypothesize that the larger models benefit not only from their increased size but also from these non-scaling factors.
我们假设,较大的模型不仅从其增大的规模中受益,还从这些非扩展因素中获益。
4. Reflection
4. 反思
Having completed our systematic study, we wrap up by first recapping some of our most significant findings. Our results provide some high-level perspective on which avenues of research might be more or less promising. To conclude, we outline some topics we think might provide effective approaches for further progressing the field.
已完成系统研究后,我们首先回顾一些最重要的发现。我们的结果提供了一些高层次的视角,指出哪些研究方向可能更有或较少前景。最后,我们概述了一些我们认为可能为领域进一步发展提供有效方法的主题。
4.1. Takeaways
4.1. 收获点
Text-to-text Our text-to-text framework provides a simple way to train a single model on a wide variety of text tasks using the same loss function and decoding procedure. We showed how this approach can be successfully applied to generative tasks like abstract ive sum mari z ation, classification tasks like natural language inference, and even regression tasks like STS-B. In spite of its simplicity, we found the text-totext framework obtained comparable performance to task-specific architectures and ultimately produced state-of-the-art results when combined with scale.
文本到文本
我们的文本到文本框架提供了一种简单的方法,可以使用相同的损失函数和解码过程对各种文本任务进行单一模型的训练。我们展示了这种方法如何成功应用于生成式任务(如摘要生成),分类任务(如自然语言推理)以及回归任务(如 STS-B)。尽管其简单性,我们发现文本到文本框架获得了与特定任务架构相当的性能,并且在结合规模后最终产生了最先进的结果。
Architectures While some work on transfer learning for NLP has considered architectural variants of the Transformer, we found the original encoder-decoder form worked best in our text-to-text framework. Though an encoder-decoder model uses twice as many parameters as “encoder-only” (e.g. BERT) or “decoder-only” (language model) architectures, it has a similar computational cost. We also showed that sharing the parameters in the encoder and decoder did not result in a substantial performance drop while halving the total parameter count.
架构
虽然一些关于 NLP 迁移学习的工作考虑了 Transformer 的架构变体,但我们发现原始的编码器-解码器形式在我们的文本到文本框架中表现最好。尽管编码器-解码器模型使用的参数量是“仅编码器”(例如 BERT)或“仅解码器”(语言模型)架构的两倍,但其计算成本相似。我们还表明,在编码器和解码器之间共享参数并不会导致性能显著下降,同时将总参数量减少了一半。
Unsupervised objectives Overall, we found that most “denoising” objectives, which train the model to reconstruct randomly corrupted text, performed similarly in the text-totext setup. As a result, we suggest using objectives that produce short target sequences so that unsupervised pre-training is more computationally efficient.
无监督目标
总体而言,我们发现大多数“去噪”目标,即训练模型重建随机损坏的文本,在文本到文本设置中表现相似。因此,我们建议使用生成短目标序列的目标,从而使无监督预训练更加计算高效。
Data sets We introduced the “Colossal Clean Crawled Corpus” (C4), which comprises heuristic ally-cleaned text from the Common Crawl web dump. When comparing C4 to data sets that use additional filtering, we found that training on in-domain unlabeled data could boost performance in a few downstream tasks. However, constraining to a single domain typically results in a smaller data set. We separately showed that performance can degrade when an unlabeled data set is small enough that it is repeated many times over the course of pre-training. This motivates the use of a large and diverse data set like C4 for generic language understanding tasks.
数据集
我们介绍了“Colossal Clean Crawled Corpus” (C4),它包含从 Common Crawl 网络转储中启发式清理的文本。在将 C4 与使用额外过滤的数据集进行比较时,我们发现,在域内未标注数据上进行训练可以提升一些下游任务的性能。然而,限制在一个单一领域通常会导致数据集规模较小。我们还分别展示了当未标注数据集足够小以至于在整个预训练过程中被多次重复时,性能可能会下降。这促使我们在通用语言理解任务中使用像 C4 这样大而多样的数据集。
Training strategies We found that the basic approach of updating all of a pre-trained model’s parameters during fine-tuning outperformed methods that are designed to update fewer parameters, although updating all parameters is most expensive. We also experimented with various approaches for training the model on multiple tasks at once, which in our text-to-text setting simply corresponds to mixing examples from different data sets when constructing batches. The primary concern in multi-task learning is setting the proportion of each task to train on. We ultimately did not find a strategy for setting mixing proportions that matched the performance of the basic approach of unsupervised pre-training followed by supervised fine-tuning. However, we found that fine-tuning after pre-training on a mixture of tasks produced comparable performance to unsupervised pre-training.
训练策略
我们发现,在微调期间更新预训练模型所有参数的基本方法优于旨在更新较少参数的方法,尽管更新所有参数的成本最高。我们还尝试了多种方法来同时训练模型执行多个任务,在我们的文本到文本设置中,这仅仅对应于在构建批次时混合来自不同数据集的示例。多任务学习的主要问题是设置每个任务的训练比例。我们最终没有找到一种设置混合比例的策略能够匹配无监督预训练后进行有监督微调的基本方法的性能。然而,我们发现,在任务混合上进行预训练后再进行微调,其性能与无监督预训练相当。
Scaling We compared various strategies for taking advantage of additional compute, including training the model on more data, training a larger model, and using an ensemble of models. We found each approach conferred a significant boost in performance, though training a smaller model on more data was often outperformed by training a larger model for fewer steps. We also showed an ensemble of models can provide substantially better results than a single model, which provides an orthogonal means of leveraging additional computation. Ensembling models that were fine-tuned from the same base pre-trained model performed worse than pre-training and fine-tuning all models completely separately, though fine-tune-only ensembling still substantially outperformed a single model.
我们比较了多种利用额外计算资源的策略,包括在更多数据上训练模型、训练更大规模的模型以及使用模型集成。我们发现每种方法都能显著提升性能,尽管在更多数据上训练较小模型通常不如训练较大模型但训练步数较少的效果好。我们还展示了模型集成可以提供比单个模型更优的结果,这为利用额外计算资源提供了一种正交的方法。从同一基础预训练模型微调的模型集成表现不如完全独立预训练和微调的所有模型,但仅微调的模型集成仍然显著优于单个模型。
Pushing the limits We combined our above insights and trained substantially larger models (up to 11 billion parameters) to achieve state-of-the-art results across many of the benchmarks we considered. For unsupervised training, we extracted text from our C4 data set and applied a denoising objective that corrupts contiguous spans of tokens. We pre-trained on a multi-task mixture before fine-tuning on individual tasks. Overall, our models were trained on over 1 trillion tokens. In the interest of facilitating the replication, extension, and application of our results, we release our code, the C4 data set, and pre-trained model weights for each T5 variant.
推动极限
我们结合了上述见解,训练了规模显著更大的模型(最多达 110 亿参数),以在我们考虑的许多基准测试中取得最先进的成果。对于无监督训练,我们从 C4 数据集中提取文本,并应用一种去噪目标,该目标会破坏连续的 Token 段。我们在多任务混合数据上进行了预训练,然后在各个任务上进行微调。总体而言,我们的模型是在超过 1 万亿个 Token 上训练的。为了便于复制、扩展和应用我们的成果,我们发布了代码、C4 数据集以及每个 T5 变体的预训练模型权重。
4.2. Outlook
4.2. 展望
The inconvenience of large models An unsurprising but important result from our study is that larger models tend to perform better. The fact that the hardware used for running these models is continually getting cheaper and more powerful suggests that scaling up may continue to be a promising way to achieve better performance (Sutton, 2019). However, it will always be the case that there are applications and scenarios where using a smaller or less expensive model is helpful, for example when performing client-side inference or federated learning (Konečny` et al., 2015, 2016). Relatedly, one beneficial use of transfer learning is the possibility of attaining good performance on low-resource tasks. Low-resource tasks often occur (by definition) in settings where one lacks the assets to label more data. It follows that low-resource applications often also have limited access to computational resources which can incur additional costs. As a result, we advocate for research on methods that achieve stronger performance with cheaper models so that transfer learning can be applied where it will have the most impact. Some current work along these lines include distillation (Hinton et al., 2015; Sanh et al., 2019; Jiao et al., 2019), parameter sharing (Lan et al., 2019), and conditional computation (Shazeer et al., 2017).
大模型的不便性
一项并不意外但重要的研究结果是,较大的模型往往表现更好。用于运行这些模型的硬件不断变得更便宜且更强大,这表明扩大规模可能是实现更好性能的一种有前景的方法 (Sutton, 2019)。然而,在某些应用场景和情境中,使用较小或成本较低的模型是有帮助的,例如在客户端推理或联邦学习 (Konečny et al., 2015, 2016) 中。与此相关的是,迁移学习的一个有益用途是在低资源任务上获得良好性能的可能性。低资源任务通常发生在缺乏标注更多数据资产的情况下(根据定义)。因此,低资源应用通常也受限于计算资源,这可能会带来额外的成本。因此,我们提倡研究能够在更便宜的模型上实现更强性能的方法,以便迁移学习可以在最需要的地方发挥作用。目前在这方面的一些工作包括蒸馏 (Hinton et al., 2015; Sanh et al., 2019; Jiao et al., 2019),参数共享 (Lan et al., 2019),以及条件计算 (Shazeer et al., 2017)。
More efficient knowledge extraction Recall that one of the goals of pre-training is (loosely speaking) to provide the model with general-purpose “knowledge” that improves its performance on downstream tasks. The method we use in this work, which is currently common practice, is to train the model to denoise corrupted spans of text. We suspect that this simplistic technique may not be a very efficient way to teach the model general-purpose knowledge. More concretely, it would be useful to be able to attain good fine-tuning performance without needing to train our models on 1 trillion tokens of text first. Some concurrent work along these lines improves efficiency by pre-training a model to distinguish between real and machine-generated text (Clark et al., 2020).
更高效的知识提取
回顾预训练的一个目标(粗略地说)是为模型提供通用的“知识”,以提高其在下游任务中的性能。我们在本工作中使用的方法,目前是常见的做法,是训练模型对损坏的文本片段进行去噪。我们怀疑这种简单的技术可能不是教授模型通用知识的有效方法。更具体地说,在不需要首先训练我们的模型处理 1 万亿个 Token 的文本的情况下,能够获得良好的微调性能将是有用的。一些同期的工作通过预训练模型来区分真实和机器生成的文本 (Clark et al., 2020) 来提高效率。
Formalizing the similarity between tasks We observed that pre-training on unlabeled in-domain data can improve performance on downstream tasks (Section 3.4). This finding mostly relies on basic observations like the fact that SQuAD was created using data from Wikipedia. It would be useful to formulate a more rigorous notion of the “similarity” between the pre-training and downstream tasks, so that we could make more principled choices about what source of unlabeled data to use. There is some early empirical work along these lines in the field of computer vision (Huh et al., 2016; Kornblith et al., 2018; He et al., 2018). A better notion of the relatedness of tasks could also help choose supervised pre-training tasks, which has been shown to be helpful for the GLUE benchmark (Phang et al., 2018).
形式化任务之间的相似性
我们观察到,在未标注的领域内数据上进行预训练可以提高下游任务的性能(第 3.4 节)。这一发现主要依赖于一些基本观察,例如 SQuAD 是使用来自 Wikipedia 的数据创建的。为了能够更合理地选择使用哪种未标注数据源,有必要制定一个更严格的“任务相似性”概念。在计算机视觉领域已经有一些早期的实证工作沿着这个方向展开 (Huh et al., 2016; Kornblith et al., 2018; He et al., 2018)。更好的任务相关性概念也有助于选择监督预训练任务,这已被证明对 GLUE 基准测试有帮助 (Phang et al., 2018)。
Language-agnostic models We were disappointed to find that English-only pre-training did not achieve state-of-the-art results on the translation tasks we studied. We also are interested in avoiding the logistical difficulty of needing to specify which languages a vocabulary can encode ahead of time. To address these issues, we are interested in further investigating language-agnostic models, i.e. models that can perform a given NLP task with good performance regardless of the text’s language. This is an especially pertinent issue given that English is not the native language for the majority of the world’s population.
语言无关模型
我们失望地发现,仅使用英语进行预训练并未在我们研究的翻译任务中取得最先进的结果。我们也希望避免提前指定词汇表可以编码哪些语言的物流困难。为了解决这些问题,我们有兴趣进一步研究语言无关模型,即无论文本的语言如何,都能以良好的性能执行给定的自然语言处理任务的模型。鉴于英语不是世界上大多数人口的母语,这是一个特别相关的问题。
The motivation for this paper was the flurry of recent work on transfer learning for NLP. Before we began this work, these advances had already enabled breakthroughs in settings where learning-based methods had not yet been shown to be effective. We are happy to be able to continue this trend, for example by nearly matching human-level performance on the SuperGLUE benchmark, a task specifically designed to be difficult for modern transfer-learning pipelines. Our results stem from the combination of a straightforward and unified text-to-text framework, our new C4 data set, and insights from our systematic study. Additionally, we provided an empirical overview of the field and a perspective on where it stands. We are excited to see continued work using transfer learning towards the goal of general language understanding.
本文的动机源于最近关于自然语言处理 (NLP) 领域迁移学习的一系列工作。在我们开始这项工作之前,这些进展已经在学习方法尚未证明有效的场景中实现了突破。我们很高兴能够延续这一趋势,例如在 SuperGLUE 基准测试中几乎达到人类水平的表现,该任务是专门为现代迁移学习管道设计的难题。我们的成果来自于一个简单且统一的文本到文本框架、我们新的 C4 数据集以及我们系统研究的见解。此外,我们还提供了该领域的实证概述,并对其现状进行了展望。我们期待看到继续使用迁移学习来实现通用语言理解的研究。
Acknowledgments
致谢
We thank Grady Simon, Noah Fiedel, Samuel R. Bowman, Augustus Odena, Daphne Ippolito, Noah Constant, Orhan Firat, Ankur Bapna, and Sebastian Ruder for their comments on this manuscript; Zak Stone and the TFRC team for their support; Austin Tarango for his guidance on data set creation; Melvin Johnson, Dima Lepikhin, Katrin Tomanek, Jeff Klingner, and Naveen Ari vaz hagan for insight into multi-task machine translation; Neil Houlsby for comments on adapter layers; Olga Wichowska, Ola Spyra, Michael Banfield, Yi Lin, and Frank Chen for assistance with infrastructure; Etienne Pot, Ryan Sepassi, and Pierre Ruyssen for collaboration on TensorFlow Datasets; Rohan Anil for help with our download pipeline for Common Crawl; Robby Neale and Taku Kudo for their work on Sentence Piece; and many other members of the Google Brain team for their discussion and insight.
我们感谢 Grady Simon、Noah Fiedel、Samuel R. Bowman、Augustus Odena、Daphne Ippolito、Noah Constant、Orhan Firat、Ankur Bapna 和 Sebastian Ruder 对本文稿的评论;感谢 Zak Stone 和 TFRC 团队的支持;感谢 Austin Tarango 在数据集创建方面的指导;感谢 Melvin Johnson、Dima Lepikhin、Katrin Tomanek、Jeff Klingner 和 Naveen Ari vaz hagan 对多任务机器翻译的见解;感谢 Neil Houlsby 对适配器层的评论;感谢 Olga Wichowska、Ola Spyra、Michael Banfield、Yi Lin 和 Frank Chen 在基础设施方面的帮助;感谢 Etienne Pot、Ryan Sepassi 和 Pierre Ruyssen 在 TensorFlow Datasets 方面的合作;感谢 Rohan Anil 对 Common Crawl 下载管道的帮助;感谢 Robby Neale 和 Taku Kudo 在 Sentence Piece 方面的工作;以及感谢 Google Brain 团队的许多其他成员的讨论和见解。
Appendix A. Contributions
附录 A. 贡献
Colin designed the scope of this project and wrote this paper, ran all the experiments in Sections 3.1 to 3.6, and contributed a large portion of our codebase. Noam contributed many of the ideas, including the text-to-text framework, unsupervised objectives, and data set mixing strategies; implemented our base Transformer model and its architectural variants; and ran the experiments in Section 3.7. Adam oversaw all engineering aspects for this project, created the C4 data set, implemented our data set pipeline, and added various benchmark data sets. Katherine coordinated experiments, wrote and updated documentation, ran experiments to help design our baseline, and contributed to many parts of our codebase. Sharan contributed some of the required data sets and preprocessor s, and ran assorted preliminary experiments, in addition to co-leading the open-sourcing of our codebase. Michael owned all aspects of the Winograd data sets, ingested many of the data sets we used, contributed various improvements and fixes to our infrastructure, and ran some preliminary experiments. Yanqi ran experiments and implemented methods to help settle on a reasonable baseline and helped with the final fine-tuning of the models in Section 3.7. Wei also helped with final fine-tuning and improved some of our preprocessor s. Peter prototyped an early version of the pre-training data set and resolved issues pertaining to the SQuAD and CNN/DM tasks. All authors helped set the scope and research direction we followed in this work.
科林设计了本项目的范围并撰写了本文,完成了第 3.1 节到第 3.6 节的所有实验,并贡献了我们代码库的大部分。诺亚提出了许多想法,包括文本到文本框架、无监督目标和数据集混合策略;实现了我们的基础 Transformer 模型及其架构变体;并进行了第 3.7 节的实验。亚当监督了该项目的所有工程方面,创建了 C4 数据集,实现了我们的数据集管道,并添加了各种基准数据集。凯瑟琳协调了实验,撰写了文档并更新,进行了帮助设计我们基线的实验,并为我们的代码库的多个部分做出了贡献。沙伦贡献了一些必要的数据集和预处理器,并进行了各种初步实验,此外还共同领导了我们代码库的开源工作。迈克尔负责 Winograd 数据集的所有方面,摄取了许多我们使用的数据集,为我们的基础设施贡献了各种改进和修复,并进行了一些初步实验。颜琪进行了实验并实现了方法,以帮助确定合理的基线,并在第 3.7 节中帮助完成了模型的最终微调。韦也帮助进行了最终微调并改进了一些预处理器。彼得原型化了预训练数据集的早期版本,并解决了与 SQuAD 和 CNN/DM 任务相关的问题。所有作者都帮助设定了我们在本工作中遵循的范围和研究方向。
Appendix B. Converting WNLI to Our Text-to-Text Format
附录 B. 将 WNLI 转换为我们的文本到文本格式
Note that as discussed in Section 2.4, we do not train on any of the data from WNLI. Instead, when evaluating on the WNLI test set (for the results in Section 3.7), we convert the WNLI test set to the “referent noun prediction” text-to-text format so that we can evaluate using a model trained on WSC and DPR. Our WNLI preprocessor is inspired by the one proposed by He et al. (2019). Recall that examples from WNLI consist of a premise, a hypothesis, and a label that indicates whether the hypothesis is True or False. Using the example from Section 2.4, the hypothesis would be “The city councilmen refused the demonstrators a permit because they feared violence.” with the premise “The demonstrators feared violence.” and the label False. We first find the location of all pronouns in the premise (“they” in our example). Then, we find the maximum number of words that precede or follow each pronoun that are a substring in the hypothesis (“feared violence” in our example), ignoring case and punctuation. When the premise contains multiple candidate pronouns, we choose the pronoun that is preceded or followed by the largest substring of the hypothesis. We then highlight the pronoun in the premise by surrounding it with asterisks. For the candidate noun (which is compared to our model’s prediction to obtain a True or False label), we remove the matching substring from the hypothesis and optionally make it non-possessive (resulting in “the demonstrators”).
请注意,如第 2.4 节所述,我们不使用 WNLI 的任何数据进行训练。相反,在评估 WNLI 测试集时(针对第 3.7 节的结果),我们将 WNLI 测试集转换为“指代名词预测”的文本到文本格式,以便我们可以使用在 WSC 和 DPR 上训练的模型进行评估。我们的 WNLI 预处理器受到 He et al. (2019) 提出的方法的启发。回忆一下,WNLI 中的示例由前提、假设和指示假设是否为真的标签组成。使用第 2.4 节中的例子,假设是“The city councilmen refused the demonstrators a permit because they feared violence.”,前提为“The demonstrators feared violence.”,标签为 False。我们首先找到前提中所有代词的位置(例如中的“they”)。然后,我们找到每个代词前后最大数量的单词,这些单词是假设中的子字符串(例如中的“feared violence”),忽略大小写和标点符号。当前提包含多个候选代词时,我们选择被假设的最大子字符串所前置或后置的代词。然后我们在前提中标记代词,用星号包围它。对于候选名词(与我们模型的预测进行比较以获得 True 或 False 标签),我们从假设中删除匹配的子字符串,并可选地将其变为非所有格(结果为“the demonstrators”)。
Appendix C. Example Predictions on CNN/Daily Mail
附录 C. CNN/Daily Mail 示例预测
To show that our model is generating fluent summaries, we include a few example decodes from our best model (T5-11B) on the validation set along with the ground-truth summaries. These examples selected at random and were not cherry-picked.
为了证明我们的模型能够生成流畅的摘要,我们从验证集中选取了几个示例解码结果,这些结果来自我们最好的模型 (T5-11B),并附上了真实的摘要。这些示例是随机选择的,并未经过精心挑选。
- Ground-truth: leopard gave up after spiky creature refused to back down in fight in kruger national park, south africa . wildlife enthusiast lisl moolman, 41, caught the bizarre battle while out on the road and armed with her camera . the leopard and porcupine tumbled out of the bushes and began to fight by roadside - watched by ms moolman .
- 真实情况:在南非克鲁格国家公园,豹在与多刺生物的搏斗中失败后放弃了。野生动物爱好者丽斯尔·穆尔曼 (Lisl Moolman),41岁,在外出时用相机捕捉到了这场奇特的战斗。豹和豪猪从灌木丛中滚了出来,并在路边开始搏斗,这一过程被穆尔曼女士目睹。
Prediction: leopard tried to make lunch out of a plucky porcupine in kruger national park, south africa . but the predator was put firmly in its place after the spiky creature refused to back down during a fight . wildlife enthusiast lisl moolman, 41, caught the bizarre battle while out on the road and armed with her camera .
预测:在南非克鲁格国家公园,一只豹子试图将一只勇敢的豪猪变成午餐。但在打斗过程中,这只多刺的动物拒不退让,最终让捕食者吃了苦头。41岁的野生动物爱好者莉斯尔·穆尔曼 (Lisl Moolman) 在外出时用相机捕捉到了这场奇特的战斗。
- Ground-truth: researchers say homes are a hotbed of undiscovered species . study of 40 north carolina homes found 8,000 bacterial and archaeal taxa 11 houses in california revealed a great variety of fungus among us .
- 真实情况:研究人员说家庭是未被发现的物种的温床。对北卡罗来纳州 40 个家庭的研究发现了 8,000 种细菌和古菌分类单元,加州的 11 所房屋中发现了我们周围存在大量真菌。
Prediction: researchers estimate that the indoor biome makes up about 0.5 percent of ice-free land, or about 247,000 square miles, almost the size of texas . a study of 40 houses in north carolina used genomic technologies to document more than 8,000 bacterial and archaeal taxa . another study of 50 houses in the same state described more than 750 types of arthropods, including insects and spiders . an examination of 11 houses in california found a great variety of fungus among us .
预测:研究人员估计,室内生物群落占无冰土地的约 0.5%,即约 247,000 平方英里,几乎相当于德克萨斯州的大小。一项对北卡罗来纳州 40 所房屋的研究使用基因组技术记录了超过 8,000 种细菌和古菌分类单元。另一项对同一州 50 所房屋的研究描述了超过 750 种蛛形纲动物,包括昆虫和蜘蛛。对加利福尼亚州 11 所房屋的检查发现我们周围存在大量真菌。
- Ground-truth: executions have been temporarily halted in the state after the drug used for lethal injections appeared faulty on monday night . kelly g is send a ner had her death delayed for a second time when it was found that the lethal pen to barb it al dose appeared cloudy . state officials dithered over the decision , changing their mind three times before deciding not to take the risk that it wouldn’t work . it is ’essential that executions are carried out in a constitutional manner,’ said georgia attorney general sam olens .
- 真实情况:在周一晚上用于致命注射的药物出现问题后,该州已暂时停止执行死刑。凯利·G 的死刑因发现致命的戊巴比妥剂量出现浑浊而再次被推迟。州官员对这一决定犹豫不决,在三次改变主意后最终决定不冒风险,因为它可能不起作用。佐治亚州总检察长山姆·奥伦斯表示:“必须以符合宪法的方式执行死刑。”
Prediction: the execution of kelly g is send a ner, the only woman on georgia’s death row, was delayed for a second time on monday . the execution team found that the lethal pen to barb it al dose appeared cloudy . the cloudy drug bolstered death penalty opponents, who have been vocal in their opposition after three botched executions in other parts of the country .
预测:凯利·G的执行被推迟,她是佐治亚州死囚中的唯一女性。周一,这是她第二次被推迟执行。执行团队发现致命的戊巴比妥剂量出现浑浊。浑浊的药物加强了死刑反对者的声势,此前美国其他地区已经发生了三次执行失误。
- Ground-truth: dani alves was not selected for the brazil squad to face france and chile . barcelona defender was filmed serving up hot dogs to people on saturday . this week alves released a charity single with former team-mat jose pinto . alves looks set to leave barcelona on a free transfer this summer . reports in spanish press claim he will sign for paris saint-germain
- 真实情况: Dani Alves 未被选入巴西队对阵法国和智利的大名单。 巴塞罗那后卫在周六被拍到为人们分发热狗。 本周 Alves 与前队友 Jose Pinto 发行了一首慈善单曲。 Alves 很可能会在今年夏天以自由转会的方式离开巴塞罗那。 西班牙媒体的报道声称他将签约巴黎圣日耳曼。
Prediction: dani alves was not selected for brazil’s friendlies against france and chile . the barcelona right back has released a charity single with jose pinto . alves was filmed serving up snacks from behind the counter of a hot dog bar on saturday . alves is yet to be offered a new deal at the nou camp . click here for all the latest barcelona news .
预测:丹尼·阿尔维斯未被选入巴西队对阵法国和智利的友谊赛。这位巴塞罗那右后卫与何塞·平托发布了一首慈善单曲。周六,阿尔维斯被拍到在热狗吧柜台后面分发小吃。阿尔维斯尚未收到诺坎普的新合同报价。点击这里查看最新的巴塞罗那新闻。
Appendix D. Pre processed Examples
附录 D. 预处理示例
In this section, we provide examples of our preprocessing for each of the data sets we consider.
在本节中,我们提供了对每个数据集进行预处理的示例。
D.1. CoLA
D.1. CoLA
cola 数据集 (CoLA) 是一个用于评估句子 acceptability 的数据集。
Original input:
请提供需要翻译的英文内容。
Sentence: John made Bill master of himself.
约翰让比尔成为自己的主人。
Processed input: cola sentence: John made Bill master of himself.
处理后的输入:cola 句子:John 使 Bill 成为自己 的主人。
Original target: 1
原始目标:1
Processed target: acceptable
处理后的目标:可接受
D.2. RTE
D.2. RTE
推理任务 (RTE)
Original input:
请提供需要翻译的英文内容。
Sentence 1: A smaller proportion of Yugoslavia’s Italians were settled in Slovenia (at the 1991 national census, some 3000 inhabitants of Slovenia declared themselves as ethnic Italians).
南斯拉夫的意大利人中有较小比例定居在斯洛文尼亚(根据 1991 年全国人口普查,约有 3000 名斯洛文尼亚居民声明自己是意大利族裔)。
Sentence 2: Slovenia has 3,000 inhabitants.
斯洛文尼亚有 3,000 名居民。
Processed input: rte sentence1: A smaller proportion of Yugoslavia’s Italians were settled in Slovenia (at the 1991 national census, some 3000 inhabitants of Slovenia declared themselves as ethnic Italians). sentence2: Slovenia has 3,000 inhabitants.
处理后的输入:rte 句子 1: 较小比例的南斯拉夫意大利人定居在斯洛文尼亚(根据 1991 年全国人口普查,约有 3000 名斯洛文尼亚居民声明自己是意大利族裔)。句子 2: 斯洛文尼亚有 3,000 名居民。
Original target: 1
原始目标:1
Processed target: not entail ment
处理后的目标:非蕴含
D.3. MNLI
D.3. MNLI
多前提自然语言推理 (Multi-Genre Natural Language Inference, MNLI)
Original input:
请提供需要翻译的英文内容。
Hypothesis: The St. Louis Cardinals have always won.
假设:圣路易斯红雀队一直赢。
Premise: yeah well losing is i mean i’m i’m originally from Saint Louis and Saint Louis Cardinals when they were there were uh a mostly a losing team but
前提:是的,失败就是 我原本来自圣路易斯,圣路易斯红雀队在那时 mostly 是一个 mostly 失败的球队 但
注:原文中“mostly”重复,翻译时保留了这一特点。
Processed input: mnli hypothesis: The St. Louis Cardinals have always won. premise: yeah well losing is i mean i’m i’m originally from Saint Louis and Saint Louis Cardinals when they were there were uh a mostly a losing team but
处理后的输入:mnli 假设:圣路易斯红雀队总是赢。前提:嗯,输了,我是说,我原本是圣路易斯人,而圣路易斯红雀队在那时大部分时间是一支输球的队伍。
Original target: 2
原始目标:2
Processed target: contradiction
处理后的目标:矛盾
D.4. MRPC Original input:
D.4. MRPC 原始输入:
Processed input: mrpc sentence1: We acted because we saw the existing evidence in a new light , through the prism of our experience on 11 September , " Rumsfeld said . sentence2: Rather , the US acted because the administration saw " existing evidence in a new light , through the prism of our experience on September 11 " .
处理后的输入: mrpc sentence1: 我们之所以采取行动,是因为我们以 9 月 11 日的经验为棱镜,用新的视角看待现有的证据,"拉姆斯菲尔德说。sentence2: 美国采取行动的原因是政府以 9 月 11 日的经验为棱镜,“用新的视角看待现有的证据”。
Original target: 1
原始目标:1
Processed target: equivalent
处理后的目标:等效
D.5. QNLI
D.5. QNLI
Quinn 等人 [20] 提出的 QNLI(Quora Question Pairs Natural Language Inference)数据集是从 Quora 问题对数据集中派生而来,用于自然语言推理任务。
Original input:
请提供需要翻译的英文内容。
Question: Where did Jebe die?
问题:Jebe 在哪里去世?
Sentence: Genghis Khan recalled Subutai back to Mongolia soon afterwards, and Jebe died on the road back to Samarkand.
成吉思汗随后召回了 Subutai 回蒙古,Jebe 在返回撒马尔罕的途中去世。
Processed input: qnli question: Where did Jebe die? sentence: Genghis Khan recalled Subutai back to Mongolia soon afterwards, and Jebe died on the road back to Samarkand.
处理后的输入: qnli 问题:Jebe 在哪里去世?句子:成吉思汗不久后召回了 Subutai 回蒙古,而 Jebe 在返回撒马尔罕的途中去世。
Original target: 0
原始目标:0
Processed target: entailment
处理的目标:蕴涵关系
D.6. QQP
D.6. QQP
quartz 部分没有提供具体内容,仅提供了标题。根据给定的规则和策略,无法对标题 QQP 进行进一步翻译或解释。因此,保持原样。
Original input:
请提供需要翻译的英文内容。
Question 1: What attributes would have made you highly desirable in ancient Rome?
问题 1: 在古代罗马,哪些属性会让你非常受欢迎?
Question 2: How I GET OP PERT IN U TY TO JOIN IT COMPANY AS A FRESHER?
问题 2: 我如何获得加入 IT 公司的机会作为新人?
Processed input: qqp question1: What attributes would have made you highly desirable in ancient Rome? question2: How I GET OP PERT IN U TY TO JOIN IT COMPANY AS A FRESHER?
处理后的输入: qqp 问题1: 在古代罗马,哪些属性会让你变得非常有吸引力?问题2: 我如何获得足够的能力加入 IT 公司成为新人?
Original target: 0
原始目标:0
Processed target: not duplicate
处理的目标:不重复
D.7. SST2
D.7. SST2
斯坦福情感树库 2 (SST2)
Original input:
请提供需要翻译的英文内容。
Sentence: it confirms fincher ’s status as a film maker who artfully bends technical know-how to the service of psychological insight .
它确认了芬奇作为电影制作人的地位,他巧妙地将技术知识服务于心理洞察力。
Processed input: sst2 sentence: it confirms fincher ’s status as a film maker who artfully bends technical know-how to the service of psychological insight
它确认了芬奇作为一位巧妙地将技术知识服务于心理洞察的电影制作人的地位。
Original target: 1
原始目标:1
Processed target: positive
处理后的目标:正面
D.8. STSB
D.8. STS-B
句子文本相似度基准 (STS-B) 是一个评估句子之间语义相似度的数据集。它包含来自不同来源的句子对,每个句子对由人工标注了从 1 到 5 的相似度分数。这个数据集用于评估模型在理解句子语义相似性方面的性能。
Original input:
请提供需要翻译的英文内容。
Sentence 1: Representatives for Puretunes could not immediately be reached for comment Wednesday.
Puretunes 的代表周三未能立即联系上以发表评论。
Sentence 2: Puretunes representatives could not be located Thursday to comment on the suit.
Puretunes 的代表周四未能联系上以对诉讼发表评论。
Processed input: stsb sentence1: Representatives for Puretunes could not immediately be reached for comment Wednesday. sentence2: Puretunes representatives could not be located Thursday to comment on the suit.
处理后的输入:句子1:Puretunes 的代表周三未能立即联系上以发表评论。句子2:周四无法找到 Puretunes 的代表对诉讼发表评论。
Original target: 3.25
原始目标:3.25
Processed target: 3.2
处理后的目标:3.2
D.9. CB
D.9. 回调 (CB)
回调 (Callback) 是在特定事件发生时执行的函数。它通常用于异步编程中,以确保在某个操作完成后执行特定代码。回调函数可以作为参数传递给其他函数,并在适当的时候被调用。这种方式广泛应用于事件驱动编程和非阻塞 I/O 操作中。
Original input:
请提供需要翻译的英文内容。
Hypothesis: Valence was helping
假设:Valence 在协助
Premise: Valence the void-brain, Valence the virtuous valet. Why couldn’t the figger choose his own portion of titanic anatomy to shaft? Did he think he was helping?
前提:调整虚空大脑,调整美德管家。为什么 figger 不能选择他自己的一部分巨大解剖结构来控制?他是不是以为自己在帮忙?
注:这里的 "figger" 可能是 "figure" 的误拼或特殊用词,根据上下文翻译为 "figger"。如果需要更准确的翻译,请提供更多的背景信息。
Processed input: cb hypothesis: Valence was helping premise: Valence the void-brain, Valence the virtuous valet. Why couldn’t the figger choose his own portion of titanic anatomy to shaft? Did he think he was helping?
处理后的输入:cb 假设:Valence 在帮助
前提:Valence 空脑,Valence 正直的侍从。为什么 figger 不能选择他自己的一部分巨大解剖结构来改变?他以为自己在帮助吗?
注:原文中的一些词汇(如 figger, shaft)可能存在特定语境下的含义,翻译时尽量保留了原意。
Original target: 1
原始目标:1
Processed target: contradiction
处理后的目标:矛盾
D.10. COPA
D.10. COPA
常识推理任务 (COPA) 是一个评估模型理解因果关系和逻辑推理能力的数据集。它包含选择题形式的问题,要求模型根据给定的情景选择最合理的解释。
Original input:
请提供需要翻译的英文内容。
Question: effect
问题:效果
Premise: Political violence broke out in the nation.
前提:该国爆发了政治暴力事件。
Processed input: copa choice1: Many citizens relocated to the capitol. choice2: Many citizens took refuge in other territories. premise: Political violence broke out in the nation. question: effect
处理后的输入:copa 选项1:许多市民搬迁到了首都。选项2:许多市民在其他地区寻求庇护。前提:该国爆发了政治暴力。问题:影响
Original target: 1
原始目标:1
Processed target: True
处理的目标:True
D.11. MultiRC
D.11. 多重选择阅读理解 (MultiRC)
Original input:
请提供需要翻译的英文内容。
Answer: There was only pie to eat, rather than traditional breakfast foods aragraph: Sent 1: Once upon a time, there was a squirrel named Joey.
Sent 2: Joey loved to go outside and play with his cousin Jimmy.
Sent 3: Joey and Jimmy played silly games together, and were always laughing.
Sen 4: One day, Joey and Jimmy went swimming together at their Aunt Julie’s pond.
Sent 5: Joey woke up early in the morning to eat some food before they left.
Sent 6: He couldn’t find anything to eat except for pie!
Sent 7: Usually, Joey would eat cereal, fruit (a pear), or oatmeal for breakfast.
Sent 8: After he ate, he and Jimmy went to the pond.
Sent 9: On their way there they saw their friend Jack Rabbit.
Sent 10: They dove into the water and swam for several hours.
Sent 11: The sun was out, but the breeze was cold.
Sent 12: Joey and Jimmy got out of the water and started walking home.
Sent 13: Their fur was wet, and the breeze chilled them.
Sent 14: When they got home, they dried off, and Jimmy put on his favorite purple shirt.
Sent 15: Joey put on a blue shirt with red and green dots.
Sent 16: The two squirrels ate some food that Joey’s mom, Jasmine, made and went off to bed.
答案:只有派可以吃,而不是传统的早餐食物。
段落:
句 1: 从前,有一只名叫 Joey 的松鼠。
句 2: Joey 喜欢出去和他表弟 Jimmy 一起玩。
句 3: Joey 和 Jimmy 一起玩傻乎乎的游戏,总是笑个不停。
句 4: 有一天,Joey 和 Jimmy 一起去他们阿姨 Julie 的池塘游泳。
句 5: Joey 一大早就起床,在他们离开前吃点东西。
句 6: 他找不到任何可以吃的东西,除了派!
句 7: 通常,Joey 会吃谷物、水果(一个梨)或燕麦片作为早餐。
句 8: 吃完后,他和 Jimmy 去了池塘。
句 9: 在去那里的路上,他们看到了他们的朋友 Jack Rabbit。
句 10: 他们跳进水里,游了好几个小时。
句 11: 太阳出来了,但微风很冷。
句 12: Joey 和 Jimmy 从水里出来,开始往家走。
句 13: 他们的毛湿漉漉的,微风让他们感到寒冷。
句 14: 回到家后,他们擦干身体,Jimmy 穿上了他最喜欢的紫色衬衫。
句 15: Joey 穿上了一件有红绿圆点的蓝色衬衫。
句 16: 这两只松鼠吃了 Joey 的妈妈 Jasmine 做的食物,然后去睡觉了。
Question: Why was Joey surprised the morning he woke up for breakfast?
问题:为什么乔伊早上醒来吃早餐时感到惊讶?
Processed input: multirc question: Why was Joey surprised the morning he woke up for breakfast? answer: There was only pie to eat, rather than traditional breakfast foods paragraph: Sent 1: Once upon a time, there was a squirrel named Joey.
Sent 2: Joey loved to go outside and play with his cousin Jimmy.
Sent 3: Joey and Jimmy played silly games together, and were always laughing.
Sent 4: One day, Joey and Jimmy went swimming together at their Aunt Julie’s pond.
Sent 5: Joey woke up early in the morning to eat some food before they left.
Sent 6: He couldn’t find anything to eat except for pie!
Sent 7: Usually, Joey would eat cereal, fruit (a pear), or oatmeal for breakfast.
Sent 8: After he ate, he and Jimmy went to the pond.
Sent 9: On their way there they saw their friend Jack Rabbit.
Sent 10: They dove into the water and swam for several hours.
Sent 11: The sun was out, but the breeze was cold.
Sent 12: Joey and Jimmy got out of the water and started walking home.
Sent 13: Their fur was wet, and the breeze chilled them.
Sent 14: When they got home, they dried off, and Jimmy put on his favorite purple shirt.
Sent 15: Joey put on a blue shirt with red and green dots.
Sent 16: The two squirrels ate some food that Joey’s mom, Jasmine, made and went off to bed.
多选阅读理解问题:乔伊为什么在早上醒来吃早餐时感到惊讶?
答案:只有派可以吃,而不是传统的早餐食物
段落:
句子 1: 从前,有一只名叫乔伊的松鼠。
句子 2: 乔伊喜欢出去和他堂弟吉米一起玩。
句子 3: 乔伊和吉米一起玩傻乎乎的游戏,总是笑个不停。
句子 4: 有一天,乔伊和吉米一起去他们阿姨朱莉的池塘游泳。
句子 5: 乔伊一大早就起床,在他们离开前吃点东西。
句子 6: 他找不到任何可以吃的东西,除了派!
句子 7: 平时,乔伊会吃谷物、水果(一个梨)或燕麦粥当早餐。
句子 8: 吃完后,他和吉米去了池塘。
句子 9: 在去那里的路上,他们看到了他们的朋友杰克兔。
句子 10: 他们跳进水里,游了几个小时。
句子 11: 太阳出来了,但微风很冷。
句子 12: 乔伊和吉米从水里出来,开始往家走。
句子 13: 他们的毛湿了,微风让他们感到寒冷。
句子 14: 回到家后,他们擦干身体,吉米穿上了他最喜欢的紫色衬衫。
句子 15: 乔伊穿上了一件有红绿点的蓝色衬衫。
句子 16: 这两只松鼠吃了乔伊的妈妈贾斯敏做的食物,然后上床睡觉。
Original target: 1
原始目标:1
规则:
- 输出中文翻译部分的时候,只保留翻译的标题,不要有任何其他的多余内容,不要重复,不要解释。
- 不要输出与英文内容无关的内容。
- 翻译时要保留原始段落格式,以及保留术语,例如 FLAC,JPEG 等。保留公司缩写,例如 Microsoft, Amazon, OpenAI 等。
- 人名不翻译
- 同时要保留引用的论文,例如 [20] 这样的引用。
- 对于 Figure 和 Table,翻译的同时保留原有格式,例如:“图 1: ”翻译为“图 1: ”,“表 1: ”翻译为:“表 1: ”。
- 全角括号换成半角括号,并在左括号前面加半角空格,右括号后面加半角空格。
- 在翻译专业术语时,第一次出现时要在括号里面写上英文原文,例如:“生成式 AI (Generative AI)”,之后就可以只写中文了。
- 以下是常见的 AI 相关术语词汇对应表(English -> 中文):
* Transformer -> Transformer
* Token -> Token
* 大语言模型 (LLM/Large Language Model) -> 大语言模型
* 零样本 (Zero-shot) -> 零样本
* 少样本 (Few-shot) -> 少样本
* AI智能体 (AI Agent) -> AI智能体
* 通用人工智能 (AGI) -> 通用人工智能
* Python语言 (Python) -> Python语言
策略:
分三步进行翻译工作:
1. 不翻译无法识别的特殊字符和公式,原样返回
2. 将HTML表格格式转换成Markdown表格格式
3. 根据英文内容翻译成符合中文表达习惯的内容,不要遗漏任何信息
最终只返回Markdown格式的翻译结果,不要回复无关内容。
Processed target: True
处理的目标:True
D.12. WiC
D.12. WiC
词性消歧 (Word in Context, WiC) 任务旨在判断同一个单词在两个不同句子中的含义是否相同。
Original input:
请提供需要翻译的英文内容。
Processed input: wic pos: N sentence1: It was the deliberation of his act that was insulting . sentence2: The deliberations of the jury . word: deliberation
处理后的输入: wic pos: N 句子1: 他的行为经过深思熟虑后令人感到侮辱。 句子2: 陪审团的审议。 单词: deliberation
Original target: 0
原始目标:0
Processed target: False
处理的目标:False
D.13. WSC and DPR
D.13. WSC 和 DPR
Original input:
请提供需要翻译的英文内容。
Span 2 text: it
Span 2 文本:它
Span 1 text: stable
Span 1 文本:稳定
Span 2 index: 20
Span 2 索引:20
Span 1 index: 1
Span 1 索引:1
Text: The stable was very roomy, with four good stalls; a large swinging window opened into the yard , which made it pleasant and airy.
马厩非常宽敞,有四个好的马 stall;一个大的摆动窗户开向院子,这使得它既愉快又通风。
(注:这里的“stall”根据上下文应翻译为“马厩隔间”而非“马 stall”,但按照规则不进行额外解释,直接翻译为“马 stall”。)
修正后:
马厩非常宽敞,有四个好的马厩隔间;一个大的摆动窗户开向院子,这使得它既愉快又通风。
Processed input: wsc: The stable was very roomy, with four good stalls; a large swinging window opened into the yard , which made *it* pleasant and airy.
处理后的输入:wsc: 马厩非常宽敞,有四个不错的马厩;一扇大的可开关的窗户通向院子,这使得 它 显得愉快且通风。
Original target: 1
原始目标:1
Processed target: stable
处理后的目标:稳定
D.14. CNN/Daily Mail
D.14. CNN/每日邮报
Original input: marouane fellaini and adnan januzaj continue to show the world they are not just teammates but also best mates. the manchester united and belgium duo both posted pictures of themselves out at a restaurant on monday night ahead of their game against newcastle on wednesday . januzaj poses in the middle of fellaini and a friend looking like somebody who failed to receive the memo about it being a jackson 5 themed night. premier league duo adnan januzaj and marouane fellaini pose with a friend on the dance floor . manchester united and belgium duo fellaini and januzaj are good friends both on and off the pitch . manchester united ace fellaini runs over to the bench to celebrate his goal against qpr with friend januzaj . the disco effect in the background adds to the theory, but januzaj doesn’t seem to mind as they later pose on the dance floor with other friends. united haven’t had too many reasons to have a song and dance this season so it seems they may be hitting the discotheques as another form of release. however, victory against newcastle on wednesday would leave manager louis van gaal at least tapping his toes as they continue to fight for a champions league spot this season. januzaj and robin van persie join fellaini in celebrating in front of the manchester united fans at west brom . januzaj receives some words of wisdom from manchester united’s dutch manager louis van gaal . januzaj and fellaini are joined by some friends as they take to the dance floor ahead of the newcastle game .
马罗内·费莱尼和阿德南·贾努扎伊继续向世界展示他们不仅是队友,也是最好的朋友。这对曼联和比利时的搭档在周三对阵纽卡斯尔的比赛前,于周一晚上发布了自己在一家餐厅外出的照片。贾努扎伊在费莱尼和一位朋友中间摆姿势,看起来像是没有收到关于这是杰克逊五人组主题夜的通知的人。英超二人组阿德南·贾努扎伊和马罗内·费莱尼与一位朋友在舞池中合影。曼联和比利时的搭档费莱尼和贾努扎伊在场内外都是好朋友。曼联王牌费莱尼跑到替补席上与好友贾努扎伊一起庆祝他对女王公园巡游者的进球。背景中的迪斯科效果增加了这一理论,但贾努扎伊似乎并不介意,因为他们后来与其他朋友在舞池中合影。本赛季曼联并没有太多理由进行歌舞庆祝,所以他们似乎将迪厅作为另一种释放方式。然而,周三战胜纽卡斯尔将使主教练路易斯·范加尔至少轻敲脚尖,因为他们继续为本赛季的欧冠资格而战。贾努扎伊和罗宾·范佩西加入费莱尼,在西布罗姆维奇的曼联球迷面前庆祝。贾努扎伊从曼联的荷兰籍主教练路易斯·范加尔那里得到了一些忠告。贾努扎伊和费莱尼在纽卡斯尔比赛前与一些朋友一起走上舞池。
Processed input: summarize: marouane fellaini and adnan januzaj continue to show the world they are not just teammates but also best mates. the manchester united and belgium duo both posted pictures of themselves out at a restaurant on monday night ahead of their game against newcastle on wednesday . januzaj poses in the middle of fellaini and a friend looking like somebody who failed to receive the memo about it being a jackson 5 themed night. premier league duo adnan januzaj and marouane fellaini pose with a friend on the dance floor . manchester united and belgium duo fellaini and januzaj are good friends both on and off the pitch . manchester united ace fellaini runs over to the bench to celebrate his goal against qpr with friend januzaj . the disco effect in the background adds to the theory, but januzaj doesn’t seem to mind as they later pose on the dance floor with other friends. united haven’t had too many reasons to have a song and dance this season so it seems they may be hitting the discotheques as another form of release. however, victory against newcastle on wednesday would leave manager louis van gaal at least tapping his toes as they continue to fight for a champions league spot this season. januzaj and robin van persie join fellaini in celebrating in front of the manchester united fans at west brom . januzaj receives some words of wisdom
马罗内·费莱尼和阿德南·贾努扎伊继续向世界展示他们不仅是队友,也是最好的朋友。这对曼联和比利时的组合在周三对阵纽卡斯尔的比赛前,于周一晚上发布了自己在一家餐厅外出的照片。贾努扎伊在费莱尼和一位朋友中间摆姿势,看起来像是没有收到关于这是杰克逊5主题之夜的通知的人。英超二人组阿德南·贾努扎伊和马罗内·费莱尼与一位朋友在舞池中合影。曼联和比利时组合费莱尼和贾努扎伊在场上和场下都是好朋友。曼联王牌费莱尼跑到替补席上与好友贾努扎伊一起庆祝他对女王公园巡游者的进球。背景中的迪斯科效果增加了这一理论,但贾努扎伊似乎并不介意,因为他们后来与其他朋友在舞池中合影。曼联本赛季并没有太多理由进行歌舞庆祝,所以他们似乎将迪厅作为另一种释放方式。然而,周三对纽卡斯尔的胜利会让主教练路易斯·范加尔至少轻敲脚尖,因为他们继续为本赛季的冠军联赛资格而战。贾努扎伊和罗宾·范佩西加入费莱尼,在西布罗姆维奇的曼联球迷面前庆祝。贾努扎伊得到了一些忠告。
from manchester united’s dutch manager louis van gaal . januzaj and fellaini are joined by some friends as they take to the dance floor ahead of the newcastle game .
来自曼联的荷兰教练路易斯·范加尔。贾努扎伊和费莱尼与一些朋友一起在纽卡斯尔比赛前登上舞池。
Original target: the belgian duo took to the dance floor on monday night with some friends . manchester united face newcastle in the premier league on wednesday . red devils will be looking for just their second league away win in seven . louis van gaal’s side currently sit two points clear of liverpool in fourth .
比利时二人组于周一晚与一些朋友踏上了舞池。曼联将在周三的英超联赛中迎战纽卡斯尔。红魔将争取在近七场客场比赛中的第二场联赛胜利。路易斯·范加尔的球队目前以两分优势领先利物浦,排名第四。
Processed target: the belgian duo took to the dance floor on monday night with some friends . manchester united face newcastle in the premier league on wednesday . red devils will be looking for just their second league away win in seven . louis van gaal’s side currently sit two points clear of liverpool in fourth .
处理后的目标:比利时二人组于周一晚上与一些朋友一起上了舞池。曼联将在周三的英超联赛中迎战纽卡斯尔。红魔将争取在七场客场比赛中的第二场联赛胜利。路易斯·范加尔的球队目前以两分优势领先利物浦,排名第四。
D.15. SQuAD
D.15. SQuAD
斯坦福问答数据集 (SQuAD) 是一个阅读理解数据集,包含来自维基百科文章的问题和答案。
Original input:
请提供需要翻译的英文内容。
Question: What does increased oxygen concentrations in the patient’s lungs displace?
问题:患者肺部增加的氧气浓度会置换什么?
Context: Hyperbaric (high-pressure) medicine uses special oxygen chambers to increase the partial pressure of O 2 around the patient and, when needed, the medical staff. Carbon monoxide poisoning, gas gangrene, and decompression sickness (the ’bends’) are sometimes treated using these devices. Increased O 2 concentration in the lungs helps to displace carbon monoxide from the heme group of hemoglobin. Oxygen gas is poisonous to the anaerobic bacteria that cause gas gangrene, so increasing its partial pressure helps kill them. Decompression sickness occurs in divers who decompress too quickly after a dive, resulting in bubbles of inert gas, mostly nitrogen and helium, forming in their blood. Increasing the pressure of O 2 as soon as possible is part of the treatment.
高压(高气压)医学使用特殊的氧气舱来增加患者周围 O₂ 的分压,必要时也包括医务人员。一氧化碳中毒、气性坏疽和减压病(‘弯曲病’)有时会使用这些设备进行治疗。肺部 O₂ 浓度的增加有助于将一氧化碳从血红蛋白的血红素基团中置换出来。氧气对引起气性坏疽的厌氧细菌有毒,因此增加其分压有助于杀死它们。减压病发生在潜水员在潜水后减压过快时,导致血液中形成惰性气体(主要是氮气和氦气)的气泡。尽快增加 O₂ 的压力是治疗的一部分。
Processed input: question: What does increased oxygen concentrations in the patient’s lungs displace? context: Hyperbaric (high-pressure) medicine uses special oxygen chambers to increase the partial pressure of O 2 around the patient and, when needed, the medical staff. Carbon monoxide poisoning, gas gangrene, and decompression sickness (the ’bends’) are sometimes treated using these devices. Increased O 2 concentration in the lungs helps to displace carbon monoxide from the heme group of hemoglobin. Oxygen gas is poisonous to the anaerobic bacteria that cause gas gangrene, so increasing its partial pressure helps kill them. Decompression sickness occurs in divers who decompress too quickly after a dive, resulting in bubbles of inert gas, mostly nitrogen and helium, forming in their blood. Increasing the pressure of O 2 as soon as possible is part of the treatment.
患者肺部增加的氧气浓度会置换什么?
高气压 (high-pressure) 医学使用特殊的氧气舱来增加患者周围 O2 的分压,必要时也包括医务人员。一氧化碳中毒、气性坏疽和减压病 (‘bends’) 有时会使用这些设备进行治疗。肺部增加的 O2 浓度有助于从血红蛋白的血红素基团中置换一氧化碳。氧气对引起气性坏疽的厌氧菌有毒,因此增加其分压有助于杀死它们。减压病发生在潜水员在潜水后减压过快时,导致惰性气体(主要是氮气和氦气)在其血液中形成气泡。尽快增加 O2 的压力是治疗的一部分。
Original target: carbon monoxide
原始目标:一氧化碳
Processed target: carbon monoxide
处理后的目标:一氧化碳
D.16. WMT English to German
D.16. WMT 英语到德语
Original input: "Luigi often said to me that he never wanted the brothers to end up in court," she wrote.
她写道:“卢igi经常对我说,他不希望兄弟俩最终对簿公堂。”
Processed input: translate English to German: "Luigi often said to me that he never wanted the brothers to end up in court," she wrote.
她写道:“卢igi经常对我说,他不希望兄弟俩最终对簿公堂。”
注意:原文中的人名 "Luigi" 按照要求没有被翻译。
Original target: "Luigi sagte oft zu mir, dass er nie wollte, dass die Brüder vor Gericht landen", schrieb sie.
"卢igi经常对我说,他从不希望兄弟们对簿公堂", 她写道。
Processed target: "Luigi sagte oft zu mir, dass er nie wollte, dass die Brüder vor Gericht landen", schrieb sie.
处理后的目标:“Luigi 经常对我说,他从来不想让兄弟们对簿公堂”,她写道。
D.17. WMT English to French
D.17. WMT 英语到法语
Original input: This image section from an infrared recording by the Spitzer telescope shows a "family portrait" of countless generations of stars: the oldest stars are seen as blue dots, while more difficult to identify are the pink-coloured "new-borns" in the star delivery room.
这张由斯皮策望远镜拍摄的红外图像显示了一个“家族肖像”,包含了无数代恒星:最老的恒星表现为蓝色点,而更难以识别的是粉色的“新生儿”位于恒星诞生室。
Processed input: translate English to French: This image section from an infrared recording by the Spitzer telescope shows a "family portrait" of countless generations of stars: the oldest stars are seen as blue dots, while more difficult to identify are the pink-coloured "new-borns" in the star delivery room.
此图像部分来自斯皮策望远镜的红外记录,显示了无数代恒星的“家庭肖像”:最老的恒星被看作蓝色的点,而更难以识别的是星诞生区中粉红色的“新生儿”。
Original target: Ce détail d’une photograph ie infrarouge prise par le télescope Spitzer montre un "portrait de famille" des in no mb rables g n rations d’étoiles: les plus vieilles étoiles sont en bleu et les points roses, plus difficiles à identifier, sont les "nouveau-nés" dans la salle d’ac couch e ment de l’univers.
原始目标:这张由斯皮策望远镜拍摄的红外照片细节显示了无数代恒星的“家庭肖像”:最老的恒星呈蓝色,而较难识别的粉红色点则是宇宙产房中的“新生儿”。
Processed target: Ce détail d’une photograph ie infrarouge prise par le télescope Spitzer montre un "portrait de famille" des in no mb rables g n rations d’étoiles: les plus vieilles étoiles sont en bleu et les points roses, plus difficiles à identifier, sont les "nouveau-nés" dans la salle d’ac couch e ment de l’univers.
处理后的目标:这张由斯皮策望远镜拍摄的红外照片细节显示了无数代恒星的“家庭肖像”:最老的恒星呈蓝色,而较难识别的粉点则是宇宙产房中的“新生儿”。
D.18. WMT English to Romanian
D.18. WMT 英语到罗马尼亚语
Original input: Taco Bell said it plans to add 2,000 locations in the US by 2022.
塔可钟表示计划到2022年在美国增加2,000个门店。
Processed input: translate English to Romanian: Taco Bell said it plans to add 2,000 locations in the US by 2022.
塔可钟表示,计划到 2022 年在美国增加 2,000 个门店。
Original target: Taco Bell a afirmat că, până în 2022, inten,tionează să deschidă 2000 de restaurant e în SUA.
塔可钟宣布,截至 2022 年,计划在美国开设 2000 家餐厅。
Processed target: Taco Bell a afirmat că, până în 2022, inten,tionează să deschidă 2000 de restaurant e în SUA.
处理目标:塔可钟表示,截至 2022 年,打算在美国开设 2000 家餐厅。
Appendix E. Scores on Every Task for All Experiments
附录 E. 所有实验在每个任务上的得分
The following table lists the scores achieved on every task in the experiments described in Sections 3.2 to 3.6.
表 1:
以下表格列出了在第 3.2 节至第 3.6 节描述的实验中每个任务取得的分数。
$\star$
References
参考文献
Christian Buck, Kenneth Heafield, and Bas Van Ooyen. N-gram counts and language models from the common crawl. In LREC, 2014.
Christian Buck, Kenneth Heafield 和 Bas Van Ooyen. N-gram 统计和语言模型来自通用爬虫 (Common Crawl)。在 LREC,2014。
Rich Caruana. Multitask learning. Machine learning, 28(1), 1997.
Rich Caruana. 多任务学习 (Multitask learning). 机器学习, 28(1), 1997.
sum mari z ation. arXiv preprint arXiv:1705.04304, 2017.
总结。arXiv预印本 arXiv:1705.04304, 2017。
