[论文翻译]语言模型是无监督多任务学习者


原文地址:https://github.com/dalinvip/Awesome-ChatGPT/blob/main/PDF/GPT%E7%B3%BB%E5%88%97/gpt2-language_models_are_unsupervised_multitask_learners.pdf


Language Models are Unsupervised Multitask Learners

语言模型是无监督多任务学习者

Alec Radford * 1 Jeffrey Wu * 1 Rewon Child 1 David Luan 1 Dario Amodei ** 1 Ilya Sutskever *

Alec Radford * 1 Jeffrey Wu * 1 Rewon Child 1 David Luan 1 Dario Amodei ** 1 Ilya Sutskever *

Abstract

摘要

Natural language processing tasks, such as question answering, machine translation, reading com- prehension, and sum mari z ation, are typically approached with supervised learning on taskspecific datasets. We demonstrate that language models begin to learn these tasks without any explicit supervision when trained on a new dataset of millions of webpages called WebText. When conditioned on a document plus questions, the answers generated by the language model reach 55 F1 on the CoQA dataset - matching or exceeding the performance of 3 out of 4 baseline systems without using the $127{,}000+$ training examples. The capacity of the language model is essential to the success of zero-shot task transfer and increasing it improves performance in a log-linear fashion across tasks. Our largest model, GPT-2, is a 1.5B parameter Transformer that achieves state of the art results on 7 out of 8 tested language modeling datasets in a zero-shot setting but still underfits WebText. Samples from the model reflect these improvements and contain coherent paragraphs of text. These findings suggest a promising path towards building language processing systems which learn to perform tasks from their naturally occurring demonstrations.

自然语言处理任务,如问答、机器翻译、阅读理解和摘要生成,通常通过在特定任务的数据集上进行监督学习来解决。我们证明,当在一个名为WebText的数百万网页的新数据集上训练时,语言模型开始在没有明确监督的情况下学习这些任务。当给定一个文档和问题时,语言模型生成的答案在CoQA数据集上达到了55 F1分数——在不使用超过127,000个训练样本的情况下,匹配或超过了4个基线系统中的3个。语言模型的容量对于零样本任务迁移的成功至关重要,增加容量可以在任务中以对数线性方式提高性能。我们最大的模型GPT-2是一个拥有15亿参数的Transformer,在零样本设置下,在8个测试的语言建模数据集中的7个上达到了最先进的结果,但仍然对WebText欠拟合。模型的样本反映了这些改进,并包含连贯的文本段落。这些发现为构建从自然发生的演示中学习任务的语言处理系统提供了一条有希望的路径。

competent generalists. We would like to move towards more general systems which can perform many tasks – eventually without the need to manually create and label a training dataset for each one.

我们希望朝着更通用的系统发展,这些系统能够执行许多任务——最终无需为每个任务手动创建和标注训练数据集。

The dominant approach to creating ML systems is to collect a dataset of training examples demonstrating correct behavior for a desired task, train a system to imitate these behaviors, and then test its performance on independent and identically distributed (IID) held-out examples. This has served well to make progress on narrow experts. But the often erratic behavior of captioning models (Lake et al., 2017), reading comprehension systems (Jia & Liang, 2017), and image class if i ers (Alcorn et al., 2018) on the diversity and variety of possible inputs highlights some of the shortcomings of this approach.

创建机器学习系统的主流方法是收集一个训练示例的数据集,展示所需任务的正确行为,训练系统模仿这些行为,然后在独立同分布(IID)的保留示例上测试其性能。这种方法在推动窄领域专家系统方面取得了良好进展。但字幕生成模型(Lake 等,2017)、阅读理解系统(Jia & Liang,2017)和图像分类器(Alcorn 等,2018)在多样性和各种可能输入上的表现常常不稳定,突显了这种方法的某些不足。

Our suspicion is that the prevalence of single task training on single domain datasets is a major contributor to the lack of generalization observed in current systems. Progress towards robust systems with current architectures is likely to require training and measuring performance on a wide range of domains and tasks. Recently, several benchmarks have been proposed such as GLUE (Wang et al., 2018) and decaNLP (McCann et al., 2018) to begin studying this.

我们怀疑,当前系统泛化能力不足的主要原因是在单一领域数据集上进行单一任务训练的普遍性。要在现有架构上实现鲁棒系统的进展,可能需要在广泛的领域和任务上进行训练和性能评估。最近,已经提出了几个基准测试,如 GLUE (Wang et al., 2018) 和 decaNLP (McCann et al., 2018),以开始研究这一问题。

1. Introduction

1. 引言

Machine learning systems now excel (in expectation) at tasks they are trained for by using a combination of large datasets, high-capacity models, and supervised learning (Krizhevsky et al., 2012) (Sutskever et al., 2014) (Amodei et al., 2016). Yet these systems are brittle and sensitive to slight changes in the data distribution (Recht et al., 2018) and task specification (Kirkpatrick et al., 2017). Current systems are better characterized as narrow experts rather than

机器学习系统现在通过结合使用大规模数据集、高容量模型和监督学习,在它们被训练的任务上表现出色(Krizhevsky 等人,2012)(Sutskever 等人,2014)(Amodei 等人,2016)。然而,这些系统在面对数据分布的微小变化(Recht 等人,2018)和任务规范的调整(Kirkpatrick 等人,2017)时显得脆弱且敏感。当前的系统更适合被描述为狭窄的专家,而非……

Multitask learning (Caruana, 1997) is a promising framework for improving general performance. However, multitask training in NLP is still nascent. Recent work reports modest performance improvements (Yogatama et al., 2019) and the two most ambitious efforts to date have trained on a total of 10 and 17 (dataset, objective) pairs respectively (McCann et al., 2018) (Bowman et al., 2018). From a meta-learning perspective, each (dataset, objective) pair is a single training example sampled from the distribution of datasets and objectives. Current ML systems need hundreds to thousands of examples to induce functions which generalize well. This suggests that multitask training many need just as many effective training pairs to realize its promise with current approaches. It will be very difficult to continue to scale the creation of datasets and the design of objectives to the degree that may be required to brute force our way there with current techniques. This motivates exploring additional setups for performing multitask learning.

多任务学习 (Caruana, 1997) 是一个有潜力提升整体性能的框架。然而,在自然语言处理 (NLP) 领域,多任务训练仍处于起步阶段。最近的研究报告了适度的性能提升 (Yogatama et al., 2019),而迄今为止最雄心勃勃的两项工作分别训练了总共 10 对和 17 对 (数据集, 目标) 组合 (McCann et al., 2018) (Bowman et al., 2018)。从元学习的角度来看,每个 (数据集, 目标) 组合都是从数据集和目标分布中采样的单个训练样本。当前的机器学习系统需要数百到数千个样本来诱导出泛化良好的函数。这表明,多任务训练可能需要同样多的有效训练对,才能在当前方法下实现其潜力。继续扩展数据集的创建和目标的设计,以当前技术强行达到所需程度,将非常困难。这促使我们探索更多用于执行多任务学习的设置。

The current best performing systems on language tasks utilize a combination of pre-training and supervised finetuning. This approach has a long history with a trend towards more flexible forms of transfer. First, word vectors were learned and used as inputs to task-specific architectures (Mikolov et al., 2013) (Collobert et al., 2011), then the contextual representations of recurrent networks were transferred (Dai & Le, 2015) (Peters et al., 2018), and recent work suggests that task-specific architectures are no longer necessary and transferring many self-attention blocks is sufficient (Radford et al., 2018) (Devlin et al., 2018).

当前在语言任务上表现最佳的系统结合了预训练和监督微调。这种方法历史悠久,趋势是向更灵活的迁移形式发展。首先,学习词向量并将其用作任务特定架构的输入 (Mikolov et al., 2013) (Collobert et al., 2011),然后迁移循环网络的上下文表示 (Dai & Le, 2015) (Peters et al., 2018),最近的研究表明,任务特定架构不再必要,迁移多个自注意力块就足够了 (Radford et al., 2018) (Devlin et al., 2018)。


Figure 1. Zero-shot task performance of WebText LMs as a function of model size on many NLP tasks. Reading Comprehension results are on CoQA (Reddy et al., 2018), translation on WMT-14 Fr-En (Artetxe et al., 2017), sum mari z ation on CNN and Daily Mail (See et al., 2017), and Question Answering on Natural Questions (Kwiatkowski et al., 2019). Section 3 contains detailed descriptions of each result.

图 1: WebText 大语言模型在不同 NLP 任务上的零样本任务性能随模型大小的变化。阅读理解结果基于 CoQA (Reddy et al., 2018),翻译结果基于 WMT-14 Fr-En (Artetxe et al., 2017),摘要结果基于 CNN 和 Daily Mail (See et al., 2017),问答结果基于 Natural Questions (Kwiatkowski et al., 2019)。第 3 节包含每个结果的详细描述。

These methods still require supervised training in order to perform a task. When only minimal or no supervised data is available, another line of work has demonstrated the promise of language models to perform specific tasks, such as commonsense reasoning (Schwartz et al., 2017) and sentiment analysis (Radford et al., 2017).

这些方法仍然需要监督训练来执行任务。当只有少量或没有监督数据可用时,另一项工作展示了语言模型在执行特定任务(如常识推理 (Schwartz et al., 2017) 和情感分析 (Radford et al., 2017))方面的潜力。

In this paper, we connect these two lines of work and continue the trend of more general methods of transfer. We demonstrate language models can perform down-stream tasks in a zero-shot setting – without any parameter or architecture modification. We demonstrate this approach shows potential by highlighting the ability of language models to perform a wide range of tasks in a zero-shot setting. We achieve promising, competitive, and state of the art results depending on the task.

在本文中,我们结合了这两条研究路线,并延续了更通用的迁移方法趋势。我们展示了语言模型能够在零样本设置下执行下游任务——无需任何参数或架构修改。我们通过强调语言模型在零样本设置下执行广泛任务的能力,展示了这种方法的潜力。根据任务的不同,我们取得了有前景、具有竞争力以及最先进的结果。

2. Approach

2. 方法

At the core of our approach is language modeling. Language modeling is usually framed as unsupervised distribution estimation from a set of examples $(x_{1},x_{2},...,x_{n})$ each composed of variable length sequences of symbols $\left(s_{1},s_{2},...,s_{n}\right)$ . Since language has a natural sequential ordering, it is common to factorize the joint probabilities over symbols as the product of conditional probabilities (Jelinek & Mercer, 1980) (Bengio et al., 2003):

我们方法的核心是语言建模。语言建模通常被定义为从一组示例 $(x_{1},x_{2},...,x_{n})$ 中进行无监督分布估计,每个示例由可变长度的符号序列 $\left(s_{1},s_{2},...,s_{n}\right)$ 组成。由于语言具有自然的顺序性,通常将符号的联合概率分解为条件概率的乘积 (Jelinek & Mercer, 1980) (Bengio et al., 2003):

$$
p(x)=\prod_{i=1}^{n}p(s_{n}|s_{1},...,s_{n-1})
$$

$$
p(x)=\prod_{i=1}^{n}p(s_{n}|s_{1},...,s_{n-1})
$$

This approach allows for tractable sampling from and estimation of $p(x)$ as well as any conditionals of the form $p(s_{n-k},...,s_{n}|s_{1},...,s_{n-k-1})$ . In recent years, there have been significant improvements in the expressiveness of models that can compute these conditional probabilities, such as self-attention architectures like the Transformer (Vaswani et al., 2017).

这种方法允许从 $p(x)$ 以及任何形式为 $p(s_{n-k},...,s_{n}|s_{1},...,s_{n-k-1})$ 的条件概率中进行可处理的采样和估计。近年来,能够计算这些条件概率的模型的表达能力有了显著提升,例如 Transformer 等自注意力架构 (Vaswani et al., 2017) [20]。

Learning to perform a single task can be expressed in a probabilistic framework as estimating a conditional distribution $p(o u t p u t|i n p u t)$ . Since a general system should be able to perform many different tasks, even for the same input, it should condition not only on the input but also on the task to be performed. That is, it should model $p(o u t p u t|i n p u t,t a s k)$ . This has been variously formalized in multitask and meta-learning settings. Task conditioning is often implemented at an architectural level, such as the task specific encoders and decoders in (Kaiser et al., 2017) or at an algorithmic level such as the inner and outer loop optimization framework of MAML (Finn et al., 2017). But as exemplified in McCann et al. (2018), language provides a flexible way to specify tasks, inputs, and outputs all as a sequence of symbols. For example, a translation training example can be written as the sequence (translate to french, english text, french text). Likewise, a reading comprehension training example can be written as (answer the question, document, question, answer). McCann et al. (2018) demonstrated it was possible to train a single model, the MQAN, to infer and perform many different tasks on examples with this type of format.

学习执行单个任务可以在概率框架中表示为估计条件分布 $p(o u t p u t|i n p u t)$。由于一个通用系统应该能够执行许多不同的任务,即使对于相同的输入,它不仅应该基于输入进行条件化,还应该基于要执行的任务进行条件化。也就是说,它应该建模 $p(o u t p u t|i n p u t,t a s k)$。这在多任务和元学习设置中已经以各种形式进行了形式化。任务条件化通常在架构级别实现,例如 (Kaiser et al., 2017) 中的任务特定编码器和解码器,或者在算法级别实现,例如 MAML (Finn et al., 2017) 的内外循环优化框架。但正如 McCann et al. (2018) 所展示的那样,语言提供了一种灵活的方式来将任务、输入和输出都指定为符号序列。例如,一个翻译训练示例可以写为序列 (translate to french, english text, french text)。同样,一个阅读理解训练示例可以写为 (answer the question, document, question, answer)。McCann et al. (2018) 证明了可以训练一个单一模型 MQAN,以推断并执行具有这种格式示例的许多不同任务。

Language modeling is also able to, in principle, learn the tasks of McCann et al. (2018) without the need for explicit supervision of which symbols are the outputs to be predicted. Since the supervised objective is the the same as the unsupervised objective but only evaluated on a subset of the sequence, the global minimum of the unsupervised objective is also the global minimum of the supervised objective. In this slightly toy setting, the concerns with density estimation as a principled training objective discussed in (Sutskever et al., 2015) are side stepped. The problem instead becomes whether we are able to, in practice, optimize the unsupervised objective to convergence. Preliminary experiments confirmed that sufficiently large language models are able to perform multitask learning in this toy-ish setup but learning is much slower than in explicitly supervised approaches.

语言建模原则上也能够学习 McCann 等人 (2018) 的任务,而无需明确监督哪些符号是需要预测的输出。由于监督目标与非监督目标相同,但仅在序列的子集上进行评估,因此非监督目标的全局最小值也是监督目标的全局最小值。在这种稍微简化的设置中,(Sutskever 等人, 2015) 中讨论的关于密度估计作为原则性训练目标的担忧被回避了。问题反而变成了我们是否能够在实践中优化非监督目标以达到收敛。初步实验证实,足够大的语言模型能够在这种简化的设置中进行多任务学习,但学习速度比显式监督方法慢得多。

While it is a large step from the well-posed setup described above to the messiness of “language in the wild”, Weston (2016) argues, in the context of dialog, for the need to develop systems capable of learning from natural language directly and demonstrated a proof of concept – learning a QA task without a reward signal by using forward prediction of a teacher’s outputs. While dialog is an attractive approach, we worry it is overly restrictive. The internet contains a vast amount of information that is passively available without the need for interactive communication. Our speculation is that a language model with sufficient capacity will begin to learn to infer and perform the tasks demonstrated in natural language sequences in order to better predict them, regardless of their method of procurement. If a language model is able to do this it will be, in effect, performing unsupervised multitask learning. We test whether this is the case by analyzing the performance of language models in a zero-shot setting on a wide variety of tasks.

虽然从上述定义明确的情境到“现实世界中的语言”的复杂性是一个巨大的跨越,但 Weston (2016) 在对话的背景下提出,需要开发能够直接从自然语言中学习的系统,并展示了一个概念验证——通过预测教师的输出来学习问答任务,而无需奖励信号。尽管对话是一种有吸引力的方法,但我们担心它过于局限。互联网包含了大量无需交互通信即可被动获取的信息。我们推测,具有足够能力的大语言模型将开始学习推断并执行自然语言序列中展示的任务,以便更好地预测它们,无论这些任务是如何获取的。如果大语言模型能够做到这一点,它实际上就是在进行无监督的多任务学习。我们通过分析大语言模型在零样本设置下在各种任务上的表现来测试这一点。

2.1. Training Dataset

2.1. 训练数据集

Most prior work trained language models on a single domain of text, such as news articles (Jozefowicz et al., 2016), Wikipedia (Merity et al., 2016), or fiction books (Kiros et al., 2015). Our approach motivates building as large and diverse a dataset as possible in order to collect natural language demonstrations of tasks in as varied of domains and contexts as possible.

大多数先前的工作在单一领域的文本上训练语言模型,例如新闻文章 (Jozefowicz et al., 2016)、维基百科 (Merity et al., 2016) 或小说书籍 (Kiros et al., 2015)。我们的方法鼓励构建尽可能大且多样化的数据集,以便在尽可能多的领域和上下文中收集任务的自然语言演示。

A promising source of diverse and nearly unlimited text is web scrapes such as Common Crawl. While these archives are many orders of magnitude larger than current language modeling datasets, they have significant data quality issues. Trinh & Le (2018) used Common Crawl in their work on commonsense reasoning but noted a large amount of documents “whose content are mostly unintelligible”. We observed similar data issues in our initial experiments with ”I’m not the cleverest man in the world, but like they say in French: Je ne suis pas un imbecile [I’m not a fool].

一个多样且几乎无限的文本来源是诸如Common Crawl的网络抓取数据。虽然这些存档比当前的语言建模数据集大许多数量级,但它们存在显著的数据质量问题。Trinh & Le (2018) 在他们的常识推理工作中使用了Common Crawl,但指出大量文档“内容大多难以理解”。我们在初步实验中也观察到了类似的数据问题:“我不是世界上最聪明的人,但就像法语里说的:Je ne suis pas un imbecile [我不是傻瓜]。”

“Brevet Sans Garantie Du Gou verne ment”, translated to English: “Patented without government warranty”.

“Brevet Sans Garantie Du Gouvernement”,翻译为英文:“Patented without government warranty”。

Table 1. Examples of naturally occurring demonstrations of English to French and French to English translation found throughout the WebText training set.

表 1. 在 WebText 训练集中发现的英语到法语和法语到英语翻译的自然示例。

Common Crawl. Trinh & Le (2018)’s best results were achieved using a small subsample of Common Crawl which included only documents most similar to their target dataset, the Winograd Schema Challenge. While this is a pragmatic approach to improve performance on a specific task, we want to avoid making assumptions about the tasks to be performed ahead of time.

Common Crawl。Trinh & Le (2018) 的最佳结果是通过使用 Common Crawl 的一个小子样本实现的,该子样本仅包含与他们的目标数据集 Winograd Schema Challenge 最相似的文档。虽然这是一种提高特定任务性能的实用方法,但我们希望避免提前对要执行的任务做出假设。

Instead, we created a new web scrape which emphasizes document quality. To do this we only scraped web pages which have been curated/filtered by humans. Manually filtering a full web scrape would be exceptionally expensive so as a starting point, we scraped all outbound links from Reddit, a social media platform, which received at least 3 karma. This can be thought of as a heuristic indicator for whether other users found the link interesting, educational, or just funny.

相反,我们创建了一个新的网页抓取,强调文档质量。为此,我们只抓取了经过人工筛选/过滤的网页。手动过滤整个网页抓取将非常昂贵,因此作为起点,我们从社交媒体平台 Reddit 抓取了所有获得至少 3 个 karma 的外链。这可以被视为其他用户是否认为该链接有趣、有教育意义或只是有趣的一种启发式指标。

The resulting dataset, WebText, contains the text subset of these 45 million links. To extract the text from HTML responses we use a combination of the Dragnet (Peters & Lecocq, 2013) and Newspaper1 content extractors. All results presented in this paper use a preliminary version of WebText which does not include links created after Dec 2017 and which after de-duplication and some heuristic based cleaning contains slightly over 8 million documents for a total of $40\ \mathrm{GB}$ of text. We removed all Wikipedia documents from WebText since it is a common data source for other datasets and could complicate analysis due to overlapping training data with test evaluation tasks.

生成的 WebText 数据集包含了这 4500 万个链接的文本子集。为了从 HTML 响应中提取文本,我们结合使用了 Dragnet (Peters & Lecocq, 2013) 和 Newspaper1 内容提取器。本文中展示的所有结果均使用了 WebText 的初步版本,该版本不包含 2017 年 12 月之后创建的链接,并且在去重和基于启发式的清理后,包含了略多于 800 万份文档,总计 $40\ \mathrm{GB}$ 的文本。我们从 WebText 中移除了所有维基百科文档,因为它是其他数据集的常见数据源,并且由于训练数据与测试评估任务的重叠,可能会使分析复杂化。

2.2. Input Representation

2.2. 输入表示

A general language model (LM) should be able to compute the probability of (and also generate) any string. Current large scale LMs include pre-processing steps such as lowercasing, token iz ation, and out-of-vocabulary tokens which restrict the space of model-able strings. While processing Unicode strings as a sequence of UTF-8 bytes elegantly fulfills this requirement as exemplified in work such as Gillick et al. (2015), current byte-level LMs are not competitive with word-level LMs on large scale datasets such as the One Billion Word Benchmark (Al-Rfou et al., 2018). We observed a similar performance gap in our own attempts to train standard byte-level LMs on WebText.

通用语言模型 (LM) 应该能够计算(并生成)任何字符串的概率。当前的大规模语言模型包括预处理步骤,如小写化、Token化和处理词汇表外的Token,这些步骤限制了模型可处理的字符串范围。虽然将Unicode字符串作为UTF-8字节序列处理可以优雅地满足这一要求,如Gillick等人 (2015) 的工作所示,但当前的字节级语言模型在大规模数据集(如One Billion Word Benchmark (Al-Rfou等人, 2018))上无法与词级语言模型竞争。我们在尝试在WebText上训练标准字节级语言模型时也观察到了类似的性能差距。

Byte Pair Encoding (BPE) (Sennrich et al., 2015) is a practical middle ground between character and word level language modeling which effectively interpolates between word level inputs for frequent symbol sequences and character level inputs for infrequent symbol sequences. Despite its name, reference BPE implementations often operate on Unicode code points and not byte sequences. These implementations would require including the full space of Unicode symbols in order to model all Unicode strings. This would result in a base vocabulary of over 130,000 before any multi-symbol tokens are added. This is prohibitively large compared to the 32,000 to 64,000 token vocabularies often used with BPE. In contrast, a byte-level version of BPE only requires a base vocabulary of size 256. However, directly applying BPE to the byte sequence results in suboptimal merges due to BPE using a greedy frequency based heuristic for building the token vocabulary. We observed BPE including many versions of common words like dog since they occur in many variations such as dog. dog! dog? . This results in a sub-optimal allocation of limited vocabulary slots and model capacity. To avoid this, we prevent BPE from merging across character categories for any byte sequence. We add an exception for spaces which significantly improves the compression efficiency while adding only minimal fragmentation of words across multiple vocab tokens.

字节对编码 (Byte Pair Encoding, BPE) (Sennrich et al., 2015) 是一种介于字符级和词级语言建模之间的实用方法,它有效地在频繁符号序列的词级输入和不频繁符号序列的字符级输入之间进行插值。尽管其名为字节对编码,但参考的 BPE 实现通常操作的是 Unicode 码点而非字节序列。这些实现需要包含完整的 Unicode 符号空间才能对所有 Unicode 字符串进行建模。这会导致在添加任何多符号 Token 之前,基础词汇表的大小超过 130,000。与通常使用 BPE 的 32,000 到 64,000 Token 词汇表相比,这显然过大。相比之下,字节级的 BPE 只需要一个大小为 256 的基础词汇表。然而,由于 BPE 使用基于贪婪频率的启发式方法来构建 Token 词汇表,直接将其应用于字节序列会导致次优的合并。我们观察到 BPE 包含了许多常见单词的多个版本,例如 dog,因为它们以多种变体出现,如 dog.dog!dog?。这导致有限词汇表槽位和模型容量的次优分配。为了避免这种情况,我们阻止 BPE 在任何字节序列中跨字符类别进行合并。我们为空格添加了一个例外,这显著提高了压缩效率,同时仅在最少的词汇 Token 中增加了单词的碎片化。

This input representation allows us to combine the empirical benefits of word-level LMs with the generality of byte-level approaches. Since our approach can assign a probability to any Unicode string, this allows us to evaluate our LMs on any dataset regardless of pre-processing, token iz ation, or vocab size.

这种输入表示方式使我们能够将词级大语言模型的经验优势与字节级方法的通用性结合起来。由于我们的方法可以为任何 Unicode 字符串分配概率,因此我们可以在任何数据集上评估大语言模型,而无需考虑预处理、Token 化或词汇表大小。

2.3. Model

2.3. 模型

We use a Transformer (Vaswani et al., 2017) based architecture for our LMs. The model largely follows the details of the OpenAI GPT model (Radford et al., 2018) with a

我们使用基于Transformer (Vaswani et al., 2017) 的架构来构建我们的语言模型。该模型主要遵循OpenAI GPT模型 (Radford et al., 2018) 的细节。

Table 2. Architecture hyper parameters for the 4 model sizes.

表 2: 4 种模型规模的架构超参数

few modifications. Layer normalization (Ba et al., 2016) was moved to the input of each sub-block, similar to a pre-activation residual network (He et al., 2016) and an additional layer normalization was added after the final selfattention block. A modified initialization which accounts for the accumulation on the residual path with model depth is used. We scale the weights of residual layers at initialization by a factor of $1/\bar{\sqrt{N}}$ where $N$ is the number of residual layers. The vocabulary is expanded to 50,257. We also increase the context size from 512 to 1024 tokens and a larger batchsize of 512 is used.

进行了少量修改。层归一化 (Ba et al., 2016) 被移动到每个子块的输入处,类似于预激活残差网络 (He et al., 2016),并且在最终的自注意力块之后添加了额外的层归一化。我们使用了一种修改后的初始化方法,该方法考虑了模型深度对残差路径的累积影响。我们在初始化时将残差层的权重缩放为 $1/\bar{\sqrt{N}}$,其中 $N$ 是残差层的数量。词汇表扩展到 50,257。我们还将上下文大小从 512 个 token 增加到 1024 个 token,并使用更大的批量大小 512。

3. Experiments

3. 实验

We trained and benchmarked four LMs with approximately log-uniformly spaced sizes. The architectures are summarized in Table 2. The smallest model is equivalent to the original GPT, and the second smallest equivalent to the largest model from BERT (Devlin et al., 2018). Our largest model, which we call GPT-2, has over an order of magnitude more parameters than GPT. The learning rate of each model was manually tuned for the best perplexity on a $5%$ held-out sample of WebText. All models still underfit WebText and held-out perplexity has as of yet improved given more training time.

我们训练并基准测试了四个大小近似对数均匀分布的大语言模型。架构总结如表 2 所示。最小的模型相当于原始的 GPT,第二小的模型相当于 BERT (Devlin et al., 2018) 中最大的模型。我们最大的模型称为 GPT-2,其参数量比 GPT 多一个数量级。每个模型的学习率都经过手动调整,以在 WebText 的 $5%$ 保留样本上获得最佳困惑度。所有模型仍然对 WebText 欠拟合,且随着训练时间的增加,保留样本的困惑度仍在改善。

3.1. Language Modeling

3.1. 语言建模

As an initial step towards zero-shot task transfer, we are interested in understanding how WebText LM’s perform at zero-shot domain transfer on the primary task they are trained for – language modeling. Since our model operates on a byte level and does not require lossy pre-processing or token iz ation, we can evaluate it on any language model benchmark. Results on language modeling datasets are commonly reported in a quantity which is a scaled or expone nti a ted version of the average negative log probability per canonical prediction unit - usually a character, a byte, or a word. We evaluate the same quantity by computing the log-probability of a dataset according to a WebText LM and dividing by the number of canonical units. For many of these datasets, WebText LMs would be tested significantly outof-distribution, having to predict aggressively standardized text, token iz ation artifacts such as disconnected punctuation and contractions, shuffled sentences, and even the string ${<}\mathrm{UNK}{>}$ which is extremely rare in WebText - occurring only 26 times in 40 billion bytes. We report our main results in Table 3 using invertible de-tokenizers which remove as many of these token iz ation / pre-processing artifacts as possible. Since these de-tokenizers are invertible, we can still calculate the log probability of a dataset and they can be thought of as a simple form of domain adaptation. We observe gains of 2.5 to 5 perplexity for GPT-2 with these de-tokenizers.

作为零样本任务迁移的初步尝试,我们感兴趣的是了解WebText大语言模型在零样本领域迁移上的表现,尤其是在其主要训练任务——语言建模上的表现。由于我们的模型在字节级别上运行,不需要有损预处理或Token化,因此我们可以在任何语言模型基准上对其进行评估。语言建模数据集的结果通常以每个标准预测单元(通常是一个字符、一个字节或一个词)的平均负对数概率的缩放或指数版本的形式报告。我们通过计算数据集的对数概率并根据WebText大语言模型将其除以标准单元的数量来评估相同的量。对于许多这些数据集,WebText大语言模型将在显著偏离分布的情况下进行测试,必须预测高度标准化的文本、Token化伪影(如断开的标点符号和缩写)、打乱的句子,甚至是字符串${<}\mathrm{UNK}{>}$,这在WebText中极为罕见——在400亿字节中仅出现26次。我们使用可逆的去Token化器报告了主要结果,这些去Token化器尽可能多地移除了这些Token化/预处理伪影。由于这些去Token化器是可逆的,我们仍然可以计算数据集的对数概率,并且它们可以被视为一种简单的领域适应形式。我们观察到,使用这些去Token化器后,GPT-2的困惑度提高了2.5到5。结果见表3。

Language Models are Unsupervised Multitask Learners

Table 3. Zero-shot results on many datasets. No training or fine-tuning was performed for any of these results. PTB and WikiText-2 results are from (Gong et al., 2018). CBT results are from (Bajgar et al., 2016). LAMBADA accuracy result is from (Hoang et al., 2018) and LAMBADA perplexity result is from (Grave et al., 2016). Other results are from (Dai et al., 2019).

语言模型是无监督多任务学习者

LAMBADA (PPL) LAMBADA (ACC) CBT-CN (ACC) CBT-NE (ACC) WikiText2 (PPL) PTB (PPL) enwik8 (BPB) text8 (BPC) WikiText103 (PPL) 1BW (PPL)
SOTA 99.8 59.23 85.7 82.3 39.14 46.54 0.99 1.08 18.3 21.8
117M 35.13 45.99 87.65 83.4 29.41 65.85 1.16 1.17 37.50 75.20
345M 15.60 55.48 92.35 87.1 22.76 47.33 1.01 1.06 26.37 55.72
762M 10.87 60.12 93.45 88.0 19.93 40.31 0.97 1.02 22.05 44.575
1542M 8.63 63.24 93.30 89.05 18.34 35.76 0.93 0.98 17.48 42.16

表 3: 多个数据集上的零样本结果。这些结果均未进行任何训练或微调。PTB 和 WikiText-2 的结果来自 (Gong et al., 2018)。CBT 的结果来自 (Bajgar et al., 2016)。LAMBADA 准确率结果来自 (Hoang et al., 2018),LAMBADA 困惑度结果来自 (Grave et al., 2016)。其他结果来自 (Dai et al., 2019)。

WebText LMs transfer well across domains and datasets, improving the state of the art on 7 out of the 8 datasets in a zero-shot setting. Large improvements are noticed on small datasets such as Penn Treebank and WikiText-2 which have only 1 to 2 million training tokens. Large improvements are also noticed on datasets created to measure long-term dependencies like LAMBADA (Paperno et al., 2016) and the Children’s Book Test (Hill et al., 2015). Our model is still significantly worse than prior work on the One Billion Word Benchmark (Chelba et al., 2013). This is likely due to a combination of it being both the largest dataset and having some of the most destructive pre-processing - 1BW’s sentence level shuffling removes all long-range structure.

WebText 语言模型在不同领域和数据集上表现出色,在零样本设置下,8 个数据集中有 7 个达到了当前最佳水平。在小型数据集如 Penn Treebank 和 WikiText-2 上(这些数据集仅有 100 万到 200 万的训练 token),模型表现显著提升。在用于衡量长期依赖性的数据集上,如 LAMBADA (Paperno et al., 2016) 和儿童图书测试 (Hill et al., 2015),模型也有显著提升。然而,我们的模型在 One Billion Word Benchmark (Chelba et al., 2013) 上的表现仍显著落后于之前的工作。这可能是由于该数据集规模最大,并且进行了最具破坏性的预处理——1BW 的句子级别打乱破坏了所有长程结构。


3.2. Children’s Book Test Figure 2. Performance on the Children’s Book Test as a function of model capacity. Human performance are from Bajgar et al. (2016), instead of the much lower estimates from the original paper.


3.2. 儿童图书测试
图 2: 儿童图书测试性能随模型容量的变化。人类表现数据来自 Bajgar 等人 (2016),而非原论文中较低估计值。

The Children’s Book Test (CBT) (Hill et al., 2015) was created to examine the performance of LMs on different categories of words: named entities, nouns, verbs, and prepositions. Rather than reporting perplexity as an evaluation metric, CBT reports accuracy on an automatically constructed cloze test where the task is to predict which of 10 possible choices for an omitted word is correct. Following the LM approach introduced in the original paper, we compute the probability of each choice and the rest of the sentence conditioned on this choice according to the LM, and predict the one with the highest probability. As seen in Figure 2 performance steadily improves as model size is increased and closes the majority of the gap to human performance on this test. Data overlap analysis showed one of the CBT test set books, The Jungle Book by Rudyard Kipling, is in WebText, so we report results on the validation set which has no significant overlap. GPT-2 achieves new state of the art results of $93.3%$ on common nouns and $89.1%$ on named entities. A de-tokenizer was applied to remove PTB style token iz ation artifacts from CBT.

儿童图书测试 (CBT) (Hill et al., 2015) 旨在检验大语言模型在不同类别词汇上的表现:命名实体、名词、动词和介词。CBT 不报告困惑度作为评估指标,而是报告在自动构建的完形填空测试中的准确率,任务是从10个可能的选项中选择被省略的正确单词。按照原始论文中引入的大语言模型方法,我们根据模型计算每个选项及句子其余部分的条件概率,并预测概率最高的选项。如图2所示,随着模型规模的增加,性能稳步提升,并缩小了与人类在该测试中表现的大部分差距。数据重叠分析显示,CBT测试集中的一本书《丛林之书》由 Rudyard Kipling 所著,存在于 WebText 中,因此我们报告了在验证集上的结果,该验证集没有显著重叠。GPT-2 在普通名词上取得了 $93.3%$ 的最新成果,在命名实体上取得了 $89.1%$ 的成果。应用了去 Token 化器以去除 CBT 中的 PTB 风格 Token 化伪影。

3.3. LAMBADA

3.3. LAMBADA

The LAMBADA dataset (Paperno et al., 2016) tests the ability of systems to model long-range dependencies in text. The task is to predict the final word of sentences which require at least 50 tokens of context for a human to successfully predict. GPT-2 improves the state of the art from 99.8 (Grave et al., 2016) to 8.6 perplexity and increases the accuracy of LMs on this test from $19%$ (Dehghani et al., 2018) to $52.66%$ . Investigating GPT-2’s errors showed most predictions are valid continuations of the sentence, but are not valid final words. This suggests that the LM is not using the additional useful constraint that the word must be the final of the sentence. Adding a stop-word filter as an approximation to this further increases accuracy to $63.24%$ , improving the overall state of the art on this task by $4%$ . The previous state of the art (Hoang et al., 2018) used a different restricted prediction setting where the outputs of the model were constrained to only words that appeared in the context. For GPT-2, this restriction is harmful rather than helpful since $19%$ of answers are not in context. We use a version of the dataset without preprocessing.

LAMBADA 数据集 (Paperno et al., 2016) 测试系统在文本中建模长距离依赖关系的能力。任务要求预测句子的最后一个词,这些句子需要至少 50 个 Token 的上下文才能让人成功预测。GPT-2 将最先进水平从 99.8 (Grave et al., 2016) 提高到 8.6 的困惑度,并将大语言模型在此测试中的准确率从 $19%$ (Dehghani et al., 2018) 提高到 $52.66%$。对 GPT-2 的错误进行分析表明,大多数预测是句子的有效延续,但不是有效的最后一个词。这表明大语言模型没有利用到该词必须是句子最后一个词这一额外有用的约束。添加一个停用词过滤器作为近似,进一步将准确率提高到 $63.24%$,使该任务的整体最先进水平提高了 $4%$。之前的最先进水平 (Hoang et al., 2018) 使用了不同的受限预测设置,其中模型的输出被限制为仅出现在上下文中的词。对于 GPT-2 来说,这种限制是有害的,因为 $19%$ 的答案不在上下文中。我们使用未经预处理的版本的数据集。


3.4. Winograd Schema Challenge Figure 3. Performance on the Winograd Schema Challenge as a function of model capacity.

图 3: Winograd Schema Challenge 上模型性能随模型容量的变化。

The Winograd Schema challenge (Levesque et al., 2012) was constructed to measure the capability of a system to perform commonsense reasoning by measuring its ability to resolve ambiguities in text. Recently Trinh & Le (2018) demonstrated significant progress on this challenge using LMs, by predicting the resolution of the ambiguity with higher probability. We follow their problem formulation and visualize the performance of our models with both full and partial scoring techniques in Figure 3. GPT-2 improves state of the art accuracy by $7%$ , achieving $70.70%$ . The dataset is quite small with only 273 examples so we recommend reading Trichelair et al. (2018) to help contextual ize this result.

Winograd Schema挑战(Levesque等人,2012)旨在通过测量系统解决文本中歧义的能力来评估其进行常识推理的能力。最近,Trinh & Le(2018)通过使用大语言模型(LMs)以更高的概率预测歧义的解决,展示了在这一挑战上的显著进展。我们遵循他们的问题表述,并在图3中展示了我们模型在完整和部分评分技术下的性能。GPT-2将最先进的准确率提高了7%,达到了70.70%。该数据集非常小,只有273个示例,因此我们建议阅读Trichelair等人(2018)的研究,以帮助理解这一结果。

3.5. Reading Comprehension

3.5. 阅读理解

The Conversation Question Answering dataset (CoQA) Reddy et al. (2018) consists of documents from 7 different domains paired with natural language dialogues between a question asker and a question answerer about the document. CoQA tests reading comprehension capabilities and also the ability of models to answer questions that depend on conversation history (such as “Why?”).

对话问答数据集 (CoQA) Reddy 等人 (2018) 由来自 7 个不同领域的文档组成,这些文档与提问者和回答者之间关于文档的自然语言对话配对。CoQA 测试阅读理解能力以及模型回答依赖于对话历史的问题(例如“为什么?”)的能力。

Greedy decoding from GPT-2 when conditioned on a document, the history of the associated conversation, and a final token A: achieves 55 F1 on the development set. This matches or exceeds the performance of 3 out of 4 baseline systems without using the $127{,}000+$ manually collected question answer pairs those baselines were trained on. The supervised SOTA, a BERT based system (Devlin et al.,

当基于文档、相关对话历史记录和最后一个 Token A 进行 GPT-2 的贪婪解码时,在开发集上达到了 55 F1 的分数。这一表现与 4 个基线系统中的 3 个相当或更好,且未使用这些基线系统训练时所依赖的超过 127,000 条手动收集的问答对。监督学习的 SOTA(基于 BERT 的系统)由 Devlin 等人提出 [20]。

Table 4. Sum mari z ation performance as measured by ROUGE F1 metrics on the CNN and Daily Mail dataset. Bottom-Up Sum is the SOTA model from (Gehrmann et al., 2018)

R-1 R-2 R-L R-AVG
Bottom-Up Sum 41.22 18.68 38.34 32.75
Lede-3 40.38 17.66 36.62 31.55
Seq2Seq+Attn 31.33 11.81 28.83 23.99
GPT-2 TL;DR: 29.34 8.27 26.58 21.40
Random-3 28.78 8.63 25.52 20.98
GPT-2nohint 21.58 4.03 19.47 15.03

表 4. 在 CNN 和 Daily Mail 数据集上通过 ROUGE F1 指标衡量的摘要性能。Bottom-Up Sum 是来自 (Gehrmann et al., 2018) 的 SOTA 模型。

2018), is nearing the 89 F1 performance of humans. While GPT-2’s performance is exciting for a system without any supervised training, some inspection of its answers and errors suggests GPT-2 often uses simple retrieval based heuristics such as answer with a name from the document in response to a who question.

2018年),GPT-2 的性能接近人类在 89 F1 分数上的表现。尽管 GPT-2 在没有监督训练的情况下表现令人兴奋,但对其答案和错误的一些检查表明,GPT-2 经常使用基于简单检索的启发式方法,例如在回答“who”问题时使用文档中的名字作为答案。

3.6. Sum mari z ation

3.6. 总结

We test GPT-2’s ability to perform sum mari z ation on the CNN and Daily Mail dataset (Nallapati et al., 2016). To induce sum mari z ation behavior we add the text TL;DR: after the article and generate 100 tokens with Top $k$ random sampling (Fan et al., 2018) with $k=2$ which reduces repetition and encourages more abstract ive summaries than greedy decoding. We use the first 3 generated sentences in these 100 tokens as the summary. While qualitatively the generations resemble summaries, as shown in Table 14, they often focus on recent content from the article or confuse specific details such as how many cars were involved in a crash or whether a logo was on a hat or shirt. On the commonly reported ROUGE 1,2,L metrics the generated summaries only begin to approach the performance of classic neural baselines and just barely outperforms selecting 3 random sentences from the article. GPT-2’s performance drops by 6.4 points on the aggregate metric when the task hint is removed which demonstrates the ability to invoke task specific behavior in a language model with natural language.

我们测试了 GPT-2 在 CNN 和 Daily Mail 数据集 (Nallapati et al., 2016) 上进行摘要生成的能力。为了引导摘要行为,我们在文章后添加文本 TL;DR:,并使用 Top $k$ 随机采样 (Fan et al., 2018) 生成 100 个 Token,其中 $k=2$,这减少了重复并鼓励生成比贪婪解码更抽象的摘要。我们使用这 100 个 Token 中生成的前 3 个句子作为摘要。虽然生成的摘要从质量上类似于摘要,如表 14 所示,但它们通常关注文章中的近期内容,或者混淆了具体细节,例如车祸中涉及多少辆车,或者标志是在帽子还是衬衫上。在常用的 ROUGE 1,2,L 指标上,生成的摘要仅开始接近经典神经基线的性能,并且仅略微优于从文章中随机选择 3 个句子的表现。当移除任务提示时,GPT-2 在综合指标上的表现下降了 6.4 分,这表明了通过自然语言在语言模型中调用特定任务行为的能力。

3.7. Translation

3.7. 翻译

We test whether GPT-2 has begun to learn how to translate from one language to another. In order to help it infer that this is the desired task, we condition the language model on a context of example pairs of the format english sentence $=$ french sentence and then after a final prompt of english sentence $=$ we sample from the model with greedy decoding and use the first generated sentence as the translation. On the WMT-14 English-French test set, GPT-2 gets 5 BLEU, which is slightly worse than a word-by-word substitution with a bilingual lexicon inferred in previous work on unsupervised word translation (Conneau et al., 2017b). On the WMT-14 French-English test set, GPT-2 is able to leverage its very strong English language model to perform significantly better, achieving 11.5 BLEU. This outperforms several unsupervised machine translation baselines from (Artetxe et al., 2017) and (Lample et al., 2017) but is still much worse than the 33.5 BLEU of the current best unsupervised machine translation approach (Artetxe et al., 2019). Performance on this task was surprising to us, since we deliberately removed non-English webpages from WebText as a filtering step. In order to confirm this, we ran a byte-level language detector2 on WebText which detected only 10MB of data in the French language which is approximately $500\mathrm{x}$ smaller than the monolingual French corpus common in prior unsupervised machine translation research.

我们测试了 GPT-2 是否已经开始学习如何从一种语言翻译到另一种语言。为了帮助模型推断出这是期望的任务,我们以格式为“英语句子 $=$ 法语句子”的示例对作为上下文条件,然后在最后的提示“英语句子 $=$ ”之后,使用贪婪解码从模型中采样,并将生成的第一个句子作为翻译。在 WMT-14 英法测试集上,GPT-2 获得了 5 BLEU 分,略低于在无监督词翻译研究中推断出的双语词典逐词替换的结果 (Conneau et al., 2017b)。在 WMT-14 法英测试集上,GPT-2 能够利用其非常强大的英语语言模型表现显著更好,获得了 11.5 BLEU 分。这优于 (Artetxe et al., 2017) 和 (Lample et al., 2017) 中的几个无监督机器翻译基线,但仍远低于当前最佳无监督机器翻译方法 (Artetxe et al., 2019) 的 33.5 BLEU 分。这一任务的表现令我们感到惊讶,因为我们特意从 WebText 中移除了非英语网页作为过滤步骤。为了确认这一点,我们在 WebText 上运行了一个字节级语言检测器,检测到仅有 10MB 的法语数据,这比之前无监督机器翻译研究中常见的单语法语语料库小约 $500\mathrm{x}$。

Table 5. The 30 most confident answers generated by GPT-2 on the development set of Natural Questions sorted by their probability according to GPT-2. None of these questions appear in WebText according to the procedure described in Section 4.

表 5. GPT-2 在 Natural Questions 开发集上生成的 30 个最自信的答案,按 GPT-2 的概率排序。根据第 4 节描述的程序,这些问题均未出现在 WebText 中。

问题 生成的答案 正确性 概率
Who wrote the book the origin of species? Charles Darwin 83.4%
Who is the founder of the ubuntu project? Mark Shuttleworth 82.0%
Who is the quarterback for the green bay packers? Aaron Rodgers < 81.1%
Panda is a national animal of which country? China 76.8%
Who came up with the theory of relativity? Albert Einstein 76.4%
When was the first star wars film released? 1977 < 71.4%
What is the most common blood type in sweden? A 70.6%
Who is regarded as the founder of psychoanalysis? Sigmund Freud 69.3%
Who took the first steps on the moon in 1969? Neil Armstrong 66.8%
Who is the largest supermarket chain in the uk? Tesco 65.3%
What is the meaning of shalom in english? peace 64.0%
Who was the author of the art of war? Sun Tzu 59.6%
Largest state in the us by land mass? California 59.2%
Green algae is an example of which type of reproduction? parthenogenesis 56.5%
Vikramsamvat calender is official in which country? India 55.6%
Who is mostly responsible for writing the declaration of independence? Thomas Jefferson < 53.3%
What us state forms the western boundary of montana? Montana 52.3%
Who plays ser davos in game of thrones? Peter Dinklage 52.1%
Who appoints the chair of the federal reserve system? Janet Yellen 51.5%
State the process that divides one nucleus into two genetically identical nuclei? mitosis 50.7%
Who won the most mvp awards in the nba? Michael Jordan 50.2%
What river is associated with the city of rome? the Tiber 48.6%
Who is the first president to be impeached? Andrew Johnson 48.3%
Who is the head of the department of homeland security 2017? John Kelly 47.0%
What is the name given to the common currency to the european union? Euro 46.8%
What was the emperor name in star wars? Palpatine 46.5%
Do you have to have a gun permit to shoot at a range? No 46.4%
Who proposed evolution in 1859 as the basis of biological development? Charles Darwin 45.7%
Nuclear power plant that blew up in russia? Chernobyl 45.7%
Who played john connor in the original terminator? Arnold Schwarzenegger 45.2%

3.8. Question Answering

3.8. 问答

A potential way to test what information is contained within a language model is to evaluate how often it generates the correct answer to factoid-style questions. Previous showcasing of this behavior in neural systems where all information is stored in parameters such as A Neural Conversational Model (Vinyals & Le, 2015) reported qualitative results due to the lack of high-quality evaluation datasets. The recently introduced Natural Questions dataset (Kwiatkowski et al.,

测试语言模型中包含哪些信息的一种潜在方法是评估其对事实类问题生成正确答案的频率。之前在神经网络系统中展示这种行为时,由于缺乏高质量评估数据集,例如《A Neural Conversational Model》(Vinyals & Le, 2015) 报告了定性结果。最近引入的 Natural Questions 数据集 (Kwiatkowski et al.,

  1. is a promising resource to test this more quantitatively. Similar to translation, the context of the language model is seeded with example question answer pairs which helps the model infer the short answer style of the dataset. GPT-2 answers $4.1%$ of questions correctly when evaluated by the exact match metric commonly used on reading comprehension datasets like SQUAD.3 As a comparison point, the smallest model does not exceed the $1.0%$ accuracy of an incredibly simple baseline which returns the most common answer for each question type (who, what, where, etc...). GPT-2 answers 5.3 times more questions correctly, suggesting that model capacity has been a major factor in the poor performance of neural systems on this kind of task as of yet. The probability GPT-2 assigns to its generated answers is well calibrated and GPT-2 has an accuracy of $63.1%$ on the $1%$ of questions it is most confident in. The 30 most confident answers generated by GPT-2 on development set questions are shown in Table 5. The performance of GPT-2 is still much, much, worse than the 30 to $50%$ range of open domain question answering systems which hybridize information retrieval with extractive document question answering (Alberti et al., 2019).
  2. 是一个有前景的资源,可以更定量地测试这一点。与翻译类似,大语言模型的上下文被植入了示例问答对,这有助于模型推断数据集的简短答案风格。在阅读理解数据集(如SQUAD)常用的精确匹配指标下,GPT-2 正确回答了 $4.1%$ 的问题。作为对比,最小的模型没有超过一个极其简单的基线的 $1.0%$ 准确率,该基线返回每种问题类型(谁、什么、哪里等)的最常见答案。GPT-2 正确回答的问题数量是其 5.3 倍,这表明模型容量是迄今为止神经网络系统在此类任务上表现不佳的主要因素。GPT-2 为其生成的答案分配的概率校准良好,并且在其最有信心的 $1%$ 问题上,GPT-2 的准确率为 $63.1%$。表 5 展示了 GPT-2 在开发集问题上生成的 30 个最有信心的答案。GPT-2 的表现仍然远远低于结合信息检索与抽取式文档问答的开放域问答系统的 30 到 $50%$ 范围 (Alberti et al., 2019)。

Table 6. Percentage of test set 8 grams overlapping with training sets.

表 6. 测试集 8-gram 与训练集重叠的百分比。

PTB WikiText-2 enwik8 text8 Wikitext-103 1BW
Dataset train 2.67% 0.66% 7.50% 2.34% 9.09% 13.19%
WebText train 0.88% 1.63% 6.31% 3.94% 2.42% 3.75%

4. Generalization vs Memorization

4. 泛化与记忆

Recent work in computer vision has shown that common image datasets contain a non-trivial amount of near-duplicate images. For instance CIFAR-10 has $3.3%$ overlap between train and test images (Barz & Denzler, 2019). This results in an over-reporting of the generalization performance of machine learning systems. As the size of datasets increases this issue becomes increasingly likely which suggests a similar phenomena could be happening with WebText. Therefore it is important to analyze how much test data also shows up in the training data.

最近的计算机视觉研究表明,常见的图像数据集中包含相当数量的近似重复图像。例如,CIFAR-10 的训练集和测试集之间有 3.3% 的重叠 (Barz & Denzler, 2019)。这导致机器学习系统的泛化性能被高估。随着数据集规模的增加,这一问题变得愈发可能,这表明 WebText 可能也存在类似现象。因此,分析测试数据中有多少也出现在训练数据中是非常重要的。

To study this we created Bloom filters containing 8-grams of WebText training set tokens. To improve recall, strings were normalized to contain only lower-cased alphanumeric words with a single space as a delimiter. The Bloom filters were constructed such that the false positive rate is upper bounded by $\textstyle{\frac{1}{10^{8}}}$ . We further verified the low false positive rate by generating 1M strings, of which zero were found by the filter.

为了研究这一点,我们创建了包含 WebText 训练集 Token 的 8-gram 的布隆过滤器 (Bloom filter)。为了提高召回率,字符串被归一化为仅包含小写字母数字单词,并以单个空格作为分隔符。布隆过滤器的构建使得误报率的上限为 $\textstyle{\frac{1}{10^{8}}}$。我们通过生成 100 万个字符串进一步验证了低误报率,其中过滤器未发现任何误报。

These Bloom filters let us calculate, given a dataset, the percentage of 8-grams from that dataset that are also found in the WebText training set. Table 6 shows this overlap analysis for the test sets of common LM benchmarks. Common LM datasets’ test sets have between $1!-!6%$ overlap with WebText train, with an average of overlap of $3.2%$ . Somewhat surprisingly, many datasets have larger overlaps with their own training splits, with an average of $5.9%$ overlap.

这些布隆过滤器使我们能够计算,给定一个数据集,其中在 WebText 训练集中也出现的 8-gram 的百分比。表 6 展示了常见大语言模型基准测试集的这种重叠分析。常见大语言模型数据集的测试集与 WebText 训练集的重叠率在 $1!-!6%$ 之间,平均重叠率为 $3.2%$。有些令人惊讶的是,许多数据集与其自身的训练集有更大的重叠,平均重叠率为 $5.9%$。

Our approach optimizes for recall, and while manual inspection of the overlaps shows many common phrases, there are many longer matches that are due to duplicated data. This is not unique to WebText. For instance, we discovered that the test set of WikiText-103 has an article which is also in the training dataset. Since there are only 60 articles in the test set there is at least an overlap of $1.6%$ .4 Potentially more worryingly, 1BW has an overlap of nearly $13.2%$ with its own training set according to our procedure.

我们的方法优化了召回率,虽然手动检查重叠部分显示了许多常见短语,但也有许多较长的匹配是由于数据重复造成的。这种情况并非 WebText 独有。例如,我们发现 WikiText-103 的测试集中有一篇文章也出现在训练数据集中。由于测试集中只有 60 篇文章,因此至少有 $1.6%$ 的重叠。更令人担忧的是,根据我们的方法,1BW 与其训练集的重叠率接近 $13.2%$。

For the Winograd Schema Challenge, we found only 10 schemata which had any 8-gram overlaps with the WebText training set. Of these, 2 were spurious matches. Of the remaining 8, only 1 schema appeared in any contexts that

对于Winograd Schema Challenge,我们发现只有10个模式与WebText训练集存在8-gram重叠。其中,2个是虚假匹配。在剩下的8个中,只有1个模式出现在任何上下文中。

gave away the answer.

泄露了答案。

For CoQA, about $15%$ of documents in the news domain are already in WebText and the model performs about 3 F1 better on these. CoQA’s development set metric reports the average performance over 5 different domains and we measure a gain of about 0.5-1.0 F1 due to overlap across the various domains. However, no actual training questions or answers are in WebText since CoQA was released after the cutoff date for links in WebText.

对于 CoQA,新闻领域约 15% 的文档已经存在于 WebText 中,模型在这些文档上的表现大约提高了 3 F1。CoQA 的开发集指标报告了 5 个不同领域的平均表现,我们测量到由于各领域之间的重叠,F1 值提高了约 0.5-1.0。然而,WebText 中并未包含实际的训练问题或答案,因为 CoQA 是在 WebText 链接截止日期之后发布的。

On LAMBADA, the average overlap is $1.2%$ . GPT-2 performs about 2 perplexity better on examples with greater than $15%$ overlap. Re calculating metrics when excluding all examples with any overlap shifts results from 8.6 to 8.7 perplexity and reduces accuracy from $63.2%$ to $62.9%$ . This very small change in overall results is likely due to only 1 in 200 examples having significant overlap.

在LAMBADA数据集上,平均重叠率为$1.2%$。GPT-2在重叠率大于$15%$的样本上表现更好,困惑度降低了约2。当排除所有有重叠的样本后重新计算指标,困惑度从8.6变为8.7,准确率从$63.2%$降至$62.9%$。整体结果的微小变化可能是由于仅有1/200的样本存在显著重叠。

Overall, our analysis suggests that data overlap between WebText training data and specific evaluation datasets provides a small but consistent benefit to reported results. However, for most datasets we do not notice significantly larger overlaps than those already existing between standard training and test sets, as Table 6 highlights.

总体而言,我们的分析表明,WebText训练数据与特定评估数据集之间的数据重叠对报告结果提供了微小但一致的益处。然而,对于大多数数据集,我们并未注意到比标准训练集和测试集之间已经存在的重叠显著更大的重叠,如表6所示。

Understanding and quantifying how highly similar text impacts performance is an important research question. Better de-duplication techniques such as scalable fuzzy matching could also help better answer these questions. For now, we recommend the use of n-gram overlap based de-duplication as an important verification step and sanity check during the creation of training and test splits for new NLP datasets.

理解和量化高度相似文本如何影响性能是一个重要的研究问题。更好的去重技术,如可扩展的模糊匹配,也可以帮助更好地回答这些问题。目前,我们建议在创建新 NLP 数据集的训练和测试分割时,使用基于 n-gram 重叠的去重作为重要的验证步骤和合理性检查。

Another potential way of determining whether the performance of WebText LMs is attributable to memorization is inspecting their performance on their own held-out set. As shown in Figure 4, performance on both the training and test sets of WebText are similar and improve together as model size is increased. This suggests even GPT-2 is still under fitting on WebText in many ways.

另一种确定 WebText 大语言模型性能是否归因于记忆的潜在方法是检查它们在自己保留集上的表现。如图 4 所示,WebText 的训练集和测试集上的表现相似,并且随着模型规模的增加而一起提升。这表明即使 GPT-2 在许多方面仍然对 WebText 欠拟合。

GPT-2 is also able to write news articles about the discovery of talking unicorns. An example is provided in Table 13.

GPT-2 也能够撰写关于发现会说话的独角兽的新闻文章。表 13 中提供了一个示例。

5. Related Work

5. 相关工作

A significant portion of this work measured the performance of larger language models trained on larger datasets. This is similar to the work of Jozefowicz et al. (2016) which scaled RNN based language models on the 1 Billion Word Benchmark. Bajgar et al. (2016) also previously improved results on the Children’s Book Test by creating a much larger training dataset out of Project Gutenberg to supplement the standard training dataset. Hestness et al. (2017) conducted a thorough analysis of how the performance of various deep learning models changes as a function of both model capacity and dataset size. Our experiments, while much noisier across tasks, suggest similar trends hold for sub-tasks of an objective and continue into the $^{1\mathrm{B+}}$ parameter regime.

本工作的一个重要部分是测量在更大数据集上训练的更大语言模型的性能。这与 Jozefowicz 等人 (2016) 的工作类似,他们在 10 亿词基准上扩展了基于 RNN 的语言模型。Bajgar 等人 (2016) 之前也通过从古腾堡计划中创建更大的训练数据集来补充标准训练数据集,从而改进了儿童图书测试的结果。Hestness 等人 (2017) 对各种深度学习模型的性能如何随模型容量和数据集大小的变化进行了深入分析。我们的实验虽然在任务之间噪声更大,但表明类似的趋势在目标的子任务中仍然存在,并延续到 $^{1\mathrm{B+}}$ 参数范围内。


Figure 4. The performance of LMs trained on WebText as a function of model size.

图 4: 在 WebText 上训练的语言模型性能随模型大小的变化。

Interesting learned functionality in generative models has been documented before such as the cells in an RNN language model performing line-width tracking and quote/comment detection Karpathy et al. (2015). More inspi rational to our work was the observation of Liu et al. (2018) that a model trained to generate Wikipedia articles also learned to translate names between languages.

在生成式模型中有趣的学习功能之前已被记录,例如RNN语言模型中的单元执行行宽跟踪和引号/评论检测 [Karpathy et al., 2015]。对我们的工作更具启发性的是 [Liu et al., 2018] 的观察,即一个训练用于生成维基百科文章的模型也学会了在不同语言之间翻译名称。

Previous work has explored alternative approaches to filtering and constructing a large text corpus of web pages, such as the iWeb Corpus (Davies, 2018).

先前的工作已经探索了过滤和构建网页大型文本语料库的替代方法,例如 iWeb 语料库 (Davies, 2018)。

There has been extensive work on pre-training methods for language tasks. In addition to those mentioned in the introduction, GloVe (Pennington et al., 2014) scaled word vector representation learning to all of Common Crawl. An influential early work on deep representation learning for text was Skip-thought Vectors (Kiros et al., 2015). McCann et al. (2017) explored the use of representations derived from machine translation models and Howard & Ruder (2018)

在语言任务的预训练方法方面已有大量研究。除了引言中提到的那些,GloVe (Pennington et al., 2014) 将词向量表示学习扩展到整个 Common Crawl。文本深度表示学习的一个有影响力的早期工作是 Skip-thought Vectors (Kiros et al., 2015)。McCann et al. (2017) 探索了从机器翻译模型中提取表示的使用,而 Howard & Ruder (2018)

improved the RNN based fine-tuning approaches of (Dai & Le, 2015). (Conneau et al., 2017a) studied the transfer performance of representations learned by natural language inference models and (Subramania n et al., 2018) explored large-scale multitask training.

改进了基于RNN的微调方法 (Dai & Le, 2015)。(Conneau et al., 2017a) 研究了自然语言推理模型学习到的表示的迁移性能,(Subramanian et al., 2018) 则探索了大规模多任务训练。

(Rama chandra n et al., 2016) demonstrated that seq2seq models benefit from being initialized with pre-trained language models as encoders and decoders. More recent work has shown that LM pre-training is helpful when fine-tuned for difficult generation tasks like chit-chat dialog and dialog based question answering systems as well (Wolf et al., 2019) (Dinan et al., 2018).

(Rama chandra n et al., 2016) 证明了使用预训练语言模型作为编码器和解码器初始化的序列到序列模型能够从中受益。最近的研究表明,在微调用于困难生成任务(如闲聊对话和基于对话的问答系统)时,语言模型预训练同样有帮助 (Wolf et al., 2019) (Dinan et al., 2018)。

6. Discussion

6. 讨论

Much research has been dedicated to learning (Hill et al., 2016), understanding (Levy & Goldberg, 2014), and critically evaluating (Wieting & Kiela, 2019) the representations of both supervised and unsupervised pre-training methods. Our results suggest that unsupervised task learning is an additional promising area of research to explore. These findings potentially help explain the widespread success of pre-training techniques for down-stream NLP tasks as we show that, in the limit, one of these pre-training techniques begins to learn to perform tasks directly without the need for supervised adaption or modification.

大量研究致力于学习 (Hill et al., 2016)、理解 (Levy & Goldberg, 2014) 以及批判性评估 (Wieting & Kiela, 2019) 有监督和无监督预训练方法的表示。我们的结果表明,无监督任务学习是另一个值得探索的有前景的研究领域。这些发现可能有助于解释预训练技术在下游 NLP 任务中广泛成功的原因,因为我们表明,在极限情况下,这些预训练技术之一开始直接学习执行任务,而无需有监督的适应或修改。

On reading comprehension the performance of GPT-2 is competitive with supervised baselines in a zero-shot setting. However, on other tasks such as sum mari z ation, while it is qualitatively performing the task, its performance is still only rudimentary according to quantitative metrics. While suggestive as a research result, in terms of practical applications, the zero-shot performance of GPT-2 is still far from use-able.

在阅读理解任务中,GPT-2 在零样本设置下的表现与有监督的基线模型相当。然而,在其他任务(如摘要生成)中,尽管 GPT-2 在定性上能够完成任务,但根据定量指标,其表现仍然较为基础。虽然作为研究结果具有一定的启发性,但在实际应用中,GPT-2 的零样本性能仍然远未达到可用水平。

We have studied the zero-shot performance of WebText LMs on many canonical NLP tasks, but there are many additional tasks that could be evaluated. There are undoubtedly many practical tasks where the performance of GPT-2 is still no better than random. Even on common tasks that we evaluated on, such as question answering and translation, language models only begin to outperform trivial baselines when they have sufficient capacity.

我们研究了 WebText 大语言模型在许多经典 NLP 任务上的零样本表现,但还有许多其他任务可以评估。毫无疑问,在许多实际任务中,GPT-2 的表现仍然不比随机好。即使在我们评估的常见任务上,如问答和翻译,语言模型只有在具备足够能力时才开始优于简单基线。

While zero-shot performance establishes a baseline of the potential performance of GPT-2 on many tasks, it is not clear where the ceiling is with finetuning. On some tasks, GPT-2’s fully abstract ive output is a significant departure from the extractive pointer network (Vinyals et al., 2015) based outputs which are currently state of the art on many question answering and reading comprehension datasets. Given the prior success of fine-tuning GPT, we plan to investigate fine-tuning on benchmarks such as decaNLP and GLUE, especially since it is unclear whether the additional training data and capacity of GPT-2 is sufficient to overcome the inefficiencies of uni-directional representations demonstrated by BERT (Devlin et al., 2018).

虽然零样本性能为 GPT-2 在许多任务上的潜在性能建立了基线,但尚不清楚通过微调能达到的上限在哪里。在某些任务上,GPT-2 的完全抽象输出与目前在许多问答和阅读理解数据集上最先进的基于提取指针网络 (Vinyals et al., 2015) 的输出有显著差异。鉴于之前微调 GPT 的成功,我们计划在 decaNLP 和 GLUE 等基准上进行微调研究,特别是因为尚不清楚 GPT-2 的额外训练数据和容量是否足以克服 BERT (Devlin et al., 2018) 所展示的单向表示的低效性。

7. Conclusion

7. 结论

When a large language model is trained on a suf