[论文翻译]大语言模型是少样本学习者


原文地址:https://github.com/dalinvip/Awesome-ChatGPT/blob/main/PDF/GPT%E7%B3%BB%E5%88%97/gpt3-Language-Models-are-Few-Shot-Learners.pdf


Language Models are Few-Shot Learners

大语言模型是少样本学习者

OpenAI

| Tom B. Brown* | Benjamin Mann* | Nick Ryder* Melanie Subbiah* |
| Jared Kaplan | Prafulla Dhariwal | Arvind Neelakantan | Pranav Shyam | Girish Sastry |
| Amanda Askell | Sandhini Agarwal | Ariel Herbert-Voss | Gretchen Krueger | Tom Henighan |
| Rewon Child | Aditya Ramesh | Daniel M. Ziegler | Jeffrey Wu | Clemens Winter |
| Christopher Hesse | Mark Chen | Eric Sigler | Mateusz Litwin | Scott Gray |
| Benjamin Chess | Jack Clark | Christopher Berner |
| Sam McCandlish | Alec Radford | Ilya Sutskever | Dario Amodei |

Abstract

摘要

Recent work has demonstrated substantial gains on many NLP tasks and benchmarks by pre-training on a large corpus of text followed by fine-tuning on a specific task. While typically task-agnostic in architecture, this method still requires task-specific fine-tuning datasets of thousands or tens of thousands of examples. By contrast, humans can generally perform a new language task from only a few examples or from simple instructions – something which current NLP systems still largely struggle to do. Here we show that scaling up language models greatly improves task-agnostic, few-shot performance, sometimes even reaching competitiveness with prior state-of-the-art finetuning approaches. Specifically, we train GPT-3, an auto regressive language model with 175 billion parameters, $10\mathrm{x}$ more than any previous non-sparse language model, and test its performance in the few-shot setting. For all tasks, GPT-3 is applied without any gradient updates or fine-tuning, with tasks and few-shot demonstrations specified purely via text interaction with the model. GPT-3 achieves strong performance on many NLP datasets, including translation, question-answering, and cloze tasks, as well as several tasks that require on-the-fly reasoning or domain adaptation, such as unscrambling words, using a novel word in a sentence, or performing 3-digit arithmetic. At the same time, we also identify some datasets where GPT-3’s few-shot learning still struggles, as well as some datasets where GPT-3 faces methodological issues related to training on large web corpora. Finally, we find that GPT-3 can generate samples of news articles which human evaluators have difficulty distinguishing from articles written by humans. We discuss broader societal impacts of this finding and of GPT-3 in general.

最近的研究表明,通过在大量文本语料库上进行预训练,然后在特定任务上进行微调,可以在许多自然语言处理(NLP)任务和基准测试中取得显著进展。尽管这种方法在架构上通常是任务无关的,但它仍然需要数千甚至数万个示例的任务特定微调数据集。相比之下,人类通常只需几个示例或简单的指令就能执行新的语言任务——这是当前NLP系统仍然难以做到的。本文展示了扩展语言模型可以显著提高任务无关的少样本性能,有时甚至能与之前的最先进微调方法相媲美。具体来说,我们训练了GPT-3,这是一个拥有1750亿参数的自回归语言模型,参数数量是之前任何非稀疏语言模型的10倍,并在少样本设置下测试了其性能。对于所有任务,GPT-3在没有任何梯度更新或微调的情况下应用,任务和少样本演示仅通过与模型的文本交互来指定。GPT-3在许多NLP数据集上表现出色,包括翻译、问答和完形填空任务,以及一些需要即时推理或领域适应的任务,如解构单词、在句子中使用新词或执行三位数算术。同时,我们也发现了一些数据集,GPT-3的少样本学习仍然存在困难,以及一些数据集,GPT-3面临与大型网络语料库训练相关的方法论问题。最后,我们发现GPT-3可以生成新闻文章样本,人类评估者难以区分这些文章是由人类撰写的。我们讨论了这一发现以及GPT-3的更广泛社会影响。

Contents

目录

1 Introduction

1 引言

2 Approach

2 方法

3 Results 10

3 结果 10

4 Measuring and Preventing Memorization Of Benchmarks 29

4 测量和防止基准测试的记忆化 29

5 Limitations 33

5 限制 33

6 Broader Impacts 34

6 更广泛的影响 34

7 Related Work 39

7 相关工作 39

1 Introduction

1 引言

Recent years have featured a trend towards pre-trained language representations in NLP systems, applied in increasingly flexible and task-agnostic ways for downstream transfer. First, single-layer representations were learned using word vectors [MCCD13, PSM14] and fed to task-specific architectures, then RNNs with multiple layers of representations and contextual state were used to form stronger representations [DL15, MBXS17, PNZtY18] (though still applied to task-specific architectures), and more recently pre-trained recurrent or transformer language models $[\mathrm{VSP^{+}17}]$ have been directly fine-tuned, entirely removing the need for task-specific architectures [RNSS18, DCLT18, HR18].

近年来,NLP系统呈现出一种趋势,即采用预训练的语言表示,并以越来越灵活且与任务无关的方式应用于下游迁移。最初,使用词向量 [MCCD13, PSM14] 学习单层表示,并将其输入到特定任务的架构中;随后,使用具有多层表示和上下文状态的RNN来形成更强的表示 [DL15, MBXS17, PNZtY18](尽管仍然应用于特定任务的架构);最近,预训练的循环或Transformer语言模型 $[\mathrm{VSP^{+}17}]$ 被直接微调,完全消除了对特定任务架构的需求 [RNSS18, DCLT18, HR18]。

This last paradigm has led to substantial progress on many challenging NLP tasks such as reading comprehension, question answering, textual entailment, and many others, and has continued to advance based on new architectures and algorithms $[\mathrm{R}\bar{\mathrm{S}}\mathrm{R}^{+}19\$ , $\mathrm{LOG^{+}19}$ , $\mathrm{YDY^{+}19}$ , $\mathrm{LCG}^{+}19]$ . However, a major limitation to this approach is that while the architecture is task-agnostic, there is still a need for task-specific datasets and task-specific fine-tuning: to achieve strong performance on a desired task typically requires fine-tuning on a dataset of thousands to hundreds of thousands of examples specific to that task. Removing this limitation would be desirable, for several reasons.

最后这一范式在许多具有挑战性的自然语言处理任务上取得了显著进展,例如阅读理解、问答、文本蕴含等,并且随着新架构和算法的出现持续进步 [RSR+19, LOG+19, YDY+19, LCG+19]。然而,这种方法的一个主要限制是,尽管架构是任务无关的,但仍然需要特定任务的数据集和特定任务的微调:要在某个任务上实现强大的性能,通常需要对该任务的数千到数十万个样本的数据集进行微调。出于多种原因,消除这一限制是可取的。

First, from a practical perspective, the need for a large dataset of labeled examples for every new task limits the applicability of language models. There exists a very wide range of possible useful language tasks, encompassing anything from correcting grammar, to generating examples of an abstract concept, to critiquing a short story. For many of these tasks it is difficult to collect a large supervised training dataset, especially when the process must be repeated for every new task.

首先,从实际角度来看,每个新任务都需要大量标注数据集的限制,限制了大语言模型的适用性。存在非常广泛的有用语言任务,从纠正语法到生成抽象概念的示例,再到评论短篇小说。对于许多这样的任务,很难收集到大量的监督训练数据集,尤其是当这个过程必须为每个新任务重复时。

Second, the potential to exploit spurious correlations in training data fundamentally grows with the expressiveness of the model and the narrowness of the training distribution. This can create problems for the pre-training plus fine-tuning paradigm, where models are designed to be large to absorb information during pre-training, but are then fine-tuned on very narrow task distributions. For instance $\bar{[\mathrm{HLW}^{+}20]}$ observe that larger models do not necessarily generalize better out-of-distribution. There is evidence that suggests that the generalization achieved under this paradigm can be poor because the model is overly specific to the training distribution and does not generalize well outside it $[\Upsilon\mathrm{d}\mathrm{C}^{+}\bar{1}9$ , MPL19]. Thus, the performance of fine-tuned models on specific benchmarks, even when it is nominally at human-level, may exaggerate actual performance on the underlying task $\mathrm{[GSL^{+}18}$ , NK19].

其次,利用训练数据中的虚假相关性的潜力从根本上随着模型的表达能力和训练分布的狭窄性而增长。这可能会给预训练加微调的范式带来问题,在这种范式中,模型被设计得很大以在预训练期间吸收信息,但随后在非常狭窄的任务分布上进行微调。例如,$\bar{[\mathrm{HLW}^{+}20]}$ 观察到,较大的模型不一定在分布外泛化得更好。有证据表明,在这种范式下实现的泛化可能很差,因为模型对训练分布过于特定,无法在其外部很好地泛化 $[\Upsilon\mathrm{d}\mathrm{C}^{+}\bar{1}9$ , MPL19]。因此,微调模型在特定基准测试上的性能,即使名义上达到人类水平,也可能夸大其在底层任务上的实际性能 $\mathrm{[GSL^{+}18}$ , NK19]。

Third, humans do not require large supervised datasets to learn most language tasks – a brief directive in natural language (e.g. “please tell me if this sentence describes something happy or something sad”) or at most a tiny number of demonstrations (e.g. “here are two examples of people acting brave; please give a third example of bravery”) is often sufficient to enable a human to perform a new task to at least a reasonable degree of competence. Aside from pointing to a conceptual limitation in our current NLP techniques, this adaptability has practical advantages – it allows humans to seamlessly mix together or switch between many tasks and skills, for example performing addition during a lengthy dialogue. To be broadly useful, we would someday like our NLP systems to have this same fluidity and generality.

第三,人类不需要大量的监督数据集来学习大多数语言任务——自然语言中的简短指令(例如,“请告诉我这句话描述的是快乐还是悲伤的事情”)或最多几个示例(例如,“这里有两个勇敢行为的例子;请给出第三个勇敢的例子”)通常足以让人类以至少合理的能力执行新任务。除了指出我们当前自然语言处理(NLP)技术的概念局限性外,这种适应性还具有实际优势——它允许人类无缝地混合或切换许多任务和技能,例如在长时间的对话中执行加法运算。为了广泛有用,我们希望有一天我们的 NLP 系统也能具备同样的流畅性和通用性。


Figure 1.1: Language model meta-learning. During unsupervised pre-training, a language model develops a broad set of skills and pattern recognition abilities. It then uses these abilities at inference time to rapidly adapt to or recognize the desired task. We use the term “in-context learning” to describe the inner loop of this process, which occurs within the forward-pass upon each sequence. The sequences in this diagram are not intended to be representative of the data a model would see during pre-training, but are intended to show that there are sometimes repeated sub-tasks embedded within a single sequence.

图 1.1: 语言模型的元学习。在无监督预训练过程中,语言模型发展出一系列广泛的技能和模式识别能力。然后,在推理时,它利用这些能力快速适应或识别所需的任务。我们使用“上下文学习”这一术语来描述这一过程的内循环,该循环发生在每个序列的前向传递中。图中的序列并不代表模型在预训练期间会看到的数据,而是为了展示有时在单个序列中嵌入的重复子任务。


Figure 1.2: Larger models make increasingly efficient use of in-context information. We show in-context learning performance on a simple task requiring the model to remove random symbols from a word, both with and without a natural language task description (see Sec. 3.9.2). The steeper “in-context learning curves” for large models demonstrate improved ability to learn a task from contextual information. We see qualitatively similar behavior across a wide range of tasks.

图 1.2: 更大的模型能够更高效地利用上下文信息。我们展示了一个简单任务中的上下文学习性能,该任务要求模型从单词中删除随机符号,无论是否有自然语言任务描述(见第 3.9.2 节)。大型模型的“上下文学习曲线”更陡峭,表明它们从上下文信息中学习任务的能力有所提高。我们在广泛的任务中观察到了类似的行为。

One potential route towards addressing these issues is meta-learning1 – which in the context of language models means the model develops a broad set of skills and pattern recognition abilities at training time, and then uses those abilities at inference time to rapidly adapt to or recognize the desired task (illustrated in Figure 1.1). Recent work $[\mathrm{RWC}^{+}19]$ attempts to do this via what we call “in-context learning”, using the text input of a pretrained language model as a form of task specification: the model is conditioned on a natural language instruction and/or a few demonstrations of the task and is then expected to complete further instances of the task simply by predicting what comes next.

解决这些问题的一个潜在途径是元学习(meta-learning)——在语言模型的背景下,这意味着模型在训练时发展出一套广泛的技能和模式识别能力,然后在推理时利用这些能力快速适应或识别所需任务(如图 1.1 所示)。最近的工作 [RWC+19] 试图通过我们称之为“上下文学习”(in-context learning)的方式来实现这一点,使用预训练语言模型的文本输入作为任务规范的一种形式:模型以自然语言指令和/或任务的几个示例为条件,然后通过预测接下来会发生什么来完成任务的更多实例。

While it has shown some initial promise, this approach still achieves results far inferior to fine-tuning – for example $[\mathrm{RWC}^{+}19]$ achieves only $4%$ on Natural Questions, and even its 55 F1 CoQa result is now more than 35 points behind the state of the art. Meta-learning clearly requires substantial improvement in order to be viable as a practical method of solving language tasks.

尽管这种方法显示出了一些初步的潜力,但其结果仍然远不及微调——例如,$[\mathrm{RWC}^{+}19]$ 在 Natural Questions 上仅达到了 $4%$,甚至其 55 F1 的 CoQa 结果也比当前的最新技术落后了 35 分以上。显然,元学习需要大幅改进才能成为解决语言任务的实用方法。

Another recent trend in language modeling may offer a way forward. In recent years the capacity of transformer language models has increased substantially, from 100 million parameters [RNSS18], to 300 million parameters [DCLT18], to 1.5 billion parameters $[\mathrm{RWC}^{+}19]$ , to 8 billion parameters $[\mathrm{SPP^{+}19}]$ , 11 billion parameters $[\mathrm{RSR}^{+}19]$ , and finally 17 billion parameters [Tur20]. Each increase has brought improvements in text synthesis and/or downstream NLP tasks, and there is evidence suggesting that log loss, which correlates well with many downstream tasks, follows a smooth trend of improvement with scale $[\mathrm{KMH}^{+}20]$ . Since in-context learning involves absorbing many skills and tasks within the parameters of the model, it is plausible that in-context learning abilities might show similarly strong gains with scale.

语言建模的另一个最新趋势可能提供了一条前进的道路。近年来,Transformer语言模型的容量大幅增加,从1亿参数 [RNSS18],到3亿参数 [DCLT18],再到15亿参数 $[\mathrm{RWC}^{+}19]$,80亿参数 $[\mathrm{SPP^{+}19}]$,110亿参数 $[\mathrm{RSR}^{+}19]$,最终达到170亿参数 [Tur20]。每次增加都带来了文本合成和/或下游NLP任务的改进,并且有证据表明,与许多下游任务相关性良好的对数损失(log loss)随着规模的增加呈现出平滑的改进趋势 $[\mathrm{KMH}^{+}20]$。由于上下文学习涉及在模型参数中吸收许多技能和任务,因此上下文学习能力可能随着规模的增加而表现出类似的强劲提升。


Figure 1.3: Aggregate performance for all 42 accuracy-denominated benchmarks While zero-shot performance improves steadily with model size, few-shot performance increases more rapidly, demonstrating that larger models are more proficient at in-context learning. See Figure 3.8 for a more detailed analysis on SuperGLUE, a standard NLP benchmark suite.

图 1.3: 所有 42 个以准确率为基准的测试集的综合表现。虽然零样本性能随着模型规模的增加稳步提升,但少样本性能提升得更快,这表明更大的模型在上下文学习中更为熟练。关于 SuperGLUE(一个标准的 NLP 基准测试套件)的详细分析,请参见图 3.8。

In this paper, we test this hypothesis by training a 175 billion parameter auto regressive language model, which we call GPT-3, and measuring its in-context learning abilities. Specifically, we evaluate GPT-3 on over two dozen NLP datasets, as well as several novel tasks designed to test rapid adaptation to tasks unlikely to be directly contained in the training set. For each task, we evaluate GPT-3 under 3 conditions: (a) “few-shot learning”, or in-context learning where we allow as many demonstrations as will fit into the model’s context window (typically 10 to 100), (b) “one-shot learning”, where we allow only one demonstration, and (c) “zero-shot” learning, where no demonstrations are allowed and only an instruction in natural language is given to the model. GPT-3 could also in principle be evaluated in the traditional fine-tuning setting, but we leave this to future work.

在本文中,我们通过训练一个拥有1750亿参数的自回归语言模型(我们称之为GPT-3)来测试这一假设,并测量其上下文学习能力。具体来说,我们在超过二十个自然语言处理(NLP)数据集上评估GPT-3,同时还设计了几个新任务来测试其对训练集中不太可能直接包含的任务的快速适应能力。对于每个任务,我们在三种条件下评估GPT-3:(a) "少样本学习",即上下文学习,我们允许尽可能多的演示样本放入模型的上下文窗口中(通常为10到100个),(b) "单样本学习",即只允许一个演示样本,以及(c) "零样本"学习,即不允许任何演示样本,仅向模型提供自然语言的指令。原则上,GPT-3也可以在传统的微调设置中进行评估,但我们将其留待未来工作。

Figure 1.2 illustrates the conditions we study, and shows few-shot learning of a simple task requiring the model to remove extraneous symbols from a word. Model performance improves with the addition of a natural language task description, and with the number of examples in the model’s context, $K$ . Few-shot learning also improves dramatically with model size. Though the results in this case are particularly striking, the general trends with both model size and number of examples in-context hold for most tasks we study. We emphasize that these “learning” curves involve no gradient updates or fine-tuning, just increasing numbers of demonstrations given as conditioning.

图 1.2 展示了我们研究的条件,并展示了一个简单任务的少样本学习,该任务要求模型从单词中删除多余的符号。随着自然语言任务描述的添加以及模型上下文中示例数量 $K$ 的增加,模型性能有所提升。少样本学习也随着模型规模的增大而显著提高。尽管在这种情况下结果特别显著,但模型规模和上下文示例数量的一般趋势在我们研究的大多数任务中都成立。我们强调,这些“学习”曲线不涉及梯度更新或微调,只是增加了作为条件的演示数量。

Broadly, on NLP tasks GPT-3 achieves promising results in the zero-shot and one-shot settings, and in the the few-shot setting is sometimes competitive with or even occasionally surpasses state-of-the-art (despite state-of-the-art being held by fine-tuned models). For example, GPT-3 achieves 81.5 F1 on CoQA in the zero-shot setting, 84.0 F1 on CoQA in the one-shot setting, $85.0,\mathrm{F1}$ in the few-shot setting. Similarly, GPT-3 achieves $64.3%$ accuracy on TriviaQA in the zero-shot setting, $68.0%$ in the one-shot setting, and $71.2%$ in the few-shot setting, the last of which is state-of-the-art relative to fine-tuned models operating in the same closed-book setting.

在自然语言处理(NLP)任务中,GPT-3 在零样本(zero-shot)和单样本(one-shot)设置下取得了令人瞩目的成果,而在少样本(few-shot)设置下,有时甚至能与经过微调的模型相媲美,甚至偶尔超越当前的最先进水平。例如,GPT-3 在零样本设置下的 CoQA 任务中达到了 81.5 F1 分数,在单样本设置下达到了 84.0 F1 分数,在少样本设置下达到了 $85.0,\mathrm{F1}$ 分数。同样,GPT-3 在零样本设置下的 TriviaQA 任务中达到了 $64.3%$ 的准确率,在单样本设置下达到了 $68.0%$,在少样本设置下达到了 $71.2%$,后者在与相同闭卷设置下经过微调的模型相比,达到了当前的最先进水平。

GPT-3 also displays one-shot and few-shot proficiency at tasks designed to test rapid adaption or on-the-fly reasoning, which include unscrambling words, performing arithmetic, and using novel words in a sentence after seeing them defined only once. We also show that in the few-shot setting, GPT-3 can generate synthetic news articles which human evaluators have difficulty distinguishing from human-generated articles.

GPT-3 在测试快速适应或即时推理的任务中展示了一次样本和少样本的熟练度,这些任务包括解构单词、执行算术运算以及在句子中使用仅见过一次定义的新词。我们还展示了在少样本设置下,GPT-3 能够生成合成新闻文章,人类评估者难以将其与人类生成的文章区分开来。

At the same time, we also find some tasks on which few-shot performance struggles, even at the scale of GPT-3. This includes natural language inference tasks like the ANLI dataset, and some reading comprehension datasets like RACE or QuAC. By presenting a broad characterization of GPT-3’s strengths and weaknesses, including these limitations, we hope to stimulate study of few-shot learning in language models and draw attention to where progress is most needed.

同时,我们也发现了一些任务,即使是在 GPT-3 的规模下,少样本表现仍然存在困难。这包括像 ANLI 数据集这样的自然语言推理任务,以及一些阅读理解数据集,如 RACE 或 QuAC。通过全面展示 GPT-3 的优势和劣势,包括这些局限性,我们希望激发对大语言模型中少样本学习的研究,并引起对最需要进展的领域的关注。

A heuristic sense of the overall results can be seen in Figure 1.3, which aggregates the various tasks (though it should not be seen as a rigorous or meaningful benchmark in itself).

图 1.3 展示了整体结果的启发式感知,其中汇总了各种任务(尽管它本身不应被视为严格或有意义的基准)。

We also undertake a systematic study of “data contamination” – a growing problem when training high capacity models on datasets such as Common Crawl, which can potentially include content from test datasets simply because such content often exists on the web. In this paper we develop systematic tools to measure data contamination and quantify its distorting effects. Although we find that data contamination has a minimal effect on GPT-3’s performance on most datasets, we do identify a few datasets where it could be inflating results, and we either do not report results on these datasets or we note them with an asterisk, depending on the severity.

我们还对“数据污染”进行了系统研究——这是一个在 Common Crawl 等数据集上训练高容量模型时日益严重的问题,这些数据集可能包含来自测试数据集的内容,仅仅是因为这些内容通常存在于网络上。在本文中,我们开发了系统工具来测量数据污染并量化其扭曲效应。尽管我们发现数据污染对 GPT-3 在大多数数据集上的表现影响甚微,但我们确实发现了一些数据集可能夸大了结果,我们根据严重程度选择不报告这些数据集的结果或用星号标注。

In addition to all the above, we also train a series of smaller models (ranging from 125 million parameters to 13 billion parameters) in order to compare their performance to GPT-3 in the zero, one and few-shot settings. Broadly, for most tasks we find relatively smooth scaling with model capacity in all three settings; one notable pattern is that the gap between zero-, one-, and few-shot performance often grows with model capacity, perhaps suggesting that larger models are more proficient meta-learners.

除了上述所有内容外,我们还训练了一系列较小的模型(参数范围从1.25亿到130亿),以便在零样本、单样本和少样本设置中与GPT-3的性能进行比较。总体而言,对于大多数任务,我们发现模型容量在这三种设置中的扩展相对平滑;一个显著的模式是,零样本、单样本和少样本性能之间的差距通常随着模型容量的增加而增大,这可能表明较大的模型是更熟练的元学习者。

Finally, given the broad spectrum of capabilities displayed by GPT-3, we discuss concerns about bias, fairness, and broader societal impacts, and attempt a preliminary analysis of GPT-3’s characteristics in this regard.

最后,鉴于 GPT-3 展现出的广泛能力,我们讨论了关于偏见、公平性以及更广泛社会影响的担忧,并尝试对 GPT-3 在这方面的特性进行初步分析。

The remainder of this paper is organized as follows. In Section 2, we describe our approach and methods for training GPT-3 and evaluating it. Section 3 presents results on the full range of tasks in the zero-, one- and few-shot settings. Section 4 addresses questions of data contamination (train-test overlap). Section 5 discusses limitations of GPT-3. Section 6 discusses broader impacts. Section 7 reviews related work and Section 8 concludes.

本文的其余部分组织如下。在第2节中,我们描述了训练GPT-3并对其进行评估的方法。第3节展示了在零样本、单样本和少样本设置下各种任务的结果。第4节讨论了数据污染(训练-测试重叠)的问题。第5节讨论了GPT-3的局限性。第6节讨论了更广泛的影响。第7节回顾了相关工作,第8节总结了全文。

2 Approach

2 方法

Our basic pre-training approach, including model, data, and training, is similar to the process described in $[\mathrm{RWC}^{+}19]$ , with relatively straightforward scaling up of the model size, dataset size and diversity, and length of training. Our use of in-context learning is also similar to $[\mathrm{RWC}^{+}19]$ , but in this work we systematically explore different settings for learning within the context. Therefore, we start this section by explicitly defining and contrasting the different settings that we will be evaluating GPT-3 on or could in principle evaluate GPT-3 on. These settings can be seen as lying on a spectrum of how much task-specific data they tend to rely on. Specifically, we can identify at least four points on this spectrum (see Figure 2.1 for an illustration):

我们的基本预训练方法,包括模型、数据和训练,与 $[\mathrm{RWC}^{+}19]$ 中描述的过程相似,主要是对模型大小、数据集大小和多样性以及训练长度进行了相对直接的扩展。我们对上下文学习的使用也与 $[\mathrm{RWC}^{+}19]$ 相似,但在本工作中,我们系统地探索了在上下文中学习的不同设置。因此,我们首先明确定义并对比了我们将评估 GPT-3 或原则上可以评估 GPT-3 的不同设置。这些设置可以被视为在依赖任务特定数据的程度上存在差异。具体来说,我们可以在这个谱系中识别出至少四个点(见图 2.1 的图示):

• Fine-Tuning (FT) has been the most common approach in recent years, and involves updating the weights of a pre-trained model by training on a supervised dataset specific to the desired task. Typically thousands to hundreds of thousands of labeled examples are used. The main advantage of fine-tuning is strong performance on many benchmarks. The main disadvantages are the need for a new large dataset for every task, the potential for poor generalization out-of-distribution [MPL19], and the potential to exploit spurious features of the training data $\mathrm{[GSL^{+}18}$ , NK19], potentially resulting in an unfair comparison with human performance. In this work we do not fine-tune GPT-3 because our focus is on task-agnostic performance, but GPT-3 can be fine-tuned in principle and this is a promising direction for future work.

• 微调 (Fine-Tuning, FT) 是近年来最常见的方法,它通过在特定任务的监督数据集上训练来更新预训练模型的权重。通常使用数千到数十万个标注样本。微调的主要优势是在许多基准测试中表现出色。主要缺点是需要为每个任务准备一个新的大型数据集,可能在分布外泛化能力上表现不佳 [MPL19],并且可能利用训练数据中的虚假特征 $\mathrm{[GSL^{+}18}$ , NK19],这可能导致与人类表现的不公平比较。在本工作中,我们没有对 GPT-3 进行微调,因为我们的重点是任务无关的性能,但原则上 GPT-3 是可以微调的,这也是未来工作的一个很有前景的方向。

• Few-Shot (FS) is the term we will use in this work to refer to the setting where the model is given a few demonstrations of the task at inference time as conditioning $[\mathrm{RWC}^{+}19]$ , but no weight updates are allowed. As shown in Figure 2.1, for a typical dataset an example has a context and a desired completion (for example an English sentence and the French translation), and few-shot works by giving $K$ examples of context and completion, and then one final example of context, with the model expected to provide the completion. We typically set $K$ in the range of 10 to 100 as this is how many examples can fit in the model’s context window $(n_{\mathrm{ctx}}=2048)$ ). The main advantages of few-shot are a major reduction in the need for task-specific data and reduced potential to learn an overly narrow distribution from a large but narrow fine-tuning dataset. The main disadvantage is that results from this method have so far been much worse than state-of-the-art fine-tuned models. Also, a small amount of task specific data is still required. As indicated by the name, few-shot learning as described here for language models is related to few-shot learning as used in other contexts in ML [HYC01, $\mathrm{VBL}^{+}16]$ – both involve learning based on a broad distribution of tasks (in this case implicit in the pre-training data) and then rapidly adapting to a new task.

• 少样本 (Few-Shot, FS) 是我们在本文中使用的术语,指的是在推理时给模型提供少量任务演示作为条件 $[\mathrm{RWC}^{+}19]$,但不允许进行权重更新的设置。如图 2.1 所示,对于典型的数据集,一个示例包含上下文和期望的完成内容(例如一个英语句子和其法语翻译),少样本学习通过提供 $K$ 个上下文和完成内容的示例,然后提供一个最终的上下文示例,期望模型能够生成完成内容。我们通常将 $K$ 设置在 10 到 100 的范围内,因为这是模型上下文窗口 $(n_{\mathrm{ctx}}=2048)$ 能够容纳的示例数量。少样本学习的主要优点是大大减少了对任务特定数据的需求,并降低了从大型但狭窄的微调数据集中学习过于狭窄分布的可能性。主要缺点是,迄今为止,这种方法的结果远不如最先进的微调模型。此外,仍然需要少量的任务特定数据。正如名称所示,这里描述的语言模型的少样本学习与机器学习中其他上下文中的少样本学习 [HYC01, $\mathrm{VBL}^{+}16]$ 相关——两者都涉及基于广泛任务分布(在这种情况下隐含在预训练数据中)的学习,然后快速适应新任务。

• One-Shot (1S) is the same as few-shot except that only one demonstration is allowed, in addition to a natural language description of the task, as shown in Figure 1. The reason to distinguish one-shot from few-shot and zero-shot (below) is that it most closely matches the way in which some tasks are communicated to humans. For example, when asking humans to generate a dataset on a human worker service (for example Mechanical Turk), it is common to give one demonstration of the task. By contrast it is sometimes difficult to communicate the content or format of a task if no examples are given.

• 单样本 (One-Shot, 1S) 与少样本类似,区别在于只允许提供一个示例,同时还会给出任务的自然语言描述,如图 1 所示。之所以将单样本与少样本和零样本(见下文)区分开来,是因为它最接近某些任务传达给人类的方式。例如,当要求人类在人类工作者服务(如 Mechanical Turk)上生成数据集时,通常会提供一个任务示例。相比之下,如果没有给出示例,有时很难传达任务的内容或格式。


The three settings we explore for in-context learning Figure 2.1: Zero-shot, one-shot and few-shot, contrasted with traditional fine-tuning. The panels above show four methods for performing a task with a language model – fine-tuning is the traditional method, whereas zero-, one-, and few-shot, which we study in this work, require the model to perform the task with only forward passes at test time. We typically present the model with a few dozen examples in the few shot setting. Exact phrasings for all task descriptions, examples and prompts can be found in Appendix G.

图 2.1: 零样本、单样本和少样本,与传统微调的对比。上面的面板展示了使用语言模型执行任务的四种方法——微调是传统方法,而零样本、单样本和少样本(我们在本研究中探讨的)要求模型在测试时仅通过前向传递来执行任务。在少样本设置中,我们通常向模型提供几十个示例。所有任务描述、示例和提示的确切措辞可以在附录 G 中找到。

• Zero-Shot (0S) is the same as one-shot except that no demonstrations are allowed, and the model is only given a natural language instruction describing the task. This method provides maximum convenience, potential for robustness, and avoidance of spurious correlations (unless they occur very broadly across the large corpus of pre-training data), but is also the most challenging setting. In some cases it may even be difficult for humans to understand the format of the task without prior examples, so this setting is in some cases “unfairly hard”. For example, if someone is asked to “make a table of world records for the $200\mathrm{m}$ dash”, this request can be ambiguous, as it may not be clear exactly what format the table should have or what should be included (and even with careful clarification, understanding precisely what is desired can be difficult). Nevertheless, for at least some settings zero-shot is closest to how humans perform tasks – for example, in the translation example in Figure 2.1, a human would likely know what to do from just the text instruction.

• 零样本 (Zero-Shot, 0S) 与单样本类似,但不允许提供任何示例,模型仅会收到描述任务的自然语言指令。这种方法提供了最大的便利性、潜在的鲁棒性,并避免了虚假相关性(除非它们在预训练数据的大规模语料库中广泛存在),但也是最具挑战性的设置。在某些情况下,甚至人类在没有先例的情况下也可能难以理解任务的格式,因此这种设置在某些情况下“过于困难”。例如,如果有人被要求“制作一个200米短跑世界纪录的表格”,这个请求可能是模糊的,因为可能不清楚表格的具体格式或应包含哪些内容(即使经过仔细澄清,准确理解需求也可能很困难)。然而,至少在某些情况下,零样本最接近人类执行任务的方式——例如,在图2.1的翻译示例中,人类可能仅通过文本指令就知道该怎么做。

Figure 2.1 shows the four methods using the example of translating English to French. In this paper we focus on zero-shot, one-shot and few-shot, with the aim of comparing them not as competing alternatives, but as different problem settings which offer a varying trade-off between performance on specific benchmarks and sample efficiency. We especially highlight the few-shot results as many of them are only slightly behind state-of-the-art fine-tuned models. Ultimately, however, one-shot, or even sometimes zero-shot, seem like the fairest comparisons to human performance, and are important targets for future work.

图 2.1 展示了使用英语翻译成法语的例子来说明这四种方法。在本文中,我们主要关注零样本、单样本和少样本,目的是将它们进行比较,不是作为竞争性的替代方案,而是作为不同的问题设置,这些设置在特定基准测试中的性能和样本效率之间提供了不同的权衡。我们特别强调了少样本的结果,因为其中许多结果仅略微落后于最先进的微调模型。然而,最终,单样本,甚至有时是零样本,似乎是与人类表现最公平的比较,并且是未来工作的重要目标。

Sections 2.1-2.3 below give details on our models, training data, and training process respectively. Section 2.4 discusses the details of how we do few-shot, one-shot, and zero-shot evaluations.

2.1-2.3 节分别详细介绍了我们的模型、训练数据和训练过程。2.4 节讨论了如何进行少样本、单样本和零样本评估的细节。

Table 2.1: Sizes, architectures, and learning hyper-parameters (batch size in tokens and learning rate) of the models which we trained. All models were trained for a total of 300 billion tokens.

表 2.1: 我们训练的模型的规模、架构和学习超参数(Token 的批量大小和学习率)。所有模型总共训练了 3000 亿个 Token。

模型名称 参数量 层数 模型维度 头数 头维度 批量大小 学习率
GPT-3 Small 125M 12 768 12 64 0.5M 6.0 × 10^-4
GPT-3 Medium 350M 24 1024 16 64 0.5M 3.0 × 10^-4
GPT-3 Large 760M 24 1536 16 96 0.5M 2.5 × 10^-4
GPT-3 3XL 1.3B 24 2048 24 128 1M 2.0 × 10^-4
GPT-3 2.7B 2.7B 32 2560 32 80 1M 1.6 × 10^-4
GPT-3 6.7B 6.7B 32 4096 32 128 2M 1.2 × 10^-4
GPT-3 13B 13.0B 40 5140 40 128 2M 1.0 × 10^-4
GPT-3 175B (即“GPT-3”) 175.0B 96 12288 96 128 3.2M 0.6 × 10^-4

2.1 Model and Architectures

2.1 模型与架构

We use the same model and architecture as GPT-2 $[\mathrm{RWC}^{+}19]$ , including the modified initialization, pre-normalization, and reversible token iz ation described therein, with the exception that we use alternating dense and locally banded sparse attention patterns in the layers of the transformer, similar to the Sparse Transformer [CGRS19]. To study the dependence of ML performance on model size, we train 8 different sizes of model, ranging over three orders of magnitude from 125 million parameters to 175 billion parameters, with the last being the model we call GPT-3. Previous work $[\mathrm{KMH}^{+}20]$ suggests that with enough training data, scaling of validation loss should be approximately a smooth power law as a function of size; training models of many different sizes allows us to test this hypothesis both for validation loss and for downstream language tasks.

我们使用了与 GPT-2 [RWC+19] 相同的模型和架构,包括其中描述的修改后的初始化、预归一化和可逆的 Token 化,唯一的区别是我们在 Transformer 的层中使用了交替的密集和局部带状稀疏注意力模式,类似于 Sparse Transformer [CGRS19]。为了研究机器学习性能对模型大小的依赖性,我们训练了 8 种不同大小的模型,参数数量从 1.25 亿到 1750 亿不等,最后一个模型我们称之为 GPT-3。先前的工作 [KMH+20] 表明,在足够的训练数据下,验证损失的缩放应该大致是一个平滑的幂律函数;训练多个不同大小的模型使我们能够验证这一假设,无论是在验证损失还是在下游语言任务中。

Table 2.1 shows the sizes and architectures of our 8 models. Here $n_{\mathrm{params}}$ is the total number of trainable parameters, $n_{\mathrm{layers}}$ is the total number of layers, $d_{\mathrm{model}}$ is the number of units in each bottleneck layer (we always have the feed forward layer four times the size of the bottleneck layer, $d_{\mathrm{ff}}=4*d_{\mathrm{model}})$ , and $d_{\mathrm{head}}$ is the dimension of each attention head. All models use a context window of $n_{\mathrm{ctx}}=2048$ tokens. We partition the model across GPUs along both the depth and width dimension in order to minimize data-transfer between nodes. The precise architectural parameters for each model are chosen based on computational efficiency and load-balancing in the layout of models across GPU’s. Previous work $[\mathrm{KMH}^{+}20]$ suggests that validation loss is not strongly sensitive to these parameters within a reasonably broad range.

表 2.1 展示了我们 8 个模型的规模和架构。其中,$n_{\mathrm{params}}$ 是可训练参数的总数,$n_{\mathrm{layers}}$ 是总层数,$d_{\mathrm{model}}$ 是每个瓶颈层的单元数(我们总是将前馈层的尺寸设为瓶颈层的四倍,即 $d_{\mathrm{ff}}=4*d_{\mathrm{model}}$),$d_{\mathrm{head}}$ 是每个注意力头的维度。所有模型都使用 $n_{\mathrm{ctx}}=2048$ 个 token 的上下文窗口。为了最小化节点之间的数据传输,我们在深度和宽度维度上将模型分布在多个 GPU 上。每个模型的精确架构参数是基于计算效率和 GPU 上模型布局的负载均衡来选择的。之前的工作 $[\mathrm{KMH}^{+}20]$ 表明,验证损失在合理范围内对这些参数并不十分敏感。

2.2 Training Dataset

2.2 训练数据集

Datasets for language models have rapidly expanded, culminating in the Common Crawl dataset2 $[\mathrm{RSR}^{+}19]$ constituting nearly a trillion words. This size of dataset is sufficient to train our largest models without ever updating on the same sequence twice. However, we have found that unfiltered or lightly filtered versions of Common Crawl tend to have lower quality than more curated datasets. Therefore, we took 3 steps to improve the average quality of our datasets: (1) we downloaded and filtered a version of Common Crawl based on similarity to a range of high-quality reference corpora, (2) we performed fuzzy de duplication at the document level, within and across datasets, to prevent redundancy and preserve the integrity of our held-out validation set as an accurate measure of over fitting, and (3) we also added known high-quality reference corpora to the training mix to augment Common Crawl and increase its diversity.

用于训练大语言模型的数据集迅速扩展,最终形成了包含近万亿词汇的Common Crawl数据集2 $[\mathrm{RSR}^{+}19]$。如此规模的数据集足以训练我们最大的模型,而无需在相同的序列上重复更新。然而,我们发现未经过滤或轻度过滤的Common Crawl版本往往比经过精心筛选的数据集质量更低。因此,我们采取了三个步骤来提高数据集的平均质量:(1) 我们下载并过滤了Common Crawl的一个版本,基于其与一系列高质量参考语料库的相似性;(2) 我们在文档级别进行了模糊去重,无论是在数据集内部还是跨数据集,以防止冗余并保持验证集的完整性,作为衡量过拟合的准确指标;(3) 我们还在训练混合中添加了已知的高质量参考语料库,以增强Common Crawl并增加其多样性。

Details of the first two points (processing of Common Crawl) are described in Appendix A. For the third, we added several curated high-quality datasets, including an expanded version of the WebText dataset [RWC+19], collected by scraping links over a longer period of time, and first described in $[\mathrm{KMH}^{+}20]$ , two internet-based books corpora (Books1 and Books2) and English-language Wikipedia.

前两点的详细信息(Common Crawl的处理)在附录A中进行了描述。对于第三点,我们添加了几个精选的高质量数据集,包括通过长时间抓取链接收集的WebText数据集的扩展版本 [RWC+19],首次在 $[\mathrm{KMH}^{+}20]$ 中描述,以及两个基于互联网的书籍语料库(Books1和Books2)和英文维基百科。

Table 2.2 shows the final mixture of datasets that we used in training. The Common Crawl data was downloaded from 41 shards of monthly Common Crawl covering 2016 to 2019, constituting 45TB of compressed plaintext before filtering and 570GB after filtering, roughly equivalent to 400 billion byte-pair-encoded tokens. Note that during training, datasets are not sampled in proportion to their size, but rather datasets we view as higher-quality are sampled more frequently, such that Common Crawl and Books2 datasets are sampled less than once during training, but the other datasets are sampled 2-3 times. This essentially accepts a small amount of over fitting in exchange for higher quality training data.

表 2.2 展示了我们在训练中使用的最终数据集混合情况。Common Crawl 数据是从 2016 年至 2019 年的 41 个月度 Common Crawl 分片中下载的,过滤前构成了 45TB 的压缩纯文本,过滤后为 570GB,大约相当于 4000 亿个字节对编码的 Token。需要注意的是,在训练过程中,数据集并不是按其大小比例进行采样的,而是我们认为质量更高的数据集会被更频繁地采样,因此 Common Crawl 和 Books2 数据集在训练过程中被采样的次数少于一次,而其他数据集则被采样 2-3 次。这本质上是为了换取更高质量的训练数据而接受少量的过拟合。


Figure 2.2: Total compute used during training. Based on the analysis in Scaling Laws For Neural Language Models $[\bar{\mathrm{KMH}}^{+}20]$ we train much larger models on many fewer tokens than is typical. As a consequence, although GPT-3 3B is almost $10\mathrm{x}$ larger than RoBERTa-Large (355M params), both models took roughly 50 petaflop/s-days of compute during pre-training. Methodology for these calculations can be found in Appendix D. Table 2.2: Datasets used to train GPT-3. “Weight in training mix” refers to the fraction of examples during training that are drawn from a given dataset, which we intentionally do not make proportional to the size of the dataset. As a result, when we train for 300 billion tokens, some datasets are seen up to 3.4 times during training while other datasets are seen less than once.

图 2.2: 训练期间使用的总计算量。基于《Scaling Laws For Neural Language Models》[KMH+20] 的分析,我们在比通常少得多的 Token 上训练了更大的模型。因此,尽管 GPT-3 3B 几乎比 RoBERTa-Large(355M 参数)大 10 倍,但两个模型在预训练期间都花费了大约 50 petaflop/s-days 的计算量。这些计算的方法可以在附录 D 中找到。

表 2.2: 用于训练 GPT-3 的数据集。“训练混合中的权重”指的是训练期间从给定数据集中抽取的样本比例,我们有意不使其与数据集的大小成比例。因此,当我们训练 3000 亿个 Token 时,一些数据集在训练期间会被看到多达 3.4 次,而其他数据集则被看到不到一次。

数据集 数量 (tokens) 训练混合权重 训练300B tokens时的epochs
Common Crawl (过滤后) 4100亿 60% 0.44
WebText2 190亿 22% 2.9
Booksl 120亿 8% 1.9
Books2 550亿 8% 0.43
Wikipedia 30亿 3% 3.4

A major methodological concern with language models pretrained on a broad swath of internet data, particularly large models with the capacity to memorize vast amounts of content, is potential contamination of downstream tasks by having their test or development sets inadvertently seen during pre-training. To reduce such contamination, we searched for and attempted to remove any overlaps with the development and test sets of all benchmarks studied in this paper. Unfortunately, a bug in the filtering caused us to ignore some overlaps, and due to the cost of training it was not feasible to retrain the model. In Section 4 we characterize the impact of the remaining overlaps, and in future work we will more aggressively remove data contamination.

一个主要的方法论问题是,基于广泛互联网数据预训练的语言模型,特别是那些能够记忆大量内容的大型模型,可能会在下游任务中因预训练期间无意中看到其测试或开发集而导致潜在的污染。为了减少这种污染,我们搜索并尝试移除与本文研究的所有基准测试的开发集和测试集的重叠部分。不幸的是,过滤过程中的一个错误导致我们忽略了一些重叠,由于训练成本的原因,重新训练模型是不可行的。在第4节中,我们描述了剩余重叠的影响,并在未来的工作中将更积极地移除数据污染。

2.3 Training Process

2.3 训练过程

As found in $[\mathrm{KMH}^{+}20\$ , MKAT18], larger models can typically use a larger batch size, but require a smaller learning rate. We measure the gradient noise scale during training and use it to guide our choice of batch size [MKAT18]. Table 2.1 shows the parameter settings we used. To train the larger models without running out of memory, we use a mixture of model parallelism within each matrix multiply and model parallelism across the layers of the network. All models were trained on V100 GPU’s on part of a high-bandwidth cluster provided by Microsoft. Details of the training process and hyper parameter settings are described in Appendix B.

如 $[\mathrm{KMH}^{+}20\$ 和 [MKAT18] 中所发现的那样,更大的模型通常可以使用更大的批量大小,但需要更小的学习率。我们在训练过程中测量梯度噪声尺度,并使用它来指导我们选择批量大小 [MKAT18]。表 2.1 显示了我们使用的参数设置。为了在不耗尽内存的情况下训练更大的模型,我们在每个矩阵乘法内部和网络层之间使用了模型并行化的混合方法。所有模型都在 Microsoft 提供的高带宽集群的一部分 V100 GPU 上进行训练。训练过程和超参数设置的详细信息在附录 B 中描述。

表 2.1: 参数设置

For few-shot learning, we evaluate each example in the evaluation set by randomly drawing $K$ examples from that task’s training set as conditioning, delimited by 1 or 2 newlines depending on the task. For LAMBADA and Storycloze there is no supervised training set available so we draw conditioning examples from the development set and evaluate on the test set. For Winograd (the original, not SuperGLUE version) there is only one dataset, so we draw conditioning examples directly from it.

对于少样本学习,我们通过从每个任务的训练集中随机抽取 $K$ 个样本作为条件来评估评估集中的每个样本,具体分隔符根据任务的不同使用 1 个或 2 个换行符。对于 LAMBADA 和 Storycloze,由于没有监督训练集可用,我们从开发集中抽取条件样本并在测试集上进行评估。对于 Winograd(原始版本,而非 SuperGLUE 版本),只有一个数据集,因此我们直接从中抽取条件样本。

$K$ can be any value from 0 to the maximum amount allowed by the model’s context window, which is $n_{\mathrm{ctx}}=2048$ for all models and typically fits 10 to 100 examples. Larger values of $K$ are usually but not always better, so when a separate development and test set are available, we experiment with a few values of $K$ on the development set and then run the best value on the test set. For some tasks (see Appendix G) we also use a natural language prompt in addition to (or for $K=0$ , instead of) demonstrations.

$K$ 可以是 0 到模型上下文窗口允许的最大值之间的任何值,对于所有模型来说,$n_{\mathrm{ctx}}=2048$,通常可以容纳 10 到 100 个示例。较大的 $K$ 值通常但不总是更好,因此当有独立的开发和测试集时,我们在开发集上尝试几个 $K$ 值,然后在测试集上运行最佳值。对于某些任务(见附录 G),除了演示(或对于 $K=0$,代替演示)之外,我们还使用自然语言提示。

On tasks that involve choosing one correct completion from several options (multiple choice), we provide $K$ examples of context plus correct completion, followed by one example of context only, and compare the LM likelihood of each completion. For most tasks we compare the per-token likelihood (to normalize for length), however on a small number of datasets (ARC, OpenBookQA, and RACE) we gain additional benefit as measured on the development set by normalizing by the unconditional probability of each completion, by computing $\frac{P\mathrm{{(completion|context)}}}{P\mathrm{{(completion|answer\mathrm{{-context)}}}}}$ , where answer context is the string "Answer: " or "A: " and is used to prompt that the completion should be an answer but is otherwise generic.

在涉及从多个选项中选择一个正确完成的任务(多项选择)中,我们提供 $K$ 个上下文加正确完成的示例,然后提供一个仅上下文的示例,并比较每个完成的大语言模型 (LLM) 的似然值。对于大多数任务,我们比较每个 Token 的似然值(以进行长度归一化),然而在少数数据集(ARC、OpenBookQA 和 RACE)上,我们通过归一化每个完成的无条件概率,在开发集上获得了额外的收益,通过计算 $\frac{P\mathrm{{(completion|context)}}}{P\mathrm{{(completion|answer\mathrm{{-context)}}}}}$,其中答案上下文是字符串 "Answer: " 或 "A: ",用于提示完成应该是一个答案,但在其他方面是通用的。

On tasks that involve binary classification, we give the options more semantically meaningful names (e.g. “True” or “False” rather than 0 or 1) and then treat the task like multiple choice; we also sometimes frame the task similar to what is done by $[\mathrm{RSR}^{+}19]$ (see Appendix G) for details.

在涉及二元分类的任务中,我们为选项赋予更具语义意义的名称(例如“True”或“False”而不是0或1),然后将任务视为多项选择;有时我们也会将任务框架化,类似于 $[\mathrm{RSR}^{+}19]$ 的做法(详见附录G)。

On tasks with free-form completion, we use beam search with the same parameters as $[\mathrm{RSR}^{+}19]$ : a beam width of 4 and a length penalty of $\alpha=0.6$ . We score the model using F1 similarity score, BLEU, or exact match, depending on what is standard for the dataset at hand.

在自由形式完成的任务中,我们使用与 $[\mathrm{RSR}^{+}19]$ 相同的参数进行束搜索 (beam search) :束宽度为 4,长度惩罚为 $\alpha=0.6$ 。我们根据手头数据集的标准,使用 F1 相似度得分、BLEU 或精确匹配来评分模型。

Final results are reported on the test set when publicly available, for each model size and learning setting (zero-, one-, and few-shot). When the test set is private, our model is often too large to fit on the test server, so we report results on the development set. We do submit to the test server on a small number of datasets (SuperGLUE, TriviaQA, PiQa) where we were able to make submission work, and we submit only the 200B few-shot results, and report development set results for everything else.

最终结果在测试集上报告(当测试集公开可用时),针对每个模型大小和学习设置(零样本、单样本和少样本)。当测试集为私有时,我们的模型通常太大,无法适应测试服务器,因此我们在开发集上报告结果。我们确实在少数数据集(SuperGLUE、TriviaQA、PiQa)上提交到测试服务器,这些数据集我们能够成功提交,并且我们仅提交200B少样本结果,其他所有结果均在开发集上报告。

3 Results

3 结果

In Figure 3.1 we display training curves for the 8 models described in Section 2. For this graph we also include 6 additional extra-small models with as few as 100,000 parameters. As observed in $[\mathrm{KMH}^{+}20]$ , language modeling performance follows a power-law when making efficient use of training compute. After extending this trend by two more orders of magnitude, we observe only a slight (if any) departure from the power-law. One might worry that these improvements in cross-entropy loss come only from modeling spurious details of our training corpus. However, we will see in the following sections that improvements in cross-entropy loss lead to consistent performance gains across a broad spectrum of natural language tasks.

在图 3.1 中,我们展示了第 2 节中描述的 8 个模型的训练曲线。在此图中,我们还包含了 6 个额外的超小型模型,这些模型的参数量仅为 100,000。正如 $[\mathrm{KMH}^{+}20]$ 中所观察到的,当有效利用训练计算资源时,语言建模性能遵循幂律。在将这一趋势扩展了两个数量级后,我们观察到仅有轻微(如果有的话)偏离幂律的情况。有人可能会担心,这些交叉熵损失的改进仅来自于对训练语料库中虚假细节的建模。然而,我们将在接下来的章节中看到,交叉熵损失的改进在广泛的自然语言任务中带来了一致的性能提升。

Below, we evaluate the 8 models described in Section 2 (the 175 billion parameter parameter GPT-3 and 7 smaller models) on a wide range of datasets. We group the datasets into 9 categories representing roughly similar tasks.

下面,我们在广泛的数据集上评估第2节中描述的8个模型(1750亿参数的GPT-3和7个较小的模型)。我们将这些数据集分为9个类别,代表大致相似的任务。

In Section 3.1 we evaluate on traditional language modeling tasks and tasks that are similar to language modeling, such as Cloze tasks and sentence/paragraph completion tasks. In Section 3.2 we evaluate on “closed book” question answering tasks: tasks which require using the information stored in the model’s parameters to answer general knowledge questions. In Section 3.3 we evaluate the model’s ability to translate between languages (especially one-shot and few-shot). In Section 3.4 we evaluate the model’s performance on Winograd Schema-like tasks. In Section 3.5 we evaluate on datasets that involve commonsense reasoning or question answering. In Section 3.6 we evaluate on reading comprehension tasks, in Section 3.7 we evaluate on the SuperGLUE benchmark suite, and in 3.8 we briefly explore NLI. Finally, in Section 3.9, we invent some additional tasks designed especially to probe in-context learning abilities – these tasks focus on on-the-fly reasoning, adaptation skills, or open-ended text synthesis. We evaluate all tasks in the few-shot, one-shot, and zero-shot settings.

在3.1节中,我们评估了传统的语言建模任务以及类似于语言建模的任务,例如完形填空任务和句子/段落补全任务。在3.2节中,我们评估了“闭卷”问答任务:这些任务需要使用模型参数中存储的信息来回答一般知识问题。在3.3节中,我们评估了模型在语言之间翻译的能力(尤其是一次样本和少样本)。在3.4节中,我们评估了模型在类似Winograd Schema任务上的表现。在3.5节中,我们评估了涉及常识推理或问答的数据集。在3.6节中,我们评估了阅读理解任务,在3.7节中,我们评估了SuperGLUE基准套件,并在3.8节中简要探讨了自然语言推理(NLI)。最后,在3.9节中,我们设计了一些额外的任务,专门用于探究上下文学习能力——这些任务侧重于即时推理、适应能力或开放式文本合成。我们在少样本、一次样本和零样本设置下评估了所有任务。


Figure 3.1: Smooth scaling of performance with compute. Performance (measured in terms of cross-entropy validation loss) follows a power-law trend with the amount of compute used for training. The power-law behavior observed in $[\mathrm{KMH}^{+}20]$ continues for an additional two orders of magnitude with only small deviations from the predicted curve. For this figure, we exclude embedding parameters from compute and parameter counts. Table 3.1: Zero-shot results on PTB language modeling dataset. Many other common language modeling datasets are omitted because they are derived from Wikipedia or other sources which are included in GPT-3’s training data. $^{a}[\mathrm{RWC}^{+}19]$

图 3.1: 计算量与性能的平滑扩展。性能(以交叉熵验证损失衡量)与用于训练的计算量呈幂律趋势。在 $[\mathrm{KMH}^{+}20]$ 中观察到的幂律行为在额外两个数量级上继续存在,仅与预测曲线有微小偏差。在本图中,我们从计算量和参数计数中排除了嵌入参数。

表 3.1: PTB 语言建模数据集上的零样本结果。许多其他常见的语言建模数据集被省略,因为它们源自维基百科或其他包含在 GPT-3 训练数据中的来源。$^{a}[\mathrm{RWC}^{+}19]$

设置 PTB
SOTA (零样本) 35.8°
GPT-3 零样本 20.5

3.1 Language Modeling, Cloze, and Completion Tasks

3.1 语言建模、完形填空和补全任务

In this section we test GPT-3’s performance on the traditional task of language modeling, as well as related tasks that involve predicting a single word of interest, completing a sentence or paragraph, or choosing between possible completions of a piece of text.

在本节中,我们测试了 GPT-3 在传统语言建模任务中的表现,以及涉及预测单个关键词、完成句子或段落、或在文本的多个可能完成选项之间做出选择的相关任务。

3.1.1 Language Modeling

3.1.1 语言建模

We calculate zero-shot perplexity on the Penn Tree Bank (PTB) $[\mathrm{MKM^{+}94}]$ dataset measured in $[\mathrm{RWC}^{+}19]$ . We omit the 4 Wikipedia-related tasks in that work because they are entirely contained in our training data, and we also omit the one-billion word benchmark due to a high fraction of the dataset being contained in our training set. PTB escapes these issues due to predating the modern internet. Our largest model sets a new SOTA on PTB by a substantial margin of 15 points, achieving a perplexity of 20.50. Note that since PTB is a traditional language modeling dataset it does not have a clear separation of examples to define one-shot or few-shot evaluation around, so we measure only zero-shot.

我们在 Penn Tree Bank (PTB) 数据集上计算了零样本困惑度 (zero-shot perplexity),该数据集在 $[\mathrm{RWC}^{+}19]$ 中进行了测量。我们省略了该工作中的 4 个与 Wikipedia 相关的任务,因为它们完全包含在我们的训练数据中,并且由于数据集中有大量内容包含在我们的训练集中,我们也省略了十亿词基准测试 (one-billion word benchmark)。PTB 由于早于现代互联网,避免了这些问题。我们最大的模型在 PTB 上以 15 分的显著优势创下了新的 SOTA (State of the Art),达到了 20.50 的困惑度。需要注意的是,由于 PTB 是一个传统的语言建模数据集,它没有明确的示例分离来定义单样本或少样本评估,因此我们仅测量零样本。

3.1.2 LAMBADA

3.1.2 LAMBADA

The LAMBADA dataset $[\mathrm{PKL}^{+}16]$ tests the modeling of long-range dependencies in text – the model is asked to predict the last word of sentences which require reading a paragraph of context. It has recently been suggested that the continued scaling of language models is yielding diminishing returns on this difficult benchmark. $[\mathrm{BHT^{+}}20]$ reflect on the small $1.5%$ improvement achieved by a doubling of model size between two recent state of the art results $\left(\mathrm{[SPP^{+}19]}\right)$

LAMBADA 数据集 [PKL+16] 测试了文本中长距离依赖关系的建模——模型被要求预测句子的最后一个词,这需要阅读一段上下文。最近有观点认为,语言模型的持续扩展在这一困难基准上带来的收益正在递减 [BHT+20]。他们反思了在两个最新最先进结果之间,模型规模翻倍仅带来 1.5% 的小幅提升 ([SPP+19])。

设置 LAMBADA (准确率) LAMBADA (困惑度) StoryCloze (准确率) HellaSwag (准确率)
SOTA 68.0a 8.63b 91.8c 85.6d
GPT-3 零样本 76.2 3.00 83.2 78.9
GPT-3 单样本 72.5 3.35 84.7 78.1
GPT-3 少样本 86.4 1.92 87.7 79.3


Table 3.2: Performance on cloze and completion tasks. GPT-3 significantly improves SOTA on LAMBADA while achieving respectable performance on two difficult completion prediction datasets. $^{a}[\mathrm{Tur}20]\ ^{b}$ ${}^{b}[\mathrm{RWC}^{+}19]$ c[LDL19] $^d[\mathrm{LCH^{+}}\bar{2}0]$ Figure 3.2: On LAMBADA, the few-shot capability of language models results in a strong boost to accuracy. GPT-3 2.7B outperforms the SOTA 17B parameter Turing-NLG [Tur20] in this setting, and GPT-3 175B advances the state of the art by $18%$ . Note zero-shot uses a different format from one-shot and few-shot as described in the text.

表 3.2: 填空和补全任务的表现。GPT-3 在 LAMBADA 上显著提升了 SOTA(State of the Art),同时在两个困难的补全预测数据集上取得了不错的表现。$^{a}[\mathrm{Tur}20]\ ^{b}$ ${}^{b}[\mathrm{RWC}^{+}19]$ c[LDL19] $^d[\mathrm{LCH^{+}}\bar{2}0]$
图 3.2: 在 LAMBADA 上,语言模型的少样本能力显著提升了准确率。GPT-3 2.7B 在该设置下超越了 SOTA 17B 参数的 Turing-NLG [Tur20],而 GPT-3 175B 将 SOTA 提升了 $18%$。需要注意的是,零样本使用的格式与一样本和少样本不同,如文中所述。

and $[\mathrm{Tur}20],$ and argue that “continuing to expand hardware and data sizes by orders of magnitude is not the path forward”. We find that path is still promising and in a zero-shot setting GPT-3 achieves $76%$ on LAMBADA, a gain of $8%$ over the previous state of the art.

以及 $[\mathrm{Tur}20],$ 并认为“继续以数量级扩展硬件和数据规模并不是前进的方向”。我们发现这条路径仍然充满希望,在零样本设置下,GPT-3 在 LAMBADA 上达到了 $76%$,比之前的最先进水平提高了 $8%$。

LAMBADA is also a demonstration of the flexibility of few-shot learning as it provides a way to address a problem that classically occurs with this dataset. Although the completion in LAMBADA is always the last word in a sentence, a standard language model has no way of knowing this detail. It thus assigns probability not only to the correct ending but also to other valid continuations of the paragraph. This problem has been partially addressed in the past with stop-word filters $[\mathrm{RWC}^{+}19]$ (which ban “continuation” words). The few-shot setting instead allows us to “frame” the task as a cloze-test and allows the language model to infer from examples that a completion of exactly one word is desired. We use the following fill-in-the-blank format:

LAMBADA 也是少样本学习灵活性的一个展示,因为它提供了一种解决该数据集经典问题的方法。尽管 LAMBADA 中的补全总是句子的最后一个词,但标准的语言模型无法知道这一细节。因此,它不仅为正确的结尾分配概率,还为段落的其他有效延续分配概率。过去,这个问题已经通过停用词过滤器 $[\mathrm{RWC}^{+}19]$ (禁止“延续”词)部分解决。而少样本设置则允许我们将任务“框架”为完形填空测试,并让语言模型从示例中推断出需要精确补全一个词。我们使用以下填空格式:

When presented with examples formatted this way, GPT-3 achieves $86.4%$ accuracy in the few-shot setting, an increase of over $18%$ from the previous state-of-the-art. We observe that few-shot performance improves strongly with model size. While this setting decreases the performance of the smallest model by almost $20%$ , for GPT-3 it improves accuracy by $10%$ . Finally, the fill-in-blank method is not effective one-shot, where it always performs worse than the zero-shot setting. Perhaps this is because all models still require several examples to recognize the pattern.

当以这种方式呈现示例时,GPT-3 在少样本设置中达到了 $86.4%$ 的准确率,比之前的最先进水平提高了超过 $18%$。我们观察到,少样本性能随着模型规模的增加而显著提升。虽然这种设置使最小模型的性能下降了近 $20%$,但对于 GPT-3 来说,准确率提高了 $10%$。最后,填空方法在单样本设置中并不有效,其表现始终不如零样本设置。这可能是因为所有模型仍然需要多个示例才能识别出模式。

Table 3.3: Results on three Open-Domain QA tasks. GPT-3 is shown in the few-, one-, and zero-shot settings, as compared to prior SOTA results for closed book and open domain settings. TriviaQA few-shot result is evaluated on the wiki split test server.

表 3.3: 三个开放域问答任务的结果。GPT-3 在少样本、单样本和零样本设置下的表现,与之前闭书和开放域设置的 SOTA 结果进行了比较。TriviaQA 的少样本结果是在 wiki 分割测试服务器上评估的。

设置 NaturalQs WebQS TriviaQA
RAG (微调, 开放域) [LPP+20] 44.5 45.5 68.0
T5-11B+SSM (微调, 闭书) [RRS20] 36.6 44.7 60.5
T5-11B (微调, 闭书) 34.5 37.4 50.1
GPT-3 零样本 14.6 14.4 64.3
GPT-3 单样本 23.0 25.3 68.0
GPT-3 少样本 29.9 41.5 71.2

One note of caution is that an analysis of test set contamination identified that a significant minority of the LAMBADA dataset appears to be present in our training data – however analysis performed in Section 4 suggests negligible impact on performance.

需要注意的是,对测试集污染的分析发现,LAMBADA 数据集中有相当一部分似乎出现在我们的训练数据中——然而,第 4 节中的分析表明,这对性能的影响可以忽略不计。

3.1.3 HellaSwag

3.1.3 HellaSwag

The HellaSwag dataset $[Z\mathrm{HB}^{+}19]$ involves picking the best ending to a story or set of instructions. The examples were adversarial ly mined to be difficult for language models while remaining easy for humans (who achieve $95.6%$ accuracy). GPT-3 achieves $78.1%$ accuracy in the one-shot setting and $79.3%$ accuracy in the few-shot setting, outperforming the $75.4%$ accuracy of a fine-tuned 1.5B parameter language model $[Z\mathrm{HR}^{+}19]$ but still a fair amount lower than the overall SOTA of $85.6%$ achieved by the fine-tuned multi-task model ALUM.

HellaSwag 数据集 [ZHB+19] 涉及为故事或一组指令选择最佳结尾。这些示例经过对抗性挖掘,使其对语言模型来说具有挑战性,但对人类来说仍然容易(人类准确率达到 95.6%)。GPT-3 在单样本设置中达到了 78.1% 的准确率,在少样本设置中达到了 79.3% 的准确率,优于微调的 1.5B 参数语言模型 [ZHR+19] 的 75.4% 准确率,但仍远低于微调多任务模型 ALUM 实现的 85.6% 的总体 SOTA。

3.1.4 StoryCloze

3.1.4 StoryCloze

We next evaluate GPT-3 on the StoryCloze 2016 dataset $[\mathrm{MCH^{+}}16]$ , which involves selecting the correct ending sentence for five-sentence long stories. Here GPT-3 achieves $83.2%$ in the zero-shot setting and $87.7%$ in the few-shot setting (with $K=70$ ). This is still $4.1%$ lower than the fine-tuned SOTA using a BERT based model [LDL19] but improves over previous zero-shot results by roughly $10%$ .

我们接下来在 StoryCloze 2016 数据集 $[\mathrm{MCH^{+}}16]$ 上评估 GPT-3,该数据集涉及为五句话长的故事选择正确的结尾句。在这里,GPT-3 在零样本设置下达到了 $83.2%$ 的准确率,在少样本设置下($K=70$)达到了 $87.7%$ 的准确率。这仍然比使用基于 BERT 的模型 [LDL19] 进行微调的 SOTA 低 $4.1%$,但比之前的零样本结果提高了大约 $10%$。

3.2 Closed Book Question Answering

3.2 闭卷问答

In this section we measure GPT-3’s ability to answer questions about broad factual knowledge. Due to the immense amount of possible queries, this task has normally been approached by using an information retrieval system to find relevant text in combination with a model which learns to generate an answer given the question and the retrieved text. Since this setting allows a system to search for and condition on text which potentially contains the answer it is denoted “open-book”. [RRS20] recently demonstrated that a large language model can perform surprisingly well directly answering the questions without conditioning on auxilliary information. They denote this more restrictive evaluation setting as “closed-book”. Their work suggests that even higher-capacity models could perform even better and we test this hypothesis with GPT-3. We evaluate GPT-3 on the 3 datasets in [RRS20]: Natural Questions $[\mathrm{KPR}^{+}19]$ , Web Questions [BCFL13], and TriviaQA [JCWZ17], using the same splits. Note that in addition to all results being in the closed-book setting, our use of few-shot, one-shot, and zero-shot evaluations represent an even stricter setting than previous closed-book QA work: in addition to external content not being allowed, fine-tuning on the Q&A dataset itself is also not permitted.

在本节中,我们评估了 GPT-3 回答广泛事实性知识问题的能力。由于可能的查询数量巨大,这一任务通常通过使用信息检索系统来查找相关文本,并结合一个模型来生成给定问题和检索到的文本的答案。由于这种设置允许系统搜索并基于可能包含答案的文本进行条件生成,因此被称为“开卷” [RRS20]。最近,[RRS20] 证明了大语言模型在不需要辅助信息的情况下直接回答问题可以表现得非常出色。他们将这种更为严格的评估设置称为“闭卷”。他们的研究表明,更高容量的模型可能会表现得更好,我们使用 GPT-3 来测试这一假设。我们在 [RRS20] 中的 3 个数据集上评估 GPT-3:Natural Questions $[\mathrm{KPR}^{+}19]$、Web Questions [BCFL13] 和 TriviaQA [JCWZ17],使用相同的划分。需要注意的是,除了所有结果都在闭卷设置下,我们使用的少样本、单样本和零样本评估代表了比以往闭卷问答工作更为严格的设置:除了不允许使用外部内容外,也不允许在问答数据集上进行微调。

The results for GPT-3 are shown in Table 3.3. On TriviaQA, we achieve $64.3%$ in the zero-shot setting, $68.0%$ in the one-shot setting, and $71.2%$ in the few-shot setting. The zero-shot result already outperforms the fine-tuned T5-11B by $14.2%$ , and also outperforms a version with Q&A tailored span prediction during pre-training by $3.8%$ . The one-shot result improves by $3.7%$ and matches the SOTA for an open-domain QA system which not only fine-tunes but also makes use of a learned retrieval mechanism over a 15.3B parameter dense vector index of 21M documents $[\mathrm{LPP}^{+}20]$ . GPT-3’s few-shot result further improves performance another $3.2%$ beyond this.

GPT-3 的结果如表 3.3 所示。在 TriviaQA 上,我们在零样本设置下达到了 64.3%,在单样本设置下达到了 68.0%,在少样本设置下达到了 71.2%。零样本结果已经比微调的 T5-11B 高出 14.2%,也比在预训练期间使用问答定制的跨度预测版本高出 3.8%。单样本结果提高了 3.7%,并与开放域问答系统的 SOTA 结果相当,该系统不仅进行了微调,还使用了在 21M 文档的 15.3B 参数密集向量索引上学习到的检索机制 [LPP+20]。GPT-3 的少样本结果在此基础上进一步提高了 3.2%。

On Web Questions (WebQs), GPT-3 achieves $14.4%$ in the zero-shot setting, $25.3%$ in the one-shot setting, and $41.5%$ in the few-shot setting. This compares to $37.4%$ for fine-tuned T5-11B, and $44.7%$ for fine-tuned $_{\mathrm{T5-11B+SSM}}$ , which uses a Q&A-specific pre-training procedure. GPT-3 in the few-shot setting approaches the performance of state-of-the-art fine-tuned models. Notably, compared to TriviaQA, WebQS shows a much larger gain from zero-shot to few-shot (and indeed its zero-shot and one-shot performance are poor), perhaps suggesting that the WebQs questions and/or the style of their answers are out-of-distribution for GPT-3. Nevertheless, GPT-3 appears able to adapt to this distribution, recovering strong performance in the few-shot setting.

在Web Questions (WebQs) 数据集上,GPT-3 在零样本 (zero-shot) 设置下达到了 14.4%,在单样本 (one-shot) 设置下达到了 25.3%,在少样本 (few-shot) 设置下达到了 41.5%。相比之下,经过微调的 T5-11B 达到了 37.4%,而使用了问答特定预训练过程的 T5-11B+SSM 达到了 44.7%。在少样本设置下,GPT-3 的表现接近了最先进的微调模型。值得注意的是,与 TriviaQA 相比,WebQs 从零样本到少样本的提升要大得多(实际上其零样本和单样本表现较差),这可能表明 WebQs 的问题和/或其答案的风格超出了 GPT-3 的分布范围。尽管如此,GPT-3 似乎能够适应这种分布,在少样本设置下恢复了强劲的表现。


Figure 3.3: On TriviaQA GPT3’s performance grows smoothly with model size, suggesting that language models continue to absorb knowledge as their capacity increases. One-shot and few-shot performance make significant gains over zero-shot behavior, matching and exceeding the performance of the SOTA fine-tuned open-domain model, RAG $[\mathrm{LPP}^{+}20]$ 1

图 3.3: 在 TriviaQA 上,GPT3 的性能随着模型规模的增加而平稳增长,这表明随着容量的增加,语言模型继续吸收知识。单样本和少样本性能相比零样本行为有显著提升,匹配并超过了 SOTA 微调开放域模型 RAG 的性能 $[\mathrm{LPP}^{+}20]$ 1

On Natural Questions (NQs) GPT-3 achieves $14.6%$ in the zero-shot setting, $23.0%$ in the one-shot setting, and $29.9%$ in the few-shot setting, compared to $36.6%$ for fine-tuned T5 $11\mathrm{B}{+}\mathrm{S}\mathrm{S}\mathrm{M}$ . Similar to WebQS, the large gain from zero-shot to few-shot may suggest a distribution shift, and may also explain the less competitive performance compared to TriviaQA and WebQS. In particular, the questions in NQs tend towards very fine-grained knowledge on Wikipedia specifically which could be testing the limits of GPT-3’s capacity and broad pre training distribution.

在自然问题 (NQs) 上,GPT-3 在零样本设置中达到了 $14.6%$,在单样本设置中达到了 $23.0%$,在少样本设置中达到了 $29.9%$,而经过微调的 T5 $11\mathrm{B}{+}\mathrm{S}\mathrm{S}\mathrm{M}$ 则达到了 $36.6%$。与 WebQS 类似,从零样本到少样本的大幅提升可能表明存在分布偏移,这也可能解释了与 TriviaQA 和 WebQS 相比表现不那么具有竞争力的原因。特别是,NQs 中的问题往往涉及非常细粒度的维基百科知识,这可能是在测试 GPT-3 的能力和广泛的预训练分布的极限。

Overall, on one of the three datasets GPT-3’s one-shot matches the open-domain fine-tuning SOTA. On the other two datasets it approaches the performance of the closed-book SOTA despite not using fine-tuning. On all 3 datasets, we find that performance scales very smoothly with model size (Figure 3.3 and Appendix H Figure H.7), possibly reflecting the idea that model capacity translates directly to more ‘knowledge’ absorbed in the parameters of the model.

总体而言,在三个数据集中的一个上,GPT-3 的零样本表现与开放领域微调的 SOTA (State of the Art) 相当。在另外两个数据集上,尽管没有使用微调,GPT-3 的表现也接近闭卷 SOTA。在所有三个数据集上,我们发现模型的性能随着模型规模的增加而非常平滑地提升(图 3.3 和附录 H 图 H.7),这可能反映了模型容量直接转化为模型参数中吸收的更多“知识”的观点。

3.3 Translation

3.3 翻译

For GPT-2 a filter was used on a multilingual collection of documents to produce an English only dataset due to capacity concerns. Even with this filtering GPT-2 showed some evidence of multilingual capability and performed non-trivially when translating between French and English despite only training on 10 megabytes of remaining French text. Since we increase the capacity by over two orders of magnitude from GPT-2 to GPT-3, we also expand the scope of the training dataset to include more representation of other languages, though this remains an area for further improvement. As discussed in 2.2 the majority of our data is derived from raw Common Crawl with only quality-based filtering. Although GPT-3’s training data is still primarily English $93%$ by word count), it also includes $7%$ of text in other languages. These languages are documented in the supplemental material. In order to better understand translation capability, we also expand our analysis to include two additional commonly studied languages, German and Romanian.

由于容量限制,GPT-2 在多语言文档集合上使用了过滤器,以生成仅包含英语的数据集。尽管进行了这种过滤,GPT-2 仍然显示出一定的多语言能力,并且在仅训练了 10 兆字节的法语文本的情况下,在法语和英语之间的翻译任务中表现不俗。由于我们从 GPT-2 到 GPT-3 将容量增加了两个数量级以上,我们也扩大了训练数据集的范围,以包含更多其他语言的代表性数据,尽管这仍然是一个需要进一步改进的领域。正如 2.2 节所讨论的,我们的数据主要来自原始 Common Crawl,仅进行了基于质量的过滤。尽管 GPT-3 的训练数据仍然以英语为主(按词数计算占 93%),但它也包含了 7% 的其他语言文本。这些语言在补充材料中有详细记录。为了更好地理解翻译能力,我们还扩展了分析范围,包括两种常用的研究语言:德语和罗马尼亚语。

Existing unsupervised machine translation approaches often combine pre training on a pair of monolingual datasets with back-translation [SHB15] to bridge the two languages in a controlled way. By contrast, GPT-3 learns from a blend of training data that mixes many languages together in a natural way, combining them on a word, sentence, and document level. GPT-3 also uses a single training objective which is not customized or designed for any task in particular. However, our one / few-shot settings aren’t strictly comparable to prior unsupervised work since they make use of a small amount of paired examples (1 or 64). This corresponds to up to a page or two of in-context training data.

现有的无监督机器翻译方法通常将单语数据集对的预训练与回译 [SHB15] 结合起来,以可控的方式桥接两种语言。相比之下,GPT-3 从混合了多种语言的训练数据中学习,这些数据在单词、句子和文档级别上自然结合。GPT-3 还使用单一的训练目标,该目标并未针对任何特定任务进行定制或设计。然而,我们的零样本/少样本设置与之前的无监督工作并不严格可比,因为它们使用了少量的配对示例(1 或 64)。这相当于最多一两页的上下文训练数据。

Results are shown in Table 3.4. Zero-shot GPT-3, which only receives on a natural language description of the task, still under performs recent unsupervised NMT results. However, providing only a single example demonstration for

结果如表 3.4 所示。零样本 GPT-3 仅接收任务的自然语言描述,其表现仍不及最近的无监督神经机器翻译 (NMT) 结果。然而,仅提供一个示例演示...

设置 En→→Fr Fr→→En En→→De De→→En En→→Ro Ro→En
SOTA (监督学习) 45.6a 35.0 b 41.2c 40.2d 38.5e 39.9e
XLM [LC19] 33.4 33.3 26.4 34.3 33.3 31.8
MASS [STQ+19] 37.5 34.9 28.3 35.2 35.2 33.1
mBART [LGG+20] - - 29.8 34.0 35.0 30.5
GPT-3 零样本 25.2 21.2 24.6 27.2 14.1 19.9
GPT-3 单样本 28.3 33.7 26.2 30.4 20.6 38.6
GPT-3 少样本 32.6 39.2 29.7 40.6 21.0 39.5

Table 3.4: Few-shot GPT-3 outperforms previous unsupervised NMT work by 5 BLEU when translating into English reflecting its strength as an English LM. We report BLEU scores on the WMT’14 $\mathrm{Fr}{\leftrightarrow}\mathrm{En}$ , WMT’16 $\scriptstyle\mathrm{De}\leftrightarrow\mathrm{En}$ , and WMT’16 $\scriptstyle\mathbf{Ro}\leftrightarrow\mathbf{En}$ datasets as measured by multi-bleu.perl with XLM’s tokenization in order to compare most closely with prior unsupervised NMT work. SacreBLEUf [Pos18] results reported in Appendix H. Underline indicates an unsupervised or few-shot SOTA, bold indicates supervised SOTA with relative confidence. a[EOAG18] b[DHKH14] $^c[\mathrm{WXH^{+}18}]$ d[oR16] $^{e}[\mathrm{LGG}^{+}20]$ f [SacreBLEU signature: BLEU $^+$ case.mixed+numrefs. $^{1+}$ smooth.exp+tok.intl+version.1.2.20]

表 3.4: 少样本 GPT-3 在翻译成英语时比之前的无监督神经机器翻译 (NMT) 工作高出 5 个 BLEU 分数,反映了其作为英语语言模型 (LM) 的优势。我们报告了 WMT'14 $\mathrm{Fr}{\leftrightarrow}\mathrm{En}$、WMT'16 $\scriptstyle\mathrm{De}\leftrightarrow\mathrm{En}$ 和 WMT'16 $\scriptstyle\mathbf{Ro}\leftrightarrow\mathbf{En}$ 数据集上的 BLEU 分数,这些分数是通过 multi-bleu.perl 使用 XLM 的 Token 化方法测量的,以便与之前的无监督 NMT 工作进行最接近的比较。SacreBLEUf [Pos18] 的结果在附录 H 中报告。下划线表示无监督或少样本的 SOTA (State-of-the-Art),粗体表示具有相对置信度的有监督 SOTA。a[EOAG18] b[DHKH14] $^c[\mathrm{WXH^{+}18}]$ d[oR16] $^{e}[\mathrm{LGG}^{+}20]$ f [SacreBLEU 签名: BLEU $^+$ case.mixed+numrefs. $^{1+}$ smooth.exp+tok.intl+version.1.2.20]


Figure 3.4: Few-shot translation performance on 6 language pairs as model capacity increases. There is a consistent trend of improvement across all datasets as the model scales, and as well as tendency for translation into English to be stronger than translation from English.


图 3.4: 随着模型容量的增加,6种语言对的少样本翻译性能。随着模型的扩展,所有数据集的性能都呈现出持续提升的趋势,并且翻译成英语的性能往往强于从英语翻译出来的性能。

设置 Winograd Winogrande e (XL)
微调SOTA 90.1a 84.6b
GPT-3 零样本 88.3* 70.2
GPT-3 单样本 89.7* 73.2
GPT-3 少样本 88.6* 77.7


Table 3.5: Results on the WSC273 version of Winograd schemas and the adversarial Winogrande dataset. See Section 4 for details on potential contamination of the Winograd test set. a[SBBC19] $^b[\mathrm{LYN}^{+}20]$ Figure 3.5: Zero-, one-, and few-shot performance on the adversarial Winogrande dataset as model capacity scales. Scaling is relatively smooth with the gains to few-shot learning increasing with model size, and few-shot GPT-3 175B is competitive with a fine-tuned RoBERTA-large.

表 3.5: WSC273 版本的 Winograd 模式和对抗性 Winogrande 数据集的结果。有关 Winograd 测试集潜在污染的详细信息,请参见第 4 节。a[SBBC19] $^b[\mathrm{LYN}^{+}20]$
图 3.5: 随着模型容量的增加,对抗性 Winogrande 数据集上的零样本、单样本和少样本性能。随着模型规模的增加,少样本学习的收益相对平稳,少样本 GPT-3 175B 与微调的 RoBERTA-large 具有竞争力。

each translation task improves performance by over 7 BLEU and nears competitive performance with prior work. GPT-3 in the full few-shot setting further improves another 4 BLEU resulting in similar average performance to prior unsupervised NMT work. GPT-3 has a noticeable skew in its performance depending on language direction. For the three input languages studied, GPT-3 significantly outperforms prior unsupervised NMT work when translating into English but under performs when translating in the other direction. Performance on En-Ro is a noticeable outlier at over 10 BLEU worse than prior unsupervised NMT work. This could be a weakness due to reusing the byte-level BPE tokenizer of GPT-2 which was developed for an almost entirely English training dataset. For both Fr-En and De-En, few shot GPT-3 outperforms the best supervised result we could find but due to our unfamiliarity with the literature and the appearance that these are un-competitive benchmarks we do not suspect those results represent true state of the art. For Ro-En, few shot GPT-3 performs within 0.5 BLEU of the overall SOTA which is achieved by a combination of unsupervised pre training, supervised finetuning on 608K labeled examples, and back translation [LHCG19b].

每次翻译任务都将性能提高了超过 7 BLEU,并且接近了先前工作的竞争性能。在完整的少样本设置中,GPT-3 进一步提高了 4 BLEU,使得平均性能与先前的无监督神经机器翻译 (NMT) 工作相当。GPT-3 的性能在不同语言方向上存在显著偏差。对于研究的三种输入语言,GPT-3 在翻译成英语时显著优于先前的无监督 NMT 工作,但在翻译成其他语言时表现不佳。在 En-Ro 上的性能是一个明显的异常值,比先前的无监督 NMT 工作差了超过 10 BLEU。这可能是由于重用了 GPT-2 的字节级 BPE Tokenizer 的弱点,该 Tokenizer 是为几乎完全由英语组成的训练数据集开发的。对于 Fr-En 和 De-En,少样本 GPT-3 优于我们找到的最佳监督结果,但由于我们对文献的不熟悉以及这些基准似乎没有竞争力,我们不认为这些结果代表了真正的技术水平。对于 Ro-En,少样本 GPT-3 的表现与整体 SOTA 相差不到 0.5 BLEU,而 SOTA 是通过无监督预训练、在 608K 标注样本上进行监督微调以及反向翻译 [LHCG19b] 的组合实现的。

Finally, across all language pairs and across all three settings (zero-, one-, and few-shot), there is a smooth trend of improvement with model capacity. This is shown in Figure 3.4 in the case of few-shot results, and scaling for all three settings is shown in Appendix H.

最后,在所有语言对和三种设置(零样本、单样本和少样本)中,模型容量的提升呈现出平稳的趋势。图 3.4 展示了少样本结果的情况,而所有三种设置的扩展情况见附录 H。

3.4 Winograd-Style Tasks

3.4 Winograd 风格任务

The Winograd Schemas Challenge [LDM12] is a classical task in NLP that involves determining which word a pronoun refers to, when the pronoun is grammatically ambiguous but semantically unambiguous to a human. Recently fine-tuned language models have achieved near-human performance on the original Winograd dataset, but more difficult versions such as the adversarial ly-mined Winogrande dataset [SBBC19] still significantly lag human performance. We test GPT-3’s performance on both Winograd and Winogrande, as usual in the zero-, one-, and few-shot setting.

Winograd Schemas Challenge [LDM12] 是 NLP 中的一个经典任务,涉及确定代词指代的是哪个词,当该代词在语法上存在歧义但在语义上对人类来说是无歧义的。最近经过微调的语言模型在原始的 Winograd 数据集上已经达到了接近人类的表现,但在更具挑战性的版本(如对抗性挖掘的 Winogrande 数据集 [SBBC19])上,仍然显著落后于人类的表现。我们在 Winograd 和 Winogrande 上测试了 GPT-3 的表现,通常是在零样本、单样本和少样本设置下进行的。

设置 PIQA ARC (简单) ARC (挑战) OpenBookQA
微调SOTA 79.4 92.0 [KKS+ 20] 78.5 [KKS+ 20] 87.2 [KKS+ 20]
GPT-3 零样本 80.5* 68.8 51.4 57.6
GPT-3 单样本 80.5* 71.2 53.2 58.8
GPT-3 少样本 82.8* 70.1 51.5 65.4


Table 3.6: GPT-3 results on three commonsense reasoning tasks, PIQA, ARC, and OpenBookQA. GPT-3 Few-Shot PIQA result is evaluated on the test server. See Section 4 for details on potential contamination issues on the PIQA test set. Figure 3.6: GPT-3 results on PIQA in the zero-shot, one-shot, and few-shot settings. The largest model achieves a score on the development set in all three conditions that exceeds the best recorded score on the task.

表 3.6: GPT-3 在三个常识推理任务(PIQA、ARC 和 OpenBookQA)上的结果。GPT-3 的少样本 PIQA 结果是在测试服务器上评估的。有关 PIQA 测试集上潜在污染问题的详细信息,请参见第 4 节。

图 3.6: GPT-3 在零样本、单样本和少样本设置下的 PIQA 结果。最大模型在所有三种条件下的开发集得分都超过了该任务的最佳记录得分。

On Winograd we test GPT-3 on the original set of 273 Winograd schemas, using the same “partial evaluation” method described in $[\mathrm{RWC}^{+}19]$ . Note that this setting differs slightly from the WSC task in the SuperGLUE benchmark, which is presented as binary classification and requires entity extraction to convert to the form described in this section. On Winograd GPT-3 achieves $88.3%$ , $89.7%$ , and $88.6%$ in the zero-shot, one-shot, and few-shot settings, showing no clear in-context learning but in all cases achieving strong results just a few points below state-of-the-art and estimated human performance. We note that contamination analysis found some Winograd schemas in the training data but this appears to have only a small effect on results (see Section 4).

在 Winograd 数据集上,我们使用与 $[\mathrm{RWC}^{+}19]$ 中描述的相同的“部分评估”方法,对 GPT-3 在原始的 273 个 Winograd 模式上进行了测试。需要注意的是,此设置与 SuperGLUE 基准中的 WSC 任务略有不同,后者以二元分类的形式呈现,并需要实体提取以转换为本节中描述的形式。在 Winograd 数据集上,GPT-3 在零样本、单样本和少样本设置中分别达到了 $88.3%$、$89.7%$ 和 $88.6%$ 的准确率,虽然没有明显的上下文学习效果,但在所有情况下都取得了接近最先进水平和估计人类表现的强劲结果。我们注意到,污染分析发现训练数据中存在一些 Winograd 模式,但这似乎对结果的影响很小(见第 4 节)。

On the more difficult Winogrande dataset, we do find gains to in-context learning: GPT-3 achieves $70.2%$ in the zero-shot setting, $73.2%$ in the one-shot setting, and $77.7%$ in the few-shot setting. For comparison a fine-tuned RoBERTA model achieves $79%$ , state-of-the-art is $84.6%$ achieved with a fine-tuned high capacity model (T5), and human performance on the task as reported by [SBBC19] is $94.0%$ .

在更具挑战性的 Winogrande 数据集上,我们发现上下文学习确实带来了提升:GPT-3 在零样本设置下达到了 70.2%,在单样本设置下达到了 73.2%,在少样本设置下达到了 77.7%。作为对比,经过微调的 RoBERTa 模型达到了 79%,当前最佳成绩是由经过微调的高容量模型 (T5) 取得的 84.6%,而 [SBBC19] 报告的人类在该任务上的表现为 94.0%。

3.5 Common Sense Reasoning

3.5 常识推理

Next we consider three datasets which attempt to capture physical or scientific reasoning, as distinct from sentence completion, reading comprehension, or broad knowledge question answering. The first, PhysicalQA (PIQA) $[\mathrm{BZB^{+}19}]$ , asks common sense questions about how the physical world works and is intended as a probe of grounded understanding of the world. GPT-3 achieves $81.0%$ accuracy zero-shot, $80.5%$ accuracy one-shot, and $82.8%$ accuracy few-shot (the last measured on PIQA’s test server). This compares favorably to the $79.4%$ accuracy prior state-of-the-art of a fine-tuned RoBERTa. PIQA shows relatively shallow scaling with model size and is still over $10%$ worse than human performance, but GPT-3’s few-shot and even zero-shot result outperform the current state-of-the-art. Our analysis flagged PIQA for a potential data contamination issue (despite hidden test labels), and we therefore conservatively mark the result with an asterisk. See Section 4 for details.

接下来我们考虑三个数据集,这些数据集试图捕捉物理或科学推理,与句子补全、阅读理解或广泛知识问答不同。第一个数据集是 PhysicalQA (PIQA) [BZB^{+}19],它提出了关于物理世界如何运作的常识性问题,旨在探测对世界的实际理解。GPT-3 在零样本情况下达到了 81.0% 的准确率,单样本情况下达到了 80.5% 的准确率,少样本情况下达到了 82.8% 的准确率(最后一个是在 PIQA 的测试服务器上测量的)。这比之前最先进的微调 RoBERTa 的 79.4% 准确率要好。PIQA 在模型规模上的扩展相对较浅,并且仍然比人类表现差 10% 以上,但 GPT-3 的少样本甚至零样本结果优于当前的最先进技术。我们的分析指出 PIQA 可能存在数据污染问题(尽管测试标签是隐藏的),因此我们保守地在结果上标记了星号。详见第 4 节。

Table 3.7: Results on reading comprehension tasks. All scores are F1 except results for RACE which report accuracy. $^{a}[{\mathrm{JZC}}^{+}19]$ b[JN20] c[AI19] d[QIA20] $^{e}[\mathrm{SPP^{+}19}]$

表 3.7: 阅读理解任务的结果。除 RACE 的结果为准确率外,其余均为 F1 分数。$^{a}[{\mathrm{JZC}}^{+}19]$ $^{b}[JN20]$ $^{c}[AI19]$ $^{d}[QIA20]$ $^{e}[\mathrm{SPP^{+}19}]$

设置 CoQA DROP QuAC SQuADv2 RACE-h RACE-m
Fine-tunedSOTA 90.7$^{a}$ 89.1$^{b}$ 74.4$^{c}$ 93.0$^{d}$ 30'06 93.1$^{e}$
GPT-3 零样本 81.5 23.6 41.5 59.5 45.5 58.4
GPT-3 单样本 84.0 34.3 43.3 65.4 45.9 57.4
GPT-3 少样本 85.0 36.5 44.3 69.8 46.8 58.1

ARC $[\mathrm{CCE^{+}18}]$ is a dataset of multiple-choice questions collected from 3rd to 9th grade science exams. On the “Challenge” version of the dataset which has been filtered to questions which simple statistical or information retrieval methods are unable to correctly answer, GPT-3 achieves $51.4%$ accuracy in the zero-shot setting, $53.2%$ in the one-shot setting, and $51.5%$ in the few-shot setting. This is approaching the performance of a fine-tuned RoBERTa baseline $(55.9%)$ from UnifiedQA $[\mathrm{KKS^{+}20}]$ . On the “Easy” version of the dataset (questions which either of the mentioned baseline approaches answered correctly), GPT-3 achieves $68.8%$ , $71.2%$ , and $70.1%$ which slightly exceeds a fine-tuned RoBERTa baseline from $[\mathrm{KKS}^{+}20]$ . However, both of these results are still much worse than the overall SOTAs achieved by the UnifiedQA which exceeds GPT-3’s few-shot results by $27%$ on the challenge set and $22%$ on the easy set.

ARC $[\mathrm{CCE^{+}18}]$ 是一个从三年级到九年级科学考试中收集的多项选择题数据集。在“挑战”版本的数据集中,这些问题经过筛选,简单的统计或信息检索方法无法正确回答,GPT-3 在零样本设置下的准确率为 $51.4%$ ,在单样本设置下为 $53.2%$ ,在少样本设置下为 $51.5%$ 。这接近了 UnifiedQA $[\mathrm{KKS^{+}20}]$ 中微调的 RoBERTa 基线 $(55.9%)$ 的表现。在“简单”版本的数据集中(这些问题是上述基线方法中任何一个都能正确回答的),GPT-3 的准确率为 $68.8%$ 、 $71.2%$ 和 $70.1%$ ,略高于 $[\mathrm{KKS}^{+}20]$ 中的微调 RoBERTa 基线。然而,这两个结果仍然远低于 UnifiedQA 实现的整体 SOTA,其在挑战集上比 GPT-3 的少样本结果高出 $27%$ ,在简单集上高出 $22%$ 。

On OpenBookQA [MCKS18], GPT-3 improves significantly from zero to few shot settings but is still over 20 points short of the overall SOTA. GPT-3’s few-shot performance is similar to a fine-tuned BERT Large baseline on the leader board.

在 OpenBookQA [MCKS18] 上,GPT-3 从零样本到少样本设置有了显著提升,但仍比整体 SOTA 低 20 多分。GPT-3 的少样本表现与排行榜上经过微调的 BERT Large 基线相似。

Overall, in-context learning with GPT-3 shows mixed results on commonsense reasoning tasks, with only small and inconsistent gains observed in the one and few-shot learning settings for both PIQA and ARC, but a significant improvement is observed on OpenBookQA. GPT-3 sets SOTA on the new PIQA dataset in all evaluation settings.

总体而言,GPT-3 在常识推理任务中的上下文学习表现参差不齐。在 PIQA 和 ARC 任务中,单样本和少样本学习设置下仅观察到微小且不一致的提升,但在 OpenBookQA 上则观察到了显著改进。GPT-3 在新的 PIQA 数据集上所有评估设置中均达到了 SOTA(State of the Art)水平。

3.6 Reading Comprehension

3.6 阅读理解

Next we evaluate GPT-3 on the task of reading comprehension. We use a suite of 5 datasets including abstract ive, multiple choice, and span based answer formats in both dialog and single question settings. We observe a wide spread in GPT-3’s performance across these datasets suggestive of varying capability with different answer formats. In general we observe GPT-3 is on par with initial baselines and early results trained using contextual representations on each respective dataset.

接下来,我们在阅读理解任务上评估 GPT-3。我们使用了包含 5 个数据集的套件,这些数据集涵盖了摘要式、多项选择和基于跨度的答案格式,涵盖了对话和单一问题设置。我们观察到 GPT-3 在这些数据集上的表现差异较大,这表明其在不同答案格式下的能力有所不同。总体而言,我们观察到 GPT-3 与每个数据集上使用上下文表示训练的初始基线和早期结果相当。

GPT-3 performs best (within 3 points of the human baseline) on CoQA [RCM19] a free-form conversational dataset and performs worst (13 F1 below an ELMo baseline) on QuAC $[\mathrm{CHI^{+}18}]$ a dataset which requires modeling structured dialog acts and answer span selections of teacher-student interactions. On DROP $[\mathrm{DWD}^{+}19]$ , a dataset testing discrete reasoning and numeracy in the context of reading comprehension, GPT-3 in a few-shot setting outperforms the fine-tuned BERT baseline from the original paper but is still well below both human performance and state-of-the-art approaches which augment neural networks with symbolic systems $[\mathrm{RLL^{+}19}]$ . On SQuAD 2.0 [RJL18], GPT-3 demonstrates its few-shot learning capabilities, improving by almost $10,\mathrm{F}1$ (to 69.8) compared to a zero-shot setting. This allows it to slightly outperform the best fine-tuned result in the original paper. On RACE $[\mathrm{LXL^{+}17}]$ , a multiple choice dataset of middle school and high school english examinations, GPT-3 performs relatively weakly and is only competitive with the earliest work utilizing contextual representations and is still $45%$ behind SOTA.

GPT-3 在 CoQA [RCM19] 上表现最佳(与人类基线的差距在 3 分以内),这是一个自由形式的对话数据集;而在 QuAC $[\mathrm{CHI^{+}18}]$ 上表现最差(比 ELMo 基线低 13 F1),这是一个需要建模结构化对话行为以及师生互动的答案跨度选择的数据集。在 DROP $[\mathrm{DWD}^{+}19]$ 上,这是一个测试阅读理解中的离散推理和计算能力的数据集,GPT-3 在少样本设置下优于原始论文中的微调 BERT 基线,但仍远低于人类表现以及通过符号系统增强神经网络的最先进方法 $[\mathrm{RLL^{+}19}]$。在 SQuAD 2.0 [RJL18] 上,GPT-3 展示了其少样本学习能力,与零样本设置相比,提升了近 $10,\mathrm{F}1$(达到 69.8),使其略微优于原始论文中的最佳微调结果。在 RACE $[\mathrm{LXL^{+}17}]$ 上,这是一个中学和高中英语考试的多项选择题数据集,GPT-3 表现相对较弱,仅与最早利用上下文表示的工作相当,仍落后于 SOTA 45%。

3.7 SuperGLUE

3.7 SuperGLUE

In order to better aggregate results on NLP tasks and compare to popular models such as BERT and RoBERTa in a more systematic way, we also evaluate GPT-3 on a standardized collection of datasets, the SuperGLUE benchmark $[\mathrm{WPN}^{\dot{+}}19]$ $[\mathrm{WPN^{+}i9}]$ $[\mathrm{CLC^{+}19}]$ [DMST19] [RBG11] $[\mathrm{KCR}^{+}18]$ ] $[Z\mathrm{LL}^{+}18]$ [DGM06] $[\mathrm{BHDD^{+}06}]$ [GMDD07] $[\mathrm{BDD^{+}09}]$ [PCC18] $[\mathrm{PHR}^{+}18]$ . GPT-3’s test-set performance on the SuperGLUE dataset is shown in Table 3.8. In the few-shot setting, we used 32 examples for all tasks, sampled randomly from the training set. For all tasks except WSC

为了更好地在 NLP 任务上聚合结果,并以更系统的方式与 BERT 和 RoBERTa 等流行模型进行比较,我们还在标准化的数据集集合 SuperGLUE 基准上评估了 GPT-3 $[\mathrm{WPN}^{\dot{+}}19]$ $[\mathrm{WPN^{+}i9}]$ $[\mathrm{CLC^{+}19}]$ [DMST19] [RBG11] $[\mathrm{KCR}^{+}18]$ ] $[Z\mathrm{LL}^{+}18]$ [DGM06] $[\mathrm{BHDD^{+}06}]$ [GMDD07] $[\mathrm{BDD^{+}09}]$ [PCC18] $[\mathrm{PHR}^{+}18]$ 。GPT-3 在 SuperGLUE 数据集上的测试集性能如表 3.8 所示。在少样本设置中,我们对所有任务使用了 32 个示例,这些示例是从训练集中随机采样的。除了 WSC 任务外,所有任务都使用了相同的设置。


Figure 3.7: GPT-3 results on CoQA reading comprehension task. GPT-3 175B achieves 85 F1 in the few-shot setting, only a few points behind measured human performance and state-of-the-art fine-tuned models. Zero-shot and one-shot performance is a few points behind, with the gains to few-shot being largest for bigger models.

图 3.7: GPT-3 在 CoQA 阅读理解任务上的结果。GPT-3 175B 在少样本设置下达到了 85 F1 分,仅比测量的人类表现和最先进的微调模型低几分。零样本和单样本表现稍低几分,且对于更大的模型,少样本带来的增益最大。

SuperGLUE 平均 BoolQ 准确率 CB 准确率 CB F1 COPA 准确率 RTE 准确率
Fine-tuned SOTA 89.0 91.0 96.9 93.9 94.8 92.5
Fine-tuned BERT-Large 69.0 77.4 83.6 75.7 70.6 71.7
GPT-3 少样本 71.8 76.4 75.6 52.0 92.0 69.0
WiC 准确率 WSC 准确率 MultiRC 准确率 MultiRC Fla ReCoRD 准确率 ReCoRD F1
Fine-tuned SOTA 76.1 93.8 62.3 88.2 92.5 93.3
Fine-tuned BERT-Large 69.6 64.6 24.1 70.0 71.3 72.0
GPT-3 少样本 49.4 80.1 30.5 75.4 90.2 91.1

Table 3.8: Performance of GPT-3 on SuperGLUE compared to fine-tuned baselines and SOTA. All results are reported on the test set. GPT-3 few-shot is given a total of 32 examples within the context of each task and performs no gradient updates.

表 3.8: GPT-3 在 SuperGLUE 上的性能与微调基线和 SOTA 的比较。所有结果均在测试集上报告。GPT-3 少样本在每个任务的上下文中总共给出 32 个示例,并且不进行梯度更新。


Figure 3.8: Performance on SuperGLUE increases with model size and number of examples in context. A value of $K=32$ means that our model was shown 32 examples per task, for 256 examples total divided across the 8 tasks in SuperGLUE. We report GPT-3 values on the dev set, so our numbers are not directly comparable to the dotted reference lines (our test set results are in Table 3.8). The BERT-Large reference model was fine-tuned on the SuperGLUE training set (125K examples), whereas $\mathrm{BERT++}$ was first fine-tuned on MultiNLI (392K examples) and SWAG (113K examples) before further fine-tuning on the SuperGLUE training set (for a total of 630K fine-tuning examples). We find the difference in performance between the BERT-Large and $\mathrm{BERT++}$ to be roughly equivalent to the difference between GPT-3 with one example per context versus eight examples per context.

图 3.8: SuperGLUE 上的性能随着模型大小和上下文中的示例数量增加而提升。$K=32$ 表示我们的模型在每个任务中展示了 32 个示例,总共 256 个示例分布在 SuperGLUE 的 8 个任务中。我们报告了 GPT-3 在开发集上的值,因此我们的数字不能直接与虚线参考线进行比较(我们的测试集结果在表 3.8 中)。BERT-Large 参考模型在 SuperGLUE 训练集(125K 示例)上进行了微调,而 $\mathrm{BERT++}$ 首先在 MultiNLI(392K 示例)和 SWAG(113K 示例)上进行了微调,然后在 SuperGLUE 训练集上进一步微调(总共 630K 微调示例)。我们发现 BERT-Large 和 $\mathrm{BERT++}$ 之间的性能差异大致相当于 GPT-3 在每个上下文中展示一个示例与八个示例之间的差异。

and MultiRC, we sampled a new set of examples to use in the context for each problem. For WSC and MultiRC, we used the same set of randomly drawn examples from the training set as context for all of the problems we evaluated.

对于 WSC 和 MultiRC,我们从训练集中随机抽取了一组新的样本作为每个问题的上下文。对于 WSC 和 MultiRC,我们使用了从训练集中随机抽取的相同样本集作为所有评估问题的上下文。

We observe a wide range in GPT-3’s performance across tasks. On COPA and ReCoRD GPT-3 achieves near-SOTA performance in the one-shot and few-shot settings, with COPA falling only a couple points short and achieving second place on the leader board, where first place is held by a fine-tuned 11 billion parameter model (T5). On WSC, performance is still relatively strong, achieving $80.1%$ in the few-shot setting (note that GPT-3 achieves $88.6%$ on the original Winograd dataset as described in Section 3.4). On BoolQ, MultiRC, and RTE, performance is reasonable, roughly matching that of a fine-tuned BERT-Large. On CB, we see signs of life at $75.6%$ in the few-shot setting.

我们观察到 GPT-3 在不同任务中的表现差异很大。在 COPA 和 ReCoRD 任务中,GPT-3 在单样本和少样本设置下接近 SOTA 性能,其中 COPA 仅落后几分,在排行榜上位居第二,而第一名是由一个经过微调的 110 亿参数模型 (T5) 占据。在 WSC 任务中,表现仍然相对强劲,在少样本设置下达到了 $80.1%$(需要注意的是,GPT-3 在原始 Winograd 数据集上的表现达到了 $88.6%$,如第 3.4 节所述)。在 BoolQ、MultiRC 和 RTE 任务中,表现尚可,大致与经过微调的 BERT-Large 相当。在 CB 任务中,我们在少样本设置下看到了 $75.6%$ 的表现。

WiC is a notable weak spot with few-shot performance at $49.4%$ (at random chance). We tried a number of different phrasings and formulations for WiC (which involves determining if a word is being used with the same meaning in two sentences), none of which was able to achieve strong performance. This hints at a phenomenon that will become clearer in the next section (which discusses the ANLI benchmark) – GPT-3 appears to be weak in the few-shot or one-shot setting at some tasks that involve comparing two sentences or snippets, for example whether a word is used the same way in two sentences (WiC), whether one sentence is a paraphrase of another, or whether one sentence implies another. This could also explain the comparatively low scores for RTE and CB, which also follow this format. Despite these weaknesses, GPT-3 still outperforms a fine-tuned BERT-large on four of eight tasks and on two tasks GPT-3 is close to the state-of-the-art held by a fine-tuned 11 billion parameter model.

WiC 是一个显著的弱点,少样本性能仅为 $49.4%$ (随机概率)。我们尝试了多种不同的表述和公式来解决 WiC 问题(涉及确定一个词在两个句子中是否具有相同的含义),但均未能取得良好的性能。这暗示了一个现象,在下一节(讨论 ANLI 基准时)将更加清晰——GPT-3 在少样本或单样本设置下,对于涉及比较两个句子或片段的某些任务表现较弱,例如一个词在两个句子中的使用方式是否相同(WiC),一个句子是否是另一个句子的改写,或者一个句子是否暗示了另一个句子。这也可能解释了 RTE 和 CB 相对较低的分数,这些任务也遵循类似的格式。尽管存在这些弱点,GPT-3 在八个任务中的四个任务上仍然优于经过微调的 BERT-large,并且在两个任务上接近由经过微调的 110 亿参数模型保持的最先进水平。

Finally, we note that the few-shot SuperGLUE score steadily improves with both model size and with number of examples in the context showing increasing benefits from in-context learning (Figure 3.8). We scale $K$ up to 32 examples per task, after which point additional examples will not reliably fit into our context. When sweeping over values of $K$ , we find that GPT-3 requires less than eight total examples per task to outperform a fine-tuned BERT-Large on overall SuperGLUE score.

最后,我们注意到,少样本 SuperGLUE 分数随着模型大小和上下文中的示例数量稳步提高,显示出上下文学习的益处不断增加(图 3.8)。我们将每个任务的 $K$ 扩展到 32 个示例,超过这个数量后,额外的示例将无法可靠地放入我们的上下文中。在遍历 $K$ 的值时,我们发现 GPT-3 每个任务需要的总示例数少于 8 个,就能在整体 SuperGLUE 分数上超过微调后的 BERT-Large。

3.8 NLI

3.8 自然语言推理 (NLI)

Natural Language Inference (NLI) [Fyo00] concerns the ability to understand the relationship between two sentences. In practice, this task is usually structured as a two or three class classification problem where the model classifies whether the second sentence logically follows from the first, contradicts the first sentence, or is possibly true (neutral). SuperGLUE includes an NLI dataset, RTE, which evaluates the binary version of the task. On RTE, only the largest version of GPT-3 performs convincingly better than random $(56%)$ in any evaluation setting, but in a few-shot setting GPT-3 performs similarly to a single-task fine-tuned BERT Large. We also evaluate on the recently introduced Adversarial Natural Language Inference (ANLI) dataset $[\mathrm{NWD^{+}19}]$ . ANLI is a difficult dataset employing a series of adversarial ly mined natural language inference questions in three rounds (R1, R2, and R3). Similar to RTE, all of our models smaller than GPT-3 perform at almost exactly random chance on ANLI, even in the few-shot setting $(\sim33%)$ ), whereas GPT-3 itself shows signs of life on Round 3. Results for ANLI R3 are highlighted in Figure 3.9 and full results for all rounds can be found in Appendix H. These results on both RTE and ANLI suggest that NLI is still a very difficult task for language models and they are only just beginning to show signs of progress.

自然语言推理 (Natural Language Inference, NLI) [Fyo00] 关注的是理解两个句子之间关系的能力。在实际应用中,该任务通常被构建为一个两分类或三分类问题,模型需要判断第二个句子是否在逻辑上从第一个句子中得出、与第一个句子矛盾,或者可能是真实的(中性)。SuperGLUE 包含一个 NLI 数据集 RTE,它评估该任务的二分类版本。在 RTE 上,只有最大版本的 GPT-3 在任何评估设置中表现得明显优于随机 $(56%)$,但在少样本设置中,GPT-3 的表现与单任务微调的 BERT Large 相似。我们还评估了最近引入的对抗性自然语言推理 (Adversarial Natural Language Inference, ANLI) 数据集 $[\mathrm{NWD^{+}19}]$。ANLI 是一个困难的数据集,它采用了一系列对抗性挖掘的自然语言推理问题,分为三轮(R1、R2 和 R3)。与 RTE 类似,所有小于 GPT-3 的模型在 ANLI 上的表现几乎完全随机,即使在少样本设置中 $(\sim33%)$,而 GPT-3 本身在第三轮中显示出一些进展的迹象。ANLI R3 的结果在图 3.9 中突出显示,所有轮的完整结果可以在附录 H 中找到。这些在 RTE 和 ANLI 上的结果表明,NLI 对于语言模型来说仍然是一个非常困难的任务,它们才刚刚开始显示出进展的迹象。


Figure 3.9: Performance of GPT-3 on ANLI Round 3. Results are on the dev-set, which has only 1500 examples and therefore has high variance (we estimate a standard deviation of $1.2%$ ). We find that smaller models hover around random chance, while few-shot GPT-3 175B closes almost half the gap from random chance to SOTA. Results for ANLI rounds 1 and 2 are shown in the appendix.

图 3.9: GPT-3 在 ANLI 第 3 轮的表现。结果基于开发集,该开发集仅有 1500 个样本,因此具有较高的方差(我们估计标准差为 $1.2%$)。我们发现较小的模型在随机概率附近徘徊,而少样本的 GPT-3 175B 几乎缩小了从随机概率到 SOTA 的一半差距。ANLI 第 1 轮和第 2 轮的结果见附录。

3.9 Synthetic and Qualitative Tasks

3.9 合成与定性任务

One way to probe GPT-3’s range of abilities in the few-shot (or zero- and one-shot) setting is to give it tasks which require it to perform simple on-the-fly computational reasoning, recognize a novel pattern that is unlikely to have occurred in training, or adapt quickly to an unusual task. We devise several tasks to test this class of abilities. First, we test GPT-3’s ability to perform arithmetic. Second, we create several tasks that involve rearranging or unscrambling the letters in a word, tasks which are unlikely to have been exactly seen during training. Third, we test GPT-3’s ability to solve SAT-style analogy problems few-shot. Finally, we test GPT-3 on several qualitative tasks, including using new words in a sentence, correcting English grammar, and news article generation. We will release the synthetic datasets with the hope of stimulating further study of test-time behavior of language models.

一种探究 GPT-3 在少样本(或零样本和单样本)设置下能力范围的方法是,给它一些需要即时进行简单计算推理、识别训练中不太可能出现的新模式,或快速适应不寻常任务的任务。我们设计了几项任务来测试这类能力。首先,我们测试 GPT-3 进行算术运算的能力。其次,我们创建了几项涉及重新排列或解构单词中字母的任务,这些任务在训练中不太可能被完全见过。第三,我们测试 GPT-3 在少样本情况下解决 SAT 风格类比问题的能力。最后,我们在几项定性任务上测试 GPT-3,包括在句子中使用新词、纠正英语语法以及新闻文章生成。我们将发布这些合成数据集,以期激发对语言模型测试时行为的进一步研究。

3.9.1 Arithmetic

3.9.1 算术

To test GPT-3’s ability to perform simple arithmetic operations without task-specific training, we developed a small battery of 10 tests that involve asking GPT-3 a simple arithmetic problem in natural language:

为了测试 GPT-3 在没有任务特定训练的情况下执行简单算术运算的能力,我们开发了一套包含 10 个测试的小型测试集,这些测试涉及用自然语言向 GPT-3 提出简单的算术问题:


Figure 3.10: Results on all 10 arithmetic tasks in the few-shot settings for models of different sizes. There is a significant jump from the second largest model (GPT-3 13B) to the largest model (GPT-3 175), with the latter being able to reliably accurate 2 digit arithmetic, usually accurate 3 digit arithmetic, and correct answers a significant fraction of the time on 4-5 digit arithmetic, 2 digit multiplication, and compound operations. Results for one-shot and zero-shot are shown in the appendix.

图 3.10: 不同大小模型在少样本设置下所有 10 个算术任务的结果。从第二大的模型 (GPT-3 13B) 到最大的模型 (GPT-3 175) 有一个显著的跳跃,后者能够可靠地准确进行 2 位数的算术运算,通常准确进行 3 位数的算术运算,并且在 4-5 位数的算术运算、2 位数的乘法和复合运算中大部分时间都能给出正确答案。单样本和零样本的结果见附录。

• 3 digit subtraction (3D-) – Same as 2 digit subtraction, except numbers are uniformly sampled from $[0,1000)$ . • 4 digit addition $(4\mathbf{D}+)$ – Same as 3 digit addition, except uniformly sampled from $[0,10000)$ . • 4 digit subtraction (4D-) – Same as 3 digit subtraction, except uniformly sampled from $[0,10000)$ . • 5 digit addition $(5\mathbf{D}+$ ) – Same as 3 digit addition, except uniformly sampled from $[0,100000)$ . • 5 digit subtraction (5D-) – Same as 3 digit subtraction, except uniformly sampled from $[0,100000)$ . • 2 digit multiplication (2Dx) – The model is asked to multiply two integers sampled uniformly from $[0,100)$ , e.g. “Q: What is 24 times 42? A: 1008”. • One-digit composite (1DC) – The model is asked to perform a composite operation on three 1 digit numbers, with parentheses around the last two. For example, “Q: What is $6{+}(4^{\ast}8)?$ A: $38^{\circ}$ . The three 1 digit numbers are selected uniformly on [0, 10) and the operations are selected uniformly from ${+,-,{*}}$ .

• 三位数减法 (3D-) – 与两位数减法相同,只是数字从 $[0,1000)$ 中均匀采样。
• 四位数加法 $(4\mathbf{D}+)$ – 与三位数加法相同,只是从 $[0,10000)$ 中均匀采样。
• 四位数减法 (4D-) – 与三位数减法相同,只是从 $[0,10000)$ 中均匀采样。
• 五位数加法 $(5\mathbf{D}+)$ – 与三位数加法相同,只是从 $[0,100000)$ 中均匀采样。
• 五位数减法 (5D-) – 与三位数减法相同,只是从 $[0,100000)$ 中均匀采样。
• 两位数乘法 (2Dx) – 模型被要求将两个从 $[0,100)$ 中均匀采样的整数相乘,例如“Q: 24 乘以 42 是多少?A: 1008”。
• 一位数复合运算 (1DC) – 模型被要求对三个一位数进行复合运算,最后两个数用括号括起来。例如,“Q: $6{+}(4^{\ast}8)?$ 是多少?A: $38^{\circ}$。这三个一位数从 [0, 10) 中均匀采样,运算符从 ${+,-,{*}}$ 中均匀选择。

In all 10 tasks the model must generate the correct answer exactly. For each task we generate a dataset of 2,000 random instances of the task and evaluate all models on those instances.

在所有10个任务中,模型必须准确生成正确答案。对于每个任务,我们生成一个包含2,000个随机实例的数据集,并在这些实例上评估所有模型。

First we evaluate GPT-3 in the few-shot setting, for which results are shown in Figure 3.10. On addition and subtraction, GPT-3 displays strong proficiency when the number of digits is small, achieving $100%$ accuracy on 2 digit addition, $98.9%$ at 2 digit subtraction, $80.2%$ at 3 digit addition, and $94.2%$ at 3-digit subtraction. Performance decreases as the number of digits increases, but GPT-3 still achieves $25{-}26%$ accuracy on four digit operations and $9\mathrm{-}10%$ accuracy on five digit operations, suggesting at least some capacity to generalize to larger numbers of digits. GPT-3 also achieves $29.2%$ accuracy at 2 digit multiplication, an especially computationally intensive operation. Finally, GPT-3 achieves $21.3%$ accuracy at single digit combined operations (for example, $^{9*}(7{+}5))$ , suggesting that it has some robustness beyond just single operations.

首先,我们在少样本设置下评估 GPT-3,结果如图 3.10 所示。在加法和减法方面,当数字位数较少时,GPT-3 表现出较强的能力,在 2 位数加法上达到了 $100%$ 的准确率,2 位数减法为 $98.9%$,3 位数加法为 $80.2%$,3 位数减法为 $94.2%$。随着位数的增加,性能有所下降,但 GPT-3 在 4 位数运算上仍能达到 $25{-}26%$ 的准确率,在 5 位数运算上达到 $9\mathrm{-}10%$ 的准确率,这表明它至少具备一定能力来泛化到更大位数的运算。GPT-3 在 2 位数乘法上也达到了 $29.2%$ 的准确率,这是一个特别计算密集型的运算。最后,GPT-3 在单位数组合运算(例如 $^{9*}(7{+}5))$ 上达到了 $21.3%$ 的准确率,这表明它在单一运算之外还具备一定的鲁棒性。

As Figure 3.10 makes clear, small models do poorly on all of these tasks – even the 13 billion parameter model (the second largest after the 175 billion full GPT-3) can solve 2 digit addition and subtraction only half the time, and all other operations less than $10%$ of the time.

如图 3.10 所示,小型模型在这些任务上表现都很差——即使是拥有 130 亿参数的模型(仅次于 1750 亿参数的完整 GPT-3)也只能在一半的时间内解决两位数的加减法,而其他所有操作的准确率都不到 $10%$。

One-shot and zero-shot performance are somewhat degraded relative to few-shot performance, suggesting that adaptation to the task (or at the very least recognition of the task) is important to performing these computations correctly. Nevertheless, one-shot performance is still quite strong, and even zero-shot performance of the full GPT-3 significantly outperforms few-shot learning for all smaller models. All three settings for the full GPT-3 are shown in Table 3.9, and model capacity scaling for all three settings is shown in Appendix H.

单样本和零样本性能相对于少样本性能有所下降,这表明适应任务(或至少识别任务)对于正确执行这些计算是重要的。然而,单样本性能仍然相当强,即使是完整 GPT-3 的零样本性能也显著优于所有较小模型的少样本学习。完整 GPT-3 的三种设置如表 3.9 所示,所有三种设置的模型容量扩展见附录 H。

设置 2D+ 2D- 3D+ 3D- 4D+ 4D- 5D+ 5D- 2Dx 1DC
GPT-3 零样本 76.9 58.0 34.2 48.3 4.0 7.5 0.7 0.8 19.8 9.8
GPT-3 单样本 99.6 86.4 65.5 78.7 14.0 14.0 3.5 3.8 27.4 14.3
GPT-3 少样本 100.0 98.9 80.4 94.2 25.5 26.8 9.3 9.9 29.2 21.3

Table 3.9: Results on basic arithmetic tasks for GPT-3 175B. ${2,3,4,5}\mathrm{D}{+,-}$ is 2, 3, 4, and 5 digit addition or subtraction, 2Dx is 2 digit multiplication. 1DC is 1 digit composite operations. Results become progressively stronger moving from the zero-shot to one-shot to few-shot setting, but even the zero-shot shows significant arithmetic abilities.

表 3.9: GPT-3 175B 在基础算术任务上的结果。${2,3,4,5}\mathrm{D}{+,-}$ 表示 2、3、4 和 5 位数的加法或减法,2Dx 表示 2 位数的乘法,1DC 表示 1 位数的复合运算。从零样本到单样本再到少样本设置,结果逐渐增强,但即使是零样本也显示出显著的算术能力。

Setting CL A1 A2 RI RW
GPT-3 Zero-shot 3.66 2.28 8.91 8.26 0.09
GPT-3 One-shot 21.7 8.62 25.9 45.4 0.48
GPT-3 Few-shot 37.9 15.1 39.7 67.2 0.44

To spot-check whether the model is simply memorizing specific arithmetic problems, we took the 3-digit arithmetic problems in our test set and searched for them in our training data in both the forms $"<\tt N U M1>\partial+\epsilon<\tt N U M2>\partial="1"$ and " plus ${\tt{"}}$ . Out of 2,000 addition problems we found only 17 matches $(0.8%)$ and out of 2,000 subtraction problems we found only 2 matches $(0.1%)$ , suggesting that only a trivial fraction of the correct answers could have been memorized. In addition, inspection of incorrect answers reveals that the model often makes mistakes such as not carrying a “1”, suggesting it is actually attempting to perform the relevant computation rather than memorizing a table.

为了抽查模型是否仅仅记住了特定的算术问题,我们从测试集中选取了3位数算术问题,并在训练数据中搜索了两种形式:$"<\tt N U M1>\partial+\epsilon<\tt N U M2>\partial="1"$ 和 " plus ${\tt{"}}$。在2000道加法问题中,我们仅找到了17个匹配项 $(0.8%)$,而在2000道减法问题中,我们仅找到了2个匹配项 $(0.1%)$,这表明只有极少部分正确答案可能是通过记忆得到的。此外,对错误答案的检查显示,模型经常犯诸如不携带“1”这样的错误,这表明它实际上是在尝试执行相关计算,而不是在记忆一个表格。

Overall, GPT-3 displays reasonable proficiency at moderately complex arithmetic in few-shot, one-shot, and even zero-shot settings.

总体而言,GPT-3 在少样本、单样本甚至零样本设置下,对中等复杂度的算术表现出合理的熟练度。

3.9.2 Word Scrambling and Manipulation Tasks

3.9.2 单词打乱与操作任务

To test GPT-3’s ability to learn novel symbolic manipulations from a few examples, we designed a small battery of 5 “character manipulation” tasks. Each task involves giving the model a word distorted by some combination of scrambling, addition, or deletion of characters, and asking it to recover the original word. The 5 tasks are:

为了测试 GPT-3 从少量示例中学习新颖符号操作的能力,我们设计了一组包含 5 个“字符操作”任务的小测试。每个任务都涉及给模型一个通过字符打乱、添加或删除等方式扭曲的单词,并要求其恢复原始单词。这 5 个任务分别是:

For each task we generate 10,000 examples, which we chose to be the top 10,000 most frequent words as measured by [Nor09] of length more than 4 characters and less than 15 characters. The few-shot results are shown in Figure 3.11. Task performance tends to grow smoothly with model size, with the full GPT-3 model achieving $66.9%$ on removing random insertions, $38.6%$ on cycling letters, $40.2%$ on the easier anagram task, and $15.1%$ on the more difficult anagram task (where only the first and last letters are held fixed). None of the models can reverse the letters in a word.

对于每个任务,我们生成了10,000个示例,这些示例是根据[Nor09]测量的长度超过4个字符且少于15个字符的最常见的10,000个单词。少样本结果如图3.11所示。任务性能随着模型大小的增加而平稳增长,完整的GPT-3模型在随机插入删除任务上达到了$66.9%$,在字母循环任务上达到了$38.6%$,在较简单的变位词任务上达到了$40.2%$,在较难的变位词任务(仅首尾字母固定)上达到了$15.1%$。所有模型都无法反转单词中的字母。


Figure 3.11: Few-shot performance on the five word scrambling tasks for different sizes of model. There is generally smooth improvement with model size although the random insertion task shows an upward slope of improvement with the 175B model solving the task the majority of the time. Scaling of one-shot and zero-shot performance is shown in the appendix. All tasks are done with $K=100$ .

图 3.11: 不同规模模型在五个单词乱序任务上的少样本表现。尽管随机插入任务在175B模型上表现出明显的改进趋势,但总体上随着模型规模的增加,性能平稳提升。单样本和零样本表现的扩展情况见附录。所有任务均在 $K=100$ 的条件下完成。

In the one-shot setting, performance is significantly weaker (dropping by half or more), and in the zero-shot setting the model can rarely perform any of the tasks (Table 3.10). This suggests that the model really does appear to learn these tasks at test time, as the model cannot perform them zero-shot and their artificial nature makes them unlikely to appear in the pre-training data (although we cannot confirm this with certainty).

在少样本设置中,性能显著下降(下降一半或更多),而在零样本设置中,模型几乎无法执行任何任务(表 3.10)。这表明模型确实在测试时学习了这些任务,因为模型无法在零样本情况下执行这些任务,而且这些任务的人工性质使得它们不太可能出现在预训练数据中(尽管我们无法确定这一点)。

We can further quantify performance by plotting “in-context learning curves”, which show task performance as a function of the number of in-context examples. We show in-context learning curves for the Symbol Insertion task in Figure 1.2. We can see that larger models are able to make increasingly effective use of in-context information, including both task examples and natural language task descriptions.

我们可以通过绘制“上下文学习曲线”来进一步量化性能,这些曲线展示了任务性能随上下文示例数量的变化。我们在图 1.2 中展示了符号插入任务的上下文学习曲线。可以看到,更大的模型能够越来越有效地利用上下文信息,包括任务示例和自然语言任务描述。

Finally, it is worth adding that solving these tasks requires character-level manipulations, whereas our BPE encoding operates on significant fractions of a word (on average $\sim0.7$ words per token), so from the LM’s perspective succeeding at these tasks involves not just manipulating BPE tokens but understanding and pulling apart their substructure. Also, CL, A1, and A2 are not bijective (that is, the unscrambled word is not a deterministic function of the scrambled word), requiring the model to perform some search to find the correct unscrambling. Thus, the skills involved appear to require non-trivial pattern-matching and computation.

最后,值得一提的是,解决这些任务需要字符级别的操作,而我们的 BPE 编码作用于单词的显著部分(平均每个 token 约 0.7 个单词),因此从语言模型的角度来看,成功完成这些任务不仅涉及操作 BPE token,还需要理解并分解其子结构。此外,CL、A1 和 A2 不是双射的(即未打乱的单词不是打乱单词的确定性函数),这要求模型执行一些搜索以找到正确的未打乱形式。因此,所涉及的技能似乎需要非平凡的模式匹配和计算能力。

3.9.3 SAT Analogies

3.9.3 SAT 类比

To test GPT-3 on another task that is somewhat unusual relative to the typical distribution of text, we collected a set of 374 “SAT analogy” problems [TLBS03]. Analogies are a style of multiple choice question that constituted a section of the SAT college entrance exam before 2005. A typical example is “audacious is to boldness as (a) sanctimonious is to hypocrisy, (b) anonymous is to identity, (c) remorseful is to misdeed, (d) deleterious is to result, (e) impressionable is to temptation”. The student is expected to choose which of the five word pairs has the same relationship as the original word pair; in this example the answer is “sanctimonious is to hypocrisy”. On this task GPT-3 achieves $65.2%$ in the few-shot setting, $59.1%$ in the one-shot setting, and $53.7%$ in the zero-shot setting, whereas the average score among college applicants was $57%$ [TL05] (random guessing yields $20%$ ). As shown in Figure 3.12, the results improve with scale, with the the full 175 billion model improving by over $10%$ compared to the 13 billion parameter model.

为了测试 GPT-3 在另一项相对于典型文本分布来说较为不寻常的任务上的表现,我们收集了 374 道“SAT 类比”问题 [TLBS03]。类比题是一种多项选择题,曾在 2005 年之前作为 SAT 大学入学考试的一部分。一个典型的例子是:“audacious 之于 boldness 如同 (a) sanctimonious 之于 hypocrisy, (b) anonymous 之于 identity, (c) remorseful 之于 misdeed, (d) deleterious 之于 result, (e) impressionable 之于 temptation”。学生需要从五个词对中选择与原始词对具有相同关系的词对;在这个例子中,答案是“sanctimonious 之于 hypocrisy”。在这项任务中,GPT-3 在少样本设置下达到了 $65.2%$,在单样本设置下达到了 $59.1%$,在零样本设置下达到了 $53.7%$,而大学申请者的平均得分为 $57%$ [TL05](随机猜测的得分为 $20%$)。如图 3.12 所示,随着模型规模的增加,结果有所改善,拥有 1750 亿参数的完整模型相比 130 亿参数的模型提升了超过 $10%$。


Figure 3.12: Zero-, one-,and few-shot performance on SAT analogy tasks, for different sizes of model. The largest model achieves $65%$ accuracy in the few-shot setting, and also demonstrates significant gains to in-context learning which are not present in smaller models.

图 3.12: 不同大小模型在 SAT 类比任务上的零样本、单样本和少样本表现。最大模型在少样本设置下达到了 65% 的准确率,并且在上下文学习中表现出显著的提升,而较小模型则没有这种提升。

3.9.4 News Article Generation

3.9.4 新闻文章生成

Previous work on generative language models qualitatively tested their ability to generate synthetic “news articles” by conditional sampling from the model given a human-written prompt consisting of a plausible first sentence for a news story $[\mathrm{RWC}^{+}19]$ . Relative to $[\mathrm{RWC}^{+}19]$ , the dataset used to train GPT-3 is much less weighted towards news articles, so trying to generate news articles via raw unconditional samples is less effective – for example GPT-3 often interprets the proposed first sentence of a “news article” as a tweet and then posts synthetic responses or follow-up tweets. To solve this problem we employed GPT-3’s few-shot learning abilities by providing three previous news articles in the model’s context to condition it. With the title and subtitle of a proposed next article, the model is able to reliably generate short articles in the “news” genre.

先前关于生成式语言模型的工作通过从模型中进行条件采样,给定一个由人类编写的新闻故事可能的第一句话作为提示,定性测试了其生成合成“新闻文章”的能力 [RWC+19]。与 [RWC+19] 相比,用于训练 GPT-3 的数据集在新闻文章上的权重要小得多,因此尝试通过原始的无条件样本来生成新闻文章效果较差——例如,GPT-3 经常将“新闻文章”的第一句话解释为一条推文,然后发布合成回复或后续推文。为了解决这个问题,我们利用了 GPT-3 的少样本学习能力,在模型的上下文中提供了三篇之前的新闻文章作为条件。通过给定下一篇文章的标题和副标题,模型能够可靠地生成“新闻”类型的短篇文章。

To gauge the quality of news article generation from GPT-3 (which we believe is likely to be correlated with conditional sample generation quality in general), we decided to measure human ability to distinguish GPT-3-generated articles from real ones. Similar work has been carried out by Kreps et al. [KMB20] and Zellers et al. $[Z\mathrm{HR}^{+}19]$ . Generative language models are trained to match the distribution of content generated by humans, so the (in)ability of humans to distinguish the two is a potentially important measure of quality.3

为了评估 GPT-3 生成新闻文章的质量(我们相信这可能与条件样本生成质量相关),我们决定测量人类区分 GPT-3 生成的文章与真实文章的能力。类似的工作已由 Kreps 等人 [KMB20] 和 Zellers 等人 [ZHR+19] 进行。生成式语言模型的训练目标是匹配人类生成内容的分布,因此人类区分两者的(无)能力可能是衡量质量的重要指标。

In order to see how well humans can detect model generated text, we arbitrarily selected 25 article titles and subtitles from the website newser.com (mean length: 215 words). We then generated completions of these titles and subtitles from four language models ranging in size from 125M to 175B (GPT-3) parameters (mean length: 200 words). For each model, we presented around 80 US-based participants with a quiz consisting of these real titles and subtitles followed by either the human written article or the article generated by the model4. Participants were asked to select whether the article was “very likely written by a human”, “more likely written by a human”, “I don’t know”, “more likely written by a machine”, or “very likely written by a machine”.

为了了解人类检测模型生成文本的能力,我们随机从网站 newser.com 上选取了 25 篇文章的标题和副标题(平均长度:215 词)。然后,我们使用四个参数量从 125M 到 175B(GPT-3)的语言模型生成了这些标题和副标题的续写内容(平均长度:200 词)。对于每个模型,我们向大约 80 名美国参与者展示了一个测验,测验内容包括这些真实的标题和副标题,随后是人工撰写的文章或模型生成的文章。参与者被要求选择文章是“非常可能由人类撰写”、“更可能由人类撰写”、“我不知道”、“更可能由机器撰写”或“非常可能由机器撰写”。

The articles we selected were not in the models’ training data and the model outputs were formatted and selected pro grammatically to prevent human cherry-picking. All models used the same context to condition outputs on and were pre-trained with the same context size and the same article titles and subtitles were used as prompts for each model. However, we also ran an experiment to control for participant effort and attention that followed the same format but involved intentionally bad model generated articles. This was done by generating articles from a “control model”: a 160M parameter model with no context and increased output randomness.

我们选择的文章不在模型的训练数据中,并且模型输出是通过程序格式化和选择的,以防止人为挑选。所有模型使用相同的上下文来生成输出,并且预训练时使用相同的上下文大小,每篇文章的标题和副标题都作为每个模型的提示。然而,我们还进行了一项实验,以控制参与者的努力和注意力,该实验遵循相同的格式,但涉及故意生成的低质量模型文章。这是通过从“控制模型”生成文章来实现的:一个160M参数的模型,没有上下文,并且增加了输出随机性。

Table 3.11: Human accuracy in identifying whether short $\mathbf{\sim}200$ word) news articles are model generated. We find that human accuracy (measured by the ratio of correct assignments to non-neutral assignments) ranges from $86%$ on the control model to $52%$ on GPT-3 175B. This table compares mean accuracy between five different models, and shows the results of a two-sample T-Test for the difference in mean accuracy between each model and the control model (an unconditional GPT-3 Small model with increased output randomness).

表 3.11: 人类在识别短新闻文章(约 200 词)是否为模型生成时的准确率。我们发现,人类的准确率(通过正确分配与非中性分配的比率来衡量)从控制模型的 86% 到 GPT-3 175B 的 52% 不等。该表比较了五个不同模型的平均准确率,并展示了每个模型与控制模型(一个增加了输出随机性的无条件 GPT-3 Small 模型)之间平均准确率差异的双样本 T 检验结果。

平均准确率 95% 置信区间(低,高) 与控制模型的 t 值(p 值) 非中性分配比例
控制模型(故意设计的差模型) 86% 83%-90% 3.6%
GPT-3 Small 76% 72%-80% 3.9 (2e-4) 4.9%
GPT-3 Medium 61% 58%-65% 10.3 (7e-21) 6.0%
GPT-3 Large 68% 64%-72% 7.3 (3e-11) 8.7%
GPT-3 XL 62% 59%-65% 10.7 (1e-19) 7.5%
GPT-3 2.7B 62% 58%-65% 10.4 (5e-19) 7.1%
GPT-3 6.7B 59% 56%-63% 11.2 (3e-21) 6.2%
GPT-3 13B 55% 52%-58% 15.3 (1e-32) 7.1%
GPT-3 175B 52% 49%-54% 16.9 (1e-34) 7.8%

Mean human accuracy (the ratio of correct assignments to non-neutral assignments per participant) at detecting that the intentionally bad articles were model generated was $\sim86%$ where $50%$ is chance level performance. By contrast, mean human accuracy at detecting articles that were produced by the 175B parameter model was barely above chance at $\sim52%$ (see Table 3.11).5 Human abilities to detect model generated text appear to decrease as model size increases: there appears to be a trend towards chance accuracy with model size, and human detection of GPT-3 is close to chance.6 This is true despite the fact that participants spend more time on each output as model size increases (see Appendix E).

人类在检测故意编写的糟糕文章是由模型生成时的平均准确率(每位参与者正确分配与非中性分配的比率)为 $\sim86%$,其中 $50%$ 是随机水平的表现。相比之下,检测由 175B 参数模型生成的文章的平均准确率仅略高于随机水平,为 $\sim52%$(见表 3.11)。5 人类检测模型生成文本的能力似乎随着模型规模的增加而下降:随着模型规模的增加,准确率似乎趋向于随机水平,而人类对 GPT-3 的检测接近随机水平。6 尽管随着模型规模的增加,参与者在每个输出上花费的时间更多(见附录 E),但这一现象仍然成立。

Examples of synthetic articles from GPT-3 are given in Figures 3.14 and 3.15.7 Much of the text is—as indicated by the evaluations—difficult for humans to distinguish from authentic human content. Factual inaccuracies can be an indicator that an article is model generated since, unlike human authors, the models have no access to the specific facts that the article titles refer to or when the article was written. Other indicators include repetition, non sequiturs, and unusual phrasings, though these are often subtle enough that they are not noticed.

图 3.14 和图 3.15 中展示了 GPT-3 生成的合成文章示例。正如评估所示,大部分文本对人类来说难以与真实的人类内容区分开来。事实错误可能是文章由模型生成的指标,因为与人类作者不同,模型无法访问文章标题所指的具体事实或文章撰写的时间。其他指标包括重复、不合逻辑的推论和不寻常的措辞,尽管这些通常足够微妙,以至于不会被注意到。

Related work on language model detection by Ippolito et al. [IDCBE19] indicates that automatic disc rim in at or s like G ROV E R $[Z\mathrm{HR}^{+}1\bar{9}]$ and GLTR [GSR19] may have greater success at detecting model generated text than human evaluators. Automatic detection of these models may be a promising area of future research.

Ippolito 等人 [IDCBE19] 关于语言模型检测的相关研究表明,自动检测器如 GROVER $[Z\mathrm{HR}^{+}1\bar{9}]$ 和 GLTR [GSR19] 在检测模型生成的文本方面可能比人类评估者更成功。这些模型的自动检测可能是未来研究的一个有前景的领域。

Ippolito et al. [IDCBE19] also note that human accuracy at detecting model generated text increases as humans observe more tokens. To do a preliminary investigation of how good humans are at detecting longer news articles generated by GPT-3 175B, we selected 12 world news articles from Reuters with an average length of 569 words and generated completions of these articles from GPT-3 with an average length of 498 words (298 words longer than our initial experiments). Following the methodology above, we ran two experiments, each on around 80 US-based participants, to compare human abilities to detect the articles generated by GPT-3 and a control model.

Ippolito 等人 [IDCBE19] 也指出,随着人类观察到的 token 数量增加,人类检测模型生成文本的准确性也会提高。为了初步调查人类在检测由 GPT-3 175B 生成长篇新闻文章时的表现,我们从路透社中选取了 12 篇平均长度为 569 个单词的世界新闻文章,并使用 GPT-3 生成了这些文章的续写,平均长度为 498 个单词(比我们最初的实验长 298 个单词)。按照上述方法,我们进行了两项实验,每项实验约有 80 名美国参与者,以比较人类检测 GPT-3 生成文章与对照组模型生成文章的能力。

We found that mean human accuracy at detecting the intentionally bad longer articles from the control model was $\sim88%$ , while mean human accuracy at detecting the longer articles that were produced by GPT-3 175B was still barely above chance at $\sim52%$ (see Table 3.12). This indicates that, for news articles that are around 500 words long, GPT-3 continues to produce articles that humans find difficult to distinguish from human written news articles.

我们发现,人类在检测控制模型生成的故意写得较差的长文章时的平均准确率为 $\sim88%$,而在检测由 GPT-3 175B 生成长文章时的平均准确率仅略高于随机水平,为 $\sim52%$(见表 3.12)。这表明,对于大约 500 字长的新闻文章,GPT-3 生成的文章仍然让人类难以区分其与人类撰写的新闻文章。

3.9.5 Learning and Using Novel Words

3.9.5 学习和使用新词

A task studied in developmental linguistics [CB78] is the ability to learn and utilize new words, for example using a word in a sentence after seeing it defined only once, or conversely inferring a word’s meaning from only one usage. Here we qualitatively test GPT-3’s ability to do the former. Specifically, we give GPT-3 the definition of a nonexistent word, such as “Gigamuru”, and then ask it to use it in a sentence. We provide one to five previous examples of a (separate)

发展语言学中研究的一个任务 [CB78] 是学习和使用新词的能力,例如在只看到一次定义后就能在句子中使用一个词,或者反过来从一次使用中推断出一个词的含义。这里我们定性地测试 GPT-3 执行前者的能力。具体来说,我们给 GPT-3 一个不存在的词的定义,例如“Gigamuru”,然后要求它在句子中使用它。我们提供一个到五个之前的(独立的)示例。


Human ability to detect model generated news articles Figure 3.13: People’s ability to identify whether news articles are model-generated (measured by the ratio of correct assignments to non-neutral assignments) decreases as model size increases. Accuracy on the outputs on the deliberatelybad control model (an un conditioned GPT-3 Small model with higher output randomness) is indicated with the dashed line at the top, and the random chance $(50%)$ is indicated with the dashed line at the bottom. Line of best fit is a power law with $95%$ confidence intervals. Table 3.12: People’s ability to identify whether $\sim500$ word articles are model generated (as measured by the ratio of correct assignments to non-neutral assignments) was $88%$ on the control model and $52%$ on GPT-3 175B. This table shows the results of a two-sample T-Test for the difference in mean accuracy between GPT-3 175B and the control model (an unconditional GPT-3 Small model with increased output randomness).


人类检测模型生成新闻文章的能力
图 3.13: 随着模型规模的增加,人们识别新闻文章是否由模型生成的能力(通过正确分配与非中性分配的比率来衡量)下降。顶部虚线表示故意不良控制模型(具有更高输出随机性的无条件 GPT-3 Small 模型)的输出准确率,底部虚线表示随机概率 $(50%)$。最佳拟合线为幂律,置信区间为 $95%$。
表 3.12: 人们识别 $\sim500$ 字文章是否由模型生成的能力(通过正确分配与非中性分配的比率来衡量)在控制模型上为 $88%$,在 GPT-3 175B 上为 $52%$。该表显示了 GPT-3 175B 与控制模型(具有增加输出随机性的无条件 GPT-3 Small 模型)之间平均准确率差异的双样本 T 检验结果。

平均准确率 95%置信区间 (低, 高) 与控制组1的t比较 (p值) "我不知道" 分配
控制组 88% 84%-91% 2.7%
GPT-3175B 52% 48%-57% 12.7 (3.2e-23) 10.6%


Figure 3.14: The GPT-3 generated news article that humans had the greatest difficulty distinguishing from a human written article (accuracy: $12%$ ).

图 3.14: GPT-3 生成的新闻文章,人类最难将其与人类撰写的文章区分开来(准确率:$12%$)。


Figure 3.16: Representative GPT-3 completions for the few-shot task of using a new word in a sentence. Boldface is GPT-3’s completions, plain text is human prompts. In the first example both the prompt and the completion are provided by a human; this then serves as conditioning for subsequent examples where GPT-3 receives successive additional prompts and provides the completions. Nothing task-specific is provided to GPT-3 other than the conditioning shown here.

图 3.16: 在少样本任务中使用新词的 GPT-3 补全示例。粗体为 GPT-3 的补全内容,普通文本为人类提示。在第一个示例中,提示和补全均由人类提供;这随后作为后续示例的条件,GPT-3 接收连续的额外提示并提供补全内容。除了此处显示的条件外,没有向 GPT-3 提供任何特定于任务的信息。

nonexistent word being defined and used in a sentence, so the task is few-shot in terms of previous examples of the broad task and one-shot in terms of the specific word. Table 3.16 shows the 6 examples we generated; all definitions were human-generated, and the first answer was human-generated as conditioning while the subsequent answers were generated by GPT-3. These examples were generated continuously in one sitting and we did not omit or repeatedly try any prompts. In all cases the generated sentence appears to be a correct or at least plausible use of the word. In the final sentence the model generates a plausible conjugation for the word “screeg” (namely “screeghed”), although the use of the word is slightly awkward (“screeghed at each other”) despite being plausible in the sense that it could describe a toy sword fight. Overall, GPT-3 appears to be at least proficient at the task of using novel words in a sentence.

表 3.16 展示了我们生成的 6 个示例;所有定义均由人工生成,第一个答案作为条件由人工生成,而后续答案则由 GPT-3 生成。这些示例是在一次连续生成的,我们没有省略或重复尝试任何提示。在所有情况下,生成的句子似乎都是正确或至少是合理的单词使用。在最后一个句子中,模型为单词“screeg”生成了一个合理的变位形式(即“screeghed”),尽管该词的使用略显笨拙(“screeghed at each other”),但从描述玩具剑战的角度来看是合理的。总体而言,GPT-3 在句子中使用新词的任务上至少表现出了一定的熟练度。

3.9.6 Correcting English Grammar

3.9.6 纠正英语语法

Another task well suited for few-shot learning is correcting English grammar. We test this with GPT-3 in the fewshot setting by giving prompts of the form "Poor English Input: \n Good English Output: ". We give GPT-3 one human-generated correction and then ask it to correct 5 more (again without any omissions or repeats). Results are shown in Figure 3.17.

另一个非常适合少样本学习的任务是纠正英语语法。我们通过提供“Poor English Input: <句子>\n Good English Output: <句子>”形式的提示,在少样本设置下使用 GPT-3 进行测试。我们给 GPT-3 提供一个由人类生成的纠正示例,然后要求它再纠正 5 个句子(同样没有任何遗漏或重复)。结果如图 3.17 所示。

4 Measuring and Preventing Memorization Of Benchmarks

4 测量和防止基准测试的记忆化

Since our training dataset is sourced from the internet, it is possible that our model was trained on some of our benchmark test sets. Accurately detecting test contamination from internet-scale datasets is a new area of research without established best practices. While it is common practice to train large models without investigating contamination, given the increasing scale of pre training datasets, we believe this issue is becoming increasingly important to attend to.

由于我们的训练数据集来源于互联网,我们的模型可能在部分基准测试集上进行了训练。从互联网规模的数据集中准确检测测试污染是一个新的研究领域,尚未建立最佳实践。尽管在不调查污染的情况下训练大型模型是常见做法,但鉴于预训练数据集的规模不断扩大,我们认为这个问题变得越来越重要。

This concern is not just hypothetical. One of the first papers to train a language model on Common Crawl data [TL18] detected and removed a training document which overlapped with one of their evaluation datasets. Other work such as GPT-2 [RWC+19] also conducted post-hoc overlap analysis. Their study was relatively encouraging, finding that

这种担忧并非只是假设。最早在Common Crawl数据上训练语言模型的论文之一 [TL18] 检测并移除了一个与他们的评估数据集重叠的训练文档。其他工作如GPT-2 [RWC+19] 也进行了事后重叠分析。他们的研究结果相对令人鼓舞,发现

Figure 3.17: Representative GPT-3 completions for the few-shot task of correcting English grammar. Boldface is GPT-3’s completions, plain text is human prompts. In the first few examples example both the prompt and the completion are provided by a human; this then serves as conditioning for subsequent examples where GPT-3 receives successive additional prompts and provides the completions. Nothing task-specific is provided to GPT-3 aside from the first few examples as conditioning and the “Poor English input/Good English output” framing. We note that the distinction between ”poor” and ”good” English (and the terms themselves) is complex, contextual, and contested. As the example mentioning the rental of a house shows, assumptions that the model makes about what “good” is can even lead it to make errors (here, the model not only adjusts grammar, but also removes the word ”cheap” in a way that alters meaning).

图 3.17: GPT-3 在少样本任务中纠正英语语法的代表性补全结果。加粗部分是 GPT-3 的补全内容,普通文本是人类提示。在前几个示例中,提示和补全内容均由人类提供;这为后续示例提供了条件,GPT-3 接收到连续的额外提示并提供补全内容。除了前几个示例作为条件和“Poor English input/Good English output”框架外,没有向 GPT-3 提供任何特定于任务的内容。我们注意到,“poor”和“good”英语之间的区别(以及这些术语本身)是复杂的、情境化的且有争议的。正如提到租房的那个示例所示,模型对“good”的假设甚至可能导致其犯错(在这里,模型不仅调整了语法,还以改变含义的方式删除了“cheap”一词)。


Figure 4.1: GPT-3 Training Curves We measure model performance during training on a de duplicated validation split of our training distribution. Though there is some gap between training and validation performance, the gap grows only minimally with model size and training time, suggesting that most of the gap comes from a difference in difficulty rather than over fitting.

图 4.1: GPT-3 训练曲线 我们在训练过程中测量模型在我们训练分布的去重验证集上的表现。尽管训练和验证表现之间存在一定差距,但随着模型规模和训练时间的增加,差距的增长非常有限,这表明大部分差距来自于难度差异而非过拟合。

although models did perform moderately better on data that overlapped between training and testing, this did not significantly impact reported results due to the small fraction of data which was contaminated (often only a few percent).

尽管模型在训练和测试数据重叠的部分表现稍好,但由于污染数据所占比例较小(通常只有几个百分点),这并未显著影响报告的结果。

GPT-3 operates in a somewhat different regime. On the one hand, the dataset and model size are about two orders of magnitude larger than those used for GPT-2, and include a large amount of Common Crawl, creating increased potential for contamination and memorization. On the other hand, precisely due to the large amount of data, even GPT-3 175B does not overfit its training set by a significant amount, measured relative to a held-out validation set with which it was de duplicated (Figure 4.1). Thus, we expect that contamination is likely to be frequent, but that its effects may not be as large as feared.

GPT-3 的运行机制有所不同。一方面,其数据集和模型规模比 GPT-2 使用的要大两个数量级,并且包含了大量的 Common Crawl 数据,这增加了数据污染和记忆的风险。另一方面,正是由于数据量庞大,即使是 GPT-3 175B 模型,相对于去重后的验证集(图 4.1),其训练集的过拟合程度也并不显著。因此,我们预计数据污染可能会频繁发生,但其影响可能没有人们担心的那么大。

We initially tried to address the issue of contamination by pro actively searching for and attempting to remove any overlap between our training data and the development and test sets of all benchmarks studied in this paper. Unfortunately, a bug resulted in only partial removal of all detected overlaps from the training data. Due to the cost of training, it wasn’t feasible to retrain the model. To address this, we investigate in detail how the remaining detected overlap impacts results.

我们最初尝试通过主动搜索并尝试移除训练数据与本文研究的所有基准测试的开发集和测试集之间的任何重叠来解决数据污染问题。不幸的是,由于一个错误,训练数据中仅部分移除了所有检测到的重叠。由于训练成本的原因,重新训练模型并不可行。为了解决这个问题,我们详细研究了剩余检测到的重叠如何影响结果。

For each benchmark, we produce a ‘clean’ version which removes all potentially leaked examples, defined roughly as examples that have a 13-gram overlap with anything in the pre training set (or that overlap with the whole example when it is shorter than 13-grams). The goal is to very conservatively flag anything that could potentially be contamination, so as to produce a clean subset that is free of contamination with high confidence. The exact procedure is detailed in Appendix C.

对于每个基准测试,我们生成了一个“干净”版本,该版本移除了所有可能泄露的样本,这些样本大致定义为与预训练集中的任何内容有13-gram重叠的样本(或者当样本长度短于13-gram时与整个样本重叠)。目标是非常保守地标记任何可能存在的污染,以便生成一个高置信度的无污染干净子集。具体步骤详见附录C。

We then evaluate GPT-3 on these clean benchmarks, and compare to the original score. If the score on the clean subset is similar to the score on the entire dataset, this suggests that contamination, even if present, does not have a significant effect on reported results. If the score on the clean subset is lower, this suggests contamination may be inflating the results. The results are summarized in Figure 4.2. Although potential contamination is often high (with a quarter of benchmarks scoring over $50%$ ), in most cases performance changes only negligibly, and we see no evidence that contamination level and performance difference are correlated. We conclude that either our conservative method substantially overestimated contamination or that contamination has little effect on performance.

然后我们在这些干净的基准上评估 GPT-3,并与原始分数进行比较。如果在干净子集上的分数与整个数据集上的分数相似,这表明即使存在污染,也不会对报告的结果产生显著影响。如果在干净子集上的分数较低,这表明污染可能夸大了结果。结果总结在图 4.2 中。尽管潜在污染通常很高(四分之一的基准得分超过 $50%$),但在大多数情况下,性能变化微乎其微,我们没有发现污染水平与性能差异之间存在相关性。我们得出结论,要么我们的保守方法大大高估了污染,要么污染对性能影响很小。

Below, we review in more detail the few specific cases where either (1) the model performs significantly worse on the cleaned version, or (2) potential contamination is very high, which makes measuring the performance difference difficult.

下面,我们详细回顾了几种特定情况,其中要么 (1) 模型在清理后的版本上表现明显较差,要么 (2) 潜在的污染非常高,这使得测量性能差异变得困难。

Our analysis flagged six groups of benchmarks for further investigation: Word Scrambling, Reading Comprehension (QuAC, SQuAD2, DROP), PIQA, Winograd, language modeling tasks (Wikitext tasks, 1BW), and German to English translation. Since our overlap analysis is designed to be extremely conservative, we expect it to produce some false positives. We summarize the results for each group of tasks below:

我们的分析标记了六组基准用于进一步调查:单词重组、阅读理解(QuAC、SQuAD2、DROP)、PIQA、Winograd、语言建模任务(Wikitext任务、1BW)以及德语到英语翻译。由于我们的重叠分析设计得非常保守,我们预计会产生一些误报。我们在下面总结了每组任务的结果:


Figure 4.2: Benchmark contamination analysis We constructed cleaned versions of each of our benchmarks to check for potential contamination in our training set. The ${\bf X}$ -axis is a conservative lower bound for how much of the dataset is known with high confidence to be clean, and the y-axis shows the difference in performance when evaluating only on the verified clean subset. Performance on most benchmarks changed negligibly, but some were flagged for further review. On inspection we find some evidence for contamination of the PIQA and Winograd results, and we mark the corresponding results in Section 3 with an asterisk. We find no evidence that other benchmarks are affected. Percentage of Data Clean in Dataset

图 4.2: 基准污染分析 我们构建了每个基准的清理版本,以检查训练集中潜在的污染。${\bf X}$ 轴是一个保守的下限,表示数据集中已知高度清洁的部分,y 轴显示了仅在已验证的清洁子集上评估时的性能差异。大多数基准的性能变化可以忽略不计,但有些基准被标记为需要进一步审查。在检查中,我们发现 PIQA 和 Winograd 结果存在一些污染的证据,并在第 3 节中用星号标记了相应的结果。我们没有发现其他基准受到影响的证据。数据集中清洁数据的百分比

• Reading Comprehension: Our initial analysis flagged ${>}90%$ of task examples from QuAC, SQuAD2, and DROP as potentially contaminated, so large that even measuring the differential on a clean subset was difficult. Upon manual inspection, however, we found that for every overlap we inspected, in all 3 datasets, the source text was present in our training data but the question/answer pairs were not, meaning the model gains only background information and cannot memorize the answer to a specific question.

• 阅读理解:我们的初步分析标记了来自 QuAC、SQuAD2 和 DROP 的任务示例中 ${>}90%$ 的部分可能受到污染,以至于即使在干净的子集上测量差异也很困难。然而,经过手动检查后,我们发现,在我们检查的所有 3 个数据集中,源文本存在于我们的训练数据中,但问题/答案对并不存在,这意味着模型只能获得背景信息,而无法记住特定问题的答案。

• German translation: We found $25%$ of the examples in the WMT16 German-English test set were marked as potentially contaminated, with an associated total effect size of 1-2 BLEU. Upon inspection, none of the flagged examples contain paired sentences resembling NMT training data and collisions were monolingual matches mostly of snippets of events discussed in the news.

• 德语翻译:我们发现 WMT16 德语-英语测试集中有 $25%$ 的样本被标记为可能受到污染,相关的总效应大小为 1-2 BLEU。经过检查,所有被标记的样本中都没有包含类似于神经机器翻译 (NMT) 训练数据的成对句子,且碰撞主要是新闻中讨论的事件片段的单语匹配。

• Reversed Words and Anagrams: Recall that these tasks are of the form $\because a1a0k=k0a1a^{\prime}$ . Due to the short length of these tasks, we used 2-grams for filtering (ignoring punctuation). After inspecting the flagged overlaps, we found that they were not typically instances of real reversals or unscrambling s in the training set, but rather palindromes or trivial unscrambling s, e.g “kayak $=$ kayak”. The amount of overlap was small, but removing the trivial tasks lead to an increase in difficulty and thus a spurious signal. Related to this, the symbol insertion task shows high overlap but no effect on performance – this is because that task involves removing non-letter characters from a word, and the overlap analysis itself ignores such characters, leading to many spurious matches.

• 反转词和字谜:回想一下,这些任务的形式为 $\because a1a0k=k0a1a^{\prime}$。由于这些任务的长度较短,我们使用了2-gram进行过滤(忽略标点符号)。在检查标记的重叠后,我们发现它们通常不是训练集中的真实反转或解谜实例,而是回文或简单的解谜,例如“kayak $=$ kayak”。重叠量很小,但移除这些简单任务会导致难度增加,从而产生虚假信号。与此相关的是,符号插入任务显示出高度重叠,但对性能没有影响——这是因为该任务涉及从单词中移除非字母字符,而重叠分析本身忽略了这些字符,导致许多虚假匹配。

• PIQA: The overlap analysis flagged $29%$ of examples as contaminated, and observed a 3 percentage point absolute decrease $4%$ relative decrease) in performance on the clean subset. Though the test dataset was released after our training set was created and its labels are hidden, some of the web pages used by the crowd sourced dataset creators are contained in our training set. We found a similar decrease in a $25\mathrm{x}$ smaller model with much less capacity to memorize, leading us to suspect that the shift is likely statistical bias rather than memorization; examples which workers copied may simply be easier. Unfortunately, we cannot rigorously prove this hypothesis. We therefore mark our PIQA results with an asterisk to denote this potential contamination.

• PIQA:重叠分析标记了29%的示例为受污染,并在干净子集上观察到性能下降了3个百分点(相对下降4%)。尽管测试数据集是在我们的训练集创建后发布的,并且其标签是隐藏的,但众包数据集创建者使用的一些网页包含在我们的训练集中。我们在一个容量小25倍的模型中发现了类似的性能下降,该模型的记忆能力要小得多,这使我们怀疑这种变化可能是统计偏差而非记忆;工人们复制的示例可能只是更容易。不幸的是,我们无法严格证明这一假设。因此,我们在PIQA结果上标记了星号,以表示这种潜在的污染。

• Winograd: The overlap analysis flagged $45%$ of examples, and found a $2.6%$ decrease in performance on the clean subset. Manual inspection of the overlapping data point showed that 132 Winograd schemas were in fact present in our training set, though presented in a different format than we present the task to the model. Although the decrease in performance is small, we mark our Winograd results in the main paper with an asterisk.

• Winograd:重叠分析标记了 $45%$ 的样本,并在干净子集上发现了 $2.6%$ 的性能下降。对重叠数据点的手动检查显示,132 个 Winograd 模式实际上存在于我们的训练集中,尽管其格式与我们呈现给模型的任务格式不同。尽管性能下降较小,我们在主论文中用星号标记了 Winograd 结果。

• Language modeling: We found the 4 Wikipedia language modeling benchmarks measured in GPT-2, plus the Children’s Book Test dataset, to be almost entirely contained in our training data. Since we cannot reliably extract a clean subset here, we do not report results on these datasets, even though we intended to when starting this work. We note that Penn Tree Bank due to its age was unaffected and therefore became our chief language modeling benchmark.

• 语言建模:我们发现 GPT-2 中测量的 4 个维基百科语言建模基准,以及儿童图书测试数据集,几乎完全包含在我们的训练数据中。由于我们无法可靠地提取一个干净的子集,因此我们没有报告这些数据集的结果,尽管我们在开始这项工作时有意这样做。我们注意到,由于 Penn Tree Bank 的年代久远,它未受影响,因此成为我们主要的语言建模基准。

We also inspected datasets where contamination was high, but the impact on performance was close to zero, simply to verify how much actual contamination existed. These appeared to often contain false positives. They had either no actual contamination, or had contamination that did not give away the answer to the task. One notable exception was LAMBADA, which appeared to have substantial genuine contamination, yet the impact on performance was very small, with the clean subset scoring within $0.5%$ of the full dataset. Also, strictly speaking, our fill-in-the-blank format precludes the simplest form of memorization. Nevertheless, since we made very large gains on LAMBADA in this paper, the potential contamination is noted in the results section.

我们还检查了污染率较高但对性能影响几乎为零的数据集,只是为了验证实际污染的程度。这些数据集往往包含误报。它们要么没有实际污染,要么污染并未泄露任务的答案。一个显著的例外是 LAMBADA,它似乎存在大量真实的污染,但对性能的影响非常小,干净子集的得分与完整数据集的得分相差在 $0.5%$ 以内。此外,严格来说,我们的填空格式排除了最简单的记忆形式。然而,由于我们在本文中对 LAMBADA 取得了非常大的进展,因此在结果部分中注明了潜在的污染。

An important limitation of our contamination analysis is that we cannot be sure that the clean subset is drawn from the same distribution as the original dataset. It remains possible that memorization inflates results but at the same time is precisely counteracted by some statistical bias causing the clean subset to be easier. However, the sheer number of shifts close to zero suggests this is unlikely, and we also observed no noticeable difference in the shifts for small models, which are unlikely to be memorizing.

我们污染分析的一个重要限制是,我们无法确定干净子集是否来自与原始数据集相同的分布。仍然有可能记忆效应夸大了结果,但同时被某些统计偏差精确抵消,导致干净子集更容易。然而,接近零的偏移数量之多表明这种情况不太可能,而且我们还观察到小模型的偏移没有明显差异,这些小模型不太可能进行记忆。

Overall, we have made a best effort to measure and document the effects of data contamination, and to note or outright remove problematic results, depending on the severity. Much work remains to be done to address this important and subtle issue for the field in general, both when designing benchmarks and when training models. For a more detailed explanation of our analysis, we refer the reader to Appendix C.

总体而言,我们已尽最大努力来衡量和记录数据污染的影响,并根据严重程度对问题结果进行标注或直接删除。在设计基准测试和训练模型时,仍有许多工作要做,以解决这一重要且微妙的问题。有关我们分析的更详细解释,请参阅附录 C。

5 Limitations

5 局限性

GPT-3 and our analysis of it have a number of limitations. Below we describe some of these and suggest directions for future work.

GPT-3 及其分析存在一些局限性。以下我们描述其中一些,并为未来的工作提出方向。

First, despite the strong quantitative and qualitative improvements of GPT-3, particularly compared to its direct predecessor GPT-2, it still has notable weaknesses in text synthesis and several NLP tasks. On text synthesis, although the overall quality is high, GPT-3 samples still sometimes repeat themselves semantically at the document level, start to lose coherence over sufficiently long passages, contradict themselves, and occasionally contain non-sequitur sentences or paragraphs. We will release a collection of 500 uncurated unconditional samples to help provide a better sense of GPT-3’s limitations and strengths at text synthesis. Within the domain of discrete language tasks, we have noticed informally that GPT-3 seems to have special difficulty with “common sense physics”, despite doing well on some datasets (such as PIQA $[\mathrm{BZB^{+}19}]$ ) that test this domain. Specifically GPT-3 has difficulty with questions of the type “If I put cheese into the fridge, will it melt?”. Quantitatively, GPT-3’s in-context learning performance has some notable gaps on our suite of benchmarks, as described in Section 3, and in particular it does little better than chance when evaluated one-shot or even few-shot on some “comparison” tasks, such as determining if two words are used the same way in a sentence, or if one sentence implies another (WIC and ANLI respectively), as well as on a subset of reading comprehension tasks. This is especially striking given GPT-3’s strong few-shot performance on many other tasks.

首先,尽管 GPT-3 在定量和定性方面有了显著的改进,特别是与其前身 GPT-2 相比,但它在文本合成和多个 NLP 任务中仍然存在明显的弱点。在文本合成方面,尽管整体质量很高,但 GPT-3 生成的样本有时在文档级别上会出现语义重复,在足够长的段落中失去连贯性,自相矛盾,偶尔还会包含不合逻辑的句子或段落。我们将发布 500 个未经筛选的无条件样本集合,以帮助更好地理解 GPT-3 在文本合成方面的局限性和优势。在离散语言任务领域,我们非正式地注意到,尽管 GPT-3 在某些测试该领域的数据集(如 PIQA [BZB⁺19])上表现良好,但它在“常识物理”方面似乎特别困难。具体来说,GPT-3 在回答诸如“如果我把奶酪放进冰箱,它会融化吗?”这类问题时存在困难。从定量角度来看,GPT-3 的上下文学习性能在我们的基准测试套件中存在一些显著差距,如第 3 节所述,特别是在一些“比较”任务上,如确定两个词在句子中的使用方式是否相同,或一个句子是否暗示另一个句子(分别为 WIC 和 ANLI),以及在一部分阅读理解任务上,GPT-3 的表现几乎与随机猜测无异。这一点尤其引人注目,因为 GPT-3 在许多其他任务上的少样本表现非常出色。

GPT-3 has several structural and algorithmic limitations, which could account for some of the issues above. We focused on exploring in-context learning behavior in auto regressive language models because it is straightforward to both sample and compute likelihoods with this model class. As a result our experiments do not include any bidirectional architectures or other training objectives such as denoising. This is a noticeable difference from much of the recent literature, which has documented improved fine-tuning performance when using these approaches over standard language models $[\mathrm{RSR}^{+}19]$ . Thus our design decision comes at the cost of potentially worse performance on tasks which empirically benefit from bidirectional it y. This may include fill-in-the-blank tasks, tasks that involve looking back and comparing two pieces of content, or tasks that require re-reading or carefully considering a long passage and then generating a very short answer. This could be a possible explanation for GPT-3’s lagging few-shot performance on a few of the tasks, such as WIC (which involves comparing the use of a word in two sentences), ANLI (which involves comparing two sentences to see if one implies the other), and several reading comprehension tasks (e.g. QuAC and RACE). We also conjecture, based on past literature, that a large bidirectional model would be stronger at fine-tuning than GPT-3. Making a bidirectional model at the scale of GPT-3, and/or trying to make bidirectional models work with few- or zero-shot learning, is a promising direction for future research, and could help achieve the “best of both worlds”.

GPT-3 存在一些结构和算法上的局限性,这可能是上述部分问题的原因。我们专注于探索自回归语言模型中的上下文学习行为,因为使用这类模型进行采样和计算概率是相对简单的。因此,我们的实验没有包含任何双向架构或其他训练目标,例如去噪。这与最近的许多文献形成了显著差异,这些文献记录了在使用这些方法时,相比标准语言模型,微调性能有所提升 $[\mathrm{RSR}^{+}19]$ 。因此,我们的设计决策可能会在那些从双向性中受益的任务上表现较差。这些任务可能包括填空任务、需要回顾并比较两段内容的任务,或者需要重新阅读或仔细考虑一段长文后生成简短答案的任务。这可能是 GPT-3 在某些任务上少样本表现滞后的一个可能解释,例如 WIC(涉及比较一个词在两个句子中的使用)、ANLI(涉及比较两个句子以判断一个是否暗示另一个)以及一些阅读理解任务(例如 QuAC 和 RACE)。我们还基于过去的文献推测,一个大型的双向模型在微调方面会比 GPT-3 更强。构建一个与 GPT-3 规模相当的双向模型,或者尝试让双向模型在少样本或零样本学习中发挥作用,是未来研究的一个有前景的方向,可能有助于实现“两全其美”。

A more fundamental limitation of the general approach described in this paper – scaling up any LM-like model, whether auto regressive or bidirectional – is that it may eventually run into (or could already be running into) the limits of the pre training objective. Our current objective weights every token equally and lacks a notion of what is most important to predict and what is less important. [RRS20] demonstrate benefits of customizing prediction to entities of interest. Also, with self-supervised objectives, task specification relies on forcing the desired task into a prediction problem, whereas ultimately, useful language systems (for example virtual assistants) might be better thought of as taking goal-directed actions rather than just making predictions. Finally, large pretrained language models are not grounded in other domains of experience, such as video or real-world physical interaction, and thus lack a large amount of context about the world $[\mathrm{BH}\bar{\Gamma}^{+}20]$ . For all these reasons, scaling pure self-supervised prediction is likely to hit limits, and augmentation with a different approach is likely to be necessary. Promising future directions in this vein might include learning the objective function from humans $[\dot{Z}\mathrm{SW}^{+}19\mathrm{a}]$ , fine-tuning with reinforcement learning, or adding additional modalities such as images to provide grounding and a better model of the world $[\mathrm{CLY^{+}19}]$ .

本文所述通用方法的一个更根本的限制——无论是自回归还是双向的类大语言模型(LM)的扩展——是它可能最终会遇到(或可能已经遇到)预训练目标的限制。我们当前的目标是对每个Token进行同等加权,缺乏对预测内容重要性的区分。[RRS20]展示了针对感兴趣实体定制预测的好处。此外,对于自监督目标,任务规范依赖于将所需任务强制转化为预测问题,而最终,有用的语言系统(例如虚拟助手)可能更适合被视为采取目标导向的行动,而不仅仅是做出预测。最后,大型预训练语言模型并未基于其他经验领域(如视频或现实世界的物理交互)进行训练,因此缺乏大量关于世界的上下文信息 $[\mathrm{BH}\bar{\Gamma}^{+}20]$。由于所有这些原因,扩展纯自监督预测可能会遇到限制,可能需要采用不同的方法进行增强。有希望的未来方向可能包括从人类学习目标函数 $[\dot{Z}\mathrm{SW}^{+}19\mathrm{a}]$,通过强化学习进行微调,或添加其他模态(如图像)以提供基础和对世界的更好建模 $[\mathrm{CLY^{+}19}]$。

Another limitation broadly shared by language models is poor sample efficiency during pre-training. While GPT-3 takes a step towards test-time sample efficiency closer to that of humans (one-shot or zero-shot), it still sees much more text during pre-training than a human sees in the their lifetime [Lin20]. Improving pre-training sample efficiency is an important direction for future work, and might come from grounding in the physical world to provide additional information, or from algorithmic improvements.

语言模型普遍存在的另一个局限是预训练期间的样本效率低下。虽然 GPT-3 在测试时的样本效率上更接近人类(少样本或零样本),但它在预训练期间所接触的文本量仍然远超人类一生所见的文本量 [Lin20]。提高预训练的样本效率是未来工作的重要方向,可能通过基于物理世界的额外信息或算法改进来实现。

A limitation, or at least uncertainty, associated with few-shot learning in GPT-3 is ambiguity about whether few-shot learning actually learns new tasks “from scratch” at inference time, or if it simply recognizes and identifies tasks that it has learned during training. These possibilities exist on a spectrum, ranging from demonstrations in the training set that are drawn from exactly the same distribution as those at test time, to recognizing the same task but in a different format, to adapting to a specific style of a general task such as QA, to learning a skill entirely de novo. Where GPT-3 is on this spectrum may also vary from task to task. Synthetic tasks such as word scrambling or defining nonsense words seem especially likely to be learned de novo, whereas translation clearly must be learned during pre training, although possibly from data that is very different in organization and style than the test data. Ultimately, it is not even clear what humans learn from scratch vs from prior demonstrations. Even organizing diverse demonstrations during pre-training and identifying them at test time would be an advance for language models, but nevertheless understanding precisely how few-shot learning works is an important unexplored direction for future research.

GPT-3 中少样本学习的一个局限性或至少是不确定性在于,尚不清楚少样本学习是否真的在推理时“从零开始”学习新任务,还是仅仅识别和识别在训练期间学到的任务。这些可能性存在于一个范围内,从训练集中的演示与测试时的演示完全相同的分布,到识别相同任务但格式不同,再到适应一般任务(如问答)的特定风格,再到完全从头学习一项技能。GPT-3 在这个范围内的位置可能因任务而异。像单词打乱或无意义单词定义这样的合成任务似乎特别可能从头学习,而翻译显然必须在预训练期间学习,尽管可能是从组织和风格与测试数据非常不同的数据中学习的。最终,甚至不清楚人类是从零开始学习还是从先前的演示中学习。即使在预训练期间组织多样化的演示并在测试时识别它们,对于语言模型来说也是一种进步,但准确理解少样本学习的工作原理仍然是未来研究的一个重要且未探索的方向。

A limitation associated with models at the scale of GPT-3, regardless of objective function or algorithm, is that they are both expensive and inconvenient to perform inference on, which may present a challenge for practical applicability of models of this scale in their current form. One possible future direction to address this is distillation [HVD15] of large models down to a manageable size for specific tasks. Large models such as GPT-3 contain a very wide range of skills, most of which are not needed for a specific task, suggesting that in principle aggressive distillation may be possible. Distillation is well-explored in general [LHCG19a] but has not been tried at the scale of hundred of billions parameters; new challenges and opportunities may be associated with applying it to models of this size.

与 GPT-3 规模模型相关的一个限制是,无论目标函数或算法如何,它们在进行推理时既昂贵又不方便,这可能对当前形式下这种规模模型的实际适用性构成挑战。未来可能的一个方向是将大模型蒸馏 (distillation) [HVD15] 到特定任务的可管理大小。像 GPT-3 这样的大模型包含非常广泛的技能,其中大多数技能对于特定任务来说并不需要,这表明在原则上可以进行激进的蒸馏。蒸馏在一般情况下已经得到了广泛研究 [LHCG19a],但尚未在数千亿参数的规模上进行尝试;将其应用于这种规模的模型可能会带来新的挑战和机遇。

Finally, GPT-3 shares some limitations common to most deep learning systems – its decisions are not easily interpret able, it is not necessarily well-calibrated in its predictions on novel inputs as observed by the much higher variance in performance than humans on standard benchmarks, and it retains the biases of the data it has been trained on. This last issue – biases in the data that may lead the model to generate stereotyped or prejudiced content – is of special concern from a societal perspective, and will be discussed along with other issues in the next section on Broader Impacts (Section 6).

最后,GPT-3 与大多数深度学习系统共享一些局限性——其决策不易解释,对于新颖输入的预测不一定校准良好,正如在标准基准测试中观察到的性能方差远高于人类,并且它保留了训练数据中的偏见。最后一个问题——数据中的偏见可能导致模型生成刻板或偏见内容——从社会角度来看尤其值得关注,将在下一节关于更广泛影响的部分(第6节)中与其他问题一起讨论。

6 Broader Impacts

6 更广泛的影响

Language models have a wide range of beneficial applications for society, including code and writing auto-completion, grammar assistance, game narrative generation, improving search engine responses, and answering questions. But they also have potentially harmful applications. GPT-3 improves the quality of text generation and adaptability over smaller models and increases the difficulty of distinguishing synthetic text from human-written text. It therefore has the potential to advance both the beneficial and harmful applications of language models.

语言模型对社会有着广泛的有益应用,包括代码和写作自动补全、语法辅助、游戏叙事生成、改善搜索引擎响应以及回答问题。但它们也可能有潜在的有害应用。GPT-3 相比小型模型提高了文本生成的质量和适应性,并增加了区分合成文本与人类书写文本的难度。因此,它有可能推动语言模型的有益和有害应用。

Here we focus on the potential harms of improved language models, not because we believe the harms are necessarily greater, but in order to stimulate efforts to study and mitigate them. The broader impacts of language models like this are numerous. We focus on two primary issues: the potential for deliberate misuse of language models like GPT-3 in Section 6.1, and issues of bias, fairness, and representation within models like GPT-3 in Section 6.2. We also briefly discuss issues of energy efficiency (Section 6.3).

我们在此关注改进后语言模型的潜在危害,并非认为这些危害必然更大,而是为了激发研究和缓解这些危害的努力。像GPT-3这样的语言模型具有广泛的影响。我们主要关注两个问题:6.1节中讨论的GPT-3等语言模型可能被故意滥用的风险,以及6.2节中讨论的GPT-3等模型中的偏见、公平性和代表性等问题。我们还将简要讨论能效问题(6.3节)。

6.1 Misuse of Language Models

6.1 语言模型的滥用

Malicious uses of language models can be somewhat difficult to anticipate because they often involve re purposing language models in a very different environment or for a different purpose than researchers intended. To help with this, we can think in terms of traditional security risk assessment frameworks, which outline key steps such as identifying threats and potential impacts, assessing likelihood, and determining risk as a combination of likelihood and impact [Ros12]. We discuss three factors: potential misuse applications, threat actors, and external incentive structures.

恶意使用大语言模型可能有些难以预测,因为它们通常涉及在非常不同的环境中或以研究人员预期之外的目的重新利用这些模型。为了帮助解决这个问题,我们可以从传统的安全风险评估框架的角度来思考,这些框架概述了关键步骤,如识别威胁和潜在影响、评估可能性,以及确定风险(即可能性和影响的结合)[Ros12]。我们讨论三个因素:潜在的滥用应用、威胁行为者以及外部激励结构。

6.1.1 Potential Misuse Applications

6.1.1 潜在的滥用应用

Any socially harmful activity that relies on generating text could be augmented by powerful language models. Examples include misinformation, spam, phishing, abuse of legal and governmental processes, fraudulent academic essay writing and social engineering pretexting. Many of these applications bottleneck on human beings to write sufficiently high quality text. Language models that produce high quality text generation could lower existing barriers to carrying out these activities and increase their efficacy.

任何依赖生成文本的社会有害活动都可能因强大的语言模型而增强。例如,虚假信息、垃圾邮件、网络钓鱼、滥用法律和政府程序、欺诈性学术论文写作以及社会工程学借口。这些应用中的许多都受限于人类撰写高质量文本的能力。能够生成高质量文本的语言模型可能会降低执行这些活动的现有障碍,并提高其效果。

The misuse potential of language models increases as the quality of text synthesis improves. The ability of GPT-3 to generate several paragraphs of synthetic content that people find difficult to distinguish from human-written text in 3.9.4 represents a concerning milestone in this regard.

随着文本合成质量的提高,语言模型的滥用潜力也在增加。GPT-3 在 3.9.4 节中生成的多段合成内容,人们难以将其与人类撰写的文本区分开来,这标志着一个令人担忧的里程碑。

6.1.2 Threat Actor Analysis

6.1.2 威胁行为者分析

Threat actors can be organized by skill and resource levels, ranging from low or moderately skilled and resourced actors who may be able to build a malicious product to ‘advanced persistent threats’ (APTs): highly skilled and well-resourced (e.g. state-sponsored) groups with long-term agendas $[\mathrm{SBC^{+}19}]$ .

威胁行为者可以根据技能和资源水平进行分类,从技能和资源水平较低或中等的行为者(他们可能能够构建恶意产品)到“高级持续性威胁 (APTs)”(Advanced Persistent Threats):这些是技能高超且资源充足(例如由国家支持的)的团体,拥有长期目标 $[\mathrm{SBC^{+}19}]$。

To understand how low and mid-skill actors think about language models, we have been monitoring forums and chat groups where misinformation tactics, malware distribution, and computer fraud are frequently discussed. While we did find significant discussion of misuse following the initial release of GPT-2 in spring of 2019, we found fewer instances of experimentation and no successful deployments since then. Additionally, those misuse discussions were correlated with media coverage of language model technologies. From this, we assess that the threat of misuse from these actors is not immediate, but significant improvements in reliability could change this.

为了了解低技能和中等技能的行为者如何看待语言模型,我们一直在监控那些经常讨论误导策略、恶意软件分发和计算机欺诈的论坛和聊天群组。虽然我们在2019年春季GPT-2首次发布后确实发现了大量关于滥用的讨论,但自那时以来,我们发现的实验实例较少,且没有成功的部署案例。此外,这些滥用讨论与语言模型技术的媒体报道相关。基于此,我们评估认为,这些行为者滥用语言模型的威胁并非迫在眉睫,但可靠性的显著提升可能会改变这一现状。

Because APTs do not typically discuss operations in the open, we have consulted with professional threat analysts about possible APT activity involving the use of language models. Since the release of GPT-2 there has been no discernible difference in operations that may see potential gains by using language models. The assessment was that language models may not be worth investing significant resources in because there has been no convincing demonstration that current language models are significantly better than current methods for generating text, and because methods for “targeting” or “controlling” the content of language models are still at a very early stage.

由于APT(高级持续性威胁)通常不会公开讨论其操作,我们咨询了专业的威胁分析师,了解可能涉及使用语言模型的APT活动。自GPT-2发布以来,尚未发现可能通过使用语言模型获得潜在收益的操作有明显差异。评估认为,语言模型可能不值得投入大量资源,因为目前尚无令人信服的证据表明当前的语言模型在生成文本方面显著优于现有方法,而且“定向”或“控制”语言模型内容的方法仍处于非常早期的阶段。

6.1.3 External Incentive Structures

6.1.3 外部激励结构

Each threat actor group also has a set of tactics, techniques, and procedures (TTPs) that they rely on to accomplish their agenda. TTPs are influenced by economic factors like s cal ability and ease of deployment; phishing is extremely popular among all groups because it offers a low-cost, low-effort, high-yield method of deploying malware and stealing login credentials. Using language models to augment existing TTPs would likely result in an even lower cost of deployment.

每个威胁行为者群体也有一套他们依赖的战术、技术和程序 (TTPs) 来实现他们的目标。TTPs 受到经济因素的影响,如可扩展性和部署的便捷性;钓鱼攻击在所有群体中极为流行,因为它提供了一种低成本、低投入、高回报的方法来部署恶意软件和窃取登录凭证。使用语言模型来增强现有的 TTPs 可能会进一步降低部署成本。

Ease of use is another significant incentive. Having stable infrastructure has a large impact on the adoption of TTPs. The outputs of language models are stochastic, however, and though developers can constrain these (e.g. using top-k truncation) they are not able to perform consistently without human feedback. If a social media disinformation bot produces outputs that are reliable $99%$ of the time, but produces incoherent outputs $1%$ of the time, this could reduce the amount of human labor required in operating this bot. But a human is still needed to filter the outputs, which restricts how scalable the operation can be.

易用性是另一个重要的激励因素。拥有稳定的基础设施对TTPs的采用有很大影响。然而,语言模型的输出是随机的,尽管开发者可以对其进行约束(例如使用top-k截断),但在没有人类反馈的情况下,它们无法保持一致的表现。如果一个社交媒体虚假信息机器人在99%的时间内产生可靠的输出,但在1%的时间内产生不连贯的输出,这可能会减少操作该机器人所需的人力。但仍然需要人类来过滤输出,这限制了操作的可扩展性。

Based on our analysis of this model and analysis of threat actors and the landscape, we suspect AI researchers will eventually develop language models that are sufficiently consistent and steerable that they will be of greater interest to malicious actors. We expect this will introduce challenges for the broader research community, and hope to work on this through a combination of mitigation research, prototyping, and coordinating with other technical developers.

根据我们对该模型的分析以及对威胁行为者和环境的分析,我们怀疑 AI 研究人员最终会开发出足够一致且可控的语言模型,从而引起恶意行为者的更大兴趣。我们预计这将为更广泛的研究社区带来挑战,并希望通过结合缓解研究、原型设计以及与其他技术开发人员的协调来解决这一问题。

6.2 Fairness, Bias, and Representation

6.2 公平性、偏见与代表性

Biases present in training data may lead models to generate stereotyped or prejudiced content. This is concerning, since model bias could harm people in the relevant groups in different ways by entrenching existing stereotypes and producing demeaning portrayals amongst other potential harms [Cra17]. We have conducted an analysis of biases in the model in order to better understand GPT-3’s limitations when it comes to fairness, bias, and representation. 8

训练数据中存在的偏见可能导致模型生成刻板或带有偏见的内容。这令人担忧,因为模型偏见可能通过强化现有刻板印象和产生贬低性描述等方式,以不同方式伤害相关群体 [Cra17]。为了更好地理解 GPT-3 在公平性、偏见和代表性方面的局限性,我们对模型中的偏见进行了分析。

Our goal is not to exhaustively characterize GPT-3, but to give a p