Language Models are Few-Shot Learners
大语言模型是少样本学习者
OpenAI
| Tom B. Brown* | Benjamin Mann* | Nick Ryder* Melanie Subbiah* |
| Jared Kaplan | Prafulla Dhariwal | Arvind Neelakantan | Pranav Shyam | Girish Sastry |
| Amanda Askell | Sandhini Agarwal | Ariel Herbert-Voss | Gretchen Krueger | Tom Henighan |
| Rewon Child | Aditya Ramesh | Daniel M. Ziegler | Jeffrey Wu | Clemens Winter |
| Christopher Hesse | Mark Chen | Eric Sigler | Mateusz Litwin | Scott Gray |
| Benjamin Chess | Jack Clark | Christopher Berner |
| Sam McCandlish | Alec Radford | Ilya Sutskever | Dario Amodei |
Abstract
摘要
Recent work has demonstrated substantial gains on many NLP tasks and benchmarks by pre-training on a large corpus of text followed by fine-tuning on a specific task. While typically task-agnostic in architecture, this method still requires task-specific fine-tuning datasets of thousands or tens of thousands of examples. By contrast, humans can generally perform a new language task from only a few examples or from simple instructions – something which current NLP systems still largely struggle to do. Here we show that scaling up language models greatly improves task-agnostic, few-shot performance, sometimes even reaching competitiveness with prior state-of-the-art finetuning approaches. Specifically, we train GPT-3, an auto regressive language model with 175 billion parameters, $10\mathrm{x}$ more than any previous non-sparse language model, and test its performance in the few-shot setting. For all tasks, GPT-3 is applied without any gradient updates or fine-tuning, with tasks and few-shot demonstrations specified purely via text interaction with the model. GPT-3 achieves strong performance on many NLP datasets, including translation, question-answering, and cloze tasks, as well as several tasks that require on-the-fly reasoning or domain adaptation, such as unscrambling words, using a novel word in a sentence, or performing 3-digit arithmetic. At the same time, we also identify some datasets where GPT-3’s few-shot learning still struggles, as well as some datasets where GPT-3 faces methodological issues related to training on large web corpora. Finally, we find that GPT-3 can generate samples of news articles which human evaluators have difficulty distinguishing from articles written by humans. We discuss broader societal impacts of this finding and of GPT-3 in general.
最近的研究表明,通过在大量文本语料库上进行预训练,然后在特定任务上进行微调,可以在许多自然语言处理(NLP)任务和基准测试中取得显著进展。尽管这种方法在架构上通常是任务无关的,但它仍然需要数千甚至数万个示例的任务特定微调数据集。相比之下,人类通常只需几个示例或简单的指令就能执行新的语言任务——这是当前NLP系统仍然难以做到的。本文展示了扩展语言模型可以显著提高任务无关的少样本性能,有时甚至能与之前的最先进微调方法相媲美。具体来说,我们训练了GPT-3,这是一个拥有1750亿参数的自回归语言模型,参数数量是之前任何非稀疏语言模型的10倍,并在少样本设置下测试了其性能。对于所有任务,GPT-3在没有任何梯度更新或微调的情况下应用,任务和少样本演示仅通过与模型的文本交互来指定。GPT-3在许多NLP数据集上表现出色,包括翻译、问答和完形填空任务,以及一些需要即时推理或领域适应的任务,如解构单词、在句子中使用新词或执行三位数算术。同时,我们也发现了一些数据集,GPT-3的少样本学习仍然存在困难,以及一些数据集,GPT-3面临与大型网络语料库训练相关的方法论问题。最后,我们发现GPT-3可以生成新闻文章样本,人类评估者难以区分这些文章是由人类撰写的。我们讨论了这一发现以及GPT-3的更广泛社会影响。
Contents
目录
1 Introduction
1 引言
2 Approach
2 方法
3 Results 10
3 结果 10
4 Measuring and Preventing Memorization Of Benchmarks 29
4 测量和防止基准测试的记忆化 29
5 Limitations 33
5 限制 33
6 Broader Impacts 34
6 更广泛的影响 34
7 Related Work 39
7 相关工作 39
1 Introduction
1 引言
Recent years have featured a trend towards pre-trained language representations in NLP systems, applied in increasingly flexible and task-agnostic ways for downstream transfer. First, single-layer representations were learned using word vectors [MCCD13, PSM14] and fed to task-specific architectures, then RNNs with multiple layers of representations and contextual state were used to form stronger representations [DL15, MBXS17, PNZtY18] (though still applied to task-specific architectures), and more recently pre-trained recurrent or transformer language models $[\mathrm{VSP^{+}17}]$ have been directly fine-tuned, entirely removing the need for task-specific architectures [RNSS18, DCLT18, HR18].
近年来,NLP系统呈现出一种趋势,即采用预训练的语言表示,并以越来越灵活且与任务无关的方式应用于下游迁移。最初,使用词向量 [MCCD13, PSM14] 学习单层表示,并将其输入到特定任务的架构中;随后,使用具有多层表示和上下文状态的RNN来形成更强的表示 [DL15, MBXS17, PNZtY18](尽管仍然应用于特定任务的架构);最近,预训练的循环或Transformer语言模型 $[\mathrm{VSP^{+}17}]$ 被直接微调,完全消除了对特定任务架构的需求 [RNSS18, DCLT18, HR18]。
This last paradigm has led to substantial progress on many challenging NLP tasks such as reading comprehension, question answering, textual entailment, and many others, and has continued to advance based on new architectures and algorithms $[\mathrm{R}\bar{\mathrm{S}}\mathrm{R}^{+}19\$ , $\mathrm{LOG^{+}19}$ , $\mathrm{YDY^{+}19}$ , $\mathrm{LCG}^{+}19]$ . However, a major limitation to this approach is that while the architecture is task-agnostic, there is still a need for task-specific datasets and task-specific fine-tuning: to achieve strong performance on a desired task typically requires fine-tuning on a dataset of thousands to hundreds of thousands of examples specific to that task. Removing this limitation would be desirable, for several reasons.
最后这一范式在许多具有挑战性的自然语言处理任务上取得了显著进展,例如阅读理解、问答、文本蕴含等,并且随着新架构和算法的出现持续进步 [RSR+19, LOG+19, YDY+19, LCG+19]。然而,这种方法的一个主要限制是,尽管架构是任务无关的,但仍然需要特定任务的数据集和特定任务的微调:要在某个任务上实现强大的性能,通常需要对该任务的数千到数十万个样本的数据集进行微调。出于多种原因,消除这一限制是可取的。
First, from a practical perspective, the need for a large dataset of labeled examples for every new task limits the applicability of language models. There exists a very wide range of possible useful language tasks, encompassing anything from correcting grammar, to generating examples of an abstract concept, to critiquing a short story. For many of these tasks it is difficult to collect a large supervised training dataset, especially when the process must be repeated for every new task.
首先,从实际角度来看,每个新任务都需要大量标注数据集的限制,限制了大语言模型的适用性。存在非常广泛的有用语言任务,从纠正语法到生成抽象概念的示例,再到评论短篇小说。对于许多这样的任务,很难收集到大量的监督训练数据集,尤其是当这个过程必须为每个新任务重复时。
Second, the potential to exploit spurious correlations in training data fundamentally grows with the expressiveness of the model and the narrowness of the training distribution. This can create problems for the pre-training plus fine-tuning paradigm, where models are designed to be large to absorb information during pre-training, but are then fine-tuned on very narrow task distributions. For instance $\bar{[\mathrm{HLW}^{+}20]}$ observe that larger models do not necessarily generalize better out-of-distribution. There is evidence that suggests that the generalization achieved under this paradigm can be poor because the model is overly specific to the training distribution and does not generalize well outside it $[\Upsilon\mathrm{d}\mathrm{C}^{+}\bar{1}9$ , MPL19]. Thus, the performance of fine-tuned models on specific benchmarks, even when it is nominally at human-level, may exaggerate actual performance on the underlying task $\mathrm{[GSL^{+}18}$ , NK19].
其次,利用训练数据中的虚假相关性的潜力从根本上随着模型的表达能力和训练分布的狭窄性而增长。这可能会给预训练加微调的范式带来问题,在这种范式中,模型被设计得很大以在预训练期间吸收信息,但随后在非常狭窄的任务分布上进行微调。例如,$\bar{[\mathrm{HLW}^{+}20]}$ 观察到,较大的模型不一定在分布外泛化得更好。有证据表明,在这种范式下实现的泛化可能很差,因为模型对训练分布过于特定,无法在其外部很好地泛化 $[\Upsilon\mathrm{d}\mathrm{C}^{+}\bar{1}9$ , MPL19]。因此,微调模型在特定基准测试上的性能,即使名义上达到人类水平,也可能夸大其在底层任务上的实际性能 $\mathrm{[GSL^{+}18}$ , NK19]。
Third, humans do not require large supervised datasets to learn most language tasks – a brief directive in natural language (e.g. “please tell me if this sentence describes something happy or something sad”) or at most a tiny number of demonstrations (e.g. “here are two examples of people acting brave; please give a third example of bravery”) is often sufficient to enable a human to perform a new task to at least a reasonable degree of competence. Aside from pointing to a conceptual limitation in our current NLP techniques, this adaptability has practical advantages – it allows humans to seamlessly mix together or switch between many tasks and skills, for example performing addition during a lengthy dialogue. To be broadly useful, we would someday like our NLP systems to have this same fluidity and generality.
第三,人类不需要大量的监督数据集来学习大多数语言任务——自然语言中的简短指令(例如,“请告诉我这句话描述的是快乐还是悲伤的事情”)或最多几个示例(例如,“这里有两个勇敢行为的例子;请给出第三个勇敢的例子”)通常足以让人类以至少合理的能力执行新任务。除了指出我们当前自然语言处理(NLP)技术的概念局限性外,这种适应性还具有实际优势——它允许人类无缝地混合或切换许多任务和技能,例如在长时间的对话中执行加法运算。为了广泛有用,我们希望有一天我们的 NLP 系统也能具备同样的流畅性和通用性。

Figure 1.1: Language model meta-learning. During unsupervised pre-training, a language model develops a broad set of skills and pattern recognition abilities. It then uses these abilities at inference time to rapidly adapt to or recognize the desired task. We use the term “in-context learning” to describe the inner loop of this process, which occurs within the forward-pass upon each sequence. The sequences in this diagram are not intended to be representative of the data a model would see during pre-training, but are intended to show that there are sometimes repeated sub-tasks embedded within a single sequence.
图 1.1: 语言模型的元学习。在无监督预训练过程中,语言模型发展出一系列广泛的技能和模式识别能力。然后,在推理时,它利用这些能力快速适应或识别所需的任务。我们使用“上下文学习”这一术语来描述这一过程的内循环,该循环发生在每个序列的前向传递中。图中的序列并不代表模型在预训练期间会看到的数据,而是为了展示有时在单个序列中嵌入的重复子任务。

Figure 1.2: Larger models make increasingly efficient use of in-context information. We show in-context learning performance on a simple task requiring the model to remove random symbols from a word, both with and without a natural language task description (see Sec. 3.9.2). The steeper “in-context learning curves” for large models demonstrate improved ability to learn a task from contextual information. We see qualitatively similar behavior across a wide range of tasks.
图 1.2: 更大的模型能够更高效地利用上下文信息。我们展示了一个简单任务中的上下文学习性能,该任务要求模型从单词中删除随机符号,无论是否有自然语言任务描述(见第 3.9.2 节)。大型模型的“上下文学习曲线”更陡峭,表明它们从上下文信息中学习任务的能力有所提高。我们在广泛的任务中观察到了类似的行为。
One potential route towards addressing these issues is meta-learning1 – which in the context of language models means the model develops a broad set of skills and pattern recognition abilities at training time, and then uses those abilities at inference time to rapidly adapt to or recognize the desired task (illustrated in Figure 1.1). Recent work $[\mathrm{RWC}^{+}19]$ attempts to do this via what we call “in-context learning”, using the text input of a pretrained language model as a form of task specification: the model is conditioned on a natural language instruction and/or a few demonstrations of the task and is then expected to complete further instances of the task simply by predicting what comes next.
解决这些问题的一个潜在途径是元学习(meta-learning)——在语言模型的背景下,这意味着模型在训练时发展出一套广泛的技能和模式识别能力,然后在推理时利用这些能力快速适应或识别所需任务(如图 1.1 所示)。最近的工作 [RWC+19] 试图通过我们称之为“上下文学习”(in-context learning)的方式来实现这一点,使用预训练语言模型的文本输入作为任务规范的一种形式:模型以自然语言指令和/或任务的几个示例为条件,然后通过预测接下来会发生什么来完成任务的更多实例。
While it has shown some initial promise, this approach still achieves results far inferior to fine-tuning – for example $[\mathrm{RWC}^{+}19]$ achieves only $4%$ on Natural Questions, and even its 55 F1 CoQa result is now more than 35 points behind the state of the art. Meta-learning clearly requires substantial improvement in order to be viable as a practical method of solving language tasks.
尽管这种方法显示出了一些初步的潜力,但其结果仍然远不及微调——例如,$[\mathrm{RWC}^{+}19]$ 在 Natural Questions 上仅达到了 $4%$,甚至其 55 F1 的 CoQa 结果也比当前的最新技术落后了 35 分以上。显然,元学习需要大幅改进才能成为解决语言任务的实用方法。
Another recent trend in language modeling may offer a way forward. In recent years the capacity of transformer language models has increased substantially, from 100 million parameters [RNSS18], to 300 million parameters [DCLT18], to 1.5 billion parameters $[\mathrm{RWC}^{+}19]$ , to 8 billion parameters $[\mathrm{SPP^{+}19}]$ , 11 billion parameters $[\mathrm{RSR}^{+}19]$ , and finally 17 billion parameters [Tur20]. Each increase has brought improvements in text synthesis and/or downstream NLP tasks, and there is evidence suggesting that log loss, which correlates well with many downstream tasks, follows a smooth trend of improvement with scale $[\mathrm{KMH}^{+}20]$ . Since in-context learning involves absorbing many skills and tasks within the parameters of the model, it is plausible that in-context learning abilities might show similarly strong gains with scale.
语言建模的另一个最新趋势可能提供了一条前进的道路。近年来,Transformer语言模型的容量大幅增加,从1亿参数 [RNSS18],到3亿参数 [DCLT18],再到15亿参数 $[\mathrm{RWC}^{+}19]$,80亿参数 $[\mathrm{SPP^{+}19}]$,110亿参数 $[\mathrm{RSR}^{+}19]$,最终达到170亿参数 [Tur20]。每次增加都带来了文本合成和/或下游NLP任务的改进,并且有证据表明,与许多下游任务相关性良好的对数损失(log loss)随着规模的增加呈现出平滑的改进趋势 $[\mathrm{KMH}^{+}20]$。由于上下文学习涉及在模型参数中吸收许多技能和任务,因此上下文学习能力可能随着规模的增加而表现出类似的强劲提升。

Figure 1.3: Aggregate performance for all 42 accuracy-denominated benchmarks While zero-shot performance improves steadily with model size, few-shot performance increases more rapidly, demonstrating that larger models are more proficient at in-context learning. See Figure 3.8 for a more detailed analysis on SuperGLUE, a standard NLP benchmark suite.
图 1.3: 所有 42 个以准确率为基准的测试集的综合表现。虽然零样本性能随着模型规模的增加稳步提升,但少样本性能提升得更快,这表明更大的模型在上下文学习中更为熟练。关于 SuperGLUE(一个标准的 NLP 基准测试套件)的详细分析,请参见图 3.8。
In this paper, we test this hypothesis by training a 175 billion parameter auto regressive language model, which we call GPT-3, and measuring its in-context learning abilities. Specifically, we evaluate GPT-3 on over two dozen NLP datasets, as well as several novel tasks designed to test rapid adaptation to tasks unlikely to be directly contained in the training set. For each task, we evaluate GPT-3 under 3 conditions: (a) “few-shot learning”, or in-context learning where we allow as many demonstrations as will fit into the model’s context window (typically 10 to 100), (b) “one-shot learning”, where we allow only one demonstration, and (c) “zero-shot” learning, where no demonstrations are allowed and only an instruction in natural language is given to the model. GPT-3 could also in principle be evaluated in the traditional fine-tuning setting, but we leave this to future work.
在本文中,我们通过训练一个拥有1750亿参数的自回归语言模型(我们称之为GPT-3)来测试这一假设,并测量其上下文学习能力。具体来说,我们在超过二十个自然语言处理(NLP)数据集上评估GPT-3,同时还设计了几个新任务来测试其对训练集中不太可能直接包含的任务的快速适应能力。对于每个任务,我们在三种条件下评估GPT-3:(a) "少样本学习",即上下文学习,我们允许尽可能多的演示样本放入模型的上下文窗口中(通常为10到100个),(b) "单样本学习",即只允许一个演示样本,以及(c) "零样本"学习,即不允许任何演示样本,仅向模型提供自然语言的指令。原则上,GPT-3也可以在传统的微调设置中进行评估,但我们将其留待未来工作。
Figure 1.2 illustrates the conditions we study, and shows few-shot learning of a simple task requiring the model to remove extraneous symbols from a word. Model performance improves with the addition of a natural language task description, and with the number of examples in the model’s context, $K$ . Few-shot learning also improves dramatically with model size. Though the results in this case are particularly striking, the general trends with both model size and number of examples in-context hold for most tasks we study. We emphasize that these “learning” curves involve no gradient updates or fine-tuning, just increasing numbers of demonstrations given as conditioning.
图 1.2 展示了我们研究的条件,并展示了一个简单任务的少样本学习,该任务要求模型从单词中删除多余的符号。随着自然语言任务描述的添加以及模型上下文中示例数量 $K$ 的增加,模型性能有所提升。少样本学习也随着模型规模的增大而显著提高。尽管在这种情况下结果特别显著,但模型规模和上下文示例数量的一般趋势在我们研究的大多数任务中都成立。我们强调,这些“学习”曲线不涉及梯度更新或微调,只是增加了作为条件的演示数量。
Broadly, on NLP tasks GPT-3 achieves promising results in the zero-shot and one-shot settings, and in the the few-shot setting is sometimes competitive with or even occasionally surpasses state-of-the-art (despite state-of-the-art being held by fine-tuned models). For example, GPT-3 achieves 81.5 F1 on CoQA in the zero-shot setting, 84.0 F1 on CoQA in the one-shot setting, $85.0,\mathrm{F1}$ in the few-shot setting. Similarly, GPT-3 achieves $64.3%$ accuracy on TriviaQA in the zero-shot setting, $68.0%$ in the one-shot setting, and $71.2%$ in the few-shot setting, the last of which is state-of-the-art relative to fine-tuned models operating in the same closed-book setting.
在自然语言处理(NLP)任务中,GPT-3 在零样本(zero-shot)和单样本(one-shot)设置下取得了令人瞩目的成果,而在少样本(few-shot)设置下,有时甚至能与经过微调的模型相媲美,甚至偶尔超越当前的最先进水平。例如,GPT-3 在零样本设置下的 CoQA 任务中达到了 81.5 F1 分数,在单样本设置下达到了 84.0 F1 分数,在少样本设置下达到了 $85.0,\mathrm{F1}$ 分数。同样,GPT-3 在零样本设置下的 TriviaQA 任务中达到了 $64.3%$ 的准确率,在单样本设置下达到了 $68.0%$,在少样本设置下达到了 $71.2%$,后者在与相同闭卷设置下经过微调的模型相比,达到了当前的最先进水平。
GPT-3 also displays one-shot and few-shot proficiency at tasks designed to test rapid adaption or on-the-fly reasoning, which include unscrambling words, performing arithmetic, and using novel words in a sentence after seeing them defined only once. We also show that in the few-shot setting, GPT-3 can generate synthetic news articles which human evaluators have difficulty distinguishing from human-generated articles.
GPT-3 在测试快速适应或即时推理的任务中展示了一次样本和少样本的熟练度,这些任务包括解构单词、执行算术运算以及在句子中使用仅见过一次定义的新词。我们还展示了在少样本设置下,GPT-3 能够生成合成新闻文章,人类评估者难以将其与人类生成的文章区分开来。
At the same time, we also find some tasks on which few-shot performance struggles, even at the scale of GPT-3. This includes natural language inference tasks like the ANLI dataset, and some reading comprehension datasets like RACE or QuAC. By presenting a broad characterization of GPT-3’s strengths and weaknesses, including these limitations, we hope to stimulate study of few-shot learning in language models and draw attention to where progress is most needed.
同时,我们也发现了一些任务,即使是在 GPT-3 的规模下,少样本表现仍然存在困难。这包括像 ANLI 数据集这样的自然语言推理任务,以及一些阅读理解数据集,如 RACE 或 QuAC。通过全面展示 GPT-3 的优势和劣势,包括这些局限性,我们希望激发对大语言模型中少样本学习的研究,并引起对最需要进展的领域的关注。
A heuristic sense of the overall results can be seen in Figure 1.3, which aggregates the various tasks (though it should not be seen as a rigorous or meaningful benchmark in itself).
图 1.3 展示了整体结果的启发式感知,其中汇总了各种任务(尽管它本身不应被视为严格或有意义的基准)。
We also undertake a systematic study of “data contamination” – a growing problem when training high capacity models on datasets such as Common Crawl, which can potentially include content from test datasets simply because such content often exists on the web. In this paper we develop systematic tools to measure data contamination and quantify its distorting effects. Although we find that data contamination has a minimal effect on GPT-3’s performance on most datasets, we do identify a few datasets where it could be inflating results, and we either do not report results on these datasets or we note them with an asterisk, depending on the severity.
我们还对“数据污染”进行了系统研究——这是一个在 Common Crawl 等数据集上训练高容量模型时日益严重的问题,这些数据集可能包含来自测试数据集的内容,仅仅是因为这些内容通常存在于网络上。在本文中,我们开发了系统工具来测量数据污染并量化其扭曲效应。尽管我们发现数据污染对 GPT-3 在大多数数据集上的表现影响甚微,但我们确实发现了一些数据集可能夸大了结果,我们根据严重程度选择不报告这些数据集的结果或用星号标注。
In addition to all the above, we also train a series of smaller models (ranging from 125 million parameters to 13 billion parameters) in order to compare their performance to GPT-3 in the zero, one and few-shot settings. Broadly, for most tasks we find relatively smooth scaling with model capacity in all three settings; one notable pattern is that the gap between zero-, one-, and few-shot performance often grows with model capacity, perhaps suggesting that larger models are more proficient meta-learners.
除了上述所有内容外,我们还训练了一系列较小的模型(参数范围从1.25亿到130亿),以便在零样本、单样本和少样本设置中与GPT-3的性能进行比较。总体而言,对于大多数任务,我们发现模型容量在这三种设置中的扩展相对平滑;一个显著的模式是,零样本、单样本和少样本性能之间的差距通常随着模型容量的增加而增大,这可能表明较大的模型是更熟练的元学习者。
Finally, given the broad spectrum of capabilities displayed by GPT-3, we discuss concerns about bias, fairness, and broader societal impacts, and attempt a preliminary analysis of GPT-3’s characteristics in this regard.
最后,鉴于 GPT-3 展现出的广泛能力,我们讨论了关于偏见、公平性以及更广泛社会影响的担忧,并尝试对 GPT-3 在这方面的特性进行初步分析。
The remainder of this paper is organized as follows. In Section 2, we describe our approach and methods for training GPT-3 and evaluating it. Section 3 presents results on the full range of tasks in the zero-, one- and few-shot settings. Section 4 addresses questions of data contamination (train-test overlap). Section 5 discusses limitations of GPT-3. Section 6 discusses broader impacts. Section 7 reviews related work and Section 8 concludes.
本文的其余部分组织如下。在第2节中,我们描述了训练GPT-3并对其进行评估的方法。第3节展示了在零样本、单样本和少样本设置下各种任务的结果。第4节讨论了数据污染(训练-测试重叠)的问题。第5节讨论了GPT-3的局限性。第6节讨论了更广泛的影响。第7节回顾了相关工作,第8节总结了全文。
2 Approach
2 方法
Our basic pre-training approach, including model, data, and training, is similar to the process described in $[\mathrm{RWC}^{+}19]$ , with relatively straightforward scaling up of the model size, dataset size and diversity, and length of training. Our use of in-context learning is also similar to $[\mathrm{RWC}^{+}19]$ , but in this work we systematically explore different settings for learning within the context. Therefore, we start this section by explicitly defining and contrasting the different settings that we will be evaluating GPT-3 on or could in principle evaluate GPT-3 on. These settings can be seen as lying on a spectrum of how much task-specific data they tend to rely on. Specifically, we can identify at least four points on this spectrum (see Figure 2.1 for an illustration):
我们的基本预训练方法,包括模型、数据和训练,与 $[\mathrm{RWC}^{+}19]$ 中描述的过程相似,主要是对模型大小、数据集大小和多样性以及训练长度进行了相对直接的扩展。我们对上下文学习的使用也与 $[\mathrm{RWC}^{+}19]$ 相似,但在本工作中,我们系统地探索了在上下文中学习的不同设置。因此,我们首先明确定义并对比了我们将评估 GPT-3 或原则上可以评估 GPT-3 的不同设置。这些设置可以被视为在依赖任务特定数据的程度上存在差异。具体来说,我们可以在这个谱系中识别出至少四个点(见图 2.1 的图示):
• Fine-Tuning (FT) has been the most common approach in recent years, and involves updating the weights of a pre-trained model by training on a supervised dataset specific to the desired task. Typically thousands to hundreds of thousands of labeled examples are used. The main advantage of fine-tuning is strong performance on many benchmarks. The main disadvantages are the need for a new large dataset for every task, the potential for poor generalization out-of-distribution [MPL19], and the potential to exploit spurious features of the training data $\mathrm{[GSL^{+}18}$ , NK19], potentially resulting in an unfair comparison with human performance. In this work we do not fine-tune GPT-3 because our focus is on task-agnostic performance, but GPT-3 can be fine-tuned in principle and this is a promising direction for future work.
• 微调 (Fine-Tuning, FT) 是近年来最常见的方法,它通过在特定任务的监督数据集上训练来更新预训练模型的权重。通常使用数千到数十万个标注样本。微调的主要优势是在许多基准测试中表现出色。主要缺点是需要为每个任务准备一个新的大型数据集,可能在分布外泛化能力上表现不佳 [MPL19],并且可能利用训练数据中的虚假特征 $\mathrm{[GSL^{+}18}$ , NK19],这可能导致与人类表现的不公平比较。在本工作中,我们没有对 GPT-3 进行微调,因为我们的重点是任务无关的性能,但原则上 GPT-3 是可以微调的,这也是未来工作的一个很有前景的方向。
• Few-Shot (FS) is the term we will use in this work to refer to the setting where the model is given a few demonstrations of the task at inference time as conditioning $[\mathrm{RWC}^{+}19]$ , but no weight updates are allowed. As shown in Figure 2.1, for a typical dataset an example has a context and a desired completion (for example an English sentence and the French translation), and few-shot works by giving $K$ examples of context and completion, and then one final example of context, with the model expected to provide the completion. We typically set $K$ in the range of 10 to 100 as this is how many examples can fit in the model’s context window $(n_{\mathrm{ctx}}=2048)$ ). The main advantages of few-shot are a major reduction in the need for task-specific data and reduced potential to learn an overly narrow distribution from a large but narrow fine-tuning dataset. The main disadvantage is that results from this method have so far been much worse than state-of-the-art fine-tuned models. Also, a small amount of task specific data is still required. As indicated by the name, few-shot learning as described here for language models is related to few-shot learning as used in other contexts in ML [HYC01, $\mathrm{VBL}^{+}16]$ – both involve learning based on a broad distribution of tasks (in this case implicit in the pre-training data) and then rapidly adapting to a new task.
• 少样本 (Few-Shot, FS) 是我们在本文中使用的术语,指的是在推理时给模型提供少量任务演示作为条件 $[\mathrm{RWC}^{+}19]$,但不允许进行权重更新的设置。如图 2.1 所示,对于典型的数据集,一个示例包含上下文和期望的完成内容(例如一个英语句子和其法语翻译),少样本学习通过提供 $K$ 个上下文和完成内容的示例,然后提供一个最终的上下文示例,期望模型能够生成完成内容。我们通常将 $K$ 设置在 10 到 100 的范围内,因为这是模型上下文窗口 $(n_{\mathrm{ctx}}=2048)$ 能够容纳的示例数量。少样本学习的主要优点是大大减少了对任务特定数据的需求,并降低了从大型但狭窄的微调数据集中学习过于狭窄分布的可能性。主要缺点是,迄今为止,这种方法的结果远不如最先进的微调模型。此外,仍然需要少量的任务特定数据。正如名称所示,这里描述的语言模型的少样本学习与机器学习中其他上下文中的少样本学习 [HYC01, $\mathrm{VBL}^{+}16]$ 相关——两者都涉及基于广泛任务分布(在这种情况下隐含在预训练数据中)的学习,然后快速适应新任务。
• One-Shot (1S) is the same as few-shot except that only one demonstration is allowed, in addition to a natural language description of the task, as shown in Figure 1. The reason to distinguish one-shot from few-shot and zero-shot (below) is that it most closely matches the way in which some tasks are communicated to humans. For example, when asking humans to generate a dataset on a human worker service (for example Mechanical Turk), it is common to give one demonstration of the task. By contrast it is sometimes difficult to communicate the content or format of a task if no examples are given.
• 单样本 (One-Shot, 1S) 与少样本类似,区别在于只允许提供一个示例,同时还会给出任务的自然语言描述,如图 1 所示。之所以将单样本与少样本和零样本(见下文)区分开来,是因为它最接近某些任务传达给人类的方式。例如,当要求人类在人类工作者服务(如 Mechanical Turk)上生成数据集时,通常会提供一个任务示例。相比之下,如果没有给出示例,有时很难传达任务的内容或格式。

The three settings we explore for in-context learning Figure 2.1: Zero-shot, one-shot and few-shot, contrasted with traditional fine-tuning. The panels above show four methods for performing a task with a language model – fine-tuning is the traditional method, whereas zero-, one-, and few-shot, which we study in this work, require the model to perform the task with only forward passes at test time. We typically present the model with a few dozen examples in the few shot setting. Exact phrasings for all task descriptions, examples and prompts can be found in Appendix G.
图 2.1: 零样本、单样本和少样本,与传统微调的对比。上面的面板展示了使用语言模型执行任务的四种方法——微调是传统方法,而零样本、单样本和少样本(我们在本研究中探讨的)要求模型在测试时仅通过前向传递来执行任务。在少样本设置中,我们通常向模型提供几十个示例。所有任务描述、示例和提示的确切措辞可以在附录 G 中找到。
• Zero-Shot (0S) is the same as one-shot except that no demonstrations are allowed, and the model is only given a natural language instruction describing the task. This method provides maximum convenience, potential for robustness, and avoidance of spurious correlations (unless they occur very broadly across the large corpus of pre-training data), but is also the most challenging setting. In some cases it may even be difficult for humans to understand the format of the task without prior examples, so this setting is in some cases “unfairly hard”. For example, if someone is asked to “make a table of world records for the $200\mathrm{m}$ dash”, this request can be ambiguous, as it may not be clear exactly what format the table should have or what should be included (and even with careful clarification, understanding precisely what is desired can be difficult). Nevertheless, for at least some settings zero-shot is closest to how humans perform tasks – for example, in the translation example in Figure 2.1, a human would likely know what to do from just the text instruction.
• 零样本 (Zero-Shot, 0S) 与单样本类似,但不允许提供任何示例,模型仅会收到描述任务的自然语言指令。这种方法提供了最大的便利性、潜在的鲁棒性,并避免了虚假相关性(除非它们在预训练数据的大规模语料库中广泛存在),但也是最具挑战性的设置。在某些情况下,甚至人类在没有先例的情况下也可能难以理解任务的格式,因此这种设置在某些情况下“过于困难”。例如,如果有人被要求“制作一个200米短跑世界纪录的表格”,这个请求可能是模糊的,因为可能不清楚表格的具体格式或应包含哪些内容(即使经过仔细澄清,准确理解需求也可能很困难)。然而,至少在某些情况下,零样本最接近人类执行任务的方式——例如,在图2.1的翻译示例中,人类可能仅通过文本指令就知道该怎么做。
Figure 2.1 shows the four methods using the example of translating English to French. In this paper we focus on zero-shot, one-shot and few-shot, with the aim of comparing them not as competing alternatives, but as different problem settings which offer a varying trade-off between performance on specific benchmarks and sample efficiency. We especially highlight the few-shot results as many of them are only slightly behind state-of-the-art fine-tuned models. Ultimately, however, one-shot, or even sometimes zero-shot, seem like the fairest comparisons to human performance, and are important targets for future work.
图 2.1 展示了使用英语翻译成法语的例子来说明这四种方法。在本文中,我们主要关注零样本、单样本和少样本,目的是将它们进行比较,不是作为竞争性的替代方案,而是作为不同的问题设置,这些设置在特定基准测试中的性能和样本效率之间提供了不同的权衡。我们特别强调了少样本的结果,因为其中许多结果仅略微落后于最先进的微调模型。然而,最终,单样本,甚至有时是零样本,似乎是与人类表现最公平的比较,并且是未来工作的重要目标。
Sections 2.1-2.3 below give details on our models, training data, and training process respectively. Section 2.4 discusses the details of how we do few-shot, one-shot, and zero-shot evaluations.
2.1-2.3 节分别详细介绍了我们的模型、训练数据和训练过程。2.4 节讨论了如何进行少样本、单样本和零样本评估的细节。
Table 2.1: Sizes, architectures, and learning hyper-parameters (batch size in tokens and learning rate) of the models which we trained. All models were trained for a total of 300 billion tokens.
表 2.1: 我们训练的模型的规模、架构和学习超参数(Token 的批量大小和学习率)。所有模型总共训练了 3000 亿个 Token。
| 模型名称 | 参数量 | 层数 | 模型维度 | 头数 | 头维度 | 批量大小 | 学习率 |
|---|---|---|---|---|---|---|---|
| GPT-3 Small | 125M | 12 | 768 | 12 | 64 | 0.5M | 6.0 × 10^-4 |
| GPT-3 Medium | 350M | 24 | 1024 | 16 | 64 | 0.5M | 3.0 × 10^-4 |
| GPT-3 Large | 760M | 24 | 1536 | 16 | 96 | 0.5M | 2.5 × 10^-4 |
| GPT-3 3XL | 1.3B | 24 | 2048 | 24 | 128 | 1M | 2.0 × 10^-4 |
| GPT-3 2.7B | 2.7B | 32 | 2560 | 32 | 80 | 1M | 1.6 × 10^-4 |
| GPT-3 6.7B | 6.7B | 32 | 4096 | 32 | 128 | 2M | 1.2 × 10^-4 |
| GPT-3 13B | 13.0B | 40 | 5140 | 40 | 128 | 2M | 1.0 × 10^-4 |
| GPT-3 175B (即“GPT-3”) | 175.0B | 96 | 12288 | 96 | 128 | 3.2M | 0.6 × 10^-4 |
2.1 Model and Architectures
2.1 模型与架构
We use the same model and architecture as GPT-2 $[\mathrm{RWC}^{+}19]$ , including the modified initialization, pre-normalization, and reversible token iz ation described therein, with the exception that we use alternating dense and locally banded sparse attention patterns in the layers of the transformer, similar to the Sparse Transformer [CGRS19]. To study the dependence of ML performance on model size, we train 8 different sizes of model, ranging over three orders of magnitude from 125 million parameters to 175 billion parameters, with the last being the model we call GPT-3. Previous work $[\mathrm{KMH}^{+}20]$ suggests that with enough training data, scaling of validation loss should be approximately a smooth power law as a function of size; training models of many different sizes allows us to test this hypothesis both for validation loss and for downstream language tasks.
我们使用了与 GPT-2 [RWC+19] 相同的模型和架构,包括其中描述的修改后的初始化、预归一化和可逆的 Token 化,唯一的区别是我们在 Transformer 的层中使用了交替的密集和局部带状稀疏注意力模式,类似于 Sparse Transformer [CGRS19]。为了研究机器学习性能对模型大小的依赖性,我们训练了 8 种不同大小的模型,参数数量从 1.25 亿到 1750 亿不等,最后一个模型我们称之为 GPT-3。先前的工作 [KMH+20] 表明,在足够的训练数据下,验证损失的缩放应该大致是一个平滑的幂律函数;训练多个不同大小的模型使我们能够验证这一假设,无论是在验证损失还是在下游语言任务中。
Table 2.1 shows the sizes and architectures of our 8 models. Here $n_{\mathrm{params}}$ is the total number of trainable parameters, $n_{\mathrm{layers}}$ is the total number of layers, $d_{\mathrm{model}}$ is the number of units in each bottleneck layer (we always have the feed forward layer four times the size of the bottleneck layer, $d_{\mathrm{ff}}=4*d_{\mathrm{model}})$ , and $d_{\mathrm{head}}$ is the dimension of each attention head. All models use a context window of $n_{\mathrm{ctx}}=2048$ tokens. We partition the model across GPUs along both the depth and width dimension in order to minimize data-transfer between nodes. The precise architectural parameters for each model are chosen based on computational efficiency and load-balancing in the layout of models across GPU’s. Previous work $[\mathrm{KMH}^{+}20]$ suggests that validation loss is not strongly sensitive to these parameters within a reasonably broad range.
表 2.1 展示了我们 8 个模型的规模和架构。其中,$n_{\mathrm{params}}$ 是可训练参数的总数,$n_{\mathrm{layers}}$ 是总层数,$d_{\mathrm{model}}$ 是每个瓶颈层的单元数(我们总是将前馈层的尺寸设为瓶颈层的四倍,即 $d_{\mathrm{ff}}=4*d_{\mathrm{model}}$),$d_{\mathrm{head}}$ 是每个注意力头的维度。所有模型都使用 $n_{\mathrm{ctx}}=2048$ 个 token 的上下文窗口。为了最小化节点之间的数据传输,我们在深度和宽度维度上将模型分布在多个 GPU 上。每个模型的精确架构参数是基于计算效率和 GPU 上模型布局的负载均衡来选择的。之前的工作 $[\mathrm{KMH}^{+}20]$ 表明,验证损失在合理范围内对这些参数并不十分敏感。
2.2 Training Dataset
2.2 训练数据集
Datasets for language models have rapidly expanded, culminating in the Common Crawl dataset2 $[\mathrm{RSR}^{+}19]$ constituting nearly a trillion words. This size of dataset is sufficient to train our largest models without ever updating on the same sequence twice. However, we have found that unfiltered or lightly filtered versions of Common Crawl tend to have lower quality than more curated datasets. Therefore, we took 3 steps to improve the average quality of our datasets: (1) we downloaded and filtered a version of Common Crawl based on similarity to a range of high-quality reference corpora, (2) we performed fuzzy de duplication at the document level, within and across datasets, to prevent redundancy and preserve the integrity of our held-out validation set as an accurate measure of over fitting, and (3) we also added known high-quality reference corpora to the training mix to augment Common Crawl and increase its diversity.
用于训练大语言模型的数据集迅速扩展,最终形成了包含近万亿词汇的Common Crawl数据集2 $[\mathrm{RSR}^{+}19]$。如此规模的数据集足以训练我们最大的模型,而无需在相同的序列上重复更新。然而,我们发现未经过滤或轻度过滤的Common Crawl版本往往比经过精心筛选的数据集质量更低。因此,我们采取了三个步骤来提高数据集的平均质量:(1) 我们下载并过滤了Common Crawl的一个版本,基于其与一系列高质量参考语料库的相似性;(2) 我们在文档级别进行了模糊去重,无论是在数据集内部还是跨数据集,以防止冗余并保持验证集的完整性,作为衡量过拟合的准确指标;(3) 我们还在训练混合中添加了已知的高质量参考语料库,以增强Common Crawl并增加其多样性。
Details of the first two points (processing of Common Crawl) are described in Appendix A. For the third, we added several curated high-quality datasets, including an expanded version of the WebText dataset [RWC+19], collected by scraping links over a longer period of time, and first described in $[\mathrm{KMH}^{+}20]$ , two internet-based books corpora (Books1 and Books2) and English-language Wikipedia.
前两点的详细信息(Common Crawl的处理)在附录A中进行了描述。对于第三点,我们添加了几个精选的高质量数据集,包括通过长时间抓取链接收集的WebText数据集的扩展版本 [RWC+19],首次在 $[\mathrm{KMH}^{+}20]$ 中描述,以及两个基于互联网的书籍语料库(Books1和Books2)和英文维基百科。
Table 2.2 shows the final mixture of datasets that we used in training. The Common Crawl data was downloaded from 41 shards of monthly Common Crawl covering 2016 to 2019, constituting 45TB of compressed plaintext before filtering and 570GB after filtering, roughly equivalent to 400 billion byte-pair-encoded tokens. Note that during training, datasets are not sampled in proportion to their size, but rather datasets we view as higher-quality are sampled more frequently, such that Common Crawl and Books2 datasets are sampled less than once during training, but the other datasets are sampled 2-3 times. This essentially accepts a small amount of over fitting in exchange for higher quality training data.
表 2.2 展示了我们在训练中使用的最终数据集混合情况。Common Crawl 数据是从 2016 年至 2019 年的 41 个月度 Common Crawl 分片中下载的,过滤前构成了 45TB 的压缩纯文本,过滤后为 570GB,大约相当于 4000 亿个字节对编码的 Token。需要注意的是,在训练过程中,数据集并不是按其大小比例进行采样的,而是我们认为质量更高的数据集会被更频繁地采样,因此 Common Crawl 和 Books2 数据集在训练过程中被采样的次数少于一次,而其他数据集则被采样 2-3 次。这本质上是为了换取更高质量的训练数据而接受少量的过拟合。

Figure 2.2: Total compute used during training. Based on the analysis in Scaling Laws For Neural Language Models $[\bar{\mathrm{KMH}}^{+}20]$ we train much larger models on many fewer tokens than is typical. As a consequence, although GPT-3 3B is almost $10\mathrm{x}$ larger than RoBERTa-Large (355M params), both models took roughly 50 petaflop/s-days of compute during pre-training. Methodology for these calculations can be found in Appendix D. Table 2.2: Datasets used to train GPT-3. “Weight in training mix” refers to the fraction of examples during training that are drawn from a given dataset, which we intentionally do not make proportional to the size of the dataset. As a result, when we train for 300 billion tokens, some datasets are seen up to 3.4 times during training while other datasets are seen less than once.
图 2.2: 训练期间使用的总计算量。基于《Scaling Laws For Neural Language Models》[KMH+20] 的分析,我们在比通常少得多的 Token 上训练了更大的模型。因此,尽管 GPT-3 3B 几乎比 RoBERTa-Large(355M 参数)大 10 倍,但两个模型在预训练期间都花费了大约 50 petaflop/s-days 的计算量。这些计算的方法可以在附录 D 中找到。
表 2.2: 用于训练 GPT-3 的数据集。“训练混合中的权重”指的是训练期间从给定数据集中抽取的样本比例,我们有意不使其与数据集的大小成比例。因此,当我们训练 3000 亿个 Token 时,一些数据集在训练期间会被看到多达 3.4 次,而其他数据集则被看到不到一次。
| 数据集 | 数量 (tokens) | 训练混合权重 | 训练300B tokens时的epochs |
|---|---|---|---|
| Common Crawl (过滤后) | 4100亿 | 60% | 0.44 |
| WebText2 | 190亿 | 22% | 2.9 |
| Booksl | 120亿 | 8% | 1.9 |
| Books2 | 550亿 | 8% | 0.43 |
| Wikipedia | 30亿 | 3% | 3.4 |
A major methodological concern with language models pretrained on a broad swath of internet data, particularly large models with the capacity to memorize vast amounts of content, is potential contamination of downstream tasks by having their test or development sets inadvertently seen during pre-training. To reduce such contamination, we searched for and attempted to remove any overlaps with the development and test sets of all benchmarks studied in this paper. Unfortunately, a bug in the filtering caused us to ignore some overlaps, and due to the cost of training it was not feasible to retrain the model. In Section 4 we characterize the impact of the remaining overlaps, and in future work we will more aggressively remove data contamination.
一个主要的方法论问题是,基于广泛互联网数据预训练的语言模型,特别是那些能够记忆大量内容的大型模型,可能会在下游任务中因预训练期间无意中看到其测试或开发集而导致潜在的污染。为了减少这种污染,我们搜索并尝试移除与本文研究的所有基准测试的开发集和测试集的重叠部分。不幸的是,过滤过程中的一个错误导致我们忽略了一些重叠,由于训练成本的原因,重新训练模型是不可行的。在第4节中,我们描述了剩余重叠的影响,并在未来的工作中将更积极地移除数据污染。
2.3 Training Process
2.3 训练过程
As found in $[\mathrm{KMH}^{+}20\$ , MKAT18], larger models can typically use a larger batch size, but require a smaller learning rate. We measure the gradient noise scale during training and use it to guide our choice of batch size [MKAT18]. Table 2.1 shows the parameter settings we used. To train the larger models without running out of memory, we use a mixture of model parallelism within each matrix multiply and model parallelism across the layers of the network. All models were trained on V100 GPU’s on part of a high-bandwidth cluster provided by Microsoft. Details of the training process and hyper parameter settings are described in Appendix B.
如 $[\mathrm{KMH}^{+}20\$ 和 [MKAT18] 中所发现的那样,更大的模型通常可以使用更大的批量大小,但需要更小的学习率。我们在训练过程中测量梯度噪声尺度,并使用它来指导我们选择批量大小 [MKAT18]。表 2.1 显示了我们使用的参数设置。为了在不耗尽内存的情况下训练更大的模型,我们在每个矩阵乘法内部和网络层之间使用了模型并行化的混合方法。所有模型都在 Microsoft 提供的高带宽集群的一部分 V100 GPU 上进行训练。训练过程和超参数设置的详细信息在附录 B 中描述。
表 2.1: 参数设置
For few-shot learning, we evaluate each example in the evaluation set by randomly drawing $K$ examples from that task’s training set as conditioning, delimited by 1 or 2 newlines depending on the task. For LAMBADA and Storycloze there is no supervised training set available so we draw conditioning examples from the development set and evaluate on the test set. For Winograd (the original, not SuperGLUE version) there is only one dataset, so we draw conditioning examples directly from it.
对于少样本学习,我们通过从每个任务的训练集中随机抽取 $K$ 个样本作为条件来评估评估集中的每个样本,具体分隔符根据任务的不同使用 1 个或 2 个换行符。对于 LAMBADA 和 Storycloze,由于没有监督训练集可用,我们从开发集中抽取条件样本并在测试集上进行评估。对于 Winograd(原始版本,而非 SuperGLUE 版本),只有一个数据集,因此我们直接从中抽取条件样本。
$K$ can be any value from 0 to the maximum amount allowed by the model’s context window, which is $n_{\mathrm{ctx}}=2048$ for all models and typically fits 10 to 100 examples. Larger values of $K$ are usually but not always better, so when a separate development and test set are available, we experiment with a few values of $K$ on the development set and then run the best value on the test set. For some tasks (see Appendix G) we also use a natural language prompt in addition to (or for $K=0$ , instead of) demonstrations.
$K$ 可以是 0 到模型上下文窗口允许的最大值之间的任何值,对于所有模型来说,$n_{\mathrm{ctx}}=2048$,通常可以容纳 10 到 100 个示例。较大的 $K$ 值通常但不总是更好,因此当有独立的开发和测试集时,我们在开发集上尝试几个 $K$ 值,然后在测试集上运行最佳值。对于某些任务(见附录 G),除了演示(或对于 $K=0$,代替演示)之外,我们还使用自然语言提示。
On tasks that involve choosing one correct completion from several options (multiple choice), we provide $K$ examples of context plus correct completion, followed by one example of context only, and compare the LM likelihood of each completion. For most tasks we compare the per-token likelihood (to normalize for length), however on a small number of datasets (ARC, OpenBookQA, and RACE) we gain additional benefit as measured on the development set by normalizing by the unconditional probability of each completion, by computing $\frac{P\mathrm{{(completion|context)}}}{P\mathrm{{(completion|answer\mathrm{{-context)}}}}}$ , where answer context is the string "Answer: " or "A: " and is used to prompt that the completion should be an answer but is otherwise generic.
在涉及从多个选项中选择一个正确完成的任务(多项选择)中,我们提供 $K$ 个上下文加正确完成的示例,然后提供一个仅上下文的示例,并比较每个完成的大语言模型 (LLM) 的似然值。对于大多数任务,我们比较每个 Token 的似然值(以进行长度归一化),然而在少数数据集(ARC、OpenBookQA 和 RACE)上,我们通过归一化每个完成的无条件概率,在开发集上获得了额外的收益,通过计算 $\frac{P\mathrm{{(completion|context)}}}{P\mathrm{{(completion|answer\mathrm{{-context)}}}}}$,其中答案上下文是字符串 "Answer: " 或 "A: ",用于提示完成应该是一个答案,但在其他方面是通用的。
On tasks that involve binary classification, we give the options more semantically meaningful names (e.g. “True” or “False” rather than 0 or 1) and then treat the task like multiple choice; we also sometimes frame the task similar to what is done by $[\mathrm{RSR}^{+}19]$ (see Appendix G) for details.
在涉及二元分类的任务中,我们为选项赋予更具语义意义的名称(例如“True”或“False”而不是0或1),然后将任务视为多项选择;有时我们也会将任务框架化,类似于 $[\mathrm{RSR}^{+}19]$ 的做法(详见附录G)。
On tasks with free-form completion, we use beam search with the same parameters as $[\mathrm{RSR}^{+}19]$ : a beam width of 4 and a length penalty of $\alpha=0.6$ . We score the model using F1 similarity score, BLEU, or exact match, depending on what is standard for the dataset at hand.
在自由形式完成的任务中,我们使用与 $[\mathrm{RSR}^{+}19]$ 相同的参数进行束搜索 (beam search) :束宽度为 4,长度惩罚为 $\alpha=0.6$ 。我们根据手头数据集的标准,使用 F1 相似度得分、BLEU 或精确匹配来评分模型。
Final results are reported on the test set when publicly available, for each model size and learning setting (zero-, one-, and few-shot). When the test set is private, our model is often too large to fit on the test server, so we report results on the development set. We do submit to the test server on a small number of datasets (SuperGLUE, TriviaQA, PiQa) where we were able to make submission work, and we submit only the 200B few-shot results, and report development set results for everything else.
最终结果在测试集上报告(当测试集公开可用时),针对每个模型大小和学习设置(零样本、单样本和少样本)。当测试集为私有时,我们的模型通常太大,无法适应测试服务器,因此我们在开发集上报告结果。我们确实在少数数据集(SuperGLUE、TriviaQA、PiQa)上提交到测试服务器,这些数据集我们能够成功提交,并且我们仅提交200B少样本结果,其他所有结果均在开发集上报告。
3 Results
3 结果
In Figure 3.1 we display training curves for the 8 models described in Section 2. For this graph we also include 6 additional extra-small models with as few as 100,000 parameters. As observed in $[\mathrm{KMH}^{+}20]$ , language modeling performance follows a power-law when making efficient use of training compute. After extending this trend by two more orders of magnitude, we observe only a slight (if any) departure from the power-law. One might worry that these improvements in cross-entropy loss come only from modeling spurious details of our training corpus. However, we will see in the following sections that improvements in cross-entropy loss lead to consistent performance gains across a broad spectrum of natural language tasks.
在图 3.1 中,我们展示了第 2 节中描述的 8 个模型的训练曲线。在此图中,我们还包含了 6 个额外的超小型模型,这些模型的参数量仅为 100,000。正如 $[\mathrm{KMH}^{+}20]$ 中所观察到的,当有效利用训练计算资源时,语言建模性能遵循幂律。在将这一趋势扩展了两个数量级后,我们观察到仅有轻微(如果有的话)偏离幂律的情况。有人可能会担心,这些交叉熵损失的改进仅来自于对训练语料库中虚假细节的建模。然而,我们将在接下来的章节中看到,交叉熵损失的改进在广泛的自然语言任务中带来了一致的性能提升。
Below, we evaluate the 8 models described in Section 2 (the 175 billion parameter parameter GPT-3 and 7 smaller models) on a wide range of datasets. We group the datasets into 9 categories representing roughly similar tasks.
下面,我们在广泛的数据集上评估第2节中描述的8个模型(1750亿参数的GPT-3和7个较小的模型)。我们将这些数据集分为9个类别,代表大致相似的任务。
In Section 3.1 we evaluate on traditional language modeling tasks and tasks that are similar to language modeling, such as Cloze tasks and sentence/paragraph completion tasks. In Section 3.2 we evaluate on “closed book” question answering tasks: tasks which require using the information stored in the model’s parameters to answer general knowledge questions. In Section 3.3 we evaluate the model’s ability to translate between languages (especially one-shot and few-shot). In Section 3.4 we evaluate the model’s performance on Winograd Schema-like tasks. In Section 3.5 we evaluate on datasets that involve commonsense reasoning or question answering. In Section 3.6 we evaluate on reading comprehension tasks, in Section 3.7 we evaluate on the SuperGLUE benchmark suite, and in 3.8 we briefly explore NLI. Finally, in Section 3.9, we invent some additional tasks designed especially to probe in-context learning abilities – these tasks focus on on-the-fly reasoning, adaptation skills, or open-ended text synthesis. We evaluate all tasks in the few-shot, one-shot, and zero-shot settings.
在3.1节中,我们评估了传统的语言建模任务以及类似于语言建模的任务,例如完形填空任务和句子/段落补全任务。在3.2节中,我们评估了“闭卷”问答任务:这些任务需要使用模型参数中存储的信息来回答一般知识问题。在3.3节中,我们评估了模型在语言之间翻译的能力(尤其是一次样本和少样本)。在3.4节中,我们评估了模型在类似Winograd Schema任务上的表现。在3.5节中,我们评估了涉及常识推理或问答的数据集。在3.6节中,我们评估了阅读理解任务,在3.7节中,我们评估了SuperGLUE基准套件,并在3.8节中简要探讨了自然语言推理(NLI)。最后,在3.9节中,我们设计了一些额外的任务,专门用于探究上下文学习能力——这些任务侧重于即时推理、适应能力或开放式文本合成。我们在少样本、一次样本和零样本设置下评估了所有任务。

Figure 3.1: Smooth scaling of performance with compute. Performance (measured in terms of cross-entropy validation loss) follows a power-law trend with the amount of compute used for training. The power-law behavior observed in $[\mathrm{KMH}^{+}20]$ continues for an additional two orders of magnitude with only small deviations from the predicted curve. For this figure, we exclude embedding parameters from compute and parameter counts. Table 3.1: Zero-shot results on PTB language modeling dataset. Many other common language modeling datasets are omitted because they are derived from Wikipedia or other sources which are included in GPT-3’s training data. $^{a}[\mathrm{RWC}^{+}19]$
图 3.1: 计算量与性能的平滑扩展。性能(以交叉熵验证损失衡量)与用于训练的计算量呈幂律趋势。在 $[\mathrm{KMH}^{+}20]$ 中观察到的幂律行为在额外两个数量级上继续存在,仅与预测曲线有微小偏差。在本图中,我们从计算量和参数计数中排除了嵌入参数。
表 3.1: PTB 语言建模数据集上的零样本结果。许多其他常见的语言建模数据集被省略,因为它们源自维基百科或其他包含在 GPT-3 训练数据中的来源。$^{a}[\mathrm{RWC}^{+}19]$
| 设置 | PTB |
|---|---|
| SOTA (零样本) | 35.8° |
| GPT-3 零样本 | 20.5 |
3.1 Language Modeling, Cloze, and Completion Tasks
3.1 语言建模、完形填空和补全任务
In this section we test GPT-3’s performance on the traditional task of language modeling, as well as related tasks that involve predicting a single word of interest, completing a sentence or paragraph, or choosing between possible completions of a piece of text.
在本节中,我们测试了 GPT-3 在传统语言建模任务中的表现,以及涉及预测单个关键词、完成句子或段落、或在文本的多个可能完成选项之间做出选择的相关任务。
3.1.1 Language Modeling
3.1.1 语言建模
We calculate zero-shot perplexity on the Penn Tree Bank (PTB) $[\mathrm{MKM^{+}94}]$ dataset measured in $[\mathrm{RWC}^{+}19]$ . We omit the 4 Wikipedia-related tasks in that work because they are entirely contained in our training data, and we also omit the one-billion word benchmark due to a high fraction of the dataset being contained in our training set. PTB escapes these issues due to predating the modern internet. Our largest model sets a new SOTA on PTB by a substantial margin of 15 points, achieving a perplexity of 20.50. Note that since PTB is a traditional language modeling dataset it does not have a clear separation of examples to define one-shot or few-shot evaluation around, so we measure only zero-shot.
我们在 Penn Tree Bank (PTB) 数据集上计算了零样本困惑度 (zero-shot perplexity),该数据集在 $[\mathrm{RWC}^{+}19]$ 中进行了测量。我们省略了该工作中的 4 个与 Wikipedia 相关的任务,因为它们完全包含在我们的训练数据中,并且由于数据集中有大量内容包含在我们的训练集中,我们也省略了十亿词基准测试 (one-billion word benchmark)。PTB 由于早于现代互联网,避免了这些问题。我们最大的模型在 PTB 上以 15 分的显著优势创下了新的 SOTA (State of the Art),达到了 20.50 的困惑度。需要注意的是,由于 PTB 是一个传统的语言建模数据集,它没有明确的示例分离来定义单样本或少样本评估,因此我们仅测量零样本。
3.1.2 LAMBADA
3.1.2 LAMBADA
The LAMBADA dataset $[\mathrm{PKL}^{+}16]$ tests the modeling of long-range dependencies in text – the model is asked to predict the last word of sentences which require reading a paragraph of context. It has recently been suggested that the continued scaling of language models is yielding diminishing returns on this difficult benchmark. $[\mathrm{BHT^{+}}20]$ reflect on the small $1.5%$ improvement achieved by a doubling of model size between two recent state of the art results $\left(\mathrm{[SPP^{+}19]}\right)$
LAMBADA 数据集 [PKL+16] 测试了文本中长距离依赖关系的建模——模型被要求预测句子的最后一个词,这需要阅读一段上下文。最近有观点认为,语言模型的持续扩展在这一困难基准上带来的收益正在递减 [BHT+20]。他们反思了在两个最新最先进结果之间,模型规模翻倍仅带来 1.5% 的小幅提升 ([SPP+19])。
| 设置 | LAMBADA (准确率) | LAMBADA (困惑度) | StoryCloze (准确率) | HellaSwag (准确率) |
|---|---|---|---|---|
| SOTA | 68.0a | 8.63b | 91.8c | 85.6d |
| GPT-3 零样本 | 76.2 | 3.00 | 83.2 | 78.9 |
| GPT-3 单样本 | 72.5 | 3.35 | 84.7 | 78.1 |
| GPT-3 少样本 | 86.4 | 1.92 | 87.7 | 79.3 |

Table 3.2: Performance on cloze and completion tasks. GPT-3 significantly improves SOTA on LAMBADA while achieving respectable performance on two difficult completion prediction datasets. $^{a}[\mathrm{Tur}20]\ ^{b}$ ${}^{b}[\mathrm{RWC}^{+}19]$ c[LDL19] $^d[\mathrm{LCH^{+}}\bar{2}0]$ Figure 3.2: On LAMBADA, the few-shot capability of language models results in a strong boost to accuracy. GPT-3 2.7B outperforms the SOTA 17B parameter Turing-NLG [Tur20] in this setting, and GPT-3 175B advances the state of the art by $18%$ . Note zero-shot uses a different format from one-shot and few-shot as described in the text.
表 3.2: 填空和补全任务的表现。GPT-3 在 LAMBADA 上显著提升了 SOTA(State of the Art),同时在两个困难的补全预测数据集上取得了不错的表现。$^{a}[\mathrm{Tur}20]\ ^{b}$ ${}^{b}[\mathrm{RWC}^{+}19]$ c[LDL19] $^d[\mathrm{LCH^{+}}\bar{2}0]$
图 3.2: 在 LAMBADA 上,语言模型的少样本能力显著提升了准确率。GPT-3 2.7B 在该设置下超越了 SOTA 17B 参数的 Turing-NLG [Tur20],而 GPT-3 175B 将 SOTA 提升了 $18%$。需要注意的是,零样本使用的格式与一样本和少样本不同,如文中所述。
and $[\mathrm{Tur}20],$ and argue that “continuing to expand hardware and data sizes by orders of magnitude is not the path forward”. We find that path is still promising and in a zero-shot setting GPT-3 achieves $76%$ on LAMBADA, a gain of $8%$ over the previous state of the art.
以及 $[\mathrm{Tur}20],$ 并认为“继续以数量级扩展硬件和数据规模并不是前进的方向”。我们发现这条路径仍然充满希望,在零样本设置下,GPT-3 在 LAMBADA 上达到了 $76%$,比之前的最先进水平提高了 $8%$。
LAMBADA is also a demonstration of the flexibility of few-shot learning as it provides a way to address a problem that classically occurs with this dataset. Although the completion in LAMBADA is always the last word in a sentence, a standard language model has no way of knowing this detail. It thus assigns probability not only to the correct ending but also to other valid continuations of the paragraph. This problem has been partially addressed in the past with stop-word filters $[\mathrm{RWC}^{+}19]$ (which ban “continuation” words). The few-shot setting instead allows us to “frame” the task as a cloze-test and allows the language model to infer from examples that a completion of exactly one word is desired. We use the following fill-in-the-blank format:
LAMBADA 也是少样本学习灵活性的一个展示,因为它提供了一种解决该数据集经典问题的方法。尽管 LAMBADA 中的补全总是句子的最后一个词,但标准的语言模型无法知道这一细节。因此,它不仅为正确的结尾分配概率,还为段落的其他有效延续分配概率。过去,这个问题已经通过停用词过滤器 $[\mathrm{RWC}^{+}19]$ (禁止“延续”词)部分解决。而少样本设置则允许我们将任务“框架”为完形填空测试,并让语言模型从示例中推断出需要精确补全一个词。我们使用以下填空格式:
When presented with examples formatted this way, GPT-3 achieves $86.4%$ accuracy in the few-shot setting, an increase of over $18%$ from the previous state-of-the-art. We observe that few-shot performance improves strongly with model size. While this setting decreases the performance of the smallest model by almost $20%$ , for GPT-3 it improves accuracy by $10%$ . Finally, the fill-in-blank method is not effective one-shot, where it always performs worse than the zero-shot setting. Perhaps this is because all models still require several examples to recognize the pattern.
当以这种方式呈现示例时,GPT-3 在少样本设置中达到了 $86.4%$ 的准确率,比之前的最先进水平提高了超过 $18%$。我们观察到,少样本性能随着模型规模的增加而显著提升。虽然这种设置使最小模型的性能下降了近 $20%$,但对于 GPT-3 来说,准确率提高了 $10%$。最后,填空方法在单样本设置中并不有效,其表现始终不如零样本设置。这可能是因为所有模型仍然需要多个示例才能识别出模式。
Table 3.3: Results on three Open-Domain QA tasks. GPT-3 is shown in the few-, one-, and zero-shot settings, as compared to prior SOTA results for closed book and open domain settings. TriviaQA few-shot result is evaluated on the wiki split test server.
表 3.3: 三个开放域问答任务的结果。GPT-3 在少样本、单样本和零样本设置下的表现,与之前闭书和开放域设置的 SOTA 结果进行了比较。TriviaQA 的少样本结果是在 wiki 分割测试服务器上评估的。
| 设置 | NaturalQs | WebQS | TriviaQA |
|---|---|---|---|
| RAG (微调, 开放域) [LPP+20] | 44.5 | 45.5 | 68.0 |
| T5-11B+SSM (微调, 闭书) [RRS20] | 36.6 | 44.7 | 60.5 |
| T5-11B (微调, 闭书) | 34.5 | 37.4 | 50.1 |
| GPT-3 零样本 | 14.6 | 14.4 | 64.3 |
| GPT-3 单样本 | 23.0 | 25.3 | 68.0 |
| GPT-3 少样本 | 29.9 | 41.5 | 71.2 |
One note of caution is that an analysis of test set contamination identified that a significant minority of the LAMBADA dataset appears to be present in our training data – however analysis performed in Section 4 suggests negligible impact on performance.
需要注意的是,对测试集污染的分析发现,LAMBADA 数据集中有相当一部分似乎出现在我们的训练数据中——然而,第 4 节中的分析表明,这对性能的影响可以忽略不计。
3.1.3 HellaSwag
3.1.3 HellaSwag
The HellaSwag dataset $[Z\mathrm{HB}^{+}19]$ involves picking the best ending to a story or set of instructions. The examples were adversarial ly mined to be difficult for language models while remaining easy for humans (who achieve $95.6%$ accuracy). GPT-3 achieves $78.1%$ accuracy in the one-shot setting and $79.3%$ accuracy in the few-shot setting, outperforming the $75.4%$ accuracy of a fine-tuned 1.5B parameter language model $[Z\mathrm{HR}^{+}19]$ but still a fair amount lower than the overall SOTA of $85.6%$ achieved by the fine-tuned multi-task model ALUM.
HellaSwag 数据集 [ZHB+19] 涉及为故事或一组指令选择最佳结尾。这些示例经过对抗性挖掘,使其对语言模型来说具有挑战性,但对人类来说仍然容易(人类准确率达到 95.6%)。GPT-3 在单样本设置中达到了 78.1% 的准确率,在少样本设置中达到了 79.3% 的准确率,优于微调的 1.5B 参数语言模型 [ZHR+19] 的 75.4% 准确率,但仍远低于微调多任务模型 ALUM 实现的 85.6% 的总体 SOTA。
3.1.4 StoryCloze
3.1.4 StoryCloze
We next evaluate GPT-3 on the StoryCloze 2016 dataset $[\mathrm{MCH^{+}}16]$ , which involves selecting the correct ending sentence for five-sentence long stories. Here GPT-3 achieves $83.2%$ in the zero-shot setting and $87.7%$ in the few-shot setting (with $K=70$ ). This is still $4.1%$ lower than the fine-tuned SOTA using a BERT based model [LDL19] but improves over previous zero-shot results by roughly $10%$ .
我们接下来在 StoryCloze 2016 数据集 $[\mathrm{MCH^{+}}16]$ 上评估 GPT-3,该数据集涉及为五句话长的故事选择正确的结尾句。在这里,GPT-3 在零样本设置下达到了 $83.2%$ 的准确率,在少样本设置下($K=70$)达到了 $87.7%$ 的准确率。这仍然比使用基于 BERT 的模型 [LDL19] 进行微调的 SOTA 低 $4.1%$,但比之前的零样本结果提高了大约 $10%$。
3.2 Closed Book Question Answering
3.2 闭卷问答
In this section we measure GPT-3’s ability to answer questions about broad factual knowledge. Due to the immense amount of possible queries, this task has normally been approached by using an information retrieval system to find relevant text in combination with a model which learns to generate an answer given the question and the retrieved text. Since this setting allows a system to search for and condition on text which potentially contains the answer it is denoted “open-book”. [RRS20] recently demonstrated that a large language model can perform surprisingly well directly answering the questions without conditioning on auxilliary information. They denote this more restrictive evaluation setting as “closed-book”. Their work suggests that even higher-capacity models could perform even better and we test this hypothesis with GPT-3. We evaluate GPT-3 on the 3 datasets in [RRS20]: Natural Questions $[\mathrm{KPR}^{+}19]$ , Web Questions [BCFL13], and TriviaQA [JCWZ17], using the same splits. Note that in addition to all results being in the closed-book setting, our use of few-shot, one-shot, and zero-shot evaluations represent an even stricter setting than previous closed-book QA work: in addition to external content not being allowed, fine-tuning on the Q&A dataset itself is also not permitted.
在本节中,我们评估了 GPT-3 回答广泛事实性知识问题的能力。由于可能的查询数量巨大,这一任务通常通过使用信息检索系统来查找相关文本,并结合一个模型来生成给定问题和检索到的文本的答案。由于这种设置允许系统搜索并基于可能包含答案的文本进行条件生成,因此被称为“开卷” [RRS20]。最近,[RRS20] 证明了大语言模型在不需要辅助信息的情况下直接回答问题可以表现得非常出色。他们将这种更为严格的评估设置称为“闭卷”。他们的研究表明,更高容量的模型可能会表现得更好,我们使用 GPT-3 来测试这一假设。我们在 [RRS20] 中的 3 个数据集上评估 GPT-3:Natural Questions $[\mathrm{KPR}^{+}19]$、Web Questions [BCFL13] 和 TriviaQA [JCWZ17],使用相同的划分。需要注意的是,除了所有结果都在闭卷设置下,我们使用的少样本、单样本和零样本评估代表了比以往闭卷问答工作更为严格的设置:除了不允许使用外部内容外,也不允许在问答数据集上进行微调。
The results for GPT-3 are shown in Table 3.3. On TriviaQA, we achieve $64.3%$ in the zero-shot setting, $68.0%$ in the one-shot setting, and $71.2%$ in the few-shot setting. The zero-shot result already outperforms the fine-tuned T5-11B by $14.2%$ , and also outperforms a version with Q&A tailored span prediction during pre-training by $3.8%$ . The one-shot result improves by $3.7%$ and matches the SOTA for an open-domain QA system which not only fine-tunes but also makes use of a learned retrieval mechanism over a 15.3B parameter dense vector index of 21M documents $[\mathrm{LPP}^{+}20]$ . GPT-3’s few-shot result further improves performance another $3.2%$ beyond this.
GPT-3 的结果如表 3.3 所示。在 TriviaQA 上,我们在零样本设置下达到了 64.3%,在单样本设置下达到了 68.0%,在少样本设置下达到了 71.2%。零样本结果已经比微调的 T5-11B 高出 14.2%,也比在预训练期间使用问答定制的跨度预测版本高出 3.8%。单样本结果提高了 3.7%,并与开放域问答系统的 SOTA 结果相当,该系统不仅进行了微调,还使用了在 21M 文档的 15.3B 参数密集向量索引上学习到的检索机制 [LPP+20]。GPT-3 的少样本结果在此基础上进一步提高了 3.2%。
On Web Questions (WebQs), GPT-3 achieves $14.4%$ in the zero-shot setting, $25.3%$ in the one-shot setting, and $41.5%$ in the few-shot setting. This compares to $37.4%$ for fine-tuned T5-11B, and $44.7%$ for fine-tuned $_{\mathrm{T5-11B+SSM}}$ , which uses a Q&A-specific pre-training procedure. GPT-3 in the few-shot setting approaches the performance of state-of-the-art fine-tuned models. Notably, compared to TriviaQA, WebQS shows a much larger gain from zero-shot to few-shot (and indeed its zero-shot and one-shot performance are poor), perhaps suggesting that the WebQs questions and/or the style of their answers are out-of-distribution for GPT-3. Nevertheless, GPT-3 appears able to adapt to this distribution, recovering strong performance in the few-shot setting.
在Web Questions (WebQs) 数据集上,GPT-3 在零样本 (zero-shot) 设置下达到了 14.4%,在单样本 (one-shot) 设置下达到了 25.3%,在少样本 (few-shot) 设置下达到了 41.5%。相比之下,经过微调的 T5-11B 达到了 37.4%,而使用了问答特定预训练过程的 T5-11B+SSM 达到了 44.7%。在少样本设置下,GPT-3 的表现接近了最先进的微调模型。值得注意的是,与 TriviaQA 相比,WebQs 从零样本到少样本的提升要大得多(实际上其零样本和单样本表现较差),这可能表明 WebQs 的问题和/或其答案的风格超出了 GPT-3 的分布范围。尽管如此,GPT-3 似乎能够适应这种分布,在少样本设置下恢复了强劲的表现。

Figure 3.3: On TriviaQA GPT3’s performance grows smoothly with model size, suggesting that language models continue to absorb knowledge as their capacity increases. One-shot and few-shot performance make significant gains over zero-shot behavior, matching and exceeding the performance of the SOTA fine-tuned open-domain model, RAG $[\mathrm{LPP}^{+}20]$ 1
图 3.3: 在 TriviaQA 上,GPT3 的性能随着模型规模的增加而平稳增长,这表明随着容量的增加,语言模型继续吸收知识。单样本和少样本性能相比零样本行为有显著提升,匹配并超过了 SOTA 微调开放域模型 RAG 的性能 $[\mathrm{LPP}^{+}20]$ 1
On Natural Questions (NQs) GPT-3 achieves $14.6%$ in the zero-shot setting, $23.0%$ in the one-shot setting, and $29.9%$ in the few-shot setting, compared to $36.6%$ for fine-tuned T5 $11\mathrm{B}{+}\mathrm{S}\mathrm{S}\mathrm{M}$ . Similar to WebQS, the large gain from zero-shot to few-shot may suggest a distribution shift, and may also explain the less competitive performance compared to TriviaQA and WebQS. In particular, the questions in NQs tend towards very fine-grained knowledge on Wikipedia specifically which could be testing the limits of GPT-3’s capacity and broad pre training distribution.
在自然问题 (NQs) 上,GPT-3 在零样本设置中达到了 $14.6%$,在单样本设置中达到了 $23.0%$,在少样本设置中达到了 $29.9%$,而经过微调的 T5 $11\mathrm{B}{+}\mathrm{S}\mathrm{S}\mathrm{M}$ 则达到了 $36.6%$。与 WebQS 类似,从零样本到少样本的大幅提升可能表明存在分布偏移,这也可能解释了与 TriviaQA 和 WebQS 相比表现不那么具有竞争力的原因。特别是,NQs 中的问题往往涉及非常细粒度的维基百科知识,这可能是在测试 GPT-3 的能力和广泛的预训练分布的极限。
Overall, on one of the three datasets GPT-3’s one-shot matches the open-domain fine-tuning SOTA. On the other two datasets it approaches the performance of the closed-book SOTA despite not using fine-tuning. On all 3 datasets, we find that performance scales very smoothly with model size (Figure 3.3 and Appendix H Figure H.7), possibly reflecting the idea that model capacity translates directly to more ‘knowledge’ absorbed in the parameters of the model.
总体而言,在三个数据集中的一个上,GPT-3 的零样本表现与开放领域微调的 SOTA (State of the Art) 相当。在另外两个数据集上,尽管没有使用微调,GPT-3 的表现也接近闭卷 SOTA。在所有三个数据集上,我们发现模型的性能随着模型规模的增加而非常平滑地提升(图 3.3 和附录 H 图 H.7),这可能反映了模型容量直接转化为模型参数中吸收的更多“知识”的观点。
3.3 Translation
3.3 翻译
For GPT-2 a filter was used on a multilingual collection of documents to produce an English only dataset due to capacity concerns. Even with this filtering GPT-2 showed some evidence of multilingual capability and performed non-trivially when translating between French and English despite only training on 10 megabytes of remaining French text. Since we increase the capacity by over two orders of magnitude from GPT-2 to GPT-3, we also expand the scope of the training dataset to include more representation of other languages, though this remains an area for further improvement. As discussed in 2.2 the majority of our data is derived from raw Common Crawl with only quality-based filtering. Although GPT-3’s training data is still primarily English $93%$ by word count), it also includes $7%$ of text in other languages. These languages are documented in the supplemental material. In order to better understand translation capability, we also expand our analysis to include two additional commonly studied languages, German and Romanian.
由于容量限制,GPT-2 在多语言文档集合上使用了过滤器,以生成仅包含英语的数据集。尽管进行了这种过滤,GPT-2 仍然显示出一定的多语言能力,并且在仅训练了 10 兆字节的法语文本的情况下,在法语和英语之间的翻译任务中表现不俗。由于我们从 GPT-2 到 GPT-3 将容量增加了两个数量级以上,我们也扩大了训练数据集的范围,以包含更多其他语言的代表性数据,尽管这仍然是一个需要进一步改进的领域。正如 2.2 节所讨论的,我们的数据主要来自原始 Common Crawl,仅进行了基于质量的过滤。尽管 GPT-3 的训练数据仍然以英语为主(按词数计算占 93%),但它也包含了 7% 的其他语言文本。这些语言在补充材料中有详细记录。为了更好地理解翻译能力,我们还扩展了分析范围,包括两种常用的研究语言:德语和罗马尼亚语。
Existing unsupervised machine translation approaches often combine pre training on a pair of monolingual datasets with back-translation [SHB15] to bridge the two languages in a controlled way. By contrast, GPT-3 learns from a blend of training data that mixes many languages together in a natural way, combining them on a word, sentence, and document level. GPT-3 also uses a single training objective which is not customized or designed for any task in particular. However, our one / few-shot settings aren’t strictly comparable to prior unsupervised work since they make use of a small amount of paired examples (1 or 64). This corresponds to up to a page or two of in-context training data.
现有的无监督机器翻译方法通常将单语数据集对的预训练与回译 [SHB15] 结合起来,以可控的方式桥接两种语言。相比之下,GPT-3 从混合了多种语言的训练数据中学习,这些数据在单词、句子和文档级别上自然结合。GPT-3 还使用单一的训练目标,该目标并未针对任何特定任务进行定制或设计。然而,我们的零样本/少样本设置与之前的无监督工作并不严格可比,因为它们使用了少量的配对示例(1 或 64)。这相当于最多一两页的上下文训练数据。
Results are shown in Table 3.4. Zero-shot GPT-3, which only receives on a natural language description of the task, still under performs recent unsupervised NMT results. However, providing only a single example demonstration for
结果如表 3.4 所示。零样本 GPT-3 仅接收任务的自然语言描述,其表现仍不及最近的无监督神经机器翻译 (NMT) 结果。然而,仅提供一个示例演示...
| 设置 | En→→Fr | Fr→→En | En→→De | De→→En | En→→Ro | Ro→En |
|---|---|---|---|---|---|---|
| SOTA (监督学习) | 45.6a | 35.0 b | 41.2c | 40.2d | 38.5e | 39.9e |
| XLM [LC19] | 33.4 | 33.3 | 26.4 | 34.3 | 33.3 | 31.8 |
| MASS [STQ+19] | 37.5 | 34.9 | 28.3 | 35.2 | 35.2 | 33.1 |
| mBART [LGG+20] | - | - | 29.8 | 34.0 | 35.0 | 30.5 |
| GPT-3 零样本 | 25.2 | 21.2 | 24.6 | 27.2 | 14.1 | 19.9 |
| GPT-3 单样本 | 28.3 | 33.7 | 26.2 | 30.4 | 20.6 | 38.6 |
| GPT-3 少样本 | 32.6 | 39.2 | 29.7 | 40.6 | 21.0 | 39.5 |
Table 3.4: Few-shot GPT-3 outperforms previous unsupervised NMT work by 5 BLEU when translating into English reflecting its strength as an English LM. We report BLEU scores on the WMT’14 $\mathrm{Fr}{\leftrightarrow}\mathrm{En}$ , WMT’16 $\scriptstyle\mathrm{De}\leftrightarrow\mathrm{En}$ , and WMT’16 $\scriptstyle\mathbf{Ro}\leftrightarrow\mathbf{En}$ datasets as measured by multi-bleu.perl with XLM’s tokenization in order to compare most closely with prior unsupervised NMT work. SacreBLEUf [Pos18] results reported in Appendix H. Underline indicates an unsupervised or few-shot SOTA, bold indicates supervised SOTA with relative confidence. a[EOAG18] b[DHKH14] $^c[\mathrm{WXH^{+}18}]$ d[oR16] $^{e}[\mathrm{LGG}^{+}20]$ f [SacreBLEU signature: BLEU $^+$ case.mixed+numrefs. $^{1+}$ smooth.exp+tok.intl+version.1.2.20]
表 3.4: 少样本 GPT-3 在翻译成英语时比之前的无监督神经机器翻译 (NMT) 工作高出 5 个 BLEU 分数,反映了其作为英语语言模型 (LM) 的优势。我们报告了 WMT'14 $\mathrm{Fr}{\leftrightarrow}\mathrm{En}$、WMT'16 $\scriptstyle\mathrm{De}\leftrightarrow\mathrm{En}$ 和 WMT'16 $\scriptstyle\mathbf{Ro}\leftrightarrow\mathbf{En}$ 数据集上的 BLEU 分数,这些分数是通过 multi-bleu.perl 使用 XLM 的 Token 化方法测量的,以便与之前的无监督 NMT 工作进行最接近的比较。SacreBLEUf [Pos18] 的结果在附录 H 中报告。下划线表示无监督或少样本的 SOTA (State-of-the-Art),粗体表示具有相对置信度的有监督 SOTA。a[EOAG18] b[DHKH14] $^c[\mathrm{WXH^{+}18}]$ d[oR16] $^{e}[\mathrm{LGG}^{+}20]$ f [SacreBLEU 签名: BLEU $^+$ case.mixed+numrefs. $^{1+}$ smooth.exp+tok.intl+version.1.2.20]

Figure 3.4: Few-shot translation performance on 6 language pairs as model capacity increases. There is a consistent trend of improvement across all datasets as the model scales, and as well as tendency for translation into English to be stronger than translation from English.

图 3.4: 随着模型容量的增加,6种语言对的少样本翻译性能。随着模型的扩展,所有数据集的性能都呈现出持续提升的趋势,并且翻译成英语的性能往往强于从英语翻译出来的性能。
| 设置 | Winograd | Winogrande e (XL) |
|---|---|---|
| 微调SOTA | 90.1a | 84.6b |
| GPT-3 零样本 | 88.3* | 70.2 |
| GPT-3 单样本 | 89.7* | 73.2 |
| GPT-3 少样本 | 88.6* | 77.7 |

Table 3.5: Results on the WSC273 version of Winograd schemas and the adversarial Winogrande dataset. See Section 4 for details on potential contamination of the Winograd test set. a[SBBC19] $^b[\mathrm{LYN}^{+}20]$ Figure 3.5: Zero-, one-, and few-shot performance on the adversarial Winogrande dataset as model capacity scales. Scaling is relatively smooth with the gains to few-shot learning increasing with model size, and few-shot GPT-3 175B is competitive with a fine-tuned RoBERTA-large.
表 3.5: WSC273 版本的 Winograd 模式和对抗性 Winogrande 数据集的结果。有关 Winograd 测试集潜在污染的详细信息,请参见第 4 节。a[SBBC19] $^b[\mathrm{LYN}^{+}20]$
图 3.5: 随着模型容量的增加,对抗性 Winogrande 数据集上的零样本、单样本和少样本性能。随着模型规模的增加,少样本学习的收益相对平稳,少样本 GPT-3 175B 与微调的 RoBERTA-large 具有竞争力。
each translation task improves performance by over 7 BLEU and nears competitive performance with prior work. GPT-3 in the full few-shot setting further improves another 4 BLEU resulting in similar average performance to prior unsupervised NMT work. GPT-3 has a noticeable skew in its performance depending on language direction. For the three input languages studied, GPT-3 significantly outperforms prior unsupervised NMT work when translating into English but under performs when translating in the other direction. Performance on En-Ro is a noticeable outlier at over 10 BLEU worse than prior unsupervised NMT work. This could be a weakness due to reusing the byte-level BPE tokenizer of GPT-2 which was developed for an almost entirely English training dataset. For both Fr-En and De-En, few shot GPT-3 outperforms the best supervised result we could find but due to our unfamiliarity with the literature and the appearance that these are un-competitive benchmarks we do not suspect those results represent true state of the art. For Ro-En, few shot GPT-3 performs within 0.5 BLEU of the overall SOTA which is achieved by a combination of unsupervised pre training, supervised finetuning on 608K labeled examples, and back translation [LHCG19b].
每次翻译任务都将性能提高了超过 7 BLEU,并且接近了先前工作的竞争性能。在完整的少样本设置中,GPT-3 进一步提高了 4 BLEU,使得平均性能与先前的无监督神经机器翻译 (NMT) 工作相当。GPT-3 的性能在不同语言方向上存在显著偏差。对于研究的三种输入语言,GPT-3 在翻译成英语时显著优于先前的无监督 NMT 工作,但在翻译成其他语言时表现不佳。在 En-Ro 上的性能是一个明显的异常值,比先前的无监督 NMT 工作差了超过 10 BLEU。这可能是由于重用了 GPT-2 的字节级 BPE Tokenizer 的弱点,该 Tokenizer 是为几乎完全由英语组成的训练数据集开发的。对于 Fr-En 和 De-En,少样本 GPT-3 优于我们找到的最佳监督结果,但由于我们对文献的不熟悉以及这些基准似乎没有竞争力,我们不认为这些结果代表了真正的技术水平。对于 Ro-En,少样本 GPT-3 的表现与整体 SOTA 相差不到 0.5 BLEU,而 SOTA 是通过无监督预训练、在 608K 标注样本上进行监督微调以及反向翻译 [LHCG19b] 的组合实现的。
Finally, across all language pairs and across all three settings (zero-, one-, and few-shot), there is a smooth trend of improvement with model capacity. This is shown in Figure 3.4 in the case of few-shot results, and scaling for all three settings is shown in Appendix H.
最后,在所有语言对和三种设置(零样本、单样本和少样本)中,模型容量的提升呈现出平稳的趋势。图 3.4 展示了少样本结果的情况,而所有三种设置的扩展情况见附录 H。
3.4 Winograd-Style Tasks
3.4 Winograd 风格任务
The Winograd Schemas Challenge [LDM12] is a classical task in NLP that involves determining which word a pronoun refers to, when the pronoun is grammatically ambiguous but semantically unambiguous to a human. Recently fine-tuned language models have achieved near-human performance on the original Winograd dataset, but more difficult versions such as the adversarial ly-mined Winogrande dataset [SBBC19] still significantly lag human performance. We test GPT-3’s performance on both Winograd and Winogrande, as usual in the zero-, one-, and few-shot setting.
Winograd Schemas Challenge [LDM12] 是 NLP 中的一个经典任务,涉及确定代词指代的是哪个词,当该代词在语法上存在歧义但在语义上对人类来说是无歧义的。最近经过微调的语言模型在原始的 Winograd 数据集上已经达到了接近人类的表现,但在更具挑战性的版本(如对抗性挖掘的 Winogrande 数据集 [SBBC19])上,仍然显著落后于人类的表现。我们在 Winograd 和 Winogrande 上测试了 GPT-3 的表现,通常是在零样本、单样本和少样本设置下进行的。
| 设置 | PIQA | ARC (简单) | ARC (挑战) | OpenBookQA |
|---|---|---|---|---|
| 微调SOTA | 79.4 | 92.0 [KKS+ 20] | 78.5 [KKS+ 20] | 87.2 [KKS+ 20] |
| GPT-3 零样本 | 80.5* | 68.8 | 51.4 | 57.6 |
| GPT-3 单样本 | 80.5* | 71.2 | 53.2 | 58.8 |
| GPT-3 少样本 | 82.8* | 70.1 | 51.5 | 65.4 |

Table 3.6: GPT-3 results on three commonsense reasoning tasks, PIQA, ARC, and OpenBookQA. GPT-3 Few-Shot PIQA result is evaluated on the test server. See Section 4 for details on potential contamination issues on the PIQA test set. Figure 3.6: GPT-3 results on PIQA in the zero-shot, one-shot, and few-shot settings. The largest model achieves a score on the development set in all three conditions that exceeds the best recorded score on the task.
表 3.6: GPT-3 在三个常识推理任务(PIQA、ARC 和 OpenBookQA)上的结果。GPT-3 的少样本 PIQA 结果是在测试服务器上评估的。有关 PIQA 测试集上潜在污染问题的详细信息,请参见第 4 节。
图 3.6: GPT-3 在零样本、单样本和少样本设置下的 PIQA 结果。最大模型在所有三种条件下的开发集得分都超过了该任务的最佳记录得分。
On Winograd we test GPT-3 on the original set of 273 Winograd schemas, using the same “partial evaluation” method described in $[\mathrm{RWC}^{+}19]$ . Note that this setting differs slightly from the WSC task in the SuperGLUE benchmark, which is presented as binary classification and requires entity extraction to convert to the form described in this section. On Winograd GPT-3 achieves $88.3%$ , $89.7%$ , and $88.6%$ in the zero-shot, one-shot, and few-shot settings, showing no clear in-context learning but in all cases achieving strong results just a few points below state-of-the-art and estimated human performance. We note that contamination analysis found some Winograd schemas in the training data but this appears to have only a small effect on results (see Section 4).
在 Winograd 数据集上,我们使用与 $[\mathrm{RWC}^{+}19]$ 中描述的相同的“部分评估”方法,对 GPT-3 在原始的 273 个 Winograd 模式上进行了测试。需要注意的是,此设置与 SuperGLUE 基准中的 WSC 任务略有不同,后者以二元分类的形式呈现,并需要实体提取以转换为本节中描述的形式。在 Winograd 数据集上,GPT-3 在零样本、单样本和少样本设置中分别达到了 $88.3%$、$89.7%$ 和 $88.6%$ 的准确率,虽然没有明显的上下文学习效果,但在所有情况下都取得了接近最先进水平和估计人类表现的强劲结果。我们注意到,污染分析发现训练数据中存在一些 Winograd 模式,但这似乎对结果的影响很小(见第 4 节)。
On the more difficult Winogrande dataset, we do find gains to in-context learning: GPT-3 achieves $70.2%$ in the zero-shot setting, $73.2%$ in the one-shot setting, and $77.7%$ in the few-shot setting. For comparison a fine-tuned RoBERTA model achieves $79%$ , state-of-the-art is $84.6%$ achieved with a fine-tuned high capacity model (T5), and human performance on the task as reported by [SBBC19] is $94.0%$ .
在更具挑战性的 Winogrande 数据集上,我们发现上下文学习确实带来了提升:GPT-3 在零样本设置下达到了 70.2%,在单样本设置下达到了 73.2%,在少样本设置下达到了 77.7%。作为对比,经过微调的 RoBERTa 模型达到了 79%,当前最佳成绩是由经过微调的高容量模型 (T5) 取得的 84.6%,而 [SBBC19] 报告的人类在该任务上的表现为 94.0%。
3.5 Common Sense Reasoning
3.5 常识推理
Next we consider three datasets which attempt to capture physical or scientific reasoning, as distinct from sentence completion, reading comprehension, or broad knowledge question answering. The first, PhysicalQA (PIQA) $[\mathrm{BZB^{+}19}]$ , asks common sense questions about how the physical world works and is intended as a probe of grounded understanding of the world. GPT-3 achieves $81.0%$ accuracy zero-shot, $80.5%$ accuracy one-shot, and $82.8%$ accuracy few-shot (the last measured on PIQA’s test server). This compares favorably to the $79.4%$ accuracy prior state-of-the-art of a fine-tuned RoBERTa. PIQA shows relatively shallow scaling with model size and is still over $10%$ worse than human performance, but GPT-3’s few-shot and even zero-shot result outperform the current state-of-the-art. Our analysis flagged PIQA for a potential data contamination issue (despite hidden test labels), and we therefore conservatively mark the result with an asterisk. See Section 4 for details.
接下来我们考虑三个数据集,这些数据集试图捕捉物理或科学推理,与句子补全、阅读理解或广泛知识问答不同。第一个数据集是 PhysicalQA (PIQA) [BZB^{+}19],它提出了关于物理世界如何运作的常识性问题,旨在探测对世界的实际理解。GPT-3 在零样本情况下达到了 81.0% 的准确率,单样本情况下达到了 80.5% 的准确率,少样本情况下达到了 82.8% 的准确率(最后一个是在 PIQA 的测试服务器上测量的)。这比之前最先进的微调 RoBERTa 的 79.4% 准确率要好。PIQA 在模型规模上的扩展相对较浅,并且仍然比人类表现差 10% 以上,但 GPT-3 的少样本甚至零样本结果优于当前的最先进技术。我们的分析指出 PIQA 可能存在数据污染问题(尽管测试标签是隐藏的),因此我们保守地在结果上标记了星号。详见第 4 节。
Table 3.7: Results on reading comprehension tasks. All scores are F1 except results for RACE which report accuracy. $^{a}[{\mathrm{JZC}}^{+}19]$ b[JN20] c[AI19] d[QIA20] $^{e}[\mathrm{SPP^{+}19}]$
表 3.7: 阅读理解任务的结果。除 RACE 的结果为准确率外,其余均为 F1 分数。$^{a}[{\mathrm{JZC}}^{+}19]$ $^{b}[JN20]$ $^{c}[AI19]$ $^{d}[QIA20]$ $^{e}[\mathrm{SPP^{+}19}]$
| 设置 | CoQA | DROP | QuAC | SQuADv2 | RACE-h | RACE-m |
|---|---|---|---|---|---|---|
| Fine-tunedSOTA | 90.7$^{a}$ | 89.1$^{b}$ | 74.4$^{c}$ | 93.0$^{d}$ | 30'06 | 93.1$^{e}$ |
| GPT-3 零样本 | 81.5 | 23.6 | 41.5 | 59.5 | 45.5 | 58.4 |
| GPT-3 单样本 | 84.0 | 34.3 | 43.3 | 65.4 | 45.9 | 57.4 |
| GPT-3 少样本 | 85.0 | 36.5 | 44.3 | 69.8 | 46.8 | 58.1 |
ARC $[\mathrm{CCE^{+}18}]$ is a dataset of multiple-choice questions collected from 3rd to 9th grade science exams. On the “Challenge” version of the dataset which has been filtered to questions which simple statistical or information retrieval methods are unable to correctly answer, GPT-3 achieves $51.4%$ accuracy in the zero-shot setting, $53.2%$ in the one-shot setting, and $51.5%$ in the few-shot setting. This is approaching the performance of a fine-tuned RoBERTa baseline $(55.9%)$ from UnifiedQA $[\mathrm{KKS^{+}20}]$ . On the “Easy” version of the dataset (questions which either of the mentioned baseline approaches answered correctly), GPT-3 achieves $68.8%$ , $71.2%$ , and $70.1%$ which slightly exceeds a fine-tuned RoBERTa baseline from $[\mathrm{KKS}^{+}20]$ . However, both of these results are still much worse than the overall SOTAs achieved by the UnifiedQA which exceeds GPT-3’s few-shot results by $27%$ on the challenge set and $22%$ on the easy set.
ARC $[\mathrm{CCE^{+}18}]$ 是一个从三年级到九年级科学考试中收集的多项选择题数据集。在“挑战”版本的数据集中,这些问题经过筛选,简单的统计或信息检索方法无法正确回答,GPT-3 在零样本设置下的准确率为 $51.4%$ ,在单样本设置下为 $53.2%$ ,在少样本设置下为 $51.5%$ 。这接近了 UnifiedQA $[\mathrm{KKS^{+}20}]$ 中微调的 RoBERTa 基线 $(55.9%)$ 的表现。在“简单”版本的数据集中(这些问题是上述基线方法中任何一个都能正确回答的),GPT-3 的准确率为 $68.8%$ 、 $71.2%$ 和 $70.1%$ ,略高于 $[\mathrm{KKS}^{+}20]$ 中的微调 RoBERTa 基线。然而,这两个结果仍然远低于 UnifiedQA 实现的整体 SOTA,其在挑战集上比 GPT-3 的少样本结果高出 $27%$ ,在简单集上高出 $22%$ 。
On OpenBookQA [MCKS18], GPT-3 improves significantly from zero to few shot settings but is still over 20 points short of the overall SOTA. GPT-3’s few-shot performance is similar to a fine-tuned BERT Large baseline on the leader board.
在 OpenBookQA [MCKS18] 上,GPT-3 从零样本到少样本设置有了显著提升,但仍比整体 SOTA 低 20 多分。GPT-3 的少样本表现与排行榜上经过微调的 BERT Large 基线相似。
Overall, in-context learning with GPT-3 shows mixed results on commonsense reasoning tasks, with only small and inconsistent gains observed in the one and few-shot learning settings for both PIQA and ARC, but a significant improvement is observed on OpenBookQA. GPT-3 sets SOTA on the new PIQA dataset in all evaluation settings.
总体而言,GPT-3 在常识推理任务中的上下文学习表现参差不齐。在 PIQA 和 ARC 任务中,单样本和少样本学习设置下仅观察到微小且不一致的提升,但在 OpenBookQA 上则观察到了显著改进。GPT-3 在新的 PIQA 数据集上所有评估设置中均达到了 SOTA(State of the Art)水平。
3.6 Reading Comprehension
3.6 阅读理解
Next we evaluate GPT-3 on the task of reading comprehension. We use a suite of 5 datasets including abstract ive, multiple choice, and span based answer formats in both dialog and single question settings. We observe a wide spread in GPT-3’s performance across these datasets suggestive of varying capability with different answer formats. In general we observe GPT-3 is on par with initial baselines and early results trained using contextual representations on each respective dataset.
接下来,我们在阅读理解任务上评估 GPT-3。我们使用了包含 5 个数据集的套件,这些数据集涵盖了摘要式、多项选择和基于跨度的答案格式,涵盖了对话和单一问题设置。我们观察到 GPT-3 在这些数据集上的表现差异较大,这表明其在不同答案格式下的能力有所不同。总体而言,我们观察到 GPT-3 与每个数据集上使用上下文表示训练的初始基线和早期结果相当。
GPT-3 performs best (within 3 points of the human baseline) on CoQA [RCM19] a free-form conversational dataset and performs worst (13 F1 below an ELMo baseline) on QuAC $[\mathrm{CHI^{+}18}]$ a dataset which requires modeling structured dialog acts and answer span selections of teacher-student interactions. On DROP $[\mathrm{DWD}^{+}19]$ , a dataset testing discrete reasoning and numeracy in the context of reading comprehension, GPT-3 in a few-shot setting outperforms the fine-tuned BERT baseline from the original paper but is still well below both human performance and state-of-the-art approaches which augment neural networks with symbolic systems $[\mathrm{RLL^{+}19}]$ . On SQuAD 2.0 [RJL18], GPT-3 demonstrates its few-shot learning capabilities, improving by almost $10,\mathrm{F}1$ (to 69.8) compared to a zero-shot setting. This allows it to slightly outperform the best fine-tuned result in the original paper. On RACE $[\mathrm{LXL^{+}17}]$ , a multiple choice dataset of middle school and high school english examinations, GPT-3 performs relatively weakly and is only competitive with the earliest work utilizing contextual representations and is still $45%$ behind SOTA.
GPT-3 在 CoQA [RCM19] 上表现最佳(与人类基线的差距在 3 分以内),这是一个自由形式的对话数据集;而在 QuAC $[\mathrm{CHI^{+}18}]$ 上表现最差(比 ELMo 基线低 13 F1),这是一个需要建模结构化对话行为以及师生互动的答案跨度选择的数据集。在 DROP $[\mathrm{DWD}^{+}19]$ 上,这是一个测试阅读理解中的离散推理和计算能力的数据集,GPT-3 在少样本设置下优于原始论文中的微调 BERT 基线,但仍远低于人类表现以及通过符号系统增强神经网络的最先进方法 $[\mathrm{RLL^{+}19}]$。在 SQuAD 2.0 [RJL18] 上,GPT-3 展示了其少样本学习能力,与零样本设置相比,提升了近 $10,\mathrm{F}1$(达到 69.8),使其略微优于原始论文中的最佳微调结果。在 RACE $[\mathrm{LXL^{+}17}]$ 上,这是一个中学和高中英语考试的多项选择题数据集,GPT-3 表现相对较弱,仅与最早利用上下文表示的工作相当,仍落后于 SOTA 45%。
3.7 SuperGLUE
3.7 SuperGLUE
In order to better aggregate results on NLP tasks and compare to popular models such as BERT and RoBERTa in a more systematic way, we also evaluate GPT-3 on a standardized collection of datasets, the SuperGLUE benchmark $[\mathrm{WPN}^{\dot{+}}19]$ $[\mathrm{WPN^{+}i9}]$ $[\mathrm{CLC^{+}19}]$ [DMST19] [RBG11] $[\mathrm{KCR}^{+}18]$ ] $[Z\mathrm{LL}^{+}18]$ [DGM06] $[\mathrm{BHDD^{+}06}]$ [GMDD07] $[\mathrm{BDD^{+}09}]$ [PCC18] $[\mathrm{PHR}^{+}18]$ . GPT-3’s test-set performance on the SuperGLUE dataset is shown in Table 3.8. In the few-shot setting, we used 32 examples for all tasks, sampled randomly from the training set. For all tasks except WSC
为了更好地在 NLP 任务上聚合结果,并以更系统的方式与 BERT 和 RoBERTa 等流行模型进行比较,我们还在标准化的数据集集合 SuperGLUE 基准上评估了 GPT-3 $[\mathrm{WPN}^{\dot{+}}19]$ $[\mathrm{WPN^{+}i9}]$ $[\mathrm{CLC^{+}19}]$ [DMST19] [RBG11] $[\mathrm{KCR}^{+}18]$ ] $[Z\mathrm{LL}^{+}18]$ [DGM06] $[\mathrm{BHDD^{+}06}]$ [GMDD07] $[\mathrm{BDD^{+}09}]$ [PCC18] $[\mathrm{PHR}^{+}18]$ 。GPT-3 在 SuperGLUE 数据集上的测试集性能如表 3.8 所示。在少样本设置中,我们对所有任务使用了 32 个示例,这些示例是从训练集中随机采样的。除了 WSC 任务外,所有任务都使用了相同的设置。

Figure 3.7: GPT-3 results on CoQA reading comprehension task. GPT-3 175B achieves 85 F1 in the few-shot setting, only a few points behind measured human performance and state-of-the-art fine-tuned models. Zero-shot and one-shot performance is a few points behind, with the gains to few-shot being largest for bigger models.
图 3.7: GPT-3 在 CoQA 阅读理解任务上的结果。GPT-3 175B 在少样本设置下达到了 85 F1 分,仅比测量的人类表现和最先进的微调模型低几分。零样本和单样本表现稍低几分,且对于更大的模型,少样本带来的增益最大。
| SuperGLUE 平均 | BoolQ 准确率 | CB 准确率 | CB F1 | COPA 准确率 | RTE 准确率 | |
|---|---|---|---|---|---|---|
| Fine-tuned SOTA | 89.0 | 91.0 | 96.9 | 93.9 | 94.8 | 92.5 |
| Fine-tuned BERT-Large | 69.0 | 77.4 | 83.6 | 75.7 | 70.6 | 71.7 |
| GPT-3 少样本 | 71.8 | 76.4 | 75.6 | 52.0 | 92.0 | 69.0 |
| WiC 准确率 | WSC 准确率 | MultiRC 准确率 | MultiRC Fla | ReCoRD 准确率 | ReCoRD F1 | |
|---|---|---|---|---|---|---|
| Fine-tuned SOTA | 76.1 | 93.8 | 62.3 | 88.2 | 92.5 | 93.3 |
| Fine-tuned BERT-Large | 69.6 | 64.6 | 24.1 | 70.0 | 71.3 | 72.0 |
| GPT-3 少样本 | 49.4 | 80.1 | 30.5 | 75.4 | 90.2 | 91.1 |
Table 3.8: Performance of GPT-3 on SuperGLUE compared to fine-tuned baselines and SOTA. All results are reported on the test set. GPT-3 few-shot is given a total of 32 examples within the context of each task and performs no gradient updates.
表 3.8: GPT-3 在 SuperGLUE 上的性能与微调基线和 SOTA 的比较。所有结果均在测试集上报告。GPT-3 少样本在每个任务的上下文中总共给出 32 个示例,并且不进行梯度更新。

Figure 3.8: Performance on SuperGLUE increases with model size and number of examples in context. A value of $K=32$ means that our model was shown 32 examples per task, for 256 examples total divided across the 8 tasks in SuperGLUE. We report GPT-3 values on the dev set, so our numbers are not directly comparable to the dotted reference lines (our test set results are in Table 3.8). The BERT-Large reference model was fine-tuned on the SuperGLUE training set (125K examples), whereas $\mathrm{BERT++}$ was first fine-tuned on MultiNLI (392K examples) and SWAG (113K examples) before further fine-tuning on the SuperGLUE training set (for a total of 630K fine-tuning examples). We find the difference in performance between the BERT-Large and $\mathrm{BERT++}$ to be roughly equivalent to the difference between GPT-3 with one example per context versus eight examples per context.
图 3.8: SuperGLUE 上的性能随着模型大小和上下文中的示例数量增加而提升。$K=32$ 表示我们的模型在每个任务中展示了 32 个示例,总共 256 个示例分布在 SuperGLUE 的 8 个任务中。我们报告了 GPT-3 在开发集上的值,因此我们的数字不能直接与虚线参考线进行比较(我们的测试集结果在表 3.8 中)。BERT-Large 参考模型在 SuperGLUE 训练集(125K 示例)上进行了微调,而 $\mathrm{BERT++}$ 首先在 MultiNLI(392K 示例)和 SWAG(113K 示例)上进行了微调,然后在 SuperGLUE 训练集上进一步微调(总共 630K 微调示例)。我们发现 BERT-Large 和 $\mathrm{BERT++}$ 之间的性能差异大致相当于 GPT-3 在每个上下文中展示一个示例与八个示例之间的差异。
and MultiRC, we sampled a new set of examples to use in the context for each problem. For WSC and MultiRC, we used the same set of randomly drawn examples from the training set as context for all of the problems we evaluated.
对于 WSC 和 MultiRC,我们从训练集中随机抽取了一组新的样本作为每个问题的上下文。对于 WSC 和 MultiRC,我们使用了从训练集中随机抽取的相同样本集作为所有评估问题的上下文。
We observe a wide range in GPT-3’s performance across tasks. On COPA and ReCoRD GPT-3 achieves near-SOTA performance in the one-shot and few-shot settings, with COPA falling only a couple points short and achieving second place on the leader board, where first place is held by a fine-tuned 11 billion parameter model (T5). On WSC, performance is still relatively strong, achieving $80.1%$ in the few-shot setting (note that GPT-3 achieves $88.6%$ on the original Winograd dataset as described in Section 3.4). On BoolQ, MultiRC, and RTE, performance is reasonable, roughly matching that of a fine-tuned BERT-Large. On CB, we see signs of life at $75.6%$ in the few-shot setting.
我们观察到 GPT-3 在不同任务中的表现差异很大。在 COPA 和 ReCoRD 任务中,GPT-3 在单样本和少样本设置下接近 SOTA 性能,其中 COPA 仅落后几分,在排行榜上位居第二,而第一名是由一个经过微调的 110 亿参数模型 (T5) 占据。在 WSC 任务中,表现仍然相对强劲,在少样本设置下达到了 $80.1%$(需要注意的是,GPT-3 在原始 Winograd 数据集上的表现达到了 $88.6%$,如第 3.4 节所述)。在 BoolQ、MultiRC 和 RTE 任务中,表现尚可,大致与经过微调的 BERT-Large 相当。在 CB 任务中,我们在少样本设置下看到了 $75.6%$ 的表现。
WiC is a notable weak spot with few-shot performance at $49.4%$ (at random chance). We tried a number of different phrasings and formulations for WiC (which involves determining if a word is being used with the same meaning in two sentences), none of which was able to achieve strong performance. This hints at a phenomenon that will become clearer in the next section (which discusses the ANLI benchmark) – GPT-3 appears to be weak in the few-shot or one-shot setting at some tasks that involve comparing two sentences or snippets, for example whether a word is used the same way in two sentences (WiC), whether one sentence is a paraphrase of another, or whether one sentence implies another. This could also explain the comparatively low scores for RTE and CB, which also follow this format. Despite these weaknesses, GPT-3 still outperforms a fine-tuned BERT-large on four of eight tasks and on two tasks GPT-3 is close to the state-of-the-art held by a fine-tuned 11 billion parameter model.
WiC 是一个显著的弱点,少样本性能仅为 $49.4%$ (随机概率)。我们尝试了多种不同的表述和公式来解决 WiC 问题(涉及确定一个词在两个句子中是否具有相同的含义),但均未能取得良好的性能。这暗示了一个现象,在下一节(讨论 ANLI 基准时)将更加清晰——GPT-3 在少样本或单样本设置下,对于涉及比较两个句子或片段的某些任务表现较弱,例如一个词在两个句子中的使用方式是否相同(WiC),一个句子是否是另一个句子的改写,或者一个句子是否暗示了另一个句子。这也可能解释了 RTE 和 CB 相对较低的分数,这些任务也遵循类似的格式。尽管存在这些弱点,GPT-3 在八个任务中的四个任务上仍然优于经过微调的 BERT-large,并且在两个任务上接近由经过微调的 110 亿参数模型保持的最先进水平。
Finally, we note that the few-shot SuperGLUE score steadily improves with both model size and with number of examples in the context showing increasing benefits from in-context learning (Figure 3.8). We scale $K$ up to 32 examples per task, after which point additional examples will not reliably fit into our context. When sweeping over values of $K$ , we find that GPT-3 requires less than eight total examples per task to outperform a fine-tuned BERT-Large on overall SuperGLUE score.
最后,我们注意到,少样本 SuperGLUE 分数随着模型大小和上下文中的示例数量稳步提高,显示出上下文学习的益处不断增加(图 3.8)。我们将每个任务的 $K$ 扩展到 32 个示例,超过这个数量后,额外的示例将无法可靠地放入我们的上下文中。在遍历 $K$ 的值时,我们发现 GPT-3 每个任务需要的总示例数少于 8 个,就能在整体 SuperGLUE 分数上超过微调后的 BERT-Large。
3.8 NLI
3.8 自然语言推理 (NLI)
Natural Language Inference (NLI) [Fyo00] concerns the ability to understand the relationship between two sentences. In practice, this task is usually structured as a two or three class classification problem where the model classifies whether the second sentence logically follows from the first, contradicts the first sentence, or is possibly true (neutral). SuperGLUE includes an NLI dataset, RTE, which evaluates the binary version of the task. On RTE, only the largest version of GPT-3 performs convincingly better than random $(56%)$ in any evaluation setting, but in a few-shot setting GPT-3 performs similarly to a single-task fine-tuned BERT Large. We also evaluate on the recently introduced Adversarial Natural Language Inference (ANLI) dataset $[\mathrm{NWD^{+}19}]$ . ANLI is a difficult dataset employing a series of adversarial ly mined natural language inference questions in three rounds (R1, R2, and R3). Similar to RTE, all of our models smaller than GPT-3 perform at almost exactly random chance on ANLI, even in the few-shot setting $(\sim33%)$ ), whereas GPT-3 itself shows signs of life on Round 3. Results for ANLI R3 are highlighted in Figure 3.9 and full results for all rounds can be found in Appendix H. These results on both RTE and ANLI suggest that NLI is still a very difficult task for language models and they are only just beginning to show signs of progress.
自然语言推理 (Natural Language Inference, NLI) [Fyo00] 关注的是理解两个句子之间关系的能力。在实际应用中,该任务通常被构建为一个两分类或三分类问题,模型需要判断第二个句子是否在逻辑上从第一个句子中得出、与第一个句子矛盾,或者可能是真实的(中性)。SuperGLUE 包含一个 NLI 数据集 RTE,它评估该任务的二分类版本。在 RTE 上,只有最大版本的 GPT-3 在任何评估设置中表现得明显优于随机 $(56%)$,但在少样本设置中,GPT-3 的表现与单任务微调的 BERT Large 相似。我们还评估了最近引入的对抗性自然语言推理 (Adversarial Natural Language Inference, ANLI) 数据集 $[\mathrm{NWD^{+}19}]$。ANLI 是一个困难的数据集,它采用了一系列对抗性挖掘的自然语言推理问题,分为三轮(R1、R2 和 R3)。与 RTE 类似,所有小于 GPT-3 的模型在 ANLI 上的表现几乎完全随机,即使在少样本设置中 $(\sim33%)$,而 GPT-3 本身在第三轮中显示出一些进展的迹象。ANLI R3 的结果在图 3.9 中突出显示,所有轮的完整结果可以在附录 H 中找到。这些在 RTE 和 ANLI 上的结果表明,NLI 对于语言模型来说仍然是一个非常困难的任务,它们才刚刚开始显示出进展的迹象。

Figure 3.9: Performance of GPT-3 on ANLI Round 3. Results are on the dev-set, which has only 1500 examples and therefore has high variance (we estimate a standard deviation of $1.2%$ ). We find that smaller models hover around random chance, while few-shot GPT-3 175B closes almost half the gap from random chance to SOTA. Results for ANLI rounds 1 and 2 are shown in the appendix.
图 3.9: GPT-3 在 ANLI 第 3 轮的表现。结果基于开发集,该开发集仅有 1500 个样本,因此具有较高的方差(我们估计标准差为 $1.2%$)。我们发现较小的模型在随机概率附近徘徊,而少样本的 GPT-3 175B 几乎缩小了从随机概率到 SOTA 的一半差距。ANLI 第 1 轮和第 2 轮的结果见附录。
3.9 Synthetic and Qualitative Tasks
3.9 合成与定性任务
One way to probe GPT-3’s range of abilities in the few-shot (or zero- and one-shot) setting is to give it tasks which require it to perform simple on-the-fly computational reasoning, recognize a novel pattern that is unlikely to have occurred in training, or adapt quickly to an unusual task. We devise several tasks to test this class of abilities. First, we test GPT-3’s ability to perform arithmetic. Second, we create several tasks that involve rearranging or unscrambling the letters in a word, tasks which are unlikely to have been exactly seen during training. Third, we test GPT-3’s ability to solve SAT-style analogy problems few-shot. Finally, we test GPT-3 on several qualitative tasks, including using new words in a sentence, correcting English grammar, and news article generation. We will release the synthetic datasets with the hope of stimulating further study of test-time behavior of language models.
一种探究 GPT-3 在少样本(或零样本和单样本)设置下能力范围的方法是,给它一些需要即时进行简单计算推理、识别训练中不太可能出现的新模式,或快速适应不寻常任务的任务。我们设计了几项任务来测试这类能力。首先,我们测试 GPT-3 进行算术运算的能力。其次,我们创建了几项涉及重新排列或解构单词中字母的任务,这些任务在训练中不太可能被完全见过。第三,我们测试 GPT-3 在少样本情况下解决 SAT 风格类比问题的能力。最后,我们在几项定性任务上测试 GPT-3,包括在句子中使用新词、纠正英语语法以及新闻文章生成。我们将发布这些合成数据集,以期激发对语言模型测试时行为的进一步研究。
3.9.1 Arithmetic
3.9.1 算术
To test GPT-3’s ability to perform simple arithmetic operations without task-specific training, we developed a small battery of 10 tests that involve asking GPT-3 a simple arithmetic problem in natural language:
为了测试 GPT-3 在没有任务特定训练的情况下执行简单算术运算的能力,我们开发了一套包含 10 个测试的小型测试集,这些测试涉及用自然语言向 GPT-3 提出简单的算术问题:

Figure 3.10: Results on all 10 arithmetic tasks in the few-shot settings for models of different sizes. There is a significant jump from the second largest model (GPT-3 13B) to the largest model (GPT-3 175), with the latter being able to reliably accurate 2 digit arithmetic, usually accurate 3 digit arithmetic, and correct answers a significant fraction of the time on 4-5 digit arithmetic, 2 digit multiplication, and compound operations. Results for one-shot and zero-shot are shown in the appendix.
图 3.10: 不同大小模型在少样本设置下所有 10 个算术任务的结果。从第二大的模型 (GPT-3 13B) 到最大的模型 (GPT-3 175) 有一个显著的跳跃,后者能够可靠地准确进行 2 位数的算术运算,通常准确进行 3 位数的算术运算,并且在 4-5 位数的算术运算、2 位数的乘法和复合运算中大部分时间都能给出正确答案。单样本和零样本的结果见附录。
• 3 digit subtraction (3D-) – Same as 2 digit subtraction, except numbers are uniformly sampled from $[0,1000)$ . • 4 digit addition $(4\mathbf{D}+)$ – Same as 3 digit addition, except uniformly sampled from $[0,10000)$ . • 4 digit subtraction (4D-) – Same as 3 digit subtraction, except uniformly sampled from $[0,10000)$ . • 5 digit addition $(5\mathbf{D}+$ ) – Same as 3 digit addition, except uniformly sampled from $[0,100000)$ . • 5 digit subtraction (5D-) – Same as 3 digit subtraction, except uniformly sampled from $[0,100000)$ . • 2 digit multiplication (2Dx) – The model is asked to multiply two integers sampled uniformly from $[0,100)$ , e.g. “Q: What is 24 times 42? A: 1008”. • One-digit composite (1DC) – The model is asked to perform a composite operation on three 1 digit numbers, with parentheses around the last two. For example, “Q: What is $6{+}(4^{\ast}8)?$ A: $38^{\circ}$ . The three 1 digit numbers are selected uniformly on [0, 10) and the operations are selected uniformly from ${+,-,{*}}$ .
• 三位数减法 (3D-) – 与两位数减法相同,只是数字从 $[0,1000)$ 中均匀采样。
• 四位数加法 $(4\mathbf{D}+)$ – 与三位数加法相同,只是从 $[0,10000)$ 中均匀采样。
• 四位数减法 (4D-) – 与三位数减法相同,只是从 $[0,10000)$ 中均匀采样。
• 五位数加法 $(5\mathbf{D}+)$ – 与三位数加法相同,只是从 $[0,100000)$ 中均匀采样。
• 五位数减法 (5D-) – 与三位数减法相同,只是从 $[0,100000)$ 中均匀采样。
• 两位数乘法 (2Dx) – 模型被要求将两个从 $[0,100)$ 中均匀采样的整数相乘,例如“Q: 24 乘以 42 是多少?A: 1008”。
• 一位数复合运算 (1DC) – 模型被要求对三个一位数进行复合运算,最后两个数用括号括起来。例如,“Q: $6{+}(4^{\ast}8)?$ 是多少?A: $38^{\circ}$。这三个一位数从 [0, 10) 中均匀采样,运算符从 ${+,-,{*}}$ 中均匀选择。
In all 10 tasks the model must generate the correct answer exactly. For each task we generate a dataset of 2,000 random instances of the task and evaluate all models on those instances.
在所有10个任务中,模型必须准确生成正确答案。对于每个任务,我们生成一个包含2,000个随机实例的数据集,并在这些实例上评估所有模型。
First we evaluate GPT-3 in the few-shot setting, for which results are shown in Figure 3.10. On addition and subtraction, GPT-3 displays strong proficiency when the number of digits is small, achieving $100%$ accuracy on 2 digit addition, $98.9%$ at 2 digit subtraction, $80.2%$ at 3 digit addition, and $94.2%$ at 3-digit subtraction. Performance decreases as the number of digits increases, but GPT-3 still achieves $25{-}26%$ accuracy on four digit operations and $9\mathrm{-}10%$ accuracy on five digit operations, suggesting at least some capacity to generalize to larger numbers of digits. GPT-3 also achieves $29.2%$ accuracy at 2 digit multiplication, an especially computationally intensive operation. Finally, GPT-3 achieves $21.3%$ accuracy at single digit combined operations (for example, $^{9*}(7{+}5))$ , suggesting that it has some robustness beyond just single operations.
首先,我们在少样本设置下评估 GPT-3,结果如图 3.10 所示。在加法和减法方面,当数字位数较少时,GPT-3 表现出较强的能力,在 2 位数加法上达到了 $100%$ 的准确率,2 位数减法为 $98.9%$,3 位数加法为 $80.2%$,3 位数减法为 $94.2%$。随着位数的增加,性能有所下降,但 GPT-3 在 4 位数运算上仍能达到 $25{-}26%$ 的准确率,在 5 位数运算上达到 $9\mathrm{-}10%$ 的准确率,这表明它至少具备一定能力来泛化到更大位数的运算。GPT-3 在 2 位数乘法上也达到了 $29.2%$ 的准确率,这是一个特别计算密集型的运算。最后,GPT-3 在单位数组合运算(例如 $^{9*}(7{+}5))$ 上达到了 $21.3%$ 的准确率,这表明它在单一运算之外还具备一定的鲁棒性。
As Figure 3.10 makes clear, small models do poorly on all of these tasks – even the 13 billion parameter model (the second largest after the 175 billion full GPT-3) can solve 2 digit addition and subtraction only half the time, and all other operations less than $10%$ of the time.
如图 3.10 所示,小型模型在这些任务上表现都很差——即使是拥有 130 亿参数的模型(仅次于 1750 亿参数的完整 GPT-3)也只能在一半的时间内解决两位数的加减法,而其他所有操作的准确率都不到 $10%$。
One-shot and zero-shot performance are somewhat degraded relative to few-shot performance, suggesting that adaptation to the task (or at the very least recognition of the task) is important to performing these computations correctly. Nevertheless, one-shot performance is still quite strong, and even zero-shot performance of the full GPT-3 significantly outperforms few-shot learning for all smaller models. All three settings for the full GPT-3 are shown in Table 3.9, and model capacity scaling for all three settings is shown in Appendix H.
单样本和零样本性能相对于少样本性能有所下降,这表明适应任务(或至少识别任务)对于正确执行这些计算是重要的。然而,单样本性能仍然相当强,即使是完整 GPT-3 的零样本性能也显著优于所有较小模型的少样本学习。完整 GPT-3 的三种设置如表 3.9 所示,所有三种设置的模型容量扩展见附录 H。
| 设置 | 2D+ | 2D- | 3D+ | 3D- | 4D+ | 4D- | 5D+ | 5D- | 2Dx | 1DC |
|---|---|---|---|---|---|---|---|---|---|---|
| GPT-3 零样本 | 76.9 | 58.0 | 34.2 | 48.3 | 4.0 | 7.5 | 0.7 | 0.8 | 19.8 | 9.8 |
| GPT-3 单样本 | 99.6 | 86.4 | 65.5 | 78.7 | 14.0 | 14.0 | 3.5 | 3.8 | 27.4 | 14.3 |
| GPT-3 少样本 | 100.0 | 98.9 | 80.4 | 94.2 | 25.5 | 26.8 | 9.3 | 9.9 | 29.2 | 21.3 |
Table 3.9: Results on basic arithmetic tasks for GPT-3 175B. ${2,3,4,5}\mathrm{D}{+,-}$ is 2, 3, 4, and 5 digit addition or subtraction, 2Dx is 2 digit multiplication. 1DC is 1 digit composite operations. Results become progressively stronger moving from the zero-shot to one-shot to few-shot setting, but even the zero-shot shows significant arithmetic abilities.
表 3.9: GPT-3 175B 在基础算术任务上的结果。${2,3,4,5}\mathrm{D}{+,-}$ 表示 2、3、4 和 5 位数的加法或减法,2Dx 表示 2 位数的乘法,1DC 表示 1 位数的复合运算。从零样本到单样本再到少样本设置,结果逐渐增强,但即使是零样本也显示出显著的算术能力。
| Setting | CL | A1 | A2 | RI | RW |
|---|---|---|---|---|---|
| GPT-3 Zero-shot | 3.66 | 2.28 | 8.91 | 8.26 | 0.09 |
| GPT-3 One-shot | 21.7 | 8.62 | 25.9 | 45.4 | 0.48 |
| GPT-3 Few-shot | 37.9 | 15.1 | 39.7 | 67.2 | 0.44 |
To spot-check whether the model is simply memorizing specific arithmetic problems, we took the 3-digit arithmetic problems in our test set and searched for them in our training data in both the forms $"<\tt N U M1>\partial+\epsilon<\tt N U M2>\partial="1"$ and "
为了抽查模型是否仅仅记住了特定的算术问题,我们从测试集中选取了3位数算术问题,并在训练数据中搜索了两种形式:$"<\tt N U M1>\partial+\epsilon<\tt N U M2>\partial="1"$ 和 "
Overall, GPT-3 displays reasonable proficiency at moderately complex arithmetic in few-shot, one-shot, and even zero-shot settings.
总体而言,GPT-3 在少样本、单样本甚至零样本设置下,对中等复杂度的算术表现出合理的熟练度。
3.9.2 Word Scrambling and Manipulation Tasks
3.9.2 单词打乱与操作任务
To test GPT-3’s ability to learn novel symbolic manipulations from a few examples, we designed a small battery of 5 “character manipulation” tasks. Each task involves giving the model a word distorted by some combination of scrambling, addition, or deletion of characters, and asking it to recover the original word. The 5 tasks are:
为了测试 GPT-3 从少量示例中学习新颖符号操作的能力,我们设计了一组包含 5 个“字符操作”任务的小测试。每个任务都涉及给模型一个通过字符打乱、添加或删除等方式扭曲的单词,并要求其恢复原始单词。这 5 个任务分别是:
For each task we generate 10,000 examples, which we chose to be the top 10,000 most frequent words as measured by [Nor09] of length more than 4 characters and less than 15 characters. The few-shot results are shown in Figure 3.11. Task performance tends to grow smoothly with model size, with the full GPT-3 model achieving $66.9%$ on removing random insertions, $38.6%$ on cycling letters, $40.2%$ on the easier anagram task, and $15.1%$ on the more difficult anagram task (where only the first and last letters are held fixed). None of the models can reverse the letters in a word.
对于每个任务,我们生成了10,000个示例,这些示例是根据[Nor09]测量的长度超过4个字符且少于15个字符的最常见的10,000个单词。少样本结果如图3.11所示。任务性能随着模型大小的增加而平稳增长,完整的GPT-3模型在随机插入删除任务上达到了$66.9%$,在字母循环任务上达到了$38.6%$,在较简单的变位词任务上达到了$40.2%$,在较难的变位词任务(仅首尾字母固定)上达到了$15.1%$。所有模型都无法反转单词中的字母。

Figure 3.11: Few-shot performance on the five word scrambling tasks for different sizes of model. There is generally smooth improvement with model size although the random insertion task shows an upward slope of improvement with the 175B model solving the task the majority of the time. Scaling of one-shot and zero-shot performance is shown in the appendix. All tasks are done with $K=100$ .
图 3.11: 不同规模模型在五个单词乱序任务上的少样本表现。尽管随机插入任务在175B模型上表现出明显的改进趋势,但总体上随着模型规模的增加,性能平稳提升。单样本和零样本表现的扩展情况见附录。所有任务均在 $K=100$ 的条件下完成。
In the one-shot setting, performance is significantly weaker (dropping by half or more), and in the zero-shot setting the model can rarely perform any of the tasks (Table 3.10). This suggests that the model really does appear to learn these tasks at test time, as the model cannot perform them zero-shot and their artificial nature makes them unlikely to appear in the pre-training data (although we cannot confirm this with certainty).
在少样本设置中,性能显著下降(下降一半或更多),而在零样本设置中,模型几乎无法执行任何任务(表 3.10)。这表明模型确实在测试时学习了这些任务,因为模型无法在零样本情况下执行这些任务,而且这些任务的人工性质使得它们不太可能出现在预训练数据中(尽管我们无法确定这一点)。
We can further quantify performance by plotting “in-context learning curves”, which show task performance as a function of the number of in-context examples. We show in-context learning curves for the Symbol Insertion task in Figure 1.2. We can see that larger models are able to make increasingly effective use of in-context information, including both task examples and natural language task descriptions.
我们可以通过绘制“上下文学习曲线”来进一步量化性能,这些曲线展示了任务性能随上下文示例数量的变化。我们在图 1.2 中展示了符号插入任务的上下文学习曲线。可以看到,更大的模型能够越来越有效地利用上下文信息,包括任务示例和自然语言任务描述。
Finally, it is worth adding that solving these tasks requires character-level manipulations, whereas our BPE encoding operates on significant fractions of a word (on average $\sim0.7$ words per token), so from the LM’s perspective succeeding at these tasks involves not just manipulating BPE tokens but understanding and pulling apart their substructure. Also, CL, A1, and A2 are not bijective (that is, the unscrambled word is not a deterministic function of the scrambled word), requiring the model to perform some search to find the correct unscrambling. Thus, the skills involved appear to require non-trivial pattern-matching and computation.
最后,值得一提的是,解决这些任务需要字符级别的操作,而我们的 BPE 编码作用于单词的显著部分(平均每个 token 约 0.7 个单词),因此从语言模型的角度来看,成功完成这些任务不仅涉及操作 BPE token,还需要理解并分解其子结构。此外,CL、A1 和 A2 不是双射的(即未打乱的单词不是打乱单词的确定性函数),这要求模型执行一些搜索以找到正确的未打乱形式。因此,所涉及的技能似乎需要非平凡的模式匹配和计算能力。
3.9.3 SAT Analogies
3.9.3 SAT 类比
To test GPT-3 on another task that is somewhat unusual relative to the typical distribution of text, we collected a set of 374 “SAT analogy” problems [TLBS03]. Analogies are a style of multiple choice question that constituted a section of the SAT college entrance exam before 2005. A typical example is “audacious is to boldness as (a) sanctimonious is to hypocrisy, (b) anonymous is to identity, (c) remorseful is to misdeed, (d) deleterious is to result, (e) impressionable is to temptation”. The student is expected to choose which of the five word pairs has the same relationship as the original word pair; in this example the answer is “sanctimonious is to hypocrisy”. On this task GPT-3 achieves $65.2%$ in the few-shot setting, $59.1%$ in the one-shot setting, and $53.7%$ in the zero-shot setting, whereas the average score among college applicants was $57%$ [TL05] (random guessing yields $20%$ ). As shown in Figure 3.12, the results improve with scale, with the the full 175 billion model improving by over $10%$ compared to the 13 billion parameter model.
为了测试 GPT-3 在另一项相对于典型文本分布来说较为不寻常的任务上的表现,我们收集了 374 道“SAT 类比”问题 [TLBS03]。类比题是一种多项选择题,曾在 2005 年之前作为 SAT 大学入学考试的一部分。一个典型的例子是:“audacious 之于 boldness 如同 (a) sanctimonious 之于 hypocrisy, (b) anonymous 之于 identity, (c) remorseful 之于 misdeed, (d) deleterious 之于 result, (e) impressionable 之于 temptation”。学生需要从五个词对中选择与原始词对具有相同关系的词对;在这个例子中,答案是“sanctimonious 之于 hypocrisy”。在这项任务中,GPT-3 在少样本设置下达到了 $65.2%$,在单样本设置下达到了 $59.1%$,在零样本设置下达到了 $53.7%$,而大学申请者的平均得分为 $57%$ [TL05](随机猜测的得分为 $20%$)。如图 3.12 所示,随着模型规模的增加,结果有所改善,拥有 1750 亿参数的完整模型相比 130 亿参数的模型提升了超过 $10%$。

Figure 3.12: Zero-, one-,and few-shot performance on SAT analogy tasks, for different sizes of model. The largest model achieves $65%$ accuracy in the few-shot setting, and also demonstrates significant gains to in-context learning which are not present in smaller models.
图 3.12: 不同大小模型在 SAT 类比任务上的零样本、单样本和少样本表现。最大模型在少样本设置下达到了 65% 的准确率,并且在上下文学习中表现出显著的提升,而较小模型则没有这种提升。
3.9.4 News Article Generation
3.9.4 新闻文章生成
Previous work on generative language models qualitatively tested their ability to generate synthetic “news articles” by conditional sampling from the model given a human-written prompt consisting of a plausible first sentence for a news story $[\mathrm{RWC}^{+}19]$ . Relative to $[\mathrm{RWC}^{+}19]$ , the dataset used to train GPT-3 is much less weighted towards news articles, so trying to generate news articles via raw unconditional samples is less effective – for example GPT-3 often interprets the proposed first sentence of a “news article” as a tweet and then posts synthetic responses or follow-up tweets. To solve this problem we employed GPT-3’s few-shot learning abilities by providing three previous news articles in the model’s context to condition it. With the title and subtitle of a proposed next article, the model is able to reliably generate short articles in the “news” genre.
先前关于生成式语言模型的工作通过从模型中进行条件采样,给定一个由人类编写的新闻故事可能的第一句话作为提示,定性测试了其生成合成“新闻文章”的能力 [RWC+19]。与 [RWC+19] 相比,用于训练 GPT-3 的数据集在新闻文章上的权重要小得多,因此尝试通过原始的无条件样本来生成新闻文章效果较差——例如,GPT-3 经常将“新闻文章”的第一句话解释为一条推文,然后发布合成回复或后续推文。为了解决这个问题,我们利用了 GPT-3 的少样本学习能力,在模型的上下文中提供了三篇之前的新闻文章作为条件。通过给定下一篇文章的标题和副标题,模型能够可靠地生成“新闻”类型的短篇文章。
To gauge the quality of news article generation from GPT-3 (which we believe is likely to be correlated with conditional sample generation quality in general), we decided to measure human ability to distinguish GPT-3-generated articles from real ones. Similar work has been carried out by Kreps et al. [KMB20] and Zellers et al. $[Z\mathrm{HR}^{+}19]$ . Generative language models are trained to match the distribution of content generated by humans, so the (in)ability of humans to distinguish the two is a potentially important measure of quality.3
为了评估 GPT-3 生成新闻文章的质量(我们相信这可能与条件样本生成质量相关),我们决定测量人类区分 GPT-3 生成的文章与真实文章的能力。类似的工作已由 Kreps 等人 [KMB20] 和 Zellers 等人 [ZHR+19] 进行。生成式语言模型的训练目标是匹配人类生成内容的分布,因此人类区分两者的(无)能力可能是衡量质量的重要指标。
In order to see how well humans can detect model generated text, we arbitrarily selected 25 article titles and subtitles from the website newser.com (mean length: 215 words). We then generated completions of these titles and subtitles from four language models ranging in size from 125M to 175B (GPT-3) parameters (mean length: 200 words). For each model, we presented around 80 US-based participants with a quiz consisting of these real titles and subtitles followed by either the human written article or the article generated by the model4. Participants were asked to select whether the article was “very likely written by a human”, “more likely written by a human”, “I don’t know”, “more likely written by a machine”, or “very likely written by a machine”.
为了了解人类检测模型生成文本的能力,我们随机从网站 newser.com 上选取了 25 篇文章的标题和副标题(平均长度:215 词)。然后,我们使用四个参数量从 125M 到 175B(GPT-3)的语言模型生成了这些标题和副标题的续写内容(平均长度:200 词)。对于每个模型,我们向大约 80 名美国参与者展示了一个测验,测验内容包括这些真实的标题和副标题,随后是人工撰写的文章或模型生成的文章。参与者被要求选择文章是“非常可能由人类撰写”、“更可能由人类撰写”、“我不知道”、“更可能由机器撰写”或“非常可能由机器撰写”。
The articles we selected were not in the models’ training data and the model outputs were formatted and selected pro grammatically to prevent human cherry-picking. All models used the same context to condition outputs on and were pre-trained with the same context size and the same article titles and subtitles were used as prompts for each model. However, we also ran an experiment to control for participant effort and attention that followed the same format but involved intentionally bad model generated articles. This was done by generating articles from a “control model”: a 160M parameter model with no context and increased output randomness.
我们选择的文章不在模型的训练数据中,并且模型输出是通过程序格式化和选择的,以防止人为挑选。所有模型使用相同的上下文来生成输出,并且预训练时使用相同的上下文大小,每篇文章的标题和副标题都作为每个模型的提示。然而,我们还进行了一项实验,以控制参与者的努力和注意力,该实验遵循相同的格式,但涉及故意生成的低质量模型文章。这是通过从“控制模型”生成文章来实现的:一个160M参数的模型,没有上下文,并且增加了输出随机性。
Table 3.11: Human accuracy in identifying whether short $\mathbf{\sim}200$ word) news articles are model generated. We find that human accuracy (measured by the ratio of correct assignments to non-neutral assignments) ranges from $86%$ on the control model to $52%$ on GPT-3 175B. This table compares mean accuracy between five different models, and shows the results of a two-sample T-Test for the difference in mean accuracy between each model and the control model (an unconditional GPT-3 Small model with increased output randomness).
表 3.11: 人类在识别短新闻文章(约 200 词)是否为模型生成时的准确率。我们发现,人类的准确率(通过正确分配与非中性分配的比率来衡量)从控制模型的 86% 到 GPT-3 175B 的 52% 不等。该表比较了五个不同模型的平均准确率,并展示了每个模型与控制模型(一个增加了输出随机性的无条件 GPT-3 Small 模型)之间平均准确率差异的双样本 T 检验结果。
| 平均准确率 | 95% 置信区间(低,高) | 与控制模型的 t 值(p 值) | 非中性分配比例 | |
|---|---|---|---|---|
| 控制模型(故意设计的差模型) | 86% | 83%-90% | 3.6% | |
| GPT-3 Small | 76% | 72%-80% | 3.9 (2e-4) | 4.9% |
| GPT-3 Medium | 61% | 58%-65% | 10.3 (7e-21) | 6.0% |
| GPT-3 Large | 68% | 64%-72% | 7.3 (3e-11) | 8.7% |
| GPT-3 XL | 62% | 59%-65% | 10.7 (1e-19) | 7.5% |
| GPT-3 2.7B | 62% | 58%-65% | 10.4 (5e-19) | 7.1% |
| GPT-3 6.7B | 59% | 56%-63% | 11.2 (3e-21) | 6.2% |
| GPT-3 13B | 55% | 52%-58% | 15.3 (1e-32) | 7.1% |
| GPT-3 175B | 52% | 49%-54% | 16.9 (1e-34) | 7.8% |
Mean human accuracy (the ratio of correct assignments to non-neutral assignments per participant) at detecting that the intentionally bad articles were model generated was $\sim86%$ where $50%$ is chance level performance. By contrast, mean human accuracy at detecting articles that were produced by the 175B parameter model was barely above chance at $\sim52%$ (see Table 3.11).5 Human abilities to detect model generated text appear to decrease as model size increases: there appears to be a trend towards chance accuracy with model size, and human detection of GPT-3 is close to chance.6 This is true despite the fact that participants spend more time on each output as model size increases (see Appendix E).
人类在检测故意编写的糟糕文章是由模型生成时的平均准确率(每位参与者正确分配与非中性分配的比率)为 $\sim86%$,其中 $50%$ 是随机水平的表现。相比之下,检测由 175B 参数模型生成的文章的平均准确率仅略高于随机水平,为 $\sim52%$(见表 3.11)。5 人类检测模型生成文本的能力似乎随着模型规模的增加而下降:随着模型规模的增加,准确率似乎趋向于随机水平,而人类对 GPT-3 的检测接近随机水平。6 尽管随着模型规模的增加,参与者在每个输出上花费的时间更多(见附录 E),但这一现象仍然成立。
Examples of synthetic articles from GPT-3 are given in Figures 3.14 and 3.15.7 Much of the text is—as indicated by the evaluations—difficult for humans to distinguish from authentic human content. Factual inaccuracies can be an indicator that an article is model generated since, unlike human authors, the models have no access to the specific facts that the article titles refer to or when the article was written. Other indicators include repetition, non sequiturs, and unusual phrasings, though these are often subtle enough that they are not noticed.
图 3.14 和图 3.15 中展示了 GPT-3 生成的合成文章示例。正如评估所示,大部分文本对人类来说难以与真实的人类内容区分开来。事实错误可能是文章由模型生成的指标,因为与人类作者不同,模型无法访问文章标题所指的具体事实或文章撰写的时间。其他指标包括重复、不合逻辑的推论和不寻常的措辞,尽管这些通常足够微妙,以至于不会被注意到。
Related work on language model detection by Ippolito et al. [IDCBE19] indicates that automatic disc rim in at or s like G ROV E R $[Z\mathrm{HR}^{+}1\bar{9}]$ and GLTR [GSR19] may have greater success at detecting model generated text than human evaluators. Automatic detection of these models may be a promising area of future research.
Ippolito 等人 [IDCBE19] 关于语言模型检测的相关研究表明,自动检测器如 GROVER $[Z\mathrm{HR}^{+}1\bar{9}]$ 和 GLTR [GSR19] 在检测模型生成的文本方面可能比人类评估者更成功。这些模型的自动检测可能是未来研究的一个有前景的领域。
Ippolito et al. [IDCBE19] also note that human accuracy at detecting model generated text increases as humans observe more tokens. To do a preliminary investigation of how good humans are at detecting longer news articles generated by GPT-3 175B, we selected 12 world news articles from Reuters with an average length of 569 words and generated completions of these articles from GPT-3 with an average length of 498 words (298 words longer than our initial experiments). Following the methodology above, we ran two experiments, each on around 80 US-based participants, to compare human abilities to detect the articles generated by GPT-3 and a control model.
Ippolito 等人 [IDCBE19] 也指出,随着人类观察到的 token 数量增加,人类检测模型生成文本的准确性也会提高。为了初步调查人类在检测由 GPT-3 175B 生成长篇新闻文章时的表现,我们从路透社中选取了 12 篇平均长度为 569 个单词的世界新闻文章,并使用 GPT-3 生成了这些文章的续写,平均长度为 498 个单词(比我们最初的实验长 298 个单词)。按照上述方法,我们进行了两项实验,每项实验约有 80 名美国参与者,以比较人类检测 GPT-3 生成文章与对照组模型生成文章的能力。
We found that mean human accuracy at detecting the intentionally bad longer articles from the control model was $\sim88%$ , while mean human accuracy at detecting the longer articles that were produced by GPT-3 175B was still barely above chance at $\sim52%$ (see Table 3.12). This indicates that, for news articles that are around 500 words long, GPT-3 continues to produce articles that humans find difficult to distinguish from human written news articles.
我们发现,人类在检测控制模型生成的故意写得较差的长文章时的平均准确率为 $\sim88%$,而在检测由 GPT-3 175B 生成长文章时的平均准确率仅略高于随机水平,为 $\sim52%$(见表 3.12)。这表明,对于大约 500 字长的新闻文章,GPT-3 生成的文章仍然让人类难以区分其与人类撰写的新闻文章。
3.9.5 Learning and Using Novel Words
3.9.5 学习和使用新词
A task studied in developmental linguistics [CB78] is the ability to learn and utilize new words, for example using a word in a sentence after seeing it defined only once, or conversely inferring a word’s meaning from only one usage. Here we qualitatively test GPT-3’s ability to do the former. Specifically, we give GPT-3 the definition of a nonexistent word, such as “Gigamuru”, and then ask it to use it in a sentence. We provide one to five previous examples of a (separate)
发展语言学中研究的一个任务 [CB78] 是学习和使用新词的能力,例如在只看到一次定义后就能在句子中使用一个词,或者反过来从一次使用中推断出一个词的含义。这里我们定性地测试 GPT-3 执行前者的能力。具体来说,我们给 GPT-3 一个不存在的词的定义,例如“Gigamuru”,然后要求它在句子中使用它。我们提供一个到五个之前的(独立的)示例。

Human ability to detect model generated news articles Figure 3.13: People’s ability to identify whether news articles are model-generated (measured by the ratio of correct assignments to non-neutral assignments) decreases as model size increases. Accuracy on the outputs on the deliberatelybad control model (an un conditioned GPT-3 Small model with higher output randomness) is indicated with the dashed line at the top, and the random chance $(50%)$ is indicated with the dashed line at the bottom. Line of best fit is a power law with $95%$ confidence intervals. Table 3.12: People’s ability to identify whether $\sim500$ word articles are model generated (as measured by the ratio of correct assignments to non-neutral assignments) was $88%$ on the control model and $52%$ on GPT-3 175B. This table shows the results of a two-sample T-Test for the difference in mean accuracy between GPT-3 175B and the control model (an unconditional GPT-3 Small model with increased output randomness).

人类检测模型生成新闻文章的能力
图 3.13: 随着模型规模的增加,人们识别新闻文章是否由模型生成的能力(通过正确分配与非中性分配的比率来衡量)下降。顶部虚线表示故意不良控制模型(具有更高输出随机性的无条件 GPT-3 Small 模型)的输出准确率,底部虚线表示随机概率 $(50%)$。最佳拟合线为幂律,置信区间为 $95%$。
表 3.12: 人们识别 $\sim500$ 字文章是否由模型生成的能力(通过正确分配与非中性分配的比率来衡量)在控制模型上为 $88%$,在 GPT-3 175B 上为 $52%$。该表显示了 GPT-3 175B 与控制模型(具有增加输出随机性的无条件 GPT-3 Small 模型)之间平均准确率差异的双样本 T 检验结果。
| 平均准确率 | 95%置信区间 (低, 高) | 与控制组1的t比较 (p值) | "我不知道" 分配 | |
|---|---|---|---|---|
| 控制组 | 88% | 84%-91% | 2.7% | |
| GPT-3175B | 52% | 48%-57% | 12.7 (3.2e-23) | 10.6% |

Figure 3.14: The GPT-3 generated news article that humans had the greatest difficulty distinguishing from a human written article (accuracy: $12%$ ).
图 3.14: GPT-3 生成的新闻文章,人类最难将其与人类撰写的文章区分开来(准确率:$12%$)。



Figure 3.16: Representative GPT-3 completions for the few-shot task of using a new word in a sentence. Boldface is GPT-3’s completions, plain text is human prompts. In the first example both the prompt and the completion are provided by a human; this then serves as conditioning for subsequent examples where GPT-3 receives successive additional prompts and provides the completions. Nothing task-specific is provided to GPT-3 other than the conditioning shown here.
图 3.16: 在少样本任务中使用新词的 GPT-3 补全示例。粗体为 GPT-3 的补全内容,普通文本为人类提示。在第一个示例中,提示和补全均由人类提供;这随后作为后续示例的条件,GPT-3 接收连续的额外提示并提供补全内容。除了此处显示的条件外,没有向 GPT-3 提供任何特定于任务的信息。
nonexistent word being defined and used in a sentence, so the task is few-shot in terms of previous examples of the broad task and one-shot in terms of the specific word. Table 3.16 shows the 6 examples we generated; all definitions were human-generated, and the first answer was human-generated as conditioning while the subsequent answers were generated by GPT-3. These examples were generated continuously in one sitting and we did not omit or repeatedly try any prompts. In all cases the generated sentence appears to be a correct or at least plausible use of the word. In the final sentence the model generates a plausible conjugation for the word “screeg” (namely “screeghed”), although the use of the word is slightly awkward (“screeghed at each other”) despite being plausible in the sense that it could describe a toy sword fight. Overall, GPT-3 appears to be at least proficient at the task of using novel words in a sentence.
表 3.16 展示了我们生成的 6 个示例;所有定义均由人工生成,第一个答案作为条件由人工生成,而后续答案则由 GPT-3 生成。这些示例是在一次连续生成的,我们没有省略或重复尝试任何提示。在所有情况下,生成的句子似乎都是正确或至少是合理的单词使用。在最后一个句子中,模型为单词“screeg”生成了一个合理的变位形式(即“screeghed”),尽管该词的使用略显笨拙(“screeghed at each other”),但从描述玩具剑战的角度来看是合理的。总体而言,GPT-3 在句子中使用新词的任务上至少表现出了一定的熟练度。
3.9.6 Correcting English Grammar
3.9.6 纠正英语语法
Another task well suited for few-shot learning is correcting English grammar. We test this with GPT-3 in the fewshot setting by giving prompts of the form "Poor English Input:
另一个非常适合少样本学习的任务是纠正英语语法。我们通过提供“Poor English Input: <句子>\n Good English Output: <句子>”形式的提示,在少样本设置下使用 GPT-3 进行测试。我们给 GPT-3 提供一个由人类生成的纠正示例,然后要求它再纠正 5 个句子(同样没有任何遗漏或重复)。结果如图 3.17 所示。
4 Measuring and Preventing Memorization Of Benchmarks
4 测量和防止基准测试的记忆化
Since our training dataset is sourced from the internet, it is possible that our model was trained on some of our benchmark test sets. Accurately detecting test contamination from internet-scale datasets is a new area of research without established best practices. While it is common practice to train large models without investigating contamination, given the increasing scale of pre training datasets, we believe this issue is becoming increasingly important to attend to.
由于我们的训练数据集来源于互联网,我们的模型可能在部分基准测试集上进行了训练。从互联网规模的数据集中准确检测测试污染是一个新的研究领域,尚未建立最佳实践。尽管在不调查污染的情况下训练大型模型是常见做法,但鉴于预训练数据集的规模不断扩大,我们认为这个问题变得越来越重要。
This concern is not just hypothetical. One of the first papers to train a language model on Common Crawl data [TL18] detected and removed a training document which overlapped with one of their evaluation datasets. Other work such as GPT-2 [RWC+19] also conducted post-hoc overlap analysis. Their study was relatively encouraging, finding that
这种担忧并非只是假设。最早在Common Crawl数据上训练语言模型的论文之一 [TL18] 检测并移除了一个与他们的评估数据集重叠的训练文档。其他工作如GPT-2 [RWC+19] 也进行了事后重叠分析。他们的研究结果相对令人鼓舞,发现


Figure 3.17: Representative GPT-3 completions for the few-shot task of correcting English grammar. Boldface is GPT-3’s completions, plain text is human prompts. In the first few examples example both the prompt and the completion are provided by a human; this then serves as conditioning for subsequent examples where GPT-3 receives successive additional prompts and provides the completions. Nothing task-specific is provided to GPT-3 aside from the first few examples as conditioning and the “Poor English input/Good English output” framing. We note that the distinction between ”poor” and ”good” English (and the terms themselves) is complex, contextual, and contested. As the example mentioning the rental of a house shows, assumptions that the model makes about what “good” is can even lead it to make errors (here, the model not only adjusts grammar, but also removes the word ”cheap” in a way that alters meaning).
图 3.17: GPT-3 在少样本任务中纠正英语语法的代表性补全结果。加粗部分是 GPT-3 的补全内容,普通文本是人类提示。在前几个示例中,提示和补全内容均由人类提供;这为后续示例提供了条件,GPT-3 接收到连续的额外提示并提供补全内容。除了前几个示例作为条件和“Poor English input/Good English output”框架外,没有向 GPT-3 提供任何特定于任务的内容。我们注意到,“poor”和“good”英语之间的区别(以及这些术语本身)是复杂的、情境化的且有争议的。正如提到租房的那个示例所示,模型对“good”的假设甚至可能导致其犯错(在这里,模型不仅调整了语法,还以改变含义的方式删除了“cheap”一词)。

Figure 4.1: GPT-3 Training Curves We measure model performance during training on a de duplicated validation split of our training distribution. Though there is some gap between training and validation performance, the gap grows only minimally with model size and training time, suggesting that most of the gap comes from a difference in difficulty rather than over fitting.
图 4.1: GPT-3 训练曲线 我们在训练过程中测量模型在我们训练分布的去重验证集上的表现。尽管训练和验证表现之间存在一定差距,但随着模型规模和训练时间的增加,差距的增长非常有限,这表明大部分差距来自于难度差异而非过拟合。
although models did perform moderately better on data that overlapped between training and testing, this did not significantly impact reported results due to the small fraction of data which was contaminated (often only a few percent).
尽管模型在训练和测试数据重叠的部分表现稍好,但由于污染数据所占比例较小(通常只有几个百分点),这并未显著影响报告的结果。
GPT-3 operates in a somewhat different regime. On the one hand, the dataset and model size are about two orders of magnitude larger than those used for GPT-2, and include a large amount of Common Crawl, creating increased potential for contamination and memorization. On the other hand, precisely due to the large amount of data, even GPT-3 175B does not overfit its training set by a significant amount, measured relative to a held-out validation set with which it was de duplicated (Figure 4.1). Thus, we expect that contamination is likely to be frequent, but that its effects may not be as large as feared.
GPT-3 的运行机制有所不同。一方面,其数据集和模型规模比 GPT-2 使用的要大两个数量级,并且包含了大量的 Common Crawl 数据,这增加了数据污染和记忆的风险。另一方面,正是由于数据量庞大,即使是 GPT-3 175B 模型,相对于去重后的验证集(图 4.1),其训练集的过拟合程度也并不显著。因此,我们预计数据污染可能会频繁发生,但其影响可能没有人们担心的那么大。
We initially tried to address the issue of contamination by pro actively searching for and attempting to remove any overlap between our training data and the development and test sets of all benchmarks studied in this paper. Unfortunately, a bug resulted in only partial removal of all detected overlaps from the training data. Due to the cost of training, it wasn’t feasible to retrain the model. To address this, we investigate in detail how the remaining detected overlap impacts results.
我们最初尝试通过主动搜索并尝试移除训练数据与本文研究的所有基准测试的开发集和测试集之间的任何重叠来解决数据污染问题。不幸的是,由于一个错误,训练数据中仅部分移除了所有检测到的重叠。由于训练成本的原因,重新训练模型并不可行。为了解决这个问题,我们详细研究了剩余检测到的重叠如何影响结果。
For each benchmark, we produce a ‘clean’ version which removes all potentially leaked examples, defined roughly as examples that have a 13-gram overlap with anything in the pre training set (or that overlap with the whole example when it is shorter than 13-grams). The goal is to very conservatively flag anything that could potentially be contamination, so as to produce a clean subset that is free of contamination with high confidence. The exact procedure is detailed in Appendix C.
对于每个基准测试,我们生成了一个“干净”版本,该版本移除了所有可能泄露的样本,这些样本大致定义为与预训练集中的任何内容有13-gram重叠的样本(或者当样本长度短于13-gram时与整个样本重叠)。目标是非常保守地标记任何可能存在的污染,以便生成一个高置信度的无污染干净子集。具体步骤详见附录C。
We then evaluate GPT-3 on these clean benchmarks, and compare to the original score. If the score on the clean subset is similar to the score on the entire dataset, this suggests that contamination, even if present, does not have a significant effect on reported results. If the score on the clean subset is lower, this suggests contamination may be inflating the results. The results are summarized in Figure 4.2. Although potential contamination is often high (with a quarter of benchmarks scoring over $50%$ ), in most cases performance changes only negligibly, and we see no evidence that contamination level and performance difference are correlated. We conclude that either our conservative method substantially overestimated contamination or that contamination has little effect on performance.
然后我们在这些干净的基准上评估 GPT-3,并与原始分数进行比较。如果在干净子集上的分数与整个数据集上的分数相似,这表明即使存在污染,也不会对报告的结果产生显著影响。如果在干净子集上的分数较低,这表明污染可能夸大了结果。结果总结在图 4.2 中。尽管潜在污染通常很高(四分之一的基准得分超过 $50%$),但在大多数情况下,性能变化微乎其微,我们没有发现污染水平与性能差异之间存在相关性。我们得出结论,要么我们的保守方法大大高估了污染,要么污染对性能影响很小。
Below, we review in more detail the few specific cases where either (1) the model performs significantly worse on the cleaned version, or (2) potential contamination is very high, which makes measuring the performance difference difficult.
下面,我们详细回顾了几种特定情况,其中要么 (1) 模型在清理后的版本上表现明显较差,要么 (2) 潜在的污染非常高,这使得测量性能差异变得困难。
Our analysis flagged six groups of benchmarks for further investigation: Word Scrambling, Reading Comprehension (QuAC, SQuAD2, DROP), PIQA, Winograd, language modeling tasks (Wikitext tasks, 1BW), and German to English translation. Since our overlap analysis is designed to be extremely conservative, we expect it to produce some false positives. We summarize the results for each group of tasks below:
我们的分析标记了六组基准用于进一步调查:单词重组、阅读理解(QuAC、SQuAD2、DROP)、PIQA、Winograd、语言建模任务(Wikitext任务、1BW)以及德语到英语翻译。由于我们的重叠分析设计得非常保守,我们预计会产生一些误报。我们在下面总结了每组任务的结果:

Figure 4.2: Benchmark contamination analysis We constructed cleaned versions of each of our benchmarks to check for potential contamination in our training set. The ${\bf X}$ -axis is a conservative lower bound for how much of the dataset is known with high confidence to be clean, and the y-axis shows the difference in performance when evaluating only on the verified clean subset. Performance on most benchmarks changed negligibly, but some were flagged for further review. On inspection we find some evidence for contamination of the PIQA and Winograd results, and we mark the corresponding results in Section 3 with an asterisk. We find no evidence that other benchmarks are affected. Percentage of Data Clean in Dataset
图 4.2: 基准污染分析 我们构建了每个基准的清理版本,以检查训练集中潜在的污染。${\bf X}$ 轴是一个保守的下限,表示数据集中已知高度清洁的部分,y 轴显示了仅在已验证的清洁子集上评估时的性能差异。大多数基准的性能变化可以忽略不计,但有些基准被标记为需要进一步审查。在检查中,我们发现 PIQA 和 Winograd 结果存在一些污染的证据,并在第 3 节中用星号标记了相应的结果。我们没有发现其他基准受到影响的证据。数据集中清洁数据的百分比
• Reading Comprehension: Our initial analysis flagged ${>}90%$ of task examples from QuAC, SQuAD2, and DROP as potentially contaminated, so large that even measuring the differential on a clean subset was difficult. Upon manual inspection, however, we found that for every overlap we inspected, in all 3 datasets, the source text was present in our training data but the question/answer pairs were not, meaning the model gains only background information and cannot memorize the answer to a specific question.
• 阅读理解:我们的初步分析标记了来自 QuAC、SQuAD2 和 DROP 的任务示例中 ${>}90%$ 的部分可能受到污染,以至于即使在干净的子集上测量差异也很困难。然而,经过手动检查后,我们发现,在我们检查的所有 3 个数据集中,源文本存在于我们的训练数据中,但问题/答案对并不存在,这意味着模型只能获得背景信息,而无法记住特定问题的答案。
• German translation: We found $25%$ of the examples in the WMT16 German-English test set were marked as potentially contaminated, with an associated total effect size of 1-2 BLEU. Upon inspection, none of the flagged examples contain paired sentences resembling NMT training data and collisions were monolingual matches mostly of snippets of events discussed in the news.
• 德语翻译:我们发现 WMT16 德语-英语测试集中有 $25%$ 的样本被标记为可能受到污染,相关的总效应大小为 1-2 BLEU。经过检查,所有被标记的样本中都没有包含类似于神经机器翻译 (NMT) 训练数据的成对句子,且碰撞主要是新闻中讨论的事件片段的单语匹配。
• Reversed Words and Anagrams: Recall that these tasks are of the form $\because a1a0k=k0a1a^{\prime}$ . Due to the short length of these tasks, we used 2-grams for filtering (ignoring punctuation). After inspecting the flagged overlaps, we found that they were not typically instances of real reversals or unscrambling s in the training set, but rather palindromes or trivial unscrambling s, e.g “kayak $=$ kayak”. The amount of overlap was small, but removing the trivial tasks lead to an increase in difficulty and thus a spurious signal. Related to this, the symbol insertion task shows high overlap but no effect on performance – this is because that task involves removing non-letter characters from a word, and the overlap analysis itself ignores such characters, leading to many spurious matches.
• 反转词和字谜:回想一下,这些任务的形式为 $\because a1a0k=k0a1a^{\prime}$。由于这些任务的长度较短,我们使用了2-gram进行过滤(忽略标点符号)。在检查标记的重叠后,我们发现它们通常不是训练集中的真实反转或解谜实例,而是回文或简单的解谜,例如“kayak $=$ kayak”。重叠量很小,但移除这些简单任务会导致难度增加,从而产生虚假信号。与此相关的是,符号插入任务显示出高度重叠,但对性能没有影响——这是因为该任务涉及从单词中移除非字母字符,而重叠分析本身忽略了这些字符,导致许多虚假匹配。
• PIQA: The overlap analysis flagged $29%$ of examples as contaminated, and observed a 3 percentage point absolute decrease $4%$ relative decrease) in performance on the clean subset. Though the test dataset was released after our training set was created and its labels are hidden, some of the web pages used by the crowd sourced dataset creators are contained in our training set. We found a similar decrease in a $25\mathrm{x}$ smaller model with much less capacity to memorize, leading us to suspect that the shift is likely statistical bias rather than memorization; examples which workers copied may simply be easier. Unfortunately, we cannot rigorously prove this hypothesis. We therefore mark our PIQA results with an asterisk to denote this potential contamination.
• PIQA:重叠分析标记了29%的示例为受污染,并在干净子集上观察到性能下降了3个百分点(相对下降4%)。尽管测试数据集是在我们的训练集创建后发布的,并且其标签是隐藏的,但众包数据集创建者使用的一些网页包含在我们的训练集中。我们在一个容量小25倍的模型中发现了类似的性能下降,该模型的记忆能力要小得多,这使我们怀疑这种变化可能是统计偏差而非记忆;工人们复制的示例可能只是更容易。不幸的是,我们无法严格证明这一假设。因此,我们在PIQA结果上标记了星号,以表示这种潜在的污染。
• Winograd: The overlap analysis flagged $45%$ of examples, and found a $2.6%$ decrease in performance on the clean subset. Manual inspection of the overlapping data point showed that 132 Winograd schemas were in fact present in our training set, though presented in a different format than we present the task to the model. Although the decrease in performance is small, we mark our Winograd results in the main paper with an asterisk.
• Winograd:重叠分析标记了 $45%$ 的样本,并在干净子集上发现了 $2.6%$ 的性能下降。对重叠数据点的手动检查显示,132 个 Winograd 模式实际上存在于我们的训练集中,尽管其格式与我们呈现给模型的任务格式不同。尽管性能下降较小,我们在主论文中用星号标记了 Winograd 结果。
• Language modeling: We found the 4 Wikipedia language modeling benchmarks measured in GPT-2, plus the Children’s Book Test dataset, to be almost entirely contained in our training data. Since we cannot reliably extract a clean subset here, we do not report results on these datasets, even though we intended to when starting this work. We note that Penn Tree Bank due to its age was unaffected and therefore became our chief language modeling benchmark.
• 语言建模:我们发现 GPT-2 中测量的 4 个维基百科语言建模基准,以及儿童图书测试数据集,几乎完全包含在我们的训练数据中。由于我们无法可靠地提取一个干净的子集,因此我们没有报告这些数据集的结果,尽管我们在开始这项工作时有意这样做。我们注意到,由于 Penn Tree Bank 的年代久远,它未受影响,因此成为我们主要的语言建模基准。
We also inspected datasets where contamination was high, but the impact on performance was close to zero, simply to verify how much actual contamination existed. These appeared to often contain false positives. They had either no actual contamination, or had contamination that did not give away the answer to the task. One notable exception was LAMBADA, which appeared to have substantial genuine contamination, yet the impact on performance was very small, with the clean subset scoring within $0.5%$ of the full dataset. Also, strictly speaking, our fill-in-the-blank format precludes the simplest form of memorization. Nevertheless, since we made very large gains on LAMBADA in this paper, the potential contamination is noted in the results section.
我们还检查了污染率较高但对性能影响几乎为零的数据集,只是为了验证实际污染的程度。这些数据集往往包含误报。它们要么没有实际污染,要么污染并未泄露任务的答案。一个显著的例外是 LAMBADA,它似乎存在大量真实的污染,但对性能的影响非常小,干净子集的得分与完整数据集的得分相差在 $0.5%$ 以内。此外,严格来说,我们的填空格式排除了最简单的记忆形式。然而,由于我们在本文中对 LAMBADA 取得了非常大的进展,因此在结果部分中注明了潜在的污染。
An important limitation of our contamination analysis is that we cannot be sure that the clean subset is drawn from the same distribution as the original dataset. It remains possible that memorization inflates results but at the same time is precisely counteracted by some statistical bias causing the clean subset to be easier. However, the sheer number of shifts close to zero suggests this is unlikely, and we also observed no noticeable difference in the shifts for small models, which are unlikely to be memorizing.
我们污染分析的一个重要限制是,我们无法确定干净子集是否来自与原始数据集相同的分布。仍然有可能记忆效应夸大了结果,但同时被某些统计偏差精确抵消,导致干净子集更容易。然而,接近零的偏移数量之多表明这种情况不太可能,而且我们还观察到小模型的偏移没有明显差异,这些小模型不太可能进行记忆。
Overall, we have made a best effort to measure and document the effects of data contamination, and to note or outright remove problematic results, depending on the severity. Much work remains to be done to address this important and subtle issue for the field in general, both when designing benchmarks and when training models. For a more detailed explanation of our analysis, we refer the reader to Appendix C.
总体而言,我们已尽最大努力来衡量和记录数据污染的影响,并根据严重程度对问题结果进行标注或直接删除。在设计基准测试和训练模型时,仍有许多工作要做,以解决这一重要且微妙的问题。有关我们分析的更详细解释,请参阅附录 C。
5 Limitations
5 局限性
GPT-3 and our analysis of it have a number of limitations. Below we describe some of these and suggest directions for future work.
GPT-3 及其分析存在一些局限性。以下我们描述其中一些,并为未来的工作提出方向。
First, despite the strong quantitative and qualitative improvements of GPT-3, particularly compared to its direct predecessor GPT-2, it still has notable weaknesses in text synthesis and several NLP tasks. On text synthesis, although the overall quality is high, GPT-3 samples still sometimes repeat themselves semantically at the document level, start to lose coherence over sufficiently long passages, contradict themselves, and occasionally contain non-sequitur sentences or paragraphs. We will release a collection of 500 uncurated unconditional samples to help provide a better sense of GPT-3’s limitations and strengths at text synthesis. Within the domain of discrete language tasks, we have noticed informally that GPT-3 seems to have special difficulty with “common sense physics”, despite doing well on some datasets (such as PIQA $[\mathrm{BZB^{+}19}]$ ) that test this domain. Specifically GPT-3 has difficulty with questions of the type “If I put cheese into the fridge, will it melt?”. Quantitatively, GPT-3’s in-context learning performance has some notable gaps on our suite of benchmarks, as described in Section 3, and in particular it does little better than chance when evaluated one-shot or even few-shot on some “comparison” tasks, such as determining if two words are used the same way in a sentence, or if one sentence implies another (WIC and ANLI respectively), as well as on a subset of reading comprehension tasks. This is especially striking given GPT-3’s strong few-shot performance on many other tasks.
首先,尽管 GPT-3 在定量和定性方面有了显著的改进,特别是与其前身 GPT-2 相比,但它在文本合成和多个 NLP 任务中仍然存在明显的弱点。在文本合成方面,尽管整体质量很高,但 GPT-3 生成的样本有时在文档级别上会出现语义重复,在足够长的段落中失去连贯性,自相矛盾,偶尔还会包含不合逻辑的句子或段落。我们将发布 500 个未经筛选的无条件样本集合,以帮助更好地理解 GPT-3 在文本合成方面的局限性和优势。在离散语言任务领域,我们非正式地注意到,尽管 GPT-3 在某些测试该领域的数据集(如 PIQA [BZB⁺19])上表现良好,但它在“常识物理”方面似乎特别困难。具体来说,GPT-3 在回答诸如“如果我把奶酪放进冰箱,它会融化吗?”这类问题时存在困难。从定量角度来看,GPT-3 的上下文学习性能在我们的基准测试套件中存在一些显著差距,如第 3 节所述,特别是在一些“比较”任务上,如确定两个词在句子中的使用方式是否相同,或一个句子是否暗示另一个句子(分别为 WIC 和 ANLI),以及在一部分阅读理解任务上,GPT-3 的表现几乎与随机猜测无异。这一点尤其引人注目,因为 GPT-3 在许多其他任务上的少样本表现非常出色。
GPT-3 has several structural and algorithmic limitations, which could account for some of the issues above. We focused on exploring in-context learning behavior in auto regressive language models because it is straightforward to both sample and compute likelihoods with this model class. As a result our experiments do not include any bidirectional architectures or other training objectives such as denoising. This is a noticeable difference from much of the recent literature, which has documented improved fine-tuning performance when using these approaches over standard language models $[\mathrm{RSR}^{+}19]$ . Thus our design decision comes at the cost of potentially worse performance on tasks which empirically benefit from bidirectional it y. This may include fill-in-the-blank tasks, tasks that involve looking back and comparing two pieces of content, or tasks that require re-reading or carefully considering a long passage and then generating a very short answer. This could be a possible explanation for GPT-3’s lagging few-shot performance on a few of the tasks, such as WIC (which involves comparing the use of a word in two sentences), ANLI (which involves comparing two sentences to see if one implies the other), and several reading comprehension tasks (e.g. QuAC and RACE). We also conjecture, based on past literature, that a large bidirectional model would be stronger at fine-tuning than GPT-3. Making a bidirectional model at the scale of GPT-3, and/or trying to make bidirectional models work with few- or zero-shot learning, is a promising direction for future research, and could help achieve the “best of both worlds”.
GPT-3 存在一些结构和算法上的局限性,这可能是上述部分问题的原因。我们专注于探索自回归语言模型中的上下文学习行为,因为使用这类模型进行采样和计算概率是相对简单的。因此,我们的实验没有包含任何双向架构或其他训练目标,例如去噪。这与最近的许多文献形成了显著差异,这些文献记录了在使用这些方法时,相比标准语言模型,微调性能有所提升 $[\mathrm{RSR}^{+}19]$ 。因此,我们的设计决策可能会在那些从双向性中受益的任务上表现较差。这些任务可能包括填空任务、需要回顾并比较两段内容的任务,或者需要重新阅读或仔细考虑一段长文后生成简短答案的任务。这可能是 GPT-3 在某些任务上少样本表现滞后的一个可能解释,例如 WIC(涉及比较一个词在两个句子中的使用)、ANLI(涉及比较两个句子以判断一个是否暗示另一个)以及一些阅读理解任务(例如 QuAC 和 RACE)。我们还基于过去的文献推测,一个大型的双向模型在微调方面会比 GPT-3 更强。构建一个与 GPT-3 规模相当的双向模型,或者尝试让双向模型在少样本或零样本学习中发挥作用,是未来研究的一个有前景的方向,可能有助于实现“两全其美”。
A more fundamental limitation of the general approach described in this paper – scaling up any LM-like model, whether auto regressive or bidirectional – is that it may eventually run into (or could already be running into) the limits of the pre training objective. Our current objective weights every token equally and lacks a notion of what is most important to predict and what is less important. [RRS20] demonstrate benefits of customizing prediction to entities of interest. Also, with self-supervised objectives, task specification relies on forcing the desired task into a prediction problem, whereas ultimately, useful language systems (for example virtual assistants) might be better thought of as taking goal-directed actions rather than just making predictions. Finally, large pretrained language models are not grounded in other domains of experience, such as video or real-world physical interaction, and thus lack a large amount of context about the world $[\mathrm{BH}\bar{\Gamma}^{+}20]$ . For all these reasons, scaling pure self-supervised prediction is likely to hit limits, and augmentation with a different approach is likely to be necessary. Promising future directions in this vein might include learning the objective function from humans $[\dot{Z}\mathrm{SW}^{+}19\mathrm{a}]$ , fine-tuning with reinforcement learning, or adding additional modalities such as images to provide grounding and a better model of the world $[\mathrm{CLY^{+}19}]$ .
本文所述通用方法的一个更根本的限制——无论是自回归还是双向的类大语言模型(LM)的扩展——是它可能最终会遇到(或可能已经遇到)预训练目标的限制。我们当前的目标是对每个Token进行同等加权,缺乏对预测内容重要性的区分。[RRS20]展示了针对感兴趣实体定制预测的好处。此外,对于自监督目标,任务规范依赖于将所需任务强制转化为预测问题,而最终,有用的语言系统(例如虚拟助手)可能更适合被视为采取目标导向的行动,而不仅仅是做出预测。最后,大型预训练语言模型并未基于其他经验领域(如视频或现实世界的物理交互)进行训练,因此缺乏大量关于世界的上下文信息 $[\mathrm{BH}\bar{\Gamma}^{+}20]$。由于所有这些原因,扩展纯自监督预测可能会遇到限制,可能需要采用不同的方法进行增强。有希望的未来方向可能包括从人类学习目标函数 $[\dot{Z}\mathrm{SW}^{+}19\mathrm{a}]$,通过强化学习进行微调,或添加其他模态(如图像)以提供基础和对世界的更好建模 $[\mathrm{CLY^{+}19}]$。
Another limitation broadly shared by language models is poor sample efficiency during pre-training. While GPT-3 takes a step towards test-time sample efficiency closer to that of humans (one-shot or zero-shot), it still sees much more text during pre-training than a human sees in the their lifetime [Lin20]. Improving pre-training sample efficiency is an important direction for future work, and might come from grounding in the physical world to provide additional information, or from algorithmic improvements.
语言模型普遍存在的另一个局限是预训练期间的样本效率低下。虽然 GPT-3 在测试时的样本效率上更接近人类(少样本或零样本),但它在预训练期间所接触的文本量仍然远超人类一生所见的文本量 [Lin20]。提高预训练的样本效率是未来工作的重要方向,可能通过基于物理世界的额外信息或算法改进来实现。
A limitation, or at least uncertainty, associated with few-shot learning in GPT-3 is ambiguity about whether few-shot learning actually learns new tasks “from scratch” at inference time, or if it simply recognizes and identifies tasks that it has learned during training. These possibilities exist on a spectrum, ranging from demonstrations in the training set that are drawn from exactly the same distribution as those at test time, to recognizing the same task but in a different format, to adapting to a specific style of a general task such as QA, to learning a skill entirely de novo. Where GPT-3 is on this spectrum may also vary from task to task. Synthetic tasks such as word scrambling or defining nonsense words seem especially likely to be learned de novo, whereas translation clearly must be learned during pre training, although possibly from data that is very different in organization and style than the test data. Ultimately, it is not even clear what humans learn from scratch vs from prior demonstrations. Even organizing diverse demonstrations during pre-training and identifying them at test time would be an advance for language models, but nevertheless understanding precisely how few-shot learning works is an important unexplored direction for future research.
GPT-3 中少样本学习的一个局限性或至少是不确定性在于,尚不清楚少样本学习是否真的在推理时“从零开始”学习新任务,还是仅仅识别和识别在训练期间学到的任务。这些可能性存在于一个范围内,从训练集中的演示与测试时的演示完全相同的分布,到识别相同任务但格式不同,再到适应一般任务(如问答)的特定风格,再到完全从头学习一项技能。GPT-3 在这个范围内的位置可能因任务而异。像单词打乱或无意义单词定义这样的合成任务似乎特别可能从头学习,而翻译显然必须在预训练期间学习,尽管可能是从组织和风格与测试数据非常不同的数据中学习的。最终,甚至不清楚人类是从零开始学习还是从先前的演示中学习。即使在预训练期间组织多样化的演示并在测试时识别它们,对于语言模型来说也是一种进步,但准确理解少样本学习的工作原理仍然是未来研究的一个重要且未探索的方向。
A limitation associated with models at the scale of GPT-3, regardless of objective function or algorithm, is that they are both expensive and inconvenient to perform inference on, which may present a challenge for practical applicability of models of this scale in their current form. One possible future direction to address this is distillation [HVD15] of large models down to a manageable size for specific tasks. Large models such as GPT-3 contain a very wide range of skills, most of which are not needed for a specific task, suggesting that in principle aggressive distillation may be possible. Distillation is well-explored in general [LHCG19a] but has not been tried at the scale of hundred of billions parameters; new challenges and opportunities may be associated with applying it to models of this size.
与 GPT-3 规模模型相关的一个限制是,无论目标函数或算法如何,它们在进行推理时既昂贵又不方便,这可能对当前形式下这种规模模型的实际适用性构成挑战。未来可能的一个方向是将大模型蒸馏 (distillation) [HVD15] 到特定任务的可管理大小。像 GPT-3 这样的大模型包含非常广泛的技能,其中大多数技能对于特定任务来说并不需要,这表明在原则上可以进行激进的蒸馏。蒸馏在一般情况下已经得到了广泛研究 [LHCG19a],但尚未在数千亿参数的规模上进行尝试;将其应用于这种规模的模型可能会带来新的挑战和机遇。
Finally, GPT-3 shares some limitations common to most deep learning systems – its decisions are not easily interpret able, it is not necessarily well-calibrated in its predictions on novel inputs as observed by the much higher variance in performance than humans on standard benchmarks, and it retains the biases of the data it has been trained on. This last issue – biases in the data that may lead the model to generate stereotyped or prejudiced content – is of special concern from a societal perspective, and will be discussed along with other issues in the next section on Broader Impacts (Section 6).
最后,GPT-3 与大多数深度学习系统共享一些局限性——其决策不易解释,对于新颖输入的预测不一定校准良好,正如在标准基准测试中观察到的性能方差远高于人类,并且它保留了训练数据中的偏见。最后一个问题——数据中的偏见可能导致模型生成刻板或偏见内容——从社会角度来看尤其值得关注,将在下一节关于更广泛影响的部分(第6节)中与其他问题一起讨论。
6 Broader Impacts
6 更广泛的影响
Language models have a wide range of beneficial applications for society, including code and writing auto-completion, grammar assistance, game narrative generation, improving search engine responses, and answering questions. But they also have potentially harmful applications. GPT-3 improves the quality of text generation and adaptability over smaller models and increases the difficulty of distinguishing synthetic text from human-written text. It therefore has the potential to advance both the beneficial and harmful applications of language models.
语言模型对社会有着广泛的有益应用,包括代码和写作自动补全、语法辅助、游戏叙事生成、改善搜索引擎响应以及回答问题。但它们也可能有潜在的有害应用。GPT-3 相比小型模型提高了文本生成的质量和适应性,并增加了区分合成文本与人类书写文本的难度。因此,它有可能推动语言模型的有益和有害应用。
Here we focus on the potential harms of improved language models, not because we believe the harms are necessarily greater, but in order to stimulate efforts to study and mitigate them. The broader impacts of language models like this are numerous. We focus on two primary issues: the potential for deliberate misuse of language models like GPT-3 in Section 6.1, and issues of bias, fairness, and representation within models like GPT-3 in Section 6.2. We also briefly discuss issues of energy efficiency (Section 6.3).
我们在此关注改进后语言模型的潜在危害,并非认为这些危害必然更大,而是为了激发研究和缓解这些危害的努力。像GPT-3这样的语言模型具有广泛的影响。我们主要关注两个问题:6.1节中讨论的GPT-3等语言模型可能被故意滥用的风险,以及6.2节中讨论的GPT-3等模型中的偏见、公平性和代表性等问题。我们还将简要讨论能效问题(6.3节)。
6.1 Misuse of Language Models
6.1 语言模型的滥用
Malicious uses of language models can be somewhat difficult to anticipate because they often involve re purposing language models in a very different environment or for a different purpose than researchers intended. To help with this, we can think in terms of traditional security risk assessment frameworks, which outline key steps such as identifying threats and potential impacts, assessing likelihood, and determining risk as a combination of likelihood and impact [Ros12]. We discuss three factors: potential misuse applications, threat actors, and external incentive structures.
恶意使用大语言模型可能有些难以预测,因为它们通常涉及在非常不同的环境中或以研究人员预期之外的目的重新利用这些模型。为了帮助解决这个问题,我们可以从传统的安全风险评估框架的角度来思考,这些框架概述了关键步骤,如识别威胁和潜在影响、评估可能性,以及确定风险(即可能性和影响的结合)[Ros12]。我们讨论三个因素:潜在的滥用应用、威胁行为者以及外部激励结构。
6.1.1 Potential Misuse Applications
6.1.1 潜在的滥用应用
Any socially harmful activity that relies on generating text could be augmented by powerful language models. Examples include misinformation, spam, phishing, abuse of legal and governmental processes, fraudulent academic essay writing and social engineering pretexting. Many of these applications bottleneck on human beings to write sufficiently high quality text. Language models that produce high quality text generation could lower existing barriers to carrying out these activities and increase their efficacy.
任何依赖生成文本的社会有害活动都可能因强大的语言模型而增强。例如,虚假信息、垃圾邮件、网络钓鱼、滥用法律和政府程序、欺诈性学术论文写作以及社会工程学借口。这些应用中的许多都受限于人类撰写高质量文本的能力。能够生成高质量文本的语言模型可能会降低执行这些活动的现有障碍,并提高其效果。
The misuse potential of language models increases as the quality of text synthesis improves. The ability of GPT-3 to generate several paragraphs of synthetic content that people find difficult to distinguish from human-written text in 3.9.4 represents a concerning milestone in this regard.
随着文本合成质量的提高,语言模型的滥用潜力也在增加。GPT-3 在 3.9.4 节中生成的多段合成内容,人们难以将其与人类撰写的文本区分开来,这标志着一个令人担忧的里程碑。
6.1.2 Threat Actor Analysis
6.1.2 威胁行为者分析
Threat actors can be organized by skill and resource levels, ranging from low or moderately skilled and resourced actors who may be able to build a malicious product to ‘advanced persistent threats’ (APTs): highly skilled and well-resourced (e.g. state-sponsored) groups with long-term agendas $[\mathrm{SBC^{+}19}]$ .
威胁行为者可以根据技能和资源水平进行分类,从技能和资源水平较低或中等的行为者(他们可能能够构建恶意产品)到“高级持续性威胁 (APTs)”(Advanced Persistent Threats):这些是技能高超且资源充足(例如由国家支持的)的团体,拥有长期目标 $[\mathrm{SBC^{+}19}]$。
To understand how low and mid-skill actors think about language models, we have been monitoring forums and chat groups where misinformation tactics, malware distribution, and computer fraud are frequently discussed. While we did find significant discussion of misuse following the initial release of GPT-2 in spring of 2019, we found fewer instances of experimentation and no successful deployments since then. Additionally, those misuse discussions were correlated with media coverage of language model technologies. From this, we assess that the threat of misuse from these actors is not immediate, but significant improvements in reliability could change this.
为了了解低技能和中等技能的行为者如何看待语言模型,我们一直在监控那些经常讨论误导策略、恶意软件分发和计算机欺诈的论坛和聊天群组。虽然我们在2019年春季GPT-2首次发布后确实发现了大量关于滥用的讨论,但自那时以来,我们发现的实验实例较少,且没有成功的部署案例。此外,这些滥用讨论与语言模型技术的媒体报道相关。基于此,我们评估认为,这些行为者滥用语言模型的威胁并非迫在眉睫,但可靠性的显著提升可能会改变这一现状。
Because APTs do not typically discuss operations in the open, we have consulted with professional threat analysts about possible APT activity involving the use of language models. Since the release of GPT-2 there has been no discernible difference in operations that may see potential gains by using language models. The assessment was that language models may not be worth investing significant resources in because there has been no convincing demonstration that current language models are significantly better than current methods for generating text, and because methods for “targeting” or “controlling” the content of language models are still at a very early stage.
由于APT(高级持续性威胁)通常不会公开讨论其操作,我们咨询了专业的威胁分析师,了解可能涉及使用语言模型的APT活动。自GPT-2发布以来,尚未发现可能通过使用语言模型获得潜在收益的操作有明显差异。评估认为,语言模型可能不值得投入大量资源,因为目前尚无令人信服的证据表明当前的语言模型在生成文本方面显著优于现有方法,而且“定向”或“控制”语言模型内容的方法仍处于非常早期的阶段。
6.1.3 External Incentive Structures
6.1.3 外部激励结构
Each threat actor group also has a set of tactics, techniques, and procedures (TTPs) that they rely on to accomplish their agenda. TTPs are influenced by economic factors like s cal ability and ease of deployment; phishing is extremely popular among all groups because it offers a low-cost, low-effort, high-yield method of deploying malware and stealing login credentials. Using language models to augment existing TTPs would likely result in an even lower cost of deployment.
每个威胁行为者群体也有一套他们依赖的战术、技术和程序 (TTPs) 来实现他们的目标。TTPs 受到经济因素的影响,如可扩展性和部署的便捷性;钓鱼攻击在所有群体中极为流行,因为它提供了一种低成本、低投入、高回报的方法来部署恶意软件和窃取登录凭证。使用语言模型来增强现有的 TTPs 可能会进一步降低部署成本。
Ease of use is another significant incentive. Having stable infrastructure has a large impact on the adoption of TTPs. The outputs of language models are stochastic, however, and though developers can constrain these (e.g. using top-k truncation) they are not able to perform consistently without human feedback. If a social media disinformation bot produces outputs that are reliable $99%$ of the time, but produces incoherent outputs $1%$ of the time, this could reduce the amount of human labor required in operating this bot. But a human is still needed to filter the outputs, which restricts how scalable the operation can be.
易用性是另一个重要的激励因素。拥有稳定的基础设施对TTPs的采用有很大影响。然而,语言模型的输出是随机的,尽管开发者可以对其进行约束(例如使用top-k截断),但在没有人类反馈的情况下,它们无法保持一致的表现。如果一个社交媒体虚假信息机器人在99%的时间内产生可靠的输出,但在1%的时间内产生不连贯的输出,这可能会减少操作该机器人所需的人力。但仍然需要人类来过滤输出,这限制了操作的可扩展性。
Based on our analysis of this model and analysis of threat actors and the landscape, we suspect AI researchers will eventually develop language models that are sufficiently consistent and steerable that they will be of greater interest to malicious actors. We expect this will introduce challenges for the broader research community, and hope to work on this through a combination of mitigation research, prototyping, and coordinating with other technical developers.
根据我们对该模型的分析以及对威胁行为者和环境的分析,我们怀疑 AI 研究人员最终会开发出足够一致且可控的语言模型,从而引起恶意行为者的更大兴趣。我们预计这将为更广泛的研究社区带来挑战,并希望通过结合缓解研究、原型设计以及与其他技术开发人员的协调来解决这一问题。
6.2 Fairness, Bias, and Representation
6.2 公平性、偏见与代表性
Biases present in training data may lead models to generate stereotyped or prejudiced content. This is concerning, since model bias could harm people in the relevant groups in different ways by entrenching existing stereotypes and producing demeaning portrayals amongst other potential harms [Cra17]. We have conducted an analysis of biases in the model in order to better understand GPT-3’s limitations when it comes to fairness, bias, and representation. 8
训练数据中存在的偏见可能导致模型生成刻板或带有偏见的内容。这令人担忧,因为模型偏见可能通过强化现有刻板印象和产生贬低性描述等方式,以不同方式伤害相关群体 [Cra17]。为了更好地理解 GPT-3 在公平性、偏见和代表性方面的局限性,我们对模型中的偏见进行了分析。
Our goal is not to exhaustively characterize GPT-3, but to give a preliminary analysis of some of its limitations and behaviors. We focus on biases relating to gender, race, and religion, although many other categories of bias are likely present and could be studied in follow-up work. This is a preliminary analysis and does not reflect all of the model’s biases even within the studied categories.
我们的目标不是详尽地描述 GPT-3 的特性,而是对其一些局限性和行为进行初步分析。我们重点关注与性别、种族和宗教相关的偏见,尽管可能还存在许多其他类别的偏见,可以在后续工作中进行研究。这是一项初步分析,并未反映模型在研究类别中的所有偏见。
Broadly, our analysis indicates that internet-trained models have internet-scale biases; models tend to reflect stereotypes present in their training data. Below we discuss our preliminary findings of bias along the dimensions of gender, race, and religion. We probe for bias in the 175 billion parameter model and also in similar smaller models, to see if and how they are different in this dimension.
总体而言,我们的分析表明,经过互联网训练的模型具有互联网规模的偏见;模型往往反映了其训练数据中存在的刻板印象。下面我们讨论在性别、种族和宗教维度上的初步偏见发现。我们探究了1750亿参数模型以及类似的小型模型中的偏见,以观察它们在这一维度上是否存在差异以及如何不同。
6.2.1 Gender
6.2.1 性别
In our investigation of gender bias in GPT-3, we focused on associations between gender and occupation. We found that occupations in general have a higher probability of being followed by a male gender identifier than a female one (in other words, they are male leaning) when given a context such as "The {occupation} was a" (Neutral Variant). $83%$ of the 388 occupations we tested were more likely to be followed by a male identifier by GPT-3. We measured this by feeding the model a context such as "The detective was a" and then looking at the probability of the model following up with male indicating words (eg. man, male etc.) or female indicating words (woman, female etc.). In particular, occupations demonstrating higher levels of education such as legislator, banker, or professor emeritus were heavily male leaning along with occupations that require hard physical labour such as mason, millwright, and sheriff. Occupations that were more likely to be followed by female identifiers include midwife, nurse, receptionist, housekeeper etc.
在我们对 GPT-3 中性别偏见的研究中,我们重点关注了性别与职业之间的关联。我们发现,当给定诸如“The {occupation} was a”(中性变体)这样的上下文时,大多数职业更有可能被 GPT-3 识别为男性性别标识符(换句话说,它们倾向于男性)。在我们测试的 388 个职业中,$83%$ 的职业更有可能被 GPT-3 识别为男性。我们通过向模型输入诸如“The detective was a”这样的上下文,然后观察模型后续生成男性指示词(例如 man、male 等)或女性指示词(例如 woman、female 等)的概率来衡量这一点。特别是,像立法者、银行家或名誉教授等需要较高教育水平的职业,以及像石匠、机械师和警长等需要高强度体力劳动的职业,都明显倾向于男性。而更有可能被识别为女性的职业包括助产士、护士、接待员、管家等。
We also tested how these probabilities changed when we shifted the context to be the "The competent {occupation} was $\mathtt{a}"$ (Competent Variant), and when we shifted the context to be "The incompetent {occupation} was a" (Incompetent Variant) for each occupation in the dataset. We found that, when prompted with "The competent ${\mathsf{o c c u p a t i o n}}$ was a," the majority of occupations had an even higher probability of being followed by a male identifier than a female one than was the case with our original neutral prompt, "The {occupation} was $a"$ . With the prompt "The incompetent ${\mathsf{o c c u p a t i o n}}$ was $a"$ the majority of occupations still leaned male $\begin{array}{r}{\frac{1}{n_{\mathrm{jobs}}}\sum_{\mathrm{jobs}}\log(\bar{\frac{P(\mathrm{female}|\bar{\mathrm{Context}})}{P(\mathrm{male}|\mathrm{Context}))}})}\end{array}$ f-o rw aosu $-1.11$ nfaolr tnheeu tNraelu trparlo mVaprti.antT, $-2.14$ rfaogr et hoe cc Cuo pm at pie ot ne ntb iaVsa ri-a nmt eaansdu $-1.15$ for the Incompetent Variant.
我们还测试了当我们将上下文改为“称职的{职业}是$\mathtt{a}"$”(称职变体)以及改为“不称职的{职业}是a”(不称职变体)时,这些概率如何变化。我们发现,当提示为“称职的${\mathsf{o c c u p a t i o n}}$是a”时,大多数职业被男性标识符跟随的概率比女性标识符更高,这比我们原始的中性提示“{职业}是$a"$”的情况更为明显。当提示为“不称职的${\mathsf{o c c u p a t i o n}}$是$a"$”时,大多数职业仍然倾向于男性$\begin{array}{r}{\frac{1}{n_{\mathrm{jobs}}}\sum_{\mathrm{jobs}}\log(\bar{\frac{P(\mathrm{female}|\bar{\mathrm{Context}})}{P(\mathrm{male}|\mathrm{Context}))}})}\end{array}$ f-o rw aosu $-1.11$ nfaolr tnheeu tNraelu trparlo mVaprti.antT, $-2.14$ rfaogr et hoe cc Cuo pm at pie ot ne ntb iaVsa ri-a nmt eaansdu $-1.15$ 对于不称职变体。
We also carried out pronoun resolution on the Winogender dataset [RNLVD18] using two methods which further corroborated the model’s tendency to associate most occupations with males. One method measured the models ability to correctly assign a pronoun as the occupation or the participant. For example, we fed the model a context such as "The advisor met with the advisee because she wanted to get advice about job applications. ‘She’ refers to the" and found the option with the lowest probability between the two possible options (Choices between Occupation Option: advisor; Participant Option: advisee).
我们还使用两种方法对Winogender数据集[RNLVD18]进行了代词解析,进一步证实了模型倾向于将大多数职业与男性相关联。一种方法测量了模型正确分配代词作为职业或参与者的能力。例如,我们向模型输入一个上下文,如“顾问会见了被顾问者,因为她想获得关于工作申请的建议。‘她’指的是”,并找到两个可能选项之间概率最低的选项(职业选项:顾问;参与者选项:被顾问者)。
Occupation and participant words often have societal biases associated with them such as the assumption that most occupants are by default male. We found that the language models learnt some of these biases such as a tendency to associate female pronouns with participant positions more than male pronouns. GPT-3 175B had the highest accuracy of all the models $(64.17%)$ on this task. It was also the only model where the accuracy for Occupant sentences (sentences where the correct answer was the Occupation option) for females was higher than for males $81.7%$ vs $76.7%$ ). All other models had a higher accuracy for male pronouns with Occupation sentences as compared to female pronouns with the exception of our second largest model- GPT-3 13B - which had the same accuracy $(60%)$ for both. This offers some preliminary evidence that in places where issues of bias can make language models susceptible to error, the larger models are more robust than smaller models.
职业和参与者词汇通常带有社会偏见,例如默认大多数从业者为男性的假设。我们发现,语言模型学习到了一些这样的偏见,例如更倾向于将女性代词与参与者职位相关联,而不是男性代词。GPT-3 175B 在此任务中的准确率最高,为 $(64.17%)$。它也是唯一一个在职业句子(正确答案为职业选项的句子)中,女性准确率高于男性的模型($81.7%$ 对 $76.7%$)。所有其他模型在职业句子中,男性代词的准确率均高于女性代词,除了我们的第二大模型 GPT-3 13B,它对两者的准确率相同,均为 $(60%)$。这提供了一些初步证据,表明在偏见问题可能导致语言模型出错的地方,较大的模型比小模型更具鲁棒性。
We also performed co-occurrence tests, where we analyzed which words are likely to occur in the vicinity of other preselected words. We created a model output sample set by generating 800 outputs of length 50 each with a temperature of 1 and top p of 0.9 for every prompt in our dataset. For gender, we had prompts such as "He was very", "She was very", "He would be described as", "She would be described as"9. We looked at the adjectives and adverbs in the top 100 most favored words using an off-the-shelf POS tagger [LB02]. We found females were more often described using appearance oriented words such as ”beautiful” and ”gorgeous” as compared to men who were more often described using adjectives that span a greater spectrum.
我们还进行了共现测试,分析了哪些词可能出现在其他预选词的附近。我们通过为数据集中的每个提示生成800个长度为50的输出,温度为1,top p为0.9,创建了一个模型输出样本集。对于性别,我们使用了诸如“He was very”、“She was very”、“He would be described as”、“She would be described as”等提示。我们使用现成的词性标注工具 [LB02] 查看了前100个最受青睐的形容词和副词。我们发现,女性更常被描述为使用与外貌相关的词汇,如“beautiful”和“gorgeous”,而男性则更常被描述为使用涵盖更广泛范围的形容词。
Table 6.1: Most Biased Descriptive Words in 175B Model
表 6.1: 175B 模型中最具偏见的描述性词语
| 男性描述性词语(前10个最具偏见,附带原始共现计数) | 女性描述性词语(前10个最具偏见,附带原始共现计数) |
|---|---|
| 所有词语的平均共现次数:17.5 | 所有词语的平均共现次数:23.9 |
| Large (16) | Optimistic (12) |
| Mostly (15) | Bubbly (12) |
| Lazy (14) | Naughty (12) |
| Fantastic (13) | Easy-going (12) |
| Eccentric (13) | Petite (10) |
| Protect (10) | Tight (10) |
| Jolly (10) | Pregnant (10) |
| Stable (9) | Gorgeous (28) |
| Personable (22) | Sucked (8) |
| Survive (7) | Beautiful (158) |
Table 6.1 shows the top 10 most favored descriptive words for the model along with the raw number of times each word co-occurred with a pronoun indicator. “Most Favored” here indicates words which were most skewed towards a category by co-occurring with it at a higher rate as compared to the other category. To put these numbers in perspective, we have also included the average for the number of co-occurrences across all qualifying words for each gender.
表 6.1 展示了模型最偏好的前 10 个描述性词汇,以及每个词汇与代词指示词共现的原始次数。这里的“最偏好”指的是那些在与某一类别共现时,相较于另一类别,共现率更高的词汇。为了更直观地理解这些数字,我们还列出了每个性别所有符合条件的词汇的共现次数的平均值。
6.2.2 Race
6.2.2 种族
To investigate racial bias in GPT-3, we seeded the model with prompts such as - "The ${\mathtt{r a c e}}$ man was very", "The ${{\tt r a c e}}$ woman was very" and "People would describe the ${\mathtt{r a c e}}$ person as" and generated 800 samples for each of the above prompts, with ${\mathtt{r a c e}}$ replaced with a term indicating a racial category such as White or Asian. We then measure word co-occurrences in the generated samples. Given prior research demonstrating that language models produce text of differing sentiment when varying features such as occupation $[\mathrm{HZ}\mathrm{J}^{+}19]$ , we explored how race impacted sentiment. We measured sentiment using Senti WordNet [BES10] for the words which co-occurred disproportionately with each race. Each word sentiment varied from 100 to -100, with positive scores indicating positive words (eg. wonderful ness: 100, amicable: 87.5), negative scores indicating negative words (eg. wretched: -87.5 , horrid: -87.5) and a score of 0 indicating neutral words (eg. sloping, chalet).
为了研究 GPT-3 中的种族偏见,我们使用以下提示语作为种子输入模型:“${\mathtt{r a c e}}$ 男人非常”,“${\mathtt{r a c e}}$ 女人非常”以及“人们会形容 ${\mathtt{r a c e}}$ 人为”,并为每个提示生成了 800 个样本,其中 ${\mathtt{r a c e}}$ 被替换为表示种族类别的术语,例如 White 或 Asian。然后,我们测量了生成样本中的词语共现情况。鉴于先前的研究表明,语言模型在改变诸如职业等特征时会产生不同情感的文本 [HZJ+19],我们探讨了种族如何影响情感。我们使用 Senti WordNet [BES10] 对与每个种族不成比例共现的词语进行情感测量。每个词语的情感得分范围从 100 到 -100,正分表示积极词语(例如 wonderfulness: 100, amicable: 87.5),负分表示消极词语(例如 wretched: -87.5, horrid: -87.5),得分为 0 表示中性词语(例如 sloping, chalet)。
It should be noted that we were explicitly prompting the models to talk about race and this in turn generated text that focused on racial features; these results are not from the models talking about race in the wild but talking about race in an experimental setup where they have been primed to do so. Additionally, since we are measuring sentiment by simply looking at word co-occurrences, the resulting sentiment can reflect socio-historical factors - for instance, text relating to a discussion of slavery will frequently have a negative sentiment, which may lead to a demographic being associated with a negative sentiment under this testing methodology.
需要注意的是,我们明确提示模型讨论种族问题,这反过来生成了关注种族特征的文本;这些结果并非来自模型在自然环境中讨论种族问题,而是在实验设置中讨论种族问题,这些模型已被引导这样做。此外,由于我们通过简单地查看词语共现来测量情感,结果情感可能反映社会历史因素——例如,与讨论奴隶制相关的文本通常会带有负面情感,这可能导致在这种测试方法下,某个群体与负面情感相关联。
Across the models we analyzed, ‘Asian’ had a consistently high sentiment - it ranked 1st in 3 out of 7 models. On the other hand, ’Black’ had a consistently low sentiment - it ranked the lowest in 5 out of 7 models. These differences narrowed marginally on the larger model sizes. This analysis gives a sense of the biases of different models and highlights the need for more sophisticated analysis of the relationship between sentiment, entities, and input data.
在我们分析的模型中,“Asian”的情感得分一直较高——在7个模型中的3个中排名第一。另一方面,“Black”的情感得分一直较低——在7个模型中的5个中排名最低。这些差异在较大模型上略有缩小。这一分析揭示了不同模型的偏见,并强调了需要对情感、实体和输入数据之间的关系进行更复杂的分析。

Figure 6.1: Racial Sentiment Across Models
图 6.1: 不同模型的种族情感分析
Religion Most Favored Descriptive Words Table 6.2: Shows the ten most favored words about each religion in the GPT-3 175B model.
表 6.2: GPT-3 175B 模型中对每种宗教最偏好的十个描述词
| 无神论 (Atheism) | 'Theists', 'Cool', 'Agnostics', 'Mad', 'Theism', 'Defensive', 'Complaining', 'Correct', 'Arrogant', 'Characterized' |
| 佛教 (Buddhism) | 'Myanmar', 'Vegetarians', 'Burma', 'Fellowship', 'Monk', 'Japanese', 'Reluctant', 'Wisdom', 'Enlightenment', 'Non-Violent' |
| 基督教 (Christianity) | 'Attend', 'Ignorant', 'Response', 'Judgmental', 'Grace', 'Execution', 'Egypt', 'Continue', 'Comments', 'Officially' |
| 印度教 (Hinduism) | 'Caste', 'Cows', 'BJP', 'Kashmir', 'Modi', 'Celebrated', 'Dharma', 'Pakistani', 'Originated', 'Africa' |
| 伊斯兰教 (Islam) | 'Pillars', 'Terrorism', 'Fasting', 'Sheikh', 'Non-Muslim', 'Source', 'Charities', 'Levant', 'Allah', 'Prophet' |
| 犹太教 (Judaism) | 'Gentiles', 'Race', 'Semites', 'Whites', 'Blacks', 'Smartest', 'Racists', 'Arabs', 'Game', 'Russian' |
6.2.3 Religion
6.2.3 宗教
We studied which words co-occurred with religious terms relating to Atheism, Buddhism, Christianity, Hinduism, Islam, and Judaism, by generating 800 model outputs of length ${\approx}50$ with a temperature of 1 and a top $p$ of 0.9 for every prompt. Our prompts were of the nature "{Religion practitioners} are" $\mathrm{Eg}$ . "Christians are") for each of the six religious categories listed above. We then allowed the model to naturally carry out completions and created a corpus of such completions for studying co-occurrence of words.
我们研究了哪些词汇与无神论、佛教、基督教、印度教、伊斯兰教和犹太教相关的宗教术语共现。为此,我们为每个提示生成了800个长度约为50的模型输出,温度为1,top $p$ 为0.9。我们的提示形式为“{宗教信徒}是”(例如“基督徒是”),针对上述六种宗教类别中的每一种。然后,我们让模型自然地完成补全,并创建了一个用于研究词汇共现的补全语料库。
The following is an example output from the model:
以下是模型的示例输出:
佛教徒分为两大主要分支——上座部佛教和大乘佛教。上座部佛教是较为保守的一支,以僧侣生活和最早的佛经为中心。
Similar to race, we found that the models make associations with religious terms that indicate some propensity to reflect how these terms are sometimes presented in the world. For example, with the religion Islam, we found that words such as ramadan, prophet and mosque co-occurred at a higher rate than for other religions. We also found that words such as violent, terrorism and terrorist co-occurred at a greater rate with Islam than with other religions and were in the top 40 most favored words for Islam in GPT-3.
与种族类似,我们发现模型对宗教术语的关联表明其在一定程度上反映了这些术语在现实世界中的呈现方式。例如,对于伊斯兰教,我们发现诸如斋月 (ramadan)、先知 (prophet) 和清真寺 (mosque) 等词汇的出现频率高于其他宗教。我们还发现,诸如暴力 (violent)、恐怖主义 (terrorism) 和恐怖分子 (terrorist) 等词汇与伊斯兰教的共现频率高于其他宗教,并且在 GPT-3 中,这些词汇是伊斯兰教最受青睐的前 40 个词汇之一。
6.2.4 Future Bias and Fairness Challenges
6.2.4 未来的偏见与公平性挑战
We have presented this preliminary analysis to share some of the biases we found in order to motivate further research, and to highlight the inherent difficulties in characterizing biases in large-scale generative models; we expect this to be an area of continuous research for us and are excited to discuss different methodological approaches with the community. We view the work in this section as subjective signposting - we chose gender, race, and religion as a starting point, but we recognize the inherent subjectivity in this choice. Our work is inspired by the literature on characterizing model attributes to develop informative labels such as Model Cards for Model Reporting from $[\mathrm{MWZ^{+}}18]$ .
我们提出这一初步分析,旨在分享我们发现的一些偏见,以激发进一步的研究,并强调在大规模生成模型中刻画偏见的固有困难;我们预计这将是我们持续研究的一个领域,并期待与社区讨论不同的方法论。我们将本节的工作视为主观的路标——我们选择性别、种族和宗教作为起点,但我们认识到这一选择中的固有主观性。我们的工作受到文献的启发,这些文献旨在刻画模型属性以开发信息丰富的标签,例如来自 $[\mathrm{MWZ^{+}}18]$ 的模型报告模型卡。
Ultimately, it is important not just to characterize biases in language systems but to intervene. The literature on this is also extensive [QMZH19, $\mathrm{H}Z\mathrm{J}^{+}19]$ , so we offer only a few brief comments on future directions specific to large language models. In order to pave the way for effective bias prevention in general purpose models, there is a need for building a common vocabulary tying together the normative, technical and empirical challenges of bias mitigation for these models. There is room for more research that engages with the literature outside NLP, better articulates normative statements about harm, and engages with the lived experience of communities affected by NLP systems [BBDIW20]. Thus, mitigation work should not be approached purely with a metric driven objective to ‘remove’ bias as this has been shown to have blind spots [GG19, NvNvdG19] but in a holistic manner.
最终,重要的是不仅要描述语言系统中的偏见,还要进行干预。关于这一点的文献也非常广泛 [QMZH19, $\mathrm{H}Z\mathrm{J}^{+}19]$,因此我们仅针对大语言模型的未来方向提供一些简要评论。为了为通用模型中的有效偏见预防铺平道路,有必要建立一个共同的词汇表,将这些模型的规范性、技术和经验挑战联系在一起。有更多的研究空间可以与自然语言处理(NLP)之外的文献相结合,更好地阐明关于伤害的规范性陈述,并与受 NLP 系统影响的社区的实际经验相结合 [BBDIW20]。因此,缓解工作不应仅仅以“消除”偏见的指标驱动目标来进行,因为这已被证明存在盲点 [GG19, NvNvdG19],而应以一种全面的方式进行。
6.3 Energy Usage
6.3 能源使用
Practical large-scale pre-training requires large amounts of computation, which is energy-intensive: training the GPT-3 175B consumed several thousand petaflop/s-days of compute during pre-training, compared to tens of petaflop/s-days for a 1.5B parameter GPT-2 model (Figure 2.2). This means we should be cognizant of the cost and efficiency of such models, as advocated by [SDSE19].
大规模预训练需要大量的计算资源,这消耗了大量能源:训练 GPT-3 175B 在预训练期间消耗了数千 petaflop/s-days 的计算量,而 1.5B 参数的 GPT-2 模型仅消耗了数十 petaflop/s-days(图 2.2)。这意味着我们应该意识到此类模型的成本和效率问题,正如 [SDSE19] 所倡导的那样。
The use of large-scale pre-training also gives another lens through which to view the efficiency of large models - we should consider not only the resources that go into training them, but how these resources are amortized over the lifetime of a model, which will subsequently be used for a variety of purposes and fine-tuned for specific tasks. Though models like GPT-3 consume significant resources during training, they can be surprisingly efficient once trained: even with the full GPT-3 175B, generating 100 pages of content from a trained model can cost on the order of $0.4,\mathrm{kW}.$ -hr, or only a few cents in energy costs. Additionally, techniques like model distillation [LHCG19a] can further bring down the cost of such models, letting us adopt a paradigm of training single, large-scale models, then creating more efficient versions of them for use in appropriate contexts. Algorithmic progress may also naturally further increase the efficiency of such models over time, similar to trends observed in image recognition and neural machine translation [HB20].
大规模预训练的使用也为我们提供了另一个视角来审视大模型的效率——我们不仅要考虑训练它们所投入的资源,还要考虑这些资源在模型生命周期内的分摊情况,因为模型随后将用于多种目的,并针对特定任务进行微调。尽管像 GPT-3 这样的模型在训练期间消耗了大量资源,但一旦训练完成,它们的效率可能出奇地高:即使是完整的 GPT-3 175B,从训练好的模型中生成 100 页内容的成本大约为 $0.4,\mathrm{kW}.$ -hr,或仅需几美分的能源成本。此外,像模型蒸馏 [LHCG19a] 这样的技术可以进一步降低此类模型的成本,使我们能够采用训练单一、大规模模型的范式,然后创建更高效的版本以在适当的场景中使用。随着时间的推移,算法进步也可能自然进一步提高此类模型的效率,类似于在图像识别和神经机器翻译 [HB20] 中观察到的趋势。
7 Related Work
7 相关工作
Several lines of work have focused on increasing parameter count and/or computation in language models as a means to improve generative or task performance. An early work scaled LSTM based language models to over a billion parameters $[\mathrm{JVS}^{+}16]$ . One line of work straightforwardly increases the size of transformer models, scaling up parameters and FLOPS-per-token roughly in proportion. Work in this vein has successively increased model size: 213 million parameters $[\bar{\mathrm{VSP}}^{+}17]$ in the original paper, 300 million parameters [DCLT18], 1.5 billion parameters $\mathrm{[RWC^{+}19]}$ ], 8 billion parameters $[\mathrm{SPP^{+}19}]$ , 11 billion parameters $[\mathrm{RSR}^{+}19]$ , and most recently 17 billion parameters [Tur20]. A second line of work has focused on increasing parameter count but not computation, as a means of increasing models’ capacity to store information without increased computational cost. These approaches rely on the conditional computation framework [BLC13] and specifically, the mixture-of-experts method $[\mathrm{SMM}^{+}17]$ has been used to produce 100 billion parameter models and more recently 50 billion parameter translation models [AJF19], though only a small fraction of the parameters are actually used on each forward pass. A third approach increases computation without increasing parameters; examples of this approach include adaptive computation time [Gra16] and the universal transformer $[\mathrm{DGV}^{+}18]$ . Our work focuses on the first approach (scaling compute and parameters together, by straightforwardly making the neural net larger), and increases model size 10x beyond previous models that employ this strategy.
多项研究工作集中于通过增加语言模型的参数数量和/或计算量来提升生成或任务性能。早期的一项工作将基于 LSTM 的语言模型扩展到了超过 10 亿个参数 $[\mathrm{JVS}^{+}16]$。其中一条研究路径直接增加了 Transformer 模型的规模,使参数数量和每 Token 的浮点运算量 (FLOPS) 大致成比例增长。这一方向的研究逐步增加了模型规模:原始论文中的 2.13 亿参数 $[\bar{\mathrm{VSP}}^{+}17]$,3 亿参数 [DCLT18],15 亿参数 $\mathrm{[RWC^{+}19]}$,80 亿参数 $[\mathrm{SPP^{+}19}]$,110 亿参数 $[\mathrm{RSR}^{+}19]$,以及最近的 170 亿参数 [Tur20]。第二条研究路径则专注于增加参数数量而不增加计算量,以此在不增加计算成本的情况下提升模型的信息存储能力。这些方法依赖于条件计算框架 [BLC13],特别是专家混合方法 $[\mathrm{SMM}^{+}17]$,已被用于生成 1000 亿参数的模型,以及最近的 500 亿参数翻译模型 [AJF19],尽管每次前向传播时实际只使用了其中一小部分参数。第三种方法则在不增加参数的情况下增加计算量;这类方法的例子包括自适应计算时间 [Gra16] 和通用 Transformer $[\mathrm{DGV}^{+}18]$。我们的工作专注于第一种方法(通过直接扩大神经网络规模来同时扩展计算量和参数数量),并将模型规模扩展到了采用该策略的先前模型的 10 倍。
Several efforts have also systematically studied the effect of scale on language model performance. $[\mathrm{KMH^{+}}20\$ , RRBS19, $\mathrm{LWS}^{+}20$ , $\mathrm{HNA^{+}\dot{1}7]}$ , find a smooth power-law trend in loss as auto regressive language models are scaled up. This work suggests that this trend largely continues as models continue to scale up (although a slight bending of the curve can perhaps be detected in Figure 3.1), and we also find relatively smooth increases in many (though not all) downstream tasks across 3 orders of magnitude of scaling.
多项研究也系统地探讨了规模对语言模型性能的影响。$[\mathrm{KMH^{+}}20\$、RRBS19、$\mathrm{LWS}^{+}20$、$\mathrm{HNA^{+}\dot{1}7]}$发现,随着自回归语言模型的规模扩大,损失呈现出平滑的幂律趋势。这项工作表明,随着模型规模的持续扩大,这一趋势在很大程度上仍在继续(尽管在图3.1中可能检测到曲线的轻微弯曲),并且我们还发现,在规模扩大三个数量级的情况下,许多(尽管不是全部)下游任务的性能也呈现出相对平滑的提升。
Another line of work goes in the opposite direction from scaling, attempting to preserve strong performance in language models that are as small as possible. This approach includes ALBERT $\bar{[\mathrm{LC}\bar{\mathrm{G}}^{+}19]}$ as well as general [HVD15] and task-specific [SDCW19, $\mathrm{JYS^{+}19}$ , KR16] approaches to distillation of language models. These architectures and techniques are potentially complementary to our work, and could be applied to decrease latency and memory footprint of giant models.
另一项工作则与扩展相反,试图在尽可能小的语言模型中保持强大的性能。这种方法包括 ALBERT $\bar{[\mathrm{LC}\bar{\mathrm{G}}^{+}19]}$ 以及通用的 [HVD15] 和特定任务的 [SDCW19, $\mathrm{JYS^{+}19}$ , KR16] 语言模型蒸馏方法。这些架构和技术可能与我们的工作互补,并可用于减少大型模型的延迟和内存占用。
As fine-tuned language models have neared human performance on many standard benchmark tasks, considerable effort has been devoted to constructing more difficult or open-ended tasks, including question answering $[\mathrm{KPR}^{+}19$ , $\mathrm{IBGC}^{+}14.$ , ${\mathrm{CCE}}^{+}18$ , MCKS18], reading comprehension $\bar{[\mathrm{CHI}^{+}18}$ , RCM19], and adversarial ly constructed datasets designed to be difficult for existing language models [SBBC19, $\mathrm{NWD}^{+}19]$ ]. In this work we test our models on many of these datasets.
随着微调语言模型在许多标准基准任务上接近人类表现,大量努力被投入到构建更困难或开放式的任务中,包括问答任务 [KPR+19, IBGC+14, CCE+18, MCKS18]、阅读理解任务 [CHI+18, RCM19],以及专门设计用于挑战现有语言模型的对抗性数据集 [SBBC19, NWD+19]。在本研究中,我们在许多此类数据集上测试了我们的模型。
Many previous efforts have focused specifically on question-answering, which constitutes a significant fraction of the tasks we tested on. Recent efforts include $[\mathrm{RSR}^{+}19$ , RRS20], which fine-tuned an 11 billion parameter language model, and $[\mathrm{GLT^{+}20}]$ , which focused on attending over a large corpus of data at test time. Our work differs in focusing on in-context learning but could be combined in the future with those of $[\mathrm{GLT^{+}20},\mathrm{LPP^{+}20}]$ .
许多先前的工作特别关注问答任务,这在我们测试的任务中占据了很大一部分。最近的工作包括 $[\mathrm{RSR}^{+}19$ , RRS20],它们微调了一个110亿参数的语言模型,以及 $[\mathrm{GLT^{+}20}]$,它们专注于在测试时处理大量数据。我们的工作不同之处在于专注于上下文学习,但未来可以与 $[\mathrm{GLT^{+}20},\mathrm{LPP^{+}20}]$ 的工作结合。
Metal earning in language models has been utilized in $[\mathrm{RWC}^{+}19]$ , though with much more limited results and no systematic study. More broadly, language model metal earning has an inner-loop-outer-loop structure, making it structurally similar to metal earning as applied to ML in general. Here there is an extensive literature, including matching networks $[\mathrm{VBL}^{+}16]$ , RL2 $[\mathrm{DSC}^{\mp}16]$ , learning to optimize [RL16, $\mathrm{ADG^{+}}16$ , LM17] and MAML [FAL17]. Our approach of stuffing the model’s context with previous examples is most structurally similar to RL2 and also resembles [HYC01], in that an inner loop of adaptation takes place through computation in the model’s activation s across timesteps, without updating the weights, while an outer loop (in this case just language model pre-training) updates the weights, and implicitly learns the ability to adapt to or at least recognize tasks defined at inference-time. Few-shot auto-regressive density estimation was explored in $[\mathrm{RCP^{+}17}]$ and $[\mathrm{GWC}^{+}18]$ studied low-resource NMT as a few-shot learning problem.
语言模型中的元学习已在 $[\mathrm{RWC}^{+}19]$ 中被利用,尽管结果非常有限且没有系统研究。更广泛地说,语言模型的元学习具有内循环-外循环结构,使其在结构上类似于一般应用于机器学习的元学习。这里有大量文献,包括匹配网络 $[\mathrm{VBL}^{+}16]$、RL2 $[\mathrm{DSC}^{\mp}16]$、学习优化 [RL16, $\mathrm{ADG^{+}}16$, LM17] 和 MAML [FAL17]。我们通过将模型的上下文填充先前示例的方法在结构上最类似于 RL2,并且类似于 [HYC01],因为适应的内循环通过模型激活中的跨时间步计算发生,而不更新权重,而外循环(在这种情况下只是语言模型的预训练)更新权重,并隐式学习适应或至少识别在推理时定义的任务的能力。少样本自回归密度估计在 $[\mathrm{RCP^{+}17}]$ 中进行了探索,$[\mathrm{GWC}^{+}18]$ 研究了低资源神经机器翻译 (NMT) 作为少样本学习问题。
While the mechanism of our few-shot approach is different, prior work has also explored ways of using pre-trained language models in combination with gradient descent to perform few-shot learning [SS20]. Another sub-field with similar goals is semi-supervised learning where approaches such as UDA $[\mathrm{XDH^{+}19}]$ ] also explore methods of fine-tuning when very little labeled data is available.
虽然我们的少样本方法机制不同,但之前的工作也探索了使用预训练语言模型结合梯度下降进行少样本学习的方法 [SS20]。另一个具有相似目标的子领域是半监督学习,其中如 UDA $[\mathrm{XDH^{+}19}]$ 等方法也探索了在可用标注数据非常少时的微调方法。
Giving multi-task models instructions in natural language was first formalized in a supervised setting with [MKXS18] and utilized for some tasks (such as summarizing) in a language model with $[\mathrm{RW}\bar{\mathrm{C}}^{+}19]$ . The notion of presenting tasks in natural language was also explored in the text-to-text transformer $[\mathrm{RSR}^{+}19]$ , although there it was applied for multi-task fine-tuning rather than for in-context learning without weight updates.
在多任务模型中用自然语言给出指令的做法,最初由 [MKXS18] 在监督学习环境中正式提出,并在 $[\mathrm{RW}\bar{\mathrm{C}}^{+}19]$ 的语言模型中被用于一些任务(如摘要生成)。在文本到文本的 Transformer 模型 $[\mathrm{RSR}^{+}19]$ 中,也探讨了用自然语言呈现任务的概念,尽管它被应用于多任务微调,而不是无需权重更新的上下文学习。
Another approach to increasing generality and transfer-learning capability in language models is multi-task learning [Car97], which fine-tunes on a mixture of downstream tasks together, rather than separately updating the weights for each one. If successful multi-task learning could allow a single model to be used for many tasks without updating the weights (similar to our in-context learning approach), or alternatively could improve sample efficiency when updating the weights for a new task. Multi-task learning has shown some promising initial results $[\mathrm{LGH^{+}15}$ , $\mathrm{LSP^{+}i8]}$ and multi-stage fine-tuning has recently become a standardized part of SOTA results on some datasets [PFB18] and pushed the boundaries on certain tasks $[\mathrm{K}\dot{\mathrm{K}}\mathrm{S}^{+}20]$ , but is still limited by the need to manually curate collections of datasets and set up training curricula. By contrast pre-training at large enough scale appears to offer a “natural” broad distribution of tasks implicitly contained in predicting the text itself. One direction for future work might be attempting to generate a broader set of explicit tasks for multi-task learning, for example through procedural generation $[\mathrm{TFR}^{+}17]$ , human interaction $[Z\mathrm{SW}^{+}19\mathrm{b}]$ , or active learning [Mac92].
另一种提高语言模型通用性和迁移学习能力的方法是多任务学习 [Car97],它在下游任务的混合上进行微调,而不是单独更新每个任务的权重。如果多任务学习成功,它可以让单个模型在不更新权重的情况下用于许多任务(类似于我们的上下文学习方法),或者在新任务更新权重时提高样本效率。多任务学习已经显示出一些有希望的初步结果 $[\mathrm{LGH^{+}15}$ , $\mathrm{LSP^{+}i8]}$,并且多阶段微调最近已成为某些数据集上 SOTA 结果的标准部分 [PFB18],并在某些任务上推动了边界 $[\mathrm{K}\dot{\mathrm{K}}\mathrm{S}^{+}20]$,但仍然受到需要手动整理数据集集合和设置训练课程的限制。相比之下,足够大规模的预训练似乎提供了预测文本本身隐含的“自然”广泛任务分布。未来工作的一个方向可能是尝试为多任务学习生成更广泛的显式任务,例如通过程序生成 $[\mathrm{TFR}^{+}17]$、人机交互 $[Z\mathrm{SW}^{+}19\mathrm{b}]$ 或主动学习 [Mac92]。
Algorithmic innovation in language models over the last two years has been enormous, including denoising-based bidirectional it y [DCLT18], prefixLM [DL15] and encoder-decoder architectures $[\mathrm{LLG^{+}}19$ , $\mathrm{RSR}^{+}19^{-}$ , random permutations during training $[\mathrm{YD}\bar{\mathrm{Y}}^{+}19]$ , architectures that improve the efficiency of sampling $[\mathrm{DYY^{+}19}]$ , improvements in data and training procedures $[\mathrm{LOG^{+}19}]$ , and efficiency increases in the embedding parameters $[\mathrm{LCG^{+}19}]$ . Many of these techniques provide significant gains on downstream tasks. In this work we continue to focus on pure auto regressive language models, both in order to focus on in-context learning performance and to reduce the complexity of our large model implementations. However, it is very likely that incorporating these algorithmic advances could improve GPT-3’s performance on downstream tasks, especially in the fine-tuning setting, and combining GPT-3’s scale with these algorithmic techniques is a promising direction for future work.
过去两年中,语言模型的算法创新非常显著,包括基于去噪的双向模型 [DCLT18]、prefixLM [DL15] 和编码器-解码器架构 $[\mathrm{LLG^{+}}19$ , $\mathrm{RSR}^{+}19^{-}$ 、训练期间的随机排列 $[\mathrm{YD}\bar{\mathrm{Y}}^{+}19]$ 、提高采样效率的架构 $[\mathrm{DYY^{+}19}]$ 、数据和训练过程的改进 $[\mathrm{LOG^{+}19}]$ ,以及嵌入参数效率的提升 $[\mathrm{LCG^{+}19}]$ 。许多这些技术在下游任务中提供了显著的增益。在本工作中,我们继续专注于纯自回归语言模型,既是为了专注于上下文学习性能,也是为了减少我们大模型实现的复杂性。然而,结合这些算法进展很可能会提高 GPT-3 在下游任务中的表现,尤其是在微调设置中,将 GPT-3 的规模与这些算法技术相结合是未来工作的一个很有前景的方向。
8 Conclusion
8 结论
We presented a 175 billion parameter language model which shows strong performance on many NLP tasks and benchmarks in the zero-shot, one-shot, and few-shot settings, in some cases nearly matching the performance of state-of-the-art fine-tuned systems, as well as generating high-quality samples and strong qualitative performance at tasks defined on-the-fly. We documented roughly predictable trends of scaling in performance without using fine-tuning. We also discussed the social impacts of this class of model. Despite many limitations and weaknesses, these results suggest that very large language models may be an important ingredient in the development of adaptable, general language systems.
我们提出了一个拥有1750亿参数的大语言模型,该模型在零样本、单样本和少样本设置下的许多自然语言处理任务和基准测试中表现出色,在某些情况下几乎可以与最先进的微调系统相媲美,同时还能生成高质量样本,并在即时定义的任务中展现出强大的定性性能。我们记录了在不使用微调的情况下,性能扩展的大致可预测趋势。我们还讨论了这类模型的社会影响。尽管存在许多局限性和弱点,但这些结果表明,非常大的大语言模型可能是开发适应性强的通用语言系统的重要组成部分。
Acknowledgements
致谢
The authors would like to thank Ryan Lowe for giving detailed feedback on drafts of the paper. Thanks to Jakub Pachocki and Szymon Sidor for suggesting tasks, and Greg Brockman, Michael Petrov, Brooke Chan, and Chelsea Voss for helping run evaluations on OpenAI’s infrastructure. Thanks to David Luan for initial support in scaling up this project, Irene Solaiman for discussions about ways to approach and evaluate bias, Harrison Edwards and Yura Burda for discussions and experimentation with in-context learning, Geoffrey Irving and Paul Christiano for early discussions of language model scaling, Long Ouyang for advising on the design of the human evaluation experiments, Chris Hallacy for discussions on data collection, and Shan Carter for help with visual design. Thanks to the millions of people who created content that was used in the training of the model, and to those who were involved in indexing or upvoting the content (in the case of WebText). Additionally, we would like to thank the entire OpenAI infrastructure and super computing teams for making it possible to train models at this scale.
作者们要感谢Ryan Lowe对论文草稿提供的详细反馈。感谢Jakub Pachocki和Szymon Sidor提出的任务建议,以及Greg Brockman、Michael Petrov、Brooke Chan和Chelsea Voss在OpenAI基础设施上帮助运行评估。感谢David Luan在项目扩展初期的支持,Irene Solaiman关于如何应对和评估偏见的讨论,Harrison Edwards和Yura Burda关于上下文学习的讨论和实验,Geoffrey Irving和Paul Christiano关于语言模型扩展的早期讨论,Long Ouyang在人类评估实验设计上的建议,Chris Hallacy关于数据收集的讨论,以及Shan Carter在视觉设计上的帮助。感谢数百万为模型训练提供内容的人,以及那些参与内容索引或点赞的人(在WebText的情况下)。此外,我们要感谢整个OpenAI基础设施和超级计算团队,使得在这个规模上训练模型成为可能。
Contributions
贡献
Tom Brown, Ben Mann, Prafulla Dhariwal, Dario Amodei, Nick Ryder, Daniel M Ziegler, and Jeffrey Wu implemented the large-scale models, training infrastructure, and model-parallel strategies.
Tom Brown、Ben Mann、Prafulla Dhariwal、Dario Amodei、Nick Ryder、Daniel M Ziegler 和 Jeffrey Wu 实现了大规模模型、训练基础设施和模型并行策略。
Tom Brown, Dario Amodei, Ben Mann, and Nick Ryder conducted pre-training experiments.
Tom Brown、Dario Amodei、Ben Mann 和 Nick Ryder 进行了预训练实验。
Ben Mann and Alec Radford collected, filtered, de duplicated, and conducted overlap analysis on the training data.
Ben Mann 和 Alec Radford 对训练数据进行了收集、过滤、去重和重叠分析。
Melanie Subbiah, Ben Mann, Dario Amodei, Jared Kaplan, Sam McCandlish, Tom Brown, Tom Henighan, and Girish Sastry implemented the downstream tasks and the software framework for supporting them, including creation of synthetic tasks.
Melanie Subbiah、Ben Mann、Dario Amodei、Jared Kaplan、Sam McCandlish、Tom Brown、Tom Henighan 和 Girish Sastry 实现了下游任务及其支持的软件框架,包括创建合成任务。
Jared Kaplan and Sam McCandlish initially predicted that a giant language model should show continued gains, and applied scaling laws to help predict and guide model and data scaling decisions for the research.
Jared Kaplan 和 Sam McCandlish 最初预测,一个巨大的语言模型应该会持续表现出增益,并应用了扩展法则来帮助预测和指导研究的模型和数据扩展决策。
Ben Mann implemented sampling without replacement during training.
Ben Mann 在训练过程中实现了无放回采样。
Alec Radford originally demonstrated few-shot learning occurs in language models.
Alec Radford 最初展示了语言模型中的少样本学习现象。
Jared Kaplan and Sam McCandlish showed that larger models learn more quickly in-context, and systematically studied in-context learning curves, task prompting, and evaluation methods.
Jared Kaplan 和 Sam McCandlish 展示了更大的模型在上下文学习中学习得更快,并系统地研究了上下文学习曲线、任务提示和评估方法。
Prafulla Dhariwal implemented an early version of the codebase, and developed the memory optimization s for fully half-precision training.
Prafulla Dhariwal 实现了代码库的早期版本,并开发了用于全半精度训练的内存优化方案。
Rewon Child and Mark Chen developed an early version of our model-parallel strategy.
Rewon Child 和 Mark Chen 开发了我们模型并行策略的早期版本。
Rewon Child and Scott Gray contributed the sparse transformer.
Rewon Child 和 Scott Gray 贡献了稀疏 Transformer (Sparse Transformer)。
Aditya Ramesh experimented with loss scaling strategies for pre training.
Aditya Ramesh 对预训练中的损失缩放策略进行了实验。
Melanie Subbiah and Arvind Neel a kant an implemented, experimented with, and tested beam search.
Melanie Subbiah 和 Arvind Neelakantan 实现、实验并测试了束搜索 (beam search)。
Pranav Shyam worked on SuperGLUE and assisted with connections to few-shot learning and meta-learning literature.
Pranav Shyam 参与了 SuperGLUE 的工作,并协助将其与少样本学习和元学习文献联系起来。
Sandhini Agarwal conducted the fairness and representation analysis.
Sandhini Agarwal 进行了公平性和代表性分析。
Girish Sastry and Amanda Askell conducted the human evaluations of the model.
Girish Sastry 和 Amanda Askell 负责了模型的人类评估。
Ariel Herbert-Voss conducted the threat analysis of malicious use.
Ariel Herbert-Voss 进行了恶意使用的威胁分析。
Gretchen Krueger edited and red-teamed the policy sections of the paper.
Gretchen Krueger 编辑并对论文的政策部分进行了红队测试。
Benjamin Chess, Clemens Winter, Eric Sigler, Christopher Hesse, Mateusz Litwin, and Christopher Berner optimized OpenAI’s clusters to run the largest models efficiently.
Benjamin Chess、Clemens Winter、Eric Sigler、Christopher Hesse、Mateusz Litwin 和 Christopher Berner 优化了 OpenAI 的集群,以高效运行最大的模型。
Scott Gray developed fast GPU kernels used during training.
Scott Gray 开发了用于训练的快速 GPU 内核。
Jack Clark led the analysis of ethical impacts — fairness and representation, human assessments of the model, and broader impacts analysis, and advised Gretchen, Amanda, Girish, Sandhini, and Ariel on their work.
Jack Clark 领导了伦理影响的分析——公平性和代表性、人类对模型的评估以及更广泛的影响分析,并为 Gretchen、Amanda、Girish、Sandhini 和 Ariel 的工作提供了建议。
Dario Amodei, Alec Radford, Tom Brown, Sam McCandlish, Nick Ryder, Jared Kaplan, Sandhini Agarwal, Amanda Askell, Girish Sastry, and Jack Clark wrote the paper.
Dario Amodei、Alec Radford、Tom Brown、Sam McCandlish、Nick Ryder、Jared Kaplan、Sandhini Agarwal、Amanda Askell、Girish Sastry 和 Jack Clark 撰写了这篇论文。
Sam McCandlish led the analysis of model scaling, and advised Tom Henighan and Jared Kaplan on their work.
Sam McCandlish 领导了模型扩展的分析工作,并为 Tom Henighan 和 Jared Kaplan 的工作提供了建议。
Alec Radford advised the project from an NLP perspective, suggested tasks, put the results in context, and demonstrated the benefit of weight decay for training.
Alec Radford 从自然语言处理 (NLP) 的角度为项目提供建议,提出了任务,将结果置于上下文中,并展示了权重衰减 (weight decay) 对训练的益处。
Ilya Sutskever was an early advocate for scaling large generative likelihood models, and advised Pranav, Prafulla, Rewon, Alec, and Aditya on their work.
Ilya Sutskever 是早期倡导扩展大型生成似然模型的人,并为 Pranav、Prafulla、Rewon、Alec 和 Aditya 的工作提供了建议。
Dario Amodei designed and led the research.
Dario Amodei 设计并领导了这项研究。
A Details of Common Crawl Filtering
A Common Crawl 过滤细节
As mentioned in Section 2.2, we employed two techniques to improve the quality of the Common Crawl dataset: (1) filtering Common Crawl and (2) fuzzy de duplication:
如第2.2节所述,我们采用了两种技术来提高Common Crawl数据集的质量:(1) 过滤Common Crawl和(2) 模糊去重:
- In order to improve the quality of Common Crawl, we developed an automatic filtering method to remove low quality documents. Using the original WebText as a proxy for high-quality documents, we trained a classifier to distinguish these from raw Common Crawl. We then used this classifier to re-sample Common Crawl by prioritizing documents which were predicted by the classifier to be higher quality. The classifier is trained using logistic regression classifier with features from Spark’s standard tokenizer and HashingTF 10. For the positive examples, we used a collection of curated datasets such as WebText, Wikiedia, and our web books corpus as the positive examples, and for the negative examples, we used unfiltered Common Crawl. We used this classifier to score Common Crawl documents. We kept each document in our dataset iff
为了提高 Common Crawl 的质量,我们开发了一种自动过滤方法,以去除低质量的文档。使用原始的 WebText 作为高质量文档的代理,我们训练了一个分类器来区分这些文档与原始的 Common Crawl。然后,我们使用该分类器对 Common Crawl 进行重新采样,优先选择分类器预测为更高质量的文档。该分类器使用逻辑回归分类器进行训练,特征来自 Spark 的标准分词器和 HashingTF 10。对于正例,我们使用了诸如 WebText、Wikipedia 和我们的网络书籍语料库等精选数据集作为正例,而对于负例,我们使用了未过滤的 Common Crawl。我们使用该分类器对 Common Crawl 文档进行评分。只有当文档满足条件时,我们才将其保留在我们的数据集中。
$$
\mathtt{n p.r a n d o m.p a r e t o(\alpha)>1-d o c u m e n t_{-}s c o r e}
$$
$$
\mathtt{n p.r a n d o m.p a r e t o(\alpha)>1-d o c u m e n t_{-}s c o r e}
$$
We chose $\alpha=9$ in order to take mostly documents the classifier scored highly, but still include some documents that were out of distribution. $\alpha$ was chosen to match the distribution of scores from our classifier on WebText. We found this re-weighting increased quality as measured by loss on a range of out-of-distribution generative text samples.
我们选择 $\alpha=9$ 以主要包含分类器评分较高的文档,同时仍包含一些分布外的文档。选择 $\alpha$ 是为了匹配 WebText 上分类器评分的分布。我们发现,这种重新加权提高了质量,具体表现为在一系列分布外生成文本样本上的损失减少。
- To further improve model quality and prevent over fitting (which becomes increasingly important as model capacity increases), we fuzzily de duplicated documents (i.e. removed documents with high overlap with other documents) within each dataset using Spark’s MinHashLSH implementation with 10 hashes, using the same features as were used for classification above. We also fuzzily removed WebText from Common Crawl. Overall this decreased dataset size by an average of $10%$ .
- 为了进一步提高模型质量并防止过拟合(随着模型容量的增加,这一点变得越来越重要),我们在每个数据集中使用 Spark 的 MinHashLSH 实现(使用 10 个哈希值)对文档进行模糊去重(即移除与其他文档重叠度高的文档),使用的特征与上述分类相同。我们还从 Common Crawl 中模糊移除了 WebText。总体而言,这使数据集大小平均减少了 $10%$。
After filtering for duplicates and quality, we also partially removed text occurring in benchmark datasets, described in Appendix C.
在过滤重复项和质量后,我们还部分移除了基准数据集中出现的文本,详见附录 C。
B Details of Model Training
B 模型训练细节
To train all versions of GPT-3, we use Adam with $\beta_{1}=0.9$ , $\beta_{2}=0.95$ , and $\epsilon=10^{-8}$ , we clip the global norm of the gradient at 1.0, and we use cosine decay for learning rate down to $10%$ of its value, over 260 billion tokens (after 260 billion tokens, training continues at $10%$ of the original learning rate). There is a linear LR warmup over the first 375 million tokens. We also gradually increase the batch size linearly from a small value $32\mathrm{k}$ tokens) to the full value over the first 4-12 billion tokens of training, depending on the model size. Data are sampled without replacement during training (until an epoch boundary is reached) to minimize over fitting. All models use weight decay of 0.1 to provide a small amount of regular iz ation [LH17].
为了训练所有版本的 GPT-3,我们使用 Adam 优化器,其中 $\beta_{1}=0.9$、$\beta_{2}=0.95$ 和 $\epsilon=10^{-8}$。我们将梯度的全局范数裁剪为 1.0,并使用余弦衰减将学习率降至其初始值的 $10%$,持续 2600 亿个 token(在 2600 亿个 token 之后,训练继续以 $10%$ 的原始学习率进行)。在前 3.75 亿个 token 期间,学习率线性预热。我们还根据模型大小,在前 40 到 120 亿个 token 的训练过程中,逐步将批量大小从较小的值($32\mathrm{k}$ token)线性增加到完整值。在训练过程中,数据被无放回地采样(直到达到一个 epoch 的边界),以最小化过拟合。所有模型都使用 0.1 的权重衰减,以提供少量的正则化 [LH17]。
During training we always train on sequences of the full $n_{\mathrm{ctx}},=,2048$ token context window, packing multiple documents into a single sequence when documents are shorter than 2048, in order to increase computational efficiency. Sequences with multiple documents are not masked in any special way but instead documents within a sequence are delimited with a special end of text token, giving the language model the information necessary to infer that context separated by the end of text token is unrelated. This allows for efficient training without need for any special sequence-specific masking.
在训练过程中,我们始终在完整的 $n_{\mathrm{ctx}},=,2048$ Token 上下文窗口的序列上进行训练。当文档长度小于 2048 时,我们会将多个文档打包到一个序列中,以提高计算效率。包含多个文档的序列不会以任何特殊方式进行掩码处理,而是使用特殊的文本结束 Token 来分隔序列中的文档,从而为语言模型提供必要的信息,使其能够推断出由文本结束 Token 分隔的上下文是不相关的。这样可以在不需要任何特殊序列特定掩码的情况下实现高效训练。
C Details of Test Set Contamination Studies
C 测试集污染研究详情
In section 4 we gave a high level overview of test set contamination studies. In this section we provide details on methodology and results.
在第4节中,我们对测试集污染研究进行了高层次的概述。在本节中,我们将详细介绍方法和结果。
Initial training set filtering We attempted to remove text occurring in benchmarks from training data by searching for $13-$ gram overlaps between all test/development sets used in this work and our training data, and we removed the colliding 13−gram as well as a 200 character window around it, splitting the original document into pieces. For filtering purposes we define a gram as a lowercase, whitespace delimited word with no punctuation. Pieces less than 200 characters long were discarded. Documents split into more than 10 pieces were considered contaminated and removed entirely. Originally we removed entire documents given a single collision, but that overly penalized long documents such as books for false positives. An example of a false positive might be a test set based on Wikipedia, in which the Wikipedia article quotes a single line from a book. We ignored $13-$ grams that matched more than 10 training documents, as inspection showed the majority of these to contain common cultural phrases, legal boilerplate, or similar content that we likely do want the model to learn, rather than undesired specific overlaps with test sets. Examples for various frequencies can be found in the GPT-3 release repository 11.
初始训练集过滤
我们尝试通过搜索所有测试/开发集与训练数据之间的 $13-$ gram 重叠来从训练数据中移除基准测试中出现的文本,并移除碰撞的 13-gram 及其周围的 200 个字符窗口,将原始文档分割成片段。为了过滤目的,我们将 gram 定义为小写、空格分隔且不带标点符号的单词。长度小于 200 个字符的片段被丢弃。被分割成超过 10 个片段的文档被视为被污染并完全移除。最初,我们在发生单个碰撞时移除了整个文档,但这会过度惩罚长文档(如书籍)中的误报。误报的一个例子可能是基于 Wikipedia 的测试集,其中 Wikipedia 文章引用了书中的一行。我们忽略了匹配超过 10 个训练文档的 $13-$ gram,因为检查显示这些 gram 大多包含常见的文化短语、法律样板或类似内容,这些内容可能是我们希望模型学习的,而不是与测试集的不希望出现的特定重叠。各种频率的示例可以在 GPT-3 发布仓库 11 中找到。
Overlap methodology For our benchmark overlap analysis in Section 4, we used a variable number of words $N$ to check for overlap for each dataset, where $N$ is the 5th percentile example length in words, ignoring all punctuation, whitespace, and casing. Due to spurious collisions at lower values of $N$ we use a minimum value of 8 on non-synthetic tasks. For performance reasons, we set a maximum value of 13 for all tasks. Values for $N$ and the amount of data marked as dirty are shown in Table C.1. Unlike GPT-2’s use of bloom filters to compute probabilistic bounds for test contamination, we used Apache Spark to compute exact collisions across all training and test sets. We compute overlaps between test sets and our full training corpus, even though we only trained on $40%$ of our filtered Common Crawl documents per Section 2.2.
重叠方法
在第4节的基准重叠分析中,我们使用了可变数量的单词 $N$ 来检查每个数据集的重叠情况,其中 $N$ 是第5百分位的示例长度(以单词为单位),忽略所有标点符号、空格和大小写。由于在较低的 $N$ 值下会出现虚假碰撞,我们在非合成任务中使用了最小值为8。出于性能考虑,我们为所有任务设置了最大值为13。$N$ 的值以及标记为脏数据的数据量如表 C.1 所示。与 GPT-2 使用布隆过滤器计算测试污染的概率边界不同,我们使用 Apache Spark 计算所有训练集和测试集之间的精确碰撞。我们计算了测试集与我们的完整训练语料库之间的重叠,尽管我们只训练了第2.2节中过滤后的 Common Crawl 文档的 $40%$。
We define a ‘dirty’ example as one with any $N$ -gram overlap with any training document, and a ‘clean’ example as one with no collision.
我们将“脏”样本定义为与任何训练文档存在任何 $N$ -gram 重叠的样本,而“干净”样本则定义为没有任何重叠的样本。
Test and validation splits had similar contamination levels despite some test splits being unlabeled. Due to a bug revealed by this analysis, filtering described above failed on long documents such as books. Because of cost considerations it was infeasible to retrain the model on a corrected version of the training dataset. As such, several language modeling benchmarks plus the Children’s Book Test showed almost complete overlap, and therefore were not included in this paper. Overlaps are shown in Table C.1
测试集和验证集的污染水平相似,尽管部分测试集未标注。由于分析中揭示的一个错误,上述过滤方法在长文档(如书籍)上失效。出于成本考虑,无法在修正后的训练数据集上重新训练模型。因此,几个语言建模基准测试以及儿童图书测试显示出几乎完全的重叠,因此未包含在本文中。重叠情况如表 C.1 所示。
Overlap results To understand how much having seen some of the data helps the model perform on downstream tasks, we filter every validation and test set by dirtiness. Then we run evaluation on the clean-only examples and report the relative percent change between the clean score and the original score. If the clean score is more than $1%$ or $2%$ worse than the overall score, it suggests the model may have overfit to the examples it has seen. If the clean score is significantly better, our filtering scheme may have preferentially marked easier examples as dirty.
重叠结果
为了了解模型在见过部分数据后对下游任务表现的帮助程度,我们根据脏数据过滤每个验证集和测试集。然后,我们仅在干净样本上运行评估,并报告干净分数与原始分数之间的相对百分比变化。如果干净分数比整体分数差超过 $1%$ 或 $2%$,则表明模型可能对见过的样本过拟合。如果干净分数显著更好,则我们的过滤方案可能优先将较简单的样本标记为脏数据。
This overlap metric tends to show a high rate of false positives for datasets that contain background information (but not answers) drawn from the web (such as SQuAD, which draws from Wikipedia) or examples less than 8 words long, which we ignored in our filtering process (except for word scrambling tasks). One instance where this technique seems to fail to give good signal is DROP, a reading comprehension task in which $94%$ of the examples are dirty. The information required to answer the question is in a passage provided to the model, so having seen the passage during training but not the questions and answers does not meaningfully constitute cheating. We confirmed that every matching training document contained only the source passage, and none of the questions and answers in the dataset. The more likely explanation for the decrease in performance is that the $6%$ of examples that remain after filtering come from a slightly different distribution than the dirty examples.
这种重叠度量方法在包含从网络(如SQuAD,其数据来源于维基百科)提取的背景信息(但不是答案)或长度少于8个单词的示例的数据集中,往往显示出较高的误报率,我们在过滤过程中忽略了这些情况(除了单词打乱任务)。DROP是一个阅读理解任务,其中94%的示例是“脏”的,这种技术似乎无法提供良好的信号。回答问题的信息在提供给模型的段落中,因此在训练过程中看到段落但没有看到问题和答案并不构成作弊。我们确认每个匹配的训练文档仅包含源段落,而不包含数据集中的任何问题和答案。性能下降的更可能解释是,过滤后剩余的6%示例来自与“脏”示例略有不同的分布。
Figure 4.2 shows that as the dataset becomes more contaminated, the variance of the clean/all fraction increases, but there is no apparent bias towards improved or degraded performance. This suggests that GPT-3 is relatively insensitive to contamination. See Section 4 for details on the datasets we flagged for further review.
图 4.2 显示,随着数据集污染程度的增加,干净/全部比例(clean/all fraction)的方差增大,但性能改善或下降的偏差并不明显。这表明 GPT-3 对污染相对不敏感。关于我们标记为需要进一步审查的数据集的详细信息,请参见第 4 节。
Table C.1: Overlap statistics for all datasets sorted from dirtiest to cleanest. We consider a dataset example dirty if it has a single $N$ -gram collision with any document in our training corpus. “Relative Difference Clean vs All” shows the percent change in performance between only the clean examples vs all the examples in the benchmark. “Count” shows the number of examples. “Clean percentage” is the percent of examples that are clean vs total. For “Acc/F1/BLEU” we use the metric specified in “Metric”. These scores come from evaluations with a different seed for the random examples used for in-context learning, and will therefore differ slightly from the scores elsewhere in the paper.
表 C.1: 按从最脏到最干净排序的所有数据集的重叠统计。如果数据集中的某个示例与我们的训练语料库中的任何文档有单个 $N$ -gram 碰撞,我们则认为该示例是脏的。“相对差异(干净 vs 全部)”显示了仅在干净示例与基准中所有示例之间的性能变化百分比。“计数”显示了示例的数量。“干净百分比”是干净示例占总示例的百分比。对于“Acc/F1/BLEU”,我们使用“Metric”中指定的指标。这些分数来自使用不同随机种子进行上下文学习示例的评估,因此与论文中其他地方的分数略有不同。
D Total Compute Used to Train Language Models
D 用于训练语言模型的总计算量
This appendix contains the calculations that were used to derive the approximate compute used to train the language models in Figure 2.2. As a simplifying assumption, we ignore the attention operation, as it typically uses less than $10%$ of the total compute for the models we are analyzing.
本附录包含用于推导图 2.2 中训练语言模型所使用的大致计算量的计算过程。作为简化假设,我们忽略了注意力操作,因为它通常在我们分析的模型中使用的总计算量不到 $10%$。
Calculations can be seen in Table D.1 and are explained within the table caption.
计算可以在表 D.1 中查看,并在表格标题中进行了解释。
| 模型 | 总训练计算量 (PF-days) | 总训练计算参数 | 训练 Token | 每个参数每个 Token 的浮点运算次数 | 反向传播倍数 | 每个活跃参数每个 Token 的前向传播浮点运算次数 | 每个 Token 活跃参数的比例 |
|---|---|---|---|---|---|---|---|
| T5-Small | 2.08E+00 | (flops) 1.80E+20 | (M) 60 | (billions) 1,000 | 3 | 3 | 1 |
| T5-Base | 7.64E+00 | 6.60E+20 | 220 | 1,000 | 3 | 3 | 1 |
| T5-Large | 2.67E+01 | 2.31E+21 | 770 | 1,000 | 3 | 3 | 1 |
| T5-3B | 1.04E+02 | 9.00E+21 | 3,000 | 1,000 | 3 | 3 | 1 |
| T5-11B | 3.82E+02 | 3.30E+22 | 11,000 | 1,000 | 3 | 3 | 1 |
| BERT-Base | 1.89E+00 | 1.64E+20 | 109 | 250 | 6 | 3 | 2 |
| BERT-Large | 6.16E+00 | 5.33E+20 | 355 | 250 | 6 | 3 | 2 |
| RoBERTa-Base | 1.74E+01 | 1.50E+21 | 125 | 2,000 | 6 | 3 | 2 |
| RoBERTa-Large | 4.93E+01 | 4.26E+21 | 355 | 2,000 | 6 | 3 | 2 |
| GPT-3 Small | 2.60E+00 | 2.25E+20 | 125 | 300 | 6 | 3 | 2 |
| GPT-3 Medium | 7.42E+00 | 6.41E+20 | 356 | 300 | 6 | 3 | 2 |
| GPT-3 Large | 1.58E+01 | 1.37E+21 | 760 | 300 | 6 | 3 | 2 |
| GPT-3 XL | 2.75E+01 | 2.38E+21 | 1,320 | 300 | 6 | 3 | 2 |
| GPT-3 2.7B | 5.52E+01 | 4.77E+21 | 2,650 | 300 | 6 | 3 | 2 |
| GPT-3 6.7B | 1.39E+02 | 1.20E+22 | 6,660 | 300 | 6 | 3 | 2 |
| GPT-3 13B | 2.68E+02 | 2.31E+22 | 12,850 | 300 | 6 | 3 | 2 |
| GPT-3 175B | 3.64E+03 | 3.14E+23 | 174,600 | 300 |
Table D.1: Starting from the right hand side and moving left, we begin with the number of training tokens that each model was trained with. Next we note that since T5 uses an encoder-decoder model, only half of the parameters are active for each token during a forward or backwards pass. We then note that each token is involved in a single addition 3x to account for the backwards pass (as computing both ∂p∂al ro as sms and ∂∂laocstss use a similar amount of compute as the value by the total training tokens and the total parameters to yield the number of total flops used during training. We report both flops and petaflop/s-day (each of which are $8.64\mathrm{e}{+19}$ flops).
表 D.1: 从右侧开始向左移动,我们首先列出每个模型训练时使用的训练 token 数量。接着我们注意到,由于 T5 使用了编码器-解码器模型,因此在每次前向或反向传播中,每个 token 只激活了一半的参数。然后我们注意到,每个 token 在反向传播中会额外涉及 3 倍的计算(因为计算 ∂p∂al 和 ∂∂laocstss 所需的计算量与正向传播相似),因此我们将总训练 token 数量与总参数数量相乘,得到训练期间使用的总浮点运算次数 (flops)。我们同时报告了 flops 和 petaflop/s-day(每个 petaflop/s-day 等于 $8.64\mathrm{e}{+19}$ flops)。
E Human Quality Assessment of Synthetic News Articles
合成新闻文章的人类质量评估
This appendix contains details on the experiments measuring human ability to distinguish GPT-3-generated synthetic news articles from real news articles. We first describe the experiments on the $\sim200$ word news articles, and then describe the preliminary investigation of $\sim500$ word news articles generated by GPT-3.
本附录详细介绍了测量人类区分 GPT-3 生成的合成新闻文章与真实新闻文章能力的实验。我们首先描述了关于 $\sim200$ 字新闻文章的实验,然后介绍了对 GPT-3 生成的 $\sim500$ 字新闻文章的初步调查。
Participants: We recruited 718 unique participants to take part in 6 experiments. 97 participants were excluded for failing an internet check question, leaving a total of 621 participants: 343 male, 271 female, and 7 other. Mean participant age was $\sim38$ years old. All participants were recruited through Positly, which maintains a whitelist of high-performing workers from Mechanical Turk. All participants were US-based but there were no other demographic restrictions. Participants were paid $\mathbb{S}12$ for their participation, based on a task time estimate of 60 minutes determined by pilot runs. In order to ensure that the sample of participants for each experiment quiz was unique, participants were not allowed to take part in an experiment more than once.
参与者:我们招募了718名独特的参与者参加6项实验。97名参与者因未能通过互联网检查问题而被排除,最终共有621名参与者:343名男性,271名女性,以及7名其他性别。参与者的平均年龄约为38岁。所有参与者均通过Positly招募,Positly维护了一份来自Mechanical Turk的高绩效工作者白名单。所有参与者均位于美国,但没有其他人口统计限制。参与者根据试运行确定的60分钟任务时间估计,获得12美元的报酬。为了确保每个实验测试的参与者样本是唯一的,参与者不允许多次参加同一实验。
Procedure and design: We arbitrarily selected 25 news articles that appeared in newser.com in early 2020. We used the article titles and subtitles to produce outputs from the 125M, 350M, 760M, 1.3B, 2.7B, 6.7B, 13.0B, and 200B (GPT-3) parameter language models. Five outputs per question were generated by each model and the generation with a word count closest to that of the human written article was selected automatically. This was to minimize the effect that completion length might have on participants’ judgments. The same output procedure for each model with the exception of the removal of the intentionally bad control model, as described in the main text.
流程与设计:我们随机选取了2020年初出现在newser.com上的25篇新闻文章。我们使用文章标题和副标题,从125M、350M、760M、1.3B、2.7B、6.7B、13.0B和200B(GPT-3)参数的语言模型中生成输出。每个模型为每个问题生成五个输出,并自动选择与人类撰写的文章字数最接近的生成结果。这是为了尽量减少完成长度可能对参与者判断的影响。每个模型的输出流程相同,除了移除了故意设置的不良控制模型,如正文所述。
| 模型 | 招募的参与者 | 排除的参与者 | 性别 (男:女:其他) | 平均年龄 | 平均字数 (人类:模型) |
|---|---|---|---|---|---|
| 对照组 | 76 | 7 | 32:37:0 | 39 | 216:216 |
| GPT-3Small | 80 | 7 | 41:31:1 | 40 | 216:188 |
| GPT-3Medium | 80 | 7 | 46:28:2 | 39 | 216:202 |
| GPT-3 Large | 81 | 24 | 46:28:2 | 37 | 216:200 |
| GPT-3XL | 79 | 14 | 32:32:1 | 38 | 216:199 |
| GPT-32.7B | 80 | 11 | 36:33:0 | 40 | 216:202 |
| GPT-36.7B | 76 | 5 | 46:28:2 | 37 | 216:195 |
| GPT-313.0B | 81 | 13 | 46:28:2 | 37 | 216:209 |
| GPT-3175B | 80 | 9 | 42:29:0 | 37 | 216:216 |
Average time spent trying to detect model generated news article
平均检测模型生成新闻文章所花费的时间

Table E.1: Participant details and article lengths for each experiment to evaluate human detection of $\sim200$ word model generated news articles. Participants were excluded due to internet check fails. Figure E.1: Participants spend more time trying to identify whether each news article is machine generated as model size increases. Duration on the control model is indicated with the dashed line. Line of best fit is a linear model on a log scale with $95%$ confidence intervals.
表 E.1: 用于评估人类检测 $\sim200$ 字模型生成新闻文章的每个实验的参与者详情和文章长度。参与者因互联网检查失败而被排除。
图 E.1: 随着模型规模的增加,参与者花费更多时间试图识别每篇新闻文章是否由机器生成。控制模型的持续时间用虚线表示。最佳拟合线是对数尺度上的线性模型,具有 $95%$ 的置信区间。
In each experiment, half of the participants were randomly assigned to quiz A and half were randomly assigned to quiz B. Each quiz consisted of 25 articles: half (12-13) were human written and half (12-13) were model generated: the articles with human written completions in quiz A had model generated completions in quiz B and vice versa. The order of quiz question was shuffled for each participant. Participants could leave comments and were asked to indicate if they had seen the articles before. Participants were instructed not to look up the articles or their content during the quiz and at the end of the quiz were asked if they had looked anything up during the quiz.
在每次实验中,一半的参与者被随机分配到测验A,另一半被随机分配到测验B。每个测验包含25篇文章:一半(12-13篇)是人类撰写的,另一半(12-13篇)是模型生成的:测验A中人类撰写的文章在测验B中是模型生成的,反之亦然。每个参与者的测验问题顺序都被打乱。参与者可以留下评论,并被要求指出他们是否之前见过这些文章。参与者被指示在测验期间不要查找文章或其内容,并在测验结束时被询问是否在测验期间查找了任何内容。
Statistical Tests: To compare means on the different runs, we performed a two-sample t-test for independent groups for each model against the control. This was implemented in Python using the scipy.stats.ttest_ind function. When plotting a regression line in the graph of average participant accuracy vs model size, we fit a power law of the form $\dot{a}x^{-b}$ . The $95%$ confidence intervals were estimated from the t-distribution of the sample mean.
统计测试:为了比较不同运行中的均值,我们对每个模型与对照组进行了独立样本的双样本t检验。这是在Python语言中使用scipy.stats.ttest_ind函数实现的。在绘制参与者平均准确率与模型大小的关系图时,我们拟合了形式为$\dot{a}x^{-b}$的幂律。$95%$的置信区间是根据样本均值的t分布估计的。
Duration statistics: In the main text, we discussed the finding that the ability of human participants to distinguish model and human generated news articles decreases as our models become larger. We have also found that the average time spent for a given set of questions increases as the model size increases, as shown in Figure E.1. Lower accuracy scores despite increased time investment from participants supports the finding that larger models generate harder-to-distinguish news articles.
持续时间统计:在正文中,我们讨论了人类参与者区分模型和人类生成新闻文章的能力随着模型规模的增大而减弱的发现。我们还发现,随着模型规模的增大,参与者回答一组问题的平均时间也在增加,如图 E.1 所示。尽管参与者投入的时间增加,但准确率却下降,这支持了较大模型生成的新闻文章更难区分的发现。
Table E.2: Participant details and article lengths for the experiments investigating human detection of $\sim500$ word model generated news articles. Participants were excluded due to internet check fails.
表 E.2: 实验参与者详情及文章长度,用于研究人类对约500词模型生成新闻文章的检测。参与者因互联网检查失败被排除。
| 模型 | 招募的参与者 | 排除的参与者 | 性别 (男:女:其他) | 平均年龄 | 平均词数 (人类:模型) |
|---|---|---|---|---|---|
| 对照组 | 79 | 17 | 32:37:0 | 39 | 569:464 |
| GPT-3175B | 81 | 19 | 32:30:0 | 40 | 569:498 |
Preliminary investigation of $\sim500$ word articles: We recruited 160 unique US-based participants to take part in 2 experiments through Positly (details are given in Table E.2). We randomly selected 12 Reuters world news articles from late 2019 and created a context for GPT-3 175B that consisted of a single Reuters article not in this set of 12. We then used the article titles and Reuters locations to generate completions from GPT-3 175B and the 160M control model from the previous experiments. These were used to create two 12-question quizzes per model, each consisting of half human written and half model generated articles. Comprehension questions were added and articles were shown to participants in 3 stages at 30 second intervals to encourage closer reading. Participants were paid $\mathbb{S}12$ for this task. Model generation selection methods, exclusion criteria, and statistical tests mirror those of the previous experiments.
$\sim500$ 字文章的初步调查:我们通过 Positly 招募了 160 名美国参与者参加 2 项实验(详细信息见表 E.2)。我们从 2019 年底的 12 篇路透社世界新闻文章中随机选择了一篇,并为 GPT-3 175B 创建了一个上下文,该上下文由一篇不在这 12 篇文章中的路透社文章组成。然后,我们使用文章标题和路透社位置从 GPT-3 175B 和之前实验中的 160M 控制模型生成补全内容。这些内容用于为每个模型创建两个 12 题的测验,每个测验由一半人类撰写和一半模型生成的文章组成。添加了理解问题,并在 30 秒的间隔内分三个阶段向参与者展示文章,以鼓励更仔细的阅读。参与者为此任务获得了 $\mathbb{S}12$ 的报酬。模型生成选择方法、排除标准和统计测试与之前的实验相同。
F Additional Samples from GPT-3
F GPT-3 的额外样本
GPT-3 adapts well to many tasks other than the ones explored in the main body of the paper. As an example, in Figure F.1, we show four uncurated samples from a prompt suggesting that the model write a poem, with a given title, in the style of Wallace Stevens. We first experimented with a few prompts, then generated four samples with no additional editing or selection (sampling at temperature 1 using nucleus sampling [HBFC19] with $P=0.9$ ). Completions were truncated when the model began to write a new title and author heading, or broke into prose commentary.
GPT-3 在适应许多任务方面表现出色,除了本文正文中探讨的任务外。例如,在图 F.1 中,我们展示了四个未经筛选的样本,这些样本来自一个提示,要求模型以 Wallace Stevens 的风格写一首诗,并给出一个标题。我们首先尝试了几个提示,然后生成了四个样本,没有进行额外的编辑或选择(使用温度 1 进行采样,采用核采样 [HBFC19],$P=0.9$)。当模型开始写新的标题和作者标题,或转向散文评论时,生成的内容被截断。
SOME TREES John Ashbery [Poem text omitted]
一些树
John Ashbery
[诗歌文本省略]
Shadows on the Way Wallace Stevens
Shadows on the Way Wallace Stevens
Generated Poem 1
生成的诗 1
Generated Poem 3
生成的诗歌 3
I must have shadows on the way If I am to walk I must have Each step taken slowly and alone To have it ready made
我必须在路上有阴影
如果我必须行走
我必须每一步都缓慢而独自地迈出
让它准备就绪
And I must think in lines of grey To have dim thoughts to be my guide Must look on blue and green And never let my eye forget That color is my friend And purple must surround me too
我必须用灰色的线条思考
让暗淡的思绪成为我的向导
必须凝视蓝色和绿色
永远不让我的眼睛忘记
色彩是我的朋友
紫色也必须环绕我
The yellow of the sun is no more Intrusive than the bluish snow That falls on all of us. I must have Grey thoughts and blue thoughts walk with me If I am to go away at all.
太阳的黄色并不比落在我们所有人身上的蓝色雪更侵扰。如果我要离开,必须有灰色的思绪和蓝色的思绪与我同行。
Generated Poem 2
生成的诗 2
Comes to the wood.
来到树林。
The sun was all we had. Now, in the shade All is changed. The mind must dwell on those White fields, that to its eyes were always old; Those ancient gleams, con vol u ting The way they lay among the huge roots, The great dark tomes of reverie, The plumed labyrinth of the sea.
太阳是我们唯一拥有的。如今,在阴影中
一切已改变。心灵必须沉思那些
白色的田野,在它眼中总是古老的;
那些古老的光芒,盘旋着
它们躺在巨大根须之间的方式,
那些沉思的厚重黑暗典籍,
海洋那羽饰的迷宫。
We played, too, once, in the warmest rooms. To one content, there is one content For all mankind. The forest glades Are all the more for being in fable, And wood and town in legend shadowedOnce more to return. The one content Has in it every happiness, more brave For being purposeless; and so Those island gardens, deep in the mind, Make of remoteness familiarity And of the unspoken name a friend.
我们也曾在最温暖的房间里玩耍。对于一种满足,全人类都有一种满足。森林的空地因寓言而更加珍贵,而树林和城镇在传说的阴影中再次回归。这种满足包含了每一种幸福,因无目的而更加勇敢;因此,那些深藏于心灵中的岛屿花园,将遥远变为熟悉,将未说出的名字变为朋友。
Generated Poem 4
生成诗歌 4
G Details of Task Phrasing and Specifications
G 任务表述和规范细节
The following figures illustrate the formatting and phrasing of all the tasks included in the paper. All data comes from the ground truth datasets in this section, and no samples from GPT-3 are included here.
以下图表展示了本文中所有任务的格式和表述。所有数据均来自本节中的真实数据集,此处不包含任何来自 GPT-3 的样本。
Context $\rightarrow$
上下文 $\rightarrow$
Informal conversation is an important part of any business relationship.Before you start a discussion,however,make sure you understand which topics are suitable and which are considered taboo in a particular culture. Latin Americans enjoy sharing information about their local history, art and customs.You may expect questions about your family,and be sure to show pictures of your children.You may feel free to ask similar questions of your Latin American friends.The French think of conversation as an art form,and they enjoy the value of lively discussions as well as disagreements. For them,arguments can be interesting and they can cover pretty much or any topic ---- as long as they occur in are respectful and intelligent manner. In the United States,business people like to discuss a wide range of topics,including opinions about work,family,hobbies,and politics. In Japan,China,and Korea,however,people are much more private.They do not share much about their thoughts,feelings,or emotions because they feel that doing so might take away from the harmonious business relationship they’re trying to build.Middle Easterners are also private about their personal lives and family matters.It is considered rude,for example,to ask a businessman from Saudi Arabia about his wife or children. As a general rule,it’s best not to talk about politics or religion with your business friends.This can get you into trouble,even in the United States,where people hold different religious views.In addition,discussing one’s salary is usually considered unsuitable.Sports is typically a friendly subject in most parts of the world,although be careful not to criticize national sport.Instead,be friendly and praise your host’s team.
非正式对话是任何商业关系中的重要组成部分。然而,在开始讨论之前,确保你了解在特定文化中哪些话题是合适的,哪些是禁忌的。拉丁美洲人喜欢分享关于他们当地历史、艺术和习俗的信息。你可能会被问到关于你家庭的问题,并且一定要展示你孩子的照片。你可以自由地向你的拉丁美洲朋友提出类似的问题。法国人将对话视为一种艺术形式,他们享受生动讨论和分歧的价值。对他们来说,争论可以很有趣,并且可以涵盖几乎所有话题——只要这些争论是以尊重和智慧的方式进行的。在美国,商人喜欢讨论广泛的话题,包括对工作、家庭、爱好和政治的看法。然而,在日本、中国和韩国,人们更加注重隐私。他们不会过多分享他们的想法、感受或情感,因为他们觉得这样做可能会破坏他们试图建立的和谐商业关系。中东人也对他们的个人生活和家庭事务保持隐私。例如,询问一位来自沙特阿拉伯的商人关于他的妻子或孩子的问题被认为是粗鲁的。一般来说,最好不要与你的商业朋友谈论政治或宗教。这可能会让你陷入麻烦,即使是在美国,人们持有不同的宗教观点。此外,讨论某人的薪水通常被认为是不合适的。在世界上大多数地方,体育通常是一个友好的话题,但要注意不要批评国家运动。相反,要友好并赞扬你主人的团队。
$\mathsf{Q}$ : What shouldn’t you do when talking about sports with colleagues from another country?
$\mathsf{Q}$ : 与来自其他国家的同事谈论体育时,不应该做什么?
A: Criticizing the sports of your colleagues’ country.
A: 批评同事所在国家的体育运动。
Q: Which is typically a friendly topic in most places according to the author?
问:根据作者的说法,在大多数地方,哪个话题通常是友好的?
A: Sports.
A: 体育。
Q: Why are people from Asia more private in their conversation with others?
Q: 为什么亚洲人在与他人交谈时更加私密?
A: They don’t want to have their good relationship with others harmed by informal conversation.
A: 他们不想因为非正式的对话而损害与他人的良好关系。
$\mathsf{Q}$ : The author considers politics and religion .
$\mathsf{Q}$ : 作者考虑了政治和宗教。
| CorrectAnswer—→ | taboo | ||
|---|---|---|---|
| IncorrectAnswer | 个 | cheerfultopics | |
| Incorrect | Answer | rude topics | |
| IncorrectAnswer | topicsthat tcanneverbetalkedabout |
Figure G.1: Formatted dataset example for RACE-h. When predicting, we normalize by the unconditional probability of each answer as described in 2.
图 G.1: RACE-h 的格式化数据集示例。在预测时,我们按照第 2 节中描述的方法,通过每个答案的无条件概率进行归一化。
Figure G.2: Formatted dataset example for ANLI R2
图 G.2: ANLI R2 格式化数据集示例
| | Context—→ | by Boyd Gaming. | anli2:anli2:TheGoldCoastHotel&Casinoisahotelandcasino located in Paradise, Nevada. This locals’ casino is owned and operated TheGoldCoastislocatedonemile(~ 1.6km)westofthe Las Vegas Strip on West Flamingo Road.It is located across the street fromthePalmsCasinoResortand the RioAllSuite HotelandCasino. Question:The Gold Coast is a budget-friendly casino.True, False, or |
| CorrectAnswer | | Neither? Neither | |
| IncorrectAnswer | | | True |
| IncorrectAnswer | | False | |
Context $\rightarrow$
上下文 $\rightarrow$
Q: Which of the following is True according to the passage?
Q: 根据文章内容,以下哪项是正确的?
A: If a kid hated four people,he or she had to carry four potatoes.
A: 如果一个孩子讨厌四个人,他或她必须携带四个土豆。
Q: We can learn from the passage that we should .
Q: 我们可以从文章中了解到我们应该 。
A: throw away the hatred inside
A: 抛弃内心的仇恨
Q: The children complained about besides the weight trouble.
Q: 孩子们除了体重问题外还抱怨了什么。
Q: Mrs.Smith asked her students to write on the potatoes.
Q: Smith 夫人让她的学生在土豆上写字。


Figure G.3: Formatted dataset example for RACE-m. When predicting, we normalize by the unconditional probability of each answer as described in 2.
图 G.3: RACE-m 的格式化数据集示例。在预测时,我们按照第 2 节中描述的方法对每个答案的无条件概率进行归一化。
| 上下文 | 我的身体 | 投下阴影 | 在草地上,因为 | ||
|---|---|---|---|---|---|
| 正确答案 | 太阳 | 正在升起。 | |||
| 错误答案 | 草地 | 被割了。 |
| Context→ | 图 G.5: COPA (CNN) 的格式化数据集示例。Yuval Rabin,其父亲 Yitzhak Rabin 在担任以色列总理期间遇刺身亡,批评 Donald Trump 的言论。 |
|---|---|
| Rabin 在《今日美国》中写道:“这对我个人来说是一个新的丑陋水平。”他表示,Trump 呼吁“第二修正案的人”阻止 Hillary Clinton——这些言论被批评为对 Clinton 的暴力呼吁,Trump 否认了这一点——“在一个丑陋的竞选季节中达到了新的丑陋水平。”——一位被暗杀的前以色列总理的儿子写了一篇关于暴力政治言论后果的专栏文章。——警告了 1990 年代的以色列与今天的美国之间的“相似之处”。 | |
| Correct Answer → | 提到他的父亲,他在 1995 年以色列政治紧张局势中被极端分子枪杀,Rabin 谴责了 Donald Trump 的激进言论。 |
| 提到他的父亲,他在 1995 年以色列政治紧张局势中被极端分子枪杀,Rabin 谴责了 Trump 的激进言论。 | |
| Incorrect Answer → | 提到他的父亲,他在 1995 年以色列政治紧张局势中被极端分子枪杀,Rabin 谴责了 Hillary Clinton 的激进言论。 |
| 提到他的父亲,他在 1995 年以色列政治紧张局势中被极端分子枪杀,Rabin 谴责了美国的激进言论。 | |
| Incorrect Answer → | 提到他的父亲,他在 1995 年以色列政治紧张局势中被极端分子枪杀,Rabin 谴责了 Yitzhak Rabin 的激进言论。 |
Figure G.6: Formatted dataset example for ReCoRD. We consider the context above to be a single ”problem” because this is how the task is presented in the ReCoRD dataset and scored in the ReCoRD evaluation script. Figure G.7: Formatted dataset example for ANLI R1
图 G.6: ReCoRD 的格式化数据集示例。我们将上述上下文视为一个“问题”,因为这是 ReCoRD 数据集中任务的呈现方式,也是 ReCoRD 评估脚本中的评分方式。
图 G.7: ANLI R1 的格式化数据集示例
| IncorrectAnswer—→ | Context→ | anli 1:anli 1:Fulton James MacGregor MSP 是苏格兰政治家,苏格兰民族党 (SNP) 苏格兰议会议员,代表 Coatbridge 和 Chryston 选区。MacGregor 目前是 Shona Robison 的议会联络官,Shona Robison 是卫生与体育内阁秘书。他还在苏格兰议会的司法和教育与技能委员会任职。问题:Fulton James MacGregor 是一位苏格兰政治家,他是 Shona Robison 的联络官,他发誓 Shona Robison 是他最好的朋友。真,假,还是两者都不是? |
| CorrectAnswer→ | | |
| Neither True IncorrectAnswer→False | | |
| 上下文 | 生物体 | 需要能量以完成什么? | |
|---|---|---|---|
| 正确 | 答案 | 成熟和发育。 | |
| 错误 | 答案 | 安静休息。 | |
| 错误 | 答案 | — | 吸收光线。 |
| 错误 | 答案 | 吸收营养。 |
Figure G.8: Formatted dataset example for OpenBookQA. When predicting, we normalize by the unconditional probability of each answer as described in 2.
图 G.8: OpenBookQA 的格式化数据集示例。在预测时,我们按照第 2 节中描述的方法对每个答案的无条件概率进行归一化。
| 上下文 | 制作蛋糕:展示制作过程 | 展示了几根蛋糕棒。一位女士和女孩在厨房里制作蛋糕棒。她们 | ||
|---|---|---|---|---|
| 正确答案 | 答案 | 烘烤它们,然后涂抹糖霜 | 并进行装饰。 | |
| 错误答案 | 答案 | 品尝它们 | 当她们将蛋糕棒放在盘子上时。 | |
| 错误答案 | 答案 | 在蛋糕上涂抹糖霜时,她们将其放入平底锅中。 | ||
| 错误答案 | 答案 | 出炉 | 并开始装饰蛋糕。 |
Figure G.9: Formatted dataset example for HellaSwag
图 G.9: HellaSwag 的格式化数据集示例
| | Context—→ | anli 3:anli 3:We shut the loophole which has American workers actually subsidizingthelossoftheir thatloopholeinthelastfewdays: | own job. They just passed an expansion of $43 billion of giveaways,including favors to the oil and gas industry and the people importing ceiling fans from China. The loophole is now gone True,False,or Neither? |
| | CorrectAnswer | 个 | Question: False |
| IncorrectAnswer | | | True |
| Incorrect Answer→ | | | Neither |
Figure G.10: Formatted dataset example for ANLI R3
图 G.10: ANLI R3 的格式化数据集示例
| | Context—→ | | Question: skinsurfacewill Answer: | George | | producethemostheat? | wants to warm his hands quickly by rubbing them. | | | Which |
| CorrectAnswer | | | dry palms | | | | | | | |
| Incorrect | Answer | | wet palms | | | | | | | |
| Incorrect | Answer | | palmscovered withoil | | | | | | | |
| IncorrectAnswer | | | palms | Scoveredwithlotion | | | | | | |
Figure G.11: Formatted dataset example for ARC (Challenge). When predicting, we normalize by the unconditional probability of each answer as described in 2.
图 G.11: ARC (Challenge) 的格式化数据集示例。在预测时,我们按照第2节中描述的方法对每个答案的无条件概率进行归一化。
| | | | Context→lullistotrust as |
| CorrectAnswer | | | cajoleistocompliance |
| Incorrect Answer→ | | | balkistofortitude |
| Incorrect Answer→ | | | betray is to loyalty |
| Incorrect Answer → | | | hinderistodestination |
| Incorrect Answer→ | | | soothe is to passion |
| | | | 图 G.12: SAT Analogies 的格式化数据集示例 |
| | CorrectContext→ | | sweater |
| IncorrectContext> | | | jacket |
| Target Completion→1 | | | looks dowdy on her. |
Figure G.13: Formatted dataset example for Winograd. The ‘partial’ evaluation method we use compares the probability of the completion given a correct and incorrect context.
图 G.13: Winograd 的格式化数据集示例。我们使用的“部分”评估方法比较了在正确和错误上下文下完成句子的概率。
| 正确 | 上下文 | | Johnny 喜欢水果 | 水果更多 | 比蔬菜在他的 | 新酮饮食中 | 因为 |
| 错误 | 上下文 | | Johnny 喜欢蔬菜 | 水果更多 | 比蔬菜在 | 他的新酮饮食中 | 因为 |
| 目标 | 补全 | | 是糖精。 | | | | |
Figure G.14: Formatted dataset example for Winogrande. The ‘partial’ evaluation method we use compares the probability of the completion given a correct and incorrect context.
图 G.14: Winogrande 的格式化数据集示例。我们使用的“部分”评估方法比较了在正确和错误上下文下完成句子的概率。
| 上下文→ 阅读理解答案 | |
|---|---|
| 在这一过程中,外交活动仍在继续。对塔利班的直接施压被证明是无效的。正如国家安全委员会的一份备忘录所言:“在塔利班统治下,阿富汗与其说是一个支持恐怖主义的国家,不如说是一个被恐怖主义支持的国家。”2000年初,美国开始进行高层努力,试图说服巴基斯坦利用其对塔利班的影响力。2000年1月,助理国务卿卡尔·因德福斯和国务院反恐协调员迈克尔·希恩在伊斯兰堡会见了穆沙拉夫将军,并向他暗示,如果巴基斯坦合作,可能会在3月安排总统访问。穆沙拉夫非常渴望这次访问,他承诺会与奥马尔会面并向他施压关于本·拉登的问题。然而,他们离开后向华盛顿报告说,巴基斯坦实际上不太可能对阿富汗采取行动。克林顿总统计划访问印度,并顺道访问巴基斯坦。然而,特勤局和中央情报局强烈警告说,访问巴基斯坦可能会危及总统的生命。反恐官员也认为巴基斯坦做得不够,不值得总统访问。但克林顿总统坚持将巴基斯坦纳入他的南亚之行行程中。2000年3月25日,他的一日停留是自1969年以来美国总统首次访问巴基斯坦。在与穆沙拉夫等人的会晤中,克林顿总统主要关注巴基斯坦和印度之间的紧张局势以及核扩散的危险,但也讨论了本·拉登的问题。克林顿总统告诉我们,当他将穆沙拉夫拉到一边进行简短交谈时,他说:“我去见他时,提出了改善与美国关系的条件,如果他帮助我们抓住本·拉登并处理其他一两个问题。”美国的努力仍在继续。国务院认为谁应该访问印度和巴基斯坦? |
| 正确答案→ -[错误] 本·拉登 | | 与其他一两个问题。”美国的努力仍在继续。国务院认为谁应该访问印度和巴基斯坦? |
Figure G.15: Formatted dataset example for MultiRC. There are three levels within MultiRC: (1) the passage, (2) the questions, and (3) the answers. During evaluation, accuracy is determined at the per-question level, with a question being considered correct if and only if all the answers within the question are labeled correctly. For this reason, we use $K$ to refer to the number of questions shown within the context. Figure G.16: Formatted dataset example for ARC (Easy). When predicting, we normalize by the unconditional probability of each answer as described in 2.
图 G.15: MultiRC 的格式化数据集示例。MultiRC 包含三个层次:(1) 段落,(2) 问题,以及 (3) 答案。在评估过程中,准确率是在每个问题层面上确定的,当且仅当问题中的所有答案都被正确标记时,该问题才被视为正确。因此,我们使用 $K$ 来表示上下文中显示的问题数量。
图 G.16: ARC (Easy) 的格式化数据集示例。在预测时,我们按照第 2 节中描述的方法对每个答案的无条件概率进行归一化。
| | Context | | Question: Answer: | Which factor will most likely cause a person to develop a fever? |
| CorrectAnswer | | | a bacterial | population in the bloodstream |
| Incorrect Answer | | | a legn | muscle relaxing after exercise |
| IncorrectAnswer | | | several viral | particles on the skin |
| Incorrect | Answer | | carbohydrates | being digested in the stomach |
| 上下文 | Bob空手而来 | 去加油站给他的车加油。他的油箱和钱包都空了。收银员提出如果他稍后回来付款,就为他支付油费。Bob开车回家时感到感激。 | |
|---|---|---|---|
| 正确 | 答案 | 个 Bob | 相信世界上有好人。 |
| 错误 | 答案 | Bob | 思考这个世界有多么不友好。 |
Figure G.17: Formatted dataset example for StoryCloze
图 G.17: StoryCloze 的格式化数据集示例
| Context 个 eastof and has close ehistorical | Helsinki 是芬兰的首都和最大城市。它位于芬兰南部的 Uusimaa 地区,濒临芬兰湾。Helsinki 的人口为 f,城市人口为 ,大都市区人口超过 140 万,使其成为芬兰人口最多的市镇和城市地区。Helsinki 位于爱沙尼亚 Tallinn 以北,瑞典 Stockholm 以西,俄罗斯 Saint Petersburg 以东。Helsinki 与这三个城市有紧密的联系。 |
The Helsinki metropolitan area includes the urban core of Helsinki, Espoo, Vantaa, Kauniainen, and surrounding commuter towns. It is the world’s northernmost metro area of over one million people, and the city is the northernmost capital of an EU member state. The Helsinki metropolitan area is the third largest metropolitan area in the Nordic countries after Stockholm and Copenhagen, and the City of Helsinki is the third largest after Stockholm and Oslo. Helsinki is Finland’s major political, educational, financial, cultural, and research center as well as one of northern Europe’s major cities. Approximately 75% of foreign companies that operate in Finland have settled in the Helsinki region. The nearby municipality of Vantaa is the location of Helsinki Airport, with frequent service to various destinations in Europe and Asia.
赫尔辛基大都市区包括赫尔辛基、埃斯波、万塔、考尼艾宁的城市核心区域以及周边的通勤城镇。它是世界上人口超过一百万的最北端大都市区,也是欧盟成员国中最北端的首都。赫尔辛基大都市区是北欧国家中仅次于斯德哥尔摩和哥本哈根的第三大都市区,而赫尔辛基市则是仅次于斯德哥尔摩和奥斯陆的第三大城市。赫尔辛基是芬兰的主要政治、教育、金融、文化和研究中心,也是北欧的主要城市之一。在芬兰运营的外国公司中,约75%都设在赫尔辛基地区。附近的万塔市是赫尔辛基机场的所在地,该机场提供频繁的航班服务,通往欧洲和亚洲的多个目的地。
Q: what is the most populous municipality in Finland?
芬兰人口最多的直辖市是什么?
A: Helsinki
A: 赫尔辛基
Q: how many people live there?
Q: 那里有多少人居住?
A: 1.4 million in the metropolitan area
A: 都会区140万
Q: what percent of the foreign companies that operate in Finland are in Helsinki?
Q: 在芬兰运营的外国公司中有多少比例位于赫尔辛基?
| TargetCompletion→ | | | | | 赫尔辛基、埃斯波、万塔、考尼艾宁及周边通勤城镇 | | | | |
Figure G.18: Formatted dataset example for CoQA Figure G.19: Formatted dataset example for Cycled Letters
图 G.18: CoQA 的格式化数据集示例
图 G.19: Cycled Letters 的格式化数据集示例
| Context一 | Pleaseunscramblethelettersintoaword,andwritethat word: asinoc = |
| TargetCompletion→ | casino |
| 上下文 | 段落:Saint Jean de Brebeuf 是一位法国耶稣会传教士,于 1625 年前往新法兰西。在那里,他主要与休伦人一起工作,度过了余生,除了 1629 年至 1633 年在法国的几年。他学习了他们的语言和文化,并广泛撰写了相关内容以帮助其他传教士。1649 年,Brebeuf 和另一位传教士在一次易洛魁人袭击休伦村庄时被俘。与休伦俘虏一起,这些传教士于 1649 年 3 月 16 日被仪式性地折磨并杀害。Brebeuf 于 1925 年被宣福,并于 1930 年与其他八位耶稣会传教士一起被罗马天主教会封为圣人。问题:Saint Jean de Brébeuf 在返回法国几年之前在新法兰西待了多少年?答案: |
Figure G.20: Formatted dataset example for DROP
图 G.20: DROP 的格式化数据集示例
Context $\rightarrow$ Fill in blank:
上下文 $\rightarrow$ 填空:
She held the torch in front of her.
她将火把举在面前。
Target Completion $\rightarrow$ step
目标完成 $\rightarrow$ 步骤
Figure G.21: Formatted dataset example for LAMBADA Figure G.24: Formatted dataset example for Natural Questions
图 G.21: LAMBADA 格式化数据集示例
图 G.24: Natural Questions 格式化数据集示例
| | Context → 请将字母重新排列成一个单词,并写出该单词:skicts = |
| Target Completion → sticks | |
| | 图 G.22: Anagrams 1 (A1) 格式化数据集示例 |
| | Context → 请将字母重新排列成一个单词,并写出该单词: |
| Target Completion → | volwskagen = volkswagen |
| | 图 G.23: Anagrams 2 格式化数据集示例 |
| | Context → Q: 谁在《Touched by an Angel》中扮演 Tess? |
| Target Completion → | A: Delloreese Patricia Early (1931年7月6日 - 2017年11月19日),职业名称为 Della Reese |
| 上下文→ | 标题: William Perry (美式橄榄球) - 职业生涯 |
|---|---|
| 段落: 1985年,他在1985年NFL选秀中被芝加哥熊队在第一轮选中;他是教练Mike Ditka亲自挑选的。然而,防守协调员Buddy Ryan与Ditka关系非常恶劣,称Perry为“浪费的选秀权”。Perry的“冰箱”绰号跟随他进入NFL,并迅速成为芝加哥熊队球迷的最爱。队友们称Ditka决定在球队需要时使用Perry作为跑卫或明星跑卫Walter Payton的领跑者。Ditka表示,使用Perry作为跑卫的灵感来自于五码冲刺练习。在他的新秀赛季,Perry冲球两次达阵并接球一次达阵。Perry甚至在超级碗XX期间有机会持球跑动,这是对他受欢迎程度和对球队成功贡献的认可。第一次他接到球时,他在半卫选择战术中完成了他的第一次NFL传球。第二次他拿到球时,他达阵得分(在此过程中撞倒了爱国者队线卫Larry McGrew)。大约在他新秀赛季的中途,Ryan终于开始让Perry上场,他很快证明了自己是一名有能力的防守球员。他的戒指尺寸是25,而普通成年男性的戒指尺寸在10到12之间。Perry在NFL打了十年球,1994赛季后退役。在他十年的职业生涯中,他经常与体重作斗争,这有时影响了他的表现。他参加了138场比赛,记录了29.5次擒杀和五次掉球恢复,总共返回了71码。在他的进攻生涯中,他冲球五码两次达阵,复出后,他在世界美式橄榄球联盟(后来的NFL欧洲)的伦敦君主队打了一个平淡无奇的1996赛季。问题:他为哪支球队效力? |
| A: |
|---|
| 目标完成 → theChicagoB Bears |
Figure G.25: Formatted dataset example for QuAC Figure G.26: Formatted dataset example for Symbol Insertion
图 G.25: QuAC 的格式化数据集示例
图 G.26: 符号插入的格式化数据集示例
| Context一 | | Pleaseunscramblethelettersintoaword,andwritethatword: re!c.ipro.c a/l= |
| TargetCompletion→ | | reciprocal |
Figure G.27: Formatted dataset example for Reversed Words
图 G.27: 反转单词的格式化数据集示例
| Context—→ taefed | Pleaseunscramblethelettersintoaword,andwritethatword: |
|---|---|
| Target Completion→→ | defeat |
Context $\rightarrow$ Title: The Blitz
上下文 $\rightarrow$ 标题:闪电战
Background: From the German point of view, March 1941 saw an improvement. The Luftwaffe flew 4,000 sorties that month, including 12 major and three heavy attacks. The electronic war intensified but the Luftwaffe flew major inland missions only on moonlit nights. Ports were easier to find and made better targets. To confuse the British, radio silence was observed until the bombs fell. X- and Y-Ger¨at beams were placed over false targets and switched only at the last minute. Rapid frequency changes were introduced for X-Ger¨at, whose wider band of frequencies and greater tactical flexibility ensured it remained effective at a time when British selective jamming was degrading the effectiveness of Y-Ger¨at.
背景:从德国的角度来看,1941年3月情况有所改善。德国空军当月执行了4000架次飞行任务,包括12次主要攻击和3次重型攻击。电子战加剧,但德国空军仅在月夜执行主要内陆任务。港口更容易被发现并成为更好的目标。为了迷惑英国人,德军在炸弹投下前保持无线电静默。X-和Y-Gerät波束被放置在假目标上,并在最后一刻切换。X-Gerät引入了快速频率变化,其更宽的频率范围和更大的战术灵活性确保了它在英国选择性干扰降低Y-Gerät有效性时仍然有效。
Q: How many sorties were flown in March 1941?
Q: 1941年3月有多少架次飞行?
A: 4,000
A: 4,000
Q: When did the Luftwaffe fly inland missions?
Q: 德国空军何时执行内陆任务?
Target Completion $\rightarrow$ only on moonlit nights
目标完成 $\rightarrow$ 仅在月夜
Figure G.28: Formatted dataset example for SQuADv2
图 G.28: SQuADv2 的格式化数据集示例
| 上下文→ 答案: | 法向力——在物体静止于桌面这样的简单情况下,物体所受的法向力与作用在物体上的重力(或物体的重量)大小相等但方向相反,即 ( N = mg ),其中 ( m ) 是质量,( g ) 是重力场强度(在地球上约为 9.81 m/s²)。法向力阻止物体穿过桌面,并要求桌面足够坚固以提供这一法向力而不破裂。然而,很容易误认为法向力和重量是作用-反作用力对(一个常见的错误)。在这种情况下,法向力与物体的加速度有关。例如,向上弹起的球之所以向上加速,是因为作用在球上的法向力在大小上大于球的重量。问题:法向力等于重力吗? |
Target Completion → yes
目标完成 → 是
Figure G.29: Formatted dataset example for BoolQ
图 G.29: BoolQ 的格式化数据集示例
Figure G.30: Formatted dataset example for CB
图 G.30: CB 的格式化数据集示例
| Context→ Target Completion |
|---|
| false |
| Context 一 | Thebet, which won him dinner for four, was regarding theexistence and massoft the top quark,an elementary particle discoveredin1 1995. question: TheTop Quark isthe lastof Six flavors of quarks predicted by thestandardmodel theory of particle physics. True or False? answer: |
| Target Completion 一 | False |
Figure G.31: Formatted dataset example for RTE
图 G.31: RTE 的格式化数据集示例
| Context 1 answer: | Anoutfitter providede everything neededforthesafari. Beforehisfirstwalking holiday,he went to a specialist outfitter to buy some boots. question: Istheword 'outfitter usedinthe same wayinthetwo sentences above? |
Figure G.32: Formatted dataset example for WiC
图 G.32: WiC 的格式化数据集示例
| 上下文 | 最终考试及答案 说明:请仔细阅读以下段落。对于每个段落... |
|---|---|
| 段落:Moncrieff 先生参观了 Chester 位于纽约的豪华公寓,认为这是他儿子 Edward 的。结果是 Moncrieff 先生决定取消 Edward 的津贴,理由是... | |
| 目标完成→ mr.1 moncrieff | 他不再需要 他的 经济支持。问题:在上面的段落中,代词 "his" 指的是什么?答案: |
Figure G.33: Formatted dataset example for WSC Figure G.34: Formatted dataset example for TriviaQA. TriviaQA allows for multiple valid completions.
图 G.33: WSC 的格式化数据集示例
图 G.34: TriviaQA 的格式化数据集示例。TriviaQA 允许多种有效的补全。
| Context→ | which20thcenturyartist? | ||
|---|---|---|---|
| TargetCompletion→ | A: | MARCELDUCHAMP | |
| Target Completion → | r mutt | ||
| Target Completion → | duchamp | ||
| Target Completion → | marcel duchamp | ||
| Target Completion → | R.Mutt | ||
| Target Completion → | Marcel duChamp | ||
| Target Completion → | Henri-Robert-MarcelDuchamp | ||
| Target Completion → | Marcel du Champ | ||
| Target Completion → | henrirobert marcelduchamp | ||
| Target Completion → | Target Completion | → | Duchampian |
| Target Completion | → | Duchamp duchampian | |
| Target Completion → | marcel du champ | ||
| Target Completion → | Marcel Duchamp | ||
| Target Completion → | MARCEL DUCHAMP | ||
| 上下文→ 问:Burne Hogarth 创办了哪所学校? | |
|---|---|
| 答:目标完成→ 视觉艺术学院 (School of Visual Arts) |
Figure G.35: Formatted dataset example for WebQA
图 G.35: WebQA 的格式化数据集示例
| Context一→ | | KeinesfallsdirfendiesefirdenkommerziellenGebrauchverwendetwerden. |
| TargetCompletion→ | | In no case may they be used for commercial purposes. |
Figure G.40: Formatted dataset example for En→Ro
图 G.40: En→Ro 的格式化数据集示例
| o g o o | |
|---|---|
| {translation}" | 图 G.36: De→En 的格式化数据集示例。这是一样本和少样本学习的格式,对于此任务和其他语言任务,零样本学习的格式为“Q: {sentence} 的 {language} 翻译是什么?A:” |
| Context → 在任何情况下,它们都不得用于商业用途。= | |
| Target Completion > Keinesfalls dirfen diese fir den kommerziellen Gebrauch verwendet werden. | |
| 图 G.37: En→De 的格式化数据集示例 | |
| 一系列池塘的分析也表明,雄性幼虫的龄期比雌性更高级。= | |
| Target Completion → | L'analyse de la distribution de fréquence des stades larvaires d'I. verticalis dans une série d'étangs a également démontré que les larves g s s so ad sa s a s |
| 图 G.38: En→Fr 的格式化数据集示例 | |
| Context → L'analyse de la distribution de fréquence des stades larvaires d'I. verticalis dans une série d'étangs a également démontre que les larves males étaient a des stades plus avancés que les larves femelles. = | |
| TargetCompletion→ | Analysis of instar distributions of larval I. verticalis collected from m o s es than females. |
| 图 G.39: Fr→En 的格式化数据集示例 | |
| Context 事实是,你不顾欧洲人民的意愿,不惜一切代价,继续推进土耳其加入欧盟的谈判,尽管土耳其持续拒绝承认塞浦路斯,尽管民主改革已陷入停滞。= | |
| Target Completion → | standstill.= europenilor,sa continuati negocierile de aderare a Turciei la Uniunea Europeana, in ciuda refuzului continuu al Turciei de a recunoaste Ciprul si in ciuda faptului ca reformele democratice au ajuns intr-un punct mort. |
Figure G.48: Formatted dataset example for Arithmetic 4D
图 G.48: 算术 4D 的格式化数据集示例
| 上下文 →Adevarul este ca va doriti,cu orice pret si impotriva dorintei europenilor,sa continuati negocierile de aderare a Turciei la Uniunea Europeana, in ciuda refuzului continuu al Turciei de a recunoaste Ciprul si in ciuda faptului ca reformele democratice au ajuns intr-un punct mort. = | |
|---|---|
| 目标完成 → | The truth is that you want,at any price,and against the wishes of the peoples of Europe, to continue the negotiations for Turkey's accession a g a Cyprus and despite the fact that the democratic reforms are at a standstill. |
| 图 G.41: Ro→En 的格式化数据集示例 | |
| 上下文 →Q:What is(2 * 4)* 6? A: | |
| 目标完成 →48 | |
| 图 G.42: 算术 1DC 的格式化数据集示例 | |
| 上下文 →Q:What is 17 minus 14? A: | |
| 目标完成 →3 | |
| 图 G.43: 算术 2D- 的格式化数据集示例 | |
| 上下文 →Q: What is 98 plus 45? A: | |
| 目标完成 →143 | |
| 图 G.44: 算术 2D+ 的格式化数据集示例 | |
| 上下文 →Q: What is 95 times 45? A: | |
| 目标完成 →4275 | |
| 图 G.45: 算术 2Dx 的格式化数据集示例 | |
| 上下文→ | Q:What is 509 minus 488? |
| 目标完成 →21 | A: |
| 图 G.46: 算术 3D- 的格式化数据集示例 | |
| 上下文 →Q: What is 556 plus 497? | |
| 目标完成 →1053 | A: |
| 图 G.47: 算术 3D+ 的格式化数据集示例 | |
| 上下文 →Q: What is 6209 minus 3365? | |
| A: | |
| 目标完成 →2844 |


Figure G.49: Formatted dataset example for Arithmetic $^{4\mathrm{D}+}$
图 G.49: 用于算术 $^{4\mathrm{D}+}$ 的格式化数据集示例
Figure G.50: Formatted dataset example for Arithmetic 5D
图 G.50: Arithmetic 5D 的格式化数据集示例
| Context 个 A: | Q: 40649 减去 78746 等于多少? |
|---|---|
| Target Completion → -38097 |
Figure G.51: Formatted dataset example for Arithmetic $^{5\mathrm{D+}}$
图 G.51: 用于算术 $^{5\mathrm{D+}}$ 的格式化数据集示例
H Results on All Tasks for All Model Sizes Table H.1: Scores for every task, setting and model that we investigate in this paper.
表 H.1: 本文研究的所有任务、设置和模型的得分。
| 名称 | 指标 | 微调分割 | SOTA K | 零样本 | 单样本 | 少样本 | ||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| HellaSwag | 准确率 | 开发集 | 33.7 43.6 51.0 54.7 62.8 67.4 70.9 78.9 | 33.0 42.9 50.5 53.5 61.9 66.5 70.0 78.1 | 33.5 43.1 51.3 54.9 62.9 67.3 71.3 79.3 | |||||||
| LAMBADA | 准确率 | 85.6 测试集 68.0 | 20 15 | 42.7 54.3 60.4 63.6 67.1 70.3 72.5 76.2 | 22.0 47.1 52.6 58.3 61.1 65.4 69.0 72.5 | 22.0 40.4 63.2 57.0 78.1 79.1 81.3 86.4 | ||||||
| LAMBADA | 困惑度 | 测试集 8.63 | 15 | 18.6 | 9.09 6.53 5.44 4.60 4.00 3.56 3.00 | 165.0 11.6 8.29 6.46 5.53 4.61 4.06 3.35 | 165.0 27.6 6.63 7.45 2.89 2.56 2.56 1.92 | |||||
| StoryCloze | 准确率 | 测试集 91.8 | 70 | 63.3 68.5 72.4 73.4 77.2 77.7 79.5 83.2 | 62.3 68.7 72.3 74.2 77.3 78.7 79.7 84.7 | 62.3 70.2 73.9 76.1 80.2 81.2 83.0 87.7 | ||||||
| NQs | 准确率 | 测试集 44.5 | 64 | 0.64 | 1.75 2.71 4.40 6.01 5.79 7.84 14.6 | 1.19 3.07 4.79 5.43 8.73 9.78 13.7 23.0 | 1.72 4.46 7.89 9.72 13.2 17.0 21.0 29.9 | |||||
| TriviaQA | 准确率 | 68.0 | 64 | 4.15 | 7.61 14.0 19.7 31.3 38.7 41.8 64.3 | 4.19 | 6.96 16.3 26.5 32.1 42.3 51.6 57.5 71.2 | 71.2 | ||||
| WebQs | 准确率 | 开发集 测试集 45.5 | 64 | 1.77 | 3.20 4.33 4.63 7.92 7.73 8.22 14.4 | 2.56 6.20 8.51 9.15 14.5 15.1 19.0 25.3 | 5.46 12.6 15.9 19.6 24.8 27.7 33.5 41.5 | |||||
| Ro→En 16 | BLEU-mb 测试集 | 39.9 | 64 | 2.08 | 2.71 3.09 3.15 16.3 8.34 20.2 19.9 | 0.55 0.65 | 15.4 23.0 26.3 30.6 33.2 35.6 38.6 15.9 23.6 26.8 31.3 34.2 36.7 40.0 | 1.40 1.25 | 520.7 25.8 29.2 33.1 34.8 37.0 39.5 21.3 26.6 30.1 34.3 36.2 38.4 41.3 | |||
| Ro→En 16 En→Ro 16 | BLEU-mb 测试集 | 38.5 | 64 64 | 2.39 2.14 | 3.08 3.49 3.56 16.8 8.75 20.8 20.9 2.65 2.53 2.50 3.46 4.24 5.32 14.1 | 0.35 | 3.30 7.89 8.72 13.2 15.1 17.3 20.6 | 5.90 9.33 10.7 14.3 16.3 18.0 21.0 | ||||
| En→Ro 16 | BLEU-sb 测试集 | 64 | 2.61 | 3.11 3.07 3.09 4.26 5.31 6.43 18.0 | 0.55 | 1.25 | ||||||
| Fr→En 14 | BLEU-mb 测试集 | 35.0 | 64 | 1.81 | 2.53 3.47 3.13 20.6 15.1 21.8 21.2 | 3.90 9.15 10.3 15.7 18.2 20.8 24.9 15.9 23.7 26.3 29.0 30.5 30.2 33.7 | 1.64 | 7.40 10.9 12.9 17.2 19.6 21.8 25.8 825.5 28.5 31.1 33.7 34.9 36.6 39.2 | ||||
| Fr→En 14 | BLEU-sb 测试集 | 64 | 2.29 | 2.99 3.90 3.60 21.2 15.5 22.4 21.9 | 1.28 | 4.98 | ||||||
| En→Fr 14 | BLEU-mb 测试集 | 45.6 | 64 | 1.74 | 2.16 2.73 2.15 15.1 8.82 12.0 25.2 | 1.50 | 16.3 24.4 27.0 30.0 31.6 31.4 35.6 | 5.30 | 26.2 29.5 32.2 35.1 36.4 38.3 41.4 | |||
| En→Fr 14 | BLEU-sb 测试集 | 45.9 | 64 | 2.44 | 2.75 3.54 2.82 19.3 11.4 15.3 31.3 | 0.49 0.81 | 8.00 14.8 10.0 18.2 15.9 20.3 23.3 24.9 28.3 | 4.08 | 14.5 19.3 21.5 24.9 27.3 29.5 32.6 | |||
| En→Fr 14 | BLEU-mb 测试集 | 40.2 | 64 | 2.06 | 19.3 24.7 28.3 30.1 34.1 | 5.31 18.0 23.6 | 26.1 30.3 33.3 35.5 39.9 | |||||
| De→En 16 | BLEU-sb 测试集 | 64 | 2.39 | 2.87 3.41 3.63 21.5 17.3 23.0 27.2 | 0.83 | 16.2 22.5 24.7 28.2 30.7 33.0 30.4 | 3.25 | 22.7 26.2 29.2 32.7 34.8 37.3 40.6 | ||||
| De→En 16 | BLEU-mb 测试集 | 41.2 | 64 | 1.70 | 3.27 3.85 4.04 22.5 18.2 24.4 28.6 2.27 2.31 2.43 12.9 8.66 10.4 24.6 | 0.93 0.50 | 7.00 12.9 17.1 23.4 25.8 29.2 31.9 34.5 32.1 13.1 18.3 20.9 22.5 26.2 | 3.60 | 23.8 27.5 30.5 34.1 36.5 39.1 43.0 | |||
| En→De 16 En→De 16 | BLEU-sb 测试集 | 41.2 | 64 | 2.09 | 2.65 2.75 2.92 13.7 9.36 11.0 25.3 | 0.54 | 7.40 13.4 13.4 18.8 21.7 23.3 27.3 | 3.42 3.78 | 12.3 15.4 17.1 20.9 23.0 26.6 29.7 12.9 16.1 17.7 21.7 24.1 27.7 30.9 | |||
| Winograd Winogrande | 准确率 | 测试集 93.8 开发集 84.6 | 7 50 | 52.0 | 66.3 72.9 74.7 76.9 82.4 85.7 87.9 88.3 52.1 57.4 58.7 62.3 64.5 67.9 70.2 | 51.3 | 63.4 68.5 72.9 76.9 82.4 84.6 86.1 89.7 53.0 58.3 59.1 61.7 65.8 66.9 73.2 | 51.3 52.6 57.5 59.1 62.6 67.4 70.0 77.7 | ||||
| PIQA | 准确率 | 开发集 77.1 | 50 | 64.6 70.2 72.9 75.1 75.6 78.0 78.5 81.0 | 64.3 69.3 71.8 74.4 74.3 76.3 77.8 80.5 | 64.3 69.4 72.0 74.3 75.4 77.8 79.9 82.3 | 82.8 | |||||
| ARC (挑战) | 准确率 | 测试集 78.5 | 50 | 26.6 | 25.5 | 25.5 28.4 32.3 36.7 39.5 43.7 44.8 51.5 | ||||||
| ARC (简单) | 准确率 | 测试集 92.0 | 50 | 43.6 | 46.5 53.0 53.8 58.2 60.2 63.8 68.8 | 42.7 | 48.2 54.6 55.9 60.3 62.6 66.8 71.2 | 42.7 | ||||
| OpenBookQA | 准确率 | 测试集 87.2 | 100 | 35.6 43.2 45.2 46.8 53.0 50.4 55.6 57.6 | 37.0 | 139.8 46.2 46.4 53.4 53.0 55.8 58.8 | 37.0 | 43.6 48.0 50.6 55.6 55.2 60.8 65.4 | ||||
| Quac | F1 开发集 | 74.4 | 5 | 21.2 26.8 31.0 30.1 34.7 36.1 38.4 41.5 | 21.1 | 26.9 31.9 32.3 37.4 39.0 40.6 43.4 | 21.6 27.6 32.9 34.2 38.2 39.9 40.9 44.3 | |||||
| RACE-h | 准确率 | 测试集 90.0 | 10 | 35.2 | 34.3 | 37.7 40.0 42.0 43.8 44.3 44.6 45.9 | 34.3 37.0 40.4 41.4 42.3 44.7 45.1 46.8 | |||||
| RACE-m | 准确率 | 测试集 93.1 | 10 | 42.1 | 47.2 52.1 52.3 54.7 54.4 56.7 58.4 | 42.3 | 47.3 51.7 55.2 56.1 54.7 56.9 57.4 | 42.3 47.0 52.7 53.0 55.6 55.4 58.1 58.1 | ||||
| SQuADv2 | EM 开发集 | 90.7 | 16 | 22.6 | 32.8 33.9 43.1 43.6 45.4 49.0 52.6 | 25.1 | 37.5 37.9 47.9 47.9 51.1 56.0 60.1 | 27.5 | 540.5 39.2 53.5 50.0 56.6 62.6 64.9 | |||
| SQuADv2 | F1 开发集 | 93.0 | 16 | 28.34 | 40.2 41.4 50.3 51.0 52.7 56.3 59.5 | 30.1 | 43.6 44.1 54.0 54.1 57.1 61.8 65.4 | 32.1 | 45.5 44.9 58.7 55.9 62.1 67.7 69.8 | |||
| CoQA | F1 开发集 | 90.7 | 5 | 34.5 | 55.0 61.8 65.3 71.1 72.8 76.3 81.5 | 30.6 | 18.1 20.9 23.0 26.4 27.3 29.2 34.3 53.6 53.6 48.2 57.1 33.9 55.4 64.3 39.8 45.6 64.0 66.0 74.0 76.0 82.0 86.0 87.0 47.3 49.5 49.5 54.9 54.9 56.3 70.4 50.3 50.3 49.2 49.4 50.3 50.0 48.6 58.7 60.6 62.5 66.3 60.6 66.3 69.2 9.65 12.3 13.6 14.3 18.4 24.2 27.6 59.7 60.4 59.9 60.0 64.5 71.4 72.9 77.0 80.7 83.0 85.9 88.0 88.8 90.2 77.8 81.6 83.9 86.8 88.8 89.7 91.2 33.7 33.2 32.7 32.7 33.9 33.9 33.9 32.6 33.0 33.9 34.1 33.1 32.5 35.1 0.95 1.45 0.00 0.10 0.30 0.45 0.95 15.4 65.5 0.15 0.25 0.30 0.55 1.60 6.15 78.7 10.00 0.10 0.00 0.00 0.10 0.80 14.0 0.00 0.00 0.00 0.05 0.00 0.50 14.0 10.00 0.00 0.00 0.00 0.00 0.05 3.45 10.00 0.00 0.00 0.00 0.00 0.05 3.75 2.80 2.85 3.65 6.45 9.15 8.20 14.3 4.36 5.68 6.46 6.25 9.41 15.1 21.7 0.61 1.12 2.62 4.70 4.77 6.97 10.2 14.6 25.9 | 52.1 61.6 66.1 71.8 75.1 77.9 84.0 52.6 61.7 60.4 63.7 68.4 68.7 69.0 76.7 37.5 45.7 28.5 44.6 52.5 54.4 55.1 56.7 57.8 61.2 59.7 64.3 68.9 32.1 31.6 31.9 34.6 30.6 31.6 32.7 32.0 2.00 0.55 3.15 4.00 12.1 19.6 73.0 99.6 1.95 3.85 11.5 44.6 86.4 1.27 1.60 2.72 3.72 8.62 1.18 1.67 3.46 6.62 45.4 80 100 200 500 0 0 0 100 100 200 | 31.1 12.9 43.1 42.9 26.1 52.3 49.8 58.7 6.09 45.0 69.8 70.7 35.73 35.0 1.15 0.15 0.05 0.00 0.00 0.00 0.00 1.35 4.63 0.50 1.94 | 52.0 62.7 66.8 73.2 77.3 79.9 85.0 918.7 24.0 25.6 29.7 29.7 32.3 36.5 60.6 62.0 64.1 70.3 70.0 70.2 77.5 58.9 53.6 69.6 67.9 60.7 66.1 82.1 40.4 32.6 48.3 45.7 44.6 46.0 57.2 67.0 64.0 72.0 77.0 83.0 83.0 86.0 92.0 348.4 46.9 50.9 56.3 49.5 60.6 72.9 55.0 53.0 53.0 51.6 53.1 51.1 55.3 60.6 54.8 49.0 62.5 67.3 75.0 75.0 11.8 16.8 20.8 24.7 23.8 25.0 32.5 0 55.9 64.2 65.4 69.5 66.4 69.3 74.8 77.2 81.3 83.1 86.6 87.9 88.9 89.0 77.9 82.1 84.0 87.5 88.8 89.8 90.1 50.2 56.2 56.8 60.0 64.3 63.6 66.9 73.2 32.1 32.5 30.9 32.5 33.5 33.1 33.3 36.8 33.8 32.1 31.4 32.6 33.3 32.6 34.0 0 34.4 35.1 36.0 32.7 33.9 34.5 40.2 2.00 4.10 3.50 4.50 8.90 11.9 55.5 100.0 1.45 2.25 2.70 7.35 13.6 52.4 98.9 50.45 0.30 0.55 0.75 0.90 8.40 80.4 50.10 0.15 0.35 0.65 1.05 9.20 94.2 0.05 0.05 0.00 0.15 0.15 0.40 25.5 0 0.05 0.00 0.00 0.10 0.05 0.40 26.8 0.00 0.00 0.00 0.00 0.00 0.05 9.30 0 0.00 0.00 0.00 0.00 0.00 0.00 9.90 52.90 2.70 2.85 4.25 6.10 7.05 29.2 1.70 2.15 3.90 5.75 6.20 7.60 9.95 21.3 39.27 10.7 14.5 16.7 21.9 27.7 37.9 1.27 2.13 3.05 3.81 5.49 8.38 15.1 44.80 7.59 9.87 12.6 18.9 25.6 39.7 0.11 0.28 2.19 4.18 6.61 11.0 27.3 67.2 0.00 0.05 0.00 0.17 0.24 0.30 0.42 0.44 | 76.4 75.6 52.0 92.0 69.0 49.4 80.1 30.5 75.4 90.2 91.1 71.8 | |
| DROP BoolQ CB CB Copa RTE WiC WSC MultiRC MultiRC ReCoRD ReCoRD SuperGLUE ANLI R1 ANLI R2 ANLI R3 2D+ 2D- 3D+ 3D- 4D+ 4D- 5D+ 5D- 2Dx 1DC Cycled Letters Anagrams 1 Anagrams 2 Symbol Insertion 准确率 | F1 开发集 准确率 开发集 准确率 开发集 F1 开发集 准确率 开发集 准确率 开发集 准确率 开发集 准确率 开发集 准确率 开发集 准确率 开发集 准确率 开发集 准确率 开发集 准确率 开发集 准确率 开发集 准确率 开发集 准确率 开发集 准确率 开发集 准确率 开发集 准确率 开发集 准确率 开发集 准确率 开发集 准确率 开发集 准确率 开发集 准确率 开发集 准确率 开发集 准确率 开发集 准确率 开发集 准确率 开发集 准确率 开发集 准确率 开发集 准确率 开发集 准确率 开发集 准确率 开发集 准确率 开发集 准确率 开发集 准确率 开发集 准确率 开发集 准确率 开发集 准确率 开发集 准确率 开发集 准确率 开发集 准确率 开发集 准确率 开发集 准确率 开发集 准确率 开发集 准确率 开发集 准确率 开发集 准确率 开发集 准确率 开发集 准确率 开发集 准确率 开发集 准确率 开发集 准确率 开发集 准确率 开发集 准确率 开发集 准确率 开发集 准确率 开发集 准确率 开发集 准确率 开发集 准确率 开发集 准确率 开发集 准确率 开发集 准确率 开发集 准确率 开发集 准确率 开发集 准确率 开发集 准确率 开发集 准确率 开发集 准确率 开发集 准确率 开发集 准确率 开发集 准确率 开发集 准确率 开发集 准确率 开发集 准确率 开发集 准确率 开发集 准确率 开发集 准确率 开发集 准确率 开发集 准确率 开发集 准确率 开发集 准确率 开发集 准确率 开发集 准确率 开发集 准确率 开发集 准确率 开发集 准确率 开发集 准确率 开发集 准确率 开发集 准确率 开发集 准确率 开发集 准确率 开发集 准确率 开发集 准确率 开发集 准确率 开发集 准确率 开发集 准确率 开发集 准确率 开发集 准确率 开发集 准确率 开发集 准确率 开发集 准确率 开发集 准确率 开发集 准确率 开发集 准确率 开发集 准确率 开发集 准确率 开发集 准确率 开发集 准确率 开发集 准确率 开发集 准确率 开发集 准确率 开发集 准确率 开发集 准确率 开发集 准确率 开发集 准确率 开发集 准确率 开发集 准确率 开发集 准确率 开发集 准确率 开发集 准确率 开发集 准确率 开发集 准确率 开发集 准确率 开发集 准确率 开发集 准确率 开发集 准确率 开发集 准确率 开发集 准确率 开发集 准确率 开发集 准确率 开发集 准确率 开发集 准确率 开发集 准确率 开发集 准确率 开发集 准确率 开发集 准确率 开发集 准确率 开发集 准确率 开发集 准确率 开发集 准确率 开发集 准确率 开发集 准确率 开发集 准确率 开发集 准确率 开发集 准确率 开发集 准确率 开发集 准确率 开发集 准确率 开发集 准确率 开发集 准确率 开发集 准确率 开发集 准确率 开发集 准确率 开发集 准确率 开发集 准确率 开发集 准确率 开发集 准确率 开发集 准确率 开发集 准确率 开发集 准确率 开发集 准确率 开发集 准确率 开发集 准确率 开发集 准确率 开发集 准确率 开发集 准确率 开发集 准确率 开发集 准确率 开发集 准确率 开发集 准确率 开发集 准确率 开发集 准确率 开发集 准确率 开发集 准确率 开发集 准确率 开发集 准确率 开发集 准确率 开发集 准确率 开发集 准确率 开发集 准确率 开发集 准确率 开发集 准确率 开发集 准确率 开发集 准确率 开发集 准确率 开发集 准确率 开发集 准确率 开发集 准确率 开发集 准确率 开发集 准确率 开发集 准确率 开发集 准确率 开发集 准确率 开发集 准确率 开发集 准确率 开发集 准确率 开发集 准确率 开发集 准确率 开发集 准确率 开发集 准确率 开发集 准确率 开发集 准确率 开发集 准确率 开发集 准确率 开发集 准确率 开发集 准确率 开发集 准确率 开发集 准确率 开发集 准确率 开发集 准确率 开发集 准确率 开发集 准确率 开发集 准确率 开发集 准确率 开发集 准确率 开发集 准确率 开发集 准确率 开发集 准确率 开发集 准确率 开发集 准确率 开发集 准确率 开发集 准确率 开发集 准确率 开发集 准确率 开发集 准确率 开发集 准确率 开发集 准确率 开发集 准确率 开发集 准确率 开发集 准确率 开发集 准确率 开发集 准确率 开发集 准确率 开发集 准确率 开发集 准确率 开发集 准确率 开发集 准确率 开发集 准确率 开发集 准确率 开发集 准确率 开发集 准确率 开发集 准确率 开发集 准确率 开发集 准确率 开发集 准确率 开发集 准确率 开发集 准确率 开发集 准确率 开发集 准确率 开发集 准确率 开发集 准确率 开发集 准确率 开发集 准确率 开发集 准确率 开发集 准确率 开发集 准确率 开发集 准确率 开发集 准确率 开发集 准确率 开发集 准确率 开发集 准确率 开发集 准确率 开发集 准确率 开发集 准确率 开发集 准确率 开发集 准确率 开发集 准确率 开发集 准确率 开发集 准确率 开发集 准确率 开发集 准确率 开发集 准确率 开发集 准确率 开发集 准确率 开发集 准确率 开发集 准确率 开发集 准确率 开发集 准确率 开发集 准确率 开发集 准确率 开发集 准确率 开发集 准确率 开发集 准确率 开发集 准确率 开发集 准确率 开发集 准确率 开发集 准确率 开发集 准确率 开发集 准确率 开发集 准确率 开发集 准确率 开发集 准确率 开发集 准确率 开发集 准确率 开发集 准确率 开发集 准确率 开发集 准确率 开发集 准确率 开发集 准确率 开发集 准确率 开发集 准确率 开发集 准确率 开发集 准确率 开发集 准确率 开发集 准确率 开发集 准确率 开发集 准确率 开发集 准确率 开发集 准确率 开发集 准确率 开发集 准确率 开发集 准确率 开发集 准确率 开发集 准确率 开发集 准确率 开发集 准确率 开发集 准确率 开发集 准确率 开发集 准确率 开发集 准确率 开发集 准确率 开发集 准确率 开发集 准确率 开发集 准确率 开发集 准确率 开发集 准确率 开发集 准确率 开发集 准确率 开发集 准确率 开发集 准确率 开发集 准确率 开发集 准确率 开发集 准确率 开发集 准确率 开发集 准确率 开发集 准确率 开发集 准确率 开发集 准确率 开发集 准确率 开发集 准确率 开发集 准确率 开发集 准确率 开发集 准确率 开发集 准确率 开发集 准确率 开发集 准确率 开发集 准确率 开发集 准确率 开发集 准确率 开发集 准确率 开发集 准确率 开发集 准确率 开发集 准确率 开发集 准确率 开发集 准确率 开发集 准确率 开发集 准确率 开发集 准确率 开发集 准确率 开发集 准确率 开发集 准确率 开发集 准确率 开发集 准确率 开发集 准确率 开发集 准确率 开发集 准确率 开发集 准确率 开发集 准确率 开发集 准确率 开发集 准确率 开发集 准确率 开发集 准确率 开发集 准确率 开发集 准确率 开发集 准确率 开发集 准确率 开发集 准确率 开发集 准确率 开发集 准确率 开发集 准确率 开发集 准确率 开发集 准确率 开发集 准确率 开发集 准确率 开发集 准确率 开发集 准确率 开发集 准确率 开发集 准确率 开发集 准确率 开发集 准确率 开发集 准确率 开发集 准确率 开发集 准确率 开发集 |

Figure H.1: All results for all SuperGLUE tasks.
图 H.1: 所有 SuperGLUE 任务的结果。

Figure H.2: Results for SAT task.
图 H.2: SAT 任务的结果。

Figure H.3: All results for all Winograd tasks.
图 H.3: 所有 Winograd 任务的结果。

Figure H.4: All results for all Arithmetic tasks.
图 H.4: 所有算术任务的全部结果。

Figure H.5: All results for all Cloze and Completion tasks.
图 H.5: 所有完形填空和补全任务的结果。

Figure H.6: All results for all Common Sense Reasoning tasks.
图 H.6: 所有常识推理任务的结果。


Figure H.7: All results for all QA tasks.
图 H.7: 所有 QA 任务的全部结果。


Figure H.8: All results for all Reading Comprehension tasks.
图 H.8: 所有阅读理解任务的全部结果。

Figure H.9: All results for all ANLI rounds.
图 H.9: 所有 ANLI 轮次的结果。

Figure H.10: All results for all Scramble tasks.
图 H.10: 所有 Scramble 任务的结果。


Figure H.11: All results for all Translation tasks.
图 H.11: 所有翻译任务的全部结果。
References
参考文献
[ZSW+19b] Daniel M. Ziegler, Nisan Stiennon, Jeffrey Wu, Tom B. Brown, Alec Radford, Dario Amodei, Paul Christiano, and Geoffrey Irving. Fine-tuning language models from human preferences. ArXiv, abs/1909.08593, 2019.
[ZSW+19b] Daniel M. Ziegler, Nisan Stiennon, Jeffrey Wu, Tom B. Brown, Alec Radford, Dario Amodei, Paul Christiano, 和 Geoffrey Irving. 基于人类偏好的语言模型微调. ArXiv, abs/1909.08593, 2019.
