Language Models are Few-Shot Learners
大语言模型是少样本学习者
Abstract
摘要
We demonstrate that scaling up language models greatly improves task-agnostic, few-shot performance, sometimes even becoming competitive with prior state-ofthe-art fine-tuning approaches. Specifically, we train GPT-3, an auto regressive language model with 175 billion parameters, $10\mathrm{x}$ more than any previous nonsparse language model, and test its performance in the few-shot setting. For all tasks, GPT-3 is applied without any gradient updates or fine-tuning, with tasks and few-shot demonstrations specified purely via text interaction with the model. GPT-3 achieves strong performance on many NLP datasets, including translation, question-answering, and cloze tasks. We also identify some datasets where GPT3's few-shot learning still struggles, as well as some datasets where GPT-3 faces methodological issues related to training on large web corpora.
我们证明了扩大语言模型的规模可以显著提升任务无关的少样本性能,有时甚至可以与之前的最先进微调方法相媲美。具体来说,我们训练了 GPT-3,一个具有 1750 亿参数的自回归语言模型,比任何以前的非稀疏语言模型多 10 倍,并测试其在少样本设置中的性能。对于所有任务,GPT-3 在没有任何梯度更新或微调的情况下应用,任务和少样本演示仅通过与模型的文本交互来指定。GPT-3 在许多自然语言处理数据集上表现出色,包括翻译、问答和完形填空任务。我们还确定了一些 GPT-3 的少样本学习仍然存在困难的数据集,以及一些 GPT-3 在训练大型网络语料库时面临方法论问题的数据集。
1 Introduction
1 引言
NLP has shifted from learning task-specific representations and designing task-specific architectures to using task-agnostic pre-training and task-agnostic architectures. This shift has led to substantial progress on many challenging NLP tasks such as reading comprehension, question answering, textual entailment, among others. Even though the architecture and initial representations are now taskagnostic, a final task-specific step remains: fine-tuning on a large dataset of examples to adapt a task agnostic model to perform a desired task.
自然语言处理 (NLP) 已经从学习特定任务的表示和设计特定任务的架构转变为使用与任务无关的预训练和与任务无关的架构。这一转变在许多具有挑战性的 NLP 任务上取得了实质性进展,例如阅读理解、问答、文本蕴含等。尽管现在的架构和初始表示是与任务无关的,但最终仍需要一个特定任务的步骤:在大量示例数据集上进行微调,以使与任务无关的模型适应执行所需的任务。
Recentwork $[\mathrm{RWC}^{+}19]$ suggested this final step may not be necessary. $[\mathrm{RWC}^{+}19]$ demonstrated that a single pretrained language model can be zero-shot transferred to perform standard NLP tasks
最近的工作 $[\mathrm{RWC}^{+}19]$ 建议这最后一步可能不是必需的。$[\mathrm{RWC}^{+}19]$ 证明了一个预训练的语言模型可以进行零样本迁移以执行标准的自然语言处理任务。
Figure 1.1: Performance on SuperGLUE increases with model size. A value of $K=32$ means that our model was shown 32 examples per task, for 256 examples total divided across the 8 tasks in SuperGLUE. We report GPT-3 values on the dev set, so our numbers are not directly comparable to the dotted reference lines (our test set results are in the appendix). The BERT-Large reference model was fine-tuned on the SuperGLUE training set (125K examples), whereas $\mathrm{BERT++}$ wasfirst fine-tuned on MultiNLI (392K examples) and SWAG (113K examples) before further fine-tuning on the SuperGLUE training set (for a total of 630K fine-tuning examples).
图 1.1: SuperGLUE 上的性能随着模型规模的增加而提高。值 $K=32$ 表示我们的模型在每个任务中展示了 32 个示例,总共 256 个示例分布在 SuperGLUE 的 8 个任务中。我们报告的是 GPT-3 在开发集上的结果,因此我们的数字不能直接与虚线参考线进行比较(我们的测试集结果在附录中)。BERT-Large 参考模型是在 SuperGLUE 训练集(125K 示例)上微调的,而 $\mathrm{BERT++}$ 首先在 MultiNLI(392K 示例)和 SWAG(113K 示例)上进行了微调,然后再在 SuperGLUE 训练集上进一步微调(总共 630K 微调示例)。
Performance on SuperGLUE increases with number of examples in context. We find the difference in performance between the BERT-Large and BERT $^{++}$ to be roughly equivalent to the difference between GPT-3 with one example per context versus eight examples per context.
在 SuperGLUE 上的表现随着上下文中示例数量的增加而提高。我们发现 BERT-Large 和 BERT $^{++}$ 之间的性能差异大致相当于 GPT-3 在每个上下文中有一个示例与有八个示例之间的差异。
Aggregate performance for all 42 accuracy-denominated benchmarks. While zero-shot performance improves steadily with model size, few-shot performance increases more rapidly, demonstrating that larger models are more proficient at in-context learning.
所有 42 个以准确率为衡量标准的基准测试的综合性能。虽然零样本 (Zero-shot) 性能随着模型规模的增大而稳步提高,但少样本 (Few-shot) 性能增长更快,这表明更大的模型在上下文学习方面更为 proficient。
without the need for finetuning on a dataset of training examples. While this work was a promising proof of concept, the best case performance only matched some supervised baselines on a single dataset. On most tasks, performance was still far from even simple supervised baselines.
无需在训练样本数据集上进行微调。虽然这项工作是一个有前景的概念验证,但在单个数据集上最佳性能仅与某些监督基线相当。在大多数任务中,性能仍然远低于简单的监督基线。
However $[\mathrm{RWC}^{+}19]$ also showed a potential way forward. The work observed relatively consistent log-linear trends in performance on both transfer tasks and language modeling loss across one an order of magnitude of scaling. $[\mathrm{KMH}^{+}20]$ then conducted a much more rigorous study of the scaling behavior of log loss and confirmed smooth scaling trends. In this work, we empirically test whether scaling continues to improve performance by extrapolating the previously identified phenomena another two orders of magnitude. We train a 175 billion parameter auto regressive language model, which we call GPT-3, and measure its transfer learning abilities.
然而 $[\mathrm{RWC}^{+}19]$ 也展示了一种潜在的前进方向。该工作观察到,在一个数量级的扩展范围内,迁移任务和语言建模损失的表现呈现出相对一致的对数线性趋势。$[\mathrm{KMH}^{+}20]$ 随后对对数损失的扩展行为进行了更为严格的研究,并确认了平滑的扩展趋势。在本工作中,我们通过将之前识别的现象外推两个数量级,实证测试扩展是否继续改善性能。我们训练了一个 1750 亿参数的自回归大语言模型,我们称之为 GPT-3,并测量其迁移学习能力。
As part of this investigation, we also clarify and systematize the approach introduced in $[\mathrm{RWC}^{+}19]$ While $[\mathrm{RWC}^{+}19]$ describe their work as “zero-shot task transfer”’ they sometimes provide examples of the relevant task in the context. Due to the use of what are effectively training examples, these cases are better described as “one-shot'’ or “few-shot" transfer. We study these one-shot and few-shot settings in detail comparing them with the zero-shot setting which only uses a natural language description or invocation of the task to be performed. Our findings are summarized in Figure 1.1. We observe that one- and few-shot performance is often much higher than true zero-shot performance leading us to suggest that language models can also be understood as meta-learners where slow outer-loop gradient descent based learning is combined with fast “in-context” learning implemented within the context activation s of the model.
作为这项研究的一部分,我们还澄清并系统化了 $[\mathrm{RWC}^{+}19]$ 中引入的方法。虽然 $[\mathrm{RWC}^{+}19]$ 将他们的工作描述为“零样本任务迁移”,但有时他们在上下文中提供了相关任务的示例。由于实际上使用了训练示例,这些情况更好地被描述为“一样本”或“少样本”迁移。我们详细研究了这些一样本和少样本设置,并将其与仅使用自然语言描述或调用任务的零样本设置进行比较。我们的发现总结在图 1.1 中。我们观察到一样本和少样本的表现通常远高于真正的零样本表现,这使我们认为语言模型也可以被视为元学习器,其中慢速的外部循环梯度下降学习与模型上下文激活中的快速“上下文内”学习相结合。
图 1.1:
Broadly, on NLP tasks GPT-3 achieves promising results in the zero- and one-shot settings, and in the few-shot setting is sometimes competitive with or even occasionally surpasses state-of-the-art (despite state-of-the-art being held by fine-tuned models). For example, GPT-3 achieves $81.5,\mathrm{F}1$ on CoQA in the zero-shot setting, 84.0 F1 on CoQA in the one-shot setting, and 85.0 F1 in the few-shot setting. Similarly, GPT-3 achieves $64.3%$ accuracy on TriviaQA in the zero-shot setting, $68.0%$ in the one-shot setting, and $71.2%$ in the few-shot setting, the last of which is state-of-the-art relative to fine-tuned models operating in the same closed-book setting.
在自然语言处理任务中,GPT-3 在零样本和单样本设置下取得了有希望的结果,在少样本设置下有时可以与最先进水平竞争,甚至偶尔超越最先进水平(尽管最先进水平是由微调模型保持的)。例如,GPT-3 在零样本设置下在 CoQA 上达到了 81.5 F1,在单样本设置下达到了 84.0 F1,在少样本设置下达到了 85.0 F1。同样,GPT-3 在零样本设置下在 TriviaQA 上达到了 64.3% 的准确率,在单样本设置下达到了 68.0%,在少样本设置下达到了 71.2%,后者在相同的闭书设置下相对于微调模型是最先进的。
We additionally train a series of smaller models (ranging from 125 million parameters to 13 billion parameters) in order to compare their performance to GPT-3 in the zero-, one- and few-shot settings. In general, we find relatively smooth scaling for most tasks with model capacity in all three settings; one notable pattern is that the gap between zero-, one-, and few-shot performance often grows with model capacity, perhaps suggesting that larger models are more proficient meta-learners.
此外,我们还训练了一系列较小的模型(参数量从 1.25 亿到 130 亿)以比较它们在零样本、单样本和少样本设置下的性能与 GPT-3 的差异。总体而言,我们发现大多数任务在三种设置下随着模型容量的增加表现出相对平滑的扩展;一个值得注意的模式是,零样本、单样本和少样本性能之间的差距往往随着模型容量的增加而扩大,这可能表明较大的模型是更熟练的元学习者。
2 Approach
Our basic pre-training approach, including model, data, and training, is similar to the process described in $[\mathrm{RWC}^{+}1\bar{9}]$ , with relatively straightforward scaling up of the model size, dataset size and diversity, and length of training. Our use of in-context learning is also similar to $[\mathrm{RWC}^{+}19]$ ,but in this work we systematically explore different settings for learning within the context:
我们的基本预训练方法,包括模型、数据和训练,与 $[\mathrm{RWC}^{+}1\bar{9}]$ 中描述的过程相似,只是在模型规模、数据集规模和多样性以及训练长度方面进行了相对直接的扩展。我们对上下文中学习的使用也与 $[\mathrm{RWC}^{+}19]$ 类似,但在本工作中,我们系统地探索了上下文中学习的不同设置:
· Fine-Tuning (FT) - updates the weights of a pre-trained model by training on thousands of supervised labels specific to the desired task. The main advantage of fine-tuning is strong performance on many benchmarks. The main disadvantages are the need for a new large dataset for every task, the potential for poor generalization out-of-distribution [MPL19], and the potential to exploit spurious features of the training data $[\mathrm{GSL^{+}18}.$ NK19].Wefocus on task-agnostic performance, leaving fine-tuning for future work.
微调 (Fine-Tuning, FT) - 通过在数千个特定于所需任务的监督标签上进行训练,更新预训练模型的权重。微调的主要优势是在许多基准测试中表现出色。主要缺点包括每个任务都需要一个新的大型数据集、可能存在分布外泛化不良的情况 [MPL19],以及可能利用训练数据中的虚假特征 $[\mathrm{GSL^{+}18}.$ NK19]。我们专注于任务无关的性能,将微调留作未来的工作。
· Few-Shot (Fs) - the model is given a few demonstrations of the task at inference time as conditioning $[\mathrm{RWC}^{+}19]$ , but no weights are updated. An example typically has a context and a desired completion (for example an English sentence and the French translation), and few-shot works by giving $K$ examples of context and completion, and then one final example of context, with the model expected to provide the completion (see appendix for more details). We typically set $K$ in the range of 10 to 100, as this is how many examples can fit in the model's context window $\langle n_{\mathrm{ctx}}=2048\rangle$ . The main advantage of few-shot is a major reduction in the need for task-specific data. The main disadvantage is that results from this method have so far been much worse than state-of-the-art fine-tuned models. Also, a small amount of task specific data is still required. As indicated by the name, few-shot learning as described here for language models is related to few-shot learning as used in other contexts in ML [HYC01, ${\mathrm{VBL}}^{+}1{\bar{6}}]$ - both involve learning based on a broad distribution of tasks and then rapidly adapting to a new task.
少样本 (Few-Shot) - 模型在推理时给出几个任务的演示作为条件 $[\mathrm{RWC}^{+}19]$ ,但不更新任何权重。一个示例通常包含上下文和期望的完成(例如,一个英文句子及其法语翻译),少样本通过提供 $K$ 个上下文和完成的示例,然后给出一个最终的上下文示例,要求模型提供完成(更多详情见附录)。我们通常将 $K$ 设置在10到100之间,因为这是模型上下文窗口中可以容纳的示例数量 $\langle n_{\mathrm{ctx}}=2048\rangle$ 。少样本的主要优点是大大减少了对特定任务数据的需求。主要缺点是,到目前为止,这种方法的结果远不如最先进的微调模型。此外,仍然需要少量的任务特定数据。正如名称所示,这里描述的语言模型的少样本学习与机器学习其他上下文中使用的少样本学习 [HYC01, ${\mathrm{VBL}}^{+}1{\bar{6}}]$ 类似——两者都涉及基于广泛的任务分布进行学习,然后快速适应新任务。
· One-Shot (1S) - similar to few-shot but with $K=1$ · Zero-Shot (os) - similar to few-shot but with a natural language description of the task instead of any examples.
· 单样本 (One-Shot) - 与少样本类似,但 $K=1$
· 零样本 (Zero-Shot) - 与少样本类似,但使用任务的自然语言描述而不是任何示例。
The appendix includes a demonstration of the four methods using the example of translating English to French. While the few-shot results we present in this paper achieve the highest performance, one-shot, or even sometimes zero-shot, seem like the fairest comparisons to human performance, and are important targets for future work.
附录包括了使用将英语翻译成法语的例子来演示四种方法。虽然我们在本文中展示的少样本结果达到了最高性能,但单样本,甚至有时零样本,似乎更像与人类表现的公平比较,并且是未来工作的重要目标。
2.1 Model and Architectures
2.1 模型和架构
We use the same model and architecture as GPT-2 $[\mathrm{RWC}^{+}19]$ , including the modified initialization, pre-normalization, and reversible token iz ation described therein, with the exception that we use alternating dense and locally banded sparse attention patterns in the layers of the transformer, similar to the Sparse Transformer [CGRS19]. To study the dependence of ML performance on model size, we train 8 different sizes of model, from 125 million parameters to 175 billion parameters, with the last being the model we call GPT-3. This range of model sizes allows us to test the scaling laws introduced in $[\mathrm{KMH}^{+}20]$
我们使用与 GPT-2 $[\mathrm{RWC}^{+}19]$ 相同的模型和架构,包括其中描述的修改后的初始化、预归一化和可逆分词,唯一的例外是我们使用交替的密集和局部带状稀疏注意力模式在 Transformer 的层中,类似于 Sparse Transformer [CGRS19]。为了研究模型性能对模型大小的依赖性,我们训练了 8 种不同大小的模型,从 1.25 亿参数到 1750 亿参数,其中最大的模型我们称为 GPT-3。这一范围的模型大小使我们能够测试 $[\mathrm{KMH}^{+}20]$ 中引入的扩展定律。
More details on the sizes and architectures of our models can be found in the appendix. We partition each model across GPUs along both the depth and width dimension in order to minimize data-transfer between nodes.
有关我们模型的大小和架构的更多详细信息可以在附录中找到。我们将每个模型在 GPU 上按深度和宽度维度进行划分,以最小化节点之间的数据传输。
2.2 Training Dataset
2.2 训练数据集
To create our training data, we (1) downloaded and filtered a version of Common Crawl 1 $[\mathrm{RSR}^{+}19]$ based on similarity to a range of high-quality reference corpora, (2) performed fuzzy de duplication at the document level, within and across datasets, to prevent redundancy and preserve the integrity of our held-out validation set as an accurate measure of over fitting, and (3) added known high-quality reference corpora to the training mix to augment Common Crawl and increase its diversity. These reference corpora include an expanded version of the WebText dataset $[\mathrm{RWC}^{+}19]$ ,collected by scraping links over a longer period of time, and first described in $[\mathrm{KMH}^{+}20]$ , two internet-based books corpora (Books1 and Books2) and English-language Wikipedia (details in the appendix).
为了创建我们的训练数据,我们 (1) 下载并过滤了一个版本的 Common Crawl 1 $[\mathrm{RSR}^{+}19]$ ,基于其与一系列高质量参考语料库的相似性;(2) 在文档级别进行了模糊去重,包括在数据集内部和跨数据集,以防止冗余并保持我们保留的验证集的完整性,作为过拟合的准确衡量标准;(3) 将已知的高质量参考语料库添加到训练混合数据中,以增强 Common Crawl 并增加其多样性。这些参考语料库包括扩展版的 WebText 数据集 $[\mathrm{RWC}^{+}19]$ ,该数据集通过在更长时间内抓取链接收集而成,并首次在 $[\mathrm{KMH}^{+}20]$ 中描述,还包括两个基于互联网的书籍语料库(Books1 和 Books2)以及英文维基百科(详细信息见附录)。
Table 3.1: Performance on cloze and completion tasks. GPT-3 significantly improves SOTA on LAMBADA while achieving respectable performance on two difficult completion prediction datasets. a[Tur20] ${}^{b}[\mathrm{RWC}^{+}19]$ C[LDL19] $^d[\mathrm{LCH}^{+}20]$
表 3.1: 完形填空和补全任务的性能。GPT-3 在 LAMBADA 上显著提升了 SOTA,同时在两个困难的补全预测数据集上也取得了不错的成绩。a[Tur20] ^b[RWC+19] C[LDL19] ^d[LCH+20]
Setting | LAMBADA (acc) | LAMBADA (ppl) | StoryCloze (acc) | HellaSwag (acc) |
---|---|---|---|---|
SOTA | 68.0 a | 8.63 b | 91.8 c | 85.6 d |
GPT-3 零样本 | 76.2 | 3.00 | 83.2 | 78.9 |
GPT-3 单样本 | 72.5 | 3.35 | 84.7 | 78.1 |
GPT-3 少样本 | 86.4 | 1.92 | 87.7 | 79.3 |
2.3 Training Process
2.3 训练过程
As found in $[\mathrm{KMH}^{+}20\$ , MKAT18], larger models can typically use a larger batch size, but require a smaller learning rate. We measure the gradient noise scale during training and use it to guide our choice of batch size [MKAT18]. Table A.1 shows the parameter settings we used. To train the larger models without running out of memory, we use a mixture of model parallelism within each matrix multiply and model parallelism across the layers of the network. All models were trained on V100 GPU's on part of a high-bandwidth cluster. Details of the training process and hyper parameter settings are described in the appendix.
如 $[\mathrm{KMH}^{+}20$ , MKAT18] 所示,较大的模型通常可以使用更大的批量大小,但需要较小的学习率。我们测量训练期间的梯度噪声规模,并用它来指导我们选择批量大小 [MKAT18]。表 A.1 显示了我们使用的参数设置。为了在不耗尽内存的情况下训练更大的模型,我们在每个矩阵乘法内和网络层之间使用模型并行性。所有模型都在 V100 GPU 上,在高带宽集群的一部分上进行训练。训练过程和超参数设置的详细信息在附录中描述。
2.4 Evaluation
2.4 评估
For few-shot learning, we evaluate each example in the evaluation set by randomly drawing $K$ examples from that task's training set as conditioning, delimited by 1 or 2 newlines depending on the task. For LAMBADA and Storycloze there is no supervised training set available so we draw conditioning examples from the development set and evaluate on the test set.
对于少样本学习,我们通过从该任务的训练集中随机抽取 $K$ 个示例作为条件,来评估评估集中的每个示例,示例之间根据任务不同用1个或2个换行符分隔。对于 LAMBADA 和 Storycloze,由于没有监督训练集可用,我们从开发集中抽取条件示例,并在测试集上进行评估。
For some tasks we use a natural language prompt in addition to (or for $K=0$ , instead of) demonstrations. Similar to $[\mathrm{RSR}^{+}19]$ we also sometimes change the formatting of answers. See the appendix for per-task examples.
对于某些任务,我们使用自然语言提示,除了(或对于 $K=0$ ,代替)演示。类似于 $[\mathrm{RSR}^{+}19]$ ,我们有时也会更改答案的格式。请参见附录以获取每个任务的示例。
On tasks with free-form completion, we use beam search with the same parameters as $[\mathrm{RSR}^{+}19]$ :a beam width of 4 and a length penalty of $\alpha=0.6$
在自由形式完成的任务中,我们使用与 $[\mathrm{RSR}^{+}19]$ 相同参数的束搜索:束宽为 4 和长度惩罚为 $\alpha=0.6$
Final results are reported on the test set when publicly available, for each model size and learning setting (zero-, one-, and few-shot). When the test set is private, our model is often too large to fit on the test server, so we report results on the development set.
最终结果在测试集上报告(当公开可用时),针对每个模型大小和学习设置(零样本、一样本和少样本)。当测试集是私有时,我们的模型通常太大而无法适应测试服务器,因此我们在开发集上报告结果。
3 Results
3.1 Language Modeling, Cloze, and Completion Tasks
3.1 语言模型、完形填空和补全任务
We test GPT-3's performance on the traditional task of language modeling as well as related tasks. We calculate zero-shot perplexity on the Penn Tree Bank (PTB) $[\mathrm{MKM^{+}94}]$ dataset measured in $[\mathrm{RWC}^{+}19]$ . We omit the 4 Wikipedia-related tasks and the one-billion word benchmark due to a high fraction of these datasets being contained in our training set. Our largest model sets a new SOTA on PTB by a substantial margin of 15 points.
我们测试 GPT-3 在传统语言建模任务及相关任务上的性能。我们在 Penn Tree Bank (PTB) 数据集上计算零样本困惑度 [MKM^+94],该指标在 [RWC^+19] 中有测量。我们省略了 4 个与 Wikipedia 相关的任务和十亿词基准测试,因为这些数据集中有很大一部分包含在我们的训练集中。我们最大的模型在 PTB 上以显著的 15 分优势设立了新的 SOTA。
The LAMBADA dataset $[\mathrm{PKL}^{+}16]$ requires the model to predict the last word of a paragraph. Although $[\mathrm{BHT^{+}20}]$ suggested scaling language models is yielding diminishing returns on this benchmark, we find that zero-shot GPT-3 achieves a substantive gain of $8%$ over the previous state-ofthe-art. For the few-shot setting, we use a fill-in-the-blank format to encourage the language model to only generate one word (Alice was friends with Bob. Alice went to visit her friend, $_\rightarrow B o b)$ With this format, GPT-3 achieves an increase of over $18%$ from the previous state-of-the-art, and performance improves smoothly with model size. However, the fill-in-blank method is not effective one-shot, where it always performs worse than the zero-shot setting, perhaps because all models require several examples to recognize the pattern. An analysis of test set contamination identified that a significant minority of the LAMBADA dataset appears to be present in our training data - however analysis performed in Section 4 suggests negligible impact on performance.
LAMBADA 数据集 $[\mathrm{PKL}^{+}16]$ 要求模型预测段落的最后一个词。尽管 $[\mathrm{BHT^{+}20}]$ 建议在该基准上扩展语言模型的效果正在递减,但我们发现零样本 GPT-3 相对于之前的最先进水平取得了实质性的 8% 的提升。对于少样本设置,我们使用填空格式鼓励语言模型只生成一个词 (Alice was friends with Bob. Alice went to visit her friend, _ → Bob)。通过这种格式,GPT-3 相对于之前的最先进水平实现了超过 18% 的提升,并且性能随着模型规模的增大而平稳提高。然而,填空方法在一例情况下并不有效,其表现总是不如零样本设置,可能是因为所有模型都需要几个例子来识别模式。对测试集污染的分析表明,LAMBADA 数据集的相当一部分似乎出现在我们的训练数据中,但第 4 节进行的分析表明这对性能的影响可以忽略不计。
设置 | NaturalQs | WebQS | TriviaQA |
---|---|---|---|
RAG (Fine-tuned, Open-Domain) [LPP+20] | 44.5 | 45.5 | 68.0 |
T5-11B+SSM (Fine-tuned, Closed-Book) [RRS20] | 36.6 | 44.7 | 60.5 |
T5-11B (Fine-tuned, Closed-Book) | 34.5 | 37.4 | 50.1 |
GPT-3 零样本 | 14.6 | 14.4 | 64.3 |
GPT-3 一样本 | 23.0 | 25.3 | 68.0 |
GPT-3 少样本 | 29.9 | 41.5 | 71.2 |
Table 3.2: Results on three Open-Domain QA tasks. GPT-3 is shown in the few-, one-, and zero-shot settings, as compared to prior SOTA results for closed book and open domain settings. TriviaQA few-shot result is evaluated on the wiki split test server. Table 3.3: GPT-3 results on a selection of QA / RC tasks. CoQA and DROP are F1 while ARC reports accuracy. See the appendix for additional experiments. $a_{[\mathrm{KKS}^{+}20]},^{b}[\mathrm{KKS}^{+}20],^{c}[\mathrm{JZC}^{+}19]$ d[JIN20]
表 3.2: 在三个开放域问答任务上的结果。GPT-3 在少样本、单样本和零样本设置下的表现,与之前封闭书本和开放域设置下的最佳结果 (SOTA) 进行了比较。TriviaQA 少样本结果是在 wiki 分割测试服务器上评估的。表 3.3: GPT-3 在一系列问答 / 阅读理解任务上的结果。CoQA 和 DROP 报告 F1 分数,而 ARC 报告准确率。更多实验详见附录。$a_[\mathrm{KKS}^{+}20],^{b}[\mathrm{KKS}^{+}20],^{c}[\mathrm{JZC}^{+}19]$ d[JIN20]
设置 | ARC (Easy) | ARC (Challenge) | CoQA | DROP |
---|---|---|---|---|
Fine-tuned SOTA | 92.0a | 78.5b | 90.7c | 89.1d |
GPT-3 零样本 | 68.8 | 51.4 | 81.5 | 23.6 |
GPT-3 单样本 | 71.2 | 53.2 | 84.0 | 34.3 |
GPT-3 少样本 | 70.1 | 51.5 | 85.0 | 36.5 |
The HellaS wag data set $[Z\mathrm{HB}^{+}19]$ involves picking the best ending to a story or set of instructions. The examples were adversarial ly mined to be difficult for language models while remaining easy for humans. GPT-3 outperforms a fine-tuned 1.5B parameter language model $[Z\mathrm{HR}^{+}19]$ but is still a fair amount lower than the overall SOTA achieved by the fine-tuned multi-task model ALUM.
HellaS 数据集 $[Z\mathrm{HB}^{+}19]$ 涉及选择故事或指令集的最佳结尾。这些例子经过对抗性挖掘,旨在对语言模型来说具有挑战性,而对人类来说则相对简单。GPT-3 超过了一个微调的 1.5B 参数的大语言模型 $[Z\mathrm{HR}^{+}19]$ ,但仍然明显低于由微调的多任务模型 ALUM 达到的整体最先进水平 (SOTA)。
The Story Clo ze 2016 dataset $[\mathrm{MCH}^{+}16]$ involves selecting the correct ending sentence for fivesentence long stories. Here GPT-3 improves over previous zero-shot results by roughly $10%$ but is Overall still $4.1%$ lower than the fine-tuned SOTA using a BERT based model [LDL19].
故事结尾 2016 数据集 $[\mathrm{MCH}^{+}16]$ 涉及为五句长的故事选择正确的结尾句子。这里 GPT-3 在零样本 (Zero-shot) 结果上比之前提高了大约 $10%$ ,但总体上仍比使用基于 BERT 的模型微调的最先进水平低 $4.1%$ [LDL19]。
3.2 Question Answering
3.2 问答系统
In this section we measure GPT-3's ability to handle a variety of question answering tasks. First, we look at datasets involving answering questions about broad factual knowledge. We evaluate in the "closed-book" setting (meaning no conditioning information/articles) as suggested by [RRS20]. On TriviaQA [JCWZ17], GPT-3 zero-shot already outperforms the fine-tuned T5-11B by $14.2%$ , and also outperforms a version with Q&A tailored span prediction during pre-training by $3.8%$ . The one-shot result improves by $3.7%$ and matches the SOTA for an open-domain QA system which not only fine-tunes but also makes use of a learned retrieval mechanism over a 15.3B parameter dense vector index of 21M documents $[\mathrm{LPP}^{+}20]$ . GPT-3's few-shot result further improves performance another $3.2%$ beyond this. On Natural Questions (NQs) $[\mathrm{KPR}^{+}19]$ , GPT-3 under performs a fine-tuned T5 $11\mathrm{B}{+}\mathrm{S}\mathrm{S}\mathrm{M}$ . The questions in NQs tend towards fine-grained Wikipedia knowledge which could be testing the limits of GPT-3's capacity and broad pre training distribution.
在本节中,我们测量 GPT-3 处理各种问答任务的能力。首先,我们查看涉及回答广泛事实知识问题的数据集。我们在“闭卷”设置(即没有条件信息/文章)下进行评估,如 [RRS20] 所建议的。在 TriviaQA [JCWZ17] 上,GPT-3 零样本已经超过了微调后的 T5-11B 14.2%,并且也超过了在预训练期间使用 Q&A 定制跨度预测版本的 3.8%。单样本结果提高了 3.7%,并匹配了开放域问答系统 (不仅进行了微调,还利用了 153 亿参数密集向量索引中的 2100 万文档的学习检索机制) 的最先进水平 $[\mathrm{LPP}^{+}20]$ 。GPT-3 的少样本结果进一步将性能提高了 3.2%。在 Natural Questions (NQs) $[\mathrm{KPR}^{+}19]$ 上,GPT-3 表现不如微调后的 T5 11B+SSM。NQs 中的问题倾向于细粒度的维基百科知识,这可能是在测试 GPT-3 的容量和广泛预训练分布的极限。
ARC $[\mathrm{CCE^{+}18}]$ is a common sense reasoning dataset of multiple-choice questions collected from 3rd to 9th grade science exams. On the “Challenge" version of the dataset, which has been filtered to questions which simple statistical or information retrieval methods are unable to correctly answer, GPT-3 approaches the performance of a fine-tuned RoBERTa baseline $[\mathrm{KKS}^{+}20]$ .Onthe“Easy version of the dataset, GPT-3 slightly exceeds the same fine-tuned RoBERTa baseline $[\mathrm{KKS^{+}20}]$ However, both of these results are still much worse than the overall SOTAs achieved by $[\mathrm{KKS}^{+}20]$
ARC $[\mathrm{CCE^{+}18}]$ 是一个从 3 年级到 9 年级科学考试中收集的多选题常识推理数据集。在经过筛选后的“Challenge”版本数据集中,简单统计或信息检索方法无法正确回答的问题上,GPT-3 的表现接近微调的 RoBERTa 基准模型 $[\mathrm{KKS}^{+}20]$ 。在“Easy”版本的数据集中,GPT-3 稍微超过了同一微调的 RoBERTa 基准模型 $[\mathrm{KKS^{+}20}]$ 。然而,这两个结果仍然远不如 $[\mathrm{KKS}^{+}20]$ 达到的整体最先进水平。
Table 3.4: Few-shot GPT-3 outperforms previous unsupervised NMT work by 5 BLEU when translating into English reflecting its strength as an English LM. We report BLEU scores on the WMT'14 $\mathrm{Fr}{\leftrightarrow}\mathrm{En}$ ,WMT'16 $\scriptstyle\mathrm{De}\leftrightarrow\mathrm{En}$ , and WMT'16 $\scriptstyle\mathbf{Ro}\leftrightarrow\mathbf{En}$ datasets as measured by multi-bleu.perl with XLM's token iz ation in order to compare most closely with prior unsupervised NMT work. SacreBLEU [Pos18] results reported in the appendix. Underline indicates an unsupervised or few-shot SOTA, bold indicates supervised SOTA with relative confidence. “[EOAG18] $^{b}[{\mathrm{DHKH14}}]$ $^c[\mathrm{WXH^{+}18}]$ “[oR16] $^{e}[\mathrm{LGG^{+}}20]$ J[SacreBLEU signature: BLEU+case.mixed+numrefs. $^{1+}$ smooth.exp+tok.intl+version.1.2.20]
设置 | En→Fr | Fr→En | En→De | De→En | En→Ro | Ro→En |
---|---|---|---|---|---|---|
SOTA (有监督) | 45.6a | 35.0 b | 41.2c | 40.2d | 38.5e | 39.9e |
XLM [LC19] | 33.4 | 33.3 | 26.4 | 34.3 | 33.3 | 31.8 |
MASS [STQ+19] | 37.5 | 34.9 | 28.3 | 35.2 | 35.2 | 33.1 |
mBART [LGG+20] | 29.8 | 34.0 | 35.0 | 30.5 | ||
GPT-3 零样本 | 25.2 | 21.2 | 24.6 | 27.2 | 14.1 | 19.9 |
GPT-3 单样本 | 28.3 | 33.7 | 26.2 | 30.4 | 20.6 | 38.6 |
GPT-3 少样本 | 32.6 | 39.2 | 29.7 | 40.6 | 21.0 | 39.5 |
表 3.4: 少样本 GPT-3 在将文本翻译成英语时,比之前的无监督神经机器翻译工作高出 5 BLEU 分,反映了其作为英语大语言模型的优势。我们在 WMT'14 法语↔英语、WMT'16 德语↔英语和 WMT'16 罗马尼亚语↔英语数据集上报告了 BLEU 分数,这些分数是通过 multi-bleu.perl 和 XLM 的 token 化进行测量的,以便与之前的无监督神经机器翻译工作最接近地比较。SacreBLEU [Pos18] 结果在附录中报告。下划线表示无监督或少样本的最佳水平,粗体表示有监督的最佳水平并具有相对置信度。"[EOAG18] ^b[DHKH14] ^c[WXH^+18] [oR16] ^e[LGG^+20] J[SacreBLEU 签名:BLEU+case.mixed+numrefs. ^1+ smooth.exp+tok.intl+version.1.2.20]"
Finally, we evaluate GPT-3 on two reading comprehension datasets. Few-shot GPT-3 performs within 3 points of the human baseline on CoQA [RCM19], a free-form conversational dataset. On DROP $[\mathrm{DWD}^{+}19]$ , a dataset testing discrete reasoning and numeracy, few-shot GPT-3 outperforms the fine-tuned BERT baseline from the original paper but is still well below both human performance and state-of-the-art approaches which augment neural networks with symbolic systems $[\mathrm{RLL}^{+}19]$
最后,我们在两个阅读理解数据集上评估 GPT-3。少样本 GPT-3 在 CoQA [RCM19] 上的表现与人类基线相差 3 分以内,CoQA 是一个自由形式的对话数据集。在测试离散推理和数值能力的 DROP [(DWD^+19)] 数据集上,少样本 GPT-3 超过了原始论文中微调的 BERT 基线,但仍然远低于人类表现和最先进的方法,后者通过符号系统增强神经网络 [(RLL^+19)]。
3.3 Translation
3.3 翻译
In collecting training data for GPT-3, we used the unfiltered distribution of languages reflected in internet text datasets (primarily Common Crawl). As a result, although GPT-3's training data primarily consists of English ( $93%$ by word count), it also includes $7%$ non-English content (full list at GPT-3 GitHub). Existing unsupervised machine translation approaches often combine pre training on a pair of monolingual datasets with back-translation [SHB 15] to bridge the two languages in a controlled way. By contrast, GPT-3 learns from a blend of training data that mixes many languages together. Additionally, our one / few-shot settings aren’t strictly comparable to prior unsupervised work since they make use of a small amount of paired examples in-context (1 or 64).
在收集 GPT-3 的训练数据时,我们使用了互联网文本数据集(主要是 Common Crawl)中未经过滤的语言分布。因此,尽管 GPT-3 的训练数据主要由英语组成(按词数计算占 93%),还包括 7% 的非英语内容(完整列表见 GPT-3 GitHub)。现有的无监督机器翻译方法通常结合双语单语数据集的预训练与回译 [SHB 15] 来以受控方式连接两种语言。相比之下,GPT-3 从多种语言混合的训练数据中学习。此外,我们的零样本 / 少样本设置并不完全等同于之前的无监督工作,因为它们利用了少量成对示例(1 或 64)进行上下文中的学习。
Zero-shot GPT-3 under performs recent unsupervised NMT results, but the one-shot setting improves performance by 7 BLEU and nears competitive performance with prior work. Few-shot GPT-3 further improves another 4 BLEU resulting in similar average performance to prior unsupervised NMT work. For the three input languages studied, GPT-3 significantly outperforms prior unsupervised NMT work when translating into English but under performs when translating in the other direction. Performance on En-Ro is a noticeable outlier at over 10 BLEU worse than prior unsupervised NMT work. This could be a weakness due to reusing the byte-level BPE tokenizer of GPT-2 which was developed for an almost entirely English training dataset. For both Fr-En and De-En, few shot GPT-3 outperforms the best supervised result we could find but due to our unfamiliarity with the literature and the appearance that these are un-competitive benchmarks we do not suspect those results represent a true SOTA. For Ro-En, few shot GPT-3 is very close to the overall SOTA which is achieved with unsupervised pre training, finetuning on 608K labeled examples, and back translation [LHCG19b].
零样本 GPT-3 的表现不如最近的无监督 NMT 结果,但单样本设置将性能提高了 7 BLEU,并接近与之前工作相当的水平。少样本 GPT-3 进一步提高了 4 BLEU,使得平均性能与之前的无监督 NMT 工作相似。对于研究的三种输入语言,GPT-3 在翻译成英语时显著优于之前的无监督 NMT 工作,但在反向翻译时表现较差。En-Ro 的表现是一个明显的例外,比之前的无监督 NMT 工作低了超过 10 BLEU。这可能是由于重用了 GPT-2 的字节级 BPE 分词器,而该分词器是为几乎完全由英语组成的训练数据集开发的。对于 Fr-En 和 De-En,少样本 GPT-3 超过了我们能找到的最佳有监督结果,但由于我们对文献不熟悉以及这些基准似乎不具备竞争力,我们怀疑这些结果并不代表真正的最先进水平。对于 Ro-En,少样本 GPT-3 接近整体最先进水平,该水平是通过无监督预训练、在 608K 标记示例上微调和回译 [LHCG19b] 实现的。
3.4 SuperGLUE
3.4 超级GLUE (SuperGLUE)
The SuperGLUE benchmark is a standardized collection of datasets $[\mathrm{WPN^{+}19}]$ .Inthefew-shot setting, we used 32 examples for all tasks, sampled randomly from the training set. For all tasks except WsC and MultiRC, we sampled a new set of examples to use in the context for each problem. For WsC and MultiRC, we used the same set of randomly drawn examples from the training set as context for all of the problems we evaluated. We sweep values of $K$ up to 32 and note that the few-shot SuperGLUE score steadily improves with both model size and with number of examples in the context showing increasing benefits from in-context learning (Figure 1.1).
SuperGLUE基准是一个标准化的数据集集合 $[\mathrm{WPN^{+}19}]$ 。在少样本设置中,我们为所有任务使用了32个示例,这些示例是从训练集中随机抽取的。对于除WsC和MultiRC之外的所有任务,我们为每个问题抽样了一组新的示例以用作上下文。对于WsC和MultiRC,我们使用了从训练集中随机抽取的同一组示例作为所有评估问题的上下文。我们调整 $K$ 的值至多为32,并注意到少样本SuperGLUE分数随着模型大小和上下文中示例数量的增加而稳步提高,显示出从上下文学习中获得的好处不断增加(图 1.1)。
Table 3.5: Performance of GPT-3 on SuperGLUE compared to fine-tuned baselines and SOTA. All results are reported on the test set. GPT-3 few-shot is given a total of 32 examples within the context of each task and performs no gradient updates.
表 3.5: GPT-3 在 SuperGLUE 上的表现与微调基线和最先进水平 (SOTA) 的比较。所有结果均在测试集上报告。GPT-3 少样本在每个任务的上下文中总共给出 32 个示例,并且不进行梯度更新。
SuperGLUE 平均 | BoolQ 准确率 | CB 准确率 | CB F1 | COPA 准确率 | RTE 准确率 | |
---|---|---|---|---|---|---|
微调 SOTA | 89.0 | 91.0 | 96.9 | 93.9 | 94.8 | 92.5 |
微调 BERT-Large | 69.0 | 77.4 | 83.6 | 75.7 | 70.6 | 71.7 |
GPT-3 少样本 | 71.8 | 76.4 | 75.6 | 52.0 | 92.0 | 69.0 |
WiC 准确率 | WSC 准确率 | MultiRC 准确率 | MultiRC Fla | ReCoRD 准确率 | ReCoRD F1 | |
---|---|---|---|---|---|---|
微调 SOTA | 76.1 | 93.8 | 62.3 | 88.2 | 92.5 | 93.3 |
微调 BERT-Large | 69.6 | 64.6 | 24.1 | 70.0 | 71.3 | 72.0 |
GPT-3 少样本 | 49.4 | 80.1 | 30.5 | 75.4 | 90.2 | 91.1 |
We observe a wide range in GPT-3's performance across tasks. On COPA and ReCoRD GPT-3 achieves near-SOTA performance in the one-shot and few-shot settings, with COPA falling only a couple points short and achieving second place on the leader board, where first place is held by a fine-tuned 11 billion parameter model (T5). On WSC, BoolQ, MultiRC, and RTE, performance is reasonable, roughly matching that of a fine-tuned BERT-Large. On CB, we see signs of life at $75.6%$ in the few-shot setting. WiC is a notable weak spot with few-shot performance equivalent to random chance. We tried a number of different phrasings and formulations for WiC (which involves determining if a word is being used with the same meaning in two sentences), none of which was able to achieve strong performance. This hints at a phenomenon (which we saw in other experiments we ran contained in the Additional Materials) - GPT-3 appears to be weak in the few-shot or one-shot setting at some tasks that involve comparing two sentences or snippets. This could also explain the comparatively low scores for RTE and CB, which also follow this format. Despite these weaknesses, GPT-3 still outperforms a fine-tuned BERT-large on four of eight tasks and on two tasks GPT-3 is close to the state-of-the-art held by a fine-tuned 11 billion parameter model.
我们观察到 GPT-3 在不同任务中的表现范围很广。在 COPA 和 ReCoRD 上,GPT-3 在单样本和少样本设置中接近最先进 (SOTA) 水平,其中 COPA 仅差几个百分点,并且在排行榜上获得第二名,第一名由一个微调的 110 亿参数模型 (T5) 保持。在 WSC、BoolQ、MultiRC 和 RTE 上,表现合理,大致与微调的 BERT-Large 相当。在 CB 上,我们在少样本设置中看到了 75.6% 的正确率。WiC 是一个明显的弱点,在少样本设置中的表现相当于随机猜测。我们尝试了多种不同的表述和公式来处理 WiC(这涉及到确定一个词在两个句子中是否使用了相同的含义),但没有一种能够取得良好的表现。这暗示了一个现象(我们在其他实验中也观察到了这一点,这些实验包含在附加材料中)——GPT-3 在涉及比较两个句子或片段的任务中,在少样本或单样本设置下表现较弱。这也可能解释了 RTE 和 CB 的相对较低得分,这些任务也遵循这种格式。尽管存在这些弱点,GPT-3 仍然在八个任务中的四个上超过了微调的 BERT-large,并且在两个任务上接近由微调的 110 亿参数模型保持的最先进水平。
4 Measuring and Preventing Memorization Of Benchmarks
4 测量和防止基准的 memorization
The dataset and model size are about two orders of magnitude larger than those used for GPT-2, and include a large amount of Common Crawl, creating increased potential for contamination and memorization. On the other hand, precisely due to the large amount of data, even GPT-3 175B does not overfit its training set by a significant amount, measured relative to a held-out validation set with which it was de duplicated. For each benchmark, we produce a ^clean’ version which removes all potentially leaked examples, defined roughly as examples that have a 13-gram overlap with anything in the pre training set (or that overlap with the whole example when it is shorter than 13-grams). We then evaluate GPT-3 on these clean benchmarks, and compare to the original score. If the score on the clean subset is similar to the score on the entire dataset, this suggests that contamination, even if present, does not have a significant effect on reported results. In most cases performance changes only negligibly, and we see no evidence that contamination level and performance difference are correlated. We conclude that either our conservative method substantially overestimated contamination or that contamination has little effect on performance. We provide full details of the methodology and analysis on the most problematic tasks in the appendix.
数据集和模型规模比用于 GPT-2 的大约大两个数量级,并且包含大量 Common Crawl 数据,这增加了污染和记忆的风险。另一方面,正是由于数据量巨大,即使 GPT-3 175B 在其训练集上也没有显著过拟合,这是相对于一个与其去重的保留验证集测量的结果。对于每个基准测试,我们生成一个“干净”版本,该版本移除了所有可能泄露的样本,大致定义为与预训练集中任何内容有 13-gram 重叠的样本(或当样本长度小于 13-gram 时与整个样本重叠)。然后我们在这些干净的基准测试上评估 GPT-3,并与原始分数进行比较。如果干净子集上的分数与整个数据集上的分数相似,这表明即使存在污染,对报告结果的影响也不显著。在大多数情况下,性能变化微乎其微,我们没有发现污染程度和性能差异之间的相关性。我们得出结论,要么我们的保守方法大大高估了污染,要么污染对性能影响很小。我们在附录中提供了最具问题任务的方法和分析的全部详细信息。
5 Limitations
5 局限性
On text synthesis, GPT-3 samples still sometimes repeat themselves semantically at the document level, start to lose coherence over sufficiently long passages, contradict themselves, and occasionally contain non-sequitur sentences or paragraphs. Our release repository contains uncurated unconditional samples.
在文本生成方面,GPT-3 的样本在文档级别上有时仍然会在语义上重复自己,在足够长的段落中开始失去连贯性,自相矛盾,并且偶尔包含不合逻辑的句子或段落。我们的发布仓库包含未经过滤的无条件样本。
Our experiments do not include any bidirectional architectures or other training objectives such as denoising. Our design decision comes at the cost of potentially worse performance on tasks which empirically benefit from bidirectional it y, such as fill-in-the-blank tasks, tasks that involve looking back and comparing two pieces of content (ANLI, WIC), or tasks that require re-reading or carefully considering a long passage and then generating a very short answer (QuAC, RACE).
我们的实验不包括任何双向架构或其他训练目标,例如去噪。我们的设计决策可能会导致在从双向性中受益的任务上表现较差,例如填空任务、涉及回顾和比较两段内容的任务(ANLI、WIC),或需要重读或仔细考虑长篇内容然后生成非常简短答案的任务(QuAC、RACE)。
Our objective weights every token equally and lacks a notion of what is most important to predict and what is less important. [RRS20] demonstrate benefits of customizing prediction to entities of interest. Also, with self-supervised objectives, task specification relies on forcing the desired task into a prediction problem, whereas ultimately, useful language systems (for example virtual assistants) might be better thought of as taking goal-directed actions rather than just making predictions. Finally, large pretrained language models are not grounded in other domains of experience, such as video or real-world physical interaction, and thus lack a large amount of context about the world $[\mathrm{BHT^{+}20}]$ For all these reasons, scaling pure self-supervised prediction is likely to hit limits, and augmentation with a different approach is likely to be necessary. Promising future directions in this vein might include learning the objective function from humans $[Z\mathrm{SW}^{\bar{+}}19]$ ], fine-tuning with reinforcement learning, or adding additional modalities such as images to provide grounding and a better model of theworld $[\mathrm{CLY^{+}i9}]$
我们的目标对每个 Token 权重相同,缺乏对预测内容重要性的区分。[RRS20] 展示了针对感兴趣实体定制预测的好处。此外,在自监督目标中,任务定义依赖于将所需任务转化为一个预测问题,而最终,有用的语言系统(例如虚拟助手)可能更适合被视为执行目标导向的动作,而不仅仅是进行预测。最后,大型预训练语言模型没有在其他经验领域(如视频或现实世界的物理交互)中得到验证,因此缺乏大量关于世界的背景知识 $[\mathrm{BHT^{+}20}]$ 。基于所有这些原因,纯粹的自监督预测扩展可能会遇到瓶颈,需要采用不同的方法来增强。未来有希望的方向可能包括从人类学习目标函数 $[Z\mathrm{SW}^{\bar{+}}19]$ ,使用强化学习微调,或添加图像等其他模态以提供更坚实的背景和更好的世界模型 $[\mathrm{CLY^{+}i9}]$ 。
GPT-3's size makes it challenging to deploy. Task-specific distillation [HVD15] merits exploration at this new scale.
GPT-3 的规模使其部署具有挑战性。针对特定任务的蒸馏 [HVD15] 在这一新规模下值得探索。
6 Related Work
6 相关工作
Several efforts have studied the effect of scale on language model performance. $[\mathrm{KMH}^{+}20\$ , RRBS19, $\mathrm{LWS}^{+}20$ $\mathrm{HNA^{+}17}]$ , find a smooth power-law trend in loss as auto regressive language models are scaled up. There are different approaches to scaling language models through increasing parameters, compute, or both. Our work is most aligned with methods that have increased the size of transformers by increasing parameters and FLOPS-per-token roughly in proportion, with a parameter count of 213 million $[\mathrm{VSP^{+}17}]$ in the original paper, then 300 million [DCLT18], 1.5 billion $[\mathrm{RWC}^{+}19]$ , 8 billion $[\mathrm{SPP^{+}19}]$ , 11 billion $[\mathrm{RSR}^{+}19]$ , and most recently 17 billion [Tur20]. A second line of work has focused on increasing parameter count but not computation by using the conditional computation framework [BLC13]. Specifically, the mixture-of-experts method $[\mathrm{SMM}^{+}17]$ has produced 100 billion parameter models and 50 billion parameter translation models [AJF19]. One way to decrease the computational cost of our models would be to draw from work such as ALBERT $[\dot{\mathrm{LCG}}^{+}19]$ or general [HVD15] or task-specific [SDCW19, $\mathrm{JYS^{+}19}$ , KR16] approaches to distillation. Lastly, a third approach to scale increases computation without increasing parameters through methods like adaptive computation time [Gra16] and the universal transformer $\bar{[\mathrm{DGV}^{+}18]}$
多个研究工作已经探讨了规模对语言模型性能的影响。[KMH+20, RRBS19, LWS+20, HNA+17] 发现,随着自回归语言模型的扩展,损失呈现出平滑的幂律趋势。通过增加参数、计算资源或两者结合,有不同的方法可以扩展语言模型。我们的工作最接近于那些通过按比例增加Transformer的参数和每Token的FLOPS来扩大模型规模的方法,在最初的论文中参数数量为2.13亿 [VSP+17],然后是3亿 [DCLT18],15亿 [RWC+19],80亿 [SPP+19],110亿 [RSR+19],最近达到了170亿 [Tur20]。第二条研究路线专注于通过条件计算框架 [BLC13] 增加参数数量但不增加计算量。特别是,专家混合方法 [SMM+17] 已经产生了1000亿参数的模型和500亿参数的翻译模型 [AJF19]。减少我们模型计算成本的一种方法是从类似ALBERT [LCG+19] 或通用 [HVD15] 或任务特定 [SDCW19, JYS+19, KR16] 的蒸馏方法中借鉴。最后,第三种扩展方法是通过自适应计算时间 [Gra16] 和通用Transformer [DGV+18] 等方法增加计算而不增加参数。
There are many approaches to building multi-task models. Giving task instructions in natural language was first formalized in a supervised setting with [MKXS18] and used in $[\mathrm{RWC}^{+}19]$ for in-context learning and in $[\mathrm{RSR}^{+}19]$ for multi-task fine-tuning. Multi-task learning [Car97] has shown some promising initial results $[\mathrm{LGH^{+}}15$ , LCR19] and multi-stage fine-tuning has produced SOTA or SOTAcompetitive results [PFB18, $\mathrm{KKS}^{+}20]$ . Metal earning was used in language models in $[\mathrm{RWC}^{+}19]$ though with limited results and no systematic study. Other uses of metal earning include matching networks $[\mathrm{VBL}^{+}16]$ , RL2 $[\mathrm{DSC}^{+}16]$ , learning to optimize [RL16, $\mathrm{ADG^{+}}16$ , LM17] and MAML [FAL17]. Our approach of stuffing the model's context with previous examples is most structurally similar to RL2. It also resembles [HYCO1], in that an inner loop adapts to a task, while an outer loop updates the weights. Our inner loop performs few-shot in-context learning, but prior work has explored other methods of few-shot learning [SS20, $\mathrm{RCP}^{+}17$ $\mathrm{GWC}^{+}18$ $\mathrm{XDH^{+}19]}$
构建多任务模型的方法有很多。用自然语言给出任务指令最初在监督设置中被形式化 [MKXS18],并在 $[\mathrm{RWC}^{+}19]$ 中用于上下文学习,在 $[\mathrm{RSR}^{+}19]$ 中用于多任务微调。多任务学习 [Car97] 已显示出一些有希望的初步结果 $[\mathrm{LGH^{+}}15$ , LCR19],多阶段微调已产生 SOTA 或接近 SOTA 的结果 [PFB18, $\mathrm{KKS}^{+}20]$ 。元学习在语言模型中被使用于 $[\mathrm{RWC}^{+}19]$,尽管结果有限且没有系统性研究。元学习的其他应用包括匹配网络 $[\mathrm{VBL}^{+}16]$、RL2 $[\mathrm{DSC}^{+}16]$、学习优化 [RL16, $\mathrm{ADG^{+}}16$ , LM17] 和 MAML [FAL17]。我们通过填充模型上下文中的先前示例的方法在结构上最类似于 RL2。它也类似于 [HYCO1],即内部循环适应任务,而外部循环更新权重。我们的内部循环执行少样本上下文学习,但之前的工作已经探索了其他少样本学习方法 [SS20, $\mathrm{RCP}^{+}17$ $\mathrm{GWC}^{+}18$ $\mathrm{XDH^{+}19]}$
Finally, Algorithmic innovation in language models over the last two years has been enormous, including denoising-based bidirectional it y [DCLT18], prefixLM [DL15], encoder-decoder architectures $[\mathrm{LLG}^{\bar{+}}19$ $\mathrm{RSR}^{\bar{+}}19]$ , random permutations during training $[\mathrm{YDY^{+}19}]$ , architectures for sampling efficiency $[\mathrm{DYY^{+}19}]$ , data and training improvements $[\mathrm{LOG^{+}19}]$ , and embedding parameters efficiency $[\mathrm{LCG^{+}19}]$ . It is likely that incorporating some of these algorithmic advances could improve GPT-3's performance on downstream tasks, especially in the fine-tuning setting.
最后,过去两年中语言模型的算法创新非常巨大,包括基于去噪的双向模型 [DCLT18],前缀语言模型 (prefixLM) [DL15],编码器-解码器架构 $[\mathrm{LLG}^{\bar{+}}19$ $\mathrm{RSR}^{\bar{+}}19]$ ,训练期间的随机排列 $[\mathrm{YDY^{+}19}]$ ,用于采样效率的架构 $[\mathrm{DYY^{+}19}]$ ,数据和训练改进 $[\mathrm{LOG^{+}19}]$ ,以及嵌入参数效率 $[\mathrm{LCG^{+}19}]$ 。很可能将这些算法进展中的一些融入到 GPT-3 中可以提高其在下游任务中的性能,特别是在微调设置中。
7 Conclusion
7 结论
We presented a 175 billion parameter language model which shows strong performance on many NLP tasks and benchmarks in the zero-shot, one-shot, and few-shot settings, in some cases nearly matching the performance of state-of-the-art fine-tuned systems, as well as generating high-quality samples and strong qualitative performance at tasks defined on-the-fly. We documented roughly predictable trends of scaling in performance without using fine-tuning. We also discussed the social impacts of this class of model. Despite many limitations and weaknesses, these results suggest that very large language models may be an important ingredient in the development of adaptable, general language systems.
我们提出了一种包含 1750 亿参数的语言模型,在零样本、单样本和少样本设置的许多自然语言处理任务和基准测试中表现出强大的性能,在某些情况下几乎达到了最先进的微调系统的性能水平,并且能够生成高质量的样本和在即时定义的任务中表现出强大的定性性能。我们记录了在不使用微调的情况下,性能随规模大致可预测的趋势。我们还讨论了此类模型的社会影响。尽管存在许多局限性和弱点,这些结果表明,非常大的语言模型可能是开发可适应的、通用的语言系统的重要组成部分。
Funding Disclosures
资金披露
This work was funded by OpenAI. All models were trained on V100 GPU's on part of a highbandwidth cluster provided by Microsoft
这项工作由 OpenAI 资助。所有模型都在 Microsoft 提供的高带宽集群部分的 V100 GPU 上训练。
Broader Impacts
更广泛的影响
Language models have a wide range of beneficial applications for society, including code and writing auto-completion, grammar assistance, game narrative generation, improving search engine responses, and answering questions. But they also have potentially harmful applications. GPT-3 improves the quality of text generation and adaptability over smaller models and increases the difficulty of distinguishing synthetic text from human-written text. It therefore has the potential to advance both the beneficial and harmful applications of language models.
大语言模型有广泛的社会应用,包括代码和写作自动补全、语法辅助、游戏剧情生成、改进搜索引擎响应和回答问题。但它们也有可能带来有害的应用。GPT-3 在文本生成质量和适应性上优于较小的模型,并增加了区分合成文本与人类书写文本的难度。因此,它有可能推进大语言模型的有益和有害应用。
Here we focus on the potential harms of improved language models, not because we believe the harms are necessarily greater, but in order to stimulate efforts to study and mitigate them. The broader impacts of language models like this are numerous. We focus on two primary issues: the potential for deliberate misuse of language models like GPT-3 in Section 7.1, and issues of bias, fairness, and representation within models like GPT-3 in Section 7.2. We also briefly discuss issues of energy efficiency (Section 7.3).
在这里,我们关注改进的语言模型可能带来的危害,不是因为我们认为这些危害必然更大,而是为了刺激对这些危害进行研究和缓解的努力。这类语言模型的广泛影响是多方面的。我们重点关注两个主要问题:第 7.1 节中讨论的像 GPT-3 这样的语言模型被故意滥用的潜在可能性,以及第 7.2 节中讨论的像 GPT-3 这样的模型中存在的偏见、公平性和代表性问题。我们还简要讨论了能源效率问题(第 7.3 节)。
7.1 Misuse of Language Models
7.1 大语言模型的误用
Malicious uses of language models can be somewhat difficult to anticipate because they often involve re purposing language models in a very different environment or for a different purpose than researchers intended. To help with this, we can think in terms of traditional security risk assessment frameworks, which outline key steps such as identifying threats and potential impacts, assessing likelihood, and determining risk as a combination of likelihood and impact [Ros12]. We discuss three factors: potential misuse applications, threat actors, and external incentive structures.
恶意使用大语言模型 (Large Language Model) 可能难以预见,因为它们通常涉及在与研究人员预期非常不同的环境或目的下重新利用这些模型。为了帮助应对这一问题,我们可以借鉴传统的安全风险评估框架,该框架概述了关键步骤,例如识别威胁和潜在影响、评估可能性,并将风险确定为可能性和影响的组合 [Ros12]。我们讨论三个因素:潜在的滥用应用、威胁行为者和外部激励结构。
7.1.1 Potential Misuse Applications
7.1.1 潜在的滥用应用
Any socially harmful activity that relies on generating text could be augmented by powerful lan guage models. Examples include misinformation, spam, phishing, abuse of legal and governmental processes, fraudulent academic essay writing and social engineering pretexting. Many of these applications bottleneck on human beings to write sufi cie ntl y high quality text. Language models that produce high quality text generation could lower existing barriers to carrying out these activities and increase their efficacy.
任何依赖生成文本的社会有害活动都可能被强大的大语言模型 (LLM) 加强。例子包括虚假信息、垃圾邮件、网络钓鱼、滥用法律和政府程序、欺诈性学术论文写作和社会工程预文本。这些应用中的许多都受限于人类编写足够高质量文本的能力。能够生成高质量文本的大语言模型可能会降低进行这些活动的现有障碍并提高其效率。
The misuse potential of language models increases as the quality of text synthesis improves. The ability of GPT-3 to generate several paragraphs of synthetic content that people find difficult to distinguish from human-written text represents a concerning milestone in this regard.
语言模型的滥用潜力随着文本合成质量的提高而增加。GPT-3 生成多段落合成内容的能力,使得这些内容难以与人类书写的文本区分开来,这标志着一个令人担忧的里程碑。
7.1.2 Threat Actor Analysis
7.1.2 威胁行为者分析
Threat actors can be organized by skill and resource levels, ranging from low or moderately skilled and resourced actors who may be able to build a malicious product to ^advanced persistent threats' (APTs): highly skilled and well-resourced (e.g. state-sponsored) groups with long-term agendas $[\mathrm{SBC^{+}19}]$
威胁行为者可以根据技能和资源水平进行分类,从低或中等技能和资源的参与者(可能能够构建恶意产品)到 “高级持续性威胁” (APTs):具有长期议程的高度 skilled 和资源充足的群体(例如,国家资助的组织) [SBC^+19]
To understand how low and mid-skill actors think about language models, we have been monitoring forums and chat groups where misinformation tactics, malware distribution, and computer fraud are frequently discussed. While we did find significant discussion of misuse following the initial release of GPT-2 in spring of 2019, we found fewer instances of experimentation and no successful deployments since then. Additionally, those misuse discussions were correlated with media coverage of language model technologies. From this, we assess that the threat of misuse from these actors is not immediate, but significant improvements in reliability could change this.
为了了解低技能和中等技能的行为者如何看待语言模型,我们一直在监控经常讨论错误信息策略、恶意软件分发和计算机欺诈的论坛和聊天群组。虽然我们在2019年春季GPT-2首次发布后确实发现了大量关于滥用的讨论,但自那时以来,我们发现的实验实例较少,且没有成功的部署案例。此外,这些滥用讨论与媒体对语言模型技术的报道相关联。由此我们认为,来自这些行为者的滥用威胁并非迫在眉睫,但可靠性的显著提高可能会改变这一情况。
Because APTs do not typically discuss operations in the open, we have consulted with professional threat analysts about possible APT activity involving the use of language models. Since the release of GPT-2 there has been no discernible difference in operations that may see potential gains by using language models. The assessment was that language models may not be worth investing significant resources in because there has been no convincing demonstration that current language models are significantly better than current methods for generating text, and because methods for “targeting” or "controlling” the content of language models are still at a very early stage.
因为高级持续性威胁 (APT) 组织通常不会在公开场合讨论其操作,我们咨询了专业威胁分析师关于可能涉及使用语言模型的 APT 活动。自从 GPT-2 发布以来,尚未发现任何可识别的操作差异,这些操作可能会因使用语言模型而获得潜在收益。评估结果是,语言模型可能不值得投入大量资源,因为目前没有令人信服的证据表明当前的语言模型比现有的文本生成方法有显著优势,并且针对语言模型内容的“定向”或“控制”方法仍处于非常早期的阶段。
7.1.3 External Incentive Structures
7.1.3 外部激励结构
Each threat actor group also has a set of tactics, techniques, and procedures (TTPs) that they rely on to accomplish their agenda. TTPs are infuenced by economic factors like s cal ability and ease of deployment; phishing is extremely popular among all groups because it offers a low-cost, low-effort, high-yield method of deploying malware and stealing login credentials. Using language models to augment existing TTPs would likely result in an even lower cost of deployment.
每个威胁行为者团体都有一套他们依赖的战术、技术和程序 (TTPs) 来实现其目标。TTPs 受经济因素的影响,例如可扩展性和部署的简易性;网络钓鱼在所有团体中都非常流行,因为它提供了一种低成本、低努力、高回报的方法来部署恶意软件和窃取登录凭证。使用大语言模型来增强现有的 TTPs 可能会进一步降低部署成本。
Ease of use is another significant incentive. Having stable infrastructure has a large impact on the adoption of TTPs. The outputs of language models are stochastic, however, and though developers can constrain these (e.g. using top $\cdot\mathbf{k}$ truncation) they are not able to perform consistently without human feedback. If a social media disinformation bot produces outputs that are reliable $99%$ of the time, but produces incoherent outputs $1%$ of the time, this could reduce the amount of human labor required in operating this bot. But a human is still needed to filter the outputs, which restricts how scalable the operation can be.
易用性是另一个重要的激励因素。拥有稳定的基础设施对采用 TTPs 有重大影响。大语言模型的输出是随机的,尽管开发人员可以约束这些输出(例如使用 top·k 截断),但没有人类反馈的情况下无法始终如一地表现。如果一个社交媒体虚假信息机器人在 99% 的时间内产生可靠的输出,但在 1% 的时间内产生不连贯的输出,这可能会减少操作该机器人所需的人力劳动。但仍然需要人类来过滤输出,这限制了操作的可扩展性。
Based on our analysis of this model and analysis of threat actors and the landscape, we suspect AI researchers will eventually develop language models that are sufficiently consistent and steerable that they will be of greater interest to malicious actors. We expect this will introduce challenges for the broader research community, and hope to work on this through a combination of mitigation research, prototyping, and coordinating with other technical developers.
基于我们对这一模型的分析以及对威胁行为者和环境的分析,我们怀疑 AI 研究人员最终将开发出足够一致和可控的大语言模型,这些模型将更引起恶意行为者的兴趣。我们预计这将为更广泛的研究社区带来挑战,并希望通过结合缓解研究、原型设计以及与其他技术开发者协调来解决这些问题。
7.2 Fairness, Bias, and Representation
7.2 公平性、偏差和代表性
Biases present in training data may lead models to generate stereotyped or prejudiced content. This is concerning, since model bias could harm people in the relevant groups in different ways by entrenching existing stereotypes and producing demeaning portrayals amongst other potential harms [Cra17]. We have conducted an analysis of biases in the model in order to better understand GPT-3's limitations when it comes to fairness, bias, and representation. 2
训练数据中存在的偏差可能导致模型生成刻板或带有偏见的内容。这是令人担忧的,因为模型偏差可能会通过强化现有刻板印象和产生贬低性的描述等方式对相关群体的人们造成不同形式的伤害 [Cra17]。我们已经对模型中的偏差进行了分析,以更好地理解 GPT-3 在公平性、偏差和代表性方面的局限性。
Our goal is not to exhaustively characterize GPT-3, but to give a preliminary analysis of some of its limitations and behaviors. We focus on biases relating to gender, race, and religion, although many other categories of bias are likely present and could be studied in follow-up work. This is a preliminary analysis and does not refect all of the model's biases even within the studied categories.
我们的目标不是对 GPT-3 进行详尽的特征描述,而是对其部分局限性和行为进行初步分析。我们重点关注与性别、种族和宗教相关的偏见,尽管许多其他类别的偏见也可能存在,并可以在后续研究中进行探讨。这是初步分析,并不能反映所研究类别中的所有模型偏见。
Broadly, our analysis indicates that internet-trained models have internet-scale biases; models tend to refect stereotypes present in their training data. Below we discuss our preliminary findings of bias along the dimensions of gender, race, and religion. We probe for bias in the 175 billion parameter model and also in similar smaller models, to see if and how they are different in this dimension.
总体而言,我们的分析表明,互联网训练的模型具有互联网规模的偏差;模型倾向于反映其训练数据中存在的刻板印象。下文我们将讨论在性别、种族和宗教维度上的