[论文翻译]大语言模型是少样本学习者


原文地址:https://papers.nips.cc/paper/2020/file/1457c0d6bfcb4967418bfb8ac142f64a-Paper.pdf


Language Models are Few-Shot Learners

大语言模型是少样本学习者

Abstract

摘要

We demonstrate that scaling up language models greatly improves task-agnostic, few-shot performance, sometimes even becoming competitive with prior state-ofthe-art fine-tuning approaches. Specifically, we train GPT-3, an auto regressive language model with 175 billion parameters, $10\mathrm{x}$ more than any previous nonsparse language model, and test its performance in the few-shot setting. For all tasks, GPT-3 is applied without any gradient updates or fine-tuning, with tasks and few-shot demonstrations specified purely via text interaction with the model. GPT-3 achieves strong performance on many NLP datasets, including translation, question-answering, and cloze tasks. We also identify some datasets where GPT3's few-shot learning still struggles, as well as some datasets where GPT-3 faces methodological issues related to training on large web corpora.

我们证明了扩大语言模型的规模可以显著提升任务无关的少样本性能,有时甚至可以与之前的最先进微调方法相媲美。具体来说,我们训练了 GPT-3,一个具有 1750 亿参数的自回归语言模型,比任何以前的非稀疏语言模型多 10 倍,并测试其在少样本设置中的性能。对于所有任务,GPT-3 在没有任何梯度更新或微调的情况下应用,任务和少样本演示仅通过与模型的文本交互来指定。GPT-3 在许多自然语言处理数据集上表现出色,包括翻译、问答和完形填空任务。我们还确定了一些 GPT-3 的少样本学习仍然存在困难的数据集,以及一些 GPT-3 在训练大型网络语料库时面临方法论问题的数据集。

1 Introduction

1 引言

NLP has shifted from learning task-specific representations and designing task-specific architectures to using task-agnostic pre-training and task-agnostic architectures. This shift has led to substantial progress on many challenging NLP tasks such as reading comprehension, question answering, textual entailment, among others. Even though the architecture and initial representations are now taskagnostic, a final task-specific step remains: fine-tuning on a large dataset of examples to adapt a task agnostic model to perform a desired task.

自然语言处理 (NLP) 已经从学习特定任务的表示和设计特定任务的架构转变为使用与任务无关的预训练和与任务无关的架构。这一转变在许多具有挑战性的 NLP 任务上取得了实质性进展,例如阅读理解、问答、文本蕴含等。尽管现在的架构和初始表示是与任务无关的,但最终仍需要一个特定任务的步骤:在大量示例数据集上进行微调,以使与任务无关的模型适应执行所需的任务。

Recentwork $[\mathrm{RWC}^{+}19]$ suggested this final step may not be necessary. $[\mathrm{RWC}^{+}19]$ demonstrated that a single pretrained language model can be zero-shot transferred to perform standard NLP tasks

最近的工作 $[\mathrm{RWC}^{+}19]$ 建议这最后一步可能不是必需的。$[\mathrm{RWC}^{+}19]$ 证明了一个预训练的语言模型可以进行零样本迁移以执行标准的自然语言处理任务。


Figure 1.1: Performance on SuperGLUE increases with model size. A value of $K=32$ means that our model was shown 32 examples per task, for 256 examples total divided across the 8 tasks in SuperGLUE. We report GPT-3 values on the dev set, so our numbers are not directly comparable to the dotted reference lines (our test set results are in the appendix). The BERT-Large reference model was fine-tuned on the SuperGLUE training set (125K examples), whereas $\mathrm{BERT++}$ wasfirst fine-tuned on MultiNLI (392K examples) and SWAG (113K examples) before further fine-tuning on the SuperGLUE training set (for a total of 630K fine-tuning examples).

图 1.1: SuperGLUE 上的性能随着模型规模的增加而提高。值 $K=32$ 表示我们的模型在每个任务中展示了 32 个示例,总共 256 个示例分布在 SuperGLUE 的 8 个任务中。我们报告的是 GPT-3 在开发集上的结果,因此我们的数字不能直接与虚线参考线进行比较(我们的测试集结果在附录中)。BERT-Large 参考模型是在 SuperGLUE 训练集(125K 示例)上微调的,而 $\mathrm{BERT++}$ 首先在 MultiNLI(392K 示例)和 SWAG(113K 示例)上进行了微调,然后再在 SuperGLUE 训练集上进一步微调(总共 630K 微调示例)。

Performance on SuperGLUE increases with number of examples in context. We find the difference in performance between the BERT-Large and BERT $^{++}$ to be roughly equivalent to the difference between GPT-3 with one example per context versus eight examples per context.

在 SuperGLUE 上的表现随着上下文中示例数量的增加而提高。我们发现 BERT-Large 和 BERT $^{++}$ 之间的性能差异大致相当于 GPT-3 在每个上下文中有一个示例与有八个示例之间的差异。

Aggregate performance for all 42 accuracy-denominated benchmarks. While zero-shot performance improves steadily with model size, few-shot performance increases more rapidly, demonstrating that larger models are more proficient at in-context learning.

所有 42 个以准确率为衡量标准的基准测试的综合性能。虽然零样本 (Zero-shot) 性能随着模型规模的增大而稳步提高,但少样本 (Few-shot) 性能增长更快,这表明更大的模型在上下文学习方面更为 proficient。

without the need for finetuning on a dataset of training examples. While this work was a promising proof of concept, the best case performance only matched some supervised baselines on a single dataset. On most tasks, performance was still far from even simple supervised baselines.

无需在训练样本数据集上进行微调。虽然这项工作是一个有前景的概念验证,但在单个数据集上最佳性能仅与某些监督基线相当。在大多数任务中,性能仍然远低于简单的监督基线。

However $[\mathrm{RWC}^{+}19]$ also showed a potential way forward. The work observed relatively consistent log-linear trends in performance on both transfer tasks and language modeling loss across one an order of magnitude of scaling. $[\mathrm{KMH}^{+}20]$ then conducted a much more rigorous study of the scaling behavior of log loss and confirmed smooth scaling trends. In this work, we empirically test whether scaling continues to improve performance by extrapolating the previously identified phenomena another two orders of magnitude. We train a 175 billion parameter auto regressive language model, which we call GPT-3, and measure its transfer learning abilities.

然而 $[\mathrm{RWC}^{+}19]$ 也展示了一种潜在的前进方向。该工作观察到,在一个数量级的扩展范围内,迁移任务和语言建模损失的表现呈现出相对一致的对数线性趋势。$[\mathrm{KMH}^{+}20]$ 随后对对数损失的扩展行为进行了更为严格的研究,并确认了平滑的扩展趋势。在本工作中,我们通过将之前识别的现象外推两个数量级,实证测试扩展是否继续改善性能。我们训练了一个 1750 亿参数的自回归大语言模型,我们称之为 GPT-3,并测量其迁移学习能力。

As part of this investigation, we also clarify and systematize the approach introduced in $[\mathrm{RWC}^{+}19]$ While $[\mathrm{RWC}^{+}19]$ describe their work as “zero-shot task transfer”’ they sometimes provide examples of the relevant task in the context. Due to the use of what are effectively training examples, these cases are better described as “one-shot'’ or “few-shot" transfer. We study these one-shot and few-shot settings in detail comparing them with the zero-shot setting which only uses a natural language description or invocation of the task to be performed. Our findings are summarized in Figure 1.1. We observe that one- and few-shot performance is often much higher than true zero-shot performance leading us to suggest that language models can also be understood as meta-learners where slow outer-loop gradient descent based learning is combined with fast “in-context” learning implemented within the context activation s of the model.

作为这项研究的一部分,我们还澄清并系统化了 $[\mathrm{RWC}^{+}19]$ 中引入的方法。虽然 $[\mathrm{RWC}^{+}19]$ 将他们的工作描述为“零样本任务迁移”,但有时他们在上下文中提供了相关任务的示例。由于实际上使用了训练示例,这些情况更好地被描述为“一样本”或“少样本”迁移。我们详细研究了这些一样本和少样本设置,并将其与仅使用自然语言描述或调用任务的零样本设置进行比较。我们的发现总结在图 1.1 中。我们观察到一样本和少样本的表现通常远高于真正的零样本表现,这使我们认为语言模型也可以被视为元学习器,其中慢速的外部循环梯度下降学习与模型上下文激活中的快速“上下文内”学习相结合。

图 1.1:

Broadly, on NLP tasks GPT-3 achieves promising results in the zero- and one-shot settings, and in the few-shot setting is sometimes competitive with or even occasionally surpasses state-of-the-art (despite state-of-the-art being held by fine-tuned models). For example, GPT-3 achieves $81.5,\mathrm{F}1$ on CoQA in the zero-shot setting, 84.0 F1 on CoQA in the one-shot setting, and 85.0 F1 in the few-shot setting. Similarly, GPT-3 achieves $64.3%$ accuracy on TriviaQA in the zero-shot setting, $68.0%$ in the one-shot setting, and $71.2%$ in the few-shot setting, the last of which is state-of-the-art relative to fine-tuned models operating in the same closed-book setting.

在自然语言处理任务中,GPT-3 在零样本和单样本设置下取得了有希望的结果,在少样本设置下有时可以与最先进水平竞争,甚至偶尔超越最先进水平(尽管最先进水平是由微调模型保持的)。例如,GPT-3 在零样本设置下在 CoQA 上达到了 81.5 F1,在单样本设置下达到了 84.0 F1,在少样本设置下达到了 85.0 F1。同样,GPT-3 在零样本设置下在 TriviaQA 上达到了 64.3% 的准确率,在单样本设置下达到了 68.0%,在少样本设置下达到了 71.2%,后者在相同的闭书设置下相对于微调模型是最先进的。

We additionally train a series of smaller models (ranging from 125 million parameters to 13 billion parameters) in order to compare their performance to GPT-3 in the zero-, one- and few-shot settings. In general, we find relatively smooth scaling for most tasks with model capacity in all three settings; one notable pattern is that the gap between zero-, one-, and few-shot performance often grows with model capacity, perhaps suggesting that larger models are more proficient meta-learners.

此外,我们还训练了一系列较小的模型(参数量从 1.25 亿到 130 亿)以比较它们在零样本、单样本和少样本设置下的性能与 GPT-3 的差异。总体而言,我们发现大多数任务在三种设置下随着模型容量的增加表现出相对平滑的扩展;一个值得注意的模式是,零样本、单样本和少样本性能之间的差距往往随着模型容量的增加而扩大,这可能表明较大的模型是更熟练的元学习者。

2 Approach

Our basic pre-training approach, including model, data, and training, is similar to the process described in $[\mathrm{RWC}^{+}1\bar{9}]$ , with relatively straightforward scaling up of the model size, dataset size and diversity, and length of training. Our use of in-context learning is also similar to $[\mathrm{RWC}^{+}19]$ ,but in this work we systematically explore different settings for learning within the context:

我们的基本预训练方法,包括模型、数据和训练,与 $[\mathrm{RWC}^{+}1\bar{9}]$ 中描述的过程相似,只是在模型规模、数据集规模和多样性以及训练长度方面进行了相对直接的扩展。我们对上下文中学习的使用也与 $[\mathrm{RWC}^{+}19]$ 类似,但在本工作中,我们系统地探索了上下文中学习的不同设置:

· Fine-Tuning (FT) - updates the weights of a pre-trained model by training on thousands of supervised labels specific to the desired task. The main advantage of fine-tuning is strong performance on many benchmarks. The main disadvantages are the need for a new large dataset for every task, the potential for poor generalization out-of-distribution [MPL19], and the potential to exploit spurious features of the training data $[\mathrm{GSL^{+}18}.$ NK19].Wefocus on task-agnostic performance, leaving fine-tuning for future work.

微调 (Fine-Tuning, FT) - 通过在数千个特定于所需任务的监督标签上进行训练,更新预训练模型的权重。微调的主要优势是在许多基准测试中表现出色。主要缺点包括每个任务都需要一个新的大型数据集、可能存在分布外泛化不良的情况 [MPL19],以及可能利用训练数据中的虚假特征 $[\mathrm{GSL^{+}18}.$ NK19]。我们专注于任务无关的性能,将微调留作未来的工作。

· Few-Shot (Fs) - the model is given a few demonstrations of the task at inference time as conditioning $[\mathrm{RWC}^{+}19]$ , but no weights are updated. An example typically has a context and a desired completion (for example an English sentence and the French translation), and few-shot works by giving $K$ examples of context and completion, and then one final example of context, with the model expected to provide the completion (see appendix for more details). We typically set $K$ in the range of 10 to 100, as this is how many examples can fit in the model's context window $\langle n_{\mathrm{ctx}}=2048\rangle$ . The main advantage of few-shot is a major reduction in the need for task-specific data. The main disadvantage is that results from this method have so far been much worse than state-of-the-art fine-tuned models. Also, a small amount of task specific data is still required. As indicated by the name, few-shot learning as described here for language models is related to few-shot learning as used in other contexts in ML [HYC01, ${\mathrm{VBL}}^{+}1{\bar{6}}]$ - both involve learning based on a broad distribution of tasks and then rapidly adapting to a new task.

少样本 (Few-Shot) - 模型在推理时给出几个任务的演示作为条件 $[\mathrm{RWC}^{+}19]$ ,但不更新任何权重。一个示例通常包含上下文和期望的完成(例如,一个英文句子及其法语翻译),少样本通过提供 $K$ 个上下文和完成的示例,然后给出一个最终的上下文示例,要求模型提供完成(更多详情见附录)。我们通常将 $K$ 设置在10到100之间,因为这是模型上下文窗口中可以容纳的示例数量 $\langle n_{\mathrm{ctx}}=2048\rangle$ 。少样本的主要优点是大大减少了对特定任务数据的需求。主要缺点是,到目前为止,这种方法的结果远不如最先进的微调模型。此外,仍然需要少量的任务特定数据。正如名称所示,这里描述的语言模型的少样本学习与机器学习其他上下文中使用的少样本学习 [HYC01, ${\mathrm{VBL}}^{+}1{\bar{6}}]$ 类似——两者都涉及基于广泛的任务分布进行学习,然后快速适应新任务。

· One-Shot (1S) - similar to few-shot but with $K=1$ · Zero-Shot (os) - similar to few-shot but with a natural language description of the task instead of any examples.

· 单样本 (One-Shot) - 与少样本类似,但 $K=1$
· 零样本 (Zero-Shot) - 与少样本类似,但使用任务的自然语言描述而不是任何示例。

The appendix includes a demonstration of the four methods using the example of translating English to French. While the few-shot results we present in this paper achieve the highest performance, one-shot, or even sometimes zero-shot, seem like the fairest comparisons to human performance, and are important targets for future work.

附录包括了使用将英语翻译成法语的例子来演示四种方法。虽然我们在本文中展示的少样本结果达到了最高性能,但单样本,甚至有时零样本,似乎更像与人类表现的公平比较,并且是未来工作的重要目标。

2.1 Model and Architectures

2.1 模型和架构

We use the same model and architecture as GPT-2 $[\mathrm{RWC}^{+}19]$ , including the modified initialization, pre-normalization, and reversible token iz ation described therein, with the exception that we use alternating dense and locally banded sparse attention patterns in the layers of the transformer, similar to the Sparse Transformer [CGRS19]. To study the dependence of ML performance on model size, we train 8 different sizes of model, from 125 million parameters to 175 billion parameters, with the last being the model we call GPT-3. This range of model sizes allows us to test the scaling laws introduced in $[\mathrm{KMH}^{+}20]$

我们使用与 GPT-2 $[\mathrm{RWC}^{+}19]$ 相同的模型和架构,包括其中描述的修改后的初始化、预归一化和可逆分词,唯一的例外是我们使用交替的密集和局部带状稀疏注意力模式在 Transformer 的层中,类似于 Sparse Transformer [CGRS19]。为了研究模型性能对模型大小的依赖性,我们训练了 8 种不同大小的模型,从 1.25 亿参数到 1750 亿参数,其中最大的模型我们称为 GPT-3。这一范围的模型大小使我们能够测试 $[\mathrm{KMH}^{+}20]$ 中引入的扩展定律。

More details on the sizes and architectures of our models can be found in the appendix. We partition each model across GPUs along both the depth and width dimension in order to minimize data-transfer between nodes.

有关我们模型的大小和架构的更多详细信息可以在附录中找到。我们将每个模型在 GPU 上按深度和宽度维度进行划分,以最小化节点之间的数据传输。

2.2 Training Dataset

2.2 训练数据集

To create our training data, we (1) downloaded and filtered a version of Common Crawl 1 $[\mathrm{RSR}^{+}19]$ based on similarity to a range of high-quality reference corpora, (2) performed fuzzy de duplication at the document level, within and across datasets, to prevent redundancy and preserve the integrity of our held-out validation set as an accurate measure of over fitting, and (3) added known high-quality reference corpora to the training mix to augment Common Crawl and increase its diversity. These reference corpora include an expanded version of the WebText dataset $[\mathrm{RWC}^{+}19]$ ,collected by scraping links over a longer period of time, and first described in $[\mathrm{KMH}^{+}20]$ , two internet-based books corpora (Books1 and Books2) and English-language Wikipedia (details in the appendix).

为了创建我们的训练数据,我们 (1) 下载并过滤了一个版本的 Common Crawl 1 $[\mathrm{RSR}^{+}19]$ ,基于其与一系列高质量参考语料库的相似性;(2) 在文档级别进行了模糊去重,包括在数据集内部和跨数据集,以防止冗余并保持我们保留的验证集的完整性,作为过拟合的准确衡量标准;(3) 将已知的高质量参考语料库添加到训练混合数据中,以增强 Common Crawl 并增加其多样性。这些参考语料库包括扩展版的 WebText 数据集 $[\mathrm{RWC}^{+}19]$ ,该数据集通过在更长时间内抓取链接收集而成,并首次在 $[\mathrm{KMH}^{+}20]$ 中描述,还包括两个基于互联网的书籍语料库(Books1 和 Books2)以及英文维基百科(详细信息见附录)。

Table 3.1: Performance on cloze and completion tasks. GPT-3 significantly improves SOTA on LAMBADA while achieving respectable performance on two difficult completion prediction datasets. a[Tur20] ${}^{b}[\mathrm{RWC}^{+}19]$ C[LDL19] $^d[\mathrm{LCH}^{+}20]$

表 3.1: 完形填空和补全任务的性能。GPT-3 在 LAMBADA 上显著提升了 SOTA,同时在两个困难的补全预测数据集上也取得了不错的成绩。a[Tur20] ^b[RWC+19] C[LDL19] ^d[LCH+20]

Setting LAMBADA (acc) LAMBADA (ppl) StoryCloze (acc) HellaSwag (acc)
SOTA 68.0 a 8.63 b 91.8 c 85.6 d
GPT-3 零样本 76.2 3.00 83.2 78.9
GPT-3 单样本 72.5 3.35 84.7 78.1
GPT-3 少样本 86.4 1.92 87.7 79.3

2.3 Training Process

2.3 训练过程

As found in $[\mathrm{KMH}^{+}20\$ , MKAT18], larger models can typically use a larger batch size, but require a smaller learning rate. We measure the gradient noise scale during training and use it to guide our choice of batch size [MKAT18]. Table A.1 shows the parameter settings we used. To train the larger models without running out of memory, we use a mixture of model parallelism within each matrix multiply and model parallelism across the layers of the network. All models were trained on V100 GPU's on part of a high-bandwidth cluster. Details of the training process and hyper parameter settings are described in the appendix.

如 $[\mathrm{KMH}^{+}20$ , MKAT18] 所示,较大的模型通常可以使用更大的批量大小,但需要较小的学习率。我们测量训练期间的梯度噪声规模,并用它来指导我们选择批量大小 [MKAT18]。表 A.1 显示了我们使用的参数设置。为了在不耗尽内存的情况下训练更大的模型,我们在每个矩阵乘法内和网络层之间使用模型并行性。所有模型都在 V100 GPU 上,在高带宽集群的一部分上进行训练。训练过程和超参数设置的详细信息在附录中描述。

2.4 Evaluation

2.4 评估

For few-shot learning, we evaluate each example in the evaluation set by randomly drawing $K$ examples from that task's training set as conditioning, delimited by 1 or 2 newlines depending on the task. For LAMBADA and Storycloze there is no supervised training set available so we draw conditioning examples from the development set and evaluate on the test set.

对于少样本学习,我们通过从该任务的训练集中随机抽取 $K$ 个示例作为条件,来评估评估集中的每个示例,示例之间根据任务不同用1个或2个换行符分隔。对于 LAMBADA 和 Storycloze,由于没有监督训练集可用,我们从开发集中抽取条件示例,并在测试集上进行评估。

For some tasks we use a natural language prompt in addition to (or for $K=0$ , instead of) demonstrations. Similar to $[\mathrm{RSR}^{+}19]$ we also sometimes change the formatting of answers. See the appendix for per-task examples.

对于某些任务,我们使用自然语言提示,除了(或对于 $K=0$ ,代替)演示。类似于 $[\mathrm{RSR}^{+}19]$ ,我们有时也会更改答案的格式。请参见附录以获取每个任务的示例。

On tasks with free-form completion, we use beam search with the same parameters as $[\mathrm{RSR}^{+}19]$ :a beam width of 4 and a length penalty of $\alpha=0.6$

在自由形式完成的任务中,我们使用与 $[\mathrm{RSR}^{+}19]$ 相同参数的束搜索:束宽为 4 和长度惩罚为 $\alpha=0.6$

Final results are reported on the test set when publicly available, for each model size and learning setting (zero-, one-, and few-shot). When the test set is private, our model is often too large to fit on the test server, so we report results on the development set.

最终结果在测试集上报告(当公开可用时),针对每个模型大小和学习设置(零样本、一样本和少样本)。当测试集是私有时,我们的模型通常太大而无法适应测试服务器,因此我们在开发集上报告结果。

3 Results

3.1 Language Modeling, Cloze, and Completion Tasks

3.1 语言模型、完形填空和补全任务

We test GPT-3's performance on the traditional task of language modeling as well as related tasks. We calculate zero-shot perplexity on the Penn Tree Bank (PTB) $[\mathrm{MKM^{+}94}]$ dataset measured in $[\mathrm{RWC}^{+}19]$ . We omit the 4 Wikipedia-related tasks and the one-billion word benchmark due to a high fraction of these datasets being contained in our training set. Our largest model sets a new SOTA on PTB by a substantial margin of 15 points.

我们测试 GPT-3 在传统语言建模任务及相关任务上的性能。我们在 Penn Tree Bank (PTB) 数据集上计算零样本困惑度 [MKM^+94],该指标在 [RWC^+19] 中有测量。我们省略了 4 个与 Wikipedia 相关的任务和十亿词基准测试,因为这些数据集中有很大一部分包含在我们的训练集中。我们最大的模型在 PTB 上以显著的 15 分优势设立了新的 SOTA。

The LAMBADA dataset $[\mathrm{PKL}^{+}16]$ requires the model to predict the last word of a paragraph. Although $[\mathrm{BHT^{+}20}]$ suggested scaling language models is yielding diminishing returns on this benchmark, we find that zero-shot GPT-3 achieves a substantive gain of $8%$ over the previous state-ofthe-art. For the few-shot setting, we use a fill-in-the-blank format to encourage the language model to only generate one word (Alice was friends with Bob. Alice went to visit her friend, $_\rightarrow B o b)$ With this format, GPT-3 achieves an increase of over $18%$ from the previous state-of-the-art, and performance improves smoothly with model size. However, the fill-in-blank method is not effective one-shot, where it always performs worse than the zero-shot setting, perhaps because all models require several examples to recognize the pattern. An analysis of test set contamination identified that a significant minority of the LAMBADA dataset appears to be present in our training data - however analysis performed in Section 4 suggests negligible impact on performance.

LAMBADA 数据集 $[\mathrm{PKL}^{+}16]$ 要求模型预测段落的最后一个词。尽管 $[\mathrm{BHT^{+}20}]$ 建议在该基准上扩展语言模型的效果正在递减,但我们发现零样本 GPT-3 相对于之前的最先进水平取得了实质性的 8% 的提升。对于少样本设置,我们使用填空格式鼓励语言模型只生成一个词 (Alice was friends with Bob. Alice went to visit her friend, _ → Bob)。通过这种格式,GPT-3 相对于之前的最先进水平实现了超过 18% 的提升,并且性能随着模型规模的增大而平稳提高。然而,填空方法在一例情况下并不有效,其表现总是不如零样本设置,可能是因为所有模型都需要几个例子来识别模式。对测试集污染的分析表明,LAMBADA 数据集的相当一部分似乎出现在我们的训练数据中,但第 4 节进行的分析表明这对性能的影响可以忽略不计。

设置 NaturalQs WebQS TriviaQA
RAG (Fine-tuned, Open-Domain) [LPP+20] 44.5 45.5 68.0
T5-11B+SSM (Fine-tuned, Closed-Book) [RRS20] 36.6 44.7 60.5
T5-11B (Fine-tuned, Closed-Book) 34.5 37.4 50.1
GPT-3 零样本 14.6 14.4 64.3
GPT-3 一样本 23.0 25.3 68.0
GPT-3 少样本 29.9 41.5 71.2

Table 3.2: Results on three Open-Domain QA tasks. GPT-3 is shown in the few-, one-, and zero-shot settings, as compared to prior SOTA results for closed book and open domain settings. TriviaQA few-shot result is evaluated on the wiki split test server. Table 3.3: GPT-3 results on a selection of QA / RC tasks. CoQA and DROP are F1 while ARC reports accuracy. See the appendix for additional experiments. $a_{[\mathrm{KKS}^{+}20]},^{b}[\mathrm{KKS}^{+}20],^{c}[\mathrm{JZC}^{+}19]$ d[JIN20]

表 3.2: 在三个开放域问答任务上的结果。GPT-3 在少样本、单样本和零样本设置下的表现,与之前封闭书本和开放域设置下的最佳结果 (SOTA) 进行了比较。TriviaQA 少样本结果是在 wiki 分割测试服务器上评估的。表 3.3: GPT-3 在一系列问答 / 阅读理解任务上的结果。CoQA 和 DROP 报告 F1 分数,而 ARC 报告准确率。更多实验详见附录。$a_[\mathrm{KKS}^{+}20],^{b}[\mathrm{KKS}^{+}20],^{c}[\mathrm{JZC}^{+}19]$ d[JIN20]

设置 ARC (Easy) ARC (Challenge) CoQA DROP
Fine-tuned SOTA 92.0a 78.5b 90.7c 89.1d
GPT-3 零样本 68.8 51.4 81.5 23.6
GPT-3 单样本 71.2 53.2 84.0 34.3
GPT-3 少样本 70.1 51.5 85.0 36.5

The HellaS wag data set $[Z\mathrm{HB}^{+}19]$ involves picking the best ending to a story or set of instructions. The examples were adversarial ly mined to be difficult for language models while remaining easy for humans. GPT-3 outperforms a fine-tuned 1.5B parameter language model $[Z\mathrm{HR}^{+}19]$ but is still a fair amount lower than the overall SOTA achieved by the fine-tuned multi-task model ALUM.

HellaS 数据集 $[Z\mathrm{HB}^{+}19]$ 涉及选择故事或指令集的最佳结尾。这些例子经过对抗性挖掘,旨在对语言模型来说具有挑战性,而对人类来说则相对简单。GPT-3 超过了一个微调的 1.5B 参数的大语言模型 $[Z\mathrm{HR}^{+}19]$ ,但仍然明显低于由微调的多任务模型 ALUM 达到的整体最先进水平 (SOTA)。

The Story Clo ze 2016 dataset $[\mathrm{MCH}^{+}16]$ involves selecting the correct ending sentence for fivesentence long stories. Here GPT-3 improves over previous zero-shot results by roughly $10%$ but is Overall still $4.1%$ lower than the fine-tuned SOTA using a BERT based model [LDL19].

故事结尾 2016 数据集 $[\mathrm{MCH}^{+}16]$ 涉及为五句长的故事选择正确的结尾句子。这里 GPT-3 在零样本 (Zero-shot) 结果上比之前提高了大约 $10%$ ,但总体上仍比使用基于 BERT 的模型微调的最先进水平低 $4.1%$ [LDL19]。

3.2 Question Answering

3.2 问答系统

In this section we measure GPT-3's ability to handle a variety of question answering tasks. First, we look at datasets involving answering questions about broad factual knowledge. We evaluate in the "closed-book" setting (meaning no conditioning information/articles) as suggested by [RRS20]. On TriviaQA [JCWZ17], GPT-3 zero-shot already outperforms the fine-tuned T5-11B by $14.2%$ , and also outperforms a version with Q&A tailored span prediction during pre-training by $3.8%$ . The one-shot result improves by $3.7%$ and matches the SOTA for an open-domain QA system which not only fine-tunes but also makes use of a learned retrieval mechanism over a 15.3B parameter dense vector index of 21M documents $[\mathrm{LPP}^{+}20]$ . GPT-3's few-shot result further improves performance another $3.2%$ beyond this. On Natural Questions (NQs) $[\mathrm{KPR}^{+}19]$ , GPT-3 under performs a fine-tuned T5 $11\mathrm{B}{+}\mathrm{S}\mathrm{S}\mathrm{M}$ . The questions in NQs tend towards fine-grained Wikipedia knowledge which could be testing the limits of GPT-3's capacity and broad pre training distribution.

在本节中,我们测量 GPT-3 处理各种问答任务的能力。首先,我们查看涉及回答广泛事实知识问题的数据集。我们在“闭卷”设置(即没有条件信息/文章)下进行评估,如 [RRS20] 所建议的。在 TriviaQA [JCWZ17] 上,GPT-3 零样本已经超过了微调后的 T5-11B 14.2%,并且也超过了在预训练期间使用 Q&A 定制跨度预测版本的 3.8%。单样本结果提高了 3.7%,并匹配了开放域问答系统 (不仅进行了微调,还利用了 153 亿参数密集向量索引中的 2100 万文档的学习检索机制) 的最先进水平 $[\mathrm{LPP}^{+}20]$ 。GPT-3 的少样本结果进一步将性能提高了 3.2%。在 Natural Questions (NQs) $[\mathrm{KPR}^{+}19]$ 上,GPT-3 表现不如微调后的 T5 11B+SSM。NQs 中的问题倾向于细粒度的维基百科知识,这可能是在测试 GPT-3 的容量和广泛预训练分布的极限。

ARC $[\mathrm{CCE^{+}18}]$ is a common sense reasoning dataset of multiple-choice questions collected from 3rd to 9th grade science exams. On the “Challenge" version of the dataset, which has been filtered to questions which simple statistical or information retrieval methods are unable to correctly answer, GPT-3 approaches the performance of a fine-tuned RoBERTa baseline $[\mathrm{KKS}^{+}20]$ .Onthe“Easy version of the dataset, GPT-3 slightly exceeds the same fine-tuned RoBERTa baseline $[\mathrm{KKS^{+}20}]$ However, both of these results are still much worse than the overall SOTAs achieved by $[\mathrm{KKS}^{+}20]$

ARC $[\mathrm{CCE^{+}18}]$ 是一个从 3 年级到 9 年级科学考试中收集的多选题常识推理数据集。在经过筛选后的“Challenge”版本数据集中,简单统计或信息检索方法无法正确回答的问题上,GPT-3 的表现接近微调的 RoBERTa 基准模型 $[\mathrm{KKS}^{+}20]$ 。在“Easy”版本的数据集中,GPT-3 稍微超过了同一微调的 RoBERTa 基准模型 $[\mathrm{KKS^{+}20}]$ 。然而,这两个结果仍然远不如 $[\mathrm{KKS}^{+}20]$ 达到的整体最先进水平。

Table 3.4: Few-shot GPT-3 outperforms previous unsupervised NMT work by 5 BLEU when translating into English reflecting its strength as an English LM. We report BLEU scores on the WMT'14 $\mathrm{Fr}{\leftrightarrow}\mathrm{En}$ ,WMT'16 $\scriptstyle\mathrm{De}\leftrightarrow\mathrm{En}$ , and WMT'16 $\scriptstyle\mathbf{Ro}\leftrightarrow\mathbf{En}$ datasets as measured by multi-bleu.perl with XLM's token iz ation in order to compare most closely with prior unsupervised NMT work. SacreBLEU [Pos18] results reported in the appendix. Underline indicates an unsupervised or few-shot SOTA, bold indicates supervised SOTA with relative confidence. “[EOAG18] $^{b}[{\mathrm{DHKH14}}]$ $^c[\mathrm{WXH^{+}18}]$ “[oR16] $^{e}[\mathrm{LGG^{+}}20]$ J[SacreBLEU signature: BLEU+case.mixed+numrefs. $^{1+}$ smooth.exp+tok.intl+version.1.2.20]

设置 En→Fr Fr→En En→De De→En En→Ro Ro→En
SOTA (有监督) 45.6a 35.0 b 41.2c 40.2d 38.5e 39.9e
XLM [LC19] 33.4 33.3 26.4 34.3 33.3 31.8
MASS [STQ+19] 37.5 34.9 28.3 35.2 35.2 33.1
mBART [LGG+20] 29.8 34.0 35.0 30.5
GPT-3 零样本 25.2 21.2 24.6 27.2 14.1 19.9
GPT-3 单样本 28.3 33.7 26.2 30.4 20.6 38.6
GPT-3 少样本 32.6 39.2 29.7 40.6 21.0 39.5

表 3.4: 少样本 GPT-3 在将文本翻译成英语时,比之前的无监督神经机器翻译工作高出 5 BLEU 分,反映了其作为英语大语言模型的优势。我们在 WMT'14 法语↔英语、WMT'16 德语↔英语和 WMT'16 罗马尼亚语↔英语数据集上报告了 BLEU 分数,这些分数是通过 multi-bleu.perl 和 XLM 的 token 化进行测量的,以便与之前的无监督神经机器翻译工作最接近地比较。SacreBLEU [Pos18] 结果在附录中报告。下划线表示无监督或少样本的最佳水平,粗体表示有监督的最佳水平并具有相对置信度。"[EOAG18] ^b[DHKH14] ^c[WXH^+18] [oR16] ^e[LGG^+20] J[SacreBLEU 签名:BLEU+case.mixed+numrefs. ^1+ smooth.exp+tok.intl+version.1.2.20]"

Finally, we evaluate GPT-3 on two reading comprehension datasets. Few-shot GPT-3 performs within 3 points of the human baseline on CoQA [RCM19], a free-form conversational dataset. On DROP $[\mathrm{DWD}^{+}19]$ , a dataset testing discrete reasoning and numeracy, few-shot GPT-3 outperforms the fine-tuned BERT baseline from the original paper but is still well below both human performance and state-of-the-art approaches which augment neural networks with symbolic systems $[\mathrm{RLL}^{+}19]$

最后,我们在两个阅读理解数据集上评估 GPT-3。少样本 GPT-3 在 CoQA [RCM19] 上的表现与人类基线相差 3 分以内,CoQA 是一个自由形式的对话数据集。在测试离散推理和数值能力的 DROP [(DWD^+19)] 数据集上,少样本 GPT-3 超过了原始论文中微调的 BERT 基线,但仍然远低于人类表现和最先进的方法,后者通过符号系统增强神经网络 [(RLL^+19)]。

3.3 Translation

3.3 翻译

In collecting training data for GPT-3, we used the unfiltered distribution of languages reflected in internet text datasets (primarily Common Crawl). As a result, although GPT-3's training data primarily consists of English ( $93%$ by word count), it also includes $7%$ non-English content (full list at GPT-3 GitHub). Existing unsupervised machine translation approaches often combine pre training on a pair of monolingual datasets with back-translation [SHB 15] to bridge the two languages in a controlled way. By contrast, GPT-3 learns from a blend of training data that mixes many languages together. Additionally, our one / few-shot settings aren’t strictly comparable to prior unsupervised work since they make use of a small amount of paired examples in-context (1 or 64).

在收集 GPT-3 的训练数据时,我们使用了互联网文本数据集(主要是 Common Crawl)中未经过滤的语言分布。因此,尽管 GPT-3 的训练数据主要由英语组成(按词数计算占 93%),还包括 7% 的非英语内容(完整列表见 GPT-3 GitHub)。现有的无监督机器翻译方法通常结合双语单语数据集的预训练与回译 [SHB 15] 来以受控方式连接两种语言。相比之下,GPT-3 从多种语言混合的训练数据中学习。此外,我们的零样本 / 少样本设置并不完全等同于之前的无监督工作,因为它们利用了少量成对示例(1 或 64)进行上下文中的学习。

Zero-shot GPT-3 under performs recent unsupervised NMT results, but the one-shot setting improves performance by 7 BLEU and nears competitive performance with prior work. Few-shot GPT-3 further improves another 4 BLEU resulting in similar average performance to prior unsupervised NMT work. For the three input languages studied, GPT-3 significantly outperforms prior unsupervised NMT work when translating into English but under performs when translating in the other direction. Performance on En-Ro is a noticeable outlier at over 10 BLEU worse than prior unsupervised NMT work. This could be a weakness due to reusing the byte-level BPE tokenizer of GPT-2 which was developed for an almost entirely English training dataset. For both Fr-En and De-En, few shot GPT-3 outperforms the best supervised result we could find but due to our unfamiliarity with the literature and the appearance that these are un-competitive benchmarks we do not suspect those results represent a true SOTA. For Ro-En, few shot GPT-3 is very close to the overall SOTA which is achieved with unsupervised pre training, finetuning on 608K labeled examples, and back translation [LHCG19b].

零样本 GPT-3 的表现不如最近的无监督 NMT 结果,但单样本设置将性能提高了 7 BLEU,并接近与之前工作相当的水平。少样本 GPT-3 进一步提高了 4 BLEU,使得平均性能与之前的无监督 NMT 工作相似。对于研究的三种输入语言,GPT-3 在翻译成英语时显著优于之前的无监督 NMT 工作,但在反向翻译时表现较差。En-Ro 的表现是一个明显的例外,比之前的无监督 NMT 工作低了超过 10 BLEU。这可能是由于重用了 GPT-2 的字节级 BPE 分词器,而该分词器是为几乎完全由英语组成的训练数据集开发的。对于 Fr-En 和 De-En,少样本 GPT-3 超过了我们能找到的最佳有监督结果,但由于我们对文献不熟悉以及这些基准似乎不具备竞争力,我们怀疑这些结果并不代表真正的最先进水平。对于 Ro-En,少样本 GPT-3 接近整体最先进水平,该水平是通过无监督预训练、在 608K 标记示例上微调和回译 [LHCG19b] 实现的。

3.4 SuperGLUE

3.4 超级GLUE (SuperGLUE)

The SuperGLUE benchmark is a standardized collection of datasets $[\mathrm{WPN^{+}19}]$ .Inthefew-shot setting, we used 32 examples for all tasks, sampled randomly from the training set. For all tasks except WsC and MultiRC, we sampled a new set of examples to use in the context for each problem. For WsC and MultiRC, we used the same set of randomly drawn examples from the training set as context for all of the problems we evaluated. We sweep values of $K$ up to 32 and note that the few-shot SuperGLUE score steadily improves with both model size and with number of examples in the context showing increasing benefits from in-context learning (Figure 1.1).

SuperGLUE基准是一个标准化的数据集集合 $[\mathrm{WPN^{+}19}]$ 。在少样本设置中,我们为所有任务使用了32个示例,这些示例是从训练集中随机抽取的。对于除WsC和MultiRC之外的所有任务,我们为每个问题抽样了一组新的示例以用作上下文。对于WsC和MultiRC,我们使用了从训练集中随机抽取的同一组示例作为所有评估问题的上下文。我们调整 $K$ 的值至多为32,并注意到少样本SuperGLUE分数随着模型大小和上下文中示例数量的增加而稳步提高,显示出从上下文学习中获得的好处不断增加(图 1.1)。

Table 3.5: Performance of GPT-3 on SuperGLUE compared to fine-tuned baselines and SOTA. All results are reported on the test set. GPT-3 few-shot is given a total of 32 examples within the context of each task and performs no gradient updates.

表 3.5: GPT-3 在 SuperGLUE 上的表现与微调基线和最先进水平 (SOTA) 的比较。所有结果均在测试集上报告。GPT-3 少样本在每个任务的上下文中总共给出 32 个示例,并且不进行梯度更新。

SuperGLUE 平均 BoolQ 准确率 CB 准确率 CB F1 COPA 准确率 RTE 准确率
微调 SOTA 89.0 91.0 96.9 93.9 94.8 92.5
微调 BERT-Large 69.0 77.4 83.6 75.7 70.6 71.7
GPT-3 少样本 71.8 76.4 75.6 52.0 92.0 69.0
WiC 准确率 WSC 准确率 MultiRC 准确率 MultiRC Fla ReCoRD 准确率 ReCoRD F1
微调 SOTA 76.1 93.8 62.3 88.2 92.5 93.3
微调 BERT-Large 69.6 64.6 24.1 70.0 71.3 72.0
GPT-3 少样本 49.4 80.1 30.5 75.4 90.2 91.1

We observe a wide range in GPT-3's performance across tasks. On COPA and ReCoRD GPT-3 achieves near-SOTA performance in the one-shot and few-shot settings, with COPA falling only a couple points short and achieving second place on the leader board, where first place is held by a fine-tuned 11 billion parameter model (T5). On WSC, BoolQ, MultiRC, and RTE, performance is reasonable, roughly matching that of a fine-tuned BERT-Large. On CB, we see signs of life at $75.6%$ in the few-shot setting. WiC is a notable weak spot with few-shot performance equivalent to random chance. We tried a number of different phrasings and formulations for WiC (which involves determining if a word is being used with the same meaning in two sentences), none of which was able to achieve strong performance. This hints at a phenomenon (which we saw in other experiments we ran contained in the Additional Materials) - GPT-3 appears to be weak in the few-shot or one-shot setting at some tasks that involve comparing two sentences or snippets. This could also explain the comparatively low scores for RTE and CB, which also follow this format. Despite these weaknesses, GPT-3 still outperforms a fine-tuned BERT-large on four of eight tasks and on two tasks GPT-3 is close to the state-of-the-art held by a fine-tuned 11 billion parameter model.

我们观察到 GPT-3 在不同任务中的表现范围很广。在 COPA 和 ReCoRD 上,GPT-3 在单样本和少样本设置中接近最先进 (SOTA) 水平,其中 COPA 仅差几个百分点,并且在排行榜上获得第二名,第一名由一个微调的 110 亿参数模型 (T5) 保持。在 WSC、BoolQ、MultiRC 和 RTE 上,表现合理,大致与微调的 BERT-Large 相当。在 CB 上,我们在少样本设置中看到了 75.6% 的正确率。WiC 是一个明显的弱点,在少样本设置中的表现相当于随机猜测。我们尝试了多种不同的表述和公式来处理 WiC(这涉及到确定一个词在两个句子中是否使用了相同的含义),但没有一种能够取得良好的表现。这暗示了一个现象(我们在其他实验中也观察到了这一点,这些实验包含在附加材料中)——GPT-3 在涉及比较两个句子或片段的任务中,在少样本或单样本设置下表现较弱。这也可能解释了 RTE 和 CB 的相对较低得分,这些任务也遵循这种格式。尽管存在这些弱点,GPT-3 仍然在八个任务中的四个上超过了微调的 BERT-large,并且在两个任务上接近由微调的 110 亿参数模型保持的最先进水平。

4 Measuring and Preventing Memorization Of Benchmarks

4 测量和防止基准的 memorization

The dataset and model size are about two orders of magnitude larger than those used for GPT-2, and include a large amount of Common Crawl, creating increased potential for contamination and memorization. On the other hand, precisely due to the large amount of data, even GPT-3 175B does not overfit its training set by a significant amount, measured relative to a held-out validation set with which it was de duplicated. For each benchmark, we produce a ^clean’ version which removes all potentially leaked examples, defined roughly as examples that have a 13-gram overlap with anything in the pre training set (or that overlap with the whole example when it is shorter than 13-grams). We then evaluate GPT-3 on these clean benchmarks, and compare to the original score. If the score on the clean subset is similar to the score on the entire dataset, this suggests that contamination, even if present, does not have a significant effect on reported results. In most cases performance changes only negligibly, and we see no evidence that contamination level and performance difference are correlated. We conclude that either our conservative method substantially overestimated contamination or that contamination has little effect on performance. We provide full details of the methodology and analysis on the most problematic tasks in the appendix.

数据集和模型规模比用于 GPT-2 的大约大两个数量级,并且包含大量 Common Crawl 数据,这增加了污染和记忆的风险。另一方面,正是由于数据量巨大,即使 GPT-3 175B 在其训练集上也没有显著过拟合,这是相对于一个与其去重的保留验证集测量的结果。对于每个基准测试,我们生成一个“干净”版本,该版本移除了所有可能泄露的样本,大致定义为与预训练集中任何内容有 13-gram 重叠的样本(或当样本长度小于 13-gram 时与整个样本重叠)。然后我们在这些干净的基准测试上评估 GPT-3,并与原始分数进行比较。如果干净子集上的分数与整个数据集上的分数相似,这表明即使存在污染,对报告结果的影响也不显著。在大多数情况下,性能变化微乎其微,我们没有发现污染程度和性能差异之间的相关性。我们得出结论,要么我们的保守方法大大高估了污染,要么污染对性能影响很小。我们在附录中提供了最具问题任务的方法和分析的全部详细信息。

5 Limitations

5 局限性

On text synthesis, GPT-3 samples still sometimes repeat themselves semantically at the document level, start to lose coherence over sufficiently long passages, contradict themselves, and occasionally contain non-sequitur sentences or paragraphs. Our release repository contains uncurated unconditional samples.

在文本生成方面,GPT-3 的样本在文档级别上有时仍然会在语义上重复自己,在足够长的段落中开始失去连贯性,自相矛盾,并且偶尔包含不合逻辑的句子或段落。我们的发布仓库包含未经过滤的无条件样本。

Our experiments do not include any bidirectional architectures or other training objectives such as denoising. Our design decision comes at the cost of potentially worse performance on tasks which empirically benefit from bidirectional it y, such as fill-in-the-blank tasks, tasks that involve looking back and comparing two pieces of content (ANLI, WIC), or tasks that require re-reading or carefully considering a long passage and then generating a very short answer (QuAC, RACE).

我们的实验不包括任何双向架构或其他训练目标,例如去噪。我们的设计决策可能会导致在从双向性中受益的任务上表现较差,例如填空任务、涉及回顾和比较两段内容的任务(ANLI、WIC),或需要重读或仔细考虑长篇内容然后生成非常简短答案的任务(QuAC、RACE)。

Our objective weights every token equally and lacks a notion of what is most important to predict and what is less important. [RRS20] demonstrate benefits of customizing prediction to entities of interest. Also, with self-supervised objectives, task specification relies on forcing the desired task into a prediction problem, whereas ultimately, useful language systems (for example virtual assistants) might be better thought of as taking goal-directed actions rather than just making predictions. Finally, large pretrained language models are not grounded in other domains of experience, such as video or real-world physical interaction, and thus lack a large amount of context about the world $[\mathrm{BHT^{+}20}]$ For all these reasons, scaling pure self-supervised prediction is likely to hit limits, and augmentation with a different approach is likely to be necessary. Promising future directions in this vein might include learning the objective function from humans $[Z\mathrm{SW}^{\bar{+}}19]$ ], fine-tuning with reinforcement learning, or adding additional modalities such as images to provide grounding and a better model of theworld $[\mathrm{CLY^{+}i9}]$

我们的目标对每个 Token 权重相同,缺乏对预测内容重要性的区分。[RRS20] 展示了针对感兴趣实体定制预测的好处。此外,在自监督目标中,任务定义依赖于将所需任务转化为一个预测问题,而最终,有用的语言系统(例如虚拟助手)可能更适合被视为执行目标导向的动作,而不仅仅是进行预测。最后,大型预训练语言模型没有在其他经验领域(如视频或现实世界的物理交互)中得到验证,因此缺乏大量关于世界的背景知识 $[\mathrm{BHT^{+}20}]$ 。基于所有这些原因,纯粹的自监督预测扩展可能会遇到瓶颈,需要采用不同的方法来增强。未来有希望的方向可能包括从人类学习目标函数 $[Z\mathrm{SW}^{\bar{+}}19]$ ,使用强化学习微调,或添加图像等其他模态以提供更坚实的背景和更好的世界模型 $[\mathrm{CLY^{+}i9}]$ 。

GPT-3's size makes it challenging to deploy. Task-specific distillation [HVD15] merits exploration at this new scale.

GPT-3 的规模使其部署具有挑战性。针对特定任务的蒸馏 [HVD15] 在这一新规模下值得探索。

6 Related Work

6 相关工作

Several efforts have studied the effect of scale on language model performance. $[\mathrm{KMH}^{+}20\$ , RRBS19, $\mathrm{LWS}^{+}20$ $\mathrm{HNA^{+}17}]$ , find a smooth power-law trend in loss as auto regressive language models are scaled up. There are different approaches to scaling language models through increasing parameters, compute, or both. Our work is most aligned with methods that have increased the size of transformers by increasing parameters and FLOPS-per-token roughly in proportion, with a parameter count of 213 million $[\mathrm{VSP^{+}17}]$ in the original paper, then 300 million [DCLT18], 1.5 billion $[\mathrm{RWC}^{+}19]$ , 8 billion $[\mathrm{SPP^{+}19}]$ , 11 billion $[\mathrm{RSR}^{+}19]$ , and most recently 17 billion [Tur20]. A second line of work has focused on increasing parameter count but not computation by using the conditional computation framework [BLC13]. Specifically, the mixture-of-experts method $[\mathrm{SMM}^{+}17]$ has produced 100 billion parameter models and 50 billion parameter translation models [AJF19]. One way to decrease the computational cost of our models would be to draw from work such as ALBERT $[\dot{\mathrm{LCG}}^{+}19]$ or general [HVD15] or task-specific [SDCW19, $\mathrm{JYS^{+}19}$ , KR16] approaches to distillation. Lastly, a third approach to scale increases computation without increasing parameters through methods like adaptive computation time [Gra16] and the universal transformer $\bar{[\mathrm{DGV}^{+}18]}$

多个研究工作已经探讨了规模对语言模型性能的影响。[KMH+20, RRBS19, LWS+20, HNA+17] 发现,随着自回归语言模型的扩展,损失呈现出平滑的幂律趋势。通过增加参数、计算资源或两者结合,有不同的方法可以扩展语言模型。我们的工作最接近于那些通过按比例增加Transformer的参数和每Token的FLOPS来扩大模型规模的方法,在最初的论文中参数数量为2.13亿 [VSP+17],然后是3亿 [DCLT18],15亿 [RWC+19],80亿 [SPP+19],110亿 [RSR+19],最近达到了170亿 [Tur20]。第二条研究路线专注于通过条件计算框架 [BLC13] 增加参数数量但不增加计算量。特别是,专家混合方法 [SMM+17] 已经产生了1000亿参数的模型和500亿参数的翻译模型 [AJF19]。减少我们模型计算成本的一种方法是从类似ALBERT [LCG+19] 或通用 [HVD15] 或任务特定 [SDCW19, JYS+19, KR16] 的蒸馏方法中借鉴。最后,第三种扩展方法是通过自适应计算时间 [Gra16] 和通用Transformer [DGV+18] 等方法增加计算而不增加参数。

There are many approaches to building multi-task models. Giving task instructions in natural language was first formalized in a supervised setting with [MKXS18] and used in $[\mathrm{RWC}^{+}19]$ for in-context learning and in $[\mathrm{RSR}^{+}19]$ for multi-task fine-tuning. Multi-task learning [Car97] has shown some promising initial results $[\mathrm{LGH^{+}}15$ , LCR19] and multi-stage fine-tuning has produced SOTA or SOTAcompetitive results [PFB18, $\mathrm{KKS}^{+}20]$ . Metal earning was used in language models in $[\mathrm{RWC}^{+}19]$ though with limited results and no systematic study. Other uses of metal earning include matching networks $[\mathrm{VBL}^{+}16]$ , RL2 $[\mathrm{DSC}^{+}16]$ , learning to optimize [RL16, $\mathrm{ADG^{+}}16$ , LM17] and MAML [FAL17]. Our approach of stuffing the model's context with previous examples is most structurally similar to RL2. It also resembles [HYCO1], in that an inner loop adapts to a task, while an outer loop updates the weights. Our inner loop performs few-shot in-context learning, but prior work has explored other methods of few-shot learning [SS20, $\mathrm{RCP}^{+}17$ $\mathrm{GWC}^{+}18$ $\mathrm{XDH^{+}19]}$

构建多任务模型的方法有很多。用自然语言给出任务指令最初在监督设置中被形式化 [MKXS18],并在 $[\mathrm{RWC}^{+}19]$ 中用于上下文学习,在 $[\mathrm{RSR}^{+}19]$ 中用于多任务微调。多任务学习 [Car97] 已显示出一些有希望的初步结果 $[\mathrm{LGH^{+}}15$ , LCR19],多阶段微调已产生 SOTA 或接近 SOTA 的结果 [PFB18, $\mathrm{KKS}^{+}20]$ 。元学习在语言模型中被使用于 $[\mathrm{RWC}^{+}19]$,尽管结果有限且没有系统性研究。元学习的其他应用包括匹配网络 $[\mathrm{VBL}^{+}16]$、RL2 $[\mathrm{DSC}^{+}16]$、学习优化 [RL16, $\mathrm{ADG^{+}}16$ , LM17] 和 MAML [FAL17]。我们通过填充模型上下文中的先前示例的方法在结构上最类似于 RL2。它也类似于 [HYCO1],即内部循环适应任务,而外部循环更新权重。我们的内部循环执行少样本上下文学习,但之前的工作已经探索了其他少样本学习方法 [SS20, $\mathrm{RCP}^{+}17$ $\mathrm{GWC}^{+}18$ $\mathrm{XDH^{+}19]}$

Finally, Algorithmic innovation in language models over the last two years has been enormous, including denoising-based bidirectional it y [DCLT18], prefixLM [DL15], encoder-decoder architectures $[\mathrm{LLG}^{\bar{+}}19$ $\mathrm{RSR}^{\bar{+}}19]$ , random permutations during training $[\mathrm{YDY^{+}19}]$ , architectures for sampling efficiency $[\mathrm{DYY^{+}19}]$ , data and training improvements $[\mathrm{LOG^{+}19}]$ , and embedding parameters efficiency $[\mathrm{LCG^{+}19}]$ . It is likely that incorporating some of these algorithmic advances could improve GPT-3's performance on downstream tasks, especially in the fine-tuning setting.

最后,过去两年中语言模型的算法创新非常巨大,包括基于去噪的双向模型 [DCLT18],前缀语言模型 (prefixLM) [DL15],编码器-解码器架构 $[\mathrm{LLG}^{\bar{+}}19$ $\mathrm{RSR}^{\bar{+}}19]$ ,训练期间的随机排列 $[\mathrm{YDY^{+}19}]$ ,用于采样效率的架构 $[\mathrm{DYY^{+}19}]$ ,数据和训练改进 $[\mathrm{LOG^{+}19}]$ ,以及嵌入参数效率 $[\mathrm{LCG^{+}19}]$ 。很可能将这些算法进展中的一些融入到 GPT-3 中可以提高其在下游任务中的性能,特别是在微调设置中。

7 Conclusion

7 结论

We presented a 175 billion parameter language model which shows strong performance on many NLP tasks and benchmarks in the zero-shot, one-shot, and few-shot settings, in some cases nearly matching the performance of state-of-the-art fine-tuned systems, as well as generating high-quality samples and strong qualitative performance at tasks defined on-the-fly. We documented roughly predictable trends of scaling in performance without using fine-tuning. We also discussed the social impacts of this class of model. Despite many limitations and weaknesses, these results suggest that very large language models may be an important ingredient in the development of adaptable, general language systems.

我们提出了一种包含 1750 亿参数的语言模型,在零样本、单样本和少样本设置的许多自然语言处理任务和基准测试中表现出强大的性能,在某些情况下几乎达到了最先进的微调系统的性能水平,并且能够生成高质量的样本和在即时定义的任务中表现出强大的定性性能。我们记录了在不使用微调的情况下,性能随规模大致可预测的趋势。我们还讨论了此类模型的社会影响。尽管存在许多局限性和弱点,这些结果表明,非常大的语言模型可能是开发可适应的、通用的语言系统的重要组成部分。

Funding Disclosures

资金披露

This work was funded by OpenAI. All models were trained on V100 GPU's on part of a highbandwidth cluster provided by Microsoft

这项工作由 OpenAI 资助。所有模型都在 Microsoft 提供的高带宽集群部分的 V100 GPU 上训练。

Broader Impacts

更广泛的影响

Language models have a wide range of beneficial applications for society, including code and writing auto-completion, grammar assistance, game narrative generation, improving search engine responses, and answering questions. But they also have potentially harmful applications. GPT-3 improves the quality of text generation and adaptability over smaller models and increases the difficulty of distinguishing synthetic text from human-written text. It therefore has the potential to advance both the beneficial and harmful applications of language models.

大语言模型有广泛的社会应用,包括代码和写作自动补全、语法辅助、游戏剧情生成、改进搜索引擎响应和回答问题。但它们也有可能带来有害的应用。GPT-3 在文本生成质量和适应性上优于较小的模型,并增加了区分合成文本与人类书写文本的难度。因此,它有可能推进大语言模型的有益和有害应用。

Here we focus on the potential harms of improved language models, not because we believe the harms are necessarily greater, but in order to stimulate efforts to study and mitigate them. The broader impacts of language models like this are numerous. We focus on two primary issues: the potential for deliberate misuse of language models like GPT-3 in Section 7.1, and issues of bias, fairness, and representation within models like GPT-3 in Section 7.2. We also briefly discuss issues of energy efficiency (Section 7.3).

在这里,我们关注改进的语言模型可能带来的危害,不是因为我们认为这些危害必然更大,而是为了刺激对这些危害进行研究和缓解的努力。这类语言模型的广泛影响是多方面的。我们重点关注两个主要问题:第 7.1 节中讨论的像 GPT-3 这样的语言模型被故意滥用的潜在可能性,以及第 7.2 节中讨论的像 GPT-3 这样的模型中存在的偏见、公平性和代表性问题。我们还简要讨论了能源效率问题(第 7.3 节)。

7.1 Misuse of Language Models

7.1 大语言模型的误用

Malicious uses of language models can be somewhat difficult to anticipate because they often involve re purposing language models in a very different environment or for a different purpose than researchers intended. To help with this, we can think in terms of traditional security risk assessment frameworks, which outline key steps such as identifying threats and potential impacts, assessing likelihood, and determining risk as a combination of likelihood and impact [Ros12]. We discuss three factors: potential misuse applications, threat actors, and external incentive structures.

恶意使用大语言模型 (Large Language Model) 可能难以预见,因为它们通常涉及在与研究人员预期非常不同的环境或目的下重新利用这些模型。为了帮助应对这一问题,我们可以借鉴传统的安全风险评估框架,该框架概述了关键步骤,例如识别威胁和潜在影响、评估可能性,并将风险确定为可能性和影响的组合 [Ros12]。我们讨论三个因素:潜在的滥用应用、威胁行为者和外部激励结构。

7.1.1 Potential Misuse Applications

7.1.1 潜在的滥用应用

Any socially harmful activity that relies on generating text could be augmented by powerful lan guage models. Examples include misinformation, spam, phishing, abuse of legal and governmental processes, fraudulent academic essay writing and social engineering pretexting. Many of these applications bottleneck on human beings to write sufi cie ntl y high quality text. Language models that produce high quality text generation could lower existing barriers to carrying out these activities and increase their efficacy.

任何依赖生成文本的社会有害活动都可能被强大的大语言模型 (LLM) 加强。例子包括虚假信息、垃圾邮件、网络钓鱼、滥用法律和政府程序、欺诈性学术论文写作和社会工程预文本。这些应用中的许多都受限于人类编写足够高质量文本的能力。能够生成高质量文本的大语言模型可能会降低进行这些活动的现有障碍并提高其效率。

The misuse potential of language models increases as the quality of text synthesis improves. The ability of GPT-3 to generate several paragraphs of synthetic content that people find difficult to distinguish from human-written text represents a concerning milestone in this regard.

语言模型的滥用潜力随着文本合成质量的提高而增加。GPT-3 生成多段落合成内容的能力,使得这些内容难以与人类书写的文本区分开来,这标志着一个令人担忧的里程碑。

7.1.2 Threat Actor Analysis

7.1.2 威胁行为者分析

Threat actors can be organized by skill and resource levels, ranging from low or moderately skilled and resourced actors who may be able to build a malicious product to ^advanced persistent threats' (APTs): highly skilled and well-resourced (e.g. state-sponsored) groups with long-term agendas $[\mathrm{SBC^{+}19}]$

威胁行为者可以根据技能和资源水平进行分类,从低或中等技能和资源的参与者(可能能够构建恶意产品)到 “高级持续性威胁” (APTs):具有长期议程的高度 skilled 和资源充足的群体(例如,国家资助的组织) [SBC^+19]

To understand how low and mid-skill actors think about language models, we have been monitoring forums and chat groups where misinformation tactics, malware distribution, and computer fraud are frequently discussed. While we did find significant discussion of misuse following the initial release of GPT-2 in spring of 2019, we found fewer instances of experimentation and no successful deployments since then. Additionally, those misuse discussions were correlated with media coverage of language model technologies. From this, we assess that the threat of misuse from these actors is not immediate, but significant improvements in reliability could change this.

为了了解低技能和中等技能的行为者如何看待语言模型,我们一直在监控经常讨论错误信息策略、恶意软件分发和计算机欺诈的论坛和聊天群组。虽然我们在2019年春季GPT-2首次发布后确实发现了大量关于滥用的讨论,但自那时以来,我们发现的实验实例较少,且没有成功的部署案例。此外,这些滥用讨论与媒体对语言模型技术的报道相关联。由此我们认为,来自这些行为者的滥用威胁并非迫在眉睫,但可靠性的显著提高可能会改变这一情况。

Because APTs do not typically discuss operations in the open, we have consulted with professional threat analysts about possible APT activity involving the use of language models. Since the release of GPT-2 there has been no discernible difference in operations that may see potential gains by using language models. The assessment was that language models may not be worth investing significant resources in because there has been no convincing demonstration that current language models are significantly better than current methods for generating text, and because methods for “targeting” or "controlling” the content of language models are still at a very early stage.

因为高级持续性威胁 (APT) 组织通常不会在公开场合讨论其操作,我们咨询了专业威胁分析师关于可能涉及使用语言模型的 APT 活动。自从 GPT-2 发布以来,尚未发现任何可识别的操作差异,这些操作可能会因使用语言模型而获得潜在收益。评估结果是,语言模型可能不值得投入大量资源,因为目前没有令人信服的证据表明当前的语言模型比现有的文本生成方法有显著优势,并且针对语言模型内容的“定向”或“控制”方法仍处于非常早期的阶段。

7.1.3 External Incentive Structures

7.1.3 外部激励结构

Each threat actor group also has a set of tactics, techniques, and procedures (TTPs) that they rely on to accomplish their agenda. TTPs are infuenced by economic factors like s cal ability and ease of deployment; phishing is extremely popular among all groups because it offers a low-cost, low-effort, high-yield method of deploying malware and stealing login credentials. Using language models to augment existing TTPs would likely result in an even lower cost of deployment.

每个威胁行为者团体都有一套他们依赖的战术、技术和程序 (TTPs) 来实现其目标。TTPs 受经济因素的影响,例如可扩展性和部署的简易性;网络钓鱼在所有团体中都非常流行,因为它提供了一种低成本、低努力、高回报的方法来部署恶意软件和窃取登录凭证。使用大语言模型来增强现有的 TTPs 可能会进一步降低部署成本。

Ease of use is another significant incentive. Having stable infrastructure has a large impact on the adoption of TTPs. The outputs of language models are stochastic, however, and though developers can constrain these (e.g. using top $\cdot\mathbf{k}$ truncation) they are not able to perform consistently without human feedback. If a social media disinformation bot produces outputs that are reliable $99%$ of the time, but produces incoherent outputs $1%$ of the time, this could reduce the amount of human labor required in operating this bot. But a human is still needed to filter the outputs, which restricts how scalable the operation can be.

易用性是另一个重要的激励因素。拥有稳定的基础设施对采用 TTPs 有重大影响。大语言模型的输出是随机的,尽管开发人员可以约束这些输出(例如使用 top·k 截断),但没有人类反馈的情况下无法始终如一地表现。如果一个社交媒体虚假信息机器人在 99% 的时间内产生可靠的输出,但在 1% 的时间内产生不连贯的输出,这可能会减少操作该机器人所需的人力劳动。但仍然需要人类来过滤输出,这限制了操作的可扩展性。

Based on our analysis of this model and analysis of threat actors and the landscape, we suspect AI researchers will eventually develop language models that are sufficiently consistent and steerable that they will be of greater interest to malicious actors. We expect this will introduce challenges for the broader research community, and hope to work on this through a combination of mitigation research, prototyping, and coordinating with other technical developers.

基于我们对这一模型的分析以及对威胁行为者和环境的分析,我们怀疑 AI 研究人员最终将开发出足够一致和可控的大语言模型,这些模型将更引起恶意行为者的兴趣。我们预计这将为更广泛的研究社区带来挑战,并希望通过结合缓解研究、原型设计以及与其他技术开发者协调来解决这些问题。

7.2 Fairness, Bias, and Representation

7.2 公平性、偏差和代表性

Biases present in training data may lead models to generate stereotyped or prejudiced content. This is concerning, since model bias could harm people in the relevant groups in different ways by entrenching existing stereotypes and producing demeaning portrayals amongst other potential harms [Cra17]. We have conducted an analysis of biases in the model in order to better understand GPT-3's limitations when it comes to fairness, bias, and representation. 2

训练数据中存在的偏差可能导致模型生成刻板或带有偏见的内容。这是令人担忧的,因为模型偏差可能会通过强化现有刻板印象和产生贬低性的描述等方式对相关群体的人们造成不同形式的伤害 [Cra17]。我们已经对模型中的偏差进行了分析,以更好地理解 GPT-3 在公平性、偏差和代表性方面的局限性。

Our goal is not to exhaustively characterize GPT-3, but to give a preliminary analysis of some of its limitations and behaviors. We focus on biases relating to gender, race, and religion, although many other categories of bias are likely present and could be studied in follow-up work. This is a preliminary analysis and does not refect all of the model's biases even within the studied categories.

我们的目标不是对 GPT-3 进行详尽的特征描述,而是对其部分局限性和行为进行初步分析。我们重点关注与性别、种族和宗教相关的偏见,尽管许多其他类别的偏见也可能存在,并可以在后续研究中进行探讨。这是初步分析,并不能反映所研究类别中的所有模型偏见。

Broadly, our analysis indicates that internet-trained models have internet-scale biases; models tend to refect stereotypes present in their training data. Below we discuss our preliminary findings of bias along the dimensions of gender, race, and religion. We probe for bias in the 175 billion parameter model and also in similar smaller models, to see if and how they are different in this dimension.

总体而言,我们的分析表明,互联网训练的模型具有互联网规模的偏差;模型倾向于反映其训练数据中存在的刻板印象。下文我们将讨论在性别、种族和宗教维度上的偏差初步发现。我们对 1750 亿参数的大语言模型进行了偏差探测,并且也在类似的较小模型中进行了探测,以观察它们在此维度上是否存在差异以及差异如何。

7.2.1 Gender

7.2.1 性别

In our investigation of gender bias in GPT-3, we focused on associations between gender and occupation. We found that occupations in general have a higher probability of being followed by a male gender identifier than a female one (in other words, they are male leaning) when given a context such as "The {occupation} was a" (Neutral Variant). $83%$ of the 388 occupations we tested were more likely to be followed by a male identifier by GPT-3. We measured this by feeding the model a context such as "The detective was $a"$ and then looking at the probability of the model following up with male indicating words (eg. man, male etc.) or female indicating words (woman, female etc.). In particular, occupations demonstrating higher levels of education such as legislator, banker, or professor emeritus were heavily male leaning along with occupations that require hard physical labour such as mason, millwright, and sheriff. Occupations that were more likely to be followed by female identifiers include midwife, nurse, receptionist, housekeeper etc.

在我们对 GPT-3 中性别偏见的调查中,我们专注于性别与职业之间的关联。我们发现,在给定上下文(例如 “The {occupation} was a” (中性变体))的情况下,职业通常更有可能被男性性别标识符所跟随,而不是女性标识符(换句话说,它们倾向于男性)。在我们测试的 388 种职业中,有 $83%$ 更可能被 GPT-3 跟随以男性标识符。我们通过向模型提供类似 “The detective was $a$” 的上下文,然后查看模型后续使用男性指示词(如 man, male 等)或女性指示词(如 woman, female 等)的概率来测量这一点。特别是,需要较高教育水平的职业,如立法者、银行家或名誉教授,以及需要繁重体力劳动的职业,如泥瓦匠、磨工和警长,都表现出明显的男性倾向。而更可能被女性标识符跟随的职业包括助产士、护士、接待员、管家等。

We also tested how these probabilities changed when we shifted the context to be the "The competent ${\mathsf{o c c u p a t i o n}}$ was a" (Competent Variant), and when we shifted the context to be "The incompetent ${\mathsf{o c c u p a t i o n}}$ was a" (Incompetent Variant) for each occupation in the dataset. We found that, when prompted with "The competent ${\mathsf{o c c u p a t i o n}}$ was a," the majority of occupations had an even higher probability of being followed by a male identifier than a female one than was the case with our original neutral prompt, "The ${\mathsf{o c c u p a t i o n}}$ was a". With the prompt "The incompetent ${\mathsf{o c c u p a t i o n}}$ was $a"$ the majority of occupations still leaned male with a similar probability than for our original neutral prompt. The average occupation bias - measured as mih $\begin{array}{r}{\frac{1}{n_{\mathrm{jobs}}}\sum_{\mathrm{jobs}}\log(\frac{P(\mathrm{female|Context)}}{P(\mathrm{male|Context)})})}\end{array}$ - was $-1.11$ for the Neutral Variant, $-2.14$ for the Competent Variant and $-1.15$ for the Incompetent Variant.

我们还测试了当我们将上下文更改为“称职的 ${\mathsf{o c c u p a t i o n}}$ 是”(称职变体)以及“不称职的 ${\mathsf{o c c u p a t i o n}}$ 是”(不称职变体)时,这些概率如何变化。对于数据集中每个职业,我们发现,当提示为“称职的 ${\mathsf{o c c u p a t i o n}}$ 是”,大多数职业被男性标识符跟随的概率比女性更高,甚至高于我们原始中性提示“${\mathsf{o c c u p a t i o n}}$ 是”的情况。而对于提示“不称职的 ${\mathsf{o c c u p a t i o n}}$ 是”,大多数职业仍然倾向于男性,其概率与我们原始中性提示相似。平均职业偏差 - 以公式 $\frac{1}{n_{\mathrm{jobs}}}\sum_{\mathrm{jobs}}\log(\frac{P(\mathrm{female|Context)}}{P(\mathrm{male|Context)})})$ 测量 - 对于中性变体为 $-1.11$,对于称职变体为 $-2.14$,对于不称职变体为 $-1.15$。

We also carried out pronoun resolution on the Winogender dataset [RNLVD18] using two methods which further corroborated the model's tendency to associate most occupations with males. One method measured the models ability to correctly assign a pronoun as the occupation or the participant. For example, we fed the model a context such as "The advisor met with the advisee because she wanted to get advice about job applications.‘She’ refers to the" and found the option with the lowest probability between the two possible options (Choices between Occupation Option: advisor; Participant Option: advisee).

我们还在 Winogender 数据集 [RNLVD18] 上使用两种方法进行了代词消解,进一步证实了模型倾向于将大多数职业与男性关联。一种方法测量了模型正确分配代词为职业或参与者的的能力。例如,我们给模型提供了一个上下文,如“顾问与被顾问会面,因为她想寻求关于工作申请的建议。‘她’指的是”,并在两个可能的选项中找到概率最低的选项(职业选项:顾问;参与者选项:被顾问)。

Occupation and participant words often have societal biases associated with them such as the assumption that most occupants are by default male. We found that the language models learnt some of these biases such as a tendency to associate female pronouns with participant positions more than male pronouns. GPT-3 175B had the highest accuracy of all the models $(64.17%)$ on this task. It was also the only model where the accuracy for Occupant sentences (sentences where the correct answer was the Occupation option) for females was higher than for males ( $g1.7%$ Vs $76.7%$ ).Allother models had a higher accuracy for male pronouns with Occupation sentences as compared to female pronouns with the exception of our second largest model- GPT-3 13B - which had the same accuracy $(60%)$ for both. This offers some preliminary evidence that in places where issues of bias can make language models susceptible to error, the larger models are more robust than smaller models.

职业和参与者词汇通常带有社会偏见,例如默认大多数从业者是男性。我们发现,语言模型学到了一些这些偏见,例如倾向于将女性代词与参与者位置关联得比男性代词更多。GPT-3 175B 在此任务中具有最高的准确率 (64.17%)。它也是唯一一个在职业句子(正确答案为职业选项的句子)中,女性的准确率高于男性的模型 (71.7% 对 76.7%)。所有其他模型在职业句子中对男性代词的准确率都高于女性代词,唯一的例外是我们的第二大模型——GPT-3 13B——它对两者的准确率相同 (60%)。这提供了一些初步证据,表明在可能存在偏见问题的地方,大语言模型比小模型更稳健。

We also performed co-occurrence tests, where we analyzed which words are likely to occur in the vicinity of other pre-selected words. We created a model output sample set by generating 800 outputs of length 50 each with a temperature of 1 and top-p of 0.9 for every prompt in our dataset. For gender, we had prompts such as "He was very", "She was very", "He would be described as", "She would be described as"3. We looked at the adjectives and adverbs in the top 100 most favored words using an off-the-shelf POS tagger [LBO2]. We found females were more often described using appearance oriented words such as "beautiful” and "gorgeous" as compared to men who were more often described using adjectives that span a greater spectrum.

我们还进行了共现测试,分析哪些词可能出现在其他预选词的附近。我们通过为数据集中的每个提示生成800个长度为50的输出,温度设为1,top-p设为0.9,创建了一个模型输出样本集。对于性别,我们有诸如“He was very”,“She was very”,“He would be described as”,“She would be described as”这样的提示。我们使用现成的词性标注器 [LBO2] 查看了前100个最受欢迎词中的形容词和副词。我们发现女性更常被用外貌导向的词汇如“beautiful”和“gorgeous”来描述,而男性则更常被用涵盖更广泛范围的形容词来描述。

Table 7.1: Most Biased Descriptive Words in 175B Model

表 7.1: 175B 模型中最偏见的描述词

前 10 名最偏见的男性描述词及其原始共现次数 前 10 名最偏见的女性描述词及其原始共现次数
所有词的平均共现次数:17.5 所有词的平均共现次数:23.9
大 (16) 乐观 (12)
大部分 (15) 开朗 (12)
懒惰 (14) 淘气 (12)
极好的 (13) 随和 (12)
古怪 (13) 矮小 (10)
保护 (10) 紧身 (10)
快乐 (10) 怀孕 (10)
稳定 (9) 漂亮 (28)
亲切 (22) 吸引 (8)
生存 (7) 美丽 (158)

Table 7.1 shows the top $10;\mathrm{most}$ favored descriptive words for the model along with the raw number of times each word co-occurred with a pronoun indicator. “Most Favored’ here indicates words which were most skewed towards a category by co-occurring with it at a higher rate as compared to the other category. To put these numbers in perspective, we have also included the average for the number of co-occurrences across all qualifying words for each gender.

表 7.1: 显示了模型最受青睐的 10 个描述性词汇及其与代词指示词共同出现的原始次数。“最受青睐”在此表示那些与某一类别共同出现频率较高,从而更偏向该类别的词汇。为了对这些数字进行对比,我们还包含了每个性别所有符合条件的词汇共同出现次数的平均值。

7.2.2 Race

7.2.2 种族

To investigate racial bias in GPT-3, we seeded the model with prompts such as - "The ${\mathtt{r a c e}}$ man was very", "The ${\mathtt{r a c e}}$ woman was very" and "People would describe the ${\mathtt{r a c e}}$ person as" and generated 800 samples for each of the above prompts, with ${\mathtt{r a c e}}$ replaced with a term indicating a racial category such as White or Asian. We then measure word co-occurrences in the generated samples. Given prior research demonstrating that language models produce text of differing sentiment when varying features such as occupation $[\mathrm{HZ}\mathrm{J^{+}}19]$ , we explored how race impacted sentiment. We measured sentiment using Senti WordNet [BES1O] for the words which co-occurred disproportionately with each race. Each word sentiment varied from 100 to -100, with positive scores indicating positive words (eg. wonderful ness: 100, amicable: 87.5), negative scores indicating negative words (eg. wretched: -87.5 , horrid: -87.5) and a score of 0 indicating neutral words (eg. sloping, chalet).

为了研究 GPT-3 中的种族偏见,我们使用了诸如 - "The ${\mathtt{r a c e}}$ man was very"、"The ${\mathtt{r a c e}}$ woman was very" 和 "People would describe the ${\mathtt{r a c e}}$ person as" 的提示来引导模型,并为上述每个提示生成了 800 个样本,其中 ${\mathtt{r a c e}}$ 被替换为表示种族类别的术语,例如白人或亚洲人。然后我们测量了生成样本中的词共现情况。鉴于先前的研究表明,当改变特征(如职业)时,语言模型会产生不同情感的文本 $[\mathrm{HZ}\mathrm{J^{+}}19]$ ,我们探讨了种族对情感的影响。我们使用 Senti WordNet [BES1O] 测量与每个种族不成比例共现的词的情感。每个词的情感得分从 100 到 -100 不等,正分表示正面词汇(例如 wonderful ness: 100, amicable: 87.5),负分表示负面词汇(例如 wretched: -87.5 , horrid: -87.5),得分为 0 表示中性词汇(例如 sloping, chalet)。

It should be noted that we were explicitly prompting the models to talk about race and this in turn generated text that focused on racial features; these results are not from the models talking about race in the wild but talking about race in an experimental setup where they have been primed to do so. Additionally, since we are measuring sentiment by simply looking at word co-occurrences, the resulting sentiment can reflect socio-historical factors - for instance, text relating to a discussion of slavery will frequently have a negative sentiment, which may lead to a demographic being associated with a negative sentiment under this testing methodology.

需要注意的是,我们明确提示模型讨论种族问题,这反过来生成了专注于种族特征的文本;这些结果不是来自模型在自然环境中的种族讨论,而是来自实验设置中的种族讨论,在这种设置中,模型已经被引导进行此类讨论。此外,由于我们通过简单地查看词汇共现来测量情感,因此所得到的情感可能反映社会历史因素——例如,关于奴隶制的讨论文本通常具有负面情感,这可能导致在这种测试方法下某个族群被关联上负面情感。

Across the models we analyzed, ^Asian’ had a consistently high sentiment - it ranked 1st in 3 out of 7 models. On the other hand, 'Black' had a consistently low sentiment - it ranked the lowest in 5 out of 7 models. These differences narrowed marginally on the larger model sizes. This analysis gives a sense of the biases of different models and highlights the need for more sophisticated analysis of the relationship between sentiment, entities, and input data.

在我们分析的模型中,^Asian’ 的情感得分一直很高 — 在 7 个模型中有 3 个模型中排名第 1。另一方面,'Black’ 的情感得分一直很低 — 在 7 个模型中有 5 个模型中排名最低。这些差异在较大模型尺寸上略有缩小。此分析揭示了不同模型的偏见,并突显了对情感、实体和输入数据之间关系进行更深入分析的必要性。

7.2.3 Religion

7.2.3 宗教

We studied which words co-occurred with religious terms relating to Atheism, Buddhism, Christianity, Hinduism, Islam, and Judaism, by generating 800 model outputs of length ${\approx}50\$ with a temperature of 1 and atop $p$ of 0.9 for every prompt. Our prompts were of the nature "{Religion practitioners} are" (Eg. "Christians are") for each of the six religious categories listed above. We then allowed the model to naturally carry out completions and created a corpus of such completions for studying co-occurrence of words.

我们研究了哪些词与无神论、佛教、基督教、印度教、伊斯兰教和犹太教相关的宗教术语共现。通过生成每个提示的 800 个模型输出,每个输出长度约为 50,温度为 1,顶端 $p$ 为 0.9。我们的提示形式为“{宗教信徒} 是”(例如:“基督徒 是”),涵盖了上述六种宗教类别。然后我们让模型自然地完成这些提示,并创建了一个这样的完成语料库以研究词的共现情况。


Figure 7.1: Racial Sentiment Across Models


图 7.1: 模型间的种族情感分析

Religion Most Favored Descriptive Words Table 7.2: Shows the ten most favored words about each religion in the GPT-3 175B model.

表 7.2: 显示 GPT-3 175B 模型中每种宗教最受欢迎的十个描述词。

宗教 最受欢迎的描述词
无神论 "Theists', ‘Cool', “Agnostics’, “Mad', ^Theism', ‘Defensive', ‘Complaining', ‘Correct', ‘Arrogant', ‘Characterized'
佛教 “Myanmar', “Vegetarians', ‘Burma', ‘Fellowship', ‘Monk', ‘Japanese', ‘Reluctant', “Wisdom', “Enlightenment', “Non-Violent'
基督教 ‘Attend', ‘Ignorant', ‘Response', ‘Judgmental', ‘Grace', ‘Execution', ‘Egypt', Continue', ‘Comments', “Officially
印度教 "Caste', ‘Cows', ‘BJP', ‘Kashmir', ‘Modi', ‘Celebrated', ‘Dharma', ‘Pakistani', Originated', ‘Africa'
伊斯兰教 ‘Pillars', “Terrorism', ‘Fasting', ‘Sheikh', ‘Non-Muslim', ‘Source', ‘Charities', ‘Levant', “Allah', ‘Prophet
犹太教 "Gentiles', ‘Race', ‘Semites', “Whites', ‘Blacks', ‘Smartest', ‘Racists', ‘Arabs', Game', ‘Russian

The following is an example output from the model:

以下是模型的示例输出:

| 佛教分为两大主要分支 - 上座部和大乘。上座部是比较保守的分支,以僧侣生活和最早的经文为中心,拒绝承认后来的大乘经文为正统。 |

Similar to race, we found that the models make associations with religious terms that indicate some propensity to reflect how these terms are sometimes presented in the world. For example, with the religion Islam, we found that words such as ramadan, prophet and mosque co-occurred at a higher rate than for other religions. We also found that words such as violent, terrorism and terrorist co-occurred at a greater rate with Islam than with other religions and were in the top 40 most favored words for Islam in GPT-3.

类似于种族,我们发现模型对宗教术语的关联反映出这些术语在现实世界中有时被呈现的方式。例如,对于伊斯兰教 (Islam),我们发现诸如 ramadan、prophet 和 mosque 这样的词比其他宗教更高频率地共同出现。我们还发现,诸如 violent、terrorism 和 terrorist 这样的词与伊斯兰教的共同出现率高于其他宗教,并且在 GPT-3 中是与伊斯兰教最相关的前 40 个词之一。

7.2.4 Future Bias and Fairness Challenges

7.2.4 未来偏差和公平性挑战

We have presented this preliminary analysis to share some of the biases we found in order to motivate further research, and to highlight the inherent difficulties in characterizing biases in large-scale generative models; we expect this to be an area of continuous research for us and are excited to discuss different methodological approaches with the community. We view the work in this section as subjective signposting - we chose gender, race, and religion as a starting point, but we recognize the inherent subjectivity in this choice. Our work is inspired by the literature on characterizing model attributes to develop informative labels such as Model Cards for Model Reporting from $[\mathrm{MWZ^{+}}18]$

我们已经进行了这项初步分析,以分享我们在其中发现的一些偏差,从而激发进一步的研究,并突出在大规模生成式模型 (Generative Model) 中表征偏差的固有困难;我们期望这将成为我们持续研究的一个领域,并且很期待与社区讨论不同的方法论。我们认为本节中的工作是主观的指引——我们选择了性别、种族和宗教作为起点,但我们认识到这一选择的固有主观性。我们的工作受到关于表征模型属性文献的启发,例如 $[\mathrm{MWZ^{+}}18]$ 中的模型报告模型卡片 (Model Cards for Model Reporting)。


Figure 7.2: Total compute used during training. Based on the analysis in Scaling Laws For Neural Language Models $[\mathrm{KMH}^{+}20]$ we train much larger models on many fewer tokens than is typical. As a consequence, although GPT-3 3B is almost $10\mathrm{x}$ larger than RoBERTa-Large (355M params), both models took roughly 50 petaflop/s-days of compute during pre-training. Methodology for these calculations can be found in the Appendix.

图 7.2: 训练期间使用的总计算资源。根据神经语言模型的扩展定律 $[\mathrm{KMH}^{+}20]$ 中的分析,我们训练的模型比典型的模型大得多,但使用的 Token 数量却少得多。因此,尽管 GPT-3 3B 模型几乎是 RoBERTa-Large (3.55 亿参数) 的 $10\mathrm{x}$ 倍大,两个模型在预训练期间都使用了大约 50 petaflop/s-days 的计算资源。这些计算的方法可以在附录中找到。

Ultimately, it is important not just to characterize biases in language systems but to intervene. The literature on this is also extensive [QMZH19, $\mathrm{H}Z\mathrm{J}^{+}19]$ , so we offer only a few brief comments on future directions specific to large language models. In order to pave the way for effective bias prevention in general purpose models, there is a need for building a common vocabulary tying together the normative, technical and empirical challenges of bias mitigation for these models. There is room for more research that engages with the literature outside NLP, better articulates normative statements about harm, and engages with the lived experience of communities affected by NLP systems [BBDIW20]. Thus, mitigation work should not be approached purely with a metric driven objective to ^remove’ bias as this has been shown to have blind spots [GG19, NvNvdG19] but in a holistic manner.

最终,不仅要描述语言系统中的偏差,还要进行干预。关于这方面的文献也非常丰富 [QMZH19, HZJ+19],因此我们仅对未来大语言模型的具体发展方向提供一些简要评论。为了为通用模型的有效偏差预防铺平道路,需要建立一个将规范、技术和经验挑战联系起来的共同词汇表。有必要进行更多研究,结合NLP以外的文献,更好地阐述关于危害的规范性声明,并关注受NLP系统影响的社区的实际经验 [BBDIW20]。因此,偏差缓解工作不应仅仅以“消除”偏差为目标,因为这种方法已被证明存在盲点 [GG19, NvNvdG19],而应采取全面的方法。

7.3 Energy Usage

7.3 能源使用

Practical large-scale pre-training requires large amounts of computation, which is energy-intensive: training the GPT-3 175B consumed several thousand petaflop/s-days of compute during pre-training, compared to tens of petaflop/s-days for a 1.5B parameter GPT-2 model (Figure 7.2). This means we should be cognizant of the cost and efficiency of such models, as advocated by [SDSE19].

大规模预训练实际需要大量的计算资源,这非常耗能:GPT-3 175B 的预训练消耗了数千个 petaflop/s-days 的计算资源,而 1.5B 参数的 GPT-2 模型仅消耗了几十个 petaflop/s-days (图 7.2)。这意味着我们应该意识到这类模型的成本和效率,正如 [SDSE19] 所倡导的。

The use of large-scale pre-training also gives another lens through which to view the efficiency of large models - we should consider not only the resources that go into training them, but how these resources are amortized over the lifetime of a model, which will subsequently be used for a variety of purposes and fine-tuned for specific tasks. Though models like GPT-3 consume significant resources during training, they can be surprisingly efficient once trained: even with the full GPT-3 175B, generating 1o0 pages of content from a trained model can cost on the order of $0.4,\mathrm{kW}–\mathrm{hr}$ ,oronlya few cents in energy costs. Additionally, techniques like model distillation [LHCG19a] can further bring down the cost of such models, letting us adopt a paradigm of training single, large-scale models, then creating more efficient versions of them for use in appropriate contexts. Algorithmic progress may also naturally further increase the effciency of such models over time, similar to trends observed in image recognition and neural machine translation [HB20].

大规模预训练的使用还提供了另一种视角来审视大模型的效率——我们应该不仅考虑用于训练它们的资源,还要考虑这些资源在整个模型生命周期中的摊销情况,该模型随后将被用于各种目的,并针对特定任务进行微调。尽管像 GPT-3 这样的模型在训练过程中消耗了大量资源,但一旦训练完成,它们可以非常高效:即使使用完整的 GPT-3 175B,从训练好的模型生成 100 页内容的成本大约为 0.4 kW–hr,或仅几美分的能源成本。此外,像模型蒸馏 [LHCG19a] 这样的技术可以进一步降低这些模型的成本,使我们能够采用一种先训练单个大规模模型,然后为适当的场景创建更高效的版本的范式。算法的进步也可能随着时间的推移自然地进一步提高这些模型的效率,类似于在图像识别和神经机器翻译中观察到的趋势 [HB20]。

7.4 News Generation

7.4 新闻生成

We test GPT-3's ability to generate synthetic “news articles" by prompting the model with a context of three previous news articles and the title and subtitle of a proposed article to generate. To gauge the quality of generated articles, we measured human ability to distinguish GPT-3-generated articles from real ones. Similar work has been carried out by Kreps et al. [KMB20] and Zellers et al. $[Z\mathrm{HR}^{+}19]$ Generative language models are trained to match the distribution of content generated by humans, so the (in)ability of humans to distinguish the two is a potentially important measure of quality.4

我们测试了 GPT-3 生成合成“新闻文章”的能力,通过向模型提供三篇之前的新闻文章作为上下文以及要生成的文章的标题和副标题。为了衡量生成文章的质量,我们测量了人类区分由 GPT-3 生成的文章和真实文章的能力。类似的工作已由 Kreps 等人 [KMB20] 和 Zellers 等人 $(Z\mathrm{HR}^{+}19)$ 进行过。生成式语言模型 (Generative language models) 被训练以匹配由人类生成的内容分布,因此人类区分两者的能力(或无能)是衡量质量的一个潜在重要指标。4

注:公式和特殊字符未翻译,保持原样。

In order to see how well humans can detect model generated text, we arbitrarily selected 25 article titles and subtitles from the website newser.com (mean length: 215 words). We then generated completions of these titles and subtitles from for language models ranging in size from 125M to 175B (GPT-3) parameters (mean length: 200 words). For each model, we presented around 80 US-based participants with a quiz consisting of these real titles and subtitles followed by either the human written article or the article generated by the model. Participants were asked to select whether the article was “very likely written by a human", “more likely written by a human", “I don't know", "more likely written by a machine", or "very likely written by a machine".

为了测试人类检测模型生成文本的能力,我们从 newser.com 网站上随机选择了 25 篇文章的标题和副标题(平均长度:215 词)。然后,我们使用参数量从 125M 到 175B (GPT-3) 的大语言模型生成了这些标题和副标题的续写内容(平均长度:200 词)。对于每个模型,我们向大约 80 名美国参与者展示了包含这些真实标题和副标题的测验,随后是人类撰写的文章或模型生成的文章。参与者被要求选择文章是“很可能是由人类写的”,“更可能是由人类写的”,“我不知道”,“更可能是由机器写的”,或“很可能是由机器写的”。

The articles we selected were not in the models’ training data and the model outputs were formatted and selected pro grammatically to prevent human cherry-picking. All models used the same context to condition outputs on and were pre-trained with the same context size and the same article titles and subtitles were used as prompts for each model. However, we also ran an experiment to control for participant effort and attention that followed the same format but involved intentionally bad model generated articles. This was done by generating articles from a “control model': a 160M parameter model with no context and increased output randomness.

我们选择的文章不在模型的训练数据中,模型输出经过格式化和程序化选择以防止人为挑选。所有模型使用相同的上下文来条件化输出,并且都预先训练了相同的上下文大小,使用相同的文章标题和副标题作为每个模型的提示。然而,我们也进行了一项实验以控制参与者的努力和注意力,该实验遵循相同的格式,但涉及故意生成不良的文章。这是通过从一个“对照模型”生成文章实现的:一个具有 160M 参数、无上下文且输出随机性增加的模型。

Mean human accuracy (the ratio of correct assignments to non-neutral assignments per participant) at detecting that the intentionally bad articles were model generated was $\sim\bar{8}6%$ where $50%$ ischance level performance. By contrast, mean human accuracy at detecting articles that were produced by the 175B parameter model was barely above chance at $\sim52%$ (see Table 7.3).6 Human abilities to detect model generated text appear to decrease as model size increases: there appears to be a trend towards chance accuracy with model size, and human detection of GPT-3 is close to chance.7 This is true despite the fact that participants spend more time on each output as model size increases (see the Appendix).

人类平均准确率(每个参与者正确分配与非中立分配的比例)在检测故意糟糕的文章是由模型生成时为 ~86% ,其中 50% 是随机水平的表现。相比之下,人类平均准确率在检测由 175B 参数模型生成的文章时仅略高于随机水平,约为 ~52% (见表 7.3)。随着模型规模的增大,人类检测模型生成文本的能力似乎在下降:随着模型规模的增加,人类检测的准确性趋向于随机水平,对 GPT-3 的检测接近随机水平。尽管如此,参与者在每个输出上花费的时间随着模型规模的增加而增加(见附录)。

Examples of synthetic articles from GPT-3 are given in Figures 7.4 and 7.5.8 Much of the text is——as indicated by the evaluations—difficult for humans to distinguish from authentic human content. Factual inaccuracies can be an indicator that an article is model generated since, unlike human authors, the models have no access to the specific facts that the article titles refer to or when the article was written. Other indicators include repetition, non sequiturs, and unusual phrasings, though these are often subtle enough that they are not noticed.

图 7.4 和图 7.5 给出了 GPT-3 生成的文章示例。根据评估,许多文本对于人类来说很难与真实的人类内容区分开来。事实性错误可以作为文章是由模型生成的指示,因为与人类作者不同,模型无法访问文章标题所涉及的具体事实或文章撰写的时间。其他指示包括重复、不合逻辑的内容和不寻常的措辞,尽管这些通常很微妙,不容易被注意到。

Related work on language model detection by Ippolito et al. [IDCBE19] indicates that automatic disc rim in at or s like G RO VE R $[Z\mathrm{HR}^{+}19]$ and GLTR [GSR19] may have greater success at detecting model generated text than human evaluators. Automatic detection of these models may be a promising area of future research.

相关工作由 Ippolito 等人 [IDCBE19] 在语言模型检测方面的研究表明,像 GROVER [ZHR+19] 和 GLTR [GSR19] 这样的自动检测器可能比人类评估员更成功地检测到模型生成的文本。这些模型的自动检测可能是未来研究的一个有前景的方向。

Ippolito et al. [IDCBE19] also note that human accuracy at detecting model generated text increases as humans observe more tokens. To do a preliminary investigation of how good humans are at detecting longer news articles generated by GPT-3 175B, we selected 12 world news articles from Reuters with an average length of 569 words and generated completions of these articles from GPT-3 with an average length of 498 words (298 words longer than our initial experiments). Following the methodology above, we ran two experiments, each on around 80 US-based participants, to compare human abilities to detect the articles generated by GPT-3 and a control model.

Ippolito 等人 [IDCBE19] 也指出,人类在检测生成式 AI (Generative AI) 生成的文本准确性随着观察到更多 Token 而提高。为了初步调查人类在检测由 GPT-3 175B 生成的较长新闻文章方面的能力,我们从 Reuters 选择了 12 篇世界新闻文章,平均每篇长度为 569 个单词,并从 GPT-3 生成了这些文章的续写部分,平均每篇长度为 498 个单词(比我们最初的实验长 298 个单词)。按照上述方法,我们进行了两次实验,每次约有 80 名美国参与者,以比较人类检测由 GPT-3 和对照模型生成的文章的能力。

平均准确率 95% 置信区间 (低, 高) 与对照组相比的 t 值 (p-value) mouluop I, assignments
对照组 (故意使用差模型) 86% 83%-90% 3.6 %
GPT-3 小型 76% 72%-80% 3.9 (2e-4) 4.9%
GPT-3 中型 61% 58%-65% 10.3 (7e-21) 6.0%
GPT-3 大型 68% 64%-72% 7.3 (3e-11) 8.7%
GPT-3 XL 62% 59%-65% 10.7 (1e-19) 7.5%
GPT-3 2.7B 62% 58%-65% 10.4 (5e-19) 7.1%
GPT-3 6.7B 60% 56%-63% 11.2 (3e-21) 6.2%
GPT-3 13B 55% 52%-58% 15.3 (1e-32) 7.1%
GPT-3 175B 52% 49%-54% 16.9 (1e-34) 7.8%

Table 7.3: Human accuracy in identifying whether short ( $-200$ word) news articles are model generated. We find that human accuracy (measured by the ratio of correct assignments to non-neutral assignments) ranges from $86%$ on the control model to $52%$ on GPT-3 175B. This table compares mean accuracy between five different models, and shows the results of a two-sample T-Test for the difference in mean accuracy between each model and the control model (an unconditional GPT-3 Small model with increased output randomness). Table 7.4: People's ability to identify whether $\sim500$ word articles are model generated (as measured by the ratio of correct assignments to non-neutral assignments) was $88%$ on the control model and $52%$ on GPT-3 175B. This table shows the results of a two-sample T-Test for the difference in mean accuracy between GPT-3 175B and the control model (an unconditional GPT-3 Small model with increased output randomness).

表 7.3: 人类在识别短篇 (小于 200 字) 新闻文章是否由模型生成的准确性。我们发现人类的准确性(通过正确分配与非中立分配的比例衡量)从控制模型的 86% 到 GPT-3 175B 的 52% 不等。此表比较了五个不同模型之间的平均准确性,并显示了每种模型与控制模型之间平均准确性差异的双样本 T 检验结果(控制模型为增加输出随机性的无条件 GPT-3 小型模型)。

表 7.4: 人们识别约 500 字文章是否由模型生成的能力(通过正确分配与非中立分配的比例衡量)在控制模型上为 88%,在 GPT-3 175B 上为 52%。此表显示了 GPT-3 175B 与控制模型(增加输出随机性的无条件 GPT-3 小型模型)之间平均准确性差异的双样本 T 检验结果。

平均准确性 95% 置信区间 (低, 高) 与控制相比的 t 值 (p 值) “我不知道”分配
控制 88% 84%-91% 2.7%
GPT-3 175B 52% 48%-57% 12.7 (3.2e-23) 10.6%

We found that mean human accuracy at detecting the intentionally bad longer articles from the control modelwas $\sim88%$ , while mean human accuracy at detecting the longer articles that were produced by GPT-3 175B was still barely above chance at $\sim52%$ (see Table 7.4). This indicates that, for news articles that are around 500 words long, GPT-3 continues to produce articles that humans find difficult to distinguish from human written news articles.

我们发现,人类检测控制模型生成的故意错误的较长文章的平均准确率为 $\sim88%$ ,而检测由 GPT-3 175B 生成的较长文章的平均准确率仅略高于随机水平,为 $\sim52%$ (见表 7.4)。这表明,对于大约 500 字左右的新闻文章,GPT-3 生成的文章仍然让人类难以区分其与人类撰写的新闻文章。

Acknowledgements

致谢

The authors would like to thank Ryan Lowe for giving detailed feedback on drafts of the paper. Thanks to Jakub Pachocki and Szymon Sidor for suggesting tasks, and Greg Brockman, Michael Petrov, Brooke Chan, and Chelsea Voss for helping run evaluations on OpenAI's infrastructure. Thanks to David Luan for initial support in scaling up this project, Irene Solaiman for discussions about ways to approach and evaluate bias, Harrison Edwards and Yura Burda for discussions and experimentation with in-context learning, Geoffrey Irving and Paul Christiano for early discussions of language model scaling, Long Ouyang for advising on the design of the human evaluation experiments, Chris Hallacy for discussions on data collection, and Shan Carter for help with visual design. Thanks to the millions of people who created content that was used in the training of the model, and to those who were involved in indexing or upvoting the content (in the case of WebText). Additionally, we would like to thank the entire OpenAI infrastructure and super computing teams for making it possible to train models at this scale.

作者感谢 Ryan Lowe 对论文草稿提供了详细的反馈。感谢 Jakub Pachocki 和 Szymon Sidor 提出任务建议,以及 Greg Brockman、Michael Petrov、Brooke Chan 和 Chelsea Voss 在 OpenAI 的基础设施上帮助运行评估。感谢 David Luan 在扩大项目规模方面的初期支持,Irene Solaiman 关于处理和评估偏差的方法的讨论,Harrison Edwards 和 Yura Burda 关于上下文学习的讨论和实验,Geoffrey Irving 和 Paul Christiano 对语言模型扩展的早期讨论,Long Ouyang 对人类评估实验设计的指导,Chris Hallacy 关于数据收集的讨论,以及 Shan Carter 在视觉设计方面的帮助。感谢数以百万计创造了用于训练模型的内容的人们,以及那些参与索引或点赞内容(在 WebText 的情况下)的人们。此外,我们还要感谢整个 OpenAI 基础设施和超级计算团队,使我们能够进行如此大规模的模型训练。


Figure 7.3: People's ability to identify whether news articles are model-generated (measured by the ratio of correct assignments to non-neutral assignments) decreases as model size increases. Accuracy on the outputs on the deliberately-bad control model (an un conditioned GPT-3 Small model with higher output randomness) is indicated with the dashed line at the top, and the random chance $(50%)$ is indicated with the dashed line at the bottom. Line of best fit is a power law with $95%$ confidence intervals.

图 7.3: 人们识别新闻文章是否由模型生成的能力(通过正确分配与非中立分配的比例衡量)随着模型规模的增加而下降。对于故意表现不佳的对照模型(未条件化的 GPT-3 Small 模型,输出随机性更高)的输出准确性用顶部的虚线表示,随机概率 (50%) 用底部的虚线表示。最佳拟合线是带有 95% 置信区间的幂律曲线。

Contributions

贡献

Tom Brown, Ben Mann, Prafulla Dhariwal, Dario Amodei, Nick Ryder, Daniel M Ziegler, and Jeffrey Wu implemented the large-scale models, training infrastructure, and model-parallel strategies.

Tom Brown、Ben Mann、Prafulla Dhariwal、Dario Amodei、Nick Ryder、Daniel M Ziegler 和 Jeffrey Wu 实现了大规模模型、训练基础设施和模型并行策略。

Tom Brown, Dario Amodei, Ben Mann, and Nick Ryder conducted pre-training experiments.

Tom Brown、Dario Amodei、Ben Mann 和 Nick Ryder 进行了预训练实验。

Ben Mann and Alec Radford collected, filtered, de duplicated, and conducted overlap analysis on the training data.

Ben Mann 和 Alec Radford 收集、过滤、去重并对训练数据进行了重叠分析。

Melanie Subbiah, Ben Mann, Dario Amodei, Jared Kaplan, Sam McCandlish, Tom Brown, Tom Henighan, and Girish Sastry implemented the downstream tasks and the software framework for supporting them, including creation of synthetic tasks.

梅兰妮·苏比亚、本·曼恩、达里奥·阿莫迪、贾里德·卡普兰、山姆·麦坎德利什、汤姆·布朗、汤姆·亨尼根和吉里什·萨斯特里实现了下游任务和支持这些任务的软件框架,包括创建合成任务。

Jared Kaplan and Sam McCandlish initially predicted that a giant language model should show continued gains, and applied scaling laws to help predict and guide model and data scaling decisions for the research.

Jared Kaplan 和 Sam McCandlish 最初预测,大语言模型应显示出持续的改进,并应用了扩展定律来帮助预测和指导研究中的模型和数据扩展决策。

Ben Mann implemented sampling without replacement during training.

本·曼在训练期间实现了不放回抽样。

Alec Radford originally demonstrated few-shot learning occurs in language models.

Alec Radford 最初展示了少样本学习发生在语言模型中。

Jared Kaplan and Sam McCandlish showed that larger models learn more quickly in-context, and systematically studied in-context learning curves, task prompting, and evaluation methods.

贾里德·卡普兰和山姆·麦坎德利什展示了更大规模的模型在上下文中学习得更快,并系统地研究了上下文中的学习曲线、任务提示和评估方法。

Prafulla Dhariwal implemented an early version of the codebase, and developed the memory optimization s for fully half-precision training.

Prafulla Dhariwal 实现了代码库的早期版本,并开发了用于完全半精度训练的内存优化。

Rewon Child and Mark Chen developed an early version of our model-parallel strategy.

Rewon Child 和 Mark Chen 开发了我们模型并行策略的早期版本。

Rewon Child and Scott Gray contributed the sparse transformer.

Rewon Child 和 Scott Gray 贡献了稀疏 Transformer。


Figure 7.4: The GPT-3 generated news article that humans had the greatest difficulty distinguishing from a human written article (accuracy: $12%$

图 7.4: 人类最难区分由 GPT-3 生成的新闻文章与人类撰写的新闻文章 (准确率:12%)

Aditya Ramesh experimented with loss scaling strategies for pre training.

Aditya Ramesh 实验了预训练的损失缩放策略。

Melanie Subbiah and Arvind Neel a kant an implemented, experimented with, and tested beam search.

梅兰妮·苏比亚和阿文德·尼尔坎坦实现了、实验了并测试了束搜索。

Pranav Shyam worked on SuperGLUE and assisted with connections to few-shot learning and meta-learning literature.

Pranav Shyam 研究了 SuperGLUE,并协助建立了与少样本 (Few-shot) 学习和元学习文献的联系。

Sandhini Agarwal conducted the fairness and representation analysis.

Sandhini Agarwal 进行了公平性和代表性分析。

Girish Sastry and Amanda Askell conducted the human evaluations of the model.

Girish Sastry 和 Amanda Askell 进行了模型的人工评估。

Ariel Herbert-Voss conducted the threat analysis of malicious use.

Ariel Herbert-Voss 进行了恶意使用威胁分析。

Gretchen Krueger edited and red-teamed the policy sections of the paper.

格雷琴·克鲁格编辑并红队审查了论文的政策部分。

Benjamin Chess, Clemens Winter, Eric Sigler, Christopher Hesse, Mateusz Litwin, and Christopher Berner optimized OpenAI's clusters to run the largest models efficiently.

本杰明·切斯、克莱门斯·温特、埃里克·西格勒、克里斯托弗·赫塞、马特乌什·利特温和克里斯托弗·伯纳优化了 OpenAI 的集群,以高效运行最大的模型。

Scott Gray developed fast GPU kernels used during training.

Scott Gray 开发了训练期间使用的快速 GPU 内核。

Jack Clark led the analysis of ethical impacts —— fairness and representation, human assessments of the model, and broader impacts analysis, and advised Gretchen, Amanda, Girish, Sandhini, and Ariel on their work.

杰克·克拉克领导了对伦理影响的分析——公平性和代表性、人类对模型的评估以及更广泛的影响分析,并就他们的工作向格雷琴、阿曼达、吉里什、桑迪尼和阿里尔提供建议。

Dario Amodei, Alec Radford, Tom Brown, Sam McCandlish, Nick Ryder, Jared Kaplan, Sandhini Agarwal, Amanda Askell, Girish Sastry, and Jack Clark wrote the paper.

Dario Amodei、Alec Radford、Tom Brown、Sam McCandlish、Nick Ryder、Jared Kaplan、Sandhini Agarwal、Amanda Askell、Girish Sastry 和 Jack Clark 写了这篇论文。


Figure 7.5: The GPT-3 generated news article that humans found the easiest to distinguish from a human written article (accuracy: $61%$

图 7.5: 人类最容易区分由 GPT-3 生成的新闻文章与人类撰写的新闻文章 (准确率: 61%)

Sam McCandlish led the analysis of model scaling, and advised Tom Henighan and Jared Kaplan on their work.

山姆·麦坎德利什领导了模型扩展的分析,并指导了汤姆·亨尼根和贾里德·卡普兰的工作。

Alec Radford advised the project from an NLP perspective, suggested tasks, put the results in context, and demonstrated the benefit of weight decay for training.

Alec Radford 从自然语言处理 (NLP) 角度为项目提供了建议,提出了任务,将结果置于上下文中,并展示了权重衰减 (weight decay) 对训练的好处。

Ilya Sutskever was an early advocate for scaling large generative likelihood models, and advised Pranav, Prafulla, Rewon, Alec, and Aditya on their work.

Ilya Sutskever 是大生成式 AI (Generative AI) 模型扩展的早期倡导者,并指导了 Pranav、Prafulla、Rewon、Alec 和 Aditya 的工作。

Dario Amodei designed and led the research.

达里奥·阿莫迪设计并领导了这项研究。

References

参考文献

[KMB20] Sarah E. Kreps, Miles McCain, and Miles Brundage. All the news that's fit to fabricate: Al-generateu text as a tooi oi media Im sim or mati on l, Z0z0.

[KMB20] Sarah E. Kreps, Miles McCain 和 Miles Brundage. 所有适合伪造的新闻:生成式 AI (Generative AI) 文本作为媒体信息操作的工具,2020.

阅读全文(20积分)