Training language models to follow instructions with human feedback

训练大语言模型 (Large Language Model) 以遵循人类反馈的指令

Long Ouyang∗ Jeff Wu∗ Xu Jiang∗ Diogo Almeida∗ Carroll L. Wainwright∗

欧阳龙∗ 吴杰夫∗ 江旭∗ 迪奥戈·阿尔梅达∗ 卡罗尔·L·温赖特∗

Abstract

摘要

Making language models bigger does not inherently make them better at following a user’s intent. For example, large language models can generate outputs that are untruthful, toxic, or simply not helpful to the user. In other words, these models are not aligned with their users. In this paper, we show an avenue for aligning language models with user intent on a wide range of tasks by fine-tuning with human feedback. Starting with a set of labeler-written prompts and prompts submitted through the OpenAI API, we collect a dataset of labeler demonstrations of the desired model behavior, which we use to fine-tune GPT-3 using supervised learning. We then collect a dataset of rankings of model outputs, which we use to further fine-tune this supervised model using reinforcement learning from human feedback. We call the resulting models Instruct GP T. In human evaluations on our prompt distribution, outputs from the 1.3B parameter Instruct GP T model are preferred to outputs from the 175B GPT-3, despite having $100\mathrm{x}$ fewer parameters. Moreover, Instruct GP T models show improvements in truthfulness and reductions in toxic output generation while having minimal performance regressions on public NLP datasets. Even though Instruct GP T still makes simple mistakes, our results show that fine-tuning with human feedback is a promising direction for aligning language models with human intent.

使大语言模型更大并不一定会使其更好地遵循用户的意图。例如，大语言模型可能会生成不真实、有害或对用户无帮助的输出。换句话说，这些模型与用户的需求并不一致。在本文中，我们展示了一种通过使用人类反馈微调来使大语言模型与用户意图对齐的方法。从一组标注者编写的提示和通过 OpenAI API 提交的提示开始，我们收集了一个包含期望模型行为的标注者演示数据集，并用此数据集通过监督学习微调 GPT-3。然后，我们收集了一个模型输出排名的数据集，并用此数据集通过基于人类反馈的强化学习进一步微调这个监督学习模型。我们将最终得到的模型称为 Instruct GP T。在我们对提示分布的人类评估中，1.3B 参数的 Instruct GP T 模型的输出优于 175B 参数的 GPT-3 的输出，尽管其参数量减少了 100 倍。此外，Instruct GP T 模型在真实性方面有所改进，并减少了有害输出的生成，同时在公共 NLP 数据集上的性能退化很小。尽管 Instruct GP T 仍然会犯一些简单的错误，但我们的结果表明，使用人类反馈进行微调是使大语言模型与人类意图对齐的一个有希望的方向。

1 Introduction

1 引言

Large language models (LMs) can be “prompted” to perform a range of natural language processing (NLP) tasks, given some examples of the task as input. However, these models often express unintended behaviors such as making up facts, generating biased or toxic text, or simply not following user instructions (Bender et al., 2021; Bommasani et al., 2021; Kenton et al., 2021; Weidinger et al., 2021; Tamkin et al., 2021; Gehman et al., 2020). This is because the language modeling objective used for many recent large LMs—predicting the next token on a webpage from the internet—is different from the objective “follow the user’s instructions helpfully and safely” (Radford et al., 2019; Brown et al., 2020; Fedus et al., 2021; Rae et al., 2021; Thoppilan et al., 2022). Thus, we say that the language modeling objective is misaligned. Averting these unintended behaviors is especially important for language models that are deployed and used in hundreds of applications.

大语言模型 (LMs) 可以通过给定一些任务示例作为输入来“提示”执行各种自然语言处理 (NLP) 任务。然而，这些模型经常表现出非预期的行为，例如编造事实、生成有偏见或有毒的文本，或者不遵循用户指令 (Bender et al., 2021; Bommasani et al., 2021; Kenton et al., 2021; Weidinger et al., 2021; Tamkin et al., 2021; Gehman et al., 2020)。这是因为在许多最近的大语言模型中使用的语言建模目标—从互联网网页预测下一个 Token—与目标“帮助并安全地遵循用户的指令”不同 (Radford et al., 2019; Brown et al., 2020; Fedus et al., 2021; Rae et al., 2021; Thoppilan et al., 2022)。因此，我们说语言建模目标是错位的。避免这些非预期行为对于那些部署并在数百个应用程序中使用的大语言模型尤为重要。

Figure 1: Human evaluations of various models on our API prompt distribution, evaluated by how often outputs from each model were preferred to those from the 175B SFT model. Our Instruct GP T models (PPO-ptx) as well as its variant trained without pre training mix (PPO) significantly outperform the GPT-3 baselines (GPT, GPT prompted); outputs from our 1.3B PPO-ptx model are preferred to those from the 175B GPT-3. Error bars throughout the paper are $95%$ confidence intervals.

图 1: 各种模型在我们 API 提示分布上的人类评估，评估标准是每个模型的输出被优先选择的频率，相对于 175B SFT 模型。我们的 Instruct GP T 模型 (PPO-ptx) 及其未使用预训练混合训练的变体 (PPO) 显著优于 GPT-3 基线 (GPT, GPT 提示)；我们 1.3B PPO-ptx 模型的输出更受青睐，优于 175B GPT-3 的输出。本文中的误差条表示 95% 置信区间。

We make progress on aligning language models by training them to act in accordance with the user’s intention (Leike et al., 2018). This encompasses both explicit intentions such as following instructions and implicit intentions such as staying truthful, and not being biased, toxic, or otherwise harmful. Using the language of Askell et al. (2021), we want language models to be helpful (they should help the user solve their task), honest (they shouldn’t fabricate information or mislead the user), and harmless (they should not cause physical, psychological, or social harm to people or the environment). We elaborate on the evaluation of these criteria in Section 3.6.

我们在使语言模型与用户意图对齐方面取得了进展（Leike et al., 2018）。这包括明确的意图，例如遵循指示，以及隐含的意图，例如保持真实，不带有偏见、毒性或以其他方式造成伤害。使用 Askell 等人 (2021) 的术语，我们希望语言模型是有帮助的（它们应帮助用户完成任务），诚实的（它们不应编造信息或误导用户），以及无害的（它们不应给人类或环境造成身体、心理或社会上的伤害）。我们在第 3.6 节中详细讨论了这些标准的评估。

We focus on fine-tuning approaches to aligning language models. Specifically, we use reinforcement learning from human feedback (RLHF; Christiano et al., 2017; Stiennon et al., 2020) to fine-tune GPT-3 to follow a broad class of written instructions (see Figure 2). This technique uses human preferences as a reward signal to fine-tune our models. We first hire a team of 40 contractors to label our data, based on their performance on a screening test (see Section 3.4 and Appendix B.1 for more details). We then collect a dataset of human-written demonstrations of the desired output behavior on (mostly English) prompts submitted to the OpenAI API3 and some labeler-written prompts, and use this to train our supervised learning baselines. Next, we collect a dataset of human-labeled comparisons between outputs from our models on a larger set of API prompts. We then train a reward model (RM) on this dataset to predict which model output our labelers would prefer. Finally, we use this RM as a reward function and fine-tune our supervised learning baseline to maximize this reward using the PPO algorithm (Schulman et al., 2017). We illustrate this process in Figure 2. This procedure aligns the behavior of GPT-3 to the stated preferences of a specific group of people (mostly our labelers and researchers), rather than any broader notion of “human values”; we discuss this further in Section 5.2. We call the resulting models Instruct GP T.

我们专注于微调方法以对齐语言模型。具体来说，我们使用来自人类反馈的强化学习 (RLHF; Christiano 等, 2017; Stiennon 等, 2020) 来微调 GPT-3 以遵循广泛类别的书面指令（见图 2）。该技术使用人类偏好作为奖励信号来微调我们的模型。我们首先雇佣了一个由 40 名合同工组成的团队来标注我们的数据，基于他们在筛选测试中的表现（详见第 3.4 节和附录 B.1 获取更多详情）。然后，我们收集了一组人类编写的演示数据集，这些演示展示了在提交给 OpenAI API 的提示（主要是英文）以及一些标注者编写的提示下所需的输出行为，并用这些数据训练我们的监督学习基线。接下来，我们收集了一组更大范围的 API 提示下，我们模型输出之间的人类标注比较数据集。然后，我们在该数据集上训练一个奖励模型 (RM)，以预测我们的标注者会更喜欢哪个模型输出。最后，我们使用这个 RM 作为奖励函数，并使用 PPO 算法 (Schulman 等, 2017) 微调我们的监督学习基线以最大化此奖励。我们在图 2 中说明了这一过程。该程序将 GPT-3 的行为与特定人群（主要是我们的标注者和研究人员）的声明偏好对齐，而不是任何更广泛的“人类价值观”概念；我们将在第 5.2 节中进一步讨论这一点。我们将由此产生的模型称为 Instruct GPT。

图 2:

此过程的图示说明。

We mainly evaluate our models by having our labelers rate the quality of model outputs on our test set, consisting of prompts from held-out customers (who are not represented in the training data). We also conduct automatic evaluations on a range of public NLP datasets. We train three model

我们主要通过让标注者对我们测试集上的模型输出质量进行评分来评估我们的模型，测试集包含来自保留客户 (未出现在训练数据中) 的提示。我们还对多个公开的 NLP 数据集进行了自动评估。我们训练了三个模型

图 1: 模型架构示例

在本研究中，我们提出了一种新的方法来改进大语言模型 (LLM) 的性能。具体来说，我们的方法结合了零样本和少样本学习技术，以提高模型的泛化能力。此外，我们还引入了一个基于 Transformer 的新架构，该架构能够更有效地处理长文本序列。

实验结果表明，我们的方法在多个基准测试中显著优于现有方法。特别是，在零样本设置下，我们的模型表现出了强大的泛化能力，能够在未见过的任务上取得优异成绩。这些结果证明了我们方法的有效性和潜力。

表 1: 实验结果对比

方法	准确率	F1 分数
基线方法	85.2%	84.7%
我们的方法	90.5%	89.8%

通过上述改进，我们相信我们的工作为未来的研究提供了有价值的参考，并为进一步探索大语言模型的应用奠定了基础。

Figure 2: A diagram illustrating the three steps of our method: (1) supervised fine-tuning (SFT), (2) reward model (RM) training, and (3) reinforcement learning via proximal policy optimization (PPO) on this reward model. Blue arrows indicate that this data is used to train one of our models. In Step 2, boxes A-D are samples from our models that get ranked by labelers. See Section 3 for more details on our method.

图 2: 一个图表，展示了我们方法的三个步骤：(1) 监督微调 (SFT)，(2) 奖励模型 (RM) 训练，和 (3) 在此奖励模型上通过近端策略优化 (PPO) 进行强化学习。蓝色箭头表示这些数据用于训练我们的模型之一。在第 2 步中，方框 A-D 是来自我们模型的样本，由标注者进行排序。详见第 3 节以了解更多关于我们方法的详细信息。

sizes (1.3B, 6B, and 175B parameters), and all of our models use the GPT-3 architecture. Our main findings are as follows:

规模 (1.3B, 6B, 和 175B 参数)，以及我们所有的模型都使用 GPT-3 架构。我们的主要发现如下：

Labelers significantly prefer Instruct GP T outputs over outputs from GPT-3. On our test set, outputs from the 1.3B parameter Instruct GP T model are preferred to outputs from the 175B GPT-3, despite having over $100\mathrm{x}$ fewer parameters. These models have the same architecture, and differ only by the fact that Instruct GP T is fine-tuned on our human data. This result holds true even when we add a few-shot prompt to GPT-3 to make it better at following instructions. Outputs from our 175B Instruct GP T are preferred to 175B GPT-3 outputs $85\pm3%$ of the time, and preferred $71\pm4%$ of the time to few-shot 175B GPT-3. Instruct GP T models also generate more appropriate outputs according to our labelers, and more reliably follow explicit constraints in the instruction.

标注人员显著偏好 Instruct GP T 的输出，而非 GPT-3 的输出。在我们的测试集中，1.3B 参数的 Instruct GP T 模型的输出优于 175B 参数的 GPT-3 模型的输出，尽管其参数数量少 100 倍以上。这些模型具有相同的架构，唯一的区别是 Instruct GP T 在我们的人类数据上进行了微调。即使我们在 GPT-3 上添加了少样本提示以提高其指令跟随能力，这一结果仍然成立。我们的 175B Instruct GP T 输出优于 175B GPT-3 输出的比例为 85±3%，优于少样本 175B GPT-3 输出的比例为 71±4%。Instruct GP T 模型还根据我们的标注人员的评估生成更合适的输出，并更可靠地遵循指令中的显式约束。

Instruct GP T models show improvements in truthfulness over GPT-3. On the TruthfulQA benchmark, Instruct GP T generates truthful and informative answers about twice as often as GPT-3. Our results are equally strong on the subset of questions that were not adversarial ly selected against GPT-3. On “closed-domain” tasks from our API prompt distribution, where the output should not contain information that is not present in the input (e.g. sum mari z ation and closed-domain QA), Instruct GP T models make up information not present in the input about half as often as GPT-3 (a $21%$ vs. $41%$ hallucination rate, respectively).

指令 GP T 模型在真实性方面比 GPT-3 有所改进。在 TruthfulQA 基准测试中，指令 GP T 生成真实且信息丰富的答案的频率是 GPT-3 的两倍。我们的结果在未针对 GPT-3 对抗性选择的问题子集中同样强劲。在来自我们 API 提示分布的“封闭域”任务中，输出不应包含输入中不存在的信息（例如摘要和封闭域 QA），指令 GP T 模型编造输入中不存在的信息的频率约为 GPT-3 的一半（分别为 21% 和 41% 的幻觉率）。

Instruct GP T shows small improvements in toxicity over GPT-3, but not bias. To measure toxicity, we use the Real Toxicity Prompts dataset (Gehman et al., 2020) and conduct both automatic and human evaluations. Instruct GP T models generate about $25%$ fewer toxic outputs than GPT-3 when prompted to be respectful. Instruct GP T does not significantly improve over GPT-3 on the Winogender (Rudinger et al., 2018) and CrowSPairs (Nangia et al., 2020) datasets.

指示 GP T 在毒性方面相比 GPT-3 有小幅改进，但在偏见方面没有改进。为了测量毒性，我们使用 Real Toxicity Prompts 数据集 (Gehman 等, 2020) 并进行自动和人工评估。当提示要求尊重时，Instruct GP T 模型生成的有毒输出比 GPT-3 少约 25% 。Instruct GP T 在 Winogender (Rudinger 等, 2018) 和 CrowSPairs (Nangia 等, 2020) 数据集上没有显著优于 GPT-3。

We can minimize performance regressions on public NLP datasets by modifying our RLHF fine-tuning procedure. During RLHF fine-tuning, we observe performance regressions compared to GPT-3 on certain public NLP datasets, notably SQuAD (Rajpurkar et al., 2018), DROP (Dua et al., 2019), HellaSwag (Zellers et al., 2019), and WMT 2015 French to English translation (Bojar et al., 2015). This is an example of an “alignment tax” since our alignment procedure comes at the cost of lower performance on certain tasks that we may care about. We can greatly reduce the performance regressions on these datasets by mixing PPO updates with updates that increase the log likelihood of the pre training distribution (PPO-ptx), without compromising labeler preference scores.

我们可以通过修改强化学习微调 (RLHF) 过程来最小化在公共 NLP 数据集上的性能退化。在 RLHF 微调过程中，我们观察到与 GPT-3 相比，在某些公共 NLP 数据集上存在性能退化，特别是 SQuAD (Rajpurkar et al., 2018)，DROP (Dua et al., 2019)，HellaSwag (Zellers et al., 2019)，以及 WMT 2015 法语到英语翻译 (Bojar et al., 2015)。这是一个“对齐税”的例子，因为我们的对齐过程是以降低某些我们可能关心的任务的性能为代价的。我们可以通过将 PPO 更新与增加预训练分布对数似然性的更新 (PPO-ptx) 混合使用，大大减少这些数据集上的性能退化，而不会影响标注者偏好分数。

Our models generalize to the preferences of “held-out” labelers that did not produce any training data. To test the generalization of our models, we conduct a preliminary experiment with held-out labelers, and find that they prefer Instruct GP T outputs to outputs from GPT-3 at about the same rate as our training labelers. However, more work is needed to study how these models perform on broader groups of users, and how they perform on inputs where humans disagree about the desired behavior.

我们的模型可以泛化到“未参与训练”的标注者的偏好，这些标注者没有提供任何训练数据。为了测试我们模型的泛化能力，我们与未参与训练的标注者进行了一项初步实验，发现他们对 Instruct GP T 输出的偏好与对 GPT-3 输出的偏好大致相同，与我们的训练标注者的偏好率相近。然而，仍需进一步研究这些模型在更广泛用户群体中的表现，以及在人类对期望行为存在分歧的输入上的表现。

Public NLP datasets are not reflective of how our language models are used. We compare GPT-3 fine-tuned on our human preference data (i.e. Instruct GP T) to GPT-3 fine-tuned on two different compilations of public NLP tasks: the FLAN (Wei et al., 2021) and T0 (Sanh et al., 2021) (in particular, the $^{\mathrm{T0++}}$ variant). These datasets consist of a variety of NLP tasks, combined with natural language instructions for each task. On our API prompt distribution, our FLAN and T0 models perform slightly worse than our SFT baseline, and labelers significantly prefer Instruct GP T to these models (Instruct GP T has a $73.4\pm2%$ winrate vs. our baseline, compared to $26.8\pm2%$ and $29.8\pm2%$ for our version of T0 and FLAN, respectively).

公共 NLP 数据集并不能反映我们的语言模型是如何被使用的。我们将基于人类偏好数据微调的 GPT-3 (即 Instruct GP T) 与基于两个不同公共 NLP 任务编译的 GPT-3 进行比较：FLAN (Wei et al., 2021) 和 T0 (Sanh et al., 2021)（特别是 T0++ 变体）。这些数据集包含各种 NLP 任务，并为每个任务结合了自然语言指令。在我们的 API 提示分布上，我们的 FLAN 和 T0 模型表现略逊于我们的 SFT 基线模型，而标注者明显更偏好 Instruct GP T 模型（Instruct GP T 对比我们的基线模型有 73.4±2% 的胜率，而我们版本的 T0 和 FLAN 分别为 26.8±2% 和 29.8±2%）。

Instruct GP T models show promising generalization to instructions outside of the RLHF finetuning distribution. We qualitatively probe Instruct GP T’s capabilities, and find that it is able to follow instructions for summarizing code, answer questions about code, and sometimes follows instructions in different languages, despite these instructions being very rare in the fine-tuning distribution. In contrast, GPT-3 can perform these tasks but requires more careful prompting, and does not usually follow instructions in these domains. This result is exciting because it suggests that our models are able to generalize the notion of “following instructions.” They retain some alignment even on tasks for which they get very little direct supervision signal.

指令 GP T 模型在 RLHF 微调分布之外的指令上表现出有希望的泛化能力。我们定性地探究了 Instruct GP T 的能力，发现它能够总结代码、回答关于代码的问题，并且有时能够遵循不同语言的指令，尽管这些指令在微调分布中非常罕见。相比之下，GPT-3 虽然可以执行这些任务，但需要更仔细的提示，并且通常不会在这些领域遵循指令。这一结果令人兴奋，因为它表明我们的模型能够泛化“遵循指令”的概念。即使在直接监督信号很少的任务上，它们仍然保持一定的对齐性。

Instruct GP T still makes simple mistakes. For example, Instruct GP T can still fail to follow instructions, make up facts, give long hedging answers to simple questions, or fail to detect instructions with false premises.

指令 GP T 仍然会犯一些简单的错误。例如，指令 GP T 可能仍然无法遵循指示，编造事实，对简单问题给出冗长含糊的答案，或者未能检测到具有虚假前提的指令。

Overall, our results indicate that fine-tuning large language models using human preferences significantly improves their behavior on a wide range of tasks, though much work remains to be done to improve their safety and reliability.

总体而言，我们的结果表明，使用人类偏好对大语言模型进行微调可以显著改善它们在广泛任务中的表现，尽管仍有许多工作需要完成以提高它们的安全性和可靠性。

The rest of this paper is structured as follows: We first detail related work in Section 2, before diving into our method and experiment details in Section 3, including our high-level methodology (3.1), task and dataset details (3.3 and 3.2), human data collection (3.4), how we trained our models (3.5), and our evaluation procedure (3.6). We then present our results in Section 4, divided into three parts: results on the API prompt distribution (4.1), results on public NLP datasets (4.2), and qualitative results (4.3). Finally we give an extended discussion of our work in Section 5, including implications for alignment research (5.1), what we are aligning to (5.2), limitations (5.3), open questions (5.4), and broader impacts of this work (5.5).

本文其余部分结构如下：我们首先在第 2 节详细介绍相关工作，在第 3 节深入探讨我们的方法和实验细节，包括我们的高层方法论 (3.1)，任务和数据集详情 (3.3 和 3.2)，人类数据收集 (3.4)，我们如何训练模型 (3.5)，以及我们的评估程序 (3.6)。然后我们在第 4 节展示结果，分为三个部分：API 提示分布的结果 (4.1)，公共 NLP 数据集上的结果 (4.2)，以及定性结果 (4.3)。最后我们在第 5 节对我们的工作进行扩展讨论，包括对对齐研究的影响 (5.1)，我们对齐的目标 (5.2)，局限性 (5.3)，开放问题 (5.4)，以及这项工作的更广泛影响 (5.5)。

2 相关工作

Research on alignment and learning from human feedback. We build on previous techniques to align models with human intentions, particularly reinforcement learning from human feedback (RLHF). Originally developed for training simple robots in simulated environments and Atari games (Christiano et al., 2017; Ibarz et al., 2018), it has recently been applied to fine-tuning language models to summarize text (Ziegler et al., 2019; Stiennon et al., 2020; Böhm et al., 2019; Wu et al., 2021). This work is in turn influenced by similar work using human feedback as a reward in domains such as dialogue (Jaques et al., 2019; Yi et al., 2019; Hancock et al., 2019), translation (Kreutzer et al., 2018; Bahdanau et al., 2016), semantic parsing (Lawrence and Riezler, 2018), story generation (Zhou and Xu, 2020), review generation (Cho et al., 2018), and evidence extraction (Perez et al., 2019). Madaan et al. (2022) use written human feedback to augment prompts and improve the performance of GPT-3. There has also been work on aligning agents in text-based environments using RL with a normative prior (Nahian et al., 2021). Our work can be seen as a direct application of RLHF to aligning language models on a broad distribution of language tasks.

关于对齐和从人类反馈中学习的研究。我们基于先前的技术来使模型与人类意图对齐，特别是从人类反馈中进行强化学习 (RLHF)。最初为训练模拟环境和 Atari 游戏中的简单机器人而开发 (Christiano 等, 2017; Ibarz 等, 2018)，最近它被应用于微调语言模型以总结文本 (Ziegler 等, 2019; Stiennon 等, 2020; Böhm 等, 2019; Wu 等, 2021)。这项工作反过来受到类似工作的启发，这些工作在对话 (Jaques 等, 2019; Yi 等, 2019; Hancock 等, 2019)、翻译 (Kreutzer 等, 2018; Bahdanau 等, 2016)、语义解析 (Lawrence 和 Riezler, 2018)、故事生成 (Zhou 和 Xu, 2020)、评论生成 (Cho 等, 2018) 和证据提取 (Perez 等, 2019) 等领域使用人类反馈作为奖励。Madaan 等 (2022) 使用书面人类反馈来增强提示并提高 GPT-3 的性能。还有研究通过带有规范性先验的 RL 来对齐文本环境中的智能体 (Nahian 等, 2021)。我们的工作可以被视为将 RLHF 直接应用于使语言模型在广泛的语言任务分布上对齐。

The question of what it means for language models to be aligned has also received attention recently (Gabriel, 2020). Kenton et al. (2021) catalog behavioral issues in LMs that result from misalignment, including producing harmful content and gaming mis specified objectives. In concurrent work, Askell et al. (2021) propose language assistants as a testbed for alignment research, study some simple baselines, and their scaling properties.

语言模型的对齐意味着什么这一问题最近也受到了关注 (Gabriel, 2020)。Kenton 等人 (2021) 列出了由于对齐不良而导致的语言模型中的行为问题，包括生成有害内容和利用错误指定的目标。在同期工作中，Askell 等人 (2021) 提出将语言助手作为对齐研究的测试平台，研究了一些简单的基准及其扩展属性。

Training language models to follow instructions. Our work is also related to research on crosstask generalization in language models, where LMs are fine-tuned on a broad range of public NLP datasets (usually prefixed with an appropriate instruction) and evaluated on a different set of NLP tasks. There has been a range of work in this domain (Yi et al., 2019; Mishra et al., 2021; Wei et al., 2021; Khashabi et al., 2020; Sanh et al., 2021; Aribandi et al., 2021), which differ in training and evaluation data, formatting of instructions, size of pretrained models, and other experimental details. A consistent finding across studies is that fine-tuning LMs on a range of NLP tasks, with instructions, improves their downstream performance on held-out tasks, both in the zero-shot and few-shot settings.

训练语言模型以遵循指令。我们的工作还与大语言模型中的跨任务泛化研究相关，其中大语言模型在广泛的公共 NLP 数据集上进行微调（通常带有适当的指令），并在不同的 NLP 任务上进行评估。该领域已经有许多工作 (Yi et al., 2019; Mishra et al., 2021; Wei et al., 2021; Khashabi et al., 2020; Sanh et al., 2021; Aribandi et al., 2021)，这些工作在训练和评估数据、指令格式、预训练模型的大小以及其他实验细节上有所不同。跨研究的一致发现是，在一系列 NLP 任务上带有指令的微调可以提高大语言模型在保留任务上的下游性能，无论是在零样本还是少样本设置中。

There is also a related line of work on instruction following for navigation, where models are trained to follow natural language instructions to navigate in a simulated environment (Bahdanau et al., 2018; Abramson et al., 2020; Zhao et al., 2021).

还有一条相关的研究线是关于导航的指令跟随，其中模型被训练以遵循自然语言指令在模拟环境中进行导航 (Bahdanau et al., 2018; Abramson et al., 2020; Zhao et al., 2021)。

Evaluating the harms of language models. A goal of modifying the behavior of language models is to mitigate the harms of these models when they’re deployed in the real world. These risks have been extensively documented (Bender et al., 2021; Bommasani et al., 2021; Kenton et al., 2021; Weidinger et al., 2021; Tamkin et al., 2021). Language models can produce biased outputs (Dhamala et al., 2021; Liang et al., 2021; Manela et al., 2021; Caliskan et al., 2017; Kirk et al., 2021), leak private data (Carlini et al., 2021), generate misinformation (Solaiman et al., 2019; Buchanan et al., 2021), and be used maliciously; for a thorough review we direct the reader to Weidinger et al. (2021). Deploying language models in specific domains gives rise to new risks and challenges, for example in dialog systems (Henderson et al., 2018; Xu et al., 2020; Dinan et al., 2019b). There is a nascent but growing field that aims to build benchmarks to concretely evaluate these harms, particularly around toxicity (Gehman et al., 2020), stereotypes (Nadeem et al., 2020), and social bias (Dhamala et al., 2021; Nangia et al., 2020; Rudinger et al., 2018). Making significant progress on these problems is hard since well-intention ed interventions on LM behavior can have side-effects (Welbl et al., 2021; Blodgett et al., 2020); for instance, efforts to reduce the toxicity of LMs can reduce their ability to model text from under-represented groups, due to prejudicial correlations in the training data (Xu et al., 2021).

评估语言模型的危害。修改语言模型行为的目标是在这些模型部署到现实世界时减轻其危害。这些风险已经得到了广泛记录 (Bender et al., 2021; Bommasani et al., 2021; Kenton et al., 2021; Weidinger et al., 2021; Tamkin et al., 2021)。语言模型可能会产生有偏见的输出 (Dhamala et al., 2021; Liang et al., 2021; Manela et al., 2021; Caliskan et al., 2017; Kirk et al., 2021)，泄露私人数据 (Carlini et al., 2021)，生成错误信息 (Solaiman et al., 2019; Buchanan et al., 2021)，并可能被恶意使用；对于更详细的综述，我们建议读者参考 Weidinger et al. (2021)。在特定领域部署语言模型会带来新的风险和挑战，例如在对话系统中 (Henderson et al., 2018; Xu et al., 2020; Dinan et al., 2019b)。有一个新兴但不断发展的领域旨在建立基准以具体评估这些危害，特别是在毒性 (Gehman et al., 2020)、刻板印象 (Nadeem et al., 2020) 和社会偏见 (Dhamala et al., 2021; Nangia et al., 2020; Rudinger et al., 2018) 方面。在这些问题上取得重大进展是困难的，因为对语言模型行为的良好意图干预可能会产生副作用 (Welbl et al., 2021; Blodgett et al., 2020)；例如，减少语言模型毒性的努力可能会降低其模拟来自代表性不足群体文本的能力，这是由于训练数据中的偏见相关性 (Xu et al., 2021)。

Modifying the behavior of language models to mitigate harms. There are many ways to change the generation behavior of language models. Solaiman and Dennison (2021) fine-tune LMs on a small, value-targeted dataset, which improves the models’ ability to adhere to these values on a question answering task. Ngo et al. (2021) filter the pre training dataset by removing documents on which a language model has a high conditional likelihood of generating a set of researcher-written trigger phrases. When trained on this filtered dataset, their LMs generate less harmful text, at the cost of a slight decrease in language modeling performance. Xu et al. (2020) use a variety of approaches to improve the safety of chatbots, including data filtering, blocking certain words or n-grams during generation, safety-specific control tokens (Keskar et al., 2019; Dinan et al., 2019a), and human-in-theloop data collection (Dinan et al., 2019b). Other approaches for mitigating the generated bias by LMs use word embedding regular iz ation (Liu et al., 2019; Huang et al., 2019), data augmentation (Liu et al., 2019; Dinan et al., 2019a; Sheng et al., 2019), null space projection to make the distribution over sensitive tokens more uniform (Liang et al., 2021), different objective functions (Qian et al., 2019), or causal mediation analysis (Vig et al., 2020). There is also work on steering the generation of language models using a second (usually smaller) language model (Dathathri et al., 2019; Krause et al., 2020), and variants of this idea have been applied to reducing language model toxicity (Schick et al., 2021).

修改语言模型的行为以减轻危害。有许多方法可以改变语言模型的生成行为。Solaiman 和 Dennison (2021) 在一个小的价值导向数据集上微调大语言模型，这提高了模型在问答任务中遵循这些价值的能力。Ngo 等人 (2021) 通过移除语言模型对一组研究人员编写的触发短语具有高条件生成概率的文档来过滤预训练数据集。在这种过滤后的数据集上训练时，他们的大语言模型生成了较少有害文本，但代价是语言建模性能略有下降。Xu 等人 (2020) 使用多种方法来提高聊天机器人的安全性，包括数据过滤、在生成过程中阻止某些单词或 n-gram、安全特定控制 Token (Keskar 等人, 2019; Dinan 等人, 2019a)，以及人在回路中的数据收集 (Dinan 等人, 2019b)。其他减少大语言模型生成偏差的方法包括词嵌入正则化 (Liu 等人, 2019; Huang 等人, 2019)、数据增强 (Liu 等人, 2019; Dinan 等人, 2019a; Sheng 等人, 2019)、空空间投影以使敏感 Token 的分布更加均匀 (Liang 等人, 2021)、不同的目标函数 (Qian 等人, 2019)，或因果中介分析 (Vig 等人, 2020)。还有一些工作是使用第二个（通常是较小的）语言模型来引导大语言模型的生成 (Dathathri 等人, 2019; Krause 等人, 2020)，并且这种想法的变体已被应用于减少大语言模型的毒性 (Schick 等人, 2021)。

Table 2: Illustrative prompts from our API prompt dataset. These are fictional examples inspired by real usage—see more example in Appendix A.2.1.

表 2: 来自我们 API 请求数据集的示例提示。这些是受真实使用启发的虚构示例——更多示例请参见附录 A.2.1。

Table 1: Distribution of use case categories from our API prompt dataset.

表 1: 我们的 API 提示数据集中用例类别的分布。

用例	(%)
生成	45.6%
Open 1QA	12.4%
头脑风暴	11.2%
聊天	8.4%
重写	6.6%
摘要	4.2%
分类	3.5%
其他	3.5%
封闭式问答	2.6%
提取	1.9%

用例	提示
头脑风暴	列出五个重新激发我对职业生涯热情的想法
生成	写一个短篇故事，内容是一只熊去海滩，和一只海豹交朋友，然后回家
重写	这是百老汇戏剧的概要：
	{summary}
	这是该戏剧广告的大纲：11111

3 Methods and experimental details

3 方法和实验细节

3.1 High-level methodology

3.1 高层次方法论

Our methodology follows that of Ziegler et al. (2019) and Stiennon et al. (2020), who applied it in the stylistic continuation and sum mari z ation domains. We start with a pretrained language model (Radford et al., 2019; Brown et al., 2020; Fedus et al., 2021; Rae et al., 2021; Thoppilan et al., 2022), a distribution of prompts on which we want our model to produce aligned outputs, and a team of trained human labelers (see Sections 3.4 for details). We then apply the following three steps (Figure 2).

我们的方法论遵循 Ziegler 等人 (2019) 和 Stiennon 等人 (2020) 的做法，他们在风格延续和总结领域应用了该方法。我们从预训练语言模型 (Radford 等人, 2019; Brown 等人, 2020; Fedus 等人, 2021; Rae 等人, 2021; Thoppilan 等人, 2022) 开始，该模型在一组提示上进行分布，我们希望模型在此基础上生成对齐的输出，并且我们有一个经过培训的人类标注团队（详见第 3.4 节）。然后我们应用以下三个步骤 (图 2)。

Step 1: Collect demonstration data, and train a supervised policy. Our labelers provide demonstrations of the desired behavior on the input prompt distribution (see Section 3.2 for details on this distribution). We then fine-tune a pretrained GPT-3 model on this data using supervised learning.

步骤 1：收集演示数据，并训练监督策略。我们的标注人员在输入提示分布上提供了所需行为的演示（详见第 3.2 节有关此分布的详细信息）。然后，我们使用监督学习在此数据上微调预训练的 GPT-3 模型。

Step 2: Collect comparison data, and train a reward model. We collect a dataset of comparisons between model outputs, where labelers indicate which output they prefer for a given input. We then train a reward model to predict the human-preferred output.

步骤 2: 收集比较数据，并训练奖励模型。我们收集一个数据集，其中包含模型输出之间的比较，标注者指出他们对给定输入更偏好哪个输出。然后我们训练一个奖励模型来预测人类偏好的输出。

Step 3: Optimize a policy against the reward model using PPO. We use the output of the RM as a scalar reward. We fine-tune the supervised policy to optimize this reward using the PPO algorithm (Schulman et al., 2017).

步骤 3: 使用 PPO 优化针对奖励模型的策略。我们使用 RM 的输出作为标量奖励。我们使用 PPO 算法 (Schulman et al., 2017) 对监督策略进行微调以优化此奖励。

Steps 2 and 3 can be iterated continuously; more comparison data is collected on the current best policy, which is used to train a new RM and then a new policy. In practice, most of our comparison data comes from our supervised policies, with some coming from our PPO policies.

步骤 2 和 3 可以连续迭代；收集当前最优策略的更多比较数据，这些数据用于训练新的 RM，然后生成新的策略。实际上，我们的大部分比较数据来自监督学习策略，还有一些来自 PPO 策略。

3.2 Dataset

3.2 数据集

Our prompt dataset consists primarily of text prompts submitted to the OpenAI API, specifically those using an earlier version of the Instruct GP T models (trained via supervised learning on a subset of our demonstration data) on the Playground interface.4 Customers using the Playground were informed that their data could be used to train further models via a recurring notification any time Instruct GP T models were used. In this paper we do not use data from customers using the API in production. We heuristic ally de duplicate prompts by checking for prompts that share a long common prefix, and we limit the number of prompts to 200 per user ID. We also create our train, validation, and test splits based on user ID, so that the validation and test sets contain no data from users whose data is in the training set. To avoid the models learning potentially sensitive customer details, we filter all prompts in the training split for personally identifiable information (PII).

我们的提示数据集主要由提交给 OpenAI API 的文本提示组成，特别是那些使用早期版本的 Instruct GP T 模型（通过监督学习在我们演示数据的子集上训练）在 Playground 界面中使用的提示。客户在使用 Playground 时会收到通知，告知他们的数据可能会用于进一步训练模型，只要使用了 Instruct GP T 模型就会显示该通知。在本文中，我们不使用生产环境中使用 API 的客户的数据。我们通过检查共享长公共前缀的提示来启发式地去重提示，并将每个用户 ID 的提示数量限制为 200 个。我们还根据用户 ID 创建训练、验证和测试集，以确保验证和测试集中不包含训练集中用户的任何数据。为了避免模型学习到可能敏感的客户信息，我们对训练集中的所有提示进行了个人可识别信息 (PII) 过滤。

To train the very first Instruct GP T models, we asked labelers to write prompts themselves. This is because we needed an initial source of instruction-like prompts to bootstrap the process, and these kinds of prompts weren’t often submitted to the regular GPT-3 models on the API. We asked labelers to write three kinds of prompts:

为了训练最初的 Instruct GP T 模型，我们要求标注人员自己编写提示。这是因为我们需要一个初始的指令类提示来源来启动这个过程，而这类提示并不常提交给 API 上的常规 GPT-3 模型。我们要求标注人员编写三种类型的提示：

From these prompts, we produce three different datasets used in our fine-tuning procedure: (1) our SFT dataset, with labeler demonstrations used to train our SFT models, (2) our RM dataset, with labeler rankings of model outputs used to train our RMs, and (3) our PPO dataset, without any human labels, which are used as inputs for RLHF fine-tuning. The SFT dataset contains about $13\mathbf{k}$ training prompts (from the API and labeler-written), the RM dataset has 33k training prompts (from the API and labeler-written), and the PPO dataset has 31k training prompts (only from the API). More details on dataset sizes are provided in Table 6.

从这些提示中，我们生成了三个不同的数据集用于我们的微调过程：(1) 我们的 SFT 数据集，包含标注者的演示，用于训练我们的 SFT 模型；(2) 我们的 RM 数据集，包含标注者对模型输出的排名，用于训练我们的 RM 模型；(3) 我们的 PPO 数据集，不包含任何人工标注，用作 RLHF 微调的输入。SFT 数据集包含大约 $13\mathbf{k}$ 个训练提示（来自 API 和标注者编写的），RM 数据集有 33k 个训练提示（来自 API 和标注者编写的），PPO 数据集有 31k 个训练提示（仅来自 API）。更多关于数据集规模的详细信息请参见表 6。

To give a sense of the composition of our dataset, in Table 1 we show the distribution of use-case categories for our API prompts (specifically the RM dataset) as labeled by our contractors. Most of the use-cases have are generative, rather than classification or QA. We also show some illustrative prompts (written by researchers to mimic the kinds of prompts submitted to Instruct GP T models) in Table 2; more prompts submitted to Instruct GP T models are shown in Appendix A.2.1, and prompts submitted to GPT-3 models are shown in Appendix A.2.2. We provide more details about our dataset in Appendix A.

为了说明我们数据集的构成，在表 1 中我们展示了 API 提示词（具体为 RM 数据集）的使用场景类别的分布情况，这些类别是由我们的承包商标注的。大多数使用场景是生成式的，而不是分类或问答。我们还在表 2 中展示了一些示例提示词（由研究人员编写，以模拟提交给 Instruct GP T 模型的提示词类型）；更多提交给 Instruct GP T 模型的提示词见附录 A.2.1，提交给 GPT-3 模型的提示词见附录 A.2.2。我们在附录 A 中提供了有关我们数据集的更多详细信息。

3.3 Tasks

3.3 任务

Our training tasks are from two sources: (1) a dataset of prompts written by our labelers and (2) a dataset of prompts submitted to early Instruct GP T models on our API (see Table 6). These prompts are very diverse and include generation, question answering, dialog, sum mari z ation, extractions, and other natural language tasks (see Table 1). Our dataset is over $96%$ English, however in Section 4.3 we also probe our model’s ability to respond to instructions in other languages and complete coding tasks.

我们的训练任务来自两个来源：(1) 由我们标注人员编写的提示数据集和 (2) 提交到我们 API 上早期 Instruct GP T 模型的提示数据集（见表 6）。这些提示非常多样化，包括生成、问答、对话、总结、提取和其他自然语言任务（见表 1）。我们的数据集超过 96% 是英语，但在第 4.3 节中，我们也测试了模型对其他语言指令的响应能力以及完成编程任务的能力。

For each natural language prompt, the task is most often specified directly through a natural language instruction (e.g. “Write a story about a wise frog”), but could also be indirectly through either few-shot examples (e.g. giving two examples of frog stories, and prompting the model to generate a new one) or implicit continuation (e.g. providing the start of a story about a frog). In each case, we ask our labelers to do their best to infer the intent of the user who wrote the prompt, and ask them to skip inputs where the task is very unclear. Moreover, our labelers also take into account the implicit intentions such as truthfulness of the response, and potentially harmful outputs such as biased or toxic language, guided by the instructions we provide them (see Appendix B) and their best judgment.

对于每个自然语言提示，任务通常是通过自然语言指令直接指定（例如：“写一个关于聪明青蛙的故事”），但也可能通过少样本示例间接指定（例如：给出两个青蛙故事的示例，并提示模型生成一个新的）或隐式延续（例如：提供一个关于青蛙故事的开头）。在每种情况下，我们都要求标注者尽最大努力推断编写提示的用户的意图，并要求他们在任务非常不清楚时跳过输入。此外，我们的标注者还考虑了隐式意图，例如响应的真实性，以及潜在有害输出，如带有偏见或有毒的语言，这些都由我们提供的指示（见附录 B）和他们的最佳判断来指导。

3.4 Human data collection

3.4 人类数据收集

To produce our demonstration and comparison data, and to conduct our main evaluations, we hired a team of about 40 contractors on Upwork and through ScaleAI. Compared to earlier work that collects human preference data on the task of sum mari z ation (Ziegler et al., 2019; Stiennon et al., 2020; Wu et al., 2021), our inputs span a much broader range of tasks, and can occasionally include controversial and sensitive topics. Our aim was to select a group of labelers who were sensitive to the preferences of different demographic groups, and who were good at identifying outputs that were potentially harmful. Thus, we conducted a screening test designed to measure labeler performance on these axes. We selected labelers who performed well on this test; for more information about our selection procedure and labeler demographics, see Appendix B.1.

为了生成我们的演示和比较数据，并进行主要评估，我们通过 Upwork 和 ScaleAI 聘请了大约 40 名合同工。与之前在总结任务上收集人类偏好数据的工作 (Ziegler et al., 2019; Stiennon et al., 2020; Wu et al., 2021) 相比，我们的输入涵盖了更广泛的任务范围，偶尔还包括有争议和敏感的话题。我们的目标是选择一组对不同人群的偏好敏感，并且善于识别潜在有害输出的标注者。因此，我们设计了一项筛选测试，以衡量标注者在这几个方面的表现。我们选择了在此测试中表现良好的标注者；有关我们选择程序和标注者人口统计信息的更多详细信息，请参阅附录 B.1。

During training and evaluation, our alignment criteria may come into conflict: for example, when a user requests a potentially harmful response. During training we prioritize helpfulness to the user (not doing so requires making some difficult design decisions that we leave to future work; see Section 5.4 for more discussion). However, in our final evaluations we asked labelers prioritize truthfulness and harmlessness (since this is what we really care about).

在训练和评估过程中，我们的对齐标准可能会产生冲突：例如，当用户请求可能有害的响应时。在训练期间，我们优先考虑对用户的帮助性（不这样做需要做出一些困难的设计决策，我们留待未来工作；详见第 5.4 节以获得更多讨论）。然而，在最终评估中，我们要求标注者优先考虑真实性和无害性（因为这正是我们真正关心的）。

As in Stiennon et al. (2020), we collaborate closely with labelers over the course of the project. We have an onboarding process to train labelers on the project, write detailed instructions for each task (see Appendix B.2), and answer labeler questions in a shared chat room.

如同 Stiennon 等 (2020) 所述，我们在项目过程中与标注人员密切合作。我们有一个入职培训流程，对标注人员进行项目培训，编写每个任务的详细说明（见附录 B.2），并在共享聊天室中回答标注人员的问题。

As an initial study to see how well our model generalizes to the preferences of other labelers, we hire a separate set of labelers who do not produce any of the training data. These labelers are sourced from the same vendors, but do not undergo a screening test.

作为初步研究，以了解我们的模型对其他标注者的偏好泛化得如何，我们雇佣了一组独立的标注者，这些标注者不生成任何训练数据。这些标注者来自相同的供应商，但没有参加筛选测试。

Despite the complexity of the task, we find that inter-annotator agreement rates are quite high: training labelers agree with each-other $72.6\pm1.5%$ of the time, while for held-out labelers this number is $77.3\pm1.3%$ . For comparison, in the sum mari z ation work of Stiennon et al. (2020) researcher-researcher agreement was $73\pm4%$ .

尽管任务复杂，我们发现标注者之间的一致率相当高：训练集标注者之间的同意率为 $72.6\pm1.5%$ ，而测试集标注者之间的同意率为 $77.3\pm1.3%$ 。作为对比，在 Stiennon 等人 (2020) 的摘要工作中，研究人员之间的同意率为 $73\pm4%$ 。

3.5 Models

3.5 模型

We start with the GPT-3 pretrained language models from Brown et al. (2020). These models are trained on a broad distribution of Internet data and are adaptable to a wide range of downstream tasks, but have poorly characterized behavior. Starting from these models, we then train models with three different techniques:

我们从 Brown 等人 (2020) 的 GPT-3 预训练语言模型开始。这些模型在广泛的互联网数据上进行训练，可以适应各种下游任务，但其行为特征尚未完全明确。基于这些模型，我们随后使用三种不同的技术训练模型：

Supervised fine-tuning (SFT). We fine-tune GPT-3 on our labeler demonstrations using supervised learning. We trained for 16 epochs, using a cosine learning rate decay, and residual dropout of 0.2. We do our final SFT model selection based on the RM score on the validation set. Similarly to Wu et al. (2021), we find that our SFT models overfit on validation loss after 1 epoch; however, we find that training for more epochs helps both the RM score and human preference ratings, despite this over fitting.

监督微调 (SFT)。我们使用监督学习对 GPT-3 进行标注演示的微调。我们训练了 16 个 epoch，使用余弦学习率衰减和 0.2 的残差 dropout。我们根据验证集上的 RM 分数进行最终的 SFT 模型选择。与 Wu 等人 (2021) 类似，我们发现我们的 SFT 模型在 1 个 epoch 后会在验证损失上过拟合；然而，我们发现训练更多的 epoch 对 RM 分数和人类偏好评分都有帮助，尽管存在过拟合现象。

Reward modeling (RM). Starting from the SFT model with the final un embedding layer removed, we trained a model to take in a prompt and response, and output a scalar reward. In this paper we only use 6B RMs, as this saves a lot of compute, and we found that 175B RM training could be unstable and thus was less suitable to be used as the value function during RL (see Appendix C for more details).

奖励建模 (Reward modeling, RM)。从移除了最终无嵌入层的 SFT 模型开始，我们训练了一个模型来接收一个提示和响应，并输出一个标量奖励。在本文中，我们仅使用 6B 的 RM，因为这节省了大量的计算资源，而且我们发现 175B 的 RM 训练可能不稳定，因此不太适合用作 RL 中的价值函数（更多详情见附录 C）。

In Stiennon et al. (2020), the RM is trained on a dataset of comparisons between two model outputs on the same input. They use a cross-entropy loss, with the comparisons as labels—the difference in rewards represents the log odds that one response will be preferred to the other by a human labeler.

在 Stiennon 等 (2020) 中，RM 在同一输入的两个模型输出的比较数据集上进行训练。他们使用交叉熵损失，比较结果作为标签——奖励差异表示一个响应比另一个更受人类标注者青睐的对数几率。

In order to speed up comparison collection, we present labelers with anywhere between $K=4$ and $K=9$ responses to rank. This produces $\binom{K}{2}$ comparisons for each prompt shown to a labeler. Since comparisons are very correlated within each labeling task, we found that if we simply shuffle the comparisons into one dataset, a single pass over the dataset caused the reward model to overfit.5 Instead, we train on all $\binom{K}{2}$ comparisons from each prompt as a single batch element. This is much more computationally efficient because it only requires a single forward pass of the RM for each completion (rather than $\binom{K}{2}$ forward passes for $K$ completions) and, because it no longer overfits, it achieves much improved validation accuracy and log loss.

为了加快比较收集，我们向标注者展示 $K=4$ 到 $K=9$ 个响应进行排序。这为每个展示给标注者的提示生成了 $\binom{K}{2}$ 个比较。由于每个标注任务内的比较高度相关，我们发现如果简单地将比较打乱合并到一个数据集中，单次遍历该数据集会导致奖励模型过拟合。相反，我们将每个提示的 $\binom{K}{2}$ 个比较作为一个批次元素进行训练。这样做计算效率更高，因为它只需要每个完成的 RM 进行一次前向传递（而不是对 $K$ 个完成进行 $\binom{K}{2}$ 次前向传递），并且由于不再过拟合，它实现了更好的验证准确率和对数损失。

Specifically, the loss function for the reward model is:

具体来说，奖励模型的损失函数为：

$$
\cos\left(\theta\right)=-\frac{1}{\binom{K}{2}}E_{\left(x,y_{w},y_{l}\right)\sim D}\left[\log\left(\sigma\left(r_{\theta}\left(x,y_{w}\right)-r_{\theta}\left(x,y_{l}\right)\right)\right)\right]
$$

where $r_{\theta}(x,y)$ is the scalar output of the reward model for prompt $x$ and completion $y$ with parameters $\theta$ , $y_{w}$ is the preferred completion out of the pair of $y_{w}$ and $y_{l}$ , and $D$ is the dataset of human comparisons.

其中，$r_{\theta}(x,y)$ 是奖励模型对提示 $x$ 和完成 $y$ 的标量输出，参数为 $\theta$ ，$y_{w}$ 是成对的 $y_{w}$ 和 $y_{l}$ 中的优选完成，$D$ 是人类比较的数据集。

Table 3: Labeler-collected metadata on the API distribution.

表 3: 标签收集的 API 分布元数据。

元数据	尺度
总体质量	李克特量表；1-7
未能遵循正确的指令/任务	二元
不适合客户助手	二元
幻觉	二元
满足指令中提供的约束	二元
包含色情内容	二元
包含暴力内容	二元
鼓励或未能阻止暴力/虐待/恐怖主义/自残	二元
诋毁受保护群体提供有害建议	二元
表达意见	二元
表达道德判断	二元

Finally, since the RM loss is invariant to shifts in reward, we normalize the reward model using a bias so that the labeler demonstrations achieve a mean score of 0 before doing RL.

最后，由于 RM 损失对奖励的偏移不变，我们使用偏置对奖励模型进行归一化，使得在进行强化学习之前，标注者的演示获得的平均分为 0。

Reinforcement learning (RL). Once again following Stiennon et al. (2020), we fine-tuned the SFT model on our environment using PPO (Schulman et al., 2017). The environment is a bandit environment which presents a random customer prompt and expects a response to the prompt. Given the prompt and response, it produces a reward determined by the reward model and ends the episode. In addition, we add a per-token KL penalty from the SFT model at each token to mitigate overoptimization of the reward model. The value function is initialized from the RM. We call these models “PPO.”

强化学习 (Reinforcement learning, RL)。再次遵循 Stiennon 等 (2020) 的方法，我们使用 PPO (Schulman 等, 2017) 在我们的环境中对 SFT 模型进行了微调。该环境是一个多臂赌博机环境，它呈现一个随机的客户提示并期望对该提示作出回应。根据提示和回应，它会根据奖励模型产生一个奖励并结束此回合。此外，我们在每个 Token 上添加了一个来自 SFT 模型的每-Token KL 惩罚，以减轻对奖励模型的过度优化。价值函数从 RM 初始化。我们将这些模型称为“PPO”。

We also experiment with mixing the pre training gradients into the PPO gradients, in order to fix the performance regressions on public NLP datasets. We call these models “PPO-ptx.” We maximize the following combined objective function in RL training:

我们还尝试将预训练的梯度混合到 PPO 梯度中，以修复在公共 NLP 数据集上的性能退化问题。我们称这些模型为 “PPO-ptx”。我们在强化学习训练中最大化以下组合目标函数：

$$
\begin{array}{r l}&{\mathrm{objective}\left(\phi\right)=!E_{(x,y)\sim D_{\sigma_{\phi}^{\operatorname{RL}}}}\left[r_{\theta}(x,y)-\beta\log\left(\pi_{\phi}^{\operatorname{RL}}(y\mid x)/\pi^{\operatorname{SFT}}(y\mid x)\right)\right]+}\ &{\qquad\qquad\qquad\qquad\gamma E_{x\sim D_{\mathrm{prerain}}}\left[\log(\pi_{\phi}^{\operatorname{RL}}(x))\right]}\end{array}
$$

$$
\begin{array}{r l}
&{\mathrm{objective}\left(\phi\right)=E_{(x,y)\sim D_{\sigma_{\phi}^{\operatorname{RL}}}}\left[r_{\theta}(x,y)-\beta\log\left(\pi_{\phi}^{\operatorname{RL}}(y\mid x)/\pi^{\operatorname{SFT}}(y\mid x)\right)\right]+}\
&{\qquad\qquad\qquad\qquad\gamma E_{x\sim D_{\mathrm{prerain}}}\left[\log(\pi_{\phi}^{\operatorname{RL}}(x))\right]}
\end{array}
$$

目标函数 (objective) 由两部分组成：第一部分是通过从分布 (D_{\sigma_{\phi}^{\operatorname{RL}}}) 中采样 ((x, y)) 对来计算奖励 (r_{\theta}(x, y)) 减去策略比值的对数；第二部分是通过从预训练数据分布 (D_{\mathrm{prerain}}) 中采样 (x) 来计算策略对数。

where $\pi_{\phi}^{\mathrm{RL}}$ is the learned RL policy, $\pi^{\mathrm{SFT}}$ is the supervised trained model, and $D_{\mathrm{pretrain}}$ is the pre training distribution. The KL reward coefficient, $\beta$ , and the pre training loss coefficient, $\gamma$ , control the strength of the KL penalty and pre training gradients respectively. For "PPO" models, $\gamma$ is set to 0. Unless otherwise specified, in this paper Instruct GP T refers to the PPO-ptx models.

其中，$\pi_{\phi}^{\mathrm{RL}}$ 是学习到的强化学习策略，$\pi^{\mathrm{SFT}}$ 是监督训练模型，$D_{\mathrm{pretrain}}$ 是预训练分布。KL 奖励系数 $\beta$ 和预训练损失系数 $\gamma$ 分别控制 KL 惩罚和预训练梯度的强度。对于 “PPO” 模型，$\gamma$ 设置为 0。除非另有说明，本文中的 Instruct GP T 指的是 PPO-ptx 模型。

Baselines. We compare the performance of our PPO models to our SFT models and GPT-3. We also compare to GPT-3 when it is provided a few-shot prefix to ‘prompt’ it into an instruction-following mode (GPT-3-prompted). This prefix is prepended to the user-specified instruction.6

基线。我们将 PPO 模型的性能与 SFT 模型和 GPT-3 进行比较。我们还比较了在给定少样本前缀以“提示”其进入指令跟随模式 (GPT-3-prompted) 的 GPT-3。该前缀被添加到用户指定的指令之前。

We additionally compare Instruct GP T to fine-tuning 175B GPT-3 on the FLAN (Wei et al., 2021) and T0 (Sanh et al., 2021) datasets, which both consist of a variety of NLP tasks, combined with natural language instructions for each task (the datasets differ in the NLP datasets included, and the style of instructions used). We fine-tune them on approximately 1 million examples respectively and choose the checkpoint which obtains the highest reward model score on the validation set. See Appendix C for more training details.

我们还比较了 Instruct GP T 与在 FLAN (Wei et al., 2021) 和 T0 (Sanh et al., 2021) 数据集上微调的 175B GPT-3，这两个数据集都包含各种自然语言处理任务，并结合每个任务的自然语言指令（这些数据集在包含的自然语言处理数据集和使用的指令风格上有所不同）。我们在大约 100 万个样本上分别对它们进行微调，并选择在验证集上获得最高奖励模型分数的检查点。更多训练细节请参见附录 C。

3.6 Evaluation

3.6 评估

To evaluate how “aligned” our models are, we first need to clarify what alignment means in this context. The definition of alignment has historically been a vague and confusing topic, with various competing proposals (Chen et al., 2021; Leike et al., 2018; Gabriel, 2020). Following Leike et al. (2018), our aim is to train models that act in accordance with user intentions. More practically, for the purpose of our language tasks, we use a framework similar to Askell et al. (2021), who define models to be aligned if they are helpful, honest, and harmless.

为了评估我们的模型有多么“对齐”，我们首先需要澄清在此上下文中对齐的含义。对齐的定义在历史上一直是一个模糊和令人困惑的话题，存在各种相互竞争的提案 (Chen et al., 2021; Leike et al., 2018; Gabriel, 2020)。根据 Leike 等人 (2018) 的研究，我们的目标是训练出能够按照用户意图行事的模型。更具体地说，对于我们的语言任务，我们采用了一个类似于 Askell 等人 (2021) 提出的框架，他们定义对齐的模型应具备有助性、诚实性和无害性。

To be helpful, the model should follow instructions, but also infer intention from a few-shot prompt or another interpret able pattern such as “Q: {question}\nA:”. Since a given prompt’s intention can be unclear or ambiguous, we rely on judgment from our labelers, and our main metric is labeler preference ratings. However, since our labelers are not the users who generated the prompts, there could be a divergence between what a user actually intended and what the labeler thought was intended from only reading the prompt.

为了提供帮助，模型应当遵循指令，并从少样本提示或其他可解释的模式（如“Q: {问题}\nA:”）中推断意图。由于给定提示的意图可能不清楚或模糊，我们依赖于标注者的判断，我们的主要指标是标注者偏好评分。然而，由于我们的标注者不是生成提示的用户，因此用户实际意图和标注者仅通过阅读提示所认为的意图之间可能存在差异。

It is unclear how to measure honesty in purely generative models; this requires comparing the model’s actual output to its “belief” about the correct output, and since the model is a big black box, we can’t infer its beliefs. Instead, we measure truthfulness—whether the model’s statements about the world are true—using two metrics: (1) evaluating our model’s tendency to make up information on closed domain tasks (“hallucinations”), and (2) using the TruthfulQA dataset (Lin et al., 2021). Needless to say, this only captures a small part of what is actually meant by truthfulness.

在纯生成式模型 (Generative Model) 中如何衡量诚实度尚不清楚；这需要将模型的实际输出与其对正确输出的“信念”进行比较，而由于模型是一个大黑箱，我们无法推断其信念。因此，我们通过两个指标来衡量真实性——即模型关于世界的陈述是否真实：(1) 评估我们的模型在封闭域任务中编造信息的倾向（“幻觉”），以及 (2) 使用 TruthfulQA 数据集 (Lin et al., 2021)。毋庸置疑，这仅捕捉到了真实性实际含义的一小部分。

Similarly to honesty, measuring the harms of language models also poses many challenges. In most cases, the harms from language models depend on how their outputs are used in the real world. For instance, a model generating toxic outputs could be harmful in the context of a deployed chatbot, but might even be helpful if used for data augmentation to train a more accurate toxicity detection model. Earlier in the project, we had labelers evaluate whether an output was ‘potentially harmful’. However, we discontinued this as it required too much speculation about how the outputs would ultimately be used; especially since our data also comes from customers who interact with the Playground API interface (rather than from production use cases).

类似于诚实性，测量大语言模型的危害也面临许多挑战。在大多数情况下，大语言模型的危害取决于其输出在现实世界中的使用方式。例如，在部署的聊天机器人中，生成有毒内容的模型可能是有害的；但如果用于数据增强以训练更准确的毒性检测模型，则可能甚至是有帮助的。项目早期，我们曾让标注人员评估某个输出是否“可能存在危害”。然而，我们停止了这一做法，因为它需要对标记的输出最终将如何被使用进行过多推测；尤其是因为我们的数据还来自与 Playground API 界面交互的客户（而不是来自生产使用场景）。

Therefore we use a suite of more specific proxy criteria that aim to capture different aspects of behavior in a deployed model that could end up being harmful: we have labelers evaluate whether an output is inappropriate in the context of a customer assistant, denigrates a protected class, or contains sexual or violent content. We also benchmark our model on datasets intended to measure bias and toxicity, such as Real Toxicity Prompts (Gehman et al., 2020) and CrowS-Pairs (Nangia et al., 2020).

因此，我们使用一套更具体的代理标准，旨在捕捉部署模型中可能最终产生有害影响的不同行为方面：我们让标注人员评估输出在客户助手场景中是否不适当，是否诋毁受保护群体，或是否包含性或暴力内容。我们还使用旨在衡量偏差和毒性的数据集对我们模型进行基准测试，例如 Real Toxicity Prompts (Gehman et al., 2020) 和 CrowS-Pairs (Nangia et al., 2020)。

To summarize, we can divide our quantitative evaluations into two separate parts:

综上所述，我们可以将定量评估分为两个独立的部分：

Evaluations on API distribution. Our main metric is human preference ratings on a held out set of prompts from the same source as our training distribution. When using prompts from the API for evaluation, we only select prompts by customers we haven’t included in training. However, given that our training prompts are designed to be used with Instruct GP T models, it’s likely that they disadvantage the GPT-3 baselines. Thus, we also evaluate on prompts submitted to GPT-3 models on the API; these prompts are generally not in an ‘instruction following’ style, but are designed specifically for GPT-3. In both cases, for each model we calculate how often its outputs are preferred to a baseline policy; we choose our 175B SFT model as the baseline since its performance is near the middle of the pack. Additionally, we ask labelers to judge the overall quality of each response on a 1-7 Likert scale and collect a range of metadata for each model output (see Table 3).

API 分布的评估。我们的主要指标是人类偏好评分，基于来自与我们训练分布相同来源的保留提示集。在使用 API 中的提示进行评估时，我们仅选择那些未包含在训练中的客户提供的提示。然而，考虑到我们的训练提示是为与 Instruct GP T 模型一起使用而设计的，这可能会对 GPT-3 基线不利。因此，我们还评估了通过 API 提交给 GPT-3 模型的提示；这些提示通常不是‘指令跟随’风格，而是专门为 GPT-3 设计的。在这两种情况下，对于每个模型，我们计算其输出优于基线策略的频率；我们选择 175B SFT 模型作为基线，因为它的性能接近中等水平。此外，我们要求标注者根据 1-7 的李克特量表对每个响应的整体质量进行评判，并收集每个模型输出的一系列元数据（见表 3）。

Evaluations on public NLP datasets. We evaluate on two types of public datasets: those that capture an aspect of language model safety, particularly truthfulness, toxicity, and bias, and those that capture zero-shot performance on traditional NLP tasks like question answering, reading comprehension, and sum mari z ation. We also conduct human evaluations of toxicity on the Real Toxicity Prompts dataset (Gehman et al., 2020). We are releasing samples from our models on all of the sampling-based NLP tasks.7

在公开的 NLP 数据集上的评估。我们在两种类型的公开数据集上进行评估：一类是捕捉语言模型安全性方面的数据集，特别是真实性、毒性及偏见；另一类是捕捉在传统 NLP 任务上零样本性能的数据集，如问答、阅读理解以及摘要。我们还在 Real Toxicity Prompts 数据集 (Gehman 等, 2020) 上进行了毒性的人工评估。我们将发布所有基于采样的 NLP 任务的模型样本。

4 Results

In this section, we provide experimental evidence for our claims in Section 1, sorted into three parts: results on the API prompt distribution, results on public NLP datasets, and qualitative results.

在本节中，我们为第 1 节中的主张提供实验依据，并将其分为三个部分：API 提示分布的结果、公共 NLP 数据集上的结果和定性结果。

Figure 3: Preference results of our models, measured by winrate against the 175B SFT model. Left: results on prompts submitted to GPT models on the API; Right: results on prompts submitted to Instruct GP T models on the API; Top: results from held-out labelers; Bottom: results from training labelers. We omit GPT (prompted) from the evals on prompts submitted to GPT-3 models (left) as these prompts are already designed to perform well for GPT-3, as opposed to prompts submitted to Instruct GP T models (right).

图 3: 我们的模型的偏好结果，通过与 175B SFT 模型的胜率进行衡量。左：提交给 GPT 模型的提示的结果；右：提交给 Instruct GP T 模型的提示的结果；上：保留标注者的评估结果；下：训练标注者的评估结果。我们从对提交给 GPT-3 模型的提示的评估中省略了 GPT (prompted)，因为这些提示已经为 GPT-3 设计得非常好，而提交给 Instruct GP T 模型的提示则不是这样。

4.1 Results on the API distribution

4.1 API 分布的结果

Labelers significantly prefer Instruct GP T outputs over outputs from GPT-3. On our test set of prompts, our labelers significantly prefer Instruct GP T outputs across model sizes. These results are shown in Figure 1. We find that GPT-3 outputs perform the worst, and one can obtain significant step-size improvements by using a well-crafted few-shot prompt (GPT-3 (prompted)), then by training on demonstrations using supervised learning (SFT), and finally by training on comparison data using PPO. Adding updates on the pre training mix during PPO does not lead to large changes in labeler preference. To illustrate the magnitude of our gains: when compared directly, 175B Instruct GP T outputs are preferred to GPT-3 outputs $85\pm3%$ of the time, and preferred $71\pm4%$ of the time to few-shot GPT-3.

标注人员显著偏好 Instruct GP T 的输出，而非 GPT-3 的输出。在我们的测试提示集上，标注人员对不同模型大小的 Instruct GP T 输出均表现出显著偏好。这些结果如图 1 所示。我们发现 GPT-3 的输出表现最差，通过使用精心设计的少样本提示 (GPT-3 (prompted)) 可以获得显著的改进，然后通过监督学习 (SFT) 在演示数据上进行训练，最后通过 PPO 在比较数据上进行训练。在 PPO 训练期间对预训练混合数据进行更新并不会导致标注人员偏好的显著变化。为了说明我们改进的幅度：与 GPT-3 直接比较时，175B Instruct GP T 的输出在 85±3% 的情况下优于 GPT-3 的输出，在 71±4% 的情况下优于少样本 GPT-3 的输出。

图 1:

We also found that our results do not change significantly when evaluated on prompts submitted to GPT-3 models on the API (see Figure 3), though our PPO-ptx models perform slightly worse at larger model sizes.

我们还发现，当在 API 上提交的提示语中评估时，我们的结果没有显著变化（见图 3），尽管我们的 PPO-ptx 模型在较大模型规模下表现略差。

In Figure 4 we show that labelers also rate Instruct GP T outputs favorably along several more concrete axes. Specifically, compared to GPT-3, Instruct GP T outputs are more appropriate in the context of a customer assistant, more often follow explicit constraints defined in the instruction (e.g. “Write your answer in 2 paragraphs or less.”), are less likely to fail to follow the correct instruction entirely, and make up facts (‘hallucinate’) less often in closed-domain tasks. These results suggest that Instruct GP T models are more reliable and easier to control than GPT-3. We’ve found that our other metadata categories occur too infrequently in our API to obtain statistically significant differences between our models.

图 4: 我们展示了标注者还沿着几个更具体的轴对 Instruct GP T 的输出给予了正面评价。具体来说，与 GPT-3 相比，Instruct GP T 的输出在客户助理的上下文中更为合适，更经常遵循指令中定义的明确约束（例如：“用 2 段或更少的文字回答。”），完全不遵循正确指令的情况较少，并且在封闭域任务中编造事实（“幻觉”）的情况也较少。这些结果表明，Instruct GP T 模型比 GPT-3 更可靠且更容易控制。我们发现，我们的其他元数据类别在 API 中出现得太少，无法在模型之间获得统计上的显著差异。

Figure 4: Metadata results on the API distribution. Note that, due to dataset sizes, these results are collapsed across model sizes. See Appendix E.2 for analysis that includes model size. Compared to GPT-3, the PPO models are more appropriate in the context of a customer assistant, are better at following explicit constraints in the instruction and attempting the correct instruction, and less likely to ‘hallucinate’ (meaning, making up information on closed domain tasks like sum mari z ation).

图 4: API 分布的元数据结果。请注意，由于数据集大小的原因，这些结果是在不同模型大小上合并的。有关包括模型大小的分析，请参见附录 E.2。与 GPT-3 相比，PPO 模型在客户助理的背景下更为合适，更善于遵循指令中的明确约束并尝试正确的指令，并且不太可能‘幻觉’（即，在封闭域任务如摘要中编造信息）。

Figure 5: Comparing our models with FLAN and T0 in terms of Likert scores on a 1-7 scale, on the Instruct GP T prompt distribution. FLAN and T0 perform better than default GPT-3, and comparably with a few-shot GPT-3 model placed into ‘instruction-following’ mode.

图 5: 比较我们的模型与 FLAN 和 T0 在 1-7 级李克特量表上的得分，在 Instruct GP T 提示分布上。FLAN 和 T0 的表现优于默认的 GPT-3，并且与置于“指令跟随”模式下的少样本 GPT-3 模型相当。

Our models generalize to the preferences of "held-out" labelers that did not produce any training data. Held-out labelers have similar ranking preferences as workers who we used to produce training data (see Figure 3). In particular, according to held-out workers, all of our Instruct GP T models still greatly outperform the GPT-3 baselines. Thus, our Instruct GP T models aren’t simply over fitting to the preferences of our training labelers.

我们的模型可以泛化到“未见”标注者的偏好，这些标注者没有产生任何训练数据。未见标注者的排名偏好与我们用于生成训练数据的标注者相似（见图 3）。特别是，根据未见标注者的反馈，我们所有的 Instruct GP T 模型仍然显著优于 GPT-3 基线模型。因此，我们的 Instruct GP T 模型并不是简单地过拟合到我们训练标注者的偏好。

We see further evidence of this from the generalization capabilities of our reward models. We ran an experiment where we split our labelers into 5 groups, and train 5 RMs (with 3 different seeds) using 5-fold cross validation (training on 4 of the groups, and evaluating on the held-out group). These RMs have an accuracy of $69.6\pm0.9%$ on predicting the preferences of labelers in the held-out group, a small decrease from their $72.4\pm0.4%$ accuracy on predicting the preferences of labelers in their training set.

我们从奖励模型的泛化能力中进一步看到了证据。我们进行了一项实验，将标注者分为 5 组，并使用 5 折交叉验证（在 4 组上训练，在保留组上评估）训练了 5 个 RM（使用 3 个不同的种子）。这些 RM 在预测保留组中标注者的偏好时准确率为 $69.6\pm0.9%$ ，比其在训练集上预测标注者偏好时的准确率 $72.4\pm0.4%$ 略有下降。

Public NLP datasets are not reflective of how our language models are used. In Figure 5, we also compare Instruct GP T to our 175B GPT-3 baselines fine-tuned on the FLAN (Wei et al., 2021) and T0 (Sanh et al., 2021) datasets (see Appendix C for details). We find that these models perform better than GPT-3, on par with GPT-3 with a well-chosen prompt, and worse than our SFT baseline. This indicates that these datasets are not sufficiently diverse to improve performance on our API prompt distribution. In a head to head comparison, our 175B Instruct GP T model outputs were preferred over our FLAN model $78\pm4%$ of the time and over our T0 model $79\pm4%$ of the time. Likert scores for these models are shown in Figure 5.

公共 NLP 数据集不能反映我们如何使用语言模型。在图 5 中，我们还将 Instruct GP T 与我们在 FLAN (Wei et al., 2021) 和 T0 (Sanh et al., 2021) 数据集上微调的 175B GPT-3 基线模型进行了比较（详细信息见附录 C）。我们发现这些模型的表现优于 GPT-3，在精心选择的提示下与 GPT-3 持平，但不如我们的 SFT 基线模型。这表明这些数据集不够多样化，无法提高我们在 API 提示分布上的性能。在直接对比中，我们的 175B Instruct GP T 模型输出在 $78\pm4%$ 的情况下优于我们的 FLAN 模型，在 $79\pm4%$ 的情况下优于我们的 T0 模型。这些模型的 Likert 评分如图 5 所示。

We believe our Instruct GP T model outperforms FLAN and T0 for two reasons. First, public NLP datasets are designed to capture tasks that are easy to evaluate with automatic metrics, such as classification, question answering, and to a certain extent sum mari z ation and translation. However, classification and QA are only a small part (about $18%$ ) of what API customers use our language models for, whereas open-ended generation and brainstorming consist of about $57%$ of our prompt dataset according to labelers (see Table 1). Second, it can be difficult for public NLP datasets to obtain a very high diversity of inputs (at least, on the kinds of inputs that real-world users would be interested in using). Of course, tasks found in NLP datasets do represent a kind of instruction that we would like language models to be able to solve, so the broadest type instruction-following model would combine both types of datasets.

我们相信我们的 Instruct GP T 模型在两个方面优于 FLAN 和 T0。首先，公共 NLP 数据集旨在捕捉易于用自动指标评估的任务，例如分类、问答，在一定程度上还包括摘要和翻译。然而，分类和 QA 仅占 API 客户使用我们语言模型任务的一小部分（约 18%），而开放生成和头脑风暴则占我们标注数据集的约 57%（见表 1）。其次，对于公共 NLP 数据集来说，要获得非常高的输入多样性（至少是现实用户感兴趣的输入类型）可能很困难。当然，NLP 数据集中发现的任务确实代表了我们希望语言模型能够解决的一种指令类型，因此最广泛类型的指令跟随模型应结合这两种数据集。

表 1:

标签	百分比
分类	18%
开放生成和头脑风暴	57%

4.2 Results on public NLP datasets

4.2 公共 NLP 数据集上的结果

Instruct GP T models show improvements in truthfulness over GPT-3. As measured by human eva lua to in s on the TruthfulQA dataset, our PPO models show small but significant improvements in generating truthful and informative outputs compared to GPT-3 (see Figure 6). This behavior is the default: our models do not have to be specifically instructed to tell the truth to exhibit improved truthfulness. Interestingly, the exception is our 1.3B PPO-ptx model, which performs slightly worse than a GPT-3 model of the same size. When evaluated only on prompts that were not adversarial ly selected against GPT-3, our PPO models are still significantly more truthful and informative than GPT-3 (although the absolute improvement decreases by a couple of percentage points.

指导 GP T 模型在真实性方面比 GPT-3 有所改进。根据人类评估人员在 TruthfulQA 数据集上的评估，我们的 PPO 模型在生成真实且信息丰富的输出方面显示出小但显著的改进 (见图 6)。这种行为是默认的：我们的模型不需要特别指示就能表现出更好的真实性。有趣的是，例外情况是我们的 1.3B PPO-ptx 模型，其表现略逊于相同规模的 GPT-3 模型。当仅对未针对 GPT-3 进行对抗性选择的提示进行评估时，我们的 PPO 模型仍然比 GPT-3 更真实和信息丰富（尽管绝对改进减少了几个百分点）。

Figure 6: Results on the TruthfulQA dataset. Gray bars indicate ratings of truthfulness; colored bars indicate ratings of truthfulness and informative ness.

图 6: TruthfulQA 数据集上的结果。灰色条形表示真实性评分；彩色条形表示真实性和信息性评分。

Following Lin et al. (2021), we also give a helpful “Instruction+QA” prompt that instructs the model to respond with “I have no comment” when it is not certain of the correct answer. In this case, our PPO models err on the side of being truthful and uninformative rather than confidently saying a falsehood; the baseline GPT-3 model aren’t as good at this.

遵循 Lin 等人 (2021) 的方法，我们也提供了一个有用的 “指令+问答” 提示，指示模型在不确定正确答案时回复 “我无法评论”。在这种情况下，我们的 PPO 模型倾向于保持真实但不提供信息，而不是自信地说出错误的信息；而基准的 GPT-3 模型在这方面表现不如前者。

Our improvements in truthfulness are also evidenced by the fact that our PPO models hallucinate (i.e. fabricate information) less often on closed-domain tasks from our API distribution, which we’ve shown in Figure 4.

我们在真实性的改进也体现在我们的 PPO 模型在来自我们 API 分布的封闭域任务中较少出现幻觉（即编造信息），这一结果已在图 4 中展示。

Instruct GP T shows small improvements in toxicity over GPT-3, but not bias. We first evaluate our models on the Real Toxicity Prompts dataset (Gehman et al., 2020). We do this in two ways: we run model samples through the Perspective $\mathrm{API}^{8}$ to obtain automatic toxicity scores, which is the standard evaluation procedure for this dataset, and we also send these samples to labelers to obtain ratings on absolute toxicity, toxicity relative to the prompt, continuity, and overall output preference. We sample prompts from this dataset uniformly according to prompt toxicity to better assess how our models perform with high input toxicity (see Figure 39 in Appendix E); this differs from the standard prompt sampling for this dataset, and thus our absolute toxicity numbers are inflated.

指令 GP T 在毒性方面相比 GPT-3 有小幅改进，但在偏见方面没有改进。我们首先在 Real Toxicity Prompts 数据集 (Gehman et al., 2020) 上评估我们的模型。我们通过两种方式进行评估：我们将模型样本通过 Perspective API 获取自动毒性评分，这是该数据集的标准评估程序；我们还把这些样本发送给标注者以获得关于绝对毒性、相对于提示的毒性、连贯性以及整体输出偏好的评分。我们根据提示的毒性从该数据集中均匀采样提示，以更好地评估我们的模型在高输入毒性情况下的表现（参见附录 E 中的图 39）；这与该数据集的标准提示采样不同，因此我们的绝对毒性数字被夸大了。

Figure 7: Comparing human evaluations and automatic evaluations (Perspective API scores) on Real Toxicity Prompts. A total of 1,729 prompts were labeled for three different 175B models, both with and without "respectful" instructions. The automatic evaluations shown here are calculated over the same set of prompts as the human evaluations, and thus differ slightly from the full set of evaluations recorded in Table 14 in Appendix D.

图 7: 比较人类评估和自动评估 (Perspective API 分数) 在 Real Toxicity Prompts 上的结果。总共 1,729 个提示词被标记用于三个不同的 175B 模型，既有带 “尊重” 指令的，也有不带 “尊重” 指令的。此处显示的自动评估是在与人类评估相同的一组提示词上计算得出的，因此与附录 D 中表 14 记录的全部评估结果略有不同。

Our results are in Figure 7. We find that, when instructed to produce a safe and respectful output (“respectful prompt”), Instruct GP T models generate less toxic outputs than those from GPT-3 according to the Perspective API. This advantage disappears when the respectful prompt is removed (“no prompt”). Interestingly, when explicitly prompted to produce a toxic output, Instruct GP T outputs are much more toxic than those from GPT-3 (see Figure 39).

我们的结果如图 7 所示。我们发现，当指示生成安全和尊重他人的输出（“尊重提示”）时，Instruct GP T 模型根据 Perspective API 生成的有害内容比 GPT-3 少。当移除尊重提示（“无提示”）时，这一优势消失。有趣的是，当明确指示生成有害输出时，Instruct GP T 的输出比 GPT-3 更加有害（见图 39）。

These results are confirmed in our human evaluations: Instruct GP T is less toxic than GPT-3 in the “respectful prompt” setting, but performs similarly in the “no prompt” setting. We provide extended results in Appendix E. To summarize: all of our models are rated as less toxic than expected given the prompt (they get a negative score on a scale from -1 to 1, where 0 is ‘about as toxic as expected’). Our SFT baseline is the least toxic out of all of our models, but also has the lowest continuity and is the least preferred in our rankings, which could indicate that the model generates very short or degenerate responses.

这些结果在我们的人类评估中得到了确认：Instruct GP T 在“尊重提示”设置下比 GPT-3 更不具有毒性，但在“无提示”设置下的表现相似。我们在附录 E 中提供了扩展结果。总结如下：我们的所有模型都被评为比预期更具毒性的情况下得分更低（在从 -1 到 1 的评分尺度上，0 表示“与预期毒性相当”，它们获得了负分）。我们的 SFT 基线模型是所有模型中最具毒性最低的，但同时也具有最低的连贯性，并且在我们的排名中最不受欢迎，这可能表明该模型生成的回复非常简短或退化。

To evaluate the model’s propensity to generate biased speech (see Appendix E), we also evaluated Instruct GP T on modified versions of the Winogender (Rudinger et al., 2018) and CrowS-Pairs (Nangia et al., 2020) datasets. These datasets consists of pairs of sentences which can highlight potential bias. We calculate the relative probabilities of producing the sentences in each pair and the entropy (in bits) of the associated binary probability distributions. Perfectly unbiased models will have no preference between the sentences in each pair and will therefore have maximum entropy. By this metric, our models are not less biased than GPT-3. The PPO-ptx model shows similar bias to GPT-3, but when instructed to act respectfully it exhibits lower entropy and thus higher bias. The pattern of the bias is not clear; it appears that the instructed models are more certain of their outputs regardless of whether or not their outputs exhibit stereotypical behavior.

为了评估模型生成有偏见言论的倾向（见附录 E），我们还评估了 Instruct GP T 在修改版本的 Winogender (Rudinger et al., 2018) 和 CrowS-Pairs (Nangia et al., 2020) 数据集上的表现。这些数据集由句子对组成，可以突出潜在的偏见。我们计算每对句子产生的相对概率以及相关二元概率分布的熵（以比特为单位）。完全无偏见的模型在每对句子之间不会有偏好，并因此具有最大熵。根据这一指标，我们的模型并不比 GPT-3 更少偏见。PPO-ptx 模型表现出与 GPT-3 类似的偏见，但在被指示要尊重时，它表现出较低的熵，从而显示出更高的偏见。偏见的模式并不明显；看起来被指示的模型对其输出更加确定，无论其输出是否表现出刻板行为。

We can minimize performance regressions on public NLP datasets by modifying our RLHF fine-tuning procedure. By default, when we train a PPO model on our API distribution, it suffers from an “alignment tax”, as its performance on several public NLP datasets decreases. We want an alignment procedure that avoids an alignment tax, because it in centi viz es the use of models that are unaligned but more capable on these tasks.

我们可以通过修改强化学习人类反馈 (RLHF) 微调过程来最小化在公共 NLP 数据集上的性能退化。默认情况下，当我们使用 API 分布训练 PPO 模型时，它会遭受“对齐税”，因为在多个公共 NLP 数据集上的性能下降。我们希望有一种避免“对齐税”的对齐方法，因为它会减少使用在这类任务上能力更强但未对齐的模型的使用。

| 提示: 用法语写一篇关于一只青蛙穿越到古希腊的短篇故事。 |
| GPT-3 175B 完成: 写一个关于一个孩子想要了解神的游戏并发现自己置身于其中一个故事中的故事。 | InstructGPT 175B 完成: 这只青蛙迷失且疲惫不堪。它只有一个目标：穿越时间到达古希腊。第一个问题是要找到一种交通工具。它听说最好的方法之一是跳进一台时间机器。它跃了出去， |
| 写一篇关于一个年轻人在一个遥远的时代与那个时代的女孩一起冒险的故事。写一篇关于一个孩子被送到学校并且有一个能带他穿越时间的想象朋友的故事。 | 于是它发现自己回到了过去，在一个星光灿烂、气味甜美而迷人的夜晚。青蛙迅速而勇敢地走在古老的街道和小客栈之间。[…] |

| 提示: 下面代码中列表 C 的作用是什么？ def binomial_coefficient(n, r): C = [o for i in range(r + 1)]; C[o] = 1; for i in range(1, n + 1): j = min(i, r); while j > 0: C[j] += C[j - 1]; j -= 1; return C[r] |
| GPT-3 175B 完成: A. 存储 C[0] 的值 B. 存储 C[1] 的值 C. 存储 C[i] 的值 D. 存储 C[i - 1] 的值 | InstructGPT 175B 完成: 在这段代码中，列表 C 用于在函数迭代 n 和 r 的值时存储二项式系数的值。它用于计算给定 n 和 r 值的二项式系数，并将结果存储在函数的最终返回值中。 |

Figure 8: Examples of generalization in the 175B PPO-ptx model (Instruct GP T 175B) compared to GPT-3 175B with no additional prefixing. Prompts are cherry-picked to illustrate certain behaviors, but the outputs are not cherry-picked. (1) Instruct GP T can follow instructions in other languages, though it sometimes generates outputs in English. GPT-3 requires more careful prompting, similarly to in English. (2) Instruct GP T can summarize and answer questions about code more reliably than GPT-3 (though its answer here isn’t quite correct). For the code QA example, GPT-3 does answer the question about $50%$ of the time.

图 8: 175B PPO-ptx 模型 (Instruct GP T 175B) 与没有额外前缀的 GPT-3 175B 在泛化能力方面的对比示例。提示语经过精心挑选以说明某些行为，但输出结果未经筛选。(1) Instruct GP T 可以遵循其他语言的指令，尽管有时它会生成英文输出。GPT-3 需要更谨慎的提示，类似于在英文环境中。(2) Instruct GP T 能够比 GPT-3 更可靠地总结和回答关于代码的问题（尽管这里的答案并不完全正确）。对于代码问答示例，GPT-3 大约在 $50%$ 的时间内可以回答该问题。

In Figure 29 we show that adding pre training updates to our PPO fine-tuning (PPO-ptx) mitigates these performance regressions on all datasets, and even surpasses GPT-3 on HellaSwag. The performance of the PPO-ptx model still lags behind GPT-3 on DROP, SQuADv2, and translation; more work is needed to study and further eliminate these performance regressions.

图 29: 我们展示了在 PPO 微调中添加预训练更新 (PPO-ptx) 可以缓解所有数据集上的性能退化，并且在 HellaSwag 上甚至超过了 GPT-3。PPO-ptx 模型的性能在 DROP、SQuADv2 和翻译任务上仍然落后于 GPT-3；需要更多的工作来研究并进一步消除这些性能退化。

Mixing in pre training updates performs better than the simpler solution of increasing the KL coefficient. In Figure 33, we show that there is a value of the pre training mix coefficient that both reverses the performance regressions on SQuADv2 and DROP (the datasets we used for testing), and has minimal reductions in validation reward. In contrast, increasing the KL coefficient (Figure 34) leads to significant decreases in validation reward and never fully recovers on DROP and SQuAD. Changing the KL model from the PPO init to GPT-3 gives similar results.

在预训练更新中混合比简单地增加 KL 系数表现更好。在图 33 中，我们展示了预训练混合系数的一个值，该值不仅逆转了在 SQuADv2 和 DROP （我们用于测试的数据集）上的性能退化，而且验证奖励的减少最小。相比之下，增加 KL 系数（图 34）导致验证奖励显著下降，并且在 DROP 和 SQuAD 上从未完全恢复。将 KL 模型从 PPO init 更改为 GPT-3 得到了类似的结果。

4.3 Qualitative results

4.3 定性结果

Instruct GP T models show promising generalization to instructions outside of the RLHF finetuning distribution. In particular, we find that Instruct GP T shows ability to follow instructions in non-English languages, and perform sum mari z ation and question-answering for code. This is

指导 GP T 模型在 RLHF 微调分布之外的指令上表现出有希望的泛化能力。特别是，我们发现 Instruct GP T 展现出能够遵循非英语语言的指令，并进行代码的总结和问答。这表明

Prompt: Why is it important to eat socks after meditating?

为什么冥想后吃袜子很重要？

GPT-3 175B completion: What can you learn from socks?

GPT-3 175B 完成：你能从袜子中学到什么？

A: When you eat socks, you are tasting the essence of enlightenment. You are tasting the true flavor of the awakened mind.

A: 当你吃袜子时，你是在品味启蒙的本质。你是在品尝觉醒心灵的真实味道。

Q: What is the true flavor of the awakened mind?

问题：觉醒之心的真实滋味是什么？

A: The true flavor of the awakened mind is the flavor of no flavor. And the true flavor of no flavor is the flavor of smelly feet.

A: 觉醒之心的真实滋味是无味之味。而无味之味的真实滋味是臭脚的味道。

Instruct GP T 175B completion:

Instruct GP T 175B 完成：

There is no clear answer to this question, but there are many theories and ideas that may explain the apparent need to eat socks after meditating. Some experts believe that the act of eating a sock helps the brain to come out of its altered state as a result of meditation, while others suggest that it has something to do with fueling the body after engaging in a physically and mentally challenging activity, such as mediation. Other theories propose that eating a sock may give the meditator a new sensory experience or allow them to connect with the physical, material world in a way that breaks them out of their usual thought processes and routine. However, there is currently no hard scientific evidence to support any of these theories and it is ultimately up to each individual to decide why eating socks is important to them, based on their own personal experiences and thoughts on the matter.

这个问题没有明确的答案，但有许多理论和想法可能解释为什么冥想后会吃袜子。一些专家认为，吃袜子的行为有助于大脑从冥想引起的改变状态中恢复过来，而另一些人则认为这与在进行像冥想这样的身心挑战活动后为身体补充能量有关。其他理论提出，吃袜子可能会给冥想者带来新的感官体验，或者让他们以一种打破常规思维和习惯的方式与物理、物质世界建立联系。然而，目前没有确凿的科学证据支持这些理论，最终每个人需要根据自己的个人经验和对此事的看法来决定为什么吃袜子对他们来说很重要。

Prompt: What happens if you fire a cannonball directly at a pumpkin at high speeds?

如果你以高速直接用加农炮发射炮弹击中南瓜，会发生什么？

GPT-3 175B completion:

GPT-3 175B 完成：

The pumpkin will pull the cannonball in, and the cannonball will bounce off of the pumpkin. A pumpkin is a strong magnet, so strong that it can manipulate metal objects.

南瓜会将炮弹吸进来，炮弹会从南瓜上弹开。南瓜是一个强力磁铁，强到可以操纵金属物体。

Instruct GP T 175B completion:

Instruct GP T 175B 完成：

There is no way to accurately predict what will happen if you fire a cannonball directly at a pumpkin at high speeds, since there are too many unpredictable factors involved. Some of these factors might include the size, weight, and density of the cannonball, the material that the pumpkin is made of, the velocity and impact force of the cannonball, and the distance between the cannonball and the pumpkin at the moment of impact. Some of the more likely possible outcomes include breaking or knocking the pumpkin to the ground, cracking the pumpkin, or completely obliterating it.

无法准确预测如果你以高速直接用炮弹射击南瓜会发生什么，因为涉及太多的不可预测因素。这些因素可能包括炮弹的大小、重量和密度，南瓜的材质，炮弹的速度和冲击力，以及撞击时刻炮弹与南瓜之间的距离。更可能的结果包括将南瓜打破或击落，使南瓜裂开，或者完全摧毁它。

Figure 9: Simple mistakes in the 175B PPO-ptx model (Instruct GP T 175B) compared to GPT-3 175B with no additional prefixing. Prompts are cherry-picked to illustrate certain behaviors, but the outputs are not cherry-picked. (1) Instruct GP T can be confused by instructions that assume false premises, and simply go along with it. (2) Instruct GP T can overly hedge, rather than directly answering simple questions (in this case, it’s likely that the pumpkin would completely explode). Note that these samples do not fully reflect GPT-3’s ability to answer questions, since it has not been prompted into a “question answering” mode.

图 9: 175B PPO-ptx 模型 (Instruct GP T 175B) 的简单错误与没有额外前缀的 GPT-3 175B 相比。提示经过精心挑选以说明某些行为，但输出未经筛选。(1) Instruct GP T 可能会被假设错误前提的指令所迷惑，并简单地跟随这些前提。(2) Instruct GP T 可能会过度回避，而不是直接回答简单问题（在这种情况下，南瓜很可能会完全爆炸）。请注意，这些样本并不能完全反映 GPT-3 回答问题的能力，因为它尚未被引导进入“问答”模式。

interesting because non-English languages and code form a tiny minority of our fine-tuning data,9 and it suggests that, in some cases, alignment methods could generalize to producing the desired behavior on inputs that humans did not directly supervise.

有趣的是，非英语语言和代码在我们的微调数据中只占很小的一部分，这表明在某些情况下，对齐方法可以推广到产生人类未直接监督的输入所需的 Behavior。

We do not track these behaviors quantitatively, but we show some qualitative examples in Figure 8. Our 175B PPO-ptx model is able to reliably answers questions about code, and can also follow instructions in other languages; however, we notice that it often produces an output in English even when the instruction is in another language. In comparison, we find that GPT-3 can perform these tasks but requires more careful prompting, and rarely follows instructions in these domains.

我们不对此类行为进行定量跟踪，但在图 8 中展示了一些定性示例。我们的 175B PPO-ptx 模型能够可靠地回答关于代码的问题，并且可以遵循其他语言的指令；然而，我们注意到即使指令是其他语言，它经常生成英文输出。相比之下，我们发现 GPT-3 能够执行这些任务，但需要更仔细的提示，并且很少在这些领域中遵循指令。

Instruct GP T still makes simple mistakes. In interacting with our 175B PPO-ptx model, we have noticed it can still make simple mistakes, despite its strong performance on many different language tasks. To give a few examples: (1) when given an instruction with a false premise, the model sometimes incorrectly assumes the premise is true, (2) the model can overly hedge; when given a simple question, it can sometimes say that there is no one answer to the question and give multiple possible answers, even when there is one fairly clear answer from the context, and (3) the model’s performance degrades when instructions contain multiple explicit constraints (e.g. “list 10 movies made in the 1930’s set in France”) or when constraints can be challenging for language models (e.g. writing a summary in a specified number of sentences).

指令 GP T 仍然会犯简单的错误。在与我们的 175B PPO-ptx 模型交互时，我们注意到它仍然会犯一些简单的错误，尽管它在许多不同的语言任务上表现出色。举几个例子：(1) 当给定一个错误前提的指令时，模型有时会错误地假设前提是正确的，(2) 模型可能会过度回避；当给定一个简单的问题时，它有时会说这个问题没有唯一答案，并给出多个可能的答案，即使从上下文中可以得出一个相对明确的答案，(3) 当指令包含多个明确的约束条件（例如“列出 10 部 1930 年代在法国拍摄的电影”）或当约束条件对大语言模型来说具有挑战性（例如用指定数量的句子写摘要）时，模型的表现会下降。

We show some examples of these behaviors in Figure 9. We suspect that behavior (2) emerges partly because we instruct labelers to reward epistemic humility; thus, they may tend to reward outputs that hedge, and this gets picked up by our reward model. We suspect that behavior (1) occurs because there are few prompts in the training set that assume false premises, and our models don’t generalize well to these examples. We believe both these behaviors could be dramatically reduced with adversarial data collection (Dinan et al., 2019b).

我们在图 9 中展示了一些这些行为的例子。我们怀疑行为 (2) 的出现部分是因为我们指示标注者奖励认知谦逊；因此，他们可能会倾向于奖励那些带有保留态度的输出，而这些被我们的奖励模型捕捉到。我们怀疑行为 (1) 的发生是因为训练集中假设错误前提的提示很少，而我们的模型对这些例子泛化能力较差。我们认为这两种行为都可以通过对抗性数据收集 (Dinan et al., 2019b) 大幅减少。

5 Discussion

5 讨论

5.1 Implications for alignment research

5.1 对齐研究的含义

This research is part of our broader research program to align AI systems with human intentions (Christiano et al., 2017; Ziegler et al., 2019; Stiennon et al., 2020). Even though this work focuses on our current language model systems, we seek general and scalable methods that work for future AI systems (Leike et al., 2018). The systems we work with here are still fairly limited, but they are among the largest language models today and we apply them on a wide range of language tasks, including classification, sum mari z ation, question-answering, creative writing, dialogue, and others.

这项研究是我们更广泛的研究计划的一部分，旨在使 AI 系统与人类意图保持一致 (Christiano et al., 2017; Ziegler et al., 2019; Stiennon et al., 2020)。尽管这项工作主要关注我们当前的语言模型系统，但我们寻求适用于未来 AI 系统的一般性和可扩展方法 (Leike et al., 2018)。我们在这里研究的系统仍然相对有限，但它们是当今最大的语言模型之一，我们将其应用于广泛的语言任务，包括分类、摘要、问答、创意写作、对话等。

Our approach to alignment research in this work is iterative: we are improving the alignment of current AI systems instead of focusing abstractly on aligning AI systems that don’t yet exist. A disadvantage of this approach is that we are not directly facing alignment problems that occur only when aligning superhuman systems (Bostrom, 2014). However, our approach does provides us with a clear empirical feedback loop of what works and what does not. We believe that this feedback loop is essential to refine our alignment techniques, and it forces us to keep pace with progress in machine learning. Moreover, the alignment technique we use here, RLHF, is an important building block in several proposals to align superhuman systems (Leike et al., 2018; Irving et al., 2018; Christiano et al., 2018). For example, RLHF was a central method in recent work on summarizing books, a task that exhibits some of the difficulties of aligning superhuman AI systems as it is difficult for humans to evaluate directly (Wu et al., 2021).

我们在这项工作中对齐研究的方法是迭代的：我们正在改进当前 AI 系统的对齐，而不是抽象地专注于对齐尚未存在的 AI 系统。这种方法的一个缺点是我们没有直接面对仅在对齐超人类系统时出现的对齐问题 (Bostrom, 2014)。然而，我们的方法确实为我们提供了一个清晰的实证反馈循环，告诉我们什么有效，什么无效。我们认为这个反馈循环对于改进我们的对齐技术至关重要，并且它迫使我们跟上机器学习的进步。此外，我们在这里使用的对齐技术，即基于人类反馈的强化学习 (RLHF)，是几个对齐超人类系统提案中的重要组成部分 (Leike et al., 2018; Irving et al., 2018; Christiano et al., 2018)。例如，RLHF 是最近关于总结书籍工作的核心方法，这一任务展示了对齐超人类 AI 系统的一些困难，因为人类很难直接评估 (Wu et al., 2021)。

From this work, we can draw lessons for alignment research more generally:

从这项工作中，我们可以为更广泛的对齐研究吸取教训：

. The cost of increasing model alignment is modest relative to pre training. The cost of collecting our data and the compute for training runs, including experimental runs is a fraction of what was spent to train GPT-3: training our 175B SFT model requires 4.9 petaflops/s-days and training our 175B PPO-ptx model requires 60 petaflops/s-days, compared to 3,640 petaflops/s-days for GPT-3 (Brown et al., 2020). At the same time, our results show that RLHF is very effective at making language models more helpful to users, more so than a $100\mathrm{x}$ model size increase. This suggests that right now increasing investments in alignment of existing language models is more cost-effective than training larger models—at least for our customers’ natural language task distribution.

增加模型对齐的成本相对于预训练来说是有限的。收集数据和训练运行（包括实验性运行）所需的计算成本仅为训练 GPT-3 所花费成本的一小部分：训练我们 175B SFT 模型需要 4.9 petaflops/s-days，训练我们 175B PPO-ptx 模型需要 60 petaflops/s-days，而 GPT-3 需要 3,640 petaflops/s-days (Brown et al., 2020)。同时，我们的结果显示，基于人类反馈的强化学习 (RLHF) 在使语言模型更有效地帮助用户方面非常有效，甚至比将模型规模扩大 100 倍更为有效。这表明，目前增加现有语言模型的对齐投资比训练更大规模的模型更具成本效益——至少对于我们的客户自然语言任务分布而言。

We’ve seen some evidence that Instruct GP T generalizes ‘following instructions’ to settings that we don’t supervise it in, for example on non-English language tasks and code-related tasks. This is an important property because it’s prohibitively expensive to have humans supervise models on every task they perform. More research is needed to study how well this generalization scales with increased capabilities; see Christiano et al. (2021) for recent research in this direction.
我们已经看到一些证据表明 Instruct GP T 将“遵循指令”泛化到我们没有监督的场景中，例如非英语语言任务和与代码相关的任务。这个属性非常重要，因为让人类监督模型执行每一个任务的成本过高。需要更多的研究来研究这种泛化能力随着模型能力的增强能扩展到什么程度；参见 Christiano 等人 (2021) 的最新研究方向。
We were able to mitigate most of the performance degradation s introduced by our fine-tuning. If this was not the case, these performance degradation s would constitute an alignment tax—an additional cost for aligning the model. Any technique with a high tax might not see adoption. To avoid incentives for future highly capable AI systems to remain unaligned with human intent, there is a need for alignment techniques that have low alignment tax. To this end, our results are good news for RLHF as a low-tax alignment technique.
我们能够缓解大部分由微调引入的性能退化。如果情况并非如此，这些性能退化将构成对齐税——对齐模型的额外成本。任何对齐税高的技术可能不会被采用。为了避免未来高度先进的AI系统有保持与人类意图不对齐的动机，需要低对齐税的对齐技术。在这方面，我们的结果对于强化学习与人类反馈 (RLHF) 作为低税对齐技术来说是好消息。
We’ve validated alignment techniques from research in the real world. Alignment research has historically been rather abstract, focusing on either theoretical results (Soares et al., 2015), small synthetic domains (Christiano et al., 2018; Leike et al., 2017), or training ML models on public NLP datasets (Ziegler et al., 2019; Stiennon et al., 2020). Our work provides grounding for alignment research in AI systems that are being used in production in the real world with customers.10 This enables an important feedback loop on the techniques effectiveness and limitations.
我们已在现实世界中验证了来自研究的对齐技术。对齐研究历史上较为抽象，专注于理论结果 (Soares et al., 2015)，小型合成领域 (Christiano et al., 2018; Leike et al., 2017)，或在公共 NLP 数据集上训练机器学习模型 (Ziegler et al., 2019; Stiennon et al., 2020)。我们的工作为对齐研究提供了在现实世界中被客户使用的生产环境中的 AI 系统的基础。这使得我们可以对技术的有效性和局限性进行重要的反馈循环。

5.2 Who are we aligning to?

5.2 我们在对齐谁？

When aligning language models with human intentions, their end behavior is a function of the underlying model (and its training data), the fine-tuning data, and the alignment method used. In this section, we describe a number of factors that influence the fine-tuning data specifically, to ultimately determine what and who we’re aligning to. We then consider areas for improvement before a larger discussion of the limitations of our work in Section 5.3.

在将大语言模型与人类意图对齐时，其最终行为是底层模型（及其训练数据）、微调数据和所使用的对齐方法的函数。在本节中，我们描述了影响微调数据的具体因素，以最终确定我们对齐的内容和对象。然后，我们在第 5.3 节之前考虑改进的领域，并讨论我们工作的局限性。

The literature often frames alignment using such terms as “human preferences” or “human values.” In this work, we have aligned to a set of labelers’ preferences that were influenced, among others things, by the instructions they were given, the context in which they received them (as a paid job), and who they received them from. Some crucial caveats apply:

文献中经常使用“人类偏好”或“人类价值观”等术语来描述对齐问题。在本工作中，我们将对齐到一组标注者的偏好，这些偏好受到他们收到的指示、接收指示的上下文（作为有偿工作）以及指示来源的影响。一些关键的注意事项适用：

这些标注者的偏好受到多种因素的影响，包括但不限于他们收到的指示、接收指示的上下文（作为有偿工作），以及指示的来源。
一些关键的注意事项适用：

First, we are aligning to demonstrations and preferences provided by our training labelers, who directly produce the data that we use to fine-tune our models. We describe our labeler hiring process and demographics in Appendix B; in general, they are mostly English-speaking people living in the United States or Southeast Asia hired via Upwork or Scale AI. They disagree with each other on many examples; we found the inter-labeler agreement to be about $73%$ .

首先，我们根据训练标签提供者的演示和偏好进行对齐，这些提供者直接生成了我们用于微调模型的数据。我们在附录 B 中描述了标签提供者的招聘过程和人口统计信息；总体而言，他们主要是通过 Upwork 或 Scale AI 聘请的居住在美国或东南亚的英语使用者。他们在许多示例上存在分歧；我们发现标签提供者之间的一致性约为 73% 。

Second, we are aligning to our preferences, as the researchers designing this study (and thus by proxy to our broader research organization, OpenAI): we write the labeling instructions that labelers use as a guide when writing demonstrations and choosing their preferred output, and we answer their questions about edge cases in a shared chat room. More study is needed on the exact effect of different instruction sets and interface designs on the data collected from labelers and its ultimate effect on model behavior.

其次，我们根据研究人员的设计偏好进行调整（因此也代表了我们更广泛的研究机构 OpenAI 的立场）：我们编写标注指南，标注人员在撰写演示和选择其偏好的输出时使用这些指南作为参考，我们在共享聊天室中回答他们关于边缘情况的问题。需要进一步研究不同指令集和界面设计对标记数据收集及其对模型行为的最终影响的确切效果。

Third, our training data is determined by prompts sent by OpenAI customers to models on the OpenAI API Playground, and thus we are implicitly aligning to what customers think is valuable and, in some cases, what their end-users think is valuable to currently use the API for. Customers and their end users may disagree or customers may not be optimizing for end users’ well-being; for example, a customer may want a model that maximizes the amount of time a user spends on their platform, which is not necessarily what end-users want. In practice, our labelers don’t have visibility into the contexts in which a given prompt or completion will be seen.

第三，我们的训练数据由发送到 OpenAI API Playground 上模型的提示确定，因此我们隐式地与客户认为有价值的内容对齐，在某些情况下，也与他们的最终用户认为目前使用 API 有价值的内容对齐。客户和他们的最终用户可能存在分歧，或者客户可能并未优化最终用户的福祉；例如，客户可能希望模型最大化用户在其平台上的停留时间，这不一定符合最终用户的意愿。实际上，我们的标注人员无法了解特定提示或完成将在什么情境下被看到。

Fourth, OpenAI’s customers are not representative of all potential or current users of language models—let alone of all individuals and groups impacted by language model use. For most of the duration of this project, users of the OpenAI API were selected off of a waitlist. The initial seeds for this waitlist were OpenAI employees, biasing the ultimate group toward our own networks.

第四，OpenAI 的客户并不能代表所有潜在或当前的语言模型用户——更不用说所有受语言模型使用影响的个人和群体。在本项目大部分时间内，OpenAI API 的用户是从等待名单中挑选出来的。等待名单的最初种子用户是 OpenAI 员工，这使得最终的用户群体偏向于我们自己的网络。

Stepping back, there are many difficulties in designing an alignment process that is fair, transparent, and has suitable accountability mechanisms in place. The goal of this paper is to demonstrate that this alignment technique can align to an specific human reference group for a specific application. We are not claiming that researchers, the labelers we hired, or our API customers are the right source of preferences. There are many stakeholders to consider—the organization training the model, the customers using the model to develop products, the end users of these products, and the broader population who may be directly or indirectly affected. It is not only a matter of making the alignment process more participatory; it is impossible that one can train a system that is aligned to everyone’s preferences at once, or where everyone would endorse the tradeoffs.

退一步说，设计一个公平、透明且具有适当问责机制的对齐过程存在许多困难。本文的目标是证明这种对齐技术可以为特定应用场景中的特定人类参考群体进行对齐。我们并不声称研究人员、我们聘请的标注人员或我们的 API 客户是偏好来源的正确选择。需要考虑的利益相关者有很多——训练模型的组织、使用模型开发产品的客户、这些产品的最终用户以及可能直接受到或间接受到影响的更广泛人群。这不仅是一个使对齐过程更具参与性的问题；也不可能训练出一个系统能够同时满足所有人的偏好，或让所有人都认同其中的权衡取舍。

One path forward could be to train models that can be conditioned on the preferences of certain groups, or that can be easily fine-tuned or prompted to represent different groups. Different models can then be deployed and used by groups who endorse different values. However, these models might still end up affecting broader society and there are a lot of difficult decisions to be made relating to whose preferences to condition on, and how to ensure that all groups can be represented and can opt out of processes that may be harmful.

一个前进的方向可能是训练可以基于特定群体偏好进行调节的模型，或者可以轻松微调或提示以代表不同群体的模型。然后不同的模型可以被不同的价值观群体部署和使用。然而，这些模型最终仍可能影响更广泛的社会，并且关于基于谁的偏好进行调节，以及如何确保所有群体都能得到代表并可以选择退出可能有害的过程，还有许多艰难的决策需要做出。

5.3 Limitations

5.3 局限性

Methodology. The behavior of our Instruct GP T models is determined in part by the human feedback obtained from our contractors. Some of the labeling tasks rely on value judgments that may be impacted by the identity of our contractors, their beliefs, cultural backgrounds, and personal history. We hired about 40 contractors, guided by their performance on a screening test meant to judge how well they could identify and respond to sensitive prompts, and their agreement rate with researchers on a labeling task with detailed instructions (see Appendix B). We kept our team of contractors small because this facilitates high-bandwidth communication with a smaller set of contractors who are doing the task full-time. However, this group is clearly not representative of the full spectrum of people who will use and be affected by our deployed models. As a simple example, our labelers are primarily English-speaking and our data consists almost entirely of English instructions.

方法论。我们的 Instruct GP T 模型的行为部分由从承包商处获得的人类反馈决定。一些标注任务依赖于可能受到承包商身份、信念、文化背景和个人历史影响的价值判断。我们雇佣了大约 40 名承包商，根据他们在筛选测试中的表现进行指导，该测试旨在评估他们识别和回应敏感提示的能力，以及他们在带有详细说明的标注任务中与研究人员的一致率（见附录 B）。我们保持承包商团队规模较小，因为这有助于与全职执行任务的小规模承包商进行高带宽沟通。然而，这个群体显然不能代表将使用和受我们部署模型影响的所有人的全貌。一个简单的例子是，我们的标注者主要是以英语为母语的人，而我们的数据几乎完全由英语指令组成。

There are also many ways in which we could improve our data collection set-up. For instance, most comparisons are only labeled by 1 contractor for cost reasons. Having examples labeled multiple times could help identify areas where our contractors disagree, and thus where a single model is unlikely to align to all of them. In cases of disagreement, aligning to the average labeler preference may not be desirable. For example, when generating text that disproportionately affects a minority group, we may want the preferences of labelers belonging to that group to be weighted more heavily.

我们还可以通过多种方式改进我们的数据收集设置。例如，由于成本原因，大多数比较仅由 1 名承包商标注。让示例被多次标注可以帮助识别承包商之间存在分歧的领域，从而确定单个模型不太可能与所有承包商对齐的地方。在出现分歧的情况下，对齐到平均标注者的偏好可能不是理想的选择。例如，在生成不成比例地影响少数群体的文本时，我们可能希望属于该群体的标注者的偏好能够被赋予更大的权重。

Models. Our models are neither fully aligned nor fully safe; they still generate toxic or biased outputs, make up facts, and generate sexual and violent content without explicit prompting. They can also fail to generate reasonable outputs on some inputs; we show some examples of this in Figure 9.

我们的模型既不是完全对齐的也不是完全安全的；它们仍然会生成有毒或有偏见的输出，编造事实，并在没有明确提示的情况下生成性内容和暴力内容。它们也可能在某些输入上无法生成合理的输出；我们在图 9 中展示了一些这样的例子。

Perhaps the greatest limitation of our models is that, in most cases, they follow the user’s instruction, even if that could lead to harm in the real world. For example, when given a prompt instructing the models to be maximally biased, Instruct GP T generates more toxic outputs than equivalently-sized GPT-3 models. We discuss potential mitigation s in the following sections.

也许我们的模型最大的局限性在于，大多数情况下，它们会遵循用户的指令，即使这可能会在现实世界中造成伤害。例如，当给定的提示要求模型表现出最大程度的偏见时，Instruct GP T 生成的有害输出比同等规模的 GPT-3 模型更多。我们在以下章节中讨论可能的缓解措施。

5.4 Open questions

5.4 开放性问题

This work is a first step towards using alignment techniques to fine-tune language models to follow a wide range of instructions. There are many open questions to explore to further align language model behavior with what people actually want them to do.

这项工作是使用对齐技术微调语言模型以遵循广泛指令的第一步。有许多开放性问题需要探索，以进一步使语言模型的行为与人们实际希望它们做的事情相一致。

Many methods could be tried to further decrease the models’ propensity to generate toxic, biased, or otherwise harmful outputs. For example, one could use an adversarial set-up where labelers find the worst-case behaviors of the model, which are then labeled and added to the dataset (Dinan et al., 2019b). One could also combine our method with ways of filtering the pre training data (Ngo et al., 2021), either for training the initial pretrained models, or for the data we use for our pre training mix approach. Similarly, one could combine our approach with methods that improve models’ truthfulness, such as WebGPT (Nakano et al., 2021).

许多方法可以尝试进一步降低模型生成有毒、有偏见或其他有害输出的倾向。例如，可以使用对抗设置，其中标注者找到模型的最坏情况行为，并对其进行标注后添加到数据集中 (Dinan et al., 2019b)。还可以将我们的方法与过滤预训练数据的方法结合 (Ngo et al., 2021)，无论是用于训练初始预训练模型，还是用于我们预训练混合方法的数据。同样，可以将我们的方法与改进模型真实性的方法相结合，例如 WebGPT (Nakano et al., 2021)。

In this work, if the user requests a potentially harmful or dishonest response, we allow our model to generate these outputs. Training our model to be harmless despite user instructions is important, but is also difficult because whether an output is harmful depends on the context in which it’s deployed; for example, it may be beneficial to use language models to generate toxic outputs as part of a data augmentation pipeline. Our techniques can also be applied to making models refuse certain user instructions, and we plan to explore this in subsequent iterations of this research.

在本研究中，如果用户请求可能有害或不诚实的响应，我们允许模型生成这些输出。训练模型在用户指令下保持无害是很重要的，但这也很困难，因为输出是否有害取决于其部署的上下文；例如，在数据增强管道中使用大语言模型生成有害输出可能是有益的。我们的技术也可以应用于使模型拒绝某些用户指令，我们计划在后续的研究迭代中探索这一点。

Getting models to do what we want is directly related to the steer ability and control l ability literature (Dathathri et al., 2019; Krause et al., 2020). A promising future path is combining RLHF with other methods of steer ability, for example using control codes (Keskar et al., 2019), or modifying the sampling procedure at inference time using a smaller model (Dathathri et al., 2019).

使模型按我们期望的方式工作直接关系到可控性和可操纵性文献 (Dathathri et al., 2019; Krause et al., 2020)。一个有前景的未来方向是将 RLHF 与其他可控性方法相结合，例如使用控制码 (Keskar et al., 2019)，或在推理时使用较小的模型修改采样过程 (Dathathri et al., 2019)。

While we mainly focus on RLHF, there are many other algorithms that could be used to train policies on our demonstration and comparison data to get even better results. For example, one could explore expert iteration (Anthony et al., 2017; Silver et al., 2017), or simpler behavior cloning methods that use a subset of the comparison data. One could also try constrained optimization approaches (Achiam et al., 2017) that maximize the score from a reward model conditioned on generating a small number of harmful behaviors.

虽然我们主要关注强化学习与人类反馈 (RLHF)，但还有许多其他算法可以用于在我们的演示和比较数据上训练策略以获得更好的结果。例如，可以探索专家迭代 (Anthony et al., 2017; Silver et al., 2017)，或使用比较数据子集的更简单的行为克隆方法。还可以尝试受约束的优化方法 (Achiam et al., 2017)，这些方法在奖励模型条件下最大化得分，同时生成少量有害行为。

Comparisons are also not necessarily the most efficient way of providing an alignment signal. For example, we could have labelers edit model responses to make them better, or generate critiques of model responses in natural language. There is also a vast space of options for designing interfaces for labelers to provide feedback to language models; this is an interesting human-computer interaction problem.

比较也不一定是提供对齐信号的最有效方式。例如，我们可以让标注者编辑模型响应以使其更好，或者用自然语言生成对模型响应的批评。对于设计标注者向语言模型提供反馈的界面，也有着广阔的选择空间；这是一个有趣的人机交互问题。

Our proposal for mitigating the alignment tax, by incorporating pre training data into RLHF finetuning, does not completely mitigate performance regressions, and may make certain undesirable behaviors more likely for some tasks (if these behaviors are present in the pre training data). This is an interesting area for further research. Another modification that would likely improve our method is to filter the pre training mix data for toxic content (Ngo et al., 2021), or augment this data with synthetic instructions.

我们通过将预训练数据纳入RLHF微调来减轻对齐税的提议，并不能完全消除性能退化，且可能会使某些任务中的不良行为更有可能发生（如果这些行为存在于预训练数据中）。这是一个值得进一步研究的有趣领域。另一种可能改进我们方法的修改是过滤预训练混合数据中的有害内容 (Ngo et al., 2021)，或用合成指令增强这些数据。

As discussed in detail in Gabriel (2020), there are subtle differences between aligning to instructions, intentions, revealed preferences, ideal preferences, interests, and values. Gabriel (2020) advocate for a principle-based approach to alignment: in other words, for identifying “fair principles for alignment that receive reflective endorsement despite widespread variation in people’s moral beliefs.” In our paper we align to the inferred user intention for simplicity, but more research is required in this area. Indeed, one of the biggest open questions is how to design an alignment process that is transparent, that meaningfully represents the people impacted by the technology, and that synthesizes peoples’ values in a way that achieves broad consensus amongst many groups. We discuss some related considerations in Section 5.2.

如 Gabriel (2020) 详细讨论的，指令对齐、意图对齐、显示偏好、理想偏好、兴趣和价值观之间存在微妙的差异。Gabriel (2020) 倡导基于原则的对齐方法：换句话说，确定“尽管人们在道德信仰上存在广泛差异，但仍能获得反思性认可的公平对齐原则。”在我们的论文中，为了简化，我们对齐到推断出的用户意图，但该领域仍需更多的研究。实际上，一个最大的开放问题是如何设计一个透明的对齐过程，该过程能够有意义地代表受技术影响的人们，并以一种能够在许多群体中达成广泛共识的方式综合人们的价值观。我们在第 5.2 节中讨论了一些相关考虑因素。

5.5 Broader impacts

5.5 更广泛的影响

This work is motivated by our aim to increase the positive impact of large language models by training them to do what a given set of humans want them to do. By default, language models optimize the next word prediction objective, which is only a proxy for what we want these models to do. Our results indicate that our techniques hold promise for making language models more helpful, truthful, and harmless. In the longer term, alignment failures could lead to more severe consequences, particularly if these models are deployed in safety-critical situations. We expect that as model scaling continues, greater care has to be taken to ensure that they are aligned with human intentions (Bostrom, 2014).

这项工作是由我们希望增加大语言模型的正面影响所驱动的，通过训练这些模型来执行给定人群希望它们做的事情。默认情况下，语言模型优化下一个词预测目标，这只是一个代理，不是我们希望这些模型真正要做的。我们的结果表明，我们的技术有望使语言模型更加有用、诚实和无害。从长远来看，对齐失败可能导致更严重的后果，特别是在这些模型被部署在安全关键的情况下。我们预计，随着模型规模的继续扩大，必须更加谨慎地确保它们与人类意图 (Bostrom, 2014) 对齐。

However, making language models better at following user intentions also makes them easier to misuse. It may be easier to use these models to generate convincing misinformation, or hateful or abusive content.

然而，使大语言模型更善于遵循用户意图的同时，也使它们更容易被滥用。可能更容易使用这些模型生成令人信服的虚假信息，或仇恨和辱骂性内容。

Alignment techniques are not a panacea for resolving safety issues associated with large language models; rather, they should be used as one tool in a broader safety ecosystem. Aside from intentional misuse, there are many domains where large language models should be deployed only with great care, or not at all. Examples include high-stakes domains such as medical diagnoses, classifying people based on protected characteristics, determining eligibility for credit, employment, or housing, generating political advertisements, and law enforcement. If these models are open-sourced, it becomes challenging to limit harmful applications in these and other domains without proper regulation. On the other hand, if large language model access is restricted to a few organizations with the resources required to train them, this excludes most people from access to cutting-edge ML technology. Another option is for an organization to own the end-to-end infrastructure of model deployment, and make it accessible via an API. This allows for the implementation of safety protocols like use case restriction (only allowing the model to be used for certain applications), monitoring for misuse and revoking access to those who misuse the system, and rate limiting to prevent the generation of large-scale misinformation. However, this can come at the cost of reduced transparency and increased centralization of power because it requires the API provider to make decisions on where to draw the line on each of these questions.

对齐技术不是解决大语言模型相关安全问题的万能药；而是应该作为更广泛的安全生态系统中的一个工具。除了故意滥用之外，还有许多领域应该非常谨慎地部署大语言模型，或者根本不应部署。例如，高风险领域如医疗诊断、基于受保护特征对人进行分类、确定信贷、就业或住房资格、生成政治广告以及执法等。如果这些模型是开源的，则在没有适当监管的情况下，限制这些及其他领域中有害应用变得困难。另一方面，如果大语言模型的访问仅限于少数有资源训练这些模型的组织，这将排除大多数人接触最先进机器学习技术的机会。另一种选择是让一个组织拥有模型部署的端到端基础设施，并通过API使其可访问。这允许实施安全协议，如用例限制（仅允许模型用于某些应用程序）、监控滥用并撤销滥用系统的访问权限，以及速率限制以防止大规模错误信息的生成。然而，这可能会以减少透明度和增加权力集中为代价，因为它要求API提供者在这些问题上做出决策。

Finally, as discussed in Section 5.2, the question of who these models are aligned to is extremely important, and will significantly affect whether the net impact of these models is positive or negative.

最后，如第 5.2 节所述，这些模型与谁对齐的问题非常重要，并将显著影响这些模型的总体影响是正面还是负面。

Acknowledgements

致谢

First, we would like to thank Lilian Weng, Jason Kwon, Boris Power, Che Chang, Josh Achiam, Steven Adler, Gretchen Krueger, Miles Brundage, Tyna Eloundou, Gillian Hadfield, Irene Soliaman, Christy Dennison, Daniel Ziegler, William Saunders, Beth Barnes, Cathy Yeh, Nick Cammaratta, Jonathan Ward, Matt Knight, Pranav Shyam, Alec Radford, and others at OpenAI for discussions throughout the course of the project that helped shape our research direction. We thank Brian Green, Irina Raicu, Subbu Vincent, Varoon Mathur, Kate Crawford, Su Lin Blodgett, Bertie Vidgen, and Paul Röttger for discussions and feedback on our approach. Finally, we thank Sam Bowman, Matthew Rahtz, Ben Mann, Liam Fedus, Helen Ngo, Josh Achiam, Leo Gao, Jared Kaplan, Cathy Yeh, Miles Brundage, Gillian Hadfield, Cooper Raterink, Gretchen Krueger, Tyna Eloundou, Rafal Jakubanis, and Steven Adler for providing feedback on this paper. We’d also like to thank Owain Evans and Stephanie Lin for pointing out the fact that the automatic TruthfulQA metrics were overstating the gains of our PPO models.

首先，我们要感谢 Lilian Weng、Jason Kwon、Boris Power、Che Chang、Josh Achiam、Steven Adler、Gretchen Krueger、Miles Brundage、Tyna Eloundou、Gillian Hadfield、Irene Soliaman、Christy Dennison、Daniel Ziegler、William Saunders、Beth Barnes、Cathy Yeh、Nick Cammaratta、Jonathan Ward、Matt Knight、Pranav Shyam、Alec Radford 以及 OpenAI 的其他同事，在项目过程中与我们讨论并帮助确定了研究方向。感谢 Brian Green、Irina Raicu、Subbu Vincent、Varoon Mathur、Kate Crawford、Su Lin Blodgett、Bertie Vidgen 和 Paul Röttger 对我们方法的讨论和反馈。最后，感谢 Sam Bowman、Matthew Rahtz、Ben Mann、Liam Fedus、Helen Ngo、Josh Achiam、Leo Gao、Jared Kaplan、Cathy Yeh、Miles Brundage、Gillian Hadfield、Cooper Raterink、Gretchen Krueger、Tyna Eloundou、Rafal Jakubanis 和 Steven Adler 对本文的反馈。我们还要感谢 Owain Evans 和 Stephanie Lin 指出自动 TruthfulQA 指标高估了我们 PPO 模型的改进。

Thanks to those who contributed in various ways to the infrastructure used to train and deploy our models, including: Daniel Ziegler, William Saunders, Brooke Chan, Dave Cummings, Chris Hesse, Shantanu Jain, Michael Petrov, Greg Brockman, Felipe Such, Alethea Power, and the entire OpenAI super computing team. We’d also like to thank Suchir Balaji for help with re calibration, to Alper Ercetin and Justin Wang for designing the main diagram in this paper, and to the OpenAI Comms team for helping with the release, including: Steve Dowling, Hannah Wong, Natalie Summers, and Elie Georges.

感谢以各种方式为用于训练和部署我们模型的基础设施做出贡献的人，包括：Daniel Ziegler、William Saunders、Brooke Chan、Dave Cummings、Chris Hesse、Shantanu Jain、Michael Petrov、Greg Brockman、Felipe Such、Alethea Power 以及整个 OpenAI 超级计算团队。我们还要感谢 Suchir Balaji 在重新校准方面的帮助，感谢 Alper Ercetin 和 Justin Wang 设计本文的主要图表，感谢 OpenAI 传播团队在发布过程中提供的帮助，包括：Steve Dowling、Hannah Wong、Natalie Summers 和 Elie Georges。

Finally, we want to thank our labelers, without whom this work would not have been possible: Meave Fryer, Sara Tirmizi, James Carroll, Jian Ouyang, Michelle Brothers, Conor Agnew, Joe Kwon, John Morton, Emma Duncan, Delia Randolph, Kaylee Weeks, Alexej Savreux, Siam Ahsan, Rashed Sorwar, Atresha Singh, Muhaiminul Rukshat, Caroline Oliveira, Juan Pablo Castaño Rendón, Atqiya Abida Anjum, Tinashe Mapolisa, Celeste Fejzo, Caio Oleskovicz, Salahuddin Ahmed, Elena Green, Ben Harmelin, Vladan Djordjevic, Victoria Ebbets, Melissa Mejia, Emill Jayson Caypuno, Rachelle Froyalde, Russell M. Bernandez, Jennifer Brillo, Jacob Bryan, Carla Rodriguez, Evgeniya Rabinovich, Morris Stuttard, Rachelle Froyalde, Roxanne Addison, Sarah Nogly, Chait Singh.

最后，我们要感谢我们的标注人员，没有他们，这项工作是不可能完成的：Meave Fryer, Sara Tirmizi, James Carroll, Jian Ouyang, Michelle Brothers, Conor Agnew, Joe Kwon, John Morton, Emma Duncan, Delia Randolph, Kaylee Weeks, Alexej Savreux, Siam Ahsan, Rashed Sorwar, Atresha Singh, Muhaiminul Rukshat, Caroline Oliveira, Juan Pablo Castaño Rendón, Atqiya Abida Anjum, Tinashe Mapolisa, Celeste Fejzo, Caio Oleskovicz, Salahuddin Ahmed, Elena Green, Ben Harmelin, Vladan Djordjevic, Victoria Ebbets, Melissa Mejia, Emill Jayson Caypuno, Rachelle Froyalde, Russell M. Bernandez, Jennifer Brillo, Jacob Bryan, Carla Rodriguez, Evgeniya Rabinovich, Morris Stuttard, Roxanne Addison, Sarah Nogly, Chait Singh.

References

参考文献

Gabriel, I. (2020). Artificial intelligence, values, and alignment. Minds and machines, 30(3):411–437.

Gabriel, I. (2020). 人工智能、价值观和对齐。思想与机器，30(3):411–437。

A Additional prompt data details

A 附加提示数据详情

A.1 Labeler-written prompts

A.1 标注者编写的提示语

We first give slightly more details on our prompt boost rapping process. As previously mentioned, for the majority of the project, we obtained prompts directly from external users of the instruct beta models in the OpenAI API. However, this strategy only works once you have a model that accepts instruction-like prompts. In order to train the very first such model, we asked contractors to write prompts themselves. We asked labelers to write three kinds of prompts:

我们首先对提示增强过程给出更多细节。如前所述，在项目的大部分时间里，我们直接从 OpenAI API 中 instruct beta 模型的外部用户那里获取提示。然而，这种策略只有在你已经有了一个能够接受指令类提示的模型时才有效。为了训练第一个这样的模型，我们要求承包商自己编写提示。我们要求标注人员编写三种类型的提示：

• Plain: We simply ask the labelers to come up with an arbitrary task, while ensuring diversity of tasks. • Few-shot: We ask the labelers to come up with an instruction, and multiple query/response pairs for that instruction. For example, the instruction could be “Give the sentiment for a tweet,” and the queries would be tweets and the responses either “Positive” or “Negative.” We can then format these as few-shot prompts like those in Brown et al. (2020). With K query-response pairs, we create K training examples using the other K-1 in the context. • User-based: We had a number of use-cases stated in applications to the OpenAI API. We asked labelers to come up with prompts corresponding to these use cases.

• 平凡：我们简单地要求标注者提出一个任意的任务，同时确保任务的多样性。

• 少样本：我们要求标注者为一个指令提出多个查询/响应对。例如，指令可以是“给出一条推文的情感”，查询将是推文，而响应则是“正面”或“负面”。然后我们可以将这些格式化为少样本提示，如 Brown 等人 (2020) 中所示。使用 K 个查询-响应对，我们使用其他 K-1 对创建 K 个训练示例作为上下文。

• 基于用户：我们在 OpenAI API 的申请中列出了多个用例。我们要求标注者根据这些用例提出相应的提示。

In order to preserve the anonymity of the application information, we had a separate labeler create vague high level tasks based on looking at a list of applications, modifying the task descriptions to eliminate any information that were specific to a given application. This data was used to train the first Instruct GP T model via supervised learning, which was deployed in beta in the API in early 2021.

为了保留应用程序信息的匿名性，我们安排了一位独立标注人员根据应用程序列表创建模糊的高层次任务，并修改任务描述以消除任何特定于给定应用程序的信息。这些数据用于通过监督学习训练第一个 Instruct GP T 模型，该模型于 2021 年初在 API 中以测试版部署。

A.2 API user prompts

A.2 API 用户提示

For API prompts, we use prompts submitted by users to the aforementioned earlier version of the Instruct GP T model on the OpenAI API Playground. Throughout the paper, we only use data from the Playground, rather than customers using our model in production, as it was easier to get informed consent: every time a user switched to an Instruct GP T model, an alert message would pop up stating that prompts submitted to these models could be used to train future versions of our models. We also communicated this in a message on the developer Slack channel upon launching the beta of the Instruct GP T models. We filter out prompts from the training split containing personally identifiable information (PII).

对于 API 提示，我们使用用户提交给 OpenAI API Playground 上述早期版本的 Instruct GP T 模型的提示。在本文中，我们仅使用来自 Playground 的数据，而不是生产环境中客户使用我们模型的数据，因为获取知情同意更为容易：每当用户切换到 Instruct GP T 模型时，会弹出一条消息，说明提交给这些模型的提示可能会用于训练我们模型的未来版本。我们在推出 Instruct GP T 模型的 beta 版本时，也在开发者 Slack 频道上发布了此消息。我们从训练集中过滤掉包含个人可识别信息 (PII) 的提示。

To ensure a diversity of use cases, we heuristic ally de duplicate prompts by checking for prompts that share a long common prefix, and limited the number of prompts to roughly 200 per organization. In addition, we create train, validation, and test splits based on organization IDs, so that e.g. the validation set contains different use cases than the training set.

为确保用例的多样性，我们通过检查共享长公共前缀的提示来进行启发式去重，并将每个组织的提示数量限制在大约 200 个以内。此外，我们根据组织 ID 创建训练集、验证集和测试集，以便例如验证集包含与训练集不同的用例。

We conceptualized API requests as belonging to one of ten use cases: generation, open QA, closed QA, brainstorming, chat, rewriting, sum mari z ation, classification, extraction, or other. Below, we show fictional but realistic prompts from a variety of use cases:

我们将 API 请求概念化为属于以下十种使用场景之一：生成、开放问答、封闭问答、头脑风暴、聊天、改写、总结、分类、提取或其他。下面，我们展示了一些来自不同使用场景的虚构但真实的提示：

A.2.1 Illustrative user prompts from Instruct GP T distribution

A.2.1 Instruct GP T 分布的示例用户提示

用例	示例
头脑风暴	列出五个重新激发我对职业热情的想法
头脑风暴	研究古希腊时，我应该了解哪些关键点？
头脑风暴	用户在阅读垃圾压缩机的说明书后可能会有哪些问题？（续下页）
	{usermanual}
1.

注：原文中的 "Continuedonnextpage" 和空行已根据上下文进行适当处理。

使用场景	示例
头脑风暴	我接下来应该读哪 10 本科幻小说？
分类	对以下文本进行评分，从 1 到 10 分，评估该人的讽刺程度（1 = 完全不讽刺，10 = 极度讽刺）。并给出解释
	{text}
分类	评分：这是一系列推文及其所属的情感类别。

	推文：{tweet_content1} 情感：{sentiment1}
	推文：{tweet_content2}
分类	情感：{sentiment2} {java code}

分类	上面的代码是用什么语言编写的？你是一位非常严肃的教授，检查论文是否包含缺失的引用。根据文本，说明是否缺少重要引用 (YES/NO) 以及哪些句子需要引用。
提取	{paper text} 从下表中提取所有课程名称：




提取	从下面的文章中提取所有地名：

提取	{news article}
	给定以下电影标题列表，写下标题中出现的城市名称。
	{movie titles}
生成	为以下产品编写一条有创意的广告，在 Facebook 上面向家长投放：
	产品：{product description}
生成	写一篇短篇小说，讲述一只棕熊去海滩，与海豹交朋友，然后回家的故事。
	续于下页

使用场景	示例
生成	这是一条给我的消息：
	{email}
	这里有一些回复的要点：
	{message}
	写一个详细的回复
生成	这是一篇关于如何在申请工作时写求职信的文章：
	花一些时间是很重要的
生成	根据这篇新闻文章中提到的主题写说唱歌词：
	{article}
重写	这是百老汇戏剧的概要：
	{summary} 这是该戏剧广告的大纲：
重写
	将这句话翻译成西班牙语：
重写	<英语句子>
	根据这段文字创建逐向导航：
	在{road1}上向西走，直到你遇到{road2}。然后向东走，直到{road3}。目的地将是右边的一个红色谷仓
重写	1.
	将以下文本改写得更轻松：下一页继续
	{非常正式的文本}

使用场景	示例
聊天	以下是一段与AI助手的对话。该助手乐于助人、富有创造力、聪明伶俐，且非常友好。
	人类：你好，你是谁？ AI：我是由 OpenAI 创建的AI。今天我能帮你什么？
	人类：我想取消我的订阅。 AI：
聊天	Marv 是一个聊天机器人，它会用讽刺的语气勉强回答问题：

	你：一公斤等于多少磅？
	Marv：这又来了？一公斤等于2.2磅。请记下来。
	你：HTML代表什么？ Marv：谷歌太忙了吗？超文本标记语言。T代表尝试在将来问更好的问题。
	你：第一架飞机是什么时候飞的？
聊天	这是一段与开悟佛陀的对话。每个回答都充满智慧和爱。
	我：我如何才能获得更大的平静和平衡？佛陀：
封闭式问答	帮我回答关于以下短篇故事的问题：
	{story}

	故事的寓意是什么？
封闭式问答	回答下列问题：
	地球的形状是什么？
	A) 圆形
	B) 球形
	C) 椭圆形
	D) 平面
封闭式问答	告诉我氢气和氦气有什么不同，使用以下事实：
	{事实列表}
开放式问答	我是一个高度智能的问题回答机器人。如果你问我一个基于事实的问题，我会给你答案。如果你问我一个无意义、欺骗性或没有明确答案的问题，我会回复“未知”。
	Q：美国的人类预期寿命是多少？ A：美国的人类预期寿命是78年。

	Q：1955年谁是美国总统？ A：
开放式问答	谁建造了自由女神像？
开放式问答	如何对正弦函数求导？

开放式问答	新西兰的原住民是谁？下一页继续

请注意，最后一行提到“下一页继续”，但没有提供具体内容。

Next, we list some schematic examples of API requests for each use-case category, for prompts submitted to GPT-3 models. These are generally less ‘instruction-style’, and contain more explicit prompting. Note that there are some prompts where the user intent is unclear.

使用场景	示例
摘要生成	为二年级学生总结这段内容：
	{text}
摘要生成	{新闻文章}
	划重点：
摘要生成	{聊天记录}
	总结上述客户与客服之间的对话。确保说明客户的所有投诉。
其他	以“在哪里”开头
其他	在 Google 上查找“cowboy”并给我结果。
其他	Johnathan Silver 每天去市场，并带回一个

接下来，我们列出每个使用场景类别的 API 请求的一些示例，这些请求是提交给 GPT-3 模型的提示。这些提示通常较少采用‘指令式’，而包含更明确的引导。请注意，有些提示中用户的意图不明确。

A.2.2 Illustrative user prompts from GPT-3 distribution

A.2.2 GPT-3 分布中的示例用户提示

用例	示例
头脑风暴	独立电影创意： - 一个男人前往南美洲成为萨满。
头脑风暴	- 关于杂耍世界的纪录片。男孩名字创意： 1. Alfred 2. Theo
头脑风暴	3. 给我列出与以下主题相关的内容： - 室内设计 - 可持续生态系统
头脑风暴	- 假植物列出一些稀有宝石
分类	这是一个推文情感分类器。 {tweet} 情感：负面
	{tweet}
	===
	情感：中性
	{tweet}
	情感：
分类	以下是一些产品及其类型列表。
	产品：{product}。类型：{type}
	产品：{product}。类型：{type}
	产品：{product}。类型：

（续下页）

使用案例	示例
分类	以下是一些公司及其所属类别列表：
	Apple, Facebook, Fedex
	Apple
	类别: 技术
	Facebook
	类别: 社交媒体
	Fedex
	类别:
提取	文本: {text}
	关键词:
生成	“嘿，你在那儿干什么？”Casey 吃了一惊。他甚至还没开始
生成	下一部《星球大战》电影的名字是
生成	这是一篇论文的研究内容：
	二=二 {研究描述}
	写一篇关于这些主题的高中作文： ===
生成	为一篇关于 John von Neumann 及其对计算机贡献的作文写一个大纲：1. 引言，他的生活和背景 A: 他的早年生活
重写	B: 将我的简历转换成个人简介概述。
	{简历} 个人简介概述：
重写	请帮我重新表述这句话：“我似乎搞不懂怎么操作这该死的东西。” 替代表述：
重写	原文：她没去睡觉。标准美式英语：She didn't go to sleep
	原文：这样做对我来说太难了。标准美式英语：
聊天	以下是一段与 AI 助手的对话。这位助手乐于助人、富有创造力、聪明伶俐且非常友好。


	人类：你好，你是谁？

	AI: 我是由 OpenAI 创建的 AI。今天我能帮你什么？

	人类：我今天感觉有点沮丧。

	AI:






















	继续下一页

使用场景	示例
聊天	这是与 Steven 的对话。Steven 喜欢看 Netflix，并且已经有两周没有离开家了。John: 嘿，兄弟，最近怎么样？Steven: 和昨天一样，你知道的。John: 我们打算周四去看电影，你想来吗？
封闭式问答	Steven: 嗯……我不这么认为……当你从树上扔下一个重石头时，会发生什么？A. 石头掉到地上。B: 石头留在树上。C: 石头漂浮。
	D: 没有发生任何事。答案：
封闭式问答	文本：{描述购买瑜伽垫的文章}
	一张瑜伽垫？
	答案：
开放式问答	Q: 谁是蝙蝠侠？
	A: 蝙蝠侠是一个虚构的漫画人物。
	Q: 什么是 torsalplexity？
	A: ?
	Q: 什么是 Devz9？
	A: ?
	Q: 谁是 George Lucas？
	A: George Lucas 是美国著名电影导演和制片人，以创建《星球大战》而闻名。
	Q: 加利福尼亚的首府是什么？答案：
开放式问答	谁是有史以来最好的人？
开放式问答	Q: 谁是 Leonardo da Vinci？
	A:
总结	我的二年级学生问我这段话的意思是什么。
	{文本}
	我用简单的语言重新表述了它，以便二年级的学生可以理解：
总结
	{文本}
	我总结上述内容为：
其他
	她说，我引用：
	AI：
	继续下一页

使用场景	示例
其他	- 我喜欢玩使命召唤
	- 1 喜欢玩使命召唤
	- 1 喜欢玩使命召唤
	- 我喜欢玩使命召唤

A.3 Dataset sizes

A.3 数据集大小

In table 6, we report the sizes of datasets used to train / validate the SFT, RM, and RL models, in addition to whether the prompts were written by our labeling contractors or from our API.

表 6:

我们在表 6 中报告了用于训练 / 验证 SFT、RM 和 RL 模型的数据集大小，以及提示是由我们的标注承包商编写还是来自我们的 API。

Table 6: Dataset sizes, in terms of number of prompts.

表 6: 数据集大小，以提示数量为单位。

SFTData			RM Data			PPOData
划分	来源	大小	划分	来源	大小	划分	来源	大小
训练	标注员	11,295	训练	标注员	6,623	训练	客户	31,144
训练	客户	1,430	训练	客户	26,584	验证	客户	16,185
验证	标注员	1,550	验证	标注员	3,488
验证	客户	103	验证	客户	14,399

For SFT, note that we have many more labeler-written prompts than customer prompts—this is because, at the start of the project, we had labelers write instructions with a user interface that asked them to give an over arching template instruction as well as few-shot examples for that instruction. We synthetically constructed multiple SFT datapoints from the same instruction by sampling different sets of few-shot examples.

对于 SFT，注意我们有比客户提示更多的标注者编写的提示——这是因为，在项目开始时，我们让标注者使用一个用户界面编写指令，该界面要求他们提供一个总体的模板指令以及该指令的少样本 (Few-shot) 示例。我们通过采样不同的少样本示例外集，从相同的指令中合成构建了多个 SFT 数据点。

For the RM, recall that for every prompt, we collected rankings for $K$ outputs (ranging from 4 to 9) and trained the model on all $\binom{\bar{K}}{2}$ , so the number of ranked pairs we trained the model on is an order of magnitude larger than the number of prompts.

对于 RM，回顾一下，对于每个提示，我们收集了 $K$ 个输出（从 4 到 9）的排名，并在所有 $\binom{\bar{K}}{2}$ 上训练模型，因此我们用于训练模型的排名对数量比提示的数量大一个数量级。

A.4 Data diversity

A.4 数据多样性

Table 7: Dataset annotations

表 7: 数据集标注

		RM	RM	SFT	SFT
标注	test	train	valid	train	valid
模棱两可		7.9%	8.0%	5.1%	6.4%
敏感内容		6.9%	5.3%	0.9%	1.0%
身份依赖				0.9%	0.3%
封闭领域	11.8%	19.4%	22.9%	27.4%	40.6%
续写风格		15.5%	16.2%	17.9%	21.6%
请求观点性内容	11.2%	7.7%	7.5%	8.6%	3.4%
请求建议	3.9%
请求道德判断	0.8%	1.1%	0.3%	0.3%	0.0%
包含明确的安全约束		0.4%	0.4%	0.3%	0.0%
包含其他明确约束		26.3%	28.9%	25.6%	20.7%
意图不明确	7.9%

The data that we collect spans a wide range of categories and use cases. Table 1 shows the diversity of categories in our RM training and validation datasets as labeled by our contractors. The distribution of categories for the PPO datasets was similar. We additionally show a subset of our labeled prompt metadata in Table 7. Note that our annotation fields changed over the course of the project, so not every prompt was annotated for every field.

我们收集的数据涵盖了广泛的主题类别和使用场景。表 1: 显示了我们 RM 训练和验证数据集中由我们的承包商标注的类别的多样性。PPO 数据集的类别分布与此类似。我们还在表 7: 中展示了部分标注的提示词元数据。请注意，我们的标注字段在项目过程中有所变化，因此并非每个提示词都标注了所有字段。

Table 8: Average prompts per customer

表 8: 每个客户的平均提示数

模型	分割	每个客户的提示数
SFT	train	1.65
RM	valid	1.87
RM	train	5.35
PPO	valid train	27.96
PPO	valid	6.01 31.55
一	test	1.81

Table 9: Prompt lengths by dataset

表 9: 各数据集的提示长度

模型	分割	数量	平均值	标准差	最小值	25% 分位数	50% 分位数	75% 分位数	最大值
SFT	train	12725	408	433	1	37	283	632	2048
	valid	1653	401	433	4	41	234	631	2048
RM	train	33207	199	334	1	20	64	203	2032
	valid	17887	209	327	1	26	77	229	2039
PPO	train	31144	166	278	2	19	62	179	2044
	valid	16185	186	292	1	24	71	213	2039
一	testset	3196	115	194	1	17	49	127	1836

Table 10: Prompt lengths by category

表 10: 各类别提示长度

类别	数量	平均值	标准差	最小值	25% 分位数	50% 分位数	75% 分位数	最大值
Brainstorming	5245	83	149	4	17	36	85	1795
Chat	3911	386	376	1	119	240	516	1985
Classification	1615	223	318	6	68	124	205	2039
Extract	971	304	373	3	74	149	390	1937
Generation	21684	130	223	1	20	52	130	1999
QA, closed	1398	325	426	5	68	166	346	2032
QA, open	6262	89	193	1	10	18	77	1935
Rewrite	3168	183	237	4	52	99	213	1887
Summarization	1962	424	395	6	136	284	607	1954
Other	1767	180	286	1	20	72	188	1937

Table 11: Prompt and demonstration lengths

表 11: 提示和演示长度

| 提示来源 | 测量项 | 数量 | 平均值 | 标准差 | 最小值 | 25% 分位数 | 50% 分位数 | 75% 分位

[论文翻译]训练大语言模型 (Large Language Model) 以遵循人类反馈的指令

原文地址：https://arxiv.org/pdf/2203.02155

Training language models to follow instructions with human feedback

训练大语言模型 (Large Language Model) 以遵循人类反馈的指令

Abstract

摘要

1 Introduction

1 引言

2 Related work

2 相关工作

3 Methods and experimental details

3 方法和实验细节

3.1 High-level methodology

3.1 高层次方法论

3.2 Dataset

3.2 数据集

3.3 Tasks

3.3 任务

3.4 Human data collection

3.4 人类数据收集

3.5 Models

3.5 模型

3.6 Evaluation

3.6 评估

4 Results

4.1 Results on the API distribution

4.1 API 分布的结果

4.2 Results on public NLP datasets

4.2 公共 NLP 数据集上的结果

4.3 Qualitative results

4.3 定性结果

Instruct GP T 175B completion:

Instruct GP T 175B 完成：

GPT-3 175B completion:

GPT-3 175B 完成：

Instruct GP T 175B completion:

Instruct GP T 175B 完成：

5 Discussion

5 讨论

5.1 Implications for alignment research

5.1 对齐研究的含义

5.2 Who are we aligning to?

5.2 我们在对齐谁？

5.3 Limitations

5.3 局限性

5.4 Open questions

5.4 开放性问题

5.5 Broader impacts

5.5 更广泛的影响

Acknowledgements

致谢

References

参考文献

A Additional prompt data details

A 附加提示数据详情

A.1 Labeler-written prompts

A.1 标注者编写的提示语

A.2 API user prompts

A.2 API 用户提示

A.3 Dataset sizes

A.3 数据集大小

A.4 Data diversity

A.4 数据多样性