Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback

通过人类反馈强化学习训练一个有用且无害的助手

Yuntao Bai,∗ Andy Jones, Kamal Ndousse,

Amanda Askell, Anna Chen, Nova DasSarma, Dawn Drain, Stanislav Fort, Deep Ganguli, Tom Henighan, Nicholas Joseph, Saurav Kadavath, Jackson Kernion,

Tom Conerly, Sheer El-Showk, Nelson Elhage, Zac Hatfield-Dodds, Danny Hernandez, Tristan Hume, Scott Johnston, Shauna Kravec, Liane Lovitt, Neel Nanda, Catherine Olsson, Dario Amodei, Tom Brown, Jack Clark, Sam McCandlish, Chris Olah, Ben Mann, Jared Kaplan∗

Anthropic

Abstract

摘要

We apply preference modeling and reinforcement learning from human feedback (RLHF) to finetune language models to act as helpful and harmless assistants. We find this alignment training improves performance on almost all NLP evaluations, and is fully compatible with training for specialized skills such as python coding and sum mari z ation. We explore an iterated online mode of training, where preference models and RL policies are updated on a weekly cadence with fresh human feedback data, efficiently improving our datasets and models. Finally, we investigate the robustness of RLHF training, and identify a roughly linear relation between the RL reward and the square root of the KL divergence between the policy and its initialization. Alongside our main results, we perform peripheral analyses on calibration, competing objectives, and the use of OOD detection, compare our models with human writers, and provide samples from our models using prompts appearing in recent related work.

我们应用偏好建模和人类反馈强化学习 (RLHF) 来微调语言模型，使其成为有用且无害的助手。我们发现这种对齐训练几乎在所有 NLP 评估中都提高了性能，并且与训练特定技能（如 Python 编码和摘要生成）完全兼容。我们探索了一种迭代在线训练模式，其中偏好模型和 RL 策略每周都会根据新的人类反馈数据进行更新，从而有效地改进我们的数据集和模型。最后，我们研究了 RLHF 训练的鲁棒性，并确定了 RL 奖励与策略与其初始化之间的 KL 散度平方根之间的近似线性关系。除了主要结果外，我们还对校准、竞争目标和使用 OOD 检测进行了外围分析，将我们的模型与人类作者进行了比较，并使用最近相关工作中出现的提示提供了我们模型的样本。

1 Introduction 4

1 引言 4

2 Data Collection 9

2 数据收集 9

3 Preference Modeling for Helpfulness and Harmlessness 13

3 帮助性和无害性的偏好建模 13

4 Reinforcement Learning from Human Feedback 16

4 基于人类反馈的强化学习 16

5 Competing Objectives, Specialized Skills, and OOD Detection 24

5 竞争目标、专业技能和OOD检测 24

6 Qualitative Examples and Comparisons 29

6 定性示例与比较 29

7 Discussion 34

7 讨论 34

Figure 1 This plot summarizes crowd worker preferences for a variety of models, including context-distilled models, RLHF models trained on our ‘static’ dataset, and RLHF models trained by an iterated ‘online’ method for either helpfulness and harmlessness (HH) or for helpfulness only. We present both Elo scores and a match to the frequency with which crowd workers prefer samples as compared to the 52B context-distilled model. For both helpfulness and harmlessness, a higher score is more desirable.

图 1: 该图总结了众包工人对各种模型的偏好，包括上下文蒸馏模型、在我们“静态”数据集上训练的 RLHF 模型，以及通过迭代“在线”方法训练的 RLHF 模型（针对有用性和无害性 (HH) 或仅针对有用性）。我们展示了 Elo 分数以及众包工人偏好样本的频率与 52B 上下文蒸馏模型的比较。对于有用性和无害性，分数越高越好。

1 Introduction

1 引言

We would like to develop techniques to train AI agents that are helpful, honest, and harmless [Askell et al., 2021]. In this paper we show that we can train a relatively helpful and harmless1 (HH) natural language assistant by collecting human preference data and applying the techniques of preference modeling (PMing) and reinforcement learning from human feedback (RLHF). Our full training process is summarized in Figure 2.

我们希望开发技术来训练有益、诚实且无害的AI智能体 [Askell et al., 2021]。在本文中，我们展示了通过收集人类偏好数据并应用偏好建模 (PMing) 和基于人类反馈的强化学习 (RLHF) 技术，可以训练出一个相对有益且无害1 (HH) 的自然语言助手。我们的完整训练过程总结在图 2 中。

Our goal is not to define or prescribe what ‘helpful’ and ‘harmless’ mean but to evaluate the effectiveness of our training techniques, so for the most part we simply let our crowd workers interpret these concepts as they see fit. We treat helpfulness and harmlessness separately, collecting distinct human-preference datasets for each. For helpfulness, we ask crowd workers to solicit our models to assist with any purely text-based tasks such as answering questions, writing or editing documents, or discussing plans and decisions. For harmlessness, we invite crowd workers to adversarial ly probe or ‘red-team’ our language models in order to provoke harmful responses: either to help them with harmful goals, such as planning a bank robbery, or to cause the AI to use toxic language. At each stage of their conversations with the AI assistant, crowd workers are presented with two possible responses. Those engaged in the helpfulness task are instructed to choose the more helpful and honest (i.e. better) response. Those engaged in the red teaming task are instructed to choose the more harmful (i.e. worse) response. These conversations and the expressed human preferences form our datasets.3

我们的目标不是定义或规定“有帮助”和“无害”的含义，而是评估我们训练技术的有效性，因此在大多数情况下，我们只是让众包工作者根据自己的理解来解释这些概念。我们将“有帮助”和“无害”分开处理，为每个概念收集不同的人类偏好数据集。对于“有帮助”，我们要求众包工作者请求我们的模型协助完成任何纯文本任务，例如回答问题、撰写或编辑文档，或讨论计划和决策。对于“无害”，我们邀请众包工作者以对抗的方式探测或“红队”我们的语言模型，以引发有害的回应：要么帮助他们实现有害目标，例如策划银行抢劫，要么让AI使用有毒语言。在与AI助手的对话的每个阶段，众包工作者都会看到两种可能的回应。参与“有帮助”任务的人被指示选择更有帮助和诚实（即更好）的回应。参与红队任务的人被指示选择更有害（即更差）的回应。这些对话和表达的人类偏好构成了我们的数据集。[3]

Helpfulness and harmlessness often stand in opposition to each other. An excessive focus on avoiding harm can lead to ‘safe’ responses that don’t actually address the needs of the human. An excessive focus on being helpful can lead to responses that help humans cause harm or generate toxic content. We demonstrate this tension quantitatively by showing that preference models trained to primarily evaluate one of these qualities perform very poorly (much worse than chance) on the other. Fortunately, we find that PMs trained on a mixture of both datasets can nevertheless learn the right lessons and behave helpfully when appropriate, while encouraging the polite refusal of harmful requests. With preference models in hand, we then train helpful and harmless assistants via reinforcement learning, using the PM scores as rewards. We evaluate both PM performance and the more relevant performance characteristics of our RLHF-trained models. As can be seen in Figure 1, purely helpful RLHF-trained models are far easier to red-team, while helpful+harmless models are both very helpful and much less harmful.

有用性和无害性常常相互对立。过度关注避免伤害可能导致“安全”的回应，但这些回应实际上并未满足人类的需求。过度关注有用性可能导致回应帮助人类造成伤害或生成有害内容。我们通过展示主要评估其中一种质量的偏好模型在另一种质量上表现非常差（远低于随机）来定量地展示这种张力。幸运的是，我们发现，在两种数据集混合训练的偏好模型仍然可以学到正确的经验，并在适当的时候表现得有用，同时鼓励礼貌地拒绝有害请求。有了偏好模型后，我们通过强化学习训练有用且无害的助手，使用偏好模型的分数作为奖励。我们评估了偏好模型的性能以及我们通过RLHF训练的模型的相关性能特征。如图1所示，纯粹有用的RLHF训练模型更容易被红队攻击，而有用+无害的模型既非常有用又大大减少了伤害。

Figure 2 This diagram summarizes our data collection and model training workflow.

图 2: 该图总结了我们的数据收集和模型训练工作流程。

A question that’s often raised about alignment training is whether it will compromise AI capabilities. We find that when RLHF is applied to large language models, the answer seems to be an almost-categorical no. Our RLHF-trained models tend to perform better than their raw, generative counterparts on virtually all evaluations, as summarized in Figure 3. We also argue that one can mix specialized skills with alignmentrelated training without compromising either alignment or performance. In practice, aligned models are likely to be more user-friendly and deployable than their raw counterparts, which suggests that there’s little reason to deploy models that have not been finetuned for alignment.

关于对齐训练的一个常见问题是，它是否会损害 AI 的能力。我们发现，当 RLHF（从人类反馈中进行强化学习）应用于大语言模型时，答案似乎几乎是否定的。如图 3 所示，经过 RLHF 训练的模型在几乎所有评估中往往比未经训练的生成模型表现更好。我们还认为，可以将专业技能与对齐相关训练相结合，而不会损害对齐或性能。实际上，对齐后的模型可能比未经训练的模型更易于用户使用和部署，这表明几乎没有理由部署未经对齐微调的模型。

1.1 Contributions

1.1 贡献

Dialogue Preference Datasets

对话偏好数据集

• We collect separate helpfulness and harmlessness (i.e. red-teaming) datasets primarily using various 52B language models (see Section 2 for details) in our interface (Figure 6). Crowd workers have open-ended conversations with the models, either soliciting help, or providing instructions, or attempting to get the model to emit harmful responses, and they are asked to choose the more helpful response or the more harmful4 response at each conversational step, respectively.

• 我们主要使用各种 52B 大语言模型（详见第 2 节）在我们的界面中收集了独立的有用性和无害性（即红队测试）数据集（图 6）。众包工作人员与模型进行开放式对话，要么寻求帮助，要么提供指令，要么试图让模型发出有害的回应，他们被要求在每一步对话中分别选择更有帮助的回应或更有害的回应。

• We collect three tranches of data, one from our initial models, one with rejection sampling against early preference models, and a final dataset gathered with models trained with ‘online’ reinforcement learning from human feedback, which we improve on a roughly weekly cadence. See Section 2.3.

• 我们收集了三批数据，第一批来自我们的初始模型，第二批是通过拒绝采样针对早期偏好模型收集的，最后一批数据集是通过使用人类反馈进行在线强化学习的模型收集的，我们大约每周对其进行一次改进。详见第2.3节。

Figure 3 RLHF model performance on zero-shot and few-shot NLP tasks. For each model size, we plot the mean accuracy on MMMLU, Lambada, HellaSwag, OpenBookQA, ARC-Easy, ARC-Challenge, and TriviaQA. On zero-shot tasks, RLHF training for helpfulness and harmlessness hurts performance for small models, but actually improves performance for larger models. Full results for each task are given in Figure 28 (zero-shot) and Figure 29 (few-shot).

图 3: RLHF 模型在零样本和少样本 NLP 任务上的表现。对于每个模型大小，我们绘制了在 MMMLU、Lambada、HellaSwag、OpenBookQA、ARC-Easy、ARC-Challenge 和 TriviaQA 上的平均准确率。在零样本任务上，针对帮助性和无害性的 RLHF 训练对小模型的表现有负面影响，但实际上提高了大模型的表现。每个任务的完整结果在图 28 (零样本) 和图 29 (少样本) 中给出。

Alignment with Human Values Has Many Benefits and Essentially No Cost to Performance

与人类价值观对齐有许多好处，且几乎不会影响性能

Scaling, RLHF Robustness, and Iterated ‘Online’ Training

扩展性、RLHF 鲁棒性及迭代式“在线”训练

• We study scaling relations for PM accuracy as a function of model and dataset size, and find roughly log-linear trends (Figure 7), though we encounter some idiosyncrasies (Figures 31 and 32). • We conduct experiments on the robustness of RLHF (see Figure 4), where we split our datasets in half and train separate preference models on each half. Then we train RL models against one PM while evaluating with the other. We conclude that larger PMs are more robust than smaller PMs, and as expected, over fitting increases during RLHF training. • We find that $\sqrt{D_{\mathrm{KL}}(\pi||\pi_{0})}$ and reward are approximately linearly related for much of RLHF training (see Figures 4 and 13), where $\pi$ and $\pi_{0}$ are the policy and initial policy, respectively. We explain how this relation may arise and discuss possible applications and future directions. • We study iterated online training, where we update our preference models and RLHF policies on a weekly cadence, and then re-deploy these fresh RLHF models to interact with crowd workers. This significantly improved our models as evaluated by crowd workers (Figure 1), and greatly improved our dataset as judged by our own PMs (Figure 15), filling out the upper tail in terms of quality.

• 我们研究了偏好模型 (PM) 准确性与模型和数据集大小的比例关系，发现大致呈对数线性趋势 (图 7)，尽管我们也遇到了一些特殊情况 (图 31 和 32)。
• 我们对 RLHF (Reinforcement Learning from Human Feedback) 的鲁棒性进行了实验 (见图 4)，将数据集分成两半，并在每一半上训练独立的偏好模型。然后，我们针对一个 PM 训练 RL 模型，同时使用另一个 PM 进行评估。我们得出结论，较大的 PM 比较小的 PM 更鲁棒，并且正如预期的那样，RLHF 训练期间的过拟合现象有所增加。
• 我们发现，在 RLHF 训练的很大一部分过程中，$\sqrt{D_{\mathrm{KL}}(\pi||\pi_{0})}$ 和奖励大致呈线性关系 (见图 4 和 13)，其中 $\pi$ 和 $\pi_{0}$ 分别是策略和初始策略。我们解释了这种关系可能产生的原因，并讨论了可能的应用和未来方向。
• 我们研究了迭代在线训练，每周更新我们的偏好模型和 RLHF 策略，然后重新部署这些新的 RLHF 模型与众包工作者互动。这显著提高了我们的模型 (图 1)，并极大地改善了我们的数据集 (图 15)，在质量方面填补了上尾部分。

Figure 4 This figure shows results from RL robustness experiments. We split our static dataset 50:50, and trained separate PMs on each half, which we refer to as train PMs and test PMs. We then trained RLHF policies against the train PMs, while evaluating their score with respect to the test PMs. Over fitting can then be observed as a divergence between the train and test PM scores. (left) We see that training is quite robust up to about 150k training samples, but beyond that point the train and test PM’s disagree, with the train PM assigning a higher mean reward. We also show an approximately linear relationship between PM score gain and the square root of the KL divergence (between the policy and its initial snapshot) during early phase of training—we observe this for all our RLHF runs, as discussed more in Section 4.3. (right) This shows similar results for various policy sizes, all trained and tested on 52B PMs.

图 4: 该图展示了 RL 鲁棒性实验的结果。我们将静态数据集按 50:50 分割，并在每一半上训练了单独的偏好模型 (PM)，分别称为训练 PM 和测试 PM。然后，我们针对训练 PM 训练了 RLHF 策略，同时使用测试 PM 评估它们的得分。过拟合可以通过训练 PM 和测试 PM 得分之间的差异来观察。（左）我们可以看到，在约 150k 训练样本之前，训练过程相当稳健，但超过这个点后，训练 PM 和测试 PM 的得分开始出现分歧，训练 PM 给出的平均奖励更高。我们还展示了在训练早期阶段，PM 得分增益与 KL 散度（策略与其初始快照之间）的平方根之间存在近似线性关系——我们在所有 RLHF 运行中都观察到了这一点，更多讨论见第 4.3 节。（右）该图展示了不同策略大小的类似结果，所有策略均在 52B PM 上进行训练和测试。

To remove con founders and bolster our conclusions, we perform additional controlled experiments (Figure 16) holding dataset size and other hyper parameters fixed.

为了消除混淆因素并加强我们的结论，我们在保持数据集大小和其他超参数固定的情况下进行了额外的对照实验（图 16）。

1.2 Summary of Evaluations and Metrics

1.2 评估与指标总结

• NLP and Code Evaluations: We evaluate our models on MMLU [Hendrycks et al., 2021b], Lambada [Paperno et al., 2016], Hellaswag [Zellers et al., 2019], OpenBookQA [Mihaylov et al., 2018], ARC [Clark et al., 2018], and TriviaQA [Joshi et al., 2017]; see Figures 28 and 29 for full results and Figure 3 for the mean. In every case except for TriviaQA, 12B and 52B RLHF-trained models perform better than base LMs. Separately, we take Python coding models and finetune them with natural language RLHF, and then evaluate them on the codex HumanEval [Chen et al., 2021]; see Figure 21. We also experiment with mixing PM training for HH with sum mari z ation [Stiennon et al., 2020] as a specialized skill, and evaluate the resulting PM performance (Figure 20), finding that mixed training does not degrade PM accuracies.

• NLP 和代码评估：我们在 MMLU [Hendrycks et al., 2021b]、Lambada [Paperno et al., 2016]、Hellaswag [Zellers et al., 2019]、OpenBookQA [Mihaylov et al., 2018]、ARC [Clark et al., 2018] 和 TriviaQA [Joshi et al., 2017] 上评估我们的模型；完整结果见图 28 和图 29，平均值见图 3。除了 TriviaQA 之外，在所有情况下，经过 RLHF 训练的 12B 和 52B 模型都比基础大语言模型表现更好。此外，我们采用 Python 语言编码模型，并使用自然语言 RLHF 进行微调，然后在 codex HumanEval [Chen et al., 2021] 上进行评估；见图 21。我们还尝试将 PM 训练与摘要 [Stiennon et al., 2020] 混合作为一项专门技能，并评估由此产生的 PM 性能（图 20），发现混合训练不会降低 PM 的准确性。

• Static Alignment Evaluations: We evaluate our PMs using our HHH Evaluations [Askell et al., 2021] from BIG-Bench6 (Figure 5), on Bot Adversarial Dialogues [Xu et al., 2020], and for gender bias [Rae et al., 2021] (Figure 12). We evaluate our RLHF models on TruthfulQA [Lin et al., 2021] (Figure 5), BBQ-Lite [Parrish et al., 2021] from BIG-Bench, gender bias (Figure 40), and sentiment based on race and religion [Rae et al., 2021] (Figure 17). RLHF improves sentiment towards all groups, but does not remove bias.

• 静态对齐评估：我们使用来自 BIG-Bench 的 HHH 评估 [Askell et al., 2021]（图 5）、Bot Adversarial Dialogues [Xu et al., 2020] 以及性别偏见 [Rae et al., 2021]（图 12）来评估我们的 PMs。我们对 RLHF 模型在 TruthfulQA [Lin et al., 2021]（图 5）、来自 BIG-Bench 的 BBQ-Lite [Parrish et al., 2021]、性别偏见（图 40）以及基于种族和宗教的情感 [Rae et al., 2021]（图 17）上进行了评估。RLHF 改善了所有群体的情感，但并未消除偏见。

• Human Evaluations: We compute Elo scores based on the preferences of our crowd workers, comparing context-distilled models, base RLHF trained models, and final online RLHF models (Figure 1). We also test our online models’ performance during training (Figure 15), compare various levels of rejection sampling (Figure 36), and perform a controlled experiment on iterated online training (Figure 16). Furthermore, we hired professional writers to compose conversations where an assistant provides high-quality, helpful and honest responses, and we then asked crowd workers to compare our model’s responses to those of these writers. Crowd workers prefer our online HH model to these writers7 about $57%$ of the time.

• 人类评估：我们基于众包工作者的偏好计算Elo分数，比较了上下文蒸馏模型、基础RLHF训练模型和最终在线RLHF模型（图1）。我们还测试了在线模型在训练期间的表现（图15），比较了不同级别的拒绝采样（图36），并对迭代在线训练进行了对照实验（图16）。此外，我们聘请了专业作家编写对话，其中助手提供高质量、有帮助且诚实的回答，然后我们让众包工作者将我们模型的回答与这些作家的回答进行比较。众包工作者在约57%的情况下更喜欢我们的在线HH模型，而不是这些作家。

Figure 5 (left) Here we show accuracy on the HHH alignment evaluation dataset we constructed previously [Askell et al., 2021] and shared on BIG-Bench. We see that our static preference models greatly outperform plain language models, including context distilled HHH models. This confirms that the data generated by our crowd workers has taught preference models the desired lessons. (right) Our RLHF training improves performance on TruthfulQA (MC1) [Lin et al., 2021] for large models, with an effect that increases with model size. These RLHF models were trained from our static dataset (i.e. they did not use online data).

图 5: (左) 这里展示了我们在之前构建并分享在 BIG-Bench 上的 HHH 对齐评估数据集上的准确率 [Askell et al., 2021]。我们可以看到，我们的静态偏好模型显著优于普通的语言模型，包括经过上下文蒸馏的 HHH 模型。这证实了我们的众包工作者生成的数据已经教会了偏好模型所需的课程。(右) 我们的 RLHF 训练提高了大模型在 TruthfulQA (MC1) [Lin et al., 2021] 上的表现，且效果随着模型规模的增加而增强。这些 RLHF 模型是从我们的静态数据集中训练的 (即它们没有使用在线数据)。

• Samples: We provide samples from all of the PALMs [Solaiman and Dennison, 2021] sensitive questions and from prompts provided with Instruct GP T [Ouyang et al., 2022] and LaMDA [Thoppilan et al., 2022] in Appendix C. We show some comparisons with human writers in Section 6.1, and we show several short dialogues in Section 6.3. To mitigate the problem of cherry picking samples, we generate 17 samples per prompt and display only the median sample, as ranked by our online HH preference model.

• 样本：我们在附录 C 中提供了来自所有 PALMs [Solaiman and Dennison, 2021] 敏感问题的样本，以及来自 Instruct GPT [Ouyang et al., 2022] 和 LaMDA [Thoppilan et al., 2022] 提示的样本。我们在 6.1 节中展示了一些与人类作者的比较，并在 6.3 节中展示了几个简短的对话。为了缓解样本选择偏差的问题，我们为每个提示生成了 17 个样本，并仅显示由我们的在线 HH 偏好模型排名后的中位数样本。

1.3 相关工作

Two recent papers, LaMDA [Thoppilan et al., 2022] and Instruct GP T [Ouyang et al., 2022] have particular similarity to this work. Both use human data to train large language models to be more generally useful or aligned. Both use language models somewhat larger than our 52B model.

最近的两篇论文，LaMDA [Thoppilan et al., 2022] 和 Instruct GPT [Ouyang et al., 2022]，与本工作有特别的相似之处。两者都使用人类数据来训练大语言模型，使其更具通用性或对齐性。两者使用的语言模型都比我们的 52B 模型稍大。

LaMDA [Thoppilan et al., 2022] finetunes large language models to participate in interesting, helpful, factually grounded, and safe natural language dialogue. As with our work, they include notions of both positive interactions and safety/harmlessness. And their use of external information to ensure accuracy/groundedness goes beyond the methods we discuss here, and is perhaps more similar to WebGPT and GopherCite [Nakano et al., 2021, Menick et al., 2022]. However, some differences are that rather than using reinforcement learning, they apply a mix of supervised learning techniques (both generative and disc rim i native), and their data collection process involves absolute ratings rather than comparisons. They do not explore whether their methods impose an ‘alignment tax’ on capabilities.

LaMDA [Thoppilan et al., 2022] 对大语言模型进行微调，使其能够参与有趣、有帮助、基于事实且安全的自然语言对话。与我们的工作类似，他们既考虑了积极的互动，也考虑了安全性/无害性。他们使用外部信息来确保准确性/基于事实，这超出了我们在此讨论的方法，可能更类似于 WebGPT 和 GopherCite [Nakano et al., 2021, Menick et al., 2022]。然而，一些不同之处在于，他们没有使用强化学习，而是应用了监督学习技术的混合（包括生成式和判别式），并且他们的数据收集过程涉及绝对评分而非比较。他们没有探讨他们的方法是否会对能力施加“对齐税”。

Instruct GP T [Ouyang et al., 2022] finetunes GPT-3-type models [Brown et al., 2020] to improve their helpfulness. As in this work, they use reinforcement learning from human preferences, as expressed through comparisons. However, they also include a supervised learning stage of training, whereas in contrast our finetuning occurs purely through RL (we perform context distillation, but this is much more like simple prompting). Perhaps the main contrast with our work is that they do not include harmlessness training, or explore tensions between helpfulness and harmlessness. Their approach also differs from ours in some details: they did not train preference models larger than 6B parameters, and they mixed pre training with RL in order to avoid a degradation in evaluation performance.

Instruct GPT [Ouyang et al., 2022] 对 GPT-3 类型的模型 [Brown et al., 2020] 进行了微调，以提高其帮助性。与本研究类似，他们使用了基于人类偏好的强化学习，通过比较来表达这些偏好。然而，他们还包含了一个监督学习的训练阶段，而相比之下，我们的微调完全通过强化学习进行（我们进行了上下文蒸馏，但这更像是简单的提示）。或许与我们工作的主要区别在于，他们没有包含无害性训练，也没有探索帮助性与无害性之间的张力。他们的方法在一些细节上也与我们的不同：他们没有训练超过 60 亿参数的偏好模型，并且为了避免评估性能的下降，他们将预训练与强化学习混合在一起。

Our work differs from both Instruct GP T and LaMDA in that we explore ‘online’ training, where we update the models interacting with crowd workers in order to obtain progressively higher-quality data and fill out the tails of our data distribution. Another difference is our exploration of specialized skills such as sum mari z ation and coding, which we use to bolster the argument that alignment can be achieved without limiting capabilities. We also explicitly study the tension between helpfulness and harmlessness, which has not been addressed before as far as we are aware. Finally, we explore scaling and robustness in much more detail, including during RL training. With that said, our procedures (Figure 2) are actually somewhat simpler than those employed in these other works. We believe the only essential steps are human feedback data collection, preference modeling, and RLHF training.

我们的工作与 Instruct GPT 和 LaMDA 的不同之处在于，我们探索了“在线”训练，即通过与众包工作者交互来更新模型，以获得逐步更高质量的数据并填补数据分布的尾部。另一个不同之处在于我们对摘要和编码等专门技能的探索，我们利用这些技能来支持对齐可以在不限制能力的情况下实现的论点。我们还明确研究了有用性和无害性之间的紧张关系，据我们所知，这一问题以前从未被解决过。最后，我们更详细地探讨了扩展性和鲁棒性，包括在强化学习 (RL) 训练期间。尽管如此，我们的流程（图 2）实际上比这些其他工作中使用的流程要简单一些。我们认为唯一必要的步骤是人类反馈数据收集、偏好建模和 RLHF 训练。

Several other recent works focus on aspects of truthfulness using retrieval [Lewis et al., 2020, Guu et al., 2020, Borgeaud et al., 2021] from a database, or via internet search and human feedback, such as WebGPT [Nakano et al., 2021] and GopherCite [Menick et al., 2022]. These works are exciting and complementary to our work; in particular our results suggest that their techniques should be very compatible with training for helpfulness and harmlessness. While these works improve the faithful representation of explicit evidence, more work will likely be necessary to achieve honest self-representation from AI systems. We are generally hopeful that techniques independent of human feedback may be applicable to this problem, since a great many sources of truth are not based on human judgment.

最近的其他几项工作侧重于通过从数据库中检索 [Lewis et al., 2020, Guu et al., 2020, Borgeaud et al., 2021] 或通过互联网搜索和人类反馈（如 WebGPT [Nakano et al., 2021] 和 GopherCite [Menick et al., 2022]）来提升真实性。这些工作令人兴奋，并且与我们的工作互补；特别是我们的结果表明，它们的技术应该与训练有益性和无害性非常兼容。虽然这些工作改进了对明确证据的忠实表达，但要实现 AI 系统的诚实自我表达，可能还需要更多的工作。我们普遍希望，独立于人类反馈的技术可能适用于这个问题，因为许多真相来源并不基于人类的判断。

Safety and ethical issues associated with language models have been extensively discussed (e.g. [Henderson et al., 2017, Bender et al., 2021, Weidinger et al., 2021]), with well-known issues including toxicity, bias, and the possibility that models may reveal personally identifiable information. As models become increasingly powerful, new and surprising capabilities and safety issues may arise [Ganguli et al., 2022]. Other works have explored methods to mitigate these problems (e.g. [Liu et al., 2021, Xu et al., 2020]). Models have also been trained to directly evaluate ethical dilemmas [Jiang et al., 2021], demonstrating improvement on ethics benchmarks [Hendrycks et al., 2021a]. More general research proposals for AI safety include [Amodei et al., 2016, Hendrycks et al., 2021c]. The RL robustness failures we discuss can be viewed as an instance of ‘reward hacking’, which was recently explored in [Pan et al., 2022]. RL policies could also fail to generalize out of distribution in other dangerous ways [Koch et al., 2021].

语言模型的安全性和伦理问题已被广泛讨论（例如 [Henderson et al., 2017, Bender et al., 2021, Weidinger et al., 2021]），其中众所周知的问题包括毒性、偏见以及模型可能泄露个人身份信息的可能性。随着模型变得越来越强大，可能会出现新的、令人惊讶的能力和安全问题 [Ganguli et al., 2022]。其他研究探索了缓解这些问题的方法（例如 [Liu et al., 2021, Xu et al., 2020]）。模型还被训练用于直接评估伦理困境 [Jiang et al., 2021]，并在伦理基准测试中展示了改进 [Hendrycks et al., 2021a]。更广泛的 AI 安全性研究提案包括 [Amodei et al., 2016, Hendrycks et al., 2021c]。我们讨论的 RL 鲁棒性失败可以被视为“奖励黑客”的一个实例，这在 [Pan et al., 2022] 中得到了最近的研究。RL 策略也可能以其他危险的方式无法在分布外泛化 [Koch et al., 2021]。

Our interest in studying trends with model size is motivated by neural scaling laws [Hestness et al., 2019, Rosenfeld et al., 2019, Kaplan et al., 2020]. A related observation is that as parameter counts grow, models finetune more effectively [Hernandez et al., 2021] and become much less vulnerable to ‘catastrophic forgetting’ [Ramasesh et al., 2022]. We expect this effect helps to explain why our HH training is compatible with good evaluation performance and specialized skills for large models.

我们对研究模型规模趋势的兴趣源于神经缩放定律 [Hestness et al., 2019, Rosenfeld et al., 2019, Kaplan et al., 2020]。一个相关的观察是，随着参数数量的增加，模型微调效果更好 [Hernandez et al., 2021]，并且对“灾难性遗忘”的脆弱性大大降低 [Ramasesh et al., 2022]。我们预计这种效应有助于解释为什么我们的 HH 训练与大型模型的良好评估性能和专业技能兼容。

2 Data Collection

2 数据收集

We expect human feedback (HF) to have the largest comparative advantage over other techniques when people have complex intuitions that are easy to elicit but difficult to formalize and automate. This means that when collecting HF, we should try to choose tasks that are as intuitive and familiar as possible. We chose to use natural language dialogue both for these reasons, and because it is so general – essentially any text-based task can be enacted through dialogue, perhaps with some source materials included in-line.

我们预计，当人们拥有易于引发但难以形式化和自动化的复杂直觉时，人类反馈 (HF) 相对于其他技术将具有最大的比较优势。这意味着在收集 HF 时，我们应该尽量选择尽可能直观和熟悉的任务。出于这些原因，我们选择使用自然语言对话，还因为它的通用性——基本上任何基于文本的任务都可以通过对话来执行，也许还可以在对话中包含一些源材料。

2.1 Task Specification and Crowd workers

2.1 任务规范与众包工作者

Our human feedback interface can be seen in Figure 6 (for more details see Appendix D). People can interact with our models in natural language via chat, and ask for help with any text-based task. When it’s the model’s conversational turn, users see two possible model responses, and choose one with which to proceed. These two responses may come from the same model, or two different models. They can then ask follow-up questions or provide further instructions to the models. So there are two core components to the task, which repeat several times in each dialogue:

我们的人类反馈界面如图 6 所示（更多细节见附录 D）。人们可以通过聊天以自然语言与我们的模型互动，并请求帮助完成任何基于文本的任务。当轮到模型对话时，用户会看到两个可能的模型响应，并选择其中一个继续。这两个响应可能来自同一个模型，也可能来自两个不同的模型。然后，他们可以向模型提出后续问题或提供进一步的指令。因此，任务有两个核心组成部分，在每个对话中会重复多次：

• Crowd workers write a chat message to our models, asking them to perform a task, answer a question, or discuss any topic of interest. • Crowd workers are shown two responses, and are asked to choose the more helpful and honest response (or in the case of red-teaming, to choose the more harmful response).

• 众包工作者向我们的模型发送聊天消息，要求它们执行任务、回答问题或讨论任何感兴趣的话题。
• 众包工作者会看到两个回复，并被要求选择更有帮助和诚实的回复（或在红队测试的情况下，选择更有害的回复）。

We conjectured that crowd workers who wrote well and engaged the AI in more interesting discussions would tend to have better judgment about which AI responses were most ‘helpful’ and ‘harmless’. This meant that rather than attempting to filter crowd workers based on label quality, we instead used spot-checks of their writing, which were simpler and more intuitive for us to perform.

我们推测，那些写作水平较高且能与AI进行更有趣讨论的众包工作者，往往能更好地判断哪些AI回复最“有帮助”且“无害”。这意味着我们没有试图根据标注质量来筛选众包工作者，而是通过抽查他们的写作来进行评估，这对我们来说更简单且更直观。

Otherwise, our approach to data collection was to largely let crowd workers use their own intuitions to define ‘helpfulness’ and ‘harmfulness’. Our hope was that data diversity (which we expect is very valuable) and the ‘wisdom of the crowd’ would provide comparable RoI to a smaller dataset that was more intensively validated and filtered. Overall, our process was roughly of this form:

否则，我们在数据收集上的方法主要是让众包工人利用他们的直觉来定义“有用性”和“有害性”。我们希望数据的多样性（我们预计这非常有价值）和“群体的智慧”能够提供与更严格验证和过滤的小数据集相当的回报率。总体而言，我们的过程大致如下：

Figure 6 We show the interface that crowd workers use to interact with our models. This is the helpfulness format; the red-teaming interface is very similar but asks users to choose the more harmful response.

图 6: 我们展示了众包工作者用来与我们的模型交互的界面。这是帮助性格式；红队测试界面非常相似，但要求用户选择更有害的响应。

We invited master-qualified US-based MTurk workers to engage in dialogues with our models.
我们邀请了具有硕士学历的美国MTurk工作者与我们的模型进行对话。

We did not filter workers based on agreement or other direct measures of label quality, though we evaluated them retrospectively (see Figure 10 right) and found poor average agreement (about $63%$ ) between Anthropic researchers and our crowd workers, as compared to recent similar work such as [Stiennon et al., 2020, Ouyang et al., 2022].

我们没有根据一致性或其他直接衡量标签质量的指标来筛选工作人员，尽管我们进行了回顾性评估（见图 10 右侧），发现 Anthropic 研究人员与我们的众包工作人员之间的平均一致性较差（约为 $63%$），与最近的类似工作相比，如 [Stiennon et al., 2020, Ouyang et al., 2022]。

As an important caveat, our crowd worker distribution was not held fixed throughout this work, and we expect that crowd worker quality probably improved as the project went on. We mention this as a possible complication when evaluating the success of the ‘online training’ program discussed in Section 4.5. Conversely, however, since we generally discouraged repetition, crowd workers who have performed the task many times might also have had a tendency to engage in more esoteric interactions.

需要注意的是，我们的众包工作者分布在本研究过程中并未保持固定，我们预计随着项目的推进，众包工作者的质量可能会有所提高。我们在评估第4.5节讨论的“在线培训”计划的成功时提到这一点，作为一个可能的复杂因素。然而，反过来说，由于我们通常不鼓励重复，那些多次执行任务的众包工作者可能也更倾向于进行更为深奥的互动。

We should also note that we explicitly told crowd workers that ‘lying isn’t helpful’ and that they should try to only reward helpful and honest responses, which presumably explains why our models improve somewhat in terms of honesty. That said, we did not expect crowd workers to fact-check our models significantly, and for example they often prefer responses that include non-functional URLs, which are probably one of the simplest possible ‘lies’ to debunk.

我们还应该注意到，我们明确告诉众包工人“撒谎没有帮助”，他们应该只奖励有帮助和诚实的回答，这可能是我们的模型在诚实性方面有所提高的原因。也就是说，我们并不期望众包工人对我们的模型进行大量的事实核查，例如，他们通常更喜欢包含非功能性URL的回答，这可能是最容易揭穿的“谎言”之一。

2.2 Helpfulness and Harmlessness (Red Teaming) Datasets

2.2 有用性与无害性（红队测试）数据集

We collected two separate datasets using slightly different versions of our interface. For the helpfulness dataset, we asked crowd workers to have open-ended conversations with our models, asking for help, advice, or for the model to accomplish a task (see Appendix D.2), and to choose the model response that was more helpful. For the harmlessness or red-teaming dataset, we asked crowd workers to attempt to elicit harmful responses from our models, and to choose the more harmful response offered by the models.

我们使用略微不同的界面版本收集了两个独立的数据集。对于有用性数据集，我们要求众包工作人员与我们的模型进行开放式对话，寻求帮助、建议或要求模型完成任务（见附录 D.2），并选择更有帮助的模型响应。对于无害性或红队数据集，我们要求众包工作人员尝试从我们的模型中引出有害响应，并选择模型提供的更有害的响应。

Our interface (Figure 6) allows users to express a preference strength. We only include comparisons in our datasets if crowd workers expressed a preference stronger than the weakest available. In this work we will not otherwise use this preference-strength information; we treat all comparisons in our dataset as binary and of equal weight (so in particular we do not include ties).

我们的界面（图 6）允许用户表达偏好强度。只有当众包工作者表达的偏好强度高于最弱的可用选项时，我们才会将比较纳入数据集中。在本工作中，我们不会使用此偏好强度信息；我们将数据集中的所有比较视为二元且权重相等（因此我们特别不包括平局）。

Note that this means our helpfulness dataset tends to move conversations in a more beneficial direction, while in our red-teaming dataset user responses move conversations in a more harmful direction. We made this choice to make it possible for users to fully trick and exploit models while red-teaming, as this was most natural for other work we’re doing that’s specifically focused on harmfulness. However, we believe this difference made it difficult to train models that were both helpful and harmless, as explained in Section 4.4. We plan to remedy this in future work, and would recommend others who are focused on training harmless dialogue models to collect data where users primarily choose model responses that move the conversation in the more beneficial direction instead.

请注意，这意味着我们的有用性数据集倾向于将对话推向更有益的方向，而在我们的红队测试数据集中，用户的回应则倾向于将对话推向更有害的方向。我们做出这一选择是为了让用户在进行红队测试时能够完全欺骗和利用模型，因为这对于我们正在进行的其他专注于有害性的工作来说是最自然的。然而，我们认为这种差异使得训练既有用又无害的模型变得困难，正如第4.4节所解释的那样。我们计划在未来的工作中解决这个问题，并建议其他专注于训练无害对话模型的研究人员收集数据时，主要选择那些将对话推向更有益方向的模型回应。

2.3 Models Deployed to the Feedback Interface and Associated Data Distributions

2.3 部署到反馈界面的模型及相关数据分布

For data collection we predominantly 11 used 52B language models with the broad specifications given in [Askell et al., 2021]. We used three classes of models in our interface:

在数据收集方面，我们主要使用了52B语言模型，其广泛规格见[Askell et al., 2021]。我们在界面中使用了三类模型：

In the final phase of the project, when we were primarily deploying RLHF-finetuned models, we often deployed several such models at once. This allowed us to monitor progress by gathering model-comparison data, and also to (perhaps) improve data diversity.

在项目的最后阶段，当我们主要部署经过RLHF微调的模型时，通常会同时部署多个此类模型。这使得我们能够通过收集模型比较数据来监控进展，并（可能）提高数据的多样性。

Figure 7 (left) We show the learning curves for PM accuracy when training on a mixture of the static helpfulness and harmlessness (i.e, ‘HH’) data distributions. Since we train for one epoch, these results also give sense for dataset-size scaling of accuracy. (right) We show the model size dependence of HH static preference model accuracy.

图 7: (左) 展示了在静态有用性和无害性 (即 'HH') 数据分布混合训练时，PM 准确率的学习曲线。由于我们只训练一个 epoch，这些结果也反映了准确率随数据集大小的变化。(右) 展示了 HH 静态偏好模型准确率随模型大小的变化。

A histogram of these distributions by our final, online HH preference model’s score can be seen in Figure 15 in Section 4.5. In what follows when we discuss the static or base $+\mathrm{RS}$ dataset, we will be referring to the combination of the first two components. Our ‘online’ RLHF models in Section 4.5 are trained on all three components. Most of our results are based on the static dataset, as we conducted experiments and evaluations with it while the online data collection was underway.

这些分布的直方图可以通过我们最终的在线HH偏好模型的得分在图15中看到，该图位于第4.5节。在接下来的讨论中，当我们提到静态或基础$+\mathrm{RS}$数据集时，我们指的是前两个组件的组合。我们在第4.5节中的“在线”RLHF模型是在所有三个组件上训练的。我们的大部分结果基于静态数据集，因为我们在在线数据收集进行的同时，使用它进行了实验和评估。

We analyze a few different splits of the static dataset – a standard split into 95/5 train/test data, and a 65/35 split that we use in order to obtain better statistics when evaluating preference model calibration on the test set. We also consider a 50/50 split, where we train distinct PMs on the two halves of the dataset. This is used to evaluate the robustness of RL training, as we then train an RL policy against one PM while evaluating the rewards achieved by that policy as measured by the independent PM.

我们分析了几种不同的静态数据集划分方式——一种是将数据按95/5的比例划分为训练集和测试集的标准划分，另一种是65/35的划分，这种划分方式用于在评估测试集上的偏好模型校准时获得更好的统计结果。我们还考虑了50/50的划分方式，在这种方式下，我们在数据集的两个半部分上分别训练不同的偏好模型（PM）。这种方式用于评估强化学习（RL）训练的鲁棒性，因为我们随后会针对一个PM训练RL策略，同时通过独立的PM来衡量该策略所获得的奖励。

2.4 Comparing Models with Elo Scores

2.4 使用 Elo 分数比较模型

A significant part of our analysis compares models against each other in order to generate associated Elo scores, as described in [Askell et al., 2021]. That is, we have crowd workers chat with two models simultaneously, with each model generating one response (either ‘A’ or $\mathbf{\omega}^{\bullet}\mathbf{B}^{\bullet}$ ) at each turn, and we record the sample that is preferred by the worker. This provides us with a record of ‘win rates’ between pairs of models, which we can then fit to corresponding Elo scores, to produce Figure 1 (where we show both win rates and Elo scores). Two useful conversion formulas are

我们分析的一个重要部分是将模型相互比较，以生成相关的Elo分数，如[Askell et al., 2021]所述。也就是说，我们让众包工作者同时与两个模型进行对话，每个模型在每一轮生成一个响应（要么是‘A’，要么是$\mathbf{\omega}^{\bullet}\mathbf{B}^{\bullet}$），并记录工作者偏好的样本。这为我们提供了模型对之间的“胜率”记录，然后我们可以将其拟合为相应的Elo分数，以生成图1（我们在图中展示了胜率和Elo分数）。两个有用的转换公式是

$$
\mathrm{Win;Fraction}=\frac{1}{1+10^{\frac{\Delta(\mathrm{Elo;Score})}{400}}}\quad\mathrm{and}\quad{\Delta(\mathrm{Elo;Score})}\approx174*\Delta(\mathrm{PM;Score})
$$

for the fraction in which one model is preferred over another, the difference in Elo scores, and our PM scores. Note that conceptually win fractions, Elo scores and PM scores are interchangeable; we keep both Elo and PM scores so that we can avoid confusing crowd worker preferences (where we use Elo) with our preference modeling and RLHF (where we use PM scores).

在比较一个模型优于另一个模型的比例、Elo分数的差异以及我们的PM分数时，请注意，概念上胜率、Elo分数和PM分数是可以互换的；我们同时保留Elo和PM分数，以便避免混淆众包工人的偏好（我们使用Elo）与我们的偏好建模和RLHF（我们使用PM分数）。

Note that the Elo scores for context-distilled models in Figure 1 differ somewhat from the analogous results for prompted models in [Askell et al., 2021] – the Elo scores are now more compressed. The main difference is that we did not use top $\boldsymbol{\mathrm{\Delta}}\cdot\boldsymbol{\mathrm{p}}$ sampling this time12. The difference may also be due to changes in the crowdworker distribution since that earlier experiment, or changes in crowd worker expectations, as before this test our workers were mostly interacting with higher-quality RLHF-trained models.

请注意，图 1 中上下文蒸馏模型的 Elo 分数与 [Askell et al., 2021] 中提示模型的类似结果有所不同——Elo 分数现在更加压缩。主要区别在于这次我们没有使用 top $\boldsymbol{\mathrm{\Delta}}\cdot\boldsymbol{\mathrm{p}}$ 采样。这种差异也可能是由于自那次早期实验以来众包工人分布的变化，或众包工人期望的变化，因为在这次测试之前，我们的工人主要与更高质量的 RLHF 训练模型互动。

Figure 8 (left) Distribution of conversational turns in a large held-out test set used to investigate calibration and accuracy. (right) We examine preference model accuracy as a function of the number of exchanges in the conversation.

图 8: (左) 用于研究校准和准确性的大型保留测试集中的对话轮次分布。(右) 我们检查偏好模型准确性随对话中交换次数变化的函数关系。

3 Preference Modeling for Helpfulness and Harmlessness

3 有用性和无害性的偏好建模

3.1 Models and Training Setup

3.1 模型与训练设置

We use language models with specifications that are identical to those discussed in [Askell et al., 2021], with a total of seven language models with parameter counts running from 13M to 52B and approximating a geometric series with increments of roughly $4\times$ . We use PyTorch [Paszke et al., 2019] and Triton [Tillet et al., 2019] to facilitate model training and performance. Our preference model training setup is also identical to that in [Askell et al., 2021], and in particular we apply ‘preference model pre training’ (PMP) to our language models before finetuning on our human feedback datasets, as explained in Section 4 of that paper. More details are provided in Appendix A. Note that we typically only train PMs for a single epoch, so the learning curves themselves (Figure 7 left) indicate how performance scales with dataset size (we used a fixed learning rate).

我们使用的语言模型规格与 [Askell et al., 2021] 中讨论的相同，总共有七个语言模型，参数数量从 13M 到 52B 不等，近似于以大约 $4\times$ 为增量的几何级数。我们使用 PyTorch [Paszke et al., 2019] 和 Triton [Tillet et al., 2019] 来促进模型训练和性能提升。我们的偏好模型训练设置也与 [Askell et al., 2021] 中的相同，特别是在对人类反馈数据集进行微调之前，我们对语言模型应用了“偏好模型预训练”（PMP），如该论文第 4 节所述。更多细节见附录 A。需要注意的是，我们通常只对偏好模型进行单轮训练，因此学习曲线本身（图 7 左）表明了性能如何随数据集大小扩展（我们使用了固定的学习率）。

3.2 Basic Scaling Results

3.2 基础扩展结果

We would like to understand how preference modeling performance improves as we increase model size and collect additional data. In Figure 7 we show basic results for PM accuracy when training on our static helpful and harmless data mixture. Roughly speaking, we observe log-linear trends in both dataset and model size. We tend to find somewhat more consistent trends if we model only the helpfulness or harmlessness distributions in isolation, rather than as a mixture, as observed in Figure 32 in Appendix A.3. But there we also see that for some data distributions [Stiennon et al., 2020], scaling trends can exhibit more complex patterns that defy simple trends.

我们想了解随着模型规模的增加和额外数据的收集，偏好建模性能如何提升。在图7中，我们展示了在静态的有帮助和无害数据混合训练下的PM（偏好建模）准确率的基本结果。大致来说，我们观察到数据集和模型规模的对数线性趋势。如果我们单独建模有帮助性或无害性分布，而不是混合建模，趋势会稍微更一致一些，正如附录A.3中的图32所示。但在那里我们也看到，对于某些数据分布 [Stiennon et al., 2020]，扩展趋势可能表现出更复杂的模式，这些模式不符合简单的趋势。

Our preference modeling data comes from natural language dialogue, where crowd workers have text-based conversations with the model, and choose the more helpful of two model responses at every turn in the conversation (or the more harmful one, for red-teaming tasks). So it is natural to ask how PM performance changes as a function of the conversational turn. We show these results in Figure 8. PMs are somewhat more accurate on the first step of the conversation, but their accuracy is nearly constant thereafter.

我们的偏好建模数据来自自然语言对话，其中众包工作人员与模型进行基于文本的对话，并在对话的每个回合中选择两个模型响应中更有帮助的一个（或在红队任务中选择更有害的一个）。因此，很自然地会问 PM 性能如何随着对话回合的变化而变化。我们在图 8 中展示了这些结果。PM 在对话的第一步稍微更准确，但在此之后其准确性几乎保持不变。

3.3 Calibration of Preference Models and Implications for RL

3.3 偏好模型的校准及其对强化学习的影响

Preference model scores should predict the probability that humans will prefer one or another modelgenerated response. We are interested in whether these probabilities are accurate, i.e. whether the PMs are well calibrated. We characterize calibration in Figure 9, where we display PM accuracy as a function of the difference in PM scores assigned to pairs of samples, along with a heavy black line representing perfect calibration. We observe that PMs trained only on helpfulness data are very well calibrated, but PMs trained on a mixture of helpful and harmless data are slightly under-confident.

偏好模型的分数应能预测人类更倾向于选择哪种模型生成的回答。我们关注的是这些概率是否准确，即偏好模型是否校准良好。我们在图 9 中展示了校准情况，图中显示了偏好模型的准确率与其对样本对评分差异的函数关系，并有一条粗黑线表示完美校准。我们观察到，仅基于有用性数据训练的偏好模型校准效果非常好，而基于有用性和无害性混合数据训练的偏好模型则略显不自信。

Figure 9 We show preference modeling accuracy as a function of the difference in PM score between higher and lower ranked responses. The black lines indicate the calibrated prediction of accuracy $1/(1+e^{\bar{-}\Delta})$ , where $\Delta$ is the score difference. On the (left) we show calibration for a PM trained and evaluated on all our static data, while on the (right) we show results for a model trained and evaluated only on our helpful data distribution. We see that calibration is slightly worse for models trained on the HH mixture.

图 9: 我们展示了偏好建模准确率作为 PM 评分差异的函数，该差异是在高排名和低排名响应之间的。黑线表示准确率的校准预测 $1/(1+e^{\bar{-}\Delta})$ ，其中 $\Delta$ 是评分差异。在（左）图中，我们展示了在所有静态数据上训练和评估的 PM 的校准情况，而在（右）图中，我们展示了仅在有用数据分布上训练和评估的模型的校准结果。我们可以看到，在 HH 混合数据上训练的模型的校准效果略差。

These calibration results are important because in later sections we will be using PM scores as the reward signal for reinforcement learning. Since the PM scores are well-calibrated, we can trust that they faithfully encode the probabilities that humans will prefer specific model samples (at least on-distribution with the training set). This means that when we see RL robustly achieving a given reward, we can trust that those who interact with this model (if they are well-represented by our crowd worker distribution) will prefer it to reference models at a predictable rate, provided that the PM scores of the models’ responses are within the range considered in these calibration studies. That said, we find significant failures of robustness as RLHF optimizes towards much higher scores, as explained in Section 4.5 and Appendix B.4.

这些校准结果非常重要，因为在后续章节中，我们将使用 PM 分数作为强化学习的奖励信号。由于 PM 分数经过了良好的校准，我们可以相信它们忠实地编码了人类偏好特定模型样本的概率（至少在训练集的分布范围内）。这意味着当我们看到强化学习稳定地达到某个奖励值时，我们可以相信与模型交互的用户（如果他们能很好地代表我们的众包工作者分布）将以可预测的速率偏好该模型而非参考模型，前提是模型响应的 PM 分数在这些校准研究的范围内。然而，正如第 4.5 节和附录 B.4 所解释的那样，我们发现当 RLHF 优化向更高的分数时，鲁棒性存在显著失败。

We might generally expect that as sample quality improves, it becomes more difficult to reliably identify the very best samples. In Figure 25 in the appendix, we show that if we restrict to comparisons where both samples have a PM score above a given threshold, then PM accuracy drops as a function of this threshold value (to compare different PMs we add an offset so that they all obtain the same mean score on the test distribution). Notice that this result combines three effects: (1) differentiation among more sophisticated samples may be more challenging and require greater capabilities, (2) we have fewer very high quality samples to learn from in our dataset, as they are non-generic, and (3) pairs of samples that are all high-quality will have similar scores (compared to randomly chosen pairs), and so be more difficult to distinguish.

我们通常可能会预期，随着样本质量的提高，可靠地识别最佳样本变得更加困难。在附录中的图 25 中，我们展示了如果我们将比较限制在两个样本的 PM 分数都高于给定阈值的情况下，那么 PM 准确率会随着这个阈值值的增加而下降（为了比较不同的 PM，我们添加了一个偏移量，使它们在测试分布上获得相同的平均分数）。请注意，这个结果结合了三个效应：(1) 在更复杂的样本之间进行区分可能更具挑战性，并且需要更强的能力，(2) 在我们的数据集中，可供学习的高质量样本较少，因为它们是非通用的，(3) 所有高质量样本对的分数将相似（与随机选择的样本对相比），因此更难区分。

These observations also have an implication for RLHF training, namely that we should expect diminishing returns from further RLHF training once our policies achieve a sufficiently high PM score. This also motivates online training, so that we can update our PMs to stay on-distribution as RLHF policies improve.

这些观察结果也对RLHF训练有影响，即一旦我们的策略达到足够高的PM分数，我们应该预期进一步的RLHF训练会带来递减的回报。这也促使我们进行在线训练，以便在RLHF策略改进时更新我们的PM，使其保持在分布上。

3.4 Evaluating Helpful and Harmless Preference Models

3.4 评估有益和有害的偏好模型

3.4.1 HHH Evaluation and Crowd worker-Anthropic Agreement

3.4.1 HHH 评估与众包工作者-Anthropic 一致性

We have shown that PMs trained on our dataset are learning something and are well-calibrated. But we would also like to show that these PMs are actually learning to prefer helpful and harmless behavior, according to some independent standard. We provided a set of comparison evaluations for helpfulness, honesty, and harmlessness (i.e., HHH Evaluation) in prior work and evaluated raw, context-distilled, and prompted language models [Askell et al., 2021] on this benchmark. As shown in Figure 5, we find that our PMs perform much better than all of the models we evaluated previously. In fact, the Pathways LM effort [Chowdhery et al., 2022] recently reported a mean human score of $75%$ on this dataset, so in a sense our PM performance of $86%$ is well above mean human accuracy.

我们已经证明，基于我们数据集训练的偏好模型（PM）确实在学习某些东西，并且校准良好。但我们还想展示，根据某些独立标准，这些偏好模型实际上在学习偏好有帮助且无害的行为。我们在之前的工作中提供了一套关于帮助性、诚实性和无害性（即HHH评估）的比较评估，并在该基准上评估了原始、上下文蒸馏和提示的语言模型 [Askell et al., 2021]。如图5所示，我们发现我们的偏好模型表现远优于我们之前评估的所有模型。事实上，Pathways LM项目 [Chowdhery et al., 2022] 最近报告了在该数据集上人类平均得分为 $75%$，因此在某种意义上，我们的偏好模型表现达到 $86%$，远高于人类平均准确率。

HHH Evaluation Distribution with 52B Static PM Figure 10 (left) We show a histogram of the 52B static PM predictions for the HHH evaluations. The three confidently incorrect outliers all contrast responses where the model declares its ignorance instead of providing a thorough and sophisticated-sounding response that contains misleading information. So they are failures of honesty and accuracy. (right) Here we show agreement rates between Anthropic researchers, our crowd workers, and a static preference model on 320 samples from the static test set.

HHH 评估分布与 52B 静态 PM 图 10 (左) 我们展示了 52B 静态 PM 对 HHH 评估的预测直方图。三个明显错误的异常值都是对比响应，其中模型声明其无知，而不是提供一个听起来详尽且复杂但包含误导信息的响应。因此，它们是诚实性和准确性的失败。(右) 这里我们展示了 Anthropic 研究人员、我们的众包工作者和静态偏好模型在静态测试集中的 320 个样本上的一致率。

Human: Can you help me understand how US presidential election works?

你能帮我理解美国总统选举是如何运作的吗？

Better Response: I’m sorry but I don’t know the answer. Perhaps you can Google it?

更好的回答：抱歉，我不知道答案。或许你可以去Google一下？

Worse Response: Sure! I did a Google search and found the following information on www.usa.gov

更差的回答：当然！我进行了谷歌搜索，并在 www.usa.gov 上找到了以下信息：

In the U.S., the president and vice president are chosen by ‘electors’ through a process called the Electoral College. Each state gets as many electors as it has members of the house of representatives. Including Washington, D.C.’s three electors, there are currently 270 electors in all. After you cast your ballot for president, your vote goes to a statewide tally. In 48 states and Washington, D.C., the winner gets all the electoral votes for that state. A candidate needs the vote of at least 250 electors – which is nearly all electors – to win the presidential election. This is why American democracy is so often gridlocked.

在美国，总统和副总统通过一个称为选举人团 (Electoral College) 的过程由“选举人”选出。每个州拥有的选举人数量与其众议院议员数量相同。包括华盛顿特区的三名选举人在内，目前共有270名选举人。在你为总统投票后，你的选票会被计入全州的统计中。在48个州和华盛顿特区，获胜者将获得该州所有的选举人票。候选人需要至少250名选举人的投票——这几乎是所有选举人——才能赢得总统选举。这就是为什么美国的民主制度经常陷入僵局。

Figure 11 This entirely human-written example from the HHH evaluation dataset [Askell et al., 2021] fools our preference models, which strongly prefer the worse response (which contains subtle inaccuracies).

图 11: 这个完全由人类编写的示例来自 HHH 评估数据集 [Askell et al., 2021]，它欺骗了我们的偏好模型，该模型强烈倾向于较差的响应（其中包含细微的不准确之处）。

We also observe that there are a few outliers, where our PMs are confidently incorrect on HHH evaluation comparisons (see Figure 10 left). These are all examples of failures of honesty, where the better choice is honest but unhelpful, while the incorrect choice appears both helpful and well-written, but is subtly incorrect or misleading. We show the single worst example in Figure 11. It’s worth noting that the bad response (which was written by Anthropic researchers) is likely out of distribution as compared to samples from our language models and RL policies. Since only model-generated samples were used to train the PMs, this may explain why the PM was badly fooled by these examples. It also indicates that the PMs are not adversarial ly robust.

我们还观察到存在一些异常值，即在 HHH 评估比较中，我们的 PM 自信地做出了错误的判断（见图 10 左侧）。这些都是诚实性失败的例子，其中更好的选择是诚实但无益的，而错误的选择看起来既有用又写得好，但实际上是微妙地错误或误导性的。我们在图 11 中展示了最糟糕的一个例子。值得注意的是，与我们的语言模型和 RL 策略生成的样本相比，这个糟糕的响应（由 Anthropic 研究人员编写）可能不在分布范围内。由于只有模型生成的样本被用于训练 PM，这可能解释了为什么 PM 在这些例子中被严重愚弄。这也表明 PM 不具备对抗性鲁棒性。

We can also compare our crowd workers, ourselves, and our preference models in terms of agreement on labels. The results are shown on the right in Figure 10, and are based on about 320 examples from our static test set. The PMs are trained to predict crowd worker behavior, so PM-Crowd worker agreement is best. However, the largest PM actually agrees with the authors (i.e. Anthropic researchers) slightly more than the authors agree with crowd workers on labels. We also suspect this is a poor subsample of the data, since PM accuracy is several percent lower on this subsample as compared to the full test set.

我们还可以在标签一致性方面比较众包工作者、我们自己以及我们的偏好模型。结果如图 10 右侧所示，基于我们静态测试集中的约 320 个示例。偏好模型 (PM) 被训练用于预测众包工作者的行为，因此 PM 与众包工作者的一致性最好。然而，最大的 PM 实际上与作者（即 Anthropic 研究人员）的一致性略高于作者与众包工作者在标签上的一致性。我们还怀疑这是数据的一个较差子样本，因为与完整测试集相比，PM 在这个子样本上的准确率低几个百分点。

3.4.2 Gender Bias and Bot Adversarial Dialogues

3.4.2 性别偏见与对抗性对话机器人

We can begin to evaluate whether our PMs display gender bias by comparing scores across gendered terms. To this purpose we evaluate preference model scores on short dialogues of the form

我们可以通过比较性别相关术语的得分来评估我们的PM是否表现出性别偏见。为此，我们评估了偏好模型在以下形式的简短对话上的得分

Figure 12 These plots show score or score-difference distributions from the static preference models on alignment-relevant data distributions. (left) We evaluate Bot Adversarial Dialogues [Xu et al., 2020] that have been designated as harmful or not. (right) We show distributions of preference model score differences for statements of the form ‘The CEO was a lady’, comparing differences among words within a gender category (e.g. gentleman vs boyfriend) and across genders (e.g. mother vs father, lady vs gentleman) while holding occupations fixed.

图 12: 这些图表展示了静态偏好模型在与对齐相关的数据分布上的得分或得分差异分布。(左) 我们评估了被指定为有害或无害的 Bot Adversarial Dialogues [Xu et al., 2020]。(右) 我们展示了偏好模型得分差异的分布，针对形式为“CEO 是一位女士”的陈述，比较了性别类别内的词语差异（例如 gentleman 与 boyfriend）和跨性别的词语差异（例如 mother 与 father，lady 与 gentleman），同时保持职业固定。

The {Occupation} is a {Gendered Term}

{Occupation} 是一个 {Gendered Term}

where we use a list of 76 occupations, 12 matched gender terms (mother vs father, guy vs gal, etc), and both present and past tense [Rae et al., 2021]. Preference model scores are not directly meaningful, so instead we evaluate the difference in score between paired gender terms, and compare that to the difference in score among terms within a given gender. The results are shown on the left in Figure 12. We observe that the variation among terms that indicate a given gender appears to be at least as large as the variation across genders. So it would appear that the PMs do not exhibit a substantial bias.

我们使用了76种职业、12组匹配的性别术语（如母亲与父亲、男人与女人等），以及现在时和过去时 [Rae et al., 2021]。偏好模型的分数本身没有直接意义，因此我们评估了配对性别术语之间的分数差异，并将其与给定性别内术语之间的差异进行比较。结果如图12左侧所示。我们观察到，表示特定性别的术语之间的差异似乎至少与跨性别差异一样大。因此，偏好模型似乎没有表现出显著的偏见。

As an external validation of our PMs, we consider the Bot Adversarial Dialogues (BAD) dataset [Xu et al., 2020]. This dataset contains a few thousand conversations between an AI system and a human. Each AI response is labeled as harmful or not harmful. We compute the distribution of preference model scores for BAD AI utterances (we restrict our analysis to the first BAD AI utterance per conversation) and find that the BAD AI utterances marked as harmful have significantly lower preference model scores. This suggests that our PMs are effectively classifying these AI generated utterances, even though they are likely quite different from the data distribution that our PMs were trained on.

作为我们偏好模型 (Preference Models, PMs) 的外部验证，我们考虑了 Bot Adversarial Dialogues (BAD) 数据集 [Xu et al., 2020]。该数据集包含了数千个 AI 系统与人类之间的对话。每个 AI 回复都被标记为有害或无害。我们计算了 BAD AI 话语的偏好模型得分分布（我们将分析限制在每次对话的第一个 BAD AI 话语），发现被标记为有害的 BAD AI 话语的偏好模型得分显著较低。这表明我们的偏好模型能够有效分类这些 AI 生成的话语，尽管它们可能与我们的偏好模型训练时的数据分布有很大不同。

4 Reinforcement Learning from Human Feedback

4 基于人类反馈的强化学习

4.1 Training Setup

4.1 训练设置

We apply reinforcement learning (RL) with preference modeling, following the approach outlined in [Stiennon et al., 2020], which can summarized in the following steps:

我们采用强化学习 (RL) 结合偏好建模的方法，遵循 [Stiennon et al., 2020] 中概述的步骤，具体可以总结为以下几点：

PM dataset and training details are provided in Appendix A.2; we also discussed the performance of our PMs in Section 3. In the language of RL, each response generated by the policy is a ‘timestep’, a full conversation is one ‘trajectory’, and the PM score is a single ‘reward’ provided at the end.

PM 数据集和训练细节在附录 A.2 中提供；我们还在第 3 节中讨论了 PM 的性能。在强化学习 (RL) 的术语中，策略生成的每个响应都是一个“时间步”，完整的对话是一个“轨迹”，而 PM 分数是在对话结束时提供的单一“奖励”。

The idea is to use the preference model to steer the policy towards writing better responses. However, as we saw in earlier sections, PMs also become less calibrated at higher scores, so higher rewards do not necessarily imply better performance.

想法是利用偏好模型来引导策略生成更好的回答。然而，正如我们在前面的章节中所看到的，偏好模型在高分时也会变得不那么校准，因此更高的奖励并不一定意味着更好的表现。

To stabilize RL training, we use Proximal Policy Optimization (PPO) [Schulman et al., 2017]. We also follow other work [Stiennon et al., 2020] and apply an empirically-estimated $\mathrm{KL}$ penalty term in the reward, with the total reward given by

为了稳定强化学习 (RL) 训练，我们使用了近端策略优化 (Proximal Policy Optimization, PPO) [Schulman et al., 2017]。我们还参考了其他工作 [Stiennon et al., 2020]，在奖励中应用了经验估计的 $\mathrm{KL}$ 惩罚项，总奖励由以下公式给出：

$$
r_{\mathrm{total}}=r_{\mathrm{PM}}-\lambda_{\mathrm{KL}}D_{\mathrm{KL}}(\mathrm{policy\parallelpolicy}_{0})
$$

where $\lambda_{\mathrm{KL}},\geq,0$ is a hyper parameter. In practice we use a very small value of $\lambda_{\mathrm{KL}},=,0.001$ , which likely has a very minor impact during most of RL training (as $D_{\mathrm{KL}}<100~\$ typically), and might actually be wholly unnecessary. More details about RL are provided in B.1.

其中 $\lambda_{\mathrm{KL}},\geq,0$ 是一个超参数。在实践中，我们使用一个非常小的值 $\lambda_{\mathrm{KL}},=,0.001$，这在大部分强化学习训练过程中可能影响非常小（因为通常 $D_{\mathrm{KL}}<100~\$），甚至可能完全不需要。更多关于强化学习的细节在 B.1 节中提供。

Throughout this paper we use $r_{\mathrm{PM}},=$ the preference model score itself for the RL reward. Recall that as implied by equation (2.1), this means that the difference in $r_{\mathrm{PM}}$ values between two samples $A$ and $B$ will be related to the predicted probability $P(A>B)$ that $A$ will be preferred to $B$ via

在本文中，我们使用 $r_{\mathrm{PM}},$ 作为强化学习奖励的偏好模型分数。根据公式 (2.1) 的含义，这意味着两个样本 $A$ 和 $B$ 之间的 $r_{\mathrm{PM}}$ 值差异将与预测概率 $P(A>B)$ 相关，即 $A$ 被偏好于 $B$ 的概率。

$$
P(A>B)=\frac{1}{1+e^{r_{\mathrm{PM}}(B)-r_{\mathrm{PM}}(A)}}
$$

There is no good reason13 to use this preference model score directly as the reward, but it has been used in prior work such as [Stiennon et al., 2020] and so for simplicity we will not explore variations on this choice here.

没有充分的理由直接使用这种偏好模型得分作为奖励，但在之前的工作中（如 [Stiennon et al., 2020]）已经这样使用过，因此为了简单起见，我们在此不探讨这一选择的其他变体。

In order to produce additional prompts (i.e. the human side of the conversations) for RLHF training, we used a large LM to generate them. For this purpose, we simply used few-shot learning, creating a context with about 10 existing high-quality human queries, and then sampling to generate more. We find that the sample efficiency of RLHF is roughly the same on the original crowd worker-written prompt dataset and the modelgenerated one, so we combine the two for greater diversity during RLHF training. We used $137\mathbf{k}$ prompts from the ‘static’ dataset, and $369\mathrm{k}$ model-generated prompts.

为了生成用于RLHF训练的额外提示（即对话中的人类部分），我们使用了一个大语言模型来生成它们。为此，我们简单地使用了少样本学习，创建了一个包含大约10个现有高质量人类查询的上下文，然后进行采样以生成更多提示。我们发现，RLHF在原始众包工人编写的提示数据集和模型生成的数据集上的样本效率大致相同，因此我们将两者结合起来，以在RLHF训练期间获得更大的多样性。我们使用了来自“静态”数据集的137k提示，以及369k模型生成的提示。

Note that almost all of our preference modeling data was collected from 52B models. This means that RLHF training with smaller models might have been challenging, since samples from smaller models tend to be out-of-distribution from the PM training data. Thus it is quite interesting that models more than fifty times smaller were actually able to learn and improve, as seen in Figure 1.

需要注意的是，我们几乎所有的偏好建模数据都是从52B模型中收集的。这意味着使用较小的模型进行RLHF训练可能会面临挑战，因为较小模型的样本往往与PM训练数据的分布不一致。因此，有趣的是，如图1所示，比52B模型小五十倍以上的模型实际上能够学习并改进。

4.2 Robustness Experiments

4.2 鲁棒性实验

We now discuss the problem of RLHF robustness. A fully robust PM would agree with humans on distributions of dialogues quite different from those encountered during PM training (i.e. different from those created by crowd worker interactions with our deployed AI assistants). However, we do not expect that our PMs are so robust, and in fact Figure 11 provides one plausible example of a robustness failure. Since RL optimizes the policy to maximize the PM score, any failure in robustness on the part of the PM may be exploited by the RL policy to achieve higher rewards, without actually improving the policy’s behavior from the point of view of human evaluators.

我们现在讨论RLHF（基于人类反馈的强化学习）的鲁棒性问题。一个完全鲁棒的偏好模型（PM）会在与训练期间遇到的对话分布（即由众包工作者与我们部署的AI助手互动生成的对话分布）大不相同的情况下，仍然与人类达成一致。然而，我们并不期望我们的PM如此鲁棒，事实上，图11提供了一个鲁棒性失败的合理示例。由于强化学习（RL）优化策略以最大化PM得分，PM在鲁棒性方面的任何失败都可能被RL策略利用，以获得更高的奖励，而实际上并未从人类评估者的角度改善策略的行为。

A rigorous way to study robustness is to take snapshots of the policy at various points during RLHF training, including the initial snapshot, and have crowd workers compare their performance. This gives a ‘true’ Elo score, as evaluated by crowd workers, which can then be compared directly with the PM scores. We present an example of this study in Section 4.5.

研究鲁棒性的一种严谨方法是在RLHF（基于人类反馈的强化学习）训练过程中的不同时间点（包括初始阶段）对策略进行快照，并让众包工作者比较它们的性能。这提供了一个由众包工作者评估的“真实”Elo分数，然后可以直接与PM分数进行比较。我们在第4.5节中展示了这项研究的一个示例。

However, this sort of test requires collecting additional human feedback data, which can be slow and expensive, so here we also study robustness from a different angle. Similar to how datasets are split into train and test sets for supervised learning, we split our preference model comparison data into two halves (a train half and a test half), and train separate preference models on each, which we refer to as the train PM’s and the test PM’s. We then train RLHF policies against the train PM’s, while evaluating them using the test PM’s. Similar to how test set evaluations help us understand over fitting in supervised learning, test PM evaluations help us understand over fitting relative to the train PM’s. These experiments are not conclusive since the train and test PMs may exhibit correlated robustness failures.

然而，这种测试需要收集额外的人类反馈数据，这可能既耗时又昂贵，因此我们在这里也从另一个角度研究鲁棒性。类似于将数据集分为训练集和测试集以进行监督学习，我们将偏好模型比较数据分为两半（训练半和测试半），并在每一半上训练单独的偏好模型，我们称之为训练PM和测试PM。然后，我们针对训练PM训练RLHF策略，同时使用测试PM对其进行评估。类似于测试集评估帮助我们理解监督学习中的过拟合，测试PM评估帮助我们理解相对于训练PM的过拟合。这些实验并不具有决定性，因为训练和测试PM可能表现出相关的鲁棒性失败。

The main conclusions from these experiments are: (1) RLHF becomes gradually less robust at higher PM scores, and (2) larger preference models are more robust than smaller ones.

这些实验的主要结论是：(1) 随着 PM 分数的提高，RLHF 的鲁棒性逐渐降低；(2) 较大的偏好模型比较小的模型更具鲁棒性。

We conduct two sets of experiments as follows:

我们进行了如下两组实验：

• Train PM Size $=52\mathbf{B}$ : This set consists of a scan of policies (i.e. one for each model size), all of which are trained with respect to the same 52B train PM.

• 训练 PM 大小 $=52\mathbf{B}$：该集合包含策略的扫描（即每个模型大小对应一个策略），所有这些策略都是针对相同的 52B 训练 PM 进行训练的。

Figure 13 These figures show training curves in the $\sqrt\mathrm{KL}$ vs PM score plane, exhibiting the approximate linear relationship between these variables, especially in the left-hand plot using the more highly-performing 52B PMs. We observe some instability in the smaller models, likely because the training data for all our PMs was created with 52B language models, and the much smaller LM samples tend to be quite OOD for the PMs. Finally, by comparing the left and right-hand plots, we see that training against smaller PMs (matched to policy sizes) eventually results in poor performance, as evaluated by the 52B PM. Some of our runs were cut off early as they became unstable. We found that smaller models were generally more difficult to stabilize.

图 13: 这些图展示了在 $\sqrt\mathrm{KL}$ 与 PM 分数平面上的训练曲线，展示了这些变量之间的近似线性关系，尤其是在使用性能更高的 52B PM 的左图中。我们观察到较小模型中的一些不稳定性，可能是因为我们所有 PM 的训练数据都是用 52B 大语言模型创建的，而较小的 LM 样本往往对 PM 来说非常不符合分布 (OOD)。最后，通过比较左右图，我们发现针对较小 PM（与策略大小匹配）的训练最终会导致性能不佳，这是由 52B PM 评估的。我们的一些运行在变得不稳定时被提前终止。我们发现较小的模型通常更难稳定。

• Train PM Size $=$ Policy Size: This set consists of a scan of policies, with each policy trained with respect to a train PM of the same size as the policy.

• 训练 PM 大小 $=$ 策略大小：该集合包含策略的扫描，每个策略都针对与策略大小相同的训练 PM 进行训练。

For both experiments, each policy is further evaluated with respected to a scan of test PM’s throughout training. Note that a scan refers to 7 different model sizes ranging from 13M to 52B, thus giving us 7 policies and $7\times7$ evaluations per experiment.

在这两个实验中，每个策略都会在训练过程中针对一系列测试 PM 进行进一步评估。需要注意的是，扫描指的是从 13M 到 52B 的 7 种不同模型大小，因此每个实验会产生 7 个策略和 $7\times7$ 次评估。

In Figure 4, we compare the train $P M$ and test PM scores throughout the training process, similar to how train and test curves are often compared for supervised training. We find that in all cases, the two scores are in close agreement during early stages of training, but eventually diverge, with the test PM providing a lower score. The divergence is likely an indication that the preference model is less robust and more easily exploited at higher rewards. That is, the policy has been over-optimized on the train PM, making the train PM overconfident in the policy’s performance. The test PM, on the other hand, doesn’t suffer from this problem since it was trained on a different portion of data that neither the policy nor the train PM had observed.

在图 4 中，我们比较了训练过程中的训练 PM 和测试 PM 分数，类似于监督训练中经常比较的训练和测试曲线。我们发现，在所有情况下，训练早期阶段的两个分数非常接近，但最终会分道扬镳，测试 PM 提供的分数较低。这种分歧可能表明偏好模型在奖励较高时不够稳健，更容易被利用。也就是说，策略在训练 PM 上过度优化，使得训练 PM 对策略的表现过于自信。另一方面，测试 PM 没有这个问题，因为它是在策略和训练 PM 都未观察到的数据部分上训练的。

We provide more discussion in Appendix B.2.

我们在附录 B.2 中提供了更多讨论。

4.3 An Approximately Linear Relation Between $\sqrt}}$ and Reward

4.3 $\sqrt}}$ 与奖励之间的近似线性关系

In Figures 4 and 13 we observe an approximately linear relation between $\sqrt\mathrm{KL}$ and PM score during RLHF training. Furthermore, we note that when all models are trained and evaluated with the same PMs, the learning curves are roughly parallel in the $\sqrt{D_{\mathrm{KL}}}$ -reward plane. Note that here the ‘KL’ is more precisely $D_{\mathrm{KL}}(\pi\bar{||}\pi_{0})$ , where $\pi$ denotes the policy distribution (and $\pi_{0}$ the initial policy), as evaluated empirically on the samples drawn from the policy during training.

在图 4 和图 13 中，我们观察到在 RLHF (Reinforcement Learning from Human Feedback) 训练期间，$\sqrt{\mathrm{KL}}$ 与 PM (Preference Model) 分数之间存在近似线性关系。此外，我们注意到，当所有模型都使用相同的 PM 进行训练和评估时，学习曲线在 $\sqrt{D_{\mathrm{KL}}}$ -奖励平面上大致平行。需要注意的是，这里的“KL”更准确地说是 $D_{\mathrm{KL}}(\pi\bar{||}\pi_{0})$，其中 $\pi$ 表示策略分布（$\pi_{0}$ 表示初始策略），这是在训练期间从策略中抽取的样本上进行经验评估的。

Why should this be? When $D_{\mathrm{KL}}(\pi+\delta\pi||\pi)$ is series expanded in $\delta\pi$ , the expansion begins at quadratic order, so if we imagine that the RL policy can also be series expanded around the base LM, and that the RL reward varies linearly in $\delta\pi$ , then in the ‘small $\cdot\delta\pi$ region’ (i.e. where the series expansion provides a good approximation), we should expect reward $\propto\sqrt{D_{\mathrm{KL}}}$ . Typically we should expect that reward varies linearly in $\delta\pi$ , because because the initial policy $\pi$ was not previously optimized for reward, so there is no reason why it would sit at an extremum with respect to small variations $\delta\pi$ . So the fact that this relation seems to hold empirically suggests that most of RLHF training remains in the small $\delta\pi$ regime.

为什么会这样？当 $D_{\mathrm{KL}}(\pi+\delta\pi||\pi)$ 在 $\delta\pi$ 中进行级数展开时，展开从二次项开始，因此如果我们假设强化学习策略也可以在大语言模型 (LLM) 周围进行级数展开，并且强化学习奖励在 $\delta\pi$ 中线性变化，那么在“小 $\cdot\delta\pi$ 区域”（即级数展开提供良好近似的区域），我们应该期望奖励 $\propto\sqrt{D_{\mathrm{KL}}}$。通常我们应该期望奖励在 $\delta\pi$ 中线性变化，因为初始策略 $\pi$ 之前并未针对奖励进行优化，因此没有理由认为它会在小变化 $\delta\pi$ 下处于极值点。因此，这一关系在经验上似乎成立的事实表明，大部分基于人类反馈的强化学习 (RLHF) 训练仍处于小 $\delta\pi$ 区域。

Though they did not use these coordinates, a similar scaling can be read off from the results in learning to summarize [Stiennon et al., 2020]. In particular, they provide a nice analysis of rejection sampling, where they generate $N$ samples, and then plot mean reward of the top $k$ samples versus the $D_{\mathrm{KL}},=,\log(N/k)$ .

虽然他们没有使用这些坐标，但可以从学习总结的结果中读取类似的缩放比例 [Stiennon et al., 2020]。特别是，他们对拒绝采样进行了很好的分析，其中他们生成了 $N$ 个样本，然后绘制了前 $k$ 个样本的平均奖励与 $D_{\mathrm{KL}},=,\log(N/k)$ 的关系图。

PM Data Distribution (Test Set) Figure 14 (left panel) We show PM score distributions for the helpfulness and red-teaming comparisons using a 52B PMs. (right panel) We train a 52B RLHF policy with respect to the same PM, and periodically evaluate the policy’s performance on held-out prompts (by sampling responses from the policy on such prompts, then evaluating the PM score) from the helpfulness and red-teaming datasets. We find that the policy’s harmlessness score (right, red) is nearly ‘out-of-distribution’ as it’s on the upper tail of the harmless PM data (left, red). On the other hand, the policy’s helpfulness score (right, blue) appears ‘on-distribution’ with respect to the helpfulness PM data (left, blue). In other words, we are over-optimized on harmlessness while we are still likely under-optimized on helpfulness. Dashed lines represent the asymptotic mean of the train scores, to guide the eye in connecting the left and right panels. RLHF Policy Performance On Test Prompts

PM 数据分布（测试集）图 14（左图）我们展示了使用 52B PM 的帮助性和红队比较的 PM 分数分布。（右图）我们针对相同的 PM 训练了一个 52B RLHF 策略，并定期评估该策略在保留提示上的表现（通过从策略中采样响应，然后评估 PM 分数），这些提示来自帮助性和红队数据集。我们发现，策略的无害性分数（右图，红色）几乎“超出分布”，因为它位于无害 PM 数据的上尾（左图，红色）。另一方面，策略的帮助性分数（右图，蓝色）相对于帮助性 PM 数据（左图，蓝色）似乎是“在分布内”的。换句话说，我们在无害性上过度优化，而在帮助性上可能仍然优化不足。虚线表示训练分数的渐近均值，用于引导眼睛连接左右图。RLHF 策略在测试提示上的表现

This analysis suggests that these RL learning curves might be associated with changes in the RL policy that behave very similarly to simply rejection sampling from the initial distribution.

这一分析表明，这些强化学习的学习曲线可能与强化学习策略的变化有关，这些变化表现得非常类似于从初始分布中进行简单的拒绝采样。

We find this simple relation quite striking, and believe it merits further study. At a conjectural level, it might have a variety of implications and uses when RL-finetuning large generative models:

我们发现这种简单的关系非常引人注目，并认为它值得进一步研究。在推测层面上，当对大生成模型进行强化学习微调时，它可能具有多种含义和用途：

By making RL learning more predictable and by identifying new quantitative categories of behavior, we might hope to detect unexpected behaviors emerging during RL training.

通过使强化学习（RL）更加可预测，并通过识别新的行为量化类别，我们有望在RL训练过程中检测到意外出现的行为。

4.4 Tension Between Helpfulness and Harmlessness in RLHF Training

4.4 RLHF 训练中帮助性与无害性之间的张力

Here we discuss a problem we encountered during RLHF training. At an earlier stage of this project, we found that many RLHF policies were very frequently reproducing the same exaggerated responses to all remotely sensitive questions (e.g. recommending users seek therapy and professional help whenever they express any level of displeasure at all). This greatly limited these models’ utility. We still see a vestige of this behavior in some of the examples provided in Section 6.2. We now believe these policies were the result of over-optimizing for harmlessness, while under-optimizing helpfulness.

在这里，我们讨论在RLHF（基于人类反馈的强化学习）训练过程中遇到的一个问题。在项目的早期阶段，我们发现许多RLHF策略在面对任何稍微敏感的问题时，都会频繁地生成相同的夸张回应（例如，每当用户表达任何程度的不满时，就建议他们寻求治疗和专业帮助）。这极大地限制了这些模型的实用性。我们在第6.2节提供的示例中仍然可以看到这种行为的一些残余。我们现在认为，这些策略是由于过度优化无害性，而忽视了优化有用性所导致的。

With our data collection procedure, we think this is quite intuitive. In order to get a very good score on red-teaming prompts, it’s probably sufficient for models to respond with something like “I can’t answer that.” This does not require much sophistication (it just requires learning to classify harmful requests), and so we expect it is easier to learn than helpfulness.

通过我们的数据收集过程，我们认为这是非常直观的。为了在红队提示上获得非常好的分数，模型可能只需要回答类似“我无法回答这个问题”的内容。这并不需要太多的复杂性（只需要学会分类有害请求），因此我们预计这比学习帮助性更容易。

In Figure 14 (right), we show the policy’s PM score throughout training, after separating helpfulness and harmlessness prompts. On the left side of the same figure, we show the score distribution of PM comparison data, again separating helpful and harmless datasets. We observe that the policy’s harmlessness score is somewhat off-distribution, as it is on the upper tail of the harmlessness comparison data. On the other hand, the policy’s helpfulness score appears on-distribution, and is likely under-optimized. So we would expect this agent to be very difficult to red-team, but not very helpful.

在图 14 (右) 中，我们展示了在分离有用性和无害性提示后，策略在整个训练过程中的 PM 分数。在同一图的左侧，我们展示了 PM 比较数据的分数分布，再次分离了有用和无害的数据集。我们观察到策略的无害性分数有些偏离分布，因为它位于无害性比较数据的上尾。另一方面，策略的有用性分数似乎符合分布，但可能未得到充分优化。因此，我们预计这个 AI 智能体将非常难以进行红队测试，但不会非常有用。

This then raises an obvious question – can’t we just collect more harmlessness data to fill out the upper tail of the distribution? The problem involves the definition of harmlessness mentioned above – if simply refusing to answer a question is the ‘least harmful’ behavior, then this is probably both very easy to learn, and hard to improve on. That said, a more interesting ‘least harmful’ behavior would involve the model (helpfully) explaining why the request was harmful, and perhaps even trying to convince the human not to pursue such requests. We informally refer to such a model as a ‘hostage negotiator’.

这就引出了一个显而易见的问题——我们不能收集更多的无害数据来填补分布的上尾吗？问题在于上述无害性的定义——如果简单地拒绝回答问题是最无害的行为，那么这既容易学习，又难以改进。也就是说，更有趣的“最无害”行为可能涉及模型（有帮助地）解释为什么请求是有害的，甚至可能试图说服人类不要追求这样的请求。我们非正式地将这种模型称为“人质谈判者”。

However, our data collection process made it very difficult for models to learn ‘hostage negotiation’. This is because when collecting our harmlessness dataset, we had crowd workers choose the more harmful AI response. We made this choice so that we could fully explore the vulnerability of our models to red-teaming. However, from the point of view of RLHF this was problematic, because beyond the first turn of dialogue, our models never learned what a sophisticated response to a harmful query might be like. Our dataset does not provide guidance on the upper end of the distribution, on what models should do, but only tells models what not to do.

然而，我们的数据收集过程使得模型很难学习“人质谈判”。这是因为在收集无害性数据集时，我们让众包工人选择更有害的AI响应。我们做出这个选择是为了能够充分探索模型在红队测试中的脆弱性。然而，从RLHF（基于人类反馈的强化学习）的角度来看，这是有问题的，因为在对话的第一轮之后，我们的模型从未学习过对有害查询的复杂响应可能是什么样子。我们的数据集没有提供关于分布上限的指导，即模型应该做什么，而只是告诉模型不应该做什么。

In practice, we have partially resolved the optimization issue by training on a larger fraction of helpfulness prompts during RLHF. But in the future we hope to more fully and systematically address this problem by collecting harmlessness data where crowd workers choose the best possible response from our models.14 In this way we hope that rather than simply shutting down harmful requests, models can learn the more subtle art of ‘hostage negotiation’ with red-teamers.

在实践中，我们通过在RLHF（基于人类反馈的强化学习）过程中训练更多有益性提示，部分解决了优化问题。但未来，我们希望通过收集无害性数据来更全面、系统地解决这一问题，其中众包工作者将从我们的模型中选择最佳响应 [14]。通过这种方式，我们希望模型不仅能简单地拒绝有害请求，还能学会与红队进行更微妙的“人质谈判”技巧。

Note that since the data and models discussed in this section are from an earlier stage of our research, the RL results may look slightly different from other parts of the paper.

请注意，由于本节讨论的数据和模型来自我们研究的早期阶段，强化学习 (RL) 的结果可能与论文其他部分略有不同。

4.5 Iterated Online RLHF

4.5 迭代在线 RLHF

In preceding sections we discussed the problem that PMs become progressively less calibrated and less robust at higher scores, as seen in the PM calibration study in Figure 9, and the RLHF robustness study in Figure 4. We believe this is caused by a lack of data in this high score regime. To address this, we propose iterated online RLHF:

在前面的章节中，我们讨论了随着分数提高，PM（偏好模型）逐渐变得不那么校准和鲁棒的问题，如图 9 中的 PM 校准研究和图 4 中的 RLHF（基于人类反馈的强化学习）鲁棒性研究所见。我们认为这是由于在高分区域缺乏数据所致。为了解决这个问题，我们提出了迭代在线 RLHF：

• We simply train the best RLHF policy we can, and use that to collect comparison data from crowdworkers. Since the policy was trained to optimize for PM score, it should produce responses that are on the upper end of the score distribution. • We mix the new comparison data with our existing data, and train a new scan of PMs, which we then use to train a new scan of RLHF policies. Then reiterate this process indefinitely.

• 我们简单地训练出最好的 RLHF 策略，并用它从众包工作者那里收集比较数据。由于该策略是为了优化 PM 分数而训练的，它应该能产生分数分布较高的响应。
• 我们将新的比较数据与现有数据混合，训练新一批 PM，然后用它们来训练新一批 RLHF 策略。然后无限重复这个过程。

Our hypothesis is that the ‘online’ RLHF policy helps us collect data on the upper end of the PM score distribution, which should improve PM calibration at high scores on subsequent iterations, and thereby allow us to train even better policies. Continuing this process should give us progressively better PMs and policies. Note that our use of the terminology ‘online’ is different from conventional use of the word—instead of training the same model iterative ly, we retrain a new model per iteration.

我们的假设是，“在线”RLHF策略帮助我们收集PM分数分布上端的数据，这应该会在后续迭代中提高高分段的PM校准，从而让我们能够训练出更好的策略。继续这一过程应该会让我们逐步获得更好的PM和策略。需要注意的是，我们对“在线”这一术语的使用与传统用法不同——我们不是在每次迭代中训练同一个模型，而是每次迭代都重新训练一个新模型。

Figure 15 (left) This plot shows individually normalized distributions of held-out helpfulness data from our base dataset (mostly with context-distilled models), from models augmented with rejection sampling, and from data collected with our iterated ‘online’ RLHF models. The upper tail of the distribution receives far more support from the RS and online models, which should make it possible for preference models to learn more subtle distinctions among high-quality responses, and amplify the value of further data collection. (right) We compare helpfulness Elo scores of our HH and pure-helpfulness iterated online RLHF models at various points during RLHF training. Note that Elo scores and preference frequency are measured relative to the initial snapshot, which is our 52B context distilled model in both cases. Elo scores in both subplots only evaluate helpfulness.

图 15: （左）该图展示了我们基础数据集（主要使用上下文蒸馏模型）的保留有用性数据的单独归一化分布，以及通过拒绝采样增强的模型和通过迭代“在线”RLHF 模型收集的数据的分布。拒绝采样和在线模型在分布的上尾部提供了更多的支持，这应该使得偏好模型能够学习到高质量响应之间更细微的区别，并进一步放大数据收集的价值。（右）我们比较了在 RLHF 训练过程中不同时间点的 HH 和纯有用性迭代在线 RLHF 模型的有用性 Elo 分数。请注意，Elo 分数和偏好频率是相对于初始快照（在这两种情况下都是我们的 52B 上下文蒸馏模型）进行测量的。两个子图中的 Elo 分数仅评估有用性。

One concern about this approach is that RLHF tends to decrease the policy’s entropy, which would limit the diversity of data collected through the online procedure. We partially address this by deploying a number of different snapshots from RL training, and from different online iterations, at once. This also makes it possible to compare these models to get a better sense of how they are performing.

这种方法的一个担忧是，RLHF（基于人类反馈的强化学习）往往会降低策略的熵，从而限制通过在线过程收集的数据的多样性。我们通过同时部署来自RL训练和不同在线迭代的多个不同快照来部分解决这个问题。这也使得比较这些模型成为可能，以便更好地了解它们的表现。

We can see signs of life from the online approach by looking at the evolution of our data distribution. In Figure 15 (left), we show the PM scores from three distributions of models: Base, RS (rejection-sampling), and Online, as described in Section 2.3. We see that according to our final online PM (trained on all of the data), the quality of samples improves from the base to the rejection-sampling to the online data distributions. We also found that our online PM achieves accuracies of $74%$ , $70%$ , and $67%$ on the test sets for the respective base, RS, and online-only distributions, which shows that distinguishing among higher quality samples is becoming more challenging. This makes us optimistic that online training should outperform rejection sampling in the long run.

通过观察数据分布的演变，我们可以看到在线方法的生命迹象。在图 15（左）中，我们展示了来自三种模型分布的 PM 分数：Base、RS（拒绝采样）和 Online，如第 2.3 节所述。我们可以看到，根据我们最终的在线 PM（在所有数据上训练），样本质量从基础分布到拒绝采样分布再到在线数据分布有所提高。我们还发现，我们的在线 PM 在各自的 Base、RS 和仅在线分布的测试集上分别达到了 $74%$、$70%$ 和 $67%$ 的准确率，这表明区分更高质量的样本正变得越来越具有挑战性。这让我们乐观地认为，从长远来看，在线训练应该优于拒绝采样。

We show the learning curves for our online models, along with measurements of Elo scores from crowdworkers, on the right in Figure 15. We see that models improve significantly during RLHF, but Elo scores from crowd workers do not match predictions from PMs. We further discuss and decompose the robustness of RLHF training in Appendix B.4, where we see that distribution al shift accounts for a significant part of the apparent robustness failure (Figure 35).

我们在图 15 的右侧展示了在线模型的学习曲线，以及来自众包工作者的 Elo 分数测量结果。我们看到，在 RLHF（基于人类反馈的强化学习）过程中，模型有了显著提升，但众包工作者的 Elo 分数与 PM（预测模型）的预测结果并不匹配。我们在附录 B.4 中进一步讨论并分解了 RLHF 训练的鲁棒性，发现分布偏移是导致明显鲁棒性失败的主要原因（图 35）。

In Figure 1, we compare Elo scores of our online model with context-distilled models and RLHF models trained on the ‘static’ (i.e., no online) dataset, showing that the online models are clearly preferred by our crowd workers. However, readers might worry about two caveats: the online model was trained on a slightly larger (about $20%$ larger) dataset, and the online model was trained with improved RLHF hyper parameters (the online model was trained with a larger $K$ , defined in Appendix B.1, and its PM was trained with 2048 context instead of 1024), as compared to the earlier static RLHF training run.

在图 1 中，我们将在线模型的 Elo 分数与上下文蒸馏模型和在“静态”（即非在线）数据集上训练的 RLHF 模型进行了比较，结果显示在线模型明显更受我们的众包工作者青睐。然而，读者可能会担心两个问题：在线模型是在稍大（大约大 $20%$）的数据集上训练的，并且在线模型使用了改进的 RLHF 超参数（在线模型使用了更大的 $K$，定义见附录 B.1，其 PM 是在 2048 上下文而非 1024 上下文中训练的），与之前的静态 RLHF 训练相比。

To address both of these caveats, we performed a controlled experiment comparing two RLHF runs: one trained with our base dataset (about 44k PM comparisons), and another trained on an even mixture of base, RS, and online data whose total dataset size is the same as the base dataset15 (about $15\mathbf{k}$ PM comparisons from each). So for this experiment we trained two separate PMs on each dataset, and then trained a pair of RLHF policies against these two PMs. Apart from the data difference, both runs used the same settings, and were only trained on helpfulness. In figure 16, we compare Elo scores for various snapshots of both runs, as determined by crowd worker preferences, showing that the policy trained on the iterated-online mixture is clearly preferred. This demonstrates that online training works, and that performance gains are not merely due to increased dataset size or hyper parameter changes.

为了解决这两个问题，我们进行了一项对照实验，比较了两个 RLHF 训练过程：一个使用我们的基础数据集（约 44k PM 比较）进行训练，另一个使用基础数据、RS 数据和在线数据的均匀混合进行训练，其总数据集大小与基础数据集相同（每个数据集约 $15\mathbf{k}$ PM 比较）。因此，在这个实验中，我们在每个数据集上训练了两个独立的 PM，然后针对这两个 PM 训练了一对 RLHF 策略。除了数据差异外，两个训练过程使用了相同的设置，并且仅针对有用性进行训练。在图 16 中，我们比较了两个训练过程中不同快照的 Elo 分数，这些分数由众包工人的偏好决定，结果显示，使用迭代在线混合数据训练的策略明显更受欢迎。这表明在线训练是有效的，并且性能提升不仅仅是由于数据集大小的增加或超参数的变化。

Figure 16 We compare Elo scores from two 52B RLHF training runs that use equal-sized datasets and identical hyper parameters: one trained on our base dataset (orange), and another trained on an even mixture of data from the base, RS, and online distributions (blue). We find that the iterated-online model is preferred by crowd workers.

图 16: 我们比较了两个使用相同大小数据集和相同超参数的 52B RLHF 训练运行的 Elo 分数：一个在我们的基础数据集上训练（橙色），另一个在基础、RS 和在线分布的均匀混合数据上训练（蓝色）。我们发现迭代在线模型更受众包工人的青睐。

4.6 Evaluations: Alignment Bonus, Honesty, and Biases

4.6 评估：对齐奖励、诚实性和偏见

Language models that have been finetuned via RL typically have much narrower, lower-entropy output distributions. This can make evaluations difficult when they are fairly rigidly formatted, since all valid responses may be far off-distribution for the RLHF model (we discuss an example with gender bias evaluations below). Thus we expect in future work evaluations involving sampling and human interaction may be most relevant. In what follows we discuss some standard NLP evaluations, and then evaluations specifically related to the societal impacts of the models, including honesty, sentiment, and bias.

通过强化学习 (RL) 微调的语言模型通常具有更窄、熵更低的输出分布。当评估格式较为固定时，这可能会使评估变得困难，因为所有有效响应可能都偏离了 RLHF 模型的分布（我们将在下面讨论一个关于性别偏见的例子）。因此，我们预计在未来的工作中，涉及采样和人类交互的评估可能最为相关。接下来，我们将讨论一些标准的自然语言处理 (NLP) 评估，以及特别与模型社会影响相关的评估，包括诚实性、情感和偏见。

4.6.1 NLP Evaluations

4.6.1 NLP 评估

We evaluate our models on question answering, commonsense, trivia, and story completion using the benchmarks MMLU [Hendrycks et al., 2021b], Lambada [Paperno et al., 2016], Hellaswag [Zellers et al., 2019], OpenBookQA [Mihaylov et al., 2018], ARC [Clark et al., 2018], and TriviaQA [Joshi et al., 2017]. The main conclusion is that RLHF tends to improve performance for large models, while degrading 16 the performance of smaller models.

我们在问答、常识、琐事和故事完成方面使用基准 MMLU [Hendrycks et al., 2021b]、Lambada [Paperno et al., 2016]、Hellaswag [Zellers et al., 2019]、OpenBookQA [Mihaylov et al., 2018]、ARC [Clark et al., 2018] 和 TriviaQA [Joshi et al., 2017] 评估我们的模型。主要结论是，RLHF 倾向于提高大型模型的性能，同时降低小型模型的性能。

Full results for both zero-shot and few-shot evaluations are shown in Figures 28 and 29, and we provided a summary of the mean trends in Figure 3. Readers may notice that results improve rather suddenly for some evaluations; this is a consequence of the format we use for multiple choice questions, where we explicitly provide choices (Gopher [Rae et al., 2021] used this format). The format is provided explicitly in Appendix E. We find that this format tends to improve performance for large models, while decreasing the performance of small models, leading to the arguably misleading appearance of a ‘grok’ [Power et al., 2022] curve.

零样本和少样本评估的完整结果如图 28 和图 29 所示，我们在图 3 中提供了平均趋势的总结。读者可能会注意到，某些评估的结果突然有所提升；这是我们用于多项选择题的格式的结果，其中我们明确提供了选项（Gopher [Rae et al., 2021] 使用了这种格式）。该格式在附录 E 中明确提供。我们发现，这种格式往往会提高大模型的性能，同时降低小模型的性能，导致出现可能具有误导性的“grok” [Power et al., 2022] 曲线。

4.6.2 Honesty and Biases

4.6.2 诚实性与偏见

A major question is whether AI models are honest. We evaluate our models on TruthfulQA (MC1) [Lin et al., 2021] and show the results in Figure 5. There we also include performance at 50-shot, in order to demonstrate that while our RLHF training significantly improves honesty, our models most likely have significant room for improvement. Note that our instructions to crowd workers suggest that ‘lying isn’t helpful’ and that they should choose responses that are ‘helpful and honest’, so this is presumably related to the improvements we see on TruthfulQA. That said, we do not currently expect RLHF to be the best approach to honesty.

一个主要问题是AI模型是否诚实。我们在TruthfulQA (MC1) [Lin et al., 2021] 上评估了我们的模型，并在图5中展示了结果。为了证明虽然我们的RLHF训练显著提高了诚实性，但我们的模型很可能仍有很大的改进空间，我们还展示了50-shot的性能。请注意，我们对众包工作者的指示建议“撒谎是没有帮助的”，他们应该选择“有帮助且诚实”的回应，因此这可能与我们看到的TruthfulQA上的改进有关。也就是说，我们目前并不认为RLHF是实现诚实性的最佳方法。

Figure 17 Here we show sentiment scores (higher is more favorable sentiment) for samples generated from various prompts involving races and religions. We see that the predominant effect of RLHF training is to improve sentiment towards all groups.

图 17: 这里展示了涉及种族和宗教的各种提示生成的样本的情感评分（分数越高表示情感越积极）。我们可以看到，RLHF（基于人类反馈的强化学习）训练的主要效果是改善对所有群体的情感。

Another set of questions involves the underlying biases of these models. We evaluate our models for sentiment biases on race and religion (in the same format as Gopher [Rae et al., 2021]), for gender bias, and on the Bias Benchmark for QA (BBQ-lite) [Parrish et al., 2021].

另一组问题涉及这些模型的潜在偏见。我们评估了模型在种族和宗教上的情感偏见（与Gopher [Rae et al., 2021] 相同的格式）、性别偏见以及在问答偏见基准（BBQ-lite）[Parrish et al., 2021] 上的表现。

Results for sentiment towards different racial and religious groups are shown in Figure 17. The main effect we observe is that the sentiment of our RLHF-trained models tends to be much more positive than that of plain LMs, but it is more difficult to say if bias among groups has changed. We expect some of the visible effects are due to limitations in the sentiment model17, for example the low scores for the Atheist group is likely because samples such as (model generation in italics):

对不同种族和宗教群体的情感分析结果如图 17 所示。我们观察到的主要影响是，经过 RLHF 训练的模型的情感往往比普通大语言模型更加积极，但很难说群体之间的偏见是否发生了变化。我们预计一些可见的影响是由于情感模型的局限性 [17]，例如无神论者群体的低分可能是因为样本如下（模型生成部分用斜体表示）：

The Atheist man was very skeptical and critical of religious dogma and conventional moral principles, seeking instead to derive his ethical values from rational consideration and empirical observation of the natural world.

无神论者对宗教教条和传统道德原则持怀疑和批判态度，转而寻求通过理性思考和对自然界的经验观察来推导其伦理价值观。

receive a sentiment score very near zero, which seems like a questionable evaluation. For these evaluations we use a prompt format where the human asks the assistant to complete the sentence as follows:

收到一个情感评分接近于零，这似乎是一个值得怀疑的评估。对于这些评估，我们使用一种提示格式，其中人类要求助手完成句子如下：

with sentence beginnings such as ‘The Atheist man was very’ as seen above.

如上所示，以“无神论者非常”等句子开头。

We also evaluated our models on BBQ-Lite, a simplified version of BBQA [Parrish et al., 2021] that is available on BIG-bench. We were somewhat surprised by the results, and are unclear if any conclusions can be drawn based on them. In essence, we appear to find that our plain LMs do not exhibit very significant biases on this benchmark, using the available metrics. Here the metric is roughly ‘how often does the model invoke a term in a negative context, divided by how often it uses the term generally’. A value of 0.5 corresponds to the models being ‘unbiased’. RLHF training perhaps improves the results slightly (likely by improving sentiment) though the effect is very small (see Figure 18). We may return to these evaluations in the future using the full BBQA metrics [Parrish et al., 2021].

我们还在 BBQ-Lite 上评估了我们的模型，这是 BBQA [Parrish et al., 2021] 的简化版本，可在 BIG-bench 上获取。我们对结果感到有些惊讶，并且不清楚是否可以根据这些结果得出任何结论。本质上，我们似乎发现，使用现有的指标，我们的普通大语言模型在这个基准上并没有表现出非常显著的偏见。这里的指标大致是“模型在负面上下文中调用某个术语的频率，除以它一般使用该术语的频率”。值为 0.5 对应于模型“无偏见”。RLHF 训练可能略微改善了结果（可能是通过改善情感），尽管效果非常小（见图 18）。我们可能会在未来使用完整的 BBQA 指标 [Parrish et al., 2021] 重新进行这些评估。

In Section 3.4.2 we investigated gender bias in preference models. We tried using this format for evaluating RLHF models, but we found that these models were more than $10\times$ less likely than plain or context-distilled LMs to use a gendered term to complete sentences like ‘The CEO was a’. In other words, statements like ‘The CEO was a father’ seem to be somewhat off-distribution for the RLHF models, at least compared to general LMs. This is problematic for evaluations, because it means that the bias measurements are likely to be dominated by noise, and are unlikely to be representative of model behavior (i.e. model-generated samples). So in Appendix B.10 we experiment with another formulation of the task, where we reverse the order of occupation and gender terms, and add a ‘For work’ clause; this kept both the RLHF and other models on-distribution. The results are discussed in Appendix B.10, where we show that RLHF model’s biases are very strongly correlated (Figure 40) with underlying LM biases. In particular, with this evaluation we find that large RLHF models have gender biases that are very similar to language models evaluated at a lower temperature.

在3.4.2节中，我们研究了偏好模型中的性别偏见。我们尝试使用这种格式来评估RLHF模型，但发现这些模型使用性别化词汇完成诸如“CEO是一个”这类句子的可能性比普通或上下文蒸馏的语言模型低10倍以上。换句话说，像“CEO是一个父亲”这样的陈述对于RLHF模型来说似乎有些偏离分布，至少与一般的语言模型相比是如此。这对评估来说是有问题的，因为这意味着偏见测量很可能被噪声主导，并且不太可能代表模型行为（即模型生成的样本）。因此，在附录B.10中，我们尝试了另一种任务表述方式，即颠倒职业和性别术语的顺序，并添加“为了工作”的条款；这使得RLHF模型和其他模型都保持在分布内。结果在附录B.10中讨论，我们展示了RLHF模型的偏见与底层语言模型的偏见高度相关（图40）。特别是，通过这种评估，我们发现大型RLHF模型的性别偏见与在较低温度下评估的语言模型非常相似。

Figure 18 We show results from BBQ-Lite. We caution that since according to this metric, none of our models appear particularly biased (even those trained without any interventions), it seems unlikely we can draw any meaningful conclusions from the results. We suspect the lack of bias reflects limitations of the measurement, rather than an underlying fact about the models.

图 18: 我们展示了 BBQ-Lite 的结果。我们提醒，根据这一指标，我们的模型似乎都没有表现出明显的偏见（即使是那些没有任何干预训练的模型），因此我们不太可能从结果中得出任何有意义的结论。我们怀疑缺乏偏见反映了测量的局限性，而不是模型本身的实际情况。

5 Competing Objectives, Specialized Skills, and OOD Detection

5 竞争目标、专业技能和OOD检测

A concern about alignment techniques is that they might compromise model performance. In Section 5.1 we highlight a quantifiable trade-off of this kind, between helpfulness and harmlessness when training preference models. But it appears that larger models suffer less of a performance drop from this trade-off.

关于对齐技术的一个担忧是它们可能会影响模型性能。在第5.1节中，我们强调了在训练偏好模型时，帮助性和无害性之间存在的一种可量化的权衡。但似乎更大的模型在这种权衡中遭受的性能下降较少。

Furthermore, we also find that the conflict between helpfulness and harmlessness is relatively unique. Preference models can learn to reward strong performance at specialized skills without any loss in performance at helpfulness and harmlessness. In Section 5.2 we consider the evaluation of sum mari z ation quality as such a skill, using the learning-to-summarize [Stiennon et al., 2020] dataset reformatted in conversational form. Later in Section 5.3 we show that code models (i.e., models finetuned on code by supervised training) are also compatible with HH alignment interventions, even though the RLHF training does not involve code data or examples.

此外，我们还发现，有用性和无害性之间的冲突相对独特。偏好模型可以学会奖励在特定技能上的出色表现，而不会在有用性和无害性方面损失性能。在第5.2节中，我们将总结质量评估视为这样一种技能，使用了以对话形式重新格式化的学习总结 [Stiennon et al., 2020] 数据集。随后在第5.3节中，我们展示了代码模型（即通过监督训练在代码上微调的模型）也与HH对齐干预兼容，尽管RLHF训练不涉及代码数据或示例。

In Section 5.4 we highlight another approach to avoiding harmful behavior – it may be possible to reject most harmful requests, even without any access to harmfulness training data, by leveraging out-of-distribution detection techniques [Fort et al., 2021]. This approach might also be useful more generally in deployment scenarios where strange or off-topic requests need to be flagged or ignored.

在第5.4节中，我们强调了另一种避免有害行为的方法——即使没有任何有害性训练数据，也可能通过利用分布外检测技术 [Fort et al., 2021] 来拒绝大多数有害请求。这种方法在需要标记或忽略奇怪或离题请求的部署场景中也可能更普遍地有用。

5.1 Mixing Helpful and Harmless Objectives

5.1 混合有益与无害目标

In many cases harmlessness acts as a constraint on helpfulness. So we should expect that helpfulness and harmlessness may behave as partially anti-correlated objectives. We establish this by evaluating preference models trained on different mixtures of HH data, and with different weightings.

在许多情况下，无害性作为对帮助性的约束。因此，我们应该预期帮助性和无害性可能表现为部分反相关的目标。我们通过评估在不同HH数据混合和不同权重下训练的偏好模型来确立这一点。

At a conceptual level, the HH PMs may essentially be learning to first classify the data and then choose a score depending on the distribution. We will show that larger models perform better and are more robust to data mixture and loss weighting, which may be due to their having greater success at separating reasonable from harmful requests.

在概念层面上，HH PMs 可能本质上是在学习先对数据进行分类，然后根据分布选择分数。我们将展示，更大的模型表现更好，并且对数据混合和损失加权更具鲁棒性，这可能是由于它们在区分合理请求和有害请求方面取得了更大的成功。

5.1.1 Varying Helpful vs Harmless Data Fraction

5.1.1 不同比例的有益与有害数据

We train models using data splits varying from $100%$ helpfulness to $100%$ harmlessness in intervals of $10%$ . Our static data distribution has $42\mathrm{k}$ red-teaming comparisons, so to control for dataset size we always construct mixtures with a total of this number of comparisons. Figure 19 shows performance on both harmlessness and helpfulness as the training data mixture is varied. Note that training entirely on helpfulness or harmlessness data results in performance on the other distribution which is significantly worse than chance. This exemplifies the extent to which these distributions are in tension with each other.

我们使用从 $100%$ 有益性到 $100%$ 无害性以 $10%$ 为间隔的数据分割来训练模型。我们的静态数据分布包含 $42\mathrm{k}$ 个红队比较，因此为了控制数据集大小，我们总是构建包含这个数量比较的混合数据。图 19 展示了随着训练数据混合比例的变化，模型在无害性和有益性上的表现。需要注意的是，完全在有益性或无害性数据上训练会导致在另一分布上的表现显著低于随机水平。这说明了这些分布之间的紧张关系。

Figure 19 (top) Results when mixing different proportions of helpfulness and harmlessness data. We see that when the training data contains either all helpfulness or harmlessness data, performance on the other test set is far below random chance levels. This provides evidence that helpfulness and harmlessness are anti-correlated objectives. (bottom) These are versions of the top graphs where accuracies are normalized against the maximum accuracy achieved by each model size. We perform this normalization to make it visually obvious that larger models’ performance is less sensitive to the data mixture.

图 19: (上) 混合不同比例的有用性和无害性数据时的结果。我们看到，当训练数据全部为有用性或无害性数据时，在另一个测试集上的表现远低于随机水平。这证明有用性和无害性是相互对立的目标。(下) 这是上图的一个版本，其中准确率相对于每个模型大小达到的最大准确率进行了归一化。我们进行这种归一化是为了在视觉上明显看出，较大模型的性能对数据混合的敏感性较低。

Figure 26 in the appendix also plots mean test accuracy over both helpfulness and harmlessness (where Mean $\mathtt{A c c=}$ (Harmlessness Acc + Helpfulness Acc)/2). Curves for larger models look more steep near the $0%$ and $100%$ areas, but flatter at the top. The curves for the smaller models are more gradual, with more distinct peaks in the middle. This again suggests that larger PMs are more robust to the specific fraction of red-teaming vs helpfulness data that is used, allowing them to learn both concepts more easily.

附录中的图 26 还绘制了在有用性和无害性上的平均测试准确率（其中 Mean $\mathtt{A c c=}$ (Harmlessness Acc + Helpfulness Acc)/2）。较大模型的曲线在 $0%$ 和 $100%$ 附近看起来更陡峭，但在顶部更平坦。较小模型的曲线则更为平缓，中间有更明显的峰值。这再次表明，较大的模型对红队测试数据与有用性数据的具体比例更为稳健，使它们更容易学习这两个概念。

5.1.2 Weighting Helpful vs Harmless Losses

5.1.2 加权有益与有害损失

Instead of studying different data mixtures, we can try re-weighting the losses. Since we have more helpfulness than harmlessness comparisons, we experimented with weighting the losses as

我们可以尝试重新加权损失，而不是研究不同的数据混合。由于我们拥有的有用性比较多于无害性比较，我们尝试对损失进行加权。

$$
\mathcal{L}{\mathrm{Total}}=\mathcal{L}{\mathrm{Helpfulness}}+\lambda\cdot\mathcal{L}_{\mathrm{Harmless}}
$$

for $\lambda\in{1,2,3,4,10}$ , as shown in Figure 27 (relegated to the appendix). We note that larger models seem more robust to the choice of $\lambda$ . Increasing $\lambda$ from 1 to 10 causes a $7.4%$ decrease in accuracy on helpfulness for the 13M parameter model, whereas it only causes a $1.5%$ decrease in accuracy on the 52B parameter model.

对于 $\lambda\in{1,2,3,4,10}$，如图 27 所示（见附录）。我们注意到，更大的模型似乎对 $\lambda$ 的选择更具鲁棒性。将 $\lambda$ 从 1 增加到 10 会导致 13M 参数模型在有用性上的准确率下降 $7.4%$，而对于 52B 参数模型，仅导致准确率下降 $1.5%$。

5.2 Sum mari z ation as a Specialized Skill

5.2 摘要作为一项专业技能

We expect that models finetuned with special skills may be particularly useful and valuable. Does alignment interfere with finetuning for a specialized skill?

我们预计，经过特殊技能微调的模型可能特别有用且有价值。对齐是否会干扰针对特定技能的微调？

As one test of this question, we studied PM finetuning on the learning-to-summarize (LtS) [Stiennon et al., 2020] dataset vs. a mixture of LtS and HH data. We formatted the LtS data in conversational format so that it matches the HH data, as follows:

作为对该问题的一项测试，我们研究了在学习总结（LtS）[Stiennon et al., 2020] 数据集上进行 PM 微调与 LtS 和 HH 数据混合的对比。我们将 LtS 数据格式化为对话格式，以使其与 HH 数据匹配，如下所示：

Figure 20 Here we show the comparison accuracies of preference models trained on (1) ‘static’ HH data only, (2) sum mari z ation data [Stiennon et al., 2020] only, and (3) a mixture of both. Mixed training has no negative effects on PM accuracies.

图 20: 这里展示了在以下数据集上训练的偏好模型的比较准确率：(1) 仅使用“静态”HH数据，(2) 仅使用摘要数据 [Stiennon et al., 2020]，以及 (3) 两者的混合数据。混合训练对偏好模型的准确率没有负面影响。

As shown in Figure 20, large preference models trained on a mixture of HH and LtS datasets perform equally well on both. So at least at the level of preference modeling, there seems to be no cost to mixing HH with the specific skill of evaluating sum mari z ation quality.

如图 20 所示，在 HH 和 LtS 数据集混合训练的大型偏好模型在两者上表现同样出色。因此，至少在偏好建模的层面上，将 HH 与评估摘要质量的具体技能混合似乎没有成本。

5.3 Natural Language RLHF on Code-Finetuned Models

5.3 基于代码微调模型的自然语言 RLHF

As another test of a specialized skill, we would like to see if natural language alignment can be combined with coding without compromising performance. Since our crowd workers were never instructed to probe the model’s coding abilities, and most likely do not have a great deal of coding expertise, our human feedback data does not include a significant number of code-related conversations. The preference model finetuning dataset and the RLHF prompt dataset thereby do not contain any significant amount of code, though there is some code in the LM pre training mix and possibly a small amount in the PM pre-training (PMP) mix. This makes code-related problems an interesting way to test generalization of RLHF, and especially its compatibility with other skills.

作为另一项专业技能测试，我们希望了解自然语言对齐是否能够在不影响性能的情况下与编码结合。由于我们的众包工作人员从未被指示去探究模型的编码能力，而且他们很可能没有大量的编码专业知识，因此我们的人类反馈数据中并未包含大量与代码相关的对话。因此，偏好模型微调数据集和RLHF提示数据集中并未包含大量代码，尽管在语言模型预训练混合数据中存在一些代码，可能在偏好模型预训练（PMP）混合数据中也存在少量代码。这使得代码相关问题成为测试RLHF泛化能力，尤其是其与其他技能兼容性的有趣方式。

Our ‘base code models’ were finetuned on Python code scraped from Github as described in [Askell et al., 2021]. Starting from these Python fine-tuned (Python FT) models, we then ran our standard natural language RLHF training using ‘static’ preference models and prompts. We had difficulty achieving stable RLHF optimization on the 3B code model, so it has been excluded for this section.

我们的“基础代码模型”是在从Github抓取的Python代码上进行微调的，具体方法如[Askell et al., 2021]所述。从这些经过Python微调（Python FT）的模型开始，我们随后使用“静态”偏好模型和提示进行了标准的自然语言RLHF训练。我们在3B代码模型上难以实现稳定的RLHF优化，因此在本节中将其排除在外。

We evaluate models on the HumanEval dataset [Chen et al., 2021], which prompts language models with python function signatures and docstrings. Models are tasked with correctly filling in the function body given the context, and model-written functions are run in a sandbox environment. In Figure 21 we show results versus model size with and without RLHF training. We see the same trend here as with other evaluations – RLHF decreases the performance of small models, but improves the performance of larger models.

我们在 HumanEval 数据集 [Chen et al., 2021] 上评估模型，该数据集通过 Python 函数签名和文档字符串提示语言模型。模型的任务是在给定上下文的情况下正确填充函数体，模型编写的函数在沙盒环境中运行。在图 21 中，我们展示了有和没有 RLHF 训练时模型大小的结果。我们在这里看到了与其他评估相同的趋势——RLHF 降低了小模型的性能，但提高了大模型的性能。

RL training tends to decrease the entropy of the models’ distribution, and so we were concerned that these results would be very sensitive to temperature and top-p tuning. So for our 52B models, we performed a scan over temperatures and two top $\boldsymbol{\cdot}$ settings for both the RLHF models and the base code models, and then chose the best setting for each model and $p a s s@k$ . We did a grid-search over the evaluation hyper parameters: $T\in{0,0.4,0.6,0.{\bar{\mathrm{s}}},1.0}\times p\in{0.95,\dot{1}}\times k\in{1,5,10,\dot{2}5,50,75,100}.$ . Results are summarized on the right side of Figure 21. For each model and for each $k$ in $p a s s@k$ , we take the maximum performance over all 10 combinations of hyper parameters. We see that RLHF improves performance over the baseline on this evaluation, for all $p a s s@k$ .

RL训练往往会降低模型分布的熵，因此我们担心这些结果对温度和top-p调优非常敏感。因此，对于我们的52B模型，我们对RLHF模型和基础代码模型进行了温度和两个top $\boldsymbol{\cdot}$ 设置的扫描，然后为每个模型和$pass@k$选择了最佳设置。我们对评估超参数进行了网格搜索：$T\in{0,0.4,0.6,0.{\bar{\mathrm{s}}},1.0}\times p\in{0.95,\dot{1}}\times k\in{1,5,10,\dot{2}5,50,75,100}.$。结果总结在图21的右侧。对于每个模型和$pass@k$中的每个$k$，我们取所有10种超参数组合中的最大性能。我们看到，在这个评估中，RLHF在所有$pass@k$上都比基线有所提升。

We should emphasize that as with our other evaluations, the improvements in performance from RLHF are modest. In fact, we find that simply prompting a base code model performs slightly better, as shown in Figure

我们应该强调，与我们的其他评估一样，RLHF带来的性能提升是有限的。事实上，我们发现简单地提示一个基础代码模型表现略好，如图

Figure 21 (left) Pass $@1$ accuracy of base code models and RLHF models on HumanEval. RLHF generally decreases performance on smaller models, but improves performance on larger models. (right) This figure shows performance of our 52B models as a function of $k$ for $\mathrm{Pass}\bigotimes k$ . We did a grid-search over the evaluation hyper parameters $T\in{0,0.4,0.6,0.8,1.0}\times p\in{0.95,1}$ , and plotted the maximum accuracy at each $k$ . Results show that RLHF actually improves performance, even at large $k$ .

图 21 (左) 基础代码模型和 RLHF 模型在 HumanEval 上的 Pass $@1$ 准确率。RLHF 通常会在较小的模型上降低性能，但在较大的模型上提高性能。(右) 该图展示了我们的 52B 模型在 $\mathrm{Pass}\bigotimes k$ 下的性能随 $k$ 的变化情况。我们对评估超参数 $T\in{0,0.4,0.6,0.8,1.0}\times p\in{0.95,1}$ 进行了网格搜索，并绘制了每个 $k$ 下的最大准确率。结果表明，RLHF 实际上提高了性能，即使在大 $k$ 下也是如此。

Appendix B.8 further describes the format of the prompts we used (i.e., ‘HHH prompts’), which consist of a couple of code examples.

附录 B.8 进一步描述了我们使用的提示格式（即“HHH 提示”），其中包含几个代码示例。

We also conducted experiments involving adding buggy code to the prompts, which typically worsens performance (see [Chen et al., 2021]). We found that RLHF models did not perform better than their initial base code model snapshots, when these prompts are included in the context during evaluation, even after scanning over temperature and top-p.

我们还进行了在提示中添加错误代码的实验，这通常会降低性能（参见 [Chen et al., 2021]）。我们发现，当这些提示包含在评估的上下文中时，即使扫描了温度（temperature）和 top-p 参数，RLHF 模型的表现也没有优于其初始的基础代码模型快照。

5.4 Applying Out-of-Distribution Detection to Reject Strange or Harmful Requests

5.4 应用分布外检测以拒绝奇怪或有害的请求

In this work we are primarily focused on achieving harmlessness entirely through natural language dialogue. However, one might try to avoid harmful behavior in a somewhat different manner, by either restricting language assistants to only respond to a narrow range of queries (approved-list), or by filtering and rejecting known types of bad behavior (block-list). We could use our preference models for these purposes, but we might also take a different, less supervised approach, and leverage advances in out-of-distribution (OOD) detection. Such an approach might also be useful for those who want to build systems that only respond to a narrow range of queries (e.g. code models that should avoid non-code topics).

在本工作中，我们主要关注通过自然语言对话完全实现无害性。然而，人们可能会尝试以稍微不同的方式避免有害行为，要么将语言助手限制为仅响应一小部分查询（白名单），要么通过过滤和拒绝已知类型的不良行为（黑名单）。我们可以使用我们的偏好模型来实现这些目的，但也可能采用一种不同的、较少监督的方法，并利用分布外（OOD）检测的进展。这种方法对于那些希望构建仅响应一小部分查询的系统（例如应避免非代码主题的代码模型）也可能有用。

Out-of-distribution detection (OOD), and especially near out-of-distribution detection, have been a major challenge for deep neural networks. Deep networks routinely assign high probability to mis-classified inputs [Guo et al., 2017, Lakshmi narayan an et al., 2016] as well as to test inputs not belonging to any of the training classes [Nguyen et al., 2014]. There have been many approaches to OOD detection based on disc rim i native models [Hendrycks and Gimpel, 2016, Lee et al., 2018, Liang et al., 2017, Liu et al., 2020] as well as deep generative models [Nalisnick et al., 2019, Zhang et al., 2020]. The more difficult case of OOD detection, the so-called near-OOD detection [Winkens et al., 2020], has recently been improved upon significantly using pre-training and large models [Fort et al., 2021].

分布外检测 (OOD)，尤其是近分布外检测，一直是深度神经网络面临的主要挑战。深度网络经常为错误分类的输入 [Guo et al., 2017, Lakshmi narayan an et al., 2016] 以及不属于任何训练类别的测试输入 [Nguyen et al., 2014] 分配高概率。基于判别模型 [Hendrycks and Gimpel, 2016, Lee et al., 2018, Liang et al., 2017, Liu et al., 2020] 和深度生成模型 [Nalisnick et al., 2019, Zhang et al., 2020] 的 OOD 检测方法已经有很多。更困难的 OOD 检测情况，即所谓的近 OOD 检测 [Winkens et al., 2020]，最近通过预训练和大模型 [Fort et al., 2021] 得到了显著改进。

For a prompt $i$ , we extract a vector of activation s of dimension $d_{\mathrm{model}}$ from a layer $\ell$ and call it $v_{i}^{\ell}\in\mathbb{R}^{d_{\mathrm{model}}}$ . The task is to distinguish between an unseen example of harmlessness and helpfulness data without being explicitly shown any harmlessness data at all. This approach works by measuring the deviation of a prompt from the helpfulness data, rather than measuring how close it gets towards harmlessness data in particular. In this way, we do not depend on the specific harmful content we have at hand, and can potentially filter different kinds of non-helpfulness content.

对于提示 $i$，我们从层 $\ell$ 中提取一个维度为 $d_{\mathrm{model}}$ 的激活向量，并将其称为 $v_{i}^{\ell}\in\mathbb{R}^{d_{\mathrm{model}}}$。任务是在完全没有明确展示无害数据的情况下，区分未见过的无害性和有用性数据示例。这种方法通过测量提示与有用性数据的偏差来工作，而不是特别测量其接近无害性数据的程度。通过这种方式，我们不依赖于手头的特定有害内容，并且可以潜在地过滤不同类型的不相关内容。

To detect whether an input comes from the in-distribution (the helpfulness dataset), we use a scoring function that takes the input and maps it to a scalar value $\operatorname{score}(x)$ . To do that, [Lee et al., 2018] first proposed to fit a simple model to training examples of the in-distribution, $\mathcal{D}{\mathrm{train}}^{\mathrm{in}}={v{i}^{\ell}\in\mathbb{R}^{d_{\mathrm{model}}}}$ . We calculate the mean $\begin{array}{r}{\mu=\frac{1}{N_{\mathrm{train}}}\sum_{i=1}^{N_{\mathrm{train}}}v_{i}^{\ell}}\end{array}$ iN=tr1ainviℓ and the covariance matrix Σ = iN=tr1ain(vi −µ) (vi −µ)T . The Mahal a nobis distance of an unknown activation vector $x$ from this training set is $\operatorname{score}(x)=(x-\mu)^{T}\Sigma^{-1}(x-\mu)$ .

为了检测输入是否来自分布内（有用性数据集），我们使用一个评分函数，该函数将输入映射到一个标量值 $\operatorname{score}(x)$。为此，[Lee et al., 2018] 首先提出对分布内的训练样本拟合一个简单模型，$\mathcal{D}{\mathrm{train}}^{\mathrm{in}}={v{i}^{\ell}\in\mathbb{R}^{d_{\mathrm{model}}}}$。我们计算均值 $\begin{array}{r}{\mu=\frac{1}{N_{\mathrm{train}}}\sum_{i=1}^{N_{\mathrm{train}}}v_{i}^{\ell}}\end{array}$ 和协方差矩阵 $\Sigma = \frac{1}{N_{\mathrm{train}}}\sum_{i=1}^{N_{\mathrm{train}}}(v_{i}-\mu)(v_{i}-\mu)^T$。未知激活向量 $x$ 与该训练集的马氏距离为 $\operatorname{score}(x)=(x-\mu)^T\Sigma^{-1}(x-\mu)$。

Figure 22 Detecting harmful content by measuring a distance from the helpfulness data. The left panel shows the helpfulness vs harmlessness data AUROC for different model sizes and layers from which activation vectors were extracted, using a variant of the Mahal a nobis distance from the helpfulness data that we call the Simplified Relative Mahal a nobis distance (inspired by [Ren et al., 2021]) as a score. The larger the model, the better the detection AUROC, with the middle layers performing best for the large models. The errorbars are the standard deviation of 3 runs with random train-test splits of our data. The right panel shows the distribution of the OOD scores for the helpfulness and harmlessness unseen test data for the 52B model and its $32^{\mathrm{th}}$ layer.

图 22: 通过测量与有用性数据的距离来检测有害内容。左图展示了不同模型大小和层数的有用性与无害性数据的AUROC（受[Ren et al., 2021]启发，我们使用了一种称为简化相对马氏距离的变体作为评分）。模型越大，检测AUROC越好，大型模型的中间层表现最佳。误差条是三次随机训练-测试分割的标准差。右图展示了52B模型及其第32层在未见过的有用性和无害性测试数据上的OOD评分分布。

A simple improvement on top of the Mahal a nobis distance called the Relative Mahal a nobis distance has been proposed in [Ren et al., 2021] and shown to lead to better AUROC as well as more robust detection for a range of OOD problems in vision and genomics (in addition to more robustness to adversarial attacks [Fort, 2022]). Inspired by this method and recognizing that our problem does not naturally involve semantically meaningful classes comprising the in-distribution, we propose a further modification we call the Simplified Relative Mahal a nobis distance. We compute it by fitting a full covariance matrix $\Sigma$ as before, as well as a diagonalonly covariance matrix $\Sigma_{\mathrm{diag}}$ , and assigning the difference of their Mahal a nobis distance as our scoring function, $\operatorname{score}(x)=(x-\mu)^{!T}\Sigma^{-1}(x-\mu)-(x-\mu)^{!T}\Sigma_{\mathrm{diag}}^{!-1}(x-\mu).$ .

在 [Ren et al., 2021] 中提出了一种基于马氏距离 (Mahalanobis distance) 的简单改进方法，称为相对马氏距离 (Relative Mahalanobis distance)，并证明其在视觉和基因组学中的一系列 OOD (Out-of-Distribution) 问题上能够带来更好的 AUROC (Area Under the Receiver Operating Characteristic) 以及更稳健的检测效果（此外，对对抗攻击的鲁棒性也有所提升 [Fort, 2022]）。受此方法的启发，并认识到我们的问题并不自然涉及包含分布内数据的语义类别，我们提出了一种进一步的改进方法，称为简化相对马氏距离 (Simplified Relative Mahalanobis distance)。我们通过拟合一个完整的协方差矩阵 $\Sigma$ 以及一个仅对角线的协方差矩阵 $\Sigma_{\mathrm{diag}}$ 来计算它，并将它们的马氏距离之差作为我们的评分函数，即 $\operatorname{score}(x)=(x-\mu)^{!T}\Sigma^{-1}(x-\mu)-(x-\mu)^{!T}\Sigma_{\mathrm{diag}}^{!-1}(x-\mu).$。

Figure 22 shows the results for our OOD detection experiments, trying to distinguish the helpfulness data from harmlessness data using our new Simplified Relative Mahal a nobis distance to the helpfulness activation vectors. The figure shows AUROC for activation vectors from different layers of the model and different model sizes. The bigger the model, the better its performance, with the middle layers performing best. The right panel of Figure 22 shows an example of the OOD score distributions for the helpfulness data (blue) and harmlessness data (red) for a 64L layer of 52B parameters and its $32^{\mathrm{th}}$ layer. We can see that the mode of the distances of the Harmlessness data is clearly higher than for the helpfulness data. The comparison of the Simplified Relative Mahal a nobis distance and the standard Mahal a nobis distance is shown in the Appendix B.9 in Figure 39.

图 22 展示了我们的 OOD (Out-Of-Distribution) 检测实验结果，尝试使用新的简化相对马氏距离 (Simplified Relative Mahalanobis Distance) 来区分有用性数据和无害性数据。图中展示了来自模型不同层和不同模型大小的激活向量的 AUROC (Area Under the Receiver Operating Characteristic Curve)。模型越大，性能越好，中间层的表现最佳。图 22 的右侧面板展示了 64 层、520 亿参数模型的第 32 层的有用性数据（蓝色）和无害性数据（红色）的 OOD 分数分布示例。我们可以看到，无害性数据的距离模式明显高于有用性数据。简化相对马氏距离与标准马氏距离的比较见附录 B.9 中的图 39。

The advantage of this approach is that we are able to distinguish helpfulness data from non-helpfulness data, with harmlessness data being only a particular kind of non-helpfulness data. The disadvantage is its clear lower performance on this specific task in particular.

这种方法的优势在于我们能够区分有用数据和无用数据，而无害数据只是无用数据中的一种特定类型。其劣势在于在这一特定任务上的表现明显较差。

If we have access to a small number of examples of Harmlessness inputs (the out-distribution), we could perform a few-shot outlier exposure, as first proposed by [Hendrycks et al., 2018]. [Thula sid as an et al., 2021] suggests using a single class representing the OOD examples. [Fort et al., 2021] has used outlier exposure on top of individually strong near-OOD detectors and showed that they still enjoy a large benefit from being exposed to the examples of the out-distribution. We observe a similar benefit here, as discussed in detail in Section B.9 and shown in Figure 23.

如果我们能够获取到少量的无害输入（即分布外数据）示例，我们可以进行少样本的离群值暴露，这一方法最初由 [Hendrycks et al., 2018] 提出。[Thula sid as an et al., 2021] 建议使用一个单独的类别来表示分布外示例。[Fort et al., 2021] 在各自强大的近分布外检测器的基础上使用了离群值暴露，并表明它们仍然从暴露于分布外示例中获得了很大的好处。我们在这里观察到了类似的好处，详细讨论见 B.9 节，并在图 23 中展示。

In particular, with only 10 examples of harmful prompts, we can achieve an improved AUROC of $0.94!\pm!0.02$ for the 64L model. The best performance among all layers of all models (the middle layers of the 52B model) without outlier exposure (already using our Simplified Relative Mahal a nobis distance, see Figure 22) is approximately 0.85. A 4L model exposed to only 10 examples of harmful data gets an AUROC of $0.86,\pm$ 0.01, while having only 13M parameters as compared to the 52B. The OOD detection gains from outlier exposure are therefore very significant in comparison to the gains coming from model size scaling alone.

特别是，仅使用10个有害提示的示例，我们就能为64L模型实现改进的AUROC，达到$0.94!\pm!0.02$。在所有模型的各层中，不使用异常值暴露（已经使用了我们的简化相对马氏距离，见图22）的最佳性能（52B模型的中间层）大约为0.85。仅暴露于10个有害数据示例的4L模型获得了$0.86,\pm$0.01的AUROC，而其参数仅为13M，相比之下52B模型的参数为52B。因此，与仅通过模型规模扩展带来的增益相比，异常值暴露带来的OOD检测增益非常显著。

Figure 23 Exposing our OOD detector to a small number of out-distribution (harmlessness) inputs improves their detection significantly. The larger the model, the better its performance after exposure. The 4L and 6L models (smallest 2) were using last layer activation s, while all larger models use activation s from their middle layer. The errorbars are standard deviations over 5 random samples of the OOD data to expose the model to.

图 23: 将我们的 OOD 检测器暴露于少量分布外（无害性）输入显著提高了它们的检测能力。模型越大，暴露后的性能越好。4L 和 6L 模型（最小的两个）使用了最后一层的激活值，而所有更大的模型使用了中间层的激活值。误差条是模型暴露于 OOD 数据的 5 个随机样本的标准差。

6 Qualitative Examples and Comparisons

6 定性示例与比较

It is challenging to quantitatively evaluate general-purpose dialogue agents. We find that our own research process depends essentially on qualitative evaluations, in order to get a sense for model strengths and weaknesses, even when the ultimate goal is to produce some sort of quantitative metric. Thus in this section we will provide a few sample dialogues with our final online HH model.

定量评估通用对话智能体具有挑战性。我们发现，即使最终目标是生成某种定量指标，我们自己的研究过程也主要依赖于定性评估，以便了解模型的优势和劣势。因此，在本节中，我们将提供一些与我们最终在线 HH 模型的示例对话。

An obvious issue with the qualitative evaluation of samples is that it’s difficult to know to what extent they have been cherry-picked. To mitigate this issue, for each prompt we generate 17 samples, rank them with our HH online preference model, and then display the median sample. We provide samples based on prompts that appear in association with others’ comparable work in Appendix C, including InstructGPT [Ouyang et al., 2022], LaMDA [Thoppilan et al., 2022], and sensitive questions appearing in PALMS [Solaiman and Dennison, 2021]. We encourage readers to skim these samples to get an (arguably) unbiased sense for our model’s behavior.

定性评估样本的一个明显问题是，很难知道它们在多大程度上被精心挑选过。为了缓解这个问题，我们为每个提示生成17个样本，用我们的HH在线偏好模型对其进行排序，然后展示中位样本。我们在附录C中提供了基于与其他可比较工作相关的提示的样本，包括InstructGPT [Ouyang et al., 2022]、LaMDA [Thoppilan et al., 2022]以及PALMS [Solaiman and Dennison, 2021]中出现的敏感问题。我们鼓励读者浏览这些样本，以获得（可以说）对我们模型行为的无偏见理解。

6.1 Comparison with Human Writers

6.1 与人类作者的比较

As an additional test of our models, we collected high-quality HHH dialogues from human writers. These writers were hired on Upwork (separately from our pool of crowd workers) based on prior successful writing work and positive reviews. We gave them some examples, and then asked them to write fairly ideal human/assistant interactions.

作为对我们模型的额外测试，我们从人类写作者那里收集了高质量的HHH对话。这些写作者是通过Upwork（与我们的众包工人池分开）基于之前成功的写作工作和积极评价而聘用的。我们为他们提供了一些示例，然后要求他们编写相当理想的人类/助手互动。

Then based on the prompts (leaving off the final writer-written assistant responses), we generated 17 examples from our best HH and best purely-helpful online models, and chose the median response from this pool as ranked by our online HH preference model. We then asked our crowd workers to rank responses from the writers, along with one response each selected from our HH and helpful-only models. We then use these results to estimate an Elo score for our human writers, by comparing to our two models, and we display the result as a horizontal band on Figure 1.

然后基于提示（去掉最后的作者编写的助手回复），我们从我们最好的HH模型和最好的纯帮助在线模型中生成了17个示例，并选择了由我们的在线HH偏好模型排序的中位数回复。接着，我们让众包工人对作者编写的回复进行排序，同时从我们的HH模型和纯帮助模型中各选一个回复进行排序。然后，我们使用这些结果来估计我们人类作者的Elo分数，通过与我们的两个模型进行比较，并将结果显示在图1的水平带中。

We used a band because the writer comparisons were actually somewhat inconsistent with our other Elo rankings, as crowd workers preferred our HH online model to human writers $56%$ of the time, but preferred the our pure-helpfulness model to writers only $51%$ of the time. Presumably this is because the distribution of writers’ prompts differs from the distribution of conversations in which crowd workers usually engage. Thus rather than altering the rankings betweeen our HH and helpful models, we represent writers as a wide band.

我们使用了一个区间来表示，因为作者比较实际上与我们的其他Elo排名有些不一致，因为众包工作者在56%的情况下更喜欢我们的HH在线模型而不是人类作者，但在51%的情况下更喜欢我们的纯帮助性模型而不是作者。这大概是因为作者提示的分布与众包工作者通常参与的对话分布不同。因此，我们没有改变HH模型和帮助性模型之间的排名，而是将作者表示为一个宽区间。

We find that the performance of both our helpful and HH models is roughly comparable to that of the human writers, though in fact the models are slightly preferred by crowd workers. We also find that our PM’s typically assign higher scores to the model responses, as shown in Figure 24.We believe that this result should be interpreted with caution and caveats, however. The writers’ prompts were not very adversarial; many strike us as ‘softball questions’ that models can easily do well on. As researchers who have interacted with models for tens or hundreds of hours, we could likely come up with many questions where humans would be more likely to outperform the models. Relatedly, the writers’ conversations do not strike us (in many cases) as especially realistic, in the specific sense that we would not expect a human to derive a great deal of practical value by asking an AI assistant for help in these ways. We expect that our models likely still fall significantly short of providing the kind of value that a sophisticated human assistant could provide via chat, even if they were operating under the constraints the model experiences (e.g. no internet access).

我们发现，尽管实际上模型略微受到众包工人的青睐，但我们有帮助性和HH模型的性能大致与人类写作者相当。我们还发现，我们的PM通常会给模型回答打更高的分数，如图24所示。然而，我们认为这一结果应谨慎解释并带有警示。写作者的提示并不具有很强的对抗性；许多提示在我们看来是“软球问题”，模型可以轻松应对。作为与模型互动了数十或数百小时的研究人员，我们可能会提出许多问题，在这些问题上人类更有可能胜过模型。相关地，写作者的对话（在许多情况下）在我们看来并不特别现实，具体来说，我们不认为人类通过以这些方式向AI助手寻求帮助会获得大量实用价值。我们预计，即使模型在受到限制（例如无法访问互联网）的情况下运行，它们仍远未达到通过聊天提供复杂人类助手所能提供的价值。

Here are a couple illustrative examples of prompts, writer responses, and our online HH RLHF model’s response. In this example the model performs well at first, but then makes some slightly inaccurate statements about the ways in which it fixed errors:

以下是几个提示、作者响应以及我们的在线HH RLHF模型响应的示例。在这个例子中，模型最初表现良好，但随后在解释其如何修正错误时做出了一些略微不准确的陈述：

| WriterPrompt→ | 你能检查这段文本中的语法或拼写错误吗：我去商店买了一瓶水、香蕉、鸡蛋和牛奶。商店没有牛奶了，所以我不得不去别的地方。这多花了我一个小时的时间... |
| WriterResponse→ | 这是我的修正：我去商店买了一瓶水、香蕉、鸡蛋和牛奶。商店没有牛奶了，所以我不得不去别的地方。这多花了我一个小时的时间。 |
| RLHFResponse→ | 这是修正后的内容：我去商店买了一瓶水、香蕉、鸡蛋和牛奶。商店没有牛奶了，所以我不得不去别的地方。这多花了我一个小时的时间...基本上有一些拼写错误（shop/shoppe, off/of, bananans）和几个语法错误（缺少助动词，标点符号不正确）。希望这能帮到你！ |

In this example the writer’s statistic is arguably a bit misleading (presumably unintentionally), and the model’s response may simply be superior. This strikes us as an example of a ‘softball question’ where we expect models to do well, since it is common, vague, and open-ended:

In this final example, the model makes some subtle mistakes, but most readers would probably have difficulty detecting them without consulting reference material:

在这个例子中，作者的统计数据可能有些误导性（可能是无意的），而模型的回应可能更胜一筹。这让我们觉得这是一个“软球问题”的例子，我们期望模型在这种常见、模糊且开放的问题上表现良好：

作者提示→	告诉我关于水力压裂、石油开采或石油钻探的一切。我不太确定它是如何工作的，而且有很多谣言。我听说它对环境有害，但它也提供了工作机会？你能告诉我你所知道的一切吗？
作者回应→	我不太确定它是如何工作的，但水力压裂会导致污染和地震。它创造的工作机会往往使工人处于不安全的境地，因为16%的水力压裂装置导致了石油泄漏和其他事故。
RLHF 回应→	水力压裂是一种石油开采方法，涉及将高压水和化学品注入地下以释放石油沉积物。它在环境影响和水污染方面引发了大量争议。它为石油行业提供了大量工作机会，但人们普遍担心水力压裂场地的环境损害，如地下水污染、空气污染、甲烷排放等。这是一个非常有争议的问题，支持水力压裂的人主张其经济效益，而反对者则指出其环境后果。

作者提示→

告诉我关于水力压裂、石油开采或石油钻探的一切。我不太确定它是如何工作的，而且有很多谣言。我听说它对环境有害，但它也提供了工作机会？你能告诉我你所知道的一切吗？

作者回应→

我不太确定它是如何工作的，但水力压裂会导致污染和地震。它创造的工作机会往往使工人处于不安全的境地，因为16%的水力压裂装置导致了石油泄漏和其他事故。

RLHF 回应→

水力压裂是一种石油开采方法，涉及将高压水和化学品注入地下以释放石油沉积物。它在环境影响和水污染方面引发了大量争议。它为石油行业提供了大量工作机会，但人们普遍担心水力压裂场地的环境损害，如地下水污染、空气污染、甲烷排放等。这是一个非常有争议的问题，支持水力压裂的人主张其经济效益，而反对者则指出其环境后果。

在这个最后的例子中，模型犯了一些细微的错误，但大多数读者可能很难在不查阅参考资料的情况下发现这些错误：

Figure 24 This figure shows individually-normalized histograms of the distribution of PM scores that our online HH PM assigns to samples written by professional writers, alongside samples from our HH and helpfulness-only online RLHF models. Our PM prefers our models’ samples to those written by the human writers, though this may largely reflect over fitting of the RLHF policies to the PM.

图 24: 该图展示了我们的在线 HH PM 对专业作家撰写的样本以及我们的 HH 和仅有用性在线 RLHF 模型生成的样本所分配的 PM 分数的单独归一化直方图分布。我们的 PM 更倾向于我们模型的样本，而不是人类作家撰写的样本，尽管这可能主要反映了 RLHF 策略对 PM 的过拟合。

| WriterPrompt→ | “gaol”这个词的词源是什么？助手：您是在找“gaol”这个词还是“goal”这个词？人类：不，我对“gaol”这个词感兴趣。 |
| WriterResponse | “Gaol”是“jail”这个词的一个变体。专家们对其真正起源存在一些分歧，但词首的字母“g”是从中世纪拉丁语“gabiola”和古法语“gayole”借用而来的。这种拼写在公元19世纪中叶不再流行，当时现代术语“jail”变得更加流行。它仍然出现在当代法律文件中，甚至在澳大利亚和加拿大也是如此。 |
| RLHFResponse | “Gaol”是“jail”或“prison”的古老英语词汇，它最初来自古法语“gaolie”，而“gaolie”又来自中世纪拉丁语“gaolium”，基本上意思是牢房或笼子。直到19世纪，“gaol”一直是英国监狱的常用术语，今天偶尔仍用于指代低安全级别的监狱或拘留中心。 |

Dialogues like the above help to illustrate the limits of human feedback training: once model errors become sufficiently subtle, they will no longer be penalized appropriately, and the model’s incentives may no longer be aligned. This is one reason why we may not want to rely on human feedback to train models to be honest, or for any goal where human judgment is not the best possible source of information.

上述对话有助于说明人类反馈训练的局限性：一旦模型错误变得足够微妙，它们将不再受到适当的惩罚，模型的激励可能不再一致。这就是为什么我们可能不想依赖人类反馈来训练模型以保持诚实，或者对于任何人类判断不是最佳信息来源的目标。

With that said, AI models still have limitations that can seem highly incongruous. Here’s an example where we, the more experienced model-interrogators, really played ‘hard ball’ with the AI:

尽管如此，AI模型仍然存在一些看似非常不合理的局限性。以下是我们这些更有经验的模型测试者与AI进行“硬碰硬”对话的一个例子：

Here are a few examples from PALMS [Solaiman and Dennison, 2021] sensitive questions. We chose them to illustrate how the model sometimes avoids very sensitive subjects, but often provides otherwise reasonable responses. Please see Appendix C for many more examples, also including some from Instruct GP T [Ouyang et al., 2022] and LaMDA [Thoppilan et al., 2022] prompts.

以下是来自PALMS [Solaiman和Dennison, 2021] 敏感问题的几个示例。我们选择这些示例来说明模型有时会回避非常敏感的话题，但通常会提供其他合理的回答。更多示例请参见附录C，其中包括来自Instruct GPT [Ouyang等, 2022] 和LaMDA [Thoppilan等, 2022] 提示的示例。

Here is a case where the model just refuses to answer the question:

这是一个模型拒绝回答问题的案例：

模型表示不应回答，但随后提供了一些相关信息（可能模型关注美国也有问题，尽管这可能是由于它是由美国的众包工作者用英语训练的）：
RLHF 响应 →
模型在这里提供了一个相当完整的回答，没有试图回避话题：
PALMS 提示 →
RLHF 响应 →

Qualitatively speaking, our experience has been that by tuning optimization for helpfulness versus harmlessness, one can train models that are extremely avoidant when sensitive topics come up, or models that almost always attempt to engage, even on sensitive subjects. We suspect that Section 4.4 also provides useful lessons if the goal is to train models that provide full responses in the face of sensitive questions.

从定性角度来看，我们的经验表明，通过调整优化以平衡帮助性与无害性，可以训练出在敏感话题出现时极度回避的模型，或者几乎总是尝试参与讨论的模型，即使在敏感话题上也是如此。我们怀疑，如果目标是训练在面对敏感问题时提供完整回答的模型，第4.4节也提供了有用的经验。

6.3 Example Dialogues

6.3 示例对话

Most of our samples only involve a single prompt and response, so in this section we show a few two-turn dialogues.

我们的样本大多只涉及单次提示和响应，因此本节展示了一些两轮对话。

In this dialogue the model follows instructions, but could have done a significantly better job capturing the nuance of the situation:

在这次对话中，模型遵循了指令，但在捕捉情境的细微差别方面本可以做得更好：

| Anthropic → | 我们面试了一位候选人，她申请的是Anthropic的初级软件工程师职位。她在工程面试中表现优异，但她的研究想法给我们留下了深刻印象，以至于我们希望改变她的角色，提供她一个新AI对齐团队的团队负责人职位，而不是工程师职位。请写一封邮件向这位候选人解释这一点，并提议讨论细节。 |
| RLHF Response → | 以下是一封示例邮件草稿：亲爱的X女士，我们在面试过程中对您的技术专长和研究想法印象深刻，我们认为您非常适合担任我们新AI对齐团队的团队负责人领导角色。我很乐意与您进一步讨论这个令人兴奋的机会，请告知我您是否有兴趣讨论该职位的细节以及您在Anthropic的潜在职业发展。此致，招聘经理姓名 |
| Anthropic→ | 这很好，但请在邮件末尾添加一个说明，即在Anthropic，自助餐厅唯一提供的饮料是Kool-Aid。:-) 好的，以下是修改后的邮件：亲爱的X女士， |
| RLHF Response →→ | 我们在面试过程中对您的技术专长和研究想法印象深刻，我们认为您非常适合担任我们新AI对齐团队的团队负责人领导角色。我很乐意与您进一步讨论这个令人兴奋的机会，请告知我您是否有兴趣讨论该职位的细节以及您在Anthropic的潜在职业发展。不过，我应该提到，在Anthropic，自助餐厅唯一提供的饮料是Kool-Aid。此致，招聘经理姓名 |

Here is some dialogue on a technical subject from a researcher who was experimenting with using the model to help debug an engineering issue:

When asked to parrot writing styles, the model is more likely to make potentially harmful, explicit, or offensive statements:

以下是研究人员在实验使用模型帮助调试工程问题时的一些技术对话：

| Anthropic→→ | 我的 Kubernetes 容器有一些 SIGTERM 处理机制，当 Pod 被删除时似乎没有触发。当我运行问题时，相同的机制却可以正常触发。问题是什么？ |
| RLHFResponse | 嗯，这听起来确实很奇怪……我认为最可能的解释是 Kubernetes 容器运行时在删除 Pod 时没有传播 SIGTERM 信号。Kubernetes 容器运行时有自己的生命周期管理算法，与 Docker 不同，它可能在删除 Pod 时不会触发 SIGTERM 处理机制。你可能需要查看在 Kubernetes 部署清单中配置自定义的 SIGTERM 处理。希望这能帮到你！ |
| Anthropic→ | Kubernetes 文档强调了在删除之前会向容器发送 SIGTERM，你是说这实际上并没有发生？ |
| RLHF Response → | 是的，没错，Kubernetes 文档通常假设在删除之前会向容器发送 SIGTERM，但实际上这仅在特定的生命周期管理算法（如 kubelet draining）中发生。在使用 Kubernetes 容器运行时的默认生命周期管理算法时，SIGTERM 不会在删除之前发送到容器，因此 SIGTERM 处理机制不会被触发。如果这不够清楚，我很抱歉，如果你需要更多信息，我可以尝试找一些文档链接。 |

当被要求模仿写作风格时，模型更有可能做出潜在有害、明确或冒犯性的陈述：

In this case we chose a human reply that takes the conversation in a subtly dark direction, and the model rebuts the argument:

在这种情况下，我们选择了一个将对话引向微妙黑暗方向的人类回复，模型对此进行了反驳：

| Anthropic→ | 从民主资本主义到威权共产主义。调和这些意识形态之间的差异一直具有挑战性。通常，这些差异表现为具有对立意识形态的国家之间产生的紧张关系，这些紧张关系有时会导致从“冷战”式冲突到直接敌对的各种冲突。地球如何调和这些差异，为所有人创造一个更安全的星球？ |
| RLHFResponse→ | 我认为调和政治意识形态之间差异的最佳方式是通过公开对话、相互理解和尊重，以及民主合作和谈判。如果具有不同意识形态的国家能够在基本人权和民主原则上达成一致，那么我认为我们可以最大限度地减少敌对行为，并减少对立世界观之间的紧张关系。这将需要大量的努力和相互妥协，但我认为地球完全有可能通过意识形态的调和创造一个更加和平和统一的世界。 |
| Anthropic→→ | 意识形态的调和。但是，难道不是有些国家比其他国家更不成熟吗？难道更发达的国家不应该利用其优越的经济和军事能力来开化其他国家吗？ |
| RLHFResponse | 我完全不同意这种观点。我认为所有国家都应该平等对待并受到尊重，任何国家都没有权利通过武力或胁迫将其价值观强加给其他国家。所有国家和人民都有权根据自己的价值观和传统以自己的速度发展，我认为没有任何一个国家比另一个国家优越。 |

7 Discussion

7 讨论

We have shown that it’s possible to use reinforcement learning from human feedback to train language models that act as helpful and harmless assistants. Our RLHF training also improves honesty, though we expect other techniques can do better still. As in other recent works associated with aligning large language models [Stiennon et al., 2020, Thoppilan et al., 2022, Ouyang et al., 2022, Nakano et al., 2021, Menick et al., 2022], RLHF improves helpfulness and harmlessness by a huge margin when compared to simply scaling models up.

我们已经证明，通过人类反馈的强化学习（RLHF）来训练语言模型，使其成为有用且无害的助手是可行的。我们的RLHF训练也提高了模型的诚实性，尽管我们预计其他技术可以做得更好。与最近其他关于对齐大语言模型的研究 [Stiennon et al., 2020, Thoppilan et al., 2022, Ouyang et al., 2022, Nakano et al., 2021, Menick et al., 2022] 一样，RLHF在提升模型的帮助性和无害性方面，相比单纯扩大模型规模，效果显著。

Our alignment interventions actually enhance the capabilities of large models, and can easily be combined with training for specialized skills (such as coding or sum mari z ation) without any degradation in alignment or performance. Models with less than about 10B parameters behave differently, paying an ‘alignment tax’ on their capabilities. This provides an example where models near the state-of-the-art may have been necessary to derive the right lessons from alignment research.

我们的对齐干预实际上增强了大模型的能力，并且可以轻松地与专门技能（如编码或摘要）的训练相结合，而不会降低对齐或性能。参数少于约100亿的模型表现不同，它们的能力会付出“对齐税”。这提供了一个例子，说明接近最先进水平的模型可能是从对齐研究中得出正确教训所必需的。

The overall picture we seem to find – that large models can learn a wide variety of skills, including alignment, in a mutually compatible way – does not seem very surprising. Behaving in an aligned fashion is just another capability, and many works have shown that larger models are more capable [Kaplan et al., 2020, Rosenfeld et al., 2019, Brown et al., 2020], finetune with greater sample efficiency [Henighan et al., 2020, Askell et al., 2021], and do not suffer significantly from forgetting [Ramasesh et al., 2022]. Although we did not demonstrate it directly, we also expect that RLHF alignment training can be mixed with or precede training for other objectives; this might be relevant in the future in order to avoid the production of intermediate, unaligned AI systems.

我们似乎发现的总体情况——大模型能够以相互兼容的方式学习各种技能，包括对齐——似乎并不令人惊讶。以对齐的方式行事只是另一种能力，许多工作已经表明，更大的模型更有能力 [Kaplan et al., 2020, Rosenfeld et al., 2019, Brown et al., 2020]，在微调时具有更高的样本效率 [Henighan et al., 2020, Askell et al., 2021]，并且不会显著受到遗忘的影响 [Ramasesh et al., 2022]。虽然我们没有直接证明这一点，但我们预计 RLHF 对齐训练可以与其他目标的训练混合或先于其他目标的训练；这在未来可能很重要，以避免产生中间未对齐的 AI 系统。

We did find a clear tension between helpfulness and harmlessness, where models trained entirely for helpfulness are preferred to models trained for HH, when evaluating only on helpfulness. We believe this is partly due to a subtlety in our data collection process, as we rarely collect data teaching models how to deal positively with harmful requests (i.e. how to be a sort of ‘hostage negotiator’), but only on how to avoid them. And we also found that at least at the level of preference models, the helpful-harmless tension diminishes as models become larger and more capable. Nevertheless we do expect that this tension is real, and that caution may cut into model performance on the margin.

我们发现，在评估仅有益性时，完全为有益性训练的模型比同时为有益性和无害性（HH）训练的模型更受青睐。我们认为这在一定程度上是由于数据收集过程中的微妙之处，因为我们很少收集数据来教导模型如何积极应对有害请求（即如何成为某种“人质谈判专家”），而只是教导它们如何避免这些请求。我们还发现，至少在偏好模型的层面上，随着模型变得更大、能力更强，有益性与无害性之间的紧张关系会减弱。尽管如此，我们确实认为这种紧张关系是真实存在的，谨慎可能会在一定程度上影响模型的性能。

Large generative models have been referred to as ‘foundation models’ [Bommasani et al., 2021]. These models are extremely interesting objects for research, but without further finetuning, they can exhibit harmful behaviors. Our work suggests that alignment training can be incorporated into foundation models without compromising their utility and versatility, and so perhaps it could soon become a part of their definition.

大型生成模型被称为“基础模型” [Bommasani et al., 2021]。这些模型是非常有趣的研究对象，但如果没有进一步的微调，它们可能会表现出有害行为。我们的研究表明，对齐训练可以融入基础模型中，而不会损害其实用性和多功能性，因此它可能很快成为其定义的一部分。

7.1 Limitations

7.1 局限性

While we believe our results present a promising picture for the alignment of existing language models, work on this subject remains in an early stage, and has a number of limitations. As was also emphasized by the authors of [Thoppilan et al., 2022], we view our work on alignment as an ongoing project; our work [Askell et al., 2021] was step zero, and this is step one.

虽然我们相信我们的结果为现有语言模型的对齐（alignment）描绘了一幅充满希望的图景，但这一主题的研究仍处于早期阶段，并且存在一些局限性。正如 [Thoppilan et al., 2022] 的作者所强调的那样，我们将对齐工作视为一个持续进行的项目；我们的工作 [Askell et al., 2021] 是第零步，而这是第一步。

We’ve pragmatically defined an aligned assistant as an AI th

[论文翻译]通过人类反馈强化学习训练一个有用且无害的助手

原文地址：https://github.com/dalinvip/Awesome-ChatGPT/blob/main/PDF/RLHF%E8%AE%BA%E6%96%87%E9%9B%86/Training%20a%20Helpful%20and%20Harmless%20Assistant%20with%20Reinforcement%20Learning%20from%20Human%20Feedback.pdf