Learning to summarize from human feedback
从人类反馈中学习总结
Nisan Stiennon∗ Long Ouyang∗ Jeff Wu∗ Daniel M. Ziegler∗ Ryan Lowe∗
Chelsea Voss∗ Alec Radford Dario Amodei Paul Christiano∗
OpenAI
Abstract
摘要
As language models become more powerful, training and evaluation are increasingly bottle necked by the data and metrics used for a particular task. For example, sum mari z ation models are often trained to predict human reference summaries and evaluated using ROUGE, but both of these metrics are rough proxies for what we really care about—summary quality. In this work, we show that it is possible to significantly improve summary quality by training a model to optimize for human preferences. We collect a large, high-quality dataset of human comparisons between summaries, train a model to predict the human-preferred summary, and use that model as a reward function to fine-tune a sum mari z ation policy using reinforcement learning. We apply our method to a version of the TL;DR dataset of Reddit posts [63] and find that our models significantly outperform both human reference summaries and much larger models fine-tuned with supervised learning alone. Our models also transfer to CNN/DM news articles [22], producing summaries nearly as good as the human reference without any news-specific fine-tuning.2 We conduct extensive analyses to understand our human feedback dataset and fine-tuned models.3 We establish that our reward model generalizes to new datasets, and that optimizing our reward model results in better summaries than optimizing ROUGE according to humans. We hope the evidence from our paper motivates machine learning researchers to pay closer attention to how their training loss affects the model behavior they actually want.
随着语言模型变得越来越强大,训练和评估越来越受到特定任务所使用的数据和指标的瓶颈限制。例如,摘要模型通常被训练来预测人类参考摘要,并使用ROUGE进行评估,但这两个指标都是我们真正关心的摘要质量的粗略代理。在这项工作中,我们展示了通过训练模型以优化人类偏好来显著提高摘要质量的可能性。我们收集了一个大规模、高质量的人类摘要比较数据集,训练一个模型来预测人类偏好的摘要,并使用该模型作为奖励函数,通过强化学习微调摘要策略。我们将我们的方法应用于Reddit帖子的TL;DR数据集版本[63],发现我们的模型显著优于人类参考摘要和仅通过监督学习微调的更大模型。我们的模型还迁移到CNN/DM新闻文章[22],生成的摘要几乎与人类参考摘要一样好,而无需任何新闻特定的微调。我们进行了广泛的分析,以理解我们的人类反馈数据集和微调模型。我们证明了我们的奖励模型能够泛化到新数据集,并且根据人类的评价,优化我们的奖励模型比优化ROUGE能产生更好的摘要。我们希望我们论文中的证据能够激励机器学习研究人员更加关注他们的训练损失如何影响他们实际想要的模型行为。
1 Introduction
1 引言
Large-scale language model pre training has become increasingly prevalent for achieving high performance on a variety of natural language processing (NLP) tasks. When applying these models to a specific task, they are usually fine-tuned using supervised learning, often to maximize the log probability of a set of human demonstrations.
大规模语言模型预训练在实现各种自然语言处理(NLP)任务的高性能方面变得越来越普遍。当将这些模型应用于特定任务时,通常使用监督学习进行微调,通常是为了最大化一组人类演示的对数概率。
While this strategy has led to markedly improved performance, there is still a misalignment between this fine-tuning objective—maximizing the likelihood of human-written text—and what we care about—generating high-quality outputs as determined by humans. This misalignment has several causes: the maximum likelihood objective has no distinction between important errors (e.g. making up facts [41]) and unimportant errors (e.g. selecting the precise word from a set of synonyms); models are in centi viz ed to place probability mass on all human demonstrations, including those that are low-quality; and distribution al shift during sampling can degrade performance [56, 52]. Quality can often be improved significantly by non-uniform sampling strategies such as beam search [51], but these can lead to repetition and other undesirable artifacts [69, 23]. Optimizing for quality may be a principled approach to overcoming these problems.
虽然这种策略显著提高了性能,但这种微调目标——最大化人类书写文本的可能性——与我们关心的目标——生成人类认为高质量的输出——之间仍然存在不一致。这种不一致有几个原因:最大似然目标无法区分重要错误(例如编造事实 [41])和不重要错误(例如从一组同义词中选择精确的词语);模型倾向于在所有人类演示上分配概率质量,包括那些低质量的演示;采样过程中的分布偏移可能会降低性能 [56, 52]。通过非均匀采样策略(如束搜索 [51])通常可以显著提高质量,但这些策略可能导致重复和其他不希望的伪影 [69, 23]。优化质量可能是克服这些问题的原则性方法。
Figure 1: Fraction of the time humans prefer our models’ summaries over the human-generated reference summaries on the TL;DR dataset.4Since quality judgments involve an arbitrary decision about how to trade off summary length vs. coverage within the 24-48 token limit, we also provide length-controlled graphs in Appendix F; length differences explain about a third of the gap between feedback and supervised learning at 6.7B.
图 1: 在 TL;DR 数据集上,人类偏好我们模型生成的摘要与人类生成的参考摘要的时间比例。由于质量判断涉及在 24-48 Token 限制内如何权衡摘要长度与覆盖范围的任意决策,我们还在附录 F 中提供了长度控制的图表;长度差异解释了反馈学习和监督学习在 6.7B 模型上差距的大约三分之一。
Our goal in this paper is to advance methods for training language models on objectives that more closely capture the behavior we care about. To make short-term progress towards this goal, we focus on abstract ive English text sum mari z ation, as it has a long history in the NLP community [16, 8, 54, 59, 50], and is a subjective task where we believe it is difficult to quantify summary quality without human judgments. Indeed, existing automatic metrics for evaluating summary quality, such as ROUGE [39], have received criticism for poor correlation with human judgments [55, 45, 6, 33].
本文的目标是推进训练语言模型的方法,使其目标更贴近我们关心的行为。为了实现这一目标的短期进展,我们专注于抽象式英文文本摘要任务,因为它在自然语言处理(NLP)领域有着悠久的历史 [16, 8, 54, 59, 50],并且是一项主观任务,我们认为在没有人类判断的情况下很难量化摘要质量。事实上,现有的自动评估摘要质量的指标,如 ROUGE [39],因与人类判断的相关性较差而受到批评 [55, 45, 6, 33]。
We follow the works of [3, 73], who fine-tune language models from human feedback using reward learning [35]. We first collect a dataset of human preferences between pairs of summaries, then train a reward model (RM) via supervised learning to predict the human-preferred summary. Finally, we train a policy via reinforcement learning (RL) to maximize the score given by the RM; the policy generates a token of text at each ‘time step’, and is updated using the PPO algorithm [58] based on the RM ‘reward’ given to the entire generated summary. We can then gather more human data using samples from the resulting policy, and repeat the process. We follow the works of [48, 4] and use large pretrained GPT-3 models with as many as 6.7 billion parameters.
我们遵循 [3, 73] 的工作,通过奖励学习 [35] 从人类反馈中微调语言模型。我们首先收集人类在成对摘要之间的偏好数据集,然后通过监督学习训练一个奖励模型 (Reward Model, RM) 来预测人类偏好的摘要。接着,我们通过强化学习 (Reinforcement Learning, RL) 训练一个策略,以最大化 RM 给出的分数;该策略在每个“时间步”生成一个文本 Token,并使用 PPO 算法 [58] 根据 RM 对整个生成摘要的“奖励”进行更新。然后,我们可以使用该策略生成的样本收集更多人类数据,并重复这一过程。我们遵循 [48, 4] 的工作,使用具有多达 67 亿参数的大型预训练 GPT-3 模型。
Our main contributions are four-fold.
我们的主要贡献有四点。
(1) We show that training with human feedback significantly outperforms very strong baselines on English sum mari z ation. When applying our methods on a version of the Reddit TL;DR dataset [63], we train policies via human feedback that produce better summaries than much larger policies trained via supervised learning. Summaries from our human feedback models are preferred by our labelers to the original human demonstrations in the dataset (see Figure 1).
(1) 我们展示了通过人类反馈进行训练在英文摘要任务上显著优于非常强的基线模型。当我们将这些方法应用于 Reddit TL;DR 数据集 [63] 的一个版本时,通过人类反馈训练的策略生成的摘要比通过监督学习训练的更大模型生成的摘要更好。我们的标签员更喜欢来自人类反馈模型的摘要,而不是数据集中原始的人类演示摘要 (见图 1)。
(2) We show human feedback models generalize much better to new domains than supervised models. Our Reddit-trained human feedback models also generate high-quality summaries of news articles on the CNN/DailyMail (CNN/DM) dataset without any news-specific fine-tuning, almost matching the quality of the dataset’s reference summaries. We perform several checks to ensure that these human preferences reflect a real quality difference: we consistently monitor agreement rates amongst labelers and researchers, and find researcher-labeler agreement rates are nearly as high as researcher-researcher agreement rates (see Section C.2), and we verify models are not merely optimizing simple metrics like length or amount of copying (see Appendices F and G.7).
(2) 我们展示了人类反馈模型在新领域中的泛化能力远优于监督模型。我们在 Reddit 上训练的人类反馈模型也能在没有针对新闻进行任何微调的情况下,生成 CNN/DailyMail (CNN/DM) 数据集上新闻文章的高质量摘要,几乎与数据集的参考摘要质量相当。我们进行了多项检查,以确保这些人类偏好反映了真实的质量差异:我们持续监控标注者和研究人员之间的一致性率,发现研究人员与标注者之间的一致性率几乎与研究人员之间的一致性率一样高(见第 C.2 节),并且我们验证了模型不仅仅是在优化简单的指标,如长度或复制量(见附录 F 和 G.7)。
(3) We conduct extensive empirical analyses of our policy and reward model. We examine the impact of model and data size (Figure 6), study performance as we continue to optimize a given reward model (Section 4.3), and analyze reward model performance using synthetic and humanwritten perturbations of summaries (Section 4.3). We confirm that our reward model outperforms other metrics such as ROUGE at predicting human preferences, and that optimizing our reward model directly results in better summaries than optimizing ROUGE according to humans (Section 4.4).
我们对策略和奖励模型进行了广泛的实证分析。我们研究了模型和数据规模的影响(图 6),研究了在继续优化给定奖励模型时的性能表现(第 4.3 节),并通过使用合成和人工编写的摘要扰动分析了奖励模型的性能(第 4.3 节)。我们确认,在预测人类偏好方面,我们的奖励模型优于 ROUGE 等其他指标,并且根据人类的评价,直接优化我们的奖励模型比优化 ROUGE 能产生更好的摘要(第 4.4 节)。
(4) We publicly release our human feedback dataset for further research. The dataset contains 64,832 summary comparisons on the TL;DR dataset, as well as our evaluation data on both TL;DR (comparisons and Likert scores) and CNN/DM (Likert scores).
(4) 我们公开发布了人类反馈数据集以供进一步研究。该数据集包含 64,832 条关于 TL;DR 数据集的摘要比较,以及我们在 TL;DR(比较和 Likert 评分)和 CNN/DM(Likert 评分)上的评估数据。
The methods we present in this paper are motivated in part by longer-term concerns about the misalignment of AI systems with what humans want them to do. When misaligned sum mari z ation models make up facts, their mistakes are fairly low-risk and easy to spot. However, as AI systems become more powerful and are given increasingly important tasks, the mistakes they make will likely become more subtle and safety-critical, making this an important area for further research.
我们在本文中提出的方法部分源于对AI系统与人类期望行为之间长期存在的不一致性的担忧。当不一致的摘要模型编造事实时,它们的错误风险较低且易于发现。然而,随着AI系统变得更加强大并被赋予越来越重要的任务,它们所犯的错误可能会变得更加微妙且对安全至关重要,这使得这一领域成为进一步研究的重要方向。
2 Related work
2 相关工作
Most directly related to our work is previous work using human feedback to train sum mari z ation models with RL [3, 73]. Bohm et al. [3] learn a reward function from a dataset of human ratings of $2.5\mathrm{k}$ CNN/DM summaries, and train a policy whose summaries are preferred to a policy optimizing ROUGE. Our work is most similar to [73], who also train Transformer models [62] to optimize human feedback across a range of tasks, including sum mari z ation on the Reddit TL;DR and CNN/DM datasets. Unlike us, they train in an online manner and find the model highly extractive. They note that their labelers prefer extractive summaries and have low agreement rates with researchers. Compared to [73], we use significantly larger models, move to the batch setting for collecting human feedback, ensure high labeler-researcher agreement, and make some algorithmic modifications, such as separating the policy and value networks.
与我们工作最直接相关的是之前使用人类反馈通过强化学习 (RL) 训练摘要模型的研究 [3, 73]。Bohm 等人 [3] 从包含 $2.5\mathrm{k}$ 个 CNN/DM 摘要的人类评分数据集中学习奖励函数,并训练了一个其摘要优于优化 ROUGE 的策略的模型。我们的工作与 [73] 最为相似,他们也训练了 Transformer 模型 [62],以优化包括 Reddit TL;DR 和 CNN/DM 数据集摘要任务在内的一系列任务的人类反馈。与我们不同的是,他们以在线方式进行训练,并发现模型高度倾向于提取式摘要。他们指出,他们的标注者更喜欢提取式摘要,并且与研究人员的一致性较低。与 [73] 相比,我们使用了显著更大的模型,转向批量设置以收集人类反馈,确保标注者与研究人员的高度一致性,并进行了一些算法修改,例如分离策略网络和价值网络。
Human feedback has also been used as a reward to train models in other domains such as dialogue [25, 68, 21], translation [32, 1], semantic parsing [34], story generation [72], review generation [7], and evidence extraction [46]. Our reward modeling approach was developed in prior work on learning to rank [40], which has been applied to ranking search results using either explicit feedback [2, 18] or implicit feedback in the form of click-through data [29, 30]. In a related line of research, human feedback has been used to train agents in simulated environments [10, 24]. There is also a rich literature on using RL to optimize automatic metrics for NLP tasks, such as ROUGE for sum mari z ation [50, 65, 45, 15, 19], BLEU for translation [50, 66, 1, 43], and other domains [61, 27, 26]. Finally, there has been extensive research on modifying architectures [22, 59] and pre-training procedures [70, 36, 49, 60, 53, 14] for improving sum mari z ation performance.
人类反馈也被用作奖励来训练其他领域的模型,例如对话 [25, 68, 21]、翻译 [32, 1]、语义解析 [34]、故事生成 [72]、评论生成 [7] 和证据提取 [46]。我们的奖励建模方法是在先前关于学习排序的研究中开发的 [40],该方法已应用于使用显式反馈 [2, 18] 或点击数据形式的隐式反馈 [29, 30] 来对搜索结果进行排序。在相关的研究中,人类反馈被用于在模拟环境中训练智能体 [10, 24]。此外,还有大量关于使用强化学习 (RL) 来优化自然语言处理 (NLP) 任务的自动指标的文献,例如用于摘要的 ROUGE [50, 65, 45, 15, 19]、用于翻译的 BLEU [50, 66, 1, 43] 以及其他领域 [61, 27, 26]。最后,关于修改架构 [22, 59] 和预训练程序 [70, 36, 49, 60, 53, 14] 以提高摘要性能的研究也非常广泛。
3 Method and experiment details
3 方法和实验细节
3.1 High-level methodology
3.1 高层次方法论
Our approach is similar to the one outlined in [73], adapted to the batch setting. We start with an initial policy that is fine-tuned via supervised learning on the desired dataset (in our case, the Reddit TL;DR sum mari z ation dataset). The process (illustrated in Figure 2) then consists of three steps that can be repeated iterative ly.
我们的方法与[73]中概述的方法类似,但适用于批量设置。我们从初始策略开始,该策略通过在目标数据集(在我们的案例中,是Reddit TL;DR摘要数据集)上进行监督学习进行微调。该过程(如图2所示)随后包括三个可以迭代重复的步骤。
Step 1: Collect samples from existing policies and send comparisons to humans. For each Reddit post, we sample summaries from several sources including the current policy, initial policy, original reference summaries and various baselines. We send a batch of pairs of summaries to our human evaluators, who are tasked with selecting the best summary of a given Reddit post.
步骤 1:从现有策略中收集样本并发送比较结果给人类评估者。对于每个 Reddit 帖子,我们从多个来源(包括当前策略、初始策略、原始参考摘要和各种基线)中抽取摘要样本。我们将成对的摘要批次发送给人类评估者,他们的任务是为给定的 Reddit 帖子选择最佳摘要。
Step 2: Learn a reward model from human comparisons. Given a post and a candidate summary, we train a reward model to predict the log odds that this summary is the better one, as judged by our labelers.
步骤 2:从人类比较中学习奖励模型。给定一个帖子和一个候选摘要,我们训练一个奖励模型来预测该摘要被标注者判断为更好的对数几率。
Step 3: Optimize a policy against the reward model. We treat the logit output of the reward model as a reward that we optimize using reinforcement learning, specifically with the PPO algorithm [58].
第3步:针对奖励模型优化策略。我们将奖励模型的logit输出视为奖励,并使用强化学习(特别是PPO算法 [58])进行优化。
Figure 2: Diagram of our human feedback, reward model training, and policy training procedure.
图 2: 我们的人类反馈、奖励模型训练和策略训练流程的示意图。
We provide a more thorough description of our procedure, including details of the reward model and policy training and our quality control process, in the following sections. In practice, rather than precisely iterating this sequence of three steps, we updated our data collection and training procedures over the course of the project while accumulating labels (see Appendix C.6 for details).
我们将在以下章节中提供更详细的程序描述,包括奖励模型和策略训练的细节以及我们的质量控制流程。在实践中,我们并没有严格按照这三个步骤的顺序进行迭代,而是在项目过程中不断更新数据收集和训练程序,同时积累标签(详见附录 C.6)。
3.2 Datasets and task
3.2 数据集和任务
Datasets. We use the TL;DR sum mari z ation dataset [63], which contains ${\sim}3$ million posts from reddit.com across a variety of topics (subreddits), as well summaries of the posts written by the original poster (TL;DRs). We additionally filter this dataset (see Appendix A) to ensure quality, including using a whitelist of subreddits that are understandable to the general population. Crucially, we also filter to include only posts where the human-written summaries contain between 24 and 48 tokens, to minimize the potential effect of summary length on quality (see Section 4.1 and Appendix F). Our final filtered dataset contains 123,169 posts, and we hold out ${\sim}5%$ as a validation set. For the remainder of this paper, we refer to this dataset simply as TL;DR.
数据集。我们使用了 TL;DR 摘要数据集 [63],该数据集包含了来自 reddit.com 的约 300 万篇帖子,涵盖了各种主题(子版块),以及由原帖作者撰写的帖子摘要(TL;DR)。我们进一步过滤了这个数据集(见附录 A)以确保质量,包括使用一个普通大众能够理解的子版块白名单。关键的是,我们还进行了过滤,仅包含那些由人类撰写的摘要包含 24 到 48 个 Token 的帖子,以最小化摘要长度对质量的潜在影响(见第 4.1 节和附录 F)。我们最终过滤后的数据集包含 123,169 篇帖子,并保留了约 5% 作为验证集。在本文的其余部分,我们将这个数据集简称为 TL;DR。
We chose the TL;DR dataset over the more commonly used CNN/DM dataset primarily because very strong performance can be attained on CNN/DM with simple extractive baselines. We find in Section 4.2 that our labelers prefer lead-3 over the CNN/DM reference summaries,5 and that the supervised T5 model [49] with low-temperature sampling already surpasses the reference summary quality, while copying extensively from the article. On the other hand, simple extractive baselines perform poorly on TL;DR in our human evaluations (see Appendix G.2). Instead of training on CNN/DM, we study the transfer performance of our human feedback models to CNN/DM after being trained to summarize Reddit posts.
我们选择 TL;DR 数据集而非更常用的 CNN/DM 数据集,主要是因为 CNN/DM 上的简单抽取式基线模型已经能够取得非常强的性能。在第 4.2 节中,我们发现标注者更倾向于 lead-3 而非 CNN/DM 的参考摘要 [5],并且使用低温采样的监督 T5 模型 [49] 已经超越了参考摘要的质量,尽管它大量复制了文章内容。另一方面,在我们的人类评估中,简单的抽取式基线模型在 TL;DR 上表现不佳(见附录 G.2)。我们没有在 CNN/DM 上进行训练,而是研究了在训练用于总结 Reddit 帖子后,我们的人类反馈模型在 CNN/DM 上的迁移性能。
Task. We define our ground-truth task as producing a model that generates summaries fewer than 48 tokens long that are as good as possible, according to our judgments. We judge summary quality by how faithfully the summary conveys the original post to a reader who can only read the summary and not the post (see Appendix C.5 for further discussion of criteria). Since we have limited capacity to do comparisons, we hire labelers to do the comparisons for us. We rely on detailed procedures to ensure high agreement between labelers and us on the task, which we describe in the next section.
任务。我们将我们的基准任务定义为生成一个模型,该模型根据我们的判断生成少于 48 个 Token 的摘要,并且尽可能好。我们通过摘要如何忠实地向只能阅读摘要而不能阅读原文的读者传达原文来判断摘要质量(有关标准的进一步讨论,请参见附录 C.5)。由于我们进行比较的能力有限,我们聘请标注员为我们进行比较。我们依靠详细的程序来确保标注员和我们在任务上达成高度一致,我们将在下一节中描述这些程序。
Table 1: Example of post and samples on the TL;DR dataset, chosen to be particularly short. For random samples (along with posts), see Appendix H and our website.
人工撰写的参考 TL;DR | 6.7B 监督模型 | 6.7B 人类反馈模型 |
---|---|---|
恋爱三年后的第一次约会,去海滩,很紧张。该带什么,该做什么? | 和几天前认识的女孩约会,去海滩。该带什么,该做什么? | 明天是我第一次约会,骑车去海滩。想要一些关于该做什么和带什么的建议。我有点紧张和兴奋。谢谢! |
表 1: TL;DR 数据集中的帖子示例和样本,特意选择了较短的例子。随机样本(包括帖子)见附录 H 和我们的网站。
3.3 Collecting human feedback
3.3 收集人类反馈
Previous work on fine-tuning language models from human feedback [73] reported “a mismatch between the notion of quality we wanted our model to learn, and what the humans labelers actually evaluated”, leading to model-generated summaries that were high-quality according to the labelers, but fairly low-quality according to the researchers.
先前关于从人类反馈中微调语言模型的工作 [73] 报告了“我们希望模型学习的质量概念与人类标注者实际评估的内容之间存在不匹配”,导致模型生成的摘要虽然标注者认为质量很高,但研究人员认为质量相当低。
Compared to [73], we implement two changes to improve human data quality. First, we transition entirely to the offline setting, where we alternate between sending large batches of comparison data6 to our human labelers and re-training our models on the cumulative collected data. Second, we maintain a hands-on relationship with labelers:7 we on-board them with detailed instructions, answer their questions in a shared chat room, and provide regular feedback on their performance. We train all labelers to ensure high agreement with our judgments, and continuously monitor labeler-researcher agreement over the course of the project. See Appendix C.1 and C.5 for details.
与 [73] 相比,我们实施了两项改进以提高人类数据质量。首先,我们完全过渡到离线设置,在这种设置下,我们交替向人类标注者发送大批量的比较数据6,并在累积收集的数据上重新训练我们的模型。其次,我们与标注者保持密切的关系:7 我们通过详细的说明对他们进行培训,在共享聊天室中回答他们的问题,并定期提供关于他们表现的反馈。我们培训所有标注者以确保与我们的判断高度一致,并在项目过程中持续监控标注者与研究人员之间的一致性。详见附录 C.1 和 C.5。
As a result of our procedure, we obtained high labeler-researcher agreement: on a subset of comparison tasks, labelers agree with researchers $77%\pm2%$ of the time, while researchers agree with each other $73%\pm4%$ of the time. We provide more analysis of our human data quality in Appendix C.2.
通过我们的流程,我们获得了较高的标注者与研究者之间的一致性:在一部分比较任务中,标注者与研究者的同意率为 $77%\pm2%$,而研究者之间的同意率为 $73%\pm4%$。我们在附录 C.2 中提供了更多关于人类数据质量的分析。
3.4 Models
3.4 模型
All of our models are Transformer decoders [62] in the style of GPT-3 [47, 4]. We conduct our human feedback experiments on models with 1.3 billion (1.3B) and 6.7 billion (6.7B) parameters.
我们所有的模型都是基于 GPT-3 [47, 4] 风格的 Transformer 解码器 [62]。我们在具有 13 亿 (1.3B) 和 67 亿 (6.7B) 参数的模型上进行了人类反馈实验。
Pretrained models. Similarly to [12, 47], we start with models pretrained to auto regressive ly predict the next token in a large text corpus. As in [48, 4], we use these models as ‘zero-shot’ baselines by padding the context with examples of high-quality summaries from the dataset. We provide details on pre training in Appendix B, and on our zero-shot procedure in Appendix B.2.
预训练模型。与 [12, 47] 类似,我们从预训练模型开始,这些模型用于自回归预测大型文本语料库中的下一个 Token。如 [48, 4] 所述,我们通过用数据集中的高质量摘要示例填充上下文,将这些模型用作“零样本”基线。我们在附录 B 中提供了预训练的详细信息,并在附录 B.2 中提供了我们的零样本过程的详细信息。
Supervised baselines. We next fine-tune these models via supervised learning to predict summaries from our filtered TL;DR dataset (see Appendix B for details). We use these supervised models to sample initial summaries for collecting comparisons, to initialize our policy and reward models, and as baselines for evaluation. In our final human evaluations, we use $\mathrm{T}{=}0$ to sample from all models, as we found it performed better than higher temperatures or nucleus sampling (see Appendix B.1).
监督基线。接下来,我们通过监督学习对这些模型进行微调,以从我们过滤后的 TL;DR 数据集中预测摘要(详见附录 B)。我们使用这些监督模型来采样初始摘要以收集比较,初始化我们的策略和奖励模型,并作为评估的基线。在我们最终的人类评估中,我们使用 $\mathrm{T}{=}0$ 从所有模型中采样,因为我们发现它比更高的温度或核心采样表现更好(详见附录 B.1)。
To validate that our supervised models are indeed strong baselines for comparison, we run our supervised fine-tuning procedure with our 6.7B model on the CNN/DM dataset, and find that we achieve slightly better ROUGE scores than SOTA models [71] from mid-2019 (see Appendix G.4).
为了验证我们的监督模型确实是用于比较的强基线,我们在 CNN/DM 数据集上使用我们的 6.7B 模型运行了监督微调过程,发现我们获得的 ROUGE 分数略高于 2019 年中期的 SOTA 模型 [71](见附录 G.4)。
Reward models. To train our reward models, we start from a supervised baseline, as described above, then add a randomly initialized linear head that outputs a scalar value. We train this model to predict which summary $\dot{y}\in{y_{0},y_{1}}$ is better as judged by a human, given a post $x$ . If the summary preferred by the human is $y_{i}$ , we can write the RM loss as:
奖励模型。为了训练我们的奖励模型,我们从一个监督基线开始,如上所述,然后添加一个随机初始化的线性头,输出一个标量值。我们训练这个模型来预测在给定帖子 $x$ 的情况下,人类判断哪个摘要 $\dot{y}\in{y_{0},y_{1}}$ 更好。如果人类偏好的摘要是 $y_{i}$,我们可以将奖励模型的损失函数写为:
$$
\begin{array}{r}{\mathrm{loss}(r_{\theta})=E_{(x,y_{0},y_{1},i)\sim D}[\mathrm{log}(\sigma(r_{\theta}(x,y_{i})-r_{\theta}(x,y_{1-i})))]}\end{array}
$$
$$
\begin{array}{r}{\mathrm{loss}(r_{\theta})=E_{(x,y_{0},y_{1},i)\sim D}[\mathrm{log}(\sigma(r_{\theta}(x,y_{i})-r_{\theta}(x,y_{1-i})))]}\end{array}
$$
where $r_{\theta}(x,y)$ is the scalar output of the reward model for post $x$ and summary $y$ with parameters $\theta$ , and $D$ is the dataset of human judgments. At the end of training, we normalize the reward model outputs such that the reference summaries from our dataset achieve a mean score of 0.
其中 $r_{\theta}(x,y)$ 是奖励模型对于帖子 $x$ 和摘要 $y$ 的标量输出,参数为 $\theta$,$D$ 是人类判断的数据集。在训练结束时,我们对奖励模型的输出进行归一化,使得数据集中的参考摘要的平均得分为 0。
Human feedback policies. We want to use the reward model trained above to train a policy that generates higher-quality outputs as judged by humans. We primarily do this using reinforcement learning, by treating the output of the reward model as a reward for the entire summary that we maximize with the PPO algorithm [58], where each time step is a BPE token.8 We initialize our policy to be the model fine-tuned on Reddit TL;DR. Importantly, we include a term in the reward that penalizes the KL divergence between the learned RL policy $\pi_{\phi}^{\tt R L}$ with parameters $\phi$ and this original supervised model $\pi^{\mathrm{SFT}}$ , as previously done in [25]. The full reward $R$ can be written as:
人类反馈策略。我们希望使用上述训练的奖励模型来训练一个策略,该策略能够生成被人类判断为更高质量的输出。我们主要通过强化学习来实现这一点,将奖励模型的输出视为整个摘要的奖励,并使用PPO算法[58]来最大化这个奖励,其中每个时间步是一个BPE token。我们将策略初始化为在Reddit TL;DR上微调的模型。重要的是,我们在奖励中加入了一个惩罚项,用于惩罚学习到的RL策略$\pi_{\phi}^{\tt R L}$(参数为$\phi$)与原始监督模型$\pi^{\mathrm{SFT}}$之间的KL散度,如之前[25]中所做的那样。完整的奖励$R$可以写成:
$$
R(x,y)=r_{\theta}(x,y)-\beta\log[\pi_{\phi}^{\mathrm{RL}}(y|x)/\pi^{\mathrm{SFT}}(y|x)]
$$
$$
R(x,y)=r_{\theta}(x,y)-\beta\log[\pi_{\phi}^{\mathrm{RL}}(y|x)/\pi^{\mathrm{SFT}}(y|x)]
$$
This KL term serves two purposes. First, it acts as an entropy bonus, encouraging the policy to explore and deterring it from collapsing to a single mode. Second, it ensures the policy doesn’t learn to produce outputs that are too different from those that the reward model has seen during training.
这个 KL 项有两个作用。首先,它作为一个熵奖励,鼓励策略进行探索,防止其坍缩到单一模式。其次,它确保策略不会学会生成与奖励模型在训练期间见过的输出差异过大的结果。
For the PPO value function, we use a Transformer with completely separate parameters from the policy. This prevents updates to the value function from partially destroying the pretrained policy early in training (see ablation in Appendix G.1). We initialize the value function to the parameters of the reward model. In our experiments, the reward model, policy, and value function are the same size.
对于 PPO 值函数,我们使用了一个与策略完全分离参数的 Transformer。这样可以防止在训练初期更新值函数时部分破坏预训练策略(参见附录 G.1 中的消融实验)。我们将值函数初始化为奖励模型的参数。在我们的实验中,奖励模型、策略和值函数的大小相同。
4 Results
4 结果
4.1 Summarizing Reddit posts from human feedback
4.1 基于人类反馈的Reddit帖子摘要
Policies trained with human feedback are preferred to much larger supervised policies. Our main results evaluating our human feedback policies on TL;DR are shown in Figure 1. We measure policy quality as the percentage of summaries generated by that policy that humans prefer over the reference summaries in the dataset. Our policies trained with human feedback significantly outperform our supervised baselines on this metric, with our $1.3\mathbf{B}$ human feedback model significantly outperforming a supervised model $10\times$ its size $21%$ versus $43%$ raw preference score against reference summaries). Our 6.7B model in turn significantly outperforms our 1.3B model, suggesting that training with human feedback also benefits from scale. Additionally, both of our human feedback models are judged by humans to be superior to the human demonstrations used in the dataset.
使用人类反馈训练的模型比更大的监督模型更受青睐。我们在 TL;DR 上评估人类反馈模型的主要结果如图 1 所示。我们通过模型生成的摘要被人类偏好于数据集中参考摘要的百分比来衡量模型质量。使用人类反馈训练的模型在这一指标上显著优于我们的监督基线,其中我们的 1.3B 人类反馈模型显著优于一个 10 倍于其大小的监督模型(21% 对 43% 的原始偏好分数)。我们的 6.7B 模型又显著优于 1.3B 模型,这表明人类反馈训练也受益于规模。此外,我们的人类反馈模型被人类判断为优于数据集中使用的人类演示。
Controlling for summary length. When judging summary quality, summary length is a confounding factor. The target length of a summary is implicitly part of the sum mari z ation task; depending on the desired trade-off between conciseness and coverage, a shorter or longer summary might be better. Since our models learned to generate longer summaries, length could account for much of our quality improvements. We find that after controlling for length (Appendix F), the preference of our human feedback models vs. reference summaries drops by ${\sim}5%$ ; even so, our 6.7B model summaries are still preferred to the reference summaries ${\sim}65%$ of the time.
控制摘要长度。在评估摘要质量时,摘要长度是一个混杂因素。摘要的目标长度隐含在摘要任务中;根据简洁性和覆盖范围之间的权衡,较短或较长的摘要可能更好。由于我们的模型学会了生成长度更长的摘要,长度可能是我们质量提升的主要原因。我们发现,在控制长度后(附录 F),我们的人类反馈模型与参考摘要的偏好下降了约 5%;即便如此,我们的 6.7B 模型摘要仍然在约 65% 的情况下优于参考摘要。
How do our policies improve over the baselines? To better understand the quality of our models’ summaries compared to the reference summaries and those of our supervised baselines, we conduct an additional analysis where human labelers assess summary quality across four dimensions (or “axes”) using a 7-point Likert scale [38]. Labelers rated summaries for coverage (how much important information from the original post is covered), accuracy (to what degree the statements in the summary are stated in the post), coherence (how easy the summary is to read on its own), and overall quality.
我们的策略如何改进基线?为了更好地理解我们模型的摘要质量与参考摘要及监督基线的摘要质量相比如何,我们进行了一项额外分析,其中人工标注者使用7点李克特量表 [38] 评估摘要质量,涵盖四个维度(或“轴”)。标注者对摘要的覆盖率(原始帖子中重要信息的覆盖程度)、准确性(摘要中的陈述在帖子中陈述的程度)、连贯性(摘要本身的易读性)以及整体质量进行了评分。
Figure 4: Transfer results on CNN/DM. (a) Overall summary quality on CNN/DM as a function of model size. Full results across axes shown in Appendix G.2. (b) Overall scores vs. length for the 6.7B TL;DR supervised baseline, the 6.7B TL;DR human feedback model, and T5 fine-tuned on CNN/DM summaries. At similar summary lengths, our 6.7B TL;DR human feedback model nearly matches T5 despite never being trained to summarize news articles.
图 4: CNN/DM 上的迁移结果。(a) CNN/DM 上的总体摘要质量随模型大小的变化。附录 G.2 中展示了各轴的完整结果。(b) 6.7B TL;DR 监督基线、6.7B TL;DR 人类反馈模型和在 CNN/DM 摘要上微调的 T5 的总体得分与长度的关系。在相似的摘要长度下,我们的 6.7B TL;DR 人类反馈模型几乎与 T5 相当,尽管从未接受过新闻文章摘要的训练。
The results (Figure 3) indicate that our human feedback models outperform the supervised baselines across every dimension of quality, but particularly coverage. Although our human labelers had a high bar for giving perfect overall scores, summaries from our 6.7B PPO model achieve a 7/7 overall score $45%$ of the time (compared to $20%$ and $23%$ for the 6.7B supervised baseline and reference summaries, respectively).
结果(图 3)表明,我们的人类反馈模型在质量的每个维度上都优于监督基线,尤其是在覆盖范围方面。尽管我们的人类标注者对给出完美的总体评分有很高的标准,但我们的 6.7B PPO 模型的摘要达到了 7/7 的总体评分,占比为 $45%$(相比之下,6.7B 监督基线和参考摘要分别为 $20%$ 和 $23%$)。
4.2 Transfer to summarizing news articles
4.2 新闻文章摘要的迁移
Our human feedback models can also generate excellent summaries of CNN/DM news articles without any further training (Figure 4). Our human feedback models significantly outperform models trained via supervised learning on TL;DR and models trained only on pre training corpora. In fact, our 6.7B human feedback model performs almost as well as a 6.7B model that was fine-tuned on the CNN/DM reference s