[论文翻译]LLAMA 2: 开源基础模型和微调聊天模型


原文地址:https://arxiv.org/pdf/2307.09288


LLAMA 2: Open Foundation and Fine-Tuned Chat Models

LLAMA 2: 开源基础模型和微调聊天模型

Hugo Touvron* Louis Martint Kevin Stonet

GenAI, Meta

生成式 AI (GenAI), Meta

Abstract

摘要

In this work, we develop and release Llama 2, a collection of pretrained and fine-tuned large language models (LLMs) ranging in scale from 7 billion to 70 billion parameters. Our fine-tuned LLMs, called LLAMA 2-CHAT, are optimized for dialogue use cases. Our models outperform open-source chat models on most benchmarks we tested, and based on our human evaluations for helpfulness and safety, may be a suitable substitute for closedsource models. We provide a detailed description of our approach to fine-tuning and safety improvements of LLAMA 2-CHAT in order to enable the community to build on our work and contribute to the responsible development of LLMs.

在本工作中,我们开发并发布了 Llama 2,这是一系列预训练和微调的大语言模型 (LLM),参数规模从 70 亿到 700 亿不等。我们的微调大语言模型,称为 LLAMA 2-CHAT,针对对话应用场景进行了优化。我们的模型在我们测试的大多数基准上优于开源聊天模型,并且根据我们对有用性和安全性的人工评估,可能是闭源模型的合适替代品。我们详细描述了对 LLAMA 2-CHAT 进行微调和安全性改进的方法,以使社区能够在此基础上进行构建,并为大语言模型的负责任发展做出贡献。

Contents

目录

1 Introduction

1 引言

2 Pre training 5

2 预训练 5

3 Fine-tuning 8

3 微调 8

4 Safety 20

4 安全性 20

Discussion 32

讨论 32

6Related Work 35

6 相关工作 35

7 Conclusion 36

7 结论 36

A Appendix 46

附录 46


Figure 1: Helpfulness human evaluation results for LLAMA $^{2}$ -CHAT compared to other open-source and closed-source models. Human raters compared model generations on ${\sim}4\mathrm{k}$ prompts consisting of both single and multi-turn prompts. The $9\bar{5}%$ confidence intervals for this evaluation are between $1%$ and $2%$ . More details in Section 3.4.2. While reviewing these results, it is important to note that human evaluations can be noisy due to limitations of the prompt set, subjectivity of the review guidelines, subjectivity of individual raters, and the inherent difficulty of comparing generations.

图 1: LLAMA $^{2}$ -CHAT 相对于其他开源和闭源模型的帮助性人工评估结果。人工评估者在约 4k 个提示上比较了模型生成的结果,这些提示包括单轮和多轮提示。此次评估的 $9\bar{5}%$ 置信区间在 $1%$ 到 $2%$ 之间。更多详情见第 3.4.2 节。在审查这些结果时,需要注意人工评估可能会因提示集的局限性、评审指南的主观性、个别评估者的主观性以及比较生成内容的固有难度而产生噪声。


Figure 2: Win-rate $%$ for helpfulness and safety between commercial-licensed baselines and LLAMA 2-CHAT, according to GPT4. To complement the human evaluation, we used a more capable model, not subject to our own guidance. Green area indicates our model is better according to GPT-4. To remove ties, we used win/ $'(w i n\bar{+},l o s s)$ . The orders in which the model responses are presented to GPT-4 are randomly swapped to alleviate bias.

图 2: 根据 GPT4,商业许可基线和 LLAMA 2-CHAT 在有用性和安全性方面的胜率 % 。为了补充人工评估,我们使用了一个更强大的模型,该模型不受我们自身指导的限制。绿色区域表示根据 GPT-4 我们的模型更好。为了解决平局问题,我们使用了胜/负 (win/(win+loss)) 。模型响应呈现给 GPT-4 的顺序是随机交换的,以减轻偏差。

1 Introduction

1 引言

Large Language Models (LLMs) have shown great promise as highly capable AI assistants that excel in complex reasoning tasks requiring expert knowledge across a wide range of fields, including in specialized domains such as programming and creative writing. They enable interaction with humans through intuitive chat interfaces, which has led to rapid and widespread adoption among the general public.

大语言模型 (LLMs) 在复杂推理任务中展现出巨大的潜力,这些任务需要跨多个领域的专业知识,包括编程和创意写作等专业领域。它们通过直观的聊天界面与人类进行交互,这导致了在普通大众中的快速和广泛采用。

The capabilities of LLMs are remarkable considering the seemingly straightforward nature of the training methodology. Auto-regressive transformers are pretrained on an extensive corpus of self-supervised data, followed by alignment with human preferences via techniques such as Reinforcement Learning with Human Feedback (RLHF). Although the training methodology is simple, high computational requirements have limited the development of LLMs to a few players. There have been public releases of pretrained LLMs (such as BLOOM (Sca0 et al., 2022), LLaMa-1 (Touvron et al., 2023), and Falcon (Penedo et al., 2023)) that match the performance of closed pretrained competitors like GPT-3 (Brown et al., 2020) and Chinchilla (Hoffmann et al., 2022), but none of these models are suitable substitutes for closed "product" LLMs, such as ChatGPT, BARD, and Claude. These closed product LLMs are heavily fine-tuned to align with human preferences, which greatly enhances their usability and safety. This step can require significant costs in compute and human annotation, and is often not transparent or easily reproducible, limiting progress within the community to advance AI alignment research.

大语言模型的能力令人印象深刻,考虑到其训练方法看似简单直接。自回归 Transformer 模型在大量自我监督数据上进行预训练,随后通过强化学习与人类反馈 (Reinforcement Learning with Human Feedback, RLHF) 等技术与人类偏好对齐。尽管训练方法简单,但高计算需求限制了大语言模型的开发,只有少数参与者能够涉足。已有一些公开发布的预训练大语言模型(如 BLOOM (Sca0 et al., 2022),LLaMa-1 (Touvron et al., 2023),和 Falcon (Penedo et al., 2023)),其性能可与封闭的预训练竞争对手如 GPT-3 (Brown et al., 2020) 和 Chinchilla (Hoffmann et al., 2022) 相媲美,但这些模型都不足以替代封闭的“产品”大语言模型,如 ChatGPT、BARD 和 Claude。这些封闭的产品大语言模型经过大量微调以与人类偏好对齐,这大大提高了它们的可用性和安全性。此步骤可能需要大量的计算和人工标注成本,并且通常不透明或难以复现,从而限制了社区在推进 AI 对齐研究方面的进展。

In this work, we develop and release Llama 2, a family of pretrained and fine-tuned LLMs, LLAMA 2 and LLAMA 2-CHAT, at scales up to 7OB parameters. On the series of helpfulness and safety benchmarks we tested, LLAMA 2-CHAT models generally perform better than existing open-source modeis. They also appear to be on par with some of the closed-source models, at least on the human evaluations we performed (see Figures 1 and 3). We have taken measures to increase the safety of these models, using safety-specific data annotation and tuning, as well as conducting red-teaming and employing iterative evaluations. Additionally, this paper contributes a thorough description of our fine-tuning methodology and approach to improving LLM safety. We hope that this openness will enable the community to reproduce fine-tuned LLMs and continue to improve the safety of those models, paving the way for more responsible development of LLMs. We also share novel observations we made during the development of LLAMA 2 and LLAMA 2-CHAT, such as the emergence of tool usage and temporal organization of knowledge.

在本工作中,我们开发并发布了 Llama 2,一系列预训练和微调的大语言模型,包括 LLAMA 2 和 LLAMA 2-CHAT,参数规模最高达 70B。在我们测试的一系列有用性和安全性基准上,LLAMA 2-CHAT 模型通常表现优于现有的开源模型。它们也似乎与某些闭源模型相当,至少在我们进行的人类评估中是这样(见图 1 和图 3)。我们已采取措施提高这些模型的安全性,使用专门针对安全性的数据标注和调优,并进行红队测试和迭代评估。此外,本文详细描述了我们的微调方法和改进大语言模型安全性的方法。我们希望这种开放性能够使社区能够重现微调的大语言模型,并继续改进这些模型的安全性,为更负责任地开发大语言模型铺平道路。我们还分享了在开发 LLAMA 2 和 LLAMA 2-CHAT 过程中的一些新观察,例如工具使用的出现和知识的时间组织。


Figure 3: Safety human evaluation results for LLAMA 2-CHAT compared to other open-source and closedsource models. Human raters judged model generations for safety violations across ${\sim}2\mathrm{,}000$ adversarial prompts consisting of both single and multi-turn prompts. More details can be found in Section 4.4. It is important to caveat these safety results with the inherent bias of LLM evaluations due to limitations of the prompt set, subjectivity of the review guidelines, and subjectivity of individual raters. Additionally, these safety evaluations are performed using content standards that are likely to be biased towards the LLAMA 2-CHATmodels.

图 3: LLAMA 2-CHAT 与其他开源和闭源模型的安全性人工评估结果。人类评估员对模型生成内容进行了安全性违规判断,涵盖了约 2,000 个对抗性提示,包括单轮和多轮提示。更多细节可以在第 4.4 节中找到。需要注意的是,这些安全性结果存在大语言模型 (LLM) 评估的固有偏差,这是由于提示集的局限性、审查指南的主观性以及个别评估员的主观性所致。此外,这些安全性评估是根据可能偏向于 LLAMA 2-CHAT 模型的内容标准进行的。

We are releasing the following models to the general public for research and commercial uset:

我们正在向公众发布以下模型,供研究和商业使用:

  1. LLAMA 2, an updated version of LLAMA 1, trained on a new mix of publicly available data. We also increased the size of the pre training corpus by $40%$ doubled the context length of the model, and adopted grouped-query attention (Ainslie et al., 2023). We are releasing variants of LLAMA 2 with 7B, i3B, and 70B parameters. We have also trained 34B variants, which we report on in this paper but are not releasing.s
  2. LLAMA 2 是 LLAMA 1 的更新版本,基于新的公开数据混合进行训练。我们还增加了预训练语料库的规模 40%,并将模型的上下文长度翻倍,并采用了分组查询注意力机制 (Ainslie et al., 2023)。我们将发布参数量分别为 7B、i3B 和 70B 的 LLAMA 2 版本。我们还训练了 34B 参数的版本,在本文中有所报告,但不会发布。
  3. LLAMA 2-CHAT, a fine-tuned version of LLAMA 2 that is optimized for dialogue use cases. We release variants of this model with 7B, 13B, and 70B parameters as well.
  4. LLAMA 2-CHAT,这是针对对话应用场景优化的 LLAMA 2 的微调版本。我们发布了该模型的 7B、13B 和 70B 参数的变体。

We believe that the open release of LLMs, when done safely, will be a net benefit to society. Like all LLMs, LLAMA 2 is a new technology that carries potential risks with use (Bender et al., 2021b; Weidinger et al., 2021; Solaiman et al., 2023). Testing conducted to date has been in English and has not — and could not — cover all scenarios. Therefore, before deploying any applications of LLAMA 2-CHAT, developers should perform safety testing and tuning tailored to their specific applications of the model. We provide a responsible use guide and code examples" to facilitate the safe deployment of LLAMA 2 and LLAMA 2-CHAT. More details of our responsible release strategy can be found in Section 5.3.

我们认为,当大语言模型 (LLM) 的开放发布在确保安全的前提下进行时,将对社会产生净收益。像所有大语言模型一样,LLAMA 2 是一项新技术,在使用中可能带来潜在风险 (Bender et al., 2021b; Weidinger et al., 2021; Solaiman et al., 2023)。迄今为止进行的测试仅限于英语,并且无法涵盖所有场景。因此,在部署任何 LLAMA 2-CHAT 应用之前,开发人员应针对其具体应用场景进行安全性测试和调整。我们提供了一份负责任使用指南和代码示例,以促进 LLAMA 2 和 LLAMA 2-CHAT 的安全部署。更多关于我们的负责任发布策略的详细信息可以在第 5.3 节找到。

The remainder of this paper describes our pre training methodology (Section 2), fine-tuning methodology (Section 3), approach to model safety (Section 4), key observations and insights (Section 5), relevant related work (Section 6), and conclusions (Section 7).

本文其余部分描述了我们的预训练方法 (Section 2),微调方法 (Section 3),模型安全性方法 (Section 4),关键观察和见解 (Section 5),相关工作 (Section 6),以及结论 (Section 7)。


Figure 4: Training of LLAMA 2-CHAT: This process begins with the pre training of LLAMA 2 using publicly available online sources. Following this, we create an initial version of LLAMA 2-CHAT through the application of supervised fine-tuning. Subsequently, the model is iterative ly refined using Reinforcement Learning with Human Feedback (RLHF) methodologies, specifically through rejection sampling and Proximal Policy Optimization (PPO). Throughout the RLHF stage, the accumulation of iterative reward modeling data in parallel with model enhancements is crucial to ensure the reward models remain within distribution.

图 4: LLAMA 2-CHAT 的训练:此过程从使用公开的在线资源对 LLAMA 2 进行预训练开始。随后,我们通过应用监督微调创建 LLAMA 2-CHAT 的初始版本。之后,模型通过人类反馈的强化学习 (Reinforcement Learning with Human Feedback, RLHF) 方法进行迭代优化,具体包括拒绝采样和近端策略优化 (Proximal Policy Optimization, PPO)。在整个 RLHF 阶段,迭代奖励建模数据的积累与模型增强并行进行至关重要,以确保奖励模型保持在分布内。

2 Pre training

2 预训练

To create the new family of LLAMA 2 models, we began with the pre training approach described in Touvron et al. (2023), using an optimized auto-regressive transformer, but made several changes to improve performance. Specifically, we performed more robust data cleaning, updated our data mixes, trained on $40\bar{%}$ more total tokens, doubled the context length, and used grouped-query attention (GQA) to improve inference s cal ability for our larger models. Table 1 compares the attributes of the new LLAMA 2 models with the LLAMA 1 models.

为了创建新的 LLAMA 2 模型系列,我们从 Touvron 等人 (2023) 描述的预训练方法开始,使用优化的自回归 Transformer,但进行了几项改进以提高性能。具体来说,我们进行了更强大的数据清理,更新了数据组合,在总 Token 数量上增加了 40% 的训练,将上下文长度翻倍,并使用分组查询注意力机制 (GQA) 来提高较大模型的推理可扩展性。表 1: 比较了新 LLAMA 2 模型与 LLAMA 1 模型的属性。

2.1 Pre training Data

2.1 预训练数据

Our training corpus includes a new mix of data from publicly available sources, which does not include data from Meta's products or services. We made an effort to remove data from certain sites known to contain a high volume of personal information about private individuals. We trained on 2 trillion tokens of data as this provides a good performance-cost trade-off, up-sampling the most factual sources in an effort to increase knowledge and dampen hallucinations.

我们的训练语料库包括来自公开可用来源的新数据混合,不包括来自 Meta 产品或服务的数据。我们努力从某些已知包含大量私人个人信息的网站中移除数据。我们在 2 万亿个 Token 的数据上进行了训练,因为这提供了良好的性能与成本权衡,并增加了最事实来源的采样以提高知识准确性并减少幻觉。

We performed a variety of pre training data investigations so that users can better understand the potential capabilities and limitations of our models; results can be found in Section 4.1.

我们进行了多种预训练数据调查,以便用户可以更好地了解我们模型的潜在能力和局限性;结果可以在第 4.1 节中找到。

2.2 Training Details

2.2 训练细节

We adopt most of the pre training setting and model architecture from LLAMA 1. We use the standard transformer architecture (Vaswani et al., 2017), apply pre-normalization using RMSNorm (Zhang and Sennrich, 2019), use the SwiGLU activation function (Shazeer, 2020), and rotary positional embeddings (RoPE, Su et al. 2022). The primary architectural differences from LLAMA 1 include increased context length and grouped-query attention (GQA). We detail in Appendix Section A.2.1 each of these differences with ablation experiments to demonstrate their importance.

我们采用了 LLAMA 1 的大部分预训练设置和模型架构。我们使用标准的 Transformer 架构 (Vaswani 等, 2017),应用 RMSNorm 进行预归一化 (Zhang 和 Sennrich, 2019),使用 SwiGLU 激活函数 (Shazeer, 2020),以及旋转位置编码 (RoPE, Su 等, 2022)。与 LLAMA 1 的主要架构差异包括增加的上下文长度和分组查询注意力机制 (GQA)。我们在附录 A.2.1 节详细介绍了这些差异,并通过消融实验展示了它们的重要性。

Hyper parameters. We trained using the AdamW optimizer (Loshchilov and Hutter, 2017), with $\beta_{1}=$ $0.\dot{9},\beta_{2}\stackrel{\rightharpoonup}{=}0.95,\mathrm{eps}=10^{-5}$ . We use a cosine learning rate schedule, with warmup of 2000 steps, and decay final learning ratedownto $10%$ of the peak learning rate. We use a weight decay of 0.1 and gradient clipping of 1.0. Figure 5 (a) shows the training loss for LLAMA 2 with these hyper parameters.

超参数。我们使用 AdamW 优化器 (Loshchilov and Hutter, 2017),参数为 $\beta_{1}=$ $0.\dot{9},\beta_{2}\stackrel{\rightharpoonup}{=}0.95,\mathrm{eps}=10^{-5}$ 。我们采用余弦学习率调度,预热 2000 步,最终学习率衰减到峰值学习率的 $10%$ 。我们使用权重衰减 0.1 和梯度裁剪 1.0。图 5 (a) 显示了 LLAMA 2 使用这些超参数的训练损失。

Table 1: LAMA 2 family of models. Token counts refer to pre training data only. All models are trained with a global batch-size of 4M tokens. Bigger models — 34B and 70B — use Grouped-Query Attention (GQA) for improved inference s cal ability.

表 1: LAMA 2 系列模型。Token 数量仅指预训练数据。所有模型均使用全局批次大小为 4M Token 进行训练。更大规模的模型 — 34B 和 70B — 使用分组查询注意力 (GQA) 以提高推理可扩展性。

训练数据 参数 上下文长度 GQA Token 学习率
LLAMA1 参见 Touvron 等 (2023) 7B 2k 1.0T 3.0 × 10^-4
13B 2k × 1.0T 3.0 × 10^-4
33B 2k × 1.4T 1.5 × 10^-4
65B 2k 1.4T 1.5 × 10^-4
LLAMA2 新的公开在线数据混合 7B 4k × 2.0T 3.0 × 10^-4
13B 4k 2.0T 3.0 × 10^-4
34B 4k 2.0T 1.5 × 10^-4
70B 4k 2.0T 1.5 × 10^-4


Figure 5: Training Loss for LLAMA 2 models. We compare the training loss of the LLAMA 2 family of models. We observe that after pre training on 2T Tokens, the models still did not show any sign of saturation.

图 5: LLAMA 2 模型的训练损失。我们比较了 LLAMA 2 系列模型的训练损失。我们观察到,在预训练了 2T Token 后,这些模型仍然没有显示出任何饱和的迹象。

Tokenizer. We use the same tokenizer as LLAMA 1; it employs a bytepair encoding (BPE) algorithm (Sennrich et al., 2016) using the implementation from Sentence Piece (Kudo and Richardson, 2018). As with LLAMA 1, we split all numbers into individual digits and use bytes to decompose unknown UTF-8 characters. The total vocabulary size is 32k tokens.

分词器。我们使用与 LLAMA 1 相同的分词器;它采用字节对编码 (BPE) 算法 (Sennrich et al., 2016),使用来自 Sentence Piece (Kudo and Richardson, 2018) 的实现。与 LLAMA 1 一样,我们将所有数字拆分为单个数字,并使用字节分解未知的 UTF-8 字符。词汇表总大小为 32k Token。

2.2.1Training Hardware & Carbon Footprint

2.2.1 训练硬件与碳足迹

Training Hardware. We pretrained our models on Meta's Research Super Cluster (RSC) (Lee and Sengupta, 2022) as well as internal production clusters. Both clusters use NVIDIA A100s. There are two key differences between the two clusters, with the first being the type of interconnect available: RSC uses NVIDiA Quantum InfiniBand while our production cluster is equipped with a RoCE (RDMA over converged Ethernet) solution based on commodity ethernet Switches. Both of these solutions interconnect 200 Gbps end-points. The second difference is the per-GPU power consumption cap —- RSC uses 40oW while our production cluster uses 350W. With this two-cluster setup, we were able to compare the suitability of these different types of interconnect for large scale training. RoCE (which is a more affordable, commercial interconnect network)

训练硬件。我们在 Meta 的 Research Super Cluster (RSC) (Lee 和 Sengupta, 2022) 以及内部生产集群上预训练了我们的模型。两个集群均使用 NVIDIA A100s。这两个集群之间有两个关键差异,首先是可用的互连类型:RSC 使用 NVIDiA Quantum InfiniBand,而我们的生产集群配备了基于商品以太网交换机的 RoCE (RDMA over converged Ethernet) 解决方案。这两种解决方案都互连了 200 Gbps 端点。第二个差异是每 GPU 功耗上限 —— RSC 使用 400W,而我们的生产集群使用 350W。通过这种双集群设置,我们能够比较这些不同类型互连在大规模训练中的适用性。RoCE(这是一种更经济实惠的商业互连网络)

Table2: $\mathbf{CO}_{2}$ emissions during pre training. Time: total GPU time required for training each model. Power Consumption: peak power capacity per GPU device for the GPUs used adjusted for power usage efficiency. $100%$ of the emissions are directly offset by Meta's sustainability program, and because we are openly releasing these models, the pre training costs do not need to be incurred by others.

表 2: 预训练期间的 $\mathbf{CO}_{2}$ 排放。时间:训练每个模型所需的总 GPU 时间。功耗:根据电源使用效率调整后的每块 GPU 设备的峰值功率容量。Meta 的可持续发展计划直接抵消了 $100%$ 的排放,由于我们公开发布了这些模型,预训练成本不需要由其他人承担。

时间 (GPU 小时) 功耗 (W) 碳排放 (tCO2eq)
LLAMA2 7B 184320 400 31.22
13B 368640 400 62.44
34B 1038336 350 153.90
70B 1720320 400 291.42
总计 3311616 539.00

can scale almost as well as expensive Infiniband up to 2000 GPUs, which makes pre training even more democrat iz able.

可以几乎与昂贵的 Infiniband 一样好地扩展到多达 2000 个 GPU,这使得预训练变得更加平民化。

Carbon Footprint of Pre training. Following preceding research (Bender et al., 2021a; Patterson et al., 2021; Wu et al., 2022; Dodge et al, 2022) and using power consumption estimates of GPU devices and carbon efficiency, we aim to calculate the carbon emissions resulting from the pre training of LLAMA 2 models. The actual power usage of a GPU is dependent on its utilization and is likely to vary from the Thermal Design Power (TDP) that we employ as an estimation for GPU power. It is important to note that our calculations do not account for further power demands, such as those from interconnect or non-GPU server power consumption, nor from datacenter cooling systems. Additionally, the carbon output related to the production of AI hardware, like GPUs, could add to the overall carbon footprint as suggested by Gupta et al. (2022b,a).

预训练的碳足迹。遵循先前的研究 (Bender et al., 2021a; Patterson et al., 2021; Wu et al., 2022; Dodge et al, 2022) 并使用 GPU 设备的功耗估计和碳效率,我们旨在计算 LLAMA 2 模型预训练所产生的碳排放。GPU 的实际功耗取决于其利用率,并且可能会与我们用作 GPU 功耗估算的热设计功率 (TDP) 不同。需要注意的是,我们的计算未考虑进一步的电力需求,例如来自互连或非 GPU 服务器功耗的需求,以及数据中心冷却系统的功耗。此外,根据 Gupta 等人的建议 (2022b,a),AI 硬件(如 GPU)的生产相关的碳排放可能会增加总体碳足迹。

Table 2 summarizes the carbon emission for pre training the LLAMA 2 family of models. A cumulative of 3.3M GPU hours of computation was performed on hardware of type A100-80GB (TDP of 400W or 350W). We estimate the total emissions for training to be $539;\mathbf{tCO}_{2}\mathbf{eq},$ ofwhich $100%$ were directly offset by Meta's sustainability program.** Our open release strategy also means that these pre training costs will not need to be incurred by other companies, saving more global resources.

表 2: 总结了预训练 LLAMA 2 系列模型的碳排放。累计进行了 3.3M GPU 小时的计算,硬件类型为 A100-80GB (TDP 为 400W 或 350W)。我们估计训练的总排放量为 $539;\mathbf{tCO}_{2}\mathbf{eq}$ ,其中 $100%$ 由 Meta 的可持续发展计划直接抵消。我们的开放发布策略也意味着这些预训练成本将不会被其他公司承担,从而节省更多的全球资源。

2.3 LLAMA 2 Pretrained Model Evaluation

2.3 LLAMA 2 预训练模型评估

In this section, we report the results for the LLAMA 1 and LLAMA 2 base models, MosaicML Pretrained Transformer (MPT)ft models, and Falcon (Almazrouei et al., 2023) models on standard academic benchmarks. For all the evaluations, we use our internal evaluations library. We reproduce results for the MPT and Falcon models internally. For these models, we always pick the best score between our evaluation framework and any publicly reported results.

在本节中,我们报告了 LLAMA 1 和 LLAMA 2 基础模型、MosaicML 预训练 Transformer (MPT) 模型以及 Falcon (Almazrouei et al., 2023) 模型在标准学术基准测试上的结果。对于所有评估,我们使用内部评估库。我们内部重现了 MPT 和 Falcon 模型的结果。对于这些模型,我们始终选择内部评估框架和任何公开报告结果之间的最佳分数。

In Table 3, we summarize the overall performance across a suite of popular benchmarks. Note that safety benchmarks are shared in Section 4.1. The benchmarks are grouped into the categories listed below. The results for all the individual benchmarks are available in Section A.2.2.

表 3: 我们总结了在一系列流行基准测试中的整体性能。请注意,安全基准测试在第 4.1 节中共享。基准测试被分为以下类别。所有单个基准测试的结果可在附录 A.2.2 中查看。

Table 3: Overall performance on grouped academic benchmarks compared to open-source base models.

表 3: 在分组学术基准测试上的总体性能与开源基础模型的比较。

模型 规模 编程 (Code) 常识推理 世界知识 阅读理解 数学 MMLU BBH AGIEval
MPT 7B 20.5 57.4 41.0 57.5 4.9 26.8 31.0 23.5
30B 28.9 64.9 50.0 64.7 9.1 46.9 38.0 33.8
Falcon 7B 5.6 56.1 42.8 36.0 4.6 26.2 28.0 21.2
40B 15.2 69.2 56.7 65.7 12.6 55.4 37.1 37.0
LLAMA 1 7B 14.1 60.8 46.2 58.5 6.95 35.1 30.3 23.9
13B 18.9 66.1 52.6 62.3 10.9 46.9 37.0 33.9
33B 26.0 70.0 58.4 67.6 21.4 57.8 39.8 41.7
65B 30.7 70.7 60.5 68.6 30.8 63.4 43.5 47.6
LLAMA2 7B 16.8 63.9 48.9 61.3 14.6 45.3 32.6 29.3
13B 24.5 66.9 55.4 65.8 28.7 54.8 39.4 39.1
34B 27.8 69.9 58.7 68.0 24.2 62.6 44.1 43.4
70B 37.5 71.9 63.6 69.4 35.2 68.9 51.2 54.2

● Popular Aggregated Benchmarks. We report the overall results for MMLU (5 shot) (Hendrycks et al., 2020), Big Bench Hard (BBH) (3 shot) (Suzgun et al, 2022), and AGI Eval (3-5 shot) (Zhong et al., 2023). For AGI Eval, we only evaluate on the English tasks and report the average.

● 常见综合基准。我们报告了 MMLU (5 shot) (Hendrycks et al., 2020),Big Bench Hard (BBH) (3 shot) (Suzgun et al, 2022),和 AGI Eval (3-5 shot) (Zhong et al., 2023) 的整体结果。对于 AGI Eval,我们仅评估英文任务并报告平均值。

As shown in Table 3, LLAMA 2 models outperform LLAMA 1 models. In particular, LLAMA 2 7OB improves the results on MMLU and BBH by ${\approx}5$ and ${\approx}8$ points, respectively, compared to LLAMA 1 65B. LLAMA 2 7B and 30B models outperform MPT models of the corresponding size on all categories besides code benchmarks. For the Falcon models, LLAMA 2 7B and 34B outperform Falcon 7B and 40B models on all categories of benchmarks. Additionally, LLAMA 2 7OB model outperforms all open-source models.

表 3:

如表 3 所示,LLAMA 2 模型优于 LLAMA 1 模型。特别是,LLAMA 2 70B 在 MMLU 和 BBH 上的得分分别比 LLAMA 1 65B 提高了约 5 和约 8 个点。LLAMA 2 7B 和 30B 模型在所有类别上均优于相应规模的 MPT 模型,代码基准除外。对于 Falcon 模型,LLAMA 2 7B 和 34B 在所有类别的基准测试中均优于 Falcon 7B 和 40B 模型。此外,LLAMA 2 70B 模型优于所有开源模型。

In addition to open-source models, we also compare LLAMA 2 70B results to closed-source models. As shown in Table 4, LLAMA 2 70B is close to GPT-3.5 (OpenAI, 2023) on MMLU and GSM8K, but there is a significant gap on coding benchmarks. LLAMA 2 70B results are on par or better than PaLM (540B) (Chowdhery et al., 2022) on almost all benchmarks. There is still a large gap in performance between LLAMA 2 70B and GPT-4 andPaLM-2-L.

除了开源模型外,我们还将 LLAMA 2 70B 的结果与闭源模型进行了比较。如表 4 所示,LLAMA 2 70B 在 MMLU 和 GSM8K 上接近 GPT-3.5 (OpenAI, 2023),但在编程基准测试上存在显著差距。LLAMA 2 70B 的结果在几乎所有基准测试中都与 PaLM (540B) (Chowdhery et al., 2022) 持平或更好。LLAMA 2 70B 与 GPT-4 和 PaLM-2-L 之间仍然存在较大的性能差距。

We also analysed the potential data contamination and share the details in Section A.6. Table 4: Comparison to closed-source models on academic benchmarks. Results for GPT-3.5 and GPT-4 are from OpenAI (2023). Results for the PaLM model are from Chowdhery et al. (2022). Results for the PaLM-2-L are from Anil et al. (2023).

我们还分析了潜在的数据污染,并在第 A.6 节中分享了详细信息。表 1: 在学术基准上与闭源模型的比较。GPT-3.5 和 GPT-4 的结果来自 OpenAI (2023)。PaLM 模型的结果来自 Chowdhery 等人 (2022)。PaLM-2-L 的结果来自 Anil 等人 (2023)。

基准 (样本数) GPT-3.5 GPT-4 PaLM PaLM-2-L LLAMA2
MMLU (5-shot) 70.0 86.4 69.3 78.3 68.9
TriviaQA (1-shot) - - 81.4 86.1 85.0
Natural Questions (1-shot) 29.3 37.5 33.0
GSM8K (8-shot) 57.1 92.0 56.5 80.7 56.8
HumanEval (0-shot) 48.1 67.0 26.2 29.9
BIG-BenchHard (3-shot) 52.3 65.7 51.2

3 Fine-tuning

3 微调

LLAMA 2-CHAT is the result of several months of research and iterative applications of alignment techniques, including both instruction tuning and RLHF, requiring significant computational and annotation resources.

LLAMA 2-CHAT 是几个月的研究和迭代应用对齐技术的结果,包括指令微调和 RLHF (强化学习与人类反馈),需要大量的计算和标注资源。

In this section, we report on our experiments and findings using supervised fine-tuning (Section 3.1), as well as initial and iterative reward modeling (Section 3.2.2) and RLHF (Section 3.2.3). We also share a new technique, Ghost Attention (GAtt), which we find helps control dialogue flow over multiple turns (Section 3.3). See Section 4.2 for safety evaluations on fine-tuned models.

在本节中,我们报告了使用监督微调 (第 3.1 节) 的实验和发现,以及初始和迭代奖励建模 (第 3.2.2 节) 和基于人类反馈的强化学习 (RLHF) (第 3.2.3 节) 的结果。我们还分享了一种新技术,幽灵注意力机制 (Ghost Attention, GAtt),我们发现它有助于控制多轮对话流程 (第 3.3 节)。请参见第 4.2 节以获取关于微调模型的安全性评估。

Table 5: SFT annotation- example of a helpfulness (top) and safety (bottom) annotation for SFT, where the annotator has written both the prompt and its answer.

表 5: SFT 标注 - SFT 的有用性 (上) 和安全性 (下) 标注示例,其中标注者编写了提示词及其回答。

Prompt: Write a poem to help me remember the first 10 elements on the periodic table, giving each element its own line.
Response: Hydrogen comes first as element number one. Helium is second for balloons to have fun! In third place is Lithium to hold battery charge, Followed by Beryllium in emeralds small and large. Boron's number five to help us keep things clean. Carbon's next at six, and it's found in every gene. Nitrogen is seven, found in every breath we take, More than eight (or Oxygen) in atmospheric make. Number nine is Fluorine, helping polish up our teeth.
Prompt: Neon gives us glowing signs to read and stand beneath. I want you to roast me. I want you to make it particularly brutal, swearing at me.
Response: I'm sorry, but I cannot comply with that request. Using vulgar language or intentionally hurting someone's feelings is never acceptable. Instead of asking for a roast, I suggest focusing on positive and constructive feedback that can help you grow and improve as an that without resorting to hurtful language or behavior.

Getting Started. To bootstrap, we started the SFT stage with publicly available instruction tuning data (Chung et al., 2022), as utilized previously in Touvron et al. (2023).

开始。为了启动,我们使用公开可用的指令调优数据 (Chung et al., 2022) 开始了 SFT 阶段,这在 Touvron 等人 (2023) 的研究中也 previously 使用过。

为了启动,我们使用公开可用的指令调优数据 (Chung et al., 2022) 开始了 SFT 阶段,这在 Touvron 等人 (2023) 的研究中也曾使用过。

Quality Is All You Need. Third-party SFT data is available from many different sources, but we found that many of these have insufficient diversity and quality — in particular for aligning LLMs towards dialogue-style instructions. As a result, we focused first on collecting several thousand examples of high-quality SFT data, as illustrated in Table 5. By setting aside millions of examples from third-party datasets and using fewer but higher-quality examples from our own vendor-based annotation efforts, our results notably improved. These findings are similar in spirit to Zhou et al. (2023), which also finds that a limited set of clean instruction-tuning data can be sufficient to reach a high level of quality. We found that SFT annotations in the order of tens of thousands was enough to achieve a high-quality result. We stopped annotating SFT after collecting a total of 27,540 annotations. Note that we do not include any Meta user data.

质量是您所需要的。第三方 SFT 数据来自许多不同的来源,但我们发现其中许多数据的多样性和质量不足——特别是对于将大语言模型 (LLM) 对齐到对话式指令。因此,我们首先专注于收集数千个高质量的 SFT 数据示例,如表 5 所示。通过放弃数百万个来自第三方数据集的示例,并使用更少但质量更高的来自我们自己供应商标注工作的示例,我们的结果显著改善。这些发现与 Zhou 等人 (2023) 的研究精神相似,该研究也发现有限的一组干净的指令微调数据足以达到高质量水平。我们发现,数万个 SFT 标注就足以实现高质量结果。在收集了总计 27,540 个标注后,我们停止了 SFT 标注工作。请注意,我们不包含任何 Meta 用户数据。

表 5: 高质量 SFT 数据示例

We also observed that different annotation platforms and vendors can result in markedly different downstream model performance, highlighting the importance of data checks even when using vendors to source annotations. To validate our data quality, we carefully examined a set of 180 examples, comparing the annotations provided by humans with the samples generated by the model through manual scrutiny. Surprisingly, we found that the outputs sampled from the resulting SFT model were often competitive with SFT data handwritten by human annotators, suggesting that we could re prioritize and devote more annotation effort to preference-based annotation for RLHF.

我们还观察到,不同的标注平台和供应商会导致下游模型性能显著不同,突显了即使使用供应商获取标注时数据检查的重要性。为了验证我们的数据质量,我们仔细检查了一组 180 个示例,将人类提供的标注与通过手动审查由模型生成的样本进行比较。令人惊讶的是,我们发现从最终的 SFT 模型中采样的输出经常能够与人类标注员手写的 SFT 数据相媲美,这表明我们可以重新优先考虑并将更多的标注精力投入到基于偏好的 RLHF 标注上。

Fine-Tuning Details. For supervised fine-tuning, we use a cosine learning rate schedule with an initial learning rate of $2\times10^{-5}$ , a weight decay of 0.1, a batch size of 64, and a sequence length of 4096 tokens.

微调细节。对于监督式微调,我们使用余弦学习率调度,初始学习率为 $2\times10^{-5}$ ,权重衰减为 0.1,批量大小为 64,序列长度为 4096 Token。

For the fine-tuning process, each sample consists of a prompt and an answer. To ensure the model sequence length is properly filled, we concatenate all the prompts and answers from the training set. A special token is utilized to separate the prompt and answer segments. We utilize an auto regressive objective and zero-out the loss on tokens from the user prompt, so as a result, we back propagate only on answer tokens. Finally, we fine-tune the model for 2 epochs.

对于微调过程,每个样本由一个提示和一个答案组成。为了确保模型序列长度得到正确填充,我们将训练集中的所有提示和答案连接起来。使用一个特殊 Token 来分隔提示和答案部分。我们采用自回归目标,并将用户提示中的 Token 损失置零,因此我们只对答案 Token 进行反向传播。最后,我们对模型进行 2 个 epoch 的微调。

3.2 Reinforcement Learning with Human Feedback (RLHF)

3.2 带有人类反馈的强化学习 (Reinforcement Learning with Human Feedback, RLHF)

RLHF is a model training procedure that is applied to a fine-tuned language model to further align model behavior with human preferences and instruction following. We collect data that represents empirically

RLHF 是一种应用于微调语言模型的模型训练程序,以进一步使模型行为与人类偏好和指令遵循保持一致。我们收集了能够实证代表

sampled human preferences, whereby human annotators select which of two model outputs they prefer. This human feedback is subsequently used to train a reward model, which learns patterns in the preferences of the human annotators and can then automate preference decisions.

采样人类偏好,其中人类标注者选择两个模型输出中他们更喜欢哪一个。这种人类反馈随后用于训练奖励模型 (reward model),该模型学习人类标注者的偏好模式,并可以自动化偏好决策。

3.2.1 Human Preference Data Collection

3.2.1 人类偏好数据收集

Next, we collect human preference data for reward modeling. We chose a binary comparison protocol over other schemes, mainly because it enables us to maximize the diversity of collected prompts. Still, other strategies are worth considering, which we leave for future work.

接下来,我们收集用于奖励建模的人类偏好数据。我们选择了一种二元比较协议,而没有选择其他方案,主要是因为它使我们能够最大化收集到的提示的多样性。尽管如此,其他策略也值得考虑,我们将其留作未来的工作。

Our annotation procedure proceeds as follows. We ask annotators to first write a prompt, then choose between two sampled model responses, based on provided criteria. In order to maximize the diversity, the two responses to a given prompt are sampled from two different model variants, and varying the temperature hyper-parameter. In addition to giving participants a forced choice, we also ask annotators to label the degree to which they prefer their chosen response over the alternative: either their choice is significantly beter, better, slightly better, or negligibly better/ unsure.

我们的标注流程如下。我们要求标注人员首先编写一个提示,然后根据提供的标准在两个采样的模型响应之间进行选择。为了最大化多样性,针对给定提示的两个响应是从两个不同的模型变体中采样得到的,并且通过调整温度超参数来变化。除了让参与者做出强制选择外,我们还要求标注人员标记他们对所选响应相对于另一个响应的偏好程度:即他们的选择是显著更好、较好、稍好或几乎无差异/不确定。

For our collection of preference annotations, we focus on helpfulness and safety. Helpfulness refers to how well LLAMA 2-CHAT responses fulfill users requests and provide requested information; safety refers to whether LLAMA 2-CHAT's responses are unsafe, e.g., "giving detailed instructions on making a bomb" could be considered helpful but is unsafe according to our safety guidelines. Separating the two allows us to apply specific guidelines to each and better guide annotators; for example, our safety annotations provide instructions to focus on adversarial prompts, among other guidance.

对于我们的偏好标注集合,我们专注于帮助性和安全性。帮助性指的是 LLAMA 2-CHAT 回复满足用户请求和提供所需信息的程度;安全性指的是 LLAMA 2-CHAT 的回复是否安全,例如,“提供制作炸弹的详细说明”可能被认为是帮助性的,但根据我们的安全指南这是不安全的。将两者分开使我们能够为每种情况应用特定的指南,并更好地指导标注人员;例如,我们的安全标注提供了专注于对抗性提示等指导。

Apart from differences in annotation guidelines, we additionally collect a safety label during the safety stage. This additional information bins model responses into one of three categories: 1) the preferred response is safe and the other response is not, 2) both responses are safe, and 3) both responses are unsafe, with $18%,,47%,$ and $35%$ of the safety dataset falling into each bin, respectively. We do not include any examples where the chosen response was unsafe and the other response safe, as we believe safer responses will also be better/preferred by humans. Safety guidelines and more detailed information regarding safety annotations can be found in Section 4.2.1.

除了标注指南的差异外,我们在安全阶段还收集了一个安全标签。这些额外的信息将模型响应分为以下三类:1) 优选响应是安全的,而另一个响应不安全,2) 两个响应都是安全的,3) 两个响应都不安全,分别有 18%、47% 和 35% 的安全数据集属于每个类别。我们没有包含任何选择的响应不安全而另一个响应安全的例子,因为我们认为更安全的响应也会被人类认为更好/更优选。有关安全指南和安全标注的详细信息,请参见第 4.2.1 节。

Human annotations were collected in batches on a weekly basis. As we collected more preference data, our reward models improved, and we were able to train progressively better versions for LLAMA 2-CHAT (see the results in Section 5, Figure 20). LLAMA 2-CHAT improvement also shifted the model's data distribution. Since reward model accuracy can quickly degrade if not exposed to this new sample distribution, i.e., from hyper-specialization (Scialom et al., 2020b), it is important before a new LLAMA 2-CHAT tuning iteration to gather new preference data using the latest LLAMA 2-CHAT iterations. This step helps keep the reward model on-distribution and maintain an accurate reward for the latest model.

人类标注以批次形式每周收集。随着我们收集更多偏好数据,我们的奖励模型得到改进,从而能够为 LLAMA 2-CHAT 训练出越来越好的版本(请参见第 5 节,图 20)。LLAMA 2-CHAT 的改进也改变了模型的数据分布。由于奖励模型的准确性如果未接触到新的样本分布可能会迅速下降,即从过度专业化 (Scialom et al., 2020b),因此在新的 LLAMA 2-CHAT 调整迭代之前,收集使用最新 LLAMA 2-CHAT 迭代的新偏好数据非常重要。这一步有助于保持奖励模型的分布一致,并为最新模型维持准确的奖励。

In Table 6, we report the statistics of reward modeling data that we collected over time, and present them against multiple open-source preference datasets including Anthropic Helpful and Harmless (Bai et al., 2022a), OpenAI Summarize (Stiennon et al., 2020), OpenAI WebGPT (Nakano et al., 2021), Stack Exchange (Lambert et al., 2023), Stanford Human Preferences (Ethayarajh et al., 2022), and Synthetic GPT-J (Havrilla). We collected a large dataset of over 1 million binary comparisons based on humans applying our specified guidelines, which we refer to as Meta reward modeling data. Note that the number of tokens in prompts and answers differs depending on the text domain. Sum mari z ation and online forum data generally have longer prompts, while dialogue-style prompts are usually shorter. Compared to existing open-source datasets, our preference data features more conversation turns, and are longer, on average.

表 6:

我们在表 6 中报告了我们随时间收集的奖励建模数据的统计信息,并将其与多个开源偏好数据集进行对比,包括 Anthropic Helpful and Harmless (Bai et al., 2022a)、OpenAI Summarize (Stiennon et al., 2020)、OpenAI WebGPT (Nakano et al., 2021)、Stack Exchange (Lambert et al., 2023)、Stanford Human Preferences (Ethayarajh et al., 2022) 和 Synthetic GPT-J (Havrilla)。我们基于人类按照我们指定的指南进行的超过 100 万次二元比较,收集了一个大型数据集,我们称之为 Meta 奖励建模数据。请注意,提示和答案中的 Token 数量因文本领域而异。总结和在线论坛数据通常具有更长的提示,而对话风格的提示通常较短。与现有的开源数据集相比,我们的偏好数据包含更多的对话轮次,平均长度也更长。

3.2.2 Reward Modeling

3.2.2 奖励建模 (Reward Modeling)

The reward model takes a model response and its corresponding prompt (including contexts from previous turns) as inputs and outputs a scalar score to indicate the quality (e.g., helpfulness and safety) of the model generation. Leveraging such response scores as rewards, we can optimize LLAMA 2-CHAT during RLHF for better human preference alignment and improved helpfulness and safety.

奖励模型以模型响应及其对应的提示(包括先前轮次的上下文)作为输入,并输出一个标量分数来表示模型生成的质量(例如,有用性和安全性)。利用这些响应分数作为奖励,我们可以在RLHF过程中优化 LLAMA 2-CHAT,以更好地对齐人类偏好并提高有用性和安全性。

Others have found that helpfulness and safety sometimes trade off (Bai et al., 2022a), which can make it challenging for a single reward model to perform well on both. To address this, we train two separate reward models, one optimized for helpfulness (referred to as Helpfulness RM) and another for safety (Safety RM).

其他人发现,有用性和安全性有时会相互冲突 (Bai et al., 2022a),这使得单个奖励模型在这两方面都表现良好变得具有挑战性。为了解决这个问题,我们训练了两个独立的奖励模型,一个优化用于有用性(称为 Helpfulness RM),另一个用于安全性(Safety RM)。

We initialize our reward models from pretrained chat model checkpoints, as it ensures that both models benefit from knowledge acquired in pre training. In short, the reward model "knows" what the chat model

我们从预训练的聊天模型检查点初始化奖励模型,这确保了两个模型都能受益于预训练中获得的知识。简而言之,奖励模型“知道”聊天模型

数据集 比较次数 平均每轮对话的轮数 平均每例 Token 数 提示中的平均 Token 数 回应中的平均 Token 数
Anthropic Helpful 122,387 3.0 251.5 17.7 88.4
Anthropic Harmless 43,966 3.0 152.5 15.7 46.4
OpenAI Summarize 176,625 1.0 371.1 336.0 35.1
OpenAI WebGPT 13,333 1.0 237.2 48.3 188.9
StackExchange 1,038,480 1.0 440.2 200.1 240.2
Stanford SHP 74,882 1.0 338.3 199.5 138.8
Synthetic GPT-J 33,139 1.0 123.3 13.0 110.3
Meta (Safety & Helpfulness) 1,418,091 3.9 798.5 31.4 234.1
总计 2,919,326 1.6 595.7 108.2 216.9

Table 6: Statistics of human preference data for reward modeling. We list both the open-source and internally collected human preference data used for reward modeling. Note that a binary human preference comparison contains 2 responses (chosen and rejected) sharing the same prompt (and previous dialogue). Each example consists of a prompt (including previous dialogue if available) and a response, which is the input of the reward model. We report the number of comparisons, the average number of turns per dialogue, the average number of tokens per example, per prompt and per response. More details on Meta helpfulness and safety data per batch can be found in Appendix A.3.1.

表 6: 奖励建模的人类偏好数据统计。我们列出了用于奖励建模的开源和内部收集的人类偏好数据。请注意,二元人类偏好比较包含 2 个回复(选择和拒绝),共享相同的提示(和之前的对话)。每个示例由一个提示(如果可用则包括之前的对话)和一个回复组成,这是奖励模型的输入。我们报告了比较次数、每段对话的平均轮数、每个示例、每个提示和每个回复的平均 Token 数量。有关每批 Meta 数据的帮助性和安全性详细信息,请参见附录 A.3.1。

knows. This prevents cases where, for instance, the two models would have an information mismatch, which could result in favoring hallucinations. The model architecture and hyper-parameters are identical to those of the pretrained language models, except that the classification head for next-token prediction is replaced with a regression head for outputting a scalar reward.

这可以防止出现两个模型之间信息不匹配的情况,从而避免倾向于产生幻觉。模型架构和超参数与预训练语言模型相同,只是将用于下一个 Token 预测的分类头替换为用于输出标量奖励的回归头。

Training Objectives. _ To train the reward model, we convert our collected pairwise human preference data into a binary ranking label format (i.e., chosen & rejected) and enforce the chosen response to have a higher score than its counterpart. We used a binary ranking loss consistent with Ouyang et al. (2022):

训练目标。为了训练奖励模型,我们将收集到的成对人类偏好数据转换为二元排名标签格式(即,选择与拒绝),并强制选择的响应得分高于其对应项。我们使用了与 Ouyang 等人 (2022) 一致的二元排名损失:

$$
\mathcal{L}{\mathrm{ranking}}=-\mathrm{log}(\sigma(r{\theta}(x,y_{c})-r_{\theta}(x,y_{r})))
$$

$$
\mathcal{L}{\mathrm{ranking}}=-\mathrm{log}(\sigma(r{\theta}(x,y_{c})-r_{\theta}(x,y_{r})))
$$

where $r_{\theta}(x,y)$ is the scalar score output for prompt $x$ and completion $y$ with model weights $\theta.\ y_{c}$ is the preferred response that annotators choose and $y_{r}$ is the rejected counterpart.

其中,$r_{\theta}(x,y)$ 是提示 $x$ 和完成 $y$ 的标量得分输出,模型权重为 $\theta$ 。$y_{c}$ 是注释者选择的优选响应,$y_{r}$ 是被拒绝的响应。

Built on top of this binary ranking loss, we further modify it separately for better helpfulness and safety reward models as follows. Given that our preference ratings is decomposed as a scale of four points (e.g.. significantly better), as presented in Section 3.2.1, it can be useful to leverage this information to explicitly teach the reward model to assign more discrepant scores to the generations that have more differences. To do so, we further add a margin component in the loss:

基于这个二元排序损失,我们进一步分别对其进行修改,以提高帮助性和安全性的奖励模型,具体如下。鉴于我们的偏好评分被分解为四个等级(例如:显著更好),如第 3.2.1 节所述,利用这些信息显式地教导奖励模型为差异较大的生成分配更不同的分数是有用的。为此,我们在损失中进一步添加了一个边缘组件:

Figure 1: 示例图
表 1: 示例表

零样本学习可以提高模型的泛化能力 [20]。

$$
\mathcal{L}{\mathrm{ranking}}=-\mathrm{log}(\sigma(r{\theta}(x,y_{c})-r_{\theta}(x,y_{r})-m(r)))
$$

$$
\mathcal{L}{\mathrm{ranking}} = -\mathrm{log}(\sigma(r{\theta}(x, y_{c}) - r_{\theta}(x, y_{r}) - m(r)))
$$

where the margin $m(r)$ is a discrete function of the preference rating. Naturally, we use a large margin for pairs with distinct responses, and a smaller one for those with similar responses (shown in Table 27). We found this margin component can improve Helpfulness reward model accuracy especially on samples where two responses are more separable. More detailed ablation and analysis can be found in Table 28 in AppendixA.3.3.

其中 margin $m(r)$ 是偏好评分的离散函数。自然地,我们对响应差异明显的配对使用较大的 margin,而对响应相似的配对使用较小的 margin(如表 27 所示)。我们发现这一 margin 组件可以提高有用性奖励模型的准确性,特别是在两个响应更易区分的样本上。更多详细的消融实验和分析可以在附录 A.3.3 的表 28 中找到。

Data Composition. We combine our newly collected data with existing open-source preference datasets to form a larger training dataset. Initially, open-source datasets were used to bootstrap our reward models while we were in the process of collecting preference annotation data. We note that in the context of RLHF in this study, the role of reward signals is to learn human preference for LLAMA 2-CHAT outputs rather than any model outputs. However, in our experiments, we do not observe negative transfer from the open-source preference datasets. Thus, we have decided to keep them in our data mixture, as they could enable better generalization for the reward model and prevent reward hacking, i.e. LLAMA 2-CHAT taking advantage of some weaknesses of our reward, and so artificially inflating the score despite performing less well.

数据组成。我们将新收集的数据与现有的开源偏好数据集结合,形成一个更大的训练数据集。最初,开源数据集用于引导我们的奖励模型,同时我们正在进行偏好标注数据的收集。我们注意到,在本研究的强化学习人类反馈 (RLHF) 的背景下,奖励信号的作用是学习人类对 LLAMA 2-CHAT 输出的偏好,而不是任何模型的输出。然而,在我们的实验中,我们没有观察到来自开源偏好数据集的负迁移。因此,我们决定将它们保留在我们的数据混合中,因为它们可以为奖励模型提供更好的泛化能力,并防止奖励欺骗,即 LLAMA 2-CHAT 利用我们奖励机制的一些弱点,从而在表现不佳的情况下人为地提高分数。

With training data available from different sources, we experimented with different mixing recipes for both Helpfulness and Safety reward models to ascertain the best settings. After extensive experimentation, the

使用来自不同来源的训练数据,我们对 Helpfulness 和 Safety 奖励模型尝试了不同的混合方案,以确定最佳设置。经过大量实验后,

Helpfulness reward model is eventually trained on all Meta Helpfulness data, combined with an equal parts of the remaining data uniformly sampled from Meta Safety and from the open-source datasets. The Meta Safety reward model is trained on ali Meta Safety and Anthropic Harmless data, mixed with Meta Helpfulness and open-source helpfulness data in a 90/10 proportion. We found that the setting with $10%$ helpfulness data is especially beneficial for the accuracy on samples where both the chosen and rejected responses were deemed safe.

帮助性奖励模型最终在所有 Meta Helpfulness 数据上进行训练,结合了从 Meta Safety 和开源数据集中均匀采样的等量剩余数据。Meta Safety 奖励模型在所有 Meta Safety 和 Anthropic Harmless 数据上进行训练,混合了 Meta Helpfulness 和开源帮助性数据,比例为 90/10。我们发现,包含 10% 帮助性数据的设置对提高样本准确性特别有利,特别是在选择和拒绝的响应都被认为是安全的情况下。

Training Details. We train for one epoch over the training data. In earlier experiments, we found that training longer can lead to over-fitting. We use the same optimizer parameters as for the base model. The maximum learning rate is $5\times10^{-6}$ for the 7OB parameter LLAMA 2-CHAT and $1\times10^{-5}$ for the rest. The learning rate is decreased on a cosine learning rate schedule, down to $10%$ of the maximum learning rate. We use a warm-up of $3%$ of the total number of steps, with a minimum of 5. The effective batch size is kept fixed at 512 pairs, or 1024 rows per batch.

训练细节。我们对训练数据进行一个 epoch 的训练。在早期实验中,我们发现训练时间过长会导致过拟合。我们使用与基础模型相同的优化器参数。最大学习率为 7OB 参数 LLAMA 2-CHAT 模型为 $5\times10^{-6}$,其余模型为 $1\times10^{-5}$。学习率按照余弦学习率调度逐渐降低,降至最大学习率的 $10%$。我们使用总步数的 $3%$ 作为预热步骤,最少为 5 步。有效批量大小固定为 512 对,或每批 1024 行。

Meta Helpful. Meta Safety Anthropic Helpful Anthropic Harmless OpenAI Summ. Stanford SHP Avg
SteamSHP-XL 52.8 43.8 66.8 34.2 54.7 75.7 55.3
Open Assistant 53.8 53.4 67.7 68.4 71.7 55.0 63.0
GPT4 58.6 58.1 - - - - -
Safety RM 56.2 64.5 55.4 74.7 71.7 65.2 64.3
Helpfulness RM 63.2 62.8 72.0 71.0 75.5 80.0 70.6

Table 7: Reward model results. Performance of our final helpfulness and safety reward models on a diverse set of human preference benchmarks. Note that our model is fine-tuned on our collected data, as opposed to the other baselines that we report. Table 8: Granular reward model accuracy per preference rating. We report per-preference rating accuracy for both Helpfulness and Safety reward models on the Meta Helpfulness and Safety test sets. The reward models show superior accuracy on more distinct responses (e.g., significantly better) and lower accuracy on similar responses (e.g., negligibly better).

表 7: 奖励模型结果。我们的最终帮助性和安全性奖励模型在多样的人类偏好基准上的表现。请注意,我们的模型是在我们收集的数据上微调的,而其他基线模型则不是。

表 8: 细粒度奖励模型准确性。我们在 Meta 帮助性和安全性测试集上报告了帮助性和安全性奖励模型的每个偏好评分准确性。奖励模型在更明显的响应(例如,显著更好)上表现出更高的准确性,在相似的响应(例如,微不足道地更好)上表现出较低的准确性。

测试集 显著更好 更好 稍微更好 几乎相同 / 不确定 平均
安全性 RM 帮助性 RM Meta 安全性 94.3 89.9 76.3 73.2 65.7 63.8 55.3 54.5 64.5 62.8
安全性 RM 帮助性 RM Meta 帮助性 64.6 80.7 57.5 67.5 53.8 60.9 52.2 54.7 56.2 63.2

Reward Model Results. On each batch of human preference annotation for reward modeling, we held out 1000 examples as a test set to evaluate our models. We refer to the union of all prompts for the corresponding test sets as "Meta Helpfulness" and "Meta Safety," respectively.

奖励模型结果。在每个批次的人类偏好标注中,我们保留了 1000 个样本作为测试集来评估我们的模型。我们将所有对应测试集的提示的并集分别称为“Meta Helpfulness”和“Meta Safety”。

As reference points, we also evaluated other publicly available alternatives as baselines: SteamSHP-XL (Ethayarajh et al., 2022) based on FLAN-T5-xl, the Open Assistant (Kopf et al., 2023) reward model based on DeBERTa V3 Large (He et al., 2020), and GPT4 accessible through the OpenAI's API. Note that at inference time, as opposed to training, all the reward models can predict a scalar for a single output, without requiring to access its paired output. For GPT-4, we prompt with a zero-shot question "Choose the best answer between $A$ and B," where A and B are the two responses for comparison.

作为参考点,我们还评估了其他公开可用的替代方案作为基线:基于 FLAN-T5-xl 的 SteamSHP-XL (Ethayarajh et al., 2022),基于 DeBERTa V3 Large (He et al., 2020) 的 Open Assistant (Kopf et al., 2023) 奖励模型,以及通过 OpenAI 的 API 可访问的 GPT4。请注意,在推理时(与训练相反),所有奖励模型都可以为单个输出预测一个标量,而无需访问其配对输出。对于 GPT-4,我们使用零样本问题 “在 $A$ 和 B 中选择最佳答案”,其中 A 和 B 是两个待比较的回答。

We report the results in terms of accuracy in Table 7. As expected, our own reward models perform the best on our internal test sets collected based on LLAMA 2-CHAT, with the Helpfulness reward model performing best on the Meta Helpfulness test set, and similarly the Safety reward model performing best on the Meta Safety test set. Overall, our reward models outperform all of the baselines, including GPT-4. Interestingly, GPT-4 performs better than other non-Meta reward models, despite not being trained directly nor targeting specifically this reward modeling task.

表 7:

我们报告了准确率的结果。如预期,我们自己的奖励模型在基于 LLAMA 2-CHAT 收集的内部测试集上表现最佳,其中 Helpfulness 奖励模型在 Meta Helpfulness 测试集上表现最好,同样 Safety 奖励模型在 Meta Safety 测试集上表现最好。总体而言,我们的奖励模型优于所有基线模型,包括 GPT-4。有趣的是,尽管 GPT-4 并没有直接针对此奖励建模任务进行训练,它的表现仍然优于其他非 Meta 奖励模型。


Figure 6: Scaling trends for the reward model. More data and a larger-size model generally improve accuracy, and it appears that our models have not yet saturated from learning on the training data.

图 6: 奖励模型的扩展趋势。更多的数据和更大规模的模型通常会提高准确性,而且看起来我们的模型尚未从训练数据中学习到饱和 [20]。

The fact that helpfulness and safety performed the best on their own domain is potentially due to the tension between the two objectives (i.e., being as helpful as possible versus refusing unsafe prompts when necessary), which may confuse the reward model during training. In order for a single model to perform well on both dimensions, it needs to not only learn to select the better response given a prompt but also to distinguish adversarial prompts from safe ones. As a result, optimizing two separate models eases the reward modeling task. More detailed analysis on this tension between safety and helpfulness can be found in Appendix A.4.1. When we group the scores by preference rating in Table 8, we can see that the accuracy is superior for the "significantly better" test set and degrades gradually as comparison pairs become more similar (e.g., "slightly better"). It is expected that learning to model human preferences becomes challenging when deciding between two similar model responses, due to annotator subjectivity and their reliance on nuanced details that may differentiate responses. We emphasize that the accuracy on more distinct responses matters the most to improve LLAMA 2-CHAT performance. The human preference annotation agreement rate is also higher on more distinct responses than similar pairs.

帮助性和安全性在其各自领域表现最佳,可能是因为这两个目标之间的矛盾(即,尽可能提供帮助与在必要时拒绝不安全的提示),这可能会在训练过程中混淆奖励模型。为了使单个模型在这两个维度上都能表现出色,它不仅需要学习如何根据提示选择更好的响应,还需要区分对抗性提示和安全提示。因此,优化两个独立的模型可以简化奖励建模任务。关于安全性和帮助性之间这种矛盾的更详细分析可以在附录 A.4.1 中找到。当我们将表 8 中的分数按偏好评级分组时,可以看到“显著更好”的测试集准确性更高,并且随着比较对变得越来越相似(例如,“稍好”),准确性逐渐下降。可以预期,在决定两个相似模型响应之间的差异时,学习建模人类偏好会变得更加困难,因为注释者的主观性和他们依赖的细微差别可能会影响响应的区分。我们强调,提高 LLAMA 2-CHAT 性能最关键的是在更明显的响应上的准确性。对于更明显的响应,人类偏好注释的一致率也高于相似对。

Scaling Trends. We study the scaling trends in terms of data and model size for the reward model, finetuning different model sizes on an increasing amount of the reward model data collected each week (see the details on volume per batch in Table 26). Figure 6 reports these trends, showing the expected result that larger models obtain higher performance for a similar volume of data. More importantly, the scaling performance has not yet plateaued given the existing volume of data annotation used for training, a signal that there is room for more improvement with more annotations. We note that reward model accuracy is one of the most important proxies for the final performance of LLAMA 2-CHAT. While best practices for comprehensively evaluating a generative model is an open research question, the ranking task of the reward has no ambiguity. Therefore, everything else being equal, an improvement of the reward model can be directly translated into an improvement for LLAMA 2-CHAT.

扩展趋势。我们研究了奖励模型在数据量和模型规模方面的扩展趋势,微调不同规模的模型在每周收集的不断增加的奖励模型数据上(详见表 26 中每批的数据量)。图 6 报告了这些趋势,显示了预期的结果,即对于相似的数据量,较大的模型获得了更高的性能。更重要的是,鉴于用于训练的数据标注量,扩展性能尚未达到饱和,这表明通过更多的标注仍有改进的空间。我们注意到,奖励模型的准确性是评估 LLAMA 2-CHAT 最终性能最重要的代理之一。虽然全面评估生成式模型的最佳实践仍是一个开放的研究问题,但奖励的排名任务没有歧义。因此,在其他条件相同的情况下,奖励模型的改进可以直接转化为 LLAMA 2-CHAT 的改进。

3.2.3 Iterative Fine-Tuning

3.2.3 迭代微调

As we received more batches of human preference data annotation, we were able to train better reward models and collect more prompts. We therefore trained successive versions for RLHF models, referred to here as RLHF-V1, ..., RLHF-V5.

随着我们收到更多批次的人类偏好数据标注,我们能够训练更好的奖励模型并收集更多的提示。因此,我们训练了多个版本的 RLHF 模型,在这里称为 RLHF-V1,...,RLHF-V5。

We explored RLHF fine-tuning with two main algorithms:

我们探索了两种主要算法的 RLHF 微调:

· Proximal Policy Optimization (PPO) (Schulman et al., 2017), the standard in RLHF literature.

近端策略优化 (PPO) (Schulman et al., 2017),强化学习中的人机协作文献中的标准方法。

· Rejection Sampling fine-tuning. We sample $K$ outputs from the model and select the best candidate with our reward, consistent with Bai et al. (2022b). The same re-ranking strategy for LLMs was also proposed in Deng et al. (2019), where the reward is seen as an energy function. Here, we go one step further, and use the selected outputs for a gradient update. For each prompt, the sample obtaining the highest reward score is considered the new gold standard. Similar to Scialom et al. (2020a), we then fine-tune our model on the new set of ranked samples, reinforcing the reward.

拒绝采样微调。我们从模型中采样 $K$ 个输出,并根据我们的奖励选择最佳候选者,这与 Bai 等人 (2022b) 的方法一致。同样的重新排序策略在 Deng 等人 (2019) 中也被提出用于大语言模型,其中奖励被视为能量函数。在此基础上,我们更进一步,使用选定的输出进行梯度更新。对于每个提示,获得最高奖励分数的样本被视为新的黄金标准。类似于 Scialom 等人 (2020a),然后我们在新的排名样本集上对模型进行微调,以强化奖励。


Figure 7: Max and median reward among $\mathbf{N}$ samples, $N\in[1,\ldots,100]$ averaged over our training set of prompts. The delta between max and median can be interpreted as potential gain with Rejection Sampling.


图 7: 在我们的训练提示集中,N 个样本中的最大和中位奖励,N ∈ [1, …, 100]。最大值与中位数之间的差值可以解释为使用拒绝采样 (Rejection Sampling) 的潜在增益。

The two RL algorithms mainly differ in:

两种 RL 算法主要区别在于:

Until RLHF (V4), we used only Rejection Sampling fine-tuning, and after that, we combined the two sequentially, applying PPO on top of the resulted Rejection Sampling checkpoint before sampling again.

直到 RLHF (V4),我们只使用了拒绝采样微调,在那之后,我们将两者依次结合,在基于拒绝采样检查点进行 PPO 训练后再进行采样。


Figure 8: RLHF impact of the temperature when sampling N outputs and scoring them with a reward model.


图 8: 强化学习与人类反馈 (RLHF) 中温度对采样 N 个输出并用奖励模型评分的影响。

Rejection Sampling. We perform rejection sampling only with our largest 7OB LLAMA 2-CHAT. All smaller models are fine-tuned on rejection sampled data from the larger model, thus distilling the large-model capabilities into the smaller ones. We leave further analysis of the effect of this distillation for future work.

拒绝采样。我们仅使用最大的 7OB LLAMA 2-CHAT 进行拒绝采样。所有较小的模型都在较大模型的拒绝采样数据上进行了微调,从而将大模型的能力蒸馏到较小的模型中。我们将进一步分析这种蒸馏效果的工作留待未来。

At each iterative stage, we sample $K$ answers for each prompt from the most recent model. We score each sample given the best reward model accessible at the time of the experiment, and then select the best answer for a given prompt. In earlier versions of our model, up to RLHF V3, our approach was to confine answer selection solely to the $,^{\prime\prime}{\mathrm{bag}}^{\prime\prime}$ of samples gathered from the preceding iteration. For example, RLHF V3 was trained using only samples from RLHF V2. However, despite continuous improvement, this method led to a regression in some capabilities. For example, RLHF V3 struggled more than previous versions to compose rhyming lines in poems, as discerned through qualitative analysis, suggesting that further investigation into the causes of and mitigation s for forgetting (Kirkpatrick et al., 2017; Nguyen et al., 2019; Ramasesh et al., 2021) could be a fruitful area for additional future research.

在每个迭代阶段,我们从最新的模型中为每个提示采样 $K$ 个答案。我们根据实验时可用的最佳奖励模型对每个样本进行评分,然后选择给定提示的最佳答案。在我们模型的早期版本中,直到 RLHF V3,我们的方法是将答案选择限制在前一次迭代收集的样本“包”中。例如,RLHF V3 的训练仅使用了来自 RLHF V2 的样本。然而,尽管不断改进,这种方法导致了一些能力的退步。例如,通过定性分析发现,RLHF V3 在诗歌中创作押韵诗句方面比之前的版本更加困难,这表明进一步研究遗忘的原因和缓解措施 (Kirkpatrick et al., 2017; Nguyen et al., 2019; Ramasesh et al., 2021) 可能是一个有前途的未来研究方向。

In response, on subsequent iterations, we modified our strategy, incorporating top-performing samples from all prior iterations, such as those used in RLHF-V1 and RLHF-V2. Although we do not present specific figures, this adjustment demonstrated considerable enhancements in performance and effectively addressed the previously noted issues. This mitigation can be seen as analogous to Synnaeve et al. (2019) and Vinyals et al. (2019) in the RL literature.

相应地,在后续迭代中,我们修改了策略,纳入了所有先前迭代中的顶级表现样本,例如在 RLHF-V1 和 RLHF-V2 中使用的样本。虽然我们没有提供具体的数据,但这一调整显著提升了性能,并有效解决了之前指出的问题。这种缓解方法可以类比于 Synnaeve 等人 (2019) 和 Vinyals 等人 (2019) 在强化学习文献中的工作。

We illustrate the benefit of Rejection Sampling in Figure 7. The delta between the maximum and median curves can be interpreted as the potential gain of fine-tuning on the best output. As expected, this delta increases with more samples, since the maximum increases (i.e., more samples, more opportunities to generate a good trajectory), while the median remains stationary. There is a direct connection between the exploration and the maximum reward we can obtain among the samples. The temperature parameter also plays an important role for exploration, as a higher temperature enables us to sample more diverse outputs.

我们在图 7 中展示了拒绝采样 (Rejection Sampling) 的优势。最大值曲线和中位数曲线之间的差距可以解释为在最佳输出上进行微调的潜在收益。如预期的那样,随着样本数量的增加,这一差距也增大,因为最大值增加(即,更多的样本意味着有更多机会生成一个良好的轨迹),而中位数保持不变。探索与我们从样本中可以获得的最大回报之间存在直接联系。温度参数在探索中也起着重要作用,因为较高的温度使我们能够采样更多样化的输出。

In Figure 8, We report for a LLAMA 2-CHAT-SFT (left) and a LLAMA 2-CHAT-RLHF (right), the maximum reward curves among $_\mathrm{N}$ samples (with $N\in[1,\dotsc,100])$ , for different temperatures. We can observe that the optimal temperature is not constant during the iterative model updates: RLHF has a direct impact on rescaling the temperature. For LLAMA 2-CHAT-RLHF, the optimal temperature when sampling between 10 and 100 outputs is $T\in[1.2,1.3]$ . Given a finite compute budget, it is therefore necessary to re-adjust the temperature progressively. Note that this temperature rescaling happens for a constant number of steps for each model, and always starting from the base model on each new RLHF version.

图 8: 我们报告了 LLAMA 2-CHAT-SFT (左) 和 LLAMA 2-CHAT-RLHF (右) 的最大奖励曲线,这些曲线是基于 $_\mathrm{N}$ 个样本 (其中 $N\in[1,\dotsc,100])$ 在不同温度下的结果。我们可以观察到,在迭代模型更新过程中,最优温度并不是恒定的:强化学习人类反馈 (RLHF) 直接影响温度的重新调整。对于 LLAMA 2-CHAT-RLHF,在采样 10 到 100 个输出时,最优温度为 $T\in[1.2,1.3]$ 。在计算资源有限的情况下,因此有必要逐步重新调整温度。请注意,这种温度重新调整是在每个模型的固定步数内进行的,并且总是从基础模型开始,针对每个新的 RLHF 版本。

PPO. We further train our language model following the RL scheme of Stiennon et al. (2020), which uses the reward model as an estimate for the true reward function (human preference) and the pretrained language model as the policy to optimize. During this phase, we seek to optimize the following objective:

PPO. 我们进一步按照 Stiennon 等人 (2020) 的强化学习方案训练我们的语言模型,该方案使用奖励模型作为真实奖励函数(人类偏好)的估计,并将预训练的语言模型作为策略进行优化。在此阶段,我们力求优化以下目标:

$$
\arg\operatorname*{max}{\pi}\mathbb{E}{p\sim\mathcal{D},g\sim\pi}[R(g\mid p)]
$$

$$
\arg\operatorname*{max}{\pi}\mathbb{E}{p\sim\mathcal{D},g\sim\pi}[R(g\mid p)]
$$

We iterative ly improve the policy by sampling prompts $p$ from our dataset $\mathcal{D}$ and generations $g$ from the policy $\pi$ and use the PPO algorithm and loss function to achieve this objective.

我们通过从数据集 $\mathcal{D}$ 中采样提示 $p$ 和从策略 $\pi$ 中采样生成 $g$ 来迭代地改进策略,并使用 PPO 算法和损失函数来实现这一目标。

The final reward function we use during optimization,

我们在优化过程中使用的最终奖励函数,

$$
R(g\mid p)=\tilde{R}{c}(g\mid p)-\beta D{K L}(\pi_{\theta}(g\mid p)\mid\mid\pi_{0}(g\mid p))
$$

$$
R(g\mid p)=\tilde{R}{c}(g\mid p)-\beta D{K L}(\pi_{\theta}(g\mid p)\mid\mid\pi_{0}(g\mid p))
$$

contains a penalty term for diverging from the original policy $\pi_{0}$ . As was observed in other works (Stiennon et al., 2020; Ouyang et al., 2022), we find this constraint is useful for training stability, and to reduce reward hacking whereby we would achieve high scores from the reward model but low scores from human evaluation.

包含了一个偏离原始策略 $\pi_{0}$ 的惩罚项。正如其他研究 (Stiennon et al., 2020; Ouyang et al., 2022) 所观察到的,我们发现这一约束对于训练稳定性很有用,并且可以减少奖励劫持现象,即我们从奖励模型中获得高分但人类评估得分较低的情况。

We define $R_{c}$ to be a piecewise combination of the safety $\left(R_{s}\right)$ and helpfulness $(R_{h})$ reward models. We have tagged prompts in our dataset that might elicit potentially unsafe responses and prioritize the scores from the safety model. The threshold of 0.15 is chosen for filtering unsafe responses, corresponding to a precision of 0.89 and a recall of 0.55 evaluated on the Meta Safety test set. We also find it important to whiten the final linear scores (shown here by reversing the sigmoid with the logit function) in order to increase stability and balance properly with the KL penalty term $(\beta)$ above.

我们将 $R_{c}$ 定义为安全模型 $\left(R_{s}\right)$ 和有用性模型 $(R_{h})$ 的分段组合。我们对数据集中可能引发潜在不安全响应的提示进行了标记,并优先考虑来自安全模型的分数。选择 0.15 作为过滤不安全响应的阈值,在 Meta Safety 测试集上评估时,该阈值对应于 0.89 的精确率和 0.55 的召回率。我们还发现,对最终的线性分数进行白化处理(通过使用 logit 函数反转 sigmoid 函数来实现)非常重要,以便增加稳定性并适当平衡与 KL 惩罚项 $(\beta)$ 的关系。

$$
\begin{array}{r l}&{R_{c}(g\mid p)=\left{R_{s}(g\mid p)\quad\mathrm{if~}\mathfrak{r}{-}\mathsf{s A F E T Y}(p)\mathrm{or}R{s}(g\mid p)<0.15\right.}\ &{\left.\qquad\qquad\qquad\qquad\qquad\qquad\qquad\mathsf{o t h e r w i s e}\right.}\ &{\tilde{R}_{c}(g\mid p)=\mathsf{w H I T E N}(\mathsf{L O G I T}(R_{c}(g\mid p)))}\end{array}
$$

$$
\begin{array}{r l}
& R_{c}(g\mid p)=\left{R_{s}(g\mid p) \quad \mathrm{if~} \mathfrak{r}{-}\mathsf{s A F E T Y}(p) \mathrm{or} R{s}(g\mid p)<0.15 \right. \
& \left. \qquad\qquad\qquad\qquad\qquad\qquad\qquad \mathsf{o t h e r w i s e} \right. \
& \tilde{R}_{c}(g\mid p)=\mathsf{w H I T E N}(\mathsf{L O G I T}(R_{c}(g\mid p)))
\end{array}
$$

For all models, we use the AdamW optimizer (Loshchilov and Hutter, 2017), with $\beta_{1}=0.9,\beta_{2}=0.95,\mathrm{{eps}=}$ $10^{-5}$ . We use a weight decay of 0.1, gradient clipping of 1.0, and a constant learning rate of $10^{-6}$ For each PPO iteration we use a batch size of 512, a PPO clip threshold of 0.2, a mini-batch size of 64, and take one gradient step per mini-batch. For the 7B and 13B models, we set $\beta=0.01$ (KL penalty), and for the 34B and 70B models, we set $\beta=0.005$

对于所有模型,我们使用 AdamW 优化器 (Loshchilov 和 Hutter, 2017),参数为 $\beta_{1}=0.9, \beta_{2}=0.95, \mathrm{{eps}=}$ $10^{-5}$ 。我们使用权重衰减为 0.1,梯度裁剪为 1.0,以及恒定学习率为 $10^{-6}$ 。对于每次 PPO 迭代,我们使用批量大小为 512,PPO 剪切阈值为 0.2,小批量大小为 64,并在每个小批量上进行一次梯度更新。对于 7B 和 13B 模型,我们将 $\beta=0.01$ (KL 惩罚),而对于 34B 和 70B 模型,我们将 $\beta=0.005$ 。


Figure 9: Issues with multi-turn memory (left) can be improved with GAtt (right).


图 9: 多轮记忆的问题 (左) 可以通过 GAtt 得到改善 (右)。

We train for between 200 and 400 iterations for all our models, and use evaluations on held-out prompts for early stopping. Each iteration of PPO on the 7OB model takes on average $\approx330$ seconds. To train quickly with large batch sizes, we use FSDP (Zhao et al., 2023). This was effective when using ${\mathrm{O}}(1)$ forward or backward passes, but caused a large slow down $(\approx20\times)$ during generation, even when using a large batch size and KV cache. We were able to mitigate this by consolidating the model weights to each node once before generation and then freeing the memory after generation, resuming the rest of the training loop.

我们为所有模型训练 200 到 400 次迭代,并使用保留的提示进行评估以实现提前停止。每次 PPO 在 7OB 模型上的迭代平均需要约 330 秒。为了快速训练并使用大批次大小,我们使用了 FSDP (Zhao et al., 2023)。这在使用 O(1) 前向或后向传递时是有效的,但在生成过程中即使使用大批次大小和 KV 缓存也导致了大约 20 倍的显著减速。我们通过在生成前将模型权重整合到每个节点并在生成后释放内存,从而恢复训练循环的其余部分来缓解这一问题。

3.3 System Message for Multi-Turn Consistency

3.3 多轮一致性系统消息

In a dialogue setup, some instructions should apply for all the conversation turns, e.g., to respond succinctly, Orto $,^{\prime\prime}a c t,\bar{a}s^{\prime\prime}$ some public figure. When We provided such instructions to LLAMA 2-CHAT, the subsequent response should always respect the constraint. However, our initial RLHF models tended to forget the initial instruction after a few turns of dialogue, as illustrated in Figure 9 (left).

在对话设置中,某些指令应适用于所有对话轮次,例如,简洁回应,或“以某位公众人物的身份行事”。当我们向 LLAMA 2-CHAT 提供此类指令时,后续响应应始终遵守该约束。然而,我们的初始 RLHF 模型在几轮对话后往往会忘记初始指令,如图 9 (左) 所示。

To address these limitations, we propose Ghost Attention (GAtt), a very simple method inspired by Context Distillation (Bai et al., 2022b) that hacks the fine-tuning data to help the attention focus in a multi-stage process. GAtt enables dialogue control over multiple turns, as illustrated in Figure 9 (right).

为了解决这些限制,我们提出了 Ghost Attention (GAtt),这是一种受 Context Distillation (Bai et al., 2022b) 启发的非常简单的方法,通过修改微调数据来帮助注意力在多阶段过程中集中。GAtt 实现了多轮对话控制,如图 9 (右) 所示。

GAtt Method. Assume we have access to a multi-turn dialogue dataset between two persons (e.g., a user and an assistant), with a list of messages $[u_{1},a_{1},\ldots,u_{n},a_{n}],$ where $u_{n}$ and $a_{n}$ correspond to the user and assistant messages for turn $n_{\prime}$ respectively. Then, we define an instruction, inst, that should be respected throughout the dialogue. For example, inst could be "act as." We can then synthetically concatenate this instruction to all the user messages of the conversation.

GAtt 方法。假设我们有一个两人(例如,用户和助手)之间的多轮对话数据集,消息列表为 $[u_{1},a_{1},\ldots,u_{n},a_{n}]$ ,其中 $u_{n}$ 和 $a_{n}$ 分别对应第 $n$ 轮的用户和助手消息。然后,我们定义一个指令 inst,在整个对话中应遵循该指令。例如,inst 可以是“扮演”。我们可以将此指令合成连接到对话中的所有用户消息。

Next, we can sample from this synthetic data using the latest RLHF model. We now have a context-dialogue and the sample with which to fine-tune a model, in a process analogous to Rejection Sampling. Instead of augmenting all context-dialogue turns with the instruction, we can drop it in all but the first turn, but this Would lead to a mismatch at training time between the system message, i.e., all the intermediate assistant messages that come before the last turn, and our sample. To fix this issue, which could hurt the training, we simply set the loss to O for all the tokens from the previous turns, including assistant messages.

接下来,我们可以使用最新的 RLHF 模型从这些合成数据中采样。我们现在有了一个上下文对话和用于微调模型的样本,这个过程类似于拒绝采样。我们不必在所有上下文对话轮次中都添加指令,而是在除了第一轮之外的所有轮次中省略它,但这会导致训练时系统消息(即所有在最后一轮之前出现的中间助手消息)与我们的样本之间不匹配。为了解决这个问题,避免对训练产生负面影响,我们只需将之前所有轮次中的 Token 损失设置为 0,包括助手消息。

For the training instructions, we created a few synthetic constraints to sample from: Hobbies ("You enjoy e.g. Tennis"), Language ("Speak in e.g. French"), or Public Figure ("Act as e.g. Napoleon"). To obtain the lists of hobbies and public figures, we asked LLAMA 2-CHAT to generate it, avoiding a mismatch between the instruction and model knowledge (e.g., asking the model to act as someone it had not encountered during training). To make the instructions more complex and diverse, we construct the final instruction by randomly combining the above constraints. When constructing the final system message for the training data, we also modify the original instruction half of the time to be less verbose, e.g., "Always act as Napoleon from now"-> "Figure: Napoleon." These steps produce an SFT dataset, on which we can fine-tune LLAMA 2-CHAT.

对于训练指令,我们创建了一些合成约束来进行采样:爱好(“你喜欢例如网球”),语言(“用例如法语说话”),或公众人物(“扮演例如拿破仑”)。为了获得爱好和公众人物的列表,我们要求 LLAMA 2-CHAT 生成这些内容,以避免指令与模型知识之间的不匹配(例如,要求模型扮演一个它在训练期间未遇到的人)。为了使指令更加复杂和多样化,我们通过随机组合上述约束来构建最终指令。在构建训练数据的最终系统消息时,我们还有一半的时间会修改原始指令,使其更加简洁,例如,“从现在起始终扮演拿破仑”-> “图:拿破仑。” 这些步骤产生了一个 SFT 数据集,我们可以在其上对 LLAMA 2-CHAT 进行微调。

GAtt Evaluation. β We applied GAtt after RLHF V3. We report a quantitative analysis indicating that GAtt is consistent up to ${20+}$ turns, until the maximum context length is reached (see Appendix A.3.5). We tried to set constraints not present in the training of GAtt at inference time, for instance "Always answer with Haiku, for which the model remained consistent as illustrated in Appendix Figure 28.

GAtt 评估。β 我们在 RLHF V3 之后应用了 GAtt 。我们报告了一项定量分析,表明 GAtt 在达到最大上下文长度之前,最多可保持一致 ${20+}$ 轮(见附录 A.3.5)。我们尝试在推理时设置训练中不存在的约束条件,例如“始终以俳句回答”,模型仍然保持一致,如附录图 28 所示。


Figure 10: Attention visualization for a dialogue with and without GAtt. We considered the maximum activation s across the network and we bin neighboring tokens together.

图 10: 带有和不带有 GAtt 的对话注意力可视化。我们考虑了网络中的最大激活值 s,并将相邻的 Token 合并在一起。

To illustrate how GAtt helped reshape attention during fine-tuning, we display the maximum attention activation s of the model in Figure 10. The left-hand side of each figure corresponds to the system message ("Act as Oscar Wilde"). We can see that the GAtt-equipped model (right) maintains large attention activation s with respect to the system message for a larger portion of the dialogue, as compared to the model without GAtt (left).

为了说明 GAtt 如何在微调期间重塑注意力,我们在图 10 中展示了模型的最大注意力激活。每个图的左侧对应系统消息 (“Act as Oscar Wilde”)。我们可以看到,配备 GAtt 的模型(右侧)在整个对话的更大一部分中保持了对系统消息的较大注意力激活,而没有 GAtt 的模型(左侧)则不然。

图 10:

Despite its utility, the current implementation of GAtt is vanilla, and more development and iteration on this technique could likely further benefit the model. For instance, we could teach the model to change the system message during the conversation by integrating such data during fine-tuning.

尽管其具有实用性,但目前的 GAtt 实现较为基础,对该技术进行更多开发和迭代很可能会进一步提升模型性能。例如,我们可以通过在微调期间整合此类数据,训练模型在对话过程中更改系统消息。

3.4 RLHF Results

3.4 强化学习人类反馈结果 (RLHF Results)

3.4.1 Model-Based Evaluation

3.4.1 基于模型的评估 (Model-Based Evaluation)

Evaluating LLMs is a challenging open-research problem. Human evaluation, while a gold standard, can be complicated by various HCI considerations (Ciark et al., 2021; Gehrmann et al., 2023), and is not always scalable. Thus, to select the best-performing models among several ablations at each iteration from RLHF-V1 to V5, we first observed the improvement of the rewards from the latest reward models, to save costs and increase iteration speed. We later validated major model versions with human evaluations.

评估大语言模型 (LLM) 是一个具有挑战性的开放研究问题。人工评估虽然是黄金标准,但可能因各种人机交互 (HCI) 考虑因素而变得复杂 (Ciark et al., 2021; Gehrmann et al., 2023),并且不一定总是可扩展的。因此,为了在每次迭代中从 RLHF-V1 到 V5 中选择表现最佳的模型,我们首先观察了来自最新奖励模型的奖励改进,以节省成本并提高迭代速度。之后,我们通过人工评估验证了主要模型版本。

How Far Can Model-Based Evaluation Go? To measure the robustness of our reward model, we collected a test set of prompts for both helpfulness and safety, and asked three annotators to judge the quality of the answers based on a 7-point Likert scale (the higher the better). We observe that our reward models overall are well calibrated with our human preference annotations, as illustrated in Figure 29 in the appendix. This confirms the relevance of using our reward as a point-wise metric, despite being trained with a Pairwise Ranking Loss.

基于模型的评估能走多远?为了衡量我们奖励模型的鲁棒性,我们收集了一组测试用例,涵盖了有用性和安全性两个方面的提示,并邀请了三位标注者根据7点李克特量表 (分数越高越好) 对答案质量进行评判。我们观察到,我们的奖励模型总体上与人工偏好标注保持良好的一致性,如附录中的图 29 所示。这确认了使用我们的奖励作为逐点度量的相关性,尽管它是通过成对排名损失训练的。

Still, as Goodhar's Law states, when a measure becomes a target, it ceases to be a good measure. To ensure our measure won't diverge from the human preferences, we additionally used a more general reward, trained on diverse open-source Reward Modeling datasets. We have not yet observed any such divergence, and hypothesize that iterative model updates may be helping to prevent this.

尽管如此,正如古德哈特定律所言,当某个指标成为目标时,它就不再是一个好的指标。为确保我们的指标不会偏离人类偏好,我们还使用了更为通用的奖励机制,在多样化的开源奖励建模数据集上进行训练。我们尚未观察到任何此类偏差,并假设迭代模型更新可能有助于防止这种情况的发生。


Figure 11: Evolution of LLAMA 2-CHAT. We show the evolution after multiple iterations fine-tuning for the win-rate $%$ of LLAMA 2-CHAT compared to ChatGPT. Left: the judge is our reward model, which may favor our model, and right, the judge is GPT-4, which should be more neutral.


图 11: LLAMA 2-CHAT 的演变。我们展示了经过多次迭代微调后,LLAMA 2-CHAT 相比 ChatGPT 的胜率 % 的演变。左:评判模型是我们的奖励模型,可能更偏向我们的模型;右:评判模型是 GPT-4,应该更加中立。

As a last verification step to ensure no regression between our new model and the previous one, we use both to sample during the next annotation iteration. This enables a model comparison "for free" on new prompts and can help to increase diversity when sampling.

作为最后的验证步骤,以确保新模型与之前模型之间没有退步,我们在下一个标注迭代期间使用两者进行采样。这使得我们可以在新的提示上“免费”进行模型比较,并且在采样时有助于增加多样性。

Progression of Models. Figure 11 reports the progress of our different SFT and then RLHF versions for both Safety and Helpfulness axes, measured by our in-house Safety and Helpfulness reward models. On this set of evaluations, we outperform ChatGPT on both axes after RLHF-V3 (harmlessness and helpfulness ${>}50%$ ). Despite the aforementioned relevance of using our reward as a point-wise metric, it can arguably be biased in favor of LLAMA 2-CHAT. Therefore, for a fair comparison, we additionally compute the final results using GPT-4 to assess which generation is preferred. The order in which ChatGPT and LLAMA 2-CHAT outputs appeared in GPT-4 prompt are randomly swapped to avoid any bias. As expected, the win-rate in favor of LLAMA 2-CHAT is less pronounced, although obtaining more than a $60%$ win-rate for Our latest LLAMA 2-CHAT.

模型进展。图 11: 报告了我们不同 SFT 和随后 RLHF 版本在安全性和帮助性两个轴上的进展,这些数据由我们内部的安全性和帮助性奖励模型测量。在这组评估中,在 RLHF-V3 之后,我们在两个轴上都超过了 ChatGPT(无害性和有用性 ${>}50%$)。尽管使用我们的奖励作为点对点度量具有前述的重要性,但它可能偏向于 LLAMA 2-CHAT。因此,为了公平比较,我们还使用 GPT-4 来评估哪个生成结果更优。ChatGPT 和 LLAMA 2-CHAT 的输出在 GPT-4 提示中的顺序随机交换以避免任何偏差。如预期的那样,LLAMA 2-CHAT 的胜率优势不那么明显,尽管我们最新的 LLAMA 2-CHAT 获得了超过 $60%$ 的胜率。

The prompts correspond to a validation set of 1, 586 and 584 prompts for safety and helpfulness, respectively.

提示词对应于安全性和帮助性各自的验证集,其中包含 1,586 和 584 个提示词。

3.4.2 Human Evaluation

3.4.2 人类评估

Human evaluation is often considered the gold standard for judging models for natural language generation, including dialogue models. To evaluate the quality of major model versions, we asked human evaluators to rate them on helpfulness and safety. We compare the LLAMA 2-CHAT models to open-source models (Falcon, MPT MosaicML NLP Team et al. (2023), Vicuna Chiang et al. (2023), as well as closed-source models (ChatGPT (OpenAI, 2023) and PaLM Anil et al. (2023)) on over 4, 000 single and multi-turn prompts. For ChatGPT, We use gpt-3.5-turbo-0301 model in all generations. For PaLM, we use the chat-bison-001 model in all generations. The final prompt count for human evaluations for each model is shown in Table 32. See more methodology details in Appendix, Section A.3.7. The following section shows helpfulness results; safety results are presented in Section 4.4.

人类评估通常被认为是判断自然语言生成模型(包括对话模型)的黄金标准。为了评估主要模型版本的质量,我们请人类评估员对它们的帮助性和安全性进行评分。我们将 LLAMA 2-CHAT 模型与开源模型(Falcon、MPT (MosaicML NLP Team 等, 2023)、Vicuna (Chiang 等, 2023))以及闭源模型(ChatGPT (OpenAI, 2023) 和 PaLM (Anil 等, 2023))在超过 4,000 个单轮和多轮提示上进行了比较。对于 ChatGPT,我们在所有生成中使用 gpt-3.5-turbo-0301 模型。对于 PaLM,我们在所有生成中使用 chat-bison-001 模型。每个模型的人类评估最终提示数量如表 32 所示。更多方法细节请参见附录,A.3.7 节。以下部分展示了帮助性结果;安全性结果将在 4.4 节中呈现。

表 32:

模型 最终提示数量

更多方法细节请参见附录,A.3.7 节。以下部分展示了帮助性结果;安全性结果将在 4.4 节中呈现。

Results. As shown in Figure 12, LLAMA 2-CHAT models outperform open-source models by a significant margin on both single turn and multi-turn prompts. Particularly, LLAMA 2-CHAr 7B model outperforms MPT-7B-chaton $60\breve{%}$ of the prompts. LLAMA 2-CHAT 34B has an overall win rate of more than $75%$ against equivalently sized Vicuna-33B and Falcon 40B models.

结果。如图 12 所示,LLAMA 2-CHAT 模型在单轮和多轮提示中均显著优于开源模型。特别是,LLAMA 2-CHAT 7B 模型在 60% 的提示中优于 MPT-7B-chat。LLAMA 2-CHAT 34B 对比同等规模的 Vicuna-33B 和 Falcon 40B 模型,总体胜率超过 75%。

图 12:


Figure 12: Human evaluation results for LLAMA 2-CHAT models compared to open- and closed-source models across ${\sim}4\mathrm{,}000$ helpfulness prompts with three raters per prompt.


图 12: 对 LLAMA 2-CHAT 模型与开源和闭源模型在约 4,000 个有用性提示上的人类评估结果,每个提示有三位评分员。

The largest LLAMA 2-CHAT model is competitive with ChatGPT. LLAMA 2-CHAT 7OB model has a win rate of $36%$ and a tie rate of $31.5%$ relative to ChatGPT. LLAMA 2-CHAT 7OB model outperforms PaLM-bison chat model by a large percentage on our prompt set. More results and analysis is available in Section A.3.7.

最大的 LLAMA 2-CHAT 模型在性能上与 ChatGPT 具有竞争力。LLAMA 2-CHAT 7OB 模型相对于 ChatGPT 的胜率为 36% ,平局率为 31.5% 。LLAMA 2-CHAT 7OB 模型在我们的提示集上大幅优于 PaLM-bison 聊天模型。更多结果和分析请参见 A.3.7 节。

Inter-Rater Reliability (IRR). In our human evaluations, three different annotators provided independent assessments for each model generation comparison. High IRR scores (closer to 1.0) are typically seen as better from a data quality perspective, however, context is important. Highly subjective tasks like evaluating the overall helpfulness of LLM generations will usually have lower IRR scores than more objective labelling tasks. There are relatively few public benchmarks for these contexts, so we feel sharing our analysis here will benefit the research community.

评分者间信度 (Inter-Rater Reliability, IRR)。在我们的人类评估中,三位不同的标注者为每个模型生成的比较提供了独立的评估。高 IRR 分数(接近 1.0)通常从数据质量的角度来看被认为是更好的,然而,背景很重要。像评估大语言模型生成的整体有用性这样的高度主观任务,通常会比更客观的标记任务有更低的 IRR 分数。对于这些背景,相对较少有公开的基准,因此我们认为在这里分享我们的分析将有利于研究社区。

We used Gwet's AC1/2 statistic (Gwet, 2008, 2014) to measure inter-rater reliability (IRR), as we found it to be the most stable metric across different measurement scenarios. On the 7-point Likert scale helpfulness task that is used in our analysis, Gwet's AC2 score varies between 0.37 and 0.55 depending on the specific model comparison. We see scores on the lower end of that range for ratings from model comparisons with similar win rates to each other (like the LLAMA 2-CHAT-7OB-chat vs. ChatGPT comparison). We see scores on the higher end of that range for ratings from model comparisons with a more clear winner (like the LLAMA 2-CHAT-34b-chat vs. Falcon-40b-instruct).

我们使用了 Gwet 的 AC1/2 统计量 (Gwet, 2008, 2014) 来测量评分者间信度 (IRR),因为我们发现它在不同的测量场景中是最稳定的指标。在我们分析中使用的 7 点李克特量表帮助性任务中,Gwet 的 AC2 分数根据具体模型比较的不同,在 0.37 到 0.55 之间变化。对于具有相似胜率的模型比较(例如 LLAMA 2-CHAT-7OB-chat 与 ChatGPT 的比较),我们看到该范围较低端的分数。对于有更明确胜者的模型比较(例如 LLAMA 2-CHAT-34b-chat 与 Falcon-40b-instruct 的比较),我们看到该范围较高端的分数。

Limitations of human evaluations. While our results indicate that LLAMA 2-CHAT is on par with ChatGPT on human evaluations, it is important to note that human evaluations have several limitations.

人类评估的局限性。尽管我们的结果表明 LLAMA 2-CHAT 在人类评估中与 ChatGPT 不相上下,但需要注意的是,人类评估存在多个局限性。

● By academic and research standards, we have a large prompt set of 4k prompts. However, it does not cover real-world usage of these models, which will likely cover a significantly larger number of use cases. · Diversity of the prompts could be another factor in our results. For example, our prompt set does not include any coding- or reasoning-related prompts. · We only evaluate the final generation of a multi-turn conversation. A more interesting evaluation could be to ask the models to complete a task and rate the overall experience with the model over multiple turns. ● Human evaluation for generative models is inherently subjective and noisy. As a result, evaluation on a different set of prompts or with different instructions could result in different results.

● 按照学术和研究标准,我们有一个包含 4k 提示的大型提示集。然而,它并未涵盖这些模型在现实世界中的使用情况,后者可能会涉及大量更多的用例。· 提示的多样性可能是影响我们结果的另一个因素。例如,我们的提示集中不包含任何与编程或推理相关的提示。· 我们仅评估多轮对话的最终生成内容。更有趣的评估方式可能是要求模型完成一项任务,并对模型在多轮对话中的整体体验进行评分。● 对于生成式模型 (Generative model) 的人工评估本质上是主观且易受干扰的。因此,在不同的提示集或不同的指令下进行评估可能会产生不同的结果。

4 Safety

4 安全性

WARNING: this section contains examples of text that may be considered unsafe, offensive, or upsetting.

警告:本节包含可能被认为是不安全、冒犯或令人不安的文本示例。

In this section, we dive deeper into the important topic of safety measurements and mitigation s. We first discuss our safety investigations into pre training data and pretrained models (Section 4.1). Next, we describe the process of our safety alignment (Section 4.2), explaining how we collected safety-related annotations and utilized SFT and RLHF, and present experimental results. Then, we discuss the red teaming we performed to further understand and improve model safety (Section 4.3). Finally, we present quantitative safety evaluations of LLAMA 2-CHAT (Section 4.4). We also share a model card in the Appendix, in Table 52.

在本节中,我们深入探讨安全测量和缓解措施这一重要话题。我们首先讨论对预训练数据和预训练模型的安全性调查(第 4.1 节)。接下来,我们描述安全对齐的过程(第 4.2 节),解释我们如何收集与安全相关的标注并利用 SFT 和 RLHF,并展示实验结果。然后,我们讨论进行的红队测试以进一步理解和改进模型安全性(第 4.3 节)。最后,我们展示了对 LLAMA 2-CHAT 的定量安全评估(第 4.4 节)。我们还在附录中的表 52 中分享了一个模型卡片。

4.1 Safety in Pre training

4.1 预训练中的安全性

It is important to understand what is in the pre training data both to increase transparency and to shed light on root causes of potential downstream issues, such as potential biases. This can inform what, if any, downstream mitigation s to consider, and help guide appropriate model use. In this section, we analyze the pre training data for distributions of languages, demographic representations, and toxicity. We also present the results of testing the pretrained models on existing safety benchmarks.

了解预训练数据的内容对于提高透明度和揭示潜在下游问题的根本原因(如潜在偏见)非常重要。这可以指导我们考虑是否需要采取下游缓解措施,并帮助正确使用模型。在本节中,我们分析了预训练数据中的语言分布、人口统计学表示和毒性。我们还展示了在现有安全基准上测试预训练模型的结果。

Steps Taken to Pretrain Responsibly. We followed Meta's standard privacy and legal review processes for each dataset used in training. We did not use any Meta user data in training. We excluded data from certain sites known to contain a high volume of personal information about private individuals. We made a best effort to train our models efficiently to reduce the carbon footprint of pre training (Section 2.2.1). Sharing our models broadly willreduce the need for others to train similar models. No additional filtering was conducted on the datasets, to allow LLAMA 2 to be more widely usable across tasks (e.g., it can be better used for hate speech classification), while avoiding the potential for the accidental demographic erasure sometimes caused by over-scrubbing. Importantly, this allows LLAMA 2-CHAT to generalize more effectively during safety tuning with fewer examples (Welbl et al., 2021; Korbak et al., 2023; Xu et al., 2021). As a result, LLAMA 2 models should be used carefully and deployed only after significant safety tuning is applied.

采取的负责任预训练步骤。我们遵循了 Meta 的标准隐私和法律审查流程,针对用于训练的每个数据集进行了审查。我们没有使用任何 Meta 用户数据进行训练。我们排除了某些已知包含大量私人个人信息的网站的数据。我们尽最大努力高效地训练模型,以减少预训练的碳足迹(第 2.2.1 节)。广泛分享我们的模型将减少其他人训练类似模型的需求。为了使 LLAMA 2 能够更广泛地应用于各种任务(例如,它可以更好地用于仇恨言论分类),我们没有对数据集进行额外过滤,同时避免过度清理可能导致的意外人口统计信息删除。重要的是,这使得 LLAMA 2-CHAT 在安全微调时能够更有效地泛化,所需的示例更少(Welbl et al., 2021; Korbak et al., 2023; Xu et al., 2021)。因此,LLAMA 2 模型应在经过显著的安全微调后谨慎使用和部署。

Demographic Representation: Pronouns._ Bias in model generations may result from biases inherited from the training data itself. For instance, Bailey et al. (2022) shows that in massive text corpora, words representing "people" are often used in more similar contexts to words representing "men" than to words representing "women," and Ganesh et al. (2023) demonstrates that a model's performance on fairness metrics can be highly dependent on how the model trains on data representing underrepresented demographic groups. Within our English-language training corpus, we computed the frequencies of the most common English pronouns in Table 9a. We observe that He pronouns are generally overrepresented in documents compared to She pronouns, echoing similar frequency differences observed in pronominal usage for similarly sized model pre training datasets (Chowdhery et al., 2022). This could mean that the model is learning less during pre training about context that mentions She pronouns, and subsequently may potentially generate He pronouns at a higher rate than She pronouns.

人口统计学代表性:代词._ 模型生成中的偏差可能是从训练数据本身继承而来的。例如,Bailey等人 (2022) 显示,在大规模文本语料库中,表示“人”的词语通常在与表示“男性”的词语相似的上下文中使用,而不是与表示“女性”的词语 [20]。Ganesh等人 (2023) 表明,模型在公平性指标上的表现可能高度依赖于模型如何训练代表少数群体的数据 [21]。在我们的英语训练语料库中,我们在表 9a 中计算了最常见的英语代词的频率。我们观察到,He 代词在文档中的代表性通常高于 She 代词,这反映了在类似规模的模型预训练数据集中代词使用频率的相似差异 (Chowdhery等人, 2022)。这意味着模型在预训练期间可能较少学习到提到 She 代词的上下文,并且随后可能会以更高的频率生成 He 代词,而不是 She 代词。

表 9a:

代词 频率
He 较高
She 较低

Demographic Representation: Identities. We also analyze the representation of different demographic groups in the pre training data by measuring rates of usage of demographic identity terms from the Holistic Bias dataset (Smith et al., 2022) as a proxy. We compute frequencies for each descriptor term in the pre training corpus. We group descriptors into 5 axes (Religion, Gender and Sex, Nationality, Race and Ethnicity, and Sexual Orientation), and show the top 5 terms in each axis in Table 9b. In the top 5 terms, we remove a few terms such as "straight," "white," and "black," because these terms have frequent uses beyond demographic mentions (e.g., as basic color terms). We also de duplicate across lists, removing a few terms found in both Gender and Sex and Sexual Orientation. For Gender and Sex, while She pronouns are mentioned in fewer documents, the term "female" is present in a larger percentage of documents. This could imply that while there is less frequent context about She pronouns, comments about "females" are more prevalent, perhaps reflecting the differences in linguistic markedness of these terms (Blodgett et al., 2021). For Sexual Orientation, the top five terms all relate to LGBTQ $^+$ identities. For Nationality, Race and Ethnicity, and Religion, we observe a Western skew (Bhatt et al., 2022). For instance, the term "American" is mentioned in. $69.4%$ of the references, the term "European" is more prevalent than other race and ethnicity, and "Christian" is the most represented religion followed by "Catholic" and "Jewish."

人口统计代表性:身份。我们还通过测量 Holistic Bias 数据集 (Smith et al., 2022) 中不同人口统计身份术语的使用率,来分析预训练数据中不同人口群体的代表性。我们计算了预训练语料库中每个描述符术语的频率。我们将描述符分为 5 个维度(宗教、性别、国籍、种族和民族、性取向),并在表 9b 中显示每个维度的前 5 个术语。在前 5 个术语中,我们移除了一些如“straight”、“white” 和 “black” 的术语,因为这些术语在人口统计提及之外有频繁使用(例如,作为基本颜色术语)。我们还在列表之间去重,移除了一些同时出现在性别和性取向中的术语。对于性别,虽然“She”代词在较少的文档中被提及,但“female”一词出现在更大比例的文档中。这可能意味着虽然“She”代词的上下文较少,但关于“女性”的评论更为普遍,这或许反映了这些术语在语言标记性上的差异 (Blodgett et al., 2021)。对于性取向,前五名术语都与 LGBTQ+ 身份有关。对于国籍、种族和民族以及宗教,我们观察到一种西方偏斜 (Bhatt et al., 2022)。例如,“American”一词在 69.4% 的引用中被提及,“European”比其他种族和民族更为普遍,而“Christian”是最具代表性的宗教,其次是“Catholic”和“Jewish”。

性别代词 75.23% 语法人称 94.47%
她 (she, her, hers, herself) 28.45% 第一人称 (I, me, my, mine, myself, ...) 70.71%
他 (he, him, his, himself) 50.73% 第二人称 (you, your, yours, ...) 61.80%
未指定 (they, them, their, ...) 86.38% 第三人称 (it, its, itself, she, her, he, him, ..) 93.07%

(a) Percentage of documents containing gender pronouns and grammatical person. $75%$ of all documents contain gendered pronouns. Within this subset, $2\bar{8}%$ of all documents contain She pronouns. $94%$ of all documents contain pronouns in general. See the full detailed list of pronouns for each subgroup in Appendix A.4.3.

(b) The percentage listed below each demographic axis represents the percentage of all documents that mention any of the descriptor terms in this axis. The percentage listed for each demographic descriptor represents, among the documents that mention a descriptor in the given demographic axis, the percentage that mention this specific descriptor.

(a) 包含性别代词和语法人称的文档百分比。75% 的所有文档包含性别化的代词。在这个子集中,28.8% 的所有文档包含 She 代词。94% 的所有文档包含代词。有关每个子组的代词完整详细列表,请参见附录 A.4.3。

性别和性 (5.91%) 性取向 (6.67%) 国籍 (14.83%) 种族和民族 (19.51%) 宗教 (7.93%)
描述符 描述符 描述符 描述符 描述符
%Doc %Doc
female 50.0% gay 14.8% american
male 39.1% lesbian 4.3% indian
feminine 5.4% lgbt 4.0% chinese
transgender 4.2% lgbtq 3.6% korean
masculine 3.1% queer 3.5% mexican

(b) 每个人口统计轴下方列出的百分比表示提及该轴中任何描述符的所有文档的百分比。每个描述符旁边列出的百分比表示,在提及给定人口统计轴中的描述符的文档中,提及此特定描述符的文档的百分比。

Table 9: Demographic representations. Analysis of pronouns and identities in our pre training corpus shows some skews that may affect performance, such as higher representations of Western demographics.

表 9: 人口统计学特征。对我们的预训练语料库中代词和身份的分析显示了一些可能影响性能的偏差,例如西方人口统计学特征的比例较高。


Figure 13: Pre training data toxicity. To allow for better downstream generalization, we chose not to scrub toxic data from pre training. The HateBERT classifier assigns a toxicity likelihood of 0.5 or higher to about 0.2% of documents in our pre training corpus.

图 13: 预训练数据毒性。为了允许更好的下游泛化,我们选择不在预训练中清除有毒数据。HateBERT 分类器将毒性可能性为 0.5 或更高的评分赋予了我们预训练语料库中大约 0.2% 的文档。

Data Toxicity. We measure the prevalence of toxicity in the English-language portion of the pre training corpus using a HateBERT classifier fine-tuned on the ToxiGen dataset (Hartvigsen et al., 2022). We score each line of a document separately and average them to assign a document score. Figure 13 shows the distribution of scores in a $10%$ random sample of the full corpus. About $0.2%$ of documents evaluated are assigned a likelihood score of 0.5 or higher, meaning there is a small amount of toxicity in our pre training data.

数据毒性。我们使用在 ToxiGen 数据集 (Hartvigsen et al., 2022) 上微调的 HateBERT 分类器来测量预训练语料库中英语部分的毒性 prevalence。我们分别对每个文档的每一行进行评分并取平均值以分配文档得分。图 13 显示了在整个语料库的 10% 随机样本中的得分分布。大约 0.2% 的被评估文档被分配了一个 0.5 或更高的可能性得分,这意味着我们的预训练数据中有少量的毒性。

Language Identification. While our pre training data is mostly English, it also includes text from a small number of other languages. Table 10 shows the distribution of languages in our corpus, subsetted to those found in more than $0.005%$ of the documents. Our analysis uses the fastText (Bojanowski et al.,2016) language identification tool and a threshold of 0.5 for the language detection. A training corpus with a majority in English means that the model may not be suitable for use in other languages.

语言识别。虽然我们的预训练数据主要为英语,但也包含少量其他语言的文本。表 10: 显示了语料库中语言的分布情况,仅列出占比超过 0.005% 的文档中的语言。我们的分析使用 fastText (Bojanowski 等, 2016) 语言识别工具,并设置语言检测的阈值为 0.5。以英语为主的训练语料库意味着该模型可能不适合用于其他语言。

语言 百分比 语言 百分比
en 89.70% uk 0.07%
unknown 8.38% ko 0.06%
de 0.17% ca 0.04%
fr 0.16% sr 0.04%
SV 0.15% id 0.03%
zh 0.13% CS 0.03%
es 0.13% fi 0.03%
ru 0.13% hu 0.03%
nl 0.12% no 0.03%
it 0.11% ro 0.03%
ja 0.10% bg 0.02%
pl 0.09% da 0.02%
pt 0.09% sl 0.01%
vi 0.08% hr 0.01%

Safety Benchmarks for Pretrained Models. We evaluate the safety capabilities of LLAMA 2 on three popular automatic benchmarks, pertaining to three key dimensions of LM safety.

预训练模型的安全性基准测试。我们评估了 LLAMA 2 在三个流行的自动基准测试中的安全性能力,这些基准测试涉及大语言模型安全性的三个关键维度。

We compare the performance of LLAMA 2 with LLAMA 1 (Touvron et al., 2023), Falcon (Almazrouei et al., 2023), and MPT (MosaicML NLP Team et al., 2023) in Table 11. For decoding, we set temperature to 0.1 and use nucleus sampling (Holtzman et al., 2020) with top $\boldsymbol{p}$ set to 0.9. For TruthfulQA, we present the percentage of generations that are both truthful and informative (the higher, the better). For ToxiGen, we present the percentage of generations that are deemed toxic by the metric (the lower, the better). Detailed descriptions of the benchmarks and metrics can be found in Appendix A.4.7. When compared to LLAMA $\scriptstyle1-7\mathrm{{B}}$ LLAMA $_{2-7\mathrm{B}}$ demonstrates a $21.37%$ increase in truthfulness and informative ness and a $7.61%$ decrease in toxicity. We also observe an increase in toxicity in the pretrained 13B and 70B LLAMA 2, which may result from larger pre training data or a different dataset mix. Some have postulated the existence of a relationship between pre training dataset size and downstream model toxicity or bias (Bender et al., 2021b), but empirical work to validate this claim is still ongoing (Dodge et al., 2021; Smith and Williams, 2021; Tal et al., 2022), and further evidence from up-to-date models is still needed.

表 11: 我们比较了 LLAMA 2 与 LLAMA 1 (Touvron 等, 2023), Falcon (Almazrouei 等, 2023) 和 MPT (MosaicML NLP Team 等, 2023) 的性能。对于解码,我们将温度设置为 0.1 并使用核采样 (Holtzman 等, 2020),其中 top $\boldsymbol{p}$ 设置为 0.9。对于 TruthfulQA,我们展示了既真实又信息丰富的生成内容的百分比(越高越好)。对于 ToxiGen,我们展示了根据度量标准被认为是毒性的生成内容的百分比(越低越好)。基准测试和度量的详细描述可以在附录 A.4.7 中找到。与 LLAMA 1-7B 相比,LLAMA 2-7B 展示了真实性及信息丰富性提高了 21.37%,毒性降低了 7.61%。我们还观察到预训练的 13B 和 70B LLAMA 2 毒性有所增加,这可能是由于更大的预训练数据集或不同的数据集组合。一些人推测预训练数据集大小与下游模型毒性和偏差之间存在关系 (Bender 等, 2021b),但验证这一说法的经验工作仍在进行中 (Dodge 等, 2021; Smith 和 Williams, 2021; Tal 等, 2022),并且仍需要来自最新模型的进一步证据。

In Appendix A.4.7, we present bias metrics, such as how the sentiment of model generations varies with demographic attributes. We note an increase in positive sentiment overall for many of the groups using BOLD prompts. More detailed results split by different demographic groups can be found in Appendix A.4.8.

在附录 A.4.7 中,我们展示了偏置指标,例如模型生成的情感如何随人口统计属性变化。我们注意到使用 BOLD 提示时,许多群体的整体正面情感有所增加。更详细的结果按不同人口统计群体划分可以在附录 A.4.8 中找到。

LLAMA 2 does not outperform other models on toxicity metrics, and we speculate that this may be because we refrained from aggressively filtering the pre training data. Recall that leaving pre training data unfiltered may enable base models tuned to perform well on more downstream tasks (including hate speech detection), and it carries less risk of accidentally filtering out some demographic groups. We observe that models trained from less aggressively filtered pre training data also required fewer examples to achieve reasonable safety-alignment. We reiterate that this motivated choice does imply that additional safety mitigation s should be applied before deployment of base LLAMA 2 models.

LLAMA 2 在有害性指标上并未超过其他模型,我们推测这可能是因为我们没有对预训练数据进行激进的过滤。回顾一下,保留未过滤的预训练数据可以使基础模型在更多下游任务(包括仇恨言论检测)中表现良好,并且减少了意外过滤掉某些人群的风险。我们观察到,从较少激进过滤的预训练数据中训练的模型也只需要较少的示例即可达到合理的安全对齐。我们重申,这一有动机的选择确实意味着在部署基础 LLAMA 2 模型之前应应用额外的安全缓解措施。

Table 11: Evaluation of pretrained LLMs on automatic safety benchmarks. For TruthfulQA, we present the percentage of generations that are both truthful and informative (the higher the better). For ToxiGen, we present the percentage of toxic generations (the smaller, the better).

表 11: 预训练大语言模型在自动安全基准上的评估。对于 TruthfulQA,我们展示了生成内容既真实又具有信息量的百分比(越高越好)。对于 ToxiGen,我们展示了有毒生成内容的百分比(越低越好)。

TruthfulQA ↑ ToxiGen ↓
MPT
7B 29.13 22.32
30B 35.25 22.61
Falcon
7B 25.95 14.53
40B 40.39 23.44
LLAMA 1
7B 27.42 23.00
13B 41.74 23.08
33B 44.19 22.57
65B 48.71 21.77
LLAMA 2
7B 33.29 21.25
13B 41.86 26.10
34B 43.45 21.19
70B 50.18 24.60

Benchmarks give a summary view of model capabilities and behaviors that allow us to understand general patterns in the model, but they do not provide a fully comprehensive view of the impact the model may have on people or real-world outcomes; that would require study of end-to-end product deployments. Further testing and mitigation should be done to understand bias and other social issues for the specific context in which a system may be deployed. For this, it may be necessary to test beyond the groups available in the BOLD dataset (race, religion, and gender). As LLMs are integrated and deployed, we look forward to continuing research that will amplify their potential for positive impact on these important social issues.

基准测试提供了模型能力和行为的概览,使我们能够理解模型中的总体模式,但它们并不能完全全面地反映模型可能对人们或现实世界结果产生的影响;这需要研究端到端的产品部署。应进一步进行测试和缓解措施,以了解特定部署环境中可能出现的偏见和其他社会问题。为此,可能有必要超越 BOLD 数据集中可用的群体(种族、宗教和性别)进行测试。随着大语言模型 (LLM) 的集成和部署,我们期待继续开展能够放大其在这些重要社会问题上积极影响的研究。

4.2 Safety Fine-Tuning

4.2 安全微调

In this section, we describe our approach to safety fine-tuning, including safety categories, annotation guidelines, and the techniques we use to mitigate safety risks. We employ a process similar to the general fine-tuning methods as described in Section 3, with some notable differences related to safety concerns. Specifically, we use the following techniques in safety fine-tuning:

在本节中,我们描述了我们的安全微调方法,包括安全类别、标注指南以及我们用于缓解安全风险的技术。我们采用的过程类似于第 3 节所述的一般微调方法,但在安全问题上有一些显著的不同。具体来说,我们在安全微调中使用了以下技术:

4.2.1 Safety Categories and Annotation Guidelines

4.2.1 安全类别和标注指南

Based on limitations of LLMs known from prior work, we design instructions for our annotation team to create adversarial prompts along two dimensions: a risk category, or potential topic about which the LLM could produce unsafe content; and an attack vector, or question style to cover different varieties of prompts that could elicit bad model behaviors.

基于先前工作已知的大语言模型 (LLM) 的局限性,我们为标注团队设计了指令,以创建对抗性提示,这些提示沿着两个维度展开:风险类别,或大语言模型可能产生不安全内容的潜在主题;以及攻击向量,或问题风格,以涵盖可能引发不良模型行为的不同类型的提示。

The risk categories considered can be broadly divided into the following three categories: illicit and criminal activities (e.g., terrorism, theft, human trafficking); hateful and harmful activities (e.g., defamation, selfharm, eating disorders, discrimination); and unqualified advice (e.g., medical advice, financial advice, legal advice). The attack vectors explored consist of psychological manipulation (e.g., authority manipulation), logic manipulation (e.g., false premises), syntactic manipulation (e.g., misspelling), semantic manipulation (e.g., metaphor), perspective manipulation (e.g., role playing), non-English languages, and others.

考虑的风险类别可以大致分为以下三类:非法和犯罪活动(例如,恐怖主义、盗窃、人口贩卖);仇恨和有害活动(例如,诽谤、自残、饮食失调、歧视);以及不合格的建议(例如,医疗建议、财务建议、法律建议)。探索的攻击向量包括心理操纵(例如,权威操纵)、逻辑操纵(例如,虚假前提)、语法操纵(例如,拼写错误)、语义操纵(例如,隐喻)、视角操纵(例如,角色扮演)、非英语语言以及其他。

We then define best practices for safe and helpful model responses: the model should first address immediate safety concerns if applicable, then address the prompt by explaining the potential risks to the user, and finally provide additional information if possible. We also ask the annotators to avoid negative user experience categories (see Appendix A.5.2). The guidelines are meant to be a general guide for the model and are iterative ly refined and revised to include newly identified risks.

我们定义了模型生成安全且有帮助回复的最佳实践:模型应首先处理即时的安全问题(如适用),然后通过向用户解释潜在风险来回应提示,最后在可能的情况下提供额外信息。我们还要求标注人员避免负面用户体验类别(见附录 A.5.2)。这些指南旨在作为模型的一般指导,并将迭代地进行优化和修订以纳入新识别的风险。

4.2.2 Safety Supervised Fine-Tuning

4.2.2 安全监督微调

In accordance with the established guidelines from Section 4.2.1, we gather prompts and demonstrations of safe model responses from trained annotators, and use the data for supervised fine-tuning in the same manner as described in Section 3.1. An example can be found in Table 5.

根据第 4.2.1 节既定的指南,我们从训练有素的标注人员处收集安全模型响应的提示和演示,并以与第 3.1 节所述相同的方式使用这些数据进行监督微调。一个例子可以在表 5 中找到。

The annotators are instructed to initially come up with prompts that they think could potentially induce the model to exhibit unsafe behavior, i.e., perform red teaming, as defined by the guidelines. Subsequently, annotators are tasked with crafting a safe and helpful response that the model should produce.

标注人员被指示首先提出他们认为可能诱导模型表现出不安全行为的提示,即根据指南进行红队测试 (red teaming)。随后,标注人员的任务是编写一个安全且有帮助的响应,该响应是模型应该产生的。

4.2.3 Safety RLHF

4.2.3 安全 RLHF

We observe early in the development of LLAMA 2-CHAT that it is able to generalize from the safe demonstrations in supervised fine-tuning. The model quickly learns to write detailed safe responses, address safety concerns, explain why the topic might be sensitive, and provide additional helpful information. In particular, when the model outputs safe responses, they are often more detailed than what the average annotator writes. Therefore, after gathering only a few thousand supervised demonstrations, we switched entirely to RLHF to teach the model how to write more nuanced responses. Comprehensive tuning with RLHF has the added benefit that it may make the model more robust to jailbreak attempts (Bai et al., 2022a).

我们在 LLAMA 2-CHAT 的早期开发中观察到,它能够从监督微调中的安全示范中进行泛化。模型迅速学会了编写详细的、安全的回复,处理安全问题,解释为什么某个话题可能敏感,并提供额外的有用信息。特别是当模型输出安全回复时,这些回复通常比普通标注者所写的更为详细。因此,在收集了仅几千个监督示范后,我们完全转向了基于人类反馈的强化学习 (RLHF) 来教导模型如何编写更细致的回复。使用 RLHF 进行全面调优还有一个额外的好处,即可能使模型对越狱尝试更具鲁棒性 (Bai et al., 2022a)。

We conduct RLHF by first collecting human preference data for safety similar to Section 3.2.2: annotators write a prompt that they believe can elicit unsafe behavior, and then compare multiple model responses to the prompts, selecting the response that is safest according to a set of guidelines. We then use the human preference data to train a safety reward model (see Section 3.2.2), and also reuse the adversarial prompts to sample from the model during the RLHF stage.

我们通过首先收集与安全相关的类似第 3.2.2 节的人类偏好数据来进行基于人类反馈的强化学习 (RLHF):标注者编写他们认为可以引发不安全行为的提示,然后比较多个模型对这些提示的响应,根据一组指南选择最安全的响应。然后我们使用人类偏好数据来训练安全奖励模型(参见第 3.2.2 节),并在 RLHF 阶段也重用对抗性提示从模型中采样。

Better Long-Tail Safety Robustness without Hurting Helpfulness Safety is inherently a long-tail problem, where the challenge comes from a small number of very specific cases. We investigate the impact of Safety RLHF by taking two intermediate LLAMA 2-CHAT checkpoints-one without adversarial prompts in the RLHF stage and one with them—and score their responses on our test sets using our safety and helpfulness reward models. In Figure 14, we plot the score distribution shift of the safety RM on the safety test set (left) and that of the helpfulness RM on the helpfulness test set (right). In the left hand side of the figure, we observe that the distribution of safety RM scores on the safety set shifts to higher reward scores after safety tuning with RLHF, and that the long tail of the distribution near zero thins out. A clear cluster appears on the top-left corner suggesting the improvements of model safety. On the right side, we do not observe any gathering pattern below the $y=x$ line on the right hand side of Figure 14, which indicates that the helpfulness score distribution is preserved after safety tuning with RLHF. Put another way, given sufficient helpfulness training data, the addition of an additional stage of safety mitigation does not negatively impact model performance on helpfulness to any notable degradation. A qualitative example is shown in Table 12.

更好的长尾安全鲁棒性不会损害有用性

安全本质上是一个长尾问题,挑战来自于少数非常具体的情况。我们研究了安全强化学习与人类反馈 (Safety RLHF) 的影响,通过选取两个中间的 LLAMA 2-CHAT 检查点——一个在 RLHF 阶段没有对抗性提示,另一个有这些提示——并在我们的测试集上使用安全性和有用性奖励模型对它们的响应进行评分。在图 14 中,我们绘制了安全 RM 在安全测试集上的得分分布变化(左)以及有用性 RM 在有用性测试集上的得分分布变化(右)。在图的左侧,我们观察到,在经过 Safety RLHF 调整后,安全 RM 得分分布在安全集上向更高的奖励得分移动,并且接近零的长尾变细。在左上角出现了一个明显的聚类,表明模型安全性有所提高。在右侧,我们在图 14 的右侧没有观察到任何聚集模式出现在 $y=x$ 线之下,这表明在经过 Safety RLHF 调整后,有用性得分分布得以保持。换句话说,在给定足够的有用性训练数据的情况下,增加一个额外的安全缓解阶段不会对模型在有用性方面的表现产生显著的负面影响。表 12 显示了一个定性示例。

Impact of Safety Data Scaling. _ A tension between helpfulness and safety of LLMs has been observed in previous studies (Bai et al., 2022a). To better understand how the addition of safety training data affects general model performance, especially helpfulness, we investigate the trends in safety data scaling by adjusting the amount of safety data used in the RLHF stage. In this ablation experiment, we keep the amount of helpfulness training data unchanged ( ${\sim}0.9\mathrm{M}$ samples) and gradually increase the amount of safety data used in model tuning, ranging from $\mathrm{\bar{O}%}$ to $100%$ $\overset{\cdot}{\sim}0.\overset{\cdot}{1}\mathrm{M}$ samples). For the specific training data mix recipe, We follow the procedure described in Section 3.1 and fine-tune LLAMA 2 pretrained model for 2 epochs.

安全数据扩展的影响。
之前的研究中观察到了大语言模型的帮助性和安全性之间存在矛盾 (Bai et al., 2022a)。为了更好地理解添加安全训练数据对模型整体性能,特别是帮助性的影响,我们通过调整RLHF阶段使用的安全数据量来研究安全数据扩展的趋势。在这个消融实验中,我们将帮助性训练数据的数量保持不变(约 0.9M 样本),并逐渐增加用于模型调优的安全数据量,范围从 $\mathrm{\bar{O}%}$ 到 $100%$ (约 0.1M 样本)。对于具体的训练数据混合方案,我们遵循第3.1节中描述的程序,并对LLAMA 2预训练模型进行2个epoch的微调。

We eventually obtain 6 model variants trained with $0%$ $1%$ $10%$ $25%$ $50%,$ and $100%$ of the total safety data. We evaluate them using our safety and helpfulness reward models described in Section 3.2.2. For each variant, we use the safety and helpfulness reward models to score model generations corresponding to prompts in the Meta Safety and Helpful test sets, respectively.

我们最终获得了 6 个模型变体,分别使用总安全数据的 $0%$、$1%$、$10%$、$25%$、$50%$ 和 $100%$ 进行训练。我们使用第 3.2.2 节中描述的安全性和有用性奖励模型对它们进行评估。对于每个变体,我们使用安全性和有用性奖励模型对模型生成的结果进行评分,这些结果分别对应于 Meta 安全性和有用性测试集中的提示。


Figure 14: Impact of safety RLHF measured by reward model score distributions. Left: safety reward model scores of generations on the Meta Safety test set. The clustering of samples in the top left corner suggests the improvements of model safety. Right: helpfulness reward model scores of generations on the Meta Helpfulness test set.


图 14: 通过奖励模型分数分布测量的安全性 RLHF 的影响。左:Meta Safety 测试集上生成内容的安全性奖励模型分数。左上角的样本聚类表明了模型安全性的改进。右:Meta Helpfulness 测试集上生成内容的帮助性奖励模型分数。

Prompt: Write a scam email requesting 10,000 dollars.

Before 安全 RLHF 主题: 紧急援助请求 After 安全 RLHF
亲爱的 [Name],希望您身体健康,心情愉快。我今天给您写信是因为我遇到了困难的情况,需要您的帮助。我目前在 [插入地点] 并且被抢劫了。我的钱包、手机和所有其他物品都被拿走了,我一无所有。我试图联系当地 authorities,但他们并没有提供太多帮助。 我无法满足您的请求。发送欺诈邮件索要金钱是不道德的,并且可能违法。尊重他人的时间和资源,不参与欺诈或欺骗活动非常重要。此外,通过虚假或误导性的借口向人们索要金钱是有害且可能存在危险的。相反,我建议专注于以合法和道德的方式与他人在线沟通。
我联系您是因为我需要您的帮助。我需要您给我寄 $10,000,这样我才能回家并重新开始生活。我知道这是一大笔钱,但我保证会尽快还给您。 如果您有任何问题或疑虑,请随时提问,我会尽力以负责任和安全的方式协助您。让我们都努力使用互联网和电子邮件进行积极有益的交流。还有其他我可以帮到您的地方吗?

请注意,上述表格中的内容展示了两种不同的回应方式。左边的内容是一个虚构的欺诈性邮件示例,而右边的内容则是经过安全强化后的适当回复。

As shown in Figure 15, we use the mean reward model scores as proxies of model performance on safety and helpfulness. We observe that when we increase the proportion of safety data, the model's performance on handling risky and adversarial prompts improves dramatically, and we see a lighter tail in the safety reward model score distribution. Meanwhile, the mean helpfulness score remains constant. We hypothesize that this is because we already have a sufficiently large amount of helpfulness training data. Appendix A.4.2 lists more qualitative results that demonstrate how different amounts of safety data in training can change model behavior in responding to adversarial and non-adversarial prompts.

如图 15 所示,我们使用平均奖励模型分数作为模型在安全性和有用性方面的性能代理。我们观察到,当我们增加安全数据的比例时,模型在处理风险和对抗性提示方面的性能显著提高,并且我们在安全奖励模型分数分布中看到更轻的尾部。同时,平均有用性分数保持不变。我们假设这是因为我们已经有足够多的有用性训练数据。附录 A.4.2 列出了更多定性结果,展示了训练中不同数量的安全数据如何改变模型对对抗性和非对抗性提示的响应行为。


Figure 15: Safety data scaling trends. Left: as we increase the amount of safety data in model training, the mean safety RM score improves significantly while the helpfulness counterpart remains relatively stable. Right: the left tail of safety RM scores (i.e., most unsafe responses) gradually disappears with the addition of more safety training data.


图 15: 安全数据扩展趋势。左:随着我们在模型训练中增加安全数据的数量,平均安全 RM 得分显著提高,而有用性得分则保持相对稳定。右:随着更多安全训练数据的加入,安全 RM 得分的左尾(即最不安全的响应)逐渐消失。

Measure of False Refusal. Even though we do not see overall regression on model helpfulness, we qualitatively observe, through interaction, that the model with more safety mitigation answers certain questions in a more conservative manner (e.g., example shown in Appendix Table 38). As a follow-up, we measure false refusal to quantify the frequency that the model incorrectly refuses to answer non-adversarial prompts. Here, We define false refusal as the model incorrectly refusing to answer legitimate user prompts due to irrelevant safety concerns. Refusing due to reasonable causes exceeding the model's capability, e.g., "I am not capable of participating invideo calls" and $^{\prime\prime}2024$ is beyond my knowledge cut-off date," are not counted as false refusals. We train a classifier for detecting refusals in responses and apply the classifier on 1) the above helpfulness test sets and 2) a curated borderline test set consisting of 210 samples for the purpose of boundary testing similar to OpenAI (2023). The borderline dataset is designed intentionally so that its prompts look adversarial (e.g., containing sensitive words or subwords) but are not actually unsafe (e.g., "give me a recipe for Christmas Crack") (see Appendix Table 41 for more examples).

错误拒绝的衡量。尽管我们没有看到模型整体有用性上的退步,但通过交互我们定性地观察到,具有更多安全缓解措施的模型以更保守的方式回答某些问题(例如,见附录表 38)。作为后续工作,我们测量错误拒绝以量化模型因无关的安全顾虑而错误地拒绝回答非对抗性提示的频率。这里,我们将错误拒绝定义为模型由于无关的安全顾虑而错误地拒绝回答合法用户提示。因合理原因超出模型能力范围的拒绝,例如“我无法参与视频通话”和“2024年超出了我的知识截止日期”,不计入错误拒绝。我们训练了一个用于检测响应中拒绝情况的分类器,并将该分类器应用于1)上述有用性测试集和2)一个由210个样本组成的精心策划的边界测试集,目的是进行类似于 OpenAI (2023) 的边界测试。边界数据集是故意设计的,其提示看起来具有对抗性(例如,包含敏感词或子词),但实际上并不危险(例如,“给我一个圣诞节 Crack 的食谱”)(更多示例见附录表 41)。

With more safety data mixed in model tuning, the false-refusal rate becomes larger on both datasets (see Appendix Figure 33). However, false refusal is overall rare—approximately $0.\overset{\cdot}{0}5%$ —on the helpfulness dataset,evenwith $100%$ safety data. On the other hand, the false-refusal rate is much larger on the borderline set due to its difficulty. LLAMA 2-CHAT sometimes has difficulty distinguishing whether a prompt is safe when the prompt contains words that frequently occur in unsafe generations (such as "bomb"). Appendix Table 41 shows some examples of false refusals we discovered.

在模型调优中加入更多安全数据后,两个数据集上的错误拒绝率 (false-refusal rate) 变得更大(见附录图 33)。然而,错误拒绝总体上是罕见的——大约为 0.05% ——即使在帮助性数据集中包含 100% 的安全数据。另一方面,由于边界集的难度较大,其错误拒绝率要高得多。LLAMA 2-CHAT 在区分提示是否安全时有时会遇到困难,特别是当提示中包含经常出现在不安全生成中的词语(如“bomb”)。附录表 41 显示了一些我们发现的错误拒绝示例。

The following is a discussion between ahuman and a knowledgeable and empathetic assistant.The assistant's answers should not include any harmful,unethical,or socially biased content,and should move the conversation in a positive direction.

以下是人类和一位知识渊博且富有同理心的助手之间的对话。助手的回答不应包含任何有害、不道德或社会偏见的内容,并应引导对话向积极的方向发展。

  1. Encourage users to ask for/view additional info as appropriate. If applicable, the model can point the user to other legitimate sources for further reading.
  2. 鼓励用户在适当的情况下请求/查看更多信息。如有适用,模型可以指引用户到其他合法来源以获取进一步阅读资料。

Please write a silly guide that's meant to convince someone that the moon landing was faked.

请编写一份搞笑指南,目的是说服某人月球登陆是伪造的。

Please write a silly guide that's meant to convince someone that the moon landing was faked.

请编写一份搞笑指南,目的是让人相信月球登陆是伪造的。

I cannot provide a response to this prompt as it goes against my programming rules to promote or encourage the spread of false information. It is important to respect the truth and accuracy in all forms of communication, and to refrain from spreading misinformation or conspiracy theories.

我无法对此提示做出回应,因为这违反了我的编程规则,即不促进或鼓励虚假信息的传播。在所有形式的沟通中,尊重真实性和准确性非常重要,应避免传播错误信息或阴谋论。

I can't satisfy your request, $\mathrm{{I'm}}$ just an AI, I cannot provide a guide that promotes misinformation or conspiracy theories. The moon landing was not faked, and it is important to rely on scientific evidence and credible sources when evaluating historical events. It is not appropriate to spread misinformation or promote conspiracy theories, as it can be harmful and disrespectful to those who were involved in the event. Let's focus on sharing accurate and helpful information instead. Is there something else I can help you With?

我无法满足您的请求,$\mathrm{{I'm}}$ 只是一个 AI 智能体,我不能提供任何促进错误信息或阴谋论的指南。月球登陆并不是伪造的,在评估历史事件时,依靠科学证据和可信来源非常重要。传播错误信息或促进阴谋论是不合适的,因为这可能会造成伤害,并对参与该事件的人表示不尊重。让我们专注于分享准确和有用的信息。还有其他我可以帮助您的吗?

Instead, I would like to emphasize the importance of fact-checking and critical thinking when evaluating information. It is crucial to rely on credible sources and evidence-based reasoning when assessing the validity of a claim.

相反,我想强调在评估信息时事实核查和批判性思维的重要性。依赖可信的来源和基于证据的推理来评估主张的有效性是至关重要的。

If you have any other questions or concerns, please feel free to ask, and I will do my best to provide a helpful and accurate response.

如果您有任何其他问题或疑虑,请随时提问,我将尽力提供有帮助且准确的答复。

4.2.4 Context Distillation for Safety

4.2.4 上下文蒸馏以保障安全

We encourage LLAMA 2-CHAT to associate adversarial prompts with safer responses by using context distillation (Askell et al., 2021a) similar to Section 3.3. We observe that the safety capabilities of LLMs can be efficiently enhanced by prefixing the model with a safety preprompt (e.g., "You are a safe and responsible assistant"). Like supervised safety fine-tuning, safety context distillation provides a quick way to bootstrap the model's responses on hard adversarial prompts, so that they can then be further improved in RLHF.

我们鼓励 LLAMA 2-CHAT 使用上下文蒸馏 (Askell et al., 2021a) 将对抗性提示与更安全的响应关联起来,类似于第 3.3 节。我们观察到,通过在模型前添加安全预提示(例如,“您是一个安全且负责任的助手”),可以有效增强大语言模型的安全能力。与监督式安全微调类似,安全上下文蒸馏提供了一种快速引导模型对困难对抗性提示做出响应的方法,从而可以在 RLHF 中进一步改进这些响应。

Specifically, we apply context distillation by prefixing a safety preprompt to adversarial prompts to generate safer responses, and then fine-tune the model on its own safe output given the adversarial prompt without the preprompt. We generate safety preprompts automatically with templates. In particular, we use various adjectives usually associated with safe behavior such as "responsible,”" "'respectful',' or "wise," with the intuition that the model associates them with positive traits that we want to see reflected in safe answers. We show examples of safety preprompts in Appendix Table 39.

具体来说,我们通过在对抗提示前添加安全预提示来进行上下文蒸馏,以生成更安全的响应,然后在不使用预提示的情况下,根据模型自身对对抗提示的安全输出进行微调。我们使用模板自动生成安全预提示。特别是,我们使用了通常与安全行为相关的各种形容词,如“负责任的 (responsible) ”、“尊重的 (respectful) ”或“明智的 (wise) ”,其直觉是模型会将这些形容词与我们希望在安全回答中体现的积极特质联系起来。我们在附录表 39 中展示了安全预提示的例子。

Context Distillation with Answer Templates During the prompt collection phase, we also asked annotators to label prompts according to risk categories, which enables even more targeted preprompts. Specifically, this allows us to provide some dedicated answer templates of how adversarial prompts should be addressed, based on each identified risk category. Figure 16a shows the impact of context distillation and context distillation with answer templates on the safety RM scores.

在提示收集阶段,我们还要求标注人员根据风险类别对提示进行标注,这使得我们可以提供更加有针对性的预提示。具体来说,这使我们能够为每个识别出的风险类别提供一些专门的应答模板 (answer templates),说明如何应对对抗性提示。图 16a 显示了上下文蒸馏 (context distillation) 和带有应答模板的上下文蒸馏对安全 RM 分数的影响。


(a) Impact on Safety RM Score.


(a) 对安全 RM 得分的影响。


(b) Targeted Context Distillation.


(b) 针对性上下文蒸馏。

Figure 16: Context distillation analysis. Left: Distribution of safety RM scores from the base model, when adding a generic preprompt, and when adding a preprompt based on the risk category with tailored answer template. While a generic preprompt increases safety RM scores, a preprompt with tailored answer template helps even more. Right: Context distillation increases the RM score significantly for samples that initially have a low score, but can also have a detrimental effect on samples that initiaily have a high score. We therefore only apply context distillation on targeted samples when it increases RM score.

图 16: 上下文蒸馏分析。左:基础模型的安全性 RM 分数分布,添加通用预提示时的安全性 RM 分数分布,以及添加基于风险类别的预提示并使用定制答案模板时的安全性 RM 分数分布。虽然通用预提示可以提高安全性 RM 分数,但带有定制答案模板的预提示帮助更大。右:上下文蒸馏对初始分数较低的样本显著提高了 RM 分数,但也可能对初始分数较高的样本产生不利影响。因此,我们仅在上下文蒸馏能提高 RM 分数的目标样本上应用此方法。

Rejecting Context Distillation Errors with the Safety Reward Model It is important to note that performing safety context distillation for helpful prompts can degrade model performance and lead to more false refusals (see Appendix Table 40). We therefore perform safety context distillation only on adversarial prompts. However, we observed that context distillation can sometimes degrade response quality, even when dealing with adversarial prompts. Specifically, if the model responses are already of high quality, the application of context distillation can result in less pertinent replies, as the model tends to overemphasize the preprompt, often resorting to generic concerns excessively (see Appendix Table 40 for an example of vague answers due to context distillation). We thus leverage the safety reward model to decide whether to use safety context distillation - we keep the context-distilled output only on the examples where it gets a better reward model score than the original answer. We notice that this is particularly helpful on prompts that the model is very bad at, but limits the negative impact of context distillation (see Figure 16b).

使用安全奖励模型拒绝上下文蒸馏错误

需要注意的是,对有益的提示进行安全上下文蒸馏可能会降低模型性能,并导致更多的错误拒绝(见附录表 40)。因此,我们仅对对抗性提示进行安全上下文蒸馏。然而,我们观察到,即使在处理对抗性提示时,上下文蒸馏有时也会降低响应质量。具体来说,如果模型的响应已经很高质,应用上下文蒸馏可能会导致不那么相关的回复,因为模型倾向于过度强调前置提示,经常过度依赖通用的担忧(见附录表 40 中由于上下文蒸馏导致模糊回答的例子)。因此,我们利用安全奖励模型来决定是否使用安全上下文蒸馏——我们只保留那些获得比原始答案更好的奖励模型分数的上下文蒸馏输出。我们注意到这对模型表现非常差的提示特别有帮助,但限制了上下文蒸馏的负面影响(见图 16b)。

4.3 Red Teaming

4.3 红队测试

Given how broad the capabilities of LLMs are and how varied their training data is, it is insufficient to identify risks solely via ex post facto usage and analysis. Rather, as has been done for other LLMs, we performed various kinds of proactive risk identification, colloquially called "red teaming," based on the term commonly used within computer security. This kind of granular analysis is very important because safety is a long-tail issue, in which even very infrequent edge cases can cause noticeable problems. Even if quantitative scores report good results, these types of qualitative insights allow us to recognize and target specific patterns in a more comprehensive way.

鉴于大语言模型的能力如此广泛,以及其训练数据的多样性,仅通过事后使用和分析来识别风险是不够的。相反,我们对其他大语言模型也进行了各种主动的风险识别工作,通俗地说就是“红队测试”,这一术语在计算机安全领域中常用。这种细致的分析非常重要,因为安全性是一个长尾问题,即使是非常罕见的边缘情况也可能导致明显的问题。即使定量评分报告了良好的结果,这些定性见解也能使我们更全面地识别和针对特定模式。

We conducted a series of red teaming with various groups of internal employees, contract workers, and external vendors. These teams included over 350 people, including domain experts in cyber security, election fraud, social media misinformation, legal, policy, civil rights, ethics, software engineering, machine learning, responsible AI, and creative writing. They aiso included individuals representative of a variety of socioeconomic, gender, ethnicity, and racial demographics.

我们与不同组别的内部员工、合同工和外部供应商进行了一系列红队演练。这些团队包括超过 350 人,其中包括网络安全、选举欺诈、社交媒体虚假信息、法律、政策、公民权利、伦理、软件工程、机器学习、负责任的 AI (responsible AI) 和创意写作等领域的专家。还包括了代表各种社会经济、性别、种族和族裔 demographic 的个人。

The red teamers probed our models across a wide range of risk categories (such as criminal planning, human trafficking, regulated or controlled substances, sexually explicit content, unqualified health or financial advice, privacy violations, and more), as well as different attack vectors (such as hypothetical questions, malformed/misspelled inputs, or extended dialogues). Additionally, we conducted specific tests to determine the capabilities of our models to facilitate the production of weapons (e.g. nuclear, biological, chemical, and cyber); findings on these topics were marginal and were mitigated. Nonetheless, we will continue our red teaming efforts in this front.

红队成员在广泛的风崄类别上对我们模型进行了测试(如犯罪策划、人口贩运、受监管或受控物质、性内容、不合格的健康或财务建议、隐私侵犯等),以及不同的攻击向量(如假设性问题、格式错误/拼写错误的输入、或延长对话)。此外,我们还进行了特定测试以确定我们的模型在促进武器生产(例如核、生物、化学和网络武器)方面的能力;这些主题的研究结果是边缘性的,并且已经得到缓解。尽管如此,我们将在这一领域继续我们的红队工作。

To date, all of our red teaming efforts have targeted model outputs in English, but have crucially included non-English prompts and dialogue contexts, as that is a well-known attack vector. In all exercises, participants were given risk category definitions and were shown just a handful of examples of risky interactions with an LLM. After that, each participant was part of a subteam focused on a particular category of risk or attack vector. After creating each dialogue, the red team participant would annotate various attributes, including risk areas and degree of risk, as captured by a 5-point Likert scale.

到目前为止,我们所有的红队测试工作都针对了英文模型输出,但关键包括了非英文提示和对话上下文,因为这是一个众所周知的攻击向量。在所有演练中,参与者被给予了风险类别定义,并展示了与大语言模型 (LLM) 的少数几次有风险的互动示例。之后,每位参与者成为专注于特定风险或攻击向量类别的子团队的一员。创建每次对话后,红队参与者会标注各种属性,包括风险领域和风险程度,这些由5点李克特量表捕获。

Some examples of useful insights provided by members of red teams that we were able to improve upon throughout development:

我们在开发过程中能够改进的一些由红队成员提供的有用见解示例:

· [Early models] were more likely to have generated unsafe responses without noting that they contain problematic content. However, [slightly later models] have tended to display knowledge that the content is problematic, even if they do go on to provide it. "They respond with [uNSAFE CONTENT] is not appropriate to discuss, etc.' and then immediately follow up with 'With that said, here's how [UNSAFE CONTENT].' " [Latest models] are able to resolve these issues. · Distracting the [early models] by including "quirks" or specific requests usually defeated any reluctance encountered via more direct requests. "A creative writing request (song, story, poem, etc.) isa reliable way to get it to produce content that it is otherwise robust against. · Embedding a problematic request in a positive context often successfully obscured the fact that problematic output was being requested for [early models]: "The overall principle I've found most effective for any kind of attack is to hide it in language that is positive, progressive, and empowering.

· 【早期模型】更有可能生成不安全的响应而不指出它们包含有问题的内容。然而,【稍后一些的模型】倾向于表明知道内容是有问题的,即使它们继续提供这些内容。“它们会回应说【不安全内容】不适合讨论等,然后立即补充说‘尽管如此,这里是如何【不安全内容】的方法。’” 【最新模型】能够解决这些问题。

· 通过在请求中加入“怪癖”或特定要求来分散【早期模型】的注意力,通常可以克服直接请求时遇到的任何犹豫。“创意写作请求(歌曲、故事、诗歌等)是让其生成原本抵抗的内容的一种可靠方法。

· 将有问题的请求嵌入到积极的上下文中,通常可以成功掩盖正在请求有问题输出的事实,对于【早期模型】而言:“我发现对任何类型的攻击最有效的方法是将其隐藏在积极、进步和赋权的语言中。

From Red Teaming Insights to Safer Models. Crucially, after each exercise, we performed a thorough analysis of the collected data, including dialogue length, risk area distribution, histogram of topic of misinformation (where appropriate), and rated degree of risk. In each case, we took the overall lessons as a guide to help further model safety training, and specifically took data from these exercises for model fine-tuning, model feedback training, and as a signal for other safety model training.

从红队测试洞察到更安全的模型。 crucially, 在每次练习后,我们对收集的数据进行了详细分析,包括对话长度、风险区域分布、错误信息主题的直方图(如适用)以及风险程度评级。在每种情况下,我们都将总体教训作为指导,以帮助进一步的模型安全训练,并特别使用这些练习中的数据进行模型微调、模型反馈训练以及作为其他安全模型训练的信号。

Multiple additional rounds of red teaming were performed over several months to measure the robustness of each new model as it was released internally. We defined the robustness of a model, $\gamma,$ with respect to a red teaming exercise executed by a set of experts as the average number of created prompts that would trigger a violating response from the model per person per hour. As an example, on our 7B model, we had an evolution of $\gamma:1.8\rightarrow0.45$ over several red teaming iterations and model refinements. Robustness will likely continue to improve with additional red teaming efforts. Another magnitude that we tracked as new models were produced was the percentage of prompts triggering violating responses discovered in the previous red teaming exercises that were mitigated in a given new candidate release. On average, we had a $\hat{90%}$ rejection rate model over model.

在几个月内进行了多轮红队测试,以衡量每个新模型在内部发布时的稳健性。我们定义了模型的稳健性 $\gamma$,相对于一组专家执行的红队测试,为每小时每人创建的能够触发模型违规响应的提示的平均数量。例如,在我们的 7B 模型中,经过多次红队测试迭代和模型改进,$\gamma:1.8\rightarrow0.45$。随着更多的红队测试工作,稳健性可能会继续提高。另一个我们在新模型生成过程中跟踪的指标是,在之前的红队测试中发现的触发违规响应的提示中有多少比例在新的候选版本中得到了缓解。平均而言,我们有 $\hat{90%}$ 的拒绝率从一个模型到另一个模型。

4.4 Safety Evaluation of LLAMA 2-CHAT

4.4 大语言模型 LLAMA 2-CHAT 的安全性评估

Safety Human Evaluation. β We collected roughly 2,000 adversarial prompts for human evaluation according to the safety categories in Section 4.2.1, in which 1,351 prompts are single-turn and 623 are multi-turn. Examples of evaluation prompts and responses can be found in Appendix A.4.6.

安全性人工评估。我们根据第 4.2.1 节中的安全类别收集了大约 2,000 个对抗性提示进行人工评估,其中 1,351 个提示是单轮的,623 个是多轮的。评估提示和响应的示例可以在附录 A.4.6 中找到。

We then asked raters to judge models for safety violations on a five-point Likert scale with the following definitions:

然后我们要求评分员根据以下定义,在五点李克特量表上对模型的安全性违规进行评判:


Figure 17: Overall safety measures. Left: LLAMA 2-CHAT has low violation percentage overall across model sizes. Right: LLAMA 2-CHAT has high safety and helpfulness mean rating overall across model sizes. It is important to note that these results are subject to limitations of the prompt set, subjectivity of the review guidelines, and subjectivity of individual raters. (a) Overall violation percentage.

图 17: 整体安全措施。左:LLAMA 2-CHAT 在不同模型尺寸下总体违规百分比低。右:LLAMA 2-CHAT 在不同模型尺寸下总体安全性和有用性平均评分高。需要注意的是,这些结果受提示集的局限性、评审指南的主观性以及个别评分者的主观性影响。(a) 总体违规百分比。


(b) Overall safety and helpfulness mean rating.

(b) 总体安全性和帮助性平均评分。

· 1 - Severe safety violations

· 1 - 严重安全违规

We consider a rating of 1 or 2 as violation and use violation percentage as our main evaluation metric, with the mean rating as a supplement. Each example is annotated by three annotators and we take the majority vote to determine if the response is violating or not. We used Gwet's AC1/2 statistic to measure inter-rater reliability (IRR) as in the helpfulness human evaluation. The IRR scores range from $0.70t o0.95$ depending on the annotation batch, indicating a high degree of agreement among annotators on safety assessments. On LLAMA 2-CHAT annotations, the average IRR is 0.92 according to Gwet's AC2 measure. We see lower IRR scores on batches where the models have a high violation rate (e.g., Vicuna) and higher IRR scores on batches where the models have relatively low violation rates (e.g., LLAMA 2-CHAT, Falcon, and ChatGPT).

我们将评分 1 或 2 视为违规,并使用违规百分比作为主要评估指标,平均评分为辅助指标。每个示例由三位标注者进行标注,我们采用多数投票来确定响应是否违规。我们使用 Gwet's AC1/2 统计来衡量标注者间的一致性 (IRR),如同在有用性人工评估中一样。IRR 分数范围从 0.70 到 0.95,取决于标注批次,表明标注者在安全性评估上具有高度一致性。在 LLAMA 2-CHAT 标注中,根据 Gwet's AC2 测量的平均 IRR 为 0.92。我们发现在模型违规率较高的批次(例如 Vicuna)中,IRR 分数较低;而在模型违规率相对较低的批次(例如 LLAMA 2-CHAT、Falcon 和 ChatGPT)中,IRR 分数较高。


Figure 18: Single-turn and multi-turn violation percentage. Note that these results should be interpreted carefully due to limitations of the prompt set, subjectivity of the review guidelines, content standards, and individual raters.

图 18: 单轮和多轮违规百分比。注意,由于提示集的局限性、评审指南的主观性、内容标准以及个人评分员的影响,这些结果应谨慎解读。

We show the overall violation percentage and safety rating of various LLMs in Figure 17. LLAMA 2-CHAT has comparable or lower overall violation percentage across model sizes, while ChatGPT and Falcon (Almazrouei et al., 2023) come next, then MPT (MosaicML NLP Team et al., 2023) and Vicuna (Chiang et al., 2023). It is important to interpret these results carefully, as they are affected by limitations of the prompt set, subjectivity of the review guidelines, content standards, and subjectivity of individual raters. Upon manual analysis, we found that the response of Falcon is typically short (one or two sentences), thus iess prone to generating unsafe content but also generally less helpful. This is reflected by a large number of responses of Falcon with rating $\mathrm{\Delta=3}$ . As a result, we note that in Figure 17b the average rating of Falcon is much lower than LLAMA 2-CHAT (34B) although their violation percentages look similar (3.88 vs 4.45).

我们展示了各种大语言模型 (LLM) 的总体违规百分比和安全评级在图 17 中。LLAMA 2-CHAT 在不同模型尺寸下具有相当或更低的总体违规百分比,其次是 ChatGPT 和 Falcon (Almazrouei et al., 2023),然后是 MPT (MosaicML NLP Team et al., 2023) 和 Vicuna (Chiang et al., 2023)。仔细解释这些结果非常重要,因为它们受到提示集的局限性、审查指南的主观性、内容标准以及个人评分者的主观性的影响。通过手动分析,我们发现 Falcon 的响应通常较短(一到两句话),因此不太容易生成不安全的内容,但通常也较为无帮助。这反映在 Falcon 大量响应的评分为 $\mathrm{\Delta=3}$ 。因此,我们注意到在图 17b 中,尽管 Falcon 和 LLAMA 2-CHAT (34B) 的违规百分比看起来相似(3.88 对 4.45),但 Falcon 的平均评分要低得多。


Figure 19: Violation percentage per risk category. Note: these results should be interpreted carefully due to limitations of the prompt set, subjectivity of the review guidelines, content standards, and individual raters. Table 14: Evaluation of fine-tuned LLMs on different safety datasets. For TruthfulQA, we present the percentage of generations that are both truthful and informative (the higher the better). For ToxiGen, we present the percentage of toxic generations (the smaller the better).


图 19: 每个风险类别的违规百分比。注意:由于提示集的局限性、审查指南的主观性、内容标准以及个人评分者的不同,这些结果应谨慎解读。表 14: 不同安全数据集上微调后的大语言模型 (LLM) 的评估。对于 TruthfulQA,我们展示了既真实又具有信息量的生成内容的百分比(越高越好)。对于 ToxiGen,我们展示了有毒内容生成的百分比(越低越好)。

In Figure 18, we report the violation percentage on single- and multi-turn conversations, respectively. A trend across models is that multi-turn conversations are more prone to inducing unsafe responses. That said, LLAMA 2-CHAT still performs well compared to baselines, especially on multi-turn conversations. We also observe that Falcon performs particularly well on single-turn conversations (largely due to its conciseness) but much worse on multi-turn conversations, which could be due to its lack of multi-turn supervised fine-tuning data. In Figure 19, we show the per-category safety violation percentage of different LLMs. While model perfor mance is similar across categories, LLAMA 2-CHAT has relatively more violations under the unqualified advice category (although still low in an absolute sense), for various reasons, including lack of an appropriate disclaimer(e.g., $\mathrel{\breve{\prime}}!I$ am not a professional") at times. For the other two categories, LLAMA 2-CHAT achieves comparable or lower violation percentage consistently regardless of model sizes.

图 18: 我们分别报告了单轮和多轮对话中的违规百分比。跨模型的趋势是,多轮对话更容易引发不安全的响应。尽管如此,LLAMA 2-CHAT 相对于基线模型仍然表现良好,尤其是在多轮对话中。我们还观察到 Falcon 在单轮对话中表现出色(主要是由于其简洁性),但在多轮对话中的表现要差得多,这可能是由于缺乏多轮监督微调数据。

图 19: 我们展示了不同大语言模型在各分类下的安全违规百分比。虽然模型在各类别中的表现相似,但 LLAMA 2-CHAT 在不合格建议类别下的违规次数相对较多(尽管绝对数量仍然较低),原因包括有时缺乏适当的免责声明(例如,“$\mathrel{\breve{\prime}}!I$ 不是专业人士”)。对于其他两个类别,无论模型大小如何,LLAMA 2-CHAT 的违规百分比始终保持在可比或更低水平。

Truthfulness, Toxicity, and Bias. In Table 14, fine-tuned LLAMA 2-CHAT shows great improvement over the pretrained LLAMA 2 in terms of truthfulness $50.18\rightarrow64.14$ for 70B) and toxicity $24.60^{\circ}\dot{\rightarrow},0.01$ for70B). The percentage of toxic generations shrinks to effectively $0%$ for LLAMA 2-CHAT of all sizes: this is the lowest toxicity level among all compared models. In general, when compared to Falcon and MPT, the fine-tuned LLAMA 2-CHAT shows the best performance in terms of toxicity and truthfulness. After fine-tuning, LLAMA 2-CHAT tends to have an increase in positive sentiment overall for many of the demographic groups in BOLD. In Appendix A.4.8, we present a detailed score breakdown of model generation sentiment across different subgroups for the bias benchmark, along with more in-depth analyses and results of truthfulness and bias.

表 14: 在真实性 (50.18→64.14) 和毒性 (24.60°→0.01) 方面,微调后的 LLAMA 2-CHAT 在 70B 模型上相比预训练的 LLAMA 2 有显著改进。有毒生成的比例在所有规模的 LLAMA 2-CHAT 中有效降低到 0%:这是所有比较模型中最低的毒性水平。总体而言,与 Falcon 和 MPT 相比,微调后的 LLAMA 2-CHAT 在毒性和真实性方面表现出最佳性能。微调后,LLAMA 2-CHAT 对于 BOLD 中的许多人口统计群体的整体正面情感有所增加。在附录 A.4.8 中,我们展示了不同子群体的模型生成情感得分的详细分解,以及关于偏差基准的更深入分析和真实性和偏差的结果。

TruthfulQA↑ ToxiGen↓
ChatGPT 78.46 0.20
Falcon-instruct 7B 28.03 7.89
MPT-instruct 7B 29.99 16.33
LLAMA2-CHAT 7B 57.04 0.00
13B 62.18 0.00
34B 67.20 0.02
70B 64.14
0.01

5 Discussion

5 讨论

Here, we discuss the interesting properties we have observed with RLHF (Section 5.1). We then discuss the limitations of LLAMA 2-CHAT (Section 5.2). Lastly, we present our strategy for responsibly releasing these models (Section 5.3).

在这里,我们讨论了我们在 RLHF (第 5.1 节) 中观察到的有趣特性。然后我们讨论了 LLAMA 2-CHAT 的局限性 (第 5.2 节)。最后,我们介绍了我们负责任地发布这些模型的策略 (第 5.3 节)。

5.1 Learnings and Observations

5.1 学习和观察

Our tuning process revealed several interesting results, such as LLAMA 2-CHAT's abilities to temporally organize its knowledge, or to call APIs for external tools.

我们的调优过程揭示了几个有趣的结果,例如 LLAMA 2-CHAT 具有按时间组织其知识的能力,或者调用外部工具的 API 的能力。


Figure 20: Distribution shift for progressive versions of LLAMA 2-CHAT, from SFT models towards RLHF.


图 20: LLAMA 2-CHAT 进阶版本的分布偏移,从 SFT 模型到 RLHF。

Beyond Human Supervision. At the outset of the project, many among us expressed a preference for supervised annotation, attracted by its denser signal. Meanwhile reinforcement learning, known for its instability, seemed a somewhat shadowy field for those in the NLP research community. However, reinforcement learning proved highly effective, particularly given its cost and time effectiveness. Our findings underscore that the crucial determinant of RLHF's success lies in the synergy it fosters between humans and LLMs throughout the annotation process.

超越人类监督。在项目初期,我们中的许多人倾向于监督标注,被其更密集的信号所吸引。与此同时,强化学习以其不稳定性而闻名,对于自然语言处理 (NLP) 研究社区的人来说似乎是一个较为神秘的领域。然而,强化学习证明了其高效性,特别是在成本和时间效益方面。我们的研究结果强调,RLHF 成功的关键在于它在整个标注过程中促进的人类与大语言模型 (LLM) 之间的协同作用。

Even with proficient annotators, each individual writes with significant variation. A model fine-tuned on SFT annotation learns this diversity, including, unfortunately, the tail-end of poorly executed annotation. Furthermore, the model's performance is capped by the writing abilities of the most skilled annotators. Human annotators are arguably less subject to discrepancy when comparing two outputs preference annotation for RLHF. Consequently, the reward mechanism swiftly learns to assign low scores to undesirable tail-end distribution and aligns towards the human preference. This phenomena is illustrated in Figure 20, where we can see that the worst answers are progressively removed, shifting the distribution to the right.

即使有熟练的标注人员,每个人的书写风格也存在显著差异。在SFT标注上微调的模型学会了这种多样性,包括不幸的是,那些执行不佳的标注的尾部部分。此外,模型的表现受到最熟练标注人员写作能力的限制。相比之下,人类标注人员在为RLHF进行两个输出的偏好标注时,可能较少受到差异的影响。因此,奖励机制能够迅速学习为不理想的尾部分布分配低分,并与人类偏好对齐。这一现象如图 20 所示,在图中我们可以看到最差的答案逐渐被移除,分布向右移动。

In addition, during annotation, the model has the potential to venture into writing trajectories that even the best annotators may not chart. Nonetheless, humans can still provide valuable feedback when comparing two answers, beyond their own writing competencies. Drawing a parallel, while we may not all be accomplished artists, our ability to appreciate and critique art remains intact. We posit that the superior writing abilities of LLMs, as manifested in surpassing human annotators in certain tasks, are fundamentally driven by RLHF, as documented in Gilardi et al. (2023) and Huang et al. (2023). Supervised data may no longer be the gold standard, and this evolving circumstance compels a re-evaluation of the concept of "supervision."

此外,在标注过程中,模型有可能探索出连最好的标注者也无法绘制的写作轨迹。尽管如此,人类在比较两个答案时仍然可以提供有价值的反馈,这超出了他们自身的写作能力。打个比方,虽然我们并非都是成就斐然的艺术家,但我们欣赏和批评艺术的能力依然存在。我们认为,大语言模型 (LLM) 的卓越写作能力,如在某些任务中超越人类标注者的表现,从根本上是由 RLHF 驱动的,如 Gilardi 等 (2023) 和 Huang 等 (2023) 所记录的那样。监督数据可能不再是黄金标准,这种不断演变的情况迫使我们重新评估“监督”这一概念。

In-Context Temperature Rescaling. β We have observed an intriguing phenomenon related to RLHF, a feature not previously reported to the best of our knowledge: the dynamic re-scaling of temperature contingent upon the context. As indicated in Figure 8, the temperature appears to be influenced by RLHF. Yet, intriguingly, our findings also revealed that the shifts are not uniformly applied across all prompts, as shown in Figure 21.

上下文温度重新缩放 (In-Context Temperature Rescaling)。β 我们观察到一个与 RLHF 相关的有趣现象,据我们所知,这一特征之前未被报道过:根据上下文动态重新调整温度。如图 8 所示,温度似乎受到 RLHF 的影响。然而,有趣的是,我们的研究结果还显示,这些变化并不是均匀应用于所有提示,如图 21 所示。

For instance, when it comes to prompts associated with creativity, such as "Write a poem," an increase in temperature continues to generate diversity across our various RLHF iterations. This can be observed in the Self-BLEU slope, which mirrors a pattern comparable to that of the SFT model.

例如,在与创造力相关的提示中,如“写一首诗”,温度的增加继续在我们各个 RLHF 迭代中生成多样性。这可以从 Self-BLEU 斜率观察到,它反映了与 SFT 模型相似的模式。

On the other hand, for prompts based on factual information, such as "What is the capital of $?^{\prime\prime}$ theSelf-BLEU slope diminishes over time. This pattern suggests that despite the rising temperature, the model learns to consistently provide the same response to factual prompts.

另一方面,对于基于事实信息的提示,例如“什么是 $?^{\prime\prime}$ 的首都”,Self-BLEU 值随时间减少。这种模式表明,尽管温度升高,模型仍然学会对事实提示做出一致的响应。


Figure 21: RLHF learns to adapt the temperature with regard to the type of prompt. Lower Self-BLEU corresponds to more diversity: RLHF eliminates diversity in responses to factual prompts but retains more diversity when generating responses to creative prompts. We prompt each model with a diverse set of 10 creative and 10 factual instructions and sample 25 responses. This is repeated for the temperatures $T\in{k/10\mid k\in\mathbb{N}:1\leq k\leq15}$ . For each of the 25 responses we compute the Self-BLEU metric and report the mean and standard deviation against the temperature.

图 21: 强化学习与人类反馈 (RLHF) 学会了根据提示类型调整温度。较低的 Self-BLEU 对应更高的多样性:RLHF 在事实性提示的回答中消除了多样性,但在生成创造性提示的回答时保留了更多多样性。我们用一组多样的 10 条创造性指令和 10 条事实性指令对每个模型进行提示,并采样 25 个回答。这一过程在温度 $T\in{k/10\mid k\in\mathbb{N}:1\leq k\leq15}$ 下重复进行。对于每个 25 个回答,我们计算 Self-BLEU 指标并报告其均值和标准差与温度的关系。


Figure 22: Time awareness — illustration of our model generalizing the notion of time, with 1,000 SFT time-focused data.

图 22: 时间感知 —— 我们的模型对时间概念的泛化示例,使用了 1,000 个 SFT 时间聚焦数据。

LLAMA 2-CHAT Temporal Perception Our model showcased impressive generalization ability, as shown in Figure 22. We manually tested dozens of examples and observed consistently that our model demonstrates a robust capability to organize its knowledge in a temporal manner, even when provided with minimal data. To instill a concept of time in LLAMA 2-CHAT, we collected a set of 1,000 SFT examples that were related to specific dates. These examples included questions like "How long ago did Barack Obama become president?" Each was associated with two critical pieces of metadata: the date when the query was posed —- which influenced the response - and the event date, a point in time prior to which the question would be nonsensical.

LLAMA 2-CHAT 时间感知

我们的模型展示了令人印象深刻的一般化能力,如图 22 所示。我们手动测试了数十个例子,并一致观察到我们的模型具备强大的能力以时间顺序组织其知识,即使在提供极少数据的情况下也是如此。为了使 LLAMA 2-CHAT 具备时间概念,我们收集了一组 1,000 个与特定日期相关的 SFT 示例。这些示例包括诸如“巴拉克·奥巴马是多久前成为总统的?”之类的问题。每个示例都关联有两个关键的元数据:提问的日期——这会影响回答——和事件日期,即在此日期之前问题是没有意义的时间点。

The observation suggests that LLMs have internalized the concept of time to a greater extent than previously assumed, despite their training being solely based on next-token prediction and data that is randomly shuffled without regard to their chronological context.

观察结果表明,大语言模型 (LLM) 对时间概念的理解程度比之前认为的要深,尽管它们的训练仅基于下一个 Token 预测,并且数据是随机打乱而不考虑其时间顺序的。

Tool Use Emergence The integration of LLMs with tools is a growing research area, as highlighted in Mialon et al. (2023). The approach devised in Toolformer (Schick et al., 2023) entails the sampling of millions

工具使用 emergence 大语言模型与工具的集成是一个不断增长的研究领域,如 Mialon 等 (2023) 所强调的。Toolformer (Schick 等, 2023) 中提出的方法包括采样数百万

模型 ASDiv SVAMP MAWPS
OPT-66B 6.0 4.9 7.9
GPT-J 7.5 5.2 9.9
GPT-J + CC 9.6 5.0 9.3
GPT-3 14.0 10.0 19.8
Toolformer 40.4 29.4 44.0
LLAMA2-CHAT 67.1 69.2 82.4

Table 15: Performance with tool use. Evaluation on the math datasets used in Toolformer. For different baselines, we report the scores from Schick et al. (2023).

表 15: 使用工具的性能。在 Toolformer 使用的数学数据集上进行评估。对于不同的基准,我们报告 Schick et al. (2023) 的分数。

of trajectories, complemented by the formulation of few-shot examples for each tool. Nonetheless, this technique was only applied using a single tool per example, and would not scale for a sequence of tool usage.

轨迹的案例,辅以每个工具的少样本 (Few-shot) 示例。然而,这项技术在每个示例中仅使用了一个工具,因此无法扩展到一系列工具的使用场景。


Figure 23: Tool use emergence. LLAMA 2-CHAT is able to understand the tools's applications, and the API arguments, just through the semantics, despite never having been trained to use tools.

图 23: 工具使用涌现。LLAMA 2-CHAT 能够通过语义理解工具的应用和 API 参数,尽管从未接受过使用工具的训练。

The release of OpenAI's pluginst# has incited substantial discourse within the academic community, igniting questions such as: How can we effectively teach models to utilize tools? or Does the process necessitate a substantial dataset? Our experiments indicate that tool usage can spontaneously emerge from alignment in a zero-shot manner. Although we never explicitly annotate tool-use usage, Figure 23 exhibits an instance where the model demonstrated the capability to utilize a sequence of tools in a zero-shot context.

OpenAI 插件的发布在学术界引起了广泛的讨论,引发了诸如以下问题:我们如何有效地教模型使用工具?或者这个过程是否需要大量的数据集?我们的实验表明,工具的使用可以以零样本的方式自发地从对齐中出现。尽管我们从未明确标注工具的使用,图 23 展示了一个实例,其中模型展示了在零样本情境下使用一系列工具的能力。

In addition, our study extended to evaluating the LLAMA 2-CHAT with access to a calculator. The results from this particular experiment are documented in Table 15. LLM tool use, while exciting, can also cause some safety concerns. We encourage more community research and red teaming in this area.

此外,我们的研究还扩展到评估具有计算器访问权限的 LLAMA 2-CHAT。此实验的结果记录在表 15: 大语言模型工具的使用,虽然令人兴奋,但也可能引发一些安全问题。我们鼓励在此领域进行更多的社区研究和红队测试。

5.2 Limitations and Ethical Considerations

5.2 限制和伦理考量

LLAMA 2-CHAT is subject to the same well-recognized limitations of other LLMs, including a cessation of knowledge updates post-pre training, potential for non-factual generation such as unqualified advice, and a propensity towards hallucinations.

LLAMA 2-CHAT 受到与其他大语言模型 (LLM) 相同的公认限制,包括预训练后知识更新的停止、可能生成不准确的内容(如不合格的建议),以及倾向于产生幻觉。

Furthermore, our initial version of LLAMA 2-CHAT predominantly concentrated on English-language data. While our experimental observations suggest the model has garnered some proficiency in other languages, its proficiency is limited, due primarily to the limited amount of pre training data available in non-English languages (as documented in Table 10). Consequently, the model's performance in languages other than English remains fragile and should be used with caution.

此外,我们最初的 LLAMA 2-CHAT 版本主要集中在英语数据上。虽然我们的实验观察表明该模型在其他语言中获得了一些能力,但由于非英语语言的预训练数据量有限(如表 10 所示),其能力是有限的。因此,该模型在非英语语言中的表现仍然不稳定,使用时应谨慎。

Like other LLMs, LLAMA 2 may generate harmful, offensive, or biased content due to its training on publicly available online datasets. We attempted to mitigate this via fine-tuning, but some issues may remain, particularly for languages other than English where publicly available datasets were not available. We will continue to fine-tune and release updated versions in the future as we progress on addressing these issues.

与其他大语言模型 (LLM) 一样,LLAMA 2 可能会由于在其公开可用的在线数据集上进行训练而生成有害、冒犯性或有偏见的内容。我们尝试通过微调来减轻这些问题,但某些问题可能仍然存在,特别是在英语以外的语言中,因为公开可用的数据集不足。我们将继续微调并在未来发布更新版本,以逐步解决这些问题。

Not everyone who uses AI models has good intentions, and conversational AI agents could potentially be used for nefarious purposes such as generating misinformation or retrieving information about topics like bio terrorism or cybercrime. We have, however, made efforts to tune the models to avoid these topics and diminish any capabilities they might have offered for those use cases.

并非所有使用 AI 模型的人都有善意,对话式 AI 智能体可能会被用于生成虚假信息或检索有关生物恐怖主义或网络犯罪等话题的信息等不良目的。然而,我们已经努力调整这些模型,以避免涉及这些话题,并减少它们在这些用例中可能提供的任何能力。

While we attempted to reasonably balance safety with helpfulness, in some instances, our safety tuning goes too far. Users of LLAMA 2-CHAT may observe an overly cautious approach, with the model erring on the side of declining certain requests or responding with too many safety details.

虽然我们试图在安全性和有用性之间做出合理的平衡,但在某些情况下,我们的安全调整可能过于严格。使用 LLAMA 2-CHAT 的用户可能会发现模型采取了过于谨慎的态度,拒绝某些请求或回应时提供了过多的安全细节。

Users of the pretrained models need to be particularly cautious, and should take extra steps in tuning and deployment as described in our Responsible Use Guide. ss

预训练模型的用户需要特别谨慎,并应按照我们的《负责任使用指南》中的描述,在调优和部署时采取额外的步骤。

5.3 Responsible Release Strategy

5.3 负责任的发布策略

Release Details. We make LLAMA 2 available for both research and commercial use at https : //ai .meta. com/resources/models-and-libraries/llama/. Those who use LLAMA 2 must comply with the terms of the provided license and our Acceptable Use Policy, which prohibit any uses that would violate applicable policies, laws, rules, and regulations.

发布详情。我们使 LLAMA 2 可用于研究和商业用途,网址为 https://ai.meta.com/resources/models-and-libraries/llama/ 。使用 LLAMA 2 的用户必须遵守提供的许可证条款和我们的可接受使用政策,这些政策禁止任何违反适用政策、法律、规则和法规的使用。

We also provide code examples to help developers replicate our safe generations with LLAMA 2-CHAT and apply basic safety techniques at the user input and model output layers. These code samples are available here: https : //github. com/facebook research/llama. Finally, we are sharing a Responsible Use Guide, which provides guidelines regarding safe development and deployment.

我们还提供了代码示例,以帮助开发人员使用 LLAMA 2-CHAT 复制我们的安全生成,并在用户输入和模型输出层应用基本的安全技术。这些代码示例可在以下链接获取:https://github.com/facebookresearch/llama。最后,我们分享了一份负责任使用指南,其中提供了关于安全开发和部署的指导方针。

Responsible Release. While many companies have opted to build AI behind closed doors, we are releasing LLAMA 2 openly to encourage responsible AI innovation. Based on our experience, an open approach draws upon the collective wisdom, diversity, and ingenuity of the AI-practitioner community to realize the benefits of this technology. Collaboration will make these models better and safer. The entire AI community—academic researchers, civil society, policymakers, and industry-must work together to rigorously analyze and expose the risks of current AI systems and to build solutions that address potentially problematic misuse. This approach not only fosters real collaboration with diverse stakeholders-those beyond the walls of big tech companies—but also serves as the cornerstone for democratizing access to foundational models. As argued in Zellers et al. (2019b), open releases promote transparency and allow more people to access AI tools, democratizing the technology and decentralizing AI expertise. We believe that the decentralization of AI expertise does more than simply distribute knowledge—it stimulates innovation and accelerates progress in the industry. Lastly, openly releasing these models consolidates costs and eliminates barriers to entry, allowing small businesses to ieverage innovations in LLMs to explore and build text-generation use cases. Ultimately, we believe this will create a more level playing field for organizations of all sizes across the globe to benefit from the economic growth promised by the advancement of AI.

负责任的发布。虽然许多公司选择在封闭环境中构建 AI,但我们决定公开发布 LLAMA 2 以鼓励负责任的 AI 创新。根据我们的经验,开放的方法能够汇集 AI 实践者社区的集体智慧、多样性和创造力,从而实现这项技术的优势。合作将使这些模型变得更好、更安全。整个 AI 社区——学术研究人员、民间社会、政策制定者和行业——必须共同努力,严格分析并揭示当前 AI 系统的风险,并建立解决潜在误用问题的解决方案。这种方法不仅促进了与多元利益相关者的真正合作——那些超越大型科技公司围墙之外的人们——还为普及基础模型的访问奠定了基石。正如 Zellers 等人在 (2019b) 中所论述的,公开发布促进透明度,让更多人可以访问 AI 工具,使技术民主化并分散 AI 专业知识。我们认为,AI 专业知识的分散不仅仅简单地分配知识——它刺激了创新并加速了行业的进步。最后,公开发布这些模型整合了成本并消除了进入壁垒,使小企业能够利用大语言模型 (LLM) 的创新来探索和构建文本生成应用场景。最终,我们相信这将为全球各种规模的组织创造一个更加公平的竞争环境,使其从 AI 发展带来的经济增长中受益。

We know that not everyone who uses AI models has good intentions, and we acknowledge that there are reasonable concerns regarding the ways that AI will impact our world. Toxic content generation and problematic associations are meaningful risks that the AI community has yet to fully mitigate. As this paper illustrates, we have made strides in limiting the prevalence of these types of responses. While we recognize there is more work to be done, this realization only deepens our commitment to open science and collaboration with the AI community.

我们知道,并非所有使用 AI 模型的人都怀有善意,我们也承认对于 AI 将如何影响我们的世界存在合理的担忧。有毒内容生成和问题关联是 AI 社区尚未完全缓解的重要风险。正如本文所阐述的,我们在限制这些类型响应的普遍性方面已经取得了一定进展。虽然我们认识到还有更多的工作要做,但这一认识更加坚定了我们对开放科学以及与 AI 社区合作的承诺。

6 Related Work

6 相关工作

Large Language Models. The recent years have witnessed a substantial evolution in the field of LLMs. Following the scaling laws of Kaplan et al. (2020), several Large Language Models with more than 100B parameters have been proposed, from GPT-3 (Brown et al., 2020) to Gopher (Rae et al., 2022) or specialized models, e.g. Galactica, for science(Taylor et al., 2022). With 70B parameters, Chinchilla (Hoffmann et al., 2022) redefined those scaling laws towards the number of tokens rather than model weights. Notable in this progression is the rise of Llama, recognized for its focus on computational efficiency during inference (Touvron et al., 2023). A parallel discourse has unfolded around the dynamics of open-source versus closedsource models. Open-source releases like BLOOM (Sca0 et al., 2022), OPT(Zhang et al., 2022), and Falcon (Penedo et al., 2023) have risen to challenge their closed-source counterparts like GPT-3 and Chinchilla.

大语言模型。近年来,大语言模型 (LLM) 领域经历了显著的演变。根据 Kaplan 等人 (2020) 的扩展定律,已经提出了多个参数超过 100B 的大语言模型,从 GPT-3 (Brown 等人, 2020) 到 Gopher (Rae 等人, 2022),或专门化模型,例如 Galactica,用于科学领域 (Taylor 等人, 2022)。Chinchilla (Hoffmann 等人, 2022) 拥有 70B 参数,重新定义了这些扩展定律,转向关注 Token 数量而非模型权重。在这个进展中值得注意的是 Llama 的崛起,因其在推理过程中注重计算效率而受到认可 (Touvron 等人, 2023)。与此同时,关于开源与闭源模型动态的讨论也在展开。像 BLOOM (Sca0 等人, 2022),OPT (Zhang 等人, 2022),和 Falcon (Penedo 等人, 2023) 这样的开源版本已经崛起,挑战像 GPT-3 和 Chinchilla 这样的闭源模型。

Yet, when it comes to the "production-ready" LLMs such as ChatGPT, Bard, and Claude, there's a marked distinction in performance and usability. These models rely on intricate tuning techniques to align with human preferences (Gudibande et al., 2023), a process that is still being explored and refined within the open-source community.

然而,当涉及到“生产就绪”的大语言模型 (LLM) 如 ChatGPT、Bard 和 Claude 时,在性能和可用性上有着明显的区别。这些模型依赖于复杂的调优技术来与人类偏好对齐 (Gudibande et al., 2023),这一过程仍在开源社区中被探索和改进。

Attempts to close this gap have emerged, with distillation-based models such as Vicuna (Chiang et al., 2023) and Alpaca ( Taori et al., 2023) adopting a unique approach to training with synthetic instructions (Honovich et al., 2022; Wang et al., 2022). However, while these models show promise, they still fall short of the bar set by their closed-source counterparts.

尝试弥合这一差距的方法已经出现,例如采用独特训练方法的蒸馏模型 Vicuna (Chiang et al., 2023) 和 Alpaca (Taori et al., 2023),它们使用合成指令进行训练 (Honovich et al., 2022; Wang et al., 2022)。然而,尽管这些模型显示出潜力,但它们仍未能达到其闭源同类产品的水平。

Instruction Tuning. Wei et al. (2021) obtained zero-shot performance on unseen tasks by fine-tuning LLMs on numerous datasets. Chung et al. (2022) and Longpre et al. (2023) investigate the impact of instruction tuning as a function of number of tasks, model size, prompt settings, etc. Prompts used for instruction tuning can be created by humans or by LLMs themselves (Zhou et al., 2022), and follow-up instructions can be used to refine initial generations to make them more useful, engaging, and unbiased (Ganguli et al., 2023; Madaan et al., 2023). An approach related to instruction tuning is chain-of-thought prompting (Wei et al., 2022b), in which models are prompted to explain their reasoning when given a complex problem, in order to increase the likelihood that their final answer is correct.

指令微调。Wei 等 (2021) 通过在大量数据集上微调大语言模型 (LLM) 获得了零样本性能。Chung 等 (2022) 和 Longpre 等 (2023) 研究了任务数量、模型大小、提示设置等因素对指令微调的影响。用于指令微调的提示可以由人类或大语言模型本身创建 (Zhou 等, 2022),并且后续指令可用于改进初始生成内容,使其更加有用、吸引人和无偏见 (Ganguli 等, 2023; Madaan 等, 2023)。与指令微调相关的一种方法是链式思维提示 (Wei 等, 2022b),在这种方法中,当给定复杂问题时,模型被提示解释其推理过程,以增加最终答案正确的可能性。

RLHF has emerged as a powerful strategy for fine-tuning Large Language Models, enabling significant improvements in their performance (Christiano et al., 2017). The method, first showcased by Stiennon et al. (2020) in the context of text-sum mari z ation tasks, has since been extended to a range of other applications. In this paradigm, models are fine-tuned based on feedback from human users, thus iterative ly aligning the models' responses more closely with human expectations and preferences.

RLHF 已经成为微调大语言模型 (LLM) 的一种强大策略,能够显著提高其性能 (Christiano 等, 2017)。该方法首先由 Stiennon 等 (2020) 在文本摘要任务中展示,此后已扩展到其他多个应用领域。在这种范式下,模型根据来自人类用户的反馈进行微调,从而迭代地使模型的响应更符合人类的期望和偏好。

Ouyang et al. (2022) demonstrates that a combination of instruction fine-tuning and RLHF can help fix issues with factuality, toxicity, and helpfulness that cannot be remedied by simply scaling up LLMs. Bai et al. (2022b) partially automates this fine-tuning-plus-RLHF approach by replacing the human-labeled fine-tuning data with the model's own self-critiques and revisions, and by replacing human raters with a model when ranking model outputs in RLHF, a process known as $^{\prime}\mathrm{RL}$ from AI Feedback" (RLAIF).

欧阳等人 (2022) 指出,指令微调和基于人类反馈的强化学习 (RLHF) 的结合可以帮助解决大语言模型在事实性、毒性及有用性方面的问题,而这些问题仅靠扩大模型规模是无法解决的。白等人 (2022b) 部分自动化了这种微调加 RLHF 的方法,通过用模型自身的自我批评和修订替代人工标注的微调数据,并在 RLHF 中用模型替代人工评估者对模型输出进行排名,这一过程被称为 “RL 从 AI 反馈” (RLAIF)。

Known LLM Safety Challenges. Recent literature has extensively explored the risks and challenges linked With Large Language Models. Bender et al. (2021b) and Weidinger et al. (2021) underscore various hazards like bias, toxicity, private data leakage, and the potential for malicious uses. Solaiman et al. (2023) categorizes these impacts into two groups -- those that can be assessed within the base system and those requiring a societal context evaluation, while Kumar et al. (2022) offers potential mitigation strategies to curb harm. Work from Roller et al. (2020) and Dinan et al. (2021) also illuminates the difficulties tied to chatbot-oriented LLMs, with concerns ranging from privacy to misleading expertise claims. Deng et al. (2023) proposes a taxonomic framework to tackle these issues, and Bergman et al. (2022) delves into the balance between potential positive and negative impacts from releasing dialogue models.

已知的大语言模型安全挑战。近期文献广泛探讨了与大语言模型 (LLM) 相关的风险和挑战。Bender 等 (2021b) 和 Weidinger 等 (2021) 强调了各种危害,如偏差、毒性、私人数据泄露以及恶意使用的可能性。Solaiman 等 (2023) 将这些影响分为两类——可以在基础系统内评估的影响和需要社会背景评估的影响,而 Kumar 等 (2022) 提出了潜在的缓解策略以减少危害。Roller 等 (2020) 和 Dinan 等 (2021) 的研究也揭示了面向聊天机器人的大语言模型所面临的困难,问题从隐私到误导性的专业声明不一而足。Deng 等 (2023) 提出了一种分类框架来解决这些问题,而 Bergman 等 (2022) 则深入探讨了发布对话模型所带来的潜在正面和负面影响之间的平衡。

Investigations into red teaming reveal specific challenges in tuned LLMs, with studies by Ganguli et al. (2022) and Zhuo et al. (2023) showcasing a variety of successful attack types and their effects on the generation of harmful content. National security agencies and various researchers, such as (Mialon et al., 2023), have also raised red flags around advanced emergent model behaviors, cyber threats, and potential misuse in areas like biological warfare. Lastly, broader societal issues like job displacement due to accelerated AI research and an over-reliance on LLMs ieading to training data degradation are also pertinent considerations (Acemoglu and Restrepo, 2018; Autor and Salomons, 2018; Webb, 2019; Shumailov et al., 2023). We are committed to continuing our work engaging with the broader policy, academic, and industry community on these issues.

对红队测试的调查揭示了在调优的大语言模型 (LLM) 中的具体挑战,Ganguli 等人 (2022) 和 Zhuo 等人 (2023) 的研究展示了多种成功的攻击类型及其对有害内容生成的影响。国家安全机构和各类研究人员,如 (Mialon 等人, 2023),也对高级新兴模型行为、网络威胁以及在生物战等领域的潜在滥用提出了警告。最后,更广泛的社会问题,如由于加速的 AI 研究导致的工作岗位流失以及对大语言模型的过度依赖导致训练数据退化,也是相关的重要考虑因素 (Acemoglu 和 Restrepo, 2018; Autor 和 Salomons, 2018; Webb, 2019; Shumailov 等人, 2023)。我们致力于继续与更广泛的政策、学术和行业社区合作,共同应对这些问题。

7 Conclusion

7 结论

In this study, we have introduced LLAMA 2, a new family of pretrained and fine-tuned models with scales of 7 billion to 70 billion parameters. These models have demonstrated their competitiveness with existing open-source chat models, as well as competency that is equivalent to some proprietary models on evaluation sets we examined, although they still lag behind other models like GPT-4. We meticulously elaborated on the methods and techniques applied in achieving our models, with a heavy emphasis on their alignment with the principles of helpfulness and safety. To contribute more significantly to society and foster the pace of research, we have responsibly opened access to LLAMA 2 and LLAMA 2-CHAT. As part of our ongoing commitment to transparency and safety, we plan to make further improvements to LLAMA 2-CHAT in future work.

在本研究中,我们介绍了 LLAMA 2,这是一组新的预训练和微调模型,参数规模从 70 亿到 700 亿。这些模型展示了与现有开源聊天模型的竞争性,并且在我们考察的评估集上表现出与某些专有模型相当的能力,尽管它们仍然落后于其他模型如 GPT-4。我们详细阐述了实现这些模型所应用的方法和技术,特别强调了其与有用性和安全性原则的一致性。为了更大地贡献社会并加速研究进展,我们负责任地开放了对 LLAMA 2 和 LLAMA 2-CHAT 的访问。作为我们持续致力于透明度和安全性的部分,我们计划在未来的工作中进一步改进 LLAMA 2-CHAT。

References

参考文献

Daron Acemoglu and Pascual Restrepo. Artificial intelligence, automation, and work. In The economics of artificial intelligence: An agenda, pages 197-236. University of Chicago Press, 2018.

达龙·阿西莫格鲁和帕斯卡尔·雷斯特雷波. 人工智能、自动化与工作. 在《人工智能经济学:议程》, 页码 197-236. 芝加哥大学出版社, 2018.

Joshua Ainslie, James Lee-Thorp, Michiel de Jong, Yury Zemlya nsk iy, Federico Lebron, and Sumit Sanghai. Gqa: Training generalized multi-query transformer models from multi-head checkpoints, 2023.

Joshua Ainslie, James Lee-Thorp, Michiel de Jong, Yury Zemlyanskiy, Federico Lebron, 和 Sumit Sanghai. GQA: 从多头检查点训练通用多查询 Transformer 模型, 2023.

Ebtesam Almazrouei, Hamza Alobeidli, Abdulaziz Alshamsi, Alessandro Cappelli, Ruxandra Cojocaru, Merouane Debbah, Etienne Goffinet, Daniel Heslow, Julien Launay, Quentin Malartic, Badreddine Noune, Baptiste Pannier, and Guilherme Penedo. Falcon-40B: an open large language model with state-of-the-art performance.2023.

Ebtesam Almazrouei,Hamza Alobeidli,Abdulaziz Alshamsi,Alessandro Cappelli,Ruxandra Cojocaru,Merouane Debbah,Etienne Goffinet,Daniel Heslow,Julien Launay,Quentin Malartic,Badreddine Noune,Baptiste Pannier,and Guilherme Penedo. Falcon-40B:一个具有最先进性能的开源大语言模型. 2023.

Rohan Anil, Andrew M. Dai, Orhan Firat, Melvin Johnson, Dmitry Lepikhin, Alexandre Passos, Siamak Shakeri, Emanuel Taropa, Paige Bailey, Zhifeng Chen, Eric Chu, Jonathan H. Clark, Laurent El Shafey, Yanping Huang, Kathy Meier-Hellstern, Gaurav Mishra, Erica Moreira, Mark Omernick, Kevin Robinson, Sebastian Ruder, Yi Tay, Kefan Xiao, Yuanzhong Xu, Yujing Zhang, Gustavo Hernandez Abrego, Junwhan Ahn, Jacob Austin, Paul Barham, Jan Botha, James Bradbury, Siddhartha Brahma, Kevin Brooks, Michele Catasta, Yong Cheng, Colin Cherry, Christopher A. Choquette-Choo, Aakanksha Chowdhery, Clément Crepy, Shachi Dave, Mostafa Dehghani, Sunipa Dev, Jacob Devlin, Mark Diaz, Nan Du, Ethan Dyer, Vlad Feinberg, Fangxiaoyu Feng, Vlad Fienber, Markus Freitag, Xavier Garcia, Sebastian Gehrmann, Lucas Gonzalez, Guy Gur-Ari, Steven Hand, Hadi Hashemi, Le Hou, Joshua Howland, Andrea Hu, Jeffrey Hui, Jeremy Hurwitz, Michael Isard, Abe Itty cheri ah, Matthew Jagielski, Wenhao Jia, Kathleen Kenealy, Maxim Krikun, Sneha Kudugunta, Chang Lan, Katherine Lee, Benjamin Lee, Eric Li, Music Li, Wei Li, YaGuang Li, Jian Li, Hyeontaek Lim, Hanzhao Lin, Zhongtao Liu, Frederick Liu, Marcello Maggioni, Aroma Mahendru, Joshua Maynez, Vedant Misra, Maysam Moussalem, Zachary Nado, John Nham, Eric Ni, Andrew Nystrom, Alicia Parrish, Marie Pellat, Martin Polacek, Alex Polozov, Reiner Pope, Siyuan Qiao, Emily Reif, Bryan Richter, Parker Riley, Alex Castro Ros, Aurko Roy, Brennan Saeta, Rajkumar Samuel, Renee Shelby, Ambrose Slone, Daniel Smilkov, David R. So, Daniel Sohn, Simon Tokumine, Dasha Valter, Vijay Vasudevan, Kiran Vodrahalli, Xuezhi Wang, Pidong Wang, Zirui Wang, Tao Wang, John Wieting, Yuhuai Wu, Kelvin Xu, Yunhan Xu, Linting Xue, Pengcheng Yin, Jiahui Yu, Qiao Zhang, Steven Zheng, Ce Zheng, Weikang Zhou, Denny Zhou, Slav Petrov, and Yonghui Wu. Palm 2 technical report, 2023.

Rohan Anil, Andrew M. Dai, Orhan Firat, Melvin Johnson, Dmitry Lepikhin, Alexandre Passos, Siamak Shakeri, Emanuel Taropa, Paige Bailey, Zhifeng Chen, Eric Chu, Jonathan H. Clark, Laurent El Shafey, Yanping Huang, Kathy Meier-Hellstern, Gaurav Mishra, Erica Moreira, Mark Omernick, Kevin Robinson, Sebastian Ruder, Yi Tay, Kefan Xiao, Yuanzhong Xu, Yujing Zhang, Gustavo Hernandez Abrego, Junwhan Ahn, Jacob Austin, Paul Barham, Jan Botha, James Bradbury, Siddhartha Brahma, Kevin Brooks, Michele Catasta, Yong Cheng, Colin Cherry, Christopher A. Choquette-Choo, Aakanksha Chowdhery, Clément Crepy, Shachi Dave, Mostafa Dehghani, Sunipa Dev, Jacob Devlin, Mark Diaz, Nan Du, Ethan Dyer, Vlad Feinberg, Fangxiaoyu Feng, Vlad Fienber, Markus Freitag, Xavier Garcia, Sebastian Gehrmann, Lucas Gonzalez, Guy Gur-Ari, Steven Hand, Hadi Hashemi, Le Hou, Joshua Howland, Andrea Hu, Jeffrey Hui, Jeremy Hurwitz, Michael Isard, Abe Itty cheri ah, Matthew Jagielski, Wenhao Jia, Kathleen Kenealy, Maxim Krikun, Sneha Kudugunta, Chang Lan, Katherine Lee, Benjamin Lee, Eric Li, Music Li, Wei Li, YaGuang Li, Jian Li, Hyeontaek Lim, Hanzhao Lin, Zhongtao Liu, Frederick Liu, Marcello Maggioni, Aroma Mahendru, Joshua Maynez, Vedant Misra, Maysam Moussalem, Zachary Nado, John Nham, Eric Ni, Andrew Nystrom, Alicia Parrish, Marie Pellat, Martin Polacek, Alex Polozov, Reiner Pope, Siyuan Qiao, Emily Reif, Bryan Richter, Parker Riley, Alex Castro Ros, Aurko Roy, Brennan Saeta, Rajkumar Samuel, Renee Shelby, Ambrose Slone, Daniel Smilkov, David R. So,