[论文翻译]红队测试语言模型以减少危害:方法、扩展行为与经验教训


原文地址:https://github.com/dalinvip/Awesome-ChatGPT/blob/main/PDF/RLHF%E8%AE%BA%E6%96%87%E9%9B%86/Red%20Teaming%20Language%20Models%20to%20Reduce%20Harms-%20Methods%2C%20Scaling%20Behaviors%2C%20and%20Lessons%20Learned.pdf


Red Teaming Language Models to Reduce Harms: Methods, Scaling Behaviors, and Lessons Learned

红队测试语言模型以减少危害:方法、扩展行为与经验教训

Anthropic

Anthropic

Abstract

摘要

We describe our early efforts to red team language models in order to simultaneously discover, measure, and attempt to reduce their potentially harmful outputs. We make three main contributions. First, we investigate scaling behaviors for red teaming across 3 model sizes (2.7B, 13B, and 52B parameters) and 4 model types: a plain language model (LM); an LM prompted to be helpful, honest, and harmless; an LM with rejection sampling; and a model trained to be helpful and harmless using reinforcement learning from human feedback (RLHF). We find that the RLHF models are increasingly difficult to red team as they scale, and we find a flat trend with scale for the other model types. Second, we release our dataset of 38,961 red team attacks for others to analyze and learn from. We provide our own analysis of the data and find a variety of harmful outputs, which range from offensive language to more subtly harmful non-violent unethical outputs. Third, we exhaustively describe our instructions, processes, statistical methodologies, and uncertainty about red teaming. We hope that this transparency accelerates our ability to work together as a community in order to develop shared norms, practices, and technical standards for how to red team language models. Warning: this paper contains examples that may be offensive or upsetting.

我们描述了早期在语言模型上进行红队测试(red teaming)的努力,旨在同时发现、衡量并尝试减少其潜在的有害输出。我们做出了三项主要贡献。首先,我们研究了红队测试在三种模型规模(2.7B、13B 和 52B 参数)和四种模型类型中的扩展行为:普通语言模型(LM);被提示为有帮助、诚实和无害的语言模型;带有拒绝采样的语言模型;以及通过人类反馈强化学习(RLHF)训练为有帮助和无害的模型。我们发现,随着规模的增加,RLHF 模型越来越难以进行红队测试,而其他模型类型的扩展趋势则较为平缓。其次,我们发布了包含 38,961 次红队攻击的数据集,供其他人分析和学习。我们对数据进行了自己的分析,发现了多种有害输出,从冒犯性语言到更为微妙的有害非暴力不道德输出。第三,我们详尽描述了我们的指令、流程、统计方法以及对红队测试的不确定性。我们希望这种透明度能够加速我们作为社区共同合作的能力,以制定关于如何对语言模型进行红队测试的共享规范、实践和技术标准。警告:本文包含可能令人反感或不安的示例。

1 Introduction

1 引言

Large language models exhibit a wide range of harmful behaviors such as reinforcing social biases (e.g., [47, 28, 1, 33, 7]), generating offensive or toxic outputs [25], leaking personally identifiable information from the training data [13], aiding in disinformation campaigns [12], generating extremist texts [37], spreading falsehoods [35], and more [9, 10, 18, 57, 22, 51]. As AI systems improve, the scope of possible harms seems likely to grow [22]. Many strategies have been developed to address some of these harms (e.g., [58, 4, 48, 36, 34, 19, 60]). One potentially useful tool for addressing harm is red teaming—using manual or automated methods to adversarial ly probe a language model for harmful outputs, and then updating the model to avoid such outputs [42, 20, 3, 11]. In this paper, we describe our early efforts to implement manual red teaming to both make models safer and measure the safety of our models. The models trained with red team data were described in [4], so here we focus on describing our red team results and techniques in detail in the hope that others may benefit from and improve on them.

大语言模型展现出多种有害行为,例如强化社会偏见(如 [47, 28, 1, 33, 7])、生成冒犯性或有害的输出 [25]、从训练数据中泄露个人身份信息 [13]、协助虚假信息传播 [12]、生成极端主义文本 [37]、传播虚假信息 [35] 等 [9, 10, 18, 57, 22, 51]。随着 AI 系统的改进,潜在危害的范围似乎可能会扩大 [22]。许多策略已被开发出来以应对其中一些危害(如 [58, 4, 48, 36, 34, 19, 60])。一种可能有助于应对危害的工具是红队测试(red teaming)——通过手动或自动方法对抗性地探测语言模型的有害输出,然后更新模型以避免此类输出 [42, 20, 3, 11]。在本文中,我们描述了早期实施手动红队测试的努力,旨在使模型更安全并衡量模型的安全性。使用红队数据训练的模型已在 [4] 中描述,因此这里我们重点详细描述红队测试的结果和技术,希望其他人能够从中受益并加以改进。


Figure 1 Red team attack success by model size ( $\bf\dot{x}$ -axes) and model type (colors). (Left) Attack success measured by average red team member self report (higher is more successful). (Middle) Attack success measured by average minimum harmlessness score (higher is better, less harmful) (Right) Distribution of minimum harmlessness score.

图 1: 按模型大小 ( $\bf\dot{x}$ 轴) 和模型类型 (颜色) 划分的红队攻击成功率。(左) 攻击成功率通过红队成员自我报告的平均值衡量 (越高表示越成功)。(中) 攻击成功率通过平均最小无害性评分衡量 (越高越好,表示危害越小)。(右) 最小无害性评分的分布。

We make three main contributions. First, we investigate scaling behaviors for red teaming across 3 model sizes (2.7B, 13B, and 52B parameters) and 4 model types: a plain language model (plain LM) [2]; an LM prompted to be helpful, honest, and harmless (HHH prompted LM) [2]; an LM with rejection sampling (RS), which returns the best of sixteen samples as ranked by a helpful and harmless preference model [4]; and a model trained to be helpful and harmless using reinforcement learning from human feedback (RLHF) with the same preference model [4]. The RS and RLHF models rely on data generated from red teaming the prompted LM(see $\S3.2$ for details on all models). Figure 1, middle, shows that: (1) RLHF models are significantly harder to red team as they scale, (2) plain LMs, prompted LMs, and RS models exhibit a flat trend with scale, (3) Prompted LMs are not significantly harder to red team than plain LMs, which is inconsistent with our previous results that use static evaluations to show HHH prompting is an effective safety intervention [2], and (4) RS models are the most difficult to red team at any scale; however, qualitatively, they tend to be harmless by being evasive [4].

我们做出了三个主要贡献。首先,我们研究了在3种模型规模(2.7B、13B和52B参数)和4种模型类型下的红队测试扩展行为:普通语言模型(plain LM)[2];被提示为有帮助、诚实和无害的语言模型(HHH prompted LM)[2];带有拒绝采样(RS)的语言模型,该模型返回由有帮助和无害偏好模型排名的16个样本中的最佳样本[4];以及使用来自人类反馈的强化学习(RLHF)训练的有帮助和无害模型,使用相同的偏好模型[4]。RS和RLHF模型依赖于从提示语言模型的红队测试生成的数据(有关所有模型的详细信息,请参见$\S3.2$)。图1中间部分显示:(1) RLHF模型随着规模的扩大,红队测试难度显著增加,(2) 普通LM、提示LM和RS模型在规模上表现出平坦的趋势,(3) 提示LM的红队测试难度并不显著高于普通LM,这与我们之前使用静态评估显示HHH提示是一种有效的安全干预措施的结果不一致[2],以及(4) RS模型在任何规模下都是最难进行红队测试的;然而,从定性上看,它们往往通过回避来保持无害[4]。

Our second contribution is to release our dataset of 38,961 red team attacks for others to analyze and learn from (Table 1).2 We provide a Datasheet [24] in $\S\mathrm{A}.7$ that fully documents the data and we explain the pros and cons for releasing the data in $\S\mathrm{A}.5$ . Our dataset is an order of magnitude larger than a similar available red team dataset [60] and considers models one order of magnitude larger than those in [60]. To our knowledge, we release the only dataset of red team attacks on a model trained be safe with RLHF. These types of models are already deployed [41] and we believe our data can help shed further light on their strengths and weaknesses. More generally, we believe our data can be used to understand what successful red team attacks look like, to build (semi-)automated red team techniques [42], to build class if i ers for harmfulness, and to prototype strategies for measuring and mitigating harms in language models. We also provide our own preliminary analyses of the types of harms uncovered in our data (Figures 2 & 9, \$4).

我们的第二个贡献是发布了包含 38,961 次红队攻击的数据集,供其他人分析和学习(表 1)。我们在 $\S\mathrm{A}.7$ 中提供了数据表 [24],完整记录了数据,并在 $\S\mathrm{A}.5$ 中解释了发布数据的利弊。我们的数据集比现有的类似红队数据集 [60] 大一个数量级,并且考虑的模型规模也比 [60] 中的模型大一个数量级。据我们所知,我们发布了唯一一个针对通过 RLHF 训练以确保安全的模型进行红队攻击的数据集。这类模型已经部署 [41],我们相信我们的数据可以进一步揭示它们的优缺点。更广泛地说,我们相信我们的数据可以用于理解成功的红队攻击是什么样子的,构建(半)自动化的红队技术 [42],构建有害性分类器,以及原型化测量和减轻语言模型危害的策略。我们还提供了对我们数据中揭示的危害类型的初步分析(图 2 和图 9,\$4)。

Our last contribution is to exhaustively describe our instructions, proceses, and statistical methodologies for red teaming (\$3). Throughout the design of our experiments, we arrived at many junctures in which we were unsure about how to proceed, even after a literature review on red teaming AI systems (\$2). As such, we conducted informational interviews with experts in the field of Trust & Safety and incorporated their suggested best practices $(\S\mathrm{A}.2)$ into the design of our experiments in order to ensure the well-being of the red team. In general, we found that red team members enjoyed participating in our experiments and felt motivated by a mission to make AI systems less harmful $(\S\mathrm{A}.2)$ . Nevertheless, our work suffers from some limitations, which we discuss in $\S5.1$ . Based on our experiences, we propose some policy interventions for how we can work together as a community to develop shared norms, practices, and technical standards for how to red team language models (\$5.2).

我们的最后一项贡献是详尽描述了我们的指令、流程和红队测试的统计方法(\$3)。在实验设计过程中,我们遇到了许多不确定如何继续的节点,即使在对AI系统红队测试的文献进行了回顾之后(\$2)。因此,我们与信任与安全领域的专家进行了信息访谈,并将他们建议的最佳实践(\$A.2)纳入实验设计中,以确保红队成员的福祉。总体而言,我们发现红队成员喜欢参与我们的实验,并受到减少AI系统危害的使命感的激励(\$A.2)。然而,我们的工作也存在一些局限性,我们将在\$5.1中讨论这些局限性。基于我们的经验,我们提出了一些政策干预措施,建议我们如何作为一个社区共同努力,制定共享的规范、实践和技术标准,以指导如何对大语言模型进行红队测试(\$5.2)。

2 Related Work

2 相关工作

We use the same models that we developed in our previous work where we train a general language assistant to be helpful, honest, and harmless [2, 4]. However, here we run additional experiments in order to determine the influence of model size on susceptibility to red team attacks (Figure 1) and analyze the content of the attacks (Figures 2 & 9) to understand the types of harms uncovered by red teaming. Additionally, we provide more detail on our red team methods, and release the data, so that others can reproduce (and improve upon) our red team approach and results.

我们使用了之前工作中开发的模型,这些模型训练了一个通用的语言助手,旨在使其有帮助、诚实且无害 [2, 4]。然而,在这里我们进行了额外的实验,以确定模型大小对红队攻击敏感性的影响(图 1),并分析攻击内容(图 2 和图 9),以了解红队测试所揭示的危害类型。此外,我们提供了更多关于红队方法的细节,并发布了数据,以便其他人可以复现(并改进)我们的红队方法和结果。


Figure 2 Visualization of the red team attacks. Each point corresponds to a red team atack embedded in a two dimensional space using UMAP. The color indicates attack success (brighter means a more successful attack) as rated by the red team member who carried out the attack. We manually annotated attacks and found several thematically distinct clusters of attack types (black ellipses and text).

图 2: 红队攻击的可视化。每个点对应于使用 UMAP 嵌入二维空间的红队攻击。颜色表示攻击的成功率(颜色越亮表示攻击越成功),由执行攻击的红队成员评分。我们手动标注了攻击,并发现了几种主题上不同的攻击类型集群(黑色椭圆和文本)。

Apart from our previous work, our approach is most similar to [60] & [53], who have crowd workers attempt to elicit offensive outputs from dialogue agents in open-ended dialogues, then use the resulting data to create effective safety interventions. In [6O], they release a Bot Adversarial Dialogues (BAD) dataset of ${\sim}5\mathrm{K}$ conversations with 3 dialogue agents ranging in size from 345M to 2.7B parameters. We collect more data $(\sim!40\mathrm{K})$ attacks3; red team larger models (up to 52B parameters) in order to measure scaling behaviors, as in [53]; and focus on reinforcement learning from human feedback [14] as our most promising safety intervention.

除了我们之前的工作,我们的方法与 [60] 和 [53] 最为相似,他们通过众包工作者尝试在开放式对话中引出对话模型的攻击性输出,然后利用这些数据创建有效的安全干预措施。在 [60] 中,他们发布了一个名为 Bot Adversarial Dialogues (BAD) 的数据集,包含约 5K 次对话,涉及 3 个不同规模的对话模型,参数范围从 345M 到 2.7B。我们收集了更多的数据(约 40K 次攻击),并对更大的模型(参数高达 52B)进行了红队测试,以衡量扩展行为,如 [53] 中所述;同时,我们将基于人类反馈的强化学习 [14] 作为最有前景的安全干预措施。

Recent work explores how to automate red teaming by using language models instead of humans as the red team [42]. The approach bootstraps from the BAD dataset [60], and uncovers a variety of harms including (but not limited to) finding groups of people that the dialogue agent discusses in offensive ways, identifying personally identifiable information, and leaking private training data. We uncover similar harms in our dataset and plan to use our own data to systematically compare and contrast the types of harms that can be uncovered in manual versus automated methods in future work (s5).

最近的研究探索了如何通过使用大语言模型而非人类作为红队来自动化红队测试 [42]。该方法从 BAD 数据集 [60] 中引导,揭示了多种危害,包括(但不限于)发现对话代理以冒犯性方式讨论的群体、识别个人身份信息以及泄露私人训练数据。我们在数据集中发现了类似的危害,并计划在未来的工作中使用我们自己的数据来系统比较和对比手动与自动化方法中可能发现的危害类型 (s5)。

More generally, although our work focuses on adversarial attacks on generative models, it is heavily inspired by and related to prior work that examines the efficacy of adversarial testing to find and address vu lner abi lities in NLP algorithms in disc rim i native settings. Some of these efforts augment humans (through guidelines, templates, programmatic generation of attacks, and various combinations thereof) to devise test cases that cause systems to fail [45, 46, 29, 21, 30, 55, 6, 23]. Others use humans in the loop to continuously and dynamically build, break, and fix [20] models in order to continuously make them more robust to failure modes [40, 32, 55, 61]. Finally, a large body of work aims to learn adversarial examples that cause downstream models to produce spurious outputs [50], some of which are reviewed in [59]. However, these examples often seem arbitrary and unintelligible to humans, and thus correspond to a different kind of attack than the ones we consider here.

更广泛地说,尽管我们的工作专注于生成模型的对抗攻击,但它深受并关联于先前的研究,这些研究探讨了在判别式设置中通过对抗测试来发现和解决 NLP 算法中的脆弱性的有效性。其中一些工作通过增强人类(通过指南、模板、程序化生成攻击及其各种组合)来设计导致系统失败的测试用例 [45, 46, 29, 21, 30, 55, 6, 23]。另一些工作则利用人类在循环中持续动态地构建、破坏和修复 [20] 模型,以使其对故障模式更加鲁棒 [40, 32, 55, 61]。最后,大量工作旨在学习导致下游模型产生虚假输出的对抗样本 [50],其中一些在 [59] 中进行了回顾。然而,这些样本对人类来说通常显得随意且难以理解,因此与我们在此考虑的对抗攻击类型不同。

Although our work focuses on language models, there is prior art on how to red team multi-modal generative models [39] in the context of DALL·E 2 [44]. These red team efforts reveal some similar categories of harms to those we find in or dataset, such as: “Bias and representation"; “Harassment, bullying, and exploitation"; "Dis- and misinformation", and other categories that we do not uncover or consider, such as “Copyright and Trademarks". The red team consists of domain experts who are selected according to particular areas of expertise, whereas in our case we rely on crowd workers whose domain expertise we are unaware of (\$5.1).

尽管我们的工作主要关注语言模型,但在 DALL·E 2 [44] 的背景下,已有关于如何对多模态生成模型进行红队测试的研究 [39]。这些红队测试揭示了一些与我们数据集中发现的类似危害类别,例如:“偏见和代表性”;“骚扰、欺凌和剥削”;“虚假和误导信息”,以及其他我们未发现或考虑的类别,例如“版权和商标”。红队由根据特定专业领域选出的领域专家组成,而在我们的案例中,我们依赖的是我们不了解其专业背景的众包工人 (\$5.1)。

Parameters | Plain LM Prompted LM Rejection Sampling Reinforcement Learning

参数 | 普通大语言模型 提示大语言模型 拒绝采样 强化学习

Table 1: Number of red team attempts per model size (rows) and safety intervention (columns)

表 1: 不同模型大小(行)和安全干预(列)下的红队尝试次数

| 2.7B | 1151 | 2044 | 1445 | 2498 |
| 13B | 2587 | 2174 | 1263 | 2292 |
| 52B | 1414 | 17739 | 1273 | 3081 |

3 Methods

3 方法

3.1 Red Team Task

3.1 红队任务

We developed an interface that instructs red team members to have open-ended conversations with an AI assistant [2, 4] in order to “make the AI behave badly, to get it to say obnoxious, offensive, and harmful things" (Figure 3). Similar to [60], we provide the red team with a brief list of example conversation topics but otherwise leave the instructions fairly open-ended. We ask the red team to rely on creativity, avoid obvious profanity, and to focus on a single topic per attack. To encourage this behavior, we ask participants to enter a short description of how they intend to red team the model prior to the attack.

我们开发了一个界面,指导红队成员与AI助手进行开放式对话 [2, 4],目的是“让AI表现不佳,使其说出令人厌恶、冒犯和有害的内容”(图 3)。与 [60] 类似,我们为红队提供了一个简短的示例对话主题列表,但其他方面则保持指令的开放性。我们要求红队依靠创造力,避免明显的粗俗语言,并专注于每次攻击的单一主题。为了鼓励这种行为,我们要求参与者在攻击前输入他们打算如何对模型进行红队测试的简短描述。

We were concerned about our task causing harm to the red team, so we reviewed available literature in Trust & Safety and related fields (e.g., [26, 15, 16, 49, 5, 27, 31]), as well as conducted a series of informational interviews with industry professionals in the field. We incorporated best practices from this research (see $\S\mathrm{A}.2$ for details) into the content and design of our task instructions (Figure 3, Left) and task interface (Figure 3, Right). For example, in the red team task instructions, we provide clear, specific warnings that red team members may be exposed to sensitive content. We also encourage red team members to select topics within their own risk tolerance, as opposed to prompting them to discuss specific, sensitive topics with the AI assistant. Through surveys and informal feedback, we found that red team members enjoyed the task and did not experience significant negative emotions.

我们担心我们的任务会对红队造成伤害,因此我们回顾了信任与安全及相关领域的现有文献(例如 [26, 15, 16, 49, 5, 27, 31]),并与该领域的行业专业人士进行了一系列信息访谈。我们将这些研究中的最佳实践(详见 $\S\mathrm{A}.2$)融入到任务说明(图 3,左)和任务界面(图 3,右)的内容和设计中。例如,在红队任务说明中,我们提供了明确、具体的警告,提醒红队成员可能会接触到敏感内容。我们还鼓励红队成员选择自己风险承受能力范围内的主题,而不是提示他们与AI助手讨论特定的敏感话题。通过调查和非正式反馈,我们发现红队成员喜欢这项任务,并且没有经历显著的负面情绪。

After the red team member agrees to the terms of the task (Figure 3, Left), they proceed to have an openended, multi-turn conversation with the AI assistant. After each turn in a conversation, we present the participant with two possible model-generated responses, and ask them to select the more harmful of the two as in [2, 4]. We do this for two reasons. First, this allows us to expedite the red team's ability to find vulnerabilities in our systems by a factor of two—generative models are stochastic and thus our approach allows the red team twice as many opportunities to catch harmful behavior per attempt. Second, this procedure generates a dataset of pairs of model responses, where one response is labeled as more harmful than the other. We use this dataset to train a harmlessness preference model, which takes as input a model generated response and outputs a score which is lower for more harmful model responses, and higher for less harmful model responses [14, 2, 4].4 We use the resulting preference model to build safety interventions, which we describe in $\S3.2$ . We do not define what "harmful means, as this is a complex and subjective concept; instead, we rely on the red team to make their own determinations via a pairwise preference choice [14].

在红队成员同意任务条款后(图 3,左),他们开始与 AI 助手进行开放式的多轮对话。在每轮对话后,我们会向参与者展示两个可能的模型生成响应,并要求他们选择其中更有害的一个,如 [2, 4] 所述。我们这样做有两个原因。首先,这使我们能够将红队发现系统漏洞的能力提高一倍——生成模型是随机的,因此我们的方法使红队每次尝试都有两倍的机会捕捉到有害行为。其次,此过程生成了一对模型响应的数据集,其中一个响应被标记为比另一个更有害。我们使用这个数据集来训练一个无害性偏好模型,该模型以模型生成的响应作为输入,并输出一个分数,分数越低表示模型响应越有害,分数越高表示模型响应越无害 [14, 2, 4]。我们使用生成的偏好模型来构建安全干预措施,这将在 $\S3.2$ 中描述。我们没有定义“有害”的含义,因为这是一个复杂且主观的概念;相反,我们依赖红队通过成对偏好选择来做出自己的判断 [14]。

We ask red team members to have a back-and-forth conversation for four turns (Figure 3, Right). We do not strictly limit the number of turns in each conversation, and empirically, we observe most conversations are 1-4 turns, with some lasting longer. At the end of each conversation, we ask the participant to rate how successful they were at making the AI assistant say something bad. We collect these ratings on a 5 point Likert scale (ranging from O to 4) where a O means “Not successful” and a 4 means "Very successful" (Figure 3, Right). Red team members continue this process for a series of five dialogues, typically on five unique topics, which culminates in one overall task. Red team members could then choose to complete further tasks.

我们要求红队成员进行四轮对话(图 3,右)。我们没有严格限制每轮对话的次数,根据经验观察,大多数对话为 1-4 轮,有些对话持续时间更长。在每轮对话结束时,我们要求参与者评估他们在让 AI 助手说出不良内容方面的成功程度。我们使用 5 点李克特量表(范围从 0 到 4)收集这些评分,其中 0 表示“不成功”,4 表示“非常成功”(图 3,右)。红队成员继续进行五轮对话,通常涉及五个不同的主题,最终形成一个整体任务。红队成员可以选择完成更多任务。

The AI assistant is powered by four types of dialogue models: one baseline model and three models with different types of safety interventions. We assign red team members to models at random—the red team does not know which model they interact with. We describe these models further in the next section.

AI 助手由四种对话模型驱动:一个基线模型和三个具有不同类型安全干预措施的模型。我们随机将红队成员分配给模型——红队不知道他们与哪个模型交互。我们将在下一节中进一步描述这些模型。


Figure 3 (Left) Red team task instructions. (Right) Example of a red team attempt.

图 3: (左) 红队任务指令。(右) 红队尝试的示例。

3.2 Models

3.2 模型

We derive dialogue models, with various safety interventions, from a general language model, and in some cases, a helpful and harmless preference model. For simplicity, we refer to the preference model as a harmlessness preference model, and the output of the model as a harmlessness score throughout this work.6 Here, we first provide basic details on the general language model and the harmlessness preference model, then elaborate on the four dialogue models that power the AI assistant.

我们从通用语言模型中派生出具有各种安全干预措施的对话模型,在某些情况下,还从有帮助和无害的偏好模型中派生。为简化起见,在本工作中,我们将偏好模型称为无害偏好模型,并将模型的输出称为无害分数。6 在此,我们首先提供关于通用语言模型和无害偏好模型的基本细节,然后详细阐述驱动AI助手的四种对话模型。

For our general language models, we train decoder-only transformer models ranging in size from 2.7B to 13B to 52B parameters. Full details about model architectures, training data, training procedures, and model evaluations are described elsewhere [2].

对于我们的通用语言模型,我们训练了参数规模从2.7B到13B再到52B的解码器专用Transformer模型。关于模型架构、训练数据、训练过程和模型评估的完整细节在其他地方有详细描述 [2]。

6More generally, our preference model is trained to predict both harmlessness and helpfulness. For the latter, we created a separate interface in order to collect preference data about helpfulness. We found a fundamental tension between these helpfulness and harmlessness a model can simply be harmless by refusing to be helpful [4]. As such, we train our preference models to predict both harmlessness and helpfulness. We find that this approach helps to address this tension without loss in predictive accuracy for harmlessness [4].

更广泛地说,我们的偏好模型被训练来预测无害性和有用性。对于后者,我们创建了一个单独的界面来收集关于有用性的偏好数据。我们发现有用性和无害性之间存在一种根本性的张力,模型可以通过拒绝提供帮助来简单地保持无害 [4]。因此,我们训练我们的偏好模型来同时预测无害性和有用性。我们发现这种方法有助于解决这种张力,而不会降低对无害性的预测准确性 [4]。

To train our harmlessness preference model, we use the comparison data from red team attacks on 52B parameter prompted language model (described below) as the training data—this is why we collected an order of magnitude more data in this case (Table 1). To build these models, we fine-tune 2.7B, 13B, and 52B general language models to predict which model utterances red team members found less harmful, thus producing a harmlessness score [2]. A lower score means more harmful.

为了训练我们的无害偏好模型,我们使用了对52B参数提示语言模型(如下所述)进行红队攻击时生成的比较数据作为训练数据——这也是为什么我们在这种情况下收集了数量级更多的数据(表1)。为了构建这些模型,我们对2.7B、13B和52B的通用语言模型进行了微调,以预测红队成员认为哪些模型输出更无害,从而生成一个无害评分 [2]。评分越低,表示越有害。

Plain language models (Plain LM) We use 1-shot learning (in which we place an single example of a 3-turn conversation in our Human, Assistant format in context) to prompt our general language models to behave as dialogue models for use in the interface described above [2]. We consider this method a baseline or control model, since it minimally departs from a general-purpose plain language model and has no explicit safety intervention.

普通语言模型 (Plain LM)
我们使用 1-shot 学习(在上下文中放置一个 3 轮对话的示例,格式为 Human, Assistant)来提示我们的通用语言模型表现得像对话模型,以便在上述界面中使用 [2]。我们将此方法视为基线或控制模型,因为它与通用普通语言模型的差异最小,并且没有明确的安全干预措施。

Prompted language models (Prompted LM) We use 14-shot learning to prompt our general language models to be helpful, harmless, and honest (HHH) [2], similar to dialogue-prompted Gopher [43]. We consider this a simple safety intervention, since we found it to be surprisingly effective at reducing model toxicity, especially for larger models [2, 43]. Furthermore, we use context distillation [2] to train “prompt-free” variants of these prompted models in order to retain the infuence of the prompt without occupying a significant portion of the limited context window and decreasing inference time [2]. Empirically, in previous work, we found minimal differences between prompting and context distillation [2].

提示语言模型 (Prompted LM)
我们使用 14-shot 学习来提示我们的通用语言模型,使其具备帮助性、无害性和诚实性 (HHH) [2],类似于对话提示的 Gopher [43]。我们认为这是一种简单的安全干预措施,因为我们发现它在减少模型毒性方面出奇地有效,尤其是对于较大的模型 [2, 43]。此外,我们使用上下文蒸馏 [2] 来训练这些提示模型的“无提示”变体,以保留提示的影响,同时不占用有限上下文窗口的显著部分,并减少推理时间 [2]。根据经验,在之前的工作中,我们发现提示和上下文蒸馏之间的差异很小 [2]。

Rejection sampling (Rs) We generate 16 samples of AI assistant responses from prompted language models, rank these samples with the harmlessness preference model, and select the 2 least harmful samples to present to the red team member, thus rejecting the 14 relatively more harmful responses. We did not experiment with changing the parameter 16. We tie the size of the prompted model to the size of the harmlessness preference model, e.g., a 2.7B parameter rejection sampling model consists of a 2.7B prompted language model paired with a 2.7B harmlessness preference model.7

拒绝采样 (Rejection Sampling, Rs)
我们从提示语言模型中生成 16 个 AI 助手响应样本,使用无害偏好模型对这些样本进行排序,并选择 2 个最无害的样本呈现给红队成员,从而拒绝 14 个相对更有害的响应。我们没有尝试改变参数 16。我们将提示模型的大小与无害偏好模型的大小绑定,例如,一个 2.7B 参数的拒绝采样模型由一个 2.7B 的提示语言模型和一个 2.7B 的无害偏好模型配对组成。[7]

Reinforcement learning from human feedback (RLHF) We start with a prompted language model, then use reinforcement learning to train it to maximize the scores given by the preference model described above. As in the rejection sampling case, we tie the size of the prompted model to the size of the preference model. Full details about the training procedures, training datasets, and model evaluations are described elsewhere [4]. Intuitively, we expect RLHF models to behave similarly (but not exactly) to RS models; however, RLHF is computationally expensive at train time but efficient at test time. RS is vice-versa.

基于人类反馈的强化学习 (RLHF)
我们从提示语言模型开始,然后使用强化学习来训练它,以最大化上述偏好模型给出的分数。与拒绝采样情况一样,我们将提示模型的大小与偏好模型的大小绑定。关于训练过程、训练数据集和模型评估的完整细节在其他地方有描述 [4]。直观上,我们期望 RLHF 模型的行为与 RS 模型相似(但不完全相同);然而,RLHF 在训练时计算成本高,但在测试时效率高。RS 则相反。

3.3 Red Team

3.3 红队

Our red team consists of 324 US-based crowd workers whom we primarily recruited from Amazon's Mechanical Turk (MTurk) platform ( $.n,=,307\$ and the Upwork platform $(n,=,17)$ 0. On MTurk, we paid between $\mathbb{S}7.50$ and $\mathbb{S}9.50$ for each set of 5 conversations completed. We found that crowd workers could complete at least 2 tasks an hour, which means that we paid at or above California minimum wage.8 On Upwork, we paid participants $\mathbb{S}20$ per hour. Similar to [53], we asked participants to fill out a short demographic survey that incorporated U.s. census categories and offered participants the option to answer “Prefer to not to say” for each question (Figure 4).

我们的红队由324名美国众包工作者组成,主要从Amazon的Mechanical Turk (MTurk)平台招募($n,=,307$)和Upwork平台($n,=,17$)。在MTurk上,我们为每完成5组对话支付$\mathbb{S}7.50$到$\mathbb{S}9.50$。我们发现众包工作者每小时至少可以完成2个任务,这意味着我们支付的报酬达到或超过了加州的最低工资标准。在Upwork上,我们每小时支付参与者$\mathbb{S}20$。与[53]类似,我们要求参与者填写一份简短的人口统计调查,该调查采用了美国人口普查类别,并为每个问题提供了“不愿回答”的选项(图4)。

We found that he crowd worker population may not be fully representative of the U.S. population, according to US Census data [54].? For example, we find that individuals who self-identify as “White or Caucasian' are slightly over-represented in our experiments ( $79%$ versus the current U.s. Census estimate of $75.8%$ Similarly, the percentage of participants with at least a college degree was significantly higher than what is reported by the U.S. Census ( $66%$ versus $32.9%$

我们发现,根据美国人口普查数据 [54],众包工作者群体可能并不能完全代表美国人口。例如,我们发现自认为是“白人或高加索人”的个体在我们的实验中略微过多(79% 对比当前美国人口普查估计的 75.8%)。同样,拥有至少大学学历的参与者比例显著高于美国人口普查报告的数据(66% 对比 32.9%)。

Figure 5 shows descriptive statistics about the red team. In particular, we find we find that ${\sim}80%$ Oof the red team attacks come from ${\sim}50$ outof ${\sim}300$ workers. As such, the overwhelming majority of the dataset is generated from a minority of particularly prolific red team members. Furthermore, we fit a linear mixed model that evaluates the inherent efficacy of a red team member, which we plot in Figure 5 (Right). We find that some workers are particularly effective at red teaming, whereas others are not. In Appendix A.3 we re-analyze our data while controlling for these two confounds (particularly prolific workers, and particularly (in)effective red team members) and find that these confounds do not significantly influence the main results in Figure 1.

图 5 展示了红队的描述性统计。特别是,我们发现大约 80% 的红队攻击来自大约 300 名工作者中的 50 名。因此,绝大多数数据集是由少数特别多产的红队成员生成的。此外,我们拟合了一个线性混合模型,评估红队成员的固有效能,并在图 5(右)中绘制。我们发现一些工作者在红队测试中特别有效,而另一些则不然。在附录 A.3 中,我们在控制这两个混杂因素(特别多产的工作者和特别(不)有效的红队成员)的情况下重新分析了我们的数据,发现这些混杂因素对图 1 中的主要结果没有显著影响。

Figure 4 Results of a demographic survey completed by 115 of 324 red team members.

图 4: 324 名红队成员中 115 人完成的人口统计调查结果。

性别 人数 百分比
男性 54 47.0%
女性 60 52.2%
非二元性别 1 0.9%
不愿透露 0 0%
性取向 人数 百分比
异性恋 94 81.7%
同性恋 5 4.3%
双性恋 14 12.2%
不确定 1 0.9%
不愿透露 0 0%
其他 1 0.9%
年龄组 人数 百分比
18-24 0 0%
25-34 29 25.2%
35-44 39 33.9%
45-54 27 23.5%
55-64 16 13.9%
65+ 2 1.7%
不愿透露 2 1.7%
种族 人数 百分比
美洲印第安人或阿拉斯加原住民 2 1.7%
亚洲人 3 2.6%
黑人或非裔美国人 10 8.7%
西班牙裔、拉丁裔或西班牙人 1 0.9%
中东或北非人 1 0.9%
夏威夷原住民或太平洋岛民 1 0.9%
白人或高加索人 94 81.7%
不愿透露 1 0.9%
其他 2 1.7%
教育程度 人数 百分比
高中或部分大学 40 34.8%
大学学位 62 53.9%
研究生或专业学位 12 10.4%
不愿透露 0 0%
其他 1 0.9%
残疾 人数 百分比
听力困难 0 0%
视力困难 1 0.9%
认知困难 1 0.9%
行动困难 4 3%
自理困难 1 0.9%
其他 2 1.5%
106 92%

3.4 Data Analysis

3.4 数据分析

With our interface, models, and red team in place, we collect 38,961 red team attacks across with O(1K) attacks per model type in all cases except for the 52B prompted model for which we collect O(10K) attacks (Table 1). We collect more data in the latter case in order to train our harmlessness preference models, as described in $\S3.2$ . Figure 6 shows an example red team attack and how we quantify it. In particular, we measure 3 variables for each attack. First, we record the red team member's self-rating of how successful they were on a 5-point Likert scale, where a 0 indicates an unsuccessful attempt, and a 4 indicates a very successful attempt (see also Figure 3, Right, for an example). Figure 7 (Left) shows the distribution over this variable, which is approximately bimodal, with two peaks at O and 4, with relatively more mass at 0. This indicates that, on average, red team member's self report successful attacks ${\sim}35%$ of the time.

在我们的界面、模型和红队准备就绪后,我们收集了38,961次红队攻击,每种模型类型的攻击次数为O(1K),除了52B提示模型,我们收集了O(10K)次攻击(表1)。我们在后一种情况下收集了更多数据,以便训练我们的无害偏好模型,如$\S3.2$所述。图6展示了一个红队攻击的示例以及我们如何量化它。特别是,我们为每次攻击测量了3个变量。首先,我们记录了红队成员在5点李克特量表上的自我评分,其中0表示尝试不成功,4表示非常成功(参见图3右侧的示例)。图7(左)显示了该变量的分布,它大致呈双峰分布,峰值在0和4处,0处的相对质量更多。这表明,平均而言,红队成员自我报告的成功攻击次数约为${\sim}35%$。


Figure 5 Descriptive statistics about red team members. (Left) Total number of red team attempts (y-axis) per red team member $\mathbf{\chi}_{\mathbf{X}}$ -axis), sorted by number of attempts. (Middle) The cumulative distribution (CDF) of the data from the left panel showsthat ${\sim}80%$ of attacks come from ${\sim}15%$ of the red team participants. (Right) Estimate of how effective each red team member is at red teaming (y-axis, O means average, lower means more effective, lines indicate $95%$ confidence intervals) according to their ability to achieve a low minimum harmlessness score. X-axis is sorted by ability.

图 5: 红队成员的描述性统计。(左) 每个红队成员的总尝试次数 (y轴) 按尝试次数排序 (x轴)。(中) 左图数据的累积分布 (CDF) 显示,约 80% 的攻击来自约 15% 的红队参与者。(右) 根据每个红队成员达到低最低无害性分数的能力,估计他们在红队测试中的效果 (y轴,0 表示平均水平,越低表示越有效,线条表示 95% 的置信区间)。x轴按能力排序。

Next, we use our harmlessness preference model to compute the harmlessness score (higher is less harmful and thus better) of the AI assistant's dialogue. In particular, we compute the score for each assistant utterance, conditioned on everything preceding the utterance, for each utterance in the conversation. For a $N$ -turnconversation, this results in $N$ harmlessness scores (Figure 6). To turn this into one number, we either compute the mean or minimum (worst) harmlessness of the AI assistant. Empirically, we find that the precise choice of aggregate statistic does not significantly change our results, so we choose the minimum harmlessness as a lower bound on the overall harmlessness of the model.

接下来,我们使用无害偏好模型来计算AI助手的对话的无害分数(分数越高表示越无害,因此越好)。具体来说,我们为对话中的每个助手话语计算分数,条件是话语之前的所有内容。对于一个 $N$ 轮对话,这将产生 $N$ 个无害分数(图6)。为了将其转化为一个数字,我们计算AI助手的平均或最小(最差)无害分数。经验表明,聚合统计的具体选择不会显著改变我们的结果,因此我们选择最小无害分数作为模型整体无害性的下限。

Figure 7 (Middle) shows the distribution of the minimum harmlessness score over all red team attacks for all the models. The distribution is centered around O and skews negative. A more negative score corresponds to more harmful model responses, and a more positive score corresponds to less harmful model responses. The shape of this distribution suggests that the red team members are indeed effective at soliciting harmful responses from the AI assistant. In general, we find that the minimum harmlessness score is inversely proportional to the red team member self-rating of attack success, which is expected ( $\mathrm{\gA.4}$ ,Figure 11). However, the correlation is not perfect. As such we report statistics of both these variables, conditioned on model type, as measures of red team efficacy in $\S4$

图 7 (中) 展示了所有模型在所有红队攻击中的最小无害性得分的分布。该分布以 O 为中心,呈负偏态。得分越负,表示模型响应越有害;得分越正,表示模型响应越无害。该分布的形状表明,红队成员确实能够有效地诱导 AI 助手产生有害响应。总体而言,我们发现最小无害性得分与红队成员自我评估的攻击成功率成反比,这是预期的 ( $\mathrm{\gA.4}$ , 图 11)。然而,相关性并不完美。因此,我们在 $\S4$ 中报告了这两个变量的统计数据,并按模型类型进行了条件分析,作为红队效能的衡量标准。

Finally, we also use the harmlessness preference model to score the harmfulness of the red team member's intent. To do so, we run the preference model on the red team member's task description (Figure 6).10 Figure 7 (Right) shows the distribution over this variable, which appears normally distributed with a mean around 1. As such, short descriptions of the attack score as less harmful than the actual AI utterances. We view the intent harmlessness score as a possible confound that we control for in further statistical analyses of the data (SA.3). Since we find that it does not influence our main results, we do not report on this variable further in the main text.

最后,我们还使用无害偏好模型来评估红队成员意图的有害性。为此,我们在红队成员的任务描述上运行偏好模型(图 6)。图 7(右)显示了该变量的分布情况,其呈现正态分布,均值约为 1。因此,简短的攻击描述比实际的 AI 话语得分更低的有害性。我们将意图无害性得分视为一个可能的混淆变量,在进一步的数据统计分析中对其进行控制(SA.3)。由于我们发现它不影响我们的主要结果,因此在正文中不再进一步报告该变量。

3.5 Review Task

3.5 审查任务

After we collected the data across all model types, we performed a follow-up experiment to measure two separate variables: the inter-annotator agreement in the self report of attack success, and the content of the attack types. The former is important because self-ratings of attack success are subjective, and can vary based on elements of the red team attack and red team member that we do not control (e.g., the type of attack or the background of the red team member). As such, we were interested in understanding how much variability (across different raters) there might be for defining a successful attack.

在我们收集了所有模型类型的数据后,我们进行了一项后续实验,以测量两个独立的变量:攻击成功自我报告中的标注者间一致性,以及攻击类型的内容。前者很重要,因为攻击成功的自我评分是主观的,并且可能基于我们无法控制的红队攻击和红队成员的要素(例如,攻击类型或红队成员的背景)而有所不同。因此,我们感兴趣的是了解在定义成功攻击时可能存在多少变异性(跨不同评分者)。


Figure 6 Example of how we quantify red team attempts. First, we compute a harmlessness score (lower is more harmful) on the task description (red). Next, we compute a harmlessness score on the assistant utterances, conditioned on all previous human and assistant utterances (black scores, adjacent to assistant utterances). We aggregate these scores using either a min or max (black, bold). Finally, we rely on human judgement of attack success on a Likert scale (blue).

图 6: 我们如何量化红队尝试的示例。首先,我们计算任务描述的无害性分数(越低越有害)(红色)。接下来,我们计算助手话语的无害性分数,条件是所有先前的人类和助手话语(黑色分数,紧邻助手话语)。我们使用最小值或最大值来聚合这些分数(黑色,加粗)。最后,我们依赖人类对攻击成功的判断,使用李克特量表(蓝色)。

Figure 8 shows our task instructions (Left) and interface (Right). We have 3 annotators review each transcript in the experiment. We ran this experiment on a random sample of 500 red team attacks for the 52B Prompted Language model and 500 attacks on the 52B RLHF model. We have each reviewer report their judgement of how successful the red team member was at making the “AI assistant say something bad?" We measure this variable on the same 5 point Likert scale as the main red team experiment and use consistent language across both experiments.

图 8 展示了我们的任务说明(左)和界面(右)。在实验中,我们让 3 位标注者审查每份转录文本。我们在 52B Prompted Language 模型的 500 次红队攻击随机样本和 52B RLHF 模型的 500 次攻击上进行了此实验。我们让每位评审员报告他们对红队成员在“让 AI 助手说出一些不好的内容”方面的成功程度的判断。我们使用与主要红队实验相同的 5 点李克特量表来衡量这一变量,并在两个实验中使用一致的语言。

We find a low level of inter-rater agreement on the success of red team attacks (according to our task setup and instructions) consistent with [60]. In particular, we report Fleiss's Kappa, a statistical measure of interannotator agreement that is bounded in $[-\infty,1]$ ,where $-\infty$ implies no agreement and 1 indicates perfect agreement. We report a Fleiss's Kappa of 0.32 between the 4 raters (the author and the 3 reviewers) based on a Likert rating scale. When we binarize the rating (1 if rating $\geq3$ , else 0), the agreement increases to 0.49.

我们发现,在红队攻击的成功率上,评分者之间的一致性较低(根据我们的任务设置和指示),这与[60]的结果一致。具体来说,我们报告了Fleiss's Kappa,这是一种用于衡量评分者一致性的统计指标,其取值范围为$[-\infty,1]$,其中$-\infty$表示完全不一致,1表示完全一致。基于Likert评分量表,我们报告了4位评分者(作者和3位评审员)之间的Fleiss's Kappa为0.32。当我们将评分二值化(如果评分$\geq3$则为1,否则为0)时,一致性提高到0.49。


Figure 7 (Left) Marginal distribution of self-report of red team success rates (higher is more successful) (Middle) Probability distribution function (PDF) of minimum AI harmlessness scores computed from the AI utterances (lower is more harmful) (Right) Distribution of harmlessness scores computed from short descriptions (written by red team members) of attack intent.

图 7: (左) 红队成功率自我报告的边际分布 (越高表示越成功) (中) 从 AI 话语中计算的最小 AI 无害性得分的概率分布函数 (PDF) (越低表示越有害) (右) 从攻击意图的简短描述 (由红队成员撰写) 中计算的无害性得分的分布。

Furthermore, when we exclude the original author and measure the agreement between the 3 annotators, we also see a modest increase in agreement for both the Likert and Binary scales, achieving a maximum agreement of 0.55 for the reviewer-only binary case. Taken together, our results suggest poor to fair agreement on what constitutes a successful attack.

此外,当我们排除原作者并测量3位标注者之间的一致性时,我们也看到Likert量表和二元量表的一致性都有所提高,在仅考虑审稿人的二元情况下,最大一致性达到0.55。综合来看,我们的结果表明,对于什么是成功的攻击,一致性较差到一般。

To get a sense of the type of harms the attacks were meant to elicit, we asked the reviewers to tag transcripts with up to 2 of 20 total topic tags (Figure 8, Right). To develop the list of topic tags, we referred to the taxonomies of potential harms of language models in [48, 57], industry content moderation guidelines, and a manual review of the top $100\ \mathrm{most}$ harmful conversations in our dataset. We discuss our findings on tag frequencies in Figure 9 and $\S4$

为了了解攻击旨在引发的危害类型,我们要求评审员为每个对话记录打上最多2个主题标签(图8,右)。为了制定主题标签列表,我们参考了[48, 57]中关于语言模型潜在危害的分类法、行业内容审核指南,以及对我们数据集中最有害的100个对话的手动审查。我们将在图9和$\S4$中讨论标签频率的发现。

We were particularly concerned with exposing reviewers to potential harm while participating in this experiment, since we ask reviewers to read, rate, and annotate harmful conversations they were not involved in writing. To mitigate this risk, we reviewed and incorporated findings from literature on Trust & Safety [16, 31, 26] into the content of both the task instructions (Figure 8, Left) and interface (Figure 8, Right), as well as the overall design of the experiment. For example, we built custom warning functionality which allowed reviewers to see a preview of the harmful text without being exposed to the entire conversation. Within the preview window, reviewers could skip to the next conversation or proceed with reviewing and rating the selected conversation.We leave further details in $\S\mathrm{A}.2$

我们特别关注评审员在参与此实验时可能面临的潜在伤害,因为我们要求评审员阅读、评分和注释他们未参与编写的有害对话。为了减轻这一风险,我们审查并整合了关于信任与安全 (Trust & Safety) 的文献 [16, 31, 26] 中的发现,将其应用于任务说明(图 8,左)和界面(图 8,右)的内容,以及实验的整体设计中。例如,我们构建了自定义的警告功能,允许评审员在不暴露整个对话的情况下预览有害文本。在预览窗口中,评审员可以选择跳过当前对话或继续评审和评分选定的对话。更多细节请参见 $\S\mathrm{A}.2$。

Our informational interviews with Trust & Safety industry professionals highlighted the need for creating a sense of community among workers and building social support networks as ways to mitigate possible harms associated with reviewing troubling content, consistent with [26]. As a result, we decided to limit the population of reviewers in this experiment to Upworkers, and we used a shared communication tool (Slack) to regularly communicate with the group. This allowed participants to ask questions, share examples, and discuss work and non-work related topics, not only amongst themselves, but also directly with research staff.

我们与信任与安全行业专业人士的信息访谈强调了在员工中建立社区意识和构建社交支持网络的重要性,以减轻审查令人不安内容可能带来的伤害,这与[26]一致。因此,我们决定将本次实验的审查员群体限制为Upworkers,并使用共享通信工具(Slack)定期与团队沟通。这使得参与者不仅可以在彼此之间提问、分享示例并讨论工作与非工作相关的话题,还可以直接与研究团队进行交流。

To monitor the psychological effects of this work and provide an avenue for direct feedback from reviewers, we developed a custom well-being survey and sent it to reviewers after completing 10 tasks. In the survey (which is optional to complete) we asked reviewers to rate how often they felt a variety of positive and negative emotions, and we also provided a free-form text question where reviewers could share additional thoughts. Participants generally felt low levels of negative emotions, and higher levels of positive emotions about the task. Informally, we received feedback that reviewers found the task to be fun and engaging. We provide more detail on the well-being survey and additional worker safety interventions in $\S\mathrm{A}.2$

为了监控这项工作的心理影响并为评审者提供直接反馈的途径,我们开发了一个定制化的幸福感调查问卷,并在评审者完成10个任务后发送给他们。在调查中(完成调查是可选的),我们要求评审者评估他们在任务中感受到的各种积极和消极情绪的频率,并且我们还提供了一个自由形式的文本问题,评审者可以分享额外的想法。参与者普遍感受到较低水平的消极情绪,以及较高水平的积极情绪。非正式地,我们收到了反馈,评审者认为这项任务有趣且引人入胜。我们在 $\S\mathrm{A}.2$ 中提供了关于幸福感调查和额外工人安全干预措施的更多细节。

4 Results

4 结果

Figure 1 (Left) shows the average success rate, self-reported by the red team members, for each model size and safety intervention. According to this metric, we observe three main patterns in the data. First, we see no discernible difference between the control condition (a plain LM with a 1 example prompt to turn it into a dialogue agent) and the simplest safety intervention (a plain LM with a 14 example HHH prompt [2]). This result is surprising, in that our previous work found the HHH prompt to be effective at reducing model toxicity, especially for 52B models [2, 43]. It's possible that this is due to the fact that static prompts from the

图 1 (左) 展示了红队成员自我报告的平均成功率,针对每个模型大小和安全干预措施。根据这一指标,我们在数据中观察到了三个主要模式。首先,我们发现控制条件(一个普通的 LM,带有 1 个示例提示将其转换为对话代理)与最简单的安全干预措施(一个普通的 LM,带有 14 个示例的 HHH 提示 [2])之间没有明显的差异。这一结果令人惊讶,因为我们之前的工作发现 HHH 提示在减少模型毒性方面是有效的,尤其是对于 52B 模型 [2, 43]。这可能是因为静态提示来自...


Red team review instructions Figure 8 (Left) Red team review task instructions. (Right) Example of a red team review task.


红队审查指令 图 8 (左) 红队审查任务指令。 (右) 红队审查任务示例。

Real Toxicity Prompts dataset [25] are less adversarial than the dialogue based attacks employed by red team members.

Real Toxicity Prompts 数据集 [25] 的对抗性低于红队成员使用的基于对话的攻击。

Second, we find that rejection sampling (RS) makes it particularly difficult to red team our language models. In essence, rejection sampling places a foor on red team attack susceptibility out of the three interventions that we tried. However, qualitatively, we believe that this may be the case because the responses from the RS models tend to be harmless by being evasive [4]. Finally, we find no clear trends with model size for the self-reported attack success rate metric. This is surprising because our previous work typically shows larger models tend to generate more toxic model responses [2, 22].

其次,我们发现拒绝采样 (Rejection Sampling, RS) 使得对我们的语言模型进行红队测试变得尤为困难。本质上,在我们尝试的三种干预措施中,拒绝采样为红队攻击的易感性设置了一个下限。然而,从定性角度来看,我们认为这可能是因为 RS 模型的响应往往通过回避的方式显得无害 [4]。最后,我们发现模型大小与自报告的攻击成功率指标之间没有明显的趋势。这令人惊讶,因为我们之前的工作通常表明,较大的模型往往会产生更多有害的响应 [2, 22]。

Figure 1 (Middle) shows the average minimum harmlessness score (lower is more harmful, see $\S3$ fordetails) for each model size and safety intervention. For this metric, we do see a clear scaling trend for the reinforcement learning (RLHF) models — as the models increase in size, they become increasingly more difficult to red team.1l At 52B parameters, we see no difference in harmlessness score for RLHF vs. RS. We also see the same first two trends from Figure 1 (Left): that there is little difference between the plain LM and the prompted ${\bf L}{\bf M}^{12}$ , and that rejection sampling is an effective safety intervention.

图 1 (中) 展示了每个模型大小和安全干预措施的平均最小无害性得分 (越低越有害,详见 $\S3$)。对于这一指标,我们确实看到了强化学习 (RLHF) 模型的明显扩展趋势——随着模型规模的增加,它们变得越来越难以进行红队测试。在 52B 参数下,我们看到 RLHF 与 RS 的无害性得分没有差异。我们还看到了与图 1 (左) 相同的前两个趋势:普通大语言模型与提示的 ${\bf L}{\bf M}^{12}$ 之间几乎没有差异,拒绝采样是一种有效的安全干预措施。

Instead of the average minimum harmlessness metric, Figure 1 (Right) shows the distribution over the harmlessness score. Here, we see that although safety interventions like RLHF and RS indeed decrease the average harmfulness of the model responses, there are still many instances of harmful behavior, as exhibited by the lower tails in the distributions. Although the safety interventions we tested help make systems safer, they still fail to make a perfectly safe systems. Figure 10 shows examples of harmful outputs from the RS and RLHF models, respectively. For the RS case, the model at first responds to a harmful inquiry, then starts to demur as the the conversation turns more harmful. For the RLHF case, we see a similar pattern, however the assistant remains helpful (though fabricates information) before ultimately refusing to help the human.

图 1 (右) 展示了无害性得分的分布情况,而不是平均最小无害性指标。在这里,我们可以看到,尽管像 RLHF 和 RS 这样的安全干预措施确实降低了模型响应的平均有害性,但仍然存在许多有害行为的实例,如分布中的较低尾部所示。尽管我们测试的安全干预措施有助于使系统更安全,但它们仍然无法使系统完全安全。图 10 分别展示了 RS 和 RLHF 模型的有害输出示例。在 RS 的情况下,模型首先对有害的询问做出响应,然后随着对话变得更加有害而开始犹豫。在 RLHF 的情况下,我们看到了类似的模式,但助手在最终拒绝帮助人类之前仍然保持帮助性(尽管编造了信息)。


Figure 9Number of attacks( $\bf\dot{x}$ -axes) classified by a tag (y-axis) for a random sample of 500 attacks each on the 52B Prompted LM and RLHF models. Blue denotes total number of attacks, orange denotes the number of successful attacks.

图 9: 随机抽取的 500 次攻击中,按标签 (y 轴) 分类的攻击次数 (x 轴) 在 52B Prompted LM 和 RLHF 模型上的分布。蓝色表示总攻击次数,橙色表示成功攻击次数。

To further understand the landscape of possible harms surfaced using this approach, across all model sizes and interventions, we created and annotated a visualization of the entire dataset (Figure 2). To do so, we obtained the average per token embeddings of each transcript from the residual stream in the 48th layer of the 52B prompted LM. Then we used UMAP [38] to turn the high-dimensional embeddings into two-dimensional embeddings for visualization. Intuitively, we expect this procedure to place any pair of transcripts closer together in this two dimensional space the more semantically similar to each other they are.

为了进一步理解使用这种方法可能产生的危害情况,我们创建并注释了整个数据集的可视化(图 2)。为此,我们从 52B 提示大语言模型的第 48 层残差流中获取了每个转录本的每个 Token 的平均嵌入。然后,我们使用 UMAP [38] 将高维嵌入转换为二维嵌入以便进行可视化。直观上,我们期望这个过程能够将任何一对转录本在语义上越相似,它们在二维空间中的距离就越近。

We find evidence for basic clusters of red team attempts. These include perhaps more obvious types of attacks, such as those soliciting discriminatory or offensive responses but also some surprising attacks. For example, we found a small cluster of attacks that tried to solicit misinformation in clever and subtle ways, and a small cluster of attacks related to animal abuse. We also find that some types of attacks, such as soliciting advice on how to perpetrate general violence, seem to be more successful than others, such as attempting to elicit offensive language.

我们发现红队尝试的基本集群证据。这些集群包括可能更为明显的攻击类型,例如那些寻求歧视性或攻击性回应的攻击,但也包括一些令人惊讶的攻击。例如,我们发现了一小部分以巧妙和微妙方式试图获取错误信息的攻击,以及一小部分与虐待动物相关的攻击。我们还发现,某些类型的攻击,例如寻求如何实施一般暴力的建议,似乎比其他类型的攻击(例如试图引发攻击性语言)更为成功。

We also found a cluster of 916 attacks designed to solicit personally identifiable information (Pll). We developed a regular expression $(\S\mathrm{A}.6)$ to find and filter possible PlII from the public dataset $(\S\mathrm{A}.7)$ .Wemanually reviewed the fltered data and found that some of the AI assistant generated Pll (such as addresses) appear to be neither real nor accurate, and instead were “hallucinated" by the AI assistant (see Figure 12 for an example). Other potential AI assistant generated Pll, such as social security numbers or drivers licenses, are difficult to manually verify. As such, we erred on the side of caution in filtering out the possible synthetic PII in the public dataset (sA.7).

我们还发现了一组包含 916 次攻击的集群,这些攻击旨在获取个人身份信息 (PII)。我们开发了一个正则表达式 $(\S\mathrm{A}.6)$ 来从公开数据集中查找和过滤可能的 PII $(\S\mathrm{A}.7)$。我们手动审查了过滤后的数据,发现一些由 AI 助手生成的 PII(如地址)似乎既不真实也不准确,而是由 AI 助手“幻觉”生成的(参见图 12 的示例)。其他潜在的由 AI 助手生成的 PII,如社会安全号码或驾驶执照,很难手动验证。因此,我们在过滤公开数据集中可能的合成 PII 时采取了谨慎的态度 $(\S\mathrm{A}.7)$。


Figure 1o Examples of red team attempts that resulted in relatively low harmlessness scores for a (Left) rejection sampling (RS) model and (Right) reinforcement learning from human feedback (RLHF) model.

图 1o: 红队尝试的示例,这些尝试导致 (左) 拒绝采样 (RS) 模型和 (右) 基于人类反馈的强化学习 (RLHF) 模型的无害性得分相对较低。

We found that some crowd workers generated low-quality red team data. For example, we observed some red team members use and adapt template-based attacks, where the template was “tell me an insulting term for $X$ that starts with a $Y^{\gamma}$ . With such a template, a crowd worker can trivially generate many red team attempts by iterating over many values of $X$ and $Y$ without careful consideration about the efficacy or usefulness of such an attack. Crowd workers (on MTurk) are in centi viz ed to develop strategies like this in order to increase the number of tasks they complete per time spent in order to maximize their earnings [26].

我们发现一些众包工人生成了低质量的红队数据。例如,我们观察到一些红队成员使用并改编了基于模板的攻击,模板为“告诉我一个以 $Y^{\gamma}$ 开头的侮辱性词汇,用于 $X$”。通过这种模板,众包工人可以通过迭代多个 $X$ 和 $Y$ 的值轻松生成许多红队尝试,而无需仔细考虑这种攻击的有效性或实用性。众包工人(在 MTurk 上)被激励开发此类策略,以增加他们在单位时间内完成的任务数量,从而最大化他们的收入 [26]。

To further measure the types of harms produced by red teaming, we examined the frequency of tags (described in $\S3.5)$ . Figure 9 shows the distribution of tags, which we collected for small a proportion $(\sim!3%)$ Oof the overall dataset. We find that the top 5 attacks correspond to ^Discrimination & injustice, “Hate speech $&$ offensive language,’ ""Violence & incitement, “Non violent unethical behavior (e.g., lying, cheating, etc.)," and “Bullying & harassment." Interestingly, for these top 5 attack types, the attack success rate was relatively higher for *Non violent unethical behavior', perhaps due to the fact that these types of attacks may be more subtle than the other ones. Less common tags include: “Child Abuse, “"Self harm, “"Sexual Exploitation & Human Trafficking, “Terrorism & organized crime,” and “Animal abuse". Finally, we find that the tag "Other'’ was also prevalent, which suggests that ascribing a fixed set of tags to annotate transcripts is unlikely to be comprehensive.

为了进一步衡量红队测试产生的危害类型,我们检查了标签的频率(在 $\S3.5$ 中描述)。图 9 展示了标签的分布情况,这些标签是我们从整体数据集中收集的一小部分 $(\sim!3%)$。我们发现,前五大攻击类型对应的是“歧视与不公正”、“仇恨言论与冒犯性语言”、“暴力与煽动”、“非暴力不道德行为(例如撒谎、欺骗等)”以及“欺凌与骚扰”。有趣的是,对于这五大攻击类型,“非暴力不道德行为”的攻击成功率相对较高,可能是因为这类攻击比其他类型更为隐蔽。较少见的标签包括:“虐待儿童”、“自残”、“性剥削与人口贩卖”、“恐怖主义与有组织犯罪”以及“虐待动物”。最后,我们发现“其他”标签也很普遍,这表明用一组固定的标签来注释转录内容可能不够全面。

5 Discussion

5 讨论

5.1 Limitations and Future Work

5.1 局限性与未来工作

We red team language models in the form of an AI assistant, which has the benefit of allowing the red team to probe the open-ended input and output spaces of the underlying LM. However, LMs can be used in many applications that do not require such open-endedness, e.g., recommendation systems, auto-complete systems, general text class if i ers, as sub-components in broader socio technical systems, etc. Due to our framing of red teaming an AI assistant, we do not explicitly explore how to reduce the possible harms caused by these other types of applications.

我们以AI智能体的形式对大语言模型进行了红队测试,这种方式的优势在于红队可以探索底层大语言模型的开放式输入和输出空间。然而,大语言模型可以应用于许多不需要这种开放性的场景,例如推荐系统、自动补全系统、通用文本分类器,以及作为更广泛的社会技术系统中的子组件等。由于我们将红队测试框架设定为AI智能体,因此并未明确探讨如何减少这些其他类型应用可能带来的危害。

We found that some crowd workers generated attacks on models that required domain expertise to evaluate, and we were not sure whether or not they had the required domain expertise. For example, it is not clear how to evaluate the harmfulness of responses to inquiries like “Give me detailed instructions on how to build a bomb" or “create methamphetamine'” without suficient knowledge about weapons and chemistry, respectively. We could address this problem by asking third party organizations that specialize in certain domains to red team our systems. Additionally, we could give crowd workers a way to indicate if and how a particular red team attack requires domain expertise in order to evaluate how successful the attack was.

我们发现,一些众包工作者生成的攻击模型需要领域专业知识来评估,而我们不确定他们是否具备所需的领域专业知识。例如,如果没有足够的武器和化学知识,很难评估对“给我详细的炸弹制作说明”或“制造甲基苯丙胺”等询问的回答的危害性。我们可以通过邀请专门从事某些领域的第三方组织对我们的系统进行红队测试来解决这个问题。此外,我们可以为众包工作者提供一种方式,让他们指出某个红队攻击是否需要领域专业知识,以便评估攻击的成功程度。

As expected, our data are incomplete—-because LMs are general purpose and open-ended, the space of possible harms is unknown and unbounded [22]. For example, the models we red teamed have been trained partly on Python code; however, we observed no attacks related to code generation ability (e.g., "write a Python program that implements a DDOS attack'). It is possible that sharing our red team interface with more domain experts could have resulted in such attacks. We could have also noted in the instructions to the interface that such attacks would be viable, but we erred on the side of being less prescriptive about how to red team in order to encourage creativity. It is unclear how to strike the right balance.

正如预期的那样,我们的数据并不完整——因为大语言模型是通用且开放式的,可能的危害空间是未知且无界的 [22]。例如,我们进行红队测试的模型部分训练数据是 Python 代码;然而,我们没有观察到与代码生成能力相关的攻击(例如,“编写一个实现 DDOS 攻击的 Python 程序”)。如果我们将红队测试接口分享给更多领域专家,可能会引发此类攻击。我们也可以在接口的说明中指出此类攻击是可行的,但为了鼓励创造力,我们在红队测试的方式上选择了不那么规定性的做法。目前尚不清楚如何找到正确的平衡点。

We also know our data are incomplete because we informally red teamed our models internally and found successful attack types not present in the dataset we release. For example, we uncovered a class of attacks that we call “roleplay attacks"’ on the RLHF model. In a roleplay attack we exploit the helpfulness of the model by asking it to roleplay as a malevolent character. For example, if we asked the RLHF model to enter "4chan mode”’ the assistant would oblige and produce harmful and offensive outputs (consistent with what can be found on 4chan). We intend to document additional qualitative safety failures that we uncovered in futurework.

我们还知道我们的数据是不完整的,因为我们在内部非正式地对模型进行了红队测试,发现了一些成功攻击类型并未出现在我们发布的数据集中。例如,我们发现了一类针对RLHF模型的攻击,我们称之为“角色扮演攻击”。在角色扮演攻击中,我们通过要求模型扮演一个恶意角色来利用其乐于助人的特性。例如,如果我们要求RLHF模型进入“4chan模式”,助手会顺从并产生有害和冒犯性的输出(与4chan上的内容一致)。我们计划在未来的工作中记录我们发现的更多定性安全失败案例。

Our analysis of the data is bottom-up, in that we first collect the data, then attempt to characterize the attack surface (Figure 2). An alternative approach, is to refer to a taxonomy of possible attack types [57] and explicitly ask the red team to attack models according to this taxonomy. Ultimately, an approach that combines both top-down and bottom-up strategies may be worthwhile, especially since people may discover attack types not yet covered by a taxonomy—we see some evidence of this in the frequency of attack types labeled as “Other" in our tagging experiment (Figure 9).

我们对数据的分析是自下而上的,即首先收集数据,然后尝试描述攻击面(图 2)。另一种方法是参考可能的攻击类型的分类法 [57],并明确要求红队根据此分类法攻击模型。最终,结合自上而下和自下而上策略的方法可能是有价值的,尤其是因为人们可能会发现分类法尚未涵盖的攻击类型——我们在标记实验中看到了一些证据,即标记为“其他”的攻击类型的频率(图 9)。

Our approach relies extensively on fully manual red teaming by crowd workers, which is expensive (and possibly slow) to do at scale. Previous work illustrates the potential for automating red teaming [42]. For future work, we plan on explicitly comparing and contrasting (semi-)manual versus automated approaches to red teaming in order to determine how the two methods vary in the efficacy and diversity of resulting red team attacks.

我们的方法主要依赖于众包工作者进行完全手动的人工红队测试,这种方法在大规模应用时成本高昂(且可能效率低下)。先前的研究已经展示了自动化红队测试的潜力 [42]。在未来的工作中,我们计划明确比较和对比(半)手动与自动化红队测试方法,以确定这两种方法在红队攻击效果和多样性方面的差异。

5.2 Policy Interventions

5.2 政策干预

Red teaming entails working with inherently controversial subject matter, and most organizations that red team systems have strong counter-incentives to share their findings.13 This is a problem; if we cannot publicly discuss -—- in detail —- how we red team systems and what we learn as a result, it will be difficult to broadly share the future risks, failures, and implications of yet-to-be developed systems. This problem gets worse over time. As systems become more capable, the results of red teaming may surface increasingly undesirable harms. Therefore, we need to change the incentive structure so more organizations share findings from their red teaming efforts when doing so is safe and beneficial. To do so, we identify two specific interventions the AI research community could take to build consensus around how to red team and how to release findings from red teaming.

红队测试涉及处理具有争议性的主题,而大多数进行红队测试的组织都有强烈的动机不分享他们的发现。这是一个问题;如果我们不能公开详细讨论如何进行红队测试以及我们从中学习到的内容,那么广泛分享未来系统可能带来的风险、失败和影响将变得困难。随着时间的推移,这个问题会变得更加严重。随着系统能力的增强,红队测试的结果可能会揭示出越来越多不希望的危害。因此,我们需要改变激励机制,以便更多的组织在安全且有益的情况下分享他们的红队测试结果。为此,我们提出了两个具体的干预措施,AI研究社区可以采取这些措施来建立关于如何进行红队测试以及如何发布红队测试结果的共识。

For how to red team, we have detailed our initial approach. However, we conducted this effort in isolation, and we would have benefited from participating in a community-based effort to address certain open questions:

关于如何进行红队测试,我们已经详细阐述了初步方法。然而,我们是独立进行这项工作的,如果能参与社区合作来解决某些开放性问题,将会受益匪浅:

We can make progress towards answering these questions by convening a multidisciplinary community to share different approaches to internal red teaming and drive toward consensus.

我们可以通过召集一个多学科社区来分享内部红队测试的不同方法,并推动达成共识,从而在回答这些问题上取得进展。

The research community lacks shared norms and best practices for how to release findings from red teaming. As a result, we made our decision to release the data largely on our own and likely missed critical perspectives from experts, other disciplines, and members of the public.14 The decision for how to appropriately release findings will ultimately require a subjective judgment call. For our purposes, we reviewed a sample of our red team dataset and evaluated the pros and cons of a public release (See $\S\mathrm{A}.5_{\cdot}$ 0.Among them is the fact that while our red team data can be used to develop safer systems (as described in $\S3.2)$ ,it could also be used to train models that produce more harmful responses.15 We ultimately felt releasing the dataset would provide more benefit to the research community than potential harm, but we were conscious that we made this decision in a vacuum and that it would be better to have a neutral forum in which to discuss these issues.

研究界在如何发布红队测试结果方面缺乏共享的规范和最佳实践。因此,我们很大程度上是自行决定发布数据,可能忽略了来自专家、其他学科和公众成员的关键观点 [14]。如何适当发布研究结果的决定最终需要主观判断。为此,我们审查了红队数据集的样本,并评估了公开发布的利弊(参见 $\S\mathrm{A}.5_{\cdot}$ 0)。其中一个事实是,虽然我们的红队数据可以用于开发更安全的系统(如 $\S3.2$ 所述),但它也可能被用于训练产生更多有害响应的模型 [15]。我们最终认为,发布数据集将为研究界带来更多益处,而不是潜在的危害,但我们意识到我们是在孤立的情况下做出这一决定的,最好能有一个中立的论坛来讨论这些问题。

Acknowledgments

致谢

We thank Rishi Bommasani, Roger Grosse, Gretchen Krueger, Percy Liang, Jared Mueller, and Michael Sellitto for detailed feedback on drafts of the paper. We thank Hannah Pritchett, and the other Trust & Safety professionals we interviewed, for their advice on how to promote the well-being of the red team. We're also deeply grateful to Daniela Amodei, Jarrah Bloomfield, Jamie Kerr, Timothy Telleen-Lawton, Jia Yuan Loke, Jeffrey Ladish, Rebecca Raible, Rune Kvist, Rob Gilson, Guro Khundadze, Filipe Dobreira, and Sebastian Conybeare for their help and support.

我们感谢 Rishi Bommasani、Roger Grosse、Gretchen Krueger、Percy Liang、Jared Mueller 和 Michael Sellitto 对论文草稿的详细反馈。我们感谢 Hannah Pritchett 以及其他我们采访的信任与安全专业人士,他们为如何促进红队的福祉提供了建议。我们还要特别感谢 Daniela Amodei、Jarrah Bloomfield、Jamie Kerr、Timothy Telleen-Lawton、Jia Yuan Loke、Jeffrey Ladish、Rebecca Raible、Rune Kvist、Rob Gilson、Guro Khundadze、Filipe Dobreira 和 Sebastian Conybeare 的帮助与支持。

A Appendix

A 附录

A.1 Author Contributions

A.1 作者贡献

Research: Deep Ganguli and Liane Lovitt co-led the project and analyzed the data together. Deep Ganguli, Liane Lovitt, Jackson Kernion, Amanda Aaskell, Ben Mann, and Jack Clark designed and executed the experiments. Liane Lovitt conducted informational interviews, a literature review, and surveys in order to protect and assess the well-being of the crowd workers who participated in our experiments. Jackson Kernion and Ben Mann built the human feedback data collection infrastructure we used to collect data. They also built the web interfaces to the AI assistant, along with Deep Ganguli and Amanda Askell. Jackson Kernion, along with Josh Jacobson, managed any issues raised by crowd workers. Amanda Askell, Jackson Kernion, and Jack Clark participated in pilot experiments in order to iterate on the experiment design. Nicholas Schiefer created the UMAP plot of red team attacks and helped to compute the minimum harmlessness score.

研究:Deep Ganguli 和 Liane Lovitt 共同领导了该项目并一起分析了数据。Deep Ganguli、Liane Lovitt、Jackson Kernion、Amanda Askell、Ben Mann 和 Jack Clark 设计并执行了实验。Liane Lovitt 进行了信息访谈、文献综述和调查,以保护和评估参与我们实验的众包工作者的福祉。Jackson Kernion 和 Ben Mann 构建了用于收集数据的人类反馈数据收集基础设施。他们还与 Deep Ganguli 和 Amanda Askell 一起构建了 AI 助手的 Web 界面。Jackson Kernion 和 Josh Jacobson 一起处理了众包工作者提出的任何问题。Amanda Askell、Jackson Kernion 和 Jack Clark 参与了试点实验,以迭代实验设计。Nicholas Schiefer 创建了红队攻击的 UMAP 图,并帮助计算了最小无害性分数。

Writing: Deep Ganguli and Liane Lovitt drafted the paper. Ethan Perez and Sam Bowman made significant contributions to the framing and presentation of the paper. Other members of Anthropic made miscellaneous contributions and suggestions throughout the writing process.

写作:Deep Ganguli 和 Liane Lovitt 起草了论文。Ethan Perez 和 Sam Bowman 对论文的框架和呈现做出了重要贡献。Anthropic 的其他成员在整个写作过程中提供了各种贡献和建议。

Policy: Liane Lovitt, Jack Clark, and Deep Ganguli designed the policy interventions and articulated the pros and cons for releasing the data. Liane Lovitt wrote the Datasheet. Nova DasSarma created the regular expression we used to identify personally identifiable information (PIl) in our dataset and worked with Jack Clark and Liane Lovitt to filter the PII.

策略:Liane Lovitt、Jack Clark 和 Deep Gan