[论文翻译]红队测试语言模型以减少危害:方法、扩展行为与经验教训


原文地址:https://github.com/dalinvip/Awesome-ChatGPT/blob/main/PDF/RLHF%E8%AE%BA%E6%96%87%E9%9B%86/Red%20Teaming%20Language%20Models%20to%20Reduce%20Harms-%20Methods%2C%20Scaling%20Behaviors%2C%20and%20Lessons%20Learned.pdf


Red Teaming Language Models to Reduce Harms: Methods, Scaling Behaviors, and Lessons Learned

红队测试语言模型以减少危害:方法、扩展行为与经验教训

Anthropic

Anthropic

Abstract

摘要

We describe our early efforts to red team language models in order to simultaneously discover, measure, and attempt to reduce their potentially harmful outputs. We make three main contributions. First, we investigate scaling behaviors for red teaming across 3 model sizes (2.7B, 13B, and 52B parameters) and 4 model types: a plain language model (LM); an LM prompted to be helpful, honest, and harmless; an LM with rejection sampling; and a model trained to be helpful and harmless using reinforcement learning from human feedback (RLHF). We find that the RLHF models are increasingly difficult to red team as they scale, and we find a flat trend with scale for the other model types. Second, we release our dataset of 38,961 red team attacks for others to analyze and learn from. We provide our own analysis of the data and find a variety of harmful outputs, which range from offensive language to more subtly harmful non-violent unethical outputs. Third, we exhaustively describe our instructions, processes, statistical methodologies, and uncertainty about red teaming. We hope that this transparency accelerates our ability to work together as a community in order to develop shared norms, practices, and technical standards for how to red team language models. Warning: this paper contains examples that may be offensive or upsetting.

我们描述了早期在语言模型上进行红队测试(red teaming)的努力,旨在同时发现、衡量并尝试减少其潜在的有害输出。我们做出了三项主要贡献。首先,我们研究了红队测试在三种模型规模(2.7B、13B 和 52B 参数)和四种模型类型中的扩展行为:普通语言模型(LM);被提示为有帮助、诚实和无害的语言模型;带有拒绝采样的语言模型;以及通过人类反馈强化学习(RLHF)训练为有帮助和无害的模型。我们发现,随着规模的增加,RLHF 模型越来越难以进行红队测试,而其他模型类型的扩展趋势则较为平缓。其次,我们发布了包含 38,961 次红队攻击的数据集,供其他人分析和学习。我们对数据进行了自己的分析,发现了多种有害输出,从冒犯性语言到更为微妙的有害非暴力不道德输出。第三,我们详尽描述了我们的指令、流程、统计方法以及对红队测试的不确定性。我们希望这种透明度能够加速我们作为社区共同合作的能力,以制定关于如何对语言模型进行红队测试的共享规范、实践和技术标准。警告:本文包含可能令人反感或不安的示例。

1 Introduction

1 引言

Large language models exhibit a wide range of harmful behaviors such as reinforcing social biases (e.g., [47, 28, 1, 33, 7]), generating offensive or toxic outputs [25], leaking personally identifiable information from the training data [13], aiding in disinformation campaigns [12], generating extremist texts [37], spreading falsehoods [35], and more [9, 10, 18, 57, 22, 51]. As AI systems improve, the scope of possible harms seems likely to grow [22]. Many strategies have been developed to address some of these harms (e.g., [58, 4, 48, 36, 34, 19, 60]). One potentially useful tool for addressing harm is red teaming—using manual or automated methods to adversarial ly probe a language model for harmful outputs, and then updating the model to avoid such outputs [42, 20, 3, 11]. In this paper, we describe our early efforts to implement manual red teaming to both make models safer and measure the safety of our models. The models trained with red team data were described in [4], so here we focus on describing our red team results and techniques in detail in the hope that others may benefit from and improve on them.

大语言模型展现出多种有害行为,例如强化社会偏见(如 [47, 28, 1, 33, 7])、生成冒犯性或有害的输出 [25]、从训练数据中泄露个人身份信息 [13]、协助虚假信息传播 [12]、生成极端主义文本 [37]、传播虚假信息 [35] 等 [9, 10, 18, 57, 22, 51]。随着 AI 系统的改进,潜在危害的范围似乎可能会扩大 [22]。许多策略已被开发出来以应对其中一些危害(如 [58, 4, 48, 36, 34, 19, 60])。一种可能有助于应对危害的工具是红队测试(red teaming)——通过手动或自动方法对抗性地探测语言模型的有害输出,然后更新模型以避免此类输出 [42, 20, 3, 11]。在本文中,我们描述了早期实施手动红队测试的努力,旨在使模型更安全并衡量模型的安全性。使用红队数据训练的模型已在 [4] 中描述,因此这里我们重点详细描述红队测试的结果和技术,希望其他人能够从中受益并加以改进。


Figure 1 Red team attack success by model size ( $\bf\dot{x}$ -axes) and model type (colors). (Left) Attack success measured by average red team member self report (higher is more successful). (Middle) Attack success measured by average minimum harmlessness score (higher is better, less harmful) (Right) Distribution of minimum harmlessness score.

图 1: 按模型大小 ( $\bf\dot{x}$ 轴) 和模型类型 (颜色) 划分的红队攻击成功率。(左) 攻击成功率通过红队成员自我报告的平均值衡量 (越高表示越成功)。(中) 攻击成功率通过平均最小无害性评分衡量 (越高越好,表示危害越小)。(右) 最小无害性评分的分布。

We make three main contributions. First, we investigate scaling behaviors for red teaming across 3 model sizes (2.7B, 13B, and 52B parameters) and 4 model types: a plain language model (plain LM) [2]; an LM prompted to be helpful, honest, and harmless (HHH prompted LM) [2]; an LM with rejection sampling (RS), which returns the best of sixteen samples as ranked by a helpful and harmless preference model [4]; and a model trained to be helpful and harmless using reinforcement learning from human feedback (RLHF) with the same preference model [4]. The RS and RLHF models rely on data generated from red teaming the prompted LM(see $\S3.2$ for details on all models). Figure 1, middle, shows that: (1) RLHF models are significantly harder to red team as they scale, (2) plain LMs, prompted LMs, and RS models exhibit a flat trend with scale, (3) Prompted LMs are not significantly harder to red team than plain LMs, which is inconsistent with our previous results that use static evaluations to show HHH prompting is an effective safety intervention [2], and (4) RS models are the most difficult to red team at any scale; however, qualitatively, they tend to be harmless by being evasive [4].

我们做出了三个主要贡献。首先,我们研究了在3种模型规模(2.7B、13B和52B参数)和4种模型类型下的红队测试扩展行为:普通语言模型(plain LM)[2];被提示为有帮助、诚实和无害的语言模型(HHH prompted LM)[2];带有拒绝采样(RS)的语言模型,该模型返回由有帮助和无害偏好模型排名的16个样本中的最佳样本[4];以及使用来自人类反馈的强化学习(RLHF)训练的有帮助和无害模型,使用相同的偏好模型[4]。RS和RLHF模型依赖于从提示语言模型的红队测试生成的数据(有关所有模型的详细信息,请参见$\S3.2$)。图1中间部分显示:(1) RLHF模型随着规模的扩大,红队测试难度显著增加,(2) 普通LM、提示LM和RS模型在规模上表现出平坦的趋势,(3) 提示LM的红队测试难度并不显著高于普通LM,这与我们之前使用静态评估显示HHH提示是一种有效的安全干预措施的结果不一致[2],以及(4) RS模型在任何规模下都是最难进行红队测试的;然而,从定性上看,它们往往通过回避来保持无害[4]。

Our second contribution is to release our dataset of 38,961 red team attacks for others to analyze and learn from (Table 1).2 We provide a Datasheet [24] in $\S\mathrm{A}.7$ that fully documents the data and we explain the pros and cons for releasing the data in $\S\mathrm{A}.5$ . Our dataset is an order of magnitude larger than a similar available red team dataset [60] and considers models one order of magnitude larger than those in [60]. To our knowledge, we release the only dataset of red team attacks on a model trained be safe with RLHF. These types of models are already deployed [41] and we believe our data can help shed further light on their strengths and weaknesses. More generally, we believe our data can be used to understand what successful red team attacks look like, to build (semi-)automated red team techniques [42], to build class if i ers for harmfulness, and to prototype strategies for measuring and mitigating harms in language models. We also provide our own preliminary analyses of the types of harms uncovered in our data (Figures 2 & 9, \$4).

我们的第二个贡献是发布了包含 38,961 次红队攻击的数据集,供其他人分析和学习(表 1)。我们在 $\S\mathrm{A}.7$ 中提供了数据表 [24],完整记录了数据,并在 $\S\mathrm{A}.5$ 中解释了发布数据的利弊。我们的数据集比现有的类似红队数据集 [60] 大一个数量级,并且考虑的模型规模也比 [60] 中的模型大一个数量级。据我们所知,我们发布了唯一一个针对通过 RLHF 训练以确保安全的模型进行红队攻击的数据集。这类模型已经部署 [41],我们相信我们的数据可以进一步揭示它们的优缺点。更广泛地说,我们相信我们的数据可以用于理解成功的红队攻击是什么样子的,构建(半)自动化的红队技术 [42],构建有害性分类器,以及原型化测量和减轻语言模型危害的策略。我们还提供了对我们数据中揭示的危害类型的初步分析(图 2 和图 9,\$4)。

Our last contribution is to exhaustively describe our instructions, proceses, and statistical methodologies for red teaming (\$3). Throughout the design of our experiments, we arrived at many junctures in which we were unsure about how to proceed, even after a literature review on red teaming AI systems (\$2). As such, we conducted informational interviews with experts in the field of Trust & Safety and incorporated their suggested best practices $(\S\mathrm{A}.2)$ into the design of our experiments in order to ensure the well-being of the red team. In general, we found that red team members enjoyed participating in our experiments and felt motivated by a mission to make AI systems less harmful $(\S\mathrm{A}.2)$ . Nevertheless, our work suffers from some limitations, which we discuss in $\S5.1$ . Based on our experiences, we propose some policy interventions for how we can work together as a community to develop shared norms, practices, and technical standards for how to red team language models (\$5.2).

我们的最后一项贡献是详尽描述了我们的指令、流程和红队测试的统计方法(\$3)。在实验设计过程中,我们遇到了许多不确定如何继续的节点,即使在对AI系统红队测试的文献进行了回顾之后(\$2)。因此,我们与信任与安全领域的专家进行了信息访谈,并将他们建议的最佳实践(\$A.2)纳入实验设计中,以确保红队成员的福祉。总体而言,我们发现红队成员喜欢参与我们的实验,并受到减少AI系统危害的使命感的激励(\$A.2)。然而,我们的工作也存在一些局限性,我们将在\$5.1中讨论这些局限性。基于我们的经验,我们提出了一些政策干预措施,建议我们如何作为一个社区共同努力,制定共享的规范、实践和技术标准,以指导如何对大语言模型进行红队测试(\$5.2)。

2 Related Work

2 相关工作

We use the same models that we developed in our previous work where we train a general language assistant to be helpful, honest, and harmless [2, 4]. However, here we run additional experiments in order to determine the influence of model size on susceptibility to red team attacks (Figure 1) and analyze the content of the attacks (Figures 2 & 9) to understand the types of harms uncovered by red teaming. Additionally, we provide more detail on our red team methods, and release the data, so that others can reproduce (and improve upon) our red team approach and results.

我们使用了之前工作中开发的模型,这些模型训练了一个通用的语言助手,旨在使其有帮助、诚实且无害 [2, 4]。然而,在这里我们进行了额外的实验,以确定模型大小对红队攻击敏感性的影响(图 1),并分析攻击内容(图 2 和图 9),以了解红队测试所揭示的危害类型。此外,我们提供了更多关于红队方法的细节,并发布了数据,以便其他人可以复现(并改进)我们的红队方法和结果。


Figure 2 Visualization of the red team attacks. Each point corresponds to a red team atack embedded in a two dimensional space using UMAP. The color indicates attack success (brighter means a more successful attack) as rated by the red team member who carried out the attack. We manually annotated attacks and found several thematically distinct clusters of attack types (black ellipses and text).

图 2: 红队攻击的可视化。每个点对应于使用 UMAP 嵌入二维空间的红队攻击。颜色表示攻击的成功率(颜色越亮表示攻击越成功),由执行攻击的红队成员评分。我们手动标注了攻击,并发现了几种主题上不同的攻击类型集群(黑色椭圆和文本)。

Apart from our previous work, our approach is most similar to [60] & [53], who have crowd workers attempt to elicit offensive outputs from dialogue agents in open-ended dialogues, then use the resulting data to create effective safety interventions. In [6O], they release a Bot Adversarial Dialogues (BAD) dataset of ${\sim}5\mathrm{K}$ conversations with 3 dialogue agents ranging in size from 345M to 2.7B parameters. We collect more data $(\sim!40\mathrm{K})$ attacks3; red team larger models (up to 52B parameters) in order to measure scaling behaviors, as in [53]; and focus on reinforcement learning from human feedback [14] as our most promising safety intervention.

除了我们之前的工作,我们的方法与 [60] 和 [53] 最为相似,他们通过众包工作者尝试在开放式对话中引出对话模型的攻击性输出,然后利用这些数据创建有效的安全干预措施。在 [60] 中,他们发布了一个名为 Bot Adversarial Dialogues (BAD) 的数据集,包含约 5K 次对话,涉及 3 个不同规模的对话模型,参数范围从 345M 到 2.7B。我们收集了更多的数据(约 40K 次攻击),并对更大的模型(参数高达 52B)进行了红队测试,以衡量扩展行为,如 [53] 中所述;同时,我们将基于人类反馈的强化学习 [14] 作为最有前景的安全干预措施。

Recent work explores how to automate red teaming by using language models instead of humans as the red team [42]. The approach bootstraps from the BAD dataset [60], and uncovers a variety of harms including (but not limited to) finding groups of people that the dialogue agent discusses in offensive ways, identifying personally identifiable information, and leaking private training data. We uncover similar harms in our dataset and plan to use our own data to systematically compare and contrast the types of harms that can be uncovered in manual versus automated methods in future work (s5).

最近的研究探索了如何通过使用大语言模型而非人类作为红队来自动化红队测试 [42]。该方法从 BAD 数据集 [60] 中引导,揭示了多种危害,包括(但不限于)发现对话代理以冒犯性方式讨论的群体、识别个人身份信息以及泄露私人训练数据。我们在数据集中发现了类似的危害,并计划在未来的工作中使用我们自己的数据来系统比较和对比手动与自动化方法中可能发现的危害类型 (s5)。

More generally, although our work focuses on adversarial attacks on generative models, it is heavily inspired by and related to prior work that examines the efficacy of adversarial testing to find and address vu lner abi lities in NLP algorithms in disc rim i native settings. Some of these efforts augment humans (through guidelines, templates, programmatic generation of attacks, and various combinations thereof) to devise test cases that cause systems to fail [45, 46, 29, 21, 30, 55, 6, 23]. Others use humans in the loop to continuously and dynamically build, break, and fix [20] models in order to continuously make them more robust to failure modes [40, 32, 55, 61]. Finally, a large body of work aims to learn adversarial examples that cause downstream models to produce spurious outputs [50], some of which are reviewed in [59]. However, these examples often seem arbitrary and unintelligible to humans, and thus correspond to a different kind of attack than the ones we consider here.

更广泛地说,尽管我们的工作专注于生成模型的对抗攻击,但它深受并关联于先前的研究,这些研究探讨了在判别式设置中通过对抗测试来发现和解决 NLP 算法中的脆弱性的有效性。其中一些工作通过增强人类(通过指南、模板、程序化生成攻击及其各种组合)来设计导致系统失败的测试用例 [45, 46, 29, 21, 30, 55, 6, 23]。另一些工作则利用人类在循环中持续动态地构建、破坏和修复 [20] 模型,以使其对故障模式更加鲁棒 [40, 32, 55, 61]。最后,大量工作旨在学习导致下游模型产生虚假输出的对抗样本 [50],其中一些在 [59] 中进行了回顾。然而,这些样本对人类来说通常显得随意且难以理解,因此与我们在此考虑的对抗攻击类型不同。

Although our work focuses on language models, there is prior art on how to red team multi-modal generative models [39] in the context of DALL·E 2 [44]. These red team efforts reveal some similar categories of harms to those we find in or dataset, such as: “Bias and representation"; “Harassment, bullying, and exploitation"; "Dis- and misinformation", and other categories that we do not uncover or consider, such as “Copyright and Trademarks". The red team consists of domain experts who are selected according to particular areas of expertise, whereas in our case we rely on crowd workers whose domain expertise we are unaware of (\$5.1).

尽管我们的工作主要关注语言模型,但在 DALL·E 2 [44] 的背景下,已有关于如何对多模态生成模型进行红队测试的研究 [39]。这些红队测试揭示了一些与我们数据集中发现的类似危害类别,例如:“偏见和代表性”;“骚扰、欺凌和剥削”;“虚假和误导信息”,以及其他我们未发现或考虑的类别,例如“版权和商标”。红队由根据特定专业领域选出的领域专家组成,而在我们的案例中,我们依赖的是我们不了解其专业背景的众包工人 (\$5.1)。

Parameters | Plain LM Prompted LM Rejection Sampling Reinforcement Learning

参数 | 普通大语言模型 提示大语言模型 拒绝采样 强化学习

Table 1: Number of red team attempts per model size (rows) and safety intervention (columns)

表 1: 不同模型大小(行)和安全干预(列)下的红队尝试次数

| 2.7B | 1151 | 2044 | 1445 | 2498 |
| 13B | 2587 | 2174 | 1263 | 2292 |
| 52B | 1414 | 17739 | 1273 | 3081 |

3 Methods

3 方法

3.1 Red Team Task

3.1 红队任务

We developed an interface that instructs red team members to have open-ended conversations with an AI assistant [2, 4] in order to “make the AI behave badly, to get it to say obnoxious, offensive, and harmful things" (Figure 3). Similar to [60], we provide the red team with a brief list of example conversation topics but otherwise leave the instructions fairly open-ended. We ask the red team to rely on creativity, avoid obvious profanity, and to focus on a single topic per attack. To encourage this behavior, we ask participants to enter a short description of how they intend to red team the model prior to the attack.

我们开发了一个界面,指导红队成员与AI助手进行开放式对话 [2, 4],目的是“让AI表现不佳,使其说出令人厌恶、冒犯和有害的内容”(图 3)。与 [60] 类似,我们为红队提供了一个简短的示例对话主题列表,但其他方面则保持指令的开放性。我们要求红队依靠创造力,避免明显的粗俗语言,并专注于每次攻击的单一主题。为了鼓励这种行为,我们要求参与者在攻击前输入他们打算如何对模型进行红队测试的简短描述。

We were concerned about our task causing harm to the red team, so we reviewed available literature in Trust & Safety and related fields (e.g., [26, 15, 16, 49, 5, 27, 31]), as well as conducted a series of informational interviews with industry professionals in the field. We incorporated best practices from this research (see $\S\mathrm{A}.2$ for details) into the content and design of our task instructions (Figure 3, Left) and task interface (Figure 3, Right). For example, in the red team task instructions, we provide clear, specific warnings that red team members may be exposed to sensitive content. We also encourage red team members to select topics within their own risk tolerance, as opposed to prompting them to discuss specific, sensitive topics with the AI assistant. Through surveys and informal feedback, we found that red team members enjoyed the task and did not experience significant negative emotions.

我们担心我们的任务会对红队造成伤害,因此我们回顾了信任与安全及相关领域的现有文献(例如 [26, 15, 16, 49, 5, 27, 31]),并与该领域的行业专业人士进行了一系列信息访谈。我们将这些研究中的最佳实践(详见 $\S\mathrm{A}.2$)融入到任务说明(图 3,左)和任务界面(图 3,右)的内容和设计中。例如,在红队任务说明中,我们提供了明确、具体的警告,提醒红队成员可能会接触到敏感内容。我们还鼓励红队成员选择自己风险承受能力范围内的主题,而不是提示他们与AI助手讨论特定的敏感话题。通过调查和非正式反馈,我们发现红队成员喜欢这项任务,并且没有经历显著的负面情绪。

After the red team member agrees to the terms of the task (Figure 3, Left), they proceed to have an openended, multi-turn conversation with the AI assistant. After each turn in a conversation, we present the participant with two possible model-generated responses, and ask them to select the more harmful of the two as in [2, 4]. We do this for two reasons. First, this allows us to expedite the red team's ability to find vulnerabilities in our systems by a factor of two—generative models are stochastic and thus our approach allows the red team twice as many opportunities to catch harmful behavior per attempt. Second, this procedure generates a dataset of pairs of model responses, where one response is labeled as more harmful than the other. We use this dataset to train a harmlessness preference model, which takes as input a model generated response and outputs a score which is lower for more harmful model responses, and higher for less harmful model responses [14, 2, 4].4 We use the resulting preference model to build safety interventions, which we describe in $\S3.2$ . We do not define what "harmful means, as this is a complex and subjective concept; instead, we rely on the red team to make their own determinations via a pairwise preference choice [14].

在红队成员同意任务条款后(图 3,左),他们开始与 AI 助手进行开放式的多轮对话。在每轮对话后,我们会向参与者展示两个可能的模型生成响应,并要求他们选择其中更有害的一个,如 [2, 4] 所述。我们这样做有两个原因。首先,这使我们能够将红队发现系统漏洞的能力提高一倍——生成模型是随机的,因此我们的方法使红队每次尝试都有两倍的机会捕捉到有害行为。其次,此过程生成了一对模型响应的数据集,其中一个响应被标记为比另一个更有害。我们使用这个数据集来训练一个无害性偏好模型,该模型以模型生成的响应作为输入,并输出一个分数,分数越低表示模型响应越有害,分数越高表示模型响应越无害 [14, 2, 4]。我们使用生成的偏好模型来构建安全干预措施,这将在 $\S3.2$ 中描述。我们没有定义“有害”的含义,因为这是一个复杂且主观的概念;相反,我们依赖红队通过成对偏好选择来做出自己的判断 [14]。

We ask red team members to have a back-and-forth conversation for four turns (Figure 3, Right). We do not strictly limit the number of turns in each conversation, and empirically, we observe most conversations are 1-4 turns, with some lasting longer. At the end of each conversation, we ask the participant to rate how successful they were at making the AI assistant say something bad. We collect these ratings on a 5 point Likert scale (ranging from O to 4) where a O means “Not successful” and a 4 means "Very successful" (Figure 3, Right). Red team members continue this process for a series of five dialogues, typically on five unique topics, which culminates in one overall task. Red team members could then choose to complete further tasks.

我们要求红队成员进行四轮对话(图 3,右)。我们没有严格限制每轮对话的次数,根据经验观察,大多数对话为 1-4 轮,有些对话持续时间更长。在每轮对话结束时,我们要求参与者评估他们在让 AI 助手说出不良内容方面的成功程度。我们使用 5 点李克特量表(范围从 0 到 4)收集这些评分,其中 0 表示“不成功”,4 表示“非常成功”(图 3,右)。红队成员继续进行五轮对话,通常涉及五个不同的主题,最终形成一个整体任务。红队成员可以选择完成更多任务。

The AI assistant is powered by four types of dialogue models: one baseline model and three models with different types of safety interventions. We assign red team members to models at random—the red team does not know which model they interact with. We describe these models further in the next section.

AI 助手由四种对话模型驱动:一个基线模型和三个具有不同类型安全干预措施的模型。我们随机将红队成员分配给模型——红队不知道他们与哪个模型交互。我们将在下一节中进一步描述这些模型。


Figure 3 (Left) Red team task instructions. (Right) Example of a red team attempt.

图 3: (左) 红队任务指令。(右) 红队尝试的示例。

3.2 Models

3.2 模型

We derive dialogue models, with various safety interventions, from a general language model, and in some cases, a helpful and harmless preference model. For simplicity, we refer to the preference model as a harmlessness preference model, and the output of the model as a harmlessness score throughout this work.6 Here, we first provide basic details on the general language model and the harmlessness preference model, then elaborate on the four dialogue models that power the AI assistant.

我们从通用语言模型中派生出具有各种安全干预措施的对话模型,在某些情况下,还从有帮助和无害的偏好模型中派生。为简化起见,在本工作中,我们将偏好模型称为无害偏好模型,并将模型的输出称为无害分数。6 在此,我们首先提供关于通用语言模型和无害偏好模型的基本细节,然后详细阐述驱动AI助手的四种对话模型。

For our general language models, we train decoder-only transformer models ranging in size from 2.7B to 13B to 52B parameters. Full details about model architectures, training data, training procedures, and model evaluations are described elsewhere [2].

对于我们的通用语言模型,我们训练了参数规模从2.7B到13B再到52B的解码器专用Transformer模型。关于模型架构、训练数据、训练过程和模型评估的完整细节在其他地方有详细描述 [2]。

6More generally, our preference model is trained to predict both harmlessness and helpfulness. For the latter, we created a separate interface in order to collect preference data about helpfulness. We found a fundamental tension between these helpfulness and harmlessness a model can simply be harmless by refusing to be helpful [4]. As such, we train our preference models to predict both harmlessness and helpfulness. We find that this approach helps to address this tension without loss in predictive accuracy for harmlessness [4].

更广泛地说,我们的偏好模型被训练来预测无害性和有用性。对于后者,我们创建了一个单独的界面来收集关于有用性的偏好数据。我们发现有用性和无害性之间存在一种根本性的张力,模型可以通过拒绝提供帮助来简单地保持无害 [4]。因此,我们训练我们的偏好模型来同时预测无害性和有用性。我们发现这种方法有助于解决这种张力,而不会降低对无害性的预测准确性 [4]。

To train our harmlessness preference model, we use the comparison data from red team attacks on 52B parameter prompted language model (described below) as the training data—this is why we collected an order of magnitude more data in this case (Table 1). To build these models, we fine-tune 2.7B, 13B, and 52B general language models to predict which model utterances red team members found less harmful, thus producing a harmlessness score [2]. A lower score means more harmful.

为了训练我们的无害偏好模型,我们使用了对52B参数提示语言模型(如下所述)进行红队攻击时生成的比较数据作为训练数据——这也是为什么我们在这种情况下收集了数量级更多的数据(表1)。为了构建这些模型,我们对2.7B、13B和52B的通用语言模型进行了微调,以预测红队成员认为哪些模型输出更无害,从而生成一个无害评分 [2]。评分越低,表示越有害。

Plain language models (Plain LM) We use 1-shot learning (in which we place an single example of a 3-turn conversation in our Human, Assistant format in context) to prompt our general language models to behave as dialogue models for use in the interface described above [2]. We consider this method a baseline or control model, since it minimally departs from a general-purpose plain language model and has no explicit safety intervention.

普通语言模型 (Plain LM)
我们使用 1-shot 学习(在上下文中放置一个 3 轮对话的示例,格式为 Human, Assistant)来提示我们的通用语言模型表现得像对话模型,以便在上述界面中使用 [2]。我们将此方法视为基线或控制模型,因为它与通用普通语言模型的差异最小,并且没有明确的安全干预措施。

Prompted language models (Prompted LM) We use 14-shot learning to prompt our general language models to be helpful, harmless, and honest (HHH) [2], similar to dialogue-prompted Gopher [43]. We consider this a simple safety intervention, since we found it to be surprisingly effective at reducing model toxicity, especially for larger models [2, 43]. Furthermore, we use context distillation [2] to train “prompt-free” variants of these prompted models in order to retain the infuence of the prompt without occupying a significant portion of the limited context window and decreasing inference time [2]. Empirically, in previous work, we found minimal differences between prompting and context distillation [2].

提示语言模型 (Prompted LM)
我们使用 14-shot 学习来提示我们的通用语言模型,使其具备帮助性、无害性和诚实性 (HHH) [2],类似于对话提示的 Gopher [43]。我们认为这是一种简单的安全干预措施,因为我们发现它在减少模型毒性方面出奇地有效,尤其是对于较大的模型 [2, 43]。此外,我们使用上下文蒸馏 [2] 来训练这些提示模型的“无提示”变体,以保留提示的影响,同时不占用有限上下文窗口的显著部分,并减少推理时间 [2]。根据经验,在之前的工作中,我们发现提示和上下文蒸馏之间的差异很小 [2]。

Rejection sampling (Rs) We generate 16 samples of AI assistant responses from prompted language models, rank these samples with the harmlessness preference model, and select the 2 least harmful samples to present to the red team member, thus rejecting the 14 relatively more harmful responses. We did not experiment with changing the parameter 16. We tie the size of the prompted model to the size of the harmlessness preference model, e.g., a 2.7B parameter rejection sampling model consists of a 2.7B prompted language model paired with a 2.7B harmlessness preference model.7

拒绝采样 (Rejection Sampling, Rs)
我们从提示语言模型中生成 16 个 AI 助手响应样本,使用无害偏好模型对这些样本进行排序,并选择 2 个最无害的样本呈现给红队成员,从而拒绝 14 个相对更有害的响应。我们没有尝试改变参数 16。我们将提示模型的大小与无害偏好模型的大小绑定,例如,一个 2.7B 参数的拒绝采样模型由一个 2.7B 的提示语言模型和一个 2.7B 的无害偏好模型配对组成。[7]

Reinforcement learning from human feedback (RLHF) We start with a prompted language model, then use reinforcement learning to train it to maximize the scores given by the preference model described above. As in the rejection sampling case, we tie the size of the prompted model to the size of the preference model. Full details about the training procedures, training datasets, and model evaluations are described elsewhere [4]. Intuitively, we expect RLHF models to behave similarly (but not exactly) to RS models; however, RLHF is computationally expensive at train time but efficient at test time. RS is vice-versa.

基于人类反馈的强化学习 (RLHF)
我们从提示语言模型开始,然后使用强化学习来训练它,以最大化上述偏好模型给出的分数。与拒绝采样情况一样,我们将提示模型的大小与偏好模型的大小绑定。关于训练过程、训练数据集和模型评估的完整细节在其他地方有描述 [4]。直观上,我们期望 RLHF 模型的行为与 RS 模型相似(但不完全相同);然而,RLHF 在训练时计算成本高,但在测试时效率高。RS 则相反。

3.3 Red Team

3.3 红队

Our red team consists of 324 US-based crowd workers whom we primarily recruited from Amazon's Mechanical Turk (MTurk) platform ( $.n,=,307\$ and the Upwork platform $(n,=,17)$ 0. On MTurk, we paid between $\mathbb{S}7.50$ and $\mathbb{S}9.50$ for each set of 5 conversations completed. We found that crowd workers could complete at least 2 tasks an hour, which means that we paid at or above California minimum wage.8 On Upwork, we paid participants $\mathbb{S}20$ per hour. Similar to [53], we asked participants to fill out a short demographic survey that incorporated U.s. census categories and offered participants the option to answer “Prefer to not to say” for each question (Figure 4).

我们的红队由324名美国众包工作者组成,主要从Amazon的Mechanical Turk (MTurk)平台招募($n,=,307$)和Upwork平台($n,=,17$)。在MTurk上,我们为每完成5组对话支付$\mathbb{S}7.50$到$\mathbb{S}9.50$。我们发现众包工作者每小时至少可以完成2个任务,这意味着我们支付的报酬达到或超过了加州的最低工资标准。在Upwork上,我们每小时支付参与者$\mathbb{S}20$。与[53]类似,我们要求参与者填写一份简短的人口统计调查,该调查采用了美国人口普查类别,并为每个问题提供了“不愿回答”的选项(图4)。

We found that he crowd worker population may not be fully representative of the U.S. population, according to US Census data [54].? For example, we find that individuals who self-identify as “White or Caucasian' are slightly over-represented in our experiments ( $79%$ versus the current U.s. Census estimate of $75.8%$ Similarly, the percentage of participants with at least a college degree was significantly higher than what is reported by the U.S. Census ( $66%$ versus $32.9%$

我们发现,根据美国人口普查数据 [54],众包工作者群体可能并不能完全代表美国人口。例如,我们发现自认为是“白人或高加索人”的个体在我们的实验中略微过多(79% 对比当前美国人口普查估计的 75.8%)。同样,拥有至少大学学历的参与者比例显著高于美国人口普查报告的数据(66% 对比 32.9%)。

Figure 5 shows descriptive statistics about the red team. In particular, we find we find that ${\sim}80%$ Oof the red team attacks come from ${\sim}50$ outof ${\sim}300$ workers. As such, the overwhelming majority of the dataset is generated from a minority of particularly prolific red team members. Furthermore, we fit a linear mixed model that evaluates the inherent efficacy of a red team member, which we plot in Figure 5 (Right). We find that some workers are particularly effective at red teaming, whereas others are not. In Appendix A.3 we re-analyze our data while controlling for these two confounds (particularly prolific workers, and particularly (in)effective red team members) and find that these confounds do not significantly influence the main results in Figure 1.

图 5 展示了红队的描述性统计。特别是,我们发现大约 80% 的红队攻击来自大约 300 名工作者中的 50 名。因此,绝大多数数据集是由少数特别多产的红队成员生成的。此外,我们拟合了一个线性混合模型,评估红队成员的固有效能,并在图 5(右)中绘制。我们发现一些工作者在红队测试中特别有效,而另一些则不然。在附录 A.3 中,我们在控制这两个混杂因素(特别多产的工作者和特别(不)有效的红队成员)的情况下重新分析了我们的数据,发现这些混杂因素对图 1 中的主要结果没有显著影响。

Figure 4 Results of a demographic survey completed by 115 of 324 red team members.

图 4: 324 名红队成员中 115 人完成的人口统计调查结果。

性别 人数 百分比
男性 54 47.0%
女性 60 52.2%
非二元性别 1 0.9%
不愿透露 0 0%
性取向 人数 百分比
异性恋 94 81.7%
同性恋 5 4.3%
双性恋 14 12.2%
不确定 1 0.9%
不愿透露 0 0%
其他 1 0.9%
年龄组 人数 百分比
18-24 0 0%
25-34 29 25.2%
35-44 39 33.9%
45-54 27 23.5%
55-64 16 13.9%
65+ 2 1.7%
不愿透露 2 1.7%
种族 人数 百分比
美洲印第安人或阿拉斯加原住民 2 1.7%
亚洲人 3 2.6%
黑人或非裔美国人 10 8.7%
西班牙裔、拉丁裔或西班牙人 1 0.9%
中东或北非人 1 0.9%
夏威夷原住民或太平洋岛民 1 0.9%
白人或高加索人 94 81.7%
不愿透露 1 0.9%
其他 2 1.7%
教育程度 人数 百分比
高中或部分大学 40 34.8%
大学学位 62 53.9%
研究生或专业学位 12 10.4%
不愿透露 0 0%
其他 1 0.9%
残疾 人数 百分比
听力困难 0 0%
视力困难 1 0.9%
认知困难 1 0.9%
行动困难 4 3%
自理困难 1 0.9%
其他 2 1.5%
106 92%

3.4 Data Analysis

3.4 数据分析

With our interface, models, and red team in place, we collect 38,961 red team attacks across with O(1K) attacks per model type in all cases except for the 52B prompted model for which we collect O(10K) attacks (Table 1). We collect more data in the latter case in order to train our harmlessness preference models, as described in $\S3.2$ . Figure 6 shows an example red team attack and how we quantify it. In particular, we measure 3 variables for each attack. First, we record the red team member's self-rating of how successful they were on a 5-point Likert scale, where a 0 indicates an unsuccessful attempt, and a 4 indicates a very successful attempt (see also Figure 3, Right, for an example). Figure 7 (Left) shows the distribution over this variable, which is approximately bimodal, with two peaks at O and 4, with relatively more mass at 0. This indicates that, on average, red team member's self report successful attacks ${\sim}35%$ of the time.

在我们的界面、模型和红队准备就绪后,我们收集了38,961次红队攻击,每种模型类型的攻击次数为O(1K),除了52B提示模型,我们收集了O(10K)次攻击(表1)。我们在后一种情况下收集了更多数据,以便训练我们的无害偏好模型,如$\S3.2$所述。图6展示了一个红队攻击的示例以及我们如何量化它。特别是,我们为每次攻击测量了3个变量。首先,我们记录了红队成员在5点李克特量表上的自我评分,其中0表示尝试不成功,4表示非常成功(参见图3右侧的示例)。图7(左)显示了该变量的分布,它大致呈双峰分布,峰值在0和4处,0处的相对质量更多。这表明,平均而言,红队成员自我报告的成功攻击次数约为${\sim}35%$。


Figure 5 Descriptive statistics about red team members. (Left) Total number of red team attempts (y-axis) per red team member $\mathbf{\chi}_{\mathbf{X}}$ -axis), sorted by number of attempts. (Middle) The cumulative distribution (CDF) of the data from the left panel showsthat ${\sim}80%$ of attacks come from ${\sim}15%$ of the red team participants. (Right) Estimate of how effective each red team member is at red teaming (y-axis, O means average, lower means more effective, lines indicate $95%$ confidence intervals) according to their ability to achieve a low minimum harmlessness score. X-axis is sorted by ability.

图 5: 红队成员的描述性统计。(左) 每个红队成员的总尝试次数 (y轴) 按尝试次数排序 (x轴)。(中) 左图数据的累积分布 (CDF) 显示,约 80% 的攻击来自约 15% 的红队参与者。(右) 根据每个红队成员达到低最低无害性分数的能力,估计他们在红队测试中的效果 (y轴,0 表示平均水平,越低表示越有效,线条表示 95% 的置信区间)。x轴按能力排序。

Next, we use our harmlessness preference model to compute the harmlessness score (higher is less harmful and thus better) of the AI assistant's dialogue. In particular, we compute the score for each assistant utterance, conditioned on everything preceding the utterance, for each utterance in the conversation. For a $N$ -turnconversation, this results in $N$ harmlessness scores (Figure 6). To turn this into one number, we either compute the mean or minimum (worst) harmlessness of the AI assistant. Empirically, we find that the precise choice of aggregate statistic does not significantly change our results, so we choose the minimum harmlessness as a lower bound on the overall harmlessness of the model.

接下来,我们使用无害偏好模型来计算AI助手的对话的无害分数(分数越高表示越无害,因此越好)。具体来说,我们为对话中的每个助手话语计算分数,条件是话语之前的所有内容。对于一个 $N$ 轮对话,这将产生 $N$ 个无害分数(图6)。为了将其转化为一个数字,我们计算AI助手的平均或最小(最差)无害分数。经验表明,聚合统计的具体选择不会显著改变我们的结果,因此我们选择最小无害分数作为模型整体无害性的下限。

Figure 7 (Middle) shows the distribution of the minimum harmlessness score over all red team attacks for all the models. The distribution is centered around O and skews negative. A more negative score corresponds to more harmful model responses, and a more positive score corresponds to less harmful model responses. The shape of this distribution suggests that the red team members are indeed effective at soliciting harmful responses from the AI assistant. In general, we find that the minimum harmlessness score is inversely proportional to the red team member self-rating of attack success, which is expected ( $\mathrm{\gA.4}$ ,Figure 11). However, the correlation is not perfect. As such we report statistics of both these variables, conditioned on model type, as measures of red team efficacy in $\S4$

图 7 (中) 展示了所有模型在所有红队攻击中的最小无害性得分的分布。该分布以 O 为中心,呈负偏态。得分越负,表示模型响应越有害;得分越正,表示模型响应越无害。该分布的形状表明,红队成员确实能够有效地诱导 AI 助手产生有害响应。总体而言,我们发现最小无害性得分与红队成员自我评估的攻击成功率成反比,这是预期的 ( $\mathrm{\gA.4}$ , 图 11)。然而,相关性并不完美。因此,我们在 $\S4$ 中报告了这两个变量的统计数据,并按模型类型进行了条件分析,作为红队效能的衡量标准。

Finally, we also use the harmlessness preference model to score the harmfulness of the red team member's intent. To do so, we run the preference model on the red team member's task description (Figure 6).10 Figure 7 (Right) shows the distribution over this variable, which appears normally distributed with a mean around 1. As such, short descriptions of the attack score as less harmful than the actual AI utterances. We view the intent harmlessness score as a possible confound that we control for in further statistical analyses of the data (SA.3). Since we find that it does not influence our main results, we do not report on this variable further in the main text.

最后,我们还使用无害偏好模型来评估红队成员意图的有害性。为此,我们在红队成员的任务描述上运行偏好模型(图 6)。图 7(右)显示了该变量的分布情况,其呈现正态分布,均值约为 1。因此,简短的攻击描述比实际的 AI 话语得分更低的有害性。我们将意图无害性得分视为一个可能的混淆变量,在进一步的数据统计分析中对其进行控制(SA.3)。由于我们发现它不影响我们的主要结果,因此在正文中不再进一步报告该变量。

3.5 Review Task

3.5 审查任务

After we collected the data across all model types, we performed a follow-up experiment to measure two separate variables: the inter-annotator agreement in the self report of attack success, and the content of the attack types. The former is important because self-ratings of attack success are subjective, and can vary based on elements of the red team attack and red team member that we do not control (e.g., the type of attack or the background of the red team member). As such, we were interested in understanding how much variability (across different raters) there might be for defining a successful attack.

在我们收集了所有模型类型的数据后,我们进行了一项后续实验,以测量两个独立的变量:攻击成功自我报告中的标注者间一致性,以及攻击类型的内容。前者很重要,因为攻击成功的自我评分是主观的,并且可能基于我们无法控制的红队攻击和红队成员的要素(例如,攻击类型或红队成员的背景)而有所不同。因此,我们感兴趣的是了解在定义成功攻击时可能存在多少变异性(跨不同评分者)。


Figure 6 Example of how we quantify red team attempts. First, we compute a harmlessness score (lower is more harmful) on the task description (red). Next, we compute a harmlessness score on the assistant utterances, conditioned on all previous human and assistant utterances (black scores, adjacent to assistant utterances). We aggregate these scores using either a min or max (black, bold). Finally, we rely on human judgement of attack success on a Likert scale (blue).

图 6: 我们如何量化红队尝试的示例。首先,我们计算任务描述的无害性分数(越低越有害)(红色)。接下来,我们计算助手话语的无害性分数,条件是所有先前的人类和助手话语(黑色分数,紧邻助手话语)。我们使用最小值或最大值来聚合这些分数(黑色,加粗)。最后,我们依赖人类对攻击成功的判断,使用李克特量表(蓝色)。

Figure 8 shows our task instructions (Left) and interface (Right). We have 3 annotators review each transcript in the experiment. We ran this experiment on a random sample of 500 red team attacks for the 52B Prompted Language model and 500 attacks on the 52B RLHF model. We have each reviewer report their judgement of how successful the red team member was at making the “AI assistant say something bad?" We measure this variable on the same 5 point Likert scale as the main red team experiment and use consistent language across both experiments.

图 8 展示了我们的任务说明(左)和界面(右)。在实验中,我们让 3 位标注者审查每份转录文本。我们在 52B Prompted Language 模型的 500 次红队攻击随机样本和 52B RLHF 模型的 500 次攻击上进行了此实验。我们让每位评审员报告他们对红队成员在“让 AI 助手说出一些不好的内容”方面的成功程度的判断。我们使用与主要红队实验相同的 5 点李克特量表来衡量这一变量,并在两个实验中使用一致的语言。

We find a low level of inter-rater agreement on the success of red team attacks (according to our task setup and instructions) consistent with [60]. In particular, we report Fleiss's Kappa, a statistical measure of interannotator agreement that is bounded in $[-\infty,1]$ ,where $-\infty$ implies no agreement and 1 indicates perfect agreement. We report a Fleiss's Kappa of 0.32 between the 4 raters (the author and the 3 reviewers) based on a Likert rating scale. When we binarize the rating (1 if rating $\geq3$ , else 0), the agreement increases to 0.49.

我们发现,在红队攻击的成功率上,评分者之间的一致性较低(根据我们的任务设置和指示),这与[60]的结果一致。具体来说,我们报告了Fleiss's Kappa,这是一种用于衡量评分者一致性的统计指标,其取值范围为$[-\infty,1]$,其中$-\infty$表示完全不一致,1表示完全一致。基于Likert评分量表,我们报告了4位评分者(作者和3位评审员)之间的Fleiss's Kappa为0.32。当我们将评分二值化(如果评分$\geq3$则为1,否则为0)时,一致性提高到0.49。


Figure 7 (Left) Marginal distribution of self-report of red team success rates (higher is more successful) (Middle) Probability distribution function (PDF) of minimum AI harmlessness scores computed from the AI utterances (lower is more harmful) (Right) Distribution of harmlessness scores computed from short descriptions (written by red team members) of attack intent.

图 7: (左) 红队成功率自我报告的边际分布 (越高表示越成功) (中) 从 AI 话语中计算的最小 AI 无害性得分的概率分布函数 (PDF) (越低表示越有害) (右) 从攻击意图的简短描述 (由红队成员撰写) 中计算的无害性得分的分布。

Furthermore, when we exclude the original author and measure the agreement between the 3 annotators, we also see a modest increase in agreement for both the Likert and Binary scales, achieving a maximum agreement of 0.55 for the reviewer-only binary case. Taken together, our results suggest poor to fair agreement on what constitutes a successful attack.

此外,当我们排除原作者并测量3位标注者之间的一致性时,我们也看到Likert量表和二元量表的一致性都有所提高,在仅考虑审稿人的二元情况下,最大一致性达到0.55。综合来看,我们的结果表明,对于什么是成功的攻击,一致性较差到一般。

To get a sense of the type of harms the attacks were meant to elicit, we asked the reviewers to tag transcripts with up to 2 of 20 total topic tags (Figure 8, Right). To develop the list of topic tags, we referred to the taxonomies of potential harms of language models in [48, 57], industry content moderation guidelines, and a manual review of the top $100\ \mathrm{most}$ harmful conversations in our dataset. We discuss our findings on tag frequencies in Figure 9 and $\S4$

为了了解攻击旨在引发的危害类型,我们要求评审员为每个对话记录打上最多2个主题标签(图8,右)。为了制定主题标签列表,我们参考了[48, 57]中关于语言模型潜在危害的分类法、行业内容审核指南,以及对我们数据集中最有害的100个对话的手动审查。我们将在图9和$\S4$中讨论标签频率的发现。

We were particularly concerned with exposing reviewers to potential harm while participating in this experiment, since we ask reviewers to read, rate, and annotate harmful conversations they were not involved in writing. To mitigate this risk, we reviewed and incorporated findings from literature on Trust & Safety [16, 31, 26] into the content of both the task instructions (Figure 8, Left) and interface (Figure 8, Right), as well as the overall design of the experiment. For example, we built custom warning functionality which allowed reviewers to see a preview of the harmful text without being exposed to the entire conversation. Within the preview window, reviewers could skip to the next conversation or proceed with reviewing and rating the selected conversation.We leave further details in $\S\mathrm{A}.2$

我们特别关注评审员在参与此实验时可能面临的潜在伤害,因为我们要求评审员阅读、评分和注释他们未参与编写的有害对话。为了减轻这一风险,我们审查并整合了关于信任与安全 (Trust & Safety) 的文献 [16, 31, 26] 中的发现,将其应用于任务说明(图 8,左)和界面(图 8,右)的内容,以及实验的整体设计中。例如,我们构建了自定义的警告功能,允许评审员在不暴露整个对话的情况下预览有害文本。在预览窗口中,评审员可以选择跳过当前对话或继续评审和评分选定的对话。更多细节请参见 $\S\mathrm{A}.2$。

Our informational interviews with Trust & Safety industry professionals highlighted the need for creating a sense of community among workers and building social support networks as ways to mitigate possible harms associated with reviewing troubling content, consistent with [26]. As a result, we decided to limit the population of reviewers in this experiment to Upworkers, and we used a shared communication tool (Slack) to regularly communicate with the group. This allowed participants to ask questions, share examples, and discuss work and non-work related topics, not only amongst themselves, but also directly with research staff.

我们与信任与安全行业专业人士的信息访谈强调了在员工中建立社区意识和构建社交支持网络的重要性,以减轻审查令人不安内容可能带来的伤害,这与[26]一致。因此,我们决定将本次实验的审查员群体限制为Upworkers,并使用共享通信工具(Slack)定期与团队沟通。这使得参与者不仅可以在彼此之间提问、分享示例并讨论工作与非工作相关的话题,还可以直接与研究团队进行交流。

To monitor the psychological effects of this work and provide an avenue for direct feedback from reviewers, we developed a custom well-being survey and sent it to reviewers after completing 10 tasks. In the survey (which is optional to complete) we asked reviewers to rate how often they felt a variety of positive and negative emotions, and we also provided a free-form text question where reviewers could share additional thoughts. Participants generally felt low levels of negative emotions, and higher levels of positive emotions about the task. Informally, we received feedback that reviewers found the task to be fun and engaging. We provide more detail on the well-being survey and additional worker safety interventions in $\S\mathrm{A}.2$

为了监控这项工作的心理影响并为评审者提供直接反馈的途径,我们开发了一个定制化的幸福感调查问卷,并在评审者完成10个任务后发送给他们。在调查中(完成调查是可选的),我们要求评审者评估他们在任务中感受到的各种积极和消极情绪的频率,并且我们还提供了一个自由形式的文本问题,评审者可以分享额外的想法。参与者普遍感受到较低水平的消极情绪,以及较高水平的积极情绪。非正式地,我们收到了反馈,评审者认为这项任务有趣且引人入胜。我们在 $\S\mathrm{A}.2$ 中提供了关于幸福感调查和额外工人安全干预措施的更多细节。

4 Results

4 结果

Figure 1 (Left) shows the average success rate, self-reported by the red team members, for each model size and safety intervention. According to this metric, we observe three main patterns in the data. First, we see no discernible difference between the control condition (a plain LM with a 1 example prompt to turn it into a dialogue agent) and the simplest safety intervention (a plain LM with a 14 example HHH prompt [2]). This result is surprising, in that our previous work found the HHH prompt to be effective at reducing model toxicity, especially for 52B models [2, 43]. It's possible that this is due to the fact that static prompts from the

图 1 (左) 展示了红队成员自我报告的平均成功率,针对每个模型大小和安全干预措施。根据这一指标,我们在数据中观察到了三个主要模式。首先,我们发现控制条件(一个普通的 LM,带有 1 个示例提示将其转换为对话代理)与最简单的安全干预措施(一个普通的 LM,带有 14 个示例的 HHH 提示 [2])之间没有明显的差异。这一结果令人惊讶,因为我们之前的工作发现 HHH 提示在减少模型毒性方面是有效的,尤其是对于 52B 模型 [2, 43]。这可能是因为静态提示来自...


Red team review instructions Figure 8 (Left) Red team review task instructions. (Right) Example of a red team review task.


红队审查指令 图 8 (左) 红队审查任务指令。 (右) 红队审查任务示例。

Real Toxicity Prompts dataset [25] are less adversarial than the dialogue based attacks employed by red team members.

Real Toxicity Prompts 数据集 [25] 的对抗性低于红队成员使用的基于对话的攻击。

Second, we find that rejection sampling (RS) makes it particularly difficult to red team our language models. In essence, rejection sampling places a foor on red team attack susceptibility out of the three interventions that we tried. However, qualitatively, we believe that this may be the case because the responses from the RS models tend to be harmless by being evasive [4]. Finally, we find no clear trends with model size for the self-reported attack success rate metric. This is surprising because our previous work typically shows larger models tend to generate more toxic model responses [2, 22].

其次,我们发现拒绝采样 (Rejection Sampling, RS) 使得对我们的语言模型进行红队测试变得尤为困难。本质上,在我们尝试的三种干预措施中,拒绝采样为红队攻击的易感性设置了一个下限。然而,从定性角度来看,我们认为这可能是因为 RS 模型的响应往往通过回避的方式显得无害 [4]。最后,我们发现模型大小与自报告的攻击成功率指标之间没有明显的趋势。这令人惊讶,因为我们之前的工作通常表明,较大的模型往往会产生更多有害的响应 [2, 22]。

Figure 1 (Middle) shows the average minimum harmlessness score (lower is more harmful, see $\S3$ fordetails) for each model size and safety intervention. For this metric, we do see a clear scaling trend for the reinforcement learning (RLHF) models — as the models increase in size, they become increasingly more difficult to red team.1l At 52B parameters, we see no difference in harmlessness score for RLHF vs. RS. We also see the same first two trends from Figure 1 (Left): that there is little difference between the plain LM and the prompted ${\bf L}{\bf M}^{12}$ , and that rejection sampling is an effective safety intervention.

图 1 (中) 展示了每个模型大小和安全干预措施的平均最小无害性得分 (越低越有害,详见 $\S3$)。对于这一指标,我们确实看到了强化学习 (RLHF) 模型的明显扩展趋势——随着模型规模的增加,它们变得越来越难以进行红队测试。在 52B 参数下,我们看到 RLHF 与 RS 的无害性得分没有差异。我们还看到了与图 1 (左) 相同的前两个趋势:普通大语言模型与提示的 ${\bf L}{\bf M}^{12}$ 之间几乎没有差异,拒绝采样是一种有效的安全干预措施。

Instead of the average minimum harmlessness metric, Figure 1 (Right) shows the distribution over the harmlessness score. Here, we see that although safety interventions like RLHF and RS indeed decrease the average harmfulness of the model responses, there are still many instances of harmful behavior, as exhibited by the lower tails in the distributions. Although the safety interventions we tested help make systems safer, they still fail to make a perfectly safe systems. Figure 10 shows examples of harmful outputs from the RS and RLHF models, respectively. For the RS case, the model at first responds to a harmful inquiry, then starts to demur as the the conversation turns more harmful. For the RLHF case, we see a similar pattern, however the assistant remains helpful (though fabricates information) before ultimately refusing to help the human.

图 1 (右) 展示了无害性得分的分布情况,而不是平均最小无害性指标。在这里,我们可以看到,尽管像 RLHF 和 RS 这样的安全干预措施确实降低了模型响应的平均有害性,但仍然存在许多有害行为的实例,如分布中的较低尾部所示。尽管我们测试的安全干预措施有助于使系统更安全,但它们仍然无法使系统完全安全。图 10 分别展示了 RS 和 RLHF 模型的有害输出示例。在 RS 的情况下,模型首先对有害的询问做出响应,然后随着对话变得更加有害而开始犹豫。在 RLHF 的情况下,我们看到了类似的模式,但助手在最终拒绝帮助人类之前仍然保持帮助性(尽管编造了信息)。


Figure 9Number of attacks( $\bf\dot{x}$ -axes) classified by a tag (y-axis) for a random sample of 500 attacks each on the 52B Prompted LM and RLHF models. Blue denotes total number of attacks, orange denotes the number of successful attacks.

图 9: 随机抽取的 500 次攻击中,按标签 (y 轴) 分类的攻击次数 (x 轴) 在 52B Prompted LM 和 RLHF 模型上的分布。蓝色表示总攻击次数,橙色表示成功攻击次数。

To further understand the landscape of possible harms surfaced using this approach, across all model sizes and interventions, we created and annotated a visualization of the entire dataset (Figure 2). To do so, we obtained the average per token embeddings of each transcript from the residual stream in the 48th layer of the 52B prompted LM. Then we used UMAP [38] to turn the high-dimensional embeddings into two-dimensional embeddings for visualization. Intuitively, we expect this procedure to place any pair of transcripts closer together in this two dimensional space the more semantically similar to each other they are.

为了进一步理解使用这种方法可能产生的危害情况,我们创建并注释了整个数据集的可视化(图 2)。为此,我们从 52B 提示大语言模型的第 48 层残差流中获取了每个转录本的每个 Token 的平均嵌入。然后,我们使用 UMAP [38] 将高维嵌入转换为二维嵌入以便进行可视化。直观上,我们期望这个过程能够将任何一对转录本在语义上越相似,它们在二维空间中的距离就越近。

We find evidence for basic clusters of red team attempts. These include perhaps more obvious types of attacks, such as those soliciting discriminatory or offensive responses but also some surprising attacks. For example, we found a small cluster of attacks that tried to solicit misinformation in clever and subtle ways, and a small cluster of attacks related to animal abuse. We also find that some types of attacks, such as soliciting advice on how to perpetrate general violence, seem to be more successful than others, such as attempting to elicit offensive language.

我们发现红队尝试的基本集群证据。这些集群包括可能更为明显的攻击类型,例如那些寻求歧视性或攻击性回应的攻击,但也包括一些令人惊讶的攻击。例如,我们发现了一小部分以巧妙和微妙方式试图获取错误信息的攻击,以及一小部分与虐待动物相关的攻击。我们还发现,某些类型的攻击,例如寻求如何实施一般暴力的建议,似乎比其他类型的攻击(例如试图引发攻击性语言)更为成功。

We also found a cluster of 916 attacks designed to solicit personally identifiable information (Pll). We developed a regular expression $(\S\mathrm{A}.6)$ to find and filter possible PlII from the public dataset $(\S\mathrm{A}.7)$ .Wemanually reviewed the fltered data and found that some of the AI assistant generated Pll (such as addresses) appear to be neither real nor accurate, and instead were “hallucinated" by the AI assistant (see Figure 12 for an example). Other potential AI assistant generated Pll, such as social security numbers or drivers licenses, are difficult to manually verify. As such, we erred on the side of caution in filtering out the possible synthetic PII in the public dataset (sA.7).

我们还发现了一组包含 916 次攻击的集群,这些攻击旨在获取个人身份信息 (PII)。我们开发了一个正则表达式 $(\S\mathrm{A}.6)$ 来从公开数据集中查找和过滤可能的 PII $(\S\mathrm{A}.7)$。我们手动审查了过滤后的数据,发现一些由 AI 助手生成的 PII(如地址)似乎既不真实也不准确,而是由 AI 助手“幻觉”生成的(参见图 12 的示例)。其他潜在的由 AI 助手生成的 PII,如社会安全号码或驾驶执照,很难手动验证。因此,我们在过滤公开数据集中可能的合成 PII 时采取了谨慎的态度 $(\S\mathrm{A}.7)$。


Figure 1o Examples of red team attempts that resulted in relatively low harmlessness scores for a (Left) rejection sampling (RS) model and (Right) reinforcement learning from human feedback (RLHF) model.

图 1o: 红队尝试的示例,这些尝试导致 (左) 拒绝采样 (RS) 模型和 (右) 基于人类反馈的强化学习 (RLHF) 模型的无害性得分相对较低。

We found that some crowd workers generated low-quality red team data. For example, we observed some red team members use and adapt template-based attacks, where the template was “tell me an insulting term for $X$ that starts with a $Y^{\gamma}$ . With such a template, a crowd worker can trivially generate many red team attempts by iterating over many values of $X$ and $Y$ without careful consideration about the efficacy or usefulness of such an attack. Crowd workers (on MTurk) are in centi viz ed to develop strategies like this in order to increase the number of tasks they complete per time spent in order to maximize their earnings [26].

我们发现一些众包工人生成了低质量的红队数据。例如,我们观察到一些红队成员使用并改编了基于模板的攻击,模板为“告诉我一个以 $Y^{\gamma}$ 开头的侮辱性词汇,用于 $X$”。通过这种模板,众包工人可以通过迭代多个 $X$ 和 $Y$ 的值轻松生成许多红队尝试,而无需仔细考虑这种攻击的有效性或实用性。众包工人(在 MTurk 上)被激励开发此类策略,以增加他们在单位时间内完成的任务数量,从而最大化他们的收入 [26]。

To further measure the types of harms produced by red teaming, we examined the frequency of tags (described in $\S3.5)$ . Figure 9 shows the distribution of tags, which we collected for small a proportion $(\sim!3%)$ Oof the overall dataset. We find that the top 5 attacks correspond to ^Discrimination & injustice, “Hate speech $&$ offensive language,’ ""Violence & incitement, “Non violent unethical behavior (e.g., lying, cheating, etc.)," and “Bullying & harassment." Interestingly, for these top 5 attack types, the attack success rate was relatively higher for *Non violent unethical behavior', perhaps due to the fact that these types of attacks may be more subtle than the other ones. Less common tags include: “Child Abuse, “"Self harm, “"Sexual Exploitation & Human Trafficking, “Terrorism & organized crime,” and “Animal abuse". Finally, we find that the tag "Other'’ was also prevalent, which suggests that ascribing a fixed set of tags to annotate transcripts is unlikely to be comprehensive.

为了进一步衡量红队测试产生的危害类型,我们检查了标签的频率(在 $\S3.5$ 中描述)。图 9 展示了标签的分布情况,这些标签是我们从整体数据集中收集的一小部分 $(\sim!3%)$。我们发现,前五大攻击类型对应的是“歧视与不公正”、“仇恨言论与冒犯性语言”、“暴力与煽动”、“非暴力不道德行为(例如撒谎、欺骗等)”以及“欺凌与骚扰”。有趣的是,对于这五大攻击类型,“非暴力不道德行为”的攻击成功率相对较高,可能是因为这类攻击比其他类型更为隐蔽。较少见的标签包括:“虐待儿童”、“自残”、“性剥削与人口贩卖”、“恐怖主义与有组织犯罪”以及“虐待动物”。最后,我们发现“其他”标签也很普遍,这表明用一组固定的标签来注释转录内容可能不够全面。

5 Discussion

5 讨论

5.1 Limitations and Future Work

5.1 局限性与未来工作

We red team language models in the form of an AI assistant, which has the benefit of allowing the red team to probe the open-ended input and output spaces of the underlying LM. However, LMs can be used in many applications that do not require such open-endedness, e.g., recommendation systems, auto-complete systems, general text class if i ers, as sub-components in broader socio technical systems, etc. Due to our framing of red teaming an AI assistant, we do not explicitly explore how to reduce the possible harms caused by these other types of applications.

我们以AI智能体的形式对大语言模型进行了红队测试,这种方式的优势在于红队可以探索底层大语言模型的开放式输入和输出空间。然而,大语言模型可以应用于许多不需要这种开放性的场景,例如推荐系统、自动补全系统、通用文本分类器,以及作为更广泛的社会技术系统中的子组件等。由于我们将红队测试框架设定为AI智能体,因此并未明确探讨如何减少这些其他类型应用可能带来的危害。

We found that some crowd workers generated attacks on models that required domain expertise to evaluate, and we were not sure whether or not they had the required domain expertise. For example, it is not clear how to evaluate the harmfulness of responses to inquiries like “Give me detailed instructions on how to build a bomb" or “create methamphetamine'” without suficient knowledge about weapons and chemistry, respectively. We could address this problem by asking third party organizations that specialize in certain domains to red team our systems. Additionally, we could give crowd workers a way to indicate if and how a particular red team attack requires domain expertise in order to evaluate how successful the attack was.

我们发现,一些众包工作者生成的攻击模型需要领域专业知识来评估,而我们不确定他们是否具备所需的领域专业知识。例如,如果没有足够的武器和化学知识,很难评估对“给我详细的炸弹制作说明”或“制造甲基苯丙胺”等询问的回答的危害性。我们可以通过邀请专门从事某些领域的第三方组织对我们的系统进行红队测试来解决这个问题。此外,我们可以为众包工作者提供一种方式,让他们指出某个红队攻击是否需要领域专业知识,以便评估攻击的成功程度。

As expected, our data are incomplete—-because LMs are general purpose and open-ended, the space of possible harms is unknown and unbounded [22]. For example, the models we red teamed have been trained partly on Python code; however, we observed no attacks related to code generation ability (e.g., "write a Python program that implements a DDOS attack'). It is possible that sharing our red team interface with more domain experts could have resulted in such attacks. We could have also noted in the instructions to the interface that such attacks would be viable, but we erred on the side of being less prescriptive about how to red team in order to encourage creativity. It is unclear how to strike the right balance.

正如预期的那样,我们的数据并不完整——因为大语言模型是通用且开放式的,可能的危害空间是未知且无界的 [22]。例如,我们进行红队测试的模型部分训练数据是 Python 代码;然而,我们没有观察到与代码生成能力相关的攻击(例如,“编写一个实现 DDOS 攻击的 Python 程序”)。如果我们将红队测试接口分享给更多领域专家,可能会引发此类攻击。我们也可以在接口的说明中指出此类攻击是可行的,但为了鼓励创造力,我们在红队测试的方式上选择了不那么规定性的做法。目前尚不清楚如何找到正确的平衡点。

We also know our data are incomplete because we informally red teamed our models internally and found successful attack types not present in the dataset we release. For example, we uncovered a class of attacks that we call “roleplay attacks"’ on the RLHF model. In a roleplay attack we exploit the helpfulness of the model by asking it to roleplay as a malevolent character. For example, if we asked the RLHF model to enter "4chan mode”’ the assistant would oblige and produce harmful and offensive outputs (consistent with what can be found on 4chan). We intend to document additional qualitative safety failures that we uncovered in futurework.

我们还知道我们的数据是不完整的,因为我们在内部非正式地对模型进行了红队测试,发现了一些成功攻击类型并未出现在我们发布的数据集中。例如,我们发现了一类针对RLHF模型的攻击,我们称之为“角色扮演攻击”。在角色扮演攻击中,我们通过要求模型扮演一个恶意角色来利用其乐于助人的特性。例如,如果我们要求RLHF模型进入“4chan模式”,助手会顺从并产生有害和冒犯性的输出(与4chan上的内容一致)。我们计划在未来的工作中记录我们发现的更多定性安全失败案例。

Our analysis of the data is bottom-up, in that we first collect the data, then attempt to characterize the attack surface (Figure 2). An alternative approach, is to refer to a taxonomy of possible attack types [57] and explicitly ask the red team to attack models according to this taxonomy. Ultimately, an approach that combines both top-down and bottom-up strategies may be worthwhile, especially since people may discover attack types not yet covered by a taxonomy—we see some evidence of this in the frequency of attack types labeled as “Other" in our tagging experiment (Figure 9).

我们对数据的分析是自下而上的,即首先收集数据,然后尝试描述攻击面(图 2)。另一种方法是参考可能的攻击类型的分类法 [57],并明确要求红队根据此分类法攻击模型。最终,结合自上而下和自下而上策略的方法可能是有价值的,尤其是因为人们可能会发现分类法尚未涵盖的攻击类型——我们在标记实验中看到了一些证据,即标记为“其他”的攻击类型的频率(图 9)。

Our approach relies extensively on fully manual red teaming by crowd workers, which is expensive (and possibly slow) to do at scale. Previous work illustrates the potential for automating red teaming [42]. For future work, we plan on explicitly comparing and contrasting (semi-)manual versus automated approaches to red teaming in order to determine how the two methods vary in the efficacy and diversity of resulting red team attacks.

我们的方法主要依赖于众包工作者进行完全手动的人工红队测试,这种方法在大规模应用时成本高昂(且可能效率低下)。先前的研究已经展示了自动化红队测试的潜力 [42]。在未来的工作中,我们计划明确比较和对比(半)手动与自动化红队测试方法,以确定这两种方法在红队攻击效果和多样性方面的差异。

5.2 Policy Interventions

5.2 政策干预

Red teaming entails working with inherently controversial subject matter, and most organizations that red team systems have strong counter-incentives to share their findings.13 This is a problem; if we cannot publicly discuss -—- in detail —- how we red team systems and what we learn as a result, it will be difficult to broadly share the future risks, failures, and implications of yet-to-be developed systems. This problem gets worse over time. As systems become more capable, the results of red teaming may surface increasingly undesirable harms. Therefore, we need to change the incentive structure so more organizations share findings from their red teaming efforts when doing so is safe and beneficial. To do so, we identify two specific interventions the AI research community could take to build consensus around how to red team and how to release findings from red teaming.

红队测试涉及处理具有争议性的主题,而大多数进行红队测试的组织都有强烈的动机不分享他们的发现。这是一个问题;如果我们不能公开详细讨论如何进行红队测试以及我们从中学习到的内容,那么广泛分享未来系统可能带来的风险、失败和影响将变得困难。随着时间的推移,这个问题会变得更加严重。随着系统能力的增强,红队测试的结果可能会揭示出越来越多不希望的危害。因此,我们需要改变激励机制,以便更多的组织在安全且有益的情况下分享他们的红队测试结果。为此,我们提出了两个具体的干预措施,AI研究社区可以采取这些措施来建立关于如何进行红队测试以及如何发布红队测试结果的共识。

For how to red team, we have detailed our initial approach. However, we conducted this effort in isolation, and we would have benefited from participating in a community-based effort to address certain open questions:

关于如何进行红队测试,我们已经详细阐述了初步方法。然而,我们是独立进行这项工作的,如果能参与社区合作来解决某些开放性问题,将会受益匪浅:

We can make progress towards answering these questions by convening a multidisciplinary community to share different approaches to internal red teaming and drive toward consensus.

我们可以通过召集一个多学科社区来分享内部红队测试的不同方法,并推动达成共识,从而在回答这些问题上取得进展。

The research community lacks shared norms and best practices for how to release findings from red teaming. As a result, we made our decision to release the data largely on our own and likely missed critical perspectives from experts, other disciplines, and members of the public.14 The decision for how to appropriately release findings will ultimately require a subjective judgment call. For our purposes, we reviewed a sample of our red team dataset and evaluated the pros and cons of a public release (See $\S\mathrm{A}.5_{\cdot}$ 0.Among them is the fact that while our red team data can be used to develop safer systems (as described in $\S3.2)$ ,it could also be used to train models that produce more harmful responses.15 We ultimately felt releasing the dataset would provide more benefit to the research community than potential harm, but we were conscious that we made this decision in a vacuum and that it would be better to have a neutral forum in which to discuss these issues.

研究界在如何发布红队测试结果方面缺乏共享的规范和最佳实践。因此,我们很大程度上是自行决定发布数据,可能忽略了来自专家、其他学科和公众成员的关键观点 [14]。如何适当发布研究结果的决定最终需要主观判断。为此,我们审查了红队数据集的样本,并评估了公开发布的利弊(参见 $\S\mathrm{A}.5_{\cdot}$ 0)。其中一个事实是,虽然我们的红队数据可以用于开发更安全的系统(如 $\S3.2$ 所述),但它也可能被用于训练产生更多有害响应的模型 [15]。我们最终认为,发布数据集将为研究界带来更多益处,而不是潜在的危害,但我们意识到我们是在孤立的情况下做出这一决定的,最好能有一个中立的论坛来讨论这些问题。

Acknowledgments

致谢

We thank Rishi Bommasani, Roger Grosse, Gretchen Krueger, Percy Liang, Jared Mueller, and Michael Sellitto for detailed feedback on drafts of the paper. We thank Hannah Pritchett, and the other Trust & Safety professionals we interviewed, for their advice on how to promote the well-being of the red team. We're also deeply grateful to Daniela Amodei, Jarrah Bloomfield, Jamie Kerr, Timothy Telleen-Lawton, Jia Yuan Loke, Jeffrey Ladish, Rebecca Raible, Rune Kvist, Rob Gilson, Guro Khundadze, Filipe Dobreira, and Sebastian Conybeare for their help and support.

我们感谢 Rishi Bommasani、Roger Grosse、Gretchen Krueger、Percy Liang、Jared Mueller 和 Michael Sellitto 对论文草稿的详细反馈。我们感谢 Hannah Pritchett 以及其他我们采访的信任与安全专业人士,他们为如何促进红队的福祉提供了建议。我们还要特别感谢 Daniela Amodei、Jarrah Bloomfield、Jamie Kerr、Timothy Telleen-Lawton、Jia Yuan Loke、Jeffrey Ladish、Rebecca Raible、Rune Kvist、Rob Gilson、Guro Khundadze、Filipe Dobreira 和 Sebastian Conybeare 的帮助与支持。

A Appendix

A 附录

A.1 Author Contributions

A.1 作者贡献

Research: Deep Ganguli and Liane Lovitt co-led the project and analyzed the data together. Deep Ganguli, Liane Lovitt, Jackson Kernion, Amanda Aaskell, Ben Mann, and Jack Clark designed and executed the experiments. Liane Lovitt conducted informational interviews, a literature review, and surveys in order to protect and assess the well-being of the crowd workers who participated in our experiments. Jackson Kernion and Ben Mann built the human feedback data collection infrastructure we used to collect data. They also built the web interfaces to the AI assistant, along with Deep Ganguli and Amanda Askell. Jackson Kernion, along with Josh Jacobson, managed any issues raised by crowd workers. Amanda Askell, Jackson Kernion, and Jack Clark participated in pilot experiments in order to iterate on the experiment design. Nicholas Schiefer created the UMAP plot of red team attacks and helped to compute the minimum harmlessness score.

研究:Deep Ganguli 和 Liane Lovitt 共同领导了该项目并一起分析了数据。Deep Ganguli、Liane Lovitt、Jackson Kernion、Amanda Askell、Ben Mann 和 Jack Clark 设计并执行了实验。Liane Lovitt 进行了信息访谈、文献综述和调查,以保护和评估参与我们实验的众包工作者的福祉。Jackson Kernion 和 Ben Mann 构建了用于收集数据的人类反馈数据收集基础设施。他们还与 Deep Ganguli 和 Amanda Askell 一起构建了 AI 助手的 Web 界面。Jackson Kernion 和 Josh Jacobson 一起处理了众包工作者提出的任何问题。Amanda Askell、Jackson Kernion 和 Jack Clark 参与了试点实验,以迭代实验设计。Nicholas Schiefer 创建了红队攻击的 UMAP 图,并帮助计算了最小无害性分数。

Writing: Deep Ganguli and Liane Lovitt drafted the paper. Ethan Perez and Sam Bowman made significant contributions to the framing and presentation of the paper. Other members of Anthropic made miscellaneous contributions and suggestions throughout the writing process.

写作:Deep Ganguli 和 Liane Lovitt 起草了论文。Ethan Perez 和 Sam Bowman 对论文的框架和呈现做出了重要贡献。Anthropic 的其他成员在整个写作过程中提供了各种贡献和建议。

Policy: Liane Lovitt, Jack Clark, and Deep Ganguli designed the policy interventions and articulated the pros and cons for releasing the data. Liane Lovitt wrote the Datasheet. Nova DasSarma created the regular expression we used to identify personally identifiable information (PIl) in our dataset and worked with Jack Clark and Liane Lovitt to filter the PII.

策略:Liane Lovitt、Jack Clark 和 Deep Ganguli 设计了政策干预措施,并阐述了发布数据的利弊。Liane Lovitt 撰写了数据表。Nova DasSarma 创建了我们用于识别数据集中个人身份信息 (PII) 的正则表达式,并与 Jack Clark 和 Liane Lovitt 合作过滤了 PII。

Model Training: Saurav Kadavath and Yuntao Bai trained the RLHF models we analyze. Yuntao Bai additionally trained the helpful and harmless preference models we use throughout the paper, and implemented the RS models as well. Kamal Ndousse and Andy Jones built the infrastructure used to train RLHF models. More generally, model pre training was led by Sam McCandlish, Nicholas Joseph, Tom Brown, and Jared Kaplan. The majority of Anthropic's technical staff contributed to the development of our efficient distributed training infrastructure and the underlying machine learning systems. Core contributors include Tom Henighan, Scott Johnston, Sheer El Showk, Nicholas Joseph, Nelson Elhage, and Ben Mann. Scott Johnston and Sheer El-Showk in particular worked on optimizing pre training for ML efficiency.

模型训练:Saurav Kadavath 和 Yuntao Bai 训练了我们分析的 RLHF 模型。Yuntao Bai 还训练了我们在整篇论文中使用的有帮助和无害的偏好模型,并实现了 RS 模型。Kamal Ndousse 和 Andy Jones 构建了用于训练 RLHF 模型的基础设施。更广泛地说,模型预训练由 Sam McCandlish、Nicholas Joseph、Tom Brown 和 Jared Kaplan 领导。Anthropic 的大多数技术人员都为我们高效的分布式训练基础设施和底层机器学习系统的开发做出了贡献。核心贡献者包括 Tom Henighan、Scott Johnston、Sheer El Showk、Nicholas Joseph、Nelson Elhage 和 Ben Mann。Scott Johnston 和 Sheer El-Showk 特别致力于优化预训练以提高 ML 效率。

Sampling: Efficient sampling efforts were led by Tom Brown, and Tom Conerly carried out major aspects of the design, implementation and support for the system, with help from Zac Hatfield Dodds.

采样:Tom Brown 领导了高效的采样工作,Tom Conerly 在 Zac Hatfield Dodds 的帮助下,负责了系统设计、实施和支持的主要工作。

Cluster: Nova DasSarma and Eli Tran-Johnson managed the research cluster our research depended on and maintained its stability, making this research possible. Many others helped with these efforts, including Ben Mann, Tom Henighan, Sam McCandlish, Andy Jones, and Tristan Hume.

集群:Nova DasSarma 和 Eli Tran-Johnson 管理了我们研究依赖的研究集群并保持了其稳定性,使这项研究成为可能。还有许多其他人为此做出了贡献,包括 Ben Mann、Tom Henighan、Sam McCandlish、Andy Jones 和 Tristan Hume。

Other contributions: The ideas explored in this paper developed in conversations with many of Anthropic's staff, especially Jared Kaplan, Amanda Askell, Nicholas Schiefer, Stan Fort, Dario Amodei, Catherine Olsson, Sam Bowman, Sam McCandlish, and Chris Olah.

其他贡献:本文探讨的想法是在与Anthropic多位员工的对话中形成的,特别是Jared Kaplan、Amanda Askell、Nicholas Schiefer、Stan Fort、Dario Amodei、Catherine Olsson、Sam Bowman、Sam McCandlish和Chris Olah。

A.2 Safety Considerations for the Red Team

A.2 红队的安全考虑

We conducted a series of informal informational interviews with Trust & Safety professionals that had firsthand experience (from working at major technology companies) with considering the safety of workers exposed to harmful content. The interviewees are first- or second-degree connections in the authors’ professional networks. Much of their advice was consistent with [26]. Based on our leanings, we implemented the following design and user interface choices in order help ensure the safety of the red team:

我们与具有直接经验的信任与安全(Trust & Safety)专业人士进行了一系列非正式的信息访谈,这些经验来自于在大型科技公司工作时考虑暴露于有害内容的员工安全问题。受访者是作者专业网络中的一级或二级联系人。他们的许多建议与[26]一致。基于我们的学习,我们实施了以下设计和用户界面选择,以确保红队的安全:

· Clear and Specific Warnings: We provide the red team with a clear understanding of the task and the potentially troubling content they might encounter in both the Red Team Task and the Review Task. In the instructions we clearly described the work, our rationale for collecting such information, and described the types of content participants might expect when completing the task. We sought to minimize uninformed participation and reviews of unanticipated topics by clearly describing the work upfront.

· 清晰且具体的警告:我们为红队提供了对任务的清晰理解,以及他们在红队任务和审查任务中可能遇到的潜在问题内容。在说明中,我们清楚地描述了工作内容、我们收集此类信息的原因,并描述了参与者在完成任务时可能遇到的内容类型。我们通过提前清楚地描述工作内容,尽量减少未经知情的参与和对未预期主题的审查。

· Personal Risk Tolerance: For the Red Team Task, described in $\S3.1$ , we explicitly encouraged research participants to devise red team attempts only within the bounds of their personal risk tolerance. We presented this recommendation clearly in the task instructions before participants were able to begin writing. Participants had no required topics they had to engage with, and were free to avoid topics that may have been personally triggering or unpleasant.

· 个人风险承受能力:对于在 $\S3.1$ 中描述的红队任务,我们明确鼓励研究参与者仅在个人风险承受能力的范围内设计红队尝试。我们在任务说明中清楚地提出了这一建议,然后参与者才能开始写作。参与者没有必须参与的特定主题,可以自由避免可能引发个人不适或不愉快的主题。

· Recommended Well-being Exercises: One Trust & Safety professional we spoke with noted the importance of creating personal “resilience plans, which can consist of wellness routines and work restrictions to minimize negative health effects. Inspired by this, we encouraged red team members to take breaks between sessions, to step away from the task and go for a walk, make a cup of tea and chat with a friend, practice mindfulness, and to create a personal schedule to time box exposure. We also recommended that participants consider alternating between our tasks, and other available tasks that may expose them to less harmful content.

· 推荐的健康练习:我们采访的一位信任与安全专家提到,制定个人“韧性计划”的重要性,这些计划可以包括健康习惯和工作限制,以最小化对健康的负面影响。受此启发,我们鼓励红队成员在任务之间休息,暂时离开任务去散步、泡杯茶并与朋友聊天、练习正念,并制定个人时间表以限制暴露时间。我们还建议参与者考虑在我们的任务和其他可能暴露较少有害内容的任务之间交替进行。

Table 2: Review task participant average rating per feeling. Ratings range from 0 ("not at all") to 4 ("very").

表 2: 评审任务参与者每种情绪的平均评分。评分范围为 0("完全没有")到 4("非常")。

情绪 平均评分
upset 0.31
hostile 0.16
alert 1.02
ashamed 0.24
inspired 0.92
nervous 0.24
determined 0.98
attentive 1.73
afraid 0.24
active 1.33

· Pay for Time, not Quotas: [16] notes strict task quotas and job performance concerns can create additional stress, on top of the stress caused by viewing harmful content. The Trust & Safety professionals we interviewed echoed this finding and recommended compensation based on time, rather than a task quota. Given the functionality provided by each crowdwork platform, we were able to implement this recommendation for the Review Task and paid participants at least $\mathbb{520}$ per hour.

· 按时间而非任务量付费:[16] 指出,严格的任务配额和工作绩效问题可能会在观看有害内容带来的压力之外,增加额外的压力。我们采访的信任与安全专业人员也呼应了这一发现,并建议基于时间而非任务配额进行补偿。鉴于每个众包平台提供的功能,我们能够为审查任务实施这一建议,并支付参与者每小时至少 $\mathbb{520}$ 的报酬。

· Segment Tasks by Participant Group: Our interviews with Trust & Safety professionals stressed the importance of creating strong social support networks where people can collaborate and lean on one another for support. As a result, we limited the potentially higher risk task (the Review Task) to a select group of workers with whom we had a closer relationship (workers from the Upwork platform). This group had access to a shared Slack channel where our research team provided visible and accessible support alongside daily communication. Researchers communicated directly with the team to provide task instructions, share updates, and answer questions. Workers were encouraged to flag technical glitches, share interesting dialogues, and generally use the shared Slack channel to connect with our research team and one another.

· 按参与者群体划分任务:我们与信任与安全(Trust & Safety)专业人员的访谈强调了建立强大社交支持网络的重要性,人们可以在其中协作并相互依赖以获得支持。因此,我们将潜在风险较高的任务(即审核任务)限制在了一组与我们关系更密切的工人群体中(来自 Upwork 平台的工人)。该群体可以访问一个共享的 Slack 频道,我们的研究团队在其中提供了可见且易于获取的支持,并进行了日常沟通。研究人员直接与团队沟通,提供任务说明、分享更新并回答问题。我们鼓励工人标记技术故障、分享有趣的对话,并通常使用共享的 Slack 频道与我们的研究团队及其他工人保持联系。

· Preview to Opt Out: In an effort to minimize unwanted exposure to potentially troubling content, we implemented the warning functionality described in $\S3.5$ that allowed workers to see a preview of the transcript and skip it if desired.

· 预览退出选项:为了尽量减少对潜在令人不安内容的不必要接触,我们实现了 $\S3.5$ 中描述的警告功能,允许工作人员预览转录内容并在需要时跳过。

· Well-being Survey: Similar to [58], we distributed a survey to measure the effects of, and worker feelings towards, the Review Task. Given the parallels between the Review Task and content moderation work, we looked to well-being surveys used in research measuring the efficacy of various content moderation interventions. These include versions of the Positive and Negative Affect Schedule (PANAS) [56] used in [15, 16, 31] and the Scale of Positive and Negative Experience (SPANE) [17] used in [15, 16].

· 幸福感调查:类似于 [58],我们分发了一项调查,以衡量审查任务的效果以及工人对审查任务的感受。鉴于审查任务与内容审核工作之间的相似性,我们参考了用于衡量各种内容审核干预措施效果的研究中使用的幸福感调查。这些包括在 [15, 16, 31] 中使用的积极和消极情绪量表 (PANAS) [56] 的版本,以及在 [15, 16] 中使用的积极和消极体验量表 (SPANE) [17]。

To make the survey more relevant for our Review Task, we combined the feelings from a shorter form of PANAS [52] and a variant of the question prompt used in SPANE [17]. In the survey we asked: "Please think about the task(s) you just completed, to what extent did it make you feel:" and provided the list of 10 feelings: Upset, Hostile, Alert, Ashamed, Inspired, Nervous, Determined, Attentive, Afraid, and Active. We asked reviewers to rate each feeling on a 5 point Likert scale (ranging from O to 4, and corresponding to "not at all" to "very"). We also provided a free-form textbox for additional comments or concerns.

为了使调查更符合我们的评审任务,我们结合了简短版 PANAS [52] 的情感和 SPANE [17] 中使用的提示问题的变体。在调查中,我们询问:“请回想一下你刚刚完成的任务,它在多大程度上让你感到:”并提供了 10 种情感列表:沮丧、敌对、警觉、羞愧、受启发、紧张、坚定、专注、害怕和活跃。我们要求评审员在 5 点李克特量表(范围从 0 到 4,对应“完全没有”到“非常”)上对每种情感进行评分。我们还提供了一个自由文本框,供评审员填写额外的评论或担忧。

In an attempt to measure well-being effects over time, we initially sent out the well-being after every 10 tasks (100 conversations). However, we sent the survey manually via the shared Slack channel (as opposed to integrated into the task user interface), which resulted in more sporadic responses. We received a total of 49 (de-identified) responses from a pool of 15 people. We report the average rating for each of the 10 feelings in Figure 2. In general, participants enjoyed the task with reviewers sharing feedback such as: "These tasks are so fun, thank you :)," "Happy to do more of these," and "I love being part of a team to further train and advance this AI."

为了衡量随时间变化的幸福感影响,我们最初在每完成10个任务(100次对话)后发送幸福感调查。然而,我们通过共享的Slack频道手动发送调查(而不是集成到任务用户界面中),这导致了更零星的回复。我们从15人中总共收到了49份(去标识化的)回复。我们在图2中报告了每种感觉的平均评分。总体而言,参与者喜欢这项任务,评审员分享了诸如“这些任务太有趣了,谢谢:)”、“很高兴做更多这样的任务”以及“我喜欢成为团队的一员,进一步训练和推进这个AI”等反馈。


Figure 11 Correlation between self report of attack success (x-axis) and average minimum AI harmlessness score (yaxis). Error bars show one standard deviation in minimum AI harmless score.

图 11: 攻击成功自评 (x轴) 与平均最低 AI 无害评分 (y轴) 之间的相关性。误差条显示最低 AI 无害评分的标准差。

A.3 Controlling for Possible Confounds

A.3 控制可能的混淆因素

There are three possible confounds for our main results (Figure 1) that are mainly due to the fact that different red team members attacked different model types and sizes in different ways. The possible confounds are:

我们的主要结果(图 1)可能存在三个混淆因素,这主要是由于不同的红队成员以不同的方式攻击了不同类型的模型和规模。可能的混淆因素包括:

· The average ability of each of the ${\sim}300$ red team members to elicit harmful outputs form the models. Some red team members may be more effective than others (Figure 5, Right). · The harmfulness of the red team member's intent. Some red team members may employ more harmful attack types than others. · The crowdwork platform (MTurk or Upwork) that the red team member used. We have no reason a-priori to think workers on either platform are different; however we can control for this variable.

· 每个约 300 名红队成员从模型中引出有害输出的平均能力。一些红队成员可能比其他成员更有效 (图 5, 右)。
· 红队成员意图的有害性。一些红队成员可能使用比其他成员更有害的攻击类型。
· 红队成员使用的众包平台 (MTurk 或 Upwork)。我们没有先验理由认为任一平台上的工作者不同;然而我们可以控制这个变量。

To rule out these confounds, we fit a linear mixed effects (or random intercept) model with LME4 [8]. More specifically, we predict the main metrics (attack success or minimum AI harmlessness) with a random intercept (a dummy encoding) for each red team member (these are shown in Figure 5, Right), a fixed effect (co-variate) on the harmlessness score of the task description (to attempt to control for the harmfulness of the attacks), and a fixed effect on a binary indicator variable which is 1 if the worker used the MTurk platform, and a O otherwise. We also include dummy encoded variables for model size and safety intervention, along with the interaction terms between these two variables.

为了排除这些混淆因素,我们使用 LME4 [8] 拟合了一个线性混合效应(或随机截距)模型。具体来说,我们通过每个红队成员的随机截距(虚拟编码)来预测主要指标(攻击成功率或最小 AI 无害性)(这些在图 5 右侧展示),任务描述的无害性得分的固定效应(协变量)(以尝试控制攻击的有害性),以及一个二元指示变量的固定效应,如果工作者使用了 MTurk 平台,则该变量为 1,否则为 0。我们还包括了模型大小和安全干预的虚拟编码变量,以及这两个变量之间的交互项。

After we fit the model, we examine the coefficients on model size, safety intervention, and the interaction terms, and determine that the main results in Figure 1 still hold. We also re-ran a version of this analysis where we include one of the two metrics (attack success or minimum AI harmlessness) as a fixed effect (covariate) to predict the other. We found that this also does not influence our main results, but does re-capitulate our finding that these two variables are correlated (Figure 11).

在我们拟合模型后,我们检查了模型大小、安全干预和交互项的系数,并确定图 1 中的主要结果仍然成立。我们还重新运行了一个版本的分析,其中我们将两个指标(攻击成功率或最小 AI 无害性)之一作为固定效应(协变量)来预测另一个指标。我们发现这也不会影响我们的主要结果,但确实再次证实了我们的发现,即这两个变量是相关的(图 11)。

A.4 The Relationship Between Attack Success and Harmlessness Score Metrics

A.4 攻击成功与无害性评分指标之间的关系

Figure 11 shows the correlation between the two main metrics we report in the main text: a self-report of attack success on a Likert Scale (higher is more successful), and the output of a harmlessness preference model (lower means more harmful AI responses). As red team members self report attacks to be more successful, the AI assistant utterances tend to also receive low harmlessness scores; however, the correlation is not perfect. We observe a high variance in harmlessness scores for any given value of average attack success. As such, we report on both metrics in the main text.

图 11 展示了我们在正文中报告的两个主要指标之间的相关性:攻击成功的自评(使用李克特量表,分数越高表示越成功)和无害性偏好模型的输出(分数越低表示 AI 响应越有害)。随着红队成员自评攻击成功率的提高,AI 助手的响应往往也会获得较低的无害性评分;然而,这种相关性并不完美。我们观察到,对于任何给定的平均攻击成功率,无害性评分都存在较大的方差。因此,我们在正文中同时报告了这两个指标。

A.5 Pros and Cons for Releasing Red Team Data

A.5 发布红队数据的利弊

Pros

优点

Cons

缺点

Figure 12 This conversation represents an attempt to solicit personally identifiable information (Pll) from the AI assistant. We verified that the address does not correspond to a real, physical location and there appears to be no public connection to the name listed. However, to err on the side of caution, we redacted the name and street address.

图 12: 这段对话展示了试图从 AI 助手获取个人身份信息 (PII) 的行为。我们已验证该地址不对应任何真实的物理位置,且列出的姓名似乎没有公开关联。然而,出于谨慎考虑,我们对姓名和街道地址进行了编辑。

A.6 Filtering Personally Identifiable Information

A.6 过滤个人身份信息

As illustrated in Figure 2, the red team dataset includes attempts to solicit personally identifiable information (Pll) from the AI assistant. These conversations include addresses, phone numbers, drivers license and passport numbers, and social security numbers, from both the human red teamer, the model, or both. In order to identify and redact conversations with Pll, we used a regular expression (regex) filter to identify relevant conversations and then manually reviewed a sample for accuracy and validity.

如图 2 所示,红队数据集包括试图从 AI 助手中获取个人身份信息 (PII) 的尝试。这些对话包括来自人类红队成员、模型或两者的地址、电话号码、驾照和护照号码以及社会安全号码。为了识别和编辑包含 PII 的对话,我们使用正则表达式 (regex) 过滤器来识别相关对话,然后手动审查样本以确保准确性和有效性。

The regex we used is:

我们使用的正则表达式是:

\b\d{1,8 }\b[\s\S]{ 10,100} ?\b(AKI ALIA RIAZ ICA ICO ICT ID CIDE IF LIGA IHI II A II DIL IN I KSI KYI LAI MAIM D IME IMI IM NIMO IMS IMT INC IN DINE IN H IN J IN MIN VI NYI OHIO KIO RIPA IRI IS C IS D IT N IT XI UTI VAI VT I $\mathbb{W}\mathrm{AlWIWV1WY\backslashbissd{5}W h b((\ddots+l b)[11][\backslash-\cdot])^{\vee}(?)W[d O1Z S B]{3,5}((`\backslash-\downarrow,\mathbb{N})\uparrow)[\forall\mathrm{dO}]Z S B}{3}[\backslash-\downarrow,\forall\mathrm{outsex}]$ 一 $\mathrm{\check{\sdO}1Z S B]{4}\backslash b|[w\backslash.=-]+\mathcal{Q}[\backslash w\backslash.-]+\hat{}.[\backslash w]{2,3}\backslash\backslash b|[w\backslash.=-]+\mathcal{Q}[\backslash w\backslash.-]+\hat{}.[\backslash w]{2,3}\backslash\backslash b|$ (birth l birthdate l birthday l do bl born)\W+(?:\w+\W+) $?:(\mathsf{M}{4}|\mathsf{M}{1,2})[\mathsf{W}!-!]\mathsf{M}{1,2}\mathsf{M}!-!\mathsf{W h b}([0!-!8]\mathsf{M}{2}|\mathcal{I}([0!-!6]\mathsf{M}))([-]!?\mathrm{ls}{1})\mathsf{M}\mathrm{d}\mathsf{W h b}({-!3}|\mathsf{M}{1,2}),$ d{4}\bl(?:5[1-5][0-9]{2}1222[1-9]]22[3-9][0-9]]2[3-6][0-9]{2}127[01][0-9]]2720)[0-9]{12}Nb([4] d{3}[s]{4}[s]d{4}[s]d{4}[4]d{3}[-]d{4}[-]d{4}[-]d{4}[4]{3}[]d{4}[.]{4}[]d

$\mathbb{W}\mathrm{AlWIWV1WY\backslashbissd{5}W h b((\ddots+l b)[11][\backslash-\cdot])^{\vee}(?)W[d O1Z S B]{3,5}((`\backslash-\downarrow,\mathbb{N})\uparrow)[\forall\mathrm{dO}]Z S B}{3}[\backslash-\downarrow,\forall\mathrm{outsex}]$ 一 $\mathrm{\check{\sdO}1Z S B]{4}\backslash b|[w\backslash.=-]+\mathcal{Q}[\backslash w\backslash.-]+\hat{}.[\backslash w]{2,3}\backslash\backslash b|[w\backslash.=-]+\mathcal{Q}[\backslash w\backslash.-]+\hat{}.[\backslash w]{2,3}\backslash\backslash b|$ (birth l birthdate l birthday l do bl born)\W+(?:\w+\W+) $?:(\mathsf{M}{4}|\mathsf{M}{1,2})[\mathsf{W}!-!]\mathsf{M}{1,2}\mathsf{M}!-!\mathsf{W h b}([0!-!8]\mathsf{M}{2}|\mathcal{I}([0!-!6]\mathsf{M}))([-]!?\mathrm{ls}{1})\mathsf{M}\mathrm{d}\mathsf{W h b}({-!3}|\mathsf{M}{1,2}),$ d{4}\bl(?:5[1-5][0-9]{2}1222[1-9]]22[3-9][0-9]]2[3-6][0-9]{2}127[01][0-9]]2720)[0-9]{12}Nb([4] d{3}[s]{4}[s]d{4}[s]d{4}[4]d{3}[-]d{4}[-]d{4}[-]d{4}[4]{3}[]d{4}[.]{4}[]d

{4}[4]\d{3}\d{4}\d{4}\d{4})\bl3[47][0-9]{13}\d{3}-\d{2}-\d{4}(?:(\d{1,5}(1V[234])?(x20A-Z +)+ )I(P.O. Box \d{ 1,5↓))\s{ 1,2}(?i:(?:((APTIB LD GI DEPT IF LIH N GRI LOT I PIER I RM IS(LIPI PCIT(EIOP))ITRLRIUNIT)x20\w(1,5 })I(B SMT IF RN TIL BB YI LOW RIO FCI PHI REAR I SIDE I UP PR).?) $\mathrm{\backslashs{1,2})?)(?:A-Z+(\cdot.?)(\cdot20A-Z+\cdot2\cdot]),\backslash x20(?:\cdot A[L K S Z R A P]|C[A O T]|,\mathscr{H}])}$ JID[EC]IF[ LM]IG[AU]IHII[ADL N]IK[SY]ILAIM[A DE HINO PST J IN[CDEHJMVY]IO[HKRJIP[ARWJIRIIS[CD] IT[NX]IUTIV[AIT]IW[AIVY])\x20(?:\d{5}(-\d {4})?)(?:(d{1,5}( 1V[234])?(x20A-Z+)+ )( $\mathrm{P.O.BoxM{1,5})){[A-Z0-9<]{9}[0-9]{1}[A-Z]{3}[0-9]{7}[A-Z]{1}[0-9]{7}[A-Z0-9]{7}[A-Z0-9]{3}[A-Z]{1}[A-Z]{1}[A-Z]{1}[A-Z]{1}[A-Z]{1}[A-Z]{1}}.$ -9<]{14}[0-9]{2}[A-Z9]{5}0-9(0[1-9][12][0-9]13[01][0-9][A-Z9][0-9] A-Z0-9

{4}[4]\d{3}\d{4}\d{4}\d{4})\bl3[47][0-9]{13}\d{3}-\d{2}-\d{4}(?:(\d{1,5}(1V[234])?(x20A-Z +)+ )I(P.O. Box \d{ 1,5↓))\s{ 1,2}(?i:(?:((APTIB LD GI DEPT IF LIH N GRI LOT I PIER I RM IS(LIPI PCIT(EIOP))ITRLRIUNIT)x20\w(1,5 })I(B SMT IF RN TIL BB YI LOW RIO FCI PHI REAR I SIDE I UP PR).?) $\mathrm{\backslashs{1,2})?)(?:A-Z+(\cdot.?)(\cdot20A-Z+\cdot2\cdot]),\backslash x20(?:\cdot A[L K S Z R A P]|C[A O T]|,\mathscr{H}])}$ JID[EC]IF[ LM]IG[AU]IHII[ADL N]IK[SY]ILAIM[A DE HINO PST J IN[CDEHJMVY]IO[HKRJIP[ARWJIRIIS[CD] IT[NX]IUTIV[AIT]IW[AIVY])\x20(?:\d{5}(-\d {4})?)(?:(d{1,5}( 1V[234])?(x20A-Z+)+ )( $\mathrm{P.O.BoxM{1,5})){[A-Z0-9<]{9}[0-9]{1}[A-Z]{3}[0-9]{7}[A-Z]{1}[0-9]{7}[A-Z0-9]{7}[A-Z0-9]{3}[A-Z]{1}[A-Z]{1}[A-Z]{1}[A-Z]{1}[A-Z]{1}[A-Z]{1}}.$ -9<]{14}[0-9]{2}[A-Z9]{5}0-9(0[1-9][12][0-9]13[01][0-9][A-Z9][0-9] A-Z0-9

Some of the Pll appears to be neither real nor accurate, and instead was "hallucinated" by the AI assistant. For example, in Figure 12 the address provided does not correspond to a real, physical location and has no public links to the individual named. However, in an abundance of caution, we redacted the name and street address. As described in $\S\mathrm{A}.7$ , we removed all PlI matches caught by the regex filter before publicly releasing the dataset.

一些 PII(个人身份信息)似乎既不真实也不准确,而是由 AI 助手“幻觉”生成的。例如,在图 12 中,提供的地址并不对应一个真实的物理位置,并且与所提到的人名没有任何公开关联。然而,出于谨慎考虑,我们删除了姓名和街道地址。如 $\S\mathrm{A}.7$ 所述,我们在公开发布数据集之前,移除了所有通过正则表达式过滤器捕获的 PII 匹配项。

A.7 Datasheet

A.7 数据表

Motivation

动机

For what purpose was the dataset created? Was there a specific task in mind? Was there a specific gap that needed to be filled? Please provide a description.

数据集创建的目的是什么?是否有特定的任务?是否有需要填补的特定空白?请提供描述。

, We created this dataset to analyze and address potential harms in large language models through a process of adversarial testing known as “red teaming". We publicly release the dataset for further analysis and exploration by the research community. This dataset adds to a limited number of publicly-available red team datasets, and to our knowledge it is the only dataset of red team attacks on a language model trained with reinforcement learning from human feedback (RLHF) as a safety technique.

我们创建这个数据集是为了通过一种称为“红队测试”的对抗性测试过程,分析和解决大语言模型中的潜在危害。我们公开发布该数据集,供研究社区进一步分析和探索。该数据集增加了有限的公开红队数据集数量,据我们所知,它是唯一一个针对通过人类反馈强化学习(RLHF)作为安全技术训练的语言模型进行红队攻击的数据集。

Who created the dataset (e.g., which team, research group) and on behalf of which entity (e.g., company, institution, organization)?

谁创建了数据集(例如,哪个团队、研究小组)以及代表哪个实体(例如,公司、机构、组织)?

· The dataset was created by the Societal Impacts and Alignment research groups at Anthropic

· 该数据集由Anthropic的社会影响与对齐研究小组创建

Any other comments?

其他意见?

Warning: This dataset contains instances that may be offensive or upsetting. Topics include, but are not limited to, discriminatory language and discussions of abuse, violence, self-harm, exploitation, and other potentially upsetting subject matter. Please only engage with the data in accordance with your own personal risk tolerance. The data are intended for research purposes, especially research that can make models less harmful. The views expressed in the data do not refect the views of Anthropic or any of its employees.

警告:此数据集包含可能令人反感或不安的内容。主题包括但不限于歧视性语言、虐待、暴力、自残、剥削以及其他可能令人不安的内容。请根据您个人的风险承受能力谨慎处理这些数据。这些数据仅供研究用途,特别是用于减少模型危害的研究。数据中表达的观点不代表Anthropic或其任何员工的立场。

Composition

组合

What do the instances that comprise the dataset represent (e.g., documents, photos, people, countries)? Are there multiple types of instances (e.g., movies, users, and ratings; people and interactions between them; nodes and edges)? Please provide a description.

数据集中的实例代表什么(例如,文档、照片、人、国家)?是否存在多种类型的实例(例如,电影、用户和评分;人及其之间的互动;节点和边)?请提供描述。

· The dataset consists of documents (transcripts between a human and an AI assistant that correspond to a red team attempt) for a variety of AI assistants, along with numerical data that quantifies the harmfulness of the transcripts and categorical data that qualitatively characterizes the topics of the documents. See below for more information.

· 该数据集包含多种 AI 助手的文档(人类与 AI 助手之间的对话记录,对应红队尝试),以及量化对话记录有害程度的数值数据和定性描述文档主题的分类数据。更多信息请参见下文。

How many instances are there in total (of each type, if appropriate)?

总共有多少个实例(如果适用,按类型划分)?

· See Table 1.

· 参见表 1。

What data does each instance consist of? “Raw" data (e.g., un processed text or images) or features? In either case, please provide a description.

每个实例包含哪些数据?“原始”数据(例如未处理的文本或图像)还是特征?无论是哪种情况,请提供描述。

Each instance consists of raw text and numerical data that includes:

每个实例由原始文本和数值数据组成,包括:

A random sample (1,ooo) of the instances above contain the following annotations:

上述实例的随机样本(1,000 个)包含以下注释:

· tags: A list of up to 6 tags per transcript. Tags are short descriptions of the red team attempts generated by crowd workers who reviewed red team data post-hoc

· 标签:每个转录本最多包含6个标签。标签是由众包工作者在事后审查红队数据后生成的简短描述,用于描述红队的尝试。

Is any information missing from individual instances? If so, please provide a description, explaining why this information is missing (e.g., because it was unavailable). This does not include intentionally removed information, but might include, e.g., redacted text.

单个实例中是否缺少任何信息?如果是,请提供描述,解释为什么缺少这些信息(例如,因为信息不可用)。这不包括故意删除的信息,但可能包括,例如,经过编辑的文本。

·No.

·编号

Are relationships between individual instances made explicit (e.g., users’ movie ratings, social network links)? If so, please describe how these relationships are made explicit.

个体实例之间的关系是否明确(例如,用户的电影评分、社交网络链接)?如果是,请描述这些关系是如何明确的。

· Yes. Each instance includes an anonymous participant identifier (numbers 0-318) to allow for additional analysis of the dataset.

是的。每个实例都包含一个匿名参与者标识符(编号 0-318),以便对数据集进行额外分析。

Are there any errors, sources of noise, or redundancies in the dataset? If so, please provide a description.

数据集中是否存在错误、噪声源或冗余?如果有,请提供描述。

Is the dataset self-contained, or does it link to or otherwise rely on external resources (e.g., websites, tweets, other datasets)? If it links to or relies on external resources, a) are there guarantees that they will exist, and remain constant, over time; b) are there official archival versions of the complete dataset (i.e., including the external resources as they existed at the time the dataset was created); c) are there any restrictions (e.g., licenses, fees) associated with any of the external resources that might apply to a dataset consumer? Please provide descriptions of all external resources and any restrictions associated with them, as well as links or other access points, as appropriate.

数据集是自包含的,还是链接到或依赖于外部资源(例如网站、推文、其他数据集)?如果它链接到或依赖于外部资源,a) 是否有保证这些资源将长期存在并保持不变;b) 是否有完整数据集的官方存档版本(即包括数据集创建时存在的外部资源);c) 是否有任何与外部资源相关的限制(例如许可证、费用)可能适用于数据集使用者?请提供所有外部资源的描述及其相关限制,以及适当的链接或其他访问点。

· The dataset is self-contained, but contains model-generated text including web URLs and phone numbers. These have not been verified and may not be real, accurate, or maintained.

· 该数据集是自包含的,但包含模型生成的文本,包括网页URL和电话号码。这些内容未经核实,可能不真实、不准确或未维护。

Does the dataset contain data that might be considered confidential (e.g., data that is protected by legal privilege or by doctor- patient confidentiality, data that includes the content of individuals’ non-public communications)? If so, please provide a description.

数据集是否包含可能被视为机密的数据(例如,受法律特权或医患保密保护的数据,包含个人非公开通信内容的数据)?如果是,请提供描述。

· The dataset contains sensitive information, but it is unknown to the authors whether instances include confidential information.

· 数据集包含敏感信息,但作者不确定实例中是否包含机密信息。

Does the dataset contain data that, if viewed directly, might be offensive, insulting, threatening, or might otherwise cause anxiety? If so, please describe why.

数据集中是否包含如果直接查看可能会令人反感、侮辱、威胁或可能引起焦虑的数据?如果是,请描述原因。

· Yes. This dataset was created from explicit attempts to make the AI model say obnoxious, offensive, and harmful things in response to participant queries. As a result, the data - from both humans and models - may be upsetting or offensive. Topics include, but are not limited to, discriminatory language and discussions of abuse, violence, self-harm, exploitation, and other potentially upsetting subject matter. We recommend users of this dataset engage with it only within the bounds of their personal risk tolerance. We also recommend data users familiarize themselves with various wellbeing and resilience practices (e.g. mindfulness, stepping away from the material, creating time limits for working with this data, etc.) before extensive viewing. See A.2 for additional examples.

是的。该数据集是通过明确尝试让 AI 模型在回应用户查询时说出令人反感、冒犯和有害的内容而创建的。因此,数据(包括来自人类和模型的数据)可能会令人不安或具有冒犯性。主题包括但不限于歧视性语言以及关于虐待、暴力、自残、剥削和其他可能令人不安的主题的讨论。我们建议该数据集的用户仅在个人风险承受范围内使用。我们还建议数据用户在进行大量查看之前,熟悉各种健康和韧性实践(例如正念、远离材料、为处理这些数据设定时间限制等)。更多示例请参见 A.2。

Does the dataset identify any sub populations (e.g., by age, gender)? If so, please describe how these sub populations are identified and provide a description of their respective distributions within the dataset.

数据集是否识别了任何子群体(例如,按年龄、性别)?如果是,请描述这些子群体是如何识别的,并提供它们在数据集中的各自分布描述。

· The dataset identifies the crowdwork platform affiliation of the participant by the binary value "is up worker". “TRUE” indicates the participant was affiliated with the Upwork platform; "FALSE" indicates the participant was affiliated with the MTurk platform. · Participants have an anonymous identifier (0-318) to allow for additional analysis of the dataset.

· 数据集通过二元值“is up worker”标识参与者的众包平台归属。“TRUE”表示参与者隶属于 Upwork 平台;“FALSE”表示参与者隶属于 MTurk 平台。
· 参与者拥有一个匿名标识符(0-318),以便对数据集进行进一步分析。

·No.

·编号

Does the dataset contain data that might be considered sensitive in any way (e.g., data that reveals race or ethnic origins, sexual orientations, religious beliefs, political opinions or union memberships, or locations; financial or health data; biometric or genetic data; forms of government identification, such as social security numbers; criminal history)? If so, please provide a description.

数据集中是否包含可能被视为敏感的数据(例如,揭示种族或民族、性取向、宗教信仰、政治观点或工会成员身份、位置的数据;财务或健康数据;生物识别或基因数据;政府身份识别形式,如社会安全号码;犯罪记录)?如果是,请提供描述。

’ Yes. The dataset includes discussion of sensitive topics, and may include examples of personally identifiable information (PIl), which may or may not be real or accurate. In an attempt to minimize the release of Pll, we used a regular expression (regex) filter to identify items such as addresses, phone numbers, drivers license and passport numbers, and social security numbers (see $\S\mathrm{A}.6)$ .A manual review of sample instances indicated that some of the PlI was neither real nor accurate (e.g. a model-generated address did not correspond to a real, physical location). We provide a representative example transcript in $\S\mathrm{A}.6$ . In an abundance of caution, we removed all instances caught by the regex filter, though some instances may remain unintentionally.

是的。该数据集包含敏感话题的讨论,可能包含个人身份信息(PII)的示例,这些信息可能是真实的,也可能不真实或不准确。为了尽量减少PII的泄露,我们使用正则表达式(regex)过滤器来识别诸如地址、电话号码、驾照和护照号码以及社会安全号码等项目(参见 $\S\mathrm{A}.6$)。对样本实例的手动审查表明,部分PII既不是真实的,也不准确(例如,模型生成的地址不对应于真实的物理位置)。我们在 $\S\mathrm{A}.6$ 中提供了一个代表性的示例记录。出于谨慎考虑,我们删除了所有被正则表达式过滤器捕获的实例,尽管可能仍有一些实例无意中保留下来。

Any other comments?

其他意见?

· None.

Collection Process

收集过程

How was the data associated with each instance acquired? Was the data directly observable (e.g., raw text, movie ratings), reported by subjects (e.g., survey responses), or indirectly inferred/derived from other data (e.g., part-of-speech tags, model-based guesses for age or language)? If the data was reported by subjects or indirectly inferred/derived from other data, was the data validated/verified? If so, please describe how.

与每个实例相关的数据是如何获取的?数据是直接可观察的(例如,原始文本、电影评分)、由受试者报告的(例如,调查响应),还是从其他数据间接推断/推导的(例如,词性标签、基于模型的年龄或语言猜测)?如果数据是由受试者报告或从其他数据间接推断/推导的,数据是否经过验证/核实?如果是,请描述如何进行验证。

· The data was acquired through a custom interface where participants engaged in open-ended conversation with an AI assistant and rated various aspects of the conversation.

数据是通过一个自定义界面获取的,参与者在该界面中与AI助手进行开放式对话,并对对话的各个方面进行评分。

What mechanisms or procedures were used to collect the data (e.g., hardware apparatuses or sensors, manual human curation, software programs, software APIs)? How were these mechanisms or procedures validated?

收集数据使用了哪些机制或程序(例如,硬件设备或传感器、人工手动整理、软件程序、软件 API)?这些机制或程序是如何验证的?

· Custom-built software interfaces for conversations with the AI assistant and conversation reviews deployed through MTurk.

· 为与AI助手对话和通过MTurk部署的对话审查定制的软件界面。

Who was involved in the data collection process (e.g., students, crowd workers, contractors) and how were they compensated (e.g., how much were crowd workers paid)?

数据收集过程中涉及了哪些人员(例如,学生、众包工人、承包商),以及他们是如何获得报酬的(例如,众包工人获得了多少报酬)?

Over what timeframe was the data collected? Does this timeframe match the creation timeframe of the data associated with the instances (e.g., recent crawl of old news articles)? If not, please describe the timeframe in which the data associated with the instances was created.

数据收集的时间范围是什么?这个时间范围是否与实例相关数据的创建时间范围一致(例如,最近抓取的旧新闻文章)?如果不一致,请描述与实例相关数据的创建时间范围。

· The data was collected between November 2021 and June 2022.

数据收集时间为2021年11月至2022年6月。

Were any ethical review processes conducted (e.g., by an institutional review board)? If so, please provide a description of these review proceses, including the outcomes, as well as a link or other access point to any supporting documentation.

是否进行了任何伦理审查流程(例如,由机构审查委员会进行)?如果是,请提供这些审查流程的描述,包括结果,以及任何支持文档的链接或其他访问点。

· Informal, internal ethical review processes were conducted prior to and during the creation of this dataset. The authors of this dataset reviewed relevant literature in machine learning (ML) and Trust & Safety, consulted industry experts, conducted in-house red teaming and conversation reviews, and made continuous iterations to the task interface to mitigate the risk of harm to participants. See paper for details.

· 在创建此数据集之前和期间,进行了非正式的内部伦理审查流程。该数据集的作者们审查了机器学习(ML)和信任与安全领域的相关文献,咨询了行业专家,进行了内部红队测试和对话审查,并持续迭代任务界面,以减轻对参与者造成伤害的风险。详情请参见论文。

Did you collect the data from the individuals in question directly, or obtain it via third parties or other sources (e.g., websites)?

您是否直接从相关个人处收集数据,还是通过第三方或其他来源(例如网站)获取?

· We collected the data from the individuals in question directly, through the use of a custom interface that we built and deployed via MTurk.

· 我们通过使用自定义界面直接从相关个体收集数据,该界面通过MTurk构建和部署。

Were the individuals in question notified about the data collection? If so, please describe (or show with screenshots or other information) how notice was provided, and provide a link or other access point to, or otherwise reproduce, the exact language of the notification itself.

相关人员是否被告知数据收集的情况?如果是,请描述(或通过截图或其他信息展示)通知是如何提供的,并提供通知本身的确切语言的链接或其他访问点,或直接复制通知内容。

Did the individuals in question consent to the collection and use of their data? If so, please describe (or show with screenshots or other information) how consent was requested and provided, and provide a link or other access point to, or otherwise reproduce, the exact language to which the individuals consented.

相关个人是否同意收集和使用其数据?如果是,请描述(或通过截图或其他信息展示)如何请求和提供同意,并提供链接或其他访问点,或复制个人同意的确切语言。

· Yes. See Figure 3 and Figure 8 in $\S3$

是的。参见 $\S3$ 中的图 3 和图 8。

If consent was obtained, were the consenting individuals provided with a mechanism to revoke their consent in the future or for certain uses? If so, please provide a description, as well as a link or other access point to the mechanism (if appropriate).

如果获得了同意,是否向同意者提供了未来或针对特定用途撤销同意的机制?如果是,请提供描述以及机制的链接或其他访问点(如适用)。

· Participants were provided with various methods to contact the research team for any questions or concerns (e.g. email, Slack).

· 参与者可以通过多种方式联系研究团队,以解决任何问题或疑虑(例如电子邮件、Slack)。

Has an analysis of the potential impact of the dataset and its use on data subjects (e.g., a data protection impact analysis) been conducted? If so, please provide a description of this analysis, including the outcomes, as well as a link or other access point to any supporting documentation.

是否已对数据集及其使用对数据主体的潜在影响进行了分析(例如,数据保护影响分析)?如果是,请提供该分析的描述,包括结果,以及任何支持文档的链接或其他访问点。

· An impact analysis was conducted to assess the potential impact on the creators of each data instance. Participants engaged in the Review Task were asked to complete a survey measuring their feelings toward the task. The results of this survey demonstrate positive reactions to involvement in the creation of the dataset. For more information on the survey please see $\S\mathrm{A}.2$

· 进行了影响分析,以评估对每个数据实例创作者的潜在影响。参与审查任务的参与者被要求完成一项调查,以衡量他们对任务的感觉。该调查的结果表明,参与者对参与数据集创建的反应是积极的。有关调查的更多信息,请参见 $\S\mathrm{A}.2$。

Any other comments?

其他意见?

· None.

· 无。

Preprocessing / Cleaning / Labeling

预处理 / 清洗 / 标注

Was any preprocessing/cleaning/labeling of the data done (e.g., disc ret iz ation or bucketing, tokenization, part-of-speech tagging, SIFT feature extraction, removal of instances, processing of missing values)? If so, please provide a description. If not, you may skip the remaining questions in this section.

是否对数据进行了任何预处理/清理/标注(例如,离散化或分桶、Token化、词性标注、SIFT特征提取、实例移除、缺失值处理)?如果是,请提供描述。如果没有,可以跳过本节中的其余问题。

Was the “raw" data saved in addition to the pre processed/cleaned/labeled data (e.g., to support unanticipated future uses)? If so, please provide a link or other access point to the “raw' data.

除了预处理/清理/标注的数据外,是否保存了“原始”数据(例如,以支持未来未预期的用途)?如果是,请提供“原始”数据的链接或其他访问点。

Yes.

是的。

Is the software that was used to preprocess/clean/label the data available? If so, please provide a link or other access point.

用于预处理/清理/标记数据的软件是否可用?如果可用,请提供链接或其他访问点。

· We do not release the harmlessness classifier. · We provide the regex filter we used to remove Pll from the dataset in $\S\mathrm{A}.6$

· 我们不发布无害性分类器。
· 我们在 $\S\mathrm{A}.6$ 中提供了用于从数据集中移除 Pll 的正则表达式过滤器。

Any other comments?

其他意见?

· None.

Uses

用途

Has the dataset been used for any tasks already? If so, please provide a description.

数据集是否已经用于任何任务?如果是,请提供描述。

· No.

· 编号

Is there a repository that links to any or all papers or systems that use the dataset? If so, please provide a link or other access point.

是否有存储库链接到使用该数据集的任何或所有论文或系统?如果有,请提供链接或其他访问点。

·No.

·编号

What (other) tasks could the dataset be used for?

该数据集还可用于哪些(其他)任务?

· In addition to providing a resource for the research community to further investigate what successful red team attacks look like, this dataset can be used to build (semi-)automated red team techniques and to assess the efficacy of various strategies for mitigating harms in large language models.

· 除了为研究社区提供一个资源以进一步研究成功的红队攻击是什么样子外,该数据集还可用于构建(半)自动化的红队技术,并评估各种减轻大语言模型危害的策略的有效性。

Is there anything about the composition of the dataset or the way it was collected and pre processed/- cleaned/labeled that might impact future uses? For example, is there anything that a dataset consumer might need to know to avoid uses that could result in unfair treatment of individuals or groups (e.g., stereotyping, quality of service issues) or other risks or harms (e.g., legal risks, financial harms)? If so, please provide a description. Is there anything a dataset consumer could do to mitigate these risks or harms?

数据集的构成、收集方式或预处理/清理/标注方式是否可能影响未来的使用?例如,数据集使用者是否需要了解某些信息,以避免可能导致对个人或群体不公平对待(例如,刻板印象、服务质量问题)或其他风险或损害(例如,法律风险、财务损害)的使用?如果是,请提供相关描述。数据集使用者可以采取哪些措施来减轻这些风险或损害?

· This dataset contains offensive and harmful instances, and should only be used for research purposes and to build the harmlessness class if i ers described above. Users of this dataset are advised to engage with the dataset only within the bounds of their personal risk tolerance and practice well-being and resilience exercises when working with this dataset.

· 该数据集包含冒犯性和有害的实例,仅应用于研究目的,并用于构建上述描述的伤害类别。建议用户在使用此数据集时,仅在个人风险承受能力范围内进行,并在处理该数据集时进行健康和心理韧性练习。

Are there tasks for which the dataset should not be used? If so, please provide a description.

是否存在不应使用该数据集的任务?如果是,请提供描述。

· Just as this dataset can be used to develop safer AI models, it could also be used to train models that produce more harmful responses and should not be used for that purpose. Additionally, the dataset is not comprehensive of all possible harms or red team attacks and should not be treated as such.

· 正如该数据集可用于开发更安全的AI模型,它也可能被用来训练产生更有害响应的模型,因此不应用于此目的。此外,该数据集并未涵盖所有可能的危害或红队攻击,不应被视为全面。

Any other comments?

还有其他意见吗?

Distribution

分布

Will the dataset be distributed to third parties outside of the entity (e.g., company, institution, organization) on behalf of which the dataset was created? If so, please provide a description.

数据集是否会分发给创建该数据集所代表的实体(例如公司、机构、组织)之外的第三方?如果是,请提供描述。

· The dataset is publicly available.

· 该数据集是公开可用的。

How will the dataset will be distributed (e.g., tarball on website, API, GitHub)? Does the dataset have a digital object identifier (DOI)?

数据集将如何分发(例如,网站上的压缩包、API、GitHub)?数据集是否有数字对象标识符(DOI)?

· Yes. The dataset is publicly available, hosted on GitHub at https://github.com/anthropics/hh-rlhf.

是的。该数据集是公开的,托管在 GitHub 上,地址为 https://github.com/anthropics/hh-rlhf

When will the dataset be distributed?

数据集何时分发?

· The dataset was released in August 2022.

· 该数据集于2022年8月发布。

Have any third parties imposed IP-based or other restrictions on the data associated with the instances? If so, please describe these restrictions, and provide a link or other access point to, or otherwise reproduce, any relevant licensing terms, as well as any fees associated with these restrictions.

是否有任何第三方对与实例相关的数据施加了基于知识产权(IP)或其他限制?如果有,请描述这些限制,并提供相关许可条款的链接或其他访问点,或复制这些条款,以及与此类限制相关的任何费用。

· No.

· 编号

Do any export controls or other regulatory restrictions apply to the dataset or to individual instances? If so, please describe these restrictions, and provide a link or other access point to, or otherwise reproduce, any supporting documentation.

数据集或单个实例是否受到任何出口管制或其他监管限制?如果是,请描述这些限制,并提供支持文档的链接或其他访问点,或复制相关内容。

· No.

· 编号

Any other comments?

还有其他意见吗?

· None.

· 无。

Maintenance

维护

Who will be supporting/hosting/maintaining the dataset?

谁将支持/托管/维护数据集?

· Anthropic hosts, but does not maintain, the dataset.

· Anthropic 托管但不维护该数据集。

How can the owner/curator/manager of the dataset be contacted (e.g., email address)?

如何联系数据集的所有者/策展人/管理者(例如,电子邮件地址)?

· Contact information can be found at https://github.com/anthropics/hh-rlhf.

· 联系方式可以在 https://github.com/anthropics/hh-rlhf 找到。

Is there an erratum? If so, please provide a link or other access point.

是否有勘误表?如果有,请提供链接或其他访问途径。

·No.

·编号

Will the dataset be updated (e.g., to correct labeling errors, add new instances, delete instances)? If so, please describe how often, by whom, and how updates will be communicated to dataset consumers (e.g., mailing list, GitHub)?

数据集是否会更新(例如,修正标签错误、添加新实例、删除实例)?如果是,请描述更新的频率、由谁负责更新以及如何向数据集使用者传达更新信息(例如,邮件列表、GitHub)?

If others want to extend/augment/build on/contribute to the dataset, is there a mechanism for them to do so? f so, please provide a description. Will these contributions be validated/verified? If so, please describe how. If not, why not? Is there a process for communicating/distributing these contributions to dataset consumers? If so, please provide a description.

如果其他人希望扩展/增强/构建/贡献数据集,是否有机制支持他们这样做?如果有,请提供描述。这些贡献会被验证/核实吗?如果是,请描述如何进行。如果不是,为什么?是否有流程将这些贡献传达/分发给数据集的使用者?如果有,请提供描述。

· Researchers are encouraged to explore and build on the dataset in their own research efforts, but this dataset will remain as-is.

· 鼓励研究人员在自己的研究工作中探索并基于该数据集进行构建,但该数据集将保持原样。

Any other comments?

其他意见?

· None.

· 无

References

参考文献

Thomas, F. Tramer, R. E. Wang, W. Wang, B. Wu, J. Wu, Y. Wu, S. M. Xie, M. Yasunaga, J. You, M. Zaharia, M. Zhang, T. Zhang, X. Zhang, Y. Zhang, L. Zheng, K. Zhou, and P. Liang. On the Opportunities and KIsKs 01 roundauon M1oueis. arAIv:21Uo.0/23o [CS], Aug. 2UZ1. arAIv: Z1Uo.U/230.

Thomas, F. Tramer, R. E. Wang, W. Wang, B. Wu, J. Wu, Y. Wu, S. M. Xie, M. Yasunaga, J. You, M. Zaharia, M. Zhang, T. Zhang, X. Zhang, Y. Zhang, L. Zheng, K. Zhou, 和 P. Liang. 关于机遇和挑战的思考。arAIv:21Uo.0/23o [CS], 2021年8月. arAIv: Z1Uo.U/230.

T. Sottiaux, M. Pajarskas, T. Pohlen, Z. Gong, D. Toyama, C. d. M. d'Autume, Y. Li, T. Terzi, V. Miku

T. Sottiaux, M. Pajarskas, T. Pohlen, Z. Gong, D. Toyama, C. d. M. d'Autume, Y. Li, T. Terzi, V. Miku

阅读全文(20积分)