[论文翻译]OPT: 开放预训练 Transformer 语言模型


原文地址:https://arxiv.org/pdf/2205.01068


OPT: Open Pre-trained Transformer Language Models

OPT: 开放预训练 Transformer 语言模型

Susan Zhang,∗ Stephen Roller,∗ Naman Goyal,∗ Mikel Artetxe, Moya Chen, Shuohui Chen, Christopher Dewan, Mona Diab, Xian Li, Xi Victoria Lin, Todor Mihaylov, Myle Ott,† Sam Shleifer,† Kurt Shuster, Daniel Simig, Punit Singh Koura, Anjali Sridhar, Tianlu Wang, Luke Z ett le moyer

苏珊·张,∗ 斯蒂芬·罗勒,∗ 纳曼·戈亚尔,∗ 米凯尔·阿尔特特克塞, 莫娅·陈, 陈澍辉, 克里斯托弗·杜安,莫娜·迪亚布, 李弦, 林维多利亚, 托多尔·米哈伊洛夫, 迈尔·奥特,† 山姆·施莱弗,† 库尔特·舒斯特, 丹尼尔·西米格, 普尼特·辛格·库拉, 安贾莉·斯里达, 王天禄, 卢克·泽特尔莫耶

Meta AI

Meta AI

{susanz,roller,naman}@fb.com

{susanz,roller,naman}@fb.com

Abstract

摘要

Large language models, which are often trained for hundreds of thousands of compute days, have shown remarkable capabilities for zero- and few-shot learning. Given their computational cost, these models are difficult to replicate without significant capital. For the few that are available through APIs, no access is granted to the full model weights, making them difficult to study. We present Open Pre-trained Transformers (OPT), a suite of decoder-only pre-trained transformers ranging from 125M to 175B parameters, which we aim to fully and responsibly share with interested researchers. We show that OPT-175B is comparable to GPT-3,1 while requiring only 1/7th the carbon footprint to develop. We are also releasing our logbook detailing the infrastructure challenges we faced, along with code for experimenting with all of the released models.

大语言模型,通常需要数十万计算日进行训练,已经展示了在零样本和少样本学习方面的卓越能力。鉴于其计算成本,这些模型在没有大量资金的情况下难以复制。对于少数通过 API 提供的模型,无法获得完整的模型权重,这使得研究变得困难。我们推出了 Open Pre-trained Transformers (OPT),这是一系列仅解码器的预训练 Transformer,参数范围从 125M 到 175B,我们计划全面且负责任地与感兴趣的科研人员共享。我们证明了 OPT-175B 的性能与 GPT-3 相当,同时开发过程中仅需 1/7 的碳足迹。我们还将发布我们的日志,详细记录了我们面临的基础设施挑战,并提供代码以实验所有发布的模型。

1 Introduction

1 引言

Large language models (LLMs) trained on massive text collections have shown surprising emergent capabilities to generate text and perform zero- and few-shot learning (Brown et al., 2020; Lieber et al., 2021; Smith et al., 2022; Rae et al., 2021; Chowd- hery et al., 2022). While in some cases the public can interact with these models through paid APIs, full model access is currently limited to only a few highly resourced labs.2 This restricted access has limited researchers’ ability to study how and why these large language models work, hindering progress on improving known challenges in areas such as robustness, bias, and toxicity.

大语言模型 (LLMs) 在大规模文本集合上训练,展现出生成文本和执行零样本及少样本学习的惊人能力 (Brown et al., 2020; Lieber et al., 2021; Smith et al., 2022; Rae et al., 2021; Chowd- hery et al., 2022)。虽然在某些情况下公众可以通过付费 API 与这些模型互动,但完整的模型访问目前仅限于少数资源丰富的实验室。这种受限的访问限制了研究人员研究这些大语言模型的工作原理及其背后原因的能力,阻碍了在鲁棒性、偏差和毒性等已知挑战领域的进展。

In this technical report, we present Open Pretrained Transformers (OPT), a suite of decoderonly pre-trained transformers ranging from 125M to 175B parameters, which we aim to fully and responsibly share with interested researchers. We train the OPT models to roughly match the performance and sizes of the GPT-3 class of models, while also applying the latest best practices in data collection and efficient training. Our aim in developing this suite of OPT models is to enable reproducible and responsible research at scale, and to bring more voices to the table in studying the impact of these LLMs. Definitions of risk, harm, bias, and toxicity, etc., should be articulated by the collective research community as a whole, which is only possible when models are available for study.

在本技术报告中,我们介绍了开放预训练 Transformer (OPT),这是一套从 125M 到 175B 参数的解码器-only 预训练 Transformer 模型,我们计划将其完全且负责任地分享给感兴趣的科研人员。我们训练 OPT 模型以大致匹配 GPT-3 类模型的性能和规模,同时应用最新的数据收集和高效训练的最佳实践。我们开发这套 OPT 模型的目标是实现可重复且负责任的大规模研究,并让更多的人参与到这些大语言模型的影响研究中来。风险、伤害、偏差和毒性等定义应由整个科研社区共同制定,而这只有在模型可供研究时才可能实现。

We are releasing all of our models between 125M and 66B parameters, and will provide full research access to OPT-175B upon request. Access will be granted to academic researchers; those affiliated with organizations in government, civil society, and academia; and those in industry research laboratories. We are also releasing both the logbook of our model creation as well as our codebase, metaseq,3 which enabled training OPT-175B on 992 80GB A100 GPUs, reaching 147 TFLOP/s utilization per GPU. From this implementation, and from using the latest generation of NVIDIA hardware, we are able to develop OPT-175B using only 1/7th the carbon footprint of GPT-3. While this is a significant achievement, the energy cost of creating such a model is still nontrivial, and repeated efforts to replicate a model of this size will only amplify the growing compute footprint of these LLMs.

我们将发布所有参数量在 125M 到 66B 之间的模型,并将在请求后提供对 OPT-175B 的完全研究访问权限。访问权限将授予学术研究人员;政府、公民社会和学术机构的附属人员;以及工业研究实验室的人员。我们还将发布我们的模型创建日志以及我们的代码库 metaseq,这使得我们能够在 992 个 80GB A100 GPU 上训练 OPT-175B,达到每个 GPU 147 TFLOP/s 的利用率。通过这一实现,以及使用最新一代 NVIDIA 硬件,我们能够以 GPT-3 碳足迹的 1/7 开发 OPT-175B。尽管这是一个重要的成就,但创建此类模型的能源成本仍然不容忽视,重复尝试复制这种规模的模型只会加剧这些大语言模型的计算碳足迹。

We believe the entire AI community — academic researchers, civil society, policymakers, and industry — must work together to develop clear guidelines around responsible AI in general and responsible LLMs in particular, given their centrality in many downstream language applications. A much broader segment of the AI community needs access to these models in order to conduct reproducible research and collectively drive the field forward. With the release of OPT-175B and smaller-scale baselines, we hope to increase the diversity of voices defining the ethical considerations of such technologies.

我们相信整个 AI 社区——学术研究人员、公民社会、政策制定者和行业——必须共同努力,制定关于通用负责任 AI 和特别针对大语言模型 (LLM) 的明确指南,鉴于它们在许多下游语言应用中的核心地位。AI 社区的更广泛部分需要访问这些模型,以便进行可重复的研究并共同推动领域的发展。通过发布 OPT-175B 及较小规模的基准模型,我们希望增加定义此类技术伦理考虑的多样性声音。

Table 1: Model architecture details. We report the number of layers $(#!L)$ , number of attention heads $(#\mathrm{H})$ , and the embedding size $\left(\mathrm{d}_{\mathrm{model}}\right)$ . We also report the peak Learning Rate (LR) and global batch size in number of tokens (Batch).

表 1: 模型架构详情。我们报告了层数 (#L) ,注意力头数 (#H) ,以及嵌入大小 (dmodel) 。我们也报告了峰值学习率 (LR) 和以 Token 数表示的全局批量大小 (Batch) 。

Model #L #H dmodel LR Batch
125M 12 12 768 6.0e-4 0.5M
350M 24 16 1024 3.0e-4 0.5M
1.3B 24 32 2048 2.0e-4 1M
2.7B 32 32 2560 1.6e-4 1M
6.7B 32 32 4096 1.2e-4 2M
13B 40 40 5120 1.0e-4 4M
30B 48 56 7168 1.0e-4 4M
66B 64 72 9216 0.8e-4 2M
175B 96 96 12288 1.2e-4 2M

2 Method

2 方法

2.1 Models

2.1 模型

We present results on eight Transformer language models ranging from 125 million to 175 billion parameters. Architectural details are displayed in Table 1. In the interest of transparency, and to reduce risk of training instabilities, our models and hyper parameters largely follow Brown et al. (2020), with variations in batch size mostly to obtain increased computational efficiency.

我们展示了八个参数量从 1.25 亿到 1750 亿的 Transformer 语言模型的结果。架构细节显示在表 1: 。为了透明起见,并减少训练不稳定性风险,我们的模型和超参数 largely follow Brown et al. (2020),主要通过调整批量大小以获得更高的计算效率。

2.2 Training Setup

2.2 训练设置

For weight initialization, we follow the same settings provided in the Megatron-LM codebase,4 us- ing a normal distribution with zero mean and standard deviation of 0.006. Standard deviation for output layers are scaled by a $1.0/\sqrt{2L}$ term where $L$ is the total number of layers. All bias terms are initialized as 0, and all models are trained with ReLU activation and a sequence length of 2048.

对于权重初始化,我们遵循 Megatron-LM 代码库提供的相同设置,使用均值为零、标准差为 0.006 的正态分布。输出层的标准差通过项 $1.0/\sqrt{2L}$ 进行缩放,其中 $L$ 是总层数。所有偏置项都初始化为 0,所有模型都使用 ReLU 激活函数和 2048 的序列长度进行训练。

We use an AdamW optimizer (Loshchilov and Hutter, 2017) with $(\beta_{1},\beta_{2})$ set to (0.9, 0.95), and weight decay of 0.1. We follow a linear learning rate schedule, warming up from 0 to the maximum learning rate over the first 2000 steps in OPT-175B, or over 375M tokens in our smaller baselines, and decaying down to $10%$ of the maximum LR over 300B tokens. A number of mid-flight changes to LR were also required (see Section 2.5). Our batch sizes range from $0.5\mathrm{M}$ to 4M depending on the model size (see Table 1) and is kept constant throughout the course of training.

我们使用 AdamW 优化器 (Loshchilov and Hutter, 2017) ,$(\beta_{1},\beta_{2})$ 设置为 (0.9, 0.95),权重衰减为 0.1。我们采用线性学习率调度策略,在 OPT-175B 的前 2000 步中从 0 线性增加到最大学习率,或在较小基线模型中覆盖 375M Token,然后在 300B Token 内线性衰减到最大学习率的 $10%$ 。训练过程中也进行了若干次学习率调整 (见第 2.5 节) 。我们的批量大小根据模型大小从 $0.5\mathrm{M}$ 到 4M 不等 (见表 1),并在整个训练过程中保持不变。

We use a dropout of 0.1 throughout, but we do not apply any dropout to embeddings. We clip gradient norms at 1.0, except for some midflight changes that reduce this threshold down from 1.0 to 0.3 (see Section 2.5). We also include a gradient predivide factor to reduce the risk of over/underflows when computing the gradient across all ranks (splitting the division by the world size of $N$ into two division operations by $\sqrt{N}$ ).

我们全程使用 0.1 的 dropout,但不对嵌入 (embeddings) 应用任何 dropout。我们将梯度范数裁剪为 1.0,除了某些中途更改将此阈值从 1.0 降低到 0.3(见第 2.5 节)。我们还加入了一个梯度预划分因子,以减少在所有节点上计算梯度时发生溢出或下溢的风险(将除以世界大小 $N$ 的操作拆分为两次除以 $\sqrt{N}$ 的操作)。

2.3 Pre-training Corpus

2.3 预训练语料库

The pre-training corpus contains a concatenation of datasets used in RoBERTa (Liu et al., 2019b), the Pile (Gao et al., 2021a), and PushShift.io Reddit (Baumgartner et al., 2020; Roller et al., 2021). All corpora were previously collected or filtered to contain predominantly English text, but a small amount of non-English data is still present within the corpus via Common Crawl.

预训练语料库包含 RoBERTa (Liu et al., 2019b)、The Pile (Gao et al., 2021a) 和 PushShift.io Reddit (Baumgartner et al., 2020; Roller et al., 2021) 使用的数据集的合并。所有语料库之前都已收集或过滤为主要包含英文文本,但通过 Common Crawl 仍存在少量非英文数据。

We removed duplicated documents across all datasets by filtering out documents via MinhashLSH (Rajaraman and Ullman, 2011) with a Jaccard similarity $\geq$ .95. We found the Pile was particularly full of duplicate documents, and advise future researchers using the Pile to perform additional de-duplication processing.

我们通过 MinhashLSH (Rajaraman 和 Ullman, 2011) 过滤掉 Jaccard 相似度 ≥ .95 的文档,以去除所有数据集中的重复文档。我们发现 Pile 数据集中尤其充满了重复文档,并建议未来使用 Pile 的研究人员进行额外的去重处理。

We tokenize all corpora using the GPT-2 byte level BPE tokenizer (Sennrich et al., 2016; Radford et al., 2019; Brown et al., 2020). Our final corpus contains roughly 180B tokens.

我们使用 GPT-2 字节级 BPE 分词器 (Sennrich et al., 2016; Radford et al., 2019; Brown et al., 2020) 对所有语料库进行分词。最终语料库包含大约 1800 亿个 Token。

RoBERTa We included the BookCorpus (Zhu et al., 2015) and Stories (Trinh and Le, 2018) subsets of the RoBERTa corpus and utilized an updated version of CCNews, containing news stories crawled through September 28, 2021. This CCNews v2 corpus was pre processed the same way as the original RoBERTa CCNews (Liu et al., 2019b).

我们包含了 RoBERTa 语料库中的 BookCorpus (Zhu 等,2015) 和 Stories (Trinh 和 Le,2018) 子集,并使用了更新版本的 CCNews,其中包含截至 2021 年 9 月 28 日爬取的新闻故事。这个 CCNews v2 语料库的预处理方式与原始 RoBERTa CCNews (Liu 等,2019b) 相同。

The Pile We included a subset of the Pile (Gao et al., 2021a), including: Common Crawl,

我们包含了 Pile (Gao et al., 2021a) 的一个子集,包括:Common Crawl,

DM Mathematics, Project Gutenberg, HackerNews, Open Subtitles, Open Web Text 2, USPTO and Wikipedia. Other subsets of the Pile were eliminated as we found they increased the risk of instabilities, as measured by tendency to cause spikes in gradient norms at the 1.3B scale, or were otherwise deemed unsuitable. All subsets went through additional ad-hoc whitespace normalization.

DM 数学,Project Gutenberg,HackerNews,Open Subtitles,Open Web Text 2,USPTO 和 Wikipedia。其他子集被消除,因为我们发现它们增加了不稳定性风险,表现为在 1.3B 规模上导致梯度范数出现尖峰,或者被认为不适合。所有子集都经过了额外的临时空白规范化处理。

PushShift.io Reddit We included a subset of the Pushshift.io corpus produced by Baumgartner et al. (2020) and previously used by Roller et al. (2021). To convert the conversational trees into language-model-accessible documents, we extracted the longest chain of comments in each thread and discarded all other paths in the tree. This reduced the corpus by about $66%$ .

PushShift.io Reddit

我们包含了由 Baumgartner 等人 (2020) 生成的 Pushshift.io 语料库的一个子集,并曾被 Roller 等人 (2021) 使用过。为了将对话树转换为大语言模型可访问的文档,我们提取了每个线程中最长的评论链,并丢弃了树中的所有其他路径。这使语料库减少了大约 $66%$ 。

2.4 Training Efficiency

2.4 训练效率

We trained OPT-175B on 992 80GB A100 GPUs, by utilizing Fully Sharded Data Parallel (Artetxe et al., 2021) with Megatron-LM Tensor Parallelism (Shoeybi et al., 2019). We achieve utilization of up to 147 TFLOP/s per GPU. We keep Adam state in FP32, since we shard it across all hosts, while the model weights remained in FP16. To avoid underflows, we used dynamic loss scaling, as described in Mic ike vici us et al. (2017).

我们在 992 个 80GB A100 GPU 上训练了 OPT-175B,通过利用全分片数据并行 (Artetxe et al., 2021) 和 Megatron-LM 张量并行 (Shoeybi et al., 2019)。我们实现了每 GPU 最高 147 TFLOP/s 的利用率。我们将 Adam 状态保持在 FP32,因为它被分片到所有主机上,而模型权重则保持在 FP16。为了避免下溢,我们使用了动态损失缩放,如 Mic ike vici us et al. (2017) 所述。

2.5 Training Processes

2.5 训练过程

Here we describe significant training process adjustments that arose during OPT-175B pre-training.

在这里我们描述了在 OPT-175B 预训练期间出现的重要训练过程调整。

Hardware Failures We faced a significant number of hardware failures in our compute cluster while training OPT-175B. In total, hardware failures contributed to at least 35 manual restarts and the cycling of over 100 hosts over the course of 2 months. During manual restarts, the training run was paused, and a series of diagnostics tests were conducted to detect problematic nodes. Flagged nodes were then cordoned off and training was resumed from the last saved checkpoint. Given the difference between the number of hosts cycled out and the number of manual restarts, we estimate $^{70+}$ automatic restarts due to hardware failures.

硬件故障

在训练 OPT-175B 期间,我们的计算集群遇到了大量硬件故障。总计,硬件故障导致至少 35 次手动重启,并在两个月内更换了超过 100 台主机。在手动重启过程中,训练运行被暂停,并进行了一系列诊断测试以检测有问题的节点。标记的节点随后被隔离,训练从最后一个保存的检查点恢复。鉴于更换的主机数量与手动重启次数之间的差异,我们估计有 70+ 次由于硬件故障引起的自动重启。

Loss Divergences Loss divergences were also an issue in our training run. When the loss diverged, we found that lowering the learning rate and restarting from an earlier checkpoint allowed for the job to recover and continue training. We noticed a correlation between loss divergence, our dynamic loss scalar crashing to 0, and the $l^{2}$ -norm of the activations of the final layer spiking. These observations led us to pick restart points for which our dynamic loss scalar was still in a “healthy” state $(\ge1.0)$ , and after which our activation norms would trend downward instead of growing unbounded ly. Our empirical LR schedule is shown in Figure 1. Early in training, we also noticed that lowering gradient clipping from 1.0 to 0.3 helped with stability; see our released logbook for exact details. Figure 2 shows our validation loss with respect to training iterations.

损失发散

损失发散在我们的训练过程中也是一个问题。当损失发散时,我们发现降低学习率并从较早的检查点重新启动可以使任务恢复并继续训练。我们注意到损失发散、动态损失标量崩溃至 0 以及最终层激活的 $l^{2}$ -范数激增之间存在关联。这些观察结果促使我们选择动态损失标量仍处于“健康”状态 $(\ge1.0)$ 的重启点,并且在此之后,我们的激活范数会趋于下降而不是无限制增长。我们的经验学习率调度如图 1 所示。在训练初期,我们还注意到将梯度裁剪从 1.0 降低到 0.3 有助于提高稳定性;具体细节请参阅我们发布的日志。图 2 显示了我们的验证损失相对于训练迭代的情况。

图 1:

图 2: 验证损失与训练迭代的关系


Figure 1: Empirical LR schedule. We found that lowering learning rate was helpful for avoiding instabilities.


图 1: 经验学习率 (LR) 调度。我们发现降低学习率有助于避免不稳定。


Figure 2: Validation Perplexity. Our mid-flight LR changes had clear effects on validation perplexity.

图 2: 验证困惑度。我们的中途学习率 (LR) 变化对验证困惑度有明显的影响。

Other Mid-flight Changes We conducted a number of other experimental mid-flight changes to handle loss divergences. These included: switching to vanilla SGD (optimization plateaued quickly, and we reverted back to AdamW); resetting the dynamic loss scalar (this helped recover some but not all divergences); and switching to a newer version of Megatron (this reduced pressure on activation norms and improved throughput).

其他飞行中的更改

我们进行了多项实验性的飞行中更改以处理损失发散问题。这些更改包括:切换到 vanilla SGD(优化很快陷入停滞,我们又恢复到 AdamW);重置动态损失标量(这有助于恢复部分但不是所有的发散问题);以及切换到更新版本的 Megatron(这减轻了激活范数的压力并提高了吞吐量)。

3 Evaluations

3 评估

3.1 Prompting & Few-Shot

3.1 提示与少样本 (Prompting & Few-Shot)

We evaluate our model on 16 standard NLP tasks utilized in the literature: HellaSwag (Zellers et al., 2019), StoryCloze (Most af azad eh et al., 2016), PIQA (Bisk et al., 2020), ARC Easy and Challenge (Clark et al., 2018), OpenBookQA (Mihaylov et al., 2018), WinoGrad (Levesque et al., 2011), WinoGrande (Sakaguchi et al., 2020), and SuperGLUE (Wang et al., 2019). We follow GPT-3 (Brown et al., 2020) by using their prompts and overall experimental setup. We compare primarily to GPT-3, having aimed to re-implement their evaluation settings, but include reported performance of other LLMs on a per-task basis when available (Lieber et al., 2021; Rae et al., 2021; Hoffmann et al., 2022; Black et al., 2022)

我们在文献中使用的 16 个标准 NLP 任务上评估了我们的模型:HellaSwag (Zellers et al., 2019),StoryCloze (Mostafazadeh et al., 2016),PIQA (Bisk et al., 2020),ARC Easy 和 Challenge (Clark et al., 2018),OpenBookQA (Mihaylov et al., 2018),WinoGrad (Levesque et al., 2011),WinoGrande (Sakaguchi et al., 2020),以及 SuperGLUE (Wang et al., 2019)。我们遵循 GPT-3 (Brown et al., 2020) 的方法,使用他们的提示和整体实验设置。我们主要与 GPT-3 进行比较,因为我们旨在重新实现他们的评估设置,但在有可用数据时也包括其他大语言模型在每个任务上的报告性能 (Lieber et al., 2021; Rae et al., 2021; Hoffmann et al., 2022; Black et al., 2022)。

We report performance in accuracy (omitting F1 for MultiRC and ReCoRD for consistency in evaluation metrics). For the Winograd Schema Challenge (WSC) task in the SuperGLUE benchmark, we follow (Brown et al., 2020) and formulate the task as multiple choice questions, which is known to affect performance (Liu et al., 2020).

我们报告准确率(为保持评估指标的一致性,MultiRC 和 ReCoRD 不包括 F1)。对于 SuperGLUE 基准中的 Winograd Schema Challenge (WSC) 任务,我们遵循 (Brown et al., 2020) 并将任务表述为多项选择题,这已知会影响性能 (Liu et al., 2020)。

Zero-shot Overall average zero-shot performance across all 14 tasks may be seen in Figure 3. Overall, we see our average performance follows the trend of GPT-3. However, performance can vary radically across the tasks: for a full breakdown, see Appendix A. Note that we intentionally removed MultiRC and WIC from these averages, as these datasets seem to systematically favor GPT-3 or OPT disproportionately.

图 3: 在所有 14 个任务上的零样本 (Zero-shot) 平均性能。总体而言,我们看到我们的平均性能遵循 GPT-3 的趋势。然而,不同任务之间的性能可能会有显著差异:详细分解请参见附录 A。请注意,我们有意从这些平均值中移除了 MultiRC 和 WIC,因为这些数据集似乎系统性地偏向 GPT-3 或 OPT。

Our performance roughly matched GPT-3 for 10 tasks, and under performed in 3 tasks (ARC Challenge and MultiRC). In 3 tasks (CB, BoolQ, WSC), we find both GPT and OPT models display unpredictable behavior with respect to scale, likely due to the small size of the validation set in these 3 tasks (56, 277, and 104 examples, respectively). In WIC, we see that the OPT models always outperform the GPT-3 models, though the numbers reported by Brown et al. (2020) also seem questionable, given WIC being a binary classification task.5 For MultiRC, we are unable to replicate the GPT-3 results using the Davinci $\mathrm{API}^{6}$ within our evaluation setup, suggesting differences in the methods of evaluation on this task. For BoolQ and WSC, we note that both OPT and GPT models seem to hover around majority-class accuracy, suggesting small perturbations in probability masses may be dominating the evaluations.

我们的表现大致与 GPT-3 在 10 项任务中相当,在 3 项任务 (ARC Challenge 和 MultiRC) 中表现较差。在 3 项任务 (CB, BoolQ, WSC) 中,我们发现 GPT 和 OPT 模型在规模上的表现不可预测,这可能是由于这 3 项任务的验证集较小 (分别有 56、277 和 104 个示例)。在 WIC 中,我们看到 OPT 模型始终优于 GPT-3 模型,尽管 Brown 等人 (2020) 报告的数字也似乎值得怀疑,因为 WIC 是一个二分类任务。对于 MultiRC,我们无法使用 Davinci $\mathrm{API}^{6}$ 在我们的评估设置中复制 GPT-3 的结果,这表明在这个任务上评估方法存在差异。对于 BoolQ 和 WSC,我们注意到 OPT 和 GPT 模型的表现接近多数类准确率,这表明概率质量的小波动可能主导了评估。


Figure 3: Zero-shot NLP Evaluation Averages. Across a variety of tasks and model sizes, OPT largely matches the reported averages of GPT-3. However, performance varies greatly per task: see Appendix A.

图 3: 零样本 NLP 评估平均值。在各种任务和模型大小上,OPT 基本上与 GPT-3 的报告平均值相符。然而,性能在不同任务之间差异很大:详见附录 A。


Figure 4: Multi-shot performance. OPT performance for one- and few-shot lags behind GPT-3 models, but performance depends heavily per task; see Appendix A.


图 4: 多样本性能。OPT 在单样本和少样本场景下的表现落后于 GPT-3 模型,但性能在不同任务中差异很大;详见附录 A。

Chinchilla (Hoffmann et al., 2022) and Gopher (Rae et al., 2021) perform roughly consistently with others for their parameter sizes, while PaLM (Chowdhery et al., 2022) generally performs better across all settings, even when controlling for number of parameters. We speculate the high performance of PaLM comes predominantly from higher quality and diversity of pre-training data.

Chinchilla (Hoffmann 等,2022) 和 Gopher (Rae 等,2021) 在其参数规模上表现大致与其他模型一致,而 PaLM (Chowdhery 等,2022) 在所有设置下通常表现更好,即使在控制参数数量的情况下也是如此。我们推测 PaLM 的高性能主要来自于预训练数据的质量和多样性更高。

One-shot and Few-shot Average multi-shot incontext performance is shown in Figure 4 (again, omitting MultiRC and WIC), with detailed performances shown in Appendix A. Across the average of all metrics, we find that OPT models perform similarly to GPT-3 models. However, as with zeroshot, breaking down these results per task shows a different story: in the same set of 10 datasets as zero-shot, we see similar performance across the two models. Some of the remaining datasets show inconsistent performance with respect to model size for both OPT and GPT-3 models (BoolQ, CB, WSC, RTE). In MultiRC, we consistently see under performance of OPT models compared to GPT3 models. Similar to our zero-shot evaluation, we hypothesize our one- and few-shot evaluation setup may differ significantly from Brown et al. (2020).

图 4: 展示了一次性和少样本平均多样本的上下文性能(再次省略 MultiRC 和 WIC),详细性能见附录 A。在所有指标的平均值上,我们发现 OPT 模型的表现与 GPT-3 模型类似。然而,与零样本一样,按任务分解这些结果则显示出不同的情况:在相同的 10 个数据集中,两种模型的表现相似。一些剩余的数据集显示了关于模型大小的不一致表现(BoolQ、CB、WSC、RTE)。在 MultiRC 中,我们始终看到 OPT 模型相对于 GPT-3 模型表现不佳。类似于我们的零样本评估,我们假设我们的一次性和少样本评估设置可能与 Brown 等人 (2020) 的设置有显著差异。

3.2 Dialogue

3.2 对话

Given that LLMs are known to be an integral component of modern dialogue models (Adiwardana et al., 2020; Roller et al., 2021; Thoppilan et al., 2022; Rae et al., 2021; Chowdhery et al., 2022), we additionally evaluate OPT-175B on several open source dialogue datasets. In particular, we follow Roller et al. (2021), and evaluate on ConvAI2 (Dinan et al., 2020b), Wizard of Wikipedia (Dinan et al., 2019b), Empathetic Dialogues (Rashkin et al., 2019), and Blended Skill Talk (Smith et al., 2020). We additionally evaluate on the more recent Wizard of Internet dataset (Komeili et al., 2021). We focus our comparisons primarily against existing open source dialogue models including the fine-tuned BlenderBot 1 (Roller et al., 2021) and its pre-training counterpart Reddit 2.7B. We also compare against the fine-tuned R2C2 BlenderBot, a 2.7B parameter BlenderBot-like model trained by Shuster et al. (2022).

鉴于大语言模型 (LLM) 是现代对话模型的重要组成部分 (Adiwardana 等, 2020; Roller 等, 2021; Thoppilan 等, 2022; Rae 等, 2021; Chowdhery 等, 2022),我们还对 OPT-175B 在几个开源对话数据集上进行了评估。具体来说,我们遵循 Roller 等 (2021) 的方法,在 ConvAI2 (Dinan 等, 2020b)、Wizard of Wikipedia (Dinan 等, 2019b)、Empathetic Dialogues (Rashkin 等, 2019) 和 Blended Skill Talk (Smith 等, 2020) 上进行评估。此外,我们在较新的 Wizard of Internet 数据集 (Komeili 等, 2021) 上也进行了评估。我们的比较主要集中在现有的开源对话模型上,包括微调后的 BlenderBot 1 (Roller 等, 2021) 及其预训练版本 Reddit 2.7B。我们还与微调后的 R2C2 BlenderBot 进行了比较,这是一款由 Shuster 等 (2022) 训练的 27 亿参数的类似 BlenderBot 模型。

We report Perplexity and Unigram F1 (UF1) overlap, following the metrics of the ConvAI2 competition (Dinan et al., 2020b). To control for different token iz ation in each of the models, we normalize all perplexities to be in the space of the GPT-2 tokenizer (Radford et al., 2019). We also note which models are supervised with respect to these dialogue tasks and which are unsupervised. For OPT-175B, all generations are performed using greedy decoding up to a maximum of 32 tokens. We do not attempt to prompt the model at all except for alternating “Person 1:” and “Person 2:” lines of dialogue. The remaining models use the generation parameters found in BlenderBot 1.

我们报告困惑度和单字节 F1 (UF1) 重叠,遵循 ConvAI2 竞赛的指标 (Dinan et al., 2020b)。为了控制每个模型中的不同分词,我们将所有困惑度归一化到 GPT-2 分词器 (Radford et al., 2019) 的空间中。我们还注明了哪些模型是针对这些对话任务进行监督学习的,哪些是无监督的。对于 OPT-175B,所有生成均使用贪婪解码进行,最多生成 32 个 Token。除了交替使用“Person 1:”和“Person 2:”的对话行外,我们不尝试提示模型。其余模型使用 BlenderBot 1 中的生成参数。

Results are shown in Table 2. We see that OPT-175B significantly outperforms the alsounsupervised Reddit 2.7B model on all tasks, and performs competitively with the fully supervised BlenderBot 1 model, especially in the ConvAI2 dataset. On the Wizard-of-Internet dataset, which is fully unsupervised for all models, we see that OPT-175B obtains the lowest perplexity but still has lower UF1 than the models with Wizard-ofWikipedia supervision.

结果如表 2 所示。我们看到,OPT-175B 在所有任务上显著优于同样无监督的 Reddit 2.7B 模型,并且在 ConvAI2 数据集上与完全有监督的 BlenderBot 1 模型竞争性表现。在完全无监督的 Wizard-of-Internet 数据集中,我们发现 OPT-175B 获得了最低的困惑度,但其 UF1 仍然低于具有 Wizard-of-Wikipedia 监督的模型。

表 2:

We were somewhat surprised that the evaluations of the unsupervised OPT-175B model were as competitive as BlenderBot 1 on the ConvAI2 dataset. This may indicate leakage of the ConvAI2 dataset into the general pre-training corpus or even into the validation data as evaluated in Table 2. To address concerns of leakage, we searched our pre-training corpus for the first conversation in the ConvAI2 dataset, but we did not find any overlap. We additionally evaluated OPT-175B on the ConvAI2 hidden test set, which has never been publicly released, and achieved $10.7;\mathrm{ppl}$ and .185 UF1, matching the performance of the validation set. Furthermore, we evaluated OPT-175B on a subset of the ConvAI2- like Multi Session Chat (MSC) dataset (Xu et al., 2021b) and obtained a perplexity of 9.7 and UF1 of .177, indicating the model is generalizing well across multiple Person a Chat-like datasets. Since both MSC and WoI datasets were released after the Common Crawl snapshot used in pre-training corpus, there is minimal risk of leakage. We conclude that OPT-175B has a strong ability to maintain a consistent persona across conversations, a behavior also highlighted in LaMDA (Thoppilan et al., 2022).

我们对无监督 OPT-175B 模型在 ConvAI2 数据集上的评估结果与 BlenderBot 1 相当感到有些惊讶。这可能表明 ConvAI2 数据集泄漏到了通用预训练语料库中,甚至可能泄漏到了验证数据中,如表 2 所示。为了解决泄漏问题的担忧,我们在预训练语料库中搜索了 ConvAI2 数据集中的第一个对话,但没有发现任何重叠。此外,我们在从未公开发布的 ConvAI2 隐藏测试集上评估了 OPT-175B,并达到了 $10.7;\mathrm{ppl}$ 和 .185 UF1,与验证集的表现相匹配。进一步地,我们在 ConvAI2 类似的多会话聊天 (MSC) 数据集 (Xu et al., 2021b) 的一个子集上评估了 OPT-175B,并获得了 9.7 的困惑度和 .177 的 UF1,表明该模型在多个类似 Person a Chat 的数据集上表现良好。由于 MSC 和 WoI 数据集都是在用于预训练语料库的 Common Crawl 快照发布之后才发布的,因此泄漏的风险极小。我们得出结论,OPT-175B 具有在对话中保持一致角色的强大能力,这一行为也在 LaMDA (Thoppilan et al., 2022) 中得到了强调。

4 Bias & Toxicity Evaluations

4 偏差与毒性评估

To understand the potential harm of OPT-175B, we evaluate a series of benchmarks related to hate speech detection, stereotype awareness, and toxic content generation. While there may be shortcomings in these benchmarks (Blodgett et al., 2021; Jacobs and Wallach, 2021), these measurements provide a first step towards understanding the limitations of OPT-175B. We compare primarily against GPT-3 Davinci, as these benchmarks were not yet available to be included in Brown et al. (2020).

为了理解 OPT-175B 的潜在危害,我们评估了一系列与仇恨言论检测、刻板印象意识和有害内容生成相关的基准测试。虽然这些基准测试可能存在不足(Blodgett 等,2021;Jacobs 和 Wallach,2021),但这些测量为理解 OPT-175B 的局限性提供了第一步。我们主要将其与 GPT-3 Davinci 进行比较,因为这些基准测试在 Brown 等 (2020) 发布时还不可用。

4.1 Hate Speech Detection

4.1 恶意言论检测

Using the ETHOS dataset provided in Mollas et al. (2020) and instrumented by Chiu and Alexander (2021), we measure the ability of OPT-175B to identify whether or not certain English statements are racist or sexist (or neither). In the zero-, one-, and few-shot binary cases, the model is presented with text and asked to consider whether the text is racist or sexist and provide a yes/no response. In the few-shot multiclass setting, the model is asked to provide a yes/no/neither response.

使用 Mollas 等人 (2020) 提供的 ETHOS 数据集和 Chiu 和 Alexander (2021) 仪器化处理的数据集,我们测量了 OPT-175B 识别某些英文陈述是否为种族主义或性别歧视(或都不是)的能力。在零样本、单样本和少样本二分类情况下,模型被呈现文本并要求判断该文本是否为种族主义或性别歧视,并给出是/否的回答。在少样本多分类设置中,模型被要求给出是/否/都不是的回答。

困惑度 (↓) 单词 F1 (↑)
模型 评估方式 C2 ww ED BST WoI C2 ww ED BST
Reddit2.7B 无监督 18.9 21.0 11.6 17.4 18.0 .126 .133 .135 .133
BlenderBot1 有监督 10.2 12.5 9.0 11.9 14.7 .183 .189 .192 .178
R2C2BlenderBot 有监督 10.5 12.4 9.1 11.7 14.6 .205 .198 .197 .186
OPT-175B 无监督 10.8 13.3 10.3 12.1 12.0 .185 .152 .149 .162

Table 3: Hate speech detection. F1 scores of detecting hate speech between Davinci and OPT-175B. OPT175B considerably outperforms Davinci in all settings.

表 3: 仇恨言论检测。检测仇恨言论的 F1 分数对比 Davinci 和 OPT-175B 。OPT175B 在所有设置下都显著优于 Davinci 。

设置 Davinci OPT-175B
零样本 .628 .667
单样本 .616 .713
少样本 (二分类) .354 .759
少样本 (多分类) .672 .812

Results are presented in Table 3. With all of our one-shot through few-shot configurations, OPT175B performs considerably better than Davinci. We speculate this occurs from two sources: (1) evaluating via the Davinci API may be bringing in safety control mechanisms beyond the original 175B GPT-3 model used in Brown et al. (2020); and (2) the significant presence of unmoderated social media discussions in the pre-training dataset has provided additional inductive bias to aid in such classification tasks.

结果如表 3 所示。在所有的一次性到少样本配置中,OPT175B 的表现明显优于 Davinci。我们推测这可能源于两个原因:(1) 通过 Davinci API 进行评估可能会引入超出 Brown 等人 (2020) 使用的原始 175B GPT-3 模型的安全控制机制;以及 (2) 预训练数据集中大量未经过滤的社交媒体讨论为这类分类任务提供了额外的归纳偏置。

表 3:

4.2 CrowS-Pairs

4.2 CrowS-Pairs

CrowS-Pairs 是一个数据集,用于评估模型在不同情境下的偏见和公平性。该数据集包含一系列句子对,旨在测试模型是否会在特定的社会背景下表现出偏见 [20]。

Developed for masked language models, CrowSPairs (Nangia et al., 2020) is a crowd sourced benchmark aiming to measure intra sentence level biases in 9 categories: gender, religion, race/color, sexual orientation, age, nationality, disability, physical appearance, and socioeconomic status. Each example consists of a pair of sentences representing a stereotype, or anti-stereotype, regarding a certain group, with the goal of measuring model preference towards stereotypical expressions. Higher scores indicate higher bias exhibited by a model.

为掩码语言模型开发的 CrowSPairs (Nangia 等, 2020) 是一个众包基准,旨在衡量句子内部层面的偏见,涵盖 9 个类别:性别、宗教、种族/肤色、性取向、年龄、国籍、残疾、外貌和 socioeconomic status。每个示例由一对句子组成,表示关于某个群体的刻板印象或反刻板印象,目的是衡量模型对刻板表达的偏好。较高的分数表明模型表现出更高的偏见。

注:socioeconomic status(社会经济地位)在首次出现时加上了英文原文。

Table 2: Dialogue Evaluations. OPT-175B, in a fully unsupervised setting, performs competitively against fully supervised models.

Table 4: CrowS-Pairs evaluation. Lower is better for all categories, indicating more fairness. The OPT-175B model performs worse than Davinci in most categories.

表 2: 对话评估。OPT-175B 在完全无监督的设置下,表现与完全监督的模型具有竞争力。

类别 GPT-3 OPT-175B
性别 62.6 65.7
宗教 73.3 68.6
种族/肤色 64.7 68.6
性取向 76.2 78.6
年龄 64.4 67.8
国籍 61.6 62.9
残疾 76.7 76.7
外貌 74.6 76.2
社会经济地位 73.8 76.2
总体 67.2 69.5

表 4: CrowS-Pairs 评估。所有类别的值越低越好,表示更公平。OPT-175B 模型在大多数类别中的表现不如 Davinci。

When compared with Davinci in Table 4, OPT175B appears to exhibit more stereotypical biases in almost all categories except for religion. Again, this is likely due to differences in training data; Nangia et al. (2020) showed that Pushshift.io Reddit corpus has a higher incidence rate for stereotypes and discriminatory text than other corpora (e.g. Wikipedia). Given this is a primary data source for OPT-175B, the model may have learned more discriminatory associations, which directly impacts its performance on CrowS-Pairs.

表 4 显示,与 Davinci 相比,OPT175B 在几乎所有类别中表现出更多的刻板偏见,除了宗教。再次,这可能是由于训练数据的差异;Nangia 等 (2020) 表明 Pushshift.io Reddit 语料库中的刻板印象和歧视性文本的发生率高于其他语料库(例如 Wikipedia)。鉴于这是 OPT-175B 的主要数据来源,该模型可能学到了更多的歧视性关联,这直接影响了其在 CrowS-Pairs 上的表现。

4.3 StereoSet

4.3 StereoSet 立体声集

Following Lieber et al. (2021) and Artetxe et al. (2021), we use StereoSet (Nadeem et al., 2021) to measure stereotypical bias across 4 categories: profession, gender, religion, and race. In addition to intra sentence measurement (similar to CrowSPairs), StereoSet includes measurement at the intersentence level to test a model’s ability to incorporate additional context. To account for a potential trade-off between bias detection and language modeling capability, StereoSet includes two metrics:

遵循 Lieber 等 (2021) 和 Artetxe 等 (2021) 的方法,我们使用 StereoSet (Nadeem 等, 2021) 来测量四个类别中的刻板印象偏差:职业、性别、宗教和种族。除了句子内测量(类似于 CrowSPairs)外,StereoSet 还包括句子间水平的测量,以测试模型结合额外上下文的能力。为了考虑偏差检测与语言建模能力之间的潜在权衡,StereoSet 包含两个指标:

Table 5: StereoSet Evaluations. Davinci and OPT175B perform similarly across all evaluations.

表 5: StereoSet 评估。Davinci 和 OPT-175B 在所有评估中表现相似。

类别 Davinci OPT-175B
LMS (↑) 78.4 74.1
专业 (↑) sS 63.4 62.6
ICAT (↑) 57.5 55.4
LMS (↑) 75.6 74.0
性别 (↑) sS 66.5 63.6
ICAT (↑) 50.6 53.8
LMS (↑) 80.8 84.0
可靠性 SS (↓) 59.0 59.0
ICAT (↑) 66.3 68.9
LMS (↑) 77.0 74.9
种族 (↑) sS 57.4 56.8
ICAT (↑) 65.7 64.8
LMS (↑) 77.6 74.8
总体 SS (↓) 60.8 59.9
ICAT (↑) 60.8 60.0

Language Modeling Score (LMS) and Stereotype Score (SS), which are then combined to form the Idealized Context Association Test score (ICAT). Unlike Lieber et al. (2021), we normalize scores by token count, rather than character count, which they report improves metrics for several models.

语言模型评分 (LMS) 和刻板印象评分 (SS),然后将两者结合形成理想化情境关联测试评分 (ICAT)。与 Lieber 等人 (2021) 不同,我们通过 Token 数量而不是字符数量来标准化评分,他们报告这种方法可以改善多个模型的指标。

Results are shown in Table 5. We see that Davinci and OPT-175B exhibit similar scores on aggregate (overall ICAT is very close between the two). In particular, Davinci outperforms in the areas of profession and race, while OPT-175B outperforms in the areas of Gender and Religion. OPT175B performs better across the board on the SS metric, while Davinci generally outperforms on the LMS metric.

结果如表 5 所示。我们看到 Davinci 和 OPT-175B 的总体得分相似(两者之间的总体 ICAT 非常接近)。具体来说,Davinci 在职业和种族方面表现更好,而 OPT-175B 在性别和宗教方面表现更优。OPT-175B 在 SS 指标上全面领先,而 Davinci 在 LMS 指标上通常表现更好。

表 5:

4.4 Real Toxicity Prompts

4.4 真实毒性提示 (Real Toxicity Prompts)

We evaluate the tendency of OPT-175B to respond with toxic language via the Real Toxicity Prompts (Gehman et al., 2020) dataset. Following PaLM (Chowdhery et al., 2022), we sample 25 generations of 20 tokens using nucleus sampling (Holtzman et al., 2020) $(p:=:0.9)$ for each of $10,000$ randomly sampled prompts from RTP, and report mean toxicity probabilities of the continuations, stratified across bucketed toxicities of the original prompts. For comparison, we report bucketed toxicity rates from Davinci and PaLM.

我们评估 OPT-175B 对 Real Toxicity Prompts (Gehman 等,2020) 数据集以毒性语言回应的倾向。遵循 PaLM (Chowdhery 等,2022),我们对从 RTP 中随机抽取的 10,000 个提示中的每一个,使用核采样 (Holtzman 等,2020) $(p = 0.9)$ 抽取 25 次 20 Token 的生成,并报告续写文本的平均毒性概率,按原始提示的分桶毒性进行分层。为了比较,我们报告了 Davinci 和 PaLM 的分桶毒性率。

Results are shown in Figure 5. Overall, we see that OPT-175B has a higher toxicity rate than either PaLM or Davinci. We also observe that all 3 models have increased likelihood of generating toxic continuations as the toxicity of the prompt increases, which is consistent with the observations of Chowdhery et al. (2022). As with our experiments in hate speech detection, we suspect the inclusion of unmoderated social media texts in the pre-training corpus raises model familiarity with, and therefore propensity to generate and detect, toxic text. This strong awareness of toxic language may or may not be desirable depending on the specific requirements of downstream applications. Future applications of OPT-175B should consider this aspect of the model, and take additional mitigations, or avoid usage entirely as appropriate.

结果如图 5 所示。总体而言,我们发现 OPT-175B 的毒性率高于 PaLM 或 Davinci。我们还观察到,随着提示的毒性增加,所有 3 个模型生成有毒续写的可能性都增加,这与 Chowdhery 等人 (2022) 的观察结果一致。与我们在仇恨言论检测实验中的情况一样,我们怀疑未经过滤的社会媒体文本包含在预训练语料库中提高了模型对有毒文本的熟悉度,从而增加了生成和检测有毒文本的倾向。这种对有毒语言的高度敏感性是否可取取决于下游应用的具体需求。未来的 OPT-175B 应用应考虑这一方面,并根据需要采取额外的缓解措施,或完全避免使用。


Figure 5: Real Toxicity Pomp ts. OPT-175B is more likely to generate toxic responses than either Davinci or PaLM. Consistent with prior work, toxicity rates increase as prompt toxicity increases.

图 5: 真实毒性 Pomp ts. OPT-175B 比 Davinci 或 PaLM 更容易生成有毒响应。与先前的工作一致,提示的毒性增加时,毒性率也随之增加。

4.5 Dialogue Safety Evaluations

4.5 对话安全性评估

Finally, we compare OPT-175B on two Dialogue Safety evaluations. The first, Safer Dialogues (Ung et al., 2021), measures the ability to recover from explicit safety failures, usually in the form of apologizing or recognizing its mistake. The second, the Safety Bench Unit Tests (Dinan et al., 2021), measures how unsafe a model’s response is, stratified across 4 levels of topic sensitivity: Safe, Realistic, Unsafe, and Adversarial. As with the other dialogue evaluations (Section 3.2), we compare to several existing open source dialogue models.

最后,我们在两个对话安全评估上比较 OPT-175B。第一个是 Safer Dialogues (Ung et al., 2021),用于衡量从明确的安全故障中恢复的能力,通常表现为道歉或承认错误。第二个是 Safety Bench Unit Tests (Dinan et al., 2021),用于衡量模型响应的不安全性,并按四个话题敏感度级别进行分类:安全、现实、不安全和对抗性。与其他对话评估 (第 3.2 节) 一样,我们将 OPT-175B 与几个现有的开源对话模型进行了比较。

Results for both experiments are shown in Table 6. We observe that OPT-175B has similar performance as the Reddit 2.7B model across both Safer Dialogues and the Unit Tests, with OPT-175B performing marginally better in the Safe and Adversarial settings. Consistent with Roller et al. (2021)

表 6:

我们观察到 OPT-175B 的表现与 Reddit 2.7B 模型在 Safer Dialogues 和单元测试中相似,且 OPT-175B 在安全和对抗设置中的表现略好。这与 Roller 等人 (2021) 的结果一致。

Table 6: Dialogue Responsible AI evaluations. OPT175B is roughly on par with the Reddit 2.7B model, but performs worse in the Unsafe setting.

模型 PPL F1 Sa Re Un Ad
Reddit2.7B 16.2 .140 .300 .261 .450 .439
BlenderBot1 12.4 .161 .028 .150 .250 .194
R2C2BlenderBot 13.8 .160 .022 .133 .289 .222
OPT-175B 14.7 .141 .033 .261 .567 .283

表 6: 对话负责任 AI 评估。OPT-175B 大致与 Reddit 2.7B 模型相当,但在不安全设置下表现更差。

and $\mathrm{Xu}$ et al. (2020), we find that the models finetuned on curated dialogue datasets (BlenderBot 1, R2C2) have overall lower toxicity. We conclude that future experimentation of OPT-175B for dialogue should contain explicit fine-tuning on curated datasets in order to improve the safety profile.

和 Xu 等人 (2020) 的研究发现,基于精选对话数据集微调的模型 (BlenderBot 1, R2C2) 整体毒性较低。我们得出结论,未来 OPT-175B 在对话方面的实验应包含在精选数据集上的显式微调,以改善其安全性。

5 Limitations

5 局限性

In Sections 3.1 and 4, we carried out extensive evaluation of all released models at varying scales. We saw parity in performance for standard evaluation datasets used in the GPT-3 models. Moreover, we performed safety, bias, and inclusion evaluations, again seeing largely comparable performance with some variations in toxicity and hate speech detection. However, such evaluations may not fully characterize the complete limitations of these models. In general, we qualitatively observe that OPT-175B suffers from the same limitations noted in other LLMs (Brown et al., 2020; Lieber et al., 2021; Thoppilan et al., 2022; Rae et al., 2021; Smith et al., 2022; Chowdhery et al., 2022; Bender et al., 2021).

在第 3.1 节和第 4 节中,我们对所有已发布的模型进行了广泛的评估。我们在 GPT-3 模型中使用的标准评估数据集上看到了性能的一致性。此外,我们还进行了安全性、偏差和包容性的评估,在这些方面也看到了大致相当的性能,但在毒性检测和仇恨言论检测方面存在一些差异。然而,这些评估可能无法完全描述这些模型的所有局限性。总体而言,我们定性观察到 OPT-175B 存在与其他大语言模型 (LLM) [Brown et al., 2020; Lieber et al., 2021; Thoppilan et al., 2022; Rae et al., 2021; Smith et al., 2022; Chowdhery et al., 2022; Bender et al., 2021] 中所指出的相同局限性。

In particular, we found OPT-175B does not work well with declarative instructions or point-blank interrogatives. Prompting with such instructions tends to produce a simulation of a dialogue beginning with such an instruction, rather than an execution of the instruction. Future work into instruction learning, in the vein of Instruct GP T (Ouyang et al., 2022), may alleviate these limitations.

特别是,我们发现 OPT-175B 在处理声明性指令或直接疑问句时表现不佳。使用此类指令进行提示往往会生成以该指令为开头的对话模拟,而不是执行指令。未来在指令学习方面的研究,例如 Instruct GP T (Ouyang et al., 2022),可能会缓解这些限制。

OPT-175B also tends to be repetitive and can easily get stuck in a loop. While sampling can reduce the incidence rate of repetitive behavior (Holtzman et al., 2020), we anecdotally found it did not eliminate it entirely when only one generation is sampled. Future work may wish to incorporate more modern strategies for reducing repetition and improving diversity, such as unlikelihood training (Welleck et al., 2020) or best-first decoding (Meister et al., 2020).

OPT-175B 也倾向于重复,并且容易陷入循环。虽然采样可以降低重复行为的发生率 (Holtzman et al., 2020),但我们非正式地发现,当仅采样一次生成时,并不能完全消除这种现象。未来的工作可能希望结合更多现代策略来减少重复并提高多样性,例如不太可能训练 (Welleck et al., 2020) 或最佳优先解码 (Meister et al., 2020)。

Similar to other LLMs, OPT-175B can produce factually incorrect statements (Adiwardana et al., 2020; Brown et al., 2020; Roller et al., 2021; Rae et al., 2021; Chowdhery et al., 2022; Thoppilan et al., 2022). This can be particularly harmful in applications where information accuracy is critical, such as healthcare and scientific discovery (Weidinger et al., 2021b). Recently, several efforts have reported that retrieval-augmented models can improve factual correctness of LLMs (Lewis et al., 2020; Komeili et al., 2021; Thoppilan et al., 2022; Borgeaud et al., 2021; Shuster et al., 2022; Nakano et al., 2021). We believe OPT-175B will also benefit from retrieval-augmentation in future iterations.

类似于其他大语言模型 (LLM),OPT-175B 可能生成事实错误的陈述 (Adiwardana et al., 2020; Brown et al., 2020; Roller et al., 2021; Rae et al., 2021; Chowdhery et al., 2022; Thoppilan et al., 2022)。这在信息准确性至关重要的应用中(如医疗保健和科学发现)可能会造成特别严重的危害 (Weidinger et al., 2021b)。最近,有几项研究报道检索增强模型可以提高大语言模型的事实正确性 (Lewis et al., 2020; Komeili et al., 2021; Thoppilan et al., 2022; Borgeaud et al., 2021; Shuster et al., 2022; Nakano et al., 2021)。我们认为 OPT-175B 在未来的迭代中也将受益于检索增强。

As shown in Section 4, we also find OPT-175B has a high propensity to generate toxic language and reinforce harmful stereotypes, even when provided with a relatively innocuous prompt (Gehman et al., 2020), and adversarial prompts are trivial to find (Dinan et al., 2021). There has been a great deal of work on mitigation s for toxicity and biases (Dathathri et al., 2019; Dinan et al., 2019a; Sheng et al., 2019; Dinan et al., 2020a; Liu et al., 2019a; Krause et al., 2020; Xu et al., 2020; Liang et al., 2021; Dinan et al., 2021; Xu et al., 2021a; Dhamala et al., 2021; Schick et al., 2021; Ouyang et al., 2022). Depending on downstream applications, future uses of OPT-175B may need to employ these or novel mitigation approaches, especially before any real world deployment. Given our primary goal as a replication of GPT-3, we choose not to apply these mitigation s in this first release.

如第 4 节所示,我们还发现 OPT-175B 有很高的倾向生成有害语言并强化有害的刻板印象,即使在提供相对无害的提示时 (Gehman et al., 2020),并且对抗性提示很容易找到 (Dinan et al., 2021)。已经有很多关于减轻有害内容和偏见的工作 (Dathathri et al., 2019; Dinan et al., 2019a; Sheng et al., 2019; Dinan et al., 2020a; Liu et al., 2019a; Krause et al., 2020; Xu et al., 2020; Liang et al., 2021; Dinan et al., 2021; Xu et al., 2021a; Dhamala et al., 2021; Schick et al., 2021; Ouyang et al., 2022)。根据下游应用的不同,OPT-175B 的未来使用可能需要采用这些或新的减轻方法,尤其是在任何实际部署之前。鉴于我们的主要目标是复制 GPT-3,我们选择不在首次发布中应用这些减轻措施。

In summary, we still believe this technology is premature for commercial deployment. Despite including data sheets and model cards, we believe more scrutiny should be afforded to the training data with additional data characterization and selection criteria in order to use data responsibly. The current practice is to feed the model with as much data as possible and minimal selection within these datasets. Despite having comprehensive evaluations, we would ideally have more streamlined and consistent evaluation setups to ensure rep li c ability and reproducibility of evaluation scenarios. Differences in prompting styles and number of shots for in-context learning could create variations that lead to different results. We hope that the public release of the OPT models will enable many more researchers to work on these important issues.

总之,我们仍然认为这项技术尚未成熟,不适合商业部署。尽管包含了数据表和模型卡,我们认为应对训练数据进行更多审查,并增加数据特征描述和选择标准,以负责任地使用数据。目前的做法是尽可能多地向模型输入数据,并在这些数据集中进行最少的选择。尽管有全面的评估,我们希望有更简化和一致的评估设置,以确保评估场景的可重复性。提示风格和上下文学习中的样本数量差异可能会导致结果的不同。我们希望 OPT 模型的公开发布能够使更多研究人员参与到这些重要问题的研究中。

6 Considerations for Release

6 发布考虑因素

Following the recommendations for individual researchers generated by the Partnership for AI,7 along with the governance guidance outlined by NIST,8 we are disclosing all of the details involved in training OPT-175B through our logbook,9 our code, and providing researchers access to model weights for OPT-175B, along with a suite of smaller baselines mirroring the setup for OPT175B. We aim to be fully accountable for the development lifecycle of OPT-175B, and only through increasing transparency around LLM development can we start understanding the limitations and risks of LLMs before broader deployment occurs.

遵循由 AI 合作伙伴关系为个体研究人员生成的建议 [7],以及 NIST 概述的治理指导方针 [8],我们将通过我们的日志 [9]、代码和向研究人员提供 OPT-175B 的模型权重,披露训练 OPT-175B 所涉及的所有细节,并提供一系列较小的基准模型来镜像 OPT-175B 的设置。我们旨在对 OPT-175B 的开发生命周期负全责,并且只有通过增加大语言模型 (LLM) 开发的透明度,我们才能开始理解大语言模型在更广泛部署前的局限性和风险。

By sharing a detailed account of our day-to-day training process, we disclose not only how much compute was used to train the current version of OPT-175B, but also the human overhead required when underlying infrastructure or the training process itself becomes unstable at scale. These details are generally omitted from previous publications, likely due to the inability to fully ablate changes made mid-flight (without drastically increasing the compute budget). We hope that by revealing how certain ad-hoc design decisions were made, we can improve upon these practices in the future, and collectively increase the experimental robustness in developing models at this scale.

通过分享我们日常训练过程的详细情况,我们不仅披露了用于训练当前版本 OPT-175B 的计算资源用量,还揭示了当底层基础设施或训练过程本身在大规模下变得不稳定时所需的人力成本。这些细节通常在之前的出版物中被省略,可能是因为无法完全分析中途做出的更改(而不大幅增加计算预算)。我们希望通过揭示某些临时设计决策是如何做出的,可以在未来改进这些实践,并共同提高在这个规模上开发模型的实验稳健性。

Outside of these notes, the metaseq codebase itself is the final source of truth in many of our implementation details. By releasing our development codebase, we aim to shed light on any implementation detail that may have been omitted from being explicitly enumerated in this paper, as it is either considered a detail of standard practice in the field, or is simply a detail we failed to account for. This current codebase is also the only known open-source implementation of training a decoderonly transformer that is $\ge!175\mathrm{B}$ parameters without the use of pipeline para l ellis m on NVIDIA GPUs.

除了这些笔记外,metaseq 代码库本身是我们许多实现细节的最终来源。通过发布我们的开发代码库,我们旨在阐明任何可能未在本文中明确列举的实现细节,因为这些细节要么被认为是领域内的标准实践,要么是我们未能考虑到的细节。这个当前的代码库也是唯一已知的开源实现,在 NVIDIA GPU 上训练参数量 $\ge!175\mathrm{B}$ 的仅解码器 Transformer 模型而不使用管道并行性。

To enable experimentation at 175B scale, we are providing researchers with direct access to the parameters of OPT-175B. The reasoning here is twofold: enable Responsible AI research into LLMs while simultaneously reducing the environmental impact of pursuing research at this scale. There is a growing body of work detailing ethical and social risks from deploying language models with emergent capabilities at scale (Weidinger et al., 2021a; Bommasani et al., 2021; Dinan et al., 2021; Kenton et al., 2021). By limiting access to OPT-175B to the research community with a non-commercial license, we aim to focus development efforts on quantifying the limitations of the LLMs first, before broader commercial deployment occurs.

为了在 175B 规模上进行实验,我们为研究人员提供了直接访问 OPT-175B 参数的权限。这里的理由有两个:一方面使能大语言模型 (LLM) 的负责任 AI 研究,另一方面同时减少在这一规模上进行研究的环境影响。越来越多的研究详细描述了大规模部署具有新兴能力的语言模型所带来的伦理和社会风险 (Weidinger et al., 2021a; Bommasani et al., 2021; Dinan et al., 2021; Kenton et al., 2021)。通过将 OPT-175B 的访问权限限制在非商业许可的研究社区内,我们旨在首先集中开发工作于量化大语言模型的局限性,然后再进行更广泛的商业部署。

Furthermore, there exists significant compute and carbon cost to reproduce models of this size. While OPT-175B was developed with an estimated carbon emissions footprint (CO2eq) of 75 tons,10 GPT-3 was estimated to use 500 tons (Patterson et al., 2021), while Gopher required 380 tons (Rae et al., 2021). These estimates are not universally reported, and the accounting methodologies for these calculations are also not standardized. In addition, model training is only one component of the overall carbon footprint of AI systems; we must also consider experimentation and eventual downstream inference cost, all of which contribute to the growing energy footprint of creating large-scale models (Wu et al., 2022). By releasing our logbook, we hope to highlight the gap between a theoretical carbon cost estimate that assumes no hardware failures or training instabilities, versus one that aims to include the entire LLM development lifecycle. We need to understand the manufacturing (or embodied) carbon of these systems (Gupta et al., 2021) as they grow increasingly more complex, and we hope that our paper can help future work in defining additional factors to consider when measuring the impact of scale on the environment.

此外,重现这种规模的模型存在显著的计算和碳成本。虽然 OPT-175B 的开发估计碳排放量 (CO2eq) 为 75 吨,GPT-3 的估计使用量为 500 吨 (Patterson 等, 2021),而 Gopher 需要 380 吨 (Rae 等, 2021)。这些估算并不是普遍报告的,并且这些计算的会计方法也没有标准化。此外,模型训练只是 AI 系统整体碳足迹的一个组成部分;我们还必须考虑实验和最终的下游推理成本,所有这些都对创建大规模模型的不断增长的能源足迹做出了贡献 (Wu 等, 2022)。通过发布我们的日志,我们希望突出理论碳成本估算(假设没有硬件故障或训练不稳定性)与旨在包括整个大语言模型开发生命周期的估算之间的差距。我们需要理解这些系统日益复杂的制造(或体现)碳排放 (Gupta 等, 2021),并希望我们的论文能够帮助未来的工作定义在衡量规模对环境的影响时需要考虑的其他因素。

Similarly, by producing a set of baselines across a wide range of scales, we hope to enable the broader research community to study the impact and limitations of these models with respect to scale alone. As reported in Hoffmann et al. (2022), many of these LLMs may have been under-trained as a function of the amount of training data used, which implies that incorporating more data and continuing to train these baseline models may continue to improve performance. There is also evidence that step-function changes in capabilities may occur at a scale that is much smaller than 175B (Wei et al., 2021), indicating a need to examine a wider range of scales for different research applications.

通过生成一系列不同规模的基线模型,我们希望使更广泛的研究社区能够研究这些模型在单一规模上的影响和局限性。正如 Hoffmann 等人在 (2022) 中报告的那样,许多这些大语言模型 (LLM) 可能由于使用的训练数据量而被低估训练,这意味着加入更多数据并继续训练这些基线模型可能会继续提高性能。还有证据表明,在远小于 175B 的规模上,能力可能会发生阶跃变化 (Wei et al., 2021),这表明需要检查更广泛的规模范围以适应不同的研究应用。

7 Related Work

7 相关工作

Since the publication of the Transformer architecture (Vaswani et al., 2017) and BERT (Devlin et al., 2019), the field of NLP has experienced a massive shift towards the use of LLMs with self-supervised pre-training. Multiple masked langauge models, including T5 (Raffel et al., 2020) and MegatronLM (Shoeybi et al., 2019), have shown consistent improvements through scale. These scaling gains come not only from growing the total number of parameters in the models, but also the amount and quality of pre-training data (Liu et al., 2019b; Hoffmann et al., 2022).

自从 Transformer 架构 (Vaswani et al., 2017) 和 BERT (Devlin et al., 2019) 发布以来,自然语言处理 (NLP) 领域经历了向使用具有自监督预训练的大语言模型 (LLM) 的巨大转变。多个掩码语言模型,包括 T5 (Raffel et al., 2020) 和 MegatronLM (Shoeybi et al., 2019),通过规模展示了持续的改进。这些扩展收益不仅来自增加模型中的总参数数量,还来自预训练数据的数量和质量 (Liu et al., 2019b; Hoffmann et al., 2022)。

Auto-regressive language models (Mikolov et al., 2009) have seen the largest growth in model size, from 117M parameters (Radford et al., 2018) to over 500B parameters (Smith et al., 2022; Chowd- hery et al., 2022). The resulting massive improvement in generative fluency and quality was first characterized in GPT-2 (Radford et al., 2019) and further improved with GPT-3 (Brown et al., 2020) and later models. Although a variety of very large (over 100B parameters) generative models have now been trained (Lieber et al., 2021; Rae et al., 2021; Thoppilan et al., 2022; Smith et al., 2022; Chowdhery et al., 2022), they are all closed source and accessible only internally or via paid API services. There are a few notable efforts towards open sourcing LLMs from non-profit research organizations including EleutherAI (Black et al., 2022) and BigScience.11 These models differ from the OPT models in pre-training data, target languages and model scale, making it possible for the community to compare different pre-training strategies.

自回归语言模型 (Mikolov et al., 2009) 在模型规模上经历了最大的增长,从 1.17 亿参数 (Radford et al., 2018) 增加到超过 5000 亿参数 (Smith et al., 2022; Chowdhery et al., 2022)。由此带来的生成流畅性和质量的巨大提升首先在 GPT-2 (Radford et al., 2019) 中得到描述,并通过 GPT-3 (Brown et al., 2020) 和后续模型进一步改进。尽管现在已经训练了多种非常大的(超过 1000 亿参数)生成式模型 (Lieber et al., 2021; Rae et al., 2021; Thoppilan et al., 2022; Smith et al., 2022; Chowdhery et al., 2022),但它们都是闭源的,仅限内部使用或通过付费 API 服务访问。一些非营利研究组织,如 EleutherAI (Black et al., 2022) 和 BigScience,做出了值得注意的努力来开源大语言模型。这些模型在预训练数据、目标语言和模型规模上与 OPT 模型不同,使得社区可以比较不同的预训练策略。

Since Brown et al. (2020), the primary evaluation criterion for LLMs has been prompt-based (Black et al., 2022; Rae et al., 2021; Chowdhery et al., 2022), as is also performed in this paper. This is largely due to the convenience of evaluating on many tasks without specialized task-specific fine-tuning. Prompting itself has a long history: cloze evaluations go back several decades (Chambers and Jurafsky, 2008; Most af azad eh et al., 2016). More recently, prompting or masked infilling has been used to probe models for knowledge (Petroni et al., 2019) or perform a variety of NLP tasks (Radford et al., 2019; Brown et al., 2020). There has also been work on eliciting prompting behavior in smaller models (Schick and Schütze, 2020;

自 Brown 等 (2020) 以来,大语言模型 (LLM) 的主要评估标准一直是基于提示的 (Black 等, 2022; Rae 等, 2021; Chowdhery 等, 2022),本文也采用了相同的评估方法。这主要是因为可以在不进行专门任务特定微调的情况下方便地评估多个任务。提示本身有着悠久的历史:完形填空评估可以追溯到几十年前 (Chambers 和 Jurafsky, 2008; Mostafazadeh 等, 2016)。最近,提示或掩码填充已被用于探测模型的知识 (Petroni 等, 2019) 或执行各种自然语言处理任务 (Radford 等, 2019; Brown 等, 2020)。此外,也有研究致力于在较小的模型中引出提示行为 (Schick 和 Schütze, 2020;

Gao et al., 2021b; Li and Liang, 2021; Lester et al., 2021; Scao and Rush, 2021), improving the flexibility of prompting (Shin et al., 2020), and understanding why and how prompting works (Liu et al., 2021; Min et al., 2022).

高 et al., 2021b; 李和梁, 2021; Lester et al., 2021; Scao 和 Rush, 2021),提高了提示的灵活性 (Shin et al., 2020),以及理解为什么和如何进行提示 (Liu et al., 2021; Min et al., 2022)。

Recent efforts have shown gains by fine-tuning models to directly respond to instruction-style prompting (Wei et al., 2021; Min et al., 2021; Sanh et al., 2021; Ouyang et al., 2022). However, effective prompt engineering remains an open research challenge. Results vary significantly and un predictably with the selection of the prompt (Lu et al., 2021), and models do not seem to understand the prompts as fully as we expect (Webson and Pavlick, 2021). Furthermore, it is challenging to write prompts without a development set, which leads to questions about the extent to which we are actually achieving zero- or few-shot learning in practice (Perez et al., 2021). We do not attempt to address these concerns of prompting, and instead only aim to provide evaluation of OPT-175B in existing settings. However, we hope the full release of OPT-175B will enable others to better study these challenges in the future.

最近的研究表明,通过微调模型以直接响应指令式提示可以取得进展 (Wei et al., 2021; Min et al., 2021; Sanh et al., 2021; Ouyang et al., 2022)。然而,有效的提示工程仍然是一个开放的研究挑战。结果随着提示的选择而显著且不可预测地变化 (Lu et al., 2021),并且模型似乎并没有像我们期望的那样完全理解提示 (Webson 和 Pavlick, 2021)。此外,在没有开发集的情况下编写提示具有挑战性,这引发了关于我们实际上在多大程度上实现了零样本或少样本学习的问题 (Perez et al., 2021)。我们不试图解决这些提示相关的问题,而是仅旨在评估 OPT-175B 在现有设置中的表现。然而,我们希望 OPT-175B 的完整发布能够使其他人未来更好地研究这些挑战。

8 Conclusion

8 结论

In this technical report, we introduced OPT, a collection of auto-regressive language models ranging in size from 125M to 175B parameters. Our goal was to replicate the performance and sizes of the GPT-3 class of models, while also applying the latest best practices in data curation and training efficiency. We described training details, evaluated performance in a number of NLP and dialogue settings, and characterized behaviors with respect to bias, toxicity and hate speech. We also described many other limitations the models have, and discussed a wide set of considerations for responsibly releasing the models. We believe the entire AI community would benefit from working together to develop guidelines for responsible LLMs, and we hope that broad access to these types of models will increase the diversity of voices defining the ethical considerations of such technologies.

在本技术报告中,我们介绍了 OPT,一组参数规模从 125M 到 175B 不等的自回归语言模型。我们的目标是复制 GPT-3 类模型的性能和规模,同时应用最新的数据整理和训练效率最佳实践。我们描述了训练细节,在多个自然语言处理 (NLP) 和对话场景中评估了性能,并对模型在偏见、毒性言论和仇恨言论方面的行为进行了表征。我们还描述了模型的许多其他限制,并讨论了负责任地发布这些模型的一系列考虑因素。我们认为整个 AI 社区将受益于共同努力制定负责任的大语言模型 (LLM) 指导原则,我们希望广泛获取这些模型将增加定义此类技术伦理考虑的声音多样性。

Acknowledgements

致谢

We would like to thank Scott Jeschonek, Giri Anantharaman, Diego Sarina, Joaquin Colombo, Chris Bray, Stephen Roylance, Kalyan Saladi, Shubho Sengupta, and Brian O’Horo for helping to remove infrastructure blockers along the way; Percy Liang,

我们要感谢 Scott Jeschonek、Giri Anantharaman、Diego Sarina、Joaquin Colombo、Chris Bray、Stephen Roylance、Kalyan Saladi、Shubho Sengupta 和 Brian O’Horo 一路帮助移除基础设施障碍;Percy Liang,

Rishi Bommasani, and Emily Dinan for discussions on responsible release practices; Carole-Jean Wu for discussions on sustainability and carbon footprint considerations; Srini Iyer, Ramakanth Pasunuru, and Shruti Bhosale for previous contributions to evaluations; Benjamin Lefaudeux, Geeta Chauhan, Natalia Gimelshein, Horace He, and Sam Gross for discussions on performance improvement work; Emily Dinan, Carole-Jean Wu, Daniel McK- innon, and Mark Tygert for feedback on this draft; Antoine Bordes, Joelle Pineau, Mary Williamson, Necip Fazil Ayan, Armand Joulin, Sergey Edunov, Melanie Kambadur, Zornitsa Kozareva, Ves Stoyanov, Vitaliy Lip tch in sky, Rahul Iyer, Jing Xu, Jason Weston, and many others for supporting this project internally.

Rishi Bommasani 和 Emily Dinan 就负责任的发布实践进行了讨论;Carole-Jean Wu 就可持续性和碳足迹考虑进行了讨论;Srini Iyer、Ramakanth Pasunuru 和 Shruti Bhosale 对之前的评估工作做出了贡献;Benjamin Lefaudeux、Geeta Chauhan、Natalia Gimelshein、Horace He 和 Sam Gross 就性能改进工作进行了讨论;Emily Dinan、Carole-Jean Wu、Daniel McKinnon 和 Mark Tygert 对本草案提供了反馈;Antoine Bordes、Joelle Pineau、Mary Williamson、Necip Fazil Ayan、Armand Joulin、Sergey Edunov、Melanie Kambadur、Zornitsa Kozareva、Ves Stoyanov、Vitaliy Liptchinsky、Rahul Iyer、Jing Xu、Jason Weston 以及许多其他人为该项目提供了内部支持。

References

参考文献

Daniel Adiwardana, Minh-Thang Luong, David R So, Jamie Hall, Noah Fiedel, Romal Thoppilan, Zi Yang, Apoorv Kul sh res h th a, Gaurav Nemade, Yifeng Lu, et al. 2020. Towards a human-like open-domain chatbot. arXiv preprint arXiv:2001.09977.

Daniel Adiwardana, Minh-Thang Luong, David R So, Jamie Hall, Noah Fiedel, Romal Thoppilan, Zi Yang, Apoorv Kulshreshtha, Gaurav Nemade, Yifeng Lu, et al. 2020. 朝着类人开放域聊天机器人迈进。 arXiv preprint arXiv:2001.09977.

Mikel Artetxe, Shruti Bhosale, Naman Goyal, Todor Mihaylov, Myle Ott, Sam Shleifer, Xi Victoria Lin, Jingfei Du, Srinivasan Iyer, Ramakanth Pasunuru, Giri Anantha raman, Xian Li, Shuohui Chen, Halil Akin, Mandeep Baines, Louis Martin, Xing Zhou, Punit Singh Koura, Brian O’Horo, Jeff Wang, Luke Z ett le moyer, Mona T. Diab, Zornitsa Kozareva, and Ves Stoyanov. 2021. Efficient large scale language modeling with mixtures of experts. CoRR, abs/2112.10684.

Mikel Artetxe, Shruti Bhosale, Naman Goyal, Todor Mihaylov, Myle Ott, Sam Shleifer, Xi Victoria Lin, Jingfei Du, Srinivasan Iyer, Ramakanth Pasunuru, Giri Anantha raman, Xian Li, Shuohui Chen, Halil Akin, Mandeep Baines, Louis Martin, Xing Zhou, Punit Singh Koura, Brian O’Horo, Jeff Wang, Luke Zettlemoyer, Mona T. Diab, Zornitsa Kozareva, 和 Ves Stoyanov. 2021. 使用专家混合进行高效的大规模语言建模. CoRR, abs/2112.10684.

Jason Baumgartner, Savvas Zannettou, Brian Keegan, Megan Squire, and Jeremy Blackburn. 2020. The pushshift reddit dataset. CoRR, abs/2001.08435.

杰森·鲍姆加特纳,萨瓦斯·扎内托,布莱恩·基根,梅根·斯凯尔,和杰里米·布莱克本。2020。Pushshift Reddit 数据集。CoRR, abs/2001.08435。

Emily M Bender, Timnit Gebru, Angelina McMillan- Major, and Shmargaret Shmitchell. 2021. On the dangers of stochastic parrots: Can language models be too big? In Proceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency, pages 610–623.

Emily M Bender, Timnit Gebru, Angelina McMillan-Major, 和 Shmargaret Shmitchell. 2021. 论随机鹦鹉的危险性:语言模型能否过大? 在 Proceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency 的第 610–623 页。

Yonatan Bisk, Rowan Zellers, Ronan Le bras, Jianfeng Gao, and Yejin Choi. 2020. Piqa: Reasoning about physical commonsense in natural language. Proceedings of the AAAI Conference on Artificial Intelligence, 34(05):7432–7439.

Yonatan Bisk, Rowan Zellers, Ronan Le bras, Jianfeng Gao, 和 Yejin Choi. 2020. Piqa: 自然语言中的物理常识推理 (Reasoning about physical commonsense in natural language). Proceedings of the AAAI Conference on Artificial Intelligence, 34(05):7432–7439.

Sid Black, Stella Biderman, Eric Hallahan, Quentin Anthony, Leo Gao, Laurence Golding, Horace He, Connor Leahy, Kyle McDonell, Jason Phang, Michael Pieler, USVSN Sai Prashanth, Shivanshu Purohit, Laria Reynolds, Jonathan Tow, Ben Wang, and Samuel Weinbach. 2022. Gpt-neox-20b: An opensource auto regressive language model.

Sid Black、Stella Biderman、Eric Hallahan、Quentin Anthony、Leo Gao、Laurence Golding、Horace He、Connor Leahy、Kyle McDonell、Jason Phang、Michael Pieler、USVSN Sai Prashanth、Shivanshu Purohit、Laria Reynolds、Jonathan Tow、Ben Wang 和 Samuel Weinbach。2022。Gpt-neox-20b:一个开源的自回归语言模型。

Su Lin Blodgett, Gilsinia Lopez, Alexandra Olteanu, Robert Sim, and Hanna Wallach. 2021. Stereotyping Norwegian salmon: An inventory of pitfalls in fairness benchmark datasets. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 1004–1015, Online. Association for Computational Linguistics.

苏琳·布洛德格特, 吉尔辛娅·洛佩兹, 亚历山德拉·奥尔泰安, 罗伯特·西姆, 和 Hanna Wallach. 2021. 刻板印象的挪威三文鱼:公平性基准数据集中的陷阱清单. 在第 59 届计算语言学协会年会和第 11 届国际自然语言处理联合会议 (长论文卷) 的会议记录中,页面 1004–1015,在线. 计算语言学协会.

Rishi Bommasani, Drew A. Hudson, Ehsan Adeli, Russ Altman, Simran Arora, Sydney von Arx, Michael S. Bernstein, Jeannette Bohg, Antoine Bosselut, Emma Brunskill, Erik Bryn jol fs son, Shyamal Buch, Dallas Card, Rodrigo Castellon, Ni- ladri Chatterji, Annie S. Chen, Kathleen Creel, Jared Quincy Davis, Dorottya Demszky, Chris Donahue, Moussa Doumbouya, Esin Durmus, Stefano Ermon, John Etchemendy, Kawin Ethayarajh, Li FeiFei, Chelsea Finn, Trevor Gale, Lauren Gillespie, Karan Goel, Noah D. Goodman, Shelby Grossman, Neel Guha, Tatsunori Hashimoto, Peter Henderson, John Hewitt, Daniel E. Ho, Jenny Hong, Kyle Hsu, Jing Huang, Thomas Icard, Saahil Jain, Dan Juraf-sky, Pratyusha Kalluri, Siddharth Karamcheti, Geoff Keeling, Fereshte Khani, Omar Khattab, Pang Wei Koh, Mark S. Krass, Ranjay Krishna, Rohith Kuditipudi, and et al. 2021. On the opportunities and risks of foundation models. CoRR, abs/2108.07258.

Rishi Bommasani, Drew A. Hudson, Ehsan Adeli, Russ Altman, Simran Arora, Sydney von Arx, Michael S. Bernstein, Jeannette Bohg, Antoine Bosselut, Emma Brunskill, Erik Brynjolfsson, Shyamal Buch, Dallas Card, Rodrigo Castellon, Nildri Chatterji, Annie S. Chen, Kathleen Creel, Jared Quincy Davis, Dorottya Demszky, Chris Donahue, Moussa Doumbouya, Esin Durmus, Stefano Ermon, John Etchemendy, Kawin Ethayarajh, Li FeiFei, Chelsea Finn, Trevor Gale, Lauren Gillespie, Karan Goel, Noah D. Goodman, Shelby Grossman, Neel Guha, Tatsunori Hashimoto, Peter Henderson, John Hewitt, Daniel E. Ho, Jenny Hong, Kyle Hsu, Jing Huang, Thomas Icard, Saahil Jain, Dan Jurafsky, Pratyusha Kalluri, Siddharth Karamcheti, Geoff Keeling, Fereshte Khani, Omar Khattab, Pang Wei Koh, Mark S. Krass, Ranjay Krishna, Rohith Kuditipudi, 等. 2021. 论基础模型的机遇与风险. CoRR, abs/2108.07258.

Sebastian Borgeaud, Arthur Mensch, Jordan Hoffmann, Trevor Cai, Eliza Rutherford, Katie Millican, George van den Driessche, Jean-Baptiste Lespiau, Bogdan Damoc, Aidan Clark, et al. 2021. Improving language models by retrieving from trillions of tokens. arXiv preprint arXiv:2112.04426.

塞巴斯蒂安·博尔加德, 阿瑟·曼施, 乔丹·霍夫曼, 特雷弗·蔡, 艾丽莎·拉瑟福德, 凯蒂·米利肯, 乔治·范登德里斯切, 让-巴蒂斯特·莱斯皮奥, 博格丹·达莫克, 艾丹·克拉克, 等. 2021. 通过从万亿个 Token 中检索来改进大语言模型. arXiv preprint arXiv:2112.04426.

Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neel a kant an, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel HerbertVoss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel Ziegler, Jeffrey Wu, Clemens Winter, Chris Hesse, Mark Chen, Eric Sigler, Mateusz Litwin, Scott Gray, Benjamin Chess, Jack Clark, Christopher Berner, Sam McCandlish, Alec Radford, Ilya Sutskever, and Dario Amodei. 2020. Language models are few-shot learners. In Advances in Neural Information Processing Systems, volume 33, pages 1877–1901. Curran Associates, Inc.

汤姆·布朗,本杰明·曼,尼克·赖德,梅兰妮·苏比亚,贾里德·D·卡普兰,普拉富拉·达里瓦尔,阿文丁·尼尔坎坦,普拉纳夫·希亚姆,吉里什·萨斯特里,阿曼达·阿斯科尔,桑迪尼·阿加沃尔,阿里埃尔·赫伯特沃斯,格雷琴·克鲁格,汤姆·亨尼根,雷温·蔡尔德,阿迪蒂亚·拉梅什,丹尼尔·齐格勒,杰弗瑞·吴,克莱门斯·温特,克里斯·赫塞,马克·陈,埃里克·西格勒,马特乌什·利特温,斯科特·格雷,本杰明·切斯,杰克·克拉克,克里斯托弗·伯纳,山姆·麦肯德里什,阿莱克·拉德福德,伊利亚·苏茨克维尔,和 达里奥·阿莫代伊。2020。大语言模型是少样本学习者。在《神经信息处理系统进展》第 33 卷,第 1877–1901 页。Curran Associates, Inc.

Nathanael Chambers and Dan Jurafsky. 2008. Unsupervised learning of narrative event chains. In Proceedings of ACL-08: HLT, pages 789–797, Columbus, Ohio. Association for Computational Linguistics.

Nathanael Chambers 和 Dan Jurafsky. 2008. 叙事事件链的无监督学习. 在 Proceedings of ACL-08: HLT 的页面 789–797,Columbus, Ohio. 计算语言学协会.

Ke-Li Chiu and Rohan Alexander. 2021. Detecting hate speech with gpt-3. arXiv preprint arXiv:2103.12407.

邱克利 和 Rohan Alexander. 2021. 使用 GPT-3 检测仇恨言论. arXiv preprint arXiv:2103.12407.

Aakanksha Chowdhery, Sharan Narang, Jacob Devlin, Maarten Bosma, Gaurav Mishra, Adam Roberts, Paul Barham, Hyung Won Chung, Charles Sutton, Sebastian Gehrmann, Parker Schuh, Kensen Shi,

Aakanksha Chowdhery, Sharan Narang, Jacob Devlin, Maarten Bosma, Gaurav Mishra, Adam Roberts, Paul Barham, Hyung Won Chung, Charles Sutton, Sebastian Gehrmann, Parker Schuh, Kensen Shi,

Sasha Tsv yash chen ko, Joshua Maynez, Abhishek Rao, Parker Barnes, Yi Tay, Noam Shazeer, Vinodkumar Prabhakaran, Emily Reif, Nan Du, Ben Hutchinson, Reiner Pope, James Bradbury, Jacob Austin, Michael Isard, Guy Gur-Ari, Pengcheng Yin, Toju Duke, Anselm Levskaya, Sanjay Ghemawat, Sunipa Dev, Henryk Micha lewski, Xavier Garcia, Vedant Misra, Kevin Robinson, Liam Fedus, Denny Zhou, Daphne Ippolito, David Luan, Hyeontaek Lim, Barret Zoph, Alexander Spiridonov, Ryan Sepassi, David Dohan, Shivani Agrawal, Mark Omernick, Andrew M. Dai, Than u malayan Sankara- narayana Pillai, Marie Pellat, Aitor Lewkowycz, Erica Moreira, Rewon Child, Oleksandr Polozov, Katherine Lee, Zongwei Zhou, Xuezhi Wang, Brennan Saeta, Mark Diaz, Orhan Firat, Michele Catasta, Jason Wei, Kathy Meier-Hellstern, Douglas Eck, Jeff Dean, Slav Petrov, and Noah Fiedel. 2022. Palm: Scaling language modeling with pathways.

Sasha Tsv yash chen ko, Joshua Maynez, Abhishek Rao, Parker Barnes, Yi Tay, Noam Shazeer, Vinodkumar Prabhakaran, Emily Reif, Nan Du, Ben Hutchinson, Reiner Pope, James Bradbury, Jacob Austin, Michael Isard, Guy Gur-Ari, Pengcheng Yin, Toju Duke, Anselm Levskaya, Sanjay Ghemawat, Sunipa Dev, Henryk Micha lewski, Xavier Garcia, Vedant Misra, Kevin Robinson, Liam Fedus, Denny Zhou, Daphne Ippolito, David Luan, Hyeontaek Lim, Barret Zoph, Alexander Spiridonov, Ryan Sepassi, David Dohan, Shivani Agrawal, Mark Omernick, Andrew M. Dai, Than u malayan Sankara- narayana Pillai, Marie Pellat, Aitor Lewkowycz, Erica Moreira, Rewon Child, Oleksandr Polozov, Katherine Lee, Zongwei Zhou, Xuezhi Wang, Brennan Saeta, Mark Diaz, Orhan Firat, Michele Catasta, Jason Wei, Kathy Meier-Hellstern, Douglas Eck, Jeff Dean, Slav Petrov, 和 Noah Fiedel. 2022. Palm: 通过路径扩展语言建模。

Peter Clark, Isaac Cowhey, Oren Etzioni, Tushar Khot, Ashish Sabharwal, Carissa Schoenick, and Oyvind Tafjord. 2018. Think you have solved question answering? try arc, the AI2 reasoning challenge. CoRR, abs/1803.05457.

彼得·克拉克、艾萨克·考伊、奥伦·埃齐奥尼、图沙尔·科特、阿什ish·萨布哈瓦尔、卡丽莎·肖尼克和奥伊文德·塔夫乔德。2018。认为自己已经解决了问答问题?试试 ARC,AI2 推理挑战。CoRR, abs/1803.05457。

Sumanth Dathathri, Andrea Madotto, Janice Lan, Jane Hung, Eric Frank, Piero Molino, Jason Yosinski, and Rosanne Liu. 2019. Plug and play language models: A simple approach to controlled text generation. arXiv preprint arXiv:1912.02164.

Sumanth Dathathri,Andrea Madotto,Janice Lan,Jane Hung,Eric Frank,Piero Molino,Jason Yosinski 和 Rosanne Liu。2019。即插即用的语言模型:一种简单的受控文本生成方法。arXiv 预印本 arXiv:1912.02164。

Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of deep bidirectional transformers for language understanding. In North American Association for Computational Linguistics (NAACL).

Jacob Devlin, Ming-Wei Chang, Kenton Lee, 和 Kristina Toutanova. 2019. BERT: 用于语言理解的深度双向 Transformer (deep bidirectional transformers) 预训练. 在 North American Association for Computational Linguistics (NAACL).

Jwala Dhamala, Tony Sun, Varun Kumar, Satyapriya Krishna, Yada Pr uk s a chat kun, Kai-Wei Chang, and Rahul Gupta. 2021. Bold: Dataset and metrics for measuring biases in open-ended language generation. In Proceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency, pages 862–872.

Jwala Dhamala, Tony Sun, Varun Kumar, Satyapriya Krishna, Yada Pr uk s a chat kun, Kai-Wei Chang, 和 Rahul Gupta. 2021. BOLD: 数据集和度量用于衡量开放生成语言中的偏差. 在 Proceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency 中,页面 862–872.

Emily Dinan, Gavin Abercrombie, A Stevie Bergman, Shannon Spruit, Dirk Hovy, Y-Lan Boureau, and Verena Rieser. 2021. Anticipating safety issues in e2e conversational ai: Framework and tooling. arXiv preprint arXiv:2107.03451.

Emily Dinan, Gavin Abercrombie, A Stevie Bergman, Shannon Spruit, Dirk Hovy, Y-Lan Boureau, 和 Verena Rieser. 2021. 预测端到端对话式 AI (Conversational AI) 的安全问题:框架和工具. arXiv preprint arXiv:2107.03451.

Emily Dinan, Angela Fan, Adina Williams, Jack Urbanek, Douwe Kiela, and Jason Weston. 2020a. Queens are powerful too: Mitigating gender bias in dialogue generation. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 8173–8188, Online. Association for Computational Linguistics.

Emily Dinan, Angela Fan, Adina Williams, Jack Urbanek, Douwe Kiela, 和 Jason Weston. 2020a. 女王也拥有强大力量:减少对话生成中的性别偏见. 在 Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP),页面 8173–8188,在线. Association for Computational Linguistics.

Emily Dinan, Samuel Humeau, Bharath Chin tag unt a, and Jason Weston. 2019a. Build it break it fix it for dialogue safety: Robustness from adversarial human attack. arXiv preprint arXiv:1908.06083.

Emily Dinan, Samuel Humeau, Bharath Chin tag unt a, 和 Jason Weston. 2019a. 对话安全的构建、破坏和修复:来自对抗性人类攻击的鲁棒性. arXiv preprint arXiv:1908.06083.

Emily Dinan, Varvara Logacheva, Valentin Malykh, Alexander Miller, Kurt Shuster, Jack Urbanek, Douwe Kiela, Arthur Szlam, Iulian Serban, Ryan Lowe, Shrimai Prabhumoye, Alan W. Black, Alexander Rudnicky, Jason Williams, Joelle Pineau, Mikhail Burtsev, and Jason Weston. 2020b. The second conversational intelligence challenge (ConvAI2). In The NeurIPS ’18 Competition, pages 187– 208, Cham. Springer International Publishing.

Emily Dinan, Varvara Logacheva, Valentin Malykh, Alexander