[论文翻译]OPT: 开放预训练 Transformer 语言模型


原文地址:https://arxiv.org/pdf/2205.01068


OPT: Open Pre-trained Transformer Language Models

OPT: 开放预训练 Transformer 语言模型

Susan Zhang,∗ Stephen Roller,∗ Naman Goyal,∗ Mikel Artetxe, Moya Chen, Shuohui Chen, Christopher Dewan, Mona Diab, Xian Li, Xi Victoria Lin, Todor Mihaylov, Myle Ott,† Sam Shleifer,† Kurt Shuster, Daniel Simig, Punit Singh Koura, Anjali Sridhar, Tianlu Wang, Luke Z ett le moyer

苏珊·张,∗ 斯蒂芬·罗勒,∗ 纳曼·戈亚尔,∗ 米凯尔·阿尔特特克塞, 莫娅·陈, 陈澍辉, 克里斯托弗·杜安,莫娜·迪亚布, 李弦, 林维多利亚, 托多尔·米哈伊洛夫, 迈尔·奥特,† 山姆·施莱弗,† 库尔特·舒斯特, 丹尼尔·西米格, 普尼特·辛格·库拉, 安贾莉·斯里达, 王天禄, 卢克·泽特尔莫耶

Meta AI

Meta AI

{susanz,roller,naman}@fb.com

{susanz,roller,naman}@fb.com

Abstract

摘要

Large language models, which are often trained for hundreds of thousands of compute days, have shown remarkable capabilities for zero- and few-shot learning. Given their computational cost, these models are difficult to replicate without significant capital. For the few that are available through APIs, no access is granted to the full model weights, making them difficult to study. We present Open Pre-trained Transformers (OPT), a suite of decoder-only pre-trained transformers ranging from 125M to 175B parameters, which we aim to fully and responsibly share with interested researchers. We show that OPT-175B is comparable to GPT-3,1 while requiring only 1/7th the carbon footprint to develop. We are also releasing our logbook detailing the infrastructure challenges we faced, along with code for experimenting with all of the released models.

大语言模型,通常需要数十万计算日进行训练,已经展示了在零样本和少样本学习方面的卓越能力。鉴于其计算成本,这些模型在没有大量资金的情况下难以复制。对于少数通过 API 提供的模型,无法获得完整的模型权重,这使得研究变得困难。我们推出了 Open Pre-trained Transformers (OPT),这是一系列仅解码器的预训练 Transformer,参数范围从 125M 到 175B,我们计划全面且负责任地与感兴趣的科研人员共享。我们证明了 OPT-175B 的性能与 GPT-3 相当,同时开发过程中仅需 1/7 的碳足迹。我们还将发布我们的日志,详细记录了我们面临的基础设施挑战,并提供代码以实验所有发布的模型。

1 Introduction

1 引言

Large language models (LLMs) trained on massive text collections have shown surprising emergent capabilities to generate text and perform zero- and few-shot learning (Brown et al., 2020; Lieber et al., 2021; Smith et al., 2022; Rae et al., 2021; Chowd- hery et al., 2022). While in some cases the public can interact with these models through paid APIs, full model access is currently limited to only a few highly resourced labs.2 This restricted access has limited researchers’ ability to study how and why these large language models work, hindering progress on improving known challenges in areas such as robustness, bias, and toxicity.

大语言模型 (LLMs) 在大规模文本集合上训练,展现出生成文本和执行零样本及少样本学习的惊人能力 (Brown et al., 2020; Lieber et al., 2021; Smith et al., 2022; Rae et al., 2021; Chowd- hery et al., 2022)。虽然在某些情况下公众可以通过付费 API 与这些模型互动,但完整的模型访问目前仅限于少数资源丰富的实验室。这种受限的访问限制了研究人员研究这些大语言模型的工作原理及其背后原因的能力,阻碍了在鲁棒性、偏差和毒性等已知挑战领域的进展。

In this technical report, we present Open Pretrained Transformers (OPT), a suite of decoderonly pre-trained transformers ranging from 125M to 175B parameters, which we aim to fully and responsibly share with interested researchers. We train the OPT models to roughly match the performance and sizes of the GPT-3 class of models, while also applying the latest best practices in data collection and efficient training. Our aim in developing this suite of OPT models is to enable reproducible and responsible research at scale, and to bring more voices to the table in studying the impact of these LLMs. Definitions of risk, harm, bias, and toxicity, etc., should be articulated by the collective research community as a whole, which is only possible when models are available for study.

在本技术报告中,我们介绍了开放预训练 Transformer (OPT),这是一套从 125M 到 175B 参数的解码器-only 预训练 Transformer 模型,我们计划将其完全且负责任地分享给感兴趣的科研人员。我们训练 OPT 模型以大致匹配 GPT-3 类模型的性能和规模,同时应用最新的数据收集和高效训练的最佳实践。我们开发这套 OPT 模型的目标是实现可重复且负责任的大规模研究,并让更多的人参与到这些大语言模型的影响研究中来。风险、伤害、偏差和毒性等定义应由整个科研社区共同制定,而这只有在模型可供研究时才可能实现。

We are releasing all of our models between 125M and 66B parameters, and will provide full research access to OPT-175B upon request. Access will be granted to academic researchers; those affiliated with organizations in government, civil society, and academia; and those in industry research laboratories. We are also releasing both the logbook of our model creation as well as our codebase, metaseq,3 which enabled training OPT-175B on 992 80GB A100 GPUs, reaching 147 TFLOP/s utilization per GPU. From this implementation, and from using the latest generation of NVIDIA hardware, we are able to develop OPT-175B using only 1/7th the carbon footprint of GPT-3. While this is a significant achievement, the energy cost of creating such a model is still nontrivial, and repeated efforts to replicate a model of this size will only amplify the growing compute footprint of these LLMs.

我们将发布所有参数量在 125M 到 66B 之间的模型,并将在请求后提供对 OPT-175B 的完全研究访问权限。访问权限将授予学术研究人员;政府、公民社会和学术机构的附属人员;以及工业研究实验室的人员。我们还将发布我们的模型创建日志以及我们的代码库 metaseq,这使得我们能够在 992 个 80GB A100 GPU 上训练 OPT-175B,达到每个 GPU 147 TFLOP/s 的利用率。通过这一实现,以及使用最新一代 NVIDIA 硬件,我们能够以 GPT-3 碳足迹的 1/7 开发 OPT-175B。尽管这是一个重要的成就,但创建此类模型的能源成本仍然不容忽视,重复尝试复制这种规模的模型只会加剧这些大语言模型的计算碳足迹。

We believe the entire AI community — academic researchers, civil society, policymakers, and industry — must work together to develop clear guidelines around responsible AI in general and responsible LLMs in particular, given their centrality in many downstream language applications. A much broader segment of the AI community needs access to these models in order to conduct reproducible research and collectively drive the field forward. With the release of OPT-175B and smaller-scale baselines, we hope to increase the diversity of voices defining the ethical considerations of such technologies.

我们相信整个 AI 社区——学术研究人员、公民社会、政策制定者和行业——必须共同努力,制定关于通用负责任 AI 和特别针对大语言模型 (LLM) 的明确指南,鉴于它们在许多下游语言应用中的核心地位。AI 社区的更广泛部分需要访问这些模型,以便进行可重复的研究并共同推动领域的发展。通过发布 OPT-175B 及较小规模的基准模型,我们希望增加定义此类技术伦理考虑的多样性声音。

Table 1: Model architecture details. We report the number of layers $(#!L)$ , number of attention heads $(#\mathrm{H})$ , and the embedding size $\left(\mathrm{d}_{\mathrm{model}}\right)$ . We also report the peak Learning Rate (LR) and global batch size in number of tokens (Batch).

表 1: 模型架构详情。我们报告了层数 (#L) ,注意力头数 (#H) ,以及嵌入大小 (dmodel) 。我们也报告了峰值学习率 (LR) 和以 Token 数表示的全局批量大小 (Batch) 。

Model #L #H dmodel LR Batch
125M 12 12 768 6.0e-4 0.5M
350M 24 16 1024 3.0e-4 0.5M
1.3B 24 32 2048 2.0e-4 1M
2.7B 32 32 2560 1.6e-4 1M
6.7B 32 32 4096 1.2e-4 2M
13B 40 40 5120 1.0e-4 4M
30B 48 56 7168 1.0e-4 4M
66B 64 72 9216 0.8e-4 2M
175B 96 96 12288 1.2e-4 2M

2 Method

2 方法

2.1 Models

2.1 模型

We present results on eight Transformer language models ranging from 125 million to 175 billion parameters. Architectural details are displayed in Table 1. In the interest of transparency, and to reduce risk of training instabilities, our models and hyper parameters largely follow Brown et al. (2020), with variations in batch size mostly to obtain increased computational efficiency.

我们展示了八个参数量从 1.25 亿到 1750 亿的 Transformer 语言模型的结果。架构细节显示在表 1: 。为了透明起见,并减少训练不稳定性风险,我们的模型和超参数 largely follow Brown et al. (2020),主要通过调整批量大小以获得更高的计算效率。

2.2 Training Setup

2.2 训练设置

For weight initialization, we follow the same settings provided in the Megatron-LM codebase,4 us- ing a normal distribution with zero mean and standard deviation of 0.006. Standard deviation for output layers are scaled by a $1.0/\sqrt{2L}$ term where $L$ is the total number of layers. All bias terms are initialized as 0, and all models are trained with ReLU activation and a sequence length of 2048.

对于权重初始化,我们遵循 Megatron-LM 代码库提供的相同设置,使用均值为零、标准差为 0.006 的正态分布。输出层的标准差通过项 $1.0/\sqrt{2L}$ 进行缩放,其中 $L$ 是总层数。所有偏置项都初始化为 0,所有模型都使用 ReLU 激活函数和 2048 的序列长度进行训练。

We use an AdamW optimizer (Loshchilov and Hutter, 2017) with $(\beta_{1},\beta_{2})$ set to (0.9, 0.95), and weight decay of 0.1. We follow a linear learning rate schedule, warming up from 0 to the maximum learning rate over the first 2000 steps in OPT-175B, or over 375M tokens in our smaller baselines, and decaying down to $10%$ of the maximum LR over 300B tokens. A number of mid-flight changes to LR were also required (see Section 2.5). Our batch sizes range from $0.5\mathrm{M}$ to 4M depending on the model size (see Table 1) and is kept constant throughout the course of training.

我们使用 AdamW 优化器 (Loshchilov and Hutter, 2017) ,$(\beta_{1},\beta_{2})$ 设置为 (0.9, 0.95),权重衰减为 0.1。我们采用线性学习率调度策略,在 OPT-175B 的前 2000 步中从 0 线性增加到最大学习率,或在较小基线模型中覆盖 375M Token,然后在 300B Token 内线性衰减到最大学习率的 $10%$ 。训练过程中也进行了若干次学习率调整 (见第 2.5 节) 。我们的批量大小根据模型大小从 $0.5\mathrm{M}$ 到 4M 不等 (见表 1),并在整个训练过程中保持不变。

We use a dropout of 0.1 throughout, but we do not apply any dropout to embeddings. We clip gradient norms at 1.0, except for some midflight changes that reduce this threshold down from 1.0 to 0.3 (see Section 2.5). We also include a gradient predivide factor to reduce the risk of over/underflows when computing the gradient across all ranks (splitting the division by the world size of $N$ into two division operations by $\sqrt{N}$ ).

我们全程使用 0.1 的 dropout,但不对嵌入 (embeddings) 应用任何 dropout。我们将梯度范数裁剪为 1.0,除了某些中途更改将此阈值从 1.0 降低到 0.3(见第 2.5 节)。我们还加入了一个梯度预划分因子,以减少在所有节点上计算梯度时发生溢出或下溢的风险(将除以世界大小 $N$ 的操作拆分为两次除以 $\sqrt{N}$ 的操作)。

2.3 Pre-training Corpus

2.3 预训练语料库

The pre-training corpus contains a concatenation of datasets used in RoBERTa (Liu et al., 2019b), the Pile (Gao et al., 2021a), and PushShift.io Reddit (Baumgartner et al., 2020; Roller et al., 2021). All corpora were previously collected or filtered to contain predominantly English text, but a small amount of non-English data is still present within the corpus via Common Crawl.

预训练语料库包含 RoBERTa (Liu et al., 2019b)、The Pile (Gao et al., 2021a) 和 PushShift.io Reddit (Baumgartner et al., 2020; Roller et al., 2021) 使用的数据集的合并。所有语料库之前都已收集或过滤为主要包含英文文本,但通过 Common Crawl 仍存在少量非英文数据。

We removed duplicated documents across all datasets by filtering out documents via MinhashLSH (Rajaraman and Ullman, 2011) with a Jaccard similarity $\geq$ .95. We found the Pile was particularly full of duplicate documents, and advise future researchers using the Pile to perform additional de-duplication processing.

我们通过 MinhashLSH (Rajaraman 和 Ullman, 2011) 过滤掉 Jaccard 相似度 ≥ .95 的文档,以去除所有数据集中的重复文档。我们发现 Pile 数据集中尤其充满了重复文档,并建议未来使用 Pile 的研究人员进行额外的去重处理。

We tokenize all corpora using the GPT-2 byte level BPE tokenizer (Sennrich et al., 2016; Radford et al., 2019; Brown et al., 2020). Our final corpus contains roughly 180B tokens.

我们使用 GPT-2 字节级 BPE 分词器 (Sennrich et al., 2016; Radford et al., 2019; Brown et al., 2020) 对所有语料库进行分词。最终语料库包含大约 1800 亿个 Token。

RoBERTa We included the BookCorpus (Zhu et al., 2015) and Stories (Trinh and Le, 2018) subsets of the RoBERTa corpus and utilized an updated version of CCNews, containing news stories crawled through September 28, 2021. This CCNews v2 corpus was pre processed the same way as the original RoBERTa CCNews (Liu et al., 2019b).

我们包含了 RoBERTa 语料库中的 BookCorpus (Zhu 等,2015) 和 Stories (Trinh 和 Le,2018) 子集,并使用了更新版本的 CCNews,其中包含截至 2021 年 9 月 28 日爬取的新闻故事。这个 CCNews v2 语料库的预处理方式与原始 RoBERTa CCNews (Liu 等,2019b) 相同。

The Pile We included a subset of the Pile (Gao et al., 2021a), including: Common Crawl,

我们包含了 Pile (Gao et al., 2021a) 的一个子集,包括:Common Crawl,

DM Mathematics, Project Gutenberg, HackerNews, Open Subtitles, Open Web Text 2, USPTO and Wikipedia. Other subsets of the Pile were eliminated as we found they increased the risk of instabilities, as measured by tendency to cause spikes in gradient norms at the 1.3B scale, or were otherwise deemed unsuitable. All subsets went through additional ad-hoc whitespace normalization.

DM 数学,Project Gutenberg,HackerNews,Open Subtitles,Open Web Text 2,USPTO 和 Wikipedia。其他子集被消除,因为我们发现它们增加了不稳定性风险,表现为在 1.3B 规模上导致梯度范数出现尖峰,或者被认为不适合。所有子集都经过了额外的临时空白规范化处理。

PushShift.io Reddit We included a subset of the Pushshift.io corpus produced by Baumgartner et al. (2020) and previously used by Roller et al. (2021). To convert the conversational trees into language-model-accessible documents, we extracted the longest chain of comments in each thread and discarded all other paths in the tree. This reduced the corpus by about $66%$ .

PushShift.io Reddit

我们包含了由 Baumgartner 等人 (2020) 生成的 Pushshift.io 语料库的一个子集,并曾被 Roller 等人 (2021) 使用过。为了将对话树转换为大语言模型可访问的文档,我们提取了每个线程中最长的评论链,并丢弃了树中的所有其他路径。这使语料库减少了大约 $66%$ 。

2.4 Training Efficiency

2.4 训练效率

We trained OPT-175B on 992 80GB A100 GPUs, by utilizing Fully Sharded Data Parallel (Artetxe et al., 2021) with Megatron-LM Tensor Parallelism (Shoeybi et al., 2019). We achieve utilization of up to 147 TFLOP/s per GPU. We keep Adam state in FP32, since we shard it across all hosts, while the model weights remained in FP16. To avoid underflows, we used dynamic loss scaling, as described in Mic ike vici us et al. (2017).

我们在 992 个 80GB A100 GPU 上训练了 OPT-175B,通过利用全分片数据并行 (Artetxe et al., 2021) 和 Megatron-LM 张量并行 (Shoeybi et al., 2019)。我们实现了每 GPU 最高 147 TFLOP/s 的利用率。我们将 Adam 状态保持在 FP32,因为它被分片到所有主机上,而模型权重则保持在 FP16。为了避免下溢,我们使用了动态损失缩放,如 Mic ike vici us et al. (2017) 所述。

2.5 Training Processes

2.5 训练过程

Here we describe significant training process adjustments that arose during OPT-175B pre-training.

在这里我们描述了在 OPT-175B 预训练期间出现的重要训练过程调整。

Hardware Failures We faced a significant number of hardware failures in our compute cluster while training OPT-175B. In total, hardware failures contributed to at least 35 manual restarts and the cycling of over 100 hosts over the course of 2 months. During manual restarts, the training run was paused, and a series of diagnostics tests were conducted to detect problematic nodes. Flagged nodes were then cordoned off and training was resumed from the last saved checkpoint. Given the difference between the number of hosts cycled out and the number of manual restarts, we estimate $^{70+}$ automatic restarts due to hardware failures.

硬件故障

在训练 OPT-175B 期间,我们的计算集群遇到了大量硬件故障。总计,硬件故障导致至少 35 次手动重启,并在两个月内更换了超过 100 台主机。在手动重启过程中,训练运行被暂停,并进行了一系列诊断测试以检测有问题的节点。标记的节点随后被隔离,训练从最后一个保存的检查点恢复。鉴于更换的主机数量与手动重启次数之间的差异,我们估计有 70+ 次由于硬件故障引起的自动重启。

Loss Divergences Loss divergences were also an issue in our training run. When the loss diverged, we found that lowering the learning rate and restarting from an earlier checkpoint allowed for the job to recover and continue training. We noticed a correlation between loss divergence, our dynamic loss scalar crashing to 0, and the $l^{2}$ -norm of the activations of the final layer spiking. These observations led us to pick restart points for which our dynamic loss scalar was still in a “healthy” state $(\ge1.0)$ , and after which our activation norms would trend downward instead of growing unbounded ly. Our empirical LR schedule is shown in Figure 1. Early in training, we also noticed that lowering gradient clipping from 1.0 to 0.3 helped with stability; see our released logbook for exact details. Figure 2 shows our validation loss with respect to training iterations.

损失发散

损失发散在我们的训练过程中也是一个问题。当损失发散时,我们发现降低学习率并从较早的检查点重新启动可以使任务恢复并继续训练。我们注意到损失发散、动态损失标量崩溃至 0 以及最终层激活的 $l^{2}$ -范数激增之间存在关联。这些观察结果促使我们选择动态损失标量仍处于“健康”状态 $(\ge1.0)$ 的重启点,并且在此之后,我们的激活范数会趋于下降而不是无限制增长。我们的经验学习率调度如图 1 所示。在训练初期,我们还注意到将梯度裁剪从 1.0 降低到 0.3 有助于提高稳定性;具体细节请参阅我们发布的日志。图 2 显示了我们的验证损失相对于训练迭代的情况。

图 1:

图 2: 验证损失与训练迭代的关系


Figure 1: Empirical LR schedule. We found that lowering learning rate was helpful for avoiding instabilities.


图 1: 经验学习率 (LR) 调度。我们发现降低学习率有助于避免不稳定。


Figure 2: Validation Perplexity. Our mid-flight LR changes had clear effects on validation perplexity.

图 2: 验证困惑度。我们的中途学习率 (LR) 变化对验证困惑度有明显的影响。

Other Mid-flight Changes We conducted a number of other experimental mid-flight changes to handle loss divergences. These included: switching to vanilla SGD (optimization plateaued quickly, and we reverted back to AdamW); resetting the dynamic loss scalar (this helped recover some but not all divergences); and switching to a newer version of Megatron (this reduced pressure on activation norms and improved throughput).

其他飞行中的更改

我们进行了多项实验性的飞行中更改以处理损失发散问题。这些更改包括:切换到 vanilla SGD(优化很快陷入停滞,我们又恢复到 AdamW);重置动态损失标量(这有助于恢复部分但不是所有的发散问题);以及切换到更新版本的 Megatron(这减轻了激活范数的压力并提高了吞吐量)。

3 Evaluations

3 评估

3.1 Prompting & Few-Shot

3.1 提示与少样本 (Prompting & Few-Shot)

We evaluate our model on 16 standard NLP tasks utilized in the literature: HellaSwag (Zellers et al., 2019), StoryCloze (Most af azad eh et al., 2016), PIQA (Bisk et al., 2020), ARC Easy and Challenge (Clark et al., 2018), OpenBookQA (Mihaylov et al., 2018), WinoGrad (Levesque et al., 2011), WinoGrande (Sakaguchi et al., 2020), and SuperGLUE (Wang et al., 2019). We follow GPT-3 (Brown et al., 2020) by using their prompts and overall experimental setup. We compare primarily to GPT-3, having aimed to re-implement their evaluation settings, but include reported performance of other LLMs on a per-task basis when available (Lieber et al., 2021; Rae et al., 2021; Hoffmann et al., 2022; Black et al., 2022)

我们在文献中使用的 16 个标准 NLP 任务上评估了我们的模型:HellaSwag (Zellers et al., 2019),StoryCloze (Mostafazadeh et al., 2016),PIQA (Bisk et al., 2020),ARC Easy 和 Challenge (Clark et al., 2018),OpenBookQA (Mihaylov et al., 2018),WinoGrad (Levesque et al., 2011),WinoGrande (Sakaguchi et al., 2020),以及 SuperGLUE (Wang et al., 2019)。我们遵循 GPT-3 (Brown et al., 2020) 的方法,使用他们的提示和整体实验设置。我们主要与 GPT-3 进行比较,因为我们旨在重新实现他们的评估设置,但在有可用数据时也包括其他大语言模型在每个任务上的报告性能 (Lieber et al., 2021; Rae et al., 2021; Hoffmann et al., 2022; Black et al., 2022)。

We report performance in accuracy (omitting F1 for MultiRC and ReCoRD for consistency in evaluation metrics). For the Winograd Schema Challenge (WSC) task in the SuperGLUE benchmark, we follow (Brown et al., 2020) and formulate the task as multiple choice questions, which is known to affect performance (Liu et al., 2020).

我们报告准确率(为保持评估指标的一致性,MultiRC 和 ReCoRD 不包括 F1)。对于 SuperGLUE 基准中的 Winograd Schema Challenge (WSC) 任务,我们遵循 (Brown et al., 2020) 并将任务表述为多项选择题,这已知会影响性能 (Liu et al., 2020)。

Zero-shot Overall average zero-shot performance across all 14 tasks may be seen in Figure 3. Overall, we see our average performance follows the trend of GPT-3. However, performance can vary radically across the tasks: for a full breakdown, see Appendix A. Note that we intentionally removed MultiRC and WIC from these averages, as these datasets seem to systematically favor GPT-3 or OPT disproportionately.

图 3: 在所有 14 个任务上的零样本 (Zero-shot) 平均性能。总体而言,我们看到我们的平均性能遵循 GPT-3 的趋势。然而,不同任务之间的性能可能会有显著差异:详细分解请参见附录 A。请注意,我们有意从这些平均值中移除了 MultiRC 和 WIC,因为这些数据集似乎系统性地偏向 GPT-3 或 OPT。

Our performance roughly matched GPT-3 for 10 tasks, and under performed in 3 tasks (ARC Challenge and MultiRC). In 3 tasks (CB, BoolQ, WSC), we find both GPT and OPT models display unpredictable behavior with respect to scale, likely due to the small size of the validation set in these 3 tasks (56, 277, and 104 examples, respectively). In WIC, we see that the OPT models always outperform the GPT-3 models, though the numbers reported by Brown et al. (2020) also seem questionable, given WIC being a binary classification task.5 For MultiRC, we are unable to replicate the GPT-3 results using the Davinci $\mathrm{API}^{6}$ within our evaluation setup, suggesting differences in the methods of evaluation on this task. For BoolQ and WSC, we note that both OPT and GPT models seem to hover around majority-class accuracy, suggesting small perturbations in probability masses may be dominating the evaluations.

我们的表现大致与 GPT-3 在 10 项任务中相当,在 3 项任务 (ARC Challenge 和 MultiRC) 中表现较差。在 3 项任务 (CB, BoolQ, WSC) 中,我们发现 GPT 和 OPT 模型在规模上的表现不可预测,这可能是由于这 3 项任务的验证集较小 (分别有 56、277 和 104 个示例)。在 WIC 中,我们看到 OPT 模型始终优于 GPT-3 模型,尽管 Brown 等人 (2020) 报告的数字也似乎值得怀疑,因为 WIC 是一个二分类任务。对于 MultiRC,我们无法使用 Davinci $\mathrm{API}^{6}$ 在我们的评估设置中复制 GPT-3 的结果,这表明在这个任务上评估方法存在差异。对于 BoolQ 和 WSC,我们注意到 OPT 和 GPT 模型的表现接近多数类准确率,这表明概率质量的小波动可能主导了评估。


Figure 3: Zero-shot NLP Evaluation Averages. Across a variety of tasks and model sizes, OPT largely matches the reported averages of GPT-3. However, performance varies greatly per task: see Appendix A.

图 3: 零样本 NLP 评估平均值。在各种任务和模型大小上,OPT 基本上与 GPT-3 的报告平均值相符。然而,性能在不同任务之间差异很大:详见附录 A。


Figure 4: Multi-shot performance. OPT performance for one- and few-shot lags behind GPT-3 models, but performance depends heavily per task; see Appendix A.


图 4: 多样本性能。OPT 在单样本和少样本场景下的表现落后于 GPT-3 模型,但性能在不同任务中差异很大;详见附录 A。

Chinchilla (Hoffmann et al., 2022) and Gopher (Rae et al., 2021) perform roughly consistently with others for their parameter sizes, while PaLM (Chowdhery et al., 2022) generally performs better across all settings, even when controlling for number of parameters. We speculate the high performance of PaLM comes predominantly from higher quality and diversity of pre-training data.

Chinchilla (Hoffmann 等,2022) 和 Gopher (Rae 等,2021) 在其参数规模上表现大致与其他模型一致,而 PaLM (Chowdhery 等,2022) 在所有设置下通常表现更好,即使在控制参数数量的情况下也是如此。我们推测 PaLM 的高性能主要来自于预训练数据的质量和多样性更高。

One-shot and Few-shot Average multi-shot incontext performance is shown in Figure 4 (again, omitting MultiRC and WIC), with detailed performances shown in Appendix A. Across the average of all metrics, we find that OPT models perform similarly to GPT-3 models. However, as with zeroshot, breaking down these results per task shows a different story: in the same set of 10 datasets as zero-shot, we see similar performance across the two models. Some of the remaining datasets show inconsistent performance with respect to model size for both OPT and GPT-3 models (BoolQ, CB, WSC, RTE). In MultiRC, we consistently see under performance of OPT models compared to GPT3 models. Similar to our zero-shot evaluation, we hypothesize our one- and few-shot evaluation setup may differ significantly from Brown et al. (2020).

图 4: 展示了一次性和少样本平均多样本的上下文性能(再次省略 MultiRC 和 WIC),详细性能见附录 A。在所有指标的平均值上,我们发现 OPT 模型的表现与 GPT-3 模型类似。然而,与零样本一样,按任务分解这些结果则显示出不同的情况:在相同的 10 个数据集中,两种模型的表现相似。一些剩余的数据集显示了关于模型大小的不一致表现(BoolQ、CB、WSC、RTE)。在 MultiRC 中,我们始终看到 OPT 模型相对于 GPT-3 模型表现不佳。类似于我们的零样本评估,我们假设我们的一次性和少样本评估设置可能与 Brown 等人 (2020) 的设置有显著差异。

3.2 Dialogue

3.2 对话

Given that LLMs are known to be an integral component of modern dialogue models (Adiwardana et al., 2020; Roller et al., 2021; Thoppilan et al., 2022; Rae et al., 2021; Chowdhery et al., 2022), we additionally evaluate OPT-175B on several open source dialogue datasets. In particular, we follow Roller et al. (2021), and evaluate on ConvAI2 (Dinan et al., 2020b), Wizard of Wikipedia (Dinan et al., 2019b), Empathetic Dialogues (Rashkin et al., 2019), and Blended Skill Talk (Smith et al., 2020). We additionally evaluate on the more recent Wizard of Internet dataset (Komeili et al., 2021). We focus our comparisons primarily against existing open source dialogue models including the fine-tuned BlenderBot 1 (Roller et al., 2021) and its pre-training counterpart Reddit 2.7B. We also compare against the fine-tuned R2C2 BlenderBot, a 2.7B parameter BlenderBot-like model trained by Shuster et al. (2022).

鉴于大语言模型 (LLM) 是现代对话模型的重要组成部分 (Adiwardana 等, 2020; Roller 等, 2021; Thoppilan 等, 2022; Rae 等, 2021; Chowdhery 等, 2022),我们还对 OPT-175B 在几个开源对话数据集上进行了评估。具体来说,我们遵循 Roller 等 (2021) 的方法,在 ConvAI2 (Dinan 等, 2020b)、Wizard of Wikipedia (Dinan 等, 2019b)、Empathetic Dialogues (Rashkin 等, 2019) 和 Blended Skill Talk (Smith 等, 2020) 上进行评估。此外,我们在较新的 Wizard of Internet 数据集 (Komeili 等, 2021) 上也进行了评估。我们的比较主要集中在现有的开源对话模型上,包括微调后的 BlenderBot 1 (Roller 等, 2021) 及其预训练版本 Reddit 2.7B。我们还与微调后的 R2C2 BlenderBot 进行了比较,这是一款由 Shuster 等 (2022) 训练的 27 亿参数的类似 BlenderBot 模型。

We report Perplexity and Unigram F1 (UF1) overlap, following the metrics of the ConvAI2 competition (Dinan et al., 2020b). To control for different token iz ation in each of the models, we normalize all perplexities to be in the space of the GPT-2 tokenizer (Radford et al., 2019). We also note which models are supervised with respect to these dialogue tasks and which are unsupervised. For OPT-175B, all generations are performed using greedy decoding up to a maximum of 32 tokens. We do not attempt to prompt the model at all except for alternating “Person 1:” and “Person 2:” lines of dialogue. The remaining models use the generation parameters found in BlenderBot 1.

我们报告困惑度和单字节 F1 (UF1) 重叠,遵循 ConvAI2 竞赛的指标 (Dinan et al., 2020b)。为了控制每个模型中的不同分词,我们将所有困惑度归一化到 GPT-2 分词器 (Radford et al., 2019) 的空间中。我们还注明了哪些模型是针对这些对话任务进行监督学习的,哪些是无监督的。对于 OPT-175B,所有生成均使用贪婪解码进行,最多生成 32 个 Token。除了交替使用“Person 1:”和“Person 2:”的对话行外,我们不尝试提示模型。其余模型使用 BlenderBot 1 中的生成参数。

Results are shown in Table 2. We see that OPT-175B significantly outperforms the alsounsupervised Reddit 2.7B model on all tasks, and performs competitively with the fully supervised BlenderBot 1 model, especially in the ConvAI2 dataset. On the Wizard-of-Internet dataset, which is fully unsupervised for all models, we see that OPT-175B obtains the lowest perplexity but still has lower UF1 than the models with Wizard-ofWikipedia supervision.

结果如表 2 所示。我们看到,OPT-175B 在所有任务上显著优于同样无监督的 Reddit 2.7B 模型,并且在 ConvAI2 数据集上与完全有监督的 BlenderBot 1 模型竞争性表现。在完全无监督的 Wizard-of-Internet 数据集中,我们发现 OPT-175B 获得了最低的困惑度,但其 UF1 仍然低于具有 Wizard-of-Wikipedia 监督的模型。

表 2:

We were somewhat surprised that the evaluations of the unsupervised OPT-175B model were as competitive as BlenderBot 1 on the ConvAI2 dataset. This may indicate leakage of the ConvAI2 dataset into the general pre-training corpus or even into the validation data as evaluated in Table 2. To address concerns of leakage, we searched our pre-training corpus for the first conversation in the ConvAI2 dataset, but we did not find any overlap. We additionally evaluated OPT-175B on the ConvAI2 hidden test set, which has never been publicly released, and achieved $10.7;\mathrm{ppl}$ and .185 UF1, matching the performance of the validation set. Furthermore, we evaluated OPT-175B on a subset of the ConvAI2- like Multi Session Chat (MSC) dataset (Xu et al., 2021b) and obtained a perplexity of 9.7 and UF1 of .177, indicating the model is generalizing well across multiple Person a Chat-like datasets. Since both MSC and WoI datasets were released after the Common Crawl snapshot used in pre-training corpus, there is minimal risk of leakage. We conclude that OPT-175B has a strong ability to maintain a consistent persona across conversations, a behavior also highlighted in LaMDA (Thoppilan et al., 2022).

我们对无监督 OPT-175B 模型在 ConvAI2 数据集上的评估结果与 BlenderBot 1 相当感到有些惊讶。这可能表明 ConvAI2 数据集泄漏到了通用预训练语料库中,甚至可能泄漏到了验证数据中,如表 2 所示。为了解决泄漏问题的担忧,我们在预训练语料库中搜索了 ConvAI2 数据集中的第一个对话,但没有发现任何重叠。此外,我们在从未公开发布的 ConvAI2 隐藏测试集上评估了 OPT-175B,并达到了 $10.7;\mathrm{ppl}$ 和 .185 UF1,与验证集的表现相匹配。进一步地,我们在 ConvAI2 类似的多会话聊天 (MSC) 数据集 (Xu et al., 2021b) 的一个子集上评估了 OPT-175B,并获得了 9.7 的困惑度和 .177 的 UF1,表明该模型在多个类似 Person a Chat 的数据集上表现良好。由于 MSC 和 WoI 数据集都是在用于预训练语料库的 Common Crawl 快照发布之后才发布的,因此泄漏的风险极小。我们得出结论,OPT-175B 具有在对话中保持一致角色的强大能力,这一行为也在 LaMDA (Thoppilan et al., 2022) 中得到了强调。

4 Bias & Toxicity Evaluations

4 偏差与毒性评估

To understand the potential harm of OPT-175B, we evaluate a series of benchmarks related to hate speech detection, stereotype awareness, and toxic content generation. While there may be shortcomings in these benchmarks (Blodgett et al., 2021; Jacobs and Wallach, 2021), these measurements provide a first step towards understanding the limitations of OPT-175B. We compare primarily against GPT-3 Davinci, as these benchmarks were not yet available to be included in Brown et al. (2020).

为了理解 OPT-175B 的潜在危害,我们评估了一系列与仇恨言论检测、刻板印象意识和有害内容生成相关的基准测试。虽然这些基准测试可能存在不足(Blodgett 等,2021;Jacobs 和 Wallach,2021),但这些测量为理解 OPT-175B 的局限性提供了第一步。我们主要将其与 GPT-3 Davinci 进行比较,因为这些基准测试在 Brown 等 (2020) 发布时还不可用。

4.1 Hate Speech Detection

4.1 恶意言论检测

Using the ETHOS dataset provided in Mollas et al. (2020) and instrumented by Chiu and Alexander (2021), we measure the ability of OPT-175B to identify whether or not certain English statements are racist or sexist (or neither). In the zero-, one-, and few-shot binary cases, the model is presented with text and asked to consider whether the text is racist or sexist and provide a yes/no response. In the few-shot multiclass setting, the model is asked to provide a yes/no/neither response.

使用 Mollas 等人 (2020) 提供的 ETHOS 数据集和 Chiu 和 Alexander (2021) 仪器化处理的数据集,我们测量了 OPT-175B 识别某些英文陈述是否为种族主义或性别歧视(或都不是)的能力。在零样本、单样本和少样本二分类情况下,模型被呈现文本并要求判断该文本是否为种族主义或性别歧视,并给出是/否的回答。在少样本多分类设置中,模型被要求给出是/否/都不是的回答。

困惑度 (↓) 单词 F1 (↑)
模型 评估方式 C2 ww ED BST WoI C2 ww ED BST
Reddit2.7B 无监督 18.9 21.0 11.6 17.4 18.0 .126 .133 .135 .133
BlenderBot1 有监督 10.2 12.5 9.0 11.9 14.7 .183 .189 .192 .178
R2C2BlenderBot 有监督 10.5 12.4 9.1 11.7 14.6 .205 .198 .197 .186
OPT-175B 无监督 10.8 13.3 10.3 12.1 12.0 .185 .152 .149 .162

Table 3: Hate speech detection. F1 scores of detecting hate speech between Davinci and OPT-175B. OPT175B considerably outperforms Davinci in all settings.

表 3: 仇恨言论检测。检测仇恨言论的 F1 分数对比 Davinci 和 OPT-175B 。OPT175B 在所有设置下都显著优于 Davinci 。

设置 Davinci OPT-175B
零样本 .628 .667
单样本 .616 .713
少样本 (二分类) .354 .759
少样本 (多分类) .672 .812

Results are presented in Table 3. With all of our one-shot through few-shot configurations, OPT175B performs considerably better than Davinci. We speculate this occurs from two sources: (1) evaluating via the Davinci API may be bringing in safety control mechanisms beyond the original 175B GPT-3 model used in Brown et al. (2020); and (2) the significant presence of unmoderated social media discussions in the pre-training dataset has provided additional inductive bias to aid in such classification tasks.

结果如表 3 所示。在所有的一次性到少样本配置中,OPT175B 的表现明显优于 Davinci。我们推测这可能源于两个原因:(1) 通过 Davinci API 进行评估可能会引入超出 Brown 等人 (2020) 使用的原始 175B GPT-3 模型的安全控制机制;以及 (2) 预训练数据集中大量未经过滤的社交媒体讨论为这类分类任务提供了额外的归纳偏置。

表 3:

4.2 CrowS-Pairs

4.2 CrowS-Pairs

CrowS-Pairs 是一个数据集,用于评估模型在不同情境下的偏见和公平性。该数据集包含一系列句子对,旨在测试模型是否会在特定的社会背景下表现出偏见 [20]。

Developed for masked language models, CrowSPairs (Nangia et al., 2020) is a crowd sourced benchmark aiming to measure intra sentence level biases in 9 categories: gender, religion, race/color, sexual orientation, age, nationality, disability, physical appearance, and socioeconomic status. Each example consists of a pair of sentences representing a stereotype, or anti-stereotype, regarding a certain group, with the goal of measuring model preference towards stereotypical expressions. Higher scores indicate higher bias exhibited by a model.

为掩码语言模型开发的 CrowSPairs (Nangia 等, 2020) 是一个众包基准,旨在衡量句子内部层面的偏见,涵盖 9 个类别:性别、宗教、种族/肤色、性取向、年龄、国籍、残疾、外貌和 socioeconomic status。每个示例由一对句子组成,表示关于某个群体的刻板印象或反刻板印象,目的是衡量模型对刻板表达的偏好。较高的分数表明模型表现出更高的偏见。

注:socioeconomic status(社会经济地位)在首次出现时加上了英文原文。

Table 2: Dialogue Evaluations. OPT-175B, in a fully unsupervised setting, performs competitively against fully supervised models.

Table 4: CrowS-Pairs evaluation. Lower is better for all categories, indicating more fairness. The OPT-175B model performs worse than Davinci in most categories.

表 2: 对话评估。OPT-175B 在完全无监督的设置下,表现与完全监督的模型具有竞争力。

类别 GPT-3 OPT-175B
性别 62.6 65.7
宗教 73.3 68.6
种族/肤色 64.7 68.6
性取向 76.2 78.6
年龄 64.4 67.8
国籍 61.6 62.9
残疾 76.7 76.7
外貌 74.6 76.2
社会经济地位 73.8 76.2
总体 67.2 69.5

表 4: CrowS-Pairs 评估。所有类别的值越低越好,表示更公平。OPT-175B 模型在大多数类别中的表现不如 Davinci。

When compared with Davinci in Table 4, OPT175B appears to exhibit more stereotypical biases in almost all categories except for religion. Again, this is likely due to differences in training data; Nangia et al. (2020) showed that Pushshift.io Reddit corpus has a higher incidence rate for stereotypes and discriminatory text than other corpora (e.g. Wikipedia). Given this is a primary data source for OPT-175B, the model may have learned more discriminatory associations, which directly impacts its performance on CrowS-Pairs.

表 4 显示,与 Davinci 相比,OPT175B 在几乎所有类别中表现出更多的刻板偏见,除了宗教。再次,这可能是由于训练数据的差异;Nangia 等 (2020) 表明 Pushshift.io Reddit 语料库中的刻板印象和歧视性文本的发生率高于其他语料库(例如 Wikipedia)。鉴于这是 OPT-175B 的主要数据来源,该模型可能学到了更多的歧视性关联,这直接影响了其在 CrowS-Pairs 上的表现。

4.3 StereoSet

4.3 StereoSet 立体声集

Following Lieber et al. (2021) and Artetxe et al. (2021), we use StereoSet (Nadeem et al., 2021) to measure stereotypical bias across 4 categories: profession, gender, religion, and race. In addition to intra sentence measurement (similar to CrowSPairs), StereoSet includes measurement at the intersentence level to test a model’s ability to incorporate additional context. To account for a potential trade-off between bias detection and language modeling capability, StereoSet includes two metrics:

遵循 Lieber 等 (2021) 和 Artetxe 等 (2021) 的方法,我们使用 StereoSet (Nadeem 等, 2021) 来测量四个类别中的刻板印象偏差:职业、性别、宗教和种族。除了句子内测量(类似于 CrowSPairs)外,StereoSet 还包括句子间水平的测量,以测试模型结合额外上下文的能力。为了考虑偏差检测与语言建模能力之间的潜在权衡,StereoSet 包含两个指标:

Table 5: StereoSet Evaluations. Davinci and OPT175B perform similarly across all evaluations.

表 5: StereoSet 评估。Davinci 和 OPT-175B 在所有评估中表现相似。

类别 Davinci OPT-175B
LMS (↑) 78.4 74.1
专业 (↑) sS 63.4 62.6
ICAT (↑) 57.5 55.4
LMS (↑) 75.6 74.0
性别 (↑) sS 66.5 63.6
ICAT (↑) 50.6 53.8
LMS (↑) 80.8 84.0
可靠性 SS (↓) 59.0 59.0
ICAT (↑) 66.3 68.9
LMS (↑) 77.0 74.9
种族 (↑) sS 57.4 56.8
ICAT (↑) 65.7 64.8
LMS (↑) 77.6 74.8
总体 SS (↓) 60.8 59.9
ICAT (↑) 60.8 60.0

Language Modeling Score (LMS) and Stereotype Score (SS), which are then combined to form the Idealized Context Association Test score (ICAT). Unlike Lieber et al. (2021), we normalize scores by token count, rather than character count, which they report improves metrics for several models.

语言模型评分 (LMS) 和刻板印象评分 (SS),然后将两者结合形成理想化情境关联测试评分 (ICAT)。与 Lieber 等人 (2021) 不同,我们通过 Token 数量而不是字符数量来标准化评分,他们报告这种方法可以改善多个模型的指标。

Results are shown in Table 5. We see that Davinci and OPT-175B exhibit similar scores on aggregate (overall ICAT is very close between the two). In particular, Davinci outperforms in the areas of profession and race, while OPT-175B outperforms in the areas of Gender and Religion. OPT175B performs better across the board on the SS metric, while Davinci generally outperforms on the LMS metric.

结果如表 5 所示。我们看到 Davinci 和 OPT-175B 的总体得分相似(两者之间的总体 ICAT 非常接近)。具体来说,Davinci 在职业和种族方面表现更好,而 OPT-175B 在性别和宗教方面表现更优。OPT-175B 在 SS 指标上全面领先,而 Davinci 在 LMS 指标上通常表现更好。

表 5:

4.4 Real Toxicity Prompts

4.4 真实毒性提示 (Real Toxicity Prompts)

We evaluate the tendency of OPT-175B to respond with toxic language via the Real Toxicity Prompts (Gehman et al., 2020) dataset. Following PaLM (Chowdhery et al., 2022), we sample 25 generations of 20 tokens using nucleus sampling (Holtzman et al., 2020) $(p:=:0.9)$ for each of $10,000$ randomly sampled prompts from RTP, and report mean toxicity probabilities of the continuations, stratified across bucketed toxicities of the original prompts. For comparison, we report bucketed toxicity rates from Davinci and PaLM.

我们评估 OPT-175B 对 Real Toxicity Prompts (Gehman 等,2020) 数据集以毒性语言回应的倾向。遵循 PaLM (Chowdhery 等,2022),我们对从 RTP 中随机抽取的 10,000 个提示中的每一个,使用核采样 (Holtzman 等,2020) $(p = 0.9)$ 抽取 25 次 20 Token 的生成,并报告续写文本的平均毒性概率,按原始提示的分桶毒性进行分层。为了比较,我们报告了 Davinci 和 PaLM 的分桶毒性率。

Results are shown in Figure 5. Overall, we see that OPT-175B has a higher toxicity rate than either PaLM or Davinci. We also observe that all 3 models have increased likelihood of generating toxic continuations as the toxicity of the prompt increases, which is consistent with the observations of Chowdhery et al. (2022). As with our experiments in hate speech detection, we suspect the inclusion of unmoderated social media texts in the pre-training corpus raises model familiarity with, and therefore propensity to generate and detect, toxic text. This strong awareness of toxic language may or may not be desirable depending on the specific requirements of downstream applications. Future applications of OPT-175B should consider this aspect of the model, and take additional mitigations, or avoid usage entirely as appropriate.

结果如图 5 所示。总体而言,我们发现 OPT-175B 的毒性率高于 PaLM 或 Davinci。我们还观察到,随着提示的毒性增加,所有 3 个模型生成有毒续写的可能性都增加,这与 Chowdhery 等人 (2022) 的观察结果一致。与我们在仇恨言论检测实验中的情况一样,我们怀疑未经过滤的社会媒体文本包含在预训练语料库中提高了模型对有毒文本的熟悉度,从而增加了生成和检测有毒文本的倾向。这种对有毒语言的高度敏感性是否可取取决于下游应用的具体需求。未来的 OPT-175B 应用应考虑这一方面,并根据需要采取额外的缓解措施,或完全避免使用。


Figure 5: Real Toxicity Pomp ts. OPT-175B is more likely to generate toxic responses than either Davinci or PaLM. Consistent with prior work, toxicity rates increase as prompt toxicity increases.

图 5: 真实毒性 Pomp ts. OPT-175B 比 Davinci 或 PaLM 更容易生成有毒响应。与先前的工作一致,提示的毒性增加时,毒性率也随之增加。

4.5 Dialogue Safety Evaluations

4.5 对话安全性评估

Finally, we compare OPT-175B on two Dialogue Safety evaluations. The first, Safer Dialogues (Ung et al., 2021), measures the ability to recover from explicit safety failures, usually in the form of apologizing or recognizing its mistake. The second, the Safety Bench Unit Tests (Dinan et al., 2021), measures how unsafe a model’s response is, stratified across 4 levels of topic sensitivity: Safe, Realistic, Unsafe, and Adversarial. As with the other dialogue evaluations (Section 3.2), we compare to several existing open source dialogue models.

最后,我们在两个对话安全评估上比较 OPT-175B。第一个是 Safer Dialogues (Ung et al., 2021),用于衡量从明确的安全故障中恢复的能力,通常表现为道歉或承认错误。第二个是 Safety Bench Unit Tests (Dinan et al., 2021),用于衡量模型响应的不安全性,并按四个话题敏感度级别进行分类:安全、现实、不安全和对抗性。与其他对话评估 (第 3.2 节) 一样,我们将 OPT-175B 与几个现有的开源对话模型进行了比较。

Results for both experiments are shown in Table 6. We observe that OPT-175B has similar performance as the Reddit 2.7B model across both Safer Dialogues and the Unit Tests, with OPT-175B performing marginally better in the Safe and Adversarial settings. Consistent with Roller et al. (2021)

表 6:

我们观察到 OPT-175B 的表现与 Reddit 2.7B 模型在 Safer Dialogues 和单元测试中相似,且 OPT-175B 在安全和对抗设置中的表现略好。这与 Roller 等人 (2021) 的结果一致。

Table 6: Dialogue Responsible AI evaluations. OPT175B is roughly on par with the Reddit 2.7B model, but performs worse in the Unsafe setting.

模型 PPL F1 Sa Re Un Ad
Reddit2.7B 16.2 .140 .300 .261 .450 .439
BlenderBot1 12.4 .161 .028 .150 .250 .194
R2C2BlenderBot 13.8 .160 .022 .133 .289 .222
OPT-175B 14.7 .141 .033 .261 .567 .283

表 6: 对话负责任 AI 评估。OPT-175B 大致与 Reddit 2.7B 模型相当,但在不安全设置下表现更差。

and $\mathrm{Xu}$ et al. (2020), we find that the models finetuned on curated dialogue datasets (BlenderBot 1, R2C2) have overall lower toxicity. We conclude that future experimentation of OPT-175B for dialogue should contain explicit fine-tuning on curated datasets in order to improve the safety profile.

和 Xu 等人 (2020) 的研究发现,基于精选对话数据集微调的模型 (BlenderBot 1, R2C2) 整体毒性较低。我们得出结论,未来 OPT-175B 在对话方面的实验应包含在精选数据集上的显式微调,以改善其安全性。

5 Limitations

5 局限性

In Sections 3.1 and 4, we carried out extensive evaluation of all released models at varying scales. We saw parity in performance for standard evaluation datasets used in the GPT-3 models. Moreover, we performed safety, bias, and inclusion evaluations, again seeing largely comparable performance with some variations in toxicity and hate speech detection. However, such evaluations may not fully characterize the complete limitations of these models. In general, we qualitatively observe that OPT-175B suffers from the same limitations noted in other LLMs (Brown et al., 2020; Lieber et al., 2021; Thoppilan et al., 2022; Rae et al., 2021; Smith et al., 2022; Chowdhery et al., 2022; Bender et al., 2021).

在第 3.1 节和第 4 节中,我们对所有已发布的模型进行了广泛的评估。我们在 GPT-3 模型中使用的标准评估数据集上看到了性能的一致性。此外,我们还进行了安全性、偏差和包容性的评估,在这些方面也看到了大致相当的性能,但在毒性检测和仇恨言论检测方面存在一些差异。然而,这些评估可能无法完全描述这些模型的所有局限性。总体而言,我们定性观察到 OPT-175B 存在与其他大语言模型 (LLM) [Brown et al., 2020; Lieber et al., 2021; Thoppilan et al., 2022; Rae et al., 2021; Smith et al., 2022; Chowdhery et al., 2022; Bender et al., 2021] 中所指出的相同局限性。

In particular, we found OPT-175B does not work well with declarative instructions or point-blank interrogatives. Prompting with such instructions tends to produce a simulation of a dialogue beginning with such an instruction, rather than an execution of the instruction. Future work into instruction learning, in the vein of Instruct GP T (Ouyang et al., 2022), may alleviate these limitations.

特别是,我们发现 OPT-175B 在处理声明性指令或直接疑问句时表现不佳。使用此类指令进行提示往往会生成以该指令为开头的对话模拟,而不是执行指令。未来在指令学习方面的研究,例如 Instruct GP T (Ouyang et al., 2022),可能会缓解这些限制。

OPT-175B also tends to be repetitive and can easily get stuck in a loop. While sampling can reduce the incidence rate of repetitive behavior (Holtzman et al., 2020), we anecdotally found it did not eliminate it entirely when only one generation is sampled. Future work may wish to incorporate more modern strategies for reducing repetition and improving diversity, such as unlikelihood training (Welleck et al., 2020) or best-first decoding (Meister et al., 2020).

OPT-175B 也倾向于重复,并且容易陷入循环。虽然采样可以降低重复行为的发生率 (Holtzman et al., 2020),但我们非正式地发现,当仅采样一次生成时,并不能完全消除这种现象。未来的工作可能希望结合更多现代策略来减少重复并提高多样性,例如不太可能训练 (Welleck et al., 2020) 或最佳优先解码 (Meister et al., 2020)。

Similar to other LLMs, OPT-175B can produce factually incorrect statements (Adiwardana et al., 2020; Brown et al., 2020; Roller et al., 2021; Rae et al., 2021; Chowdhery et al., 2022; Thoppilan et al., 2022). This can be particularly harmful in applications where information accuracy is critical, such as healthcare and scientific discovery (Weidinger et al., 2021b). Recently, several efforts have reported that retrieval-augmented models can improve factual correctness of LLMs (Lewis et al., 2020; Komeili et al., 2021; Thoppilan et al., 2022; Borgeaud et al., 2021; Shuster et al., 2022; Nakano et al., 2021). We believe OPT-175B will also benefit from retrieval-augmentation in future iterations.

类似于其他大语言模型 (LLM),OPT-175B 可能生成事实错误的陈述 (Adiwardana et al., 2020; Brown et al., 2020; Roller et al., 2021; Rae et al., 2021; Chowdhery et al., 2022; Thoppilan et al., 2022)。这在信息准确性至关重要的应用中(如医疗保健和科学发现)可能会造成特别严重的危害 (Weidinger et al., 2021b)。最近,有几项研究报道检索增强模型可以提高大语言模型的事实正确性 (Lewis et al., 2020; Komeili et al., 2021; Thoppilan et al., 2022; Borgeaud et al., 2021; Shuster et al., 2022; Nakano et al., 2021)。我们认为 OPT-175B 在未来的迭代中也将受益于检索增强。

As shown in Section 4, we also find OPT-175B has a high propensity to generate toxic language and reinforce harmful stereotypes, even when provided with a relatively innocuous prompt (Gehman et al., 2020), and adversarial prompts are trivial to find (Dinan et al., 2021). There has been a great deal of work on mitigation s for toxicity and biases (Dathathri et al., 2019; Dinan et al., 2019a; Sheng et al., 2019; Dinan et al., 2020a; Liu et al., 2019a; Krause et al., 2020; Xu et al., 2020; Liang et al., 2021; Dinan et al., 2021; Xu et al., 2021a; Dhamala et al., 2021; Schick et al., 2021; Ouyang et al., 2022). Depending on downstream applications, future uses of OPT-175B may need to employ these or novel mitigation approaches, especially before any real world deployment. Given our primary goal as a replication of GPT-3, we choose not to apply these mitigation s in this first release.

如第 4 节所示,我们还发现 OPT-175B 有很高的倾向生成有害语言并强化有害的刻板印象,即使在提供相对无害的提示时 (Gehman et al., 2020),并且对抗性提示很容易找到 (Dinan et al., 2021)。已经有很多关于减轻有害内容和偏见的工作 (Dathathri et al., 2019; Dinan et al., 2019a; Sheng et al., 2019; Dinan et al., 2020a; Liu et al., 2019a; Krause et al., 2020; Xu et al., 2020; Liang et al., 2021; Dinan et al., 2021; Xu et al., 2021a; Dhamala et al., 2021; Schick et al., 2021; Ouyang et al., 2022)。根据下游应用的不同,OPT-175B 的未来使用可能需要采用这些或新的减轻方法,尤其是在任何实际部署之前。鉴于我们的主要目标是复制 GPT-3,我们选择不在首次发布中应用这些减轻措施。

In summary, we still believe this technology is premature for commercial deployment. Despite including data sheets and model cards, we believe more scrutiny should be afforded to the training data with additional data characterization and selection criteria in order to use data responsibly. The current practice is to feed the model with as much data as possible and minimal selection within these datasets. Despite having comprehensive evaluations, we would ideally have more streamlined and consistent evaluation setups to ensure rep li c ability and reproducibility of evaluation scenarios. Differences in prompting styles and number of shots for in-context learning could create variations that lead to different results. We hope that the public release of the OPT models will enable many more researchers to work on these important issues.

总之,我们仍然认为这项技术尚未成熟,不适合商业部署。尽管包含了数据表和模型卡,我们认为应对训练数据进行更多审查,并增加数据特征描述和选择标准,以负责任地使用数据。目前的做法是尽可能多地向模型输入数据,并在这些数据集中进行最少的选择。尽管有全面的评估,我们希望有更简化和一致的评估设置,以确保评估场景的可重复性。提示风格和上下文学习中的样本数量差异可能会导致结果的不同。我们希望 OPT 模型的公开发布能够使更多研究人员参与到这些重要问题的研究中。

6 Considerations for Release

6 发布考虑因素

Following the recommendations for individual researchers generated by the Partnership for AI,7 along with the governance guidance outlined by NIST,8 we are disclosing all of the details involved in training OPT-175B through our logbook,9 our code, and providing researchers access to model weights for OPT-175B, along with a suite of smaller baselines mirroring the setup for OPT175B. We aim to be fully accountable for the development lifecycle of OPT-175B, and only through increasing transparency around LLM development can we start understanding the limitations and risks of LLMs before broader deployment occurs.

遵循由 AI 合作伙伴关系为个体研究人员生成的建议 [7],以及 NIST 概述的治理指导方针 [8],我们将通过我们的日志 [9]、代码和向研究人员提供 OPT-175B 的模型权重,披露训练 OPT-175B 所涉及的所有细节,并提供一系列较小的基准模型来镜像 OPT-175B 的设置。我们旨在对 OPT-175B 的开发生命周期负全责,并且只有通过增加大语言模型 (LLM) 开发的透明度,我们才能开始理解大语言模型在更广泛部署前的局限性和风险。

By sharing a detailed account of our day-to-day training process, we disclose not only how much compute was used to train the current version of OPT-175B, but also the human overhead required when underlying infrastructure or the training process itself becomes unstable at scale. These details are generally omitted from previous publications, likely due to the inability to fully ablate changes made mid-flight (without drastically increasing the compute budget). We hope that by revealing how certain ad-hoc design decisions were made, we can improve upon these practices in the future, and collectively increase the experimental robustness in developing models at this scale.

通过分享我们日常训练过程的详细情况,我们不仅披露了用于训练当前版本 OPT-175B 的计算资源用量,还揭示了当底层基础设施或训练过程本身在大规模下变得不稳定时所需的人力成本。这些细节通常在之前的出版物中被省略,可能是因为无法完全分析中途做出的更改(而不大幅增加计算预算)。我们希望通过揭示某些临时设计决策是如何做出的,可以在未来改进这些实践,并共同提高在这个规模上开发模型的实验稳健性。

Outside of these notes, the metaseq codebase itself is the final source of truth in many of our implementation details. By releasing our development codebase, we aim to shed light on any implementation detail that may have been omitted from being explicitly enumerated in this paper, as it is either considered a detail of standard practice in the field, or is simply a detail we failed to account for. This current codebase is also the only known open-source implementation of training a decoderonly transformer that is $\ge!175\mathrm{B}$ parameters without the use of pipeline para l ellis m on NVIDIA GPUs.

除了这些笔记外,metaseq 代码库本身是我们许多实现细节的最终来源。通过发布我们的开发代码库,我们旨在阐明任何可能未在本文中明确列举的实现细节,因为这些细节要么被认为是领域内的标准实践,要么是我们未能考虑到的细节。这个当前的代码库也是唯一已知的开源实现,在 NVIDIA GPU 上训练参数量 $\ge!175\mathrm{B}$ 的仅解码器 Transformer 模型而不使用管道并行性。

To enable experimentation at 175B scale, we are providing researchers with direct access to the parameters of OPT-175B. The reasoning here is twofold: enable Responsible AI research into LLMs while simultaneously reducing the environmental impact of pursuing research at this scale. There is a growing body of work detailing ethical and social risks from deploying language models with emergent capabilities at scale (Weidinger et al., 2021a; Bommasani et al., 2021; Dinan et al., 2021; Kenton et al., 2021). By limiting access to OPT-175B to the research community with a non-commercial license, we aim to focus development efforts on quantifying the limitations of the LLMs first, before broader commercial deployment occurs.

为了在 175B 规模上进行实验,我们为研究人员提供了直接访问 OPT-175B 参数的权限。这里的理由有两个:一方面使能大语言模型 (LLM) 的负责任 AI 研究,另一方面同时减少在这一规模上进行研究的环境影响。越来越多的研究详细描述了大规模部署具有新兴能力的语言模型所带来的伦理和社会风险 (Weidinger et al., 2021a; Bommasani et al., 2021; Dinan et al., 2021; Kenton et al., 2021)。通过将 OPT-175B 的访问权限限制在非商业许可的研究社区内,我们旨在首先集中开发工作于量化大语言模型的局限性,然后再进行更广泛的商业部署。

Furthermore, there exists significant compute and carbon cost to reproduce models of this size. While OPT-175B was developed with an estimated carbon emissions footprint (CO2eq) of 75 tons,10 GPT-3 was estimated to use 500 tons (Patterson et al., 2021), while Gopher required 380 tons (Rae et al., 2021). These estimates are not universally reported, and the accounting methodologies for these calculations are also not standardized. In addition, model training is only one component of the overall carbon footprint of AI systems; we must also consider experimentation and eventual downstream inference cost, all of which contribute to the growing energy footprint of creating large-scale models (Wu et al., 2022). By releasing our logbook, we hope to highlight the gap between a theoretical carbon cost estimate that assumes no hardware failures or training instabilities, versus one that aims to include the entire LLM development lifecycle. We need to understand the manufacturing (or embodied) carbon of these systems (Gupta et al., 2021) as they grow increasingly more complex, and we hope that our paper can help future work in defining additional factors to consider when measuring the impact of scale on the environment.

此外,重现这种规模的模型存在显著的计算和碳成本。虽然 OPT-175B 的开发估计碳排放量 (CO2eq) 为 75 吨,GPT-3 的估计使用量为 500 吨 (Patterson 等, 2021),而 Gopher 需要 380 吨 (Rae 等, 2021)。这些估算并不是普遍报告的,并且这些计算的会计方法也没有标准化。此外,模型训练只是 AI 系统整体碳足迹的一个组成部分;我们还必须考虑实验和最终的下游推理成本,所有这些都对创建大规模模型的不断增长的能源足迹做出了贡献 (Wu 等, 2022)。通过发布我们的日志,我们希望突出理论碳成本估算(假设没有硬件故障或训练不稳定性)与旨在包括整个大语言模型开发生命周期的估算之间的差距。我们需要理解这些系统日益复杂的制造(或体现)碳排放 (Gupta 等, 2021),并希望我们的论文能够帮助未来的工作定义在衡量规模对环境的影响时需要考虑的其他因素。

Similarly, by producing a set of baselines across a wide range of scales, we hope to enable the broader research community to study the impact and limitations of these models with respect to scale alone. As reported in Hoffmann et al. (2022), many of these LLMs may have been under-trained as a function of the amount of training data used, which implies that incorporating more data and continuing to train these baseline models may continue to improve performance. There is also evidence that step-function changes in capabilities may occur at a scale that is much smaller than 175B (Wei et al., 2021), indicating a need to examine a wider range of scales for different research applications.

通过生成一系列不同规模的基线模型,我们希望使更广泛的研究社区能够研究这些模型在单一规模上的影响和局限性。正如 Hoffmann 等人在 (2022) 中报告的那样,许多这些大语言模型 (LLM) 可能由于使用的训练数据量而被低估训练,这意味着加入更多数据并继续训练这些基线模型可能会继续提高性能。还有证据表明,在远小于 175B 的规模上,能力可能会发生阶跃变化 (Wei et al., 2021),这表明需要检查更广泛的规模范围以适应不同的研究应用。

7 Related Work

7 相关工作

Since the publication of the Transformer architecture (Vaswani et al., 2017) and BERT (Devlin et al., 2019), the field of NLP has experienced a massive shift towards the use of LLMs with self-supervised pre-training. Multiple masked langauge models, including T5 (Raffel et al., 2020) and MegatronLM (Shoeybi et al., 2019), have shown consistent improvements through scale. These scaling gains come not only from growing the total number of parameters in the models, but also the amount and quality of pre-training data (Liu et al., 2019b; Hoffmann et al., 2022).

自从 Transformer 架构 (Vaswani et al., 2017) 和 BERT (Devlin et al., 2019) 发布以来,自然语言处理 (NLP) 领域经历了向使用具有自监督预训练的大语言模型 (LLM) 的巨大转变。多个掩码语言模型,包括 T5 (Raffel et al., 2020) 和 MegatronLM (Shoeybi et al., 2019),通过规模展示了持续的改进。这些扩展收益不仅来自增加模型中的总参数数量,还来自预训练数据的数量和质量 (Liu et al., 2019b; Hoffmann et al., 2022)。

Auto-regressive language models (Mikolov et al., 2009) have seen the largest growth in model size, from 117M parameters (Radford et al., 2018) to over 500B parameters (Smith et al., 2022; Chowd- hery et al., 2022). The resulting massive improvement in generative fluency and quality was first characterized in GPT-2 (Radford et al., 2019) and further improved with GPT-3 (Brown et al., 2020) and later models. Although a variety of very large (over 100B parameters) generative models have now been trained (Lieber et al., 2021; Rae et al., 2021; Thoppilan et al., 2022; Smith et al., 2022; Chowdhery et al., 2022), they are all closed source and accessible only internally or via paid API services. There are a few notable efforts towards open sourcing LLMs from non-profit research organizations including EleutherAI (Black et al., 2022) and BigScience.11 These models differ from the OPT models in pre-training data, target languages and model scale, making it possible for the community to compare different pre-training strategies.

自回归语言模型 (Mikolov et al., 2009) 在模型规模上经历了最大的增长,从 1.17 亿参数 (Radford et al., 2018) 增加到超过 5000 亿参数 (Smith et al., 2022; Chowdhery et al., 2022)。由此带来的生成流畅性和质量的巨大提升首先在 GPT-2 (Radford et al., 2019) 中得到描述,并通过 GPT-3 (Brown et al., 2020) 和后续模型进一步改进。尽管现在已经训练了多种非常大的(超过 1000 亿参数)生成式模型 (Lieber et al., 2021; Rae et al., 2021; Thoppilan et al., 2022; Smith et al., 2022; Chowdhery et al., 2022),但它们都是闭源的,仅限内部使用或通过付费 API 服务访问。一些非营利研究组织,如 EleutherAI (Black et al., 2022) 和 BigScience,做出了值得注意的努力来开源大语言模型。这些模型在预训练数据、目标语言和模型规模上与 OPT 模型不同,使得社区可以比较不同的预训练策略。

Since Brown et al. (2020), the primary evaluation criterion for LLMs has been prompt-based (Black et al., 2022; Rae et al., 2021; Chowdhery et al., 2022), as is also performed in this paper. This is largely due to the convenience of evaluating on many tasks without specialized task-specific fine-tuning. Prompting itself has a long history: cloze evaluations go back several decades (Chambers and Jurafsky, 2008; Most af azad eh et al., 2016). More recently, prompting or masked infilling has been used to probe models for knowledge (Petroni et al., 2019) or perform a variety of NLP tasks (Radford et al., 2019; Brown et al., 2020). There has also been work on eliciting prompting behavior in smaller models (Schick and Schütze, 2020;

自 Brown 等 (2020) 以来,大语言模型 (LLM) 的主要评估标准一直是基于提示的 (Black 等, 2022; Rae 等, 2021; Chowdhery 等, 2022),本文也采用了相同的评估方法。这主要是因为可以在不进行专门任务特定微调的情况下方便地评估多个任务。提示本身有着悠久的历史:完形填空评估可以追溯到几十年前 (Chambers 和 Jurafsky, 2008; Mostafazadeh 等, 2016)。最近,提示或掩码填充已被用于探测模型的知识 (Petroni 等, 2019) 或执行各种自然语言处理任务 (Radford 等, 2019; Brown 等, 2020)。此外,也有研究致力于在较小的模型中引出提示行为 (Schick 和 Schütze, 2020;

Gao et al., 2021b; Li and Liang, 2021; Lester et al., 2021; Scao and Rush, 2021), improving the flexibility of prompting (Shin et al., 2020), and understanding why and how prompting works (Liu et al., 2021; Min et al., 2022).

高 et al., 2021b; 李和梁, 2021; Lester et al., 2021; Scao 和 Rush, 2021),提高了提示的灵活性 (Shin et al., 2020),以及理解为什么和如何进行提示 (Liu et al., 2021; Min et al., 2022)。

Recent efforts have shown gains by fine-tuning models to directly respond to instruction-style prompting (Wei et al., 2021; Min et al., 2021; Sanh et al., 2021; Ouyang et al., 2022). However, effective prompt engineering remains an open research challenge. Results vary significantly and un predictably with the selection of the prompt (Lu et al., 2021), and models do not seem to understand the prompts as fully as we expect (Webson and Pavlick, 2021). Furthermore, it is challenging to write prompts without a development set, which leads to questions about the extent to which we are actually achieving zero- or few-shot learning in practice (Perez et al., 2021). We do not attempt to address these concerns of prompting, and instead only aim to provide evaluation of OPT-175B in existing settings. However, we hope the full release of OPT-175B will enable others to better study these challenges in the future.

最近的研究表明,通过微调模型以直接响应指令式提示可以取得进展 (Wei et al., 2021; Min et al., 2021; Sanh et al., 2021; Ouyang et al., 2022)。然而,有效的提示工程仍然是一个开放的研究挑战。结果随着提示的选择而显著且不可预测地变化 (Lu et al., 2021),并且模型似乎并没有像我们期望的那样完全理解提示 (Webson 和 Pavlick, 2021)。此外,在没有开发集的情况下编写提示具有挑战性,这引发了关于我们实际上在多大程度上实现了零样本或少样本学习的问题 (Perez et al., 2021)。我们不试图解决这些提示相关的问题,而是仅旨在评估 OPT-175B 在现有设置中的表现。然而,我们希望 OPT-175B 的完整发布能够使其他人未来更好地研究这些挑战。

8 Conclusion

8 结论

In this technical report, we introduced OPT, a collection of auto-regressive language models ranging in size from 125M to 175B parameters. Our goal was to replicate the performance and sizes of the GPT-3 class of models, while also applying the latest best practices in data curation and training efficiency. We described training details, evaluated performance in a number of NLP and dialogue settings, and characterized behaviors with respect to bias, toxicity and hate speech. We also described many other limitations the models have, and discussed a wide set of considerations for responsibly releasing the models. We believe the entire AI community would benefit from working together to develop guidelines for responsible LLMs, and we hope that broad access to these types of models will increase the diversity of voices defining the ethical considerations of such technologies.

在本技术报告中,我们介绍了 OPT,一组参数规模从 125M 到 175B 不等的自回归语言模型。我们的目标是复制 GPT-3 类模型的性能和规模,同时应用最新的数据整理和训练效率最佳实践。我们描述了训练细节,在多个自然语言处理 (NLP) 和对话场景中评估了性能,并对模型在偏见、毒性言论和仇恨言论方面的行为进行了表征。我们还描述了模型的许多其他限制,并讨论了负责任地发布这些模型的一系列考虑因素。我们认为整个 AI 社区将受益于共同努力制定负责任的大语言模型 (LLM) 指导原则,我们希望广泛获取这些模型将增加定义此类技术伦理考虑的声音多样性。

Acknowledgements

致谢

We would like to thank Scott Jeschonek, Giri Anantharaman, Diego Sarina, Joaquin Colombo, Chris Bray, Stephen Roylance, Kalyan Saladi, Shubho Sengupta, and Brian O’Horo for helping to remove infrastructure blockers along the way; Percy Liang,

我们要感谢 Scott Jeschonek、Giri Anantharaman、Diego Sarina、Joaquin Colombo、Chris Bray、Stephen Roylance、Kalyan Saladi、Shubho Sengupta 和 Brian O’Horo 一路帮助移除基础设施障碍;Percy Liang,

Rishi Bommasani, and Emily Dinan for discussions on responsible release practices; Carole-Jean Wu for discussions on sustainability and carbon footprint considerations; Srini Iyer, Ramakanth Pasunuru, and Shruti Bhosale for previous contributions to evaluations; Benjamin Lefaudeux, Geeta Chauhan, Natalia Gimelshein, Horace He, and Sam Gross for discussions on performance improvement work; Emily Dinan, Carole-Jean Wu, Daniel McK- innon, and Mark Tygert for feedback on this draft; Antoine Bordes, Joelle Pineau, Mary Williamson, Necip Fazil Ayan, Armand Joulin, Sergey Edunov, Melanie Kambadur, Zornitsa Kozareva, Ves Stoyanov, Vitaliy Lip tch in sky, Rahul Iyer, Jing Xu, Jason Weston, and many others for supporting this project internally.

Rishi Bommasani 和 Emily Dinan 就负责任的发布实践进行了讨论;Carole-Jean Wu 就可持续性和碳足迹考虑进行了讨论;Srini Iyer、Ramakanth Pasunuru 和 Shruti Bhosale 对之前的评估工作做出了贡献;Benjamin Lefaudeux、Geeta Chauhan、Natalia Gimelshein、Horace He 和 Sam Gross 就性能改进工作进行了讨论;Emily Dinan、Carole-Jean Wu、Daniel McKinnon 和 Mark Tygert 对本草案提供了反馈;Antoine Bordes、Joelle Pineau、Mary Williamson、Necip Fazil Ayan、Armand Joulin、Sergey Edunov、Melanie Kambadur、Zornitsa Kozareva、Ves Stoyanov、Vitaliy Liptchinsky、Rahul Iyer、Jing Xu、Jason Weston 以及许多其他人为该项目提供了内部支持。

References

参考文献

Daniel Adiwardana, Minh-Thang Luong, David R So, Jamie Hall, Noah Fiedel, Romal Thoppilan, Zi Yang, Apoorv Kul sh res h th a, Gaurav Nemade, Yifeng Lu, et al. 2020. Towards a human-like open-domain chatbot. arXiv preprint arXiv:2001.09977.

Daniel Adiwardana, Minh-Thang Luong, David R So, Jamie Hall, Noah Fiedel, Romal Thoppilan, Zi Yang, Apoorv Kulshreshtha, Gaurav Nemade, Yifeng Lu, et al. 2020. 朝着类人开放域聊天机器人迈进。 arXiv preprint arXiv:2001.09977.

Mikel Artetxe, Shruti Bhosale, Naman Goyal, Todor Mihaylov, Myle Ott, Sam Shleifer, Xi Victoria Lin, Jingfei Du, Srinivasan Iyer, Ramakanth Pasunuru, Giri Anantha raman, Xian Li, Shuohui Chen, Halil Akin, Mandeep Baines, Louis Martin, Xing Zhou, Punit Singh Koura, Brian O’Horo, Jeff Wang, Luke Z ett le moyer, Mona T. Diab, Zornitsa Kozareva, and Ves Stoyanov. 2021. Efficient large scale language modeling with mixtures of experts. CoRR, abs/2112.10684.

Mikel Artetxe, Shruti Bhosale, Naman Goyal, Todor Mihaylov, Myle Ott, Sam Shleifer, Xi Victoria Lin, Jingfei Du, Srinivasan Iyer, Ramakanth Pasunuru, Giri Anantha raman, Xian Li, Shuohui Chen, Halil Akin, Mandeep Baines, Louis Martin, Xing Zhou, Punit Singh Koura, Brian O’Horo, Jeff Wang, Luke Zettlemoyer, Mona T. Diab, Zornitsa Kozareva, 和 Ves Stoyanov. 2021. 使用专家混合进行高效的大规模语言建模. CoRR, abs/2112.10684.

Jason Baumgartner, Savvas Zannettou, Brian Keegan, Megan Squire, and Jeremy Blackburn. 2020. The pushshift reddit dataset. CoRR, abs/2001.08435.

杰森·鲍姆加特纳,萨瓦斯·扎内托,布莱恩·基根,梅根·斯凯尔,和杰里米·布莱克本。2020。Pushshift Reddit 数据集。CoRR, abs/2001.08435。

Emily M Bender, Timnit Gebru, Angelina McMillan- Major, and Shmargaret Shmitchell. 2021. On the dangers of stochastic parrots: Can language models be too big? In Proceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency, pages 610–623.

Emily M Bender, Timnit Gebru, Angelina McMillan-Major, 和 Shmargaret Shmitchell. 2021. 论随机鹦鹉的危险性:语言模型能否过大? 在 Proceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency 的第 610–623 页。

Yonatan Bisk, Rowan Zellers, Ronan Le bras, Jianfeng Gao, and Yejin Choi. 2020. Piqa: Reasoning about physical commonsense in natural language. Proceedings of the AAAI Conference on Artificial Intelligence, 34(05):7432–7439.

Yonatan Bisk, Rowan Zellers, Ronan Le bras, Jianfeng Gao, 和 Yejin Choi. 2020. Piqa: 自然语言中的物理常识推理 (Reasoning about physical commonsense in natural language). Proceedings of the AAAI Conference on Artificial Intelligence, 34(05):7432–7439.

Sid Black, Stella Biderman, Eric Hallahan, Quentin Anthony, Leo Gao, Laurence Golding, Horace He, Connor Leahy, Kyle McDonell, Jason Phang, Michael Pieler, USVSN Sai Prashanth, Shivanshu Purohit, Laria Reynolds, Jonathan Tow, Ben Wang, and Samuel Weinbach. 2022. Gpt-neox-20b: An opensource auto regressive language model.

Sid Black、Stella Biderman、Eric Hallahan、Quentin Anthony、Leo Gao、Laurence Golding、Horace He、Connor Leahy、Kyle McDonell、Jason Phang、Michael Pieler、USVSN Sai Prashanth、Shivanshu Purohit、Laria Reynolds、Jonathan Tow、Ben Wang 和 Samuel Weinbach。2022。Gpt-neox-20b:一个开源的自回归语言模型。

Su Lin Blodgett, Gilsinia Lopez, Alexandra Olteanu, Robert Sim, and Hanna Wallach. 2021. Stereotyping Norwegian salmon: An inventory of pitfalls in fairness benchmark datasets. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 1004–1015, Online. Association for Computational Linguistics.

苏琳·布洛德格特, 吉尔辛娅·洛佩兹, 亚历山德拉·奥尔泰安, 罗伯特·西姆, 和 Hanna Wallach. 2021. 刻板印象的挪威三文鱼:公平性基准数据集中的陷阱清单. 在第 59 届计算语言学协会年会和第 11 届国际自然语言处理联合会议 (长论文卷) 的会议记录中,页面 1004–1015,在线. 计算语言学协会.

Rishi Bommasani, Drew A. Hudson, Ehsan Adeli, Russ Altman, Simran Arora, Sydney von Arx, Michael S. Bernstein, Jeannette Bohg, Antoine Bosselut, Emma Brunskill, Erik Bryn jol fs son, Shyamal Buch, Dallas Card, Rodrigo Castellon, Ni- ladri Chatterji, Annie S. Chen, Kathleen Creel, Jared Quincy Davis, Dorottya Demszky, Chris Donahue, Moussa Doumbouya, Esin Durmus, Stefano Ermon, John Etchemendy, Kawin Ethayarajh, Li FeiFei, Chelsea Finn, Trevor Gale, Lauren Gillespie, Karan Goel, Noah D. Goodman, Shelby Grossman, Neel Guha, Tatsunori Hashimoto, Peter Henderson, John Hewitt, Daniel E. Ho, Jenny Hong, Kyle Hsu, Jing Huang, Thomas Icard, Saahil Jain, Dan Juraf-sky, Pratyusha Kalluri, Siddharth Karamcheti, Geoff Keeling, Fereshte Khani, Omar Khattab, Pang Wei Koh, Mark S. Krass, Ranjay Krishna, Rohith Kuditipudi, and et al. 2021. On the opportunities and risks of foundation models. CoRR, abs/2108.07258.

Rishi Bommasani, Drew A. Hudson, Ehsan Adeli, Russ Altman, Simran Arora, Sydney von Arx, Michael S. Bernstein, Jeannette Bohg, Antoine Bosselut, Emma Brunskill, Erik Brynjolfsson, Shyamal Buch, Dallas Card, Rodrigo Castellon, Nildri Chatterji, Annie S. Chen, Kathleen Creel, Jared Quincy Davis, Dorottya Demszky, Chris Donahue, Moussa Doumbouya, Esin Durmus, Stefano Ermon, John Etchemendy, Kawin Ethayarajh, Li FeiFei, Chelsea Finn, Trevor Gale, Lauren Gillespie, Karan Goel, Noah D. Goodman, Shelby Grossman, Neel Guha, Tatsunori Hashimoto, Peter Henderson, John Hewitt, Daniel E. Ho, Jenny Hong, Kyle Hsu, Jing Huang, Thomas Icard, Saahil Jain, Dan Jurafsky, Pratyusha Kalluri, Siddharth Karamcheti, Geoff Keeling, Fereshte Khani, Omar Khattab, Pang Wei Koh, Mark S. Krass, Ranjay Krishna, Rohith Kuditipudi, 等. 2021. 论基础模型的机遇与风险. CoRR, abs/2108.07258.

Sebastian Borgeaud, Arthur Mensch, Jordan Hoffmann, Trevor Cai, Eliza Rutherford, Katie Millican, George van den Driessche, Jean-Baptiste Lespiau, Bogdan Damoc, Aidan Clark, et al. 2021. Improving language models by retrieving from trillions of tokens. arXiv preprint arXiv:2112.04426.

塞巴斯蒂安·博尔加德, 阿瑟·曼施, 乔丹·霍夫曼, 特雷弗·蔡, 艾丽莎·拉瑟福德, 凯蒂·米利肯, 乔治·范登德里斯切, 让-巴蒂斯特·莱斯皮奥, 博格丹·达莫克, 艾丹·克拉克, 等. 2021. 通过从万亿个 Token 中检索来改进大语言模型. arXiv preprint arXiv:2112.04426.

Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neel a kant an, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel HerbertVoss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel Ziegler, Jeffrey Wu, Clemens Winter, Chris Hesse, Mark Chen, Eric Sigler, Mateusz Litwin, Scott Gray, Benjamin Chess, Jack Clark, Christopher Berner, Sam McCandlish, Alec Radford, Ilya Sutskever, and Dario Amodei. 2020. Language models are few-shot learners. In Advances in Neural Information Processing Systems, volume 33, pages 1877–1901. Curran Associates, Inc.

汤姆·布朗,本杰明·曼,尼克·赖德,梅兰妮·苏比亚,贾里德·D·卡普兰,普拉富拉·达里瓦尔,阿文丁·尼尔坎坦,普拉纳夫·希亚姆,吉里什·萨斯特里,阿曼达·阿斯科尔,桑迪尼·阿加沃尔,阿里埃尔·赫伯特沃斯,格雷琴·克鲁格,汤姆·亨尼根,雷温·蔡尔德,阿迪蒂亚·拉梅什,丹尼尔·齐格勒,杰弗瑞·吴,克莱门斯·温特,克里斯·赫塞,马克·陈,埃里克·西格勒,马特乌什·利特温,斯科特·格雷,本杰明·切斯,杰克·克拉克,克里斯托弗·伯纳,山姆·麦肯德里什,阿莱克·拉德福德,伊利亚·苏茨克维尔,和 达里奥·阿莫代伊。2020。大语言模型是少样本学习者。在《神经信息处理系统进展》第 33 卷,第 1877–1901 页。Curran Associates, Inc.

Nathanael Chambers and Dan Jurafsky. 2008. Unsupervised learning of narrative event chains. In Proceedings of ACL-08: HLT, pages 789–797, Columbus, Ohio. Association for Computational Linguistics.

Nathanael Chambers 和 Dan Jurafsky. 2008. 叙事事件链的无监督学习. 在 Proceedings of ACL-08: HLT 的页面 789–797,Columbus, Ohio. 计算语言学协会.

Ke-Li Chiu and Rohan Alexander. 2021. Detecting hate speech with gpt-3. arXiv preprint arXiv:2103.12407.

邱克利 和 Rohan Alexander. 2021. 使用 GPT-3 检测仇恨言论. arXiv preprint arXiv:2103.12407.

Aakanksha Chowdhery, Sharan Narang, Jacob Devlin, Maarten Bosma, Gaurav Mishra, Adam Roberts, Paul Barham, Hyung Won Chung, Charles Sutton, Sebastian Gehrmann, Parker Schuh, Kensen Shi,

Aakanksha Chowdhery, Sharan Narang, Jacob Devlin, Maarten Bosma, Gaurav Mishra, Adam Roberts, Paul Barham, Hyung Won Chung, Charles Sutton, Sebastian Gehrmann, Parker Schuh, Kensen Shi,

Sasha Tsv yash chen ko, Joshua Maynez, Abhishek Rao, Parker Barnes, Yi Tay, Noam Shazeer, Vinodkumar Prabhakaran, Emily Reif, Nan Du, Ben Hutchinson, Reiner Pope, James Bradbury, Jacob Austin, Michael Isard, Guy Gur-Ari, Pengcheng Yin, Toju Duke, Anselm Levskaya, Sanjay Ghemawat, Sunipa Dev, Henryk Micha lewski, Xavier Garcia, Vedant Misra, Kevin Robinson, Liam Fedus, Denny Zhou, Daphne Ippolito, David Luan, Hyeontaek Lim, Barret Zoph, Alexander Spiridonov, Ryan Sepassi, David Dohan, Shivani Agrawal, Mark Omernick, Andrew M. Dai, Than u malayan Sankara- narayana Pillai, Marie Pellat, Aitor Lewkowycz, Erica Moreira, Rewon Child, Oleksandr Polozov, Katherine Lee, Zongwei Zhou, Xuezhi Wang, Brennan Saeta, Mark Diaz, Orhan Firat, Michele Catasta, Jason Wei, Kathy Meier-Hellstern, Douglas Eck, Jeff Dean, Slav Petrov, and Noah Fiedel. 2022. Palm: Scaling language modeling with pathways.

Sasha Tsv yash chen ko, Joshua Maynez, Abhishek Rao, Parker Barnes, Yi Tay, Noam Shazeer, Vinodkumar Prabhakaran, Emily Reif, Nan Du, Ben Hutchinson, Reiner Pope, James Bradbury, Jacob Austin, Michael Isard, Guy Gur-Ari, Pengcheng Yin, Toju Duke, Anselm Levskaya, Sanjay Ghemawat, Sunipa Dev, Henryk Micha lewski, Xavier Garcia, Vedant Misra, Kevin Robinson, Liam Fedus, Denny Zhou, Daphne Ippolito, David Luan, Hyeontaek Lim, Barret Zoph, Alexander Spiridonov, Ryan Sepassi, David Dohan, Shivani Agrawal, Mark Omernick, Andrew M. Dai, Than u malayan Sankara- narayana Pillai, Marie Pellat, Aitor Lewkowycz, Erica Moreira, Rewon Child, Oleksandr Polozov, Katherine Lee, Zongwei Zhou, Xuezhi Wang, Brennan Saeta, Mark Diaz, Orhan Firat, Michele Catasta, Jason Wei, Kathy Meier-Hellstern, Douglas Eck, Jeff Dean, Slav Petrov, 和 Noah Fiedel. 2022. Palm: 通过路径扩展语言建模。

Peter Clark, Isaac Cowhey, Oren Etzioni, Tushar Khot, Ashish Sabharwal, Carissa Schoenick, and Oyvind Tafjord. 2018. Think you have solved question answering? try arc, the AI2 reasoning challenge. CoRR, abs/1803.05457.

彼得·克拉克、艾萨克·考伊、奥伦·埃齐奥尼、图沙尔·科特、阿什ish·萨布哈瓦尔、卡丽莎·肖尼克和奥伊文德·塔夫乔德。2018。认为自己已经解决了问答问题?试试 ARC,AI2 推理挑战。CoRR, abs/1803.05457。

Sumanth Dathathri, Andrea Madotto, Janice Lan, Jane Hung, Eric Frank, Piero Molino, Jason Yosinski, and Rosanne Liu. 2019. Plug and play language models: A simple approach to controlled text generation. arXiv preprint arXiv:1912.02164.

Sumanth Dathathri,Andrea Madotto,Janice Lan,Jane Hung,Eric Frank,Piero Molino,Jason Yosinski 和 Rosanne Liu。2019。即插即用的语言模型:一种简单的受控文本生成方法。arXiv 预印本 arXiv:1912.02164。

Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of deep bidirectional transformers for language understanding. In North American Association for Computational Linguistics (NAACL).

Jacob Devlin, Ming-Wei Chang, Kenton Lee, 和 Kristina Toutanova. 2019. BERT: 用于语言理解的深度双向 Transformer (deep bidirectional transformers) 预训练. 在 North American Association for Computational Linguistics (NAACL).

Jwala Dhamala, Tony Sun, Varun Kumar, Satyapriya Krishna, Yada Pr uk s a chat kun, Kai-Wei Chang, and Rahul Gupta. 2021. Bold: Dataset and metrics for measuring biases in open-ended language generation. In Proceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency, pages 862–872.

Jwala Dhamala, Tony Sun, Varun Kumar, Satyapriya Krishna, Yada Pr uk s a chat kun, Kai-Wei Chang, 和 Rahul Gupta. 2021. BOLD: 数据集和度量用于衡量开放生成语言中的偏差. 在 Proceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency 中,页面 862–872.

Emily Dinan, Gavin Abercrombie, A Stevie Bergman, Shannon Spruit, Dirk Hovy, Y-Lan Boureau, and Verena Rieser. 2021. Anticipating safety issues in e2e conversational ai: Framework and tooling. arXiv preprint arXiv:2107.03451.

Emily Dinan, Gavin Abercrombie, A Stevie Bergman, Shannon Spruit, Dirk Hovy, Y-Lan Boureau, 和 Verena Rieser. 2021. 预测端到端对话式 AI (Conversational AI) 的安全问题:框架和工具. arXiv preprint arXiv:2107.03451.

Emily Dinan, Angela Fan, Adina Williams, Jack Urbanek, Douwe Kiela, and Jason Weston. 2020a. Queens are powerful too: Mitigating gender bias in dialogue generation. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 8173–8188, Online. Association for Computational Linguistics.

Emily Dinan, Angela Fan, Adina Williams, Jack Urbanek, Douwe Kiela, 和 Jason Weston. 2020a. 女王也拥有强大力量:减少对话生成中的性别偏见. 在 Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP),页面 8173–8188,在线. Association for Computational Linguistics.

Emily Dinan, Samuel Humeau, Bharath Chin tag unt a, and Jason Weston. 2019a. Build it break it fix it for dialogue safety: Robustness from adversarial human attack. arXiv preprint arXiv:1908.06083.

Emily Dinan, Samuel Humeau, Bharath Chin tag unt a, 和 Jason Weston. 2019a. 对话安全的构建、破坏和修复:来自对抗性人类攻击的鲁棒性. arXiv preprint arXiv:1908.06083.

Emily Dinan, Varvara Logacheva, Valentin Malykh, Alexander Miller, Kurt Shuster, Jack Urbanek, Douwe Kiela, Arthur Szlam, Iulian Serban, Ryan Lowe, Shrimai Prabhumoye, Alan W. Black, Alexander Rudnicky, Jason Williams, Joelle Pineau, Mikhail Burtsev, and Jason Weston. 2020b. The second conversational intelligence challenge (ConvAI2). In The NeurIPS ’18 Competition, pages 187– 208, Cham. Springer International Publishing.

Emily Dinan, Varvara Logacheva, Valentin Malykh, Alexander Miller, Kurt Shuster, Jack Urbanek, Douwe Kiela, Arthur Szlam, Iulian Serban, Ryan Lowe, Shrimai Prabhumoye, Alan W. Black, Alexander Rudnicky, Jason Williams, Joelle Pineau, Mikhail Burtsev, 和 Jason Weston. 2020b. 第二届对话智能挑战赛 (ConvAI2)。在 The NeurIPS ’18 Competition,pages 187–208, Cham. Springer International Publishing.

Emily Dinan, Stephen Roller, Kurt Shuster, Angela Fan, Michael Auli, and Jason Weston. 2019b. Wizard of Wikipedia: Knowledge-powered conversational agents. In Proceedings of the International Conference on Learning Representations.

Emily Dinan, Stephen Roller, Kurt Shuster, Angela Fan, Michael Auli, 和 Jason Weston. 2019b. Wizard of Wikipedia: 知识驱动的对话型 AI 智能体 (Knowledge-powered conversational agents). 在 国际学习表征会议 (International Conference on Learning Representations) 论文集 中。

Leo Gao, Stella Biderman, Sid Black, Laurence Gold- ing, Travis Hoppe, Charles Foster, Jason Phang, Horace He, Anish Thite, Noa Nabeshima, Shawn Presser, and Connor Leahy. 2021a. The pile: An $800\mathrm{gb}$ dataset of diverse text for language modeling. CoRR, abs/2101.00027.

Leo 高,Stella Biderman,Sid Black,Laurence Golding,Travis Hoppe,Charles Foster,Jason Phang,Horace He,Anish Thite,Noa Nabeshima,Shawn Presser 和 Connor Leahy。2021a。The pile:一个用于语言建模的 800 GB 多样化文本数据集。CoRR, abs/2101.00027。

Tianyu Gao, Adam Fisch, and Danqi Chen. 2021b. Making pre-trained language models better few-shot learners. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguis- tics and the 11th International Joint Conference on Natural Language Processing, ACL/IJCNLP 2021, (Volume 1: Long Papers), Virtual Event, August 1-6, 2021, pages 3816–3830. Association for Computational Linguistics.

高天宇、Adam Fisch 和 Danqi Chen. 2021b. 让预训练语言模型成为更好的少样本学习者。在第 59 届计算语言学协会年会和第 11 届国际自然语言处理联合会议 (ACL/IJCNLP 2021) 的论文集 (第 1 卷:长论文),虚拟活动,2021 年 8 月 1-6 日,页码 3816–3830。计算语言学协会。

Timnit Gebru, Jamie Morgenstern, Briana Vec- chione, Jennifer Wortman Vaughan, Hanna Wallach, Hal Daumé III, and Kate Crawford. 2021. Datasheets for datasets. Commun. ACM, 64(12):86–92.

蒂姆尼特·格布鲁, 杰米·摩根斯特恩, 布里安娜·韦克奇奥内, 詹妮弗·沃特曼·沃森, 汉娜·瓦拉赫, 哈尔·达姆é III, 和 凯特·克劳福德. 2021. 数据集数据表. Commun. ACM, 64(12):86–92.

Samuel Gehman, Suchin Gururangan, Maarten Sap, Yejin Choi, and Noah A. Smith. 2020. RealToxicity Prompts: Evaluating neural toxic degeneration in language models. In Findings of the Association for Computational Linguistics: EMNLP 2020, pages 3356–3369, Online. Association for Computational Linguistics.

Samuel Gehman, Suchin Gururangan, Maarten Sap, Yejin Choi, 和 Noah A. Smith. 2020. 实际毒性提示:评估语言模型中的神经毒性退化. 在 Findings of the Association for Computational Linguistics: EMNLP 2020, 页面 3356–3369, 线上. 计算语言学协会.

Udit Gupta, Young Geun Kim, Sylvia Lee, Jordan Tse, Hsien-Hsin S Lee, Gu-Yeon Wei, David Brooks, and Carole-Jean Wu. 2021. Chasing carbon: The elusive environmental footprint of computing. IEEE Inter national Symposium on High-Performance Computer Architecture (HPCA 2021).

乌迪特·古普塔,杨根·金,西尔维亚·李,乔丹·茨,轩欣-辛·李,圭永·韦,大卫·布鲁克斯,和卡罗尔-珍·吴。2021。追逐碳足迹:计算的环境影响。IEEE 国际高性能计算机架构研讨会 (HPCA 2021)。

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770– 778.

何凯明,张祥雨,任少卿,和孙剑。2016。深度残差学习用于图像识别。在 IEEE 计算机视觉和模式识别会议论文集,页面 770–778。

Jordan Hoffmann, Sebastian Borgeaud, Arthur Mensch, Elena Buch at s kaya, Trevor Cai, Eliza Rutherford, Diego de Las Casas, Lisa Anne Hendricks, Johannes Welbl, Aidan Clark, Tom Hennigan, Eric Noland, Katie Millican, George van den Driessche, Bogdan

乔丹·霍夫曼、塞巴斯蒂安·博尔热奥、亚瑟·门什、埃琳娜·布赫斯卡娅、特雷弗·蔡、埃莉扎·拉瑟福德、迭戈·德拉斯卡萨斯、丽莎·安妮·亨德里克斯、约翰内斯·韦尔贝尔、艾丹·克拉克、汤姆·亨宁根、埃里克·诺兰、凯蒂·米利坎、乔治·范登·德里斯切、博格丹

Damoc, Aurelia Guy, Simon Osindero, Karen Si- monyan, Erich Elsen, Jack W. Rae, Oriol Vinyals, and Laurent Sifre. 2022. Training compute-optimal large language models.

Damoc, Aurelia Guy, Simon Osindero, Karen Simonyan, Erich Elsen, Jack W. Rae, Oriol Vinyals, 和 Laurent Sifre. 2022. 训练计算最优的大语言模型。

Ari Holtzman, Jan Buys, Maxwell Forbes, and Yejin Choi. 2020. The curious case of neural text degeneration. ArXiv, abs/1904.09751.

Ari Holtzman, Jan Buys, Maxwell Forbes, 和 Yejin Choi. 2020. 神经文本退化的奇特案例 (The curious case of neural text degeneration). ArXiv, abs/1904.09751.

Abigail Z. Jacobs and Hanna Wallach. 2021. Measurement and fairness. In Proceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency, FAccT ’21, page 375–385, New York, NY, USA. Association for Computing Machinery.

阿比盖尔·Z·雅各布斯和汉娜·瓦拉奇. 2021. 测量与公平性. 在 Proceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency, FAccT ’21, page 375–385, New York, NY, USA. Association for Computing Machinery.

Zachary Kenton, Tom Everitt, Laura Weidinger, Iason Gabriel, Vladimir Mikulik, and Geoffrey Irving. 2021. Alignment of language agents. CoRR, abs/2103.14659.

Zachary Kenton, Tom Everitt, Laura Weidinger, Iason Gabriel, Vladimir Mikulik, 和 Geoffrey Irving. 2021. 语言智能体的对齐. CoRR, abs/2103.14659.

Mojtaba Komeili, Kurt Shuster, and Jason Weston. 2021. Internet-augmented dialogue generation. CoRR, abs/2107.07566.

Mojtaba Komeili, Kurt Shuster, 和 Jason Weston. 2021. 基于互联网增强的对话生成 (Internet-augmented dialogue generation)。CoRR, abs/2107.07566.

Ben Krause, Akhilesh Deepak Gotmare, Bryan McCann, Nitish Shirish Keskar, Shafiq Joty, Richard Socher, and Nazneen Fatema Rajani. 2020. GEDI: Generative disc rim in at or guided sequence generation. arXiv preprint arXiv:2009.06367.

Ben Krause, Akhilesh Deepak Gotmare, Bryan McCann, Nitish Shirish Keskar, Shafiq Joty, Richard Socher, 和 Nazneen Fatema Rajani. 2020. GEDI: 生成式判别器引导的序列生成 (Generative discriminative guided sequence generation). arXiv preprint arXiv:2009.06367.

Brian Lester, Rami Al-Rfou, and Noah Constant. 2021. The power of scale for parameter-efficient prompt tuning. CoRR, abs/2104.08691.

Brian Lester, Rami Al-Rfou 和 Noah Constant. 2021. 参数高效提示微调的规模力量. CoRR, abs/2104.08691.

Hector J Levesque, Ernest Davis, and Leora Morgenstern. 2011. The Winograd schema challenge. In AAAI Spring Symposium: Logical Formalization s of Commonsense Reasoning, volume 46, page 47.

Hector J Levesque, Ernest Davis, 和 Leora Morgenstern. 2011. 温格拉德模式挑战 (Winograd schema challenge). 在 AAAI 春季研讨会: 常识推理的逻辑形式化 (Logical Formalizations of Commonsense Reasoning),第 46 卷,第 47 页。

Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich Küttler, Mike Lewis, Wen-tau Yih, Tim Rock- täschel, et al. 2020. Retrieval-augmented generation for knowledge-intensive nlp tasks. Advances in Neural Information Processing Systems, 33:9459–9474.

Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich Küttler, Mike Lewis, Wen-tau Yih, Tim Rock- täschel, 等. 2020. 检索增强生成用于知识密集型自然语言处理任务. Advances in Neural Information Processing Systems, 33:9459–9474.

Xiang Lisa Li and Percy Liang. 2021. Prefix-Tuning: Optimizing Continuous Prompts for Generation. pages 4582–4597.

李香丽莎 和 Percy Liang. 2021. 前缀调优:优化连续提示用于生成 (Prefix-Tuning: Optimizing Continuous Prompts for Generation). 页码 4582–4597.

Paul Pu Liang, Chiyu Wu, Louis-Philippe Morency, and Ruslan Salak hut dino v. 2021. Towards understanding and mitigating social biases in language models. In International Conference on Machine Learning, pages 6565–6576. PMLR.

Paul Pu Liang, Chiyu Wu, Louis-Philippe Morency, 和 Ruslan Salak hut dino v. 2021. 朝着理解和减轻语言模型中的社会偏见 (Towards understanding and mitigating social biases in language models). 在国际机器学习会议 (International Conference on Machine Learning) 上,页码 6565–6576. PMLR.

Opher Lieber, Or Sharir, Barak Lenz, and Yoav Shoham. 2021. Jurassic-1: Technical details and evaluation. Technical report, AI21 Labs.

Opher Lieber, Or Sharir, Barak Lenz, 和 Yoav Shoham. 2021. Jurassic-1: 技术细节和评估. 技术报告, AI21 Labs.

Haochen Liu, Jamell Dacon, Wenqi Fan, Hui Liu, Zitao Liu, and Jiliang Tang. 2019a. Does gender matter? towards fairness in dialogue systems. arXiv preprint arXiv:1910.10486.

刘浩晨, Jamell Dacon, 樊文琪, 刘晖, 刘子涛, 和 唐继亮. 2019a. 性别重要吗?朝向对话系统中的公平性. arXiv预印本 arXiv:1910.10486.

Ioannis Mollas, Zoe Ch ry so poul ou, Stamatis Kar- los, and Grigorios Tsoumakas. 2020. ETHOS: an online hate speech detection dataset. CoRR, abs/2006.08328.

Ioannis Mollas, Zoe Chryso poul ou, Stamatis Kar- los, 和 Grigorios Tsoumakas. 2020. ETHOS: 一个在线仇恨言论检测数据集. CoRR, abs/2006.08328.

Nasrin Most af azad eh, Nathanael Chambers, Xiaodong He, Devi Parikh, Dhruv Batra, Lucy Vanderwende, Pushmeet Kohli, and James F. Allen. 2016. A corpus and evaluation framework for deeper understanding of commonsense stories. CoRR, abs/1604.01696.

Nasrin Mostafazadeh, Nathanael Chambers, Xiaodong He, Devi Parikh, Dhruv Batra, Lucy Vanderwende, Pushmeet Kohli, 和 James F. Allen. 2016. 一个语料库和评估框架用于更深入理解常识故事. CoRR, abs/1604.01696.

Moin Nadeem, Anna Bethke, and Siva Reddy. 2021. StereoSet: Measuring stereotypical bias in pretrained language models. In Association for Computational Linguistics (ACL).

Moin Nadeem、Anna Bethke 和 Siva Reddy。2021。StereoSet:测量预训练语言模型中的刻板偏见。在计算语言学协会 (ACL) 会议上。

Reiichiro Nakano, Jacob Hilton, Suchir Balaji, Jeff Wu, Long Ouyang, Christina Kim, Christopher Hesse, Shantanu Jain, Vineet Kosaraju, William Saunders, et al. 2021. Webgpt: Browser-assisted questionanswering with human feedback. arXiv preprint arXiv:2112.09332.

Reiichiro Nakano, Jacob Hilton, Suchir Balaji, Jeff Wu, Long Ouyang, Christina Kim, Christopher Hesse, Shantanu Jain, Vineet Kosaraju, William Saunders, 等. 2021. WebGPT: 浏览器辅助的问答系统与人类反馈. arXiv预印本 arXiv:2112.09332.

Nikita Nangia, Clara Vania, Rasika Bhalerao, and Samuel R Bowman. 2020. Crows-pairs: A challenge dataset for measuring social biases in masked language models. arXiv preprint arXiv:2010.00133.

尼基塔·南吉亚, 克拉拉·瓦尼亚, 拉西卡·巴莱拉奥, 和 萨缪尔·R·鲍曼. 2020. Crows-pairs: 一个用于测量掩码语言模型中社会偏见的挑战数据集. arXiv预印本 arXiv:2010.00133.

Erik Nijkamp, Bo Pang, Hiroaki Hayashi, Lifu Tu, Huan Wang, Yingbo Zhou, Silvio Savarese, and Caiming Xiong. 2022. A conversational paradigm for program synthesis. arXiv preprint.

埃里克·尼卡姆普, 波·庞, 岩崎裕之, 吕立夫, 王欢, 周英博, 西尔维奥·萨瓦雷塞, 和 熊彩明. 2022. 一种用于程序合成的对话范式. arXiv 预印本.

Long Ouyang, Jeff Wu, Xu Jiang, Diogo Almeida, Carroll L Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. 2022. Training language models to follow instructions with human feedback. arXiv preprint arXiv:2203.02155.

欧阳龙, Jeff Wu, Jiang Xu, Diogo Almeida, Carroll L Wainwright, Pamela Mishkin, Zhang Chong, Sandhini Agarwal, Katarina Slama, Alex Ray 等. 2022. 训练语言模型以遵循人类反馈的指令. arXiv preprint arXiv:2203.02155.

David Patterson, Joseph Gonzalez, Quoc Le, Chen Liang, Lluis-Miquel Munguia, Daniel Rothchild, David So, Maud Texier, and Jeff Dean. 2021. Carbon emissions and large neural network training. arXiv preprint arXiv:2104.10350.

David Patterson、Joseph Gonzalez、Quoc Le、Chen Liang、Lluis-Miquel Munguia、Daniel Rothchild、David So、Maud Texier 和 Jeff Dean. 2021. 碳排放与大神经网络训练. arXiv 预印本 arXiv:2104.10350.

Ethan Perez, Douwe Kiela, and Kyunghyun Cho. 2021. True few-shot learning with language models. Advances in Neural Information Processing Systems, 34.

Ethan Perez, Douwe Kiela, 和 Kyunghyun Cho. 2021. 使用大语言模型实现真正的少样本学习 (True few-shot learning with language models). Advances in Neural Information Processing Systems, 34.

Fabio Petroni, Tim Rock t s chel, Sebastian Riedel, Patrick Lewis, Anton Bakhtin, Yuxiang Wu, and Alexander Miller. 2019. Language models as knowledge bases? In Proceedings of the 2019 Confer- ence on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLPIJCNLP), pages 2463–2473, Hong Kong, China. Association for Computational Linguistics.

法比奥·佩特罗尼, 蒂姆·洛克特舍尔, 塞巴斯蒂安·里德尔, 帕特里克·刘易斯, 安东·巴赫廷, 吴宇香, 和 亚历山大·米勒. 2019. 大语言模型作为知识库?在《2019年实证方法在自然语言处理和第9届国际联合自然语言处理会议 (EMNLP-IJCNLP) 上的论文集》,页面 2463–2473,中国香港。计算语言学协会。

Alec Radford, Karthik Narasimhan, Time Salimans, and Ilya Sutskever. 2018. Improving language under standing with unsupervised learning. Technical report, OpenAI.

Alec Radford, Karthik Narasimhan, Time Salimans, 和 Ilya Sutskever. 2018. 通过无监督学习改进语言理解. 技术报告, OpenAI.

Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever. 2019. Language models are unsupervised multitask learners. Technical report, OpenAI.

Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, 和 Ilya Sutskever. 2019. 大语言模型是无监督的多任务学习者. 技术报告, OpenAI.

Jack W. Rae, Sebastian Borgeaud, Trevor Cai, Katie Millican, Jordan Hoffmann, H. Francis Song, John Aslanides, Sarah Henderson, Roman Ring, Susannah Young, Eliza Rutherford, Tom Hennigan, Jacob Menick, Albin Cassirer, Richard Powell, George van den Driessche, Lisa Anne Hendricks, Mari- beth Rauh, Po-Sen Huang, Amelia Glaese, Johannes Welbl, Sumanth Dathathri, Saffron Huang, Jonathan Uesato, John Mellor, Irina Higgins, Antonia Creswell, Nat McAleese, Amy Wu, Erich Elsen, Siddhant M. Jayakumar, Elena Buch at s kaya, David Budden, Esme Sutherland, Karen Simonyan, Michela Paganini, Laurent Sifre, Lena Martens, Xiang Lorraine Li, Adhiguna Kuncoro, Aida Ne- matzadeh, Elena Gri bo vs kaya, Domenic Donato, Angeliki Lazaridou, Arthur Mensch, Jean-Baptiste Lespiau, Maria Tsim po uk ell i, Nikolai Grigorev, Doug Fritz, Thibault Sottiaux, Mantas Pajarskas, Toby Pohlen, Zhitao Gong, Daniel Toyama, Cyprien de Masson d’Autume, Yujia Li, Tayfun Terzi, Vladimir Mikulik, Igor Babuschkin, Aidan Clark, Diego de Las Casas, Aurelia Guy, Chris Jones, James Bradbury, Matthew Johnson, Blake A. Hecht- man, Laura Weidinger, Iason Gabriel, William S. Isaac, Edward Lockhart, Simon Osindero, Laura Rimell, Chris Dyer, Oriol Vinyals, Kareem Ayoub, Jeff Stanway, Lorrayne Bennett, Demis Hassabis, Koray Ka vuk cuo g lu, and Geoffrey Irving. 2021. Scaling language models: Methods, analysis & insights from training gopher. CoRR, abs/2112.11446.

杰克·W·雷、塞巴斯蒂安·博尔热奥、特雷弗·蔡、凯蒂·米利肯、乔丹·霍夫曼、H·弗朗西斯·宋、约翰·阿斯兰尼迪斯、萨拉·亨德森、罗曼·林格、苏珊娜·杨、埃莉扎·鲁瑟福德、汤姆·亨宁根、雅各布·梅尼克、阿尔宾·卡西雷尔、理查德·鲍威尔、乔治·范登·德里斯切、丽莎·安妮·亨德里克斯、玛丽-贝丝·劳、波森·黄、阿米莉亚·格莱斯、约翰尼斯·韦尔布、萨曼斯·达塔斯里、萨夫龙·黄、乔纳森·乌萨托、约翰·梅洛、伊琳娜·希金斯、安东尼亚·克雷斯韦尔、内特·麦卡利什、艾米·吴、埃里克·埃尔森、西德汉特·M·贾亚库马尔、叶莲娜·布赫斯卡娅、大卫·巴登、埃斯梅·苏特兰、卡伦·西莫尼扬、米切拉·帕加尼尼、洛朗·西弗、莱娜·马滕斯、李香·洛雷恩、阿德吉纳·坤佐罗、艾达·内玛茨德、叶莲娜·格里博夫斯卡娅、多梅尼科·多纳托、安吉利基·拉扎里杜、亚瑟·门什、让-巴蒂斯特·莱斯皮奥、玛丽亚·蒂姆波乌基利、尼古拉·格里戈列夫、道格·弗里茨、蒂博·索蒂奥、曼塔斯·帕亚尔斯卡斯、托比·波伦、龚志涛、丹尼尔·托亚玛、赛普里安·德马松达图姆、刘玉佳、泰夫恩·特尔齐、弗拉基米尔·米库利克、伊戈尔·巴巴申金、艾丹·克拉克、迭戈·德拉卡萨斯、奥雷利亚·盖伊、克里斯·琼斯、詹姆斯·布拉德伯里、马修·约翰逊、布莱克·A·赫克特曼、劳拉·魏丁格、伊桑·加布里埃尔、威廉·S·艾萨克、爱德华·洛克哈特、西蒙·奥斯因德罗、劳拉·里梅尔、克里斯·戴尔、奥里奥尔·文亚尔斯、卡里姆·艾尤布、杰夫·斯坦威、洛雷恩·贝内特、德米斯·哈萨比斯、科雷伊·卡武奇奥格鲁和杰弗里·欧文。2021. 扩展语言模型:方法、分析及训练 Gopher 的见解。CoRR, abs/2112.11446。

Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J Liu. 2020. Exploring the limits of transfer learning with a unified text-to-text transformer. The Journal of Machine Learning Research (JMLR), 21:1–67.

Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, 和 Peter J Liu. 2020. 探索迁移学习的极限与统一的文本到文本 Transformer. The Journal of Machine Learning Research (JMLR), 21:1–67.

Anand Rajaraman and Jeffrey David Ullman. 2011. Mining of massive datasets. Cambridge University Press.

Anand Rajaraman 和 Jeffrey David Ullman. 2011. 大规模数据集挖掘 (Mining of massive datasets). Cambridge University Press.

Hannah Rashkin, Eric Michael Smith, Margaret Li, and Y-Lan Boureau. 2019. Towards empathetic opendomain conversation models: A new benchmark and dataset. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguis- tics, pages 5370–5381, Florence, Italy. Association for Computational Linguistics.

Hannah Rashkin, Eric Michael Smith, Margaret Li, 和 Y-Lan Boureau. 2019. 朝向富有同情心的开放领域对话模型:一个新的基准和数据集. 在 Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics 的第 5370–5381 页,佛罗伦萨,意大利. Association for Computational Linguistics.

Stephen Roller, Emily Dinan, Naman Goyal, Da Ju, Mary Williamson, Yinhan Liu, Jing Xu, Myle Ott, Eric Michael Smith, Y-Lan Boureau, and Jason Weston. 2021. Recipes for building an open-domain chatbot. In Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume, pages 300–325, Online. Association for Computational Linguistics.

斯蒂芬·罗勒,艾米丽·迪南,纳曼·戈亚尔,朱达,玛丽·威廉姆森,刘殷涵,徐静,迈尔斯·奥特,埃里克·迈克尔·史密斯,Y-Lan Boureau,和杰森·韦斯顿。2021。构建开放域聊天机器人的方法。在第 16 届欧洲语言协会计算语言学分会会议:主卷的论文集上,第 300–325 页,在线。计算语言学协会。

Keisuke Sakaguchi, Ronan Le Bras, Chandra Bhagavatula, and Yejin Choi. 2020. Winogrande: An adversarial winograd schema challenge at scale. In The Thirty-Fourth AAAI Conference on Artificial Intelligence, AAAI 2020, The Thirty-Second Innovative Applications of Artificial Intelligence Conference, IAAI 2020, The Tenth AAAI Symposium on Educational Advances in Artificial Intelligence, EAAI 2020, New York, NY, USA, February 7-12, 2020, pages 8732– 8740. AAAI Press.

Keisuke Sakaguchi, Ronan Le Bras, Chandra Bhagavatula, 和 Yejin Choi. 2020. Winogrande: 大规模的对抗性 Winograd 模式挑战. 在 The Thirty-Fourth AAAI Conference on Artificial Intelligence, AAAI 2020, The Thirty-Second Innovative Applications of Artificial Intelligence Conference, IAAI 2020, The Tenth AAAI Symposium on Educational Advances in Artificial Intelligence, EAAI 2020, New York, NY, USA, February 7-12, 2020, pages 8732–8740. AAAI Press.

Victor Sanh, Albert Webson, Colin Raffel, Stephen H. Bach, Lintang Sutawika, Zaid Alyafeai, Antoine Chaffin, Arnaud Stiegler, Teven Le Scao, Arun Raja, Manan Dey, M Saiful Bari, Canwen Xu, Urmish Thakker, Shanya Sharma Sharma, Eliza Szczechla, Taewoon Kim, Gunjan Chhablani, Nihal Nayak, Debajyoti Datta, Jonathan Chang, Mike Tian-Jian Jiang, Han Wang, Matteo Manica, Sheng Shen, Zheng Xin Yong, Harshit Pandey, Rachel Bawden, Thomas Wang, Trishala Neeraj, Jos Rozen, Ab- heesht Sharma, Andrea Santilli, Thibault Fevry, Jason Alan Fries, Ryan Teehan, Stella Biderman, Leo Gao, Tali Bers, Thomas Wolf, and Alexander M. Rush. 2021. Multitask prompted training enables zero-shot task generalization.

维克多·桑,阿尔伯特·韦布森,科林·拉费尔,斯蒂芬·H·巴赫,林坦格·苏塔维卡,扎德·阿利亚费,安托万·查芬,阿诺德·施蒂格勒,特文·勒斯卡奥,阿伦·拉贾,马南·德伊,M·萨伊富尔·巴里,徐灿文,乌尔米什·萨克her,夏妮娅·夏尔马,埃莉扎·斯切赫拉,太云金,冈詹·查布拉尼,尼哈尔·纳亚克,德巴吉约蒂·达塔,乔纳森·张,迈克·田-建姜,王涵,马特奥·马尼卡,沈声,郑欣永,哈希特·潘迪,雷切尔·鲍登,托马斯·王,特里莎拉·尼拉杰,乔斯·罗森,阿比希斯特·夏尔马,安德烈亚·桑蒂利,蒂博·费弗里,杰森·艾伦·弗里斯,瑞安·蒂汉,斯特拉·比德曼,李高,塔莉·贝尔斯,托马斯·沃尔夫,和 亚历山大·M·拉什。2021。多任务提示训练实现零样本任务泛化。

Teven Le Scao and Alexander M. Rush. 2021. How many data points is a prompt worth? pages 2627– 2636.

Teven Le Scao 和 Alexander M. Rush. 2021. 提示相当于多少数据点? pages 2627–2636.

Timo Schick and Hinrich Schütze. 2020. It’s not just size that matters: Small language models are also few-shot learners. CoRR, abs/2009.07118.

Timo Schick 和 Hinrich Schütze. 2020. 不仅仅是大小重要:小语言模型也是少样本学习者。CoRR, abs/2009.07118.

Timo Schick, Sahana Udupa, and Hinrich Schütze. 2021. Self-diagnosis and self-debiasing: A proposal for reducing corpus-based bias in nlp. Transactions of the Association for Computational Linguistics, 9:1408–1424.

Timo Schick, Sahana Udupa, 和 Hinrich Schütze. 2021. 自诊断和自去偏:一项减少 NLP 中基于语料库偏差的提案. 计算语言学协会会刊, 9:1408–1424.

Rico Sennrich, Barry Haddow, and Alexandra Birch. 2016. Neural machine translation of rare words with subword units. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1715– 1725, Berlin, Germany. Association for Computational Linguistics.

Rico Sennrich, Barry Haddow, 和 Alexandra Birch. 2016. 使用子词单元的神经机器翻译对稀有词进行翻译。在 Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) 的第 1715–1725 页,柏林,德国。Association for Computational Linguistics。

Emily Sheng, Kai-Wei Chang, Premkumar Natarajan, and Nanyun Peng. 2019. The woman worked as a babysitter: On biases in language generation. arXiv preprint arXiv:1909.01326.

Emily Sheng, Kai-Wei Chang, Premkumar Natarajan, 和 Nanyun Peng. 2019. 这位女士曾做过保姆:关于语言生成中的偏见. arXiv preprint arXiv:1909.01326.

Taylor Shin, Yasaman Razeghi, Robert L. Logan IV, Eric Wallace, and Sameer Singh. 2020. AutoPrompt: Eliciting Knowledge from Language Models with Automatically Generated Prompts. pages 4222– 4235.

泰勒·申, 尤萨马·拉泽吉, 罗伯特·L·洛根四世, 埃里克·华莱士, 和 萨米尔·辛格. 2020. AutoPrompt: 使用自动生成的提示词从语言模型中提取知识. 页码 4222–4235.

Mohammad Shoeybi, Mostofa Patwary, Raul Puri, Patrick LeGresley, Jared Casper, and Bryan Catanzaro. 2019. Megatron-lm: Training multi-billion parameter language models using model parallelism. arXiv preprint arXiv:1909.08053.

Mohammad Shoeybi, Mostofa Patwary, Raul Puri, Patrick LeGresley, Jared Casper, 和 Bryan Catanzaro. 2019. Megatron-lm: 使用模型并行训练多十亿参数的语言模型. arXiv preprint arXiv:1909.08053.

Haas, Laura Rimell, Lisa Anne Hendricks, William Isaac, Sean Legassick, Geoffrey Irving, and Iason Gabriel. 2021a. Ethical and social risks of harm from language models.

Haas, Laura Rimell, Lisa Anne Hendricks, William Isaac, Sean Legassick, Geoffrey Irving, 和 Iason Gabriel. 2021a. 语言模型的伦理和社会危害风险。

Laura Weidinger, John Mellor, Maribeth Rauh, Conor Griffin, Jonathan Uesato, Po-Sen Huang, Myra Cheng, Mia Glaese, Borja Balle, Atoosa Kasirzadeh, et al. 2021b. Ethical and social risks of harm from language models. arXiv preprint arXiv:2112.04359.

劳拉·魏丁格, 约翰·梅尔洛, 玛丽贝斯·劳, 康纳·格里芬, 乔纳森·乌萨托, 黄柏森, 成美拉, 米娅·格莱斯, 博尔哈·巴列, 阿托萨·卡西尔扎德, 等. 2021b. 语言模型的伦理和社会风险. arXiv preprint arXiv:2112.04359.

Sean Welleck, Ilia Kulikov, Stephen Roller, Emily Dinan, Kyunghyun Cho, and Jason Weston. 2020. Neural text generation with unlikelihood training. In International Conference on Learning Representations.

肖恩·韦尔克,伊利亚·库利科夫,斯蒂芬·罗勒,艾米丽·迪南,基永玄·乔和杰森·韦斯顿。2020。使用反向训练的神经文本生成。在国际学习表征会议 (International Conference on Learning Representations) 上。

Carole-Jean Wu, Ramya Rag have ndra, Udit Gupta, Bilge Acun, Newsha Ardalani, Kiwan Maeng, Gloria Chang, Fiona Aga Behram, James Huang, Charles Bai, Michael Gschwind, Anurag Gupta, Myle Ott, Anastasia Melnikov, Salvatore Candido, David Brooks, Geeta Chauhan, Benjamin Lee, Hsien-Hsin S. Lee, Bugra Akyildiz, Maximilian Balandat, Joe Spisak, Ravi Jain, Mike Rabbat, and Kim Hazelwood. 2022. Sustainable AI: environmental implications, challenges and opportunities. In Proceedings of the Conference on Machine Learning and Systems.

  1. 可持续 AI (Sustainable AI):环境影响、挑战和机遇。在机器学习和系统会议论文集 中。

Jing Xu, Da Ju, Margaret Li, Y-Lan Boureau, Jason Weston, and Emily Dinan. 2020. Recipes for safety in open-domain chatbots. arXiv preprint arXiv:2010.07079.

徐 Jing, 居 Da, Li Margaret, Boureau Y-Lan, Weston Jason, 和 Dinan Emily. 2020. 开放域聊天机器人安全策略. arXiv preprint arXiv:2010.07079.

Jing Xu, Da Ju, Margaret Li, Y-Lan Boureau, Jason Weston, and Emily Dinan. 2021a. Bot-adversarial dialogue for safe conversational agents. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 2950–2968, Online. Association for Computational Linguistics.

徐 Jing, 居 Da, Margaret Li, Y-Lan Boureau, Jason Weston 和 Emily Dinan. 2021a. 针对安全对话代理的对抗性对话机器人. 在 Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies 的论文集,第 2950–2968 页,在线. Association for Computational Linguistics.

Jing Xu, Arthur Szlam, and Jason Weston. 2021b. Beyond goldfish memory: Long-term open-domain conversation. arXiv preprint arXiv:2107.07567.

徐 Jing,Arthur Szlam,和 Jason Weston。2021b。超越金鱼记忆:长期开放域对话。arXiv预印本 arXiv:2107.07567。

Rowan Zellers, Ari Holtzman, Yonatan Bisk, Ali Farhadi, and Yejin Choi. 2019. Hellaswag: Can a machine really finish your sentence? In Proceedings of the 57th Conference of the Association for Computational Linguistics, ACL 2019, Florence, Italy, July 28- August 2, 2019, Volume 1: Long Papers, pages 4791–4800. Association for Computational Linguistics.

Rowan Zellers, Ari Holtzman, Yonatan Bisk, Ali Farhadi, 和 Yejin Choi. 2019. Hellaswag: 机器真的能完成你的句子吗?在 Proceedings of the 57th Conference of the Association for Computational Linguistics, ACL 2019, Florence, Italy, July 28- August 2, 2019, Volume 1: Long Papers, pages 4791–4800. Association for Computational Linguistics.

Yukun Zhu, Ryan Kiros, Richard S. Zemel, Ruslan Salak hut dino v, Raquel Urtasun, Antonio Torralba, and Sanja Fidler. 2015. Aligning books and movies: Towards story-like visual explanations by watching movies and reading books. CoRR, abs/1506.06724.

朱昱坤,Ryan Kiros,Richard S. Zemel,Ruslan Salakhutdinov,Raquel Urtasun,Antonio Torralba,和 Sanja Fidler. 2015. 对齐书籍和电影:通过观看电影和阅读书籍实现故事般的视觉解释 . CoRR, abs/1506.06724.

A Additional Evaluations

A 额外评估


Figure 6: Zero-shot NLP Evaluations. Full evaluations on all $16;\mathrm{NLP}$ tasks, with comparisons where available. We find that across most tasks, GPT-3 models and OPT models perform similarly, but some tasks display highly erratic behavior.

图 6: 零样本 NLP 评估。对所有 16 个 NLP 任务的完整评估,以及可用的比较。我们发现,在大多数任务中,GPT-3 模型和 OPT 模型的表现相似,但某些任务表现出高度不规律的行为。


Figure 7: Multishot-shot NLP Evaluations. Full evaluations on all $16\ \mathrm{NLP}$ tasks, with comparisons to the GPT-3 reported performance. As with zero-shot, performance is roughly similar for most tasks, with some tasks demonstrating erratic behavior.

图 7: 多样本 NLP 评估。对所有 16 个 NLP 任务的完整评估,并与 GPT-3 报告的性能进行比较。与零样本一样,大多数任务的性能大致相似,但有些任务表现出不稳定的行为。

B Contributions

B 贡献

Pre-training

预训练

Evaluations

评估

• NLP: Xian Li, Xi Victoria Lin, Todor Mihaylov, Stephen Roller, Anjali Sridhar • Dialogue: Stephen Roller • Responsible AI Evaluations: Punit Singh Koura, Stephen Roller, Tianlu Wang

• 自然语言处理 (NLP) :Xian Li, Xi Victoria Lin, Todor Mihaylov, Stephen Roller, Anjali Sridhar
• 对话系统:Stephen Roller
• 负责任的 AI 评估:Punit Singh Koura, Stephen Roller, Tianlu Wang

Paper writing: Moya Chen, Stephen Roller, Luke Z ett le moyer, Susan Zhang

论文写作:Moya Chen, Stephen Roller, Luke Zettlemoyer, Susan Zhang

Code release preparation: Christopher Dewan, Susan Zhang

代码发布准备:Christopher Dewan,Susan Zhang

Responsible AI conduct: Mona Diab, Susan Zhang

负责任的 AI 行为:Mona Diab,Susan Zhang

C Datasheet

C 数据表

We follow the recommendations of Gebru et al. (2021) and provide a data card for the dataset used to train the OPT models.

我们遵循 Gebru 等人 (2021) 的建议,并为用于训练 OPT 模型的数据集提供数据卡片。

C.1 Motivation

C.1 动机

• What do the instances that comprise the dataset represent (e.g., documents, photos, people, countries)? Are there multiple types of instances (e.g., movies, users, and ratings; people and interactions between them; nodes and edges)? Please provide a description. The instances are textual documents. The overall dataset is composed from a union of the following datasets:

• 数据集中的实例代表什么(例如,文档、照片、人、国家)?是否存在多种类型的实例(例如,电影、用户和评分;人与他们之间的互动;节点和边)?请提供描述。实例是文本文档。整个数据集由以下数据集的并集组成:

• How many instances are there in total (of each type, if appropriate)? The training data contains 180B tokens corresponding to 800 GB of data.

• 总共有多少实例(如果合适,按类型区分)?训练数据包含 180B 个 Token ,对应 800 GB 的数据。

• Does the dataset contain all possible instances or is it a sample (not necessarily random) of instances from a larger set? If the dataset is a sample, then what is the larger set? Is the sample representative of the larger set (e.g., geographic coverage)? If so, please describe how this representative ness was validated/verified. If it is not representative of the larger set, please describe why not (e.g., to cover a more diverse range of instances, because instances were withheld or unavailable). The CC-stories dataset contains a subset of Common Crawl data filtered to match the story-like style of Winograd schemas. The remainder of the dataset was collected from the above sources, reformatted, and de duplicated.

• 数据集是否包含所有可能的实例,还是它是来自更大集合的样本(不一定是随机的)?如果是样本,那么更大的集合是什么?该样本是否代表了更大的集合(例如,地理覆盖范围)?如果是,请描述这种代表性是如何验证/确认的。如果不是,请描述原因(例如,为了涵盖更多样化的实例,因为某些实例被扣留或无法获得)。CC-stories 数据集包含从 Common Crawl 数据中筛选出的子集,这些数据经过过滤以匹配 Winograd 模式的故事情节风格。其余数据集是从上述来源收集的,重新格式化并去重。

• What data does each instance consist of? “Raw” data (e.g., un processed text or images) or features? In either case, please provide a description. Each instance consists of raw text data.

• 每个实例由什么数据组成?“原始”数据(例如,未处理的文本或图像)还是特征?在任何一种情况下,请提供描述。每个实例由原始文本数据组成。

• Is there a label or target associated with each instance? If so, please provide a description. No.

• 是否每个实例都有对应的标签或目标?如果有,请提供描述。没有。

• Is any information missing from individual instances? If so, please provide a description, explaining why this information is missing (e.g., because it was unavailable). This does not include intentionally removed information, but might include, e.g., redacted text. No.

• 个别实例中是否缺少任何信息?如果有,请提供描述,解释为什么这些信息缺失(例如,因为信息不可用)。这不包括故意删除的信息,但可能包括例如,删节的文本。否。

• Are relationships between individual instances made explicit (e.g., users’ movie ratings, social network links)? If so, please describe how these relationships are made explicit. There are no explicit relationships between individual instances.

• 实例之间的关系是否被明确表示(例如,用户的电影评分,社交网络链接)?如果是,请描述这些关系是如何被明确表示的。实例之间没有明确的关系。

• Are there recommended data splits (e.g., training, development/validation, testing)? If so, please provide a description of these splits, explaining the rationale behind them. We hold out a random validation set of approximately 200MB from the pre training data, sampled proportionally to each dataset’s size in the pre training corpus.

• 是否有推荐的数据分割(例如,训练、开发/验证、测试)?如果有,请描述这些分割,并解释其背后的理由。我们从预训练数据中随机保留了一个大约 200MB 的验证集,按每个数据集在预训练语料库中的比例进行采样。

• Did the individuals in question consent to the collection and use of their data? If so, please describe (or show with screenshots or other information) how consent was requested and provided, and provide a link or other access point to, or otherwise reproduce, the exact language to which the individuals consented. N/A.

• 相关个人是否同意收集和使用其数据?如果是,请描述(或通过截图或其他信息展示)如何请求并获得同意,并提供链接或其他访问点,或以其他方式重现个人同意的确切语言。不适用 (N/A)。

• If consent was obtained, were the consenting individuals provided with a mechanism to revoke their consent in the future or for certain uses? If so, please provide a description, as well as a link or other access point to the mechanism (if appropriate). N/A.

• 如果获得了同意,是否向同意的个人提供了在未来或针对某些用途撤销其同意的机制?如果是,请提供描述以及指向该机制的链接或其他访问点(如果适当)。不适用。

• Has an analysis of the potential impact of the dataset and its use on data subjects (e.g., a data protection impact analysis) been conducted? If so, please provide a description of this analysis, including the outcomes, as well as a link or other access point to any supporting documentation. Some toxicity and bias evaluations were performed. Please refer to the main document and the model card for these details.

• 是否已进行有关数据集及其使用对数据主体(例如,数据保护影响分析)的潜在影响的分析?如果是,请提供此分析的描述,包括结果,以及任何支持文档的链接或其他访问点。已进行了一些毒性评估和偏差评估。请参阅主文档和模型卡以获取详细信息。

• Any other comments? No.

• 还有其他评论吗?没有。

C.4 Preprocessing/cleaning/labeling

C.4 预处理/清理/标注

• Was any preprocessing/cleaning/labeling of the data done (e.g., disc ret iz ation or bucketing, tokenization, part-of-speech tagging, SIFT feature extraction, removal of instances, processing of missing values)? If so, please provide a description. If not, you may skip the remainder of the questions in this section. The component datasets went through standard cleaning and re-formatting practices, including removing repetitive/non-informative text like “Chapter One,” or “This ebook by Project Gutenberg.”

• 是否对数据进行了任何预处理/清理/标注(例如,离散化或分桶、Tokenization、词性标注、SIFT特征提取、实例移除、缺失值处理)?如果是,请提供描述。如果不是,您可以跳过本节中的其余问题。组件数据集经过了标准的清理和重新格式化处理,包括移除重复的或无信息的文本,如“Chapter One”或“This ebook by Project Gutenberg”。

• Was the “raw” data saved in addition to the pre processed/cleaned/labeled data (e.g., to support unanticipated future uses)? If so, please provide a link or other access point to the “raw” data. The “raw” component datasets is publicly available in their respective locations (more details can be seen in the respective papers linked in references).

• 是否保存了“原始”数据,此外还有预处理/清理/标注的数据(例如,以支持未预见的未来用途)?如果是,请提供“原始”数据的链接或其他访问点。"原始"组件数据集在其各自的位置公开可用(更多详细信息可以参见参考文献中链接的相应论文)。

• Any other comments? No.

• 还有其他评论吗?没有。

C.5 Uses

C.5 用途

• Has the dataset been used for any tasks already? If so, please provide a description. Yes, this dataset was used to pre-train the OPT models.

• 该数据集是否已用于任何任务?如果是,请提供描述。是的,这个数据集已被用于预训练 OPT 模型。

• Is there a repository that links to any or all papers or systems that use the dataset? If so, please provide a link or other access point. https://github.com/facebook research/ metaseq

• 是否有仓库链接到使用该数据集的任何或所有论文或系统?如果有,请提供链接或其他访问点。 https://github.com/facebook research/ metaseq

• What (other) tasks could the dataset be used for? This data can be used to pre-train language models, which are foundation to many current and future language tasks.

• 该数据集还可以用于哪些任务?这些数据可以用于预训练语言模型,这些模型是许多当前和未来语言任务的基础。

• Is there anything about the composition of the dataset or the way it was collected and preprocessed/cleaned/labeled that might impact future uses? For example, is there anything that a future user might need to know to avoid uses that could result in unfair treatment of individuals or groups (e.g., stereotyping, quality of service issues) or other undesirable harms (e.g., financial harms, legal risks) If so, please provide a description. Is there anything a future user could do to mitigate these undesirable harms? The pipeline for creating this dataset paves a way for building a scalable infrastructure for mining datasets.

• 数据集的组成或其收集和预处理/清理/标注方式是否存在可能影响未来使用的情况?例如,未来用户是否需要了解某些信息以避免可能导致对个人或群体的不公平待遇(如刻板印象、服务质量问题)或其他不良后果(如财务损失、法律风险)?如果是,请提供描述。未来用户可以采取哪些措施来减轻这些不良后果?创建此数据集的流程为构建可扩展的数据集挖掘基础设施铺平了道路。

• Are there tasks for which the dataset should not be used? If so, please provide a description. None that we are currently aware of.

• 是否存在不应使用该数据集的任务?如果有,请提供描述。目前我们尚未知晓任何此类情况。

• Any other comments? No.

• 还有其他评论吗?没有。

C.6 Distribution

C.6 分布

• Will the dataset be distributed to third parties outside of the entity (e.g., company, institution, organization) on behalf of which the dataset was created? If so, please provide a description. Not at this time.

• 该数据集是否会分发给创建该数据集的实体(例如,公司、机构、组织)以外的第三方?如果是,请提供描述。目前不会。

• How will the dataset will be distributed (e.g., tarball on website, API, GitHub)? Does the dataset have a digital object identifier (DOI)? N/A.

• 数据集将如何分发(例如,网站上的 tarball,API,GitHub)?数据集是否有数字对象标识符 (DOI)?不适用。

• When will the dataset be distributed? N/A.

• 数据集将在何时分发?不适用。

• Will the dataset be distributed under a copyright or other intellectual property $(\mathbf{IP})$ license, and/or under applicable terms of use (ToU)? If so, please describe this license and/or ToU, and provide a link or other access point to, or otherwise reproduce, any relevant licensing terms or ToU, as well as any fees associated with these restrictions. N/A.

• 该数据集是否会根据版权或其他知识产权 (IP) 许可,和/或适用的使用条款 (ToU) 分发?如果是,请描述此许可和/或使用条款,并提供链接或其他访问点,或以其他方式复制任何相关的许可条款或使用条款,以及与这些限制相关的任何费用。不适用 (N/A)。

• Do any export controls or other regulatory restrictions apply to the dataset or to individual instances? If so, please describe these restrictions, and provide a link or other access point to, or otherwise reproduce, any supporting documentation. N/A.

• 是否有任何出口管制或其他监管限制适用于该数据集或单个实例?如果有,请描述这些限制,并提供链接或其他访问点,或以其他方式重现任何支持文档。不适用 (N/A)。

• Any other comments? No.

• 有任何其他评论吗?没有。

C.7 Maintenance

C.7 维护

• Who is supporting/hosting/maintaining the dataset? Meta AI.

• 谁在支持/托管/维护数据集?Meta AI。

• How can the owner/curator/manager of the dataset be contacted (e.g., email address)? Refer to the main document.

• 如何联系数据集的所有者/管理者/维护者(例如,电子邮件地址)?请参阅主文档。

• Is there an erratum? If so, please provide a link or other access point. N/A.

• 是否有勘误表?如果有,请提供链接或其他访问方式。不适用 (N/A)。

• Will the dataset be updated (e.g., to correct labeling errors, add new instances, delete instances)? If so, please describe how often, by whom, and how updates will be communicated to users (e.g., mailing list, GitHub)? No current plan for updating.

• 数据集是否会更新(例如,更正标注错误、添加新实例、删除实例)?如果是,请描述更新的频率、由谁进行更新,以及如何向用户传达更新信息(例如,邮件列表、GitHub)?目前没有更新计划。

• If the dataset relates to people, are there applicable limits on the retention of the data associated with the instances (e.g., were individuals in question told that their data would be retained for a fixed period of time and then deleted)? If so, please describe these limits and explain how they will be enforced. N/A.

• 如果数据集涉及人员,是否有关于实例相关数据保留的适用限制(例如,是否告知相关人员他们的数据将被保留一段固定时间然后删除)?如果是,请描述这些限制并解释如何执行。不适用。

• Will older versions of the dataset continue to be supported/hosted/maintained? If so, please describe how. If not, please describe how its obsolescence will be communicated to users. N/A.

• 较旧版本的数据集是否会继续得到支持/托管/维护?如果是,请描述具体方式。如果不是,请描述其废弃情况将如何通知用户。不适用 (N/A)。

• If others want to extend/augment/build on/contribute to the dataset, is there a mechanism for them to do so? If so, please provide a description. Will these contributions be validated/ verified? If so, please describe how. If not, why not? Is there a process for communicating/ distributing these contributions to other users? If so, please provide a description. No mechanism is available right now.

• 如果其他人希望扩展/增强/构建/贡献于该数据集,是否有机制让他们这样做?如果是,请提供描述。这些贡献是否会进行验证/核实?如果是,请描述如何进行。如果不是,为什么?是否有向其他用户传达/分发这些贡献的流程?如果是,请提供描述。目前没有可用的机制。

• Any other comments? No.

• 还有其他评论吗?没有。

D Model Card

模型卡片 (Model Card)

Following Mitchell et al. (2018), we provide a model card for OPT-175B.

遵循 Mitchell 等 (2018) 的做法,我们为 OPT-175B 提供了一个模型卡片。

D.1 Model Details

D.1 模型细节

• Person or organization developing model: OPT-175B was developed by Meta AI.

• 开发模型的个人或组织:OPT-175B 由 Meta AI 开发。

• Model date: OPT-175B was released on May 3, 2022.

• 模型发布日期:OPT-175B 发布于 2022 年 5 月 3 日。

• Model version: OPT-175B described in this paper is version 1.0.0.

• 模型版本:本文描述的 OPT-175B 是 1.0.0 版。

• Model type: OPT-175B is a large decoder-only transformer language model.

• 模型类型:OPT-175B 是一个大型的仅解码器 Transformer 语言模型。

• Information about training algorithms, parameters, fairness constraints or other applied approaches, and features: OPT-175B was trained with AdamW for parameter sizes from 125M to 175B. See the Data Card (Appendix C) for information about training data and Section 2.2 - 2.5 for information about the training process.

• 训练算法、参数、公平性约束或其他应用方法以及特征的相关信息:OPT-175B 使用 AdamW 进行训练,参数规模从 125M 到 175B。有关训练数据的信息,请参见数据卡片 (Appendix C),有关训练过程的信息,请参见第 2.2 - 2.5 节。

• Paper or other resource for more information: See the rest of this paper for more details on OPT-175B as well as the corresponding post on the Meta AI Research Blog. More details are also available in metaseq, our open-source repository.12

• 更多信息的论文或其他资源:请参阅本文的其余部分,以了解更多关于 OPT-175B 的详细信息,以及 Meta AI Research 博客上的相应文章。更多详细信息也可在我们的开源仓库 metaseq 中获得。12

• License: OPT-175B and the smaller baseline models are made available through a non-commercial use license agreement provided in our model license.13

• 许可证:OPT-175B 和较小的基础模型通过非商业使用许可协议提供,该协议包含在我们的模型许可证中。13

• Where to send questions or comments about the model: Please contact the corresponding authors {susanz,roller,namangoyal}@fb.com for any questions or comments.

• 关于模型的问题或评论请联系:请将任何问题或评论发送给相应作者 {susanz,roller,namangoyal}@fb.com。

D.2 Intended Use

D.2 预期用途

• Primary intended uses: We release OPT-175B for research into Language Models, especially as it pertains to Responsible AI. See Section 6 for more detailed Considerations for Release. Information on how to use the model can be found at metaseq, our open-source repository.

• 主要预期用途:我们发布 OPT-175B 用于研究大语言模型 (Large Language Model),特别是与其相关的负责任 AI (Responsible AI)。详见第 6 节,了解更多关于发布的考虑因素。如何使用该模型的信息可以在我们的开源仓库 metaseq 中找到。

• Primary intended users: We primarily target researchers and the related research community.

• 主要目标用户:我们主要面向研究人员及相关研究社区。

• Out-of-scope use cases: OPT-175B is not released for production use or real-world deployments. As we note in Section 5, OPT-175B, like similar large language models, has a variety of shortcomings that make it premature for commercial use.

• 超出范围的使用场景:OPT-175B 不发布用于生产环境或实际部署。如我们在第 5 节中所述,OPT-175B 与类似的大型语言模型一样,存在多种不足之处,使其不适合商业使用。

D.3 Data, Limitations, and Recommendations • Data selection for training: Training data for OPT-175B was selected based on a combination of breadth and availability. See our Data Card (Appendix C) for more detailed information on the data used to train our model.

D.3 数据、限制和建议 • 训练数据的选择:OPT-175B 的训练数据是根据广度和可用性的结合来选择的。有关用于训练我们模型的数据的更详细信息,请参见我们的数据卡片 (附录 C) 。

• Data selection for evaluation: Evaluations in this paper were chosen to provide comparable performance assessments relative to similar scale models in the literature. Given concerns in the community around safety and fairness of large language models in general, we also explicitly provide evaluations on Responsible AI (see Section 4).

• 评估的数据选择:本文中的评估被选择用于提供与文献中类似规模的模型相比的性能评估。鉴于社区对大语言模型的安全性和公平性普遍存在的担忧,我们还明确提供了关于负责任的 AI (Responsible AI) 的评估(见第 4 节)。

• Limitations: Like other large language models for which the diversity (or lack thereof) of training data induces downstream impact on the quality of our model, OPT-175B has limitations in terms of bias and safety. OPT-175B can also have quality issues in terms of generation diversity and hallucination. In general, OPT-175B is not immune from the plethora of issues that plague modern large language models. By releasing with a non-commercial license, we also hope to increase communication, transparency, and study of the problems of large language models, especially in areas which may not be aligned with commercial interests. See Section 5 for a more detailed discussion of limitations of OPT-175B.

• 限制:与其他大语言模型类似,由于训练数据的多样性(或缺乏多样性)对模型质量产生下游影响,OPT-175B 在偏差和安全性方面存在局限性。OPT-175B 在生成多样性和幻觉方面也可能存在质量问题。总体而言,OPT-175B 并不能免受困扰现代大语言模型的各种问题的影响。通过采用非商业许可发布,我们也希望增加对大语言模型问题的沟通、透明度和研究,特别是在可能与商业利益不一致的领域。详见第 5 节,了解更多关于 OPT-175B 的限制详细讨论。

• Recommendations for future work: See Section 6 for more about our Considerations for Release, including a discussion of potential avenues of research enabled by opening our model to more of the research community. We hope that the release of OPT-175B, as well as information around our model training process, will increase open science around both large language models in specific and natural language processing and deep learning in general.

• 未来工作建议:请参见第 6 节,了解更多关于我们发布时的考虑因素,包括开放模型给更多研究社区所带来的潜在研究方向的讨论。我们希望 OPT-175B 的发布,以及关于我们模型训练过程的信息,能够促进大语言模型和自然语言处理及深度学习领域的开放科学。

E Sample Model Outputs

E 样本模型输出 示例 (Sample Model Outputs)

For all sample outputs, the initial prompt is given in bold and the remainder is the continuation. These example outputs were intentionally selected to highlight both successes and failures of the OPT-175B model.

对于所有样本输出,初始提示以粗体给出,其余部分是续写。这些示例输出有意选择以突出显示 OPT-175B 模型的成功和失败。

图 1: 模型架构示例

在本节中,我们将介绍生成式 AI (Generative AI) 的最新进展。生成式 AI 是指能够根据给定的数据集生成新内容的技术。这些模型通常基于 Transformer 架构,并使用大量的 Token 进行训练。大语言模型(LLM)是生成式 AI 的一个典型应用,它们可以在零样本或少样本的情况下完成各种自然语言处理任务。

我们还探讨了 AI智能体 在不同环境中的表现,以及通用人工智能(AGI)的未来发展方向。此外,Python语言 仍然是开发和实现这些模型的主要编程语言之一。

Figure 8: Poetry generation. We have observed the model can write entertaining poetry on topics such as dodos, samosas, and performance reviews. However, we struggled to get the model to observe rhyme or meter.

图 8: 诗歌生成。我们观察到该模型可以撰写关于渡渡鸟、萨莫萨和绩效评估等主题的娱乐性诗歌。然而,我们很难让模型遵循押韵或格律。

图 1: 示例图片说明

Figure 9: Conversation generation. OPT-175B adopts a patriotic personality when prompted as the Statue of Liberty. However, the model also devolves into somewhat simple and linguistically repetitive generations further into the conversation.

图 9: 对话生成。OPT-175B 在被提示为自由女神像时采用了一种爱国人格。然而,随着对话的深入,模型的生成逐渐变得简单且语言上重复。

1.Introduction

1.引言

In recent years, deep neural networks have led to a series of breakthroughs in a variety of domains, such as image classification and natural language understanding. In many of these works, network depth and increased model capacity seem to be critical in pushing state-of-the-art forward. In this paper, we attempt to understand what it means for a deep neural network to have a high capacity, and how to quantify it.

近年来,深度神经网络在多个领域(如图像分类和自然语言理解)取得了一系列突破。在许多这些研究中,网络深度和增加的模型容量似乎对于推动最先进水平至关重要。在本文中,我们尝试理解深度神经网络具有高容量的意义,以及如何量化它。

We introduce the notion of network capacity as the upper bound on the complexity of a neural network. We define the complexity of a neural network as the number of parameters and the number of connections, and show that the complexity of a neural network is proportional to the number of parameters, and the network capacity. We define the network capacity of a neural network as the maximum possible number of parameters that the network can have and still be able to accurately reproduce the training data. We then introduce a new measure of network capacity called the capacity-to-data (C2D) ratio, which is the ratio between the maximum number of parameters that the network can have and still be able to accurately reproduce the training data and the number of parameters of the network.We show that the C2D ratio is a good measure of network capacity, and that it is useful for comparing different neural networks. We introduce a new network compression technique called sparsity-promoting compression, which reduces the number of parameters of a neural network, while preserving its accuracy. We apply the sparsity-promoting compression technique to several datasets, and show that it can reduce the number of parameters of a neural network by up to 70%. We also show that sparsity-promoting compression can significantly improve the C2D ratio of the neural network. We then apply the sparsity-promoting compression technique to several image classification datasets, and show that it can significantly improve the accuracy of the neural network. We finally show that the sparsity-promoting compression technique can significantly reduce the memory consumption of a neural network, and that it can be used to reduce the memory consumption of deep neural networks.

我们引入网络容量的概念,作为神经网络复杂度的上限。我们将神经网络的复杂度定义为参数数量和连接数量,并表明神经网络的复杂度与参数数量和网络容量成正比。我们将神经网络的网络容量定义为网络能够准确重现训练数据的最大可能参数数量。然后,我们引入了一个新的网络容量度量,称为容量到数据 (C2D) 比率,它是网络能够准确重现训练数据的最大参数数量与网络参数数量之间的比率。我们表明 C2D 比率是衡量网络容量的良好指标,并且它对于比较不同的神经网络非常有用。我们引入了一种新的网络压缩技术,称为稀疏促进压缩,它可以在保持精度的同时减少神经网络的参数数量。我们将稀疏促进压缩技术应用于多个数据集,并表明它可以将神经网络的参数数量减少多达 70%。我们还表明,稀疏促进压缩可以显著提高神经网络的 C2D 比率。然后,我们将稀疏促进压缩技术应用于多个图像分类数据集,并表明它可以显著提高神经网络的精度。最后,我们表明稀疏促进压缩技术可以显著减少神经网络的内存消耗,并且可以用于减少深度神经网络的内存消耗。

2.The complexity of neural networks

  1. 神经网络的复杂性

The complexity of a neural network is defined as the number of parameters and the number of connections.Let P be the number of parameters and N be the number of connections, then the complexity of a neural network is given by

神经网络的复杂度定义为参数的数量和连接的数量。设 P 为参数的数量,N 为连接的数量,则神经网络的复杂度由以下公式给出:

(1)

请提供需要翻译的英文内容。

where ∑ is the set of parameters of the neural network.The number of connections is equal to the number of nodes in

其中 ∑ 是神经网络的参数集。连接的数量等于节点的数量在

Figure 11: Paper writing example. Prompting with "1. Introduction" generally yielded more interesting results compared to prompting with “Abstract.” Our prompt here was inspired by the first sentence of the seminal ResNet work (He et al., 2016).

图 11: 论文写作示例。使用 "1. 引言" 作为提示通常比使用 “摘要” 获得更有趣的结果。我们的提示灵感来自 ResNet 工作的第一句话 (He et al., 2016)。


Figure 12: Arithmetic. We observe mistakes when extending from addition to other operations.

图 12: 算术。我们观察到从加法扩展到其他运算时出现了错误。

def fib(n): "Returns n-th Fibonacci number." if $\texttt{n}<\textbf{2}$ : return 1 else: return fib(n-1) + fib(n-2)

def fib(n): 
    "返回第 n 个斐波那契数。" 
    if  \$\texttt{n}<\textbf{2}\$ : 
        return 1 
    else: 
        return fib(n-1) + fib(n-2)

def fib $(\star)$ : "Returns k-th Fibonacci number." if $\textbf{k}==\textbf{0}$ : return 1 return fib(k-1)+fib(k-2)

def fib  \$(\star)\$  : "返回第 k 个斐波那契数。" 如果  \$\textbf{k}==\textbf{0}\$  : 返回 1 返回 fib(k-1)+fib(k-2)
阅读全文(20积分)