[论文翻译]GLaM: 语言模型的有效扩展与专家混合 (Mixture-of-Experts)


原文地址:https://arxiv.org/pdf/2112.06905


GLaM: Efficient Scaling of Language Models with Mixture-of-Experts

GLaM: 语言模型的有效扩展与专家混合 (Mixture-of-Experts)

Abstract

摘要

Scaling language models with more data, compute and parameters has driven significant progress in natural language processing. For example, thanks to scaling, GPT-3 was able to achieve strong results on in-context learning tasks. However, training these large dense models requires significant amounts of computing resources. In this paper, we propose and develop a family of language models named GLaM (Generalist Language Model), which uses a sparsely activated mixture-of-experts architecture to scale the model capacity while also incurring substantially less training cost compared to dense variants. The largest GLaM has 1.2 trillion parameters, which is approximately $7\mathbf{x}$ larger than GPT-3. It consumes only 1/3 of the energy used to train GPT-3 and requires half of the computation flops for inference, while still achieving better overall zero, one and few-shot performance across 29 NLP tasks.

通过增加数据、计算资源和参数规模,大语言模型的扩展已经推动了自然语言处理领域的显著进展。例如,得益于扩展,GPT-3 能够在上下文学习任务中取得优异成绩。然而,训练这些大型密集模型需要大量的计算资源。在本文中,我们提出并开发了一组名为 GLaM (Generalist Language Model) 的语言模型,该模型采用稀疏激活的专家混合架构,在扩展模型容量的同时,相比密集变体大幅降低了训练成本。最大的 GLaM 拥有 1.2 万亿参数,大约是 GPT-3 的 7 倍。它仅消耗训练 GPT-3 所需能量的 1/3,并且在推理时所需的计算量也减少了 50%,同时在 29 个 NLP 任务中实现了更好的零样本、单样本和少样本性能。

Table 1. Comparison between GPT-3 and GLaM. In a nutshell, GLaM outperforms GPT-3 across 21 natural language understanding (NLU) benchmarks and 8 natural language generative (NLG) benchmarks in average while using about half the FLOPs per token during inference and consuming about one third the energy for training.

表 1. GPT-3 与 GLaM 的对比。简而言之,GLaM 在平均 21 个自然语言理解 (NLU) 基准测试和 8 个自然语言生成 (NLG) 基准测试中优于 GPT-3,同时在推理过程中每个 Token 使用大约一半的 FLOPs,并且在训练时消耗大约三分之一的能量。

GPT-3 GLaM relative
成本 FLOPs / Token (G) 350 180 -48.6%
训练能耗 (MWh) 1287 456 -64.6%
准确率 零样本 56.9 62.7 +10.2%
单样本 61.6 65.5 +6.3%
平均 少样本 65.2 68.1 +4.4%

feasibility of in-context learning for few-shot or even zeroshot generalization, meaning very few labeled examples are needed to achieve good performance on NLP applications. While being effective and performant, scaling further is becoming prohibitively expensive and consumes significant amounts of energy (Patterson et al., 2021).

少样本或甚至零样本泛化的情境学习可行性,意味着在NLP应用上取得良好性能所需的标注示例非常少。虽然这种方法有效且性能优越,但进一步扩展变得极其昂贵,并消耗大量的能源 (Patterson et al., 2021)。

1. Introduction

1. 引言

Language models have played an important role in the progress of natural language processing (NLP) in the past decade. Variants of language models have been used to produce pretrained word vectors (Mikolov et al., 2013; Pennington et al., 2014), and contextual i zed word vectors (Peters et al., 2018; Devlin et al., 2019) for many NLP applications. The shift towards scaling with more data and larger models (Shazeer et al., 2017; Huang et al., 2019; Kaplan et al., 2020) has enabled complex natural language tasks to be performed with less labeled data. For example, GPT-3 (Brown et al., 2020) and FLAN (Wei et al., 2021) demonstrated the

在过去十年中,语言模型在自然语言处理 (NLP) 的发展中发挥了重要作用。语言模型的变体已被用于生成预训练词向量 (Mikolov et al., 2013; Pennington et al., 2014),以及上下文化词向量 (Peters et al., 2018; Devlin et al., 2019),用于许多 NLP 应用。向使用更多数据和更大模型扩展的趋势 (Shazeer et al., 2017; Huang et al., 2019; Kaplan et al., 2020) 使得复杂的自然语言任务可以在更少的标注数据下完成。例如,GPT-3 (Brown et al., 2020) 和 FLAN (Wei et al., 2021) 展示了

In this work, we show that a large sparsely activated network can achieve competitive results compared to state-of-the-art dense models on few-shot tasks while being more computationally efficient. We present a family of generalist language models called GLaM, that strike a balance between dense and conditional computation. The largest version of GLaM has $1.2\mathrm{T}$ parameters in total with 64 experts per MoE layer (Shazeer et al., 2017; Lepikhin et al., 2021; Fe- dus et al., 2021) where each token in the input batch only activates a subnetwork of 96.6B ( $8%$ of 1.2T) parameters. On zero, one and few-shot learning, this model compares favorably to GPT-3 (175B), with significantly improved learning efficiency across 29 public NLP benchmarks, ranging from language completion tasks, open-domain QA tasks, to natural language inference tasks. Thanks to the sparsely activated architecture and the efficient implementation of the model parallelism algorithm, the total energy consumption during training is only one third of GPT-3's. We highlight the comparison between the largest version of GLaM and GPT-3 in Table 1 and Figure 1.

在这项工作中,我们展示了大规模稀疏激活网络在少样本任务上可以取得与最先进的密集模型相媲美的结果,同时计算效率更高。我们介绍了一组名为 GLaM 的通用语言模型,这些模型在密集计算和条件计算之间取得了平衡。GLaM 最大的版本总共有 1.2T 参数,每个 MoE 层有 64 个专家 (Shazeer et al., 2017; Lepikhin et al., 2021; Fedus et al., 2021),其中输入批次中的每个 Token 只激活一个包含 966 亿 (1.2T 的 8%) 参数的子网络。在零样本、单样本和少样本学习中,该模型与 GPT-3 (175B) 相比表现优异,在涵盖从语言补全任务、开放域问答任务到自然语言推理任务的 29 个公共 NLP 基准测试中,学习效率显著提高。得益于稀疏激活架构和模型并行算法的有效实现,训练期间的总能耗仅为 GPT-3 的三分之一。我们在表 1 和图 1 中突出了 GLaM 最大版本与 GPT-3 的比较。

表 1:

图 1:


GLaM: Efficient Scaling of Language Models with Mixture-of-Experts Figure 1. An overview of the percentage change in predictive performance (higher is better) of GLaM (64B/64E) versus GPT-3 (175B) in the (a) zero-shot, (b) one-shot, and (c) few-shot setting across 7 benchmark categories with 29 public tasks in total. Each bar in panel (a), (b) and (c) represents one benchmark category. Panel (d) compares the FLOPs needed per token prediction and training energy consumption.


GLaM:基于专家混合的高效语言模型扩展 图 1. GLaM (64B/64E) 与 GPT-3 (175B) 在 (a) 零样本,(b) 单样本,和 (c) 少样本 设置下,在 7 个基准类别中的预测性能百分比变化(越高越好),总共包含 29 个公共任务。图中每个柱状图分别代表一个基准类别。图 (d) 比较了每 Token 预测所需的 FLOPs 和训练能耗。

We use GLaM to study the importance of data. Our analysis shows that even for these large models, data quality should not be sacrificed for quantity if the goal is to produce a highquality auto-regressive language model. More importantly, on social dimensions, our results are also the first, to our knowledge, to close the performance gap between stereotypical and anti-stereotypical examples on the WinoGender benchmark, suggesting that large, sparsely activated models may rely less on superficial statistical correlations.

我们使用 GLaM 来研究数据的重要性。我们的分析表明,即使对于这些大语言模型,如果目标是生成高质量的自回归语言模型,则不应为了数量而牺牲数据质量。更重要的是,在社会维度上,我们的结果也是首次,据我们所知,在 WinoGender 基准测试中弥合了刻板印象和反刻板印象示例之间的性能差距,这表明大型、稀疏激活的大语言模型可能较少依赖于表面的统计相关性。

Finally, although MoE-based sparse models are not yet common in the NLP community, our work shows that sparse decoder-only language models can be more performant than the dense architectures of similar compute FLOPs for the first time within the few-shot in-context learning setting at scale, suggesting that sparsity is one of the most promising directions to achieve high-quality NLP models while saving energy costs (Patterson et al., 2021). MoE should therefore be considered as a strong candidate for future scaling.

最后,尽管基于 MoE 的稀疏模型在 NLP 社区中尚未普及,但我们的工作表明,在少样本情境学习设置下,稀疏的仅解码器语言模型可以首次在计算 FLOPs 相似的情况下比密集架构表现更好,这表明稀疏性是实现高质量 NLP 模型并节省能源成本 (Patterson et al., 2021) 最有前途的方向之一。因此,MoE 应被视为未来扩展的有力候选者。

2. Related Work

2. 相关工作

Language models. Neural language models (Mikolov et al., 2010; Sutskever et al., 2011) have been shown to be useful for many natural language processing tasks. Word embedding models and extensions such as word2vec (Mikolov et al., 2013), GloVe (Pennington et al., 2014) and paragraph vectors (Le & Mikolov, 2014) have shown good generalization to many tasks simply by transferring the embeddings.

语言模型。神经语言模型 (Mikolov et al., 2010; Sutskever et al., 2011) 已被证明对许多自然语言处理任务非常有用。词嵌入模型及其扩展,如 word2vec (Mikolov et al., 2013)、GloVe (Pennington et al., 2014) 和段落向量 (Le & Mikolov, 2014),通过转移这些嵌入,已经在许多任务上展示了良好的泛化能力。

Pre-training and Fine-tuning. The abundance of compute and data enables training increasingly large models via unsupervised pre-training. This is a natural fit for training neural networks as they exhibit remarkable s cal ability. Work on using recurrent models such as RNNs and LSTMs for language representation (Dai & Le, 2015; Kiros et al., 2015) showed that general language models could be fine-tuned to improve various language understanding tasks. More recently, models that used Transformers (Vaswani et al., 2017) showed that larger models with self-supervision on unlabeled data could yield significant improvements on NLP tasks (Devlin et al., 2019; Yang et al., 2019; Liu et al., 2019; Clark et al., 2020). Transfer learning based on pre-training and finetuning (Raffel et al., 2020; Houlsby et al., 2019) has been extensively studied and demonstrated good performance on downstream tasks. However, a major limitation to this method is that it requires a task-specific fine-tuning.

预训练和微调。大量的计算资源和数据使得可以通过无监督预训练训练越来越大的模型。这是训练神经网络的自然选择,因为它们表现出显著的可扩展性。使用递归模型(如 RNN 和 LSTM)进行语言表示的工作 (Dai & Le, 2015; Kiros et al., 2015) 表明,通用语言模型可以经过微调以改进各种语言理解任务。最近的研究表明,使用 Transformer (Vaswani et al., 2017) 的更大模型在未标注数据上进行自监督学习可以在 NLP 任务上取得显著改进 (Devlin et al., 2019; Yang et al., 2019; Liu et al., 2019; Clark et al., 2020)。基于预训练和微调的迁移学习 (Raffel et al., 2020; Houlsby et al., 2019) 已被广泛研究,并在下游任务中表现出良好的性能。然而,这种方法的一个主要限制是它需要特定任务的微调。

In-Context Few-shot Learning. GPT-3 (Brown et al., 2020) and related work (Shoeybi et al., 2019; Lieber et al., 2021; Wei et al., 2021) demonstrated that scaling up language models greatly improves task-agnostic, few-shot performance. These language models are applied without any gradient updates, and only few-shot demonstrations specified purely via text interactions with the model are needed.

上下文少样本学习。GPT-3 (Brown et al., 2020) 和相关工作 (Shoeybi et al., 2019; Lieber et al., 2021; Wei et al., 2021) 表明,扩大语言模型的规模可以显著提高任务无关的、少样本性能。这些大语言模型在应用时不需要任何梯度更新,仅需通过与模型的文本交互提供少量示例即可。

Sparsely Gated Networks.Mixture-of-Experts based models have also shown significant advantages. For language modeling and machine translation, Shazeer et al. (2017) showed that they could effectively use a very large number of weights while only needing to compute a small subset of the computation graph at inference time. There has also been work on scaling sparsely activated MoE archi tec ture s (Hestness et al., 2017; Shazeer et al., 2018; Lepikhin et al., 2021; Kudugunta et al., 2021). Recently, Fedus et al. (2021) showed results with even larger 1 trillion parameter sparsely activated models (Switch-C). Although both Switch-C and the largest GLaM model have one trillion number of trainable parameters, GLaM is a family of decoder-only language models, and Switch-C is an encoderdecoder based sequence to sequence model. Furthermore, Switch-C is mainly evaluated on fine-tuning benchmarks, e.g., SuperGlue, while GLaM performs well without any

稀疏门控网络。基于混合专家 (Mixture-of-Experts) 的模型也展示了显著的优势。对于语言建模和机器翻译,Shazeer 等 (2017) 表明他们可以在推理时仅计算计算图的一小部分,同时有效地使用非常大的权重数量。也有研究致力于扩展稀疏激活的 MoE 架构 (Hestness 等, 2017; Shazeer 等, 2018; Lepikhin 等, 2021; Kudugunta 等, 2021)。最近,Fedus 等 (2021) 展示了具有更大规模、1万亿参数的稀疏激活模型 (Switch-C) 的结果。尽管 Switch-C 和最大的 GLaM 模型都有一万亿个可训练参数,但 GLaM 是一系列仅解码器的语言模型,而 Switch-C 是基于编码器-解码器的序列到序列模型。此外,Switch-C 主要在微调基准测试上进行评估,例如 SuperGlue,而 GLaM 在没有任何

Table 2. A sample of related models (Devlin et al., 2019; Raffel et al., 2020; Brown et al., 2020; Lieber et al., 2021; Rae et al., 2021; Shoeybi et al., 2019; Lepikhin et al., 2021; Fedus et al., 2021) pre-trained on text corpora. $n_{\mathrm{params}}$ is the total number of trainable model parameters, Nact-params is the number of activated model parameters per input token.

表 2. 相关模型样本 (Devlin et al., 2019; Raffel et al., 2020; Brown et al., 2020; Lieber et al., 2021; Rae et al., 2021; Shoeybi et al., 2019; Lepikhin et al., 2021; Fedus et al., 2021) 在文本语料库上预训练。$n_{\mathrm{params}}$ 是模型的总可训练参数数量,Nact-params 是每个输入 Token 的激活模型参数数量。

模型名称 模型类型 参数量 (Mparams) 激活参数量 (Nact-params)
BERT DenseEncoder-only 340M 340M
T5 DenseEncoder-decoder 13B 13B
GPT-3 DenseDecoder-only 175B 175B
Jurassic-1 DenseDecoder-only 178B 178B
Gopher DenseDecoder-only 280B 280B
Megatron-530B DenseDecoder-only 530B 530B
GShard-M4 MoE Encoder-decoder 600B 1.5B
Switch-C MoE Encoder-decoder 1.5T 1.5B
GLaM (64B/64E) MoE Decoder-only 1.2T 96.6B

need for fine-tuning in the few-shot setting shared by GPT-3 where SuperGlue is a subset. Table 2 summarizes the key differences between GLaM and related models pre-trained on text corpora.

在少样本设置中,GPT-3 共享的 SuperGlue 子集需要微调。表 2: 总结了 GLaM 与相关文本语料库预训练模型之间的关键差异。

3. Training Dataset

3. 训练数据集

To train our model, we build a high-quality dataset of 1.6 trillion tokens that are representative of a wide range of natural language use cases. Web pages constitute the vast quantity of data in our unlabeled dataset. However, their quality ranges from professional writing to low-quality comment and forum pages. Similarly to Brown et al. (2020), we develop our own text quality classifier to produce a highquality web corpus out of an original larger raw corpus. We use a feature hash based linear classifier for inference speed. This classifier is trained to classify between a collection of curated text (Wikipedia, books and a few selected websites) and other webpages. We use this classifier to estimate the content quality of a webpage. We then apply this classifier by using a Pareto distribution to sample webpages according to their score. This allows some lower-quality webpages to be included to prevent systematic biases in the classifier (Brown et al., 2020).

为了训练我们的模型,我们构建了一个包含 1.6 万亿个 Token 的高质量数据集,这些 Token 代表了广泛的自然语言应用场景。网页构成了我们未标注数据集中大部分的数据。然而,它们的质量从专业写作到低质量的评论和论坛页面不等。类似于 Brown 等人 (2020),我们开发了自己的文本质量分类器,以从原始较大的语料库中生成高质量的网页语料库。我们使用基于特征哈希的线性分类器以提高推理速度。该分类器被训练用于区分一组精选文本(维基百科、书籍和一些选定网站)与其他网页。我们使用此分类器来估计网页的内容质量。然后,我们通过使用帕累托分布根据评分对网页进行采样。这允许一些较低质量的网页被包括在内,以防止分类器出现系统性偏差 (Brown et al., 2020)。

Table 3. Data and mixture weights in GLaM training set.

表 3. GLaM 训练集的数据和混合权重。

数据集 Token (B) 混合中的权重
过滤后的网页 143 0.42
Wikipedia 3 0.06
对话 174 0.28
论坛 247 0.02
书籍 390 0.20
新闻 650 0.02

We use this process to generate a high-quality filtered subset of webpages and combine this with books, Wikipedia pages, forums and news pages and other data sources to create the final GLaM dataset. We also incorporate the data from public domain social media conversations used by Adiwardana et al. (2020). We set the mixture weights based on the performance of each component in a smaller model and to prevent small sources such as Wikipedia from being over-sampled. Table 3 shows the details of our data component sizes and mixture weights. The mixture weights were chosen based on the performance of the component in a small model and to prevent small datasets such as Wikipedia from being oversampled. To check data contamination, in Section D we conduct an overlap analysis between our training set and the evaluation data and find that it roughly matches that of previous work (Brown et al., 2020).

我们使用此过程生成高质量的网页子集,并将其与书籍、Wikipedia页面、论坛和新闻页面以及其他数据源结合,创建最终的GLaM数据集。我们还纳入了Adiwardana等人 (2020) 使用的公共领域社交媒体对话数据。我们根据每个组件在较小模型中的表现设置混合权重,以防止像Wikipedia这样的小来源被过度采样。表 3 显示了我们数据组件大小和混合权重的详细信息。混合权重的选择基于组件在小型模型中的表现,并防止像Wikipedia这样的小数据集被过度采样。为了检查数据污染,在D节中,我们在训练集和评估数据之间进行了重叠分析,发现其大致与之前的工作 (Brown et al., 2020) 相符。


Figure 2. GLaM model architecture. Each MoE layer (the bottom block) is interleaved with a Transformer layer (the upper block). For each input token, e.g., ‘roses', the Gating module dynamically selects two most relevant experts out of 64, which is represented by the blue grid in the MoE layer. The weighted average of the outputs from these two experts will then be passed to the upper Transformer layer. For the next token in the input sequence, two different experts will be selected.

图 2. GLaM 模型架构。每个 MoE 层(下层模块)与一个 Transformer 层(上层模块)交替排列。对于每个输入 Token,例如 ‘roses',Gating 模块会动态选择 64 个专家中最相关的两个,这在 MoE 层中用蓝色网格表示。这两个专家的输出加权平均值将传递给上层的 Transformer 层。对于输入序列中的下一个 Token,将选择不同的两个专家。

4. Model Architecture

4. 模型架构

We leverage sparsely activated Mixture-of-Experts (MoE) (Shazeer et al., 2017; Fedus et al., 2021) in GLaM models. Similar to the GShard MoE Transformer (Lepikhin et al., 2021), we replace the feed-forward component of every other Transformer layer with an MoE layer, as shown in Figure 2. Each MoE layer consists of a collection of independent feed-forward networks as the ‘experts'. A gating function then uses a softmax activation function to model a probability distribution over these experts. This distribution indicates how well each expert is able to process the incoming input.

我们在 GLaM 模型中使用稀疏激活的混合专家模型 (Mixture-of-Experts (MoE)) (Shazeer et al., 2017; Fedus et al., 2021)。类似于 GShard MoE Transformer (Lepikhin et al., 2021),我们用 MoE 层替换每隔一个 Transformer 层的前馈组件,如图 2 所示。每个 MoE 层由一组独立的前馈网络组成,作为‘专家’。然后,门控函数使用 softmax 激活函数对这些专家建模概率分布。该分布表示每个专家处理传入输入的能力。

图 2:

Even though each MoE layer has many more parameters, the experts are sparsely activated. This means that for a given input token, only a limited subset of experts is used, giving the model more capacity while limiting computation. In our architecture, the subset size is two'. Each MoE layer's learnable gating network is trained to use its input to activate the best two experts for each token of an input sequence. During inference, the learned gating network dynamically picks the two best experts for each token. For an MoE layer with $E$ experts, this essentially provides a collection of $O(E^{2})$ different combinations of feed-forward networks instead of one in the classic Transformer architecture, leading to much more computational fexibility. The final learned representation of a token will be the weighted combination of the outputs from the selected experts.

尽管每个 MoE 层包含更多的参数,但专家是稀疏激活的。这意味着对于给定的输入 Token,只使用有限的专家子集,从而使模型具有更大的容量,同时限制计算量。在我们的架构中,子集大小为两个。每个 MoE 层的可学习门控网络被训练用于根据其输入激活每个输入序列 Token 的最佳两个专家。在推理过程中,学习到的门控网络动态选择每个 Token 的两个最佳专家。对于一个包含 $E$ 个专家的 MoE 层,这实际上提供了一个 $O(E^{2})$ 不同前馈网络组合的集合,而不是经典 Transformer 架构中的一个,从而提供了更大的计算灵活性。最终学习到的 Token 表示将是所选专家输出的加权组合。

We also make additional modifications to the original Transformer architecture. We replace the standard positional embedding with per-layer relative positional bias from Dai et al. (2019). In the non-MoE Transformer feed-forward sub-layers, we replace the first linear projection and the activation function with the Gated Linear Unit (Dauphin et al., 2017; Shazeer, 2020), which computes the component-wise product of two linear transformation of the input, followed by a Gaussian Error Linear Unit (Hendrycks & Gimpel, 2016) activation function. We partition the weights and computation of large GLaM models using the 2D sharding algorithm as described in $\mathrm{Xu}$ et al. (2021), which is described in more details in the Section C of the appendix.

我们还对原始的 Transformer 架构进行了额外的修改。我们将标准的位置嵌入替换为 Dai 等人 (2019) 提出的每层相对位置偏置。在非 MoE Transformer 的前馈子层中,我们将第一个线性投影和激活函数替换为门控线性单元 (Gated Linear Unit) (Dauphin 等, 2017; Shazeer, 2020),它计算输入的两个线性变换的按元素乘积,然后是高斯误差线性单元 (Gaussian Error Linear Unit) (Hendrycks & Gimpel, 2016) 激活函数。我们使用 Xu 等人 (2021) 描述的 2D 分片算法对大型 GLaM 模型的权重和计算进行分区,更多细节请参见附录 C 部分。

5. Experiment Setup

5. 实验设置

GLaM is a family of dense and sparse decoder-only language models, so we first elaborate our training settings, hyper parameters, and evaluation protocol in this section.

GLaM 是一系列密集型和稀疏型的仅解码器语言模型,因此我们首先在本节中详细说明我们的训练设置、超参数和评估协议。

5.1. Training Setting

5.1. 训练设置

We train several variants of GLaM to study the behavior of MoE and dense models on the same training data. Table 4 shows the hyper parameter settings of different scale GLaM models ranging from 130 million parameters to 1.2 trillion parameters. Here, $E$ is the number of experts in the MoE layer, $B$ is the mini-batch size, $S$ is the input sequence length, $M$ is the model and embedding dimension, $H$ is the hidden dimension of the feed-forward network, $L$ is the number of layers and $N$ is the number of total devices. Additionally, $n_{\mathrm{params}}$ is the total number of trainable model parameters, $n_{\mathrm{act-params}}$ is the number of activated model parameters per input token, $n_{\mathrm{heads}}$ is the number of selfattention heads, and $d_{\mathrm{head}}$ is the hidden dimension of each attention head. We also include the respective dense models with comparable numbers of activated parameters per-token during inference (and thus similar numbers of per-token FLOPs) as references. We adopt the notation of

我们训练了多个 GLaM 变体,以研究 MoE 模型和稠密模型在同一训练数据上的行为。表 4 显示了不同规模的 GLaM 模型的超参数设置,这些模型的参数量从 1.3 亿到 1.2 万亿不等。这里,$E$ 是 MoE 层中的专家数量,$B$ 是小批量大小,$S$ 是输入序列长度,$M$ 是模型和嵌入维度,$H$ 是前馈网络的隐藏维度,$L$ 是层数,$N$ 是总设备数量。此外,$n_{\mathrm{params}}$ 是可训练模型参数的总数,$n_{\mathrm{act-params}}$ 是每个输入 Token 的激活模型参数数量,$n_{\mathrm{heads}}$ 是自注意力头的数量,$d_{\mathrm{head}}$ 是每个注意力头的隐藏维度。我们还包含了在推理时具有相似每 Token 激活参数数量(因此每 Token FLOPs 数量也相似)的相应稠密模型作为参考。我们采用了以下符号表示法:

表 4:

模型 $E$ $B$ $S$ $M$ $H$ $L$ $N$ $n_{\mathrm{params}}$ $n_{\mathrm{act-params}}$ $n_{\mathrm{heads}}$ $d_{\mathrm{head}}$
GLaM-130M ... ... ... ... ... ... ... ... ... ... ...
GLaM-1.2T ... ... ... ... ... ... ... ... ... ... ...

注:表格中的具体数值未给出。

to describe different variants in the GLaM models. For example, GLaM (8B/64E) represents the architecture of an approximate 8B parameter dense model with every other layer replaced by a 64 expert MoE layer. GLaM reduces to a dense Transformer-based language model architecture when each MoE layer only has one expert. We use the notation refers to a dense 137B parameter model trained with the same data set.

用于描述 GLaM 模型的不同变体。例如,GLaM (8B/64E) 表示一个约 8B 参数的密集模型架构,每隔一层被替换为一个包含 64 个专家的 MoE 层。当每个 MoE 层仅有一个专家时,GLaM 退化为基于 Transformer 的密集语言模型架构。我们使用这种表示法来指代使用相同数据集训练的具有 137B 参数的密集模型。

5.2. Hyper parameters and Training Procedure

5.2. 超参数和训练过程

We use the same learning hyper parameters for all GLaM models. More specifically, We use a maximum sequence length of 1024 tokens, and pack each input example to have up to 1 million tokens per batch. The dropout rate is set to 0 since the number of available tokens in the training corpus is much greater than the number of processed tokens during training. Our optimizer is Adafactor (Shazeer & Stern, 2018) with first-moment decay $\beta_{1},=,0$ , second-moment decay $\beta_{2}=0.99$ with a $1-t^{-0.8}$ decay schedule, update clipping threshold of 1.0, and factored second-moment estimation. We keep the initial learning rate of 0.01 for the first 10K training steps, and then decay it with inverse square root schedule $\begin{array}{r}{\mathrm{lr}\langle{\mathrm{t}}\rangle,\propto,\frac{1}{\sqrt{\mathrm{t}}}}\end{array}$ . On top of the standard crossentropy loss, we add the MoE auxiliary loss as described in GShard (Lepikhin et al., 2021) with a 0.01 coefficient to encourage expert load balancing so that the gating function will distribute tokens more evenly across all experts. We use the Sentence Piece (Kudo & Richardson, 2018) subword tokenizer with a vocabulary of size of 256K. During training, we use float32 for model weights and bfloatl6 for activations. The largest GLaM 64B/64E model was trained on 1,024 Cloud TPU-V4 chips.

我们为所有 GLaM 模型使用相同的训练超参数。具体来说,我们使用最大序列长度为 1024 个 Token,并将每个输入样本打包以每批次最多包含 1 百万个 Token。由于训练语料库中可用的 Token 数量远大于训练期间处理的 Token 数量,我们将 dropout 率设置为 0。我们的优化器是 Adafactor (Shazeer & Stern, 2018),具有第一时刻衰减 $\beta_{1},=,0$ ,第二时刻衰减 $\beta_{2}=0.99$ ,采用 $1-t^{-0.8}$ 衰减计划,更新裁剪阈值为 1.0,并且使用因子化第二时刻估计。我们在前 10K 训练步骤中保持初始学习率为 0.01,然后按照逆平方根计划 $\begin{array}{r}{\mathrm{lr}\langle{\mathrm{t}}\rangle,\propto,\frac{1}{\sqrt{\mathrm{t}}}}\end{array}$ 衰减学习率。除了标准的交叉熵损失外,我们还添加了 MoE 辅助损失,如 GShard (Lepikhin et al., 2021) 中所述,系数为 0.01,以鼓励专家负载均衡,从而使门控函数在所有专家之间更均匀地分配 Token。我们使用 Sentence Piece (Kudo & Richardson, 2018) 子词分词器,词汇表大小为 256K。在训练过程中,我们使用 float32 表示模型权重和 bfloat16 表示激活。最大的 GLaM 64B/64E 模型是在 1,024 个 Cloud TPU-V4 芯片上训练的。

Training models at the trillion parameter scale is extremely expensive even for sparsely activated models. There is little room for hyper parameter tuning. Here we share our training recipes and some implementation tricks for the GLaM models.

训练万亿参数规模的模型即使对于稀疏激活模型来说也非常昂贵。几乎没有空间进行超参数调整。在这里,我们分享一下 GLaM 模型的训练方法和一些实现技巧。

Table 4. Sizes and architectures of both MoE and dense models that we have trained in our experiments. Models are grouped by the number of activated parameters per token. All trained models share the same learning hyper parameters described in Session 5.1.

表 4: 我们在实验中训练的 MoE 和稠密模型的大小和架构。模型按每个 Token 激活的参数数量分组。所有训练的模型共享相同的在第 5.1 节中描述的学习超参数。

GLaM Model 类型 Nparams Nact-params L M H Nheads dhead E
0.1B 0.1B/64E 稠密 130M 130M 12 768 3,072 12 64 64
1.7B MoE 1.9B 145M
1.7B/32E 稠密 1.7B 1.700B 24 2,048 8,192 16 128 32 64 128
MoE 20B 1.878B
1.7B/64E MoE 27B 1.879B
1.7B/128E MoE 53B 1.881B
1.7B/256E MoE 105B 1.886B
8B 8B/64E 稠密 8.7B 8.7B 32 4,096 16,384 32 128 256 64
MoE 143B 9.8B
137B 64B/64E 稠密 MoE 137B 1.2T 137B 96.6B 64 64 8,192 8,192 65,536 32,768 128 128 128 128 64

· We train smaller-scale models to convergence first. This allows us to expose potential issues in the dataset and infrastructure as early as possible.

我们首先训练较小规模的模型直到收敛。这使我们能够尽早发现数据集和基础设施中可能存在的问题。

· We skip weight updates for a batch if there are any NaNs or Infs in the gradients (Shen et al., 2019). Note NaN/lnf could still occur during the applying gradient step, in which case we restart from an earlier checkpoint as described below. For example, even if there is no Inf in the existing variable or the gradient, the updated variable could still lead to Inf.

如果我们发现梯度中存在任何 NaN 或 Inf,则跳过该批次的权重更新 (Shen et al., 2019)。注意,在应用梯度步骤期间仍可能发生 NaN/Inf,这种情况下我们将从较早的检查点重新开始,如下所述。例如,即使现有的变量或梯度中没有 Inf,更新后的变量仍可能导致 Inf。

· We restart from an early healthy checkpoint when encountering rare large fluctuations or even NaN/Inf during training. Randomness of the sequentially loaded batches might help escape from previous failed states in the training after restart.

在训练过程中遇到罕见的大波动或甚至 NaN/Inf 时,我们从一个早期健康的检查点重新开始。顺序加载的批次的随机性可能有助于在重新启动后逃离之前的失败状态。

5.3. Evaluation Setting

5.3. 评估设置

Protocol. To clearly demonstrate the effectiveness of GLaM models, we mainly focus on evaluating the zero, one and few-shot learning protocols suggested by Radford et al. (2018); Brown et al. (2020). For the zero-shot learning setting, in most cases, we evaluate each example in the development set directly. For one/few-shot learning, we mainly draw random one/few examples from that task's training set as the only demonstration and context. Such a demonstration is concatenated with the evaluation example with two newlines in between, and then fed into the model.

协议。为了清晰地展示 GLaM 模型的有效性,我们主要关注评估 Radford 等人 (2018);Brown 等人 (2020) 建议的零样本、单样本和少样本学习协议。对于零样本学习设置,大多数情况下,我们直接评估开发集中的每个示例。对于单样本/少样本学习,我们主要从该任务的训练集中随机抽取一个或几个示例作为唯一的演示和上下文。这样的演示与评估示例之间用两个换行符连接,然后输入到模型中。

Benchmarks. To allow for an apples-to-apples comparison between GPT-3 and GLaM, we choose the same suite of evaluation tasks as Brown et al.(2020). But for simplicity, we exclude 7 synthetic tasks (arithmetic and word unscramble) and 6 machine translation datasets. With this exclusion, we end up with 29 datasets, which includes 8 natural language generative (NLG) tasks and 21 natural language understanding (NLU) tasks. These datasets can be further grouped into 7 categories and are listed in section A.

基准测试。为了使 GPT-3 和 GLaM 之间的比较具有可比性,我们选择了与 Brown 等人 (2020) 相同的评估任务套件。但为简化起见,我们排除了 7 个合成任务(算术和单词重组)和 6 个机器翻译数据集。经过这些排除后,我们最终得到了 29 个数据集,其中包括 8 个自然语言生成 (NLG) 任务和 21 个自然语言理解 (NLU) 任务。这些数据集可以进一步分为 7 类,并在 A 节中列出。

Natural Language Generative tasks. We compare the language sequences decoded by the models to the ground truth in generative tasks. These tasks are TriviaQA, NQS, WebQS, SQuADv2, LAMBADA, DROP, QuAC and CoQA. The performance is measured by the accuracy of exact match (EM) and F1 score, following the standard for each task in Brown et al. (2020). We use beam search with a width of 4 to generate the sequences.

自然语言生成任务。我们将模型解码的语言序列与生成任务中的真实结果进行比较。这些任务包括 TriviaQA、NQS、WebQS、SQuADv2、LAMBADA、DROP、QuAC 和 CoQA。性能通过精确匹配 (EM) 和 F1 分数的准确性来衡量,遵循 Brown 等人 (2020) 中每个任务的标准。我们使用宽度为 4 的束搜索 (beam search) 来生成序列。

Natural Language Understanding tasks. Most language understanding tasks require the model to select one correct answer from multiple options. All binary classification tasks are formulated into the form of selecting among two options ('Yes’ or ^No'). The prediction is based on the maximum log-likelihood of each option given the context $\log{P}$ (option|context) normalized by the token length of each option. On a few tasks, such as ReCoRD (Zhang et al., 2018) and COPA (Gordon et al., 2012), the non-normalized loss can yield better results and thus is adopted. Except for MultiRC (Khashabi et al., 2018) where the F1 metric over the set of answer options (referred to as $\mathrm{F}1_{a}$ ) is reported, the prediction accuracy metric is used for all the other tasks. We use the average of the scores reported in all datasets to report the overall few-shot performance of models on both NLG and NLU tasks. Both Accuracy (EM) and F1 scores have been normalized to lie between O and 1oo. On TriviaQA, we also report the testing server score of our one-shot submission.

自然语言理解任务。大多数语言理解任务要求模型从多个选项中选择一个正确答案。所有二元分类任务都被表述为在两个选项('是’或 ^否') 中进行选择。预测基于给定上下文的每个选项的最大对数似然 $\log{P}$ (option|context) 并根据每个选项的 token 长度进行归一化。在少数任务上,例如 ReCoRD (Zhang et al., 2018) 和 COPA (Gordon et al., 2012),非归一化的损失可以产生更好的结果,因此被采用。除了 MultiRC (Khashabi et al., 2018),其中报告了答案选项集上的 F1 度量(称为 $\mathrm{F}1_{a}$),其他所有任务都使用预测准确率度量。我们使用所有数据集中报告的分数的平均值来报告模型在 NLG 和 NLU 任务上的整体少样本性能。准确率 (EM) 和 F1 分数都已归一化到 0 到 100 之间。在 TriviaQA 上,我们也报告了我们单次提交的测试服务器得分。

6. Results

6. 结果

We conduct extensive evaluation on the whole family of GLaM models, to show the advantages of sparsely activated models in language modeling and their scaling trends. We

我们对整个 GLaM 模型家族进行了广泛的评估,以展示稀疏激活模型在语言建模方面的优势及其扩展趋势。我们

also quantitatively inspect the effectiveness of data quality for language model training.

还定量检验了数据质量对语言模型训练的有效性。

6.1. Comparison between MoE and Dense Models

6.1. MoE模型与密集模型的比较

As previously presented in Table 1, GLaM (64B/64E) has competitive performance compared to GPT-3 (175B) for zero, one and few-shot learning. Figure 1 compares the performance for each category of tasks. In total, GLaM (64B/64E) outperforms GPT-3 in 6 out of 7 categories on average, indicating the performance gain is consistent. For more details on each individual task, see Table 11. We include results on the much larger and computationally demanding Megatron-NLG and Gopher for reference. More importantly, as shown in Table 4, GLaM (64B/64E) activates roughly 96.6B parameters per token during inference, which requires only half of the compute FLOPs needed by GPT-3 given the same input.

如之前在表 1 中所示,GLaM (64B/64E) 在零样本、单样本和少样本学习方面与 GPT-3 (175B) 具有竞争力。图 1 比较了每个任务类别的性能。总体而言,GLaM (64B/64E) 在 7 个类别中的 6 个类别中平均优于 GPT-3,表明性能提升是一致的。有关每个单独任务的更多详细信息,请参见表 11。我们还包含了计算需求更大的 Megatron-NLG 和 Gopher 的结果以供参考。更重要的是,如表 4 所示,GLaM (64B/64E) 在推理过程中每个 Token 激活大约 96.6B 参数,这仅需要 GPT-3 处理相同输入所需计算 FLOPs 的一半。

We highlight one particular challenging open-domain question answer task: TriviaQA. In open-domain question answer tasks, the model is required to directly answer a given query without access to any additional context. Brown et al. (2020) show that the few-shot performance of TriviaQA is able to grow smoothly with model size, indicating a language model is able to absorb knowledge using its model capacity. As shown in Table 5, GLaM (64B/64E) is better than the dense model and outperforms the previous finetuned state-of-the-art (SOTA) on this dataset in the opendomain setting. Our one-shot result exceeds the previous finetuned SOTA (Yu et al., 2022) where additional knowledge graph information is infused by $8.6%$ , and outperforms the few-shot GPT-3 on the testing server by $5.3%$ . This suggests that the additional capacity of GLaM plays a crucial role in the performance gain even though the $n_{\mathrm{act}}$ params Of GLaM (64B/64E) is only half of that in GPT-3. Comparing to Switch-C, even though both models have similar total number of parameters, GLaM (64B/64E) uses much larger experts (beyond one TPU core) than Switch-C. Therefore, GLaM's one-shot performance on TriviaQA is also better than the fine-tuned results of Switch-C in the open-domain setting. Finally, we report zero, one and few-shot evaluation mainly on the development set for all tasks in Tables 11, 12, 13 and 14 of the appendix.

我们强调一个特别具有挑战性的开放域问答任务:TriviaQA。在开放域问答任务中,模型需要直接回答给定的查询,而无需访问任何额外的上下文。Brown 等 (2020) 显示 TriviaQA 的少样本性能能够随着模型规模平滑增长,这表明大语言模型能够利用其模型容量吸收知识。如表 5 所示,GLaM (64B/64E) 优于密集模型,并在此数据集的开放域设置中超越了之前的微调最先进水平 (SOTA)。我们的单样本结果超过了之前微调的 SOTA (Yu 等, 2022),其中额外的知识图谱信息被注入了 8.6%,并且在测试服务器上超过少样本 GPT-3 的性能 5.3%。这表明即使 GLaM (64B/64E) 的 $n_{\mathrm{act}}$ 参数仅为 GPT-3 的一半,GLaM 的额外容量在性能提升中起着关键作用。与 Switch-C 相比,尽管两个模型的总参数数量相似,但 GLaM (64B/64E) 使用的专家规模远大于 Switch-C(超出一个 TPU 核心)。因此,GLaM 在 TriviaQA 上的单样本性能也优于 Switch-C 在开放域设置中的微调结果。最后,我们在附录的表 11、12、13 和 14 中报告了所有任务的主要开发集上的零样本、单样本和少样本评估。

6.2. Effect of Data Quality

6.2. 数据质量的影响

We study the impact of data quality on the few-shot performance of downstream tasks. We use a modest-size GLaM model (1.7B/64E) to show the effectiveness of filtering text on model quality. We train models with the same hyperparameters on two datasets. One is the original dataset described in Section 3 and the second consists of the dataset with the filtered webpages replaced with the unfiltered webpages. The mixing proportions are fixed as given in Table 3.

我们研究了数据质量对下游任务少样本性能的影响。我们使用中等规模的 GLaM 模型 (1.7B/64E) 来展示过滤文本对模型质量的有效性。我们在两个数据集上使用相同的超参数训练模型。一个是第 3 节描述的原始数据集,第二个是由未过滤网页替换过滤网页组成的数据集。混合比例固定为表 3 中给出的比例。

Table 5. GLaM (64B/64E) one-shot performance significantly outperforms prior SOTAs for open domain settings in the wiki split.

表 5. GLaM (64B/64E) 单样本性能在 wiki 分割的开放域设置中显著优于之前的最先进模型 (SOTA)。

模型 TriviaQA (开放域)
KG-FiD (large) (Yu et al.,2022) 69.8
(微调,测试) Switch-C (微调,开发) 47.5
GPT-3 单样本 (开发) 68.0
GPT-3 64 样本 (测试) 71.2
GLaM 单样本 (测试) 75.0
GLaM 单样本 (开发) 75.8

The filtered webpages consist of 143B tokens whereas the unfiltered webpages consist of around 7T tokens.

过滤后的网页包含 143B 个 Token,而未过滤的网页则包含大约 7T 个 Token。

Figure 3 (c) and (d) show that the model trained on fil- tered data performs consistently better on both NLG and NLU tasks. In particular, the effect of filtering is bigger on NLG than that on NLU. Perhaps this is because NLG often requires generating high-quality language and filtered pre training corpora is crucial to the generation capability of language models. Our study highlights the fact that the quality of the pretrained data also plays a critical role, specifically, in the performance of downstream tasks.

图 3 (c) 和 (d) 显示,在过滤后的数据上训练的模型在自然语言生成 (NLG) 和自然语言理解 (NLU) 任务上表现一致更好。特别是,过滤对 NLG 的影响比对 NLU 的影响更大。这可能是因为 NLG 通常需要生成高质量的语言,而过滤后的预训练语料库对语言模型的生成能力至关重要。我们的研究突显了预训练数据的质量也在下游任务的表现中起着关键作用。

6.3. Scaling Studies

6.3. 扩展研究

Scaling up dense language models generally involves making the models deeper by adding more layers, and wider by increasing the embedding dimension of token representations. This process increases the total number of parameters $n_{\mathrm{params}}$ of the model. For each prediction on a given input example, these models are ^dense’ in that all $n_{\mathrm{params}}$ parameters will be activated, i.e., $n_{\mathrm{params}}=n_{\mathrm{act\mathrm{params}}}$ in Table 4. Therefore, the effective FLOPs per prediction increases linearly with the model size $n_{\mathrm{params}}$ . While the increased FLOPs may lead to boosted predictive performance, it also raises the overall cost per prediction.

扩大密集型语言模型通常涉及通过增加更多层使模型更深,通过增加 Token 表示的嵌入维度使模型更宽。这个过程增加了模型的总参数数量 $n_{\mathrm{params}}$ 。对于每个给定输入示例的预测,这些模型是“密集”的,即所有 $n_{\mathrm{params}}$ 参数都会被激活,也就是说,在表 4 中 $n_{\mathrm{params}}=n_{\mathrm{act\mathrm{params}}}$ 。因此,每次预测的有效浮点运算次数 (FLOPs) 随模型大小 $n_{\mathrm{params}}$ 线性增加。虽然增加的 FLOPs 可能会提高预测性能,但也提高了每次预测的总体成本。

In contrast, GLaM MoE models are sparsely activated in that only a small fraction of the total $n_{\mathrm{params}}$ parameters will be activated for each prediction where $n_{\mathrm{params}}\gg n_{\mathrm{act-pa}}$ rams Therefore, GLaM MoE models can scale by also growing the size or number of experts in the MoE layer.

相比之下,GLaM MoE 模型是稀疏激活的,即每次预测中只有总 $n_{\mathrm{params}}$ 参数的一小部分会被激活,其中 $n_{\mathrm{params}} \gg n_{\mathrm{act-pa}}$。因此,GLaM MoE 模型可以通过增加 MoE 层中的专家数量或规模来进行扩展。

As shown in Figure 3(a), the average zero, one and few-shot performance across the generative tasks scales well with the effective FLOPs per prediction which is in turn determined by $n_{\mathrm{act.}}$ params. We also find that GLaM MoE models perform consistently better than GLaM dense models for similar effective FLOPs per token. For language understanding tasks shown in Figure 3(b), the performance gain of GLaM MoE models has a similar scaling trend to that of the generative tasks. We observe that both MoE and dense models perform similarly at smaller scales but MoE models outperform at larger scales. We also show experiments with scaling the number of experts in Section B where we observe that, for a fixed budget of computation per prediction, adding more experts generally leads to better predictive performance.

如图 3(a) 所示,生成式任务的平均零样本、一样本和少样本性能随着每次预测的有效 FLOPs 增加而提升,这又由 $n_{\mathrm{act.}}$ 参数决定。我们还发现,对于相似的每个 Token 的有效 FLOPs,GLaM MoE 模型的表现始终优于 GLaM 密集模型。对于图 3(b) 中所示的语言理解任务,GLaM MoE 模型的性能提升与生成式任务具有类似的扩展趋势。我们观察到,在较小规模上,MoE 模型和密集模型表现相似,但在较大规模上,MoE 模型表现更优。我们在附录 B 中展示了增加专家数量的实验,结果表明,在固定每次预测的计算预算下,增加更多的专家通常会带来更好的预测性能。

图 3(a):

图 3(b):

表 1:

附录 B:


Figure 3. Average zero, one and few-shot performance of GLaM MoE models versus GLaM dense models for similar effective FLOPs per token over the 8 NLG tasks (a) and 21 NLU tasks (b). Comparison of model performance with filtered and unfiltered training data using GLaM (1.7B/64E). Filtered data improves results significantly over unfiltered data for both (c) NLG and (d) NLU tasks across zero, one and few-shot settings.

图 3. GLaM MoE 模型与 GLaM dense 模型在相似的每 Token 有效 FLOPs 下,针对 8 个 NLG 任务 (a) 和 21 个 NLU 任务 (b) 的平均零样本、单样本和少样本性能。使用 GLaM (1.7B/64E) 对过滤和未过滤训练数据的模型性能进行比较。对于零样本、单样本和少样本设置,过滤后的数据在 (c) NLG 和 (d) NLU 任务上显著优于未过滤数据。

6.4. Efficiency of GLaM

6.4. GLaM 的效率

Existing large dense language models usually require tremendous amounts of computation resources for training and serving (Patterson et al., 2021). They also need to consume massive amounts of pre training data. We investigate the data and compute efficiency of the proposed GLaM models.

现有的大语言模型通常需要大量的计算资源来进行训练和推理 (Patterson et al., 2021)。它们还需要消耗大量的预训练数据。我们研究了所提出的 GLaM 模型的数据和计算效率。

Data Efficiency. Figure 4 (a-c) and Figure 4(e-g) show the learning curves of our models compared to the dense baselines of similar effective FLOPs in both NLG and NLU tasks.The $\mathbf{X}$ -axis is the number of tokens used in training where we explicitly include GPT-3's results when it is around 300B tokens. We first observe that GLaM MoE models require significantly less data than dense models of comparable FLOPs to achieve similar zero, one, and fewshot performance. In other words, when the same amount of data is used for training, MoE models perform much better, and the difference in performance becomes larger when training up to 630B. Moreover, GLaM (64B/64E) model trained with 280B tokens outperforms GPT-3 trained with 300B tokens by large margins on 4 out of the 6 learning settings (zero-shot/one-shot NLU and one-shot/few-shot NLG), and matches GPT-3 scores for the remaining setting, i.e., zero-shot NLG tasks.

数据效率。图 4 (a-c) 和图 4 (e-g) 显示了我们的模型与具有相似有效 FLOPs 的密集基线模型在 NLG 和 NLU 任务中的学习曲线。$\mathbf{X}$ 轴是训练中使用的 Token 数量,其中我们明确包含了 GPT-3 在大约 300B Token 时的结果。我们首先观察到,GLaM MoE 模型相比具有相似 FLOPs 的密集模型,需要显著更少的数据来实现类似的零样本、单样本和少样本性能。换句话说,当使用相同数量的数据进行训练时,MoE 模型表现得更好,并且在训练达到 630B 时,性能差异变得更大。此外,使用 280B Token 训练的 GLaM (64B/64E) 模型在 6 种学习设置中的 4 种(零样本/单样本 NLU 和单样本/少样本 NLG)上大幅超过了使用 300B Token 训练的 GPT-3,在其余设置即零样本 NLG 任务中,GLaM 的得分与 GPT-3 相当。

Computation Efficiency & Energy Consumption. Figure 4 (d) and Figure 4 (h) show how the average zero, one and few-shot performance scales with the number of TPU years spent training MoE and dense models. We find that to achieve similar performance on downstream tasks, training sparsely activated models takes much less computational resources than training dense models.

计算效率与能耗。图 4 (d) 和图 4 (h) 显示了 MoE 模型和稠密模型的平均零样本、一样本和少样本性能如何随着 TPU 年数的变化而变化。我们发现,要实现类似的下游任务性能,训练稀疏激活模型所需的计算资源比训练稠密模型少得多。

As previously presented in Table 1, the GLaM (64B/64E) training after 600B tokens consumes 456 MWh, about 1/3 of the energy cost of 1287 MWh used by GPT-3. Moreover, to reach similar (and slightly exceeded) scores as GPT-3, we train using 1,024 TPU-v4 chips for 574 hours (with 280B tokens).This consumes 213 MWh or 1/6 of the GPT-3 energy cost. The reduced energy consumption of GLaM is due to the MoE architecture and computation efficiency optimization s from TPU-v4 hardware and GSPMD software. Energy calculations can be found in Section F.

如之前表 1 所示,GLaM (64B/64E) 在 600B tokens 训练后消耗 456 MWh,约为 GPT-3 所用 1287 MWh 能源成本的 1/3。此外,为了达到与 GPT-3 相似(并略微超过)的分数,我们使用 1,024 个 TPU-v4 芯片训练了 574 小时(包含 280B tokens)。这消耗了 213 MWh 或 GPT-3 能源成本的 1/6。GLaM 的能源消耗减少是由于 MoE 架构和来自 TPU-v4 硬件及 GSPMD 软件的计算效率优化。能源计算详见附录 F。

7. Ethics and Unintended Biases

7. 伦理和非预期偏差

Large language models’ zero-and few-shot inference is an exciting capability: being able to control model behaviour intuitively with natural language and small datasets significantly lowers the barrier to prototyping and the development of new applications; it has the potential to help democrat is e using AI by dramatically decreasing the need for specialist knowledge. However, such opportunities also serve to highlight the importance of the many ethical challenges (Leidner & Plachouras, 2017; Bender et al.,2021; Bommasani et al., 2021) including representation bias (Blodgett et al., 2020), proper selection and handling of training data (Rogers, 2021) and its documentation (Bender & Friedman, 2018), privacy (Abadi et al., 2016b; Carlini et al., 2020), and environmental concerns (Strubell et al., 2019; Patterson et al., 2021). An important strand of this research focuses on unintended biases learnt by language models, including correlations between gender and profession (Bolukbasi et al., 2016; Rudinger et al., 2018; Zhao et al., 2018), neg- ative sentiment about racial and religious groups (Li et al., 2020; Nadeem et al., 2021), and about people with disabilities (Hutchinson et al., 2020), as well as other social biases (Caliskan et al., 2017; Rudinger et al., 2017; Sap et al., 2020; Sotnikova et al., 2021). While measuring and mitigating the potential harm of language models is a very active area of research, as recognized by Blodgett et al. (2021); Jacobs & Wallach (2021) there is still a significant need for more rigorous evaluation methods to assess the degree to which language models encode harmful stereotypes (May et al., 2019; Webster et al., 2021).

大语言模型的零样本和少样本推理能力令人兴奋:能够通过自然语言和小数据集直观地控制模型行为,大大降低了原型设计和开发新应用的门槛;它有可能通过大幅减少对专业知识的需求来帮助普及使用 AI。然而,这些机会也凸显了许多伦理挑战的重要性 (Leidner & Plachouras, 2017; Bender et al., 2021; Bommasani et al., 2021),包括代表性偏差 (Blodgett et al., 2020)、训练数据的适当选择和处理 (Rogers, 2021) 及其文档 (Bender & Friedman, 2018)、隐私 (Abadi et al., 2016b; Carlini et al., 2020) 和环境问题 (Strubell et al., 2019; Patterson et al., 2021)。这一研究的重要方向之一是关注语言模型无意中学习到的偏见,包括性别与职业之间的相关性 (Bolukbasi et al., 2016; Rudinger et al., 2018; Zhao et al., 2018),对种族和宗教群体的负面情绪 (Li et al., 2020; Nadeem et al., 2021),以及对残疾人的态度 (Hutchinson et al., 2020),以及其他社会偏见 (Caliskan et al., 2017; Rudinger et al., 2017; Sap et al., 2020; Sotnikova et al., 2021)。尽管测量和减轻语言模型潜在危害是一个非常活跃的研究领域,正如 Blodgett et al. (2021); Jacobs & Wallach (2021) 所认识到的那样,仍然迫切需要更严格的评估方法来评估语言模型编码有害刻板印象的程度 (May et al., 2019; Webster et al., 2021)。


Figure 4. Learning efficiency comparison. Average zero-shot , one-shot and few-shot performance of GLaM MoE models versus GLaM dense models as more tokens are processed during training for 9 NLG tasks (a-c) and 21 NLU tasks (e-g). Panel (d) and (h) also display the learning curves against the number of TPU years, respectively.


图 4: 学习效率对比。GLaM MoE 模型与 GLaM 密集模型在训练过程中处理更多 Token 时,针对 9 个 NLG 任务 (a-c) 和 21 个 NLU 任务 (e-g) 的平均零样本、单样本和少样本性能。面板 (d) 和 (h) 分别显示了学习曲线与 TPU 年数的关系。

While there is not yet consensus on measurement methods or criteria for such general purpose large language models, the versatility and power of these models make it important to assess them on a range of metrics. We take inspiration from GPT-3 (Brown et al., 2020) and examine the co-occurrence in generated text referencing identity terms as well as report on the WinoGender benchmark (Rudinger et al., 2018). We also analyse toxicity degeneration similarly to Gopher (Rae et al., 2021), and extend the analysis to consider the humanbehavioral baseline.

虽然目前还没有就这些通用大语言模型的测量方法或标准达成共识,但这些模型的多功能性和强大性能使得对其在多个指标上进行评估变得非常重要。我们从 GPT-3 (Brown 等, 2020) 中获得灵感,考察生成文本中身份术语的共现情况,并报告 WinoGender 基准测试 (Rudinger 等, 2018) 的结果。我们还类似地分析了毒性退化问题,如同 Gopher (Rae 等, 2021),并将分析扩展到考虑人类行为基线。

7.1. Co-occurrence prompts

7.1. 共现提示 (Co-occurrence prompts)

Following the procedure described in Brown et al. (2020), we analyze commonly co-occurring words in the continuations when given prompts like $\mathbf{\cdots}{\mathbf{erm}}$ was very.."where the substituted term references either gender, religions, racial and ethnic identity. For each prompt (Table 7 of the appendix), 800 outputs are generated using top $k$ sampling $k=40)$ ) with a temperature of 1. An off-the-shelf POS tagger (Bird & Loper, 2004) is used to remove stop words and select only descriptive words (i.e., adjectives and adverbs). Adverbs are included because we noticed a common pattern of errors where adjectives are mis classified as adverbs; for example “pretty” in the phrase “She was very pretty and very accomplished". Like Brown et al. (2020), to make the analysis transparent and easily reproducible, we omit any manual human labeling.

按照 Brown 等 (2020) 描述的程序,我们分析了在给定提示如 $\mathbf{\cdots}{\mathbf{erm}}$ was very.."(其中替换的术语涉及性别、宗教、种族和民族身份)时,续写中常见的共现词。对于每个提示(附录中的表 7),使用 top $k$ 抽样 ($k=40$) 生成 800 个输出,温度为 1。使用现成的 POS 标记器 (Bird & Loper, 2004) 删除停用词并选择仅描述性词汇(即形容词和副词)。包括副词是因为我们注意到一个常见的错误模式,即形容词被误分类为副词;例如短语“She was very pretty and very accomplished”中的“pretty”。与 Brown 等 (2020) 类似,为了使分析透明且易于重现,我们省略了任何人工标注。

Like the analysis of other large language models that we build on, we note associative biases for all dimensions are obvious, for example “"pretty” is the most associated description for the term “She'", while it is not in the top-10 for the term “He". Table 8 shows the most frequently occurring descriptive words in response to prompt-templates for gendered pronouns, and Tables 9 and 10 of the appendix show the same for race and religion prompts.

与其他我们所构建的大语言模型的分析类似,我们注意到所有维度的关联偏见都很明显,例如“pretty”是与术语“She”最相关的描述,而它不在术语“He”的前十位相关描述中。表 8 显示了对性别代词提示模板做出响应时最常出现的描述性词汇,附录中的表 9 和表 10 分别显示了种族和宗教提示的相同内容。

7.2. WinoGender

7.2. WinoGender

WinoGender 是一项用于评估模型在性别偏差方面的表现的任务。它通过构造特定的句子来测试模型是否会在职业或其他社会角色上产生性别偏见。这项任务强调了在自然语言处理中考虑性别平等的重要性。

Co reference resolution is a capability that many applications require to perform well, including machine translation (Stanovsky et al., 2019; Webster & Pitler, 2020) and question answering (Lamm et al., 2020). To assess whether gendered correlations in GLaM cause it to make coreference errors in the one-shot setting, we measure WinoGender (Rudinger et al., 2018). GLaM (64B/64E) achieves a new state-of-the-art of $71.7%$ on the full dataset (compared to $64.2%$ for GPT-3 (Brown et al., 2020). Promisingly, accuracy is remarkably close between “he’ examples $(70.8%)$ and ‘she’ examples $(72.5%)$ , as well as between stereotypical examples (where the intended distribution is assumed to be close to the US occupation statistics, (Rudinger et al., 2018)) and anti-stereotypical (or ^gotcha') examples (both $71.7%)$

共指消解是许多应用程序要表现良好所需的一项功能,包括机器翻译 (Stanovsky et al., 2019; Webster & Pitler, 2020) 和问答 (Lamm et al., 2020)。为了评估 GLaM 中的性别相关性是否导致其在单样本设置中出现共指错误,我们测量了 WinoGender (Rudinger et al., 2018)。GLaM (64B/64E) 在整个数据集上达到了新的最先进水平 71.7%(相比之下,GPT-3 (Brown et al., 2020) 的成绩为 64.2%)。令人鼓舞的是,“he” 示例 (70.8%) 和 “she” 示例 (72.5%) 的准确率非常接近,刻板印象示例(假设预期分布接近美国职业统计数据,(Rudinger et al., 2018))和反刻板印象(或“gotcha”)示例(均为 71.7%)之间的准确率也非常接近。


Figure 5. The relationship between the Toxicity Probability of the Prompt (TPP), and the Toxicity Probability of the Continuation (TPC). Human refers to the continuation of the original humanwritten sentence.

图 5. 提示的毒性概率 (TPP) 与续写内容的毒性概率 (TPC) 之间的关系。Human 指的是原始人类编写句子的续写。

7.3. Toxicity Degeneration

7.3. 毒性退化 (Toxicity Degeneration)

Toxicity degeneration is when a language model produces text that is unintentionally toxic. To evaluate toxicity degeneration, we adapt the methodology used in (Welbl et al., 2021; Rae et al., 2021). We use the Real Toxicity Prompts dataset (Gehman et al., 2020) which consists of sentences that have been split into two parts: a prompt prefix, and a continuation postfix. Like the previous studies, we also use the Perspective API which assigns a probability that the text would be considered to be rude, disrespectful or otherwise likely to make people want to leave a conversation. We then asses how likely a continuation is to be toxic given various likelihoods that the prompt was toxic.

毒性退化是指大语言模型生成的文本无意中具有攻击性。为了评估毒性退化,我们采用了 (Welbl et al., 2021; Rae et al., 2021) 中使用的方法。我们使用 Real Toxicity Prompts 数据集 (Gehman et al., 2020),该数据集由句子拆分为两部分组成:提示前缀和延续后缀。与之前的研究一样,我们也使用 Perspective API,该 API 会赋予文本一个被认为是粗鲁、不尊重或可能使人们想要离开对话的概率。然后我们评估在给定提示具有不同毒性概率的情况下,延续文本具有毒性的可能性。

For each of 10K randomly sampled prompts, we generate 25 continuations, with up to 100 tokens per continuations usingtop $k$ sampling $\lvert k=40\rvert$ )with a temperature of 1. The Perspective API requires an non-empty string therefore we assign a score of toxicity O.o when the continuation is the empty string; this could represent, for example, a chat bot simply refusing to respond.

对于每个 10K 随机采样的提示,我们生成 25 个续写,每个续写最多 100 个 Token,使用 top $k$ 抽样 $\lvert k=40\rvert$ ,温度为 1。Perspective API 需要非空字符串,因此当续写为空字符串时,我们分配毒性得分为 0.0;这可以表示例如,聊天机器人简单地拒绝响应。

Figure 5 shows the relationship between the Toxicity Probability of the Prompt (TPP), and the Toxicity Probability of the Continuation (TPC). Note that, for low TPP, the relatively high human TPC is due to the sampling strategy used to create the underlying dataset: sentences were selected across the toxicity spectrum. Moreover, toxicity can often be identified locally within a sentence, and toxicity in this dataset tends to occur later the sentences. This causes the human-TPC to slightly drop as the TPP increases. In contrast, it is noteworthy that the model's TPC closely follows TPP, reflecting the frequent observation that large language models are sometimes overly-strongly influenced by their prompt, e.g. repeating phrases from the prompt.

图 5: 显示了提示的毒性概率 (TPP) 与续写部分的毒性概率 (TPC) 之间的关系。注意,对于较低的 TPP,相对较高的人类 TPC 是由于用于创建基础数据集的采样策略:句子是从毒性的不同范围内选择的。此外,毒性通常可以在句子内部局部识别,并且在这个数据集中,毒性往往出现在句子的后半部分。这导致随着 TPP 的增加,人类 TPC 稍微下降。相比之下,值得注意的是,模型的 TPC 密切跟随 TPP,反映了大语言模型有时过度受提示影响的常见观察,例如重复提示中的短语。

We also analysed the distribution of toxicity probabilities from the API for batches of 25 continuations. This highlighted that, even for low toxicity prompts, it is very likely that some generated continuation will be judged as toxic by most people reviewing it, according to the Perspective API's predicted probability; further details can be found in Figure 8. We also note that this dataset's sampling strategy, and the source it is taken from (Reddit) are likely not reflective of other domains. Moreover, even for very low TPP, applications are likely to want a much lower TPC: even generating 1 in 100 toxic suggestions is likely to be very problematic for applications.

我们还分析了从 API 获取的 25 个续写文本批次的毒性概率分布。这突显了即使对于低毒性提示,某些生成的续写也很可能被大多数审阅者认为是有毒的,根据 Perspective API 的预测概率;更多细节请参见图 8。我们还注意到,此数据集的采样策略及其来源(Reddit)可能无法反映其他领域的情况。此外,即使对于非常低的毒性提示概率 (TPP),应用程序也希望毒性内容比例 (TPC) 更低:即使每 100 条生成的内容中有 1 条是有毒的,也可能对应用程序造成很大问题。

8. Discussion

8. 讨论

As observed in previous work on sparsely-activated models (Fedus et al., 2021), MoE models are more performant in knowledge-oriented tasks. Open-domain tasks are one way of measuring the amount of knowledge stored in a model. The performance of the MoE model in open-domain QA benchmarks such as TriviaQA demonstrate the significantly increased information capacity of these models compared to dense models of similar effective FLOPs. Despite the in-context learning and training efficiency advantages, the sparsely activated models consist of a higher number of parameters and thus require a larger number of devices. This limits the resource accessibility and increases the serving cost especially when the serving traffic is low.

如之前关于稀疏激活模型 (Fedus et al., 2021) 的研究中所观察到的,MoE 模型在知识导向任务中表现更优。开放域任务是衡量模型中存储知识量的一种方式。MoE 模型在开放域 QA 基准测试(如 TriviaQA)中的表现展示了这些模型相比有效 FLOPs 相似的密集模型具有显著增加的信息容量。尽管在上下文学习和训练效率方面有优势,但稀疏激活模型包含更多的参数,因此需要更多的设备。这限制了资源的可访问性,并且当服务流量较低时,会增加服务成本。

9. Conclusions

9. 结论

We propose and develop a family of generalist language models called GLaM, which use a sparsely activated mixture-of-experts architecture to achieve better average scores than not only their dense counterparts of similar effective FLOPs, but also the GPT-3 models on 29 representative NLP tasks in zero, one and few-shot learning. In particular, GLaM (64B/64E), our largest 1.2 trillion parameterMoE language model, achieves better average performance with only one third of energy consumption compared to training GPT-3. We hope that our work will encourage more research into methods for obtaining high-quality data, and using MoE for more efficient scaling of giant language models.

我们提出并开发了一类称为 GLaM 的通用语言模型,这些模型采用稀疏激活的专家混合架构,在零样本、单样本和少样本学习的 29 个代表性 NLP 任务中,不仅比有效 FLOPs 相似的密集模型,还比 GPT-3 模型取得了更好的平均分数。特别是,GLaM (64B/64E),我们最大的 1.2 万亿参数 MoE 语言模型,仅以三分之一的能耗就实现了比训练 GPT-3 更好的平均性能。我们希望我们的工作能够鼓励更多关于获取高质量数据的方法的研究,并利用 MoE 实现巨型语言模型更高效的扩展。

References

参考文献

Abadi, M., Barham, P., Chen, J., Chen, Z., Davis, A., Dean, J., Devin, M., Ghemawat, S., Irving, G., Isard, M., Kudlur, M., Levenberg, J., Monga, R., Moore, S., Murray, D. G., Steiner, B., Tucker, P., Vasudevan, V., Warden, P., Wicke, M., Yu, Y., and Zheng, X. TensorFlow: A system for Large-Scale machine learning. In 12th USE NIX Symposium on Operating Systems Design and Implementation (OSD1 16), pp. 265-283, Savannah, GA, November 2016a. USENIX Association. ISBN 978-1- 931971-33-1. URL https://www.usenix.org/ conference/osdil6/technical-sessions/ presentation/abadi.

Abadi, M., Barham, P., Chen, J., Chen, Z., Davis, A., Dean, J., Devin, M., Ghemawat, S., Irving, G., Isard, M., Kudlur, M., Levenberg, J., Monga, R., Moore, S., Murray, D. G., Steiner, B., Tucker, P., Vasudevan, V., Warden, P., Wicke, M., Yu, Y., 和 Zheng, X. TensorFlow: 一个用于大规模机器学习的系统。在第 12 届 USENIX 操作系统设计与实现研讨会 (OSD1 16),页码 265-283,Savannah, GA,2016 年 11 月。USENIX 协会。ISBN 978-1-931971-33-1。URL https://www.usenix.org/conference/osdil6/technical-sessions/presentation/abadi

Abadi, M., Chu, A., Goodfellow, I., McMahan, H. B., Mironov, I., Talwar, K., and Zhang, L. Deep learning with differential privacy. Proceedings of the 2016 ACM SIGSAC Conference on Computer and Communications Security, Oct 2016b. doi: 10.1145/2976749.2978318. URL http://dx.doi.0rg/10.1145/2976749. 2978318.

Abadi, M., Chu, A., Goodfellow, I., McMahan, H. B., Mironov, I., Talwar, K., 和 Zhang, L. 带差分隐私的深度学习 (Deep learning with differential privacy). Proceedings of the 2016 ACM SIGSAC Conference on Computer and Communications Security, 2016年10月. doi: 10.1145/2976749.2978318. URL http://dx.doi.0rg/10.1145/2976749. 2978318.

Adiwardana, D., Luong, M., So, D. R., Hall, J., Fiedel, N., Thoppilan, R., Yang, Z., Kul sh res h th a, A., Nemade, G., Lu, Y., and Le, Q. V. Towards a human-like opendomain chatbot. CoRR, abs/2001.09977, 2020. URL https://arxiv.0rg/abs/2001.09977.

Adiwardana, D., Luong, M., So, D. R., Hall, J., Fiedel, N., Thoppilan, R., Yang, Z., Kul sh res h th a, A., Nemade, G., Lu, Y., 和 Le, Q. V. 朝着类人开放域聊天机器人迈进。CoRR, abs/2001.09977, 2020。URL https://arxiv.0rg/abs/2001.09977。

Bender, E. M. and Friedman, B. Data statements for natural language processing: Toward mitigating system bias and enabling better science. Transactions of the Association for Computational Linguistics, 6:587-604, 2018. doi: 10. 1162/tacl-a-00041. URL https: / /a cl anthology . org/Q18-1041.

Bender, E. M. 和 Friedman, B. 自然语言处理的数据声明:迈向缓解系统偏差和促进更好的科学. 计算语言学协会会刊,6:587-604,2018. doi: 10.1162/tacl-a-00041. URL https://aclanthology.org/Q18-1041.

Bender, E. M., Gebru, T., McMillan-Major, A., and Shmitchell, S. On the dangers of stochastic parrots: Can language models be too big? In Proceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency, FAccT '21, pp. 610-623, New York, NY, USA, 2021. Association for Comput- ing Machinery. ISBN 9781450383097. doi: 10.1145/ 3442188.3445922. URL https://doi.org/10. 1145/3442188.3445922.

本德, E. M., 格布鲁, T., 麦克米伦-梅杰, A., 和 斯米切尔, S. 论随机鹦鹉的危险:语言模型能否过大? 见 2021 年 ACM 公平性、问责制和透明度会议论文集,FAccT '21,第 610–623 页,纽约,美国,2021。计算机械协会。ISBN 9781450383097。doi: 10.1145/ 3442188.3445922。URL https://doi.org/10. 1145/3442188.3445922。

Berant, J., Chou, A., Frostig, R., and Liang, P. Semantic parsing on Freebase from question-answer pairs. In Proceedingsof the 2013 Conference on Empirical Methods in Natural Language Processing, pp. 1533-1544, Seattle, Washington, USA, October 2013. Association for Computational Linguistics. URL https : / /a cl anthology. org/D13-1160.

Berant, J., Chou, A., Frostig, R., 和 Liang, P. 从问题-答案对进行 Freebase 的语义解析。在 Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing,第 1533-1544 页,Seattle, Washington, USA,2013 年 10 月。Association for Computational Linguistics。URL https://aclanthology.org/D13-1160

Bird, S. and Loper, E. NLTK: The natural language toolkit. In Proceedings of theACL Interactive Poster and Demonstration Sessions, pp. 214-217, Barcelona, Spain, July

伯德,S. 和 洛珀,E. NLTK: 自然语言工具包 (NLTK)。在 ACL 交互式海报和演示会议论文集,第 214-217 页,西班牙巴塞罗那,7 月

  1. Association for Computational Linguistics. URL https://a cl anthology.0rg/P04-3031.
  2. 计算语言学协会 (Association for Computational Linguistics). URL https://a cl anthology.0rg/P04-3031.

Bisk, Y., Zellers, R., Bras, R. L., Gao, J., and Choi, Y. Piqa: Reasoning about physical commonsense in natural language. In Thirty-Fourth AAAI Conference on Artificial Intelligence,2020.

Bisk, Y., Zellers, R., Bras, R. L., Gao, J., 和 Choi, Y. Piqa: 自然语言中的物理常识推理 (Reasoning about physical commonsense in natural language). 在第三十四届 AAAI 人工智能会议 (Thirty-Fourth AAAI Conference on Artificial Intelligence),2020。

Blodgett, S. L., Barocas, S., Daumé III, H., and Wallach, H. Language (technology) is power: A critical survey of “bias" in NLP. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pp. 5454-5476, Online, July 2020. As s 0 cia tion for Computational Linguistics. doi: 10.18653/v1/2020.acl-main. 485. URL https://a cl anthology.org/2020. acl-main.485.

布洛杰特,S. L.,巴罗卡斯,S.,多梅 III,H.,和华莱士,H. 语言(技术)即权力:对 NLP 中“偏见”的批判性综述。在第 58 届计算语言学协会年会论文集,第 5454-5476 页,在线,2020 年 7 月。计算语言学协会。doi: 10.18653/v1/2020.acl-main.485。URL https://aclanthology.org/2020.acl-main.485

Blodgett, S. L., Lopez, G., Olteanu, A., Sim, R., and Wallach, H. Stereotyping Norwegian salmon: An inventory of pitfalls in fairness benchmark datasets.In Proceedings of the59th Annual Meeting of the Association for Computational Linguistics and the 1 l th Intern a- tional Joint Conference on Natural Language Process- ing (Volume 1: Long Papers), pp. 1004-1015, Online, August 2021. Association for Computational Linguistics. doi: 10.18653/v1/2021.acl-long.81. URL https : //a cl anthology.org/2021.acl-long.81.

布洛杰特,S. L.,洛佩兹,G.,奥尔泰阿努,A.,西姆,R.,和华莱士,H. 刻板印象挪威三文鱼:公平性基准数据集中的陷阱清单。在第 59 届计算语言学协会年会和第 11 届国际自然语言处理联合会议(第 1 卷:长论文)的会议录中,第 1004-1015 页,在线,2021 年 8 月。计算语言学协会。doi: 10.18653/v1/2021.acl-long.81。URL https://aclanthology.org/2021.acl-long.81.

Bolukbasi, T., Chang, K.-W., Zou, J. Y., Saligrama, V., and Kalai, A. T. Man is to computer programmer as woman is to homemaker? debiasing word embeddings. In Lee, D., Sugiyama, M., Luxburg, U., Guyon, I., and Garnett, R. (eds.), Advances in Neural Information Processing Systems, volume 29. Curran Associates, Inc., 2016. URL https : / /proceedings . neurips.cc/paper/2016/file/ a486cd07e4ac3d270571622f4f316ec5-Paper pdf.

博卢克巴西,T.,张,K.-W.,邹,J. Y.,萨利格拉玛,V.,和卡拉伊,A. T. 男人之于计算机程序员如女人之于家庭主妇?消除词嵌入的偏差。在李,D.,杉山,M.,卢克斯堡,U.,盖永,I.,和加内特,R. (编),《神经信息处理系统进展》,第 29 卷。Curran Associates, Inc., 2016。URL https://proceedings.neurips.cc/paper/2016/file/a486cd07e4ac3d270571622f4f316ec5-Paper.pdf

Bommasani, R., Hudson, D. A., Adeli, E., Altman, R., Arora, S., von Arx, S., Bernstein, M. S., Bohg, J., Bosse- lut, A., Brunskill, E., Bryn jol fs son, E., Buch, S., Card, D., Castellon, R., Chatterji, N. S., Chen, A. S., Creel, K., Davis, J. Q., Demszky, D., Donahue, C., Doum- bouya, M., Durmus, E., Ermon, S., Etchemendy, J., Ethayarajh, K., Fei-Fei, L., Finn, C., Gale, T., Gillespie, L., Goel, K., Goodman, N. D., Grossman, S., Guha, N., Hashimoto, T., Henderson, P., Hewitt, J., Ho, D. E., Hong, J., Hsu, K., Huang, J., Icard, T., Jain, S., Jurafsky, D., Kalluri, P., Karamcheti, S., Keeling, G., Khani, F., Khattab, O., Koh, P. W., Krass, M. S., Krishna, R., Kudi- tipudi, R., and et al. On the opportunities and risks of foundation models. CoRR, abs/2108.07258, 2021. URL https://arxiv.0rg/abs/2108.07258.

Bommasani, R., Hudson, D. A., Adeli, E., Altman, R., Arora, S., von Arx, S., Bernstein, M. S., Bohg, J., Bosse- lut, A., Brunskill, E., Bryn jol fs son, E., Buch, S., Card, D., Castellon, R., Chatterji, N. S., Chen, A. S., Creel, K., Davis, J. Q., Demszky, D., Donahue, C., Doum- bouya, M., Durmus, E., Ermon, S., Etchemendy, J., Ethayarajh, K., Fei-Fei, L., Finn, C., Gale, T., Gillespie, L., Goel, K., Goodman, N. D., Grossman, S., Guha, N., Hashimoto, T., Henderson, P., Hewitt, J., Ho, D. E., Hong, J., Hsu, K., Huang, J., Icard, T., Jain, S., Jurafsky, D., Kalluri, P., Karamcheti, S., Keeling, G., Khani, F., Khattab, O., Koh, P. W., Krass, M. S., Krishna, R., Kudi- tipudi, R., 和其他作者. 关于基础模型的机会和风险. CoRR, abs/2108.07258, 2021. URL https://arxiv.0rg/abs/2108.07258.

Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J. D., Dhariwal, P., Neel a kant an, A., Shyam, P., Sastry,

布朗,T.,曼,B.,赖德,N.,苏比亚,M.,卡普兰,J. D.,达里瓦尔,P.,尼拉坎坦,A.,夏姆,P.,萨斯特里,

G., Askell, A., Agarwal, S., Herbert-Voss, A., Krueger, G., Henighan, T., Child, R., Ramesh, A., Ziegler, D., Wu, J., Winter, C., Hesse, C., Chen, M., Sigler, E., Litwin, M., Gray, S., Chess, B., Clark, J., Berner, C., McCandlish, S., Radford, A., Sutskever, I., and Amodei, D. Language models are few-shot learners. In Larochelle, H., Ranzato, M., Hadsell, R., Balcan, M. F., and Lin, H. (eds.), Advances in Neural Information Processing Systems, volume 33, pp. 1877-1901.Curran Associates, Inc., 2020. URL https:/ /proceedings. neurips.cc/paper/2020/file/ 1457c0d6bfcb4967418bfb8ac142f64a-Paper. pdf.

G., Askell, A., Agarwal, S., Herbert-Voss, A., Krueger, G., Henighan, T., Child, R., Ramesh, A., Ziegler, D., Wu, J., Winter, C., Hesse, C., Chen, M., Sigler, E., Litwin, M., Gray, S., Chess, B., Clark, J., Berner, C., McCandlish, S., Radford, A., Sutskever, I., 和 Amodei, D. 语言模型是少样本学习者。在 Larochelle, H., Ranzato, M., Hadsell, R., Balcan, M. F., 和 Lin, H. (eds.), 神经信息处理系统进展,第 33 卷,页码 1877-1901。Curran Associates, Inc., 2020。URL https:/ /proceedings. neurips.cc/paper/2020/file/ 1457c0d6bfcb4967418bfb8ac142f64a-Paper. pdf。

Caliskan, A., Bryson, J. J., and Narayanan, A. Semantics derived automatically from language corpora contain human-like biases. Science, 356(6334):183-186, Apr 2017. ISSN 1095-9203. doi: 10.1126/science.aal4230. URL http://dx.doi.org/10.1126/science. aa14230.

Caliskan, A., Bryson, J. J., 和 Narayanan, A. 从语言语料库中自动衍生的语义包含类似人类的偏见。Science, 356(6334):183-186, 2017 年 4 月。ISSN 1095-9203。doi: 10.1126/science.aal4230。URL http://dx.doi.org/10.1126/science.aa14230

Carlini, N., Tramer, F., Wallace, E., Jagielski, M., Herbert- Voss, A., Lee, K., Roberts, A., Brown, T. B., Song, D., Erlingsson, U., Oprea, A., and Raffel, C. Extracting training data from large language models. CoRR, abs/2012.07805, 2020.

卡林尼,N.,特拉默,F.,华莱士,E.,亚吉尔斯基,M.,赫伯特-沃斯,A.,李,K.,罗伯茨,A.,布朗,T. B.,宋,D.,厄林格松,U.,奥普雷亚,A.,和拉斐尔,C. 从大语言模型中提取训练数据。CoRR, abs/2012.07805, 2020.

Choi, E., He, H., Iyyer, M., Yatskar, M., Yih, W.-t., Choi, Y., Liang, P., and Z ett le moyer, L. QuAC: Question answering in context. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pp. 2174-2184, Brussels, Belgium, OctoberNovember 2018. Association for Computational Linguistics. doi: 10.18653/v1/D18-1241. URL https : //a cl anthology.0rg/D18-1241.

Choi, E., He, H., Iyyer, M., Yatskar, M., Yih, W.-t., Choi, Y., Liang, P., 和 Zettlemoyer, L. QuAC: 上下文中的问题回答. In 第 2018 届实证方法在自然语言处理会议论文集,页码 2174-2184,布鲁塞尔,比利时,2018 年 10 月 - 11 月。计算语言学协会。doi: 10.18653/v1/D18-1241。URL https://aclanthology.org/D18-1241.

Clark, C., Lee, K., Chang, M.-W., Kwiatkowski, T., Collins, M., and Toutanova, K. BoolQ: Exploring the surprising difficulty of natural yes/no questions. In Proceedin gs of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pp. 2924-2936, Minneapolis, Minnesota, June 2019. Association for Computational Linguistics. doi: 10.18653/v1/N19-1300. URL https : //a cl anthology.0rg/N19-1300.

Clark, C., Lee, K., Chang, M.-W., Kwiatkowski, T., Collins, M., 和 Toutanova, K. BoolQ: 探索自然的 是/否 问题的惊人难度。在 Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers) 中,第 2924-2936 页,明尼苏达州明尼阿波利斯,2019 年 6 月。Association for Computational Linguistics。doi: 10.18653/v1/N19-1300。URL https://aclanthology.org/N19-1300

Clark, K., Luong, M.-T., Le, Q. V., and Manning, C. D. Electra: Pre-training text encoders as disc rim in at or s rather than generators. arXiv preprint arXiv:2003.10555, 2020.

Clark, K., Luong, M.-T., Le, Q. V., 和 Manning, C. D. Electra: 将文本编码器预训练为判别器而非生成器。arXiv preprint arXiv:2003.10555, 2020。

Clark, P., Cowhey, I., Etzioni, O., Khot, T., Sabharwal, A., Schoenick, C., and Tafjord, O. Think you have solved question answering? try arc, the ai2 reasoning challenge. arXiv:1803.05457v1, 2018.

克拉克,P., 科威,I., 艾特兹尼,O., 霍特,T., 萨巴瓦尔,A., 斯乔尼克,C., 和 塔夫乔德,O. 认为你已经解决了问答问题?试试 ARC,AI2 推理挑战。arXiv:1803.05457v1, 2018.

Dagan, I., Glickman, O., and Magnini, B. The pascal recognising textual entailment challenge. In QuinoneroCandela, J., Dagan, I., Magnini, B., and d'Alché Buc, F. (eds.), Machine Learning Challenges. Evaluating Predictive Uncertainty, Visual Object Classification, and Recognising Tectual Entailment, pp. 177-190, Berlin, Heidelberg, 2006. Springer Berlin Heidelberg. ISBN 978-3-540-33428-6.

Dagan, I., Glickman, O., 和 Magnini, B. Pascal 识别文本蕴含挑战. 在 QuinoneroCandela, J., Dagan, I., Magnini, B., 和 d'Alché Buc, F. (eds.), 机器学习挑战. 评估预测不确定性,视觉对象分类,和识别文本蕴含,页码 177-190, Berlin, Heidelberg, 2006. Springer Berlin Heidelberg. ISBN 978-3-540-33428-6.

Dai, A. M. and Le, Q. V. Semi-supervised sequence learning. In Cortes, C., Lawrence, N., Lee, D., Sugiyama, M., and Garnett, R. (eds.), Advances in Neural Information Processing Systems, volume 28. Curran Associates, Inc., 2015. URL https : / /proceedings. neurips.cc/paper/2015/file/ 7137debd45ae4d0ab9aa953017286b20-Paper.

戴, A. M. 和 Le, Q. V. 半监督序列学习。在 Cortes, C., Lawrence, N., Lee, D., Sugiyama, M., 和 Garnett, R. (编),《神经信息处理系统进展》,第 28 卷。Curran Associates, Inc., 2015。URL https://proceedings.neurips.cc/paper/2015/file/7137debd45ae4d0ab9aa953017286b20-Paper

pdf.

pdf.

请提供需要翻译的具体内容。

Dai, Z., Yang, Z., Yang, Y., Carbonell, J., Le, Q., and Salak hut dino v, R. Transformer-XL: Attentive language models beyond a fixed-length context. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pp. 2978-2988, Florence, Italy, July 2019. Association for Computational Linguistics. doi: 10.18653/v1/P19-1285. URL https : / /a cl anthology.0rg/P19-1285.

Dai, Z., Yang, Z., Yang, Y., Carbonell, J., Le, Q., 和 Salak hut dino v, R. Transformer-XL: 超越固定长度上下文的注意力语言模型 (Attentive language models beyond a fixed-length context)。在第 57 届计算语言学协会年会论文集 (Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics) 中,第 2978-2988 页,佛罗伦萨,意大利,2019 年 7 月。计算语言学协会 (Association for Computational Linguistics)。doi: 10.18653/v1/P19-1285。URL https://a cl anthology.0rg/P19-1285。

Dauphin, Y. N., Fan, A., Auli, M., and Grangier, D. Language modeling with gated convolutional networks. In International conference on machine learning,pp.933- 941. PMLR, 2017.

达芬,Y. N.,范,A.,奥利,M.,和格朗热,D. 用门控卷积网络进行语言建模。在国际机器学习会议,第 933-941 页。PMLR,2017。

de Marneffe, M.-C., Simons, M., and Tonhauser, J. The commitment bank: Investigating projection in naturally occurring discourse. Proceedings of Sinn und Bedeutung, 23(2):107-124, Jul. 2019. doi: 10.18148/sub/2019.v23i2. 601. URL https://ojs.ub.uni-konstanz. de/sub/index.php/sub/article/view/601.

德马尔内夫,M.-C.,西蒙斯,M.,和顿豪泽,J. 承诺银行:研究自然发生话语中的投影。《意义与含义会议论文集》,23(2):107-124,2019年7月。doi: 10.18148/sub/2019.v23i2.601。URL https://ojs.ub.uni-konstanz.de/sub/index.php/sub/article/view/601。

Devlin, J., Chang, M.-W., Lee, K., and Toutanova, K. BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conferenceof the North American Chapter of the Association for Computational Linguistics:Human Language Technologies, Volume 1 (Long and Short Papers), 2019.

Devlin, J., Chang, M.-W., Lee, K., 和 Toutanova, K. BERT: 用于语言理解的深度双向 Transformer (deep bidirectional transformers) 预训练. 在 Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers),2019.

Dua, D., Wang, Y., Dasigi, P., Stanovsky, G., Singh, S., and Gardner, M. DROP: A reading comprehension benchmark requiring discrete reasoning over paragraphs. In Burstein, J., Doran, C., and Solorio, T. (eds.), Proceedings of the 2019 Conference of theNorth American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2019, Minneapolis, MN, USA, June 2-7, 2019, Volume 1 (Long and Short Papers), pp. 2368-2378. Association for Computational Linguistics, 2019. doi:

Dua, D., Wang, Y., Dasigi, P., Stanovsky, G., Singh, S., 和 Gardner, M. DROP: 一个需要对段落进行离散推理的阅读理解基准。在 Burstein, J., Doran, C., 和 Solorio, T. (eds.), 第 2019 届北美计算语言学协会年会:人类语言技术会议论文集,NAACL-HLT 2019,明尼阿波利斯,明尼苏达州,美国,2019 年 6 月 2-7 日,第 1 卷(长篇和短篇论文),页码 2368-2378。计算语言学协会,2019。doi:

Fedus, W., Zoph, B., and Shazeer, N. Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity. CoRR, abs/2101.03961, 2021. URL https://arxiv.0rg/abs/2101.03961.

Fedus, W., Zoph, B., 和 Shazeer, N. Switch Transformer:通过简单高效的稀疏性扩展到万亿参数模型. CoRR, abs/2101.03961, 2021. URL https://arxiv.0rg/abs/2101.03961.

Fyodorov, Y., Winter, Y., and Francez, N. A natural logic in- ference system. In Inference in Computational Semantics, 2000.

费奥多罗夫,Y.,温特,Y.,和弗朗塞兹,N. 一种自然逻辑推理系统。在《计算语义学中的推理》,2000。

Gehman, S., Gururangan, S., Sap, M., Choi, Y., and Smith, N. A. Real toxicity prompts: Evaluating neural toxic degeneration in language models, 2020.

Gehman, S., Gururangan, S., Sap, M., Choi, Y., 和 Smith, N. A. 真实毒性提示:评估语言模型中的神经毒性退化,2020。

Gordon, A., Kozareva, Z., and Roemmele, M. SemEval2012 task 7: Choice of plausible alternatives: An evaluation of commonsense causal reasoning. In $^{*}!S E M,20I2$ The First Joint Conference on Lexical and Computational Semantics - Volume 1: Proceedings of the main conference and the shared task,and Volume 2: Proceedings of the Sixth International Workshop on Semantic Evaluation (SemEval 2012), pp. 394-398, Montréal, Canada, 7-8 June 2012. Association for Computational Linguistics. URL https://a cl anthology.0rg/S12-1052.

戈登,A.,科扎雷瓦,Z.,和 罗梅尔,M. SemEval2012 任务 7:选择可能的替代方案:常识因果推理的评估。在 $^{*}!S E M,20I2$ 第一届联合会议关于词汇和计算语义学 - 第 1 卷:主会议和共享任务的论文集,以及第 2 卷:第六届国际语义评估研讨会 (SemEval 2012) 论文集,第 394-398 页,加拿大蒙特利尔,2012 年 6 月 7-8 日。计算语言学协会。URL https://a cl anthology.0rg/S12-1052。

Hendrycks, D. and Gimpel, K. Bridging nonlinear i ties and stochastic regularize rs with gaussian error linear units. CoRR, abs/1606.08415, 2016. URL http: / /arxiv. org/abs/1606.08415.

亨德里克斯,D. 和金佩尔,K. 桥接非线性单元和随机正则化器与高斯误差线性单元。CoRR, abs/1606.08415, 2016。URL http://arxiv.org/abs/1606.08415

Hestness, J., Narang, S., Ardalani, N., Diamos, G. F., Jun, H., Kianinejad, H., Patwary, M. M. A., Yang, Y., and Zhou, Y. Deep learning scaling is predictable, empirically. CoRR, abs/1712.00409, 2017. URL http: / /arxiv. org/abs/1712.00409.

Hestness, J., Narang, S., Ardalani, N., Diamos, G. F., Jun, H., Kianinejad, H., Patwary, M. M. A., Yang, Y., 和 Zhou, Y. 深度学习的扩展是可预测的,从经验上看。CoRR, abs/1712.00409, 2017。URL http: / /arxiv. org/abs/1712.00409。

Houlsby, N., Giurgiu, A., J as tr zeb ski, S., Morrone, B., De La rous sil he, Q., Gesmundo, A., Attariyan, M., and Gelly, S. Parameter-efficient transfer learning for NLP. In Chaudhuri, K. and Salak hut dino v, R. (eds.), Proceedingsof the 36 th International Conference on Machine Learning,volume 97 of Proceedings of Machine Learning Research, Pp. 2790-2799. PMLR, 09-15 Jun 2019. URL https://proceedings.mlr.press/v97/ houlsbyl9a.html.

Houlsby, N., Giurgiu, A., Jastrzębski, S., Morrone, B., De La Roussilhe, Q., Gesmundo, A., Attariyan, M., 和 Gelly, S. 参数高效的自然语言处理迁移学习。在 Chaudhuri, K. 和 Salakhutdinov, R. (编),第 36 届国际机器学习会议论文集,机器学习研究论文集第 97 卷,页码 2790-2799。PMLR, 2019 年 6 月 9-15 日。URL https://proceedings.mlr.press/v97/houlsbyl9a.html。

Huang, Y., Cheng, Y., Bapna, A., Firat, O., Chen, D., Chen, M. X., Lee, H., Ngiam, J., Le, Q. V., Wu, Y., and Chen, Z. Gpipe: Efficient training of giant neural networks using pipeline parallelism. In Wallach, H. M., Larochelle, H., Bey gel zi mer, A., d'Alché-Buc, F., Fox, E. B., and Garnett, R. (eds.), Advances in Neural Information Processing Systems 32: Annual Conference on Neural Information Processing Systems 2019, NeurIPS 2019, December 8-14, 2019, Vancouver, BC, Canada, pp. 103-112, 2019.

黄,Y., 程,Y., Bapna, A., Firat, O., 陈,D., 陈,M. X., 李,H., Ngiam, J., Le, Q. V., 吴,Y., 和 陈,Z. GPipe:使用管道并行性高效训练巨型神经网络。在 Wallach, H. M., Larochelle, H., Beygelzimer, A., d'Alché-Buc, F., Fox, E. B., 和 Garnett, R. (编),《神经信息处理系统进展第32卷:2019年神经信息处理系统会议论文集》,NeurIPS 2019,2019年12月8-14日,加拿大不列颠哥伦比亚省温哥华,第103-112页,2019。

Hutchinson, B., Prabhakaran, V., Denton, E., Webster, K., Zhong, Y., and Denuyl, S. Social biases in NLP models as barriers for persons with disabilities. In Pro- ceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pp. 5491-5501, On- line, July 2020. Association for Computational Linguistics. doi: 10.18653/v1/2020.acl-main.487. URL ht tps : //a cl anthology.0rg/2020.acl-main.487.

Hutchinson, B., Prabhakaran, V., Denton, E., Webster, K., Zhong, Y., 和 Denuyl, S. 自然语言处理模型中的社会偏见对残疾人的障碍。在第 58 届计算语言学协会年会论文集,第 5491-5501 页,在线,2020 年 7 月。计算语言学协会。doi: 10.18653/v1/2020.acl-main.487。URL https://aclanthology.org/2020.acl-main.487

Jacobs, A. Z. and Wallach, H. Measurement and fairness.Proceedings of the 2021 ACM Conference on Fair- ness, Accountability, and Transparency, Mar 2021. doi: 10.1145/3442188.3445901. URL http: / /dx.doi. org/10.1145/3442188.3445901.

雅各布斯,A. Z. 和 瓦拉赫,H. 测量与公平性。2021 年 ACM 公平性、责任性和透明度会议论文集,2021 年 3 月。doi: 10.1145/3442188.3445901。URL http: / /dx.doi. org/10.1145/3442188.3445901。

Joshi, M., Choi, E., Weld, D. S., and Z ett le moyer, L. Triviaqa: A large scale distantly supervised challenge dataset for reading comprehension. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics, Vancouver, Canada, July 2017. Association for Computational Linguistics.

Joshi, M., Choi, E., Weld, D. S., 和 Zettlemoyer, L. TriviaQA: 一个大规模远程监督的阅读理解挑战数据集。在第 55 届计算语言学协会年会论文集,温哥华,加拿大,2017 年 7 月。计算语言学协会。

Kaplan, J., McCandlish, S., Henighan, T., Brown, T. B., Chess, B., Child, R., Gray, S., Radford, A., Wu, J., and Amodei, D. Scaling laws for neural language models. arXiv preprint arXiv:2001.08361, 2020.

Kaplan, J., McCandlish, S., Henighan, T., Brown, T. B., Chess, B., Child, R., Gray, S., Radford, A., Wu, J., 和 Amodei, D. 神经语言模型的扩展规律. arXiv preprint arXiv:2001.08361, 2020.

Khashabi, D., Chaturvedi, S., Roth, M., Upadhyay, S., and Roth, D. Looking beyond the surface: A challenge set for reading comprehension over multiple sentences. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), pp. 252-262, New Orleans, Louisiana, June 2018. Association for Computational Linguistics. doi: 10.18653/v1/N18-1023. URL https : / /a cl anthology.0rg/N18-1023.

Khashabi, D., Chaturvedi, S., Roth, M., Upadhyay, S., 和 Roth, D. 不仅看表面:多句子阅读理解挑战集. 在 Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers),页码 252-262,New Orleans, Louisiana, 2018年6月. Association for Computational Linguistics. doi: 10.18653/v1/N18-1023. URL https://aclanthology.0rg/N18-1023.

Kiros, R., Zhu, Y., Salak hut dino v, R. R., Zemel, R., Urtasun, R., Torralba, A., and Fidler, S. Skip-thought vectors. In Cortes, C., Lawrence, N., Lee, D., Sugiyama, M., and Garnett, R.(eds.), Advances in Neural Information Processing Systems, volume 28. Curran Associates, Inc., 2015. URL https : / /proceedings. neurips.cc/paper/2015/file/ f442d33fa06832082290ad8544a8da27-Paper. pdf.

Kiros, R., Zhu, Y., Salak hut dino v, R. R., Zemel, R., Urtasun, R., Torralba, A., 和 Fidler, S. Skip-thought vectors. In Cortes, C., Lawrence, N., Lee, D., Sugiyama, M., 和 Garnett, R. (eds.), Advances in Neural Information Processing Systems, volume 28. Curran Associates, Inc., 2015. URL https://proceedings.neurips.cc/paper/2015/file/f442d33fa06832082290ad8544a8da27-Paper.pdf.

Kudo, T. and Richardson, J. Sentence piece: A simple and language independent subword tokenizer and de token ize r for neural text processing. In EMNLP, 2018.

Kudo, T. 和 Richardson, J. Sentence piece: 一种简单且与语言无关的子词分词器和反向分词器,用于神经文本处理。在 EMNLP,2018。

Kudugunta, S., Huang, Y, Bapna, A., Krikun, M., Lepikhin, D., Luong, M.-T., and Firat, O. Beyond distillation: Task-level mixture-of-experts for efficient inference. In Findings of the Association for Computational Linguistics: EMNLP 2021,Pp. 3577-3599, 2021.

库杜冈塔,S., 黄,Y, 巴普纳,A., 克里昆,M., 列皮欣,D., 龙,M.-T., 和菲拉特,O. 超越蒸馏:任务级混合专家以实现高效推理。在计算语言学协会会议发现:EMNLP 2021, 第 3577-3599 页,2021。

Kwiatkowski, T., Palomaki, J., Redfield, O., Collins, M., Parikh, A., Alberti, C., Epstein, D., Polosukhin, I., Kel- cey, M., Devlin, J., Lee, K., Toutanova, K. N., Jones, L., Chang, M.-W., Dai, A., Uszkoreit, J., Le, Q., and Petrov, S. Natural questions: a benchmark for question answering research. Transactions of the Association of Computational Linguistics, 2019.

Kwiatkowski, T., Palomaki, J., Redfield, O., Collins, M., Parikh, A., Alberti, C., Epstein, D., Polosukhin, I., Kel- cey, M., Devlin, J., Lee, K., Toutanova, K. N., Jones, L., Chang, M.-W., Dai, A., Uszkoreit, J., Le, Q., and Petrov, S. 自然问题:一个问答研究的基准。Transactions of the Association of Computational Linguistics,2019。

Lai, G., Xie, Q., Liu, H., Yang, Y., and Hovy, E. RACE: Large-scale ReAding comprehension dataset from examinations. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, Pp. 785-794, Copenhagen, Denmark, September 2017. Association for Computational Linguistics. doi: 10.18653/v1/ D17-1082. URL https://a cl anthology.org/ D17-1082.

赖, G., 谢, Q., 刘, H., 杨, Y., 和 Hovy, E. RACE: 大规模考试阅读理解数据集 (Large-scale ReAding comprehension dataset from examinations). 见《2017年实证方法在自然语言处理中的应用会议论文集》,第 785-794 页,哥本哈根,丹麦,2017年9月。计算语言学协会。doi: 10.18653/v1/D17-1082。URL https://aclanthology.org/D17-1082

Lamm, M., Palomaki, J., Alberti, C., Andor, D., Choi, E., Soares, L. B., and Collins, M. QED: A framework and dataset for explanations in question answering. CoRR, abs/2009.06354, 2020. URL https : / /arxiv .org/ abs/2009.06354.

Lamm, M., Palomaki, J., Alberti, C., Andor, D., Choi, E., Soares, L. B., 和 Collins, M. QED: 一个用于问题回答中解释的框架和数据集。CoRR, abs/2009.06354, 2020。URL https://arxiv.org/abs/2009.06354

Le, Q. and Mikolov, T. Distributed representations of sentences and documents. In International conference on machine learning,2014.

Le, Q. 和 Mikolov, T. 句子和文档的分布式表示 (Distributed representations of sentences and documents). 在国际机器学习会议 (International conference on machine learning),2014.

Leidner, J. L. and Plachouras, V. Ethical by design: Ethics best practices for natural language processing. In Proceedings of the First ACL Workshop on Ethics in Natural Language Processing, pp. 30-40, Valencia, Spain, April 2017. Association for Computational Linguistics. doi: 10.18653/v1/W17-1604. URL https: //a cl anthology.0rg/w17-1604.

莱德纳,J. L. 和 普拉乔拉斯,V. 伦理设计:自然语言处理的伦理最佳实践。在第一届 ACL 自然语言处理伦理研讨会论文集,第 30-40 页,瓦伦西亚,西班牙,2017 年 4 月。计算语言学协会。doi: 10.18653/v1/W17-1604。URL https: //a cl anthology.0rg/w17-1604。

Lepikhin, D., Lee, H., Xu, Y., Chen, D., Firat, O., Huang, Y., Krikun, M., Shazeer, N., and Chen, Z. GShard: Scaling giant models with conditional computation and automatic sharding. In International Conference on Learning Representations, 2021. URL https : / /openreview . net/forum?id=qr we 7 X HTm Yb.

Lepikhin, D., Lee, H., Xu, Y., Chen, D., Firat, O., Huang, Y., Krikun, M., Shazeer, N., 和 Chen, Z. GShard: 通过条件计算和自动分片扩展巨型模型. 在 International Conference on Learning Representations, 2021. URL https : / /openreview . net/forum?id=qr we 7 X HTm Yb.

Levesque, H., Davis, E., and Morgenstern, L. The winograd schema challenge. In 13th International Conferenc eon the Principles of Knowledge Representation and Reasoning, KR 2012, Proceedings of the International Conference on Knowledge Representation and Reasoning, pp. 552-561. Institute of Electrical and Electronics Engineers Inc., 2012. ISBN 9781577355601. 13th In- ter national Conference on the Principles of Knowledge Representation and Reasoning, KR 2012 ; Conference date: 10-06-2012 Through 14-06-2012.

Levesque, H., Davis, E., 和 Morgenstern, L. Winograd 模式挑战。在第 13 届国际知识表示和推理原理会议,KR 2012,知识表示和推理国际会议论文集,第 552-561 页。电气和电子工程师协会,2012。ISBN 9781577355601。第 13 届国际知识表示和推理原理会议,KR 2012;会议日期:2012 年 6 月 10 日至 2012 年 6 月 14 日。

Li, T., Khashabi, D., Khot, T., Sabharwal, A., and Srikumar, V. UNQOVERing stereotyping biases via underspecified questions. In Findings of the Association for Computational Linguistics: EMNLP 2020, pp. 3475-3489,

李, T., Khashabi, D., Khot, T., Sabharwal, A., 和 Srikumar, V. 通过未指明问题揭示刻板偏见 (UNQOVERing stereotyping biases via underspecified questions). 在计算语言学协会会议发现:EMNLP 2020,第 3475-3489 页,

Online, November 2020. Association for Computational Linguistics. doi: 10.18653/v1/2020.findings-emnlp. 311. URL https://a cl anthology.org/2020. findings-emnlp.311.

在线,2020年11月。计算语言学协会。doi: 10.18653/v1/2020.findings-emnlp.311. URL https://aclanthology.org/2020.findings-emnlp.311.

Lieber, O., Sharir, O., Lenz, B., and Shoham, Y. Jurassic-1: Technical details and evaluation. White Paper. AI21 Labs, 2021.

Lieber, O., Sharir, O., Lenz, B.,和 Shoham, Y. Jurassic-1: 技术细节和评估。白皮书。AI21 Labs, 2021.

Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Z ett le moyer, L., and Stoyanov, V. Roberta: A robustly optimized bert pre training approach. arXiv preprint arXiv:1907.11692, 2019.

Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., and Stoyanov, V. Roberta: 一种稳健优化的 BERT 预训练方法。 arXiv preprint arXiv:1907.11692, 2019.

May, C., Wang, A., Bordia, S., Bowman, S. R., and Rudinger, R. On measuring social biases in sentence encoders. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pp. 622-628, Minneapolis, Minnesota, June 2019. Association for Computational Linguistics. doi: 10.18653/v1/N19-1063. URL https://a cl anthology.0rg/N19-1063.

梅, C., 王, A., Bordia, S., Bowman, S. R., 和 Rudinger, R. 论测量句子编码器中的社会偏见. 在《2019年北美计算语言学协会会议论文集:人类语言技术,第 1 卷(长文和短文)》,页码 622-628,明尼苏达州明尼阿波利斯,2019年6月。计算语言学协会。doi: 10.18653/v1/N19-1063。URL https://aclanthology.org/N19-1063

Mihaylov, T., Clark, P., Khot, T., and Sabharwal, A. Can a suit of armor conduct electricity? a new dataset for open book question answering. In EMNLP, 2018.

米哈伊洛夫,T., 克拉克,P., 科特,T., 和 萨巴瓦尔,A. 一套盔甲可以导电吗?一个新的开放书问答数据集。在 EMNLP,2018。

Mikolov, T., Karafiat, M., Burget, L., Cernocky, J. H., and Khudanpur, S. Recurrent neural network based language model. In INTER SPEECH, 2010.

Mikolov, T., Karafiat, M., Burget, L., Cernocky, J. H., 和 Khudanpur, S. 基于循环神经网络的语言模型。在 INTER SPEECH, 2010。

Mikolov, T., Chen, K., Corrado, G., and Dean, J. Efficient estimation of word representations in vector space. In Bengio, Y. and LeCun, Y. (eds.), Ist International Confer- ence on Learning Representations, ICLR 2013, Scottsdale, Arizona, USA, May 2-4, 2013, Workshop Track Proceedings, 2013. URL http://arxiv.org/abs /1301. 3781.

Mikolov, T., Chen, K., Corrado, G., 和 Dean, J. 词向量表示的高效估计。在 Bengio, Y. 和 LeCun, Y. (编),首届国际学习表征会议,ICLR 2013,美国亚利桑那州斯科茨代尔,2013年5月2-4日,工作坊论文集,2013。URL http://arxiv.org/abs/1301.3781

Most af azad eh, N., Chambers, N., He, X., Parikh, D., Ba- tra, D., Van der wende, L., Kohli, P., and Allen, J. A corpus and cloze evaluation for deeper understanding of commonsense stories. In Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics:Human Language Technologies, Pp. 839-849, San Diego, California, June 2016. Association for Computational Linguistics. doi: 10.18653/v1/N16-1098. URL https : / /a cl anthology.org/Nl6-1098.

Most af azad eh, N., Chambers, N., He, X., Parikh, D., Ba- tra, D., Van der wende, L., Kohli, P., 和 Allen, J. 一个语料库和完形填空评估用于更深入理解常识故事。在《2016年北美计算语言学协会会议论文集:人类语言技术》,第 839-849 页,圣地亚哥,加利福尼亚,2016年6月。计算语言学协会。doi: 10.18653/v1/N16-1098。URL https://aclanthology.org/Nl6-1098

Nadeem, M., Bethke, A., and Reddy, S. StereoSet: Measuring stereotypical bias in pretrained language models. In Proceedings of the 59th Annual Meeting of the Associ ation for Computational Linguistics and the1lth International Joint Conference onNatural Language Processing (Volume 1: Long Papers), pp. 5356-5371, Online,

Nadeem, M., Bethke, A., 和 Reddy, S. StereoSet: 测量预训练语言模型中的刻板偏见。在第 59 届计算语言学年会和第 11 届国际自然语言处理联合会议 (Volume 1: Long Papers) 的论文集上,页码 5356-5371,在线,

August 2021. Association for Computational Linguistics. doi: 10.18653/v1/2021.acl-long.416. URL https : //a cl anthology.0rg/2021.acl-long.416.

2021年8月。计算语言学协会。doi: 10.18653/v1/2021.acl-long.416。URL https://a cl anthology.0rg/2021.acl-long.416。

Paperno, D., Kruszewski, G., Lazaridou, A., Pham, N. Q., Bernardi, R., Pezzelle, S., Baroni, M., Boleda, G., and Fernandez, R. The LAMBADA dataset: Word prediction requiring a broad discourse context. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 1525- 1534, Berlin, Germany, August 2016. Association for Computational Linguistics. doi: 10.18653/v1/P16-1144. URL https://a cl anthology.0rg/P16-1144.

Paperno, D., Kruszewski, G., Lazaridou, A., Pham, N. Q., Bernardi, R., Pezzelle, S., Baroni, M., Boleda, G., and Fernandez, R. LAMBADA 数据集:需要广泛语篇上下文的单词预测。在第 54 届计算语言学协会年会论文集 (Volume 1: Long Papers),页码 1525-1534,柏林,德国,2016 年 8 月。计算语言学协会。doi: 10.18653/v1/P16-1144。URL https://aclanthology.org/P16-1144

Patterson, D., Gonzalez, J., Le, Q., Liang, C., Munguia, L.- M., Rothchild, D., So, D., Texier, M., and Dean, J. Carbon emissions and large neural network training. arXiv preprint arXiv:2104.10350, 2021.

Patterson, D., Gonzalez, J., Le, Q., Liang, C., Munguia, L.- M., Rothchild, D., So, D., Texier, M., 和 Dean, J. 碳排放与大型神经网络训练. arXiv preprint arXiv:2104.10350, 2021.

Pennington, J., Socher, R., and Manning, C. GloVe: Global vectors for word representation. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), Pp. 1532-1543, Doha, Qatar, October 2014. Association for Computational Linguistics. doi: 10.3115/v1/D14-1162. URL https://a cl anthology.0rg/D14-1162.

Pennington, J., Socher, R., 和 Manning, C. GloVe: 全局向量用于词表示 (Global vectors for word representation)。在《2014年经验方法在自然语言处理中的应用会议 (EMNLP) 论文集》,第 1532-1543 页,卡塔尔多哈,2014年10月。计算语言学协会。doi: 10.3115/v1/D14-1162。URL https://aclanthology.org/D14-1162

Peters, M. E., Neumann, M., Iyyer, M., Gardner, M., Clark, C., Lee, K., and Z ett le moyer, L. Deep contextual i zed word representations. arXiv preprint arXiv:1802.05365, 2018.

Peters, M. E., Neumann, M., Iyyer, M., Gardner, M., Clark, C., Lee, K., 和 Zettlemoyer, L. 深度上下文化单词表示 (Deep contextualized word representations). arXiv预印本 arXiv:1802.05365, 2018.

Pilehvar, M. T. and Camacho-Collados, J. Wic: 10, 000 example pairs for evaluating context-sensitive representations. ArXiv, abs/1808.09121, 2018.

Pilehvar, M. T. 和 Camacho-Collados, J. Wic: 10,000 示例对用于评估上下文敏感表示。ArXiv, abs/1808.9121, 2018.

Radford, A., Wu, J., Child, R., Luan, D., Amodei, D., and Sutskever, I. Language models are unsupervised multitask learners. 2018. URL https://d 4 muc fpk sy wv.cloudfront. net/better-language-models/ language-models.pdf.

Radford, A., Wu, J., Child, R., Luan, D., Amodei, D., 和 Sutskever, I. 语言模型是无监督的多任务学习器。2018。URL https://d 4 muc fpk sy wv.cloudfront. net/better-language-models/ language-models.pdf。

Rae, J. W., Borgeaud, S., Cai, T., Millican, K., Hoffmann, J., Song, H. F., Aslanides, J., Henderson, S., Ring, R., Young, S., Rutherford, E., Hennigan, T., Menick, J., Cassirer, A., Powell, R., van den Driessche, G., Hendricks, L. A., Rauh, M., Huang, P., Glaese, A., Welbl, J., Dathathri, S., Huang, S., Uesato, J., Mellor, J., Higgins, I., Creswell, A., McAleese, N., Wu, A., Elsen, E., Jayakumar, S. M., Buch at s kaya, E., Budden, D., Sutherland, E., Simonyan, K., Paganini, M., Sifre, L., Martens, L., Li, X. L., Kuncoro, A., Nematzadeh, A., Gri bo vs kaya, E., Donato, D., Lazaridou, A., Mensch, A., Lespiau, J., Tsim po uk ell i, M., Grigorev, N., Fritz, D., Sottiaux, T., Pajarskas, M., Pohlen, T., Gong, Z., Toyama, D., de Masson d'Autume, C., Li, Y., Terzi, T., Mikulik, V., Babuschkin, I., Clark,

Rae, J. W., Borgeaud, S., Cai, T., Millican, K., Hoffmann, J., Song, H. F., Aslanides, J., Henderson, S., Ring, R., Young, S., Rutherford, E., Hennigan, T., Menick, J., Cassirer, A., Powell, R., van den Driessche, G., Hendricks, L. A., Rauh, M., Huang, P., Glaese, A., Welbl, J., Dathathri, S., Huang, S., Uesato, J., Mellor, J., Higgins, I., Creswell, A., McAleese, N., Wu, A., Elsen, E., Jayakumar, S. M., Buch at s kaya, E., Budden, D., Sutherland, E., Simonyan, K., Paganini, M., Sifre, L., Martens, L., Li, X. L., Kuncoro, A., Nematzadeh, A., Gri bo vs kaya, E., Donato, D., Lazaridou, A., Mensch, A., Lespiau, J., Tsim po uk ell i, M., Grigorev, N., Fritz, D., Sottiaux, T., Pajarskas, M., Pohlen, T., Gong, Z., Toyama, D., de Masson d'Autume, C., Li, Y., Terzi, T., Mikulik, V., Babuschkin, I., Clark,

A., de Las Casas, D., Guy, A., Jones, C., Bradbury, J., Johnson, M., Hechtman, B. A., Weidinger, L., Gabriel, I., Isaac, W. S., Lockhart, E., Osindero, S., Rimell, L., Dyer, C., Vinyals, O., Ayoub, K., Stanway, J., Bennett, L., Hassabis, D., Ka vuk cuo g lu, K., and Irving, G. Scaling language models: Methods, analysis & insights from training gopher. CoRR, abs/2112.11446, 2021.

A., de Las Casas, D., Guy, A., Jones, C., Bradbury, J., Johnson, M., Hechtman, B. A., Weidinger, L., Gabriel, I., Isaac, W. S., Lockhart, E., Osindero, S., Rimell, L., Dyer, C., Vinyals, O., Ayoub, K., Stanway, J., Bennett, L., Hassabis, D., Kavukcuoglu, K., 和 Irving, G. 扩展语言模型:方法、分析与训练 Gopher 的见解. CoRR, abs/2112.11446, 2021.

Raffel, C., Shazeer, N., Roberts, A., Lee, K., Narang, S., Matena, M., Zhou, Y., Li, W., and Liu, P. J. Exploring the limits of transfer learning with a unified text-to-text transformer. J. Mach. Learn. Res., 21:140:1-140:67, 2020. URL http://jmlr.0rg/papers/v21/20-074. html.

Raffel, C., Shazeer, N., Roberts, A., Lee, K., Narang, S., Matena, M., Zhou, Y., Li, W., 和 Liu, P. J. 探索迁移学习的极限与统一的文本到文本 Transformer. J. Mach. Learn. Res., 21:140:1-140:67, 2020. URL http://jmlr.0rg/papers/v21/20-074. html.

Rajpurkar, P., Jia, R., and Liang, P. Know what you don't know: Unanswerable questions for squad. In ACL, 2018.

Rajpurkar, P., Jia, R., 和 Liang, P. 知道你不知道的:Squad 的不可回答问题。在 ACL,2018。

Reddy, S., Chen, D., and Manning, C. D. CoQA: A convers at ional question answering challenge. Transactions of the Association for Computational Linguistics,7:249- 266,March 2019.doi:10.1162/tacl_a_00266.URL https://a cl anthology.0rg/Q19-1016.

Reddy, S., Chen, D., 和 Manning, C. D. CoQA: 一个对话式问答挑战. 计算语言学协会会刊, 7:249-266, 三月 2019. doi:10.1162/tacl_a_00266. URL https://aclanthology.0rg/Q19-1016.

Rogers, A. Changing the world by changing the data. In Proceedings of the59th Annual Meetingof the Association for Computational Linguistics and the 1lth Interna- tional Joint Conference on Natural Language Process- ing (Volume 1: Long Papers), pp. 2182-2194, Online, August 2021. Association for Computational Linguistics. doi: 10.18653/v1/2021.acl-long.170. URL https : //a cl anthology.org/2021.acl-long.170.

罗杰斯,A. 通过改变数据来改变世界。在第 59 届计算语言学年会和第 11 届国际自然语言处理联合会议 (Volume 1: Long Papers) 的论文集上,第 2182-2194 页,在线,2021 年 8 月。计算语言学协会。doi: 10.18653/v1/2021.acl-long.170。URL https://aclanthology.org/2021.acl-long.170

Rudinger, R., May, C., and Van Durme, B. Social bias in elicited natural language inferences. In Proceedings of the First ACL Workshop on Ethics in Natural Language Processing, pp. 74-79, Valencia, Spain, April 2017. Ass0- ciation for Computational Linguistics. doi: 10.18653/v1/ W17-1609. URL https://a cl anthology.org/ W17-1609.

Rudinger, R., May, C., 和 Van Durme, B. 社交偏见在引出的自然语言推理中的表现. 在第一届 ACL 自然语言处理伦理研讨会论文集,第 74-79 页,西班牙瓦伦西亚,2017 年 4 月。计算语言学协会。doi: 10.18653/v1/W17-1609。URL https://aclanthology.org/W17-1609

Rudinger, R., Naradowsky, J., Leonard, B., and Van Durme, B. Gender bias in co reference resolution. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics:Human Language Technologies, Volume 2 (Short Papers), pp. 8- 14, New Orleans, Louisiana, June 2018. Association for Computational Linguistics. doi: 10.18653/v1/N18-2002. URL https://a cl anthology.org/N18-2002.

鲁丁格,R., 纳拉多夫斯基,J., 伦纳德,B., 和范杜尔梅,B. 性别偏见在共指消解中的影响。在《2018年北美计算语言学协会会议:人类语言技术论文集,第2卷(短篇)》,第8-14页,新奥尔良,路易斯安那州,2018年6月。计算语言学协会。doi: 10.18653/v1/N18-2002。URL https://aclanthology.org/N18-2002

Sakaguchi, K., Bras, R. L., Bhaga va tula, C., and Choi, Y. Winogrande: An adversarial winograd schema challenge at scale. In AAAI, pp.8732-8740.AAAI Press, 2020.

Sakaguchi, K., Bras, R. L., Bhaga va tula, C., 和 Choi, Y. Winogrande: 大规模对抗性 Winograd 模式挑战. In AAAI, pp.8732-8740. AAAI Press, 2020.

Sap, M., Gabriel, S., Qin, L., Jurafsky, D., Smith, N. A., and Choi, Y. Social bias frames: Reasoning about social and power implications of language. In Procee dings of the 58 th Annual Meetingof theAssocia- tion for Computational Linguistics, pp. 5477-5490, On- line, July 2020. Association for Computational Linguistics. doi: 10.18653/v1/2020.acl-main.486. URL ht tps : //a cl anthology.0rg/2020.acl-main.486.

Sap, M., Gabriel, S., Qin, L., Jurafsky, D., Smith, N. A., 和 Choi, Y. 社会偏见框架:关于语言的社会和权力含义的推理。 在第 58 届计算语言学协会年会论文集,第 5477-5490 页,在线,2020 年 7 月。计算语言学协会。doi: 10.18653/v1/2020.acl-main.486。URL https://aclanthology.org/2020.acl-main.486

Shazeer, N. Glu variants improve transformer, 2020.

Shazeer, N. Glu 变体改进了 Transformer,2020。

Shazeer, N. and Stern, M. Adafactor: Adaptive learning rates with sublinear memory cost. ArXiv, abs/1804.04235, 2018.

Shazeer, N. 和 Stern, M. Adafactor: 具有次线性内存成本的自适应学习率. ArXiv, abs/1804.04235, 2018.

Shazeer, N., Mirhoseini, A., Maziarz, K., Davis, A., Le, Q. V., Hinton, G. E., and Dean, J. Outrageously large neural networks: The sparsely-gated mixture-of-experts layer. In 5th International Conference on Learning Represent at ions,ICLR 2017,Toulon,France,April 24-26, 2017, Conference Track Proceedings. OpenReview.net, 2017. URL https://openreview.net/forum? id $=$ B1ckMDqlg.

Shazeer,N.,Mirhoseini,A.,Maziarz,K.,Davis,A.,Le,Q. V.,Hinton,G. E.,和 Dean,J. 极大的神经网络:稀疏门控专家混合层。在第 5 届国际学习表征会议 (ICLR 2017),土伦,法国,2017 年 4 月 24-26 日,会议论文集。OpenReview.net,2017。URL https://openreview.net/forum? id $=$ B1ckMDqlg。

Shazeer, N., Cheng, Y., Parmar, N., Tran, D., Vaswani, A., Koa nanta kool, P., Hawkins, P., Lee, H., Hong, M., Young. C., Sepassi, R., and Hechtman, B. Mesh-tensorflow: Deep learning for supercomputers. In Proceedings of the 32nd International Conference on Neural Information Processing Systems, NIPS'18, pp. 10435-10444, Red Hook, NY, USA, 2018. Curran Associates Inc.

Shazeer, N., Cheng, Y., Parmar, N., Tran, D., Vaswani, A., Koa nanta kool, P., Hawkins, P., Lee, H., Hong, M., Young. C., Sepassi, R., 和 Hechtman, B. Mesh-tensorflow: 用于超级计算机的深度学习. In Proceedings of the 32nd International Conference on Neural Information Processing Systems, NIPS'18, pp. 10435-10444, Red Hook, NY, USA, 2018. Curran Associates Inc.

Shen, J., Nguyen, P., Wu, Y., Chen, Z., Chen, M. X., Jia, Y., Kannan, A., Sainath, T. N., Cao, Y., Chiu, C., He, Y., Chorowski, J., Hinsu, S., Laurenzo, S., Qin, J., Firat, O., Macherey, W., Gupta, S., Bapna, A., Zhang, S., Pang, R., Weiss, R. J., Pra bhava l kar, R., Liang, Q., Jacob, B., Liang, B., Lee, H., Chelba, C., Jean, S., Li, B., Johnson, M., Anil, R., Tibrewal, R., Liu, X., Eriguchi, A., Jaitly, N., Ari, N., Cherry, C., Haghani, P., Good, O., Cheng, Y., Alvarez, R., Caswell, I., Hsu, W., Yang, Z., Wang, K., Gonina, E., Tomanek, K., Vanik, B., Wu, Z., Jones, L., Schuster, M., Huang, Y., Chen, D., Irie, K., Foster, G. F., Richardson, J., Macherey, K., Bruguier, A., Zen, H., Raffel, C., Kumar, S., Rao, K., Rybach, D., Murray, M., Peddinti, V., Krikun, M., Bacchiani, M., Jablin, T. B., Suderman, R., Williams, I., Lee, B., Bhatia, D., Carlson, J., Yavuz, S., Zhang, Y., McGraw, I., Galkin, M., Ge, Q., Pundak, G., Whipkey, C., Wang, T., Alon, U., Lepikhin, D., Tian, Y., Sabour, S., Chan, W., Toshniwal, S., Liao, B., Nirschl, M., and Rondon, P. Lingvo: a modular and scalable framework for sequence-to-sequence modeling. CoRR, abs/1902.08295, 2019.

Shen, J., Nguyen, P., Wu, Y., Chen, Z., Chen, M. X., Jia, Y., Kannan, A., Sainath, T. N., Cao, Y., Chiu, C., He, Y., Chorowski, J., Hinsu, S., Laurenzo, S., Qin, J., Firat, O., Macherey, W., Gupta, S., Bapna, A., Zhang, S., Pang, R., Weiss, R. J., Prabhaval kar, R., Liang, Q., Jacob, B., Liang, B., Lee, H., Chelba, C., Jean, S., Li, B., Johnson, M., Anil, R., Tibrewal, R., Liu, X., Eriguchi, A., Jaitly, N., Ari, N., Cherry, C., Haghani, P., Good, O., Cheng, Y., Alvarez, R., Caswell, I., Hsu, W., Yang, Z., Wang, K., Gonina, E., Tomanek, K., Vanik, B., Wu, Z., Jones, L., Schuster, M., Huang, Y., Chen, D., Irie, K., Foster, G. F., Richardson, J., Macherey, K., Bruguier, A., Zen, H., Raffel, C., Kumar, S., Rao, K., Rybach, D., Murray, M., Peddinti, V., Krikun, M., Bacchiani, M., Jablin, T. B., Suderman, R., Williams, I., Lee, B., Bhatia, D., Carlson, J., Yavuz, S., Zhang, Y., McGraw, I., Galkin, M., Ge, Q., Pundak, G., Whipkey, C., Wang, T., Alon, U., Lepikhin, D., Tian, Y., Sabour, S., Chan, W., Toshniwal, S., Liao, B., Nirschl, M., 和 Rondon, P. Lingvo: 一个模块化和可扩展的序列到序列建模框架。CoRR, abs/1902.08295, 2019.

Shoeybi, M., Patwary, M., Puri, R., LeGresley, P., Casper, J., and Catanzaro, B. Megatron-lm: Training multi-billion parameter language models using gpu model parallelism. arXiv preprint arXiv:1909.08053, 2019.

Shoeybi, M., Patwary, M., Puri, R., LeGresley, P., Casper, J., 和 Catanzaro, B. Megatron-lm: 使用 GPU 模型并行训练多十亿参数语言模型。arXiv 预印本 arXiv:1909.08053, 2019。

Sotnikova, A., Cao, Y. T., Daumé Ill, H., and Rudinger, R. Analyzing stereotypes in generative text inference tasks. In Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021, pp. 4052- 4065, Online, August 2021. Association for Computational Linguistics. doi: 10.18653/v1/2021.findings-acl. 355. URL https://a cl anthology.org/2021. findings-acl.355.

索特尼科娃,A., 曹,Y. T., Daumé III, H., 和 Rudinger, R. 分析生成式文本推理任务中的刻板印象。在 Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021,第 4052-4065 页,在线,2021 年 8 月。Association for Computational Linguistics. doi: 10.18653/v1/2021.findings-acl.355. URL https://aclanthology.org/2021.findings-acl.355.

Stanovsky, G., Smith, N. A., and Z ett le moyer, L. Evaluating gender bias in machine translation. In Proceedings of the57th Annual Meetingof the Association for Computational Linguistics, pp. 1679-1684, Florence, Italy, July 2019. Association for Computational Linguistics. doi: 10.18653/v1/P19-1164. URL https : //a cl anthology.0rg/P19-1l64.

斯坦诺夫斯基,G.,史密斯,N. A.,和 泽特勒莫耶,L. 评估机器翻译中的性别偏见。在第 57 届计算语言学协会年会论文集,第 1679-1684 页,佛罗伦萨,意大利,2019 年 7 月。计算语言学协会。doi: 10.18653/v1/P19-1164。URL https://aclanthology.org/P19-1l64

Strubell, E., Ganesh, A., and McCallum, A. Energy and policy considerations for deep learning in NLP. In Proceedings of the 57 th Annual Meetingof theAssociation for Computational Linguistics, pp. 3645-3650, Florence, Italy, July 2019. Association for Computational Linguistics. doi: 10.18653/v1/P19-1355. URL https://a cl anthology.0rg/P19-1355.

斯特鲁贝尔,E. ,甘恩,A. 和麦卡利姆,A. 深度学习在自然语言处理中的能源和政策考虑。在第 57 届计算语言学协会年会论文集,第 3645-3650 页,佛罗伦萨,意大利,2019 年 7 月。计算语言学协会。doi: 10.18653/v1/P19-1355。URL https://aclanthology.org/P19-1355

Sutskever, I., Martens, J., and Hinton, G. Generating text with recurrent neural networks. In Proceedings of the 28th International Conference on International Conference on Machine Learning, ICML'11, pp. 1017-1024, Madison, WI, USA, 2011. Omnipress. ISBN 9781450306195.

Sutskever, I., Martens, J., 和 Hinton, G. 用循环神经网络生成文本。在 Proceedings of the 28th International Conference on International Conference on Machine Learning, ICML'11,第 1017-1024 页,Madison, WI, USA, 2011。Omnipress。ISBN 9781450306195。

Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, L. u., and Polosukhin, I. Attention is all you need. In Guyon, I., Luxburg, U. V., Bengio, S., Wallach, H., Fergus, R., Vish- wanathan, S., and Garnett, R. (eds.), Advances in Neural Information Processing Systems, volume 30. Curran Associates, Inc., 2017. URL https : / /proceedings . neurips.cc/paper/2017/file/ 3f5ee243547dee91fbd053c1c4a845aa-Paper.pdf.

Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, L. u., 和 Polosukhin, I. 注意力就是你所需要的 (Attention is all you need)。在 Guyon, I., Luxburg, U. V., Bengio, S., Wallach, H., Fergus, R., Vish- wanathan, S., 和 Garnett, R. (编),神经信息处理系统进展 (Advances in Neural Information Processing Systems),第 30 卷。Curran Associates, Inc., 2017。URL https : / /proceedings . neurips.cc/paper/2017/file/ 3f5ee243547dee91fbd053c1c4a845aa-Paper.pdf。

Wang, A., Pr uk s a chat kun, Y., Nangia, N., Singh, A., Michael, J., Hill, F., Levy, O., and Bowman, S. Superglue: A stickier benchmark for general-purpose language understanding systems. In Wallach, H., Larochelle, H., Bey gel zi mer, A., d'Alché Buc, F., Fox, E.,and Garnett, R.(eds.), Advances in Neural Information Processing Systems, volume 32. Curran Associates, Inc., 2019. URL https : / /proceedings . neurips.cc/paper/2019/file/ 4496 bf 24 a fe 7 fab 6 f 046 bf 4923 da 8 de 6-Paper. pdf.

王, A., 普鲁克什/chat昆, Y., 南吉亚, N., 辛格, A., 迈克尔, J., 希尔, F., 利维, O., 和 鲍曼, S. Superglue: 一个更具挑战性的通用语言理解系统基准。 在 Wallach, H., Larochelle, H., Beygelzimer, A., d'Alché-Buc, F., Fox, E., 和 Garnett, R. (eds.), 神经信息处理系统进展, 第 32 卷。Curran Associates, Inc., 2019。URL https://proceedings.neurips.cc/paper/2019/file/4496bf24afe7fab6f046bf4923da8de6-Paper.pdf

Webster, K. and Pitler, E. Scalable cross lingual pivots to model pronoun gender for translation. CoRR, abs/2006.08881, 2020. URL https : / /arxiv . 0rg/ abs/2006.08881.

Webster, K. 和 Pitler, E. 可扩展的跨语言枢轴模型以建模代词性别用于翻译。CoRR, abs/2006.08881, 2020。URL https : / /arxiv . 0rg/ abs/2006.08881。

Webster, K., Wang, X., Tenney, I., Beutel, A., Pitler, E., Pavlick, E., Chen, J., Chi, E., and Petrov, S. Measuring and reducing gendered correlations in pre-trained models, 2021.

韦伯斯特,K.,王,X.,滕尼,I.,贝特尔,A.,皮特勒,E.,帕夫利克,E.,陈,J.,奇,E.,和佩特罗夫,S. 测量和减少预训练模型中的性别相关性,2021。

Wei, J., Bosma, M., Zhao, V. Y., Guu, K., Yu, A. W., Lester, B., Du, N., Dai, A. M., and Le, Q. V. Finetuned language models are zero-shot learners, 2021.

韦, J., Bosma, M., 赵, V. Y., Guu, K., Yu, A. W., Lester, B., Du, N., Dai, A. M., 和 Le, Q. V. 微调的大语言模型是零样本学习者,2021。

Welbl, J., Glaese, A., Uesato, J., Dathathri, S., Mellor, J., Hendricks, L. A., Anderson, K., Kohli, P., Coppin, B., and Huang, P.-S. Challenges in detoxif ying language models. InFindings of the As- sociation for Computational Linguistics: EMNLP 2021, pp. 2447-2469, Punta Cana, Dominican Republic, November 2021. Association for Computational Linguistics. doi: 10.18653/v1/2021.findings-emnlp. 210. URL https://a cl anthology.org/2021. findings-emnlp.210.

韦尔布,J.,格莱斯,A.,乌萨托,J.,达塔特里,S.,梅洛,J.,亨德里克斯,L. A.,安德森,K.,科利,P.,科平,B.,和黄,P.-S. 语言模型解毒的挑战。在《计算语言学协会会议论文集:EMNLP 2021》,第 2447-2469 页,2021 年 11 月,蓬塔卡纳,多米尼加共和国。计算语言学协会。doi: 10.18653/v1/2021.findings-emnlp.210。URL https://aclanthology.org/2021.findings-emnlp.210

Xu, Y., Lee, H., Chen, D., Hechtman, B. A., Huang, Y., Joshi, R., Krikun, M., Lepikhin, D., Ly, A., Maggioni, M., Pang, R., Shazeer, N., Wang, S., Wang, T., Wu, Y., and Chen, Z. GSPMD: general and scalable parallel iz ation for ML computation graphs. CoRR, abs/2105.04663, 2021. URL https://arxiv.0rg/abs/2105.04663.

徐,Y.,李,H.,陈,D.,Hechtman,B. A.,黄,Y.,Joshi,R.,Krikun,M.,Lepikhin,D.,Ly,A.,Maggioni,M.,Pang,R.,Shazeer,N.,王,S.,王,T.,吴,Y.,和 陈,Z. GSPMD:通用和可扩展的机器学习计算图并行化。CoRR,abs/2105.04663,2021。URL https://arxiv.0rg/abs/2105.04663。

Yang, Z., Dai, Z., Yang, Y., Carbonell, J., Salak hut dino v, R. R., and Le, Q. V. Xlnet: Generalized auto regressive pre training for language understanding. Advances in neural information processing systems, 32, 2019.

杨, Z., 戴, Z., 杨, Y., 卡博内尔, J., Salak hut dino v, R. R., 和 Le, Q. V. XLNet: 泛化自回归预训练用于语言理解. Advances in neural information processing systems, 32, 2019.

注意:根据您的要求,人名没有被翻译。但是原文中的人名拼写似乎有误(例如 "Salak hut dino v"),请确认人名的正确性。这里保持了原文的人名拼写。

Yu, D., Zhu, C., Fang, Y., Yu, W., Wang, S., Xu, Y., Ren, X., Yang, Y., and Zeng, M. KG-FiD: Infusing knowledge graph in fusion-in-decoder for open-domain question answering. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 4961-4974, Dublin, Ireland, May 2022. Association for Computational Linguistics. doi: 10.18653/v1/2022.acl-long.340. URL https : //a cl anthology.0rg/2022.acl-long.340.

于, D., 朱, C., 方, Y., 余, W., 王, S., 许, Y., 任, X., 杨, Y., 和 曾, M. KG-FiD: 将知识图谱融入解码器融合以进行开放领域问答。在第 60 届计算语言学协会年会论文集(第 1 卷:长篇论文),第 4961-4974 页,爱尔兰都柏林,2022 年 5 月。计算语言学协会。doi: 10.18653/v1/2022.acl-long.340。URL https : //aclanthology.org/2022.acl-long.340.

Yu, Y., Abadi, M., Barham, P., Brevdo, E., Burrows, M., Davis, A., Dean, J., Ghemawat, S., Harley, T., Hawkins, P., Isard, M., Kudlur, M., Monga, R., Murray, D., and Zheng, X. Dynamic control flow in large-scale machine learning. In Proceedings of the Thirteenth EuroSys Conference, EuroSys '18, New York, NY, USA, 2018. Associ- ation for Computing Machinery. ISBN 9781450355841. doi: 10.1145/3190508.3190551. URL https : / /doi. org/10.1145/3190508.3190551.

Yu, Y., Abadi, M., Barham, P., Brevdo, E., Burrows, M., Davis, A., Dean, J., Ghemawat, S., Harley, T., Hawkins, P., Isard, M., Kudlur, M., Monga, R., Murray, D., 和 Zheng, X. 大规模机器学习中的动态控制流。在 Proceedings of the Thirteenth EuroSys Conference, EuroSys '18, New York, NY, USA, 2018。Association for Computing Machinery。ISBN 9781450355841。doi: 10.1145/3190508.3190551。URL https://doi.org/10.1145/3190508.3190551

Zellers, R., Holtzman, A., Bisk, Y., Farhadi, A., and Choi, Y. HellaSwag: Can a machine really finish your sentence?In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pp. 4791- 4800, Florence, Italy, July 2019. Association for Computational Linguistics. doi: 10.18653/v1/P19-1472. URL https://a clan tho lo q y.0rg/P19-1472.

Zellers, R., Holtzman, A., Bisk, Y., Farhadi, A., 和 Choi, Y. HellaSwag: 机器能真正完成你的句子吗?
收录于第 57 届计算语言学协会年会论文集,页码 4791-4800,佛罗伦萨,意大利,2019 年 7 月。计算语言学协会。doi: 10.18653/v1/P19-1472。URL https://a clan tho lo q y.0rg/P19-1472。

Zhang, S., Liu, X., Liu, J., Gao, J., Duh, K., and Durme, B. V. Record: Bridging the gap between human and machine commonsense reading comprehension. CoRR, abs/1810.12885, 2018.

张, S., 刘, X., 刘, J., 高, J., Duh, K., 和 Durme, B. V. Record: 桥接人类和机器常识阅读理解之间的差距. CoRR, abs/1810.12885, 2018.

Zhao, J., Wang, T., Yatskar, M., Ordonez, V., and Chang, K.-W. Gender bias in co reference resolution: Evaluation and debiasing methods. In Proceedings of the 2018 Conference of the North American Chapter of the As soci- ation for Computational Linguistics: Human Language Technologies, Volume 2 (Short Papers), pp. 15-20, New Orleans, Louisiana, June 2018. Association for Computational Linguistics. doi: 10.18653/v1/N18-2003. URL https://a cl anthology.0rg/N18-2003.

赵, J., 王, T., Yatskar, M., Ordonez, V., 和 Chang, K.-W. 代词消解中的性别偏见:评估和去偏方法. 在 Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 2 (Short Papers),页码 15-20,新奥尔良,路易斯安那州,2018年6月。Association for Computational Linguistics。doi: 10.18653/v1/N18-2003。URL https://aclanthology.0rg/N18-2003。

A. Benchmarks

A. 基准测试

Open-Domain Question Answering: TriviaQA (Joshi et al., 2017), Natural Questions (NQS) (Kwiatkowski et al., 2019), Web Questions (WebQS) (Berant et al., 2013)

开放领域问答:TriviaQA (Joshi et al., 2017),自然问题 (NQS) (Kwiatkowski et al., 2019),Web Questions (WebQS) (Berant et al., 2013)

Cloze and Completion Tasks: LAMBADA (Paperno et al., 2016), HellaSwag (Zellers et al., 2019), StoryCloze (Most af azad eh et al., 2016)

完形填空和补全任务:LAMBADA (Paperno 等, 2016),HellaSwag (Zellers 等, 2019),StoryCloze (Mostafazadeh 等, 2016)

Winograd-Style Tasks: Winograd (Levesque et al., 2012), WinoGrande (Sakaguchi et al., 2020)

温格拉德风格任务:温格拉德 (Levesque et al., 2012),WinoGrande (Sakaguchi et al., 2020)

Common Sense Reasoning: PIQA (Bisk et al., 2020), ARC (Easy)(Clark et al., 2018), ARC (Chal-lenge) (Clark et al., 2018), OpenBookQA (Mihaylov et al., 2018)

常识推理:PIQA (Bisk et al., 2020),ARC (Easy) (Clark et al., 2018),ARC (Challenge) (Clark et al., 2018),OpenBookQA (Mihaylov et al., 2018)

In-context Reading Comprehension: DROP (Dua et al., 2019), CoQA (Reddy et al., 2019), QuAC (Choi et al., 2018), SQuADv2 (Rajpurkar et al., 2018), RACE- h (Lai et al., 2017), RACE-m (Lai et al., 2017)

上下文阅读理解:DROP (Dua et al., 2019),CoQA (Reddy et al., 2019),QuAC (Choi et al., 2018),SQuADv2 (Rajpurkar et al., 2018),RACE- h (Lai et al., 2017),RACE-m (Lai et al., 2017)

SuperGLUE: (Wang et al., 2019) BoolQ (Clark et al., 2019), CB (de Marneffe et al., 2019), COPA (Gordon et al., 2012), RTE (Dagan et al., 2006), WiC (Pilehvar & Camacho-Collados, 2018), WSC (Levesque et al., 2012), MultiRC (Khashabi et al., 2018), ReCoRD (Zhang et al., 2018)

SuperGLUE: (Wang et al., 2019) BoolQ (Clark et al., 2019), CB (de Marneffe et al., 2019), COPA (Gordon et al., 2012), RTE (Dagan et al., 2006), WiC (Pilehvar & Camacho-Collados, 2018), WSC (Levesque et al., 2012), MultiRC (Khashabi et al., 2018), ReCoRD (Zhang et al., 2018)

Natural Language Inference: ANLI R1, ANLI R2, ANLI R3 (Fyodorov et al., 2000)

自然语言推理:ANLI R1,ANLI R2,ANLI R3 (Fyodorov et al., 2000)

B. Scaling the Number of Experts

B. 扩展专家数量

We also study the effects of increasing the number of experts per MoE layer. More concretely, we start with a modest size model of 1.7B, which essentially is a GLaM (1.7B/1E) model where each MoE layer reduces to include only a single feed-forward network as the expert. We then increase the number of experts in each MoE layer from 1 to 256. Despite the fact that the number of experts increases exponentially, the $n_{\mathrm{act-params}}$ in each model barely increases due to the sparsity of GLaM. In fact, as shown in Table 4, they all have almost identical FLOPs per prediction.

我们还研究了增加每个 MoE 层中专家数量的影响。更具体地说,我们从一个中等规模的 1.7B 模型开始,该模型本质上是一个 GLaM (1.7B/1E) 模型,其中每个 MoE 层仅包含一个前馈网络作为专家。然后我们将每个 MoE 层中的专家数量从 1 增加到 256。尽管专家数量呈指数增长,但由于 GLaM 的稀疏性,每个模型中的 $n_{\mathrm{act-params}}$ 几乎没有增加。实际上,如表 4 所示,它们在每次预测中的 FLOPs 几乎相同。

In Figure 6, we observe that, for a fixed budget of computation per prediction, adding more experts generally leads to better predictive performance. This further verifies the performance gain of GLaM sparsely activated models over the dense counterparts when both have similar FLOPs per prediction, thanks to the increased capacity and fexibility from more experts.

图 6: 我们观察到,在每次预测的计算预算固定的情况下,增加更多的专家通常会带来更好的预测性能。这进一步验证了 GLaM 稀疏激活模型相比密集模型在预测时的性能提升,当两者具有相似的 FLOPs 每次预测时,得益于更多专家带来的容量和灵活性的增加。


Figure 6. Average zero, one and few-shot performance versus the number of experts per layer for a set of modest-size models from 1.7B/1E to 1.7B/256E.


图 6: 不同层数专家数量对一组中等规模模型(从 1.7B/1E 到 1.7B/256E)的零样本、单样本和少样本性能的平均影响。

C. Model Partitioning

C. 模型分区

We partition the weights and computation of large GLaM models using the 2D sharding algorithm as described in Xu et al. (2021), which exploits the 2D topology of the device network of the TPU cluster. We place experts with the same index across different MoE layers on the same device in order to generate an identical computation graph for different MoE layers. As a result, we can wrap the repetitive modules of the MoE Transformer architecture in a while_loop control fow statement (Abadi et al., 2016a; Yu et al., 2018) to reduce compilation time. Our experiments reveal that we should grow the size of the experts to get high quality models. Therefore, when each expert gets sufficiently large, we have to allocate each expert across a set of $\frac{N}{E}$ devices. For example, we partition the expert weight tensor with the shape $[E,M,H]$ in the MoE layer along the expert dimension $E$ , and hidden dimension $H$ , and partition the input activation tensors with the shape $[B,S,M]$ along the batch dimension $B$ and the model dimension $M$ .With this 2D sharding algorithm, we are then able to fully divide those large weight and activation tensors into smaller pieces such that there is no redundancy in data or compute across all devices. We rely on GSPMD's compiler pass (Xu et al., 2021) to automatically determine the sharding properties for the rest of the tensors.

我们使用 Xu 等人 (2021) 描述的 2D 分片算法对大型 GLaM 模型的权重和计算进行分区,该算法利用了 TPU 集群设备网络的 2D 拓扑结构。我们将不同 MoE 层中具有相同索引的专家放置在同一设备上,以生成不同 MoE 层的相同计算图。因此,我们可以将 MoE Transformer 架构中的重复模块封装在 while_loop 控制流语句 (Abadi 等人, 2016a; Yu 等人, 2018) 中,以减少编译时间。我们的实验表明,我们应该增加专家的规模以获得高质量的模型。因此,当每个专家变得足够大时,我们必须将每个专家分配到一组 $\frac{N}{E}$ 设备上。例如,我们在 MoE 层中沿专家维度 $E$ 和隐藏维度 $H$ 对形状为 $[E,M,H]$ 的专家权重张量进行分片,并沿批量维度 $B$ 和模型维度 $M$ 对形状为 $[B,S,M]$ 的输入激活张量进行分片。通过这种 2D 分片算法,我们能够将这些大型权重和激活张量完全分割成更小的部分,使得所有设备之间的数据或计算没有冗余。我们依赖 GSPMD 编译器传递 (Xu 等人, 2021) 自动确定其余张量的分片属性。

D. Data Contamination

D. 数据污染

As GLaM was trained on over 1.6 trillion tokens of text, it is a valid concern that some of the test data might appear exactly in the pre training dataset, inflating some of the results. We therefore follow Brown et al. (2020) and Wei et al. (2021) and quantify the overlap between pre training data and evaluation datasets.

由于 GLaM 是在超过 1.6 万亿个 Token 的文本上训练的,因此确实存在一些测试数据可能完全出现在预训练数据集中,从而夸大某些结果的情况。我们因此遵循 Brown 等 (2020) 和 Wei 等 (2021) 的方法,量化预训练数据和评估数据集之间的重叠。

Our analysis uses the same methodology as Wei et al. (2021), which, in turn closely follows Brown et al. (2020). For each evaluation dataset we report the number of examples which overlap with the pre training data, defining overlap as

我们的分析使用了与 Wei 等人 (2021) 相同的方法论,而后者又紧密遵循 Brown 等人 (2020) 的方法。对于每个评估数据集,我们报告与预训练数据重叠的样本数量,定义重叠为

Table 6. Overlap statistics for the subset of datasets that are also used in GPT-3. An evaluation example was dirty if it had any $n$ gram collision with the pre training corpus.

表 6: 子集数据集的重叠统计,这些数据集也在 GPT-3 中使用。如果评估示例与预训练语料库有任何 $n$ 克碰撞,则认为该示例是脏的。

数据集 阶段 脏数据数量 总数量 清洁百分比
ANLI R1 validation 962 1000 3.8
ANLI R2 validation 968 1000 3.2
ANLIR3 validation 596 1200 50.33
ARC Challenge validation 95 299 68.23
ARCEasy validation 185 570 67.54
BoolQ validation 3013 3270 7.86
CB validation 15 56 73.21
COPA validation 3 100 97.0
CoQa test 375 500 25.0
DROP dev 9361 9536 1.84
HellaSwag validation 1989 10042 80.19
LAMBADA test 1125 5153 78.17
MultiRC validation 3334 4848 31.23
NQs validation 141 3610 96.09
OpenBookQA validation 100 500 80.0
PIQA validation 902 1838 50.92
Quac validation 7353 7354 0.01
RACE-h dev 2552 3451 26.05
RACE-m dev 838 1436 41.64
RTE validation 152 277 45.13
ReCoRD validation 9861 10000 1.39
SQuADv2 validation 11234 11873 5.38
StoryCloze validation 1871 1871 0.0
TriviaQA validation 2121 11313 81.25
WSC test 157 273 42.49
WiC validation 46 638 92.79
Winograd validation 70 104 32.69
Winogrande test 6 1767 99.66

having any $n$ -gram, which also appears in the pre training data (varying $n$ between datasets). We find that the number of validation examples appearing verbatim in the training data roughly matches that of prior work. We report these numbers in Table 6.

表 6:

具有任何在预训练数据中也出现的 $n$ -gram(在不同数据集中 $n$ 的值有所不同)。我们发现,验证样本完全匹配训练数据的数量大致与之前的工作相符。我们在表 6 中报告了这些数字。

E. Ethics and Unintended Biases

E. 伦理和意外偏差

Like Rae et al. (2021), we also analyzed toxicity degeneration with with respect to model scale. This is shown in Figure 7. As with other analysis GLaM's performance on this benchmark, it is fairly consistent across model sizes and with MoE variants. The 0.1B/64E MoE variant, the smallest sparse variant analyzed, is noticeable in the plot and smaller MoE models may be less stable, as noted by Rae et al. (2021).

如同 Rae 等 (2021) 的研究,我们也分析了模型规模对毒性退化的影响。这在图 7 中有所展示。与对其他基准的分析类似,GLaM 在此基准上的表现相当一致,不受模型规模和 MoE 变体的影响。所分析的最小稀疏变体 0.1B/64E MoE 在图中较为明显,较小的 MoE 模型可能不够稳定,正如 Rae 等 (2021) 所指出的。

图 7: 毒性退化与模型规模的关系

Following Rae et al. (2021), we also analysed the aspect of the distribution of generated toxicity probabilities with respect to model scale. The same pattern of scale-in-variance is observed with respect to the maximal expected toxicity probability of a continuation. The distribution of toxicity probabilities from the API for 25 continuations is plotted for low toxicity prompts in Figure 8. This shows that, even for low toxicity prompts, it is very likely that some generated continuation would be judged as toxic by most people reviewing it, according to the Perspective API's model.

遵循 Rae 等人 (2021) 的研究,我们也分析了生成的毒性概率分布与模型规模相关的方面。观察到关于最大预期毒性概率的延续,存在相同的规模不变性模式。图 8 显示了对于低毒性提示,从 API 获取的 25 个延续的毒性概率分布。这表明,即使对于低毒性提示,根据 Perspective API 的模型,大多数审查人员也很可能会认为某些生成的延续是有毒的。


Figure 7. Toxicity degeneration scaling plot. The different shades show different buckets of prompt toxicity probability. The y-axis shows the expected probability of a continuation being toxic. The 0.1B/64E variant is noticeable, but as model parameters scale, the relationship to continuation toxicity constant.

图 7. 有害性退化缩放图。不同的阴影显示不同的提示有害性概率区间。y轴显示续写内容有害性的预期概率。0.1B/64E 变体较为明显,但随着模型参数的增加,与续写有害性之间的关系趋于恒定。

Table 7. Prompts used for each co-occurrence category.

表 7. 每个共现类别的提示。

类别 提示
性别 {term} was very... {term} would be described as...
种族 The {term} man was very... The {term} woman was very... People would describe the {term} person as...
宗教 {term} practitioners are...

Table 7 shows the prompts used for the co-occurrence evaluation; these are the same as those of Brown et al. (2020). The top associations for gender templates are shown in Table 8, and Tables 9 and 10 show the same for the race and religion prompt templates.

表 7: 显示了用于共现评估的提示;这些与 Brown 等人 (2020) 的相同。性别模板的顶级关联显示在表 8 中,表 9 和表 10 显示了种族和宗教提示模板的相同内容。

F. Energy Usage

F. 能源使用情况

The power usage effectiveness (PUE) of the datacenter at the time of training (August and September 2021) was 1.11. Using 326W measured system power per TPU-v4 chip, this leads to a total energy consumption of $213;\mathrm{MWh}$ for GLaM, 1/6 of the energy cost of GPT-3, 1287 MWh. The datacenter PUE was 1.10 at the time of training GPT-3 (Patterson et al., 2021). The reduced energy consumption of GLaM is due to the MoE architecture and computation efficiency optimization s from TPU-v4 hardware and GSPMD software.

训练期间(2021年8月和9月)数据中心的电力使用效率 (PUE) 为1.11。使用每块TPU-v4芯片326W的测量系统功耗,这导致GLaM的总能耗为 $213;\mathrm{MWh}$ ,相当于GPT-3能耗成本1287 MWh的1/6。训练GPT-3时的数据中心PUE为1.10 (Patterson et al., 2021)。GLaM的能耗降低归因于MoE架构以及TPU-v4硬件和GSPMD软件的计算效率优化。


Figure 8. Expected toxicity probability given low toxicity probability prompts for 8B Dense variant. This chart shows distributions underlying the expected maximum toxicity metric for the 8B Dense model. The y-axis shows expected toxicity and the $\mathbf{X}$ -axisshows the distribution aggregated at different percentiles. At the left, the minimum continuation toxicity reflects that after repeated evaluations of 25 samples the least toxic response for some outlier non-toxic prompts was 0.8 likely to be perceived as toxicity. At the right we see that the worst-case toxicity has an almost uniform distribution across non-toxic prompts. In other words, in 25 samples across low probability toxic prompts, for the majority of trials, there will be a high toxicity probability continuation.

图 8: 给定低毒性概率提示的 8B Dense 变体的预期毒性概率。此图表显示了 8B Dense 模型预期最大毒性指标的基础分布。y 轴显示预期毒性,x 轴显示在不同百分位数上聚合的分布。左侧,最小延续毒性反映了在对 25 个样本进行重复评估后,某些异常非毒性提示的最不毒响应被感知为毒性的概率为 0.8。右侧显示最坏情况下的毒性几乎在非毒性提示上呈均匀分布。换句话说,在 25 个样本中,对于低毒性概率提示,大多数试验将会有高毒性概率的延续。

As a result of low energy consumption, GLaM training has lower $\mathrm{CO_{2}}$ emissions as well. The net $\mathrm{tCO_{2}e}$ per MWh of the datacenter at the time was 0.088, training GLaM with 280B tokens emits a total of 18.7 net $\mathrm{tCO_{2}e}$ , compared to 552 net tCO2e for GPT-3 (Patterson et al., 2021). The complete GLaM training using 600B tokens consumes only 456 MWh and emits 40.2 net $\mathrm{tCO_{2}e}$

由于能耗较低,GLaM 训练的 $\mathrm{CO_{2}}$ 排放量也较低。当时数据中心每 MWh 的净 $\mathrm{tCO_{2}e}$ 为 0.088,使用 280B Token 训练 GLaM 的总净 $\mathrm{tCO_{2}e}$ 排放量为 18.7,而 GPT-3 (Patterson et al., 2021) 的净排放量为 552 $\mathrm{tCO_{2}e}$。完整的 GLaM 训练使用 600B Token 仅消耗 456 MWh,并排放 40.2 净 $\mathrm{tCO_{2}e}$。

G. Results on All Tasks for All Model Sizes

G. 所有模型尺寸在所有任务上的结果

We include the zero/one/few-shot results of different model sizes on all the tasks in Table 11, 12, 13 and 14.

我们在表 11、表 12、表 13 和表 14 中包含了不同模型大小在所有任务上的零样本/单样本/少样本结果。

Table 8. Gender: top co-occurrences for prompts like ‘ ${{\mathrm{term}}}$ was very.."

表 8. 性别:对于提示 ‘${{\mathrm{term}}}$ 非常..’ 的最高共同出现词汇

H, "She'
前十 much (188) pretty (232)
最常见 great (130) little (185)
描述性 well (129) much (154)
词汇 (及数量) little (129) beautiful (148)
good (124) always (142)
always (114) good (136)
black (103) black (117)
even (92) never (116)
many (87) even (111)
also (83) well (110)

Table 9. Race: co-occurrence in response to prompts like “People would describe the ${\mathrm{term}}$ person as...'.

表 9: 种族:对类似“人们会这样描述 ${\mathrm{term}}$ 人…”的提示的共同出现。

术语 最常见的描述词
Asian Asian, black, white, polite, even, really. Chinese, good, also, nice
Black white, black, much, even, well, angry, good, also, proud, happy
White white, black, many, even, Indian, much, good, happy, angry, never
Latinx white, black, even, really, also, Spanish, much, well, different, never
Indian Indian, white, black, much, even, different, happy, really, never, good
Middle-Eastern white, black, even, eastern, polite, really, middle, nice, brown, also

Table 10. Religion: co-occurrence in response to prompts like $\langle{\mathrm{term}}\rangle$ practitioners are..'

表 10. 宗教:对类似 $\langle{\mathrm{term}}\rangle$ practitioners are.. 的提示的共现

术语 最常见的描述性词汇
无神论 religious, also, bad, likely, really, much, many, moral, even, sure
佛教 also, generally, many, religious, always, often, even, good, first, different
基督教 religious, also, Christian, many, even, often, always, likely, different, bad
伊斯兰教 also, religious, even, many, likely, still, different, generally, much, violent
印度教 generally, also, religious, many, different, even, often, well, Indian, likely
犹太教 Jewish, also, religious, responsible, many, even, well, generally, often, different

Table 11. Scores of GLaM (64B/64E), GPT-3 and Gopher across all 29 benchmarks. We include the significantly larger and more computationally expensive Gopher and Megatron-NLG models for reference.

表 11: GLaM (64B/64E),GPT-3 和 Gopher 在所有 29 个基准测试中的得分。我们包含显著更大且计算成本更高的 Gopher 和 Megatron-NLG 模型以供参考。

零样本 一样本 少样本 (样本数)
名称 指标 分割 GPT-3 (175B) GLaM (64B/64E) GPT-3 (175B) GLaM (64B/64E) GPT-3 (175B)
TriviaQA acc (em) dev 64.3 71.3 68.0 75.8 71.2 (64)
NQs acc (em) test 14.6 24.7 23.0 26.3 29.9 (64)
WebQS acc (em) test 14.4 19.0 25.3 24.4 41.5 (64)
Lambada acc (em) test 76.2 64.2 72.5 80.9 86.4 (15)
HellaSwag acc dev 78.9 76.6 78.1 76.8 79.3 (20)
StoryCloze acc test 83.2 82.5 84.7 84.0 87.7 (70)
Winograd acc test 88.3 87.2 89.7 83.9 88.6 (7)
WinoGrande acc dev 70.2 73.5 73.2 73.1 77.7 (16)
DROP f1 dev 23.6 57.3 34.3 57.8 36.5 (20)
CoQA f1 dev 81.5 78.8 84.0 79.6 85.0 (5)
QuAC f1 dev 41.5 40.3 43.4 42.8 44.3 (5)
SQuADv2 f1 dev 62.1 71.1 64.6 71.8 69.8 (16)
SQuADv2 acc (em) dev 52.6 64.7 60.1 66.5 64.9 (16)
RACE-m acc test 58.4 64.0 57.4 65.5 58.1 (10)
RACE-h acc test 45.5 46.9 45.9 48.7 46.8 (10)
PIQA acc dev 81.0 80.4 80.5 81.4 82.3 (50)
ARC-e acc test 68.8 71.6 71.2 76.6 70.1 (50)
ARC-c acc test 51.4 48.0 53.2 50.3 51.5 (50)
OpenbookQA acc test 57.6 53.4 58.8 55.2 65.4 (100)
BoolQ acc dev 60.5 83.1 76.7 82.8 77.5 (32)
Copa acc dev 91.0 90.0 87.0 92.0 92.0 (32)
RTE acc dev 63.5 67.9 70.4 71.5 72.9 (32)
WiC acc dev 0.0 50.3 48.6 52.7 55.3 (32)
Multirc fla dev 72.9 73.7 72.9 74.7 74.8 (32)
WSC acc dev 65.4 85.3 69.2 83.9 75.0 (32)
ReCoRD acc dev 90.2 90.3 90.2 90.3 89.0 (32)
CB acc dev 46.4 48.2 64.3 73.2 82.1 (32)
ANLI R1 acc test 34.6 39.2 32.0 42.4 36.8 (50)
ANLI R2 acc test 35.4 37.3 33.9 40.0 34.0 (50)
ANLI R3 acc test 34.5 41.3 35.1 40.8 40.2 (50)
Avg NLG 47.6 54.6 52.9 58.4 58.8
Avg NLU 60.8 66.2 65.4 68.6 68.4

Table 12. Zero-shot scores on all 29 benchmarks for GPT3 and different GLaM MoE and dense models.

表 12: GPT3 和不同 GLaM MoE 及密集模型在所有 29 个基准上的零样本分数。

名称 指标 分割 GLaM (MoE) GLaM (MoE) GLaM (MoE) GLaM (MoE) GLaM (Dense) GLaM (Dense) GLaM (Dense) GPT3 GPT3
0.1B/64E 1.7B/64E 8B/64E 64B/64E 0.1B 1.7B 8B 137B 175B
TriviaQA 准确率 (em) dev 9.42 44.0 55.1 71.3 2.3 27.0 48.1 64.0 64.3
NQs 准确率 (em) test 2.24 9.2 11.9 24.7 1.1 5.6 9.0 17.3 14.6
WebQS 准确率 (em) test 3.44 8.3 10.7 19.0 0.7 5.9 7.7 13.8 14.4
Lambada 准确率 (em) test 41.4 63.7 67.3 64.2 37.8 60.1 69.3 70.9 76.2
HellaSwag 准确率 dev 43.1 65.8 74.0 76.6 34.7 60.6 72.2 76.9 78.9
StoryCloze 准确率 test 66.4 76.2 78.9 82.5 63.3 75.1 79.5 81.1 83.2
Winograd 准确率 test 66.3 80.2 83.9 87.2 67 78.7 81.6 84.3 88.3
WinoGrande 准确率 dev 51.0 63.9 67.8 73.5 49.7 62.6 70.1 71.5 70.2
DROP F1 dev 9.43 13.4 16.8 57.3 5.67 14.0 17.0 21.8 23.6
CoQA F1 dev 45.9 65.3 65.5 78.8 40.7 66.5 68.7 72.1 81.5
QuAC F1 dev 25.2 32.8 33.8 40.3 25.4 33.3 30.7 38.3 41.5
SQuADv2 F1 dev 22.9 49.2 57.1 71.1 16.8 44.9 55.7 65.5 59.5
SQuADv2 准确率 (em) dev 7.06 29.6 38 64.7 3.4 24 35.8 48.2 52.6
RACE-m 准确率 test 43.4 56.1 61.9 64.0 40.6 53.6 63.0 67.8 58.4
RACE-h 准确率 test 30.4 40.4 43.4 46.9 29.4 40.0 45.0 47.2 45.5
PIQA 准确率 dev 70.0 76.9 78.6 80.4 64.4 73.6 78.2 78.5 80.4
ARC-e 准确率 test 52.0 66.2 66.2 71.6 44.5 62.2 67.9 71.7 68.8
ARC-c 准确率 test 26.5 37.6 42.8 48.0 23.2 35.1 42.7 47.2 51.4
Openbookqa 准确率 test 40.0 46.4 50.0 53.4 36.8 46.7 49.8 52.0 57.6
BoolQ 准确率 dev 56.6 62.7 72.2 83.1 56.6 56.1 73.6 78 60.5
Copa 准确率 dev 73 85 86 90 67 80 86 90 91
RTE 准确率 dev 45.8 58.8 60.3 67.9 51.3 49.1 63.8 50.5 63.5
WiC 准确率 dev 50.0 49.8 49.5 50.3 50.8 50.3 44 50.6 0.0
Multirc fla dev 57.7 58.0 52.4 73.7 58.6 53.0 39.0 54.8 72.9
WSC 准确率 dev 65.6 79.3 81.8 85.3 66.3 77.2 80.7 82.8 65.4
ReCoRD 准确率 dev 77.5 87.1 88.9 90.3 71.6 86.7 89.2 90.3 90.2
CB 准确率 dev 66.1 33.9 40.7 48.2 42.9 37.5 33.9 42.9 46.4
ANLI R1 准确率 dev 34.1 33.9 33.4 39.2 36.1 33.2 34.7 39.4 34.6
ANLI R2 准确率 dev 33.8 32.4 34.9 37.3 36.7 33.6 34.8 35.7 35.4
ANLI R3 准确率 dev 32.8 34.0 34.6 41.3 34.8 34.1 34.9 34.6 34.5
平均 NLG 18.6 35.1 39.6 54.6 14.9 31.3 38.0 45.8 47.6
平均 NLU 51.5 58.3 61.1 66.2 48.9 56.1 60.2 63.2 60.8

Table 13. One-shot scores on all 29 benchmarks for GPT3 and different GLaM MoE and dense models.

表 13: GPT3 和不同 GLaM MoE 及密集模型在所有 29 个基准测试中的单次拍摄得分。

名称 指标 分割 GLaM (MoE) GLaM (Dense) GPT3
0.1B/64E 1.7B/64E 8B/64E 64B/64E 0.1B 1.7B 8B 64B GPT-3 (175B)
TriviaQA acc (em) dev 15.2 54.1 65.9 75.8 8.3 36.3 56.4 70.0 68.0
NQs acc (em) test 2.5 10.7 16.0 26.3 1.19 6.5 10.7 19.1 23.0
WebQS acc (em) test 5.9 13.9 17.0 24.4 3.44 9.3 11.6 18.8 25.3
Lambada acc (em) test 36.9 57.4 64.1 80.9 21.8 52.3 64.7 68.5 72.5
HellaSwag acc dev 43.5 66.4 74.0 76.8 34.7 60.5 72.6 76.8 78.1
StoryCloze acc test 67.0 77.9 80.0 84.0 63.7 76.4 82.1 82.6 84.7
Winograd acc test 69.2 80.2 85.3 83.9 65.6 80.2 84 85.3 89.7
WinoGrande acc dev 51.7 63.5 68.7 73.0 49.8 62.8 70.0 73.1 73.2
DROP f1 dev 16.3 24.8 28.4 57.8 19.3 24.9 41.2 49.4 34.3
CoQA f1 dev 48.3 72.8 76 79.6 33.3 72.7 74.4 78.8 84.0
QuAC f1 dev 28.7 35.2 43.1 42.7 23.7 35.7 35.1 44.6 43.4
SQuADv2 f1 dev 35.5 69.5 76.3 71.8 34.2 67.1 69.2 70.0 65.4
SQuADv2 acc (em) dev 21.8 53.6 60.9 66.5 29.0 50.8 64.2 63.7 60.1
RACE-m acc test 42.7 60.9 60.6 65.5 43.1 56.4 63.1 69.0 57.4
RACE-h acc test 29.1 41.9 44.6 48.7 29.4 40.8 45.3 47.7 45.9
PIQA acc dev 69.0 76.0 78.1 81.4 63.7 73.1 76.3 79.5 80.5
ARC-e acc test 53.5 68.1 73.4 76.6 45.9 63.8 62.6 77.2 71.2
ARC-c acc test 27.0 39.3 44.8 50.3 24.5 35.2 41.5 50.7 53.2
Openbookqa acc test 39.6 47.6 50.6 55.2 37.8 47.2 53.0 55.4 58.8
BoolQ acc dev 53.6 62.0 70.8 82.8 55.7 58.1 76.4 77.5 76.7
Copa acc dev 75 81 86 92 71 81 86 91 87
RTE acc dev 53.1 54.5 57.0 71.5 53.4 55.2 62.0 58.4 70.4
WiC acc dev 47.3 47.0 48.0 52.7 47.3 46.8 48.0 48.7 48.6
Multirc fla dev 58.5 59.6 62.0 74.7 56.3 59.4 61.9 64.2 72.9
WSC acc dev 67.7 77.5 83.8 83.9 63.8 78.5 83.0 86.3 69.2
ReCoRD acc dev 77.5 87.3 89.0 90.3 71.6 86.2 89.2 90.2 90.1
CB acc dev 41.1 35.7 44.6 73.2 42.9 41.1 30.4 48.2 64.3
ANLI R1 acc dev 32.1 31.1 32.3 42.4 32.5 31.4 31.9 34.8 32.0
ANLI R2 acc dev 31.1 30.7 32.5 40.0 30.7 31.2 30.7 32.6 33.9
ANLI R3 acc dev 30.5 31.6 34.8 40.8 30.9 30.3 32.4 35.0 35.1
Avg NLG 23.5 43.6 49.7 58.4 19.4 39.5 47.5 52.8 52.7
Avg NLU 50.4 58.1 61.9 68.6 48.3 56.9 61.7 65.0 65.4

Table 14. Few-shot scores on all 29 benchmarks for GPT3 and different GLaM MoE and dense models. We tune the number of shots up to the respective value in each task used by GPT3.

表 14. GPT3 和不同 GLaM MoE 及密集模型在所有 29 个基准测试中的少样本得分。我们调整每个任务的样本数量,直到达到 GPT3 使用的各自值。

名称 指标 分割 GLaM (MoE) GLaM (MoE) GLaM (MoE) GLaM (MoE) GLaM (Dense) GLaM (Dense) GLaM (Dense) GLaM (Dense) GPT3
0.1B/64E 1.7B/64E 8B/64E 64B/64E 0.1B 1.7B 8B 137B GPT-3 (175B)
TriviaQA 准确率 (em) dev 21.7 60.1 67.7 75.8 8.3 38.8 56.4 70.0 71.2
NQs 准确率 (em) test 5.3 17.7 24.4 32.5 1.50 9.0 20.1 27.9 29.9
WebQS 准确率 (em) test 12.1 24.4 29.6 41.1 6.90 9.3 25.5 32.9 41.5
Lambada 准确率 (em) test 36.9 64.3 79.0 86.6 21.8 63.0 77.1 84.2 86.4
HellaSwag 准确率 dev 45.6 66.2 74.0 77.2 34.7 60.7 72.6 76.8 79.3
StoryCloze 准确率 test 69.4 80.0 82.8 86.7 63.7 78.7 83.7 85.7 87.7
Winograd 准确率 test 69.2 82.8 85.3 88.6 65.6 80.5 85.4 85.3 88.6
WinoGrande 准确率 dev 52.6 66.2 71.4 79.2 49.8 64.2 72.3 76.6 77.7
DROP F1 dev 23.5 37.0 40.0 58.6 19.3 41.4 49.4 49.4 36.5
CoQA F1 dev 48.3 66.0 72 79.6 33.3 66.0 74.4 78.8 85.0
QuAC F1 dev 26.0 34.2 43.1 42.8 23.7 34.3 35.1 37.2 44.3
SQuADv2 F1 dev 38.7 61.8 67.1 71.8 34.2 60.0 69.6 70.0 69.8
SQuADv2 准确率 (em) dev 32.7 55.5 60.9 67.0 29.0 53.9 64.2 63.7 64.9
RACE-m 准确率 test 41.8 53.6 60.6 66.9 43.1 56.5 56 65.1 58.1
RACE-h 准确率 test 31.5 40.2 44.6 49.3 29.5 40.8 43 48.1 46.8
PIQA 准确率 dev 69.0 76.1 78.1 81.8 64.2 73.1 77 80.8 82.3
ARC-e 准确率 test 57.8 70.1 75.3 78.9 48.9 66.0 74 79.0 70.1
ARC-c 准确率 test 29.7 38.3 45.5 52.0 24.8 35.2 41.5 45.7 51.5
Openbookqa 准确率 test 41.6 49.6 53.0 63.0 37.8 54 54.0 58.8 65.4
BoolQ 准确率 dev 53.6 62.0 70.5 83.1 59.9 63.1 76.4 80.5 77.5
Copa 准确率 dev 75 82 88 93.0 71 83 92.0 91.0 92.0
RTE 准确率 dev 53.1 54.5 60.0 76.2 54.9 55.2 64.0 63.9 72.9
WiC 准确率 dev 49.4 51.3 53.3 56.3 51.9 50.9 50.0 53.6 55.3
Multirc fla dev 58.5 59.7 62.0 77.5 56.3 59.4 61.5 68.1 74.8
WSC 准确率 dev 67.7 80.4 83.8 85.6 65.6 80.0 82.0 87.4 75.0
ReCoRD 准确率 dev 77.5 87.3 89.0 90.6 71.8 86.2 89.0 90.5 89.0
CB 准确率 dev 43.0 53.6 60.7 84.0 42.9 55.4 58 53.6 82.1
ANLI R1 准确率 dev 34.3 31.4 34.0 44.3 33.5 33.1 33.2 35.8 36.8
ANLI R2 准确率 dev 32.3 33.0 32.0 41.2 34.4 33.7 33.9 35.6 34.0
ANLI R3 准确率 dev 33.9 35.8 33.0 44.7 32.9 33.3 35.0 34.7 40.2
平均 NLG 27.2 46.8 53.0 61.6 19.8 42.7 52.4 57.1 58.8
平均 NLU - 51.7 59.7 63.6 71.4 49.2 59.2 63.7 66.8 68.4
阅读全文(20积分)