GLaM: Efficient Scaling of Language Models with Mixture-of-Experts

GLaM: 语言模型的有效扩展与专家混合 (Mixture-of-Experts)

Abstract

摘要

Scaling language models with more data, compute and parameters has driven significant progress in natural language processing. For example, thanks to scaling, GPT-3 was able to achieve strong results on in-context learning tasks. However, training these large dense models requires significant amounts of computing resources. In this paper, we propose and develop a family of language models named GLaM (Generalist Language Model), which uses a sparsely activated mixture-of-experts architecture to scale the model capacity while also incurring substantially less training cost compared to dense variants. The largest GLaM has 1.2 trillion parameters, which is approximately $7\mathbf{x}$ larger than GPT-3. It consumes only 1/3 of the energy used to train GPT-3 and requires half of the computation flops for inference, while still achieving better overall zero, one and few-shot performance across 29 NLP tasks.

通过增加数据、计算资源和参数规模，大语言模型的扩展已经推动了自然语言处理领域的显著进展。例如，得益于扩展，GPT-3 能够在上下文学习任务中取得优异成绩。然而，训练这些大型密集模型需要大量的计算资源。在本文中，我们提出并开发了一组名为 GLaM (Generalist Language Model) 的语言模型，该模型采用稀疏激活的专家混合架构，在扩展模型容量的同时，相比密集变体大幅降低了训练成本。最大的 GLaM 拥有 1.2 万亿参数，大约是 GPT-3 的 7 倍。它仅消耗训练 GPT-3 所需能量的 1/3，并且在推理时所需的计算量也减少了 50%，同时在 29 个 NLP 任务中实现了更好的零样本、单样本和少样本性能。

Table 1. Comparison between GPT-3 and GLaM. In a nutshell, GLaM outperforms GPT-3 across 21 natural language understanding (NLU) benchmarks and 8 natural language generative (NLG) benchmarks in average while using about half the FLOPs per token during inference and consuming about one third the energy for training.

表 1. GPT-3 与 GLaM 的对比。简而言之，GLaM 在平均 21 个自然语言理解 (NLU) 基准测试和 8 个自然语言生成 (NLG) 基准测试中优于 GPT-3，同时在推理过程中每个 Token 使用大约一半的 FLOPs，并且在训练时消耗大约三分之一的能量。

		GPT-3	GLaM	relative
成本	FLOPs / Token (G)	350	180	-48.6%
	训练能耗 (MWh)	1287	456	-64.6%
准确率	零样本	56.9	62.7	+10.2%
	单样本	61.6	65.5	+6.3%
平均	少样本	65.2	68.1	+4.4%

feasibility of in-context learning for few-shot or even zeroshot generalization, meaning very few labeled examples are needed to achieve good performance on NLP applications. While being effective and performant, scaling further is becoming prohibitively expensive and consumes significant amounts of energy (Patterson et al., 2021).

少样本或甚至零样本泛化的情境学习可行性，意味着在NLP应用上取得良好性能所需的标注示例非常少。虽然这种方法有效且性能优越，但进一步扩展变得极其昂贵，并消耗大量的能源 (Patterson et al., 2021)。

1. Introduction

1. 引言

Language models have played an important role in the progress of natural language processing (NLP) in the past decade. Variants of language models have been used to produce pretrained word vectors (Mikolov et al., 2013; Pennington et al., 2014), and contextual i zed word vectors (Peters et al., 2018; Devlin et al., 2019) for many NLP applications. The shift towards scaling with more data and larger models (Shazeer et al., 2017; Huang et al., 2019; Kaplan et al., 2020) has enabled complex natural language tasks to be performed with less labeled data. For example, GPT-3 (Brown et al., 2020) and FLAN (Wei et al., 2021) demonstrated the

在过去十年中，语言模型在自然语言处理 (NLP) 的发展中发挥了重要作用。语言模型的变体已被用于生成预训练词向量 (Mikolov et al., 2013; Pennington et al., 2014)，以及上下文化词向量 (Peters et al., 2018; Devlin et al., 2019)，用于许多 NLP 应用。向使用更多数据和更大模型扩展的趋势 (Shazeer et al., 2017; Huang et al., 2019; Kaplan et al., 2020) 使得复杂的自然语言任务可以在更少的标注数据下完成。例如，GPT-3 (Brown et al., 2020) 和 FLAN (Wei et al., 2021) 展示了

In this work, we show that a large sparsely activated network can achieve competitive results compared to state-of-the-art dense models on few-shot tasks while being more computationally efficient. We present a family of generalist language models called GLaM, that strike a balance between dense and conditional computation. The largest version of GLaM has $1.2\mathrm{T}$ parameters in total with 64 experts per MoE layer (Shazeer et al., 2017; Lepikhin et al., 2021; Fe- dus et al., 2021) where each token in the input batch only activates a subnetwork of 96.6B ( $8%$ of 1.2T) parameters. On zero, one and few-shot learning, this model compares favorably to GPT-3 (175B), with significantly improved learning efficiency across 29 public NLP benchmarks, ranging from language completion tasks, open-domain QA tasks, to natural language inference tasks. Thanks to the sparsely activated architecture and the efficient implementation of the model parallelism algorithm, the total energy consumption during training is only one third of GPT-3's. We highlight the comparison between the largest version of GLaM and GPT-3 in Table 1 and Figure 1.

在这项工作中，我们展示了大规模稀疏激活网络在少样本任务上可以取得与最先进的密集模型相媲美的结果，同时计算效率更高。我们介绍了一组名为 GLaM 的通用语言模型，这些模型在密集计算和条件计算之间取得了平衡。GLaM 最大的版本总共有 1.2T 参数，每个 MoE 层有 64 个专家 (Shazeer et al., 2017; Lepikhin et al., 2021; Fedus et al., 2021)，其中输入批次中的每个 Token 只激活一个包含 966 亿 (1.2T 的 8%) 参数的子网络。在零样本、单样本和少样本学习中，该模型与 GPT-3 (175B) 相比表现优异，在涵盖从语言补全任务、开放域问答任务到自然语言推理任务的 29 个公共 NLP 基准测试中，学习效率显著提高。得益于稀疏激活架构和模型并行算法的有效实现，训练期间的总能耗仅为 GPT-3 的三分之一。我们在表 1 和图 1 中突出了 GLaM 最大版本与 GPT-3 的比较。

表 1:

图 1:

GLaM: Efficient Scaling of Language Models with Mixture-of-Experts Figure 1. An overview of the percentage change in predictive performance (higher is better) of GLaM (64B/64E) versus GPT-3 (175B) in the (a) zero-shot, (b) one-shot, and (c) few-shot setting across 7 benchmark categories with 29 public tasks in total. Each bar in panel (a), (b) and (c) represents one benchmark category. Panel (d) compares the FLOPs needed per token prediction and training energy consumption.

GLaM：基于专家混合的高效语言模型扩展图 1. GLaM (64B/64E) 与 GPT-3 (175B) 在 (a) 零样本，(b) 单样本，和 (c) 少样本设置下，在 7 个基准类别中的预测性能百分比变化（越高越好），总共包含 29 个公共任务。图中每个柱状图分别代表一个基准类别。图 (d) 比较了每 Token 预测所需的 FLOPs 和训练能耗。

We use GLaM to study the importance of data. Our analysis shows that even for these large models, data quality should not be sacrificed for quantity if the goal is to produce a highquality auto-regressive language model. More importantly, on social dimensions, our results are also the first, to our knowledge, to close the performance gap between stereotypical and anti-stereotypical examples on the WinoGender benchmark, suggesting that large, sparsely activated models may rely less on superficial statistical correlations.

我们使用 GLaM 来研究数据的重要性。我们的分析表明，即使对于这些大语言模型，如果目标是生成高质量的自回归语言模型，则不应为了数量而牺牲数据质量。更重要的是，在社会维度上，我们的结果也是首次，据我们所知，在 WinoGender 基准测试中弥合了刻板印象和反刻板印象示例之间的性能差距，这表明大型、稀疏激活的大语言模型可能较少依赖于表面的统计相关性。

Finally, although MoE-based sparse models are not yet common in the NLP community, our work shows that sparse decoder-only language models can be more performant than the dense architectures of similar compute FLOPs for the first time within the few-shot in-context learning setting at scale, suggesting that sparsity is one of the most promising directions to achieve high-quality NLP models while saving energy costs (Patterson et al., 2021). MoE should therefore be considered as a strong candidate for future scaling.

最后，尽管基于 MoE 的稀疏模型在 NLP 社区中尚未普及，但我们的工作表明，在少样本情境学习设置下，稀疏的仅解码器语言模型可以首次在计算 FLOPs 相似的情况下比密集架构表现更好，这表明稀疏性是实现高质量 NLP 模型并节省能源成本 (Patterson et al., 2021) 最有前途的方向之一。因此，MoE 应被视为未来扩展的有力候选者。

2. 相关工作

Language models. Neural language models (Mikolov et al., 2010; Sutskever et al., 2011) have been shown to be useful for many natural language processing tasks. Word embedding models and extensions such as word2vec (Mikolov et al., 2013), GloVe (Pennington et al., 2014) and paragraph vectors (Le & Mikolov, 2014) have shown good generalization to many tasks simply by transferring the embeddings.

语言模型。神经语言模型 (Mikolov et al., 2010; Sutskever et al., 2011) 已被证明对许多自然语言处理任务非常有用。词嵌入模型及其扩展，如 word2vec (Mikolov et al., 2013)、GloVe (Pennington et al., 2014) 和段落向量 (Le & Mikolov, 2014)，通过转移这些嵌入，已经在许多任务上展示了良好的泛化能力。

Pre-training and Fine-tuning. The abundance of compute and data enables training increasingly large models via unsupervised pre-training. This is a natural fit for training neural networks as they exhibit remarkable s cal ability. Work on using recurrent models such as RNNs and LSTMs for language representation (Dai & Le, 2015; Kiros et al., 2015) showed that general language models could be fine-tuned to improve various language understanding tasks. More recently, models that used Transformers (Vaswani et al., 2017) showed that larger models with self-supervision on unlabeled data could yield significant improvements on NLP tasks (Devlin et al., 2019; Yang et al., 2019; Liu et al., 2019; Clark et al., 2020). Transfer learning based on pre-training and finetuning (Raffel et al., 2020; Houlsby et al., 2019) has been extensively studied and demonstrated good performance on downstream tasks. However, a major limitation to this method is that it requires a task-specific fine-tuning.

预训练和微调。大量的计算资源和数据使得可以通过无监督预训练训练越来越大的模型。这是训练神经网络的自然选择，因为它们表现出显著的可扩展性。使用递归模型（如 RNN 和 LSTM）进行语言表示的工作 (Dai & Le, 2015; Kiros et al., 2015) 表明，通用语言模型可以经过微调以改进各种语言理解任务。最近的研究表明，使用 Transformer (Vaswani et al., 2017) 的更大模型在未标注数据上进行自监督学习可以在 NLP 任务上取得显著改进 (Devlin et al., 2019; Yang et al., 2019; Liu et al., 2019; Clark et al., 2020)。基于预训练和微调的迁移学习 (Raffel et al., 2020; Houlsby et al., 2019) 已被广泛研究，并在下游任务中表现出良好的性能。然而，这种方法的一个主要限制是它需要特定任务的微调。

In-Context Few-shot Learning. GPT-3 (Brown et al., 2020) and related work (Shoeybi et al., 2019; Lieber et al., 2021; Wei et al., 2021) demonstrated that scaling up language models greatly improves task-agnostic, few-shot performance. These language models are applied without any gradient updates, and only few-shot demonstrations specified purely via text interactions with the model are needed.

上下文少样本学习。GPT-3 (Brown et al., 2020) 和相关工作 (Shoeybi et al., 2019; Lieber et al., 2021; Wei et al., 2021) 表明，扩大语言模型的规模可以显著提高任务无关的、少样本性能。这些大语言模型在应用时不需要任何梯度更新，仅需通过与模型的文本交互提供少量示例即可。

Sparsely Gated Networks.Mixture-of-Experts based models have also shown significant advantages. For language modeling and machine translation, Shazeer et al. (2017) showed that they could effectively use a very large number of weights while only needing to compute a small subset of the computation graph at inference time. There has also been work on scaling sparsely activated MoE archi tec ture s (Hestness et al., 2017; Shazeer et al., 2018; Lepikhin et al., 2021; Kudugunta et al., 2021). Recently, Fedus et al. (2021) showed results with even larger 1 trillion parameter sparsely activated models (Switch-C). Although both Switch-C and the largest GLaM model have one trillion number of trainable parameters, GLaM is a family of decoder-only language models, and Switch-C is an encoderdecoder based sequence to sequence model. Furthermore, Switch-C is mainly evaluated on fine-tuning benchmarks, e.g., SuperGlue, while GLaM performs well without any

稀疏门控网络。基于混合专家 (Mixture-of-Experts) 的模型也展示了显著的优势。对于语言建模和机器翻译，Shazeer 等 (2017) 表明他们可以在推理时仅计算计算图的一小部分，同时有效地使用非常大的权重数量。也有研究致力于扩展稀疏激活的 MoE 架构 (Hestness 等, 2017; Shazeer 等, 2018; Lepikhin 等, 2021; Kudugunta 等, 2021)。最近，Fedus 等 (2021) 展示了具有更大规模、1万亿参数的稀疏激活模型 (Switch-C) 的结果。尽管 Switch-C 和最大的 GLaM 模型都有一万亿个可训练参数，但 GLaM 是一系列仅解码器的语言模型，而 Switch-C 是基于编码器-解码器的序列到序列模型。此外，Switch-C 主要在微调基准测试上进行评估，例如 SuperGlue，而 GLaM 在没有任何

Table 2. A sample of related models (Devlin et al., 2019; Raffel et al., 2020; Brown et al., 2020; Lieber et al., 2021; Rae et al., 2021; Shoeybi et al., 2019; Lepikhin et al., 2021; Fedus et al., 2021) pre-trained on text corpora. $n_{\mathrm{params}}$ is the total number of trainable model parameters, Nact-params is the number of activated model parameters per input token.

表 2. 相关模型样本 (Devlin et al., 2019; Raffel et al., 2020; Brown et al., 2020; Lieber et al., 2021; Rae et al., 2021; Shoeybi et al., 2019; Lepikhin et al., 2021; Fedus et al., 2021) 在文本语料库上预训练。$n_{\mathrm{params}}$ 是模型的总可训练参数数量，Nact-params 是每个输入 Token 的激活模型参数数量。

模型名称	模型类型	参数量 (Mparams)	激活参数量 (Nact-params)
BERT	DenseEncoder-only	340M	340M
T5	DenseEncoder-decoder	13B	13B
GPT-3	DenseDecoder-only	175B	175B
Jurassic-1	DenseDecoder-only	178B	178B
Gopher	DenseDecoder-only	280B	280B
Megatron-530B	DenseDecoder-only	530B	530B
GShard-M4	MoE Encoder-decoder	600B	1.5B
Switch-C	MoE Encoder-decoder	1.5T	1.5B
GLaM (64B/64E)	MoE Decoder-only	1.2T	96.6B

need for fine-tuning in the few-shot setting shared by GPT-3 where SuperGlue is a subset. Table 2 summarizes the key differences between GLaM and related models pre-trained on text corpora.

在少样本设置中，GPT-3 共享的 SuperGlue 子集需要微调。表 2: 总结了 GLaM 与相关文本语料库预训练模型之间的关键差异。

3. Training Dataset

3. 训练数据集

To train our model, we build a high-quality dataset of 1.6 trillion tokens that are representative of a wide range of natural language use cases. Web pages constitute the vast quantity of data in our unlabeled dataset. However, their quality ranges from professional writing to low-quality comment and forum pages. Similarly to Brown et al. (2020), we develop our own text quality classifier to produce a highquality web corpus out of an original larger raw corpus. We use a feature hash based linear classifier for inference speed. This classifier is trained to classify between a collection of curated text (Wikipedia, books and a few selected websites) and other webpages. We use this classifier to estimate the content quality of a webpage. We then apply this classifier by using a Pareto distribution to sample webpages according to their score. This allows some lower-quality webpages to be included to prevent systematic biases in the classifier (Brown et al., 2020).

为了训练我们的模型，我们构建了一个包含 1.6 万亿个 Token 的高质量数据集，这些 Token 代表了广泛的自然语言应用场景。网页构成了我们未标注数据集中大部分的数据。然而，它们的质量从专业写作到低质量的评论和论坛页面不等。类似于 Brown 等人 (2020)，我们开发了自己的文本质量分类器，以从原始较大的语料库中生成高质量的网页语料库。我们使用基于特征哈希的线性分类器以提高推理速度。该分类器被训练用于区分一组精选文本（维基百科、书籍和一些选定网站）与其他网页。我们使用此分类器来估计网页的内容质量。然后，我们通过使用帕累托分布根据评分对网页进行采样。这允许一些较低质量的网页被包括在内，以防止分类器出现系统性偏差 (Brown et al., 2020)。

Table 3. Data and mixture weights in GLaM training set.

表 3. GLaM 训练集的数据和混合权重。

数据集	Token (B)	混合中的权重
过滤后的网页	143	0.42
Wikipedia	3	0.06
对话	174	0.28
论坛	247	0.02
书籍	390	0.20
新闻	650	0.02

We use this process to generate a high-quality filtered subset of webpages and combine this with books, Wikipedia pages, forums and news pages and other data sources to create the final GLaM dataset. We also incorporate the data from public domain social media conversations used by Adiwardana et al. (2020). We set the mixture weights based on the performance of each component in a smaller model and to prevent small sources such as Wikipedia from being over-sampled. Table 3 shows the details of our data component sizes and mixture weights. The mixture weights were chosen based on the performance of the component in a small model and to prevent small datasets such as Wikipedia from being oversampled. To check data contamination, in Section D we conduct an overlap analysis between our training set and the evaluation data and find that it roughly matches that of previous work (Brown et al., 2020).

我们使用此过程生成高质量的网页子集，并将其与书籍、Wikipedia页面、论坛和新闻页面以及其他数据源结合，创建最终的GLaM数据集。我们还纳入了Adiwardana等人 (2020) 使用的公共领域社交媒体对话数据。我们根据每个组件在较小模型中的表现设置混合权重，以防止像Wikipedia这样的小来源被过度采样。表 3 显示了我们数据组件大小和混合权重的详细信息。混合权重的选择基于组件在小型模型中的表现，并防止像Wikipedia这样的小数据集被过度采样。为了检查数据污染，在D节中，我们在训练集和评估数据之间进行了重叠分析，发现其大致与之前的工作 (Brown et al., 2020) 相符。

Figure 2. GLaM model architecture. Each MoE layer (the bottom block) is interleaved with a Transformer layer (the upper block). For each input token, e.g., ‘roses', the Gating module dynamically selects two most relevant experts out of 64, which is represented by the blue grid in the MoE layer. The weighted average of the outputs from these two experts will then be passed to the upper Transformer layer. For the next token in the input sequence, two different experts will be selected.

图 2. GLaM 模型架构。每个 MoE 层（下层模块）与一个 Transformer 层（上层模块）交替排列。对于每个输入 Token，例如 ‘roses'，Gating 模块会动态选择 64 个专家中最相关的两个，这在 MoE 层中用蓝色网格表示。这两个专家的输出加权平均值将传递给上层的 Transformer 层。对于输入序列中的下一个 Token，将选择不同的两个专家。

4. Model Architecture

4. 模型架构

We leverage sparsely activated Mixture-of-Experts (MoE) (Shazeer et al., 2017; Fedus et al., 2021) in GLaM models. Similar to the GShard MoE Transformer (Lepikhin et al., 2021), we replace the feed-forward component of every other Transformer layer with an MoE layer, as shown in Figure 2. Each MoE layer consists of a collection of independent feed-forward networks as the ‘experts'. A gating function then uses a softmax activation function to model a probability distribution over these experts. This distribution indicates how well each expert is able to process the incoming input.

我们在 GLaM 模型中使用稀疏激活的混合专家模型 (Mixture-of-Experts (MoE)) (Shazeer et al., 2017; Fedus et al., 2021)。类似于 GShard MoE Transformer (Lepikhin et al., 2021)，我们用 MoE 层替换每隔一个 Transformer 层的前馈组件，如图 2 所示。每个 MoE 层由一组独立的前馈网络组成，作为‘专家’。然后，门控函数使用 softmax 激活函数对这些专家建模概率分布。该分布表示每个专家处理传入输入的能力。

图 2:

Even though each MoE layer has many more parameters, the experts are sparsely activated. This means that for a given input token, only a limited subset of experts is used, giving the model more capacity while limiting computation. In our architecture, the subset size is two'. Each MoE layer's learnable gating network is trained to use its input to activate the best two experts for each token of an input sequence. During inference, the learned gating network dynamically picks the two best experts for each token. For an MoE layer with $E$ experts, this essentially provides a collection of $O(E^{2})$ different combinations of feed-forward networks instead of one in the classic Transformer architecture, leading to much more computational fexibility. The final learned representation of a token will be the weighted combination of the outputs from the selected experts.

尽管每个 MoE 层包含更多的参数，但专家是稀疏激活的。这意味着对于给定的输入 Token，只使用有限的专家子集，从而使模型具有更大的容量，同时限制计算量。在我们的架构中，子集大小为两个。每个 MoE 层的可学习门控网络被训练用于根据其输入激活每个输入序列 Token 的最佳两个专家。在推理过程中，学习到的门控网络动态选择每个 Token 的两个最佳专家。对于一个包含 $E$ 个专家的 MoE 层，这实际上提供了一个 $O(E^{2})$ 不同前馈网络组合的集合，而不是经典 Transformer 架构中的一个，从而提供了更大的计算灵活性。最终学习到的 Token 表示将是所选专家输出的加权组合。

We also make additional modifications to the original Transformer architecture. We replace the standard positional embedding with per-layer relative positional bias from Dai et al. (2019). In the non-MoE Transformer feed-forward sub-layers, we replace the first linear projection and the activation function with the Gated Linear Unit (Dauphin et al., 2017; Shazeer, 2020), which computes the component-wise product of two linear transformation of the input, followed by a Gaussian Error Linear Unit (Hendrycks & Gimpel, 2016) activation function. We partition the weights and computation of large GLaM models using the 2D sharding algorithm as described in $\mathrm{Xu}$ et al. (2021), which is described in more details in the Section C of the appendix.

我们还对原始的 Transformer 架构进行了额外的修改。我们将标准的位置嵌入替换为 Dai 等人 (2019) 提出的每层相对位置偏置。在非 MoE Transformer 的前馈子层中，我们将第一个线性投影和激活函数替换为门控线性单元 (Gated Linear Unit) (Dauphin 等, 2017; Shazeer, 2020)，它计算输入的两个线性变换的按元素乘积，然后是高斯误差线性单元 (Gaussian Error Linear Unit) (Hendrycks & Gimpel, 2016) 激活函数。我们使用 Xu 等人 (2021) 描述的 2D 分片算法对大型 GLaM 模型的权重和计算进行分区，更多细节请参见附录 C 部分。

5. Experiment Setup

5. 实验设置

GLaM is a family of dense and sparse decoder-only language models, so we first elaborate our training settings, hyper parameters, and evaluation protocol in this section.

GLaM 是一系列密集型和稀疏型的仅解码器语言模型，因此我们首先在本节中详细说明我们的训练设置、超参数和评估协议。

5.1. Training Setting

5.1. 训练设置

We train several variants of GLaM to study the behavior of MoE and dense models on the same training data. Table 4 shows the hyper parameter settings of different scale GLaM models ranging from 130 million parameters to 1.2 trillion parameters. Here, $E$ is the number of experts in the MoE layer, $B$ is the mini-batch size, $S$ is the input sequence length, $M$ is the model and embedding dimension, $H$ is the hidden dimension of the feed-forward network, $L$ is the number of layers and $N$ is the number of total devices. Additionally, $n_{\mathrm{params}}$ is the total number of trainable model parameters, $n_{\mathrm{act-params}}$ is the number of activated model parameters per input token, $n_{\mathrm{heads}}$ is the number of selfattention heads, and $d_{\mathrm{head}}$ is the hidden dimension of each attention head. We also include the respective dense models with comparable numbers of activated parameters per-token during inference (and thus similar numbers of per-token FLOPs) as references. We adopt the notation of

我们训练了多个 GLaM 变体，以研究 MoE 模型和稠密模型在同一训练数据上的行为。表 4 显示了不同规模的 GLaM 模型的超参数设置，这些模型的参数量从 1.3 亿到 1.2 万亿不等。这里，$E$ 是 MoE 层中的专家数量，$B$ 是小批量大小，$S$ 是输入序列长度，$M$ 是模型和嵌入维度，$H$ 是前馈网络的隐藏维度，$L$ 是层数，$N$ 是总设备数量。此外，$n_{\mathrm{params}}$ 是可训练模型参数的总数，$n_{\mathrm{act-params}}$ 是每个输入 Token 的激活模型参数数量，$n_{\mathrm{heads}}$ 是自注意力头的数量，$d_{\mathrm{head}}$ 是每个注意力头的隐藏维度。我们还包含了在推理时具有相似每 Token 激活参数数量（因此每 Token FLOPs 数量也相似）的相应稠密模型作为参考。我们采用了以下符号表示法：

表 4:

模型	$E$	$B$	$S$	$M$	$H$	$L$	$N$	$n_{\mathrm{params}}$	$n_{\mathrm{act-params}}$	$n_{\mathrm{heads}}$	$d_{\mathrm{head}}$
GLaM-130M	...	...	...	...	...	...	...	...	...	...	...
GLaM-1.2T	...	...	...	...	...	...	...	...	...	...	...

注：表格中的具体数值未给出。

to describe different variants in the GLaM models. For example, GLaM (8B/64E) represents the architecture of an approximate 8B parameter dense model with every other layer replaced by a 64 expert MoE layer. GLaM reduces to a dense Transformer-based language model architecture when each MoE layer only has one expert. We use the notation refers to a dense 137B parameter model trained with the same data set.

用于描述 GLaM 模型的不同变体。例如，GLaM (8B/64E) 表示一个约 8B 参数的密集模型架构，每隔一层被替换为一个包含 64 个专家的 MoE 层。当每个 MoE 层仅有一个专家时，GLaM 退化为基于 Transformer 的密集语言模型架构。我们使用这种表示法来指代使用相同数据集训练的具有 137B 参数的密集模型。

5.2. Hyper parameters and Training Procedure

5.2. 超参数和训练过程

We use the same learning hyper parameters for all GLaM models. More specifically, We use a maximum sequence length of 1024 tokens, and pack each input example to have up to 1 million tokens per batch. The dropout rate is set to 0 since the number of available tokens in the training corpus is much greater than the number of processed tokens during training. Our optimizer is Adafactor (Shazeer & Stern, 2018) with first-moment decay $\beta_{1},=,0$ , second-moment decay $\beta_{2}=0.99$ with a $1-t^{-0.8}$ decay schedule, update clipping threshold of 1.0, and factored second-moment estimation. We keep the initial learning rate of 0.01 for the first 10K training steps, and then decay it with inverse square root schedule $\begin{array}{r}{\mathrm{lr}\langle{\mathrm{t}}\rangle,\propto,\frac{1}{\sqrt{\mathrm{t}}}}\end{array}$ . On top of the standard crossentropy loss, we add the MoE auxiliary loss as described in GShard (Lepikhin et al., 2021) with a 0.01 coefficient to encourage expert load balancing so that the gating function will distribute tokens more evenly across all experts. We use the Sentence Piece (Kudo & Richardson, 2018) subword tokenizer with a vocabulary of size of 256K. During training, we use float32 for model weights and bfloatl6 for activations. The largest GLaM 64B/64E model was trained on 1,024 Cloud TPU-V4 chips.

我们为所有 GLaM 模型使用相同的训练超参数。具体来说，我们使用最大序列长度为 1024 个 Token，并将每个输入样本打包以每批次最多包含 1 百万个 Token。由于训练语料库中可用的 Token 数量远大于训练期间处理的 Token 数量，我们将 dropout 率设置为 0。我们的优化器是 Adafactor (Shazeer & Stern, 2018)，具有第一时刻衰减 $\beta_{1},=,0$ ，第二时刻衰减 $\beta_{2}=0.99$ ，采用 $1-t^{-0.8}$ 衰减计划，更新裁剪阈值为 1.0，并且使用因子化第二时刻估计。我们在前 10K 训练步骤中保持初始学习率为 0.01，然后按照逆平方根计划 $\begin{array}{r}{\mathrm{lr}\langle{\mathrm{t}}\rangle,\propto,\frac{1}{\sqrt{\mathrm{t}}}}\end{array}$ 衰减学习率。除了标准的交叉熵损失外，我们还添加了 MoE 辅助损失，如 GShard (Lepikhin et al., 2021) 中所述，系数为 0.01，以鼓励专家负载均衡，从而使门控函数在所有专家之间更均匀地分配 Token。我们使用 Sentence Piece (Kudo & Richardson, 2018) 子词分词器，词汇表大小为 256K。在训练过程中，我们使用 float32 表示模型权重和 bfloat16 表示激活。最大的 GLaM 64B/64E 模型是在 1,024 个 Cloud TPU-V4 芯片上训练的。

Training models at the trillion parameter scale is extremely expensive even for sparsely activated models. There is little room for hyper parameter tuning. Here we share our training recipes and some implementation tricks for the GLaM models.

训练万亿参数规模的模型即使对于稀疏激活模型来说也非常昂贵。几乎没有空间进行超参数调整。在这里，我们分享一下 GLaM 模型的训练方法和一些实现技巧。

Table 4. Sizes and architectures of both MoE and dense models that we have trained in our experiments. Models are grouped by the number of activated parameters per token. All trained models share the same learning hyper parameters described in Session 5.1.

表 4: 我们在实验中训练的 MoE 和稠密模型的大小和架构。模型按每个 Token 激活的参数数量分组。所有训练的模型共享相同的在第 5.1 节中描述的学习超参数。

GLaM Model	类型	Nparams	Nact-params	L	M	H	Nheads	dhead	E
0.1B 0.1B/64E	稠密	130M	130M	12	768	3,072	12	64	64
1.7B	MoE	1.9B	145M
1.7B/32E	稠密	1.7B	1.700B	24	2,048	8,192	16	128	32 64 128
	MoE	20B	1.878B
1.7B/64E	MoE	27B	1.879B
1.7B/128E	MoE	53B	1.881B
1.7B/256E	MoE	105B	1.886B
8B 8B/64E	稠密	8.7B	8.7B	32	4,096	16,384	32	128	256 64
	MoE	143B	9.8B
137B 64B/64E	稠密 MoE	137B 1.2T	137B 96.6B	64 64	8,192 8,192	65,536 32,768	128 128	128 128	64

· We train smaller-scale models to convergence first. This allows us to expose potential issues in the dataset and infrastructure as early as possible.

我们首先训练较小规模的模型直到收敛。这使我们能够尽早发现数据集和基础设施中可能存在的问题。

· We skip weight updates for a batch if there are any NaNs or Infs in the gradients (Shen et al., 2019). Note NaN/lnf could still occur during the applying gradient step, in which case we restart from an earlier checkpoint as described below. For example, even if there is no Inf in the existing variable or the gradient, the updated variable could still lead to Inf.

如果我们发现梯度中存在任何 NaN 或 Inf，则跳过该批次的权重更新 (Shen et al., 2019)。注意，在应用梯度步骤期间仍可能发生 NaN/Inf，这种情况下我们将从较早的检查点重新开始，如下所述。例如，即使现有的变量或梯度中没有 Inf，更新后的变量仍可能导致 Inf。

· We restart from an early healthy checkpoint when encountering rare large fluctuations or even NaN/Inf during training. Randomness of the sequentially loaded batches might help escape from previous failed states in the training after restart.

在训练过程中遇到罕见的大波动或甚至 NaN/Inf 时，我们从一个早期健康的检查点重新开始。顺序加载的批次的随机性可能有助于在重新启动后逃离之前的失败状态。

5.3. Evaluation Setting

5.3. 评估设置

Protocol. To clearly demonstrate the effectiveness of GLaM models, we mainly focus on evaluating the zero, one and few-shot learning protocols suggested by Radford et al. (2018); Brown et al. (2020). For the zero-shot learning setting, in most cases, we evaluate each example in the development set directly. For one/few-shot learning, we mainly draw random one/few examples from that task's training set as the only demonstration and context. Such a demonstration is concatenated with the evaluation example with two newlines in between, and then fed into the model.

协议。为了清晰地展示 GLaM 模型的有效性，我们主要关注评估 Radford 等人 (2018)；Brown 等人 (2020) 建议的零样本、单样本和少样本学习协议。对于零样本学习设置，大多数情况下，我们直接评估开发集中的每个示例。对于单样本/少样本学习，我们主要从该任务的训练集中随机抽取一个或几个示例作为唯一的演示和上下文。这样的演示与评估示例之间用两个换行符连接，然后输入到模型中。

Benchmarks. To allow for an apples-to-apples comparison between GPT-3 and GLaM, we choose the same suite of evaluation tasks as Brown et al.(2020). But for simplicity, we exclude 7 synthetic tasks (arithmetic and word unscramble) and 6 machine translation datasets. With this exclusion, we end up with 29 datasets, which includes 8 natural language generative (NLG) tasks and 21 natural language understanding (NLU) tasks. These datasets can be further grouped into 7 categories and are listed in section A.

基准测试。为了使 GPT-3 和 GLaM 之间的比较具有可比性，我们选择了与 Brown 等人 (2020) 相同的评估任务套件。但为简化起见，我们排除了 7 个合成任务（算术和单词重组）和 6 个机器翻译数据集。经过这些排除后，我们最终得到了 29 个数据集，其中包括 8 个自然语言生成 (NLG) 任务和 21 个自然语言理解 (NLU) 任务。这些数据集可以进一步分为 7 类，并在 A 节中列出。

Natural Language Generative tasks. We compare the language sequences decoded by the models to the ground truth in generative tasks. These tasks are TriviaQA, NQS, WebQS, SQuADv2, LAMBADA, DROP, QuAC and CoQA. The performance is measured by the accuracy of exact match (EM) and F1 score, following the standard for each task in Brown et al. (2020). We use beam search with a width of 4 to generate the sequences.

自然语言生成任务。我们将模型解码的语言序列与生成任务中的真实结果进行比较。这些任务包括 TriviaQA、NQS、WebQS、SQuADv2、LAMBADA、DROP、QuAC 和 CoQA。性能通过精确匹配 (EM) 和 F1 分数的准确性来衡量，遵循 Brown 等人 (2020) 中每个任务的标准。我们使用宽度为 4 的束搜索 (beam search) 来生成序列。

Natural Language Understanding tasks. Most language understanding tasks require the model to select one correct answer from multiple options. All binary classification tasks are formulated into the form of selecting among two options ('Yes’ or ^No'). The prediction is based on the maximum log-likelihood of each option given the context $\log{P}$ (option|context) normalized by the token length of each option. On a few tasks, such as ReCoRD (Zhang et al., 2018) and COPA (Gordon et al., 2012), the non-normalized loss can yield better results and thus is adopted. Except for MultiRC (Khashabi et al., 2018) where the F1 metric over the set of answer options (referred to as $\mathrm{F}1_{a}$ ) is reported, the prediction accuracy metric is used for all the other tasks. We use the average of the scores reported in all datasets to report the overall few-shot performance of models on both NLG and NLU tasks. Both Accuracy (EM) and F1 scores have been normalized to lie between O and 1oo. On TriviaQA, we also report the testing server score of our one-shot submission.

自然语言理解任务。大多数语言理解任务要求模型从多个选项中选择一个正确答案。所有二元分类任务都被表述为在两个选项（'是’或 ^否') 中进行选择。预测基于给定上下文的每个选项的最大对数似然 $\log{P}$ (option|context) 并根据每个选项的 token 长度进行归一化。在少数任务上，例如 ReCoRD (Zhang et al., 2018) 和 COPA (Gordon et al., 2012)，非归一化的损失可以产生更好的结果，因此被采用。除了 MultiRC (Khashabi et al., 2018)，其中报告了答案选项集上的 F1 度量（称为 $\mathrm{F}1_{a}$），其他所有任务都使用预测准确率度量。我们使用所有数据集中报告的分数的平均值来报告模型在 NLG 和 NLU 任务上的整体少样本性能。准确率 (EM) 和 F1 分数都已归一化到 0 到 100 之间。在 TriviaQA 上，我们也报告了我们单次提交的测试服务器得分。

6. Results

6. 结果

We conduct extensive evaluation on the whole family of GLaM models, to show the advantages of sparsely activated models in language modeling and their scaling trends. We

我们对整个 GLaM 模型家族进行了广泛的评估，以展示稀疏激活模型在语言建模方面的优势及其扩展趋势。我们

also quantitatively inspect the effectiveness of data quality for language model training.

还定量检验了数据质量对语言模型训练的有效性。

6.1. Comparison between MoE and Dense Models

6.1. MoE模型与密集模型的比较

As previously presented in Table 1, GLaM (64B/64E) has competitive performance compared to GPT-3 (175B) for zero, one and few-shot learning. Figure 1 compares the performance for each category of tasks. In total, GLaM (64B/64E) outperforms GPT-3 in 6 out of 7 categories on average, indicating the performance gain is consistent. For more details on each individual task, see Table 11. We include results on the much larger and computationally demanding Megatron-NLG and Gopher for reference. More importantly, as shown in Table 4, GLaM (64B/64E) activates roughly 96.6B parameters per token during inference, which requires only half of the compute FLOPs needed by GPT-3 given the same input.

如之前在表 1 中所示，GLaM (64B/64E) 在零样本、单样本和少样本学习方面与 GPT-3 (175B) 具有竞争力。图 1 比较了每个任务类别的性能。总体而言，GLaM (64B/64E) 在 7 个类别中的 6 个类别中平均优于 GPT-3，表明性能提升是一致的。有关每个单独任务的更多详细信息，请参见表 11。我们还包含了计算需求更大的 Megatron-NLG 和 Gopher 的结果以供参考。更重要的是，如表 4 所示，GLaM (64B/64E) 在推理过程中每个 Token 激活大约 96.6B 参数，这仅需要 GPT-3 处理相同输入所需计算 FLOPs 的一半。

We highlight one particular challenging open-domain question answer task: TriviaQA. In open-domain question answer tasks, the model is required to directly answer a given query without access to any additional context. Brown et al. (2020) show that the few-shot performance of TriviaQA is able to grow smoothly with model size, indicating a language model is able to absorb knowledge using its model capacity. As shown in Table 5, GLaM (64B/64E) is better than the dense model and outperforms the previous finetuned state-of-the-art (SOTA) on this dataset in the opendomain setting. Our one-shot result exceeds the previous finetuned SOTA (Yu et al., 2022) where additional knowledge graph information is infused by $8.6%$ , and outperforms the few-shot GPT-3 on the testing server by $5.3%$ . This suggests that the additional capacity of GLaM plays a crucial role in the performance gain even though the $n_{\mathrm{act}}$ params Of GLaM (64B/64E) is only half of that in GPT-3. Comparing to Switch-C, even though both models have similar total number of parameters, GLaM (64B/64E) uses much larger experts (beyond one TPU core) than Switch-C. Therefore, GLaM's one-shot performance on TriviaQA is also better than the fine-tuned results of Switch-C in the open-domain setting. Finally, we report zero, one and few-shot evaluation mainly on the development set for all tasks in Tables 11, 12, 13 and 14 of the appendix.

我们强调一个特别具有挑战性的开放域问答任务：TriviaQA。在开放域问答任务中，模型需要直接回答给定的查询，而无需访问任何额外的上下文。Brown 等 (2020) 显示 TriviaQA 的少样本性能能够随着模型规模平滑增长，这表明大语言模型能够利用其模型容量吸收知识。如表 5 所示，GLaM (64B/64E) 优于密集模型，并在此数据集的开放域设置中超越了之前的微调最先进水平 (SOTA)。我们的单样本结果超过了之前微调的 SOTA (Yu 等, 2022)，其中额外的知识图谱信息被注入了 8.6%，并且在测试服务器上超过少样本 GPT-3 的性能 5.3%。这表明即使 GLaM (64B/64E) 的 $n_{\mathrm{act}}$ 参数仅为 GPT-3 的一半，GLaM 的额外容量在性能提升中起着关键作用。与 Switch-C 相比，尽管两个模型的总参数数量相似，但 GLaM (64B/64E) 使用的专家规模远大于 Switch-C（超出一个 TPU 核心）。因此，GLaM 在 TriviaQA 上的单样本性能也优于 Switch-C 在开放域设置中的微调结果。最后，我们在附录的表 11、12、13 和 14 中报告了所有任务的主要开发集上的零样本、单样本和少样本评估。

6.2. Effect of Data Quality

6.2. 数据质量的影响

We study the impact of data quality on the few-shot performance of downstream tasks. We use a modest-size GLaM model (1.7B/64E) to show the effectiveness of filtering text on model quality. We train models with the same hyperparameters on two datasets. One is the original dataset described in Section 3 and the second consists of the dataset with the filtered webpages replaced with the unfiltered webpages. The mixing proportions are fixed as given in Table 3.

我们研究了数据质量对下游任务少样本性能的影响。我们使用中等规模的 GLaM 模型 (1.7B/64E) 来展示过滤文本对模型质量的有效性。我们在两个数据集上使用相同的超参数训练模型。一个是第 3 节描述的原始数据集，第二个是由未过滤网页替换过滤网页组成的数据集。混合比例固定为表 3 中给出的比例。

Table 5. GLaM (64B/64E) one-shot performance significantly outperforms prior SOTAs for open domain settings in the wiki split.

表 5. GLaM (64B/64E) 单样本性能在 wiki 分割的开放域设置中显著优于之前的最先进模型 (SOTA)。

模型	TriviaQA (开放域)
KG-FiD (large) (Yu et al.,2022)	69.8
(微调，测试) Switch-C (微调，开发)	47.5
GPT-3 单样本 (开发)	68.0
GPT-3 64 样本 (测试)	71.2
GLaM 单样本 (测试)	75.0
GLaM 单样本 (开发)	75.8

The filtered webpages consist of 143B tokens whereas the unfiltered webpages consist of around 7T tokens.

过滤后的网页包含 143B 个 Token，而未过滤的网页则包含大约 7T 个 Token。

Figure 3 (c) and (d) show that the model trained on fil- tered data performs consistently better on both NLG and NLU tasks. In particular, the effect of filtering is bigger on NLG than that on NLU. Perhaps this is because NLG often requires generating high-quality language and filtered pre training corpora is crucial to the generation capability of language models. Our study highlights the fact that the quality of the pretrained data also plays a critical role, specifically, in the performance of downstream tasks.

图 3 (c) 和 (d) 显示，在过滤后的数据上训练的模型在自然语言生成 (NLG) 和自然语言理解 (NLU) 任务上表现一致更好。特别是，过滤对 NLG 的影响比对 NLU 的影响更大。这可能是因为 NLG 通常需要生成高质量的语言，而过滤后的预训练语料库对语言模型的生成能力至关重要。我们的研究突显了预训练数据的质量也在下游任务的表现中起着关键作用。

6.3. Scaling Studies

6.3. 扩展研究

Scaling up dense language models generally involves making the models deeper by adding more layers, and wider by increasing the embedding dimension of token representations. This process increases the total number of parameters $n_{\mathrm{params}}$ of the model. For each prediction on a given input example, these models are ^dense’ in that all $n_{\mathrm{params}}$ parameters will be activated, i.e., $n_{\mathrm{params}}=n_{\mathrm{act\mathrm{params}}}$ in Table 4. Therefore, the effective FLOPs per prediction increases linearly with the model size $n_{\mathrm{params}}$ . While the increased FLOPs may lead to boosted predictive performance, it also raises the overall cost per prediction.

扩大密集型语言模型通常涉及通过增加更多层使模型更深，通过增加 Token 表示的嵌入维度使模型更宽。这个过程增加了模型的总参数数量 $n_{\mathrm{params}}$ 。对于每个给定输入示例的预测，这些模型是“密集”的，即所有 $n_{\mathrm{params}}$ 参数都会被激活，也就是说，在表 4 中 $n_{\mathrm{params}}=n_{\mathrm{act\mathrm{params}}}$ 。因此，每次预测的有效浮点运算次数 (FLOPs) 随模型大小 $n_{\mathrm{params}}$ 线性增加。虽然增加的 FLOPs 可能会提高预测性能，但也提高了每次预测的总体成本。

In contrast, GLaM MoE models are sparsely activated in that only a small fraction of the total $n_{\mathrm{params}}$ parameters will be activated for each prediction where $n_{\mathrm{params}}\gg n_{\mathrm{act-pa}}$ rams Therefore, GLaM MoE models can scale by also growing the size or number of experts in the MoE layer.

相比之下，GLaM MoE 模型是稀疏激活的，即每次预测中只有总 $n_{\mathrm{params}}$ 参数的一小部分会被激活，其中 $n_{\mathrm{params}} \gg n_{\mathrm{act-pa}}$。因此，GLaM MoE 模型可以通过增加 MoE 层中的专家数量或规模来进行扩展。

As shown in Figure 3(a), the average zero, one and few-shot performance across the generative tasks scales well with the effective FLOPs per prediction which is in turn determined by $n_{\mathrm{act.}}$ params. We also find that GLaM MoE models perform consistently better than GLaM dense models for similar effective FLOPs per token. For language understanding tasks shown in Figure 3(b), the performance gain of GLaM MoE models has a similar scaling trend to that of the generative tasks. We observe that both MoE and dense models perform similarly at smaller scales but MoE models outperform at larger scales. We also show experiments with scaling the number of experts in Section B where we observe that, for a fixed budget of computation per prediction, adding more experts generally leads to better predictive performance.

如图 3(a) 所示，生成式任务的平均零样本、一样本和少样本性能随着每次预测的有效 FLOPs 增加而提升，这又由 $n_{\mathrm{act.}}$ 参数决定。我们还发现，对于相似的每个 Token 的有效 FLOPs，GLaM MoE 模型的表现始终优于 GLaM 密集模型。对于图 3(b) 中所示的语言理解任务，GLaM MoE 模型的性能提升与生成式任务具有类似的扩展趋势。我们观察到，在较小规模上，MoE 模型和密集模型表现相似，但在较大规模上，MoE 模型表现更优。我们在附录 B 中展示了增加专家数量的实验，结果表明，在固定每次预测的计算预算下，增加更多的专家通常会带来更好的预测性能。

图 3(a):

图 3(b):

表 1:

附录 B:

Figure 3. Average zero, one and few-shot performance of GLaM MoE models versus GLaM dense models for similar effective FLOPs per token over the 8 NLG tasks (a) and 21 NLU tasks (b). Comparison of model performance with filtered and unfiltered training data using GLaM (1.7B/64E). Filtered data improves results significantly over unfiltered data for both (c) NLG and (d) NLU tasks across zero, one and few-shot settings.

图 3. GLaM MoE 模型与 GLaM dense 模型在相似的每 Token 有效 FLOPs 下，针对 8 个 NLG 任务 (a) 和 21 个 NLU 任务 (b) 的平均零样本、单样本和少样本性能。使用 GLaM (1.7B/64E) 对过滤和未过滤训练数据的模型性能进行比较。对于零样本、单样本和少样本设置，过滤后的数据在 (c) NLG 和 (d) NLU 任务上显著优于未过滤数据。

6.4. Efficiency of GLaM

6.4. GLaM 的效率

Existing large dense language models usually require tremendous amounts of computation resources for training and serving (Patterson et al., 2021). They also need to consume massive amounts of pre training data. We investigate the data and compute efficiency of the proposed GLaM models.

现有的大语言模型通常需要大量的计算资源来进行训练和推理 (Patterson et al., 2021)。它们还需要消耗大量的预训练数据。我们研究了所提出的 GLaM 模型的数据和计算效率。

Data Efficiency. Figure 4 (a-c) and Figure 4(e-g) show the learning curves of our models compared to the dense baselines of similar effective FLOPs in both NLG and NLU tasks.The $\mathbf{X}$ -axis is the number of tokens used in training where we explicitly include GPT-3's results when it is around 300B tokens. We first observe that GLaM MoE models require significantly less data than dense models of comparable FLOPs to achieve similar zero, one, and fewshot performance. In other words, when the same amount of data is used for training, MoE models perform much better, and the difference in performance becomes larger when training up to 630B. Moreover, GLaM (64B/64E) model trained with 280B tokens outperforms GPT-3 trained with 300B tokens by large margins on 4 out of the 6 learning settings (zero-shot/one-shot NLU and one-shot/few-shot NLG), and matches GPT-3 scores for the remaining setting, i.e., zero-shot NLG tasks.

数据效率。图 4 (a-c) 和图 4 (e-g) 显示了我们的模型与具有相似有效 FLOPs 的密集基线模型在 NLG 和 NLU 任务中的学习曲线。$\mathbf{X}$ 轴是训练中使用的 Token 数量，其中我们明确包含了 GPT-3 在大约 300B Token 时的结果。我们首先观察到，GLaM MoE 模型相比具有相似 FLOPs 的密集模型，需要显著更少的数据来实现类似的零样本、单样本和少样本性能。换句话说，当使用相同数量的数据进行训练时，MoE 模型表现得更好，并且在训练达到 630B 时，性能差异变得更大。此外，使用 280B Token 训练的 GLaM (64B/64E) 模型在 6 种学习设置中的 4 种（零样本/单样本 NLU 和单样本/少样本 NLG）上大幅超过了使用 300B Token 训练的 GPT-3，在其余设置即零样本 NLG 任务中，GLaM 的得分与 GPT-3 相当。

Computation Efficiency & Energy Consumption. Figure 4 (d) and Figure 4 (h) show how the average zero, one and few-shot performance scales with the number of TPU years spent training MoE and dense models. We find that to achieve similar performance on downstream tasks, training sparsely activated models takes much less computational resources than training dense models.

计算效率与能耗。图 4 (d) 和图 4 (h) 显示了 MoE 模型和稠密模型的平均零样本、一样本和少样本性能如何随着 TPU 年数的变化而变化。我们发现，要实现类似的下游任务性能，训练稀疏激活模型所需的计算资源比训练稠密模型少得多。

As previously presented in Table 1, the GLaM (64B/64E) training after 600B tokens consumes 456 MWh, about 1/3 of the energy cost of 1287 MWh used by GPT-3. Moreover, to reach similar (and slightly exceeded) scores as GPT-3, we train using 1,024 TPU-v4 chips for 574 hours (with 280B tokens).This consumes 213 MWh or 1/6 of the GPT-3 energy cost. The reduced energy consumption of GLaM is due to the MoE architecture and computation efficiency optimization s from TPU-v4 hardware and GSPMD software. Energy calculations can be found in Section F.

如之前表 1 所示，GLaM (64B/64E) 在 600B tokens 训练后消耗 456 MWh，约为 GPT-3 所用 1287 MWh 能源成本的 1/3。此外，为了达到与 GPT-3 相似（并略微超过）的分数，我们使用 1,024 个 TPU-v4 芯片训练了 574 小时（包含 280B tokens）。这消耗了 213 MWh 或 GPT-3 能源成本的 1/6。GLaM 的能源消耗减少是由于 MoE 架构和来自 TPU-v4 硬件及 GSPMD 软件的计算效率优化。能源计算详见附录 F。

7. Ethics and Unintended Biases

7. 伦理和非预期偏差

Large language models’ zero-and few-shot inference is an exciting capability: being able to control model behaviour intuitively with natural language and small datasets significantly lowers the barrier to prototyping and the development of new applications; it has the potential to help democrat is e using AI by dramatically decreasing the need for specialist knowledge. However, such opportunities also serve to highlight the importance of the many ethical challenges (Leidner & Plachouras, 2017; Bender et al.,2021; Bommasani et al., 2021) including representation bias (Blodgett et al., 2020), proper selection and handling of training data (Rogers, 2021) and its documentation (Bender & Friedman, 2018), privacy (Abadi et al., 2016b; Carlini et al., 2020), and environmental concerns (Strubell et al., 2019; Patterson et al., 2021). An important strand of this research focuses on unintended biases learnt by language models, including correlations between gender and profession (Bolukbasi et al., 2016; Rudinger et al., 2018; Zhao et al., 2018), neg- ative sentiment about racial and religious groups (Li et al., 2020; Nadeem et al., 2021), and about people with disabilities (Hutchinson et al., 2020), as well as other social biases (Caliskan et al., 2017; Rudinger et al., 2017; Sap et al., 2020; Sotnikova et al., 2021). While measuring and mitigating the potential harm of language models is a very active area of research, as recognized by Blodgett et al. (2021); Jacobs & Wallach (2021) there is still a significant need for more rigorous evaluation methods to assess the degree to which language models encode harmful stereotypes (May et al., 2019; Webster et al., 2021).

大语言模型的零样本和少样本推理能力令人兴奋：能够通过自然语言和小数据集直观地控制模型行为，大大降低了原型设计和开发新应用的门槛；它有可能通过大幅减少对专业知识的需求来帮助普及使用 AI。然而，这些机会也凸显了许多伦理挑战的重要性 (Leidner & Plachouras, 2017; Bender et al., 2021; Bommasani et al., 2021)，包括代表性偏差 (Blodgett et al., 2020)、训练数据的适当选择和处理 (Rogers, 2021) 及其文档 (Bender & Friedman, 2018)、隐私 (Abadi et al., 2016b; Carlini et al., 2020) 和环境问题 (Strubell et al., 2019; Patterson et al., 2021)。这一研究的重要方向之一是关注语言模型无意中学习到的偏见，包括性别与职业之间的相关性 (Bolukbasi et al., 2016; Rudinger et al., 2018; Zhao et al., 2018)，对种族和宗教群体的负面情绪 (Li et al., 2020; Nadeem et al., 2021)，以及对残疾人的态度 (Hutchinson et al., 2020)，以及其他社会偏见 (Caliskan et al., 2017; Rudinger et al., 2017; Sap et al., 2020; Sotnikova et al., 2021)。尽管测量和减轻语言模型潜在危害是一个非常活跃的研究领域，正如 Blodgett et al. (2021); Jacobs & Wallach (2021) 所认识到的那样，仍然迫切需要更严格的评估方法来评估语言模型编码有害刻板印象的程度 (May et al., 2019; Webster et al., 2021)。

Figure 4. Learning efficiency comparison. Average zero-shot , one-shot and few-shot performance of GLaM MoE models versus GLaM dense models as more tokens are processed during training for 9 NLG tasks (a-c) and 21 NLU tasks (e-g). Panel (d) and (h) also display the learning curves against the number of TPU years, respectively.

图 4: 学习效率对比。GLaM MoE 模型与 GLaM 密集模型在训练过程中处理更多 Token 时，针对 9 个 NLG 任务 (a-c) 和 21 个 NLU 任务 (e-g) 的平均零样本、单样本和少样本性能。面板 (d) 和 (h) 分别显示了学习曲线与 TPU 年数的关系。

While there is not yet consensus on measurement methods or criteria for such general purpose large language models, the versatility and power of these models make it important to assess them on a range of metrics. We take inspiration from GPT-3 (Brown et al., 2020) and examine the co-occurrence in generated text referencing identity terms as well as report on the WinoGender benchmark (Rudinger et al., 2018). We also analyse toxicity degeneration similarly to Gopher (Rae et al., 2021), and extend the analysis to consider the humanbehavioral baseline.

虽然目前还没有就这些通用大语言模型的测量方法或标准达成共识，但这些模型的多功能性和强大性能使得对其在多个指标上进行评估变得非常重要。我们从 GPT-3 (Brown 等, 2020) 中获得灵感，考察生成文本中身份术语的共现情况，并报告 WinoGender 基准测试 (Rudinger 等, 2018) 的结果。我们还类似地分析了毒性退化问题，如同 Gopher (Rae 等, 2021)，并将分析扩展到考虑人类行为基线。

7.1. Co-occurrence prompts

7.1. 共现提示 (Co-occurrence prompts)

Following the procedure described in Brown et al. (2020), we analyze commonly co-occurring words in the continuations when given prompts like $\mathbf{\cdots}{\mathbf{erm}}$ was very.."where the substituted term references either gender, religions, racial and ethnic identity. For each prompt (Table 7 of the appendix), 800 outputs are generated using top $k$ sampling $k=40)$ ) with a temperature of 1. An off-the-shelf POS tagger (Bird & Loper, 2004) is used to remove stop words and select only descriptive words (i.e., adjectives and adverbs). Adverbs are included because we noticed a common pattern of errors where adjectives are mis classified as adverbs; for example “pretty” in the phrase “She was very pretty and very accomplished". Like Brown et al. (2020), to make the analysis transparent and easily reproducible, we omit any manual human labeling.

按照 Brown 等 (2020) 描述的程序，我们分析了在给定提示如 $\mathbf{\cdots}{\mathbf{erm}}$ was very.."（其中替换的术语涉及性别、宗教、种族和民族身份）时，续写中常见的共现词。对于每个提示（附录中的表 7），使用 top $k$ 抽样 ($k=40$) 生成 800 个输出，温度为 1。使用现成的 POS 标记器 (Bird & Loper, 2004) 删除停用词并选择仅描述性词汇（即形容词和副词）。包括副词是因为我们注意到一个常见的错误模式，即形容词被误分类为副词；例如短语“She was very pretty and very accomplished”中的“pretty”。与 Brown 等 (2020) 类似，为了使分析透明且易于重现，我们省略了任何人工标注。

Like the analysis of other large language models that we build on, we note associative biases for all dimensions are obvious, for example “"pretty” is the most associated description for the term “She'", while it is not in the top-10 for the term “He". Table 8 shows the most frequently occurring descriptive words in response to prompt-templates for gendered pronouns, and Tables 9 and 10 of the appendix show the same for race and religion prompts.

与其他我们所构建的大语言模型的分析类似，我们注意到所有维度的关联偏见都很明显，例如“pretty”是与术语“She”最相关的描述，而它不在术语“He”的前十位相关描述中。表 8 显示了对性别代词提示模板做出响应时最常出现的描述性词汇，附录中的表 9 和表 10 分别显示了种族和宗教提示的相同内容。

7.2. WinoGender

WinoGender 是一项用于评估模型在性别偏差方面的表现的任务。它通过构造特定的句子来测试模型是否会在职业或其他社会角色上产生性别偏见。这项任务强调了在自然语言处理中考虑性别平等的重要性。

Co reference resolution is a capability that many applications require to perform well, including machine translation (Stanovsky et al., 2019; Webster & Pitler, 2020) and question answering (Lamm et al., 2020). To assess whether gendered correlations in GLaM cause it to make coreference errors in the one-shot setting, we measure WinoGender (Rudinger et al., 2018). GLaM (64B/64E) achieves a new state-of-the-art of $71.7%$ on the full dataset (compared to $64.2%$ for GPT-3 (Brown et al., 2020). Promisingly, accuracy is remarkably close between “he’ examples $(70.8%)$ and ‘she’ examples $(72.5%)$ , as well as between stereotypical examples (where the intended distribution is assumed to be close to the US occupation statistics, (Rudinger et al., 2018)) and anti-stereotypical (or ^gotcha') examples (both $71.7%)$

共指消解是许多应用程序要表现良好所需的一项功能，包括机器翻译 (Stanovsky et al., 2019; Webster & Pitler, 2020) 和问答 (Lamm et al., 2020)。为了评估 GLaM 中的性别相关性是否导致其在单样本设置中出现共指错误，我们测量了 WinoGender (Rudinger et al., 2018)。GLaM (64B/64E) 在整个数据集上达到了新的最先进水平 71.7%（相比之下，GPT-3 (Brown et al., 2020) 的成绩为 64.2%）。令人鼓舞的是，“he” 示例 (70.8%) 和 “she” 示例 (72.5%) 的准确率非常接近，刻板印象示例（假设预期分布接近美国职业统计数据，(Rudinger et al., 2018)）和反刻板印象（或“gotcha”）示例（均为 71.7%）之间的准确率也非常接近。

Figure 5. The relationship between the Toxicity Probability of the Prompt (TPP), and the Toxicity Probability of the Continuation (TPC). Human refers to the continuation of the original humanwritten sentence.

图 5. 提示的毒性概率 (TPP) 与续写内容的毒性概率 (TPC) 之间的关系。Human 指的是原始人类编写句子的续写。

7.3. Toxicity Degeneration

7.3. 毒性退化 (Toxicity Degeneration)

Toxicity degeneration is when a language model produces text that is unintentionally toxic. To evaluate toxicity degeneration, we adapt the methodology used in (Welbl et al., 2021; Rae et al., 2021). We use the Real Toxicity Prompts dataset (Gehman et al., 2020) which consists of sentences that have been split into two parts: a prompt prefix, and a continuation postfix. Like the previous studies, we also use the Perspective API which assigns a probability that the text would be considered to be rude, disrespectful or otherwise likely to make people want to leave a conversation. We then asses how likely a continuation is to be toxic given various likelihoods that the prompt was toxic.

毒性退化是指大语言模型生成的文本无意中具有攻击性。为了评估毒性退化，我们采用了 (Welbl et al., 2021; Rae et al., 2021) 中使用的方法。我们使用 Real Toxicity Prompts 数据集 (Gehman et al., 2020)，该数据集由句子拆分为两部分组成：提示前缀和延续后缀。与之前的研究一样，我们也使用 Perspective API，该 API 会赋予文本一个被认为是粗鲁、不尊重或可能使人们想要离开对话的概率。然后我们评估在给定提示具有不同毒性概率的情况下，延续文本具有毒性的可能性。

For each of 10K randomly sampled prompts, we generate 25 continuations, with up to 100 tokens per continuations usingtop $k$ sampling $\lvert k=40\rvert$ )with a temperature of 1. The Perspective API requires an non-empty string therefore we assign a score of toxicity O.o when the continuation is the empty string; this could represent, for example, a chat bot simply refusing to respond.

对于每个 10K 随机采样的提示，我们生成 25 个续写，每个续写最多 100 个 Token，使用 top $k$ 抽样 $\lvert k=40\rvert$ ，温度为 1。Perspective API 需要非空字符串，因此当续写为空字符串时，我们分配毒性得分为 0.0；这可以表示例如，聊天机器人简单地拒绝响应。

Figure 5 shows the relationship between the Toxicity Probability of the Prompt (TPP), and the Toxicity Probability of the Continuation (TPC). Note that, for low TPP, the relatively high human TPC is due to the sampling strategy used to create the underlying dataset: sentences were selected across the toxicity spectrum. Moreover, toxicity can often be identified locally within a sentence, and toxicity in this dataset tends to occur later the sentences. This causes the human-TPC to slightly drop as the TPP increases. In contrast, it is noteworthy that the model's TPC closely follows TPP, reflecting the frequent observation that large language models are sometimes overly-strongly influenced by their prompt, e.g. repeating phrases from the prompt.

图 5: 显示了提示的毒性概率 (TPP) 与续写部分的毒性概率 (TPC) 之间的关系。注意，对于较低的 TPP，相对较高的人类 TPC 是由于用于创建基础数据集的采样策略：句子是从毒性的不同范围内选择的。此外，毒性通常可以在句子内部局部识别，并且在这个数据集中，毒性往往出现在句子的后半部分。这导致随着 TPP 的增加，人类 TPC 稍微下降。相比之下，值得注意的是，模型的 TPC 密切跟随 TPP，反映了大语言模型有时过度受提示影响的常见观察，例如重复提示中的短语。

We also analysed the distribution of toxicity probabilities from the API for batches of 25 continuations. This highlighted that, even for low toxicity prompts, it is very likely that some generated continuation will be judged as toxic by most people reviewing it, according to the Perspective API's predicted probability; further details can be found in Figure 8. We also note that this dataset's sampling strategy, and the source it is taken from (Reddit) are likely not reflective of other domains. Moreover, even for very low TPP, applications are likely to want a much lower TPC: even generating 1 in 100 toxic suggestions is likely to be very problematic for applications.

我们还分析了从 API 获取的 25 个续写文本批次的毒性概率分布。这突显了即使对于低毒性提示，某些生成的续写也很可能被大多数审阅者认为是有毒的，根据 Perspective API 的预测概率；更多细节请参见图 8。我们还注意到，此数据集的采样策略及其来源（Reddit）可能无法反映其他领域的情况。此外，即使对于非常低的毒性提示概率 (TPP)，应用程序也希望毒性内容比例 (TPC) 更低：即使每 100 条生成的内容中有 1 条是有毒的，也可能对应用程序造成很大问题。

8. Discussion

8. 讨论

As observed in previous work on sparsely-activated models (Fedus et al., 2021), MoE models are more performant in knowledge-oriented tasks. Open-domain tasks are one way of measuring the amount of knowledge stored in a model. The performance of the MoE model in open-domain QA benchmarks such as TriviaQA demonstrate the significantly increased information capacity of these models compared to dense models of similar effective FLOPs. Despite the in-context learning and training efficiency advantages, the sparsely activated models consist of a higher number of parameters and thus require a larger number of devices. This limits the resource accessibility and increases the serving cost especially when the serving traffic is low.

如之前关于稀疏激活模型 (Fedus et al., 2021) 的研究中所观察到的，MoE 模型在知识导向任务中表现更优。开放域任务是衡量模型中存储知识量的一种方式。MoE 模型在开放域 QA 基准测试（如 TriviaQA）中的表现展示了这些模型相比有效 FLOPs 相似的密集模型具有显著增加的信息容量。尽管在上下文学习和训练效率方面有优势，但稀疏激活模型包含更多的参数，因此需要更多的设备。这限制了资源的可访问性，并且当服务流量较低时，会增加服务成本。

9. Conclusions

9. 结论

We propose and develop a family of generalist language models called GLaM, which use a sparsely activated mixture-of-experts architecture to achieve better average scores than not only their dense counterparts of similar effective FLOPs, but also the GPT-3 models on 29 representative NLP tasks in zero, one and few-shot learning. In particular, GLaM (64B/64E), our largest 1.2 trillion parameterMoE language model, achieves better average performance with only one third of energy consumption compared to training GPT-3. We hope that our work will encourage more research into methods for obtaining high-quality data, and using MoE for more efficient scaling of giant language models.

我们提出并开发了一类称为 GLaM 的通用语言模型，这些模型采用稀疏激活的专家混合架构，在零样本、单样本和少样本学习的 29 个代表性 NLP 任务中，不仅比有效 FLOPs 相似的密集模型，还比 GPT-3 模型取得了更好的平均分数。特别是，GLaM (64B/64E)，我们最大的 1.2 万亿参数 MoE 语言模型，仅以三分之一的能耗就实现了比训练 GPT-3 更好的平均性能。我们希望我们的工作能够鼓励更多关于获取高质量数据的方法的研究，并利用 MoE 实现巨型语言模型更高效的扩展。

References

参考文献

Abadi, M., Barham, P., Chen, J., Chen, Z., Davis, A., Dean, J., Devin, M., Ghemawat, S., Irving, G., Isard, M., Kudlur, M., Levenberg, J., Monga, R., Moore, S., Murray, D. G., Steiner, B., Tucker, P., Vasudevan, V., Warden, P., Wicke, M., Yu, Y., and Zheng, X. TensorFlow: A system for Large-Scale machine learning. In 12th USE NIX Symposium on Operating Systems Design and Implementation (OSD1 16), pp. 265-283, Savannah, GA, November 2016a. USENIX Association. ISBN 978-1- 931971-33-1. URL https://www.usenix.org/ conference/osdil6/technical-sessions/ presentation/abadi.

Abadi, M., Barham, P., Chen, J., Chen, Z., Davis, A., Dean, J., Devin, M., Ghemawat, S., Irving, G., Isard, M., Kudlur, M., Levenberg, J., Monga, R., Moore, S., Murray, D. G., Steiner, B., Tucker, P., Vasudevan, V., Warden, P., Wicke, M., Yu, Y., 和 Zheng, X. TensorFlow: 一个用于大规模机器学习的系统。在第 12 届 USENIX 操作系统设计与实现研讨会 (OSD1 16)，页码 265-283，Savannah, GA，2016 年 11 月。USENIX 协会。ISBN 978-1-931971-33-1。URL https://www.usenix.org/conference/osdil6/technical-sessions/presentation/abadi。

Abadi, M., Chu, A., Goodfellow, I., McMahan, H. B., Mironov, I., Talwar, K., and Zhang, L. Deep learning with differential privacy. Proceedings of the 2016 ACM SIGSAC Conference on Computer and Communications Security, Oct 2016b. doi: 10.1145/2976749.2978318. URL http://dx.doi.0rg/10.1145/2976749. 2978318.

Abadi, M., Chu, A., Goodfellow, I., McMahan, H. B., Mironov, I., Talwar, K., 和 Zhang, L. 带差分隐私的深度学习 (Deep learning with differential privacy). Proceedings of the 2016 ACM SIGSAC Conference on Computer and Communications Security, 2016年10月. doi: 10.1145/2976749.2978318. URL http://dx.doi.0rg/10.1145/2976749. 2978318.

Adiwardana, D., Luong, M., So, D. R., Hall, J., Fiedel, N., Thoppilan, R., Yang, Z., Kul sh res h th a, A., Nemade, G., Lu, Y., and Le, Q. V. Towards a human-like opendomain chatbot. CoRR, abs/2001.09977, 2020. URL https://arxiv.0rg/abs/2001.09977.

Adiwardana, D., Luong, M., So, D. R., Hall, J., Fiedel, N., Thoppilan, R., Yang, Z., Kul sh res h th a, A., Nemade, G., Lu, Y., 和 Le, Q. V. 朝着类人开放域聊天机器人迈进。CoRR, abs/2001.09977, 2020。URL https://arxiv.0rg/abs/2001.09977。

Bender, E. M. and Friedman, B. Data statements for natural language processing: Toward mitigating system bias and enabling better science. Transactions of the Association for Computational Linguistics, 6:587-604, 2018. doi: 10. 1162/tacl-a-00041. URL https: / /a cl anthology . org/Q18-1041.

Bender, E. M. 和 Friedman, B. 自然语言处理的数据声明：迈向缓解系统偏差和促进更好的科学. 计算语言学协会会刊，6:587-604，2018. doi: 10.1162/tacl-a-00041. URL https://aclanthology.org/Q18-1041.

Bender, E. M., Gebru, T., McMillan-Major, A., and Shmitchell, S. On the dangers of stochastic parrots: Can language models be too big? In Proceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency, FAccT '21, pp. 610-623, New York, NY, USA, 2021. Association for Comput- ing Machinery. ISBN 9781450383097. doi: 10.1145/ 3442188.3445922. URL https://doi.org/10. 1145/3442188.3445922.

本德, E. M., 格布鲁, T., 麦克米伦-梅杰, A., 和斯米切尔, S. 论随机鹦鹉的危险：语言模型能否过大？见 2021 年 ACM 公平性、问责制和透明度会议论文集，FAccT '21，第 610–623 页，纽约，美国，2021。计算机械协会。ISBN 9781450383097。doi: 10.1145/ 3442188.3445922。URL https://doi.org/10. 1145/3442188.3445922。

Berant, J., Chou, A., Frostig, R., and Liang, P. Semantic parsing on Freebase from question-answer pairs. In Proceedingsof the 2013 Conference on Empirical Methods in Natural Language Processing, pp. 1533-1544, Seattle, Washington, USA, October 2013. Association for Computational Linguistics. URL https : / /a cl anthology. org/D13-1160.

Berant, J., Chou, A., Frostig, R., 和 Liang, P. 从问题-答案对进行 Freebase 的语义解析。在 Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing，第 1533-1544 页，Seattle, Washington, USA，2013 年 10 月。Association for Computational Linguistics。URL https://aclanthology.org/D13-1160。

Bird, S. and Loper, E. NLTK: The natural language toolkit. In Proceedings of theACL Interactive Poster and Demonstration Sessions, pp. 214-217, Barcelona, Spain, July

伯德，S. 和洛珀，E. NLTK: 自然语言工具包 (NLTK)。在 ACL 交互式海报和演示会议论文集，第 214-217 页，西班牙巴塞罗那，7 月

Association for Computational Linguistics. URL https://a cl anthology.0rg/P04-3031.
计算语言学协会 (Association for Computational Linguistics). URL https://a cl anthology.0rg/P04-3031.

Bisk, Y., Zellers, R., Bras, R. L., Gao, J., and Choi, Y. Piqa: Reasoning about physical commonsense in natural language. In Thirty-Fourth AAAI Conference on Artificial Intelligence,2020.

Bisk, Y., Zellers, R., Bras, R. L., Gao, J., 和 Choi, Y. Piqa: 自然语言中的物理常识推理 (Reasoning about physical commonsense in natural language). 在第三十四届 AAAI 人工智能会议 (Thirty-Fourth AAAI Conference on Artificial Intelligence)，2020。

Blodgett, S. L., Barocas, S., Daumé III, H., and Wallach, H. Language (technology) is power: A critical survey of “bias" in NLP. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pp. 5454-5476, Online, July 2020. As s 0 cia tion for Computational Linguistics. doi: 10.18653/v1/2020.acl-main. 485. URL https://a cl anthology.org/2020. acl-main.485.

布洛杰特，S. L.，巴罗卡斯，S.，多梅 III，H.，和华莱士，H. 语言（技术）即权力：对 NLP 中“偏见”的批判性综述。在第 58 届计算语言学协会年会论文集，第 5454-5476 页，在线，2020 年 7 月。计算语言学协会。doi: 10.18653/v1/2020.acl-main.485。URL https://aclanthology.org/2020.acl-main.485。

Blodgett, S. L., Lopez, G., Olteanu, A., Sim, R., and Wallach, H. Stereotyping Norwegian salmon: An inventory of pitfalls in fairness benchmark datasets.In Proceedings of the59th Annual Meeting of the Association for Computational Linguistics and the 1 l th Intern a- tional Joint Conference on Natural Language Process- ing (Volume 1: Long Papers), pp. 1004-1015, Online, August 2021. Association for Computational Linguistics. doi: 10.18653/v1/2021.acl-long.81. URL https : //a cl anthology.org/2021.acl-long.81.

布洛杰特，S. L.，洛佩兹，G.，奥尔泰阿努，A.，西姆，R.，和华莱士，H. 刻板印象挪威三文鱼：公平性基准数据集中的陷阱清单。在第 59 届计算语言学协会年会和第 11 届国际自然语言处理联合会议（第 1 卷：长论文）的会议录中，第 1004-1015 页，在线，2021 年 8 月。计算语言学协会。doi: 10.18653/v1/2021.acl-long.81。URL https://aclanthology.org/2021.acl-long.81.

Bolukbasi, T., Chang, K.-W., Zou, J. Y., Saligrama, V., and Kalai, A. T. Man is to computer programmer as woman is to homemaker? debiasing word embeddings. In Lee, D., Sugiyama, M., Luxburg, U., Guyon, I., and Garnett, R. (eds.), Advances in Neural Information Processing Systems, volume 29. Curran Associates, Inc., 2016. URL https : / /proceedings . neurips.cc/paper/2016/file/ a486cd07e4ac3d270571622f4f316ec5-Paper pdf.

博卢克巴西，T.，张，K.-W.，邹，J. Y.，萨利格拉玛，V.，和卡拉伊，A. T. 男人之于计算机程序员如女人之于家庭主妇？消除词嵌入的偏差。在李，D.，杉山，M.，卢克斯堡，U.，盖永，I.，和加内特，R. (编)，《神经信息处理系统进展》，第 29 卷。Curran Associates, Inc., 2016。URL https://proceedings.neurips.cc/paper/2016/file/a486cd07e4ac3d270571622f4f316ec5-Paper.pdf。

Bommasani, R., Hudson, D. A., Adeli, E., Altman, R., Arora, S., von Arx, S., Bernstein, M. S., Bohg, J., Bosse- lut, A., Brunskill, E., Bryn jol fs son, E., Buch, S., Card, D., Castellon, R., Chatterji, N. S., Chen, A. S., Creel, K., Davis, J. Q., Demszky, D., Donahue, C., Doum- bouya, M., Durmus, E., Ermon, S., Etchemendy, J., Ethayarajh, K., Fei-Fei, L., Finn, C., Gale, T., Gillespie, L., Goel, K., Goodman, N. D., Grossman, S., Guha, N., Hashimoto, T., Henderson, P., Hewitt, J., Ho, D. E., Hong, J., Hsu, K., Huang, J., Icard, T., Jain, S., Jurafsky, D., Kalluri, P., Karamcheti, S., Keeling, G., Khani, F., Khattab, O., Koh, P. W., Krass, M. S., Krishna, R., Kudi- tipudi, R., and et al. On the opportunities and risks of foundation models. CoRR, abs/2108.07258, 2021. URL https://arxiv.0rg/abs/2108.07258.

Bommasani, R., Hudson, D. A., Adeli, E., Altman, R., Arora, S., von Arx, S., Bernstein, M. S., Bohg, J., Bosse- lut, A., Brunskill, E., Bryn jol fs son, E., Buch, S., Card, D., Castellon, R., Chatterji, N. S., Chen, A. S., Creel, K., Davis, J. Q., Demszky, D., Donahue, C., Doum- bouya, M., Durmus, E., Ermon, S., Etchemendy, J., Ethayarajh, K., Fei-Fei, L., Finn, C., Gale, T., Gillespie, L., Goel, K., Goodman, N. D., Grossman, S., Guha, N., Hashimoto, T., Henderson, P., Hewitt, J., Ho, D. E., Hong, J., Hsu, K., Huang, J., Icard, T., Jain, S., Jurafsky, D., Kalluri, P., Karamcheti, S., Keeling, G., Khani, F., Khattab, O., Koh, P. W., Krass, M. S., Krishna, R., Kudi- tipudi, R., 和其他作者. 关于基础模型的机会和风险. CoRR, abs/2108.07258, 2021. URL https://arxiv.0rg/abs/2108.07258.

Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J. D., Dhariwal, P., Neel a kant an, A., Shyam, P., Sastry,

布朗，T.，曼，B.，赖德，N.，苏比亚，M.，卡普兰，J. D.，达里瓦尔，P.，尼拉坎坦，A.，夏姆，P.，萨斯特里，

G., Askell, A., Agarwal, S., Herbert-Voss, A., Krueger, G., Henighan, T., Child, R., Ramesh, A., Ziegler, D., Wu, J., Winter, C., Hesse, C., Chen, M., Sigler, E., Litwin, M., Gray, S., Chess, B., Clark, J., Berner, C., McCandlish, S., Radford, A., Sutskever, I., and Amodei, D. Language models are few-shot learners. In Larochelle, H., Ranzato, M., Hadsell, R., Balcan, M. F., and Lin, H. (eds.), Advances in Neural Information Processing Systems, volume 33, pp. 1877-1901.Curran Associates, Inc., 2020. URL https:/ /proceedings. neurips.cc/paper/2020/file/ 1457c0d6bfcb4967418bfb8ac142f64a-Paper. pdf.

G., Askell, A., Agarwal, S., Herbert-Voss, A., Krueger, G., Henighan, T., Child, R., Ramesh, A., Ziegler, D., Wu, J., Winter, C., Hesse, C., Chen, M., Sigler, E., Litwin, M., Gray, S., Chess, B., Clark, J., Berner, C., McCandlish, S., Radford, A., Sutskever, I., 和 Amodei, D. 语言模型是少样本学习者。在 Larochelle, H., Ranzato, M., Hadsell, R., Balcan, M. F., 和 Lin, H. (eds.), 神经信息处理系统进展，第 33 卷，页码 1877-1901。Curran Associates, Inc., 2020。URL https:/ /proceedings. neurips.cc/paper/2020/file/ 1457c0d6bfcb4967418bfb8ac142f64a-Paper. pdf。

Caliskan, A., Bryson, J. J., and Narayanan, A. Semantics derived automatically from language corpora contain human-like biases. Science, 356(6334):183-186, Apr 2017. ISSN 1095-9203. doi: 10.1126/science.aal4230. URL http://dx.doi.org/10.1126/science. aa14230.

Caliskan, A., Bryson, J. J., 和 Narayanan, A. 从语言语料库中自动衍生的语义包含类似人类的偏见。Science, 356(6334):183-186, 2017 年 4 月。ISSN 1095-9203。doi: 10.1126/science.aal4230。URL http://dx.doi.org/10.1126/science.aa14230。

Carlini, N., Tramer, F., Wallace, E., Jagielski, M., Herbert- Voss, A., Lee, K., Roberts, A., Brown, T. B., Song, D., Erlingsson, U., Oprea, A., and Raffel, C. Extracting training data from large language models. CoRR, abs/2012.07805, 2020.

卡林尼，N.，特拉默，F.，华莱士，E.，亚吉尔斯基，M.，赫伯特-沃斯，A.，李，K.，罗伯茨，A.，布朗，T. B.，宋，D.，厄林格松，U.，奥普雷亚，A.，和拉斐尔，C. 从大语言模型中提取训练数据。CoRR, abs/2012.07805, 2020.

Choi, E., He, H., Iyyer, M., Yatskar, M., Yih, W.-t., Choi, Y., Liang, P., and Z ett le moyer, L. QuAC: Question answering in context. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pp. 2174-2184, Brussels, Belgium, OctoberNovember 2018. Association for Computational Linguistics. doi: 10.18653/v1/D18-1241. URL https : //a cl anthology.0rg/D18-1241.

Choi, E., He, H., Iyyer, M., Yatskar, M., Yih, W.-t., Choi, Y., Liang, P., 和 Zettlemoyer, L. QuAC: 上下文中的问题回答. In 第 2018 届实证方法在自然语言处理会议论文集，页码 2174-2184，布鲁塞尔，比利时，2018 年 10 月 - 11 月。计算语言学协会。doi: 10.18653/v1/D18-1241。URL https://aclanthology.org/D18-1241.

Clark, C., Lee, K., Chang, M.-W., Kwiatkowski, T., Collins, M., and Toutanova, K. BoolQ: Exploring the surprising difficulty of natural yes/no questions. In Proceedin gs of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pp. 2924-2936, Minneapolis, Minnesota, June 2019. Association for Computational Linguistics. doi: 10.18653/v1/N19-1300. URL https : //a cl anthology.0rg/N19-1300.

Clark, C., Lee, K., Chang, M.-W., Kwiatkowski, T., Collins, M., 和 Toutanova, K. BoolQ: 探索自然的是/否问题的惊人难度。在 Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers) 中，第 2924-2936 页，明尼苏达州明尼阿波利斯，2019 年 6 月。Association for Computational Linguistics。doi: 10.18653/v1/N19-1300。URL https://aclanthology.org/N19-1300。

Clark, K., Luong, M.-T., Le, Q. V., and Manning, C. D. Electra: Pre-training text encoders as disc rim in at or s rather than generators. arXiv preprint arXiv:2003.10555, 2020.

Clark, K., Luong, M.-T., Le, Q. V., 和 Manning, C. D. Electra: 将文本编码器预训练为判别器而非生成器。arXiv preprint arXiv:2003.10555, 2020。

Clark, P., Cowhey, I., Etzioni, O., Khot, T., Sabharwal, A., Schoenick, C., and Tafjord, O. Think you have solved question answering? try arc, the ai2 reasoning challenge. arXiv:1803.05457v1, 2018.

克拉克，P., 科威，I., 艾特兹尼，O., 霍特，T., 萨巴瓦尔，A., 斯乔尼克，C., 和塔夫乔德，O. 认为你已经解决了问答问题？试试 ARC，AI2 推理挑战。arXiv:1803.05457v1, 2018.

Dagan, I., Glickman, O., and Magnini, B. The pascal recognising textual entailment challenge. In QuinoneroCandela, J., Dagan, I., Magnini, B., and d'Alché Buc, F. (eds.), Machine Learning Challenges. Evaluating Predictive Uncertainty, Visual Object Classification, and Recognising Tectual Entailment, pp. 177-190, Berlin, Heidelberg, 2006. Springer Berlin Heidelberg. ISBN 978-3-540-33428-6.

Dagan, I., Glickman, O., 和 Magnini, B. Pascal 识别文本蕴含挑战. 在 QuinoneroCandela, J., Dagan, I., Magnini, B., 和 d'Alché Buc, F. (eds.), 机器学习挑战. 评估预测不确定性，视觉对象分类，和识别文本蕴含，页码 177-190, Berlin, Heidelberg, 2006. Springer Berlin Heidelberg. ISBN 978-3-540-33428-6.

Dai, A. M. and Le, Q. V. Semi-supervised sequence learning. In Cortes, C., Lawrence, N., Lee, D., Sugiyama, M., and Garnett, R. (eds.), Advances in Neural Information Processing Systems, volume 28. Curran Associates, Inc., 2015. URL https : / /proceedings. neurips.cc/paper/2015/file/ 7137debd45ae4d0ab9aa953017286b20-Paper.

戴, A. M. 和 Le, Q. V. 半监督序列学习。在 Cortes, C., Lawrence, N., Lee, D., Sugiyama, M., 和 Garnett, R. (编)，《神经信息处理系统进展》，第 28 卷。Curran Associates, Inc., 2015。URL https://proceedings.neurips.cc/paper/2015/file/7137debd45ae4d0ab9aa953017286b20-Paper。

pdf.

请提供需要翻译的具体内容。

Dai, Z., Yang, Z., Yang, Y., Carbonell, J., Le, Q., and Salak hut dino v, R. Transformer-XL: Attentive language models beyond a fixed-length context. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pp. 2978-2988, Florence, Italy, July 2019. Association for Computational Linguistics. doi: 10.18653/v1/P19-1285. URL https : / /a cl anthology.0rg/P19-1285.

Dai, Z., Yang, Z., Yang, Y., Carbonell, J., Le, Q., 和 Salak hut dino v, R. Transformer-XL: 超越固定长度上下文的注意力语言模型 (Attentive language models beyond a fixed-length context)。在第 57 届计算语言学协会年会论文集 (Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics) 中，第 2978-2988 页，佛罗伦萨，意大利，2019 年 7 月。计算语言学协会 (Association for Computational Linguistics)。doi: 10.18653/v1/P19-1285。URL https://a cl anthology.0rg/P19-1285。

Dauphin, Y. N., Fan, A., Auli, M., and Grangier, D. Language modeling with gated convolutional networks. In International conference on machine learning,pp.933- 941. PMLR, 2017.

达芬，Y. N.，范，A.，奥利，M.，和格朗热，D. 用门控卷积网络进行语言建模。在国际机器学习会议，第 933-941 页。PMLR，2017。

de Marneffe, M.-C., Simons, M., and Tonhauser, J. The commitment bank: Investigating projection in naturally occurring discourse. Proceedings of Sinn und Bedeutung, 23(2):107-124, Jul. 2019. doi: 10.18148/sub/2019.v23i2. 601. URL https://ojs.ub.uni-konstanz. de/sub/index.php/sub/article/view/601.

德马尔内夫，M.-C.，西蒙斯，M.，和顿豪泽，J. 承诺银行：研究自然发生话语中的投影。《意义与含义会议论文集》，23(2):107-124，2019年7月。doi: 10.18148/sub/2019.v23i2.601。URL https://ojs.ub.uni-konstanz.de/sub/index.php/sub/article/view/601。

Devlin, J., Chang, M.-W., Lee, K., and Toutanova, K. BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conferenceof the North American Chapter of the Association for Computational Linguistics:Human Language Technologies, Volume 1 (Long and Short Papers), 2019.

Devlin, J., Chang, M.-W., Lee, K., 和 Toutanova, K. BERT: 用于语言理解的深度双向 Transformer (deep bidirectional transformers) 预训练. 在 Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers)，2019.

Dua, D., Wang, Y., Dasigi, P., Stanovsky, G., Singh, S., and Gardner, M. DROP: A reading comprehension benchmark requiring discrete reasoning over paragraphs. In Burstein, J., Doran, C., and Solorio, T. (eds.), Proceedings of the 2019 Conference of theNorth American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2019, Minneapolis, MN, USA, June 2-7, 2019, Volume 1 (Long and Short Papers), pp. 2368-2378. Association for Computational Linguistics, 2019. doi:

Dua, D., Wang, Y., Dasigi, P., Stanovsky, G., Singh, S., 和 Gardner, M. DROP: 一个需要对段落进行离散推理的阅读理解基准。在 Burstein, J., Doran, C., 和 Solorio, T. (eds.), 第 2019 届北美计算语言学协会年会：人类语言技术会议论文集，NAACL-HLT 2019，明尼阿波利斯，明尼苏达州，美国，2019 年 6 月 2-7 日，第 1 卷（长篇和短篇论文），页码 2368-2378。计算语言学协会，2019。doi:

Fedus, W., Zoph, B., and Shazeer, N. Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity. CoRR, abs/2101.03961, 2021. URL https://arxiv.0rg/abs/2101.03961.

Fedus, W., Zoph, B., 和 Shazeer, N. Switch Transformer：通过简单高效的稀疏性扩展到万亿参数模型. CoRR, abs/2101.03961, 2021. URL https://arxiv.0rg/abs/2101.03961.

Fyodorov, Y., Winter, Y., and Francez, N. A natural logic in- ference system. In Inference in Computational Semantics, 2000.

费奥多罗夫，Y.，温特，Y.，和弗朗塞兹，N. 一种自然逻辑推理系统。在《计算语义学中的推理》，2000。

Gehman, S., Gururangan, S., Sap, M., Choi, Y., and Smith, N. A. Real toxicity prompts: Evaluating neural toxic degeneration in language models, 2020.

Gehman, S., Gururangan, S., Sap, M., Choi, Y., 和 Smith, N. A. 真实毒性提示：评估语言模型中的神经毒性退化，2020。

Gordon, A., Kozareva, Z., and Roemmele, M. SemEval2012 task 7: Choice of plausible alternatives: An evaluation of commonsense causal reasoning. In $^{*}!S E M,20I2$ The First Joint Conference on Lexical and Computational Semantics - Volume 1: Proceedings of the main conference and the shared task,and Volume 2: Proceedings of the Sixth International Workshop on Semantic Evaluation (SemEval 2012), pp. 394-398, Montréal, Canada, 7-8 June 2012. Association for Computational Linguistics. URL https://a cl anthology.0rg/S12-1052.

戈登，A.，科扎雷瓦，Z.，和罗梅尔，M. SemEval2012 任务 7：选择可能的替代方案：常识因果推理的评估。在 $^{*}!S E M,20I2$ 第一届联合会议关于词汇和计算语义学 - 第 1 卷：主会议和共享任务的论文集，以及第 2 卷：第六届国际语义评估研讨会 (SemEval 2012) 论文集，第 394-398 页，加拿大蒙特利尔，2012 年 6 月 7-8 日。计算语言学协会。URL https://a cl anthology.0rg/S12-1052。

Hendrycks, D. and Gimpel, K. Bridging nonlinear i ties and stochastic regularize rs with gaussian error linear units. CoRR, abs/1606.08415, 2016. URL http: / /arxiv. org/abs/1606.08415.

亨德里克斯，D. 和金佩尔，K. 桥接非线性单元和随机正则化器与高斯误差线性单元。CoRR, abs/1606.08415, 2016。URL http://arxiv.org/abs/1606.08415。

Hestness, J., Narang, S., Ardalani, N., Diamos, G. F., Jun, H., Kianinejad, H., Patwary, M. M. A., Yang, Y., and Zhou, Y. Deep learning scaling is predictable, empirically. CoRR, abs/1712.00409, 2017. URL http: / /arxiv. org/abs/1712.00409.

Hestness, J., Narang, S., Ardalani, N., Diamos, G. F., Jun, H., Kianinejad, H., Patwary, M. M. A., Yang, Y., 和 Zhou, Y. 深度学习的扩展是可预测的，从经验上看。CoRR, abs/1712.00409, 2017。URL http: / /arxiv. org/abs/1712.00409。

Houlsby, N., Giurgiu, A., J as tr zeb ski, S., Morrone, B., De La rous sil he, Q., Gesmundo, A., Attariyan, M., and Gelly, S. Parameter-efficient transfer learning for NLP. In Chaudhuri, K. and Salak hut dino v, R. (eds.), Proceedingsof the 36 th International Conference on Machine Learning,volume 97 of Proceedings of Machine Learning Research, Pp. 2790-2799. PMLR, 09-15 Jun 2019. URL https://proceedings.mlr.press/v97/ houlsbyl9a.html.

Houlsby, N., Giurgiu, A., Jastrzębski, S., Morrone, B., De La Roussilhe, Q., Gesmundo, A., Attariyan, M., 和 Gelly, S. 参数高效的自然语言处理迁移学习。在 Chaudhuri, K. 和 Salakhutdinov, R. (编)，第 36 届国际机器学习会议论文集，机器学习研究论文集第 97 卷，页码 2790-2799。PMLR, 2019 年 6 月 9-15 日。URL https://proceedings.mlr.press/v97/houlsbyl9a.html。

Huang, Y., Cheng, Y., Bapna, A., Firat, O., Chen, D., Chen, M. X., Lee, H., Ngiam, J., Le, Q. V., Wu, Y., and Chen, Z. Gpipe: Efficient training of giant neural networks using pipeline parallelism. In Wallach, H. M., Larochelle, H., Bey gel zi mer, A., d'Alché-Buc, F., Fox, E. B., and Garnett, R. (eds.), Advances in Neural Information Processing Systems 32: Annual Conference on Neural Information Processing Systems 2019, NeurIPS 2019, December 8-14, 2019, Vancouver, BC, Canada, pp. 103-112, 2019.

黄，Y., 程，Y., Bapna, A., Firat, O., 陈，D., 陈，M. X., 李，H., Ngiam, J., Le, Q. V., 吴，Y., 和陈，Z. GPipe：使用管道并行性高效训练巨型神经网络。在 Wallach, H. M., Larochelle, H., Beygelzimer, A., d'Alché-Buc, F., Fox, E. B., 和 Garnett, R. (编)，《神经信息处理系统进展第32卷：2019年神经信息处理系统会议论文集》，NeurIPS 2019，2019年12月8-14日，加拿大不列颠哥伦比亚省温哥华，第103-112页，2019。

Hutchinson, B., Prabhakaran, V., Denton, E., Webster, K., Zhong, Y., and Denuyl, S. Social biases in NLP models as barriers for persons with disabilities. In Pro- ceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pp. 5491-5501, On- line, July 2020. Association for Computational Linguistics. doi: 10.18653/v1/2020.acl-main.487. URL ht tps : //a cl anthology.0rg/2020.acl-main.487.

Hutchinson, B., Prabhakaran, V., Denton, E., Webster, K., Zhong, Y., 和 Denuyl, S. 自然语言处理模型中的社会偏见对残疾人的障碍。在第 58 届计算语言学协会年会论文集，第 5491-5501 页，在线，2020 年 7 月。计算语言学协会。doi: 10.18653/v1/2020.acl-main.487。URL https://aclanthology.org/2020.acl-main.487。

Jacobs, A. Z. and Wallach, H. Measurement and fairness.Proceedings of the 2021 ACM Conference on Fair- ness, Accountability, and Transparency, Mar 2021. doi: 10.1145/3442188.3445901. URL http: / /dx.doi. org/10.1145/3442188.3445901.

雅各布斯，A. Z. 和瓦拉赫，H. 测量与公平性。2021 年 ACM 公平性、责任性和透明度会议论文集，2021 年 3 月。doi: 10.1145/3442188.3445901。URL http: / /dx.doi. org/10.1145/3442188.3445901。

Joshi, M., Choi, E., Weld, D. S., and Z ett le moyer, L. Triviaqa: A large scale distantly supervised challenge dataset for reading comprehension. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics, Vancouver, Canada, July 2017. Association for Computational Linguistics.

Joshi, M., Choi, E., Weld, D. S., 和 Zettlemoyer, L. TriviaQA: 一个大规模远程监督的阅读理解挑战数据集。在第 55 届计算语言学协会年会论文集，温哥华，加拿大，2017 年 7 月。计算语言学协会。

Kaplan, J., McCandlish, S., Henighan, T., Brown, T. B., Chess, B., Child, R., Gray, S., Radford, A., Wu, J., and Amodei, D. Scaling laws for neural language models. arXiv preprint arXiv:2001.08361, 2020.

Kaplan, J., McCandlish, S., Henighan, T., Brown, T. B., Chess, B., Child, R., Gray, S., Radford, A., Wu, J., 和 Amodei, D. 神经语言模型的扩展规律. arXiv preprint arXiv:2001.08361, 2020.

Khashabi, D., Chaturvedi, S., Roth, M., Upadhyay, S., and Roth, D. Looking beyond the surface: A challenge set for reading comprehension over multiple sentences. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), pp. 252-262, New Orleans, Louisiana, June 2018. Association for Computational Linguistics. doi: 10.18653/v1/N18-1023. URL https : / /a cl anthology.0rg/N18-1023.

Khashabi, D., Chaturvedi, S., Roth, M., Upadhyay, S., 和 Roth, D. 不仅看表面：多句子阅读理解挑战集. 在 Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers)，页码 252-262，New Orleans, Louisiana, 2018年6月. Association for Computational Linguistics. doi: 10.18653/v1/N18-1023. URL https://aclanthology.0rg/N18-1023.

Kiros, R., Zhu, Y., Salak hut dino v, R. R., Zemel, R., Urtasun, R., Torralba, A., and Fidler, S. Skip-thought vectors. In Cortes, C., Lawrence, N., Lee, D., Sugiyama, M., and Garnett, R.(eds.), Advances in Neural Information Processing Systems, volume 28. Curran Associates, Inc., 2015. URL https : / /proceedings. neurips.cc/paper/2015/file/ f442d33fa06832082290ad8544a8da27-Paper. pdf.

Kiros, R., Zhu, Y., Salak hut dino v, R. R., Zemel, R., Urtasun, R., Torralba, A., 和 Fidler, S. Skip-thought vectors. In Cortes, C., Lawrence, N., Lee, D., Sugiyama, M., 和 Garnett, R. (eds.), Advances in Neural Information Processing Systems, volume 28. Curran Associates, Inc., 2015. URL https://proceedings.neurips.cc/paper/2015/file/f442d33fa06832082290ad8544a8da27-Paper.pdf.

Kudo, T. and Richardson, J. Sentence piece: A simple and language independent subword tokenizer and de token ize r for neural text processing. In EMNLP, 2018.

Kudo, T. 和 Richardson, J. Sentence piece: 一种简单且与语言无关的子词分词器和反向分词器，用于神经文本处理。在 EMNLP，2018。

Kudugunta, S., Huang, Y, Bapna, A., Krikun, M., Lepikhin, D., Luong, M.-T., and Firat, O. Beyond distillation: Task-level mixture-of-experts for efficient inference. In Findings of the Association for Computational Linguistics: EMNLP 2021,Pp. 3577-3599, 2021.

库杜冈塔，S., 黄，Y, 巴普纳，A., 克里昆，M., 列皮欣，D., 龙，M.-T., 和菲拉特，O. 超越蒸馏：任务级混合专家以实现高效推理。在计算语言学协会会议发现：EMNLP 2021, 第 3577-3599 页，2021。

Kwiatkowski, T., Palomaki, J., Redfield, O., Collins, M., Parikh, A., Alberti, C., Epstein, D., Polosukhin, I., Kel- cey, M., Devlin, J., Lee, K., Toutanova, K. N., Jones, L., Chang, M.-W., Dai, A., Uszkoreit, J., Le, Q., and Petrov, S. Natural questions: a benchmark for question answering research. Transactions of the Association of Computational Linguistics, 2019.

Kwiatkowski, T., Palomaki, J., Redfield, O., Collins, M., Parikh, A., Alberti, C., Epstein, D., Polosukhin, I., Kel- cey, M., Devlin, J., Lee, K., Toutanova, K. N., Jones, L., Chang, M.-W., Dai, A., Uszkoreit, J., Le, Q., and Petrov, S. 自然问题：一个问答研究的基准。Transactions of the Association of Computational Linguistics，2019。

Lai, G., Xie, Q., Liu, H., Yang, Y., and Hovy, E. RACE: Large-scale ReAding comprehension dataset from examinations. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, Pp. 785-794, Copenhagen, Denmark, September 2017. Association for Computational Linguistics. doi: 10.18653/v1/ D17-1082. URL https://a cl anthology.org/ D17-1082.

赖, G., 谢, Q., 刘, H., 杨, Y., 和 Hovy, E. RACE: 大规模考试阅读理解数据集 (Large-scale ReAding comprehension dataset from examinations). 见《2017年实证方法在自然语言处理中的应用会议论文集》，第 785-794 页，哥本哈根，丹麦，2017年9月。计算语言学协会。doi: 10.18653/v1/D17-1082。URL https://aclanthology.org/D17-1082。

Lamm, M., Palomaki, J., Alberti, C., Andor, D., Choi, E., Soares, L. B., and Collins, M. QED: A framework and dataset for explanations in question answering. CoRR, abs/2009.06354, 2020. URL https : / /arxiv .org/ abs/2009.06354.

Lamm, M., Palomaki, J., Alberti, C., Andor, D., Choi, E., Soares, L. B., 和 Collins, M. QED: 一个用于问题回答中解释的框架和数据集。CoRR, abs/2009.06354, 2020。URL https://arxiv.org/abs/2009.06354。

Le, Q. and Mikolov, T. Distributed representations of sentences and documents. In International conference on machine learning,2014.

Le, Q. 和 Mikolov, T. 句子和文档的分布式表示 (Distributed representations of sentences and documents). 在国际机器学习会议 (International conference on machine learning)，2014.

Leidner, J. L. and Plachouras, V. Ethical by design: Ethics best practices for natural language processing. In Proceedings of the First ACL Workshop on Ethics in Natural Language Processing, pp. 30-40, Valencia, Spain, April 2017. Association for Computational Linguistics. doi: 10.18653/v1/W17-1604. URL https: //a cl anthology.0rg/w17-1604.

莱德纳，J. L. 和普拉乔拉斯，V. 伦理设计：自然语言处理的伦理最佳实践。在第一届 ACL 自然语言处理伦理研讨会论文集，第 30-40 页，瓦伦西亚，西班牙，2017 年 4 月。计算语言学协会。doi: 10.18653/v1/W17-1604。URL https: //a cl anthology.0rg/w17-1604。

Lepikhin, D., Lee, H., Xu, Y., Chen, D., Firat, O., Huang, Y., Krikun, M., Shazeer, N., and Chen, Z. GShard: Scaling giant models with conditional computation and automatic sharding. In International Conference on Learning Representations, 2021. URL https : / /openreview . net/forum?id=qr we 7 X HTm Yb.

Lepikhin, D., Lee, H., Xu, Y., Chen, D., Firat, O., Huang, Y., Krikun, M., Shazeer, N., 和 Chen, Z. GShard: 通过条件计算和自动分片扩展巨型模型. 在 International Conference on Learning Representations, 2021. URL https : / /openreview . net/forum?id=qr we 7 X HTm Yb.

Levesque, H., Davis, E., and Morgenstern, L. The winograd schema challenge. In 13th International Conferenc eon the Principles of Knowledge Representation and Reasoning, KR 2012, Proceedings of the International Conference on Knowledge Representation and Reasoning, pp. 552-561. Institute of Electrical and Electronics Engineers Inc., 2012. ISBN 9781577355601. 13th In- ter national Conference on the Principles of Knowledge Representation and Reasoning, KR 2012 ; Conference date: 10-06-2012 Through 14-06-2012.

Levesque, H., Davis, E., 和 Morgenstern, L. Winograd 模式挑战。在第 13 届国际知识表示和推理原理会议，KR 2012，知识表示和推理国际会议论文集，第 552-561 页。电气和电子工程师协会，2012。ISBN 9781577355601。第 13 届国际知识表示和推理原理会议，KR 2012；会议日期：2012 年 6 月 10 日至 2012 年 6 月 14 日。

Li, T., Khashabi, D., Khot, T., Sabharwal, A., and Srikumar, V. UNQOVERing stereotyping biases via underspecified questions. In Findings of the Association for Computational Linguistics: EMNLP 2020, pp. 3475-3489,

李, T., Khashabi, D., Khot, T., Sabharwal, A., 和 Srikumar, V. 通过未指明问题揭示刻板偏见 (UNQOVERing stereotyping biases via underspecified questions). 在计算语言学协会会议发现：EMNLP 2020，第 3475-3489 页，

Online, November 2020. Association for Computational Linguistics. doi: 10.18653/v1/2020.findings-emnlp. 311. URL https://a cl anthology.org/2020. findings-emnlp.311.

在线，2020年11月。计算语言学协会。doi: 10.18653/v1/2020.findings-emnlp.311. URL https://aclanthology.org/2020.findings-emnlp.311.

Lieber, O., Sharir, O., Lenz, B., and Shoham, Y. Jurassic-1: Technical details and evaluation. White Paper. AI21 Labs, 2021.

Lieber, O., Sharir, O., Lenz, B.，和 Shoham, Y. Jurassic-1: 技术细节和评估。白皮书。AI21 Labs, 2021.

Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Z ett le moyer, L., and Stoyanov, V. Roberta: A robustly optimized bert pre training approach. arXiv preprint arXiv:1907.11692, 2019.

Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., and Stoyanov, V. Roberta: 一种稳健优化的 BERT 预训练方法。 arXiv preprint arXiv:1907.11692, 2019.

May, C., Wang, A., Bordia, S., Bowman, S. R., and Rudinger, R. On measuring social biases in sentence encoders. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pp. 622-628, Minneapolis, Minnesota, June 2019. Association for Computational Linguistics. doi: 10.18653/v1/N19-1063. URL https://a cl anthology.0rg/N19-1063.

梅, C., 王, A., Bordia, S., Bowman, S. R., 和 Rudinger, R. 论测量句子编码器中的社会偏见. 在《2019年北美计算语言学协会会议论文集：人类语言技术，第 1 卷（长文和短文）》，页码 622-628，明尼苏达州明尼阿波利斯，2019年6月。计算语言学协会。doi: 10.18653/v1/N19-1063。URL https://aclanthology.org/N19-1063。

Mihaylov, T., Clark, P., Khot, T.,

[论文翻译]GLaM: 语言模型的有效扩展与专家混合 (Mixture-of-Experts)

原文地址：https://arxiv.org/pdf/2112.06905

GLaM: Efficient Scaling of Language Models with Mixture-of-Experts

GLaM: 语言模型的有效扩展与专家混合 (Mixture-of-Experts)

Abstract

摘要

1. Introduction

1. 引言

2. Related Work

2. 相关工作

3. Training Dataset

3. 训练数据集

4. Model Architecture

4. 模型架构

5. Experiment Setup

5. 实验设置

5.1. Training Setting

5.1. 训练设置

5.2. Hyper parameters and Training Procedure

5.2. 超参数和训练过程

5.3. Evaluation Setting

5.3. 评估设置

6. Results

6. 结果

6.1. Comparison between MoE and Dense Models

6.1. MoE模型与密集模型的比较

6.2. Effect of Data Quality

6.2. 数据质量的影响

6.3. Scaling Studies

6.3. 扩展研究

6.4. Efficiency of GLaM

6.4. GLaM 的效率

7. Ethics and Unintended Biases

7. 伦理和非预期偏差

7.1. Co-occurrence prompts

7.1. 共现提示 (Co-occurrence prompts)

7.2. WinoGender

7.2. WinoGender

7.3. Toxicity Degeneration

7.3. 毒性退化 (Toxicity Degeneration)

8. Discussion

8. 讨论

9. Conclusions

9. 结论

References

参考文献