[论文翻译]BERT: 深度双向 Transformer 用于语言理解的预训练


原文地址:https://u254848-88c6-e493554b.yza1.seetacloud.com:8443/miner/v2/analysis/pdf_md?filename=full.md&as_attachment=False&user_id=931&pdf=987545ffb087f1ece898142c403a516baeabeb70ce19089397fac6f7db12c3d41735865539_N19-1423.pdf


BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

BERT: 深度双向 Transformer 用于语言理解的预训练

Jacob Devlin Ming-Wei Chang Kenton Lee Kristina Toutanova Google AI Language jacob dev lin,ming wei chang,kentonl,kristout@google.com

Jacob Devlin Ming-Wei Chang Kenton Lee Kristina Toutanova Google AI Language
jacob dev lin, ming wei chang, kentonl, kristout@google.com

Abstract

摘要

We introduce a new language representation model called BERT, which stands for Bidirectional Encoder Representations from Transformers. Unlike recent language representation models (Peters et al., 2018a; Radford et al., 2018), BERT is designed to pretrain deep bidirectional representations from unlabeled text by jointly conditioning on both left and right context in all layers. As a result, the pre-trained BERT model can be finetuned with just one additional output layer to create state-of-the-art models for a wide range of tasks, such as question answering and language inference, without substantial taskspecific architecture modifications.

我们介绍了一种新的语言表示模型,称为 BERT (Bidirectional Encoder Representations from Transformers)。与最近的语言表示模型不同 (Peters et al., 2018a; Radford et al., 2018),BERT 被设计为通过联合条件在所有层中同时考虑左右上下文,从而从未标注的文本中预训练深度双向表示。因此,预训练的 BERT 模型只需添加一个额外的输出层即可进行微调,以创建多种任务(如问答和语言推理)的最先进模型,而无需进行大量的任务特定架构修改。

BERT is conceptually simple and empirically powerful. It obtains new state-of-the-art results on eleven natural language processing tasks, including pushing the GLUE score to $80.5%$ $7.7%$ point absolute improvement), MultiNLI accuracy to $86.7%$ ( $4.6%$ absolute improvement), SQuAD v1.1 question answering Test F1 to 93.2 (1.5 point absolute im- provement) and SQuAD v2.0 Test F1 to 83.1 (5.1 point absolute improvement).

BERT 在概念上简单且在实证上强大。它在十一个自然语言处理任务中取得了新的最先进成果,包括将 GLUE 分数提高到 80.5% (绝对提高了 7.7 个百分点),MultiNLI 准确率提高到 86.7% (绝对提高了 4.6 个百分点),SQuAD v1.1 问答测试 F1 分数提高到 93.2 (绝对提高了 1.5 分),以及 SQuAD v2.0 测试 F1 分数提高到 83.1 (绝对提高了 5.1 分)。

1 Introduction

1 引言

Language model pre-training has been shown to be effective for improving many natural language processing tasks (Dai and Le, 2015; Peters et al., 2018a; Radford et al., 2018; Howard and Ruder, 2018). These include sentence-level tasks such as natural language inference (Bowman et al., 2015; Williams et al., 2018) and paraphrasing (Dolan and Brockett, 2005), which aim to predict the rela tion ships between sentences by analyzing them holistic ally, as well as token-level tasks such as named entity recognition and question answering, where models are required to produce fine-grained output at the token level (Tjong Kim Sang and De Meulder, 2003; Rajpurkar et al., 2016).

语言模型预训练已被证明对改进许多自然语言处理任务有效 (Dai 和 Le, 2015; Peters 等, 2018a; Radford 等, 2018; Howard 和 Ruder, 2018)。这些任务包括句子级别的任务,如自然语言推理 (Bowman 等, 2015; Williams 等, 2018) 和释义 (Dolan 和 Brockett, 2005),其目标是通过整体分析来预测句子之间的关系,还包括 Token 级别的任务,如命名实体识别和问答,要求模型在 Token 级别上生成细粒度的输出 (Tjong Kim Sang 和 De Meulder, 2003; Rajpurkar 等, 2016)。

There are two existing strategies for applying pre-trained language representations to down- stream tasks: feature-based and fine-tuning. The feature-based approach, such as ELMo (Peters et al., 2018a), uses task-specific architectures that include the pre-trained representations as additional features. The fine-tuning approach, such as the Generative Pre-trained Transformer (OpenAI GPT) (Radford et al., 2018), introduces minimal task-specific parameters, and is trained on the downstream tasks by simply fine-tuning all pretrained parameters. The two approaches share the same objective function during pre-training, where they use unidirectional language models to learn general language representations.

将预训练语言表示应用于下游任务的现有两种策略是:基于特征的方法和微调方法。基于特征的方法,例如 ELMo (Peters et al., 2018a),使用特定于任务的架构,将预训练表示作为附加特征包含在内。微调方法,例如生成式预训练 Transformer (OpenAI GPT) (Radford et al., 2018),引入最少的任务特定参数,并通过简单地微调所有预训练参数来在下游任务上进行训练。这两种方法在预训练期间共享相同的目标函数,即使用单向语言模型来学习通用的语言表示。

We argue that current techniques restrict the power of the pre-trained representations, especially for the fine-tuning approaches. The major limitation is that standard language models are unidirectional, and this limits the choice of architectures that can be used during pre-training. For example, in OpenAI GPT, the authors use a left-toright architecture, where every token can only attend to previous tokens in the self-attention layers of the Transformer (Vaswani et al., 2017). Such restrictions are sub-optimal for sentence-level tasks, and could be very harmful when applying finetuning based approaches to token-level tasks such as question answering, where it is crucial to incorporate context from both directions.

我们认为当前的技术限制了预训练表示的能力,特别是对于微调方法。主要的限制是标准的语言模型是单向的,这限制了预训练期间可以使用的架构选择。例如,在 OpenAI GPT 中,作者使用了从左到右的架构,其中每个 Token 只能关注 Transformer (Vaswani et al., 2017) 自注意力层中的先前 Token。这种限制对于句子级别的任务来说是次优的,并且在将基于微调的方法应用于问答等 Token 级别的任务时可能会非常有害,因为在这些任务中结合双向上下文至关重要。

In this paper, we improve the fine-tuning based approaches by proposing BERT: Bidirectional Encoder Representations from Transformers. BERT alleviates the previously mentioned unidirect ional it y constraint by using a “masked language model” (MLM) pre-training objective, inspired by the Cloze task (Taylor, 1953). The masked language model randomly masks some of the tokens from the input, and the objective is to predict the original vocabulary id of the masked word based only on its context. Unlike left-toright language model pre-training, the MLM objective enables the representation to fuse the left and the right context, which allows us to pretrain a deep bidirectional Transformer. In addition to the masked language model, we also use a “next sentence prediction” task that jointly pretrains text-pair representations. The contributions of our paper are as follows:

在本文中,我们通过提出 BERT:双向编码器表示 (Bidirectional Encoder Representations from Transformers) 来改进基于微调的方法。BERT 通过使用“掩码语言模型”(masked language model, MLM) 预训练目标缓解了之前提到的单向性约束,这一方法受到 Cloze 任务 [20] 的启发。掩码语言模型随机遮蔽输入中的某些 Token,其目标是仅根据上下文预测被遮蔽词的原始词汇 id。与从左到右的语言模型预训练不同,MLM 目标使表示能够融合左右上下文,从而允许我们预训练一个深度双向 Transformer。除了掩码语言模型外,我们还使用了一个“下一句预测”任务,该任务联合预训练文本对表示。本文的贡献如下:

• We demonstrate the importance of bidirectional pre-training for language representations. Unlike Radford et al. (2018), which uses unidirectional language models for pre-training, BERT uses masked language models to enable pretrained deep bidirectional representations. This is also in contrast to Peters et al. (2018a), which uses a shallow concatenation of independently trained left-to-right and right-to-left LMs.

• 我们展示了双向预训练对语言表示的重要性。不同于 Radford 等 (2018) 使用单向语言模型进行预训练,BERT 使用掩码语言模型来实现预训练的深度双向表示。这与 Peters 等 (2018a) 形成对比,后者使用独立训练的从左到右和从右到左语言模型的浅层拼接。

• We show that pre-trained representations reduce the need for many heavily-engineered taskspecific architectures. BERT is the first finetuning based representation model that achieves state-of-the-art performance on a large suite of sentence-level and token-level tasks, outperforming many task-specific architectures.

• 我们证明了预训练的表示方法减少了对许多复杂工程化任务特定架构的需求。BERT 是第一个在大量句子级和 Token 级任务上达到最先进性能的微调基于的表示模型,超过了众多任务特定架构。

• BERT advances the state of the art for eleven NLP tasks. The code and pre-trained models are available at https://github.com/ google-research/bert.

• BERT 在十一项自然语言处理 (NLP) 任务上取得了最先进的成果。代码和预训练模型可在 https://github.com/google-research/bert 获取。

2 Related Work

2 相关工作

There is a long history of pre-training general language representations, and we briefly review the most widely-used approaches in this section.

预训练通用语言表示有着悠久的历史,我们在本节中简要回顾最广泛使用的方法。

2.1 Unsupervised Feature-based Approaches

2.1 无监督特征基方法 (Unsupervised Feature-based Approaches)

Learning widely applicable representations of words has been an active area of research for decades, including non-neural (Brown et al., 1992; Ando and Zhang, 2005; Blitzer et al., 2006) and neural (Mikolov et al., 2013; Pennington et al., 2014) methods. Pre-trained word embeddings are an integral part of modern NLP systems, offering significant improvements over embeddings learned from scratch (Turian et al., 2010). To pretrain word embedding vectors, left-to-right language modeling objectives have been used (Mnih and Hinton, 2009), as well as objectives to discriminate correct from incorrect words in left and right context (Mikolov et al., 2013).

学习广泛适用的词表示一直是几十年来活跃的研究领域,包括非神经网络方法 (Brown et al., 1992; Ando and Zhang, 2005; Blitzer et al., 2006) 和神经网络方法 (Mikolov et al., 2013; Pennington et al., 2014)。预训练词嵌入是现代 NLP 系统的重要组成部分,相比从头学习的嵌入提供了显著改进 (Turian et al., 2010)。为了预训练词嵌入向量,使用了从左到右的语言模型目标 (Mnih and Hinton, 2009),以及用于区分左右上下文中正确和错误单词的目标 (Mikolov et al., 2013)。

These approaches have been generalized to coarser granular i ties, such as sentence embeddings (Kiros et al., 2015; Logeswaran and Lee, 2018) or paragraph embeddings (Le and Mikolov, 2014). To train sentence representations, prior work has used objectives to rank candidate next sentences (Jernite et al., 2017; Logeswaran and Lee, 2018), left-to-right generation of next sentence words given a representation of the previous sentence (Kiros et al., 2015), or denoising autoencoder derived objectives (Hill et al., 2016).

这些方法已经被推广到更粗粒度的表示,例如句子嵌入 (Kiros 等, 2015; Logeswaran 和 Lee, 2018) 或段落嵌入 (Le 和 Mikolov, 2014)。为了训练句子表示,先前的工作使用了以下目标:对候选的下一个句子进行排序 (Jernite 等, 2017; Logeswaran 和 Lee, 2018),根据前一个句子的表示从左到右生成下一个句子的单词 (Kiros 等, 2015),或使用去噪自编码器衍生的目标 (Hill 等, 2016)。

ELMo and its predecessor (Peters et al., 2017, 2018a) generalize traditional word embedding research along a different dimension. They extract context-sensitive features from a left-to-right and a right-to-left language model. The contextual represent ation of each token is the concatenation of the left-to-right and right-to-left representations. When integrating contextual word embeddings with existing task-specific architectures, ELMo advances the state of the art for several major NLP benchmarks (Peters et al., 2018a) including question answering (Rajpurkar et al., 2016), sentiment analysis (Socher et al., 2013), and named entity recognition (Tjong Kim Sang and De Meulder, 2003). Melamud et al. (2016) proposed learning contextual representations through a task to predict a single word from both left and right context using LSTMs. Similar to ELMo, their model is feature-based and not deeply bidirectional. Fedus et al. (2018) shows that the cloze task can be used to improve the robustness of text generation models.

ELMo 及其前身 (Peters et al., 2017, 2018a) 沿着不同的维度推广了传统词嵌入研究。它们从左到右和从右到左的语言模型中提取上下文敏感特征。每个 Token 的上下文表示是左到右和右到左表示的连接。在将上下文词嵌入与现有的任务特定架构集成时,ELMo 在多个主要自然语言处理基准测试中取得了最先进的成果 (Peters et al., 2018a),包括问答 (Rajpurkar et al., 2016)、情感分析 (Socher et al., 2013) 和命名实体识别 (Tjong Kim Sang 和 De Meulder, 2003)。Melamud 等人 (2016) 提出通过使用 LSTM 从左右上下文中预测单个词的任务来学习上下文表示。类似于 ELMo,他们的模型是基于特征的,并不是深度双向的。Fedus 等人 (2018) 表明完形填空任务可以用于提高文本生成模型的鲁棒性。

2.2 Unsupervised Fine-tuning Approaches

2.2 无监督微调方法

As with the feature-based approaches, the first works in this direction only pre-trained word embedding parameters from unlabeled text (Collobert and Weston, 2008).

与基于特征的方法一样,这一方向的首批工作仅从无标签文本中预训练词嵌入参数 (Collobert and Weston, 2008)。

More recently, sentence or document encoders which produce contextual token representations have been pre-trained from unlabeled text and fine-tuned for a supervised downstream task (Dai and Le, 2015; Howard and Ruder, 2018; Radford et al., 2018). The advantage of these approaches is that few parameters need to be learned from scratch. At least partly due to this advantage, OpenAI GPT (Radford et al., 2018) achieved previously state-of-the-art results on many sentencelevel tasks from the GLUE benchmark (Wang et al., 2018a). Left-to-right language model- ing and auto-encoder objectives have been used for pre-training such models (Howard and Ruder, 2018; Radford et al., 2018; Dai and Le, 2015).

最近,句子或文档编码器已经从无标签文本中预训练,并针对监督下游任务进行了微调 (Dai 和 Le, 2015; Howard 和 Ruder, 2018; Radford 等, 2018)。这些方法的优势在于只需要从头学习少量参数。至少部分由于这一优势,OpenAI GPT (Radford 等, 2018) 在 GLUE 基准测试的许多句子级任务上取得了之前最先进的结果 (Wang 等, 2018a)。左到右语言模型和自动编码目标已被用于预训练此类模型 (Howard 和 Ruder, 2018; Radford 等, 2018; Dai 和 Le, 2015)。


Figure 1: Overall pre-training and fine-tuning procedures for BERT. Apart from output layers, the same architectures are used in both pre-training and fine-tuning. The same pre-trained model parameters are used to initialize models for different down-stream tasks. During fine-tuning, all parameters are fine-tuned. [CLS] is a special symbol added in front of every input example, and [SEP] is a special separator token (e.g. separating questions/answers).

图 1: BERT 的整体预训练和微调流程。除了输出层之外,预训练和微调使用相同的架构。相同的预训练模型参数用于初始化不同的下游任务模型。在微调期间,所有参数都会进行微调。[CLS] 是添加到每个输入示例前面的特殊符号,而 [SEP] 是特殊的分隔符 Token(例如,分隔问题/答案)。

2.3 Transfer Learning from Supervised Data

2.3 监督数据的迁移学习

There has also been work showing effective transfer from supervised tasks with large datasets, such as natural language inference (Conneau et al., 2017) and machine translation (McCann et al., 2017). Computer vision research has also demonstrated the importance of transfer learning from large pre-trained models, where an effective recipe is to fine-tune models pre-trained with ImageNet (Deng et al., 2009; Yosinski et al., 2014).

也有多项研究表明,从大规模监督任务(如自然语言推理 (Conneau et al., 2017) 和机器翻译 (McCann et al., 2017))中进行有效的迁移是可行的。计算机视觉研究同样证明了从大规模预训练模型进行迁移学习的重要性,其中一种有效的方法是微调使用 ImageNet 预训练的模型 (Deng et al., 2009; Yosinski et al., 2014)。

3 BERT

3 BERT

We introduce BERT and its detailed implementation in this section. There are two steps in our framework: pre-training and fine-tuning. During pre-training, the model is trained on unlabeled data over different pre-training tasks. For finetuning, the BERT model is first initialized with the pre-trained parameters, and all of the parameters are fine-tuned using labeled data from the downstream tasks. Each downstream task has separate fine-tuned models, even though they are initialized with the same pre-trained parameters. The question-answering example in Figure 1 will serve as a running example for this section.

我们在此部分介绍 BERT 及其详细实现。我们的框架包含两个步骤:预训练和微调。在预训练期间,模型在不同的预训练任务上使用无标签数据进行训练。对于微调,BERT 模型首先用预训练参数进行初始化,并使用来自下游任务的有标签数据对所有参数进行微调。每个下游任务都有单独的微调模型,即使它们是用相同的预训练参数初始化的。图 1 中的问题回答示例将作为本节的运行示例。

A distinctive feature of BERT is its unified architecture across different tasks. There is minimal difference between the pre-trained architecture and the final downstream architecture.

BERT 的一个显著特点是其在不同任务中采用统一的架构。预训练架构和最终下游任务架构之间的差异很小。

Model Architecture BERT’s model architecture is a multi-layer bidirectional Transformer encoder based on the original implementation described in Vaswani et al. (2017) and released in the tensor 2 tensor library.1 Because the use of Transformers has become common and our imple ment ation is almost identical to the original, we will omit an exhaustive background description of the model architecture and refer readers to Vaswani et al. (2017) as well as excellent guides such as “The Annotated Transformer.”2

模型架构

BERT 的模型架构是多层双向 Transformer 编码器,基于 Vaswani 等人在 2017 年描述的原始实现,并在 tensor 2 tensor 库中发布。由于 Transformer 的使用已经变得普遍,且我们的实现几乎与原始实现相同,我们将省略对模型架构的详尽背景描述,并引导读者参阅 Vaswani 等人 (2017) 以及如“The Annotated Transformer”等优秀指南 [2]。

In this work, we denote the number of layers (i.e., Transformer blocks) as $L$ , the hidden size as $H$ , and the number of self-attention heads as $A$ .3 We primarily report results on two model sizes: BERTBASE $_{\mathrm{L}=12}$ , $\mathrm{H}{=}768$ , $_{\mathrm{A}=12}$ , Total Param- eters $=110\mathrm{M}$ and BERTLARGE $\scriptstyle{\mathrm{L}}=24$ , $_{\mathrm{H}=1024}$ , $_{\mathrm{A}=16}$ , Total Parameters $!!=!!340\mathrm{M},$ .

在本工作中,我们将层数(即 Transformer 块的数量)表示为 $L$ ,隐藏层大小表示为 $H$ ,自注意力头的数量表示为 $A$ 。我们主要报告两种模型尺寸的结果:BERTBASE ${\mathrm{L}=12}$ , $\mathrm{H}=768$ , ${\mathrm{A}=12}$ ,总参数量 $=110\mathrm{M}$ 和 BERTLARGE $\scriptstyle{\mathrm{L}}=24$ , ${\mathrm{H}=1024}$ , ${\mathrm{A}=16}$ ,总参数量 $=340\mathrm{M}$ 。

BERTBASE was chosen to have the same model size as OpenAI GPT for comparison purposes. Critically, however, the BERT Transformer uses bidirectional self-attention, while the GPT Transformer uses constrained self-attention where every token can only attend to context to its left.4

为了比较目的,选择了与 OpenAI GPT 具有相同模型大小的 BERTBASE 。然而,至关重要的是,BERT Transformer 使用双向自注意力机制,而 GPT Transformer 使用受限的自注意力机制,其中每个 Token 只能关注其左侧的上下文。

Input/Output Representations To make BERT handle a variety of down-stream tasks, our input representation is able to unambiguously represent both a single sentence and a pair of sentences (e.g., Question, Answer ) in one token sequence. Throughout this work, a “sentence” can be an arbitrary span of contiguous text, rather than an actual linguistic sentence. A “sequence” refers to the input token sequence to BERT, which may be a single sentence or two sentences packed together.

输入/输出表示

为了使 BERT 能够处理各种下游任务,我们的输入表示能够清晰地表示单个句子和句子对(例如,问题,答案)在一个 token 序列中。在本文中,“句子”可以是任意连续的文本片段,而不仅仅是一个实际的语言句子。“序列”指的是输入给 BERT 的 token 序列,它可以是一个单句或两个句子组合在一起。

We use WordPiece embeddings (Wu et al., 2016) with a 30,000 token vocabulary. The first token of every sequence is always a special classification token ([CLS]). The final hidden state corresponding to this token is used as the aggregate sequence representation for classification tasks. Sentence pairs are packed together into a single sequence. We differentiate the sentences in two ways. First, we separate them with a special token ([SEP]). Second, we add a learned embedding to every token indicating whether it belongs to sentence $\mathtt{A}$ or sentence B. As shown in Figure 1, we denote input embedding as $E$ , the final hidden vector of the special [CLS] token as $C:\in:\mathbb{R}^{H}$ , and the final hidden vector for the $i^{\mathrm{th}}$ input token as $T_{i}\in\mathbb{R}^{H}$ .

我们使用 WordPiece 嵌入 (Wu et al., 2016) ,词汇量为 30,000 个 token。每个序列的第一个 token 总是一个特殊的分类 token ([CLS])。与该 token 对应的最终隐藏状态用于分类任务的聚合序列表示。句子对被打包成一个单一的序列。我们通过两种方式区分这些句子。首先,我们用一个特殊的 token ([SEP]) 将它们分开。其次,我们为每个 token 添加一个学习到的嵌入,以指示它属于句子 A 或句子 B。如图 1 所示,我们将输入嵌入表示为 $E$,特殊 [CLS] token 的最终隐藏向量表示为 $C:\in:\mathbb{R}^{H}$,第 $i$ 个输入 token 的最终隐藏向量表示为 $T_{i}\in\mathbb{R}^{H}$。

图 1:

For a given token, its input representation is constructed by summing the corresponding token, segment, and position embeddings. A visualization of this construction can be seen in Figure 2.

对于给定的 token,其输入表示是通过将相应的 token、segment 和 position embeddings 相加构建的。这种构建的可视化可以在图 2 中看到。

3.1 Pre-training BERT

3.1 预训练 BERT

Unlike Peters et al. (2018a) and Radford et al. (2018), we do not use traditional left-to-right or right-to-left language models to pre-train BERT. Instead, we pre-train BERT using two unsupervised tasks, described in this section. This step is presented in the left part of Figure 1.

与 Peters 等 (2018a) 和 Radford 等 (2018) 不同,我们不使用传统的从左到右或从右到左的语言模型来预训练 BERT。相反,我们使用两个无监督任务来预训练 BERT,这些任务在本节中描述。此步骤如图 1 左侧所示。

Task #1: Masked LM Intuitively, it is reasonable to believe that a deep bidirectional model is strictly more powerful than either a left-to-right model or the shallow concatenation of a left-toright and a right-to-left model. Unfortunately, standard conditional language models can only be trained left-to-right or right-to-left, since bidirectional conditioning would allow each word to indirectly “see itself”, and the model could trivially predict the target word in a multi-layered context.

任务 #1: 掩码语言模型 (Masked LM)

直观上,可以合理地认为,深度双向模型比单向的左到右模型或浅层连接的左到右和右到左模型更强大。不幸的是,标准的条件语言模型只能从左到右或从右到左进行训练,因为双向条件会使每个词间接“看到自己”,从而使模型在多层上下文中轻易预测目标词。

In order to train a deep bidirectional representation, we simply mask some percentage of the input tokens at random, and then predict those masked tokens. We refer to this procedure as a “masked LM” (MLM), although it is often referred to as a Cloze task in the literature (Taylor, 1953). In this case, the final hidden vectors corresponding to the mask tokens are fed into an output softmax over the vocabulary, as in a standard LM. In all of our experiments, we mask $15%$ of all WordPiece tokens in each sequence at random. In contrast to denoising auto-encoders (Vincent et al., 2008), we only predict the masked words rather than reconstructing the entire input.

为了训练一个深度双向表示,我们简单地随机屏蔽一部分输入的 Token,然后预测这些被屏蔽的 Token。我们将这个过程称为“屏蔽语言模型”(masked LM,MLM),尽管在文献中它通常被称为完形填空任务 (Cloze task) [Taylor, 1953]。在这种情况下,与屏蔽 Token 对应的最终隐藏向量会被输入到词汇表上的输出 softmax 中,就像标准的语言模型一样。在我们所有的实验中,我们随机屏蔽每个序列中所有 WordPiece Token 的 15%。与去噪自编码器 (Vincent et al., 2008) 不同,我们只预测被屏蔽的词,而不是重建整个输入。

Although this allows us to obtain a bidirectional pre-trained model, a downside is that we are creating a mismatch between pre-training and fine-tuning, since the [MASK] token does not appear during fine-tuning. To mitigate this, we do not always replace “masked” words with the actual [MASK] token. The training data generator chooses $15%$ of the token positions at random for prediction. If the $i$ -th token is chosen, we replace the $i$ -th token with (1) the [MASK] token $80%$ of the time (2) a random token $10%$ of the time (3) the unchanged $i$ -th token $10%$ of the time. Then, $T_{i}$ will be used to predict the original token with cross entropy loss. We compare variations of this procedure in Appendix C.2.

虽然这使我们能够获得一个双向预训练模型,但缺点是在预训练和微调之间产生了不匹配,因为 [MASK] token 在微调时不会出现。为了解决这个问题,我们并不总是用实际的 [MASK] token 替换“被遮掩”的词。训练数据生成器随机选择 15% 的 token 位置进行预测。如果选择了第 $i$ 个 token,则我们以如下方式替换第 $i$ 个 token:(1) 80% 的时间用 [MASK] token 替换;(2) 10% 的时间用一个随机 token 替换;(3) 10% 的时间保持第 $i$ 个 token 不变。然后,使用 $T_{i}$ 来通过交叉熵损失预测原始 token。我们在附录 C.2 中比较了此过程的不同变体。

Task #2: Next Sentence Prediction (NSP) Many important downstream tasks such as Question Answering (QA) and Natural Language Infer- ence (NLI) are based on understanding the relationship between two sentences, which is not directly captured by language modeling. In order to train a model that understands sentence relationships, we pre-train for a binarized next sen- tence prediction task that can be trivially generated from any monolingual corpus. Specifically, when choosing the sentences A and B for each pretraining example, $50%$ of the time B is the actual next sentence that follows $\mathtt{A}$ (labeled as $\mathbb{I}_{\mathsf{S N e x t}}$ ), and $50%$ of the time it is a random sentence from the corpus (labeled as NotNext). As we show in Figure 1, $C$ is used for next sentence prediction (NSP).5 Despite its simplicity, we demon- strate in Section 5.1 that pre-training towards this task is very beneficial to both QA and NLI. 6

任务 #2: 下一句预测 (Next Sentence Prediction, NSP) 许多重要的下游任务,如问答 (Question Answering, QA) 和自然语言推理 (Natural Language Inference, NLI),都依赖于理解两个句子之间的关系,而这并不是通过语言模型直接捕捉到的。为了训练一个能够理解句子关系的模型,我们对二元化的下一句预测任务进行预训练,该任务可以从任何单语语料库中轻松生成。具体来说,在为每个预训练示例选择句子 A 和 B 时,50% 的时间 B 是实际跟在 A 后面的下一句(标记为 $\mathbb{I}_{\mathsf{S N e x t}}$),而 50% 的时间它是语料库中的一个随机句子(标记为 NotNext)。如图 1 所示,C 用于下一句预测 (NSP)。尽管其简单性,我们在第 5.1 节中证明了针对此任务的预训练对 QA 和 NLI 都非常有益。

图 1:

6


Figure 2: BERT input representation. The input embeddings are the sum of the token embeddings, the segmentation embeddings and the position embeddings.


图 2: BERT 输入表示。输入嵌入是 Token 嵌入、分段嵌入和位置嵌入的和。

The NSP task is closely related to representationlearning objectives used in Jernite et al. (2017) and Logeswaran and Lee (2018). However, in prior work, only sentence embeddings are transferred to down-stream tasks, where BERT transfers all parameters to initialize end-task model parameters.

NSP 任务与 Jernite 等人 (2017) 和 Logeswaran 和 Lee (2018) 中使用的表示学习目标密切相关。然而,在先前的工作中,只有句子嵌入被迁移到下游任务,而 BERT 则迁移所有参数以初始化最终任务模型参数。

Pre-training data The pre-training procedure largely follows the existing literature on language model pre-training. For the pre-training corpus we use the Books Corpus (800M words) (Zhu et al., 2015) and English Wikipedia (2,500M words). For Wikipedia we extract only the text passages and ignore lists, tables, and headers. It is critical to use a document-level corpus rather than a shuffled sentence-level corpus such as the Billion Word Benchmark (Chelba et al., 2013) in order to extract long contiguous sequences.

预训练数据
预训练过程在很大程度上遵循了现有文献中关于语言模型预训练的做法。对于预训练语料库,我们使用了 Books Corpus (800M 词) (Zhu et al., 2015) 和英文维基百科 (2,500M 词)。对于维基百科,我们仅提取文本段落,忽略列表、表格和标题。为了提取长连续序列,使用文档级别的语料库而不是像 Billion Word Benchmark (Chelba et al., 2013) 这样的打乱句子级别的语料库是至关重要的。

3.2 Fine-tuning BERT

3.2 微调 BERT

Fine-tuning is straightforward since the selfattention mechanism in the Transformer allows BERT to model many downstream tasks— whether they involve single text or text pairs—by swapping out the appropriate inputs and outputs. For applications involving text pairs, a common pattern is to independently encode text pairs before applying bidirectional cross attention, such as Parikh et al. (2016); Seo et al. (2017). BERT instead uses the self-attention mechanism to unify these two stages, as encoding a concatenated text pair with self-attention effectively includes bidirectional cross attention between two sentences.

微调非常直接,因为 Transformer 中的自注意力机制允许 BERT 通过替换适当的输入和输出来建模许多下游任务——无论是涉及单个文本还是文本对的任务。对于涉及文本对的应用,常见的模式是在应用双向交叉注意力之前独立编码文本对,例如 Parikh 等人 (2016);Seo 等人 (2017)。相反,BERT 使用自注意力机制统一这两个阶段,因为使用自注意力编码连接的文本对实际上包含了两个句子之间的双向交叉注意力。

For each task, we simply plug in the taskspecific inputs and outputs into BERT and finetune all the parameters end-to-end. At the input, sentence A and sentence B from pre-training are analogous to (1) sentence pairs in paraphrasing, (2) hypothesis-premise pairs in entailment, (3) question-passage pairs in question answering, and (4) a degenerate text $\mathcal{B}$ pair in text classification or sequence tagging. At the output, the token represent at ions are fed into an output layer for tokenlevel tasks, such as sequence tagging or question answering, and the [CLS] representation is fed into an output layer for classification, such as entailment or sentiment analysis.

对于每个任务,我们只需将特定于任务的输入和输出接入 BERT 并对所有参数进行端到端微调。在输入端,预训练中的句子 A 和句子 B 类似于 (1) 释义中的句子对,(2) 中性关系中的假设-前提对,(3) 问答中的问题-篇章对,以及 (4) 文本分类或序列标注中的退化文本 $\mathcal{B}$ 对。在输出端,Token 表示被输入到输出层以处理 Token 级别的任务,例如序列标注或问答,而 [CLS] 表示则被输入到输出层以处理分类任务,例如中性关系或情感分析。

Compared to pre-training, fine-tuning is relatively inexpensive. All of the results in the paper can be replicated in at most 1 hour on a single Cloud TPU, or a few hours on a GPU, starting from the exact same pre-trained model.7 We describe the task-specific details in the corresponding subsections of Section 4. More details can be found in Appendix A.5.

相比于预训练,微调的代价相对较低。本文中的所有结果最多可以在单个 Cloud TPU 上于 1 小时内复现,或者在 GPU 上花费几小时,前提是使用完全相同的预训练模型。我们在第 4 节的相应小节中描述了特定任务的详细信息。更多详情请参见附录 A.5。

4 Experiments

4 实验

In this section, we present BERT fine-tuning results on 11 NLP tasks.

在本节中,我们展示了 BERT 在 11 个 NLP 任务上的微调结果。

4.1 GLUE

4.1 GLUE 数据集 (GLUE dataset)

GLUE 是一系列自然语言理解任务的基准,用于评估模型在多个不同任务上的性能。它包括多个子任务,如文本相似度、情感分析等。通过 GLUE 基准可以全面评估大语言模型 (LLM) 的能力。

The General Language Understanding Evaluation (GLUE) benchmark (Wang et al., 2018a) is a collection of diverse natural language understanding tasks. Detailed descriptions of GLUE datasets are included in Appendix B.1.

通用语言理解评估 (GLUE) 基准 (Wang et al., 2018a) 是一组多样化的自然语言理解任务。GLUE 数据集的详细描述包含在附录 B.1 中。

To fine-tune on GLUE, we represent the input sequence (for single sentence or sentence pairs) as described in Section 3, and use the final hidden vector $C~\in~\mathbb{R}^{H}$ corresponding to the first input token ([CLS]) as the aggregate representation. The only new parameters introduced during fine-tuning are classification layer weights $W,\in$ $\mathbb{R}^{K\times H}$ , where $K$ is the number of labels. We compute a standard classification loss with $C$ and $W$ , i.e., $\log(\operatorname{softmax}(C W^{T}))$ .

在 GLUE 上进行微调时,我们将输入序列(对于单个句子或句子对)表示为第 3 节中所述的形式,并使用与第一个输入 Token ([CLS]) 对应的最终隐藏向量 $C~\in~\mathbb{R}^{H}$ 作为聚合表示。微调期间引入的唯一新参数是分类层权重 $W,\in$ $\mathbb{R}^{K\times H}$ ,其中 $K$ 是标签的数量。我们使用 $C$ 和 $W$ 计算标准分类损失,即 $\log(\operatorname{softmax}(C W^{T}))$ 。

Table 1: GLUE Test results, scored by the evaluation server (https://glue benchmark.com/leader board). The number below each task denotes the number of training examples. The “Average” column is slightly different than the official GLUE score, since we exclude the problematic WNLI set.8 BERT and OpenAI GPT are singlemodel, single task. F1 scores are reported for QQP and MRPC, Spearman correlations are reported for STS-B, and accuracy scores are reported for the other tasks. We exclude entries that use BERT as one of their components.

表 1: GLUE 测试结果,由评估服务器评分 (https://glue benchmark.com/leader board)。每个任务下方的数字表示训练样本的数量。“平均”列与官方 GLUE 得分略有不同,因为我们排除了有问题的 WNLI 集。8 BERT 和 OpenAI GPT 是单模型、单任务。QQP 和 MRPC 报告 F1 分数,STS-B 报告 Spearman 相关性,其他任务报告准确率分数。我们排除了使用 BERT 作为其组件之一的条目。

系统 MNLI-(m/mm) 392k QQP 363k QNLI 108k SST-2 67k CoLA 8.5k STS-B 5.7k MRPC 3.5k RTE 2.5k 平均 -
Pre-OpenAI SOTA 80.6/80.1 66.1 82.3 93.2 35.0 81.0 86.0 61.7 74.0
BiLSTM+ELMo+Attn 76.4/76.1 64.8 79.8 90.4 36.0 73.3 84.9 56.8 71.0
OpenAI GPT 82.1/81.4 70.3 87.4 91.3 45.4 80.0 82.3 56.0 75.1
BERTBASE 84.6/83.4 71.2 90.5 93.5 52.1 85.8 88.9 66.4 79.6
BERTLARGE 86.7/85.9 72.1 92.7 94.9 60.5 86.5 89.3 70.1 82.1

We use a batch size of 32 and fine-tune for 3 epochs over the data for all GLUE tasks. For each task, we selected the best fine-tuning learning rate (among 5e-5, 4e-5, 3e-5, and 2e-5) on the Dev set. Additionally, for BERTLARGE we found that finetuning was sometimes unstable on small datasets, so we ran several random restarts and selected the best model on the Dev set. With random restarts, we use the same pre-trained checkpoint but perform different fine-tuning data shuffling and classifier layer initialization.

我们使用批次大小为 32,并对所有 GLUE 任务的数据进行 3 轮微调。对于每个任务,我们在开发集上选择了最佳的微调学习率(在 5e-5、4e-5、3e-5 和 2e-5 中选择)。此外,对于 BERTLARGE,我们发现其在小数据集上有时微调不稳定,因此我们进行了多次随机重启,并在开发集上选择了最佳模型。在随机重启中,我们使用相同的预训练检查点,但执行不同的微调数据洗牌和分类器层初始化。

Results are presented in Table 1. Both BERTBASE and BERTLARGE outperform all systems on all tasks by a substantial margin, obtaining $4.5%$ and $7.0%$ respective average accuracy improvement over the prior state of the art. Note that BERTBASE and OpenAI GPT are nearly identical in terms of model architecture apart from the attention masking. For the largest and most widely reported GLUE task, MNLI, BERT obtains a $4.6%$ absolute accuracy improvement. On the official GLUE leader board 10, BERTLARGE obtains a score of 80.5, compared to OpenAI GPT, which obtains 72.8 as of the date of writing.

结果如表 1 所示。BERTBASE 和 BERTLARGE 在所有任务上均大幅优于所有系统,分别比之前的最先进水平提高了 4.5% 和 7.0% 的平均准确率。请注意,除了注意力掩码外,BERTBASE 和 OpenAI GPT 在模型架构上几乎相同。对于最大且最广泛报道的 GLUE 任务 MNLI,BERT 获得了 4.6% 的绝对准确率提升。在官方 GLUE 排行榜 10 上,BERTLARGE 获得了 80.5 分,而 OpenAI GPT 在撰写时获得了 72.8 分。

表 1:

We find that BERTLARGE significantly outperforms BERTBASE across all tasks, especially those with very little training data. The effect of model size is explored more thoroughly in Section 5.2.

我们发现 BERTLARGE 在所有任务上显著优于 BERTBASE,尤其是在训练数据非常少的任务上。模型大小的效果在第 5.2 节中进行了更深入的探讨。

4.2 SQuAD v1.1

4.2 SQuAD v1.1

The Stanford Question Answering Dataset (SQuAD v1.1) is a collection of $100\mathrm{k}$ crowdsourced question/answer pairs (Rajpurkar et al., 2016). Given a question and a passage from

斯坦福问答数据集 (SQuAD v1.1) 是一个包含 $100\mathrm{k}$ 个众包问题/答案对的集合 (Rajpurkar et al., 2016)。给定一个问题和一段来自

Wikipedia containing the answer, the task is to predict the answer text span in the passage.

维基百科包含答案,任务是在文中预测答案文本片段。

As shown in Figure 1, in the question answering task, we represent the input question and passage as a single packed sequence, with the question using the A embedding and the passage using the B embedding. We only introduce a start vector $S,\in,\mathbb{R}^{H}$ and an end vector $E,\in,\mathbb{R}^{H}$ during fine-tuning. The probability of word $i$ being the start of the answer span is computed as a dot product between $T_{i}$ and $S$ followed by a softmax over all of the words in the paragraph: $\begin{array}{r}{P_{i}=\frac{e^{S\cdot T_{i}}}{\sum_{j}e^{S\cdot T_{j}}}}\end{array}$ The analogous formula is used for the end of the answer span. The score of a candidate span from position $i$ to position $j$ is defined as ${\cal S}{\cdot}{\cal T}{i}+{\cal E}{\cdot}{\cal T}{j}$ , and the maximum scoring span where $j~\ge~i$ is used as a prediction. The training objective is the sum of the log-likelihoods of the correct start and end positions. We fine-tune for 3 epochs with a learning rate of 5e-5 and a batch size of 32.

如图 1 所示,在问答任务中,我们将输入的问题和段落表示为一个打包的序列,问题使用 A 嵌入,段落使用 B 嵌入。在微调期间,我们仅引入一个起始向量 $S,\in,\mathbb{R}^{H}$ 和一个结束向量 $E,\in,\mathbb{R}^{H}$。单词 $i$ 是答案跨度起始位置的概率通过计算 $T_{i}$ 和 $S$ 的点积,然后对段落中的所有单词进行 softmax 计算得出:$\begin{array}{r}{P_{i}=\frac{e^{S\cdot T_{i}}}{\sum_{j}e^{S\cdot T_{j}}}}\end{array}$。类似的公式用于答案跨度的结束位置。从位置 $i$ 到位置 $j$ 的候选跨度得分定义为 ${\cal S}{\cdot}{\cal T}{i}+{\cal E}{\cdot}{\cal T}{j}$,并且预测使用的是得分最高的跨度,其中 $j~\ge~i$。训练目标是正确起始和结束位置的对数似然之和。我们以 5e-5 的学习率和 32 的批量大小微调 3 个 epoch。

Table 2 shows top leader board entries as well as results from top published systems (Seo et al., 2017; Clark and Gardner, 2018; Peters et al., 2018a; Hu et al., 2018). The top results from the SQuAD leader board do not have up-to-date public system descriptions available,11 and are allowed to use any public data when training their systems. We therefore use modest data augmentation in our system by first fine-tuning on TriviaQA (Joshi et al., 2017) befor fine-tuning on SQuAD.

表 2: 显示了排行榜上的顶级条目以及来自顶级已发布系统的成果 (Seo et al., 2017; Clark and Gardner, 2018; Peters et al., 2018a; Hu et al., 2018)。SQuAD 排行榜上的顶级结果没有最新的公共系统描述可用,并且在训练系统时允许使用任何公共数据。因此,我们在系统中使用适度的数据增强,首先在 TriviaQA (Joshi et al., 2017) 上进行微调,然后再在 SQuAD 上进行微调。

Our best performing system outperforms the top leader board system by $+1.5,\mathrm{F}1$ in ensembling and $+1.3$ F1 as a single system. In fact, our single BERT model outperforms the top ensemble system in terms of F1 score. Without TriviaQA fine-

我们表现最好的系统在集成时比领先的系统高出 $+1.5,\mathrm{F}1$ ,作为单个系统时高出 $+1.3$ F1。实际上,我们的单个 BERT 模型在 F1 分数上超过了顶级集成系统。没有 TriviaQA 微调

系统 开发测试 EM F1 EM
F1 排行榜系统 (2018年12月10日) 人类 82.3 91.2 #1 组合 -nlnet 86.0
#2 组合 -QANet 84.5 90.5 已发布 BiDAF+ELMo (单模型) 85.6
R.M.Reader(组合) 81.2 87.9 82.3
我们的 BERTBASE (单模型) 80.8 88.5 BERTLARGE (单模型) 84.1
BERTLARGE (组合) 85.8 91.8 BERTLARGE (单模型+TriviaQA) 84.2
85.1 91.8 BERTLARGE (组合+TriviaQA) 86.2
87.4 93.2

请注意,表格中的数据和系统名称保持了原始格式。

Table 2: SQuAD 1.1 results. The BERT ensemble is $7\mathbf{x}$ systems which use different pre-training checkpoints and fine-tuning seeds.

表 2: SQuAD 1.1 结果。BERT 集成是 $7\mathbf{x}$ 系统,这些系统使用不同的预训练检查点和微调种子。

System Dev Dev Test Test
EM F1 EM F1
Top Leaderboard Systems (Dec 10th, 2018)
Human 86.389.0 86.9 89.5
#1 Single -MIR-MRC (F-Net) 74.8 78.0
#2Single-nlnet 74.2 77.1
Published
unet(Ensemble) 71.474.9
SLQA+ (Single) 71.474.4
Ours
BERTLARGE (Single) 78.7 81.9 80.0 83.1

Table 3: SQuAD 2.0 results. We exclude entries that use BERT as one of their components.

表 3: SQuAD 2.0 结果。我们排除了使用 BERT 作为其组件之一的条目。

tuning data, we only lose 0.1-0.4 F1, still outperforming all existing systems by a wide margin.12

调优数据时,我们仅损失了 0.1-0.4 F1,仍然大幅优于所有现有系统。12

请注意,您提供的文本似乎缺少一些上下文信息,导致翻译可能不完全准确。如果您可以提供更多的上下文或完整的句子,我将能够提供更精确的翻译。

4.3 SQuAD v2.0

4.3 SQuAD v2.0

The SQuAD 2.0 task extends the SQuAD 1.1 problem definition by allowing for the possibility that no short answer exists in the provided paragraph, making the problem more realistic.

SQuAD 2.0 任务通过允许提供的段落中可能不存在简短答案,扩展了 SQuAD 1.1 的问题定义,使问题更加现实。

We use a simple approach to extend the SQuAD v1.1 BERT model for this task. We treat questions that do not have an answer as having an answer span with start and end at the [CLS] token. The probability space for the start and end answer span positions is extended to include the position of the [CLS] token. For prediction, we compare the score of the no-answer span: $s_{\mathrm{nu11}}=$ $S{\cdot}C+E{\cdot}C$ to the score of the best non-null span $\begin{array}{r}{s\hat{i},j=\operatorname*{max}_{j\ge i}{S\cdot T_{i}}+E\cdot T_{j}}\end{array}$ . We predict a non-null answer when $s_{i,j}^{\wedge},>,s_{\mathrm{nu11}},+,\tau$ , where the threshold $\tau$ is selected on the dev set to maximize F1. We did not use TriviaQA data for this model. We fine-tuned for 2 epochs with a learning rate of 5e-5 and a batch size of 48.

我们使用一种简单的方法来扩展 SQuAD v1.1 BERT 模型以完成此任务。我们将没有答案的问题视为具有起始和结束于 [CLS] token 的答案片段。起始和结束答案片段位置的概率空间被扩展为包括 [CLS] token 的位置。在预测时,我们将无答案片段的得分:$s_{\mathrm{nu11}}=$ $S{\cdot}C+E{\cdot}C$ 与最佳非空片段的得分 $\begin{array}{r}{s\hat{i},j=\operatorname*{max}_{j\ge i}{S\cdot T_{i}}+E\cdot T_{j}}\end{array}$ 进行比较。当 $s_{i,j}^{\wedge},>,s_{\mathrm{nu11}},+,\tau$ 时,我们预测一个非空答案,其中阈值 $\tau$ 在开发集上选择以最大化 F1 分数。我们没有使用 TriviaQA 数据来训练这个模型。我们进行了 2 个 epoch 的微调,学习率为 5e-5,批量大小为 48。

Table 4: SWAG Dev and Test accuracies. †Human performance is measured with 100 samples, as reported in the SWAG paper.

表 4: SWAG Dev 和 Test 准确率。†人类表现是通过 100 个样本测量的,如 SWAG 论文 [20] 所报告。

系统 Dev Test
ESIM+GloVe ESIM+ELMo 51.9 59.1 52.7 59.2
OpenAIGPT 78.0
BERTBASE BERTLARGE 81.6 86.3
86.6
人类 (专家) 85.0
人类 (5 个标注) 88.0

The results compared to prior leader board entries and top published work (Sun et al., 2018; Wang et al., 2018b) are shown in Table 3, excluding systems that use BERT as one of their components. We observe a $+5.1$ F1 improvement over the previous best system.

表 3: 显示了与之前排行榜条目和顶级已发表作品 (Sun et al., 2018; Wang et al., 2018b) 的对比结果,排除了使用 BERT 作为组件的系统。我们观察到相比之前的最佳系统 F1 提升了 +5.1 。

4.4 SWAG

4.4 SWAG

生成式 AI (Generative AI) 随意生成内容的例子,通常用于展示模型的创造力和多样性。SWAG 数据集包含一系列情境描述,要求模型根据给定的情境选择最合理的后续行动或描述。

The Situations With Adversarial Generations (SWAG) dataset contains $113\mathbf{k}$ sentence-pair completion examples that evaluate grounded commonsense inference (Zellers et al., 2018). Given a sentence, the task is to choose the most plausible continuation among four choices.

对抗性生成情况 (SWAG) 数据集包含 $113\mathbf{k}$ 个句子对补全示例,用于评估基于常识的推理 (Zellers et al., 2018)。给定一个句子,任务是从四个选项中选择最合理的续句。

When fine-tuning on the SWAG dataset, we construct four input sequences, each containing the concatenation of the given sentence (sentence A) and a possible continuation (sentence B). The only task-specific parameters introduced is a vector whose dot product with the [CLS] token represent ation $C$ denotes a score for each choice which is normalized with a softmax layer.

在 SWAG 数据集上进行微调时,我们构建了四个输入序列,每个序列包含给定句子(句子 A)和一个可能的延续(句子 B)的连接。引入的唯一任务特定参数是一个向量,该向量与 [CLS] Token 的点积表示为 $C$,表示每个选择的得分,这些得分通过 softmax 层进行归一化。

We fine-tune the model for 3 epochs with a learning rate of 2e-5 and a batch size of 16. Results are presented in Table 4. BERTLARGE outperforms the authors’ baseline ESIM $\pm$ ELMo system by $+27.1%$ and OpenAI GPT by $8.3%$ .

我们对模型进行3个epoch的微调,学习率为2e-5,批量大小为16。结果如表 4 所示。BERTLARGE 模型的表现优于作者的基线 ESIM ± ELMo 系统 27.1% 和 OpenAI GPT 8.3%。

表 4:

5 Ablation Studies

5 消融研究

In this section, we perform ablation experiments over a number of facets of BERT in order to better understand their relative importance. Additional

在本节中,我们对 BERT 的多个方面进行消融实验,以更好地理解它们的相对重要性。额外的

规则:
      - 输出中文翻译部分的时候,只保留翻译的标题,不要有任何其他的多余内容,不要重复,不要解释。
      - 不要输出与英文内容无关的内容。
      - 翻译时要保留原始段落格式,以及保留术语,例如 FLAC,JPEG 等。保留公司缩写,例如 Microsoft, Amazon, OpenAI 等。
      - 人名不翻译
      - 同时要保留引用的论文,例如 [20] 这样的引用。
      - 对于 Figure 和 Table,翻译的同时保留原有格式,例如:“Figure 1: ”翻译为“图 1: ”,“Table 1: ”翻译为:“表 1: ”。
      - 全角括号换成半角括号,并在左括号前面加半角空格,右括号后面加半角空格。
      - 在翻译专业术语时,第一次出现时要在括号里面写上英文原文,例如:“生成式 AI (Generative AI)”,之后就可以只写中文了。
      - 以下是常见的 AI 相关术语词汇对应表(English -> 中文):
      * Transformer -> Transformer
      * Token -> Token
      * LLM/Large Language Model -> 大语言模型
      * Zero-shot -> 零样本
      * Few-shot -> 少样本
      * AI Agent -> AI智能体
      * AGI -> 通用人工智能
      * Python -> Python语言

      策略:

      分三步进行翻译工作:
      1. 不翻译无法识别的特殊字符和公式,原样返回
      2. 将HTML表格格式转换成Markdown表格格式
      3. 根据英文内容翻译成符合中文表达习惯的内容,不要遗漏任何信息

      最终只返回Markdown格式的翻译结果,不要回复无关内容。

在本节中,我们对 BERT 的多个方面进行消融实验,以更好地理解它们的相对重要性。额外的

任务 MNLI-m (Acc) (Acc) QNLI DevSet MRPC SST-2 (Acc) SQuAD (F1)
BERTBASE 84.4 88.4 (Acc) 86.7 92.7 88.5
NoNSP 83.9 84.9 86.5 92.6 87.9
LTR&NoNSP 82.1 84.3 77.5 92.1 77.8
+ BiLSTM 82.1 84.1 75.7 91.6 84.9

Table 5: Ablation over the pre-training tasks using the BERTBASE architecture. “No NSP” is trained without the next sentence prediction task. “LTR & No NSP” is trained as a left-to-right LM without the next sentence prediction, like OpenAI GPT. “ $^+$ BiLSTM” adds a randomly initialized BiLSTM on top of the “LTR $+~\mathrm{No}$ NSP” model during fine-tuning.

表 5: 使用 BERTBASE 架构对预训练任务的消融研究。"No NSP" 表示在没有下一句预测任务的情况下进行训练。"LTR & No NSP" 表示像 OpenAI GPT 一样,作为从左到右的语言模型 (LM) 进行训练,且没有下一句预测任务。" + BiLSTM" 表示在微调期间,在 "LTR + No NSP" 模型之上添加一个随机初始化的双向 LSTM (BiLSTM) 。

请注意,公式和特殊字符保持原样。

ablation studies can be found in Appendix C.

消融研究详见附录 C。

5.1 Effect of Pre-training Tasks

5.1 预训练任务的影响

规则:
      - 输出中文翻译部分的时候,只保留翻译的标题,不要有任何其他的多余内容,不要重复,不要解释。
      - 不要输出与英文内容无关的内容。
      - 翻译时要保留原始段落格式,以及保留术语,例如 FLAC,JPEG 等。保留公司缩写,例如 Microsoft, Amazon, OpenAI 等。
      - 人名不翻译
      - 同时要保留引用的论文,例如 [20] 这样的引用。
      - 对于 Figure 和 Table,翻译的同时保留原有格式,例如:“Figure 1: ”翻译为“图 1: ”,“Table 1: ”翻译为:“表 1: ”。
      - 全角括号换成半角括号,并在左括号前面加半角空格,右括号后面加半角空格。
      - 在翻译专业术语时,第一次出现时要在括号里面写上英文原文,例如:“生成式 AI (Generative AI)”,之后就可以只写中文了。
      - 以下是常见的 AI 相关术语词汇对应表(English -> 中文):
      * Transformer -> Transformer
      * Token -> Token
      * LLM/Large Language Model -> 大语言模型
      * Zero-shot -> 零样本
      * Few-shot -> 少样本
      * AI Agent -> AI智能体
      * AGI -> 通用人工智能
      * Python -> Python语言

      策略:

      分三步进行翻译工作:
      1. 不翻译无法识别的特殊字符和公式,原样返回
      2. 将HTML表格格式转换成Markdown表格格式
      3. 根据英文内容翻译成符合中文表达习惯的内容,不要遗漏任何信息

      最终只返回Markdown格式的翻译结果,不要回复无关内容。

We demonstrate the importance of the deep bidirect ional it y of BERT by evaluating two pretraining objectives using exactly the same pretraining data, fine-tuning scheme, and hyperparameters as BERTBASE:

我们通过使用与 BERTBASE 完全相同的预训练数据、微调方案和超参数,评估两个预训练目标,以此证明 BERT 的深度双向性的重要性:

No NSP: A bidirectional model which is trained using the “masked LM” (MLM) but without the “next sentence prediction” (NSP) task.

无 NSP: 一个使用“掩码语言模型” (MLM) 进行训练的双向模型,但不包括“下一句预测” (NSP) 任务。

LTR & No NSP: A left-context-only model which is trained using a standard Left-to-Right (LTR) LM, rather than an MLM. The left-only constraint was also applied at fine-tuning, because removing it introduced a pre-train/fine-tune mismatch that degraded downstream performance. Additionally, this model was pre-trained without the NSP task. This is directly comparable to OpenAI GPT, but using our larger training dataset, our input representation, and our fine-tuning scheme.

仅左上下文模型:该模型使用标准的从左到右 (LTR) 语言模型进行训练,而不是掩码语言模型 (MLM)。在微调阶段也应用了仅左上下文约束,因为移除该约束会导致预训练和微调之间的不匹配,从而降低下游任务的表现。此外,该模型在预训练时没有使用下一句预测 (NSP) 任务。这与 OpenAI GPT 直接可比,但使用了我们更大的训练数据集、我们的输入表示以及我们的微调方案。

We first examine the impact brought by the NSP task. In Table 5, we show that removing NSP hurts performance significantly on QNLI, MNLI, and SQuAD 1.1. Next, we evaluate the impact of training bidirectional representations by comparing “No NSP” to “LTR & No NSP”. The LTR model performs worse than the MLM model on all tasks, with large drops on MRPC and SQuAD.

我们首先考察 NSP 任务带来的影响。在表 5 中,我们显示移除 NSP 对 QNLI、MNLI 和 SQuAD 1.1 的性能有显著负面影响。接下来,我们通过比较“无 NSP”和“LTR & 无 NSP”来评估训练双向表示的影响。LTR 模型在所有任务上的表现都比 MLM 模型差,在 MRPC 和 SQuAD 上有较大下降。

For SQuAD it is intuitively clear that a LTR model will perform poorly at token predictions, since the token-level hidden states have no rightside context. In order to make a good faith attempt at strengthening the LTR system, we added a randomly initialized BiLSTM on top. This does significantly improve results on SQuAD, but the results are still far worse than those of the pretrained bidirectional models. The BiLSTM hurts performance on the GLUE tasks.

对于 SQuAD,直观上很明显,LTR 模型在 Token 预测上表现不佳,因为 Token 级别的隐藏状态没有右侧上下文。为了诚实地尝试加强 LTR 系统,我们在其上添加了一个随机初始化的 BiLSTM。这确实显著提高了 SQuAD 上的结果,但结果仍然远不如预训练的双向模型。BiLSTM 在 GLUE 任务上损害了性能。

We recognize that it would also be possible to train separate LTR and RTL models and represent each token as the concatenation of the two models, as ELMo does. However: (a) this is twice as expensive as a single bidirectional model; (b) this is non-intuitive for tasks like QA, since the RTL model would not be able to condition the answer on the question; (c) this it is strictly less powerful than a deep bidirectional model, since it can use both left and right context at every layer.

我们认识到也可以训练单独的LTR和RTL模型,并将每个Token表示为两个模型的连接,就像ELMo所做的那样。然而:(a) 这比单个双向模型贵两倍;(b) 对于像问答这样的任务来说,这是反直觉的,因为RTL模型无法根据问题来调整答案;(c) 与深度双向模型相比,这种方法在每一层只能使用左或右上下文,因此严格来说功能较弱。

5.2 Effect of Model Size

5.2 模型大小的影响

规则:
      - 输出中文翻译部分的时候,只保留翻译的标题,不要有任何其他的多余内容,不要重复,不要解释。
      - 不要输出与英文内容无关的内容。
      - 翻译时要保留原始段落格式,以及保留术语,例如 FLAC,JPEG 等。保留公司缩写,例如 Microsoft, Amazon, OpenAI 等。
      - 人名不翻译
      - 同时要保留引用的论文,例如 [20] 这样的引用。
      - 对于 Figure 和 Table,翻译的同时保留原有格式,例如:“Figure 1: ”翻译为“图 1: ”,“Table 1: ”翻译为:“表 1: ”。
      - 全角括号换成半角括号,并在左括号前面加半角空格,右括号后面加半角空格。
      - 在翻译专业术语时,第一次出现时要在括号里面写上英文原文,例如:“生成式 AI (Generative AI)”,之后就可以只写中文了。
      - 以下是常见的 AI 相关术语词汇对应表(English -> 中文):
      * Transformer -> Transformer
      * Token -> Token
      * LLM/Large Language Model -> 大语言模型
      * Zero-shot -> 零样本
      * Few-shot -> 少样本
      * AI Agent -> AI智能体
      * AGI -> 通用人工智能
      * Python -> Python语言

      策略:

      分三步进行翻译工作:
      1. 不翻译无法识别的特殊字符和公式,原样返回
      2. 将HTML表格格式转换成Markdown表格格式
      3. 根据英文内容翻译成符合中文表达习惯的内容,不要遗漏任何信息

      最终只返回Markdown格式的翻译结果,不要回复无关内容。

In this section, we explore the effect of model size on fine-tuning task accuracy. We trained a number of BERT models with a differing number of layers, hidden units, and attention heads, while otherwise using the same hyper parameters and training procedure as described previously.

在本节中,我们探讨了模型大小对微调任务准确性的影响。我们训练了多个 BERT 模型,这些模型具有不同数量的层、隐藏单元和注意力头,而在其他方面使用了与之前描述相同的超参数和训练过程。

Results on selected GLUE tasks are shown in Table 6. In this table, we report the average Dev Set accuracy from 5 random restarts of fine-tuning. We can see that larger models lead to a strict accuracy improvement across all four datasets, even for MRPC which only has 3,600 labeled training examples, and is substantially different from the pre-training tasks. It is also perhaps surprising that we are able to achieve such significant improvements on top of models which are already quite large relative to the existing literature. For example, the largest Transformer explored in Vaswani et al. (2017) is $\scriptstyle{\mathrm{L}=6}$ , $_{\mathrm{H}=1024}$ , $A{=}16$ ) with 100M parameters for the encoder, and the largest Transformer we have found in the literature is $\scriptstyle\mathrm{L}=64$ , $_{\mathrm{H}=512}$ , $\scriptstyle\mathbf{A}=2$ ) with $235\mathrm{M}$ parameters (Al-Rfou et al., 2018). By contrast, BERTBASE contains 110M parameters and BERTLARGE contains 340M parameters.

表 6: 选定的 GLUE 任务的结果。在该表中,我们报告了从 5 次随机重启微调得到的平均开发集准确率。我们可以看到,较大的模型在所有四个数据集上都带来了严格的准确率提升,即使对于只有 3,600 个标注训练样本且与预训练任务有显著差异的 MRPC 数据集也是如此。也许令人惊讶的是,我们在已经相对较大的现有模型基础上还能取得如此显著的改进。例如,Vaswani 等人 (2017) 探索的最大 Transformer 的参数为 $\scriptstyle{\mathrm{L}=6}$ , $_{\mathrm{H}=1024}$ , $A{=}16$ ,编码器包含 1 亿个参数,而我们在文献中找到的最大 Transformer 参数为 $\scriptstyle\mathrm{L}=64$ , $_{\mathrm{H}=512}$ , $\scriptstyle\mathbf{A}=2$ ,包含 2.35 亿个参数 (Al-Rfou 等人, 2018)。相比之下,BERTBASE 包含 1.1 亿个参数,BERTLARGE 包含 3.4 亿个参数。

It has long been known that increasing the model size will lead to continual improvements on large-scale tasks such as machine translation and language modeling, which is demonstrated by the LM perplexity of held-out training data shown in Table 6. However, we believe that this is the first work to demonstrate convincingly that scaling to extreme model sizes also leads to large improvements on very small scale tasks, provided that the model has been sufficiently pre-trained. Peters et al. (2018b) presented mixed results on the downstream task impact of increasing the pre-trained bi-LM size from two to four layers and Melamud et al. (2016) mentioned in passing that increasing hidden dimension size from 200 to 600 helped, but increasing further to 1,000 did not bring further improvements. Both of these prior works used a featurebased approach — we hypothesize that when the model is fine-tuned directly on the downstream tasks and uses only a very small number of randomly initialized additional parameters, the taskspecific models can benefit from the larger, more expressive pre-trained representations even when downstream task data is very small.

长期以来,人们已经知道增加模型大小会在大规模任务(如机器翻译和语言建模)上带来持续的改进,这在表 6 所示的保留训练数据的 LM 困惑度中得到了证明。然而,我们认为这是第一个令人信服地证明扩展到极端模型大小也会在非常小规模的任务上带来显著改进的工作,前提是模型已经进行了充分的预训练。Peters 等 (2018b) 在下游任务的影响方面展示了混合结果,即从两层增加到四层的预训练双向语言模型 (bi-LM) 的效果不一;Melamud 等 (2016) 顺便提到,将隐藏维度大小从 200 增加到 600 有帮助,但进一步增加到 1,000 并未带来额外的改进。这两项先前的工作都使用了基于特征的方法——我们假设当模型直接在下游任务上进行微调并仅使用少量随机初始化的附加参数时,即使下游任务数据非常少,任务特定模型也可以从更大、更具表现力的预训练表示中受益。

5.3 Feature-based Approach with BERT

5.3 基于特征的方法与 BERT

All of the BERT results presented so far have used the fine-tuning approach, where a simple classification layer is added to the pre-trained model, and all parameters are jointly fine-tuned on a downstream task. However, the feature-based approach, where fixed features are extracted from the pretrained model, has certain advantages. First, not all tasks can be easily represented by a Transformer encoder architecture, and therefore require a task-specific model architecture to be added. Second, there are major computational benefits to pre-compute an expensive representation of the training data once and then run many experiments with cheaper models on top of this representation.

所有目前展示的 BERT 结果都使用了微调方法,其中在预训练模型上添加了一个简单的分类层,并且所有参数都在下游任务上联合微调。然而,基于特征的方法(即从预训练模型中提取固定特征)具有一些优势。首先,并非所有任务都可以轻松地用 Transformer 编码器架构表示,因此需要添加特定于任务的模型架构。其次,预先计算训练数据的昂贵表示一次,然后在此表示之上运行许多使用更便宜模型的实验,可以带来显著的计算效益。

In this section, we compare the two approaches by applying BERT to the CoNLL-2003 Named Entity Recognition (NER) task (Tjong Kim Sang and De Meulder, 2003). In the input to BERT, we use a case-preserving WordPiece model, and we include the maximal document context provided by the data. Following standard practice, we formulate this as a tagging task but do not use a CRF

在本节中,我们通过将 BERT 应用于 CoNLL-2003 命名实体识别 (NER) 任务 (Tjong Kim Sang 和 De Meulder, 2003) 来比较这两种方法。在 BERT 的输入中,我们使用保留大小写的 WordPiece 模型,并包含数据提供的最大文档上下文。按照标准做法,我们将此任务表述为标注任务,但不使用 CRF

Hyperparams Dev Set Accuracy
#L #H #A LM (ppl)
3 768 12
6 768 3
6 768 12
12 768 12
12 1024 16
24 1024 16

Table 6: Ablation over BERT model size. #L $=$ the number of layers; $#\mathbf{H}=$ hidden size; #A $=$ number of attention heads. “LM (ppl)” is the masked LM perplexity of held-out training data.

表 6: BERT 模型规模的消融实验。#L $=$ 层数; $#\mathbf{H}=$ 隐藏层大小; #A $=$ 注意力头数量。 “LM (ppl)” 是保留的训练数据的掩码 LM (masked LM) 困惑度。

Table 7: CoNLL-2003 Named Entity Recognition results. Hyper parameters were selected using the Dev set. The reported Dev and Test scores are averaged over 5 random restarts using those hyper parameters.

表 7: CoNLL-2003 命名实体识别结果。超参数是使用开发集选择的。报告的开发集和测试集分数是在使用这些超参数的情况下,经过 5 次随机重启后的平均值。

系统 开发集 F1 测试集 F1
ELMo (Peters et al., 2018a) 95.7 92.2
CVT (Clark et al., 2018) CSE (Akbik et al., 2018) 92.6
93.1
微调方法
BERTLARGE 96.6 92.8
BERTBASE 96.4 92.4
基于特征的方法 (BERTBASE)
倒数第二隐藏层嵌入 91.0
最后隐藏层 95.6
94.9
最后四层加权和 95.9
最后四层连接 96.1
所有 12 层加权和 95.5

layer in the output. We use the representation of the first sub-token as the input to the token-level classifier over the NER label set.

我们使用输出中的第一个子 Token 的表示作为输入,传递给基于 Token 级别的分类器,该分类器在命名实体识别 (NER) 标签集上进行分类。

To ablate the fine-tuning approach, we apply the feature-based approach by extracting the activations from one or more layers without fine-tuning any parameters of BERT. These contextual em- beddings are used as input to a randomly initialized two-layer 768-dimensional BiLSTM before the classification layer.

为了消融微调方法,我们采用基于特征的方法,通过从一个或多个层提取激活而无需微调 BERT 的任何参数。这些上下文嵌入用作输入到随机初始化的两层 768 维 BiLSTM,在分类层之前。

Results are presented in Table 7. BERTLARGE performs competitively with state-of-the-art methods. The best performing method concatenates the token representations from the top four hidden layers of the pre-trained Transformer, which is only $0.3;\mathrm{F}1$ behind fine-tuning the entire model. This demonstrates that BERT is effective for both finetuning and feature-based approaches.

结果如表 7 所示。BERTLARGE 的表现与最先进方法具有竞争力。表现最好的方法是连接预训练 Transformer 最上面四层的 Token 表示,这仅比微调整个模型落后 0.3 F1 。这表明 BERT 对于微调和基于特征的方法都有效。

6 Conclusion

6 结论

Recent empirical improvements due to transfer learning with language models have demonstrated that rich, unsupervised pre-training is an integral part of many language understanding systems. In particular, these results enable even low-resource tasks to benefit from deep unidirectional architectures. Our major contribution is further generalizing these findings to deep bidirectional architectures, allowing the same pre-trained model to successfully tackle a broad set of NLP tasks.

近期由于使用语言模型进行迁移学习所取得的经验性进展表明,丰富的无监督预训练是许多语言理解系统不可或缺的一部分。特别是,这些结果使得即使资源较少的任务也能从深度单向架构中受益。我们的主要贡献是进一步将这些发现推广到深度双向架构,使同一预训练模型能够成功应对广泛的自然语言处理任务。

References

参考文献

for natural language understanding. In Proceedings of the 2018 EMNLP Workshop Blackbox NLP: An- alyzing and Interpreting Neural Networks for NLP, pages 353–355.

用于自然语言理解。收录于 2018 年 EMNLP Workshop Blackbox NLP: 分析和解释用于 NLP 的神经网络,页面 353–355。

Wei Wang, Ming Yan, and Chen Wu. 2018b. Multigranularity hierarchical attention fusion networks for reading comprehension and question answering. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Association for Computational Linguistics.

王伟、颜明、吴晨。2018b。多粒度层次注意力融合网络用于阅读理解和问答。在第 56 届计算语言学协会年会论文集(长文卷)。计算语言学协会。

Alex Warstadt, Amanpreet Singh, and Samuel R Bowman. 2018. Neural network acceptability judgments. arXiv preprint arXiv:1805.12471.

亚历克斯·沃斯特德, 阿曼普里特·辛格, 和 萨缪尔·R·鲍曼. 2018. 神经网络可接受性判断 (Neural network acceptability judgments). arXiv预印本 arXiv:1805.12471.

Adina Williams, Nikita Nangia, and Samuel R Bowman. 2018. A broad-coverage challenge corpus for sentence understanding through inference. In NAACL.

阿迪娜·威廉姆斯, 尼基塔·南吉亚, 和 萨缪尔·R·鲍曼. 2018. 通过推理进行句子理解的广泛覆盖挑战语料库. 在 NAACL.

Yonghui Wu, Mike Schuster, Zhifeng Chen, Quoc V Le, Mohammad Norouzi, Wolfgang Macherey, Maxim Krikun, Yuan Cao, Qin Gao, Klaus Macherey, et al. 2016. Google’s neural machine translation system: Bridging the gap between human and machine translation. arXiv preprint arXiv:1609.08144.

吴永辉、Mike Schuster、陈志峰、Quoc V Le、Mohammad Norouzi、Wolfgang Macherey、Maxim Krikun、曹 Yuan、高 Qin、Klaus Macherey 等. 2016. Google 的神经机器翻译系统:弥合人类与机器翻译之间的差距. arXiv 预印本 arXiv:1609.08144.

Jason Yosinski, Jeff Clune, Yoshua Bengio, and Hod Lipson. 2014. How transferable are features in deep neural networks? In Advances in neural information processing systems, pages 3320–3328.

Jason Yosinski, Jeff Clune, Yoshua Bengio, 和 Hod Lipson. 2014. 深度神经网络中的特征可迁移性如何? 在 Advances in neural information processing systems 中,页面 3320–3328.

Adams Wei Yu, David Dohan, Minh-Thang Luong, Rui Zhao, Kai Chen, Mohammad Norouzi, and Quoc V Le. 2018. QANet: Combining local convolution with global self-attention for reading comprehension. In ICLR.

Adams Wei Yu, David Dohan, Minh-Thang Luong, Rui Zhao, Kai Chen, Mohammad Norouzi, 和 Quoc V Le. 2018. QANet: 结合局部卷积与全局自注意力机制用于阅读理解。在 ICLR。

Rowan Zellers, Yonatan Bisk, Roy Schwartz, and Yejin Choi. 2018. Swag: A large-scale adversarial dataset for grounded commonsense inference. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing (EMNLP).

Rowan Zellers, Yonatan Bisk, Roy Schwartz, 和 Yejin Choi. 2018. SWAG: 一个大规模对抗性数据集用于基于常识推理 (adversarial dataset for grounded commonsense inference)。在 Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing (EMNLP)。

Yukun Zhu, Ryan Kiros, Rich Zemel, Ruslan Salakhutdinov, Raquel Urtasun, Antonio Torralba, and Sanja Fidler. 2015. Aligning books and movies: Towards story-like visual explanations by watching movies and reading books. In Proceedings of the IEEE international conference on computer vision, pages 19–27.

朱昱昆, Ryan Kiros, Rich Zemel, Ruslan Salakhutdinov, Raquel Urtasun, Antonio Torralba, 和 Sanja Fidler. 2015. 对齐书籍和电影:通过观看电影和阅读书籍实现故事般的视觉解释. 在 Proceedings of the IEEE international conference on computer vision 的第 19–27 页.

Appendix for “BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding”

附录 “BERT: 用于语言理解的深度双向 Transformer (Deep Bidirectional Transformers) 预训练”

We organize the appendix into three sections:

我们将附录组织成三个部分:

• Additional implementation details for BERT are presented in Appendix A;

• BERT 的其他实现细节见附录 A;

A Additional Details for BERT

A BERT 的附加细节

A.1 Illustration of the Pre-training Tasks

A.1 预训练任务的说明 (Illustration of the Pre-training Tasks)

图 1: 预训练任务示例 (Figure 1: Illustration of the Pre-training Tasks)

We provide examples of the pre-training tasks in the following.

我们在以下提供预训练任务的示例。

Masked LM and the Masking Procedure Assuming the unlabeled sentence is my dog is hairy, and during the random masking procedure we chose the 4-th token (which corresponding to hairy), our masking procedure can be further illustrated by

掩码 LM 和掩码过程 假设未标注的句子是 my dog is hairy,在随机掩码过程中我们选择了第 4 个 Token (对应于 hairy),我们的掩码过程可以进一步说明为

The advantage of this procedure is that the Transformer encoder does not know which words it will be asked to predict or which have been replaced by random words, so it is forced to keep a distribution al contextual representation of every input token. Additionally, because random replacement only occurs for $1.5%$ of all tokens (i.e., $10%$ of $15%$ ), this does not seem to harm the model’s language understanding capability. In Section C.2, we evaluate the impact this procedure.

这种方法的优点是,Transformer 编码器不知道哪些词会被要求预测或已被随机词替换,因此它被迫保持每个输入 Token 的分布式的上下文表示。此外,由于随机替换仅发生在所有 Token 的 1.5% (即 15% 的 10%),这似乎不会损害模型的语言理解能力。在第 C.2 节中,我们评估了此过程的影响。

Compared to standard langauge model training, the masked LM only make predictions on $15%$ of tokens in each batch, which suggests that more pre-training steps may be required for the model

相比于标准语言模型训练,掩码语言模型 (masked LM) 仅对每个批次中 $15%$ 的 Token 进行预测,这表明模型可能需要更多的预训练步骤。

Figure 3: Differences in pre-training model architectures. BERT uses a bidirectional Transformer. OpenAI GPT uses a left-to-right Transformer. ELMo uses the concatenation of independently trained left-to-right and right-toleft LSTMs to generate features for downstream tasks. Among the three, only BERT representations are jointly conditioned on both left and right context in all layers. In addition to the architecture differences, BERT and OpenAI GPT are fine-tuning approaches, while ELMo is a feature-based approach.

图 3: 预训练模型架构的差异。BERT 使用双向 Transformer。OpenAI GPT 使用从左到右的 Transformer。ELMo 使用独立训练的从左到右和从右到左的 LSTMs 的连接来为下游任务生成特征。在这三个模型中,只有 BERT 的表示在所有层中同时依赖于左右上下文。除了架构上的差异,BERT 和 OpenAI GPT 是微调方法,而 ELMo 是基于特征的方法。


Label $=$ NotNext

标签 $=$ 非下一个


Label $=$ NotNext

to converge. In Section C.1 we demonstrate that MLM does converge marginally slower than a leftto-right model (which predicts every token), but the empirical improvements of the MLM model far outweigh the increased training cost.

在 C.1 节中我们证明了 MLM (masked language model) 的收敛速度确实比从左到右的模型(预测每个 Token)稍慢,但 MLM 模型的经验改进远远超过了增加的训练成本。

Next Sentence Prediction The next sentence prediction task can be illustrated in the following examples.

下一句预测任务可以通过以下例子来说明。

To generate each training input sequence, we sample two spans of text from the corpus, which we refer to as “sentences” even though they are typically much longer than single sentences (but can be shorter also). The first sentence receives the A embedding and the second receives the B embedding. $50%$ of the time $_\mathrm{B}$ is the actual next sentence that follows $\mathtt{A}$ and $50%$ of the time it is a random sentence, which is done for the “next sentence prediction” task. They are sampled such that the combined length is $\leq512$ tokens. The LM masking is applied after WordPiece token iz ation with a uniform masking rate of $15%$ , and no special consideration given to partial word pieces.

为了生成每个训练输入序列,我们从语料库中采样两个文本片段,我们称之为“句子”,尽管它们通常比单个句子长得多(但也可能更短)。第一个句子接收 A 嵌入,第二个句子接收 B 嵌入。50% 的时间 B 是实际跟随 A 的下一个句子,50% 的时间它是一个随机句子,这是为“下一句预测”任务而做的。它们被采样以确保组合长度 ≤512 个 Token。在 WordPiece tokenization 之后应用 LM 掩码,掩码率为 15%,并且不对部分词片段做特殊处理。

Label $=$ IsNext

标签 $=$ 是否为下一个

epochs over the 3.3 billion word corpus. We use Adam with learning rate of 1e-4, $\beta_{1},=,0.9$ , $\beta_{2}~=~0.999$ , L2 weight decay of 0.01, learning rate warmup over the first 10,000 steps, and linear decay of the learning rate. We use a dropout probability of 0.1 on all layers. We use a gelu activation (Hendrycks and Gimpel, 2016) rather than the standard relu, following OpenAI GPT. The training loss is the sum of the mean masked LM likelihood and the mean next sentence prediction likelihood.

在 33 亿词的语料库上进行训练。我们使用 Adam 优化器,学习率为 1e-4,$\beta_{1},=,0.9$,$\beta_{2}~=~0.999$,L2 权重衰减为 0.01,前 10,000 步进行学习率预热,并采用线性衰减的学习率。我们在所有层使用 0.1 的 dropout 概率。我们使用 gelu 激活函数 (Hendrycks 和 Gimpel, 2016) 而不是标准的 relu,遵循 OpenAI GPT 的做法。训练损失是平均掩码语言模型似然和平均下一句预测似然的总和。

A.2 Pre-training Procedure

A.2 预训练过程

We train with batch size of 256 sequences (256 sequences $^\ast\ 512$ tokens $=128{,}000$ tokens/batch) for 1,000,000 steps, which is approximately 40

我们使用批大小为 256 个序列(256 个序列 * 512 个 Token = 128,000 个 Token/批)训练 1,000,000 步,这大约是 40

Training of BERTBASE was performed on 4 Cloud TPUs in Pod configuration (16 TPU chips total).13 Training of BERTLARGE was performed on 16 Cloud TPUs (64 TPU chips total). Each pretraining took 4 days to complete.

BERTBASE 的训练是在 4 个 Cloud TPU (Pod 配置,共 16 个 TPU 芯片) 上进行的。BERTLARGE 的训练是在 16 个 Cloud TPU (共 64 个 TPU 芯片) 上进行的。每次预训练耗时 4 天完成。

Longer sequences are disproportionately expensive because attention is quadratic to the sequence length. To speed up pretraing in our experiments, we pre-train the model with sequence length of 128 for $90%$ of the steps. Then, we train the rest $10%$ of the steps of sequence of 512 to learn the positional embeddings.

更长的序列成本不成比例地高昂,因为注意力机制与序列长度呈二次关系。为了加快预训练速度,在我们的实验中,我们在 90% 的步骤中使用序列长度为 128 进行预训练。然后,我们用剩余 10% 的步骤对序列长度为 512 的数据进行训练,以学习位置嵌入。

A.3 Fine-tuning Procedure

A.3 微调过程 (Fine-tuning Procedure)

For fine-tuning, most model hyper parameters are the same as in pre-training, with the exception of the batch size, learning rate, and number of training epochs. The dropout probability was always kept at 0.1. The optimal hyper parameter values are task-specific, but we found the following range of possible values to work well across all tasks:

对于微调,大多数模型超参数与预训练时相同,例外的是批量大小、学习率和训练轮数。dropout 概率始终保持在 0.1 。最优超参数值是任务特定的,但我们发现以下范围的可能值在所有任务中表现良好:

• Batch size: 16, 32

• 批量大小:16,32

• Learning rate (Adam): 5e-5, 3e-5, 2e-5 • Number of epochs: 2, 3, 4

• 学习率 (Adam): 5e-5, 3e-5, 2e-5 • 训练轮数: 2, 3, 4

We also observed that large data sets (e.g., $100\mathrm{k}+$ labeled training examples) were far less sensitive to hyper parameter choice than small data sets. Fine-tuning is typically very fast, so it is reasonable to simply run an exhaustive search over the above parameters and choose the model that performs best on the development set.

我们还观察到,大型数据集(例如,$100\mathrm{k}+$ 标记的训练样本)对超参数的选择远不如小型数据集敏感。微调通常非常快,因此可以合理地对上述参数进行 exhaustive search,并选择在开发集上表现最好的模型。

A.4 Comparison of BERT, ELMo ,and OpenAI GPT

A.4 BERT、ELMo 和 OpenAI GPT 的比较

规则:
      - 输出中文翻译部分的时候,只保留翻译的标题,不要有任何其他的多余内容,不要重复,不要解释。
      - 不要输出与英文内容无关的内容。
      - 翻译时要保留原始段落格式,以及保留术语,例如 FLAC,JPEG 等。保留公司缩写,例如 Microsoft, Amazon, OpenAI 等。
      - 人名不翻译
      - 同时要保留引用的论文,例如 [20] 这样的引用。
      - 对于 Figure 和 Table,翻译的同时保留原有格式,例如:“Figure 1: ”翻译为“图 1: ”,“Table 1: ”翻译为:“表 1: ”。
      - 全角括号换成半角括号,并在左括号前面加半角空格,右括号后面加半角空格。
      - 在翻译专业术语时,第一次出现时要在括号里面写上英文原文,例如:“生成式 AI (Generative AI)”,之后就可以只写中文了。
      - 以下是常见的 AI 相关术语词汇对应表(English -> 中文):
      * Transformer -> Transformer
      * Token -> Token
      * LLM/Large Language Model -> 大语言模型
      * Zero-shot -> 零样本
      * Few-shot -> 少样本
      * AI Agent -> AI智能体
      * AGI -> 通用人工智能
      * Python -> Python语言

      策略:

      分三步进行翻译工作:
      1. 不翻译无法识别的特殊字符和公式,原样返回
      2. 将HTML表格格式转换成Markdown表格格式
      3. 根据英文内容翻译成符合中文表达习惯的内容,不要遗漏任何信息

      最终只返回Markdown格式的翻译结果,不要回复无关内容。

Here we studies the differences in recent popular representation learning models including ELMo, OpenAI GPT and BERT. The comparisons between the model architectures are shown visually in Figure 3. Note that in addition to the architecture differences, BERT and OpenAI GPT are finetuning approaches, while ELMo is a feature-based approach.

我们研究了最近流行的表示学习模型之间的差异,包括 ELMo、OpenAI GPT 和 BERT。模型架构之间的比较如图 3 所示。请注意,除了架构上的差异外,BERT 和 OpenAI GPT 是微调方法,而 ELMo 是基于特征的方法。

图 3:

The most comparable existing pre-training method to BERT is OpenAI GPT, which trains a left-to-right Transformer LM on a large text corpus. In fact, many of the design decisions in BERT were intentionally made to make it as close to GPT as possible so that the two methods could be minimally compared. The core argument of this work is that the bi-directional it y and the two pretraining tasks presented in Section 3.1 account for the majority of the empirical improvements, but we do note that there are several other differences between how BERT and GPT were trained:

与 BERT 最可比的现有预训练方法是 OpenAI GPT,它在大型文本语料库上训练了一个从左到右的 Transformer 语言模型 (LM)。实际上,BERT 的许多设计决策是有意为之,目的是使其尽可能接近 GPT,以便两种方法可以进行最小化的比较。本文的核心论点是,双向性和第 3.1 节中提出的两个预训练任务解释了大部分经验上的改进,但我们确实注意到 BERT 和 GPT 的训练方式之间还存在其他一些差异:

To isolate the effect of these differences, we perform ablation experiments in Section 5.1 which demonstrate that the majority of the improvements are in fact coming from the two pre-training tasks and the bidirectional it y they enable.

为了隔离这些差异的影响,我们在第 5.1 节进行消融实验,结果表明大部分改进实际上来自于两个预训练任务以及它们所实现的双向性。

A.5 Illustrations of Fine-tuning on Different Tasks

A.5 不同任务微调的插图

The illustration of fine-tuning BERT on different tasks can be seen in Figure 4. Our task-specific models are formed by incorporating BERT with one additional output layer, so a minimal number of parameters need to be learned from scratch. Among the tasks, (a) and (b) are sequence-level tasks while (c) and (d) are token-level tasks. In the figure, $E$ represents the input embedding, $T_{i}$ represents the contextual representation of token $i$ , [CLS] is the special symbol for classification output, and [SEP] is the special symbol to separate non-consecutive token sequences.

图 4: 在不同任务上微调 BERT 的示例。我们的特定任务模型是通过在 BERT 上添加一个额外的输出层形成的,因此只需要从头学习少量参数。在这些任务中,(a) 和 (b) 是序列级任务,而 (c) 和 (d) 是 Token 级任务。在图中,$E$ 表示输入嵌入,$T_{i}$ 表示第 i 个 Token 的上下文表示,[CLS] 是分类输出的特殊符号,[SEP] 是用于分隔非连续 Token 序列的特殊符号。

B Detailed Experimental Setup

B 详细的实验设置

B.1 Detailed Descriptions for the GLUE Benchmark Experiments.

B.1 GLUE基准测试实验的详细描述。

The GLUE benchmark includes the following datasets, the descriptions of which were originally summarized in Wang et al. (2018a):

GLUE基准包括以下数据集,这些数据集的描述最初由 Wang 等人 (2018a) 总结如下:

MNLI Multi-Genre Natural Language Inference is a large-scale, crowd sourced entailment classification task (Williams et al., 2018). Given a pair of sentences, the goal is to predict whether the second sentence is an entailment, contradiction, or neutral with respect to the first one.

MNLI 多genre自然语言推理 (Multi-Genre Natural Language Inference) 是一个大规模、众包的蕴含分类任务 (Williams et al., 2018)。给定一对句子,目标是预测第二个句子相对于第一个句子是否为蕴含、矛盾或中性。

QQP Quora Question Pairs is a binary classification task where the goal is to determine if two questions asked on Quora are semantically equivalent (Chen et al., 2018).

QQP Quora 问题对是一个二分类任务,目标是确定在 Quora 上提出的两个问题是否语义等价 (Chen 等, 2018)。

QNLI Question Natural Language Inference is a version of the Stanford Question Answering Dataset (Rajpurkar et al., 2016) which has been converted to a binary classification task (Wang et al., 2018a). The positive examples are (question, sentence) pairs which do contain the correct answer, and the negative examples are (question, sentence) from the same paragraph which do not contain the answer.

QNLI 问题自然语言推理是斯坦福问答数据集 (Stanford Question Answering Dataset) (Rajpurkar et al., 2016) 的一个版本,该版本已转换为二元分类任务 (Wang et al., 2018a)。正例是包含正确答案的 (问题, 句子) 对,而负例是来自同一段落但不包含答案的 (问题, 句子)。

SST-2 The Stanford Sentiment Treebank is a binary single-sentence classification task consisting of sentences extracted from movie reviews with human annotations of their sentiment (Socher et al., 2013).

SST-2 斯坦福情感树库是一个二元单句分类任务,由从电影评论中提取的句子组成,并有人工标注的情感 (Socher et al., 2013)。


Figure 4: Illustrations of Fine-tuning BERT on Different Tasks.


图 4: 不同任务上微调 BERT 的示例。

CoLA The Corpus of Linguistic Acceptability is a binary single-sentence classification task, where the goal is to predict whether an English sentence is linguistically “acceptable” or not (Warstadt et al., 2018).

CoLA 语言可接受性语料库是一个二元单句分类任务,目标是预测一个英语句子在语言上是否“可接受” (Warstadt et al., 2018)。

STS-B The Semantic Textual Similarity Benchmark is a collection of sentence pairs drawn from news headlines and other sources (Cer et al., 2017). They were annotated with a score from 1 to 5 denoting how similar the two sentences are in terms of semantic meaning.

STS-B 语义文本相似性基准是来自新闻标题和其他来源 (Cer et al., 2017) 的句子对集合。这些句子对被标注了 1 到 5 的分数,表示两个句子在语义上的相似程度。

MRPC Microsoft Research Paraphrase Corpus consists of sentence pairs automatically extracted from online news sources, with human annotations for whether the sentences in the pair are semantically equivalent (Dolan and Brockett, 2005).

MRPC Microsoft Research Paraphrase Corpus 包含从在线新闻来源自动提取的句子对,并有人工标注这些句子对在语义上是否等价 (Dolan 和 Brockett, 2005)。

RTE Recognizing Textual Entailment is a binary entailment task similar to MNLI, but with much less training data (Bentivogli et al., 2009).14

RTE 文本蕴含识别是一个二元蕴含任务,类似于 MNLI,但训练数据要少得多 (Bentivogli et al., 2009).14

WNLI Winograd NLI is a small natural language inference dataset (Levesque et al., 2011). The GLUE webpage notes that there are issues with the construction of this dataset, 15 and every trained system that’s been submitted to GLUE has performed worse than the 65.1 baseline accuracy of predicting the majority class. We therefore exclude this set to be fair to OpenAI GPT. For our GLUE submission, we always predicted the majority class.

WNLI Winograd NLI 是一个小的自然语言推理数据集 (Levesque 等, 2011)。GLUE 网页指出,该数据集的构建存在一些问题,并且每个提交到 GLUE 的训练系统的表现都比预测多数类别的 65.1% 基准准确率差。因此,为了对 OpenAI GPT 公平起见,我们排除了这个数据集。对于我们的 GLUE 提交,我们始终预测多数类别。

C Additional Ablation Studies

C 额外的消融研究

C.1 Effect of Number of Training Steps

C.1 训练步数的影响

Figure 5 presents MNLI Dev accuracy after finetuning from a checkpoint that has been pre-trained for $k$ steps. This allows us to answer the following questions:

图 5: 基于预训练了 $k$ 步的检查点进行微调后的 MNLI Dev 准确率。这使我们能够回答以下问题:

  1. Question: Does BERT really need such a large amount of pre-training (128,000 words/batch $*_{\mathrm{~1,000},000}$ steps) to achieve high fine-tuning accuracy?
  2. 问题:BERT 真的需要如此大量的预训练(每批 128,000 个词,( *_{\mathrm{~1,000},000} ) 步)才能实现高微调精度吗?

Answer: Yes, BERTBASE achieves almost $1.0%$ additional accuracy on MNLI when trained on 1M steps compared to $500\mathrm{k}$ steps.

答:是的,BERTBASE 在 1M 步训练时相比 500k 步训练在 MNLI 上提高了近 1.0% 的准确率。

  1. Question: Does MLM pre-training converge slower than LTR pre-training, since only $15%$ of words are predicted in each batch rather than every word? Answer: The MLM model does converge slightly slower than the LTR model. However, in terms of absolute accuracy the MLM model begins to outperform the LTR model almost immediately.
  2. 问题:由于在每个批次中只预测了 15% 的单词,而不是每个单词,MLM 预训练是否比 LTR 预训练收敛得更慢?
    回答:MLM 模型确实比 LTR 模型收敛稍慢。然而,在绝对准确性方面,MLM 模型几乎立即开始优于 LTR 模型。

C.2 Ablation for Different Masking Procedures

C.2 不同掩码过程的消融实验

In Section 3.1, we mention that BERT uses a mixed strategy for masking the target tokens when pre-training with the masked language model (MLM) objective. The following is an ablation study to evaluate the effect of different masking strategies.

在第 3.1 节中,我们提到 BERT 在使用掩码语言模型 (MLM) 目标进行预训练时,采用了一种混合策略来掩码目标 Token。以下是不同掩码策略效果的消融研究。

Note that the purpose of the masking strategies is to reduce the mismatch between pre-training and fine-tuning, as the [MASK] symbol never appears during the fine-tuning stage. We report the Dev results for both MNLI and NER. For NER, we report both fine-tuning and feature-based approaches, as we expect the mismatch will be amplified for the feature-based approach as the model will not have the chance to adjust the representations.

需要注意的是,掩码策略的目的是减少预训练和微调之间的不匹配,因为 [MASK] 符号在微调阶段从未出现。我们报告了 MNLI 和 NER 的开发集结果。对于 NER,我们报告了微调和基于特征的方法的结果,因为我们预计基于特征的方法会放大这种不匹配,因为模型没有机会调整表示。


Figure 5: Ablation over number of training steps. This shows the MNLI accuracy after fine-tuning, starting from model parameters that have been pre-trained for $k$ steps. The $\mathbf{X}$ -axis is the value of $k$ .

图 5: 训练步数的消融实验。这显示了从预训练了 $k$ 步的模型参数开始微调后的 MNLI 准确率。$\mathbf{X}$-轴是 $k$ 的值。

掩码率 MASK SAME RND MNLI Fine-tune Fine-tune NER Feature-based
80% 80% 10% 10% 84.2 95.4
100% 100% 0% 0% 84.3 94.9 94.9 94.0
80% 80% 0% 20% 84.1 95.2 94.6
80% 80% 20% 0% 84.4 95.2 94.7
0% 0% 20% 80% 83.7 94.8 94.6
0% 0% 0% 100% 83.6 94.9 94.6

Table 8: Ablation over different masking strategies.

表 8: 不同遮罩策略的消融实验。

The results are presented in Table 8. In the table, MASK means that we replace the target token with the [MASK] symbol for MLM; SAME means that we keep the target token as is; RND means that we replace the target token with another random token.

表 8:

在表中,MASK 表示我们将目标 Token 替换为 [MASK] 符号用于 MLM;SAME 表示我们保持目标 Token 不变;RND 表示我们将目标 Token 替换为另一个随机 Token。

The numbers in the left part of the table represent the probabilities of the specific strategies used during MLM pre-training (BERT uses $80%$ , $10%$ , $10%)$ ). The right part of the paper represents the Dev set results. For the feature-based approach, we concatenate the last 4 layers of BERT as the features, which was shown to be the best approach in Section 5.3.

表的左部分的数字代表在 MLM 预训练期间使用特定策略的概率 (BERT 使用 $80%$ , $10%$ , $10%$ ) 。论文的右部分代表 Dev 集的结果。对于基于特征的方法,我们将 BERT 的最后 4 层连接作为特征,在第 5.3 节中显示这是最佳方法。

From the table it can be seen that fine-tuning is surprisingly robust to different masking strategies. However, as expected, using only the MASK strategy was problematic when applying the featurebased approach to NER. Interestingly, using only the RND strategy performs much worse than our strategy as well.

从表中可以看出,微调对不同的掩码策略表现出惊人的鲁棒性。然而,正如预期的那样,在应用基于特征的方法进行命名实体识别 (NER) 时,仅使用 MASK 策略存在问题。有趣的是,仅使用 RND 策略的表现也比我们的策略差得多。

阅读全文(20积分)