BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

BERT：面向语言理解的深度双向Transformer预训练

Jacob Devlin Ming-Wei Chang Kenton Lee Kristina Toutanova Google AI Language {jacob dev lin,ming wei chang,kentonl,kristout}@google.com

Abstract

摘要

We introduce a new language representation model called BERT, which stands for Bidirectional Encoder Representations from Transformers. Unlike recent language representation models (Peters et al., 2018a; Radford et al., 2018), BERT is designed to pretrain deep bidirectional representations from unlabeled text by jointly conditioning on both left and right context in all layers. As a result, the pre-trained BERT model can be finetuned with just one additional output layer to create state-of-the-art models for a wide range of tasks, such as question answering and language inference, without substantial taskspecific architecture modifications.

我们提出了一种名为BERT (Bidirectional Encoder Representations from Transformers) 的新语言表示模型。与近期其他语言表示模型 (Peters et al., 2018a; Radford et al., 2018) 不同，BERT通过在所有层中联合调节左右上下文，从未标记文本中预训练深度双向表示。因此，预训练的BERT模型只需添加一个额外输出层进行微调，即可为问答和语言推理等多种任务创建最先进的模型，而无需对任务特定架构进行重大修改。

BERT is conceptually simple and empirically powerful. It obtains new state-of-the-art results on eleven natural language processing tasks, including pushing the GLUE score to $80.5%$ $7.7%$ point absolute improvement), MultiNLI accuracy to $86.7%$ ( $4.6%$ absolute improvement), SQuAD v1.1 question answering Test F1 to 93.2 (1.5 point absolute im- provement) and SQuAD v2.0 Test F1 to 83.1 (5.1 point absolute improvement).

BERT 在概念上简单且实证效果强大。它在11项自然语言处理任务中取得了新的最先进成果，包括将GLUE分数提升至$80.5%$（绝对提升$7.7%$）、MultiNLI准确率提升至$86.7%$（绝对提升$4.6%$）、SQuAD v1.1问答测试F1值达到93.2（绝对提升1.5分）以及SQuAD v2.0测试F1值达到83.1（绝对提升5.1分）。

1 Introduction

1 引言

Language model pre-training has been shown to be effective for improving many natural language processing tasks (Dai and Le, 2015; Peters et al., 2018a; Radford et al., 2018; Howard and Ruder, 2018). These include sentence-level tasks such as natural language inference (Bowman et al., 2015; Williams et al., 2018) and paraphrasing (Dolan and Brockett, 2005), which aim to predict the rela tion ships between sentences by analyzing them holistic ally, as well as token-level tasks such as named entity recognition and question answering, where models are required to produce fine-grained output at the token level (Tjong Kim Sang and De Meulder, 2003; Rajpurkar et al., 2016).

语言模型预训练已被证明能有效提升多种自然语言处理任务的性能 (Dai and Le, 2015; Peters et al., 2018a; Radford et al., 2018; Howard and Ruder, 2018)。这包括句子级任务如自然语言推理 (Bowman et al., 2015; Williams et al., 2018) 和文本复述 (Dolan and Brockett, 2005)——这类任务通过整体分析句子来预测其关联关系；也包括token级任务如命名实体识别和问答系统——这类任务要求模型在token级别生成细粒度输出 (Tjong Kim Sang and De Meulder, 2003; Rajpurkar et al., 2016)。

There are two existing strategies for applying pre-trained language representations to down- stream tasks: feature-based and fine-tuning. The feature-based approach, such as ELMo (Peters et al., 2018a), uses task-specific architectures that include the pre-trained representations as additional features. The fine-tuning approach, such as the Generative Pre-trained Transformer (OpenAI GPT) (Radford et al., 2018), introduces minimal task-specific parameters, and is trained on the downstream tasks by simply fine-tuning all pretrained parameters. The two approaches share the same objective function during pre-training, where they use unidirectional language models to learn general language representations.

将预训练语言表征应用于下游任务的现有策略有两种：基于特征的方法和微调方法。基于特征的方法，如ELMo (Peters et al., 2018a)，使用包含预训练表征作为附加特征的特定任务架构。微调方法，如生成式预训练Transformer (OpenAI GPT) (Radford et al., 2018)，引入最少的任务特定参数，并通过简单微调所有预训练参数来在下游任务上进行训练。这两种方法在预训练期间共享相同的目标函数，即使用单向语言模型来学习通用语言表征。

We argue that current techniques restrict the power of the pre-trained representations, especially for the fine-tuning approaches. The major limitation is that standard language models are unidirectional, and this limits the choice of architectures that can be used during pre-training. For example, in OpenAI GPT, the authors use a left-toright architecture, where every token can only attend to previous tokens in the self-attention layers of the Transformer (Vaswani et al., 2017). Such restrictions are sub-optimal for sentence-level tasks, and could be very harmful when applying finetuning based approaches to token-level tasks such as question answering, where it is crucial to incorporate context from both directions.

我们认为当前的技术限制了预训练表征的能力，尤其是微调方法。主要局限在于标准语言模型是单向的，这限制了预训练时可用的架构选择。例如，在OpenAI GPT中，作者采用了从左到右的架构，其中每个token在Transformer (Vaswani et al., 2017) 的自注意力层中只能关注先前的token。这种限制对于句子级任务来说是次优的，并且在将基于微调的方法应用于token级任务（如问答）时可能非常不利，因为双向上下文整合至关重要。

In this paper, we improve the fine-tuning based approaches by proposing BERT: Bidirectional Encoder Representations from Transformers. BERT alleviates the previously mentioned unidirect ional it y constraint by using a “masked language model” (MLM) pre-training objective, inspired by the Cloze task (Taylor, 1953). The masked language model randomly masks some of the tokens from the input, and the objective is to predict the original vocabulary id of the masked word based only on its context. Unlike left-toright language model pre-training, the MLM objective enables the representation to fuse the left and the right context, which allows us to pretrain a deep bidirectional Transformer. In addition to the masked language model, we also use a “next sentence prediction” task that jointly pretrains text-pair representations. The contributions of our paper are as follows:

在本文中，我们通过提出BERT（来自Transformer的双向编码器表示）改进了基于微调的方法。BERT通过使用受Cloze任务（Taylor, 1953）启发的"掩码语言模型"(MLM)预训练目标，缓解了之前提到的单向性限制。掩码语言模型随机掩码输入中的部分token，目标是根据上下文预测被掩码单词的原始词汇ID。与从左到右的语言模型预训练不同，MLM目标使表示能够融合左右上下文，这使我们能够预训练一个深度双向Transformer。除了掩码语言模型外，我们还使用"下一句预测"任务来联合预训练文本对表示。本文的贡献如下：

• We demonstrate the importance of bidirectional pre-training for language representations. Unlike Radford et al. (2018), which uses unidirectional language models for pre-training, BERT uses masked language models to enable pretrained deep bidirectional representations. This is also in contrast to Peters et al. (2018a), which uses a shallow concatenation of independently trained left-to-right and right-to-left LMs.

• 我们证明了双向预训练对语言表征的重要性。与Radford等人(2018)使用单向语言模型进行预训练不同，BERT采用掩码语言模型来实现预训练的深度双向表征。这也与Peters等人(2018a)形成对比，后者仅对独立训练的从左到右和从右到左语言模型进行浅层拼接。

• We show that pre-trained representations reduce the need for many heavily-engineered taskspecific architectures. BERT is the first finetuning based representation model that achieves state-of-the-art performance on a large suite of sentence-level and token-level tasks, outperforming many task-specific architectures.

• 我们证明预训练表征能减少对大量人工设计的任务特定架构的需求。BERT是首个基于微调的表征模型，在句子级和token级任务套件中实现最先进性能，超越了许多任务特定架构。

• BERT advances the state of the art for eleven NLP tasks. The code and pre-trained models are available at https://github.com/ google-research/bert.

BERT 在十一项 NLP 任务中实现了技术突破。代码和预训练模型可在 https://github.com/google-research/bert 获取。

2 相关工作

There is a long history of pre-training general language representations, and we briefly review the most widely-used approaches in this section.

预训练通用语言表征有着悠久的历史，本节将简要回顾最广泛使用的方法。

2.1 Unsupervised Feature-based Approaches

2.1 基于特征的无监督方法

Learning widely applicable representations of words has been an active area of research for decades, including non-neural (Brown et al., 1992; Ando and Zhang, 2005; Blitzer et al., 2006) and neural (Mikolov et al., 2013; Pennington et al., 2014) methods. Pre-trained word embeddings are an integral part of modern NLP systems, offering significant improvements over embeddings learned from scratch (Turian et al., 2010). To pretrain word embedding vectors, left-to-right language modeling objectives have been used (Mnih and Hinton, 2009), as well as objectives to discriminate correct from incorrect words in left and right context (Mikolov et al., 2013).

学习广泛适用的词表示法数十年来一直是活跃的研究领域，包括非神经网络方法 (Brown et al., 1992; Ando and Zhang, 2005; Blitzer et al., 2006) 和神经网络方法 (Mikolov et al., 2013; Pennington et al., 2014)。预训练词嵌入是现代自然语言处理系统的核心组成部分，相比从头开始学习的嵌入有显著提升 (Turian et al., 2010)。为预训练词嵌入向量，研究者既采用了从左到右的语言建模目标 (Mnih and Hinton, 2009)，也使用了区分左右上下文中正确与错误单词的目标 (Mikolov et al., 2013)。

These approaches have been generalized to coarser granular i ties, such as sentence embeddings (Kiros et al., 2015; Logeswaran and Lee, 2018) or paragraph embeddings (Le and Mikolov, 2014). To train sentence representations, prior work has used objectives to rank candidate next sentences (Jernite et al., 2017; Logeswaran and Lee, 2018), left-to-right generation of next sentence words given a representation of the previous sentence (Kiros et al., 2015), or denoising autoencoder derived objectives (Hill et al., 2016).

这些方法已被推广到更粗粒度的单元，例如句子嵌入 (sentence embeddings) (Kiros et al., 2015; Logeswaran and Lee, 2018) 或段落嵌入 (paragraph embeddings) (Le and Mikolov, 2014)。为训练句子表征，先前研究采用了以下目标：对候选下一句子进行排序 (Jernite et al., 2017; Logeswaran and Lee, 2018)，基于前句表征从左到右生成下一句单词 (Kiros et al., 2015)，或采用去噪自编码器衍生的目标函数 (Hill et al., 2016)。

ELMo and its predecessor (Peters et al., 2017, 2018a) generalize traditional word embedding research along a different dimension. They extract context-sensitive features from a left-to-right and a right-to-left language model. The contextual represent ation of each token is the concatenation of the left-to-right and right-to-left representations. When integrating contextual word embeddings with existing task-specific architectures, ELMo advances the state of the art for several major NLP benchmarks (Peters et al., 2018a) including question answering (Rajpurkar et al., 2016), sentiment analysis (Socher et al., 2013), and named entity recognition (Tjong Kim Sang and De Meulder, 2003). Melamud et al. (2016) proposed learning contextual representations through a task to predict a single word from both left and right context using LSTMs. Similar to ELMo, their model is feature-based and not deeply bidirectional. Fedus et al. (2018) shows that the cloze task can be used to improve the robustness of text generation models.

ELMo及其前身 (Peters et al., 2017, 2018a) 从不同维度拓展了传统词嵌入研究。该方法通过从左到右和从右到左的语言模型提取上下文敏感特征，每个token的上下文表示是左右双向表示的拼接。当将上下文词嵌入与现有任务特定架构结合时，ELMo在多项重要NLP基准测试中实现了技术突破 (Peters et al., 2018a) ，包括问答系统 (Rajpurkar et al., 2016) 、情感分析 (Socher et al., 2013) 和命名实体识别 (Tjong Kim Sang and De Meulder, 2003) 。Melamud等人 (2016) 提出通过LSTM预测左右上下文单词的任务来学习上下文表示，其模型与ELMo类似，都是基于特征且非深度双向的。Fedus等人 (2018) 证明完形填空任务可用于提升文本生成模型的鲁棒性。

2.2 Unsupervised Fine-tuning Approaches

2.2 无监督微调方法

As with the feature-based approaches, the first works in this direction only pre-trained word embedding parameters from unlabeled text (Collobert and Weston, 2008).

与基于特征的方法类似，这一方向的首批研究仅从无标注文本中预训练词嵌入参数 (Collobert and Weston, 2008)。

More recently, sentence or document encoders which produce contextual token representations have been pre-trained from unlabeled text and fine-tuned for a supervised downstream task (Dai and Le, 2015; Howard and Ruder, 2018; Radford et al., 2018). The advantage of these approaches is that few parameters need to be learned from scratch. At least partly due to this advantage, OpenAI GPT (Radford et al., 2018) achieved previously state-of-the-art results on many sentencelevel tasks from the GLUE benchmark (Wang et al., 2018a). Left-to-right language model- ing and auto-encoder objectives have been used for pre-training such models (Howard and Ruder, 2018; Radford et al., 2018; Dai and Le, 2015).

最近，能够生成上下文token表示的句子或文档编码器已通过无标注文本进行预训练，并针对有监督的下游任务进行微调 (Dai and Le, 2015; Howard and Ruder, 2018; Radford et al., 2018)。这类方法的优势在于只需从头学习少量参数。至少部分得益于这一优势，OpenAI GPT (Radford et al., 2018) 在GLUE基准测试 (Wang et al., 2018a) 的多数句子级任务中取得了当时最先进的结果。此类模型的预训练采用了从左到右的语言建模和自编码器目标 (Howard and Ruder, 2018; Radford et al., 2018; Dai and Le, 2015)。

Figure 1: Overall pre-training and fine-tuning procedures for BERT. Apart from output layers, the same architectures are used in both pre-training and fine-tuning. The same pre-trained model parameters are used to initialize models for different down-stream tasks. During fine-tuning, all parameters are fine-tuned. [CLS] is a special symbol added in front of every input example, and [SEP] is a special separator token (e.g. separating questions/answers).

图 1: BERT的整体预训练与微调流程。除输出层外，预训练和微调阶段使用相同的架构结构。相同的预训练模型参数被用于初始化不同下游任务的模型。微调过程中所有参数都会被调整。[CLS]是添加在每个输入样本前的特殊符号，[SEP]是特殊分隔符token (例如分隔问题/答案)。

2.3 Transfer Learning from Supervised Data

2.3 基于监督数据的迁移学习

There has also been work showing effective transfer from supervised tasks with large datasets, such as natural language inference (Conneau et al., 2017) and machine translation (McCann et al., 2017). Computer vision research has also demonstrated the importance of transfer learning from large pre-trained models, where an effective recipe is to fine-tune models pre-trained with ImageNet (Deng et al., 2009; Yosinski et al., 2014).

已有研究表明，从大规模监督任务（如自然语言推理 (Conneau et al., 2017) 和机器翻译 (McCann et al., 2017)）中能实现有效迁移。计算机视觉研究同样证实了基于大型预训练模型进行迁移学习的重要性，其中经典方法是微调基于ImageNet (Deng et al., 2009; Yosinski et al., 2014) 预训练的模型。

3 BERT

We introduce BERT and its detailed implementation in this section. There are two steps in our framework: pre-training and fine-tuning. During pre-training, the model is trained on unlabeled data over different pre-training tasks. For finetuning, the BERT model is first initialized with the pre-trained parameters, and all of the parameters are fine-tuned using labeled data from the downstream tasks. Each downstream task has separate fine-tuned models, even though they are initialized with the same pre-trained parameters. The question-answering example in Figure 1 will serve as a running example for this section.

本节介绍BERT及其具体实现。我们的框架包含两个步骤：预训练(pre-training)和微调(fine-tuning)。预训练阶段，模型通过不同预训练任务在无标注数据上进行训练。微调阶段，BERT模型首先加载预训练参数进行初始化，然后使用下游任务的标注数据对所有参数进行微调。尽管不同下游任务使用相同的预训练参数初始化，但每个任务都有独立的微调模型。图1中的问答示例将作为本节的运行案例。

A distinctive feature of BERT is its unified architecture across different tasks. There is minimal difference between the pre-trained architecture and the final downstream architecture.

BERT的一个显著特点是其跨任务统一的架构。预训练架构与最终下游架构之间的差异极小。

Model Architecture BERT’s model architecture is a multi-layer bidirectional Transformer encoder based on the original implementation described in Vaswani et al. (2017) and released in the tensor 2 tensor library.1 Because the use of Transformers has become common and our imple ment ation is almost identical to the original, we will omit an exhaustive background description of the model architecture and refer readers to Vaswani et al. (2017) as well as excellent guides such as “The Annotated Transformer.”2

模型架构
BERT的模型架构是一个基于Vaswani等人(2017)原始实现的多层双向Transformer编码器，并发布于tensor2tensor库。由于Transformer的使用已变得普遍，且我们的实现与原始版本几乎相同，因此我们将省略对模型架构的详尽背景描述，建议读者参考Vaswani等人(2017)以及《The Annotated Transformer》等优秀指南。

In this work, we denote the number of layers (i.e., Transformer blocks) as $L$ , the hidden size as $H$ , and the number of self-attention heads as $A$ .3 We primarily report results on two model sizes: BERTBASE ($\scriptstyle\mathrm{L}=12$ , $\scriptstyle{\mathrm{H}}=768$ , $_ \mathrm{A}{=}12$ , Total Param- eters =110M) and BERTLARGE (L=24,H=1024,A=16,Total Parameters=340M).

在本工作中，我们将层数（即Transformer块）表示为$L$，隐藏层大小表示为$H$，自注意力头数表示为$A$。我们主要报告两种模型尺寸的结果：BERTBASE $\scriptstyle\mathrm{L}=12$，$\scriptstyle{\mathrm{H}}=768$，$_ \mathrm{A}{=}12$，总参数量$\mathsf{\Gamma}=110\mathbf{M}$；以及BERTLARGE $\mathrm{L}{=}24$，$\mathrm{H}{=}1024$，$\mathrm{A}{=}16$，总参数量$\mathsf{\Gamma}=340\mathbf{M}$。

BERTBASE was chosen to have the same model size as OpenAI GPT for comparison purposes. Critically, however, the BERT Transformer uses bidirectional self-attention, while the GPT Transformer uses constrained self-attention where every token can only attend to context to its left.4

为了便于比较，BERTBASE 的模型大小被设定为与 OpenAI GPT 相同。但关键在于，BERT 的 Transformer 采用了双向自注意力机制，而 GPT 的 Transformer 使用的是受限自注意力机制，即每个 Token 只能关注其左侧的上下文。4

Input/Output Representations To make BERT handle a variety of down-stream tasks, our input representation is able to unambiguously represent both a single sentence and a pair of sentences (e.g., Question, Answer ) in one token sequence. Throughout this work, a “sentence” can be an arbitrary span of contiguous text, rather than an actual linguistic sentence. A “sequence” refers to the input token sequence to BERT, which may be a single sentence or two sentences packed together.

输入/输出表示

为了使BERT能够处理各种下游任务，我们的输入表示能够在一个token序列中明确表示单个句子和句子对(如问题、答案)。在本研究中，"句子"可以是任意连续的文本片段，而非实际的语言学句子。"序列"指的是BERT的输入token序列，可以是一个句子或两个打包在一起的句子。

We use WordPiece embeddings (Wu et al., 2016) with a 30,000 token vocabulary. The first token of every sequence is always a special classification token ([CLS]). The final hidden state corresponding to this token is used as the aggregate sequence representation for classification tasks. Sentence pairs are packed together into a single sequence. We differentiate the sentences in two ways. First, we separate them with a special token ([SEP]). Second, we add a learned embedding to every token indicating whether it belongs to sentence $\mathtt{A}$ or sentence B. As shown in Figure 1, we denote input embedding as $E$ , the final hidden vector of the special [CLS] token as $C\in\mathbb{R}^{H}$ , and the final hidden vector for the $i^{\mathrm{th}}$ input token as $T_ {i}\in\mathbb{R}^{H}$ .

我们采用30,000个token词汇表的WordPiece嵌入 (Wu et al., 2016) 。每个序列的首个token始终是特殊分类标记 ([CLS]) 。该标记对应的最终隐藏状态被用作分类任务的聚合序列表示。句子对会被合并为单个序列，我们通过两种方式区分句子：首先用特殊分隔符 ([SEP]) 隔开，其次为每个token添加可学习的嵌入标识，用于区分其属于句子$\mathtt{A}$还是句子B。如图1所示，输入嵌入记为$E$，特殊[CLS]标记的最终隐藏向量记为$C\in\mathbb{R}^{H}$，第$i^{\mathrm{th}}$个输入token的最终隐藏向量记为$T_ {i}\in\mathbb{R}^{H}$。

For a given token, its input representation is constructed by summing the corresponding token, segment, and position embeddings. A visualization of this construction can be seen in Figure 2.

对于给定的Token，其输入表示通过将对应的Token嵌入、片段嵌入和位置嵌入相加构建而成。这一构建过程的可视化展示见图2:

3.1 Pre-training BERT

3.1 BERT预训练

Unlike Peters et al. (2018a) and Radford et al. (2018), we do not use traditional left-to-right or right-to-left language models to pre-train BERT. Instead, we pre-train BERT using two unsupervised tasks, described in this section. This step is presented in the left part of Figure 1.

与 Peters 等人 (2018a) 和 Radford 等人 (2018) 不同，我们并未使用传统的从左到右或从右到左语言模型来预训练 BERT，而是采用本节所述的两个无监督任务进行预训练。该步骤展示在图 1 左侧部分。

Task #1: Masked LM Intuitively, it is reasonable to believe that a deep bidirectional model is strictly more powerful than either a left-to-right model or the shallow concatenation of a left-toright and a right-to-left model. Unfortunately, standard conditional language models can only be trained left-to-right or right-to-left, since bidirectional conditioning would allow each word to indirectly “see itself”, and the model could trivially predict the target word in a multi-layered context.

任务 #1: 掩码语言模型 (Masked LM)
直观上可以认为，深度双向模型严格优于从左到右模型或左右向模型的浅层拼接。但标准条件语言模型只能进行从左到右或从右到左训练，因为双向条件作用会使每个词间接"看到自己"，导致模型在多层次上下文中能轻易预测目标词。

In order to train a deep bidirectional representation, we simply mask some percentage of the input tokens at random, and then predict those masked tokens. We refer to this procedure as a “masked LM” (MLM), although it is often referred to as a Cloze task in the literature (Taylor, 1953). In this case, the final hidden vectors corresponding to the mask tokens are fed into an output softmax over the vocabulary, as in a standard LM. In all of our experiments, we mask $15%$ of all WordPiece tokens in each sequence at random. In contrast to denoising auto-encoders (Vincent et al., 2008), we only predict the masked words rather than reconstructing the entire input.

为了训练深度双向表征，我们简单地随机遮蔽一定比例的输入token，然后预测这些被遮蔽的token。我们将此过程称为"遮蔽语言模型 (masked LM, MLM)"，尽管在文献(Taylor, 1953)中它常被称为完形填空任务(Cloze task)。这种情况下，与标准语言模型类似，对应于遮蔽token的最终隐藏向量会被输入到词汇表上的输出softmax层。在所有实验中，我们随机遮蔽每个序列中15%的WordPiece token。与去噪自编码器(Vincent et al., 2008)不同，我们仅预测被遮蔽的单词而非重建整个输入。

Although this allows us to obtain a bidirectional pre-trained model, a downside is that we are creating a mismatch between pre-training and fine-tuning, since the [MASK] token does not appear during fine-tuning. To mitigate this, we do not always replace “masked” words with the actual [MASK] token. The training data generator chooses $15%$ of the token positions at random for prediction. If the $i$ -th token is chosen, we replace the $i$ -th token with (1) the [MASK] token $80%$ of the time (2) a random token $10%$ of the time (3) the unchanged $i$ -th token $10%$ of the time. Then, $T_ {i}$ will be used to predict the original token with cross entropy loss. We compare variations of this procedure in Appendix C.2.

虽然这种方法能让我们获得双向预训练模型，但一个缺点是会造成预训练与微调阶段的不匹配，因为微调时不会出现 [MASK] token。为缓解这一问题，我们并不总是用真实的 [MASK] token 替换被遮蔽词。训练数据生成器会随机选择 $15%$ 的 token 位置进行预测：若选中第 $i$ 个 token，则以 (1) $80%$ 概率替换为 [MASK] token (2) $10%$ 概率替换为随机 token (3) $10%$ 概率保持原 token。随后 $T_ {i}$ 将通过交叉熵损失来预测原始 token。我们在附录 C.2 中对比了该流程的多种变体。

Task #2: Next Sentence Prediction (NSP) Many important downstream tasks such as Question Answering (QA) and Natural Language Infer- ence (NLI) are based on understanding the relationship between two sentences, which is not directly captured by language modeling. In order to train a model that understands sentence relationships, we pre-train for a binarized next sen- tence prediction task that can be trivially generated from any monolingual corpus. Specifically, when choosing the sentences A and B for each pretraining example, $50%$ of the time B is the actual next sentence that follows $\mathbb{A}$ (labeled as $\mathbb{I}\operatorname{sNext}$ ), and $50%$ of the time it is a random sentence from the corpus (labeled as NotNext). As we show in Figure 1, $C$ is used for next sentence prediction (NSP).5 Despite its simplicity, we demon- strate in Section 5.1 that pre-training towards this task is very beneficial to both QA and NLI. 6

任务 #2: 下一句预测 (NSP)
许多重要下游任务 (如问答系统 (QA) 和自然语言推理 (NLI) ) 都基于对两个句子关系的理解，而语言建模无法直接捕捉这种关系。为了训练能够理解句子关系的模型，我们针对一个二值化的下一句预测任务进行预训练，该任务可直接从任何单语语料库生成。具体而言，在为每个预训练样本选择句子A和B时，50%的概率B是实际接在A后的下一句 (标记为IsNext) ，50%的概率它是语料库中的随机句子 (标记为NotNext) 。如图1所示，C被用于下一句预测 (NSP) 。尽管该任务简单，但我们在5.1节证明针对该任务的预训练对QA和NLI都非常有益。

Figure 2: BERT input representation. The input embeddings are the sum of the token embeddings, the segmentation embeddings and the position embeddings.

图 2: BERT输入表示。输入嵌入是token嵌入、分段嵌入和位置嵌入的总和。

The NSP task is closely related to representationlearning objectives used in Jernite et al. (2017) and Logeswaran and Lee (2018). However, in prior work, only sentence embeddings are transferred to down-stream tasks, where BERT transfers all parameters to initialize end-task model parameters.

NSP任务与Jernite等人(2017)以及Logeswaran和Lee(2018)使用的表征学习目标密切相关。然而在先前工作中，只有句子嵌入会被迁移到下游任务，而BERT会迁移所有参数来初始化终端任务模型参数。

Pre-training data The pre-training procedure largely follows the existing literature on language model pre-training. For the pre-training corpus we use the Books Corpus (800M words) (Zhu et al., 2015) and English Wikipedia (2,500M words). For Wikipedia we extract only the text passages and ignore lists, tables, and headers. It is critical to use a document-level corpus rather than a shuffled sentence-level corpus such as the Billion Word Benchmark (Chelba et al., 2013) in order to extract long contiguous sequences.

预训练数据
预训练流程基本遵循现有语言模型预训练文献。预训练语料采用Books Corpus (8亿词量) (Zhu et al., 2015) 和英文维基百科 (25亿词量)。对于维基百科，我们仅提取文本段落并忽略列表、表格和标题。使用文档级语料而非乱序的句子级语料（如Billion Word Benchmark (Chelba et al., 2013)）对提取长连续序列至关重要。

3.2 Fine-tuning BERT

3.2 微调 BERT

Fine-tuning is straightforward since the selfattention mechanism in the Transformer allows BERT to model many downstream tasks— whether they involve single text or text pairs—by swapping out the appropriate inputs and outputs. For applications involving text pairs, a common pattern is to independently encode text pairs before applying bidirectional cross attention, such as Parikh et al. (2016); Seo et al. (2017). BERT instead uses the self-attention mechanism to unify these two stages, as encoding a concatenated text pair with self-attention effectively includes bidirectional cross attention between two sentences.

微调过程非常直接，因为Transformer中的自注意力机制使BERT能够通过替换适当的输入和输出来建模许多下游任务——无论涉及单个文本还是文本对。对于涉及文本对的应用，常见模式是在应用双向交叉注意力之前独立编码文本对，例如Parikh等人(2016) [20]；Seo等人(2017) [21]。而BERT使用自注意力机制统一了这两个阶段，因为用自注意力编码连接的文本对实际上包含了两个句子之间的双向交叉注意力。

For each task, we simply plug in the taskspecific inputs and outputs into BERT and finetune all the parameters end-to-end. At the input, sentence A and sentence B from pre-training are analogous to (1) sentence pairs in paraphrasing, (2) hypothesis-premise pairs in entailment, (3) question-passage pairs in question answering, and (4) a degenerate text $\mathcal{O}$ pair in text classification or sequence tagging. At the output, the token represent at ions are fed into an output layer for tokenlevel tasks, such as sequence tagging or question answering, and the [CLS] representation is fed into an output layer for classification, such as entailment or sentiment analysis.

对于每项任务，我们只需将任务特定的输入和输出插入BERT模型，并端到端地微调所有参数。在输入阶段，预训练中的句子A和句子B分别对应：(1) 释义任务中的句子对，(2) 蕴含任务中的假设-前提对，(3) 问答任务中的问题-段落对，以及(4) 文本分类或序列标注任务中的退化文本$\mathcal{O}$对。在输出阶段，Token表征会被输入到输出层以处理Token级别的任务（如序列标注或问答），而[CLS]表征则被输入到输出层以进行分类任务（如蕴含分析或情感分析）。

Compared to pre-training, fine-tuning is relatively inexpensive. All of the results in the paper can be replicated in at most 1 hour on a single Cloud TPU, or a few hours on a GPU, starting from the exact same pre-trained model.7 We describe the task-specific details in the corresponding subsections of Section 4. More details can be found in Appendix A.5.

与预训练相比，微调成本相对较低。论文中的所有结果都能在单个Cloud TPU上至多1小时内复现，或在GPU上花费数小时完成（均基于完全相同的预训练模型）[7]。具体任务细节详见第4节对应子章节，更多信息可参阅附录A.5。

4 Experiments

4 实验

In this section, we present BERT fine-tuning results on 11 NLP tasks.

在本节中，我们展示了BERT在11项NLP任务上的微调结果。

4.1 GLUE

The General Language Understanding Evaluation (GLUE) benchmark (Wang et al., 2018a) is a collection of diverse natural language understanding tasks. Detailed descriptions of GLUE datasets are included in Appendix B.1.

通用语言理解评估 (GLUE) 基准 (Wang et al., 2018a) 是一个包含多样化自然语言理解任务的集合。GLUE 数据集的详细描述见附录 B.1。

To fine-tune on GLUE, we represent the input sequence (for single sentence or sentence pairs) as described in Section 3, and use the final hidden vector $C\in\mathbb{R}^{H}$ corresponding to the first input token ([CLS]) as the aggregate representation. The only new parameters introduced during fine-tuning are classification layer weights $W\in$ RK×H, where K is the number of labels. We compute a standard classification loss with $C$ and $W$ , i.e., $\log(\mathrm{softmax}(C W^{T}))$ .

在GLUE上进行微调时，我们按照第3节所述表示输入序列（单句或句对），并使用与首个输入token ([CLS]) 对应的最终隐藏向量 $C\in\mathbb{R}^{H}$ 作为聚合表征。微调过程中引入的唯一新参数是分类层权重 $W\in$ RK×H，其中K为标签数量。我们通过 $C$ 和 $W$ 计算标准分类损失，即 $\log(\mathrm{softmax}(C W^{T}))$。

Table 1: GLUE Test results, scored by the evaluation server (https://glue benchmark.com/leader board). The number below each task denotes the number of training examples. The “Average” column is slightly different than the official GLUE score, since we exclude the problematic WNLI set.8 BERT and OpenAI GPT are singlemodel, single task. F1 scores are reported for QQP and MRPC, Spearman correlations are reported for STS-B, and accuracy scores are reported for the other tasks. We exclude entries that use BERT as one of their components.

System	MNLI-(m/mm) 392k	QQP 363k	QNLI 108k	SST-2 67k	CoLA 8.5k	STS-B 5.7k	MRPC 3.5k	RTE 2.5k	Average
Pre-OpenAI SOTA	80.6/80.1	66.1	82.3	93.2	35.0	81.0	86.0	61.7	74.0
BiLSTM+ELMo+Attn	76.4/76.1	64.8	79.8	90.4	36.0	73.3	84.9	56.8	71.0
OpenAI GPT	82.1/81.4	70.3	87.4	91.3	45.4	80.0	82.3	56.0	75.1
BERTBASE	84.6/83.4	71.2	90.5	93.5	52.1	85.8	88.9	66.4	79.6
BERTLARGE	86.7/85.9	72.1	92.7	94.9	60.5	86.5	89.3	70.1	82.1

表 1: GLUE 测试结果，由评估服务器 (https://gluebenchmark.com/leaderboard) 评分。每个任务下方的数字表示训练样本数量。"Average"列与官方 GLUE 分数略有不同，因为我们排除了有问题的 WNLI 集。8 BERT 和 OpenAI GPT 是单模型、单任务。QQP 和 MRPC 报告 F1 分数，STS-B 报告 Spearman 相关性，其他任务报告准确率分数。我们排除了使用 BERT 作为组件的条目。

System	MNLI-(m/mm) 392k	QQP 363k	QNLI 108k	SST-2 67k	CoLA 8.5k	STS-B 5.7k	MRPC 3.5k	RTE 2.5k	Average
Pre-OpenAI SOTA	80.6/80.1	66.1	82.3	93.2	35.0	81.0	86.0	61.7	74.0
BiLSTM+ELMo+Attn	76.4/76.1	64.8	79.8	90.4	36.0	73.3	84.9	56.8	71.0
OpenAI GPT	82.1/81.4	70.3	87.4	91.3	45.4	80.0	82.3	56.0	75.1
BERTBASE	84.6/83.4	71.2	90.5	93.5	52.1	85.8	88.9	66.4	79.6
BERTLARGE	86.7/85.9	72.1	92.7	94.9	60.5	86.5	89.3	70.1	82.1

We use a batch size of 32 and fine-tune for 3 epochs over the data for all GLUE tasks. For each task, we selected the best fine-tuning learning rate (among 5e-5, 4e-5, 3e-5, and 2e-5) on the Dev set. Additionally, for BERTLARGE we found that finetuning was sometimes unstable on small datasets, so we ran several random restarts and selected the best model on the Dev set. With random restarts, we use the same pre-trained checkpoint but perform different fine-tuning data shuffling and classifier layer initialization.9

我们使用32的批量大小，对所有GLUE任务数据进行3个周期的微调。针对每项任务，我们在开发集上选择了最佳微调学习率（从5e-5、4e-5、3e-5和2e-5中选取）。此外，对于BERTLARGE，我们发现小数据集上的微调有时不稳定，因此进行了多次随机重启，并选择开发集上表现最佳的模型。随机重启时，我们使用相同的预训练检查点，但采用不同的微调数据洗牌和分类器层初始化方式。9

Results are presented in Table 1. Both BERTBASE and BERTLARGE outperform all systems on all tasks by a substantial margin, obtaining $4.5%$ and $7.0%$ respective average accuracy improvement over the prior state of the art. Note that BERTBASE and OpenAI GPT are nearly identical in terms of model architecture apart from the attention masking. For the largest and most widely reported GLUE task, MNLI, BERT obtains a $4.6%$ absolute accuracy improvement. On the official GLUE leader board 10, BERTLARGE obtains a score of 80.5, compared to OpenAI GPT, which obtains 72.8 as of the date of writing.

结果如表1所示。BERTBASE和BERTLARGE在所有任务上都以显著优势超越所有系统，分别比之前的最先进水平平均准确率提升了4.5%和7.0%。需要注意的是，除了注意力掩码机制外，BERTBASE与OpenAI GPT的模型架构几乎完全相同。在GLUE基准中规模最大、报道最广的MNLI任务上，BERT实现了4.6%的绝对准确率提升。根据GLUE官方排行榜[10]的当前数据，BERTLARGE获得80.5分，而OpenAI GPT得分为72.8。

We find that BERTLARGE significantly outperforms BERTBASE across all tasks, especially those with very little training data. The effect of model size is explored more thoroughly in Section 5.2.

我们发现BERTLARGE在所有任务上都显著优于BERTBASE，尤其是在训练数据极少的任务中。模型规模的影响将在5.2节进行更深入的探讨。

4.2 SQuAD v1.1

The Stanford Question Answering Dataset (SQuAD v1.1) is a collection of $100\mathrm{k}$ crowdsourced question/answer pairs (Rajpurkar et al., 2016). Given a question and a passage from

斯坦福问答数据集 (SQuAD v1.1) 是一个包含 10 万条众包问答对的数据集 (Rajpurkar et al., 2016)。给定一个问题和一个来自

Wikipedia containing the answer, the task is to predict the answer text span in the passage.

维基百科包含答案，任务是在文章中预测答案文本片段。

As shown in Figure 1, in the question answering task, we represent the input question and passage as a single packed sequence, with the question using the A embedding and the passage using the B embedding. We only introduce a start vector $S\in\mathbb{R}^{H}$ and an end vector $E\in\mathbb{R}^{H}$ during fine-tuning. The probability of word $i$ being the start of the answer span is computed as a dot product between $T_ {i}$ and $S$ followed by a softmax over all of the words in the paragraph: $\begin{array}{r}{P_ {i}=\frac{e^{S\cdot T_ {i}}}{\sum_ {j}e^{S\cdot T_ {j}}}}\end{array}$ The analogous formula is used for the end of the answer span. The score of a candidate span from position $i$ to position $j$ is defined as $S{\cdot}T_ {i}+E{\cdot}T_ {j}$ , and the maximum scoring span where $j\geq i$ is used as a prediction. The training objective is the sum of the log-likelihoods of the correct start and end positions. We fine-tune for 3 epochs with a learning rate of 5e-5 and a batch size of 32.

如图 1 所示，在问答任务中，我们将输入问题和段落表示为单个打包序列，其中问题使用 A 嵌入，段落使用 B 嵌入。微调期间仅引入起始向量 $S\in\mathbb{R}^{H}$ 和结束向量 $E\in\mathbb{R}^{H}$。词 $i$ 作为答案跨度起始位置的概率通过 $T_ {i}$ 与 $S$ 的点积计算，并对段落中所有词进行 softmax 归一化：$\begin{array}{r}{P_ {i}=\frac{e^{S\cdot T_ {i}}}{\sum_ {j}e^{S\cdot T_ {j}}}}\end{array}$。答案跨度结束位置的计算公式同理。候选跨度从位置 $i$ 到位置 $j$ 的得分定义为 $S{\cdot}T_ {i}+E{\cdot}T_ {j}$，最终预测取满足 $j\geq i$ 条件的最高得分跨度。训练目标为正确起止位置对数似然之和。我们采用学习率 5e-5、批量大小 32 进行 3 个周期的微调。

Table 2 shows top leader board entries as well as results from top published systems (Seo et al., 2017; Clark and Gardner, 2018; Peters et al., 2018a; Hu et al., 2018). The top results from the SQuAD leader board do not have up-to-date public system descriptions available,11 and are allowed to use any public data when training their systems. We therefore use modest data augmentation in our system by first fine-tuning on TriviaQA (Joshi et al., 2017) befor fine-tuning on SQuAD.

表 2 展示了排行榜前列的参赛系统以及已发表顶级系统的结果 (Seo et al., 2017; Clark and Gardner, 2018; Peters et al., 2018a; Hu et al., 2018)。SQuAD 排行榜的顶尖结果目前没有最新的公开系统描述，且允许在系统训练时使用任何公开数据。因此我们在系统中采用了适度的数据增强策略：先在 TriviaQA (Joshi et al., 2017) 上进行微调，再对 SQuAD 进行微调。

Our best performing system outperforms the top leader board system by $+1.5\mathrm{F}1$ in ensembling and $+1.3$ F1 as a single system. In fact, our single BERT model outperforms the top ensemble system in terms of F1 score. Without TriviaQA fine- tuning data, we only lose 0.1-0.4 F1, still outperforming all existing systems by a wide margin.12

我们表现最佳的集成系统比排行榜上的顶级系统高出 $+1.5\mathrm{F}1$ ，而作为单一系统则高出 $+1.3$ F1。事实上，我们的单一 BERT 模型在 F1 分数上已经超越了顶级集成系统。即便不使用 TriviaQA 微调数据，我们也仅损失 0.1-0.4 F1，仍以显著优势超越所有现有系统。12

Table 2: SQuAD 1.1 results. The BERT ensemble is $7\mathbf{x}$ systems which use different pre-training checkpoints and fine-tuning seeds.

System	Dev EM F1	Test EM F1
Top Leaderboard Systems (Dec 10th, 2018) Human #1 Ensemble -nlnet #2 Ensemble -QANet		82.3 91.2 86.0 91.7 84.5 90.5
Published BiDAF+ELMo (Single) R.M. Reader (Ensemble)	85.6 81.2 87.982.3	85.8 88.5
Ours BERTBAsE (Single) BERTLARGE (Single)
BERTLARGE (Ensemble) BERTLARGE (Sgl.+TriviaQA) BERTLARGE (Ens.+TriviaQA) 86.2 92.2	80.8 88.5 84.1 90.9 85.8 91.8 84.2 91.1	85.1 91.8

表 2: SQuAD 1.1 结果。BERT 集成模型由 $7\mathbf{x}$ 个使用不同预训练检查点和微调种子的系统组成。

System	Dev EM F1	Test EM F1
Top Leaderboard Systems (Dec 10th, 2018) Human #1 Ensemble -nlnet #2 Ensemble -QANet		82.3 91.2 86.0 91.7 84.5 90.5
Published BiDAF+ELMo (Single) R.M. Reader (Ensemble)	85.6 81.2 87.982.3	85.8 88.5
Ours BERTBASE (Single) BERTLARGE (Single)
BERTLARGE (Ensemble) BERTLARGE (Sgl.+TriviaQA) BERTLARGE (Ens.+TriviaQA) 86.2 92.2	80.8 88.5 84.1 90.9 85.8 91.8 84.2 91.1	85.1 91.8

Table 3: SQuAD 2.0 results. We exclude entries that use BERT as one of their components.

System	Dev		Test
	EM	F1	EM	F1
Top Leaderboard Systems (Dec 10th, 2018) Human #1 Single -MIR-MRC (F-Net)		86.389.0	86.9 74.8	89.5 78.0
#2 Single -nlnet				74.277.1
Published unet(Ensemble)
SLQA+ (Single)				71.474.9
				71.474.4
Ours






BERTLARGE (Single)
			78.781.9 80.0	83.1

表 3: SQuAD 2.0 结果。我们排除了使用 BERT 作为组件的条目。

System	Dev	Test
	EM	F1
Top Leaderboard Systems (Dec 10th, 2018) Human #1 Single -MIR-MRC (F-Net)		86.389.0
#2 Single -nlnet
Published unet(Ensemble)
SLQA+ (Single)

Ours
BERTLARGE (Single)

4.3 SQuAD v2.0

The SQuAD 2.0 task extends the SQuAD 1.1 problem definition by allowing for the possibility that no short answer exists in the provided paragraph, making the problem more realistic.

SQuAD 2.0任务通过允许所提供段落中可能不存在简短答案，扩展了SQuAD 1.1的问题定义，使问题更加现实。

We use a simple approach to extend the SQuAD v1.1 BERT model for this task. We treat questions that do not have an answer as having an answer span with start and end at the [CLS] token. The probability space for the start and end answer span positions is extended to include the position of the [CLS] token. For prediction, we compare the score of the no-answer span: $s_ {\tt n u l l}=$ $S{\cdot}C+E{\cdot}C$ to the score of the best non-null span $s_ {i,j}=\mathtt{m a x}_ {j\ge i}S\cdot T_ {i}+E\cdot T_ {j}$ . We predict a non-null answer when $s_ {i,j}^{}>s_ {\mathrm{nu11}}+\tau$ , where the threshold $\tau$ is selected on the dev set to maximize F1. We did not use TriviaQA data for this model. We fine-tuned for 2 epochs with a learning rate of 5e-5 and a batch size of 48.

我们采用一种简单的方法来扩展SQuAD v1.1 BERT模型以完成此任务。对于没有答案的问题，我们将其视为答案跨度起始和结束位置均为[CLS] token的情况。答案跨度起始和结束位置的概率空间被扩展以包含[CLS] token的位置。在预测时，我们将无答案跨度的得分：$s_ {\tt n u l l}=$$S{\cdot}C+E{\cdot}C$与最佳非空跨度得分$s_ {i,j}=\mathtt{m a x}_ {j\ge i}S\cdot T_ {i}+E\cdot T_ {j}$进行比较。当$s_ {i,j}^{}>s_ {\mathrm{nu11}}+\tau$时，我们预测为非空答案，其中阈值$\tau$在开发集上选择以最大化F1值。该模型未使用TriviaQA数据。我们以5e-5的学习率和48的批量大小进行了2个epoch的微调。

Table 4: SWAG Dev and Test accuracies. †Human performance is measured with 100 samples, as reported in the SWAG paper.

System	Dev Test
ESIM+GloVe ESIM+ELMo	51.9 52.7 59.1 59.2
OpenAI GPT	78.0
BERTBASE BERTLARGE	81.6 86.6 86.3
Human (expert)t Human (5 annotations)	85.0 88.0

表 4: SWAG开发集和测试集准确率。 (†人类表现基于SWAG论文中报告的100个样本测量结果。)

系统	开发集/测试集
ESIM+GloVe ESIM+ELMo	51.9/52.7 59.1/59.2
OpenAI GPT	78.0
BERTBASE BERTLARGE	81.6/86.6 86.3
Human (expert) Human (5 annotations)	85.0 88.0

The results compared to prior leader board entries and top published work (Sun et al., 2018; Wang et al., 2018b) are shown in Table 3, excluding systems that use BERT as one of their components. We observe a $+5.1$ F1 improvement over the previous best system.

与之前排行榜条目及顶尖发表成果 (Sun et al., 2018; Wang et al., 2018b) 的对比结果如表 3 所示 (排除使用 BERT 作为组件的系统)。我们观察到相较于之前最佳系统实现了 $+5.1$ F1 提升。

4.4 SWAG

The Situations With Adversarial Generations (SWAG) dataset contains $113\mathrm{k\Omega}$ sentence-pair completion examples that evaluate grounded commonsense inference (Zellers et al., 2018). Given a sentence, the task is to choose the most plausible continuation among four choices.

对抗性生成情境 (SWAG) 数据集包含 $113\mathrm{k\Omega}$ 个句子对补全示例，用于评估基础常识推理 (Zellers et al., 2018) 。给定一个句子，任务是从四个选项中选择最合理的续写。

When fine-tuning on the SWAG dataset, we construct four input sequences, each containing the concatenation of the given sentence (sentence A) and a possible continuation (sentence B). The only task-specific parameters introduced is a vector whose dot product with the [CLS] token represent ation $C$ denotes a score for each choice which is normalized with a softmax layer.

在SWAG数据集上进行微调时，我们构建了四个输入序列，每个序列包含给定句子（句子A）与一个可能延续句（句子B）的拼接内容。引入的唯一任务特定参数是一个向量，其与[CLS] token表征$C$的点积表示每个选项的得分，该得分通过softmax层进行归一化。

We fine-tune the model for 3 epochs with a learning rate of 2e-5 and a batch size of 16. Results are presented in Table 4. BERTLARGE outperforms the authors’ baseline ESIM $+$ ELMo system by $+27.1%$ and OpenAI GPT by $8.3%$ .

我们以2e-5的学习率和16的批量大小对模型进行了3个周期的微调。结果如表4所示。BERTLARGE比作者基线ESIM $+$ ELMo系统高出 $+27.1%$ ，比OpenAI GPT高出 $8.3%$ 。

5 Ablation Studies

5 消融实验

In this section, we perform ablation experiments over a number of facets of BERT in order to better understand their relative importance. Additional

在本节中，我们对BERT的多个方面进行了消融实验，以更好地理解它们的相对重要性。

Table 5: Ablation over the pre-training tasks using the BERTBASE architecture. “No NSP” is trained without the next sentence prediction task. “LTR & No NSP” is trained as a left-to-right LM without the next sentence prediction, like OpenAI GPT. “ $^+$ BiLSTM” adds a randomly initialized BiLSTM on top of the “LTR $+~\mathbf{No}$ NSP” model during fine-tuning.

Tasks	MNLI-m QNLI	DevSet MRPC (Acc)	SST-2 (Acc) (Acc)	SQuAD
BERTBASE	84.4	88.4	86.7 92.7	88.5
NoNSP	83.9	84.9	86.5	92.6 87.9
LTR&NoNSP	82.1	84.3	77.5 92.1	77.8
+ BiLSTM	82.1	84.1	75.7	91.6 84.9

表 5: 使用 BERTBASE 架构对预训练任务进行消融实验。"No NSP"表示训练时不使用下一句预测任务。"LTR & No NSP"表示像 OpenAI GPT 那样训练为从左到右的语言模型，且不使用下一句预测。"$^+$ BiLSTM"表示在微调时在"LTR $+~\mathbf{No}$ NSP"模型基础上添加随机初始化的 BiLSTM。

任务	MNLI-m QNLI	DevSet MRPC (Acc)	SST-2 (Acc) (Acc)	SQuAD
BERTBASE	84.4	88.4	86.7 92.7	88.5
NoNSP	83.9	84.9	86.5	92.6 87.9
LTR&NoNSP	82.1	84.3	77.5 92.1	77.8
+ BiLSTM	82.1	84.1	75.7	91.6 84.9

ablation studies can be found in Appendix C.

消融研究详见附录C。

5.1 Effect of Pre-training Tasks

5.1 预训练任务的影响

We demonstrate the importance of the deep bidirect ional it y of BERT by evaluating two pretraining objectives using exactly the same pretraining data, fine-tuning scheme, and hyperparameters as BERTBASE:

我们通过使用与BERTBASE完全相同的预训练数据、微调方案和超参数评估两个预训练目标，证明了BERT深度双向性的重要性：

No NSP: A bidirectional model which is trained using the “masked LM” (MLM) but without the “next sentence prediction” (NSP) task.

无NSP：一种双向模型，采用“掩码语言模型”(MLM)训练，但不包含“下一句预测”(NSP)任务。

LTR & No NSP: A left-context-only model which is trained using a standard Left-to-Right (LTR) LM, rather than an MLM. The left-only constraint was also applied at fine-tuning, because removing it introduced a pre-train/fine-tune mismatch that degraded downstream performance. Additionally, this model was pre-trained without the NSP task. This is directly comparable to OpenAI GPT, but using our larger training dataset, our input representation, and our fine-tuning scheme.

LTR & No NSP: 一种仅使用标准从左到右 (LTR) 语言模型训练的左向上下文模型，而非掩码语言模型 (MLM)。在微调阶段同样保持左向约束，因为取消该约束会导致预训练/微调不匹配，从而降低下游任务性能。此外，该模型预训练时未使用下一句预测 (NSP) 任务。该模型可直接与 OpenAI GPT 进行对比，但采用了我们更大的训练数据集、输入表示方案及微调策略。

We first examine the impact brought by the NSP task. In Table 5, we show that removing NSP hurts performance significantly on QNLI, MNLI, and SQuAD 1.1. Next, we evaluate the impact of training bidirectional representations by comparing “No NSP” to “LTR & No NSP”. The LTR model performs worse than the MLM model on all tasks, with large drops on MRPC and SQuAD.

我们首先研究了 NSP (Next Sentence Prediction) 任务带来的影响。表 5 显示，移除 NSP 会显著降低 QNLI、MNLI 和 SQuAD 1.1 的性能表现。接着，我们通过对比 "No NSP" 和 "LTR & No NSP" 来评估训练双向表示的影响。LTR (Left-to-Right) 模型在所有任务上的表现都逊于 MLM (Masked Language Model) 模型，其中 MRPC 和 SQuAD 的性能下降尤为明显。

For SQuAD it is intuitively clear that a LTR model will perform poorly at token predictions, since the token-level hidden states have no rightside context. In order to make a good faith attempt at strengthening the LTR system, we added a randomly initialized BiLSTM on top. This does significantly improve results on SQuAD, but the results are still far worse than those of the pretrained bidirectional models. The BiLSTM hurts performance on the GLUE tasks.

对于SQuAD数据集，直观上很明显，左到右(LTR)模型在token预测上表现会很差，因为token级别的隐藏状态缺乏右侧上下文。为了真诚地尝试增强LTR系统，我们在顶部添加了一个随机初始化的双向长短时记忆网络(BiLSTM)。这确实显著提升了SQuAD的结果，但仍远逊于预训练双向模型的表现。该BiLSTM结构反而会损害GLUE任务上的性能。

We recognize that it would also be possible to train separate LTR and RTL models and represent each token as the concatenation of the two models, as ELMo does. However: (a) this is twice as expensive as a single bidirectional model; (b) this is non-intuitive for tasks like QA, since the RTL model would not be able to condition the answer on the question; (c) this it is strictly less powerful than a deep bidirectional model, since it can use both left and right context at every layer.

我们意识到也可以像ELMo那样训练独立的从左到右(LTR)和从右到左(RTL)模型，并将每个token表示为两个模型的串联。但是：(a) 这种方式比单一双向模型成本高一倍；(b) 对于问答(QA)等任务不够直观，因为RTL模型无法基于问题来生成答案；(c) 其能力严格弱于深度双向模型，因为后者能在每一层同时利用左右上下文。

5.2 Effect of Model Size

5.2 模型尺寸的影响

In this section, we explore the effect of model size on fine-tuning task accuracy. We trained a number of BERT models with a differing number of layers, hidden units, and attention heads, while otherwise using the same hyper parameters and training procedure as described previously.

本节探讨模型规模对微调任务准确率的影响。我们训练了多个不同层数、隐藏单元数和注意力头数的BERT模型，其余超参数和训练流程均与此前描述保持一致。

Results on selected GLUE tasks are shown in Table 6. In this table, we report the average Dev Set accuracy from 5 random restarts of fine-tuning. We can see that larger models lead to a strict accuracy improvement across all four datasets, even for MRPC which only has 3,600 labeled training examples, and is substantially different from the pre-training tasks. It is also perhaps surprising that we are able to achieve such significant improvements on top of models which are already quite large relative to the existing literature. For example, the largest Transformer explored in Vaswani et al. (2017) is $\scriptstyle\mathrm{L=6}$ , $_ \mathrm{H}{=}1024$ , $\mathrm{A}{=}16$ ) with 100M parameters for the encoder, and the largest Transformer we have found in the literature is $\mathrm{L}{=}64$ , $_ {\mathrm{H}=512}$ , $\mathbf{A}{=}2$ ) with $235\mathbf{M}$ parameters (Al-Rfou et al., 2018). By contrast, BERTBASE contains 110M parameters and BERTLARGE contains 340M parameters.

在选定的GLUE任务上的结果如表6所示。该表中报告了5次微调随机重启的平均开发集准确率。可以看出，更大的模型在所有四个数据集上都带来了严格的准确率提升，即便是仅有3600个标注训练样本且与预训练任务差异较大的MRPC数据集。另一个令人惊讶的发现是，我们能在已相对现有文献相当庞大的模型基础上取得如此显著的改进。例如，Vaswani等人(2017)探讨的最大Transformer是$\scriptstyle\mathrm{L=6}$,$_ \mathrm{H}{=}1024$,$\mathrm{A}{=}16$)编码器参数为1亿，而文献中我们发现的最大Transformer是$\mathrm{L}{=}64$,$_ {\mathrm{H}=512}$,$\mathbf{A}{=}2$)参数达2.35亿(Al-Rfou等人,2018)。相比之下，BERTBASE包含1.1亿参数，BERTLARGE则包含3.4亿参数。

It has long been known that increasing the model size will lead to continual improvements on large-scale tasks such as machine translation and language modeling, which is demonstrated by the LM perplexity of held-out training data shown in Table 6. However, we believe that this is the first work to demonstrate convincingly that scaling to extreme model sizes also leads to large improvements on very small scale tasks, provided that the model has been sufficiently pre-trained. Peters et al. (2018b) presented mixed results on the downstream task impact of increasing the pre-trained bi-LM size from two to four layers and Melamud et al. (2016) mentioned in passing that increasing hidden dimension size from 200 to 600 helped, but increasing further to 1,000 did not bring further improvements. Both of these prior works used a featurebased approach — we hypothesize that when the model is fine-tuned directly on the downstream tasks and uses only a very small number of randomly initialized additional parameters, the taskspecific models can benefit from the larger, more expressive pre-trained representations even when downstream task data is very small.

众所周知，增加模型规模会持续提升机器翻译和语言建模等大规模任务的性能，表6所示的训练数据LM困惑度便印证了这一点。但我们认为，本研究首次有力证明了：只要模型经过充分预训练，即使扩展到极大模型规模，也能显著提升极小规模任务的性能。Peters等人(2018b)关于将预训练双向语言模型从两层增至四层对下游任务影响的实验得出了不一致的结论，而Melamud等人(2016)曾简要提及将隐藏层维度从200增至600有助提升性能，但继续增至1,000却未带来进一步改善。这两项早期研究均采用基于特征的方法——我们推测，若模型直接在下游任务上进行微调且仅使用极少量随机初始化的附加参数时，即便下游任务数据量极小，任务专用模型仍能受益于更大规模、更具表现力的预训练表征。

5.3 Feature-based Approach with BERT

5.3 基于BERT的特征提取方法

All of the BERT results presented so far have used the fine-tuning approach, where a simple classification layer is added to the pre-trained model, and all parameters are jointly fine-tuned on a downstream task. However, the feature-based approach, where fixed features are extracted from the pretrained model, has certain advantages. First, not all tasks can be easily represented by a Transformer encoder architecture, and therefore require a task-specific model architecture to be added. Second, there are major computational benefits to pre-compute an expensive representation of the training data once and then run many experiments with cheaper models on top of this representation.

目前展示的所有BERT结果都采用了微调(fine-tuning)方法，即在预训练模型上添加简单分类层，并针对下游任务联合微调所有参数。然而，基于特征的方法（从预训练模型中提取固定特征）具有特定优势：首先，并非所有任务都能通过Transformer编码器架构轻松实现，因此需要添加任务特定的模型架构；其次，预计算训练数据的昂贵表示后，可基于该表示运行更轻量模型的多次实验，这将带来显著的计算效率优势。

In this section, we compare the two approaches by applying BERT to the CoNLL-2003 Named Entity Recognition (NER) task (Tjong Kim Sang and De Meulder, 2003). In the input to BERT, we use a case-preserving WordPiece model, and we include the maximal document context provided by the data. Following standard practice, we formulate this as a tagging task but do not use a CRF layer in the output. We use the representation of the first sub-token as the input to the token-level classifier over the NER label set.

在本节中，我们通过将BERT应用于CoNLL-2003命名实体识别(NER)任务 (Tjong Kim Sang and De Meulder, 2003) 来比较两种方法。在BERT的输入中，我们使用保留大小写的WordPiece模型，并包含数据提供的最大文档上下文。按照标准做法，我们将其表述为标注任务，但在输出中不使用CRF层。我们使用第一个子token的表示作为NER标签集上token级分类器的输入。

Table 6: Ablation over BERT model size. #L $=$ the number of layers; #H = hidden size; #A $=$ number of attention heads. “LM (ppl)” is the masked LM perplexity of held-out training data.

Hyperparams		Dev Set Accuracy
#L	#H #A	LM (ppl) MNLI-m	MRPC SST-2
3	768	12 5.84 77.9	79.8 88.4
6	768	5.24 80.6	82.2 90.7
6	768	4.68 81.9	84.8 91.3
12	768	3.99 84.4	86.7 92.9
12	1024 16	3.54 85.7	86.9 93.3
24	1024 16	3.23 86.6	87.8 93.7

表 6: BERT模型规模的消融实验。#L = 层数；#H = 隐藏层大小；#A = 注意力头数。"LM (ppl)"表示训练数据遮蔽语言模型的困惑度。

Hyperparams	Dev Set Accuracy
#L	#H #A
3	768
6	768
6	768
12	768
12	1024 16
24	1024 16

Table 7: CoNLL-2003 Named Entity Recognition results. Hyper parameters were selected using the Dev set. The reported Dev and Test scores are averaged over 5 random restarts using those hyper parameters.

System	DevF1	Test F1
ELMo (Peterset al.,2018a)	95.7	92.2
CVT (Clark et al.,2018) CSE (Akbik et al.,2018)		92.6 93.1
Fine-tuning approach
BERTLARGE	96.6	92.8
BERTBASE	96.4	92.4
Feature-based approach(BERTBAsE)
Embeddings	91.0
Second-to-Last Hidden	95.6
LastHidden	94.9
Weighted Sum Last Four Hidden	95.9
Concat LastFourHidden	96.1
WeightedSumAll12Layers	95.5

表 7: CoNLL-2003 命名实体识别结果。超参数通过开发集选定，报告的开发集和测试集分数为使用该超参数进行5次随机重启的平均值。

System	DevF1	Test F1
ELMo (Peters et al., 2018a)	95.7	92.2
CVT (Clark et al., 2018) CSE (Akbik et al., 2018)		92.6 93.1
Fine-tuning approach
BERTLARGE	96.6	92.8
BERTBASE	96.4	92.4
Feature-based approach (BERTBASE)
Embeddings	91.0
Second-to-Last Hidden	95.6
Last Hidden	94.9
Weighted Sum Last Four Hidden	95.9
Concat Last Four Hidden	96.1
Weighted Sum All 12 Layers	95.5

[论文翻译]BERT：面向语言理解的深度双向Transformer预训练

原文地址：https://aclanthology.org/N19-1423.pdf

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

Abstract

1 Introduction

1 引言

2 相关工作

2.1 Unsupervised Feature-based Approaches

2.1 基于特征的无监督方法

2.2 Unsupervised Fine-tuning Approaches

2.2 无监督微调方法

2.3 Transfer Learning from Supervised Data

2.3 基于监督数据的迁移学习

3 BERT

3 BERT

3.1 Pre-training BERT

3.1 BERT预训练

3.2 Fine-tuning BERT

3.2 微调 BERT

4 Experiments

4 实验

4.1 GLUE

4.1 GLUE

4.2 SQuAD v1.1

4.2 SQuAD v1.1

4.3 SQuAD v2.0

4.3 SQuAD v2.0

4.4 SWAG

4.4 SWAG

5 Ablation Studies

5 消融实验

5.1 Effect of Pre-training Tasks

5.1 预训练任务的影响

5.2 Effect of Model Size

5.2 模型尺寸的影响

5.3 Feature-based Approach with BERT

5.3 基于BERT的特征提取方法

[论文翻译]BERT：面向语言理解的深度双向Transformer预训练

原文地址：https://aclanthology.org/N19-1423.pdf

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

Abstract

1 Introduction

1 引言

2 Related Work

2 相关工作

2.1 Unsupervised Feature-based Approaches

2.1 基于特征的无监督方法

2.2 Unsupervised Fine-tuning Approaches

2.2 无监督微调方法

2.3 Transfer Learning from Supervised Data

2.3 基于监督数据的迁移学习

3 BERT

3 BERT

3.1 Pre-training BERT

3.1 BERT预训练

3.2 Fine-tuning BERT

3.2 微调 BERT

4 Experiments

4 实验

4.1 GLUE

4.1 GLUE

4.2 SQuAD v1.1

4.2 SQuAD v1.1

4.3 SQuAD v2.0

4.3 SQuAD v2.0

4.4 SWAG

4.4 SWAG

5 Ablation Studies

5 消融实验

5.1 Effect of Pre-training Tasks

5.1 预训练任务的影响

5.2 Effect of Model Size

5.2 模型尺寸的影响

5.3 Feature-based Approach with BERT

5.3 基于BERT的特征提取方法