[论文翻译]BERT:面向语言理解的深度双向Transformer预训练


原文地址:https://aclanthology.org/N19-1423.pdf


BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

BERT:面向语言理解的深度双向Transformer预训练

Jacob Devlin Ming-Wei Chang Kenton Lee Kristina Toutanova Google AI Language {jacob dev lin,ming wei chang,kentonl,kristout}@google.com

Jacob Devlin Ming-Wei Chang Kenton Lee Kristina Toutanova Google AI Language {jacob dev lin,ming wei chang,kentonl,kristout}@google.com

Abstract

摘要

We introduce a new language representation model called BERT, which stands for Bidirectional Encoder Representations from Transformers. Unlike recent language representation models (Peters et al., 2018a; Radford et al., 2018), BERT is designed to pretrain deep bidirectional representations from unlabeled text by jointly conditioning on both left and right context in all layers. As a result, the pre-trained BERT model can be finetuned with just one additional output layer to create state-of-the-art models for a wide range of tasks, such as question answering and language inference, without substantial taskspecific architecture modifications.

我们提出了一种名为BERT (Bidirectional Encoder Representations from Transformers) 的新语言表示模型。与近期其他语言表示模型 (Peters et al., 2018a; Radford et al., 2018) 不同,BERT通过在所有层中联合调节左右上下文,从未标记文本中预训练深度双向表示。因此,预训练的BERT模型只需添加一个额外输出层进行微调,即可为问答和语言推理等多种任务创建最先进的模型,而无需对任务特定架构进行重大修改。

BERT is conceptually simple and empirically powerful. It obtains new state-of-the-art results on eleven natural language processing tasks, including pushing the GLUE score to $80.5%$ $7.7%$ point absolute improvement), MultiNLI accuracy to $86.7%$ ( $4.6%$ absolute improvement), SQuAD v1.1 question answering Test F1 to 93.2 (1.5 point absolute im- provement) and SQuAD v2.0 Test F1 to 83.1 (5.1 point absolute improvement).

BERT 在概念上简单且实证效果强大。它在11项自然语言处理任务中取得了新的最先进成果,包括将GLUE分数提升至$80.5%$(绝对提升$7.7%$)、MultiNLI准确率提升至$86.7%$(绝对提升$4.6%$)、SQuAD v1.1问答测试F1值达到93.2(绝对提升1.5分)以及SQuAD v2.0测试F1值达到83.1(绝对提升5.1分)。

1 Introduction

1 引言

Language model pre-training has been shown to be effective for improving many natural language processing tasks (Dai and Le, 2015; Peters et al., 2018a; Radford et al., 2018; Howard and Ruder, 2018). These include sentence-level tasks such as natural language inference (Bowman et al., 2015; Williams et al., 2018) and paraphrasing (Dolan and Brockett, 2005), which aim to predict the rela tion ships between sentences by analyzing them holistic ally, as well as token-level tasks such as named entity recognition and question answering, where models are required to produce fine-grained output at the token level (Tjong Kim Sang and De Meulder, 2003; Rajpurkar et al., 2016).

语言模型预训练已被证明能有效提升多种自然语言处理任务的性能 (Dai and Le, 2015; Peters et al., 2018a; Radford et al., 2018; Howard and Ruder, 2018)。这包括句子级任务如自然语言推理 (Bowman et al., 2015; Williams et al., 2018) 和文本复述 (Dolan and Brockett, 2005)——这类任务通过整体分析句子来预测其关联关系;也包括token级任务如命名实体识别和问答系统——这类任务要求模型在token级别生成细粒度输出 (Tjong Kim Sang and De Meulder, 2003; Rajpurkar et al., 2016)。

There are two existing strategies for applying pre-trained language representations to down- stream tasks: feature-based and fine-tuning. The feature-based approach, such as ELMo (Peters et al., 2018a), uses task-specific architectures that include the pre-trained representations as additional features. The fine-tuning approach, such as the Generative Pre-trained Transformer (OpenAI GPT) (Radford et al., 2018), introduces minimal task-specific parameters, and is trained on the downstream tasks by simply fine-tuning all pretrained parameters. The two approaches share the same objective function during pre-training, where they use unidirectional language models to learn general language representations.

将预训练语言表征应用于下游任务的现有策略有两种:基于特征的方法和微调方法。基于特征的方法,如ELMo (Peters et al., 2018a),使用包含预训练表征作为附加特征的特定任务架构。微调方法,如生成式预训练Transformer (OpenAI GPT) (Radford et al., 2018),引入最少的任务特定参数,并通过简单微调所有预训练参数来在下游任务上进行训练。这两种方法在预训练期间共享相同的目标函数,即使用单向语言模型来学习通用语言表征。

We argue that current techniques restrict the power of the pre-trained representations, especially for the fine-tuning approaches. The major limitation is that standard language models are unidirectional, and this limits the choice of architectures that can be used during pre-training. For example, in OpenAI GPT, the authors use a left-toright architecture, where every token can only attend to previous tokens in the self-attention layers of the Transformer (Vaswani et al., 2017). Such restrictions are sub-optimal for sentence-level tasks, and could be very harmful when applying finetuning based approaches to token-level tasks such as question answering, where it is crucial to incorporate context from both directions.

我们认为当前的技术限制了预训练表征的能力,尤其是微调方法。主要局限在于标准语言模型是单向的,这限制了预训练时可用的架构选择。例如,在OpenAI GPT中,作者采用了从左到右的架构,其中每个token在Transformer (Vaswani et al., 2017) 的自注意力层中只能关注先前的token。这种限制对于句子级任务来说是次优的,并且在将基于微调的方法应用于token级任务(如问答)时可能非常不利,因为双向上下文整合至关重要。

In this paper, we improve the fine-tuning based approaches by proposing BERT: Bidirectional Encoder Representations from Transformers. BERT alleviates the previously mentioned unidirect ional it y constraint by using a “masked language model” (MLM) pre-training objective, inspired by the Cloze task (Taylor, 1953). The masked language model randomly masks some of the tokens from the input, and the objective is to predict the original vocabulary id of the masked word based only on its context. Unlike left-toright language model pre-training, the MLM objective enables the representation to fuse the left and the right context, which allows us to pretrain a deep bidirectional Transformer. In addition to the masked language model, we also use a “next sentence prediction” task that jointly pretrains text-pair representations. The contributions of our paper are as follows:

在本文中,我们通过提出BERT(来自Transformer的双向编码器表示)改进了基于微调的方法。BERT通过使用受Cloze任务(Taylor, 1953)启发的"掩码语言模型"(MLM)预训练目标,缓解了之前提到的单向性限制。掩码语言模型随机掩码输入中的部分token,目标是根据上下文预测被掩码单词的原始词汇ID。与从左到右的语言模型预训练不同,MLM目标使表示能够融合左右上下文,这使我们能够预训练一个深度双向Transformer。除了掩码语言模型外,我们还使用"下一句预测"任务来联合预训练文本对表示。本文的贡献如下:

• We demonstrate the importance of bidirectional pre-training for language representations. Unlike Radford et al. (2018), which uses unidirectional language models for pre-training, BERT uses masked language models to enable pretrained deep bidirectional representations. This is also in contrast to Peters et al. (2018a), which uses a shallow concatenation of independently trained left-to-right and right-to-left LMs.

• 我们证明了双向预训练对语言表征的重要性。与Radford等人(2018)使用单向语言模型进行预训练不同,BERT采用掩码语言模型来实现预训练的深度双向表征。这也与Peters等人(2018a)形成对比,后者仅对独立训练的从左到右和从右到左语言模型进行浅层拼接。

• We show that pre-trained representations reduce the need for many heavily-engineered taskspecific architectures. BERT is the first finetuning based representation model that achieves state-of-the-art performance on a large suite of sentence-level and token-level tasks, outperforming many task-specific architectures.

• 我们证明预训练表征能减少对大量人工设计的任务特定架构的需求。BERT是首个基于微调的表征模型,在句子级和token级任务套件中实现最先进性能,超越了许多任务特定架构。

• BERT advances the state of the art for eleven NLP tasks. The code and pre-trained models are available at https://github.com/ google-research/bert.

2 Related Work

2 相关工作

There is a long history of pre-training general language representations, and we briefly review the most widely-used approaches in this section.

预训练通用语言表征有着悠久的历史,本节将简要回顾最广泛使用的方法。

2.1 Unsupervised Feature-based Approaches

2.1 基于特征的无监督方法

Learning widely applicable representations of words has been an active area of research for decades, including non-neural (Brown et al., 1992; Ando and Zhang, 2005; Blitzer et al., 2006) and neural (Mikolov et al., 2013; Pennington et al., 2014) methods. Pre-trained word embeddings are an integral part of modern NLP systems, offering significant improvements over embeddings learned from scratch (Turian et al., 2010). To pretrain word embedding vectors, left-to-right language modeling objectives have been used (Mnih and Hinton, 2009), as well as objectives to discriminate correct from incorrect words in left and right context (Mikolov et al., 2013).

学习广泛适用的词表示法数十年来一直是活跃的研究领域,包括非神经网络方法 (Brown et al., 1992; Ando and Zhang, 2005; Blitzer et al., 2006) 和神经网络方法 (Mikolov et al., 2013; Pennington et al., 2014)。预训练词嵌入是现代自然语言处理系统的核心组成部分,相比从头开始学习的嵌入有显著提升 (Turian et al., 2010)。为预训练词嵌入向量,研究者既采用了从左到右的语言建模目标 (Mnih and Hinton, 2009),也使用了区分左右上下文中正确与错误单词的目标 (Mikolov et al., 2013)。

These approaches have been generalized to coarser granular i ties, such as sentence embeddings (Kiros et al., 2015; Logeswaran and Lee, 2018) or paragraph embeddings (Le and Mikolov, 2014). To train sentence representations, prior work has used objectives to rank candidate next sentences (Jernite et al., 2017; Logeswaran and Lee, 2018), left-to-right generation of next sentence words given a representation of the previous sentence (Kiros et al., 2015), or denoising autoencoder derived objectives (Hill et al., 2016).

这些方法已被推广到更粗粒度的单元,例如句子嵌入 (sentence embeddings) (Kiros et al., 2015; Logeswaran and Lee, 2018) 或段落嵌入 (paragraph embeddings) (Le and Mikolov, 2014)。为训练句子表征,先前研究采用了以下目标:对候选下一句子进行排序 (Jernite et al., 2017; Logeswaran and Lee, 2018),基于前句表征从左到右生成下一句单词 (Kiros et al., 2015),或采用去噪自编码器衍生的目标函数 (Hill et al., 2016)。

ELMo and its predecessor (Peters et al., 2017, 2018a) generalize traditional word embedding research along a different dimension. They extract context-sensitive features from a left-to-right and a right-to-left language model. The contextual represent ation of each token is the concatenation of the left-to-right and right-to-left representations. When integrating contextual word embeddings with existing task-specific architectures, ELMo advances the state of the art for several major NLP benchmarks (Peters et al., 2018a) including question answering (Rajpurkar et al., 2016), sentiment analysis (Socher et al., 2013), and named entity recognition (Tjong Kim Sang and De Meulder, 2003). Melamud et al. (2016) proposed learning contextual representations through a task to predict a single word from both left and right context using LSTMs. Similar to ELMo, their model is feature-based and not deeply bidirectional. Fedus et al. (2018) shows that the cloze task can be used to improve the robustness of text generation models.

ELMo及其前身 (Peters et al., 2017, 2018a) 从不同维度拓展了传统词嵌入研究。该方法通过从左到右和从右到左的语言模型提取上下文敏感特征,每个token的上下文表示是左右双向表示的拼接。当将上下文词嵌入与现有任务特定架构结合时,ELMo在多项重要NLP基准测试中实现了技术突破 (Peters et al., 2018a) ,包括问答系统 (Rajpurkar et al., 2016) 、情感分析 (Socher et al., 2013) 和命名实体识别 (Tjong Kim Sang and De Meulder, 2003) 。Melamud等人 (2016) 提出通过LSTM预测左右上下文单词的任务来学习上下文表示,其模型与ELMo类似,都是基于特征且非深度双向的。Fedus等人 (2018) 证明完形填空任务可用于提升文本生成模型的鲁棒性。

2.2 Unsupervised Fine-tuning Approaches

2.2 无监督微调方法

As with the feature-based approaches, the first works in this direction only pre-trained word embedding parameters from unlabeled text (Collobert and Weston, 2008).

与基于特征的方法类似,这一方向的首批研究仅从无标注文本中预训练词嵌入参数 (Collobert and Weston, 2008)。

More recently, sentence or document encoders which produce contextual token representations have been pre-trained from unlabeled text and fine-tuned for a supervised downstream task (Dai and Le, 2015; Howard and Ruder, 2018; Radford et al., 2018). The advantage of these approaches is that few parameters need to be learned from scratch. At least partly due to this advantage, OpenAI GPT (Radford et al., 2018) achieved previously state-of-the-art results on many sentencelevel tasks from the GLUE benchmark (Wang et al., 2018a). Left-to-right language model- ing and auto-encoder objectives have been used for pre-training such models (Howard and Ruder, 2018; Radford et al., 2018; Dai and Le, 2015).

最近,能够生成上下文token表示的句子或文档编码器已通过无标注文本进行预训练,并针对有监督的下游任务进行微调 (Dai and Le, 2015; Howard and Ruder, 2018; Radford et al., 2018)。这类方法的优势在于只需从头学习少量参数。至少部分得益于这一优势,OpenAI GPT (Radford et al., 2018) 在GLUE基准测试 (Wang et al., 2018a) 的多数句子级任务中取得了当时最先进的结果。此类模型的预训练采用了从左到右的语言建模和自编码器目标 (Howard and Ruder, 2018; Radford et al., 2018; Dai and Le, 2015)。


Figure 1: Overall pre-training and fine-tuning procedures for BERT. Apart from output layers, the same architectures are used in both pre-training and fine-tuning. The same pre-trained model parameters are used to initialize models for different down-stream tasks. During fine-tuning, all parameters are fine-tuned. [CLS] is a special symbol added in front of every input example, and [SEP] is a special separator token (e.g. separating questions/answers).

图 1: BERT的整体预训练与微调流程。除输出层外,预训练和微调阶段使用相同的架构结构。相同的预训练模型参数被用于初始化不同下游任务的模型。微调过程中所有参数都会被调整。[CLS]是添加在每个输入样本前的特殊符号,[SEP]是特殊分隔符token (例如分隔问题/答案)。

2.3 Transfer Learning from Supervised Data

2.3 基于监督数据的迁移学习

There has also been work showing effective transfer from supervised tasks with large datasets, such as natural language inference (Conneau et al., 2017) and machine translation (McCann et al., 2017). Computer vision research has also demonstrated the importance of transfer learning from large pre-trained models, where an effective recipe is to fine-tune models pre-trained with ImageNet (Deng et al., 2009; Yosinski et al., 2014).

已有研究表明,从大规模监督任务(如自然语言推理 (Conneau et al., 2017) 和机器翻译 (McCann et al., 2017))中能实现有效迁移。计算机视觉研究同样证实了基于大型预训练模型进行迁移学习的重要性,其中经典方法是微调基于ImageNet (Deng et al., 2009; Yosinski et al., 2014) 预训练的模型。

3 BERT

3 BERT

We introduce BERT and its detailed implementation in this section. There are two steps in our framework: pre-training and fine-tuning. During pre-training, the model is trained on unlabeled data over different pre-training tasks. For finetuning, the BERT model is first initialized with the pre-trained parameters, and all of the parameters are fine-tuned using labeled data from the downstream tasks. Each downstream task has separate fine-tuned models, even though they are initialized with the same pre-trained parameters. The question-answering example in Figure 1 will serve as a running example for this section.

本节介绍BERT及其具体实现。我们的框架包含两个步骤:预训练(pre-training)和微调(fine-tuning)。预训练阶段,模型通过不同预训练任务在无标注数据上进行训练。微调阶段,BERT模型首先加载预训练参数进行初始化,然后使用下游任务的标注数据对所有参数进行微调。尽管不同下游任务使用相同的预训练参数初始化,但每个任务都有独立的微调模型。图1中的问答示例将作为本节的运行案例。

A distinctive feature of BERT is its unified architecture across different tasks. There is minimal difference between the pre-trained architecture and the final downstream architecture.

BERT的一个显著特点是其跨任务统一的架构。预训练架构与最终下游架构之间的差异极小。

Model Architecture BERT’s model architecture is a multi-layer bidirectional Transformer encoder based on the original implementation described in Vaswani et al. (2017) and released in the tensor 2 tensor library.1 Because the use of Transformers has become common and our imple ment ation is almost identical to the original, we will omit an exhaustive background description of the model architecture and refer readers to Vaswani et al. (2017) as well as excellent guides such as “The Annotated Transformer.”2

模型架构
BERT的模型架构是一个基于Vaswani等人(2017)原始实现的多层双向Transformer编码器,并发布于tensor2tensor库。由于Transformer的使用已变得普遍,且我们的实现与原始版本几乎相同,因此我们将省略对模型架构的详尽背景描述,建议读者参考Vaswani等人(2017)以及《The Annotated Transformer》等优秀指南。

In this work, we denote the number of layers (i.e., Transformer blocks) as $L$ , the hidden size as $H$ , and the number of self-attention heads as $A$ .3 We primarily report results on two model sizes: BERTBASE ($\scriptstyle\mathrm{L}=12$ , $\scriptstyle{\mathrm{H}}=768$ , $_ \mathrm{A}{=}12$ , Total Param- eters =110M) and BERTLARGE (L=24,H=1024,A=16,Total Parameters=340M).

在本工作中,我们将层数(即Transformer块)表示为$L$,隐藏层大小表示为$H$,自注意力头数表示为$A$。我们主要报告两种模型尺寸的结果:BERTBASE $\scriptstyle\mathrm{L}=12$,$\scriptstyle{\mathrm{H}}=768$,$_ \mathrm{A}{=}12$,总参数量$\mathsf{\Gamma}=110\mathbf{M}$;以及BERTLARGE $\mathrm{L}{=}24$,$\mathrm{H}{=}1024$,$\mathrm{A}{=}16$,总参数量$\mathsf{\Gamma}=340\mathbf{M}$。

BERTBASE was chosen to have the same model size as OpenAI GPT for comparison purposes. Critically, however, the BERT Transformer uses bidirectional self-attention, while the GPT Transformer uses constrained self-attention where every token can only attend to context to its left.4

为了便于比较,BERTBASE 的模型大小被设定为与 OpenAI GPT 相同。但关键在于,BERT 的 Transformer 采用了双向自注意力机制,而 GPT 的 Transformer 使用的是受限自注意力机制,即每个 Token 只能关注其左侧的上下文。4

Input/Output Representations To make BERT handle a variety of down-stream tasks, our input representation is able to unambiguously represent both a single sentence and a pair of sentences (e.g., Question, Answer ) in one token sequence. Throughout this work, a “sentence” can be an arbitrary span of contiguous text, rather than an actual linguistic sentence. A “sequence” refers to the input token sequence to BERT, which may be a single sentence or two sentences packed together.

输入/输出表示

为了使BERT能够处理各种下游任务,我们的输入表示能够在一个token序列中明确表示单个句子和句子对(如问题、答案)。在本研究中,"句子"可以是任意连续的文本片段,而非实际的语言学句子。"序列"指的是BERT的输入token序列,可以是一个句子或两个打包在一起的句子。

We use WordPiece embeddings (Wu et al., 2016) with a 30,000 token vocabulary. The first token of every sequence is always a special classification token ([CLS]). The final hidden state corresponding to this token is used as the aggregate sequence representation for classification tasks. Sentence pairs are packed together into a single sequence. We differentiate the sentences in two ways. First, we separate them with a special token ([SEP]). Second, we add a learned embedding to every token indicating whether it belongs to sentence $\mathtt{A}$ or sentence B. As shown in Figure 1, we denote input embedding as $E$ , the final hidden vector of the special [CLS] token as $C\in\mathbb{R}^{H}$ , and the final hidden vector for the $i^{\mathrm{th}}$ input token as $T_ {i}\in\mathbb{R}^{H}$ .

我们采用30,000个token词汇表的WordPiece嵌入 (Wu et al., 2016) 。每个序列的首个token始终是特殊分类标记 ([CLS]) 。该标记对应的最终隐藏状态被用作分类任务的聚合序列表示。句子对会被合并为单个序列,我们通过两种方式区分句子:首先用特殊分隔符 ([SEP]) 隔开,其次为每个token添加可学习的嵌入标识,用于区分其属于句子$\mathtt{A}$还是句子B。如图1所示,输入嵌入记为$E$,特殊[CLS]标记的最终隐藏向量记为$C\in\mathbb{R}^{H}$,第$i^{\mathrm{th}}$个输入token的最终隐藏向量记为$T_ {i}\in\mathbb{R}^{H}$。

For a given token, its input representation is constructed by summing the corresponding token, segment, and position embeddings. A visualization of this construction can be seen in Figure 2.

对于给定的Token,其输入表示通过将对应的Token嵌入、片段嵌入和位置嵌入相加构建而成。这一构建过程的可视化展示见图2:

3.1 Pre-training BERT

3.1 BERT预训练

Unlike Peters et al. (2018a) and Radford et al. (2018), we do not use traditional left-to-right or right-to-left language models to pre-train BERT. Instead, we pre-train BERT using two unsupervised tasks, described in this section. This step is presented in the left part of Figure 1.

与 Peters 等人 (2018a) 和 Radford 等人 (2018) 不同,我们并未使用传统的从左到右或从右到左语言模型来预训练 BERT,而是采用本节所述的两个无监督任务进行预训练。该步骤展示在图 1 左侧部分。

Task #1: Masked LM Intuitively, it is reasonable to believe that a deep bidirectional model is strictly more powerful than either a left-to-right model or the shallow concatenation of a left-toright and a right-to-left model. Unfortunately, standard conditional language models can only be trained left-to-right or right-to-left, since bidirectional conditioning would allow each word to indirectly “see itself”, and the model could trivially predict the target word in a multi-layered context.

任务 #1: 掩码语言模型 (Masked LM)
直观上可以认为,深度双向模型严格优于从左到右模型或左右向模型的浅层拼接。但标准条件语言模型只能进行从左到右或从右到左训练,因为双向条件作用会使每个词间接"看到自己",导致模型在多层次上下文中能轻易预测目标词。

In order to train a deep bidirectional representation, we simply mask some percentage of the input tokens at random, and then predict those masked tokens. We refer to this procedure as a “masked LM” (MLM), although it is often referred to as a Cloze task in the literature (Taylor, 1953). In this case, the final hidden vectors corresponding to the mask tokens are fed into an output softmax over the vocabulary, as in a standard LM. In all of our experiments, we mask $15%$ of all WordPiece tokens in each sequence at random. In contrast to denoising auto-encoders (Vincent et al., 2008), we only predict the masked words rather than reconstructing the entire input.

为了训练深度双向表征,我们简单地随机遮蔽一定比例的输入token,然后预测这些被遮蔽的token。我们将此过程称为"遮蔽语言模型 (masked LM, MLM)",尽管在文献(Taylor, 1953)中它常被称为完形填空任务(Cloze task)。这种情况下,与标准语言模型类似,对应于遮蔽token的最终隐藏向量会被输入到词汇表上的输出softmax层。在所有实验中,我们随机遮蔽每个序列中15%的WordPiece token。与去噪自编码器(Vincent et al., 2008)不同,我们仅预测被遮蔽的单词而非重建整个输入。

Although this allows us to obtain a bidirectional pre-trained model, a downside is that we are creating a mismatch between pre-training and fine-tuning, since the [MASK] token does not appear during fine-tuning. To mitigate this, we do not always replace “masked” words with the actual [MASK] token. The training data generator chooses $15%$ of the token positions at random for prediction. If the $i$ -th token is chosen, we replace the $i$ -th token with (1) the [MASK] token $80%$ of the time (2) a random token $10%$ of the time (3) the unchanged $i$ -th token $10%$ of the time. Then, $T_ {i}$ will be used to predict the original token with cross entropy loss. We compare variations of this procedure in Appendix C.2.

虽然这种方法能让我们获得双向预训练模型,但一个缺点是会造成预训练与微调阶段的不匹配,因为微调时不会出现 [MASK] token。为缓解这一问题,我们并不总是用真实的 [MASK] token 替换被遮蔽词。训练数据生成器会随机选择 $15%$ 的 token 位置进行预测:若选中第 $i$ 个 token,则以 (1) $80%$ 概率替换为 [MASK] token (2) $10%$ 概率替换为随机 token (3) $10%$ 概率保持原 token。随后 $T_ {i}$ 将通过交叉熵损失来预测原始 token。我们在附录 C.2 中对比了该流程的多种变体。

Task #2: Next Sentence Prediction (NSP) Many important downstream tasks such as Question Answering (QA) and Natural Language Infer- ence (NLI) are based on understanding the relationship between two sentences, which is not directly captured by language modeling. In order to train a model that understands sentence relationships, we pre-train for a binarized next sen- tence prediction task that can be trivially generated from any monolingual corpus. Specifically, when choosing the sentences A and B for each pretraining example, $50%$ of the time B is the actual next sentence that follows $\mathbb{A}$ (labeled as $\mathbb{I}\operatorname{sNext}$ ), and $50%$ of the time it is a random sentence from the corpus (labeled as NotNext). As we show in Figure 1, $C$ is used for next sentence prediction (NSP).5 Despite its simplicity, we demon- strate in Section 5.1 that pre-training towards this task is very beneficial to both QA and NLI. 6

任务 #2: 下一句预测 (NSP)
许多重要下游任务 (如问答系统 (QA) 和自然语言推理 (NLI) ) 都基于对两个句子关系的理解,而语言建模无法直接捕捉这种关系。为了训练能够理解句子关系的模型,我们针对一个二值化的下一句预测任务进行预训练,该任务可直接从任何单语语料库生成。具体而言,在为每个预训练样本选择句子A和B时,50%的概率B是实际接在A后的下一句 (标记为IsNext) ,50%的概率它是语料库中的随机句子 (标记为NotNext) 。如图1所示,C被用于下一句预测 (NSP) 。尽管该任务简单,但我们在5.1节证明针对该任务的预训练对QA和NLI都非常有益。


Figure 2: BERT input representation. The input embeddings are the sum of the token embeddings, the segmentation embeddings and the position embeddings.

图 2: BERT输入表示。输入嵌入是token嵌入、分段嵌入和位置嵌入的总和。

The NSP task is closely related to representationlearning objectives used in Jernite et al. (2017) and Logeswaran and Lee (2018). However, in prior work, only sentence embeddings are transferred to down-stream tasks, where BERT transfers all parameters to initialize end-task model parameters.

NSP任务与Jernite等人(2017)以及Logeswaran和Lee(2018)使用的表征学习目标密切相关。然而在先前工作中,只有句子嵌入会被迁移到下游任务,而BERT会迁移所有参数来初始化终端任务模型参数。

Pre-training data The pre-training procedure largely follows the existing literature on language model pre-training. For the pre-training corpus we use the Books Corpus (800M words) (Zhu et al., 2015) and English Wikipedia (2,500M words). For Wikipedia we extract only the text passages and ignore lists, tables, and headers. It is critical to use a document-level corpus rather than a shuffled sentence-level corpus such as the Billion Word Benchmark (Chelba et al., 2013) in order to extract long contiguous sequences.

预训练数据
预训练流程基本遵循现有语言模型预训练文献。预训练语料采用Books Corpus (8亿词量) (Zhu et al., 2015) 和英文维基百科 (25亿词量)。对于维基百科,我们仅提取文本段落并忽略列表、表格和标题。使用文档级语料而非乱序的句子级语料(如Billion Word Benchmark (Chelba et al., 2013))对提取长连续序列至关重要。

3.2 Fine-tuning BERT

3.2 微调 BERT

Fine-tuning is straightforward since the selfattention mechanism in the Transformer allows BERT to model many downstream tasks— whether they involve single text or text pairs—by swapping out the appropriate inputs and outputs. For applications involving text pairs, a common pattern is to independently encode text pairs before applying bidirectional cross attention, such as Parikh et al. (2016); Seo et al. (2017). BERT instead uses the self-attention mechanism to unify these two stages, as encoding a concatenated text pair with self-attention effectively includes bidirectional cross attention between two sentences.

微调过程非常直接,因为Transformer中的自注意力机制使BERT能够通过替换适当的输入和输出来建模许多下游任务——无论涉及单个文本还是文本对。对于涉及文本对的应用,常见模式是在应用双向交叉注意力之前独立编码文本对,例如Parikh等人(2016) [20];Seo等人(2017) [21]。而BERT使用自注意力机制统一了这两个阶段,因为用自注意力编码连接的文本对实际上包含了两个句子之间的双向交叉注意力。

For each task, we simply plug in the taskspecific inputs and outputs into BERT and finetune all the parameters end-to-end. At the input, sentence A and sentence B from pre-training are analogous to (1) sentence pairs in paraphrasing, (2) hypothesis-premise pairs in entailment, (3) question-passage pairs in question answering, and (4) a degenerate text $\mathcal{O}$ pair in text classification or sequence tagging. At the output, the token represent at ions are fed into an output layer for tokenlevel tasks, such as sequence tagging or question answering, and the [CLS] representation is fed into an output layer for classification, such as entailment or sentiment analysis.

对于每项任务,我们只需将任务特定的输入和输出插入BERT模型,并端到端地微调所有参数。在输入阶段,预训练中的句子A和句子B分别对应:(1) 释义任务中的句子对,(2) 蕴含任务中的假设-前提对,(3) 问答任务中的问题-段落对,以及(4) 文本分类或序列标注任务中的退化文本$\mathcal{O}$对。在输出阶段,Token表征会被输入到输出层以处理Token级别的任务(如序列标注或问答),而[CLS]表征则被输入到输出层以进行分类任务(如蕴含分析或情感分析)。

Compared to pre-training, fine-tuning is relatively inexpensive. All of the results in the paper can be replicated in at most 1 hour on a single Cloud TPU, or a few hours on a GPU, starting from the exact same pre-trained model.7 We describe the task-specific details in the corresponding subsections of Section 4. More details can be found in Appendix A.5.

与预训练相比,微调成本相对较低。论文中的所有结果都能在单个Cloud TPU上至多1小时内复现,或在GPU上花费数小时完成(均基于完全相同的预训练模型)[7]。具体任务细节详见第4节对应子章节,更多信息可参阅附录A.5。

4 Experiments

4 实验

In this section, we present BERT fine-tuning results on 11 NLP tasks.

在本节中,我们展示了BERT在11项NLP任务上的微调结果。

4.1 GLUE

4.1 GLUE

The General Language Understanding Evaluation (GLUE) benchmark (Wang et al., 2018a) is a collection of diverse natural language understanding tasks. Detailed descriptions of GLUE datasets are included in Appendix B.1.

通用语言理解评估 (GLUE) 基准 (Wang et al., 2018a) 是一个包含多样化自然语言理解任务的集合。GLUE 数据集的详细描述见附录 B.1。

To fine-tune on GLUE, we represent the input sequence (for single sentence or sentence pairs) as described in Section 3, and use the final hidden vector $C\in\mathbb{R}^{H}$ corresponding to the first input token ([CLS]) as the aggregate representation. The only new parameters introduced during fine-tuning are classification layer weights $W\in$ RK×H, where K is the number of labels. We compute a standard classification loss with $C$ and $W$ , i.e., $\log(\mathrm{softmax}(C W^{T}))$ .

在GLUE上进行微调时,我们按照第3节所述表示输入序列(单句或句对),并使用与首个输入token ([CLS]) 对应的最终隐藏向量 $C\in\mathbb{R}^{H}$ 作为聚合表征。微调过程中引入的唯一新参数是分类层权重 $W\in$ RK×H,其中K为标签数量。我们通过 $C$ 和 $W$ 计算标准分类损失,即 $\log(\mathrm{softmax}(C W^{T}))$。

Table 1: GLUE Test results, scored by the evaluation server (https://glue benchmark.com/leader board). The number below each task denotes the number of training examples. The “Average” column is slightly different than the official GLUE score, since we exclude the problematic WNLI set.8 BERT and OpenAI GPT are singlemodel, single task. F1 scores are reported for QQP and MRPC, Spearman correlations are reported for STS-B, and accuracy scores are reported for the other tasks. We exclude entries that use BERT as one of their components.

SystemMNLI-(m/mm) 392kQQP 363kQNLI 108kSST-2 67kCoLA 8.5kSTS-B 5.7kMRPC 3.5kRTE 2.5kAverage
Pre-OpenAI SOTA80.6/80.166.182.393.235.081.086.061.774.0
BiLSTM+ELMo+Attn76.4/76.164.879.890.436.073.384.956.871.0
OpenAI GPT82.1/81.470.387.491.345.480.082.356.075.1
BERTBASE84.6/83.471.290.593.552.185.888.966.479.6
BERTLARGE86.7/85.972.192.794.960.586.589.370.182.1

表 1: GLUE 测试结果,由评估服务器 (https://gluebenchmark.com/leaderboard) 评分。每个任务下方的数字表示训练样本数量。"Average"列与官方 GLUE 分数略有不同,因为我们排除了有问题的 WNLI 集。8 BERT 和 OpenAI GPT 是单模型、单任务。QQP 和 MRPC 报告 F1 分数,STS-B 报告 Spearman 相关性,其他任务报告准确率分数。我们排除了使用 BERT 作为组件的条目。

System MNLI-(m/mm) 392k QQP 363k QNLI 108k SST-2 67k CoLA 8.5k STS-B 5.7k MRPC 3.5k RTE 2.5k Average
Pre-OpenAI SOTA 80.6/80.1 66.1 82.3 93.2 35.0 81.0 86.0 61.7 74.0
BiLSTM+ELMo+Attn 76.4/76.1 64.8 79.8 90.4 36.0 73.3 84.9 56.8 71.0
OpenAI GPT 82.1/81.4 70.3 87.4 91.3 45.4 80.0 82.3 56.0 75.1
BERTBASE 84.6/83.4 71.2 90.5 93.5 52.1 85.8 88.9 66.4 79.6
BERTLARGE 86.7/85.9 72.1 92.7 94.9 60.5 86.5 89.3 70.1 82.1

We use a batch size of 32 and fine-tune for 3 epochs over the data for all GLUE tasks. For each task, we selected the best fine-tuning learning rate (among 5e-5, 4e-5, 3e-5, and 2e-5) on the Dev set. Additionally, for BERTLARGE we found that finetuning was sometimes unstable on small datasets, so we ran several random restarts and selected the best model on the Dev set. With random restarts, we use the same pre-trained checkpoint but perform different fine-tuning data shuffling and classifier layer initialization.9

我们使用32的批量大小,对所有GLUE任务数据进行3个周期的微调。针对每项任务,我们在开发集上选择了最佳微调学习率(从5e-5、4e-5、3e-5和2e-5中选取)。此外,对于BERTLARGE,我们发现小数据集上的微调有时不稳定,因此进行了多次随机重启,并选择开发集上表现最佳的模型。随机重启时,我们使用相同的预训练检查点,但采用不同的微调数据洗牌和分类器层初始化方式。9

Results are presented in Table 1. Both BERTBASE and BERTLARGE outperform all systems on all tasks by a substantial margin, obtaining $4.5%$ and $7.0%$ respective average accuracy improvement over the prior state of the art. Note that BERTBASE and OpenAI GPT are nearly identical in terms of model architecture apart from the attention masking. For the largest and most widely reported GLUE task, MNLI, BERT obtains a $4.6%$ absolute accuracy improvement. On the official GLUE leader board 10, BERTLARGE obtains a score of 80.5, compared to OpenAI GPT, which obtains 72.8 as of the date of writing.

结果如表1所示。BERTBASE和BERTLARGE在所有任务上都以显著优势超越所有系统,分别比之前的最先进水平平均准确率提升了4.5%和7.0%。需要注意的是,除了注意力掩码机制外,BERTBASE与OpenAI GPT的模型架构几乎完全相同。在GLUE基准中规模最大、报道最广的MNLI任务上,BERT实现了4.6%的绝对准确率提升。根据GLUE官方排行榜[10]的当前数据,BERTLARGE获得80.5分,而OpenAI GPT得分为72.8。

We find that BERTLARGE significantly outperforms BERTBASE across all tasks, especially those with very little training data. The effect of model size is explored more thoroughly in Section 5.2.

我们发现BERTLARGE在所有任务上都显著优于BERTBASE,尤其是在训练数据极少的任务中。模型规模的影响将在5.2节进行更深入的探讨。

4.2 SQuAD v1.1

4.2 SQuAD v1.1

The Stanford Question Answering Dataset (SQuAD v1.1) is a collection of $100\mathrm{k}$ crowdsourced question/answer pairs (Rajpurkar et al., 2016). Given a question and a passage from

斯坦福问答数据集 (SQuAD v1.1) 是一个包含 10 万条众包问答对的数据集 (Rajpurkar et al., 2016)。给定一个问题和一个来自

Wikipedia containing the answer, the task is to predict the answer text span in the passage.

维基百科包含答案,任务是在文章中预测答案文本片段。

As shown in Figure 1, in the question answering task, we represent the input question and passage as a single packed sequence, with the question using the A embedding and the passage using the B embedding. We only introduce a start vector $S\in\mathbb{R}^{H}$ and an end vector $E\in\mathbb{R}^{H}$ during fine-tuning. The probability of word $i$ being the start of the answer span is computed as a dot product between $T_ {i}$ and $S$ followed by a softmax over all of the words in the paragraph: $\begin{array}{r}{P_ {i}=\frac{e^{S\cdot T_ {i}}}{\sum_ {j}e^{S\cdot T_ {j}}}}\end{array}$ The analogous formula is used for the end of the answer span. The score of a candidate span from position $i$ to position $j$ is defined as $S{\cdot}T_ {i}+E{\cdot}T_ {j}$ , and the maximum scoring span where $j\geq i$ is used as a prediction. The training objective is the sum of the log-likelihoods of the correct start and end positions. We fine-tune for 3 epochs with a learning rate of 5e-5 and a batch size of 32.

如图 1 所示,在问答任务中,我们将输入问题和段落表示为单个打包序列,其中问题使用 A 嵌入,段落使用 B 嵌入。微调期间仅引入起始向量 $S\in\mathbb{R}^{H}$ 和结束向量 $E\in\mathbb{R}^{H}$。词 $i$ 作为答案跨度起始位置的概率通过 $T_ {i}$ 与 $S$ 的点积计算,并对段落中所有词进行 softmax 归一化:$\begin{array}{r}{P_ {i}=\frac{e^{S\cdot T_ {i}}}{\sum_ {j}e^{S\cdot T_ {j}}}}\end{array}$。答案跨度结束位置的计算公式同理。候选跨度从位置 $i$ 到位置 $j$ 的得分定义为 $S{\cdot}T_ {i}+E{\cdot}T_ {j}$,最终预测取满足 $j\geq i$ 条件的最高得分跨度。训练目标为正确起止位置对数似然之和。我们采用学习率 5e-5、批量大小 32 进行 3 个周期的微调。

Table 2 shows top leader board entries as well as results from top published systems (Seo et al., 2017; Clark and Gardner, 2018; Peters et al., 2018a; Hu et al., 2018). The top results from the SQuAD leader board do not have up-to-date public system descriptions available,11 and are allowed to use any public data when training their systems. We therefore use modest data augmentation in our system by first fine-tuning on TriviaQA (Joshi et al., 2017) befor fine-tuning on SQuAD.

表 2 展示了排行榜前列的参赛系统以及已发表顶级系统的结果 (Seo et al., 2017; Clark and Gardner, 2018; Peters et al., 2018a; Hu et al., 2018)。SQuAD 排行榜的顶尖结果目前没有最新的公开系统描述,且允许在系统训练时使用任何公开数据。因此我们在系统中采用了适度的数据增强策略:先在 TriviaQA (Joshi et al., 2017) 上进行微调,再对 SQuAD 进行微调。

Our best performing system outperforms the top leader board system by $+1.5\mathrm{F}1$ in ensembling and $+1.3$ F1 as a single system. In fact, our single BERT model outperforms the top ensemble system in terms of F1 score. Without TriviaQA fine- tuning data, we only lose 0.1-0.4 F1, still outperforming all existing systems by a wide margin.12

我们表现最佳的集成系统比排行榜上的顶级系统高出 $+1.5\mathrm{F}1$ ,而作为单一系统则高出 $+1.3$ F1。事实上,我们的单一 BERT 模型在 F1 分数上已经超越了顶级集成系统。即便不使用 TriviaQA 微调数据,我们也仅损失 0.1-0.4 F1,仍以显著优势超越所有现有系统。12

Table 2: SQuAD 1.1 results. The BERT ensemble is $7\mathbf{x}$ systems which use different pre-training checkpoints and fine-tuning seeds.

SystemDev EM F1Test EM F1
Top Leaderboard Systems (Dec 10th, 2018) Human #1 Ensemble -nlnet #2 Ensemble -QANet82.3 91.2 86.0 91.7 84.5 90.5
Published BiDAF+ELMo (Single) R.M. Reader (Ensemble)85.6 81.2 87.982.385.8 88.5
Ours BERTBAsE (Single) BERTLARGE (Single)
BERTLARGE (Ensemble) BERTLARGE (Sgl.+TriviaQA) BERTLARGE (Ens.+TriviaQA) 86.2 92.280.8 88.5 84.1 90.9 85.8 91.8 84.2 91.185.1 91.8

表 2: SQuAD 1.1 结果。BERT 集成模型由 $7\mathbf{x}$ 个使用不同预训练检查点和微调种子的系统组成。

System Dev EM F1 Test EM F1
Top Leaderboard Systems (Dec 10th, 2018) Human #1 Ensemble -nlnet #2 Ensemble -QANet 82.3 91.2 86.0 91.7 84.5 90.5
Published BiDAF+ELMo (Single) R.M. Reader (Ensemble) 85.6 81.2 87.982.3 85.8 88.5
Ours BERTBASE (Single) BERTLARGE (Single)
BERTLARGE (Ensemble) BERTLARGE (Sgl.+TriviaQA) BERTLARGE (Ens.+TriviaQA) 86.2 92.2 80.8 88.5 84.1 90.9 85.8 91.8 84.2 91.1 85.1 91.8

Table 3: SQuAD 2.0 results. We exclude entries that use BERT as one of their components.

SystemDevTest
EMF1EMF1
Top Leaderboard Systems (Dec 10th, 2018) Human #1 Single -MIR-MRC (F-Net)86.389.086.9 74.889.5 78.0
#2 Single -nlnet74.277.1
Published unet(Ensemble)
SLQA+ (Single)71.474.9
71.474.4
Ours
BERTLARGE (Single)
78.781.9 80.083.1

表 3: SQuAD 2.0 结果。我们排除了使用 BERT 作为组件的条目。

System Dev Test
EM F1
Top Leaderboard Systems (Dec 10th, 2018) Human #1 Single -MIR-MRC (F-Net) 86.389.0
#2 Single -nlnet
Published unet(Ensemble)
SLQA+ (Single)
Ours
BERTLARGE (Single)

4.3 SQuAD v2.0

4.3 SQuAD v2.0

The SQuAD 2.0 task extends the SQuAD 1.1 problem definition by allowing for the possibility that no short answer exists in the provided paragraph, making the problem more realistic.

SQuAD 2.0任务通过允许所提供段落中可能不存在简短答案,扩展了SQuAD 1.1的问题定义,使问题更加现实。

We use a simple approach to extend the SQuAD v1.1 BERT model for this task. We treat questions that do not have an answer as having an answer span with start and end at the [CLS] token. The probability space for the start and end answer span positions is extended to include the position of the [CLS] token. For prediction, we compare the score of the no-answer span: $s_ {\tt n u l l}=$ $S{\cdot}C+E{\cdot}C$ to the score of the best non-null span $s_ {i,j}=\mathtt{m a x}_ {j\ge i}S\cdot T_ {i}+E\cdot T_ {j}$ . We predict a non-null answer when $s_ {i,j}^{}>s_ {\mathrm{nu11}}+\tau$ , where the threshold $\tau$ is selected on the dev set to maximize F1. We did not use TriviaQA data for this model. We fine-tuned for 2 epochs with a learning rate of 5e-5 and a batch size of 48.

我们采用一种简单的方法来扩展SQuAD v1.1 BERT模型以完成此任务。对于没有答案的问题,我们将其视为答案跨度起始和结束位置均为[CLS] token的情况。答案跨度起始和结束位置的概率空间被扩展以包含[CLS] token的位置。在预测时,我们将无答案跨度的得分:$s_ {\tt n u l l}=$$S{\cdot}C+E{\cdot}C$与最佳非空跨度得分$s_ {i,j}=\mathtt{m a x}_ {j\ge i}S\cdot T_ {i}+E\cdot T_ {j}$进行比较。当$s_ {i,j}^{}>s_ {\mathrm{nu11}}+\tau$时,我们预测为非空答案,其中阈值$\tau$在开发集上选择以最大化F1值。该模型未使用TriviaQA数据。我们以5e-5的学习率和48的批量大小进行了2个epoch的微调。

Table 4: SWAG Dev and Test accuracies. †Human performance is measured with 100 samples, as reported in the SWAG paper.

SystemDev Test
ESIM+GloVe ESIM+ELMo51.9 52.7 59.1 59.2
OpenAI GPT78.0
BERTBASE BERTLARGE81.6 86.6 86.3
Human (expert)t Human (5 annotations)85.0 88.0

表 4: SWAG开发集和测试集准确率。 (†人类表现基于SWAG论文中报告的100个样本测量结果。)

系统 开发集/测试集
ESIM+GloVe ESIM+ELMo 51.9/52.7 59.1/59.2
OpenAI GPT 78.0
BERTBASE BERTLARGE 81.6/86.6 86.3
Human (expert) Human (5 annotations) 85.0 88.0

The results compared to prior leader board entries and top published work (Sun et al., 2018; Wang et al., 2018b) are shown in Table 3, excluding systems that use BERT as one of their components. We observe a $+5.1$ F1 improvement over the previous best system.

与之前排行榜条目及顶尖发表成果 (Sun et al., 2018; Wang et al., 2018b) 的对比结果如 表 3 所示 (排除使用 BERT 作为组件的系统)。我们观察到相较于之前最佳系统实现了 $+5.1$ F1 提升。

4.4 SWAG

4.4 SWAG

The Situations With Adversarial Generations (SWAG) dataset contains $113\mathrm{k\Omega}$ sentence-pair completion examples that evaluate grounded commonsense inference (Zellers et al., 2018). Given a sentence, the task is to choose the most plausible continuation among four choices.

对抗性生成情境 (SWAG) 数据集包含 $113\mathrm{k\Omega}$ 个句子对补全示例,用于评估基础常识推理 (Zellers et al., 2018) 。给定一个句子,任务是从四个选项中选择最合理的续写。

When fine-tuning on the SWAG dataset, we construct four input sequences, each containing the concatenation of the given sentence (sentence A) and a possible continuation (sentence B). The only task-specific parameters introduced is a vector whose dot product with the [CLS] token represent ation $C$ denotes a score for each choice which is normalized with a softmax layer.

在SWAG数据集上进行微调时,我们构建了四个输入序列,每个序列包含给定句子(句子A)与一个可能延续句(句子B)的拼接内容。引入的唯一任务特定参数是一个向量,其与[CLS] token表征$C$的点积表示每个选项的得分,该得分通过softmax层进行归一化。

We fine-tune the model for 3 epochs with a learning rate of 2e-5 and a batch size of 16. Results are presented in Table 4. BERTLARGE outperforms the authors’ baseline ESIM $+$ ELMo system by $+27.1%$ and OpenAI GPT by $8.3%$ .

我们以2e-5的学习率和16的批量大小对模型进行了3个周期的微调。结果如表4所示。BERTLARGE比作者基线ESIM $+$ ELMo系统高出 $+27.1%$ ,比OpenAI GPT高出 $8.3%$ 。

5 Ablation Studies

5 消融实验

In this section, we perform ablation experiments over a number of facets of BERT in order to better understand their relative importance. Additional

在本节中,我们对BERT的多个方面进行了消融实验,以更好地理解它们的相对重要性。

Table 5: Ablation over the pre-training tasks using the BERTBASE architecture. “No NSP” is trained without the next sentence prediction task. “LTR & No NSP” is trained as a left-to-right LM without the next sentence prediction, like OpenAI GPT. “ $^+$ BiLSTM” adds a randomly initialized BiLSTM on top of the “LTR $+~\mathbf{No}$ NSP” model during fine-tuning.

TasksMNLI-m QNLIDevSet MRPC (Acc)SST-2 (Acc) (Acc)SQuAD
BERTBASE84.488.486.7 92.788.5
NoNSP83.984.986.592.6 87.9
LTR&NoNSP82.184.377.5 92.177.8
+ BiLSTM82.184.175.791.6 84.9

表 5: 使用 BERTBASE 架构对预训练任务进行消融实验。"No NSP"表示训练时不使用下一句预测任务。"LTR & No NSP"表示像 OpenAI GPT 那样训练为从左到右的语言模型,且不使用下一句预测。"$^+$ BiLSTM"表示在微调时在"LTR $+~\mathbf{No}$ NSP"模型基础上添加随机初始化的 BiLSTM。

任务 MNLI-m QNLI DevSet MRPC (Acc) SST-2 (Acc) (Acc) SQuAD
BERTBASE 84.4 88.4 86.7 92.7 88.5
NoNSP 83.9 84.9 86.5 92.6 87.9
LTR&NoNSP 82.1 84.3 77.5 92.1 77.8
+ BiLSTM 82.1 84.1 75.7 91.6 84.9

ablation studies can be found in Appendix C.

消融研究详见附录C。

5.1 Effect of Pre-training Tasks

5.1 预训练任务的影响

We demonstrate the importance of the deep bidirect ional it y of BERT by evaluating two pretraining objectives using exactly the same pretraining data, fine-tuning scheme, and hyperparameters as BERTBASE:

我们通过使用与BERTBASE完全相同的预训练数据、微调方案和超参数评估两个预训练目标,证明了BERT深度双向性的重要性:

No NSP: A bidirectional model which is trained using the “masked LM” (MLM) but without the “next sentence prediction” (NSP) task.

无NSP:一种双向模型,采用“掩码语言模型”(MLM)训练,但不包含“下一句预测”(NSP)任务。

LTR & No NSP: A left-context-only model which is trained using a standard Left-to-Right (LTR) LM, rather than an MLM. The left-only constraint was also applied at fine-tuning, because removing it introduced a pre-train/fine-tune mismatch that degraded downstream performance. Additionally, this model was pre-trained without the NSP task. This is directly comparable to OpenAI GPT, but using our larger training dataset, our input representation, and our fine-tuning scheme.

LTR & No NSP: 一种仅使用标准从左到右 (LTR) 语言模型训练的左向上下文模型,而非掩码语言模型 (MLM)。在微调阶段同样保持左向约束,因为取消该约束会导致预训练/微调不匹配,从而降低下游任务性能。此外,该模型预训练时未使用下一句预测 (NSP) 任务。该模型可直接与 OpenAI GPT 进行对比,但采用了我们更大的训练数据集、输入表示方案及微调策略。

We first examine the impact brought by the NSP task. In Table 5, we show that removing NSP hurts performance significantly on QNLI, MNLI, and SQuAD 1.1. Next, we evaluate the impact of training bidirectional representations by comparing “No NSP” to “LTR & No NSP”. The LTR model performs worse than the MLM model on all tasks, with large drops on MRPC and SQuAD.

我们首先研究了 NSP (Next Sentence Prediction) 任务带来的影响。表 5 显示,移除 NSP 会显著降低 QNLI、MNLI 和 SQuAD 1.1 的性能表现。接着,我们通过对比 "No NSP" 和 "LTR & No NSP" 来评估训练双向表示的影响。LTR (Left-to-Right) 模型在所有任务上的表现都逊于 MLM (Masked Language Model) 模型,其中 MRPC 和 SQuAD 的性能下降尤为明显。

For SQuAD it is intuitively clear that a LTR model will perform poorly at token predictions, since the token-level hidden states have no rightside context. In order to make a good faith attempt at strengthening the LTR system, we added a randomly initialized BiLSTM on top. This does significantly improve results on SQuAD, but the results are still far worse than those of the pretrained bidirectional models. The BiLSTM hurts performance on the GLUE tasks.

对于SQuAD数据集,直观上很明显,左到右(LTR)模型在token预测上表现会很差,因为token级别的隐藏状态缺乏右侧上下文。为了真诚地尝试增强LTR系统,我们在顶部添加了一个随机初始化的双向长短时记忆网络(BiLSTM)。这确实显著提升了SQuAD的结果,但仍远逊于预训练双向模型的表现。该BiLSTM结构反而会损害GLUE任务上的性能。

We recognize that it would also be possible to train separate LTR and RTL models and represent each token as the concatenation of the two models, as ELMo does. However: (a) this is twice as expensive as a single bidirectional model; (b) this is non-intuitive for tasks like QA, since the RTL model would not be able to condition the answer on the question; (c) this it is strictly less powerful than a deep bidirectional model, since it can use both left and right context at every layer.

我们意识到也可以像ELMo那样训练独立的从左到右(LTR)和从右到左(RTL)模型,并将每个token表示为两个模型的串联。但是:(a) 这种方式比单一双向模型成本高一倍;(b) 对于问答(QA)等任务不够直观,因为RTL模型无法基于问题来生成答案;(c) 其能力严格弱于深度双向模型,因为后者能在每一层同时利用左右上下文。

5.2 Effect of Model Size

5.2 模型尺寸的影响

In this section, we explore the effect of model size on fine-tuning task accuracy. We trained a number of BERT models with a differing number of layers, hidden units, and attention heads, while otherwise using the same hyper parameters and training procedure as described previously.

本节探讨模型规模对微调任务准确率的影响。我们训练了多个不同层数、隐藏单元数和注意力头数的BERT模型,其余超参数和训练流程均与此前描述保持一致。

Results on selected GLUE tasks are shown in Table 6. In this table, we report the average Dev Set accuracy from 5 random restarts of fine-tuning. We can see that larger models lead to a strict accuracy improvement across all four datasets, even for MRPC which only has 3,600 labeled training examples, and is substantially different from the pre-training tasks. It is also perhaps surprising that we are able to achieve such significant improvements on top of models which are already quite large relative to the existing literature. For example, the largest Transformer explored in Vaswani et al. (2017) is $\scriptstyle\mathrm{L=6}$ , $_ \mathrm{H}{=}1024$ , $\mathrm{A}{=}16$ ) with 100M parameters for the encoder, and the largest Transformer we have found in the literature is $\mathrm{L}{=}64$ , $_ {\mathrm{H}=512}$ , $\mathbf{A}{=}2$ ) with $235\mathbf{M}$ parameters (Al-Rfou et al., 2018). By contrast, BERTBASE contains 110M parameters and BERTLARGE contains 340M parameters.

在选定的GLUE任务上的结果如表6所示。该表中报告了5次微调随机重启的平均开发集准确率。可以看出,更大的模型在所有四个数据集上都带来了严格的准确率提升,即便是仅有3600个标注训练样本且与预训练任务差异较大的MRPC数据集。另一个令人惊讶的发现是,我们能在已相对现有文献相当庞大的模型基础上取得如此显著的改进。例如,Vaswani等人(2017)探讨的最大Transformer是$\scriptstyle\mathrm{L=6}$,$_ \mathrm{H}{=}1024$,$\mathrm{A}{=}16$)编码器参数为1亿,而文献中我们发现的最大Transformer是$\mathrm{L}{=}64$,$_ {\mathrm{H}=512}$,$\mathbf{A}{=}2$)参数达2.35亿(Al-Rfou等人,2018)。相比之下,BERTBASE包含1.1亿参数,BERTLARGE则包含3.4亿参数。

It has long been known that increasing the model size will lead to continual improvements on large-scale tasks such as machine translation and language modeling, which is demonstrated by the LM perplexity of held-out training data shown in Table 6. However, we believe that this is the first work to demonstrate convincingly that scaling to extreme model sizes also leads to large improvements on very small scale tasks, provided that the model has been sufficiently pre-trained. Peters et al. (2018b) presented mixed results on the downstream task impact of increasing the pre-trained bi-LM size from two to four layers and Melamud et al. (2016) mentioned in passing that increasing hidden dimension size from 200 to 600 helped, but increasing further to 1,000 did not bring further improvements. Both of these prior works used a featurebased approach — we hypothesize that when the model is fine-tuned directly on the downstream tasks and uses only a very small number of randomly initialized additional parameters, the taskspecific models can benefit from the larger, more expressive pre-trained representations even when downstream task data is very small.

众所周知,增加模型规模会持续提升机器翻译和语言建模等大规模任务的性能,表6所示的训练数据LM困惑度便印证了这一点。但我们认为,本研究首次有力证明了:只要模型经过充分预训练,即使扩展到极大模型规模,也能显著提升极小规模任务的性能。Peters等人(2018b)关于将预训练双向语言模型从两层增至四层对下游任务影响的实验得出了不一致的结论,而Melamud等人(2016)曾简要提及将隐藏层维度从200增至600有助提升性能,但继续增至1,000却未带来进一步改善。这两项早期研究均采用基于特征的方法——我们推测,若模型直接在下游任务上进行微调且仅使用极少量随机初始化的附加参数时,即便下游任务数据量极小,任务专用模型仍能受益于更大规模、更具表现力的预训练表征。

5.3 Feature-based Approach with BERT

5.3 基于BERT的特征提取方法

All of the BERT results presented so far have used the fine-tuning approach, where a simple classification layer is added to the pre-trained model, and all parameters are jointly fine-tuned on a downstream task. However, the feature-based approach, where fixed features are extracted from the pretrained model, has certain advantages. First, not all tasks can be easily represented by a Transformer encoder architecture, and therefore require a task-specific model architecture to be added. Second, there are major computational benefits to pre-compute an expensive representation of the training data once and then run many experiments with cheaper models on top of this representation.

目前展示的所有BERT结果都采用了微调(fine-tuning)方法,即在预训练模型上添加简单分类层,并针对下游任务联合微调所有参数。然而,基于特征的方法(从预训练模型中提取固定特征)具有特定优势:首先,并非所有任务都能通过Transformer编码器架构轻松实现,因此需要添加任务特定的模型架构;其次,预计算训练数据的昂贵表示后,可基于该表示运行更轻量模型的多次实验,这将带来显著的计算效率优势。

In this section, we compare the two approaches by applying BERT to the CoNLL-2003 Named Entity Recognition (NER) task (Tjong Kim Sang and De Meulder, 2003). In the input to BERT, we use a case-preserving WordPiece model, and we include the maximal document context provided by the data. Following standard practice, we formulate this as a tagging task but do not use a CRF layer in the output. We use the representation of the first sub-token as the input to the token-level classifier over the NER label set.

在本节中,我们通过将BERT应用于CoNLL-2003命名实体识别(NER)任务 (Tjong Kim Sang and De Meulder, 2003) 来比较两种方法。在BERT的输入中,我们使用保留大小写的WordPiece模型,并包含数据提供的最大文档上下文。按照标准做法,我们将其表述为标注任务,但在输出中不使用CRF层。我们使用第一个子token的表示作为NER标签集上token级分类器的输入。

Table 6: Ablation over BERT model size. #L $=$ the number of layers; #H = hidden size; #A $=$ number of attention heads. “LM (ppl)” is the masked LM perplexity of held-out training data.

HyperparamsDev Set Accuracy
#L#H #ALM (ppl) MNLI-mMRPC SST-2
376812 5.84 77.979.8 88.4
67685.24 80.682.2 90.7
67684.68 81.984.8 91.3
127683.99 84.486.7 92.9
121024 163.54 85.786.9 93.3
241024 163.23 86.687.8 93.7

表 6: BERT模型规模的消融实验。#L = 层数;#H = 隐藏层大小;#A = 注意力头数。"LM (ppl)"表示训练数据遮蔽语言模型的困惑度。

Hyperparams Dev Set Accuracy
#L #H #A
3 768
6 768
6 768
12 768
12 1024 16
24 1024 16

Table 7: CoNLL-2003 Named Entity Recognition results. Hyper parameters were selected using the Dev set. The reported Dev and Test scores are averaged over 5 random restarts using those hyper parameters.

SystemDevF1Test F1
ELMo (Peterset al.,2018a)95.792.2
CVT (Clark et al.,2018) CSE (Akbik et al.,2018)92.6 93.1
Fine-tuning approach
BERTLARGE96.692.8
BERTBASE96.492.4
Feature-based approach(BERTBAsE)
Embeddings91.0
Second-to-Last Hidden95.6
LastHidden94.9
Weighted Sum Last Four Hidden95.9
Concat LastFourHidden96.1
WeightedSumAll12Layers95.5

表 7: CoNLL-2003 命名实体识别结果。超参数通过开发集选定,报告的开发集和测试集分数为使用该超参数进行5次随机重启的平均值。

System DevF1 Test F1
ELMo (Peters et al., 2018a) 95.7 92.2
CVT (Clark et al., 2018) CSE (Akbik et al., 2018) 92.6 93.1
Fine-tuning approach
BERTLARGE 96.6 92.8
BERTBASE 96.4 92.4
Feature-based approach (BERTBASE)
Embeddings 91.0
Second-to-Last Hidden 95.6
Last Hidden 94.9
Weighted Sum Last Four Hidden 95.9
Concat Last Four Hidden 96.1
Weighted Sum All 12 Layers 95.5

To ablate the fine-tuning approach, we apply the feature-based approach by extracting the activations from one or more layers without fine-tuning any parameters of BERT. These contextual em- beddings are used as input to a randomly initialized two-layer 768-dimensional BiLSTM before the classification layer.

为了消融微调方法,我们采用基于特征的方法,通过从一个或多个层提取激活值而不微调BERT的任何参数。这些上下文嵌入被用作随机初始化的两层768维BiLSTM的输入,然后传递到分类层。

Results are presented in Table 7. BERTLARGE performs competitively with state-of-the-art methods. The best performing method concatenates the token representations from the top four hidden layers of the pre-trained Transformer, which is only $0.3\mathrm{~F}1$ behind fine-tuning the entire model. This demonstrates that BERT is effective for both finetuning and feature-based approaches.

结果如表 7 所示。BERTLARGE 的性能与最先进方法相当。表现最佳的方法是将预训练 Transformer 顶部四个隐藏层的 Token 表征拼接起来,其效果仅比微调整个模型低 $0.3\mathrm{~F}1$ 。这表明 BERT 对于微调和基于特征的方法都有效。

6 Conclusion

6 结论

Recent empirical improvements due to transfer learning with language models have demonstrated that rich, unsupervised pre-training is an integral part of many language understanding systems. In particular, these results enable even low-resource tasks to benefit from deep unidirectional architectures. Our major contribution is further generalizing these findings to deep bidirectional architectures, allowing the same pre-trained model to successfully tackle a broad set of NLP tasks.

近期基于大语言模型的迁移学习实证研究表明,丰富的无监督预训练已成为众多语言理解系统的核心组成部分。这些成果尤其使得低资源任务也能受益于深度单向架构。我们的主要贡献是将这些发现进一步推广至深度双向架构,使同一预训练模型能成功应对广泛的自然语言处理任务。

References

参考文献

Alan Akbik, Duncan Blythe, and Roland Vollgraf. 2018. Contextual string embeddings for sequence labeling. In Proceedings of the 27th International Conference on Computational Linguistics, pages 1638–1649.

Alan Akbik、Duncan Blythe和Roland Vollgraf。2018. 序列标注的上下文字符串嵌入。载于《第27届国际计算语言学会议论文集》,第1638–1649页。

Rami Al-Rfou, Dokook Choe, Noah Constant, Mandy Guo, and Llion Jones. 2018. Character-level language modeling with deeper self-attention. arXiv preprint arXiv:1808.04444.

Rami Al-Rfou、Dokook Choe、Noah Constant、Mandy Guo和Llion Jones。2018。基于深层自注意力机制的字符级语言建模。arXiv预印本arXiv:1808.04444。

Rie Kubota Ando and Tong Zhang. 2005. A framework for learning predictive structures from multiple tasks and unlabeled data. Journal of Machine Learning Research, 6(Nov):1817–1853.

Rie Kubota Ando 和 Tong Zhang。2005。一种从多任务和无标注数据中学习预测结构的框架。《机器学习研究期刊》,6(11月):1817–1853。

Luisa Bentivogli, Bernardo Magnini, Ido Dagan, Hoa Trang Dang, and Danilo Gia m piccolo. 2009. The fifth PASCAL recognizing textual entailment challenge. In TAC. NIST.

Luisa Bentivogli、Bernardo Magnini、Ido Dagan、Hoa Trang Dang和Danilo Giampiccolo。2009。第五届PASCAL文本蕴含识别挑战赛。发表于TAC会议。NIST。

John Blitzer, Ryan McDonald, and Fernando Pereira. 2006. Domain adaptation with structural correspondence learning. In Proceedings of the 2006 conference on empirical methods in natural language processing, pages 120–128. Association for Computational Linguistics.

John Blitzer、Ryan McDonald 和 Fernando Pereira。2006。基于结构对应学习的领域自适应。载于《2006年自然语言处理实证方法会议论文集》,第120-128页。计算语言学协会。

Samuel R. Bowman, Gabor Angeli, Christopher Potts, and Christopher D. Manning. 2015. A large annotated corpus for learning natural language inference. In EMNLP. Association for Computational Linguistics.

Samuel R. Bowman, Gabor Angeli, Christopher Potts, and Christopher D. Manning. 2015. 用于学习自然语言推理的大规模标注语料库. 载于EMNLP. 计算语言学协会.

Peter F Brown, Peter V Desouza, Robert L Mercer, Vincent J Della Pietra, and Jenifer C Lai. 1992. Class-based n-gram models of natural language. Computational linguistics, 18(4):467–479.

Peter F Brown、Peter V Desouza、Robert L Mercer、Vincent J Della Pietra 和 Jenifer C Lai。1992。基于类的自然语言n元模型。Computational linguistics,18(4):467–479。

Daniel Cer, Mona Diab, Eneko Agirre, Inigo Lopez- Gazpio, and Lucia Specia. 2017. Semeval-2017 task 1: Semantic textual similarity multilingual and cross lingual focused evaluation. In Proceedings of the 11th International Workshop on Semantic Evaluation (SemEval-2017), pages 1–14, Vancouver, Canada. Association for Computational Linguistics.

Daniel Cer、Mona Diab、Eneko Agirre、Inigo Lopez-Gazpio 和 Lucia Specia。2017。SemEval-2017 任务 1: 多语言与跨语言语义文本相似度评估。载于《第11届国际语义评测研讨会论文集》(SemEval-2017),第1-14页,加拿大温哥华。计算语言学协会。

Ciprian Chelba, Tomas Mikolov, Mike Schuster, Qi Ge, Thorsten Brants, Phillipp Koehn, and Tony Robinson. 2013. One billion word benchmark for measuring progress in statistical language modeling. arXiv preprint arXiv:1312.3005.

Ciprian Chelba、Tomas Mikolov、Mike Schuster、Qi Ge、Thorsten Brants、Phillipp Koehn 和 Tony Robinson。2013. 用于衡量统计语言建模进展的十亿词基准。arXiv 预印本 arXiv:1312.3005。

Z. Chen, H. Zhang, X. Zhang, and L. Zhao. 2018. Quora question pairs.

Z. Chen、H. Zhang、X. Zhang 和 L. Zhao。2018。Quora 问答对。

Christopher Clark and Matt Gardner. 2018. Simple and effective multi-paragraph reading comprehension. In ACL.

Christopher Clark 和 Matt Gardner. 2018. 简单有效的多段落阅读理解. 发表于 ACL.

Kevin Clark, Minh-Thang Luong, Christopher D Manning, and Quoc Le. 2018. Semi-supervised sequence modeling with cross-view training. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 1914– 1925.

Kevin Clark、Minh-Thang Luong、Christopher D Manning和Quoc Le。2018。基于交叉视图训练的半监督序列建模。载于《2018年自然语言处理实证方法会议论文集》,第1914–1925页。

Ronan Collobert and Jason Weston. 2008. A unified architecture for natural language processing: Deep neural networks with multitask learning. In Proceedings of the 25th international conference on Machine learning, pages 160–167. ACM.

Ronan Collobert 和 Jason Weston. 2008. 自然语言处理的统一架构: 多任务学习的深度神经网络. 见《第25届国际机器学习会议论文集》, 第160–167页. ACM.

Alexis Conneau, Douwe Kiela, Holger Schwenk, Loic Barrault, and Antoine Bordes. 2017. Supervised learning of universal sentence representations from natural language inference data. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, pages 670–680, Copenhagen, Denmark. Association for Computational Linguistics.

Alexis Conneau、Douwe Kiela、Holger Schwenk、Loic Barrault和Antoine Bordes。2017. 基于自然语言推理数据的通用句子表示监督学习。载于《2017年自然语言处理实证方法会议论文集》,第670-680页,丹麦哥本哈根。计算语言学协会。

Andrew M Dai and Quoc V Le. 2015. Semi-supervised sequence learning. In Advances in neural information processing systems, pages 3079–3087.

Andrew M Dai 和 Quoc V Le. 2015. 半监督序列学习. In Advances in neural information processing systems, pages 3079–3087.

J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. FeiFei. 2009. ImageNet: A Large-Scale Hierarchical Image Database. In CVPR09.

J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. FeiFei. 2009. ImageNet: 一个大规模分层图像数据库. In CVPR09.

William B Dolan and Chris Brockett. 2005. Automatically constructing a corpus of sentential paraphrases. In Proceedings of the Third International Workshop on Paraphrasing (IWP2005).

William B Dolan 和 Chris Brockett. 2005. 自动构建句子复述语料库. 见《第三届国际复述研讨会论文集》(IWP2005).

William Fedus, Ian Goodfellow, and Andrew M Dai. 2018. Maskgan: Better text generation via filling in the . arXiv preprint arXiv:1801.07736.

William Fedus、Ian Goodfellow 和 Andrew M Dai. 2018. MaskGAN: 通过填充实现更好的文本生成. arXiv 预印本 arXiv:1801.07736.

Dan Hendrycks and Kevin Gimpel. 2016. Bridging nonlinear i ties and stochastic regularize rs with gaussian error linear units. CoRR, abs/1606.08415.

Dan Hendrycks 和 Kevin Gimpel. 2016. 用高斯误差线性单元 (Gaussian Error Linear Units) 桥接非线性与随机正则化器. CoRR, abs/1606.08415.

Felix Hill, Kyunghyun Cho, and Anna Korhonen. 2016. Learning distributed representations of sentences from unlabelled data. In Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. Association for Computational Linguistics.

Felix Hill、Kyunghyun Cho 和 Anna Korhonen。2016. 从无标注数据中学习句子的分布式表示。载于《2016年北美计算语言学协会人类语言技术会议论文集》。计算语言学协会。

Jeremy Howard and Sebastian Ruder. 2018. Universal language model fine-tuning for text classification. In ACL. Association for Computational Linguistics.

Jeremy Howard 和 Sebastian Ruder。2018。面向文本分类的通用语言模型微调。载于ACL。计算语言学协会。

Minghao Hu, Yuxing Peng, Zhen Huang, Xipeng Qiu, Furu Wei, and Ming Zhou. 2018. Reinforced mnemonic reader for machine reading comprehension. In IJCAI.

胡明浩、彭宇星、黄震、邱锡鹏、韦福如、周明。2018。用于机器阅读理解的强化记忆阅读器。见IJCAI。

Yacine Jernite, Samuel R. Bowman, and David Sontag. 2017. Discourse-based objectives for fast unsupervised sentence representation learning. CoRR, abs/1705.00557.

Yacine Jernite、Samuel R. Bowman 和 David Sontag。2017。基于语篇目标的快速无监督句子表征学习。CoRR,abs/1705.00557。

Mandar Joshi, Eunsol Choi, Daniel S Weld, and Luke Z ett le moyer. 2017. Triviaqa: A large scale distantly supervised challenge dataset for reading comprehension. In ACL.

Mandar Joshi、Eunsol Choi、Daniel S Weld和Luke Zettlemoyer。2017. TriviaQA:一个大规模远程监督的阅读理解挑战数据集。载于ACL。

Ryan Kiros, Yukun Zhu, Ruslan R Salak hut dino v, Richard Zemel, Raquel Urtasun, Antonio Torralba, and Sanja Fidler. 2015. Skip-thought vectors. In Advances in neural information processing systems, pages 3294–3302.

Ryan Kiros、Yukun Zhu、Ruslan R Salakhutdinov、Richard Zemel、Raquel Urtasun、Antonio Torralba 和 Sanja Fidler。2015. Skip-thought向量。载于《神经信息处理系统进展》,第3294-3302页。

Quoc Le and Tomas Mikolov. 2014. Distributed represent at ions of sentences and documents. In International Conference on Machine Learning, pages 1188–1196.

Quoc Le和Tomas Mikolov。2014。句子和文档的分布式表示。见《国际机器学习会议》,第1188-1196页。

Hector J Levesque, Ernest Davis, and Leora Morgenstern. 2011. The winograd schema challenge. In Aaai spring symposium: Logical formalization s of commonsense reasoning, volume 46, page 47.

Hector J Levesque、Ernest Davis 和 Leora Morgenstern。2011. The winograd schema challenge. In Aaai spring symposium: Logical formalization s of commonsense reasoning, volume 46, page 47.

Lajanugen Logeswaran and Honglak Lee. 2018. An efficient framework for learning sentence representations. In International Conference on Learning Representations.

Lajanugen Logeswaran 和 Honglak Lee. 2018. 一种高效学习句子表征的框架. 见于国际学习表征会议.

Bryan McCann, James Bradbury, Caiming Xiong, and Richard Socher. 2017. Learned in translation: Contextual i zed word vectors. In NIPS.

Bryan McCann、James Bradbury、Caiming Xiong 和 Richard Socher。2017. 翻译中的学习:上下文感知词向量。发表于 NIPS。

Oren Melamud, Jacob Goldberger, and Ido Dagan. 2016. context 2 vec: Learning generic context embedding with bidirectional LSTM. In CoNLL.

Oren Melamud、Jacob Goldberger 和 Ido Dagan。2016. context2vec: 基于双向 LSTM 的通用上下文嵌入学习。In CoNLL.

Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Cor- rado, and Jeff Dean. 2013. Distributed representations of words and phrases and their compositionality. In Advances in Neural Information Processing Systems 26, pages 3111–3119. Curran Associates, Inc.

Tomas Mikolov、Ilya Sutskever、Kai Chen、Greg S Corrado 和 Jeff Dean。2013. 词与短语的分布式表示及其组合性。载于《神经信息处理系统进展》第26卷,第3111–3119页。Curran Associates公司出版。

Andriy Mnih and Geoffrey E Hinton. 2009. A scalable hierarchical distributed language model. In D. Koller, D. Schuurmans, Y. Bengio, and L. Bottou, editors, Advances in Neural Information Processing Systems 21, pages 1081–1088. Curran Associates, Inc.

Andriy Mnih 和 Geoffrey E Hinton. 2009. 可扩展的层次化分布式语言模型. 见 D. Koller, D. Schuurmans, Y. Bengio 和 L. Bottou 编, 《神经信息处理系统进展 21》, 第 1081–1088 页. Curran Associates 公司.

Ankur P Parikh, Oscar Tackstrom, Dipanjan Das, and Jakob Uszkoreit. 2016. A de com pos able attention model for natural language inference. In EMNLP.

Ankur P Parikh、Oscar Tackstrom、Dipanjan Das 和 Jakob Uszkoreit。2016. 一种可分解注意力模型用于自然语言推理。发表于 EMNLP。

Jeffrey Pennington, Richard Socher, and Christopher D. Manning. 2014. Glove: Global vectors for word representation. In Empirical Methods in Natural Language Processing (EMNLP), pages 1532– 1543.

Jeffrey Pennington、Richard Socher 和 Christopher D. Manning。2014. Glove: 词表示的全局向量。载于《自然语言处理实证方法》(EMNLP),第1532–1543页。

Matthew Peters, Waleed Ammar, Chandra Bhagavatula, and Russell Power. 2017. Semi-supervised sequence tagging with bidirectional language models. In ACL.

Matthew Peters、Waleed Ammar、Chandra Bhagavatula 和 Russell Power。2017. 基于双向语言模型的半监督序列标注。载于 ACL。

Matthew Peters, Mark Neumann, Mohit Iyyer, Matt Gardner, Christopher Clark, Kenton Lee, and Luke Z ett le moyer. 2018a. Deep contextual i zed word represent at ions. In NAACL.

Matthew Peters、Mark Neumann、Mohit Iyyer、Matt Gardner、Christopher Clark、Kenton Lee 和 Luke Zettlemoyer。2018a. 深度上下文词表征 (Deep contextualized word representations)。见 NAACL。

Matthew Peters, Mark Neumann, Luke Z ett le moyer, and Wen-tau Yih. 2018b. Dissecting contextual word embeddings: Architecture and representation. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 1499–1509.

Matthew Peters、Mark Neumann、Luke Zettlemoyer 和 Wen-tau Yih。2018b。剖析上下文词嵌入:架构与表征。载于《2018年自然语言处理实证方法会议论文集》,第1499–1509页。

Alec Radford, Karthik Narasimhan, Tim Salimans, and Ilya Sutskever. 2018. Improving language understanding with unsupervised learning. Technical report, OpenAI.

Alec Radford、Karthik Narasimhan、Tim Salimans和Ilya Sutskever。2018。通过无监督学习提升语言理解能力。技术报告,OpenAI。

Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, and Percy Liang. 2016. Squad: $100{,}000+$ questions for machine comprehension of text. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, pages 2383–2392.

Pranav Rajpurkar、Jian Zhang、Konstantin Lopyrev 和 Percy Liang。2016。SQuAD:面向机器理解文本的 $100{,}000+$ 问题集。载于《2016年自然语言处理实证方法会议论文集》,第2383–2392页。

Minjoon Seo, Aniruddha Kembhavi, Ali Farhadi, and Hannaneh Hajishirzi. 2017. Bidirectional attention flow for machine comprehension. In ICLR.

Minjoon Seo、Aniruddha Kembhavi、Ali Farhadi 和 Hannaneh Hajishirzi。2017. 机器理解的双向注意力流。发表于 ICLR。

Richard Socher, Alex Perelygin, Jean Wu, Jason Chuang, Christopher D Manning, Andrew Ng, and Christopher Potts. 2013. Recursive deep models for semantic compositional it y over a sentiment treebank. In Proceedings of the 2013 conference on empirical methods in natural language processing, pages 1631–1642.

Richard Socher、Alex Perelygin、Jean Wu、Jason Chuang、Christopher D Manning、Andrew Ng 和 Christopher Potts。2013. 基于情感树库的语义组合性递归深度模型。载于《2013年自然语言处理实证方法会议论文集》,第1631–1642页。

Fu Sun, Linyang Li, Xipeng Qiu, and Yang Liu. 2018. U-net: Machine reading comprehension with unanswerable questions. arXiv preprint arXiv:1810.06638.

Fu Sun、Linyang Li、Xipeng Qiu和Yang Liu。2018。U-net:带不可回答问题机制的机器阅读理解。arXiv预印本arXiv:1810.06638。

Wilson L Taylor. 1953. Cloze procedure: A new tool for measuring readability. Journalism Bulletin, 30(4):415–433.

Wilson L Taylor. 1953. 完形填空程序:测量可读性的新工具。Journalism Bulletin, 30(4):415–433。

Erik F Tjong Kim Sang and Fien De Meulder. 2003. Introduction to the conll-2003 shared task: Language-independent named entity recognition. In CoNLL.

Erik F Tjong Kim Sang 和 Fien De Meulder。2003。CoNLL-2003 共享任务简介:语言无关的命名实体识别。载于 CoNLL。

Joseph Turian, Lev Ratinov, and Yoshua Bengio. 2010. Word representations: A simple and general method for semi-supervised learning. In Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics, ACL ’10, pages 384–394.

Joseph Turian、Lev Ratinov 和 Yoshua Bengio。2010. 词表示:一种简单通用的半监督学习方法。载于《第48届计算语言学协会年会论文集》(ACL '10),第384–394页。

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Lukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In Advances in Neural Information Processing Systems, pages 6000–6010.

Ashish Vaswani、Noam Shazeer、Niki Parmar、Jakob Uszkoreit、Llion Jones、Aidan N Gomez、Lukasz Kaiser 和 Illia Polosukhin。2017. Attention is all you need。载于《神经信息处理系统进展》,第6000-6010页。

Pascal Vincent, Hugo Larochelle, Yoshua Bengio, and Pierre-Antoine Manzagol. 2008. Extracting and composing robust features with denoising autoencoders. In Proceedings of the 25th international conference on Machine learning, pages 1096–1103. ACM.

Pascal Vincent、Hugo Larochelle、Yoshua Bengio 和 Pierre-Antoine Manzagol。2008. 使用降噪自编码器提取和组合鲁棒特征。载于第25届国际机器学习会议论文集,第1096–1103页。ACM。

Alex Wang, Amanpreet Singh, Julian Michael, Felix Hill, Omer Levy, and Samuel Bowman. 2018a. Glue: A multi-task benchmark and analysis platform for natural language understanding. In Proceedings of the 2018 EMNLP Workshop Blackbox NLP: An- alyzing and Interpreting Neural Networks for NLP, pages 353–355.

Alex Wang、Amanpreet Singh、Julian Michael、Felix Hill、Omer Levy 和 Samuel Bowman。2018a. GLUE:自然语言理解的多任务基准与分析平台。见《2018年EMNLP研讨会Blackbox NLP:分析与解释神经网络在NLP中的应用》论文集,第353-355页。

Wei Wang, Ming Yan, and Chen Wu. 2018b. Multigranularity hierarchical attention fusion networks for reading comprehension and question answering. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Association for Computational Linguistics.

Wei Wang、Ming Yan和Chen Wu。2018b。用于阅读理解与问答的多粒度分层注意力融合网络。载于《第56届计算语言学协会年会论文集(第一卷:长论文)》。计算语言学协会。

Alex Warstadt, Amanpreet Singh, and Samuel R Bowman. 2018. Neural network acceptability judgments. arXiv preprint arXiv:1805.12471.

Alex Warstadt、Amanpreet Singh 和 Samuel R Bowman。2018。神经网络可接受性判断。arXiv 预印本 arXiv:1805.12471。

Adina Williams, Nikita Nangia, and Samuel R Bowman. 2018. A broad-coverage challenge corpus for sentence understanding through inference. In NAACL.

Adina Williams、Nikita Nangia 和 Samuel R Bowman. 2018. 面向广泛覆盖的句子理解推理挑战语料库. 载于 NAACL.

Yonghui Wu, Mike Schuster, Zhifeng Chen, Quoc V Le, Mohammad Norouzi, Wolfgang Macherey, Maxim Krikun, Yuan Cao, Qin Gao, Klaus Macherey, et al. 2016. Google’s neural machine translation system: Bridging the gap between human and machine translation. arXiv preprint arXiv:1609.08144.

Yonghui Wu, Mike Schuster, Zhifeng Chen, Quoc V Le, Mohammad Norouzi, Wolfgang Macherey, Maxim Krikun, Yuan Cao, Qin Gao, Klaus Macherey等. 2016. 谷歌神经机器翻译系统: 缩小人类与机器翻译的差距. arXiv预印本 arXiv:1609.08144.

Jason Yosinski, Jeff Clune, Yoshua Bengio, and Hod Lipson. 2014. How transferable are features in deep neural networks? In Advances in neural information processing systems, pages 3320–3328.

Jason Yosinski、Jeff Clune、Yoshua Bengio 和 Hod Lipson。2014. 深度神经网络中的特征可迁移性如何?载于《神经信息处理系统进展》,第3320-3328页。

Adams Wei Yu, David Dohan, Minh-Thang Luong, Rui Zhao, Kai Chen, Mohammad Norouzi, and Quoc V Le. 2018. QANet: Combining local convolution with global self-attention for reading comprehension. In ICLR.

Adams Wei Yu、David Dohan、Minh-Thang Luong、Rui Zhao、Kai Chen、Mohammad Norouzi 和 Quoc V Le。2018。QANet:结合局部卷积与全局自注意力机制的阅读理解模型。发表于 ICLR。

Rowan Zellers, Yonatan Bisk, Roy Schwartz, and Yejin Choi. 2018. Swag: A large-scale adversarial dataset for grounded commonsense inference. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing (EMNLP).

Rowan Zellers、Yonatan Bisk、Roy Schwartz和Yejin Choi。2018。SWAG:一个用于基础常识推理的大规模对抗数据集。载于《2018年自然语言处理实证方法会议论文集》(EMNLP)。

Yukun Zhu, Ryan Kiros, Rich Zemel, Ruslan Salakhutdinov, Raquel Urtasun, Antonio Torralba, and Sanja Fidler. 2015. Aligning books and movies: Towards story-like visual explanations by watching movies and reading books. In Proceedings of the IEEE international conference on computer vision, pages 19–27.

Yukun Zhu、Ryan Kiros、Rich Zemel、Ruslan Salakhutdinov、Raquel Urtasun、Antonio Torralba 和 Sanja Fidler。2015。通过观看电影和阅读书籍实现书籍与电影的对齐:构建故事化视觉解释。载于《IEEE国际计算机视觉会议论文集》,第19-27页。

Appendix for “BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding”

"BERT:面向语言理解的深度双向Transformer预训练"附录

We organize the appendix into three sections:

我们将附录分为三个部分:

• Additional implementation details for BERT are presented in Appendix A;

• 关于BERT的更多实现细节详见附录A;

• Additional details for our experiments are presented in Appendix B; and

• 实验的更多细节见附录 B;

• Additional ablation studies are presented in Appendix C.

• 更多消融实验详见附录 C。

We present additional ablation studies for BERT including:

我们针对BERT进行了额外的消融研究,包括:

– Effect of Number of Training Steps; and – Ablation for Different Masking Procedures.

  • 训练步数的影响
  • 不同掩码处理方法的消融实验

A Additional Details for BERT

A BERT 的附加细节

A.1 Illustration of the Pre-training Tasks

A.1 预训练任务说明

We provide examples of the pre-training tasks in the following.

我们提供以下预训练任务的示例。

Masked LM and the Masking Procedure Assuming the unlabeled sentence is my dog is hairy, and during the random masking procedure we chose the 4-th token (which corresponding to hairy), our masking procedure can be further illustrated by

掩码语言模型 (Masked LM) 与掩码处理流程
假设未标注句子为"my dog is hairy",在随机掩码过程中我们选择第4个token (对应"hairy"),该掩码处理流程可通过以下方式进一步说明:

The advantage of this procedure is that the Transformer encoder does not know which words it will be asked to predict or which have been replaced by random words, so it is forced to keep a distribution al contextual representation of every input token. Additionally, because random replacement only occurs for $1.5%$ of all tokens (i.e., $10%$ of $15%$ ), this does not seem to harm the model’s language understanding capability. In Section C.2, we evaluate the impact this procedure.

该方法的优势在于,Transformer 编码器无法预知哪些词需要被预测或被随机词替换,因此它必须为每个输入 token 保留分布式上下文表征。此外,由于随机替换仅作用于全部 token 的 1.5%(即 15% 中的 10%),这似乎不会损害模型的语言理解能力。在附录 C.2 中,我们将评估该操作的影响。

Compared to standard langauge model training, the masked LM only make predictions on $15%$ of tokens in each batch, which suggests that more pre-training steps may be required for the model to converge. In Section C.1 we demonstrate that MLM does converge marginally slower than a leftto-right model (which predicts every token), but the empirical improvements of the MLM model far outweigh the increased training cost.

与标准语言模型训练相比,掩码语言模型 (masked LM) 仅对每批次中 $15%$ 的 token 进行预测,这表明模型可能需要更多预训练步骤才能收敛。在 C.1 节中,我们证明 MLM 的收敛速度确实略慢于从左到右的模型 (后者预测所有 token),但 MLM 模型带来的实证改进远远超过了训练成本的增加。


Figure 3: Differences in pre-training model architectures. BERT uses a bidirectional Transformer. OpenAI GPT uses a left-to-right Transformer. ELMo uses the concatenation of independently trained left-to-right and right-toleft LSTMs to generate features for downstream tasks. Among the three, only BERT representations are jointly conditioned on both left and right context in all layers. In addition to the architecture differences, BERT and OpenAI GPT are fine-tuning approaches, while ELMo is a feature-based approach.

图 3: 预训练模型架构差异。BERT采用双向Transformer架构,OpenAI GPT使用从左到右的Transformer,ELMo通过拼接独立训练的从左到右和从右到左LSTM来生成下游任务特征。三者中,只有BERT在所有层同时联合左右上下文进行表征。除架构差异外,BERT和OpenAI GPT属于微调方法,而ELMo基于特征提取方法。

Next Sentence Prediction The next sentence prediction task can be illustrated in the following examples.

下一句预测任务可以通过以下示例说明。

A.2 Pre-training Procedure

A.2 预训练流程

To generate each training input sequence, we sample two spans of text from the corpus, which we refer to as “sentences” even though they are typically much longer than single sentences (but can be shorter also). The first sentence receives the A embedding and the second receives the B embedding. $50%$ of the time $_ \mathrm{B}$ is the actual next sentence that follows $\mathtt{A}$ and $50%$ of the time it is a random sentence, which is done for the “next sentence prediction” task. They are sampled such that the combined length is $\leq512$ tokens. The LM masking is applied after WordPiece token iz ation with a uniform masking rate of $15%$ , and no special consideration given to partial word pieces.

为生成每个训练输入序列,我们从语料库中采样两段文本(尽管它们通常比单句长得多,但也可能更短),分别称为"句子"。第一句使用A嵌入,第二句使用B嵌入。50%的情况下B是实际接在A后的下一句,另外50%是随机句子,这是为"下一句预测"任务设计的。采样时确保组合长度≤512个token。在WordPiece分词后以15%的统一掩码率应用LM掩码,且不对部分词片段做特殊处理。

We train with batch size of 256 sequences (256 sequences $*512$ tokens $=128{,}000$ tokens/batch) for 1,000,000 steps, which is approximately 40 epochs over the 3.3 billion word corpus. We use Adam with learning rate of 1e-4, $\beta_ {1}~=0.9$ , $\beta_ {2}=~0.999$ , L2 weight decay of 0.01, learning rate warmup over the first 10,000 steps, and linear decay of the learning rate. We use a dropout probability of 0.1 on all layers. We use a gelu activation (Hendrycks and Gimpel, 2016) rather than the standard relu, following OpenAI GPT. The training loss is the sum of the mean masked LM likelihood and the mean next sentence prediction likelihood.

我们使用256个序列的批量大小(256个序列 $*512$ 个token $=128{,}000$ 个token/批次)进行1,000,000步训练,大约相当于在33亿词规模的语料上训练40个周期。采用Adam优化器,学习率为1e-4,$\beta_ {1}=~0.9$,$\beta_ {2}=~0.999$,L2权重衰减为0.01,前10,000步进行学习率预热,之后线性衰减。所有层均使用0.1的dropout概率。遵循OpenAI GPT的方案,我们使用gelu激活函数(Hendrycks and Gimpel, 2016)而非标准relu。训练损失为掩码语言模型似然均值与下一句预测似然均值之和。

Training of BERTBASE was performed on 4 Cloud TPUs in Pod configuration (16 TPU chips total).13 Training of BERTLARGE was performed on 16 Cloud TPUs (64 TPU chips total). Each pretraining took 4 days to complete.

BERTBASE的训练在4个Cloud TPU组成的Pod配置(共16个TPU芯片)上完成。BERTLARGE的训练在16个Cloud TPU(共64个TPU芯片)上完成。每次预训练需要4天完成。

Longer sequences are disproportionately expensive because attention is quadratic to the sequence length. To speed up pretraing in our experiments, we pre-train the model with sequence length of 128 for $90%$ of the steps. Then, we train the rest $10%$ of the steps of sequence of 512 to learn the positional embeddings.

更长的序列会带来不成比例的高成本,因为注意力机制的计算复杂度与序列长度呈平方关系。为了加速实验中的预训练过程,我们先用128的序列长度完成了90%的预训练步骤,随后用512的序列长度训练剩余10%的步骤以学习位置嵌入。

A.3 Fine-tuning Procedure

A.3 微调流程

For fine-tuning, most model hyper parameters are the same as in pre-training, with the exception of the batch size, learning rate, and number of training epochs. The dropout probability was always kept at 0.1. The optimal hyper parameter values are task-specific, but we found the following range of possible values to work well across all tasks:

在微调阶段,大多数模型超参数与预训练保持一致,仅调整批次大小、学习率和训练周期数。Dropout概率始终保持在0.1。最优超参数值因任务而异,但我们发现以下参数范围在所有任务中表现良好:

• Batch size: 16, 32

  • 批量大小 (Batch size): 16, 32

• Learning rate (Adam): 5e-5, 3e-5, 2e-5 • Number of epochs: 2, 3, 4

  • 学习率 (Adam):5e-5、3e-5、2e-5
  • 训练轮次:2、3、4

We also observed that large data sets (e.g., $100\mathrm{k}+$ labeled training examples) were far less sensitive to hyper parameter choice than small data sets. Fine-tuning is typically very fast, so it is reasonable to simply run an exhaustive search over the above parameters and choose the model that performs best on the development set.

我们还发现,大数据集(例如超过10万条标注训练样本)对超参数选择的敏感度远低于小数据集。微调通常非常快,因此完全可以对上述参数进行穷举搜索,并选择在开发集上表现最佳的模型。

A.4 Comparison of BERT, ELMo ,and OpenAI GPT

A.4 BERT、ELMo 与 OpenAI GPT 对比

Here we studies the differences in recent popular representation learning models including ELMo, OpenAI GPT and BERT. The comparisons between the model architectures are shown visually in Figure 3. Note that in addition to the architecture differences, BERT and OpenAI GPT are finetuning approaches, while ELMo is a feature-based approach.

我们研究了近期流行的表征学习模型之间的差异,包括ELMo、OpenAI GPT和BERT。模型架构的对比可视化展示在图3中。需要注意的是,除了架构差异外,BERT和OpenAI GPT属于微调方法,而ELMo是基于特征的方法。

The most comparable existing pre-training method to BERT is OpenAI GPT, which trains a left-to-right Transformer LM on a large text corpus. In fact, many of the design decisions in BERT were intentionally made to make it as close to GPT as possible so that the two methods could be minimally compared. The core argument of this work is that the bi-directional it y and the two pretraining tasks presented in Section 3.1 account for the majority of the empirical improvements, but we do note that there are several other differences between how BERT and GPT were trained:

现有预训练方法中与BERT最具可比性的是OpenAI GPT,该方法在大规模文本语料上训练了一个从左到右的Transformer语言模型。实际上,BERT的许多设计决策都刻意保持与GPT高度接近,以便两种方法能进行最小化差异的对比。本文的核心论点是:双向性(bi-directionality)和3.1节提出的两个预训练任务贡献了大部分实证改进,但我们也注意到BERT与GPT在训练方式上还存在以下差异:

To isolate the effect of these differences, we perform ablation experiments in Section 5.1 which demonstrate that the majority of the improvements are in fact coming from the two pre-training tasks and the bidirectional it y they enable.

为分离这些差异的影响,我们在5.1节进行了消融实验,结果表明大部分改进实际上来自两个预训练任务及其实现的双向性。

A.5 Illustrations of Fine-tuning on Different Tasks

A.5 不同任务上的微调示例

The illustration of fine-tuning BERT on different tasks can be seen in Figure 4. Our task-specific models are formed by incorporating BERT with one additional output layer, so a minimal number of parameters need to be learned from scratch. Among the tasks, (a) and (b) are sequence-level tasks while (c) and (d) are token-level tasks. In the figure, $E$ represents the input embedding, $T_ {i}$ represents the contextual representation of token $i$ , [CLS] is the special symbol for classification output, and [SEP] is the special symbol to separate non-consecutive token sequences.

图 4: 展示了在不同任务上微调 BERT 的示意图。我们的任务特定模型是通过在 BERT 基础上添加一个额外的输出层构建的,因此只需从头学习极少量的参数。在这些任务中,(a) 和 (b) 是序列级任务,而 (c) 和 (d) 是 token 级任务。图中 $E$ 表示输入嵌入,$T_ {i}$ 表示 token $i$ 的上下文表示,[CLS] 是分类输出的特殊符号,[SEP] 是分隔非连续 token 序列的特殊符号。

B Detailed Experimental Setup

B 详细实验设置

B.1 Detailed Descriptions for the GLUE Benchmark Experiments.

B.1 GLUE 基准实验的详细描述

The GLUE benchmark includes the following datasets, the descriptions of which were originally summarized in Wang et al. (2018a):

GLUE基准包含以下数据集,其描述最初由Wang等人(2018a)总结:

MNLI Multi-Genre Natural Language Inference is a large-scale, crowd sourced entailment classification task (Williams et al., 2018). Given a pair of sentences, the goal is to predict whether the second sentence is an entailment, contradiction, or neutral with respect to the first one.

MNLI(多类型自然语言推理)是一项大规模众包蕴含分类任务 (Williams et al., 2018)。给定一对句子,目标是预测第二个句子相对于第一个句子是蕴含、矛盾还是中立关系。

QQP Quora Question Pairs is a binary classification task where the goal is to determine if two questions asked on Quora are semantically equivalent (Chen et al., 2018).

QQP Quora Question Pairs是一个二分类任务,目标是判断Quora上提出的两个问题在语义上是否等价 (Chen et al., 2018)。

QNLI Question Natural Language Inference is a version of the Stanford Question Answering Dataset (Rajpurkar et al., 2016) which has been converted to a binary classification task (Wang et al., 2018a). The positive examples are (question, sentence) pairs which do contain the correct answer, and the negative examples are (question, sentence) from the same paragraph which do not contain the answer.

QNLI (Question Natural Language Inference) 是斯坦福问答数据集 (Rajpurkar et al., 2016) 的一个版本,已被转化为二分类任务 (Wang et al., 2018a)。正例是包含正确答案的 (问题, 句子) 对,负例是来自同一段落但不包含答案的 (问题, 句子) 对。

SST-2 The Stanford Sentiment Treebank is a binary single-sentence classification task consisting of sentences extracted from movie reviews with human annotations of their sentiment (Socher et al., 2013).

SST-2 斯坦福情感树库是一个二元单句分类任务,由从电影评论中提取的句子组成,并带有情感的人工标注 (Socher et al., 2013)。


Figure 4: Illustrations of Fine-tuning BERT on Different Tasks.

图 4: 不同任务上微调 BERT 的示意图。

CoLA The Corpus of Linguistic Acceptability is a binary single-sentence classification task, where the goal is to predict whether an English sentence is linguistically “acceptable” or not (Warstadt et al., 2018).

CoLA (Corpus of Linguistic Acceptability) 是一个二元的单句分类任务,其目标是判断一个英语句子在语言学上是否"可接受" (Warstadt et al., 2018)。

STS-B The Semantic Textual Similarity Benchmark is a collection of sentence pairs drawn from news headlines and other sources (Cer et al., 2017). They were annotated with a score from 1 to 5 denoting how similar the two sentences are in terms of semantic meaning.

STS-B语义文本相似度基准测试是从新闻标题和其他来源收集的句子对合集 (Cer et al., 2017)。这些句子对标注了1到5分的分数,表示两个句子在语义上的相似程度。

MRPC Microsoft Research Paraphrase Corpus consists of sentence pairs automatically extracted from online news sources, with human annotations for whether the sentences in the pair are semantically equivalent (Dolan and Brockett, 2005).

MRPC (Microsoft Research Paraphrase Corpus) 包含从在线新闻源自动提取的句子对,并标注了句子对在语义上是否等价 (Dolan and Brockett, 2005)。

RTE Recognizing Textual Entailment is a binary entailment task similar to MNLI, but with much less training data (Bentivogli et al., 2009).14

RTE (Recognizing Textual Entailment) 是一个类似于 MNLI 的二分类蕴涵任务,但训练数据量少得多 (Bentivogli et al., 2009)。14

WNLI Winograd NLI is a small natural language inference dataset (Levesque et al., 2011). The GLUE webpage notes that there are issues with the construction of this dataset, 15 and every trained system that’s been submitted to GLUE has performed worse than the 65.1 baseline accuracy of predicting the majority class. We therefore exclude this set to be fair to OpenAI GPT. For our GLUE submission, we always predicted the majority class.

WNLI Winograd NLI 是一个小型自然语言推理数据集 (Levesque et al., 2011)。GLUE 网页指出该数据集的构建存在问题,15 并且所有提交至 GLUE 的已训练系统表现都低于预测多数类别的 65.1% 基线准确率。因此,为公平对待 OpenAI GPT,我们排除了该数据集。在我们的 GLUE 提交中,始终预测多数类别。

C Additional Ablation Studies

C 其他消融研究

C.1 Effect of Number of Training Steps

C.1 训练步数的影响

Figure 5 presents MNLI Dev accuracy after finetuning from a checkpoint that has been pre-trained for $k$ steps. This allows us to answer the following questions:

图5展示了从预训练了$k$步的检查点进行微调后,MNLI开发集的准确率。这使我们能够回答以下问题:

C.2 Ablation for Different Masking Procedures

C.2 不同掩码处理方法的消融实验

In Section 3.1, we mention that BERT uses a mixed strategy for masking the target tokens when pre-training with the masked language model (MLM) objective. The following is an ablation study to evaluate the effect of different masking strategies.

在第3.1节中,我们提到BERT在使用掩码语言模型(MLM)目标进行预训练时,对目标token采用了混合掩码策略。以下是评估不同掩码策略效果的消融实验。

Note that the purpose of the masking strategies is to reduce the mismatch between pre-training and fine-tuning, as the [MASK] symbol never appears during the fine-tuning stage. We report the Dev results for both MNLI and NER. For NER, we report both fine-tuning and feature-based approaches, as we expect the mismatch will be amplified for the feature-based approach as the model will not have the chance to adjust the representations.

需要注意的是,掩码策略的目的是减少预训练与微调之间的不匹配,因为微调阶段从未出现 [MASK] 符号。我们报告了 MNLI 和 NER 的开发集结果。对于 NER,我们同时报告了微调和基于特征的方法,因为预计基于特征的方法会放大这种不匹配,因为模型无法调整表征。


Figure 5: Ablation over number of training steps. This shows the MNLI accuracy after fine-tuning, starting from model parameters that have been pre-trained for $k$ steps. The $\mathbf{X}$ -axis is the value of $k$ .

图 5: 训练步数消融实验。该图展示了从经过 $k$ 步预训练的模型参数开始微调后,在MNLI任务上的准确率。$\mathbf{X}$ 轴表示 $k$ 的取值。

Table 8: Ablation over different masking strategies.

MaskingRatesDevSetResults
MASKSAMERNDMNLI Fine-tuneFine-tuneNER Feature-based
80%10%10%84.295.494.9
100%0%0%84.394.994.0
80%0%20%84.195.294.6
80%20%0%84.495.294.7
0%20%80%83.794.894.6
0%0%100%83.694.994.6

表 8: 不同掩码策略的消融实验。

MaskingRates DevSetResults
MASK SAME RND MNLI Fine-tune Fine-tune NER Feature-based
80% 10% 10% 84.2 95.4 94.9
100% 0% 0% 84.3 94.9 94.0
80% 0% 20% 84.1 95.2 94.6
80% 20% 0% 84.4 95.2 94.7
0% 20% 80% 83.7 94.8 94.6
0% 0% 100% 83.6 94.9 94.6

The results are presented in Table 8. In the table, MASK means that we replace the target token with the [MASK] symbol for MLM; SAME means that we keep the target token as is; RND means that we replace the target token with another random token.

结果如表 8 所示。表中 MASK 表示我们用 [MASK] 符号替换目标 token 以进行 MLM (Masked Language Model) ;SAME 表示我们保持目标 token 不变;RND 表示我们用另一个随机 token 替换目标 token。

The numbers in the left part of the table represent the probabilities of the specific strategies used during MLM pre-training (BERT uses $80%$ , $10%$ , $10%$ ). The right part of the paper represents the Dev set results. For the feature-based approach, we concatenate the last 4 layers of BERT as the features, which was shown to be the best approach in Section 5.3.

表格左侧数字表示 MLM 预训练中采用特定策略的概率 (BERT 使用 $80%$、$10%$、$10%$)。右侧数据代表开发集结果。对于基于特征的方法,我们将 BERT 最后 4 层连接作为特征,该方法在第 5.3 节中被证明是最优方案。

From the table it can be seen that fine-tuning is surprisingly robust to different masking strategies. However, as expected, using only the MASK strategy was problematic when applying the featurebased approach to NER. Interestingly, using only the RND strategy performs much worse than our strategy as well.

从表中可以看出,微调(fine-tuning)对不同掩码策略表现出惊人的鲁棒性。然而正如预期,当将基于特征的方法应用于NER时,仅使用MASK策略会产生问题。有趣的是,仅使用RND策略的表现也远逊于我们的策略。

阅读全文(20积分)