BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

BERT: 深度双向 Transformer 用于语言理解的预训练

Jacob Devlin Ming-Wei Chang Kenton Lee Kristina Toutanova Google AI Language jacob dev lin,ming wei chang,kentonl,kristout@google.com

Jacob Devlin Ming-Wei Chang Kenton Lee Kristina Toutanova Google AI Language
jacob dev lin, ming wei chang, kentonl, kristout@google.com

Abstract

摘要

We introduce a new language representation model called BERT, which stands for Bidirectional Encoder Representations from Transformers. Unlike recent language representation models (Peters et al., 2018a; Radford et al., 2018), BERT is designed to pretrain deep bidirectional representations from unlabeled text by jointly conditioning on both left and right context in all layers. As a result, the pre-trained BERT model can be finetuned with just one additional output layer to create state-of-the-art models for a wide range of tasks, such as question answering and language inference, without substantial taskspecific architecture modifications.

我们介绍了一种新的语言表示模型，称为 BERT (Bidirectional Encoder Representations from Transformers)。与最近的语言表示模型不同 (Peters et al., 2018a; Radford et al., 2018)，BERT 被设计为通过联合条件在所有层中同时考虑左右上下文，从而从未标注的文本中预训练深度双向表示。因此，预训练的 BERT 模型只需添加一个额外的输出层即可进行微调，以创建多种任务（如问答和语言推理）的最先进模型，而无需进行大量的任务特定架构修改。

BERT is conceptually simple and empirically powerful. It obtains new state-of-the-art results on eleven natural language processing tasks, including pushing the GLUE score to $80.5%$ $7.7%$ point absolute improvement), MultiNLI accuracy to $86.7%$ ( $4.6%$ absolute improvement), SQuAD v1.1 question answering Test F1 to 93.2 (1.5 point absolute im- provement) and SQuAD v2.0 Test F1 to 83.1 (5.1 point absolute improvement).

BERT 在概念上简单且在实证上强大。它在十一个自然语言处理任务中取得了新的最先进成果，包括将 GLUE 分数提高到 80.5% (绝对提高了 7.7 个百分点)，MultiNLI 准确率提高到 86.7% (绝对提高了 4.6 个百分点)，SQuAD v1.1 问答测试 F1 分数提高到 93.2 (绝对提高了 1.5 分)，以及 SQuAD v2.0 测试 F1 分数提高到 83.1 (绝对提高了 5.1 分)。

1 Introduction

1 引言

Language model pre-training has been shown to be effective for improving many natural language processing tasks (Dai and Le, 2015; Peters et al., 2018a; Radford et al., 2018; Howard and Ruder, 2018). These include sentence-level tasks such as natural language inference (Bowman et al., 2015; Williams et al., 2018) and paraphrasing (Dolan and Brockett, 2005), which aim to predict the rela tion ships between sentences by analyzing them holistic ally, as well as token-level tasks such as named entity recognition and question answering, where models are required to produce fine-grained output at the token level (Tjong Kim Sang and De Meulder, 2003; Rajpurkar et al., 2016).

语言模型预训练已被证明对改进许多自然语言处理任务有效 (Dai 和 Le, 2015; Peters 等, 2018a; Radford 等, 2018; Howard 和 Ruder, 2018)。这些任务包括句子级别的任务，如自然语言推理 (Bowman 等, 2015; Williams 等, 2018) 和释义 (Dolan 和 Brockett, 2005)，其目标是通过整体分析来预测句子之间的关系，还包括 Token 级别的任务，如命名实体识别和问答，要求模型在 Token 级别上生成细粒度的输出 (Tjong Kim Sang 和 De Meulder, 2003; Rajpurkar 等, 2016)。

There are two existing strategies for applying pre-trained language representations to down- stream tasks: feature-based and fine-tuning. The feature-based approach, such as ELMo (Peters et al., 2018a), uses task-specific architectures that include the pre-trained representations as additional features. The fine-tuning approach, such as the Generative Pre-trained Transformer (OpenAI GPT) (Radford et al., 2018), introduces minimal task-specific parameters, and is trained on the downstream tasks by simply fine-tuning all pretrained parameters. The two approaches share the same objective function during pre-training, where they use unidirectional language models to learn general language representations.

将预训练语言表示应用于下游任务的现有两种策略是：基于特征的方法和微调方法。基于特征的方法，例如 ELMo (Peters et al., 2018a)，使用特定于任务的架构，将预训练表示作为附加特征包含在内。微调方法，例如生成式预训练 Transformer (OpenAI GPT) (Radford et al., 2018)，引入最少的任务特定参数，并通过简单地微调所有预训练参数来在下游任务上进行训练。这两种方法在预训练期间共享相同的目标函数，即使用单向语言模型来学习通用的语言表示。

We argue that current techniques restrict the power of the pre-trained representations, especially for the fine-tuning approaches. The major limitation is that standard language models are unidirectional, and this limits the choice of architectures that can be used during pre-training. For example, in OpenAI GPT, the authors use a left-toright architecture, where every token can only attend to previous tokens in the self-attention layers of the Transformer (Vaswani et al., 2017). Such restrictions are sub-optimal for sentence-level tasks, and could be very harmful when applying finetuning based approaches to token-level tasks such as question answering, where it is crucial to incorporate context from both directions.

我们认为当前的技术限制了预训练表示的能力，特别是对于微调方法。主要的限制是标准的语言模型是单向的，这限制了预训练期间可以使用的架构选择。例如，在 OpenAI GPT 中，作者使用了从左到右的架构，其中每个 Token 只能关注 Transformer (Vaswani et al., 2017) 自注意力层中的先前 Token。这种限制对于句子级别的任务来说是次优的，并且在将基于微调的方法应用于问答等 Token 级别的任务时可能会非常有害，因为在这些任务中结合双向上下文至关重要。

In this paper, we improve the fine-tuning based approaches by proposing BERT: Bidirectional Encoder Representations from Transformers. BERT alleviates the previously mentioned unidirect ional it y constraint by using a “masked language model” (MLM) pre-training objective, inspired by the Cloze task (Taylor, 1953). The masked language model randomly masks some of the tokens from the input, and the objective is to predict the original vocabulary id of the masked word based only on its context. Unlike left-toright language model pre-training, the MLM objective enables the representation to fuse the left and the right context, which allows us to pretrain a deep bidirectional Transformer. In addition to the masked language model, we also use a “next sentence prediction” task that jointly pretrains text-pair representations. The contributions of our paper are as follows:

在本文中，我们通过提出 BERT：双向编码器表示 (Bidirectional Encoder Representations from Transformers) 来改进基于微调的方法。BERT 通过使用“掩码语言模型”(masked language model, MLM) 预训练目标缓解了之前提到的单向性约束，这一方法受到 Cloze 任务 [20] 的启发。掩码语言模型随机遮蔽输入中的某些 Token，其目标是仅根据上下文预测被遮蔽词的原始词汇 id。与从左到右的语言模型预训练不同，MLM 目标使表示能够融合左右上下文，从而允许我们预训练一个深度双向 Transformer。除了掩码语言模型外，我们还使用了一个“下一句预测”任务，该任务联合预训练文本对表示。本文的贡献如下：

• We demonstrate the importance of bidirectional pre-training for language representations. Unlike Radford et al. (2018), which uses unidirectional language models for pre-training, BERT uses masked language models to enable pretrained deep bidirectional representations. This is also in contrast to Peters et al. (2018a), which uses a shallow concatenation of independently trained left-to-right and right-to-left LMs.

• 我们展示了双向预训练对语言表示的重要性。不同于 Radford 等 (2018) 使用单向语言模型进行预训练，BERT 使用掩码语言模型来实现预训练的深度双向表示。这与 Peters 等 (2018a) 形成对比，后者使用独立训练的从左到右和从右到左语言模型的浅层拼接。

• We show that pre-trained representations reduce the need for many heavily-engineered taskspecific architectures. BERT is the first finetuning based representation model that achieves state-of-the-art performance on a large suite of sentence-level and token-level tasks, outperforming many task-specific architectures.

• 我们证明了预训练的表示方法减少了对许多复杂工程化任务特定架构的需求。BERT 是第一个在大量句子级和 Token 级任务上达到最先进性能的微调基于的表示模型，超过了众多任务特定架构。

• BERT advances the state of the art for eleven NLP tasks. The code and pre-trained models are available at https://github.com/ google-research/bert.

• BERT 在十一项自然语言处理 (NLP) 任务上取得了最先进的成果。代码和预训练模型可在 https://github.com/google-research/bert 获取。

2 相关工作

There is a long history of pre-training general language representations, and we briefly review the most widely-used approaches in this section.

预训练通用语言表示有着悠久的历史，我们在本节中简要回顾最广泛使用的方法。

2.1 Unsupervised Feature-based Approaches

2.1 无监督特征基方法 (Unsupervised Feature-based Approaches)

Learning widely applicable representations of words has been an active area of research for decades, including non-neural (Brown et al., 1992; Ando and Zhang, 2005; Blitzer et al., 2006) and neural (Mikolov et al., 2013; Pennington et al., 2014) methods. Pre-trained word embeddings are an integral part of modern NLP systems, offering significant improvements over embeddings learned from scratch (Turian et al., 2010). To pretrain word embedding vectors, left-to-right language modeling objectives have been used (Mnih and Hinton, 2009), as well as objectives to discriminate correct from incorrect words in left and right context (Mikolov et al., 2013).

学习广泛适用的词表示一直是几十年来活跃的研究领域，包括非神经网络方法 (Brown et al., 1992; Ando and Zhang, 2005; Blitzer et al., 2006) 和神经网络方法 (Mikolov et al., 2013; Pennington et al., 2014)。预训练词嵌入是现代 NLP 系统的重要组成部分，相比从头学习的嵌入提供了显著改进 (Turian et al., 2010)。为了预训练词嵌入向量，使用了从左到右的语言模型目标 (Mnih and Hinton, 2009)，以及用于区分左右上下文中正确和错误单词的目标 (Mikolov et al., 2013)。

These approaches have been generalized to coarser granular i ties, such as sentence embeddings (Kiros et al., 2015; Logeswaran and Lee, 2018) or paragraph embeddings (Le and Mikolov, 2014). To train sentence representations, prior work has used objectives to rank candidate next sentences (Jernite et al., 2017; Logeswaran and Lee, 2018), left-to-right generation of next sentence words given a representation of the previous sentence (Kiros et al., 2015), or denoising autoencoder derived objectives (Hill et al., 2016).

这些方法已经被推广到更粗粒度的表示，例如句子嵌入 (Kiros 等, 2015; Logeswaran 和 Lee, 2018) 或段落嵌入 (Le 和 Mikolov, 2014)。为了训练句子表示，先前的工作使用了以下目标：对候选的下一个句子进行排序 (Jernite 等, 2017; Logeswaran 和 Lee, 2018)，根据前一个句子的表示从左到右生成下一个句子的单词 (Kiros 等, 2015)，或使用去噪自编码器衍生的目标 (Hill 等, 2016)。

ELMo and its predecessor (Peters et al., 2017, 2018a) generalize traditional word embedding research along a different dimension. They extract context-sensitive features from a left-to-right and a right-to-left language model. The contextual represent ation of each token is the concatenation of the left-to-right and right-to-left representations. When integrating contextual word embeddings with existing task-specific architectures, ELMo advances the state of the art for several major NLP benchmarks (Peters et al., 2018a) including question answering (Rajpurkar et al., 2016), sentiment analysis (Socher et al., 2013), and named entity recognition (Tjong Kim Sang and De Meulder, 2003). Melamud et al. (2016) proposed learning contextual representations through a task to predict a single word from both left and right context using LSTMs. Similar to ELMo, their model is feature-based and not deeply bidirectional. Fedus et al. (2018) shows that the cloze task can be used to improve the robustness of text generation models.

ELMo 及其前身 (Peters et al., 2017, 2018a) 沿着不同的维度推广了传统词嵌入研究。它们从左到右和从右到左的语言模型中提取上下文敏感特征。每个 Token 的上下文表示是左到右和右到左表示的连接。在将上下文词嵌入与现有的任务特定架构集成时，ELMo 在多个主要自然语言处理基准测试中取得了最先进的成果 (Peters et al., 2018a)，包括问答 (Rajpurkar et al., 2016)、情感分析 (Socher et al., 2013) 和命名实体识别 (Tjong Kim Sang 和 De Meulder, 2003)。Melamud 等人 (2016) 提出通过使用 LSTM 从左右上下文中预测单个词的任务来学习上下文表示。类似于 ELMo，他们的模型是基于特征的，并不是深度双向的。Fedus 等人 (2018) 表明完形填空任务可以用于提高文本生成模型的鲁棒性。

2.2 Unsupervised Fine-tuning Approaches

2.2 无监督微调方法

As with the feature-based approaches, the first works in this direction only pre-trained word embedding parameters from unlabeled text (Collobert and Weston, 2008).

与基于特征的方法一样，这一方向的首批工作仅从无标签文本中预训练词嵌入参数 (Collobert and Weston, 2008)。

More recently, sentence or document encoders which produce contextual token representations have been pre-trained from unlabeled text and fine-tuned for a supervised downstream task (Dai and Le, 2015; Howard and Ruder, 2018; Radford et al., 2018). The advantage of these approaches is that few parameters need to be learned from scratch. At least partly due to this advantage, OpenAI GPT (Radford et al., 2018) achieved previously state-of-the-art results on many sentencelevel tasks from the GLUE benchmark (Wang et al., 2018a). Left-to-right language model- ing and auto-encoder objectives have been used for pre-training such models (Howard and Ruder, 2018; Radford et al., 2018; Dai and Le, 2015).

最近，句子或文档编码器已经从无标签文本中预训练，并针对监督下游任务进行了微调 (Dai 和 Le, 2015; Howard 和 Ruder, 2018; Radford 等, 2018)。这些方法的优势在于只需要从头学习少量参数。至少部分由于这一优势，OpenAI GPT (Radford 等, 2018) 在 GLUE 基准测试的许多句子级任务上取得了之前最先进的结果 (Wang 等, 2018a)。左到右语言模型和自动编码目标已被用于预训练此类模型 (Howard 和 Ruder, 2018; Radford 等, 2018; Dai 和 Le, 2015)。

Figure 1: Overall pre-training and fine-tuning procedures for BERT. Apart from output layers, the same architectures are used in both pre-training and fine-tuning. The same pre-trained model parameters are used to initialize models for different down-stream tasks. During fine-tuning, all parameters are fine-tuned. [CLS] is a special symbol added in front of every input example, and [SEP] is a special separator token (e.g. separating questions/answers).

图 1: BERT 的整体预训练和微调流程。除了输出层之外，预训练和微调使用相同的架构。相同的预训练模型参数用于初始化不同的下游任务模型。在微调期间，所有参数都会进行微调。[CLS] 是添加到每个输入示例前面的特殊符号，而 [SEP] 是特殊的分隔符 Token（例如，分隔问题/答案）。

2.3 Transfer Learning from Supervised Data

2.3 监督数据的迁移学习

There has also been work showing effective transfer from supervised tasks with large datasets, such as natural language inference (Conneau et al., 2017) and machine translation (McCann et al., 2017). Computer vision research has also demonstrated the importance of transfer learning from large pre-trained models, where an effective recipe is to fine-tune models pre-trained with ImageNet (Deng et al., 2009; Yosinski et al., 2014).

也有多项研究表明，从大规模监督任务（如自然语言推理 (Conneau et al., 2017) 和机器翻译 (McCann et al., 2017)）中进行有效的迁移是可行的。计算机视觉研究同样证明了从大规模预训练模型进行迁移学习的重要性，其中一种有效的方法是微调使用 ImageNet 预训练的模型 (Deng et al., 2009; Yosinski et al., 2014)。

3 BERT

We introduce BERT and its detailed implementation in this section. There are two steps in our framework: pre-training and fine-tuning. During pre-training, the model is trained on unlabeled data over different pre-training tasks. For finetuning, the BERT model is first initialized with the pre-trained parameters, and all of the parameters are fine-tuned using labeled data from the downstream tasks. Each downstream task has separate fine-tuned models, even though they are initialized with the same pre-trained parameters. The question-answering example in Figure 1 will serve as a running example for this section.

我们在此部分介绍 BERT 及其详细实现。我们的框架包含两个步骤：预训练和微调。在预训练期间，模型在不同的预训练任务上使用无标签数据进行训练。对于微调，BERT 模型首先用预训练参数进行初始化，并使用来自下游任务的有标签数据对所有参数进行微调。每个下游任务都有单独的微调模型，即使它们是用相同的预训练参数初始化的。图 1 中的问题回答示例将作为本节的运行示例。

A distinctive feature of BERT is its unified architecture across different tasks. There is minimal difference between the pre-trained architecture and the final downstream architecture.

BERT 的一个显著特点是其在不同任务中采用统一的架构。预训练架构和最终下游任务架构之间的差异很小。

Model Architecture BERT’s model architecture is a multi-layer bidirectional Transformer encoder based on the original implementation described in Vaswani et al. (2017) and released in the tensor 2 tensor library.1 Because the use of Transformers has become common and our imple ment ation is almost identical to the original, we will omit an exhaustive background description of the model architecture and refer readers to Vaswani et al. (2017) as well as excellent guides such as “The Annotated Transformer.”2

模型架构

BERT 的模型架构是多层双向 Transformer 编码器，基于 Vaswani 等人在 2017 年描述的原始实现，并在 tensor 2 tensor 库中发布。由于 Transformer 的使用已经变得普遍，且我们的实现几乎与原始实现相同，我们将省略对模型架构的详尽背景描述，并引导读者参阅 Vaswani 等人 (2017) 以及如“The Annotated Transformer”等优秀指南 [2]。

In this work, we denote the number of layers (i.e., Transformer blocks) as $L$ , the hidden size as $H$ , and the number of self-attention heads as $A$ .3 We primarily report results on two model sizes: BERTBASE $_{\mathrm{L}=12}$ , $\mathrm{H}{=}768$ , $_{\mathrm{A}=12}$ , Total Param- eters $=110\mathrm{M}$ and BERTLARGE $\scriptstyle{\mathrm{L}}=24$ , $_{\mathrm{H}=1024}$ , $_{\mathrm{A}=16}$ , Total Parameters $!!=!!340\mathrm{M},$ .

在本工作中，我们将层数（即 Transformer 块的数量）表示为 $L$ ，隐藏层大小表示为 $H$ ，自注意力头的数量表示为 $A$ 。我们主要报告两种模型尺寸的结果：BERTBASE ${\mathrm{L}=12}$ ， $\mathrm{H}=768$ ， ${\mathrm{A}=12}$ ，总参数量 $=110\mathrm{M}$ 和 BERTLARGE $\scriptstyle{\mathrm{L}}=24$ ， ${\mathrm{H}=1024}$ ， ${\mathrm{A}=16}$ ，总参数量 $=340\mathrm{M}$ 。

BERTBASE was chosen to have the same model size as OpenAI GPT for comparison purposes. Critically, however, the BERT Transformer uses bidirectional self-attention, while the GPT Transformer uses constrained self-attention where every token can only attend to context to its left.4

为了比较目的，选择了与 OpenAI GPT 具有相同模型大小的 BERTBASE 。然而，至关重要的是，BERT Transformer 使用双向自注意力机制，而 GPT Transformer 使用受限的自注意力机制，其中每个 Token 只能关注其左侧的上下文。

Input/Output Representations To make BERT handle a variety of down-stream tasks, our input representation is able to unambiguously represent both a single sentence and a pair of sentences (e.g., Question, Answer ) in one token sequence. Throughout this work, a “sentence” can be an arbitrary span of contiguous text, rather than an actual linguistic sentence. A “sequence” refers to the input token sequence to BERT, which may be a single sentence or two sentences packed together.

输入/输出表示

为了使 BERT 能够处理各种下游任务，我们的输入表示能够清晰地表示单个句子和句子对（例如，问题，答案）在一个 token 序列中。在本文中，“句子”可以是任意连续的文本片段，而不仅仅是一个实际的语言句子。“序列”指的是输入给 BERT 的 token 序列，它可以是一个单句或两个句子组合在一起。

We use WordPiece embeddings (Wu et al., 2016) with a 30,000 token vocabulary. The first token of every sequence is always a special classification token ([CLS]). The final hidden state corresponding to this token is used as the aggregate sequence representation for classification tasks. Sentence pairs are packed together into a single sequence. We differentiate the sentences in two ways. First, we separate them with a special token ([SEP]). Second, we add a learned embedding to every token indicating whether it belongs to sentence $\mathtt{A}$ or sentence B. As shown in Figure 1, we denote input embedding as $E$ , the final hidden vector of the special [CLS] token as $C:\in:\mathbb{R}^{H}$ , and the final hidden vector for the $i^{\mathrm{th}}$ input token as $T_{i}\in\mathbb{R}^{H}$ .

我们使用 WordPiece 嵌入 (Wu et al., 2016) ，词汇量为 30,000 个 token。每个序列的第一个 token 总是一个特殊的分类 token ([CLS])。与该 token 对应的最终隐藏状态用于分类任务的聚合序列表示。句子对被打包成一个单一的序列。我们通过两种方式区分这些句子。首先，我们用一个特殊的 token ([SEP]) 将它们分开。其次，我们为每个 token 添加一个学习到的嵌入，以指示它属于句子 A 或句子 B。如图 1 所示，我们将输入嵌入表示为 $E$，特殊 [CLS] token 的最终隐藏向量表示为 $C:\in:\mathbb{R}^{H}$，第 $i$ 个输入 token 的最终隐藏向量表示为 $T_{i}\in\mathbb{R}^{H}$。

图 1:

For a given token, its input representation is constructed by summing the corresponding token, segment, and position embeddings. A visualization of this construction can be seen in Figure 2.

对于给定的 token，其输入表示是通过将相应的 token、segment 和 position embeddings 相加构建的。这种构建的可视化可以在图 2 中看到。

3.1 Pre-training BERT

3.1 预训练 BERT

Unlike Peters et al. (2018a) and Radford et al. (2018), we do not use traditional left-to-right or right-to-left language models to pre-train BERT. Instead, we pre-train BERT using two unsupervised tasks, described in this section. This step is presented in the left part of Figure 1.

与 Peters 等 (2018a) 和 Radford 等 (2018) 不同，我们不使用传统的从左到右或从右到左的语言模型来预训练 BERT。相反，我们使用两个无监督任务来预训练 BERT，这些任务在本节中描述。此步骤如图 1 左侧所示。

Task #1: Masked LM Intuitively, it is reasonable to believe that a deep bidirectional model is strictly more powerful than either a left-to-right model or the shallow concatenation of a left-toright and a right-to-left model. Unfortunately, standard conditional language models can only be trained left-to-right or right-to-left, since bidirectional conditioning would allow each word to indirectly “see itself”, and the model could trivially predict the target word in a multi-layered context.

任务 #1: 掩码语言模型 (Masked LM)

直观上，可以合理地认为，深度双向模型比单向的左到右模型或浅层连接的左到右和右到左模型更强大。不幸的是，标准的条件语言模型只能从左到右或从右到左进行训练，因为双向条件会使每个词间接“看到自己”，从而使模型在多层上下文中轻易预测目标词。

In order to train a deep bidirectional representation, we simply mask some percentage of the input tokens at random, and then predict those masked tokens. We refer to this procedure as a “masked LM” (MLM), although it is often referred to as a Cloze task in the literature (Taylor, 1953). In this case, the final hidden vectors corresponding to the mask tokens are fed into an output softmax over the vocabulary, as in a standard LM. In all of our experiments, we mask $15%$ of all WordPiece tokens in each sequence at random. In contrast to denoising auto-encoders (Vincent et al., 2008), we only predict the masked words rather than reconstructing the entire input.

为了训练一个深度双向表示，我们简单地随机屏蔽一部分输入的 Token，然后预测这些被屏蔽的 Token。我们将这个过程称为“屏蔽语言模型”（masked LM，MLM），尽管在文献中它通常被称为完形填空任务 (Cloze task) [Taylor, 1953]。在这种情况下，与屏蔽 Token 对应的最终隐藏向量会被输入到词汇表上的输出 softmax 中，就像标准的语言模型一样。在我们所有的实验中，我们随机屏蔽每个序列中所有 WordPiece Token 的 15%。与去噪自编码器 (Vincent et al., 2008) 不同，我们只预测被屏蔽的词，而不是重建整个输入。

Although this allows us to obtain a bidirectional pre-trained model, a downside is that we are creating a mismatch between pre-training and fine-tuning, since the [MASK] token does not appear during fine-tuning. To mitigate this, we do not always replace “masked” words with the actual [MASK] token. The training data generator chooses $15%$ of the token positions at random for prediction. If the $i$ -th token is chosen, we replace the $i$ -th token with (1) the [MASK] token $80%$ of the time (2) a random token $10%$ of the time (3) the unchanged $i$ -th token $10%$ of the time. Then, $T_{i}$ will be used to predict the original token with cross entropy loss. We compare variations of this procedure in Appendix C.2.

虽然这使我们能够获得一个双向预训练模型，但缺点是在预训练和微调之间产生了不匹配，因为 [MASK] token 在微调时不会出现。为了解决这个问题，我们并不总是用实际的 [MASK] token 替换“被遮掩”的词。训练数据生成器随机选择 15% 的 token 位置进行预测。如果选择了第 $i$ 个 token，则我们以如下方式替换第 $i$ 个 token：(1) 80% 的时间用 [MASK] token 替换；(2) 10% 的时间用一个随机 token 替换；(3) 10% 的时间保持第 $i$ 个 token 不变。然后，使用 $T_{i}$ 来通过交叉熵损失预测原始 token。我们在附录 C.2 中比较了此过程的不同变体。

Task #2: Next Sentence Prediction (NSP) Many important downstream tasks such as Question Answering (QA) and Natural Language Infer- ence (NLI) are based on understanding the relationship between two sentences, which is not directly captured by language modeling. In order to train a model that understands sentence relationships, we pre-train for a binarized next sen- tence prediction task that can be trivially generated from any monolingual corpus. Specifically, when choosing the sentences A and B for each pretraining example, $50%$ of the time B is the actual next sentence that follows $\mathtt{A}$ (labeled as $\mathbb{I}_{\mathsf{S N e x t}}$ ), and $50%$ of the time it is a random sentence from the corpus (labeled as NotNext). As we show in Figure 1, $C$ is used for next sentence prediction (NSP).5 Despite its simplicity, we demon- strate in Section 5.1 that pre-training towards this task is very beneficial to both QA and NLI. 6

任务 #2: 下一句预测 (Next Sentence Prediction, NSP) 许多重要的下游任务，如问答 (Question Answering, QA) 和自然语言推理 (Natural Language Inference, NLI)，都依赖于理解两个句子之间的关系，而这并不是通过语言模型直接捕捉到的。为了训练一个能够理解句子关系的模型，我们对二元化的下一句预测任务进行预训练，该任务可以从任何单语语料库中轻松生成。具体来说，在为每个预训练示例选择句子 A 和 B 时，50% 的时间 B 是实际跟在 A 后面的下一句（标记为 $\mathbb{I}_{\mathsf{S N e x t}}$），而 50% 的时间它是语料库中的一个随机句子（标记为 NotNext）。如图 1 所示，C 用于下一句预测 (NSP)。尽管其简单性，我们在第 5.1 节中证明了针对此任务的预训练对 QA 和 NLI 都非常有益。

图 1:

Figure 2: BERT input representation. The input embeddings are the sum of the token embeddings, the segmentation embeddings and the position embeddings.

图 2: BERT 输入表示。输入嵌入是 Token 嵌入、分段嵌入和位置嵌入的和。

The NSP task is closely related to representationlearning objectives used in Jernite et al. (2017) and Logeswaran and Lee (2018). However, in prior work, only sentence embeddings are transferred to down-stream tasks, where BERT transfers all parameters to initialize end-task model parameters.

NSP 任务与 Jernite 等人 (2017) 和 Logeswaran 和 Lee (2018) 中使用的表示学习目标密切相关。然而，在先前的工作中，只有句子嵌入被迁移到下游任务，而 BERT 则迁移所有参数以初始化最终任务模型参数。

Pre-training data The pre-training procedure largely follows the existing literature on language model pre-training. For the pre-training corpus we use the Books Corpus (800M words) (Zhu et al., 2015) and English Wikipedia (2,500M words). For Wikipedia we extract only the text passages and ignore lists, tables, and headers. It is critical to use a document-level corpus rather than a shuffled sentence-level corpus such as the Billion Word Benchmark (Chelba et al., 2013) in order to extract long contiguous sequences.

预训练数据
预训练过程在很大程度上遵循了现有文献中关于语言模型预训练的做法。对于预训练语料库，我们使用了 Books Corpus (800M 词) (Zhu et al., 2015) 和英文维基百科 (2,500M 词)。对于维基百科，我们仅提取文本段落，忽略列表、表格和标题。为了提取长连续序列，使用文档级别的语料库而不是像 Billion Word Benchmark (Chelba et al., 2013) 这样的打乱句子级别的语料库是至关重要的。

3.2 Fine-tuning BERT

3.2 微调 BERT

Fine-tuning is straightforward since the selfattention mechanism in the Transformer allows BERT to model many downstream tasks— whether they involve single text or text pairs—by swapping out the appropriate inputs and outputs. For applications involving text pairs, a common pattern is to independently encode text pairs before applying bidirectional cross attention, such as Parikh et al. (2016); Seo et al. (2017). BERT instead uses the self-attention mechanism to unify these two stages, as encoding a concatenated text pair with self-attention effectively includes bidirectional cross attention between two sentences.

微调非常直接，因为 Transformer 中的自注意力机制允许 BERT 通过替换适当的输入和输出来建模许多下游任务——无论是涉及单个文本还是文本对的任务。对于涉及文本对的应用，常见的模式是在应用双向交叉注意力之前独立编码文本对，例如 Parikh 等人 (2016)；Seo 等人 (2017)。相反，BERT 使用自注意力机制统一这两个阶段，因为使用自注意力编码连接的文本对实际上包含了两个句子之间的双向交叉注意力。

For each task, we simply plug in the taskspecific inputs and outputs into BERT and finetune all the parameters end-to-end. At the input, sentence A and sentence B from pre-training are analogous to (1) sentence pairs in paraphrasing, (2) hypothesis-premise pairs in entailment, (3) question-passage pairs in question answering, and (4) a degenerate text $\mathcal{B}$ pair in text classification or sequence tagging. At the output, the token represent at ions are fed into an output layer for tokenlevel tasks, such as sequence tagging or question answering, and the [CLS] representation is fed into an output layer for classification, such as entailment or sentiment analysis.

对于每个任务，我们只需将特定于任务的输入和输出接入 BERT 并对所有参数进行端到端微调。在输入端，预训练中的句子 A 和句子 B 类似于 (1) 释义中的句子对，(2) 中性关系中的假设-前提对，(3) 问答中的问题-篇章对，以及 (4) 文本分类或序列标注中的退化文本 $\mathcal{B}$ 对。在输出端，Token 表示被输入到输出层以处理 Token 级别的任务，例如序列标注或问答，而 [CLS] 表示则被输入到输出层以处理分类任务，例如中性关系或情感分析。

Compared to pre-training, fine-tuning is relatively inexpensive. All of the results in the paper can be replicated in at most 1 hour on a single Cloud TPU, or a few hours on a GPU, starting from the exact same pre-trained model.7 We describe the task-specific details in the corresponding subsections of Section 4. More details can be found in Appendix A.5.

相比于预训练，微调的代价相对较低。本文中的所有结果最多可以在单个 Cloud TPU 上于 1 小时内复现，或者在 GPU 上花费几小时，前提是使用完全相同的预训练模型。我们在第 4 节的相应小节中描述了特定任务的详细信息。更多详情请参见附录 A.5。

4 Experiments

4 实验

In this section, we present BERT fine-tuning results on 11 NLP tasks.

在本节中，我们展示了 BERT 在 11 个 NLP 任务上的微调结果。

4.1 GLUE

4.1 GLUE 数据集 (GLUE dataset)

GLUE 是一系列自然语言理解任务的基准，用于评估模型在多个不同任务上的性能。它包括多个子任务，如文本相似度、情感分析等。通过 GLUE 基准可以全面评估大语言模型 (LLM) 的能力。

The General Language Understanding Evaluation (GLUE) benchmark (Wang et al., 2018a) is a collection of diverse natural language understanding tasks. Detailed descriptions of GLUE datasets are included in Appendix B.1.

通用语言理解评估 (GLUE) 基准 (Wang et al., 2018a) 是一组多样化的自然语言理解任务。GLUE 数据集的详细描述包含在附录 B.1 中。

To fine-tune on GLUE, we represent the input sequence (for single sentence or sentence pairs) as described in Section 3, and use the final hidden vector $C~\in~\mathbb{R}^{H}$ corresponding to the first input token ([CLS]) as the aggregate representation. The only new parameters introduced during fine-tuning are classification layer weights $W,\in$ $\mathbb{R}^{K\times H}$ , where $K$ is the number of labels. We compute a standard classification loss with $C$ and $W$ , i.e., $\log(\operatorname{softmax}(C W^{T}))$ .

在 GLUE 上进行微调时，我们将输入序列（对于单个句子或句子对）表示为第 3 节中所述的形式，并使用与第一个输入 Token ([CLS]) 对应的最终隐藏向量 $C~\in~\mathbb{R}^{H}$ 作为聚合表示。微调期间引入的唯一新参数是分类层权重 $W,\in$ $\mathbb{R}^{K\times H}$ ，其中 $K$ 是标签的数量。我们使用 $C$ 和 $W$ 计算标准分类损失，即 $\log(\operatorname{softmax}(C W^{T}))$ 。

Table 1: GLUE Test results, scored by the evaluation server (https://glue benchmark.com/leader board). The number below each task denotes the number of training examples. The “Average” column is slightly different than the official GLUE score, since we exclude the problematic WNLI set.8 BERT and OpenAI GPT are singlemodel, single task. F1 scores are reported for QQP and MRPC, Spearman correlations are reported for STS-B, and accuracy scores are reported for the other tasks. We exclude entries that use BERT as one of their components.

表 1: GLUE 测试结果，由评估服务器评分 (https://glue benchmark.com/leader board)。每个任务下方的数字表示训练样本的数量。“平均”列与官方 GLUE 得分略有不同，因为我们排除了有问题的 WNLI 集。8 BERT 和 OpenAI GPT 是单模型、单任务。QQP 和 MRPC 报告 F1 分数，STS-B 报告 Spearman 相关性，其他任务报告准确率分数。我们排除了使用 BERT 作为其组件之一的条目。

系统	MNLI-(m/mm) 392k	QQP 363k	QNLI 108k	SST-2 67k	CoLA 8.5k	STS-B 5.7k	MRPC 3.5k	RTE 2.5k	平均 -
Pre-OpenAI SOTA	80.6/80.1	66.1	82.3	93.2	35.0	81.0	86.0	61.7	74.0
BiLSTM+ELMo+Attn	76.4/76.1	64.8	79.8	90.4	36.0	73.3	84.9	56.8	71.0
OpenAI GPT	82.1/81.4	70.3	87.4	91.3	45.4	80.0	82.3	56.0	75.1
BERTBASE	84.6/83.4	71.2	90.5	93.5	52.1	85.8	88.9	66.4	79.6
BERTLARGE	86.7/85.9	72.1	92.7	94.9	60.5	86.5	89.3	70.1	82.1

We use a batch size of 32 and fine-tune for 3 epochs over the data for all GLUE tasks. For each task, we selected the best fine-tuning learning rate (among 5e-5, 4e-5, 3e-5, and 2e-5) on the Dev set. Additionally, for BERTLARGE we found that finetuning was sometimes unstable on small datasets, so we ran several random restarts and selected the best model on the Dev set. With random restarts, we use the same pre-trained checkpoint but perform different fine-tuning data shuffling and classifier layer initialization.

我们使用批次大小为 32，并对所有 GLUE 任务的数据进行 3 轮微调。对于每个任务，我们在开发集上选择了最佳的微调学习率（在 5e-5、4e-5、3e-5 和 2e-5 中选择）。此外，对于 BERTLARGE，我们发现其在小数据集上有时微调不稳定，因此我们进行了多次随机重启，并在开发集上选择了最佳模型。在随机重启中，我们使用相同的预训练检查点，但执行不同的微调数据洗牌和分类器层初始化。

Results are presented in Table 1. Both BERTBASE and BERTLARGE outperform all systems on all tasks by a substantial margin, obtaining $4.5%$ and $7.0%$ respective average accuracy improvement over the prior state of the art. Note that BERTBASE and OpenAI GPT are nearly identical in terms of model architecture apart from the attention masking. For the largest and most widely reported GLUE task, MNLI, BERT obtains a $4.6%$ absolute accuracy improvement. On the official GLUE leader board 10, BERTLARGE obtains a score of 80.5, compared to OpenAI GPT, which obtains 72.8 as of the date of writing.

结果如表 1 所示。BERTBASE 和 BERTLARGE 在所有任务上均大幅优于所有系统，分别比之前的最先进水平提高了 4.5% 和 7.0% 的平均准确率。请注意，除了注意力掩码外，BERTBASE 和 OpenAI GPT 在模型架构上几乎相同。对于最大且最广泛报道的 GLUE 任务 MNLI，BERT 获得了 4.6% 的绝对准确率提升。在官方 GLUE 排行榜 10 上，BERTLARGE 获得了 80.5 分，而 OpenAI GPT 在撰写时获得了 72.8 分。

表 1:

We find that BERTLARGE significantly outperforms BERTBASE across all tasks, especially those with very little training data. The effect of model size is explored more thoroughly in Section 5.2.

我们发现 BERTLARGE 在所有任务上显著优于 BERTBASE，尤其是在训练数据非常少的任务上。模型大小的效果在第 5.2 节中进行了更深入的探讨。

4.2 SQuAD v1.1

The Stanford Question Answering Dataset (SQuAD v1.1) is a collection of $100\mathrm{k}$ crowdsourced question/answer pairs (Rajpurkar et al., 2016). Given a question and a passage from

斯坦福问答数据集 (SQuAD v1.1) 是一个包含 $100\mathrm{k}$ 个众包问题/答案对的集合 (Rajpurkar et al., 2016)。给定一个问题和一段来自

Wikipedia containing the answer, the task is to predict the answer text span in the passage.

维基百科包含答案，任务是在文中预测答案文本片段。

As shown in Figure 1, in the question answering task, we represent the input question and passage as a single packed sequence, with the question using the A embedding and the passage using the B embedding. We only introduce a start vector $S,\in,\mathbb{R}^{H}$ and an end vector $E,\in,\mathbb{R}^{H}$ during fine-tuning. The probability of word $i$ being the start of the answer span is computed as a dot product between $T_{i}$ and $S$ followed by a softmax over all of the words in the paragraph: $\begin{array}{r}{P_{i}=\frac{e^{S\cdot T_{i}}}{\sum_{j}e^{S\cdot T_{j}}}}\end{array}$ The analogous formula is used for the end of the answer span. The score of a candidate span from position $i$ to position $j$ is defined as ${\cal S}{\cdot}{\cal T}{i}+{\cal E}{\cdot}{\cal T}{j}$ , and the maximum scoring span where $j~\ge~i$ is used as a prediction. The training objective is the sum of the log-likelihoods of the correct start and end positions. We fine-tune for 3 epochs with a learning rate of 5e-5 and a batch size of 32.

如图 1 所示，在问答任务中，我们将输入的问题和段落表示为一个打包的序列，问题使用 A 嵌入，段落使用 B 嵌入。在微调期间，我们仅引入一个起始向量 $S,\in,\mathbb{R}^{H}$ 和一个结束向量 $E,\in,\mathbb{R}^{H}$。单词 $i$ 是答案跨度起始位置的概率通过计算 $T_{i}$ 和 $S$ 的点积，然后对段落中的所有单词进行 softmax 计算得出：$\begin{array}{r}{P_{i}=\frac{e^{S\cdot T_{i}}}{\sum_{j}e^{S\cdot T_{j}}}}\end{array}$。类似的公式用于答案跨度的结束位置。从位置 $i$ 到位置 $j$ 的候选跨度得分定义为 ${\cal S}{\cdot}{\cal T}{i}+{\cal E}{\cdot}{\cal T}{j}$，并且预测使用的是得分最高的跨度，其中 $j~\ge~i$。训练目标是正确起始和结束位置的对数似然之和。我们以 5e-5 的学习率和 32 的批量大小微调 3 个 epoch。

Table 2 shows top leader board entries as well as results from top published systems (Seo et al., 2017; Clark and Gardner, 2018; Peters et al., 2018a; Hu et al., 2018). The top results from the SQuAD leader board do not have up-to-date public system descriptions available,11 and are allowed to use any public data when training their systems. We therefore use modest data augmentation in our system by first fine-tuning on TriviaQA (Joshi et al., 2017) befor fine-tuning on SQuAD.

表 2: 显示了排行榜上的顶级条目以及来自顶级已发布系统的成果 (Seo et al., 2017; Clark and Gardner, 2018; Peters et al., 2018a; Hu et al., 2018)。SQuAD 排行榜上的顶级结果没有最新的公共系统描述可用，并且在训练系统时允许使用任何公共数据。因此，我们在系统中使用适度的数据增强，首先在 TriviaQA (Joshi et al., 2017) 上进行微调，然后再在 SQuAD 上进行微调。

Our best performing system outperforms the top leader board system by $+1.5,\mathrm{F}1$ in ensembling and $+1.3$ F1 as a single system. In fact, our single BERT model outperforms the top ensemble system in terms of F1 score. Without TriviaQA fine-

我们表现最好的系统在集成时比领先的系统高出 $+1.5,\mathrm{F}1$ ，作为单个系统时高出 $+1.3$ F1。实际上，我们的单个 BERT 模型在 F1 分数上超过了顶级集成系统。没有 TriviaQA 微调

系统	开发测试 EM	F1	EM
F1 排行榜系统 (2018年12月10日)	人类 82.3	91.2	#1 组合 -nlnet 86.0
#2 组合 -QANet	84.5	90.5	已发布 BiDAF+ELMo (单模型) 85.6
R.M.Reader(组合)	81.2	87.9	82.3
我们的 BERTBASE (单模型)	80.8	88.5	BERTLARGE (单模型) 84.1
BERTLARGE (组合)	85.8	91.8	BERTLARGE (单模型+TriviaQA) 84.2
85.1	91.8	BERTLARGE (组合+TriviaQA)	86.2
87.4	93.2

请注意，表格中的数据和系统名称保持了原始格式。

Table 2: SQuAD 1.1 results. The BERT ensemble is $7\mathbf{x}$ systems which use different pre-training checkpoints and fine-tuning seeds.

表 2: SQuAD 1.1 结果。BERT 集成是 $7\mathbf{x}$ 系统，这些系统使用不同的预训练检查点和微调种子。

System	Dev	Dev	Test	Test
	EM	F1	EM	F1
Top Leaderboard Systems (Dec 10th, 2018)
Human		86.389.0	86.9	89.5
#1 Single -MIR-MRC (F-Net)			74.8	78.0
#2Single-nlnet			74.2	77.1
Published
unet(Ensemble)				71.474.9
SLQA+ (Single)				71.474.4
Ours
BERTLARGE (Single)	78.7	81.9	80.0	83.1

Table 3: SQuAD 2.0 results. We exclude entries that use BERT as one of their components.

表 3: SQuAD 2.0 结果。我们排除了使用 BERT 作为其组件之一的条目。

tuning data, we only lose 0.1-0.4 F1, still outperforming all existing systems by a wide margin.12

调优数据时，我们仅损失了 0.1-0.4 F1，仍然大幅优于所有现有系统。12

请注意，您提供的文本似乎缺少一些上下文信息，导致翻译可能不完全准确。如果您可以提供更多的上下文或完整的句子，我将能够提供更精确的翻译。

4.3 SQuAD v2.0

The SQuAD 2.0 task extends the SQuAD 1.1 problem definition by allowing for the possibility that no short answer exists in the provided paragraph, making the problem more realistic.

SQuAD 2.0 任务通过允许提供的段落中可能不存在简短答案，扩展了 SQuAD 1.1 的问题定义，使问题更加现实。

We use a simple approach to extend the SQuAD v1.1 BERT model for this task. We treat questions that do not have an answer as having an answer span with start and end at the [CLS] token. The probability space for the start and end answer span positions is extended to include the position of the [CLS] token. For prediction, we compare the score of the no-answer span: $s_{\mathrm{nu11}}=$ $S{\cdot}C+E{\cdot}C$ to the score of the best non-null span $\begin{array}{r}{s\hat{i},j=\operatorname*{max}_{j\ge i}{S\cdot T_{i}}+E\cdot T_{j}}\end{array}$ . We predict a non-null answer when $s_{i,j}^{\wedge},>,s_{\mathrm{nu11}},+,\tau$ , where the threshold $\tau$ is selected on the dev set to maximize F1. We did not use TriviaQA data for this model. We fine-tuned for 2 epochs with a learning rate of 5e-5 and a batch size of 48.

我们使用一种简单的方法来扩展 SQuAD v1.1 BERT 模型以完成此任务。我们将没有答案的问题视为具有起始和结束于 [CLS] token 的答案片段。起始和结束答案片段位置的概率空间被扩展为包括 [CLS] token 的位置。在预测时，我们将无答案片段的得分：$s_{\mathrm{nu11}}=$ $S{\cdot}C+E{\cdot}C$ 与最佳非空片段的得分 $\begin{array}{r}{s\hat{i},j=\operatorname*{max}_{j\ge i}{S\cdot T_{i}}+E\cdot T_{j}}\end{array}$ 进行比较。当 $s_{i,j}^{\wedge},>,s_{\mathrm{nu11}},+,\tau$ 时，我们预测一个非空答案，其中阈值 $\tau$ 在开发集上选择以最大化 F1 分数。我们没有使用 TriviaQA 数据来训练这个模型。我们进行了 2 个 epoch 的微调，学习率为 5e-5，批量大小为 48。

Table 4: SWAG Dev and Test accuracies. †Human performance is measured with 100 samples, as reported in the SWAG paper.

表 4: SWAG Dev 和 Test 准确率。†人类表现是通过 100 个样本测量的，如 SWAG 论文 [20] 所报告。

系统	Dev	Test
ESIM+GloVe ESIM+ELMo	51.9 59.1	52.7 59.2
OpenAIGPT		78.0
BERTBASE BERTLARGE	81.6	86.3
	86.6
人类 (专家)		85.0
人类 (5 个标注)		88.0

The results compared to prior leader board entries and top published work (Sun et al., 2018; Wang et al., 2018b) are shown in Table 3, excluding systems that use BERT as one of their components. We observe a $+5.1$ F1 improvement over the previous best system.

表 3: 显示了与之前排行榜条目和顶级已发表作品 (Sun et al., 2018; Wang et al., 2018b) 的对比结果，排除了使用 BERT 作为组件的系统。我们观察到相比之前的最佳系统 F1 提升了 +5.1 。

4.4 SWAG

生成式 AI (Generative AI) 随意生成内容的例子，通常用于展示模型的创造力和多样性。SWAG 数据集包含一系列情境描述，要求模型根据给定的情境选择最合理的后续行动或描述。

The Situations With Adversarial Generations (SWAG) dataset contains $113\mathbf{k}$ sentence-pair completion examples that evaluate grounded commonsense inference (Zellers et al., 2018). Given a sentence, the task is to choose the most plausible continuation among four choices.

对抗性生成情况 (SWAG) 数据集包含 $113\mathbf{k}$ 个句子对补全示例，用于评估基于常识的推理 (Zellers et al., 2018)。给定一个句子，任务是从四个选项中选择最合理的续句。

When fine-tuning on the SWAG dataset, we construct four input sequences, each containing the concatenation of the given sentence (sentence A) and a possible continuation (sentence B). The only task-specific parameters introduced is a vector whose dot product with the [CLS] token represent ation $C$ denotes a score for each choice which is normalized with a softmax layer.

在 SWAG 数据集上进行微调时，我们构建了四个输入序列，每个序列包含给定句子（句子 A）和一个可能的延续（句子 B）的连接。引入的唯一任务特定参数是一个向量，该向量与 [CLS] Token 的点积表示为 $C$，表示每个选择的得分，这些得分通过 softmax 层进行归一化。

We fine-tune the model for 3 epochs with a learning rate of 2e-5 and a batch size of 16. Results are presented in Table 4. BERTLARGE outperforms the authors’ baseline ESIM $\pm$ ELMo system by $+27.1%$ and OpenAI GPT by $8.3%$ .

我们对模型进行3个epoch的微调，学习率为2e-5，批量大小为16。结果如表 4 所示。BERTLARGE 模型的表现优于作者的基线 ESIM ± ELMo 系统 27.1% 和 OpenAI GPT 8.3%。

表 4:

5 Ablation Studies

5 消融研究

In this section, we perform ablation experiments over a number of facets of BERT in order to better understand their relative importance. Additional

在本节中，我们对 BERT 的多个方面进行消融实验，以更好地理解它们的相对重要性。额外的

规则：
      - 输出中文翻译部分的时候，只保留翻译的标题，不要有任何其他的多余内容，不要重复，不要解释。
      - 不要输出与英文内容无关的内容。
      - 翻译时要保留原始段落格式，以及保留术语，例如 FLAC，JPEG 等。保留公司缩写，例如 Microsoft, Amazon, OpenAI 等。
      - 人名不翻译
      - 同时要保留引用的论文，例如 [20] 这样的引用。
      - 对于 Figure 和 Table，翻译的同时保留原有格式，例如：“Figure 1: ”翻译为“图 1: ”，“Table 1: ”翻译为：“表 1: ”。
      - 全角括号换成半角括号，并在左括号前面加半角空格，右括号后面加半角空格。
      - 在翻译专业术语时，第一次出现时要在括号里面写上英文原文，例如：“生成式 AI (Generative AI)”，之后就可以只写中文了。
      - 以下是常见的 AI 相关术语词汇对应表（English -> 中文）：
      * Transformer -> Transformer
      * Token -> Token
      * LLM/Large Language Model -> 大语言模型
      * Zero-shot -> 零样本
      * Few-shot -> 少样本
      * AI Agent -> AI智能体
      * AGI -> 通用人工智能
      * Python -> Python语言

      策略：

      分三步进行翻译工作：
      1. 不翻译无法识别的特殊字符和公式，原样返回
      2. 将HTML表格格式转换成Markdown表格格式
      3. 根据英文内容翻译成符合中文表达习惯的内容，不要遗漏任何信息

      最终只返回Markdown格式的翻译结果，不要回复无关内容。

在本节中，我们对 BERT 的多个方面进行消融实验，以更好地理解它们的相对重要性。额外的

任务	MNLI-m (Acc) (Acc)	QNLI	DevSet MRPC	SST-2 (Acc)	SQuAD (F1)
BERTBASE	84.4	88.4	(Acc) 86.7	92.7	88.5
NoNSP	83.9	84.9	86.5	92.6	87.9
LTR&NoNSP	82.1	84.3	77.5	92.1	77.8
+ BiLSTM	82.1	84.1	75.7	91.6	84.9

Table 5: Ablation over the pre-training tasks using the BERTBASE architecture. “No NSP” is trained without the next sentence prediction task. “LTR & No NSP” is trained as a left-to-right LM without the next sentence prediction, like OpenAI GPT. “ $^+$ BiLSTM” adds a randomly initialized BiLSTM on top of the “LTR $+~\mathrm{No}$ NSP” model during fine-tuning.

表 5: 使用 BERTBASE 架构对预训练任务的消融研究。"No NSP" 表示在没有下一句预测任务的情况下进行训练。"LTR & No NSP" 表示像 OpenAI GPT 一样，作为从左到右的语言模型 (LM) 进行训练，且没有下一句预测任务。" + BiLSTM" 表示在微调期间，在 "LTR + No NSP" 模型之上添加一个随机初始化的双向 LSTM (BiLSTM) 。

请注意，公式和特殊字符保持原样。

ablation studies can be found in Appendix C.

消融研究详见附录 C。

5.1 Effect of Pre-training Tasks

5.1 预训练任务的影响

规则：
      - 输出中文翻译部分的时候，只保留翻译的标题，不要有任何其他的多余内容，不要重复，不要解释。
      - 不要输出与英文内容无关的内容。
      - 翻译时要保留原始段落格式，以及保留术语，例如 FLAC，JPEG 等。保留公司缩写，例如 Microsoft, Amazon, OpenAI 等。
      - 人名不翻译
      - 同时要保留引用的论文，例如 [20] 这样的引用。
      - 对于 Figure 和 Table，翻译的同时保留原有格式，例如：“Figure 1: ”翻译为“图 1: ”，“Table 1: ”翻译为：“表 1: ”。
      - 全角括号换成半角括号，并在左括号前面加半角空格，右括号后面加半角空格。
      - 在翻译专业术语时，第一次出现时要在括号里面写上英文原文，例如：“生成式 AI (Generative AI)”，之后就可以只写中文了。
      - 以下是常见的 AI 相关术语词汇对应表（English -> 中文）：
      * Transformer -> Transformer
      * Token -> Token
      * LLM/Large Language Model -> 大语言模型
      * Zero-shot -> 零样本
      * Few-shot -> 少样本
      * AI Agent -> AI智能体
      * AGI -> 通用人工智能
      * Python -> Python语言

      策略：

      分三步进行翻译工作：
      1. 不翻译无法识别的特殊字符和公式，原样返回
      2. 将HTML表格格式转换成Markdown表格格式
      3. 根据英文内容翻译成符合中文表达习惯的内容，不要遗漏任何信息

      最终只返回Markdown格式的翻译结果，不要回复无关内容。

We demonstrate the importance of the deep bidirect ional it y of BERT by evaluating two pretraining objectives using exactly the same pretraining data, fine-tuning scheme, and hyperparameters as BERTBASE:

我们通过使用与 BERTBASE 完全相同的预训练数据、微调方案和超参数，评估两个预训练目标，以此证明 BERT 的深度双向性的重要性：

No NSP: A bidirectional model which is trained using the “masked LM” (MLM) but without the “next sentence prediction” (NSP) task.

无 NSP: 一个使用“掩码语言模型” (MLM) 进行训练的双向模型，但不包括“下一句预测” (NSP) 任务。

LTR & No NSP: A left-context-only model which is trained using a standard Left-to-Right (LTR) LM, rather than an MLM. The left-only constraint was also applied at fine-tuning, because removing it introduced a pre-train/fine-tune mismatch that degraded downstream performance. Additionally, this model was pre-trained without the NSP task. This is directly comparable to OpenAI GPT, but using our larger training dataset, our input representation, and our fine-tuning scheme.

仅左上下文模型：该模型使用标准的从左到右 (LTR) 语言模型进行训练，而不是掩码语言模型 (MLM)。在微调阶段也应用了仅左上下文约束，因为移除该约束会导致预训练和微调之间的不匹配，从而降低下游任务的表现。此外，该模型在预训练时没有使用下一句预测 (NSP) 任务。这与 OpenAI GPT 直接可比，但使用了我们更大的训练数据集、我们的输入表示以及我们的微调方案。

We first examine the impact brought by the NSP task. In Table 5, we show that removing NSP hurts performance significantly on QNLI, MNLI, and SQuAD 1.1. Next, we evaluate the impact of training bidirectional representations by comparing “No NSP” to “LTR & No NSP”. The LTR model performs worse than the MLM model on all tasks, with large drops on MRPC and SQuAD.

我们首先考察 NSP 任务带来的影响。在表 5 中，我们显示移除 NSP 对 QNLI、MNLI 和 SQuAD 1.1 的性能有显著负面影响。接下来，我们通过比较“无 NSP”和“LTR & 无 NSP”来评估训练双向表示的影响。LTR 模型在所有任务上的表现都比 MLM 模型差，在 MRPC 和 SQuAD 上有较大下降。

For SQuAD it is intuitively clear that a LTR model will perform poorly at token predictions, since the token-level hidden states have no rightside context. In order to make a good faith attempt at strengthening the LTR system, we added a randomly initialized BiLSTM on top. This does significantly improve results on SQuAD, but the results are still far worse than those of the pretrained bidirectional models. The BiLSTM hurts performance on the GLUE tasks.

对于 SQuAD，直观上很明显，LTR 模型在 Token 预测上表现不佳，因为 Token 级别的隐藏状态没有右侧上下文。为了诚实地尝试加强 LTR 系统，我们在其上添加了一个随机初始化的 BiLSTM。这确实显著提高了 SQuAD 上的结果，但结果仍然远不如预训练的双向模型。BiLSTM 在 GLUE 任务上损害了性能。

We recognize that it would also be possible to train separate LTR and RTL models and represent each token as the concatenation of the two models, as ELMo does. However: (a) this is twice as expensive as a single bidirectional model; (b) this is non-intuitive for tasks like QA, since the RTL model would not be able to condition the answer on the question; (c) this it is strictly less powerful than a deep bidirectional model, since it can use both left and right context at every layer.

我们认识到也可以训练单独的LTR和RTL模型，并将每个Token表示为两个模型的连接，就像ELMo所做的那样。然而：(a) 这比单个双向模型贵两倍；(b) 对于像问答这样的任务来说，这是反直觉的，因为RTL模型无法根据问题来调整答案；(c) 与深度双向模型相比，这种方法在每一层只能使用左或右上下文，因此严格来说功能较弱。

5.2 Effect of Model Size

5.2 模型大小的影响

规则：
      - 输出中文翻译部分的时候，只保留翻译的标题，不要有任何其他的多余内容，不要重复，不要解释。
      - 不要输出与英文内容无关的内容。
      - 翻译时要保留原始段落格式，以及保留术语，例如 FLAC，JPEG 等。保留公司缩写，例如 Microsoft, Amazon, OpenAI 等。
      - 人名不翻译
      - 同时要保留引用的论文，例如 [20] 这样的引用。
      - 对于 Figure 和 Table，翻译的同时保留原有格式，例如：“Figure 1: ”翻译为“图 1: ”，“Table 1: ”翻译为：“表 1: ”。
      - 全角括号换成半角括号，并在左括号前面加半角空格，右括号后面加半角空格。
      - 在翻译专业术语时，第一次出现时要在括号里面写上英文原文，例如：“生成式 AI (Generative AI)”，之后就可以只写中文了。
      - 以下是常见的 AI 相关术语词汇对应表（English -> 中文）：
      * Transformer -> Transformer
      * Token -> Token
      * LLM/Large Language Model -> 大语言模型
      * Zero-shot -> 零样本
      * Few-shot -> 少样本
      * AI Agent -> AI智能体
      * AGI -> 通用人工智能
      * P

[论文翻译]BERT: 深度双向 Transformer 用于语言理解的预训练

原文地址：https://u254848-88c6-e493554b.yza1.seetacloud.com:8443/miner/v2/analysis/pdf_md?filename=full.md&as_attachment=False&user_id=931&pdf=987545ffb087f1ece898142c403a516baeabeb70ce19089397fac6f7db12c3d41735865539_N19-1423.pdf

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

BERT: 深度双向 Transformer 用于语言理解的预训练

Abstract

摘要

1 Introduction

1 引言

2 相关工作

2.1 Unsupervised Feature-based Approaches

2.1 无监督特征基方法 (Unsupervised Feature-based Approaches)

2.2 Unsupervised Fine-tuning Approaches

2.2 无监督微调方法

2.3 Transfer Learning from Supervised Data

2.3 监督数据的迁移学习

3 BERT

3 BERT

3.1 Pre-training BERT

3.1 预训练 BERT

3.2 Fine-tuning BERT

3.2 微调 BERT

4 Experiments

4 实验

4.1 GLUE

4.1 GLUE 数据集 (GLUE dataset)

4.2 SQuAD v1.1

4.2 SQuAD v1.1

4.3 SQuAD v2.0

4.3 SQuAD v2.0

4.4 SWAG

4.4 SWAG

5 Ablation Studies

5 消融研究

5.1 Effect of Pre-training Tasks

5.1 预训练任务的影响

5.2 Effect of Model Size

5.2 模型大小的影响

[论文翻译]BERT: 深度双向 Transformer 用于语言理解的预训练

原文地址：https://u254848-88c6-e493554b.yza1.seetacloud.com:8443/miner/v2/analysis/pdf_md?filename=full.md&as_attachment=False&user_id=931&pdf=987545ffb087f1ece898142c403a516baeabeb70ce19089397fac6f7db12c3d41735865539_N19-1423.pdf

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

BERT: 深度双向 Transformer 用于语言理解的预训练

Abstract

摘要

1 Introduction

1 引言

2 Related Work

2 相关工作

2.1 Unsupervised Feature-based Approaches

2.1 无监督特征基方法 (Unsupervised Feature-based Approaches)

2.2 Unsupervised Fine-tuning Approaches

2.2 无监督微调方法

2.3 Transfer Learning from Supervised Data

2.3 监督数据的迁移学习

3 BERT

3 BERT

3.1 Pre-training BERT

3.1 预训练 BERT

3.2 Fine-tuning BERT

3.2 微调 BERT

4 Experiments

4 实验

4.1 GLUE

4.1 GLUE 数据集 (GLUE dataset)

4.2 SQuAD v1.1

4.2 SQuAD v1.1

4.3 SQuAD v2.0

4.3 SQuAD v2.0

4.4 SWAG

4.4 SWAG

5 Ablation Studies

5 消融研究

5.1 Effect of Pre-training Tasks

5.1 预训练任务的影响

5.2 Effect of Model Size

5.2 模型大小的影响