[论文翻译]通过生成式预训练提高语言理解能力


原文地址:https://u254848-88c6-e493554b.yza1.seetacloud.com:8443/miner/v2/analysis/pdf_md?filename=full.md&as_attachment=False&user_id=931&pdf=eb81a65e0856bd38e855e1f1de6ee12e7c3eb92ae05c8270d345a475beb631d01735868534_radford2018improving.pdf


Improving Language Understanding by Generative Pre-Training

通过生成式预训练提高语言理解能力

Alec Radford Karthik Narasimhan Tim Salimans Ilya Sutskever OpenAI OpenAI OpenAI OpenAI alec@openai.com karthikn@openai.com tim@openai.com ilyasu@openai.com

Alec Radford Karthik Narasimhan Tim Salimans Ilya Sutskever OpenAI OpenAI OpenAI OpenAI alec@openai.com karthikn@openai.com tim@openai.com ilyasu@openai.com

Abstract

摘要

Natural language understanding comprises a wide range of diverse tasks such as textual entailment, question answering, semantic similarity assessment, and document classification. Although large unlabeled text corpora are abundant, labeled data for learning these specific tasks is scarce, making it challenging for disc rim i natively trained models to perform adequately. We demonstrate that large gains on these tasks can be realized by generative pre-training of a language model on a diverse corpus of unlabeled text, followed by disc rim i native fine-tuning on each specific task. In contrast to previous approaches, we make use of task-aware input transformations during fine-tuning to achieve effective transfer while requiring minimal changes to the model architecture. We demonstrate the effectiveness of our approach on a wide range of benchmarks for natural language understanding. Our general task-agnostic model outperforms disc rim i natively trained models that use architectures specifically crafted for each task, significantly improving upon the state of the art in 9 out of the 12 tasks studied. For instance, we achieve absolute improvements of $8.9%$ on commonsense reasoning (Stories Cloze Test), $5.7%$ on question answering (RACE), and $1.5%$ on textual entailment (MultiNLI).

自然语言理解包括一系列多样化的任务,例如文本蕴含、问答系统、语义相似度评估和文档分类。尽管大规模未标注文本语料库非常丰富,但用于学习这些特定任务的标注数据却很少,这使得判别式训练的模型难以充分执行。我们证明,通过对一个包含多样化未标注文本的语料库进行生成式预训练,然后对每个特定任务进行判别式微调,可以在这些任务上取得显著提升。与以往的方法不同,我们在微调过程中使用任务感知输入转换,以实现有效的迁移,同时只需要对模型架构进行最小的改动。我们在广泛的自然语言理解基准测试中验证了我们方法的有效性。我们的通用任务无关模型优于使用专门为每个任务设计的架构的判别式训练模型,在所研究的12个任务中有9个任务显著超越了现有技术水平。例如,我们在常识推理(Stories Cloze Test)上取得了8.9%的绝对提升,在问答系统(RACE)上取得了5.7%的绝对提升,在文本蕴含(MultiNLI)上取得了1.5%的绝对提升。

请注意原文中的 "disc rim i natively" 和 "disc rim i native" 似乎是拼写错误,正确应为 "discriminatively" 和 "discriminative"。

1 Introduction

1 引言

The ability to learn effectively from raw text is crucial to alleviating the dependence on supervised learning in natural language processing (NLP). Most deep learning methods require substantial amounts of manually labeled data, which restricts their applicability in many domains that suffer from a dearth of annotated resources [61]. In these situations, models that can leverage linguistic information from unlabeled data provide a valuable alternative to gathering more annotation, which can be time-consuming and expensive. Further, even in cases where considerable supervision is available, learning good representations in an unsupervised fashion can provide a significant performance boost. The most compelling evidence for this so far has been the extensive use of pretrained word embeddings [10, 39, 42] to improve performance on a range of NLP tasks [8, 11, 26, 45].

从原始文本中有效学习的能力对于减轻自然语言处理 (NLP) 中对监督学习的依赖至关重要。大多数深度学习方法需要大量的手动标注数据,这限制了它们在许多缺乏标注资源领域的适用性 [61]。在这些情况下,能够利用未标注数据中的语言信息的模型为收集更多标注提供了一种有价值的替代方案,而收集更多标注可能是耗时且昂贵的。此外,即使在有大量监督的情况下,以无监督方式学习良好的表示也可以显著提高性能。迄今为止,最有力的证据是预训练词嵌入 (pretrained word embeddings) 的广泛应用 [10, 39, 42],以改进各种 NLP 任务的性能 [8, 11, 26, 45]。

Leveraging more than word-level information from unlabeled text, however, is challenging for two main reasons. First, it is unclear what type of optimization objectives are most effective at learning text representations that are useful for transfer. Recent research has looked at various objectives such as language modeling [44], machine translation [38], and discourse coherence [22], with each method outperforming the others on different tasks.1 Second, there is no consensus on the most effective way to transfer these learned representations to the target task. Existing techniques involve a combination of making task-specific changes to the model architecture [43, 44], using intricate learning schemes [21] and adding auxiliary learning objectives [50]. These uncertainties have made it difficult to develop effective semi-supervised learning approaches for language processing.

利用未标注文本中的词级别以上信息存在两个主要挑战。首先,尚不清楚哪种优化目标最有效地学习有助于迁移的文本表示。最近的研究探讨了各种目标,如语言模型 [44]、机器翻译 [38] 和篇章连贯性 [22],每种方法在不同任务上都优于其他方法。其次,对于如何将这些学到的表示有效迁移到目标任务上,目前还没有共识。现有技术包括对模型架构进行任务特定的修改 [43, 44]、使用复杂的训练方案 [21] 以及添加辅助学习目标 [50]。这些不确定性使得开发有效的半监督学习方法来处理语言变得困难。

In this paper, we explore a semi-supervised approach for language understanding tasks using a combination of unsupervised pre-training and supervised fine-tuning. Our goal is to learn a universal representation that transfers with little adaptation to a wide range of tasks. We assume access to a large corpus of unlabeled text and several datasets with manually annotated training examples (target tasks). Our setup does not require these target tasks to be in the same domain as the unlabeled corpus. We employ a two-stage training procedure. First, we use a language modeling objective on the unlabeled data to learn the initial parameters of a neural network model. Subsequently, we adapt these parameters to a target task using the corresponding supervised objective.

在本文中,我们探索了一种半监督方法用于语言理解任务,该方法结合了无监督预训练和有监督微调。我们的目标是学习一种通用表示,这种表示可以很少的适应性转移到广泛的任务中。我们假设可以获得一个大规模的未标注文本语料库以及几个具有人工标注训练样本的数据集(目标任务)。我们的设置不要求这些目标任务与未标注语料库属于同一领域。我们采用两阶段训练程序。首先,我们在未标注数据上使用语言模型目标来学习神经网络模型的初始参数。随后,我们使用相应的有监督目标将这些参数适应于目标任务。

For our model architecture, we use the Transformer [62], which has been shown to perform strongly on various tasks such as machine translation [62], document generation [34], and syntactic parsing [29]. This model choice provides us with a more structured memory for handling long-term dependencies in text, compared to alternatives like recurrent networks, resulting in robust transfer performance across diverse tasks. During transfer, we utilize task-specific input adaptations derived from traversal-style approaches [52], which process structured text input as a single contiguous sequence of tokens. As we demonstrate in our experiments, these adaptations enable us to fine-tune effectively with minimal changes to the architecture of the pre-trained model.

对于我们的模型架构,我们使用了 Transformer [62],它在各种任务上表现出色,例如机器翻译 [62]、文档生成 [34] 和句法分析 [29]。这一模型选择为我们提供了更结构化的记忆来处理文本中的长期依赖关系,相比于循环网络等替代方案,在不同任务之间实现了稳健的迁移性能。在迁移过程中,我们利用了基于遍历风格方法的任务特定输入适应 [52],这些方法将结构化文本输入处理为单个连续的 Token 序列。正如我们在实验中所展示的,这些适应使我们能够在对预训练模型架构进行最小改动的情况下有效微调。

We evaluate our approach on four types of language understanding tasks – natural language inference, question answering, semantic similarity, and text classification. Our general task-agnostic model outperforms disc rim i natively trained models that employ architectures specifically crafted for each task, significantly improving upon the state of the art in 9 out of the 12 tasks studied. For instance, we achieve absolute improvements of $8.9%$ on commonsense reasoning (Stories Cloze Test) [40], $5.7%$ on question answering (RACE) [30], $1.5%$ on textual entailment (MultiNLI) [66] and $5.5%$ on the recently introduced GLUE multi-task benchmark [64]. We also analyzed zero-shot behaviors of the pre-trained model on four different settings and demonstrate that it acquires useful linguistic knowledge for downstream tasks.

我们对四种类型的语言理解任务进行了评估——自然语言推理、问答、语义相似性和文本分类。我们的通用任务无关模型优于针对每个任务专门设计架构的判别训练模型,在所研究的12项任务中有9项显著改进了现有技术水平。例如,在常识推理(Stories Cloze Test)[40]上,我们实现了8.9%的绝对提升;在问答(RACE)[30]上,实现了5.7%的绝对提升;在文本蕴含(MultiNLI)[66]上,实现了1.5%的绝对提升;以及在最近引入的GLUE多任务基准测试[64]上,实现了5.5%的绝对提升。我们还分析了预训练模型在四种不同设置下的零样本 (Zero-shot) 表现,并证明它为下游任务获得了有用的语言知识。

2 Related Work

2 相关工作

Semi-supervised learning for NLP Our work broadly falls under the category of semi-supervised learning for natural language. This paradigm has attracted significant interest, with applications to tasks like sequence labeling [24, 33, 57] or text classification [41, 70]. The earliest approaches used unlabeled data to compute word-level or phrase-level statistics, which were then used as features in a supervised model [33]. Over the last few years, researchers have demonstrated the benefits of using word embeddings [11, 39, 42], which are trained on unlabeled corpora, to improve performance on a variety of tasks [8, 11, 26, 45]. These approaches, however, mainly transfer word-level information, whereas we aim to capture higher-level semantics.

半监督学习用于自然语言处理

我们的工作大致属于半监督学习用于自然语言的范畴。这一范式已经吸引了大量的关注,并被应用于诸如序列标注 [24, 33, 57] 或文本分类 [41, 70] 等任务。最早的 方法使用未标注数据来计算词级或短语级统计信息,这些统计信息随后被用作监督模型中的特征 [33]。在过去的几年中,研究人员展示了使用在未标注语料库上训练的词嵌入 [11, 39, 42] 来提高各种任务性能的好处 [8, 11, 26, 45]。然而,这些方法主要转移的是词级别的信息,而我们的目标是捕捉更高层次的语义。

Recent approaches have investigated learning and utilizing more than word-level semantics from unlabeled data. Phrase-level or sentence-level embeddings, which can be trained using an unlabeled corpus, have been used to encode text into suitable vector representations for various target tasks [28, 32, 1, 36, 22, 12, 56, 31].

最近的方法已经研究了从无标签数据中学习和利用超过词级别的语义。短语级别或句子级别的嵌入,可以使用无标签语料库进行训练,已被用于将文本编码为适合各种目标任务的向量表示 [28, 32, 1, 36, 22, 12, 56, 31]。

Unsupervised pre-training Unsupervised pre-training is a special case of semi-supervised learning where the goal is to find a good initialization point instead of modifying the supervised learning objective. Early works explored the use of the technique in image classification [20, 49, 63] and regression tasks [3]. Subsequent research [15] demonstrated that pre-training acts as a regular iz ation scheme, enabling better generalization in deep neural networks. In recent work, the method has been used to help train deep neural networks on various tasks like image classification [69], speech recognition [68], entity disambiguation [17] and machine translation [48].

无监督预训练

无监督预训练是半监督学习的一种特殊情况,其目标是找到一个好的初始化点,而不是修改有监督学习的目标。早期的工作探索了该技术在图像分类 [20, 49, 63] 和回归任务 [3] 中的应用。后续研究 [15] 表明,预训练作为一种正则化方案,能够使深度神经网络具有更好的泛化能力。在最近的研究中,该方法已被用于帮助训练深度神经网络执行各种任务,如图像分类 [69]、语音识别 [68]、实体消歧 [17] 和机器翻译 [48]。

The closest line of work to ours involves pre-training a neural network using a language modeling objective and then fine-tuning it on a target task with supervision. Dai et al. [13] and Howard and Ruder [21] follow this method to improve text classification. However, although the pre-training phase helps capture some linguistic information, their usage of LSTM models restricts their prediction ability to a short range. In contrast, our choice of transformer networks allows us to capture longerrange linguistic structure, as demonstrated in our experiments. Further, we also demonstrate the effectiveness of our model on a wider range of tasks including natural language inference, paraphrase detection and story completion. Other approaches [43, 44, 38] use hidden representations from a pre-trained language or machine translation model as auxiliary features while training a supervised model on the target task. This involves a substantial amount of new parameters for each separate target task, whereas we require minimal changes to our model architecture during transfer.

与我们的工作最接近的是预先训练一个神经网络,使用语言建模目标,然后在监督下对目标任务进行微调。Dai 等人 [13] 和 Howard 和 Ruder [21] 采用这种方法来改进文本分类。然而,尽管预训练阶段有助于捕捉一些语言信息,他们使用的 LSTM 模型将其预测能力限制在较短范围内。相比之下,我们选择的 Transformer 网络使我们能够捕捉更长距离的语言结构,如我们在实验中所展示的那样。此外,我们还在更广泛的任务上展示了我们模型的有效性,包括自然语言推理、释义检测和故事完成。其他方法 [43, 44, 38] 在训练监督模型时,使用预训练的语言或机器翻译模型的隐藏表示作为辅助特征。这涉及到为每个单独的目标任务引入大量新参数,而我们在迁移过程中只需要对模型架构进行最小的更改。

Auxiliary training objectives Adding auxiliary unsupervised training objectives is an alternative form of semi-supervised learning. Early work by Collobert and Weston [10] used a wide variety of auxiliary NLP tasks such as POS tagging, chunking, named entity recognition, and language modeling to improve semantic role labeling. More recently, Rei [50] added an auxiliary language modeling objective to their target task objective and demonstrated performance gains on sequence labeling tasks. Our experiments also use an auxiliary objective, but as we show, unsupervised pre-training already learns several linguistic aspects relevant to target tasks.

辅助训练目标

添加辅助无监督训练目标是半监督学习的一种替代形式。Collobert 和 Weston [10] 的早期工作使用了各种辅助的 NLP 任务,如词性标注 (POS tagging)、分块 (chunking)、命名实体识别和语言建模,以改进语义角色标注。最近,Rei [50] 在其目标任务目标中添加了一个辅助的语言建模目标,并在序列标注任务上展示了性能提升。我们的实验也使用了辅助目标,但正如我们所展示的,无监督预训练已经学到了与目标任务相关的多个语言方面。

3 Framework

3 框架

Our training procedure consists of two stages. The first stage is learning a high-capacity language model on a large corpus of text. This is followed by a fine-tuning stage, where we adapt the model to a disc rim i native task with labeled data.

我们的训练过程分为两个阶段。第一阶段是在大规模文本语料库上学习一个高容量的语言模型。随后是微调阶段,在此阶段中,我们使用带标签的数据将模型适应于判别任务。

3.1 Unsupervised pre-training

3.1 无监督预训练

Given an unsupervised corpus of tokens $\mathcal{U}={u_{1},\ldots,u_{n}}$ , we use a standard language modeling objective to maximize the following likelihood:

给定一个无监督的 Token 语料库 $\mathcal{U}={u_{1},\ldots,u_{n}}$ ,我们使用标准的语言模型目标来最大化以下似然:

$$
L_{1}(\mathcal{U})=\sum_{i}\log P(u_{i}|u_{i-k},\ldots,u_{i-1};\Theta)
$$

$$
L_{1}(\mathcal{U})=\sum_{i}\log P(u_{i}|u_{i-k},\ldots,u_{i-1};\Theta)
$$

where $k$ is the size of the context window, and the conditional probability $P$ is modeled using a neural network with parameters $\Theta$ . These parameters are trained using stochastic gradient descent [51].

其中 $k$ 是上下文窗口的大小,条件概率 $P$ 使用参数为 $\Theta$ 的神经网络进行建模。这些参数通过随机梯度下降 [51] 进行训练。

In our experiments, we use a multi-layer Transformer decoder [34] for the language model, which is a variant of the transformer [62]. This model applies a multi-headed self-attention operation over the input context tokens followed by position-wise feed forward layers to produce an output distribution over target tokens:

在我们的实验中,我们使用多层 Transformer 解码器 [34] 作为语言模型,这是 Transformer [62] 的一个变体。该模型对输入上下文 Token 应用多头自注意力操作,然后通过位置感知前馈层生成目标 Token 的输出分布:

$$
\begin{array}{r l}&{h_{0}=U W_{e}+W_{p}}\ &{\quad h_{l}=\mathsf{t r a n s f o r m e r_b l o c k}(h_{l-1})\forall i\in[1,n]}\ &{P(u)=\mathsf{s o f,t m a x}(h_{n}W_{e}^{T})}\end{array}
$$

$$
\begin{array}{r l}
& h_{0} = U W_{e} + W_{p} \
& \quad h_{l} = \mathsf{transformer_block}(h_{l-1}) \forall i \in [1,n] \
& P(u) = \mathsf{softmax}(h_{n} W_{e}^{T})
\end{array}
$$

where $U=(u_{-k},\dotsc,u_{-1})$ is the context vector of tokens, $n$ is the number of layers, $W_{e}$ is the token embedding matrix, and $W_{p}$ is the position embedding matrix.

其中 $U=(u_{-k},\dotsc,u_{-1})$ 是 Token 的上下文向量,$n$ 是层数,$W_{e}$ 是 Token 嵌入矩阵,$W_{p}$ 是位置嵌入矩阵。

3.2 Supervised fine-tuning

3.2 监督微调

After training the model with the objective in Eq. 1, we adapt the parameters to the supervised target task. We assume a labeled dataset $\mathcal{C}$ , where each instance consists of a sequence of input tokens, $x^{1},\ldots,x^{m}$ , along with a label $y$ . The inputs are passed through our pre-trained model to obtain the final transformer block’s activation $h_{l}^{m}$ , which is then fed into an added linear output layer with parameters $W_{y}$ to predict $y$ :

在使用公式 1 中的目标训练模型后,我们将参数适应于监督目标任务。我们假设有一个标注数据集 $\mathcal{C}$ ,其中每个实例由一个输入 Token 序列 $x^{1},\ldots,x^{m}$ 以及一个标签 $y$ 组成。输入通过我们预训练的模型传递,以获得最终的 Transformer 块的激活 $h_{l}^{m}$ ,然后将其输入到添加的线性输出层中,该层具有参数 $W_{y}$ ,用于预测 $y$ :

$$
P(y|x^{1},\ldots,x^{m})=\mathsf{s o f t m a x}(h_{l}^{m}W_{y}).
$$

$$
P(y|x^{1},\ldots,x^{m})=\mathsf{softmax}(h_{l}^{m}W_{y}) .
$$

This gives us the following objective to maximize:

这给我们以下最大化目标:

$$
L_{2}(\mathcal{C})=\sum_{(x,y)}\log P(y|x^{1},\ldots,x^{m}).
$$

$$
L_{2}(\mathcal{C})=\sum_{(x,y)}\log P(y|x^{1},\ldots,x^{m}) 。
$$

We additionally found that including language modeling as an auxiliary objective to the fine-tuning helped learning by (a) improving generalization of the supervised model, and (b) accelerating convergence. This is in line with prior work [50, 43], who also observed improved performance with such an auxiliary objective. Specifically, we optimize the following objective (with weight $\lambda$ ):

此外,我们还发现将语言建模作为微调的辅助目标有助于学习,具体表现在:(a) 提高监督模型的泛化能力,以及 (b) 加快收敛速度。这与之前的工作 [50, 43] 的观察一致,这些工作也注意到这种辅助目标可以提高性能。具体来说,我们优化以下目标(权重为 $\lambda$ ):

$$
L_{3}(\mathcal{C})=L_{2}(\mathcal{C})+\lambda*L_{1}(\mathcal{C})
$$

$$
L_{3}(\mathcal{C})=L_{2}(\mathcal{C})+\lambda * L_{1}(\mathcal{C})
$$

Overall, the only extra parameters we require during fine-tuning are $W_{y}$ , and embeddings for delimiter tokens (described below in Section 3.3).

总体而言,在微调过程中我们唯一需要的额外参数是 $W_{y}$ ,以及分隔符 Token 的嵌入(详见第 3.3 节)。


Figure 1: (left) Transformer architecture and training objectives used in this work. (right) Input transformations for fine-tuning on different tasks. We convert all structured inputs into token sequences to be processed by our pre-trained model, followed by a linear+softmax layer.

图 1: (left) Transformer 架构和本工作中使用的训练目标。 (right) 不同任务微调的输入转换。我们将所有结构化输入转换为 Token 序列,以供我们的预训练模型处理,随后是一个线性 + softmax 层。

3.3 Task-specific input transformations

3.3 任务特定的输入转换

For some tasks, like text classification, we can directly fine-tune our model as described above. Certain other tasks, like question answering or textual entailment, have structured inputs such as ordered sentence pairs, or triplets of document, question, and answers. Since our pre-trained model was trained on contiguous sequences of text, we require some modifications to apply it to these tasks. Previous work proposed learning task specific architectures on top of transferred representations [44]. Such an approach re-introduces a significant amount of task-specific customization and does not use transfer learning for these additional architectural components. Instead, we use a traversal-style approach [52], where we convert structured inputs into an ordered sequence that our pre-trained model can process. These input transformations allow us to avoid making extensive changes to the architecture across tasks. We provide a brief description of these input transformations below and Figure 1 provides a visual illustration. All transformations include adding randomly initialized start and end tokens $(\langle s\rangle,\langle e\rangle)$ .

对于某些任务,如文本分类,我们可以直接微调模型,如上所述。其他一些任务,如问答或文本蕴含,具有结构化的输入,例如有序的句子对,或文档、问题和答案的三元组。由于我们的预训练模型是在连续的文本序列上进行训练的,因此我们需要对这些任务进行一些修改。之前的工作提出了在迁移表示之上学习特定于任务的架构 [44]。这种方法重新引入了大量特定于任务的定制,并且没有为这些额外的架构组件使用迁移学习。相反,我们采用遍历式方法 [52],将结构化输入转换为预训练模型可以处理的有序序列。这些输入转换使我们能够避免在不同任务之间对架构进行大量更改。以下简要描述这些输入转换,图 1 提供了视觉说明。所有转换包括添加随机初始化的开始和结束 Token (⟨s⟩, ⟨e⟩)。

Textual entailment For entailment tasks, we concatenate the premise $p$ and hypothesis $h$ token sequences, with a delimiter token $(\mathfrak{G})$ in between.

文本蕴含

对于蕴含任务,我们将前提 $p$ 和假设 $h$ 的 Token 序列进行拼接,在中间使用一个分隔符 Token $(\mathfrak{G})$。

Similarity For similarity tasks, there is no inherent ordering of the two sentences being compared. To reflect this, we modify the input sequence to contain both possible sentence orderings (with a delimiter in between) and process each independently to produce two sequence representations $h_{l}^{m}$ which are added element-wise before being fed into the linear output layer.

对于相似性任务,被比较的两个句子没有固有的顺序。为了反映这一点,我们修改输入序列以包含两种可能的句子顺序(中间用分隔符隔开),并分别处理每个顺序以生成两个序列表示 $h_{l}^{m}$ ,这些表示在按元素相加后被送入线性输出层。

Question Answering and Commonsense Reasoning For these tasks, we are given a context document $z$ , a question $q$ , and a set of possible answers $\left{a_{k}\right}$ . We concatenate the document context and question with each possible answer, adding a delimiter token in between to get $\left[z;q;\mathbb{5};a_{k}\right]$ . Each of these sequences are processed independently with our model and then normalized via a softmax layer to produce an output distribution over possible answers.

问题回答和常识推理

对于这些任务,我们给定一个上下文文档 $z$ ,一个问题 $q$ ,以及一组可能的答案 $\left{a_{k}\right}$ 。我们将文档上下文和问题与每个可能的答案连接起来,在中间添加一个分隔符 Token ,以获得 $\left[z;q;\mathbb{5};a_{k}\right]$ 。这些序列分别由我们的模型独立处理,然后通过 softmax 层进行归一化,以生成可能答案上的输出分布。

4 Experiments

4 实验

4.1 Setup

4.1 设置

Unsupervised pre-training We use the Books Corpus dataset [71] for training the language model. It contains over 7,000 unique unpublished books from a variety of genres including Adventure, Fantasy, and Romance. Crucially, it contains long stretches of contiguous text, which allows the generative model to learn to condition on long-range information. An alternative dataset, the 1B Word Benchmark, which is used by a similar approach, ELMo [44], is approximately the same size but is shuffled at a sentence level - destroying long-range structure. Our language model achieves a very low token level perplexity of 18.4 on this corpus.

无监督预训练

我们使用 Books Corpus 数据集 [71] 来训练语言模型。它包含超过 7,000 本独特的未出版书籍,涵盖冒险、奇幻和浪漫等多种类型。重要的是,它包含长段连续文本,这使得生成式模型能够学习依赖于长距离信息。另一种数据集是 1B Word Benchmark,该数据集被类似方法 ELMo [44] 使用,其大小大致相同,但以句子级别进行了打乱——破坏了长距离结构。我们的语言模型在这个语料库上达到了非常低的 Token 级别困惑度 18.4。

Table 1: A list of the different tasks and datasets used in our experiments.

表 1: 我们实验中使用的不同任务和数据集列表。

任务 数据集
自然语言推理 SNLI [5], MultiNLI [66], Question NLI [64], RTE [4], SciTail [25]
问答 RACE [30], Story Cloze [40]
句子相似度 MSR Paraphrase Corpus [14], Quora Question Pairs [9], STS Benchmark [6]
分类 Stanford Sentiment Treebank-2 [54], CoLA [65]

Model specifications Our model largely follows the original transformer work [62]. We trained a 12-layer decoder-only transformer with masked self-attention heads (768 dimensional states and 12 attention heads). For the position-wise feed-forward networks, we used 3072 dimensional inner states. We used the Adam optimization scheme [27] with a max learning rate of $2.5\mathrm{e-4}$ . The learning rate was increased linearly from zero over the first 2000 updates and annealed to 0 using a cosine schedule. We train for 100 epochs on mini batches of 64 randomly sampled, contiguous sequences of 512 tokens. Since layernorm [2] is used extensively throughout the model, a simple weight initialization of $N(0,0.02)$ was sufficient. We used a bytepair encoding (BPE) vocabulary with 40,000 merges [53] and residual, embedding, and attention dropouts with a rate of 0.1 for regular iz ation. We also employed a modified version of L2 regular iz ation proposed in [37], with $w=0.01$ on all non bias or gain weights. For the activation function, we used the Gaussian Error Linear Unit (GELU) [18]. We used learned position embeddings instead of the sinusoidal version proposed in the original work. We use the ftfy library2 to clean the raw text in Books Corpus, standardize some punctuation and whitespace, and use the spaCy tokenizer.3

模型规格

我们的模型主要遵循原始的 Transformer 工作 [62]。我们训练了一个 12 层仅解码器的 Transformer,具有带掩码的自注意力头(768 维状态和 12 个注意力头)。对于位置感知前馈网络,我们使用了 3072 维的内部状态。我们使用了 Adam 优化方案 [27],最大学习率为 $2.5\mathrm{e-4}$ 。学习率在前 2000 次更新中从零线性增加,并使用余弦退火调度降至 0。我们在包含 64 个随机采样的连续序列(每个序列 512 个 Token)的小批量上训练了 100 个 epoch。由于在整个模型中广泛使用了层归一化 [2],简单的权重初始化为 $N(0, 0.02)$ 就足够了。我们使用了带有 40,000 次合并的字节对编码 (BPE) 词汇表 [53],以及残差、嵌入和注意力 dropout,dropout 率为 0.1 用于正则化。我们还采用了 [37] 中提出的 L2 正则化的修改版本,对所有非偏置或增益权重应用了 $w=0.01$ 。对于激活函数,我们使用了高斯误差线性单元 (GELU) [18]。我们使用了学习的位置嵌入,而不是原始工作中提出的正弦版本。我们使用 ftfy 库2 来清理 Books Corpus 中的原始文本,标准化一些标点符号和空白,并使用 spaCy 分词器3。

Fine-tuning details Unless specified, we reuse the hyper parameter settings from unsupervised pre-training. We add dropout to the classifier with a rate of 0.1. For most tasks, we use a learning rate of $6.25\mathrm{e}{-5}$ and a batchsize of 32. Our model finetunes quickly and 3 epochs of training was sufficient for most cases. We use a linear learning rate decay schedule with warmup over $0.2%$ of training. $\lambda$ was set to 0.5.

微调细节

除非另有说明,我们重用无监督预训练的超参数设置。我们在分类器中添加了 dropout,比例为 0.1。对于大多数任务,我们使用学习率为 $6.25\mathrm{e}{-5}$ 和批量大小为 32。我们的模型微调速度很快,3 个 epoch 的训练对于大多数情况已经足够。我们使用线性学习率衰减计划,并在前 $0.2%$ 的训练过程中进行预热。$\lambda$ 被设置为 0.5。

4.2 Supervised fine-tuning

4.2 监督微调

We perform experiments on a variety of supervised tasks including natural language inference, question answering, semantic similarity, and text classification. Some of these tasks are available as part of the recently released GLUE multi-task benchmark [64], which we make use of. Figure 1 provides an overview of all the tasks and datasets.

我们在多种监督任务上进行实验,包括自然语言推理、问答、语义相似性和文本分类。其中一些任