[论文翻译]通过生成式预训练提升语言理解能力


原文地址:https://github.com/dalinvip/Awesome-ChatGPT/blob/main/PDF/GPT%E7%B3%BB%E5%88%97/gpt1-language_understanding_paper.pdf


Improving Language Understanding by Generative Pre-Training

通过生成式预训练提升语言理解能力

Alec Radford Karthik Narasimhan Tim Salimans Ilya Su tsk ever OpenAI OpenAI OpenAI OpenAI alec@openai.com karthikn@openai.com tim@openai.com ilyasu@openai.com

Abstract

摘要

Natural language understanding comprises a wide range of diverse tasks such as textual entailment, question answering, semantic similarity assessment, and document classification. Although large unlabeled text corpora are abundant, labeled data for learning these specific tasks is scarce, making it challenging for disc rim i natively trained models to perform adequately. We demonstrate that large gains on these tasks can be realized by generative pre-training of a language model on a diverse corpus of unlabeled text, followed by disc rim i native fine-tuning on each specific task. In contrast to previous approaches, we make use of task-aware input transformations during fine-tuning to achieve effective transfer while requiring minimal changes to the model architecture. We demonstrate the effectiveness of our approach on a wide range of benchmarks for natural language understanding. Our general task-agnostic model outperforms disc rim i natively trained models that use architectures specifically crafted for each task, significantly improving upon the state of the art in 9 out of the 12 tasks studied. For instance, we achieve absolute improvements of $8.9%$ on commonsense reasoning (Stories Cloze Test), $5.7%$ On question answering (RACE), and $1.5%$ on textual entailment (MultiNLI).

自然语言理解涵盖了多种多样的任务,如文本蕴含、问答、语义相似性评估和文档分类。尽管未标注的大规模文本语料库丰富,但用于学习这些特定任务的标注数据却相对稀缺,这使得判别式训练的模型难以充分表现。我们证明,通过在多样化的未标注文本语料库上进行生成式预训练,然后在每个特定任务上进行判别式微调,可以在这些任务上实现显著的提升。与之前的方法不同,我们在微调过程中利用任务感知的输入转换,以实现有效的迁移,同时只需对模型架构进行最小限度的修改。我们在多个自然语言理解的基准测试中展示了我们方法的有效性。我们的通用任务无关模型优于为每个任务专门设计的判别式训练模型,在研究的12个任务中有9个显著提升了当前技术水平。例如,我们在常识推理(Stories Cloze Test)上实现了8.9%的绝对提升,在问答(RACE)上实现了5.7%的提升,在文本蕴含(MultiNLI)上实现了1.5%的提升。

1 Introduction

1 引言

The ability to learn effectively from raw text is crucial to alleviating the dependence on supervised learning in natural language processing (NLP). Most deep learning methods require substantial amounts of manually labeled data, which restricts their applicability in many domains that suffer from a dearth of annotated resources [61]. In these situations, models that can leverage linguistic information from unlabeled data provide a valuable alternative to gathering more annotation, which can be time-consuming and expensive. Further, even in cases where considerable supervision is available, learning good representations in an unsupervised fashion can provide a significant performance boost. The most compelling evidence for this so far has been the extensive use of pretrained word embeddings [10, 39, 42] to improve performance on a range of NLP tasks [8, 11, 26, 45].

从原始文本中有效学习的能力对于减轻自然语言处理 (NLP) 中对监督学习的依赖至关重要。大多数深度学习方法需要大量手动标注的数据,这限制了它们在许多缺乏标注资源的领域中的适用性 [61]。在这些情况下,能够从未标注数据中利用语言信息的模型为收集更多标注提供了一种有价值的替代方案,而收集标注可能既耗时又昂贵。此外,即使在有大量监督数据可用的情况下,以无监督的方式学习良好的表示也可以显著提升性能。迄今为止,最有力的证据是预训练词嵌入 [10, 39, 42] 的广泛使用,以提高一系列 NLP 任务的性能 [8, 11, 26, 45]。

Leveraging more than word-level information from unlabeled text, however, is challenging for two main reasons. First, it is unclear what type of optimization objectives are most effective at learning text representations that are useful for transfer. Recent research has looked at various objectives such as language modeling [44], machine translation [38], and discourse coherence [22], with each method outperforming the others on different tasks.1 Second, there is no consensus on the most effective way to transfer these learned representations to the target task. Existing techniques involve a combination of making task-specific changes to the model architecture [43, 44], using intricate learning schemes [21] and adding auxiliary learning objectives [50]. These uncertainties have made it difficult to develop effective semi-supervised learning approaches for language processing.

然而,从未标注文本中利用超过词级别的信息具有挑战性,主要有两个原因。首先,尚不清楚哪种类型的优化目标在学习对迁移有用的文本表示方面最为有效。最近的研究探讨了多种目标,如语言建模 [44]、机器翻译 [38] 和语篇连贯性 [22],每种方法在不同任务上表现优于其他方法。其次,关于如何最有效地将这些学习到的表示迁移到目标任务上,目前尚无共识。现有技术包括对模型架构进行任务特定的修改 [43, 44]、使用复杂的学习方案 [21] 以及添加辅助学习目标 [50]。这些不确定性使得开发有效的语言处理半监督学习方法变得困难。

In this paper, we explore a semi-supervised approach for language understanding tasks using a combination of unsupervised pre-training and supervised fine-tuning. Our goal is to learn a universal representation that transfers with little adaptation to a wide range of tasks. We assume access to a large corpus of unlabeled text and several datasets with manually annotated training examples (target tasks). Our setup does not require these target tasks to be in the same domain as the unlabeled corpus. We employ a two-stage training procedure. First, we use a language modeling objective on the unlabeled data to learn the initial parameters of a neural network model. Subsequently, we adapt these parameters to a target task using the corresponding supervised objective.

在本文中,我们探索了一种结合无监督预训练和有监督微调的语言理解任务的半监督方法。我们的目标是学习一种通用的表示,这种表示只需少量适应即可迁移到广泛的任务中。我们假设可以访问大量未标注的文本语料库以及多个带有手动标注训练示例的数据集(目标任务)。我们的设置不要求这些目标任务与未标注语料库处于同一领域。我们采用了两阶段的训练过程。首先,我们在未标注数据上使用语言建模目标来学习神经网络模型的初始参数。随后,我们使用相应的有监督目标将这些参数适应到目标任务中。

For our model architecture, we use the Transformer [62], which has been shown to perform strongly on various tasks such as machine translation [62], document generation [34], and syntactic parsing [29]. This model choice provides us with a more structured memory for handling long-term dependencies in text, compared to alternatives like recurrent networks, resulting in robust transfer performance across diverse tasks. During transfer, we utilize task-specific input adaptations derived from traversal-style approaches [52], which process structured text input as a single contiguous sequence of tokens. As we demonstrate in our experiments, these adaptations enable us to fine-tune effectively with minimal changes to the architecture of the pre-trained model.

在我们的模型架构中,我们使用了 Transformer [62],该模型在机器翻译 [62]、文档生成 [34] 和句法解析 [29] 等各种任务中表现出色。与循环网络等替代方案相比,这种模型选择为我们提供了更结构化的记忆,以处理文本中的长期依赖关系,从而在不同任务中实现稳健的迁移性能。在迁移过程中,我们利用了基于遍历式方法 [52] 的任务特定输入适配,这些方法将结构化文本输入处理为单个连续的 token 序列。正如我们在实验中所展示的,这些适配使我们能够在对预训练模型架构进行最小改动的情况下有效地进行微调。

We evaluate our approach on four types of language understanding tasks - natural language inference, question answering, semantic similarity, and text classification. Our general task-agnostic model outperforms disc rim i natively trained models that employ architectures specifically crafted for each task, significantly improving upon the state of the art in 9 out of the 12 tasks studied. For instance, we achieve absolute improvements of $8.9%$ on commonsense reasoning (Stories Cloze Test) [40], $5.7%$ on question answering (RACE) [30], $1.5%$ on textual entailment (MultiNLI) [66] and $5.5%$ on the recently introduced GLUE multi-task benchmark [64]. We also analyzed zero-shot behaviors of the pre-trained model on four different settings and demonstrate that it acquires useful linguistic knowledge for downstream tasks.

我们在四种类型的语言理解任务上评估了我们的方法——自然语言推理、问答、语义相似度和文本分类。我们的通用任务无关模型在12个研究任务中的9个上显著超越了为每个任务专门设计的架构的判别式训练模型,显著提升了当前技术水平。例如,我们在常识推理(Stories Cloze Test)[40]上实现了8.9%的绝对提升,在问答(RACE)[30]上提升了5.7%,在文本蕴含(MultiNLI)[66]上提升了1.5%,在最近引入的GLUE多任务基准[64]上提升了5.5%。我们还分析了预训练模型在四种不同设置下的零样本行为,并证明其获得了对下游任务有用的语言学知识。

2 Related Work

2 相关工作

Semi-supervised learning for NLP Our work broadly falls under the category of semi-supervised learning for natural language. This paradigm has attracted significant interest, with applications to tasks like sequence labeling [24, 33, 57] or text classification [41, 70]. The earliest approaches used unlabeled data to compute word-level or phrase-level statistics, which were then used as features in a supervised model [33]. Over the last few years, researchers have demonstrated the benefits of using word embeddings [11, 39, 42], which are trained on unlabeled corpora, to improve performance on a variety of tasks [8, 11, 26, 45]. These approaches, however, mainly transfer word-level information, whereas we aim to capture higher-level semantics.

自然语言处理中的半监督学习
我们的工作大致属于自然语言半监督学习的范畴。这一范式引起了广泛关注,并应用于序列标注 [24, 33, 57] 或文本分类 [41, 70] 等任务。最早的方法使用未标注数据来计算词级或短语级的统计信息,然后将其用作监督模型中的特征 [33]。在过去几年中,研究人员展示了使用在未标注语料库上训练的词嵌入 [11, 39, 42] 来提高各种任务性能的好处 [8, 11, 26, 45]。然而,这些方法主要传递词级信息,而我们的目标是捕捉更高层次的语义。

Recent approaches have investigated learning and utilizing more than word-level semantics from unlabeled data. Phrase-level or sentence-level embeddings, which can be trained using an unlabeled corpus, have been used to encode text into suitable vector representations for various target tasks [28, 32,1, 36, 22,12,56, 31].

最近的研究方法探索了从未标注数据中学习和利用超越词级语义的信息。短语级或句子级嵌入可以通过未标注的语料库进行训练,并用于将文本编码为适合各种目标任务的向量表示 [28, 32, 1, 36, 22, 12, 56, 31]。

Unsupervised pre-training Unsupervised pre-training is a special case of semi-supervised learning where the goal is to find a good initialization point instead of modifying the supervised learning objective. Early works explored the use of the technique in image classification [20, 49, 63] and regression tasks [3]. Subsequent research [15] demonstrated that pre-training acts as a regular iz ation scheme, enabling better generalization in deep neural networks. In recent work, the method has been used to help train deep neural networks on various tasks like image classification [69], speech recognition [68], entity disambiguation [17] and machine translation [48].

无监督预训练
无监督预训练是半监督学习的一种特殊情况,其目标是找到一个良好的初始化点,而不是修改监督学习的目标。早期研究探索了该技术在图像分类 [20, 49, 63] 和回归任务 [3] 中的应用。后续研究 [15] 表明,预训练作为一种正则化方案,能够在深度神经网络中实现更好的泛化。在最近的研究中,该方法被用于帮助训练深度神经网络,应用于图像分类 [69]、语音识别 [68]、实体消歧 [17] 和机器翻译 [48] 等各种任务。

The closest line of work to ours involves pre-training a neural network using a language modeling objective and then fine-tuning it on a target task with supervision. Dai et al. [13] and Howard and Ruder [21] follow this method to improve text classification. However, although the pre-training phase helps capture some linguistic information, their usage of LSTM models restricts their prediction ability to a short range. In contrast, our choice of transformer networks allows us to capture longerrange linguistic structure, as demonstrated in our experiments. Further, we also demonstrate the effectiveness of our model on a wider range of tasks including natural language inference, paraphrase detection and story completion. Other approaches [43, 44, 38] use hidden representations from a pre-trained language or machine translation model as auxiliary features while training a supervised model on the target task. This involves a substantial amount of new parameters for each separate target task, whereas we require minimal changes to our model architecture during transfer.

与我们工作最接近的研究涉及使用语言建模目标预训练神经网络,然后在有监督的情况下对目标任务进行微调。Dai 等人 [13] 以及 Howard 和 Ruder [21] 遵循这种方法来改进文本分类。然而,尽管预训练阶段有助于捕捉一些语言信息,但他们使用的 LSTM 模型限制了其预测能力的范围。相比之下,我们选择的 Transformer 网络使我们能够捕捉更长范围的语言结构,正如我们的实验所展示的那样。此外,我们还在更广泛的任务上展示了我们模型的有效性,包括自然语言推理、释义检测和故事补全。其他方法 [43, 44, 38] 使用来自预训练语言或机器翻译模型的隐藏表示作为辅助特征,同时在目标任务上训练有监督的模型。这需要为每个单独的目标任务引入大量新参数,而我们在迁移过程中对模型架构的改动非常小。

Auxiliary training objectives Adding auxiliary unsupervised training objectives is an alternative form of semi-supervised learning. Early work by Collobert and Weston [10] used a wide variety of auxiliary NLP tasks such as POS tagging, chunking, named entity recognition, and language modeling to improve semantic role labeling. More recently, Rei [50] added an auxiliary language modeling objective to their target task objective and demonstrated performance gains on sequence labeling tasks. Our experiments also use an auxiliary objective, but as we show, unsupervised pre-training already learns several linguistic aspects relevant to target tasks.

辅助训练目标
添加辅助的无监督训练目标是半监督学习的另一种形式。Collobert 和 Weston [10] 的早期工作使用了多种辅助 NLP 任务,如词性标注 (POS tagging)、分块 (chunking)、命名实体识别 (named entity recognition) 和语言建模 (language modeling),以改进语义角色标注 (semantic role labeling)。最近,Rei [50] 在他们的目标任务目标中添加了一个辅助语言建模目标,并在序列标注任务中展示了性能提升。我们的实验也使用了辅助目标,但正如我们所展示的,无监督预训练已经学习了与目标任务相关的多个语言学方面。

3 Framework

3 框架

Our training procedure consists of two stages. The first stage is learning a high-capacity language model on a large corpus of text. This is followed by a fine-tuning stage, where we adapt the model to a disc rim i native task with labeled data.

我们的训练过程分为两个阶段。第一阶段是在大量文本语料库上学习一个高容量的语言模型。随后是微调阶段,在这个阶段中,我们使用标注数据将模型适配到一个判别任务上。

3.1 Unsupervised pre-training

3.1 无监督预训练

Given an unsupervised corpus of tokens $\mathcal{U}={u_{1},\ldots,u_{n}}$ , we use a standard language modeling objective to maximize the following likelihood:

给定一个无监督的Token语料库 $\mathcal{U}={u_{1},\ldots,u_{n}}$,我们使用标准的语言建模目标来最大化以下似然:

$$
L_{1}(\mathcal{U})=\sum_{i}\log P(u_{i}|u_{i-k},\ldots,u_{i-1};\Theta)
$$

$$
L_{1}(\mathcal{U})=\sum_{i}\log P(u_{i}|u_{i-k},\ldots,u_{i-1};\Theta)
$$

where $k$ is the size of the context window, and the conditional probability $P$ is modeled using a neural network with parameters $\Theta$ . These parameters are trained using stochastic gradient descent [51].

其中 $k$ 是上下文窗口的大小,条件概率 $P$ 使用参数为 $\Theta$ 的神经网络建模。这些参数通过随机梯度下降法进行训练 [51]。

In our experiments, we use a multi-layer Transformer decoder [34] for the language model, which is a variant of the transformer [62]. This model applies a multi-headed self-attention operation over the input context tokens followed by position-wise feed forward layers to produce an output distribution over target tokens:

在我们的实验中,我们使用了一个多层 Transformer 解码器 [34] 作为语言模型,它是 Transformer [62] 的一个变体。该模型在输入上下文 Token 上应用多头自注意力操作,然后通过逐位置的前馈层生成目标 Token 的输出分布:

$$
\begin{array}{r l}&{h_{0}=U W_{e}+W_{p}}\ &{\quad h_{l}=\mathsf{t r a n s f o r m e r_b l o c k}(h_{l-1})\forall i\in[1,n]}\ &{P(u)=\mathsf{s o f,t m a x}(h_{n}W_{e}^{T})}\end{array}
$$

$$
\begin{array}{r l}&{h_{0}=U W_{e}+W_{p}}\ &{\quad h_{l}=\mathsf{t r a n s f o r m e r_b l o c k}(h_{l-1})\forall i\in[1,n]}\ &{P(u)=\mathsf{s o f,t m a x}(h_{n}W_{e}^{T})}\end{array}
$$

where $U=(u_{-k},\dotsc,u_{-1})$ is the context vector of tokens, $n$ is the number of layers, $W_{e}$ is the token embedding matrix, and $W_{p}$ is the position embedding matrix.

其中 $U=(u_{-k},\dotsc,u_{-1})$ 是 Token 的上下文向量,$n$ 是层数,$W_{e}$ 是 Token 嵌入矩阵,$W_{p}$ 是位置嵌入矩阵。

3.2 Supervised fine-tuning

3.2 监督微调

After training the model with the objective in Eq. 1, we adapt the parameters to the supervised target task. We assume a labeled dataset $\mathcal{C}$ , where each instance consists of a sequence of input tokens, $x^{1},\ldots,x^{m}$ , along with a label $y$ . The inputs are passed through our pre-trained model to obtain the final transformer block's activation $h_{l}^{m}$ , which is then fed into an added linear output layer with parameters $W_{y}$ to predict $y$

在使用公式 1 中的目标训练模型后,我们将参数调整到有监督的目标任务上。我们假设有一个带标签的数据集 $\mathcal{C}$,其中每个实例由一系列输入 Token $x^{1},\ldots,x^{m}$ 和一个标签 $y$ 组成。输入通过我们预训练的模型传递,以获得最终 Transformer 块的激活 $h_{l}^{m}$,然后将其输入到一个带有参数 $W_{y}$ 的线性输出层中以预测 $y$。

$$
P(y|x^{1},\ldots,x^{m})=\mathsf{s o f t m a x}(h_{l}^{m}W_{y}).
$$

$$
P(y|x^{1},\ldots,x^{m})=\mathsf{s o f t m a x}(h_{l}^{m}W_{y}).
$$

This gives us the following objective to maximize:

这为我们提供了以下需要最大化的目标:

$$
L_{2}(\mathcal{C})=\sum_{(x,y)}\log P(y|x^{1},\ldots,x^{m}).
$$

$$
L_{2}(\mathcal{C})=\sum_{(x,y)}\log P(y|x^{1},\ldots,x^{m}).
$$

We additionally found that including language modeling as an auxiliary objective to the fine-tuning helped learning by (a) improving generalization of the supervised model, and (b) accelerating convergence. This is in line with prior work [50, 43], who also observed improved performance with such an auxiliary objective. Specifically, we optimize the following objective (with weight $\lambda$

此外,我们发现将语言建模作为微调的辅助目标有助于学习,具体表现为:(a) 提高监督模型的泛化能力,(b) 加速收敛。这与之前的工作 [50, 43] 一致,他们也观察到使用这种辅助目标可以提升性能。具体来说,我们优化了以下目标(权重为 $\lambda$):

$$
L_{3}(\mathcal{C})=L_{2}(\mathcal{C})+\lambda*L_{1}(\mathcal{C})
$$

$$
L_{3}(\mathcal{C})=L_{2}(\mathcal{C})+\lambda*L_{1}(\mathcal{C})
$$

Overall,the only extra parameters we require during fine-tuning are $W_{y}$ , and embeddings for delimiter tokens (described below in Section 3.3).

总体而言,我们在微调过程中唯一需要的额外参数是 $W_{y}$ 以及分隔符 Token 的嵌入(在下面的第 3.3 节中描述)。


Figure 1: (left) Transformer architecture and training objectives used in this work. (right) Input transformations for fine-tuning on different tasks. We convert all structured inputs into token sequences to be processed by our pre-trained model, followed by a linear+softmax layer.

图 1: (左) 本工作中使用的 Transformer 架构和训练目标。(右) 针对不同任务进行微调的输入转换。我们将所有结构化输入转换为 Token 序列,由预训练模型处理,然后通过线性+softmax层。

3.3 Task-specific input transformations

3.3 特定任务的输入转换

For some tasks, like text classification, we can directly fine-tune our model as described above. Certain other tasks, like question answering or textual entailment, have structured inputs such as ordered sentence pairs, or triplets of document, question, and answers. Since our pre-trained model was trained on contiguous sequences of text, we require some modifications to apply it to these tasks. Previous work proposed learning task specific architectures on top of transferred representations [44]. Such an approach re-introduces a significant amount of task-specific customization and does not use transfer learning for these additional architectural components. Instead, we use a traversal-style approach [52], where we convert structured inputs into an ordered sequence that our pre-trained model can process. These input transformations allow us to avoid making extensive changes to the architecture across tasks. We provide a brief description of these input transformations below and Figure 1 provides a visual illustration. All transformations include adding randomly initialized start and end tokens $(\langle s\rangle,\langle e\rangle)$

对于某些任务,如文本分类,我们可以直接按照上述方法对模型进行微调。其他一些任务,如问答或文本蕴含,具有结构化的输入,例如有序的句子对,或文档、问题和答案的三元组。由于我们的预训练模型是在连续的文本序列上训练的,因此需要对其进行一些修改才能应用于这些任务。之前的工作提出了在迁移表示的基础上学习任务特定的架构 [44]。这种方法重新引入了大量任务特定的定制,并且没有对这些额外的架构组件使用迁移学习。相反,我们使用了一种遍历式方法 [52],将结构化输入转换为我们的预训练模型可以处理的有序序列。这些输入转换使我们能够避免在任务之间对架构进行大量修改。我们在下面简要描述了这些输入转换,图 1 提供了可视化说明。所有转换都包括添加随机初始化的起始和结束 token $(\langle s\rangle,\langle e\rangle)$。

Textual entailment For entailment tasks, we concatenate the premise $p$ and hypothesis $h$ token sequences, with a delimiter token $(\mathfrak{G})$ in between.

文本蕴含
对于蕴含任务,我们将前提 $p$ 和假设 $h$ 的 Token 序列连接起来,中间用一个分隔符 Token $(\mathfrak{G})$ 隔开。

Similarity For similarity tasks, there is no inherent ordering of the two sentences being compared. To refect this, we modify the input sequence to contain both possible sentence orderings (with a delimiter in between) and process each independently to produce two sequence representations $h_{l}^{m}$ which are added element-wise before being fed into the linear output layer.

相似性任务中,被比较的两个句子没有固有的顺序。为了反映这一点,我们修改输入序列,使其包含两种可能的句子顺序(中间用分隔符隔开),并分别处理以生成两个序列表示 $h_{l}^{m}$,然后将它们逐元素相加,再输入到线性输出层。

Question Answering and Commonsense Reasoning _ For these tasks, we are given a context document $z$ aquestion $q$ , and a set of possible answers $\left{a_{k}\right}$ . We concatenate the document context and question with each possible answer, adding a delimiter token in between to get $\left[z;q;\mathbb{5};a_{k}\right]$ .Each of these sequences are processed independently with our model and then normalized via a softmax layer to produce an output distribution over possible answers.

问答与常识推理
对于这些任务,我们给定一个上下文文档 $z$、一个问题 $q$ 以及一组可能的答案 $\left{a_{k}\right}$。我们将文档上下文和问题与每个可能的答案连接起来,并在中间添加一个分隔符 token,得到 $\left[z;q;\mathbb{5};a_{k}\right]$。这些序列中的每一个都通过我们的模型独立处理,然后通过 softmax 层进行归一化,以生成可能答案的输出分布。

4 Experiments

4 实验

4.1 Setup

4.1 设置

Unsupervised pre-training We use the Books Corpus dataset [71] for training the language model. It contains over 7,0o0 unique unpublished books from a variety of genres including Adventure, Fantasy, and Romance. Crucially, it contains long stretches of contiguous text, which allows the generative model to learn to condition on long-range information. An alternative dataset, the 1B Word Benchmark, which is used by a similar approach, ELMo [44], is approximately the same size but is shuffled at a sentence level - destroying long-range structure. Our language model achieves a very low token level perplexity of 18.4 on this corpus.

无监督预训练 我们使用Books Corpus数据集[71]来训练语言模型。该数据集包含超过7,000本未出版的独特书籍,涵盖冒险、奇幻和浪漫等多种类型。关键的是,它包含了大量连续的文本,这使得生成式模型能够学习到长距离信息的条件。另一种数据集1B Word Benchmark,被类似的方法ELMo[44]使用,其规模大致相同,但在句子级别进行了打乱,破坏了长距离结构。我们的语言模型在该语料库上实现了非常低的Token级别困惑度,仅为18.4。

Table 1: A list of the different tasks and datasets used in our experiments

表 1: 实验中使用的不同任务和数据集列表

任务 数据集
自然语言推理 SNLI [5], MultiNLI [66], Question NLI [64], RTE [4], SciTail [25]
问答 RACE [30], Story Cloze [40]
句子相似度 MSRParaphrase Corpus [14], Quora QuestionPairs [9], STSBenchmark [6]
分类 StanfordSentimentTreebank-2 [54], CoLA [65]

Model specifications Our model largely follows the original transformer work [62]. We trained a 12-layer decoder-only transformer with masked self-attention heads (768 dimensional states and 12 attention heads). For the position-wise feed-forward networks, we used 3072 dimensional inner states. We used the Adam optimization scheme [27] with a max learning rate of $2.5\mathrm{e-4}$ . The learning rate was increased linearly from zero over the first 2000 updates and annealed to O using a cosine schedule. We train for 100 epochs on mini batches of 64 randomly sampled, contiguous sequences of 512 tokens. Since layernorm [2] is used extensively throughout the model, a simple weight initialization of $N(0,0.02)$ was sufficient. We used a bytepair encoding (BPE) vocabulary with 40,000 merges [53] and residual, embedding, and attention dropouts with a rate of 0.1 for regular iz ation. We also employed a modified version of L2 regular iz ation proposed in [37], with $w=0.01$ on all non bias or gain weights. For the activation function, we used the Gaussian Error Linear Unit (GELU) [18]. We used learned position embeddings instead of the sinusoidal version proposed in the original work. We use the fy library2 to clean the raw text in Books Corpus, standardize some punctuation and whitespace, and use the spaCy tokenizer.3

模型规格 我们的模型主要遵循了原始 Transformer 工作 [62]。我们训练了一个 12 层的仅解码器 Transformer,带有掩码自注意力头(768 维状态和 12 个注意力头)。对于位置前馈网络,我们使用了 3072 维的内部状态。我们使用了 Adam 优化方案 [27],最大学习率为 $2.5\mathrm{e-4}$。学习率在前 2000 次更新中从零线性增加,并使用余弦调度退火到 O。我们在 64 个随机采样的连续 512 Token 的小批量上训练了 100 个周期。由于在整个模型中广泛使用了层归一化 [2],简单的权重初始化 $N(0,0.02)$ 就足够了。我们使用了 40,000 次合并的字节对编码 (BPE) 词汇表 [53],以及残差、嵌入和注意力 dropout,正则化率为 0.1。我们还采用了 [37] 中提出的 L2 正则化修改版本,对所有非偏置或增益权重使用 $w=0.01$。对于激活函数,我们使用了高斯误差线性单元 (GELU) [18]。我们使用了学习到的位置嵌入,而不是原始工作中提出的正弦版本。我们使用 fy 库来清理 Books Corpus 中的原始文本,标准化一些标点符号和空格,并使用 spaCy 分词器。

Fine-tuning details Unless specified, we reuse the hyper parameter settings from unsupervised pre-training. We add dropout to the classifier with a rate of 0.1. For most tasks, we use a learning rate Oof $6.25\mathrm{e}{-5}$ and a batchsize of 32. Our model finetunes quickly and 3 epochs of training was sufficient for most cases. We use a linear learning rate decay schedule with warmup over $0.2%$ of training. $\lambda$ wassetto0.5.

微调细节 除非另有说明,我们重用了无监督预训练的超参数设置。我们在分类器中添加了 dropout,比例为 0.1。对于大多数任务,我们使用学习率 $6.25\mathrm{e}{-5}$ 和批量大小为 32。我们的模型微调速度很快,大多数情况下 3 个训练周期就足够了。我们使用了线性学习率衰减计划,并在 $0.2%$ 的训练过程中进行预热。$\lambda$ 设置为 0.5。

4.2 Supervised fine-tuning

4.2 监督微调

We perform experiments on a variety of supervised tasks including natural language inference, question answering, semantic similarity, and text classification. Some of these tasks are available as part of the recently released GLUE multi-task benchmark [64], which we make use of. Figure 1 provides an overview of all the tasks and datasets.

我们在多种监督任务上进行了实验,包括自然语言推理、问答、语义相似度和文本分类。其中一些任务属于最近发布的 GLUE 多任务基准测试 [64] 的一部分,我们对此进行了利用。图 1 提供了所有任务和数据集的概览。

Natural Language Inference The task of natural language inference (NLI), also known as recognizing textual entailment, involves reading a pair of sentences and judging the relationship between them from one of entailment, contradiction or neutral. Although there has been a lot of recent interest [58, 35, 44], the task remains challenging due to the presence of a wide variety of phenomena like lexical entailment, co reference, and lexical and syntactic ambiguity. We evaluate on five datasets with diverse sources, including image captions (SNLI), transcribed speech, popular fiction, and government reports (MNLI), Wikipedia articles (QNLI), science exams (SciTail) or news articles (RTE).

自然语言推理 (Natural Language Inference, NLI) 任务,也称为文本蕴含识别,涉及阅读一对句子并判断它们之间的关系是蕴含、矛盾还是中立。尽管最近有很多研究关注这一领域 [58, 35, 44],但由于存在词汇蕴含、共指、词汇和句法歧义等多种现象,该任务仍然具有挑战性。我们在五个不同来源的数据集上进行了评估,包括图像描述 (SNLI)、转录语音、流行小说和政府报告 (MNLI)、维基百科文章 (QNLI)、科学考试 (SciTail) 或新闻文章 (RTE)。

Table 2 details various results on the different NLI tasks for our model and previous state-of-the-art approaches. Our method significantly outperforms the baselines on four of the five datasets, achieving absolute improvements of upto $1.5%$ on MNLI, $5%$ on SciTail, $5.8%$ on QNLI and $0.6%$ on SNLI over the previous best results. This demonstrates our model's ability to better reason over multiple sentences, and handle aspects of linguistic ambiguity. On RTE, one of the smaller datasets we evaluate on (2490 examples), we achieve an accuracy of $56%$ , which is below the $61.7%$ reported by a multi-task biLSTM model. Given the strong performance of our approach on larger NLI datasets, it is likely our model will benefit from multi-task training as well but we have not explored this currently.

表 2 详细列出了我们的模型与之前最先进方法在不同 NLI (自然语言推理) 任务上的各种结果。我们的方法在五个数据集中的四个上显著优于基线模型,在 MNLI 上实现了高达 $1.5%$ 的绝对提升,在 SciTail 上提升了 $5%$,在 QNLI 上提升了 $5.8%$,在 SNLI 上提升了 $0.6%$,均超过了之前的最佳结果。这表明我们的模型在多句子推理和处理语言歧义方面具有更好的能力。在 RTE 上(我们评估的较小数据集之一,包含 2490 个样本),我们的准确率为 $56%$,低于多任务 biLSTM 模型报告的 $61.7%$。鉴于我们的方法在较大 NLI 数据集上的强劲表现,我们的模型很可能也会从多任务训练中受益,但目前我们尚未对此进行探索。

Table 2: Experimental results on natural language inference tasks, comparing our model with current state-of-the-art methods. 5x indicates an ensemble of 5 models. All datasets use accuracy as the evaluation metric.

表 2: 自然语言推理任务的实验结果,将我们的模型与当前最先进的方法进行比较。5x 表示 5 个模型的集成。所有数据集都使用准确率作为评估指标。

方法 MNLI-m MNLI-mm SNLI SciTail QNLI RTE
ESIM + ELMo [44] (5x) 89.3
CAFE [58] (5x) 80.2 79.0 89.3
Stochastic Answer Network [35] (3x) 80.6 80.1 -
CAFE [58] 78.7 77.9 88.5 83.3
GenSen [64] 71.4 71.3 82.3 59.2
Multi-task BiLSTM + Attn [64] 72.2 72.1 82.1 61.7
Finetuned Transformer LM (ours) 82.1 81.4 89.9 88.3 88.1 56.0

Table 3: Results on question answering and commonsense reasoning, comparing our model with current state-of-the-art methods.. $9\mathbf{X}$ means an ensemble of 9 models.

表 3: 问答和常识推理结果,将我们的模型与当前最先进的方法进行比较。$9\mathbf{X}$ 表示 9 个模型的集成。

方法 Story Cloze RACE-m RACE-h RACE
val-LS-skip [55] 76.5
Hidden Coherence Model [7] 77.6
Dynamic Fusion Net [67] (9x) 55.6 49.4 51.2
BiAttention MRU [59] (9x) 60.2 50.3 53.3
Finetuned Transformer LM (ours) 86.5 62.9 57.4 59.0

Question answering and commonsense reasoning Another task that requires aspects of single and multi-sentence reasoning is question answering. We use the recently released RACE dataset [30], consisting of English passages with associated questions from middle and high school exams. This corpus has been shown to contain more reasoning type questions that other datasets like CNN [19] or SQuaD [47], providing the perfect evaluation for our model which is trained to handle long-range contexts. In addition, we evaluate on the Story Cloze Test [40], which involves selecting the correct ending to multi-sentence stories from two options. On these tasks, our model again outperforms the previous best results by significant margins - up to $8.9%$ on Story Cloze, and $5.7%$ Overall on RACE. This demonstrates the ability of our model to handle long-range contexts effectively.

问答与常识推理
另一个需要单句和多句推理能力的任务是问答。我们使用了最近发布的 RACE 数据集 [30],该数据集包含来自中学和高中考试的英语文章及相关问题。与其他数据集(如 CNN [19] 或 SQuaD [47])相比,该语料库被证明包含更多推理类型的问题,为我们的模型提供了完美的评估场景,因为我们的模型经过训练能够处理长距离上下文。此外,我们还评估了 Story Cloze Test [40],该测试涉及从两个选项中选择多句故事的正确结尾。在这些任务中,我们的模型再次以显著优势超越了之前的最佳结果——在 Story Cloze 上提升了 $8.9%$,在 RACE 上整体提升了 $5.7%$。这证明了我们的模型能够有效处理长距离上下文。

Semantic Similarity Semantic similarity (or paraphrase detection) tasks involve predicting whether two sentences are semantically equivalent or not. The challenges lie in recognizing rephrasing of concepts, understanding negation, and handling syntactic ambiguity. We use three datasets for this task - the Microsoft Paraphrase corpus (MRPC) [14] (collected from news sources), the Quora Question Pairs (QQP) dataset [9], and the Semantic Textual Similarity benchmark (STS-B) [6]. We obtain state-of-the-art results on two of the three semantic similarity tasks (Table 4) with a 1 point absolute gain on STS-B. The performance delta on QQP is significant, with a $4.2%$ absolute improvement over Single-task BiLSTM $\mathrm{\ensuremath{\varepsilon}~}+\mathrm{El}$ LMo $^+$ Attn.

语义相似性
语义相似性(或复述检测)任务涉及预测两个句子在语义上是否等价。挑战在于识别概念的重述、理解否定以及处理句法歧义。我们为此任务使用了三个数据集——Microsoft Paraphrase语料库(MRPC)[14](从新闻来源收集)、Quora问题对(QQP)数据集[9]和语义文本相似性基准(STS-B)[6]。我们在三个语义相似性任务中的两个上取得了最先进的结果(表4),在STS-B上获得了1个百分点的绝对提升。QQP上的性能差异显著,相较于Single-task BiLSTM $\mathrm{\ensuremath{\varepsilon}~}+\mathrm{El}$ LMo $^+$ Attn,绝对提升了$4.2%$。

Classification Finally, we also evaluate on two different text classification tasks. The Corpus of Linguistic Acceptability (CoLA) [65] contains expert judgements on whether a sentence is grammatical or not, and tests the innate linguistic bias of trained models. The Stanford Sentiment Treebank (SST-2) [54], on the other hand, is a standard binary classification task. Our model obtains an score of 45.4 on CoLA, which is an especially big jump over the previous best result of 35.0, showcasing the innate linguistic bias learned by our model. The model also achieves $91.3%$ accuracy on SST-2, which is competitive with the state-of-the-art results. We also achieve an overall score of 72.8 on the GLUE benchmark, which is significantly better than the previous best of 68.9.

分类
最后,我们还在两个不同的文本分类任务上进行了评估。语言可接受性语料库 (CoLA) [65] 包含了专家对句子是否合乎语法的判断,测试了训练模型的固有语言偏差。另一方面,斯坦福情感树库 (SST-2) [54] 是一个标准的二分类任务。我们的模型在 CoLA 上获得了 45.4 的分数,相比之前的最佳结果 35.0 有了显著提升,展示了我们模型学习到的固有语言偏差。该模型在 SST-2 上也达到了 91.3% 的准确率,与最先进的结果相当。我们在 GLUE 基准测试中的总体得分为 72.8,显著优于之前的最佳得分 68.9。

Table 4: Semantic similarity and classification results, comparing our model with current state-of-theart methods. All task evaluations in this table were done using the GLUE benchmark. ( $m c{=}$ Mathews correlation, $a c c{=}_{\cdot}$ Accuracy, $p c{=}$ Pearson correlation)

表 4: 语义相似度和分类结果,将我们的模型与当前最先进的方法进行比较。本表中的所有任务评估均使用 GLUE 基准进行。( $m c{=}$ 马修斯相关系数, $a c c{=}_{\cdot}$ 准确率, $p c{=}$ 皮尔逊相关系数)

方法 分类 语义相似度 GLUE
CoLA (mc) SST2 (acc) MRPC (F1)
Sparse byte mLSTM [16] 93.2
TF-KLD [23] - 86.0
ECNU (mixed ensemble) [60]
Single-task BiLSTM + ELMo + Attn [64] 35.0 90.2 80.2
Multi-task BiLSTM + ELMo + Attn [64] 18.9 91.6 83.5
Finetuned Transformer LM (ours) 45.4 91.3 82.3

Overall, our approach achieves new state-of-the-art results in 9 out of the 12 datasets we evaluate on, outperforming ensembles in many cases. Our results also indicate that our approach works well across datasets of different sizes, from smaller datasets such as STS-B ${\approx}5.7\mathrm{k}$ training examples)- to the largest one - SNLI ( ${\approx}550\mathrm{k}$ training examples).

总体而言,我们的方法在评估的12个数据集中,有9个达到了新的最先进水平,在许多情况下优于集成方法。我们的结果还表明,我们的方法在不同规模的数据集上表现良好,从较小的数据集如STS-B(约5.7k训练样本)到最大的数据集SNLI(约550k训练样本)。

5 Analysis

5 分析

Impact of number of layers transferred We observed the impact of transferring a variable number of layers from unsupervised pre-training to the supervised target task. Figure 2(left) illustrates the performance of our approach on MultiNLI and RACE as a function of the number of layers transferred. We observe the standard result that transferring embeddings improves performance and that each transformer layer provides further benefits up to $9%$ for full transfer on MultiNLI. This indicates that each layer in the pre-trained model contains useful functionality for solving target tasks.

转移层数的影响 我们观察了从无监督预训练转移到有监督目标任务时,转移不同层数的影响。图 2(左)展示了我们的方法在 MultiNLI 和 RACE 上的性能随转移层数的变化情况。我们观察到一个标准结果:转移嵌入层可以提高性能,并且每个 Transformer 层在 MultiNLI 上的完全转移可以带来高达 $9%$ 的额外收益。这表明预训练模型中的每一层都包含解决目标任务的有用功能。


Figure 2: (left) Effect of transferring increasing number of layers from the pre-trained language model on RACE and MultiNLI. (right) Plot showing the evolution of zero-shot performance on different tasks as a function of LM pre-training updates. Performance per task is normalized between a random guess baseline and the current state-of-the-art with a single model. Zero-shot Behaviors We'd like to better understand why language model pre-training of transformers is effective. A hypothesis is that the underlying generative model learns to perform many of the tasks we evaluate on in order to improve its language modeling capability and that the more structured

图 2: (左) 从预训练语言模型中迁移越来越多层对 RACE 和 MultiNLI 的影响。(右) 不同任务的零样本性能随着语言模型预训练更新次数的变化图。每个任务的性能在随机猜测基线和当前单模型的最新技术水平之间进行了归一化。零样本行为 我们想更好地理解为什么 Transformer 的语言模型预训练是有效的。一个假设是,底层的生成模型学习执行我们评估的许多任务,以提高其语言建模能力,并且结构越复杂

Table 5: Analysis of various model ablations on different tasks. Avg. score is a unweighted average of all theresults.( $m c{=}$ Mathews correlation,acc=Accuracy, $p c{=}$ Pearson correlation)

表 5: 不同任务上各种模型消融分析。平均分数是所有结果的未加权平均值。( $m c{=}$ 马修斯相关系数, acc=准确率, $p c{=}$ 皮尔逊相关系数)

方法 平均分数 CoLA (mc) SST2 (acc) MRPC (F1) STSB (pc) QQP (F1) MNLI (acc) QNLI (acc) RTE (acc)
Transformer w/ aux LM (full) 74.7 45.4 91.3 82.3 82.0 70.3 81.8 88.1 56.0
Transformer w/o pre-training 59.9 18.9 84.0 79.4 30.9 65.5 75.7 71.2 53.8
Transformer w/o aux LM 75.0 47.9 92.0 84.9 83.2 69.8 81.1 86.9 54.4
LSTM w/ aux LM 69.1 30.3 90.5 83.2 71.8 68.1 73.7 81.1 54.6

attention al memory of the transformer assists in transfer compared to LSTMs. We designed a series of heuristic solutions that use the underlying generative model to perform tasks without supervised finetuning. We visualize the effectiveness of these heuristic solutions over the course of generative pre-training in Fig 2(right). We observe the performance of these heuristics is stable and steadily increases over training suggesting that generative pre training supports the learning of a wide variety of task relevant functionality. We also observe the LSTM exhibits higher variance in its zero-shot performance suggesting that the inductive bias of the Transformer architecture assists in transfer.

Transformer 的注意力机制在迁移方面相比 LSTM 更具优势。我们设计了一系列启发式解决方案,利用底层生成模型在没有监督微调的情况下执行任务。我们在图 2(右)中可视化了这些启发式解决方案在生成预训练过程中的有效性。我们观察到这些启发式方法的性能稳定,并在训练过程中稳步提升,这表明生成预训练支持学习各种任务相关的功能。我们还观察到 LSTM 在零样本性能方面表现出更高的方差,这表明 Transformer 架构的归纳偏差有助于迁移。

For CoLA (linguistic acceptability), examples are scored as the average token log-probability the generative model assigns and predictions are made by threshold ing. For SST-2 (sentiment analysis), we append the token very to each example and restrict the language model's output distribution to only the words positive and negative and guess the token it assigns higher probability to as the prediction. For RACE (question answering), we pick the answer the generative model assigns the highest average token log-probability when conditioned on the document and question. For DPRD [46] (winograd schemas), we replace the definite pronoun with the two possible referrents and predict the resolution that the generative model assigns higher average token log-probability to the rest of the sequence after the substitution.

对于 CoLA(语言可接受性),示例的评分是生成模型分配的平均 token 对数概率,并通过阈值进行预测。对于 SST-2(情感分析),我们在每个示例后附加 token "very",并将语言模型的输出分布限制为仅包含单词 "positive" 和 "negative",并猜测模型分配较高概率的 token 作为预测。对于 RACE(问答),我们选择生成模型在文档和问题条件下分配最高平均 token 对数概率的答案。对于 DPRD [46](winograd 模式),我们用两个可能的指代替换定代词,并预测生成模型在替换后对序列其余部分分配较高平均 token 对数概率的解析。

Ablation studies We perform three different ablation studies (Table 5). First, we examine the performance of our method without the auxiliary LM objective during fine-tuning. We observe that the auxiliary objective helps on the NLI tasks and QQP. Overall, the trend suggests that larger datasets benefit from the auxiliary objective but smaller datasets do not. Second, we analyze the effect of the Transformer by comparing it with a single layer 2048 unit LSTM using the same framework. We observe a 5.6 average score drop when using the LSTM instead of the Transformer. The LSTM only outperforms the Transformer on one dataset - MRPC. Finally, we also compare with our transformer architecture directly trained on supervised target tasks, without pre-training. We observe that the lack of pre-training hurts performance across all the tasks, resulting in a $14.8%$ decrease compared to our full model.

消融研究 我们进行了三项不同的消融研究(表 5)。首先,我们在微调期间检查了没有辅助语言模型目标时我们方法的性能。我们观察到辅助目标在 NLI 任务和 QQP 上有所帮助。总体趋势表明,较大的数据集从辅助目标中受益,而较小的数据集则不然。其次,我们通过将 Transformer 与使用相同框架的单层 2048 单元 LSTM 进行比较,分析了 Transformer 的效果。当使用 LSTM 而不是 Transformer 时,我们观察到平均分数下降了 5.6 分。LSTM 仅在一个数据集 MRPC 上优于 Transformer。最后,我们还与直接在监督目标任务上训练(没有预训练)的 Transformer 架构进行了比较。我们观察到,缺乏预训练会损害所有任务的性能,导致与我们的完整模型相比下降了 $14.8%$。

6 Conclusion

6 结论

We introduced a framework for achieving strong natural language understanding with a single task-agnostic model through generative pre-training and disc rim i native fine-tuning. By pre-training on a diverse corpus with long stretches of contiguous text our model acquires significant world knowledge and ability to process long-range dependencies which are then successfully transferred to solving disc rim i native tasks such as question answering, semantic similarity assessment, entailment determination, and text classification, improving the state of the art on 9 of the 12 datasets we study. Using unsupervised (pre-)training to boost performance on disc rim i native tasks has long been an important goal of Machine Learning research. Our work suggests that achieving significant performance gains is indeed possible, and offers hints as to what models (Transformers) and data sets (text with long range dependencies) work best with this approach. We hope that this will help enable new research into unsupervised learning, for both natural language understanding and other domains, further improving our understanding of how and when unsupervised learning works.

我们引入了一个框架,通过生成式预训练和判别式微调,使用单一任务无关的模型实现强大的自然语言理解。通过在包含长连续文本的多样化语料库上进行预训练,我们的模型获得了显著的世界知识和处理长距离依赖的能力,这些能力随后成功转移到解决问答、语义相似性评估、蕴含判定和文本分类等判别式任务上,在我们研究的12个数据集中,有9个数据集上达到了最新的技术水平。使用无监督(预)训练来提高判别式任务的性能长期以来一直是机器学习研究的重要目标。我们的工作表明,实现显著的性能提升确实是可能的,并提供了关于哪些模型(Transformer)和数据集(具有长距离依赖的文本)在这种方法下效果最佳的提示。我们希望这将有助于推动无监督学习的新研究,无论是自然语言理解还是其他领域,进一步加深我们对无监督学习如何以及何时有效的理解。

References

参考文献

[1] S. Arora, Y. Liang, and T. Ma. A simple but tough-to-beat baseline for sentence embeddings. 2016.

[1] S. Arora, Y. Liang, 和 T. Ma. 一个简单但难以超越的句子嵌入基线。2016.

[2] J. L. Ba, J. R. Kiros, and G. E. Hinton. Layer normalization. arXiv preprint arXiv: 1607.06450, 2016.

[2] J. L. Ba, J. R. Kiros, 和 G. E. Hinton. 层归一化 (Layer Normalization). arXiv 预印本 arXiv: 1607.06450, 2016.

阅读全文(20积分)