XLNet: Generalized Auto regressive Pre training for Language Understanding
XLNet: 语言理解的广义自回归预训练
Abstract
摘要
With the capability of modeling bidirectional contexts, denoising auto encoding based pre training like BERT achieves better performance than pre training approaches based on auto regressive language modeling. However, relying on corrupting the input with masks, BERT neglects dependency between the masked positions and suffers from a pretrain-finetune discrepancy. In light of these pros and cons, we propose XLNet, a generalized auto regressive pre training method that (1) enables learning bidirectional contexts by maximizing the expected likelihood over all permutations of the factorization order and (2) overcomes the limitations of BERT thanks to its auto regressive formulation. Furthermore, XLNet integrates ideas from Transformer-XL, the state-of-the-art auto regressive model, into pre training. Empirically, under comparable experiment settings, XLNet outperforms BERT on 20 tasks, often by a large margin, including question answering, natural language inference, sentiment analysis, and document ranking.1.
基于双向上下文建模能力,采用去噪自编码预训练的BERT相比基于自回归语言建模的预训练方法取得了更优性能。但BERT依赖掩码破坏输入的特性,忽略了被遮蔽位置间的依赖关系,并存在预训练与微调不一致的问题。针对这些优缺点,我们提出了XLNet——一种广义自回归预训练方法,其创新在于:(1) 通过最大化因式分解顺序所有排列的期望似然来学习双向上下文;(2) 凭借自回归架构克服了BERT的局限性。此外,XLNet将当前最先进的自回归模型Transformer-XL的核心思想融入预训练过程。实证研究表明,在可比实验设置下,XLNet在20项任务(包括问答、自然语言推理、情感分析和文档排序)上显著超越BERT,且优势幅度普遍较大。
1 Introduction
1 引言
Unsupervised representation learning has been highly successful in the domain of natural language processing [7, 22, 27, 28, 10]. Typically, these methods first pretrain neural networks on large-scale unlabeled text corpora, and then finetune the models or representations on downstream tasks. Under this shared high-level idea, different unsupervised pre training objectives have been explored in literature. Among them, auto regressive (AR) language modeling and auto encoding (AE) have been the two most successful pre training objectives.
无监督表示学习在自然语言处理领域取得了巨大成功 [7, 22, 27, 28, 10]。这类方法通常先在大规模无标注文本语料上预训练神经网络,然后在下游任务中对模型或表示进行微调。在这一共同的高层理念下,文献中探索了多种无监督预训练目标,其中自回归(AR)语言建模和自编码(AE)是两种最成功的预训练目标。
AR language modeling seeks to estimate the probability distribution of a text corpus with an auto regressive model [7, 27, 28]. Specifically, given a text sequence $\mathbf{x}=(x_{1},\cdots,x_{T})$ , AR language modeling factorizes the likelihood into a forward product $\begin{array}{r}{p(\mathbf{x})=\prod_{t=1}^{T}p(x_{t}\mid\mathbf{x}_{<t})}\end{array}$ or a backward one $\begin{array}{r}{p(\mathbf{x})=\prod_{t=T}^{1}p(x_{t}\mid\mathbf{x}_{>t})}\end{array}$ . A parametric model (e.g. a neural network) is trained to model each conditional distribution. Since an AR language model is only trained to encode a uni-directional context (either forward or backward), it is not effective at modeling deep bidirectional contexts. On the contrary, downstream language understanding tasks often require bidirectional context information. This results in a gap between AR language modeling and effective pre training.
自回归(AR)语言建模旨在通过自回归模型估计文本语料库的概率分布[7,27,28]。具体而言,给定文本序列$\mathbf{x}=(x_{1},\cdots,x_{T})$,自回归语言建模将似然分解为前向乘积$\begin{array}{r}{p(\mathbf{x})=\prod_{t=1}^{T}p(x_{t}\mid\mathbf{x}_{<t})}\end{array}$或后向乘积$\begin{array}{r}{p(\mathbf{x})=\prod_{t=T}^{1}p(x_{t}\mid\mathbf{x}_{>t})}\end{array}$。通过训练参数化模型(如神经网络)来建模每个条件分布。由于自回归语言模型仅训练编码单向上下文(前向或后向),其在建模深度双向上下文时效果不佳。而下游语言理解任务通常需要双向上下文信息,这导致自回归语言建模与有效预训练之间存在差距。
In comparison, AE based pre training does not perform explicit density estimation but instead aims to reconstruct the original data from corrupted input. A notable example is BERT [10], which has been the state-of-the-art pre training approach. Given the input token sequence, a certain portion of tokens are replaced by a special symbol [MASK], and the model is trained to recover the original tokens from the corrupted version. Since density estimation is not part of the objective, BERT is allowed to utilize bidirectional contexts for reconstruction. As an immediate benefit, this closes the aforementioned bidirectional information gap in AR language modeling, leading to improved performance. However, the artificial symbols like [MASK] used by BERT during pre training are absent from real data at finetuning time, resulting in a pretrain-finetune discrepancy. Moreover, since the predicted tokens are masked in the input, BERT is not able to model the joint probability using the product rule as in AR language modeling. In other words, BERT assumes the predicted tokens are independent of each other given the unmasked tokens, which is oversimplified as high-order, long-range dependency is prevalent in natural language [9].
相比之下,基于自编码器 (AE) 的预训练并不进行显式密度估计,而是旨在从损坏的输入中重建原始数据。一个典型例子是 BERT [10],它曾是最先进的预训练方法。给定输入 token 序列后,模型会将部分 token 替换为特殊符号 [MASK],并通过训练从损坏版本中恢复原始 token。由于目标函数不包含密度估计,BERT 可以利用双向上下文进行重建。这直接弥补了自回归 (AR) 语言建模中存在的双向信息缺口,从而提升性能。然而,BERT 预训练阶段使用的 [MASK] 等人工符号在微调阶段的真实数据中并不存在,导致预训练与微调之间存在差异。此外,由于被预测的 token 在输入中被掩码,BERT 无法像 AR 语言建模那样通过乘积规则建模联合概率。换言之,BERT 假设在给定未掩码 token 的情况下,被预测的 token 彼此独立,这种假设过于简化,因为自然语言中普遍存在高阶长程依赖关系 [9]。
Faced with the pros and cons of existing language pre training objectives, in this work, we propose XLNet, a generalized auto regressive method that leverages the best of both AR language modeling and AE while avoiding their limitations.
面对现有语言预训练目标的优缺点,本文提出XLNet——一种融合自回归(AR)语言建模和自编码(AE)优势并规避其局限性的广义自回归方法。
• Firstly, instead of using a fixed forward or backward factorization order as in conventional AR models, XLNet maximizes the expected log likelihood of a sequence w.r.t. all possible permutations of the factorization order. Thanks to the permutation operation, the context for each position can consist of tokens from both left and right. In expectation, each position learns to utilize contextual information from all positions, i.e., capturing bidirectional context. • Secondly, as a generalized AR language model, XLNet does not rely on data corruption. Hence, XLNet does not suffer from the pretrain-finetune discrepancy that BERT is subject to. Meanwhile, the auto regressive objective also provides a natural way to use the product rule for factorizing the joint probability of the predicted tokens, eliminating the independence assumption made in BERT.
• 首先,与传统自回归(AR)模型使用固定前向或后向因子分解顺序不同,XLNet针对所有可能的因子分解顺序排列最大化序列的期望对数似然。得益于排列操作,每个位置的上下文可以同时包含左右两侧的token。通过期望计算,每个位置都能学习利用来自所有位置的上下文信息,即捕获双向上下文。
• 其次,作为广义自回归语言模型,XLNet不依赖数据破坏。因此XLNet不会像BERT那样遭受预训练-微调差异问题。同时,自回归目标也提供了使用乘积规则分解预测token联合概率的自然方式,消除了BERT中的独立性假设。
In addition to a novel pre training objective, XLNet improves architectural designs for pre training.
除了新颖的预训练目标外,XLNet还改进了预训练的架构设计。
• Inspired by the latest advancements in AR language modeling, XLNet integrates the segment recurrence mechanism and relative encoding scheme of Transformer-XL [9] into pre training, which empirically improves the performance especially for tasks involving a longer text sequence. • Naively applying a Transformer(-XL) architecture to permutation-based language modeling does not work because the factorization order is arbitrary and the target is ambiguous. As a solution, we propose to re parameter ize the Transformer(-XL) network to remove the ambiguity.
• 受自回归(AR)语言建模最新进展的启发,XLNet将Transformer-XL[9]的分段循环机制和相对位置编码方案整合到预训练中,实验表明这尤其能提升长文本序列任务的表现。
• 直接将Transformer(-XL)架构应用于基于排列的语言建模会失效,因为分解顺序是任意的且预测目标存在歧义。我们提出的解决方案是对Transformer(-XL)网络进行重新参数化以消除这种歧义。
Empirically, under comparable experiment setting, XLNet consistently outperforms BERT [10] on a wide spectrum of problems including GLUE language understanding tasks, reading comprehension tasks like SQuAD and RACE, text classification tasks such as Yelp and IMDB, and the ClueWeb09-B document ranking task.
实验表明,在可比的实验设置下,XLNet 在包括 GLUE 语言理解任务、SQuAD 和 RACE 等阅读理解任务、Yelp 和 IMDB 等文本分类任务,以及 ClueWeb09-B 文档排序任务在内的广泛问题上,始终优于 BERT [10]。
Related Work The idea of permutation-based AR modeling has been explored in [32, 12], but there are several key differences. Firstly, previous models aim to improve density estimation by baking an “orderless” inductive bias into the model while XLNet is motivated by enabling AR language models to learn bidirectional contexts. Technically, to construct a valid target-aware prediction distribution, XLNet incorporates the target position into the hidden state via two-stream attention while previous permutation-based AR models relied on implicit position awareness inherent to their MLP architectures. Finally, for both orderless NADE and XLNet, we would like to emphasize that “orderless” does not mean that the input sequence can be randomly permuted but that the model allows for different factorization orders of the distribution.
相关工作
基于排列的自回归(AR)建模思想已在[32,12]中被探讨,但存在若干关键差异。首先,先前模型旨在通过将"无序"归纳偏置融入模型来改进密度估计,而XLNet的动机是使AR语言模型能够学习双向上下文。技术上,为构建有效的目标感知预测分布,XLNet通过双流注意力将目标位置融入隐藏状态,而先前的基于排列的AR模型依赖其MLP架构固有的隐式位置感知。最后,对于无序NADE和XLNet,我们需强调"无序"并非指输入序列可随机排列,而是指模型允许分布的不同分解顺序。
Another related idea is to perform auto regressive denoising in the context of text generation [11], which only considers a fixed order though.
另一个相关思路是在文本生成过程中进行自回归去噪 [11],不过该方法仅考虑固定顺序。
2 Proposed Method
2 提出方法
2.1 Background
2.1 背景
In this section, we first review and compare the conventional AR language modeling and BERT for language pre training. Given a text sequence $\mathbf{x}=[x_{1},\cdots,x_{T}]$ , AR language modeling performs pre training by maximizing the likelihood under the forward auto regressive factorization:
在本节中,我们首先回顾并比较传统的自回归(AR)语言建模和BERT的预训练方法。给定一个文本序列$\mathbf{x}=[x_{1},\cdots,x_{T}]$,自回归语言建模通过前向自回归分解的最大似然估计进行预训练:
$$
\operatorname*{max}{\theta}\log p_{\theta}(\mathbf{x})=\sum_{t=1}^{T}\log p_{\theta}(x_{t}\mid\mathbf{x}{<t})=\sum_{t=1}^{T}\log\frac{\exp\left(h_{\theta}(\mathbf{x}{1:t-1})^{\top}e(x_{t})\right)}{\sum_{x^{\prime}}\exp\left(h_{\theta}(\mathbf{x}_{1:t-1})^{\top}e(x^{\prime})\right)},
$$
$$
\operatorname*{max}{\theta}\log p_{\theta}(\mathbf{x})=\sum_{t=1}^{T}\log p_{\theta}(x_{t}\mid\mathbf{x}{<t})=\sum_{t=1}^{T}\log\frac{\exp\left(h_{\theta}(\mathbf{x}{1:t-1})^{\top}e(x_{t})\right)}{\sum_{x^{\prime}}\exp\left(h_{\theta}(\mathbf{x}_{1:t-1})^{\top}e(x^{\prime})\right)},
$$
where $h_{\theta}(\mathbf{x}_{1:t-1})$ is a context representation produced by neural models, such as RNNs or Transformers, and $e(x)$ denotes the embedding of $x$ . In comparison, BERT is based on denoising auto-encoding. Specifically, for a text sequence $\mathbf{x}$ , BERT first constructs a corrupted version $\hat{\bf x}$ by randomly setting a portion (e.g. $15%$ ) of tokens in $\mathbf{x}$ to a special symbol [MASK]. Let the masked tokens be $\bar{\bf x}$ . The training objective is to reconstruct $\bar{\bf x}$ from $\hat{\bf x}$ :
其中 $h_{\theta}(\mathbf{x}_{1:t-1})$ 是由RNN或Transformer等神经网络模型生成的上下文表示,$e(x)$ 表示$x$的嵌入向量。相比之下,BERT基于去噪自编码机制。具体而言,对于文本序列$\mathbf{x}$,BERT首先通过随机将$\mathbf{x}$中部分token(例如$15%$)替换为特殊符号[MASK]来构建损坏版本$\hat{\bf x}$。设被遮蔽的token为$\bar{\bf x}$,其训练目标是从$\hat{\bf x}$重建$\bar{\bf x}$:
$$
\operatorname*{max}{\theta}\log p_{\theta}(\bar{\mathbf{x}}\mid\hat{\mathbf{x}})\approx\sum_{t=1}^{T}m_{t}\log p_{\theta}(x_{t}\mid\hat{\mathbf{x}})=\sum_{t=1}^{T}m_{t}\log\frac{\exp\left(H_{\theta}(\hat{\mathbf{x}}){t}^{\top}e(x_{t})\right)}{\sum_{x^{\prime}}\exp\left(H_{\theta}(\hat{\mathbf{x}})_{t}^{\top}e(x^{\prime})\right)},
$$
$$
\operatorname*{max}{\theta}\log p_{\theta}(\bar{\mathbf{x}}\mid\hat{\mathbf{x}})\approx\sum_{t=1}^{T}m_{t}\log p_{\theta}(x_{t}\mid\hat{\mathbf{x}})=\sum_{t=1}^{T}m_{t}\log\frac{\exp\left(H_{\theta}(\hat{\mathbf{x}}){t}^{\top}e(x_{t})\right)}{\sum_{x^{\prime}}\exp\left(H_{\theta}(\hat{\mathbf{x}})_{t}^{\top}e(x^{\prime})\right)},
$$
where $m_{t}=1$ indicates $x_{t}$ is masked, and $H_{\theta}$ is a Transformer that maps a length $T$ text sequence $\mathbf{x}$ into a sequence of hidden vectors $H_{\boldsymbol\theta}({\bf x})=[H_{\boldsymbol\theta}({\bf x})_{1},H_{\boldsymbol\theta}({\bf x})_{2},\cdot\cdot\cdot~,\bar{H_{\boldsymbol\theta}}({\bf x})_{T}]$ . The pros and cons of the two pre training objectives are compared in the following aspects:
其中 $m_{t}=1$ 表示 $x_{t}$ 被掩码,$H_{\theta}$ 是一个将长度为 $T$ 的文本序列 $\mathbf{x}$ 映射为隐藏向量序列 $H_{\boldsymbol\theta}({\bf x})=[H_{\boldsymbol\theta}({\bf x})_{1},H_{\boldsymbol\theta}({\bf x})_{2},\cdot\cdot\cdot~,\bar{H_{\boldsymbol\theta}}({\bf x})_{T}]$ 的 Transformer。这两种预训练目标在以下方面的优缺点进行了比较:
• Independence Assumption: As emphasized by the $\approx$ sign in Eq. (2), BERT factorizes the joint conditional probability $p(\bar{\bf x}\mid\hat{\bf x})$ based on an independence assumption that all masked tokens $\bar{\bf x}$ are separately reconstructed. In comparison, the AR language modeling objective (1) factorizes $p_{\boldsymbol{\theta}}(\mathbf{x})$ using the product rule that holds universally without such an independence assumption. • Input noise: The input to BERT contains artificial symbols like [MASK] that never occur in downstream tasks, which creates a pretrain-finetune discrepancy. Replacing [MASK] with original tokens as in [10] does not solve the problem because original tokens can be only used with a small probability — otherwise Eq. (2) will be trivial to optimize. In comparison, AR language modeling does not rely on any input corruption and does not suffer from this issue. • Context dependency: The AR representation $h_{\theta}(\mathbf{x}{1:t-1})$ is only conditioned on the tokens up to position $t$ (i.e. tokens to the left), while the BERT representation $H_{\boldsymbol\theta}(\mathbf{x})_{t}$ has access to the contextual information on both sides. As a result, the BERT objective allows the model to be pretrained to better capture bidirectional context.
• 独立性假设:如公式(2)中的$\approx$符号所示,BERT基于独立性假设对联合条件概率$p(\bar{\bf x}\mid\hat{\bf x})$进行因式分解,即所有被掩码的token$\bar{\bf x}$都是独立重建的。相比之下,自回归(AR)语言建模目标(1)使用普遍成立的乘积法则对$p_{\boldsymbol{\theta}}(\mathbf{x})$进行因式分解,无需此类独立性假设。
• 输入噪声:BERT的输入包含[MASK]等下游任务中从未出现的人工符号,这造成了预训练与微调之间的差异。如[10]所述用原始token替换[MASK]并不能解决问题,因为原始token只能以很小的概率被使用——否则公式(2)的优化将失去意义。相比之下,自回归语言建模不依赖任何输入破坏机制,因此不存在该问题。
• 上下文依赖性:自回归表示$h_{\theta}(\mathbf{x}{1:t-1})$仅以位置$t$之前的token(即左侧token)为条件,而BERT表示$H_{\boldsymbol\theta}(\mathbf{x})_{t}$能获取双向上下文信息。因此,BERT目标使模型能通过预训练更好地捕捉双向上下文。
2.2 Objective: Permutation Language Modeling
2.2 目标:排列语言建模
According to the comparison above, AR language modeling and BERT possess their unique advantages over the other. A natural question to ask is whether there exists a pre training objective that brings the advantages of both while avoiding their weaknesses.
根据上述对比,AR (自回归) 语言建模和 BERT 各自具备独特的优势。一个自然的问题是:是否存在一种预训练目标,能同时兼顾两者的优势并规避其弱点?
Borrowing ideas from orderless NADE [32], we propose the permutation language modeling objective that not only retains the benefits of AR models but also allows models to capture bidirectional contexts. Specifically, for a sequence $\mathbf{x}$ of length $T$ , there are $T!$ different orders to perform a valid auto regressive factorization. Intuitively, if model parameters are shared across all factorization orders, in expectation, the model will learn to gather information from all positions on both sides.
借鉴无序NADE [32]的思想,我们提出了排列语言建模目标,该目标不仅保留了自回归(AR)模型优势,还能让模型捕获双向上下文。具体而言,对于长度为$T$的序列$\mathbf{x}$,存在$T!$种不同的有效自回归分解顺序。直观来看,若所有分解顺序共享模型参数,模型将有望学会从两侧所有位置收集信息。
To formalize the idea, let $\mathcal{Z}{T}$ be the set of all possible permutations of the length $T$ index sequence $[1,2,\ldots,T]$ . We use $z_{t}$ and $\mathbf{z}{<t}$ to denote the $t$ -th element and the first $t-1$ elements of a permutation $\mathbf{z}\in\mathcal{Z}_{T}$ . Then, our proposed permutation language modeling objective can be expressed as follows:
为了形式化这一概念,设 $\mathcal{Z}{T}$ 为长度 $T$ 的索引序列 $[1,2,\ldots,T]$ 的所有可能排列集合。我们用 $z_{t}$ 和 $\mathbf{z}{<t}$ 分别表示排列 $\mathbf{z}\in\mathcal{Z}_{T}$ 的第 $t$ 个元素和前 $t-1$ 个元素。那么,我们提出的排列语言建模目标可表示为:
$$
\operatorname*{max}{\theta}\quad\mathbb{E}{\mathbf{z}\sim\mathcal{Z}{T}}\left[\sum_{t=1}^{T}\log p_{\theta}(x_{z_{t}}\mid\mathbf{x_{z}}_{<t})\right].
$$
$$
\operatorname*{max}{\theta}\quad\mathbb{E}{\mathbf{z}\sim\mathcal{Z}{T}}\left[\sum_{t=1}^{T}\log p_{\theta}(x_{z_{t}}\mid\mathbf{x_{z}}_{<t})\right].
$$
Essentially, for a text sequence $\mathbf{x}$ , we sample a factorization order $\mathbf{z}$ at a time and decompose the likelihood $p_{\boldsymbol{\theta}}(\mathbf{x})$ according to factorization order. Since the same model parameter $\theta$ is shared across all factorization orders during training, in expectation, $x_{t}$ has seen every possible element $x_{i}\neq x_{t}$ in the sequence, hence being able to capture the bidirectional context. Moreover, as this objective fits into the AR framework, it naturally avoids the independence assumption and the pretrain-finetune discrepancy discussed in Section 2.1.
本质上,对于文本序列 $\mathbf{x}$,我们每次采样一个因子化顺序 $\mathbf{z}$,并根据该顺序分解似然 $p_{\boldsymbol{\theta}}(\mathbf{x})$。由于训练期间所有因子化顺序共享相同的模型参数 $\theta$,从期望上看,$x_{t}$ 已见过序列中所有可能的元素 $x_{i}\neq x_{t}$,因而能够捕获双向上下文。此外,由于该目标符合自回归 (AR) 框架,自然避免了第2.1节讨论的独立性假设与预训练-微调差异问题。
Remark on Permutation The proposed objective only permutes the factorization order, not the sequence order. In other words, we keep the original sequence order, use the positional encodings corresponding to the original sequence, and rely on a proper attention mask in Transformers to achieve permutation of the factorization order. Note that this choice is necessary, since the model will only encounter text sequences with the natural order during finetuning.
关于排列的说明
所提出的目标仅对分解顺序进行排列,而非序列顺序。换句话说,我们保持原始序列顺序,使用与原始序列对应的位置编码,并依靠Transformer中适当的注意力掩码来实现分解顺序的排列。请注意,这一选择是必要的,因为模型在微调期间只会遇到具有自然顺序的文本序列。
To provide an overall picture, we show an example of predicting the token $x_{3}$ given the same input sequence $\mathbf{x}$ but under different factorization orders in the Appendix A.7 with Figure 4.
为了提供整体概览,我们在附录A.7的图4中展示了给定相同输入序列$\mathbf{x}$但不同分解顺序时预测token$x_{3}$的示例。
2.3 Architecture: Two-Stream Self-Attention for Target-Aware Representations
2.3 架构:面向目标感知表征的双流自注意力机制
Figure 1: (a): Content stream attention, which is the same as the standard self-attention. (b): Query stream attention, which does not have access information about the content $x_{z_{t}}$ . (c): Overview of the permutation language modeling training with two-stream attention.
图 1: (a) 内容流注意力机制 (content stream attention), 与标准自注意力机制相同。 (b) 查询流注意力机制 (query stream attention), 无法获取关于内容 $x_{z_{t}}$ 的信息。 (c) 采用双流注意力机制的排列语言建模训练流程概览。
While the permutation language modeling objective has desired properties, naive implementation with standard Transformer parameter iz ation may not work. To see the problem, assume we parameter ize the next-token distribution $p_{\theta}(X_{z_{t}}\mid\mathbf{x}{\mathbf{z}{<t}})$ using the standard Softmax formulation, i.e., $p_{\theta}(X_{z_{t}}=$ $\begin{array}{r}{x\mid\mathbf{x_{z}}{<t}\big)=\frac{\exp\left(e(x)^{\top}h_{\theta}(\mathbf{x_{z}}{<t})\right)}{\sum_{x^{\prime}}\exp\left(e(x^{\prime})^{\top}h_{\theta}(\mathbf{x_{z}}{<t})\right)}}\end{array}$ Pxe′x epx(pe((ex()x′)hθ⊤(hxθz(<xtz)<)t)) , where hθ(xz<t) denotes the hidden representation of xz<t produced by the shared Transformer network after proper masking. Now notice that the representation $\bar{h}{\theta}(\mathbf{x}{\mathbf{z}{<t}})$ does not depend on which position it will predict, i.e., the value of $z_{t}$ . Consequently, the same distribution is predicted regardless of the target position, which is not able to learn useful representations (see Appendix A.1 for a concrete example). To avoid this problem, we propose to re-parameter ize the next-token distribution to be target position aware:
虽然排列语言建模目标具有理想特性,但使用标准Transformer参数化的朴素实现可能无效。问题在于:假设我们采用标准Softmax公式参数化下一token分布$p_{\theta}(X_{z_{t}}\mid\mathbf{x}{\mathbf{z}{<t}})$,即$p_{\theta}(X_{z_{t}}=$$\begin{array}{r}{x\mid\mathbf{x_{z}}{<t}\big)=\frac{\exp\left(e(x)^{\top}h_{\theta}(\mathbf{x_{z}}{<t})\right)}{\sum_{x^{\prime}}\exp\left(e(x^{\prime})^{\top}h_{\theta}(\mathbf{x_{z}}{<t}\right)}}\end{array}$ Pxe′x epx(pe((ex()x′)hθ⊤(hxθz(<xtz)<)t)),其中hθ(xz<t)表示经过适当掩码处理后共享Transformer网络生成的xz<t隐藏表示。此时需注意,表示$\bar{h}{\theta}(\mathbf{x}{\mathbf{z}{<t}})$并不依赖其预测的位置(即$z_{t}$的值)。因此无论目标位置如何都会预测相同分布,导致无法学习有效表示(具体示例见附录A.1)。为避免此问题,我们提出重新参数化下一token分布,使其感知目标位置:
$$
p_{\theta}(X_{z_{t}}=x\mid\mathbf{x}{z_{<t}})=\frac{\exp\left(e(x)^{\top}g_{\theta}(\mathbf{x}{\mathbf{z}{<t}},z_{t})\right)}{\sum_{x^{\prime}}\exp\left(e(x^{\prime})^{\top}g_{\theta}(\mathbf{x}{\mathbf{z}{<t}},z_{t})\right)},
$$
$$
p_{\theta}(X_{z_{t}}=x\mid\mathbf{x}{z_{<t}})=\frac{\exp\left(e(x)^{\top}g_{\theta}(\mathbf{x}{\mathbf{z}{<t}},z_{t})\right)}{\sum_{x^{\prime}}\exp\left(e(x^{\prime})^{\top}g_{\theta}(\mathbf{x}{\mathbf{z}{<t}},z_{t})\right)},
$$
where $g_{\theta}\big(\mathbf{x_{z}}{<t},z_{t}\big)$ denotes a new type of representations which additionally take the target position $z_{t}$ as input.
其中 $g_{\theta}\big(\mathbf{x_{z}}{<t},z_{t}\big)$ 表示一种新型表征方式,该表征额外将目标位置 $z_{t}$ 作为输入。
Two-Stream Self-Attention While the idea of target-aware representations removes the ambiguity in target prediction, how to formulate $g_{\theta}\left(\mathbf{x_{z}}{<t},z_{t}\right)$ remains a non-trivial problem. Among other possibilities, we propose to “stand” at the target position $z_{t}$ and rely on the position $z_{t}$ to gather information from the context $\mathbf{x_{z}}{<t}$ through attention. For this parameter iz ation to work, there are two requirements that are contradictory in a standard Transformer architecture: (1) to predict the token $x_{z_{t}}$ , $g_{\theta}(\mathbf{x_{z}}{<t},z_{t})$ should only use the position $z_{t}$ and not the content $x_{z_{t}}$ , otherwise the objective becomes trivial; (2) to predict the other tokens $x_{z_{j}}$ with $j>t$ , $g_{\theta}\big(\mathbf{x_{z}}{<t},z_{t}\big)$ should also encode the content $x_{z_{t}}$ to provide full contextual information. To resolve such a contradiction, we propose to use two sets of hidden representations instead of one:
双流自注意力机制
虽然目标感知表示的概念消除了目标预测中的歧义,但如何构建 $g_{\theta}\left(\mathbf{x_{z}}{<t},z_{t}\right)$ 仍是一个关键问题。我们提出一种方案:基于目标位置 $z_{t}$ ,通过注意力机制从上下文 $\mathbf{x_{z}}{<t}$ 中聚合信息。这种参数化方式在标准Transformer架构中存在两项矛盾需求:(1) 预测token $x_{z_{t}}$ 时,$g_{\theta}(\mathbf{x_{z}}{<t},z_{t})$ 应仅使用位置 $z_{t}$ 而屏蔽内容 $x_{z_{t}}$ ,否则目标函数将失效;(2) 预测后续token $x_{z_{j}}$ ( $j>t$ )时,$g_{\theta}\big(\mathbf{x_{z}}{<t},z_{t}\big)$ 需编码内容 $x_{z_{t}}$ 以提供完整上下文信息。为解决这一矛盾,我们提出采用两组隐藏表示而非单一表示:
• The content representation $h_{\theta}(\mathbf{x}{\mathbf{z}{\le t}})$ , or abbreviated as $h_{z_{t}}$ , which serves a similar role to the standard hidden states in Transformer. This representation encodes both the context and $x_{z_{t}}$ itself. • The query representation $g_{\theta}\left(\mathbf{x_{z}}{<t},z_{t}\right)$ , or abbreviated as $g_{z_{t}}$ , which only has access to the contextual information $\mathbf{x_{z}}{<t}$ and the position $z_{t}$ , but not the content $x_{z_{t}}$ , as discussed above.
• 内容表示 $h_{\theta}(\mathbf{x}{\mathbf{z}{\le t}})$ (简写为 $h_{z_{t}}$),其作用类似于 Transformer 中的标准隐藏状态。该表示同时编码了上下文信息和 $x_{z_{t}}$ 本身。
• 查询表示 $g_{\theta}\left(\mathbf{x_{z}}{<t},z_{t}\right)$ (简写为 $g_{z_{t}}$),如上文所述,它仅能访问上下文信息 $\mathbf{x_{z}}{<t}$ 和位置 $z_{t}$,而无法获取内容 $x_{z_{t}}$。
Computationally, the first layer query stream is initialized with a trainable vector, i.e. $g_{i}^{(0)}=w$ , while the content stream is set to the corresponding word embedding, i.e. $h_{i}^{(0)}=e(x_{i})$ . For each self-attention layer $m=1,\ldots,M$ , the two streams of representations are schematically 2 updated with a shared set of parameters as follows (illustrated in Figures 1 (a) and (b)):
在计算上,第一层查询流使用可训练向量初始化,即 $g_{i}^{(0)}=w$,而内容流则设置为对应的词嵌入,即 $h_{i}^{(0)}=e(x_{i})$。对于每个自注意力层 $m=1,\ldots,M$,这两组表征通过共享参数集按如下方式更新(如图 1(a) 和 (b) 所示):
$$
\begin{array}{r l}&{g_{z_{t}}^{(m)}\gets\mathrm{Attention}(\mathsf{Q}=g_{z_{t}}^{(m-1)},\mathbf{KV}=\mathbf{h}{z<t}^{(m-1)};\theta),\quad(\mathrm{query stream};\mathrm{ use~}z_{t}\mathrm{ but cannot see~}x_{z_{t}})}\ &{h_{z_{t}}^{(m)}\gets\mathrm{Attention}(\mathsf{Q}=h_{z_{t}}^{(m-1)},\mathbf{KV}=\mathbf{h}{z<t}^{(m-1)};\theta),\quad(\mathrm{content stream};\mathrm{ use both~}z_{t}\mathrm{ and~}x_{z_{t}}).}\end{array}
$$
$$
\begin{array}{r l}&{g_{z_{t}}^{(m)}\gets\mathrm{Attention}(\mathsf{Q}=g_{z_{t}}^{(m-1)},\mathbf{KV}=\mathbf{h}_{z<t}^{(m-1)};\theta),\quad(\mathrm{query~stream};\mathrm{~use~}z_{t}\mathrm{~but~cannot~see~}x_{z_{t}})}\ &{h_{z_{t}}^{(m)}\gets\mathrm{Attention}(\mathsf{Q}=h_{z_{t}}^{(m-1)},\mathbf{KV}=\mathbf{h}_{z<t}^{(m-1)};\theta),\quad(\mathrm{content~stream};\mathrm{~use~both~}z_{t}\mathrm{~and~}x_{z_{t}}).}\end{array}
$$
where Q, K, V denote the query, key, and value in an attention operation [33]. The update rule of the content representations is exactly the same as the standard self-attention, so during finetuning, we can simply drop the query stream and use the content stream as a normal Transformer(-XL). Finally, we can use the last-layer query representation gzt $g_{z_{t}}^{(M)}$ to compute Eq. (4).
其中 Q、K、V 表示注意力操作中的查询 (query)、键 (key) 和值 (value) [33]。内容表征的更新规则与标准自注意力完全相同,因此在微调时,我们可以直接丢弃查询流,并将内容流作为常规的 Transformer(-XL) 使用。最终,我们可以使用最后一层的查询表征 $g_{z_{t}}^{(M)}$ 来计算公式 (4)。
Partial Prediction While the permutation language modeling objective (3) has several benefits, it is a much more challenging optimization problem due to the permutation and causes slow convergence in preliminary experiments. To reduce the optimization difficulty, we choose to only predict the last tokens in a factorization order. Formally, we split $\mathbf{z}$ into a non-target sub sequence $\mathbf{z}{\leq c}$ and a target sub sequence $\mathbf{z}_{>c}$ , where $c$ is the cutting point. The objective is to maximize the log-likelihood of the target sub sequence conditioned on the non-target sub sequence, i.e.,
部分预测
虽然排列语言建模目标 (3) 具有多个优势,但由于排列的存在,它成为一个更具挑战性的优化问题,并在初步实验中导致收敛速度缓慢。为了降低优化难度,我们选择仅预测分解顺序中的最后几个 token。形式上,我们将 $ \mathbf{z}$ 划分为非目标子序列 $\mathbf{z}{\leq c}$ 和目标子序列 $\mathbf{z}_{>c}$,其中 $c$ 为切割点。目标是最大化在非目标子序列条件下目标子序列的对数似然,即
$$
\operatorname*{max}{\theta}\quad\mathbb{E}{\mathbf{z}\sim\mathcal{Z}{T}}\left[\log p_{\theta}(\mathbf{x}{\mathbf{z}{>c}}\mid\mathbf{x}{\mathbf{z}{\leq c}})\right]=\mathbb{E}{\mathbf{z}\sim\mathcal{Z}{T}}\left[\sum_{t=c+1}^{|\mathbf{z}|}\log p_{\theta}(x_{z_{t}}\mid\mathbf{x}{\mathbf{z}_{<t}})\right].
$$
Note that $\mathbf{z}_{>c}$ is chosen as the target because it possesses the longest context in the sequence given the current factorization order $\mathbf{z}$ . A hyper parameter $K$ is used such that about $1/K$ tokens are selected for predictions; i.e., $\left|\mathbf{z}\right|/(\left|\mathbf{z}\right|-c)\overset{.}{\approx}\bar{K}$ . For unselected tokens, their query representations need not be computed, which saves speed and memory.
注意,选择 $\mathbf{z}_{>c}$ 作为目标是因为在给定当前分解顺序 $\mathbf{z}$ 的情况下,它在序列中拥有最长的上下文。使用超参数 $K$ 使得大约 $1/K$ 的 token 被选中用于预测;即 $\left|\mathbf{z}\right|/(\left|\mathbf{z}\right|-c)\overset{.}{\approx}\bar{K}$ 。对于未被选中的 token,无需计算其查询表示,从而节省速度和内存。
2.4 Incorporating Ideas from Transformer-XL
2.4 融入Transformer-XL思想
Since our objective function fits in the AR framework, we incorporate the state-of-the-art AR language model, Transformer-XL [9], into our pre training framework, and name our method after it. We integrate two important techniques in Transformer-XL, namely the relative positional encoding scheme and the segment recurrence mechanism. We apply relative positional encodings based on the original sequence as discussed earlier, which is straightforward. Now we discuss how to integrate the recurrence mechanism into the proposed permutation setting and enable the model to reuse hidden states from previous segments. Without loss of generality, suppose we have two segments taken from a long sequence s; i.e., $\tilde{\mathbf{x}}=\mathbf{s}_{1:T}$ and $\mathbf{x}=\mathbf{s}_{T+1:2T}$ . Let $\tilde{\mathbf{z}}$ and $\mathbf{z}$ be permutations of $[1\cdots T]$ and $[T+1\cdots2T]$ respectively. Then, based on the permutation $\tilde{\mathbf{z}}$ , we process the first segment, and then cache the obtained content representations $\tilde{\mathbf{h}}^{(m)}$ for each layer $m$ . Then, for the next segment $\mathbf{x}$ , the attention update with memory can be written as
由于我们的目标函数符合自回归 (AR) 框架,我们将最先进的自回归语言模型 Transformer-XL [9] 纳入预训练框架,并以其命名我们的方法。我们集成了 Transformer-XL 中的两项重要技术:相对位置编码方案和片段循环机制。我们基于原始序列应用相对位置编码(如前所述),这很直接。现在讨论如何将循环机制整合到所提出的排列设置中,并使模型能够重用前一片段的隐藏状态。不失一般性,假设我们从长序列 s 中取出两个片段,即 $\tilde{\mathbf{x}}=\mathbf{s}_{1:T}$ 和 $\mathbf{x}=\mathbf{s}_{T+1:2T}$。设 $\tilde{\mathbf{z}}$ 和 $\mathbf{z}$ 分别是 $[1\cdots T]$ 和 $[T+1\cdots2T]$ 的排列。然后,基于排列 $\tilde{\mathbf{z}}$,我们处理第一个片段,并缓存每层 $m$ 获得的内容表示 $\tilde{\mathbf{h}}^{(m)}$。接着,对于下一个片段 $\mathbf{x}$,带记忆的注意力更新可表示为
$$
h_{z_{t}}^{(m)}\gets\mathrm{Attention}(\mathbf{Q}=h_{z_{t}}^{(m-1)},\mathbf{K}\mathbf{V}=\left[\tilde{\mathbf{h}}^{(m-1)},\mathbf{h}{\mathbf{z}_{\leq t}}^{(m-1)}\right];\theta)
$$
$$
h_{z_{t}}^{(m)}\gets\mathrm{Attention}(\mathbf{Q}=h_{z_{t}}^{(m-1)},\mathbf{K}\mathbf{V}=\left[\tilde{\mathbf{h}}^{(m-1)},\mathbf{h}{\mathbf{z}_{\leq t}}^{(m-1)}\right];\theta)
$$
where $[.,.]$ denotes concatenation along the sequence dimension. Notice that positional encodings only depend on the actual positions in the original sequence. Thus, the above attention update is independent of $\tilde{\mathbf{z}}$ once the representations $\tilde{\mathbf{h}}^{(m)}$ are obtained. This allows caching and reusing the memory without knowing the factorization order of the previous segment. In expectation, the model learns to utilize the memory over all factorization orders of the last segment. The query stream can be computed in the same way. Finally, Figure 1 (c) presents an overview of the proposed permutation language modeling with two-stream attention (see Appendix A.7 for more detailed illustration).
其中 $[.,.]$ 表示沿序列维度的拼接。请注意位置编码