Linformer: Self-Attention with Linear Complexity
Linformer: 线性复杂度的自注意力机制
Sinong Wang, Belinda Z. Li, Madian Khabsa, Han Fang, Hao Ma Facebook AI, Seattle, WA {sinongwang, belindali, hanfang, mkhabsa, haom}@fb.com
Sinong Wang, Belinda Z. Li, Madian Khabsa, Han Fang, Hao Ma Facebook AI, Seattle, WA {sinongwang, belindali, hanfang, mkhabsa, haom}@fb.com
Abstract
摘要
Large transformer models have shown extraordinary success in achieving state-ofthe-art results in many natural language processing applications. However, training and deploying these models can be prohibitively costly for long sequences, as the standard self-attention mechanism of the Transformer uses ${\bar{O}}(n^{2})$ time and space with respect to sequence length. In this paper, we demonstrate that the self-attention mechanism can be approximated by a low-rank matrix. We further exploit this finding to propose a new self-attention mechanism, which reduces the overall self-attention complexity from $O(n^{2})$ to $O(n)$ in both time and space. The resulting linear transformer, the Linformer, performs on par with standard Transformer models, while being much more memory- and time-efficient.
大型Transformer模型在诸多自然语言处理应用中展现出卓越性能,屡创最先进成果。然而,针对长序列场景,训练和部署这类模型可能成本过高,因为标准Transformer的自注意力机制相对于序列长度需要消耗${\bar{O}}(n^{2})$级别的时间和空间复杂度。本文证明自注意力机制可通过低秩矩阵近似实现,并基于此发现提出新型自注意力机制——将整体复杂度从$O(n^{2})$降至$O(n)$级别(时间与空间维度)。由此产生的线性Transformer(Linformer)在保持与标准Transformer相当性能的同时,显著提升了内存和计算效率。
1 Introduction
1 引言
Transformer models (Vaswani et al., 2017) have become ubiquitous for wide variety of problems in natural language processing (NLP), including translation (Ott et al., 2018), text classification, question answering, among others (Raffel et al., 2019; Mohamed et al., 2019). Over the last couple of years, the number of parameters in state-of-the-art NLP transformers has grown drastically, from the original 340 million introduced in BERT-Large to 175 billion in GPT-3 (Brown et al., 2020). Although these large-scale models yield impressive results on wide variety of tasks, training and deploying such model are slow in practice. For example, the original BERT-Large model (Devlin et al., 2019) takes four days to train on 16 Cloud TPUs, and the recent GPT-3 (Brown et al., 2020) consumed orders of magnitude more petaflops / day to train compared to its predecessor, GPT-2 (Radford et al., 2019). Beyond training, deploying Transformer models to real world applications is also expensive, usually requiring extensive distillation (Hinton et al., 2015) or compression.
Transformer模型 (Vaswani et al., 2017) 已成为自然语言处理 (NLP) 领域各类任务的通用解决方案,包括翻译 (Ott et al., 2018)、文本分类、问答等 (Raffel et al., 2019; Mohamed et al., 2019)。过去几年间,最先进NLP Transformer模型的参数量从BERT-Large最初的3.4亿激增至GPT-3的1750亿 (Brown et al., 2020)。虽然这些大规模模型在各类任务上表现惊艳,但其训练和部署过程实际耗时极长。例如原始BERT-Large模型 (Devlin et al., 2019) 需要16个云端TPU训练四天,而最新GPT-3 (Brown et al., 2020) 的每日训练算力消耗相比前代GPT-2 (Radford et al., 2019) 高出了数量级。除训练外,将Transformer模型部署到实际应用中也成本高昂,通常需要大量蒸馏 (Hinton et al., 2015) 或压缩操作。
The main efficiency bottleneck in Transformer models is its self-attention mechanism. Here, each token’s representation is updated by attending to all other tokens in the previous layer. This operation is key for retaining long-term information, giving Transformers the edge over recurrent models on long sequences. However, attending to all tokens at each layer incurs a complexity of $O(n^{2})$ with respect to sequence length. Thus, in this paper, we seek to answer the question: can Transformer models be optimized to avoid this quadratic operation, or is this operation required to maintain strong performance?
Transformer模型的主要效率瓶颈在于其自注意力(self-attention)机制。该机制通过让每个token关注前一层所有其他token来更新其表征。这一操作是保留长期信息的关键,使Transformer在长序列任务上优于循环模型。然而,每层对所有token的关注会导致复杂度随序列长度呈$O(n^{2})$增长。因此,本文旨在回答:能否通过优化Transformer模型来避免这种平方级运算,或者说这一运算是否是维持强劲性能的必要条件?
Prior work has proposed several techniques for improving the efficiency of self-attention. One popular technique is introducing sparsity into attention layers (Child et al., 2019; Qiu et al., 2019; Beltagy et al., 2020) by having each token attend to only a subset of tokens in the whole sequence. This reduces the overall complexity of the attention mechanism to $O(n{\sqrt{n}})$ (Child et al., 2019). However, as shown in Qiu et al. (2019), this approach suffers from a large performance drop with limited efficiency gains, i.e., a $2%$ drop with only $20%$ speed up. More recently, the Reformer (Kitaev et al., 2020) used locally-sensitive hashing (LSH) to reduce the self-attention complexity to $O(n\log(n))$ . However, in practice, the Reformer’s efficiency gains only appear on sequences with length $>2048$ (Figure 5 in Kitaev et al. (2020)). Furthermore, the Reformer’s multi-round hashing approach actually increases the number of sequential operations, which further undermines their final efficiency gains.
先前的研究提出了多种提升自注意力 (self-attention) 效率的技术。一种流行的方法是通过让每个 token 仅关注整个序列中的部分 token 来引入稀疏性 (Child et al., 2019; Qiu et al., 2019; Beltagy et al., 2020),这将注意力机制的整体复杂度降低至 $O(n{\sqrt{n}})$ (Child et al., 2019)。但如 Qiu et al. (2019) 所示,这种方法在效率提升有限的情况下会导致性能显著下降,例如速度仅提升 20% 时性能下降 2%。最近,Reformer (Kitaev et al., 2020) 使用局部敏感哈希 (LSH) 将自注意力复杂度降至 $O(n\log(n))$。然而实际应用中,Reformer 的效率优势仅在序列长度超过 2048 时显现 (Kitaev et al. (2020) 中的图 5)。此外,Reformer 的多轮哈希策略实际上增加了串行操作次数,进一步削弱了其最终效率收益。
In this work, we introduce a novel approach for tackling the self-attention bottleneck in Transformers. Our approach is inspired by the key observation that self-attention is low rank. More precisely, we show both theoretically and empirically that the stochastic matrix formed by self-attention can be approximated by a low-rank matrix. Empowered by this observation, we introduce a novel mechanism that reduces self-attention to an $O(n)$ operation in both space- and time-complexity: we decompose the original scaled dot-product attention into multiple smaller attentions through linear projections, such that the combination of these operations forms a low-rank factorization of the original attention. A summary of runtimes for various Transformer architectures, including ours, can be found in Table 1.
在本工作中,我们提出了一种解决Transformer中自注意力(self-attention)瓶颈的新方法。该方法源于一个关键发现:自注意力具有低秩特性。具体而言,我们通过理论和实验证明,自注意力形成的随机矩阵可以用低秩矩阵近似。基于这一发现,我们设计了一种新机制,将自注意力的空间和时间复杂度降至$O(n)$:通过线性投影将原始缩放点积注意力分解为多个较小注意力,这些操作的组合构成了原始注意力的低秩分解。表1展示了包括本方法在内的多种Transformer架构的运行时间对比。
One predominant application of Transformers, that has seen the most gains, is using them as pretrained language models, whereby models are first pretrained with a language modeling objective on a large corpus, then finetuned on target tasks using supervised data (Devlin et al., 2019; Liu et al., 2019; Lewis et al., 2019). Following Devlin et al. (2019), we pretrain our model on BookCorpus (Zhu et al., 2015) plus English Wikipedia using masked-language-modeling objective. We observe similar pre training performance to the standard Transformer model. We then finetune our pretrained models on three tasks from GLUE (Wang et al., 2018) and one sentiment analysis task, IMDB reviews (Maas et al., 2011). On these tasks, we find that our model performs comparably, or even slightly better, than the standard pretrained Transformer, while observing significant training and inference speedups.
Transformer最具成效的主流应用之一是将其作为预训练语言模型,即先在大规模语料库上以语言建模为目标进行预训练,再使用监督数据对目标任务进行微调 (Devlin et al., 2019; Liu et al., 2019; Lewis et al., 2019)。参照Devlin et al. (2019) 的方法,我们在BookCorpus (Zhu et al., 2015) 和英文维基百科上采用掩码语言建模目标进行预训练,观察到与标准Transformer模型相当的预训练性能。随后,我们在GLUE (Wang et al., 2018) 的三个任务和一个情感分析任务(IMDB影评数据集 (Maas et al., 2011))上对预训练模型进行微调。实验表明,在显著提升训练和推理速度的同时,我们的模型性能与标准预训练Transformer相当甚至略有优势。
ModelArchitecture | Complexity per Layer | Sequential Operation |
Recurrent | O(n) | O(n) |
Transformer, (Vaswani et al.,2017) | O(n | 0(1) |
Sparse Tansformer, (Child et al., 2019) | O(nv n | 0(1) |
Reformer,(Kitaev et al.,2020) | O(nlog(n)) | O(log(n)) |
Linformer | O(n) | 0(1) |
模型架构 (Model Architecture) | 单层复杂度 (Complexity per Layer) | 顺序操作 (Sequential Operation) |
---|---|---|
循环网络 (Recurrent) | O(n) | O(n) |
Transformer (Vaswani 等, 2017) | O(n | O(1) |
稀疏 Transformer (Child 等, 2019) | O(n√n | O(1) |
Reformer (Kitaev 等, 2020) | O(nlog(n)) | O(log(n)) |
Linformer | O(n) | O(1) |
Table 1: Per-layer time complexity and minimum number of sequential operations as a function of sequence length $(n)$ for various architectures.
表 1: 不同架构下各层时间复杂度与最小串行操作数随序列长度 $(n)$ 的变化关系。
2 Backgrounds and Related works
2 背景与相关工作
2.1 Transformer and Self-Attention
2.1 Transformer 与自注意力机制
The Transformer is built upon the idea of Multi-Head Self-Attention (MHA), which allows the model to jointly attend to information at different positions from different representation subspaces. MHA is defined as
Transformer 基于多头自注意力 (Multi-Head Self-Attention, MHA) 机制构建,该机制使模型能够同时关注来自不同表示子空间的多位置信息。MHA 定义为
$$
\mathrm{MultiHead}(Q,K,V)=\mathrm{Concat}(\mathrm{head}_ {1},\mathrm{head}_ {2},\dots,\mathrm{head}_{h})W^{O},
$$
$$
\mathrm{MultiHead}(Q,K,V)=\mathrm{Concat}(\mathrm{head}_ {1},\mathrm{head}_ {2},\dots,\mathrm{head}_{h})W^{O},
$$
where $Q,K,V\in\mathbb{R}^{n\times d_{m}}$ are input embedding matrices, $n$ is sequence length, $d_{m}$ is the embedding dimension, and $h$ is the number of heads. Each head is defined as:
其中 $Q,K,V\in\mathbb{R}^{n\times d_{m}}$ 是输入嵌入矩阵,$n$ 为序列长度,$d_{m}$ 为嵌入维度,$h$ 为注意力头数。每个注意力头的定义为:
$$
\mathrm{head}_ {i}=\mathrm{Attention}(Q W_{i}^{Q},K W_{i}^{K},V W_{i}^{V})=\underbrace{\mathrm{softmax}\left[\frac{Q W_{i}^{Q}(K W_{i}^{K})^{T}}{\sqrt{d_{k}}}\right]}_ {P}V W_{i}^{V},
$$
$$
\mathrm{head}_ {i}=\mathrm{Attention}(Q W_{i}^{Q},K W_{i}^{K},V W_{i}^{V})=\underbrace{\mathrm{softmax}\left[\frac{Q W_{i}^{Q}(K W_{i}^{K})^{T}}{\sqrt{d_{k}}}\right]}_ {P}V W_{i}^{V},
$$
where $\begin{array}{r}{W_{i}^{Q},W_{i}^{K}\in\mathbb{R}^{d_{m}\times d_{k}},W_{i}^{V}\in\mathbb{R}^{d_{m}\times d_{v}},W^{O}\in\mathbb{R}^{h d_{v}\times d_{m}}}\end{array}$ are learned matrices and $d_{k},d_{v}$ are the hidden dimensions of the projection subspaces. For the rest of this paper, we will not differentiate between $d_{k}$ and $d_{v}$ and just use $d$ .
其中 $\begin{array}{r}{W_{i}^{Q},W_{i}^{K}\in\mathbb{R}^{d_{m}\times d_{k}},W_{i}^{V}\in\mathbb{R}^{d_{m}\times d_{v}},W^{O}\in\mathbb{R}^{h d_{v}\times d_{m}}}\end{array}$ 是可学习矩阵,$d_{k},d_{v}$ 是投影子空间的隐藏维度。在本文后续部分,我们将不再区分 $d_{k}$ 和 $d_{v}$,统一使用 $d$。
The self-attention defined in (2) refers to a context mapping matrix $P\in\mathbb{R}^{n\times n}$ . The Transformer uses $P$ to capture the input context for a given token, based on a combination of all tokens in the sequence. However, computing $P$ is expensive. It requires multiplying two $n\times d$ matrices, which is $O(n^{2})$ in time and space complexity. This quadratic dependency on the sequence length has become a bottleneck for Transformers.
(2) 式中定义的自注意力 (self-attention) 涉及一个上下文映射矩阵 $P\in\mathbb{R}^{n\times n}$。Transformer 利用 $P$ 基于序列中所有 token 的组合来捕获给定 token 的输入上下文。然而,计算 $P$ 的成本很高:需要将两个 $n\times d$ 矩阵相乘,其时间和空间复杂度均为 $O(n^{2})$。这种对序列长度的二次依赖已成为 Transformer 的性能瓶颈。
2.2 Related works
2.2 相关工作
There has been much prior literature on improving the efficiency of Transformers, especially the self-attention bottleneck. The most common techniques for model efficiency that can be applied to Transformers (some specific to Transformers, others more general-purpose) include:
已有大量文献探讨如何提升Transformer的效率,尤其是解决自注意力机制的性能瓶颈。适用于Transformer的模型效率优化技术(部分专为Transformer设计,部分具有普适性)主要包括:
Mixed Precision (Mic ike vici us et al., 2017): Using half-precision or mixed-precision representations of floating points is popular in deep learning, and is also widely used in training Transformers (Ott et al., 2019). This technique can be further improved through Quantization Aware Training (Jacob et al., 2018; Fan et al., 2020), where the weights are quantized during training and the gradients are approximated with the Straight-Through Estimator. This line of work is orthogonal to our approach, and we use mixed-precision training by default.
混合精度 (Micikevicius et al., 2017): 在深度学习中,使用半精度或混合精度表示浮点数非常普遍,这种方法也广泛应用于训练Transformer (Ott et al., 2019)。该技术可通过量化感知训练 (Jacob et al., 2018; Fan et al., 2020) 进一步优化,即在训练过程中对权重进行量化,并通过直通估计器近似梯度。这类工作与我们的方法正交,默认情况下我们采用混合精度训练。
Knowledge Distillation (Hinton et al., 2015): Knowledge distillation aims to transfer the “knowledge" from a large teacher model to a lightweight student model. The student model is then used during inference. However this approach has drawbacks: It does not address speeding up the teacher model during training, and moreover, student models usually suffer performance degradation compared to the teacher model. For example, when distilling a 12-layer BERT to a 6-layer BERT, the student model experiences an average 2.5% performance drop on several benchmark tasks (Sanh et al., 2019).
知识蒸馏 (Hinton et al., 2015): 知识蒸馏旨在将大型教师模型中的"知识"迁移到轻量级学生模型中。推理阶段则使用学生模型。但该方法存在缺陷:它无法加速教师模型的训练过程,且学生模型性能通常低于教师模型。例如,将12层BERT蒸馏为6层BERT时,学生模型在多项基准任务上的平均性能下降达2.5% (Sanh et al., 2019)。
Sparse Attention (Child et al., 2019): This technique improves the efficiency of self-attention by adding sparsity in the context mapping matrix $P$ . For example, the Sparse Transformer (Child et al., 2019) only computes $P_{i j}$ around the diagonal of matrix $P$ (instead of the all $P_{i j}$ ). Meanwhile, blockwise self-attention (Qiu et al., 2019) divides $P$ into multiple blocks and only computes $P_{i j}$ within the selected blocks. However, these techniques also suffer a large performance degradation, while having only limited additional speed-up, i.e., 2% drop with 20% speed up.
稀疏注意力 (Child et al., 2019): 该技术通过在上下文映射矩阵 $P$ 中引入稀疏性来提升自注意力机制的效率。例如,稀疏Transformer (Child et al., 2019) 仅计算矩阵 $P$ 对角线附近的 $P_{i j}$ (而非全部 $P_{i j}$)。同时,分块自注意力 (Qiu et al., 2019) 将 $P$ 划分为多个块,仅计算选定块内的 $P_{i j}$。然而这些技术在仅获得有限加速 (如速度提升20%时性能下降2%) 的同时,也会导致显著的性能下降。
LSH Attention (Kitaev et al., 2020): Locally-sensitive hashing (LSH) attention utilizes a multi-round hashing scheme when computing dot-product attention, which in theory reduces the self-attention complexity to $O(n\log(n))$ . However, in practice, their complexity term has a large constant $128^{2}$ and it is only more efficient than the vanilla transformer when sequence length is extremely long.
LSH Attention (Kitaev et al., 2020): 局部敏感哈希 (Locally-sensitive Hashing, LSH) 注意力机制在计算点积注意力时采用多轮哈希方案,理论上可将自注意力复杂度降至 $O(n\log(n))$ 。但实际应用中,其复杂度项存在较大常数 $128^{2}$ ,仅当序列长度极长时才会比标准 Transformer 更高效。
Improving Optimizer Efficiency: Micro batching (Huang et al., 2019) splits a batch into small micro batches (which can be fit into memory), and then separately runs forward and backward passes on them with gradient accumulation. Gradient check pointing (Chen et al., 2016) saves memory by only caching activation s of a subset of layers. The uncached activation s are recomputed during back propagation from the latest checkpoint. Both techniques trade off time for memory, and do not speed up inference.
提升优化器效率:微批次处理 (Huang et al., 2019) 将批次拆分为小型微批次(可装入内存),并通过梯度累积分别执行前向和反向传播。梯度检查点 (Chen et al., 2016) 通过仅缓存部分层的激活值来节省内存,未缓存的激活值在反向传播时从最新检查点重新计算。这两种技术以时间换取内存,但不会加速推理。
As we’ve noted, most common techniques have limitations in reducing both the training and inference time/memory consumption, we investigate how to optimize the self-attention layers and introduce our approach next.
正如我们提到的,大多数常见技术在同时降低训练和推理时间/内存消耗方面存在局限,接下来我们将探讨如何优化自注意力(self-attention)层并介绍我们的方法。
3 Self-Attention is Low Rank
3 自注意力机制是低秩的
In this section, we demonstrate that the self-attention mechanism, i.e., the context mapping matrix $P$ , is low-rank.
在本节中,我们证明自注意力机制 (self-attention mechanism) ,即上下文映射矩阵 $P$ ,是低秩的。
We first provide a spectrum analysis of the context mapping matrix $P$ . We use two pretrained transformer models, RoBERTa-base (12-layer stacked transformer) and RoBERTa-large (24-layer stacked transformer) (Liu et al., 2019) on two tasks: masked-language-modeling task on Wiki103 (Merity et al., 2016) and classification task on IMDB (Maas et al., 2011). In Figure 1 (left), we apply singular value decomposition into $P$ across different layers and different heads of the model, and plot the normalized cumulative singular value averaged over 10k sentences. The results exhibit a clear long-tail spectrum distribution across each layer, head and task. This implies that most of the information of matrix $P$ can be recovered from the first few largest singular values. In Figure 1 (right), we plot a heatmap of the normalized cumulative singular value at the 128-th largest singular value (out of 512). We observe that the spectrum distribution in higher layers is more skewed than in lower layers, meaning that, in higher layers, more information is concentrated in the largest singular values and the rank of $P$ is lower.
我们首先对上下文映射矩阵 $P$ 进行谱分析。我们使用两个预训练的Transformer模型——RoBERTa-base(12层堆叠Transformer)和RoBERTa-large(24层堆叠Transformer)(Liu et al., 2019),在两个任务上进行实验:Wiki103 (Merity et al., 2016) 的掩码语言建模任务和IMDB (Maas et al., 2011) 的分类任务。在图1(左)中,我们对模型不同层和不同注意力头的 $P$ 进行奇异值分解,并绘制了10k个句子平均的归一化累积奇异值曲线。结果显示,每一层、每个注意力头和每个任务都呈现出明显的长尾谱分布。这表明矩阵 $P$ 的大部分信息可以通过前几个最大的奇异值恢复。在图1(右)中,我们绘制了第128大奇异值(共512个)处的归一化累积奇异值热力图。我们观察到,较高层的谱分布比较低层的更加偏斜,这意味着在较高层中,更多信息集中在最大的奇异值上,且 $P$ 的秩更低。
Below, we provide a theoretical analysis of the above spectrum results.
下面我们对上述频谱结果进行理论分析。
Theorem 1. (self-attention is low rank) For any $Q,K,V\in\mathbb{R}^{n\times d}$ and $W_{i}^{Q},W_{i}^{K},W_{i}^{V}\in\mathbb{R}^{d\times d}$ , for any column vector $w\in\mathbb{R}^{n}$ of matrix $V W_{i}^{V}$ , there exists a low-rank matrix $\tilde{P}\in\mathbb{R}^{n\times n}$ such that
定理1. (自注意力是低秩的) 对于任意 $Q,K,V\in\mathbb{R}^{n\times d}$ 和 $W_{i}^{Q},W_{i}^{K},W_{i}^{V}\in\mathbb{R}^{d\times d}$ ,对于矩阵 $V W_{i}^{V}$ 的任意列向量 $w\in\mathbb{R}^{n}$ ,存在一个低秩矩阵 $\tilde{P}\in\mathbb{R}^{n\times n}$ 使得
$\begin{array}{r}{\operatorname*{Pr}(|\tilde{P}w^{T}-P w^{T}|<\epsilon|P w^{T}|)>1-o(1)a n d~r a n k(\tilde{P})=\Theta(\log(n)),}\end{array}$ where the context mapping matrix $P$ is defined in (2).
$\begin{array}{r}{\operatorname*{Pr}(|\tilde{P}w^{T}-P w^{T}|<\epsilon|P w^{T}|)>1-o(1)a n d~r a n k(\tilde{P})=\Theta(\log(n)),}\end{array}$,其中上下文映射矩阵 $P$ 的定义见式 (2)。
Figure 1: Left two figures are spectrum analysis of the self-attention matrix in pretrained transformer model (Liu et al., 2019) with $n=512$ . The Y-axis is the normalized cumulative singular value of context mapping matrix $P$ , and the X-axis the index of largest eigenvalue. The results are based on both RoBERTa-base and large model in two public datasets: Wiki103 and IMDB. The right figure plots the heatmap of normalized cumulative eigenvalue at the 128-th largest eigenvalue across different layers and heads in Wiki103 data.
图 1: 左侧两图展示了预训练Transformer模型 (Liu et al., 2019) 中自注意力矩阵的频谱分析 ($n=512$)。Y轴表示上下文映射矩阵 $P$ 的归一化累积奇异值,X轴为最大特征值索引。结果基于RoBERTa-base和large模型在Wiki103与IMDB两个公开数据集上的表现。右图呈现了Wiki103数据中不同层和注意力头在第128大特征值处的归一化累积特征值热力图。
Proof. Based on the definition of the context mapping matrix $P$ , we can write
证明。根据上下文映射矩阵 $P$ 的定义,我们可以写出
$$
P=\mathrm{softmax}\underbrace{\left[\frac{Q W_{i}^{Q}(K W_{i}^{K})^{T}}{\sqrt{d}}\right]}_ {A}=\exp{(A)}\cdot D_{A}^{-1},
$$
$$
P=\mathrm{softmax}\underbrace{\left[\frac{Q W_{i}^{Q}(K W_{i}^{K})^{T}}{\sqrt{d}}\right]}_ {A}=\exp{(A)}\cdot D_{A}^{-1},
$$
where $D_{A}$ is an $n\times n$ diagonal matrix. The main idea of this proof is based on the distribution al Johnson–Linden strauss lemma (Linden strauss, 1984) $\mathrm{JL}$ for short). We construct the approximate low rank matrix as $\tilde{P}=\exp\left(A\right)\cdot D_{A}^{-1}R^{T}R$ , where $R\in\mathbb{R}^{k\times n}$ with i.i.d. entries from $N(0,1/k)$ . We can then use the $\mathrm{JL}$ lemma to show that, for any column vector $w\in\mathbb{R}^{n}$ of matrix $V W_{i}^{V}$ , when $k=5\log(n)/(\epsilon^{2}-\epsilon^{3})$ , we have
其中 $D_{A}$ 是一个 $n\times n$ 的对角矩阵。该证明的主要思路基于分布式的 Johnson-Lindenstrauss 引理 (Lindenstrauss, 1984) (简称 $\mathrm{JL}$ )。我们构造近似低秩矩阵为 $\tilde{P}=\exp\left(A\right)\cdot D_{A}^{-1}R^{T}R$ ,其中 $R\in\mathbb{R}^{k\times n}$ 的条目独立同分布于 $N(0,1/k)$ 。随后可利用 $\mathrm{JL}$ 引理证明:对于矩阵 $V W_{i}^{V}$ 的任意列向量 $w\in\mathbb{R}^{n}$ ,当 $k=5\log(n)/(\epsilon^{2}-\epsilon^{3})$ 时,满足
$$
\operatorname*{Pr}\big(|P R^{T}R w^{T}-P w^{T}|\leq\epsilon|P w^{T}|\big)>1-o(1).
$$
$$
\operatorname*{Pr}\big(|P R^{T}R w^{T}-P w^{T}|\leq\epsilon|P w^{T}|\big)>1-o(1).
$$
For more details, refer to the supplementary materials.
更多详情请参阅补充材料。
Given the low-rank property of the context mapping matrix $P$ , one straightforward idea is to use singular value decomposition (SVD) to approximate $P$ with a low-rank matrix $P_{\mathrm{low}}$ , as follows
考虑到上下文映射矩阵 $P$ 的低秩特性,一个直观的想法是使用奇异值分解 (SVD) 来用低秩矩阵 $P_{\mathrm{low}}$ 近似 $P$,具体如下
$$
P\approx P_{\mathrm{low}}=\sum_{i=1}^{k}\sigma_{i}u_{i}v_{i}^{T}=\underbrace{\left[u_{1},\cdots,u_{k}\right]}_{k}\mathrm{diag}{\sigma_{1},\cdots,\sigma_{k}}\left[\begin{array}{c}{{v_{1}}}\ {{\vdots}}\ {{v_{k}}}\end{array}\right]\xi_{k}
$$
$$
P\approx P_{\mathrm{low}}=\sum_{i=1}^{k}\sigma_{i}u_{i}v_{i}^{T}=\underbrace{\left[u_{1},\cdots,u_{k}\right]}_{k}\mathrm{diag}{\sigma_{1},\cdots,\sigma_{k}}\left[\begin{array}{c}{{v_{1}}}\ {{\vdots}}\ {{v_{k}}}\end{array}\right]\xi_{k}
$$
where $\sigma_{i},u_{i}$ and $v_{i}$ are the $i$ largest singular values and their corresponding singular vectors. Based on the results in Theorem 1 and the Eckart–Young–Mirsky Theorem (Eckart & Young, 1936), one can use $P_{\mathrm{low}}$ to approximate self-attention (2) with $\epsilon$ error and $O(n k)$ time and space complexity. However, this approach requires performing an SVD decomposition in each self-attention matrix, which adds additional complexity. Therefore, we propose another approach for low-rank approximation that avoids this added complexity.
其中 $\sigma_{i},u_{i}$ 和 $v_{i}$ 分别是第 $i$ 大的奇异值及其对应的奇异向量。根据定理1的结果和Eckart-Young-Mirsky定理 (Eckart & Young, 1936),可以使用 $P_{\mathrm{low}}$ 以 $\epsilon$ 误差和 $O(n k)$ 的时间空间复杂度来近似自注意力机制 (2)。然而,这种方法需要对每个自注意力矩阵进行SVD分解,增加了额外的复杂度。因此,我们提出了另一种低秩近似方法以避免这种额外的复杂度。
4 Model
4 模型
In this section, we propose a new self-attention mechanism which allows us to compute the contextual mapping $P\cdot V W_{i}^{\hat{V}}$ in linear time and memory complexity with respect to sequence length.
在本节中,我们提出了一种新的自注意力机制 (self-attention mechanism) ,能够以序列长度的线性时间和内存复杂度计算上下文映射 $P\cdot V W_{i}^{\hat{V}}$ 。
The main idea of our proposed linear self-attention (Figure 2) is to add two linear projection matrices $E_{i},F_{i}\in\mathbb{R}^{n\times k}$ when computing key and value. We first project the original $(n\times d)$ -dimensional
我们提出的线性自注意力(图 2)的核心思想是在计算键和值时添加两个线性投影矩阵 $E_{i},F_{i}\in\mathbb{R}^{n\times k}$。我们首先对原始的 $(n\times d)$ 维
Figure 2: Left and bottom-right show architecture and example of our proposed multihead linear self-attention. Top right shows inference time vs. sequence length for various Linformer models.
图 2: 左侧和右下方展示了我们提出的多头线性自注意力 (multihead linear self-attention) 架构及示例。右上方展示了不同 Linformer 模型的推理时间与序列长度的关系。
key and value layers $K W_{i}^{K}$ and $V W_{i}^{V}$ into $(k\times d)$ -dimensional projected key and value layers. We then compute an $(n\times k)$ -dimensional context mapping matrix $\bar{P}$ using scaled dot-product attention.
键值层 $K W_{i}^{K}$ 和 $V W_{i}^{V}$ 投影为 $(k\times d)$ 维的键值层。随后通过缩放点积注意力计算 $(n\times k)$ 维的上下文映射矩阵 $\bar{P}$。
$$
\begin{array}{r l}&{\overline{{\mathrm{head}_ {i}}}=\mathrm{Attention}(Q W_{i}^{Q},E_{i}K W_{i}^{K},F_{i}V W_{i}^{V})}\ &{\qquad\quad=\underbrace{\mathrm{softmax}\left(\frac{Q W_{i}^{Q}(E_{i}K W_{i}^{K})^{T}}{\sqrt{d_{k}}}\right)}_ {\bar{P}:n\times k}\cdot\underbrace{F_{i}V W_{i}^{V}}_{k\times d},}\end{array}
$$
$$
\begin{array}{r l}&{\overline{{\mathrm{head}_ {i}}}=\mathrm{Attention}(Q W_{i}^{Q},E_{i}K W_{i}^{K},F_{i}V W_{i}^{V})}\ &{\qquad\quad=\underbrace{\mathrm{softmax}\left(\frac{Q W_{i}^{Q}(E_{i}K W_{i}^{K})^{T}}{\sqrt{d_{k}}}\right)}_ {\bar{P}:n\times k}\cdot\underbrace{F_{i}V W_{i}^{V}}_{k\times d},}\end{array}
$$
Finally, we compute context embeddings for each he $\operatorname{ad}_ {i}$ using $\bar{P}\cdot(F_{i}V W_{i}^{V})$ . Note the above operations only require $O(n k)$ time and space complexity. Thus, if we can choose a very small projected dimension $k$ , such that $k\ll n$ , then we can significantly reduce the memory and space consumption. The following theorem states that, when $\bar{k}={\cal O}(d/\epsilon^{2})$ (independent of $n$ ), one can approximate $P\cdot V W_{i}^{V}$ using linear self-attention (7) with $\epsilon$ error.
最后,我们使用 $\bar{P}\cdot(F_{i}V W_{i}^{V})$ 为每个 $\operatorname{ad}_ {i}$ 计算上下文嵌入。注意上述操作仅需 $O(n k)$ 的时间和空间复杂度。因此,若选择极小的投影维度 $k$ (满足 $k\ll n$),便能显著降低内存和空间消耗。以下定理表明:当 $\bar{k}={\cal O}(d/\epsilon^{2})$ (与 $n$ 无关)时,可通过线性自注意力(7)以 $\epsilon$ 误差逼近 $P\cdot V W_{i}^{V}$。
Theorem 2. (Linear self-attention) For any $Q_{i},K_{i},V_{i}\in\mathbb{R}^{n\times d}$ and $W_{i}^{Q},W_{i}^{K},W_{i}^{V}\in\mathbb{R}^{d\times d},$ , $i f$ $k=\operatorname*{min}{\Theta(9d\log(d)/\epsilon^{2}),5\Theta(\log(n)/\epsilon^{2})}$ , then there exists matrices $E_{i},F_{i}\in\mathbb{R}^{n\times k}$ such that, for any row vector $w$ of matrix $Q W_{i}^{Q}(K W_{i}^{K})^{T}/\sqrt{d},$ , we have
定理2. (线性自注意力) 对于任意 $Q_{i},K_{i},V_{i}\in\mathbb{R}^{n\times d}$ 和 $W_{i}^{Q},W_{i}^{K},W_{i}^{V}\in\mathbb{R}^{d\times d},$ 若 $k=\operatorname*{min}{\Theta(9d\log(d)/\epsilon^{2}),5\Theta(\log(n)/\epsilon^{2})}$ ,则存在矩阵 $E_{i},F_{i}\in\mathbb{R}^{n\times k}$ 使得对于矩阵 $Q W_{i}^{Q}(K W_{i}^{K})^{T}/\sqrt{d},$ 的任意行向量 $w$ ,满足
$$
\operatorname*{Pr}\left(\lVert\operatorname{softmax}(w E_{i}^{T})F_{i}V W_{i}^{V}-\operatorname{softmax}(w)V W_{i}^{V}\rVert\le\epsilon\lVert\operatorname{softmax}(w)\rVert\lVert V W_{i}^{V}\rVert\right)>1-o(1)
$$
$$
\operatorname*{Pr}\left(\lVert\operatorname{softmax}(w E_{i}^{T})F_{i}V W_{i}^{V}-\operatorname{softmax}(w)V W_{i}^{V}\rVert\le\epsilon\lVert\operatorname{softmax}(w)\rVert\lVert V W_{i}^{V}\rVert\right)>1-o(1)
$$
Proof. The main idea of proof is based on the distribution al Johnson–Linden strauss lemma (Lindenstrauss, 1984). We first prove that for any row vector $x\in\mathbb{R}^{n}$ of matrix $Q W_{i}^{Q}(K W_{i}^{K})^{T}/\sqrt{d_{k}}$ and column vector $y\in\mathbb{R}^{n}$ of matrix $V W_{i}^{V}$ ,
证明。该证明的主要思路基于分布型 Johnson-Lindenstrauss 引理 (Lindenstrauss, 1984)。我们首先证明对于矩阵 $Q W_{i}^{Q}(K W_{i}^{K})^{T}/\sqrt{d_{k}}$ 的任意行向量 $x\in\mathbb{R}^{n}$ 和矩阵 $V W_{i}^{V}$ 的列向量 $y\in\mathbb{R}^{n}$,
$$
\operatorname*{Pr}\left(|\exp(x E_{i}^{T})F_{i}y^{T}-\exp(x)y^{T}|\leq\epsilon|\exp(x)y^{T}|\right)>1-2e^{-(\epsilon^{2}-\epsilon^{3})k/4},
$$
$$
\operatorname*{Pr}\left(|\exp(x E_{i}^{T})F_{i}y^{T}-\exp(x)y^{T}|\leq\epsilon|\exp(x)y^{T}|\right)>1-2e^{-(\epsilon^{2}-\epsilon^{3})k/4},
$$
where $E_{i}=\delta R$ and $\begin{array}{r}{F_{i}=e^{-\delta}R}\end{array}$ , where $R\in\mathbb{R}^{k\times n}$ with i.i.d. entries from $N(0,1/k)$ and $\delta$ is a small constant. Applying the result in (9) to every row vector of matrix $A$ and every column vector of matrix $V$ , one can directly prove that, for any row vector $A_{i}$ of matrix $A$ ,
其中 $E_{i}=\delta R$ 且 $\begin{array}{r}{F_{i}=e^{-\delta}R}\end{array}$,其中 $R\in\mathbb{R}^{k\times n}$ 的条目独立同分布于 $N(0,1/k)$,$\delta$ 为小常数。将 (9) 式结果应用于矩阵 $A$ 的每个行向量和矩阵 $V$ 的每个列向量时,可直接证明:对于矩阵 $A$ 的任意行向量 $A_{i}$,
$$
\begin{array}{r}{\operatorname*{Pr}\left(|\exp(A_{i}E_{i}^{T})F_{i}V-\exp(A_{i})V|\leq\epsilon|\exp(A_{i})V|\right)>1-o(1),}\end{array}
$$
$$
\begin{array}{r}{\operatorname*{Pr}\left(|\exp(A_{i}E_{i}^{T})F_{i}V-\exp(A_{i})V|\leq\epsilon|\exp(A_{i})V|\right)>1-o(1),}\end{array}
$$
by setting $k=5\log(n d)/(\epsilon^{2}-\epsilon^{3})$ . This result does not utilize the low rank property of matrix $A$ $(\operatorname{rank}(A){=}d)$ and the resultant $k$ has a dependency on sequence length $n$ . We will further utlize the fact that rank $(A){=}d$ to prove the choice of $k$ can be constant and independent of sequence length $n$ . For more details, refer to the supplementary materials. 口
通过设定 $k=5\log(n d)/(\epsilon^{2}-\epsilon^{3})$ 。该结果未利用矩阵 $A$ 的低秩特性 $(\operatorname{rank}(A){=}d)$ ,且所得 $k$ 与序列长度 $n$ 存在依赖关系。我们将进一步利用矩阵秩 $(A){=}d$ 的特性,证明 $k$ 的取值可为常数且独立于序列长度 $n$ 。更多细节详见补充材料。口
In Figure 2 (top right), we plot the inference speed of Linformer and standard Transformer versus sequence length, while holding the total number of tokens fixed. We see that while standard Transformer becomes slower at longer sequence lengths, the Linformer speed remains relatively flat and is significantly faster at long sequences.
在图 2 (右上) 中,我们绘制了 Linformer 和标准 Transformer 在保持总 token 数不变时,推理速度随序列长度的变化曲线。可以看出,标准 Transformer 在较长序列时速度变慢,而 Linformer 的速度保持相对平稳,且在长序列时明显更快。
Additional Efficiency Techniques Several additional techniques can be introduced on top of Linformer to further optimize for both performance and efficiency:
附加效率优化技术
在Linformer基础上可引入多项技术进一步优化性能和效率:
Parameter sharing between projections: One can share parameters for the linear projection matrices $E_{i},F_{i}$ across layers and heads. In particular, we experimented with 3 levels of sharing:
投影间的参数共享:可以在不同层和头之间共享线性投影矩阵 $E_{i},F_{i}$ 的参数。具体而言,我们尝试了3种共享级别:
For example, in a 12-layer, 12-head stacked Transformer model, headwise sharing, key-value sharing and layerwise sharing will introduce 24, 12, and 1 distinct linear projection matrices, respectively.
例如,在一个12层、12头的堆叠式Transformer模型中,头间共享(headwise sharing)、键值共享(key-value sharing)和层间共享(layerwise sharing)将分别引入24、12和1个不同的线性投影矩阵。
Nonuniform projected dimension: One can choose a different projected dimension $k$ for different heads and layers. As shown in Figure 1 (right), the contextual mapping matrices in different heads and layers have distinct spectrum distributions, and heads in higher layer tend towards a more skewed distributed spectrum (lower rank). This implies one can choose a smaller projected dimension $k$ for higher layers.
非均匀投影维度:可以为不同注意力头和层选择不同的投影维度$k$。如图1 (右)所示,不同头和层中的上下文映射矩阵具有不同的频谱分布,更高层的头往往呈现更偏斜的频谱分布(更低秩)。这意味着可以为更高层选择更小的投影维度$k$。
General projections: One can also choose different kinds of low-dimensional projection methods instead of a simple linear projection. For example, one can choose mean/max pooling, or convolution where the kernel and stride is set to $n/k$ . The convolutional functions contain parameters that require training.
通用投影方法:除了简单的线性投影外,还可以选择不同类型的低维投影方法。例如,可选择均值/最大池化 (mean/max pooling) ,或卷积核与步长设置为 $n/k$ 的卷积操作。其中卷积函数包含需要训练的参数。
5 Experiments
5 实验
In this section, we present experimental results for the the techniques described above. We analyze the techniques one-by-one and explore how they impact performance.
在本节中,我们将展示上述技术的实验结果。我们将逐一分析这些技术,并探讨它们如何影响性能。
5.1 Pre training Perplexities
5.1 预训练困惑度
We first compare the pre training performance of our proposed architecture against RoBERTa (Liu et al., 2019), which is based on the Transformer. Following Devlin et al. (2019), we use BookCorpus (Zhu et al., 2015) plus English Wikipedia as our pre training set (3300M words). All models are pretrained with the masked-language-modeling (MLM) objective, and the training for all experiments are parallel i zed across 64 Tesla V100 GPUs with $250\mathrm{k}$ updates.
我们首先将提出的架构与基于Transformer的RoBERTa (Liu等人, 2019) 进行预训练性能对比。参照Devlin等人 (2019) 的方法,我们使用BookCorpus (Zhu等人, 2015) 加英文维基百科作为预训练集 (3300M词)。所有模型均采用掩码语言建模 (MLM) 目标进行预训练,所有实验均通过64块Tesla V100 GPU并行化训练,共进行$250\mathrm{k}$次参数更新。
Effect of projected dimension: We experiment with various values for the projected dimension $k$ . (We use the same $k$ across all layers and heads of Linformer.) In the Figure 3(a) and (b), we plot the validation perplexity curves for both the standard Transformer and the Linformer across different $k$ , for maximum sequence lengths $n=512$ and $n=1024$ . As expected, the Linformer performs better as projected dimension $k$ increases. However, even at $k=128$ for $n=512$ and $k=256$ for $n=1024$ , Linformer’s performance is already nearly on par with the original Transformer. Effect of sharing projections: In Figure 3(c), we plot the validation perplexity curves for the three parameter sharing strategies (headwise, key-value, and layerwise) with $n=512$ . Note that when we use just a single projection matrix (i.e. for layerwise sharing), the resulting Linformer model’s validation perplexity almost matches that of the the non-shared model. This suggests that we can decrease the number of additional parameters in our model, and consequently, it’s memory consumption, without much detriment to performance.
投影维度的影响:我们尝试了不同投影维度 $k$ 的取值(在Linformer的所有层和注意力头中保持相同的 $k$)。在图3(a)和(b)中,我们绘制了标准Transformer和Linformer在不同 $k$ 值下的验证困惑度曲线,最大序列长度分别为 $n=512$ 和 $n=1024$。正如预期,随着投影维度 $k$ 的增加,Linformer表现更好。然而,即使当 $n=512$ 时 $k=128$ 和 $n=1024$ 时 $k=256$,Linformer的性能已接近原始Transformer。
投影共享的影响:在图3(c)中,我们绘制了三种参数共享策略(头共享、键值共享和层共享)在 $n=512$ 时的验证困惑度曲线。值得注意的是,当仅使用单个投影矩阵(即层共享)时,Linformer模型的验证困惑度几乎与非共享模型相当。这表明我们可以减少模型中的额外参数数量,从而降低内存消耗,而不会显著影响性能。
Effect of longer sequences: We evaluate the effect of sequence length during Linformer pre training. In the Figure 3(d), we plot the validation perplexity for Linformer with $n\in{\bar{5}12,1024,2\bar{0}48,4096}$ holding projected dimension $k$ fixed at 256. Note that as sequence length increases, even though our projected dimension is fixed, the final perplexities after convergence remain about the same. This further empirically supports our assertion that the Linformer is linear-time.
更长序列的影响:我们评估了Linformer预训练中序列长度的影响。在图3(d)中,我们绘制了当投影维度$k$固定为2