Squeeze BERT: What can computer vision teach NLP about efficient neural networks?

Squeeze BERT：计算机视觉能为 NLP 提供哪些关于高效神经网络的启示？

Abstract

摘要

Humans read and write hundreds of billions of messages every day. Further, due to the availability of large datasets, large computing systems, and better neural network models, natural language processing (NLP) technology has made significant strides in understanding, proofreading, and organizing these messages. Thus, there is a significant opportunity to deploy NLP in myriad applications to help web users, social networks, and businesses. In particular, we consider smartphones and other mobile devices as crucial platforms for deploying NLP models at scale. However, today’s highly-accurate NLP neural network models such as BERT and RoBERTa are extremely computationally expensive, with BERT-base taking 1.7 seconds to classify a text snippet on a Pixel 3 smartphone. In this work, we observe that methods such as grouped convolutions have yielded significant speedups for computer vision networks, but many of these techniques have not been adopted by NLP neural network designers. We demonstrate how to replace several operations in self-attention layers with grouped convolutions, and we use this technique in a novel network architecture called Squeeze BERT, which runs $4.3\mathbf{X}$ faster than BERT-base on the Pixel 3 while achieving competitive accuracy on the GLUE test set. The Squeeze BERT code will be released.

人类每天阅读和书写数千亿条消息。此外，由于大规模数据集、大型计算系统和更好的神经网络模型的可用性，自然语言处理（NLP）技术在理解、校对和组织这些消息方面取得了显著进展。因此，在众多应用中部署 NLP 技术以帮助网络用户、社交网络和企业具有巨大的机会。特别是，我们认为智能手机和其他移动设备是规模化部署 NLP 模型的关键平台。然而，当今高度准确的 NLP 神经网络模型（如 BERT 和 RoBERTa）计算成本极高，BERT-base 在 Pixel 3 智能手机上分类一段文本需要 1.7 秒。在这项工作中，我们观察到诸如分组卷积等方法在计算机视觉网络中带来了显著的加速，但许多这些技术尚未被 NLP 神经网络设计者采用。我们展示了如何用分组卷积替换自注意力层中的多个操作，并将这一技术应用于一种名为 Squeeze BERT 的新型网络架构中，该架构在 Pixel 3 上比 BERT-base 快 $4.3\mathbf{X}$ ，同时在 GLUE 测试集上实现了具有竞争力的准确性。Squeeze BERT 代码将公开发布。

1 Introduction and Motivation

1 引言与动机

The human race writes over 300 billion messages per day [1, 2, 3, 4]. Out of these, more than half of the world’s emails are read on mobile devices, and nearly half of Facebook users exclusively access Facebook from a mobile device [5, 6]. Natural language processing (NLP) technology has the potential to aid these users and communities in several ways. When a person writes a message, NLP models can help with spelling and grammar checking as well as sentence completion. When content is added to a social network, NLP can facilitate content moderation before it appears in other users’ news feeds. When a person consumes messages, NLP models can help classify messages into folders, composing news feeds, prioritizing messages, and identifying duplicate messages.

人类每天撰写超过3000亿条消息 [1, 2, 3, 4]。其中，全球超过一半的电子邮件是在移动设备上阅读的，近一半的Facebook用户仅通过移动设备访问Facebook [5, 6]。自然语言处理 (Natural Language Processing, NLP) 技术有潜力以多种方式帮助这些用户和社区。当一个人撰写消息时，NLP模型可以帮助进行拼写和语法检查以及句子补全。当内容被添加到社交网络时，NLP可以在内容出现在其他用户的新闻推送之前进行内容审核。当一个人消费消息时，NLP模型可以帮助将消息分类到文件夹中，组成新闻推送，优先处理消息，并识别重复消息。

In recent years, the development and adoption of Attention Neural Networks has led to dramatic improvements in almost every area of NLP. In 2017, Vaswani et al. proposed the multi-head selfattention module, which demonstrated superior accuracy to recurrent neural networks on EnglishGerman machine language translation [7].1 These modules have since been adopted by GPT [8] and BERT [9] for sentence classification, and by GPT-2 [10] and CTRL [11] for sentence completion and generation. Recent works such as ELECTRA [12] and RoBERTa [13] have shown that larger datasets and more sophisticated training regimes can further improve the accuracy of self-attention networks.

近年来，注意力神经网络 (Attention Neural Networks) 的发展和应用几乎在自然语言处理 (NLP) 的每个领域都带来了显著的改进。2017 年，Vaswani 等人提出了多头自注意力模块 (multi-head self-attention module)，在英德机器翻译任务中展示了比循环神经网络更高的准确性 [7]。这些模块随后被 GPT [8] 和 BERT [9] 用于句子分类，并被 GPT-2 [10] 和 CTRL [11] 用于句子补全和生成。最近的研究如 ELECTRA [12] 和 RoBERTa [13] 表明，更大的数据集和更复杂的训练机制可以进一步提高自注意力网络的准确性。

Considering the enormity of the textual data created by humans on mobile devices, a natural approach is to deploy NLP models themselves on the mobile devices themselves, embedding them in common apps that are used to read, write, and share text. Unfortunately, many of today’s best state-of-the-art NLP models may are often rather computationally expensive, often making mobile deployment impractical. For example, we observe that running the BERT-base network on a Google Pixel 3 smartphone approximately 1.7 seconds to classify a single text data sample.2 Much of the research on efficient self-attention networks for NLP has just emerged in the past year. However, starting with SqueezeNet [14], the mobile computer vision (CV) community has spent the last four years optimizing neural networks for mobile devices. Intuitively, it seems like there must be opport unities to apply the lessons learned from the rich literature of mobile CV research to accelerate mobile NLP. In the following we review what has already been applied and propose two additional techniques from CV that we will leverage for accelerating NLP.

考虑到人类在移动设备上创建的文本数据的庞大性，一个自然的方法是将 NLP 模型本身部署在移动设备上，并将其嵌入到用于阅读、写入和分享文本的常见应用中。不幸的是，当今许多最先进的 NLP 模型在计算上往往非常昂贵，通常使得移动部署变得不切实际。例如，我们观察到在 Google Pixel 3 智能手机上运行 BERT-base 网络大约需要 1.7 秒来分类单个文本数据样本。2 许多关于高效自注意力网络的研究在过去一年才出现。然而，从 SqueezeNet [14] 开始，移动计算机视觉 (CV) 社区在过去四年中一直在优化用于移动设备的神经网络。直觉上，似乎有机会将从丰富的移动 CV 研究文献中学到的经验应用于加速移动 NLP。接下来，我们将回顾已经应用的技术，并提出两种来自 CV 的额外技术，我们将利用这些技术来加速 NLP。

1.1 What has CV research already taught NLP research about efficient networks?

1.1 计算机视觉研究已经为自然语言处理研究提供了哪些关于高效网络的启示？

In recent months, novel self-attention networks have been developed with the goal of achieving faster inference. At present, the MobileBERT network defines the state-of-the-art in low-latency text classification for mobile devices [15]. MobileBERT takes approximately 0.6 seconds to classify a text sequence on a Google Pixel 3 smartphone. And, on the GLUE benchmark, which consists of 9 natural language understanding (NLU) datasets [16], MobileBERT achieves higher accuracy than other efficient networks such as DistilBERT [17], PKD [18], and several others [19, 20, 21, 22]. To achieve this, MobileBERT introduced two concepts into their NLP self-attention network that are already in widespread use in CV neural networks:

近几个月来，新型的自注意力网络（self-attention networks）被开发出来，旨在实现更快的推理速度。目前，MobileBERT网络在移动设备的低延迟文本分类领域定义了最先进的技术 [15]。在Google Pixel 3智能手机上，MobileBERT大约需要0.6秒来分类一个文本序列。此外，在由9个自然语言理解（NLU）数据集组成的GLUE基准测试中 [16]，MobileBERT的准确率高于其他高效网络，如DistilBERT [17]、PKD [18] 以及其他几个网络 [19, 20, 21, 22]。为了实现这一点，MobileBERT在其NLP自注意力网络中引入了两个已经在计算机视觉（CV）神经网络中广泛使用的概念：

1.2 What else can CV research teach NLP research about efficient networks?

1.2 计算机视觉研究还能为大语言模型研究提供哪些关于高效网络的启示？

We are encouraged by the progress that MobileBERT has made in leveraging ideas that are popular in the CV literature to accelerate NLP.

我们对于 MobileBERT 在利用计算机视觉 (CV) 文献中流行的思想来加速自然语言处理 (NLP) 方面取得的进展感到鼓舞。

However, we are aware of two other ideas from CV, which weren’t used in MobileBERT, and that could be applied to accelerate NLP:

然而，我们注意到计算机视觉 (CV) 领域还有两个未被 MobileBERT 使用的想法，这些想法可以应用于加速自然语言处理 (NLP)：

1.3 Squeeze BERT: Applying lessons learned from CV to NLP

1.3 Squeeze BERT：将计算机视觉的经验应用于自然语言处理

In this work, we describe how to apply convolutions and particularly grouped convolutions in the design of a novel self-attention network for NLP, which we call Squeeze BERT. Empirically, we

在本工作中，我们描述了如何在设计一种新型的自注意力网络（我们称之为 Squeeze BERT）中应用卷积，特别是分组卷积，用于自然语言处理（NLP）。

Table 1: How does BERT spend its time? This is a breakdown of computation (in floating-point operations, or FLOPs) and latency (on a Google Pixel 3 smartphone) in BERT-base, reported to three significant digits. The FC layers account for more than $97%$ of the FLOPs and over $88%$ of the latency.

表 1: BERT 的时间花在哪里？这是 BERT-base 中计算量（以浮点运算次数，即 FLOPs 为单位）和延迟（在 Google Pixel 3 智能手机上）的详细分解，数据保留三位有效数字。全连接层 (FC layers) 占据了超过 97% 的计算量和超过 88% 的延迟。

阶段	模块类型	计算量 (FLOPs) / 延迟
嵌入	嵌入	0.00% / 0.26%
编码器	自注意力模块中的全连接层	24.3% / 18.9%
编码器	自注意力模块中的 softmax	2.70% / 11.3%
编码器	前馈网络层中的全连接层	73.0% / 69.4%
最终分类器	最终分类器中的全连接层	0.00% / 0.02%
总计		100% / 100%

find that Squeeze BERT runs at lower latency on a smartphone than BERT-base, MobileBERT, and several other efficient NLP models, while maintaining competitive accuracy.

我们发现，Squeeze BERT 在智能手机上的延迟低于 BERT-base、MobileBERT 和其他几种高效的自然语言处理 (NLP) 模型，同时保持了具有竞争力的准确性。

2 Implementing self-attention with convolutions

2 使用卷积实现自注意力 (Self-Attention)

In this section, first we review the basic structure of self-attention networks, next we identify that their biggest computational bottleneck is in their position-wise fully-connected (PFC) layers, and then we show that these PFC layers are equivalent to a 1D convolution with a kernel size of 1.

在本节中，首先我们回顾了自注意力网络的基本结构，接着指出其最大的计算瓶颈在于逐位置的全连接层 (PFC)，然后我们证明这些 PFC 层等价于核大小为 1 的一维卷积。

2.1 Self-attention networks

2.1 自注意力网络

In most BERT-derived networks there are typically 3 stages: the embedding, the encoder, and the classifier [9, 13, 12, 15, 19]. 3 The embedding converts pre processed words (represented as integervalued tokens) into learned feature-vectors of floating-point numbers; the encoder is comprised of a series of self-attention and other layers; and the classifier produces the network’s final output. As we will see later in Table 1, the embedding and the classifier account for less than $1%$ of the runtime of a self-attention network, so we focus our discussion on the encoder.

在大多数基于 BERT 的网络中，通常有 3 个阶段：嵌入 (embedding)、编码器 (encoder) 和分类器 (classifier) [9, 13, 12, 15, 19]。嵌入将预处理后的单词（表示为整数值的 Token）转换为学习到的浮点数特征向量；编码器由一系列自注意力 (self-attention) 层和其他层组成；分类器生成网络的最终输出。正如我们稍后将在表 1 中看到的，嵌入和分类器占自注意力网络运行时间的不到 $1%$ ，因此我们将讨论重点放在编码器上。

We now describe the encoder that is used in BERT-base [9]. The encoder consists of a stack of blocks. Each block consists of a self-attention module followed by three position-wise fully-connected layers, known as feed-forward network (FFN) layers. Each self-attention module contains three seperate position-wise fully-connected (PFC) layers, which are used to generate the query $(Q)$ , key $(K)$ , and value $(V)$ activation vectors for each position in the feature embedding. Each of these PFC layers in self-attention applies the same operation to each position in the feature embedding independently. While neural networks traditionally multiply weights by activation s, a distinguishing factor of attention neural networks is that they multiply activation s by other activation s, which enables dynamic weighting of tensor elements to adjust based on the input data. Further, attention networks allow modeling of arbitrary dependencies irregardless of their distance in the input or output [7]. The self-attention module proposed by Vaswani et al. [7] (which is also used by GPT [8], BERT [9], RoBERTa [13], ELECTRA [12] and others) multiplies the $Q,K$ , and $V$ activation s together using the equation $\begin{array}{r}{s o f t m a x(\frac{Q K^{T}}{\sqrt{d_{k}}})V}\end{array}$ , where $d_{k}$ is the number of channels in one attention head.4

我们现在描述BERT-base [9]中使用的编码器。编码器由一系列块组成。每个块由一个自注意力模块和三个位置全连接层（称为前馈网络（FFN）层）组成。每个自注意力模块包含三个独立的位置全连接（PFC）层，用于为特征嵌入中的每个位置生成查询 $(Q)$ 、键 $(K)$ 和值 $(V)$ 激活向量。自注意力中的每个PFC层独立地对特征嵌入中的每个位置应用相同的操作。虽然传统神经网络将权重乘以激活，但注意力神经网络的一个显著特点是它们将激活乘以其他激活，这使得张量元素能够根据输入数据进行动态加权调整。此外，注意力网络允许对任意依赖关系进行建模，而不管它们在输入或输出中的距离如何 [7]。Vaswani等人 [7] 提出的自注意力模块（也被GPT [8]、BERT [9]、RoBERTa [13]、ELECTRA [12] 等使用）使用公式 $\begin{array}{r}{s o f t m a x(\frac{Q K^{T}}{\sqrt{d_{k}}})V}\end{array}$ 将 $Q,K$ 和 $V$ 激活相乘，其中 $d_{k}$ 是一个注意力头中的通道数。

2.2 Benchmarking BERT for mobile inference

2.2 移动设备推理中的 BERT 基准测试

To identify the parts of BERT that are time-consuming to compute, we profile BERT on a smartphone. Specifically, we measure the neural network’s latency using PyTorch [34] and Torch Script on a Google Pixel 3 smartphone, with an input sequence length of 128 and a batch size of 1. In Table 1, we show the breakdown of FLOPs and latency among the main components of the BERT network, and we observe that the $\begin{array}{r}{s o f t m a x(\frac{Q K^{T}}{\sqrt{d_{k}}})V}\end{array}$ calculations in the self-attention modules account for only 11.3 percent of the total latency. However, the PFC layers in the self-attention modules account for 18.9 percent, and the PFC layers in the feed-forward network modules account for 69.4 percent, and in total the PFC layers account for 88.3 percent of the latency. Given that the PFC layers account for the overwhelming majority of the latency, we now turn our focus to reducing the latency of the PFC layers.

为了识别BERT中计算耗时的部分，我们在智能手机上对BERT进行了性能分析。具体来说，我们使用PyTorch [34] 和 Torch Script在Google Pixel 3智能手机上测量了神经网络的延迟，输入序列长度为128，批量大小为1。在表1中，我们展示了BERT网络主要组件的FLOPs和延迟的分解情况，并观察到自注意力模块中的 $\begin{array}{r}{s o f t m a x(\frac{Q K^{T}}{\sqrt{d_{k}}})V}\end{array}$ 计算仅占总延迟的11.3%。然而，自注意力模块中的PFC层占18.9%，前馈网络模块中的PFC层占69.4%，总计PFC层占延迟的88.3%。鉴于PFC层占据了绝大部分的延迟，我们现在将重点放在减少PFC层的延迟上。

2.3 Replacing the position-wise fully connected layers (PFC) with convolutions

2.3 使用卷积替换位置全连接层 (PFC)

To address this, we intend to replace the PFC layers with grouped convolutions, which have been shown to produce significant speedups in computer vision networks. As a first step in this direction, we now show that the fully-connected layers used throughout the BERT encoder are a special case of non-grouped 1D convolution. In the following, $f$ denotes the input feature vector, and $w$ denotes the weights. Given an input feature vector of dimensions $(P,C_{i n})$ with $P$ positions and $C_{i n}$ channels to generate an output of $(P,C_{o u t})$ features, the operation performed by the position-wise fullyconnected layer can be defined as follows:

为了解决这个问题，我们打算用分组卷积（grouped convolutions）替换 PFC 层，这种方法已被证明可以显著加速计算机视觉网络。作为这一方向的第一步，我们现在展示 BERT 编码器中使用的全连接层是非分组 1D 卷积的一种特殊情况。在下文中， $f$ 表示输入特征向量， $w$ 表示权重。给定一个维度为 $(P,C_{i n})$ 的输入特征向量，其中 $P$ 表示位置数， $C_{i n}$ 表示通道数，生成 $(P,C_{o u t})$ 维的输出特征，位置全连接层执行的操作可以定义如下：

图片.png

Then if we consider the definition of a 1D convolution with kernel size $K$ with the same input and output dimensions

如果我们考虑核大小为 $K$ 的一维卷积的定义，且输入和输出维度相同
图片.png

we observe that the position-wise fully-connected operation is equivalent to a convolution with a kernel size of $k=1$ . Thus, the PFC layers of Vaswani et al. [7], GPT, BERT, and similar self-attention networks can be implemented using convolutions without changing the networks’ numerical properties or behavior.

我们观察到，逐位置全连接操作等价于核大小为 $k=1$ 的卷积。因此，Vaswani 等人 [7]、GPT、BERT 以及类似的自注意力网络中的 PFC 层可以使用卷积实现，而不会改变网络的数值属性或行为。

3 Incorporating grouped convolutions into self-attention

3 将分组卷积引入自注意力机制

Now that we have shown how to implement the expensive PFC layers in self-attention networks using convolutions, we can incorporate efficient grouped convolutions into a self-attention network. Grouped convolutions are defined as follows. Given an input feature vector of dimensions $(P,C_{i n})$ with $\mathrm{P}$ positions and $C_{i n}$ channels and outputting a vector with dimensions $(P,C_{o u t})$ , a 1d convolution with kernel size $K$ and $G$ groups can be defined as follows.

既然我们已经展示了如何使用卷积在自注意力网络中实现昂贵的 PFC 层，那么我们可以将高效的分组卷积引入自注意力网络。分组卷积定义如下。给定一个维度为 $(P,C_{i n})$ 的输入特征向量，其中 $\mathrm{P}$ 表示位置， $C_{i n}$ 表示通道数，并输出一个维度为 $(P,C_{o u t})$ 的向量，核大小为 $K$ 且分组数为 $G$ 的一维卷积可以定义如下。

图片.png

This is equivalent to splitting the the input vector into $G$ separate vectors of size $\begin{array}{r}{(P,\frac{C_{i n}}{G})}\end{array}$ along the $P$ dimension and running $G$ separate convolutions with independent weights each computing vectors of size $\begin{array}{r}{(P,\frac{C_{o u t}}{G})}\end{array}$ . The grouped convolution, however, requires only $\textstyle{\frac{1}{G}}$ as many floating point operations (FLOPs) and $\textstyle{\frac{1}{G}}$ as many weights as an ordinary convolution, not counting the small (and unchanged) amount of operations needed for the channel-wise bias term that is often included in convolutional layers.5

这相当于将输入向量沿 $P$ 维度分割为 $G$ 个大小为 $\begin{array}{r}{(P,\frac{C_{i n}}{G})}\end{array}$ 的独立向量，并运行 $G$ 个独立的卷积，每个卷积使用独立的权重计算大小为 $\begin{array}{r}{(P,\frac{C_{o u t}}{G})}\end{array}$ 的向量。然而，分组卷积仅需要普通卷积的 $\textstyle{\frac{1}{G}}$ 浮点运算 (FLOPs) 和 $\textstyle{\frac{1}{G}}$ 权重，不包括卷积层中通常包含的通道偏置项所需的小量（且不变的）操作。

3.1 Squeeze BERT

Now, we describe our proposed neural architecture called Squeeze BERT, which uses grouped convolutions. Squeeze BERT is much like BERT-base, but with PFC layers implemented as convolutions, and grouped convolutions for many of the layers. Recall from Section 2 that each block in the BERT-base encoder has a self-attention module with 3 PFC layers, plus 3 more PFC layers called feed-forward network layers $\mathrm{(FFN_{1}}$ , $\mathrm{FFN_{2}}$ , and $\mathrm{FFN_{3}}$ ). The FFN layers have the following dimensions: $\mathrm{FFN_{1}}$ has $C_{i n}=C_{o u t}=768$ , $\mathrm{FFN_{2}}$ has $C_{i n}~=~768$ and $C_{o u t}=3072$ , and $\mathrm{FFN_{3}}$ has

现在，我们描述我们提出的名为 Squeeze BERT 的神经架构，该架构使用分组卷积。Squeeze BERT 与 BERT-base 非常相似，但将 PFC 层实现为卷积，并且许多层使用分组卷积。回顾第 2 节，BERT-base 编码器中的每个块都有一个带有 3 个 PFC 层的自注意力模块，以及另外 3 个称为前馈网络层的 PFC 层 (FFN1, FFN2, 和 FFN3)。FFN 层具有以下维度：FFN1 有 Cin=Cout=768，FFN2 有 Cin=768 和 Cout=3072，FFN3 有

$C_{i n}=3072$ and $C_{o u t}=768$ . In all PFC layers of the self-attention modules, and in the $\mathrm{FFN_{2}}$ and $\mathrm{FFN_{3}}$ layers, we use grouped convolutions with $G=4$ . To allow for mixing across channels of different groups, we use $G=1$ in the less-expensive $\mathrm{FFN_{1}}$ layers. Note that in BERT-base, $\mathrm{FFN_{2}}$ and $\mathrm{FFN_{3}}$ each have 4 times more arithmetic operations than $\mathrm{FFN_{1}}$ . However, when we use $G=4$ in $\mathrm{FFN_{2}}$ and $\mathrm{FFN_{3}}$ , now all FFN layers have the same number of arithmetic operations.

$C_{i n}=3072$ 和 $C_{o u t}=768$ 。在自注意力模块的所有 PFC 层中，以及在 $\mathrm{FFN_{2}}$ 和 $\mathrm{FFN_{3}}$ 层中，我们使用 $G=4$ 的分组卷积。为了允许不同组通道之间的混合，我们在计算量较少的 $\mathrm{FFN_{1}}$ 层中使用 $G=1$ 。需要注意的是，在 BERT-base 中， $\mathrm{FFN_{2}}$ 和 $\mathrm{FFN_{3}}$ 的算术操作次数分别是 $\mathrm{FFN_{1}}$ 的 4 倍。然而，当我们在 $\mathrm{FFN_{2}}$ 和 $\mathrm{FFN_{3}}$ 中使用 $G=4$ 时，现在所有 FFN 层的算术操作次数都相同。

Finally, the embedding size (768), the number of blocks in the encoder (12), the number of heads per self-attention module (12), the tokenizer (WordPiece [35, 36]), and other aspects of Squeeze BERT are adopted from BERT-base. Aside from the convolution-based implementation and the adoption of grouped convolutions, the Squeeze BERT architecture is identical to BERT-base.

最后，Squeeze BERT 的嵌入大小 (768)、编码器中的块数 (12)、每个自注意力模块的头数 (12)、分词器 (WordPiece [35, 36]) 以及其他方面均采用了 BERT-base 的配置。除了基于卷积的实现和分组卷积的采用外，Squeeze BERT 的架构与 BERT-base 完全相同。

4 Experimental Methodology

4 实验方法

4.1 Datasets

4.1 数据集

Pre training Data. For pre training, we use a combination of Wikipedia and Books Corpus [37], setting aside $3%$ of the combined dataset as a test set. Following the ALBERT paper, we use Masked Language Modeling (MLM) and Sentence Order Prediction (SOP) as pre training tasks [19].

预训练数据。对于预训练，我们使用了 Wikipedia 和 Books Corpus [37] 的组合，并将组合数据集的 $3%$ 留作测试集。遵循 ALBERT 论文，我们使用掩码语言建模 (Masked Language Modeling, MLM) 和句子顺序预测 (Sentence Order Prediction, SOP) 作为预训练任务 [19]。

Finetuning Data. We finetune and evaluate Squeeze BERT (and other baselines) on the General Language Understanding Evaluation (GLUE) set of tasks. This benchmark consists of a diverse set of 9 NLU tasks; thanks to the structure and breadth of these tasks (see supplementary material for detailed task-level information), GLUE has become the standard evaluation benchmark for NLP research. A model’s performance across the GLUE tasks likely provides a good approximation of that model’s general iz ability (esp. to other text classification tasks).

微调数据。我们在通用语言理解评估（GLUE）任务集上对Squeeze BERT（及其他基线模型）进行微调和评估。该基准测试包含9个多样化的自然语言理解（NLU）任务；得益于这些任务的结构和广度（详见补充材料中的任务级别详细信息），GLUE已成为NLP研究的标准评估基准。模型在GLUE任务上的表现很可能很好地反映了该模型的泛化能力（尤其是对其他文本分类任务）。

4.2 Training Methodology

4.2 训练方法

Many of the recent papers on efficient NLP networks report results on models trained with bells and whistles such as distillation, adversarial training, and/or transfer learning across GLUE tasks. However, there is no standardization of these training schemes across different papers, making it difficult to distinguish the contribution of the model from the contribution of the training scheme to the final accuracy number. Therefore, we first train Squeeze BERT using a simple training scheme (described in Section 4.2.1, with results reported in Section 5.1), and then we train Squeeze BERT with distillation and other techniques (described in Section 4.2.2, with results reported in Section 5.2).

许多最近关于高效 NLP 网络的论文报告了使用蒸馏、对抗训练和/或跨 GLUE 任务的迁移学习等技术训练的模型结果。然而，不同论文中这些训练方案没有标准化，使得难以区分模型本身和训练方案对最终准确率的贡献。因此，我们首先使用简单的训练方案训练 Squeeze BERT（在第 4.2.1 节中描述，结果在第 5.1 节中报告），然后使用蒸馏和其他技术训练 Squeeze BERT（在第 4.2.2 节中描述，结果在第 5.2 节中报告）。

4.2.1 Training without bells and whistles

4.2.1 无额外修饰的训练

We pretrain Squeeze BERT from scratch (without distillation) using the LAMB optimizer, and we employ the hyper parameters recommended by the LAMB authors: a global batch size of 8192, a learning rate of $2.5\mathrm{e}{\cdot}3$ , and a warmup proportion of 0.28 [38]. Following the LAMB paper’s recommendations, we pretrain for $56\mathrm{k}$ steps with a maximum sequence length of 128 and then for 6k steps with a maximum sequence length of 512.

我们使用 LAMB 优化器从头开始预训练 Squeeze BERT（不进行蒸馏），并采用了 LAMB 作者推荐的超参数：全局批量大小为 8192，学习率为 $2.5\mathrm{e}{\cdot}3$ ，预热比例为 0.28 [38]。根据 LAMB 论文的建议，我们首先以最大序列长度 128 预训练 $56\mathrm{k}$ 步，然后以最大序列长度 512 预训练 6k 步。

For finetuning, we use the AdamW optimizer with a batch size of 16 without momentum or weight decay with $\beta_{1}=0.9$ and $\beta_{2}=0.999$ [39]. As is common in the literature, during finetuning for each task, we perform hyper parameter tuning on the learning rate and dropout rate. We present more details on this in the supplementary material. In the interest of a fair comparison, we also train BERT-base using the aforementioned pre training and finetuning protocol.

对于微调，我们使用 AdamW 优化器，批量大小为 16，不使用动量或权重衰减， $\beta_{1}=0.9$ 和 $\beta_{2}=0.999$ [39]。正如文献中常见的那样，在针对每个任务进行微调时，我们对学习率和 dropout 率进行超参数调优。我们在补充材料中提供了更多详细信息。为了公平比较，我们还使用上述预训练和微调协议训练了 BERT-base。

4.2.2 Training with bells and whistles

4.2.2 带附加功能的训练

We now review recent techniques for improving the training of NLP networks, and we describe the approaches that we will use for the training and evaluation of Squeeze BERT in Section 5.2.

我们现在回顾最近用于改进自然语言处理（NLP）网络训练的技术，并描述我们将在第5.2节中用于训练和评估Squeeze BERT的方法。

Distillation approaches used in other efficient NLP networks. While the term "knowledge distillation" was coined by Hinton et al. to describe a specific method and equation [40], the term "distillation" is now used in reference to a diverse range of approaches where a "student" network is trained to replicate a "teacher" network. Some researchers distill only the final layer of the network [17], while others also distill the hidden layers [15, 18, 22]. When distilling the hidden layers, some apply layer-by-layer distillation warmup, where each module of the student network is distilled independently while downstream modules are frozen [15]. Some distill during pre training [15, 17], some distill during finetuning [22], and some do both [18, 21].

其他高效 NLP 网络中使用的蒸馏方法。虽然“知识蒸馏”这一术语是由 Hinton 等人提出的，用于描述一种特定的方法和公式 [40]，但现在“蒸馏”一词被广泛用于指代各种方法，其中“学生”网络被训练以复制“教师”网络。一些研究人员仅蒸馏网络的最后一层 [17]，而另一些则还蒸馏隐藏层 [15, 18, 22]。在蒸馏隐藏层时，一些研究人员采用逐层蒸馏预热，其中学生网络的每个模块独立蒸馏，而下游模块被冻结 [15]。一些在预训练期间进行蒸馏 [15, 17]，一些在微调期间进行蒸馏 [22]，还有一些两者都进行 [18, 21]。

Table 2: Comparison of neural networks on the development set of the GLUE benchmark. denotes models trained by the authors of the present paper. Bells and whistles are: $A=$ adversarial training; $D=$ distillation of final layer; $E=$ distillation of encoder layers; $S=$ transfer learning across GLUE tasks (a.k.a. STILTs [41]); $W=$ per-layer warmup. In GLUE accuracy, a dash means that accuracy for this task is not provided in the literature.

表 2: GLUE 基准开发集上的神经网络比较。表示由本文作者训练的模型。附加功能包括： $A=$ 对抗训练； $D=$ 最后一层的蒸馏； $E=$ 编码器层的蒸馏； $S=$ 跨 GLUE 任务的迁移学习（也称为 STILTs [41]）； $W=$ 每层预热。在 GLUE 准确率中，破折号表示文献中未提供该任务的准确率。

GLUE 准确率效率	附加功能 MNLI-mm &	延迟 (ms) #MParams MNLI-m 平均 GFLOPs 加速比 MRPC 附加功能 RTE 模型 S o B
无附加功能的结果
BERT-baset	85.2	84.8
MobileBERT[15]	80.8	88.2
ALBERT-base[19]	81.6	90.3
SqueezeBERTt	51.1	7.42
有附加功能的结果
DistilBERT 6/768[17] D	82.2	88.5
一 D	82.58	83.4
Turc 6/768 [20] =	2.1x
Theseus6/768[22]DESW	82.3	89.6
MobileBERT[15]DEW	84.4	91.5
SqueezeBERTt DS	51.1	7.42

Bells and whistles used for training Squeeze BERT (for results in Section 5.2). Distillation is not a central focus of this paper, and there is a large design space of potential approaches to distillation, so we select a relatively simple form of distillation for use in Squeeze BERT training. We apply distillation only to the final layer, and only during finetuning. On the GLUE sentence classification tasks, we use soft cross entropy loss with respect to a weighted sum of the teacher’s logits and a one-hot encoding of the ground-truth. Also note that GLUE has one regression task (STS-B text similarity), and for this task we replace the soft cross entropy loss with mean squared error. In addition to distillation, inspired by STILTS [41] and ELECTRA [12] we apply transfer learning from the MNLI GLUE task to other GLUE tasks as follows. The Squeeze BERT student model is pretrained using the approach described in Section 4.2.1, and then it is finetuned on the MNLI task. The weights from MNLI training are used as the initial student weights for other GLUE tasks except for CoLA.6 Similarly, the teacher model is a BERT-base model that is pretrained using the ELECTRA method and then finetuned on MNLI. Then, the teacher model is finetuned independently on each GLUE task, and these task-specific teacher weights are used for distillation.

用于训练 Squeeze BERT 的附加技巧（用于第 5.2 节的结果）。蒸馏并不是本文的核心重点，而且蒸馏方法的设计空间很大，因此我们选择了一种相对简单的蒸馏形式用于 Squeeze BERT 训练。我们仅在微调期间对最后一层应用蒸馏。在 GLUE 句子分类任务中，我们使用软交叉熵损失，该损失基于教师模型的 logits 和真实标签的 one-hot 编码的加权和。此外，GLUE 中有一个回归任务（STS-B 文本相似度），对于该任务，我们将软交叉熵损失替换为均方误差。除了蒸馏之外，受 STILTS [41] 和 ELECTRA [12] 的启发，我们应用了从 MNLI GLUE 任务到其他 GLUE 任务的迁移学习，具体如下。Squeeze BERT 学生模型使用第 4.2.1 节中描述的方法进行预训练，然后在 MNLI 任务上进行微调。MNLI 训练得到的权重被用作除 CoLA 之外的其他 GLUE 任务的初始学生权重。同样，教师模型是一个使用 ELECTRA 方法预训练并在 MNLI 上微调的 BERT-base 模型。然后，教师模型在每个 GLUE 任务上独立微调，这些特定任务的教师权重用于蒸馏。

5 Results

5 结果

We now turn our attention to comparing Squeeze BERT to other efficient neural networks.

我们现在将注意力转向比较 Squeeze BERT 与其他高效神经网络。

5.1 Results without bells and whistles

5.1 无额外修饰的结果

In the upper portions of Tables 2 and 3, we compare our results to other efficient networks on the dev and test sets of the GLUE benchmark. Note that relatively few of the efficiency-optimized networks report results without bells and whistles, and most such results are reported on the development (not test) set of GLUE. Fortunately, the authors of MobileBERT – a network which we will find in the next section compares favorably to other efficient networks with bells and whistles enabled – do provide development-set results without distillation on 4 of the GLUE tasks.7 We observe in the upper portion of Table 2 that, when both networks are trained without distillation, Squeeze BERT achieves higher accuracy than MobileBERT on all of these tasks. This provides initial evidence that the techniques from computer vision that we have adopted can be applied to NLP, and reasonable accuracy can be obtained. Further, we observe that Squeeze BERT is $4.3\mathbf{X}$ faster than BERT-base, while MobileBERT is $3.0\mathrm{x}$ faster than BERT-base.8

在表 2 和表 3 的上半部分，我们将我们的结果与 GLUE 基准测试的开发集和测试集上的其他高效网络进行了比较。需要注意的是，相对较少的高效优化网络报告了没有额外修饰的结果，而且大多数此类结果都是在 GLUE 的开发集（而非测试集）上报告的。幸运的是，MobileBERT 的作者——我们将在下一节中发现，该网络在启用额外修饰的情况下与其他高效网络相比表现优异——确实提供了在 4 个 GLUE 任务上未经蒸馏的开发集结果。我们在表 2 的上半部分观察到，当两个网络都在没有蒸馏的情况下训练时，Squeeze BERT 在所有任务上的准确率都高于 MobileBERT。这初步证明了我们从计算机视觉中采用的技术可以应用于 NLP，并且可以获得合理的准确率。此外，我们观察到 Squeeze BERT 比 BERT-base 快 $4.3\mathbf{X}$ ，而 MobileBERT 比 BERT-base 快 $3.0\mathrm{x}$ 。

Table 3: Comparison of neural networks on the test set of the GLUE benchmark. denotes models trained by the authors of the present paper. Bells and whistles are: $A=$ adversarial training; $D=$ distillation of final layer; $E=$ distillation of encoder layers; $S=$ transfer learning across GLUE tasks (a.k.a. STILTs [41]); $W=$ per-layer warmup.

表 3: GLUE 基准测试集上神经网络的比较。表示由本文作者训练的模型。附加功能包括： $A=$ 对抗训练； $D=$ 最终层蒸馏； $E=$ 编码器层蒸馏； $S=$ 跨 GLUE 任务的迁移学习（也称为 STILTs [41]）； $W=$ 每层预热。

	GLUE 准确率效率
	附加功能 GLUE 分数 MNLI-mm & 3 C WNLI P S RTE
MNLI-m 附加功能模型	#MParams GFLOPs 延迟( 加速比 S 0 T S 无附加功能的结果
BERT-baset
BERT-base [9]
SqueezeBERT+

	有附加功能的结果
TinyBERT 4/312[21] DE	82.5 81.8 87.792.6 43.3 79.9 62.9 65.1 14.5 1.2
ELECTRA-Small++ [12] AS 81.6 81.5 81.0 79.8	118 14x 88.3 91.1 55.6 84.6 84.9 63.6 65.1 14.0 2.62 248
PKD 6/768[18] DE	6.8x 89.092.0
Turc 6/768 [20] D Theseus 6/768[22]DESW	82.5 65.1 67.0 011.3 814 2.1x 82.279.78 89.4 91.8 84.3 65.3 65.1 67.5 11.3 814 2.1x

Due to the dearth of efficient neural network results on GLUE without bells and whistles, we also provide a comparison in Table 2 with the ALBERT-base network. ALBERT-base is a version of BERT-base that uses the same weights across multiple attention layers, and it has a smaller encoder than BERT. Due to these design choices, ALBERT-base has $9\mathrm{x}$ fewer parameters than BERT-base. However, ALBERT-base and BERT-base have the same number of FLOPs, and we observe in our measurements in Table 2 that ALBERT-base does not offer a speedup over BERT-base on a smartphone.9 Further, on the two GLUE tasks where the ALBERT authors reported the accuracy of ALBERT-base, MobileBERT and Squeeze BERT both outperform the accuracy of ALBERT-base.

由于在GLUE上没有花哨的高效神经网络结果，我们还在表2中提供了与ALBERT-base网络的比较。ALBERT-base是BERT-base的一个版本，它在多个注意力层中使用相同的权重，并且其编码器比BERT更小。由于这些设计选择，ALBERT-base的参数比BERT-base少 $9\mathrm{x}$ 。然而，ALBERT-base和BERT-base的FLOPs数量相同，我们在表2中的测量中观察到，ALBERT-base在智能手机上并没有比BERT-base提供加速。此外，在ALBERT作者报告ALBERT-base准确率的两个GLUE任务中，MobileBERT和Squeeze BERT的准确率都优于ALBERT-base。

5.2 Results with bells and whistles

5.2 完整结果

Now, we turn our attention to comparing Squeeze BERT to other models, all trained with bellsand-whistles. In the lower portion of Table 3, we first observe that when trained with bells-andwhistles MobileBERT matches or outperforms the accuracy of the other efficient models (except Squeeze BERT) on 8 of the 9 GLUE tasks. Further, on 4 of the 9 tasks Squeeze BERT outperforms the accuracy of MobileBERT; on 4 of 9 tasks MobileBERT outperforms Squeeze BERT; and on 1 task (WNLI) all models predict the most frequently occurring category.10 Also, Squeeze BERT achieves an average score across all GLUE tasks that is within 0.4 percentage-points of MobileBERT. Given the speedup of Squeeze BERT over MobileBERT, we think it is reasonable to say that Squeeze BERT and MobileBERT each offer a compelling speed-accuracy tradeoff for NLP inference on mobile devices.

现在，我们将注意力转向比较 Squeeze BERT 与其他模型，所有模型都经过 bells-and-whistles 训练。在表 3 的下半部分，我们首先观察到，当经过 bells-and-whistles 训练后，MobileBERT 在 9 个 GLUE 任务中的 8 个任务上匹配或超越了其他高效模型（除了 Squeeze BERT）的准确率。此外，在 9 个任务中的 4 个任务上，Squeeze BERT 的准确率超过了 MobileBERT；在 9 个任务中的 4 个任务上，MobileBERT 的准确率超过了 Squeeze BERT；而在 1 个任务（WNLI）上，所有模型都预测了最常出现的类别。此外，Squeeze BERT 在所有 GLUE 任务上的平均得分与 MobileBERT 相差不到 0.4 个百分点。考虑到 Squeeze BERT 相对于 MobileBERT 的速度提升，我们认为可以说 Squeeze BERT 和 MobileBERT 各自为移动设备上的 NLP 推理提供了一个引人注目的速度-准确率权衡。

6 相关工作

Quantization and Pruning. Quantization is a family of techniques which aims to reduce the number of bits required to store each parameter and/or activation in a neural network, while at the same time maintaining the accuracy of that network. This has been successfully applied to NLP in such works as [46, 47]. Pruning aims to directly eliminate certain parameters from the network while also maintaining accuracy, thereby reducing the storage and potentially computational cost of that network; for an application of this to NLP, please see [48]. These methods could be applied to Squeeze BERT to yield further efficiency improvements, but quantization and pruning are not a focus of this paper.

量化和剪枝。量化是一系列旨在减少神经网络中每个参数和/或激活所需的存储位数的技术，同时保持网络的准确性。这在[46, 47]等工作中已成功应用于自然语言处理（NLP）。剪枝旨在直接从网络中消除某些参数，同时保持准确性，从而减少网络的存储和潜在计算成本；有关其在NLP中的应用，请参见[48]。这些方法可以应用于Squeeze BERT以进一步提高效率，但量化和剪枝不是本文的重点。

Convolutions in self-attention networks for language-generation tasks. In this paper, our experiments focus on natural language understanding (NLU) tasks such as sentence classification. However, another widely-studied area is natural language generation (NLG), which includes the tasks of machine-translation (e.g., English-to-German) and language modeling (e.g., automated sentencecompletion). While we are not aware of work that adopts convolutions in self-attention networks for NLU, we are aware of such work in NLG. For instance, the Evolved Transformer and Lite Transformer architectures contain self-attention modules and convolutions in separate portions of the network [49, 50]. Additionally, LightConv shows that well-designed convolutional networks without self-attention produce comparable results to self-attention networks on certain NLG tasks [51]. Also, Wang et al. sparsify the self-attention matrix multiplication using a pattern of nonzeros that is inspired by dilated convolutions [52]. Finally, while not an attention network, Kim applied convolutional networks to NLU several years before the development of multi-head self-attention [53].

自注意力网络中的卷积用于语言生成任务。本文的实验主要集中在自然语言理解（NLU）任务上，例如句子分类。然而，另一个广泛研究的领域是自然语言生成（NLG），包括机器翻译（例如，英语到德语）和语言建模（例如，自动句子补全）任务。虽然我们不知道有工作在NLU中采用自注意力网络中的卷积，但我们知道在NLG中有这样的工作。例如，Evolved Transformer和Lite Transformer架构在网络的不同部分包含自注意力模块和卷积 [49, 50]。此外，LightConv表明，在某些NLG任务上，设计良好的无自注意力卷积网络可以产生与自注意力网络相当的结果 [51]。此外，Wang等人使用受膨胀卷积启发的非零模式稀疏化自注意力矩阵乘法 [52]。最后，虽然不是注意力网络，但Kim在多头自注意力发展之前几年就将卷积网络应用于NLU [53]。

7 Conclusions & Future Work

7 结论与未来工作

In this paper, we have studied how grouped convolutions, a popular technique in the design of efficient CV neural networks, can be applied to NLP. First, we showed that the position-wise fullyconnected layers of self-attention networks can be implemented with mathematically-equivalent 1D convolutions. Further, we proposed Squeeze BERT, an efficient NLP model which implements most of the layers of its self-attention encoder with 1D grouped convolutions. This model yields an appreciable ${>}4\mathbf{x}$ latency decrease over BERT-base when benchmarked on a Pixel 3 phone. We also successfully applied distillation to improve our approach’s accuracy to a level that is competitive with a distillation-trained MobileBERT and with the original version of BERT-base.

在本文中，我们研究了分组卷积（grouped convolutions）这一在高效计算机视觉（CV）神经网络设计中流行的技术如何应用于自然语言处理（NLP）。首先，我们展示了自注意力网络中的逐位置全连接层可以通过数学上等效的一维卷积来实现。此外，我们提出了Squeeze BERT，这是一种高效的NLP模型，它使用一维分组卷积实现了自注意力编码器的大部分层。在Pixel 3手机上进行的基准测试中，该模型相比BERT-base显著降低了 ${>}4\mathbf{x}$ 的延迟。我们还成功应用了蒸馏技术，将我们方法的准确率提升到了与经过蒸馏训练的MobileBERT以及原始版本的BERT-base相竞争的水平。

We now discuss some possibilities for future work in the directions outlined in this paper. There are several techniques in use in CV that could be applied to NLP which we have not covered in this paper. One very promising direction is down sampling strategies which decrease the sequence length of the activation s in the self-attention network as the layers progress. Extensions of this idea, such as U-Nets [54], as well as modifying channel sizes (hidden size) instead of and in addition to sequence length would also be promising directions. On this path of techniques, applying ideas such as BiFPNs [55], striding, and dilation [56] may also yield interesting results.

我们现在讨论一些未来工作的可能性，这些方向在本文中已经概述。计算机视觉（CV）中有几种技术可以应用于自然语言处理（NLP），但本文并未涵盖。一个非常有前景的方向是下采样策略，随着层数的增加，这些策略会减少自注意力网络中的激活序列长度。这一想法的扩展，例如 U-Nets [54]，以及修改通道大小（隐藏大小）而不是仅仅修改序列长度，也将是有前景的方向。在这一技术路径上，应用诸如 BiFPNs [55]、步幅和扩张 [56] 等想法也可能产生有趣的结果。

The addition of all of these potential techniques opens up a significantly broader search-space of neural architecture designs for NLP. This motivates the application of automated neural architecture search (NAS) approaches such as those described in [57, 58].

所有这些潜在技术的加入为自然语言处理（NLP）的神经架构设计开辟了更广阔的搜索空间。这促使了自动化神经架构搜索（NAS）方法的应用，例如 [57, 58] 中描述的那些方法。

Broader Impact

更广泛的影响

Potential benefits of this work

本工作的潜在优势

We hope the techniques used in this paper will allow more efficient and practical deployment of selfattention based networks, particularly allowing more widespread use of these networks on mobile devices. Possible use cases include email sorting, chat analysis, spam detection, or hate-speech filtering. Facebook has begun using a self-attention based network to automatically detect and remove hate speech on their platform [59, 60]. Currently, companies usually run these models on the serverside, but some may be reticent to deploy them on a mass scale due to computation costs. Mobile inference wouldn’t require expensive server infrastructure and allows use in cases where privacy and security may be a concern. Running on the edge devices may allow more real-time or offline use cases such as grammar checking, or sentence completion.

我们希望本文中使用的技术能够更高效、更实际地部署基于自注意力机制的网络，特别是在移动设备上更广泛地使用这些网络。潜在的应用场景包括邮件分类、聊天分析、垃圾邮件检测或仇恨言论过滤。Facebook 已经开始使用基于自注意力机制的网络来自动检测并删除其平台上的仇恨言论 [59, 60]。目前，公司通常会在服务器端运行这些模型，但由于计算成本的原因，一些公司可能对大规模部署持保留态度。在移动设备上进行推理不需要昂贵的服务器基础设施，并且在隐私和安全可能成为问题的情况下也能使用。在边缘设备上运行可能支持更多实时或离线的应用场景，例如语法检查或句子补全。

Potential malicious uses of this work

本工作的潜在恶意用途

Given that we are releasing our model and training and inference code as free software, anyone can train our model on any dataset that they like. While we hope most practitioners will apply our work for altruistic or at least well-intention ed purposes, some may apply our work for ethically questionable or purely self-serving applications. For instance, low-cost mobile inference could allow "smart" key-loggers and eavesdropping viruses to be deployed, and for these devices to better avoid detection by computing locally and only uploading important information. And, while we hope that social networks and other text-centric portals will draw on our work to embed fair and just models into their mobile applications for content moderation, these companies could just as easily draw on our work to train and deploy models that censor content of political enemies or amplify only certain types of messages and voices.

鉴于我们正在将模型、训练和推理代码作为自由软件发布，任何人都可以在他们喜欢的任何数据集上训练我们的模型。虽然我们希望大多数从业者会将我们的工作应用于利他或至少是善意的目的，但有些人可能会将我们的工作应用于道德上有问题或纯粹自私的应用。例如，低成本的移动推理可能允许部署“智能”键盘记录器和窃听病毒，并使这些设备能够通过本地计算并仅上传重要信息来更好地避免检测。此外，虽然我们希望社交媒体和其他以文本为中心的门户网站能够利用我们的工作将公平和公正的模型嵌入到他们的移动应用程序中以进行内容审核，但这些公司同样可以轻松地利用我们的工作来训练和部署模型，以审查政治对手的内容或仅放大某些类型的消息和声音。

Potential effects of unintended bias in the neural network and its training data

神经网络及其训练数据中意外偏差的潜在影响

Models for NLP commonly suffer from biases regarding race and gender [61]. In our work, we have used standard datasets, and we are not aware of any experimental factors that would increase or decrease Squeeze BERT’s propensity for bias, as compared to the approaches used in similar self-attention research such as BERT and ELECTRA. The gender-related biases of our pre training corpora (Wikipedia and Books Corpus) are investigated by Tan et al. [61], and the gender biases of BERT and GPT models finetuned on the GLUE tasks are investigated by Baba ei anjel oda r et al. [62].

自然语言处理 (NLP) 模型通常存在种族和性别偏见 [61]。在我们的工作中，我们使用了标准数据集，并且与 BERT 和 ELECTRA 等类似自注意力研究中使用的方法相比，我们没有发现任何会增加或减少 Squeeze BERT 偏见倾向的实验因素。Tan 等人 [61] 研究了我们的预训练语料库（维基百科和书籍语料库）中的性别相关偏见，而 Baba ei anjel oda r 等人 [62] 研究了在 GLUE 任务上微调的 BERT 和 GPT 模型的性别偏见。

Aside from gender bias, we are not aware of studies that investigate other patterns of bias in the datasets that we used in our work. However, Sheng et al. investigate biases regarding race, gender, and sexual orientation in a GPT-2 self-attention model trained on a language-modeling (textgeneration) task [63]. Further, in the paper introducing the GPT-3 model, a study similar to that of Sheng et al. is performed [64].

除了性别偏见，我们尚未发现针对我们工作中使用的数据集中的其他偏见模式的研究。然而，Sheng 等人研究了在语言建模（文本生成）任务上训练的 GPT-2 自注意力模型中关于种族、性别和性取向的偏见 [63]。此外，在介绍 GPT-3 模型的论文中，进行了与 Sheng 等人类似的研究 [64]。

According to the taxonomy described in [65], the harms caused by biased NLP technology include:

根据[65]中描述的分类法，由有偏见的自然语言处理（NLP）技术造成的危害包括：

• Allocation al harms: These arise when an "automated system allocates resources (e.g., credit) or opportunities (e.g., jobs) unfairly to different social group." • Representational harms: These arise when "a system (e.g., a search engine) represents some social groups in a less favorable light than others, demeans them, or fails to recognize their existence altogether."

• 分配性伤害：当“自动化系统不公平地将资源（例如信贷）或机会（例如工作）分配给不同的社会群体”时，就会产生这种伤害。
• 代表性伤害：当“系统（例如搜索引擎）以不如其他社会群体的方式呈现某些社会群体，贬低他们，或完全未能承认他们的存在”时，就会产生这种伤害。

We direct the interested reader to work by Blodgett et al. [65] and Sun et al. [66] for more details and suggestions on how to minimize the bias in datasets and the neural networks that learn from them.

我们建议感兴趣的读者参考 Blodgett 等人的研究 [65] 和 Sun 等人的研究 [66]，以获取更多关于如何最小化数据集及其训练的神经网络中的偏差的详细信息和建议。

Acknowledgments and Disclosure of Funding

致谢与资金披露

K. Keutzer’s research is supported by Alibaba, Amazon, Google, Facebook, Intel, and Samsung.

K. Keutzer 的研究得到了阿里巴巴、Amazon、Google、Facebook、Intel 和 Samsung 的支持。

References

参考文献

[63] E. Sheng, K.-W. Chang, P. Natarajan, and N. Peng, “The woman worked as a babysitter: On biase

[论文翻译]Squeeze BERT：计算机视觉能为 NLP 提供哪些关于高效神经网络的启示？

原文地址：https://arxiv.org/pdf/2006.11316v1

Squeeze BERT: What can computer vision teach NLP about efficient neural networks?

Squeeze BERT：计算机视觉能为 NLP 提供哪些关于高效神经网络的启示？

Abstract

摘要

1 Introduction and Motivation

1 引言与动机

1.1 What has CV research already taught NLP research about efficient networks?

1.1 计算机视觉研究已经为自然语言处理研究提供了哪些关于高效网络的启示？

1.2 What else can CV research teach NLP research about efficient networks?

1.2 计算机视觉研究还能为大语言模型研究提供哪些关于高效网络的启示？

1.3 Squeeze BERT: Applying lessons learned from CV to NLP

1.3 Squeeze BERT：将计算机视觉的经验应用于自然语言处理

2 Implementing self-attention with convolutions

2 使用卷积实现自注意力 (Self-Attention)

2.1 Self-attention networks

2.1 自注意力网络

2.2 Benchmarking BERT for mobile inference

2.2 移动设备推理中的 BERT 基准测试

2.3 Replacing the position-wise fully connected layers (PFC) with convolutions

2.3 使用卷积替换位置全连接层 (PFC)

3 Incorporating grouped convolutions into self-attention

3 将分组卷积引入自注意力机制

3.1 Squeeze BERT

3.1 Squeeze BERT

4 Experimental Methodology

4 实验方法

4.1 Datasets

4.1 数据集

4.2 Training Methodology

4.2 训练方法

4.2.1 Training without bells and whistles

4.2.1 无额外修饰的训练

4.2.2 Training with bells and whistles

4.2.2 带附加功能的训练

5 Results

5 结果

5.1 Results without bells and whistles

5.1 无额外修饰的结果

5.2 Results with bells and whistles

5.2 完整结果

6 Related Work

6 相关工作

7 Conclusions & Future Work

7 结论与未来工作

Broader Impact

更广泛的影响

Potential benefits of this work

本工作的潜在优势

Potential malicious uses of this work

本工作的潜在恶意用途

Potential effects of unintended bias in the neural network and its training data

神经网络及其训练数据中意外偏差的潜在影响

Acknowledgments and Disclosure of Funding

致谢与资金披露

References

参考文献