AN IMAGE IS WORTH 16X16 WORDS: TRANSFORMERS FOR IMAGE RECOGNITION AT SCALE

一张图像等价于16x16个词：大规模图像识别中的Transformer应用

Alexey Do sov it ski y∗,†, Lucas Beyer∗, Alexander Kolesnikov∗, Dirk Weiss en born∗, Xiaohua Zhai∗, Thomas Unter thin er, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, Neil Houlsby∗,†

Alexey Dosovitskiy∗,†, Lucas Beyer∗, Alexander Kolesnikov∗, Dirk Weissenborn∗, Xiaohua Zhai∗, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, Neil Houlsby∗,†

∗equal technical contribution, †equal advising Google Research, Brain Team {a do sov it ski y, neil hou ls by}@google.com

∗同等技术贡献，†共同指导
Google Research, Brain Team {a do sov it ski y, neil hou ls by}@google.com

ABSTRACT

摘要

While the Transformer architecture has become the de-facto standard for natural language processing tasks, its applications to computer vision remain limited. In vision, attention is either applied in conjunction with convolutional networks, or used to replace certain components of convolutional networks while keeping their overall structure in place. We show that this reliance on CNNs is not necessary and a pure transformer applied directly to sequences of image patches can perform very well on image classification tasks. When pre-trained on large amounts of data and transferred to multiple mid-sized or small image recognition benchmarks (ImageNet, CIFAR-100, VTAB, etc.), Vision Transformer (ViT) attains excellent results compared to state-of-the-art convolutional networks while requiring substantially fewer computational resources to train.1

虽然Transformer架构已成为自然语言处理任务的事实标准，但其在计算机视觉领域的应用仍然有限。在视觉领域，注意力机制要么与卷积网络结合使用，要么用于替换卷积网络的某些组件，同时保持其整体结构不变。我们证明这种对CNN的依赖并非必要，直接应用于图像块序列的纯Transformer在图像分类任务上也能表现出色。当在大规模数据上进行预训练并迁移至多个中型或小型图像识别基准测试(如ImageNet、CIFAR-100、VTAB等)时，视觉Transformer(ViT)相比最先进的卷积网络能取得优异成果，同时训练所需的计算资源显著减少[1]。

1 INTRODUCTION

1 引言

Self-attention-based architectures, in particular Transformers (Vaswani et al., 2017), have become the model of choice in natural language processing (NLP). The dominant approach is to pre-train on a large text corpus and then fine-tune on a smaller task-specific dataset (Devlin et al., 2019). Thanks to Transformers’ computational efficiency and s cal ability, it has become possible to train models of unprecedented size, with over 100B parameters (Brown et al., 2020; Lepikhin et al., 2020). With the models and datasets growing, there is still no sign of saturating performance.

基于自注意力 (self-attention) 的架构，尤其是 Transformer [20]，已成为自然语言处理 (NLP) 领域的首选模型。主流方法是在大型文本语料库上进行预训练，然后在较小的任务特定数据集上进行微调 [8]。得益于 Transformer 的计算效率和可扩展性，训练参数量超过 1000 亿的超大规模模型已成为可能 [3, 16]。随着模型和数据集的不断增长，其性能仍未出现饱和迹象。

In computer vision, however, convolutional architectures remain dominant (LeCun et al., 1989; Krizhevsky et al., 2012; He et al., 2016). Inspired by NLP successes, multiple works try combining CNN-like architectures with self-attention (Wang et al., 2018; Carion et al., 2020), some replacing the convolutions entirely (Rama chandra n et al., 2019; Wang et al., 2020a). The latter models, while theoretically efficient, have not yet been scaled effectively on modern hardware accelerators due to the use of specialized attention patterns. Therefore, in large-scale image recognition, classic ResNetlike architectures are still state of the art (Mahajan et al., 2018; Xie et al., 2020; Kolesnikov et al., 2020).

然而在计算机视觉领域，卷积架构仍占据主导地位 (LeCun et al., 1989; Krizhevsky et al., 2012; He et al., 2016)。受自然语言处理领域成功的启发，多项研究尝试将类CNN架构与自注意力机制相结合 (Wang et al., 2018; Carion et al., 2020)，有些研究则完全取代了卷积运算 (Ramachandran et al., 2019; Wang et al., 2020a)。后一类模型虽然在理论上高效，但由于采用特殊注意力模式，尚未在现代硬件加速器上实现有效扩展。因此在大规模图像识别任务中，经典的ResNet类架构仍保持最先进水平 (Mahajan et al., 2018; Xie et al., 2020; Kolesnikov et al., 2020)。

Inspired by the Transformer scaling successes in NLP, we experiment with applying a standard Transformer directly to images, with the fewest possible modifications. To do so, we split an image into patches and provide the sequence of linear embeddings of these patches as an input to a Transformer. Image patches are treated the same way as tokens (words) in an NLP application. We train the model on image classification in supervised fashion.

受Transformer在自然语言处理(NLP)领域规模化成功的启发，我们尝试以最小修改将标准Transformer直接应用于图像。具体而言，我们将图像分割为若干图块(patch)，并将这些图块的线性嵌入序列作为Transformer的输入。图像块的处理方式与NLP应用中的token(单词)完全一致。我们以监督学习方式训练该模型进行图像分类任务。

When trained on mid-sized datasets such as ImageNet without strong regular iz ation, these models yield modest accuracies of a few percentage points below ResNets of comparable size. This seemingly discouraging outcome may be expected: Transformers lack some of the inductive biases inherent to CNNs, such as translation e qui variance and locality, and therefore do not generalize well when trained on insufficient amounts of data.

在中型数据集（如ImageNet）上训练时，若未采用强正则化措施，这些模型的准确率会略低于同等规模的ResNets，差距约为几个百分点。这种看似不尽人意的结果或许在意料之中：Transformer缺乏CNN固有的归纳偏置（如平移等变性和局部性），因此在数据量不足时泛化能力较差。

However, the picture changes if the models are trained on larger datasets (14M-300M images). We find that large scale training trumps inductive bias. Our Vision Transformer (ViT) attains excellent results when pre-trained at sufficient scale and transferred to tasks with fewer datapoints. When pre-trained on the public ImageNet-21k dataset or the in-house JFT-300M dataset, ViT approaches or beats state of the art on multiple image recognition benchmarks. In particular, the best model reaches the accuracy of $88.55%$ on ImageNet, $90.72%$ on ImageNet-ReaL, $94.55%$ on CIFAR-100, and $77.63%$ on the VTAB suite of 19 tasks.

然而，当模型在更大规模的数据集（1400万至3亿张图像）上训练时，情况发生了变化。我们发现大规模训练可以超越归纳偏差。我们的视觉Transformer (ViT) 在经过充分规模的预训练并迁移到数据点较少的任务时，取得了优异的结果。当在公开的ImageNet-21k数据集或内部JFT-300M数据集上进行预训练时，ViT在多个图像识别基准测试中接近或超越了现有技术水平。具体而言，最佳模型在ImageNet上达到了88.55%的准确率，在ImageNet-ReaL上达到90.72%，在CIFAR-100上达到94.55%，在包含19个任务的VTAB套件上达到77.63%。

2 相关工作

Transformers were proposed by Vaswani et al. (2017) for machine translation, and have since become the state of the art method in many NLP tasks. Large Transformer-based models are often pre-trained on large corpora and then fine-tuned for the task at hand: BERT (Devlin et al., 2019) uses a denoising self-supervised pre-training task, while the GPT line of work uses language modeling as its pre-training task (Radford et al., 2018; 2019; Brown et al., 2020).

Transformer 由 Vaswani 等人 (2017) 提出用于机器翻译，此后成为众多自然语言处理任务的最先进方法。基于 Transformer 的大模型通常在大规模语料上进行预训练，然后针对具体任务进行微调：BERT (Devlin 等人, 2019) 采用去噪自监督预训练任务，而 GPT 系列模型则以语言建模作为预训练任务 (Radford 等人, 2018; 2019; Brown 等人, 2020)。

Naive application of self-attention to images would require that each pixel attends to every other pixel. With quadratic cost in the number of pixels, this does not scale to realistic input sizes. Thus, to apply Transformers in the context of image processing, several approximations have been tried in the past. Parmar et al. (2018) applied the self-attention only in local neighborhoods for each query pixel instead of globally. Such local multi-head dot-product self attention blocks can completely replace convolutions (Hu et al., 2019; Rama chandra n et al., 2019; Zhao et al., 2020). In a different line of work, Sparse Transformers (Child et al., 2019) employ scalable approximations to global selfattention in order to be applicable to images. An alternative way to scale attention is to apply it in blocks of varying sizes (Weiss en born et al., 2019), in the extreme case only along individual axes (Ho et al., 2019; Wang et al., 2020a). Many of these specialized attention architectures demonstrate promising results on computer vision tasks, but require complex engineering to be implemented efficiently on hardware accelerators.

将自注意力机制 (self-attention) 直接应用于图像时，需要每个像素关注所有其他像素。由于计算成本与像素数量的平方成正比，这种方法难以扩展到实际输入尺寸。因此，在图像处理领域应用 Transformer 时，过去曾尝试过多种近似方法。Parmar 等人 (2018) 仅在每个查询像素的局部邻域内应用自注意力机制，而非全局范围。这种局部多头点积自注意力模块可以完全替代卷积操作 (Hu 等人, 2019; Ramachandran 等人, 2019; Zhao 等人, 2020)。另一项工作中，稀疏 Transformer (Child 等人, 2019) 采用可扩展的全局自注意力近似方法以适应图像处理需求。另一种扩展注意力机制的方法是将其应用于不同尺寸的块中 (Weissenborn 等人, 2019)，极端情况下仅沿单个轴向应用 (Ho 等人, 2019; Wang 等人, 2020a)。这些专用注意力架构大多在计算机视觉任务中展现出优异效果，但需要复杂的工程实现才能在硬件加速器上高效运行。

Most related to ours is the model of Cordonnier et al. (2020), which extracts patches of size $2\times2$ from the input image and applies full self-attention on top. This model is very similar to ViT, but our work goes further to demonstrate that large scale pre-training makes vanilla transformers competitive with (or even better than) state-of-the-art CNNs. Moreover, Cordonnier et al. (2020) use a small patch size of $2\times2$ pixels, which makes the model applicable only to small-resolution images, while we handle medium-resolution images as well.

与我们工作最相关的是Cordonnier等人 (2020) 提出的模型，该模型从输入图像中提取 $2\times2$ 尺寸的图块并应用全自注意力机制。该模型与ViT非常相似，但我们的研究进一步证明大规模预训练能使原始Transformer模型达到 (甚至超越) 最先进CNN的性能。此外，Cordonnier等人 (2020) 使用了 $2\times2$ 像素的小尺寸图块，这使得该模型仅适用于低分辨率图像，而我们的方法还能处理中等分辨率图像。

There has also been a lot of interest in combining convolutional neural networks (CNNs) with forms of self-attention, e.g. by augmenting feature maps for image classification (Bello et al., 2019) or by further processing the output of a CNN using self-attention, e.g. for object detection (Hu et al., 2018; Carion et al., 2020), video processing (Wang et al., 2018; Sun et al., 2019), image classification (Wu et al., 2020), unsupervised object discovery (Locatello et al., 2020), or unified text-vision tasks (Chen et al., 2020c; Lu et al., 2019; Li et al., 2019).

将卷积神经网络 (CNNs) 与自注意力机制相结合的研究也引起了广泛关注，例如通过增强图像分类的特征图 (Bello et al., 2019) ，或利用自注意力进一步处理 CNN 的输出，应用于目标检测 (Hu et al., 2018; Carion et al., 2020) 、视频处理 (Wang et al., 2018; Sun et al., 2019) 、图像分类 (Wu et al., 2020) 、无监督目标发现 (Locatello et al., 2020) 或统一文本-视觉任务 (Chen et al., 2020c; Lu et al., 2019; Li et al., 2019) 。

Another recent related model is image GPT (iGPT) (Chen et al., 2020a), which applies Transformers to image pixels after reducing image resolution and color space. The model is trained in an unsupervised fashion as a generative model, and the resulting representation can then be fine-tuned or probed linearly for classification performance, achieving a maximal accuracy of $72%$ on ImageNet.

另一个近期相关模型是图像GPT (iGPT) (Chen等人, 2020a)，它在降低图像分辨率和颜色空间后将Transformer应用于图像像素。该模型以无监督方式作为生成式模型进行训练，所得表征可通过微调或线性探测来评估分类性能，在ImageNet上实现了72%的最高准确率。

Our work adds to the increasing collection of papers that explore image recognition at larger scales than the standard ImageNet dataset. The use of additional data sources allows to achieve state-ofthe-art results on standard benchmarks (Mahajan et al., 2018; Touvron et al., 2019; Xie et al., 2020). Moreover, Sun et al. (2017) study how CNN performance scales with dataset size, and Kolesnikov et al. (2020); Djolonga et al. (2020) perform an empirical exploration of CNN transfer learning from large scale datasets such as ImageNet-21k and JFT-300M. We focus on these two latter datasets as well, but train Transformers instead of ResNet-based models used in prior works.

我们的工作为探索超越标准ImageNet数据集规模的大规模图像识别研究增添了新的文献。利用额外数据源可在标准基准测试中实现最先进的结果 (Mahajan等人, 2018; Touvron等人, 2019; Xie等人, 2020) 。此外，Sun等人 (2017) 研究了CNN性能如何随数据集规模扩展，而Kolesnikov等人 (2020) 和Djolonga等人 (2020) 则对从ImageNet-21k和JFT-300M等大规模数据集进行CNN迁移学习展开了实证探索。我们同样聚焦于这两个数据集，但采用Transformer架构进行训练，而非先前研究中基于ResNet的模型。

Figure 1: Model overview. We split an image into fixed-size patches, linearly embed each of them, add position embeddings, and feed the resulting sequence of vectors to a standard Transformer encoder. In order to perform classification, we use the standard approach of adding an extra learnable “classification token” to the sequence. The illustration of the Transformer encoder was inspired by Vaswani et al. (2017).

图 1: 模型概览。我们将图像分割为固定大小的图像块 (patch)，对每个块进行线性嵌入，添加位置编码，并将得到的向量序列输入标准 Transformer 编码器。为了执行分类任务，我们采用标准方法：在序列中添加一个可学习的"分类标记" (classification token)。Transformer 编码器的示意图灵感来自 Vaswani 等人 (2017) [20]。

3 METHOD

3 方法

In model design we follow the original Transformer (Vaswani et al., 2017) as closely as possible. An advantage of this intentionally simple setup is that scalable NLP Transformer architectures – and their efficient implementations – can be used almost out of the box.

在模型设计上，我们尽可能遵循原版Transformer (Vaswani et al., 2017) 。这种刻意保持简洁的设置有个优势：可扩展的NLP Transformer架构及其高效实现几乎可以直接使用。

3.1 VISION TRANSFORMER (VIT)

An overview of the model is depicted in Figure 1. The standard Transformer receives as input a 1D sequence of token embeddings. To handle 2D images, we reshape the image $\mathbf{x}\in\mathbb{R}^{H\times W\times^{*}C}$ into a sequence of flattened 2D patches $\mathbf{x}_{p}\in\mathbb{R}^{N\times(P^{2}\cdot C)}$ , where $(H,W)$ is the resolution of the original image, $C$ is the number of channels, $(P,P)$ is the resolution of each image patch, and $N=H W/P^{2}$ is the resulting number of patches, which also serves as the effective input sequence length for the Transformer. The Transformer uses constant latent vector size $D$ through all of its layers, so we flatten the patches and map to $D$ dimensions with a trainable linear projection (Eq. 1). We refer to the output of this projection as the patch embeddings.

模型概览如图 1 所示。标准 Transformer 接收一维 token 嵌入序列作为输入。为处理二维图像，我们将图像 $\mathbf{x}\in\mathbb{R}^{H\times W\times^{*}C}$ 重塑为扁平化的二维图像块序列 $\mathbf{x}_{p}\in\mathbb{R}^{N\times(P^{2}\cdot C)}$ ，其中 $(H,W)$ 是原始图像分辨率， $C$ 是通道数， $(P,P)$ 是每个图像块的分辨率， $N=H W/P^{2}$ 是生成的图像块数量，同时也作为 Transformer 的有效输入序列长度。Transformer 在所有层中使用恒定的潜在向量大小 $D$ ，因此我们将图像块展平并通过可训练的线性投影映射到 $D$ 维 (公式 1)。我们将此投影的输出称为图像块嵌入。

Similar to BERT’s [class] token, we prepend a learnable embedding to the sequence of embedded patches $(\mathbf{z}_ {\mathrm{0}}^{\mathrm{0}}=\mathbf{x}_ {\mathrm{class}})$ , whose state at the output of the Transformer encoder $(\mathbf{\bar{z}}_ {L}^{0})$ serves as the image representation y (Eq. 4). Both during pre-training and fine-tuning, a classification head is attached to $\mathbf{z}_{L}^{0}$ . The classification head is implemented by a MLP with one hidden layer at pre-training time and by a single linear layer at fine-tuning time.

类似于BERT的[class] token，我们在嵌入补丁序列前添加一个可学习的嵌入 $(\mathbf{z}_ {\mathrm{0}}^{\mathrm{0}}=\mathbf{x}_ {\mathrm{class}})$ ，其在Transformer编码器输出端的状态 $(\mathbf{\bar{z}}_ {L}^{0})$ 作为图像表示y（公式4）。在预训练和微调阶段，分类头均连接到 $\mathbf{z}_{L}^{0}$ 。该分类头在预训练时由带单隐藏层的MLP实现，微调时则由单个线性层实现。

Position embeddings are added to the patch embeddings to retain positional information. We use standard learnable 1D position embeddings, since we have not observed significant performance gains from using more advanced 2D-aware position embeddings (Appendix D.4). The resulting sequence of embedding vectors serves as input to the encoder.

位置嵌入 (position embeddings) 会添加到块嵌入 (patch embeddings) 中以保留位置信息。我们使用标准的可学习一维位置嵌入，因为我们没有观察到使用更先进的二维感知位置嵌入能带来显著的性能提升 (附录 D.4)。生成的嵌入向量序列将作为编码器的输入。

The Transformer encoder (Vaswani et al., 2017) consists of alternating layers of multi headed selfattention (MSA, see Appendix A) and MLP blocks (Eq. 2, 3). Layernorm (LN) is applied before every block, and residual connections after every block (Wang et al., 2019; Baevski & Auli, 2019).

Transformer编码器 (Vaswani et al., 2017) 由多头自注意力机制 (MSA, 见附录A) 和MLP模块 (公式2、3) 的交替层组成。每个模块前应用层归一化 (LN)，每个模块后接残差连接 (Wang et al., 2019; Baevski & Auli, 2019)。

The MLP contains two layers with a GELU non-linearity.

MLP包含两层，带有GELU非线性激活函数。

$$
\begin{array}{r l r l}&{\mathbf{z}_ {0}=[\mathbf{x}_ {\mathrm{class}};\mathbf{x}_ {p}^{1}\mathbf{E};\mathbf{x}_ {p}^{2}\mathbf{E};\cdot\cdot\cdot;\mathbf{x}_ {p}^{N}\mathbf{E}]+\mathbf{E}_ {p o s},}&&{\mathbf{E}\in\mathbb{R}^{(P^{2}\cdot C)\times D},\mathbf{E}_ {p o s}\in\mathbb{R}^{(N+1)\times D}}\ &{\mathbf{z}_ {\ell}^{\prime}\in\mathrm{MSA}(\mathrm{LN}(\mathbf{z}_ {\ell-1}))+\mathbf{z}_ {\ell-1},}&&{\ell=1\dots L}\ &{\mathbf{z}_ {\ell}=\mathrm{MLP}(\mathrm{LN}(\mathbf{z}_ {\ell}^{\prime}))+\mathbf{z}_ {\ell}^{\prime},}&&{\ell=1\dots L}\ &{\mathbf{y}=\mathrm{LN}(\mathbf{z}_{L}^{0})}\end{array}
$$

Inductive bias. We note that Vision Transformer has much less image-specific inductive bias than CNNs. In CNNs, locality, two-dimensional neighborhood structure, and translation e qui variance are baked into each layer throughout the whole model. In ViT, only MLP layers are local and translationally e qui variant, while the self-attention layers are global. The two-dimensional neighborhood structure is used very sparingly: in the beginning of the model by cutting the image into patches and at fine-tuning time for adjusting the position embeddings for images of different resolution (as described below). Other than that, the position embeddings at initialization time carry no information about the 2D positions of the patches and all spatial relations between the patches have to be learned from scratch.

归纳偏置 (Inductive bias)。我们注意到 Vision Transformer 相比 CNN 具有更少的图像特定归纳偏置。在 CNN 中，局部性、二维邻域结构和平移等变性被固化在模型的每一层中。而在 ViT 中，只有 MLP 层具有局部性和平移等变性，而自注意力层是全局的。二维邻域结构的使用非常有限：仅在模型开始时将图像切割成块 (patch)，以及在微调时调整不同分辨率图像的位置嵌入 (position embeddings)（如下所述）。除此之外，初始化时的位置嵌入不包含任何关于图像块二维位置的信息，所有块之间的空间关系都必须从头学习。

Hybrid Architecture. As an alternative to raw image patches, the input sequence can be formed from feature maps of a CNN (LeCun et al., 1989). In this hybrid model, the patch embedding projection $\mathbf{E}$ (Eq. 1) is applied to patches extracted from a CNN feature map. As a special case, the patches can have spatial size 1x1, which means that the input sequence is obtained by simply flattening the spatial dimensions of the feature map and projecting to the Transformer dimension. The classification input embedding and position embeddings are added as described above.

混合架构 (Hybrid Architecture)。作为原始图像块的替代方案，输入序列可由CNN (LeCun et al., 1989) 的特征图构成。该混合模型中，块嵌入投影 $\mathbf{E}$ (式1) 作用于从CNN特征图提取的块。特殊情况下，这些块可采用1x1空间尺寸，此时通过简单展平特征图空间维度并投影至Transformer维度即可获得输入序列。分类输入嵌入与位置嵌入按前述方式添加。

3.2 FINE-TUNING AND HIGHER RESOLUTION

3.2 微调与更高分辨率

Typically, we pre-train ViT on large datasets, and fine-tune to (smaller) downstream tasks. For this, we remove the pre-trained prediction head and attach a zero-initialized $D\times K$ feed forward layer, where $K$ is the number of downstream classes. It is often beneficial to fine-tune at higher resolution than pre-training (Touvron et al., 2019; Kolesnikov et al., 2020). When feeding images of higher resolution, we keep the patch size the same, which results in a larger effective sequence length. The Vision Transformer can handle arbitrary sequence lengths (up to memory constraints), however, the pre-trained position embeddings may no longer be meaningful. We therefore perform 2D interpolation of the pre-trained position embeddings, according to their location in the original image. Note that this resolution adjustment and patch extraction are the only points at which an inductive bias about the 2D structure of the images is manually injected into the Vision Transformer.

通常，我们会在大型数据集上对ViT进行预训练，然后针对（较小的）下游任务进行微调。为此，我们会移除预训练的预测头，并附加一个零初始化的$D\times K$前馈层，其中$K$是下游类别数。在比预训练更高分辨率下进行微调通常效果更佳 (Touvron et al., 2019; Kolesnikov et al., 2020)。当输入更高分辨率的图像时，我们保持图像块(patch)大小不变，这会导致有效序列长度增加。Vision Transformer能够处理任意序列长度（受内存限制），但预训练的位置嵌入可能不再有意义。因此，我们会根据预训练位置嵌入在原始图像中的位置进行二维插值。请注意，这种分辨率调整和图像块提取是唯一需要手动将关于图像二维结构的归纳偏置注入Vision Transformer的环节。

4 EXPERIMENTS

4 实验

We evaluate the representation learning capabilities of ResNet, Vision Transformer (ViT), and the hybrid. To understand the data requirements of each model, we pre-train on datasets of varying size and evaluate many benchmark tasks. When considering the computational cost of pre-training the model, ViT performs very favourably, attaining state of the art on most recognition benchmarks at a lower pre-training cost. Lastly, we perform a small experiment using self-supervision, and show that self-supervised ViT holds promise for the future.

我们评估了ResNet、Vision Transformer (ViT) 和混合架构的表征学习能力。为理解各模型的数据需求，我们在不同规模的数据集上进行预训练，并评估多项基准任务。当考虑模型预训练的计算成本时，ViT表现优异，以更低的预训练成本在多数识别基准上达到最优水平。最后，我们进行了小规模自监督实验，结果表明自监督ViT具有未来发展潜力。

4.1 SETUP

4.1 设置

Datasets. To explore model s cal ability, we use the ILSVRC-2012 ImageNet dataset with 1k classes and 1.3M images (we refer to it as ImageNet in what follows), its superset ImageNet-21k with 21k classes and 14M images (Deng et al., 2009), and JFT (Sun et al., 2017) with $18\mathrm{k\Omega}$ classes and 303M high-resolution images. We de-duplicate the pre-training datasets w.r.t. the test sets of the downstream tasks following Kolesnikov et al. (2020). We transfer the models trained on these dataset to several benchmark tasks: ImageNet on the original validation labels and the cleaned-up ReaL labels (Beyer et al., 2020), CIFAR-10/100 (Krizhevsky, 2009), Oxford-IIIT Pets (Parkhi et al., 2012), and Oxford Flowers-102 (Nilsback & Zisserman, 2008). For these datasets, pre-processing follows Kolesnikov et al. (2020).

数据集。为探究模型的可扩展性，我们采用了以下数据集：包含1k类别和130万张图像的ILSVRC-2012 ImageNet数据集（下文简称ImageNet）、其超集ImageNet-21k（含21k类别和1400万张图像）(Deng et al., 2009)，以及包含$18\mathrm{k\Omega}$类别和3.03亿张高分辨率图像的JFT数据集(Sun et al., 2017)。我们按照Kolesnikov et al. (2020)的方法对预训练数据集进行下游任务测试集去重处理。将在这些数据集上训练的模型迁移至多个基准任务：原始验证标签和经过清理的ReaL标签(Beyer et al., 2020)的ImageNet、CIFAR-10/100(Krizhevsky, 2009)、Oxford-IIIT Pets(Parkhi et al., 2012)和Oxford Flowers-102(Nilsback & Zisserman, 2008)。这些数据集的预处理流程遵循Kolesnikov et al. (2020)的方案。

Table 1: Details of Vision Transformer model variants.

Model	Layers	Hidden size D	MLP size	Heads	Params
ViT-Base	12	768	3072	12	86M
ViT-Large	24	1024	4096	16	307M
ViT-Huge	32	1280	5120	16	632M

表 1: Vision Transformer模型变体详情。

模型	层数	隐藏层大小 D	MLP 大小	注意力头数	参数量
ViT-Base	12	768	3072	12	86M
ViT-Large	24	1024	4096	16	307M
ViT-Huge	32	1280	5120	16	632M

We also evaluate on the 19-task VTAB classification suite (Zhai et al., 2019b). VTAB evaluates low-data transfer to diverse tasks, using 1 000 training examples per task. The tasks are divided into three groups: Natural – tasks like the above, Pets, CIFAR, etc. Specialized – medical and satellite imagery, and Structured – tasks that require geometric understanding like localization.

我们还对包含19项任务的VTAB分类套件 (Zhai et al., 2019b) 进行了评估。VTAB通过每项任务使用1000个训练样本，评估低数据量向多样化任务的迁移能力。这些任务分为三组：自然类 (Natural) —— 包含上述宠物分类、CIFAR等类似任务；专业类 (Specialized) —— 涵盖医学和卫星图像；结构化类 (Structured) —— 需要几何理解能力的任务（如定位）。

Model Variants. We base ViT configurations on those used for BERT (Devlin et al., 2019), as summarized in Table 1. The “Base” and “Large” models are directly adopted from BERT and we add the larger “Huge” model. In what follows we use brief notation to indicate the model size and the input patch size: for instance, ViT-L/16 means the “Large” variant with $16\times16$ input patch size. Note that the Transformer’s sequence length is inversely proportional to the square of the patch size, thus models with smaller patch size are computationally more expensive.

模型变体。我们基于BERT (Devlin et al., 2019) 的配置设计了ViT架构，如表1所示。"Base"和"Large"模型直接沿用BERT的设置，并新增了更大的"Huge"模型。后续我们将使用简写表示模型规模和输入图像块(patch)尺寸：例如ViT-L/16表示采用$16\times16$输入块尺寸的"Large"变体。需注意Transformer的序列长度与块尺寸平方成反比，因此较小块尺寸的模型计算开销更大。

For the baseline CNNs, we use ResNet (He et al., 2016), but replace the Batch Normalization layers (Ioffe & Szegedy, 2015) with Group Normalization (Wu & He, 2018), and used standardized convolutions (Qiao et al., 2019). These modifications improve transfer (Kolesnikov et al., 2020), and we denote the modified model “ResNet (BiT)”. For the hybrids, we feed the intermediate feature maps into ViT with patch size of one “pixel”. To experiment with different sequence lengths, we either (i) take the output of stage 4 of a regular ResNet50 or (ii) remove stage 4, place the same number of layers in stage 3 (keeping the total number of layers), and take the output of this extended stage 3. Option (ii) results in a $4\mathbf{x}$ longer sequence length, and a more expensive ViT model.

对于基线CNN，我们采用ResNet (He et al., 2016)，但将批量归一化层 (Ioffe & Szegedy, 2015) 替换为组归一化 (Wu & He, 2018)，并使用标准化卷积 (Qiao et al., 2019)。这些改进提升了迁移性能 (Kolesnikov et al., 2020)，我们将改进后的模型称为"ResNet (BiT)"。在混合架构中，我们将中间特征图输入到补丁尺寸为1个"像素"的ViT中。为测试不同序列长度，我们：(i) 使用常规ResNet50第4阶段的输出，或(ii) 移除第4阶段，在阶段3放置相同数量的层（保持总层数不变），并取扩展后阶段3的输出。方案(ii)会产生$4\mathbf{x}$的序列长度增长，并构建计算成本更高的ViT模型。

Training & Fine-tuning. We train all models, including ResNets, using Adam (Kingma & Ba, 2015) with $\beta_{1}=0.9$ , $\beta_{2}=0.999$ , a batch size of 4096 and apply a high weight decay of 0.1, which we found to be useful for transfer of all models (Appendix D.1 shows that, in contrast to common practices, Adam works slightly better than SGD for ResNets in our setting). We use a linear learning rate warmup and decay, see Appendix B.1 for details. For fine-tuning we use SGD with momentum, batch size 512, for all models, see Appendix B.1.1. For ImageNet results in Table 2, we fine-tuned at higher resolution: 512 for ViT-L/16 and 518 for ViT-H/14, and also used Polyak & Juditsky (1992) averaging with a factor of 0.9999 (Rama chandra n et al., 2019; Wang et al., 2020b).

训练与微调。我们使用Adam (Kingma & Ba, 2015)训练所有模型(包括ResNets)，参数设为$\beta_{1}=0.9$、$\beta_{2}=0.999$，批大小为4096，并采用0.1的高权重衰减(附录D.1显示，与常规做法不同，在我们的设置中Adam对ResNets的表现略优于SGD)。学习率采用线性预热与衰减策略，详见附录B.1。微调阶段对所有模型使用带动量的SGD，批大小为512(见附录B.1.1)。表2中的ImageNet结果采用更高分辨率微调：ViT-L/16为512，ViT-H/14为518，并应用Polyak & Juditsky (1992)提出的0.9999系数平均法(Ramachandran et al., 2019; Wang et al., 2020b)。

Metrics. We report results on downstream datasets either through few-shot or fine-tuning accuracy. Fine-tuning accuracies capture the performance of each model after fine-tuning it on the respective dataset. Few-shot accuracies are obtained by solving a regularized least-squares regression problem that maps the (frozen) representation of a subset of training images to ${-1,1}^{K}$ target vectors. This formulation allows us to recover the exact solution in closed form. Though we mainly focus on fine-tuning performance, we sometimes use linear few-shot accuracies for fast on-the-fly evaluation where fine-tuning would be too costly.

指标。我们通过少样本或微调准确率报告下游数据集的结果。微调准确率反映了模型在相应数据集上微调后的性能表现。少样本准确率通过求解一个正则化最小二乘回归问题获得，该问题将(冻结的)训练图像子集表征映射到${-1,1}^{K}$目标向量。这种表述方式使我们能以闭式解的形式获得精确解。虽然我们主要关注微调性能，但在微调成本过高时，也会使用线性少样本准确率进行快速即时评估。

4.2 COMPARISON TO STATE OF THE ART

4.2 与现有技术的对比

We first compare our largest models – ViT-H/14 and ViT-L/16 – to state-of-the-art CNNs from the literature. The first comparison point is Big Transfer (BiT) (Kolesnikov et al., 2020), which performs supervised transfer learning with large ResNets. The second is Noisy Student (Xie et al., 2020), which is a large Efficient Net trained using semi-supervised learning on ImageNet and JFT300M with the labels removed. Currently, Noisy Student is the state of the art on ImageNet and BiT-L on the other datasets reported here. All models were trained on TPUv3 hardware, and we report the number of TPUv3-core-days taken to pre-train each of them, that is, the number of TPU v3 cores (2 per chip) used for training multiplied by the training time in days.

我们首先将最大的模型——ViT-H/14和ViT-L/16——与文献中最先进的CNN进行比较。第一个对比点是Big Transfer (BiT) (Kolesnikov et al., 2020)，它使用大型ResNet进行监督迁移学习。第二个是Noisy Student (Xie et al., 2020)，这是一个在ImageNet和无标签的JFT300M上通过半监督学习训练的大型Efficient Net。目前，Noisy Student在ImageNet上是最先进的，而BiT-L在本文报告的其他数据集上表现最佳。所有模型均在TPUv3硬件上训练，我们报告了预训练每个模型所消耗的TPUv3-core-days数，即用于训练的TPU v3核心数（每芯片2个）乘以训练天数。

Table 2 shows the results. The smaller ViT-L/16 model pre-trained on JFT-300M outperforms BiT-L (which is pre-trained on the same dataset) on all tasks, while requiring substantially less computational resources to train. The larger model, ViT-H/14, further improves the performance, especially on the more challenging datasets – ImageNet, CIFAR-100, and the VTAB suite. Interestingly, this

表 2 展示了结果。在 JFT-300M 数据集上预训练的较小模型 ViT-L/16 在所有任务上都优于 BiT-L (同样在该数据集上预训练) ，同时训练所需的计算资源显著减少。更大的 ViT-H/14 模型进一步提升了性能，尤其在更具挑战性的数据集上——ImageNet、CIFAR-100 和 VTAB 套件。值得注意的是...

	Ours-JFT (ViT-H/14)	Ours-JFT (ViT-L/16)	Ours-I21k (ViT-L/16)	BiT-L (ResNet152x4)	Noisy Student (EfficientNet-L2)
ImageNet	88.55±0.04	87.76±0.03	85.30±0.02	87.54± 0.02	88.4/88.5
ImageNet ReaL	90.72±0.05	90.54±0.03	88.62±0.05	90.54	90.55
CIFAR-10	99.50±0.06	99.42±0.03	99.15±0.03	99.37±0.06
CIFAR-100	94.55±0.04	93.90±0.05	93.25±0.05	93.51±0.08
Oxford-IlITPets	97.56±0.03	97.32±0.11	94.67 ±0.15	96.62±0.23
OxfordFlowers-102	99.68±0.02	99.74±0.00	99.61±0.02	99.63±0.03
VTAB (19 tasks)	77.63±0.23	76.28±0.46	72.72±0.21	76.29 ±1.70	一
TPUv3-core-days	2.5k	0.68k	0.23k	9.9k	12.3k

	Ours-JFT (ViT-H/14)	Ours-JFT (ViT-L/16)	Ours-I21k (ViT-L/16)	BiT-L (ResNet152x4)	Noisy Student (EfficientNet-L2)
ImageNet	88.55±0.04	87.76±0.03	85.30±0.02	87.54±0.02	88.4/88.5
ImageNet ReaL	90.72±0.05	90.54±0.03	88.62±0.05	90.54	90.55
CIFAR-10	99.50±0.06	99.42±0.03	99.15±0.03	99.37±0.06
CIFAR-100	94.55±0.04	93.90±0.05	93.25±0.05	93.51±0.08
Oxford-IlITPets	97.56±0.03	97.32±0.11	94.67±0.15	96.62±0.23
OxfordFlowers-102	99.68±0.02	99.74±0.00	99.61±0.02	99.63±0.03
VTAB (19 tasks)	77.63±0.23	76.28±0.46	72.72±0.21	76.29±1.70	—
TPUv3-core-days	2.5k	0.68k	0.23k	9.9k	12.3k

Table 2: Comparison with state of the art on popular image classification benchmarks. We report mean and standard deviation of the accuracies, averaged over three fine-tuning runs. Vision Transformer models pre-trained on the JFT-300M dataset outperform ResNet-based baselines on all datasets, while taking substantially less computational resources to pre-train. ViT pre-trained on the smaller public ImageNet-21k dataset performs well too. ∗Slightly improved $88.5%$ result reported in Touvron et al. (2020).

表 2: 主流图像分类基准的性能对比。我们报告了三次微调运行的平均准确率及其标准差。在 JFT-300M 数据集上预训练的 Vision Transformer 模型在所有数据集上都优于基于 ResNet 的基线模型，同时所需的预训练计算资源显著减少。在较小的公开数据集 ImageNet-21k 上预训练的 ViT 也表现良好。*Touvron 等人 (2020) 报告了略微提升的 $88.5%$ 结果。

Figure 2: Breakdown of VTAB performance in Natural, Specialized, and Structured task groups.

图 2: VTAB在自然(Natural)、专业(Specialized)和结构化(Structured)任务组中的性能分解。

model still took substantially less compute to pre-train than prior state of the art. However, we note that pre-training efficiency may be affected not only by the architecture choice, but also other parameters, such as training schedule, optimizer, weight decay, etc. We provide a controlled study of performance vs. compute for different architectures in Section 4.4. Finally, the ViT-L/16 model pre-trained on the public ImageNet-21k dataset performs well on most datasets too, while taking fewer resources to pre-train: it could be trained using a standard cloud TPUv3 with 8 cores in approximate ly 30 days.

模型的预训练计算量仍远低于此前的最先进水平。但需注意，预训练效率不仅受架构选择影响，还与其他参数相关，如训练计划、优化器、权重衰减等。我们在第4.4节针对不同架构进行了性能与计算资源的对照研究。最后，基于公开ImageNet-21k数据集预训练的ViT-L/16模型在多数数据集上表现优异，且预训练资源消耗更低：使用8核标准云TPUv3约30天即可完成训练。

Figure 2 decomposes the VTAB tasks into their respective groups, and compares to previous SOTA methods on this benchmark: BiT, VIVI – a ResNet co-trained on ImageNet and Youtube (Tschannen et al., 2020), and S4L – supervised plus semi-supervised learning on ImageNet (Zhai et al., 2019a). ViT-H/14 outperforms BiT-R152x4, and other methods, on the Natural and Structured tasks. On the Specialized the performance of the top two models is similar.

图 2: 将VTAB任务分解为各自组别，并与该基准测试中的先前SOTA方法进行比较：BiT、VIVI（在ImageNet和YouTube上联合训练的ResNet）(Tschannen et al., 2020)，以及S4L（ImageNet上的监督加半监督学习）(Zhai et al., 2019a)。ViT-H/14在自然和结构化任务上优于BiT-R152x4及其他方法。在专业任务上，前两名模型的性能相近。

4.3 PRE-TRAINING DATA REQUIREMENTS

4.3 预训练数据要求

The Vision Transformer performs well when pre-trained on a large JFT-300M dataset. With fewer inductive biases for vision than ResNets, how crucial is the dataset size? We perform two series of experiments.

视觉Transformer (Vision Transformer) 在大型JFT-300M数据集上预训练时表现优异。相比ResNets，其视觉归纳偏置更少，那么数据集规模有多关键？我们进行了两组实验。

First, we pre-train ViT models on datasets of increasing size: ImageNet, ImageNet-21k, and JFT300M. To boost the performance on the smaller datasets, we optimize three basic regular iz ation parameters – weight decay, dropout, and label smoothing. Figure 3 shows the results after finetuning to ImageNet (results on other datasets are shown in Table 5)2. When pre-trained on the smallest dataset, ImageNet, ViT-Large models under perform compared to ViT-Base models, despite (moderate) regular iz ation. With ImageNet-21k pre-training, their performances are similar. Only with JFT-300M, do we see the full benefit of larger models. Figure 3 also shows the performance region spanned by BiT models of different sizes. The BiT CNNs outperform ViT on ImageNet, but with the larger datasets, ViT overtakes.

首先，我们在逐渐增大的数据集上预训练ViT模型：ImageNet、ImageNet-21k和JFT300M。为了提高在较小数据集上的性能，我们优化了三个基本正则化参数——权重衰减、dropout和标签平滑。图3展示了在ImageNet上微调后的结果（其他数据集的结果见表5）。当在最小的数据集ImageNet上预训练时，尽管进行了（适度）正则化，ViT-Large模型的性能仍低于ViT-Base模型。使用ImageNet-21k预训练时，它们的性能相似。只有在JFT-300M上，我们才能看到更大模型的全部优势。图3还展示了不同尺寸BiT模型的性能范围。BiT CNN在ImageNet上优于ViT，但在更大的数据集上，ViT实现了反超。

Figure 3: Transfer to ImageNet. While large ViT models perform worse than BiT ResNets (shaded area) when pre-trained on small datasets, they shine when pre-trained on larger datasets. Similarly, larger ViT variants overtake smaller ones as the dataset grows.

图 3: 迁移到 ImageNet。当在小型数据集上进行预训练时，大型 ViT 模型的表现不如 BiT ResNets (阴影区域)，但在更大数据集上预训练时表现出色。同样，随着数据集的增大，更大的 ViT 变体会超越较小的变体。

Figure 4: Linear few-shot evaluation on ImageNet versus pre-training size. ResNets perform better with smaller pre-training datasets but plateau sooner than ViT, which performs better with larger pre-training. ViT-b is ViT-B with all hidden dimensions halved.

图 4: ImageNet 上线性少样本评估与预训练规模的关系。ResNet 在较小预训练数据集上表现更好但会更快达到瓶颈，而 ViT 在更大规模预训练时表现更优。ViT-b 指将所有隐藏维度减半的 ViT-B 模型。

Figure 5: Performance versus pre-training compute for different architectures: Vision Transformers, ResNets, and hybrids. Vision Transformers generally outperform ResNets with the same computational budget. Hybrids improve upon pure Transformers for smaller model sizes, but the gap vanishes for larger models.

图 5: 不同架构的性能与预训练计算量对比：Vision Transformer、ResNet 及混合架构。在相同计算预算下，Vision Transformer 通常优于 ResNet。对于较小模型，混合架构比纯 Transformer 有所提升，但随着模型规模增大，这种差距逐渐消失。

Second, we train our models on random subsets of 9M, 30M, and 90M as well as the full JFT300M dataset. We do not perform additional regular iz ation on the smaller subsets and use the same hyper-parameters for all settings. This way, we assess the intrinsic model properties, and not the effect of regular iz ation. We do, however, use early-stopping, and report the best validation accuracy achieved during training. To save compute, we report few-shot linear accuracy instead of full finetuning accuracy. Figure 4 contains the results. Vision Transformers overfit more than ResNets with comparable computational cost on smaller datasets. For example, ViT-B/32 is slightly faster than ResNet50; it performs much worse on the 9M subset, but better on $90\mathbf{M}+$ subsets. The same is true for Res Net 152 x 2 and ViT-L/16. This result reinforces the intuition that the convolutional inductive bias is useful for smaller datasets, but for larger ones, learning the relevant patterns directly from data is sufficient, even beneficial.

其次，我们在900万、3000万、9000万随机子集以及完整的JFT300M数据集上训练模型。对于较小子集不进行额外正则化处理，所有设置均采用相同超参数。这种方式能评估模型的内在特性，而非正则化效果。不过我们采用了早停策略，并报告训练期间达到的最佳验证准确率。为节省算力，报告的是少样本线性准确率而非完整微调准确率。图4展示了结果：在计算成本相近的情况下，视觉Transformer(ViT)在小数据集上比ResNet更容易过拟合。例如ViT-B/32略快于ResNet50，在900万子集上表现差很多，但在9000万以上子集表现更优。ResNet152x2与ViT-L/16也呈现相同规律。这一结果印证了直觉：卷积归纳偏置对小数据集很有用，但对大数据集而言，直接从数据中学习相关模式不仅足够，甚至更具优势。

Overall, the few-shot results on ImageNet (Figure 4), as well as the low-data results on VTAB (Table 2) seem promising for very low-data transfer. Further analysis of few-shot properties of ViT is an exciting direction of future work.

总体而言，ImageNet上的少样本结果(图 4)和VTAB上的低数据量结果(表 2)在极低数据迁移场景中表现良好。进一步分析ViT的少样本特性是未来工作的一个有趣方向。

4.4 SCALING STUDY

4.4 扩展性研究

We perform a controlled scaling study of different models by evaluating transfer performance from JFT-300M. In this setting data size does not bottleneck the models’ performances, and we assess performance versus pre-training cost of each model. The model set includes: 7 ResNets, R50x1, R50x2 R101x1, $\mathrm{R}152\mathrm{x}1$ , $\mathbf{R}152\mathbf{x}2$ , pre-trained for 7 epochs, plus $\mathbf{R}152\mathbf{x}2$ and ${\tt R}200{\tt X}3$ pre-trained for 14 epochs; 6 Vision Transformers, ViT-B/32, B/16, L/32, L/16, pre-trained for 7 epochs, plus L/16 and H/14 pre-trained for 14 epochs; and 5 hybrids, R50+ViT-B/32, B/16, L/32, L/16 pretrained for 7 epochs, plus $\mathrm{R}50\mathrm{+ViT\mathrm{-L}}/16$ pre-trained for 14 epochs (for hybrids, the number at the end of the model name stands not for the patch size, but for the total dow sampling ratio in the ResNet backbone).

我们通过评估从JFT-300M迁移的性能，对不同模型进行了受控扩展研究。在此设置下，数据规模不会成为模型性能的瓶颈，我们评估了各模型性能与预训练成本的关系。模型集包括：7个ResNet（R50x1、R50x2、R101x1、$\mathrm{R}152\mathrm{x}1$、$\mathbf{R}152\mathbf{x}2$预训练7个周期，外加$\mathbf{R}152\mathbf{x}2$和${\tt R}200{\tt X}3$预训练14个周期）；6个Vision Transformer（ViT-B/32、B/16、L/32、L/16预训练7个周期，外加L/16和H/14预训练14个周期）；以及5个混合模型（R50+ViT-B/32、B/16、L/32、L/16预训练7个周期，外加$\mathrm{R}50\mathrm{+ViT\mathrm{-L}}/16$预训练14个周期）（对于混合模型，模型名称末尾的数字不代表分块大小，而是ResNet主干中的总降采样率）。

Figure 5 contains the transfer performance versus total pre-training compute (see Appendix D.5 for details on computational costs). Detailed results per model are provided in Table 6 in the Appendix. A few patterns can be observed. First, Vision Transformers dominate ResNets on the performance/compute trade-off. ViT uses approximately $2-4\times$ less compute to attain the same performance (average over 5 datasets). Second, hybrids slightly outperform ViT at small computational budgets, but the difference vanishes for larger models. This result is somewhat surprising, since one might expect convolutional local feature processing to assist ViT at any size. Third, Vision Transformers appear not to saturate within the range tried, motivating future scaling efforts.

图5展示了迁移性能与总预训练计算量的关系(计算成本详情参见附录D.5)。附录中的表6提供了每个模型的详细结果。我们可以观察到几个规律：首先，在性能/计算量权衡方面，Vision Transformer明显优于ResNet。ViT只需消耗约$2-4\times$的计算量即可达到相同性能(5个数据集的平均值)。其次，在较小计算预算时，混合模型略优于ViT，但随着模型规模增大差异逐渐消失。这个结果有些出人意料，因为人们可能认为卷积局部特征处理对各种规模的ViT都有帮助。第三，Vision Transformer在实验规模范围内尚未出现饱和现象，这为未来的扩展研究提供了动力。

4.5 INSPECTING VISION TRANSFORMER

4.5 视觉Transformer (Vision Transformer) 解析

To begin to understand how the Vision Transformer processes image data, we analyze its internal representations. The first layer of the Vision Transformer linearly projects the flattened patches into a lower-dimensional space (Eq. 1). Figure 7 (left) shows the top principal components of the the learned embedding filters. The components resemble plausible basis functions for a low-dimensional representation of the fine structure within each patch.

为了理解Vision Transformer如何处理图像数据，我们分析了其内部表示。Vision Transformer的第一层将展平的图像块线性投影到低维空间(式1)。图7(左)展示了学习到的嵌入过滤器的主要主成分。这些成分类似于每个图像块内部精细结构的低维表示中合理的基函数。

After the projection, a learned position embedding is added to the patch representations. Figure 7 (center) shows that the model learns to encode distance within the image in the similarity of position embeddings, i.e. closer patches tend to have more similar position embeddings. Further, the row-column structure appears; patches in the same row/column have similar embeddings. Finally, a sinusoidal structure is sometimes apparent for larger grids (Appendix D). That the position embeddings learn to represent 2D image topology explains why hand-crafted 2D-aware embedding variants do not yield improvements (Appendix D.4).

投影后，学习得到的位置嵌入会添加到图像块表示中。图7（中）显示，模型学会通过位置嵌入的相似性来编码图像内的距离，即相邻图像块往往具有更相似的位置嵌入。此外，行列结构开始显现：同一行/列的图像块具有相似的嵌入。最后，对于较大网格，有时会出现正弦结构（附录D）。位置嵌入能学习表示二维图像拓扑结构，这解释了为什么手工设计的二维感知嵌入变体未能带来改进（附录D.4）。

Self-attention allows ViT to integrate information across the entire image even in the lowest layers. We investigate to what degree the network makes use of this capability. Specifically, we compute the average distance in image space across which information is integrated, based on the attention weights (Figure 7, right). This “attention distance” is analogous to receptive field size in CNNs.

自注意力机制使ViT即使在最底层也能整合整个图像的信息。我们研究了网络在多大程度上利用了这种能力。具体而言，我们基于注意力权重计算了图像空间中信息整合的平均距离（图7右）。这种"注意力距离"类似于CNN中的感受野大小。

Figure 6: Representative examples of attention from the output token to the input space. See Appendix D.7 for details.

图 6: 输出token对输入空间的注意力代表性示例。详情参见附录 D.7。

We find that some heads attend to most of the image already in the lowest layers, showing that the ability to integrate information globally is indeed used by the model. Other attention heads have consistently small attention distances in the low layers. This highly localized attention is less pronounced in hybrid models that apply a ResNet before the Transformer (Figure 7, right), suggesting that it may serve a similar function as early convolutional layers in CNNs. Further, the attention distance increases with network depth. Globally, we find that the model attends to image regions that are semantically relevant for classification (Figure 6).

我们发现，一些注意力头(attention head)在最底层就已经关注到图像的大部分区域，这表明模型确实具备全局信息整合能力。其他注意力头在低层网络始终保持着较小的注意力距离。这种高度局部化的注意力特性，在采用Transformer前先使用ResNet的混合模型中表现较弱(图7右)，暗示其功能可能类似于CNN中的早期卷积层。此外，注意力距离会随着网络深度增加而增大。整体而言，模型关注的图像区域与分类任务具有语义相关性(图6)。

4.6 SELF-SUPERVISION

4.6 自监督学习

Transformers show impressive performance on NLP tasks. However, much of their success stems not only from their excellent s cal ability but also from large scale self-supervised pre-training (Devlin et al., 2019; Radford et al., 2018). We also perform a preliminary exploration on masked patch prediction for self-supervision, mimicking the masked language modeling task used in BERT. With self-supervised pre-training, our smaller ViT-B/16 model achieves $79.9%$ accuracy on ImageNet, a significant improvement of $2%$ to training from scratch, but still $4%$ behind supervised pre-training. Appendix B.1.2 contains further details. We leave exploration of contrastive pre-training (Chen et al., 2020b; He et al., 2020; Bachman et al., 2019; Hénaff et al., 2020) to future work.

Transformer在自然语言处理任务中展现出卓越性能。然而，其成功不仅源于出色的扩展能力，更得益于大规模自监督预训练 (Devlin et al., 2019; Radford et al., 2018)。我们初步探索了掩码补丁预测的自监督方法，模拟BERT中使用的掩码语言建模任务。通过自监督预训练，我们较小的ViT-B/16模型在ImageNet上达到79.9%准确率，相比从头训练显著提升2%，但仍比监督预训练低4%。附录B.1.2包含更多细节。对比预训练 (Chen et al., 2020b; He et al., 2020; Bachman et al., 2019; Hénaff et al., 2020) 的探索将留待未来工作。

Figure 7: Left: Filters of the initial linear embedding of RGB values of ViT-L/32. Center: Similarity of position embeddings of ViT-L/32. Tiles show the cosine similarity between the position embedding of the patch with the indicated row and column and the position embeddings of all other patches. Right: Size of attended area by head and network depth. Each dot shows the mean attention distance across images for one of 16 heads at one layer. See Appendix D.7 for details.

图 7: 左: ViT-L/32 对 RGB 值初始线性嵌入的滤波器。中: ViT-L/32 位置嵌入的相似性。每个图块显示指定行列的补丁位置嵌入与所有其他补丁位置嵌入之间的余弦相似度。右: 注意力区域大小与网络深度的关系。每个点表示某一层 16 个头在图像上的平均注意力距离。详见附录 D.7。

5 CONCLUSION

5 结论

We have explored the direct application of Transformers to image recognition. Unlike prior works using self-attention in computer vision, we do not introduce image-specific inductive biases into the architecture apart from the initial patch extraction step. Instead, we interpret an image as a sequence of patches and process it by a standard Transformer encoder as used in NLP. This simple, yet scalable, strategy works surprisingly well when coupled with pre-training on large datasets. Thus, Vision Transformer matches or exceeds the state of the art on many image classification datasets, whilst being relatively cheap to pre-train.

我们探索了将Transformer直接应用于图像识别的方法。与之前计算机视觉领域使用自注意力机制的研究不同，除了初始的图像块提取步骤外，我们没有在架构中引入任何针对图像的归纳偏置。相反，我们将图像视为一系列图像块的序列，并使用自然语言处理中标准的Transformer编码器进行处理。这种简单但可扩展的策略，在结合大型数据集预训练时表现出惊人的效果。因此，Vision Transformer在许多图像分类数据集上达到或超越了当前最优水平，同时预训练成本相对较低。

While these initial results are encouraging, many challenges remain. One is to apply ViT to other computer vision tasks, such as detection and segmentation. Our results, coupled with those in Carion et al. (2020), indicate the promise of this approach. Another challenge is to continue exploring selfsupervised pre-training methods. Our initial experiments show improvement from self-supervised pre-training, but there is still large gap between self-supervised and large-scale supervised pretraining. Finally, further scaling of ViT would likely lead to improved performance.

虽然这些初步结果令人鼓舞，但许多挑战仍然存在。其一是将ViT应用于其他计算机视觉任务，如检测和分割。我们的结果与Carion等人 (2020) 的研究共同表明了这种方法的潜力。另一个挑战是继续探索自监督预训练方法。我们的初步实验显示自监督预训练带来了改进，但自监督与大规模监督预训练之间仍存在较大差距。最后，进一步扩展ViT的规模可能会带来性能提升。

ACKNOWLEDGEMENTS

致谢

The work was performed in Berlin, Zurich, and Amsterdam. We thank many colleagues at Google for their help, in particular Andreas Steiner for crucial help with the infrastructure and the opensource release of the code; Joan Puigcerver and Maxim Neumann for help with the large-scale training infrastructure; Dmitry Lepikhin, Aravindh Mahendran, Daniel Keysers, Mario Lucic, Noam Shazeer, Ashish Vaswani, and Colin Raffel for useful discussions.

工作在柏林、苏黎世和阿姆斯特丹进行。我们感谢谷歌众多同事的帮助，特别是Andreas Steiner在基础设施和代码开源发布方面的关键支持；Joan Puigcerver和Maxim Neumann在大规模训练基础设施上的协助；以及Dmitry Lepikhin、Aravindh Mahendran、Daniel Keysers、Mario Lucic、Noam Shazeer、Ashish Vaswani和Colin Raffel的有益讨论。

REFERENCES

参考文献

Samira Abnar and Willem Zuidema. Quantifying attention flow in transformers. In ACL, 2020.

Samira Abnar 和 Willem Zuidema. 量化Transformer中的注意力流. 载于ACL, 2020.

Philip Bachman, R Devon Hjelm, and William Buchwalter. Learning representations by maximizing mutual information across views. In NeurIPS, 2019.

Philip Bachman、R Devon Hjelm 和 William Buchwalter。通过跨视图最大化互信息学习表征。收录于 NeurIPS，2019年。

ICML, 2020.

ICML，2020。

Sergey Ioffe and Christian Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. 2015.

Sergey Ioffe 和 Christian Szegedy. 批量归一化 (Batch Normalization): 通过减少内部协变量偏移加速深度网络训练. 2015.

CVPR, 2020.

CVPR，2020。

Table 3: Hyper parameters for training. All models are trained with a batch size of 4096 and learning rate warmup of $10\mathrm{k\Omega}$ steps. For ImageNet we found it beneficial to additionally apply gradient clipping at global norm 1. Training resolution is 224.

Models	Dataset	Epochs	Base LR	LR decay	Weight decay	Dropout
ViT-B/{16,32}	JFT-300M	7	8·10-4	linear	0.1	0.0
ViT-L/32	JFT-300M	7	6·10-4	linear	0.1	0.0
ViT-L/16	JFT-300M	7/14	4·10-4	linear	0.1	0.0
ViT-H/14	JFT-300M	14	3·10-4	linear	0.1	0.0
R50x{1,2}	JFT-300M	7	10-3	linear	0.1	0.0
R101x1	JFT-300M	7	8·10-4	linear	0.1	0.0
R152x{1,2}	JFT-300M	7	6·10-4	linear	0.1	0.0
R50+ViT-B/{16,32}	JFT-300M	7	8·10-4	linear	0.1	0.0
R50+ViT-L/32	JFT-300M	7	2·10-4	linear	0.1	0.0
R50+ViT-L/16	JFT-300M	7/14	4·10-4	linear	0.1	0.0
ViT-B/{16,32}	ImageNet-21k	90	10-3	linear	0.03	0.1
ViT-L/{16,32}	ImageNet-21k	30/90	10-3	linear	0.03	0.1
ViT-*	ImageNet	300	3·10-3	cosine	0.3	0.1

表 3: 训练超参数。所有模型均以4096的批量大小和 $10\mathrm{k\Omega}$ 步数的学习率预热进行训练。对于ImageNet数据集，我们发现额外应用全局范数为1的梯度裁剪效果更佳。训练分辨率为224。

模型	数据集	训练轮数	基础学习率	学习率衰减	权重衰减	Dropout
ViT-B/{16,32}	JFT-300M	7	8·10-4	linear	0.1	0.0
ViT-L/32	JFT-300M	7	6·10-4	linear	0.1	0.0
ViT-L/16	JFT-300M	7/14	4·10-4	linear	0.1	0.0
ViT-H/14	JFT-300M	14	3·10-4	linear	0.1	0.0
R50x{1,2}	JFT-300M	7	10-3	linear	0.1	0.0
R101x1	JFT-300M	7	8·10-4	linear	0.1	0.0
R152x{1,2}	JFT-300M	7	6·10-4	linear	0.1	0.0
R50+ViT-B/{16,32}	JFT-300M	7	8·10-4	linear	0.1	0.0
R50+ViT-L/32	JFT-300M	7	2·10-4	linear	0.1	0.0
R50+ViT-L/16	JFT-300M	7/14	4·10-4	linear	0.1	0.0
ViT-B/{16,32}	ImageNet-21k	90	10-3	linear	0.03	0.1
ViT-L/{16,32}	ImageNet-21k	30/90	10-3	linear	0.03	0.1
ViT-*	ImageNet	300	3·10-3	cosine	0.3	0.1

APPENDIX

附录

A MULTIHEAD SELF-ATTENTION

多头自注意力机制

Standard qkv self-attention (SA, Vaswani et al. (2017)) is a popular building block for neural architectures. For each element in an input sequence $\mathbf{z}\in\dot{\mathbb{R}}^{N\times\dot{D}}$ , we compute a weighted sum over all values $\mathbf{v}$ in the sequence. The attention weights $A_{i j}$ are based on the pairwise similarity between two elements of the sequence and their respective query $\mathbf{q}^{i}$ and key ${\bf k}^{j}$ representations.

标准qkv自注意力机制 (SA, Vaswani et al. (2017) [20]) 是神经架构中的常用模块。对于输入序列 $\mathbf{z}\in\dot{\mathbb{R}}^{N\times\dot{D}}$ 中的每个元素，我们计算序列中所有值 $\mathbf{v}$ 的加权和。注意力权重 $A_{i j}$ 基于序列中两个元素之间的成对相似度及其各自的查询 $\mathbf{q}^{i}$ 和键 ${\bf k}^{j}$ 表示。

$$
\begin{array}{r l}&{[{\bf q},{\bf k},{\bf v}]={\bf z}{\bf U}_ {q k v}\qquad{\bf U}_ {q k v}\in\mathbb{R}^{D\times3D_{h}},}\ &{A=\mathrm{softmax}\left({\bf q}{\bf k}^{\top}/\sqrt{D_{h}}\right)\qquad\quad A\in\mathbb{R}^{N\times N},}\ &{\qquad\mathrm{SA}({\bf z})=A{\bf v}.}\end{array}
$$

Multihead self-attention (MSA) is an extension of SA in which we run $k$ self-attention operations, called “heads”, in parallel, and project their concatenated outputs. To keep compute and number of parameters constant when changing $k$ , $D_{h}$ (Eq. 5) is typically set to $D/k$ .

多头自注意力 (Multihead Self-Attention, MSA) 是自注意力 (SA) 的扩展形式，通过并行运行 $k$ 个称为"头"的自注意力操作，并对它们的拼接输出进行投影。为确保在调整 $k$ 时计算量和参数量保持不变，通常将 $D_{h}$ (公式5) 设为 $D/k$。

$$
\mathrm{MSA}(\mathbf{z})=[\mathrm{SA}_ {1}(z);\mathrm{SA}_ {2}(z);\cdots,\mathrm{SA}_ {k}(z)]\mathbf{U}_ {m s a}\qquad\mathbf{U}_ {m s a}\in\mathbb{R}^{k\cdot D_{h}\times D}
$$

$$
\mathrm{MSA}(\mathbf{z})=[\mathrm{SA}_ {1}(z);\mathrm{SA}_ {2}(z);\cdots,\mathrm{SA}_ {k}(z)]\mat

[论文翻译]一张图像等价于16x16个词：大规模图像识别中的Transformer应用

原文地址：https://arxiv.org/pdf/2010.11929v2

AN IMAGE IS WORTH 16X16 WORDS: TRANSFORMERS FOR IMAGE RECOGNITION AT SCALE

一张图像等价于16x16个词：大规模图像识别中的Transformer应用

ABSTRACT

1 INTRODUCTION

1 引言

2 相关工作

3 METHOD

3 方法

3.1 VISION TRANSFORMER (VIT)

3.1 VISION TRANSFORMER (VIT)

3.2 FINE-TUNING AND HIGHER RESOLUTION

4 EXPERIMENTS

4 实验

4.1 SETUP

4.1 设置

4.2 COMPARISON TO STATE OF THE ART

4.2 与现有技术的对比

4.3 PRE-TRAINING DATA REQUIREMENTS

4.4 SCALING STUDY

4.4 扩展性研究

4.5 INSPECTING VISION TRANSFORMER

4.5 视觉Transformer (Vision Transformer) 解析

4.6 SELF-SUPERVISION

4.6 自监督学习

5 CONCLUSION

5 结论

ACKNOWLEDGEMENTS

REFERENCES

APPENDIX

A MULTIHEAD SELF-ATTENTION

[论文翻译]一张图像等价于16x16个词：大规模图像识别中的Transformer应用

原文地址：https://arxiv.org/pdf/2010.11929v2

AN IMAGE IS WORTH 16X16 WORDS: TRANSFORMERS FOR IMAGE RECOGNITION AT SCALE

一张图像等价于16x16个词：大规模图像识别中的Transformer应用

ABSTRACT

1 INTRODUCTION

1 引言

2 RELATED WORK

2 相关工作

3 METHOD

3 方法

3.1 VISION TRANSFORMER (VIT)

3.1 VISION TRANSFORMER (VIT)

3.2 FINE-TUNING AND HIGHER RESOLUTION

4 EXPERIMENTS

4 实验

4.1 SETUP

4.1 设置

4.2 COMPARISON TO STATE OF THE ART

4.2 与现有技术的对比

4.3 PRE-TRAINING DATA REQUIREMENTS

4.4 SCALING STUDY

4.4 扩展性研究

4.5 INSPECTING VISION TRANSFORMER

4.5 视觉Transformer (Vision Transformer) 解析

4.6 SELF-SUPERVISION

4.6 自监督学习

5 CONCLUSION

5 结论

ACKNOWLEDGEMENTS

REFERENCES

APPENDIX

A MULTIHEAD SELF-ATTENTION