Generative Pre training from Pixels

基于像素的生成式预训练 (Generative Pre-training from Pixels)

Mark Chen 1 Alec Radford 1 Rewon Child 1 Jeff $\mathbf{W}\mathbf{u}^{1}$ Heewoo Jun 1 Prafulla Dhariwal 1 David Luan 1 Ilya Sutskever

Abstract

摘要

Inspired by progress in unsupervised representation learning for natural language, we examine whether similar models can learn useful representations for images. We train a sequence Transformer to auto-regressive ly predict pixels, without incorporating knowledge of the 2D input structure. Despite training on low-resolution ImageNet without labels, we find that a GPT-2 scale model learns strong image representations as measured by linear probing, fine-tuning, and low-data classification. On CIFAR-10, we achieve $96.3%$ accuracy with a linear probe, outperforming a supervised Wide ResNet, and $99.0%$ accuracy with full finetuning, matching the top supervised pre-trained models. An even larger model trained on a mixture of ImageNet and web images is competitive with self-supervised benchmarks on ImageNet, achieving $72.0%$ top-1 accuracy on a linear probe of our features.

受自然语言无监督表征学习进展的启发，我们探究类似模型能否学习有效的图像表征。我们训练了一个序列Transformer (Transformer) 进行像素自回归预测，且未引入2D输入结构的先验知识。尽管在无标签的低分辨率ImageNet数据上训练，我们发现GPT-2规模的模型通过线性探测 (linear probing) 、微调和低数据分类评估，仍能学习到强大的图像表征。在CIFAR-10上，线性探测达到96.3%准确率（超越有监督Wide ResNet），全微调达到99.0%准确率（媲美顶级有监督预训练模型）。在ImageNet与网络图像混合数据上训练的更大模型，其线性探测特征达到72.0% top-1准确率，与ImageNet自监督基准性能相当。

1. Introduction

1. 引言

Unsupervised pre-training played a central role in the resurgence of deep learning. Starting in the mid 2000’s, approaches such as the Deep Belief Network (Hinton et al., 2006) and Denoising Auto encoder (Vincent et al., 2008) were commonly used in neural networks for computer vision (Lee et al., 2009) and speech recognition (Mohamed et al., 2009). It was believed that a model which learned the data distribution $P(X)$ would also learn beneficial features for the subsequent supervised modeling of $P(Y|X)$ (Lasserre et al., 2006; Erhan et al., 2010). However, advancements such as piecewise linear activation functions (Nair & Hinton, 2010), improved initialization s (Glorot & Bengio, 2010), and normalization strategies (Ioffe & Szegedy, 2015; Ba et al., 2016) removed the need for pre-training in order to achieve strong results. Other research cast doubt on the benefits of deep unsupervised representations and reported strong results using a single layer of learned features (Coates et al., 2011), or even random features (Huang et al., 2014; May et al., 2017). The approach fell out of favor as the state of the art increasingly relied on directly encoding prior structure into the model and utilizing abundant supervised data to directly learn representations (Krizhevsky et al., 2012; Graves & Jaitly, 2014). Retrospective study of unsupervised pre-training demonstrated that it could even hurt performance in modern settings (Paine et al., 2014).

无监督预训练在深度学习的复兴中扮演了核心角色。从2000年代中期开始，深度信念网络 (Hinton et al., 2006) 和去噪自编码器 (Vincent et al., 2008) 等方法被广泛应用于计算机视觉 (Lee et al., 2009) 和语音识别 (Mohamed et al., 2009) 的神经网络中。当时认为，学习数据分布 $P(X)$ 的模型也会为后续的监督建模 $P(Y|X)$ 学习到有益特征 (Lasserre et al., 2006; Erhan et al., 2010)。然而，分段线性激活函数 (Nair & Hinton, 2010)、改进的初始化方法 (Glorot & Bengio, 2010) 和归一化策略 (Ioffe & Szegedy, 2015; Ba et al., 2016) 等进展使得无需预训练也能取得优异结果。其他研究对深度无监督表示的优势提出质疑，报告显示使用单层学习特征 (Coates et al., 2011) 甚至随机特征 (Huang et al., 2014; May et al., 2017) 也能获得强劲性能。随着前沿技术越来越倾向于直接将先验结构编码到模型中，并利用丰富的监督数据直接学习表示 (Krizhevsky et al., 2012; Graves & Jaitly, 2014)，这种方法逐渐失宠。对无监督预训练的回顾性研究表明，在现代设置中它甚至可能损害性能 (Paine et al., 2014)。

Instead, unsupervised pre-training flourished in a different domain. After initial strong results for word vectors (Mikolov et al., 2013), it has pushed the state of the art forward in Natural Language Processing on most tasks (Dai & Le, 2015; Peters et al., 2018; Howard & Ruder, 2018; Radford et al., 2018; Devlin et al., 2018). Interestingly, the training objective of a dominant approach like BERT, the prediction of corrupted inputs, closely resembles that of the Denoising Auto encoder, which was originally developed for images.

相反，无监督预训练在另一个领域蓬勃发展。在词向量 (Mikolov et al., 2013) 取得初步显著成果后，它推动自然语言处理在大多数任务上达到最先进水平 (Dai & Le, 2015; Peters et al., 2018; Howard & Ruder, 2018; Radford et al., 2018; Devlin et al., 2018)。有趣的是，像BERT这类主流方法的训练目标（即对损坏输入的预测）与最初为图像开发的去噪自编码器 (Denoising Autoencoder) 高度相似。

As a higher dimensional, noisier, and more redundant modality than text, images are believed to be difficult for generative modeling. Here, self-supervised approaches designed to encourage the modeling of more global structure (Doersch et al., 2015) have shown significant promise. A combination of new training objectives (Oord et al., 2018), more recent architectures (Gomez et al., 2017), and increased model capacity (Kolesnikov et al., 2019) has allowed these methods to achieve state of the art performance in low data settings (Hénaff et al., 2019) and sometimes even outperform supervised representations in transfer learning settings (He et al., 2019; Misra & van der Maaten, 2019; Chen et al., 2020).

作为一种比文本维度更高、噪声更多且冗余性更强的模态，图像被认为难以进行生成建模。目前，旨在促进全局结构建模的自监督方法 (Doersch et al., 2015) 已展现出显著潜力。通过结合新型训练目标 (Oord et al., 2018)、更先进的架构 (Gomez et al., 2017) 以及更大的模型容量 (Kolesnikov et al., 2019)，这些方法在低数据场景下 (Hénaff et al., 2019) 实现了最先进的性能，有时甚至在迁移学习任务中超越有监督表征 (He et al., 2019; Misra & van der Maaten, 2019; Chen et al., 2020)。

Given that it has been a decade since the original wave of generative pre-training methods for images and considering their substantial impact in NLP, this class of methods is due for a modern re-examination and comparison with the recent progress of self-supervised methods. We re-evaluate generative pre-training on images and demonstrate that when using a flexible architecture (Vaswani et al., 2017), a tractable and efficient likelihood based training objective (Larochelle & Murray, 2011; Oord et al., 2016), and significant compute resources (2048 TPU cores), generative pre-training is competitive with other self-supervised approaches and learns representations that significantly improve the state of the art in low-resolution unsupervised representation learning settings.

考虑到图像生成式预训练方法初代浪潮已过去十年，且这类方法在自然语言处理(NLP)领域产生了重大影响，现在亟需对其重新审视，并与当前自监督方法的最新进展进行对比。我们重新评估了图像生成式预训练，结果表明：当采用灵活架构 (Vaswani et al., 2017)、可处理且高效的基于似然的训练目标 (Larochelle & Murray, 2011; Oord et al., 2016) 以及强大计算资源(2048个TPU核心)时，生成式预训练能与其他自监督方法媲美，其学习到的表征显著提升了低分辨率无监督表征学习领域的现有技术水平。

Figure 1. An overview of our approach. First, we pre-process raw images by resizing to a low resolution and reshaping into a 1D sequence. We then chose one of two pre-training objectives, auto-regressive next pixel prediction or masked pixel prediction. Finally, we evaluate the representations learned by these objectives with linear probes or fine-tuning.

图 1: 方法概述。首先，我们通过将原始图像调整为低分辨率并重塑为一维序列进行预处理。然后选择两种预训练目标之一：自回归的下一个像素预测或掩码像素预测。最后，通过线性探测或微调评估这些目标学习到的表征。

This is especially promising as our architecture uses a dense connectivity pattern which does not encode the 2D spatial structure of images yet is able to match and even outperform approaches which do. We report a set of experiments characterizing the performance of our approach on many datasets and in several different evaluation settings (low data, linear evaluation, full fine-tuning). We also conduct several experiments designed to better understand the achieved performance of these models. We investigate how representations are computed inside our model via the performance of linear probes as a function of model depth as well as studying how scaling the resolution and parameter count of the approach affects performance.

这尤其令人期待，因为我们的架构采用了密集连接模式，虽然未编码图像的二维空间结构，却能媲美甚至超越那些具备该结构的方法。我们通过一系列实验展示了该方法在多个数据集及不同评估场景（低数据量、线性评估、全微调）下的性能表现。同时，我们还进行了多项实验以深入理解这些模型的性能表现：通过线性探针分析表征在模型深度中的计算过程，并研究分辨率缩放和参数量对性能的影响。

2. Approach

2. 方法

Our approach consists of a pre-training stage followed by a fine-tuning stage. In pre-training, we explore both the auto-regressive and BERT objectives. We also apply the sequence Transformer architecture to predict pixels instead of language tokens.

我们的方法包括预训练阶段和微调阶段。在预训练中，我们同时探索了自回归目标和BERT目标。我们还应用了序列Transformer架构来预测像素而非语言token。

One way to measure representation quality is to fine-tune for image classification. Fine-tuning adds a small classification head to the model, used to optimize a classification objective and adapts all weights. Pre-training can be viewed as a favorable initialization or as a regularize r when used in combination with early stopping (Erhan et al., 2010).

衡量表征质量的一种方法是对图像分类进行微调。微调会在模型上添加一个小型分类头，用于优化分类目标并调整所有权重。预训练可视为有利的初始化方法，或与早停法 (early stopping) 结合使用时作为正则化手段 (Erhan et al., 2010)。

Another approach for measuring representation quality uses the pre-trained model as a feature extractor. In particular, given labeled examples $(X,Y)$ , the model is applied to $X$ to produce features $f_{X}$ . Then, a linear classifier is trained on $(f_{X},Y)$ . Linear probing captures the intuition that good features should linearly separate the classes of transfer tasks. Furthermore, linear probes help disentangle feature quality from model architecture: in fine-tuning, one model may outperform another because its architecture is more suited for the downstream task rather than because of better pretraining.

另一种衡量表征质量的方法是将预训练模型作为特征提取器使用。具体来说，给定标注样本$(X,Y)$，模型对$X$进行处理生成特征$f_{X}$，随后在$(f_{X},Y)$上训练线性分类器。线性探测（linear probing）体现了"优质特征应能线性分离迁移任务类别"的直观理念。此外，线性探测有助于区分特征质量与模型架构的影响：在微调过程中，某个模型的优势可能源于其架构更适配下游任务，而非预训练效果更优。

We begin this section by defining the auto-regressive and BERT objectives in the context of images. Next, we outline implementation details for our transformer decoder. Finally, we describe how the transformer is used for fine-tuning and how features are extracted for linear probes.

本节首先在图像背景下定义自回归( auto-regressive )和BERT目标。接着概述我们Transformer解码器的实现细节。最后描述如何将Transformer用于微调(fine-tuning)以及如何提取特征以进行线性探测(linear probes)。

2.1. Pre-training

2.1. 预训练

Given an unlabeled dataset $X$ consisting of high dimensional data $\boldsymbol{x}=(x_{1},...,x_{n})$ , we can pick a permutation $\pi$ of the set $[1,n]$ and model the density $p(x)$ auto-regressive ly as follows:

给定一个由高维数据 $\boldsymbol{x}=(x_{1},...,x_{n})$ 组成的无标签数据集 $X$，我们可以选择集合 $[1,n]$ 的一个排列 $\pi$，并以自回归方式对密度 $p(x)$ 建模如下：

$$
p(x)=\prod_{i=1}^{n}p(x_{\pi_{i}}|x_{\pi_{1}},...,x_{\pi_{i-1}},\theta)
$$

When working with images, we pick the identity permutation $\pi_{i}=i$ for $1\leq i\leq n$ , also known as raster order. We train our model by minimizing the negative log-likelihood of the data:

在处理图像时，我们选择恒等排列 $\pi_{i}=i$ （其中 $1\leq i\leq n$），也称为光栅顺序。通过最小化数据的负对数似然来训练模型：

$$
L_{A R}=\underset{x\sim X}{\mathbb{E}}[-\log p(x)]
$$

We also consider the BERT objective, which samples a sub-sequence $M\subset[1,n]$ such that each index $i$ independently has probability 0.15 of appearing in $M$ . We call $M$ the BERT mask, and we train our model by minimizing the negative log-likelihood of the “masked” elements $x_{M}$ conditioned on the “unmasked” ones $x_{[1,n]\setminus M}$ :

我们还考虑了BERT目标，它采样一个子序列$M\subset[1,n]$，使得每个索引$i$独立地有0.15的概率出现在$M$中。我们将$M$称为BERT掩码，并通过最小化以"未掩码"元素$x_{[1,n]\setminus M}$为条件的"掩码"元素$x_{M}$的负对数似然来训练模型：

$$
L_{B E R T}=\underset{x\sim X}{\mathbb{E}}\underset{M}{\mathbb{E}}\sum_{i\in M}\left[-\log p\left(x_{i}\vert x_{[1,n]\backslash M}\right)\right]
$$

In pre-training, we pick one of $L_{A R}$ or $L_{B E R T}$ and minimize the loss over our pre-training dataset.

在预训练阶段，我们选择 $L_{A R}$ 或 $L_{B E R T}$ 其中之一，并在预训练数据集上最小化损失。

2.2. Architecture

2.2. 架构

The transformer decoder takes an input sequence x1, ..., xn of discrete tokens and produces a $d$ -dimensional embedding for each position. The decoder is realized as a stack of $L$ blocks, the $l$ -th of which produces an intermediate embedding $h_{1}^{l},...,h_{n}^{l}$ also of dimension $d$ . We use the GPT-2 (Radford et al., 2019) formulation of the transformer decoder block, which acts on an input tensor $h^{l}$ as follows:

Transformer解码器接收一个由离散Token组成的输入序列x1, ..., xn，并为每个位置生成一个$d$维嵌入。该解码器由$L$个块堆叠而成，其中第$l$个块会生成维度同样为$d$的中间嵌入$h_{1}^{l},...,h_{n}^{l}$。我们采用GPT-2 (Radford et al., 2019)提出的Transformer解码器块结构，其作用于输入张量$h^{l}$的运算如下：

$$
\begin{array}{r}{n^{l}=\operatorname{layer_norm}(h^{l})\qquad}\ {a^{l}=h^{l}+\operatorname{multihead_attention}(n^{l})\qquad}\ {h^{l+1}=a^{l}+\operatorname*{mlp}(\operatorname{layer_norm}(a^{l}))\qquad}\end{array}
$$

In particular, layer norms precede both the attention and mlp operations, and all operations lie strictly on residual paths. We find that such a formulation allows us to scale the transformer with ease.

具体来说，层归一化 (layer norm) 同时作用于注意力机制和前馈网络操作前，且所有操作严格位于残差路径上。我们发现这种结构能更轻松地扩展 Transformer 的规模。

The only mixing across sequence elements occurs in the attention operation, and to ensure proper conditioning when training the AR objective, we apply the standard upper triangular mask to the $n\times n$ matrix of attention logits. When using the BERT objective, no attention logit masking is required: after applying content embeddings to the input sequence, we zero out the positions in $M$ .

序列元素间唯一的混合发生在注意力操作中。为确保自回归(AR)目标训练时的正确条件约束，我们对 $n\times n$ 注意力对数矩阵应用标准的上三角掩码。当使用BERT目标时，则无需进行注意力对数掩码：在将内容嵌入应用到输入序列后，我们会将 $M$ 中的对应位置清零。

Additionally, since we learn independent position embeddings for each sequence element, our BERT model has no positional inductive biases (i.e. it is permutation invariant). Put another way, any spatial relationships between positions must be learned by the model at train time. This is not entirely true for the AR model, as choosing the raster order also fixes a pre specified ordering of the conditionals. Nevertheless, permutation invariance is a property in strong contrast to convolutional neural networks, which incorporate the inductive bias that features should arise from spatially proximate elements.

此外，由于我们为每个序列元素学习独立的位置嵌入 (position embeddings)，我们的 BERT 模型不具备位置归纳偏置 (positional inductive biases)（即具有排列不变性）。换句话说，位置之间的任何空间关系都必须在训练时由模型自行学习。这对自回归 (AR) 模型并不完全成立，因为选择光栅顺序 (raster order) 同时也固定了条件概率的预设顺序。尽管如此，排列不变性与卷积神经网络 (convolutional neural networks) 形成鲜明对比，后者包含的归纳偏置要求特征应来自空间相邻的元素。

Following the final transformer layer, we apply a layer norm $n^{L}=\mathrm{layer}\mathrm{.norm}(h^{L})$ , and learn a projection from $n^{L}$ to logits parameter i zing the conditional distributions at each sequence element. When training BERT, we simply ignore the logits at unmasked positions.

在最终的Transformer层之后，我们应用了一层归一化 $n^{L}=\mathrm{layer}\mathrm{.norm}(h^{L})$ ，并学习从 $n^{L}$ 到logits的投影，用于参数化每个序列元素的条件分布。在训练BERT时，我们直接忽略未屏蔽位置的logits。

2.3. Fine-tuning

2.3. 微调

When fine-tuning, we average pool $n^{L}$ across the sequence dimension to extract a $d$ -dimensional vector of features per example:

微调时，我们在序列维度上对 $n^{L}$ 进行平均池化，以提取每个样本的 $d$ 维特征向量：

$$
f^{L}=\langle n_{i}^{L}\rangle_{i}
$$

We learn a projection from $f^{L}$ to class logits, which we use to minimize a cross entropy loss $L_{C L F}$ .

我们学习从 $f^{L}$ 到类别逻辑值的投影，用于最小化交叉熵损失 $L_{C L F}$。

While fine-tuning on $L_{C L F}$ yields reasonable downstream performance, we find empirically that the joint objective

在对 $L_{CLF}$ 进行微调可获得合理的下游性能时，我们通过实验发现联合目标

$$
L_{G E N}+L_{C L F}
$$

$L_{G E N}\in{L_{A R},L_{B E R T}}$ works even better. Similar findings were reported by Radford et al. (2018).

$L_{G E N}\in{L_{A R},L_{B E R T}}$ 效果更佳。Radford等人 (2018) 也报告了类似发现。

2.4. Linear Probing

2.4. 线性探测 (Linear Probing)

Extracting fixed features for linear probing follows a similar procedure to fine-tuning, except that average pooling is not

提取固定特征用于线性探测的过程与微调类似，只是不进行平均池化

always at the final layer:

始终位于最终层：

$$
f^{l}=\langle n_{i}^{l}\rangle_{i}
$$

where $0\le l\le L$ . We will show in the experiments section that the best features often lie in the middle of the network. As in fine-tuning, we project these intermediate features to produce class logits. Because we view the features as fixed when linear probing, this projection contains the only trainable weights, so we can only optimize $L_{C L F}$ .

其中 $0\le l\le L$。我们将在实验部分展示，最佳特征通常位于网络的中间层。与微调类似，我们将这些中间特征投影以生成类别逻辑值。由于在线性探测时我们将特征视为固定，该投影层包含唯一可训练的权重，因此我们只能优化 $L_{C L F}$。

3. Methodology

3. 方法论

Although supervised pre-training is the dominant paradigm for image classification, curating large labeled image datasets is both expensive and time consuming. Instead of further scaling up labeling efforts, we can instead aspire to learn general purpose representations from the much larger set of available unlabeled images and fine-tune them for classification. We investigate this setting using ImageNet as a proxy for a large unlabeled corpus, and small classic labeled datasets (CIFAR-10, CIFAR-100, STL-10) as proxies for downstream tasks. For our largest model, we use an additional 100 million unlabeled web images, filtered to be similar to ImageNet.

尽管监督式预训练是图像分类的主流范式，但构建大型标注图像数据集既昂贵又耗时。与其进一步扩大标注规模，我们更希望从更大量的未标注图像中学习通用表征，再针对分类任务进行微调。本研究以ImageNet作为大型未标注数据集的代理，并选用经典小型标注数据集（CIFAR-10、CIFAR-100、STL-10）作为下游任务代理。针对最大模型，我们还使用了经过筛选的1亿张与ImageNet相似的未标注网络图像。

Even in cases where labels are available, unsupervised or self-supervised pre-training can still provide benefits in data efficiency or on fine-tuning speed. We investigate this setting by pre-training without labels and then fine-tuning or linear probing with labels.

即使在有标签的情况下，无监督或自监督预训练仍能在数据效率或微调速度方面带来优势。我们通过无标签预训练、随后用标签进行微调或线性探测来研究这一场景。

3.1. Dataset and Data Augmentation

3.1. 数据集与数据增强

We use the ImageNet ILSVRC 2012 training dataset, splitting off $4%$ as our experimental validation set and report results on the ILSVRC 2012 validation set as our test set. For CIFAR-10, CIFAR-100 and STL-10, we split off $10%$ of the provided training set instead. We ignore the provided unlabeled examples in STL-10, which constitute a subset of ImageNet.

我们使用ImageNet ILSVRC 2012训练数据集，划分出$4%$作为实验验证集，并在ILSVRC 2012验证集上报告结果作为测试集。对于CIFAR-10、CIFAR-100和STL-10，我们从提供的训练集中划分出$10%$作为验证集。STL-10中提供的未标注样本（属于ImageNet子集）未被采用。

No data augmentation is used when pre-training on web images, and lightweight data augmentation is used when pre-training or fine-tuning on ImageNet. Specifically, when employing data augmentation, we randomly resize an image such that the shorter sidelength is in the range [256, 384] and then take a random $224\times224$ crop. When evaluating on ImageNet, we resize the image such that the shorter sidelength is 224, and use the single $224\times224$ center crop.

在网络图像上进行预训练时不使用数据增强，而在ImageNet上进行预训练或微调时采用轻量级数据增强。具体而言，使用数据增强时，我们会随机调整图像尺寸使短边长度处于[256, 384]范围内，然后随机裁剪出$224\times224$的区域。在ImageNet上评估时，将图像短边调整为224像素，并采用单一的$224\times224$中心裁剪。

When full-network fine-tuning on CIFAR-10 and CIFAR100, we use the augmentation popularized by Wide Residual Networks: 4 pixels are reflection padded on each side, and a $32\times32$ crop is randomly sampled from the padded image or its horizontal flip (Zagoruyko & Komodakis, 2016).

在CIFAR-10和CIFAR100上进行全网络微调时，我们采用了Wide Residual Networks推广的数据增强方法：每边进行4像素的反射填充，然后从填充后的图像或其水平翻转中随机采样一个$32\times32$的裁剪区域 (Zagoruyko & Komodakis, 2016)。

Once optimal hyper parameters are found, we fold our experimental validation set back into the training set, retrain the model, and report numbers on the respective test set.

找到最优超参数后，我们将实验验证集重新合并到训练集中，重新训练模型，并在相应测试集上报告结果。

3.2. Context Reduction

3.2. 上下文缩减

Because the memory requirements of the transformer decoder scale quadratically with context length when using dense attention, we must employ further techniques to reduce context length. If we naively trained a transformer on a sequence of length $224^{2}\times3$ , our attention logits would be tens of thousands of times larger than those used in language models and even a single layer would not fit on a GPU. To deal with this, we first resize our image to a lower resolution, which we call the input resolution (IR). Our models have IRs of either $32^{2}\times3$ , $48^{2}\times3$ , or $64^{2}\times3$ .

由于Transformer解码器在使用密集注意力机制时，其内存需求会随上下文长度呈平方级增长，我们必须采用更多技术来缩短上下文长度。若直接在 $224^{2}\times3$ 长度的序列上训练Transformer，其注意力对数将比语言模型所用的大数万倍，甚至单层网络也无法放入GPU内存。为此，我们首先将图像降采样至较低分辨率（称为输入分辨率(IR)），模型采用的IR包括 $32^{2}\times3$ 、 $48^{2}\times3$ 或 $64^{2}\times3$ 三种规格。

An IR of $32^{2}\times3$ is still quite computationally intensive. While working at even lower resolutions is tempting, prior work has demonstrated human performance on image classification begins to drop rapidly below this size (Torralba et al., 2008). Instead, motivated by early color display palettes, we create our own 9-bit color palette by clustering (R, G, B) pixel values using $k$ -means with $k=512$ . Using this palette yields an input sequence length 3 times shorter than the standard (R, G, B) palette, while still encoding color faithfully. A similar approach was applied to spatial patches by Ranzato et al. (2014). We call the resulting context length ( $32^{2}$ or $48^{2}$ or $64^{2}$ ) the model resolution (MR). Note that this reduction breaks permutation invariance of the color channels, but keeps the model spatially invariant.

$32^{2}\times3$ 的输入分辨率 (IR) 在计算上仍然相当密集。虽然使用更低分辨率很诱人，但先前研究表明图像分类任务中的人类表现会在此尺寸以下快速下降 (Torralba et al., 2008)。受早期彩色显示调色板启发，我们通过 $k$ -均值聚类 ($k=512$) 创建了自定义的9位色彩调色板。该调色板使输入序列长度比标准 (R, G, B) 调色板缩短3倍，同时保持色彩保真度。Ranzato et al. (2014) 曾在空间补丁上应用过类似方法。我们将最终上下文长度 ($32^{2}$ 或 $48^{2}$ 或 $64^{2}$) 称为模型分辨率 (MR)。需注意这种压缩会破坏色彩通道的排列不变性，但保持模型的空间不变性。

3.3. Model

3.3. 模型

Our largest model, iGPT-XL, contains $L=60$ layers and uses an embedding size of $d=3072$ for a total of 6.8B parameters. Our next largest model, iGPT-L, is essentially identical to GPT-2 with $L~=~48$ layers, but contains a slightly smaller embedding size of $d=1536$ (vs 1600) for a total of 1.4M parameters. We use the same model code as GPT-2, except that we initialize weights in the layerdependent fashion as in Sparse Transformer (Child et al., 2019) and zero-initialize all projections producing logits.

我们最大的模型iGPT-XL包含$L=60$层，嵌入维度为$d=3072$，总参数量达68亿。次大的iGPT-L模型结构与GPT-2基本一致($L~=~48$层)，但采用略小的嵌入维度$d=1536$(原为1600)，总参数量14亿。我们沿用GPT-2的模型代码，仅作两点调整：(1) 采用Sparse Transformer (Child et al., 2019)提出的分层权重初始化方法；(2) 对生成logits的所有投影层进行零初始化。

We also train iGPT-M, a $455\mathrm{M}$ parameter model with $L=$ 36 and $d=1024$ and iGPT-S, a 76M parameter model with $L=24$ and $d=512$ to study the effect of model capacity on representation quality in a generative model.

我们还训练了iGPT-M（一个4.55亿参数模型，$L=36$，$d=1024$）和iGPT-S（一个7600万参数模型，$L=24$，$d=512$），以研究生成式模型中模型容量对表征质量的影响。

3.4. Training

3.4. 训练

When pre-training iGPT-XL, we use a batch size of 64 and train for 2M iterations, and for all other models we use a batch size of 128 and train for 1M iterations. We use Adam with $\beta_{1}=0.9$ and $\beta_{2}=0.95$ and sequentially try the learning rates 0.01, 0.003, 0.001, 0.0003, ..., stopping once the final validation loss starts increasing. The learning rate is warmed up for one epoch, and then decays to 0 following a cosine schedule. No dropout is used.

在预训练iGPT-XL时，我们使用64的批次大小并训练200万次迭代，其他所有模型则使用128的批次大小并训练100万次迭代。我们采用Adam优化器，设置$\beta_{1}=0.9$和$\beta_{2}=0.95$，并依次尝试0.01、0.003、0.001、0.0003等学习率，当最终验证损失开始上升时停止。学习率经过一个周期的预热后，按余弦调度衰减至0。未使用dropout。

When fine-tuning, we use the same batch size and Adam hyper parameters. Here, we do not employ a cosine schedule, and early stop once we reach the maximum validation accuracy. Again, no dropout is used.

微调时，我们使用相同的批次大小和Adam超参数。此处未采用余弦调度，一旦达到最大验证准确率即提前停止。同样未使用dropout。

When running a linear probe on ImageNet, we follow recent literature and use SGD with momentum 0.9 and a high learning rate (we try the values 30, 10, 3, ... in the manner described above) (He et al., 2019). We train for 1000000 iterations with a cosine learning rate schedule. Finally, when running a linear probe on CIFAR-10, CIFAR-100, or STL10, we use the L-BFGS algorithm for consistency with prior results (Pedregosa et al., 2011).

在ImageNet上运行线性探测时，我们遵循近期文献方法，采用动量0.9的SGD优化器，并设置较高学习率（按上述方式尝试30、10、3等值）(He et al., 2019)。训练采用余弦学习率调度，共进行100万次迭代。最后，在CIFAR-10、CIFAR-100或STL10数据集上运行线性探测时，为保持与先前研究结果的一致性，我们使用L-BFGS算法(Pedregosa et al., 2011)。

4. Experiments and Results

4. 实验与结果

We begin with experiments and results from the autoregressive formulation of iGPT. Comparisons with the BERT formulation appear in Section 4.6.

我们首先从iGPT的自回归(autoregressive)公式的实验和结果开始。与BERT公式的比较见第4.6节。

4.1. What Representation Works Best in a Generative Model Without Latent Variables?

4.1. 无潜在变量生成模型的最佳表示方式是什么？

Figure 2. Representation quality depends on the layer from which we extract features. In contrast with supervised models, the best representations for these generative models lie in the middle of the network. We plot this unimodal dependence on depth by showing linear probes for iGPT-L on CIFAR-10, CIFAR-100, and STL-10.

图 2: 表征质量取决于提取特征的网络层级。与监督模型不同，这些生成模型的最佳表征位于网络中间层。我们通过展示iGPT-L在CIFAR-10、CIFAR-100和STL-10上的线性探针结果，绘制了这种随深度变化的单峰依赖关系。

In supervised pre-training, representation quality tends to increase monotonically with depth, such that the best represent at ions lie at the penultimate layer (Zeiler & Fergus, 2014). Indeed, since a linear layer produces class logits from pre-logits, a good classifier necessarily achieves high accuracy on a linear probe of its pre-logits. If a downstream task also involves classification, it is empirically validated that penultimate features perform well.

在监督式预训练中，表征质量往往随深度单调递增，最佳表征通常位于倒数第二层 (Zeiler & Fergus, 2014) 。实际上，由于线性层通过预对数生成类别逻辑值，优质分类器必然能在其预对数的线性探测中实现高准确率。当下游任务同样涉及分类时，经验表明倒数第二层特征表现优异。

With generative pre-training, it is not obvious whether a task like pixel prediction is relevant to image classification. This suggests that the penultimate layer of a model trained for pixel prediction might not produce the most useful representations for classification. Latent variable models such as VAEs can avoid this issue by explicitly learning a representation of the input data, but deep auto regressive generative models have the same width and connectivity pattern at every layer. Our first experiment studies how representation quality varies over one set of candidate representations: different layers of a generative model. We observe a very different behavior from supervised learning: representations first improve as a function of depth, and then, starting around the middle layer, begin to deteriorate until the penultimate layer (Figure 2).

通过生成式预训练，像像素预测这样的任务是否与图像分类相关并不明显。这表明，为像素预测训练的模型的倒数第二层可能不会产生对分类最有用的表征。变分自编码器 (VAE) 等潜在变量模型可以通过显式学习输入数据的表征来避免这个问题，但深度自回归生成模型在每一层都具有相同的宽度和连接模式。我们的第一个实验研究了表征质量如何在一组候选表征中变化：生成模型的不同层。我们观察到与监督学习截然不同的行为：表征质量首先随着深度增加而提升，然后从中层开始逐渐恶化，直到倒数第二层 (图 2)。

This behavior potentially suggests that these generative models operate in two phases. In the first phase, each position gathers information from its surrounding context in order to build a more global image representation. In the second phase, this contextual i zed input is used to solve the conditional next pixel prediction task. This could resemble the behavior of encoder-decoder architectures common across deep learning, but learned within a monolithic architecture via a pre-training objective.

这一行为可能表明，这些生成模型分两个阶段运行。第一阶段，每个位置从周围上下文中收集信息，以构建更全局的图像表征。第二阶段，这些经过上下文处理的输入被用于解决条件性下一像素预测任务。这可能类似于深度学习常见的编码器-解码器架构行为，但通过预训练目标在单一架构中学习完成。

Consequently, when evaluating a generative model with a linear probe, it is important to search for the best layer. Taking the final layer on CIFAR-10 decreases performance by $2.4%$ , the difference between a baseline and a state-ofthe-art result. In all settings, we find that the dependence of representation quality on depth is strongly unimodal.

因此，在使用线性探针评估生成模型时，寻找最佳层至关重要。在CIFAR-10数据集上直接选用最后一层会使性能下降$2.4%$，这一差距相当于基线结果与最先进成果之间的差异。在所有实验设置中，我们发现表征质量对网络深度的依赖呈现明显的单峰分布特征。

4.2. Better Generative Models Learn Better Representations

4.2. 更优的生成式模型 (Generative Models) 学习更优的表征

Figure 3. Plot of representation quality as a function of validation generative loss. Each line tracks a model throughout generative pre-training: the dotted markers denote checkpoints at steps 65K, 131K, 262K, 524K, and 1000K. The positive slope suggests a link between improved generative performance and improved representation quality. Larger models produce better representations than smaller ones both at the end of training and at the same value of validation loss. iGPT-XL is not shown since it was trained on a different dataset.

图 3: 表征质量随验证生成损失变化的曲线图。每条线代表一个模型在生成式预训练过程中的表现：虚线标记对应65K、131K、262K、524K和1000K训练步数的检查点。正斜率表明生成性能提升与表征质量改善存在关联。无论训练结束时还是相同验证损失值下，较大模型始终比较小模型产生更优表征。iGPT-XL因使用不同训练数据集而未显示。

Using the linear probe as a tool for measuring representation quality, we investigate whether better generative models (as measured by log-prob on held-out data) also learn better representations.

我们使用线性探针作为衡量表征质量的工具，研究生成模型（以保留数据对数概率为衡量标准）是否在学习过程中获得了更好的表征。

In Figure 3, we see that as validation loss on the autoregressive objective decreases throughout training, linear probe accuracy increases as well. This trend holds across several model capacities, with higher capacity models achieving better validation losses. This highlights the importance of scale for our approach. Note that for a given validation loss value, bigger models also perform better.

在图3中，我们看到随着自回归目标验证损失在整个训练过程中降低，线性探针准确率也相应提升。这一趋势在不同模型容量下均成立，容量更高的模型能获得更优的验证损失。这凸显了规模对本方法的重要性。值得注意的是，在给定验证损失值的情况下，更大模型的性能也更优。

Table 1. Comparing linear probe accuracies between our models and state-of-the-art models utilizing unsupervised ImageNet transfer or supervised ImageNet transfer.

Model	Acc	Unsup Transfer	Sup Transfer
CIFAR-10
ResNet-152	94
SimCLR	95.3
iGPT-L	96.3
CIFAR-100
ResNet-152	78.0
SimCLR	80.2
iGPT-L	82.8
STL-10
AMDIM-L	94.2
iGPT-L	95.5

表 1: 我们的模型与利用无监督ImageNet迁移或有监督ImageNet迁移的先进模型之间的线性探测准确率对比。

模型	准确率	无监督迁移	有监督迁移
CIFAR-10
ResNet-152	94		√
SimCLR	95.3	√
iGPT-L	96.3	√
CIFAR-100
ResNet-152	78.0		√
SimCLR	80.2	√
iGPT-L	82.8	√
STL-10
AMDIM-L	94.2	√
iGPT-L	95.5	√

4.3. Linear Probes on CIFAR and STL-10

4.3. CIFAR 和 STL-10 上的线性探针

In addition to CIFAR-10, we also evaluate linear probes on CIFAR-100 and STL-10 (Figure 2) to check whether the learned representations are useful across multiple datasets. For this evaluation setting, we achieve state-of-the-art across the entire spectrum of pre-training approaches (Table 1). For example, on CIFAR-10, our model achieves $96.3%$ , outperforming both SimCLR (pre-trained on ImageNet without labels) and a ResNet-152 (pre-trained on ImageNet with labels). In fact, on all three datasets a linear classifier fit to the representations of iGPT-L outperforms the end-to-end supervised training of a WideResNet baseline.

除了 CIFAR-10，我们还在 CIFAR-100 和 STL-10 (图 2) 上评估线性探针，以检验所学表征是否适用于多个数据集。在此评估设置中，我们在所有预训练方法中均取得了最先进的性能 (表 1)。例如，在 CIFAR-10 上，我们的模型达到了 $96.3%$，优于 SimCLR (在无标签 ImageNet 上预训练) 和 ResNet-152 (在有标签 ImageNet 上预训练)。事实上，在所有三个数据集上，基于 iGPT-L 表征训练的线性分类器都优于 WideResNet 基线的端到端监督训练。

Note that our model is trained at the same input resolution (IR) as CIFAR, whereas models trained at the standard ImageNet IR may experience distribution shock upon linear evaluation. As a counterpoint, though STL-10 has an IR of $96^{2}\times3$ , we still outperform AMDIM-L when we downsample to $32^{2}\times3$ before linear probing. We also note that fine-tuning should allow models trained at high IR to adjust to low resolution input.

需要注意的是，我们的模型采用与CIFAR相同的输入分辨率(IR)进行训练，而以标准ImageNet分辨率训练的模型在进行线性评估时可能会遭遇分布偏移。作为对比，虽然STL-10的输入分辨率为$96^{2}\times3$，但当我们在线性探测前将其下采样至$32^{2}\times3$时，性能仍优于AMDIM-L。此外需要指出，通过微调可以使高分辨率训练的模型适应低分辨率输入。

4.4. Linear Probes on ImageNet

4.4. ImageNet 上的线性探针

Recently, there has been a resurgence of interest in unsupervised and self-supervised learning on ImageNet, evaluated using linear probes on ImageNet. This is a particularly difficult setting for us, since we cannot efficiently train at the standard ImageNet input resolution (IR). Indeed, when training iGPT-L with a model resolution (MR) of $32^{2}$ , we achieve only $60.3%$ best-layer linear probe accuracy. As with CIFAR-10, scale is critical to our approach: iGPT

近来，ImageNet上的无监督和自监督学习重新引发关注，其评估方式是在ImageNet上使用线性探针。这对我们而言是个特别困难的设定，因为无法在标准ImageNet输入分辨率(IR)下高效训练。事实上，当以$32^{2}$的模型分辨率(MR)训练iGPT-L时，我们仅获得$60.3%$的最佳层线性探针准确率。与CIFAR-10类似，规模对我们的方法至关重要：iGPT

Table 2. Comparing linear probe accuracies between our models and state-of-the-art self-supervised models. A blank input resolution (IR) corresponds to a model working at standard ImageNet resolution. We report the best performing configuration for each contrastive method, finding that our models achieve comparable performance.

Method	IR	Params (M)	Features	Acc
Rotation	orig.	86	8192	55.4
iGPT-L	322.3	1362	1536	60.3
BigBiGAN	orig.	86	8192	61.3
iGPT-L	482.3	1362	1536	65.2
AMDIM	orig.	626	8192	68.1
MoCo	orig.	375	8192	68.6
iGPT-XL	642.3	6801	3072	68.7
SimCLR	orig.	24	2048	69.3
CPCv2	orig.	303	8192	71.5
iGPT-XL	642.3	6801	15360	72.0
SimCLR	orig.	375	8192	76.5

表 2: 对比我们的模型与最先进的自监督模型之间的线性探测准确率。空白输入分辨率 (IR) 表示模型在标准 ImageNet 分辨率下工作。我们报告了每种对比方法的最佳性能配置，发现我们的模型实现了可比的性能。

方法	IR	参数量 (M)	特征维度	准确率
Rotation	orig.	86	8192	55.4
iGPT-L	322.3	1362	1536	60.3
BigBiGAN	orig.	86	8192	61.3
iGPT-L	482.3	1362	1536	65.2
AMDIM	orig.	626	8192	68.1
MoCo	orig.	375	8192	68.6
iGPT-XL	642.3	6801	3072	68.7
SimCLR	orig.	24	2048	69.3
CPCv2	orig.	303	8192	71.5
iGPT-XL	642.3	6801	15360	72.0
SimCLR	orig.	375	8192	76.5

M achieves $54.5%$ accuracy and iGPT-S achieves $41.9%$ accuracy.

M 达到了 54.5% 的准确率，iGPT-S 达到了 41.9% 的准确率。

The first obvious optimization is to increase MR while staying within accelerator memory limits. With a MR of $48^{2}$ , iGPT-L achieves a best-layer accuracy of $65.2%$ using 1536 features and with a MR of $64^{2}$ , iGPT-XL achieves a bestlayer accuracy of $68.7%$ using 3072 features.

第一个明显的优化是在加速器内存限制内提高MR (Memory Resolution)。当MR为$48^{2}$时，iGPT-L使用1536个特征实现了65.2%的最佳层准确率；当MR为$64^{2}$时，iGPT-XL使用3072个特征实现了68.7%的最佳层准确率。

Since contrastive methods report their best results on 8192 features, we would ideally evaluate iGPT with an embedding dimension 8192 for comparison. Training such a model is prohibitively expensive, so we instead concatenate features from multiple layers as an approximation. However, our features tend to be correlated across layers, so we need more of them to be competitive. If we concatenate features from 5 layers centered at the best single layer of iGPT-XL, we achieve an accuracy of $72.0%$ using 15360 features, which is competitive with recent contrastive learning approaches (Table 2). Note that we require more parameters and compute to achieve this accuracy, but we work at low resolution and without utilizing knowledge of the 2D input structure.

由于对比方法在8192个特征上报告了最佳结果，我们理想情况下应使用嵌入维度为8192的iGPT进行评估以便比较。训练这样的模型成本过高，因此我们改为通过拼接多层特征作为近似方案。然而，我们的特征在层间往往存在相关性，因此需要更多特征才能保持竞争力。若以iGPT-XL最佳单层为中心拼接5层特征，使用15360个特征时准确率达到$72.0%$，这与近期对比学习方法的表现相当（表2）。需要注意的是，我们为实现该准确率需要更多参数和计算量，但我们的工作在低分辨率下进行，且未利用2D输入结构的先验知识。

[论文翻译]基于像素的生成式预训练 (Generative Pre-training from Pixels)

原文地址：https://cdn.openai.com/papers/Generative_Pretraining_from_Pixels_V2.pdf