Visual Autoregressive Modeling: Scalable Image Generation via Next-Scale Prediction

Keyu Tian^1,2^ , Yi Jiang^2,†^ , Zehuan Yuan^2,∗^ , Bingyue Peng^2^ , Liwei Wang^1,^
^1^ Peking University ^2^ Bytedance Inc
keyutian@stu.pku.edu.cn, jiangyi.enjoy@bytedance.com,
yuanzehuan@bytedance.com, bingyue.peng@bytedance.com, wanglw@pku.edu.cn
Try and explore our online demo at: https://var.vision
Codes and models: https://github.com/FoundationVision/VAR
Corresponding authors: wanglw@pku.edu.cn, yuanzehuan@bytedance.com; †: project lead

摘要

We present Visual AutoRegressive modeling (VAR), a new generation paradigm that redefines the autoregressive learning on images as coarse-to-fine “next-scale prediction” or “next-resolution prediction”, diverging from the standard raster-scan “next-token prediction”. This simple, intuitive methodology allows autoregressive (AR) transformers to learn visual distributions fast and can generalize well: VAR, for the first time, makes GPT-style AR models surpass diffusion transformers in image generation. On ImageNet 256×256 benchmark, VAR significantly improve AR baseline by improving Fréchet inception distance (FID) from 18.65 to 1.73, inception score (IS) from 80.4 to 350.2, with 20× faster inference speed. It is also empirically verified that VAR outperforms the Diffusion Transformer (DiT) in multiple dimensions including image quality, inference speed, data efficiency, and scalability. Scaling up VAR models exhibits clear power-law scaling laws similar to those observed in LLMs, with linear correlation coefficients near −0.998 as solid evidence. VAR further showcases zero-shot generalization ability in downstream tasks including image in-painting, out-painting, and editing. These results suggest VAR has initially emulated the two important properties of LLMs: Scaling Laws and zero-shot generalization. We have released all models and codes to promote the exploration of AR/VAR models for visual generation and unified learning.

我们提出了视觉自回归建模（VAR），这是一种新一代范式，它将图像的自回归学习重新定义为从粗到细的“下一个尺度预测”或“下一个分辨率预测”，与标准光栅扫描预测下一个Token 不同。这种简单、直观的方法使自回归 (AR) 转换器能够快速学习视觉分布，并且可以很好地泛化：VAR第一次使 GPT 式 AR 模型在图像生成方面超越了扩散转换器。在 ImageNet 256×256 基准测试中，VAR 通过将 Fréchet inception distance (FID) 从 18.65 提高到 1.73，inception score (IS) 从 80.4 提高到 350.2，显著提高了 AR 基线，推理速度提高了 20×倍。实证还验证了 VAR 在图像质量、推理速度、数据效率和可扩展性等多个维度上均优于 Diffusion Transformer (DiT)。放大 VAR 模型表现出与大语言模型中观察到的清晰的幂律缩放定律，其中线性相关系数接近−0.998作为确凿的证据。 VAR 进一步展示了零样本在图像修复、修复和编辑等下游任务中的泛化能力。这些结果表明 VAR 初步模拟了大语言模型的两个重要特性：缩放定律和零样本泛化。我们已经发布了所有模型和代码，以推动 AR/VAR 模型在视觉生成和统一学习方面的探索。

Refer to caption 图 1：从在 ImageNet 上训练的视觉自回归 (VAR) 转换器生成的样本. 我们展示了 512×512 个样本（顶部）、256×256 个样本（中间）和零样本图像编辑结果（底部）。

Refer to caption

图 2：标准自回归模型 (AR) 与我们提出的视觉自回归模型 (VAR)。 (a) AR 应用于语言：从左到右、逐字生成顺序文本词符； (b) AR 应用于图像：以光栅扫描顺序从左到右、从上到下连续生成视觉词符； (c) 图像的 VAR：多尺度词符图是从粗到细尺度（从低到高分辨率）自回归生成的，每个尺度内并行生成词符。 VAR 需要多尺度 VQVAE 才能发挥作用。

1简介

The advent of GPT series [65, 66, 15, 62, 1] and more autoregressive (AR) large language models (LLMs) [22, 4, 38, 82, 83, 90, 78, 5, 79] has heralded a new epoch in the field of artificial intelligence. These models exhibit promising intelligence in generality and versatility that, despite issues like hallucinations [39], are still considered to take a solid step toward the general artificial intelligence (AGI). At the core of these models is a self-supervised learning strategy – predicting the next token in a sequence, a simple yet profound approach. Studies into the success of these large AR models have highlighted their scalability and generalizabilty: the former, as exemplified by scaling laws [43, 35], allows us to predict large model’s performance from smaller ones and thus guides better resource allocation, while the latter, as evidenced by zero-shot and few-shot learning [66, 15], underscores the unsupervised-trained models’ adaptability to diverse, unseen tasks. These properties reveal AR models’ potential in learning from vast unlabeled data, encapsulating the essence of “AGI”.

GPT 系列[65, 66, 15, 62, 1]以及更多自回归 (AR) 大语言模型 (LLM)[22, 4, 38, 82, 83, 90, 78, 5, 79]的出现，预示着人工智能领域一个新时代的到来。这些模型在通用性和多功能性方面展现出令人鼓舞的智能，尽管存在诸如幻觉[39]等问题，但仍被认为朝着通用人工智能 (AGI)迈出了坚实的一步。这些模型的核心是一种自监督学习策略——预测序列中的下一个符元，这是一种简单而深刻的方法。对这些大型 AR 模型的成功的研究强调了它们的可扩展性和泛化能力：前者，正如缩放法则[43, 35]所例证的那样，允许我们根据较小的模型预测大型模型的性能，从而指导更好的资源分配；而后者，正如零样本和少样本学习[66, 15]所证明的那样，强调了无监督训练模型对不同未见任务的适应性。这些特性揭示了 AR 模型从大量未标记数据中学习的潜力，概括了“AGI”的本质。

In parallel, the field of computer vision has been striving to develop large autoregressive or world models [58, 57, 6], aiming to emulate their impressive scalability and generalizability. Trailblazing efforts like VQGAN and DALL-E [30, 67] along with their successors [68, 92, 50, 99] have showcased the potential of AR models in image generation. These models utilize a visual tokenizer to discretize continuous images into grids of 2D tokens, which are then flattened to a 1D sequence for AR learning (Fig. 2 b), mirroring the process of sequential language modeling (Fig. 2 a). However, the scaling laws of these models remain underexplored, and more frustratingly, their performance significantly lags behind diffusion models [63, 3, 51], as shown in Fig. 3. In contrast to the remarkable achievements of LLMs, the power of autoregressive models in computer vision appears to be somewhat locked.

与此同时，计算机视觉领域一直在努力开发大型自回归或世界模型[58, 57, 6]，旨在模仿其令人印象深刻的可扩展性和泛化能力。 VQGAN 和 DALL-E[30, 67]等开创性工作及其后续工作[68, 92, 50, 99]展示了 AR 模型在图像生成方面的潜力。这些模型利用视觉标记器将连续图像离散化为 2D 符元的网格，然后将其展平为 1D 序列以进行 AR 学习（图2 b），这反映了顺序语言建模的过程（图2 a）。然而，这些模型的缩放法则仍未得到充分探索，更令人沮丧的是，它们的性能明显落后于扩散模型[63, 3, 51]，如图3所示。与大语言模型的卓越成就相比，自回归模型在计算机视觉领域的威力似乎有些锁定。

Refer to caption 图 3：缩放行为不同模型族在 ImageNet 256×256 生成基准测试上的结果。验证集的 FID 作为参考下限 (1.78)。具有 20 亿参数的 VAR 达到了 1.73 的 FID，超过了具有 30 亿或 70 亿参数的 L-DiT。

Autoregressive modeling requires defining the order of data. Our work reconsiders how to “order” an image: Humans typically perceive or create images in a hierachical manner, first capturing the global structure and then local details. This multi-scale, coarse-to-fine nature suggests an “order” for images. Also inspired by the widespread multi-scale designs [54, 52, 81, 44], we define autoregressive learning for images as “next-scale prediction” in Fig. 2 (c), diverging from the conventional “next-token prediction” in Fig. 2 (b). Our approach begins by encoding an image into multi-scale token maps. The autoregressive process is then started from the 1×1 token map, and progressively expands in resolution: at each step, the transformer predicts the next higher-resolution token map conditioned on all previous ones. We refer to this methodology as Visual AutoRegressive (VAR) modeling.

自回归建模需要定义数据的顺序。我们的工作重新考虑了如何“排序”图像：人类通常以分层的方式感知或创建图像，首先捕捉全局结构，然后捕捉局部细节。这多尺度、从粗到细的特性暗示了图像的“顺序”。同样受广泛应用的多尺度设计[54, 52, 81, 44]的启发，我们在图2 (c)中将图像的自回归学习定义为“下一尺度预测”，这与图2 (b)中的传统“下一个符元预测”不同。我们的方法首先将图像编码为多尺度词符图。然后，自回归过程从 1×1 词符图开始，并逐步扩展分辨率：在每一步，Transformer 都会以所有先前的词符图为条件来预测下一个更高分辨率的词符图。我们将此方法称为视觉自回归 (VAR) 建模。

VAR directly leverages GPT-2-like transformer architecture [66] for visual autoregressive learning. On the ImageNet 256×256 benchmark, VAR significantly improves its AR baseline, achieving a Fréchet inception distance (FID) of 1.73 and an inception score (IS) of 350.2, with inference speed 20× faster (see Sec. 7 for details). Notably, VAR surpasses the Diffusion Transformer (DiT) – the foundation of leading diffusion systems like Stable Diffusion 3.0 and SORA [29, 14] – in FID/IS, data efficiency, inference speed, and scalability. VAR models also exhibit scaling laws akin to those witnessed in LLMs. Lastly, we showcase VAR’s zero-shot generalization capabilities in tasks like image in-painting, out-painting, and editing. In summary, our contributions to the community include:

VAR 直接利用类似 GPT-2 的 Transformer 架构[66]进行视觉自回归学习。在 ImageNet 256×256 基准测试中，VAR 显著提升了其自回归基线，实现了 1.73 的 Fréchet 起始距离(FID)和 350.2 的起始分数(IS)，推理速度快了 20×倍（详情见第7节）。值得注意的是，VAR 在 FID/IS、数据效率、推理速度和可扩展性方面均超越了扩散 Transformer (DiT)——这是 Stable Diffusion 3.0 和 SORA[29, 14]等领先扩散系统的基础。 VAR 模型还表现出类似于大语言模型中所见的缩放定律。最后，我们展示了 VAR 在图像修复、修复和编辑等任务中的零样本泛化能力。总而言之，我们对社区的贡献包括：

A new visual generative framework using a multi-scale autoregressive paradigm with next-scale prediction, offering new insights in autoregressive algorithm design for computer vision.
An empirical validation of VAR models’ Scaling Laws and zero-shot generalization potential, which initially emulates the appealing properties of large language models (LLMs).
A breakthrough in visual autoregressive model performance, making GPT-style autoregressive methods surpass strong diffusion models in image synthesis for the first time^1^ ^1^ A related work [95] named “language model beats diffusion” belongs to BERT-style masked-prediction model..
A comprehensive open-source code suite, including both VQ tokenizer and autoregressive model training pipelines, to help propel the advancement of visual autoregressive learning.

一种新的视觉生成框架，使用具有下一尺度预测的多尺度自回归范例，为计算机视觉的自回归算法设计提供了新的见解。

对 VAR 模型的缩放定律和零样本泛化潜力的实证验证，初步模拟了大语言模型（大语言模型）的吸引人的特性。

视觉自回归模型性能的突破，使得 GPT 式自回归方法在图像合成方面超越强扩散模型首次^1^.

全面的开源代码套件，包括 VQ 分词器和自回归模型训练管道，有助于推动视觉自回归学习的进步。

2相关工作

2.1大型自回归语言模型的属性 Properties of large autoregressive language models

Scaling laws are found and studied in autoregressive language models [43, 35], which describe a power-law relationship between the scale of model (or dataset, computation, etc.) and the cross-entropy loss value on the test set. Scaling laws allow us to directly predict the performance of a larger model from smaller ones [1], thus guiding better resource allocation. More pleasingly, they show that the performance of LLMs can scale well with the growth of model, data, and computation and never saturate, which is considered a key factor in the success of [15, 82, 83, 98, 90, 38]. The success brought by scaling laws has inspired the vision community to explore more similar methods for multimodality understanding and generation [53, 2, 88, 27, 96, 77, 21, 23, 41, 31, 32, 80, 87].

在自回归语言模型[43, 35]中发现了并研究了缩放定律，这些定律描述了模型规模（或数据集、计算等）与测试集上的交叉熵损失值之间的幂律关系。缩放定律使我们能够直接根据较小的模型预测较大模型的性能[1]，从而指导更好的资源分配。更令人欣喜的是，它们表明 LLM 的性能可以随着模型、数据和计算的增长而很好地扩展，并且不会饱和，这被认为是[15, 82, 83, 98, 90, 38]成功的关键因素。缩放定律带来的成功启发了计算机视觉领域探索更多类似的方法，用于多模态理解和生成[53, 2, 88, 27, 96, 77, 21, 23, 41, 31, 32, 80, 87]。

Zero-shot generalization. Zero-shot generalization [72] refers to the ability of a model, particularly a Large Language Model, to perform tasks that it has not been explicitly trained on. Within the realm of the computer vision, there is a burgeoning interest in the zero-shot and in-context learning abilities of foundation models, CLIP [64], SAM [48], Dinov2 [61]. Innovations like Painter [89] and LVM [6] extend visual prompters [40, 11] to achieve in-context learning in vision.

零样本生成。零样本泛化[72]是指模型，特别是大型语言模型，执行其未明确训练过的任务的能力。在计算机视觉领域，人们对基础模型 CLIP[64]、SAM[48]、Dinov2[61]的零样本和上下文学习能力越来越感兴趣。像 Painter[89]和 LVM[6]这样的创新将视觉提示符[40, 11]扩展到在视觉中实现上下文学习。

2.2视觉生成 Visual generation

Raster-scan autoregressive models for visual generation necessitate the encoding of 2D images into 1D token sequences. Early endeavors [20, 84] have shown the ability to generate RGB (or grouped) pixels in the standard row-by-row, raster-scan manner. [69] extends [84] by using multiple independent trainable networks to do super-resolution repeatedly. VQGAN [30] advances [20, 84] by doing autoregressive learning in the latent space of VQVAE [85]. It employs GPT-2 decoder-only transformer to generate tokens in the raster-scan order, like how ViT [28] serializes 2D images into 1D patches. VQVAE-2 [68] and RQ-Transformer [50] also follow this raster-scan manner but use extra scales or stacked codes. Parti [93], based on the architecture of ViT-VQGAN [92], scales the transformer to 20B parameters and works well in text-to-image synthesis.

光栅扫描自回归模型用于视觉生成，需要将二维图像编码为一维符元序列。早期的尝试[20, 84]已经展示了以标准的行扫描方式生成 RGB（或分组）像素的能力。 [69]通过使用多个独立的可训练网络重复进行超分辨率来扩展[84]。 VQGAN[30]通过在 VQVAE[85]的潜在空间中进行自回归学习来改进[20, 84]。它采用 GPT-2 解码器仅有的 Transformer 以光栅扫描顺序生成符元，就像 ViT[28]将二维图像序列化为一维块一样。 VQVAE-2 [68]和 RQ-Transformer [50]也遵循这种光栅扫描方式，但使用了额外的尺度或堆叠代码。 Parti [93]基于 ViT-VQGAN [92]的架构，将 Transformer 的规模扩展到 200 亿个参数，并在文本到图像的合成中表现良好。

Masked-prediction model. MaskGIT [17] employs a VQ autoencoder and a masked prediction transformer similar to BERT [25, 10, 34] to generate VQ tokens through a greedy algorithm. MagViT [94] adapts this approach to videos, and MagViT-2 [95] enhances [17, 94] by introducing an improved VQVAE for both images and videos. MUSE [16] further scales MaskGIT to 3B parameters.

掩模预测模型。 MaskGIT [17]采用 VQ 自动编码器和类似于 BERT [25, 10, 34]的掩码预测 Transformer，通过贪婪算法生成 VQ 符元。 MagViT [94]将这种方法应用于视频，而 MagViT-2 [95]通过引入改进的 VQVAE 来增强图像和视频的[17, 94]。 MUSE [16]进一步将 MaskGIT 的规模扩展到 30 亿个参数。

Diffusion models ’ progress has centered around improved learning or sampling [76, 75, 55, 56, 7], guidance [37, 60], latent learning [70], and architectures [36, 63, 71, 91]. DiT and U-ViT [63, 8] replaces or integrates the U-Net with transformer, and inspires recent image [19, 18] or video synthesis systems [12, 33] including Stable Diffusion 3.0 [29], SORA [14], and Vidu [9].

扩散模型的进展主要集中在改进学习或采样[76, 75, 55, 56, 7]、引导[37, 60]、潜在学习[70]和架构[36, 63, 71, 91]方面。 DiT 和 U-ViT [63, 8]用 Transformer 替换或集成 U-Net，并启发了最近的图像[19, 18]或视频合成系统[12, 33]，包括 Stable Diffusion 3.0 [29]、SORA [14]和 Vidu [9]。

3方法

3.1初步：通过下一个标记预测进行自回归建模 Preliminary: autoregressive modeling via next-token prediction

Formulation. Consider a sequence of discrete tokens x=(x1,x2,…,xT), where xt∈[V] is an integer from a vocabulary of size V. The next-token autoregressive posits the probability of observing the current token xt depends only on its prefix (x1,x2,…,xt−1). This unidirectional token dependency assumption allows for the factorization of the sequence x’s likelihood:

公式化。考虑一个离散符元序列x=(x1,x2,…,xT)，其中xt∈[V]是从大小为V的词汇表中选择的整数。下一个符元的自回归假设观察到当前符元xt的概率仅取决于其前缀(x1,x2,…,xt−1)。这种单向符元依赖性假设允许对序列x的似然进行因式分解：

	p⁢(x1,x2,…,xT)=∏t=1Tp⁢(xt∣x1,x2,…,xt−1).		(1)

Training an autoregressive model pθ involves optimizing pθ⁢(xt∣x1,x2,…,xt−1) over a dataset. This is known as the “next-token prediction”, and the trained pθ can generate new sequences.

训练自回归模型pθ涉及在一个数据集上优化pθ⁢(xt∣x1,x2,…,xt−1)。这被称为“下一个符元预测”，训练好的pθ可以生成新的序列。

Tokenization. Images are inherently 2D continuous signals. To apply autoregressive modeling to images via next-token prediction, we must: 1) tokenize an image into several discrete tokens, and 2) define a 1D order of tokens for unidirectional modeling. For 1), a quantized autoencoder such as [30] is often used to convert the image feature map f∈ℝh×w×C to discrete tokens q∈[V]h×w:

标记化。图像本质上是二维连续信号。要通过下一个标记预测将自回归建模应用于图像，我们必须：1）将图像标记为多个离散标记，2）为单向定义标记的一维顺序造型。对于 1)，通常使用量化的自动编码器，例如[30]，将图像特征图f∈ℝh×w×C转换为离散符元q∈[V]h×w：

	f=ℰ⁢(i⁢m),q=𝒬⁢(f),		(2)

where i⁢m denotes the raw image, ℰ⁢(⋅) a encoder, and 𝒬⁢(⋅) a quantizer. The quantizer typically includes a learnable codebook Z∈ℝV×C containing V vectors. The quantization process q=𝒬⁢(f) will map each feature vector f(i,j) to the code index q(i,j) of its nearest code in the Euclidean sense:

其中 i⁢m 表示原始图像，ℰ⁢(⋅) 表示编码器，𝒬⁢(⋅) 表示量化器。量化器通常包括可学习的码本Z∈ℝV×C，其中包含V向量。量化过程 q=𝒬⁢(f) 会将每个特征向量 f(i,j) 映射到欧几里德意义上最接近的代码的代码索引 q(i,j)：

	q(i,j)=(arg⁢minv∈[V]⁡‖lookup⁢(Z,v)−f(i,j)‖2)∈[V],		(3)

其中lookup⁢(Z,v)表示取码本Z中的第v向量。为了训练量化自动编码器，每个 q(i,j) 都会查找 Z 以获得 f^，即原始 f 的近似值。然后使用给定 f^ 的解码器 𝒟⁢(⋅) 重建新图像 i⁢m^，并最小化复合损失 ℒ：

	f^	=lookup⁢(Z,q),i⁢m^=𝒟⁢(f^),		(4)
	ℒ	=‖i⁢m−i⁢m^‖2+‖f−f^‖2+λP⁢ℒP⁢(i⁢m^)+λG⁢ℒG⁢(i⁢m^),		(5)
-	-	-	-	-

where ℒP⁢(⋅) is a perceptual loss such as LPIPS [97], ℒG⁢(⋅) a discriminative loss like StyleGAN’s discriminator loss [46], and λP, λG are loss weights. Once the autoencoder {ℰ,𝒬,𝒟} is fully trained, it will be used to tokenize images for subsequent training of a unidirectional autoregressive model.

The image tokens in q∈[V]h×w are arranged in a 2D grid. Unlike natural language sentences with an inherent left-to-right ordering, the order of image tokens must be explicitly defined for unidirectional autoregressive learning. Previous AR methods [30, 92, 50] flatten the 2D grid of q into a 1D sequence x=(x1,…,xh×w) using some strategy such as row-major raster scan, spiral, or z-curve order. Once flattened, they can extract a set of sequences x from the dataset, and then train an autoregressive model to maximize the likelihood in (1) via next-token prediction.
其中ℒP⁢(⋅)是感知损失，例如 LPIPS [97]，ℒG⁢(⋅)是判别损失，例如 StyleGAN 的判别器损失[46]，而λP，λG是损失权重。一旦自动编码器{ℰ,𝒬,𝒟}完全训练完毕，它将用于对图像进行标记，以便随后训练单向自回归模型。

q∈[V]h×w中的图像符元排列在一个二维网格中。与具有固有的从左到右排序的自然语言句子不同，必须为单向自回归学习明确定义图像标记的顺序。之前的自动回归方法[30, 92, 50]使用某种策略（例如行主扫描、螺旋形或 Z 曲线顺序）将q二维网格展平为一维序列x=(x1,…,xh×w)。展平后，它们可以从数据集中提取一组序列x，然后训练一个自回归模型，以通过下一个符元预测来最大化(1)中的似然性。

Discussion on the weakness of vanilla autoregressive models. The above approach of tokenizing and flattening enable next-token autoregressive learning on images, but introduces several issues:

Mathematical premise violation. In quantized autoencoders (VQVAEs), the encoder typically produces an image feature map f with inter-dependent feature vectors f(i,j) for all i,j. So after quantization and flattening, the token sequence (x1,x2,…,xh×w) retains bidirectional correlations. This contradicts the unidirectional dependency assumption of autoregressive models, which dictates that each token xt should only depend on its prefix (x1,x2,…,xt−1).
Inability to perform some zero-shot generalization. Similar to issue 1), The unidirectional nature of image autoregressive modeling restricts their generalizability in tasks requiring bidirectional reasoning. E.g., it cannot predict the top part of an image given the bottom part.
Structural degradation. The flattening disrupts the spatial locality inherent in image feature maps. For example, the token q(i,j) and its 4 immediate neighbors q(i±1,j), q(i,j±1) are closely correlated due to their proximity. This spatial relationship is compromised in the linear sequence x, where unidirectional constraints diminish these correlations.
Inefficiency. Generating an image token sequence x=(x1,x2,…,xn×n) with a conventional self-attention transformer incurs 𝒪⁢(n2) autoregressive steps and 𝒪⁢(n6) computational cost.

Issues 2) and 3) are evident (see examples above). Regarding issue 1), we present empirical evidence in Appendix A. The proof of issue 3) is detailed in Appendix B. These theoretical and practical limitations call for a rethinking of autoregressive models in the context of image generation.

关于普通自回归模型缺点的讨论。以上的符元化和平展方法能够在图像上进行下一个符元的自回归学习，但是会引入几个问题：

1）
违反数学前提。在量化自动编码器（VQVAEs）中，编码器通常会生成一个图像特征图f，其中包含所有i,j相互依赖的特征向量f(i,j)。因此，在量化和平展之后，符元序列(x1,x2,…,xh×w)保留了双向相关性。这与自回归模型的单向依赖假设相矛盾，该假设规定每个词符xt应该仅依赖于其前缀(x1,x2,…,xt−1)。
2）
无法进行某些零样本泛化。与问题 1）类似，图像自回归建模的单向特性限制了其在需要双向推理的任务中的泛化能力。例如，它无法根据图像底部预测图像顶部。
3）
结构退化。扁平化破坏了图像特征图固有的空间局部性。例如，符元q(i,j)及其 4 个直接邻居q(i±1,j)、q(i,j±1)由于它们之间的接近性而密切相关。在线性序列x中，这种空间关系受到损害，其中单向约束会削弱这些相关性。
低效。使用传统的自注意力 Transformer 生成图像符元序列x=(x1,x2,…,xn×n)会产生𝒪⁢(n2)自回归步骤和𝒪⁢(n6)计算成本。

问题 2）和 3）是显而易见的（见上面的例子）。关于问题 1），我们在附录A中提供了经验证据。问题 3）的证明详见附录B。这些理论和实践的局限性要求在图像生成的背景下重新思考自回归模型。

Refer to caption
图 4： VAR 包含两个独立的训练阶段。阶段 1：一个多尺度矢量量化自动编码器将图像编码成K符元映射R=(r1,r2,…,rK)，并通过复合损失进行训练(5)。关于“多尺度量化”和“嵌入”的详细信息，请查看算法1和2。阶段 2：一个 VAR Transformer 通过下一尺度预测进行训练(6)：它以([s],r1,r2,…,rK−1)作为输入来预测(r1,r2,r3,…,rK)。注意力掩码用于训练，以确保每个rk只能关注r≤k。使用标准交叉熵损失。

3.2通过下一尺度预测进行视觉自回归建模 Visual autoregressive modeling via next-scale prediction

Reformulation. We reconceptualize the autoregressive modeling on images by shifting from “next-token prediction” to “next-scale prediction” strategy. Here, the autoregressive unit is an entire token map, rather than a single token. We start by quantizing a feature map f∈ℝh×w×C into K multi-scale token maps (r1,r2,…,rK), each at a increasingly higher resolution hk×wk, culminating in rK matches the original feature map’s resolution h×w. The autoregressive likelihood is formulated as:

重新制定。我们通过从“下一个 Token 预测”策略转变为“下一个尺度预测”策略，重新概念化了图像的自回归模型。这里，自回归单元是整个词符图，而不是单个词符。我们首先将特征图f∈ℝh×w×C量化为K多尺度词符图(r1,r2,…,rK)，每个图的分辨率越来越高hk×wk，最终 rK 与原始特征图的分辨率 h×w 匹配。自回归可能性的公式为：

	p⁢(r1,r2,…,rK)=∏k=1Kp⁢(rk∣r1,r2,…,rk−1),		(6)

where each autoregressive unit rk∈[V]hk×wk is the token map at scale k containing hk×wk tokens, and the sequence (r1,r2,…,rk−1) serves as the the “prefix” for rk. During the k-th autoregressive step, all distributions over the hk×wk tokens in rk will be generated in parallel, conditioned on rk’s prefix and associated k-th position embedding map. This “next-scale prediction” methodology is what we define as visual autoregressive modeling (VAR), depicted on the right side of Fig. 4. Note that in the training of VAR, a block-wise causal attention mask is used to ensure that each rk can only attend to its prefix r≤k. During inference, kv-caching can be used and no mask is needed.

其中每个自回归单元rk∈[V]hk×wk是尺度为k的符元映射，包含hk×wk个符元，序列(r1,r2,…,rk−1)作为rk的“前缀”。在第k个自回归步骤中，rk中所有关于hk×wk个符元的分布将并行生成，以rk的前缀和相关的k位置嵌入映射为条件。我们将这种“下一尺度预测”方法定义为视觉自回归建模(VAR)，如图4右侧所示。请注意，在 VAR 的训练中，使用分块因果注意力掩码来确保每个rk只能关注其前缀r≤k。在推理过程中，可以使用 kv 缓存，不需要掩码。

Discussion. VAR addresses the previously mentioned three issues as follows:

The mathematical premise is satisfied if we constrain each rk to depend only on its prefix, that is, the process of getting rk is solely related to r≤k. This constraint is acceptable as it aligns with the natural, coarse-to-fine progression characteristics like human visual perception and artistic drawing (as we discussed in Sec. 1). Further details are provided in the Tokenization below.
The spatial locality is preserved as (i) there is no flattening operation in VAR, and (ii) tokens in each rk are fully correlated. The multi-scale design additionally reinforces the spatial structure.
The complexity for generating an image with n×n latent is significantly reduced to 𝒪⁢(n4), see Appendix for proof. This efficiency gain arises from the parallel token generation in each rk.

讨论。 VAR 解决了前面提到的三个问题：

1）
如果我们将每个rk限制为仅依赖其前缀，即获取rk的过程仅与r≤k相关，则满足数学前提。此约束是可以接受的，因为它与自然的、从粗到细的渐进式特征一致，例如人类的视觉感知和艺术绘画（正如我们在第1节中所讨论的）。下面的标记化中提供了更多详细信息。
2）
空间局部性得以保留，因为 (i) VAR 中没有展平操作，并且 (ii) 每个 rk 中的标记完全相关。多尺度的设计进一步强化了空间结构。
3）
生成具有n×n潜在图像的复杂度显着降低到𝒪⁢(n4)，请参阅附录的证明。这种效率增益来自于每个rk中的并行词符生成。

Tokenization. We develope a new multi-scale quantization autoencoder to encode an image to K multi-scale discrete token maps R=(r1,r2,…,rK) necessary for VAR learning (6). We employ the same architecture as VQGAN [30] but with a modified multi-scale quantization layer. The encoding and decoding procedures with residual design on f or f^ are detailed in algorithms 1 and 2. We empirically find this residual-style design, akin to [50], can perform better than independent interpolation. Algorithm 1 shows that each rk would depend only on its prefix (r1,r2,…,rk−1). Note that a shared codebook Z is utilized across all scales, ensuring that each rk’s tokens belong to the same vocabulary [V]. To address the information loss in upscaling zk to hK×wK, we use K extra convolution layers {ϕk}k=1K. No convolution is used after downsampling f to hk×wk.

符元化。我们开发了一种新的多尺度量化自编码器，用于将图像编码为K用于 VAR 学习的多尺度离散符元映射R=(r1,r2,…,rK)（6）。我们采用与 VQGAN[30]相同的架构，但使用了改进的多尺度量化层。算法1和2详细介绍了f或f^上具有残差设计的编码和解码过程。我们凭经验发现这种类似于[50]的残差式设计，其性能优于独立插值。算法1显示每个rk仅依赖于其前缀(r1,r2,…,rk−1)。请注意，所有尺度都使用共享码本Z，确保每个rk的符元属于相同的词汇表[V]。为了解决在将zk上采样到hK×wK时产生的信息丢失问题，我们使用了K额外的卷积层{ϕk}k=1K。在将f下采样到hk×wk之后，不使用卷积。

1 Inputs: raw image i⁢m;

2 Hyperparameters: steps K, resolutions (hk,wk)k=1K;

3 f=ℰ⁢(i⁢m), R=[];

4 for k=1,⋯,K do

5 rk=𝒬⁢(interpolate⁢(f,hk,wk));

6 R=queue_push⁢(R,rk);

7 zk=lookup⁢(Z,rk);

8 zk=interpolate⁢(zk,hK,wK);

9 f=f−ϕk⁢(zk);

11Return: multi-scale tokens R;

Algorithm 1 Multi-scale VQVAE Encoding

1 Inputs: multi-scale token maps R;

2 Hyperparameters: steps K, resolutions (hk,wk)k=1K;

3 f^=0;

4 for k=1,⋯,K do

5 rk=queue_pop⁢(R);

6 zk=lookup⁢(Z,rk);

7 zk=interpolate⁢(zk,hK,wK);

8 f^=f^+ϕk⁢(zk);

10i⁢m^=𝒟⁢(f^);

11 Return: reconstructed image i⁢m^;

Algorithm 2 Multi-scale VQVAE Reconstruction

4实现细节 Implementation details

VAR tokenizer. As aforementioned, we use the vanilla VQVAE architecture [30] and a multi-scale quantization scheme with K extra convolutions (0.03M extra parameters). We use a shared codebook for all scales with V=4096. Following the baseline [30], our tokenizer is also trained on OpenImages [49] with the compound loss (5) and a spatial downsample ratio of 16×.

VAR transformer. Our main focus is on VAR algorithm so we keep a simple model architecture design. We adopt the architecture of standard decoder-only transformers akin to GPT-2 and VQGAN [66, 30] with adaptive normalization (AdaLN), which has widespread adoption and proven effectiveness in many visual generative models [46, 47, 45, 74, 73, 42, 63, 19]. For class-conditional synthesis, we use the class embedding as the start token [s] and also the condition of AdaLN. We found normalizing q⁢u⁢e⁢r⁢i⁢e⁢s and k⁢e⁢y⁢s to unit vectors before attention can stablize the training. We do not use advanced techniques in large language models, such as rotary position embedding (RoPE), SwiGLU MLP, or RMS Norm [82, 83]. Our model shape follows a simple rule like [43] that the width w, head counts h, and drop rate d⁢r are linearly scaled with the depth d as follows:

VAR 分词器。如前所述，我们使用 vanilla VQVAE 架构[30]和具有K额外卷积（0.03M 额外参数）的多尺度量化方案。我们对所有尺度使用共享码本V=4096。遵循基线[30]，我们的分词器也在 OpenImages[49]上使用复合损失(5)和空间下采样比率16×进行训练。

VAR Transformer 。我们主要关注 VAR 算法，因此我们保持简单的模型架构设计。我们采用了类似于 GPT-2 和 VQGAN[66, 30]的标准解码器 Transformer 架构，并使用了自适应归一化(AdaLN)，它在许多视觉生成模型中得到了广泛应用并证明了其有效性[46, 47, 45, 74, 73, 42, 63, 19]。对于类条件合成，我们使用类嵌入作为起始符元[s]，也作为 AdaLN 的条件。我们发现，在注意力之前将q⁢u⁢e⁢r⁢i⁢e⁢s和k⁢e⁢y⁢s归一化为单位向量可以稳定训练。我们没有使用大语言模型中的高级技术，例如旋转位置嵌入 (RoPE)、SwiGLU MLP 或 RMS Norm [82, 83]。我们的模型结构遵循类似于[43]的简单规则，其中宽度w、头数h和丢弃率d⁢r与深度d线性缩放，如下所示：

	w=64⁢d,h=d,d⁢r=0.1⋅d/24.		(7)

因此，深度为d的 VAR Transformer 的主要参数数量N由下式给出：^2^:

	N⁢(d)=d⋅4⁢w2⏟self-attention+d⋅8⁢w2⏟feed-forward+d⋅6⁢w2⏟adaptive layernorm=18⁢d⁢w2=73728⁢d3.		(8)

所有模型都使用相似的设置进行训练：每个 256 批大小的基础学习率为10−4，使用β1=0.9、β2=0.95、decay=0.05的 AdamW 优化器，批大小从 768 到 1024，训练轮数从 200 到 350（取决于模型大小）。第5节中的评估表明，这种简单的模型设计能够很好地扩展和泛化。

5 实验结果 Empirical Results

This section first compares VAR with other image generative model families in Sec. 5.1. Evaluations on the scalability and generalizability of VAR models are presented in Sec. 5.2 and Appendix 6. For implementation details and ablation study, please see Appendix 4 and Appendix 7.

本节首先在第5.1节中将 VAR 与其他图像生成模型族进行比较。 VAR 模型的可扩展性和泛化性评估结果在第5.2节和附录6中给出。有关实现细节和消融研究，请参见附录4和附录7。

表格 1：在类条件 ImageNet 256×256 上的生成模型族比较。 “↓”或“↑”表示较低或较高值更好。指标包括 Fréchet 起始距离 (FID)、起始分数 (IS)、精度 (Pre)

[论文翻译]VAR视觉自回归建模：通过下一尺度预测生成可扩展的图像

原文地址：https://arxiv.org/abs/2404.02905

代码地址：https://github.com/FoundationVision/VAR