Fractal Generative Models
分形生成模型
Tianhong Li 1 Qinyi Sun 1 Lijie Fan 2 Kaiming He 1 Figure 1: Fractal Generative Model. Four levels shown in this figure, better viewed zoomed in. In this instantiation, we use auto regressive model for the fractal generator. By recursively calling auto regressive models in auto regressive models, we build a fractal-like framework with self-similarity across different levels.
图 1: 分形生成模型。图中展示了四个层次,建议放大查看。在这个实例中,我们使用自回归模型作为分形生成器。通过在自回归模型中递归调用自回归模型,我们构建了一个具有自相似性的分形框架。
Abstract
摘要
Modular iz ation is a cornerstone of computer science, abstracting complex functions into atomic building blocks. In this paper, we introduce a new level of modular iz ation by abstracting generative models into atomic generative modules. Analogous to fractals in mathematics, our method constructs a new type of generative model by recursively invoking atomic generative modules, resulting in self-similar fractal architectures that we call fractal generative models. As a running example, we instantiate our fractal framework using autoregressive models as the atomic generative modules and examine it on the challenging task of pixelby-pixel image generation, demonstrating strong performance in both likelihood estimation and generation quality. We hope this work could open a new paradigm in generative modeling and provide a fertile ground for future research. Code is available at https://github.com/LTH14/fractalgen.
模块化是计算机科学的基石,它将复杂功能抽象为原子构建块。在本文中,我们通过将生成模型抽象为原子生成模块,引入了一种新的模块化层次。类似于数学中的分形,我们的方法通过递归调用原子生成模块,构建了一种新型的生成模型,形成了我们称为分形生成模型的自相似分形架构。作为一个运行示例,我们使用自回归模型作为原子生成模块实例化了我们的分形框架,并在具有挑战性的逐像素图像生成任务中进行了检验,展示了在似然估计和生成质量方面的强大表现。我们希望这项工作能够为生成模型开辟一个新的范式,并为未来的研究提供肥沃的土壤。代码可在 https://github.com/LTH14/fractalgen 获取。
1. Introduction
1. 引言
At the core of computer science lies the concept of modularization. For example, deep neural networks are built from atomic “layers” that serve as modular units (Szegedy et al., 2015). Similarly, modern generative models—such as diffusion models (Song et al., 2020) and auto regressive models (Radford et al., 2018)—are built from atomic “generative steps”, each implemented by a deep neural network. By abstracting complex functions into these atomic building blocks, modular iz ation allows us to create more intricate systems by composing these modules.
计算机科学的核心在于模块化概念。例如,深度神经网络由作为模块化单元的原子“层”构建而成 (Szegedy et al., 2015)。同样,现代生成模型——如扩散模型 (Song et al., 2020) 和自回归模型 (Radford et al., 2018)——由原子“生成步骤”构成,每一步都由深度神经网络实现。通过将复杂功能抽象为这些原子构建块,模块化使我们能够通过组合这些模块来创建更复杂的系统。
Building on this concept, we propose abstracting a generative model itself as a module to develop more advanced generative models. Specifically, we introduce a generative model constructed by recursively invoking generative models of the same kind within itself. This recursive strategy results in a generative framework that exhibits complex architectures with self-similarity across different levels of modules, as shown in Figure 1.
基于这一概念,我们提出将生成式模型本身抽象为一个模块,以开发更高级的生成式模型。具体来说,我们引入了一种通过递归调用同类型生成式模型构建的生成式模型。这种递归策略产生了一个生成式框架,该框架展示了在不同模块层次上具有自相似性的复杂架构,如图1所示。
Our proposal is analogous to the concept of fractals (Mandelbrot, 1983) in mathematics. Fractals are self-similar patterns constructed using a recursive rule called a generator1. Similarly, our framework is also built by a recursive process of calling generative models in generative models, exhibiting self-similarity across different levels. Therefore, we name our framework as “fractal generative model”.
我们的提案类似于数学中的分形概念 (Mandelbrot, 1983)。分形是通过一种称为生成器1的递归规则构建的自相似模式。类似地,我们的框架也是通过在生成模型中调用生成模型的递归过程构建的,展示了不同层次的自相似性。因此,我们将我们的框架命名为“分形生成模型”。
Fractals or near-fractals are commonly observed patterns in biological neural networks. Multiple studies have provided evidence for fractal or scale-invariant small-world network organizations in the brain and its functional networks (Bassett et al., 2006; Sporns, 2006; Bullmore & Sporns, 2009).
分形或近分形是生物神经网络中常见的模式。多项研究提供了大脑及其功能网络中分形或尺度不变小世界网络组织的证据 (Bassett 等, 2006; Sporns, 2006; Bullmore & Sporns, 2009)。
These findings suggest that the brain’s development largely adopts the concept of modular iz ation, recursively building larger neural networks from smaller ones.
这些发现表明,大脑的发育在很大程度上采用了模块化 (modularization) 的概念,通过递归的方式从较小的神经网络构建出更大的神经网络。
Beyond biological neural networks, natural data often exhibit fractal or near-fractal patterns. Common fractal patterns range from macroscopic structures such as clouds, tree branches, and snowflakes, to microscopic ones including crystals (Cannon et al., 2000), chromatin (Mirny, 2011), and proteins (Enright & Leitner, 2005). More generally, natural images can also be analogized to fractals. For example, an image is composed of sub-images that are themselves images (although they may follow different distributions). Accordingly, an image generative model can be composed of modules that are themselves image generative models.
除了生物神经网络,自然数据通常表现出分形或近分形模式。常见的分形模式从宏观结构如云、树枝和雪花,到微观结构包括晶体 (Cannon et al., 2000)、染色质 (Mirny, 2011) 和蛋白质 (Enright & Leitner, 2005)。更一般地说,自然图像也可以类比为分形。例如,一幅图像由子图像组成,而这些子图像本身也是图像(尽管它们可能遵循不同的分布)。因此,一个图像生成模型可以由模块组成,这些模块本身也是图像生成模型。
The proposed fractal generative model is inspired by the fractal properties observed in both biological neural networks and natural data. Analogous to natural fractal structures, the key component of our design is the generator which defines the recursive generation rule. For example, such a generator can be an auto regressive model, as illustrated in Figure 1. In this instantiation, each auto regressive model is composed of modules that are themselves autoregressive models. Specifically, each parent auto regressive block spawns multiple child auto regressive blocks, and each child block further spawns more auto regressive blocks. The resulting architecture exhibits fractal-like, self-similar patterns across different levels.
所提出的分形生成模型灵感来源于生物神经网络和自然数据中观察到的分形特性。与自然分形结构类似,我们设计的关键组件是定义了递归生成规则的生成器。例如,该生成器可以是一个自回归模型 (auto regressive model) ,如图 1 所示。在此实例中,每个自回归模型都由本身也是自回归模型的模块组成。具体来说,每个父自回归块生成多个子自回归块,而每个子块又会生成更多的自回归块。最终形成的架构在不同层级上展现出类分形的自相似模式。
We examine this fractal instantiation on a challenging testbed: pixel-by-pixel image generation. Existing methods that directly model the pixel sequence do not achieve satisfactory results in both likelihood estimation and generation quality (Hawthorne et al., 2022; Yu et al., 2023), as images do not embody a clear sequential order. Despite its difficulty, pixel-by-pixel generation represents a broader class of important generative problems: modeling non-sequential data with intrinsic structures, which is particularly important for many data types beyond images, such as molecular structures, proteins, and biological neural networks.
我们在一个具有挑战性的测试平台上检验了这一分形实例:逐像素图像生成。现有的直接对像素序列进行建模的方法在似然估计和生成质量上均未取得令人满意的结果(Hawthorne et al., 2022; Yu et al., 2023),因为图像并不体现明确的顺序。尽管逐像素生成具有难度,但它代表了一类更广泛的重要生成问题:对具有内在结构的非顺序数据进行建模,这对于图像之外的许多数据类型尤为重要,如分子结构、蛋白质和生物神经网络。
Our proposed fractal framework demonstrates strong performance on this challenging yet important task. It can generate raw images pixel-by-pixel (Figure 2) while achieving accurate likelihood estimation and high generation quality. We hope our promising results will encourage further research on both the design and applications of fractal generative models, ultimately establishing a new paradigm in generative modeling.
我们提出的分形框架在这一具有挑战性但重要的任务中表现出色。它能够逐像素生成原始图像(图 2),同时实现准确的似然估计和高质量的生成。我们希望这些有前景的结果能够鼓励进一步研究分形生成模型的设计和应用,最终在生成建模中建立一个新的范式。
2. Related Work
2. 相关工作
Fractals. A fractal is a geometric structure characterized by self-similarity across different scales, often constructed by a recursive generation rule called a generator (Mandelbrot,
分形。分形是一种几何结构,其特点是在不同尺度上具有自相似性,通常通过称为生成元 (generator) 的递归生成规则构建 (Mandelbrot,
Figure 2: Our fractal framework can generate highquality images pixel-by-pixel. We show the generation process of a $256!\times!256$ image by recursively calling autoregressive models in auto regressive models. We also provide example videos in our GitHub repository to illustrate the generation process.
图 2: 我们的分形框架可以逐像素生成高质量图像。我们展示了通过递归调用自回归模型生成 $256!\times!256$ 图像的过程。我们还在 GitHub 仓库中提供了示例视频来展示生成过程。
1983). Fractals are widely observed in nature, with classic examples ranging from macroscopic structures such as clouds, tree branches, and snowflakes, to microscopic ones including crystals (Cannon et al., 2000), chromatin (Mirny, 2011), and proteins (Enright & Leitner, 2005).
分形在自然界中广泛存在,经典的例子从宏观结构如云、树枝和雪花,到微观结构如晶体 (Cannon et al., 2000)、染色质 (Mirny, 2011) 和蛋白质 (Enright & Leitner, 2005) 等。
Beyond these more easily recognizable fractals, many natural data also exhibit near-fractal characteristics. Although they do not possess strict self-similarity, they still embody similar multi-scale representations or patterns, such as images (Freeman et al., 1991; Lowe, 1999) and biological neural networks (Bassett et al., 2006; Sporns, 2006; Bullmore & Sporns, 2009). Conceptually, our fractal generative model naturally accommodates all such non-sequential data with intrinsic structures and self-similarity across different scales; in this paper, we demonstrate its capabilities with an image-based instantiation.
除了这些更容易识别的分形,许多自然数据也表现出近分形特征。尽管它们不具备严格的自相似性,但仍然体现了相似的多尺度表示或模式,例如图像(Freeman et al., 1991; Lowe, 1999)和生物神经网络(Bassett et al., 2006; Sporns, 2006; Bullmore & Sporns, 2009)。从概念上讲,我们的分形生成模型自然地适用于所有具有内在结构且在不同尺度上呈现自相似性的非序列数据;在本文中,我们通过基于图像的实例展示了其能力。
Due to its recursive generative rule, fractals inherently exhibit hierarchical structures that are conceptually related to hierarchical design principles in computer vision. However, most hierarchical methods in computer vision do not incorporate the recursion or divide-and-conquer paradigm fundamental to fractal construction, nor do they display self-similarity in their design. The unique combination of hierarchical structure, self-similarity, and recursion is what distinguishes our fractal framework from the hierarchical methods discussed next.
由于其递归生成规则,分形本质上展现出了层次结构,这一结构在概念上与计算机视觉中的层次设计原则相关。然而,计算机视觉中的大多数层次方法并未融入分形构建所必需的递归或分治范式,也未在其设计中展现出自相似性。层次结构、自相似性和递归的独特组合,使得我们的分形框架与接下来讨论的层次方法区别开来。
Hierarchical Representations. Extracting hierarchical pyramid representations from visual data has long been an important topic in computer vision. Many early handengineered features, such as Steerable Filters, Laplacian Pyramid, and SIFT, employ scale-space analysis to construct feature pyramids (Burt & Adelson, 1987; Freeman et al., 1991; Lowe, 1999; 2004; Dalal & Triggs, 2005). In the context of neural networks, hierarchical designs remain important for capturing multi-scale information. For instance, SPPNet (He et al., 2015) and FPN (Lin et al., 2017) construct multi-scale feature hierarchies with pyramidal feature maps. Our fractal framework is also related to Swin Transformer (Liu et al., 2021), which builds hierarchical feature maps by attending to local windows at different scales. These hierarchical representations have proven effective in various image understanding tasks, including image classification, object detection, and semantic segmentation.
层次化表示。从视觉数据中提取层次化金字塔表示一直是计算机视觉领域的重要课题。许多早期手工设计的特征,如 Steerable Filters、Laplacian Pyramid 和 SIFT,都采用了尺度空间分析来构建特征金字塔 (Burt & Adelson, 1987; Freeman et al., 1991; Lowe, 1999; 2004; Dalal & Triggs, 2005)。在神经网络的背景下,层次化设计对于捕捉多尺度信息仍然至关重要。例如,SPPNet (He et al., 2015) 和 FPN (Lin et al., 2017) 通过金字塔特征图构建了多尺度特征层次结构。我们的分形框架也与 Swin Transformer (Liu et al., 2021) 相关,后者通过关注不同尺度的局部窗口来构建层次化特征图。这些层次化表示在各种图像理解任务中已被证明是有效的,包括图像分类、目标检测和语义分割。
Hierarchical Generative Models. Hierarchical designs are also widely used in generative modeling. Many recent methods employ a two-stage paradigm, where a pre-trained tokenizer maps images into a compact latent space, followed by a generative model on those latent codes (van den Oord et al., 2017; Razavi et al., 2019; Esser et al., 2021; Ramesh et al., 2021). Another example, MegaByte (Yu et al., 2023), implements a two-scale model with a global and a local module for more efficient auto regressive modeling of long pixel sequences, though its performance remains limited.
分层生成模型 (Hierarchical Generative Models)
Another line of research focuses on scale-space image generation. Cascaded diffusion models (Ramesh et al., 2022; Saharia et al., 2022; Pernias et al., 2023) train multiple diffusion models to progressively generate images from low resolution to high resolution. More recently, scale-space auto regressive methods (Tian et al., 2024; Tang et al., 2024; Han et al., 2024) generate tokens one scale at a time using an auto regressive transformer. However, generating images without a tokenizer is often prohibitively expensive for these auto regressive approaches because the large number of tokens or pixels per scale leads to a quadratic computational cost for the attention within each scale.
另一研究方向聚焦于尺度空间图像生成。级联扩散模型 (Ramesh et al., 2022; Saharia et al., 2022; Pernias et al., 2023) 通过训练多个扩散模型,逐步从低分辨率生成高分辨率图像。最近,尺度空间自回归方法 (Tian et al., 2024; Tang et al., 2024; Han et al., 2024) 使用自回归 Transformer 逐尺度生成 Token。然而,对于这些自回归方法来说,在没有 Tokenizer 的情况下生成图像通常成本过高,因为每个尺度的 Token 或像素数量庞大,导致每个尺度内注意力的计算成本呈二次方增长。
Modular i zed Neural Architecture Design. Modularization is a fundamental concept in computer science and deep learning, which atomizes previously complex functions into simple modular units. One of the earliest modular neural architectures was GoogleNet (Szegedy et al., 2015), which introduced the “Inception module” as a new level of organization. Later research expanded on this principle, designing widely used units such as the residual block (He et al., 2016) and the Transformer block (Vaswani, 2017). Recently, in the field of generative modeling, MAR (Li et al., 2024) modularize diffusion models as atomic building blocks to model the distribution of each continuous token, enabling the auto regressive modeling of continuous data. By providing higher levels of abstraction, modular iz ation enables us to build more intricate and advanced neural architectures using existing methods as building blocks.
模块化神经网络架构设计
A pioneering approach that applies a modular unit recursively and integrates fractal concepts in neural architecture design is FractalNet (Larsson et al., 2016), which constructs very deep neural networks by recursively calling a simple expansion rule. While FractalNet shares our core idea of recursively invoking a modular unit to form fractal structures, it differs from our method in two key aspects. First, FractalNet uses a small block of convolutional layers as its modular unit, while we use an entire generative model, representing different levels of modular iz ation. Second, FractalNet was mainly designed for classification tasks and thus outputs only low-dimensional logits. In contrast, our approach leverages the exponential scaling behavior of fractal patterns to generate a large set of outputs (e.g., millions of image pixels), demonstrating the potential of fractal-inspired designs for more complex tasks beyond classification.
一种在神经架构设计中递归应用模块化单元并整合分形概念的开创性方法是 FractalNet (Larsson 等, 2016), 它通过递归调用一个简单的扩展规则来构建非常深的神经网络。虽然 FractalNet 与我们递归调用模块化单元以形成分形结构的核心理念相同, 但它在两个关键方面与我们的方法不同。首先, FractalNet 使用一小块卷积层作为其模块化单元, 而我们使用整个生成模型, 代表了不同的模块化层级。其次, FractalNet 主要为分类任务设计, 因此仅输出低维度的 logits。相比之下, 我们的方法利用分形模式的指数缩放行为来生成大量输出 (例如, 数百万个图像像素), 展示了分形启发设计在分类之外更复杂任务中的潜力。
3. Fractal Generative Models
3. 分形生成模型
The key idea behind fractal generative models is to recursively construct more advanced generative models from existing atomic generative modules. In this section, we first present the high-level motivation and intuition behind fractal generative models. We then use the auto regressive model as an illustrative atomic module to demonstrate how fractal generative models can be instantiated and used to model very high-dimensional data distributions.
分形生成模型 (Fractal Generative Models) 的核心思想是从现有的原子生成模块中递归地构建更高级的生成模型。在本节中,我们首先介绍分形生成模型背后的高级动机和直觉。然后,我们以自回归模型 (Auto Regressive Model) 为例,展示如何实例化分形生成模型并将其用于建模非常高维的数据分布。
3.1. Motivation and Intuition
3.1. 动机与直觉
Fractals are complex patterns that emerge from simple, recursive rules. In fractal geometry, these rules are often called “generators” (Mandelbrot, 1983). With different generators, fractal methods can construct many natural patterns, such as clouds, mountains, snowflakes, and tree branches, and have been linked to even more intricate systems, such as the structure of biological neural networks (Bassett et al., 2006; Sporns, 2006; Bullmore & Sporns, 2009), nonlinear dynam- ics (Aguirre et al., 2009), and chaotic systems (Mandelbrot et al., 2004).
分形是由简单的递归规则生成的复杂图案。在分形几何中,这些规则通常被称为“生成器”(Mandelbrot, 1983)。通过不同的生成器,分形方法可以构造出许多自然图案,如云、山、雪花和树枝,并且与更复杂的系统相关,如生物神经网络的结构 (Bassett et al., 2006; Sporns, 2006; Bullmore & Sporns, 2009)、非线性动力学 (Aguirre et al., 2009) 和混沌系统 (Mandelbrot et al., 2004)。
Formally, a fractal generator $g_{i}$ specifies how to produce a set of new data ${x_{i+1}}$ for the next-level generator based on one output $x_{i}$ from the previous-level generator: ${x_{i+1}}=$ $g_{i}(x_{i})$ . For example, as shown in Figure 1, a generator can construct a fractal by recursively calling similar generators within each grey box.
形式上,分形生成器 $g_{i}$ 指定了如何基于前一级生成器的一个输出 $x_{i}$ 为下一级生成器生成一组新数据 ${x_{i+1}}$:${x_{i+1}}=$ $g_{i}(x_{i})$。例如,如图 1 所示,生成器可以通过在每个灰色框内递归调用类似的生成器来构造分形。
Because each generator level can produce multiple outputs from a single input, a fractal framework can achieve an exponential increase in generated outputs while only requiring a linear number of recursive levels. This property makes it particularly suitable for modeling high-dimensional data with relatively few generator levels.
由于每个生成器层级可以从单一输入产生多个输出,分形框架能够在仅需线性数量的递归层级的情况下,实现生成输出的指数级增长。这一特性使其特别适合用较少的生成器层级来建模高维数据。
Specifically, we introduce a fractal generative model that uses atomic generative modules as parametric fractal generators. In this way, a neural network “learns” the recursive rule directly from data. By combining the exponential growth of fractal outputs with neural generative modules, our fractal framework enables the modeling of high-dimensional non-sequential data. Next, we demonstrate how we instan- tiate this idea with an auto regressive model as the fractal generator.
具体来说,我们引入了一种分形生成模型,该模型使用原子生成模块作为参数化分形生成器。通过这种方式,神经网络直接从数据中“学习”递归规则。通过将分形输出的指数增长与神经生成模块相结合,我们的分形框架能够对高维非序列数据进行建模。接下来,我们展示了如何将这一想法实例化,使用自回归模型作为分形生成器。
3.2. Auto regressive Model as Fractal Generator
3.2. 自回归模型作为分形生成器
In this section, we illustrate how to use auto regressive models as the fractal generator to build a fractal generative model. Our goal is to model the joint distribution of a large set of random variables $x_{1},\cdot\cdot\cdot,x_{N}$ , but directly modeling it with a single auto regressive model is computationally prohibitive. To address this, we adopt a divide-and-conquer strategy. The key modular iz ation is to abstract an auto regressive model as a modular unit that models a probability distribution $p(x|c)$ With this modular iz ation, we can construct a more powerful auto regressive model by building it on top of multiple next-level auto regressive models.
在本节中,我们展示了如何使用自回归模型作为分形生成器来构建分形生成模型。我们的目标是对大量随机变量 $x_{1},\cdot\cdot\cdot,x_{N}$ 的联合分布进行建模,但直接使用单一的自回归模型进行建模在计算上是不可行的。为了解决这个问题,我们采用了分而治之的策略。关键模块化是将自回归模型抽象为一个建模概率分布 $p(x|c)$ 的模块单元。通过这种模块化,我们可以在多个下一级自回归模型的基础上构建一个更强大的自回归模型。
Assume that the sequence length in each autoregressive model is a manageable constant $k$ , and let the total number of random variables $\begin{array}{r l r}{N}&{{}=}&{k^{n}}\end{array}$ , where $n~=~\log_{k}(N)$ represents the number of recursive levels in our fractal framework. The first auto regressive level of the fractal framework then partitions the joint distribution into $k$ subsets, each containing $k^{n-1}$ variables. Formally, we decompose $\begin{array}{r l}{p(x_{1},\cdot\cdot\cdot,,x_{k^{n}})}&{{}=}\end{array}$ $\begin{array}{r}{\prod_{i=1}^{k}p(x_{(i-1)\cdot k^{n-1}+1},\cdot\cdot\cdot\cdot,,x_{i\cdot k^{n-1}}|x_{1},\cdot\cdot\cdot,,x_{(i-1)\cdot k^{n-1}})}\end{array}$ . Each conditional distribution $p(\cdot\cdot\cdot|\cdot\cdot\cdot)$ with kn−1 variables is then modeled by the auto regressive model at the second recursive level, so on and so forth. By recursively calling such a divide-and-conquer process, our fractal framework can efficiently handle the joint distribution of $k^{n}$ variables using $n$ levels of auto regressive models, each operating on a manageable sequence length $k$ .
假设每个自回归模型中的序列长度是一个可管理的常数 $k$,且随机变量的总数为 $\begin{array}{r l r}{N}&{{}=}&{k^{n}}\end{array}$,其中 $n~=~\log_{k}(N)$ 表示分形框架中的递归层数。分形框架的第一层自回归将联合分布划分为 $k$ 个子集,每个子集包含 $k^{n-1}$ 个变量。形式上,我们分解 $\begin{array}{r l}{p(x_{1},\cdot\cdot\cdot,,x_{k^{n}})}&{{}=}\end{array}$ $\begin{array}{r}{\prod_{i=1}^{k}p(x_{(i-1)\cdot k^{n-1}+1},\cdot\cdot\cdot\cdot,,x_{i\cdot k^{n-1}}|x_{1},\cdot\cdot\cdot,,x_{(i-1)\cdot k^{n-1}})}\end{array}$。每个包含 $k^{n-1}$ 个变量的条件分布 $p(\cdot\cdot\cdot|\cdot\cdot\cdot)$ 随后由第二层递归的自回归模型进行建模,以此类推。通过递归调用这种分治过程,我们的分形框架可以高效地处理 $k^{n}$ 个变量的联合分布,使用 $n$ 层自回归模型,每层操作在可管理的序列长度 $k$ 上。
This recursive process represents a standard divide-andconquer strategy. By recursively decomposing the joint distribution, our fractal auto regressive architecture not only significantly reduces computational costs compared to a single large auto regressive model but also captures the intrinsic hierarchical structure within the data.
这种递归过程代表了一种标准的分而治之策略。通过递归分解联合分布,我们的分形自回归架构不仅显著降低了与单个大型自回归模型相比的计算成本,还捕捉了数据内的内在层次结构。
Conceptually, as long as the data exhibits a structure that can be organized in a divide-and-conquer manner, it can be naturally modeled within our fractal framework. To provide a more concrete example, in the next section, we implement this approach to tackle the challenging task of pixel-by-pixel image generation.
从概念上讲,只要数据呈现出的结构能够以分而治之的方式组织,就可以在我们的分形框架中自然建模。为了提供更具体的示例,在下一节中,我们通过实现这一方法来应对逐像素图像生成这一具有挑战性的任务。
4. An Image Generation Instantiation
- 图像生成实例
We now present one concrete implementation of fractal generative models with an instantiation on the challenging task of pixel-by-pixel image generation. Although we use image generation as a testbed in this paper, the same divideand-conquer architecture could be potentially adapted to other data domains. Next, we first discuss the challenges and importance of pixel-by-pixel image generation.
我们现在介绍一种分形生成模型的具体实现,并将其应用于逐像素图像生成这一具有挑战性的任务。尽管本文以图像生成为测试平台,但同样的分治架构也可能适用于其他数据领域。接下来,我们首先讨论逐像素图像生成的挑战和重要性。
4.1. Pixel-by-pixel Image Generation
4.1. 逐像素图像生成
Pixel-by-pixel image generation remains a significant challenge in generative modeling because of the high dimensionality and complexity of raw image data. This task requires models that can efficiently handle a large number of pixels while effectively learning the rich structural patterns and inter dependency between them. As a result, pixel-by-pixel image generation has become a challenging benchmark where most existing methods are still limited to likelihood estimation and fail to generate satisfactory images (Child et al., 2019; Hawthorne et al., 2022; Yu et al., 2023).
逐像素图像生成仍然是生成建模中的一个重大挑战,因为原始图像数据具有高维度和复杂性。该任务要求模型能够高效处理大量像素,同时有效学习它们之间的丰富结构模式和相互依赖关系。因此,逐像素图像生成已成为一个具有挑战性的基准,现有方法大多仍局限于似然估计,无法生成令人满意的图像 (Child et al., 2019; Hawthorne et al., 2022; Yu et al., 2023)。
Though challenging, pixel-by-pixel generation represents a broader class of important high-dimensional generative problems. These problems aim to generate data element-byelement but differ from long-sequence modeling in that they typically involve non-sequential data. For example, many structures—such as molecular configurations, proteins, and biological neural networks—do not exhibit a sequential architecture yet embody very high-dimensional and structural data distributions. By selecting pixel-by-pixel image generation as an instantiation for our fractal framework, we aim not only to tackle a pivotal challenge in computer vision but also to demonstrate the potential of our fractal approach in addressing the broader problem of modeling high-dimensional non-sequential data with intrinsic structures.
尽管具有挑战性,逐像素生成代表了一类更广泛的重要高维生成问题。这些问题旨在逐元素生成数据,但与长序列建模不同,它们通常涉及非顺序数据。例如,许多结构——如分子构型、蛋白质和生物神经网络——并不表现出顺序架构,却体现了非常高维和结构化的数据分布。通过选择逐像素图像生成作为我们分形框架的一个实例,我们不仅旨在解决计算机视觉中的一个关键挑战,还旨在展示我们的分形方法在解决具有内在结构的高维非顺序数据建模这一更广泛问题中的潜力。
4.2. Architecture
图 1: 架构
As shown in Figure 3, each auto regressive model takes the output from the previous-level generator as its input and produces multiple outputs for the next-level generator. It also takes an image (which can be a patch of the original image), splits it into patches, and embeds them to form the input sequence for a transformer model. These patches are also fed to the corresponding next-level generators. The transformer then takes the output of the previous generator as a separate token, placed before the image tokens. Based on this combined sequence, the transformer produces multiple outputs for the next-level generator.
如图 3 所示,每个自回归模型将上一级生成器的输出作为输入,并为下一级生成器生成多个输出。它还将图像(可以是原始图像的补丁)分割成补丁,并将它们嵌入以形成 Transformer 模型的输入序列。这些补丁也被馈送到相应的下一级生成器。然后,Transformer 将前一级生成器的输出作为一个单独的 Token,放置在图像 Token 之前。基于这个组合序列,Transformer 为下一级生成器生成多个输出。
Following common practices from vision transformers and image generative models (Do sov it ski y et al., 2020; Peebles & Xie, 2023), we set the sequence length of the first generator $g_{0}$ to 256, dividing the original images into $16\times16$ patches. The second-level generator then models each patch and further subdivides them into smaller patches, continuing this process recursively. To manage computational costs, we progressively reduce the width and the number of transformer blocks for smaller patches, as modeling smaller patches is generally easier than larger ones. At the final level, we use a very lightweight transformer to model the RGB channels of each pixel auto regressive ly, and apply a 256-way cross-entropy loss on the prediction. The exact configurations and computational costs for each transformer across different recursive levels and resolutions are detailed in Table 1. Notably, with our fractal design, the computational cost of modeling a $256!\times!256$ image is only twice that of modeling a $64!\times!64$ image.
遵循视觉 Transformer 和图像生成模型的常见做法 (Dosovitskiy et al., 2020; Peebles & Xie, 2023),我们将第一个生成器 $g_{0}$ 的序列长度设置为 256,将原始图像划分为 $16\times16$ 的补丁。第二级生成器随后对每个补丁进行建模,并将其进一步细分为更小的补丁,递归地继续这一过程。为了管理计算成本,我们逐步减少较小补丁的宽度和 Transformer 块的数量,因为对较小补丁的建模通常比对较大补丁更容易。在最后一级,我们使用一个非常轻量级的 Transformer 对每个像素的 RGB 通道进行自回归建模,并在预测上应用 256 路交叉熵损失。表 1 详细列出了不同递归级别和分辨率下每个 Transformer 的精确配置和计算成本。值得注意的是,通过我们的分形设计,建模 $256!\times!256$ 图像的计算成本仅为建模 $64!\times!64$ 图像的两倍。
Figure 3: Instantiation of our fractal method on pixel-bypixel image generation. In each fractal level, an autoregressive model receives the output from the previous generator, concatenates it with the corresponding image patches, and employs multiple transformer blocks to produce a set of outputs for the next generators.
图 3: 我们在逐像素图像生成上的分形方法实例化。在每个分形级别中,自回归模型接收来自前一个生成器的输出,将其与相应的图像块拼接,并使用多个 Transformer 块为下一个生成器生成一组输出。
Following (Li et al., 2024), our method supports different auto regressive designs. In this work, we mainly consider two variants: a raster-order, GPT-like causal transformer (AR) and a random-order, BERT-like bidirectional transformer (MAR) (Figure 6). Both designs follow the auto regressive principle of next-token prediction, each with its own advantages and disadvantages, which we discuss in detail in Appendix B. We name the fractal framework using the AR variant as FractalAR and the MAR variant as FractalMAR.
根据 (Li et al., 2024),我们的方法支持不同的自回归设计。在本文中,我们主要考虑两种变体:一种类似 GPT 的因果 Transformer (AR) 的光栅顺序设计,以及一种类似 BERT 的双向 Transformer (MAR) 的随机顺序设计 (图 6)。这两种设计都遵循下一 Token 预测的自回归原则,各有优缺点,我们将在附录 B 中详细讨论。我们将使用 AR 变体的分形框架命名为 FractalAR,将 MAR 变体命名为 FractalMAR。
4.3. Relation to Scale-space Auto regressive Models
4.3. 与尺度空间自回归模型的关系
Recently, several models have been introduced that perform next-scale prediction for auto regressive image generation
最近,已经引入了多个模型,用于自回归图像生成的下一尺度预测
Table 1: Model configurations, parameters, and computational costs at different auto regressive levels for our large-size model. The computational costs are measured by the GFLOPs per forward pass in training with batch size 1. Notably, thanks to the fractal design, the total computational cost of modeling a $256!\times!256$ image is only twice that of modeling a $64!\times!64$ image.
表 1: 我们大尺寸模型在不同自回归级别下的配置、参数和计算成本。计算成本通过训练中批量大小为 1 时的每次前向传递的 GFLOPs 来衡量。值得注意的是,得益于分形设计,建模 $256!\times!256$ 图像的总计算成本仅为建模 $64!\times!64$ 图像的两倍。
(Tian et al., 2024; Tang et al., 2024; Han et al., 2024). One major difference between these scale-space auto regressive models and our proposed method is that they use a single auto regressive model to predict tokens scale-by-scale. In contrast, our fractal framework adopts a divide-and-conquer strategy to recursively model raw pixels with generative submodules. Another key difference lies in the computational complexities: scale-space auto regressive models are required to perform full attention over the entire sequence of next-scale tokens when generating them, which results in substantially higher computational complexity.
(Tian et al., 2024; Tang et al., 2024; Han et al., 2024). 这些尺度空间自回归模型与我们提出的方法之间的一个主要区别在于,它们使用单一的自回归模型逐尺度预测token。相比之下,我们的分形框架采用分而治之的策略,通过生成子模块递归地建模原始像素。另一个关键区别在于计算复杂度:尺度空间自回归模型在生成下一尺度的token时需要对整个token序列进行完全注意力计算,这导致计算复杂度显著更高。
For example, when generating images at a resolution of $256!\times!256$ , in the last scale, the attention matrix in each attention block of a scale-space auto regressive model has a size of $(256\times256)^{2}=4{,}294{,}967{,}296.$ . In contrast, our method performs attention on very small patches when modeling the interdependence of pixels $(4!\times!4)$ , where each patch’s attention matrix is only $(4\times4)^{2}=256$ , resulting in a total attention matrix size of $(64\times64)\times(4\times4)^{2}=1{,}048{,}576$ operations. This reduction makes our approach $4000\times$ more computationally efficient at the finest resolution and thus for the first time enables modeling high-resolution images pixel-by-pixel.
例如,在生成分辨率为 $256!\times!256$ 的图像时,在最后一个尺度上,尺度空间自回归模型中每个注意力块的注意力矩阵大小为 $(256\times256)^{2}=4{,}294{,}967{,}296$。相比之下,我们的方法在对像素的相互依赖性进行建模时,在非常小的块上执行注意力 $(4!\times!4)$,其中每个块的注意力矩阵仅为 $(4\times4)^{2}=256$,导致总注意力矩阵大小为 $(64\times64)\times(4\times4)^{2}=1{,}048{,}576$ 次操作。这种减少使我们的方法在最精细分辨率下的计算效率提高了 $4000\times$,从而首次实现了逐像素建模高分辨率图像。
4.4. Relation to Long-Sequence Modeling
4.4. 与长序列建模的关系
Most previous work on pixel-by-pixel generation formulates the problem as long-sequence modeling and leverages methods from language modeling to address it (Child et al., 2019; Roy et al., 2021; Ren et al., 2021; Hawthorne et al., 2022; Yu et al., 2023). However, the intrinsic structures of many data types, including but not limited to images, are beyond one-dimensional sequences. Different from these methods, we treat such data as sets (instead of sequences) composed of multiple elements and employ a divide-andconquer strategy to recursively model smaller subsets with fewer elements. This approach is motivated by the observation that much of this data exhibits a near-fractal structure: images are composed of sub-images, molecules are composed of sub-molecules, and biological neural networks are composed of sub-networks. Accordingly, generative models designed to handle such data should be composed of submodules that are themselves generative models.
大多数先前关于逐像素生成的工作将问题表述为长序列建模,并利用语言建模的方法来解决它(Child 等,2019;Roy 等,2021;Ren 等,2021;Hawthorne 等,2022;Yu 等,2023)。然而,包括但不限于图像在内的许多数据类型的内在结构并非一维序列。与这些方法不同,我们将此类数据视为由多个元素组成的集合(而非序列),并采用分而治之的策略递归地对包含较少元素的较小子集进行建模。这种方法的动机是观察到这类数据大多呈现出近分形结构:图像由子图像组成,分子由子分子组成,生物神经网络由子网络组成。因此,设计用于处理此类数据的生成模型应由本身即为生成模型的子模块组成。
4.5. Implementation
4.5. 实现
We briefly describe the training and generation process of our fractal image generation framework. Further details and hyper-parameters can be found in Appendix A.
我们简要描述了分形图像生成框架的训练和生成过程。更多细节和超参数请参见附录 A。
Training. We train the fractal model end-to-end on raw image pixels by going through the fractal architecture in a breadth-first manner. During training, each auto regressive model takes in an input from the previous auto regressive model and produces a set of outputs for the next-level auto regressive model to take as inputs. This process continues down to the final level, where the image is represented as a sequence of pixels. A final auto regressive model uses the output on each pixel to predict the RGB channels in an auto regressive manner. We compute a cross-entropy loss on the predicted logits (treating the RGB values as discrete integers from 0 to 255) and back propagate this loss through all levels of auto regressive models, thus training the entire fractal framework end-to-end.
训练。我们通过广度优先的方式遍历分形架构,在原始图像像素上端到端地训练分形模型。在训练过程中,每个自回归模型接收来自前一个自回归模型的输入,并为下一级自回归模型生成一组输出作为输入。这个过程一直进行到最终级别,图像被表示为像素序列。最终的自回归模型使用每个像素的输出来以自回归的方式预测 RGB 通道。我们在预测的 logits 上计算交叉熵损失(将 RGB 值视为从 0 到 255 的离散整数),并将此损失反向传播到所有级别的自回归模型中,从而端到端地训练整个分形框架。
Generation. Our fractal model generates the image in a pixel-by-pixel manner, following a depth-first order to go through the fractal architecture, as illustrated in Figure 2. Here we use the random-order generation scheme from MAR (Li et al., 2024) as an example. The first-level auto regressive model captures the interdependence between $16!\times!16$ image patches, and at each step, it generates outputs for the next level based on known patches. The second-level model then leverages these outputs to model the interdependence between $4!\times!4$ patches within each $16!\times!16$ patch. Similarly, the third-level auto regressive model models the interdependence between individual pixels within each $4!\times!4$ patch. Finally, the last-level auto regressive model samples the actual RGB values from the auto regressive ly predicted RGB logits.
生成。我们的分形模型以逐像素的方式生成图像,按照深度优先的顺序遍历分形架构,如图 2 所示。这里我们以 MAR (Li et al., 2024) 中的随机顺序生成方案为例。第一层的自回归模型捕捉了 $16!\times!16$ 图像块之间的相互依赖关系,并在每一步中根据已知的块生成下一层的输出。第二层模型随后利用这些输出来建模每个 $16!\times!16$ 块内 $4!\times!4$ 块之间的相互依赖关系。同样,第三层自回归模型建模了每个 $4!\times!4$ 块内单个像素之间的相互依赖关系。最后,最后一层自回归模型从自回归预测的 RGB logits 中采样实际的 RGB 值。
Table 2: More fractal levels achieve better likelihood estimation performance with lower computational costs, measured on uncon