Mamba: Linear-Time Sequence Modeling with Selective State Spaces
Mamba: 线性时间序列建模与选择性状态空间 (Selective State Spaces)
Albert Gu* and Tri Dao*
Albert Gu* 和 Tri Dao*
Machine Learning Department, Carnegie Mellon University Department of Computer Science, Princeton University agu@cs.cmu.edu, tri@tridao.me
机器学习系,卡内基梅隆大学计算机科学系,普林斯顿大学
agu@cs.cmu.edu, tri@tridao.me
Abstract
摘要
Foundation models, now powering most of the exciting applications in deep learning, are almost universally based on the Transformer architecture and its core attention module. Many sub quadratic-time architectures such as linear attention, gated convolution and recurrent models, and structured state space models (SSMs) have been developed to address Transformers? computational inefficiency on long sequences, but they have not performed as well as attention on important modalities such as language. We identify that a key weakness of such models is their inability to perform content-based reasoning, and make several improvements. First, simply letting the SSM parameters be functions of the input addresses their weakness with discrete modalities, allowing the model to selectively propagate or forget information along the sequence length dimension depending on the current token. Second, even though this change prevents the use of efficient convolutions, we design a hardware-aware parallel algorithm in recurrent mode. We integrate these selective SSMs into a simplified end-to-end neural network architecture without attention or even MLP blocks (Mamba). Mamba enjoys fast inference $5\times$ higher throughput than Transformers) and linear scaling in sequence length, and its performance improves on real data up to million-length sequences. As a general sequence model backbone, Mamba achieves state-of-the-art performance across several modalities such as language, audio, and genomics. On language modeling, our Mamba-3B model outperforms Transformers of the same size and matches Transformers twice its size, both in pre training and downstream evaluation.
基础模型,现在支持深度学习中大多数令人兴奋的应用,几乎普遍基于 Transformer 架构及其核心注意力模块。许多次二次时间复杂度的架构,如线性注意力、门控卷积和递归模型以及结构化状态空间模型 (SSMs),已被开发出来以解决 Transformers 在长序列上的计算效率问题,但它们在语言等重要模态上的表现不如注意力机制。我们发现这类模型的一个关键弱点是它们无法进行基于内容的推理,并进行了几项改进。首先,简单地让 SSM 参数成为输入的函数解决了它们在离散模态上的弱点,使模型能够根据当前 Token 有选择地沿序列长度维度传播或遗忘信息。其次,尽管这一改变阻止了高效卷积的使用,我们设计了一种硬件感知的并行递归模式算法。我们将这些选择性 SSMs 集成到一个简化的端到端神经网络架构中,该架构不包含注意力模块甚至多层感知机块 (Mamba)。Mamba 的推理速度比 Transformers 快 5 倍,且在序列长度上呈线性扩展,在实际数据上其性能可扩展至百万长度的序列。作为通用序列模型骨干,Mamba 在语言、音频和基因组学等多个模态上实现了最先进的性能。在语言建模方面,我们的 Mamba-3B 模型超过了相同规模的 Transformers,并与两倍规模的 Transformers 相匹配,无论是在预训练还是下游评估中。
1 Introduction
1 引言
Foundation models (FMs), or large models pretrained on massive data then adapted for downstream tasks, have emerged as an effective paradigm in modern machine learning. The backbone of these FMs are often sequence models, operating on arbitrary sequences of inputs from a wide variety of domains such as language, images, speech, audio, time series, and genomics (Brown et al. 2020; Do sov it ski y et al. 2020; Ismail Fawaz et al. 2019; Oord et al. 2016; Poli et al. 2023; Sutskever, Vinyals, and Quoc V Le 2014). While this concept is agnostic to a particular choice of model architecture, modern FMs are predominantly based on a single type of sequence model: the Transformer (Vaswani et al. 2017) and its core attention layer (Bahdanau, Cho, and Bengio 2015) The efficacy of self-attention is attributed to its ability to route information densely within a context window, allowing it to model complex data. However, this property brings fundamental drawbacks: an inability to model anything outside of a finite window, and quadratic scaling with respect to the window length. An enormous body of research has appeared on more efficient variants of attention to overcome these drawbacks (Tay, Dehghani, Bahri, et al. 2022), but often at the expense of the very properties that makes it effective. As of yet, none of these variants have been shown to be empirically effective at scale across domains.
基础模型 (FMs),或在大规模数据上预训练然后针对下游任务进行调整的大模型,已成为现代机器学习中的一种有效范式。这些 FMs 的核心通常是序列模型,处理来自广泛领域的任意输入序列,如语言、图像、语音、音频、时间序列和基因组学 (Brown et al. 2020; Do sov it ski y et al. 2020; Ismail Fawaz et al. 2019; Oord et al. 2016; Poli et al. 2023; Sutskever, Vinyals, and Quoc V Le 2014)。虽然这一概念对特定的模型架构选择是无关的,但现代 FMs 主要基于一种类型的序列模型:Transformer (Vaswani et al. 2017) 及其核心注意力层 (Bahdanau, Cho, and Bengio 2015)。自注意力机制的有效性归因于它能够在上下文窗口内密集地路由信息,从而建模复杂数据。然而,这种特性也带来了根本性的缺点:无法建模有限窗口之外的任何内容,并且随着窗口长度呈二次增长。大量研究已经出现在更高效的注意力变体上,以克服这些缺点 (Tay, Dehghani, Bahri, et al. 2022),但通常是以牺牲使其有效的特性为代价的。迄今为止,还没有任何这些变体被证明在跨域的大规模应用中具有实证有效性。
Recently, structured state space sequence models (SSMs) (Gu, Goel, and Ré 2022; Gu, Johnson, Goel, et al. 2021) have emerged as a promising class of architectures for sequence modeling. These models can be interpreted as a combination of recurrent neural networks (RNNs) and convolutional neural networks (CNNs), with inspiration from classical state space models (Kalman 1960). This class of models can be computed very effciently as either a recurrence or convolution, with linear or near-linear scaling in sequence length. Additionally, they have principled mechanisms for modeling long-range dependencies (Gu, Dao, et al. 2020) in certain data modalities, and have dominated benchmarks such as the Long Range
最近,结构化的状态空间序列模型 (SSMs) (Gu, Goel, 和 Ré 2022; Gu, Johnson, Goel, 等 2021) 作为序列建模的一类架构崭露头角。这些模型可以被解释为递归神经网络 (RNNs) 和卷积神经网络 (CNNs) 的结合,并从经典的状态空间模型 (Kalman 1960) 中获得灵感。这类模型可以非常高效地计算,既可以作为递归也可以作为卷积,具有线性或接近线性的序列长度扩展性。此外,它们在某些数据模式中具有建模长距离依赖关系的原理机制 (Gu, Dao, 等 2020),并在诸如长距离依赖等基准测试中占据主导地位。
Arena (Tay, Dehghani, Abnar, et al. 2021). Many flavors of SSMs (Gu, Goel, and Ré 2022; Gu, Gupta, et al. 2022; Gupta, Gu, and Berant 2022; Y. Li et al. 2023; Ma et al. 2023; Orvieto et al. 2023; Smith, Warrington, and Linderman 2023) have been successful in domains involving continuous signal data such as audio and vision (Goel et al. 2022; Nguyen, Goel, et al 2022; Saon, Gupta, and Cui 2023). However, they have been less effective at modeling discrete and information-dense data such astext.
Arena (Tay, Dehghani, Abnar, et al. 2021)。许多版本的 SSMs (Gu, Goel, 和 Ré 2022; Gu, Gupta, 等 2022; Gupta, Gu, 和 Berant 2022; Y. Li 等 2023; Ma 等 2023; Orvieto 等 2023; Smith, Warrington, 和 Linderman 2023) 在涉及连续信号数据(如音频和视觉)的领域中取得了成功 (Goel 等 2022; Nguyen, Goel, 等 2022; Saon, Gupta, 和 Cui 2023)。然而,它们在建模离散和信息密集型数据(如文本)方面效果较差。
We propose a new class of selective state space models, that improves on prior work on several axes to achieve the modeling power of Transformers while scaling linearly in sequence length.
我们提出了一类新的选择性状态空间模型,该模型在多个方面改进了先前的工作,以实现 Transformer 的建模能力,同时在序列长度上呈线性扩展。
Selection Mechanism. First, we identify a key limitation of prior models: the ability to efficiently select data in an input-dependent manner (i.e. focus on or ignore particular inputs). Building on intuition based on important synthetic tasks such as selective copy and induction heads, we design a simple selection mechanism by parameter i zing the SSM parameters based on the input. This allows the model to filter out irrelevant information and remember relevant information indefinitely.
选择机制。首先,我们识别出先前模型的一个关键限制:以输入依赖的方式高效选择数据的能力(即专注于或忽略特定输入)。基于诸如选择性复制和归纳头等重要合成任务的直觉,我们通过根据输入参数化 SSM 参数设计了一个简单的选择机制。这使得模型能够过滤掉无关信息并无限期地记住相关信息。
Hardware-aware Algorithm. This simple change poses a technical challenge for the computation of the model; in fact, all prior SMs models must be time- and input-invariant in order to be computationally efficient. We overcome this with a hardware-aware algorithm that computes the model re currently with a scan instead of convolution, but does not materialize the expanded state in order to avoid IO access between different levels of the GPU memory hierarchy. The resulting implementation is faster than previous methods both in theory (scaling linearly in sequence length, compared to pseudo-linear for all convolution-based SSMs) and on modern hardware (up to $3\times$ faster on A100GPUs).
硬件感知算法。这一简单更改给模型的计算带来了技术挑战;实际上,所有先前的 SM 模型必须是时间和输入不变的,以确保计算效率。我们通过一种硬件感知算法克服了这一问题,该算法使用扫描而不是卷积来计算模型,但不实现扩展状态以避免 GPU 内存层次结构中不同级别的 IO 访问。理论和现代硬件上的结果实现都比之前的方法更快(在序列长度上呈线性扩展,而所有基于卷积的 SSM 都是伪线性),在 A100 GPUs 上最多快 3 倍。
Architecture. We simplify prior deep sequence model architectures by combining the design of prior SSM architectures (Dao, Fu, Saab, et al. 2023) with the MLP block of Transformers into a single block, leading to a simple and homogenous architecture design (Mamba) incorporating selective state spaces.
架构。我们通过结合先前的 SSM 架构设计 (Dao, Fu, Saab, et al. 2023) 与 Transformer 的 MLP 块,简化了先前的深度序列模型架构,形成一个单一的块,从而实现了一个简单且同质化的架构设计 (Mamba),该设计融合了选择性的状态空间。
Selective SSMs, and by extension the Mamba architecture, are fully recurrent models with key properties that make them suitable as the backbone of general foundation models operating on sequences. (i) High quality: selectivity brings strong performance on dense modalities such as language and genomics. (i) Fast training and inference: computation and memory scales linearly in sequence length during training, and unrolling the model auto regressive ly during inference requires only constant time per step since it does not require a cache of previous elements. (i) Long context: the quality and efficiency together yield performance improvements on real data up to sequence length 1M.
选择性 SSM 模型,以及由此扩展的 Mamba 架构,是完全循环的模型,具有使其适合作为处理序列的一般基础模型骨干的关键特性。 (i) 高质量:选择性在语言和基因组学等密集模态上带来了强大的性能。 (ii) 快速训练和推理:训练期间计算和内存与序列长度呈线性扩展,并且在推理期间自回归展开模型每步只需要常数时间,因为它不需要缓存之前的元素。 (iii) 长上下文:高质量和高效率共同作用,在实际数据上长达 1M 的序列长度上实现了性能提升。
We empirically validate Mamba's potential as a general sequence FM backbone, in both pre training quality and domainspecific task performance, on several types of modalities and settings:
我们通过实证验证了 Mamba 作为通用序列 FM 主干的潜力,在预训练质量和领域特定任务性能方面,在几种模态和设置上:
· Synthetics. On important synthetic tasks such as copying and induction heads that have been proposed as being key to large language models, Mamba not only solves them easily but can extrapolate solutions indefinitely long ( $>!1M$ tokens).
合成任务。在重要的合成任务中,例如已提出作为大语言模型关键的复制和归纳头任务,Mamba不仅能够轻松解决这些问题,还可以无限长地外推解决方案($>!1M$ tokens)。
· Audio and Genomics. Mamba out-performs prior state-of-the-art models such as SaShiMi, Hyena, and Transformers on modeling audio waveforms and DNA sequences, both in pre training quality and downstream metrics (e.g. reducing FID on a challenging speech generation dataset by more than half). In both settings, its performance improves with longer context up to million-length sequences.
音频和基因组学。Mamba 在建模音频波形和 DNA 序列方面优于之前的最先进模型,例如 SaShiMi、Hyena 和 Transformer,在预训练质量和下游指标方面均有提升(例如,在具有挑战性的语音生成数据集上将 FID 降低了一半以上)。在这两种情况下,其性能随着上下文长度的增加而提高,最长可达百万级序列。
· Language Modeling. Mamba is the first linear-time sequence model that truly achieves Transformer-quality performance, both in pre training perplexity and downstream evaluations. With scaling laws up to 1B parameters, we show that Mamba exceeds the performance of a large range of baselines, including very strong modern Transformer training recipes based on LLaMa (Touvron et al. 2023). Our Mamba language model has $5\times$ generation throughput compared to Transformers of similar size, and Mamba-3B's quality matches that of Transformers twice its size (e.g. 4 points higher avg. on common sense reasoning compared to Pythia-3B and even exceeding Pythia-7B).
语言建模。Mamba 是第一个真正实现线性时间序列模型的模型,其性能达到了 Transformer 的水平,在预训练困惑度和下游评估中均表现出色。通过扩展到最多 1B 参数,我们展示了 Mamba 超越了多个基线模型的性能,包括基于 LLaMa (Touvron et al. 2023) 的非常强大的现代 Transformer 训练方案。我们的 Mamba 语言模型相比类似规模的 Transformer 模型,生成吞吐量提高了 5 倍,且 Mamba-3B 的质量与两倍规模的 Transformer 相当(例如,在常识推理方面比 Pythia-3B 高出 4 分,甚至超过了 Pythia-7B)。
Model code and pre-trained checkpoints are open-sourced at https: //github. com/state-spaces/mamba.
模型代码和预训练检查点已开源在 https://github.com/state-spaces/mamba。
2 State Space Models
2 状态空间模型
Structured state space sequence models (S4) are a recent class of sequence models for deep learning that are broadly related to RNNs, and CNNs, and classical state space models. They are inspired by a particular continuous system (1) that maps a
结构化状态空间序列模型 (S4) 是一类最近提出的深度学习序列模型,与循环神经网络 (RNN)、卷积神经网络 (CNN) 和经典状态空间模型有广泛关联。它们受到一个特定的连续系统 (1) 的启发,该系统映射一个
Selective State Space Model with Hardware-aware State Expansion
具有硬件感知状态扩展的选择性状态空间模型

Figure 1: (Overview.) Structured SSMs independently map each channel (e.g. $D,=,5_{\cdot}$ ) of an input $x$ to output $y$ through a higher dimensional latent state $h$ (e.g. $N=4$ 0. Prior SSMs avoid materializing this large effective state $\mathbf{\nabla}D N$ times batch size $B$ and sequence length $L$ ) through clever alternate computation paths requiring time-invariance: the $(\Delta,A,B,C)$ parameters are constant across time. Our selection mechanism adds back input-dependent dynamics, which also requires a careful hardware-aware algorithm to only materialize the expanded states in more efficient levels of the GPU memory hierarchy.
图 1: (概述。) 结构化 SSMs 独立地将输入 $x$ 的每个通道(例如 $D,=,5_{\cdot}$)映射到输出 $y$,通过更高维度的隐含状态 $h$(例如 $N=4$)。先前的 SSMs 通过巧妙的替代计算路径避免实现这个大的有效状态 $\mathbf{\nabla}D N$ 乘以批次大小 $B$ 和序列长度 $L$,这些路径需要时间不变性:$(\Delta,A,B,C)$ 参数在时间上是常数。我们的选择机制重新引入了依赖于输入的动力学,这也需要一个硬件感知算法来仅在 GPU 内存层次结构的更高效级别中实现扩展状态。
1-dimensional function or sequence $x(t)\in\mathbb{R}\mapsto y(t)\in\mathbb{R}$ through an implicit latent state $h(t)\in\mathbb{R}^{N}$
一维函数或序列 $x(t)\in\mathbb{R}\mapsto y(t)\in\mathbb{R}$ 通过隐式潜在状态 $h(t)\in\mathbb{R}^{N}$
Concretely, S4 models are defined with four parameters $(\Delta,A,B,C)$ , which define a sequence-to-sequence transformation in two stages.
具体来说,S4 模型由四个参数 (Δ, A, B, C) 定义,这些参数在两个阶段中定义了一个序列到序列的转换。
$$
\begin{array}{r l r l r l r l r l r}&{t)=A h(t)+B x(t)}&&{\mathrm{(1a)}}&&{\quad\quad h_{t}=\overline{{A}}h_{t-1}+\overline{{B}}x_{t}}&&{\mathrm{(2a)}}&&{\quad\overline{{K}}=(C\overline{{B}},C\overline{{A B}},\ldots,C\overline{{A}}^{k}\overline{{B}},\ldots,C\overline{{A}}^{k}\overline{{B}})\mathrm{.}}\ &{t)=C h(t)}&&{\mathrm{(1b)}}&&{\quad\quad y_{t}=C h_{t}}&&{\mathrm{(2b)}}&&{\quad\quad y=x*\overline{{K}}}&\end{array}
$$
$$
\begin{array}{r l r l r l r l r l r}
& {t) = A h(t) + B x(t)} && {(1a)} && {\quad\quad h_{t} = \overline{{A}} h_{t-1} + \overline{{B}} x_{t}} && {(2a)} && {\quad\overline{{K}} = (C \overline{{B}}, C \overline{{A B}}, \ldots, C \overline{{A}}^{k} \overline{{B}}, \ldots, C \overline{{A}}^{k} \overline{{B}}).} \
& {t) = C h(t)} && {(1b)} && {\quad\quad y_{t} = C h_{t}} && {(2b)} && {\quad\quad y = x * \overline{{K}}}
\end{array}
$$
Disc ret iz ation._ The first stage transforms the “continuous parameters" $(\Delta,A,B)$ to “discrete parameters" $(\overline{{A}},\overline{{B}})$ through fixed formulas $\overline{{A}}=f_{A}(\Delta,A)$ and $\overline{{B}}=f_{B}(\Delta,A,B)$ , where the pair $(f_{A},f_{B})$ is called a disc ret iz ation rule. Various rules can be used such as the zero-order hold (ZOH) defined in equation (4).
离散化._ 第一阶段通过固定公式 $\overline{{A}}=f_{A}(\Delta,A)$ 和 $\overline{{B}}=f_{B}(\Delta,A,B)$ 将“连续参数” $(\Delta,A,B)$ 转换为“离散参数” $(\overline{{A}},\overline{{B}})$ ,其中的对 $(f_{A},f_{B})$ 称为离散化规则。可以使用各种规则,例如在公式 (4) 中定义的零阶保持 (ZOH) 。
$$
\begin{array}{r}{\overline{{A}}=\exp(\Delta A)\qquad\overline{{B}}=(\Delta A)^{-1}(\exp(\Delta A)-I)\cdot\Delta B}\end{array}
$$
$$
\begin{array}{r}{\overline{{A}}=\exp(\Delta A)\qquad\overline{{B}}=(\Delta A)^{-1}(\exp(\Delta A)-I)\cdot\Delta B}\end{array}
$$
Disc ret iz ation has deep connections to continuous-time systems which can endow them with additional properties such as resolution invariance (Nguyen, Goel, et al. 2022) and automatically ensuring that the model is properly normalized (Gu, Johnson, Timalsina, et al. 2023; Orvieto et al. 2023). It also has connections to gating mechanisms of RNNs (Gu, Gulcehre, et al. 2020; Tallec and Ollivier 2018) which we will revisit in Section 3.5. However, from a mechanical point of view disc ret iz ation can simply be viewed as the first step of the computation graph in the forward pass of an SSM. Alternate flavors of SSMs can bypass the disc ret iz ation step and parameter ize $(\overline{{A}},\breve{\overline{{B}}})$ directly instead (Zhang et al. 2023), which may be easier to reason about.
离散化与连续时间系统有着深刻的联系,这可以赋予它们额外的属性,例如分辨率不变性 (Nguyen, Goel, et al. 2022) 和自动确保模型正确归一化 (Gu, Johnson, Timalsina, et al. 2023; Orvieto et al. 2023)。它还与RNN的门控机制有关 (Gu, Gulcehre, et al. 2020; Tallec and Ollivier 2018),我们将在第3.5节中重新讨论这一点。然而,从机械角度来看,离散化可以简单地被视为SSM前向传递计算图的第一步。SSM的其他变体可以绕过离散化步骤,直接参数化 $(\overline{{A}},\breve{\overline{{B}}})$ (Zhang et al. 2023),这可能更容易理解。
Computation. After the parameters have been transformed from $(\Delta,A,B,C)\mapsto({\overline{{A}}},{\overline{{B}}},C)$ , the model can be computed in two ways, either as a linear recurrence (2) or a global convolution (3).
计算。在参数从 $(\Delta,A,B,C)\mapsto({\overline{{A}}},{\overline{{B}}},C)$ 转换后,模型可以通过两种方式进行计算,一种是线性递归 (2) ,另一种是全局卷积 (3) 。
Commonly, the model uses the convolutional mode (3) for efficient parallel iz able training (where the whole input sequence is seen ahead of time), and switched into recurrent mode (2) for effcient auto regressive inference (where the inputs are seen one timestep at a time).
通常,模型使用卷积模式 (3) 进行高效的并行训练(其中整个输入序列可以提前获取),并在进行高效的自回归推理时切换到递归模式 (2)(其中输入是逐个时间步获取的)。
Linear Time Invariance (LTI). An important property of equations (1) to (3) is that the model's dynamics are constant through time. In other words $(\Delta,A,B,C)$ and consequently $({\dot{\overline{{A}}}},{\overline{{B}}})$ as well, are fixed for alltime-steps. This property is called linear time invariance $(L T I)$ which is deeply connected to recurrence and convolutions. Informally, we think of LTI SSMs as being equivalent to any linear recurrence (2a) or convolution (3b), and use LTI as an umbrella term for these classes of models.
线性时不变性 (LTI)。方程 (1) 到 (3) 的一个重要特性是模型的动力学在时间上是恒定的。换句话说,$(\Delta,A,B,C)$ 以及因此 $({\dot{\overline{{A}}}},{\overline{{B}}})$,在所有时间步长上都是固定的。这种特性称为线性时不变性 (LTI),它与递归和卷积密切相关。非正式地,我们可以认为 LTI 状态空间模型 (SSM) 等同于任何线性递归 (2a) 或卷积 (3b),并使用 LTI 作为这些模型类别的总称。
Thus far, all structured SSMs have been LTI (e.g. computed as convolutions) because of fundamental effciency constraints, discussed in Section 3.3. However, a core insight of this work is that LTI models have fundamental limitations in modeling certain types of data, and our technical contributions involve removing the LTI constraint while overcoming the efficiency bottlenecks.
迄今为止,所有结构化的 SSM 都是 LTI (例如通过卷积计算) 的,这是由于基本的效率限制,详见第 3.3 节。然而,本文的一个核心见解是,LTI 模型在建模某些类型的数据时存在根本性局限,我们的技术贡献在于去除 LTI 约束的同时克服效率瓶颈。
Structure and Dimensions. Finally, we note that structured SSMs are so named because computing them efficiently also requires imposing structure on the $\pmb{A}$ matrix. The most popular form of structure is diagonal (Gu, Gupta, et al. 2022; Gupta, Gu, and Berant 2022; Smith, Warrington, and Linderman 2023), which we also use.
结构和维度。最后,我们注意到结构化的 SSM 被如此命名是因为高效计算它们还需要对 $\pmb{A}$ 矩阵施加结构。最常见的结构形式是对角结构 (Gu, Gupta, et al. 2022; Gupta, Gu, and Berant 2022; Smith, Warrington, and Linderman 2023),这也是我们所使用的。
In this case, the $A\in\mathbb{R}^{N\times N},B\in\mathbb{R}^{N\times1},C\in\mathbb{R}^{1\times N}$ matrices can all be represented by $N$ numbers. To operate over an input sequence $x$ of batch size $B$ and length $L$ with $D$ channels, the SSM is applied independently to each channel. Note that in this case, the total hidden state has dimension $D N$ per input, and computing it over the sequence length requires $O(B L D N)$ time and memory; this is the root of the fundamental efficiency bottleneck addressed in Section 3.3.
在这种情况下,矩阵 $A\in\mathbb{R}^{N\times N}, B\in\mathbb{R}^{N\times1}, C\in\mathbb{R}^{1\times N}$ 都可以用 $N$ 个数字表示。为了对批量大小为 $B$、长度为 $L$、具有 $D$ 个通道的输入序列 $x$ 进行操作,状态空间模型 (SSM) 独立应用于每个通道。注意,在这种情况下,每个输入的总隐藏状态维度为 $D N$,并且在序列长度上计算它需要 $O(B L D N)$ 的时间和内存;这是第 3.3 节中解决的基本效率瓶颈的根本原因。
General State Space Models. We note that the term state space model has a very broad meaning which simply represents the notion of any recurrent process with a latent state. It has been used to refer to many disparate concepts in different disciplines, including Markov decision processes (MDP) (reinforcement learning (Hafner et al. 2020)), dynamic causal modeling (DCM) (computational neuroscience (Friston, Harrison, and Penny 2003), Kalman filters (controls (Kalman 1960), hidden Markov models (HMM) and linear dynamical systems (LDS) (machine learning), and recurrent (and sometimes convolutional) models at large (deep learning).
一般状态空间模型。我们注意到,状态空间模型 (state space model) 这个术语具有非常广泛的含义,它简单地表示任何具有潜在状态的递归过程的概念。它已被用于指代不同学科中的许多不同的概念,包括马尔可夫决策过程 (MDP)(强化学习 (Hafner et al. 2020)),动态因果建模 (DCM)(计算神经科学 (Friston, Harrison, and Penny 2003),卡尔曼滤波器 (Kalman filters)(控制理论 (Kalman 1960),隐马尔可夫模型 (HMM) 和线性动力系统 (LDS)(机器学习),以及广泛的递归(有时是卷积)模型(深度学习)。
Throughout this entire paper we use the term “SSM" to refer exclusively to the class of structured SSMs or S4 models (Gu, Goel, and Ré 2022; Gu, Gupta, et al. 2022; Gupta, Gu, and Berant 2022; Hasani et al. 2023; Ma et al. 2023; Smith, Warrington, and Linderman 2023) and use these terms interchangeably. For convenience we may also include derivatives of such models, such as those focusing on either the linear-recurrence or global-convolution viewpoints (Y. Li et al. 2023; Orvieto et al. 2023; Poli et al. 2023), and clarify nuances when necessary.
在整个论文中,我们使用的术语“SSM”专门指代结构化的 SSM 模型或 S4 模型 (Gu, Goel, 和 Ré 2022; Gu, Gupta, 等 2022; Gupta, Gu, 和 Berant 2022; Hasani 等 2023; Ma 等 2023; Smith, Warrington, 和 Linderman 2023),并且这些术语可以互换使用。为了方便,我们还可能包括这些模型的衍生模型,例如专注于线性递归或全局卷积视角的模型 (Y. Li 等 2023; Orvieto 等 2023; Poli 等 2023),并在必要时澄清细微差别。
SSM Architectures. SSMs are standalone sequence transformations that can be incorporated into end-to-end neural network architectures. (We also sometimes call SSM architectures SSNNs, which are to SSM layers as CNNs are to linear convolution layers.) We discuss some of the most well-known SSM architectures, many of which will also serve as our primary baselines.
SSM 架构。SSM 是独立的序列变换,可以集成到端到端神经网络架构中。(我们有时也将 SSM 架构称为 SSNN,其与 SSM 层的关系类似于 CNN 与线性卷积层的关系。)我们讨论一些最著名的 SSM 架构,其中许多也将作为我们的主要基准。
Other closely related SSMs and architectures are discussed further in an extended related work (Appendix B). We highlight in particular S5 (Smith, Warrington, and Linderman 2023), QRNN (Bradbury et al. 2016), and SRU (Lei et al. 2017), which we view as the most closely related methods to our core selective SSM.
其他密切相关的 SSM 和架构在扩展的相关工作 (Appendix B) 中进一步讨论。我们特别强调 S5 (Smith, Warrington, and Linderman 2023),QRNN (Bradbury et al. 2016),和 SRU (Lei et al. 2017),我们认为这些方法与我们的核心选择性 SSM 最为密切相关。
3 Selective State Space Models
3 选择性状态空间模型
We motivate our selection mechanism using intuition from synthetic tasks (Section 3.1), then explain how to incorporate this mechanism into state space models (Section 3.2). The resulting time-varying SSMs cannot use convolutions, presenting a technical challenge of how to compute them efficiently. We overcome this with a hardware-aware algorithm that exploits the memory hierarchy on modern hardware (Section 3.3). We then describe a simple SSM architecture without attention or even MLP blocks (Section 3.4). Finally, we discuss some additional properties of selection mechanisms (Section 3.5).
我们使用合成任务中的直觉来解释我们的选择机制 (Section 3.1),然后说明如何将这种机制纳入状态空间模型 (Section 3.2)。所得到的时间变化 SSMs 无法使用卷积,这提出了一个技术挑战,即如何高效地计算它们。我们通过一种利用现代硬件内存层次结构的硬件感知算法克服了这一挑战 (Section 3.3)。接着,我们描述了一种既没有注意力机制也没有 MLP 模块的简单 SSM 架构 (Section 3.4)。最后,我们讨论了选择机制的一些额外属性 (Section 3.5)。
3.1 Motivation: Selection as a Means of Compression
3.1 动机:选择作为一种压缩手段
We argue that a fundamental problem of sequence modeling is compressing context into a smaller state. In fact, we can view the tradeoffs of popular sequence models from this point of view. For example, attention is both effective and inefficient because it explicitly does not compress context at all. This can be seen from the fact that auto regressive inference requires explicitly storing the entire context (i.e. the KV cache), which directly causes the slow linear-time inference and quadratic-time training of Transformers. On the other hand, recurrent models are efficient because they have a finite state, implying constant-time inference and linear-time training. However, their effectiveness is limited by how well this state has compressed the context.
我们认为序列建模的一个基本问题是将上下文压缩到更小的状态。事实上,我们可以从这个角度来理解流行序列模型的权衡。例如,注意力机制既有效又低效,因为它完全不压缩上下文。这可以从自回归推理需要显式存储整个上下文(即 KV 缓存)看出,这直接导致了 Transformer 的线性时间推理和二次时间训练缓慢。另一方面,递归模型是高效的,因为它们具有有限状态,这意味着常数时间推理和线性时间训练。然而,其有效性受限于该状态对上下文的压缩程度。
To understand this principle, we focus on two running examples of synthetic tasks (Figure 2).
为了理解这一原则,我们关注两个合成任务的示例 (图 2)。
These tasks reveal the failure mode of LTI models. From the recurrent view, their constant dynamics (e.g. the $(\overline{{A}},\overline{{B}})$ transitions in (2) cannot let them select the correct information from their context, or affect the hidden state passed along the sequence in an input-dependent way. From the convolutional view, it is known that global convolutions can solve the vanilla Copying task (Romero et al. 2021) because it only requires time-awareness, but that they have difficulty with the Selective Copying task because of lack of content-awareness (Figure 2). More concretely, the spacing between inputs-to-outputs is varying and cannot be modeled by static convolution kernels.
这些任务揭示了 LTI 模型的失效模式。从循环的角度来看,它们的恒定动态(例如 (2) 中的 $(\overline{{A}},\overline{{B}})$ 转换)无法使它们从上下文中选择正确的信息,或以输入依赖的方式影响沿序列传递的隐藏状态。从卷积的角度来看,已知全局卷积可以解决普通的 Copying 任务 (Romero et al. 2021),因为它只需要时间感知,但它们在 Selective Copying 任务上遇到困难,因为缺乏内容感知(图 2)。更具体地说,输入到输出之间的间距是变化的,无法用静态卷积核建模。
In summary, the efficiency vs. effectiveness tradeoff of sequence models is characterized by how well they compress their state: efficient models must have a small state, while effective models must have a state that contains all necessary information from the context. In turn, we propose that a fundamental principle for building sequence models is selectivity: or the context-aware ability to focus on or filter out inputs into a sequential state. In particular, a selection mechanism controls how information propagates or interacts along the sequence dimension (see Section 3.5 for more discussion).
总结来说,序列模型的效率与效果之间的权衡体现在它们压缩状态的能力上:高效的模型必须具有较小的状态,而有效的模型则必须包含来自上下文的所有必要信息。因此,我们提出构建序列模型的一个基本原则是选择性:即根据上下文有选择地关注或过滤输入到序列状态中的能力。具体而言,选择机制控制信息在序列维度上的传播或交互方式(详见第 3.5 节)。
3.2 Improving SSMs with Selection
3.2 通过选择改进 SSMs
One method of incorporating a selection mechanism into models is by letting their parameters that affect interactions along the sequence (e.g. the recurrent dynamics of an RNN or the convolution kernel of a CNN) be input-dependent.
将选择机制引入模型的一种方法是让影响序列交互的参数(例如 RNN 的循环动态或 CNN 的卷积核)依赖于输入。
Algorithms 1 and 2 illustrates the main selection mechanism that we use. The main difference is simply making several parameters $\Delta,B,C$ functions of the input, along with the associated changes to tensor shapes throughout. In particular, we highlight that these parameters now have a length dimension $L$ meaning that the model has changed from time-invariant to time-varying. (Note that shape annotations were described in Section 2.) This loses the equivalence to convolutions (3) with implications for its efficiency, discussed next.
算法 1 和 2 描述了我们使用的主要选择机制。主要区别在于将几个参数 $\Delta,B,C$ 设为输入的函数,并对整个过程中的张量形状进行相应更改。特别是,我们强调这些参数现在具有长度维度 $L$ ,这意味着模型从时不变变为时变。(注意,形状注释在第 2 节中描述。)这使得它失去了与卷积 (3) 的等价性,从而影响其效率,接下来将对此进行讨论。
We specifically choose $s_{B}(x),=,\mathsf{L i n e a r}{N}(x)$ $s{C}(x),=,\mathsf{L i n e a r}_{N}(x)$ $s_{\Delta}(x),=,{\tt B r o a d c a s t}_{D}({\tt L i n e a r}_{1}(x))$ , and $\tau_{\Delta}=$ softplus, where Linear $\cdot_{d}$ is a parameterized projection to dimension $d$ . The choice of $s_{\Delta}$ and $\tau_{\Delta}$ is due to a connection to RNN gating mechanisms explained in Section 3.5.
我们特别选择 $s_{B}(x),=,\mathsf{L i n e a r}{N}(x)$ $s{C}(x),=,\mathsf{L i n e a r}_{N}(x)$ $s_{\Delta}(x),=,{\tt B r o a d c a s t}_{D}({\tt L i n e a r}_{1}(x))$ ,和 $\tau_{\Delta}=$ softplus,其中 Linear $\cdot_{d}$ 是参数化的投影到维度 $d$ 。选择 $s_{\Delta}$ 和 $\tau_{\Delta}$ 是由于与第 3.5 节解释的 RNN 门机制的联系。

Figure 2: (Left) The standard version of the Copying task involves constant spacing between input and output elements and is easily solved by time-invariant models such as linear recurrences and global convolutions. (Right Top) The Selective Copying task has random spacing in between inputs and requires time-varying models that can selectively remember or ignore inputs depending on their content. (Right Bottom) The Induction Heads task is an example of associative recall that requires retrieving an answer based on context, a key ability forL LMs.
图 2: (左) Copying 任务的标准版本涉及输入和输出元素之间的固定间距,可以被时间不变模型(如线性递归和全局卷积)轻松解决。 (右上) Selective Copying 任务在输入之间具有随机间距,需要能够根据内容有选择地记住或忽略输入的时间变化模型。 (右下) Induction Heads 任务是关联回忆的一个例子,要求根据上下文检索答案,这是大语言模型的关键能力。
3.3 Efficient Implementation of Selective SSMs
3.3 Selective SSMs 的高效实现
规则:
- 输出中文翻译部分的时候,只保留翻译的标题,不要有任何其他的多余内容,不要重复,不要解释。
- 不要输出与英文内容无关的内容。
- 翻译时要保留原始段落格式,以及保留术语,例如 FLAC,JPEG 等。保留公司缩写,例如 Microsoft, Amazon, OpenAI 等。
- 人名不翻译
- 同时要保留引用的论文,例如 [20] 这样的引用。
- 对于 Figure 和 Table,翻译的同时保留原有格式,例如:“Figure 1: ”翻译为“图 1: ”,“Table 1: ”翻译为:“表 1: ”。
- 全角括号换成半角括号,并在左括号前面加半角空格,右括号后面加半角空格。
- 在翻译专业术语时,第一次出现时要在括号里面写上英文原文,例如:“生成式 AI (Generative AI)”,之后就可以只写中文了。
- 以下是常见的 AI 相关术语词汇对应表(English -> 中文):
* Transformer -> Transformer
* Token -> Token
* LLM/Large Language Model -> 大语言模型
* Zero-shot -> 零样本
* Few-shot -> 少样本
* AI Agent -> AI智能体
* AGI -> 通用人工智能
* Python -> Python语言
策略:
分三步进行翻译工作:
1. 不翻译无法识别的特殊字符和公式,原样返回
2. 将HTML表格格式转换成Markdown表格格式
3. 根据英文内容翻译成符合中文表达习惯的内容,不要遗漏任何信息
最终只返回Markdown格式的翻译结果,不要回复无关内容。
Hardware-friendly primitives such as convolutions (Krizhevsky, Sutskever, and Hinton 2012) and attention (Bahdanau, Cho, and Bengio 2015; Vaswani et al. 2017) enjoy widespread application. Here we aim to make selective SSMs efficient on modern hardware (GPUs) as well. The selection mechanism is quite natural, and earlier works attempted to incorporate special cases of selection, such as letting $\Delta$ vary over time in recurrent SSMs (Gu, Dao, et al. 2020). However, as previously mentioned a core limitation in the usage of SSMs is their computational efficiency, which was why S4 and all derivatives used LTI (non-selective) models, most commonly in the form of global convolutions.
硬件友好的原语,例如卷积 (Krizhevsky, Sutskever, 和 Hinton 2012) 和注意力机制 (Bahdanau, Cho, 和 Bengio 2015; Vaswani 等 2017) 享有广泛的应用。在这里,我们的目标是使选择性的状态空间模型 (SSM) 在现代硬件(GPU)上也高效运行。选择机制非常自然,早期的工作尝试将选择的特殊情况纳入其中,例如让 $\Delta$ 在循环 SSM 中随时间变化 (Gu, Dao, 等 2020)。然而,正如前面提到的,SSM 使用的核心限制在于其计算效率,这就是为什么 S4 及其所有衍生模型使用了线性时不变 (LTI,非选择性) 模型,最常见的是以全局卷积的形式。
3.3.1 Motivation of Prior Models
3.3.1 先前模型的动机 (Motivation of Prior Models)
We first revisit this motivation and overview our approach to overcome limitations of prior methods.
我们首先回顾这一动机并概述我们的方法以克服先前方法的局限性。
· At a high level, recurrent models such as SSMs always balance a tradeoff between expressivity and speed: as discussed in Section 3.1, models with larger hidden state dimension should be more effective but slower. Thus we want to maximize hidden state dimension without paying speed and memory costs.
在高层次上,循环模型(如 SSMs)总是要在表达能力和速度之间进行权衡:正如第 3.1 节所讨论的,具有更大隐藏状态维度的模型应该更有效但更慢。因此,我们希望在不付出速度和内存成本的情况下最大化隐藏状态维度。
· Note that the recurrent mode is more flexible than the convolution mode, since the latter (3) is derived from expanding the former (2) (Gu, Goel, and Ré 2022; Gu, Johnson, Goel, et al. 2021). However, this would require computing and materializing the latent state $h$ with shape $(\mathsf{B},\mathsf{L},\mathsf{D},\mathsf{N})$ , which is much larger (by a factor of $N$ , the SSM state dimension) than the input $x$ and output $y$ of shape (B, L, D). Thus the more efficient convolution mode was introduced which could bypass the state computation and materializes a convolution kernel (3a) of size only (B, L, D).
需要注意的是,循环模式比卷积模式更灵活,因为后者 (3) 是从前者 (2) 展开而来的 (Gu, Goel, 和 Ré 2022; Gu, Johnson, Goel, 等 2021)。然而,这需要计算并显式化形状为 $(\mathsf{B},\mathsf{L},\mathsf{D},\mathsf{N})$ 的隐状态 $h$,其规模比输入 $x$ 和输出 $y$(形状为 (B, L, D))大得多(大约是 SSM 状态维度 $N$ 的倍数)。因此,引入了更高效的卷积模式,该模式可以绕过状态计算,并且只显式化一个大小为 (B, L, D) 的卷积核 (3a)。
· Prior LTI state space models leverage the dual recurrent-convolutional forms to increase the effective state dimension by a factor of $N$ $(\approx10-100)$ , much larger than traditional RNNs, without efficiency penalties.
先前的 LTI 状态空间模型利用双重循环-卷积形式,将有效状态维度提高了 N 倍 (≈10-100),比传统 RNN 大得多,且不会带来效率损失。
3.3.2 Overview of Selective Scan: Hardware-Aware State Expansion
3.3.2 Selective Scan 概述:硬件感知状态扩展
Selective Scan 是一种硬件感知的状态扩展方法,旨在根据硬件特性优化状态扩展过程。这种方法通过分析硬件资源的可用性和性能特征,动态调整状态扩展策略,以实现更高效的计算资源利用。Selective Scan 可以显著提高系统的吞吐量和响应速度,同时降低能耗 [20]。
图 1: Selective Scan 工作原理示例
表 1: 不同硬件配置下的性能对比
| 硬件配置 | 性能提升 (%) |
|---|---|
| 配置 A | 30 |
| 配置 B | 45 |
| 配置 C | 60 |
The selection mechanism is designed to overcome the limitations of LTI models; at the same time, we therefore need to revisit the computation problem of SMs. We address this with three classical techniques: kernel fusion, parallel scan, and re computation. We make two main observations:
选择机制旨在克服 LTI 模型的局限性;同时,我们因此需要重新审视 SM 的计算问题。我们采用三种经典技术来解决这一问题:kernel fusion(核融合)、parallel scan(并行扫描)和 re computation(重新计算)。我们有两个主要观察:
The main idea is to leverage properties of modern accelerators (GPUs) to materialize the state $h$ Oonly in more eficient levels of the memory hierarchy. In particular, most operations (except matrix multiplication) are bounded by memory bandwidth (Dao, Fu, Ermon, et al. 2022; Ivanov et al. 2021; Williams, Waterman, and Patterson 2009). This includes our scan operation, and we use kernel fusion to reduce the amount of memory IOs, leading to a significant speedup compared to a standard implementation.
主要思想是利用现代加速器 (GPU) 的特性,仅在内存层次结构的更高效层级中实现状态 $h$ 。具体来说,大多数操作(除矩阵乘法外)受内存带宽限制(Dao, Fu, Ermon, et al. 2022;Ivanov et al. 2021;Williams, Waterman, and Patterson 2009)。这包括我们的扫描操作,我们使用内核融合来减少内存 IO 操作的数量,从而与标准实现相比显著提高速度。
Concretely, instead of preparing the scan input $(\overline{{A}},\overline{{B}})$ of size (B, L,D, N) in GPU HBM (high-bandwidth memory), we load the SSM parameters $(\Delta,A,B,C)$ directly from slow HBM to fast SRAM, perform the disc ret iz ation and recurrence in SRAM, and then write the final outputs of size (B, L, D) back to HBM.
具体来说,我们不是在 GPU HBM (高带宽内存) 中准备大小为 (B, L, D, N) 的扫描输入 $(\overline{{A}},\overline{{B}})$,而是直接从慢速 HBM 加载 SSM 参数 $(\Delta,A,B,C)$ 到快速 SRAM,在 SRAM 中执行离散化和递归,然后将大小为 (B, L, D) 的最终输出写回 HBM。
To avoid the sequential recurrence, we observe that despite not being linear it can still be parallel i zed with a work-efficient parallel scan algorithm (Blelloch 1990; Martin and Cundy 2018; Smith, Warrington, and Linderman 2023).
为了避免顺序递归,我们观察到尽管它不是线性的,但仍然可以使用工作高效的并行扫描算法 (Blelloch 1990; Martin 和 Cundy 2018; Smith, Warrington 和 Linderman 2023) 进行并行化。
Finally, we must also avoid saving the intermediate states, which are necessary for back propagation. We carefully apply the classic technique of re computation to reduce the memory requirements: the intermediate states are not stored but recomputed in the backward pass when the inputs are loaded from HBM to SRAM. As a result, the fused selective scan layer has the same memory requirements as an optimized transformer implementation with Flash Attention.
最后,我们必须避免保存反向传播所需的中间状态。我们仔细应用经典的重计算技术来减少内存需求:中间状态不存储,而是在反向传递时从 HBM 加载输入到 SRAM 时重新计算。因此,融合的选择性扫描层具有与具有 Flash Attention 的优化 Transformer 实现相同的内存需求。
Details of the fused kernel and re computation are in Appendix D. The full Selective SSM layer and algorithm is ilustrated in Figure 1.
融合内核和重计算的详细信息见附录 D。完整的 Selective SSM 层和算法如图 1 所示。
3.4 A Simplified SSM Architecture
3.4 简化的 SSM 架构
As with structured SsMs, selective SSMs are standalone sequence transformations that can be flexibly incorporated into neural networks. The H3 architecture is the basis for the most well-known SSM architectures (Section 2), which are generally comprised of a block inspired by linear attention interleaved with an MLP (multi-layer perceptron) block. We simplify this architecture by combining these two components into one, which is stacked homogenous ly (Figure 3). This is inspired by the gated attention unit (GAU) (Hua et al. 2022), which did something similar for attention.
与结构化的 SsMs 一样,选择性 SSMs 是独立的序列转换,可以灵活地集成到神经网络中。H3 架构是目前最著名的 SSM 架构的基础(第 2 节),通常由受线性注意力启发的模块与 MLP (多层感知机) 模块交替组成。我们通过将这两个组件合并为一个,并以同质的方式堆叠(图 3),简化了这一架构。这受到门控注意力单元 (GAU) (Hua et al. 2022) 的启发,它对注意力机制进行了类似的改进。
This architecture involves expanding the model dimension $D$ by a controllable expansion factor $E$ . For each block, most of the parameters $(3E D^{2})$ are in the linear projections ( $2E D^{2}$ for input projections, $E D^{2}$ for output projection) while the inner SSM contributes less. The number of SSM parameters (projections for $\Delta,B,C$ , and the matrix $\pmb{A}$ ) are much smaller in comparison. We repeat this block, interleaved with standard normalization and residual connections, to form the Mamba architecture. We always fix to $E=2$ in our experiments and use two stacks of the block to match the $12D^{2}$ parameters of a Transformer's interleaved MHA (multi-head attention) and MLP blocks. We use the SiLU / Swish activation function (Hendrycks and Gimpel 2016; Rama chandra n, Zoph, and Quoc V Le 2017), motivated so that the Gated MLP becomes the popular “SwiGLU" variant (Chowdhery et al. 2023; Dauphin et al. 2017; Shazeer 2020; Touvron et al. 2023). Finally, we additionally use an optional normalization layer (we choose LayerNorm (J. L. Ba, Kiros, and Hinton 2016), motivated by RetNet's usage of a normalization layer in a similar location (Y. Sun et al. 2023).
该架构涉及通过可控扩展因子 $E$ 扩展模型维度 $D$ 。对于每个块,大多数参数 $(3E D^{2})$ 位于线性投影中(输入投影为 $2E D^{2}$ ,输出投影为 $E D^{2}$ ),而内部 SSM 的贡献较少。与之相比,SSM 参数($\Delta,B,C$ 的投影和矩阵 $\pmb{A}$ )的数量要小得多。我们重复这个块,并与标准归一化和残差连接交替,以形成 Mamba 架构。在我们的实验中,我们始终将 $E$ 固定为 2,并使用两个块堆栈来匹配 Transformer 的交错 MHA(多头注意力)和 MLP 块的 $12D^{2}$ 参数。我们使用 SiLU / Swish 激活函数 (Hendrycks 和 Gimpel 2016; Ramachandra n, Zoph, 和 Quoc V Le 2017),其动机是使门控 MLP 成为流行的“SwiGLU”变体 (Chowdhery 等 2023; Dauphin 等 2017; Shazeer 2020; Touvron 等 2023)。最后,我们还额外使用了一个可选的归一化层(我们选择了 LayerNorm (J. L. Ba, Kiros, 和 Hinton 2016),其动机是 RetNet 在类似位置使用了归一化层 (Y. Sun 等 2023))。
3.5 Properties of Selection Mechanisms
3.5 选择机制的属性
The selection mechanism is a broader concept that can be applied in different ways, such as to more traditional RNNs or CNNs, to different parameters (e.g. $\pmb{A}$ in Algorithm 2), or using different transformations $s(x)$
选择机制是一个更广泛的概念,可以以不同方式应用,例如应用于更传统的 RNN 或 CNN,应用于不同的参数(例如算法 2 中的 $\pmb{A}$),或使用不同的变换 $s(x)$。

Figure 3: (Architecture.) Our simplified block design combines the H3 block, which is the basis of most SSM architectures, with the ubiquitous MLP block of modern neural networks. Instead of interleaving these two blocks, we simply repeat the Mamba block homogenous ly. Compared to the H3 block, Mamba replaces the first multiplicative gate with an activation function. Compared to the MLP block, Mamba adds an SSM to the main branch. For $\sigma$ we use the SiLU / Swish activation (Hendrycks and Gimpel 2016; Rama chandra n, Zoph, and Quoc V Le 2017).
图 3: (架构。) 我们的简化块设计结合了大多数 SSM 架构基础的 H3 块和现代神经网络中常见的 MLP 块。我们没有交错排列这两个块,而是简单地重复 Mamba 块以保持同质性。与 H3 块相比,Mamba 用激活函数替换了第一个乘法门。与 MLP 块相比,Mamba 在主分支中添加了一个 SSM。对于 $\sigma$,我们使用 SiLU / Swish 激活函数 (Hendrycks 和 Gimpel 2016; Ramachandran, Zoph, 和 Quoc V Le 2017)。
3.5.1 Connection to Gating Mechanisms
3.5.1 与门控机制的联系
We highlight the most important connection: the classical gating mechanism of RNNs is an instance of our selection mechanism for SSMs. We note that the connection between RNN gating and the disc ret iz ation of continuous-time systems is well established (Funahashi and Nakamura 1993; Tallec and Ollivier 2018). In fact, Theorem 1 is an improvement of Gu, Johnson, Goel, et al. (2021, Lemma 3.1) generalizing to the ZOH disc ret iz ation and input-dependent gates (proof in Appendix C). More broadly, $\Delta$ in SSMs can be seen to play a generalized role of the RNN gating mechanism. In line with prior work, we adopt the view that disc ret iz ation of SSMs is the principled foundation of heuristic gating mechanisms.
我们强调最重要的联系:经典的 RNN 门控机制是 SSMs 选择机制的一个实例。我们注意到,RNN 门控与连续时间系统离散化之间的联系已经得到充分确立 (Funahashi 和 Nakamura 1993; Tallec 和 Ollivier 2018)。实际上,定理 1 是对 Gu, Johnson, Goel 等 (2021, 引理 3.1) 的改进,推广到零阶保持 (ZOH) 离散化和输入依赖的门控 (证明见附录 C)。更广泛地说,$\Delta$ 在 SSMs 中可以被视为扮演了 RNN 门控机制的广义角色。与先前的工作一致,我们认为 SSMs 的离散化是启发式门控机制的理论基础。
Theorem 1. When $N=1,A=-1,B=1,s_{\Delta}=\mathsf{L i n e a r}(x)$ and $\tau_{\Delta}=$ softplus, then the selective SSM recurrence (Algorithm 2) takes the form
定理 1. 当 $N=1,A=-1,B=1,s_{\Delta}=\mathsf{L i n e a r}(x)$ 和 $\tau_{\Delta}=$ softplus 时,选择性 SSM 迭代 (算法 2) 的形式为
$$
\begin{array}{l}{g_{t}=\sigma(\mathsf{L i n e a r}(x_{t}))}\ {h_{t}=(1-g_{t})h_{t-1}+g_{t}x_{t}.}\end{array}
$$
$$
\begin{array}{l}{g_{t}=\sigma(\mathsf{L i n e a r}(x_{t}))}\ {h_{t}=(1-g_{t})h_{t-1}+g_{t}x_{t}.}\end{array}
$$
公式未进行翻译,因为其中包含无法识别的特殊字符和公式,按照策略 1 原样返回。
As mentioned in Section 3.2, our specific choices of $s_{\Delta},\tau_{\Delta}$ is from this connection. In particular, note that if a given input $x_{t}$ should be completely ignored (as necessary in the synthetic tasks), all $D$ channels should ignore it, and so we project the input down to 1 dimension before repeating/broadcasting with $\Delta$
如第 3.2 节所述,我们对 $s_{\Delta},\tau_{\Delta}$ 的具体选择来自于这种联系。特别地,注意如果给定的输入 $x_{t}$ 应该被完全忽略(在合成任务中是必要的),所有 $D$ 个通道都应该忽略它,因此我们在重复/广播与 $\Delta$ 之前将输入投影到 1 维。
3.5.2 Interpretation of Selection Mechanisms
3.5.2 选择机制的解释
We elaborate on three particular mechanistic effects of selection.
我们详细讨论了选择的三个特定机制效应。
Variable Spacing. Selectivity allows filtering out irelevant noise tokens that may occur between inputs of interest. This is exemplified by the Selective Copying task, but occurs ubiquitously in common data modalities, particularly for discrete data - for example the presence of language fillers such as “um". This property arises because the model can mechanistic ally filter out any particular input $x_{t}$ , for example in the gated RNN case (Theorem 1) when $g_{t}\to0$
可变间距。选择性允许过滤掉可能出现在感兴趣输入之间的无关噪声 Token。这在选择性复制任务中得到了体现,但在常见的数据模式中普遍存在,特别是在离散数据中——例如语言填充词如“um”的存在。这种特性之所以出现,是因为模型可以机械地过滤掉任何特定的输入 $x_{t}$ ,例如在门控 RNN 情况下 (Theorem 1) 当 $g_{t}\to0$ 时。
Filtering Context. It has been empirically observed that many sequence models do not improve with longer context (F. Shi et al. 2023), despite the principle that more context should lead to strictly better performance. An explanation is that many sequence models cannot effectively ignore irrelevant context when necessary; an intuitive example are global convolutions (and general LTI models). On the other hand, selective models can simply reset their state at any time to remove extraneous history, and thus their performance in principle improves monotonic ly with context length (e.g. Section 4.3.2).
过滤上下文。实证观察表明,许多序列模型并不会随着上下文长度的增加而性能提升 (F. Shi 等 2023),尽管从原理上讲,更多的上下文应该带来更好的性能。一种解释是,许多序列模型在必要时无法有效忽略无关的上下文;一个直观的例子是全局卷积(和一般的 LTI 模型)。另一方面,选择性模型可以随时重置其状态以移除多余的歷史记录,因此原则上它们的性能会随着上下文长度的增加而单调提升(例如第 4.3.2 节)。
Boundary Resetting. In settings where multiple independent sequences are stitched together, Transformers can keep them separate by instant i a ting a particular attention mask, while LTI models will bleed information between the sequences. Selective SSMs can also reset their state at boundaries (e.g. $\Delta_{t}\to\infty$ or Theorem 1 when $g_{t}\to1$ .These settings may occur artificially (e.g. packing documents together to improve hardware utilization) or naturally (e.g. episode boundaries in reinforcement learning (Lu et al. 2023)).
边界重置。在多个独立序列被拼接在一起的情况下,Transformer 可以通过实例化特定的注意力掩码来保持它们的分离,而 LTI 模型会在序列之间泄漏信息。选择性 SSM 也可以在边界处重置其状态(例如 $\Delta_{t}\to\infty$ 或当 $g_{t}\to1$ 时的定理 1)。这些设置可能是人为的(例如,将文档打包在一起以提高硬件利用率)或自然的(例如,强化学习中的情节边界 (Lu et al. 2023))。
Additionally, we elaborate on effects of each selective parameter.
此外,我们详细说明了每个选择性参数的效果。
Interpretation of $\Delta$ . In general, $\Delta$ controls the balance between how much to focus or ignore the current input $x_{t}$ . It generalizes RNN gates (e.g. $g_{t}$ in Theorem 1): mechanically, a large $\Delta$ resets the state $h$ and focuses on the current input $x$ while a small $\Delta$ persists the state and ignores the current input. SSMs (1)-(2) can be interpreted as a continuous system disc ret i zed by a timestep $\Delta$ and in this context the intuition is that large $\Delta\to\infty$ represents the system focusing on the current input for longer (thus “selecting" it and forgetting its current state) while a small $\Delta\to0$ represents a transient input that is ignored.
$\Delta$ 的解释。一般来说,$\Delta$ 控制着关注或忽略当前输入 $x_{t}$ 的平衡。它泛化了 RNN 门(例如定理 1 中的 $g_{t}$):从机械角度看,较大的 $\Delta$ 会重置状态 $h$ 并关注当前输入 $x$,而较小的 $\Delta$ 则保持状态并忽略当前输入。状态空间模型 (1)-(2) 可以被解释为由时间步长 $\Delta$ 离散化的连续系统,在这种情况下,直觉上较大的 $\Delta \to \infty$ 表示系统更长时间地关注当前输入(因此“选择”它并忘记其当前状态),而较小的 $\Delta \to 0$ 表示一个被忽略的瞬态输入。
Interpretation of A. We remark that while the $\pmb{A}$ parameter could also be selective, it ultimately affects the model only through its interaction with $\Delta$ via $\overline{{A}}=\exp(\Delta A)$ (the disc ret iz ation (4). Thus selectivity in $\Delta$ is enough to ensure selectivity in $(\overline{{A}},\overline{{B}})$ , and is the main source of improvement. We hypothesize that making $\pmb{A}$ selective in addition to (or instead of) $\Delta$ would have similar performance, and leave it out for simplicity.
对 A 的解释。我们注意到,虽然参数 $\pmb{A}$ 也可以是选择性的,但它最终只通过与 $\Delta$ 的交互影响模型,即通过 $\overline{{A}}=\exp(\Delta A)$ (离散化 (4))。因此,$\Delta$ 中的选择性足以确保 $(\overline{{A}},\overline{{B}})$ 中的选择性,并且是主要的改进来源。我们假设使 $\pmb{A}$ 除了(或代替)$\Delta$ 具有选择性将具有类似的性能,并为了简化将其排除。
Interpretation of $B$ and $C$ .As discussed in Section 3.1, the most important property of selectivity is fltering out irrelevant information so that a sequence model's context can be compressed into an efficient state. In an SSM, modifying $B$ and $C$ to be selective allows finer-grained control over whether to let an input $x_{t}$ into the state $h_{t}$ , or the state into the output $y_{t}$ . These can be interpreted as allowing the model to modulate the recurrent dynamics based on content (input) and context (hidden states) respectively.
对 $B$ 和 $C$ 的解释。如第 3.1 节所述,选择性最重要的属性是过滤掉无关信息,从而使序列模型的上下文可以被压缩成一个高效的状态。在 SSM 中,将 $B$ 和 $C$ 修改为具有选择性可以更精细地控制是否允许输入 $x_{t}$ 进入状态 $h_{t}$ ,或者将状态输出到 $y_{t}$ 。这些可以被解释为允许模型根据内容(输入)和上下文(隐藏状态)分别调节递归动态。
3.6 Additional Model Details
3.6 附加模型细节
Real vs. Complex. Most prior SSMs use complex numbers in their state $h$ which is necessary for strong performance on many tasks in perceptual modalities (Gu, Goel, and Ré 2022). However, it has been empirically observed that completely real-valued SSMs seem to work fine, and possibly even better, in some settings (Ma et al. 2023). We use real values as the default, which work well for all but one of our tasks; we hypothesize that the complex-real tradeoff is related to the continuous-discrete spectrum in data modalities, where complex numbers are helpful for continuous modalities (e.g. audio, video) but not discrete (e.g. text, DNA).
实数 vs. 复数。大多数先前的状态空间模型 (SSM) 在其状态 $h$ 中使用复数,这对于许多感知模态任务的高性能是必要的(Gu, Goel, 和 Ré 2022)。然而,已经通过实验观察到,完全实数值的 SSM 在某些情况下似乎工作得同样好,甚至更好(Ma 等人 2023)。我们默认使用实数值,这对我们所有任务中的一个例外都表现良好;我们假设复数与实数之间的权衡与数据模态中的连续-离散谱有关,其中复数对连续模态(例如音频、视频)有帮助,但对离散模态(例如文本、DNA)则不然。
Initialization. Most prior SMs also suggest special initialization s, particularly in the complex-valued case, which can help in several settings such as low-data regimes. Our default initialization for the complex case is S4D-Lin and for the real case is S4D-Real (Gu, Gupta, et al. 2022), which is based on the HIPPO theory (Gu, Dao, et al. 2020). These define the $n$ -th element of $\pmb{A}$ as $-1/2+n i$ and $-(n+1)$ respectively. However, we expect many initialization s to work fine, particularly in the large-data and real-valued SSM regimes; some ablations are considered in Section 4.6.
初始化。大多数先前的 SM 也建议特殊的初始化方法,特别是在复数值情况下,这可以在多种场景中提供帮助,例如低数据量场景。我们对复数情况的默认初始化方法是 S4D-Lin,对实数情况是 S4D-Real (Gu, Gupta, et al. 2022),这是基于 HIPPO 理论 (Gu, Dao, et al. 2020)。这些方法分别定义了 $\pmb{A}$ 的第 $n$ 个元素为 $-1/2+n i$ 和 $-(n+1)$。然而,我们预计许多初始化方法在大数据量和实数值 SSM 场景下都能很好地工作;一些消融实验在第 4.6 节中进行了考虑。
Parameter iz ation of $\Delta$ . We defined the selective adjustment to $\Delta$ as $s_{\Delta}(x),=,{\tt B r o a d c a s t}{D}({\tt L i n e a r}{1}(x))$ , which was motivated by the mechanics of $\Delta$ (Section 3.5). We observe that it can be generalized from dimension 1 to a larger dimension R. We set this to be a small fraction of D, which uses a negligible number of parameters compared to the main Linear projections in the block. We additionally note that the broadcasting operation can instead be viewed as another Linear projection, initialized to a specific pattern of 1's and ${0,}^{*}{\mathrm{s}};$ if this projection is trainable, this leads to the alternative $s_{\Delta}(x)=\mathsf{L i n e a r}_{D}(\mathsf{L i n e a r}_{R}(x))$ , which can be viewed as a low-rank projection.
参数化 $\Delta$ 。我们定义对 $\Delta$ 的选择性调整为 $s_{\Delta}(x),=,{\tt Broadcast}{D}({\tt Linear}{1}(x))$ ,这是由 $\Delta$ 的机制(第 3.5 节)所启发的。我们观察到它可以从维度 1 推广到更大的维度 R。我们将此设置为 D 的一个小部分,与块中的主要线性投影相比,使用的参数数量可以忽略不计。我们还注意到广播操作可以被视为另一个线性投影,初始化为特定的 1 和 ${0,}^{*}{\mathrm{s}}$ 模式;如果这个投影是可训练的,则会导致另一种形式 $s_{\Delta}(x)=\mathsf{Linear}_{D}(\mathsf{Linear}_{R}(x))$ ,这可以被视为低秩投影。
In our experiments, the $\Delta$ parameter (which can be viewed as a bias term) is initialized to $\tau_{\Delta}^{-1}(\mathsf{U n i f o r m}([0.001,0.1])),$ following prior work on SSMs (Gu, Johnson, Timalsina, et al. 2023).
在我们的实验中,$\Delta$ 参数(可以视为偏置项)初始化为 $\tau_{\Delta}^{-1}(\mathsf{Uniform}([0.001,0.1]))$ ,遵循关于 SSM 的先前工作 (Gu, Johnson, Timalsina, et al. 2023)。
Remark 3.1. For brevity in our experimental results, we sometimes abbreviate selective SSMs as S6 models, because they are S4 models with a selection mechanism and computed with a scan.
备注 3.1. 为了简洁起见,在我们的实验结果中,我们有时将选择性的 SSM 模型简称为 S6 模型,因为它们是带有选择机制并通过扫描计算的 S4 模型。
4 Empirical Evaluation
4 实证评估
In Section 4.1 we test Mamba's ability to solve the two synthetic tasks motivated in Section 3.1. We then evaluate on three domains, each evaluated on auto regressive pre training as well as downstream tasks.
在第 4.1 节中,我们测试 Mamba 解决第 3.1 节中提出的两个合成任务的能力。然后我们在三个领域进行评估,每个领域都在自回归预训练和下游任务上进行了评估。
Finally, Section 4.5 shows Mamba's computational efciency at both training and inference time, and Section 4.6 ablates various components of the architecture and selective SSMs.
最后,第 4.5 节展示了 Mamba 在训练和推理阶段的计算效率,第 4.6 节对架构的各个组件和选择性的 SSMs 进行了消融研究。
4.1 Synthetic Tasks
4.1 合成任务
Full experiment details for these tasks including task details and training protocol are in Appendix E.1.
这些任务的完整实验细节,包括任务详情和训练协议,请参见附录 E.1。
4.1.1 Selective Copying
4.1.1 选择性复制
The Copying task is one of the most well-studied synthetic tasks for sequence modeling, originally designed to test the memorization abilities of recurrent models. As discussed in Section 3.1, LTI SSMs (linear recurrences and global convolutions) can easily solve this task by only keeping track of time instead of reasoning about the data; for example, by constructing a convolution kernel of exactly the right length (Figure 2). This was explicitly validated in earlier work on global convolutions (Romero et al. 2021). The Selective Copying task prevents this shortcut by randomizing the spacing between tokens. Note that this task has been introduced before as the Denoising task (Jing et al. 2019).
复制任务是序列建模中最常研究的合成任务之一,最初设计用于测试循环模型的记忆能力。如第 3.1 节所述,线性时不变状态空间模型 (LTI SSMs)(线性递归和全局卷积)可以通过仅跟踪时间而不是对数据进行推理轻松解决此任务;例如,通过构建恰好合适长度的卷积核(图 2)。这在早期关于全局卷积的工作中得到了明确验证 (Romero et al. 2021)。选择性复制任务通过随机化 token 之间的间距来防止这种捷径。请注意,此任务之前已被介绍为去噪任务 (Jing et al. 2019)。
Note that many previous works argue that adding architecture gating (multiplicative interactions) can endow models with "data-dependence" and solve related tasks (Dao, Fu, Saab, et al. 2023; Poli et al. 2023). However, we find this explanation insufficient intuitively because such gating does not interact along the sequence axis, and cannot affect the spacing between tokens. In particular architecture gating is not an instance of a selection mechanism (Appendix A).
请注意,许多先前的工作认为添加架构门控(乘法交互)可以使模型具有“数据依赖性”,并解决相关任务 (Dao, Fu, Saab, et al. 2023; Poli et al. 2023)。然而,我们认为这种解释直观上不够充分,因为这种门控不会沿序列轴进行交互,且不能影响 Token 之间的间距。特别是,架构门控不是选择机制的一个实例(附录 A)。
Table 1 confirms that gated architectures such as H3 and Mamba only partially improve performance, while the selection mechanism (modifying S4 to S6) easily solves this task, particularly when combined with these more powerful architectures.
表 1: 确认了门控架构(如 H3 和 Mamba)仅部分改善性能,而选择机制(将 S4 修改为 S6)轻松解决此任务,尤其是在与这些更强大的架构结合时。
4.1.2 Induction Heads
4.1.2 归纳头 (Induction Heads)
Induction heads (Olsson et al. 2022) is a simple task from the mechanistic interpret ability lens (Elhage et al. 2021) that is surprisingly predictive of the in-context learning ability of LLMs. It requires models to perform associative recall and copy: for example, if the model has seen a bigram such as “Harry Potter" in the sequence, then the next time “Harry” appears in the same sequence, the model should be able to predict “Potter" by copying from history.
归纳头 (Induction heads) (Olsson et al. 2022) 是一个从机械可解释性角度 (Elhage et al. 2021) 看非常简单的任务,但却能惊人地预测大语言模型的上下文学习能力。它要求模型执行关联回忆和复制:例如,如果模型在序列中见过一个二元组如 “Harry Potter”,那么当 “Harry” 再次出现在同一序列中时,模型应该能够通过从历史记录中复制来预测 “Potter”。
Dataset. We train a 2-layer model on the induction heads task at sequence length 256, with a vocab size of 16, which is comparable to prior work on this task (Dao, Fu, Saab, et al. 2023) but with longer sequences. We additionally investigate generalization and extrapolation abilities by evaluating on a range of sequence lengths from $2^{6}=64$ upto $2^{20}=1048576$ at test time.
数据集。我们在序列长度为 256 的归纳头任务上训练一个 2 层模型,词汇表大小为 16,这与该任务的先前工作 (Dao, Fu, Saab, et al. 2023) 相当,但使用更长的序列。此外,我们通过在测试时评估从 $2^6=64$ 到 $2^{20}=1048576$ 的一系列序列长度来研究模型的泛化和外推能力。
Models. Following established work on induction heads, we use 2 layer models, which allows attention to mechanistic ally solve the induction heads task (Olsson et al. 2022). We test both multi-head attention (8 heads, with various positional encodings) and SSM variants. We use a model dimension $D$ of 64 for Mamba and 128 for the other models.
模型。遵循关于归纳头的已有工作,我们使用2层模型,这使得注意力机制能够解决归纳头任务 (Olsson et al. 2022)。我们测试了多头注意力(8个头,带有各种位置编码)和SSM变体。对于Mamba,我们使用模型维度 $D$ 为64,其他模型则为128。
Results. Table 2 shows that Mamba--or more precisely, its selective SSM layer-has the ability to solve the task perfectly because of its ability to selectively remember the relevant token while ignoring everything else in between. It generalizes perfectly to million-length sequences, or $4000\times$ longer than it saw during training, while no other method goes beyond $2\times$
表 2: 结果显示,Mamba——或者更准确地说,其选择性 SSM 层——能够完美地解决任务,因为它可以选择性地记住相关 Token 而忽略其他所有内容。它能够完美泛化到百万长度的序列,或比训练时看到的序列长 4000 倍,而没有其他方法能超过 2 倍。
| 模型 | 架构 | 层 | 准确率 |
|---|---|---|---|
| S4 | No gate | S4 | 18.3 |
| No gate | S6 | 97.0 | |
| H3 | H3 | S4 | 57.0 |
| Hyena | H3 | Hyena | 30.1 |
| H3 | S6 | 99.7 | |
| Mamba | S4 | 56.4 | |
| - | Mamba | Hyena | 28.4 |
| Mamba | Mamba | S6 | 99.8 |

Table 2: (Induction Heads.) Models are trained on sequence length $2^{8}=$ 256, and tested on increasing sequence lengths of $2^{6},=,64~\mathrm{up}$ to $2^{20}\ =$ 1048576. Full numbers in Table 11.
表 2: (归纳头。)模型在序列长度为 $2^{8}=$ 256 上进行训练,并在逐渐增加的序列长度上进行测试,从 $2^{6},=,64$ 到 $2^{20}\ =$ 1048576。完整数据见表 11。

Table 1: (Selective Copying.) Accuracy for combinations of architectures and inner sequence layers. Figure 4: (Scaling Laws.) Models of size $\approx~125M$ to $\approx,1.3B$ parameters, trained on the Pile. Mamba scales better than all other attention-free models and is the first to match the performance of a very strong “Transformer $\mathrel{\mathop++}$ ' recipe that has now become standard, particularly as the sequence length grows.
表 1: (选择性复制。)不同架构和内部序列层组合的准确性。图 4: (扩展定律。)模型大小从约 125M 到约 1.3B 参数,在 Pile 数据集上训练。Mamba 的扩展性能优于所有其他无注意力机制的模型,并且是第一个匹配非常强大的 “Transformer++” 方案性能的模型,该方案现已成为标准,特别是在序列长度增加时。
Out of positional encoding variants for attention models, xPos (which was designed for length extrapolation) is slightly better than the others; also note that all attention models were only tested up to sequence length $2^{14},=,16384$ due to memory limitations. Out of other SSMs, H3 and Hyena are similar, contrary to the findings in Poli et al. (2023).
在注意力模型的位置编码变体中,xPos(专为长度外推设计)略优于其他变体;需要注意的是,由于内存限制,所有注意力模型仅测试到序列长度 $2^{14},=,16384$ 。在其他 SSM 中,H3 和 Hyena 类似,这与 Poli 等人的发现不同 [2023]。
4.2 Language Modeling
4.2 语言模型
We evaluate the Mamba architecture on standard auto regressive language modeling against other architectures, on both pre training metrics (perplexity) and zero-shot evaluations. We set the model sizes (depth and width) to mirror GPT3 specifications. We use the Pile dataset (L. Gao, Biderman, et al. 2020), and follow the training recipe described in Brown et al. (2020). All training details are in Appendix E.2.
我们评估 Mamba 架构在标准自回归语言建模任务上的表现,对比其他架构,在预训练指标(困惑度)和零样本 (Zero-shot) 评估上进行测试。我们将模型大小(深度和宽度)设置为与 GPT3 规格一致。我们使用 Pile 数据集 (L. Gao, Biderman, et al. 2020),并遵循 Brown 等人 (2020) 描述的训练方案。所有训练细节见附录 E.2。
4.2.1 Scaling Laws
4.2.1 扩展定律 (Scaling Laws)
For baselines, we compare against the standard Transformer architecture (GPT3 architecture), as well as the strongest Transformer recipe we know of (here referred to as Transformer $^{++}$ ), based on the PaLM and LLaMa architectures (e.g. rotary embedding, SwiGLU MLP, RMSNorm instead of LayerNorm, no linear bias, and higher learning rates). We also compare against other recent sub quadratic architectures (Figure 4). All model details are in Appendix E.2.
对于基准线,我们对比了标准的 Transformer 架构 (GPT3架构),以及我们所知最强的 Transformer 配方(这里称为 Transformer $^{++}$),该配方基于 PaLM 和 LLaMa 架构(例如旋转嵌入、SwiGLU MLP、使用 RMSNorm 而不是 LayerNorm、无线性偏置以及更高的学习率)。我们还对比了其他最近的次二次架构(图 4)。所有模型细节见附录 E.2。
Figure 4 shows scaling laws under the standard Chinchilla (Hoffmann et al. 2022) protocol, on models from $\approx125M$ to $\approx1.3B$ parameters. Mamba is the first attention-free model to match the performance of a very strong Transformer recipe (Transformer $\mathrel{\mathop{\uparrow}}++$ ) that has now become standard, particularly as the sequence length grows. (We note that full results on context length $8\mathrm{k}$ are missing for the RWKV and RetNet baselines, prior strong recurrent models that can also be interpreted as SSMs, because of a lack of efficient implementations leading to out-of-memory or unrealistic computation requirements.)
图 4: 显示了在标准 Chinchilla (Hoffmann 等, 2022) 协议下的扩展规律,模型参数从约 1.25 亿到约 13 亿。Mamba 是第一个无需注意力机制的模型,在性能上匹配了非常强大的 Transformer 配方 (Transformer $\mathrel{\mathop{\uparrow}}++$),这一配方现已成为标准,特别是在序列长度增加时。我们注意到,由于缺乏高效的实现导致内存不足或计算要求不切实际,RWKV 和 RetNet 基线(先前强大的循环模型,也可以解释为 SSMs)在上下文长度为 8k 的完整结果缺失。
4.2.2 Downstream Evaluations
4.2.2 下游评估
Table 3 shows the performance of Mamba on a range of popular downstream zero-shot evaluation tasks. We compare against the most well-known open source models at these sizes, most importantly Pythia (Biderman et al. 2023) and RWKV (B. Peng et al. 2023) which were trained with the same tokenizer, dataset, and training length (300B tokens) as our models. (Note that Mamba and Pythia are trained with context length 2048, while RWKV was trained with context length 1024.)
表 3: 显示了 Mamba 在一系列流行的下游零样本评估任务中的性能。我们与这些规模中最著名的开源模型进行了比较,最重要的是 Pythia (Biderman et al. 2023) 和 RWKV (B. Peng et al. 2023),这些模型使用与我们的模型相同的分词器、数据集和训练长度(300B Token)。(请注意,Mamba 和 Pythia 的上下文长度为 2048,而 RWKV 的上下文长度为 1024。)
Table 3: (Zero-shot Evaluations.) Best results for each size in bold. We compare against open source LMs with various tokenizers, trained for up to 300B tokens. ile refers to the alidation split, comparing only against models trained on the same dataset and tokenizer (GPT-NeoX-20B). For each model size, Mamba is best-in-class on every single evaluation result, and generally matches baselines at twice the model size.
表 3: (零样本评估。)每个规模的最佳结果用粗体表示。我们将 Mamba 与开源语言模型 (LM) 进行比较,这些模型使用不同的分词器训练,最多训练 300B Token。ile 指的是验证集分割,仅与在同一数据集和分词器(GPT-NeoX-20B)上训练的模型进行比较。对于每个模型规模,Mamba 在每一项评估结果中都是同类最佳,并且通常在模型规模两倍的情况下匹配基线。
| MODEL | TOKEN. | PILE PPL↓ | LAMBADA PPL↓ | LAMBADA ACC ↑ | HELLASWAG ACC ↑ | PIQA ACC↑ | ARC-E ACC↑ | ARC-C ACC ↑ | WINOGRANDE ACC ↑ | AVERAGE ACC ↑ |
|---|---|---|---|---|---|---|---|---|---|---|
| Hybrid H3-130M | GPT2 | 一 | 89.48 | 25.77 | 31.7 | 64.2 | 44.4 | 24.2 | 50.6 | 40.1 |
| Pythia-160M | NeoX | 29.64 | 38.10 | 33.0 | 30.2 | 61.4 | 43.2 | 24.1 | 51.9 | 40.6 |
| Mamba-130M | NeoX | 10.56 | 16.07 | 44.3 | 35.3 | 64.5 | 48.0 | 24.3 | 51.9 | 44.7 |
| Hybrid H3-360M | GPT2 | 一 | 12.58 | 48.0 | 41.5 | 68.1 | 51.4 | 24.7 | 54.1 | 48.0 |
| Pythia-410M | NeoX | 9.95 | 10.84 | 51.4 | 40.6 | 66.9 | 52.1 | 24.6 | 53.8 | 48.2 |
| Mamba-370M | NeoX | 8.28 | 8.14 | 55.6 | 46.5 | 69.5 | 55.1 | 28.0 | 55.3 | 50.0 |
| Pythia-1B | NeoX | 7.82 | 7.92 | 56.1 | 47.2 | 70.7 | 57.0 | 27.1 | 53.5 | 51.9 |
| Mamba-790M | NeoX | 7.33 | 6.02 | 62.7 | 55.1 | 72.1 | 61.2 | 29.5 | 56.1 | 57.1 |
| GPT-Neo 1.3B | GPT2 | 一 | 7.50 | 57.2 | 48.9 | 71.1 | 56.2 | 25.9 | 54.9 | 52.4 |
| Hybrid H3-1.3B | GPT2 | 11.25 | 49.6 | 52.6 | 71.3 | 59.2 | 28.1 | 56.9 | 53.0 | |
| OPT-1.3B | OPT | 一 | 6.64 | 58.0 | 53.7 | 72.4 | 56.7 | 29.6 | 59.5 | 55.0 |
| Pythia-1.4B | NeoX | 7.51 | 6.08 | 61.7 | 52.1 | 71.0 | 60.5 | 28.5 | 57.2 | 55.2 |
| RWKV-1.5B | NeoX | 7.70 | 7.04 | 56.4 | 52.5 | 72.4 | 60.5 | 29.4 | 54.6 | 54.3 |
| Mamba-1.4B | NeoX | 6.80 | 5.04 | 64.9 | 59.1 | 74.2 | 65.5 | 32.8 | 61.5 | 59.7 |
| GPT-Neo 2.7B | GPT2 | 5.63 | 62.2 | 55.8 | 72.1 | 61.1 | 30.2 | 57.6 | 56.5 | |
| Hybrid H3-2.7B | GPT2 | 7.92 | 55.7 | 59.7 | 73.3 | 65.6 | 32.3 | 61.4 | 58.0 | |
| OPT-2.7B | OPT | 一 | 5.12 | 63.6 | 60.6 | 74.8 | 60.8 | 31.3 | 61.0 | 58.7 |
| Pythia-2.8B | NeoX | 6.73 | 5.04 | 64.7 | 59.3 | 74.0 | 64.1 | 32.9 | 59.7 | 59.1 |
| RWKV-3B | NeoX | 7.00 | 5.24 | 63.9 | 59.6 | 73.7 | 67.8 | 33.1 | 59.6 | 59.6 |
| Mamba-2.8B | NeoX | 6.22 | 4.23 | 69.2 | 66.1 | 75.2 | 69.7 | 36.3 | 63.5 | 63.3 |
| GPT-J-6B | GPT2 | 4.10 | 68.3 | 66.3 | 75.4 | 67.0 | 36.6 | 64.1 | 63.0 | |
| OPT-6.7B | OPT | 4.25 | 67.7 | 67.2 | 76.3 | 65.6 | 34.9 | 65.5 | 62.9 | |
| Pythia-6.9B | NeoX | 6.51 | 4.45 | 67.1 | 64.0 | 75.2 | 67.3 | 35.5 | 61.3 | 61.7 |
| RWKV-7.4B | NeoX | 6.31 | 4.38 | 67.2 | 65.5 | 76.1 | 67.8 | 37.5 | 61.0 | 62.5 |
4.3 DNA Modeling
4.3 DNA 模型构建
Motivated by the success of large language models, there has been recent exploration into using the foundation model paradigm for genomics. DNA has been likened to language in that it consists of sequences of discrete tokens with a finite vocabulary. It is also known for requiring long-range dependencies to model (Avsec et al. 2021). We investigate Mamba as a FM backbone for pre training and fine-tuning in the same setting as recent works on long-sequence models for DNA (Nguyen, Poli, et al. 2023). In particular, we focus on two explorations of scaling laws across model size and sequence length (Figure 5), and a difficult downstream synthetic classification task requiring long context (Figure 6).
受大语言模型成功的影响,最近开始探索将基础模型范式应用于基因组学。DNA 被类比为语言,因为它由具有有限词汇表的离散 Token 序列组成。它还以需要建模长距离依赖性而闻名 (Avsec et al. 2021)。我们研究了 Mamba 作为预训练和微调的基础模型 (FM) 主干,在与近期关于 DNA 的长序列模型工作的相同设置下 (Nguyen, Poli, et al. 2023)。特别是,我们专注于两个方面的扩展规律的研究,包括模型规模和序列长度 (图 5),以及一个需要长上下文的困难下游合成分类任务 (图 6)。
For pre training, we largely follow a standard causal language modeling (next token prediction) setup for the training and model details (see also Appendix E.2). For the dataset, we largely follow the setup of HyenaDNA (Nguyen, Poli, et al. 2023), which uses the HG38 dataset for pre training consisting of a single human genome with about 4.5 billion tokens (DNA base pairs) in the training split.
对于预训练,我们主要遵循标准的因果语言模型(下一个 Token 预测)设置进行训练和模型细节(详见附录 E.2)。对于数据集,我们主要遵循 HyenaDNA (Nguyen, Poli, et al. 2023) 的设置,该设置使用 HG38 数据集进行预训练,包含一个约 45 亿个 Token(DNA 碱基对)的单个人类基因组作为训练集。

Figure 5: (DNA Scaling Laws.) Pre training on the HG38 (human genome) dataset. (Left) Fixing short context length $2^{10}=1024$ and increasing size from $\approx200K$ to $\approx40M$ parameters, Mamba scales better than baselines. (Right) Fixing model size and increasing sequence lengths while keeping tokens/batch and total training tokens fixed. Unlike baselines, the selection mechanism of Mamba facilitates better performance with increasing context length.
图 5: (DNA 缩放定律。)在 HG38 (人类基因组) 数据集上进行预训练。(左) 固定短上下文长度 $2^{10}=1024$ 并将参数量从 $\approx200K$ 增加到 $\approx40M$ ,Mamba 比基线模型表现更好。(右) 固定模型大小并增加序列长度,同时保持每批次 Token 数和总训练 Token 数不变。与基线模型不同,Mamba 的选择机制在上下文长度增加时表现出更好的性能。
4.3.1 Scaling: Model Size
4.3.1 扩展:模型大小
In this experiment, we investigate the scaling properties of genomics foundation models with various model backbones (Figure 5 Left).
在本实验中,我们研究了不同模型骨干的基因组基础模型的扩展特性 (图 5 左)。
Training. To advantage the baselines, we train on a short sequence length of 1024; as shown in Section 4.3.2, we expect results to favor Mamba even more at longer sequence lengths. We fix a global batch size of 1024, for a total of $2^{20}\approx1M$ tokens per batch. Models were trained for $10K$ gradient steps for a total of 10B tokens.
训练。为了有利于基线模型,我们使用较短的序列长度 1024 进行训练;如第 4.3.2 节所示,我们预计在更长的序列长度下结果会更有利于 Mamba。我们固定全局批次大小为 1024,每批次总共 $2^{20} \approx 1M$ 个 Token。模型训练了 $10K$ 梯度步,总计 10B 个 Token。
Results. Figure 5 (Left) shows that Mamba's pre training perplexity improves smoothly with model size, and that Mamba scales better than both HyenaDNA and Transformer $++$ . For example, at the largest model size of $\approx40M$ parameters, the curve shows that Mamba can match the Transformer $^{++}$ and HyenaDNA models with roughly $3\times$ to $4\times$ fewer parameters.
结果。图 5 (左) 显示 Mamba 的预训练困惑度随着模型大小平滑改善,并且 Mamba 的扩展性优于 HyenaDNA 和 Transformer $++$ 。例如,在大约 40M 参数的最大模型大小下,曲线显示 Mamba 可以用大约 3 到 4 倍更少的参数匹配 Transformer $^{++}$ 和 HyenaDNA 模型。
4.3.2 Scaling: Context Length
4.3.2 扩展:上下文长度
In the next DNA experiment, we investigate the scaling properties of models with respect to sequence length. We only compare the HyenaDNA and Mamba models, as quadratic attention becomes prohibitively expensive at longer sequence lengths. We pretrain models on sequence lengths $2^{10}\ =\ 1024$ , $2^{12}\ =\ 4096$ , $2^{14},=,16384$ , $2^{\bar{16}};=;65536$ , $2^{18};=;262144$ , $2^{20}=1048576$ We fx a model size of 6 layers by width 128 (about 1.3M-1.4M parameters). Models were trained for $20K$ gradient steps for a total of $\approx330B$ tokens. The longer sequence lengths used sequence length warmup similar to (Nguyen, Poli, et al. 2023).
在下一个 DNA 实验中,我们研究模型相对于序列长度的扩展特性。我们仅比较 HyenaDNA 和 Mamba 模型,因为二次注意力机制在较长序列长度下变得过于昂贵。我们在序列长度为 $2^{10}\ =\ 1024$ , $2^{12}\ =\ 4096$ , $2^{14},=,16384$ , $2^{\bar{16}};=;65536$ , $2^{18};=;262144$ , $2^{20}=1048576$ 上预训练模型。我们固定模型大小为 6 层,宽度为 128(约 1.3M-1.4M 参数)。模型训练了 $20K$ 梯度步,总共约 $\approx330B$ Token。较长的序列长度使用了类似于 (Nguyen, Poli, et al. 2023) 的序列长度热身。
Results. Figure 5 (Right) shows that Mamba is able to make use of longer context even up to extremely long sequences of length 1M, and its pre training perplexity improves as the context increases. On the other hand, the HyenaDNA model gets worse with sequence length. This is intuitive from the discussion in Section 3.5 on properties of the selection mechanism. In particular, LTI models cannot selectively ignore information; from a convolutional perspective, a very long convolution kernel is aggregating all information across a long sequence which may be very noisy. Note that while HyenaDNA claims to improve with longer context, their results do not control for computation time.
结果。图 5 (右) 显示 Mamba 能够利用更长的上下文,即使对于长度达 1M 的极长序列,其预训练困惑度随着上下文长度的增加而改善。另一方面,HyenaDNA 模型的表现随着序列长度的增加而变差。这从第 3.5 节关于选择机制属性的讨论中可以直观理解。特别是,LTI 模型无法有选择地忽略信息;从卷积的角度来看,一个非常长的卷积核会聚合整个长序列中的所有信息,这可能会非常嘈杂。需要注意的是,尽管 HyenaDNA 声称其在更长的上下文中表现更好,但他们的结果并未控制计算时间。
4.3.3 Synthetic Species Classification
4.3.3 合成物种分类
We evaluate models on a downstream task of classifying between 5 different species by randomly sampling a contiguous segment of their DNA. This task is adapted from HyenaDNA, which used the species {human, lemur, mouse, pig, hippo}. We modify the task to be significantly more challenging by classifying between the five great apes species {human, chimpanzee, gorilla, orangutan, bonobo}, which are known to share $99%$ of theirDNA.
我们在下游任务中评估模型,该任务是通过随机采样一段连续的 DNA 来对 5 种不同的物种进行分类。此任务改编自 HyenaDNA,其使用的物种为 {human, lemur, mouse, pig, hippo}。我们修改了任务,使其更具挑战性,即对五大类人猿物种 {human, chimpanzee, gorilla, orangutan, bonobo} 进行分类,这些物种已知共享 $99%$ 的 DNA。

Figure 6: (Great Apes DNA Classification.) Accuracy after fine- Figure 7: (Audio Pre training.) Mamba improves performance tuning on sequences of length $2^{10}=1024$ upto $2^{20}=1048576$ using over prior state-of-the-art (Sashimi) in auto regressive audio model pretrained models of the same context length. Numerical results in ing, while improving up to minute-long context or million-length Table13. sequences (controlling for computation).
图 6: (大猿类 DNA 分类。) 在长度为 $2^{10}=1024$ 到 $2^{20}=1048576$ 的序列上进行微调后的准确性。
图 7: (音频预训练。) Mamba 在自回归音频模型的预训练模型中,通过超过先前的最先进水平 (Sashimi),在相同上下文长度的情况下提高了性能,特别是在长达分钟或百万长度的序列中(控制计算量)。数值结果见表 13。


图 1: 模型架构示例 (Example of Model Architecture)
4.4 Audio Modeling and Generation
4.4 音频建模与生成 (Audio Modeling and Generation)
For the audio waveform modality, we compare primarily to the SaShiMi architecture and training protocols (Goel et al. 2022). This model comprises:
对于音频波形模态,我们主要与 SaShiMi 架构和训练协议 (Goel et al. 2022) 进行比较。该模型包括:
We consider replacing the $\mathsf{S4+M L P}$ blocks with Mamba blocks. Experiment details are in Appendix E.4.
我们考虑用 Mamba 模块替换 $\mathsf{S4+M L P}$ 模块。实验细节见附录 E.4。
4.4.1 Long-Context Auto regressive Pre training
4.4.1 长上下文自回归预训练 (Long-Context Auto regressive Pre training)
We evaluate pre training quality (auto regressive next-sample prediction) on YouTubeMix (DeepSound 2017), a standard piano music dataset used by prior work consisting of 4 hours of solo piano music, sampled at a rate of $16000,\mathrm{{Hz}}$ .Pre training details largely follow the standard language modeling setup (Section 4.2). Figure 7 evaluates the effect of increasing training sequence lengths from $2^{13}=8192$ $2^{20}\approx10^{6}$ while keeping computation fixed. There are some slight edge cases to the way the data is curated, which may lead to kinks in the scaling curves. For example, only minute-long clips were available so the maximum sequence length is actually bounded by $60s\cdot16000H z=960000.$
我们在 YouTubeMix (DeepSound 2017) 上评估了预训练质量(自回归下一个样本预测),这是一个由先前工作使用过的标准钢琴音乐数据集,包含 4 小时的独奏钢琴音乐,采样率为 $16000,\mathrm{{Hz}}$ 。预训练细节大部分遵循标准的语言模型设置(第 4.2 节)。图 7 评估了在保持计算量固定的情况下,增加训练序列长度从 $2^{13}=8192$ 到 $2^{20}\approx10^{6}$ 的影响。数据整理方式存在一些细微的边缘情况,这可能导致缩放曲线出现曲折。例如,只有分钟长的片段可用,因此最大序列长度实际上被限制为 $60s\cdot16000H z=960000$ 。
Both Mamba and the SaShiMi $\mathbf{\nabla{S4+MLP}},$ baseline improve consistently with longer context lengths; Mamba is better throughout, and the gap widens at longer lengths. The main metric is bits per byte (BPB), which is a constant factor $\log(2)$ of the standard negative log-likelihood (NLL) loss for pre training other modalities.
Mamba 和 SaShiMi $\mathbf{\nabla{S4+MLP}},$ 基线随着上下文长度的增加而持续改进;Mamba 在整个过程中表现更好,且在更长的长度下差距扩大。主要指标是每字节比特数 (BPB),这是标准负对数似然 (NLL) 损失的常数因子 $\log(2)$,用于预训练其他模态。
We note one important detail: this is the only experiment in this paper in which we switched from the real parameter iz ation to complex (Section 3.6). We show additional ablations in Appendix E.4.
我们注意到一个重要细节:这是本文中唯一一次我们将实参数化切换到复数 (第 3.6 节)。我们在附录 E.4 中展示了额外的消融实验。
4.4.2 Auto regressive Speech Generation
4.4.2 自回归语音生成 (Auto regressive Speech Generation)
SC09 is a benchmark speech generation dataset (Donahue, McAuley, and Puckette 2019; Warden 2018), consisting of 1-second clips sampled at $16000\ \mathrm{Hz}$ of the digits “zero” through “nine” with highly variable characteristics. We largely follow the auto regressive training setup and generation protocol of Goel et al. (2022).
SC09 是一个基准语音生成数据集 (Donahue, McAuley, 和 Puckette 2019; Warden 2018),由以 $16000\ \mathrm{Hz}$ 采样的 1 秒音频片段组成,内容为数字“零”到“九”,具有高度可变的特征。我们主要遵循 Goel 等人 (2022) 的自回归训练设置和生成协议。
Table 4 shows automated metrics of the Mamba-UNet model compared to a variety of baselines from Goel et al. (2022): WaveNet (Oord et al. 2016), SampleRNN (Mehri et al. 2017), WaveGAN (Donahue, McAuley, and Puckette 2019), DiffWave (Z. Kong et al. 2021), and SaShiMi. A small Mamba model outperforms the state-of-the-art (and much larger) GANand diffusion- based models. A larger model parameter-matched to the baselines further improves on fidelity metrics dramatically.
表 4: 显示了 Mamba-UNet 模型与 Goel 等 (2022) 提出的多种基准模型的自动化指标对比:WaveNet (Oord 等 2016),SampleRNN (Mehri 等 2017),WaveGAN (Donahue, McAuley, 和 Puckette 2019),DiffWave (Z. Kong 等 2021),和 SaShiMi。小型 Mamba 模型的表现超过了最先进的(且规模更大)基于 GAN 和扩散的模型。与基准模型参数匹配的较大模型在保真度指标上进一步显著提升。
Table 5 takes the small Mamba model and investigates combinations of different architectures for the outer stages and center stage. It shows that Mamba is consistently better than $\mathsf{S4+M L P}$ in the outer blocks, and Mamba $>S4\substack{+}\mathrm{MLP}>$ MHA $^{+}$ MLP in the center blocks.
表 5: 采用小规模 Mamba 模型,研究外部阶段和中心阶段不同架构的组合。结果显示,在外部块中 Mamba 一致优于 $\mathsf{S4+M L P}$ ,而在中心块中 Mamba $>S4\substack{+}\mathrm{MLP}>$ MHA $^{+}$ MLP。
Table 4: (SCo9) Automated metrics for unconditional generation on Table 5: (SCo9 Model Ablations) Models with 6M parameters. In a challenging dataset of fixed-length speech clips. (Top to Bottom) SaShiMi's U-Net backbone, there are 8 center blocks operating on Auto regressive baselines, non-auto regressive baselines, Mamba, and sequence length 1oo0, sandwiched on each side by 8 outer blocks on data set metrics. sequence length 4000, sandwiched by 8 outer blocks on sequence
表 4: (SCo9) 无条件生成的自动化指标
表 5: (SCo9 模型消融) 包含 6M 参数的模型。在固定长度语音片段的具有挑战性的数据集中。(从上到下) SaShiMi 的 U-Net 主干,有 8 个中心块操作在自回归基线、非自回归基线、Mamba 上,序列长度为 1000,两侧各有 8 个外块,在数据集指标上,序列长度为 4000,由两侧各有 8 个外块夹在中间。
| 模型 | 参数量 | NLL↓ | FID | IS↑ | MIS↑ | AM↓ |
|---|---|---|---|---|---|---|
| SampleRNN | 35.0M | 2.042 | 8.96 | 1.71 | 3.02 | 1.76 |
| WaveNet | 4.2M | 1.925 | 5.08 | 2.27 | 5.80 | 1.47 |
| SaShiMi | 5.8M | 1.873 | 1.99 | 5.13 | 42.57 | 0.74 |
| WaveGAN | 19.1M | 2.03 | 4.90 | 36.10 | 0.80 | |
| DiffWave | 24.1M | 1.92 | 5.26 | 51.21 | 0.68 | |
| +SaShiMi | 23.0M | 1.42 | 5.94 | 69.17 | 0.59 | |
| Mamba | 6.1M | 1.852 | 0.94 | 6.26 | 88.54 | 0.52 |
| Mamba | 24.3M | 1.860 | 0.67 | 7.33 | 144.9 | 0.36 |
| 训练集 | 0.00 | 8.56 | 292.5 | 0.16 | ||
| 测试集 | 0.02 | 8.33 | 257.6 | 0.19 |
length 16000 (40 blocks total). The architecture of the 8 center blocks are ablated independently ofthe rest. Note that Transformers $\scriptstyle(\mathrm{MHA+MLP})$ were not tested in the more important outer blocks because of efficiency constraints.
长度 16000 (共 40 个块)。8 个中心块的架构独立于其余部分进行消融研究。注意,由于效率限制,Transformer (MHA+MLP) 没有在外围更重要的块中进行测试。
| OUTER | CENTER | NLL↓ | FID↓ | IS ↑ | MIS ↑ | AM↓ |
|---|---|---|---|---|---|---|
| S4+MLP | MHA+MLP | 1.859 | 1.45 | 5.06 | 47.03 | 0.70 |
| S4+MLP | S4+MLP | 1.867 | 1.43 | 5.42 | 53.54 | 0.65 |
| S4+MLP | Mamba | 1.859 | 1.42 | 5.71 | 56.51 | 0.64 |
| Mamba | MHA+MLP | 1.850 | 1.37 | 5.63 | 58.23 | 0.62 |
| Mamba | S4+MLP | 1.853 | 1.07 | 6.05 | 73.34 | 0.55 |
| Mamba | Mamba | 1.852 | 0.94 | 6.26 | 88.54 | 0.52 |
4.5 Speed and Memory Benchmarks
4.5 速度和内存基准测试
We benchmark the speed of the SSM scan operation (state expansion $N,=,16\$ o, as well as the end-to-end inference throughput of Mamba, in Figure 8. Our efficient SSM scan is faster than the best attention implementation that we know of (Flash Attention-2 (Dao 2024) beyond sequence length 2K, and up to $20{-}40\times$ faster than a standard scan implementation in PyTorch. Mamba achieves $4{-}5\times$ higher inference throughput than a Transformer of similar size, since without the KV cache it can use much higher batch sizes. For example, a Mamba-6.9B (untrained) would have higher inference throughput than a $5\times$ smaller Transformer-1.3B. Details in Appendix E.5, which additionally includes a benchmark of memory consumption.
我们在图 8 中基准测试了 SSM 扫描操作(状态扩展 $N,=,16$)的速度,以及 Mamba 的端到端推理吞吐量。我们高效的 SSM 扫描在序列长度超过 2K 时比我们所知的最佳注意力实现(Flash Attention-2 (Dao 2024))更快,并且比 PyTorch 中的标准扫描实现快 20-40 倍。Mamba 的推理吞吐量比类似规模的 Transformer 高 4-5 倍,因为没有 KV 缓存,它可以使用更大的批量大小。例如,未训练的 Mamba-6.9B 的推理吞吐量将高于小 5 倍的 Transformer-1.3B。详细信息见附录 E.5,其中还包含内存消耗的基准测试。

Figure 8: (Efficiency Benchmarks.) (Left) Training: our efficient scan is $40\times$ faster than a standard implementation. (Right) Inference: as a recurrent model, Mamba can achieve $5\times$ higher throughput than Transformers.
图 8: (效率基准测试。)(左)训练:我们的高效扫描比标准实现快 40× 。 (右)推理:作为递归模型,Mamba 的吞吐量可以比 Transformer 高 5× 。
4.6Model Ablations
4.6 模型消融实验 (Model Ablations)
We perform a series of detailed ablations on components of our model, focusing on the setting of language modeling with size $\approx350\mathrm{M}$ models at Chinchilla token counts (same setting as Figure 4).
我们对模型的各个组件进行了一系列详细的消融实验,重点关注在 Chinchilla Token 数量设置下的约 350M 规模的语言模型 (同图 4 设置)。
4.6.1 Architecture
4.6.1 架构
Table 6 investigates the effects of the architecture (block) and its inner SSM layer (Figure 3). We find that · Among previous non-selective (LTI) sSMs, which are equivalent to global convolutions, performance is very similar. . Replacing the complex-valued S4 variant from previous work with a real-valued one does not affect performance much, suggesting that (at least for LM) real-valued SSMs may be a better choice when accounting for hardware efficiency. · Replacing any of these with a selective SM (S6) significantly improves performance, validating the motivation of Section3.
表 6: 研究了架构 (block) 及其内部 SSM 层 (图 3) 的影响。我们发现:
- 在之前的非选择性 (LTI) sSM 中,它们等同于全局卷积,性能非常相似。
- 用实数值 S4 变体替换之前工作中的复数值 S4 变体对性能影响不大,这表明(至少对于大语言模型)在考虑硬件效率时,实数值 SSM 可能是更好的选择。
- 用选择性 SM (S6) 替换其中任何一个显著提高了性能,验证了第 3 节的动机。
Table 6: (Ablations: Architecture and SSM layer.) The Mamba block performs similarly to H3 while being simpler. In the inner layer, there is little diference among different parameter iz at ions of LTI models, while selective SSMs (S6) provide a large improvement. More specifically, the S4 (real) variant is S4D-Real and the S4 (complex) variant is S4D-Lin.
表 6: (消融实验:架构和 SSM 层。)Mamba 块的表现与 H3 相似,但更简单。在内层,不同的 LTI 模型参数化之间的差异很小,而选择性的 SSM (S6) 提供了显著的改进。更具体地说,S4 (实数) 变体是 S4D-Real,S4 (复数) 变体是 S4D-Lin。
| 模型 | 架构 | SSMLAYER | 困惑度 |
|---|---|---|---|
| Hyena | H3 | Hyena | 10.24 |
| H3 | H3 | S4 (complex) | 10.30 |
| H3 | S4 (real) | 10.34 | |
| H3 | S6 | 8.95 |
Table 7: (Ablations: Selective parameters.) $\Delta$ is the most important parameter (Theorem 1), but using multiple selective parameters together synergizes.
表 7: (消融实验:选择性参数。) $\Delta$ 是最重要的参数(定理 1),但使用多个选择性参数可以产生协同效应。
| SELECTIVE | SELECTIVE B | SELECTIVE C | PERPLEXITY |
|---|---|---|---|
| × | 10.93 | ||
| × | 10.15 | ||
| 9.98 | |||
| × | × | 9.81 | |
| √ | √ | √ | 8.71 |
| 模型 | 架构 | SSMLAYER | 困惑度 |
|---|---|---|---|
| Mamba | Hyena | 10.75 | |
| Mamba | S4 (complex) | 10.54 | |
| Mamba | S4 (real) | 10.56 | |
| Mamba | Mamba | S6 | 8.69 |
Table 8: (Ablations: Parameter iz ation of $A.$ )Themore standard initialization s based on S4D-Lin (Gu, Gupta, et al. 2022) perform worse than S4D-Real or a random initialization, when the SSM is selective.
表 8: (消融实验:参数初始化 $A$)基于 S4D-Lin (Gu, Gupta, et al. 2022) 的更标准的初始化方法在 SSM 选择性时,表现不如 S4D-Real 或随机初始化。
| A, 初始化 | 域 | 困惑度 |
|---|---|---|
| An = -+ni | 复数 | 9.16 |
| An = -1/2 | 实数 | 8.85 |
| An = -(n + 1) | 实数 | 8.71 |
| An ~ exp(N(0, 1)) | 实数 | 8.71 |
. The Mamba architecture performs similarly to the H3 architecture (and seems slightly better when using a selective layer).
Mamba架构的表现与H3架构相似(在使用选择性层时似乎略胜一筹)。
We also investigate interleaving the Mamba block with other blocks such as MLP (a traditional architecture) MHA (a hybrid attention architecture) in Appendix E.2.2.
我们还在附录 E.2.2 中研究了将 Mamba 块与其他块(如 MLP (多层感知器)、MHA (混合注意力架构))交错使用。
4.6.2 Selective SSM
4.6.2 选择性 SSM
Table 7 ablates the selective SSM layer by considering different combinations of selective $\Delta,B$ and $C$ parameters (Algorithm 2), showing that $\Delta$ is the most important parameter due to its connection to RNN gating (Theorem 1).
表 7: 通过考虑不同的选择性 SSM 层参数组合 (算法 2) 来消融选择性 SSM 层,显示了 Δ 是最重要的参数,这是由于其与 RNN 门控的关联 (定理 1)。
Table 8 considers different initialization s of the SSM, which have been shown to make a large difference in some data modalities and settings (Gu, Goel, and Re 2022; Gu, Gupta, et al. 2022). On language modeling, we find that simpler real-valued diagonal initialization s (S4D-Real, row 3) instead of more standard complex-valued parameter iz at ions (S4D-Lin, row 1) perform better. Random initialization s also work well, consistent with findings from prior work (Mehta et al. 2023).
表 8: 考虑了 SSM 的不同初始化方式,这些方式在某些数据模式和设置中已被证明有很大的差异 (Gu, Goel, and Re 2022; Gu, Gupta, et al. 2022)。在语言建模方面,我们发现更简单的实值对角线初始化 (S4D-Real,第 3 行) 比较标准的复数值参数化 (S4D-Lin,第 1 行) 表现更好。随机初始化也表现良好,这与之前的研究结果一致 (Mehta et al. 2023)。
Table 9 and Table 10 consider varying the dimension of the $\Delta$ and $(B,C)$ projections respectively. Changing them from static to selective provides the most benefit, while increasing the dimensions further generally improves performance modestly with a small increase in parameter count.
表 9 和表 10 分别考虑了改变 $\Delta$ 和 (B, C) 投影的维度。将它们从静态改为选择性提供了最大的好处,而进一步增加维度通常会在参数量略有增加的情况下适度提高性能。
Of particular note is the dramatic improvement of the selective SSM when the state size $N$ is increased, with over a 1.0 perplexity improvement for a cost of only $1%$ additional parameters. This validates our core motivation in Sections 3.1 and 3.3.
特别值得注意的是,当状态大小 $N$ 增加时,选择性 SSM 的显著改进,仅以额外 $1%$ 参数为代价就实现了超过 1.0 的困惑度改进。这验证了我们在第 3.1 和 3.3 节的核心动机。
5 Discussion
5 讨论
We discuss related work, limitations, and some future directions.
我们讨论了相关工作、局限性以及一些未来方向。
Related Work. Appendix A discusses how the selection mechanism relates to similar concepts. Appendix B has an extended related work of SSMs and other related models.
相关工作。附录 A 讨论了选择机制与类似概念的关系。附录 B 包含了 SSMs 和其他相关模型的扩展相关工作。
Table 9: (Ablations: Expressivity of $\Delta.$ The selection mechanism of $\Delta$ constructs it with a projection of the input. Projecting it even to dim. 1 provides a large increase in performance; increasing it further provides further improvements at the cost of a modest increase in parameters. State size fixed to $N=16$
表 9: (消融实验:$\Delta$ 的表达能力。$\Delta$ 的选择机制通过输入的投影来构建它。即使将其投影到维度 1 也能显著提高性能;进一步增加维度可以带来更好的性能提升,但会以参数量适度增加为代价。状态大小固定为 $N=16$ 。)
| 项目规模 | 参数量 (M) 困惑度 |
|---|---|
| 358.9 9.12 | |
| 1 | 359.1 8.97 |
| 2 | 359.3 8.97 |
| 4 | 359.7 8.91 |
| 8 | 360.5 8.83 |
| 16 | 362.1 8.84 |
| 32 | 365.2 8.80 |
| 64 | 371.5 8.71 |
Table 10: (Ablations: SSM state dimension.) (Top) Constant $B$ and $C$ (Bottom) Selective $B$ and $C$ .Increasing the SSM state dimension $N$ which can be viewed as an expansion factor on the dimension of the recurrent state, can significantly improve performance for a negligible cost in parameters/FLOPs, but only when $B$ and $C$ are also selective. Size of $\Delta$ projection fixed to 64.
表 10: (消融实验:SSM 状态维度。) (上) 固定 $B$ 和 $C$ (下) 选择性 $B$ 和 $C$ 。增加 SSM 状态维度 $N$,可以视为递归状态维度的扩展因子,可以在参数/FLOPs 的微小代价下显著提高性能,但仅当 $B$ 和 $C$ 也是选择性时。$\Delta$ 投影的大小固定为 64。
| 状态维度 N | 参数 (M) | 困惑度 |
|---|---|---|
| 1 | 367.1 | 9.88 |
| 2 | 367.4 | 9.86 |
| 4 | 368.0 | 9.82 |
| 8 | 369.1 | 9.82 |
| 16 | 371.5 | 9.81 |
| 1 | 367.1 | 9.73 |
| 2 | 367.4 | 9.40 |
| 4 | 368.0 | 9.09 |
| 8 | 369.1 | 8.84 |
| 16 | 371.5 | 8.71 |
No Free Lunch: Continuous-Discrete Spectrum. Structured SSMs were originally defined as disc ret iz at ions of continuous systems (1), and have had a strong inductive bias toward continuous-time data modalities such as perceptual signals (e.g. audio, video). As discussed in Sections 3.1 and 3.5, the selection mechanism overcomes their weaknesses on discrete modalities such as text and DNA; but this conversely can impede their performance on data that LTI SSMs excel on. Our ablations on audio waveforms examine this tradeoff in more detail.
没有免费的午餐:连续-离散谱。结构化的 SSM 最初被定义为连续系统的离散化 (1),并且对连续时间数据模态(例如感知信号,如音频、视频)具有强烈的归纳偏置。正如在第 3.1 节和第 3.5 节中所讨论的,选择机制克服了它们在离散模态(如文本和 DNA)上的弱点;但反过来这也可能阻碍它们在 LTI SSM 擅长的数据上的表现。我们对音频波形的消融实验更详细地研究了这种权衡。
Downstream Afford ances. Transformer-based foundation models (particularly LLMs) have a rich ecosystem of properties and modes of interaction with pretrained models, such as fine-tuning, adaptation, prompting, in-context learning. instruction tuning, RLHF, quantization, and so on. We are particularly interested in whether Transformer alternatives such as SSMs have similar properties and afford ances.
下游能力。基于 Transformer 的基础模型(特别是大语言模型)拥有丰富的 pretrained 模型属性和交互模式生态系统,例如微调、适应、提示、上下文学习、指令微调、RLHF、量化等。我们特别感兴趣的是,像 SSMs 这样的 Transformer 替代方案是否具有类似的属性和能力。
Scaling. Our empirical evaluation is limited to small model sizes, below the threshold of most strong open source LLMs (e.g. Llama (Touvron et al. 2023)) as well as other recurrent models such as RWKV (B. Peng et al. 2023) and RetNet (Y. Sun et al. 2023), which have been evaluated at the 7B parameter scale and beyond. It remains to assess whether Mamba still compares favorably at these larger sizes. We also note that scaling SSMs may involve further engineering challenges and adjustments to the model that are not discussed in this paper.
扩展。我们的实证评估仅限于小模型规模,低于大多数强大的开源大语言模型 (LLM) 的阈值(例如 Llama (Touvron et al. 2023)),以及其他递归模型,如 RWKV (B. Peng et al. 2023) 和 RetNet (Y. Sun et al. 2023),这些模型已经在 7B 参数规模及以上进行了评估。仍需评估 Mamba 在这些更大规模下是否仍然具有优势。我们还注意到,扩展 SSM 可能涉及进一步的工程挑战和对模型的调整,而这些在本文中未进行讨论。
6 Conclusion
6 结论
We introduce a selection mechanism to structured state space models, allowing them to perform context-dependent reasoning while scaling linearly in sequence length. When incorporated into a simple attention-free architecture, Mamba achieves state-of-the-art results on a diverse set of domains, where it matches or exceeds the performance of strong Transformer models. We are excited about the broad applications of selective state space models to build foundation models for different domains, especially in emerging modalities requiring long context such as genomics, audio, and video. Our results suggest that Mamba is a strong candidate to be a general sequence model backbone.
我们引入了一种选择机制到结构化状态空间模型中,使它们能够在上下文依赖的情况下进行推理,并且在序列长度上呈线性扩展。当集成到一个简单的无注意力架构中时,Mamba 在多个不同领域中取得了最先进的结果,在这些领域中,它的性能与强大的 Transformer 模型相匹配或超越。我们对选择性状态空间模型的广泛应用感到兴奋,可以为不同领域构建基础模型,特别是在需要长上下文的新兴模态如基因组学、音频和视频中。我们的结果表明,Mamba 是成为通用序列模型骨干的有力候选者。
Acknowledgments
致谢
We thank Karan Goel, Arjun Desai, and Kush Bhatia for helpful feedback on the draft.
我们感谢 Karan Goel、Arjun Desai 和 Kush Bhatia 对草稿提供的宝贵反馈。
References
参考文献
[1] Martin Arjovsky, Amar Shah, and Yoshua Bengio. “Unitary Evolution Recurrent Neural Networks". In: The International Conference on Machine Learning (ICML). 2016, pp. 1120-1128.
[1] Martin Arjovsky, Amar Shah, 和 Yoshua Bengio. “单元演化循环神经网络 (Unitary Evolution Recurrent Neural Networks)”. In: 国际机器学习大会 (ICML). 2016, pp. 1120-1128.
A Discussion: Selection Mechanism
选择机制的讨论 (Selection Mechanism)
Our selection mechanism is inspired by and related to concepts such as gating, hyper networks, and data-dependence. It can also be viewed as related to “fast weights" (J. Ba et al. 2016; Schmid huber 1992), which connects classical RNNs with the mechanism of linear attention (Schlag, Irie, and Schmid huber 2021). However, we believe that it is a distinct concept that is worth clarifying.
我们的选择机制受到 gating、hyper networks 和 data-dependence 等概念的启发并与之相关。它也可以被视为与 “fast weights" (J. Ba et al. 2016; Schmid huber 1992) 相关,这将经典的 RNN 与线性注意力机制 (Schlag, Irie, and Schmid huber 2021) 联系起来。然而,我们认为这是一个 distinct concept(distinct 概念),值得澄清。
Gating. Gating originally referred to the gating mechanisms of RNNs such as the LSTM (Hochreiter and Schmid huber 1997) and GRU (J. Chung et al. 2014), or the gated equation (5) in Theorem 1. This was interpreted as a particular mechanism for controlling whether to let an input into the hidden state of an RNN. In particular, this affects the propagation of signal through time and causes inputs to interact along the sequence length dimension.
门控。门控最初指的是 RNN 的门控机制,例如 LSTM (Hochreiter and Schmidhuber 1997) 和 GRU (J. Chung 等 2014),或者是定理 1 中的门控公式 (5)。这被解释为一种特定的机制,用于控制是否让输入进入 RNN 的隐藏状态。特别是,这会影响信号在时间上的传播,并导致输入沿序列长度维度相互作用。
However, the concept of gating has since been relaxed in popular usage to simply mean any multiplicative interaction (often with an activation function). For example, element wise multiplicative components of neural network architectures (that do not interact along sequence length) are now commonly referred to as gated architectures (Hua et al. 2022; Mehta et al. 2023), despite a very different meaning than the original RNN sense. Thus we believe the original concept of RNN gating versus the popular usage of multiplicative gating actually have a very different semantic meaning.
然而, gating 概念在流行用法中已经被放宽,简单地指任何乘法交互(通常带有激活函数)。例如,神经网络架构中的元素级乘法组件(不沿序列长度交互)现在通常被称为 gated 架构 (Hua et al. 2022; Mehta et al. 2023),尽管其含义与原始的 RNN 意义非常不同。因此我们认为,RNN 中的 gating 概念与流行的乘法 gating 用法实际上具有非常不同的语义意义。
Hyper networks. Hyper networks refer to neural networks whose parameters are themselves generated by smaller neural networks. The original idea (Ha, Dai, and Quoc V. Le 2017) used it in a narrow sense to define a large RNN whose recurrent parameters are generated by a smaller RNN, and other variants have been around for a long time (Schmid huber 1992).
超网络。超网络指的是神经网络的参数由较小的神经网络生成。最初的想法 (Ha, Dai, and Quoc V. Le 2017) 在狭义上将其定义为一个大的递归神经网络 (RNN),其递归参数由一个小的 RNN 生成,其他变体也已经存在很长时间了 (Schmid huber 1992)。
Data-dependence. Similar to hyper networks, data-dependence can refer to any notion where some parameters of the model depend on the data (Poli et al. 2023).
数据依赖性。类似于超网络,数据依赖性可以指任何模型的某些参数依赖于数据的概念 (Poli et al. 2023)。
Example: GLU Activation. To illustrate the issues with these concepts, consider a simple diagonal linear layer $y=D x$ where $D$ is a diagonal weight parameter. Now suppose that $D$ is itself generated from a linear transformation of $x$ with an optional non linearity: $D,=,\sigma(W x)$ . Since it is diagonal, the multiplication becomes an element wise product: $y=\sigma({W x})\circ x$
示例:GLU 激活函数。为了说明这些概念的问题,考虑一个简单的对角线性层 $y=D x$ ,其中 $D$ 是一个对角权重参数。现在假设 $D$ 是从 $x$ 的线性变换生成的,并带有可选的非线性:$D = σ(W x)$ 。由于它是对角的,乘法变为逐元素乘积:$y = σ(W x) ∘ x$ 。
This is a rather trivial transformation, yet it technically satisfies the common meanings of gating (since it has a multiplicative "branch"), hyper networks (since the parameter $D$ is generated by another layer), and data-dependent (since $D$ depends On the data $x$ ). However, this in fact simply defines a GLU function, which is so simple that it is often considered just an activation function (Dauphin et al. 2017; Shazeer 2020) instead of a meaningful layer.
这是一项相当简单的变换,但从技术上来说,它满足了门控 (gating) 的常见含义(因为它有一个乘法“分支”),超网络 (hyper networks) 的定义(因为参数 $D$ 由另一层生成),以及数据依赖 (data-dependent) 的定义(因为 $D$ 取决于数据 $x$)。然而,实际上这仅仅定义了一个 GLU 函数,这个函数过于简单,通常被认为只是一个激活函数 (Dauphin et al. 2017; Shazeer 2020),而不是一个有意义的层。
Selection. Thus, while selection mechanisms could be considered a special case of ideas such as architectural gating. hyper networks, or data-dependence, so can an enormous range of other constructions-essentially anything with a multiplication, including standard attention mechanisms (Bahdanau, Cho, and Bengio 2015; Vaswani et al. 2017) as well-and we find it uninformative to think of them as such.
选择机制可以被视为架构门控、超网络或数据依赖等概念的特例。然而,许多其他构造也可以归为此类——实际上任何包含乘法运算的结构,包括标准的注意力机制 (Bahdanau, Cho, 和 Bengio 2015; Vaswani 等 2017) 也是如此——我们认为将它们视为此类并不具有启发性。
Instead, we view it as most closely related to the gating mechanism of traditional RNNs, which is a special case (Theorem 1) and also has a deeper history of connections to SSMs through variable (input-dependent) disc ret iz ation of $\Delta$ (Funahashi and Nakamura 1993; Gu, Dao, et al. 2020; Tallec and Olivier 2018). We also eschew the term “gating" in favor of selection to clarify the overloaded use of former. More narrowly, we use selection to refer to the mechanistic action of a model to select or ignore inputs and facilitate data interaction along the sequence length (Section 3.1). Beyond selective SSMs and gated RNNs, other examples may include input-dependent convolutions (Kosma, Niko lent zo s, and Vaz ir gianni s 2023; Lioutas and Guo 2020; Lutati, Zimerman, and Wolf 2023; Yang et al. 2019) and even attention.
相反,我们认为它与传统 RNN 的门控机制最为密切相关,这是特殊情况(定理 1),并且通过变量(输入依赖)离散化 $\Delta$ 与 SSMs 有着更悠久的联系历史(Funahashi 和 Nakamura 1993;Gu, Dao 等 2020;Tallec 和 Olivier 2018)。我们还避免使用“门控”一词,而用选择来澄清前者的多重含义。更具体地,我们使用选择来指模型选择或忽略输入并促进沿序列长度的数据交互的机制动作(第 3.1 节)。除了选择性 SSMs 和门控 RNNs 外,其他例子可能包括输入依赖卷积(Kosma, Niko lent zo s 和 Vaz ir gianni s 2023;Lioutas 和 Guo 2020;Lutati, Zimerman 和 Wolf 2023;Yang 等 2019)甚至注意力机制。
B Related Work
B 相关工作
We overview several prior works related to our methods. We mention that some of the most closely related models include recurrent layers such as S4, S5, and quasi-RNNs; as well as end-to-end architectures such as H3, RetNet, and RWKV.
我们概述了几项与我们的方法相关的先前工作。我们提到,一些最密切相关的大语言模型包括循环层,例如 S4、S5 和准递归神经网络 (quasi-RNNs);以及端到端架构,例如 H3、RetNet 和 RWKV。
B.1 S4 Variants and Derivatives
B.1 S4 变体和衍生产品
We describe a brief overview of some structured SSMs from past work, particularly those that have a relation to our method.
我们简要概述了一些过去工作中具有结构的 SSM,特别是那些与我们的方法有关的。
· S4 (Gu, Goel, and Ré 2022; Gu, Johnson, Goel, et al. 2021) introduced the first structured SSM, describing diagonal structure and diagonal plus low-rank (DPLR). It focused on efficient convolutional algorithms for DPLR SSMs due to a connection to continuous-time online memorization (HIPPO) (Gu, Dao, et al. 2020).
S4 (Gu, Goel, 和 Ré 2022; Gu, Johnson, Goel, 等 2021) 引入了第一个结构化的 SSM,描述了对角线结构和对角线加低秩 (DPLR)。它专注于高效的 DPLR SSM 卷积算法,这是由于与连续时间在线记忆 (HIPPO) (Gu, Dao, 等 2020) 的联系。
· DSS (Gupta, Gu, and Berant 2022) first discovered the empirical effectiveness of diagonal structured SSMs by approximating the HIPPO initialization. This was expanded on theoretically in S4D (Gu, Gupta, et al. 2022).
· DSS (Gupta, Gu, 和 Berant 2022) 首次发现了通过对角结构的 SSM 近似 HIPPO 初始化的实证有效性。这一发现随后在理论上由 S4D (Gu, Gupta, 等 2022) 进行了扩展。
· S5 (Smith, Warrington, and Linderman 2023) independently discovered the diagonal SSM approximation, and is the first S4 model to be computed re currently with the parallel scan. However, this required lowering the effective state dimension, which they accomplished by switching the SSM dimensions from a SISO (single-input single-output) to MIMO (multi-input multi-output) formulation. Our proposed S6 shares the scan, but differs by (i) keeping the SISO dimensions, which provides a larger effective recurrent state, (i) using a hardware-aware algorithm to overcome the computation issue, (i adding the selection mechanism.
S5 (Smith, Warrington, 和 Linderman 2023) 独立发现了对角线 SSM 近似,并且是第一个使用并行扫描计算的 S4 模型。然而,这需要降低有效的状态维度,他们通过将 SSM 维度从 SISO (单输入单输出) 转换为 MIMO (多输入多输出) 来实现这一点。我们提出的 S6 共享了扫描机制,但不同之处在于 (i) 保持 SISO 维度,这提供了更大的有效循环状态,(ii) 使用硬件感知算法来克服计算问题,(iii) 添加选择机制。
Lu et al. (2023) applied S5 to meta-RL in order to handle resetting the SSM state between episode trajectories. Their mechanism can be viewed as a particular hard-coded instance of a selection mechanism, where $\overline{{A}}$ is manually set to O, instead of our learnable mechanism that depends on the input. It would be interesting to apply selective SSMs generically to this setting and probe if the model has learned to automatically reset its state on episode boundaries.
Lu 等人 (2023) 将 S5 应用于元强化学习 (meta-RL),以处理在情节轨迹之间重置 SSM 状态的问题。他们的机制可以被视为选择机制的一个特定硬编码实例,其中 $\overline{{A}}$ 手动设置为 O,而不是我们依赖于输入的可学习机制。将选择性 SSM 泛化应用于这种场景并探究模型是否已学会在情节边界上自动重置其状态将是一件有趣的事情。
· Mega (Ma et al. 2023) introduced a simplification of S4 to be real- instead of complex- valued, giving it an interpretation of being an exponential moving average (EMA). They additionally make an interesting connection of the disc ret iz ation step of SSMs to an EMA damping term. Contrary to findings in the original S4 papers, this was the first model to show that real-valued SSMs are empirically effective in certain settings or when combined with different architectural components.
Mega (Ma et al. 2023) 引入了 S4 的简化版本,使其成为实值而不是复值,赋予其指数移动平均 (EMA) 的解释。他们还提出了一个有趣的连接,即将状态空间模型 (SSM) 的离散化步骤与 EMA 阻尼项联系起来。与原始 S4 论文中的发现相反,这是第一个显示在某些设置下或与其他架构组件结合时,实值 SSM 在经验上是有效的模型。
· Liquid S4 (Hasani et al. 2023) is also motivated by augmenting S4 with an input-dependent state transition. From this perspective it shares similarity to selection mechanisms, although in a limited form which is still computed convolution ally and close to LTI.
Liquid S4 (Hasani 等 2023) 同样受到通过输入依赖的状态转换增强 S4 的启发。从这个角度来看,它与选择机制相似,尽管是以一种有限的形式,仍然通过卷积计算并且接近线性时不变系统 (LTI)。
·. SGConv (Y. Li et al. 2023), Hyena (Poli et al. 2023), LongConv (Fu et al. 2023), Multi res Con v (J. Shi, K. A. Wang, and Fox 2023), and Toeplitz Neural Network (Qin, Han, W. Sun, B. He, et al. 2023) all focus on the convolutional representation of S4 and create global or long convolution kernels with different parameter iz at ions. However, these methods cannot do fast auto regressive inference directly.
SGConv (Y. Li 等 2023),Hyena (Poli 等 2023),LongConv (Fu 等 2023),Multi res Conv (J. Shi, K. A. Wang, 和 Fox 2023),以及 Toeplitz 神经网络 (Qin, Han, W. Sun, B. He 等 2023) 都专注于 S4 的卷积表示,并创建了具有不同参数化的全局或长卷积核。然而,这些方法无法直接进行快速自回归推理。
Notably, all of these methods, and all other structured SSMs that we are aware of, have been non-selective and usually strictly LTI (linear time invariant).
值得注意的是,所有这些方法,以及我们所知的所有其他结构化的 SSM,都是非选择性的,并且通常严格是 LTI (线性时不变) 的。
B.2 SSM Architectures
B.2 SSM架构
We use SSM architectures or state space neural networks (SSNN) to refer to deep neural network architectures incorporating one of the previous SSMs as a black box layer.
我们使用状态空间模型 (SSM) 架构或状态空间神经网络 (SSNN) 来指代将前述 SSM 作为黑盒层集成的深度神经网络架构。
Copying task because simply masking out the irrelevant inputs does not affect the spacing between the relevant ones (indeed, the Selective Copying task can even be viewed as coming pre-masked if the noise tokens are embedded to 0).
复制任务,因为仅仅屏蔽无关输入并不会影响相关输入之间的间距(确实,选择性复制任务甚至可以被视为预先屏蔽了,如果噪声 Token 被嵌入为 0)。
· RetNet (Y. Sun et al. 2023) is also based on Linear Attention and very similar to H3, but reduces the inner S4 layer to a special case where the state dimension is $N=1$ . Although not framed as such, its recurrence can be viewed as a special caseof a linearSSM.
RetNet (Y. Sun 等 2023) 也基于线性注意力机制,与 H3 非常相似,但将内部 S4 层简化为状态维度为 $N=1$ 的特殊情况。虽然没有这样表述,但其递归可以被视为线性状态空间模型 (linear SSM) 的一个特例。
Its primary source of improvement is using a linear attention with large head dimension, which can be viewed as another method to perform input-dependent state expansion. Using a larger head dimension in the context of linear attention variants was first done by H3, but not extensively used since this requires a proportional amount of extra computation. RetNet avoids this with an alternate way to parallel ize the computation with a variant of standard multi-head attention instead of convolutions, made feasible by their particular special case of SSMs which acts as a simple EMA.
其主要改进来源是使用具有大头维度的线性注意力机制,这可以被视为执行输入依赖状态扩展的另一种方法。在线性注意力变体中使用更大的头维度最早由 H3 实现,但由于这需要额外成比例的计算量,因此并未广泛采用。RetNet 通过使用标准多头注意力机制 (multi-head attention) 的变体而不是卷积来并行化计算,从而避免了这一点,这是通过其特定的 SSM 特例作为简单的 EMA 来实现的。
· RWKV (B. Peng et al. 2023) is another recent RNN designed for language modeling. It is based on AFT (attention-free Transformer (S. Zhai et al. 2021)), another variant of linear attention. Its main “WKV" mechanism involves LTI recurrences and can be seen as the ratio of two SSMs.
RWKV (B. Peng 等 2023) 是另一种最近为语言建模设计的 RNN。它基于 AFT (attention-free Transformer (S. Zhai 等 2021)),即另一种线性注意力的变体。其主要的 “WKV” 机制涉及 LTI 回归,可以视为两个 SSM 的比率。
We also highlight the gated attention unit (GAU) from Hua et al. (2022), which was motivated by combining the Transformer's MHA and MLP blocks together and was an inspiration for our architecture (Section 3.4) combining the H3 and MLP blocks.
我们还强调了 Hua 等人 (2022) 提出的门控注意力单元 (GAU),它通过将 Transformer 的多头注意力 (MHA) 和多层感知机 (MLP) 模块结合在一起而得到启发,并且是我们架构 (第 3.4 节) 结合 H3 和 MLP 模块的灵感来源。
B.3 Relationship to RNNs
B.3 与 RNN 的关系
RNNs and SSMs are broadly related, as they both involve the concepts of recurrence on a latent state.
RNN 和 SSM 广泛相关,因为它们都涉及潜在状态的递归概念。
Several older RNNs such as the strongly typed RNN (Balduzzi and Ghifary 2016), quasi-RNN (QRNN) (Bradbury et al. 2016), and simple recurrent unit (SRU) (Lei 2021; Lei et al. 2017) involve forms of gated RNNs without time-wise nonlinear i ties. Because of the connections of gating mechanisms and selection mechanisms, these can be viewed as cases of selective SSMs, and are thus more powerful in a sense than the family of LTI structured SSMs above. The main differences are:
一些较旧的 RNN,例如强类型 RNN (Balduzzi 和 Ghifary 2016)、准 RNN (QRNN) (Bradbury 等 2016) 和简单循环单元 (SRU) (Lei 2021; Lei 等 2017),涉及没有时间非线性依赖的门控 RNN 形式。由于门控机制和选择机制之间的联系,这些可以被视为选择性 SSM 的实例,因此在某种意义上比上述 LTI 结构的 SSM 更强大。主要区别在于:
Additionally, older RNNs famously suffered from efficiency issues and the vanishing gradients problem (Hochreiter 1991; Hochreiter, Bengio, et al. 2001; Pascanu, Mikolov, and Bengio 2013), both caused by their sequential nature. The former could be solved for some of the above RNNs by leveraging the parallel scan (Martin and Cundy 2018), but the latter was diffcult without theory later developed for SSMs. For example, modern structured SSMs differ in more careful parameter iz ation of the recurrent dynamics inspired by classical SSM theory (e.g. through disc ret iz ation (Gu, Johnson, Goel, et al. 2021; Gu, Johnson, Timalsina, et al. 2023)), or direct analysis (Gupta, Mehta, and Berant 2022; Kaul 2020; Orvieto et al. 2023)).
此外,较早的循环神经网络 (RNN) 以效率问题和梯度消失问题而闻名 (Hochreiter 1991; Hochreiter, Bengio, et al. 2001; Pascanu, Mikolov, 和 Bengio 2013),这些问题都是由其顺序性引起的。前者可以通过利用并行扫描 (Martin 和 Cundy 2018) 来解决部分 RNN 的问题,但后者在没有后来为状态空间模型 (SSM) 开发的理论的情况下很难解决。例如,现代结构化的 SSM 在参数化循环动态方面更加谨慎,这受到经典 SSM 理论的启发(例如通过离散化 (Gu, Johnson, Goel, et al. 2021; Gu, Johnson, Timalsina, et al. 2023)),或直接分析 (Gupta, Mehta, 和 Berant 2022; Kaul 2020; Orvieto et al. 2023))。
We also note that there is a long line of work on orthogonal RNNs (Arjovsky, Shah, and Bengio 2016; Henaff, Szlam, and LeCun 2016; Lezcano-Casado and Martinez-Rubio 2019; Mhammedi et al. 2017; Vorontsov et al. 2017) which are motivated by constraining the $\bar{A}$ transition matrix to be orthogonal or unitary, in order to control its eigenvalues and prevent the vanishing gradient problem. However, these had other limitations; we believe that these stem from the fact that orthogonal/unitary RNNs are also LTI. For example, they are almost always evaluated on the Copying task which they can solve perfectly, but observed to struggle on the Selective Copying task (Jing et al. 2019).
我们还注意到,有关正交 RNN (Arjovsky, Shah, 和 Bengio 2016; Henaff, Szlam, 和 LeCun 2016; Lezcano-Casado 和 Martinez-Rubio 2019; Mhammedi 等 2017; Vorontsov 等 2017) 的研究工作历史悠久,这些工作旨在通过将 $\bar{A}$ 转移矩阵约束为正交或酉矩阵来控制其特征值并防止梯度消失问题。然而,这些方法存在其他限制;我们认为这些限制源于正交/酉 RNN 也是线性时不变系统 (LTI)。例如,它们几乎总是被评估于 Copying 任务上,可以完美解决该任务,但在 Selective Copying 任务 (Jing 等 2019) 上表现不佳。
B.4 Linear Attention
B.4 线性注意力机制 (Linear Attention)
The Linear Attention (LA) (Katha ro poul os et al. 2020) framework is an important result popularizing kernel attention and showing how it relates to recurrent auto regressive models. Many variants have proposed alternative kernels and other modifications. Random Feature Attention (RFA) (H. Peng et al. 2021) chooses the kernel feature map to approximate softmax attention (i.e. the exp feature map) using the random Fourier feature approximation of Gaussian kernels (Rahimi and Recht 2007). Performer (Cho roman ski et al. 2021) finds an approximation to the exponential kernel involving only positive features, which also allows the softmax normalization term. Trans No rmer (Qin, Han, W. Sun, D. Li, et al. 2022) showed that the LA denominator term can be unstable and proposed replacing it with a LayerNorm. cosFormer (Qin, W. Sun, et al. 2022) augments RFA with a cosine re weighting mechanism that incorporates positional information to emphasize locality. Linear Randomized Attention (Zheng, C. Wang, and L. Kong 2022) generalize RFA from the perspective of importance sampling, and generalize it to provide better estimates of the full softmax kernel (rather than just the exp-transformed numerator).
线性注意力 (Linear Attention) (Katha ro poul os et al. 2020) 框架是一个重要的成果,它普及了核注意力,并展示了它与递归自回归模型的关系。许多变体提出了替代的核函数和其他修改。随机特征注意力 (Random Feature Attention) (RFA) (H. Peng et al. 2021) 选择核特征映射来近似 softmax 注意力(即 exp 特征映射),使用高斯核的随机傅里叶特征近似 (Rahimi 和 Recht 2007)。Performer (Cho roman ski et al. 2021) 找到了一个仅涉及正特征的指数核近似方法,这也允许 softmax 归一化项。Trans No rmer (Qin, Han, W. Sun, D. Li, et al. 2022) 指出线性注意力分母项可能不稳定,并提议用 LayerNorm 替换它。cosFormer (Qin, W. Sun, et al. 2022) 增强了 RFA,引入了余弦重新加权机制,该机制结合了位置信息以强调局部性。线性随机注意力 (Linear Randomized Attention) (Zheng, C. Wang, 和 L. Kong 2022) 从重要性采样的角度推广了 RFA,并将其推广以提供对完整 softmax 核的更好估计(而不仅仅是 exp 变换后的分子)。
Aside from kernel attention, many other variants of effcient attention exist; the survey Tay, Dehghani, Bahri, et al. (2022) offers an extensive categorization of many of these.
除了内核注意力之外,还存在许多其他高效的注意力变体;Tay, Dehghani, Bahri 等 (2022) 的综述对其中的许多进行了广泛的分类。
B.5 Long Context Models
B.5 长上下文模型 (Long Context Models)
Long context has become a popular subject, and several recent models have claimed to scale to longer and longer sequences. However, these are often from a computational standpoint and have not been extensively validated. These include:
长上下文已成为一个热门话题,最近有几个模型声称能够扩展到越来越长的序列。然而,这些模型通常是从计算角度出发,并未经过广泛的验证。这些模型包括:
· Recurrent Memory Transformer (Bulatov, Kuratov, and Burtsev 2023), a lightweight wrapper around a Transformer backbone. It showed ability to generalize up to 1M sequences but only on synthetic memorization tasks; their main result is similar to our Induction Heads extrapolation experiment (Table 2).
循环记忆 Transformer (Bulatov, Kuratov, and Burtsev 2023),这是围绕 Transformer 主干的轻量级包装。它展示了最多可泛化到 1M 序列的能力,但仅限于合成记忆任务;他们的主要结果类似于我们的归纳头外推实验(表 2)。
· LongNet (Ding et al. 2023), which claimed to scale to 1B length but only evaluated on length $<100K$ for actual tasks.
LongNet (Ding et al. 2023),声称可以扩展到 1B 长度,但在实际任务中仅评估了长度 <100K 的情况。
. Hyena and HyenaDNA (Nguyen, Poli, et al. 2023; Poli et al. 2023), which claimed to leverage up to 1M context. However, their experiments trained on proportionally more data at longer contexts, making it hard to conclude if quality improvements at 1M context are due to context length or due to more data and computation. · Sparse Transformer (Child et al. 2019) showed a proof-of-concept of using a strided sparse attention Transformer to model audio waveforms of length $2^{20}=1048576$ , although did not discuss performance tradeoffs when controlling for computation and model size.
. Hyena 和 HyenaDNA (Nguyen, Poli 等 2023; Poli 等 2023),声称利用了最多 1M 的上下文。然而,他们的实验在更长的上下文中使用了按比例更多的数据进行训练,因此很难得出在 1M 上下文中质量提升是由于上下文长度还是由于更多数据和计算资源。· Sparse Transformer (Child 等 2019) 展示了使用步幅稀疏注意力 Transformer 建模长度为 $2^{20}=1048576$ 的音频波形的概念验证,尽管没有讨论在控制计算量和模型大小时的性能权衡。
In contrast, we believe this work presents one of the first approaches to meaningfully demonstrate increasing performance with longer context.
相比之下,我们认为这项工作是首批有意义地展示随着上下文长度增加性能也随之提升的方法之一。
C Mechanics of Selective SSMs
C 选择性 SSM 的机制
Proof of Theorem 1. Consider a selective SSM (Algorithm 2) with $N=1,A=-1,B=1,s_{\Delta}=\mathsf{L i n e a r}(x),\tau_{\partial}$ $\tau_{\Delta}=$ softplus. The corresponding continuous-time SSM (1) is
定理 1 的证明。考虑一个选择性 SSM (算法 2) ,其中 $N=1,A=-1,B=1,s_{\Delta}=\mathsf{L i n e a r}(x),\tau_{\partial}$ $\tau_{\Delta}=$ softplus。相应的连续时间 SSM (1) 是
$$
h(t)=-h(t)+x(t)
$$
$$
h(t) = -h(t) + x(t)
$$
which is also called a leaky integrator.
这也被称为漏积分器。
The disc ret iz ation step size is
磁盘离散化步长是
$$
\begin{array}{r l}&{\Delta_{t}=\tau_{\Delta}(\mathsf{P a r a m e t e r}+s_{\Delta}(x_{t}))}\ &{\quad=\mathsf{s o f t p l u s}(\mathsf{P a r a m e t e r}+\mathsf{L i n e a r}(x_{t}))}\ &{\quad=\mathsf{s o f t p l u s}(\mathsf{L i n e a r}(x_{t}))}\end{array}
$$
$$
\begin{array}{r l}
&{\Delta_{t} = \tau_{\Delta}(\mathsf{Parameter} + s_{\Delta}(x_{t}))} \
&{\quad = \mathsf{softplus}(\mathsf{Parameter} + \mathsf{Linear}(x_{t}))} \
&{\quad = \mathsf{softplus}(\mathsf{Linear}(x_{t}))}
\end{array}
$$
where we observe that the parameter can be viewed as a learnable bias and folded into the linear projection.
我们观察到该参数可以被视为可学习的偏置,并折叠到线性投影中。
Now applying the zero-order hold (ZOH) disc ret iz ation formulas:
现在应用零阶保持 (ZOH) 离散化公式:
$$
\begin{array}{r l}&{\overline{{A}}_{t}=\exp(\Delta A)=\frac{1}{1+\exp(\mathsf{L i n e a r}(x_{t}))}=\sigma(-\mathsf{L i n e a r}(x_{t}))}\ &{\qquad=1-\sigma(\mathsf{L i n e a r}(x_{t}))}\ &{\overline{{B}}_{t}=(\Delta A)^{-1}(\exp(\Delta A)-I)\cdot\Delta B=-(\exp(\Delta A)-I)=1-\overline{{A}}}\ &{\qquad=\sigma(\mathsf{L i n e a r}(x_{t})).}\end{array}
$$
$$
\begin{array}{r l}
&{\overline{{A}}_{t}=\exp(\Delta A)=\frac{1}{1+\exp(\mathsf{L i n e a r}(x_{t}))}=\sigma(-\mathsf{L i n e a r}(x_{t}))}\
&{\qquad=1-\sigma(\mathsf{L i n e a r}(x_{t}))}\
&{\overline{{B}}_{t}=(\Delta A)^{-1}(\exp(\Delta A)-I)\cdot\Delta B=-(\exp(\Delta A)-I)=1-\overline{{A}}}\
&{\qquad=\sigma(\mathsf{L i n e a r}(x_{t})).}
\end{array}
$$
公式未进行翻译,原样返回。
Thus the final discrete recurrence (2a) is
因此,最终的离散递推公式 (2a) 是
$$
\begin{array}{l}{g_{t}=\sigma\bigl(\underbrace{\mathrm{Linear}(x_{t})}{h{t}}\bigr)}\ {h_{t}=(1-g_{t})h_{t-1}+g_{t}x_{t}}\end{array}
$$
$$
\begin{array}{l}{g_{t}=\sigma\bigl(\underbrace{\mathrm{Linear}(x_{t})}{h{t}}\bigr)}\ {h_{t}=(1-g_{t})h_{t-1}+g_{t}x_{t}}\end{array}
$$
公式保持原样,不进行翻译。
as desired.
如所需。
D Hardware-aware Algorithm For Selective SSMs
面向硬件的算法用于选择性 SSMs
Without input-dependent selectivity, SSMs can be efficiently implemented as a convolution (Dao, Fu, Saab, et al. 2023; Gu, Goel, and Rée 2022), which leverages the fast Fourier transform (FFT) as primitive. With selectivity, SSMs are no-longer equivalent to convolution, but we leverage the parallel associative scan. While SSM scans are theoretically efficient $(O(B L D N)$ FLOPs, scaling linear in $L$ ), training foundation models with selective SSMs requires them to be efficient on modern hardware (GPUs) as well. We describe how we use kernel fusion and re computation to make SSM scan fast and memory-effcient. We evaluate the speed of our scan implementation compared to convolution and attention in Section 4.5, showing that it is up to $7\times$ times faster than attention at sequence length 32K, and is as memory-efficient as the best attention implementation (Flash Attention).
在没有输入依赖的选择性的情况下,状态空间模型 (SSMs) 可以高效地实现为卷积 (Dao, Fu, Saab, et al. 2023; Gu, Goel, 和 Rée 2022),这利用了快速傅里叶变换 (FFT) 作为基本操作。具有选择性时,状态空间模型不再等同于卷积,但我们利用并行关联扫描。虽然状态空间模型扫描在理论上是高效的 $(O(B L D N)$ FLOPs,线性扩展于 $L$ ),但在现代硬件(GPU)上训练带有选择性状态空间模型的基础模型也需要它们高效运行。我们描述了如何使用内核融合和重计算使状态空间模型扫描快速且内存高效。我们在第 4.5 节中评估了我们的扫描实现与卷积和注意力机制的速度,显示其在序列长度为 32K 时比注意力机制快达 $7\times$ 倍,并且与最佳的注意力机制实现(Flash Attention)一样内存高效。
Speed. On modern hardware accelerators (GPUs) most operations (except matrix multiply) are bounded by memorybandwidth (Dao, Fu, Ermon, et al. 2022; Ivanov et al. 2021; Williams, Waterman, and Patterson 2009). This the case with our scan operation, and we use kernel fusion to reduce the amount of memory IOs, leading to significant speedup compared to a standard implementation.
速度。在现代硬件加速器 (GPU) 上,大多数操作(除矩阵乘法外)都受内存带宽限制 (Dao, Fu, Ermon, et al. 2022; Ivanov et al. 2021; Williams, Waterman, and Patterson 2009)。我们的扫描操作也是如此,我们使用内核融合来减少内存 IO 操作的数量,与标准实现相比,这带来了显著的速度提升。
The standard way to implement the scan algorithm in Section 3.2 is to prepare the scan input $\overline{{A}},\overline{{B}}$ of size $(B,L,D,N)$ in GPU HBM (high-bandwidth memory, commonly referred to as GPU memory), call a parallel associative scan implementation to write the scan output of size $(B,L,D,N)$ to GPU HBM, then multiply that scan output with $C$ to produce an output of size $(B,L,D)$ . However, this requires the number of memory reads/writes on the order of $O(B L D N)$ . We can instead fuse the disc ret iz ation step, the scan, and the multiplication with $C$ into one kernel:
第 3.2 节中扫描算法的标准实现方式是准备大小为 $(B,L,D,N)$ 的扫描输入 $\overline{{A}},\overline{{B}}$ 到 GPU HBM (高带宽内存,通常称为 GPU 内存),调用并行关联扫描实现,将大小为 $(B,L,D,N)$ 的扫描输出写入 GPU HBM,然后将该扫描输出与 $C$ 相乘以生成大小为 $(B,L,D)$ 的输出。然而,这需要读/写的内存次数为 $O(B L D N)$ 。我们可以将离散化步骤、扫描和与 $C$ 的乘法融合到一个内核中:
This way, we reduce IOs by a factor of $O(N)$ (the state dimension), which in practice speeds up the operation by 20-40 times (Section 4.5).
这样,我们将 IO 操作减少了 $O(N)$ (状态维度)倍,在实际应用中操作速度提高了 20-40 倍(第 4.5 节)。
For sequence length $L$ too long where we cannot fit the sequence in SRAM (which is much smaller than HBM), we split the sequences into chunks and perform the fused scan on each chunk. As long as we have the intermediate scan states, we can continue the scan with the next chunk.
对于长度为 $L$ 的序列过长,无法将其放入 SRAM(SRAM 比 HBM 小得多),我们将序列分割成块,并对每个块执行融合扫描。只要我们有中间扫描状态,就可以继续使用下一个块进行扫描。
Memory. We describe how we use the classical technique of re computation to reduce the total amount of memory required to train selective SSM layers.
内存。我们描述了如何使用经典的重新计算 (re computation) 技术来减少训练选择性 SSM 层所需的总内存量。
From the way we fuse the forward pass, we do not save the intermediate states of size $(B,L,D,N)$ to avoid memory blowup. However, these intermediate states are necessary for the backward pass to compute gradients. We instead recompute those intermediate states in the backward pass. Since the inputs $\Delta,A,B,C$ and output gradient read from HBM to SRAM are of size $O(B L N+D N)$ , and the input gradients are also of size $O(B L N+D N)$ , re computation avoids the cost of reading $O(B L N D)$ elements from HBM. This means that re computation of the SSM states in the backward pass speeds up the computation compared to storing them and reading them from HBM.
从我们融合前向传播的方式来看,我们不会保存大小为 (B,L,D,N) 的中间状态以避免内存爆炸。然而,这些中间状态对于反向传播计算梯度是必要的。因此,我们在反向传播中重新计算这些中间状态。由于输入 $\Delta,A,B,C$ 和输出梯度从 HBM 读取到 SRAM 的大小为 O(B L N+D N),输入梯度的大小也为 O(B L N+D N),重新计算避免了从 HBM 读取 O(B L N D) 元素的成本。这意味着在反向传播中重新计算 SSM 状态比存储并从 HBM 读取它们更快。
Beyond optimizing for the memory requirement of just the scan operation, we also use re computation to optimize the memory requirement of the entire selective SSM block (input projection, convolution, activation, scan, output projection). In particular, we do not save intermediate activation s that take a lot of memory but are fast to recompute (e.g. output of activation function or short convolution). As a result, the selective SSM layer has the same memory requirement as an optimized Transformer implementation with Flash Attention. In particular, each attention layer (Flash Attention) stores around 12 bytes of activation s per token, an each MLP layer stores around 20 bytes of activation s per token, for a total of 32 bytes (assuming mixed-precision training in FP16 or BF16). Each selective SSM stores around 16 bytes of activation s per token. Hence two layers of selective SSMs have around the same activation memory as an attention layer and an MLP layer.
除了优化扫描操作的内存需求外,我们还使用重计算来优化整个选择性 SSM 模块(输入投影、卷积、激活、扫描、输出投影)的内存需求。特别是,我们不保存占用大量内存但可以快速重新计算的中间激活值(例如激活函数或短卷积的输出)。因此,选择性 SSM 层的内存需求与具有 Flash Attention 的优化 Transformer 实现相同。具体来说,每个注意力层(Flash Attention)每 Token 存储大约 12 字节的激活值,每个 MLP 层每 Token 存储大约 20 字节的激活值,总计 32 字节(假设在 FP16 或 BF16 中进行混合精度训练)。每个选择性 SSM 每 Token 存储大约 16 字节的激活值。因此,两个选择性 SSM 层的激活内存与一个注意力层和一个 MLP 层大致相同。
Table 11: (Induction heads.) Models are trained on sequence length $2^{8}=256$ and tested on various sequence lengths of $2^{6}=64~\mathrm{up}$ to $2^{20}=1048576.:\backprime$ denotes perfect generalization accuracy, while $\pmb{x}$ denotes out of memory.
: Most of the parameters are in learnable positional encodings.
表 11: (归纳头。)模型在序列长度 $2^{8}=256$ 上进行训练,并在各种序列长度上进行测试,从 $2^{6}=64$ 到 $2^{20}=1048576$ 。′ 表示完美的泛化准确率,而 × 表示内存不足。
| MODEL | PARAMS | TEST ACCURACY (%) AT SEQUENCE LENGTH |
|---|---|---|
| 2^6 | ||
| MHA-Abs | 137K | |
| MHA-RoPE | 137K | |
| MHA-xPos | 137K | |
| H3 | 153K | |
| Hyena | 69M* | 97.7 |
| Mamba | 74K |
注:大多数参数位于可学习的位置编码中。
E Experimental Details and Additional Results
E 实验细节和附加结果
E.1 Synthetic Tasks
E.1 合成任务
Selective Copying. Our setting is on sequences of length 4096, with a vocab size of 16 possible tokens (including the white “noise" token from Figure 2) and requiring models to memorize 16 “data" tokens. We use 2 layer models with a model dimension of $D=64$
选择性复制。我们的设置是长度为 4096 的序列,词汇量大小为 16 个可能的 Token(包括来自图 2 的白色“噪声”Token),并要求模型记忆 16 个“数据”Token。我们使用 2 层模型,模型维度为 $D=64$。
Models are trained for 400K steps at a constant learning rate of 0.0001 with a batch size of 64.
模型在 400K 步骤内以恒定学习率 0.0001 和批量大小为 64 进行训练。
Induction Heads. Training consists of randomly generating data every step, with a batch size of 8. We choose an "epoch" size of 8192 steps, and track the accuracy on fixed validation sets (also randomly generated) of each target sequence length. For the MHA-Abs and Mamba models, results are reported after the 25th epoch $8192\times25=204800$ steps).For the MHA-RoPE and MHA-xPos models, results are reported after the 50th epoch ( $8192\times50=409600$ steps). For the LTI H3 and Hyena models, results are reported after the 10th epoch (81920 steps) because they had converged by then and failed to improve further.
归纳头。训练包括每一步随机生成数据,批量大小为 8。我们选择一个“epoch”大小为 8192 步,并跟踪每个目标序列长度在固定验证集(也是随机生成的)上的准确率。对于 MHA-Abs 和 Mamba 模型,在第 25 个 epoch 后报告结果 (8192×25=204800 步)。对于 MHA-RoPE 和 MHA-xPos 模型,在第 50 个 epoch 后报告结果 (8192×50=409600 步)。对于 LTI H3 和 Hyena 模型,在第 10 个 epoch 后报告结果 (81920 步),因为它们在那时已经收敛并且未能进一步改进。
We use the Adam optimizer with no weight decay. All models are trained at constant learning rates $2e-4$ and $1e-3$ and the better results are reported for each model (2e - 4 for all models except Mamba). The attention and Hyena models did not learn at LR 1e - 3. H3 learned at both LRs, but interestingly generalized better to shorter sequences at the smaller LR of $2e-4$ . Mamba learned at both LRs, but extrapolated better at the larger LR of $1e-3$
我们使用 Adam 优化器且不进行权重衰减。所有模型都在恒定学习率 $2e-4$ 和 $1e-3$ 下进行训练,并报告每个模型的更好结果(所有模型除 Mamba 外均为 $2e-4$)。Attention 和 Hyena 模型在学习率 $1e-3$ 下未收敛。H3 在两个学习率下均能学习,但有趣的是,在较小的学习率 $2e-4$ 下对更短序列有更好的泛化能力。Mamba 在两个学习率下均能学习,但在较大的学习率 $1e-3$ 下有更好地外推表现。
E.2 Language Modeling
E.2 语言模型
E.2.1 Scaling Law Details
E.2.1 扩展定律细节
Scaling law experiments generally followed the GPT3 recipe. All models were trained on the Pile with the GPT2 tokenizer.
缩放定律实验通常遵循 GPT3 的方案。所有模型都在 Pile 上使用 GPT2 Tokenizer 进行训练。
Model Sizes. Table 12 specifies the model sizes we use for scaling laws. This is taken directly from the GPT3 specifications (Brown et al. 2020), with very minor modifications. First, we changed the batch size of the 1.3B model from 1M tokensto $0.5\mathrm{M}$ tokens, since we did not use enough parallel iz ation to require the larger batch size. Second, we changed the number of training steps and total tokens to roughly match Chinchilla scaling laws (Hoffmann et al. 2022), which specify that training tokens should increase proportionally to model size.
表 12: 模型大小
表 12 规定了我们用于扩展定律的模型大小。这直接取自 GPT3 的规格 (Brown et al. 2020),仅做了非常小的修改。首先,我们将 1.3B 模型的批量大小从 1M tokens 改为 $0.5\mathrm{M}$ tokens,因为我们没有使用足够的并行化来需要更大的批量大小。其次,我们更改了训练步骤数和总 tokens 数,以大致匹配 Chinchilla 扩展定律 (Hoffmann et al. 2022),该定律规定训练 tokens 应与模型大小成比例增加。
Table 12: (Scaling Law Model Sizes.) Our model sizes and hyper parameters for scaling experiments. (Model dimension and number of heads applies only to Transformer models.)
表 12: (扩展法则模型大小。)我们的模型大小和超参数用于扩展实验。(模型维度和头部数量仅适用于 Transformer 模型。)
| PARAMS | n_layers | d_model | n_heads / d_head | TRAINING STEPS | LEARNING RATE | BATCHSIZE | TOKENS |
|---|---|---|---|---|---|---|---|
| 125M | 12 | 768 | 12/64 | 4800 | 6e-4 | 0.5M tokens | 2.5B |
| 350M | 24 | 1024 | 16/64 | 13500 | 3e-4 | 0.5M tokens | 7B |
| 760M | 24 | 1536 | 16/96 | 29000 | 2.5e-4 | 0.5M tokens | 15B |
| 1.3B | 24 | 2048 | 32/64 | 50000 | 2e-4 | 0.5M tokens | 26B |
By default, the peak learning rate is the GPT3 specification.
默认情况下,峰值学习率是 GPT3 规格。
We give several models an “improved recipe", inspired by changes adopted by popular large language models such as PaLM (Chowdhery et al. 2023) and LLaMa (Touvron et al. 2023). These include:
我们给几个模型一个“改进的配方”,灵感来源于流行的大语言模型所采用的更改,例如 PaLM (Chowdhery et al. 2023) 和 LLaMa (Touvron et al. 2023)。这些改进包括:
Architecture and Training Details. Our models are:
架构和训练细节。我们的模型是:
. Transformer: The standard Transformer based on GPT3 (Table 12)
Transformer: 基于 GPT3 的标准 Transformer (表 12)
. Transformer++: A Transformer with an improved architecture, namely rotary positional encodings (Su et al. 2021) and SwiGLU MLP (Shazeer 2020), and the improved training recipe above.
Transformer++: 一种改进架构的 Transformer,具体包括旋转变换位置编码 (rotary positional encodings) (Su et al. 2021) 和 SwiGLU MLP (Shazeer 2020),以及上述改进的训练方案。
. Hyena: Interleaving a Hyena block (the H3 block with S4 replaced by a global convolution parameterized by an MLP) with standard MLP blocks. The MLP blocks have expansion factor 2 instead of 4 and the number of layers is correspondingly increased by $1.5\times$ to preserve parameter count.
Hyena: 交替使用 Hyena 块(用全局卷积替换 S4 的 H3 块,全局卷积由 MLP 参数化)和标准的 MLP 块。MLP 块的扩展因子为 2 而不是 4,层数相应地增加了 1.5 倍以保持参数数量不变。
$\mathbf{H3++}$ : The H3 architecture with a few modifications, including (i) using the same “thin" Hyena dimensions above (i) the improved training recipe above (ii) a linear attention head dimension of 8.
$\mathbf{H3++}$ : H3架构经过一些修改,包括 (i) 使用上述相同的“瘦”Hyena维度 (i) 上述改进的训练配方 (ii) 注意力头的线性维度为 8 。
· RWKV: The default RWKV model from B. Peng et al. (2023), including its modified MLP block. We also used as much of its specified training recipe as possible, such as increasing the learning rates by $2\times$ Oor $3\times$ on certain parameters.
RWKV: 默认的 RWKV 模型来自 B. Peng 等人 (2023),包括其修改后的 MLP 模块。我们还尽可能使用了其指定的训练配方,例如将某些参数的学习率提高 2× 或 3×。
· RetNet: The default RetNet model from Y. Sun et al. (2023). We also gave it the improved training recipe above.
· RetNet: 默认的 RetNet 模型来自 Y. Sun 等人 (2023)。我们还为其提供了上述改进的训练方案。
· Mamba: The standard Mamba architecture, with the improved training recipe.
Mamba: 标准的 Mamba 架构,带有改进的训练配方。
E.2.2 Additional Scaling Law Ablations
E.2.2 额外的缩放定律消融实验
We perform additional ablations on the architecture using the same protocol as the 2k context length scaling laws in Figure $4;(L e f t)$
我们使用与图 4 (左) 中 2k 上下文长度缩放定律相同的协议,对架构进行额外的消融实验。
Mamba Architecture: Interleaving Blocks. We test the effect of different architectural blocks combined with the Mamba block. We focus on the viewpoint that the Mamba block is simply the standard SwiGLU block with an extra conv $\rightarrow S S M$ path added. This leads to two natural ablations:
Mamba架构:交错块。我们测试了不同架构块与Mamba块结合的效果。我们关注的观点是,Mamba块仅仅是标准的SwiGLU块,在此基础上添加了一个额外的卷积 $\rightarrow S S M$ 路径。这导致了两个自然的消融实验:
. What if the Mamba block is interleaved with a standard MLP block, instead of stacked homogenous ly? This can also be interpreted as taking Mamba and removing half of the SSMs.
如果 Mamba 块与标准的 MLP 块交错排列,而不是同质堆叠,会怎么样?这也可以解释为对 Mamba 进行改造并去除一半的 SSM。

Figure 9: (Scaling laws: extra ablations.) (Left) Instead of (Right) Instead of
图 9: (扩展定律:额外消融实验。) (左) 替代 (右) 替代
. What if the Mamba block is interleaved with MHA (multi-head attention) blocks? This can also be interpreted as taking a Transformer with SwiGLU MLPs (i.e. what we call Transformer $^{++}$ ) and simply adding SSMs to the MLP blocks.
如果 Mamba 块与 MHA (多头注意力) 块交错排列会怎样?这也可以被解释为在具有 SwiGLU MLPs 的 Transformer (即我们称之为 Transformer++ ) 中简单地向 MLP 块添加 SSMs。
Figure 9 (Right) shows these variants compared to the original (homogenous) Mamba architecture. Interestingly, neither change matters too much. The Mamba-MLP architecture is only slightly worse, and still better than all models except Transformer $\mathrel{\mathop{:}}++$ . The Mamba-MHA architecture is only slightly better, which is somewhat surprising in light of the fact that many recent works have found that combining (LTI) SSMs with Attention can lead to substantial improvements (Dao, Fu, Saab, et al. 2023; Fathi et al. 2023; Fathullah et al. 2023; Saon, Gupta, and Cui 2023; Zuo et al. 2022).
图 9 (右) 显示了这些变体与原始(同质)Mamba架构的对比。有趣的是,这两种变化的影响都不大。Mamba-MLP架构只略逊一筹,但仍优于所有模型,除了 Transformer $\mathrel{\mathop{:}}++$ 。Mamba-MHA架构只略胜一筹,这在许多近期的研究发现结合 (LTI) SSMs 和注意力机制可以带来显著改进的背景下显得有些出人意料 (Dao, Fu, Saab, et al. 2023; Fathi et al. 2023; Fathullah et al. 2023; Saon, Gupta, and Cui 2023; Zuo et al. 2022)。
H3 Architecture: Training Recipes. Next we ablate differences between the Hyena and $H3++$ models, our weakest and strongest models outside of Transformer $\mathrel{+{+}}$ and Mamba, particularly to isolate the effect of training recipes.
H3 架构:训练配方
接下来我们分析 Hyena 和 $H3++$ 模型之间的差异,这是我们除 Transformer 和 Mamba 之外最弱和最强的模型,特别目的是隔离训练配方的影响。
· Hyena: The Hyena block with its original architecture and GPT3 training recipe (same as Figure 4).
海娜:海娜模块使用其原始架构和 GPT3 训练配方(同图 4)。
. Hyena+: The same architecture but with the improved training recipe described above. $\mathbf{H3+}$ : The same architecture as Hyena $^{+}$ but with the Hyena convolution kernel swapped out for S4D convolution kernel. $\mathbf{H3++}$ : The same as $H3+$ , but with a linear attention head dimension of 8. This increases computation inside the SSM recurrence but does not increase parameters.
Hyena+: 同样的架构,但使用了上述改进的训练方案。 $\mathbf{H3+}$ : 与 Hyena $^{+}$ 架构相同,但将 Hyena 卷积核替换为 S4D 卷积核。 $\mathbf{H3++}$ : 与 $H3+$ 相同,但线性注意力头维度为 8。这增加了 SSM 循环内的计算量,但不会增加参数量。
Our general convention is that “Modelt” represents the base model with the improved training recipe, and “Model $++^{:}$ also allows for architectural changes.
我们的一般约定是:“Modelt”表示具有改进训练方案的基础模型,而“Model $++^{:}$”还允许架构更改。
Figure 9 (Right) shows that
图 9 (右) 显示了
E.2.3 Downstream Evaluation Details
E.2.3 下游评估详情
This pre training procedure is the same as the scaling law protocol, but extended to 300B tokens and with the GPT-NeoX tokenizer (Black et al. 2022) instead of GPT2 tokenizer. For the 1.3B model, we use a batch size of 1M tokens to be consistent with the GPT3 specifications. We report the perplexity on the Pile validation set, and for this metric only compare to models trained on the same dataset and with the same tokenizer, in particular Pythia and RWKV.
这个预训练过程与缩放定律协议相同,但扩展到 300B tokens,并使用 GPT-NeoX 分词器 (Black et al. 2022) 而不是 GPT2 分词器。对于 1.3B 模型,我们使用 1M tokens 的批量大小以符合 GPT3 规范。我们在 Pile 验证集上报告困惑度,并且仅将此指标与在同一数据集和相同分词器上训练的模型进行比较,特别是 Pythia 和 RWKV。
For downstream evaluation, we use the LM evaluation harness from EleutherAI (L. Gao, Tow, et al. 2021), as done by most work in this area. We evaluate on the following tasks/datasets that measure common sense reasoning:
对于下游评估,我们使用来自 EleutherAI 的 LM 评估框架 (L. Gao, Tow, et al. 2021),如同该领域内的大多数工作一样。我们在以下任务/数据集上进行评估,这些任务/数据集衡量常识推理:
· LAMBADA (Paperno et al. 2016) · HellaSwag (Zellers et al. 2019)
· LAMBADA (Paperno 等 2016) · HellaSwag (Zellers 等 2019)
· PIQA (Bisk et al. 2020) · ARC-challenge (P. Clark et al. 2018) · ARC-easy: an easy subset of ARC-challenge . WinoGrande (Sakaguchi et al. 2021)
· PIQA (Bisk 等 2020) · ARC 挑战 (P. Clark 等 2018) · ARC 简单版:ARC 挑战的简单子集 . WinoGrande (Sakaguchi 等 2021)
We report accuracy for LAMBADA, WinoGrande, PIQA, and ARC-easy, and accuracy normalized by sequence length for HellaSwag and ARC-challenge (since normalized accuracy is higher for almost all models for these task).
我们报告了 LAMBADA、WinoGrande、PIQA 和 ARC-easy 的准确率,以及 HellaSwag 和 ARC-challenge 的按序列长度归一化的准确率(因为对于这些任务,归一化后的准确率几乎对所有模型都更高)。
E.3 DNA Modeling
E.3 DNA建模
E.3.1 Pre training Details
E.3.1 预训练细节
We describe the dataset and training procedure of the HG38 pre training task in more detail.
我们更详细地描述了 HG38 预训练任务的数据集和训练过程。
The dataset follows the splits from the prior Enformer work on genomics (Avsec et al. 2021); the training split contains a total of $S=34021$ segments of length $2^{17}=131072$ that cover the genome, for a total of approximately 4.5 billion tokens (DNA base pairs). These segments are pairs of (chromosome number, starting index, ending index), and can be extended if necessary (e.g. to get longer segments).
数据集遵循先前关于基因组学的 Enformer 工作 (Avsec et al. 2021) 的划分;训练集包含总计 $S=34021$ 段,每段长度为 $2^{17}=131072$ ,覆盖整个基因组,总共约有 45 亿个 Token (DNA 碱基对)。这些段是由 (染色体编号, 起始索引, 结束索引) 组成的对,并且可以根据需要进行扩展(例如,以获得更长的段)。
We deviate from HyenaDNA when the training sequence length is not $2^{17}$ . HyenaDNA always takes a fixed sub-segment (e.g. the beginning or middle of the prescribed segment), and thus for any training sequence length each epoch is fixed to 34021 samples and doesn't necessarily go through the whole genome. On the other hand, we use the entire training data:
当训练序列长度不是 $2^{17}$ 时,我们与 HyenaDNA 不同。HyenaDNA 始终采用固定的子片段(例如规定片段的开头或中间),因此对于任何训练序列长度,每个 epoch 固定为 34021 个样本,并不一定遍历整个基因组。另一方面,我们使用全部训练数据:
. When the context length $L$ is less than (or equal to) $2^{17}$ we divide up each segment into non-overlapping sub-segments of length $L$ so that there are $\begin{array}{r}{S\times\frac{2^{17}}{L}}\end{array}$ total samples and $S\times2^{17}\approx4.5B$ tokens per epoch.
当上下文长度 $L$ 小于(或等于)$2^{17}$ 时,我们将每个段落分成长度为 $L$ 的非重叠子段落,使得总样本数为 $S \times \frac{2^{17}}{L}$,每轮有 $S \times 2^{17} \approx 4.5B$ 个 Token。
. When the context length $L$ is greater than $2^{17}$ , we turn each segment into two samples, one that begins with the prescribed segment and one that ends with the prescribed segment. Thus each epoch has 2S items and 2SL tokens per epoch. For example, at sequence length $2^{18}=262144$ there are $4\times$ as many tokens as the default, and at sequence length $2^{20}$ there are $16\times$ as many tokens.
当上下文长度 $L$ 大于 $2^{17}$ 时,我们将每个段落转换为两个样本,一个以指定的段落开头,另一个以指定的段落结尾。因此每个 epoch 包含 2S 个项目和每 epoch 2SL 个 Token。例如,在序列长度 $2^{18}=262144$ 时,Token 数量是默认值的 $4\times$,而在序列长度 $2^{20}$ 时,Token 数量是默认值的 $16\times$。
Other training details generally follow the same protocol as our language modeling experiments (Appendix E.2). For example, we use the AdamW with $(\beta_{1},\beta_{2}),=,(0.9,0.95)$ , no dropout, weight decay 0.1. We use a cosine learning rate scheduler with linear warmup for $10%$ of total steps.
其他训练细节通常遵循与我们的语言建模实验相同的协议 (附录 E.2)。例如,我们使用 AdamW,参数为 $(β_1 , β_2 ) = (0.9, 0.95)$,不使用 dropout,权重衰减为 0.1。我们使用余弦学习率调度器,并在总步数的 10% 内进行线性预热。
E.3.2 Scaling: Model Size Details
E.3.2 扩展:模型规模详情
Models. The models we consider are:
模型。我们考虑的模型是:
.Transformer $^{++}$ : a Transformer with improved architecture, notably the usage of RoPE positional encodings (Su et al. 2021). Informally, we found these to be noticeably better than vanilla positional encodings from (Vaswani et al. 2017).
Transformer $^{++}$ :一种改进架构的 Transformer,特别使用了 RoPE 位置编码 (Su et al. 2021) 。非正式地,我们发现这些明显优于 (Vaswani et al. 2017) 中的 vanilla 位置编码。
· HyenaDNA: the Hyena model from Nguyen, Poli, et al. (2023) and Poli et al. (2023), which is roughly a Transformer with the MHA block replaced by an H3 block using a global convolution parameterized by an MLP.
HyenaDNA:来自 Nguyen, Poli 等 (2023) 的 Hyena 模型,该模型大致是一个将 MHA 模块替换为使用由 MLP 参数化的全局卷积的 H3 模块的 Transformer。
· Mamba: the standard Mamba architecture.
· Mamba: 标准的 Mamba 架构。
Model Sizes. We use the following model sizes.
模型尺寸。我们使用以下模型尺寸。
| BLOCKS | 4 | 5 | 6 | 7 | 8 | 10 | 12 |
|---|---|---|---|---|---|---|---|
| 模型维度 | 64 | 96 | 128 | 192 | 256 | 384 | 512 |
| 参数量 (近似) | 250K | 700K | 1.4M | 3.5M | 7.0M | 19.3M | 40.7M |
Note that the number of blocks for Mamba is doubled, because one Transformer “layer" includes both the MHA and MLP blocks (and similarly for Hyena), which requires two Mamba blocks to match parameters (Section 3.4).
请注意,Mamba 的块数量翻倍,因为一个 Transformer “层” 包含了 MHA 和 MLP 块(Hyena 也是如此),这需要两个 Mamba 块来匹配参数(第 3.4 节)。
Training. For each model (Transformer $^{++}$ , HyenaDNA, Mamba), we swept the learning rate across ${1e-3,2e-3,4e-$ $3,8e-3}$ . The optimal Transformer and HyenaDNA learning rates were 2e-3 across all sizes. The optimal Mamba learning rate was 8e-3; note that Mamba performed better than baselines with matched learning rates (2e-3), but was more stable and improved even more at higher learning rates. (Furthermore, as this LR is on the upper range of the sweep, it is possible that our results are still suboptimal.)
训练。对于每个模型(Transformer $^{++}$, HyenaDNA, Mamba),我们在 ${1e-3, 2e-3, 4e-3, 8e-3}$ 范围内调整学习率。最优的 Transformer 和 HyenaDNA 学习率为 2e-3,适用于所有规模。Mamba 的最优学习率为 8e-3;请注意,Mamba 在匹配的学习率(2e-3)下表现优于基线模型,但在更高的学习率下更加稳定且性能进一步提升。(此外,由于此学习率处于调整范围的上限,因此我们的结果仍可能不是最优的。)
Note that, in contrast to standard LM scaling laws (Table 12), our LR held constant across model sizes for simplicity. The optimal LR should go down for larger models, but we didn't find a noticeable effect at the small model sizes (at most a few million parameters) we considered.
请注意,与标准的大语言模型扩展定律 (Table 12) 不同,为了简化,我们的学习率 (LR) 在不同模型大小上保持不变。对于较大的模型,最优学习率应该降低,但在我们考虑的小模型尺寸(最多几百万参数)中,我们没有发现明显的效果。
表 12:
E.3.3 Scaling: Context Length Details
E.3.3 扩展:上下文长度细节
We use a total batch size of $2^{24}\approx16M$ tokens per training step, for every sequence length (e.g. at length $2^{20}$ there are 16 segments per batch and at length $2^{10}$ there are 16384 segments per batch). This is a large batch size relative to the model size by usual LM standards, but note that a batch size of $2^{23}$ is the minimum possible on a machine with 8 GPUs and sequence length of $2^{2}0$ and that HyenaDNA used much larger batches of $2^{28}$
我们使用每步训练的总批次大小为 $2^{24}\approx16M$ 个 Token,对于每个序列长度(例如,在长度 $2^{20}$ 时,每个批次有 16 个段;在长度 $2^{10}$ 时,每个批次有 16384 个段)。相对于通常的语言模型标准,这是一个相对于模型大小较大的批次大小,但请注意,在具有 8 个 GPU 和序列长度为 $2^{20}$ 的机器上,最小可能的批次大小为 $2^{23}$ ,而 HyenaDNA 使用了更大的批次,大小为 $2^{28}$ 。
The learning rate used was 0.008 for Mamba and 0.o01 for HyenaDNA; we initially attempted to use the same learning rate of 0.o02 from the previous section for HyenaDNA, but found that it was unstable at the longest context length.
使用的学习率为 Mamba 0.008 和 HyenaDNA 0.001;我们最初尝试为 HyenaDNA 使用上一节中的相同学习率 0.002,但发现它在最长上下文长度下不稳定。
Sequence Length Warmup. Following (Nguyen, Poli, et al. 2023), we use sequence length warmup (SLW) during pre training. We choose a simple schedule of 2 epochs at each power-of-two sequence length starting from $2^{10}=1024$ (Note that because of how data is curated, at the longest sequence lengths more steps and tokens are spent proportionally. In particular, each stage up to length $2^{17}$ processes the same number of tokens, but $4\times$ as many tokens are processed at length $2^{18}$ , $8\times$ as many at length $2^{19}$ and $16\times$ as many at length $2^{20}$
序列长度预热 (Sequence Length Warmup)。根据 (Nguyen, Poli, et al. 2023),我们在预训练期间使用序列长度预热 (SLW)。我们选择一个简单的计划,在每个2的幂次序列长度上进行2个epoch,从 $2^{10}=1024$ 开始(注意,由于数据的整理方式,在最长的序列长度上,步骤和Token的比例会相应增加。特别是,每个阶段直到长度 $2^{17}$ 处理相同数量的Token,但在长度 $2^{18}$ 处处理的Token是其 $4\times$ ,在长度 $2^{19}$ 处是 $8\times$ ,在长度 $2^{20}$ 处是 $16\times$ 。)
Unlike HyenaDNA, we always control for the number of tokens per gradient update, so the batch size is successively halved as the sequence lengths are doubled in each stage.
与 HyenaDNA 不同,我们始终控制每个梯度更新的 Token 数量,因此在每个阶段序列长度翻倍时,批量大小依次减半。
Remark E.1. We also note that the schedule was not tuned, and we never experimented with turning off sequence length warmup for these pre training experiments. We later found that SLW did not help noticeably for audio pre training at similar lengths (Section 4.4), and it is possible that it is not necessary for DNA pre training either.
附注 E.1. 我们还注意到,训练计划没有进行调优,并且我们从未尝试在这些预训练实验中关闭序列长度热身。我们后来发现,在相似长度的音频预训练中,序列长度热身 (SLW) 并没有明显帮助(第 4.4 节),因此它对于 DNA 预训练可能也不是必要的。
E.3.4 Species (Great Apes) Classification
E.3.4 物种(大猿)分类
Models are causal and therefore only the last element (across the sequence length) of the model's output is used for the classification head. Note that we control for the total number of elements in the loss function per gradient step. The pre training objective includes all positions across the sequence length, so that batch_size $\times$ sequence length is held constant; in other words, the batch size decreases as the sequence length increases. However, for a classification task, since only the last position enters the loss, the batch size itself is held constant. Note that this also means that fine-tuning models with longer sequence lengths is more computationally expensive.
模型是因果性的,因此仅使用模型输出的最后一个元素(沿序列长度)用于分类头。注意我们控制每个梯度步骤中损失函数中的元素总数。预训练目标包括序列长度中的所有位置,因此批量大小 $\times$ 序列长度保持不变;换句话说,随着序列长度的增加,批量大小会减少。然而,对于分类任务,由于只有最后一个位置进入损失,批量大小本身保持不变。请注意,这也意味着用更长序列长度微调模型在计算上更加昂贵。
Training consists of 10 epochs, each of which has 1024 gradient steps. Each gradient step uses batch size 64, which are all independently randomly drawn by uniformly picking a species, uniformly picking a chromosome, and then uniformly picking a contiguous segment of DNA.
训练包括 10 个 epoch,每个 epoch 包含 1024 个梯度步骤。每个梯度步骤使用批量大小 64,这些批量都是通过均匀选择一个物种,均匀选择一个染色体,然后均匀选择一段连续的 DNA 独立随机抽取的。
Following (Nguyen, Poli, et al. 2023), models with a maximum context length greater than $2^{14}=16384$ use sequence length warmup with 1 epoch at length $2^{14}=16384$ 1 epoch at length $2^{15}=32768,$ 1 epoch at length $2^{16}=65536$ , and so on up to the maximum sequence length. For example, the model with $2^{20}=1048576$ context undergoes 6 epochs of sequence length warmup before 4 more epochs at its maximum sequence length.
根据 (Nguyen, Poli, et al. 2023),最大上下文长度大于 $2^{14}=16384$ 的模型使用序列长度预热,具体为:在长度为 $2^{14}=16384$ 下进行 1 个 epoch,在长度为 $2^{15}=32768$ 下进行 1 个 epoch,在长度为 $2^{16}=65536$ 下进行 1 个 epoch,依此类推,直到达到最大序列长度。例如,上下文长度为 $2^{20}=1048576$ 的模型在最大序列长度下进行 4 个 epoch 之前,会先进行 6 个 epoch 的序列长度预热。
The learning rate for all Hyena models is 4e - 5, while the learning rate for all Mamba models is 1e - 4. These were found by performing learning rate sweeps for each model among ${1e-5,2e-5,4e-5,1e-4,2e-4}$ for the smaller sequence lengths $(2^{10},\bar{2^{12}},2^{14},2^{1\bar{6}})$ , and these values were consistently found to be the best for each model. An abridged learning rate sweep was done at length $2^{18}$ , which agreed with these values, and a single run at length $2^{20}$ was performed (as described above, the computational cost of these experiments is proportional to the sequence length). The learning rate followed a cosine decay schedule with warmup with 5 epochs of linear warmup to the maximum learning rate, and 5 epochs of cosine decay down to $1e-6$ . The unusually long learning rate warmup schedule was chosen because the sequence length warmup was also long (e.g. comprising 6 out of 10 epochs for the model with context length $2^{20}$ ); we did not experiment with this choice.
所有 Hyena 模型的学习率为 4e-5,而所有 Mamba 模型的学习率为 1e-4。这些值是通过对每个模型在较小的序列长度 (2^10, 2^12, 2^14, 2^16) 上进行学习率搜索 {1e-5, 2e-5, 4e-5, 1e-4, 2e-4} 找到的,并且这些值被一致认为是每个模型的最佳选择。在长度为 2^18 处进行了简化的学习率搜索,结果与这些值一致,并且在长度为 2^20 处进行了单次运行(如上所述,这些实验的计算成本与序列长度成正比)。学习率遵循带有 warmup 的余弦衰减计划,包括 5 个 epoch 的线性 warmup 到最大学习率,以及 5 个 epoch 的余弦衰减至 1e-6。选择了异常长的学习率 warmup 计划,因为序列长度 warmup 也较长(例如,对于上下文长度为 2^20 的模型,warmup 占用了 10 个 epoch 中的 6 个);我们没有对此选择进行实验。
Table 13: (Great ApesDNA Class if cation) Accuracy after fne-tuning on sequences of length $2^{10}=1024;\mathrm{up}$ $2^{20}=1048576$ using pretrained models of the same context length. Random guessing is $20%$
表 13: (Great Apes DNA 分类) 使用相同上下文长度的预训练模型在长度为 $2^{10}=1024$ 到 $2^{20}=1048576$ 的序列上微调后的准确率。随机猜测的准确率为 $20%$
| MODEL | PARAMS | 序列长度下的准确率 (%) |
|---|---|---|
| 2^10 | ||
| HyenaDNA | 1.4M | 28.04 |
| Mamba | 1.4M | 31.47 |
| Mamba | 7M | 30.00 |
Table 14: YouTubeMix length scaling sequence lengths and batch sizes.
表 14: YouTubeMix 长度缩放序列长度和批量大小。
| 序列长度 | 批量大小 | 每批 Token 数 |
|---|---|---|
| 468×2048=958464 | 1 | 958464 |
| 234×2048=479232 | 2 | 958464 |
| 117×2048=239616 | 4 | 958464 |
| 59×2048=120832 | 8 | 966656 |
| 30×2048=61440 | 16 | 983040 |
| 15×2048=30720 | 32 | 983040 |
| 8×2048=16384 | 64 | 1048576 |
| 4×2048=8192 | 128 | 1048576 |
Results for the Species classification task are in Table 13.
表 13: 物种分类任务的结果。
E.4 Audio Details
E.4 音频详情
E.4.1 YouTubeMix Audio Pre training
E.4.1 YouTubeMix 音频预训练
Model. We use a model with 3 blocks per stage $\langle3\times5,=,15$ total Mamba blocks), pooling factor $\mathit{p}\mathrm{~=~}16$ , and outer dimension $D=64$ for about $3.5\mathrm{M}$ parameters.
我们使用一个每阶段包含 3 个块 (共 3×5=15 个 Mamba 块) 的模型,池化因子 $\mathit{p}\mathrm{~=~}16$ ,外部维度 $D=64$ ,参数量约为 $3.5\mathrm{M}$ 。
Dataset. The data is mu-law encoded at 8 bits, so the model is modeling discrete tokens with a vocab size of 256.
数据集。数据以 8 位 mu-law 编码,因此模型使用词汇量为 256 的离散 Token 进行建模。
The dataset consists of clips of up to 1 minute long, or length 960000, which is subsampled and divided into segments of any desired sequence length. Since the architecture involves two stages of pooling by a factor of 16, and we want the resulting sequence length to be a a multiple of 8 for hardware efficiency, the longest possible sequence is $468\times2048=958464$ The rest of our sequence lengths are defined by successively halving this and rounding up to the nearest multiple of 2048.
数据集由最长 1 分钟的片段组成,或长度为 960000 的片段,这些片段被下采样并分割成任意所需的序列长度。由于架构涉及两个阶段的池化,池化因子为 16,并且我们希望结果序列长度是 8 的倍数以提高硬件效率,因此最长可能的序列是 $468\times2048=958464$ 。其余的序列长度通过依次将此值减半并向上取最接近的 2048 的倍数来定义。
Table 14 lists the specifications used in Figure 7. Beyond the varying batch sizes, the number of valid segments in the training set varied between different sequence lengths (e.g. the number of training steps per epoch was not constant for different points in the graph), which may have contributed to kinks in the scaling curves.
表 14: 列出了图 7 中使用的规格。除了不同的批量大小外,训练集中有效段的数量在不同的序列长度之间有所不同(例如,每个epoch的训练步骤数量在图表的不同点上并不恒定),这可能导致了扩展曲线中的曲折。
Training. Models were trained for $200K$ training steps with a maximum learning rate of $0.002,,20K\left(10%\right)$ warmup steps, and weight decay 0.1 (similar to our general pre training recipe across domains).
训练。模型训练了 200K 个训练步骤,最大学习率为 0.002,20K (10%) 的热身步骤,权重衰减为 0.1(与我们在不同领域的通用预训练方案相似)。
Additional Ablations: SSM Parameter iz at ions. We investigate SSM parameter iz at ions on long-form audio waveform pre training in the setting of Figure 7. The setting is modified slightly to use larger models (8 layers and $D=64$ for 6M params, the SaShiMi default), shorter sequences ( $2^{11}=2048$ to $2^{18}=262144$ instead of $2^{13}$ to $2^{20}$ ), lower LR (0.001 from 0.002), and shorter training cycles (100K instead of 200K steps).
附加消融实验:SSM 参数化。我们研究了在图 7 的设置中,针对长格式音频波形预训练的 SSM 参数化。设置稍作修改,使用更大的模型(8 层和 $D=64$ 对应 6M 参数,为 SaShiMi 默认配置),更短的序列(从 $2^{11}=2048$ 到 $2^{18}=262144$ 而不是从 $2^{13}$ 到 $2^{20}$),更低的学习率(从 0.002 降低到 0.001),以及更短的训练周期(100K 步骤而不是 200K 步骤)。
Figure 10 shows that the change from $S4\to S6$ (i.e. the selection mechanism) is not always beneficial. On long-form audio waveforms, it in fact significantly hampers performance, which may be intuitive from the point of view that audio is uniformly sampled and very smooth, and therefore benefits from continuous linear time-invariant (LTI) methods. After ablating away the selection mechanism, note that the resulting model is the S4 layer inside the Mamba block. To disambiguate, we call this Mamba-S4 as opposed the default Mamba architecture Mamba-S6.
图 10: 显示了从 $S4\to S6$ (即选择机制)的改变并不总是有益的。在长音频波形上,实际上显著影响了性能,这可能从音频是均匀采样且非常平滑的角度来看是直观的,因此受益于连续线性时不变 (LTI) 方法。在去除选择机制后,注意到得到的模型是 Mamba 块中的 S4 层。为了区分,我们将这种模型称为 Mamba-S4,而不是默认的 Mamba 架构 Mamba-S6。

Figure 10: (Audio Pre training (YouTubeMix) Ablations.) As a uniformly-sampled “continuous" signal modality, audio waveforms actually benefit from LTI models which have matching inductive bias. (Left) Homogenous models (all blocks have the same parameter iz ation) (Right) Only the center U-Net blocks are ablated; the outer blocks are Mamba-S4. Purple line is same as figure on left.
图 10: (音频预训练 (YouTubeMix) 消融实验。)作为均匀采样的“连续”信号模式,音频波形实际上从具有匹配归纳偏置的 LTI 模型中受益。(左)同质模型(所有块具有相同的参数化)(右)仅中心 U-Net 块被消融;外部块是 Mamba-S4。紫色线与左侧图相同。
However, on the right side, we keep the outer layers of the U-Net Mamba-S4 and ablate only the inner layers. The performance differences shrink dramatically; this reinforces the hypothesis that layers closer to the raw audio signal should be LTI, but once they are “tokenized" and compressed by the outer layers, the inner layers no longer need to be LT1. In this setting however, the real-valued SSM still under performs the complex-valued one.
然而,在右侧,我们保留 U-Net Mamba-S4 的外层,仅移除内层。性能差异显著缩小;这强化了以下假设:接近原始音频信号的层应该是线性时不变 (LTI) 的,但一旦它们被“标记化”并由外层压缩后,内层不再需要是 LTI。然而,在这种设置下,实值 SSM 仍然不如复值 SSM 表现得好。
E.4.2 SC09 Speech Generation
E.4.2 SC09 语音生成
Auto regressive training largely followed the auto regressive language modeling protocol, such as
自回归训练 largely followed the auto regressive language modeling protocol,例如
(注意:原文中“largely followed the auto regressive language modeling protocol, such as”这部分内容似乎不完整,可能缺少后续的例子或说明。)
We used a learning rate of 0.002 and 200000 training steps at a batch size of 16.
我们使用了学习率为 0.002,在批量大小为 16 的情况下进行了 200000 步训练。
The large Mamba model in Table 4 has 15 layers per stage with an outer dimension of $D=96$ and pooling factor 4. We note that this dataset is small(training went through 100 epochs) and for this large model, there was significant over fitting of the BPB or NLL. However, automated metrics of generated samples continually improving throughout training.
表 4 中的大 Mamba 模型每个阶段有 15 层,外部维度为 $D=96$ ,池化因子为 4 。我们注意到这个数据集较小(训练经过了 100 个 epoch),对于这个大模型,BPB 或 NLL 出现了显著的过拟合。然而,生成样本的自动化评估指标在整个训练过程中持续改进。
The models in the architecture ablations in Table 5 all have 8 layers per stage with an outer dimension of $\mathsf{D}=64$ and pooling factor 4. The $\mathsf{S4+M L P}$ block has roughly $2D^{2}+4D^{2}$ parameters (expansion factor 2 in the MLP). The Transformer block has $4D^{2}+2D^{2}$ parameters (expansion factor 1 in the MLP). The Mamba block has the usual $\approx6D^{2}$ parameters. All models have roughly 6M total parameters.
表 5 中架构消融的模型每个阶段都有 8 层,外部维度为 $\mathsf{D}=64$ 和池化因子 4。$\mathsf{S4+MLP}$ 模块大约有 $2D^{2}+4D^{2}$ 参数(MLP 中扩展因子为 2)。Transformer 模块有 $4D^{2}+2D^{2}$ 参数(MLP 中扩展因子为 1)。Mamba 模块有通常的 $\approx6D^{2}$ 参数。所有模型大约总共有 6M 参数。
E.5 Efficiency Benchmark
E.5 效率基准测试
Scan Operation. We compare the core operation of selective SSMs, which is the parallel scan (Section 3.3), against convolution and attention, measured on an A100 80GB PCIe GPU. Note that these do not include the cost of other operations outside of this core operation, such as computing the convolutional kernel in global-convolution models, or computing the QKV projections in attention.
扫描操作。我们比较了选择性 SSM 的核心操作,即并行扫描(第 3.3 节),与卷积和注意力机制在 A100 80GB PCIe GPU 上的性能。注意,这些结果不包括该核心操作之外的其他操作的成本,例如在全局卷积模型中计算卷积核,或在注意力机制中计算 QKV 投影。
As a baseline, we implement a standard parallel scan in PyTorch with no kernel fusion. This requires materializing the parameters $\overline{{A}},\overline{{B}},C$ in HBM.
作为基准,我们在 PyTorch 中实现了一个标准的并行扫描,不使用内核融合。这需要在 HBM 中显式化参数 $\overline{{A}},\overline{{B}},C$ 。
Our scan implementation fuses the disc ret iz ation step and the parallel scan, avoiding the cost of materializing all the large parameters in HBM.
我们的扫描实现将离散化步骤和并行扫描融合在一起,避免了在HBM中物化所有大规模参数的成本。
Table 15: (Memory benchmark.) Mamba's memory footprint is comparable to the most optimized Transformer. Results for 125M models.
表 15: (内存基准测试。)Mamba 的内存占用与最优化的 Transformer 相当。125M 模型的结果。
| Batchsize | Transformer (w/ FlashAttention-2) | Mamba |
|---|---|---|
| 1 | 4.6GB | 4.8GB |
| 2 | 5.2GB | 5.8GB |
| 4 | 6.9GB | 7.3GB |
| 8 | 11.5GB | 12.3GB |
| 16 | 20.7GB | 23.1GB |
| 32 | 34.5GB | 38.2GB |
For convolution, we use the standard implementation in PyTorch, which separately performs FFTs on the inputs and the filters, multiply them in frequency domain, then performs an inverse FFT to obtain the result. The theoretical complexity is $O(L\log(L))$ for sequence length $L$
对于卷积,我们使用 PyTorch 中的标准实现,该实现分别对输入和滤波器进行快速傅里叶变换 (FFT),在频域中将它们相乘,然后进行逆 FFT 以获得结果。理论复杂度为 $O(L\log(L))$,其中 $L$ 为序列长度。
For attention, we compare against the fastest implementation that we are aware of (Flash Attention-2 (Dao 2024), with causal mask. Note that Flash Attention-2 with causal mask is about $1.7\times$ faster than without causal mask, since approximately only half of the attention entries are computed.
对于注意力机制,我们对比了我们所知的最快实现(Flash Attention-2 (Dao 2024),带有因果掩码)。注意,带有因果掩码的 Flash Attention-2 比不带因果掩码的快约 $1.7\times$,因为大约只有半数的注意力条目被计算。
We use batch size of 1 and increase the sequence length from $2^{9}=512$ $2^{10}\approx1K$ $2^{11}\approx2K$ , up to $2^{19}\approx500K$ (some of the baselines run out of memory before reaching 50oK). We use a model dimension of $D=1024$ and state dimension $N=16$ We measure with BF16 inputs, which is the data type most commonly used for large scale training.
我们使用批量大小为 1,并将序列长度从 $2^{9}=512$、$2^{10}\approx1K$、$2^{11}\approx2K$ 增加到 $2^{19}\approx500K$(某些基线在达到 500K 之前会内存不足)。我们使用模型维度 $D=1024$ 和状态维度 $N=16$。我们使用 BF16 输入进行测量,这是大规模训练中最常用的数据类型。
End-to-end Inference. We measure the inference throughput of a Mamba 1.4B model and an untrained Mamba 6.9B model, against a standard Transformer (GPT3 architecture) at 1.3B and 6.7B size. We use the standard Transformer implementation in the Hugging face transf ormers library.
端到端推理。我们测量了 Mamba 1.4B 模型和未训练的 Mamba 6.9B 模型的推理吞吐量,并与标准 Transformer (GPT3 架构) 在 1.3B 和 6.7B 规模下的表现进行对比。我们使用了 Hugging Face transformers 库中的标准 Transformer 实现。
We set the prompt length to be 2048 and the generation length to be 128. We vary the batch size from 1, 2, 4, 8, 16, 32, 64, to 128, and measure time time taken to generate 128 tokens. We then calculate the throughput (tokens/s) as batch size $\times,128$ /time taken. We repeat the measurements 3 times and take the average. Measurements are done on an A100 80 GB PCIe GPU.
我们将提示长度设置为 2048,生成长度设置为 128。我们改变批处理大小为 1, 2, 4, 8, 16, 32, 64 和 128,并测量生成 128 个 Token 所需的时间。然后我们计算吞吐量 (Token/s) 为批处理大小 $\times,128$ / 所需时间。我们重复测量 3 次并取平均值。测量是在 A100 80 GB PCIe GPU 上进行的。
Memory Benchmark. The memory usage simply scales proportionally to the size of the activation tensors, as with most deep sequence models. We report measurements of the training memory requirements of 125M models on 1 A100 80GB GPU. Each batch consists of sequences of length 2048. We compare to the most memory-effcient Transformer implementation we are aware of (with kernel fusion from torch. compile and with Flash Attention-2). Table 15 shows that Mamba's memory requirement is comparable to a similar-sized Transformer with an extremely optimized implementation, and we expect further improvement in Mamba's memory footprint in the future.
内存基准测试。内存使用量与激活张量的大小成正比增加,这与大多数深度序列模型相同。我们在 1 块 A100 80GB GPU 上报告了 125M 模型的训练内存需求。每个批次由长度为 2048 的序列组成。我们将其与我们所知的最节省内存的 Transformer 实现进行比较(带有 torch.compile 的内核融合和 Flash Attention-2)。表 15 显示 Mamba 的内存需求与具有极优化实现的类似规模的 Transformer 相当,我们预计 Mamba 的内存占用在未来会进一步改善。
表 15:
