Mamba: Linear-Time Sequence Modeling with Selective State Spaces
Mamba: 线性时间序列建模与选择性状态空间 (Selective State Spaces)
Albert Gu* and Tri Dao*
Albert Gu* 和 Tri Dao*
Machine Learning Department, Carnegie Mellon University Department of Computer Science, Princeton University agu@cs.cmu.edu, tri@tridao.me
机器学习系,卡内基梅隆大学计算机科学系,普林斯顿大学
agu@cs.cmu.edu, tri@tridao.me
Abstract
摘要
Foundation models, now powering most of the exciting applications in deep learning, are almost universally based on the Transformer architecture and its core attention module. Many sub quadratic-time architectures such as linear attention, gated convolution and recurrent models, and structured state space models (SSMs) have been developed to address Transformers? computational inefficiency on long sequences, but they have not performed as well as attention on important modalities such as language. We identify that a key weakness of such models is their inability to perform content-based reasoning, and make several improvements. First, simply letting the SSM parameters be functions of the input addresses their weakness with discrete modalities, allowing the model to selectively propagate or forget information along the sequence length dimension depending on the current token. Second, even though this change prevents the use of efficient convolutions, we design a hardware-aware parallel algorithm in recurrent mode. We integrate these selective SSMs into a simplified end-to-end neural network architecture without attention or even MLP blocks (Mamba). Mamba enjoys fast inference $5\times$ higher throughput than Transformers) and linear scaling in sequence length, and its performance improves on real data up to million-length sequences. As a general sequence model backbone, Mamba achieves state-of-the-art performance across several modalities such as language, audio, and genomics. On language modeling, our Mamba-3B model outperforms Transformers of the same size and matches Transformers twice its size, both in pre training and downstream evaluation.
基础模型,现在支持深度学习中大多数令人兴奋的应用,几乎普遍基于 Transformer 架构及其核心注意力模块。许多次二次时间复杂度的架构,如线性注意力、门控卷积和递归模型以及结构化状态空间模型 (SSMs),已被开发出来以解决 Transformers 在长序列上的计算效率问题,但它们在语言等重要模态上的表现不如注意力机制。我们发现这类模型的一个关键弱点是它们无法进行基于内容的推理,并进行了几项改进。首先,简单地让 SSM 参数成为输入的函数解决了它们在离散模态上的弱点,使模型能够根据当前 Token 有选择地沿序列长度维度传播或遗忘信息。其次,尽管这一改变阻止了高效卷积的使用,我们设计了一种硬件感知的并行递归模式算法。我们将这些选择性 SSMs 集成到一个简化的端到端神经网络架构中,该架构不包含注意力模块甚至多层感知机块 (Mamba)。Mamba 的推理速度比 Transformers 快 5 倍,且在序列长度上呈线性扩展,在实际数据上其性能可扩展至百万长度的序列。作为通用序列模型骨干,Mamba 在语言、音频和基因组学等多个模态上实现了最先进的性能。在语言建模方面,我们的 Mamba-3B 模型超过了相同规模的 Transformers,并与两倍规模的 Transformers 相匹配,无论是在预训练还是下游评估中。
1 Introduction
1 引言
Foundation models (FMs), or large models pretrained on massive data then adapted for downstream tasks, have emerged as an effective paradigm in modern machine learning. The backbone of these FMs are often sequence models, operating on arbitrary sequences of inputs from a wide variety of domains such as language, images, speech, audio, time series, and genomics (Brown et al. 2020; Do sov it ski y et al. 2020; Ismail Fawaz et al. 2019; Oord et al. 2016; Poli et al. 2023; Sutskever, Vinyals, and Quoc V Le 2014). While this concept is agnostic to a particular choice of model architecture, modern FMs are predominantly based on a single type of sequence model: the Transformer (Vaswani et al. 2017) and its core attention layer (Bahdanau, Cho, and Bengio 2015) The efficacy of self-attention is attributed to its ability to route information densely within a context window, allowing it to model complex data. However, this property brings fundamental drawbacks: an inability to model anything outside of a finite window, and quadratic scaling with respect to the window length. An enormous body of research has appeared on more efficient variants of attention to overcome these drawbacks (Tay, Dehghani, Bahri, et al. 2022), but often at the expense of the very properties that makes it effective. As of yet, none of these variants have been shown to be empirically effective at scale across domains.
基础模型 (FMs),或在大规模数据上预训练然后针对下游任务进行调整的大模型,已成为现代机器学习中的一种有效范式。这些 FMs 的核心通常是序列模型,处理来自广泛领域的任意输入序列,如语言、图像、语音、音频、时间序列和基因组学 (Brown et al. 2020; Do sov it ski y et al. 2020; Ismail Fawaz et al. 2019; Oord et al. 2016; Poli et al. 2023; Sutskever, Vinyals, and Quoc V Le 2014)。虽然这一概念对特定的模型架构选择是无关的,但现代 FMs 主要基于一种类型的序列模型:Transformer (Vaswani et al. 2017) 及其核心注意力层 (Bahdanau, Cho, and Bengio 2015)。自注意力机制的有效性归因于它能够在上下文窗口内密集地路由信息,从而建模复杂数据。然而,这种特性也带来了根本性的缺点:无法建模有限窗口之外的任何内容,并且随着窗口长度呈二次增长。大量研究已经出现在更高效的注意力变体上,以克服这些缺点 (Tay, Dehghani, Bahri, et al. 2022),但通常是以牺牲使其有效的特性为代价的。迄今为止,还没有任何这些变体被证明在跨域的大规模应用中具有实证有效性。
Recently, structured state space sequence models (SSMs) (Gu, Goel, and Ré 2022; Gu, Johnson, Goel, et al. 2021) have emerged as a promising class of architectures for sequence modeling. These models can be interpreted as a combination of recurrent neural networks (RNNs) and convolutional neural networks (CNNs), with inspiration from classical state space models (Kalman 1960). This class of models can be computed very effciently as either a recurrence or convolution, with linear or near-linear scaling in sequence length. Additionally, they have principled mechanisms for modeling long-range dependencies (Gu, Dao, et al. 2020) in certain data modalities, and have dominated benchmarks such as the Long Range
最近,结构化的状态空间序列模型 (SSMs) (Gu, Goel, 和 Ré 2022; Gu, Johnson, Goel, 等 2021) 作为序列建模的一类架构崭露头角。这些模型可以被解释为递归神经网络 (RNNs) 和卷积神经网络 (CNNs) 的结合,并从经典的状态空间模型 (Kalman 1960) 中获得灵感。这类模型可以非常高效地计算,既可以作为递归也可以作为卷积,具有线性或接近线性的序列长度扩展性。此外,它们在某些数据模式中具有建模长距离依赖关系的原理机制 (Gu, Dao, 等 2020),并在诸如长距离依赖等基准测试中占据主导地位。
Arena (Tay, Dehghani, Abnar, et al. 2021). Many flavors of SSMs (Gu, Goel, and Ré 2022; Gu, Gupta, et al. 2022; Gupta, Gu, and Berant 2022; Y. Li et al. 2023; Ma et al. 2023; Orvieto et al. 2023; Smith, Warrington, and Linderman 2023) have been successful in domains involving continuous signal data such as audio and vision (Goel et al. 2022; Nguyen, Goel, et al 2022; Saon, Gupta, and Cui 2023). However, they have been less effective at modeling discrete and information-dense data such astext.
Arena (Tay, Dehghani, Abnar, et al. 2021)。许多版本的 SSMs (Gu, Goel, 和 Ré 2022; Gu, Gupta, 等 2022; Gupta, Gu, 和 Berant 2022; Y. Li 等 2023; Ma 等 2023; Orvieto 等 2023; Smith, Warrington, 和 Linderman 2023) 在涉及连续信号数据(如音频和视觉)的领域中取得了成功 (Goel 等 2022; Nguyen, Goel, 等 2022; Saon, Gupta, 和 Cui 2023)。然而,它们在建模离散和信息密集型数据(如文本)方面效果较差。
We propose a new class of selective state space models, that improves on prior work on several axes to achieve the modeling power of Transformers while scaling linearly in sequence length.
我们提出了一类新的选择性状态空间模型,该模型在多个方面改进了先前的工作,以实现 Transformer 的建模能力,同时在序列长度上呈线性扩展。
Selection Mechanism. First, we identify a key limitation of prior models: the ability to efficiently select data in an input-dependent manner (i.e. focus on or ignore particular inputs). Building on intuition based on important synthetic tasks such as selective copy and induction heads, we design a simple selection mechanism by parameter i zing the SSM parameters based on the input. This allows the model to filter out irrelevant information and remember relevant information indefinitely.
选择机制。首先,我们识别出先前模型的一个关键限制:以输入依赖的方式高效选择数据的能力(即专注于或忽略特定输入)。基于诸如选择性复制和归纳头等重要合成任务的直觉,我们通过根据输入参数化 SSM 参数设计了一个简单的选择机制。这使得模型能够过滤掉无关信息并无限期地记住相关信息。
Hardware-aware Algorithm. This simple change poses a technical challenge for the computation of the model; in fact, all prior SMs models must be time- and input-invariant in order to be computationally efficient. We overcome this with a hardware-aware algorithm that computes the model re currently with a scan instead of convolution, but does not materialize the expanded state in order to avoid IO access between different levels of the GPU memory hierarchy. The resulting implementation is faster than previous methods both in theory (scaling linearly in sequence length, compared to pseudo-linear for all convolution-based SSMs) and on modern hardware (up to $3\times$ faster on A100GPUs).
硬件感知算法。这一简单更改给模型的计算带来了技术挑战;实际上,所有先前的 SM 模型必须是时间和输入不变的,以确保计算效率。我们通过一种硬件感知算法克服了这一问题,该算法使用扫描而不是卷积来计算模型,但不实现扩展状态以避免 GPU 内存层次结构中不同级别的 IO 访问。理论和现代硬件上的结果实现都比之前的方法更快(在序列长度上呈线性扩展,而所有基于卷积的 SSM 都是伪线性),在 A100 GPUs 上最多快 3 倍。
Architecture. We simplify prior deep sequence model architectures by combining the design of prior SSM architectures (Dao, Fu, Saab, et al. 2023) with the MLP block of Transformers into a single block, leading to a simple and homogenous architecture design (Mamba) incorporating selective state spaces.
架构。我们通过结合先前的 SSM 架构设计 (Dao, Fu, Saab, et al. 2023) 与 Transformer 的 MLP 块,简化了先前的深度序列模型架构,形成一个单一的块,从而实现了一个简单且同质化的架构设计 (Mamba),该设计融合了选择性的状态空间。
Selective SSMs, and by extension the Mamba architecture, are fully recurrent models with key properties that make them suitable as the backbone of general foundation models operating on sequences. (i) High quality: selectivity brings strong performance on dense modalities such as language and genomics. (i) Fast training and inference: computation and memory scales linearly in sequence length during training, and unrolling the model auto regressive ly during inference requires only constant time per step since it does not require a cache of previous elements. (i) Long context: the quality and efficiency together yield performance improvements on real data up to sequence length 1M.
选择性 SSM 模型,以及由此扩展的 Mamba 架构,是完全循环的模型,具有使其适合作为处理序列的一般基础模型骨干的关键特性。 (i) 高质量:选择性在语言和基因组学等密集模态上带来了强大的性能。 (ii) 快速训练和推理:训练期间计算和内存与序列长度呈线性扩展,并且在推理期间自回归展开模型每步只需要常数时间,因为它不需要缓存之前的元素。 (iii) 长上下文:高质量和高效率共同作用,在实际数据上长达 1M 的序列长度上实现了性能提升。
We empirically validate Mamba's potential as a general sequence FM backbone, in both pre training quality and domainspecific task performance, on several types of modalities and settings:
我们通过实证验证了 Mamba 作为通用序列 FM 主干的潜力,在预训练质量和领域特定任务性能方面,在几种模态和设置上:
· Synthetics. On important synthetic tasks such as copying and induction heads that have been proposed as being key to large language models, Mamba not only solves them easily but can extrapolate solutions indefinitely long ( $>!1M$ tokens).
合成任务。在重要的合成任务中,例如已提出作为大语言模型关键的复制和归纳头任务,Mamba不仅能够轻松解决这些问题,还可以无限长地外推解决方案($>!1M$ tokens)。
· Audio and Genomics. Mamba out-performs prior state-of-the-art models such as SaShiMi, Hyena, and Transformers on modeling audio waveforms and DNA sequences, both in pre training quality and downstream metrics (e.g. reducing FID on a challenging speech generation dataset by more than half). In both settings, its performance improves with longer context up to million-length sequences.
音频和基因组学。Mamba 在建模音频波形和 DNA 序列方面优于之前的最先进模型,例如 SaShiMi、Hyena 和 Transformer,在预训练质量和下游指标方面均有提升(例如,在具有挑战性的语音生成数据集上将 FID 降低了一半以上)。在这两种情况下,其性能随着上下文长度的增加而提高,最长可达百万级序列。
· Language Modeling. Mamba is the first linear-time sequence model that truly achieves Transformer-quality performance, both in pre training perplexity and downstream evaluations. With scaling laws up to 1B parameters, we show that Mamba exceeds the performance of a large range of baselines, including very strong modern Transformer training recipes based on LLaMa (Touvron et al. 2023). Our Mamba language model has $5\times$ generation throughput compared to Transformers of similar size, and Mamba-3B's quality matches that of Transformers twice its size (e.g. 4 points higher avg. on common sense reasoning compared to Pythia-3B and even exceeding Pythia-7B).
语言建模。Mamba 是第一个真正实现线性时间序列模型的模型,其性能达到了 Transformer 的水平,在预训练困惑度和下游评估中均表现出色。通过扩展到最多 1B 参数,我们展示了 Mamba 超越了多个基线模型的性能,包括基于 LLaMa (Touvron et al. 2023) 的非常强大的现代 Transformer 训练方案。我们的 Mamba 语言模型相比类似规模的 Transformer 模型,生成吞吐量提高了 5 倍,且 Mamba-3B 的质量与两倍规模的 Transformer 相当(例如,在常识推理方面比 Pythia-3B 高出 4 分,甚至超过了 Pythia-7B)。
Model code and pre-trained checkpoints are open-sourced at https: //github. com/state-spaces/mamba.
模型代码和预训练检查点已开源在 https://github.com/state-spaces/mamba。
2 State Space Models
2 状态空间模型
Structured state space sequence models (S4) are a recent class of sequence models for deep learning that are broadly related to RNNs, and CNNs, and classical state space models. They are inspired by a particular continuous system (1) that maps a
结构化状态空间序列模型 (S4) 是一类最近提出的深度学习序列模型,与循环神经网络 (RNN)、卷积神经网络 (CNN) 和经典状态空间模型有广泛关联。它们受到一个特定的连续系统 (1) 的启发,该系统映射一个
Selective State Space Model with Hardware-aware State Expansion
具有硬件感知状态扩展的选择性状态空间模型
Figure 1: (Overview.) Structured SSMs independently map each channel (e.g. $D,=,5_{\cdot}$ ) of an input $x$ to output $y$ through a higher dimensional latent state $h$ (e.g. $N=4$ 0. Prior SSMs avoid materializing this large effective state $\mathbf{\nabla}D N$ times batch size $B$ and sequence length $L$ ) through clever alternate computation paths requiring time-invariance: the $(\Delta,A,B,C)$ parameters are constant across time. Our selection mechanism adds back input-dependent dynamics, which also requires a careful hardware-aware algorithm to only materialize the expanded states in more efficient levels of the GPU memory hierarchy.
图 1: (概述。) 结构化 SSMs 独立地将输入 $x$ 的每个通道(例如 $D,=,5_{\cdot}$)映射到输出 $y$,通过更高维度的隐含状态 $h$(例如 $N=4$)。先前的 SSMs 通过巧妙的替代计算路径避免实现这个大的有效状态 $\mathbf{\nabla}D N$ 乘以批次大小 $B$ 和序列长度 $L$,这些路径需要时间不变性:$(\Delta,A,B,C)$ 参数在时间上是常数。我们的选择机制重新引入了依赖于输入的动力学,这也需要一个硬件感知算法来仅在 GPU 内存层次结构的更高效级别中实现扩展状态。
1-dimensional function or sequence $x(t)\in\mathbb{R}\mapsto y(t)\in\mathbb{R}$ through an implicit latent state $h(t)\in\mathbb{R}^{N}$
一维函数或序列 $x(t)\in\mathbb{R}\mapsto y(t)\in\mathbb{R}$ 通过隐式潜在状态 $h(t)\in\mathbb{R}^{N}$
Concretely, S4 models are defined with four parameters $(\Delta,A,B,C)$ , which define a sequence-to-sequence transformation in two stages.
具体来说,S4 模型由四个参数 (Δ, A, B, C) 定义,这些参数在两个阶段中定义了一个序列到序列的转换。
$$
\begin{array}{r l r l r l r l r l r}&{t)=A h(t)+B x(t)}&&{\mathrm{(1a)}}&&{\quad\quad h_{t}=\overline{{A}}h_{t-1}+\overline{{B}}x_{t}}&&{\mathrm{(2a)}}&&{\quad\overline{{K}}=(C\overline{{B}},C\overline{{A B}},\ldots,C\overline{{A}}^{k}\overline{{B}},\ldots,C\overline{{A}}^{k}\overline{{B}})\mathrm{.}}\ &{t)=C h(t)}&&{\mathrm{(1b)}}&&{\quad\quad y_{t}=C h_{t}}&&{\mathrm{(2b)}}&&{\quad\quad y=x*\overline{{K}}}&\end{array}
$$
$$
\begin{array}{r l r l r l r l r l r}
& {t) = A h(t) + B x(t)} && {(1a)} && {\quad\quad h_{t} = \overline{{A}} h_{t-1} + \overline{{B}} x_{t}} && {(2a)} && {\quad\overline{{K}} = (C \overline{{B}}, C \overline{{A B}}, \ldots, C \overline{{A}}^{k} \overline{{B}}, \ldots, C \overline{{A}}^{k} \overline{{B}}).} \
& {t) = C h(t)} && {(1b)} && {\quad\quad y_{t} = C h_{t}} && {(2b)} && {\quad\quad y = x * \overline{{K}}}
\end{array}
$$
Disc ret iz ation._ The first stage transforms the “continuous parameters" $(\Delta,A,B)$ to “discrete parameters" $(\overline{{A}},\overline{{B}})$ through fixed formulas $\overline{{A}}=f_{A}(\Delta,A)$ and $\overline{{B}}=f_{B}(\Delta,A,B)$ , where the pair $(f_{A},f_{B})$ is called a disc ret iz ation rule. Various rules can be used such as the zero-order hold (ZOH) defined in equation (4).
离散化._ 第一阶段通过固定公式 $\overline{{A}}=f_{A}(\Delta,A)$ 和 $\overline{{B}}=f_{B}(\Delta,A,B)$ 将“连续参数” $(\Delta,A,B)$ 转换为“离散参数” $(\overline{{A}},\overline{{B}})$ ,其中的对 $(f_{A},f_{B})$ 称为离散化规则。可以使用各种规则,例如在公式 (4) 中定义的零阶保持 (ZOH) 。
$$
\begin{array}{r}{\overline{{A}}=\exp(\Delta A)\qquad\overline{{B}}=(\Delta A)^{-1}(\exp(\Delta A)-I)\cdot\Delta B}\end{array}
$$
$$
\begin{array}{r}{\overline{{A}}=\exp(\Delta A)\qquad\overline{{B}}=(\Delta A)^{-1}(\exp(\Delta A)-I)\cdot\Delta B}\end{array}
$$
Disc ret iz ation has deep connections to continuous-time systems which can endow them with additional properties such as resolution invariance (Nguyen, Goel, et al. 2022) and automatically ensuring that the model is properly normalized (Gu, Johnson, Timalsina, et al. 2023; Orvieto et al. 2023). It also has connections to gating mechanisms of RNNs (Gu, Gulcehre, et al. 2020; Tallec and Ollivier 2018) which we will revisit in Section 3.5. However, from a mechanical point of view disc ret iz ation can simply be viewed as the first step of the computation graph in the forward pass of an SSM. Alternate flavors of SSMs can bypass the disc ret iz ation step and parameter ize $(\overline{{A}},\breve{\overline{{B}}})$ directly instead (Zhang et al. 2023), which may be easier to reason about.
离散化与连续时间系统有着深刻的联系,这可以赋予它们额外的属性,例如分辨率不变性 (Nguyen, Goel, et al. 2022) 和自动确保模型正确归一化 (Gu, Johnson, Timalsina, et al. 2023; Orvieto et al. 2023)。它还与RNN的门控机制有关 (Gu, Gulcehre, et al. 2020; Tallec and Ollivier 2018),我们将在第3.5节中重新讨论这一点。然而,从机械角度来看,离散化可以简单地被视为SSM前向传递计算图的第一步。SSM的其他变体可以绕过离散化步骤,直接参数化 $(\overline{{A}},\breve{\overline{{B}}})$ (Zhang et al. 2023),这可能更容易理解。
Computation. After the parameters have been transformed from $(\Delta,A,B,C)\mapsto({\overline{{A}}},{\overline{{B}}},C)$ , the model can be computed in two ways, either as a linear recurrence (2) or a global convolution (3).
计算。在参数从 $(\Delta,A,B,C)\mapsto({\overline{{A}}},{\overline{{B}}},C)$ 转换后,模型可以通过两种方式进行计算,一种是线性递归 (2) ,另一种是全局卷积 (3) 。
Commonly, the model uses the convolutional mode (3) for efficient parallel iz able training (where the whole input sequence is seen ahead of time), and switched into recurrent mode (2) for effcient auto regressive inference (where the inputs are seen one timestep at a time).
通常,模型使用卷积模式 (3) 进行高效的并行训练(其中整个输入序列可以提前获取),并在进行高效的自回归推理时切换到递归模式 (2)(其中输入是逐个时间步获取的)。
Linear Time Invariance (LTI). An important property of equations (1) to (3) is that the model's dynamics are constant through time. In other words $(\Delta,A,B,C)$ and consequently $({\dot{\overline{{A}}}},{\overline{{B}}})$ as well, are fixed for alltime-steps. This property is called linear time invariance $(L T I)$ which is deeply connected to recurrence and convolutions. Informally, we think of LTI SSMs as being equivalent to any linear recurrence (2a) or convolution (3b), and use LTI as an umbrella term for these classes of models.
线性时不变性 (LTI)。方程 (1) 到 (3) 的一个重要特性是模型的动力学在时间上是恒定的。换句话说,$(\Delta,A,B,C)$ 以及因此 $({\dot{\overline{{A}}}},{\overline{{B}}})$,在所有时间步长上都是固定的。这种特性称为线性时不变性 (LTI),它与递归和卷积密切相关。非正式地,我们可以认为 LTI 状态空间模型 (SSM) 等同于任何线性递归 (2a) 或卷积 (3b),并使用 LTI 作为这些模型类别的总称。
Thus far, all structured SSMs have been LTI (e.g. computed as convolutions) because of fundamental effciency constraints, discussed in Section 3.3. However, a core insight of this work is that LTI models have fundamental limitations in modeling certain types of data, and our technical contributions involve removing the LTI constraint while overcoming the efficiency bottlenecks.
迄今为止,所有结构化的 SSM 都是 LTI (例如通过卷积计算) 的,这是由于基本的效率限制,详见第 3.3 节。然而,本文的一个核心见解是,LTI 模型在建模某些类型的数据时存在根本性局限,我们的技术贡献在于去除 LTI 约束的同时克服效率瓶颈。
Structure and Dimensions. Finally, we note that structured SSMs are so named because computing them efficiently also requires imposing structure on the $\pmb{A}$ matrix. The most popular form of structure is diagonal (Gu, Gupta, et al. 2022; Gupta, Gu, and Berant 2022; Smith, Warrington, and Linderman 2023), which we also use.
结构和维度。最后,我们注意到结构化的 SSM 被如此命名是因为高效计算它们还需要对 $\pmb{A}$ 矩阵施加结构。最常见的结构形式是对角结构 (Gu, Gupta, et al. 2022; Gupta, Gu, and Berant 2022; Smith, Warrington, and Linderman 2023),这也是我们所使用的。
In this case, the $A\in\mathbb{R}^{N\times N},B\in\mathbb{R}^{N\times1},C\in\mathbb{R}^{1\times N}$ matrices can all be represented by $N$ numbers. To operate over an input sequence $x$ of batch size $B$ and length $L$ with $D$ channels, the SSM is applied independently to each channel. Note that in this case, the total hidden state has dimension $D N$ per input, and computing it over the sequence length requires $O(B L D N)$ time and memory; this is the root of the fundamental efficiency bottleneck addressed in Section 3.3.
在这种情况下,矩阵 $A\in\mathbb{R}^{N\times N}, B\in\mathbb{R}^{N\times1}, C\in\mathbb{R}^{1\times N}$ 都可以用 $N$ 个数字表示。为了对批量大小为 $B$、长度为 $L$、具有 $D$ 个通道的输入序列 $x$ 进行操作,状态空间模型 (SSM) 独立应用于每个通道。注意,在这种情况下,每个输入的总隐藏状态维度为 $D N$,并且在序列长度上计算它需要 $O(B L D N)$ 的时间和内存;这是第 3.3 节中解决的基本效率瓶颈的根本原因。
General State Space Models. We note that the term state space model has a very broad meaning which simply represents the notion of any recurrent process with a latent state. It has been used to refer to many disparate concepts in different disciplines, including Markov decision processes (MDP) (reinforcement learning (Hafner et al. 2020)), dynamic causal modeling (DCM) (computational neuroscience (Friston, Harrison, and Penny 2003), Kalman filters (controls (Kalman 1960), hidden Markov models (HMM) and linear dynamical systems (LDS) (machine learning), and recurrent (and sometimes convolutional) models at large (deep learning).
一般状态空间模型。我们注意到,状态空间模型 (state space model) 这个术语具有非常广泛的含义,它简单地表示任何具有潜在状态的递归过程的概念。它已被用于指代不同学科中的许多不同的概念,包括马尔可夫决策过程 (MDP)(强化学习 (Hafner et al. 2020)),动态因果建模 (DCM)(计算神经科学 (Friston, Harrison, and Penny 2003),卡尔曼滤波器 (Kalman filters)(控制理论 (Kalman 1960),隐马尔可夫模型 (HMM) 和线性动力系统 (LDS)(机器学习),以及广泛的递归(有时是卷积)模型(深度学习)。
Throughout this entire paper we use the term “SSM" to refer exclusively to the class of structured SSMs or S4 models (Gu, Goel, and Ré 2022; Gu, Gupta, et al. 2022; Gupta, Gu, and Berant 2022; Hasani et al. 2023; Ma et al. 2023; Smith, Warrington, and Linderman 2023) and use these terms interchangeably. For convenience we may also include derivatives of such models, such as those focusing on either the linear-recurrence or global-convolution viewpoints (Y. Li et al. 2023; Orvieto et al. 2023; Poli et al. 2023), and clarify nuances when necessary.
在整个论文中,我们使用的术语“SSM”专门指代结构化的 SSM 模型或 S4 模型 (Gu, Goel, 和 Ré 2022; Gu, Gupta, 等 2022; Gupta, Gu, 和 Berant 2022; Hasani 等 2023; Ma 等 2023; Smith, Warrington, 和 Linderman 2023),并且这些术语可以互换使用。为了方便,我们还可能包括这些模型的衍生模型,例如专注于线性递归或全局卷积视角的模型 (Y. Li 等 2023; Orvieto 等 2023; Poli 等 2023),并在必要时澄清细微差别。
SSM Architectures. SSMs are standalone sequence transformations that can be incorporated into end-to-end neural network architectures. (We also sometimes call SSM architectures SSNNs, which are to SSM layers as CNNs are to linear convolution layers.) We discuss some of the most well-known SSM architectures, many of which will also serve as our primary baselines.
SSM 架构。SSM 是独立的序列变换,可以集成到端到端神经网络架构中。(我们有时也将 SSM 架构称为 SSNN,其与 SSM 层的关系类似于 CNN 与线性卷积层的关系。)我们讨论一些最著名的 SSM 架构,其中许多也将作为我们的主要基准。
Other closely related SSMs and architectures are discussed further in an extended related work (Appendix B). We highlight in particular S5 (Smith, Warrington, and Linderman 2023), QRNN (Bradbury et al. 2016), and SRU (Lei et al. 2017), which we view as the most closely related methods to our core selective SSM.
其他密切相关的 SSM 和架构在扩展的相关工作 (Appendix B) 中进一步讨论。我们特别强调 S5 (Smith, Warrington, and Linderman 2023),QRNN (Bradbury et al. 2016),和 SRU (Lei et al. 2017),我们认为这些方法与我们的核心选择性 SSM 最为密切相关。
3 Selective State Space Models
3 选择性状态空间模型
We motivate our selection mechanism using intuition from synthetic tasks (Section 3.1), then explain how to incorporate this mechanism into state space models (Section 3.2). The resulting time-varying SSMs cannot use convolutions, presenting a technical challenge of how to compute them efficiently. We overcome this with a hardware-aware algorithm that exploits the memory hierarchy on modern hardware (Section 3.3). We then describe a simple SSM architecture without attention or even MLP blocks (Section 3.4). Finally, we discuss some additional properties of selection mechanisms (Section 3.5).
我们使用合成任务中的直觉来解释我们的选择机制 (Section 3.1),然后说明如何将这种机制纳入状态空间模型 (Section 3.2)。所得到的时间变化 SSMs 无法使用卷积,这提出了一个技术挑战,即如何高效地计算它们。我们通过一种利用现代硬件内存层次结构的硬件感知算法克服了这一挑战 (Section 3.3)。接着,我们描述了一种既没有注意力机制也没有 MLP 模块的简单 SSM 架构 (Section 3.4)。最后,我们讨论了选择机制的一些额外属性 (Section 3.5)。
3.1 Motivation: Selection as a Means of Compression
3.1 动机:选择作为一种压缩手段
We argue that a fundamental problem of sequence modeling is compressing context into a smaller state. In fact, we can view the tradeoffs of popular sequence models from this point of view. For example, attention is both effective and inefficient because it explicitly does not compress context at all. This can be seen from the fact that auto regressive inference requires explicitly storing the entire context (i.e. the KV cache), which directly causes the slow linear-time inference and quadratic-time training of Transformers. On the other hand, recurrent models are efficient because they have a finite state, implying constant-time inference and linear-time training. However, their effectiveness is limited by how well this state has compressed the context.
我们认为序列建模的一个基本问题是将上下文压缩到更小的状态。事实上,我们可以从这个角度来理解流行序列模型的权衡。例如,注意力机制既有效又低效,因为它完全不压缩上下文。这可以从自回归推理需要显式存储整个上下文(即 KV 缓存)看出,这直接导致了 Transformer 的线性时间推理和二次时间训练缓慢。另一方面,递归模型是高效的,因为它们具有有限状态,这意味着常数时间推理和线性时间训练。然而,其有效性受限于该状态对上下文的压缩程度。
To understand this principle, we focus on two running examples of synthetic tasks (Figure 2).
为了理解这一原则,我们关注两个合成任务的示例 (图 2)。
These tasks reveal the failure mode of LTI models. From the recurrent view, their constant dynamics (e.g. the $(\overline{{A}},\overline{{B}})$ transitions in (2) cannot let them select the correct information from their context, or affect the hidden state passed along the sequence in an input-dependent way. From the convolutional view, it is known that global convolutions can solve the vanilla Copying task (Romero et al. 2021) because it only requires time-awareness, but that they have difficulty with the Selective Copying task because of lack of content-awareness (Figure 2). More concretely, the spacing between inputs-to-outputs is varying and cannot be modeled by static convolution kernels.
这些任务揭示了 LTI 模型的失效模式。从循环的角度来看,它们的恒定动态(例如 (2) 中的 $(\overline{{A}},\overline{{B}})$ 转换)无法使它们从上下文中选择正确的信息,或以输入依赖的方式影响沿序列传递的隐藏状态。从卷积的角度来看,已知全局卷积可以解决普通的 Copying 任务 (Romero et al. 2021),因为它只需要时间感知,但它们在 Selective Copying 任务上遇到困难,因为缺乏内容感知(图 2)。更具体地说,输入到输出之间的间距是变化的,无法用静态卷积核建模。
In summary, the efficiency vs. effectiveness tradeoff of sequence models is characterized by how well they compress their state: efficient models must have a small state, while effective models must have a state that contains all necessary information from the context. In turn, we propose that a fundamental principle for building sequence models is selectivity: or the context-aware ability to focus on or filter out inputs into a sequential state. In particular, a selection mechanism controls how information propagates or interacts along the sequence dimension (see Section 3.5 for more discussion).
总结来说,序列模型的效率与效果之间的权衡体现在它们压缩状态的能力上:高效的模型必须具有较小的状态,而有效的模型则必须包含来自上下文的所有必要信息。因此,我们提出构建序列模型的一个基本原则是选择性:即根据上下文有选择地关注或过滤输入到序列状态中的能力。具体而言,选择机制控制信息在序列维度上的传播或交互方式(详见第 3.5 节)。
3.2 Improving SSMs with Selection
3.2 通过选择改进 SSMs
One method of incorporating a selection mechanism into models is by letting their parameters that affect interactions along the sequence (e.g. the recurrent dynamics of an RNN or the convolution kernel of a CNN) be input-dependent.
将选择机制引入模型的一种方法是让影响序列交互的参数(例如 RNN 的循环动态或 CNN 的卷积核)依赖于输入。
Algorithms 1 and 2 illustrates the main selection mechanism that we use. The main difference is simply making several parameters $\Delta,B,C$ functions of the input, along with the associated changes to tensor shapes throughout. In particular, we highlight that these parameters now have a length dimension $L$ meaning that the model has changed from time-invariant to time-varying. (Note that shape annotations were described in Section 2.) This loses the equivalence to convolutions (3) with implications for its efficiency, discussed next.
算法 1 和 2 描述了我们使用的主要选择机制。主要区别在于将几个参数 $\Delta,B,C$ 设为输入的函数,并对整个过程中的张量形状进行相应更改。特别是,我们强调这些参数现在具有长度维度 $L$ ,这意味着模型从时不变变为时变。(注意,形状注释在第 2 节中描述。)这使得它失去了与卷积 (3) 的等价性,从而影响其效率,接下来将对此进行讨论。
We specifically choose $s_{B}(x),=,\mathsf{L i n e a r}{N}(x)$ $s{C}(x),=,\mathsf{L i n e a r}_{N}(x)$ $s_{\Delta}(x),=,{\tt B r o a d c a s t}_{D}({\tt L i n e a r}_{1}(x))$ , and $\tau_{\Delta}=$ softplus, where Linear $\cdot_{d}$ is a parameterized projection to dimension $d$ . The choice of $s_{\Delta}$ and $\tau_{\Delta}$ is due to a connection to RNN gating mechanisms explained in Section 3.5.
我们特别选择 $s_{B}(x),=,\mathsf{L i n e a r}{N}(x)$ $s{C}(x),=,\mathsf{L i n e a r}_{N}(x)$ $s_{\Delta}(x),=,{\tt B r o a d c a s t}_{D}({\tt L i n e a r}_{1}(x))$ ,和 $\tau_{\Delta}=$ softplus,其中 Linear $\cdot_{d}$ 是参数化的投影到维度 $d$ 。选择 $s_{\Delta}$ 和 $\tau_{\Delta}$ 是由于与第 3.5 节解释的 RNN 门机制的联系。
Figure 2: (Left) The standard version of the Copying task involves constant spacing between input and output elements and is easily solved by time-invariant models such as linear recurrences and global convolutions. (Right Top) The Selective Copying task has random spacing in between inputs and requires time-varying models that can selectively remember or ignore inputs depending on their content. (Right Bottom) The Induction Heads task is an example of associative recall that requires retrieving an answer based on context, a key ability forL LMs.
图 2: (左) Copying 任务的标准版本涉及输入和输出元素之间的固定间距,可以被时间不变模型(如线性递归和全局卷积)轻松解决。 (右上) Selective Copying 任务在输入之间具有随机间距,需要能够根据内容有选择地记住或忽略输入的时间变化模型。 (右下) Induction Heads 任务是关联回忆的一个例子,要求根据上下文检索答案,这是大语言模型的关键能力。
3.3 Efficient Implementation of Selective SSMs
3.3 Selective SSMs 的高效实现
规则:
- 输出中文翻译部分的时候,只保留翻译的标题,不要有任何其他的多余内容,不要重复,不要解释。
- 不要输出与英文内容无关的内容。
- 翻译时要保留原始段落格式,以及保留术语,例如 FLAC,JPEG 等。保留公司缩写,例如 Microsoft, Amazon, OpenAI 等。
- 人名不翻译
- 同时要保留引用的论文,例如 [20] 这样的引用。
- 对于 Figure 和 Table,翻译的同时保留原有格式,例如:“Figure 1: ”翻译为“图 1: ”,“Table 1: ”翻译为:“表 1: ”。
- 全角括号换成半角括号,并在左括号前面加半角空格,右括号后面加半角空格。
- 在翻译专业术语时,第一次出现时要在括号里面写上英文原文,例如:“生成式 AI (Generative AI)”,之后就可以只写中文了。
- 以下是常见的 AI 相关术语词汇对应表(English -> 中文):
* Transformer -> Transformer
* Token -> Token
* LLM/Large Language Model -> 大语言模型
* Zero-shot -> 零样本
* Few-shot -> 少样本
* AI Agent -> AI智能体
* AGI -> 通用人工智能
* Python -> Python语言
策略:
分三步进行翻译工作:
1. 不翻译无法识别的特殊字符和公式,原样返回
2. 将HTML表格格式转换成Markdown表格格式
3. 根据英文内容翻译成符合中文表达习惯的内容,不要遗漏任何信息
最终只返回Markdown格式的翻译结果,不要回复无关内容。
Hardware-friendly primitives such as convolutions (Krizhevsky, Sutskever, and Hinton 2012) and attention (Bahdanau, Cho, and Bengio 2015; Vaswani et al. 2017) enjoy widespread application. Here we aim to make selective SSMs efficient on modern hardware (GPUs) as well. The selection mechanism is quite natural, and earlier works attempted to incorporate special cases of selection, such as letting $\Delta$ vary over time in recurrent SSMs (Gu, Dao, et al. 2020). However, as previously mentioned a core limitation in the usage of SSMs is their computational efficiency, which was why S4 and all derivatives used LTI (non-selective) models, most commonly in the form of global convolutions.
硬件友好的原语,例如卷积 (Krizhevsky, Sutskever, 和 Hinton 2012) 和注意力机制 (Bahdanau, Cho, 和 Bengio 2015; Vaswani 等 2017) 享有广泛的应用。在这里,我们的目标是使选择性的状态空间模型 (SSM) 在现代硬件(GPU)上也高效运行。选择机制非常自然,早期的工作尝试将选择的特殊情况纳入其中,例如让 $\Delta$ 在循环 SSM 中随时间变化 (Gu, Dao, 等 2020)。然而,正如前面提到的,SSM 使用的核心限制在于其计算效率,这就是为什么 S4 及其所有衍生模型使用了线性时不变 (LTI,非选择性) 模型,最常见的是以全局卷积的形式。
3.3.1 Motivation of Prior Models
3.3.1 先前模型的动机 (Motivation of Prior Models)
We first revisit this motivation and overview our approach to overcome limitations of prior methods.
我们首先回顾这一动机并概述我们的方法以克服先前方法的局限性。
· At a high level, recurrent models such as SSMs always balance a tradeoff between expressivity and speed: as discussed in Section 3.1, models with larger hidden state dimension should be more effective but slower. Thus we want to maximize hidden state dimension without paying speed and memory costs.
在高层次上,循环模型(如 SSMs)总是要在表达能力和速度之间进行权衡:正如第 3.1 节所讨论的,具有更大隐藏状态维度的模型应该更有效但更慢。因此,我们希望在不付出速度和内存成本的情况下最大化隐藏状态维度。
· Note that the recurrent mode is more flexible than the convolution mode, since the latter (3) is derived from expanding the former (2) (Gu, Goel, and Ré 2022; Gu, Johnson, Goel, et al. 2021). However, this would require computing and materializing the latent state $h$ with shape $(\mathsf{B},\mathsf{L},\mathsf{D},\mathsf{N})$ , which is much larger (by a factor of $N$ , the SSM state dimension) than the input $x$ and output $y$ of shape (B, L, D). Thus the more efficient convolution mode was introduced which could bypass the state computation and materializes a convolution kernel (3a) of size only (B, L, D).
需要注意的是,循环模式比卷积模式更灵活,因为后者 (3) 是从前者 (2) 展开而来的 (Gu, Goel, 和 Ré 2022; Gu, Johnson, Goel, 等 2021)。然而,这需要计算并显式化形状为 $(\mathsf{B},\mathsf{L},\mathsf{D},\mathsf{N})$ 的隐状态 $h$,其规模比输入 $x$ 和输出 $y$(形状为 (B, L, D))大得多(大约是 SSM 状态维度 $N$ 的倍数)。因此,引入了更高效的卷积模式,该模式可以绕过状态计算,并且只显式化一个大小为 (B, L, D) 的卷积核 (3a)。
· Prior LTI state space models leverage the dual recurrent-convolutional forms to increase the effective state dimension by a factor of $N$ $(\approx10-100)$ , much larger than traditional RNNs, without efficiency penalties.
先前的 LTI 状态空间模型利用双重循环-卷积形式,将有效状态维度提高了 N 倍 (≈10-100),比传统 RNN 大得多,且不会带来效率损失。
3.3.2 Overview of Selective Scan: Hardware-Aware State Expansion
3.3.2 Selective Scan 概述:硬件感知状态扩展
Selective Scan 是一种硬件感知的状态扩展方法,旨在根据硬件特性优化状态扩展过程。这种方法通过分析硬件资源的可用性和性能特征,动态调整状态扩展策略,以实现更高效的计算资源利用。Selective Scan 可以显著提高系统的吞吐量和响应速度,同时降低能耗 [20]。
图 1: Selective Scan 工作原理示例
表 1: 不同硬件配置下的性能对比
硬件配置 | 性能提升 (%) |
---|---|
配置 A | 30 |
配置 B | 45 |
配置 C | 60 |
The selection mechanism is designed to overcome the limitations of LTI models; at the same time, we therefore need to revisit the computation problem of SMs. We address this with three classical techniques: kernel fusion, parallel scan, and re computation. We make two main observations:
选择机制旨在克服 LTI 模型的局限性;同时,我们因此需要重新审视 SM 的计算问题。我们采用三种经典技术来解决这一问题:kernel fusion(核融合)、parallel scan(并行扫描)和 re computation(重新计算)。我们有两个主要观察:
The main idea is to leverage properties of modern accelerators (GPUs) to materialize the state $h$ Oonly in more eficient levels of the memory hierarchy. In particular, most operations (except matrix multiplication) are bounded by memory bandwidth (Dao, Fu, Ermon, et al. 2022; Ivanov et al. 2021; Williams, Waterman, and Patterson 2009). This includes our scan operation, and we use kernel fusion to reduce the amount of memory IOs, leading to a significant speedup compared to a standard implementation.
主要思想是利用现代加速器 (GPU) 的特性,仅在内存层次结构的更高效层级中实现状态 $h$ 。具体来说,大多数操作(除矩阵乘法外)受内存带宽限制(Dao, Fu, Ermon, et al. 2022;Ivanov et al. 2021;Williams, Waterman, and Patterson 2009)。这包括我们的扫描操作,我们使用内核融合来减少内存 IO 操作的数量,从而与标准实现相比显著提高速度。
Concretely, instead of preparing the scan input $(\overline{{A}},\overline{{B}})$ of size (B, L,D, N) in GPU HBM (high-bandwidth memory), we load the SSM parameters $(\Delta,A,B,C)$ directly from slow HBM to fast SRAM, perform the disc ret iz ation and recurrence in SRAM, and then write the final outputs of size (B, L, D) back to HBM.
具体来说,我们不是在 GPU HBM (高带宽内存) 中准备大小为 (B, L, D, N) 的扫描输入 $(\overline{{A}},\overline{{B}})$,而是直接从慢速 HBM 加载 SSM 参数 $(\Delta,A,B,C)$ 到快速 SRAM,在 SRAM 中执行离散化和递归,然后将大小为 (B, L, D) 的最终输出写回 HBM。
To avoid the sequential recurrence, we observe that despite not being linear it can still be parallel i zed with a work-efficient parallel scan algorithm (Blelloch 1990; Martin and Cundy 2018; Smith, Warrington, and Linderman 2023).
为了避免顺序递归,我们观察到尽管它不是线性的,但仍然可以使用工作高效的并行扫描算法 (Blelloch 1990; Martin 和 Cundy 2018; Smith, Warrington 和 Linderman 2023) 进行并行化。
Finally, we must also avoid saving the intermediate states, which are necessary for back propagation. We carefully apply the classic technique of re computation to reduce the memory requirements: the intermediate states are not stored but recomputed in the backward pass when the inputs are loaded from HBM to SRAM. As a result, the fused selective scan layer has the same memory requirements as an optimized transformer implementation with Flash Attention.
最后,我们必须避免保存反向传播所需的中间状态。我们仔细应用经典的重计算技术来减少内存需求:中间状态不存储,而是在反向传递时从 HBM 加载输入到 SRAM 时重新计算。因此,融合的选择性扫描层具有与具有 Flash Attention 的优化 Transformer 实现相同的内存需求。
Details of the fused kernel and re computation are in Appendix D. The full Selective SSM layer and algorithm is ilustrated in Figure 1.
融合内核和重计算的详细信息见附录 D。完整的 Selective SSM 层和算法如图 1 所示。
3.4 A Simplified SSM Architecture
3.4 简化的 SSM 架构
As with structured SsMs, selective SSMs are standalone sequence transformations that can be flexibly incorporated into neural networks. The H3 architecture is the basis for the most well-known SSM architectures (Section 2), which are generally comprised of a block inspired by linear attention interleaved with an MLP (multi-layer perceptron) block. We simplify this architecture by combining these two components into one, which is stacked homogenous ly (Figure 3). This is inspired by the gated attention unit (GAU) (Hua et al. 2022), which did something similar for attention.
与结构化的 SsMs 一样,选择性 SSMs 是独立的序列转换,可以灵活地集成到神经网络中。H3 架构是目前最著名的 SSM 架构的基础(第 2 节),通常由受线性注意力启发的模块与 MLP (多层感知机) 模块交替组成。我们通过将这两个组件合并为一个,并以同质的方式堆叠(图 3),简化了这一架构。这受到门控注意力单元 (GAU) (Hua et al. 2022) 的启发,它对注意力机制进行了类似的改进。
This architecture involves expanding the model dimension $D$ by a controllable expansion factor $E$ . For each block, most of the parameters $(3E D^{2})$ are in the linear projections ( $2E D^{2}$ for input projections, $E D^{2}$ for output projection) while the inner SSM contributes less. The number of SSM parameters (projections for $\Delta,B,C$ , and the matrix $\pmb{A}$ ) are much smaller in comparison. We repeat this block, interleaved with standard normalization and residual connections, to form the Mamba architecture. We always fix to $E=2$ in our experiments and use two stacks of the block to match the $12D^{2}$ parameters of a Transformer's interleaved MHA (multi-head attention) and MLP blocks. We use the SiLU / Swish activation function (Hendrycks and Gimpel 2016; Rama chandra n, Zoph, and Quoc V Le 2017), motivated so that the Gated MLP becomes the popular “SwiGLU" variant (Chowdhery et al. 2023; Dauphin et al. 2017; Shazeer 2020; Touvron et al. 2023). Finally, we additionally use an optional normalization layer (we choose LayerNorm (J. L. Ba, Kiros, and Hinton 2016), motivated by RetNet's usage of a normalization layer in a similar location (Y. Sun et al. 2023).
该架构涉及通过可控扩展因子 $E$ 扩展模型维度 $D$ 。对于每个块,大多数参数 $(3E D^{2})$ 位于线性投影中(输入投影为 $2E D^{2}$ ,输出投影为 $E D^{2}$ ),而内部 SSM 的贡献较少。与之相比,SSM 参数($\Delta,B,C$ 的投影和矩阵 $\pmb{A}$ )的数量要小得多。我们重复这个块,并与标准归一化和残差连接交替,以形成 Mamba 架构。在我们的实验中,我们始终将 $E$ 固定为 2,并使用两个块堆栈来匹配 Transformer 的交错 MHA(多头注意力)和 MLP 块的 $12D^{2}$ 参数。我们使用 SiLU / Swish 激活函数 (Hendrycks 和 Gimpel 2016; Ramachandra n, Zoph, 和 Quoc V Le 2017),其动机是使门控 MLP 成为流行的“SwiGLU”变体 (Chowdhery 等 2023; Dauphin 等 2017; Shazeer 2020; Touvron 等 2023)。最后,我们还额外使用了一个可选的归一化层(我们选择了 LayerNorm (J. L. Ba, Kiros, 和 Hinton 2016),其动机是 RetNet 在类似位置使用了归一化层 (Y. Sun 等 2023))。
3.5 Properties of Selection Mechanisms
3.5 选择机制的属性
The selection mechanism is a broader concept that can be applied in different ways, such as to more traditional RNNs or CNNs, to different parameters (e.g. $\pmb{A}$ in Algorithm 2), or using different transformations $s(x)$
选择机制是一个更广泛的概念,可以以不同方式应用,例如应用于更传统的 RNN 或 CNN,应用于不同的参数(例如算法 2 中的 $\pmb{A}$),或使用不同的变换 $s(x)$。
Figure 3: (Architecture.) Our simplified block design combines the H3 block, which is the basis of most SSM architectures, with the ubiquitous MLP block of modern neural networks. Instead of interleaving these two blocks, we simply repeat the Mamba block homogenous ly. Compared to the H3 block, Mamba replaces the first multiplicative gate with an activation function. Compared to the MLP block, Mamba adds an SSM to the main branch. For $\sigma$ we use the SiLU / Swish activation (Hendrycks and Gimpel 2016; Rama chandra n, Zoph, and Quoc V Le 2017).
图 3: (架构。) 我们的简化块设计结合了大多数 SSM 架构基础的 H3 块和现代神经网络中常见的 MLP 块。我们没有交错排列这两个块,而是简单地重复 Mamba 块以保持同质性。与 H3 块相比,Mamba 用激活函数替换了第一个乘法门。与 MLP 块相比,Mamba 在主分支中添加了一个 SSM。对于 $\sigma$,我们使用 SiLU / Swish 激活函数 (Hendrycks 和 Gimpel 2016; Ramachandran, Zoph, 和 Quoc V Le 2017)。
3.5.1 Connection to Gating Mechanisms
3.5.1 与门控机制的联系
We highlight the most important connection: the classical gating mechanism of RNNs is an instance of our selection mechanism for SSMs. We note that the connection between RNN gating and the disc ret iz ation of continuous-time systems is well established (Funahashi and Nakamura 1993; Tallec and Ollivier 2018). In fact, Theorem 1 is an improvement of Gu, Johnson, Goel, et al. (2021, Lemma 3.1) generalizing to the ZOH disc ret iz ation and input-dependent gates (proof in Appendix C). More broadly, $\Delta$ in SSMs can be seen to play a generalized role of the RNN gating mechanism. In line with prior work, we adopt the view that disc ret iz ation of SSMs is the principled foundation of heuristic gating mechanisms.
我们强调最重要的联系:经典的 RNN 门控机制是 SSMs 选择机制的一个实例。我们注意到,RNN 门控与连续时间系统离散化之间的联系已经得到充分确立 (Funahashi 和 Nakamura 1993; Tallec 和 Ollivier 2018)。实际上,定理 1 是对 Gu, Johnson, Goel 等 (2021, 引理 3.1) 的改进,推广到零阶保持 (ZOH) 离散化和输入依赖的门控 (证明见附录 C)。更广泛地说,$\Delta$ 在 SSMs 中可以被视为扮演了 RNN 门控机制的广义角色。与先前的工作一致,我们认为 SSMs 的离散化是启发式门控机制的理论基础。
Theorem 1. When $N=1,A=-1,B=1,s_{\Delta}=\mathsf{L i n e a r}(x)$ and $\tau_{\Delta}=$ softplus, then the selective SSM recurrence (Algorithm 2) takes the form
定理 1. 当 $N=1,A=-1,B=1,s_{\Delta}=\mathsf{L i n e a r}(x)$ 和 $\tau_{\Delta}=$ softplus 时,选择性 SSM 迭代 (算法 2) 的形式为
$$
\begin{array}{l}{g_{t}=\sigma(\mathsf{L i n e a r}(x_{t}))}\ {h_{t}=(1-g_{t})h_{t-1}+g_{t}x_{t}.}\end{array}
$$
$$
\begin{array}{l}{g_{t}=\sigma(\mathsf{L i n e a r}(x_{t}))}\ {h_{t}=(1-g_{t})h_{t-1}+g_{t}x_{t}.}\end{array}
$$
公式未进行翻译,因为其中包含无法识别的特殊字符和公式,按照策略 1 原样返回。
As mentioned in Section 3.2, our specific choices of $s_{\Delta},\tau_{\Delta}$ is from this connection. In particular, note that if a given input $x_{t}$ should be completely ignored (as necessary in the synthetic tasks), all $D$ channels should ignore it, and so we project the input down to 1 dimension before repeating/broadcasting with $\Delta$
如第 3.2 节所述,我们对 $s_{\Delta},\tau_{\Delta}$ 的具体选择来自于这种联系。特别地,注意如果给定的输入 $x_{t}$ 应该被完全忽略(在合成任务中是必要的),所有 $D$ 个通道都应该忽略它,因此我们在重复/广播与 $\Delta$ 之前将输入投影到 1 维。
3.5.2 Interpretation of Selection Mechanisms
3.5.2 选择机制的解释
We elaborate on three particular mechanistic effects of selection.
我们详细讨论了选择的三个特定机制效应。
Variable Spacing. Selectivity allows filtering out irelevant noise tokens that may occur between inputs of interest. This is exemplified by the Selective Copying task, but occurs ubiquitously in common data modalities, particularly for discrete data - for example the presence of language fillers such as “um". This property arises because the model can mechanistic ally filter out any particular input $x_{t}$ , for example in the gated RNN case (Theorem 1) when $g_{t}\to0$
可变间距。选择性允许过滤掉可能出现在感兴趣输入之间的无关噪声 Token。这在选择性复制任务中得到了体现,但在常见的数据模式中普遍存在,特别是在离散数据中——例如语言填充词如“um”的存在。这种特性之所以出现,是因为模型可以机械地过滤掉任何特定的输入 $x_{t}$ ,例如在门控 RNN 情况下 (Theorem 1) 当 $g_{t}\to0$ 时。
Filtering Context. It has been empirically observed that many sequence models do not improve with longer context (F. Shi et al. 2023), despite the principle that more context should lead to strictly better performance. An explanation is that many sequence models cannot effectively ignore irrelevant context when necessary; an intuitive example are global convolutions (and general LTI models). On the other hand, selective models can simply reset their state at any time to remove extraneous history, and thus their performance in principle improves monotonic ly with context length (e.g. Section 4.3.2).
过滤上下文。实证观察表明,许多序列模型并不会随着上下文长度的增加而性能提升 (F. Shi 等 2023),尽管从原理上讲,更多的上下文应该带来更好的性能。一种解释是,许多序列模型在必要时无法有效忽略无关的上下文;一个直观的例子是全局卷积(和一般的 LTI 模型)。另一方面,选择性模型可以随时重置其状态以移除多余的歷史记录,因此原则上它们的性能会随着上下文长度的增加而单调提升(例如第 4.3.2 节)。
Boundary Resetting. In settings where multiple independent sequences are stitched together, Transformers can keep them separate by instant i a ting a particular attention mask, while LTI models will bleed information between the sequences. Selective SSMs can also reset their state at boundaries (e.g. $\Delta_{t}\to\infty$ or Theorem 1 when $g_{t}\to1$ .These settings may occur artificially (e.g. packing documents together to improve hardware utilization) or naturally (e.g. episode boundaries in reinforcement learning (Lu et al. 2023)).
边界重置。在多个独立序列被拼接在一起的情况下,Transformer 可以通过实例化特定的注意力掩码来保持它们的分离,而 LTI 模型会在序列之间泄漏信息。选择性 SSM 也可以在边界处重置其状态(例如 $\Delta_{t}\to\infty$ 或当 $g_{t}\to1$ 时的定理 1)。这些设置可能是人为的(例如,将文档打包在一起以提高硬件利用率)或自然的(例如,强化学习中的情节边界 (Lu et al. 2023))。
Additionally, we elaborate on effects of each selective parameter.
此外,我们详细说明了每个选择性参数的效果。
Interpretation of $\Delta$ . In general, $\Delta$ controls the balance between how much to focus or ignore the current input $x_{t}$ . It generalizes RNN gates (e.g. $g_{t}$ in Theorem 1): mechanically, a large $\Delta$ resets the state $h$ and focuses on the current input $x$ while a small $\Delta$ persists the state and ignores the current input. SSMs (1)-(2) can be interpreted as a continuous system disc ret i zed by a timestep $\Delta$ and in this context the intuition is that large $\Delta\to\infty$ represents the system focusing on the current input for longer (thus “selecting" it and forgetting its current state) while a small $\Delta\to0$ represents a transient input that is ignored.
$\Delta$ 的解释。一般来说,$\Delta$ 控制着关注或忽略当前输入 $x_{t}$ 的平衡。它泛化了 RNN 门(例如定理 1 中的 $g_{t}$):从机械角度看,较大的 $\Delta$ 会重置状态 $h$ 并关注当前输入 $x$,而较小的 $\Delta$ 则保持状态并忽略当前输入。状态空间模型 (1)-(2) 可以被解释为由时间步长 $\Delta$ 离散化的连续系统,在这种情况下,直觉上较大的 $\Delta \to \infty$ 表示系统更长时间地关注当前输入(因此“选择”它并忘记其当前状态),而较小的 $\Delta \to 0$ 表示一个被忽略的瞬态输入。
Interpretation of A. We remark that while the $\pmb{A}$ parameter could also be selective, it ultimately affects the model only through its interaction with $\Delta$ via $\overline{{A}}=\exp(\Delta A)$ (the disc ret iz ation (4). Thus selectivity in $\Delta$ is enough to ensure selectivity in $(\overline{{A}},\overline{{B}})$ , and is the main source of improvement. We hypothesize that making $\pmb{A}$ selective in addition to (or instead of) $\Delta$ would have similar performance, and leave it out for simplicity.
对 A 的解释。我们注意到,虽然参数 $\pmb{A}$ 也可以是选择性的,但它最终只通过与 $\Delta$ 的交互影响模型,即通过 $\overline{{A}}=\exp(\Delta A)$ (离散化 (4))。因此,$\Delta$ 中的选择性足以确保 $(\overline{{A}},\overline{{B}})$ 中的选择性,并且是主要的改进来源。我们假设使 $\pmb{A}$ 除了(或代替)$\Delta$ 具有选择性将具有类似的性能,并为了简化将其排除。
Interpretation of $B$ and $C$ .As discussed in Section 3.1, the most important property of selectivity is fltering out irrelevant information so that a sequence model's context can be compressed into an efficient state. In an SSM, modifying $B$ and $C$ to be selective allows finer-grained control over whether to let an input $x_{t}$ into the state $h_{t}$ , or the state into the output $y_{t}$ . These can be interpreted as allowing the model to modulate the recurrent dynamics based on content (input) and context (hidden states) respectively.
对 $B$ 和 $C$ 的解释。如第 3.1 节所述,选择性最重要的属性是过滤掉无关信息,从而使序列模型的上下文可以被压缩成一个高效的状态。在 SSM 中,将 $B$ 和 $C$ 修改为具有选择性可以更精细地控制是否允许输入 $x_{t}$ 进入状态 $h_{t}$ ,或者将状态输出到 $y_{t}$ 。这些可以被解释为允许模型根据内容(输入)和上下文(隐藏状态)分别调节递归动态。
3.6 Additional Model Details
3.6 附加模型细节
Real vs. Complex. Most prior SSMs use complex numbers in their state $h$ which is necessary for strong performance on many tasks in perceptual modalities (Gu, Goel, and Ré 2022). However, it has been empirically observed that completely real-valued SSMs seem to work fine, and possibly even better, in some settings (Ma et al. 2023). We use real values as the default, which work well for all but one of our tasks; we hypothesize that the complex-real tradeoff is related to the continuous-discrete spectrum in data modalities, where complex numbers are helpful for continuous modalities (e.g. audio, video) but not discrete (e.g. text, DNA).
实数 vs. 复数。大多数先前的状态空间模型 (SSM) 在其状态 $h$ 中使用复数,这对于许多感知模态任务的高性能是必要的(Gu, Goel, 和 Ré 2022)。然而,已经通过实验观察到,完全实数值的 SSM 在某些情况下似乎工作得同样好,甚至更好(Ma 等人 2023)。我们默认使用实数值,这对我们所有任务中的一个例外都表现良好;我们假设复数与实数之间的权衡与数据模态中的连续-离散谱有关,其中复数对连续模态(例如音频、视频)有帮助,但对离散模态(例如文本、DNA)则不然。
Initialization. Most prior SMs also suggest special initialization s, particularly in the complex-valued case, which can help in several settings such as low-data regimes. Our default initialization for the complex case is S4D-Lin and for the real case is S4D-Real (Gu, Gupta, et al. 2022), which is based on the HIPPO theory (Gu, Dao, et al. 2020). These define the $n$ -th element of $\pmb{A}$ as $-1/2+n i$ and $-(n+1)$ respectively. However, we expect many initialization s to work fine, particularly in the large-data and real-valued SSM regimes; some ablations are considered in Section 4.6.
初始化。大多数先前的 SM 也建议特殊的初始化方法,特别是在复数值情况下,这可以在多种场景中提供帮助,例如低数据量场景。我们对复数情况的默认初始化方法是 S4D-Lin,对实数情况是 S4D-Real (Gu, Gupta, et al. 2022),这是基于 HIPPO 理论 (Gu, Dao, et al. 2020)。这些方法分别定义了 $\pmb{A}$ 的第 $n$ 个元素为 $-1/2+n i$ 和 $-(n+1)$。然而,我们预计许多初始化方法在大数据量和实数值 SSM 场景下都能很好地工作;一些消融实验在第 4.6 节中进行了考虑。
Parameter iz ation of $\Delta$ . We defined the selective adjustment to $\Delta$ as $s_{\Delta}(x),=,{\tt B r o a d c a s t}{D}({\tt L i n e a r}{1}(x))$ , which was motivated by the mechanics of $\Delta$ (Section 3.5). We observe that it can be generalized from dimension 1 to a larger dimension R. We set this to be a small fraction of D, which uses a negligible number of parameters compared to the main Linear projections in the block. We additionally note that the broadcasting operation can instead be viewed as another Linear projection, initialized to a specific pattern of 1's and ${0,}^{*}{\mathrm{s}};$ if this projection is trainable, this leads to the alternative $s_{\Delta}(x)=\mathsf{L i n e a r}_{D}(\mathsf{L i n e a r}_{R}(x))$ , which can be viewed as a low-rank projection.
参数化 $\Delta$ 。我们定义对 $\Delta$ 的选择性调整为 $s_{\Delta}(x),=,{\tt Broadcast}{D}({\tt Linear}{1}(x))$ ,这是由 $\Delta$ 的机制(第 3.5 节)所启发的。我们观察到它可以从维度 1 推广到更大的维度 R。我们将此设置为 D 的一个小部分,与块中的主要线性投影相比,使用的参数数量可以忽略不计。我们还注意到广播操作可以被视为另一个线性投影,初始化为特定的 1 和 ${0,}^{*}{\mathrm{s}}$ 模式;如果这个投影是可训练的,则会导致另一种形式 $s_{\Delta}(x)=\mathsf{Linear}_{D}(\mathsf{Linear}_{R}(x))$ ,这可以被视为低秩投影。
In our experiments, the $\Delta$ parameter (which can be viewed as a bias term) is initialized to $\tau_{\Delta}^{-1}(\mathsf{U n i f o r m}([0.001,0.1])),$ following prior work on SSMs (Gu, Johnson, Timalsina, et al. 2023).
在我们的实验中,$\Delta$ 参数(可以视为偏置项)初始化为 $\tau_{\Delta}^{-1}(\mathsf{Uniform}([0.001,0.1]))$ ,遵循关于 SSM 的先前工作 (Gu, Johnson, Timalsina, et al. 2023)。
Remark 3.1. For brevity in our experimental results, we sometimes abbreviate selective SSMs as S6 models, because they are S4 models with a selection mechanism and computed with a scan.
备注 3.1. 为了简洁起见,在我们的实验结果中,我们有时将选择性的 SSM 模型简称为 S6 模型,因为它们是带有选择机制并通过扫描计算的 S4 模型。
4 Empirical Evaluation
4 实证评估
In Section 4.1 we test Mamba's ability to solve the two synthetic tasks motivated in Section 3.1. We then evaluate on three domains, each evaluated on auto regressive pre training as well as downstream tasks.
在第 4.1 节中,我们测试 Mamba 解决第 3.1 节中提出的两个合成任务的能力。然后我们在三个领域进行评估,每个领域都在自回归预训练和下游任务上进行了评估。
Finally, Section 4.5 shows Mamba's computational efciency at both training and inference time, and Section 4.6 ablates various components of the architecture and selective SSMs.
最后,第 4.5 节展示了 Mamba 在训练和推理阶段的计算效率,第 4.6 节对架构的各个组件和选择性的 SSMs 进行了消融研究。
4.1 Synthetic Tasks
4.1 合成任务
Full experiment details for these tasks including task details and training protocol are in Appendix E.1.
这些任务的完整实验细节,包括任务详情和训练协议,请参见附录 E.1。
4.1.1 Selective Copying
4.1.1 选择性复制
The Copying task is one of the most well-studied synthetic tasks for sequence modeling, originally designed to test the memorization abilities of recurrent models. As discussed in Section 3.1, LTI SSMs (linear recurrences and global convolutions) can easily solve this task by only keeping track of time instead of reasoning about the data; for example, by constructing a convolution kernel of exactly the right length (Figure 2). This was explicitly validated in earlier work on global convolutions (Romero et al. 2021). The Selective Copying task prevents this shortcut by randomizing the spacing between tokens. Note that this task has been introduced before as the Denoising task (Jing et al. 2019).
复制任务是序列建模中最常研究的合成任务之一,最初设计用于测试循环模型的记忆能力。如第 3.1 节所述,线性时不变状态空间模型 (LTI SSMs)(线性递归和全局卷积)可以通过仅跟踪时间而不是对数据进行推理轻松解决此任务;例如,通过构建恰好合适长度的卷积核(图 2)。这在早期关于全局卷积的工作中得到了明确验证 (Romero et al. 2021)。选择性复制任务通过随机化 token 之间的间距来防止这种捷径。请注意,此任务之前已被介绍为去噪任务 (Jing et al. 2019)。
Note that many previous works argue that adding architecture gating (multiplicative interactions) can endow models with "data-dependence" and solve related tasks (Dao, Fu, Saab, et al. 2023; Poli et al. 2023). However, we find this explanation insufficient intuitively because such gating does not interact along the sequence axis, and cannot affect the spacing between tokens. In particular architecture gating is not an instance of a selection mechanism (Appendix A).
请注意,许多先前的工作认为添加架构门控(乘法交互)可以使模型具有“数据依赖性”,并解决相关任务 (Dao, Fu, Saab, et al. 2023; Poli et al. 2023)。然而,我们认为这种解释直观上不够充分,因为这种门控不会沿序列轴进行交互,且不能影响 Token 之间的间距。特别是,架构门控不是选择机制的一个实例(附录 A)。
Table 1 confirms that gated architectures such as H3 and Mamba only partially improve performance, while the selection mechanism (modifying S4 to S6) easily solves this task, particularly when combined with these more powerful architectures.
表 1: 确认了门控架构(如 H3 和 Mamba)仅部分改善性能,而选择机制(将 S4 修改为 S6)轻松解决此任务,尤其是在与这些更强大的架构结合时。
4.1.2 Induction Heads
4.1.2 归纳头 (Induction Heads)
Induction heads (Olsson et al. 2022) is a simple task from the mechanistic interpret ability lens (Elhage et al. 2021) that is surprisingly predictive of the in-context learning ability of LLMs. It requires models to perform associative recall and copy: for example, if the model has seen a bigram such as “Harry Potter" in the sequence, then the next time “Harry” appears in the same sequence, the model should be able to predict “Potter" by copying from history.
归纳头 (Induction heads) (Olsson et al. 2022) 是一个从机械可解释性角度 (Elhage et al. 2021) 看非常简单的任务,但却能惊人地预测大语言模型的上下文学习能力。它要求模型执行关联回忆和复制:例如,如果模型在序列中见过一个二元组如 “Harry Potter”,那么当 “Harry” 再次出现在同一序列中时,模型应该能够通过从历史记录中复制来预测 “Potter”。
Dataset. We train a 2-layer model on the induction heads task at sequence length 256, with a vocab size of 16, which is comparable to prior work on this task (Dao, Fu, Saab, et al. 2023) but with longer sequences. We additionally investigate generalization and extrapolation abilities by evaluating on a range of sequence lengths from $2^{6}=64$ upto $2^{20}=1048576$ at test time.
数据集。我们在序列长度为 256 的归纳头任务上训练一个 2 层模型,词汇表大小为 16,这与该任务的先前工作 (Dao, Fu, Saab, et al. 2023) 相当,但使用更长的序列。此外,我们通过在测试时评估从 $2^6=64$ 到 $2^{20}=1048576$ 的一系列序列长度来研究模型的泛化和外推能力。
Models. Following established work on induction heads, we use 2 layer models, which allows attention to mechanistic ally solve the induction heads task (Olsson et al. 2022). We test both multi-head attention (8 heads, with various positional encodings) and SSM variants. We use a model dimension $D$ of 64 for Mamba and 128 for the other models.
模型。遵循关于归纳头的已有工作,我们使用2层模型,这使得注意力机制能够解决归纳头任务 (Olsson et al. 2022)。我们测试了多头注意力(8个头,带有各种位置编码)和SSM变体。对于Mamba,我们使用模型维度 $D$ 为64,其他模型则为128。
Results. Table 2 shows that Mamba--or more precisely, its selective SSM layer-has the ability to solve the task perfectly because of its ability to selectively remember the relevant token while ignoring everything else in between. It generalizes perfectly to million-length sequences, or $4000\times$ longer than it saw during training, while no other method goes beyond $2\times$
表 2: 结果显示,Mamba——或者更准确地说,其选择性 SSM 层——能够完美地解决任务,因为它可以选择性地记住相关 Token 而忽略其他所有内容。它能够完美泛化到百万长度的序列,或比训练时看到的序列长 4000 倍,而没有其他方法能超过 2 倍。
模型 | 架构 | 层 | 准确率 |
---|---|---|---|
S4 | No gate | S4 | 18.3 |
No gate | S6 | 97.0 | |
H3 | H3 | S4 | 57.0 |
Hyena | H3 | Hyena | 30.1 |
H3 | S6 | 99.7 | |
Mamba | S4 | 56.4 | |
- | Mamba | Hyena | 28.4 |
Mamba | Mamba | S6 | 99.8 |
Table 2: (Induction Heads.) Models are trained on sequence length $2^{8}=$ 256, and tested on increasing sequence lengths of $2^{6},=,64~\mathrm{up}$ to $2^{20}\ =$ 1048576. Full numbers in Table 11.
表 2: (归纳头。)模型在序列长度为 $2^{8}=$ 256 上进行训练,并在逐渐增加的序列长度上进行测试,从 $2^{6},=,64$ 到 $2^{20}\ =$ 1048576。完整数据见表 11。
Table 1: (Selective Copying.) Accuracy for combinations of architectures and inner sequence layers. Figure 4: (Scaling Laws.) Models of size $\approx~125M$ to $\approx,1.3B$ parameters, trained on the Pile. Mamba scales better than all other attention-free models and is the first to match the performance of a very strong “Transformer $\mathrel{\mathop++}$ ' recipe that has now become standard, particularly as the sequence length grows.
表 1: (选择性复制。)不同架构和内部序列层组合的准确性。图 4: (扩展定律。)模型大小从约 125M 到约 1.3B 参数,在 Pile 数据集上训练。Mamba 的扩展性能优于所有其他无注意力机制的模型,并且是第一个匹配非常强大的 “Transformer++” 方案性能的模型,该方案现已成为标准,特别是在序列长度增加时。
Out of positional encoding variants for attention models, xPos (which was designed for length extrapolation) is slightly better than the others; also note that all attention models were only tested up to sequence length $2^{14},=,16384$ due to memory limitations. Out of other SSMs, H3 and Hyena are similar, contrary to the findings in Poli et al. (2023).
在注意力模型的位置编码变体中,xPos(专为长度外推设计)略优于其他变体;需要注意的是,由于内存限制,所有注意力模型仅测试到序列长度 $2^{14},=,16384$ 。在其他 SSM 中,H3 和 Hyena 类似,这与 Poli 等人的发现不同 [2023]。
4.2 Language Modeling
4.2 语言模型
We evaluate the Mamba architecture on standard auto regressive language modeling against other architectures, on both pre training metrics (perplexity) and zero-shot evaluations. We set the model sizes (depth and width) to mirror GPT3 specifications. We use the Pile dataset (L. Gao, Biderman, et al. 2020), and follow the training recipe described in Brown et al. (2020). All training details are in Appendix E.2.
我们评估 Mamba 架构在标准自回归语言建模任务上的表现,对比其他架构,在预训练指标(困惑度)和零样本 (Zero-shot) 评估上进行测试。我们将模型大小(深度和宽度)设置为与 GPT3 规格一致。我们使用 Pile 数据集 (L. Gao, Biderman, et al. 2020),并遵循 Brown 等人 (2020) 描述的训练方案。所有训练细节见附录 E.2。
4.2.1 Scaling Laws
4.2.1 扩展定律 (Scaling Laws)
For baselines, we compare against the standard Transformer architecture (GPT3 architecture), as well as the strongest Transformer recipe we know of (here referred to as Transformer $^{++}$ ), based on the PaLM and LLaMa architectures (e.g. rotary embedding, SwiGLU MLP, RMSNorm instead of LayerNorm, no linear bias, and higher learning rates). We also compare against other recent sub quadratic architectures (Figure 4). All model details are in Appendix E.2.
对于基准线,我们对比了标准的 Transformer 架构 (GPT3架构),以及我们所知最强的 Transformer 配方(这里称为 Transformer $^{++}$),该配方基于 PaLM 和 LLaMa 架构(例如旋转嵌入、SwiGLU MLP、使用 RMSNorm 而不是 LayerNorm、无线性偏置以及更高的学习率)。我们还对比了其他最近的次二次架构(图 4)。所有模型细节见附录 E.2。
Figure 4 shows scaling laws under the standard Chinchilla (Hoffmann et al. 2022) protocol, on models from $\approx125M$ to $\approx1.3B$ parameters. Mamba is the first attention-free model to match the performance of a very strong Transformer recipe (Transformer $\mathrel{\mathop{\uparrow}}++$ ) that has now become standard, particularly as the sequence length grows. (We note that full results on context length $8\mathrm{k}$ are missing for the RWKV and RetNet baselines, prior strong recurrent models that can also be interpreted as SSMs, because of a lack of efficient implementations leading to out-of-memory or unrealistic computation requirements.)
图 4: 显示了在标准 Chinchilla (Hoffmann 等, 2022) 协议下的扩展规律,模型参数从约 1.25 亿到约 13 亿。Mamba 是第一个无需注意力机制的模型,在性能上匹配了非常强大的 Transformer 配方 (Transformer $\mathrel{\mathop{\uparrow}}++$),这一配方现已成为标准,特别是在序列长度增加时。我们注意到,由于缺乏高效的实现导致内存不足或计算要求不切实际,RWKV 和 RetNet 基线(先前强大的循环模型,也可以解释为 SSMs)在上下文长度为 8k 的完整结果缺失。
4.2.2 Downstream Evaluations
4.2.2 下游评估
Table 3 shows the performance of Mamba on a range of popular downstream zero-shot evaluation tasks. We compare against the most well-known open source models at these sizes, most importantly Pythia (Biderman et al. 2023) and RWKV (B. Peng et al. 2023) which were trained with the same tokenizer, dataset, and training length (300B tokens) as our models. (Note that Mamba and Pythia are trained with context length 2048, while RWKV was trained with context length 1024.)
表 3: 显示了 Mamba 在一系列流行的下游零样本评估任务中的性能。我们与这些规模中最著名的开源模型进行了比较,最重要的是 Pythia (Biderman et al. 2023) 和 RWKV (B. Peng et al. 2023),这些模型使用与我们的模型相同的分词器、数据集和训练长度(300B Token)。(请注意,Mamba 和 Pythia 的上下文长度为 2048,而 RWKV 的上下文长度为 1024。)
Table 3: (Zero-shot Evaluations.) Best results for each size in bold. We compare against open source LMs with various tokenizers, trained for up to 300B tokens. ile refers to the alidation split, comparing only against models trained on the same dataset and tokenizer (GPT-NeoX-20B). For each model size, Mamba is best-in-class on every single evaluation result, and generally matches baselines at twice the model size.
表 3: (零样本评估。)每个规模的最佳结果用粗体表示。我们将 Mamba 与开源语言模型 (LM) 进行比较,这些模型使用不同的分词器训练,最多训练 300B Token。ile 指的是验证集分割,仅与在同一数据集和分词器(GPT-NeoX-20B)上训练的模型进行比较。对于每个模型规模,Mamba 在每一项评估结果中都是同类最佳,并且通常在模型规模两倍的情况下匹配基线。
MODEL | TOKEN. | PILE PPL↓ | LAMBADA PPL↓ | LAMBADA ACC ↑ | HELLASWAG ACC ↑ | PIQA ACC↑ | ARC-E ACC↑ | ARC-C ACC ↑ | WINOGRANDE ACC ↑ | AVERAGE ACC ↑ |
---|---|---|---|---|---|---|---|---|---|---|
Hybrid H3-130M | GPT2 | 一 | 89.48 | 25.77 | 31.7 | 64.2 | 44.4 | 24.2 | 50.6 | 40.1 |
Pythia-160M | NeoX | 29.64 | 38.10 | 33.0 | 30.2 | 61.4 | 43.2 | 24.1 | 51.9 | 40.6 |
Mamba-130M | NeoX | 10.56 | 16.07 | 44.3 | 35.3 | 64.5 | 48.0 | 24.3 | 51.9 | 44.7 |
Hybrid H3-360M | GPT2 | 一 | 12.58 | 48.0 | 41.5 | 68.1 | 51.4 | 24.7 | 54.1 | 48.0 |
Pythia-410M | NeoX | 9.95 | 10.84 | 51.4 | 40.6 | 66.9 | 52.1 | 24.6 | 53.8 | 48.2 |
Mamba-370M | NeoX | 8.28 | 8.14 | 55.6 | 46.5 | 69.5 | 55.1 | 28.0 | 55.3 | 50.0 |
Pythia-1B | NeoX | 7.82 | 7.92 | 56.1 | 47.2 | 70.7 | 57.0 | 27.1 | 53.5 | 51.9 |
Mamba-790M | NeoX | 7.33 | 6.02 | 62.7 | 55.1 | 72.1 | 61.2 | 29.5 | 56.1 | 57.1 |
GPT-Neo 1.3B | GPT2 | 一 | 7.50 | 57.2 | 48.9 | 71.1 | 56.2 | 25.9 | 54.9 | 52.4 |
Hybrid H3-1.3B | GPT2 | 11.25 | 49.6 | 52.6 | 71.3 | 59.2 | 28.1 | 56.9 | 53.0 | |
OPT-1.3B | OPT | 一 | 6.64 | 58.0 | 53.7 | 72.4 | 56.7 | 29.6 | 59.5 | 55.0 |
Pythia-1.4B | NeoX | 7.51 | 6.08 | 61.7 | 52.1 | 71.0 | 60.5 | 28.5 | 57.2 | 55.2 |
RWKV-1.5B | NeoX | 7.70 | 7.04 | 56.4 | 52.5 | 72.4 | 60.5 | 29.4 | 54.6 | 54.3 |
Mamba-1.4B | NeoX | 6.80 | 5.04 | 64.9 | 59.1 | 74.2 | 65.5 | 32.8 | 61.5 | 59.7 |
GPT-Neo 2.7B | GPT2 | 5.63 | 62.2 | 55.8 | 72.1 | 61.1 | 30.2 | 57.6 | 56.5 | |
Hybrid H3-2.7B | GPT2 | 7.92 | 55.7 | 59.7 | 73.3 | 65.6 | 32.3 | 61.4 | 58.0 | |
OPT-2.7B | OPT | 一 | 5.12 | 63.6 | 60.6 | 74.8 | 60.8 | 31.3 | 61.0 | 58.7 |
Pythia-2.8B | NeoX | 6.73 | 5.04 | 64.7 | 59.3 | 74.0 | 64.1 | 32.9 | 59.7 | 59.1 |
RWKV-3B | NeoX | 7.00 | 5.24 | 63.9 | 59.6 | 73.7 | 67.8 | 33.1 | 59.6 | 59.6 |
Mamba-2.8B | NeoX | 6.22 | 4.23 | 69.2 | 66.1 | 75.2 | 69.7 | 36.3 | 63.5 | 63.3 |
GPT-J-6B | GPT2 | 4.10 | 68.3 | 66.3 | 75.4 | 67.0 | 36.6 | 64.1 | 63.0 | |
OPT-6.7B | OPT | 4.25 | 67.7 | 67.2 | 76.3 | 65.6 | 34.9 | 65.5 | 62.9 | |
Pythia-6.9B | NeoX | 6.51 | 4.45 | 67.1 | 64.0 | 75.2 | 67.3 | 35.5 | 61.3 | 61.7 |
RWKV-7.4B | NeoX | 6.31 | 4.38 | 67.2 | 65.5 | 76.1 | 67.8 | 37.5 | 61.0 | 62.5 |
4.3 DNA Modeling
4.3 DNA 模型构建
Motivated by the success of large language models, there has been recent exploration into using the foundation model paradigm for genomics. DNA has been likened to language in that it consists of sequences of discrete tokens with a finite vocabulary. It is also known for requiring long-range dependencies to model (Avsec et al. 2021). We investigate Mamba as a FM backbone for pre training and fine-tuning in the same setting as recent works on long-sequence models for DNA (Nguyen, Poli, et al. 2023). In particular, we focus on two explorations of scaling laws across model size and sequence length (Figure 5), and a difficult downstream synthetic classification task requiring long context (Figure 6).
受大语言模型成功的影响,最近开始探索将基础模型范式应用于基因组学。DNA 被类比为语言,因为它由具有有限词汇表的离散 Token 序列组成。它还以需要建模长距离依赖性而闻名 (Avsec et al. 2021)。我们研究了 Mamba 作为预训练和微调的基础模型 (FM) 主干,在与近期关于 DNA 的长序列模型工作的相同设置下 (Nguyen, Poli, et al. 2023)。特别是,我们专注于两个方面的扩展规律的研究,包括模型规模和序列长度 (图 5),以及一个需要长上下文的困难下游合成分类任务 (图 6)。
For pre training, we largely follow a standard causal language modeling (next token prediction) setup for the training and model details (see also Appendix E.2). For the dataset, we largely follow the setup of HyenaDNA (Nguyen, Poli, et al. 2023), which uses the HG38 dataset for pre training consisting of a single human genome with about 4.5 billion tokens (DNA base pairs) in the training split.
对于预训练,我们主要遵循标准的因果语言模型(下一个 Token 预测)设置进行训练和模型细节(详见附录 E.2)。对于数据集,我们主要遵循 HyenaDNA (Nguyen, Poli, et al. 2023) 的设置,该设置使用 HG38 数据集进行预训练,包含一个约 45 亿个 Token(DNA 碱基对)的单个人类基因组作为训练集。
Figure 5: (DNA Scaling Laws.) Pre training on the HG38 (human genome) dataset. (Left) Fixing short context length $2^{10}=1024$ and increasing size from $\approx200K$ to $\approx40M$ parameters, Mamba scales better than baselines. (Right) Fixing model size and increasing sequence lengths while keeping tokens/batch and total training tokens fixed. Unlike baselines, the selection mechanism of Mamba facilitates better performance with increasing context length.
图 5: (DNA 缩放定律。)在 HG38 (人类基因组) 数据集上进行预训练。(左) 固定短上下文长度 $2^{10}=1024$ 并将参数量从 $\approx200K$ 增加到 $\approx40M$ ,Mamba 比基线模型表现更好。(右) 固定模型大小并增加序列长度,同时保持每批次 Token 数和总训练 Token 数不变。与基线模型不同,Mamba 的选择机制在上下文长度增加时表现出更好的性能。
4.3.1 Scaling: Model Size
4.3.1 扩展:模型大小
In this experiment, we investigate the scaling properties of genomics foundation models with various model backbones (Figure 5 Left).
在本实验中,我们研究了不同模型骨干的基因组基础模型的扩展特性 (图 5 左)。
Training. To advantage the baselines, we train on a short sequence length of 1024; as shown in Section 4.3.2, we expect results to favor Mamba even more at longer sequence lengths. We fix a global batch size of 1024, for a total of $2^{20}\approx1M$ tokens per batch. Models were trained for $10K$ gradient steps for a total of 10B tokens.
训练。为了有利于基线模型,我们使用较短的序列长度 1024 进行训练;如第 4.3.2 节所示,我们预计在更长的序列长度下结果会更有利于 Mamba。我们固定全局批次大小为 1024,每批次总共 $2^{20} \approx 1M$ 个 Token。模型训练了 $10K$ 梯度步,总计 10B 个 Token。
Results. Figure 5 (Left) shows that Mamba's pre training perplexity improves smoothly with model size, and that Mamba scales better than both HyenaDNA and Transformer $++$ . For example, at the largest model size of $\approx40M$ parameters, the curve shows that Mamba can match the Transformer $^{++}$ and HyenaDNA models with roughly $3\times$ to $4\times$ fewer parameters.
结果。图 5 (左) 显示 Mamba 的预训练困惑度随着模型大小平滑改善,并且 Mamba 的扩展性优于 HyenaDNA 和 Transformer $++$ 。例如,在大约 40M 参数的最大模型大小下,曲线显示 Mamba 可以用大约 3 到 4 倍更少的参数匹配 Transformer $^{++}$ 和 HyenaDNA 模型。
4.3.2 Scaling: Context Length
4.3.2 扩展:上下文长度
In the next DNA experiment, we investigate the scaling properties of models with respect to sequence length. We only compare the HyenaDNA and Mamba models, as quadratic attention becomes prohibitively expensive at longer sequence lengths. We pretrain models on sequence lengths $2^{10}\ =\ 1024$ , $2^{12}\ =\ 4096$ , $2^{14},=,16384$ , $2^{\bar{16}};=;65536$ , $2^{18};=;262144$ , $2^{20}=1048576$ We fx a model size of 6 layers by width 128 (about 1.3M-1.4M parameters). Models were trained for $20K$ gradient steps for a total of $\approx330B$ tokens. The longer sequence lengths used sequence length warmup similar to (Nguyen, Poli, et al. 2023).
在下一个 DNA 实验中,我们研究模型相对于序列长度的扩展特性。我们仅比较 HyenaDNA 和 Mamba 模型,因为二次注意力机制在较长序列长度下变得过于昂贵。我们在序列长度为 $2^{10}\ =\ 1024$ , $2^{12}\ =\ 4096$ , $2^{14},=,16384$ , $2^{\bar{16}};=;65536$ , $2^{18};=;262144$ , $2^{20}=1048576$ 上预训练模型。我们固定模型大小为 6 层,宽度为 128(约 1.3M-1.4M 参数)。模型训练了 $20K$ 梯度步,总共约 $\approx330B$ Token。较长的序列长度使用了类似于 (Nguyen, Poli, et al. 2023) 的序列长度热身。
Results. Figure 5 (Right) shows that Mamba is able to make use of longer context even up to extremely long sequences of length 1M, and its pre training perplexity improves as the context increases. On the other hand, the HyenaDNA model gets worse with sequence length. This is intuitive from the discussion in Section 3.5 on properties of the selection mechanism. In particular, LTI models cannot selectively ignore information; from a convolutional perspective, a very long convolution kernel is aggregating all information across a long sequence which may be very noisy. Note that while HyenaDNA claims to improve with longer context, their results do not control for computation time.
结果。图 5 (右) 显示 Mamba 能够利用更长的上下文,即使对于长度达 1M 的极长序列,其预训练困惑度随着上下文长度的增加而改善。另一方面,HyenaDNA 模型的表现随着序列长度的增加而变差。这从第 3.5 节关于选择机制属性的讨论中可以直观理解。特别是,LTI 模型无法有选择地忽略信息;从卷积的角度来看,一个非常长的卷积核会聚合整个长序列中的所有信息,这可能会非常嘈杂。需要注意的是,尽管 HyenaDNA 声称其在更长的上下文中表现更好,但他们的结果并未控制计算时间。
4.3.3 Synthetic Species Classification
4.3.3 合成物种分类
We evaluate models on a downstream task of classifying between 5 different species by randomly sampling a contiguous segment of their DNA. This task is adapted from HyenaDNA, which used the species {human, lemur, mouse, pig, hippo}. We modify the task to be significantly more challenging by classifying between the five great apes species {human, chimpanzee, gorilla, orangutan, bonobo}, which are known to share $99%$ of theirDNA.
我们在下游任务中评估模型,该任务是通过随机采样一段连续的 DNA 来对 5 种不同的物种进行分类。此任务改编自 HyenaDNA,其使用的物种为 {human, lemur, mouse, pig, hippo}。我们修改了任务,使其更具挑战性,即对五大类人猿物种 {human, chimpanzee, gorilla, orangutan, bonobo} 进行分类,这些物种已知共享 $99%$ 的 DNA。
Figure 6: (Great Apes DNA Classification.) Accuracy after fine- Figure 7: (Audio Pre training.) Mamba improves performance tuning on sequences of length $2^{10}=1024$ upto $2^{20}=1048576$ using over prior state-of-the-art (Sashimi) in auto regressive audio model pretrained models of the same context length. Numerical results in ing, while improving up to minute-long context or million-length Table13. sequences (controlling for computation).
图 6: (大猿类 DNA 分类。) 在长度为 $2^{10}=1024$ 到 $2^{20}=1048576$ 的序列上进行微调后的准确性。
图 7: (音频预训练。) Mamba 在自回归音频模型的预训练模型中,通过超过先前的最先进水平 (Sashimi),在相同上下文长度的情况下提高了性能,特别是在长达分钟或百万长度的序列中(控制计算量)。数值结果见表 13。
图 1: 模型架构示例 (Example of Model Architecture)
4.4 Audio Modeling and Generation
4.4 音频建模与生成 (Audio Modeling and Generation)
For the audio waveform modality, we compare primarily to the SaShiMi architecture and training protocols (Goel et al. 2022). This model comprises:
对于音频波形模态,我们主要与 SaShiMi 架构和训练协议 (Goel et al. 2022) 进行比较。该模型包括:
We consider replacing the $\mathsf{S4+M L P}$ blocks with Mamba blocks. Experiment details are in Appendix E.4.
我们考虑用 Mamba 模块替换 $\mathsf{S4+M L P}$ 模块。实验细节见附录 E.4。
4.4.1 Long-Context Auto regressive Pre training
4.4.1 长上下文自回归预训练 (Long-Context Auto regressive Pre training)
We evaluate pre training quality (auto regressive next-sample prediction) on YouTubeMix (DeepSound 2017), a standard piano music dataset used by prior work consisting of 4 hours of solo piano music, sampled at a rate of $16000,\mathrm{{Hz}}$ .Pre training details largely follow the standard language modeling setup (Section 4.2). Figure 7 evaluates the effect of increasing training sequence lengths from $2^{13}=8192$ $2^{20}\approx10^{6}$ while keeping computation fixed. There are some slight edge cases to the way the data is curated, which may lead to kinks in the scaling curves. For example, only minute-long clips were available so the maximum sequence length is actually bounded by $60s\cdot16000H z=960000.$
我们在 YouTubeMix (DeepSound 2017) 上评估了预训练质量(自回归下一个样本预测),这是一个由先前工作使用过的标准钢琴音乐数据集,包含 4 小时的独奏钢琴音乐,采样率为 $16000,\mathrm{{Hz}}$ 。预训练细节大部分遵循标准的语言模型设置(第 4.2 节)。图 7 评估了在保持计算量固定的情况下,增加训练序列长度从 $2^{13}=8192$ 到 $2^{20}\approx10^{6}$ 的影响。数据整理方式存在一些细微的边缘情况,这可能导致缩放曲线出现曲折。例如,只有分钟长的片段可用,因此最大序列长度实际上被限制为 $60s\cdot16000H z=960000$ 。
Both Mamba and the SaShiMi $\mathbf{\nabla{S4+MLP}},$ baseline improve consistently with longer context lengths; Mamba is better throughout, and the gap widens at longer lengths. The main metric is bits per byte (BPB), which is a constant factor $\log(2)$ of the standard negative log-likelihood (NLL) loss for pre training other modalities.
Mamba 和 SaShiMi $\mathbf{\nabla{S4+MLP}},$ 基线随着上下文长度的增加而持续改进;Mamba 在整个过程中表现更好,且在更长的长度下差距扩大。主要指标是每字节比特数 (BPB),这是标准负对数似然 (NLL) 损失的常数因子 $\log(2)$,用于预训练其他模态。
We note one important detail: this is the only experiment in this paper in which we switched from the real parameter iz ation to complex (Section 3.6). We show additional ablations in Appendix E.4.
我们注意到一个重要细节:这是本文中唯一一次我们将实参数化切换到复数 (第 3.6 节)。我们在附录 E.4 中展示了额外的消融实验。
4.4.2 Auto regressive Speech Generation
4.4.2 自回归语音生成 (Auto regressive Speech Generation)
SC09 is a benchmark speech generation dataset (Donahue, McAuley, and Puckette 2019; Warden 2018), consisting of 1-second clips sampled at $16000\ \mathrm{Hz}$ of the digits “zero” through “nine” with highly variable characteristics. We largely follow the auto regressive training setup and generation protocol of Goel et al. (2022).
SC09 是一个基准语音生成数据集 (Donahue, McAuley, 和 Puckette 2019; Warden 2018),由以 $16000\ \mathrm{Hz}$ 采样的 1 秒音频片段组成,内容为数字“零”到“九”,具有高度可变的特征。我们主要遵循 Goel 等人 (2022) 的自回归训练设置和生成协议。
Table 4 shows automated metrics of the Mamba-UNet model compared to a variety of baselines from Goel et al. (2022): WaveNet (Oord et al. 2016), SampleRNN (Mehri et al. 2017), WaveGAN (Donahue, McAuley, and Puckette 2019), DiffWave (Z. Kong et al. 2021), and SaShiMi. A small Mamba model outperforms the state-of-the-art (and much larger) GANand diffusion- based models. A larger model parameter-matched to the baselines further improves on fidelity metrics dramatically.
表 4: 显示了 Mamba-UNet 模型与 Goel 等 (2022) 提出的多种基准模型的自动化指标对比:WaveNet (Oord 等 2016),SampleRNN (Mehri 等 2017),WaveGAN (Donahue, McAuley, 和 Puckette 2019),DiffWave (Z. Kong 等 2021),和 SaShiMi。小型 Mamba 模型的表现超过了最先进的(且规模更大)基于 GAN 和扩散的模型。与基准模型参数匹配的较大模型在保真度指标上进一步显著提升。
Table 5 takes the small Mamba model and investigates combinations of different architectures for the outer stages and center stage. It shows that Mamba is consistently better than $\mathsf{S4+M L P}$ in the outer blocks, and Mamba $>S4\substack{+}\mathrm{MLP}>$ MHA $^{+}$ MLP in the center blocks.
表 5: 采用小规模 Mamba 模型,研究外部阶段和中心阶段不同架构的组合。结果显示,在外部块中 Mamba 一致优于 $\mathsf{S4+M L P}$ ,而在中心块中 Mamba $>S4\substack{+}\mathrm{MLP}>$ MHA $^{+}$ MLP。
Table 4: (SCo9) Automated metrics for unconditional generation on Table 5: (SCo9 Model Ablations) Models with 6M parameters. In a challenging dataset of fixed-length speech clips. (Top to Bottom) SaShiMi's U-Net backbone, there are 8 center blocks operating on Auto regressive baselines, non-auto regressive baselines, Mamba, and sequence length 1oo0, sandwiched on each side by 8 outer blocks on data set metrics. sequence length 4000, sandwiched by 8 outer blocks on sequence
表 4: (SCo9) 无条件生成的自动化指标
表 5: (SCo9 模型消融) 包含 6M 参数的模型。在固定长度语音片段的具有挑战性的数据集中。(从上到下) SaShiMi 的 U-Net 主干,有 8 个中心块操作在自回归基线、非自回归基线、Mamba 上,序列长度为 1000,两侧各有 8 个外块,在数据集指标上,序列长度为 4000,由两侧各有 8 个外块夹在中间。
模型 | 参数量 | NLL↓ | FID | IS↑ | MIS↑ | AM↓ |
---|---|---|---|---|---|---|
SampleRNN | 35.0M | 2.042 | 8.96 | 1.71 | 3.02 | 1.76 |
WaveNet | 4.2M | 1.925 | 5.08 | 2.27 | 5.80 | 1.47 |
SaShiMi | 5.8M | 1.873 | 1.99 | 5.13 | 42.57 | 0.74 |
WaveGAN | 19.1M | 2.03 | 4.90 | 36.10 | 0.80 | |
DiffWave | 24.1M | 1.92 | 5.26 | 51.21 | 0.68 | |
+SaShiMi | 23.0M | 1.42 | 5.94 | 69.17 | 0.59 | |
Mamba | 6.1M | 1.852 | 0.94 | 6.26 | 88.54 | 0.52 |
Mamba | 24.3M | 1.860 | 0.67 | 7.33 | 144.9 | 0.36 |
训练集 | 0.00 | 8.56 | 292.5 | 0.16 | ||
测试集 | 0.02 | 8.33 | 257.6 | 0.19 |
length 16000 (40 blocks total). The architecture of the 8 center blocks are ablated independently ofthe rest. Note that Transformers $\scriptstyle(\mathrm{MHA+MLP})$ were not tested in the more important outer blocks because of efficiency constraints.
长度 16000 (共 40 个块)。8 个中心块的架构独立于其余部分进行消融研究。注意,由于效率限制,Transformer (MHA+MLP) 没有在外围更重要的块中进行测试。
OUTER | CENTER | NLL↓ | FID↓ | IS ↑ | MIS ↑ | AM↓ |
---|---|---|---|---|---|---|
S4+MLP | MHA+MLP | 1.859 | 1.45 | 5.06 | 47.03 | 0.70 |
S4+MLP | S4+MLP | 1.867 | 1.43 | 5.42 | 53.54 | 0.65 |
S4+MLP | Mamba | 1.859 | 1.42 | 5.71 | 56.51 | 0.64 |
Mamba | MHA+MLP | 1.850 | 1.37 | 5.63 | 58.23 | 0.62 |
Mamba | S4+MLP | 1.853 | 1.07 | 6.05 | 73.34 | 0.55 |
Mamba | Mamba | 1.852 | 0.94 | 6.26 | 88.54 | 0.52 |
4.5 Speed and Memory Benchmarks
4.5 速度和内存基准测试
We benchmark the speed of the SSM scan operation (state expansion $N,=,16\$ o, as well as the end-to-end inference throughput of Mamba, in Figure 8. Our efficient SSM scan is faster than the best attention implementation that we know of (Flash Attention-2 (Dao 2024) beyond sequence length 2K, and up to $20{-}40\times$ faster than a standard scan implementation in PyTorch. Mamba achieves $4{-}5\times$ higher inference throughput than a Transformer of similar size, since without the KV cache it can use much higher batch sizes. For example, a Mamba-6.9B (untrained) would have higher inference throughput than a $5\times$ smaller Transformer-1.3B. Details in Appendix E.5, which additionally includes a benchmark of memory consumption.
我们在图 8 中基准测试了 SSM 扫描操作(状态扩展 $N,=,16$)的速度,以及 Mamba 的端到端推理吞吐量。我们高效的 SSM 扫描在序列长度超过 2K 时比我们所知的最佳注意力实现(Flash Attention-2 (Dao 2024))更快,并且比 PyTorch 中的标准扫描实现快 20-40 倍。Mamba 的推理吞吐量比类似规模的 Transformer 高 4-5 倍,因为没有 KV 缓存,它可以使用更大的批量大小。例如,未训练的 Mamba-6.9B 的推理吞吐量将高于小 5 倍的 Transformer-1.3B。详细信息见附录 E.5,其中还包含内存消耗的基准测试。
Figure 8: (Efficiency Benchmarks.) (Left) Training: our efficient scan is $40\times$ faster than a standard implementation. (Right) Inference: as a recurrent model, Mamba can achieve $5\times$ higher throughput than Transformers.
图 8: (效率基准测试。)(左)训练:我们的高效扫描比标准实现快 40× 。 (右)推理:作为递归模型,Mamba 的吞吐量可以比 Transformer 高 5× 。
4.6Model Ablations
4.6 模型消融实验 (Model Ablations)
We perform a series of detailed ablations on components of our model, focusing on the setting of language modeling with size $\approx350\mathrm{M}$ models at Chinchilla token counts (same setting as Figure 4).
我们对模型的各个组件进行了一系列详细的消融实验,重点关注在 Chinchilla Token 数量设置下的约 350M 规模的语言模型 (同图 4 设置)。
4.6.1 Architecture
4.6.1 架构
Table 6 investigates the effects of the architecture (block) and its inner SSM layer (Figure 3). We find that · Among previous non-selective (LTI) sSMs, which are equivalent to global convolutions, performance is very similar. . Replacing the complex-valued S4 variant from previous work with a real-valued one does not affect performance much, suggesting that (at least for LM) real-valued SSMs may be a better choice when accounting for hardware efficiency. · Replacing any of these with a selective SM (S6) significantly improves performance, validating the motivation of Section3.
表 6: 研究了架构 (block) 及其内部 SSM 层 (图 3) 的影响。我们发现:
- 在之前的非选择性 (LTI) sSM 中,它们等同于全局卷积,性能非常相似。
- 用实数值 S4 变体替换之前工作中的复数值 S4 变体对性能影响不大,这表明(至少对于大语言模型)在考虑硬件效率时,实数值 SSM 可能是更好的选择。
- 用选择性 SM (S6) 替换其中任何一个显著提高了性能,验证了第 3 节的动机。
Table 6: (Ablations: Architecture and SSM layer.) The Mamba block performs similarly to H3 while being simpler. In the inner layer, there is little diference among different parameter iz at ions of LTI models, while selective SSMs (S6) provide a large improvement. More specifically, the S4 (real) variant is S4D-Real and the S4 (complex) variant is S4D-Lin.
表 6: (消融实验:架构和 SSM 层。)Mamba 块的表现与 H3 相似,但更简单。在内层,不同的 LTI 模型参数化之间的差异很小,而选择性的 SSM (S6) 提供了显著的改进。更具体地说,S4 (实数) 变体是 S4D-Real,S4 (复数) 变体是 S4D-Lin。
模型 | 架构 | SSMLAYER | 困惑度 |
---|---|---|---|
Hyena | H3 | Hyena | 10.24 |
H3 | H3 | S4 (complex) | 10.30 |
H3 | S4 (real) | 10.34 | |
H3 | S6 | 8.95 |
Table 7: (Ablations: Selective parameters.) $\Delta$ is the most important parameter (Theorem 1), but using multiple selective parameters together synergizes.
表 7: (消融实验:选择性参数。) $\Delta$ 是最重要的参数(定理 1),但使用多个选择性参数可以产生协同效应。
SELECTIVE | SELECTIVE B | SELECTIVE C | PERPLEXITY |
---|---|---|---|
× | 10.93 | ||
× | 10.15 | ||
9.98 | |||
× | × | 9.81 | |
√ | √ | √ | 8.71 |
模型 | 架构 | SSMLAYER | 困惑度 |
---|---|---|---|
Mamba | Hyena | 10.75 | |
Mamba | S4 (complex) | 10.54 | |
Mamba | S4 (real) | 10.56 | |
Mamba | Mamba | S6 | 8.69 |
Table 8: (Ablations: Parameter iz ation of $A.$ )Themore standard initialization s based on S4D-Lin (Gu, Gupta, et al. 2022) perform worse than S4D-Real or a random initialization, when the SSM is selective.
表 8: (消融实验:参数初始化 $A$)基于 S4D-Lin (Gu, Gupta, et al. 2022) 的更标准的初始化方法在 SSM 选择性时,表现不如 S4D-Real 或随机初始化。
A, 初始化 | 域 | 困惑度 |
---|---|---|
An = -+ni | 复数 | 9.16 |
An = -1/2 | 实数 | 8.85 |
An = -(n + 1) | 实数 | 8.71 |
An ~ exp(N(0, 1)) | 实数 | 8.71 |
. The Mamba architecture performs similarly to the H3 architecture (and seems slightly better when using a selective layer).
Mamba架构的表现与H3架构相似(在使用选择性层时似乎略胜一筹)。
We also investigate interleaving the Mamba block with other blocks such as MLP (a traditional architecture) MHA (a hybrid attention architecture) in Appendix E.2.2.
我们还在附录 E.2.2 中研究了将 Mamba 块与其他块(如 MLP (多层感知器)、MHA (混合注意力架构))交错使用。
4.6.2 Selective SSM
4.6.2 选择性 SSM
Table 7 ablates the selective SSM layer by considering different combinations of selective $\Delta,B$ and $C$ parameters (Algorithm 2), showing that $\Delta$ is the most important parameter due to its connection to RNN gating (Theorem 1).
表 7: 通过考虑不同的选择性 SSM 层参数组合 (算法 2) 来消融选择性 SSM 层,显示了 Δ 是最重要的参数,这是由于其与 RNN 门控的关联 (定理 1)。
Table 8 considers different initialization s of the SSM, which have been shown to make a large difference in some data modalities and settings (Gu, Goel, and Re 2022; Gu, Gupta, et al. 2022). On language modeling, we find that simpler real-valued diagonal initialization s (S4D-Real, row 3) instead of more standard complex-valued parameter iz at ions (S4D-Lin, row 1) perform better. Random initialization s also work well, consistent with findings from prior work (Mehta et al. 2023).
表 8: 考虑了 SSM 的不同初始化方式,这些方式在某些数据模式和设置中已被证明有很大的差异 (Gu, Goel, and Re 2022; Gu, Gupta, et al. 2022)。在语言建模方面,我们发现更简单的实值对角线初始化 (S4D-Real,第 3 行) 比较标准的复数值参数化 (S4D-Lin,第 1 行) 表现更好。随机初始化也表现良好,这与之前的研究结果一致 (Mehta et al. 2023)。
Table 9 and Table 10 consider varying the dimension of the $\Delta$ and $(B,C)$ projections respectively. Changing them from static to selective provides the most benefit, while increasing the dimensions further generally improves performance modestly with a small increase in parameter count.
表 9 和表 10 分别考虑了改变 $\Delta$ 和 (B, C) 投影的维度。将它们从静态改为选择性提供了最大的好处,而进一步增加维度通常会在参数量略有增加的情况下适度提高性能。
Of particular note is the dramatic improvement of the selective SSM when the state size $N$ is increased, with over a 1.0 perplexity improvement for a cost of only $1%$ additional parameters. This validates our core motivation in Sections 3.1 and 3.3.
特别值得注意的是,当状态大小 $N$ 增加时,选择性 SSM 的显著改进,仅以额外 $1%$ 参数为代价就实现了超过 1.0 的困惑度改进。这验证了我们在第 3.1 和 3.3 节的核心动机。
5 Discussion
5 讨论
We discuss related work, limitations, and some future directions.
我们讨论了相关工作、局限性以及一些未来方向。
Related Work. Appendix A discusses how the selection mechanism relates to similar concepts. Appendix B has an extended related work of SSMs and other related models.
相关工作。附录 A 讨论了选择机制与类似概念的关系。附录 B 包含了 SSMs 和其他相关模型的扩展相关工作。
Table 9: (Ablations: Expressivity of $\Delta.$ The selection mechanism of $\Delta$ constructs it with a projection of the input. Projecting it even to dim. 1 provides a large increase in performance; increasing it further provides further improvements at the cost of a modest increase in parameters. State size fixed to $N=16$
表 9: (消融实验:$\Delta$ 的表达能力。$\Delta$ 的选择机制通过输入的投影来构建它。即使将其投影到维度 1 也能显著提高性能;进一步增加维度可以带来更好的性能提升,但会以参数量适度增加为代价。状态大小固定为 $N=16$ 。)
项目规模 | 参数量 (M) 困惑度 |
---|---|
358.9 9.12 | |
1 | 359.1 8.97 |
2 | 359.3 8.97 |
4 | 359.7 8.91 |
8 | 360.5 8.83 |
16 | 362.1 8.84 |
32 | 365.2 8.80 |
64 | 371.5 8.71 |
Table 10: (Ablations: SSM state dimension.) (Top) Constant $B$ and $C$ (Bottom) Selective $B$ and $C$ .Increasing the SSM state dimension $N$ which can be viewed as an expansion factor on the dimension of the recurrent state, can significantly improve performance for a negligible cost in parameters/FLOPs, but only when $B$ and $C$ are also selective. Size of $\Delta$ projection fixed to 64.
表 10: (消融实验:SSM 状态维度。) (上) 固定 $B$ 和 $C$ (下) 选择性 $B$ 和 $C$ 。增加 SSM 状态维度 $N$,可以视为递归状态维度的扩展因子,可以在参数/FLOPs 的微小代价下显著提高性能,但仅当 $B$ 和 $C$ 也是选择性时。$\Delta$ 投影的大小固定为 64。
状态维度 N | 参数 (M) | 困惑度 |
---|---|---|
1 | 367.1 | 9.88 |
2 | 367.4 | 9.86 |
4 | 368.0 | 9.82 |
8 | 369.1 | 9.82 |
16 | 371.5 | 9.81 |
1 | 367.1 | 9.73 |
2 | 367.4 | 9.40 |
4 | 368.0 | 9.09 |
8 | 369.1 | 8.84 |
16 | 371.5 | 8.71 |
No Free Lunch: Continuous-Discrete Spectrum. Structured SSMs were originally defined as disc ret iz at ions of continuous systems (1), and have had a strong inductive bias toward continuous-time data modalities such as perceptual signals (e.g. audio, video). As discussed in Sections 3.1 and 3.5, the selection mechanism overcomes their weaknesses on discrete modalities such as text and DNA; but this conversely can impede their performance on data that LTI SSMs excel on. Our ablations on audio waveforms examine this tradeoff in more detail.
没有免费的午餐:连续-离散谱。结构化的 SSM 最初被定义为连续系统的离散化 (1),并且对连续时间数据模态(例如感知信号,如音频、视频)具有强烈的归纳偏置。正如在第 3.1 节和第 3.5 节中所讨论的,选择机制克服了它们在离散模态(如文本和 DNA)上的弱点;但反过来这也可能阻碍它们在 LTI SSM 擅长的数据上的表现。我们对音频波形的消融实验更详细地研究了这种权衡。
Downstream Afford ances. Transformer-based foundation models (particularly LLMs) have a rich ecosystem of properties and modes of interaction with pretrained models, such as fine-tuning, adaptation, prompting, in-context learning. instruction tuning, RLHF, quantization, and so on. We are particularly interested in whether Transformer alternatives such as SSMs have similar properties and afford ances.
下游能力。基于 Transformer 的基础模型(特别是大语言模型)拥有丰富的 pretrained 模型属性和交互模式生态系统,例如微调、适应、提示、上下文学习、指令微调、RLHF、量化等。我们特别感兴趣的是,像 SSMs 这样的 Transformer 替代方案是否具有类似的属性和能力。
Scaling. Our empirical evaluation is limited to small model sizes, below the threshold of most strong open source LLMs (e.g. Llama (Touvron et al. 2023)) as well as other recurrent models such as RWKV (B. Peng et al. 2023) and RetNet (Y. Sun et al. 2023), which have been evaluated at the 7B parameter scale and beyond. It remains to assess whether Mamba still compares favorably at these larger sizes. We also note that scaling SSMs may involve further engineering challenges and adjustments to the model that are not discussed in this paper.
扩展。我们的实证评估仅限于小模型规模,低于大多数强大的开源大语言模型 (LLM) 的阈值(例如 Llama (Touvron et al. 2023)),以及其他递归模型,如 RWKV (B. Peng et al. 2023) 和 RetNet (Y. Sun et al. 2023),这些模型已经在 7B 参数规模及以上进行了评估。仍需评估 Mamba 在这些更大规模下是否仍然具有优势。我们还注意到,扩展 SSM 可能涉及进一步的工程挑战和对模型的调整,而这些在本文中未进行讨论。
6 Conclusion
6 结论
We introduce a selection mechanism to structured state space models, allowing them to perform context-dependent reasoning while scaling linearly in sequence length. When incorporated into a simple attention-free architecture, Mamba achieves state-of-the-art results on a diverse set of domains, where it matches or exceeds the performance of strong Transformer models. We are excited about the broad applications of selective state space models to build foundation models for different domains, especially in emerging modalities requiring long context such as genomics, audio, and video. Our results suggest that Mamba is a strong candidate to be a general sequence model backbone.
我们引入了一种选择机制到结构化状态空间模型中,使它们能够在上下文依赖的情况下进行推理,并且在序列长度上呈线性扩展。当集成到一个简单的无注意力架构中时,Mamba 在多个不同领域中取得了最先进的结果,在这些领域中,它的性能与强大的 Transformer 模型相匹配或超越。我们对选择性状态空间模型的广泛应用感到兴奋,可以为不同领域构建基础模型,特别是在需要长上下文的新兴模态如基因组学、音频和视频中。我们的结果表明,Mamba 是成为通用序列模型骨干的有力候选者。
Acknowledgments
致谢
We thank Karan Goel, Arjun Desai, and Kush Bhatia for helpful feedback on the draft.
我们感谢 Karan Goel、Arjun Desai 和 Kush Bhatia 对草稿提供的宝贵反馈。
References
参考文献
[1] Martin Arjovsky, Amar Shah, and Yoshua Bengio. “Unitary Evolution Recurrent Neural Networks". In: The International Conference on Machine Learning (ICML). 2016, pp. 1120-1128.
[1] Martin Arjovsky, Amar Shah, 和 Yoshua Bengio. “单元演化循环神经网络 (Unitary Evolution Recurrent Neural Networks)”. In: 国际机器学习大会 (ICML). 2016, pp. 1120-1128.
A Discussion: Selection Mechanism
选择机制的讨论 (Selection Mechanism)
Our selection mechanism is inspired by and related to concepts such as gating, hyper networks, and data-dependence. It can also be viewed as related to “fast weights" (J. Ba et al. 2016; Schmid huber 1992), which connects classical RNNs with the mechanism of linear attention (Schlag, Irie, and Schmid huber 2021). However, we believe that it is a distinct concept that is worth clarifying.
我们的选择机制受到 gating、hyper networks 和 data-dependence 等概念的启发并与之相关。它也可以被视为与 “fast weights" (J. Ba et al. 2016; Schmid huber 1992) 相关,这将经典的 RNN 与线性注意力机制 (Schlag, Irie, and Schmid huber 2021) 联系起来。然而,我们认为这是一个 distinct concept(distinct 概念),值得澄清。
Gating. Gating originally referred to the gating mechanisms of RNNs such as the LSTM (Hochreiter and Schmid huber 1997) and GRU (J. Chung et al. 2014), or the gated equation (5) in Theorem 1. This was interpreted as a particular mechanism for controlling whether to let an input into the hidden state of an RNN. In particular, this affects the propagation of signal through time and causes inputs to interact along the sequence length dimension.
门控。门控最初指的是 RNN 的门控机制,例如 LSTM (Hochreiter and Schmidhuber 1997) 和 GRU (J. Chung 等 2014),或者是定理 1 中的门控公式 (5)。这被解释为一种特定的机制,用于控制是否让输入进入 RNN 的隐藏状态。特别是,这会影响信号在时间上的传播,并导致输入沿序列长度维度相互作用。
However, the concept of gating has since been relaxed in popular usage to simply mean any multiplicative interaction (often with an activation function). For example, element wise multiplicative components of neural network architectures (that do not interact along sequence length) are now commonly referred to as gated architectures (Hua et al. 2022; Mehta et al. 2023), despite a very different meaning than the original RNN sense. Thus we believe the original concept of RNN gating versus the popular usage of multiplicative gating actually have a very different semantic meaning.
然而, gating 概念在流行用法中已经被放宽,简单地指任何乘法交互(通常带有激活函数)。例如,神经网络架构中的元素级乘法组件(不沿序列长度交互)现在通常被称为 gated 架构 (Hua et al. 2022; Mehta et al. 2023),尽管其含义与原始的 RNN 意义非常不同。因此我们认为,RNN 中的 gating 概念与流行的乘法 gating 用法实际上具有非常不同的语义意义。
Hyper networks. Hyper networks refer to neural networks whose parameters are themselves generated by smaller neural networks. The original idea (Ha, Dai, and Quoc V. Le 2017) used it in a narrow sense to define a large RNN whose recurrent parameters are generated by a smaller RNN, and other variants have been around for a long time (Schmid huber 1992).
超网络。超网络指的是神经网络的参数由较小的神经网络生成。最初的想法 (Ha, Dai, and Quoc V. Le 2017) 在狭义上将其定义为一个大的递归神经网络 (RNN),其递归参数由一个小的 RNN 生成,其他变体也已经存在很长时间了 (Schmid huber 1992)。
Data-dependence. Similar to hyper networks, data-dependence can refer to any notion where some parameters of the model depend on the data (Poli et al. 2023).
数据依赖性。类似于超网络,数据依赖性可以指任何模型的某些参数依赖于数据的概念 (Poli et al. 2023)。
Example: GLU Activation. To illustrate the issues with these concepts, consider a simple diagonal linear layer $y=D x$ where $D$ is a diagonal weight parameter. Now suppose that $D$ is itself generated from a linear transformation of $x$ with an optional non linearity: $D,=,\sigma(W x)$ . Since it is diagonal, the multip