[论文翻译]Samba-ASR: 基于结构化状态空间模型 (Structured State-Space Models) 的尖端语音识别技术


原文地址:https://arxiv.org/pdf/2501.02832v3


Samba-ASR: State-Of-The-Art Speech Recognition Leveraging Structured State-Space Models

Samba-ASR: 基于结构化状态空间模型 (Structured State-Space Models) 的尖端语音识别技术

ABSTRACT

摘要

We propose Samba ASR, the first state-of-the-art Automatic Speech Recognition (ASR) model leveraging the novel Mamba architecture as both encoder and decoder, built on the foundation of state-space models (SSMs). Unlike transformer-based ASR models, which rely on self-attention mechanisms to capture dependencies, Samba ASR effectively models both local and global temporal dependencies using efficient state-space dynamics, achieving remarkable performance gains. By addressing the limitations of transformers, such as quadratic scaling with input length and difficulty in handling long-range dependencies, Samba ASR achieves superior accuracy and efficiency.

我们提出Samba ASR,这是首个基于新颖Mamba架构作为编码器和解码器的最先进自动语音识别(ASR)模型,建立在状态空间模型(SSMs)的基础上。与依赖自注意力机制捕捉依赖关系的基于Transformer的ASR模型不同,Samba ASR利用高效的状态空间动力学有效建模局部和全局时间依赖关系,实现了显著的性能提升。通过解决Transformer的局限性,例如输入长度的二次方缩放和处理长距离依赖的困难,Samba ASR实现了卓越的准确性和效率。

Experimental results demonstrate that Samba ASR surpasses existing open-source transformer-based ASR models across various standard benchmarks, establishing it as the new state-of-the-art in ASR. Extensive evaluations on the benchmark dataset show significant improvements in Word Error Rate (WER), with competitive performance even in low-resource scenarios. Furthermore, the inherent computational efficiency and parameter optimization of the Mamba architecture make Samba ASR a scalable and robust solution for diverse ASR tasks.

实验结果表明,Samba ASR 在各种标准基准测试中超越了现有的基于 Transformer 的开源 ASR (Automatic Speech Recognition) 模型,成为 ASR 领域的新技术标杆。在基准数据集上的广泛评估显示,该模型在词错误率 (Word Error Rate, WER) 方面有显著提升,即使在低资源场景下仍保持竞争力。此外,Mamba 架构固有的计算效率和参数优化使 Samba ASR 成为适用于多样化 ASR 任务的可扩展且鲁棒的解决方案。

Our contributions include the development of a new Samba ASR architecture for automatic speech recognition (ASR), demonstrating the superiority of structured state-space models (SSMs) over transformer-based models for speech sequence processing. We provide a comprehensive evaluation on public benchmarks, showcasing state-of-the-art (SOTA) performance, and present an in-depth analysis of computational efficiency, robustness to noise, and sequence generalization. This work highlights the viability of Mamba SSMs as a transformer-free alternative for efficient and accurate ASR. By leveraging the advancements of state-space modeling, Samba ASR redefines ASR performance standards and sets a new benchmark for future research in this field.

我们的贡献包括开发了一种用于自动语音识别 (ASR) 的新型 Samba ASR 架构,证明了结构化状态空间模型 (SSMs) 在语音序列处理上优于基于 Transformer 的模型。我们在公开基准测试上进行了全面评估,展示了最先进 (SOTA) 的性能,并对计算效率、噪声鲁棒性和序列泛化能力进行了深入分析。这项工作凸显了 Mamba SSMs 作为无需 Transformer 的高效精准 ASR 替代方案的可行性。通过利用状态空间建模的进步,Samba ASR 重新定义了 ASR 性能标准,并为该领域未来研究设立了新基准。

Keywords Mamba $\cdot$ Structured State Space $\cdot$ Automatic Speech Recognition (ASR) $\cdot$ Mamba Blocks $\cdot$ Speech Processing

关键词 Mamba $\cdot$ 结构化状态空间 $\cdot$ 自动语音识别(ASR) $\cdot$ Mamba模块 $\cdot$ 语音处理

1 Introduction

1 引言

The rapid evolution of deep learning has significantly transformed Automatic Speech Recognition (ASR), shifting from traditional systems such as Hidden Markov Models (HMMs) and Gaussian Mixture Models (GMMs) to advanced end-to-end neural architectures. While innovations such as Connection is t Temporal Classification (CTC) and attentionbased encoder-decoder models have established new baselines [1] , transformer-based models like OpenAI’s Whisper have further pushed the boundaries, setting state-of-the-art benchmarks for multilingual, multitask ASR systems [2].

深度学习的快速发展极大地改变了自动语音识别(ASR)技术,推动该领域从传统的隐马尔可夫模型(HMM)和高斯混合模型(GMM)系统转向先进的端到端神经架构。虽然连接时序分类(CTC)和基于注意力的编码器-解码器模型等创新已确立新的基准[1],但像OpenAI的Whisper这类基于Transformer的模型进一步突破了边界,为多语言、多任务ASR系统树立了最先进的性能标杆[2]。

Despite their successes, transformer architectures face inherent challenges in scaling to long sequences, particularly those encountered in extended audio recordings. Transformers exhibit quadratic complexity with respect to sequence length, leading to high computational costs and memory usage for tasks requiring long-context modeling [3],[4]. These limitations present a significant obstacle to achieving scalable and efficient ASR systems, especially in resourceconstrained environments or for real-time applications.

尽管Transformer架构取得了成功,但在处理长序列(尤其是延长音频录音)时仍面临固有挑战。Transformer的复杂度随序列长度呈二次方增长,导致长上下文建模任务需要高昂的计算成本和内存开销 [3],[4]。这些限制对构建可扩展且高效的自动语音识别(ASR)系统构成重大障碍,特别是在资源受限环境或实时应用中。

Structured State-Space Models (SSMs) [5] have emerged as a compelling alternative, offering efficient sequence modeling with linear complexity. The Mamba architecture [6], an innovation within this domain, extends SSM capabilities by introducing selective recurrence and hardware-aware optimization s. These advancements address the limitations of traditional linear time-invariant (LTI) dynamics, enabling Mamba to deliver exceptional efficiency and s cal ability. By leveraging selective state-space dynamics, Mamba achieves efficient long-range dependency modeling, making it particularly well-suited for ASR tasks.

结构化状态空间模型 (SSMs) [5] 已成为一种引人注目的替代方案,以线性复杂度提供高效的序列建模能力。Mamba架构 [6] 作为该领域的创新成果,通过引入选择性循环 (selective recurrence) 和硬件感知优化技术,进一步扩展了SSM的能力。这些突破解决了传统线性时不变 (LTI) 动态系统的局限性,使Mamba能够实现卓越的效率和可扩展性。通过利用选择性状态空间动态机制,Mamba实现了高效的长程依赖建模,使其特别适合自动语音识别 (ASR) 任务。

Mamba’s architecture introduces input-dependent parameters into the state-space equations, allowing for dynamic adaptation to sequence content. This capability compresses context into a smaller state representation while effectively capturing both local and global dependencies [6]. Furthermore, Mamba employs hardware-aware techniques such as kernel fusion and parallel scan, optimizing computational efficiency and minimizing memory overhead during both training and inference. These features establish Mamba as a robust solution for sequence modeling across diverse modalities.

Mamba的架构在状态空间方程中引入了输入相关参数,使其能够动态适应序列内容。这一特性将上下文压缩为更小的状态表示,同时有效捕捉局部和全局依赖关系 [6]。此外,Mamba采用了硬件感知技术,如内核融合和并行扫描,优化了计算效率,并在训练和推理过程中最大限度地减少了内存开销。这些特点使Mamba成为跨多种模态进行序列建模的强大解决方案。

While Mamba has demonstrated success in a range of applications, including language and vision tasks, its direct application to speech-to-text systems remained unexplored prior to this work. The development of Samba-ASR represents a significant breakthrough, showcasing the potential of Mamba-based architectures in ASR. By replacing traditional transformer encoders with Mamba’s efficient state-space modeling, Samba-ASR achieves state-of-the-art performance across major ASR benchmarks, including Gigaspeech [7] and SPGISpeech [8]. The model reduces inference latency and training time while maintaining high accuracy, even under challenging conditions such as noisy or spontaneous speech.

虽然Mamba在语言和视觉任务等一系列应用中已取得成功,但在此项工作之前,其直接应用于语音转文字系统仍未被探索。Samba-ASR的开发标志着重大突破,展现了基于Mamba架构在自动语音识别(ASR)中的潜力。通过用Mamba高效的状态空间建模替代传统Transformer编码器,Samba-ASR在Gigaspeech [7]和SPGISpeech [8]等主流ASR基准测试中实现了最先进的性能。该模型在保持高精度的同时降低了推理延迟和训练时间,即使在嘈杂或自发语音等挑战性条件下也是如此。

The following sections delve into the technical details of State Space Models, the Mamba architecture, and its advancements in both language and vision tasks, setting the stage for our motivation and contributions to efficient ASR system using Mamba.

以下部分将深入探讨状态空间模型 (State Space Models) 、Mamba架构的技术细节,及其在语言和视觉任务中的进展,为使用Mamba构建高效自动语音识别 (ASR) 系统的研究动机与贡献奠定基础。

1.1 Background

1.1 背景

1.1.1 State Space Models (SSMs)

1.1.1 状态空间模型 (State Space Models, SSMs)

State Space Models (SSMs) [5] provide a robust framework for sequence modeling by representing dynamical systems through a latent state that evolves over time. These models describe how inputs affect system states and how states generate output, using the following equations:

状态空间模型 (SSMs) [5] 通过随时间演变的潜在状态为序列建模提供了一个稳健框架。这些模型使用以下方程描述输入如何影响系统状态以及状态如何生成输出:

$$
h_{t+1}=A(h_{t})+B(x_{t}),y_{t}=C(h_{t})
$$

$$
h_{t+1}=A(h_{t})+B(x_{t}),y_{t}=C(h_{t})
$$

where $h_{t}$ is the latent state at time $t$ , $x_{t}$ is the input, $y_{t}$ is the output, and $A,B,C$ are parameter matrices. This formulation allows SSMs to efficiently model sequential data by transitioning between latent states and producing outputs influenced by both current and historical inputs.

其中 $h_{t}$ 是时间 $t$ 的隐状态,$x_{t}$ 是输入,$y_{t}$ 是输出,$A,B,C$ 为参数矩阵。该公式化方法通过隐状态间的转移以及受当前和历史输入共同影响的输出,使 SSM 能高效建模序列数据。

Traditionally, SSMs are linear time invariant (LTI), where $A,B,$ $C$ remain constant over time. Although LTI dynamics provides computational efficiency and stability, they limit the model’s ability to adapt to input-dependent variations. Consequently, classical SSMs often struggle with complex, context-sensitive tasks, especially in discrete and contentrich modalities such as language.

传统上,SSM(状态空间模型)是线性时不变 (LTI) 的,其中 $A,B,$ $C$ 随时间保持恒定。尽管LTI动力学提供了计算效率和稳定性,但它们限制了模型适应输入相关变化的能力。因此,经典SSM通常难以处理复杂、上下文敏感的任务,尤其是在语言等离散且内容丰富的模态中。

The matrices $A$ , $B$ ,and $C$ are learned parameters with the following interpretations.

矩阵 $A$、$B$ 和 $C$ 是学习得到的参数,其含义如下。

• $A$ : Determines how much the previous hidden state $h_{t}$ should be considered to calculate the new hidden state ht+1. • $B$ : Determines how much the input $x_{t}$ should be considered to calculate the new hidden state $h_{t+1}$ . • $C$ : Determines how much the hidden state $h_{t}$ should be considered in calculating the output $y_{t}$ .

  • $A$: 决定在计算新隐藏状态 $h_{t+1}$ 时应考虑多少前一隐藏状态 $h_{t}$。
  • $B$: 决定在计算新隐藏状态 $h_{t+1}$ 时应考虑多少输入 $x_{t}$。
  • $C$: 决定在计算输出 $y_{t}$ 时应考虑多少隐藏状态 $h_{t}$。

1.1.2 Mamba: Linear-Time Sequence Modeling with Selective State Spaces

1.1.2 Mamba: 基于选择性状态空间的线性时间序列建模

Mamba[6] extends traditional SSMs with a selectivity mechanism, addressing the limitations of LTI dynamics while preserving computational efficiency. Mamba’s formulation introduces input-dependent parameters into the state-space equations:

Mamba[6] 通过引入选择性机制扩展了传统 SSM (State-Space Model) ,在保持计算效率的同时解决了 LTI (Linear Time-Invariant) 动态系统的局限性。其核心创新在于将输入相关参数引入状态空间方程:

$$
h_{t+1}=A(h_{t})+B(x_{t}):,y_{t}=C(x_{t})h_{t}
$$

$$
h_{t+1}=A(h_{t})+B(x_{t}):,y_{t}=C(x_{t})h_{t}
$$

where $B(x_{t})$ and $C(x_{t})$ are learned functions of the input $x_{t}$ , allowing selective propagation of relevant information and enables dynamic adaptation to sequence content, while $A$ remains a structured state transition matrix. This selective mechanism allows Mamba to efficiently compress context into a smaller state while maintaining the ability to capture long-range dependencies.

其中 $B(x_{t})$ 和 $C(x_{t})$ 是输入 $x_{t}$ 的学习函数,能够选择性传播相关信息并动态适应序列内容,而 $A$ 仍保持为结构化状态转移矩阵。这种选择性机制使 Mamba 能够高效地将上下文压缩至更小状态,同时保持捕获长程依赖关系的能力。

To efficiently handle the introduced time-varying parameters, Mamba employs a hardware-aware implementation using techniques like kernel fusion, parallel scan, and re computation. This minimizes memory overhead by leveraging GPU memory hierarchies, where state updates are computed in fast, low-level memory (e.g., SRAM) and final outputs are written to high-bandwidth memory (HBM). By avoiding the materialization of large latent states during training, Mamba achieves linear computational complexity while ensuring flexibility for diverse tasks. Furthermore, a re computation strategy reduces memory requirements during back propagation by re calculating intermediate states only when needed.

为高效处理引入的时变参数,Mamba采用硬件感知实现方案,运用核融合(kernel fusion)、并行扫描(parallel scan)和重计算(recomputation)等技术。该方案通过利用GPU内存层级结构(状态更新在快速低层内存如SRAM中计算,最终输出写入高带宽内存HBM)来最小化内存开销。通过避免训练期间大型潜在状态的具体化,Mamba在保持任务多样性的同时实现了线性计算复杂度。此外,重计算策略通过在反向传播时按需重新计算中间状态,进一步降低了内存需求。

The Mamba architecture simplifies traditional SSM designs by combining sequence transformation and gating mechanisms into a single homogenous block. This block replaces multi-head attention (MHA) and MLP components with a streamlined structure inspired by gated mechanisms in RNNs, such as:

Mamba架构通过将序列变换和门控机制结合到一个同质块中,简化了传统SSM设计。该块受RNN中门控机制启发,用精简结构取代了多头注意力(MHA)和MLP组件,例如:

$$
g_{t}=\sigma(\mathrm{Linear}(x_{t})),\quad h_{t}=(1-g_{t})h_{t-1}+g_{t}x_{t}
$$

$$
g_{t}=\sigma(\mathrm{Linear}(x_{t})),\quad h_{t}=(1-g_{t})h_{t-1}+g_{t}x_{t}
$$

where $g_{t}$ represents the selection gate. By iterative ly stacking these blocks with normalization (e.g., LayerNorm) and activation functions (e.g., SiLU), Mamba achieves high expressiveness while maintaining simplicity. Its design balances performance and efficiency, making it particularly effective for tasks such as Automatic Speech Recognition (ASR), language modeling, and reinforcement learning, where long-context dependencies and low latency are essential.

其中 $g_{t}$ 表示选择门。通过迭代堆叠这些带有归一化 (如 LayerNorm) 和激活函数 (如 SiLU) 的模块,Mamba 在保持简洁性的同时实现了高表现力。其设计平衡了性能与效率,尤其适用于自动语音识别 (ASR)、语言建模和强化学习等需要长上下文依赖和低延迟的任务。

1.1.3 Advancements in Large Language and Vision Models Utilizing Mamba

1.1.3 基于Mamba的大语言模型与视觉模型的进展

The Mamba architecture has inspired significant advancements in both language and vision modeling through its innovative state-space mechanism, leading to hybrid and pure Mamba-based models.

Mamba架构通过其创新的状态空间机制,在语言和视觉建模领域均推动了重大进展,催生了基于Mamba的混合模型与纯模型。

Jamba[9] introduces a novel hybrid architecture combining Transformer and Mamba layers, interleaved with mixture-ofexperts (MoE) modules. This hybrid design addresses limitations of pure Transformer models in handling long contexts and computational efficiency. The resulting model, Jamba, achieves performance comparable to Mixtral-8x7B while supporting an unprecedented context length of 256,000 tokens—the longest among production-grade models. Jamba’s efficiency is remarkable, delivering three times the throughput of Mixtral $\mathbf{8}\mathbf{x}7\mathbf{B}$ for long contexts and operating within a single 80GB GPU. This demonstrates the potential of integrating Transformer’s attention mechanisms with Mamba’s efficient state-space dynamics for enhanced performance and resource utilization.

Jamba[9] 提出了一种结合 Transformer 和 Mamba 层的新型混合架构,并穿插了混合专家 (MoE) 模块。这种混合设计解决了纯 Transformer 模型在处理长上下文和计算效率方面的局限性。最终模型 Jamba 实现了与 Mixtral-8x7B 相当的性能,同时支持前所未有的 256,000 token 上下文长度——这是生产级模型中最长的。Jamba 的效率非常出色,对于长上下文场景,其吞吐量是 Mixtral $\mathbf{8}\mathbf{x}7\mathbf{B}$ 的三倍,并且可以在单个 80GB GPU 上运行。这展示了将 Transformer 的注意力机制与 Mamba 高效状态空间动态相结合以提升性能和资源利用率的潜力。

Falcon Mamba[3] on the other hand, showcases the capabilities of a pure Mamba-based language model. This 7B parameter model trained on 5.8 trillion tokens challenges the notion that attention mechanisms are necessary for competitive performance. Surpassing open-weight Transformer-based models like Mistral 7B and Falcon2 11B, Falcon Mamba demonstrates that efficient inference and constant memory costs are achievable across context lengths. By addressing training stability issues with strategic initialization s and RMSNorm placements, Falcon Mamba establishes itself as a competitive and efficient alternative to hybrid architectures.

另一方面,Falcon Mamba[3]展示了纯Mamba架构大语言模型的实力。这个拥有70亿参数、基于5.8万亿token训练的模型,颠覆了"注意力机制是竞争性能必要条件"的认知。它在性能上超越了Mistral 7B和Falcon2 11B等基于Transformer的开源模型,证明了在不同上下文长度下都能实现高效推理与恒定内存消耗。通过策略性初始化和RMSNorm层布局解决了训练稳定性问题,Falcon Mamba成为混合架构的高效替代方案。

Zamba[4] represents another leap in Mamba-based innovation by combining a Mamba backbone with a unique shared attention module. This 7B parameter model achieves competitive performance against leading transformer-based models while maintaining SSM efficiency. With faster inference speeds and reduced memory requirements, Zamba stands out as a resource-efficient model, particularly for generating long sequences. Although slightly behind in reasoning and in-context learning tasks due to limited training data, Zamba demonstrates the viability of hybrid SSM-attention designs for large-scale modeling.

Zamba[4] 通过将 Mamba 主干网络与独特的共享注意力模块相结合,代表了基于 Mamba 架构的又一次创新飞跃。这款 70 亿参数模型在保持 SSM (State Space Model) 效率优势的同时,实现了与主流 Transformer 模型相媲美的性能表现。凭借更快的推理速度和更低的内存需求,Zamba 成为资源高效的典范,尤其擅长生成长序列内容。尽管受限于训练数据量,在推理和上下文学习任务中表现稍逊,但该模型成功验证了混合 SSM-注意力架构在大规模建模中的可行性。

In vision tasks, Vision Mamba (Vim)[10] adapts Mamba for visual representation learning, demonstrating that selfattention mechanisms are not essential for effective vision modeling. Vim introduces bidirectional Mamba blocks to address positional awareness and global context challenges in vision tasks. The model delivers superior performance on benchmarks like ImageNet and COCO, achieving $2.8\times$ faster inference speeds on high-resolution images compared to transformer-based models such as DeiT[11], while reducing GPU memory usage by $86.8%$ . Vim’s sub quadratic computation and linear memory complexity make it a highly efficient solution for high-resolution visual tasks.

在视觉任务中,Vision Mamba (Vim)[10] 将 Mamba 适配于视觉表征学习,证明自注意力机制并非高效视觉建模的必要组件。Vim 通过引入双向 Mamba 模块解决了视觉任务中的位置感知与全局上下文挑战。该模型在 ImageNet 和 COCO 等基准测试中表现优异,与基于 Transformer 的 DeiT[11] 等模型相比,在高分辨率图像上推理速度提升 $2.8\times$ ,同时 GPU 内存占用降低 $86.8%$ 。Vim 的次二次计算复杂度和线性内存复杂度,使其成为高分辨率视觉任务的高效解决方案。

These advancements illustrate the adaptability and efficacy of Mamba-based architectures in overcoming challenges across modalities, setting a new standard for resource-efficient and high-performing models in language and vision tasks.

这些进展展示了基于Mamba架构的模型在多模态挑战中展现的适应性和高效性,为语言与视觉任务中资源高效且性能卓越的模型设立了新标准。

1.2 Motivation

1.2 动机

Transformer-based ASR models, while successful, suffer from quadratic scaling, leading to high computational costs and memory usage when processing long audio sequences. This limitation becomes especially challenging with large datasets like Gigaspeech[7] or SPGISpeech[8]. To address these issues, we introduce Samba-ASR, which replaces the transformer encoder with the efficient Mamba SSM. The Mamba architecture offers linear complexity, allowing it to model long-range dependencies without the heavy computational burden.

基于Transformer的语音识别(ASR)模型虽然成功,但存在二次方复杂度问题,导致处理长音频序列时计算成本和内存占用过高。面对Gigaspeech[7]或SPGISpeech[8]等大型数据集时,这一限制尤为突出。为解决这些问题,我们推出Samba-ASR,用高效的Mamba SSM替代了Transformer编码器。Mamba架构具有线性复杂度优势,能在不增加计算负担的情况下建模长距离依赖关系。

By leveraging Mamba’s selective state-space dynamics, Samba-ASR achieves state-of-the-art performance across major ASR benchmarks, surpassing transformer-based systems in both accuracy and efficiency. Our model reduces inference latency and training time, while maintaining robust performance even with noisy or spontaneous speech. Samba-ASR presents a scalable, efficient, and accurate solution for modern ASR tasks, setting a new standard in the field.

通过利用Mamba的选择性状态空间动态特性,Samba-ASR在主流语音识别(ASR)基准测试中实现了最先进的性能表现,在准确性和效率上均超越了基于Transformer的系统。该模型在保持对嘈杂或自发语音鲁棒性的同时,显著降低了推理延迟和训练时间。Samba-ASR为现代语音识别任务提供了可扩展、高效且精准的解决方案,树立了该领域的新标杆。

1.3 Contributions

1.3 贡献

This paper makes the following key contributions:

本文的主要贡献如下:

Samba-ASR sets a new standard for efficiency and s cal ability in ASR systems, addressing critical challenges in modern speech recognition and paving the way for future innovations in the field.

Samba-ASR为自动语音识别(ASR)系统的效率和可扩展性设立了新标准,解决了现代语音识别领域的关键挑战,并为该领域的未来创新铺平了道路。

2 Related Work

2 相关工作

In recent years, Automatic Speech Recognition (ASR) systems have made significant strides in both accuracy and computational efficiency. Traditional models relied on recurrent and convolutional neural networks, but modern architectures, particularly those leveraging Transformer-based models, have set new benchmarks in performance. These Transformer models, such as Wave2Vec 2.0[13], Conformer[14], Whisper[2], and Nvidia Canary[15], have greatly advanced ASR capabilities by capturing both local and global dependencies in speech data. However, despite their successes, these models often face challenges in terms of computational resources, s cal ability, and performance on long-form speech data. Recent innovations in State Space Models (SSMs), including the Mamba-based approaches, have emerged as promising alternatives, aiming to overcome these limitations. This section reviews the key developments in ASR technologies, discussing their strengths, limitations, and the contributions of the Mamba-based systems.

近年来,自动语音识别 (Automatic Speech Recognition, ASR) 系统在准确性和计算效率方面取得了显著进展。传统模型依赖循环神经网络和卷积神经网络,但现代架构(尤其是基于Transformer的模型)为性能树立了新标杆。Wave2Vec 2.0[13]、Conformer[14]、Whis per[2]和Nvidia Canary[15]等Transformer模型通过捕捉语音数据的局部和全局依赖关系,极大提升了ASR能力。然而,尽管这些模型取得了成功,它们在计算资源、可扩展性和长语音数据性能方面仍面临挑战。状态空间模型 (State Space Models, SSMs) 的最新创新(包括基于Mamba的方法)已成为有潜力的替代方案,旨在突破这些限制。本节回顾ASR技术的关键发展,讨论其优势、局限性以及基于Mamba系统的贡献。

2.1 Present ASR Systems

2.1 当前 ASR (Automatic Speech Recognition) 系统

2.1.1 Wave2Vec 2.0

2.1.1 Wave2Vec 2.0

The Wav2Vec2[13] model is a widely adopted architecture for speech-to-text tasks, offering a robust method for processing raw audio into meaningful text. Its architecture comprises three main components: the feature encoder, quantization module, and Transformer encoder. The feature encoder processes raw audio waveforms using a series of convolutional layers that extract latent speech representations by down sampling the input while retaining critical temporal features. The quantization module disc ret ize s these latent representations into a finite set of learned speech units using product quantization, which is crucial for self-supervised learning objectives. The Transformer encoder, a core part of the architecture, captures long-range dependencies in the audio data by contextual i zing the extracted features through multi-layer attention mechanisms. During pre training, a contrastive loss is employed by masking a portion of the feature encoder’s output and predicting the corresponding quantized representations, allowing the model to learn contextual speech representations effectively. In downstream tasks, such as speech-to-text generation, ${\mathrm{Wav}}2{\mathrm{Vec}}2$ is fine-tuned with labeled audio-text data, leveraging the Connection is t Temporal Classification (CTC) loss to map audio features directly to text sequences. This approach has demonstrated exceptional performance in automatic speech recognition (ASR), making Wav2Vec2 a foundational model in related works on ASR and audio-based sequence generation tasks.

Wav2Vec2[13]模型是一种广泛应用于语音转文本任务的架构,为将原始音频处理为有意义文本提供了稳健方法。其架构包含三个主要组件:特征编码器、量化模块和Transformer编码器。特征编码器通过一系列卷积层处理原始音频波形,在降采样输入的同时保留关键时序特征,从而提取潜在语音表征。量化模块采用乘积量化技术将这些潜在表征离散化为有限的学习语音单元集合,这对自监督学习目标至关重要。Transformer编码器作为架构核心组件,通过多层注意力机制对提取特征进行上下文建模,从而捕捉音频数据中的长程依赖关系。在预训练阶段,通过掩码部分特征编码器输出并预测对应量化表征,采用对比损失函数使模型能有效学习上下文语音表征。在下游任务(如语音转文本生成)中,${\mathrm{Wav}}2{\mathrm{Vec}}2$利用带标注的音频-文本数据进行微调,借助连接时序分类(CTC)损失直接将音频特征映射到文本序列。该方法在自动语音识别(ASR)领域展现出卓越性能,使Wav2Vec2成为ASR及基于音频的序列生成任务相关研究的基础模型。

2.1.2 Conformer

2.1.2 Conformer

The Conformer[14] architecture has emerged as a significant advancement in speech processing models, particularly for Automatic Speech Recognition (ASR). It is designed to improve the extraction of both local and global features from audio signals by combining the strengths of convolutional networks and transformer-based attention mechanisms. This hybrid approach enables Conformer to achieve state-of-the-art performance in tasks requiring the understanding of sequential audio data, such as speech recognition. The core strength of the Conformer lies in its ability to effectively model both short-term and long-term dependencies, a challenge typically faced by traditional models relying on either convolutions or attention mechanisms alone.

Conformer[14]架构已成为语音处理模型领域的重要突破,尤其在自动语音识别(ASR)方面。该架构通过结合卷积网络和基于Transformer的注意力机制优势,显著提升了从音频信号中提取局部与全局特征的能力。这种混合方法使Conformer在语音识别等需要理解序列化音频数据的任务中实现了最先进的性能。其核心优势在于能同时有效建模短期与长期依赖关系——这一直是传统模型仅依赖卷积或注意力机制时面临的典型挑战。

The preprocessing stage of the Conformer model begins with a convolutional sub sampling layer. This initial step reduces the input sequence length by down sampling the feature maps, which not only reduces computational complexity but also retains essential information while discarding irrelevant details. The convolutional layer captures local patterns in the audio signal, which is crucial for preserving fine-grained temporal information. The output of this stage is then passed onto the main encoder, where the core feature extraction takes place.

Conformer模型的预处理阶段始于一个卷积子采样层。这一初始步骤通过下采样特征图来减少输入序列长度,不仅降低了计算复杂度,同时保留了关键信息并舍弃无关细节。该卷积层能捕捉音频信号中的局部模式,这对于保留细粒度时间信息至关重要。此阶段的输出随后被传递到主编码器,进行核心特征提取。

In the encoder, the audio data is processed by a sequence of Conformer blocks, each of which comprises four key modules: a feed-forward module (FFN), a multi-headed self-attention (MHSA) module, a convolution module, and a second FFN module. The MHSA module is responsible for capturing global contextual relationships within the input sequence, leveraging relative positional encoding to manage varying sequence lengths. This helps the model generalize better across different input sizes. The use of pre-norm residual connections in the MHSA module allows for stable and efficient training, as layer normalization is applied before the attention mechanism, followed by a residual connection that aids in gradient flow during training.

在编码器中,音频数据通过一系列Conformer块进行处理,每个块包含四个关键模块:前馈模块 (FFN)、多头自注意力模块 (MHSA)、卷积模块和第二个FFN模块。MHSA模块负责捕捉输入序列中的全局上下文关系,利用相对位置编码来管理不同序列长度。这有助于模型在不同输入尺寸上实现更好的泛化。MHSA模块中采用预归一化残差连接,通过将层归一化应用于注意力机制之前,再结合残差连接,从而在训练过程中保持稳定高效的梯度流动。

The Conformer architecture combines convolutional and attention mechanisms to enhance speech recognition. By integrating these components, the model is able to handle varying input lengths while preserving both local and global features in the audio signal. The design, which uses a sandwich structure of different modules, helps balance feature extraction and computational efficiency. This makes Conformer a valuable approach for speech recognition tasks and other speech processing applications.

Conformer架构结合了卷积和注意力机制来增强语音识别能力。通过整合这些组件,该模型能够处理不同长度的输入,同时保留音频信号中的局部和全局特征。这种采用不同模块三明治结构的设计,有助于平衡特征提取和计算效率,使Conformer成为语音识别任务及其他语音处理应用的理想方案。

2.1.3 Whisper

2.1.3 Whisper

The Whisper model[2] is built on a sequence-to-sequence Transformer architecture, which is designed to handle various speech processing tasks such as transcription, translation, voice activity detection, and language identification. The input to the model is an 80-channel log-magnitude Mel spec tr ogram derived from raw audio, re-sampled at $16\mathrm{kHz}$ . The spec tr ogram is computed using 25-millisecond windows with a 10-millisecond stride, which captures the essential features of the audio signal. The model processes these features through a convolutional stem followed by a stack of Transformer blocks to learn meaningful representations of the speech signal.

Whisper模型[2]基于序列到序列的Transformer架构,旨在处理多种语音处理任务,如转录、翻译、语音活动检测和语言识别。模型输入是从原始音频中提取的80通道对数幅度梅尔频谱图(Mel spectrogram),重采样率为$16\mathrm{kHz}$。该频谱图采用25毫秒窗口和10毫秒步长计算,以捕捉音频信号的关键特征。模型通过卷积主干和一系列Transformer块处理这些特征,从而学习语音信号的有意义表示。

The encoder processes the Mel spec tro grams through two initial convolutional layers followed by Transformer blocks. The convolution layers with GELU activation reduce the dimensionality of the spec tr ogram and capture local patterns, while the Transformer layers are responsible for extracting global temporal dependencies in the audio. The encoder also includes sinusoidal position embeddings, which help the model learn the temporal structure of the audio input. The encoder’s output is a sequence of contextual i zed representations that capture the relevant acoustic and linguistic information from the audio.

编码器通过两个初始卷积层和Transformer块处理梅尔频谱图(Mel spectrogram)。采用GELU激活函数的卷积层负责降低频谱图维度并捕捉局部特征,而Transformer层则用于提取音频中的全局时间依赖性。该编码器还包含正弦位置嵌入(sinusoidal position embeddings),帮助模型学习音频输入的时间结构。编码器最终输出的是捕获音频中相关声学和语言信息的上下文表征序列。

The decoder takes the encoder’s output and generates text sequences, such as transcriptions or translations, depending on the task. It uses learned position embeddings and a set of special tokens to specify the task (e.g., transcription, translation). The decoder is trained to predict the next token in the sequence, conditioned on both the previously predicted tokens and the input audio features. The model is trained in a multitask setup, enabling it to perform multiple tasks like multilingual transcription and translation with a single unified architecture. The decoder ends with a special end of transcription token, marking the end of the output sequence.

解码器接收编码器的输出,并根据任务生成文本序列(如听写文本或翻译内容)。它通过学习的位置嵌入和一组特殊token来指定任务(例如听写、翻译)。解码器经过训练,能够基于先前预测的token和输入音频特征来预测序列中的下一个token。该模型采用多任务训练机制,使其能够通过单一统一架构执行多语言听写和翻译等多项任务。解码器以特殊的转录结束token作为终止标志,标记输出序列的结束。

Thus, by using a Transformer-based architecture to handle various speech recognition tasks Whisper model processes Mel spec tro grams through an encoder to capture audio features and then uses a decoder to generate text. This approach provides a unified solution for tasks like transcription and translation.

因此,通过采用基于Transformer的架构处理多种语音识别任务,Whisper模型利用编码器处理梅尔频谱图(Mel spectrogram)来捕捉音频特征,再通过解码器生成文本。这种方法为转写和翻译等任务提供了统一的解决方案。

2.1.4 Nvidia Canary 1B

2.1.4 Nvidia Canary 1B

The Canary model[15] is an efficient encoder-decoder model designed for automatic speech recognition (ASR) and automatic speech translation (AST). It uses a Fast Con former-based architecture, a speech-specific modification of the Conformer model, which balances high performance with reduced computational resources and training data. The model processes audio as Mel spec tro grams, with a focus on minimizing the need for large datasets and achieves a $2.8\mathbf{X}$ speedup over traditional models by increasing the down sampling factor to 8.

金丝雀模型[15]是一种专为自动语音识别(ASR)和自动语音翻译(AST)设计的高效编码器-解码器模型。该模型采用基于快速Conformer的架构(这是Conformer模型针对语音任务的改进版本),在保持高性能的同时减少了计算资源和训练数据需求。模型以梅尔频谱图作为音频输入,重点降低对大样本数据集的依赖,并通过将下采样因子提升至8实现了相比传统模型$2.8\mathbf{X}$的加速效果。

The model employs a unified multi-task training strategy, where special prompt tokens direct it to perform either transcription or translation tasks. Canary is trained on synthetic data generated through machine translation, using advanced techniques such as dynamic data blending, data balancing, dynamic bucketing, and noise-robust fine-tuning. These methods optimize training efficiency, ensure consistent language representation, and minimize hallucinations when no speech is present.

该模型采用统一的多任务训练策略,通过特殊提示token (prompt tokens) 来执行转写或翻译任务。Canary 使用机器翻译生成的合成数据进行训练,并采用动态数据混合 (dynamic data blending) 、数据平衡 (data balancing) 、动态分桶 (dynamic bucketing) 和抗噪声微调 (noise-robust fine-tuning) 等先进技术。这些方法优化了训练效率,确保语言表征的一致性,并在无语音输入时最大限度减少幻觉现象。

Despite being trained on just 86K hours of speech, much less than models like Whisper, which use up to 5M hours, Canary delivers competitive or superior performance. Its compact architecture and innovative training strategies make it highly effective for ASR and AST tasks, offering impressive results across multiple languages with significantly less training data.

尽管仅训练了8.6万小时的语音数据(远低于Whisper等模型使用的500万小时),Canary仍展现出具有竞争力甚至更优的性能。其紧凑架构和创新训练策略使其在自动语音识别(ASR)和自动语音翻译(AST)任务中表现卓越,仅用少量训练数据即可在多种语言上取得亮眼成果。

2.2 Existing Mamba Based Approach

2.2 现有基于Mamba的方法

Recent advancements in speech processing have been largely driven by Transformer-based[16] models as discussed in the section 2.1, which excel at capturing global dependencies but face computational challenges for long-form sequences. State Space Models (SSMs), like Mamba, have emerged as efficient alternatives due to their linear computational scaling and ability to handle long-range dependencies. However, prior research, such as the BiMamba[17] study, primarily focused on exploring bidirectional Mamba for tasks like speech enhancement and recognition without producing a standalone ASR system competitive with Transformer-based architectures. Similarly, ”Exploring the Capability of Mamba in ASR”[18] evaluated Mamba’s potential across various speech tasks, including ASR, text-to-speech, and sum mari z ation, showcasing comparable or superior performance to Transformer models like Conformer. However, this work remained domain-focused and did not result in a fully realized ASR model.

近期语音处理领域的进展主要源于基于Transformer[16]的模型(如2.1节所述),这类模型擅长捕捉全局依赖关系,但在处理长序列时面临计算挑战。状态空间模型(State Space Models,SSMs)(如Mamba)因其线性计算复杂度及长程依赖处理能力,已成为高效替代方案。然而,先前研究(例如BiMamba[17])主要探索双向Mamba在语音增强和识别等任务中的应用,并未构建出能与Transformer架构竞争的独立ASR系统。类似地,《Exploring the Capability of Mamba in ASR》[18]评估了Mamba在ASR、文本转语音和摘要等语音任务中的潜力,其表现与Conformer等Transformer模型相当或更优,但该研究仍局限于特定领域,未形成完整的ASR模型。

The ”Speech Slytherin”[19] study extended Mamba’s application to speech separation and synthesis, introducing hybrid models like Mamba-TasNet and ConMamba, which achieved competitive results but faced limitations in efficiency for shorter inputs and joint text-speech modeling. While these studies demonstrated Mamba’s promise in speech processing, none produced a robust ASR system capabl