[论文翻译]Samba-ASR: 基于结构化状态空间模型 (Structured State-Space Models) 的尖端语音识别技术


原文地址:https://arxiv.org/pdf/2501.02832v3


Samba-ASR: State-Of-The-Art Speech Recognition Leveraging Structured State-Space Models

Samba-ASR: 基于结构化状态空间模型 (Structured State-Space Models) 的尖端语音识别技术

ABSTRACT

摘要

We propose Samba ASR, the first state-of-the-art Automatic Speech Recognition (ASR) model leveraging the novel Mamba architecture as both encoder and decoder, built on the foundation of state-space models (SSMs). Unlike transformer-based ASR models, which rely on self-attention mechanisms to capture dependencies, Samba ASR effectively models both local and global temporal dependencies using efficient state-space dynamics, achieving remarkable performance gains. By addressing the limitations of transformers, such as quadratic scaling with input length and difficulty in handling long-range dependencies, Samba ASR achieves superior accuracy and efficiency.

我们提出Samba ASR,这是首个基于新颖Mamba架构作为编码器和解码器的最先进自动语音识别(ASR)模型,建立在状态空间模型(SSMs)的基础上。与依赖自注意力机制捕捉依赖关系的基于Transformer的ASR模型不同,Samba ASR利用高效的状态空间动力学有效建模局部和全局时间依赖关系,实现了显著的性能提升。通过解决Transformer的局限性,例如输入长度的二次方缩放和处理长距离依赖的困难,Samba ASR实现了卓越的准确性和效率。

Experimental results demonstrate that Samba ASR surpasses existing open-source transformer-based ASR models across various standard benchmarks, establishing it as the new state-of-the-art in ASR. Extensive evaluations on the benchmark dataset show significant improvements in Word Error Rate (WER), with competitive performance even in low-resource scenarios. Furthermore, the inherent computational efficiency and parameter optimization of the Mamba architecture make Samba ASR a scalable and robust solution for diverse ASR tasks.

实验结果表明,Samba ASR 在各种标准基准测试中超越了现有的基于 Transformer 的开源 ASR (Automatic Speech Recognition) 模型,成为 ASR 领域的新技术标杆。在基准数据集上的广泛评估显示,该模型在词错误率 (Word Error Rate, WER) 方面有显著提升,即使在低资源场景下仍保持竞争力。此外,Mamba 架构固有的计算效率和参数优化使 Samba ASR 成为适用于多样化 ASR 任务的可扩展且鲁棒的解决方案。

Our contributions include the development of a new Samba ASR architecture for automatic speech recognition (ASR), demonstrating the superiority of structured state-space models (SSMs) over transformer-based models for speech sequence processing. We provide a comprehensive evaluation on public benchmarks, showcasing state-of-the-art (SOTA) performance, and present an in-depth analysis of computational efficiency, robustness to noise, and sequence generalization. This work highlights the viability of Mamba SSMs as a transformer-free alternative for efficient and accurate ASR. By leveraging the advancements of state-space modeling, Samba ASR redefines ASR performance standards and sets a new benchmark for future research in this field.

我们的贡献包括开发了一种用于自动语音识别 (ASR) 的新型 Samba ASR 架构,证明了结构化状态空间模型 (SSMs) 在语音序列处理上优于基于 Transformer 的模型。我们在公开基准测试上进行了全面评估,展示了最先进 (SOTA) 的性能,并对计算效率、噪声鲁棒性和序列泛化能力进行了深入分析。这项工作凸显了 Mamba SSMs 作为无需 Transformer 的高效精准 ASR 替代方案的可行性。通过利用状态空间建模的进步,Samba ASR 重新定义了 ASR 性能标准,并为该领域未来研究设立了新基准。

Keywords Mamba $\cdot$ Structured State Space $\cdot$ Automatic Speech Recognition (ASR) $\cdot$ Mamba Blocks $\cdot$ Speech Processing

关键词 Mamba $\cdot$ 结构化状态空间 $\cdot$ 自动语音识别(ASR) $\cdot$ Mamba模块 $\cdot$ 语音处理

1 Introduction

1 引言

The rapid evolution of deep learning has significantly transformed Automatic Speech Recognition (ASR), shifting from traditional systems such as Hidden Markov Models (HMMs) and Gaussian Mixture Models (GMMs) to advanced end-to-end neural architectures. While innovations such as Connection is t Temporal Classification (CTC) and attentionbased encoder-decoder models have established new baselines [1] , transformer-based models like OpenAI’s Whisper have further pushed the boundaries, setting state-of-the-art benchmarks for multilingual, multitask ASR systems [2].

深度学习的快速发展极大地改变了自动语音识别(ASR)技术,推动该领域从传统的隐马尔可夫模型(HMM)和高斯混合模型(GMM)系统转向先进的端到端神经架构。虽然连接时序分类(CTC)和基于注意力的编码器-解码器模型等创新已确立新的基准[1],但像OpenAI的Whisper这类基于Transformer的模型进一步突破了边界,为多语言、多任务ASR系统树立了最先进的性能标杆[2]。

Despite their successes, transformer architectures face inherent challenges in scaling to long sequences, particularly those encountered in extended audio recordings. Transformers exhibit quadratic complexity with respect to sequence length, leading to high computational costs and memory usage for tasks requiring long-context modeling [3],[4]. These limitations present a significant obstacle to achieving scalable and efficient ASR systems, especially in resourceconstrained environments or for real-time applications.

尽管Transformer架构取得了成功,但在处理长序列(尤其是延长音频录音)时仍面临固有挑战。Transformer的复杂度随序列长度呈二次方增长,导致长上下文建模任务需要高昂的计算成本和内存开销 [3],[4]。这些限制对构建可扩展且高效的自动语音识别(ASR)系统构成重大障碍,特别是在资源受限环境或实时应用中。

Structured State-Space Models (SSMs) [5] have emerged as a compelling alternative, offering efficient sequence modeling with linear complexity. The Mamba architecture [6], an innovation within this domain, extends SSM capabilities by introducing selective recurrence and hardware-aware optimization s. These advancements address the limitations of traditional linear time-invariant (LTI) dynamics, enabling Mamba to deliver exceptional efficiency and s cal ability. By leveraging selective state-space dynamics, Mamba achieves efficient long-range dependency modeling, making it particularly well-suited for ASR tasks.

结构化状态空间模型 (SSMs) [5] 已成为一种引人注目的替代方案,以线性复杂度提供高效的序列建模能力。Mamba架构 [6] 作为该领域的创新成果,通过引入选择性循环 (selective recurrence) 和硬件感知优化技术,进一步扩展了SSM的能力。这些突破解决了传统线性时不变 (LTI) 动态系统的局限性,使Mamba能够实现卓越的效率和可扩展性。通过利用选择性状态空间动态机制,Mamba实现了高效的长程依赖建模,使其特别适合自动语音识别 (ASR) 任务。

Mamba’s architecture introduces input-dependent parameters into the state-space equations, allowing for dynamic adaptation to sequence content. This capability compresses context into a smaller state representation while effectively capturing both local and global dependencies [6]. Furthermore, Mamba employs hardware-aware techniques such as kernel fusion and parallel scan, optimizing computational efficiency and minimizing memory overhead during both training and inference. These features establish Mamba as a robust solution for sequence modeling across diverse modalities.

Mamba的架构在状态空间方程中引入了输入相关参数,使其能够动态适应序列内容。这一特性将上下文压缩为更小的状态表示,同时有效捕捉局部和全局依赖关系 [6]。此外,Mamba采用了硬件感知技术,如内核融合和并行扫描,优化了计算效率,并在训练和推理过程中最大限度地减少了内存开销。这些特点使Mamba成为跨多种模态进行序列建模的强大解决方案。

While Mamba has demonstrated success in a range of applications, including language and vision tasks, its direct application to speech-to-text systems remained unexplored prior to this work. The development of Samba-ASR represents a significant breakthrough, showcasing the potential of Mamba-based architectures in ASR. By replacing traditional transformer encoders with Mamba’s efficient state-space modeling, Samba-ASR achieves state-of-the-art performance across major ASR benchmarks, including Gigaspeech [7] and SPGISpeech [8]. The model reduces inference latency and training time while maintaining high accuracy, even under challenging conditions such as noisy or spontaneous speech.

虽然Mamba在语言和视觉任务等一系列应用中已取得成功,但在此项工作之前,其直接应用于语音转文字系统仍未被探索。Samba-ASR的开发标志着重大突破,展现了基于Mamba架构在自动语音识别(ASR)中的潜力。通过用Mamba高效的状态空间建模替代传统Transformer编码器,Samba-ASR在Gigaspeech [7]和SPGISpeech [8]等主流ASR基准测试中实现了最先进的性能。该模型在保持高精度的同时降低了推理延迟和训练时间,即使在嘈杂或自发语音等挑战性条件下也是如此。

The following sections delve into the technical details of State Space Models, the Mamba architecture, and its advancements in both language and vision tasks, setting the stage for our motivation and contributions to efficient ASR system using Mamba.

以下部分将深入探讨状态空间模型 (State Space Models) 、Mamba架构的技术细节,及其在语言和视觉任务中的进展,为使用Mamba构建高效自动语音识别 (ASR) 系统的研究动机与贡献奠定基础。

1.1 Background

1.1 背景

1.1.1 State Space Models (SSMs)

1.1.1 状态空间模型 (State Space Models, SSMs)

State Space Models (SSMs) [5] provide a robust framework for sequence modeling by representing dynamical systems through a latent state that evolves over time. These models describe how inputs affect system states and how states generate output, using the following equations:

状态空间模型 (SSMs) [5] 通过随时间演变的潜在状态为序列建模提供了一个稳健框架。这些模型使用以下方程描述输入如何影响系统状态以及状态如何生成输出:

$$
h_{t+1}=A(h_{t})+B(x_{t}),y_{t}=C(h_{t})
$$

$$
h_{t+1}=A(h_{t})+B(x_{t}),y_{t}=C(h_{t})
$$

where $h_{t}$ is the latent state at time $t$ , $x_{t}$ is the input, $y_{t}$ is the output, and $A,B,C$ are parameter matrices. This formulation allows SSMs to efficiently model sequential data by transitioning between latent states and producing outputs influenced by both current and historical inputs.

其中 $h_{t}$ 是时间 $t$ 的隐状态,$x_{t}$ 是输入,$y_{t}$ 是输出,$A,B,C$ 为参数矩阵。该公式化方法通过隐状态间的转移以及受当前和历史输入共同影响的输出,使 SSM 能高效建模序列数据。

Traditionally, SSMs are linear time invariant (LTI), where $A,B,$ $C$ remain constant over time. Although LTI dynamics provides computational efficiency and stability, they limit the model’s ability to adapt to input-dependent variations. Consequently, classical SSMs often struggle with complex, context-sensitive tasks, especially in discrete and contentrich modalities such as language.

传统上,SSM(状态空间模型)是线性时不变 (LTI) 的,其中 $A,B,$ $C$ 随时间保持恒定。尽管LTI动力学提供了计算效率和稳定性,但它们限制了模型适应输入相关变化的能力。因此,经典SSM通常难以处理复杂、上下文敏感的任务,尤其是在语言等离散且内容丰富的模态中。

The matrices $A$ , $B$ ,and $C$ are learned parameters with the following interpretations.

矩阵 $A$、$B$ 和 $C$ 是学习得到的参数,其含义如下。

• $A$ : Determines how much the previous hidden state $h_{t}$ should be considered to calculate the new hidden state ht+1. • $B$ : Determines how much the input $x_{t}$ should be considered to calculate the new hidden state $h_{t+1}$ . • $C$ : Determines how much the hidden state $h_{t}$ should be considered in calculating the output $y_{t}$ .

  • $A$: 决定在计算新隐藏状态 $h_{t+1}$ 时应考虑多少前一隐藏状态 $h_{t}$。
  • $B$: 决定在计算新隐藏状态 $h_{t+1}$ 时应考虑多少输入 $x_{t}$。
  • $C$: 决定在计算输出 $y_{t}$ 时应考虑多少隐藏状态 $h_{t}$。

1.1.2 Mamba: Linear-Time Sequence Modeling with Selective State Spaces

1.1.2 Mamba: 基于选择性状态空间的线性时间序列建模

Mamba[6] extends traditional SSMs with a selectivity mechanism, addressing the limitations of LTI dynamics while preserving computational efficiency. Mamba’s formulation introduces input-dependent parameters into the state-space equations:

Mamba[6] 通过引入选择性机制扩展了传统 SSM (State-Space Model) ,在保持计算效率的同时解决了 LTI (Linear Time-Invariant) 动态系统的局限性。其核心创新在于将输入相关参数引入状态空间方程:

$$
h_{t+1}=A(h_{t})+B(x_{t}):,y_{t}=C(x_{t})h_{t}
$$

$$
h_{t+1}=A(h_{t})+B(x_{t}):,y_{t}=C(x_{t})h_{t}
$$

where $B(x_{t})$ and $C(x_{t})$ are learned functions of the input $x_{t}$ , allowing selective propagation of relevant information and enables dynamic adaptation to sequence content, while $A$ remains a structured state transition matrix. This selective mechanism allows Mamba to efficiently compress context into a smaller state while maintaining the ability to capture long-range dependencies.

其中 $B(x_{t})$ 和 $C(x_{t})$ 是输入 $x_{t}$ 的学习函数,能够选择性传播相关信息并动态适应序列内容,而 $A$ 仍保持为结构化状态转移矩阵。这种选择性机制使 Mamba 能够高效地将上下文压缩至更小状态,同时保持捕获长程依赖关系的能力。

To efficiently handle the introduced time-varying parameters, Mamba employs a hardware-aware implementation using techniques like kernel fusion, parallel scan, and re computation. This minimizes memory overhead by leveraging GPU memory hierarchies, where state updates are computed in fast, low-level memory (e.g., SRAM) and final outputs are written to high-bandwidth memory (HBM). By avoiding the materialization of large latent states during training, Mamba achieves linear computational complexity while ensuring flexibility for diverse tasks. Furthermore, a re computation strategy reduces memory requirements during back propagation by re calculating intermediate states only when needed.

为高效处理引入的时变参数,Mamba采用硬件感知实现方案,运用核融合(kernel fusion)、并行扫描(parallel scan)和重计算(recomputation)等技术。该方案通过利用GPU内存层级结构(状态更新在快速低层内存如SRAM中计算,最终输出写入高带宽内存HBM)来最小化内存开销。通过避免训练期间大型潜在状态的具体化,Mamba在保持任务多样性的同时实现了线性计算复杂度。此外,重计算策略通过在反向传播时按需重新计算中间状态,进一步降低了内存需求。

The Mamba architecture simplifies traditional SSM designs by combining sequence transformation and gating mechanisms into a single homogenous block. This block replaces multi-head attention (MHA) and MLP components with a streamlined structure inspired by gated mechanisms in RNNs, such as:

Mamba架构通过将序列变换和门控机制结合到一个同质块中,简化了传统SSM设计。该块受RNN中门控机制启发,用精简结构取代了多头注意力(MHA)和MLP组件,例如:

$$
g_{t}=\sigma(\mathrm{Linear}(x_{t})),\quad h_{t}=(1-g_{t})h_{t-1}+g_{t}x_{t}
$$

$$
g_{t}=\sigma(\mathrm{Linear}(x_{t})),\quad h_{t}=(1-g_{t})h_{t-1}+g_{t}x_{t}
$$

where $g_{t}$ represents the selection gate. By iterative ly stacking these blocks with normalization (e.g., LayerNorm) and activation functions (e.g., SiLU), Mamba achieves high expressiveness while maintaining simplicity. Its design balances performance and efficiency, making it particularly effective for tasks such as Automatic Speech Recognition (ASR), language modeling, and reinforcement learning, where long-context dependencies and low latency are essential.

其中 $g_{t}$ 表示选择门。通过迭代堆叠这些带有归一化 (如 LayerNorm) 和激活函数 (如 SiLU) 的模块,Mamba 在保持简洁性的同时实现了高表现力。其设计平衡了性能与效率,尤其适用于自动语音识别 (ASR)、语言建模和强化学习等需要长上下文依赖和低延迟的任务。

1.1.3 Advancements in Large Language and Vision Models Utilizing Mamba

1.1.3 基于Mamba的大语言模型与视觉模型的进展

The Mamba architecture has inspired significant advancements in both language and vision modeling through its innovative state-space mechanism, leading to hybrid and pure Mamba-based models.

Mamba架构通过其创新的状态空间机制,在语言和视觉建模领域均推动了重大进展,催生了基于Mamba的混合模型与纯模型。

Jamba[9] introduces a novel hybrid architecture combining Transformer and Mamba layers, interleaved with mixture-ofexperts (MoE) modules. This hybrid design addresses limitations of pure Transformer models in handling long contexts and computational efficiency. The resulting model, Jamba, achieves performance comparable to Mixtral-8x7B while supporting an unprecedented context length of 256,000 tokens—the longest among production-grade models. Jamba’s efficiency is remarkable, delivering three times the throughput of Mixtral $\mathbf{8}\mathbf{x}7\mathbf{B}$ for long contexts and operating within a single 80GB GPU. This demonstrates the potential of integrating Transformer’s attention mechanisms with Mamba’s efficient state-space dynamics for enhanced performance and resource utilization.

Jamba[9] 提出了一种结合 Transformer 和 Mamba 层的新型混合架构,并穿插了混合专家 (MoE) 模块。这种混合设计解决了纯 Transformer 模型在处理长上下文和计算效率方面的局限性。最终模型 Jamba 实现了与 Mixtral-8x7B 相当的性能,同时支持前所未有的 256,000 token 上下文长度——这是生产级模型中最长的。Jamba 的效率非常出色,对于长上下文场景,其吞吐量是 Mixtral $\mathbf{8}\mathbf{x}7\mathbf{B}$ 的三倍,并且可以在单个 80GB GPU 上运行。这展示了将 Transformer 的注意力机制与 Mamba 高效状态空间动态相结合以提升性能和资源利用率的潜力。

Falcon Mamba[3] on the other hand, showcases the capabilities of a pure Mamba-based language model. This 7B parameter model trained on 5.8 trillion tokens challenges the notion that attention mechanisms are necessary for competitive performance. Surpassing open-weight Transformer-based models like Mistral 7B and Falcon2 11B, Falcon Mamba demonstrates that efficient inference and constant memory costs are achievable across context lengths. By addressing training stability issues with strategic initialization s and RMSNorm placements, Falcon Mamba establishes itself as a competitive and efficient alternative to hybrid architectures.

另一方面,Falcon Mamba[3]展示了纯Mamba架构大语言模型的实力。这个拥有70亿参数、基于5.8万亿token训练的模型,颠覆了"注意力机制是竞争性能必要条件"的认知。它在性能上超越了Mistral 7B和Falcon2 11B等基于Transformer的开源模型,证明了在不同上下文长度下都能实现高效推理与恒定内存消耗。通过策略性初始化和RMSNorm层布局解决了训练稳定性问题,Falcon Mamba成为混合架构的高效替代方案。

Zamba[4] represents another leap in Mamba-based innovation by combining a Mamba backbone with a unique shared attention module. This 7B parameter model achieves competitive performance against leading transformer-based models while maintaining SSM efficiency. With faster inference speeds and reduced memory requirements, Zamba stands out as a resource-efficient model, particularly for generating long sequences. Although slightly behind in reasoning and in-context learning tasks due to limited training data, Zamba demonstrates the viability of hybrid SSM-attention designs for large-scale modeling.

Zamba[4] 通过将 Mamba 主干网络与独特的共享注意力模块相结合,代表了基于 Mamba 架构的又一次创新飞跃。这款 70 亿参数模型在保持 SSM (State Space Model) 效率优势的同时,实现了与主流 Transformer 模型相媲美的性能表现。凭借更快的推理速度和更低的内存需求,Zamba 成为资源高效的典范,尤其擅长生成长序列内容。尽管受限于训练数据量,在推理和上下文学习任务中表现稍逊,但该模型成功验证了混合 SSM-注意力架构在大规模建模中的可行性。

In vision tasks, Vision Mamba (Vim)[10] adapts Mamba for visual representation learning, demonstrating that selfattention mechanisms are not essential for effective vision modeling. Vim introduces bidirectional Mamba blocks to address positional awareness and global context challenges in vision tasks. The model delivers superior performance on benchmarks like ImageNet and COCO, achieving $2.8\times$ faster inference speeds on high-resolution images compared to transformer-based models such as DeiT[11], while reducing GPU memory usage by $86.8%$ . Vim’s sub quadratic computation and linear memory complexity make it a highly efficient solution for high-resolution visual tasks.

在视觉任务中,Vision Mamba (Vim)[10] 将 Mamba 适配于视觉表征学习,证明自注意力机制并非高效视觉建模的必要组件。Vim 通过引入双向 Mamba 模块解决了视觉任务中的位置感知与全局上下文挑战。该模型在 ImageNet 和 COCO 等基准测试中表现优异,与基于 Transformer 的 DeiT[11] 等模型相比,在高分辨率图像上推理速度提升 $2.8\times$ ,同时 GPU 内存占用降低 $86.8%$ 。Vim 的次二次计算复杂度和线性内存复杂度,使其成为高分辨率视觉任务的高效解决方案。

These advancements illustrate the adaptability and efficacy of Mamba-based architectures in overcoming challenges across modalities, setting a new standard for resource-efficient and high-performing models in language and vision tasks.

这些进展展示了基于Mamba架构的模型在多模态挑战中展现的适应性和高效性,为语言与视觉任务中资源高效且性能卓越的模型设立了新标准。

1.2 Motivation

1.2 动机

Transformer-based ASR models, while successful, suffer from quadratic scaling, leading to high computational costs and memory usage when processing long audio sequences. This limitation becomes especially challenging with large datasets like Gigaspeech[7] or SPGISpeech[8]. To address these issues, we introduce Samba-ASR, which replaces the transformer encoder with the efficient Mamba SSM. The Mamba architecture offers linear complexity, allowing it to model long-range dependencies without the heavy computational burden.

基于Transformer的语音识别(ASR)模型虽然成功,但存在二次方复杂度问题,导致处理长音频序列时计算成本和内存占用过高。面对Gigaspeech[7]或SPGISpeech[8]等大型数据集时,这一限制尤为突出。为解决这些问题,我们推出Samba-ASR,用高效的Mamba SSM替代了Transformer编码器。Mamba架构具有线性复杂度优势,能在不增加计算负担的情况下建模长距离依赖关系。

By leveraging Mamba’s selective state-space dynamics, Samba-ASR achieves state-of-the-art performance across major ASR benchmarks, surpassing transformer-based systems in both accuracy and efficiency. Our model reduces inference latency and training time, while maintaining robust performance even with noisy or spontaneous speech. Samba-ASR presents a scalable, efficient, and accurate solution for modern ASR tasks, setting a new standard in the field.

通过利用Mamba的选择性状态空间动态特性,Samba-ASR在主流语音识别(ASR)基准测试中实现了最先进的性能表现,在准确性和效率上均超越了基于Transformer的系统。该模型在保持对嘈杂或自发语音鲁棒性的同时,显著降低了推理延迟和训练时间。Samba-ASR为现代语音识别任务提供了可扩展、高效且精准的解决方案,树立了该领域的新标杆。

1.3 Contributions

1.3 贡献

This paper makes the following key contributions:

本文的主要贡献如下:

Samba-ASR sets a new standard for efficiency and s cal ability in ASR systems, addressing critical challenges in modern speech recognition and paving the way for future innovations in the field.

Samba-ASR为自动语音识别(ASR)系统的效率和可扩展性设立了新标准,解决了现代语音识别领域的关键挑战,并为该领域的未来创新铺平了道路。

2 Related Work

2 相关工作

In recent years, Automatic Speech Recognition (ASR) systems have made significant strides in both accuracy and computational efficiency. Traditional models relied on recurrent and convolutional neural networks, but modern architectures, particularly those leveraging Transformer-based models, have set new benchmarks in performance. These Transformer models, such as Wave2Vec 2.0[13], Conformer[14], Whisper[2], and Nvidia Canary[15], have greatly advanced ASR capabilities by capturing both local and global dependencies in speech data. However, despite their successes, these models often face challenges in terms of computational resources, s cal ability, and performance on long-form speech data. Recent innovations in State Space Models (SSMs), including the Mamba-based approaches, have emerged as promising alternatives, aiming to overcome these limitations. This section reviews the key developments in ASR technologies, discussing their strengths, limitations, and the contributions of the Mamba-based systems.

近年来,自动语音识别 (Automatic Speech Recognition, ASR) 系统在准确性和计算效率方面取得了显著进展。传统模型依赖循环神经网络和卷积神经网络,但现代架构(尤其是基于Transformer的模型)为性能树立了新标杆。Wave2Vec 2.0[13]、Conformer[14]、Whis per[2]和Nvidia Canary[15]等Transformer模型通过捕捉语音数据的局部和全局依赖关系,极大提升了ASR能力。然而,尽管这些模型取得了成功,它们在计算资源、可扩展性和长语音数据性能方面仍面临挑战。状态空间模型 (State Space Models, SSMs) 的最新创新(包括基于Mamba的方法)已成为有潜力的替代方案,旨在突破这些限制。本节回顾ASR技术的关键发展,讨论其优势、局限性以及基于Mamba系统的贡献。

2.1 Present ASR Systems

2.1 当前 ASR (Automatic Speech Recognition) 系统

2.1.1 Wave2Vec 2.0

2.1.1 Wave2Vec 2.0

The Wav2Vec2[13] model is a widely adopted architecture for speech-to-text tasks, offering a robust method for processing raw audio into meaningful text. Its architecture comprises three main components: the feature encoder, quantization module, and Transformer encoder. The feature encoder processes raw audio waveforms using a series of convolutional layers that extract latent speech representations by down sampling the input while retaining critical temporal features. The quantization module disc ret ize s these latent representations into a finite set of learned speech units using product quantization, which is crucial for self-supervised learning objectives. The Transformer encoder, a core part of the architecture, captures long-range dependencies in the audio data by contextual i zing the extracted features through multi-layer attention mechanisms. During pre training, a contrastive loss is employed by masking a portion of the feature encoder’s output and predicting the corresponding quantized representations, allowing the model to learn contextual speech representations effectively. In downstream tasks, such as speech-to-text generation, ${\mathrm{Wav}}2{\mathrm{Vec}}2$ is fine-tuned with labeled audio-text data, leveraging the Connection is t Temporal Classification (CTC) loss to map audio features directly to text sequences. This approach has demonstrated exceptional performance in automatic speech recognition (ASR), making Wav2Vec2 a foundational model in related works on ASR and audio-based sequence generation tasks.

Wav2Vec2[13]模型是一种广泛应用于语音转文本任务的架构,为将原始音频处理为有意义文本提供了稳健方法。其架构包含三个主要组件:特征编码器、量化模块和Transformer编码器。特征编码器通过一系列卷积层处理原始音频波形,在降采样输入的同时保留关键时序特征,从而提取潜在语音表征。量化模块采用乘积量化技术将这些潜在表征离散化为有限的学习语音单元集合,这对自监督学习目标至关重要。Transformer编码器作为架构核心组件,通过多层注意力机制对提取特征进行上下文建模,从而捕捉音频数据中的长程依赖关系。在预训练阶段,通过掩码部分特征编码器输出并预测对应量化表征,采用对比损失函数使模型能有效学习上下文语音表征。在下游任务(如语音转文本生成)中,${\mathrm{Wav}}2{\mathrm{Vec}}2$利用带标注的音频-文本数据进行微调,借助连接时序分类(CTC)损失直接将音频特征映射到文本序列。该方法在自动语音识别(ASR)领域展现出卓越性能,使Wav2Vec2成为ASR及基于音频的序列生成任务相关研究的基础模型。

2.1.2 Conformer

2.1.2 Conformer

The Conformer[14] architecture has emerged as a significant advancement in speech processing models, particularly for Automatic Speech Recognition (ASR). It is designed to improve the extraction of both local and global features from audio signals by combining the strengths of convolutional networks and transformer-based attention mechanisms. This hybrid approach enables Conformer to achieve state-of-the-art performance in tasks requiring the understanding of sequential audio data, such as speech recognition. The core strength of the Conformer lies in its ability to effectively model both short-term and long-term dependencies, a challenge typically faced by traditional models relying on either convolutions or attention mechanisms alone.

Conformer[14]架构已成为语音处理模型领域的重要突破,尤其在自动语音识别(ASR)方面。该架构通过结合卷积网络和基于Transformer的注意力机制优势,显著提升了从音频信号中提取局部与全局特征的能力。这种混合方法使Conformer在语音识别等需要理解序列化音频数据的任务中实现了最先进的性能。其核心优势在于能同时有效建模短期与长期依赖关系——这一直是传统模型仅依赖卷积或注意力机制时面临的典型挑战。

The preprocessing stage of the Conformer model begins with a convolutional sub sampling layer. This initial step reduces the input sequence length by down sampling the feature maps, which not only reduces computational complexity but also retains essential information while discarding irrelevant details. The convolutional layer captures local patterns in the audio signal, which is crucial for preserving fine-grained temporal information. The output of this stage is then passed onto the main encoder, where the core feature extraction takes place.

Conformer模型的预处理阶段始于一个卷积子采样层。这一初始步骤通过下采样特征图来减少输入序列长度,不仅降低了计算复杂度,同时保留了关键信息并舍弃无关细节。该卷积层能捕捉音频信号中的局部模式,这对于保留细粒度时间信息至关重要。此阶段的输出随后被传递到主编码器,进行核心特征提取。

In the encoder, the audio data is processed by a sequence of Conformer blocks, each of which comprises four key modules: a feed-forward module (FFN), a multi-headed self-attention (MHSA) module, a convolution module, and a second FFN module. The MHSA module is responsible for capturing global contextual relationships within the input sequence, leveraging relative positional encoding to manage varying sequence lengths. This helps the model generalize better across different input sizes. The use of pre-norm residual connections in the MHSA module allows for stable and efficient training, as layer normalization is applied before the attention mechanism, followed by a residual connection that aids in gradient flow during training.

在编码器中,音频数据通过一系列Conformer块进行处理,每个块包含四个关键模块:前馈模块 (FFN)、多头自注意力模块 (MHSA)、卷积模块和第二个FFN模块。MHSA模块负责捕捉输入序列中的全局上下文关系,利用相对位置编码来管理不同序列长度。这有助于模型在不同输入尺寸上实现更好的泛化。MHSA模块中采用预归一化残差连接,通过将层归一化应用于注意力机制之前,再结合残差连接,从而在训练过程中保持稳定高效的梯度流动。

The Conformer architecture combines convolutional and attention mechanisms to enhance speech recognition. By integrating these components, the model is able to handle varying input lengths while preserving both local and global features in the audio signal. The design, which uses a sandwich structure of different modules, helps balance feature extraction and computational efficiency. This makes Conformer a valuable approach for speech recognition tasks and other speech processing applications.

Conformer架构结合了卷积和注意力机制来增强语音识别能力。通过整合这些组件,该模型能够处理不同长度的输入,同时保留音频信号中的局部和全局特征。这种采用不同模块三明治结构的设计,有助于平衡特征提取和计算效率,使Conformer成为语音识别任务及其他语音处理应用的理想方案。

2.1.3 Whisper

2.1.3 Whisper

The Whisper model[2] is built on a sequence-to-sequence Transformer architecture, which is designed to handle various speech processing tasks such as transcription, translation, voice activity detection, and language identification. The input to the model is an 80-channel log-magnitude Mel spec tr ogram derived from raw audio, re-sampled at $16\mathrm{kHz}$ . The spec tr ogram is computed using 25-millisecond windows with a 10-millisecond stride, which captures the essential features of the audio signal. The model processes these features through a convolutional stem followed by a stack of Transformer blocks to learn meaningful representations of the speech signal.

Whisper模型[2]基于序列到序列的Transformer架构,旨在处理多种语音处理任务,如转录、翻译、语音活动检测和语言识别。模型输入是从原始音频中提取的80通道对数幅度梅尔频谱图(Mel spectrogram),重采样率为$16\mathrm{kHz}$。该频谱图采用25毫秒窗口和10毫秒步长计算,以捕捉音频信号的关键特征。模型通过卷积主干和一系列Transformer块处理这些特征,从而学习语音信号的有意义表示。

The encoder processes the Mel spec tro grams through two initial convolutional layers followed by Transformer blocks. The convolution layers with GELU activation reduce the dimensionality of the spec tr ogram and capture local patterns, while the Transformer layers are responsible for extracting global temporal dependencies in the audio. The encoder also includes sinusoidal position embeddings, which help the model learn the temporal structure of the audio input. The encoder’s output is a sequence of contextual i zed representations that capture the relevant acoustic and linguistic information from the audio.

编码器通过两个初始卷积层和Transformer块处理梅尔频谱图(Mel spectrogram)。采用GELU激活函数的卷积层负责降低频谱图维度并捕捉局部特征,而Transformer层则用于提取音频中的全局时间依赖性。该编码器还包含正弦位置嵌入(sinusoidal position embeddings),帮助模型学习音频输入的时间结构。编码器最终输出的是捕获音频中相关声学和语言信息的上下文表征序列。

The decoder takes the encoder’s output and generates text sequences, such as transcriptions or translations, depending on the task. It uses learned position embeddings and a set of special tokens to specify the task (e.g., transcription, translation). The decoder is trained to predict the next token in the sequence, conditioned on both the previously predicted tokens and the input audio features. The model is trained in a multitask setup, enabling it to perform multiple tasks like multilingual transcription and translation with a single unified architecture. The decoder ends with a special end of transcription token, marking the end of the output sequence.

解码器接收编码器的输出,并根据任务生成文本序列(如听写文本或翻译内容)。它通过学习的位置嵌入和一组特殊token来指定任务(例如听写、翻译)。解码器经过训练,能够基于先前预测的token和输入音频特征来预测序列中的下一个token。该模型采用多任务训练机制,使其能够通过单一统一架构执行多语言听写和翻译等多项任务。解码器以特殊的转录结束token作为终止标志,标记输出序列的结束。

Thus, by using a Transformer-based architecture to handle various speech recognition tasks Whisper model processes Mel spec tro grams through an encoder to capture audio features and then uses a decoder to generate text. This approach provides a unified solution for tasks like transcription and translation.

因此,通过采用基于Transformer的架构处理多种语音识别任务,Whisper模型利用编码器处理梅尔频谱图(Mel spectrogram)来捕捉音频特征,再通过解码器生成文本。这种方法为转写和翻译等任务提供了统一的解决方案。

2.1.4 Nvidia Canary 1B

2.1.4 Nvidia Canary 1B

The Canary model[15] is an efficient encoder-decoder model designed for automatic speech recognition (ASR) and automatic speech translation (AST). It uses a Fast Con former-based architecture, a speech-specific modification of the Conformer model, which balances high performance with reduced computational resources and training data. The model processes audio as Mel spec tro grams, with a focus on minimizing the need for large datasets and achieves a $2.8\mathbf{X}$ speedup over traditional models by increasing the down sampling factor to 8.

金丝雀模型[15]是一种专为自动语音识别(ASR)和自动语音翻译(AST)设计的高效编码器-解码器模型。该模型采用基于快速Conformer的架构(这是Conformer模型针对语音任务的改进版本),在保持高性能的同时减少了计算资源和训练数据需求。模型以梅尔频谱图作为音频输入,重点降低对大样本数据集的依赖,并通过将下采样因子提升至8实现了相比传统模型$2.8\mathbf{X}$的加速效果。

The model employs a unified multi-task training strategy, where special prompt tokens direct it to perform either transcription or translation tasks. Canary is trained on synthetic data generated through machine translation, using advanced techniques such as dynamic data blending, data balancing, dynamic bucketing, and noise-robust fine-tuning. These methods optimize training efficiency, ensure consistent language representation, and minimize hallucinations when no speech is present.

该模型采用统一的多任务训练策略,通过特殊提示token (prompt tokens) 来执行转写或翻译任务。Canary 使用机器翻译生成的合成数据进行训练,并采用动态数据混合 (dynamic data blending) 、数据平衡 (data balancing) 、动态分桶 (dynamic bucketing) 和抗噪声微调 (noise-robust fine-tuning) 等先进技术。这些方法优化了训练效率,确保语言表征的一致性,并在无语音输入时最大限度减少幻觉现象。

Despite being trained on just 86K hours of speech, much less than models like Whisper, which use up to 5M hours, Canary delivers competitive or superior performance. Its compact architecture and innovative training strategies make it highly effective for ASR and AST tasks, offering impressive results across multiple languages with significantly less training data.

尽管仅训练了8.6万小时的语音数据(远低于Whisper等模型使用的500万小时),Canary仍展现出具有竞争力甚至更优的性能。其紧凑架构和创新训练策略使其在自动语音识别(ASR)和自动语音翻译(AST)任务中表现卓越,仅用少量训练数据即可在多种语言上取得亮眼成果。

2.2 Existing Mamba Based Approach

2.2 现有基于Mamba的方法

Recent advancements in speech processing have been largely driven by Transformer-based[16] models as discussed in the section 2.1, which excel at capturing global dependencies but face computational challenges for long-form sequences. State Space Models (SSMs), like Mamba, have emerged as efficient alternatives due to their linear computational scaling and ability to handle long-range dependencies. However, prior research, such as the BiMamba[17] study, primarily focused on exploring bidirectional Mamba for tasks like speech enhancement and recognition without producing a standalone ASR system competitive with Transformer-based architectures. Similarly, ”Exploring the Capability of Mamba in ASR”[18] evaluated Mamba’s potential across various speech tasks, including ASR, text-to-speech, and sum mari z ation, showcasing comparable or superior performance to Transformer models like Conformer. However, this work remained domain-focused and did not result in a fully realized ASR model.

近期语音处理领域的进展主要源于基于Transformer[16]的模型(如2.1节所述),这类模型擅长捕捉全局依赖关系,但在处理长序列时面临计算挑战。状态空间模型(State Space Models,SSMs)(如Mamba)因其线性计算复杂度及长程依赖处理能力,已成为高效替代方案。然而,先前研究(例如BiMamba[17])主要探索双向Mamba在语音增强和识别等任务中的应用,并未构建出能与Transformer架构竞争的独立ASR系统。类似地,《Exploring the Capability of Mamba in ASR》[18]评估了Mamba在ASR、文本转语音和摘要等语音任务中的潜力,其表现与Conformer等Transformer模型相当或更优,但该研究仍局限于特定领域,未形成完整的ASR模型。

The ”Speech Slytherin”[19] study extended Mamba’s application to speech separation and synthesis, introducing hybrid models like Mamba-TasNet and ConMamba, which achieved competitive results but faced limitations in efficiency for shorter inputs and joint text-speech modeling. While these studies demonstrated Mamba’s promise in speech processing, none produced a robust ASR system capable of outperforming leading Transformer-based models. In contrast, our work introduces Samba-ASR, the first fully developed Mamba-based ASR system that surpasses Transformer architectures across major benchmarks, including Gigaspeech, LS Clean, LS Other, and SPGISpeech. This establishes Samba-ASR as a state-of-the-art solution, advancing the boundaries of speech recognition in terms of performance and computational efficiency.

"Speech Slytherin"[19] 研究将 Mamba 的应用扩展到语音分离与合成领域,提出了 Mamba-TasNet 和 ConMamba 等混合模型。这些模型虽取得有竞争力的成果,但在短输入效率和联合文本-语音建模方面存在局限。尽管这些研究证明了 Mamba 在语音处理中的潜力,但均未开发出能超越主流 Transformer 架构的鲁棒自动语音识别 (ASR) 系统。相比之下,我们的工作首次提出完全基于 Mamba 的 Samba-ASR 系统,在 Gigaspeech、LS Clean、LS Other 和 SPGISpeech 等核心基准测试中全面超越 Transformer 架构,确立了其在性能与计算效率方面的最先进 (state-of-the-art) 地位,推动了语音识别技术的边界。

3 Data processing

3 数据处理

The audio files are first loaded using the standard library torchaudio for efficient I/O operations. The audio file is decoded, down-mixed if necessary, and resampled to a fixed sample rate of $16\mathrm{kHz}$ , ensuring all audio inputs are in the same format, which is essential for uniform processing. Error handling is implemented to deal with any issues arising during the loading process, such as file format incompatibility or unsupported codecs. The loaded audio is then normalized to a range of [-1, 1] to facilitate model training. To ensure that the audio inputs match the expected size for processing, they are either padded or trimmed to a specific length $N_{s a m p l e s}$ , defined by the model’s requirements. This step is critical to maintain consistency in the length of audio segments processed by the encoder[20]. The choice of padding or trimming helps maintain the sequence length across all input samples, enabling efficient batch processing during training. Once the audio data is standardized, it is converted into a log-Mel spec tr ogram[21], which captures frequency content and time dynamics. This is done by applying Short-Time Fourier Transform (STFT) [22] to the audio waveform and projecting it onto the Mel filter banks. The resulting magnitude spec tr ogram is then converted to a logarithmic scale to better match human auditory perception. This transformation enhances the disc rim i native power of the features, making them more suitable for speech recognition tasks. The spec tro grams are further scaled to a range that ensures numerical stability and are normalized before being fed into the ASR model, facilitating accurate training and inference.

音频文件首先使用标准库torchaudio进行高效I/O操作加载。音频文件被解码,必要时进行下混,并重采样至固定采样率$16\mathrm{kHz}$,确保所有音频输入格式统一,这对统一处理至关重要。系统实现了错误处理机制以应对加载过程中可能出现的问题,如文件格式不兼容或编解码器不支持。加载后的音频被归一化至[-1, 1]范围以方便模型训练。

为确保音频输入符合处理所需的尺寸,系统会按照模型定义的特定长度$N_{samples}$进行填充或截断。该步骤对保持编码器处理的音频片段长度一致性具有关键作用[20]。填充或截断的选择有助于维持所有输入样本的序列长度,从而在训练期间实现高效的批量处理。

标准化后的音频数据会被转换为对数梅尔频谱图[21],该转换通过短时傅里叶变换(STFT)[22]将音频波形投影至梅尔滤波器组完成。生成的幅度频谱图随后被转换为对数尺度以更好地匹配人类听觉感知。此变换增强了特征的判别力,使其更适用于语音识别任务。频谱图会进一步缩放至确保数值稳定的范围,并在输入ASR模型前进行归一化处理,从而提升训练和推理的准确性。

3.1 Tokenizer

3.1 Tokenizer

The tokenizer is designed for the Mamba ASR (Automatic Speech Recognition) model, which converts textual input into a sequence of token IDs suitable for processing by the model. It includes a set of special tokens that mark the beginning and end of a transcription, indicate the task of transcribing the text, and potentially denote information for audio transcriptions. These tokens are crucial for guiding the model’s understanding of the input data. The tokenizer creates a basic vocabulary for English text that includes common ASCII characters, numbers, and punctuation marks. It uses a tokenizer of Byte level BPE (Byte Pair Encoding)[23] to segment the text into individual tokens. This method ensures that each element of the input text is represented uniformly, facilitating consistent preprocessing and accurate transcription when used with the Mamba ASR model.

该分词器专为Mamba ASR(自动语音识别)模型设计,可将文本输入转换为适合模型处理的token ID序列。它包含一组特殊token,用于标记转录文本的开始与结束、指示文本转录任务,并可能表示音频转录的相关信息。这些token对于引导模型理解输入数据至关重要。该分词器建立了英语文本的基础词汇表,涵盖常见ASCII字符、数字及标点符号,采用字节级BPE(Byte Pair Encoding)[23]算法将文本分割为独立token。这种方法确保输入文本的每个元素都能被统一表示,在与Mamba ASR模型配合使用时,能实现一致的预处理和准确的转录结果。

4 Samba-ASR: Architecture

4 Samba-ASR: 架构

4.1 Overview

4.1 概述

Mamba ASR introduces a novel approach to Automatic Speech Recognition (ASR) by utilizing the Mamba architecture as shown in the figure 1, a state-of-the-art sequence modeling technique known for its computational efficiency and ability to capture long-range dependencies. Traditional Transformer-based models (e.g., Wav2Vec2 and Conformer) which predominantly use self-attention mechanisms for both audio feature extraction and text generation, Mamba ASR offers an alternative that uses state space models, allowing for better s cal ability and efficiency in processing longer sequences of data. This key distinction is central to Mamba ASR’s ability to handle both the audio and text components of ASR tasks more effectively.

Mamba ASR引入了一种自动语音识别(ASR)的新方法,如图1所示,它采用了Mamba架构——一种以计算效率和捕捉长距离依赖能力著称的最先进序列建模技术。与传统基于Transformer的模型(如Wav2Vec2和Conformer)主要使用自注意力机制进行音频特征提取和文本生成不同,Mamba ASR采用状态空间模型作为替代方案,使其在处理长序列数据时具备更好的可扩展性和效率。这一关键区别是Mamba ASR能够更有效处理ASR任务中音频和文本组件的核心所在。

At the heart of Mamba ASR are two primary components: an audio encoder and a text decoder, both built with Mamba blocks. These blocks are designed to handle long-range dependencies in both speech and text sequences, offering a more efficient alternative to the memory-intensive approaches of Transformer and Conformer models. In contrast to these models, which use self-attention for global context capture (with varying computational efficiency), Mamba’s state space approach enables more efficient processing without sacrificing performance on tasks like transcription.

Mamba ASR的核心包含两个主要组件:音频编码器和文本解码器,均采用Mamba模块构建。这些模块专为处理语音和文本序列中的长距离依赖关系而设计,相比Transformer和Conformer模型的内存密集型方法提供了更高效的替代方案。与依赖自注意力机制(计算效率各异)来捕获全局上下文的模型不同,Mamba的状态空间方法能在保持转录等任务性能的同时实现更高效的处理。


Figure 1: Architecture diagram (original) of the Samba-ASR model, illustrating the key components including the Mamba encoder, which processes raw audio features using Mamba blocks, and the Mamba decoder along with the Mamba-Cross-Connection bridge, which generates transcriptions by integrating audio context with text representations. The model’s design focuses on efficient long-range dependency capture for accurate automatic speech recognition.

图 1: Samba-ASR模型架构图(原始版),展示了包括Mamba编码器(通过Mamba块处理原始音频特征)、Mamba解码器及Mamba交叉连接桥(通过整合音频上下文与文本表征生成转录文本)在内的核心组件。该模型设计专注于高效捕获长程依赖关系以实现精准的自动语音识别。

4.2 Encoder

4.2 编码器

The Mamba ASR’s audio encoder processes raw audio input, represented as Mel spec tro grams, to generate high-level feature representations that capture essential speech characteristics. It begins by passing the audio input through several convolutional layers[14], a technique borrowed from image processing models. These layers help to capture local temporal patterns in the audio signal, to preserve fine-grained details in audio features. The output of these convolutional layers is then passed through a series of Mamba blocks, which form the core of the encoder.

Mamba ASR的音频编码器处理原始音频输入(以梅尔频谱图表示),生成捕捉关键语音特征的高级特征表示。该过程首先将音频输入通过多个卷积层 [14](这一技术借鉴自图像处理模型)。这些卷积层有助于捕捉音频信号中的局部时序模式,同时保留音频特征的细粒度细节。随后,这些卷积层的输出会经过一系列构成编码器核心的Mamba块。

Unlike Transformer-based models such as Wav2Vec2 and Conformer, which rely on self-attention mechanisms to capture global context, the Mamba encoder uses a state space model that scales linearly with sequence length, making it more computationally efficient for long audio sequences. This results in a more efficient model for handling longer speech sequences without the quadratic complexity that Transformer-based models face.

与依赖自注意力机制捕捉全局上下文的Wav2Vec2和Conformer等Transformer模型不同,Mamba编码器采用随序列长度线性扩展的状态空间模型(state space model),使其在处理长音频序列时计算效率更高。这种设计有效规避了Transformer模型面临的二次方复杂度问题,从而构建出更高效的长语音序列处理模型。

The output from the Mamba blocks is a sequence of contextual i zed audio embeddings, with Layer Normalization applied to stabilize the features before they are passed to the decoder. This efficient handling of long-range dependencies in audio sequences is critical for ASR tasks, where the model needs to capture and understand context across the entire utterance.

Mamba模块输出的是一系列经过上下文编码的音频嵌入(embedding),在传递到解码器之前会通过层归一化(Layer Normalization)来稳定特征。这种对音频序列中长距离依赖关系的高效处理对于自动语音识别(ASR)任务至关重要,因为模型需要捕捉并理解整个话语的上下文。

4.3 Decoder

4.3 解码器

The text decoder in Mamba ASR generates the transcription from the encoded audio features. It begins by embedding the input tokens (representing the partially transcribed text) and adding positional embeddings to ensure the order of the sequence is preserved. These embeddings are then processed through a series of Mamba blocks, similar to the encoder. However, here, the decoder is conditioned on the encoded audio features via a Mamba-cross-connection mechanism. This allows the decoder to focus on the relevant portions of the audio sequence while predicting each token, which is essential for accurate transcription.

Mamba ASR中的文本解码器从编码的音频特征生成转录文本。它首先嵌入输入token(表示部分转录的文本)并添加位置嵌入以保持序列顺序。这些嵌入随后通过一系列类似编码器的Mamba块进行处理。不同之处在于,解码器通过Mamba交叉连接机制以编码音频特征为条件。这使得解码器在预测每个token时能聚焦于音频序列的相关部分,这对准确转录至关重要。

In Transformer-based models like Wav2Vec2 and Whisper, the encoder directly feeds the decoder, and the self-attention mechanism captures the relationship between the audio features and the generated text. In contrast, Mamba ASR’s Mamba-cross-connection mechanism enables more targeted alignment between the audio and text features, improving the model’s ability to focus on specific audio segments that are most relevant to the current token being predicted. This targeted cross-connection mechanism helps the decoder refine the text representations, integrating both the audio context and previously predicted tokens.

在基于Transformer的模型(如Wav2Vec2和Whisper)中,编码器直接馈送解码器,自注意力机制捕获音频特征与生成文本之间的关系。相比之下,Mamba ASR的Mamba交叉连接机制实现了音频与文本特征之间更具针对性的对齐,提升了模型聚焦于与当前预测token最相关音频片段的能力。这种定向交叉连接机制帮助解码器细化文本表征,同时整合音频上下文与先前预测的token。

After passing through the Mamba blocks, a final Layer Normalization is applied, and the output is projected onto the vocabulary space via a linear layer followed by a softmax function[24]. This produces a probability distribution over the entire vocabulary, from which the model selects the most likely next token. To maintain the auto regressive nature of text generation, a causal mask ensures that predictions are based only on past tokens.

经过Mamba块处理后,会应用最终的层归一化(Layer Normalization),并通过线性层和softmax函数[24]将输出投影到词汇空间。这会生成整个词汇表上的概率分布,模型从中选择最可能的下一个token。为了保持文本生成的自回归特性,因果掩码(causal mask)确保预测仅基于过去的token。

The unique use of Mamba blocks in the decoder enables Mamba ASR to model the intricate relationship between audio features and text tokens effectively, addressing the complex alignment problem in ASR while also being computationally efficient.

解码器中 Mamba 模块的独特运用使 Mamba ASR 能够有效建模音频特征与文本 Token 间的复杂关系,在解决 ASR 中复杂对齐问题的同时保持计算高效性。

5 Dataset

5 数据集

To train Samba-ASR, we utilized a diverse set of high-quality speech datasets. The Libri Speech clean split, containing 460 hours of transcribed 16kHz English speech, provided high-quality audio with minimal noise. We leveraged both the Train.100 (100 hours) and Train.360 (360 hours) subsets along with corresponding validation and test sets. These subsets include recordings with clear pronunciations and low Word Error Rates (WER), making them an ideal foundation for ASR training.

为训练Samba-ASR,我们采用了多样化的高质量语音数据集。Libri Speech的clean子集包含460小时16kHz英文语音转录,提供了低噪声的高质量音频。我们同时使用了Train.100(100小时)和Train.360(360小时)子集及对应的验证集与测试集。这些子集包含发音清晰且词错率(WER)较低的录音,是自动语音识别(ASR)训练的理想基础。

Additionally, we incorporated the GigaSpeech dataset, which added 10,000 hours of transcribed audio from various sources such as audiobooks, podcasts, and YouTube. This dataset covers both read and spontaneous speaking styles across diverse topics including science and arts, enhancing the model’s ability to handle multi-domain speech and spontaneous variations in audio.

此外,我们还整合了GigaSpeech数据集,该数据集新增了来自有声书、播客和YouTube等不同来源的10,000小时转录音频。该数据集涵盖科学与艺术等多领域的朗读和即兴讲话风格,增强了模型处理多领域语音及音频中即兴变化的能力。

We further enriched the training data with SPGISpeech, a domain-specific dataset consisting of 5,000 hours of transcribed financial audio. It features diverse accents (L1 and L2 English speakers), varying audio quality, and professionally formatted transcripts. This dataset played a crucial role in training Samba-ASR to excel in recognizing specialized financial terminologies and handling challenging audio conditions.

我们进一步用SPGISpeech丰富了训练数据,这是一个包含5000小时金融领域转录音频的专用数据集。该数据集具有多样化口音(L1和L2英语使用者)、不同音频质量以及专业格式的转录文本,在训练Samba-ASR模型识别专业金融术语和处理复杂音频条件方面发挥了关键作用。

6 Training Details

6 训练细节

As detailed in Table 1, the Samba-ASR model was trained with AdamW[25] and gradient norm clipping along with a linear learning rate decay. A batch size of 256 was used, and the models were trained for 80 epochs with an initial learning rate of 1e-4, a weight decay of 0.01, and an Adam epsilon set to 1e-8. These parameters were selected to ensure stable convergence and effectively mitigate over fitting. Throughout the training process, we tracked training loss, validation loss, and Word Error Rate (WER) to monitor model performance and generalization.

如表 1 所示,Samba-ASR 模型使用 AdamW[25] 优化器进行训练,并采用梯度范数裁剪和线性学习率衰减策略。训练批次大小为 256,模型共训练 80 个周期,初始学习率为 1e-4,权重衰减为 0.01,Adam epsilon 设置为 1e-8。这些参数的选择旨在确保稳定收敛并有效缓解过拟合问题。在整个训练过程中,我们通过跟踪训练损失、验证损失和词错误率 (WER) 来监控模型性能和泛化能力。

Table 1: Details of Training Parameters used for the training of Samba-ASR

表 1: Samba-ASR 训练参数详情

训练参数
Learning Rate 1e-4
Optimizer Adamw
WeightDecay 0.01
Adameps 1e-8
BatchSize 256

As seen in the Epoch vs Loss graph as shown in the figure 2, both training and validation loss consistently decreased, starting from an initial value of approximately 7 and converging close to 0.5 by epoch 80. Similarly, the Epoch vs WER graph as shown in the figure 3 demonstrates a steady decline in WER, reducing from over 4.0 to approximately 0.2 by epoch 80. These results highlight the Samba-ASR model’s ability to achieve stable convergence and significantly improve recognition accuracy, outperforming transformer-based ASR models.

如图 2 所示的训练轮次与损失值关系图所示,训练损失和验证损失均持续下降,从初始值约7开始,到第80轮时收敛至接近0.5。类似地,图 3 所示的训练轮次与词错误率(WER)关系图表明,WER值从超过4.0稳步下降至第80轮时的约0.2。这些结果凸显了Samba-ASR模型实现稳定收敛并显著提升识别准确率的能力,其表现优于基于Transformer的ASR模型。


Figure 2: This graph shows the correlation of training and validation loss across epochs, with both losses steadily decreasing and converging around the 72nd epoch.

图 2: 该图表展示了训练损失和验证损失随训练轮次的相关性,两种损失均稳步下降并在第72轮左右收敛。

7 Evaluation and Results

7 评估与结果

We evaluate Samba-ASR (SandLogic) on four benchmark datasets—GigaSpeech, Libri Speech (LS) Clean, LS Other, and SPGISpeech—and compare its performance with leading ASR models listed on the Open ASR Leader board hosted by Hugging Face. All results are computed using the same evaluation framework to ensure consistency and fairness. The primary evaluation metric is the Word Error Rate (WER).

我们在四个基准数据集(GigaSpeech、Libri Speech (LS) Clean、LS Other 和 SPGISpeech)上评估了 Samba-ASR (SandLogic) ,并将其性能与 Hugging Face 主办的 Open ASR Leaderboard 上列出的领先 ASR (Automatic Speech Recognition,自动语音识别)模型进行了比较。所有结果均采用相同的评估框架计算,以确保一致性和公平性。主要评估指标为词错误率 (Word Error Rate, WER)。

As shown in Table 2, the model achieves a remarkable average WER of $3.65%$ , outperforming top-performing systems. On LS Clean, it sets a new standard with a WER of $1.17%$ , while maintaining a competitive edge on the more challenging LS Other subset with a WER of $2.48%$ . Exceptional results are also observed on GigaSpeech and SPGISpeech, with WERs of $9.12%$ and $1.84%$ , respectively. These outcomes highlight the model’s state-of-the-art performance and its ability to generalize effectively across diverse ASR benchmarks.

如表 2 所示,该模型实现了 $3.65%$ 的卓越平均词错误率 (WER),优于顶级系统。在 LS Clean 上,它以 $1.17%$ 的 WER 设定了新标准,同时在更具挑战性的 LS Other 子集上保持竞争优势,WER 为 $2.48%$。在 GigaSpeech 和 SPGISpeech 上也观察到了出色的结果,WER 分别为 $9.12%$ 和 $1.84%$。这些结果突显了该模型的最先进性能及其在各种 ASR 基准测试中有效泛化的能力。


Figure 3: This graph demonstrates a significant reduction in Word Error Rate (WER) throughout the training process, indicating improved model performance and accuracy.

图 3: 该图表展示了训练过程中词错误率 (WER) 的显著降低,表明模型性能和准确性得到了提升。

Table 2: Model Performance Comparison Across Various Datasets

表 2: 不同数据集上的模型性能对比

模型 AverageWER Gigaspeech LS Clean LS Other SPGISpeech
Samba-ASR (SandLogic) 3.65 9.12 1.17 2.48 1.84
nvidia/canary-1b 4.15 10.12 1.48 2.93 2.06
nyrahealth/CrisperWhisper 4.69 10.24 1.82 4.00 2.7
nvidia/parakeet-tdt-1.1b 7.01 9.52 1.40 2.60 3.16
openai/whisper-large-v3 7.44 10.02 2.01 3.91 2.94

8 Conclusion

8 结论

Samba-ASR represents a significant breakthrough in automatic speech recognition technology, demonstrating superior performance across multiple benchmark datasets including GigaSpeech, Libri Speech Clean/Other, and SPGISpeech. The model achieves remarkable results with an average Word Error Rate (WER) of $3.65%$ , setting a new state-of-the-art benchmark with particularly impressive performance on Libri Speech Clean (WER: $1.17%$ ) and SPGISpeech (WER: $1.84%$ ).

Samba-ASR 代表了自动语音识别 (Automatic Speech Recognition) 技术的重大突破,在 GigaSpeech、Libri Speech Clean/Other 和 SPGISpeech 等多个基准数据集上展现出卓越性能。该模型平均词错误率 (Word Error Rate, WER) 达到 $3.65%$,创下最新技术标杆,尤其在 Libri Speech Clean (WER: $1.17%$) 和 SPGISpeech (WER: $1.84%$) 上表现尤为突出。

The architecture’s success can be attributed to its innovative use of state-space models (SSMs) in both encoder and decoder components, replacing traditional transformer-based attention mechanisms. This design choice results in linear computational complexity, enabling efficient processing of long audio sequences while maintaining high accuracy. The model’s robust performance across diverse speaking styles, audio qualities, and domains demonstrates its practical viability for real-world applications.

该架构的成功归功于其在编码器和解码器组件中创新性地使用了状态空间模型 (SSMs),取代了传统的基于Transformer的注意力机制。这一设计选择实现了线性计算复杂度,使其能够高效处理长音频序列,同时保持高准确度。该模型在不同说话风格、音频质量和领域中的稳健表现,证明了其在现实应用中的实用性。

Samba-ASR’s achievements extend beyond just performance metrics. The model’s efficient architecture reduces both training time and inference latency, while maintaining linear scaling with sequence length. This combination of improved accuracy and computational efficiency establishes Samba-ASR as a compelling alternative to transformer-based models, setting a new direction for future research in speech recognition technology.

Samba-ASR的成就不仅限于性能指标。该模型的高效架构同时降低了训练时间和推理延迟,同时保持与序列长度的线性扩展。这种准确率提升与计算效率的结合,使Samba-ASR成为基于Transformer模型的有力替代方案,为语音识别技术的未来研究指明了新方向。

9 Future Scope

9 未来展望

Future work on Samba-ASR will explore multiple key directions to enhance its capabilities, s cal ability, and broader applicability. A primary focus is extending support for multilingual ASR[26] and translation, enabling the system to process and transcribe speech in diverse languages, including those with limited resources. This will make Samba-ASR a robust tool for global applications, catering to cross-lingual communication and breaking language barriers effectively [27].

Samba-ASR的未来工作将围绕多个关键方向展开,以提升其能力、可扩展性和更广泛的适用性。首要重点是扩展对多语言ASR [26] 和翻译的支持,使系统能够处理和转录多种语言的语音,包括资源稀缺的语言。这将使Samba-ASR成为适用于全球场景的强大工具,有效满足跨语言交流需求并打破语言障碍 [27]。

To address diverse computational requirements, future iterations will explore the development of model variants with different sizes, from lightweight versions optimized for edge devices[28] to larger, high-performance models for enterprise-level use. This s cal ability will ensure the system’s adaptability to various deployment scenarios, from real-time transcription on mobile devices to large-scale processing in cloud environments.

为满足多样化的计算需求,未来的迭代版本将探索开发不同规模的模型变体,从针对边缘设备优化的轻量级版本[28]到适用于企业级场景的高性能大模型。这种可扩展性将确保系统适应各类部署场景,涵盖从移动设备实时转录到云端环境的大规模处理。

Enhancing the encoder pre-training process is another critical avenue of research. By incorporating larger and more diverse datasets, we aim to further improve generalization across accents, dialects, and spontaneous speech variations. Additionally, integrating domain-adaptive fine-tuning will allow the model to excel in specific industries, such as healthcare or legal transcription. Finally, efforts to integrate real-time processing capabilities and on-the-fly language detection will make Samba-ASR even more versatile for dynamic and interactive use cases. These advancements will solidify Samba-ASR as a leading-edge solution in the ASR landscape, ensuring its continued evolution to meet emerging challenges in speech recognition.

增强编码器预训练过程是另一个关键研究方向。通过整合更大规模、更多样化的数据集,我们的目标是进一步提升模型对不同口音、方言及自发语音变化的泛化能力。此外,结合领域自适应微调技术将使模型在医疗、法律转录等特定行业表现更优。最后,集成实时处理能力和动态语言检测功能将使Samba-ASR在交互式应用场景中更具通用性。这些改进将巩固Samba-ASR作为自动语音识别(ASR)领域前沿解决方案的地位,确保其持续演进以应对语音识别领域的新挑战。

阅读全文(20积分)