[论文翻译]FireRedASR:从编码器-解码器到LLM集成的开源工业级普通话语音识别模型


原文地址:https://arxiv.org/pdf/2501.14350


FireRedASR: Open-Source Industrial-Grade Mandarin Speech Recognition Models from Encoder-Decoder to LLM Integration

FireRedASR:从编码器-解码器到LLM集成的开源工业级普通话语音识别模型

Abstract

摘要

We present FireRedASR, a family of large-scale automatic speech recognition (ASR) models for Mandarin, designed to meet diverse requirements in superior performance and optimal efficiency across various applications. FireRedASR comprises two variants:

我们推出了FireRedASR,这是一系列面向中文的大规模自动语音识别(Automatic Speech Recognition, ASR)模型,旨在满足多种应用中卓越性能和最优效率的多样化需求。FireRedASR包含两种变体:

FireRedASR-LLM: Designed to achieve state-of-the-art (SOTA) performance and to enable seamless end-to-end speech interaction. It adopts an Encoder-AdapterLLM framework leveraging large language model (LLM) capabilities. On public Mandarin benchmarks, FireRedASR-LLM (8.3B parameters) achieves an average Character Error Rate (CER) of $3.05%$ , surpassing the latest SOTA of $3.33%$ with an $8.4%$ relative CER reduction (CERR). It demonstrates superior generalization capability over industrial-grade baselines, achieving $24%{-}40%$ CERR in multisource Mandarin ASR scenarios such as video, live, and intelligent assistant.

FireRedASR-LLM:旨在实现最先进的(SOTA)性能,并实现无缝的端到端语音交互。它采用了 Encoder-AdapterLLM 框架,利用大语言模型(LLM)的能力。在公开的普通话基准测试中,FireRedASR-LLM(83亿参数)实现了平均字错误率(CER)为 $3.05%$,超过了最新的 SOTA 的 $3.33%$,相对 CER 减少了 $8.4%$(CERR)。在视频、直播和智能助手等多源普通话 ASR 场景中,它展示了优于工业级基线的泛化能力,实现了 $24%{-}40%$ 的 CERR。

FireRedASR-AED: Designed to balance high performance and computational efficiency and to serve as an effective speech representation module in LLM-based speech models. It utilizes an Attention-based Encoder-Decoder (AED) architecture. On public Mandarin benchmarks, FireRedASR-AED (1.1B parameters) achieves an average CER of $3.18%$ , slightly worse than FireRedASR-LLM but still outperforming the latest SOTA model with over 12B parameters. It offers a more compact size, making it suitable for resource-constrained applications.

FireRedASR-AED: 旨在平衡高性能与计算效率,并作为基于大语言模型的语音模型中的有效语音表征模块。它采用了基于注意力机制的编码器-解码器 (Attention-based Encoder-Decoder, AED)架构。在公开的中文基准测试中,FireRedASR-AED(11亿参数)的平均 CER 为 $3.18%$,略逊于 FireRedASR-LLM,但仍优于最新拥有超过 120 亿参数的 SOTA 模型。它具有更紧凑的尺寸,适合资源受限的应用场景。

Moreover, both models exhibit competitive results on Chinese dialects and English speech benchmarks and excel in singing lyrics recognition. To advance research in speech processing, we release our models and inference code at github.com/FireRed Team/FireRedASR.

此外,两款模型在汉语方言和英语语音基准测试中均表现出色,并在歌词识别上表现卓越。为推动语音处理领域的研究,我们在 GitHub 上发布了模型和推理代码,地址为 https://github.com/FireRedTeam/FireRedASR

1 Introduction

1 引言

Automatic Speech Recognition (ASR) has evolved rapidly in recent years, becoming an essential component in intelligent voice interaction and multimedia content understanding. Recent advances in ASR have led to several large-scale models, such as Whisper [1], Qwen-Audio [2, 3], SenseVoice [4], and Seed-ASR [5], showing a paradigm shift from end-to-end models with millions of parameters [6, 7] to larger-scale models [1, 4, 8, 9] and the integration of pre-trained text LLMs [2, 3, 5, 10–19].

自动语音识别 (ASR) 近年来发展迅速,已成为智能语音交互和多媒体内容理解的重要组件。ASR 的最新进展催生了多个大规模模型,如 Whisper [1]、Qwen-Audio [2, 3]、SenseVoice [4] 和 Seed-ASR [5],展示了从数百万参数端到端模型 [6, 7] 到更大规模模型 [1, 4, 8, 9] 以及预训练文本大语言模型集成 [2, 3, 5, 10–19] 的范式转变。

Despite their impressive capabilities and larger model sizes, they face significant limitations in practical applications. Some models prioritize multilingual and multitask capabilities, resulting in suboptimal performance for specific languages like Mandarin. Others, despite showing promising results, are limited by their closed-source nature, restricting community-driven improvements and academic research. The growing demands for modern speech interaction systems, highlighted by GPT-4o [20, 21], underscore the need for open-source, high-performance Mandarin ASR solutions.

尽管这些模型具有令人印象深刻的能力和更大的模型规模,但在实际应用中仍面临显著限制。一些模型优先考虑多语言和多任务能力,导致在特定语言(如普通话)上的表现不佳。另一些模型虽然展现出良好的结果,但由于其闭源性质,限制了社区驱动的改进和学术研究。GPT-4o [20, 21] 所强调的现代语音交互系统的不断增长需求,突显了对开源、高性能的普通话语音识别 (ASR) 解决方案的需求。

To address these limitations, in this technical report, we introduce FireRedASR, a family of largescale models for Mandarin ASR. To address varying needs in performance and efficiency across a wide range of application scenarios, FireRedASR consists of two variants: FireRedASR-LLM and FireRedASR-AED. FireRedASR-LLM utilizes an innovative Encoder-Adapter-LLM framework [5, 10, 18, 19], comprising 8.3B parameters to push the boundary of recognition accuracy. This model is particularly well-suited for scenarios where precision is paramount and computational resources are not a primary constraint. FireRedASR-AED, on the other hand, is designed to balance superior performance and optimal efficiency. It employs an Attention-based Encoder-Decoder (AED) architecture [22, 23] with up to 1.1B parameters. Beyond its standalone use, FireRedASR-AED also functions as a crucial speech representation component within larger LLM-based speech frameworks.

为应对这些限制,在本技术报告中,我们介绍了 FireRedASR,这是一个用于普通话 ASR(自动语音识别)的大规模模型家族。为了满足各种应用场景中对性能和效率的不同需求,FireRedASR 包含两个变体:FireRedASR-LLM 和 FireRedASR-AED。FireRedASR-LLM 采用了创新的 Encoder-Adapter-LLM 框架 [5, 10, 18, 19],包含 83 亿参数,旨在突破识别精度的极限。该模型特别适合那些精度至关重要且计算资源不是主要限制因素的场景。另一方面,FireRedASR-AED 旨在平衡卓越性能和最佳效率,采用了基于注意力机制的编码器-解码器(AED)架构 [22, 23],参数高达 11 亿。除了独立使用外,FireRedASR-AED 还作为基于大语言模型的语音框架中的重要语音表示组件。

Key contributions of our work include:

我们工作的主要贡献包括:

The remainder of this report is organized as follows: Section 2 describes the architectures of FireRedASR-AED and FireRedASR-LLM, along with training data and optimization strategies. Section 3 presents comprehensive evaluation results across various benchmarks and practical scenarios compared to recently released large-scale ASR models. Section 4 discusses the key factors contributing to our superior performance. Section 5 concludes the report.

本报告剩余部分的结构如下:第 2 部分介绍了 FireRedASR-AED 和 FireRedASR-LLM 的架构,以及训练数据和优化策略。第 3 部分展示了与最近发布的大规模 ASR 模型相比,在各种基准测试和实际场景中的全面评估结果。第 4 部分讨论了促成我们卓越性能的关键因素。第 5 部分总结了报告。

2 FireRedASR

2 FireRedASR

In this section, we present the architectural details and methodologies for our two ASR models: FireRedASR-AED and FireRedASR-LLM. FireRedASR-AED follows the conventional Attentionbased Encoder-Decoder architecture, whereas FireRedASR-LLM is built on the Encoder-AdapterLLM architecture that leverages the power of LLM for ASR. Both models share similar input feature processing and acoustic encoding strategies but differ in their approaches to token sequence modeling.

在本节中,我们将介绍两个ASR模型的架构细节和方法:FireRedASR-AED和FireRedASR-LLM。FireRedASR-AED采用传统的基于注意力机制的编码器-解码器架构,而FireRedASR-LLM则基于编码器-适配器-大语言模型架构,利用大语言模型的力量进行ASR。这两个模型在输入特征处理和声学编码策略上相似,但在Token序列建模方法上有所不同。

2.1 FireRedASR-AED: Attention-based Encoder-Decoder ASR model

2.1 FireRedASR-AED:基于注意力机制的编码器-解码器 ASR 模型

FireRedASR-AED adopts an end-to-end architecture that combines a Conformer-based Encoder (Enc) with a Transformer-based Decoder (Dec)[24, 25]. This design choice leverages both the ability of Conformer to model local and global dependencies in speech features and the effectiveness of Transformer in sequence transduction. The overall architecture of FireRedASR-AED is illustrated in Figure 1 (bottom right).

FireRedASR-AED采用了一种端到端的架构,结合了基于 Conformer 的编码器 (Enc) 和基于 Transformer 的解码器 (Dec) [24, 25]。这一设计既利用了 Conformer 在建模语音特征的局部和全局依赖关系方面的能力,也体现了 Transformer 在序列转导中的有效性。FireRedASR-AED 的整体架构如图 1 (右下) 所示。

Training Data: The training corpus consists of approximately 70,000 hours of audio data, predominantly high-quality Mandarin Chinese speech. Unlike weakly-labeled datasets used in Whisper, the majority of our data was manually transcribed by professional annotators, ensuring high transcription accuracy and reliability. The dataset also incorporates approximately 11,000 hours of English speech data to enhance English ASR capabilities.

训练数据:训练语料库包含约70,000小时的音频数据,主要是高质量的普通话语音。与Whisper中使用的弱标注数据集不同,我们的大部分数据由专业标注人员手动转录,确保了高转录准确性和可靠性。该数据集还包含约11,000小时的英语语音数据,以增强英语ASR能力。


Figure 1: Architecture of FireRedASR-LLM (left), FireRedASR-AED (bottom right), and Adapter.

图 1: FireRedASR-LLM 架构(左)、FireRedASR-AED 架构(右下)和适配器架构。

Input Features: The input features are 80-dimensional log Mel filterbank (Fbank) extracted from 25ms windows with $10\mathrm{ms}$ frame shifts, followed by global mean and variance normalization.

输入特征:输入特征是从 25ms 窗口提取的 80 维对数梅尔滤波器组(Fbank),帧移为 $10\mathrm{ms}$,然后进行全局均值和方差归一化。

Encoder Structure: The encoder consists of two main components: a sub sampling module and a stack of Conformer blocks. The sub sampling module employs two sequential convolutional layers, each with a stride of 2 and a kernel size of 3, followed by ReLU activation functions. This configuration reduces the temporal resolution from $10\mathrm{ms}$ to $40\mathrm{ms}$ per frame, effectively managing computational complexity while preserving essential acoustic information. The subsampled features are then processed by a stack of Conformer blocks. Each Conformer block consists of four primary components: two Macaron-style feed forward modules positioned at the beginning and end of the block, a multi-head self-attention module incorporating relative positional encoding [26], and a convolution module equipped with gated linear unit (GLU) and layer normalization. The kernel size for all 1-D depthwise convolution is set to 33. This structure enables effective modeling of both local and global dependencies in the speech signal, while maintaining computational efficiency.

编码器结构:编码器由两个主要组件组成:子采样模块和一系列 Conformer 块。子采样模块采用两个连续的卷积层,每个卷积层的步幅为 2,内核大小为 3,后接 ReLU 激活函数。此配置将每帧的时间分辨率从 $10\mathrm{ms}$ 降低到 $40\mathrm{ms}$,在保留关键声学信息的同时有效管理计算复杂度。子采样后的特征随后由一系列 Conformer 块处理。每个 Conformer 块由四个主要组件组成:位于块开头和末尾的两个 Macaron 风格的前馈模块、包含相对位置编码 [26] 的多头自注意力模块,以及配备门控线性单元 (GLU) 和层归一化的卷积模块。所有一维深度卷积的内核大小设置为 33。这种结构能够在保持计算效率的同时,有效建模语音信号的局部和全局依赖关系。

Decoder Structure: The decoder follows a standard Transformer architecture with several key design choices. It adopts fixed sinusoidal positional encodings and employs weight tying between input and output token embeddings to reduce model complexity. Each Transformer block consists of three primary components: a multi-head self-attention module, a multi-head cross-attention module, and a position-wise feed forward module, all utilizing pre-norm residual units to enhance training stability and gradient flow.

解码器结构:解码器遵循标准的Transformer架构,但包含几个关键设计选择。它采用固定的正弦位置编码,并在输入和输出token嵌入之间使用权重共享,以降低模型复杂度。每个Transformer块包含三个主要组件:多头自注意力模块、多头交叉注意力模块和逐位置前馈模块,所有模块均使用预归一化残差单元来增强训练稳定性和梯度流动。

Token iz ation: We employ a mixed token iz ation strategy: Chinese characters for Chinese text and token-level byte-pair encoding (BPE) [27] for English text. The total vocabulary size is 7,832, comprising 1,000 English BPE tokens, 6,827 Chinese characters, and 5 special tokens.

Token 化处理 (Tokenization):我们采用了一种混合的 Token 化策略:对于中文文本使用中文字符,对于英文文本使用字节对编码 (BPE) [27]。总词汇表大小为 7,832,其中包括 1,000 个英文 BPE token、6,827 个中文字符和 5 个特殊 token。

We investigated various sizes of FireRedASR-AED, with detailed architectural configurations presented in Table 1, where #Params denotes the number of parameters. Unless otherwise specified, FireRedASR-AED refers to FireRedASR-AED-L.

我们研究了不同大小的 FireRedASR-AED,详细的架构配置如表 1 所示,其中 #Params 表示参数数量。除非另有说明,FireRedASR-AED 指 FireRedASR-AED-L。

2.2 FireRedASR-LLM: Encoder-Adapter-LLM-based ASR model

2.2 FireRedASR-LLM: 基于Encoder-Adapter-LLM的ASR模型

FireRedASR-LLM is also an end-to-end ASR model but designed to integrate robust speech processing capabilities of FireRedASR-AED with the superior language capabilities of LLM. It comprises three core components: a Conformer-based audio Encoder, a lightweight audio-text alignment Adapter and a pre-trained text-based LLM, forming what we term the Encoder-Adapter-LLM architecture. The overall architecture of FireRedASR-LLM is illustrated in Figure 1 (left).

FireRedASR-LLM 同样是一个端到端的自动语音识别 (ASR) 模型,但其设计目标是将 FireRedASR-AED 的强大语音处理能力与大语言模型的卓越语言能力相结合。它由三个核心组件组成:基于 Conformer 的音频编码器、轻量级的音频-文本对齐适配器以及预训练的基于文本的大语言模型,构成了我们称之为 Encoder-Adapter-LLM 的架构。FireRedASR-LLM 的整体架构如图 1 (左) 所示。

Table 1: Architecture details of FireRedASR-AED and FireRedASR-LLM.

表 1: FireRedASR-AED 和 FireRedASR-LLM 的架构详情。

模型大小 XS S M L
FireRedASR-AED
宽度 (dmodel) 层数 (编码器/解码器) 参数量 (总计) 512 12/12 140M 768 16/16 413M 1024 16/16 732M 1280 16/16 1.1B
FireRedASR-LLM
参数量 (编码器) 参数量 (适配器) 参数量 (总计) 86M 17M 7.7B 256M 18M 7.9B 455M 20M 8.1B 710M 22M 8.3B

Input Features and Encoder: FireRedASR-LLM employs the same training data, input features and processing methods as FireRedASR-AED. The encoder of FireRedASR-LLM is initialized with pre-trained weights from Encoder of FireRedASR-AED. This encoder generates continuous representations that encapsulate both acoustic and semantic characteristics of the input speech.

输入特征与编码器:FireRedASR-LLM 使用了与 FireRedASR-AED 相同的训练数据、输入特征和处理方法。FireRedASR-LLM 的编码器使用 FireRedASR-AED 编码器的预训练权重进行初始化。该编码器生成包含输入语音的声学和语义特征的连续表示。

Adapter Structure and Functionality: To seamlessly integrate the audio encoder with the text-based LLM, an adapter network is employed. This adapter transforms the output of encoder into the semantic space of the LLM, enabling the LLM to accurately recognize the corresponding text content from the input speech. The adapter consists of a simple but effective Linear-ReLU-Linear network, which projects the output dimension of encoder to match the input embedding dimension of the LLM. Even after temporal sub sampling from 10ms to $40\mathrm{ms}$ , the output of the encoder remains too lengthy for the LLM to process efficiently. Therefore, we incorporate an additional frame splicing operation at the beginning of the adapter. This operation further reduces the temporal resolution from $40\mathrm{ms}$ to 80ms per frame, thereby decreasing sequence length and improving the computational efficiency for the LLM.

适配器结构与功能:为了将音频编码器与基于文本的大语言模型无缝集成,采用了适配器网络。该适配器将编码器的输出转换到语义空间中,使大语言模型能够准确识别输入语音对应的文本内容。适配器由一个简单但有效的 Linear-ReLU-Linear 网络组成,将编码器的输出维度投影为与大语言模型的输入嵌入维度匹配。即使在从 10ms 到 $40\mathrm{ms}$ 的时间子采样后,编码器的输出仍然对于大语言模型来说过于冗长,难以高效处理。因此,我们在适配器的开头加入了一个额外的帧拼接操作。该操作进一步将每帧的时间分辨率从 $40\mathrm{ms}$ 降低到 80ms,从而减少了序列长度,提高了大语言模型的计算效率。

LLM Initialization and Processing: The LLM component of FireRedASR-LLM is initialized with pre-trained weights from Qwen2-7B-Instruct [28], a notable open-source LLM. During training, the input of FireRedASR-LLM consists of a triplet: (prompt, speech, transcript). The encoder and adapter produces a speech embedding $E_{S}$ , while the prompt and transcript are tokenized and embedded by the LLM into prompt embedding $E_{P}$ and transcript embedding $E_{T}$ . These embeddings are concatenated as $(E_{P},E_{S},E_{T})$ and processed by the subsequent layers of LLM. During inference, the input is reduced to $(E_{P},E_{S})$ , enabling the LLM to execute next-token-prediction and generate recognized text from speech.

大语言模型初始化和处理:FireRedASR-LLM 的大语言模型组件使用 Qwen2-7B-Instruct [28] 的预训练权重进行初始化,这是一个著名的开源大语言模型。在训练过程中,FireRedASR-LLM 的输入由一个三元组组成:(提示、语音、转录文本)。编码器和适配器生成语音嵌入 $E_{S}$,而提示和转录文本由大语言模型进行 Token 化并嵌入为提示嵌入 $E_{P}$ 和转录文本嵌入 $E_{T}$。这些嵌入被拼接为 $(E_{P},E_{S},E_{T})$,并由大语言模型的后续层进行处理。在推理过程中,输入简化为 $(E_{P},E_{S})$,使大语言模型能够执行下一Token预测并从语音生成识别的文本。

Training Strategy: We employ a carefully designed training strategy that balances adaptation and preservation of pre-trained capabilities: the encoder and adapter are fully trainable, while the majority of LLM parameters remain fixed. We incorporate trainable LLM Low-Rank Adaptation (LoRA) [29] to efficiently fine-tune the LLM. This strategy ensures that the encoder and adapter are adequately trained to map speech features into the semantic space of LLM, while preserving its pre-trained capabilities. The training objective is based on cross-entropy loss, with the loss computed only over the transcript portion of the input, ignoring the prompt and speech embeddings.

训练策略:我们采用了一种精心设计的训练策略,以平衡适应性和预训练能力的保留:编码器和适配器完全可训练,而大部分大语言模型参数保持固定。我们引入了可训练的大语言模型低秩适应 (LoRA) [29] 来高效微调大语言模型。该策略确保编码器和适配器得到充分训练,以将语音特征映射到大语言模型的语义空间,同时保留其预训练能力。训练目标基于交叉熵损失,仅在输入的转录部分计算损失,忽略提示和语音嵌入。

We investigated various sizes of FireRedASR-LLM, with detailed architectural configurations presented in Table 1. Unless otherwise specified, FireRedASR-LLM refers to FireRedASR-LLM-L.

我们研究了不同规模的 FireRedASR-LLM,其详细架构配置如表 1 所示。除非另有说明,FireRedASR-LLM 均指 FireRedASR-LLM-L。

3 Evaluation

3 评估

In this section, we conduct a comprehensive evaluation of FireRedASR-LLM and FireRedASR-AED models, with a primary focus on their performance in Mandarin speech recognition. The evaluation is structured into three parts to systematically assess the capabilities and generalization abilities of the models.

在本节中,我们对 FireRedASR-LLM 和 FireRedASR-AED 模型进行了全面评估,重点分析了它们在普通话语音识别中的表现。评估分为三个部分,以系统地评估模型的能力和泛化能力。

First, we benchmark our models using several public Mandarin test sets to establish baseline performance under standardized conditions. Second, we evaluate their performance on diverse multi-source Mandarin speech test sets to validate their robustness in real-world scenarios. Additionally, we assess the models’ effectiveness in singing lyrics recognition, crucial for specific industrial applications. Third, we evaluate the models’ performance on Chinese dialects and English speech recognition to demonstrate their potential for broader applications beyond standard Mandarin.

首先,我们使用多个普通话公开测试集对模型进行基准测试,以建立标准化条件下的基线性能。其次,我们在多样化的多源普通话语音测试集上评估其性能,以验证其在真实场景中的鲁棒性。此外,我们评估了模型在歌词识别中的有效性,这对于特定的工业应用至关重要。最后,我们评估了模型在中文方言和英语语音识别上的性能,以展示其在标准普通话之外的更广泛应用潜力。

Metrics: We use Character Error Rate (CER) for evaluating Chinese speech and singing lyrics recognition, and Word Error Rate (WER) for English.

指标:我们使用字符错误率 (Character Error Rate, CER) 来评估中文语音和歌词识别,使用词错误率 (Word Error Rate, WER) 来评估英文。

3.1 Evaluation on Public Mandarin ASR Benchmarks

3.1 公开普通话ASR基准测试评估

We benchmark FireRedASR-LLM and FireRedASR-AED compared to several recently released large-scale ASR models, including Seed-ASR [5], SenseVoice-L [4], Qwen-Audio [2], ParaformerLarge [30], and Whisper-Large-v3 [1]. The evaluation is conducted on four widely-used public Chinese Mandarin ASR test sets: 1) AISHELL-1 [31] test set (aishell1); 2) AISHELL-2 [32] iOS version test set (aishell2); 3) We net Speech [33] Internet domain test set (ws_net); 4) We net Speech meeting domain test set (ws_meeting). The results for the comparative models are sourced from their respective publications, with Whisper-Large-v3 results taken from the SenseVoice-L [4] and We net Speech results of Qwen-Audio derived from the Seed-ASR [5].

我们对 FireRedASR-LLM 和 FireRedASR-AED 进行了基准测试,并将其与最近发布的几个大规模 ASR 模型进行了比较,包括 Seed-ASR [5]、SenseVoice-L [4]、Qwen-Audio [2]、ParaformerLarge [30] 和 Whisper-Large-v3 [1]。评估在四个广泛使用的公开中文普通话 ASR 测试集上进行:1) AISHELL-1 [31] 测试集 (aishell1);2) AISHELL-2 [32] iOS 版本测试集 (aishell2);3) We net Speech [33] 互联网领域测试集 (ws_net);4) We net Speech 会议领域测试集 (ws_meeting)。对比模型的结果来源于各自的出版物,其中 Whisper-Large-v3 的结果来自 SenseVoice-L [4],而 Qwen-Audio 的 We net Speech 结果则来源于 Seed-ASR [5]。

As illustrated in Table 2, both FireRedASR-LLM and FireRedASR-AED outperform Seed-ASR. Notably, FireRedASR-LLM achieves an $8.4%$ relative CER reduction (CERR) compared to SeedASR when averaged across all four test sets (Average-4). Seed-ASR, a state-of-the-art large ASR models but not open-source, has been trained with 7.7 million hours in its self-supervised learning stage and 0.562 million hours in its supervised fine-tuning stage, with nearly 2B parameters in its encoder and over 10B parameters in its LLM [5]. In contrast, FireRedASR-AED contains only 1.1B parameters and FireRedASR-LLM includes 8.3B parameters, highlighting the effectiveness of our models’ architecture, training strategies and datasets. When compared to other models, most of which are open-source, FireRedASR-AED achieves a $29%$ - $.68%$ CERR with fewer parameters than Whisper-Large-v3, SenseVoice-L, and Qwen-Audio.

如表 2 所示,FireRedASR-LLM 和 FireRedASR-AED 均优于 Seed-ASR。值得注意的是,FireRedASR-LLM 在所有四个测试集(Average-4)上的平均相对字符错误率 (CERR) 相比 SeedASR 降低了 $8.4%$。Seed-ASR 是目前最先进的 ASR 大模型,但尚未开源,其自监督学习阶段训练了 770 万小时,监督微调阶段训练了 56.2 万小时,编码器拥有近 20 亿参数,大语言模型 (LLM) 部分拥有超过 100 亿参数 [5]。相比之下,FireRedASR-AED 仅包含 11 亿参数,FireRedASR-LLM 包含 83 亿参数,这凸显了我们模型的架构、训练策略和数据集的效率。与其他大多数开源模型相比,FireRedASR-AED 在参数更少的情况下,相比 Whisper-Large-v3、SenseVoice-L 和 Qwen-Audio 实现了 $29%$ 到 $.68%$ 的 CERR。

Observation of Scaling Law: Recent studies in LLMs have demonstrated that model performance typically improves with increased model size, known as the scaling law [34]. As shown in 3, we investigate the scaling behavior of our models with different model sizes, as detailed in Table 1. For FireRedASR-AED, we scale the model sizes progressively from 140M, 413M, 732M to 1.1B parameters. The performance consistently improves with increased model size, achieving CERRs of $6.1%$ , $5.3%$ , and $5.6%$ when scaling from XS to S, S to M, and M to L configurations respectively. For FireRedASR-LLM, we focus on scaling the encoder while keeping the LLM backbone unchanged. The encoder size increases from 86M to 710M parameters, with minimal changes in adapter parameters (17M to 22M). This exhibits similar scaling patterns and leads to consistent performance improvements, with an overall $7.3%$ CERR from XS $(3.29%)$ to L $(3.05%)$ configuration. These results demonstrate the effectiveness of our scaling strategies and suggest the potential for further