Abstract
摘要
Real-time speech interaction, serving as a fundamental interface for humanmachine collaboration, holds immense potential. However, current opensource models face limitations such as high costs in voice data collection, weakness in dynamic control, and limited intelligence. To address these challenges, this paper introduces Step-Audio, the first production-ready open-source solution. Key contributions include: 1) a 130B-parameter unified speech-text multi-modal model that achieves unified understanding and generation, with the Step-Audio-Chat version open-sourced; 2) a generative speech data engine that establishes an affordable voice cloning framework and produces the open-sourced lightweight Step-Audio-TTS-3B model through distillation; 3) an instruction-driven fine control system enabling dynamic adjustments across dialects, emotions, singing, and RAP; 4) an enhanced cognitive architecture augmented with tool calling and role-playing abilities to manage complex tasks effectively. Based on our new StepEvalAudio-360 evaluation benchmark, Step-Audio achieves state-of-the-art performance in human evaluations, especially in terms of instruction following. On open-source benchmarks like LLaMA Question, shows $\mathbf{9.3%}$ average performance improvement, demonstrating our commitment to advancing the development of open-source multi-modal language technologies. Our code and models are available at https://github.com/stepfun-ai/Step-Audio.
实时语音交互作为人机协作的基础接口,具有巨大的潜力。然而,当前的开源模型面临语音数据收集成本高、动态控制能力弱和智能有限等局限性。为解决这些挑战,本文介绍了第一个可用于生产环境的开源解决方案 Step-Audio。关键贡献包括:1) 一个 130B 参数的统一语音-文本多模态模型,实现了统一的理解与生成,并开源了 Step-Audio-Chat 版本;2) 一个生成式语音数据引擎,建立了经济实惠的语音克隆框架,并通过蒸馏生成了开源的轻量级 Step-Audio-TTS-3B 模型;3) 一个指令驱动的精细控制系统,支持跨方言、情感、歌唱和 RAP 的动态调整;4) 一个增强的认知架构,通过工具调用和角色扮演能力来有效管理复杂任务。基于我们新的 StepEvalAudio-360 评估基准,Step-Audio 在人类评估中达到了最先进的性能,特别是在指令遵循方面。在 LLaMA Question 等开源基准上,展示了 $\mathbf{9.3%}$ 的平均性能提升,体现了我们推动开源多模态语言技术发展的承诺。我们的代码和模型可在 https://github.com/stepfun-ai/Step-Audio 获取。
1 Introduction
1 引言
The evolution of artificial intelligence toward general-purpose systems has positioned real-time speech interaction as a critical interface for human-machine collaboration. While recent multi-modal large language models (LLMs) have accelerated progress in this domain, open-source communities face persistent challenges despite breakthroughs in proprietary systems like GPT-4o (Hurst et al., 2024) and Doubao (bytedance, 2024). Existing open-source models such as Qwen2-Audio (Chu et al., 2024a), Llama 3 (Dubey et al., 2024) and wavLLM (Hu et al., 2024) struggle with three fundamental limitations: the separation of understanding and generation processes that impedes end-to-end system integration, dependence on laborious manual speech data acquisition methods that restricts efficient voice replication, and inadequate precision in regulating prosodic features, regional dialects, and tool utilization capabilities. These limitations highlight the urgent demand for deployable frameworks that harmonize streamlined architecture with dual competencies in affective computing (accurate emotion perception and adjustment) and contextual cognition (situational reasoning and response formulation).
人工智能向通用系统的发展使得实时语音交互成为人机协作的关键接口。尽管最近的多模态大语言模型在这一领域加速了进展,但开源社区在面对GPT-4o (Hurst et al., 2024) 和 Doubao (bytedance, 2024) 等专有系统的突破时,仍然面临持续挑战。现有的开源模型如 Qwen2-Audio (Chu et al., 2024a)、Llama 3 (Dubey et al., 2024) 和 wavLLM (Hu et al., 2024) 存在三个基本限制:理解和生成过程的分离阻碍了端到端系统集成,依赖繁琐的手动语音数据获取方法限制了高效的语音复制,以及在调节韵律特征、地方方言和工具使用能力方面的精度不足。这些限制凸显了对可部署框架的迫切需求,这些框架需要将简化架构与情感计算(准确的情感感知和调整)和上下文认知(情境推理和响应制定)的双重能力相协调。
Current open-source speech systems confront multiple architectural challenges. The traditional framework employs a cascading approach (Huang et al., 2024) combining Automatic Speech Recognition (ASR), LLM processing, and Text-toSpeech (TTS). This framework introduces error propagation through modality transitions while increasing system complexity. Pure end-to-end approaches, though conceptually elegant, often sacrifice performance in open-domain dialogue quality (Zeng et al., 2024). The tension between modular design and fully integrated systems remains unresolved. Furthermore, traditional text-to-speech pipelines depend on manually curated datasets, particularly for multilingual and multi dialect scenarios—a process requiring prohibitive human annotation effort. Existing solutions also lack sophisticated control mechanisms for dynamic speech adaptation, such as real-time adjustment of speaking rate, emotional prosody, or musical rendering (e.g., Singing and RAP vocals). Crucially, the absence of tool invocation capabilities and contextual awareness prevents handling complex queries like “Retrieve live weather data and report it in Cantonese,” necessitating manual API integration.
当前的开源语音系统面临多重架构挑战。传统框架采用级联方法 (Huang et al., 2024) ,结合了自动语音识别 (ASR)、大语言模型处理和文本转语音 (TTS)。这种框架通过模态转换引入了误差传播,同时增加了系统复杂性。纯端到端的方法虽然在概念上优雅,但往往在开放域对话质量上牺牲了性能 (Zeng et al., 2024)。模块化设计与完全集成系统之间的紧张关系仍未解决。此外,传统的文本转语音管道依赖于手动整理的数据集,特别是在多语言和多方言场景下——这一过程需要大量的人工标注工作。现有解决方案还缺乏动态语音适配的精细控制机制,例如实时调整语速、情感韵律或音乐渲染(如歌唱和说唱人声)。关键的是,由于缺乏工具调用能力和上下文感知,无法处理诸如“检索实时天气数据并用粤语报告”这样的复杂查询,这需要手动集成 API。
This report presents Step-Audio, the first production-ready open-source framework for intelligent speech interaction that harmonizes comprehension and generation through four key innovations.
本报告介绍了 Step-Audio,这是首个可用于生产的开源智能语音交互框架,通过四项关键创新实现了理解与生成的统一。
• 130B-Parameter Multi-modal Model: A single unified model integrating comprehension and generation capabilities, performing speech recognition, semantic understanding, dialogue, voice cloning, audio editing and speech synthesis. We have made the 130B Step-Audio-Chat variant open source.
• 1300亿参数多模态模型:一个集成了理解和生成能力的统一模型,能够执行语音识别、语义理解、对话、语音克隆、音频编辑和语音合成。我们已将1300亿参数的Step-Audio-Chat变体开源。
• Generative Data Engine: Eliminates traditional TTS’s reliance on manual data collection by generating high-quality audio through our 130B-parameter multi-modal model. Leverages this data to train and publicly release a resource-efficient Step-Audio-TTS-3B model with enhanced instructionfollowing capabilities for controllable speech synthesis.
• 生成式数据引擎:通过我们的1300亿参数多模态模型生成高质量音频,消除了传统TTS对手动数据收集的依赖。利用这些数据训练并公开发布了一个资源高效的Step-Audio-TTS-3B模型,该模型具备增强的指令跟随能力,用于可控语音合成。
• Granular Voice Control: Enables precise regulation through instructionbased control design, supporting multiple emotions (anger, joy, sadness), dialects (Cantonese, Sichuanese, etc.), and vocal styles (RAP/Singing, a cappella humming) to meet diverse speech generation needs.
• 细粒度语音控制:通过基于指令的控制设计实现精确调节,支持多种情感(愤怒、喜悦、悲伤)、方言(粤语、四川话等)和声线风格(RAP/演唱、无伴奏哼唱),以满足多样化的语音生成需求。
• Enhanced Intelligence: Improves agent performance in complex tasks through ToolCall mechanism integration and role-playing enhancements.
• 增强智能:通过 ToolCall 机制集成和角色扮演增强,提升 AI 智能体在复杂任务中的表现。
In open-source benchmarks, Step-Audio demonstrates exceptional performance. It achieves SoTA results on open-domain question answering and complex instruction tasks including LLaMA Question, TrivialQA, and Complex Bench, with an average improvement of 9.3 points compared to the best open-source metrics, validating its advantage in generalized deep semantic understanding capabilities. Additionally, to address the current lack of comprehensive end-to-end speech dialogue evaluation systems, we introduce the multi-dimensional StepEval-Audio-360 evaluation framework covering 9 dimensions, including logical reasoning, creative ability, language proficiency, and comprehension control among other key capabilities. As shown in
在开源基准测试中,Step-Audio 展现了卓越的性能。它在开放域问答和复杂指令任务(包括 LLaMA Question、TrivialQA 和 Complex Bench)上实现了 SoTA 结果,相比最佳开源指标平均提升了 9.3 分,验证了其在广义深度语义理解能力上的优势。此外,为了解决当前缺乏全面的端到端语音对话评估系统的问题,我们引入了多维的 StepEval-Audio-360 评估框架,涵盖了逻辑推理、创造能力、语言熟练度和理解控制等 9 个维度的关键能力。如图所示,
Figure 1, Step-Audio achieves SoTA results across all dimensions in subjective comparisons against open-source models like GLM-4-Voice and Qwen2-Audio, with improvements of 19.2%, 23.7%, and $43.2%$ in response quality, response relevance, and factual accuracy respectively. Particularly in generation control dimensions such as emotion understanding, speech rate control, RAP vocals, and role-playing, compared to open-source SoTA models, the IF (Instruction Following) and MOS (Mean Opinion Score) metrics improved by $29.8%$ and $27.1%$ respectively, highlighting its leading advantage in complex speech interaction scenarios.
图 1: Step-Audio 在与 GLM-4-Voice 和 Qwen2-Audio 等开源模型的主观对比中,在所有维度上均达到了 SoTA (State-of-the-Art) 结果,在响应质量、响应相关性和事实准确性方面分别提高了 19.2%、23.7% 和 43.2%。特别是在情感理解、语速控制、RAP 人声和角色扮演等生成控制维度上,与开源 SoTA 模型相比,IF (Instruction Following) 和 MOS (Mean Opinion Score) 指标分别提高了 29.8% 和 27.1%,突显了其在复杂语音交互场景中的领先优势。


Figure 1: Human Evaluation of End-to-End Speech Interactions. We conduct comprehensive human assessments comparing Step-Audio against GLM4-Voice (Zeng et al., 2024) and Qwen2-Audio (Chu et al., 2024b) across nine critical dimensions: role-playing, logical reasoning, creativity, singing language ability, speech emotion control, gaming interaction, voice instruction following, and voice understanding. Expert evaluators rated end-to-end dialog sessions using Likert scales (1-5) for naturalness and task completion. Step-Audio represents the state-of-the-art (SoTA) across all these dimensions. It is particularly remarkable in language ability, demonstrating a high level of proficiency in grammar, semantics, and language generation. In singing, Step-Audio outshines the other models with its natural pitch control, rhythm accuracy, and overall harmonious vocal output, making it a top - tier performer in these two crucial aspects.
图 1: 端到端语音交互的人类评估。我们进行了全面的人类评估,将 Step-Audio 与 GLM4-Voice (Zeng et al., 2024) 和 Qwen2-Audio (Chu et al., 2024b) 在九个关键维度上进行了比较:角色扮演、逻辑推理、创造力、歌唱语言能力、语音情感控制、游戏互动、语音指令遵循和语音理解。专家评估者使用 Likert 量表 (1-5) 对端到端对话会话的自然度和任务完成度进行了评分。Step-Audio 在所有这些维度上都代表了最先进的技术 (SoTA)。它在语言能力方面尤为突出,展示了在语法、语义和语言生成方面的高水平熟练度。在歌唱方面,Step-Audio 凭借其自然的音高控制、节奏准确性和整体和谐的声音输出,超越了其他模型,使其在这两个关键方面成为顶级表现者。
2 Related Work
2 相关工作
Recent progress in end-to-end speech systems have markedly improved humanAI audio interaction. Early approaches relied on cascaded ASR-LLM-TTS pipelines (Huang et al., 2024), where distinct modules for speech recognition, language modeling, and speech synthesis are sequentially connected. However, these systems suffered from latency buildup, error propagation, and disjointed optimization. Later approaches sought to enhance integration by directly linking speech encoders to LLMs through trainable adapters (Chu et al., 2024a; Das et al., 2024; Kong et al., 2020), though they still required separate TTS modules for audio output.
端到端语音系统的最新进展显著改善了人类与AI的音频交互。早期的方案依赖于级联的ASR-LLM-TTS管道(Huang et al., 2024),其中语音识别、语言建模和语音合成的独立模块按顺序连接。然而,这些系统存在延迟累积、错误传播和优化不连贯的问题。后来的方案通过可训练的适配器直接将语音编码器与大语言模型连接,以增强集成性(Chu et al., 2024a; Das et al., 2024; Kong et al., 2020),尽管它们仍然需要独立的TTS模块来输出音频。
The emergence of fully end-to-end systems marked a paradigm shift. Architectures like Llama-Omni (Fang et al., 2024) integrated non-auto regressive (NAR) TTS modules with language models, using connection is t temporal classification (CTC) loss. Freeze-Omni (Wang et al., 2024) uses a combination of auto regressive and NAR speech decoders. These systems demonstrated improved latency but exhibited limitations in handling emotional nuance and natural conversational flow. MinMo (Q. Chen et al., 2025) introduced auto regressive speech token prediction through the CosyVoice2 (Du, Wang, et al., 2024) decoder, while interleaved modeling approaches (Nguyen et al., 2024; Zeng et al., 2024) alternated between text and speech token generation at the sequence level.
全端到端系统的出现标志着一次范式转变。Llama-Omni (Fang et al., 2024) 等架构将非自回归 (NAR) TTS 模块与大语言模型集成,使用连接时序分类 (CTC) 损失。Freeze-Omni (Wang et al., 2024) 则结合了自回归和非自回归语音解码器。这些系统展示了延迟的改进,但在处理情感细微差别和自然对话流方面存在局限性。MinMo (Q. Chen et al., 2025) 通过 CosyVoice2 (Du, Wang, et al., 2024) 解码器引入了自回归语音 Token 预测,而交错建模方法 (Nguyen et al., 2024; Zeng et al., 2024) 则在序列级别交替生成文本和语音 Token。
Parallel decoding architectures like Moshi (D´efossez et al., 2024) and MiniOmni (Xie & Wu, 2024) represented a significant leap by generating text and multiple speech codebook tokens simultaneously. These systems achieved lower latency through compressed speech token sequences but faced challenges in preserving linguistic capabilities when scaling speech token bandwidth. Current systems generally specialized in specific aspects: GLM-4-Voice (Zeng et al., 2024) prioritized latency reduction, while Moshi emphasized speech quality, but none holistic ally addressed emotion awareness, conversational naturalness, and real-time knowledge integration.
Moshi (D´efossez 等, 2024) 和 MiniOmni (Xie & Wu, 2024) 等并行解码架构通过同时生成文本和多个语音编码 Token,实现了显著飞跃。这些系统通过压缩语音 Token 序列实现了更低的延迟,但在扩展语音 Token 带宽时面临保持语言能力的挑战。当前系统通常专注于特定方面:GLM-4-Voice (Zeng 等, 2024) 优先考虑降低延迟,而 Moshi 则强调语音质量,但没有系统全面解决情感感知、对话自然性和实时知识集成的问题。
Recent methodological advances have systematically investigated emotion-aware interaction paradigms, though their integration with multi-modal frameworks remains nascent. While some systems (Wang et al., 2024) incorporated basic sentiment analysis, they lacked bidirectional emotional resonance-neither detecting para linguistic cues in user speech nor generating con textually appropriate emotional responses. The naturalness gap persisted due to LLMs’ tendency toward verbose, text-optimized outputs (Fang et al., 2024), ill-suited for spoken dialogue. Recent work has introduced task-specific optimization s: LUCY (H. Gao et al., 2025) adopted the architectural framework of Mini-Omni (Xie & Wu, 2024), augmented with specialized fine-tuning on conversational datasets for emotion control and function-calling.
最近的方法论进展系统地研究了情感感知的交互范式,尽管它们与多模态框架的整合仍处于初级阶段。虽然一些系统(Wang et al., 2024)整合了基本的情感分析,但它们缺乏双向的情感共鸣——既没有检测到用户语音中的副语言线索,也没有生成上下文适当的情感响应。由于大语言模型倾向于冗长、文本优化的输出(Fang et al., 2024),不适合口语对话,自然性差距仍然存在。最近的研究引入了任务特定的优化:LUCY(H. Gao et al., 2025)采用了Mini-Omni(Xie & Wu, 2024)的架构框架,并通过对对话数据集进行专门微调来增强情感控制和功能调用。
3 Architecture
3 架构
Traditional voice dialogue systems typically employ a cascaded architecture comprising ASR, LLM, and TTS modules. However, our proposed model, having undergone comprehensive multi-modal training and alignment of text and audio during the pre training phase, already possesses end-to-end voice dialogue capabilities. Despite extensive exploration of alternative designs, we ultimately adopted the AQTA (audio input, text output) $+$ TTS framework for real-time voice dialogue as shown in Figure 2, driven by the following considerations:
传统语音对话系统通常采用由ASR、大语言模型和TTS模块组成的级联架构。然而,我们提出的模型在预训练阶段经过了全面的多模态训练和文本与音频的对齐,已经具备了端到端的语音对话能力。尽管我们广泛探索了其他设计方案,但最终出于以下考虑,我们采用了AQTA(音频输入,文本输出)$+$TTS框架来实现实时语音对话,如图2所示:
• Scarcity of high-quality pure-voice dialogue data: The limited availability of pure-voice dialogue data, coupled with its constrained scenarios,
高质量纯语音对话数据的稀缺性:纯语音对话数据的有限可用性,以及其场景的局限性。


Figure 2: Architecture of Step-Audio. Step-Audio primarily consists of three components: the speech tokenizer, the LLM, and the speech decoder. The speech tokenizer is responsible for disc ret i zing the input speech into tokens. The LLM models both text and speech tokens, while the speech decoder generates the waveform output.
图 2: Step-Audio 的架构。Step-Audio 主要由三个组件组成:语音分词器 (speech tokenizer)、大语言模型 (LLM) 和语音解码器 (speech decoder)。语音分词器负责将输入语音离散化为 Token。大语言模型同时对文本和语音 Token 进行建模,而语音解码器则生成波形输出。
restricts the training efficiency of end-to-end voice dialogue models.
限制了端到端语音对话模型的训练效率。
• Control l ability and customization of output speech: By incorporating a TTS module, we gain flexible control over speech parameters such as timbre and pitch to meet users’ personalized demands, while continuously enhancing the model’s expressive capabilities.
• 输出语音的可控性与定制化:通过集成 TTS(Text-to-Speech)模块,我们能够灵活控制音色、音调等语音参数,以满足用户的个性化需求,同时持续提升模型的表达能力。
Our goal is to establish Step-Audio as a real-time multi-modal model that seamlessly integrates speech understanding and synthesis through four key components: (1) A dual-codebook token iz ation framework employing parallel linguistic (16.7Hz, 1024-codebook) and semantic (25Hz, 4096-codebook) tokenizers with 2:3 temporal interleaving; (2) A 130B-parameter LLM based on Step-1 (StepFun, 2024a), enhanced through audio-contextual i zed continual pre training and post raining; (3) A hybrid speech synthesizer combining with flow matching and neural vocoder, optimized for real-time waveform generation. In addition, a Voice Activity Detection (VAD) module was employed to extract vocal segments.
我们的目标是建立 Step-Audio 作为一个实时多模态模型,通过四个关键组件无缝集成语音理解和合成:(1) 采用并行语言(16.7Hz, 1024-codebook)和语义(25Hz, 4096-codebook) Tokenizer 的双码本 Token 化框架,具有 2:3 的时间交错;(2) 基于 Step-1 (StepFun, 2024a) 的 130B 参数大语言模型,通过音频上下文化持续预训练和后训练进行增强;(3) 结合流匹配和神经声码器的混合语音合成器,优化用于实时波形生成。此外,还采用了语音活动检测 (VAD) 模块来提取语音段。
3.1 Tokenizer
3.1 Tokenizer
To overcome the limitations of conventional speech tokenizers, which separately capture information for understanding or generation task, we propose a dualcodebook speech tokenizer framework in Step-Audio similar to ARCON (Ming et al., 2024). This approach employs two distinct tokenizers, linguistic and semantic, to better represent speech features. The linguistic tokenizer is utilized to extract structured, high-level representations, including phonemic and linguistic features, whereas the semantic tokenizer is designed to encode both semantic and coarse-grained acoustic characteristics.
为了克服传统语音分词器分别捕获信息以用于理解或生成任务的局限性,我们提出了一种类似于 ARCON (Ming et al., 2024) 的双码本语音分词器框架。该方法采用了两个不同的分词器,即语言分词器和语义分词器,以更好地表示语音特征。语言分词器用于提取结构化的高层次表示,包括音素和语言特征,而语义分词器则设计用于编码语义和粗粒度的声学特征。
For linguistic token iz ation, we utilize the output from the Paraformer (Z. Gao, Zhang, McLoughlin, & Yan, 2022) encoder, which is quantized into discrete representations at a token rate of 16.7 Hz. For semantic token iz ation, we employ CosyVoice’s (Du, Chen, et al., 2024) tokenizer, specifically designed to efficiently encode features essential for generating natural and expressive speech outputs, operating at a token rate of 25 Hz. The linguistic tokenizer employs a codebook size of 1024, while the semantic tokenizer utilizes a larger codebook size of 4096 to capture finer acoustic details.
在语言Token化方面,我们采用了Paraformer (Z. Gao, Zhang, McLoughlin, & Yan, 2022)编码器的输出,该输出被量化为离散表示,Token率为16.7 Hz。在语义Token化方面,我们使用了CosyVoice (Du, Chen, et al., 2024)的Token化器,该Token化器专门设计用于高效编码生成自然且富有表现力的语音输出所需的关键特征,其Token率为25 Hz。语言Token化器使用的码本大小为1024,而语义Token化器则采用更大的码本大小4096,以捕捉更精细的声学细节。
To effectively integrate these two token iz ation schemes, we implement a tokenlevel interleaving approach inspired by SpiritLM (Nguyen et al., 2024). Given the differing token rates, we establish a temporal alignment ratio of 2:3, where every two linguistic tokens are paired with three semantic tokens.
为有效整合这两种分词方案,我们借鉴了SpiritLM (Nguyen et al., 2024) 的思路,采用了一种Token级别的交错处理方式。考虑到分词速率的不同,我们设定了2:3的时间对齐比例,即每两个语言学Token对应三个语义Token。
3.2 LLM
3.2 大语言模型 (LLM)
To enhance Step-Audio’s ability to effectively process speech information and achieve accurate speech-text alignment, we conducted audio continual pre training based on Step-1, a 130-billion parameter pretrained text-based LLM. The details of the pretrain and post-train processes for Step-Audio are comprehensively discussed in section 4 and 5.
为了提升 Step-Audio 有效处理语音信息并实现精准语音文本对齐的能力,我们在基于 1300 亿参数的预训练文本大语言模型 Step-1 的基础上,进行了音频持续预训练。Step-Audio 的预训练和后训练过程详情将在第 4 节和第 5 节中全面讨论。
In multi-turn dialogue systems, the substantial disparity in length between audio tokens and text tokens necessitates efficient processing strategies. To address this, historical information is initially transcribed into textual format utilizing an ASR model prior to system input, thereby optimizing computational efficiency. However, it should be noted that the model architecture maintains the capability to process and utilize audio tokens as historical context when required.
在多轮对话系统中,音频Token和文本Token之间的长度差异显著,因此需要高效的处理策略。为了解决这一问题,在系统输入之前,首先使用ASR模型将历史信息转录为文本格式,从而优化计算效率。然而,需要注意的是,模型架构在需要时仍具备处理和使用音频Token作为历史上下文的能力。
3.3 Speech Decoder
3.3 语音解码器
Speech decoder consists of a 3-billion parameter language model, a flow-matching model and a mel-to-wave vocoder primarily designed to receive text or audio tokens and generate continuous time-domain stylized waveform that incorporate historical information and instructions. To optimize the intelligibility and naturalness of the synthesized speech, the speech decoder is trained using a dual-code interleaving approach, ensuring seamless integration of linguistic and semantic features throughout the generation process. On a speech decoder with a larger parameter, we have observed the emergence of enhanced generative capabilities. For further details, please refer to section 5.1.
语音解码器由一个包含30亿参数的语言模型、一个流匹配模型和一个主要设计用于接收文本或音频Token并生成包含历史信息和指令的连续时域风格化波形的梅尔频谱到波形声码器组成。为了优化合成语音的清晰度和自然度,语音解码器采用双码交错训练方法,确保在生成过程中语言和语义特征的无缝集成。在参数更大的语音解码器上,我们观察到增强生成能力的出现。更多详情请参阅第5.1节。
3.4 Real-time Inference
3.4 实时推理
To enable real-time interactions, we have designed an optimized inference pipeline as shown in Figure 3. At its core, the Controller module manages state transitions, orchestrates speculative response generation, and ensures seamless coordination between critical subsystems. These subsystems include VAD for detecting user speech, the Streaming Audio Tokenizer for processing audio in real-time, the Step-Audio language model and Speech Decoder for processing and generating responses, and the Context Manager for preserving conversational continuity.
为了实现实时交互,我们设计了一个优化的推理流程,如图 3 所示。其核心是 Controller 模块,它负责管理状态转换,协调推测性响应生成,并确保关键子系统之间的无缝协作。这些子系统包括用于检测用户语音的 VAD、用于实时处理音频的 Streaming Audio Tokenizer、用于处理和生成响应的 Step-Audio 语言模型和 Speech Decoder,以及用于保持对话连续性的 Context Manager。
Speculative Response Generation To reduce interaction latency, the system preemptively generates speculative responses. This minimizes perceived delays and enhances responsiveness, though at the cost of occasional redundant computations when speculative responses are discarded. The system begins in the Silence
推测性响应生成


Figure 3: The architecture of the real - time inference pipeline aims to enable real-time interactions. When audio is input, it’s processed concurrently by the streaming audio tokenizer and the voice activity detection module. The controller manages state transitions. A pause in user speech triggers speculative response generation, with multiple calls made but only one response committed. The context manager handles the conversation history in text format for continuity. Once the user finishes speaking, the system enters the reply state, commits a speculative response, and outputs audio. After that, it returns to the idle state for the next interaction.
图 3: 实时推理管道的架构旨在实现实时交互。当输入音频时,它会被流式音频分词器和语音活动检测模块同时处理。控制器管理状态转换。用户语音暂停会触发推测性响应生成,虽然会进行多次调用,但只会提交一个响应。上下文管理器以文本格式处理对话历史,以确保连续性。一旦用户结束讲话,系统进入回复状态,提交推测性响应并输出音频。之后,系统返回到空闲状态,等待下一次交互。
state, awaiting user input. When the VAD detects active speech, the system transitions to the User Speaking state. During this state, the Streaming Audio Tokenizer begins converting audio into tokens. If the user momentarily pauses, the system enters the UserPaused state, where speculative response generation is triggered. By preemptively generating a response in anticipation of input completion, the system reduces latency when the conversation resumes. If the user resumes speaking, the speculative response is discarded. Once the system confidently determines that the user has finished speaking, it transitions to the Bot Replying state, commits the most recent speculative response, and delivers its audio output. If interrupted by user speech, the system prioritizes the new input while maintaining conversational continuity. After completing its response, the system returns to the Silence state, ready for the next interaction. Empirical analysis shows that approximately $40%$ of speculative responses are successfully committed. This mechanism reduces per-response latency by approximately 500ms compared to non-speculative methods.
状态,等待用户输入。当语音活动检测(VAD)检测到活跃语音时,系统过渡到用户说话状态。在此状态下,流式音频分词器开始将音频转换为Token。如果用户暂时停顿,系统进入用户暂停状态,此时会触发推测性响应生成。通过预先生成响应以期待输入完成,系统在对话恢复时减少了延迟。如果用户恢复说话,推测性响应将被丢弃。一旦系统自信地确定用户已结束说话,它将过渡到机器人回复状态,提交最新的推测性响应,并输出其音频。如果被用户语音打断,系统会优先处理新输入,同时保持对话的连续性。完成响应后,系统返回静默状态,准备下一次交互。实证分析显示,大约40%的推测性响应成功提交。与非推测性方法相比,此机制将每次响应的延迟减少了约500毫秒。
Context Management Our system utilizes text transcription instead of raw audio tokens for historical context, as it provides a more compact representation (with an average text-to-audio token ratio of 1:14), improving performance, and enabling longer conversations with minimal impact on quality. ASR asynchronously transcribes user speech into text, maintaining an accurate and up-to-date conversation history.
上下文管理
我们的系统利用文本转录而非原始音频Token来管理历史上下文,因为文本提供了更紧凑的表示(平均文本到音频Token比例为1:14),从而提高了性能,并能在最小化质量影响的情况下实现更长的对话。自动语音识别(ASR)异步将用户语音转录为文本,确保对话历史的准确性和实时性。
Streaming Audio Tokenizer The input audio stream is processed through two parallel tokenizer pipelines, each employing fixed-duration segmentation. The resulting tokens are seamlessly merged into a single sequence with a 2:3 interleaving ratio. Without the streaming audio tokenizer, the inference time will be significantly slower, depending on the length of the audio input.
流式音频分词器
输入音频流通过两个并行的分词器管道进行处理,每个管道均采用固定时长的分段方法。生成的 Token 以 2:3 的交错比例无缝合并为单一序列。若没有流式音频分词器,推理时间将显著增加,具体取决于音频输入的长度。
4 Pretrain
4 预训练
4.1 Dataset
4.1 数据集
Our multi-modal pre training dataset integrates three major categories of data resources: audio, text, and images. The audio section comprises 1.1 trillion tokens of audio continuation data (approximately 7,300,000 hours), 113 billion tokens of TTS (Text-to-Speech) synthesized speech data (about 700,000 hours), 105 billion tokens of ASR (Automatic Speech Recognition) data (around 650,000 hours), and 350 billion tokens of audio-text alternating data (approximately 2,000,000 hours). The text data, amounting to 800 billion tokens, encompasses web documents, books, code, and proprietary materials. The image section includes 800 billion tokens of image-text paired/alternating data, sourced from web pages, books, and proprietary resources.
我们的多模态预训练数据集整合了三大类数据资源:音频、文本和图像。音频部分包括1.1万亿Token的音频延续数据(约7,300,000小时)、1130亿Token的TTS(文本转语音)合成语音数据(约700,000小时)、1050亿Token的ASR(自动语音识别)数据(约650,000小时)以及3500亿Token的音频-文本交替数据(约2,000,000小时)。文本数据总量为8000亿Token,涵盖网页文档、书籍、代码和专有材料。图像部分包括8000亿Token的图文配对/交替数据,数据来源包括网页、书籍和专有资源。
4.2 Training Detail
4.2 训练细节
Step-Audio is a component of Step-Omni, which is designed to train a unified pretrained model for speech, image, and text. This training is based on a pretrained text model and image encoder for continued pre training. The entire process is divided into three stages in total.
Step-Audio 是 Step-Omni 的一个组件,旨在为语音、图像和文本训练一个统一的预训练模型。该训练基于预训练的文本模型和图像编码器进行继续预训练。整个过程共分为三个阶段。
• Stage1: We expanded the vocabulary of the pretrained text model by adding 5,120 audio tokens and integrated a pretrained image encoder to form the Step-Omni model. During training, to ensure minimal loss of the text model’s capabilities, the learning rate of the text model backbone is maintained at a low level (2e-5) throughout. However, the learning rates for the embedding and language model (LM) head are set five times higher than the backbone’s to facilitate faster convergence of the newly added tokens. Meanwhile, the image encoder remains frozen during the entire training process. At this stage, audio, text, and image data are used in a 2:1:1 ratio, with audio data consisting solely of pure audio continuation tasks.
• 阶段1:我们通过添加5,120个音频Token扩展了预训练文本模型的词汇表,并整合了预训练图像编码器,形成了Step-Omni模型。在训练过程中,为了确保文本模型能力的损失最小,文本模型主干的学习率始终保持在一个较低的水平(2e-5)。然而,嵌入层和语言模型(LM)头部的学习率被设置为比主干高五倍,以促进新添加Token的快速收敛。同时,图像编码器在整个训练过程中保持冻结。在此阶段,音频、文本和图像数据的比例为2:1:1,音频数据仅包含纯音频延续任务。
• Stage2: After training on 1.2T tokens in the stage1 phase, we incorporate audio-text interleaved data for further training, with a 1:1 ratio of audio continuation data to audio-text interleaved data. During this stage, the ratio of audio, text, and image data remains 2:1:1.
• Stage2: 在阶段1训练了1.2T Token后,我们加入音频-文本交错数据进行进一步训练,其中音频延续数据与音频-文本交错数据的比例为1:1。在此阶段,音频、文本和图像数据的比例保持为2:1:1。
• Stage3: After training on 800B tokens in the stage2 phase, we incorporate ASR and TTS data for further training. The ratio of audio continuation data, audio-text interleaved data, ASR data, and TTS data is set to 1:1:1:1. During this phase, the ratio of audio, text, and image data is adjusted to 4:3:3. Additionally, the learning rates for the embedding and LM head are synchronized with the backbone, utilizing a cosine schedule that decreases from 2e-5 to 5e-6.
• 阶段3:在阶段2对800B Token进行训练后,我们引入ASR和TTS数据进行进一步训练。音频续写数据、音频-文本交错数据、ASR数据和TTS数据的比例设置为1:1:1:1。在此阶段,音频、文本和图像数据的比例调整为4:3:3。此外,嵌入层和LM头的学习率与主干网络同步,采用从2e-5降至5e-6的余弦调度。
We employ the same pre-training strategy across models of varying parameter scales.
我们在不同参数规模的模型上采用相同的预训练策略。
4.3 Training Infrastructure
4.3 训练基础设施
We train Step-Omni on thousands of H800 GPUs with 35% Model Flops Utilization (MFU). Despite employing the standard optimization s such as tailored GPU kernels and communication overlap, we highlight two innovative approaches that further enhance our training efficiency.
我们在数千台 H800 GPU 上训练 Step-Omni,模型浮点运算利用率 (MFU) 达到 35%。尽管采用了定制 GPU 内核和通信重叠等标准优化方法,我们仍强调了两项进一步提升训练效率的创新方法。
D is aggregated Data Processing The processing of multi-modality data in Step-Omni training is computationally intensive, often requiring substantial CPU resources to keep pace with the model training speed. Conventional implementations typically co-locate data processing tasks with training jobs, leading to significant interference between these tasks and ultimately slowing down the training process. To address this issue, we introduce StarWeaver, an RPC-based distributed data processing library. StarWeaver relocates CPU-intensive data pre-processing tasks to remote processes, thereby alleviating the computational burden on the GPU training side and enhancing overall training efficiency. StarWeaver also facilitates the enhancement of load balancing in the data-parallel dimension, as it serves as an ideal mechanism for redistributing data with global workload information.
D是聚合数据处理。在Step-Omni训练中,多模态数据的处理计算密集,通常需要大量的CPU资源以跟上模型训练的速度。传统的实现通常将数据处理任务与训练任务放在同一位置,导致这些任务之间的显著干扰,最终减慢训练过程。为了解决这个问题,我们引入了StarWeaver,一个基于RPC的分布式数据处理库。StarWeaver将CPU密集型的数据预处理任务重新定位到远程进程,从而减轻GPU训练端的计算负担,并提高整体训练效率。StarWeaver还有助于增强数据并行维度上的负载均衡,因为它是重新分配具有全局工作负载信息的理想机制。
D is aggregated Model Placement For multi-modal models such as StepOmni, training typically involves not only LLM but also modality encoders (e.g., vision encoder). Integrating these diverse components challenges the conventional assumption of training frameworks that the model is homogeneous and monolithic. This mismatch often results in suboptimal training efficiency. To address this issue, we propose d is aggregated model placement that allocates dedicated resources and employs tailored parallelism strategies for each sub-model. This novel approach effectively minimizes pipeline bubbles caused by model heterogeneity, thereby achieving optimal training efficiency. Details can be found at (Zhang et al., 2024).
D是聚合模型放置
对于多模态模型(如StepOmni),训练通常不仅涉及大语言模型,还包括模态编码器(例如视觉编码器)。这些多样化组件的集成挑战了传统训练框架中模型是单一且同质的假设。这种不匹配往往导致训练效率不佳。为了解决这个问题,我们提出了D是聚合模型放置,该方案为每个子模型分配专用资源,并采用定制的并行策略。这种新颖的方法有效减少了因模型异质性导致的管道气泡,从而实现最佳训练效率。详情请参见 (Zhang et al., 2024)。
4.4 Exploring Tokenizer for Audio Pre training
4.4 音频预训练的Tokenizer探索
To achieve the unification of speech understanding and generation, we first explored the use of a speech tokenizer. Initially, we investigated the training approach using a single codebook. In our experiments, we found that when training the model using only semantic tokens, the next token prediction perplexity is relatively low, and the semantic coherence between the generated content and the preceding context is good. However, due to the significant loss of acoustic information from discarding too many semantic tokens, the subsequent audio restoration through the vocoder suffers severe degradation in terms of timbre and prosody, resulting in poor auditory quality. When only using linguistic tokens for training, the audio recovered by the vocoder from the model’s continuation sounds good, but the next token prediction perplexity is very high, and the semantic coherence between the continuation and the preceding context is poor. When training with interleaved semantic tokens and linguistic tokens, the semantic tokens ensure the semantic coherence of the continuation with the preceding context, while the linguistic tokens ensure the auditory quality of the reconstructed audio. Due to the mutual reference between semantic tokens and linguistic tokens, we observed that when using dual-codebook training, the next token prediction perplexity for both semantic tokens and linguistic tokens decreased compared to using a single codebook as shown in Figure 4. Notably, the decrease in next token prediction perplexity for semantic tokens was more significant. Furthermore, ASR ablation results indicated that the dual-codebook model achieved a lower character error rate (CER) on the ASR test set compared to the pure single-codebook model(see section 6.2.1).
为了实现语音理解和生成的统一,我们首先探索了语音分词器的使用。最初,我们研究了使用单一码本的训练方法。在实验中,我们发现,当仅使用语义Token训练模型时,下一个Token预测的困惑度相对较低,且生成内容与上文之间的语义连贯性较好。然而,由于丢弃了过多语义Token导致声学信息的大量丢失,后续通过声码器进行的音频恢复在音色和韵律上严重退化,听觉质量较差。当仅使用语言Token进行训练时,模型续写的音频通过声码器恢复后听起来不错,但下一个Token预测的困惑度非常高,且续写内容与上文之间的语义连贯性较差。当使用交替的语义Token和语言Token进行训练时,语义Token确保了续写内容与上文的语义连贯性,而语言Token则保证了重建音频的听觉质量。由于语义Token和语言Token之间的相互参考,我们观察到,在使用双码本训练时,语义Token和语言Token的下一个Token预测困惑度较单一码本的训练有所降低,如图4所示。值得注意的是,语义Token的下一个Token预测困惑度下降更为显著。此外,ASR消融实验结果表明,双码本模型在ASR测试集上的字符错误率(CER)较纯单一码本模型更低(见6.2.1节)。

Figure 4: Training loss comparison between Dual-Codebook and Single Codebook Tokenizer.
图 4: 双码本与单码本分词器的训练损失对比。
Furthermore, grouping and interleaving linguistic discrete tokens and semantic discrete tokens in a 2:3 ratio facilitates faster convergence of training loss. More importantly, extending the CosyVoice semantic tokens with linguistic tokens enhances the model’s ability to understand and follow multi-turn history instructions and also mitigates issues such as unclear pronunciation and indistinct articulation, significantly boosting the performance of CosyVoice’s single code.
此外,以2:3的比例分组和交错语言离散Token和语义离散Token,可以加速训练损失的收敛。更重要的是,通过语言Token扩展CosyVoice的语义Token,增强了模型理解和遵循多轮历史指令的能力,同时也缓解了发音不清和表达模糊等问题,显著提升了CosyVoice单代码的性能。
5 Post-Training
5 训练后
5.1 TTS
5.1 文本到语音 (TTS)
5.1.1 Dataset
5.1.1 数据集
High-quality speech data is crucial for TTS task, as it directly impacts the model’s performance and the expressiveness of the generated speech. Language-specific data, dialect data, speaking styles, emotional data, and para linguistic data are extremely scarce. Constructing such datasets demands substantial human and financial resources, and the process generally spans an extended period.
高质量的语音数据对于 TTS 任务至关重要,因为它直接影响模型的性能和生成语音的表现力。特定语言数据、方言数据、说话风格、情感数据和副语言数据极为稀缺。构建此类数据集需要大量的人力和财力资源,且该过程通常需要较长的时间。
To address this gap, we present the first novel synthetic data-driven framework for TTS systems, comprising three key components:
为了解决这一差距,我们提出了首个基于合成数据的TTS系统框架,该框架包含三个关键组件:
• First, we employ a Step-2 (StepFun, 2024b) LLM to generate linguistically diverse and semantically rich textual content. • Second, we selected a pre-trained Step-Audio model checkpoint incorporating audio-token cooldown mechanisms, which enables direct generation of speaker-specific, language-dependent, and dialect-aware audio data. • Third, we developed an Audio-Edit Model by fine-tuning the aforementioned checkpoint, specifically designed to generate nuanced emotional expressions and diverse speaking styles. This model architecture allows for precise control over para linguistic features while maintaining speaker consistency.
• 首先,我们使用 Step-2 (StepFun, 2024b) 大语言模型生成语言多样且语义丰富的文本内容。
• 其次,我们选择了一个预训练的 Step-Audio 模型检查点,该检查点结合了音频 Token 冷却机制,能够直接生成特定说话者、语言相关且方言感知的音频数据。
• 第三,我们通过微调上述检查点开发了一个 Audio-Edit 模型,专门设计用于生成细腻的情感表达和多样的说话风格。该模型架构允许在保持说话者一致性的同时,精确控制副语言特征。


Figure 5: The process starts with text input which is processed by a Step-2 LLM to generate multiple rewritten texts. Then, a Step-Audio model generates target-speaker data using the rewritten texts and existing audio wav data. Finally, an Audio-Edit model refines the data to produce emotion/style data, addressing the scarcity of high - quality speech data in TTS tasks.
图 5: 该过程从文本输入开始,由 Step-2 大语言模型处理生成多个重写文本。然后,Step-Audio 模型使用重写文本和现有的音频 wav 数据生成目标说话者的数据。最后,Audio-Edit 模型对数据进行优化,生成情感/风格数据,解决了 TTS 任务中高质量语音数据稀缺的问题。
Language and Dialect Leveraging the robust continuation ability of StepAudio, which has been trained on large volumes of speaker and language data, we generate target-speaker, language and dialect data. The text-based LLM Step-2 is used to translate and rewrite chat text to conform to the grammar and style of the target language or dialect. We collect audio recordings and texts from native speakers as prompt audio and text, and then, using the format [system prompt;
语言与方言
利用StepAudio在大量说话者和语言数据上训练的强大续写能力,我们生成目标说话者、语言和方言的数据。基于文本的大语言模型Step-2用于翻译和重写聊天文本,以符合目标语言或方言的语法和风格。我们收集来自母语者的录音和文本作为提示音频和文本,然后使用格式[系统提示;
prompt text; target text; prompt code; target code] along with the corresponding text, we use Step-Audio for audio continuation generation. This method allows for the quick creation of a large amount of native-speaker data for the target language and dialect with only a small quantity of high-quality seed data.
提示文本;目标文本;提示代码;目标代码] 以及相应的文本,我们使用 Step-Audio 进行音频延续生成。这种方法只需少量的高质量种子数据,就能快速生成大量目标语言和方言的母语数据。
Emotion and Speaking Styles Emotion and speaking style data have been challenging to deal with because of the difficulty in both differentiating and defining emotion categories and their respective intensities, as well as the complexity associated with accurately describing and recording various style types. To address this, an Audio-Edit model-based approach is proposed. It ingeniously converts complex emotion and style descriptions into a comparative pair data construction format. Step-2 is used to rewrite chat text with specific emotions and styles. Normal and emotional speech samples from the same speaker with identical text are collected, and Step-Audio is used for cloning and continuation generation to create (text, neutral audio token, emotion and style audio token) data. Only the (neutral audio token, emotion and style audio token) pairs are used to perform SFT on the audio cooldown pretrain model to get the Audio-Edit model. Using this model, neutral-style speech can be input to generate emotion or style enhanced audio, and data with different emotion or style intensities can be produced iterative ly.
情感与说话风格
Singing and RAP We construct a paired dataset of lyrics and vocal segments through three stages: (1) Collecting 10,000+ hours of singing / RAP tracks with LyRiCs-format timestamps; (2) Extracting dry vocals using Demucs (Rouard, Massa, $&$ D´efossez, 2023) and removing silent regions via Voice Activity Detection(VAD); (3) Segmenting audio using LyRiCs timestamps and aligning lyrics with audio segments. For data cleaning, We performed three steps: (1) RAP Separation: we isolated pure RAP segments by retaining those with higher speech rates and using a genre classification model to identify hip-hop clips; (2) Audio Quality Filtering: Utilizing noise detection and speaker di ari z ation, we preserved low-noise, single-speaker segments; (3) Alignment Verification: To address misalignment due to inaccurate LyRiCs timestamps, we computed the Character Error Rate (CER) between transcribed speech and ground-truth lyrics, discarding misaligned segments. Ultimately, the total length of the retained audio segments constituted 17.8% of the original song durations. This dataset supports dual training objectives: the LLM learns to map lyrics to linguistic and semantic tokens, while the speech decoder decodes these tokens into high-fidelity vocals in precise tunes.
歌唱与说唱
我们通过三个阶段构建了一个歌词与人声片段的配对数据集:(1) 收集了超过10,000小时的歌唱/说唱音轨,并带有LyRiCs格式的时间戳;(2) 使用Demucs (Rouard, Massa, & D´efossez, 2023) 提取干声,并通过语音活动检测 (VAD) 去除静音区域;(3) 使用LyRiCs时间戳分割音频,并将歌词与音频片段对齐。在数据清洗方面,我们进行了三个步骤:(1) 说唱分离:通过保留语速较高的片段,并使用流派分类模型识别嘻哈片段,分离出纯说唱片段;(2) 音频质量过滤:利用噪声检测和说话人分离技术,保留低噪声、单人说话的片段;(3) 对齐验证:为了解决由于LyRiCs时间戳不准确导致的错位问题,我们计算了转录语音与真实歌词之间的字符错误率 (CER),并丢弃错位的片段。最终,保留的音频片段总长度占原歌曲时长的17.8%。该数据集支持双重训练目标:大语言模型学习将歌词映射到语言和语义Token,而语音解码器则将这些Token解码为高保真且音调精确的人声。
Target Speaker Supporting multiple languages or dialects for a target speaker is challenging through model generalization of foundational language and dialect data, as it often fails to achieve the level of a native speaker. To mitigate this issue, we employ dual codes extracted from audio generated by native speakers with timbre and prosody similar to the target speaker. These dual codes are combined with the target speaker’s prompt audio to regenerate new audio, from which dual codes are then extracted again. Through this straightforward procedure, the target speaker’s speech in new languages and dialects becomes more akin to that of a native speaker.
目标说话者支持多语言或方言的目标说话者通过模型泛化基础语言和方言数据具有挑战性,因为它通常无法达到母语者的水平。为了缓解这一问题,我们采用了从与目标说话者音色和韵律相似的母语者生成的音频中提取的双重编码。这些双重编码与目标说话者的提示音频结合,重新生成新的音频,然后再次从中提取双重编码。通过这一简单流程,目标说话者在新语言和方言中的发音更接近母语者。
Quality assessment of data constitutes a critical component in our synthetic data framework. To ensure the reliability and validity of both seed and synthesized data, we have implemented a comprehensive evaluation system incorporating multiple objective metrics: ASR accuracy, Voice Activity Detection (VAD) performance, Speaker Di ari z ation precision, Emotion recognition consistency, and Deep Noise Suppression (DNS) effectiveness. This multi-dimensional quality control mechanism guarantees the robustness and practical utility of generated synthetic data.
数据质量评估是我们合成数据框架中的关键组成部分。为了确保种子数据和合成数据的可靠性与有效性,我们实施了一个综合评估系统,包含多个客观指标:ASR准确率、语音活动检测 (Voice Activity Detection, VAD) 性能、说话人分离精度、情绪识别一致性以及深度噪声抑制 (Deep Noise Suppression, DNS) 效果。这种多维度的质量控制机制保证了生成合成数据的鲁棒性和实用性。
5.1.2 Training Detail
5.1.2 训练细节
In contrast to conventional TTS systems that emphasize fine-grained control over speaker characteristics, emotional expression, linguistic features, and stylistic elements, our approach adopts the chat-based paradigm and training methodology of LLMs. This strategic alignment significantly enhances system flexibility while simultaneously establishing a scalable framework to support future model and data expansion, thereby addressing the critical challenges of s cal ability in speech synthesis systems.
与传统的 TTS 系统强调对说话者特征、情感表达、语言特征和风格元素的精细控制不同,我们的方法采用了大语言模型的基于聊天的范式与训练方法。这种策略上的对齐显著增强了系统的灵活性,同时建立了一个可扩展的框架来支持未来的模型和数据扩展,从而解决了语音合成系统中可扩展性的关键挑战。
Supervised Fine-Tuning Format The sft format comprises three essential components: the system prompt, the human input, and the assistant response, structured in a two-turn dialogue configuration. Within this format, the system prompt serves as the foundational element for specifying the speaker attributes and defining the supported instruction tags. The human input and the assistant response components are specifically designed to handle the textual content and the dual-codebook representations respectively. The text and audio tokens from the first round can be utilized to maintain the in-domain speaker’s timbre and style consistency, as well as to enable out-domain zero-shot cloning.
监督微调格式
sft 格式由三个基本组成部分构成:系统提示、人类输入和助手响应,采用两轮对话配置进行结构化。在此格式中,系统提示作为指定说话者属性和定义支持的指令标签的基础元素。人类输入和助手响应组件分别设计用于处理文本内容和双码本表示。第一轮的文本和音频 token 可用于保持域内说话者的音色和风格一致性,以及实现域外零样本克隆。
Instruction Tags Instruction tags are classified into two distinct categories: descriptive tags and comparative tags. Descriptive tags are utilized for controlling aspects such as language, dialect, vocal, and style, while comparative tags are employed for hierarchical distinctions in emotion and speed control. The data for descriptive tags are generated using the Step-Audio model clone, supporting languages and styles including Japanese, Korean, Cantonese, Sichuan dialect, cute voice, RAP, and singing. The data for comparative tags are generated using the Audio Edit model, supporting emotions such as happiness, anger, sadness, and speed variations like fast and slow, each divided into five hierarchical levels.
指令标签
指令标签分为两大类:描述性标签和比较性标签。描述性标签用于控制语言、方言、声音和风格等方面,而比较性标签用于情感和速度控制的层次区分。描述性标签的数据是使用 Step-Audio 模型克隆生成的,支持的语言和风格包括日语、韩语、粤语、四川方言、可爱声音、RAP 和唱歌。比较性标签的数据是使用 Audio Edit 模型生成的,支持的情感包括快乐、愤怒、悲伤,以及速度变化如快和慢,每种情感和速度变化都分为五个层次级别。
We employ the SFT data as outlined in Section 5.1.1. And utilize a 3-billion parameter model, training it for one epoch with an initial learning rate of $2\times10^{-5}$ .The learning rate is adjusted using a cosine decay strategy, with a lower bound set at $2\times10^{-6}$ .
我们采用第5.1.1节中描述的SFT数据,并使用一个30亿参数模型,以 $2\times10^{-5}$ 的初始学习率训练一个周期。学习率采用余弦衰减策略进行调整,下限设置为 $2\times10^{-6}$。
5.2 AQTA
5.2 AQTA
We applied Reinforcement Learning from Human Feedback (RLHF) for the AQTA task, leading to the creation of the Step-Audio-Chat model, as depicted in Figure 6.
我们为AQTA任务应用了从人类反馈中进行强化学习(Reinforcement Learning from Human Feedback, RLHF),从而创建了Step-Audio-Chat模型,如图6所示。


Figure 6: At each training iteration, we collect multiple responses from different versions of the model. Then, through manual scoring, as well as evaluated by a LLM, high-quality pairs are selected to train the reward model. Finally, we use PPO algorithm to train the final Step-Audio-Chat model.
图 6: 在每次训练迭代中,我们从模型的不同版本中收集多个响应。然后,通过手动评分以及大语言模型 (LLM) 的评估,选择高质量的对来训练奖励模型。最后,我们使用 PPO 算法来训练最终的 Step-Audio-Chat 模型。
5.2.1 SFT dataset
5.2.1 SFT 数据集
Data Types We categorized the SFT data into several types based on the nature of the input (Q) and output (A):
数据类型
To enhance speech recognition capabilities, we have incorporated additional training data annotated in ASR format alongside existing datasets. These ASRformatted resources contain detailed transcriptions of speech signals, enabling models to better interpret phonetic patterns and linguistic nuances. The integration of such supplementary ASR-annotated data strengthens model robustness against acoustic variations including regional accents, speaking rate fluctuations, and ambient noise interference.
为了增强语音识别能力,我们在现有数据集的基础上加入了以ASR格式标注的额外训练数据。这些ASR格式的资源包含语音信号的详细转录,使模型能够更好地解析语音模式和语言细微差别。通过整合这些补充的ASR标注数据,模型的鲁棒性得到了增强,能够更好地应对包括地区口音、语速变化和环境噪声干扰在内的声学变化。
Data Processing To optimize the SFT data for effective model training, we implemented the following processing steps:
数据处理
• Single-Turn Data Modification: For single-turn interactions, we applied text length filtering to the inputs. This is because real-world user speech inputs are often concise. Additionally, we modified the outputs to adopt a more conversational text style, enhancing the speech model’s human-like qualities and avoiding rigid, verbose, or overly structured responses.
• 单轮数据修改:对于单轮交互,我们对输入进行了文本长度过滤。这是因为现实世界中的用户语音输入通常较为简洁。此外,我们修改了输出,使其采用更对话式的文本风格,增强了语音模型的拟人化特性,避免了生硬、冗长或过于结构化的响应。
• Multi-Turn Data Processing: For multi-turn interactions, we replaced the speech inputs from previous turns with their corresponding text transcriptions. Only the speech input from the final turn was retained. Furthermore, only the response from the final turn was considered in the loss calculation, focusing the model’s training on generating accurate and relevant responses to the most recent input.
• 多轮数据处理:对于多轮交互,我们将前几轮的语音输入替换为相应的文本转录。仅保留最后一轮的语音输入。此外,仅在损失计算中考虑最后一轮的响应,使模型训练集中在生成对最新输入的准确和相关的响应上。
Through this systematic approach to SFT data construction and processing, we aimed to create a diverse and representative dataset that would enable our speech model to achieve superior performance in real-world scenarios, delivering natural, coherent, and con textually appropriate responses to user inputs.
通过这种系统化的SFT数据构建和处理方法,我们旨在创建一个多样化且具有代表性的数据集,使我们的语音模型能够在实际场景中实现卓越性能,对用户输入生成自然、连贯且上下文恰当的响应。
5.2.2 Supervised Fine-Tuning Details
5.2.2 监督微调细节
We use SFT data as described in section 5.2.1. And the model is finetuned for 1 epoch with learning rate from $5.656\times10^{-5}$ to $5.656\times10^{-6}$ .
我们使用第5.2.1节中描述的SFT数据。模型以学习率从$5.656\times10^{-5}$到$5.656\times10^{-6}$进行1个周期的微调。
5.2.3 Reward Model dataset
5.2.3 奖励模型数据集
TQTA Preference Data Construction We collected human preference data generated by the TQTA model (e.g., Step-1 and Step-2) and removed categories that were less distributed in speech dialogues, such as code and mathematics. We mainly retained categories such as daily conversations, role-playing, safety and instruction following.
TQTA偏好数据构建
我们收集了由TQTA模型生成的人类偏好数据(例如,Step-1和Step-2),并移除了在语音对话中分布较少的类别,如代码和数学。我们主要保留了日常对话、角色扮演、安全性和指令遵循等类别。
AQTA Preference Data Construction For the fine-tuning dataset, we first collected real audio prompts from users and sampled four responses using the SFT model. We then constructed chosen/rejected pairs by having human annotators rate these four responses on a scale of 1 to 5, based on the criteria of instruction following, conversational naturalness and safety. In addition to these artificially generated labels, we also employed the LLM-as-a-Judge method to score the model’s responses to objective questions and create corresponding chosen/rejected pairs based on responses’ correctness. To mitigate the pattern bias associated with “deaf hacking” as described in 5.2.6, we employed the hacked PPO model to generate responses for input audio with clear audio prompts. If the responses exhibited hacking behavior, we constructed them as rejected responses. This process aimed to eliminate the pattern bias caused by the exclusive presence of “deaf hacking” as chosen responses in the training data of the reward model.
AQTA偏好数据构建
对于微调数据集,我们首先从用户那里收集了真实的音频提示,并使用SFT模型生成了四种响应。然后,我们让人工标注者根据指令遵循、对话自然性和安全性等标准,对这四种响应进行1到5分的评分,从而构建了选择/拒绝对。除了这些人工生成的标签外,我们还采用了LLM-as-a-Judge方法对模型在客观问题上的响应进行评分,并根据响应的正确性创建相应的选择/拒绝对。为了减轻5.2.6中描述的“聋哑黑客”相关的模式偏差,我们使用了被破解的PPO模型为具有清晰音频提示的输入生成响应。如果响应表现出黑客行为,我们将其构建为拒绝响应。这一过程旨在消除奖励模型训练数据中仅存在“聋哑黑客”作为选择响应所带来的模式偏差。
5.2.4 Reward Model Training Details
5.2.4 奖励模型训练细节
We implement a two-stage approach for reward model training: TQTA singlemodal preference model pre training, followed by AQTA cross-modal fine-tuning. The model is fine-tuned for 1 epoch on TQTA and 1 epoch on AQTA. The learning rate is adjusted using a cosine decay strategy, initialized at $1.24\times10^{-5}$ with a lower bound set at $6\times10^{-6}$ .
我们采用两阶段方法进行奖励模型训练:先进行TQTA单模态偏好模型预训练,然后进行AQTA跨模态微调。模型在TQTA上微调1个epoch,在AQTA上微调1个epoch。学习率采用余弦衰减策略进行调整,初始值为 $1.24\times10^{-5}$ ,下限设置为 $6\times10^{-6}$ 。
The reward model training initializes from a SFT model and proceeds through the two-stage training using the Bradley-Terry loss (Bradley & Terry, 1952), achieved a pair-wise accuracy of 70.51% on the human preference test set.
奖励模型训练从 SFT 模型初始化,并使用 Bradley-Terry 损失 (Bradley & Terry, 1952) 进行两阶段训练,在人类偏好测试集上达到了 70.51% 的配对准确率。
5.2.5 PPO dataset
5.2.5 PPO数据集
For the PPO training data, we used the same prompt seeds as those employed in the AQTA fine-tuning stage of the reward model.
在PPO训练数据中,我们使用了与奖励模型AQTA微调阶段相同的提示种子。
5.2.6 PPO Training Details
5.2.6 PPO 训练细节
After obtaining the reward model, we employ the PPO (Schulman, Wolski, Dhariwal, Radford, & Klimov, 2017) algorithm to train speech large language model. During the RLHF training stage, the critic model is warmed up with an initial 80 training steps ahead. We employ a PPO clip threshold of $\epsilon:=:0.2$ and an initial learning rate of $1\times10^{-6}$ , which decays using a cosine strategy, with a minimum learning rate of $2\times10^{-7}$ . Additionally, we set the KL penalty coefficient to $\beta=0.05$ .
在获得奖励模型后,我们采用PPO (Schulman, Wolski, Dhariwal, Radford, & Klimov, 2017)算法来训练语音大语言模型。在RLHF训练阶段,批评模型通过初始80个训练步骤进行预热。我们采用PPO剪裁阈值为$\epsilon:=:0.2$,初始学习率为$1\times10^{-6}$,并使用余弦策略进行衰减,最小学习率为$2\times10^{-7}$。此外,我们将KL惩罚系数设置为$\beta=0.05$。
Unlike the reward hacking observed in the RLHF training of TQTA models, we found that a reward model trained exclusively on human-annotated AQTA preference data exhibited a “deaf hacking” phenomenon (i.e., the reward model assigned high reward to responses containing phrases like “I didn’t hear clearly” regardless of input audio clarity, unintentionally reinforcing deaf hacking patterns during RLHF training). We attributed this issue to a pattern bias in the reward model’s training data, which exclusively featured “deaf hacking” pairs: the model responded with “I didn’t hear clearly” to unclear or semantically incomplete prompts as chosen responses but lacked such responses to clear and semantically complete prompts as rejected ones. To mitigate this bias, we constructed corresponding data as mentioned in Section 5.2.3. We also plan to introduce rule-based rewards during RLHF training to eliminate “deaf hacking” in future work.
与在 TQTA 模型的 RLHF 训练中观察到的奖励攻击不同,我们发现,仅在人工标注的 AQTA 偏好数据上训练的奖励模型表现出了一种“聋攻击”现象(即,无论输入音频清晰度如何,奖励模型都会为包含“我没听清楚”等短语的响应分配高奖励,无意中在 RLHF 训练期间强化了聋攻击模式)。我们将此问题归因于奖励模型训练数据中的模式偏差,该数据仅包含“聋攻击”对:模型对不清晰或语义不完整的提示做出“我没听清楚”的响应作为被选中的响应,但对清晰且语义完整的提示缺乏此类响应作为被拒绝的响应。为了减轻这种偏差,我们构建了第 5.2.3 节中提到的相应数据。我们还计划在未来的工作中在 RLHF 训练期间引入基于规则的奖励,以消除“聋攻击”。
6 Evaluation
6 评估
6.1 Benchmark Design
6.1 基准设计
We have created a new benchmark, named StepEval-Audio $360^{1}$ , following a series of rules. In terms of design principles, this benchmark aims to fill the gaps in the evaluation of multi-modal speech interaction, systematically identify the strengths and weaknesses of the the models, and attach importance to user experience and safety. For data collection, real user recordings are used in combination with public corpora. Meanwhile, strict control is exercised on audio quality and semantic annotation to ensure compliance with privacy.
我们创建了一个名为 StepEval-Audio $360^{1}$ 的新基准,遵循一系列规则。在设计原则方面,该基准旨在填补多模态语音交互评估中的空白,系统地识别模型的优缺点,并重视用户体验和安全性。在数据收集方面,使用真实用户录音与公开语料库相结合。同时,对音频质量和语义标注进行严格控制,以确保符合隐私要求。
The evaluation dimensions mainly cover language proficiency, emotional intelligence, logical reasoning, creativity, multi-instruction following, role-playing, safety, etc. Demographic differences (age/ gender/ dialect), environmental conditions (noise level/ microphone type), and prosodic features (speech rate/ pronunciation pattern) are also taken into account. The indicator system architecture combines quantitative analysis, uses scripts for automatic verification of indicators such as accuracy rate and repetition rate, and also involves evaluation by large language models and human. In addition, quarterly updates of the benchmark are carried out to avoid falling behind, and adjustments are made in light of user feedback.
评估维度主要涵盖语言能力、情商、逻辑推理、创造力、多指令跟随、角色扮演、安全性等。人口统计差异(年龄/性别/方言)、环境条件(噪音水平/麦克风类型)和韵律特征(语速/发音模式)也被考虑在内。指标体系架构结合定量分析,使用脚本自动验证准确率和重复率等指标,同时也涉及大语言模型和人类的评估。此外,基准每季度更新一次,以避免落后,并根据用户反馈进行调整。
6.2 Results
6.2 结果
6.2.1 ASR
6.2.1 ASR
We conducted validation experiments with a 3B model to compare the performance of Semantic Code and Dual-Code on ASR (Automatic Speech Recognition) tasks. While keeping the same amount of audio training data, the Character Error Rate (CER) of the Dual-Code approach improved from 25.5 to 18.4. This demonstrates that the Dual-Code method significantly enhances performance on ASR tasks.
我们使用 3B 模型进行了验证实验,以比较语义编码 (Semantic Code) 和双编码 (Dual-Code) 在自动语音识别 (ASR) 任务上的性能。在保持相同音频训练数据量的情况下,双编码方法的字错误率 (CER) 从 25.5 降低到了 18.4。这表明双编码方法显著提升了 ASR 任务的性能。
We tested the performance of the model at two stages: first, the pre-trained model (Step-Audio Pretrain) ; and second, the chat model (Step-Audio-Chat) , after the alignment of human preference, using the system prompt “ 请记录 下你所听到的语音内容。”. The evaluation datasets include Aishell1, Aishell2 ios, We net speech test-net, We net speech test-meeting, Libri speech test-clean, and Libri speech test-other. For the evaluation metrics, we employ the character error rate (CER) for Chinese and word error rate (WER) for English. We systematically compared the following two categories of mainstream large speech models:
我们在两个阶段测试了模型的性能:首先是预训练模型 (Step-Audio Pretrain) ,其次是在人类偏好对齐后的聊天模型 (Step-Audio-Chat) ,使用系统提示“请记录下你所听到的语音内容。”。评估数据集包括 Aishell1、Aishell2 ios、We net speech test-net、We net speech test-meeting、Libri speech test-clean 和 Libri speech test-other。对于评估指标,我们采用中文的字错误率 (CER) 和英文的词错误率 (WER) 。我们系统性地比较了以下两类主流大语音模型:
• Hidden feature modeling: Whisper Large-v3 (Radford et al., 2023), Qwen2- Audio (Chu et al., 2024a), MinMo (Q. Chen et al., 2025), LUCY (H. Gao et al., 2025);
• 隐藏特征建模:Whisper Large-v3 (Radford et al., 2023), Qwen2-Audio (Chu et al., 2024a), MinMo (Q. Chen et al., 2025), LUCY (H. Gao et al., 2025);
• Discrete audio token modeling: Moshi (D´efossez et al., 2024), GLM-4- voice (Zeng et al., 2024), Step-Audio.
• 离散音频Token建模:Moshi (D´efossez et al., 2024)、GLM-4- voice (Zeng et al., 2024)、Step-Audio。
Table 1: ASR result comparison
表 1: ASR 结果对比
| HiddenFeatureModeling | DiscreteAudioToken Modeling | ||||||||
|---|---|---|---|---|---|---|---|---|---|
| Whisper Large-v3 | Qwen2-Audio | MinMo | LUCY | Moshi | GLM-4-voice Base | GLM-4-voice Chat | Step-Audio Pretrain | Step-Audio Chat | |
| Aishell-1 | 5.14 | 1.53 | 2.4 | 2.46 | 226.47 | 0.87 | 1.95 | ||
| Aishell-2 ios | 4.76 | 3.06 | 2.69 | 211.3 | 2.91 | 3.57 | |||
| Wenetspeechtest-net | 9.68 | 7.72 | 6.64 | 8.78 | 146.05 | 7.62 | 8.75 | ||
| Wenet test-meeting | 18.54 | 8.4 | 7.6 | 10.42 | 140.82 | 7.78 | 9.52 | ||
| Librispeech test-clean | 1.9 | 1.6 | 1.6 | 3.36 | 5.7 | 2.82 | 75.39 | 2.36 | 3.11 |
| Librispeech test-other | 3.65 | 3.6 | 3.82 | 8.05 | 7.66 | 80.3 | 6.32 | 8.44 | |
| AVG | 7.28 | 4.32 | 146.74 | 4.64 | 5.89 |
The specific results are shown in Table 1. Among the audio token-based speech models, Step-Audio Pretrain achieved the best performance with an average CER of 4.64. Compared to the hidden feature models, Step-Audio Pretrain outperformed Whisper Large-v3 and achieved comparable results to Qwen2-Audio and MinMo, particularly on the clean test sets of Aishell1, Aishell2 and Libri speech test-clean, where Step-Audio Pretrain (average CER 2.05 ) was very close to Qwen2-Audio (average CER 2.06). This indicates that Step-Audio, through its dual-codebook compression strategy, has effectively preserved semantic information during the disc ret iz ation of speech representation.
具体结果如表 1 所示。在基于音频 Token 的语音模型中,Step-Audio Pretrain 以 4.64 的平均 CER 取得了最佳性能。与隐特征模型相比,Step-Audio Pretrain 超越了 Whisper Large-v3,并在 Aishell1、Aishell2 和 Libri speech test-clean 的干净测试集上取得了与 Qwen2-Audio 和 MinMo 相当的结果,其中 Step-Audio Pretrain(平均 CER 2.05)与 Qwen2-Audio(平均 CER 2.06)非常接近。这表明 Step-Audio 通过其双码本压缩策略,在语音表示的离散化过程中有效保留了语义信息。
Additionally, we compared the ASR capabilities of the final conversational models. When testing the GLM-4-voice Chat model, we experimented with various prompts and ultimately selected the prompt “ :” for evaluation. However, the model still struggled to follow instructions effectively. In contrast, the Step-Audio Chat model achieved an average CER of 5.89, maintaining strong performance, which reflects its robust ability to follow instructions.
此外,我们还比较了最终对话模型的 ASR 能力。在测试 GLM-4-voice Chat 模型时,我们尝试了多种提示词,最终选择了提示词“ :”进行评估。然而,该模型在有效遵循指令方面仍然存在困难。相比之下,Step-Audio Chat 模型取得了 5.89 的平均 CER,保持了强劲的表现,这反映了其强大的指令遵循能力。
Table 2: Performance comparison of content consistency across Step-Audio, GLM4-Voice, and MinMo.
表 2: Step-Audio、GLM4-Voice 和 MinMo 的内容一致性性能对比
| 模型 | test-zh | test-en |
|---|---|---|
| GLM-4-Voice | 2.19 | 2.91 |
| MinMo | 2.48 | 2.90 |
| Step-Audio | 1.53 | 2.71 |
Table 3: Results of Step-Audio-TTS-3B and recent LLM-Based TTS models on the SEED test sets. Step-Audio-TTS-3B-Single denotes dual-codebook backbone with single-codebook vocoder. Step-Audio-TTS denotes 130 Billion parameter version of Step-Audio-TTS.
表 3: Step-Audio-TTS-3B 和最近基于大语言模型的 TTS 模型在 SEED 测试集上的结果。Step-Audio-TTS-3B-Single 表示双码本骨干网络与单码本声码器的结合。Step-Audio-TTS 表示 1300 亿参数版本的 Step-Audio-TTS。
| 模型 | test-zh | test-en |
|---|---|---|
| FireRedTTS | 1.51 0.630 | 3.82 0.460 |
| MaskGCT | 2.27 0.774 | 2.62 |
| CosyVoice | 3.63 0.775 | 0.774 4.29 0.699 |
| CosyVoice 2 | 1.45 0.806 | 2.57 0.736 |
| CosyVoice 2-S | 1.45 0.812 | 2.38 0.743 |
| Step-Audio-TTS-3B-Single | 1.37 0.802 | 2.52 0.704 |
| Step-Audio-TTS-3B | 1.31 0.733 | 2.31 |
| Step-Audio-TTS | 1.17 0.73 | 0.660 2.0 0.660 |
6.2.2 TTS
6.2.2 文本到语音 (TTS)
To evaluate the TTS performance of Step-Audio, we utilized the SEED TTS test dataset (A nast as sio u et al., 2024). The model was prompted to reproduce the original input text. Although this approach cannot guarantee the consistency between input and output contents, we believe that erroneous outputs also objectively reflect the model’s ability to follow instructions. Therefore, we employed Paraformer (Z. Gao et al., 2022) and Whisper-Large-V3 (Radford et al., 2023) to calculate the error rate for Chinese and English, respectively. The results are summarized in Table 2. Step-Audio achieved the best CER and WER performance among open-source spoken models.
为了评估 Step-Audio 的 TTS 性能,我们使用了 SEED TTS 测试数据集 (Anastasios et al., 2024)。模型被提示重现原始输入文本。尽管这种方法无法保证输入和输出内容的一致性,但我们认为错误的输出也客观反映了模型遵循指令的能力。因此,我们采用 Paraformer (Z. Gao et al., 2022) 和 Whisper-Large-V3 (Radford et al., 2023) 分别计算中文和英文的错误率。结果总结在表 2 中。Step-Audio 在开源语音模型中取得了最佳的 CER 和 WER 表现。
Table 4: Performance comparison of Dual-codebook Re synthesis with Cosyvoice.
表 4: Dual-codebook Re synthesis 与 Cosyvoice 的性能对比
| Token | test-zh | test-zh | test-en | test-en |
|---|---|---|---|---|
| CER ↑ (%) | SS ↑ | WER ↑ (%) | SS ↑ | |
| Groundtruth | 0.972 | 一 | 2.156 | |
| CosyVoice | 2.857 | 0.849 | 4.519 | 0.807 |
| Step-Audio-TTS-3B | 2.192 | 0.784 | 3.585 | 0.742 |
For the TTS task, we evaluated the performance of open-source TTS models on the SEED test sets using Paraformer (Z. Gao et al., 2022) and Whisper-Largev3 (Radford et al., 2023). And the ERes2Net (Y. Chen et al., 2023) model is employed to evaluate speaker similarity. The results are summarized in Table 3. Our 3-Billion parameter version of Step-Audio-TTS-3B with dual-codebooks achieved SoTA results in terms of CER and WER among the open-source models, while also demonstrating highly competitive similarity scores. Notably, scaling the LLM to 130 billion parameters yielded substantial improvements in both CER and WER, suggesting the potential benefits of further scaling both synthetic data and model parameters.
在 TTS 任务中,我们使用 Paraformer (Z. Gao et al., 2022) 和 Whisper-Largev3 (Radford et al., 2023) 在 SEED 测试集上评估了开源 TTS 模型的性能,并采用 ERes2Net (Y. Chen et al., 2023) 模型来评估说话人相似度。结果总结在表 3 中。我们的 Step-Audio-TTS-3B 3B 参数版本在双码本的支持下,在 CER 和 WER 方面取得了开源模型中的 SoTA 结果,同时表现出极具竞争力的相似度得分。值得注意的是,将大语言模型扩展到 130B 参数后,CER 和 WER 均有显著提升,这表明进一步扩展合成数据和模型参数可能带来的潜在收益。
To evaluate the benefits of the dual-codebook approach, we compare the re synthesis quality of various tokens including: single cosyvoice codebook (Du, Chen, et al., 2024), and dual-codebook, on the aforementioned SEED test sets. The test speech is first quantized into discrete tokens and then reconstructed into waveforms using the token2wav process. We assess the quality of the tokens in terms of speech intelligibility and speaker similarity. The results of this comparison are presented in Table 4 which shows that the dual-codebook approach achieves a lower CER while maintaining an acceptable level of degradation of speaker similarity.
为评估双码本方法的效果,我们在上述SEED测试集上比较了多种Token的重合成质量,包括:单一cosyvoice码本 (Du, Chen等, 2024) 和双码本。测试语音首先被量化为离散Token,然后通过token2wav过程重建为波形。我们从语音清晰度和说话人相似性两个方面评估Token的质量。表4展示了比较结果,表明双码本方法在保持可接受的说话人相似性退化水平的同时,实现了更低的CER。
6.2.3 AQTA Chat
6.2.3 AQTA Chat
As illustrated in the Table 5, we compared the real-time dialogue performance of Step-Audio and open-source models use StepEval-Audio-360 benckmark mentioned in section 6.1. The scores of each metric are automatically assessed by GPT-4o. And the average scores are shown in the table while the best results are shown in bold. Since the main content of this benchmark is in Chinese, and Moshi has almost no understanding of Chinese, the results from Moshi are marked with “*” and should be considered for reference only. The results indicated that Step-Audio-Chat demonstrated superior performance in real-time dialogue.
如表 5 所示,我们使用第 6.1 节中提到的 StepEval-Audio-360 基准比较了 Step-Audio 和开源模型的实时对话性能。每个指标的得分由 GPT-4o 自动评估,表格中显示了平均得分,最佳结果以粗体显示。由于该基准的主要内容为中文,而 Moshi 对中文几乎无法理解,因此 Moshi 的结果标记为“*”,仅供参考。结果表明,Step-Audio-Chat 在实时对话中表现优异。
• Factuality: Step-Audio-Chat achieves the highest score of $66.4%$ , significantly surpassing the other models. This indicates that Step-Audio-Chat provides responses that are more accurate and reliable, adhering closely to the factual information.
• 事实性:Step-Audio-Chat 达到了最高的 $66.4%$ 的分数,显著超越了其他模型。这表明 Step-Audio-Chat 提供的回答更加准确和可靠,紧密遵循事实信息。
• Relevance: With a score of 75.2%, Step-Audio-Chat also excels in relevance, demonstrating its ability to generate con textually appropriate and meaningful responses to user queries.
• 相关性:Step-Audio-Chat 在相关性方面也表现优异,得分为 75.2%,展示了其生成上下文相关且有意义的回应用户查询的能力。
• Chat Score: With the highest overall chat score (4.11), Step-Audio-Chat provides a superior overall voice chat experience. This score, ranging from 1 (lowest) to 5 (highest), represents a comprehensive evaluation by GPT-4o based on the textual input and output of the conversations.
• 聊天评分:Step-Audio-Chat 以最高的整体聊天评分 (4.11) 提供了卓越的语音聊天体验。该评分范围为 1 (最低) 到 5 (最高),是基于 GPT-4o 对对话文本输入和输出的全面评估。
To further contextual ize Step-Audio-Chat’s performance, we conducted evaluations on several publicly available datasets. Among them, Web Questions, Llama Questions, and TriviaQA are knowledge-based question-answering datasets, while Complex Bench and the listening comprehension section of the Hanyu Shuiping Kaoshi (HSK-6) are comprehensive tests. We utilized the full test sets for Web Questions and Llama Questions. For TriviaQA, which lacks ground truth answers
为了进一步评估 Step-Audio-Chat 的性能,我们在多个公开数据集上进行了评估。其中,Web Questions、Llama Questions 和 TriviaQA 是基于知识的问答数据集,而 Complex Bench 和汉语水平考试 (HSK-6) 的听力理解部分则是综合性测试。我们使用了 Web Questions 和 Llama Questions 的完整测试集。对于缺少真实答案的 TriviaQA
Table 5: Comparison of fundamental capabilities of voice chat on the StepEvalAudio-360.
表 5: StepEvalAudio-360 上语音聊天基本能力对比
| 模型 | 事实性 (%)↑ | 相关性 (%)↑ | 聊天评分↑ |
|---|---|---|---|
| GLM4-Voice | 54.7 | 66.4 | 3.49 |
| Qwen2-Audio | 22.6 | 26.3 | 2.27 |
| Moshi | 1.0 | 0 | 1.49 |
| Step-Audio-Chat | 66.4 | 75.2 | 4.11 |
Table 6: Performance comparison of voice chat models on public benchmarks. Llama Question, Web Questions, and TriviaQA are publicly available datasets, while the audio versions of Complex Bench and HSK-6 are newly constructed from public text corpora in this study from publicly available text corpora. In the absence of an official test set, TriviaQA’s results are for reference only and marked with “*”. Qwen2-Audio and Step-Audio-chat are obtained via local inference while others (Moshi, Freeze-Omni, LUCY, MinMo) are sourced from their original publications. Notably, GLM4-Voice’s scores on HSK-6 and Complex Bench are also generated through local inference.
表 6: 语音聊天模型在公开基准上的性能对比。Llama Question、Web Questions 和 TriviaQA 是公开可用的数据集,而 Complex Bench 和 HSK-6 的音频版本是在本研究中从公开可用的文本语料库中构建的。在缺乏官方测试集的情况下,TriviaQA 的结果仅供参考,并标记为“*”。Qwen2-Audio 和 Step-Audio-chat 是通过本地推理获得的,而其他模型(Moshi、Freeze-Omni、LUCY、MinMo)则来源于其原始出版物。值得注意的是,GLM4-Voice 在 HSK-6 和 Complex Bench 上的分数也是通过本地推理生成的。
| 模型 | Llama Question | Web Questions | TriviaQA | ComplexBench | HSK-6 |
|---|---|---|---|---|---|
| GLM4-Voice | 64.7 | 32.2 | 39.1 | 66.0 | 74.0 |
| Moshi | 62.3 | 26.6 | 22.8 | ||
| Freeze-Omni | 72.0 | 44.7 | 53.9 | ||
| LUCY | 59.7 | 29.3 | 27.0 | ||
| MinMo | 78.9 | 55.0 | 48.3 | - | |
| Qwen2-Audio | 52.0 | 27.0 | 37.3 | 54.0 | |
| Step-Audio-Chat | 81.0 | 75.1 | 58.0 | 74.0 | 86.0 |
for the official test set, we constructed a 1000-sample test set from the development set. This subset comprised 500 samples each from the Wikipedia and Web verified sets, supplemented with data from the Web and Wikipedia dev sets. Due to this non-standard test set, TriviaQA results should be considered preliminary and for reference purposes only. While Llama Questions provided spoken questions, we used TTS to generate spoken versions of the questions for Web Questions, TriviaQA, Complex Bench and HSK-6, enabling an Audio Question-Text Answer (AQTA) evaluation.
对于官方测试集,我们从开发集中构建了一个包含1000个样本的测试集。该子集由来自Wikipedia和Web验证集的各500个样本组成,并补充了来自Web和Wikipedia开发集的数据。由于这个非标准测试集,TriviaQA的结果应视为初步且仅作参考。虽然Llama Questions提供了口语问题,但我们使用TTS为Web Questions、TriviaQA、Complex Bench和HSK-6生成了问题的口语版本,以便进行音频问题-文本回答(AQTA)评估。
Performance data for Moshi (D´efossez et al., 2024), Freeze-Omni (Wang et al., 2024), LUCY (H. Gao et al., 2025), and MinMo (Q. Chen et al., 2025) were taken from their respective publications. Step-Audio-Chat and Qwen2-audio (Chu et al., 2024a) results were obtained through local API inference. For LUCY, we report the best results from its stage 2 (S2) and stage 3 (S3) training phases, as presented in the original publication. GPT-4o was used to assess the accuracy of the model’s textual responses against the original textual questions. Table 6 presents the average accuracy scores, demonstrating that Step-Audio-Chat achieved the highest accuracy across all open-source benchmarks. The accuracy score for Step-Audio-Chat on the Llama Question dataset is italicized to indicate that we corrected several errors in the evaluation set. In addition to the automated metrics, we conducted a human evaluation as shown in Figure 1. The results corroborate the automated evaluation, showing a clear and consistent advantage for Step-Audio across all assessed dimensions.
Moshi (D´efossez et al., 2024)、Freeze-Omni (Wang et al., 2024)、LUCY (H. Gao et al., 2025) 和 MinMo (Q. Chen et al., 2025) 的性能数据取自各自的出版物。Step-Audio-Chat 和 Qwen2-audio (Chu et al., 2024a) 的结果通过本地 API 推理获得。对于 LUCY,我们报告了其在阶段 2 (S2) 和阶段 3 (S3) 训练阶段的最佳结果,如原始出版物所述。GPT-4o 用于评估模型文本响应与原始文本问题的准确性。表 6 展示了平均准确率得分,表明 Step-Audio-Chat 在所有开源基准测试中达到了最高的准确率。Step-Audio-Chat 在 Llama Question 数据集上的准确率得分以斜体显示,表明我们在评估集中纠正了几个错误。除了自动指标外,我们还进行了人工评估,如图 1 所示。结果验证了自动评估,显示 Step-Audio 在所有评估维度上具有明显且一致的优势。
6.3 Instruction Following
6.3 指令跟随
Audio instruction following reflects the model’s capability to generate accurate audio and textual content in response to input instructions. Consequently, we have developed a benchmark for Audio Instruction Following, encompassing categories such as Languages, Role-playing, Singing / RAP, and Voice Control. The evaluation comprises the accuracy of instruction following and quality of generated speech, both assessed using a 1-5 Mean Opinion Score (MOS) scale. As demonstrated in Table 7, Step-Audio-Chat shows competitive results in audio instruction following and audio quality respectively.
音频指令跟随反映了模型根据输入指令生成准确音频和文本内容的能力。因此,我们开发了一个音频指令跟随基准,涵盖语言、角色扮演、唱歌/RAP和语音控制等类别。评估包括指令跟随的准确性和生成语音的质量,均采用1-5平均意见得分(MOS)进行评分。如表7所示,Step-Audio-Chat在音频指令跟随和音频质量方面分别表现出竞争力。
Table 7: Performance comparison of audio instruction following between GLM4- Voice and Step-Audio-Chat
表 7: GLM4-Voice 和 Step-Audio-Chat 在音频指令跟随性能上的对比
| 类别 | 指令跟随 | 音频质量 | ||
|---|---|---|---|---|
| GLM-4-Voice | Step-Audio | GLM-4-Voice | Step-Audio | |
| 语言 | 1.9 | 3.8 | 2.9 | 3.3 |
| 角色扮演 | 3.8 | 4.2 | 3.2 | 3.6 |
| 唱歌 / RAP | 2.1 | 2.4 | 2.4 | 4 |
| 语音控制 | 3.6 | 4.4 | 3.3 | 4.1 |
6.4 Toolcall
6.4 工具调用
The proposed Step-Audio system enables real-time tool call during voice interactions. Due to the substantial bitrate disparity between real-time text responses and their corresponding audio streams, our framework achieves asynchronous tool invocation while maintaining seamless voice interaction. As illustrated in Figure 7, the architecture decouples text-based tool processing from audio generation pipelines, allowing parallel execution of external service queries (e.g. knowledge retrieval) and speech synthesis. This design eliminates waiting time for audio rendering when tool calling is required, significantly enhancing interaction fluidity.
提出的 Step-Audio 系统支持在语音交互过程中实时调用工具。由于实时文本响应与其对应的音频流之间存在显著的比特率差异,我们的框架在保持无缝语音交互的同时实现了异步工具调用。如图 7 所示,该架构将基于文本的工具处理与音频生成管道解耦,允许并行执行外部服务查询(例如知识检索)和语音合成。这种设计消除了在需要调用工具时等待音频渲染的时间,显著提升了交互的流畅性。

Figure 7: Architecture of asynchronous tool call in Step-Audio. The text processing thread handles tool calls while the audio generation thread produces speech streams concurrently.
图 7: Step-Audio 中的异步工具调用架构。文本处理线程处理工具调用,同时音频生成线程并发生成语音流。
7 Conclusion
7 结论
In this paper, we present Step-Audio, an innovative framework for real-time voice interaction. During the pre training stage, our dual-codebook speech tokenizer bridges textual and acoustic modalities using 3.3T tokens of multi-modal data, establishing cross-modal alignment. In the post-training phase, we conducted task-specific SFT for TTS and ASR tasks, while implementing SFT with diversified high-quality datasets combined with RLHF for AQTA tasks to enhance response quality, enabling fine-grained control over emotional modulation, dialect adaptation, and prosodic pattern generation. Through engineering innovations including speculative streaming with sliced latency compensation and efficient full-duplex coordination, Step-Audio achieves fluid conversational dynamics. The performance metrics from benchmark evaluations across tasks including ASR, TTS, and AQTA demonstrate Step-Audio’s exceptional capabilities in speech dialogue.
在本文中,我们提出了Step-Audio,一个用于实时语音交互的创新框架。在预训练阶段,我们的双码本语音分词器利用3.3T的多模态数据Token,建立了文本与声学模态之间的跨模态对齐。在后训练阶段,我们针对TTS和ASR任务进行了任务特定的SFT,同时结合RLHF对AQTA任务进行了多样化高质量数据集的SFT,以提升响应质量,实现了对情感调节、方言适应以及韵律模式生成的细粒度控制。通过包括推测性流式处理与切片延迟补偿以及高效的全双工协调在内的工程创新,Step-Audio实现了流畅的对话动态。跨ASR、TTS和AQTA任务的基准评估性能指标展示了Step-Audio在语音对话中的卓越能力。
8 Future Work
8 未来工作
In this work, we have demonstrated Step-Audio’s current capabilities in crossmodal integration between speech and text, as part of Step-Audio’s initial implementation aimed at trimodal systems. Moving forward, three key areas warrant further exploration: first, extending the framework to achieve native trimodal understanding that incorporates vision, speech, and text; second, enhancing pure voice dialogue efficiency by eliminating intermediate cross-modal conversions in AQAA scenarios; and third, implementing deep-thinking-enhanced tool calls to enhance intelligent interaction capabilities with external knowledge bases.
在本工作中,我们展示了 Step-Audio 在语音和文本跨模态集成方面的当前能力,这是 Step-Audio 旨在实现三模态系统的初步实现的一部分。未来有三个关键领域值得进一步探索:首先,扩展框架以实现包含视觉、语音和文本的原生三模态理解;其次,通过消除 AQAA 场景中的中间跨模态转换来提高纯语音对话的效率;第三,实现深度思考增强的工具调用,以增强与外部知识库的智能交互能力。
