[论文翻译]Step-Audio: 智能语音交互中的统一理解与生成


原文地址:https://arxiv.org/pdf/2502.11946v2


Step-Audio: Unified Understanding and Generation in Intelligent Speech Interaction

Step-Audio: 智能语音交互中的统一理解与生成

Abstract

摘要

Real-time speech interaction, serving as a fundamental interface for humanmachine collaboration, holds immense potential. However, current opensource models face limitations such as high costs in voice data collection, weakness in dynamic control, and limited intelligence. To address these challenges, this paper introduces Step-Audio, the first production-ready open-source solution. Key contributions include: 1) a 130B-parameter unified speech-text multi-modal model that achieves unified understanding and generation, with the Step-Audio-Chat version open-sourced; 2) a generative speech data engine that establishes an affordable voice cloning framework and produces the open-sourced lightweight Step-Audio-TTS-3B model through distillation; 3) an instruction-driven fine control system enabling dynamic adjustments across dialects, emotions, singing, and RAP; 4) an enhanced cognitive architecture augmented with tool calling and role-playing abilities to manage complex tasks effectively. Based on our new StepEvalAudio-360 evaluation benchmark, Step-Audio achieves state-of-the-art performance in human evaluations, especially in terms of instruction following. On open-source benchmarks like LLaMA Question, shows $\mathbf{9.3%}$ average performance improvement, demonstrating our commitment to advancing the development of open-source multi-modal language technologies. Our code and models are available at https://github.com/stepfun-ai/Step-Audio.

实时语音交互作为人机协作的基础接口,具有巨大的潜力。然而,当前的开源模型面临语音数据收集成本高、动态控制能力弱和智能化有限等局限。为了应对这些挑战,本文介绍了首个生产级开源解决方案 Step-Audio。主要贡献包括:1) 一个 130B 参数的统一语音-文本多模态模型,实现了统一理解与生成,并开源了 Step-Audio-Chat 版本;2) 一个生成式语音数据引擎,构建了经济高效的语音克隆框架,并通过蒸馏生成了开源的轻量级 Step-Audio-TTS-3B 模型;3) 一个指令驱动的精细控制系统,支持跨方言、情感、歌唱和 RAP 的动态调整;4) 一个增强的认知架构,通过工具调用和角色扮演能力有效管理复杂任务。基于我们新的 StepEvalAudio-360 评估基准,Step-Audio 在人类评估中达到了最先进的性能,尤其是在指令遵循方面。在 LLaMA Question 等开源基准上,平均性能提升了 $\mathbf{9.3%}$,展示了我们推动开源多模态语言技术发展的承诺。我们的代码和模型可在 https://github.com/stepfun-ai/Step-Audio 获取。

1 Introduction

1 引言

The evolution of artificial intelligence toward general-purpose systems has positioned real-time speech interaction as a critical interface for human-machine collaboration. While recent multi-modal large language models (LLMs) have accelerated progress in this domain, open-source communities face persistent challenges despite breakthroughs in proprietary systems like GPT-4o (Hurst et al., 2024) and Doubao (bytedance, 2024). Existing open-source models such as Qwen2-Audio (Chu et al., 2024a), Llama 3 (Dubey et al., 2024) and wavLLM (Hu et al., 2024) struggle with three fundamental limitations: the separation of understanding and generation processes that impedes end-to-end system integration, dependence on laborious manual speech data acquisition methods that restricts efficient voice replication, and inadequate precision in regulating prosodic features, regional dialects, and tool utilization capabilities. These limitations highlight the urgent demand for deployable frameworks that harmonize streamlined architecture with dual competencies in affective computing (accurate emotion perception and adjustment) and contextual cognition (situational reasoning and response formulation).

人工智能向通用系统的发展使得实时语音交互成为人机协作的关键接口。尽管最近的多模态大语言模型加速了这一领域的进展,但开源社区在 GPT-4o (Hurst et al., 2024) 和 Doubao (bytedance, 2024) 等专有系统取得突破的同时,仍面临持续挑战。现有的开源模型如 Qwen2-Audio (Chu et al., 2024a)、Llama 3 (Dubey et al., 2024) 和 wavLLM (Hu et al., 2024) 面临三个基本限制:理解和生成过程的分离阻碍了端到端系统集成,依赖繁琐的手动语音数据采集方法限制了高效的语音复制,以及在调节音律特征、地方方言和工具使用能力方面的精度不足。这些限制突显了对可部署框架的迫切需求,这些框架需要将简化架构与情感计算(准确的情感感知和调整)和语境认知(情景推理和响应制定)的双重能力相结合。

Current open-source speech systems confront multiple architectural challenges. The traditional framework employs a cascading approach (Huang et al., 2024) combining Automatic Speech Recognition (ASR), LLM processing, and Text-toSpeech (TTS). This framework introduces error propagation through modality transitions while increasing system complexity. Pure end-to-end approaches, though conceptually elegant, often sacrifice performance in open-domain dialogue quality (Zeng et al., 2024). The tension between modular design and fully integrated systems remains unresolved. Furthermore, traditional text-to-speech pipelines depend on manually curated datasets, particularly for multilingual and multi dialect scenarios—a process requiring prohibitive human annotation effort. Existing solutions also lack sophisticated control mechanisms for dynamic speech adaptation, such as real-time adjustment of speaking rate, emotional prosody, or musical rendering (e.g., Singing and RAP vocals). Crucially, the absence of tool invocation capabilities and contextual awareness prevents handling complex queries like “Retrieve live weather data and report it in Cantonese,” necessitating manual API integration.

当前的开源语音系统面临多重架构挑战。传统框架采用级联方法 (Huang et al., 2024),结合自动语音识别 (ASR)、大语言模型处理和文本到语音 (TTS)。这种框架在模态转换中引入了误差传播,同时增加了系统复杂性。纯端到端方法虽然在概念上优雅,但往往牺牲开放域对话质量的性能 (Zeng et al., 2024)。模块化设计与完全集成系统之间的紧张关系仍未解决。此外,传统的文本到语音管道依赖于手动策划的数据集,尤其是在多语言和多方言场景下——这一过程需要大量的人工标注工作。现有解决方案还缺乏动态语音适应的复杂控制机制,如实时调整语速、情感韵律或音乐渲染(例如,歌唱和说唱人声)。关键的是,缺乏工具调用能力和上下文感知能力,无法处理诸如“检索实时天气数据并用粤语报告”等复杂查询,这需要手动集成API。

This report presents Step-Audio, the first production-ready open-source framework for intelligent speech interaction that harmonizes comprehension and generation through four key innovations.

本报告介绍了 Step-Audio,这是首个面向生产环境的开源智能语音交互框架,通过四项关键创新实现了理解与生成的统一。

• 130B-Parameter Multi-modal Model: A single unified model integrating comprehension and generation capabilities, performing speech recognition, semantic understanding, dialogue, voice cloning, audio editing and speech synthesis. We have made the 130B Step-Audio-Chat variant open source.

• 1300亿参数多模态模型:一个统一的模型,集成了理解和生成能力,能够执行语音识别、语义理解、对话、语音克隆、音频编辑和语音合成。我们已经开源了1300亿参数的Step-Audio-Chat变体。

• Generative Data Engine: Eliminates traditional TTS’s reliance on manual data collection by generating high-quality audio through our 130B-parameter multi-modal model. Leverages this data to train and publicly release a resource-efficient Step-Audio-TTS-3B model with enhanced instructionfollowing capabilities for controllable speech synthesis.

生成式数据引擎 (Generative Data Engine):通过我们 130B 参数的多模态模型生成高质量音频,消除了传统 TTS 对人工数据收集的依赖。利用这些数据训练并公开发布了一个资源高效的 Step-Audio-TTS-3B 模型,该模型具有增强的指令跟随能力,用于可控语音合成。

• Granular Voice Control: Enables precise regulation through instructionbased control design, supporting multiple emotions (anger, joy, sadness), dialects (Cantonese, Sichuanese, etc.), and vocal styles (RAP/Singing, a cappella humming) to meet diverse speech generation needs.

• 细粒度语音控制:通过基于指令的控制设计实现精准调节,支持多种情绪(愤怒、喜悦、悲伤)、方言(粤语、四川话等)和声乐风格(RAP/歌唱、无伴奏哼唱),以满足多样化的语音生成需求。

• Enhanced Intelligence: Improves agent performance in complex tasks through ToolCall mechanism integration and role-playing enhancements.

• 增强智能:通过集成工具调用机制和角色扮演增强功能,提升AI智能体在复杂任务中的表现。

In open-source benchmarks, Step-Audio demonstrates exceptional performance. It achieves SoTA results on open-domain question answering and complex instruction tasks including LLaMA Question, TrivialQA, and Complex Bench, with an average improvement of 9.3 points compared to the best open-source metrics, validating its advantage in generalized deep semantic understanding capabilities. Additionally, to address the current lack of comprehensive end-to-end speech dialogue evaluation systems, we introduce the multi-dimensional StepEval-Audio-360 evaluation framework covering 9 dimensions, including logical reasoning, creative ability, language proficiency, and comprehension control among other key capabilities. As shown in

在开源基准测试中,Step-Audio 展示了卓越的性能。它在开放域问答和复杂指令任务(包括 LLaMA Question、TrivialQA 和 Complex Bench)上取得了 SoTA 结果,相比最佳开源指标平均提升了 9.3 分,验证了其在广义深度语义理解能力上的优势。此外,为了解决当前缺乏全面的端到端语音对话评估系统的问题,我们引入了多维的 StepEval-Audio-360 评估框架,涵盖逻辑推理、创造力、语言熟练度和理解控制等 9 个维度的关键能力。如图所示

Figure 1, Step-Audio achieves SoTA results across all dimensions in subjective comparisons against open-source models like GLM-4-Voice and Qwen2-Audio, with improvements of 19.2%, 23.7%, and $43.2%$ in response quality, response relevance, and factual accuracy respectively. Particularly in generation control dimensions such as emotion understanding, speech rate control, RAP vocals, and role-playing, compared to open-source SoTA models, the IF (Instruction Following) and MOS (Mean Opinion Score) metrics improved by $29.8%$ and $27.1%$ respectively, highlighting its leading advantage in complex speech interaction scenarios.

图 1: Step-Audio 在主观比较中,与开源模型如 GLM-4-Voice 和 Qwen2-Audio 相比,在所有维度上均取得了 SoTA 结果,响应质量、响应相关性和事实准确性分别提高了 19.2%、23.7% 和 43.2%。特别是在情感理解、语速控制、RAP 演唱和角色扮演等生成控制维度上,与开源 SoTA 模型相比,IF(指令跟随)和 MOS(平均意见得分)指标分别提高了 29.8% 和 27.1%,突显了其在复杂语音交互场景中的领先优势。

Figure 1: Human Evaluation of End-to-End Speech Interactions. We conduct comprehensive human assessments comparing Step-Audio against GLM4-Voice (Zeng et al., 2024) and Qwen2-Audio (Chu et al., 2024b) across nine critical dimensions: role-playing, logical reasoning, creativity, singing language ability, speech emotion control, gaming interaction, voice instruction following, and voice understanding. Expert evaluators rated end-to-end dialog sessions using Likert scales (1-5) for naturalness and task completion. Step-Audio represents the state-of-the-art (SoTA) across all these dimensions. It is particularly remarkable in language ability, demonstrating a high level of proficiency in grammar, semantics, and language generation. In singing, Step-Audio outshines the other models with its natural pitch control, rhythm accuracy, and overall harmonious vocal output, making it a top - tier performer in these two crucial aspects.

图 1: 端到端语音交互的人类评估。我们对 Step-Audio 与 GLM4-Voice (Zeng et al., 2024) 和 Qwen2-Audio (Chu et al., 2024b) 进行了全面的人类评估,涵盖了九个关键维度:角色扮演、逻辑推理、创造力、歌唱语言能力、语音情感控制、游戏交互、语音指令跟随和语音理解。专家评估者使用李克特量表(1-5)对端到端对话会话的自然性和任务完成度进行评分。Step-Audio 在所有维度上都代表了最先进的技术(SoTA)。它在语言能力方面尤为突出,展示了在语法、语义和语言生成方面的高水平熟练度。在歌唱方面,Step-Audio 以其自然的音高控制、节奏准确性和整体和谐的声音输出超越了其他模型,使其在这两个关键方面成为顶级表现者。

2 Related Work

2 相关工作

Recent progress in end-to-end speech systems have markedly improved humanAI audio interaction. Early approaches relied on cascaded ASR-LLM-TTS pipelines (Huang et al., 2024), where distinct modules for speech recognition, language modeling, and speech synthesis are sequentially connected. However, these systems suffered from latency buildup, error propagation, and disjointed optimization. Later approaches sought to enhance integration by directly linking speech encoders to LLMs through trainable adapters (Chu et al., 2024a; Das et al., 2024; Kong et al., 2020), though they still required separate TTS modules for audio output.

端到端语音系统的最新进展显著改善了人机音频交互。早期方法依赖于级联的 ASR-LLM-TTS 管道 (Huang et al., 2024),其中语音识别、语言建模和语音合成的独立模块依次连接。然而,这些系统存在延迟累积、错误传播和优化不连贯的问题。后来的方法通过可训练适配器直接将语音编码器与大语言模型连接 (Chu et al., 2024a; Das et al., 2024; Kong et al., 2020),从而增强了集成性,尽管它们仍然需要单独的 TTS 模块来输出音频。

The emergence of fully end-to-end systems marked a paradigm shift. Architectures like Llama-Omni (Fang et al., 2024) integrated non-auto regressive (NAR) TTS modules with language models, using connection is t temporal classification (CTC) loss. Freeze-Omni (Wang et al., 2024) uses a combination of auto regressive and NAR speech decoders. These systems demonstrated improved latency but exhibited limitations in handling emotional nuance and natural conversational flow. MinMo (Q. Chen et al., 2025) introduced auto regressive speech token prediction through the CosyVoice2 (Du, Wang, et al., 2024) decoder, while interleaved modeling approaches (Nguyen et al., 2024; Zeng et al., 2024) alternated between text and speech token generation at the sequence level.

全端到端系统的出现标志着一个范式的转变。Llama-Omni (Fang et al., 2024) 等架构将非自回归 (NAR) TTS 模块与语言模型集成,使用连接时序分类 (CTC) 损失。Freeze-Omni (Wang et al., 2024) 则结合了自回归和非自回归语音解码器。这些系统在延迟方面有所改善,但在处理情感细微差别和自然对话流方面表现出局限性。MinMo (Q. Chen et al., 2025) 通过 CosyVoice2 (Du, Wang, et al., 2024) 解码器引入了自回归语音 Token 预测,而交错建模方法 (Nguyen et al., 2024; Zeng et al., 2024) 则在序列级别交替进行文本和语音 Token 生成。

Parallel decoding architectures like Moshi (D´efossez et al., 2024) and MiniOmni (Xie & Wu, 2024) represented a significant leap by generating text and multiple speech codebook tokens simultaneously. These systems achieved lower latency through compressed speech token sequences but faced challenges in preserving linguistic capabilities when scaling speech token bandwidth. Current systems generally specialized in specific aspects: GLM-4-Voice (Zeng et al., 2024) prioritized latency reduction, while Moshi emphasized speech quality, but none holistic ally addressed emotion awareness, conversational naturalness, and real-time knowledge integration.

Moshi (D´efossez 等,2024) 和 MiniOmni (Xie & Wu,2024) 等并行解码架构通过同时生成文本和多个语音编码本 Token 实现了显著飞跃。这些系统通过压缩语音 Token 序列实现了更低的延迟,但在扩展语音 Token 带宽时面临保持语言能力的挑战。当前系统通常在特定方面表现出色:GLM-4-Voice (Zeng 等,2024) 优先减少延迟,而 Moshi 则强调语音质量,但尚无系统全面解决情感感知、对话自然性和实时知识整合。

Recent methodological advances have systematically investigated emotion-aware interaction paradigms, though their integration with multi-modal frameworks remains nascent. While some systems (Wang et al., 2024) incorporated basic sentiment analysis, they lacked bidirectional emotional resonance-neither detecting para linguistic cues in user speech nor generating con textually appropriate emotional responses. The naturalness gap persisted due to LLMs’ tendency toward verbose, text-optimized outputs (Fang et al., 2024), ill-suited for spoken dialogue. Recent work has introduced task-specific optimization s: LUCY (H. Gao et al., 2025) adopted the architectural framework of Mini-Omni (Xie & Wu, 2024), augmented with specialized fine-tuning on conversational datasets for emotion control and function-calling.

最近的方法论进展系统地研究了情感感知的交互范式,尽管它们与多模态框架的整合仍处于初级阶段。虽然一些系统(Wang 等人,2024)整合了基本的情感分析,但它们缺乏双向的情感共鸣——既未检测到用户语音中的副语言线索,也未生成上下文适当的情感响应。由于大语言模型倾向于冗长且针对文本优化的输出(Fang 等人,2024),这种自然性差距在口语对话中尤为明显。最近的研究引入了针对任务的优化:LUCY(H. Gao 等人,2025)采用了 Mini-Omni(Xie & Wu,2024)的架构框架,并通过对会话数据集的专门微调来增强情感控制和功能调用。

3 Architecture

3 架构

Traditional voice dialogue systems typically employ a cascaded architecture comprising ASR, LLM, and TTS modules. However, our proposed model, having undergone comprehensive multi-modal training and alignment of text and audio during the pre training phase, already possesses end-to-end voice dialogue capabilities. Despite extensive exploration of alternative designs, we ultimately adopted the AQTA (audio input, text output) $+$ TTS framework for real-time voice dialogue as shown in Figure 2, driven by the following considerations:

传统的语音对话系统通常采用由 ASR、大语言模型和 TTS 模块组成的级联架构。然而,我们提出的模型在预训练阶段经过了全面的多模态训练和文本与音频的对齐,已经具备了端到端的语音对话能力。尽管我们广泛探索了其他设计方案,但最终出于以下考虑,我们采用了 AQTA(音频输入,文本输出)$+$ TTS 框架来实现实时语音对话,如图 2 所示:

• Scarcity of high-quality pure-voice dialogue data: The limited availability of pure-voice dialogue data, coupled with its constrained scenarios,

高质量纯语音对话数据的稀缺性

Figure 2: Architecture of Step-Audio. Step-Audio primarily consists of three components: the speech tokenizer, the LLM, and the speech decoder. The speech tokenizer is responsible for disc ret i zing the input speech into tokens. The LLM models both text and speech tokens, while the speech decoder generates the waveform output.

图 2: Step-Audio 的架构。Step-Audio 主要由三个组件组成:语音分词器 (speech tokenizer)、大语言模型 (LLM) 和语音解码器 (speech decoder)。语音分词器负责将输入的语音离散化为 Token。大语言模型同时对文本和语音 Token 进行建模,而语音解码器则生成波形输出。

restricts the training efficiency of end-to-end voice dialogue models.

限制了端到端语音对话模型的训练效率。

• Control l ability and customization of output speech: By incorporating a TTS module, we gain flexible control over speech parameters such as timbre and pitch to meet users’ personalized demands, while continuously enhancing the model’s expressive capabilities.

• 输出语音的控制能力和定制化:通过引入 TTS 模块,我们能够灵活控制音色和音高等语音参数,以满足用户的个性化需求,同时持续提升模型的表达能力。

Our goal is to establish Step-Audio as a real-time multi-modal model that seamlessly integrates speech understanding and synthesis through four key components: (1) A dual-codebook token iz ation framework employing parallel linguistic (16.7Hz, 1024-codebook) and semantic (25Hz, 4096-codebook) tokenizers with 2:3 temporal interleaving; (2) A 130B-parameter LLM based on Step-1 (StepFun, 2024a), enhanced through audio-contextual i zed continual pre training and post raining; (3) A hybrid speech synthesizer combining with flow matching and neural vocoder, optimized for real-time waveform generation. In addition, a Voice Activity Detection (VAD) module was employed to extract vocal segments.

我们的目标是建立一个实时多模态模型 Step-Audio,通过四个关键组件无缝整合语音理解和合成:(1) 采用并行语言(16.7Hz,1024-codebook)和语义(25Hz,4096-codebook)分词器的双码本 Token 化框架,具有 2:3 的时间交错;(2) 基于 Step-1 (StepFun, 2024a) 的 130B 参数大语言模型,通过音频上下文化的持续预训练和后训练进行增强;(3) 结合流匹配和神经声码器的混合语音合成器,优化实时波形生成。此外,还采用了语音活动检测 (VAD) 模块来提取语音片段。

3.1 Tokenizer

3.1 Tokenizer

To overcome the limitations of conventional speech tokenizers, which separately capture information for understanding or generation task, we propose a dualcodebook speech tokenizer framework in Step-Audio similar to ARCON (Ming et al., 2024). This approach employs two distinct tokenizers, linguistic and semantic, to better represent speech features. The linguistic tokenizer is utilized to extract structured, high-level representations, including phonemic and linguistic features, whereas the semantic tokenizer is designed to encode both semantic and coarse-grained acoustic characteristics.

为克服传统语音分词器在分别捕捉理解或生成任务信息方面的局限性,我们提出了类似于 ARCON (Ming 等, 2024) 的 Step-Audio 双码本语音分词器框架。该方法采用两个不同的分词器,即语言分词器和语义分词器,以更好地表示语音特征。语言分词器用于提取结构化的高层表示,包括音素和语言特征,而语义分词器则设计用于编码语义和粗粒度的声学特征。

For linguistic token iz ation, we utilize the output from the Paraformer (Z. Gao, Zhang, McLoughlin, & Yan, 2022) encoder, which is quantized into discrete representations at a token rate of 16.7 Hz. For semantic token iz ation, we employ CosyVoice’s (Du, Chen, et al., 2024) tokenizer, specifically designed to efficiently encode features essential for generating natural and expressive speech outputs, operating at a token rate of 25 Hz. The linguistic tokenizer employs a codebook size of 1024, while the semantic tokenizer utilizes a larger codebook size of 4096 to capture finer acoustic details.

在语言 Token 化方面,我们采用了 Paraformer [Z. Gao, Zhang, McLoughlin, & Yan, 2022] 编码器的输出,并将其量化为离散表示,Token 速率为 16.7 Hz。在语义 Token 化方面,我们使用了 CosyVoice [Du, Chen, et al., 2024] 的 Tokenizer,该工具专门设计用于高效编码生成自然且富有表现力的语音输出所需的关键特征,Token 速率为 25 Hz。语言 Tokenizer 的码本大小为 1024,而语义 Tokenizer 则采用更大的码本大小 4096,以捕捉更精细的声学细节。

To effectively integrate these two token iz ation schemes, we implement a tokenlevel interleaving approach inspired by SpiritLM (Nguyen et al., 2024). Given the differing token rates, we establish a temporal alignment ratio of 2:3, where every two linguistic tokens are paired with three semantic tokens.

为了有效整合这两种Token化方案,我们实现了一种受SpiritLM (Nguyen et al., 2024) 启发的Token级交错方法。鉴于不同的Token速率,我们建立了2:3的时间对齐比例,即每两个语言Token与三个语义Token配对。

3.2 LLM

3.2 大语言模型 (LLM)

To enhance Step-Audio’s ability to effectively process speech information and achieve accurate speech-text alignment, we conducted audio continual pre training based on Step-1, a 130-billion parameter pretrained text-based LLM. The details of the pretrain and post-train processes for Step-Audio are comprehensively discussed in section 4 and 5.

为了增强 Step-Audio 有效处理语音信息并实现准确的语音-文本对齐的能力,我们在 Step-1(一个拥有 1300 亿参数的基于文本预训练的大语言模型)的基础上进行了音频持续预训练。Step-Audio 的预训练和后训练过程将在第 4 节和第 5 节中详细讨论。

In multi-turn dialogue systems, the substantial disparity in length between audio tokens and text tokens necessitates efficient processing strategies. To address this, historical information is initially transcribed into textual format utilizing an ASR model prior to system input, thereby optimizing computational efficiency. However, it should be noted that the model architecture maintains the capability to process and utilize audio tokens as historical context when required.

在多轮对话系统中,音频Token和文本Token之间的长度差异显著,因此需要高效的处理策略。为了解决这一问题,系统输入前会先使用ASR模型将历史信息转录为文本格式,从而优化计算效率。然而,需要注意的是,模型架构在需要时仍保留处理和使用音频Token作为历史上下文的能力。

3.3 Speech Decoder

3.3 语音解码器

Speech decoder consists of a 3-billion parameter language model, a flow-matching model and a mel-to-wave vocoder primarily designed to receive text or audio tokens and generate continuous time-domain stylized waveform that incorporate historical information and instructions. To optimize the intelligibility and naturalness of the synthesized speech, the speech decoder is trained using a dual-code interleaving approach, ensuring seamless integration of linguistic and semantic features throughout the generation process. On a speech decoder with a larger parameter, we have observed the emergence of enhanced generative capabilities. For further details, please refer to section 5.1.

语音解码器由一个 30 亿参数的大语言模型、一个流匹配模型和一个主要用于接收文本或音频 Token 并生成融合历史信息和指令的连续时域风格化波形的梅尔频谱到波形声码器组成。为了优化合成语音的清晰度和自然度,语音解码器采用双码交织方法进行训练,确保在整个生成过程中语言和语义特征的无缝集成。在参数更大的语音解码器上,我们观察到了增强生成能力的出现。更多细节请参阅第 5.1 节。

3.4 Real-time Inference

3.4 实时推理

To enable real-time interactions, we have designed an optimized inference pipeline as shown in Figure 3. At its core, the Controller module manages state transitions, orchestrates speculative response generation, and ensures seamless coordination between critical subsystems. These subsystems include VAD for detecting user speech, the Streaming Audio Tokenizer for processing audio in real-time, the Step-Audio language model and Speech Decoder for processing and generating responses, and the Context Manager for preserving conversational continuity.

为了实现实时交互,我们设计了一个优化的推理管道,如图 3 所示。其核心是 Controller 模块,它管理状态转换、协调推测性响应生成,并确保关键子系统之间的无缝协调。这些子系统包括用于检测用户语音的 VAD (Voice Activity Detection)、用于实时处理音频的 Streaming Audio Tokenizer、用于处理和生成响应的 Step-Audio 语言模型和 Speech Decoder,以及用于保持对话连续性的 Context Manager。

Speculative Response Generation To reduce interaction latency, the system preemptively generates speculative responses. This minimizes perceived delays and enhances responsiveness, though at the cost of occasional redundant computations when speculative responses are discarded. The system begins in the Silence

推测式响应生成

Figure 3: The architecture of the real - time inference pipeline aims to enable real-time interactions. When audio is input, it’s processed concurrently by the streaming audio tokenizer and the voice activity detection module. The controller manages state transitions. A pause in user speech triggers speculative response generation, with multiple calls made but only one response committed. The context manager handles the conversation history in text format for continuity. Once the user finishes speaking, the system enters the reply state, commits a speculative response, and outputs audio. After that, it returns to the idle state for the next interaction.

图 3: 实时推理管道的架构旨在实现实时交互。当输入音频时,流式音频标记器和语音活动检测模块会同时处理音频。控制器管理状态转换。用户语音暂停会触发推测性响应生成,虽然会进行多次调用,但只提交一个响应。上下文管理器以文本格式处理对话历史记录,以确保连续性。一旦用户说完,系统进入回复状态,提交推测性响应,并输出音频。之后,系统返回空闲状态,等待下一次交互。

state, awaiting user input. When the VAD detects active speech, the system transitions to the User Speaking state. During this state, the Streaming Audio Tokenizer begins converting audio into tokens. If the user momentarily pauses, the system enters the UserPaused state, where speculative response generation is triggered. By preemptively generating a response in anticipation of input completion, the system reduces latency when the conversation resumes. If the user resumes speaking, the speculative response is discarded. Once the system confidently determines that the user has finished speaking, it transitions to the Bot Replying state, commits the most recent speculative response, and delivers its audio output. If interrupted by user speech, the system prioritizes the new input while maintaining conversational continuity. After completing its response, the system returns to the Silence state, ready for the next interaction. Empirical analysis shows that approximately $40%$ of speculative responses are successfully committed. This mechanism reduces per-response latency by approximately 500ms compared to non-speculative methods.

状态,等待用户输入。当语音活动检测(VAD)检测到活跃语音时,系统会切换到用户说话状态。在此状态下,流式音频分词器(Streaming Audio Tokenizer)开始将音频转换为Token。如果用户短暂停顿,系统会进入用户暂停状态,此时触发推测性响应生成。通过在预期输入完成之前预先生成响应,系统在对话恢复时减少了延迟。如果用户恢复说话,推测性响应将被丢弃。一旦系统确信用户已完成说话,它会切换到机器人回复状态,提交最新的推测性响应,并输出音频。如果被用户语音打断,系统会优先处理新输入,同时保持对话的连续性。在完成响应后,系统会返回静默状态,准备进行下一次交互。实证分析表明,大约40%的推测性响应成功提交。与非推测性方法相比,该机制将每次响应的延迟减少了约500毫秒。

Context Management Our system utilizes text transcription instead of raw audio tokens for historical context, as it provides a more compact representation (with an average text-to-audio token ratio of 1:14), improving performance, and enabling longer conversations with minimal impact on quality. ASR asynchronously transcribes user speech into text, maintaining an accurate and up-to-date conversation history.

上下文管理
我们的系统利用文本转录而非原始音频Token来管理历史上下文,因为它提供了更紧凑的表示(文本与音频Token的平均比例为1:14),从而提高了性能,并能够在最小化质量影响的情况下进行更长的对话。ASR(自动语音识别)将用户的语音异步转录为文本,保持对话历史的准确性和及时性。

Streaming Audio Tokenizer The input audio stream is processed through two parallel tokenizer pipelines, each employing fixed-duration segmentation. The resulting tokens are seamlessly merged into a single sequence with a 2:3 interleaving ratio. Without the streaming audio tokenizer, the inference time will be significantly slower, depending on the length of the audio input.

流式音频分词器

输入音频流通过两个并行的分词器管道进行处理,每个管道均采用固定时长的分段。生成的Token以2:3的交错比例无缝合并为单一序列。若没有流式音频分词器,推理时间将显著变慢,具体取决于音频输入的长度。

4 Pretrain

4 预训练

4.1 Dataset

4.1 数据集

Our multi-modal pre training dataset integrates three major categories of data resources: audio, text, and images. The audio section comprises 1.1 trillion tokens of audio continuation data (approximately 7,300,000 hours), 113 billion tokens of TTS (Text-to-Speech) synthesized speech data (about 700,000 hours), 105 billion tokens of ASR (Automatic Speech Recognition) data (around 650,000 hours), and 350 billion tokens of audio-text alternating data (approximately 2,000,000 hours). The text data, amounting to 800 billion tokens, encompasses web documents, books, code, and proprietary materials. The image section includes 800 billion tokens of image-text paired/alternating data, sourced from web pages, books, and proprietary resources.

我们的多模态预训练数据集整合了三大类数据资源:音频、文本和图像。音频部分包含1.1万亿Token的音频延续数据(约7,300,000小时)、1130亿Token的TTS(Text-to-Speech)合成语音数据(约700,000小时)、1050亿Token的ASR(Automatic Speech Recognition)数据(约650,000小时)以及3500亿Token的音频-文本交替数据(约2,000,000小时)。文本数据共计8000亿Token,涵盖网络文档、书籍、代码及专有材料。图像部分包含8000亿Token的图文配对/交替数据,来源包括网页、书籍及专有资源。

4.2 Training Detail

4.2 训练细节

Step-Audio is a component of Step-Omni, which is designed to train a unified pretrained model for speech, image, and text. This training is based on a pretrained text model and image encoder for continued pre training. The entire process is divided into three stages in total.

Step-Audio 是 Step-Omni 的一个组件,旨在为语音、图像和文本训练一个统一的预训练模型。该训练基于预训练的文本模型和图像编码器进行持续预训练。整个过程总共分为三个阶段。

• Stage1: We expanded the vocabulary of the pretrained text model by adding 5,120 audio tokens and integrated a pretrained image encoder to form the Step-Omni model. During training, to ensure minimal loss of the text model’s capabilities, the learning rate of the text model backbone is maintained at a low level (2e-5) throughout. However, the learning rates for the embedding and language model (LM) head are set five times higher than the backbone’s to facilitate faster convergence of the newly added tokens. Meanwhile, the image encoder remains frozen during the entire training process. At this stage, audio, text, and image data are used in a 2:1:1 ratio, with audio data consisting solely of pure audio continuation tasks.

• 阶段1:我们通过添加5,120个音频Token扩展了预训练文本模型的词汇表,并集成了一个预训练的图像编码器,形成了Step-Omni模型。在训练过程中,为了确保文本模型能力的损失最小化,文本模型主干的学习率始终保持较低水平(2e-5)。然而,嵌入和语言模型(LM)头部的学习率被设置为主干学习率的五倍,以促进新添加Token的快速收敛。同时,图像编码器在整个训练过程中保持冻结。在此阶段,音频、文本和图像数据的比例为2:1:1,音频数据仅包含纯音频延续任务。

• Stage2: After training on 1.2T tokens in the stage1 phase, we incorporate audio-text interleaved data for further training, with a 1:1 ratio of audio continuation data to audio-text interleaved data. During this stage, the ratio of audio, text, and image data remains 2:1:1.

• 阶段2:在阶段1训练了1.2T Token后,我们加入音频-文本交替数据进行进一步训练,音频延续数据与音频-文本交替数据的比例为1:1。在此阶段,音频、文本和图像数据的比例保持为2:1:1。

• Stage3: After training on 800B tokens in the stage2 phase, we incorporate ASR and TTS data for further training. The ratio of audio continuation data, audio-text interleaved data, ASR data, and TTS data is set to 1:1:1:1. During this phase, the ratio of audio, text, and image data is adjusted to 4:3:3. Additionally, the learning rates for the embedding and LM head are synchronized with the backbone, utilizing a cosine schedule that decreases from 2e-5 to 5e-6.

• 阶段3:在阶段2训练了800B Token后,我们加入ASR和TTS数据进行进一步训练。音频延续数据、音频-文本交错数据、ASR数据和TTS数据的比例设置为1:1:1:1。在此阶段,音频、文本和图像数据的比例调整为4:3:3。此外,嵌入层和LM头的学习率与主干网络同步,采用余弦调度,从2e-5降至5e-6。

We employ the same pre-training strategy across models of varying parameter scales.

我们在不同参数规模的模型上采用了相同的预训练策略。

4.3 Training Infrastructure

4.3 训练基础设施

We train Step-Omni on thousands of H800 GPUs with 35% Model Flops Utilization (MFU). Despite employing the standard optimization s such as tailored GPU kernels and communication overlap, we highlight two innovative approaches that further enhance our training efficiency.

我们在数千块 H800 GPU 上训练了 Step-Omni,模型浮点运算利用率 (MFU) 达到 35%。尽管采用了诸如定制 GPU 内核和通信重叠等标准优化技术,但我们还着重介绍了两项进一步提升训练效率的创新方法。

D is aggregated Data Processing The processing of multi-modality data in Step-Omni training is computationally intensive, often requiring substantial CPU resources to keep pace with the model training speed. Conventional implementations typically co-locate data processing tasks with training jobs, leading to significant interference between these tasks and ultimately slowing down the training process. To address this issue, we introduce StarWeaver, an RPC-based distributed data processing library. StarWeaver relocates CPU-intensive data pre-processing tasks to remote processes, thereby alleviating the computational burden on the GPU training side and enhancing overall training efficiency. StarWeaver also facilitates the enhancement of load balancing in the data-parallel dimension, as it serves as an ideal mechanism for redistributing data with global workload information.

D 是聚合数据处理
Step-Omni 训练中的多模态数据处理计算密集,通常需要大量的 CPU 资源以跟上模型训练速度。传统实现通常将数据处理任务与训练任务放在一起,导致这些任务之间产生显著干扰,最终拖慢训练进程。为了解决这个问题,我们引入了 StarWeaver,这是一个基于 RPC(远程过程调用)的分布式数据处理库。StarWeaver 将 CPU 密集型的数据预处理任务迁移到远程进程中,从而减轻 GPU 训练端的计算负担,并提高整体训练效率。StarWeaver 还有助于增强数据并行维度中的负载均衡,因为它是一种理想的机制,可以利用全局工作负载信息重新分配数据。

D is aggregated Model Placement For multi-modal models such as StepOmni, training typically involves not only LLM but also modality encoders (e.g., vision encoder). Integrating these diverse components challenges the conventional assumption of training frameworks that the model is homogeneous and monolithic. This mismatch often results in suboptimal training efficiency. To address this issue, we propose d is aggregated model placement that allocates dedicated resources and employs tailored parallelism strategies for each sub-model. This novel approach effectively minimizes pipeline bubbles caused by model heterogeneity, thereby achieving optimal training efficiency. Details can be found at (Zhang et al., 2024).

D 是聚合模型放置

对于多模态模型,例如 StepOmni,训练通常不仅涉及大语言模型,还包括模态编码器(如视觉编码器)。整合这些不同的组件对训练框架的传统假设提出了挑战,即模型是同质且单一的。这种不匹配通常会导致训练效率低下。为了解决这个问题,我们提出了 D 是聚合模型放置,为每个子模型分配专用资源并采用量身定制的并行策略。这种新方法有效地最小化了由模型异质性引起的管道气泡,从而实现了最佳训练效率。详细信息请参阅 (Zhang et al., 2024)。

4.4 Exploring Tokenizer for Audio Pre training

4.4 探索音频预训练中的 Tokenizer

To achieve the unification of speech understanding and generation, we first explored the use of a speech tokenizer. Initially, we investigated the training approach using a single codebook. In our experiments, we found that when training the model using only semantic tokens, the next token prediction perplexity is relatively low, and the semantic coherence between the generated content and the preceding context is good. However, due to the significant loss of acoustic information from discarding too many semantic tokens, the subsequent audio restoration through the vocoder suffers severe degradation in terms of timbre and prosody, resulting in poor auditory quality. When only using linguistic tokens for training, the audio recovered by the vocoder from the model’s continuation sounds good, but the next token prediction perplexity is very high, and the semantic coherence between the continuation and the preceding context is poor. When training with interleaved semantic tokens and linguistic tokens, the semantic tokens ensure the semantic coherence of the continuation with the preceding context, while the linguistic tokens ensure the auditory quality of the reconstructed audio. Due to the mutual reference between semantic tokens and linguistic tokens, we observed that when using dual-codebook training, the next token prediction perplexity for both semantic tokens and linguistic tokens decreased compared to using a single codebook as shown in Figure 4. Notably, the decrease in next token prediction perplexity for semantic tokens was more significant. Furthermore, ASR ablation results indicated that the dual-codebook model achieved a lower character error rate (CER) on the ASR test set compared to the pure single-codebook model(see section 6.2.1).

为了实现语音理解与生成的统一,我们首先探索了使用语音分词器的方法。最初,我们研究了使用单一码本的训练方法。在实验中,我们发现,当仅使用语义 token 进行训练时,模型的下一个 token 预测困惑度相对较低,生成内容与上下文之间的语义连贯性较好。然而,由于丢弃了过多的语义 token,导致声学信息大量丢失,后续通过声码器恢复的音频在音色和韵律方面严重退化,听觉质量较差。当仅使用语言 token 进行训练时,模型生成的音频通过声码器恢复后听起来效果较好,但下一个 token 预测困惑度非常高,生成内容与上下文之间的语义连贯性较差。当使用交替的语义 token 和语言 token 进行训练时,语义 token 确保了生成内容与上下文的语义连贯性,而语言 token 则保证了重建音频的听觉质量。由于语义 token 和语言 token 之间的相互参考,我们观察到,与使用单一码本相比,使用双码本训练时,语义 token 和语言 token 的下一个 token 预测困惑度均有所下降,如图 4 所示。值得注意的是,语义 token 的下一个 token 预测困惑度下降更为显著。此外,ASR 消融实验结果表明,与纯单一码本模型相比,双码本模型在 ASR 测试集上实现了更低的字符错误率 (CER)(见第 6.2.1 节)。


Figure 4: Training loss comparison between Dual-Codebook and Single Codebook Tokenizer.

图 4: 双码本与单码本分词器的训练损失对比。

Furthermore, grouping and interleaving linguistic discrete tokens and semantic discrete tokens in a 2:3 ratio facilitates faster convergence of training loss. More importantly, extending the CosyVoice semantic tokens with linguistic tokens enhances the model’s ability to understand and follow multi-turn history instructions and also mitigates issues such as unclear pronunciation and indistinct articulation, significantly boosting the performance of CosyVoice’s single code.

此外,以 2:3 的比例对语言离散 Token 和语义离散 Token 进行分组和交错排列,有助于更快地收敛训练损失。更重要的是,通过语言 Token 扩展 CosyVoice 的语义 Token,增强了模型理解和遵循多轮历史指令的能力,同时也缓解了发音不清和表达模糊等问题,显著提升了 CosyVoice 单代码的性能。

5 Post-Training

5 训练后处理

5.1 TTS

5.1 文本到语音 (TTS)

5.1.1 Dataset

5.1.1 数据集

High-quality speech data is crucial for TTS task, as it directly impacts the model’s performance and the expressiveness of the generated speech. Language-specific data, dialect data, speaking styles, emotional data, and para linguistic data are extremely scarce. Constructing such datasets demands substantial human and financial resources, and the process generally spans an extended period.

高质量的语音数据对于 TTS 任务至关重要,因为它直接影响模型的性能和生成语音的表现力。特定语言的数据、方言数据、说话风格、情感数据和副语言数据极为稀缺。构建此类数据集需要大量的人力和财力资源,且整个过程通常需要较长时间。

To address this gap, we present the first novel synthetic data-driven framework for TTS systems, comprising three key components:

为了解决这一差距,我们提出了首个基于合成数据的TTS系统框架,该框架包含三个关键组件:

• First, we employ a Step-2 (StepFun, 2024b) LLM to generate linguistically diverse and semantically rich textual content. • Second, we selected a pre-trained Step-Audio model checkpoint incorporating audio-token cooldown mechanisms, which enables direct generation of speaker-specific, language-dependent, and dialect-aware audio data. • Third, we developed an Audio-Edit Model by fine-tuning the aforementioned checkpoint, specifically designed to generate nuanced emotional expressions and diverse speaking styles. This model architecture allows for precise control over para linguistic features while maintaining speaker consistency.

• 首先,我们使用 Step-2 (StepFun, 2024b) 大语言模型生成语言多样且语义丰富的文本内容。
• 其次,我们选择了一个预训练的 Step-Audio 模型检查点,该检查点结合了音频 Token 冷却机制,能够直接生成特定说话者、语言相关且方言感知的音频数据。
• 第三,我们通过微调上述检查点,开发了一个音频编辑模型,专门用于生成细腻的情感表达和多样化的说话风格。该模型架构允许精确控制副语言特征,同时保持说话者的一致性。

Figure 5: The process starts with text input which is processed by a Step-2 LLM to generate multiple rewritten texts. Then, a Step-Audio model generates target-speaker data using the rewritten texts and existing audio wav data. Finally, an Audio-Edit model refines the data to produce emotion/style data, addressing the scarcity of high - quality speech data in TTS tasks.

图 5: 该过程从文本输入开始,文本输入由 Step-2 大语言模型处理以生成多个重写文本。然后,Step-Audio 模型使用重写文本和现有的音频 wav 数据生成目标说话者数据。最后,Audio-Edit 模型对数据进行细化,生成情感/风格数据,从而解决 TTS 任务中高质量语音数据稀缺的问题。

Language and Dialect Leveraging the robust continuation ability of StepAudio, which has been trained on large volumes of speaker and language data, we generate target-speaker, language and dialect data. The text-based LLM Step-2 is used to translate and rewrite chat text to conform to the grammar and style of the target language or dialect. We collect audio recordings and texts from native speakers as prompt audio and text, and then, using the format [system prompt;

语言和方言
利用 StepAudio 在大量说话者和语言数据上训练的强大延续能力,我们生成目标说话者、语言和方言的数据。基于文本的大语言模型 Step-2 用于翻译和重写聊天文本,以符合目标语言或方言的语法和风格。我们收集来自母语者的录音和文本作为提示音频和文本,然后使用格式 [系统提示;

prompt text; target text; prompt code; target code] along with the corresponding text, we use Step-Audio for audio continuation generation. This method allows for the quick creation of a large amount of native-speaker data for the target language and dialect with only a small quantity of high-quality seed data.

提示文本;目标文本;提示代码;目标代码] 以及相应的文本,我们使用 Step-Audio 进行音频延续生成。这种方法只需少量高质量的种子数据,即可快速生成大量目标语言和方言的母语数据。

Emotion and Speaking Styles Emotion and speaking style data have been challenging to deal with because of the difficulty in both differentiating and defining emotion categories and their respective intensities, as well as the complexity associated with accurately describing and recording various style types. To address this, an Audio-Edit model-based approach is proposed. It ingeniously converts complex emotion and style descriptions into a comparative pair data construction format. Step-2 is used to rewrite chat text with specific emotions and styles. Normal and emotional speech samples from the same speaker with identical text are collected, and Step-Audio is used for cloning and continuation generation to create (text, neutral audio token, emotion and style audio token) data. Only the (neutral audio token, emotion and style audio token) pairs are used to perform SFT on the audio cooldown pretrain model to get the Audio-Edit model. Using this model, neutral-style speech can be input to generate emotion or style enhanced audio, and data with different emotion or style intensities can be produced iterative ly.

情感与说话风格

Singing and RAP We construct a paired dataset of lyrics and vocal segments through three stages: (1) Collecting 10,000+ hours of singing / RAP tracks with LyRiCs-format timestamps; (2) Extracting dry vocals using Demucs (Rouard, Massa, $&$ D´efossez, 2023) and removing silent regions via Voice Activity Detection(VAD); (3) Segmenting audio using LyRiCs timestamps and aligning lyrics with audio segments. For data cleaning, We performed three steps: (1) RAP Separation: we isolated pure RAP segments by retaining those with higher speech rates and using a genre classification model to identify hip-hop clips; (2) Audio Quality Filtering: Utilizing noise detection and speaker di ari z ation, we preserved low-noise, single-speaker segments; (3) Alignment Verification: To address misalignment due to inaccurate LyRiCs timestamps, we computed the Character Error Rate (CER) between transcribed speech and ground-truth lyrics, discarding misaligned segments. Ultimately, the total length of the retained audio segments constituted 17.8% of the original song durations. This dataset supports dual training objectives: the LLM learns to map lyrics to linguistic and semantic tokens, while the speech decoder decodes these tokens into high-fidelity vocals in precise tunes.

唱歌与说唱
我们通过三个阶段构建了一个歌词与声乐片段配对的数据集:(1) 收集了超过10,000小时的带有LyRiCs格式时间戳的唱歌/说唱曲目;(2) 使用Demucs (Rouard, Massa, & D´efossez, 2023) 提取干声,并通过语音活动检测 (VAD) 去除静音区域;(3) 使用LyRiCs时间戳分割音频,并将歌词与音频片段对齐。
在数据清理过程中,我们执行了三个步骤:(1) 说唱分离:通过保留语速较高的片段并使用流派分类模型识别嘻哈片段,我们分离出纯说唱片段;(2) 音频质量过滤:利用噪声检测和说话人分离,我们保留了低噪声、单说话人的片段;(3) 对齐验证:为了解决由于LyRiCs时间戳不准确导致的对齐问题,我们计算了转录语音与真实歌词之间的字符错误率 (CER),并丢弃未对齐的片段。
最终,保留的音频片段总长度占原歌曲时长的17.8%。该数据集支持双重训练目标:大语言模型学习将歌词映射为语言和语义Token,而语音解码器则将这些Token解码为高保真且音调准确的声乐。

Target Speaker Supporting multiple languages or dialects for a target speaker is challenging through model generalization of foundational language and dialect data, as it often fails to achieve the level of a native speaker. To mitigate this issue, we employ dual codes extracted from audio generated by native speakers with timbre and prosody similar to the target speaker. These dual codes are combined with the target speaker’s prompt audio to regenerate new audio, from which dual codes are then extracted again. Through this straightforward procedure, the target speaker’s speech in new languages and dialects becomes more akin to that of a native speaker.

目标说话人

Quality assessment of data constitutes a critical component in our synthetic data framework. To ensure the reliability and validity of both seed and synthesized data, we have implemented a comprehensive evaluation system incorporating multiple objective metrics: ASR accuracy, Voice Activity Detection (VAD) performance, Speaker Di ari z ation precision, Emotion recognition consistency, and Deep Noise Suppression (DNS) effectiveness. This multi-dimensional quality control mechanism guarantees the robustness and practical utility of generated synthetic data.

数据质量评估是我们合成数据框架中的关键组成部分。为确保种子数据和合成数据的可靠性和有效性,我们实施了一个综合评估系统,包含多个客观指标:ASR(自动语音识别)准确率、语音活动检测(VAD)性能、说话人分离精度、情感识别一致性以及深度噪声抑制(DNS)效果。这种多维度的质量控制机制保证了生成合成数据的鲁棒性和实际应用价值。

5.1.2 Training Detail

5.1.2 训练细节

In contrast to conventional TTS systems that emphasize fine-grained control over speaker characteristics, emotional expression, linguistic features, and stylistic elements, our approach adopts the chat-based paradigm and training methodology of LLMs. This strategic alignment significantly enhances system flexibility while simultaneously establishing a scalable framework to support future model and data expansion, thereby addressing the critical challenges of s cal ability in speech synthesis systems.

与传统TTS系统强调对说话者特征、情感表达、语言特征和风格元素的细粒度控制不同,我们的方法采用了大语言模型的聊天范式和训练方法。这种战略对齐显著增强了系统的灵活性,同时建立了一个可扩展的框架,以支持未来的模型和数据扩展,从而解决了语音合成系统中可扩展性的关键挑战。

Supervised Fine-Tuning Format The sft format comprises three essential components: the system prompt, the human input, and the assistant response, structured in a two-turn dialogue configuration. Within this format, the system prompt serves as the foundational element for specifying the speaker attributes and defining the supported instruction tags. The human input and the assistant response components are specifically designed to handle the textual content and the dual-codebook representations respectively. The text and audio tokens from the first round can be utilized to maintain the in-domain speaker’s timbre and style consistency, as well as to enable out-domain zero-shot cloning.

监督微调格式
sft 格式包含三个基本组成部分:系统提示 (system prompt)、人类输入 (human input) 和助手响应 (assistant response),采用两轮对话结构构建。在该格式中,系统提示作为指定说话者属性和定义支持的指令标签的基础元素。人类输入和助手响应组件分别专门用于处理文本内容和双码本表示。第一轮的文本和音频 Token 可用于保持领域内说话者的音色和风格一致性,以及实现领域外零样本克隆。

Instruction Tags Instruction tags are classified into two distinct categories: descriptive tags and comparative tags. Descriptive tags are utilized for controlling aspects such as language, dialect, vocal, and style, while comparative tags are employed for hierarchical distinctions in emotion and speed control. The data for descriptive tags are generated using the Step-Audio model clone, supporting languages and styles including Japanese, Korean, Cantonese, Sichuan dialect, cute voice, RAP, and singing. The data for comparative tags are generated using the Audio Edit model, supporting emotions such as happiness, anger, sadness, and speed variations like fast and slow, each divided into five hierarchical levels.

指令标签
指令标签分为两类:描述性标签和比较性标签。描述性标签用于控制语言、方言、嗓音和风格等方面,而比较性标签则用于情感和速度控制的层级区分。描述性标签的数据是通过 Step-Audio 模型生成的,支持的语言和风格包括日语、韩语、粤语、四川方言、可爱嗓音、说唱和演唱。比较性标签的数据是通过 Audio Edit 模型生成的,支持的情感包括快乐、愤怒、悲伤,以及速度变化如快和慢,每种情感和速度变化都分为五个层级。

We employ the SFT data as outlined in Section 5.1.1. And utilize a 3-billion parameter model, training it for one epoch with an initial learning rate of $2\times10^{-5}$ .The lea