Abstract
摘要
Real-time speech interaction, serving as a fundamental interface for humanmachine collaboration, holds immense potential. However, current opensource models face limitations such as high costs in voice data collection, weakness in dynamic control, and limited intelligence. To address these challenges, this paper introduces Step-Audio, the first production-ready open-source solution. Key contributions include: 1) a 130B-parameter unified speech-text multi-modal model that achieves unified understanding and generation, with the Step-Audio-Chat version open-sourced; 2) a generative speech data engine that establishes an affordable voice cloning framework and produces the open-sourced lightweight Step-Audio-TTS-3B model through distillation; 3) an instruction-driven fine control system enabling dynamic adjustments across dialects, emotions, singing, and RAP; 4) an enhanced cognitive architecture augmented with tool calling and role-playing abilities to manage complex tasks effectively. Based on our new StepEvalAudio-360 evaluation benchmark, Step-Audio achieves state-of-the-art performance in human evaluations, especially in terms of instruction following. On open-source benchmarks like LLaMA Question, shows $\mathbf{9.3%}$ average performance improvement, demonstrating our commitment to advancing the development of open-source multi-modal language technologies. Our code and models are available at https://github.com/stepfun-ai/Step-Audio.
实时语音交互作为人机协作的基础接口,具有巨大的潜力。然而,当前的开源模型面临语音数据收集成本高、动态控制能力弱和智能有限等局限性。为解决这些挑战,本文介绍了第一个可用于生产环境的开源解决方案 Step-Audio。关键贡献包括:1) 一个 130B 参数的统一语音-文本多模态模型,实现了统一的理解与生成,并开源了 Step-Audio-Chat 版本;2) 一个生成式语音数据引擎,建立了经济实惠的语音克隆框架,并通过蒸馏生成了开源的轻量级 Step-Audio-TTS-3B 模型;3) 一个指令驱动的精细控制系统,支持跨方言、情感、歌唱和 RAP 的动态调整;4) 一个增强的认知架构,通过工具调用和角色扮演能力来有效管理复杂任务。基于我们新的 StepEvalAudio-360 评估基准,Step-Audio 在人类评估中达到了最先进的性能,特别是在指令遵循方面。在 LLaMA Question 等开源基准上,展示了 $\mathbf{9.3%}$ 的平均性能提升,体现了我们推动开源多模态语言技术发展的承诺。我们的代码和模型可在 https://github.com/stepfun-ai/Step-Audio 获取。
1 Introduction
1 引言
The evolution of artificial intelligence toward general-purpose systems has positioned real-time speech interaction as a critical interface for human-machine collaboration. While recent multi-modal large language models (LLMs) have accelerated progress in this domain, open-source communities face persistent challenges despite breakthroughs in proprietary systems like GPT-4o (Hurst et al., 2024) and Doubao (bytedance, 2024). Existing open-source models such as Qwen2-Audio (Chu et al., 2024a), Llama 3 (Dubey et al., 2024) and wavLLM (Hu et al., 2024) struggle with three fundamental limitations: the separation of understanding and generation processes that impedes end-to-end system integration, dependence on laborious manual speech data acquisition methods that restricts efficient voice replication, and inadequate precision in regulating prosodic features, regional dialects, and tool utilization capabilities. These limitations highlight the urgent demand for deployable frameworks that harmonize streamlined architecture with dual competencies in affective computing (accurate emotion perception and adjustment) and contextual cognition (situational reasoning and response formulation).
人工智能向通用系统的发展使得实时语音交互成为人机协作的关键接口。尽管最近的多模态大语言模型在这一领域加速了进展,但开源社区在面对GPT-4o (Hurst et al., 2024) 和 Doubao (bytedance, 2024) 等专有系统的突破时,仍然面临持续挑战。现有的开源模型如 Qwen2-Audio (Chu et al., 2024a)、Llama 3 (Dubey et al., 2024) 和 wavLLM (Hu et al., 2024) 存在三个基本限制:理解和生成过程的分离阻碍了端到端系统集成,依赖繁琐的手动语音数据获取方法限制了高效的语音复制,以及在调节韵律特征、地方方言和工具使用能力方面的精度不足。这些限制凸显了对可部署框架的迫切需求,这些框架需要将简化架构与情感计算(准确的情感感知和调整)和上下文认知(情境推理和响应制定)的双重能力相协调。
Current open-source speech systems confront multiple architectural challenges. The traditional framework employs a cascading approach (Huang et al., 2024) combining Automatic Speech Recognition (ASR), LLM processing, and Text-toSpeech (TTS). This framework introduces error propagation through modality transitions while increasing system complexity. Pure end-to-end approaches, though conceptually elegant, often sacrifice performance in open-domain dialogue quality (Zeng et al., 2024). The tension between modular design and fully integrated systems remains unresolved. Furthermore, traditional text-to-speech pipelines depend on manually curated datasets, particularly for multilingual and multi dialect scenarios—a process requiring prohibitive human annotation effort. Existing solutions also lack sophisticated control mechanisms for dynamic speech adaptation, such as real-time adjustment of speaking rate, emotional prosody, or musical rendering (e.g., Singing and RAP vocals). Crucially, the absence of tool invocation capabilities and contextual awareness prevents handling complex queries like “Retrieve live weather data and report it in Cantonese,” necessitating manual API integration.
当前的开源语音系统面临多重架构挑战。传统框架采用级联方法 (Huang et al., 2024) ,结合了自动语音识别 (ASR)、大语言模型处理和文本转语音 (TTS)。这种框架通过模态转换引入了误差传播,同时增加了系统复杂性。纯端到端的方法虽然在概念上优雅,但往往在开放域对话质量上牺牲了性能 (Zeng et al., 2024)。模块化设计与完全集成系统之间的紧张关系仍未解决。此外,传统的文本转语音管道依赖于手动整理的数据集,特别是在多语言和多方言场景下——这一过程需要大量的人工标注工作。现有解决方案还缺乏动态语音适配的精细控制机制,例如实时调整语速、情感韵律或音乐渲染(如歌唱和说唱人声)。关键的是,由于缺乏工具调用能力和上下文感知,无法处理诸如“检索实时天气数据并用粤语报告”这样的复杂查询,这需要手动集成 API。
This report presents Step-Audio, the first production-ready open-source framework for intelligent speech interaction that harmonizes comprehension and generation through four key innovations.
本报告介绍了 Step-Audio,这是首个可用于生产的开源智能语音交互框架,通过四项关键创新实现了理解与生成的统一。
• 130B-Parameter Multi-modal Model: A single unified model integrating comprehension and generation capabilities, performing speech recognition, semantic understanding, dialogue, voice cloning, audio editing and speech synthesis. We have made the 130B Step-Audio-Chat variant open source.
• 1300亿参数多模态模型:一个集成了理解和生成能力的统一模型,能够执行语音识别、语义理解、对话、语音克隆、音频编辑和语音合成。我们已将1300亿参数的Step-Audio-Chat变体开源。
• Generative Data Engine: Eliminates traditional TTS’s reliance on manual data collection by generating high-quality audio through our 130B-parameter multi-modal model. Leverages this data to train and publicly release a resource-efficient Step-Audio-TTS-3B model with enhanced instructionfollowing capabilities for controllable speech synthesis.
• 生成式数据引擎:通过我们的1300亿参数多模态模型生成高质量音频,消除了传统TTS对手动数据收集的依赖。利用这些数据训练并公开发布了一个资源高效的Step-Audio-TTS-3B模型,该模型具备增强的指令跟随能力,用于可控语音合成。
• Granular Voice Control: Enables precise regulation through instructionbased control design, supporting multiple emotions (anger, joy, sadness), dialects (Cantonese, Sichuanese, etc.), and vocal styles (RAP/Singing, a cappella humming) to meet diverse speech generation needs.
• 细粒度语音控制:通过基于指令的控制设计实现精确调节,支持多种情感(愤怒、喜悦、悲伤)、方言(粤语、四川话等)和声线风格(RAP/演唱、无伴奏哼唱),以满足多样化的语音生成需求。
• Enhanced Intelligence: Improves agent performance in complex tasks through ToolCall mechanism integration and role-playing enhancements.
• 增强智能:通过 ToolCall 机制集成和角色扮演增强,提升 AI 智能体在复杂任务中的表现。
In open-source benchmarks, Step-Audio demonstrates exceptional performance. It achieves SoTA results on open-domain question answering and complex instruction tasks including LLaMA Question, TrivialQA, and Complex Bench, with an average improvement of 9.3 points compared to the best open-source metrics, validating its advantage in generalized deep semantic understanding capabilities. Additionally, to address the current lack of comprehensive end-to-end speech dialogue evaluation systems, we introduce the multi-dimensional StepEval-Audio-360 evaluation framework covering 9 dimensions, including logical reasoning, creative ability, language proficiency, and comprehension control among other key capabilities. As shown in
在开源基准测试中,Step-Audio 展现了卓越的性能。它在开放域问答和复杂指令任务(包括 LLaMA Question、TrivialQA 和 Complex Bench)上实现了 SoTA 结果,相比最佳开源指标平均提升了 9.3 分,验证了其在广义深度语义理解能力上的优势。此外,为了解决当前缺乏全面的端到端语音对话评估系统的问题,我们引入了多维的 StepEval-Audio-360 评估框架,涵盖了逻辑推理、创造能力、语言熟练度和理解控制等 9 个维度的关键能力。如图所示,
Figure 1, Step-Audio achieves SoTA results across all dimensions in subjective comparisons against open-source models like GLM-4-Voice and Qwen2-Audio, with improvements of 19.2%, 23.7%, and $43.2%$ in response quality, response relevance, and factual accuracy respectively. Particularly in generation control dimensions such as emotion understanding, speech rate control, RAP vocals, and role-playing, compared to open-source SoTA models, the IF (Instruction Following) and MOS (Mean Opinion Score) metrics improved by $29.8%$ and $27.1%$ respectively, highlighting its leading advantage in complex speech interaction scenarios.
图 1: Step-Audio 在与 GLM-4-Voice 和 Qwen2-Audio 等开源模型的主观对比中,在所有维度上均达到了 SoTA (State-of-the-Art) 结果,在响应质量、响应相关性和事实准确性方面分别提高了 19.2%、23.7% 和 43.2%。特别是在情感理解、语速控制、RAP 人声和角色扮演等生成控制维度上,与开源 SoTA 模型相比,IF (Instruction Following) 和 MOS (Mean Opinion Score) 指标分别提高了 29.8% 和 27.1%,突显了其在复杂语音交互场景中的领先优势。
Figure 1: Human Evaluation of End-to-End Speech Interactions. We conduct comprehensive human assessments comparing Step-Audio against GLM4-Voice (Zeng et al., 2024) and Qwen2-Audio (Chu et al., 2024b) across nine critical dimensions: role-playing, logical reasoning, creativity, singing language ability, speech emotion control, gaming interaction, voice instruction following, and voice understanding. Expert evaluators rated end-to-end dialog sessions using Likert scales (1-5) for naturalness and task completion. Step-Audio represents the state-of-the-art (SoTA) across all these dimensions. It is particularly remarkable in language ability, demonstrating a high level of proficiency in grammar, semantics, and language generation. In singing, Step-Audio outshines the other models with its natural pitch control, rhythm accuracy, and overall harmonious vocal output, making it a top - tier performer in these two crucial aspects.
图 1: 端到端语音交互的人类评估。我们进行了全面的人类评估,将 Step-Audio 与 GLM4-Voice (Zeng et al., 2024) 和 Qwen2-Audio (Chu et al., 2024b) 在九个关键维度上进行了比较:角色扮演、逻辑推理、创造力、歌唱语言能力、语音情感控制、游戏互动、语音指令遵循和语音理解。专家评估者使用 Likert 量表 (1-5) 对端到端对话会话的自然度和任务完成度进行了评分。Step-Audio 在所有这些维度上都代表了最先进的技术 (SoTA)。它在语言能力方面尤为突出,展示了在语法、语义和语言生成方面的高水平熟练度。在歌唱方面,Step-Audio 凭借其自然的音高控制、节奏准确性和整体和谐的声音输出,超越了其他模型,使其在这两个关键方面成为顶级表现者。
2 Related Work
2 相关工作
Recent progress in end-to-end speech systems have markedly improved humanAI audio interaction. Early approaches relied on cascaded ASR-LLM-TTS pipelines (Huang et al., 2024), where distinct modules for speech recognition, language modeling, and speech synthesis are sequentially connected. However, these systems suffered from latency buildup, error propagation, and disjointed optimization. Later approaches sought to enhance integration by directly linking speech encoders to LLMs through trainable adapters (Chu et al., 2024a; Das et al., 2024; Kong et al., 2020), though they still required separate TTS modules for audio output.
端到端语音系统的最新进展显著改善了人类与AI的音频交互。早期的方案依赖于级联的ASR-LLM-TTS管道(Huang et al., 2024),其中语音识别、语言建模和语音合成的独立模块按顺序连接。然而,这些系统存在延迟累积、错误传播和优化不连贯的问题。后来的方案通过可训练的适配器直接将语音编码器与大语言模型连接,以增强集成性(Chu et al., 2024a; Das et al., 2024; Kong et al., 2020),尽管它们仍然需要独立的TTS模块来输出音频。
The emergence of fully end-to-end systems marked a paradigm shift. Architectures like Llama-Omni (Fang et al., 2024) integrated non-auto regressive (NAR) TTS modules with language models, using connection is t temporal classification (CTC) loss. Freeze-Omni (Wang et al., 2024) uses a combination of auto regressive and NAR speech decoders. These systems demonstrated improved latency but exhibited limitations in handling emotional nuance and natural conversational flow. MinMo (Q. Chen et al., 2025) introduced auto regressive speech token prediction through the CosyVoice2 (Du, Wang, et al., 2024) decoder, while interleaved modeling approaches (Nguyen et al., 2024; Zeng et al., 2024) alternated between text and speech token generation at the sequence level.
全端到端系统的出现标志着一次范式转变。Llama-Omni (Fang et al., 2024) 等架构将非自回归 (NAR) TTS 模块与大语言模型集成,使用连接时序分类 (CTC) 损失。Freeze-Omni (Wang et al., 2024) 则结合了自回归和非自回归语音解码器。这些系统展示了延迟的改进,但在处理情感细微差别和自然对话流方面存在局限性。MinMo (Q. Chen et al., 2025) 通过 CosyVoice2 (Du, Wang, et al., 2024) 解码器引入了自回归语音 Token 预测,而交错建模方法 (Nguyen et al., 2024; Zeng et al., 2024) 则在序列级别交替生成文本和语音 Token。
Parallel decoding architectures like Moshi (D´efossez et al., 2024) and MiniOmni (Xie & Wu, 2024) represented a significant leap by generating text and multiple speech codebook tokens simultaneously. These systems achieved lower latency through compressed speech token sequences but faced challenges in preserving linguistic capabilities when scaling speech token bandwidth. Current systems generally specialized in specific aspects: GLM-4-Voice (Zeng et al., 2024) prioritized latency reduction, while Moshi emphasized speech quality, but none holistic ally addressed emotion awareness, conversational naturalness, and real-time knowledge integration.
Moshi (D´efossez 等, 2024) 和 MiniOmni (Xie & Wu, 2024) 等并行解码架构通过同时生成文本和多个语音编码 Token,实现了显著飞跃。这些系统通过压缩语音 Token 序列实现了更低的延迟,但在扩展语音 Token 带宽时面临保持语言能力的挑战。当前系统通常专注于特定方面:GLM-4-Voice (Zeng 等, 2024) 优先考虑降低延迟,而 Moshi 则强调语音质量,但没有系统全面解决情感感知、对话自然性和实时知识集成的问题。
Recent methodological advances have systematically investigated emotion-aware interaction paradigms, though their integration with multi-modal frameworks remains nascent. While some systems (Wang et al., 2024) incorporated basic sentiment analysis, they lacked bidirectional emotional resonance-neither detecting para linguistic cues in user speech nor generating con textually appropriate emotional responses. The naturalness gap persisted due to LLMs’ tendency toward verbose, text-optimized outputs (Fang et al., 2024), ill-suited for spoken dialogue. Recent work has introduced task-specific optimization s: LUCY (H. Gao et al., 2025) adopted the architectural framework of Mini-Omni (Xie & Wu, 2024), augmented with specialized fine-tuning on conversational datasets for emotion control and function-calling.
最近的方法论进展系统地研究了情感感知的交互范式,尽管它们与多模态框架的整合仍处于初级阶段。虽然一些系统(Wang et al., 2024)整合了基本的情感分析,但它们缺乏双向的情感共鸣——既没有检测到用户语音中的副语言线索,也没有生成上下文适当的情感响应。由于大语言模型倾向于冗长、文本优化的输出(Fang et al., 2024),不适合口语对话,自然性差距仍然存在。最近的研究引入了任务特定的优化:LUCY(H. Gao et al., 2025)采用了Mini-Omni(Xie & Wu, 2024)的架构框架,并通过对对话数据集进行专门微调来增强情感控制和功能调用。
3 Architecture
3 架构
Traditional voice dialogue systems typically employ a cascaded architecture comprising ASR, LLM, and TTS modules. However, our proposed model, having undergone comprehensive multi-modal training and alignment of text and audio during the pre training phase, already possesses end-to-end voice dialogue capabilities. Despite extensive exploration of alternative designs, we ultimately adopted the AQTA (audio input, text output) $+$ TTS framework for real-time voice dialogue as shown in Figure 2, driven by the following considerations:
传统语音对话系统通常采用由ASR、大语言模型和TTS模块组成的级联架构。然而,我们提出的模型在预训练阶段经过了全面的多模态训练和文本与音频的对齐,已经具备了端到端的语音对话能力。尽管我们广泛探索了其他设计方案,但最终出于以下考虑,我们采用了AQTA(音频输入,文本输出)$+$TTS框架来实现实时语音对话,如图2所示:
• Scarcity of high-quality pure-voice dialogue data: The limited availability of pure-voice dialogue data, coupled with its constrained scenarios,
高质量纯语音对话数据的稀缺性:纯语音对话数据的有限可用性,以及其场景的局限性。
Figure 2: Architecture of Step-Audio. Step-Audio primarily consists of three components: the speech tokenizer, the LLM, and the speech decoder. The speech tokenizer is responsible for disc ret i zing the input speech into tokens. The LLM models both text and speech tokens, while the speech decoder generates the waveform output.
图 2: Step-Audio 的架构。Step-Audio 主要由三个组件组成:语音分词器 (speech tokenizer)、大语言模型 (LLM) 和语音解码器 (speech decoder)。语音分词器负责将输入语音离散化为 Token。大语言模型同时对文本和语音 Token 进行建模,而语音解码器则生成波形输出。
restricts the training efficiency of end-to-end voice dialogue models.
限制了端到端语音对话模型的训练效率。
• Control l ability and customization of output speech: By incorporating a TTS module, we gain flexible control over speech parameters such as timbre and pitch to meet users’ personalized demands, while continuously enhancing the model’s expressive capabilities.
• 输出语音的可控性与定制化:通过集成 TTS(Text-to-Speech)模块,我们能够灵活控制音色、音调等语音参数,以满足用户的个性化需求,同时持续提升模型的表达能力。
Our goal is to establish Step-Audio as a real-time multi-modal model that seamlessly integrates speech understanding and synthesis through four key components: (1) A dual-codebook token iz ation framework employing parallel linguistic (16.7Hz, 1024-codebook) and semantic (25Hz, 4096-codebook) tokenizers with 2:3 temporal interleaving; (2) A 130B-parameter LLM based on Step-1 (StepFun, 2024a), enhanced through audio-contextual i zed continual pre training and post raining; (3) A hybrid speech synthesizer combining with flow matching and neural vocoder, optimized for real-time waveform generation. In addition, a Voice Activity Detection (VAD) module was employed to extract vocal segments.
我们的目标是建立 Step-Audio 作为一个实时多模态模型,通过四个关键组件无缝集成语音理解和合成:(1) 采用并行语言(16.7Hz, 1024-codebook)和语义(25Hz, 4096-codebook) Tokenizer 的双码本 Token 化框架,具有 2:3 的时间交错;(2) 基于 Step-1 (StepFun, 2024a) 的 130B 参数大语言模型,通过音频上下文化持续预训练和后训练进行增强;(3) 结合流匹配和神经声码器的混合语音合成器,优化用于实时波形生成。此外,还采用了语音活动检测 (VAD) 模块来提取语音段。
3.1 Tokenizer
3.1 Tokenizer
To overcome the limitations of conventional speech tokenizers, which separately capture information for understanding or generation task, we propose a dualcodebook speech tokenizer framework in Step-Audio similar to ARCON (Ming et al., 2024). This approach employs two distinct tokenizers, linguistic and semantic, to better represent speech features. The linguistic tokenizer is utilized to extract structured, high-level representations, including phonemic and linguistic features, whereas the semantic tokenizer is designed to encode both semantic and coarse-grained acoustic characteristics.
为了克服传统语音分词器分别捕获信息以用于理解或生成任务的局限性,我们提出了一种类似于 ARCON (Ming et al., 2024) 的双码本语音分词器框架。该方法采用了两个不同的分词器,即语言分词器和语义分词器,以更好地表示语音特征。语言分词器用于提取结构化的高层次表示,包括音素和语言特征,而语义分词器则设计用于编码语义和粗粒度的声学特征。
For linguistic token iz ation, we utilize the output from the Paraformer (Z. Gao, Zhang, McLoughlin, & Yan, 2022) encoder, which is quantized into discrete representations at a token rate of 16.7 Hz. For semantic token iz ation, we employ CosyVoice’s (Du, Chen, et al., 2024) tokenizer, specifically designed to efficiently encode features essential for generating natural and expressive speech outputs, operating at a token rate of 25 Hz. The linguistic tokenizer employs a codebook size of 1024, while the semantic tokenizer utilizes a larger codebook size of 4096 to capture finer acoustic details.
在语言Token化方面,我们采用了Paraformer (Z. Gao, Zhang, McLoughlin, & Yan, 2022)编码器的输出,该输出被量化为离散表示,Token率为16.7 Hz。在语义Token化方面,我们使用了CosyVoice (Du, Chen, et al., 2024)的Token化器,该Token化器专门设计用于高效编码生成自然且富有表现力的语音输出所需的关键特征,其Token率为25 Hz。语言Token化器使用的码本大小为1024,而语义Token化器则采用更大的码本大小4096,以捕捉更精细的声学细节。
To effectively integrate these two token iz ation schemes, we implement a tokenlevel interleaving approach inspired by SpiritLM (Nguyen et al., 2024). Given the differing token rates, we establish a temporal alignment ratio of 2:3, where every two linguistic tokens are paired with three semantic tokens.
为有效整合这两种分词方案,我们借鉴了SpiritLM (Nguyen et al., 2024) 的思路,采用了一种Token级别的交错处理方式。考虑到分词速率的不同,我们设定了2:3的时间对齐比例,即每两个语言学Token对应三个语义Token。
3.2 LLM
3.2 大语言模型 (LLM)
To enhance Step-Audio’s ability to effectively process speech information and achieve accurate speech-text alignment, we conducted audio continual pre training based on Step-1, a 130-billion parameter pretrained text-based LLM. The details of the pretrain and post-train processes for Step-Audio are comprehensively discussed in section 4 and 5.
为了提升 Step-Audio 有效处理语音信息并实现精准语音文本对齐的能力,我们在基于 1300 亿参数的预训练文本大语言模型 Step-1 的基础上,进行了音频持续预训练。Step-Audio 的预训练和后训练过程详情将在第 4 节和第 5 节中全面讨论。
In multi-turn dialogue systems, the substantial disparity in length between audio tokens and text tokens necessitates efficient processing strategies. To address this, historical information is initially transcribed into textual format utilizing an ASR model prior to system input, thereby optimizing computational efficiency. However, it should be noted that the model architecture maintains the capability to process and utilize audio tokens as historical context when required.
在多轮对话系统中,音频Token和文本Token之间的长度差异显著,因此需要高效的处理策略。为了解决这一问题,在系统输入之前,首先使用ASR模型将历史信息转录为文本格式,从而优化计算效率。然而,需要注意的是,模型架构在需要时仍具备处理和使用音频Token作为历史上下文的能力。
3.3 Speech Decoder
3.3 语音解码器
Speech decoder consists of a 3-billion parameter language model, a flow-matching model and a mel-to-wave vocoder primarily designed to receive text or audio tokens and generate continuous time-domain stylized waveform that incorporate historical information and instructions. To optimize the intelligibility and naturalness of the synthesized speech, the speech decoder is trained using a dual-code interleaving approach, ensuring seamless integration of linguistic and semantic features throughout the generation process. On a speech decoder with a larger parameter, we have observed the emergence of enhanced generative capabilities. For further details, please refer to section 5.1.
语音解码器由一个包含30亿参数的语言模型、一个流匹配模型和一个主要设计用于接收文本或音频Token并生成包含历史信息和指令的连续时域风格化波形的梅尔频谱到波形声码器组成。为了优化合成语音的清晰度和自然度,语音解码器采用双码交错训练方法,确保在生成过程中语言和语义特征的无缝集成。在参数更大的语音解码器上,我们观察到增强生成能力的出现。更多详情请参阅第5.1节。
3.4 Real-time Inference
3.4 实时推理
To enable real-time interactions, we have designed an optimized inference pipeline as shown in Figure 3. At its core, the Controller module manages state transitions, orchestrates speculative response generation, and ensures seamless coordination between critical subsystems. These subsystems include VAD for detecting user speech, the Streaming Audio Tokenizer for processing audio in real-time, the Step-Audio language model and Speech Decoder for processing and generating responses, and the Context Manager for preserving conversational continuity.
为了实现实时交互,我们设计了一个优化的推理流程,如图 3 所示。其核心是 Controller 模块,它负责管理状态转换,协调推测性响应生成,并确保关键子系统之间的无缝协作。这些子系统包括用于检测用户语音的 VAD、用于实时处理音频的 Streaming Audio Tokenizer、用于处理和生成响应的 Step-Audio 语言模型和 Speech Decoder,以及用于保持对话连续性的 Context Manager。
Speculative Response Generation To reduce interaction latency, the system preemptively generates speculative responses. This minimizes perceived delays and enhances responsiveness, though at the cost of occasional redundant computations when speculative responses are discarded. The system begins in the Silence
推测性响应生成
Figure 3: The architecture of the real - time inference pipeline aims to enable real-time interactions. When audio is input, it’s processed concurrently by the streaming audio tokenizer and the voice activity detection module. The controller manages state transitions. A pause in user speech triggers speculative response generation, with multiple calls made but only one response committed. The context manager handles the conversation history in text format for continuity. Once the user finishes speaking, the system enters the reply state, commits a speculative response, and outputs audio. After that, it returns to the idle state for the next interaction.
图 3: 实时推理管道的架构旨在实现实时交互。当输入音频时,它会被流式音频分词器和语音活动检测模块同时处理。控制器管理状态转换。用户语音暂停会触发推测性响应生成,虽然会进行多次调用,但只会提交一个响应。上下文管理器以文本格式处理对话历史,以确保连续性。一旦用户结束讲话,系统进入回复状态,提交推测性响应并输出音频。之后,系统返回到空闲状态,等待下一次交互。
state, awaiting user input. When the VAD detects active speech, the system transitions to the User Speaking state. During this state, the Streaming Audio Tokenizer begins converting audio into tokens. If the user momentarily pauses, the system enters the UserPaused state, where speculative response generation is triggered. By preemptively generating a response in anticipation of input completion, the system reduces latency when the conversation resumes. If the user resumes speaking, the speculative response is discarded. Once the system confidently determines that the user has finished speaking, it transitions to the Bot Replying state, commits the most recent speculative response, and delivers its audio output. If interrupted by user speech, the system prioritizes the new input while maintaining conversational continuity. After completing its response, the system returns to the Silence state, ready for the next interaction. Empirical analysis shows that approximately $40%$ of speculative responses are successfully committed. This mechanism reduces per-response latency by approximately 500ms compared to non-speculative methods.
状态,等待用户输入。当语音活动检测(VAD)检测到活跃语音时,系统过渡到用户说话状态。在此状态下,流式音频分词器开始将音频转换为Token。如果用户暂时停顿,系统进入用户暂停状态,此时会触发推测性响应生成。通过预先生成响应以期待输入完成,系统在对话恢复时减少了延迟。如果用户恢复说话,推测性响应将被丢弃。一旦系统自信地确定用户已结束说话,它将过渡到机器人回复状态,提交最新的推测性响应,并输出其音频。如果被用户语音打断,系统会优先处理新输入,同时保持对话的连续性。完成响应后,系统返回静默状态,准备下一次交互。实证分析显示,大约40%的推测性响应成功提交。与非推测性方法相比,此机制将每次响应的延迟减少了约500毫秒。
Context Management Our system utilizes text transcription instead of raw audio tokens for historical context, as it provides a more compact representation (with an average text-to-audio token ratio of 1:14), improving performance, and enabling longer conversations with minimal impact on quality. ASR asynchronously transcribes user speech into text, maintaining an accurate and up-to-date conversation history.
上下文管理
我们的系统利用文本转录而非原始音频Token来管理历史上下文,因为文本提供了更紧凑的表示(平均文本到音频Token比例为1:14),从而提高了性能,并能在最小化质量影响的情况下实现更长的对话。自动语音识别(ASR)异步将用户语音转录为文本,确保对话历史的准确性和实时性。
Streaming Audio Tokenizer The input audio stream is processed through two parallel tokenizer pipelines, each employing fixed-duration segmentation. The resulting tokens are seamlessly merged into a single sequence with a 2:3 interleaving ratio. Without the streaming audio tokenizer, the inference time will be significantly slower, depending on the length of the audio input.
流式音频分词器
输入音频流通过两个并行的分词器管道进行处理,每个管道均采用固定时长的分段方法。生成的 Token 以 2:3 的交错比例无缝合并为单一序列。若没有流式音频分词器,推理时间将显著增加,具体取决于音频输入的长度。
4 Pretrain
4 预训练
4.1 Dataset
4.1 数据集
Our multi-modal pre training dataset integrates three major categories of data resources: audio, text, and images. The audio section comprises 1.1 trillion tokens of audio continuation data (approximately 7,300,000 hours), 113 billion tokens of TTS (Text-to-Speech) synthesized speech data (about 700,000 hours), 105 billion tokens of ASR (Automatic Speech Recognition) data (around 650,000 hours), and 350 billion tokens of audio-text alternating data (approximately 2,000,000 hours). The text data, amounting to 800 billion tokens, encompasses web documents, books, code, and proprietary materials. The image section includes 800 billion tokens of image-text paired/alternating data, sourced from web pages, books, and proprietary resources.
我们的多模态预训练数据集整合了三大类数据资源:音频、文本和图像。音频部分包括1.1万亿Token的音频延续数据(约7,300,000小时)、1130亿Token的TTS(文本转语音)合成语音数据(约700,000小时)、1050亿Token的ASR(自动语音识别)数据(约650,000小时)以及3500亿Token的音频-文本交替数据(约2,000,000小时)。文本数据总量为8000亿Token,涵盖网页文档、书籍、代码和专有材料。图像部分包括8000亿Token的图文配对/交替数据,数据来源包括网页、书籍和专有资源。
4.2 Training Detail
4.2 训练细节
Step-Audio is a component of Step-Omni, which is designed to train a unified pretrained model for speech, image, and text. This training is based on a pretrained text model and image encoder for continued pre training. The entire process is divided into three stages in total.
Step-Audio 是 Step-Omni 的一个组件,旨在为语音、图像和文本训练一个统一的预训练模型。该训练基于预训练的文本模型和图像编码器进行继续预训练。整个过程共分为三个阶段。
• Stage1: We expanded the vocabulary of the pretrained text model by adding 5,120 audio tokens and integrated a pretrained image encoder to form the Step-Omni model. During training, to ensure minimal loss of the text model’s capabilities, the learning rate of the text model backbone is maintained at a low level (2e-5) throughout. However, the learning rates for the embedding and language model (LM) head are set five times higher than the backbone’s to facilitate faster convergence of the newly added tokens. Meanwhile, the image encoder remains frozen during the entire training process. At this stage, audio, text, and image data are used in a 2:1:1 ratio, with audio data consisting solely of pure audio continuation tasks.
• 阶段1:我们通过添加5,120个音频Token扩展了预训练文本模型的词汇表,并整合了预训练图像编码器,形成了Step-Omni模型。在训练过程中,为了确保文本模型能力的损失最小,文本模型主干的学习率始终保持在一个较低的水平(2e-5)。然而,嵌入层和语言模型(LM)头部的学习率被设置为比主干高五倍,以促进新添加Token的快速收敛。同时,图像编码器在整个训练过程中保持冻结。在此阶段,音频、文本和图像数据的比例为2:1:1,音频数据仅包含纯音频延续任务。
• Stage2: After training on 1.2T tokens in the stage1 phase, we incorporate audio-text interleaved data for further training, with a 1:1 ratio of audio continuation data to audio-text interleaved data. During this stage, the ratio of audio, text, and image data remains 2:1:1.
• Stage2: 在阶段1训练了1.2T Token后,我们加入音频-文本交错数据进行进一步训练,其中音频延续数据与音频-文本交错数据的比例为1:1。在此阶段,音频、文本和图像数据的比例保持为2:1:1。
• Stage3: After training on 800B tokens in the stage2 phase, we incorporate ASR and TTS data for further training. The ratio of audio continuation data, audio-text interleaved data, ASR data, and TTS data is set to 1:1:1:1. During this phase, the ratio of audio, text, and image data is adjusted to 4:3:3. Additionally, the learning rates for the embedding and LM head are synchronized with the backbone, utilizing a cosine schedule that decreases from 2e-5 to 5e-6.
• 阶段3:在阶段2对800B Token进行训练后,我们引入ASR和TTS数据进行进一步训练。音频续写数据、音频-文本交错数据、ASR数据和TTS数据的比例设置为1:1:1:1。在此阶段,音频、文本和图像数据的比例调整为4:3:3。此外,嵌入层和LM头的学习率与主干网络同步,采用从2e-5降至5e-6的余弦调度。
We employ the same pre-training strategy across models of varying parameter scales.
我们在不同参数规模的模型上采用相同的预训练策略。
4.3 Training Infrastructure
4.3 训练基础设施
We train Step-Omni on thousands of H800 GPUs with 35% Model Flops Utilization (MFU). Despite employing the standard optimization s such as tailored GPU kernels and communication overlap, we highlight two innovative approaches that further enhance our training efficiency.
我们在数千台 H800 GPU 上训练 Step-Omni,模型浮点运算利用率 (MFU) 达到 35%。尽管采用了定制 GPU 内核和通信重叠等标准优化方法,我们仍强调了两项进一步提升训练效率的创新方法。
D is aggregated Data Processing The processing of multi-modality data in Step-Omni training is computationally intensive, often requiring substantial CPU resources to keep pace with the model training speed. Conventional implementations typically co-locate data processing tasks with training jobs, leading to significant interference between these tasks and ultimately slowing down the training process. To address this issue, we introduce StarWeaver, an RPC-based distributed data processing library. StarWeaver relocates CPU-intensive data pre-processing tasks to remote processes, thereby alleviating the computational burden on the GPU training side and enhancing overall training efficiency. StarWeaver also facilitates the enhancement of load balancing in the data-parallel dimension, as it serves as an ideal mechanism for redistributing data with global workload information.
D是聚合数据处理。在Step-Omni训练中,多模态数据的处理计算密集,通常需要大量的CPU资源以跟上模型训练的速度。传统的实现通常将数据处理任务与训练任务放在同一位置,导致这些任务之间的显著干扰,最终减慢训练过程。为了解决这个问题,我们引入了StarWeaver,一个基于RPC的分布式数据处理库。StarWeaver将CPU密集型的数据预处理任务重新定位到远程进程,从而减轻GPU训练端的计算负担,并提高整体训练效率。StarWeaver还有助于增强数据并行维度上的负载均衡,因为它是重新分配具有全局工作负载信息的理想机制。
D is aggregated Model Placement For multi-modal models such as StepOmni, training typically involves not only LLM but also modality encoders (e.g., vision encoder). Integrating these diverse components challenges the conventional assumption of training frameworks that the model is homogeneous and monolithic. This mismatch often results in suboptimal training efficiency. To address this issue, we propose d is aggregated model placement that allocates dedicated resources and employs tailored parallelism strategies for each sub-model. This novel approach effectively minimizes pipeline bubbles caused by model heterogeneity, thereby achieving optimal training efficiency. Details can be found at (Zhang et al., 2024).
D是聚合模型放置
对于多模态模型(如StepOmni),训练通常不仅涉及大语言模型,还包括模态编码器(例如视觉编码器)。这些多样化组件的集成挑战了传统训练框架中模型是单一且同质的假设。这种不匹配往往导致训练效率不佳。为了解决这个问题,我们提出了D是聚合模型放置,该方案为每个子模型分配专用资源,并采用定制的并行策略。这种新颖的方法有效减少了因模型异质性导致的管道气泡,从而实现最佳训练效率。详情请参见 (Zhang et al., 2024)。
4.4 Exploring Tokenizer for Audio Pre training
4.4 音频预训练的Tokenizer探索
To achieve the unification of speech understanding and generation, we first explored the use of a speech tokenizer. Initially, we investigated the training approach using a single codebook. In our experiments, we found that when training the model using only semantic tokens, the next token prediction perplexity is relatively low, and the semantic coherence between the generated content and the preceding context is good. However, due to the significant loss of acoustic information from discarding too many semantic tokens, the subsequent audio restoration through the vocoder suffers severe degradation in terms of timbre and prosody, resulting in poor auditory quality. When only using linguistic tokens for training, the audio recovered by the vocoder from the model’s continuation sounds good, but the next token prediction perplexity is very high, and the semantic coherence between the continuation and the preceding context is poor. When training with interleaved semantic tokens and linguistic tokens, the semantic tokens ensure the semantic coherence of the continuation with the preceding context, while the linguistic tokens ensure the auditory quality of the reconstructed audio. Due to the mutual reference between semantic tokens and linguistic tokens, we observed that when using dual-codebook training, the next token prediction perplexity for both semantic tokens and linguistic tokens decreased compared to using a single codebook as shown in Figure 4. Notably, the decrease in next token prediction perplexity for semantic tokens was more significant. Furthermore, ASR ablation results indicated that the dual-codebook model achieved a lower character error rate (CER) on the ASR test set compared to the pure single-codebook model(see section 6.2.1).
为了实现语音理解和生成的统一,我们首先探索了语音分词器的使用。最初,我们研究了使用单一码本的训练方法。在实验中,我们发现,当仅使用语义Token训练模型时,下一个Token预测的困惑度相对较低,且生成内容与上文之间的语义连贯性较好。然而,由于丢弃了过多语义Token导致声学信息的大量丢失,后续通过声码器进行的音频恢复在音色和韵律上严重退化,听觉质量较差。当仅使用语言Token进行训练时,模型续写的音频通过声码器恢复后听起来不错,但下一个Token预测的困惑度非常高,且续写内容与上文之间的语义连贯性较差。当使用交替的语义Token和语言Token进行训练时,语义Token确保了续写内容与上文的语义连贯性,而语言Token则保证了重建音频的听觉质量。由于语义Token和语言Token之间的相互参考,我们观察到,在使用双码本训练时,语义Token和语言Token的下一个Token预测困惑度较单一码本的训练有所降低,如图4所示。值得注意的是,语义Token的下一个Token预测困惑度下降更为显著。此外,ASR消融实验结果表明,双码本模型在ASR测试集上的字符错误率(CER)较纯单一码本模型更低(见6.2.1节)。
Figure 4: Training loss comparison between Dual-Codebook and Single Codebook Tokenizer.
图 4: 双码本与单码本分词器的训练损失对比。
Furthermore, grouping and interleaving linguistic discrete tokens and semantic discrete tokens in a 2:3 ratio facilitates faster convergence of training loss. More importantly, extending the CosyVoice semantic tokens with linguistic tokens enhances the model’s ability to understand and follow multi-turn history instructions and also mitigates issues such as unclear pronunciation and indistinct articulation, significantly boosting the performance of CosyVoice’s single code.
此外,以2:3的比例分组和交错语言离散Token和语义离散Token,可以加速训练损失的收敛。更重要的是,通过语言Token扩展CosyVoice的语义Token,增强了模型理解和遵循多轮历史指令的能力,同时也缓解了发音不清和表达模糊等问题,显著提升了CosyVoice单代码的性能。
5 Post-Training
5 训练后
5.1 TTS
5.1 文本到语音 (TTS)
5.1.1 Dataset
5.1.1 数据集
High-quality speech data is crucial for TTS task, as it directly impacts the model’s performance and the expressiveness of the generated speech. Language-specific data, dialect data, speaking styles, emotional data, and para linguistic data are extremely scarce. Constructing such datasets demands substantial human and financial resources, and the process generally spans an extended period.
高质量的语音数据对于 TTS 任务至关重要,因为它直接影响模型的性能和生成语音的表现力。特定语言数据、方言数据、说话风格、情感数据和副语言数据极为稀缺。构建此类数据集需要大量的人力和财力资源,且该过程通常需要较长的时间。
To address this gap, we present the first novel synthetic data-driven framework for TTS systems, comprising three key components:
为了解决这一差距,我们提出了首个基于合成数据的TTS系统框架,该框架包含三个关键组件:
• First, we employ a Step-2 (StepFun, 2024b) LLM to generate linguistically diverse and semantically rich textual content. • Second, we selected a pre-trained Step-Audio model checkpoint incorporating audio-token cooldown mechanisms, which enables direct generation of speaker-specific, language-dependent, and dialect-aware audio data. • Third, we developed an Audio-Edit Model by fine-tuning the aforementioned checkpoint, specifically designed to generate nuanced emotional expressions and diverse speaking styles. This model architecture allows for precise control over para linguistic features while maintaining speaker consistency.
• 首先,我们使用 Step-2 (StepFun, 2024b) 大语言模型生成语言多样且语义丰富的文本内容。
• 其次,我们选择了一个预训练的 Step-Audio 模型检查点,该检查点结合了音频 Token 冷却机制,能够直接生成特定说话者、语言相关且方言感知的音频数据。
• 第三,我们通过微调上述检查点开发了一个 Audio-Edit 模型,专门设计用于生成细腻的情感表达和多样的说话风格。该模型架构允许在保持说话者一致性的同时,精确控制副语言特征。
Figure 5: The process starts with text input which is processed by a Step-2 LLM to generate multiple rewritten texts. Then, a Step-Audio model generates target-speaker data using the rewritten texts and existing audio wav data. Finally, an Audio-Edit model refines the data to produce emotion/style data, addressing the scarcity of high - quality speech data in TTS tasks.
图 5: 该过程从文本输入开始,由 Step-2 大语言模型处理生成多个重写文本。然后,Step-Audio 模型使用重写文本和现有的音频 wav 数据生成目标说话者的数据。最后,Audio-Edit 模型对数据进行优化,生成情感/风格数据,解决了 TTS 任务中高质量语音数据稀缺的问题。
Language and Dialect Leveraging the robust continuation ability of StepAudio, which has been trained on large volumes of speaker and language data, we generate target-speaker, language and dialect data. The text-based LLM Step-2 is used to translate and rewrite chat text to conform to the grammar and style of the target language or dialect. We collect audio recordings and texts from native speakers as prompt audio and text, and then, using the format [system prompt;
语言与方言
利用StepAudio在大量说话者和语言数据上训练的强大续写能力,我们生成目标说话者、语言和方言的数据。基于文本的大语言模型Step-2用于翻译和重写聊天文本,以符合目标语言或方言的语法和风格。我们收集来自母语者的录音和文本作为提示音频和文本,然后使用格式[系统提示;
prompt text; target text; prompt code; target code] along with the corresponding text, we use Step-Audio for audio continuation generation. This method allows for the quick creation of a large amount of native-speaker data for the target language and dialect with only a small quantity of high-quality seed data.
提示文本;目标文本;提示代码;目标代码] 以及相应的文本,我们使用 Step-Audio 进行音频延续生成。这种方法只需少量的高质量种子数据,就能快速生成大量目标语言和方言的母语数据。
Emotion and Speaking Styles Emotion and speaking style data have been challenging to deal with because of the difficulty in both differentiating and defining emotion categories and their respective intensities, as well as the complexity associated with accurately describing and recording various style types. To address this, an Audio-Edit model-based approach is proposed. It ingeniously converts complex emotion and style descriptions into a comparative pair data construction format. Step-2 is used to rewrite chat text with specific emotions and styles. Normal and emotional speech samples from the same speaker with identical text are collected, and Step-Audio is used for cloning and continuation generation to create (text, neutral audio token, emotion and style audio token) data. Only the (neutral audio token, emotion and style audio token) pairs are used to perform SFT on the audio cooldown pretrain model to get the Audio-Edit model. Using this model, neutral-style speech can be input to generate emotion or style enhanced audio, and data with different emotion or style intensities can be produced iterative ly.
情感与说话风格
Singing and RAP We construct a paired dataset of lyrics and vocal segments through three stages: (1) Collecting 10,000+ hours of singing / RAP tracks with LyRiCs-format timestamps; (2) Extracting dry vocals using Demucs (Rouard, Massa, $&$ D´efossez, 2023) and removing silent regions via Voice Activity Detection(VAD); (3) Segmenting audio using LyRiCs timestamps and aligning lyrics with audio segments. For data cleaning, We performed three steps: (1) RAP Separation: we isolated pure RAP segments by retaining those with higher speech rates and using a genre classification model to identify hip-hop clips; (2) Audio Quality Filtering: Utilizing noise detection and speaker di ari z ation, we preserved low-noise, single-speaker segments; (3) Alignment Verification: To address misalignment due to inaccurate LyRiCs timestamps, we computed the Character Error Rate (CER) between transcribed speech and ground-truth lyrics, discarding misaligned segments. Ultimately, the total length of the retained audio segments constituted 17.8% of the original song durations. This dataset supports dual training objectives: the LLM learns to map lyrics to linguistic and semantic tokens, while the speech decoder decodes these tokens into high-fidelity vocals in precise tunes.
歌唱与说唱
我们通过三个阶段构建了一个歌词与人声片段的配对数据集:(1) 收集了超过10,000小时的歌唱/说唱音轨,并带有LyRiCs格式的时间戳;(2) 使用Demucs (Rouard, Massa, & D´efossez, 2023) 提取干声,并通过语音活动检测 (VAD) 去除静音区域;(3) 使用LyRiCs时间戳分割音频,并将歌词与音频片段对齐。在数据清洗方面,我们进行了三个步骤:(1) 说唱分离:通过保留语速较高的片段,并使用流派分类模型识别嘻哈片段,分离出纯说唱片段;(2) 音频质量过滤:利用噪声检测和说话人分离技术,保留低噪声、单人说话的片段;(3) 对齐验证:为了解决由于LyRiCs时间戳不准确导致的错位问题,我们计算了转录语音与真实歌词之间的字符错误率 (CER),并丢弃错位的片段。最终,保留的音频片段总长度占原歌曲时长的17.8%。该数据集支持双重训练目标:大语言模型学习将歌词映射到语言和语义Token,而语音解码器则将这些Token解码为高保真且音调精确的人声。
Target Speaker Supporting multiple languages or dialects for a target speaker is challenging through model generalization of foundational language and dialect data, as it often fails to achieve the level of a native speaker. To mitigate this issue, we employ dual codes extracted from audio generated by native speakers with timbre and prosody similar to the target speaker. These dual codes are combined with the target speaker’s prompt audio to regenerate new audio, from which dual codes are then extracted again. Through this straightforward procedure, the target speaker’s speech in new languages and dialects becomes more akin to that of a native speaker.
目标说话者支持多语言或方言的目标说话者通过模型泛化基础语言和方言数据具有挑战性,因为它通常无法达到母语者的水平。为了缓解这一问题,我们采用了从与目标说话者音色和韵律相似的母语者生成的音频中提取的双重编码。这些双重编码与目标说话者的提示音频结合,重新生成新的音频,然后再次从中提取双重编码。通过这一简单流程,目标说话者在新语言和方言中的发音更接近母语者。
Quality assessment of data constitutes a critical component in our synthetic data framework. To ensure the reliability and validity of both seed and synthesized data, we have implemented a comprehensive evaluation system incorporating multiple objective metrics: ASR accuracy, Voice Activity Detection (VAD) performance, Speaker Di ari z ation precision, Emotion recognition consistency, and Deep Noise Suppression (DNS) effectiveness. This multi-dimensional quality control mechanism guarantees the robustness and practical utility of generated synthetic data.
数据质量评估是我们合成数据框架中的关键组成部分。为了确保种子数据和合成数据的可靠性与有效性,我们实施了一个综合评估系统,包含多个客观指标:ASR准确率、语音活动检测 (Voice Activity Detection, VAD) 性能、说话人分离精度、情绪识别一致性以及深度噪声抑制 (Deep Noise Suppression, DNS) 效果。这种多维度的质量控制机制保证了生成合成数据的鲁棒性和实用性。
5.1.2 Training Detail
5.1.2 训练细节
In contrast to conventional TTS systems that emphasize fine-grained control over speaker characteristics, emotional expression, linguistic features, and stylistic elements, our approach adopts the chat-based paradigm and training methodology of LLMs. This strategic alignment significantly enhances system flexibility while simultaneously establishing a scalable framework to support future model and data expansion, thereby addressing the critical challenges of s cal ability in speech synthesis systems.
与传统的 TTS 系统强调对说话者特征、情感表达、语言特征和风格元素的精细控制不同,我们的方法采用了大语言模型的基于聊天的范式与训练方法。这种策略上的对齐显著增强了系统的灵活性,同时建立了一个可扩展的框架来支持未来的模型和数据扩展,从而解决了语音合成系统中可扩展性的关键挑战。
Supervised Fine-Tuning Format The sft format comprises three essential components: the system prompt, the human input, and the assistant response, structured in a two-turn dialogue configuration. Within this format, the system prompt serves as the foundational element for specifying the speaker attributes and defining the supported instruction tags. The human input and the assistant response components are specifically designed to handle the textual content and the dual-codebook representations respectively. The text and audio tokens from the first round can be utilized to maintain the in-domain speaker’s timbre and style consistency, as well as to enable out-domain zero-shot cloning.
监督微调格式
sft 格式由三个基本组成部分构成:系统提示、人类输入和助手响应,采用两轮对话配置进行结构化。在此格式中,系统提示作为指定说话者属性和定义支持的指令标签的基础元素。人类输入和助手响应组件分别设计用于处理文本内容和双码本表示。第一轮的文本和音频 token 可用于保持域内说话者的音色和风格一致性,以及实现域外零样本克隆。
Instruction Tags Instruction tags are classified into two distinct categories: descriptive tags and comparative tags. Descriptive tags are utilized for controlling aspects such as language, dialect, vocal, and style, while comparative tags are employed for hierarchical distinctions in emotion and speed control. The data for descriptive tags are generated using the Step-Audio model clone, supporting languages and styles including Japanese, Korean, Cantonese, Sichuan dialect, cute voice, RAP, and singing. The data for comparative tags are generated using the Audio Edit model, supporting emotions such as happiness, anger, sadness, and speed variations like fast and slow, each divided into five hierarchical levels.
指令标签
指令标签分为两大类:描述性标签和比较性标签。描述性标签用于控制语言、方言