[论文翻译]IndexTTS: 一款工业级可控且高效的零样本文本转语音系统


原文地址:https://arxiv.org/pdf/2502.05512v1


IndexTTS: An Industrial-Level Controllable and Efficient Zero-Shot Text-To-Speech System

IndexTTS: 一款工业级可控且高效的零样本文本转语音系统

1 Artificial Intelligence Platform Department, bilibili, China {xuanwu,zhousiyi02,shu jing chen,wang j in chao,wanglu08}@bilibili.com

1 bilibili 人工智能平台部,中国 {xuanwu,zhousiyi02,shu jing chen,wang j in chao,wanglu08}@bilibili.com

Abstract

摘要

Recently, large language model (LLM) based text-tospeech (TTS) systems have gradually become the mainstream in the industry due to their high naturalness and powerful zeroshot voice cloning capabilities.

最近,基于大语言模型 (LLM) 的文本转语音 (TTS) 系统因其高自然度和强大的零样本语音克隆能力逐渐成为行业主流。

Here, we introduce the IndexTTS system, which is mainly based on the XTTS and Tortoise model. We add some novel improvements. Specifically, in Chinese scenarios, we adopt a hybrid modeling method that combines characters and pinyin, making the pronunciations of polyphonic characters and longtail characters controllable. We also performed a comparative analysis of the Vector Quantization (VQ) with Finite-Scalar Quantization (FSQ) for codebook utilization of acoustic speech tokens. To further enhance the effect and stability of voice cloning, we introduce a conformer-based speech conditional encoder and replace the speechcode decoder with BigVGAN2.

在此,我们介绍基于XTTS和Tortoise模型的IndexTTS系统,并融入了一些创新改进。具体而言,在中文应用场景中,我们采用了字符与拼音相结合的混合建模策略,从而实现对多音字及长尾字符发音的精确控制。此外,针对声学语音Token的码本利用,我们对向量量化(VQ)与有限标量量化(FSQ)进行了对比分析。为了进一步提升语音克隆的效果与稳定性,我们引入了基于Conformer的语音条件编码器,并将语音解码器替换为BigVGAN2。

Compared with XTTS, it has achieved significant improvements in naturalness, content consistency, and zero-shot voice cloning. As for the popular TTS systems in the open-source, such as Fish-Speech, CosyVoice2, FireRedTTS and F5-TTS, IndexTTS has a relatively simple training process, more controllable usage, and faster inference speed. Moreover, its performance surpasses that of these systems. Our demos are available at https://index-tts.github.io.

与 XTTS 相比,它在自然度、内容一致性和零样本语音克隆方面取得了显著改进。至于开源的流行 TTS 系统,如 Fish-Speech、CosyVoice2、FireRedTTS 和 F5-TTS,IndexTTS 的训练过程相对简单,使用更可控,推理速度更快。此外,其性能超越了这些系统。我们的演示可在 https://index-tts.github.io 查看。

Index Terms: LLM based zero-shot TTS, industrial-Level textto-speech, polyphone controllable

索引关键词:基于大语言模型的零样本 TTS,工业级文本转语音,多音字可控

1. Introduction

1. 引言

Text-to-speech synthesis (TTS) has extensive applications in fields such as human-computer interaction, education, and entert a in ment. For example, in video creation scenarios in recent years, TTS can assist users in quickly generating video dubbing, saving recording time, and thus playing a crucial role in the creation process. Many creators hope to provide personalized and highly natural speech synthesis services to meet the needs of different scenarios.

文本到语音合成 (Text-to-speech synthesis, TTS) 在人机交互、教育和娱乐等领域有广泛应用。例如,在近年来的视频创作场景中,TTS 可以帮助用户快速生成视频配音,节省录音时间,从而在创作过程中发挥关键作用。许多创作者希望提供个性化和高度自然的语音合成服务,以满足不同场景的需求。

The TTS system, which is based on large language models and can be trained using massive amounts of general speech data, demonstrates impressive performance in speech generation, such as XTTS[1], Fish-Speech[2], CosyVoice2[3], FireRedTTS[4] and F5-TTS[5]. Compared to traditional systems that rely on more intricate manual designs, such as Megatts 2[6] and Yourtts[7], these systems have achieved significant improvements in naturalness, particularly in zero-shot voice cloning. Generative TTS powered by big data can be roughly classified into three categories. The first is the neural codec language model. To ensure the quality of synthesized audio, it typically employs a multi-codebook codec along with a high-framerate configuration, like in Vall-E[8]. This architecture is simple and straightforward, yet it has drawbacks: longer training and inference times, along with compromised stability. The second is end-to-end diffusion-based TTS, a non-auto regressive (NAR) model. F5-TTS[5] and Seed-TTS[9] are case-in-point. This approach yields high-quality synthesized audio and is suitable for voice editing but is difficult to stream, so not for real-time use. Finally, the hybrid architecture typically uses a single codebook and a low bitrate codec, generating high-quality audio through a standalone decoder such as diffusion or HiFiGAN[10]. It balances performance and generation quality and offers good stability. Due to the success of large language models, tokenization is the trend of the future. For industrial-level applications, stability is crucial. Here, we opt for the hybrid architecture, using a single codebook codec and reconstruct high quality voice through a speech decoder, such as XTTS, Fish-Speech and CosyVoice2.

基于大语言模型的TTS系统,能够利用海量通用语音数据进行训练,在语音生成方面表现出色,例如XTTS[1]、Fish-Speech[2]、CosyVoice2[3]、FireRedTTS[4]和F5-TTS[5]。与依赖于更复杂手工设计的传统系统(如Megatts 2[6]和Yourtts[7])相比,这些系统在自然度上取得了显著提升,尤其是在零样本语音克隆方面。基于大数据的生成式TTS大致可分为三类。第一类是神经编解码器语言模型。为了确保合成音频的质量,它通常采用多码本编解码器和高帧率配置,如Vall-E[8]。这种架构简单直接,但也有缺点:训练和推理时间较长,且稳定性较差。第二类是基于端到端扩散的TTS,这是一种非自回归(NAR)模型。F5-TTS[5]和Seed-TTS[9]就是典型的例子。这种方法能够生成高质量的合成音频,适合语音编辑,但难以流式处理,因此不适合实时使用。最后,混合架构通常使用单码本和低比特率编解码器,通过独立的解码器(如扩散或HiFiGAN[10])生成高质量音频。它在性能和生成质量之间取得了平衡,并具有良好的稳定性。由于大语言模型的成功,Token化是未来的趋势。对于工业级应用来说,稳定性至关重要。在这里,我们选择了混合架构,使用单码本编解码器,并通过语音解码器(如XTTS、Fish-Speech和CosyVoice2)重建高质量语音。

Based on XTTS[1] and Tortoise[11], we have made several improvements, which mainly include the following: We remove the front-end G2P module and use raw text as input, along with a BPE-based text tokenizer. This simplifies input preprocessing, facilitates multi-language expansion, and enables endto-end learning of word or polyphone pronunciations via big data context integration. To address the pronunciation control of polyphones and low-frequency characters in Chinese scenarios, which inevitably occur in real world video creation, we propose a hybrid character-pinyin modeling approach. This allows video creators to correct pronunciations by directly inputting pinyin. Moreover, VQ[12] may suffer from low-utilization of the quantization codebook due to codebook collapse. we conducted a comparative analysis between VQ and FSQ[13] in terms of their codebook utilization for acoustic token representation, achieving nearly $100%$ codebook utilization. Finally, we have made significant improvements in prosody naturalness, the similarity of zero-shot voice cloning, and system stability. The main improvements and contributions are summarized as follows:

基于XTTS[1]和Tortoise[11],我们进行了多项改进,主要包括以下几点:我们移除了前端的G2P模块,使用原始文本作为输入,并采用基于BPE的文本分词器。这简化了输入预处理,便于多语言扩展,并通过大数据上下文集成实现词或多音字发音的端到端学习。针对中文场景中多音字和低频字符的发音控制问题,这在现实世界视频创作中不可避免,我们提出了一种混合字-拼音建模方法。这使得视频创作者可以通过直接输入拼音来纠正发音。此外,VQ[12]可能由于码本崩溃而导致量化码本利用率低。我们对VQ和FSQ[13]在声学Token表示中的码本利用率进行了对比分析,实现了接近$100%$的码本利用率。最后,我们在韵律自然度、零样本语音克隆的相似性以及系统稳定性方面做出了显著改进。主要改进和贡献总结如下:

2. IndexTTS System

2. IndexTTS 系统

Similar to XTTS[1], our system incorporates speech-to-codec VQVAE[12] codec, text-to-codec language model and latentto-audio decoder, as depicted in Figure 1.

与 XTTS[1] 类似,我们的系统结合了语音到编解码器 VQVAE[12] 编解码器、文本到编解码器大语言模型以及潜在到音频解码器,如图 1 所示。

2.1. Text tokenizer

2.1. 文本 Tokenizer

Currently, our system only supports two languages, Chinese and English. We directly use the raw text as input, which is tokenized by a BPE-based text tokenizer, This makes it convenient to extend the system to other languages. Due to the large number of polyphonic characters in Chinese, we adopt a hybrid-modeling approach of Chinese characters and pinyin in Chinese-related scenarios. The vocabulary size of the text tokenizer is 12,000. It encompasses 8,400 Chinese characters along with their corresponding 1,721 pinyin, English word pieces, and several special symbols. During specific training, we randomly select a certain proportion of non-polyphonic characters and replace them with pinyin. An example of preprocessing process is presented in Table 1.

目前,我们的系统仅支持两种语言,中文和英文。我们直接使用原始文本作为输入,通过基于 BPE 的文本分词器进行分词,这便于将系统扩展到其他语言。由于中文中存在大量多音字,我们在中文相关场景中采用了汉字和拼音的混合建模方法。文本分词器的词汇量为 12,000,涵盖了 8,400 个汉字及其对应的 1,721 个拼音、英文词片段以及若干特殊符号。在具体训练过程中,我们随机选择一定比例的非多音字并将其替换为拼音。表 1 展示了预处理过程的一个示例。

2.2. Neural Speech Tokenizer

2.2. 神经语音 Tokenizer

Vector Quantization (VQ) is a powerful tool for speech coding, but it may suffer from codebook collapse[13], The codebook utilization of VQ and FSQ was analyzed in the following experiments. We increased the parameters of the Variation al Auto encoder (VAE) to around 50M. The VAE receives a melspec tr ogram as input and encodes each frame with VQ using approximately 8192 codes. The sampling rate of the input audio is $24,\mathrm{kHz}$ , and the token rate output by the speech tokenizer is $25:\mathrm{Hz}$ .

向量量化 (Vector Quantization, VQ) 是语音编码的强大工具,但它可能面临码本崩溃 (codebook collapse) 的问题 [13]。在以下实验中,我们分析了 VQ 和 FSQ 的码本利用率。我们将变分自编码器 (Variational Autoencoder, VAE) 的参数增加到大约 50M。VAE 接收梅尔频谱图 (melspec) 作为输入,并使用大约 8192 个码字对每帧进行 VQ 编码。输入音频的采样率为 $24,\mathrm{kHz}$,语音分词器输出的 Token 率为 $25:\mathrm{Hz}$。

2.3. Large Language Model for TTS

2.3. 用于 TTS 的大语言模型

The text-to-codec large language model (LLM) is based on the decoder-only transformer architecture, similar to XTTS. It generates a series of audio mel tokens from the input series of text tokens. The LLM is also conditioned by a transformer-based conditioning encoder, which we replace with a Conformer encoder with a subsample rate of 2. We found that this replacement can enhance timbre similarity and training stability.

文本到编解码器的大语言模型基于仅解码器的 Transformer 架构,类似于 XTTS。它从输入的文本 Token 序列中生成一系列音频梅尔 Token。该大语言模型还通过基于 Transformer 的条件编码器进行调节,我们将其替换为子采样率为 2 的 Conformer 编码器。我们发现这种替换可以增强音色相似性和训练稳定性。

The training processes of conditional LLM can be broadly categorized into the following types, The input sequence is structured as follows ([BT] [ET] indicate the beginning and end of the text token sequence. [BA] and [EA] denote the start and end of the audio token sequence):

条件大语言模型的训练过程大致可分为以下几种类型,输入序列的结构如下([BT] 和 [ET] 表示文本 Token 序列的开始和结束,[BA] 和 [EA] 表示音频 Token 序列的开始和结束):

is “speaker info, [BT], text, [ET], [BA]”. The auto regressive generation of LM is started from such input prefix sequence until the “End of sequence” token “[EA]” is detected.

输入前缀序列为“speaker info, [BT], text, [ET], [BA]”。LM的自回归生成从该输入前缀序列开始,直到检测到“序列结束” Token “[EA]”。

We adopt the SEQ3. It is worth emphasizing that not relying on prompt text is crucial in certain scenarios. For example, in cross-language voice cloning, if prompt text must be provided or identified through a multilingual ASR system, its usability will be significantly limited. Additionally, conditioning on both the prompt text and the audio token series will substantially increase the inference time.

我们采用了SEQ3。值得强调的是,在某些场景下不依赖提示文本(prompt text)至关重要。例如,在跨语言语音克隆中,如果必须提供提示文本或通过多语言ASR系统识别,其可用性将受到极大限制。此外,同时基于提示文本和音频Token序列进行条件处理,将显著增加推理时间。

We also found that, compared to single-speaker encoding vectors such as Tortoise [11] and CosyVoice [3], or speechprompting methods like Vall-E, the Conformer-based Perceiver demonstrates superior ability in capturing speaker character istics. Moreover, it ensures consistent model outputs across different runs, effectively mitigating the issue of speaker shifting that may occur between various model executions. The Perceiver offers the advantage of utilizing multiple references without imposing length restrictions. This flexibility enables it to comprehensively capture diverse aspects of the target speaker. Furthermore, it even allows for the integration of features from other speakers, thereby facilitating the creation of a truly unique voice.

我们还发现,与Tortoise [11]和CosyVoice [3]等单说话人编码向量,或Vall-E等语音提示方法相比,基于Conformer的Perceiver在捕捉说话人特征方面表现出更优的能力。此外,它确保了不同运行间的模型输出一致性,有效缓解了各种模型执行之间可能发生的说话人偏移问题。Perceiver的优势在于能够利用多个参考而不施加长度限制。这种灵活性使其能够全面捕捉目标说话人的多样化特征。此外,它甚至允许整合其他说话人的特征,从而促进创造真正独特的声音。

2.4. Speech Decoder

2.4. 语音解码器

The last stage is to convert the SpeechLLM output into waveform. One is to utilize a flow matching[15] or diffusionbased[9] model to transform the speech code generated by the SpeechLLM into an intermediate representation, such as the Mel spec tr ogram[11][9]. Then, followed by a vocoder, such as the HifiGAN vocoder, to convert the Mel spec tr ogram into waveform. This method can generate high-quality audio, but it suffers from slow inference and faces complexity in achieving streaming. The second approach is to directly convert the SpeechLLM output, conditioned on speaker embedding, into the final waveform. We adopts the second approach, based on the BigVGAN2[14] vocoder, directly reconstructing the audio based on the last hidden state of the SpeechLLM, which is conditioned with speaker embedding. The latent sampling rate is $25\mathrm{Hz}$ . It is interpolated to $100\mathrm{Hz}$ and then input into BigVGAN2. Subsequently, the signal is decoded by BigVGAN2 and finally outputs at a frequency of $24\mathrm{KHz}$ .

最后阶段是将 SpeechLLM 的输出转换为波形。一种方法是利用流匹配[15]或基于扩散的模型[9],将 SpeechLLM 生成的语音代码转换为中间表示,例如梅尔频谱图[11][9]。然后,通过一个声码器,例如 HifiGAN 声码器,将梅尔频谱图转换为波形。这种方法可以生成高质量的音频,但推理速度较慢,并且在实现流式处理时面临复杂性。第二种方法是直接将 SpeechLLM 的输出,在条件化说话人嵌入的情况下,转换为最终的波形。我们采用第二种方法,基于 BigVGAN2[14] 声码器,直接根据 SpeechLLM 的最后一个隐藏状态重构音频,该隐藏状态以说话人嵌入为条件。潜在采样率为 $25\mathrm{Hz}$。它被插值到 $100\mathrm{Hz}$,然后输入到 BigVGAN2。随后,信号由 BigVGAN2 解码,最终以 $24\mathrm{KHz}$ 的频率输出。

3. Experiments

  1. 实验

3.1. Dataset

3.1. 数据集

All training data was collected from the internet, with an initial 120,000 hours of raw audio. After voice separation, speaker segmentation, and filtering using Demucs [16], we obtained 34,000 hours of high-quality Chinese-English bilingual data. The dataset includes 25,000 hours of Chinese and 9,000 hours of English audio. We then use ASR (Automatic Speech Recognition) to generate pseudo-labels for the corresponding audio. Finally, we emphasize that punctuation marks are added to the ASR results based on text semantics and speech pauses to create the final training texts. This approach allows users to control pauses flexibly, beyond relying solely on text semantics.

所有训练数据均从互联网收集,初始包含 120,000 小时的原始音频。经过使用 Demucs [16] 进行语音分离、说话人分割和过滤后,我们获得了 34,000 小时的高质量中英双语数据。该数据集包含 25,000 小时的中文音频和 9,000 小时的英文音频。随后,我们使用 ASR(自动语音识别)为相应音频生成伪标签。最后,我们强调根据文本语义和语音停顿在 ASR 结果中添加标点符号,以生成最终训练文本。这种方法使用户能够灵活控制停顿,而不仅仅依赖于文本语义。

3.2. Experimental Settings

3.2. 实验设置

3.2.1. Mixed training of Chinese characters and pinyin

3.2.1 汉字与拼音的混合训练

We randomly select $50%$ of the training samples. For each sample, we randomly pick $20%$ of the Chinese characters. If a character is not a polyphonic character, we replace it with its corresponding pinyin. The replaced text may include Chinese characters, pinyin, English words, and punctuation marks. Then, it is directly tokenized by the BPE tokenizer.

我们随机选择 $50%$ 的训练样本。对于每个样本,我们随机挑选 $20%$ 的汉字。如果一个字符不是多音字,我们将其替换为对应的拼音。替换后的文本可能包含汉字、拼音、英文单词和标点符号。然后,直接由 BPE tokenizer 进行 Token 化处理。


Figure 1: An overview of IndexTTS, a text-to-speech language model conditioned on prompt speech and text tokens generates acoustic tokens, and the BigVGAN2 decoder convert the LLM output latent into waveform.

图 1: IndexTTS 概述,该模型是基于提示语音和文本 Token 的文本到语音语言模型,生成声学 Token,BigVGAN2 解码器将大语言模型输出的潜在表示转换为波形。

Table 1: Preprocessing Examples for Training Samples Combining Chinese Characters and Pinyin

表 1: 结合汉字和拼音的预处理训练样本示例

| Input: | | 晕眩是一种感觉,Iwanttogotothesupermarket! |
| MixPinyin: | | 晕XUAN4是一种GAN3觉,IWANTTOGOTOTHESUPERMARKET! |
| BPETokens: | | _晕,_XUAN4,-是,-一,_种,-GAN3,-觉,-,,I,_WANT,_TO,-GO,_TO,_THE,_SUPER,M,AR,KE,T,! |

Table 2: Error and Correction Statistics for Polyphonic Character Pronunciation

表 2: 多音字发音错误与纠正统计

Sentences Percentage
Total 2500 100%
A1 465 18.6%
A2 437 94.0%

3.2.2. Speech Codec Training

3.2.2. 语音编码器训练

In the training of the Speech codec, we only replace Vector Quantization with