https://arxiv.org/pdf/2106.06909v1# GigaSpeech: An Evolving, Multi-domain ASR Corpus with 10,000 Hours of Transcribed Audio

GigaSpeech: 一个持续演进、多领域的语音识别(ASR)语料库，包含10,000小时转写音频

Abstract

摘要

This paper introduces GigaSpeech, an evolving, multi-domain English speech recognition corpus with 10,000 hours of high quality labeled audio suitable for supervised training, and 40,000 hours of total audio suitable for semi-supervised and unsupervised training. Around 40,000 hours of transcribed audio is first collected from audiobooks, podcasts and YouTube, covering both read and spontaneous speaking styles, and a variety of topics, such as arts, science, sports, etc. A new forced alignment and segmentation pipeline is proposed to create sentence segments suitable for speech recognition training, and to filter out segments with low-quality transcription. For system training, GigaSpeech provides five subsets of different sizes, 10h, 250h, 1000h, 2500h, and 10000h. For our 10,000-hour $X L$ training subset, we cap the word error rate at $4%$ during the filtering/validation stage, and for all our other smaller training subsets, we cap it at $0%$ . The DEV and TEST evaluation sets, on the other hand, are re-processed by professional human transcriber s to ensure high transcription quality. Baseline systems are provided for popular speech recognition toolkits, namely Athena, ESPnet, Kaldi and Pika.

本文介绍了GigaSpeech，这是一个持续演进的、多领域英语语音识别语料库，包含10,000小时适合监督训练的高质量标注音频，以及总计40,000小时适合半监督和无监督训练的音频。我们首先从有声书、播客和YouTube收集了约40,000小时的转录音频，涵盖朗读和即兴发言两种说话风格，以及艺术、科学、体育等多种主题。我们提出了一种新的强制对齐和分段流程，用于创建适合语音识别训练的句子片段，并过滤掉转录质量较低的片段。在系统训练方面，GigaSpeech提供五种不同规模的子集：10小时、250小时、1000小时、2500小时和10000小时。对于我们的10,000小时$X L$训练子集，在过滤/验证阶段我们将词错误率上限设为$4%$，而其他所有较小训练子集的上限设为$0%$。另一方面，DEV和TEST评估集经过专业转录员的重新处理，以确保高转录质量。我们还为主流语音识别工具包（包括Athena、ESPnet、Kaldi和Pika）提供了基线系统。

Index Terms: corpus, forced alignment, segmentation, speech recognition

索引术语：语料库 (corpus)、强制对齐 (forced alignment)、分割 (segmentation)、语音识别 (speech recognition)

1. Introduction

1. 引言

Thanks to the rapid development of the neural network models, automatic speech recognition (ASR) has made tremendous progress in the past decade. Various system architectures, from hybrid [1] to end-to-end [2], are proposed, and state-of-the-art results on standard benchmarks are being frequently updated.

得益于神经网络模型的快速发展，自动语音识别(ASR)在过去十年取得了巨大进步。从混合架构[1]到端到端架构[2]，各种系统架构不断被提出，标准基准测试的最先进结果也在持续刷新。

The mainstream speech recognition corpora, on the other hand, have not changed much in decades. To take the English speech recognition task as an example, the Wall Street Journal corpus, which consists of 80 hours of narrated news articles [3], is almost 20 years old, and has a word error rate (WER) of $2.32%$ on its eval92 benchmark [4]. The Switchboard and Fisher corpus, which consists of 262 and 1,698 hours of telephone conversational speech, is also around 20 years old, and has a WER of $5.5%$ on the Switchboard portion of the Hub5’00 benchmark [5]. Even Libri Speech [6], one of the most popular corpora for speech recognition tasks, is more than 5 years old, and has a WER of $1.9%$ on its test clean benchmark [7]. It consists of 1,000 hours of read English speech. Due to the fast development of speech recognition techniques, ASR performance on those data sets appears to have saturated, making it difficult to track further improvements from new techniques.

主流语音识别语料库几十年来变化不大。以英语语音识别任务为例，包含80小时新闻播报内容的华尔街日报语料库[3]已有近20年历史，在其eval92基准测试[4]上的词错误率(WER)为$2.32%$。包含262小时和1,698小时电话对话内容的Switchboard和Fisher语料库同样有约20年历史，在Hub5'00基准测试的Switchboard部分[5]上WER为$5.5%$。即便是语音识别任务中最流行的LibriSpeech语料库[6]也已有5年以上历史，其test clean基准测试[7]上的WER为$1.9%$，该库包含1,000小时的英语朗读语音。由于语音识别技术的快速发展，这些数据集上的ASR性能似乎已经饱和，难以追踪新技术的进一步改进。

There is some progress on creating better corpora/benchmarks for English speech recognition, from both academia and industry. TEDLIUM [8] is a series of corpora created by the Ubiqus company and the University of Le Mans. It consists of 452 hours of audio from TED talks in its latest release TED-LIUM 3. The corpora size, however, is less than 1,000 hours, making it not suitable for algorithms which demand a large amount of data. People’s Speech [9] released by ML

在英语语音识别领域，学术界和工业界都在构建更优质的语料库和基准测试方面取得进展。Ubiqus公司与勒芒大学联合开发的TEDLIUM [8] 系列语料库，其最新版本TED-LIUM 3包含452小时的TED演讲音频。但该语料库规模不足1000小时，难以满足需要海量数据的算法需求。由ML Commons发布的People's Speech [9]...

Commons consists of 87,000 hours of audio, covering 59 different languages. It’s source, however, is mostly audiobook, lacking crucial acoustic diversity. Another work is SPGISpeech [10], a corpus released by Kensho Technologies. It consists of 5,000 hours of transcribed audio from earnings calls transcribed by S&P Global, Inc. The corpus by its nature gives an emphasis to the business domain.

Commons包含87,000小时的音频，涵盖59种不同语言。然而其来源主要是有声读物，缺乏关键的声音多样性。另一项工作是SPGISpeech [10]，这是由Kensho Technologies发布的语料库。它包含5,000小时的标普全球公司(S&P Global, Inc.)从财报电话会议转录的音频数据。该语料库本质上侧重于商业领域。

We release a complementary English speech recognition corpus named GigaSpeech, an evolving, multi-domain ASR corpus with 10,000 hours of transcribed Audio. The initial release of GigaSpeech is complementary to the existing corpora in the following ways:

我们发布了一个名为GigaSpeech的补充性英语语音识别语料库，这是一个不断演进的多领域自动语音识别(ASR)语料库，包含10,000小时的转录音频。GigaSpeech的初始版本在以下方面对现有语料库形成了补充：

We make two contributions in this work. First, we release an evolving, multi-domain speech recognition corpus with 10,000 hours of labeled audio. Second we provide a scalable, reliable pipeline for generating speech recognition corpora.

我们在这项工作中做出了两项贡献。首先，我们发布了一个包含10,000小时标注音频的、不断演进的多领域语音识别语料库。其次，我们提供了一个可扩展且可靠的语音识别语料库生成流程。

The rest of the paper is organized as follows. Section 2 introduces the GigaSpeech corpus, and Section 3 presents the full pipeline to create the GigaSpeech corpus. We describe the speech recognition baseline systems for various toolkits, and provide experiment setup and results in Section 4. Finally, acknowledgements are given in Section 5.

本文其余部分组织结构如下。第2节介绍GigaSpeech语料库，第3节阐述构建GigaSpeech语料库的完整流程。第4节描述基于不同工具包的语音识别基线系统，并提供实验设置与结果。最后，第5节为致谢部分。

2. GigaSpeech Corpus

2. GigaSpeech语料库

This section explains the structure of the GigaSpeech corpus, including metadata, data partition, audio format, etc. Instructions and scripts for downloading GigaSpeech can be found on GigaSpeech’s GitHub repository 1.

本节介绍GigaSpeech语料库的结构，包括元数据、数据分区、音频格式等。下载GigaSpeech的说明和脚本可在GigaSpeech的GitHub仓库1中找到。

2.1. Metadata

2.1. 元数据

We save all the metadata information to a single JSON file named GigaSpeech.json. Figure 1 shows a snip of this file. For better presentation of this paper, we skip a lot of non-critical entries in the snip, such as “format”, “md5”, “source”, etc.

我们将所有元数据信息保存到一个名为GigaSpeech.json的JSON文件中。图1: 展示了该文件的片段。为了便于本文展示，我们跳过了片段中许多非关键条目，例如"format"、"md5"、"source"等。

To use the corpus, users are expected to extract the relevant information from GigaSpeech.json. For example, for the speech recognition task, one should first follow the “audios” entry, and work out a list of audio files. One can then follow the “url” entry to download the original audio file, or “path” if pre processed audio files have been downloaded to the disk. After that, for each audio file, one can follow the “segments” entry, and work out the trainable audio segments, as well as their corresponding transcripts. Of course, we also have various supplementary entries, such as “subsets”, “md5”, which will also be helpful for your task.

要使用该语料库，用户需从GigaSpeech.json中提取相关信息。例如，针对语音识别任务，应先查找"audios"条目，整理出音频文件列表。随后可通过"url"条目下载原始音频文件，若预处理音频已下载至本地磁盘，则使用"path"条目。接着，对每个音频文件，可通过"segments"条目获取可训练音频片段及其对应文本转录。此外，我们还提供"subsets"、"md5"等辅助条目，这些也将对您的任务有所帮助。

Figure 1: A snip of the metadata file GigaSpeech.json

图 1: 元数据文件GigaSpeech.json的片段

Table 1: GigaSpeech training subsets

表 1: GigaSpeech训练子集

子集	有声书	播客	YouTube	总计
XL	2,655小时	3,499小时	3,846小时	10,000小时
L	650小时	875小时	975小时	2,500小时
M	260小时	350小时	390小时	1,000小时
S	65小时	87.5小时	97.5小时	250小时
XS	2.6小时	3.5小时	3.9小时	10小时

The metadata file GigaSpeech.json is version controlled, and is supposed to get updated over the time. In future releases, we plan to add speaker information to the metadata file, so that it will be suitable for speaker identification/verification tasks. We also plan to add more data from different sources to increase the diversity.

元数据文件GigaSpeech.json采用版本控制，并会随时间推移更新。在未来的版本中，我们计划向元数据文件添加说话人信息，使其适用于说话人识别/验证任务。我们还计划从不同来源添加更多数据以提高多样性。

2.2. Training Subsets

2.2. 训练子集

We provide 5 training subsets in GigaSpeech, namely XS, S, M, $L$ and $X L$ , listed here in order of increasing audio hours. Table 1 shows a detailed breakdown of the 5 GigaSpeech training subsets.

我们在GigaSpeech中提供了5个训练子集，分别是XS、S、M、$L$和$XL$，按音频小时数递增顺序排列。表1: 详细展示了这5个GigaSpeech训练子集的细分数据。

2.3. Evaluation Sets

2.3. 评估集

We provide 2 evaluation sets in GigaSpeech: a DEV set for development and tuning, which consists of 12.5 hours of audio, and a TEST set for final evaluation, which consists of 40.3 hours of audio.

我们在GigaSpeech中提供了两个评估集：一个用于开发和调优的DEV集，包含12.5小时的音频；另一个用于最终评估的TEST集，包含40.3小时的音频。

A breakdown of our evaluation sets is illustrated in Table 2. Note that our evaluation sets do not have a coverage for the audiobooks. We make sure that audio files from the Libri Speech [6] evaluation sets (dev-clean, dev-other, test-clean and test-other) are not presented in our corpus, therefore, the Libri Speech evaluation sets can be used as our evaluation sets as well.

我们的评估集细目如表 2 所示。需要注意的是，我们的评估集未包含有声读物。我们确保语料库中不包含 Libri Speech [6] 评估集 (dev-clean、dev-other、test-clean 和 test-other) 的音频文件，因此 Libri Speech 评估集也可作为我们的评估集使用。

2.4. Audio Format

2.4. 音频格式

To reduce the file size of the GigaSpeech corpus, we compress the original audio using the Opus audio codec. Original audio files are first converted to 16k sampling rate, single channel and 16-bit signed-integer format. Opus compression is then applied to achieve an output bit rate of $32\mathrm{kpbs}$ , which results in a compression ratio of 8.

为减小GigaSpeech语料库的文件体积，我们使用Opus音频编解码器对原始音频进行压缩处理。原始音频文件首先被转换为16kHz采样率、单声道、16位有符号整型的格式，随后应用Opus压缩技术将输出比特率控制在$32\mathrm{kpbs}$，最终实现8:1的压缩比。

Table 3 shows the impact of Opus audio compression in terms of WER $(%)$ . Kaldi systems (see Section 4.3, but without recurrent neural network language model rescoring) are built for our $M$ (1000h) training subset, with or without Opus compression. These two systems are then used to decode the $D E V$ and TEST evaluation sets, with or without Opus compression. From Table 3, it is clear that compressing the training data with the Opus codec at 32 kpbs output bit rate has very small impact on the $D E V$ and TEST set $(0.1-0.2%$ WER degradation).

表 3 展示了 Opus 音频压缩在 WER $(%)$ 方面的影响。我们为 $M$ (1000小时) 训练子集构建了 Kaldi 系统 (参见第 4.3 节，但未使用循环神经网络语言模型重打分)，分别包含和不包含 Opus 压缩。这两个系统随后被用于解码 $DEV$ 和 TEST 评估集，无论这些评估集是否经过 Opus 压缩。从表 3 可以明显看出，以 32 kbps 输出比特率对训练数据进行 Opus 编解码器压缩对 $DEV$ 和 TEST 集的影响非常小 (WER 仅下降 0.1-0.2%)。

Table 2: GigaSpeech evaluation sets

表 2: GigaSpeech评估集

Sets	Podcast	YouTube	Total
DEV	6.3h	6.2h	12.5h
TEST	16.1h	24.2h	40.3h

Table 3: Impact of Opus audio compression (WER in $%$ )

表 3: Opus音频压缩的影响 (WER单位为 $%$ )

		DEV		TEST
		Opus	Wav	Opus	Wav
M	Opus	19.0
	Wav	18.8		18.5	18.3

3. GigaSpeech Creation Pipeline

3. GigaSpeech 创建流程

This section presents the detailed pipeline for creating the GigaSpeech corpus, which can be applied to other data generation tasks as well.

本节详细介绍了创建GigaSpeech语料库的完整流程，该流程同样适用于其他数据生成任务。

3.1. Stage 1: Audio Collection

3.1. 阶段1: 音频采集

We start the task by manually defining the categories that we are interested in. We selected 24 categories in total, namely Arts, Business, Education, Autos and Vehicles, Comedy, Crime, Entertainment, Film and Animation, Gaming, Health and Fitness, History, Howto and Style, Kids and Family, Leisure, Music, News and Politics, Nonprofits and Activism, People and Blogs, Pets and Animals, Religion and Spirituality, Science and Technology, Society and Culture, Sports, Travel and Events.

我们首先手动定义感兴趣的分类，共选取24个类别：艺术、商业、教育、汽车与交通工具、喜剧、犯罪、娱乐、电影与动画、游戏、健康与健身、历史、教程与风格、儿童与家庭、休闲、音乐、新闻与政治、非营利与行动主义、人物与博客、宠物与动物、宗教与灵性、科学与技术、社会与文化、体育、旅行与活动。

For podcasts, we follow the above categories, and select episodes that come with manual transcriptions. For YouTube, we use the above categories as seed keywords, and select videos with human-generated closed captions. For audiobooks, we do not enforce those categories.

对于播客，我们遵循上述分类标准，并选择带有人工转录文本的剧集。对于YouTube，我们将这些分类作为种子关键词，筛选含有人工生成字幕的视频。至于有声书，我们不做类别限制。

Once we have the list of audio files, we create tools and download all audio files with their corresponding transcripts.

获取音频文件列表后，我们会创建工具并下载所有音频文件及其对应文本转录。

3.2. Stage 2: Text Normalization

3.2. 第二阶段：文本规范化

The audio transcripts we download from various sources are created by different transcriber s with diversified transcription standards and styles, therefore it is necessary to apply text normalization to the original transcripts. We perform standard text normalization, including case normalization, special symbol removal, number to word rewriting, date/time rewriting, etc.

我们从不同来源下载的音频转录文本由不同的转录员创建，其转录标准和风格各异，因此需要对原始文本进行标准化处理。我们执行的标准文本规范化包括：大小写统一、特殊符号去除、数字转文字、日期/时间重写等。

For audiobooks and podcasts, transcripts are usually at the episodes or chapter/book level. For speech recognition, however, smaller segments less than 20 seconds are needed for training. The next step is to segment the long audio file into smaller segments. For YouTube, closed captions are provided at the sentence level, but unfortunately we find that the timestamps of closed captions are not reliable for segmentation. As a result, we decide to splice the closed captions all together, and perform the same segmentation as audiobooks and podcasts.

对于有声书和播客，文本通常位于单集或章节/书籍级别。但在语音识别任务中，训练需要小于20秒的短片段。因此需要将长音频文件切分为小片段。YouTube虽然提供句子级别的隐藏字幕，但我们发现其时间戳并不可靠。最终我们选择将所有隐藏字幕拼接为整体，采用与有声书和播客相同的分段方式。

3.3. Stage 3: Forced Alignment

3.3. 阶段3：强制对齐

Our aligner is implemented with Kaldi [11], and the alignment procedure follows the work here [12], which adopts a divide and conquer strategy to tackle the alignment problem. First, both audio and transcript are uniformly chunked into smaller pieces. Second, audio segments are decoded with a biased language model (LM), and hypotheses with timestamps are generated. Third, each hypothesis segment is matched to one transcript segment via TF-IDF similarity. For each matched pair, hypothesis is further aligned with the transcript segment using the Smith-Waterman algorithm [13]. Finally, through this alignment, timestamps are attached to the transcript segments, and eventually to the whole transcripts by stitching the independently aligned segments together. Note that we modify the Smith-Waterman algorithm to handle silence and punctuation, and this is essential to enable the sentencebased segmentation in the next section.

我们的对齐器基于Kaldi [11]实现，对齐流程遵循文献[12]提出的分治策略：首先将音频和文本统一分割为小片段；其次使用带偏置的语言模型(LM)解码音频片段，生成带时间戳的识别假设；然后通过TF-IDF相似度将每个假设片段与文本片段进行匹配；对于每个匹配对，采用Smith-Waterman算法[13]将假设与文本片段进行细粒度对齐；最终通过对齐结果为文本片段标注时间戳，并通过拼接独立对齐片段完成完整文本的时间标注。需要注意的是，我们改进了Smith-Waterman算法以处理静音和标点符号，这对实现下一节基于句子的分割至关重要。

Figure 2: A forced alignment graph for the sentence “ ~~A B C D E $~~$ ” ( $^{4}$ -gram)

图 2: 句子 " ~~A B C D E $~~$" ($^{4}$-gram) 的强制对齐图

To achieve better alignment performance, we first align and segment the downloaded audio with a close-domain acoustic model. We then train an in-domain acoustic model with the audio segments created (around 3,000 hours). This model is used to align the whole corpus.

为实现更好的对齐性能，我们首先使用近域声学模型对下载的音频进行对齐和分段。随后利用生成的音频片段(约3000小时)训练一个域内声学模型，该模型用于对整个语料库进行对齐。

3.4. Stage 4: Audio Segmentation

3.4. 阶段4：音频分割

We work out the audio segments from the alignment information above. Several rules are applied during the segmentation process:

我们根据上述对齐信息划分音频片段。在分割过程中应用了以下规则：

It is worth pointing out that we keep 4 types of punctuation, namely comma, period, question mark and exclamation mark, so that split can happen at sentence boundaries (second rule above). We map them to special words ${\bf\ddot{\mu}}<\mathrm{COMMA}>^{\prime}$ ”, “ $\mathrm{\angle}<\mathrm{PERIOD>}^{\prime\prime}$ , “<QUESTION MARK $>$ ” and “” respectively. Besides, this also allows us to build end-to-end speech recognition systems that includes punctuation tagging, and end point detection.

值得注意的是，我们保留了4种标点符号（即逗号、句号、问号和感叹号），以便在句子边界处进行切分（遵循上述第二条规则）。我们将它们分别映射为特殊词元 ${\bf\ddot{\mu}}<\mathrm{COMMA}>^{\prime}$ ”、“ $\mathrm{\angle}<\mathrm{PERIOD>}^{\prime\prime}$ ”、“<QUESTION MARK $>$ ”和“”。此外，这种做法还能让我们构建包含标点标注和端点检测的端到端语音识别系统。

3.5. Stage 5: Segment Validation

3.5. 阶段5：分段验证

The segmentation stage generates a list of candidate segments, but potentially with high transcription error rate. We therefore apply segment validation to filter out bad segments.

分割阶段会生成一系列候选片段，但可能存在较高的转录错误率。因此我们应用片段验证来过滤掉不良片段。

3.5.1. Forced Alignment Graph

3.5.1. 强制对齐图

To better detect transcription errors made by human transcriber s, we propose a variation of alignment graph in n-gram framework, as shown in Figure 2. The bold arrow path represents a typical LM-free forced alignment graph. Each state on the forced alignment path has a dotted “leaky” arc (with weight) that allows the token to leak out the forced alignment path, from higher order n-gram states down to lower order n-gram states, until reaching the null state. A garbage word loop (containing top 1

[论文翻译]GigaSpeech: 一个持续演进、多领域的语音识别(ASR)语料库，包含10,000小时转写音频

原文地址：https://arxiv.org/pdf/2106.06909v1