[论文翻译]GigaSpeech: 一个持续演进、多领域的语音识别(ASR)语料库,包含10,000小时转写音频


原文地址:https://arxiv.org/pdf/2106.06909v1


https://arxiv.org/pdf/2106.06909v1# GigaSpeech: An Evolving, Multi-domain ASR Corpus with 10,000 Hours of Transcribed Audio

GigaSpeech: 一个持续演进、多领域的语音识别(ASR)语料库,包含10,000小时转写音频

Abstract

摘要

This paper introduces GigaSpeech, an evolving, multi-domain English speech recognition corpus with 10,000 hours of high quality labeled audio suitable for supervised training, and 40,000 hours of total audio suitable for semi-supervised and unsupervised training. Around 40,000 hours of transcribed audio is first collected from audiobooks, podcasts and YouTube, covering both read and spontaneous speaking styles, and a variety of topics, such as arts, science, sports, etc. A new forced alignment and segmentation pipeline is proposed to create sentence segments suitable for speech recognition training, and to filter out segments with low-quality transcription. For system training, GigaSpeech provides five subsets of different sizes, 10h, 250h, 1000h, 2500h, and 10000h. For our 10,000-hour $X L$ training subset, we cap the word error rate at $4%$ during the filtering/validation stage, and for all our other smaller training subsets, we cap it at $0%$ . The DEV and TEST evaluation sets, on the other hand, are re-processed by professional human transcriber s to ensure high transcription quality. Baseline systems are provided for popular speech recognition toolkits, namely Athena, ESPnet, Kaldi and Pika.

本文介绍了GigaSpeech,这是一个持续演进的、多领域英语语音识别语料库,包含10,000小时适合监督训练的高质量标注音频,以及总计40,000小时适合半监督和无监督训练的音频。我们首先从有声书、播客和YouTube收集了约40,000小时的转录音频,涵盖朗读和即兴发言两种说话风格,以及艺术、科学、体育等多种主题。我们提出了一种新的强制对齐和分段流程,用于创建适合语音识别训练的句子片段,并过滤掉转录质量较低的片段。在系统训练方面,GigaSpeech提供五种不同规模的子集:10小时、250小时、1000小时、2500小时和10000小时。对于我们的10,000小时$X L$训练子集,在过滤/验证阶段我们将词错误率上限设为$4%$,而其他所有较小训练子集的上限设为$0%$。另一方面,DEV和TEST评估集经过专业转录员的重新处理,以确保高转录质量。我们还为主流语音识别工具包(包括Athena、ESPnet、Kaldi和Pika)提供了基线系统。

Index Terms: corpus, forced alignment, segmentation, speech recognition

索引术语:语料库 (corpus)、强制对齐 (forced alignment)、分割 (segmentation)、语音识别 (speech recognition)

1. Introduction

1. 引言

Thanks to the rapid development of the neural network models, automatic speech recognition (ASR) has made tremendous progress in the past decade. Various system architectures, from hybrid [1] to end-to-end [2], are proposed, and state-of-the-art results on standard benchmarks are being frequently updated.

得益于神经网络模型的快速发展,自动语音识别(ASR)在过去十年取得了巨大进步。从混合架构[1]到端到端架构[2],各种系统架构不断被提出,标准基准测试的最先进结果也在持续刷新。

The mainstream speech recognition corpora, on the other hand, have not changed much in decades. To take the English speech recognition task as an example, the Wall Street Journal corpus, which consists of 80 hours of narrated news articles [3], is almost 20 years old, and has a word error rate (WER) of $2.32%$ on its eval92 benchmark [4]. The Switchboard and Fisher corpus, which consists of 262 and 1,698 hours of telephone conversational speech, is also around 20 years old, and has a WER of $5.5%$ on the Switchboard portion of the Hub5’00 benchmark [5]. Even Libri Speech [6], one of the most popular corpora for speech recognition tasks, is more than 5 years old, and has a WER of $1.9%$ on its test clean benchmark [7]. It consists of 1,000 hours of read English speech. Due to the fast development of speech recognition techniques, ASR performance on those data sets appears to have saturated, making it difficult to track further improvements from new techniques.

主流语音识别语料库几十年来变化不大。以英语语音识别任务为例,包含80小时新闻播报内容的华尔街日报语料库[3]已有近20年历史,在其eval92基准测试[4]上的词错误率(WER)为$2.32%$。包含262小时和1,698小时电话对话内容的Switchboard和Fisher语料库同样有约20年历史,在Hub5'00基准测试的Switchboard部分[5]上WER为$5.5%$。即便是语音识别任务中最流行的LibriSpeech语料库[6]也已有5年以上历史,其test clean基准测试[7]上的WER为$1.9%$,该库包含1,000小时的英语朗读语音。由于语音识别技术的快速发展,这些数据集上的ASR性能似乎已经饱和,难以追踪新技术的进一步改进。

There is some progress on creating better corpora/benchmarks for English speech recognition, from both academia and industry. TEDLIUM [8] is a series of corpora created by the Ubiqus company and the University of Le Mans. It consists of 452 hours of audio from TED talks in its latest release TED-LIUM 3. The corpora size, however, is less than 1,000 hours, making it not suitable for algorithms which demand a large amount of data. People’s Speech [9] released by ML

在英语语音识别领域,学术界和工业界都在构建更优质的语料库和基准测试方面取得进展。Ubiqus公司与勒芒大学联合开发的TEDLIUM [8] 系列语料库,其最新版本TED-LIUM 3包含452小时的TED演讲音频。但该语料库规模不足1000小时,难以满足需要海量数据的算法需求。由ML Commons发布的People's Speech [9]...

Commons consists of 87,000 hours of audio, covering 59 different languages. It’s source, however, is mostly audiobook, lacking crucial acoustic diversity. Another work is SPGISpeech [10], a corpus released by Kensho Technologies. It consists of 5,000 hours of transcribed audio from earnings calls transcribed by S&P Global, Inc. The corpus by its nature gives an emphasis to the business domain.

Commons包含87,000小时的音频,涵盖59种不同语言。然而其来源主要是有声读物,缺乏关键的声音多样性。另一项工作是SPGISpeech [10],这是由Kensho Technologies发布的语料库。它包含5,000小时的标普全球公司(S&P Global, Inc.)从财报电话会议转录的音频数据。该语料库本质上侧重于商业领域。

We release a complementary English speech recognition corpus named GigaSpeech, an evolving, multi-domain ASR corpus with 10,000 hours of transcribed Audio. The initial release of GigaSpeech is complementary to the existing corpora in the following ways:

我们发布了一个名为GigaSpeech的补充性英语语音识别语料库,这是一个不断演进的多领域自动语音识别(ASR)语料库,包含10,000小时的转录音频。GigaSpeech的初始版本在以下方面对现有语料库形成了补充:

We make two contributions in this work. First, we release an evolving, multi-domain speech recognition corpus with 10,000 hours of labeled audio. Second we provide a scalable, reliable pipeline for generating speech recognition corpora.

我们在这项工作中做出了两项贡献。首先,我们发布了一个包含10,000小时标注音频的、不断演进的多领域语音识别语料库。其次,我们提供了一个可扩展且可靠的语音识别语料库生成流程。

The rest of the paper is organized as follows. Section 2 introduces the GigaSpeech corpus, and Section 3 presents the full pipeline to create the GigaSpeech corpus. We describe the speech recognition baseline systems for various toolkits, and provide experiment setup and results in Section 4. Finally, acknowledgements are given in Section 5.

本文其余部分组织结构如下。第2节介绍GigaSpeech语料库,第3节阐述构建GigaSpeech语料库的完整流程。第4节描述基于不同工具包的语音识别基线系统,并提供实验设置与结果。最后,第5节为致谢部分。

2. GigaSpeech Corpus

2. GigaSpeech语料库

This section explains the structure of the GigaSpeech corpus, including metadata, data partition, audio format, etc. Instructions and scripts for downloading GigaSpeech can be found on GigaSpeech’s GitHub repository 1.

本节介绍GigaSpeech语料库的结构,包括元数据、数据分区、音频格式等。下载GigaSpeech的说明和脚本可在GigaSpeech的GitHub仓库1中找到。

2.1. Metadata

2.1. 元数据

We save all the metadata information to a single JSON file named GigaSpeech.json. Figure 1 shows a snip of this file. For better presentation of this paper, we skip a lot of non-critical entries in the snip, such as “format”, “md5”, “source”, etc.

我们将所有元数据信息保存到一个名为GigaSpeech.json的JSON文件中。图1: 展示了该文件的片段。为了便于本文展示,我们跳过了片段中许多非关键条目,例如"format"、"md5"、"source"等。

To use the corpus, users are expected to extract the relevant information from GigaSpeech.json. For example, for the speech recognition task, one should first follow the “audios” entry, and work out a list of audio files. One can then follow the “url” entry to download the original audio file, or “path” if pre processed audio files have been downloaded to the disk. After that, for each audio file, one can follow the “segments” entry, and work out the trainable audio segments, as well as their corresponding transcripts. Of course, we also have various supplementary entries, such as “subsets”, “md5”, which will also be helpful for your task.

要使用该语料库,用户需从GigaSpeech.json中提取相关信息。例如,针对语音识别任务,应先查找"audios"条目,整理出音频文件列表。随后可通过"url"条目下载原始音频文件,若预处理音频已下载至本地磁盘,则使用"path"条目。接着,对每个音频文件,可通过"segments"条目获取可训练音频片段及其对应文本转录。此外,我们还提供"subsets"、"md5"等辅助条目,这些也将对您的任务有所帮助。


Figure 1: A snip of the metadata file GigaSpeech.json

图 1: 元数据文件GigaSpeech.json的片段

Table 1: GigaSpeech training subsets

表 1: GigaSpeech训练子集

子集 有声书 播客 YouTube 总计
XL 2,655小时 3,499小时 3,846小时 10,000小时
L 650小时 875小时 975小时 2,500小时
M 260小时 350小时 390小时 1,000小时
S 65小时 87.5小时 97.5小时 250小时
XS 2.6小时 3.5小时 3.9小时 10小时

The metadata file GigaSpeech.json is version controlled, and is supposed to get updated over the time. In future releases, we plan to add speaker information to the metadata file, so that it will be suitable for speaker identification/verification tasks. We also plan to add more data from different sources to increase the diversity.

元数据文件GigaSpeech.json采用版本控制,并会随时间推移更新。在未来的版本中,我们计划向元数据文件添加说话人信息,使其适用于说话人识别/验证任务。我们还计划从不同来源添加更多数据以提高多样性。

2.2. Training Subsets

2.2. 训练子集

We provide 5 training subsets in GigaSpeech, namely XS, S, M, $L$ and $X L$ , listed here in order of increasing audio hours. Table 1 shows a detailed breakdown of the 5 GigaSpeech training subsets.

我们在GigaSpeech中提供了5个训练子集,分别是XS、S、M、$L$和$XL$,按音频小时数递增顺序排列。表1: 详细展示了这5个GigaSpeech训练子集的细分数据。

2.3. Evaluation Sets

2.3. 评估集

We provide 2 evaluation sets in GigaSpeech: a DEV set for development and tuning, which consists of 12.5 hours of audio, and a TEST set for final evaluation, which consists of 40.3 hours of audio.

我们在GigaSpeech中提供了两个评估集:一个用于开发和调优的DEV集,包含12.5小时的音频;另一个用于最终评估的TEST集,包含40.3小时的音频。

A breakdown of our evaluation sets is illustrated in Table 2. Note that our evaluation sets do not have a coverage for the audiobooks. We make sure that audio files from the Libri Speech [6] evaluation sets (dev-clean, dev-other, test-clean and test-other) are not presented in our corpus, therefore, the Libri Speech evaluation sets can be used as our evaluation sets as well.

我们的评估集细目如表 2 所示。需要注意的是,我们的评估集未包含有声读物。我们确保语料库中不包含 Libri Speech [6] 评估集 (dev-clean、dev-other、test-clean 和 test-other) 的音频文件,因此 Libri Speech 评估集也可作为我们的评估集使用。

2.4. Audio Format

2.4. 音频格式

To reduce the file size of the GigaSpeech corpus, we compress the original audio using the Opus audio codec. Original audio files are first converted to 16k sampling rate, single channel and 16-bit signed-integer format. Opus compression is then applied to achieve an output bit rate of $32\mathrm{kpbs}$ , which results in a compression ratio of 8.

为减小GigaSpeech语料库的文件体积,我们使用Opus音频编解码器对原始音频进行压缩处理。原始音频文件首先被转换为16kHz采样率、单声道、16位有符号整型的格式,随后应用Opus压缩技术将输出比特率控制在$32\mathrm{kpbs}$,最终实现8:1的压缩比。

Table 3 shows the impact of Opus audio compression in terms of WER $(%)$ . Kaldi systems (see Section 4.3, but without recurrent neural network language model rescoring) are built for our $M$ (1000h) training subset, with or without Opus compression. These two systems are then used to decode the $D E V$ and TEST evaluation sets, with or without Opus compression. From Table 3, it is clear that compressing the training data with the Opus codec at 32 kpbs output bit rate has very small impact on the $D E V$ and TEST set $(0.1-0.2%$ WER degradation).

表 3 展示了 Opus 音频压缩在 WER $(%)$ 方面的影响。我们为 $M$ (1000小时) 训练子集构建了 Kaldi 系统 (参见第 4.3 节,但未使用循环神经网络语言模型重打分),分别包含和不包含 Opus 压缩。这两个系统随后被用于解码 $DEV$ 和 TEST 评估集,无论这些评估集是否经过 Opus 压缩。从表 3 可以明显看出,以 32 kbps 输出比特率对训练数据进行 Opus 编解码器压缩对 $DEV$ 和 TEST 集的影响非常小 (WER 仅下降 0.1-0.2%)。

Table 2: GigaSpeech evaluation sets

表 2: GigaSpeech评估集

Sets Podcast YouTube Total
DEV 6.3h 6.2h 12.5h
TEST 16.1h 24.2h 40.3h

Table 3: Impact of Opus audio compression (WER in $%$ )

表 3: Opus音频压缩的影响 (WER单位为 $%$ )

Eval DEV TEST
Opus Wav Opus Wav
M Opus 19.0
Wav 18.8 18.5 18.3

3. GigaSpeech Creation Pipeline

3. GigaSpeech 创建流程

This section presents the detailed pipeline for creating the GigaSpeech corpus, which can be applied to other data generation tasks as well.

本节详细介绍了创建GigaSpeech语料库的完整流程,该流程同样适用于其他数据生成任务。

3.1. Stage 1: Audio Collection

3.1. 阶段1: 音频采集

We start the task by manually defining the categories that we are interested in. We selected 24 categories in total, namely Arts, Business, Education, Autos and Vehicles, Comedy, Crime, Entertainment, Film and Animation, Gaming, Health and Fitness, History, Howto and Style, Kids and Family, Leisure, Music, News and Politics, Nonprofits and Activism, People and Blogs, Pets and Animals, Religion and Spirituality, Science and Technology, Society and Culture, Sports, Travel and Events.

我们首先手动定义感兴趣的分类,共选取24个类别:艺术、商业、教育、汽车与交通工具、喜剧、犯罪、娱乐、电影与动画、游戏、健康与健身、历史、教程与风格、儿童与家庭、休闲、音乐、新闻与政治、非营利与行动主义、人物与博客、宠物与动物、宗教与灵性、科学与技术、社会与文化、体育、旅行与活动。

For podcasts, we follow the above categories, and select episodes that come with manual transcriptions. For YouTube, we use the above categories as seed keywords, and select videos with human-generated closed captions. For audiobooks, we do not enforce those categories.

对于播客,我们遵循上述分类标准,并选择带有人工转录文本的剧集。对于YouTube,我们将这些分类作为种子关键词,筛选含有人工生成字幕的视频。至于有声书,我们不做类别限制。

Once we have the list of audio files, we create tools and download all audio files with their corresponding transcripts.

获取音频文件列表后,我们会创建工具并下载所有音频文件及其对应文本转录。

3.2. Stage 2: Text Normalization

3.2. 第二阶段:文本规范化

The audio transcripts we download from various sources are created by different transcriber s with diversified transcription standards and styles, therefore it is necessary to apply text normalization to the original transcripts. We perform standard text normalization, including case normalization, special symbol removal, number to word rewriting, date/time rewriting, etc.

我们从不同来源下载的音频转录文本由不同的转录员创建,其转录标准和风格各异,因此需要对原始文本进行标准化处理。我们执行的标准文本规范化包括:大小写统一、特殊符号去除、数字转文字、日期/时间重写等。

For audiobooks and podcasts, transcripts are usually at the episodes or chapter/book level. For speech recognition, however, smaller segments less than 20 seconds are needed for training. The next step is to segment the long audio file into smaller segments. For YouTube, closed captions are provided at the sentence level, but unfortunately we find that the timestamps of closed captions are not reliable for segmentation. As a result, we decide to splice the closed captions all together, and perform the same segmentation as audiobooks and podcasts.

对于有声书和播客,文本通常位于单集或章节/书籍级别。但在语音识别任务中,训练需要小于20秒的短片段。因此需要将长音频文件切分为小片段。YouTube虽然提供句子级别的隐藏字幕,但我们发现其时间戳并不可靠。最终我们选择将所有隐藏字幕拼接为整体,采用与有声书和播客相同的分段方式。

3.3. Stage 3: Forced Alignment

3.3. 阶段3:强制对齐

Our aligner is implemented with Kaldi [11], and the alignment procedure follows the work here [12], which adopts a divide and conquer strategy to tackle the alignment problem. First, both audio and transcript are uniformly chunked into smaller pieces. Second, audio segments are decoded with a biased language model (LM), and hypotheses with timestamps are generated. Third, each hypothesis segment is matched to one transcript segment via TF-IDF similarity. For each matched pair, hypothesis is further aligned with the transcript segment using the Smith-Waterman algorithm [13]. Finally, through this alignment, timestamps are attached to the transcript segments, and eventually to the whole transcripts by stitching the independently aligned segments together. Note that we modify the Smith-Waterman algorithm to handle silence and punctuation, and this is essential to enable the sentencebased segmentation in the next section.

我们的对齐器基于Kaldi [11]实现,对齐流程遵循文献[12]提出的分治策略:首先将音频和文本统一分割为小片段;其次使用带偏置的语言模型(LM)解码音频片段,生成带时间戳的识别假设;然后通过TF-IDF相似度将每个假设片段与文本片段进行匹配;对于每个匹配对,采用Smith-Waterman算法[13]将假设与文本片段进行细粒度对齐;最终通过对齐结果为文本片段标注时间戳,并通过拼接独立对齐片段完成完整文本的时间标注。需要注意的是,我们改进了Smith-Waterman算法以处理静音和标点符号,这对实现下一节基于句子的分割至关重要。


Figure 2: A forced alignment graph for the sentence “ A B C D E $$ ” ( $^{4}$ -gram)

图 2: 句子 " A B C D E $$" ($^{4}$-gram) 的强制对齐图

To achieve better alignment performance, we first align and segment the downloaded audio with a close-domain acoustic model. We then train an in-domain acoustic model with the audio segments created (around 3,000 hours). This model is used to align the whole corpus.

为实现更好的对齐性能,我们首先使用近域声学模型对下载的音频进行对齐和分段。随后利用生成的音频片段(约3000小时)训练一个域内声学模型,该模型用于对整个语料库进行对齐。

3.4. Stage 4: Audio Segmentation

3.4. 阶段4:音频分割

We work out the audio segments from the alignment information above. Several rules are applied during the segmentation process:

我们根据上述对齐信息划分音频片段。在分割过程中应用了以下规则:

It is worth pointing out that we keep 4 types of punctuation, namely comma, period, question mark and exclamation mark, so that split can happen at sentence boundaries (second rule above). We map them to special words ${\bf\ddot{\mu}}<\mathrm{COMMA}>^{\prime}$ ”, “ $\mathrm{\angle}<\mathrm{PERIOD>}^{\prime\prime}$ , “<QUESTION MARK $>$ ” and “” respectively. Besides, this also allows us to build end-to-end speech recognition systems that includes punctuation tagging, and end point detection.

值得注意的是,我们保留了4种标点符号(即逗号、句号、问号和感叹号),以便在句子边界处进行切分(遵循上述第二条规则)。我们将它们分别映射为特殊词元 ${\bf\ddot{\mu}}<\mathrm{COMMA}>^{\prime}$ ”、“ $\mathrm{\angle}<\mathrm{PERIOD>}^{\prime\prime}$ ”、“<QUESTION MARK $>$ ”和“”。此外,这种做法还能让我们构建包含标点标注和端点检测的端到端语音识别系统。

3.5. Stage 5: Segment Validation

3.5. 阶段5:分段验证

The segmentation stage generates a list of candidate segments, but potentially with high transcription error rate. We therefore apply segment validation to filter out bad segments.

分割阶段会生成一系列候选片段,但可能存在较高的转录错误率。因此我们应用片段验证来过滤掉不良片段。

3.5.1. Forced Alignment Graph

3.5.1. 强制对齐图

To better detect transcription errors made by human transcriber s, we propose a variation of alignment graph in n-gram framework, as shown in Figure 2. The bold arrow path represents a typical LM-free forced alignment graph. Each state on the forced alignment path has a dotted “leaky” arc (with weight) that allows the token to leak out the forced alignment path, from higher order n-gram states down to lower order n-gram states, until reaching the null state. A garbage word loop (containing top 1,000 uni-gram words) is added around the null state to consume additional acoustic frames. Besides, there are extra states and arcs that allow the token to return to the forced alignment path. In general, this alignment graph allows the decoder to perform insertion/deletion/substitution to the reference. This essentially brings more flexibility to the forced alignment stage, making it possible to capture the discrepancy between the audio and the corresponding transcript.

为了更好地检测人工转录员产生的转录错误,我们在n-gram框架中提出了一种对齐图 (alignment graph) 的变体,如图 2 所示。粗箭头路径代表典型的无语言模型强制对齐图。强制对齐路径上的每个状态都带有一条虚线"泄漏"弧线(含权重),允许token从高阶n-gram状态向低阶n-gram状态泄漏,直至到达空状态。空状态周围添加了一个垃圾词循环(包含前1000个uni-gram词)以消耗额外的声学帧。此外,还设有允许token返回强制对齐路径的额外状态和弧线。总体而言,这种对齐图使解码器能够对参考文本执行插入/删除/替换操作。这本质上为强制对齐阶段带来了更大灵活性,使其能够捕捉音频与对应转录文本之间的差异。

3.5.2. Validation Decoding Pass

3.5.2. 验证解码过程

During the validation decoding pass, we detect transcription errors and filter out segments with high error rate. Figure 3 gives two examples of how errors are being detected. In the first example, the transcriber misses a word “YOU” in the transcript, which is caught by our decoder. In the second example, the transcriber writes a typo which is also successfully detected.

在验证解码过程中,我们会检测转录错误并过滤掉错误率高的片段。图 3: 展示了两个错误检测的示例。第一个示例中,转录员漏掉了文本中的单词 "YOU",这被我们的解码器成功捕获。第二个示例中,转录员出现了拼写错误,该错误同样被成功检测到。

For the podcast and YouTube portion of our $X L$ training subset, we cap the maximum WER at $4%$ , and throw away all segments with higher WER. For the audiobook portion of the $X L$ training subset, as well as all other smaller subsets, we cap the maximum WER at $0%$ , meaning we don’t allow any transcription errors.

在我们的 $X L$ 训练子集中,针对播客和 YouTube 部分,我们将最大词错误率 (WER) 上限设为 $4%$ ,并丢弃所有 WER 更高的片段。对于 $X L$ 训练子集中的有声书部分以及其他所有较小子集,我们将最大 WER 上限设为 $0%$ ,这意味着不允许出现任何转录错误。

3.5.3. Reference Rewriting

3.5.3. 参考重写

Investigation into the validated segments reveals three common types of transcriber errors:

对已验证片段的分析揭示了三种常见的转录错误类型:

• Fillers ignored, such as AH, UH, UM, ER, ERR, YOU KNOW, I MEAN, SORT OF, etc. • Conjunctions ignored, such as AND, OR, BUT, etc. • Disfluency removed, such as “It’s it’s it’s a great thing!”.

• 忽略填充词,例如 AH、UH、UM、ER、ERR、YOU KNOW、I MEAN、SORT OF 等。
• 忽略连词,例如 AND、OR、BUT 等。
• 去除不流畅表达,例如 "It's it's it's a great thing!"。

Discarding these segments will fundamentally limit the diversity of the corpus. To fix those common errors, we add a filler loop (see Figure 2) to the forced alignment graph, which contains the above common fillers and conjunctions. Besides, we also employ a disfluency detector. The filler loop and the disfluency detector may modify the reference (reference rewriting), and if that happens, those words will be counted as correct. We only apply reference rewriting to our $X L$ subset (10,000h).

丢弃这些片段将从根本上限制语料库的多样性。为解决这些常见错误,我们在强制对齐图中添加了一个填充循环(见图 2),其中包含上述常见填充词和连词。此外,我们还采用了不流畅检测器。填充循环和不流畅检测器可能会修改参考文本(参考重写),若发生这种情况,这些词将被判定为正确。我们仅对 $X L$ 子集(10,000 小时)应用参考重写。

3.6. Stage 6: Evaluation

3.6. 阶段6: 评估

Since our evaluation sets are manually processed by professional human transcriber s, we take that as the ground truth, and use it to compute the frame level segmentation precision and recall, see Figure 4. Here recall tells us how many frames can be retrieved by our segmentation and validation pipeline, and precision tells us how many of these retrieved frames are correctly labeled (consistent with the human labels).

由于我们的评估集由专业转录员手动处理,我们将其视为基准事实,并用于计算帧级分割的精确率和召回率,见图 4。这里的召回率表示我们的分割和验证流程能检索到多少帧,而精确率则表示这些检索到的帧中有多少被正确标记(与人工标记一致)。

Figure 5 illustrates the precision-recall curve on the podcast portion of the DEV evaluation set. For the $X L$ training subset, we select a working point that gives us 10,000 hours of validated audio data, while keeping the maximum WER under $4%$ . And for all our other training subsets, we select the working point that keeps the maximum WER at $0%$ (the left most working point on Figure 5).

图 5: 展示了DEV评估集中播客部分的精确率-召回率曲线。对于$X L$训练子集,我们选择了一个工作点,该点可提供10,000小时的已验证音频数据,同时将最大词错误率(WER)控制在$4%$以下。而对于其他所有训练子集,我们选择将最大WER保持在$0%$的工作点(即图5中最左侧的工作点)。

4. Experiments

4. 实验

This section describes the baseline systems and experimental results for four popular speech recognition toolkits, namely Athena, ESPnet [14], Kaldi [11] and Pika [15].

本节介绍四种主流语音识别工具包的基线系统和实验结果,分别是Athena、ESPnet [14]、Kaldi [11]和Pika [15]。

4.1. Athena Baseline System2

4.1. Athena 基线系统2

The Athena baseline implements an encoder-decoder based transformer model, which is similar to [16], with parameters $\textit{e}=12$ , $d=6$ , $d_{\mathrm{model}}=1024$ , $d_{\mathrm{ff}}=2048$ and $d_{\mathrm{head}}=8$ .

Athena基线实现了一个基于编码器-解码器的Transformer模型,与[16]类似,参数为$\textit{e}=12$、$d=6$、$d_{\mathrm{model}}=1024$、$d_{\mathrm{ff}}=2048$和$d_{\mathrm{head}}=8$。

During the training, the output sequence of the encoder is also used for connection is t temporal classification (CTC) for joint training to enforce monotonic alignment between speech and label sequences. We use the Adam optimizer and varied learning rate with a warmup schedule (warmup steps $=8000,$ . The model is trained with a total batch size of 128 for 5 epochs.

在训练过程中,编码器的输出序列还用于连接时序分类(CTC)进行联合训练,以强制语音和标签序列之间保持单调对齐。我们使用Adam优化器和带预热计划的动态学习率(预热步数=8000)。模型以总批次大小128训练5个周期。

During the decoding, beam search with a beam size of 8 is used, which combines the scores of the decoder, the CTC with weight 0.5 and the LM with weight 0.2. The LM is a RNN based model with two long short-term memory (LSTM) layers, each with 1024 nodes.

在解码过程中,采用束宽为8的束搜索(beam search),该方法综合了解码器得分、权重为0.5的CTC(Connectionist Temporal Classification)得分以及权重为0.2的语言模型(LM)得分。该语言模型是基于循环神经网络(RNN)的双层长短期记忆网络(LSTM)模型,每层包含1024个节点。


Figure 3: Examples of transcription errors detected by forced alignment

图 3: 强制对齐检测到的转录错误示例


Figure 4: Frame level segmentation accuracy

图 4: 帧级分割准确率


Figure 5: Frame level precision-recall curve for the segmentation and validation pipeline

图 5: 分割与验证流程的帧级精确率-召回率曲线

4.2. ESPnet Baseline System3

4.2. ESPnet基线系统3

The ESPnet baseline uses conformer (Convolution-Augmented Transformer) [7], which is a recently proposed architecture combining the local sens it ivies of convolutional neural networks with the long-range interactions of transformers [17]. We use the implementation provided in the ESPnet toolkit [18]. We use a set of 5k BPE tokens generated by the Sentence Piece tokenizer [19]. The model has a 12 conformer blocks with an output dimension of 512 and a kernel size of 31 in the encoder, and 6 transformer blocks in the decoder. Both encoder and decoder have 8 attention heads with 2048 feed-forward unit dimension. Conformer was trained using four 24Gb memory Titan RTX GPUs. The max trainable epoch is 20. Mini-batch size is 35 million acoustic feature bins. Adam optimizer with no weight decay was used. Noam learning rate scheduler was set to 25k warmup steps with a learning rate of 0.0015. SpecAug used 2 frequency masks and 5 time masks. The last 10 best checkpoints were averaged as the final model.

ESPnet基线采用Conformer(卷积增强型Transformer)[7]架构,这是近期提出的结合卷积神经网络局部敏感性与Transformer长程交互能力的模型[17]。我们使用ESPnet工具包[18]提供的实现方案,采用Sentence Piece分词器[19]生成的5k个BPE token。该模型编码器包含12个Conformer模块(输出维度512,卷积核大小31),解码器包含6个Transformer模块。编码器和解码器均配置8个注意力头,前馈单元维度为2048。训练使用四块24GB显存的Titan RTX显卡,最大训练轮次20,小批量规模为3500万声学特征单元。采用无权重衰减的Adam优化器,Noam学习率调度器设置25k预热步数,初始学习率0.0015。SpecAugment策略应用2个频域掩码和5个时域掩码,最终模型为最后10个最优检查点的平均结果。

4.3. Kaldi Baseline System4

4.3. Kaldi 基线系统4

The Kaldi baseline implements a typical chain model. First, a GMMHMM model is trained to obtain the alignments with no data cleaning. Second, volume and speed augmentation techniques are applied. Ivectors are then extracted and pasted to the basic acoustic features, as most Kaldi’s recipes do. Finally, a neural network is trained with both cross-entropy and LF-MMI criteria. Our neural network stacks 6 convolutional neural network (CNN) layers, 10 TDNN-F layers, 1 attention-relu-renorm-layer, 1 TDNN-F layer, 1 fast-lstmp-layer, 1 TDNN-F layer and finally 1 more fast-lstmp-layer. During the decoding stage, a 4-gram LM is first used for decoding, followed by a Recurrent Neural Network Language Model (RNNLM) rescoring pass.

Kaldi基线系统实现了一个典型的链式模型。首先训练GMMHMM模型获取对齐结果(不进行数据清洗),随后应用音量与语速增强技术。如大多数Kaldi配方所示,接着提取ivector并拼接到基础声学特征上。最终使用交叉熵和LF-MMI准则训练神经网络,该网络包含6层卷积神经网络(CNN)、10层TDNN-F层、1层attention-relu-renorm层、1层TDNN-F层、1层fast-lstmp层、1层TDNN-F层及最后1层fast-lstmp层。解码阶段先使用4-gram语言模型进行解码,再通过循环神经网络语言模型(RNNLM)进行重打分。

Table 4: GigaSpeech baselines for the XL training subset (WER in $%$ )

表 4: GigaSpeech XL训练子集的基线 (WER单位为 $%$ )

工具包 模型 DEV TEST
Athena Transformer-AED+RNNLM 13.60 12.70
ESPnet Conformer/Transformer-AED 10.90 10.80
Kaldi Chain+RNNLM 14.78 14.84
Pika RNN-T 12.30 12.30

Table 5: Kaldi baselines for GigaSpeech training subsets (WER in %)

表 5: GigaSpeech训练子集的Kaldi基线 (WER, %)

Subset DEV TEST
XL 14.78 14.84
L 16.60 16.28
M 17.96 17.53
S 22.59 22.14
XS N/A N/A

4.4. Pika Baseline System5

4.4. Pika 基线系统5

The Pika baseline adopts a convolution and transformer based architecture [15] for the encoder of our RNN-T system. Five-layer transformer is used for the decoder and the hidden dimension of each layer is 512. We apply on-the-fly speed and volume perturbation during training where speed rates are set to $0.9/1.0/1.1/1.2$ and the volume range is from -55dB to -10dB. For input features, we use 80 dimensional log Fbanks. The targets of our RNN-T system are a set of English wordpieces plus blank symbol which lead to an output dimension of 5000. The number of total parameters in the RNN-T is about 87M. MBR training and two extra two-layer transformer based forward/backward rescorers are also adopted. All training is conducted on 16 V100 GPUs. Our distributed training strategy is based on block-wise model-update filtering (BMUF) with a Nesterov momentum scheme but with different learning rate scheduling [15] where both initial and final learning rates are set before training and number of training epochs are fixed (therefore there is no early stop and no development/validation set is used). One single sweep of the GigaSpeech $X L$ training subset takes about 5hrs. For decoding, we set beam-size to 8 and the temperature of softmax to 1.25.

Pika基线采用了基于卷积和Transformer的架构[15]作为RNN-T系统的编码器。解码器使用五层Transformer结构,每层隐藏维度为512。训练过程中应用了实时语速和音量扰动,语速系数设为$0.9/1.0/1.1/1.2$,音量范围从-55dB到-10dB。输入特征采用80维对数Fbank系数。RNN-T系统的输出目标为英语词片段集合加空白符号,最终输出维度为5000。该RNN-T模型总参数量约8700万。同时采用了最小贝叶斯风险(MBR)训练和两个额外的双层Transformer结构前向/后向重打分器。所有训练均在16块V100 GPU上进行,分布式训练策略基于带Nesterov动量方案的块级模型更新过滤(BMUF)方法[15],采用固定初始/最终学习率与训练轮次的调度策略(因此未使用早停机制和开发/验证集)。单次遍历GigaSpeech $XL$训练子集约需5小时。解码阶段设置束搜索宽度为8,softmax温度为1.25。

4.5. Experimental Results

4.5. 实验结果

Table 4 demonstrates baseline results for Athena, ESPnet, Kaldi and Pika. Results listed here are purely for the purpose of providing baseline systems for each toolkit. They do not reflect the state-of-theart performance of each toolkit, and cannot be used to compare the performance across toolkits.

表 4: 展示了 Athena、ESPnet、Kaldi 和 Pika 的基线结果。此处列出的结果仅用于为每个工具包提供基准系统,不代表各工具包的最先进性能,也不能用于跨工具包性能比较。

Table 5 illustrates the Kaldi baseline results for 4 training subsets. Generally speaking, as the training subset gets bigger, the performance goes up. Our smallest XS training subset (10h) is designed for system building and debugging only, and is not expected to give strong performance.

表 5: 展示了4个训练子集的Kaldi基线结果。总体而言,随着训练子集规模增大,性能也随之提升。我们最小的XS训练子集(10小时)仅用于系统构建和调试,不预期能提供强劲性能。

阅读全文(20积分)