[论文翻译]包含51种类型多样语言的百万示例多语言自然语言理解数据集


原文地址:https://arxiv.org/pdf/2204.08582v2


MASSIVE: A 1M-Example Multilingual Natural Language Understanding Dataset with 51 Typo logically-Diverse Languages

MASSIVE:包含51种类型多样语言的百万示例多语言自然语言理解数据集

Abstract

摘要

We present the MASSIVE dataset— Multilingual Amazon Slu resource package (SLURP) for Slot-filling, Intent classification, and Virtual assistant Evaluation. MASSIVE contains 1M realistic, parallel, labeled virtual assistant utterances spanning 51 languages, 18 domains, 60 intents, and 55 slots. MASSIVE was created by tasking professional translators to localize the English-only SLURP dataset into 50 typo logically diverse languages from 29 genera. We also present modeling results on XLM-R and mT5, including exact match accuracy, intent classification accuracy, and slot-filling F1 score. We have released our dataset, modeling code, and models publicly.

我们发布了MASSIVE数据集——面向槽填充、意图分类和虚拟助手评估的多语言亚马逊SLURP资源包。该数据集包含100万条真实场景下的平行标注虚拟助手话语,涵盖51种语言、18个领域、60种意图和55个槽位。MASSIVE通过聘请专业翻译人员将仅限英语的SLURP数据集本地化为来自29个语系的50种类型多样的语言。我们还展示了基于XLM-R和mT5的建模结果,包括精确匹配准确率、意图分类准确率和槽填充F1分数。我们已公开数据集、建模代码和模型。

1 Introduction and Description

1 简介与概述

Natural Language Understanding (NLU) is a machine’s ability to understand the meaning and relevant entities from text. For instance, given the utterance what is the temperature in new york, an NLU model might classify the intent as weather query and fill the slots as weather descriptor: temperature and place_name: new york. Our particular focus of NLU is one component of Spoken Language Understanding (SLU), in which raw audio is first converted to text before NLU is performed (Young, 2002; Wang et al., 2005; Tur and Mori, 2011). SLU is the foundation of voice-based virtual assistants like Alexa, Siri, and Google Assistant. Though virtual assistants have advanced incredibly in the past decade, they still only support a small fraction of the world’s $7\small{,}000+$ languages (Simons, 2022). Challenges for multilingualism span the software stack and a variety of operational considerations, but one difficulty in creating massively multilingual NLU models is the lack of labeled data for training and evaluation, particularly data that is realistic for the task and that is natural for each given language. High naturalness typically requires human-based vetting, which is often costly.

自然语言理解 (Natural Language Understanding, NLU) 是机器从文本中理解含义及相关实体的能力。例如,给定语句"what is the temperature in new york",NLU模型可能将意图分类为天气查询,并填充槽位:weather descriptor: temperature 和 place_name: new york。我们关注的NLU是口语理解 (Spoken Language Understanding, SLU) 的一个组成部分,其中原始音频首先被转换为文本,然后进行NLU处理 (Young, 2002; Wang et al., 2005; Tur and Mori, 2011)。SLU是Alexa、Siri和Google Assistant等语音虚拟助手的基础。尽管虚拟助手在过去十年取得了惊人的进步,但它们仍仅支持全球7,000多种语言中的一小部分 (Simons, 2022)。多语言化的挑战涉及软件堆栈和各种操作考量,但创建大规模多语言NLU模型的一个困难是缺乏用于训练和评估的标注数据,特别是对任务而言真实且对每种给定语言自然的的数据。高自然性通常需要人工审核,这往往成本高昂。

We present MASSIVE (Multilingual Amazon SLU Resource Package (SLURP) for Slot filling, Intent classification, and Virtual assistant Evaluation), a new 1M-example dataset composed of realistic, human-created virtual assistant utterance text spanning 51 languages, 60 intents, 55 slot types, and 18 domains. With the English seed data included, there are $587\mathrm{k}$ train utterances, $104\mathrm{k\Omega}$ dev utterances, 152k test utterances, and $153\mathrm{k\Omega}$ ut- terances currently held out for the MMNLU-22 competition, which will be released after the competition. We have released our data, code, and models 1.

我们推出MASSIVE(面向槽填充、意图分类和虚拟助手评估的多语言Amazon SLU资源包),这是一个包含100万条样本的新数据集,由人工创建的虚拟助手话语文本构成,涵盖51种语言、60种意图、55种槽类型和18个领域。包含英语种子数据在内,该数据集共有587k条训练话语、104kΩ条开发集话语、152k条测试集话语,以及153kΩ条为MMNLU-22竞赛保留的话语(竞赛结束后将公开)。我们已公开数据、代码和模型[1]。

MASSIVE was created by localizing the SLURP NLU dataset (created only in English) in a parallel manner. SLURP is described further in Section 2, linguistic analyses of the dataset in Section 3, and the localization process in Section 4.3. Results for Massively Multilingual NLU (MMNLU) modeling, in which a single model can perform NLU on any of the incoming languages, are given in Section 5.

MASSIVE是通过并行本地化仅以英语创建的SLURP NLU数据集而构建的。SLURP的详细描述见第2节,数据集的语言学分析见第3节,本地化过程见第4.3节。第5节给出了大规模多语言NLU (MMNLU) 建模的结果,该模型能对任何输入语言执行NLU任务。

2 Related Work

2 相关工作

Prior researchers have emphasized the need to explore the unique challenges of low-resource lan- guages (Simpson et al., 2008; Strassel and Tracey, 2016; Cruz and Cheng, 2020; Lakew et al., 2020; Marivate et al., 2020; Magueresse et al., 2020; Goyal et al., 2021), while the growing number and size of language models (mBERT (Devlin, 2018), RoBERTa (Liu et al., 2019b), XLM (Lample and Conneau, 2019), XLM-R (Conneau et al., 2020), mBART (Liu et al., 2020), MARGE (Lewis et al., 2020), and mT5 (Xue et al., 2021) pre-trained on massively multilingual corpora have allowed for significant improvements in supporting them. However, the creation of evaluation datasets for specific tasks has not kept pace. Some tasks, such as Named Entity Recognition (NER) or translation, lend themselves to mining existing corpora (Tiedemann, 2012; Pan et al., 2017; Hu et al., 2020), while others such as NLU, the focus here, require the creation of new data and schema-specific annotations. Beyond the cost, even identifying a sufficient number of speakers for data generation and quality control can be difficult. Most studies have thus focused on collecting data for one such low-resource language and determining the utility of multilingual models or cross-lingual learning from more readily available languages. Moreover, such datasets are often isolated collections, creating an environment of multiple datasets not easily comparable across the different languages or tasks. There have been exceptions, such as SQuAD (Rajpurkar et al., 2016) and XQuAd (Artetxe et al., 2019), ATIS (Price, 1990), its Hindi and Turkish extension (Upadhyay et al., 2018), and MultiATIS $^{++}$ (Xu et al., 2020), and Snips (Coucke et al., 2018) with its addition of French (Saade et al., 2019), where researchers have extended popular English benchmark datasets to new languages. This work focuses on the general multi-domain NLU task and builds off the SLURP (Bastia nell i et al., 2020) benchmark dataset to ex- tend to an unprecedented 50 new languages.

先前的研究者已强调需要探索低资源语言的独特挑战 (Simpson et al., 2008; Strassel and Tracey, 2016; Cruz and Cheng, 2020; Lakew et al., 2020; Marivate et al., 2020; Magueresse et al., 2020; Goyal et al., 2021) 。随着基于海量多语语料预训练的语言模型 (如 mBERT (Devlin, 2018) 、RoBERTa (Liu et al., 2019b) 、XLM (Lample and Conneau, 2019) 、XLM-R (Conneau et al., 2020) 、mBART (Liu et al., 2020) 、MARGE (Lewis et al., 2020) 和 mT5 (Xue et al., 2021) ) 的数量和规模持续增长,对低资源语言的支持能力显著提升。然而,面向特定任务的评估数据集建设仍未同步跟进。部分任务 (如命名实体识别 (NER) 或机器翻译) 可通过挖掘现有语料库实现 (Tiedemann, 2012; Pan et al., 2017; Hu et al., 2020) ,但像自然语言理解 (NLU) 这类任务 (本文研究重点) 则需要创建新数据并进行模式化标注。除成本因素外,仅招募足够数量的母语者参与数据生成和质量控制就存在困难。因此多数研究聚焦于收集单一低资源语言数据,并验证多语言模型或从高资源语言进行跨语言学习的效用。此外,这类数据集常为孤立集合,导致不同语言或任务间的数据集难以直接比较。虽有例外案例——如 SQuAD (Rajpurkar et al., 2016) 和 XQuAD (Artetxe et al., 2019) 、ATIS (Price, 1990) 及其印地语/土耳其语扩展 (Upadhyay et al., 2018) 与 MultiATIS$^{++}$ (Xu et al., 2020) ,以及新增法语版本的 Snips (Coucke et al., 2018; Saade et al., 2019) ——研究者将流行英语基准数据集扩展至新语种,但本文聚焦通用多领域 NLU 任务,基于 SLURP (Bastianelli et al., 2020) 基准数据集首次扩展覆盖 50 种新语言。

For the task of NLU, the ATIS dataset has been popular in the NLP community since its first release. MultiATIS $^{++}$ was one of the first efforts to extend an NLU dataset across a significant number of languages (nine), yet remained in the limited domain of airline bookings. While proving an asset, it has been questioned what is left to learn from such a dataset (Tur et al., 2010). Facebook released a general Intelligent Virtual Assistant (IVA) dataset across the domains of Alarm, Reminder, and Weather (Schuster et al., 2019) created for the purpose of demonstrating cross-lingual transfer learning; and so did not need to be parallel or have an equal number of datapoints, resulting in far fewer examples in Thai $(5\mathrm{k})$ compared to Spanish (7.6k) and English (43k). The Snips datasets (both the original English only and the English and French releases) are most similar to the NLU contained in the MASSIVE dataset, spanning smart home and music domains for a generic voice-based virtual assistant.

在自然语言理解(NLU)任务领域,ATIS数据集自首次发布以来一直在NLP社区广受欢迎。MultiATIS++是最早将NLU数据集扩展到多种语言(九种)的尝试之一,但仍局限于航空预订这一狭窄领域。虽然该数据集被证明具有价值,但学界对其剩余学习价值存在质疑(Tur等人, 2010)。Facebook发布了涵盖闹钟、提醒和天气三大领域的通用智能虚拟助手(IVA)数据集(Schuster等人, 2019),旨在展示跨语言迁移学习能力,因此不需要保持平行语料或数据量均衡,导致泰语样本(5k)远少于西班牙语(7.6k)和英语(43k)。Snips数据集(包括原始纯英文版及英法双语版)与MASSIVE数据集包含的NLU最为相似,均为通用语音虚拟助手提供智能家居和音乐领域的支持。

The first iteration for the foundation of the MASSIVE dataset was the NLU Evaluation Benchmarking Dataset, with $25\mathrm{k\Omega}$ utterances across 18 domains (Liu et al., 2019a). The authors updated the dataset and added audio and ASR transcriptions in the release of the Spoken Language Understanding Resource Package (SLURP) (Bastia nell i et al., 2020), allowing for full end-to-end Spoken Language Under standing (SLU) evaluation similar to the Fluent Speech Commands dataset (Lugosch et al., 2019) and Chinese Audio-Textual Spoken Language Under standing (CATSLU) (Zhu et al., 2019). An overview of selected existing NLU datasets can be seen in Table 1.

MASSIVE数据集基础的首个迭代版本是NLU评估基准数据集,包含18个领域的$25\mathrm{k\Omega}$条话语 (Liu et al., 2019a)。作者在发布口语理解资源包(SLURP)时更新了数据集,并添加了音频和自动语音识别(ASR)转录文本 (Bastianelli et al., 2020),使得能够像Fluent Speech Commands数据集 (Lugosch et al., 2019) 和中文音频-文本口语理解(CATSLU)数据集 (Zhu et al., 2019) 一样进行完整的端到端口语理解(SLU)评估。现有NLU数据集的概览如表1所示。

We release the MASSIVE dataset along with baselines from large pre-trained models fine-tuned on the NLU slot and intent prediction tasks. Early cross-lingual and multilingual NLU modeling approaches used projection or alignment methods (Yarowsky et al., 2001), focusing on string matching, edit distance, or consonant signatures (Ehrmann et al., 2011), lookup lexicons for lowresource languages (Mayhew et al., 2017), and aligning (Xie et al., 2018) or jointly training word embeddings (Singla et al., 2018). More recently, researchers have borrowed encoders from pre-trained neural translation models before building subsequent class if i ers and NER models (Eriguchi et al., 2018; Schuster et al., 2019), also focusing on language-agnostic and language specific features to learn what information to share between languages (Chen et al., 2019b). Generative parsing has been demonstrated using sequence-to-sequence models and pointer networks (Rongali et al., 2020). With the rise of BERT and large pre-trained language models, we have also seen impressive demonstrations of zero-shot performance, where subword token iz ation WordPiece overlap helps but is not even necessary to realize improvements (Pires et al., 2019; K et al., 2020), as well as production multilingual NLU improvements with distillation and full fine-tuning (FitzGerald et al., 2022). The translation task has then been in cop orated in the pretraining (Wang et al., 2021) of these models or even as part of the final NLU hypothesis for streamlined multilingual production systems (FitzGerald,

我们发布了MASSIVE数据集,以及基于大预训练模型在NLU槽位和意图预测任务上微调的基线。早期的跨语言和多语言NLU建模方法采用投影或对齐方法 (Yarowsky et al., 2001),侧重于字符串匹配、编辑距离或辅音特征 (Ehrmann et al., 2011),低资源语言的查找词典 (Mayhew et al., 2017),以及对齐 (Xie et al., 2018) 或联合训练词嵌入 (Singla et al., 2018)。最近,研究人员在构建后续分类器和NER模型之前,从预训练的神经翻译模型中借用编码器 (Eriguchi et al., 2018; Schuster et al., 2019),同时也关注语言无关和语言特定的特征,以学习在语言之间共享哪些信息 (Chen et al., 2019b)。生成式解析已通过序列到序列模型和指针网络得到演示 (Rongali et al., 2020)。随着BERT和大预训练语言模型的兴起,我们还看到了零样本性能的惊人表现,其中子词Token化WordPiece重叠有帮助,但甚至不是实现改进的必要条件 (Pires et al., 2019; K et al., 2020),以及通过蒸馏和完整微调实现的生产级多语言NLU改进 (FitzGerald et al., 2022)。翻译任务随后被纳入这些模型的预训练中 (Wang et al., 2021),甚至作为最终NLU假设的一部分,用于简化的多语言生产系统 (FitzGerald,

Table 1: Selected NLU benchmark datasets with number of languages, utterances per language, domain count, intent count, and slot count.

表 1: 精选的自然语言理解(NLU)基准数据集,包含语言数量、每种语言的语句数、领域数量、意图数量和槽位数量。

名称 语言数 每种语言语句数 领域数 意图数 槽位数
MASSIVE 51 19,521 18 60 55
SLURP (Bastianelli et al., 2020) 1 16,521 18 60 55
NLU评估数据 (Liu et al., 2019a) 1 25,716 18 54 56
航空旅行信息系统(ATIS) (Price, 1990) 1 5,871 1 26 129
含印地语和土耳其语的ATIS (Upadhyay et al., 2018) 3 1,315-5,871 1 26 129
MultiATIS++ (Xu et al., 2020) 9 1,422-5,897 1 21-26 99-140
Snips (Coucke et al., 2018) 1 14,484 - 7 53
含法语的Snips (Saade et al., 2019) 2 4,818 2 14-15 11-12
任务导向解析(TOP) (Gupta et al., 2018) 1 44,873 2 25 36
多语言任务导向语义解析(MTOP) (Li et al., 2021) 6 15,195-22,288 11 104-113 72-75
跨语言多语言任务导向对话 (Schuster et al., 2019) 3 5,083-43,323 3 12 11
Microsoft对话挑战赛 (Li et al., 2018b) 1 38,276 3 11 29
流畅语音命令(FSC) (Lugosch et al., 2019) 1 30,043 - 31 -
中文语音文本口语理解(CATSLU) (Zhu et al., 2019) 1 16,258 4 - 94

2020). Researchers have propped up training data by translating and projecting labels into the target language (Xu et al., 2020) and discovered more sophi stica ted approaches to alignment such as translate and fill using mT5 to train the filler (Nicosia et al., 2021). Recent work has even delved into the application of these techniques to lower-resource languages such as Persian. For example, ParsiNLU explores a variety of NLU tasks for Parsi, finetuning mT5 of various sizes (Khashabi et al., 2021). Similarly these techniques have also been used, even a bit earlier, for text sum mari z ation (Farahani et al., 2021).

2020年)。研究人员通过将标签翻译并映射到目标语言来扩充训练数据 (Xu et al., 2020),并发现了更复杂的对齐方法,例如使用mT5训练填充模型进行翻译填充 (Nicosia et al., 2021)。近期研究甚至将这些技术应用于波斯语等低资源语言。例如,ParsiNLU针对波斯语探索了多种自然语言理解任务,并对不同规模的mT5进行微调 (Khashabi et al., 2021)。类似地,这些技术也被应用于文本摘要任务,甚至更早就有相关实践 (Farahani et al., 2021)。

3 Language Selection and Linguistic Analysis

3 语言选择与语言学分析

3.1 Language Selection

3.1 语言选择

The languages in MASSIVE were chosen according to the following considerations. First, we acquired cost and worker availability estimates for over 100 languages, providing a constraint to our choices given our fixed budget. Second, we determined existing languages available in major virtual assistants, such that the dataset could be used to benchmark today’s systems. Third, we categorized the full pool of languages according to their genera as taken from the World Atlas of Linguistic Structures (WALS) database (Dryer and Haspelmath, 2013), where a genus is a language group that is clear to most linguists without systematic comparative analysis. Genus is a better indicator of typo logical diversity, which we sought to maximize, than language family (Dryer, 1989). Fourth, we used the ei gen vector centrality of Wikipedia articles, tweets, and book translations (Ronen et al., 2014) as proxies for the internet influence and thus the resource availability of a given language, particularly for self-supervised pre training applications, and we chose languages spanning the breadth of resource availability. Fifth, we examined the script of each language, seeking to increase script diversity to drive experimentation in token iz ation and normalization.

MASSIVE中的语言选择基于以下考量:首先,我们获取了100多种语言的成本和工作者可用性估算,在固定预算下为选择提供约束。其次,我们确定了主流虚拟助手支持的现有语言,使该数据集能用于评估当前系统性能。第三,我们根据《世界语言结构图谱》(WALS)数据库(Dryer and Haspelmath, 2013)中的语系分类对全部语言进行归类——语系是指无需系统比较分析就能被多数语言学家识别的语言群组。与语族相比,语系能更好反映我们力求最大化的类型多样性(Dryer, 1989)。第四,我们采用维基百科文章、推文和书籍翻译的特征向量中心性(Ronen et al., 2014)作为网络影响力指标,从而反映特定语言的资源可用性(尤其对自监督预训练应用而言),并选择了资源可用性跨度广泛的语言。第五,我们考察了每种语言的文字系统,通过增加文字多样性来推动token化与归一化研究。

Ultimately, we created 50 new, distinct text corpora, representing 49 different spoken languages. Mandarin Chinese was collected twice, once with native speakers who use the traditional set of characters, and once with native speakers who use the modern simplified set of characters. There are 14 language families in the dataset. The term “language family” usually refers to a group of languages which are known to be genetically related, that is, they all descend from a common ancestor language. In MASSIVE, we also include “language isolates” as families. These are languages that have no clear relationship to any known language. Our choices are given in Table 2.

最终,我们创建了50个全新的独立文本语料库,涵盖49种不同口语。其中汉语普通话收集了两次:一次采用繁体字为母语的使用者,另一次采用简体字为母语的使用者。该数据集包含14个语系。"语系"通常指已知具有发生学关联的语言群体,即它们都源自共同的祖语。在MASSIVE项目中,我们将"孤立语言"也归为语系——这类语言与任何已知语言都无明显关联。具体分类见表2:

3.2 Scripts

3.2 脚本

There are 21 distinct scripts used in the dataset. The majority of languages in MASSIVE (28 including English) use some variety of the Latin alphabet, which is also the most widely used script in the world. The Arabic script is used for three languages, the Cyrillic script for two languages, and the remaining 18 languages have “unique” scripts, in the sense that only one language in the dataset uses that script. Fourteen scripts are unique to a single language, although they may belong to a larger family of writing systems. For example, the Dravidian languages in MASSIVE have their own scripts, but are all members of the general Brahmi class of scripts. The other two scripts are unique in that only one language in the dataset uses them, but they are more widely used in the real world: Ge’ez and Chinese. Ge’ez is represented by Amharic in the dataset, but is used for several languages in East Africa, such as Tigrinya. The Chinese script is represented by Mandarin, but is used by other languages in China such as Cantonese.

数据集中使用了21种不同的文字体系。MASSIVE中的大多数语言(包括英语在内共28种)使用某种形式的拉丁字母,这也是全球使用最广泛的文字。阿拉伯文字用于三种语言,西里尔字母用于两种语言,其余18种语言拥有"独特"的文字体系(即数据集中仅有一种语言使用该文字)。虽然14种文字为单一语言所独有,但它们可能属于更大的文字体系家族。例如,MASSIVE中的达罗毗荼语系语言虽各有独立文字,但都属于婆罗米系文字大类。另外两种文字的特殊性在于:虽然数据集中仅有一种语言使用它们,但在现实世界中应用更广泛——吉兹文和中文。吉兹文在数据集中以阿姆哈拉语为代表,但实际也用于提格里尼亚语等东非多种语言;中文以普通话为代表,但也应用于粤语等中国其他语言。

Table 2: The 51 languages of MASSIVE, including scripts and genera.

代码 名称 文字 语系 代码 名称 文字 语系 代码 名称 文字 语系
af-ZA 南非荷兰语 Latn Germanic hy-AM 亚美尼亚语 Armn Armenian pl-PL 波兰语 Latn Slavic
am-ET 阿姆哈拉语 Ethi Semitic id-ID 印尼语 Latn Malayo-Sumbawan pt-PT 葡萄牙语 Latn Romance
ar-SA 阿拉伯语 Arab Semitic is-IS 冰岛语 Latn Germanic ro-RO 罗马尼亚语 Latn Romance
az-AZ 阿塞拜疆语 Latn Turkic it-IT 意大利语 Latn Romance ru-RU 俄语 Cyrl Slavic
bn-BD 孟加拉语 Beng Indic ja-JP 日语 Jpan Japanese sl-SI 斯洛文尼亚语 Latn Slavic
cy-GB 威尔士语 Latn Celtic jv-ID 爪哇语 Latn Javanese Sq-AL 阿尔巴尼亚语 Latn Albanian
da-DK 丹麦语 Latn Germanic ka-GE 格鲁吉亚语 Geor Kartvelian SV-SE 瑞典语 Latn Germanic
de-DE 德语 Latn Germanic km-KH 高棉语 Khmr Khmer sw-KE 斯瓦希里语 Latn Bantoid
el-GR 希腊语 Grek Greek kn-IN 卡纳达语 Knda SouthernDravidian ta-IN 泰米尔语 Taml SouthernDravidian
en-US 英语 Latn Germanic ko-KR 韩语 Kore Korean te-IN 泰卢固语 South-CentralDravidian
es-ES 西班牙语 Latn Romance lv-LV 拉脱维亚语 Latn Baltic th-TH 泰语 Thai Kam-Tai
fa-IR 波斯语 Arab Iranian ml-IN 马拉雅拉姆语 Mlym SouthernDravidian tl-PH 他加禄语 Latn GreaterCentralPhilippine
fi-FI 芬兰语 Latn Finnic mn-MN 蒙古语 Cyrl Mongolic tr-TR 土耳其语 Latn Turkic
fr-FR 法语 Latn Romance ms-MY 马来语 Latn Malayo-Sumbawan ur-PK 乌尔都语 Arab Indic
he-IL 希伯来语 Hebr Semitic my-MM 缅甸语 Mymr Burmese-Lolo vi-VN 越南语 Latn Viet-Muong
hi-IN 印地语 Deva Indic nb-NO 挪威语 Latn Germanic zh-CN 汉语(普通话) Hans Chinese
hu-HU 匈牙利语 Latn Ugric nl-NL 荷兰语 Latn Germanic zh-TW 汉语(普通话) Hant Chinese

表 2: MASSIVE 的 51 种语言,包括文字和语系。

3.3 Sentence Types

3.3 句子类型

MASSIVE consists of utterances directed at a device, rather than a person, which has some consequences for the type of linguistic patterns it contains. Specifically, the corpus primarily consists of interrogatives (i.e., questions) and imperatives (commands or requests). There are relatively few declarative utterances in the set. This is in contrast to many large datasets from other sources (e.g., wikipedia, movie scripts, newspapers) which contain a high proportion of declarative s, since the language is collected from situations where humans are communicating with humans.

MASSIVE包含面向设备而非人类的语音内容,这对其包含的语言模式类型产生了一定影响。具体而言,该语料库主要由疑问句(即提问)和祈使句(命令或请求)构成,陈述句相对较少。这与维基百科、电影剧本、报纸等其他来源的大型数据集形成鲜明对比——由于采集自人类间交流场景,这些数据集往往包含高比例的陈述句。

In the context of a voice assistant, a user typically asks a device to perform an action or answer a question, so declarative s are less common. For instance, a person might use an imperative “tell me if it calls for rain today” or ask a question “will it rain today,” but they would not tell their device “it’s raining today.” When declarative s are used with voice assistants, they generally have the pragmatic effect of a directive. For instance, a virtual assistant can respond to the declarative “it’s cold in here” by turning up the temperature (Thattai et al., 2020). Although syntactically it looks like a declarative, such an utterance has the force of an imperative.

在语音助手的场景中,用户通常会让设备执行某个动作或回答问题,因此陈述句较为少见。例如,人们可能会用祈使句"告诉我今天会不会下雨"或提问"今天会下雨吗",但不会对设备说"今天下雨了"。当陈述句用于语音助手时,通常具有指令的语用效果。例如,虚拟助手可以通过调高温度来回应"这里很冷"这样的陈述句 (Thattai et al., 2020)。虽然从语法上看是陈述句,但这类话语实际上具有祈使句的效力。

The standard unit of analysis in linguistics is the declarative sentence, and there is relatively less known about imperatives and questions. MASSIVE presents an opportunity to study these sentence forms, and the parallel nature of the corpus makes cross-linguistic comparisons even easier.

语言学的标准分析单位是陈述句,而对于祈使句和疑问句的研究相对较少。MASSIVE 为研究这些句子形式提供了机会,且该语料库的平行特性使得跨语言比较更加便捷。

3.4 Word Order

3.4 词序

Languages have intricate rules for ordering words depending on the word-type and sentence-type. In English, the word order for statements (“you are leaving”) is different from questions (“are you leaving?”). This is not mandatory, and sometimes the pitch of the voice is enough to indicate a question (e.g. “you’re leaving?” with a rising intonation).

语言根据词类和句型具有复杂的语序规则。在英语中,陈述句("you are leaving")的语序与疑问句("are you leaving?")不同。但这不是强制性的,有时仅靠语音语调就足以表示疑问(例如用升调说"you're leaving?")。

When considering word order at a typo logical level, it is common to simplify the situation and consider only affirmative declarative sentences and only three grammatical elements: the verb (V), its subject (S), and its object (O). This makes for six possible word orders: SVO, SOV, VOS, VSO, OVS, and OSV. All six orders have been documented, although the overwhelming majority of languages use Subject-initial ordering, while Object-initial ordering is extremely rare.

在类型学层面考虑词序时,通常会简化情况,仅考察肯定陈述句中的三个语法要素:动词(V)、主语(S)和宾语(O)。这样便产生六种可能的词序:SVO、SOV、VOS、VSO、OVS和OSV。尽管绝大多数语言采用主语前置结构,而宾语前置结构极其罕见,但这六种词序均有实证记录。

In MASSIVE, 39 languages are subject-initial (24 SVO and 15 SOV), while only three are verb-initial (VSO specifically). No object-initial languages are represented. Five languages are marked in WALS as having no preferred word order, and four do not have any word order data at all.

在MASSIVE中,39种语言采用主语优先语序(24种SVO和15种SOV),而仅有3种采用动词优先语序(具体为VSO)。该数据集未包含宾语优先语序的语言。WALS标注中有5种语言无固定词序偏好,另有4种语言完全缺乏词序数据。

3.5 Imperative Marking

3.5 命令式标记

The languages in MASSIVE have a variety of ways of indicating the imperative mood of an utterance. The majority of them (33) use some kind of verb morphology, such as adding a suffix. About half of those languages (18) have distinct imperative marking for singular or plural addressees. The utterances in MASSIVE are technically directed at a single addressee, the voice assistant, but since some languages use the plural as an indicator of politeness (see below) all varieties of imperatives will likely occur in this dataset. There are ten languages without any special morphology, and they indicate imperative through other means, such as word order or vocabulary choice.

MASSIVE 数据集中的语言有多种方式来表示话语的命令式语气。其中大多数 (33 种) 使用某种动词形态变化,例如添加后缀。约半数语言 (18 种) 针对单数或复数受话者使用不同的命令式标记。虽然 MASSIVE 中的话语在技术上是针对单个受话者 (语音助手) 的,但由于某些语言使用复数形式表示礼貌 (见下文),该数据集中很可能会出现所有类型的命令式。另有十种语言没有任何特殊形态变化,它们通过其他方式 (如词序或词汇选择) 来表示命令式。

Ten languages in the dataset have a specialized distinction between imperatives, for commands directed at another individual, and “hortatives”, where the command also includes the speaker. English verbs are not directly marked for hortative, but the auxiliary verb “let” can convey the mood instead. For example, “write this down” is an imperative and only the addressee need write anything, while “let’s write this down” is a hortative and the speaker is also expected to write. The pervasiveness of hortatives in the context of a voice assistant is an open question.

数据集中有十种语言对祈使句( imperative)和劝告式( hortative)进行了专门区分:祈使句用于向他人发出指令,而劝告式则包含说话者自身。英语动词本身不直接标记劝告式,但助动词"let"可以表达这种语气。例如"write this down"是祈使句,只有受话者需要记录;而"let's write this down"是劝告式,说话者也预期参与记录。在语音助手场景中劝告式的普遍适用性仍是一个开放性问题。

Four languages have “optative” moods, which are subtly different from imperatives. In the optative, a speaker expresses a wish or desire, as opposed to giving a direct command. However, in the right context, an optative may carry the same pragmatic weight as an imperative, and strongly imply that someone ought to do something. English has no specific optative form, but a similar mood can be conveyed using conditionals. For example, “buy this bag for me” is an imperative while “if only someone would buy me this bag” is closer to an optative. Optative forms are not well studied in linguistics, as they require specific contexts which can be difficult to create during field work, but they may be more common in device-directed utterances.

四种语言具有"祈愿式 (optative)"语气,与命令式存在微妙差异。祈愿式中说话者表达愿望或期许,而非直接发出指令。但在特定语境下,祈愿式可能具有与命令式相同的语用分量,强烈暗示某人应当采取行动。英语没有专门的祈愿式形态,但可通过条件句传递类似语气。例如"给我买这个包"是命令式,而"要是有人能给我买这个包就好了"更接近祈愿式。由于需要难以在田野调查中构建的特定语境,祈愿式在语言学研究中尚未得到充分探索,但在设备导向话语中可能更为常见。

Lastly, some languages distinguish between imperatives, when telling someone to do something, and “prohibitive s”, when telling someone not to do something. In the MASSIVE set, there are 18 languages with specialized negative particles which can only co-occur with imperative verbs. Vietnamese for instance uses the words “chang"’ or “không” to negate declarative sentences, but uses “chó” or “dung” to negate imperatives. Another ten languages have special verbs for the prohibitive, although these may overlap with other grammatical features of the language. In Spanish, for example, the prohibitive form of a verb is the same as the subjunctive form.

最后,某些语言在表达命令(要求某人做某事)和禁止(要求某人不做某事)时存在区别。在MASSIVE数据集中,有18种语言使用专有的否定助词,这些助词只能与祈使动词共现。例如越南语使用"chang"或"không"来否定陈述句,但使用"chó"或"dung"来否定祈使句。另有10种语言使用特殊动词表达禁止意义,尽管这些动词可能与语言的其他语法特征存在重叠。例如在西班牙语中,动词的禁止形式与虚拟语气形式相同。

3.6 Politeness

3.6 礼貌

Many languages encode different levels of politeness through their use of pronouns. Many European languages distinguish between “familiar” and “formal” pronouns, with the “formal” pronouns often morphologically identical to a plural. In French, the second-person singular “tu” is used between friends, while the second-person plural “vous” is used when speaking to a group, or to an individual of higher social rank (such as an employee to a manager). These politeness systems are heavily influenced by social context, and the MASSIVE dataset gives us a chance to see how people adapt their language when speaking to a virtual assistant instead of another human.

许多语言通过代词的使用来体现不同的礼貌程度。多数欧洲语言区分"熟悉"和"正式"两种代词形式,其中"正式"代词在形态上常与复数形式相同。例如法语中,第二人称单数"tu"用于朋友间交流,而第二人称复数"vous"既用于群体对话,也用于社会地位较高的个体(如员工对经理)。这种礼貌体系深受社会语境影响,MASSIVE数据集让我们得以观察:当人们与虚拟助手而非真人交流时,会如何调整语言表达。

Nearly half of the languages in MASSIVE (21) make a two-way formal/informal distinction in their second-person pronouns. This is probably due to the fact that most MASSIVE languages are European, and the binary politeness distinctions are the most common strategy in that family. A further eight languages have more than two levels of formality, such as informal, formal, and honorific. Seven languages have an “avoidance” strategy, which means that pronouns are omitted entirely in a polite scenario. Finally, eleven languages have no data on politeness in WALS at all.

MASSIVE中近半数语言(21种)的第二人称代词存在正式/非正式的双向区分。这很可能因为MASSIVE大多数语言属于欧洲语系,而二元礼貌区分是该语系最常见的策略。另有八种语言具有超过两个正式层级,例如非正式、正式和敬语。七种语言采用"回避"策略,即在礼貌场景中完全省略代词。最后,有十一种语言在WALS中完全没有礼貌相关数据。

4 Collection Setup and Execution

4 数据收集设置与执行

4.1 Heldout Evaluation Split

4.1 保留评估集

We randomly sampled a subset of the English seed data which was then paraphrased by professional annotators, resulting in new, more challenging utterances, including $49%$ more slots per utterance. These utterances were localized along with the other splits to be used as a held out evaluation set for the Massively Multilingual NLU-22 competition and workshop 2.

我们从英文种子数据中随机抽取了一个子集,由专业标注员进行改写,生成了更具挑战性的新话语,平均每条话语的槽位(slot)数量增加了49%。这些话语与其他数据划分一起进行了本地化处理,将作为Massively Multilingual NLU-22竞赛及研讨会[2]的保留评估集使用。

4.2 Vendor Selection and Onboarding

4.2 供应商选择与准入

The MASSIVE dataset was collected using a customized workflow powered by Amazon MTurk. We required a vendor pool with the capability and resources to collect a large multilingual dataset. Our original vendor pool consisted of five vendors adjudicated based on previous engagements. This vendor pool was reduced to three based on engagement and resource availability. Vendors for each language were selected based on their resource availability and proposed cost. A majority of languages were supported by a single vendor, while some languages required cross-vendor support to be completed with the required quality and within the required timeline.

MASSIVE数据集通过Amazon MTurk定制化工作流进行采集。我们要求供应商池具备采集大规模多语言数据集的能力和资源。初始供应商池包含五家经历史合作评估的供应商,后根据合作情况及资源可用性缩减至三家。各语种供应商根据其资源可用性和报价进行遴选,多数语种由单一供应商支持,部分语种需跨供应商协作以满足质量要求和时限。

We offered two mechanisms to vendors for evaluating workers to be selected for each language. The first, which was used to select workers for the translation task, was an Amazon MTurk-hosted fluency test where workers listen to questions and statements in the relevant language and were evaluated using a multiple-choice questionnaire. The second, which was used to select workers for the judgment task, was a test with a set of three judgments that the vendor could use to assess if workers were able to detect issues in the translated utterances. In order to further improve worker selection quality, we created a translator quiz using the Amazon MTurk instructions that were created for translation and judgment tasks, coupled with customized locallanguage examples. The workers were required to prove that they understood the instructions for the project based on a series of questions.

我们为供应商提供了两种评估待选工作者的机制。第一种用于筛选翻译任务的工作者,采用Amazon MTurk托管的语言流畅性测试,要求工作者听取目标语言的问句和陈述,并通过多选题问卷进行评估。第二种用于筛选评判任务的工作者,通过包含三组评判样本的测试,供供应商评估工作者是否能识别翻译语句中的问题。为进一步提升筛选质量,我们基于Amazon MTurk的翻译与评判任务说明文档,结合定制化的本地语言示例,设计了翻译能力测试题。工作者需通过回答系列问题来证明其理解项目要求。

Before commencing operations, an initial pilot run of this customized workflow was completed in three languages. A few workers per vendor were chosen to engage in this exercise. The pilot run helped improve clarity of instructions, determine reporting methods, and share open questions.

在正式启动前,该定制化工作流以三种语言完成了初步试点运行。每个供应商挑选少量工作人员参与此次演练。试点运行优化了指令清晰度、确定了汇报方式并共享了待解决问题。

should be retained. The metadata of the released dataset includes whether the worker elected to “localize,” “translate,” or keep the slot “unchanged,” primarily for the purposes of researchers evaluating machine translation systems, where it would be unreasonable to expect the system to “localize” to a specific song name the worker selected.

应予以保留。发布数据集的元数据包括工作者选择“本地化”、“翻译”或保持该槽位“不变”的情况,主要用于评估机器翻译系统的研究人员,因为期望系统“本地化”到工作者选择的特定歌曲名称是不合理的。

After the slot task, the second worker is asked to translate or localize the entire phrase using the slot task output provided by the first worker (see Figure 2). The phrase worker can decide to keep the slot as it was translated, modify it, or remove it entirely if it is not relevant for the language in that scenario. This worker is also responsible for aligning grammatical genders or prepositional affixes to any of the slots.

槽位任务完成后,会要求第二位工作者利用第一位工作者提供的槽位任务输出对整个短语进行翻译或本地化处理 (见图 2)。短语工作者可以选择保留原有槽位翻译、修改它,或在该语言场景不相关时完全移除。该工作者还负责将语法性别或介词词缀与任意槽位进行匹配。

Note that this two-step system alleviates the annotation burden often encountered with such work. Traditionally in such collections, workers would be given a light annotation guide and asked to highlight spans of the slots in a translated or localized utterance. In this system, the first step of slot translation and subsequent insertion obviates the need for workers to understand nuanced span notation, which can be complex for highly inflected languages (prepositions outside the span in English would not be carried over in the localization, but would be in the traditional span annotation workflow).

需要注意的是,这种两步式系统减轻了此类工作中常见的标注负担。传统做法中,工作人员会收到简短的标注指南,并被要求在翻译或本地化的话语中标注槽位片段。而在本系统中,第一步的槽位翻译和后续插入操作免除了工作人员理解复杂片段标注的需求(对于高度屈折的语言而言尤为困难),例如英语中介词不会出现在本地化后的片段中,但在传统片段标注流程中却需要保留。

4.3 Collection Workflows

4.3 采集工作流

The collection was conducted by locale on an individual utterance level. Each utterance from the “train,” “dev,” “test,” and “heldout” splits of the SLURP dataset went through two sequential task workflows and a judgment workflow. The first task is slot translation or localization (see Figure 1). Workers are presented the entire utterance with colored highlighting of the slot values for the utterance (if any) and then presented with each slot value and its corresponding label individually. The worker is asked to either localize or translate the slot, depending on whether the value should be translated (e.g., “tomorrow”) or localized (e.g., the movie “La La Land”, which in French is “Pour l’amour d’Hollywood.” Other entities like regionally known songs or artists could also be localized to a more relevant, known song or artist for that language or region). There is also an option to keep the slot as is, such as for names (e.g., “Taylor Swift”) or proper nouns where the original English spelling

数据收集以地区为单位在单个话语层面进行。SLURP数据集的"train"、"dev"、"test"和"heldout"分集中的每个话语都经过两个顺序任务流程和一个判断流程。第一个任务是槽位(slot)翻译或本地化(见图1)。工作人员会看到带有槽位值彩色高亮的完整话语(如有),然后单独查看每个槽位值及其对应标签。根据槽位值是否需要翻译(如"tomorrow")或本地化(如电影《La La Land》在法语中译为《Pour l'amour d'Hollywood》),工作人员需进行相应操作。其他如地区知名歌曲或艺人等实体也可本地化为该语言或地区更相关的已知歌曲或艺人。此外还提供保留原槽位的选项,适用于人名(如"Taylor Swift")或需保留英文拼写的专有名词。

4.4 Quality Assurance

4.4 质量保证

The output of the second workflow (the fully localized utterance) is judged by three workers for (1) whether the utterance matches the intent semantically, (2) whether the slots match their labels semantically, (3) grammatical it y and naturalness, (4) spelling, and (5) language identification—English or mixed utterances are acceptable if that is natural for the language, but localization s without any tokens in the target language were not accepted. See Figure 3 for how this is presented to the Amazon MTurk worker. These judgments are also included in the metadata of the dataset. In addition to the workers judging each other’s work, the collection system had alarms in place for workers with high rejection rates, high rates of slot deletion, and high rates of English tokens in the translations. Workers were also monitored to see if their tasks were primarily machine translated. Such workers were removed from the pool and all of their work was resubmitted to be completed by the other workers.

第二个工作流程的输出(完全本地化的语句)由三名工作人员进行评判,标准包括:(1) 语句在语义上是否符合意图,(2) 槽位是否与其标签语义匹配,(3) 语法正确性和自然度,(4) 拼写,以及(5) 语言识别——如果对目标语言而言是自然的,英语或混合语句也可接受,但完全不包含目标语言token的本地化结果不予接受。具体展示给Amazon MTurk工作人员的方式见图3。这些评判结果也会包含在数据集的元数据中。除了工作人员互相评判外,收集系统还对高拒绝率、高槽位删除率以及翻译中英语token高出现率的工作人员设置了警报。同时监测工作人员的任务是否主要依赖机器翻译。此类工作人员会被移出工作池,其所有任务会重新提交给其他工作人员完成。

Additionally, the authors performed several deep dives into languages with which they were familiar.

此外,作者还针对他们熟悉的几种语言进行了深入分析。

5 Model Benchmarking

5 模型基准测试

5.1Setup

5.1 设置

As initial model benchmarks, we fine-tuned publicly-available pre-trained language models on the MASSIVE dataset and evaluated them on intent classification and slot filling. Our models of choice for this exercise were XLM-Roberta (XLM-R; Conneau et al. 2020) and mT5 (Xue et al., 2021).

作为初始模型基准测试,我们在MASSIVE数据集上对公开可用的预训练语言模型进行了微调,并在意图分类和槽填充任务上进行了评估。本实验选择的模型是XLM-Roberta (XLM-R; Conneau et al. 2020)和mT5 (Xue et al., 2021)。

In the case of XLM-R, we utilized the pretrained encoder with two separate classification heads trained from scratch, based on JointBERT (Chen et al., 2019a). The first classification head used the pooled output from the encoder to predict the intent and the second used the sequence output to predict the slots. As pooling for the intent classification head, we experimented with using hidden states from the first position, averaged hidden states across the sequence, and the maximally large hidden state from the sequence.

对于XLM-R模型,我们采用了基于JointBERT (Chen et al., 2019a)的预训练编码器,并为其配备了两个独立训练的初始分类头。第一个分类头利用编码器的池化输出预测意图,第二个分类头则使用序列输出来预测槽位。在意图分类头的池化操作中,我们尝试了三种方案:使用首个位置的隐藏状态、序列隐藏状态的平均值,以及序列中的最大隐藏状态。

With mT5, we explored two separate architectures. In one architecture, we only used the pre-trained encoder extracted from mT5, and we trained two classification heads from scratch similarly to the XLM-R setup. We refer to this setup as mT5 Encoder-Only. In the other architecture, we used the full sequence-to-sequence mT5 model in text-to-text mode, where the input is “Annotate:” followed by the unlabeled utterance. The decoder output is a sequence of labels (including the Other label) for all of the tokens followed by the intent. We did not add the slots and intents to the vocabulary, but we instead allowed them to be tokenized into subwords. We refer to this model as mT5 Text-to-Text. For all models, we used the Base size, which corresponds to 270M parameters for XLM-R, 258M parameters for mT5 Encoder-Only, and 580M parameters for mT5 Text-to-Text, including 192M parameters for embeddings for all three.

使用mT5时,我们探索了两种不同的架构。在第一种架构中,我们仅使用从mT5提取的预训练编码器,并像XLM-R设置那样从头训练两个分类头。我们将此设置称为mT5 Encoder-Only。在另一种架构中,我们以文本到文本模式使用了完整的序列到序列mT5模型,其中输入是"Annotate:"后接未标注的语句。解码器输出是所有token的标签序列(包括Other标签)后接意图。我们没有将槽位和意图添加到词汇表中,而是允许它们被分词为子词。我们将此模型称为mT5 Text-to-Text。对于所有模型,我们使用了Base规模,对应XLM-R的2.7亿参数、mT5 Encoder-Only的2.58亿参数以及mT5 Text-to-Text的5.8亿参数,其中三者共享1.92亿嵌入参数。

For each model, we performed 128 trials of hyper parameter tuning using the Tree of Parzen Estimators algorithm and Asynchronous Successive Halving Algorithm (ASHA) (Li et al., 2018a) for scheduling, which are both part of the hyperopt library (Bergstra et al., 2013) integrated into the ray[tune] library (Liaw et al., 2018), which is itself integrated into the Trainer from the transformers library (Wolf et al., 2020), which we used for modeling and for our pretrained models. Our hyper parameter search spaces, sampling types, and final choices are given in Table 5. We trained our models with the Adam optimizer (Kingma and Ba, 2017) and chose the best performing model checkpoint based on overall exact match accuracy across all locales. Hyperparameter tuning and fine-tuning was performed using single p3dn.24xlarge instances ( $_{\mathrm{8X~}}$ Nvidia v100) for XLM-R and mT5 Text-to-Text and a single g4dn.metal instance ( ${}^{\mathrm{8x~}}$ Nvidia T4) for mT5 Encoder-Only. Hyper parameter tuning times were less than 4 days per model and training times were less than 1 day per model.

对于每个模型,我们使用Tree of Parzen Estimators算法和异步连续减半算法(ASHA) (Li et al., 2018a)进行了128次超参数调优试验,这两种算法都是集成在ray[tune]库(Liaw et al., 2018)中的hyperopt库(Bergstra et al., 2013)的一部分,而ray[tune]库又被集成到transformers库(Wolf et al., 2020)的Trainer中,我们使用该库进行建模和预训练模型。我们的超参数搜索空间、采样类型和最终选择如表5所示。我们使用Adam优化器(Kingma and Ba, 2017)训练模型,并根据所有语言环境的整体精确匹配准确率选择性能最佳的模型检查点。超参数调优和微调使用单个p3dn.24xlarge实例(8×Nvidia v100)进行XLM-R和mT5 Text-to-Text训练,使用单个g4dn.metal实例(8×Nvidia T4)进行mT5 Encoder-Only训练。每个模型的超参数调优时间少于4天,训练时间少于1天。

Our dataset includes several languages where white spacing is not used as a word delimiter. In some cases, spaces do occur, but they might serve as phrase delimiters or denote the end of a sentence. Three of these written languages, Japanese, Chinese (Traditional), and Chinese (Simplified), do not use spaces anywhere except to identify the end of a sentence. For these languages, we separate each character in the unlabeled input with a whitespace. We leave exploration of more sophisticated techniques (such as MeCab for Japanese; Kudo 2005) to future work. We use the default spacing provided by annotators for all other languages.

我们的数据集包含几种不以空格作为单词分隔符的语言。在某些情况下,虽然会出现空格,但它们可能用作短语分隔符或表示句子结尾。其中三种书面语言——日语、中文(繁体)和中文(简体)——除标识句子结尾外,任何地方都不使用空格。对于这些语言,我们用空格分隔未标注输入中的每个字符。我们将更复杂的技术(如日语中的MeCab [Kudo 2005])的探索留给未来工作。对于其他所有语言,我们使用标注者提供的默认空格。

Zero-shot performance was also assessed, in which the models were trained on English data, validation was performed on all languages, and testing was performed on all non-English locales.

零样本性能也进行了评估,其中模型在英语数据上训练,在所有语言上进行验证,并在所有非英语地区进行测试。

5.2 Results and Analysis

5.2 结果与分析

Table 3 shows the results for each model and training setup, including those for the best performing locale, the worst performing locale, and locale-averaged results for intent accuracy, microaveraged slot F1 score, and exact match accuracy. Zero-shot exact match performance is 25-37 points worse than that of full-dataset training runs. Additionally, the variance in task performance across locales is significantly greater for the zero-shot setup than for full-dataset training. For example, there is a 15 point difference in exact match accuracy between the highest and lowest locales for mT5 Text-to-Text when using the full training set, while the gap expands to 44 points with zero-shot.

表 3 展示了每个模型及训练配置的结果,包括表现最佳地区、表现最差地区以及地区平均的意图准确率、微观平均槽位 F1 分数和完全匹配准确率。零样本完全匹配性能比全数据集训练低 25-37 分。此外,零样本配置下各地区任务表现的方差明显大于全数据集训练。例如,使用完整训练集时,mT5 Text-to-Text 的最高和最低地区完全匹配准确率相差 15 分,而零样本情况下差距扩大至 44 分。

We compared the pre training data quantities by language for XLM-R to its per-language task performance values, and in the zero shot setup, we found a Pearson correlation of 0.54 for exact match accuracy, 0.58 for intent accuracy, and 0.46 for micro-averaged slot F1 score. In the full dataset training setup, the correlations decrease to 0.42 for exact match accuracy, 0.47 for intent accuracy, and 0.24 for micro-averaged slot F1 score. This suggests that the constant per-language data quantities in MASSIVE help to mitigate the effects of the language-skewed pre training data distribution.

我们比较了XLM-R按语言划分的预训练数据量与其各语言任务性能指标,在零样本设置下发现:精确匹配准确率的皮尔逊相关系数为0.54,意图识别准确率为0.58,微观平均槽位F1得分为0.46。在全数据集训练设置中,这些相关性分别降至:精确匹配准确率0.42,意图识别准确率0.47,微观平均槽位F1得分0.24。这表明MASSIVE中恒定的各语言数据量有助于缓解预训练数据分布的语言偏斜效应。

(a) Test results when using the full training set

模型 意图准确率 (%) 高 意图准确率 (%) 低 意图准确率 (%) 平均 槽位 F1 (%) 高 槽位 F1 (%) 低 槽位 F1 (%) 平均 完全匹配准确率 (%) 高 完全匹配准确率 (%) 低 完全匹配准确率 (%) 平均
mT5Base 87.9 ± 1.2 79.0 ± 1.5 85.3 ± 0.2 86.8 ± 0.7 67.6 ± 0.4 76.8 ± 0.1 73.4 ± 1.6 58.3 ± 1.8 66.6 ± 0.2
Text-to-Text mT5Base en-US 89.0 ± 1.1 km-KH 79.1 ± 1.5 86.1 ± 0.2 th-TH 85.7 ± 0.7 ja-JP 64.5 ± 0.4 75.4 ± 0.1 th-TH 72.3 ± 1.6 ja-JP 57.8 ± 1.8 65.9 ± 0.2
Encoder-Only XLM-RBase en-US 88.3 ± 1.2 km-KH 77.2 ± 1.5 85.1 ± 0.2 th-TH 83.5 ± 0.7 ja-JP 63.3 ± 0.4 73.6 ± 0.1 th-TH 70.1 ± 1.6 ja-JP 55.8 ± 1.8 63.7 ± 0.2

(a) 使用完整训练集时的测试结果

Table 3: Modeling results for (a) training runs on the full training dataset and (b) zero-shot training runs, in which training was performed only with en-US data, validation was performed with all locales, and testing was performed on all locales except for en-US. Each table includes the highest locale, the lowest locale, and locale-averaged results for intent accuracy, micro-averaged slot F1 score, and exact match accuracy. Intervals for $95%$ confidence are given assuming normal distributions.

(b) Zero-shot test results after training only on en-US

表 3: (a) 在全量训练数据集上的训练结果和 (b) 零样本训练结果,其中训练仅使用 en-US 数据,验证使用所有地区数据,测试则使用除 en-US 外的所有地区数据。每个表格包含意图准确率、微观平均槽位 F1 分数和完全匹配准确率的最高地区、最低地区和地区平均结果。假设正态分布,给出了 95% 置信区间。

模型 意图准确率 (%) 槽位 F1 (%) 完全匹配准确率 (%)
最高 最低 平均 最高 最低 平均 最高 最低 平均
mT5Base Text-to-Text 79.9 ± 1.4 nl-NL 25.7 ± 1.6 ja-JP 62.9 ± 0.2 64.3 ± 0.7 de-DE 13.9 ± 0.3 ja-JP 44.8 ± 0.1 53.2 ± 1.8 sV-SE 9.4 ± 1.0 ja-JP 34.7 ± 0.2
mT5Base Encoder-Only 76.4 ± 1.5 nl-NL 27.1 ± 1.6 ja-JP 61.2 ± 0.2 59.5 ± 1.0 th-TH 6.3 ± 0.2 ja-JP 41.6 ± 0.1 44.3 ± 1.8 sV-SE 4.2 ± 0.7 ja-JP 28.8 ± 0.2
XLM-RBase 85.2 ± 1.3 SV-SE 44.8 ± 1.8 ja-JP 70.6 ± 0.2 68.4 ± 0.7 SV-SE 15.4 ± 0.3 ja-JP 50.3 ± 0.1 57.9 ± 1.8 SV-SE 9.8 ± 1.1 ja-JP 38.7 ± 0.2

(b) 仅在 en-US 数据上训练后的零样本测试结果

In Thai, for which spacing is optional, the model can learn from artificial spacing in the input (around where the slots will be) to improve task performance. For Khmer, the workers had a difficult time adapting their translations and localization s to properly-slotted outputs given the space-optional nature of the language. Additionally, for Japanese and Chinese, we added spaces between all characters when modeling. These single-character inputs differ from the non-spaced inputs used during pre training, which would be chunked into groups of characters by the tokenizer with corresponding embeddings. By splitting into single characters, we don’t allow the model to the use embeddings learned for chunks of characters. This is a likely major cause of the drop in exact match accuracy for Japanese from $58.3%$ when training on the full dataset to $9.4%$ for zero shot. In the zero shot setup, the model relies solely on pretrained data representations, and individually-spaced characters are rare in the pre training data. That said, character spacing was necessary in order to properly assign the slots to the right characters. As mentioned in Section 5.1, we leave exploration of more sophisticated spacing techniques for slot filling (such as MeCab; Kudo 2005) to future work.

在泰语中(空格是可选的),模型可以通过学习输入中人工添加的空格(通常位于槽位附近)来提高任务性能。对于高棉语,由于该语言本身空格可选的特点,工作人员在调整翻译和本地化以适应正确槽位输出时遇到了困难。此外,对于日语和中文,我们在建模时在所有字符间都添加了空格。这种单字符输入方式与预训练时使用的无空格输入不同——预训练时输入会被分词器 (tokenizer) 切分成字符组并对应嵌入向量。通过强制拆分为单字符,我们限制了模型使用预训练时学到的字符组嵌入能力。这很可能是日语精确匹配准确率从全量数据训练的 $58.3%$ 骤降至零样本场景下 $9.4%$ 的主要原因:在零样本设置中,模型完全依赖预训练数据的表征方式,而单独分隔的字符在预训练数据中极为罕见。但需要说明的是,字符分隔对于正确分配槽位是必要的。如第5.1节所述,我们将更复杂的槽位填充空格处理技术(如MeCab;Kudo 2005 [20])留给未来研究。

Discounting for artificial spacing effects, Germanic genera and Latin scripts performed the best overall (See Appendix E), which is unsurprising given the amount of pre training data for those genera and scripts, as well as the quantity of Germanic and Latin-script languages in MASSIVE. Within the Germanic genera, Swedish, English, Danish, Norwegian, and Dutch all performed comparably (within $95%$ confidence bounds) for exact match accuracy. Icelandic was the lowest-performing Germanic language, likely due to a lack of pre training data, as well as to its linguistic evolution away from the others due to isolated conditions.

考虑到人工间距效应的影响,日耳曼语系和拉丁文字整体表现最佳(参见附录E),考虑到这些语系和文字在预训练数据中的占比,以及MASSIVE数据集中日耳曼语和拉丁语系语言的数量,这一结果并不令人意外。在日耳曼语系内部,瑞典语、英语、丹麦语、挪威语和荷兰语的精确匹配准确率表现相当(均在95%置信区间内)。冰岛语是日耳曼语系中表现最差的语言,这可能是由于预训练数据不足,以及其因孤立环境而与其他语言产生的演化差异所致。

6 Conclusion

6 结论

We have released a truly MASSIVE multilingual dataset for NLU spanning 51 typo logically diverse languages. Our hope is that MASSIVE will encourage many new innovations in massively multilingual NLU, other NLP tasks such as machine translation, and new linguistic analyses, such as with imperative morph o logie s.

我们发布了一个真正庞大的多语言数据集MASSIVE,涵盖51种类型多样的语言,用于自然语言理解(NLU)。我们希望MASSIVE能推动大规模多语言NLU、机器翻译等其他NLP任务的新创新,以及祈使形态学等新的语言分析研究。

阅读全文(20积分)