MASSIVE: A 1M-Example Multilingual Natural Language Understanding Dataset with 51 Typo logically-Diverse Languages
MASSIVE:包含51种类型多样语言的百万示例多语言自然语言理解数据集
Abstract
摘要
We present the MASSIVE dataset— Multilingual Amazon Slu resource package (SLURP) for Slot-filling, Intent classification, and Virtual assistant Evaluation. MASSIVE contains 1M realistic, parallel, labeled virtual assistant utterances spanning 51 languages, 18 domains, 60 intents, and 55 slots. MASSIVE was created by tasking professional translators to localize the English-only SLURP dataset into 50 typo logically diverse languages from 29 genera. We also present modeling results on XLM-R and mT5, including exact match accuracy, intent classification accuracy, and slot-filling F1 score. We have released our dataset, modeling code, and models publicly.
我们发布了MASSIVE数据集——面向槽填充、意图分类和虚拟助手评估的多语言亚马逊SLURP资源包。该数据集包含100万条真实场景下的平行标注虚拟助手话语,涵盖51种语言、18个领域、60种意图和55个槽位。MASSIVE通过聘请专业翻译人员将仅限英语的SLURP数据集本地化为来自29个语系的50种类型多样的语言。我们还展示了基于XLM-R和mT5的建模结果,包括精确匹配准确率、意图分类准确率和槽填充F1分数。我们已公开数据集、建模代码和模型。
1 Introduction and Description
1 简介与概述
Natural Language Understanding (NLU) is a machine’s ability to understand the meaning and relevant entities from text. For instance, given the utterance what is the temperature in new york, an NLU model might classify the intent as weather query and fill the slots as weather descriptor: temperature and place_name: new york. Our particular focus of NLU is one component of Spoken Language Understanding (SLU), in which raw audio is first converted to text before NLU is performed (Young, 2002; Wang et al., 2005; Tur and Mori, 2011). SLU is the foundation of voice-based virtual assistants like Alexa, Siri, and Google Assistant. Though virtual assistants have advanced incredibly in the past decade, they still only support a small fraction of the world’s $7\small{,}000+$ languages (Simons, 2022). Challenges for multilingualism span the software stack and a variety of operational considerations, but one difficulty in creating massively multilingual NLU models is the lack of labeled data for training and evaluation, particularly data that is realistic for the task and that is natural for each given language. High naturalness typically requires human-based vetting, which is often costly.
自然语言理解 (Natural Language Understanding, NLU) 是机器从文本中理解含义及相关实体的能力。例如,给定语句"what is the temperature in new york",NLU模型可能将意图分类为天气查询,并填充槽位:weather descriptor: temperature 和 place_name: new york。我们关注的NLU是口语理解 (Spoken Language Understanding, SLU) 的一个组成部分,其中原始音频首先被转换为文本,然后进行NLU处理 (Young, 2002; Wang et al., 2005; Tur and Mori, 2011)。SLU是Alexa、Siri和Google Assistant等语音虚拟助手的基础。尽管虚拟助手在过去十年取得了惊人的进步,但它们仍仅支持全球7,000多种语言中的一小部分 (Simons, 2022)。多语言化的挑战涉及软件堆栈和各种操作考量,但创建大规模多语言NLU模型的一个困难是缺乏用于训练和评估的标注数据,特别是对任务而言真实且对每种给定语言自然的的数据。高自然性通常需要人工审核,这往往成本高昂。
We present MASSIVE (Multilingual Amazon SLU Resource Package (SLURP) for Slot filling, Intent classification, and Virtual assistant Evaluation), a new 1M-example dataset composed of realistic, human-created virtual assistant utterance text spanning 51 languages, 60 intents, 55 slot types, and 18 domains. With the English seed data included, there are $587\mathrm{k}$ train utterances, $104\mathrm{k\Omega}$ dev utterances, 152k test utterances, and $153\mathrm{k\Omega}$ ut- terances currently held out for the MMNLU-22 competition, which will be released after the competition. We have released our data, code, and models 1.
我们推出MASSIVE(面向槽填充、意图分类和虚拟助手评估的多语言Amazon SLU资源包),这是一个包含100万条样本的新数据集,由人工创建的虚拟助手话语文本构成,涵盖51种语言、60种意图、55种槽类型和18个领域。包含英语种子数据在内,该数据集共有587k条训练话语、104kΩ条开发集话语、152k条测试集话语,以及153kΩ条为MMNLU-22竞赛保留的话语(竞赛结束后将公开)。我们已公开数据、代码和模型[1]。
MASSIVE was created by localizing the SLURP NLU dataset (created only in English) in a parallel manner. SLURP is described further in Section 2, linguistic analyses of the dataset in Section 3, and the localization process in Section 4.3. Results for Massively Multilingual NLU (MMNLU) modeling, in which a single model can perform NLU on any of the incoming languages, are given in Section 5.
MASSIVE是通过并行本地化仅以英语创建的SLURP NLU数据集而构建的。SLURP的详细描述见第2节,数据集的语言学分析见第3节,本地化过程见第4.3节。第5节给出了大规模多语言NLU (MMNLU) 建模的结果,该模型能对任何输入语言执行NLU任务。
2 Related Work
2 相关工作
Prior researchers have emphasized the need to explore the unique challenges of low-resource lan- guages (Simpson et al., 2008; Strassel and Tracey, 2016; Cruz and Cheng, 2020; Lakew et al., 2020; Marivate et al., 2020; Magueresse et al., 2020; Goyal et al., 2021), while the growing number and size of language models (mBERT (Devlin, 2018), RoBERTa (Liu et al., 2019b), XLM (Lample and Conneau, 2019), XLM-R (Conneau et al., 2020), mBART (Liu et al., 2020), MARGE (Lewis et al., 2020), and mT5 (Xue et al., 2021) pre-trained on massively multilingual corpora have allowed for significant improvements in supporting them. However, the creation of evaluation datasets for specific tasks has not kept pace. Some tasks, such as Named Entity Recognition (NER) or translation, lend themselves to mining existing corpora (Tiedemann, 2012; Pan et al., 2017; Hu et al., 2020), while others such as NLU, the focus here, require the creation of new data and schema-specific annotations. Beyond the cost, even identifying a sufficient number of speakers for data generation and quality control can be difficult. Most studies have thus focused on collecting data for one such low-resource language and determining the utility of multilingual models or cross-lingual learning from more readily available languages. Moreover, such datasets are often isolated collections, creating an environment of multiple datasets not easily comparable across the different languages or tasks. There have been exceptions, such as SQuAD (Rajpurkar et al., 2016) and XQuAd (Artetxe et al., 2019), ATIS (Price, 1990), its Hindi and Turkish extension (Upadhyay et al., 2018), and MultiATIS $^{++}$ (Xu et al., 2020), and Snips (Coucke et al., 2018) with its addition of French (Saade et al., 2019), where researchers have extended popular English benchmark datasets to new languages. This work focuses on the general multi-domain NLU task and builds off the SLURP (Bastia nell i et al., 2020) benchmark dataset to ex- tend to an unprecedented 50 new languages.
先前的研究者已强调需要探索低资源语言的独特挑战 (Simpson et al., 2008; Strassel and Tracey, 2016; Cruz and Cheng, 2020; Lakew et al., 2020; Marivate et al., 2020; Magueresse et al., 2020; Goyal et al., 2021) 。随着基于海量多语语料预训练的语言模型 (如 mBERT (Devlin, 2018) 、RoBERTa (Liu et al., 2019b) 、XLM (Lample and Conneau, 2019) 、XLM-R (Conneau et al., 2020) 、mBART (Liu et al., 2020) 、MARGE (Lewis et al., 2020) 和 mT5 (Xue et al., 2021) ) 的数量和规模持续增长,对低资源语言的支持能力显著提升。然而,面向特定任务的评估数据集建设仍未同步跟进。部分任务 (如命名实体识别 (NER) 或机器翻译) 可通过挖掘现有语料库实现 (Tiedemann, 2012; Pan et al., 2017; Hu et al., 2020) ,但像自然语言理解 (NLU) 这类任务 (本文研究重点) 则需要创建新数据并进行模式化标注。除成本因素外,仅招募足够数量的母语者参与数据生成和质量控制就存在困难。因此多数研究聚焦于收集单一低资源语言数据,并验证多语言模型或从高资源语言进行跨语言学习的效用。此外,这类数据集常为孤立集合,导致不同语言或任务间的数据集难以直接比较。虽有例外案例——如 SQuAD (Rajpurkar et al., 2016) 和 XQuAD (Artetxe et al., 2019) 、ATIS (Price, 1990) 及其印地语/土耳其语扩展 (Upadhyay et al., 2018) 与 MultiATIS$^{++}$ (Xu et al., 2020) ,以及新增法语版本的 Snips (Coucke et al., 2018; Saade et al., 2019) ——研究者将流行英语基准数据集扩展至新语种,但本文聚焦通用多领域 NLU 任务,基于 SLURP (Bastianelli et al., 2020) 基准数据集首次扩展覆盖 50 种新语言。
For the task of NLU, the ATIS dataset has been popular in the NLP community since its first release. MultiATIS $^{++}$ was one of the first efforts to extend an NLU dataset across a significant number of languages (nine), yet remained in the limited domain of airline bookings. While proving an asset, it has been questioned what is left to learn from such a dataset (Tur et al., 2010). Facebook released a general Intelligent Virtual Assistant (IVA) dataset across the domains of Alarm, Reminder, and Weather (Schuster et al., 2019) created for the purpose of demonstrating cross-lingual transfer learning; and so did not need to be parallel or have an equal number of datapoints, resulting in far fewer examples in Thai $(5\mathrm{k})$ compared to Spanish (7.6k) and English (43k). The Snips datasets (both the original English only and the English and French releases) are most similar to the NLU contained in the MASSIVE dataset, spanning smart home and music domains for a generic voice-based virtual assistant.
在自然语言理解(NLU)任务领域,ATIS数据集自首次发布以来一直在NLP社区广受欢迎。MultiATIS++是最早将NLU数据集扩展到多种语言(九种)的尝试之一,但仍局限于航空预订这一狭窄领域。虽然该数据集被证明具有价值,但学界对其剩余学习价值存在质疑(Tur等人, 2010)。Facebook发布了涵盖闹钟、提醒和天气三大领域的通用智能虚拟助手(IVA)数据集(Schuster等人, 2019),旨在展示跨语言迁移学习能力,因此不需要保持平行语料或数据量均衡,导致泰语样本(5k)远少于西班牙语(7.6k)和英语(43k)。Snips数据集(包括原始纯英文版及英法双语版)与MASSIVE数据集包含的NLU最为相似,均为通用语音虚拟助手提供智能家居和音乐领域的支持。
The first iteration for the foundation of the MASSIVE dataset was the NLU Evaluation Benchmarking Dataset, with $25\mathrm{k\Omega}$ utterances across 18 domains (Liu et al., 2019a). The authors updated the dataset and added audio and ASR transcriptions in the release of the Spoken Language Understanding Resource Package (SLURP) (Bastia nell i et al., 2020), allowing for full end-to-end Spoken Language Under standing (SLU) evaluation similar to the Fluent Speech Commands dataset (Lugosch et al., 2019) and Chinese Audio-Textual Spoken Language Under standing (CATSLU) (Zhu et al., 2019). An overview of selected existing NLU datasets can be seen in Table 1.
MASSIVE数据集基础的首个迭代版本是NLU评估基准数据集,包含18个领域的$25\mathrm{k\Omega}$条话语 (Liu et al., 2019a)。作者在发布口语理解资源包(SLURP)时更新了数据集,并添加了音频和自动语音识别(ASR)转录文本 (Bastianelli et al., 2020),使得能够像Fluent Speech Commands数据集 (Lugosch et al., 2019) 和中文音频-文本口语理解(CATSLU)数据集 (Zhu et al., 2019) 一样进行完整的端到端口语理解(SLU)评估。现有NLU数据集的概览如表1所示。
We release the MASSIVE dataset along with baselines from large pre-trained models fine-tuned on the NLU slot and intent prediction tasks. Early cross-lingual and multilingual NLU modeling approaches used projection or alignment methods (Yarowsky et al., 2001), focusing on string matching, edit distance, or consonant signatures (Ehrmann et al., 2011), lookup lexicons for lowresource languages (Mayhew et al., 2017), and aligning (Xie et al., 2018) or jointly training word embeddings (Singla et al., 2018). More recently, researchers have borrowed encoders from pre-trained neural translation models before building subsequent class if i ers and NER models (Eriguchi et al., 2018; Schuster et al., 2019), also focusing on language-agnostic and language specific features to learn what information to share between languages (Chen et al., 2019b). Generative parsing has been demonstrated using sequence-to-sequence models and pointer networks (Rongali et al., 2020). With the rise of BERT and large pre-trained language models, we have also seen impressive demonstrations of zero-shot performance, where subword token iz ation WordPiece overlap helps but is not even necessary to realize improvements (Pires et al., 2019; K et al., 2020), as well as production multilingual NLU improvements with distillation and full fine-tuning (FitzGerald et al., 2022). The translation task has then been in cop orated in the pretraining (Wang et al., 2021) of these models or even as part of the final NLU hypothesis for streamlined multilingual production systems (FitzGerald,
我们发布了MASSIVE数据集,以及基于大预训练模型在NLU槽位和意图预测任务上微调的基线。早期的跨语言和多语言NLU建模方法采用投影或对齐方法 (Yarowsky et al., 2001),侧重于字符串匹配、编辑距离或辅音特征 (Ehrmann et al., 2011),低资源语言的查找词典 (Mayhew et al., 2017),以及对齐 (Xie et al., 2018) 或联合训练词嵌入 (Singla et al., 2018)。最近,研究人员在构建后续分类器和NER模型之前,从预训练的神经翻译模型中借用编码器 (Eriguchi et al., 2018; Schuster et al., 2019),同时也关注语言无关和语言特定的特征,以学习在语言之间共享哪些信息 (Chen et al., 2019b)。生成式解析已通过序列到序列模型和指针网络得到演示 (Rongali et al., 2020)。随着BERT和大预训练语言模型的兴起,我们还看到了零样本性能的惊人表现,其中子词Token化WordPiece重叠有帮助,但甚至不是实现改进的必要条件 (Pires et al., 2019; K et al., 2020),以及通过蒸馏和完整微调实现的生产级多语言NLU改进 (FitzGerald et al., 2022)。翻译任务随后被纳入这些模型的预训练中 (Wang et al., 2021),甚至作为最终NLU假设的一部分,用于简化的多语言生产系统 (FitzGerald,
Table 1: Selected NLU benchmark datasets with number of languages, utterances per language, domain count, intent count, and slot count.
表 1: 精选的自然语言理解(NLU)基准数据集,包含语言数量、每种语言的语句数、领域数量、意图数量和槽位数量。
名称 | 语言数 | 每种语言语句数 | 领域数 | 意图数 | 槽位数 |
---|---|---|---|---|---|
MASSIVE | 51 | 19,521 | 18 | 60 | 55 |
SLURP (Bastianelli et al., 2020) | 1 | 16,521 | 18 | 60 | 55 |
NLU评估数据 (Liu et al., 2019a) | 1 | 25,716 | 18 | 54 | 56 |
航空旅行信息系统(ATIS) (Price, 1990) | 1 | 5,871 | 1 | 26 | 129 |
含印地语和土耳其语的ATIS (Upadhyay et al., 2018) | 3 | 1,315-5,871 | 1 | 26 | 129 |
MultiATIS++ (Xu et al., 2020) | 9 | 1,422-5,897 | 1 | 21-26 | 99-140 |
Snips (Coucke et al., 2018) | 1 | 14,484 | - | 7 | 53 |
含法语的Snips (Saade et al., 2019) | 2 | 4,818 | 2 | 14-15 | 11-12 |
任务导向解析(TOP) (Gupta et al., 2018) | 1 | 44,873 | 2 | 25 | 36 |
多语言任务导向语义解析(MTOP) (Li et al., 2021) | 6 | 15,195-22,288 | 11 | 104-113 | 72-75 |
跨语言多语言任务导向对话 (Schuster et al., 2019) | 3 | 5,083-43,323 | 3 | 12 | 11 |
Microsoft对话挑战赛 (Li et al., 2018b) | 1 | 38,276 | 3 | 11 | 29 |
流畅语音命令(FSC) (Lugosch et al., 2019) | 1 | 30,043 | - | 31 | - |
中文语音文本口语理解(CATSLU) (Zhu et al., 2019) | 1 | 16,258 | 4 | - | 94 |
2020). Researchers have propped up training data by translating and projecting labels into the target language (Xu et al., 2020) and discovered more sophi stica ted approaches to alignment such as translate and fill using mT5 to train the filler (Nicosia et al., 2021). Recent work has even delved into the application of these techniques to lower-resource languages such as Persian. For example, ParsiNLU explores a variety of NLU tasks for Parsi, finetuning mT5 of various sizes (Khashabi et al., 2021). Similarly these techniques have also been used, even a bit earlier, for text sum mari z ation (Farahani et al., 2021).
2020年)。研究人员通过将标签翻译并映射到目标语言来扩充训练数据 (Xu et al., 2020),并发现了更复杂的对齐方法,例如使用mT5训练填充模型进行翻译填充 (Nicosia et al., 2021)。近期研究甚至将这些技术应用于波斯语等低资源语言。例如,ParsiNLU针对波斯语探索了多种自然语言理解任务,并对不同规模的mT5进行微调 (Khashabi et al., 2021)。类似地,这些技术也被应用于文本摘要任务,甚至更早就有相关实践 (Farahani et al., 2021)。
3 Language Selection and Linguistic Analysis
3 语言选择与语言学分析
3.1 Language Selection
3.1 语言选择
The languages in MASSIVE were chosen according to the following considerations. First, we acquired cost and worker availability estimates for over 100 languages, providing a constraint to our choices given our fixed budget. Second, we determined existing languages available in major virtual assistants, such that the dataset could be used to benchmark today’s systems. Third, we categorized the full pool of languages according to their genera as taken from the World Atlas of Linguistic Structures (WALS) database (Dryer and Haspelmath, 2013), where a genus is a language group that is clear to most linguists without systematic comparative analysis. Genus is a better indicator of typo logical diversity, which we sought to maximize, than language family (Dryer, 1989). Fourth, we used the ei gen vector centrality of Wikipedia articles, tweets, and book translations (Ronen et al., 2014) as proxies for the internet influence and thus the resource availability of a given language, particularly for self-supervised pre training applications, and we chose languages spanning the breadth of resource availability. Fifth, we examined the script of each language, seeking to increase script diversity to drive experimentation in token iz ation and normalization.
MASSIVE中的语言选择基于以下考量:首先,我们获取了100多种语言的成本和工作者可用性估算,在固定预算下为选择提供约束。其次,我们确定了主流虚拟助手支持的现有语言,使该数据集能用于评估当前系统性能。第三,我们根据《世界语言结构图谱》(WALS)数据库(Dryer and Haspelmath, 2013)中的语系分类对全部语言进行归类——语系是指无需系统比较分析就能被多数语言学家识别的语言群组。与语族相比,语系能更好反映我们力求最大化的类型多样性(Dryer, 1989)。第四,我们采用维基百科文章、推文和书籍翻译的特征向量中心性(Ronen et al., 2014)作为网络影响力指标,从而反映特定语言的资源可用性(尤其对自监督预训练应用而言),并选择了资源可用性跨度广泛的语言。第五,我们考察了每种语言的文字系统,通过增加文字多样性来推动token化与归一化研究。
Ultimately, we created 50 new, distinct text corpora, representing 49 different spoken languages. Mandarin Chinese was collected twice, once with native speakers who use the traditional set of characters, and once with native speakers who use the modern simplified set of characters. There are 14 language families in the dataset. The term “language family” usually refers to a group of languages which are known to be genetically related, that is, they all descend from a common ancestor language. In MASSIVE, we also include “language isolates” as families. These are languages that have no clear relationship to any known language. Our choices are given in Table 2.
最终,我们创建了50个全新的独立文本语料库,涵盖49种不同口语。其中汉语普通话收集了两次:一次采用繁体字为母语的使用者,另一次采用简体字为母语的使用者。该数据集包含14个语系。"语系"通常指已知具有发生学关联的语言群体,即它们都源自共同的祖语。在MASSIVE项目中,我们将"孤立语言"也归为语系——这类语言与任何已知语言都无明显关联。具体分类见表2:
3.2 Scripts
3.2 脚本
There are 21 distinct scripts used in the dataset. The majority of languages in MASSIVE (28 including English) use some variety of the Latin alphabet, which is also the most widely used script in the world. The Arabic script is used for three languages, the Cyrillic script for two languages, and the remaining 18 languages have “unique” scripts, in the sense that only one language in the dataset uses that script. Fourteen scripts are unique to a single language, although they may belong to a larger family of writing systems. For example, the Dravidian languages in MASSIVE have their own scripts, but are all members of the general Brahmi class of scripts. The other two scripts are unique in that only one language in the dataset uses them, but they are more widely used in the real world: Ge’ez and Chinese. Ge’ez is represented by Amharic in the dataset, but is used for several languages in East Africa, such as Tigrinya. The Chinese script is represented by Mandarin, but is used by other languages in China such as Cantonese.
数据集中使用了21种不同的文字体系。MASSIVE中的大多数语言(包括英语在内共28种)使用某种形式的拉丁字母,这也是全球使用最广泛的文字。阿拉伯文字用于三种语言,西里尔字母用于两种语言,其余18种语言拥有"独特"的文字体系(即数据集中仅有一种语言使用该文字)。虽然14种文字为单一语言所独有,但它们可能属于更大的文字体系家族。例如,MASSIVE中的达罗毗荼语系语言虽各有独立文字,但都属于婆罗米系文字大类。另外两种文字的特殊性在于:虽然数据集中仅有一种语言使用它们,但在现实世界中应用更广泛——吉兹文和中文。吉兹文在数据集中以阿姆哈拉语为代表,但实际也用于提格里尼亚语等东非多种语言;中文以普通话为代表,但也应用于粤语等中国其他语言。
Table 2: The 51 languages of MASSIVE, including scripts and genera.
代码 | 名称 | 文字 | 语系 | 代码 | 名称 | 文字 | 语系 | 代码 | 名称 | 文字 | 语系 |
---|---|---|---|---|---|---|---|---|---|---|---|
af-ZA | 南非荷兰语 | Latn | Germanic | hy-AM | 亚美尼亚语 | Armn | Armenian | pl-PL | 波兰语 | Latn | Slavic |
am-ET | 阿姆哈拉语 | Ethi | Semitic | id-ID | 印尼语 | Latn | Malayo-Sumbawan | pt-PT | 葡萄牙语 | Latn | Romance |
ar-SA | 阿拉伯语 | Arab | Semitic | is-IS | 冰岛语 | Latn | Germanic | ro-RO | 罗马尼亚语 | Latn | Romance |
az-AZ | 阿塞拜疆语 | Latn | Turkic | it-IT | 意大利语 | Latn | Romance | ru-RU | 俄语 | Cyrl | Slavic |
bn-BD | 孟加拉语 | Beng | Indic | ja-JP | 日语 | Jpan | Japanese | sl-SI | 斯洛文尼亚语 | Latn | Slavic |
cy-GB | 威尔士语 | Latn | Celtic | jv-ID | 爪哇语 | Latn | Javanese | Sq-AL | 阿尔巴尼亚语 | Latn | Albanian |
da-DK | 丹麦语 | Latn | Germanic | ka-GE | 格鲁吉亚语 | Geor | Kartvelian | SV-SE | 瑞典语 | Latn | Germanic |
de-DE | 德语 | Latn | Germanic | km-KH | 高棉语 | Khmr | Khmer | sw-KE | 斯瓦希里语 | Latn | Bantoid |
el-GR | 希腊语 | Grek | Greek | kn-IN | 卡纳达语 | Knda | SouthernDravidian | ta-IN | 泰米尔语 | Taml | SouthernDravidian |
en-US | 英语 | Latn | Germanic | ko-KR | 韩语 | Kore | Korean | te-IN | 泰卢固语 | South-CentralDravidian | |
es-ES | 西班牙语 | Latn | Romance | lv-LV | 拉脱维亚语 | Latn | Baltic | th-TH | 泰语 | Thai | Kam-Tai |
fa-IR | 波斯语 | Arab | Iranian | ml-IN | 马拉雅拉姆语 | Mlym | SouthernDravidian | tl-PH | 他加禄语 | Latn | GreaterCentralPhilippine |
fi-FI | 芬兰语 | Latn | Finnic | mn-MN | 蒙古语 | Cyrl | Mongolic | tr-TR | 土耳其语 | Latn | Turkic |
fr-FR | 法语 | Latn | Romance | ms-MY | 马来语 | Latn | Malayo-Sumbawan | ur-PK | 乌尔都语 | Arab | Indic |
he-IL | 希伯来语 | Hebr | Semitic | my-MM | 缅甸语 | Mymr | Burmese-Lolo | vi-VN | 越南语 | Latn | Viet-Muong |
hi-IN | 印地语 | Deva | Indic | nb-NO | 挪威语 | Latn | Germanic | zh-CN | 汉语(普通话) | Hans | Chinese |
hu-HU | 匈牙利语 | Latn | Ugric | nl-NL | 荷兰语 | Latn | Germanic | zh-TW | 汉语(普通话) | Hant | Chinese |
表 2: MASSIVE 的 51 种语言,包括文字和语系。
3.3 Sentence Types
3.3 句子类型
MASSIVE consists of utterances directed at a device, rather than a person, which has some consequences for the type of linguistic patterns it contains. Specifically, the corpus primarily consists of interrogatives (i.e., questions) and imperatives (commands or requests). There are relatively few declarative utterances in the set. This is in contrast to many large datasets from other sources (e.g., wikipedia, movie scripts, newspapers) which contain a high proportion of declarative s, since the language is collected from situations where humans are communicating with humans.
MASSIVE包含面向设备而非人类的语音内容,这对其包含的语言模式类型产生了一定影响。具体而言,该语料库主要由疑问句(即提问)和祈使句(命令或请求)构成,陈述句相对较少。这与维基百科、电影剧本、报纸等其他来源的大型数据集形成鲜明对比——由于采集自人类间交流场景,这些数据集往往包含高比例的陈述句。
In the context of a voice assistant, a user typically asks a device to perform an action or answer a question, so declarative s are less common. For instance, a person might use an imperative “tell me if it calls for rain today” or ask a question “will it rain today,” but they would not tell their device “it’s raining today.” When declarative s are used with voice assistants, they generally have the pragmatic effect of a directive. For instance, a virtual assistant can respond to the declarative “it’s cold in here” by turning up the temperature (Thattai et al., 2020). Although syntactically it looks like a declarative, such an utterance has the force of an imperative.
在语音助手的场景中,用户通常会让设备执行某个动作或回答问题,因此陈述句较为少见。例如,人们可能会用祈使句"告诉我今天会不会下雨"或提问"今天会下雨吗",但不会对设备说"今天下雨了"。当陈述句用于语音助手时,通常具有指令的语用效果。例如,虚拟助手可以通过调高温度来回应"这里很冷"这样的陈述句 (Thattai et al., 2020)。虽然从语法上看是陈述句,但这类话语实际上具有祈使句的效力。
The standard unit of analysis in linguistics is the declarative sentence, and there is relatively less known about imperatives and questions. MASSIVE presents an opportunity to study these sentence forms, and the parallel nature of the corpus makes cross-linguistic comparisons even easier.
语言学的标准分析单位是陈述句,而对于祈使句和疑问句的研究相对较少。MASSIVE 为研究这些句子形式提供了机会,且该语料库的平行特性使得跨语言比较更加便捷。
3.4 Word Order
3.4 词序
Languages have intricate rules for ordering words depending on the word-type and sentence-type. In English, the word order for statements (“you are leaving”) is different from questions (“are you leaving?”). This is not mandatory, and sometimes the pitch of the voice is enough to indicate a question (e.g. “you’re leaving?” with a rising intonation).
语言根据词类和句型具有复杂的语序规则。在英语中,陈述句("you are leaving")的语序与疑问句("are you leaving?")不同。但这不是强制性的,有时仅靠语音语调就足以表示疑问(例如用升调说"you're leaving?")。
When considering word order at a typo logical level, it is common to simplify the situation and consider only affirmative declarative sentences and only three grammatical elements: the verb (V), its subject (S), and its object (O). This makes for six possible word orders: SVO, SOV, VOS, VSO, OVS, and OSV. All six orders have been documented, although the overwhelming majority of languages use Subject-initial ordering, while Object-initial ordering is extremely rare.
在类型学层面考虑词序时,通常会简化情况,仅考察肯定陈述句中的三个语法要素:动词(V)、主语(S)和宾语(O)。这样便产生六种可能的词序:SVO、SOV、VOS、VSO、OVS和OSV。尽管绝大多数语言采用主语前置结构,而宾语前置结构极其罕见,但这六种词序均有实证记录。
In MASSIVE, 39 languages are subject-initial (24 SVO and 15 SOV), while only three are verb-initial (VSO specifically). No object-initial languages are represented. Five languages are marked in WALS as having no preferred word order, and four do not have any word order data at all.
在MASSIVE中,39种语言采用主语优先语序(24种SVO和15种SOV),而仅有3种采用动词优先语序(具体为VSO)。该数据集未包含宾语优先语序的语言。WALS标注中有5种语言无固定词序偏好,另有4种语言完全缺乏词序数据。
3.5 Imperative Marking
3.5 命令式标记
The languages in MASSIVE have a variety of ways of indicating the imperative mood of an utterance. The majority of them (33) use some kind of verb morphology, such as adding a suffix. About half of those languages (18) have distinct imperative marking for singular or plural addressees. The utterances in MASSIVE are technically directed at a single addressee, the voice assistant, but since some languages use the plural as an indicator of politeness (see below) all varieties of imperatives will likely occur in this dataset. There are ten languages without any special morphology, and they indicate imperative through other means, such as word order or vocabulary choice.
MASSIVE 数据集中的语言有多种方式来表示话语的命令式语气。其中大多数 (33 种) 使用某种动词形态变化,例如添加后缀。约半数语言 (18 种) 针对单数或复数受话者使用不同的命令式标记。虽然 MASSIVE 中的话语在技术上是针对单个受话者 (语音助手) 的,但由于某些语言使用复数形式表示礼貌 (见下文),该数据集中很可能会出现所有类型的命令式。另有十种语言没有任何特殊形态变化,它们通过其他方式 (如词序或词汇选择) 来表示命令式。
Ten languages in the dataset have a specialized distinction between imperatives, for commands directed at another individual, and “hortatives”, where the command also includes the speaker. English verbs are not directly marked for hortative, but the auxiliary verb “let” can convey the mood instead. For example, “write this down” is an imperative and only the addressee need write anything, while “let’s write this down” is a hortative and the speaker is also expected to write. The pervasiveness of hortatives in the context of a voice assistant is an open question.
数据集中有十种语言对祈使句( imperative)和劝告式( hortative)进行了专门区分:祈使句用于向他人发出指令,而劝告式则包含说话者自身。英语动词本身不直接标记劝告式,但助动词"let"可以表达这种语气。例如"write this down"是祈使句,只有受话者需要记录;而"let's write this down"是劝告式,说话者也预期参与记录。在语音助手场景中劝告式的普遍适用性仍是一个开放性问题。
Four languages have “optative” moods, which are subtly different from imperatives. In the optative, a speaker expresses a wish or desire, as opposed to giving a direct command. However, in the right context, an optative may carry the same pragmatic weight as an imperative, and strongly imply that someone ought to do something. English has no specific optative form, but a similar mood can be conveyed using conditionals. For example, “buy this bag for me” is an imperative while “if only someone would buy me this bag” is closer to an optative. Optative forms are not well studied in linguistics, as they require specific contexts which can be difficult to create during field work, but they may be more common in device-directed utterances.
四种语言具有"祈愿式 (optative)"语气,与命令式存在微妙差异。祈愿式中说话者表达愿望或期许,而非直接发出指令。但在特定语境下,祈愿式可能具有与命令式相同的语用分量,强烈暗示某人应当采取行动。英语没有专门的祈愿式形态,但可通过条件句传递类似语气。例如"给我买这个包"是命令式,而"要是有人能给我买这个包就好了"更接近祈愿式。由于需要难以在田野调查中构建的特定语境,祈愿式在语言学研究中尚未得到充分探索,但在设备导向话语中可能更为常见。
Lastly, some languages distinguish between imperatives, when telling someone to do something, and “prohibitive s”, when telling someone not to do something. In the MASSIVE set, there are 18 languages with specialized negative particles which can only co-occur with imperative verbs. Vietnamese for instance uses the words “chang"’ or “không” to negate declarative sentences, but uses “chó” or “dung” to negate imperatives. Another ten languages have special verbs for the prohibitive, although these may overlap with other grammatical features of the language. In Spanish, for example, the prohibitive form of a verb is the same as the subjunctive form.
最后,某些语言在表达命令(要求某人做某事)和禁止(要求某人不做某事)时存在区别。在MASSIVE数据集中,有18种语言使用专有的否定助词,这些助词只能与祈使动词共现。例如越南语使用"chang"或"không"来否定陈述句,但使用"chó"或"dung"来否定祈使句。另有10种语言使用特殊动词表达禁止意义,尽管这些动词可能与语言的其他语法特征存在重叠。例如在西班牙语中,动词的禁止形式与虚拟语气形式相同。
3.6 Politeness
3.6 礼貌
Many languages encode different levels of politeness through their use of pronouns. Many European languages distinguish between “familiar” and “formal” pronouns, with the “formal” pronouns often morphologically identical to a plural. In French, the second-person singular “tu” is used between friends, while the second-person plural “vous” is used when speaking to a group, or to an individual of higher social rank (such as an employee to a manager). These politeness systems are heavily influenced by social context, and the MASSIVE dataset gives us a chance to see how people adapt their language when speaking to a virtual assistant instead of another human.
许多语言通过代词的使用来体现不同的礼貌程度。多数欧洲语言区分"熟悉"和"正式"两种代词形式,其中"正式"代词在形态上常与复数形式相同。例如法语中,第二人称单数"tu"用于朋友间交流,而第二人称复数"vous"既用于群体对话,也用于社会地位较高的个体(如员工对经理)。这种礼貌体系深受社会语境影响,MASSIVE数据集让我们得以观察:当人们与虚拟助手而非真人交流时,会如何调整语言表达。
Nearly half of the languages in MASSIVE (21) make a two-way formal/informal distinction in their second-person pronouns. This is probably due to the fact that most MASSIVE languages are European, and the binary politeness distinctions are the most common strategy in that family. A further eight languages have more than two levels of formality, such as informal, formal, and honorific. Seven languages have an “avoidance” strategy, which means that pronouns are omitted entirely in a polite scenario. Finally, eleven languages have no data on politeness in WALS at all.
MASSIVE中近半数语言(21种)的第二人称代词存在正式/非正式的双向区分。这很可能因为MASSIVE大多数语言属于欧洲语系,而二元礼貌区分是该语系最常见的策略。另有八种语言具有超过两个正式层级,例如非正式、正式和敬语。七种语言采用"回避"策略,即在礼貌场景中完全省略代词。最后,有十一种语言在WALS中完全没有礼貌相关数据。
4 Collection Setup and Execution
4 数据收集设置与执行
4.1 Heldout Evaluation Split
4.1 保留评估集
We randomly sampled a subset of the English seed data which was