Selective In-Context Data Augmentation for Intent Detection using Pointwise V-Information
基于逐点V信息的选择性上下文数据增强在意图检测中的应用
Abstract
摘要
This work focuses on in-context data augmentation for intent detection. Having found that augmentation via in-context prompting of large pre-trained language models (PLMs) alone does not improve performance, we introduce a novel approach based on PLMs and pointwise V-information (PVI), a metric that can measure the usefulness of a datapoint for training a model. Our method first fine-tunes a PLM on a small seed of training data and then synthesizes new datapoints – utterances that correspond to given intents. It then employs intent-aware filtering, based on PVI, to remove datapoints that are not helpful to the downstream intent classifier. Our method is thus able to leverage the expressive power of large language models to produce diverse training data. Empirical results demonstrate that our method can produce synthetic training data that achieve state-of-the-art performance on three challenging intent detection datasets under few-shot settings $1.28%$ absolute improvement in 5-shot and $1.18%$ absolute in 10-shot, on average) and perform on par with the state- of-the-art in full-shot settings (within $0.01%$ absolute, on average).
本研究专注于意图检测的上下文数据增强。我们发现仅通过大型预训练语言模型(PLM)的上下文提示进行增强无法提升性能,因此提出了一种基于PLM和点式V信息(PVI)的新方法——PVI是一种能衡量数据点对模型训练有用性的指标。该方法首先在小规模种子训练数据上微调PLM,随后合成新数据点(即对应给定意图的话语),并基于PVI进行意图感知过滤,以剔除对下游意图分类器无益的数据点。通过这种方式,我们的方法能够利用大语言模型的表达能力生成多样化训练数据。实验结果表明:在少样本场景下(5-shot绝对提升1.28%,10-shot平均提升1.18%),本方法生成的合成训练数据能在三个高难度意图检测数据集上达到最先进性能;在全样本场景下(平均绝对差异0.01%以内)与当前最优方法表现相当。
1 Introduction
1 引言
Intent detection, defined as the identification of a user’s intent given an utterance, is a fundamental element in task-oriented dialogue systems, usually occurring within the Natural Language Understanding (NLU) component. One of the practical challenges of training and deploying NLU modules is data scarcity, due to various reasons, such as under-represented languages, privacy and ethical concerns, or simply the cost of collecting and annotating sufficiently large amounts of data for new intents. Consequently, accurately identifying intents in limited-resource scenarios has drawn attention from the community (Papangelis et al., 2021; Mehri and Eric, 2021; Zhang et al., 2021b, for example).
意图检测被定义为根据用户话语识别其意图,是任务导向对话系统中的基础要素,通常发生在自然语言理解(NLU)组件中。训练和部署NLU模块的实际挑战之一是数据稀缺,原因包括语言覆盖不足、隐私和伦理问题,或仅为新意图收集和标注足够大量数据的成本。因此,在资源有限场景中准确识别意图引起了学界关注 (例如Papangelis等, 2021; Mehri和Eric, 2021; Zhang等, 2021b)。
There are three main families of approaches that address the challenge of limited data for intent detection: data augmentation (Peng et al., 2021; Li et al., 2021), focusing on generating high-quality synthetic training and evaluation data; few-shot learning (Zhang et al., 2020, 2021b), focusing on creating learning algorithms that can cope with limited amounts of data; and transfer learning (Namazifar et al., 2021), focusing on learning algorithms that can generalize across domains (therefore not requiring in-domain data). In this work, we follow the data augmentation approach, which is a general method that attempts to augment a humanauthored dataset with a large set of syntheticallygenerated instances. Most recent work has suggested using Pre-trained Language Models (PLMs) for data augmentations under various setups, e.g., (Peng et al., 2021), showing great improvements in performance. However, simply generating a large number of synthetic data points is not enough; we need to consider the quality of each data point, i.e., how beneficial it would be to the model’s performance if that synthetic data point is added to the training set. This is an important issue since the model might learn to overfit to synthetic datapoints (which may be low quality, represent specific use cases, etc.) and thus under-perform on real data.
针对意图检测中数据有限的挑战,主要有三类解决思路:数据增强 (data augmentation) (Peng et al., 2021; Li et al., 2021) ,专注于生成高质量的合成训练与评估数据;少样本学习 (few-shot learning) (Zhang et al., 2020, 2021b) ,专注于开发能应对数据稀缺的学习算法;以及迁移学习 (transfer learning) (Namazifar et al., 2021) ,专注于实现跨领域泛化的学习算法(从而无需领域内数据)。本研究采用数据增强方法,这是一种通过大量合成生成实例来扩展人工标注数据集的通用方案。最新研究表明,在不同配置下使用预训练语言模型 (Pre-trained Language Models, PLMs) 进行数据增强能显著提升性能,例如 (Peng et al., 2021) 。但仅生成大量合成数据并不足够,还需考量每个数据点的质量——即该合成数据加入训练集后对模型性能的提升价值。这一考量至关重要,因为模型可能过度拟合合成数据(可能存在质量低下、仅代表特定用例等问题),导致在真实数据上表现不佳。
In this work, we propose to apply Pointwise $\nu$ -Information (PVI) (Ethayarajh et al., 2022) for data augmentation, in a way that leverages a PLM to generate synthetic examples that are relevant and beneficial for training the downstream model, which in our case is an intent classifier. Our contributions are as follows:
在本工作中,我们提出应用逐点$\nu$-信息 (PVI) [20] 进行数据增强,通过利用预训练语言模型 (PLM) 生成对下游模型训练相关且有益的合成样本,此处下游模型为意图分类器。我们的贡献如下:
• We propose a novel filtering method based on PVI (Ethayarajh et al., 2022) to filter out examples that are not relevant or helpful to the desired intent. • We conduct experiments on three challenging intent detection datasets and show that our method achieves state-of-the-art performance.
• 我们提出了一种基于PVI (Ethayarajh et al., 2022) 的新型过滤方法,用于筛除与目标意图无关或无帮助的示例。
• 我们在三个具有挑战性的意图检测数据集上进行了实验,结果表明我们的方法实现了最先进的性能。
• We conduct an in-depth study and present a comprehensive analysis of the factors that influence performance, including ablation studies and comparisons with alternative methods.
• 我们进行了深入研究,并对影响性能的因素进行了全面分析,包括消融实验和与其他替代方法的对比。
The rest of the paper is organized as follows: In Section 2 we present relevant work and in Section 3 we introduce our method. In sections 4 and 5 we discuss training details, experiments, and results. In section 6, we present our analysis and discuss alternative approaches we investigated. In section 7 we conclude, and in the following sections we discuss limitations and ethical considerations.
本文其余部分组织如下:第2节介绍相关工作,第3节阐述我们的方法。第4节和第5节讨论训练细节、实验及结果。第6节进行分析并探讨我们研究的替代方案。第7节总结全文,后续章节讨论局限性和伦理考量。
2 Related Work
2 相关工作
Intent Detection Intent detection is the task of identifying the user’s intent by mapping the user’s natural language utterance into one of several predefined classes (Hemphill et al., 1990; Coucke et al., 2018). It is a critical component in the pipeline of task-oriented dialogue systems, as it is used to determine the user’s goal and to trigger an appropriate system action (Raux et al., 2005; Young et al., 2013). Several datasets have been proposed to evaluate the performance of intent detection models (Casanueva et al., 2020; Liu et al., $2019\mathrm{a}$ ; Larson et al., 2019, for some recent examples). With the availability of such datasets, intent detection has been extensively studied in the literature. Recently, pre-trained language models (e.g., BERT (Devlin et al., 2019)) have been shown to be effective in intent detection (Bunk et al., 2020; Zhang et al., 2020, 2021a,b; Mehri and Eric, 2021).
意图检测
意图检测是通过将用户的自然语言表述映射到若干预定义类别之一来识别用户意图的任务 (Hemphill et al., 1990; Coucke et al., 2018)。作为任务导向对话系统中的关键组件,它用于确定用户目标并触发适当的系统响应 (Raux et al., 2005; Young et al., 2013)。目前已提出多个数据集用于评估意图检测模型的性能 (Casanueva et al., 2020; Liu et al., $2019\mathrm{a}$; Larson et al., 2019)。随着这些数据集的普及,意图检测在学界得到了广泛研究。近期研究表明,预训练语言模型 (如 BERT (Devlin et al., 2019)) 在意图检测中表现优异 (Bunk et al., 2020; Zhang et al., 2020, 2021a,b; Mehri and Eric, 2021)。
Data Augmentation Data augmentation is a widely-used technique to address the problem of data scarcity. Paraphrasing the data is one of the ways frequently used for augmentation and can produce more diverse synthetic text with different word choices and sentence structures while preserving the meaning of the original text. Paraphrasing methods have been shown to be effective in many natural language processing tasks (Gupta et al., 2018; Edunov et al., 2018; Iyyer et al., 2018; Wei and Zou, 2019; Cai et al., 2020; Okur et al., 2022; Panda et al., 2021; Jolly et al., 2020). How- ever, such methods often fail to generate more challenging and semantically diverse sentences that are important for the robustness of the downstream models.
数据增强 (Data Augmentation)
数据增强是一种广泛用于解决数据稀缺问题的技术。对数据进行复述是常用的增强方式之一,它能在保留原文本含义的同时,通过不同词汇选择和句式结构生成更多样化的合成文本。研究表明,复述方法在众多自然语言处理任务中具有显著效果 (Gupta et al., 2018; Edunov et al., 2018; Iyyer et al., 2018; Wei and Zou, 2019; Cai et al., 2020; Okur et al., 2022; Panda et al., 2021; Jolly et al., 2020)。然而,这类方法往往难以生成对下游模型鲁棒性至关重要的、更具挑战性和语义多样性的句子。
Recently, conditional generation – using a PLM to produce text conditioned on some label – has become the dominant paradigm of data augmentation (Bowman et al., 2016; Kumar et al., 2019; Anaby- Tavor et al., 2020; Kumar et al., 2020; Yang et al., 2020a; Lee et al., 2021). This is usually achieved by fine-tuning a language model to produce the original text given the label.
近期,条件生成(conditional generation)——使用预训练语言模型(PLM)根据特定标签生成文本——已成为数据增强(data augmentation)的主流范式 [Bowman et al., 2016; Kumar et al., 2019; Anaby-Tavor et al., 2020; Kumar et al., 2020; Yang et al., 2020a; Lee et al., 2021]。该方法通常通过对语言模型进行微调,使其在给定标签的条件下生成原始文本来实现。
In the field of intent detection, previous work has proposed using data augmentation techniques to generate synthetic training data (Sahu et al., 2022; Papangelis et al., 2021). Sahu et al. (2022) also used PLMs to generate augmented examples, but they require human effort for labeling. This is a challenging task since it is expensive to annotate large amounts of data.
在意图检测领域,先前的研究提出了使用数据增强技术生成合成训练数据 (Sahu et al., 2022; Papangelis et al., 2021)。Sahu等人 (2022) 也使用了预训练语言模型 (PLM) 来生成增强样本,但需要人工标注。由于标注大量数据的成本高昂,这是一项具有挑战性的任务。
Our approach involves data valuation, similar to the concepts of Ghorbani and Zou (2019); Mindermann et al. (2022). However, our approach differs from such previous work in two key ways. First, Ghorbani and Zou (2019) only evaluated the quality of the training set after training them, whereas we evaluate the synthetic examples before training the task model. Second, Mindermann et al. (2022) selected points that minimize the loss on a holdout set, whereas we select synthetic examples that are reasonably challenging to the task model. Our approach aims to address the problem of data scarcity by evaluating the synthetic examples generated by PLMs and selecting the most valuable examples to augment the training data.
我们的方法涉及数据估值,类似于Ghorbani和Zou (2019)以及Mindermann等人 (2022)的概念。然而,我们的方法与之前的工作在两个方面存在关键差异。首先,Ghorbani和Zou (2019)仅在训练后评估训练集的质量,而我们在训练任务模型之前评估合成样本。其次,Mindermann等人 (2022)选择的是最小化保留集损失的样本,而我们选择的是对任务模型具有合理挑战性的合成样本。我们的方法旨在通过评估PLM生成的合成样本并选择最有价值的样本来扩充训练数据,从而解决数据稀缺问题。
In-context Learning Large language models such as GPT-3 (Brown et al., 2020) and OPT (Zhang et al., 2022) have shown to be able to perform many natural language processing tasks with in-context learning. In this paradigm, the model is provided with a few exemplars based on which it performs the respective task.
上下文学习
GPT-3 (Brown et al., 2020) 和 OPT (Zhang et al., 2022) 等大语言模型已展现出通过上下文学习执行多种自然语言处理任务的能力。在该范式下,模型会基于提供的少量示例执行相应任务。
In-context learning is a promising solution for few-shot learning. Because of the effectiveness in few-shot performance, in-context learning has been applied to a wide range of NLP tasks. For dialogue tasks, in-context learning has been applied to intent classification (Yu et al., 2021), semantic parsing (Shin and Durme, 2022), and dialogue state tracking (Hu et al., 2022).
上下文学习 (in-context learning) 是少样本学习的一种有效解决方案。由于其在少样本性能上的优异表现,上下文学习已被广泛应用于各类自然语言处理任务中。在对话任务领域,该方法已成功应用于意图分类 (Yu et al., 2021)、语义解析 (Shin and Durme, 2022) 以及对话状态追踪 (Hu et al., 2022) 等场景。
However, PLMs require a large amount of computational resources and the limitation on input length restricts the application of PLMs to intent detection tasks with large numbers of intents (e.g., 150 intents in CLINC (Larson et al., 2019)), where
然而,预训练语言模型 (PLM) 需要大量计算资源,且输入长度限制阻碍了其在意图数量庞大的检测任务中的应用 (例如 CLINC 数据集 (Larson et al., 2019) 中的 150 种意图)。
Prompt:
提示:
Example Completions:
示例补全:
we cannot fit examples for each intent in the input. One solution would be to call the model multiple times, each time with a subset of the possible intents. This would lead to increased inference time and may also impact performance. Consequently, Yoo et al. (2021); Sahu et al. (2022) leveraged incontext learning and PLMs to generate synthetic examples for intent detection, instead of directly deploying the PLM. However, they did not consider the quality of the generated examples, which may lead to the model over fitting on examples that are not relevant to the desired intent.
我们无法在输入中为每个意图都配备示例。一种解决方案是多次调用模型,每次使用可能的意图子集。但这会增加推理时间,还可能影响性能。因此,Yoo等人 (2021) 和 Sahu等人 (2022) 利用上下文学习 (in-context learning) 和预训练语言模型 (PLM) 为意图检测生成合成示例,而非直接部署预训练语言模型。然而,他们未考虑生成示例的质量,这可能导致模型在与目标意图无关的示例上过拟合。
3 In-Context Data Augmentation
3 上下文数据增强
In the following section, we describe our proposed two-stage method for data augmentation, which we refer to as In-Context Data Augmentation (ICDA). The overall procedure is summarized in Algorithm 1. We apply ICDA to the task of fewshot intent detection, which involves classifying a user utterance $x$ into an intent label $y\in Y$ . ICDA aims to generate synthetic examples $x^{\prime}$ such that they would belong to a given intent $y$ .
在下一节中,我们将介绍提出的两阶段数据增强方法,称为上下文数据增强 (In-Context Data Augmentation, ICDA)。整体流程如算法1所示。我们将ICDA应用于少样本意图检测任务,该任务涉及将用户话语$x$分类为意图标签$y\in Y$。ICDA旨在生成合成样本$x^{\prime}$,使其属于给定意图$y$。
3.1 Synthesizing Examples
3.1 合成示例
The core idea is to use a large pre-trained language model such as GPT-3 (Brown et al., 2020) or OPT (Zhang et al., 2022) to generate synthetic data in the context of the training set. In particular, for each intent class, we create a natural language context (prompt) that contains the intent class name, a set of real training examples under the same intent class, and an incomplete example. For instance, the prompt for the intent class refund not showing up is shown in Figure 1. We feed the prompt to the language model and obtain a set of synthetic examples as outputs. In this work, we use OPT-66B (Zhang et al., 2022) as the language model to generate a set of examples for each intent class. We adopt typical decoding with $\tau=0.9$ (Meister et al., 2022) and set repetition penalty to 1.1 following Keskar et al. (2019) to generate the synthetic examples.1 Due to the fine-grained nature of intents, and the sampling-based generation aiming to produce a set of diverse datapoints, we expect some of the generated utterances to not match the given intent.
核心思想是利用GPT-3 (Brown等人,2020) 或OPT (Zhang等人,2022) 等大型预训练大语言模型,在训练集语境下生成合成数据。具体而言,针对每个意图类别,我们会创建一个包含该意图类名称、同类真实训练示例集以及不完整示例的自然语言上下文 (prompt)。例如,图1展示了退款未到账意图类别的prompt结构。我们将prompt输入大语言模型后,即可获得一组输出结果作为合成示例。本研究采用OPT-66B (Zhang等人,2022) 作为生成各意图类别示例集的大语言模型,使用$\tau=0.9$的典型解码策略 (Meister等人,2022),并参照Keskar等人 (2019) 将重复惩罚系数设为1.1来生成合成示例。由于意图的细粒度特性及基于采样的生成机制旨在产生多样化数据点,我们预期部分生成语句可能与给定意图不匹配。
Note that our method leverages PLMs in a way that is orthogonal to the intent detection model. Unlike other methods that use the same model to directly predict the intent class of a user utterance, we use a PLM to generate synthetic training instances. These instances are then used to augment the actual training data and train a smaller intent detection model. This approach leverages the power of PLMs while preserving the independence of the intent detection model design.
需要注意的是,我们的方法以一种与意图检测模型正交的方式利用了预训练语言模型(PLM)。不同于其他直接使用同一模型预测用户话语意图类别的方法,我们使用PLM生成合成训练实例。这些实例随后被用于增强实际训练数据,并训练一个更小型的意图检测模型。该方法在发挥PLM威力的同时,保持了意图检测模型设计的独立性。
3.2 PVI Filtering
3.2 PVI 过滤
As mentioned above, given the stochastic nature of synthetic data generation, we expect some of the synthetic utterances not to match the given intent. To address this phenomenon, we filter generated instances and retain only those that are relevant and helpful to the desired intent classes.
如上所述,考虑到合成数据生成的随机性,我们预计部分合成语句无法匹配给定意图。为解决这一现象,我们对生成的实例进行筛选,仅保留与目标意图类别相关且有用的样本。
where, in this work, $g^{\prime}$ and $g^{*}$ are the intent detection models finetuned with and without the input $x$ , respectively. $\varnothing$ is a special token that is used to indicate the absence of an input utterance.
在本工作中,$g^{\prime}$ 和 $g^{*}$ 分别是经过输入 $x$ 微调和未经过输入微调的意图检测模型。$\varnothing$ 是一个特殊 token,用于表示输入话语的缺失。
Intuitively, PVI measures the amount of information that the input $x$ provides to the intent detection model (compared to the absence of meaningful input). A high PVI value indicates that the input $x$ provides a lot of information to the model, and thus is more likely to be helpful when training the model to classify instances of the intent class $y$ . On the contrary, a low PVI value indicates that the input $x$ provides little information to the model, and thus is likely to be irrelevant to the intent class $y$ (Ethayarajh et al., 2022).
直观地说,PVI衡量了输入$x$为意图检测模型提供的信息量(相较于无意义输入)。高PVI值表明输入$x$为模型提供了大量信息,因此在训练模型对意图类别$y$的实例进行分类时更可能有帮助。相反,低PVI值表明输入$x$为模型提供的信息很少,因此很可能与意图类别$y$无关 (Ethayarajh et al., 2022)。
Table 1: To assess the impact of the synthetic data size on performance, we experiment with several data multipliers (synthetic data size $=$ source data size x mult.).
表 1: 为评估合成数据规模对性能的影响,我们实验了多种数据乘数 (合成数据规模 $=$ 源数据规模 x 乘数)。
Algorithm 1: 基于PVI过滤的上下文数据增强 |
---|
输入: 任务模型 V, 大语言模型 PLM, 数据乘数 m, PVI 阈值函数 E 输出: 任务模型 g |
数据: 种子数据 Dtrain = {(输入 c, 黄金标签 yi)}=1 1 g' ← 在 Dtrain 上微调 V 2 0 ← 空字符串 |
We set a threshold $\epsilon$ (tunable parameter) to de- termine which $x$ are retained and conduct experiments to study the effect of the threshold in Section 6. Algorithm 1 defines $\epsilon$ as a function of $y$ to allow flexibility in its definition: either a fixed threshold for all intent classes, or a different threshold per intent class.
我们设定一个阈值 $\epsilon$ (可调参数) 来决定哪些 $x$ 被保留,并在第6节通过实验研究该阈值的影响。算法1将 $\epsilon$ 定义为 $y$ 的函数以灵活调整其定义方式:既可为所有意图类别设置固定阈值,也可为每个意图类别设置不同阈值。
4 Experimental Setup
4 实验设置
4.1 Datasets
4.1 数据集
To evaluate the effectiveness of our approach in intent detection in cases where we have a large number of often semantically similar intent labels, we chose the BANKING (Casanueva et al., 2020), HWU (Liu et al., 2019a), and CLINC (Larson et al., 2019) datasets and compare with recent state-ofthe-art baselines. BANKING comprises 13,083 utterances in a single banking domain and 77 intents. HWU includes 25,716 utterances with 64 intents across 21 domains. CLINC contains 23,700 utterances with 150 intents across 20 domains.
为了评估我们的方法在存在大量语义相似意图标签情况下的意图检测效果,我们选择了 BANKING (Casanueva et al., 2020)、HWU (Liu et al., 2019a) 和 CLINC (Larson et al., 2019) 数据集,并与当前最先进的基线方法进行对比。BANKING 包含单一银行领域的 13,083 条话语和 77 种意图。HWU 涵盖 21 个领域的 25,716 条话语和 64 种意图。CLINC 包含 20 个领域的 23,700 条话语和 150 种意图。
全量乘法 | 少样本乘法 | |
---|---|---|
XS | 1x | |
S | 1x | 4x |
M | 2x | 16x |
L | 4x | 64x |
XL | 128x |
4.2 Training
4.2 训练
In our experiments, we use RoBERTa-LARGE (Liu et al., 2019b) as the intent detection model $\nu$ in Algorithm 1. We use OPT $.66{\mathrm{B}}^{2}$ (Zhang et al., 2022) as the language model ${\mathcal{P}}{\mathcal{L}}{\mathcal{M}}$ to generate synthetic examples and set the data multiplier $m$ to be $128^{3}$ . We set the PVI threshold function $\epsilon$ to be the average PVI under each intent class in the validation set, where the PVI is computed using the same models as in Algorithm 1. We train RoBERTa-LARGE for 40 epochs with a batch size of 16, a learning rate of $1e-5$ , and the AdamW optimizer (Loshchilov and Hutter, 2019). We use the Hugging Face Transform- ers library (Wolf et al., 2020) for all experiments.
在我们的实验中,我们使用RoBERTa-LARGE (Liu等人,2019b)作为算法1中的意图检测模型$\nu$。我们采用OPT $.66{\mathrm{B}}^{2}$ (Zhang等人,2022)作为语言模型${\mathcal{P}}{\mathcal{L}}{\mathcal{M}}$来生成合成示例,并将数据乘数$m$设为$128^{3}$。PVI阈值函数$\epsilon$设置为验证集中每个意图类别下的平均PVI值,其中PVI计算使用的模型与算法1相同。我们使用批量大小为16、学习率为$1e-5$的AdamW优化器(Loshchilov和Hutter,2019)对RoBERTa-LARGE进行了40轮训练。所有实验均基于Hugging Face Transformers库(Wolf等人,2020)实现。
4.3 Baseline Models
4.3 基线模型
We compare our proposed method with the following baselines:
我们将提出的方法与以下基线进行比较:
RoBERTa-BASE $^+$ Classifier is a baseline that uses RoBERTa-BASE (Liu et al., 2019b) with a linear classifier on top (Zhang et al., 2020).
RoBERTa-BASE$^+$分类器是一个基线模型,它在RoBERTa-BASE (Liu et al., 2019b) 基础上添加了一个线性分类器 (Zhang et al., 2020)。
USE is a universal sentence encoder pre-trained on 16 languages supporting multiple down-stream tasks (Yang et al., 2020b).
USE是一种预训练于16种语言的通用句子编码器,支持多种下游任务 (Yang et al., 2020b)。
CONVERT is an intent detection model finetuned from dual encoder models, which is pre-trained on (input, response) pairs from Reddit (Henderson et al., 2020).
CONVERT 是一种基于双编码器模型微调的意图检测模型,其预训练数据来自 Reddit 的 (输入,响应) 对话对 (Henderson et al., 2020)。
CONVBERT fine-tunes BERT on a large opendomain dialogue corpus with 700 million conversations (Mehri et al., 2020) .
CONVBERT 在包含7亿次对话的大型开放领域对话语料库上对 BERT 进行了微调 (Mehri et al., 2020)。
CONVBERT $^+$ Combined is an intent detection model based on CONVBERT, with example-driven training based on similarity matching and observers for transformer attentions. It also conducts taskadaptive self-supervised learning with masked language modeling (MLM) on the intent detection datasets. Here, “Combined" represents the best MLM+Example+Observers setting in the referenced paper (Mehri and Eric, 2021).
CONVBERT $^+$ Combined 是基于 CONVBERT 的意图检测模型,采用基于相似度匹配的示例驱动训练和针对 Transformer 注意力的观察器。该模型还在意图检测数据集上通过掩码语言建模 (MLM) 进行任务自适应自监督学习。此处的 "Combined" 代表引用论文 (Mehri and Eric, 2021) 中的最佳 MLM+示例+观察器组合设置。
DNNC (Disc rim i native Nearest-Neighbor Classification) is a disc rim i native nearest-neighbor model, which finds the best-matched example from the training set through similarity matching. The model conducts data augmentation during training and boosts performance by pre-training on three natural language inference tasks (Zhang et al., 2020).
DNNC(Discriminative Nearest-Neighbor Classification)是一种判别式最近邻模型,通过相似度匹配从训练集中寻找最佳匹配样本。该模型在训练时进行数据增强,并通过在三个自然语言推理任务上的预训练来提升性能 (Zhang et al., 2020)。
CPFT (Contrastive Pre-training and Fine-Tuning) is the current state-of-the-art in few-shot intent detection on the selected datasets. It is pre-trained on multiple intent detection datasets in a selfsupervised contrastive manner and then fine-tuned with supervised contrastive learning (Zhang et al., 2021b).
CPFT (Contrastive Pre-training and Fine-Tuning) 是当前在选定数据集上少样本意图检测的最先进方法。该方法通过自监督对比方式在多个意图检测数据集上进行预训练,随后采用监督对比学习进行微调 (Zhang et al., 2021b)。
5 Experimental Results
5 实验结果
We conduct experiments on three benchmark datasets to validate the effectiveness of our proposed method. We first use OPT-66B to generate augmentation examples and then apply our method to enhance a RoBERTa-Large model trained on three datasets. We repeat all experiments with 5 random seeds and report the average performance in Full-shot and Few-shot settings. To investigate the effect of the synthetic data size, we experiment with a variety of multipliers (see Table 1 for notations). Results are shown in Table 2.
我们在三个基准数据集上进行实验,以验证所提方法的有效性。首先使用OPT-66B生成增强样本,然后将我们的方法应用于在三个数据集上训练的RoBERTa-Large模型。所有实验重复5次随机种子,并报告全量样本和少样本设置下的平均性能。为探究合成数据规模的影响,我们测试了多种乘数(相关符号见表1),结果如表2所示。
Full-shot settings. In this setting, we use the entire training set for each domain. The proposed method achieves the best performance on BANKING and comparable results on HWU and CLINC. In particular, on BANKING, we improve the CONVBERT $^+$ Combined baseline (Mehri and Eric, 2021) by $0.59%$ (absolute) and the RoBERTaLarge baseline by $0.72%$ (absolute). Compared with the CONVBERT $^+$ Combined, which is pretrained on intent detection datasets in a selfsupervised fashion and adds examples-driven training and specific model architectural design, our method achieves similar results with much simpler model design. Furthermore, our method is orthogonal to model architectures and can be integrated with any other approach for further improvement.
全样本设置。在此设置中,我们使用每个域的完整训练集。所提出的方法在BANKING上取得了最佳性能,在HWU和CLINC上取得了相当的结果。特别是在BANKING上,我们将CONVBERT$^+$ Combined基线 (Mehri和Eric,2021) 提高了$0.59%$ (绝对值),将RoBERTaLarge基线提高了$0.72%$ (绝对值)。与CONVBERT$^+$ Combined相比,该方法通过自监督方式在意图检测数据集上进行预训练,并增加了示例驱动的训练和特定的模型架构设计,而我们的方法以更简单的模型设计实现了相似的结果。此外,我们的方法与模型架构正交,可以与任何其他方法集成以进一步改进。
We also find that ICDA improves the performance of the RoBERTa-Large model on HWU and CLINC. This highlights the effectiveness of our method for enhancing intent detection models.
我们还发现,ICDA提升了RoBERTa-Large模型在HWU和CLINC数据集上的表现。这凸显了我们方法在增强意图检测模型方面的有效性。
Moreover, state-of-the-art performance on BANKING with the proposed method and RoBERTaLarge shows that our method is capable of generating high-quality augmentation examples to enhance the RoBERTa-Large model on the most finegrained intent detection task.
此外,采用所提方法与RoBERTaLarge在BANKING数据集上达到的最先进性能表明,我们的方法能够生成高质量的数据增强样本,从而在最具细粒度的意图检测任务中提升RoBERTa-Large模型的表现。
Few-shot settings. In this setting we only use a small number of instances (datapoints) per class. We evaluate our method in both 5-shot and 10- shot settings and compare it with several strong baselines. Our proposed method outperforms all baselines on all datasets in both 5-shot and 10- shot settings. ICDA-M achieves the best performance in 5-shot settings on BANKING dataset and ICDA-XL achieves the best performance on HWU and CLINC datasets in 5-shot settings and on all datasets in 10-shot settings. All configurations o