Selective In-Context Data Augmentation for Intent Detection using Pointwise V-Information
基于逐点V信息的选择性上下文数据增强在意图检测中的应用
Abstract
摘要
This work focuses on in-context data augmentation for intent detection. Having found that augmentation via in-context prompting of large pre-trained language models (PLMs) alone does not improve performance, we introduce a novel approach based on PLMs and pointwise V-information (PVI), a metric that can measure the usefulness of a datapoint for training a model. Our method first fine-tunes a PLM on a small seed of training data and then synthesizes new datapoints – utterances that correspond to given intents. It then employs intent-aware filtering, based on PVI, to remove datapoints that are not helpful to the downstream intent classifier. Our method is thus able to leverage the expressive power of large language models to produce diverse training data. Empirical results demonstrate that our method can produce synthetic training data that achieve state-of-the-art performance on three challenging intent detection datasets under few-shot settings $1.28%$ absolute improvement in 5-shot and $1.18%$ absolute in 10-shot, on average) and perform on par with the state- of-the-art in full-shot settings (within $0.01%$ absolute, on average).
本研究专注于意图检测的上下文数据增强。我们发现仅通过大型预训练语言模型(PLM)的上下文提示进行增强无法提升性能,因此提出了一种基于PLM和点式V信息(PVI)的新方法——PVI是一种能衡量数据点对模型训练有用性的指标。该方法首先在小规模种子训练数据上微调PLM,随后合成新数据点(即对应给定意图的话语),并基于PVI进行意图感知过滤,以剔除对下游意图分类器无益的数据点。通过这种方式,我们的方法能够利用大语言模型的表达能力生成多样化训练数据。实验结果表明:在少样本场景下(5-shot绝对提升1.28%,10-shot平均提升1.18%),本方法生成的合成训练数据能在三个高难度意图检测数据集上达到最先进性能;在全样本场景下(平均绝对差异0.01%以内)与当前最优方法表现相当。
1 Introduction
1 引言
Intent detection, defined as the identification of a user’s intent given an utterance, is a fundamental element in task-oriented dialogue systems, usually occurring within the Natural Language Understanding (NLU) component. One of the practical challenges of training and deploying NLU modules is data scarcity, due to various reasons, such as under-represented languages, privacy and ethical concerns, or simply the cost of collecting and annotating sufficiently large amounts of data for new intents. Consequently, accurately identifying intents in limited-resource scenarios has drawn attention from the community (Papangelis et al., 2021; Mehri and Eric, 2021; Zhang et al., 2021b, for example).
意图检测被定义为根据用户话语识别其意图,是任务导向对话系统中的基础要素,通常发生在自然语言理解(NLU)组件中。训练和部署NLU模块的实际挑战之一是数据稀缺,原因包括语言覆盖不足、隐私和伦理问题,或仅为新意图收集和标注足够大量数据的成本。因此,在资源有限场景中准确识别意图引起了学界关注 (例如Papangelis等, 2021; Mehri和Eric, 2021; Zhang等, 2021b)。
There are three main families of approaches that address the challenge of limited data for intent detection: data augmentation (Peng et al., 2021; Li et al., 2021), focusing on generating high-quality synthetic training and evaluation data; few-shot learning (Zhang et al., 2020, 2021b), focusing on creating learning algorithms that can cope with limited amounts of data; and transfer learning (Namazifar et al., 2021), focusing on learning algorithms that can generalize across domains (therefore not requiring in-domain data). In this work, we follow the data augmentation approach, which is a general method that attempts to augment a humanauthored dataset with a large set of syntheticallygenerated instances. Most recent work has suggested using Pre-trained Language Models (PLMs) for data augmentations under various setups, e.g., (Peng et al., 2021), showing great improvements in performance. However, simply generating a large number of synthetic data points is not enough; we need to consider the quality of each data point, i.e., how beneficial it would be to the model’s performance if that synthetic data point is added to the training set. This is an important issue since the model might learn to overfit to synthetic datapoints (which may be low quality, represent specific use cases, etc.) and thus under-perform on real data.
针对意图检测中数据有限的挑战,主要有三类解决思路:数据增强 (data augmentation) (Peng et al., 2021; Li et al., 2021) ,专注于生成高质量的合成训练与评估数据;少样本学习 (few-shot learning) (Zhang et al., 2020, 2021b) ,专注于开发能应对数据稀缺的学习算法;以及迁移学习 (transfer learning) (Namazifar et al., 2021) ,专注于实现跨领域泛化的学习算法(从而无需领域内数据)。本研究采用数据增强方法,这是一种通过大量合成生成实例来扩展人工标注数据集的通用方案。最新研究表明,在不同配置下使用预训练语言模型 (Pre-trained Language Models, PLMs) 进行数据增强能显著提升性能,例如 (Peng et al., 2021) 。但仅生成大量合成数据并不足够,还需考量每个数据点的质量——即该合成数据加入训练集后对模型性能的提升价值。这一考量至关重要,因为模型可能过度拟合合成数据(可能存在质量低下、仅代表特定用例等问题),导致在真实数据上表现不佳。
In this work, we propose to apply Pointwise $\nu$ -Information (PVI) (Ethayarajh et al., 2022) for data augmentation, in a way that leverages a PLM to generate synthetic examples that are relevant and beneficial for training the downstream model, which in our case is an intent classifier. Our contributions are as follows:
在本工作中,我们提出应用逐点$\nu$-信息 (PVI) [20] 进行数据增强,通过利用预训练语言模型 (PLM) 生成对下游模型训练相关且有益的合成样本,此处下游模型为意图分类器。我们的贡献如下:
• We propose a novel filtering method based on PVI (Ethayarajh et al., 2022) to filter out examples that are not relevant or helpful to the desired intent. • We conduct experiments on three challenging intent detection datasets and show that our method achieves state-of-the-art performance.
• 我们提出了一种基于PVI (Ethayarajh et al., 2022) 的新型过滤方法,用于筛除与目标意图无关或无帮助的示例。
• 我们在三个具有挑战性的意图检测数据集上进行了实验,结果表明我们的方法实现了最先进的性能。
• We conduct an in-depth study and present a comprehensive analysis of the factors that influence performance, including ablation studies and comparisons with alternative methods.
• 我们进行了深入研究,并对影响性能的因素进行了全面分析,包括消融实验和与其他替代方法的对比。
The rest of the paper is organized as follows: In Section 2 we present relevant work and in Section 3 we introduce our method. In sections 4 and 5 we discuss training details, experiments, and results. In section 6, we present our analysis and discuss alternative approaches we investigated. In section 7 we conclude, and in the following sections we discuss limitations and ethical considerations.
本文其余部分组织如下:第2节介绍相关工作,第3节阐述我们的方法。第4节和第5节讨论训练细节、实验及结果。第6节进行分析并探讨我们研究的替代方案。第7节总结全文,后续章节讨论局限性和伦理考量。
2 Related Work
2 相关工作
Intent Detection Intent detection is the task of identifying the user’s intent by mapping the user’s natural language utterance into one of several predefined classes (Hemphill et al., 1990; Coucke et al., 2018). It is a critical component in the pipeline of task-oriented dialogue systems, as it is used to determine the user’s goal and to trigger an appropriate system action (Raux et al., 2005; Young et al., 2013). Several datasets have been proposed to evaluate the performance of intent detection models (Casanueva et al., 2020; Liu et al., $2019\mathrm{a}$ ; Larson et al., 2019, for some recent examples). With the availability of such datasets, intent detection has been extensively studied in the literature. Recently, pre-trained language models (e.g., BERT (Devlin et al., 2019)) have been shown to be effective in intent detection (Bunk et al., 2020; Zhang et al., 2020, 2021a,b; Mehri and Eric, 2021).
意图检测
意图检测是通过将用户的自然语言表述映射到若干预定义类别之一来识别用户意图的任务 (Hemphill et al., 1990; Coucke et al., 2018)。作为任务导向对话系统中的关键组件,它用于确定用户目标并触发适当的系统响应 (Raux et al., 2005; Young et al., 2013)。目前已提出多个数据集用于评估意图检测模型的性能 (Casanueva et al., 2020; Liu et al., $2019\mathrm{a}$; Larson et al., 2019)。随着这些数据集的普及,意图检测在学界得到了广泛研究。近期研究表明,预训练语言模型 (如 BERT (Devlin et al., 2019)) 在意图检测中表现优异 (Bunk et al., 2020; Zhang et al., 2020, 2021a,b; Mehri and Eric, 2021)。
Data Augmentation Data augmentation is a widely-used technique to address the problem of data scarcity. Paraphrasing the data is one of the ways frequently used for augmentation and can produce more diverse synthetic text with different word choices and sentence structures while preserving the meaning of the original text. Paraphrasing methods have been shown to be effective in many natural language processing tasks (Gupta et al., 2018; Edunov et al., 2018; Iyyer et al., 2018; Wei and Zou, 2019; Cai et al., 2020; Okur et al., 2022; Panda et al., 2021; Jolly et al., 2020). How- ever, such methods often fail to generate more challenging and semantically diverse sentences that are important for the robustness of the downstream models.
数据增强 (Data Augmentation)
数据增强是一种广泛用于解决数据稀缺问题的技术。对数据进行复述是常用的增强方式之一,它能在保留原文本含义的同时,通过不同词汇选择和句式结构生成更多样化的合成文本。研究表明,复述方法在众多自然语言处理任务中具有显著效果 (Gupta et al., 2018; Edunov et al., 2018; Iyyer et al., 2018; Wei and Zou, 2019; Cai et al., 2020; Okur et al., 2022; Panda et al., 2021; Jolly et al., 2020)。然而,这类方法往往难以生成对下游模型鲁棒性至关重要的、更具挑战性和语义多样性的句子。
Recently, conditional generation – using a PLM to produce text conditioned on some label – has become the dominant paradigm of data augmentation (Bowman et al., 2016; Kumar et al., 2019; Anaby- Tavor et al., 2020; Kumar et al., 2020; Yang et al., 2020a; Lee et al., 2021). This is usually achieved by fine-tuning a language model to produce the original text given the label.
近期,条件生成(conditional generation)——使用预训练语言模型(PLM)根据特定标签生成文本——已成为数据增强(data augmentation)的主流范式 [Bowman et al., 2016; Kumar et al., 2019; Anaby-Tavor et al., 2020; Kumar et al., 2020; Yang et al., 2020a; Lee et al., 2021]。该方法通常通过对语言模型进行微调,使其在给定标签的条件下生成原始文本来实现。
In the field of intent detection, previous work has proposed using data augmentation techniques to generate synthetic training data (Sahu et al., 2022; Papangelis et al., 2021). Sahu et al. (2022) also used PLMs to generate augmented examples, but they require human effort for labeling. This is a challenging task since it is expensive to annotate large amounts of data.
在意图检测领域,先前的研究提出了使用数据增强技术生成合成训练数据 (Sahu et al., 2022; Papangelis et al., 2021)。Sahu等人 (2022) 也使用了预训练语言模型 (PLM) 来生成增强样本,但需要人工标注。由于标注大量数据的成本高昂,这是一项具有挑战性的任务。
Our approach involves data valuation, similar to the concepts of Ghorbani and Zou (2019); Mindermann et al. (2022). However, our approach differs from such previous work in two key ways. First, Ghorbani and Zou (2019) only evaluated the quality of the training set after training them, whereas we evaluate the synthetic examples before training the task model. Second, Mindermann et al. (2022) selected points that minimize the loss on a holdout set, whereas we select synthetic examples that are reasonably challenging to the task model. Our approach aims to address the problem of data scarcity by evaluating the synthetic examples generated by PLMs and selecting the most valuable examples to augment the training data.
我们的方法涉及数据估值,类似于Ghorbani和Zou (2019)以及Mindermann等人 (2022)的概念。然而,我们的方法与之前的工作在两个方面存在关键差异。首先,Ghorbani和Zou (2019)仅在训练后评估训练集的质量,而我们在训练任务模型之前评估合成样本。其次,Mindermann等人 (2022)选择的是最小化保留集损失的样本,而我们选择的是对任务模型具有合理挑战性的合成样本。我们的方法旨在通过评估PLM生成的合成样本并选择最有价值的样本来扩充训练数据,从而解决数据稀缺问题。
In-context Learning Large language models such as GPT-3 (Brown et al., 2020) and OPT (Zhang et al., 2022) have shown to be able to perform many natural language processing tasks with in-context learning. In this paradigm, the model is provided with a few exemplars based on which it performs the respective task.
上下文学习
GPT-3 (Brown et al., 2020) 和 OPT (Zhang et al., 2022) 等大语言模型已展现出通过上下文学习执行多种自然语言处理任务的能力。在该范式下,模型会基于提供的少量示例执行相应任务。
In-context learning is a promising solution for few-shot learning. Because of the effectiveness in few-shot performance, in-context learning has been applied to a wide range of NLP tasks. For dialogue tasks, in-context learning has been applied to intent classification (Yu et al., 2021), semantic parsing (Shin and Durme, 2022), and dialogue state tracking (Hu et al., 2022).
上下文学习 (in-context learning) 是少样本学习的一种有效解决方案。由于其在少样本性能上的优异表现,上下文学习已被广泛应用于各类自然语言处理任务中。在对话任务领域,该方法已成功应用于意图分类 (Yu et al., 2021)、语义解析 (Shin and Durme, 2022) 以及对话状态追踪 (Hu et al., 2022) 等场景。
However, PLMs require a large amount of computational resources and the limitation on input length restricts the application of PLMs to intent detection tasks with large numbers of intents (e.g., 150 intents in CLINC (Larson et al., 2019)), where
然而,预训练语言模型 (PLM) 需要大量计算资源,且输入长度限制阻碍了其在意图数量庞大的检测任务中的应用 (例如 CLINC 数据集 (Larson et al., 2019) 中的 150 种意图)。
Prompt:
提示:
Example Completions:
示例补全:
we cannot fit examples for each intent in the input. One solution would be to call the model multiple times, each time with a subset of the possible intents. This would lead to increased inference time and may also impact performance. Consequently, Yoo et al. (2021); Sahu et al. (2022) leveraged incontext learning and PLMs to generate synthetic examples for intent detection, instead of directly deploying the PLM. However, they did not consider the quality of the generated examples, which may lead to the model over fitting on examples that are not relevant to the desired intent.
我们无法在输入中为每个意图都配备示例。一种解决方案是多次调用模型,每次使用可能的意图子集。但这会增加推理时间,还可能影响性能。因此,Yoo等人 (2021) 和 Sahu等人 (2022) 利用上下文学习 (in-context learning) 和预训练语言模型 (PLM) 为意图检测生成合成示例,而非直接部署预训练语言模型。然而,他们未考虑生成示例的质量,这可能导致模型在与目标意图无关的示例上过拟合。
3 In-Context Data Augmentation
3 上下文数据增强
In the following section, we describe our proposed two-stage method for data augmentation, which we refer to as In-Context Data Augmentation (ICDA). The overall procedure is summarized in Algorithm 1. We apply ICDA to the task of fewshot intent detection, which involves classifying a user utterance $x$ into an intent label $y\in Y$ . ICDA aims to generate synthetic examples $x^{\prime}$ such that they would belong to a given intent $y$ .
在下一节中,我们将介绍提出的两阶段数据增强方法,称为上下文数据增强 (In-Context Data Augmentation, ICDA)。整体流程如算法1所示。我们将ICDA应用于少样本意图检测任务,该任务涉及将用户话语$x$分类为意图标签$y\in Y$。ICDA旨在生成合成样本$x^{\prime}$,使其属于给定意图$y$。
3.1 Synthesizing Examples
3.1 合成示例
The core idea is to use a large pre-trained language model such as GPT-3 (Brown et al., 2020) or OPT (Zhang et al., 2022) to generate synthetic data in the context of the training set. In particular, for each intent class, we create a natural language context (prompt) that contains the intent class name, a set of real training examples under the same intent class, and an incomplete example. For instance, the prompt for the intent class refund not showing up is shown in Figure 1. We feed the prompt to the language model and obtain a set of synthetic examples as outputs. In this work, we use OPT-66B (Zhang et al., 2022) as the language model to generate a set of examples for each intent class. We adopt typical decoding with $\tau=0.9$ (Meister et al., 2022) and set repetition penalty to 1.1 following Keskar et al. (2019) to generate the synthetic examples.1 Due to the fine-grained nature of intents, and the sampling-based generation aiming to produce a set of diverse datapoints, we expect some of the generated utterances to not match the given intent.
核心思想是利用GPT-3 (Brown等人,2020) 或OPT (Zhang等人,2022) 等大型预训练大语言模型,在训练集语境下生成合成数据。具体而言,针对每个意图类别,我们会创建一个包含该意图类名称、同类真实训练示例集以及不完整示例的自然语言上下文 (prompt)。例如,图1展示了退款未到账意图类别的prompt结构。我们将prompt输入大语言模型后,即可获得一组输出结果作为合成示例。本研究采用OPT-66B (Zhang等人,2022) 作为生成各意图类别示例集的大语言模型,使用$\tau=0.9$的典型解码策略 (Meister等人,2022),并参照Keskar等人 (2019) 将重复惩罚系数设为1.1来生成合成示例。由于意图的细粒度特性及基于采样的生成机制旨在产生多样化数据点,我们预期部分生成语句可能与给定意图不匹配。
Note that our method leverages PLMs in a way that is orthogonal to the intent detection model. Unlike other methods that use the same model to directly predict the intent class of a user utterance, we use a PLM to generate synthetic training instances. These instances are then used to augment the actual training data and train a smaller intent detection model. This approach leverages the power of PLMs while preserving the independence of the intent detection model design.
需要注意的是,我们的方法以一种与意图检测模型正交的方式利用了预训练语言模型(PLM)。不同于其他直接使用同一模型预测用户话语意图类别的方法,我们使用PLM生成合成训练实例。这些实例随后被用于增强实际训练数据,并训练一个更小型的意图检测模型。该方法在发挥PLM威力的同时,保持了意图检测模型设计的独立性。
3.2 PVI Filtering
3.2 PVI 过滤
As mentioned above, given the stochastic nature of synthetic data generation, we expect some of the synthetic utterances not to match the given intent. To address this phenomenon, we filter generated instances and retain only those that are relevant and helpful to the desired intent classes.
如上所述,考虑到合成数据生成的随机性,我们预计部分合成语句无法匹配给定意图。为解决这一现象,我们对生成的实例进行筛选,仅保留与目标意图类别相关且有用的样本。
where, in this work, $g^{\prime}$ and $g^{*}$ are the intent detection models finetuned with and without the input $x$ , respectively. $\varnothing$ is a special token that is used to indicate the absence of an input utterance.
在本工作中,$g^{\prime}$ 和 $g^{*}$ 分别是经过输入 $x$ 微调和未经过输入微调的意图检测模型。$\varnothing$ 是一个特殊 token,用于表示输入话语的缺失。
Intuitively, PVI measures the amount of information that the input $x$ provides to the intent detection model (compared to the absence of meaningful input). A high PVI value indicates that the input $x$ provides a lot of information to the model, and thus is more likely to be helpful when training the model to classify instances of the intent class $y$ . On the contrary, a low PVI value indicates that the input $x$ provides little information to the model, and thus is likely to be irrelevant to the intent class $y$ (Ethayarajh et al., 2022).
直观地说,PVI衡量了输入$x$为意图检测模型提供的信息量(相较于无意义输入)。高PVI值表明输入$x$为模型提供了大量信息,因此在训练模型对意图类别$y$的实例进行分类时更可能有帮助。相反,低PVI值表明输入$x$为模型提供的信息很少,因此很可能与意图类别$y$无关 (Ethayarajh et al., 2022)。
Table 1: To assess the impact of the synthetic data size on performance, we experiment with several data multipliers (synthetic data size $=$ source data size x mult.).
表 1: 为评估合成数据规模对性能的影响,我们实验了多种数据乘数 (合成数据规模 $=$ 源数据规模 x 乘数)。
| Algorithm 1: 基于PVI过滤的上下文数据增强 |
|---|
| 输入: 任务模型 V, 大语言模型 PLM, 数据乘数 m, PVI 阈值函数 E 输出: 任务模型 g |
| 数据: 种子数据 Dtrain = {(输入 c, 黄金标签 yi)}=1 1 g' ← 在 Dtrain 上微调 V 2 0 ← 空字符串 |
We set a threshold $\epsilon$ (tunable parameter) to de- termine which $x$ are retained and conduct experiments to study the effect of the threshold in Section 6. Algorithm 1 defines $\epsilon$ as a function of $y$ to allow flexibility in its definition: either a fixed threshold for all intent classes, or a different threshold per intent class.
我们设定一个阈值 $\epsilon$ (可调参数) 来决定哪些 $x$ 被保留,并在第6节通过实验研究该阈值的影响。算法1将 $\epsilon$ 定义为 $y$ 的函数以灵活调整其定义方式:既可为所有意图类别设置固定阈值,也可为每个意图类别设置不同阈值。
4 Experimental Setup
4 实验设置
4.1 Datasets
4.1 数据集
To evaluate the effectiveness of our approach in intent detection in cases where we have a large number of often semantically similar intent labels, we chose the BANKING (Casanueva et al., 2020), HWU (Liu et al., 2019a), and CLINC (Larson et al., 2019) datasets and compare with recent state-ofthe-art baselines. BANKING comprises 13,083 utterances in a single banking domain and 77 intents. HWU includes 25,716 utterances with 64 intents across 21 domains. CLINC contains 23,700 utterances with 150 intents across 20 domains.
为了评估我们的方法在存在大量语义相似意图标签情况下的意图检测效果,我们选择了 BANKING (Casanueva et al., 2020)、HWU (Liu et al., 2019a) 和 CLINC (Larson et al., 2019) 数据集,并与当前最先进的基线方法进行对比。BANKING 包含单一银行领域的 13,083 条话语和 77 种意图。HWU 涵盖 21 个领域的 25,716 条话语和 64 种意图。CLINC 包含 20 个领域的 23,700 条话语和 150 种意图。
| 全量乘法 | 少样本乘法 | |
|---|---|---|
| XS | 1x | |
| S | 1x | 4x |
| M | 2x | 16x |
| L | 4x | 64x |
| XL | 128x |
4.2 Training
4.2 训练
In our experiments, we use RoBERTa-LARGE (Liu et al., 2019b) as the intent detection model $\nu$ in Algorithm 1. We use OPT $.66{\mathrm{B}}^{2}$ (Zhang et al., 2022) as the language model ${\mathcal{P}}{\mathcal{L}}{\mathcal{M}}$ to generate synthetic examples and set the data multiplier $m$ to be $128^{3}$ . We set the PVI threshold function $\epsilon$ to be the average PVI under each intent class in the validation set, where the PVI is computed using the same models as in Algorithm 1. We train RoBERTa-LARGE for 40 epochs with a batch size of 16, a learning rate of $1e-5$ , and the AdamW optimizer (Loshchilov and Hutter, 2019). We use the Hugging Face Transform- ers library (Wolf et al., 2020) for all experiments.
在我们的实验中,我们使用RoBERTa-LARGE (Liu等人,2019b)作为算法1中的意图检测模型$\nu$。我们采用OPT $.66{\mathrm{B}}^{2}$ (Zhang等人,2022)作为语言模型${\mathcal{P}}{\mathcal{L}}{\mathcal{M}}$来生成合成示例,并将数据乘数$m$设为$128^{3}$。PVI阈值函数$\epsilon$设置为验证集中每个意图类别下的平均PVI值,其中PVI计算使用的模型与算法1相同。我们使用批量大小为16、学习率为$1e-5$的AdamW优化器(Loshchilov和Hutter,2019)对RoBERTa-LARGE进行了40轮训练。所有实验均基于Hugging Face Transformers库(Wolf等人,2020)实现。
4.3 Baseline Models
4.3 基线模型
We compare our proposed method with the following baselines:
我们将提出的方法与以下基线进行比较:
RoBERTa-BASE $^+$ Classifier is a baseline that uses RoBERTa-BASE (Liu et al., 2019b) with a linear classifier on top (Zhang et al., 2020).
RoBERTa-BASE$^+$分类器是一个基线模型,它在RoBERTa-BASE (Liu et al., 2019b) 基础上添加了一个线性分类器 (Zhang et al., 2020)。
USE is a universal sentence encoder pre-trained on 16 languages supporting multiple down-stream tasks (Yang et al., 2020b).
USE是一种预训练于16种语言的通用句子编码器,支持多种下游任务 (Yang et al., 2020b)。
CONVERT is an intent detection model finetuned from dual encoder models, which is pre-trained on (input, response) pairs from Reddit (Henderson et al., 2020).
CONVERT 是一种基于双编码器模型微调的意图检测模型,其预训练数据来自 Reddit 的 (输入,响应) 对话对 (Henderson et al., 2020)。
CONVBERT fine-tunes BERT on a large opendomain dialogue corpus with 700 million conversations (Mehri et al., 2020) .
CONVBERT 在包含7亿次对话的大型开放领域对话语料库上对 BERT 进行了微调 (Mehri et al., 2020)。
CONVBERT $^+$ Combined is an intent detection model based on CONVBERT, with example-driven training based on similarity matching and observers for transformer attentions. It also conducts taskadaptive self-supervised learning with masked language modeling (MLM) on the intent detection datasets. Here, “Combined" represents the best MLM+Example+Observers setting in the referenced paper (Mehri and Eric, 2021).
CONVBERT $^+$ Combined 是基于 CONVBERT 的意图检测模型,采用基于相似度匹配的示例驱动训练和针对 Transformer 注意力的观察器。该模型还在意图检测数据集上通过掩码语言建模 (MLM) 进行任务自适应自监督学习。此处的 "Combined" 代表引用论文 (Mehri and Eric, 2021) 中的最佳 MLM+示例+观察器组合设置。
DNNC (Disc rim i native Nearest-Neighbor Classification) is a disc rim i native nearest-neighbor model, which finds the best-matched example from the training set through similarity matching. The model conducts data augmentation during training and boosts performance by pre-training on three natural language inference tasks (Zhang et al., 2020).
DNNC(Discriminative Nearest-Neighbor Classification)是一种判别式最近邻模型,通过相似度匹配从训练集中寻找最佳匹配样本。该模型在训练时进行数据增强,并通过在三个自然语言推理任务上的预训练来提升性能 (Zhang et al., 2020)。
CPFT (Contrastive Pre-training and Fine-Tuning) is the current state-of-the-art in few-shot intent detection on the selected datasets. It is pre-trained on multiple intent detection datasets in a selfsupervised contrastive manner and then fine-tuned with supervised contrastive learning (Zhang et al., 2021b).
CPFT (Contrastive Pre-training and Fine-Tuning) 是当前在选定数据集上少样本意图检测的最先进方法。该方法通过自监督对比方式在多个意图检测数据集上进行预训练,随后采用监督对比学习进行微调 (Zhang et al., 2021b)。
5 Experimental Results
5 实验结果
We conduct experiments on three benchmark datasets to validate the effectiveness of our proposed method. We first use OPT-66B to generate augmentation examples and then apply our method to enhance a RoBERTa-Large model trained on three datasets. We repeat all experiments with 5 random seeds and report the average performance in Full-shot and Few-shot settings. To investigate the effect of the synthetic data size, we experiment with a variety of multipliers (see Table 1 for notations). Results are shown in Table 2.
我们在三个基准数据集上进行实验,以验证所提方法的有效性。首先使用OPT-66B生成增强样本,然后将我们的方法应用于在三个数据集上训练的RoBERTa-Large模型。所有实验重复5次随机种子,并报告全量样本和少样本设置下的平均性能。为探究合成数据规模的影响,我们测试了多种乘数(相关符号见表1),结果如表2所示。
Full-shot settings. In this setting, we use the entire training set for each domain. The proposed method achieves the best performance on BANKING and comparable results on HWU and CLINC. In particular, on BANKING, we improve the CONVBERT $^+$ Combined baseline (Mehri and Eric, 2021) by $0.59%$ (absolute) and the RoBERTaLarge baseline by $0.72%$ (absolute). Compared with the CONVBERT $^+$ Combined, which is pretrained on intent detection datasets in a selfsupervised fashion and adds examples-driven training and specific model architectural design, our method achieves similar results with much simpler model design. Furthermore, our method is orthogonal to model architectures and can be integrated with any other approach for further improvement.
全样本设置。在此设置中,我们使用每个域的完整训练集。所提出的方法在BANKING上取得了最佳性能,在HWU和CLINC上取得了相当的结果。特别是在BANKING上,我们将CONVBERT$^+$ Combined基线 (Mehri和Eric,2021) 提高了$0.59%$ (绝对值),将RoBERTaLarge基线提高了$0.72%$ (绝对值)。与CONVBERT$^+$ Combined相比,该方法通过自监督方式在意图检测数据集上进行预训练,并增加了示例驱动的训练和特定的模型架构设计,而我们的方法以更简单的模型设计实现了相似的结果。此外,我们的方法与模型架构正交,可以与任何其他方法集成以进一步改进。
We also find that ICDA improves the performance of the RoBERTa-Large model on HWU and CLINC. This highlights the effectiveness of our method for enhancing intent detection models.
我们还发现,ICDA提升了RoBERTa-Large模型在HWU和CLINC数据集上的表现。这凸显了我们方法在增强意图检测模型方面的有效性。
Moreover, state-of-the-art performance on BANKING with the proposed method and RoBERTaLarge shows that our method is capable of generating high-quality augmentation examples to enhance the RoBERTa-Large model on the most finegrained intent detection task.
此外,采用所提方法与RoBERTaLarge在BANKING数据集上达到的最先进性能表明,我们的方法能够生成高质量的数据增强样本,从而在最具细粒度的意图检测任务中提升RoBERTa-Large模型的表现。
Few-shot settings. In this setting we only use a small number of instances (datapoints) per class. We evaluate our method in both 5-shot and 10- shot settings and compare it with several strong baselines. Our proposed method outperforms all baselines on all datasets in both 5-shot and 10- shot settings. ICDA-M achieves the best performance in 5-shot settings on BANKING dataset and ICDA-XL achieves the best performance on HWU and CLINC datasets in 5-shot settings and on all datasets in 10-shot settings. All configurations of our method significantly improve the performance of a RoBERTa-Large model trained on any of the three datasets. Compared with CPFT (Zhang et al., 2021b), which utilizes contrastive learning for fewshot intent detection with extra data, our method achieves better performance without any additional human-annotated data. This showcases the advantage of our method for few-shot intent detection.
少样本设置。在此设置中,我们每个类别仅使用少量实例(数据点)。我们在5样本和10样本设置下评估了我们的方法,并与多个强基线进行了比较。我们提出的方法在5样本和10样本设置下的所有数据集上均优于所有基线。ICDA-M在BANKING数据集的5样本设置中表现最佳,而ICDA-XL在HWU和CLINC数据集的5样本设置以及所有数据集的10样本设置中表现最佳。我们方法的所有配置均显著提升了基于RoBERTa-Large模型在三个数据集上的性能。与利用对比学习进行少样本意图检测并需要额外数据的CPFT (Zhang et al., 2021b) 相比,我们的方法无需任何额外人工标注数据即可实现更优性能。这展示了我们方法在少样本意图检测中的优势。
We also observe that our method consistently improves the performance of the baseline model as the number of synthetic datapoints increases from XS to XL. This indicates that the generated instances from our method can gradually cover more and more information of real instances and are capable of providing more useful information for model training.
我们还观察到,随着合成数据点数量从XS增加到XL,我们的方法持续提升了基线模型的性能。这表明通过我们方法生成的实例能逐步覆盖真实实例的更多信息,并为模型训练提供更有用的数据。
6 Analysis and Discussion
6 分析与讨论
In this section, we analyze the performance of ICDA and other approaches we tried. We first identify several factors that affect performance, and then present evidence that ICDA works by transferring knowledge from the pretrained generator to the task model. We then discuss a data-relabelling experiment and an experiment using uncertainty measures or data cartography (S way am dip ta et al., 2020) as filters.
在本节中,我们分析了ICDA及其他尝试方法的性能表现。首先列举影响性能的若干因素,随后通过实验证明ICDA通过将预训练生成器(pretrained generator)的知识迁移至任务模型来实现效果提升。接着讨论数据重标注实验,以及使用不确定性度量或数据制图(Swayamdipta et al., 2020)作为过滤器的实验。
6.1 Factors that Affect ICDA Performance ICDA is effective at various training sizes. Throughout this work, we conduct experiments with different seed data sizes4 to study the effect of training size. By looking at the results in Table 2, we observe that our proposed method consistently improves the accuracy of the downstream model in all training sizes. Also, as the training size decreases, we see that the ICDA improvement increases significantly. For example, on BANKING, the improvement goes from $0.72%$ in the full shot setting to $5.02%$ as the training size decreases to 5-shot. This indicates that ICDA is more effective when we have few training data available.
6.1 影响ICDA性能的因素
ICDA在不同训练规模下均表现优异。本研究通过调整种子数据规模4进行多组实验,以探究训练规模的影响。如表2所示,我们提出的方法在所有训练规模下都能持续提升下游模型的准确率。值得注意的是,随着训练规模减小,ICDA带来的性能提升显著增加。例如在BANKING数据集上,改进幅度从全量训练( full shot )时的$0.72%$跃升至5样本训练( 5-shot )时的$5.02%$,这表明ICDA在训练数据稀缺时效果更为显著。
Table 2: Intent Detection Accuracy (in $%$ ) in few-/full-shot settings with augmented data from OPT-66B. Numbers in bold are the best results and numbers with ∗ are statistically significant by t-test $(p<0.05)$ compared to the baselines (5 / 10 examples per intent).
表 2: 使用OPT-66B增强数据在少样本/全样本设置下的意图检测准确率(单位: $%$)。加粗数字表示最佳结果,带∗数字表示与基线(每个意图5/10个示例)相比具有统计显著性 $(p<0.05)$。
| Model | 5 | 10 | Full | 5 | 10 | Full | 5 | 10 | Full |
|---|---|---|---|---|---|---|---|---|---|
| BANKING | HWU | CLINC | |||||||
| RoBERTa-Base+Classifier | 74.04 | 84.27 | 75.56 | 82.90 | 87.99 | 91.55 | |||
| USE | 76.29 | 84.23 | 92.81 | 77.79 | 83.75 | 91.25 | 87.82 | 90.85 | 95.06 |
| CONVERT | 75.32 | 83.32 | 93.01 | 76.95 | 82.65 | 91.24 | 89.22 | 92.62 | 97.16 |
| USE+CONVERT | 77.75 | 85.19 | 93.36 | 80.01 | 85.83 | 92.62 | 90.49 | 93.26 | 97.16 |
| CONVBERT | 83.63 | 92.95 | 83.77 | 90.43 | 92.10 | 97.07 | |||
| + MLM | 83.99 | 93.44 | 84.52 | 92.38 | 92.75 | 97.11 | |||
| + MLM + Example | 84.09 | 94.06 | 83.44 | 92.47 | 92.35 | 97.11 | |||
| + Combined | 85.95 | 93.83 | 86.28 | 93.03 | 93.97 | 97.31 | |||
| DNNC | 80.40 | 86.71 | 80.46 | 84.72 | 91.02 | 93.76 | |||
| CPFT | 80.86 | 87.20 | 82.03 | 87.13 | 92.34 | 94.18 | |||
| RoBERTa-Large + Classifier | 78.99 | 86.08 | 93.70 | 74.44 | 84.11 | 92.13 | 89.89 | 93.56 | 96.80 |
| + ICDA-XS | 80.29 | 86.72 | 81.32 | 85.59 | 91.16 | 93.71 | |||
| + ICDA-S | 81.95 | 87.37 | 93.66 | 81.97 | 86.25 | 92.33 | 91.22 | 93.98 | 96.97 |
| + ICDA-M | 84.01* | 88.64 | 93.73 | 81.84 | 87.36 | 92.12 | 91.93 | 94.71 | 97.06 |
| + ICDA-L | 83.90 | 89.12 | 94.42* | 81.97 | 86.94 | 92.57 | 92.41 | 94.73 | 97.12 |
| + ICDA-XL | 83.90 | 89.79* | 82.45* | 87.41* | 92.62* | 94.84* |
Table 3: Intent Detection Accuracy (in $%$ ) for RoBERTa-Large model in 10-shot settings with ICDAM synthetic instances from OPT-66B. Numbers in bold are statistically significant by t-test $(p<0.05)$ . “All” represents using all synthetic data without PVI filtering. and “All w/ relabeling" represents using “All" and an oracle intent classifier to relabel the synthetic data.
表 3: RoBERTa-Large模型在10样本设置下使用OPT-66B生成的ICDAM合成实例的意图检测准确率(单位: $%$)。加粗数字表示通过t检验具有统计显著性 $(p<0.05)$。"All"表示使用全部合成数据且未经过PVI过滤,"All w/ relabeling"表示使用"All"数据并通过预言级意图分类器重新标注合成数据。
| 模型 | BANKING | HWU | CLINC |
|---|---|---|---|
| RoBERTa-Large | 86.08 | 84.11 | 93.56 |
| All | 84.19 | 84.57 | 94.24 |
| All w/ relabeling | 87.05 | 85.22 | 93.02 |
| P | |||
| Global Low PVI | 73.99 | 69.61 | 85.42 |
| Global High PVI | 87.38 | 86.27 | 94.27 |
| Per-Intent Low PVI | 76.49 | 71.84 | 89.33 |
| Per-Intent High PVI | 88.64 | 87.36 | 94.71 |
PVI filtering threshold. To study the effect of the threshold function $\epsilon$ , we conduct experiments with two different threshold functions: Global, and PerIntent. Global means that the PVI threshold is the same for all intent classes, which is the average PVI value in the validation set. Per-Intent means that the PVI threshold is different for each intent class, which is the average PVI value under each intent class in the validation set. As a sanity check, we also conduct experiments using synthetic instances with PVI values lower than the threshold $(L o w P V I)$ as opposed to the normal (High PVI) instances.
PVI过滤阈值。为了研究阈值函数$\epsilon$的影响,我们采用两种不同阈值函数进行实验:全局(Global)和按意图(PerIntent)。全局阈值指所有意图类别使用相同的PVI阈值,即验证集中的平均PVI值;按意图阈值指每个意图类别采用不同阈值,即验证集中各意图类别下的平均PVI值。作为验证对照,我们还对PVI值低于阈值的合成实例$(LowPVI)$进行了实验,与常规(High PVI)实例形成对比。
We show the results in Table 3 (bottom half), where we see that Per-Intent High PVI filtering performs the best. Compared to using all synthetic training data without filtering (referred to as $A l l$ ), we see that High PVI filtering in general helps in improving accuracy. In BANKING, for example, when PVI filtering is applied with Per-Intent High $P V I$ , the accuracy is $88.64%$ with 10-shot training size, which is significantly better than the result without PVI filtering $(84.19%$ ) – the same holds for the other two datasets. For the Low PVI conditions, we observe that performance drops significantly. This indicates that the model overfits on those examples that are not relevant to the desired intent. We discuss the $A l l w$ relabelling condition in Section 6.3.
我们在表3(下半部分)中展示了结果,其中可见Per-Intent High PVI(每意图高PVI)过滤表现最佳。与使用全部未过滤的合成训练数据(记为$All$)相比,高PVI过滤总体上能提升准确率。例如在BANKING数据集中,当采用Per-Intent High $PVI$过滤时,10-shot训练规模的准确率达到$88.64%$,显著优于未使用PVI过滤的结果$(84.19%)$——这一趋势在其他两个数据集上同样成立。对于Low PVI(低PVI)条件,我们观察到性能显著下降,这表明模型在与目标意图无关的样本上出现了过拟合。$All_w$重标注条件将在6.3节讨论。
In Figure 2, we plot the F1 score against the PVI score of the test set instances grouped by intent, showing that some classes are harder than others, further supporting why we need a threshold per class rather than a global one.
在图2中,我们绘制了按意图分组的测试集实例的F1分数与PVI分数的关系图,表明某些类别比其他类别更难分类,进一步支持了为什么我们需要针对每个类别设置阈值而非使用全局阈值。
6.2 Why Does ICDA Work?
6.2 ICDA 为何有效?
PVI filtering discards mislabeled examples. We believe that the success of ICDA is because of not only the high diversity of the synthetic instances produced by the generator, but also the fact that PVI filtering effectively discards digressed instances. To verify this hypothesis, we randomly sample several synthetic instances from the OPT-66B generator and manually assess if each instance follows the same intent as the prompt label. We show some examples in Table 4. We observe that instances that are relevant to the desired intent are assigned high PVI values, and instances that are not relevant to the desired intent are assigned low PVI values. This further indicates that the per-intent threshold function provides an effective indicator of relevance. For example, in the BANKING dataset, most relevant instances have PVI values greater than 5.79, and most non-relevant instances have PVI values less than 5.79. This indicates that PVI filtering is an effective method for discarding mislabeled data points.
PVI过滤会丢弃错误标记的示例。我们认为ICDA的成功不仅源于生成器产生的高多样性合成实例,还因为PVI过滤能有效剔除偏离意图的实例。为验证这一假设,我们从OPT-66B生成器中随机采样若干合成实例,人工评估每个实例是否遵循提示标签的意图。表4展示了部分示例:与目标意图相关的实例被赋予较高PVI值,无关实例则获得较低PVI值,这进一步说明基于意图的阈值函数能有效衡量相关性。例如在BANKING数据集中,多数相关实例的PVI值高于5.79,而多数无关实例的PVI值低于5.79,表明PVI过滤是剔除错误标记数据点的有效方法。
Table 4: Synthetic examples generated from OPT-66B. $\dagger$ indicates the sentences that belong to the same intent as the prompt label from our manual assessment; and bold denotes the PVI values over the threshold for given label.
表 4: 由 OPT-66B 生成的合成示例。$\dagger$ 表示经人工评估与提示标签 (PromptLabel) 属于相同意图的句子;加粗表示 PVI 值超过给定标签的阈值。
| 数据 | 提示标签 | 生成句子 | PVI |
|---|---|---|---|
| BANKING | Refundnotshowingup (PVI阈值:5.79) | I didn't see my refund appear on my statement.f Where didyour refund end up at?Please send it back immediately since I | 6.10 5.81 |
| had a return and then refunded your purchase in full and I want it all returned back to my credit card. | |||
| HWU | alarm query (PVI阈值:4.28) | Can we please cancel my return & resend my goods again Please confirm what is the reason for delay in payment for my purchase? | 3.97 -3.86 |
| showme all the alarms andremindmeaboutthemt | 5.69 | ||
| i want tobenotifiedwhen alarmgoes off.f | 4.36 | ||
| CLINC | acceptreservation (PVI阈值:6.53) | howdoideleteordisablealarms. listalltheeventsonthisdate | 3.18 -5.13 |
| doeshanoversteakhousetakereservationst | 6.74 | ||
| are there any restaurants that takereservationsfor dinner in philadelphiat how many days prior is required for making reservations | 6.58 |

Figure 2: Intent Detection F1 score per intent class (circle) of the BANKING test set, justifying why we need a PVI threshold per intent.
图 2: BANKING测试集中每个意图类别的意图检测F1分数(圆圈),说明为何需要为每个意图设置PVI阈值。
Table 5: Quantitative metrics of fluency and diversity of real and synthetic utterances in 10-shot settings as measured with distinct-1 (D-1), distinct-2 (D-2), selfBLEU, and perplexity.
表 5: 10样本设置下真实与合成话语的流畅性和多样性量化指标,采用 distinct-1 (D-1)、distinct-2 (D-2)、selfBLEU 和困惑度进行测量。
| 数据 | 分割 | D-1 ↑ | D-2 ↑ | Self-BLEU√ | PPL↓ |
|---|---|---|---|---|---|
| Bank. | Test | - | - | 12.14 | |
| 10-shot | 0.15 | 0.54 | 0.24 | 17.34 | |
| ICDA | 0.21 | 0.66 | 0.11 | 21.33 | |
| HWU | Test | 14.84 | |||
| 10-shot | 0.25 | 0.71 | 0.07 | 26.97 | |
| ICDA | 0.30 | 0.78 | 0.03 | 28.52 | |
| CLINC | Test | - | - | - | 14.77 |
| 10-shot | 0.15 | 0.49 | 0.28 | 34.23 | |
| ICDA | 0.20 | 0.60 | 0.17 | 37.34 |
ICDA produces fluent and diverse utterances. We hypothesize that our proposed method is effective because it introduces more fluent and diverse utterances. We therefore compare synthetic data under the 10-shot XS condition (i.e., we generate 10 synthetic datapoints) with the original 10-shot datapoints taken from the training data. Then we use a GPT2 model trained on the test set of each benchmark dataset to calculate the perplexity of the generated utterances. We also use the same synthetic set to calculate the distinct-1, distinct2, self-BLEU, and perplexity (PPL) metrics. We report the results in Table 5 and observe that our proposed method generates more diverse utterances as shown by distinct-1, distinct-2, and self-BLEU. This indicates that our proposed method harnesses the generation power of the OPT-66B generator. Additionally, the perplexity of synthetic utterances is slightly higher than the human-annotated training set. These results suggest that our proposed method generates more diverse utterances, which can help the task model to learn a better representation.
ICDA生成流畅且多样的话语。我们假设所提出的方法有效是因为引入了更流畅多样的表达。为此,我们比较了10-shot XS条件下的合成数据(即生成10个合成数据点)与从训练数据中提取的原始10-shot数据点,并使用在各基准测试集上训练的GPT2模型计算生成话语的困惑度。同时采用相同合成数据集计算distinct-1、distinct-2、self-BLEU和困惑度(PPL)指标。如表5所示,distinct-1、distinct-2和self-BLEU指标表明我们的方法能生成更多样化的话语,这显示该方法充分利用了OPT-66B生成器的能力。此外,合成话语的困惑度略高于人工标注的训练集,表明该方法生成的多样化表达有助于任务模型学习更好的表征。
6.3 Data Relabelling
6.3 数据重标注
Following Sahu et al. (2022), we wanted to see if it is effective to use the available data to train an intent classifier and then use it to relabel the synthetic data. Intuitively, such a method would correct mistakes in the generation process. To test the feasibility of this approach, we train an oracle classifier using the entire training data of each dataset and use this as an upper bound. The results are shown in Table 3 (“All w/ relabeling"), where we see that while promising, this approach under performs ICDA.
遵循 Sahu 等人 (2022) 的方法,我们想验证是否可以利用现有数据训练意图分类器,再将其用于重新标注合成数据。直觉上,这种方法能修正生成过程中的错误。为测试该方法的可行性,我们使用各数据集的完整训练数据训练了一个理想分类器 (oracle classifier) 作为性能上限。结果如表 3 ("All w/ relabeling") 所示:虽然该方法表现尚可,但效果仍逊于 ICDA。
7 Conclusion
7 结论
We introduced In-Context Data Augmentation, a novel data augmentation framework to generate synthetic training data, preserving quality and diversity. We demonstrate that ICDA is effective on multiple intent detection benchmarks, with state-ofthe-art few-shot performance. Our analysis shows that ICDA tends to perform better in low-resource settings and that our PVI filtering strategy is important for performance. Future work includes applying ICDA to other conversational understanding tasks such as slot filling and dialogue state tracking, and incorporating other filtering or data selection strategies for further performance gains.
我们提出了上下文数据增强(ICDA)这一新颖的数据增强框架,用于生成兼具质量和多样性的合成训练数据。实验证明ICDA在多个意图检测基准测试中表现优异,尤其在少样本场景下达到最先进水平。分析表明ICDA在低资源环境下表现更突出,且我们提出的PVI过滤策略对性能提升至关重要。未来工作包括将ICDA应用于槽位填充和对话状态跟踪等其他对话理解任务,并探索其他过滤或数据选择策略以进一步提升性能。
Limitations
局限性
In this section we take BANKING as a case study to motivate PVI and discuss some of the limitations of our approach. Figure 3 shows how much we gain (or lose) in F1 score when we use a custom threshold for each class vs. a fixed threshold. While most classes benefit, there are clearly many that show performance degradation. Another limitation is the size of the model we use to generate synthetic instances (OPT-66B); in general the larger the model is, the better the generated data is.
在本节中,我们以 BANKING 作为案例研究来阐述 PVI 的动机,并讨论我们方法的一些局限性。图 3 展示了当为每个类别使用自定义阈值与固定阈值时,我们在 F1 分数上的增益(或损失)。虽然大多数类别受益,但显然有许多类别表现出性能下降。另一个局限性是我们用于生成合成实例的模型规模 (OPT-66B);一般来说,模型越大,生成的数据质量越好。
Ethical Considerations
伦理考量
As with any work involving PLMs (or foundation models), due to the data and training methods, there is inherent risk of generating biased, toxic, harmful, or otherwise unwanted output. Regarding our work in particular, as we show in Figure 3, the model’s performance on some of the classes can degrade. More analysis needs to be done before deploying our approach, since it is unclear whether it will introduce a bias towards certain types of classes.
与任何涉及预训练语言模型(PLM)或基础模型的工作一样,由于数据和训练方法的原因,存在生成带有偏见、有害、危险或其他不良内容的固有风险。就我们的具体工作而言,如图 3 所示,模型在某些类别上的性能可能会下降。在部署我们的方法之前需要进行更多分析,因为尚不清楚它是否会对某些类型的类别引入偏见。

Figure 3: This figure shows the difference in Intent Detection F1 score for each intent, if we have a PVI threshold per-class VS having a fixed PVI threshold. See larger figure in Appendix.
图 3: 该图展示了每个意图在意图检测F1分数上的差异,比较了使用每类PVI阈值与固定PVI阈值的情况。完整大图见附录。
