[论文翻译]随时主动学习


原文地址:https://cdn.aaai.org/ojs/9015/9015-13-12543-1-2-20201228.pdf


Anytime Active Learning

随时主动学习

Abstract

摘要

A common bottleneck in deploying supervised learning systems is collecting human-annotated examples. In many domains, annotators form an opinion about the label of an example increment ally — e.g., each additional word read from a document or each additional minute spent inspecting a video helps inform the annotation. In this paper, we investigate whether we can train learning systems more efficiently by requesting an annotation before inspection is fully complete — e.g., after reading only 25 words of a document. While doing so may reduce the overall annotation time, it also introduces the risk that the annotator might not be able to provide a label if interrupted too early. We propose an anytime active learning approach that optimizes the annotation time and response rate simultaneously. We conduct user studies on two document classification datasets and develop simulated annotators that mimic the users. Our simulated experiments show that anytime active learning outperforms several baselines on these two datasets. For example, with an annotation budget of one hour, training a classifier by annotating the first 25 words of each document reduces classification error by $17%$ over annotating the first 100 words of each document.

部署监督学习系统时的一个常见瓶颈是收集人工标注的样本。在许多领域,标注者会逐步形成对样本标签的判断——例如,从文档中每多读一个单词,或多花一分钟检查视频,都有助于完善标注。本文研究是否可以通过在检查未完全结束时(例如仅阅读文档的前25个单词)就请求标注,来更高效地训练学习系统。虽然这样做可能减少总体标注时间,但也存在标注者若被打断过早可能无法提供标签的风险。我们提出了一种随时主动学习(anytime active learning)方法,可同时优化标注时间和响应率。我们在两个文档分类数据集上进行了用户研究,并开发了模拟用户行为的模拟标注器。模拟实验表明,在这两个数据集上,随时主动学习优于多种基线方法。例如,在1小时的标注预算下,通过标注每份文档前25个单词训练的分类器,其分类错误率比标注每份文档前100个单词降低了17%。

Introduction

引言

Active learning is a machine learning approach that seeks to maximize classifier accuracy while minimizing the effort of human annotators (Settles 2012). This is typically done by prioritizing example annotation according to the utility to the classifier.

主动学习 (active learning) 是一种机器学习方法,旨在最大化分类器准确率的同时最小化人工标注的工作量 (Settles 2012)。通常通过根据样本对分类器的效用优先级来实施标注。

In this paper, we begin with the simple observation that in many domains human annotators form an opinion about the label of an example increment ally. For example, while reading a document, an annotator makes a more informed deci- sion about the topic assignment as each word is read. Similarly, in video classification the annotator becomes more certain of the class label the longer she watches the video.

在本文中,我们首先观察到:在许多领域中,人类标注者会逐步形成对样本标签的判断。例如,在阅读文档时,标注者每读一个单词就能对主题分配做出更准确的决策;同样,在视频分类任务中,标注者观看视频时间越长,对类别标签的确定性就越高。

The question we ask is whether we can more efficiently train a classifier by interrupting the annotator to ask for a label, rather than waiting until the annotator has fully completed her inspection. For example, in document classification the active learner may request the label after the annotator has read the first 50 words of the document. For video classification, the active learner may decide to show only a short clip. We refer to this approach as anytime active learning (AAL), by analogy to anytime algorithms, whose execution may be interrupted at any time to provide an answer.

我们探讨的问题是:通过打断标注者以请求标签,而非等待其完成全部检查,是否能更高效地训练分类器。例如,在文档分类任务中,主动学习器可能在标注者读完前50个词时就请求标签;在视频分类中,它可能仅展示短片段就做出判断。我们将这种方法称为"即时主动学习 (AAL)",类比于可随时中断并提供结果的即时算法。

If the decision of when to interrupt the annotator is made optimally, we can expect to reduce total annotation effort by eliminating unnecessary inspection time that does not affect the returned label. However, the annotator may not be able to provide a label if interrupted too early – e.g., the annotator will not know how to label a document after seeing only the first word. AAL strategies, then, must balance two competing objectives: (1) the time spent annotating an instance (annotation cost); (2) the likelihood that the annotator will be able to produce a label (annotation response rate). In this paper, we propose and evaluate a number of anytime active learning strategies applied to the domain of document classification. In this domain, it is natural to implement this approach by revealing only the first $k$ words to the annotator, which we refer to as a sub instance.

如果在最佳时机决定何时中断标注员,我们有望通过消除不影响最终标签的不必要检查时间来降低总体标注工作量。然而,若过早中断,标注员可能无法提供标签——例如仅看到文档第一个单词时,标注员将无从判断如何标注。因此,主动学习(AAL)策略需要平衡两个相互制约的目标:(1) 单样本标注耗时(标注成本);(2) 标注员能产出标签的概率(标注响应率)。本文提出并评估了多种适用于文档分类领域的实时主动学习策略。在该领域,通过仅向标注员展示前$k$个单词(我们称之为子样本)来实现这一方法是自然的选择。

We first conduct user studies to estimate annotation times and response rates, and then create simulated oracles that mimic the human annotators. We perform simulated-oracle experiments on two document classification tasks, comparing two classes of anytime active learning strategies: (1) static strategies select sub instances of a fixed size; (2) dynamic strategies select sub instances of varying sizes, optimizing cost and response rate simultaneously. Our research questions and answers are as follows:

我们首先通过用户研究估算标注时间和响应率,随后创建模拟人类标注员的仿真预言机。在两个文档分类任务上进行仿真预言机实验,比较两类即时主动学习策略:(1) 静态策略选择固定大小的子实例;(2) 动态策略选择可变大小的子实例,同时优化成本和响应率。研究问题及结论如下:

RQ1. How does sub instance size affect human annotation time and response rate? We conducted a user study in which each user labeled 480 documents from two domains under different interruption conditions (e.g., seeing only the first $k$ words). We find that as sub instance sizes increase, both response rates and annotation times increase (non-linearly), and that the rate of increase varies by dataset. RQ2. How do static AAL strategies compare with traditional active learning? We find that simple static strategies result in significantly more efficient learning, even with few words shown per document. For example, with an annotation budget of one hour, labeling only the first 25 words of each document reduces classification error by $17%$ compared with labeling the first 100 words of each document.

RQ1. 子实例规模如何影响人工标注时间和响应率?我们开展了一项用户研究,让每位用户在不同中断条件下(例如仅查看前$k$个词)标注来自两个领域的480份文档。研究发现随着子实例规模增大,响应率和标注时间均呈非线性增长,且增长率因数据集而异。
RQ2. 静态AAL策略与传统主动学习相比效果如何?研究发现即使每份文档仅显示少量词汇,简单静态策略也能实现显著更高效的学习。例如在1小时标注预算下,相比标注每份文档前100个词,仅标注前25个词可使分类错误率降低$17%$。

RQ3. How do dynamic AAL strategies compare with static strategies? The drawback of the static strategy is that we must select a sub instance size ahead of time; however, we find that the optimal size varies by dataset. Instead, we formulate a dynamic AAL algorithm to minimize cost while maximizing response rates. We find that this dynamic approach performs as well or better than the best static strategy, without the need for additional tuning.

RQ3. 动态AAL策略与静态策略相比如何?静态策略的缺点是我们必须提前选择子实例大小,但我们发现最优大小因数据集而异。为此,我们设计了一种动态AAL算法,在最小化成本的同时最大化响应率。结果表明,这种动态方法无需额外调参,其表现与最佳静态策略相当或更优。

The remainder of the paper is organized as follows: we first formalize the anytime active learning problem, then propose static and dynamic solutions. Next, we describe our user studies and how they are used to inform our simulation experiments. Finally, we present the empirical results and discuss their implications.

本文的其余部分组织如下:我们首先形式化随时主动学习 (anytime active learning) 问题,然后提出静态和动态解决方案。接着描述用户研究及其对模拟实验的指导作用。最后展示实证结果并讨论其意义。

Anytime Active Learning (AAL)

随时主动学习 (AAL)

In this section, we first review standard active learning and then formulate our proposed anytime extension.

在本节中,我们首先回顾标准主动学习 (active learning) ,然后阐述我们提出的随时扩展方案。

Problem Formulation

问题描述

Let $\mathcal{L}={(\mathbf{x}{i},y_{i})}{i=1}^{l}$ be a labeled dataset where $\mathbf{x}{i}\in\mathbb{R}^{d}$ is a $d$ -dimensional feature vector and $y_{i}\in{y^{0},y^{1}}$ is its class label.1 Let $\mathcal{U}={\mathbf{x}{i}}{i=l+1}^{m}$ be a set of unlabeled examples. Let $P_{\mathcal{L}}(y|\mathbf{x})$ be the conditional probability of $y$ given $\mathbf{x}$ according to a classifier trained on $\mathcal{L}$ .

令 $\mathcal{L}={(\mathbf{x}{i},y_{i})}{i=1}^{l}$ 为带标签数据集,其中 $\mathbf{x}{i}\in\mathbb{R}^{d}$ 是 $d$ 维特征向量,$y_{i}\in{y^{0},y^{1}}$ 是其类别标签。令 $\mathcal{U}={\mathbf{x}{i}}{i=l+1}^{m}$ 为未标注样本集。设 $P_{\mathcal{L}}(y|\mathbf{x})$ 表示基于 $\mathcal{L}$ 训练的分类器给出的 $\mathbf{x}$ 条件下 $y$ 的条件概率。

Typical pool-based active learning selects instances ${{\mathcal{U}}^{}}\subseteq$ $\mathcal{U}$ to be labeled by a human annotator (oracle) and appended to $\mathcal{L}$ . Assuming a pre specified annotation budget $B$ and an annotation cost function $C(\mathbf{x})$ , the goal of the active learning algorithm (student) is to select $\mathcal{U}^{*}$ to minimize the classifier’s generalization error subject to the budget constraints:

典型池式主动学习选择实例 ${{\mathcal{U}}^{}}\subseteq$ $\mathcal{U}$ 由人工标注员(oracle)进行标注并追加到 $\mathcal{L}$ 中。在给定预设标注预算 $B$ 和标注成本函数 $C(\mathbf{x})$ 的情况下,主动学习算法(student)的目标是选择 $\mathcal{U}^{*}$ 以在预算约束下最小化分类器的泛化误差:

$$
\mathcal{U}^{*}\leftarrow\underset{\mathcal{U}{i}\subseteq\mathcal{U}}{\mathrm{argmin}}E r r(P_{\mathcal{L}\cup\mathcal{U}{i}}(y|\mathbf{x}))\mathrm{ s.t.~}\sum_{\mathbf{x}{j}\in\mathcal{U}{i}}C(\mathbf{x}_{j})\leq B
$$

$$
\mathcal{U}^{*}\leftarrow\underset{\mathcal{U}{i}\subseteq\mathcal{U}}{\mathrm{argmin}}E r r(P_{\mathcal{L}\cup\mathcal{U}{i}}(y|\mathbf{x}))\mathrm{ s.t.}\sum_{\mathbf{x}{j}\in\mathcal{U}{i}}C(\mathbf{x}_{j})\leq B
$$

Equation 1 is typically optimized by greedy algorithms, selecting one or more examples at a time according to some heuristic criterion that estimates the utility of each labeled example. A common approach is to request a label for the unlabeled instance that maximizes benefit-cost ratio: $\mathbf{x}{i}^{}\gets$ $\begin{array}{r}{\operatorname*{max}{\mathbf{x}{i}\in\mathcal{U}}\frac{U(\mathbf{x}{i})}{C(\mathbf{x}_{i})}}\end{array}$

方程1通常通过贪心算法优化,每次根据某种启发式准则选择一个或多个样本,该准则用于估计每个已标注样本的效用。常见方法是请求使效益成本比最大化的未标注实例标签:$\mathbf{x}{i}^{}\gets$$\begin{array}{r}{\operatorname*{max}{\mathbf{x}{i}\in\mathcal{U}}\frac{U(\mathbf{x}{i})}{C(\mathbf{x}_{i})}}\end{array}$

Various definitions of utility $U(\cdot)$ are used in the literature, such as expected error reduction (Roy and McCallum 2001) and classifier uncertainty (Lewis and Gale 1994).

文献中使用了多种效用函数 $U(\cdot)$ 的定义方式,例如预期误差减少 (Roy and McCallum 2001) 和分类器不确定性 (Lewis and Gale 1994)。

We propose an alternative formulation of the active learning problem in which the student has the added capability of interrupting the human oracle to request a label while the annotation of $\mathbf{x}_{i}$ is being performed. For example, in video classification, the student may request a label after the oracle has spent only one minute watching the video. Similarly, in document classification the student may request a label after the oracle has read only the first ten words of a document.

我们提出了一种主动学习问题的替代性表述,其中学生具备在标注$\mathbf{x}_{i}$过程中中断人类标注者并请求标签的附加能力。例如,在视频分类任务中,学生可以在标注者仅观看视频一分钟后就请求标签;同样,在文档分类场景下,学生可以在标注者刚阅读文档前十个单词时请求标签。

Let $\mathbf{x}{i}^{k}$ indicate this abbreviated instance, which we call a sub instance. The nature of sub instances will vary by domain. For example, $k$ could indicate the time allotted to inspect the instance. In this paper, we focus on document classification, where it is natural to let $\mathbf{x}{i}^{k}$ be the first $k$ words of document $\mathbf{x}_{i}$ .

设 $\mathbf{x}{i}^{k}$ 表示这个简化的实例,我们称之为子实例。子实例的性质会因领域而异。例如,$k$ 可以表示检查实例所分配的时间。在本文中,我们专注于文档分类任务,此时自然可将 $\mathbf{x}{i}^{k}$ 定义为文档 $\mathbf{x}_{i}$ 的前 $k$ 个单词。

The potential savings from this approach arises from the assumption that $C(\mathbf{x}{i}^{k})<C(\mathbf{x}_{i})$ ; that is, sub instances are less costly to label than instances. While the magnitude of these savings are data-dependent, our user studies below show substantial savings for document classification.

这种方法潜在的节省源于一个假设:$C(\mathbf{x}{i}^{k})<C(\mathbf{x}_{i})$,即子实例的标注成本低于完整实例。虽然节省的具体幅度取决于数据,但我们下面的用户研究表明,在文档分类任务中可以实现显著的节省。

The immediate problem with this approach is that $\mathbf{x}{i}^{k}$ may be considerably more difficult for the oracle to label. We therefore must account for imperfect oracles (Donmez and Carbonell 2008; Yan et al. 2011). There are at least two scenarios to consider — (1) a noisy oracle produces a label for any $\mathbf{x}_{i}^{k}$ , but that label may be incorrect; (2) a reluctant oracle may decide not to produce a label for some examples, but labels that are produced are assumed to be correct. Our user studies below suggests that the latter case is more common; thus, in this paper, we restrict our attention to reluctant oracles, leaving noisy oracles for future work.

这种方法直接面临的问题是,$\mathbf{x}{i}^{k}$ 可能对标注者(oracle)来说标注难度显著增加。因此我们必须考虑不完美的标注者 (Donmez and Carbonell 2008; Yan et al. 2011)。至少有两种情况需要考虑:(1) 噪声标注者会给任何 $\mathbf{x}_{i}^{k}$ 生成标签,但标签可能是错误的;(2) 犹豫标注者可能拒绝为某些样本生成标签,但生成的标签都假定是正确的。我们后续的用户研究表明后者更常见;因此本文仅关注犹豫标注者的情况,将噪声标注者留待未来研究。

In each interaction between the student and oracle, the student presents a sub instance $\mathbf{x}{i}^{k}$ to the oracle, and the oracle returns an answer $a\in{y^{0},y^{1},n}$ , where the answer can be either the correct label $y$ or neutral, $n$ , which represents an “I don’t know” answer. If the oracle returns a non-neutral answer $a$ for $\mathbf{x}{i}^{k}$ , the student adds $\mathbf{x}{i}$ and the returned label $\overset{\cdot}{y}^{0}$ or $y^{1}.$ ) to its training data and updates its classifier. If $n$ is returned, the labeled data is unchanged. In either case, the annotation cost $C(\mathbf{x}{i}^{k})$ is deducted from the student’s budget because the oracle spends time inspecting $\mathbf{x}_{i}^{k}$ even if she returns a neutral label. To choose the optimal sub instance, the student must consider both the cost of the sub instance as well as the likelihood that a non-neutral label will be returned. Below, we propose two AAL strategies.

在学生与预言机的每次交互中,学生会向预言机提交一个子实例 $\mathbf{x}{i}^{k}$,预言机返回一个答案 $a\in{y^{0},y^{1},n}$,其中答案可以是正确标签 $y$ 或中性答案 $n$(表示“我不知道”)。如果预言机为 $\mathbf{x}{i}^{k}$ 返回非中性答案 $a$,学生会将 $\mathbf{x}{i}$ 及返回的标签 $\overset{\cdot}{y}^{0}$ 或 $y^{1}$ 添加到其训练数据中,并更新其分类器。如果返回 $n$,则标注数据保持不变。无论哪种情况,标注成本 $C(\mathbf{x}{i}^{k})$ 都会从学生的预算中扣除,因为即使预言机返回中性标签,她仍需花费时间检查 $\mathbf{x}_{i}^{k}$。为了选择最优子实例,学生必须同时考虑子实例的成本以及返回非中性标签的可能性。接下来,我们提出两种AAL策略。

Static AAL Strategies

静态AAL策略

We first consider a simple, static approach to AAL that decides a priori on a fixed sub instance size $k$ . For example, the student fixes $k=10$ and presents the oracle sub instances $\mathbf{x}_{i}^{10}$ (please see Algorithm 1).

我们首先考虑一种简单、静态的AAL方法,该方法预先确定一个固定的子实例大小$k$。例如,学生固定$k=10$,并向预言机提交子实例$\mathbf{x}_{i}^{10}$ (请参阅算法1)。

Let $\mathcal{U}^{k}={x_{i}^{k}}{i=l+1}^{m}$ be the set of all unlabeled subinstances of fixed size $k$ . In SELECT SUB INSTANCE (line 3), the student picks $\mathbf{x}_{i}^{k*}$ as follows:

设 $\mathcal{U}^{k}={x_{i}^{k}}{i=l+1}^{m}$ 为所有固定大小为 $k$ 的未标记子实例集合。在 SELECT SUB INSTANCE (第3行) 中,学生按如下方式选择 $\mathbf{x}_{i}^{k*}$:

$$
\mathbf{x}{i}^{k*}\gets\arg\operatorname*{max}{\mathbf{x}{i}^{k}\in\mathcal{U}^{k}}\frac{U(\mathbf{x}{i})}{C(\mathbf{x}_{i}^{k})}
$$

$$
\mathbf{x}{i}^{k*}\gets\arg\operatorname*{max}{\mathbf{x}{i}^{k}\in\mathcal{U}^{k}}\frac{U(\mathbf{x}{i})}{C(\mathbf{x}_{i}^{k})}
$$

Note that the utility is computed from the full instance $\mathbf{x}{i}$ , not the sub instance, since $\mathbf{x}{i}$ will be added to our labeled set (line 8). In our experiments, we consider two utility functions: uncertainty $(\mathtt{s t a t i c-k-u n c})$ , which sets $U\bar{(}\mathbf{x}{i}^{k})=$ $1-\operatorname*{max}{y}P_{\mathcal{L}}(\dot{y}|\mathbf{x}{i})$ , and constant $(s\mathtt{t a t i c-k-c o n s t})$ , which sets the utility of each sub instance to one, $U(\mathbf{x}_{i}^{k})=1$ . We use static-k-const as a baseline for other AAL methods because it is an anytime version of random sampling.

需要注意的是,效用是从完整实例 $\mathbf{x}{i}$ 而非子实例计算的,因为 $\mathbf{x}{i}$ 将被添加到我们的标注集中(第8行)。在实验中,我们考虑两种效用函数:不确定性 $(\mathtt{static-k-unc})$ ,设定 $U\bar{(}\mathbf{x}{i}^{k})=$ $1-\operatorname*{max}{y}P_{\mathcal{L}}(\dot{y}|\mathbf{x}{i})$ ;以及常量 $(\mathtt{static-k-const})$ ,将每个子实例的效用设为1,即 $U(\mathbf{x}_{i}^{k})=1$ 。我们将 static-k-const 作为其他主动学习 (AAL) 方法的基线,因为它是随机采样的随时可用版本。

Algorithm 1 Static Anytime Active Learning

算法 1 静态任意时间主动学习

| | 1: 输入: 已标注数据 L; 未标注数据 U; 预算 B; 分类器 P(y|x); 子实例大小 k |
| 2: | while B>0 do |
| 3: | x ← SELECTSUBINSTANCE(U, k) |
| 4: | u←u{xi} |
| 5: | α ← QUERYORACLE(x) |
| 6: 7: | B←B-C(x) if α then //非中性响应 |
| 8: | C←LU (xi,α) |
| 9: | P(y|X)←UPDATECLASSIFIER(C,P) |

Algorithm 2 Dynamic Anytime Active Learning

算法 2 动态实时主动学习

1: 输入: 已标注数据 L; 未标注数据 U; 预算 B: 中性标注数据 L²←0
2: while B>0 do
3: 分类器 P(y|x); 中性分类器 Q(z|x); Neu- x ← 选择子实例(U)

Table 1: User study results reporting average annotation time in seconds and percent of neutral labels by sub instance size.

表 1: 用户研究结果报告不同子实例规模下的平均标注时间(秒)和中性标签百分比。

规模 IMDB (秒) SRAA (秒) IMDB (% 中性) SRAA (% 中性)
10 5.7 5.2 53 50
25 8.2 6.5 33 38
50 10.9 7.6 24 42
75 15.9 9.1 12 22
100 16.7 10.3 13 13

used to train the neutrality classifier $Q(z|\mathbf{x}_{k}^{i})$ (line 11). Algorithm 2 outlines this approach, where ISNEUTRAL maps the oracle answer to $z$ (i.e., $n$ or $\neg n$ ). As in the static strategy, we consider two settings of the utility function: uncertainty (dynamic-unc) and constant (dynamic-const). While both dynamic-unc and dynamic-const consider the cost of annotation, dynamic-unc balances utility with the chance of receiving a non-neutral label, while dynamic-const simply maximizes the chance of a nonneutral label.

用于训练中立分类器 $Q(z|\mathbf{x}_{k}^{i})$ (第11行)。算法2概述了该方法,其中ISNEUTRAL将预言机答案映射到 $z$ (即 $n$ 或 $\neg n$ )。与静态策略类似,我们考虑效用函数的两种设置:不确定性(动态-unc)和常量(动态-const)。动态-unc和动态-const都考虑了标注成本,但动态-unc会平衡效用与获得非中立标签的机会,而动态-const则直接最大化非中立标签的概率。

Experimental Evaluation

实验评估

Our experiments use two datasets: (1) IMDB: A collection of 50K reviews from IMDB.com labeled with positive or negative sentiment (Maas et al. 2011); (2) SRAA: A collection of 73K Usenet articles labeled as related to aviation or auto documents (Nigam et al. 1998).

我们的实验使用了两个数据集:(1) IMDB:来自IMDB.com的5万条带有正面或负面情感标签的评论 (Maas et al. 2011);(2) SRAA:包含7.3万篇标注为航空或汽车相关文档的Usenet文章 (Nigam et al. 1998)。

Dynamic AAL Strategies

动态AAL策略

The static strategy ignores the impact that $k$ has on the likelihood of obtaining a neutral label from the oracle. In this section, we propose a dynamic strategy that models this probability directly and uses it to guide sub instance selection (see Algorithm 2).

静态策略忽略了$k$对从预言机获取中性标签概率的影响。本节提出一种动态策略,直接建模该概率并用于指导子实例选择(见算法2)。

Let $Q(z|\mathbf{x}{i}^{k})$ be the probability distribution that models whether the oracle will return an “I don’t know” answer (i.e. a neutral label) for the sub instance $\mathbf{x}_{i}^{k}$ , where $z\in$ ${n,\neg n}$ . The objective of SELECT SUB INSTANCE (line 3) is to select the sub instance that maximizes utility and the probability of obtaining a non-neutral label, $\neg n$ , while minimizing cost.

设 $Q(z|\mathbf{x}{i}^{k})$ 为建模预言机是否会对子实例 $\mathbf{x}_{i}^{k}$ 返回"我不知道"答案(即中性标签)的概率分布,其中 $z\in$ ${n,\neg n}$。SELECT SUB INSTANCE(第3行)的目标是选择能最大化效用且获得非中性标签 $\neg n$ 概率的子实例,同时最小化成本。

$$
\mathbf{x}{i}^{k*}\gets\operatorname*{argmax}{\mathbf{x}{i}^{k}\in\mathcal{U}^{k}\in\mathcal{S}}\frac{U(\mathbf{x}{i})Q(z=\neg n|\mathbf{x}{i}^{k})}{C(\mathbf{x}_{i}^{k})}
$$

$$
\mathbf{x}{i}^{k*}\gets\operatorname*{argmax}{\mathbf{x}{i}^{k}\in\mathcal{U}^{k}\in\mathcal{S}}\frac{U(\mathbf{x}{i})Q(z=\neg n|\mathbf{x}{i}^{k})}{C(\mathbf{x}_{i}^{k})}
$$

In contrast to the static approach, the dynamic algorithm searches over an expanded set of $p$ different sub instance sizes: ${\cal S}={\mathcal{U}^{k_{1}}\dots\dot{\mathcal{U}}^{k_{p}}}$ .

与静态方法不同,动态算法会在扩展的 $p$ 种不同子实例规模集合 ${\cal S}={\mathcal{U}^{k_{1}}\dots\dot{\mathcal{U}}^{k_{p}}}$ 上进行搜索。

The immediate question is how to estimate $Q(z|\mathbf{x}{k}^{i})$ . We propose a supervised learning approach using the previous interactions with the oracle as labeled examples. That is, we maintain an auxiliary binary labeled dataset $\mathcal{L}^{z}$ containing $(\mathbf{x}{i}^{k},z_{i})$ pairs (line 10), indicating whether subinstance $\mathbf{x}_{i}^{k}$ received a neutral label or not.2 This dataset is

当前的问题是如何估计 $Q(z|\mathbf{x}{k}^{i})$。我们提出了一种监督学习方法,利用先前与预言机交互的标记示例。具体而言,我们维护一个辅助的二元标记数据集 $\mathcal{L}^{z}$,其中包含 $(\mathbf{x}{i}^{k},z_{i})$ 对(第10行),用于指示子实例 $\mathbf{x}_{i}^{k}$ 是否收到中性标签。

User Studies

用户研究

To estimate the real-world relationships among sub instance size, annotation time, and response rate, we first performed several user studies in which subjects were shown document sub instances of varying sizes and asked to provide a correct label or an “I don’t know” answer (i.e., a neutral label).

为了评估子实例大小、标注时间和响应率之间的实际关系,我们首先进行了几项用户研究,向受试者展示不同大小的文档子实例,并要求其提供正确标签或"我不知道"答案(即中性标签)。

Each user performed six classification tasks per dataset, labeling document sub instances of sizes ${10,25,50,75,100,\mathrm{All}}$ . For example, to create the 50-word task, we truncated the documents to the first 50 words. For each classification task, the users were asked to annotate 20 randomly-chosen documents from each class, resulting in 40 annotations per task. The documents were presented to the users in random order. For every sub instance, we recorded the annotation time, the number of words seen, and the label. We used the average over five users on the IMDB and three users on the SRAA data.

每位用户在每个数据集上执行六项分类任务,标注文档子实例的大小为 ${10,25,50,75,100,\mathrm{All}}$。例如,创建50词任务时,我们将文档截断至前50个词。对于每项分类任务,用户需为每个类别标注20篇随机选择的文档,即每项任务共40个标注。文档以随机顺序呈现给用户。针对每个子实例,我们记录了标注时间、已查看的词数以及标签。在IMDB数据集上取五位用户的平均值,SRAA数据集上取三位用户的平均值。

Table 1 shows the average annotation time (in seconds) and average percentage of neutral labels returned for each sub instance size. We find that the annotation time varies by sub instance size and dataset. For instance, in IMDB annotation time of sub instances of size 50 is $25%$ greater than for sub instances of size 25. These responses are influenced by the user experience and domain knowledge familiarity, among other factors.

表 1 展示了每个子实例规模的平均标注时间(以秒为单位)和返回的中性标签平均百分比。我们发现标注时间会因子实例规模和数据集而异。例如在IMDB数据集中,规模为50的子实例标注时间比规模25的子实例高出$25%$。这些响应受到用户体验、领域知识熟悉度等因素的影响。

Intuitively, the annotator will be more likely to provide a non-neutral label when he can see more of a document. This intuition was confirmed by our user studies. Table 1 shows that the percentage of neutral labels decreases as sub instance size increases. However, the rate at which the neutral answer decreases differs by dataset. For example, there was approximately a $50%$ neutral rate on both datasets for sub instances with 10 words; yet for 75 words the neutral responses were $12%$ on IMDB and $22%$ on SRAA. We speculate that the SRAA dataset is a more specialized domain, whereas classifying movie reviews (IMDB) is easier for non-expert human annotators.

直观上,标注者在能看到更多文档内容时更可能给出非中性标签。这一直觉在我们的用户研究中得到了验证。表1显示,中性标签的占比随着子实例规模的增大而下降。但中性答案的下降速率因数据集而异。例如,在10个单词的子实例上,两个数据集的中性率均为约50%;而在75个单词时,IMDB的中性回答率为12%,SRAA则为22%。我们推测SRAA数据集属于更专业的领域,而电影评论分类(IMDB)对非专业的人类标注者来说更容易判断。

Simulations

模拟

We use the results of the user study to inform our large-scale studies on the two datasets.

我们利用用户研究的结果来指导在两个数据集上进行的大规模研究。

Oracle In order to compare many AAL strategies at scale, it is necessary to simulate the actions of the human annotators. Specifically, we must simulate for which examples the annotator will return a neutral label. We wanted to better reflect the fact that the lexical content of each sub instance influences its neutrality instead of random neutrality — e.g., if a sub instance has strong sentiment words it is not likely to be labeled neutral. To accomplish this, we trained two oracles (one per dataset) that mimic the human annotators. We simulated the oracle with a classifier trained on held-out data; a neutral label is returned when the class posterior probability for a sub instance $\mathbf{x}_{i}^{k}$ is below a specified threshold. We tune this classifier so that the pattern of neutral labels matches that observed in the user study.

Oracle
为了大规模比较多种AAL策略,需要模拟人类标注者的行为。具体而言,我们必须模拟标注者会对哪些样本返回中性标签。我们希望更准确地反映每个子实例的词汇内容对其中立性的影响,而非随机中立性(例如,若子实例包含强烈情感词汇,则不太可能被标记为中性)。为此,我们训练了两个模拟人类标注者的oracle(每个数据集一个)。通过基于保留数据训练的分类器模拟oracle:当子实例$\mathbf{x}_{i}^{k}$的类别后验概率低于指定阈值时,返回中性标签。我们调整该分类器,使中性标签的分布模式与用户研究中观察到的模式一致。

At the start of each experiment we fit a logistic regression classifier on a held-out labeled dataset (25K examples for IMDB; 36K for SRAA). We use L1 regular iz ation controlled by penalty $C$ to encourage sparsity. When the oracle is asked to label a sub instance $\mathbf{x}{i}^{k}$ , we compute the posterior probability with respect to this classifier and compute oracle’s uncertainty on $\mathbf{\dot{x}}{i}^{k}$ as $1-\operatorname*{max}{y}P(y|\mathbf x_{i}^{k})$ . If the uncertainty is greater than a specified threshold $T$ , then the oracle returns a neutral label. Otherwise, the true label is returned.

在每次实验开始时,我们会在一个预留的标注数据集上拟合一个逻辑回归分类器(IMDB数据集为25K样本,SRAA数据集为36K样本)。我们使用由惩罚参数 $C$ 控制的L1正则化来促进稀疏性。当请求预言机标注子实例 $\mathbf{x}{i}^{k}$ 时,我们基于该分类器计算后验概率,并将预言机对 $\mathbf{\dot{x}}{i}^{k}$ 的不确定性量化为 $1-\operatorname*{max}{y}P(y|\mathbf x_{i}^{k})$ 。若不确定性超过指定阈值 $T$ ,则预言机返回中性标签,否则返回真实标签。

For each of the datasets, we set $C$ and $T$ so that the distribution of neutral labels by sub instance size most closely matches the results of the user study. We searched values $C~\in~[0.001,3]$ with 0.001 step and $T\in[0.3,0.45]$ with 0.05 step, selecting $C~= 0.3,T~=~0.4$ for IMDB and $C=0.01,T=0.3$ for SRAA. Figures 1(a) and 1(b) show the simulated distribution of neutral labels by sub instance size over the same documents from the user study, indicating a close match with human behavior.

对于每个数据集,我们设定$C$和$T$的值,使得子实例大小的中性标签分布最接近用户研究结果。我们以0.001为步长搜索$C~\in~[0.001,3]$,以0.05为步长搜索$T\in[0.3,0.45]$,最终为IMDB选择$C~= 0.3,T~=~0.4$,为SRAA选择$C=0.01,T=0.3$。图1(a)和图1(b)展示了用户研究中相同文档上按子实例大小的中性标签模拟分布,表明其与人类行为高度吻合。

To simulate the cost of each annotation, we used a fixed cost equal to the average annotation time from the user study for sub instances of that size — e.g., for all sub instances of size 10 for IMDB, the cost is the average annotation time for all sub instances of size 10 in the IMDB user study. In future work, we will consider modeling annotation time as a function of the lexical content of the sub instance.

为了模拟每次标注的成本,我们采用了用户研究中该尺寸子实例的平均标注时间作为固定成本。例如,对于IMDB中所有尺寸为10的子实例,其成本即为IMDB用户研究中尺寸为10的所有子实例的平均标注时间。在后续工作中,我们将考虑将标注时间建模为子实例词汇内容的函数。

Student For the student, we use a logistic regression classifier with L1 regular iz ation using the default parameter $C=1$ , seeded with a labeled set of two examples. At each round of active learning, a subsample of 250 examples are selected uniformly from the unlabeled set $\mathcal{U}$ . Following the user study, sub instances of sizes ${10,25,50,75,100}$ are created for each example in the subsample and scored according to the appropriate strategy (Equation 2 for static; Equation 3 for dynamic). We reserve half of the data for testing, and use the remaining to simulate active learning. For all methods, we report the average result of 10 trials.

学生
对于学生,我们使用带有L1正则化的逻辑回归分类器,默认参数$C=1$,并使用两个标注样本作为初始种子集。在每轮主动学习中,从未标注集$\mathcal{U}$中均匀抽取250个样本子集。根据用户研究,为子集中的每个样本生成${10,25,50,75,100}$大小的子实例,并按相应策略(静态策略用公式2,动态策略用公式3)进行评分。我们保留一半数据用于测试,其余数据用于模拟主动学习。所有方法均报告10次试验的平均结果。

For both datasets, we used documents that at least contain 100 words. We created binary feature representations of the documents, using stemmed n-grams (sizes one to three), pruning n-grams appearing in fewer than five documents. In SRAA, we filtered header information, preserving only the subject line and body of the messages.

对于这两个数据集,我们使用的文档至少包含100个单词。我们创建了文档的二元特征表示,采用词干提取的n元语法(大小为1到3),并剔除在少于五个文档中出现的n元语法。在SRAA数据集中,我们过滤了头部信息,仅保留消息的主题行和正文。

Results and Discussion

结果与讨论

With an oracle simulation and annotation cost in place, we explored the performance of several AAL strategies. We examined learning curves for accuracy and area under the ROC curve (AUC) and observed the same trends and behaviors for each; therefore we include only AUC results here. We summarize our findings below.

在设定好预言机模拟和标注成本后,我们探索了几种主动学习(AAL)策略的性能表现。我们分别考察了准确率和ROC曲线下面积(AUC)的学习曲线,观察到两者呈现相同的趋势和行为特征,因此本文仅展示AUC结果。主要发现如下:

Smaller sub instances generally outperform larger sub instances. Figure 2 shows the performance of static-k-const for IMDB and SRAA datasets. These results consistently show that savings can be achieved by selecting smaller sub instances. For example, after an hour of annotation (3600 seconds) on IMDB, inspecting the first 100 words of each document results in an AUC of .752; whereas labeling only the first 25 words results in an AUC of .792, a $17%$ reduction in error. This suggests that while a smaller $k$ results in a high neutral percentage, the time saved by reading shorter documents more than make up for the losses. The results for static-k-unc are similar but are omitted due to space limitations.

较小的子实例通常表现优于较大的子实例。图 2: 展示了 static-k-const 在 IMDB 和 SRAA 数据集上的性能。这些结果一致表明,选择较小的子实例可以实现效率提升。例如,在 IMDB 上进行一小时标注 (3600 秒) 后,检查每篇文档的前 100 个词可获得 0.752 的 AUC 值;而仅标注前 25 个词则获得 0.792 的 AUC 值,误差降低了 17%。这表明虽然较小的 $k$ 值会导致较高的中性比例,但通过阅读更短的文档所节省的时间足以弥补这一损失。static-k-unc 的结果类似,但由于篇幅限制在此省略。

The optimal sub instance size varies by dataset. Comparing Figure 2(a) to Figure 2(b) indicates that the optimal $k^{}$ varies by dataset $\mathit{\Omega}^{\prime}k^{}=25$ for IMDB, $k^{*}~=~50$ for SRAA). This follows from the observed differences between these datasets in the user study (Table 1); i.e., the annotation cost rises more slowly with sub instance size in SRAA. Thus, somewhat bigger sub instances are worth the small additional cost to reduce the likelihood of a neutral label.

最优子实例大小因数据集而异。对比图 2(a) 和图 2(b) 可以看出,最优的 $k^{}$ 随数据集变化 (IMDB 的 $\mathit{\Omega}^{\prime}k^{}=25$,SRAA 的 $k^{*}~=~50$)。这与用户研究中观察到的数据集差异一致 (表 1),即 SRAA 的标注成本随子实例大小增长更缓慢。因此,稍大的子实例值得付出少量额外成本来降低中性标签的出现概率。

dynamic-unc does better than or equal to the best static AAL algorithm. Given the fact that the optimal sub instance size varies by dataset, we examine how the dynamic approach compares to the static approach. Figure 3 compares the dynamic approach with uncertainty and constant utility (dynamic-unc, dynamic-const) with the best static methods. We find that the dynamic-unc outperforms the best static method for the IMDB dataset (Figure 3(a)). In Figure 3(b), the static and dynamic approaches are comparable; however, the advantage of dynamic-unc is that there is no need to specify $k$ ahead of time.

动态不确定性方法(dynamic-unc)优于或等同于最佳静态AAL算法。考虑到最优子实例大小因数据集而异,我们比较了动态方法与静态方法的性能。图3将带有不确定性和恒定效用的动态方法(dynamic-unc、dynamic-const)与最佳静态方法进行对比。我们发现dynamic-unc在IMDB数据集上表现优于最佳静态方法(图3(a))。在图3(b)中,静态与动态方法表现相当;但dynamic-unc的优势在于无需预先指定$k$值。

Dynamic approaches tend to pick a mixture of subinstance sizes. To better understand the behavior of the dynamic approaches, Figure 4 plots the distribution of subinstance sizes selected by both approaches. As we can see, the dynamic approaches select a mixture of sub instance sizes, but heavily favor smaller sizes. Combining this observation with the results that dynamic-unc is either able to outperform or perform comparable to static approaches, this suggests that the dynamic approach is able to pick small subinstances that receive non-neutral labels.

动态方法倾向于选择不同子实例大小的混合。为了更好地理解动态方法的行为,图4绘制了两种方法选择的子实例大小分布。可以看出,动态方法选择了不同大小的子实例,但明显更倾向于较小的尺寸。结合动态-unc方法要么优于静态方法、要么表现相当的实验结果,这表明动态方法能够选择获得非中性标签的小子实例。


IMDB -­‐ Percentage of Neutral Responses by Size

图 1: IMDB - 不同规模下的中性回复占比


SRAA -­‐ Percentage of Neutral Responses by Size

图 1: SRAA - 中性回复比例按规模分布


Figure 1: A comparison of the proportion of neutral responses by sub instance size for the user studies and the simulated oracles.

图 1: 用户研究与模拟预言机在不同子实例规模下中性回答比例的对比


Figure 2: AUC learning curves for static-k-const. The trends for static-k-unc are similar. The optimal $k^{}$ depends on the domain. For IMDB, $k^{}=25$ while for SRAA, $k^{*}=50$ .

图 2: 静态k常数 (static-k-const) 的AUC学习曲线。静态k变量 (static-k-unc) 的趋势与之相似。最优 $k^{}$ 取决于具体领域:在IMDB数据集中 $k^{}=25$,而在SRAA数据集中 $k^{*}=50$。

Figure 3: Comparing dynamic AAL to the best of the static AAL approaches. dynamic-unc outperforms all methods for IMDB whereas it is comparable to the best static approaches for SRAA.

图 3: 动态AAL与最佳静态AAL方法的对比。dynamic-unc在IMDB数据集上优于所有方法,而在SRAA数据集上与最佳静态方法表现相当。

Table 2 further investigates this by comparing the proportion of neutral labels observed with the expected proportion based on the user study. That is, by combining the data from

表 2: 通过比较观察到的中性标签比例与基于用户研究的预期比例进一步分析。具体而言,通过结合来自

Table 1 and Figure 4, we compute the proportion of neutral labels we expect to see for the observed distribution of sub instance sizes. We can see that the neutrality classifier $Q(z|\mathbf{x}_{i}^{k})$ enables the dynamic approach to select small sub instances while limiting the impact of neutral labels.

表 1 和图 4 中,我们计算了在观察到的子实例大小分布下预期出现中性标签的比例。可以看出,中立性分类器 $Q(z|\mathbf{x}_{i}^{k})$ 能让动态方法在选取小子实例的同时限制中性标签的影响。


IMDB -­‐ AAL Proportion of Sub instances by Size SRAA -­‐ AAL Proportion of Sub instances by Size Figure 4: Proportion of sub instance sizes selected by dynamic AAL.

图 4: 动态AAL选择的子实例规模比例

Table 2: The percentage of observed neutral labels for dynamic-unc and dynamic-const, compared with what is expected for sub instances of the observed sizes.

表 2: dynamic-unc 和 dynamic-const 观测到的中性标签百分比,与观测规模子实例预期值的对比。

Obs. Exp.
IMDB const unc 15% 34% 48% 45%
SRAA const unc 2% 37% 50% 45%

Uncertainty improves the exploration of AAL algorithms. In general, we find that using uncertainty outperforms constant utility. For example, in Figure 3, dynamic-unc outperforms dynamic-const on both datasets. This supports prior work showing the benefits of uncertainty sampling. Additionally, we find that the uncertainty term of the objective balances the neutrality term, enabling a better exploration of the sample space. For example, in the SRAA dataset, Figure 3(b) shows that dynamic-const stops learning after a few rounds of active learning. Interestingly, analysis of this result showed that the $Q(z|\mathbf{x}{i}^{k})$ model quickly learned in the first iterations a strong correlation between non-neutral response and certain terms, e.g. “GPL” (proper name indicating class “auto”) in the subject line; dynamic-const selected sub instances with those terms thereafter obtaining non-neutral labels for those sub instances. Thus, $Q(z|\mathbf{x}{i}^{k})$ was further reinforced to predict the sub instances that contain the word “GPL” as nonneutral. This prevented the neutrality model from exploring other terms and the student did not receive a diverse set of documents to improve learning. Including the uncertainty term encouraged the student to seek a more diverse set of examples, thus avoiding this problem.

不确定性提升了AAL算法的探索能力。总体而言,我们发现使用不确定性策略优于固定效用策略。例如,在图3中,dynamic-unc方法在两个数据集上的表现都优于dynamic-const方法。这印证了先前关于不确定性采样优势的研究[20]。此外,我们发现目标函数中的不确定性项能平衡中性项,从而更好地探索样本空间。以SRAA数据集为例,图3(b)显示dynamic-const方法在几轮主动学习后便停止进步。深入分析表明,$Q(z|\mathbf{x}{i}^{k})$模型在初始迭代中快速建立了非中性响应与特定术语(如主题行中的"GPL"(表示"auto"类别的专有名词))之间的强关联;此后dynamic-const持续选择包含这些术语的子实例,导致这些子实例获得非中性标签。因此,$Q(z|\mathbf{x}_{i}^{k})$被进一步强化为将包含"GPL"词汇的子实例预测为非中性,这阻碍了中性模型探索其他术语的可能性,学生也无法获得多样化的文档来促进学习。引入不确定性项后,系统会鼓励学生寻找更多样化的样本,从而避免这一问题。

This is further supported by Table 2, which shows that dynamic-const primarily selects instances for which it is very likely to receive a non-neutral label. E.g., in SRAA, only $2%$ of instances receive a neutral label for dynamic-const; Figure 3(b) shows this to be an ineffective strategy, since the student observes only a homogeneous subset of examples. While dynamic-unc increases the rate of neutral labels, this additional cost is worth the greater diversity of labeled data.

表2进一步支持了这一点,其中显示dynamic-const主要选择极可能获得非中性标签的实例。例如,在SRAA中,dynamic-const仅对$2%$的实例赋予中性标签;图3(b)表明这是一种低效策略,因为学生只能观察到同质化的示例子集。尽管dynamic-unc提高了中性标签的比例,但这一额外成本换来了标注数据更高的多样性。

Related Work

相关工作

There has been a significant amount of work on noisy and reluctant oracles for traditional active learning scenarios without anytime capability (Donmez and Carbonell 2008; Donmez, Carbonell, and Schneider 2009; Fang, Zhu, and Zhang 2012; Wallace et al. 2011; Yan et al. 2011). Much of this work considered which instance to show to which oracle, and the oracle’s quality is fixed. In the anytime framework that we propose in this paper, the student has control over how much time an oracle should spend on an instance, thus controlling the quality of the label.

在传统不具备实时能力的主动学习场景中,已有大量关于噪声标注和低效标注者( oracle )的研究 (Donmez and Carbonell 2008; Donmez, Carbonell, and Schneider 2009; Fang, Zhu, and Zhang 2012; Wallace et al. 2011; Yan et al. 2011)。这些工作主要关注向哪个标注者展示哪个实例,且标注者的质量是固定的。而在本文提出的实时框架中,学习主体可以控制标注者为每个实例分配的时间,从而调控标注质量。

There has also been a significant amount of work on cost-sensitive active learning (Settles, Craven, and Friedland 2008; Donmez and Carbonell 2008; Haertel et al. 2008; Kapoor, Horvitz, and Basu 2007; Tomanek and Hahn 2010). The common strategy is to use a utility-cost ratio to determine the most cost-effective instances. We follow the same strategy and use utility-cost ratio, with the additional multiplicative factor of probability of non-neutrality.

在成本敏感主动学习领域已有大量研究成果 (Settles, Craven and Friedland 2008; Donmez and Carbonell 2008; Haertel et al. 2008; Kapoor, Horvitz and Basu 2007; Tomanek and Hahn 2010)。通用策略是通过效用-成本比来确定最具成本效益的实例。我们采用相同策略计算效用-成本比,并额外乘以非中性概率的乘积因子。

We build on our previous work (Ramirez-Loaiza, Culotta, and Bilgic 2013) about the problem of searching over subinstances with some notable differences: i) we conducted user studies to determining annotation time, whereas they assumed a linear cost function, ii) we allow the oracles to return a neutral label and model neutrality.

我们在前期研究 (Ramirez-Loaiza, Culotta, and Bilgic 2013) 的基础上探讨子实例搜索问题,但存在以下显著差异:i) 我们通过用户研究确定标注时间,而他们假设了线性成本函数;ii) 我们允许标注者返回中性标签并对中性行为建模。

Conclusions and Future Work

结论与未来工作

We present an anytime active learning framework in which the student is allowed to interrupt the oracle to save annotation time. User studies were conducted to quantify the relationship between sub instance size, annotation time, and response rate. These were used to inform a large-scale simulated study on two document classification tasks, which showed that although interruption can cause the oracle return neutral labels, interrupting at the right time can lead to significantly more efficient learning. We found that optimal interruption time depends on the domain and proposed a dynamic AAL strategy that is better than or comparable to the best static strategy that uses a fixed interruption time.

我们提出了一种可随时中断的主动学习框架,允许学习者中断标注过程以节省标注时间。通过用户研究量化了子实例大小、标注时间和响应率之间的关系,并基于此在两个文档分类任务上进行了大规模模拟研究。研究表明:虽然中断可能导致标注者返回中性标签,但在适当时机中断能显著提升学习效率。我们发现最优中断时机因领域而异,并提出了一种动态AAL策略,其性能优于或等同于采用固定中断时间的最佳静态策略。

阅读全文(20积分)