Anytime Active Learning
随时主动学习
Abstract
摘要
A common bottleneck in deploying supervised learning systems is collecting human-annotated examples. In many domains, annotators form an opinion about the label of an example increment ally — e.g., each additional word read from a document or each additional minute spent inspecting a video helps inform the annotation. In this paper, we investigate whether we can train learning systems more efficiently by requesting an annotation before inspection is fully complete — e.g., after reading only 25 words of a document. While doing so may reduce the overall annotation time, it also introduces the risk that the annotator might not be able to provide a label if interrupted too early. We propose an anytime active learning approach that optimizes the annotation time and response rate simultaneously. We conduct user studies on two document classification datasets and develop simulated annotators that mimic the users. Our simulated experiments show that anytime active learning outperforms several baselines on these two datasets. For example, with an annotation budget of one hour, training a classifier by annotating the first 25 words of each document reduces classification error by $17%$ over annotating the first 100 words of each document.
部署监督学习系统时的一个常见瓶颈是收集人工标注的样本。在许多领域,标注者会逐步形成对样本标签的判断——例如,从文档中每多读一个单词,或多花一分钟检查视频,都有助于完善标注。本文研究是否可以通过在检查未完全结束时(例如仅阅读文档的前25个单词)就请求标注,来更高效地训练学习系统。虽然这样做可能减少总体标注时间,但也存在标注者若被打断过早可能无法提供标签的风险。我们提出了一种随时主动学习(anytime active learning)方法,可同时优化标注时间和响应率。我们在两个文档分类数据集上进行了用户研究,并开发了模拟用户行为的模拟标注器。模拟实验表明,在这两个数据集上,随时主动学习优于多种基线方法。例如,在1小时的标注预算下,通过标注每份文档前25个单词训练的分类器,其分类错误率比标注每份文档前100个单词降低了17%。
Introduction
引言
Active learning is a machine learning approach that seeks to maximize classifier accuracy while minimizing the effort of human annotators (Settles 2012). This is typically done by prioritizing example annotation according to the utility to the classifier.
主动学习 (active learning) 是一种机器学习方法,旨在最大化分类器准确率的同时最小化人工标注的工作量 (Settles 2012)。通常通过根据样本对分类器的效用优先级来实施标注。
In this paper, we begin with the simple observation that in many domains human annotators form an opinion about the label of an example increment ally. For example, while reading a document, an annotator makes a more informed deci- sion about the topic assignment as each word is read. Similarly, in video classification the annotator becomes more certain of the class label the longer she watches the video.
在本文中,我们首先观察到:在许多领域中,人类标注者会逐步形成对样本标签的判断。例如,在阅读文档时,标注者每读一个单词就能对主题分配做出更准确的决策;同样,在视频分类任务中,标注者观看视频时间越长,对类别标签的确定性就越高。
The question we ask is whether we can more efficiently train a classifier by interrupting the annotator to ask for a label, rather than waiting until the annotator has fully completed her inspection. For example, in document classification the active learner may request the label after the annotator has read the first 50 words of the document. For video classification, the active learner may decide to show only a short clip. We refer to this approach as anytime active learning (AAL), by analogy to anytime algorithms, whose execution may be interrupted at any time to provide an answer.
我们探讨的问题是:通过打断标注者以请求标签,而非等待其完成全部检查,是否能更高效地训练分类器。例如,在文档分类任务中,主动学习器可能在标注者读完前50个词时就请求标签;在视频分类中,它可能仅展示短片段就做出判断。我们将这种方法称为"即时主动学习 (AAL)",类比于可随时中断并提供结果的即时算法。
If the decision of when to interrupt the annotator is made optimally, we can expect to reduce total annotation effort by eliminating unnecessary inspection time that does not affect the returned label. However, the annotator may not be able to provide a label if interrupted too early – e.g., the annotator will not know how to label a document after seeing only the first word. AAL strategies, then, must balance two competing objectives: (1) the time spent annotating an instance (annotation cost); (2) the likelihood that the annotator will be able to produce a label (annotation response rate). In this paper, we propose and evaluate a number of anytime active learning strategies applied to the domain of document classification. In this domain, it is natural to implement this approach by revealing only the first $k$ words to the annotator, which we refer to as a sub instance.
如果在最佳时机决定何时中断标注员,我们有望通过消除不影响最终标签的不必要检查时间来降低总体标注工作量。然而,若过早中断,标注员可能无法提供标签——例如仅看到文档第一个单词时,标注员将无从判断如何标注。因此,主动学习(AAL)策略需要平衡两个相互制约的目标:(1) 单样本标注耗时(标注成本);(2) 标注员能产出标签的概率(标注响应率)。本文提出并评估了多种适用于文档分类领域的实时主动学习策略。在该领域,通过仅向标注员展示前$k$个单词(我们称之为子样本)来实现这一方法是自然的选择。
We first conduct user studies to estimate annotation times and response rates, and then create simulated oracles that mimic the human annotators. We perform simulated-oracle experiments on two document classification tasks, comparing two classes of anytime active learning strategies: (1) static strategies select sub instances of a fixed size; (2) dynamic strategies select sub instances of varying sizes, optimizing cost and response rate simultaneously. Our research questions and answers are as follows:
我们首先通过用户研究估算标注时间和响应率,随后创建模拟人类标注员的仿真预言机。在两个文档分类任务上进行仿真预言机实验,比较两类即时主动学习策略:(1) 静态策略选择固定大小的子实例;(2) 动态策略选择可变大小的子实例,同时优化成本和响应率。研究问题及结论如下:
RQ1. How does sub instance size affect human annotation time and response rate? We conducted a user study in which each user labeled 480 documents from two domains under different interruption conditions (e.g., seeing only the first $k$ words). We find that as sub instance sizes increase, both response rates and annotation times increase (non-linearly), and that the rate of increase varies by dataset. RQ2. How do static AAL strategies compare with traditional active learning? We find that simple static strategies result in significantly more efficient learning, even with few words shown per document. For example, with an annotation budget of one hour, labeling only the first 25 words of each document reduces classification error by $17%$ compared with labeling the first 100 words of each document.
RQ1. 子实例规模如何影响人工标注时间和响应率?我们开展了一项用户研究,让每位用户在不同中断条件下(例如仅查看前$k$个词)标注来自两个领域的480份文档。研究发现随着子实例规模增大,响应率和标注时间均呈非线性增长,且增长率因数据集而异。
RQ2. 静态AAL策略与传统主动学习相比效果如何?研究发现即使每份文档仅显示少量词汇,简单静态策略也能实现显著更高效的学习。例如在1小时标注预算下,相比标注每份文档前100个词,仅标注前25个词可使分类错误率降低$17%$。
RQ3. How do dynamic AAL strategies compare with static strategies? The drawback of the static strategy is that we must select a sub instance size ahead of time; however, we find that the optimal size varies by dataset. Instead, we formulate a dynamic AAL algorithm to minimize cost while maximizing response rates. We find that this dynamic approach performs as well or better than the best static strategy, without the need for additional tuning.
RQ3. 动态AAL策略与静态策略相比如何?静态策略的缺点是我们必须提前选择子实例大小,但我们发现最优大小因数据集而异。为此,我们设计了一种动态AAL算法,在最小化成本的同时最大化响应率。结果表明,这种动态方法无需额外调参,其表现与最佳静态策略相当或更优。
The remainder of the paper is organized as follows: we first formalize the anytime active learning problem, then propose static and dynamic solutions. Next, we describe our user studies and how they are used to inform our simulation experiments. Finally, we present the empirical results and discuss their implications.
本文的其余部分组织如下:我们首先形式化随时主动学习 (anytime active learning) 问题,然后提出静态和动态解决方案。接着描述用户研究及其对模拟实验的指导作用。最后展示实证结果并讨论其意义。
Anytime Active Learning (AAL)
随时主动学习 (AAL)
In this section, we first review standard active learning and then formulate our proposed anytime extension.
在本节中,我们首先回顾标准主动学习 (active learning) ,然后阐述我们提出的随时扩展方案。
Problem Formulation
问题描述
Let $\mathcal{L}={(\mathbf{x}{i},y_{i})}{i=1}^{l}$ be a labeled dataset where $\mathbf{x}{i}\in\mathbb{R}^{d}$ is a $d$ -dimensional feature vector and $y_{i}\in{y^{0},y^{1}}$ is its class label.1 Let $\mathcal{U}={\mathbf{x}{i}}{i=l+1}^{m}$ be a set of unlabeled examples. Let $P_{\mathcal{L}}(y|\mathbf{x})$ be the conditional probability of $y$ given $\mathbf{x}$ according to a classifier trained on $\mathcal{L}$ .
令 $\mathcal{L}={(\mathbf{x}{i},y_{i})}{i=1}^{l}$ 为带标签数据集,其中 $\mathbf{x}{i}\in\mathbb{R}^{d}$ 是 $d$ 维特征向量,$y_{i}\in{y^{0},y^{1}}$ 是其类别标签。令 $\mathcal{U}={\mathbf{x}{i}}{i=l+1}^{m}$ 为未标注样本集。设 $P_{\mathcal{L}}(y|\mathbf{x})$ 表示基于 $\mathcal{L}$ 训练的分类器给出的 $\mathbf{x}$ 条件下 $y$ 的条件概率。
Typical pool-based active learning selects instances ${{\mathcal{U}}^{}}\subseteq$ $\mathcal{U}$ to be labeled by a human annotator (oracle) and appended to $\mathcal{L}$ . Assuming a pre specified annotation budget $B$ and an annotation cost function $C(\mathbf{x})$ , the goal of the active learning algorithm (student) is to select $\mathcal{U}^{*}$ to minimize the classifier’s generalization error subject to the budget constraints:
典型池式主动学习选择实例 ${{\mathcal{U}}^{}}\subseteq$ $\mathcal{U}$ 由人工标注员(oracle)进行标注并追加到 $\mathcal{L}$ 中。在给定预设标注预算 $B$ 和标注成本函数 $C(\mathbf{x})$ 的情况下,主动学习算法(student)的目标是选择 $\mathcal{U}^{*}$ 以在预算约束下最小化分类器的泛化误差:
$$
\mathcal{U}^{*}\leftarrow\underset{\mathcal{U}{i}\subseteq\mathcal{U}}{\mathrm{argmin}}E r r(P_{\mathcal{L}\cup\mathcal{U}{i}}(y|\mathbf{x}))\mathrm{ s.t.~}\sum_{\mathbf{x}{j}\in\mathcal{U}{i}}C(\mathbf{x}_{j})\leq B
$$
$$
\mathcal{U}^{*}\leftarrow\underset{\mathcal{U}{i}\subseteq\mathcal{U}}{\mathrm{argmin}}E r r(P_{\mathcal{L}\cup\mathcal{U}{i}}(y|\mathbf{x}))\mathrm{ s.t.}\sum_{\mathbf{x}{j}\in\mathcal{U}{i}}C(\mathbf{x}_{j})\leq B
$$
Equation 1 is typically optimized by greedy algorithms, selecting one or more examples at a time according to some heuristic criterion that estimates the utility of each labeled example. A common approach is to request a label for the unlabeled instance that maximizes benefit-cost ratio: $\mathbf{x}{i}^{}\gets$ $\begin{array}{r}{\operatorname*{max}{\mathbf{x}{i}\in\mathcal{U}}\frac{U(\mathbf{x}{i})}{C(\mathbf{x}_{i})}}\end{array}$
方程1通常通过贪心算法优化,每次根据某种启发式准则选择一个或多个样本,该准则用于估计每个已标注样本的效用。常见方法是请求使效益成本比最大化的未标注实例标签:$\mathbf{x}{i}^{}\gets$$\begin{array}{r}{\operatorname*{max}{\mathbf{x}{i}\in\mathcal{U}}\frac{U(\mathbf{x}{i})}{C(\mathbf{x}_{i})}}\end{array}$
Various definitions of utility $U(\cdot)$ are used in the literature, such as expected error reduction (Roy and McCallum 2001) and classifier uncertainty (Lewis and Gale 1994).
文献中使用了多种效用函数 $U(\cdot)$ 的定义方式,例如预期误差减少 (Roy and McCallum 2001) 和分类器不确定性 (Lewis and Gale 1994)。
We propose an alternative formulation of the active learning problem in which the student has the added capability of interrupting the human oracle to request a label while the annotation of $\mathbf{x}_{i}$ is being performed. For example, in video classification, the student may request a label after the oracle has spent only one minute watching the video. Similarly, in document classification the student may request a label after the oracle has read only the first ten words of a document.
我们提出了一种主动学习问题的替代性表述,其中学生具备在标注$\mathbf{x}_{i}$过程中中断人类标注者并请求标签的附加能力。例如,在视频分类任务中,学生可以在标注者仅观看视频一分钟后就请求标签;同样,在文档分类场景下,学生可以在标注者刚阅读文档前十个单词时请求标签。
Let $\mathbf{x}{i}^{k}$ indicate this abbreviated instance, which we call a sub instance. The nature of sub instances will vary by domain. For example, $k$ could indicate the time allotted to inspect the instance. In this paper, we focus on document classification, where it is natural to let $\mathbf{x}{i}^{k}$ be the first $k$ words of document $\mathbf{x}_{i}$ .
设 $\mathbf{x}{i}^{k}$ 表示这个简化的实例,我们称之为子实例。子实例的性质会因领域而异。例如,$k$ 可以表示检查实例所分配的时间。在本文中,我们专注于文档分类任务,此时自然可将 $\mathbf{x}{i}^{k}$ 定义为文档 $\mathbf{x}_{i}$ 的前 $k$ 个单词。
The potential savings from this approach arises from the assumption that $C(\mathbf{x}{i}^{k})<C(\mathbf{x}_{i})$ ; that is, sub instances are less costly to label than instances. While the magnitude of these savings are data-dependent, our user studies below show substantial savings for document classification.
这种方法潜在的节省源于一个假设:$C(\mathbf{x}{i}^{k})<C(\mathbf{x}_{i})$,即子实例的标注成本低于完整实例。虽然节省的具体幅度取决于数据,但我们下面的用户研究表明,在文档分类任务中可以实现显著的节省。
The immediate problem with this approach is that $\mathbf{x}{i}^{k}$ may be considerably more difficult for the oracle to label. We therefore must account for imperfect oracles (Donmez and Carbonell 2008; Yan et al. 2011). There are at least two scenarios to consider — (1) a noisy oracle produces a label for any $\mathbf{x}_{i}^{k}$ , but that label may be incorrect; (2) a reluctant oracle may decide not to produce a label for some examples, but labels that are produced are assumed to be correct. Our user studies below suggests that the latter case is more common; thus, in this paper, we restrict our attention to reluctant oracles, leaving noisy oracles for future work.
这种方法直接面临的问题是,$\mathbf{x}{i}^{k}$ 可能对标注者(oracle)来说标注难度显著增加。因此我们必须考虑不完美的标注者 (Donmez and Carbonell 2008; Yan et al. 2011)。至少有两种情况需要考虑:(1) 噪声标注者会给任何 $\mathbf{x}_{i}^{k}$ 生成标签,但标签可能是错误的;(2) 犹豫标注者可能拒绝为某些样本生成标签,但生成的标签都假定是正确的。我们后续的用户研究表明后者更常见;因此本文仅关注犹豫标注者的情况,将噪声标注者留待未来研究。
In each interaction between the student and oracle, the student presents a sub instance $\mathbf{x}{i}^{k}$ to the oracle, and the oracle returns an answer $a\in{y^{0},y^{1},n}$ , where the answer can be either the correct label $y$ or neutral, $n$ , which represents an “I don’t know” answer. If the oracle returns a non-neutral answer $a$ for $\mathbf{x}{i}^{k}$ , the student adds $\mathbf{x}{i}$ and the returned label $\overset{\cdot}{y}^{0}$ or $y^{1}.$ ) to its training data and updates its classifier. If $n$ is returned, the labeled data is unchanged. In either case, the annotation cost $C(\mathbf{x}{i}^{k})$ is deducted from the student’s budget because the oracle spends time inspecting $\mathbf{x}_{i}^{k}$ even if she returns a neutral label. To choose the optimal sub instance, the student must consider both the cost of the sub instance as well as the likelihood that a non-neutral label will be returned. Below, we propose two AAL strategies.
在学生与预言机的每次交互中,学生会向预言机提交一个子实例 $\mathbf{x}{i}^{k}$,预言机返回一个答案 $a\in{y^{0},y^{1},n}$,其中答案可以是正确标签 $y$ 或中性答案 $n$(表示“我不知道”)。如果预言机为 $\mathbf{x}{i}^{k}$ 返回非中性答案 $a$,学生会将 $\mathbf{x}{i}$ 及返回的标签 $\overset{\cdot}{y}^{0}$ 或 $y^{1}$ 添加到其训练数据中,并更新其分类器。如果返回 $n$,则标注数据保持不变。无论哪种情况,标注成本 $C(\mathbf{x}{i}^{k})$ 都会从学生的预算中扣除,因为即使预言机返回中性标签,她仍需花费时间检查 $\mathbf{x}_{i}^{k}$。为了选择最优子实例,学生必须同时考虑子实例的成本以及返回非中性标签的可能性。接下来,我们提出两种AAL策略。
Static AAL Strategies
静态AAL策略
We first consider a simple, static approach to AAL that decides a priori on a fixed sub instance size $k$ . For example, the student fixes $k=10$ and presents the oracle sub instances $\mathbf{x}_{i}^{10}$ (please see Algorithm 1).
我们首先考虑一种简单、静态的AAL方法,该方法预先确定一个固定的子实例大小$k$。例如,学生固定$k=10$,并向预言机提交子实例$\mathbf{x}_{i}^{10}$ (请参阅算法1)。
Let $\mathcal{U}^{k}={x_{i}^{k}}{i=l+1}^{m}$ be the set of all unlabeled subinstances of fixed size $k$ . In SELECT SUB INSTANCE (line 3), the student picks $\mathbf{x}_{i}^{k*}$ as follows:
设 $\mathcal{U}^{k}={x_{i}^{k}}{i=l+1}^{m}$ 为所有固定大小为 $k$ 的未标记子实例集合。在 SELECT SUB INSTANCE (第3行) 中,学生按如下方式选择 $\mathbf{x}_{i}^{k*}$:
$$
\mathbf{x}{i}^{k*}\gets\arg\operatorname*{max}{\mathbf{x}{i}^{k}\in\mathcal{U}^{k}}\frac{U(\mathbf{x}{i})}{C(\mathbf{x}_{i}^{k})}
$$
$$
\mathbf{x}{i}^{k*}\gets\arg\operatorname*{max}{\mathbf{x}{i}^{k}\in\mathcal{U}^{k}}\frac{U(\mathbf{x}{i})}{C(\mathbf{x}_{i}^{k})}
$$
Note that the utility is computed from the full instance $\mathbf{x}{i}$ , not the sub instance, since $\mathbf{x}{i}$ will be added to our labeled set (line 8). In our experiments, we consider two utility functions: uncertainty $(\mathtt{s t a t i c-k-u n c})$ , which sets $U\bar{(}\mathbf{x}{i}^{k})=$ $1-\operatorname*{max}{y}P_{\mathcal{L}}(\dot{y}|\mathbf{x}{i})$ , and constant $(s\mathtt{t a t i c-k-c o n s t})$ , which sets the utility of each sub instance to one, $U(\mathbf{x}_{i}^{k})=1$ . We use static-k-const as a baseline for other AAL methods because it is an anytime version of random sampling.
需要注意的是,效用是从完整实例 $\mathbf{x}{i}$ 而非子实例计算的,因为 $\mathbf{x}{i}$ 将被添加到我们的标注集中(第8行)。在实验中,我们考虑两种效用函数:不确定性 $(\mathtt{static-k-unc})$ ,设定 $U\bar{(}\mathbf{x}{i}^{k})=$ $1-\operatorname*{max}{y}P_{\mathcal{L}}(\dot{y}|\mathbf{x}{i})$ ;以及常量 $(\mathtt{static-k-const})$ ,将每个子实例的效用设为1,即 $U(\mathbf{x}_{i}^{k})=1$ 。我们将 static-k-const 作为其他主动学习 (AAL) 方法的基线,因为它是随机采样的随时可用版本。
Algorithm 1 Static Anytime Active Learning
算法 1 静态任意时间主动学习
| | 1: 输入: 已标注数据 L; 未标注数据 U; 预算 B; 分类器 P(y|x); 子实例大小 k |
| 2: | while B>0 do |
| 3: | x ← SELECTSUBINSTANCE(U, k) |
| 4: | u←u{xi} |
| 5: | α ← QUERYORACLE(x) |
| 6: 7: | B←B-C(x) if α then //非中性响应 |
| 8: | C←LU (xi,α) |
| 9: | P(y|X)←UPDATECLASSIFIER(C,P) |
Algorithm 2 Dynamic Anytime Active Learning
算法 2 动态实时主动学习
1: 输入: 已标注数据 L; 未标注数据 U; 预算 B: 中性标注数据 L²←0
2: while B>0 do
3: 分类器 P(y|x); 中性分类器 Q(z|x); Neu- x ← 选择子实例(U)
Table 1: User study results reporting average annotation time in seconds and percent of neutral labels by sub instance size.
表 1: 用户研究结果报告不同子实例规模下的平均标注时间(秒)和中性标签百分比。
规模 | IMDB (秒) | SRAA (秒) | IMDB (% 中性) | SRAA (% 中性) |
---|---|---|---|---|
10 | 5.7 | 5.2 | 53 | 50 |
25 | 8.2 | 6.5 | 33 | 38 |
50 | 10.9 | 7.6 | 24 | 42 |
75 | 15.9 | 9.1 | 12 | 22 |
100 | 16.7 | 10.3 | 13 | 13 |
used to train the neutrality classifier $Q(z|\mathbf{x}_{k}^{i})$ (line 11). Algorithm 2 outlines this approach, where ISNEUTRAL maps the oracle answer to $z$ (i.e., $n$ or $\neg n$ ). As in the static strategy, we consider two settings of the utility function: uncertainty (dynamic-unc) and constant (dynamic-const). While both dynamic-unc and dynamic-const consider the cost of annotation, dynamic-unc balances utility with the chance of receiving a non-neutral label, while dynamic-const simply maximizes the chance of a nonneutral label.
用于训练中立分类器 $Q(z|\mathbf{x}_{k}^{i})$ (第11行)。算法2概述了该方法,其中ISNEUTRAL将预言机答案映射到 $z$ (即 $n$ 或 $\neg n$ )。与静态策略类似,我们考虑效用函数的两种设置:不确定性(动态-unc)和常量(动态-const)。动态-unc和动态-const都考虑了标注成本,但动态-unc会平衡效用与获得非中立标签的机会,而动态-const则直接最大化非中立标签的概率。
Experimental Evaluation
实验评估
Our experiments use two datasets: (1) IMDB: A collection of 50K reviews from IMDB.com labeled with positive or negative sentiment (Maas et al. 2011); (2) SRAA: A collection of 73K Usenet articles labeled as related to aviation or auto documents (Nigam et al. 1998).
我们的实验使用了两个数据集:(1) IMDB:来自IMDB.com的5万条带有正面或负面情感标签的评论 (Maas et al. 2011);(2) SRAA:包含7.3万篇标注为航空或汽车相关文档的Usenet文章 (Nigam et al. 1998)。
Dynamic AAL Strategies
动态AAL策略
The static strategy ignores the impact that $k$ has on the likelihood of obtaining a neutral label from the oracle. In this section, we propose a dynamic strategy that models this probability directly and uses it to guide sub instance selection (see Algorithm 2).
静态策略忽略了$k$对从预言机获取中性标签概率的影响。本节提出一种动态策略,直接建模该概率并用于指导子实例选择(见算法2)。
Let $Q(z|\mathbf{x}{i}^{k})$ be the probability distribution that models whether the oracle will return an “I don’t know” answer (i.e. a neutral label) for the sub instance $\mathbf{x}_{i}^{k}$ , where $z\in$ ${n,\neg n}$ . The objective of SELECT SUB INSTANCE (line 3) is to select the sub instance that maximizes utility and the probability of obtaining a non-neutral label, $\neg n$ , while minimizing cost.
设 $Q(z|\mathbf{x}{i}^{k})$ 为建模预言机是否会对子实例 $\mathbf{x}_{i}^{k}$ 返回"我不知道"答案(即中性标签)的概率分布,其中 $z\in$ ${n,\neg n}$。SELECT SUB INSTANCE(第3行)的目标是选择能最大化效用且获得非中性标签 $\neg n$ 概率的子实例,同时最小化成本。
$$
\mathbf{x}{i}^{k*}\gets\operatorname*{argmax}{\mathbf{x}{i}^{k}\in\mathcal{U}^{k}\in\mathcal{S}}\frac{U(\mathbf{x}{i})Q(z=\neg n|\mathbf{x}{i}^{k})}{C(\mathbf{x}_{i}^{k})}
$$
$$
\mathbf{x}{i}^{k*}\gets\operatorname*{argmax}{\mathbf{x}{i}^{k}\in\mathcal{U}^{k}\in\mathcal{S}}\frac{U(\mathbf{x}{i})Q(z=\neg n|\mathbf{x}{i}^{k})}{C(\mathbf{x}_{i}^{k})}
$$
In contrast to the static approach, the dynamic algorithm searches over an expanded set of $p$ different sub instance sizes: ${\cal S}={\mathcal{U}^{k_{1}}\dots\dot{\mathcal{U}}^{k_{p}}}$ .
与静态方法不同,动态算法会在扩展的 $p$ 种不同子实例规模集合 ${\cal S}={\mathcal{U}^{k_{1}}\dots\dot{\mathcal{U}}^{k_{p}}}$ 上进行搜索。
The immediate question is how to estimate $Q(z|\mathbf{x}{k}^{i})$ . We propose a supervised learning approach using the previous interactions with the oracle as labeled examples. That is, we maintain an auxiliary binary labeled dataset $\mathcal{L}^{z}$ containing $(\mathbf{x}{i}^{k},z_{i})$ pairs (line 10), indicating whether subinstance $\mathbf{x}_{i}^{k}$ received a neutral label or not.2 This dataset is
当前的问题是如何估计 $Q(z|\mathbf{x}{k}^{i})$。我们提出了一种监督学习方法,利用先前与预言机交互的标记示例。具体而言,我们维护一个辅助的二元标记数据集 $\mathcal{L}^{z}$,其中包含 $(\mathbf{x}{i}^{k},z_{i})$ 对(第10行),用于指示子实例 $\mathbf{x}_{i}^{k}$ 是否收到中性标签。
User Studies
用户研究
To estimate the real-world relationships among sub instance size, annotation time, and response rate, we first performed several user studies in which subjects were shown document sub instances of varying sizes and asked to provide a correct label or an “I don’t know” answer (i.e., a neutral label).
为了评估子实例大小、标注时间和响应率之间的实际关系,我们首先进行了几项用户研究,向受试者展示不同大小的文档子实例,并要求其提供正确标签或"我不知道"答案(即中性标签)。
Each user performed six classification tasks per dataset, labeling document sub instances of sizes ${10,25,50,75,100,\mathrm{All}}$ . For example, to create the 50-word task, we truncated the documents to the first 50 words. For each classification task, the users were asked to annotate 20 randomly-chosen documents from each class, resulting in 40 annotations per task. The documents were presented to the users in random order. For every sub instance, we recorded the annotation time, the number of words seen, and the label. We used the average over five users on the IMDB and three users on the SRAA data.
每位用户在每个数据集上执行六项分类任务,标注文档子实例的大小为 ${10,25,50,75,100,\mathrm{All}}$。例如,创建50词任务时,我们将文档截断至前50个词。对于每项分类任务,用户需为每个类别标注20篇随机选择的文档,即每项任务共40个标注。文档以随机顺序呈现给用户。针对每个子实例,我们记录了标注时间、已查看的词数以及标签。在IMDB数据集上取五位用户的平均值,SRAA数据集上取三位用户的平均值。
Table 1 shows the average annotation time (in seconds) and average percentage of neutral labels returned for each sub instance size. We find that the annotation time varies by sub instance size and dataset. For instance, in IMDB annotation time of sub instances of size 50 is $25%$ greater than for sub instances of size 25. These responses are influenced by the user experience and domain knowledge familiarity, among other factors.
表 1 展示了每个子实例规模的平均标注时间(以秒为单位)和返回的中性标签平均百分比。我们发现标注时间会因子实例规模和数据集而异。例如在IMDB数据集中,规模为50的子实例标注时间比规模25的子实例高出$25%$。这些响应受到用户体验、领域知识熟悉度等因素的影响。
Intuitively, the annotator will be more likely to provide a non-neutral label when he can see more of a document. This intuition was confirmed by our user studies. Table 1 shows that the percentage of neutral labels decreases as sub instance size increases. However, the rate at which the neutral answer decreases differs by dataset. For example, there was approximately a $50%$ neutral rate on both datasets for sub instances with 10 words; yet for 75 words the neutral responses were $12%$ on IMDB and $22%$ on SRAA. We speculate that the SRAA dataset is a more specialized domain, whereas classifying movie reviews (IMDB) is easier for non-expert human annotators.
直观上,标注者在能看到更多文档内容时更可能给出非中性标签。这一直觉在我们的用户研究中得到了验证。表1显示,中性标签的占比随着子实例规模的增大而下降。但中性答案的下降速率因数据集而异。例如,在10个单词的子实例上,两个数据集的中性率均为约50%;而在75个单词时,IMDB的中性回答率为12%,SRAA则为22%。我们推测SRAA数据集属于更专业的领域,而电影评论分类(IMDB)对非专业的人类标注者来说更容易判断。
Simulations
模拟
We use the results of the user study to inform our large-scale studies on the two datasets.
我们利用用户研究的结果来指导在两个数据集上进行的大规模研究。
Oracle In order to compare many AAL strategies at scale, it is necessary to simulate the actions of the human annotators. Specifically, we must simulate for which examples the annotator will return a neutral label. We wanted to better reflect the fact that the lexical content of each sub instance influences its neutrality instead of random neutrality — e.g., if a sub instance has strong sentiment words it is not likely to be labeled neutral. To accomplish this, we trained two oracles (one per dataset) that mimic the human annotators. We simulated the oracle with a classifier trained on held-out data; a neutral label is returned when the class posterior probability for a sub instance $\mathbf{x}_{i}^{k}$ is below a specified threshold. We tune this classifier so that the pattern of neutral labels matches that observed in the user study.
Or