Data Programming using Semi-Supervision and Subset Selection
使用半监控和子集选择的数据编程
Abstract
The paradigm of data programming has shown a lot of promise in using weak supervision in the form of rules and labelling functions for learning in several text classification scenarios in which labelled data is not available. Another approach which has shown a lot of promise is that of semi-supervised learning wherein we augment small amounts of labelled data with a large unlabelled dataset. In this work, we argue that by not using any labelled data, data programming based approaches can yield sub-optimal performance, particularly, in cases when the labelling functions are noisy. The first contribution of this work is the study of a framework for joint learning that combines un-supervised consensus from labelling functions with semi-supervised learning. We learn a joint model that efficiently uses the rules/labelling functions along with semi-supervised loss functions on the feature space. Next, we also study a subset selection approach to select the set of examples that can be used as the labelled set, so that the labelled data can complement the labelling functions, thereby achieving the best of both worlds. We demonstrate that by effectively combining semi-supervision, data-programming and subset selection paradigms, we significantly outperform the current state-of-the-art on seven publicly available datasets.
Abstract
数据编程的范例在使用规则的形式和标记函数中使用弱监管,在几个文本分类方案中使用弱监督,其中标记数据不可用。另一种示出了许多承诺的方法是半监督学习,其中我们使用大型未标记的数据集增强少量标记的数据。在这项工作中,我们认为,通过不使用任何标记的数据,基于数据编程的方法可以产生次优性能,特别是在标签功能嘈杂的情况下。这项工作的第一个贡献是对联合学习框架的研究,将未经监督的义务与半监督学习结合在标签职能中。我们学习一个联合模型,有效地使用规则/标记功能以及在特征空间上的半监控损失函数。接下来,我们还研究一个子集选择方法来select可以用作标记集的示例集,使得标记的数据可以补充标签函数,从而实现两个世界的最佳效果。我们证明,通过有效地结合半监督,数据编程和子集选择范式,我们在七个公共数据集中显着优于当前最先进的。
Introduction
Modern machine learning techniques rely excessively on large amounts of labelled training data for text classification tasks such as spam detection, (movie) genre classification, sequence labelling. Supervised learning approaches have utilised such large amounts of labelled data and this has resulted in huge successes in the last decade. However, acquisition of labelled data, in most cases, entails a pain-staking process requiring months of human effort. Several techniques such as active learning, distant supervision, crowd-consensus learning, and semi-supervised learning have been proposed to reduce the . However, clean annotated labels continue to be critical for reliable results. Recently, proposed a paradigm on data programming, in which, several Labelling Functions (LF) written by humans, are used to weakly associate labels with the instances. In data programming, users encode the weak supervision in the form of labelling functions. On the other hand, traditional semi-supervised learning methods combine a small amount of labelled data with large unlabelled data. In this paper, we leverage semi-supervision in the feature space for more effective data programming using labelling functions.
简介
现代机器学习技术依赖于大量标记的训练数据,用于文本分类任务,如垃圾邮件检测,(电影)类型分类,序列标记。监督学习方法已经利用了如此大量的标签数据,这在过去十年中导致了巨大的成功。然而,在大多数情况下,收购标记数据需要一个需要几个月的人类努力的止动过程。已经提出了几种技术,如主动学习,远程监督,人群共识和半监督学习以减少工作量。但是,清洁的注释标签继续对可靠的结果至关重要。最近,提出了关于数据编程的范例,其中,人类编写的几个标签功能(LF)用于与该实例跨越标签。在数据编程中,用户以标记功能的形式对弱监管进行编码。另一方面,传统的半监督学习方法将少量标记数据与大型未标记数据相结合。在本文中,我们利用了使用标签功能的更有效数据编程的特征空间中的半监督。
Motivating Example
We illustrate the LFs on one of the seven tasks that we experimented with, , that of identifying spam/no-spam comments in the YouTube reviews. For some applications, writing LFs are often as simple as keyword lookups or a regex expression. In this specific case, the users construct heuristics patterns as LFs for identifying a comment as a spam or not a spam (ham). Each LF takes a comment as an input and provides a binary label as the output; +1 indicates that comment is a spam, -1 means no spam comment and 0 means that LF is unable to assert anything for this example (also called abstain). Tablesample presents a few examples LFs for spam and non-spam (ham) classification.
Id | Description |
---|---|
%Ground Truth (Target) | |
LF1 | If http or https in comment text, then return +1 otherwise ABSTAIN (return 0) |
LF2 | If length of comment is less than 5 words, then return -1 otherwise ABSTAIN (return 0).(Non spam comments are often short) |
In isolation, a particular LF may neither be always correct nor complete. Furthermore, the LFs may also produce conflicting labels. In the past, generative models such as Snorkel and CAGE have been discussed for consensus on the noisy and conflicting labels assigned by the discrete LFs to determine probability of the correct labels. Labels thus obtained could be used for training any supervised model/classifier and evaluated on a test set. We will next highlight a challenge in doing data programming using only LFs, that we attempt to address. For each of the following sentences $ S_1 \ldots S_6 $ , that can constitute , we state the value of the true label ( $ \pm 1 $ ). While the candidates in $ S_1 $ and $ S_4 $ are instances of a spam comment, the ones in $ S_2 $ and $ S_3 $ are not. In fact, these examples constitute one of the canonical cases that we discovered during the analysis of our approach in Sectionsubsetselection. 1. $ \langle\boldsymbol{S_1},+1 \rangle $ : Please help me go to college guys! Thanks from the bottom of my heart. https://www.indiegogo.com/projects/ 2.
激励示例
我们说明了我们在我们尝试的七个任务之一的LFS上识别Youtube评论中的垃圾邮件/禁止垃圾邮件评论的LFS。对于某些应用程序,编写LFS通常像关键字查找或正则表达式表达式一样简单。在该特定情况下,用户将启发式模式构建为LFS,用于将评论识别为垃圾邮件(HAM)。每个LF都将注释作为输入,并提供二进制标签作为输出; +1表示评论是垃圾邮件,-1表示没有垃圾邮件评论,0表示LF无法为此示例断言任何内容(也称为弃权)。 Tabuseample为垃圾邮件和非垃圾邮件(HAM)分类提供了一些例子LFS。
在隔离,特定lf可能既不能始终纠正也不完成。此外,LF也可以产生冲突的标签。在过去,已经讨论了诸如浮潜和笼等的生成模型在分立LF分配的嘈杂和冲突标签上达成共识,以确定正确标签的概率。如此获得的标签可用于培训任何监督模型/分类器并在测试集上进行评估。我们将下次突出显示使用LFS进行数据编程的挑战,我们尝试解决。对于以下每个句子$ S_1 \ldots S_6 $,可以构成,我们说明真标值($\pm 1 $)。虽然$ S_1 $和$ S_4 $中的候选者是垃圾邮件评论的实例,但$ S_2 $和$ S_3 $中的候选。实际上,这些例子构成了我们在分析我们在SectionSubsetselection中的方法期间发现的规范病例之一。 1. $\langle\boldsymbol{S_1},+1 \rangle $:请帮助我上大学生!谢谢你的底部。 https://www.indiegogo.com/projects/2。
$ \langle\boldsymbol{S_2},-1\rangle $ : I love this song 3. $ \langle\boldsymbol{S_3},-1\rangle $ : This song is very good... but the video makes no sense... 4. $ \langle\boldsymbol{S_4},+1\rangle $ : https://www.facebook.com/teeLaLaLa Further, let us say we have a completely , $ S_5 $ and $ S_6 $ , whose labels we would also like to predict effectively: 5. $ \langle\boldsymbol{S_5},-1\rangle $ : This song is prehistoric 6. $ \langle\boldsymbol{S_6},+1\rangle $ : Watch Maroon 5's latest ... www.youtube.
$\langle\boldsymbol{S_2},-1\rangle $:我喜欢这首歌3. $\langle\boldsymbol{S_3},-1\rangle $:这首歌非常好......但是视频没有任何意义...... $\langle\boldsymbol{S_4},+1\rangle $:https:// www.facebook.com/teelala进一步,让我们说我们有一个完全,$ S_5 $和$ S_6 $,其标签我们还希望有效地预测:5。$\langle\boldsymbol{S_5},-1\rangle $:这首歌是史前6。 $\langle\boldsymbol{S_6},+1\rangle $:手表Maroon 5's最新的... www.youtube。
com/watch?v=TQ046FuAu00 - $ \langle S_1,+1 \rangle $ : Please help me go to college guys! Thanks from the bottom of my heart. https://www.indiegogo.com/projects/ - $ \langle S_2,-1\rangle $ : Are those real animals - $ \langle S_3,-1\rangle $ :this video is very inaccurate, a tiger would rip her face of - $ \langle S_4,+1\rangle $ : Get free gift http://swagbucks.com/p/register - $ \langle S_5,-1\rangle $ : prehistoric song..has been - $ \langle S_6,+1\rangle $ : Come check out our parody of this!
Training data | LF outputs | Features | |
---|---|---|---|
id | Label | LF1(+1) | LF2(-1) |
$ S_1 $ | +1 | 1 | 0 |
% $ S_1 $ | +1 | 1 | 0 |
$ S_2 $ | -1 | 0 | 1 |
% $ S_2 $ | -1 | 0 | 1 |
$ S_3 $ | -1 | 0 | 0 |
% $ S_3 $ | -1 | 0 | 0 |
$ S_4 $ | +1 | 1 | 1 |
% $ S_4 $ | +1 | 1 | 1 |
Test data | |||
% $ S_5 $ | -1 | 0 | 1 |
$ S_6 $ | +1 | 0 | 0 |
In Tabletwochallenges, we present the outputs of the LFs as well as some n-gram features F1 (.com') and F2 (
This song')` on the observed training examples $ S_1 $ , $ S_2 $ , $ S_3 $ and $ S_4 $ as well as the unseen test examples $ S_5 $ and $ S_6 $ . For $ S_1 $ , correct consensus can easily be performed to output the true label +1 as LF1 (designed for class +1) gets triggered whereas LF2 (designed for class -1) is not triggered. Similarly, for $ S_2 $ , LF2 gets triggered whereas LF1 is not, making the correct consensus easy to perform. Hence, we have treated $ S_1 $ and $ S_2 $ as unlabelled, indicating that we could learn a model based on LFs alone without supervision if all we observed were these two examples and the outputs of LF1 and LF2. However, correct consensus on $ S_3 $ and $ S_4 $ is challenging since LF1 and LF2 either both fire or both do not. While the (n-gram based) features F1 and F2 appear to be informative and could potentially complement LF1 and LF2, we can easily see that correlating feature values with LF outputs is tricky in a completely unsupervised setup. To address this issue, we ask the following questions: What if were provided access to the true labels of a small subset of instances - in this case only $ S_3 $ and $ S_4 $ ? Could the (i) correlation of features values ( F1 and F2) with labels ( +1 and -1 respectively), modelled via a small set of labelled instances ( $ S_3 $ and $ S_4 $ ), in conjunction with (ii) the correlation of feature values ( F1 and F2) with LFs ( LF1 and LF2) modelled via a potentially larger set of unlabelled instances ( $ S_1 $ , $ S_2 $ ), help improved prediction of labels for hitherto unseen test instances $ S_5 $ and $ S_6 $ ? Can we precisely determine the subset of the unlabeled data, which when labeled would help us train a model (in conjunction with the labeling functions) that is most effective on the test set? In other words, instead of randomly choosing the labeled dataset for doing semi-supervised learning (part A), can we intelligently select the labeled subset to effectively complement the labeled set. In the above example, choosing the labeled set as $ S_3, S_4 $ would be much more useful than having the labeled set as $ S_1, S_2 $ . As a solution to (A), we present a new formulation in which the parameters over features and LFs are jointly trained in a semi-supervised manner. As for (B), we present subset selection algorithms, with a recipe that recommends the sub-set of the data ( $ S_3 $ and $ S_4 $ ), which if labelled, would most benefit the joint learning framework.
com / watch?v = tq046fuau00 - $\langle S_1,+1 \rangle $:请帮我上大学生!谢谢你的底部。 https://www.indiegogo.com/projects/ - $\langle S_2,-1\rangle $:是那些真实的动物 - $\langle S_3,-1\rangle $:这个视频非常不准确,一只老虎会撕裂她的脸 - $\langle S_4,+1\rangle $:获得免费礼品http://swagbucks.com/p/register - $\langle S_5,-1\rangle $:史前歌曲......在$\langle S_6,+1\rangle $:来看看我们的模仿!
在TabletWochallenges中,我们介绍了LFS的输出以及一些N-GRAM功能F1(“.com”)和F2(`。“ $ S_3 $和$ S_4 $以及看不见的测试示例$ S_5 $和$ S_6 $。对于$ S_1 $,可以轻松执行正确的共识,以将TRUE标签+1输出为LF1(为+1类设计)进行触发,而LF2(为-1类设计)未触发。同样,对于$ S_2 $,LF2被触发,而LF1不是,使得易于执行的正确共识。因此,我们已经处理了$ S_1 $和$ S_2 $,指示我们可以在没有监督的情况下基于LFS的LFS学习模型,如果我们所观察到的这两个示例以及LF1和LF2的输出。但是,在$ S_3 $和$ S_4 $上的正确共识是充分的,因为LF1和LF2都是火灾或两者都没有。虽然(基于N-GRAM的)功能F1和F2似乎是信息性的,并且可能是LF1和LF2的潜在补充,但我们可以很容易地看到,在完全无监督的设置中,使用LF输出的相关特征值是棘手的。要解决此问题,我们询问以下问题:如果提供了访问对某个小型实例的真实标签的内容 - 在这种情况下只有$ S_3 $和$ S_4 $?可以使用标签(+1和-1)的(i)相关性值(f1和f2)的相关性通过一小组标记的实例($ S_3 $和$ S_4 $)建模,与(ii)相结合通过潜在的更大的未标记实例($ S_1 $,$ S_2 $)对LFS(LF1和LF2)的特征值(F1和F2)的相关性有助于改善迄今看不见的测试实例$ S_5 $和$的标签预测t187_0 $?我们可以精确确定未标记数据的子集,当标记后,它会有助于我们在测试集上最有效的模型(与标签函数结合使用)?换句话说,而不是随机选择用于做半监督学习的标记数据集(第a部分),我们是否可以智能地选择标记的子集以有效地补充标记的集合。在上面的示例中,选择标记的设置为$ S_3, S_4 $比将标记的设置为$ S_1, S_2 $更有用。作为(a)的解决方案,我们提出了一种新的配方,其中通过特征和LFS的参数以半监督方式共同训练。至于(b),我们呈现子集选择算法,具有推荐数据的子集($ S_3 $和$ S_4 $),其中如果标记为准,最有利于联合学习框架。
Our Contributions
We summarise our main contributions: To address (A), we present a novel formulation for jointly learning the parameters over features and Labelling Functions in a semi-supervised manner. We jointly learn a parameterized graphical model and a classifier model to learn our overall objective. To address (B), we study a subset selection approach to select the set of examples which can be used as the labelled set. We show, in particular, that through a principled data selection approach, we can achieve significantly higher accuracies than just randomly selecting the seed labelled set for semi-supervised learning with labelling functions. Moreover, we also show that the automatically selected subset performs comparably or even better than the hand-picked subset by humans in, further emphasizing the benefit of subset selection for semi-supervised data programming. Unlike the work of, we do not assume we have rule exemplars given, or that we are given a seed labelled set. Our framework is agnostic to the underlying network arch