下载代码:https://github.com/ayushbits/ Semi-Supervised-LFs-Subset-Selection

Data Programming using Semi-Supervision and Subset Selection



The paradigm of data programming has shown a lot of promise in using weak supervision in the form of rules and labelling functions for learning in several text classification scenarios in which labelled data is not available. Another approach which has shown a lot of promise is that of semi-supervised learning wherein we augment small amounts of labelled data with a large unlabelled dataset. In this work, we argue that by not using any labelled data, data programming based approaches can yield sub-optimal performance, particularly, in cases when the labelling functions are noisy. The first contribution of this work is the study of a framework for joint learning that combines un-supervised consensus from labelling functions with semi-supervised learning. We learn a joint model that efficiently uses the rules/labelling functions along with semi-supervised loss functions on the feature space. Next, we also study a subset selection approach to select the set of examples that can be used as the labelled set, so that the labelled data can complement the labelling functions, thereby achieving the best of both worlds. We demonstrate that by effectively combining semi-supervision, data-programming and subset selection paradigms, we significantly outperform the current state-of-the-art on seven publicly available datasets.


数据编程的范例在使用规则的形式和标记函数中使用弱监管,在几个文本分类方案中使用弱监督,其中标记数据不可用。另一种示出了许多承诺的方法是半监督学习,其中我们使用大型未标记的数据集增强少量标记的数据。在这项工作中,我们认为,通过不使用任何标记的数据,基于数据编程的方法可以产生次优性能,特别是在标签功能嘈杂的情况下。这项工作的第一个贡献是对联合学习框架的研究,将未经监督的义务与半监督学习结合在标签职能中。我们学习一个联合模型,有效地使用规则/标记功能以及在特征空间上的半监控损失函数。接下来,我们还研究一个子集选择方法来 select 可以用作标记集的示例集,使得标记的数据可以补充标签函数,从而实现两个世界的最佳效果。我们证明,通过有效地结合半监督,数据编程和子集选择范式,我们在七个公共数据集中显着优于当前最先进的。


Modern machine learning techniques rely excessively on large amounts of labelled training data for text classification tasks such as spam detection, (movie) genre classification, sequence labelling. Supervised learning approaches have utilised such large amounts of labelled data and this has resulted in huge successes in the last decade. However, acquisition of labelled data, in most cases, entails a pain-staking process requiring months of human effort. Several techniques such as active learning, distant supervision, crowd-consensus learning, and semi-supervised learning have been proposed to reduce the . However, clean annotated labels continue to be critical for reliable results. Recently, proposed a paradigm on data programming, in which, several Labelling Functions (LF) written by humans, are used to weakly associate labels with the instances. In data programming, users encode the weak supervision in the form of labelling functions. On the other hand, traditional semi-supervised learning methods combine a small amount of labelled data with large unlabelled data. In this paper, we leverage semi-supervision in the feature space for more effective data programming using labelling functions.



Motivating Example

We illustrate the LFs on one of the seven tasks that we experimented with, , that of identifying spam/no-spam comments in the YouTube reviews. For some applications, writing LFs are often as simple as keyword lookups or a regex expression. In this specific case, the users construct heuristics patterns as LFs for identifying a comment as a spam or not a spam (ham). Each LF takes a comment as an input and provides a binary label as the output; +1 indicates that comment is a spam, -1 means no spam comment and 0 means that LF is unable to assert anything for this example (also called abstain). Tablesample presents a few examples LFs for spam and non-spam (ham) classification.

Id Description
%Ground Truth (Target)
LF1 If http or https in comment text, then return +1 otherwise ABSTAIN (return 0)
LF2 If length of comment is less than 5 words, then return -1 otherwise ABSTAIN (return 0).(Non spam comments are often short)

In isolation, a particular LF may neither be always correct nor complete. Furthermore, the LFs may also produce conflicting labels. In the past, generative models such as Snorkel and CAGE have been discussed for consensus on the noisy and conflicting labels assigned by the discrete LFs to determine probability of the correct labels. Labels thus obtained could be used for training any supervised model/classifier and evaluated on a test set. We will next highlight a challenge in doing data programming using only LFs, that we attempt to address. For each of the following sentences $ S_1 \ldots S_6 $ , that can constitute , we state the value of the true label ( $ \pm 1 $ ). While the candidates in $ S_1 $ and $ S_4 $ are instances of a spam comment, the ones in $ S_2 $ and $ S_3 $ are not. In fact, these examples constitute one of the canonical cases that we discovered during the analysis of our approach in Sectionsubsetselection. 1. $ \langle\boldsymbol{S_1},+1 \rangle $ : Please help me go to college guys! Thanks from the bottom of my heart. https://www.indiegogo.com/projects/ 2.


我们说明了我们在我们尝试的七个任务之一的 LFS 上识别 YouTube 评论中的垃圾邮件/禁止垃圾邮件评论的 LFS。对于某些应用程序,编写 LFS 通常像关键字查找或正则表达式表达式一样简单。在该特定情况下,用户将启发式模式构建为 LFS,用于将评论识别为垃圾邮件(HAM)。每个 LF 都将注释作为输入,并提供二进制标签作为输出; +1 表示评论是垃圾邮件,-1 表示没有垃圾邮件评论,0 表示 LF 无法为此示例断言任何内容(也称为弃权)。 Tabuseample 为垃圾邮件和非垃圾邮件(HAM)分类提供了一些例子 LFS。

在隔离,特定 lf 可能既不能始终纠正也不完成。此外,LF 也可以产生冲突的标签。在过去,已经讨论了诸如浮潜和笼等的生成模型在分立 LF 分配的嘈杂和冲突标签上达成共识,以确定正确标签的概率。如此获得的标签可用于培训任何监督模型/分类器并在测试集上进行评估。我们将下次突出显示使用 LFS 进行数据编程的挑战,我们尝试解决。对于以下每个句子$ S_1 \ldots S_6 $,可以构成,我们说明真标值($\pm 1 $)。虽然$ S_1 $和$ S_4 $中的候选者是垃圾邮件评论的实例,但$ S_2 $和$ S_3 $中的候选。实际上,这些例子构成了我们在分析我们在 SectionSubsetselection 中的方法期间发现的规范病例之一。 1. $\langle\boldsymbol{S_1},+1 \rangle $:请帮助我上大学生!谢谢你的底部。 https://www.indiegogo.com/projects/2

$ \langle\boldsymbol{S_2},-1\rangle $ : I love this song 3. $ \langle\boldsymbol{S_3},-1\rangle $ : This song is very good... but the video makes no sense... 4. $ \langle\boldsymbol{S_4},+1\rangle $ : https://www.facebook.com/teeLaLaLa Further, let us say we have a completely , $ S_5 $ and $ S_6 $ , whose labels we would also like to predict effectively: 5. $ \langle\boldsymbol{S_5},-1\rangle $ : This song is prehistoric 6. $ \langle\boldsymbol{S_6},+1\rangle $ : Watch Maroon 5's latest ... www.youtube.

$\langle\boldsymbol{S_2},-1\rangle $:我喜欢这首歌 3. $\langle\boldsymbol{S_3},-1\rangle $:这首歌非常好......但是视频没有任何意义...... $\langle\boldsymbol{S_4},+1\rangle $:https:// www.facebook.com/teelala 进一步,让我们说我们有一个完全,$ S_5 $和$ S_6 $,其标签我们还希望有效地预测:5。$\langle\boldsymbol{S_5},-1\rangle $:这首歌是史前 6。 $\langle\boldsymbol{S_6},+1\rangle $:手表 Maroon 5's 最新的... www.youtube。

com/watch?v=TQ046FuAu00 - $ \langle S_1,+1 \rangle $ : Please help me go to college guys! Thanks from the bottom of my heart. https://www.indiegogo.com/projects/ - $ \langle S_2,-1\rangle $ : Are those real animals - $ \langle S_3,-1\rangle $ :this video is very inaccurate, a tiger would rip her face of - $ \langle S_4,+1\rangle $ : Get free gift http://swagbucks.com/p/register - $ \langle S_5,-1\rangle $ : prehistoric song..has been - $ \langle S_6,+1\rangle $ : Come check out our parody of this!

Training data LF outputs Features
id Label LF1(+1) LF2(-1)
$ S_1 $ +1 1 0
% $ S_1 $ +1 1 0
$ S_2 $ -1 0 1
% $ S_2 $ -1 0 1
$ S_3 $ -1 0 0
% $ S_3 $ -1 0 0
$ S_4 $ +1 1 1
% $ S_4 $ +1 1 1
Test data
% $ S_5 $ -1 0 1
$ S_6 $ +1 0 0

In Tabletwochallenges, we present the outputs of the LFs as well as some n-gram features F1 (.com') and F2 ( This song')` on the observed training examples $ S_1 $ , $ S_2 $ , $ S_3 $ and $ S_4 $ as well as the unseen test examples $ S_5 $ and $ S_6 $ . For $ S_1 $ , correct consensus can easily be performed to output the true label +1 as LF1 (designed for class +1) gets triggered whereas LF2 (designed for class -1) is not triggered. Similarly, for $ S_2 $ , LF2 gets triggered whereas LF1 is not, making the correct consensus easy to perform. Hence, we have treated $ S_1 $ and $ S_2 $ as unlabelled, indicating that we could learn a model based on LFs alone without supervision if all we observed were these two examples and the outputs of LF1 and LF2. However, correct consensus on $ S_3 $ and $ S_4 $ is challenging since LF1 and LF2 either both fire or both do not. While the (n-gram based) features F1 and F2 appear to be informative and could potentially complement LF1 and LF2, we can easily see that correlating feature values with LF outputs is tricky in a completely unsupervised setup. To address this issue, we ask the following questions: What if were provided access to the true labels of a small subset of instances - in this case only $ S_3 $ and $ S_4 $ ? Could the (i) correlation of features values ( F1 and F2) with labels ( +1 and -1 respectively), modelled via a small set of labelled instances ( $ S_3 $ and $ S_4 $ ), in conjunction with (ii) the correlation of feature values ( F1 and F2) with LFs ( LF1 and LF2) modelled via a potentially larger set of unlabelled instances ( $ S_1 $ , $ S_2 $ ), help improved prediction of labels for hitherto unseen test instances $ S_5 $ and $ S_6 $ ? Can we precisely determine the subset of the unlabeled data, which when labeled would help us train a model (in conjunction with the labeling functions) that is most effective on the test set? In other words, instead of randomly choosing the labeled dataset for doing semi-supervised learning (part A), can we intelligently select the labeled subset to effectively complement the labeled set. In the above example, choosing the labeled set as $ S_3, S_4 $ would be much more useful than having the labeled set as $ S_1, S_2 $ . As a solution to (A), we present a new formulation in which the parameters over features and LFs are jointly trained in a semi-supervised manner. As for (B), we present subset selection algorithms, with a recipe that recommends the sub-set of the data ( $ S_3 $ and $ S_4 $ ), which if labelled, would most benefit the joint learning framework.

com / watch?v = tq046fuau00 - $\langle S_1,+1 \rangle $:请帮我上大学生!谢谢你的底部。 https://www.indiegogo.com/projects/ - $\langle S_2,-1\rangle $:是那些真实的动物 - $\langle S_3,-1\rangle $:这个视频非常不准确,一只老虎会撕裂她的脸 - $\langle S_4,+1\rangle $:获得免费礼品 http://swagbucks.com/p/register - $\langle S_5,-1\rangle $:史前歌曲......在$\langle S_6,+1\rangle $:来看看我们的模仿!

在 TabletWochallenges 中,我们介绍了 LFS 的输出以及一些 N-GRAM 功能 F1(“.com”)和 F2(`。“ $ S_3 $和$ S_4 $以及看不见的测试示例$ S_5 $和$ S_6 $。对于$ S_1 $,可以轻松执行正确的共识,以将 TRUE 标签 +1 输出为 LF1(为 +1 类设计)进行触发,而 LF2(为-1 类设计)未触发。同样,对于$ S_2 $,LF2 被触发,而 LF1 不是,使得易于执行的正确共识。因此,我们已经处理了$ S_1 $和$ S_2 $,指示我们可以在没有监督的情况下基于 LFS 的 LFS 学习模型,如果我们所观察到的这两个示例以及 LF1 和 LF2 的输出。但是,在$ S_3 $和$ S_4 $上的正确共识是充分的,因为 LF1 和 LF2 都是火灾或两者都没有。虽然(基于 N-GRAM 的)功能 F1 和 F2 似乎是信息性的,并且可能是 LF1 和 LF2 的潜在补充,但我们可以很容易地看到,在完全无监督的设置中,使用 LF 输出的相关特征值是棘手的。要解决此问题,我们询问以下问题:如果提供了访问对某个小型实例的真实标签的内容 - 在这种情况下只有$ S_3 $和$ S_4 $?可以使用标签(+1 和-1)的(i)相关性值(f1 和 f2)的相关性通过一小组标记的实例($ S_3 $和$ S_4 $)建模,与(ii)相结合通过潜在的更大的未标记实例($ S_1 $,$ S_2 $)对 LFS(LF1 和 LF2)的特征值(F1 和 F2)的相关性有助于改善迄今看不见的测试实例$ S_5 $和$的标签预测 t187_0 $?我们可以精确确定未标记数据的子集,当标记后,它会有助于我们在测试集上最有效的模型(与标签函数结合使用)?换句话说,而不是随机选择用于做半监督学习的标记数据集(第 a 部分),我们是否可以智能地选择标记的子集以有效地补充标记的集合。在上面的示例中,选择标记的设置为$ S_3, S_4 $比将标记的设置为$ S_1, S_2 $更有用。作为(a)的解决方案,我们提出了一种新的配方,其中通过特征和 LFS 的参数以半监督方式共同训练。至于(b),我们呈现子集选择算法,具有推荐数据的子集($ S_3 $和$ S_4 $),其中如果标记为准,最有利于联合学习框架。

Our Contributions

We summarise our main contributions: To address (A), we present a novel formulation for jointly learning the parameters over features and Labelling Functions in a semi-supervised manner. We jointly learn a parameterized graphical model and a classifier model to learn our overall objective. To address (B), we study a subset selection approach to select the set of examples which can be used as the labelled set. We show, in particular, that through a principled data selection approach, we can achieve significantly higher accuracies than just randomly selecting the seed labelled set for semi-supervised learning with labelling functions. Moreover, we also show that the automatically selected subset performs comparably or even better than the hand-picked subset by humans in, further emphasizing the benefit of subset selection for semi-supervised data programming. Unlike the work of, we do not assume we have rule exemplars given, or that we are given a seed labelled set. Our framework is agnostic to the underlying network architecture and can be applied using different underlying techniques without change in the meta-approach. Finally, we evaluate our model on seven publicly available datasets from domains such as spam detection and record classification and show improvement over state-of-the-art techniques. We also draw insights from experiments in synthetic settings. - Scarcity of labelled data. - Two popular approaches - i) Semi-Supervised Learning ii) Learning using rules or lfs - Can be combined to get the best of both techniques. - We propose a model which involves both semi-supervised learning on features and incorporate lfs provided by the users - We show that this method works better than using individual approaches


我们总结了我们的主要贡献:到地址(a),我们提出了一种以半监督方式共同学习参数和标记函数的新制定。我们共同学习参数化图形模型和分类器模型,以了解我们的整体目标。到地址(b),我们研究了一个子集选择方法,选择可以用作标记集的示例集。特别是通过原则性的数据选择方法显示,我们可以实现明显更高的准确性,而不是随机选择标有标签功能的半监督学习所标记的种子。此外,我们还表明,自动所选子集比人类中的手工采摘子集相当或甚至更好地执行,进一步强调了对半监督数据编程的子集选择的益处。与工作不同,我们不认为我们有规则示范,或者我们被赋予标记的种子。我们的框架对底层网络架构不可知,并且可以使用不同的底层技术应用,而不会在元方法中发生变化。最后,我们在诸如垃圾邮件检测和记录分类等域中的七个公开的数据集中评估我们的模型,并显示出最先进的技术的改进。我们还在合成设置中的实验中汲取了洞察力。 - 标记数据的稀缺性。 - 两个流行的方法 - i)半监督学习 II)使用规则或 LFS 学习 - 可以组合以获得最好的两种技术。 - 我们提出了一种模型,涉及关于功能的半监督学习,并合并用户提供的 LFS - 我们表明该方法比使用单独的方法更好

In this section, we review related work on data programming, learning with rules, unsupervised learning with labelling functions and data subset selection. Snorkel has been proposed as a generative model to determine correct label probability using consensus on the noisy and conflicting labels assigned by the discrete LFs. CAGE has been proposed as a graphical model that used continuous valued LFs with scores obtained using soft match techniques such as cosine similarity of word vectors, TF-IDF score, distance among entity pairs, . Owing to its generative model, Snorkel is highly sensitive to initialisation and hyper-parameters. On the other hand, the CAGE model introduced user-controlled quality guides that incorporates labeller intuition into the model. However, these models completely disregard feature information that could provide additional information to learn the (graphical) model. Both these models try to learn a combined model for the labelling functions in an unsupervised manner. However, in practical scenarios, some labelled data is always available (or could be made available by labelling a few instances), and hence, a completely unsupervised approach might not be the best solution. In this work, we augment these data programming approaches by designing a semi-supervised model that incorporates feature information and LFs to jointly learn the parameters. proposed a student and teacher model that transfers rule information into the weight parameters of the neural network . However, they assign linear weight to each rule based on an agreement objective. The model we propose in this paper jointly learns parameters over features and rules in a semi-supervised manner rather than just weighing their outputs and can therefore be more powerful. The only work, which to our knowledge, combines rules with supervised learning in a joint framework is work by . In their approach, they leverage both rules and labelled data by associating each rule with exemplars of correct firings (, instantiations) of that rule. Their joint training algorithms denoises over-generalized rules and trains a classification model. Our approach differs from their paper in two ways: a) we do not have information of rule exemplars and b) we employ a semi-supervised framework combined with graphical model for consensus amongst the LFs to train our model. Finally, as discussed in the next paragraph, we also study how to automatically select the seed set of labeled data, rather than having a human provide this seed set, as was done in.-1 Finally, another approach which has been gaining a lot of attention recently is data subset selection. The specific application of data subset selection depends on the goal at hand. Data subset selection techniques have been used to reduce end to end training time, select unlabaled points in an active learning manner to label or for robustness.


在本节中,我们审查有关数据编程的相关工作,使用规则学习,无监督的函数和数据子集选择。 Snorkel 已被提出作为一种生成模型,以确定使用离散 LF 分配的嘈杂和冲突标签的互联的正确标签概率。笼子已被提出作为使用具有使用软匹配技术获得的分数的连续值 LFS 的图形模型,例如单词向量,TF-IDF 分数,实体对之间的距离。由于其生成型号,浮潜对初始化和超参数非常敏感。另一方面,笼式模型引入了用户控制的质量指南,该指南将标签直接融入模型中。但是,这些模型完全无视功能信息,可以提供学习(图形)模型的其他信息。这两个模型都尝试以无监督方式学习标签功能的组合模型。但是,在实际情况下,一些标记的数据始终可用(或者可以通过标记几个实例可用),因此,完全无监督的方法可能不是最佳解决方案。在这项工作中,我们通过设计一个半监督模型来增强这些数据编程方法,该模型包含功能信息和 LFS 联合学习参数。提出了一个学生和教师模型,将规则信息转换为神经网络的权重参数。但是,它们根据协议目标为每个规则分配线性重量。我们提出本文的模型共同了解了以半监督方式在特征和规则上学习参数,而不是重称其输出,因此可以更强大。唯一的工作是我们所知,将规则与联合框架中的监督学习结合在一起。在他们的方法中,他们通过将每个规则与该规则的正确征用(,实例)的示例相关联来利用这两个规则和标记数据。他们的联合训练算法剥夺了过度广泛的规则,并列举了分类模型。我们的方法以两种方式与纸张不同:a)我们没有规则样权的信息,b)我们采用半监督框架与图形模型相结合,以便在 LFS 中达成共识,以培训我们的模型。最后,正如下一段所述,我们还研究了如何自动选择标记数据的种子集,而不是拥有人类提供这种种子集,如-1 所做的那样,另一种方法已经获得了很多方法最近关注数据子集选择。数据子集选择的具体应用取决于手头的目标。数据子集选择技术已被用于减少端到端训练时间,以主动学习方式选择未替换的点,以标记或鲁棒性。

In this paper, we use the framework of data subset selection for selecting a subset of unlabelled examples for obtaining labels, that are complementary to the labelling functions. In other words, we can use information provided by the labeling functions along with other aspects like diversity, to intelligently select the labeled subset so that both the labeled data and labeling functions are most informative. - Snorkel and CAGE model - They try to learn a combined model for the labelling functions in an unsupervised manner. - Due to the unsuervised nature we found snorkel to be highly sensitive to initialisation and hyper-parameters. With no supervised data, difficult to set these parameters correctly. - CAGE addresses this using quality guides. But theses guides come from the people generating lfs and is dependent on them providing semi accurate values. - They also treat lfs as complete black boxes and completely disregard the features of the instances which might provide additional information to learn the graphical model(Note: This point is debatable since they use the learnt graphical model and train feature based supervised model on learnt labels.) - Also in practical scenario some labelled data is always available (or could be made available by labling a few instances), and hence a completely unsupervised approach might not be the best approach. - This paper combines lfs with feature based learning. - They model the lfs individually. They do not try to learn a combined model of the lfs and hence might lose out on information of agreement // disagreement within the lfs. - They only train the lfs on the supervised data. - Unsupervised data is only used to send a weak signal to the neural network model through the third component of the loss. They do not utilize the unsupervised data fully. - The lfs are created by observing the labeled examples and thus contain a lot of bias. As a result the lfs might have very low conflict but not very high coverage. - Also the scenario is a little bit restrictive since there is a strong correlation between the labelled data and the lfs, but in reality lfs might not come from any observed data or the lfs might be given without the data from which it was created.

在本文中,我们使用数据子集选择的框架* 选择获取标签的未标记示例的子集,这与标签函数互补。换句话说,我们可以使用标签函数提供的信息以及多样性的其他方面,以智能地选择标记的子集,以便标记的数据和标签功能都是最具信息性的。 - 浮潜和笼式模型 - 他们尝试以无人监督的方式学习标签功能的组合模型。 - 由于无安全的性质,我们发现浮潜对初始化和超参数非常敏感。没有监督数据,难以正确设置这些参数。 - 笼子使用质量指南解决了这一点。但这些指南来自生成 LFS 的人,依赖于它们提供半准确的值。 - 它们还将 LFS 视为完整的黑匣子,完全无视该实例的功能,这些情况可能提供学习图形模型的其他信息(注意:这一点是借助于基于学习的标签的受监管模型,因为他们使用了基于学习的图形模型和列​​车功能。) - 同样在实际情况下,一些标记的数据始终可用(或者可以通过延迟一些实例来提供),因此,完全无监督的方法可能不是最好的方法。 - 本文将 LFS 与基于功能的学习结合起来。 - 它们单独模拟 LFS。他们不会尝试学习 LFS 的组合模型,因此可能会丢失关于 LFS 内的协议信息//分歧。 - 他们只在监督数据上培训 LFS。 - 无监督数据仅用于通过损耗的第三个组件向神经网络模型发送弱信号。它们不充分利用无监督的数据。 - 通过观察标记的示例并因此包含大量偏差来创建 LFS。结果,LFS 可能具有非常低的冲突,但覆盖率不是很高的。 - 同样的情况是有点限制性,因为标记数据和 LFS 之间存在强烈的相关性,但实际上,LFS 可能不会来自任何观察到的数据,或者在没有创建数据的情况下可能会给出 LFS。


We present how data programming can benefit from use of labelled data by learning a model that jointly optimises the consensus obtained from labelling functions in an unsupervised manner along with semi-supervised loss functions designed in the feature space. We empirically assess the performance of the different components of the loss in our joint learning framework. As another contribution, we also study some subset selection approaches to guide the selection of the subset of examples that can be used as the labelled data-set. We present performance of our models and present insights on both synthetic and real datasets. While outperforming previous approaches, we are often able to better even an examplar based (skyline) approach that uses the additional information of the association of rules with specific labelled examples.