# 使用半监控和子集选择的数据编程

## Abstract

The paradigm of data programming has shown a lot of promise in using weak supervision in the form of rules and labelling functions for learning in several text classification scenarios in which labelled data is not available. Another approach which has shown a lot of promise is that of semi-supervised learning wherein we augment small amounts of labelled data with a large unlabelled dataset. In this work, we argue that by not using any labelled data, data programming based approaches can yield sub-optimal performance, particularly, in cases when the labelling functions are noisy. The first contribution of this work is the study of a framework for joint learning that combines un-supervised consensus from labelling functions with semi-supervised learning. We learn a joint model that efficiently uses the rules/labelling functions along with semi-supervised loss functions on the feature space. Next, we also study a subset selection approach to select the set of examples that can be used as the labelled set, so that the labelled data can complement the labelling functions, thereby achieving the best of both worlds. We demonstrate that by effectively combining semi-supervision, data-programming and subset selection paradigms, we significantly outperform the current state-of-the-art on seven publicly available datasets.

## Introduction

Modern machine learning techniques rely excessively on large amounts of labelled training data for text classification tasks such as spam detection, (movie) genre classification, sequence labelling. Supervised learning approaches have utilised such large amounts of labelled data and this has resulted in huge successes in the last decade. However, acquisition of labelled data, in most cases, entails a pain-staking process requiring months of human effort. Several techniques such as active learning, distant supervision, crowd-consensus learning, and semi-supervised learning have been proposed to reduce the . However, clean annotated labels continue to be critical for reliable results. Recently, proposed a paradigm on data programming, in which, several Labelling Functions (LF) written by humans, are used to weakly associate labels with the instances. In data programming, users encode the weak supervision in the form of labelling functions. On the other hand, traditional semi-supervised learning methods combine a small amount of labelled data with large unlabelled data. In this paper, we leverage semi-supervision in the feature space for more effective data programming using labelling functions.

## 简介

### Motivating Example

We illustrate the LFs on one of the seven tasks that we experimented with, , that of identifying spam/no-spam comments in the YouTube reviews. For some applications, writing LFs are often as simple as keyword lookups or a regex expression. In this specific case, the users construct heuristics patterns as LFs for identifying a comment as a spam or not a spam (ham). Each LF takes a comment as an input and provides a binary label as the output; +1 indicates that comment is a spam, -1 means no spam comment and 0 means that LF is unable to assert anything for this example (also called abstain). Tablesample presents a few examples LFs for spam and non-spam (ham) classification.

Id Description
%Ground Truth (Target)
LF1 If http or https in comment text, then return +1 otherwise ABSTAIN (return 0)
LF2 If length of comment is less than 5 words, then return -1 otherwise ABSTAIN (return 0).(Non spam comments are often short)

In isolation, a particular LF may neither be always correct nor complete. Furthermore, the LFs may also produce conflicting labels. In the past, generative models such as Snorkel and CAGE have been discussed for consensus on the noisy and conflicting labels assigned by the discrete LFs to determine probability of the correct labels. Labels thus obtained could be used for training any supervised model/classifier and evaluated on a test set. We will next highlight a challenge in doing data programming using only LFs, that we attempt to address. For each of the following sentences $S_1 \ldots S_6$ , that can constitute , we state the value of the true label ( $\pm 1$ ). While the candidates in $S_1$ and $S_4$ are instances of a spam comment, the ones in $S_2$ and $S_3$ are not. In fact, these examples constitute one of the canonical cases that we discovered during the analysis of our approach in Sectionsubsetselection. 1. $\langle\boldsymbol{S_1},+1 \rangle$ : Please help me go to college guys! Thanks from the bottom of my heart. https://www.indiegogo.com/projects/ 2.

### 激励示例

$\langle\boldsymbol{S_2},-1\rangle$ : I love this song 3. $\langle\boldsymbol{S_3},-1\rangle$ : This song is very good... but the video makes no sense... 4. $\langle\boldsymbol{S_4},+1\rangle$ : https://www.facebook.com/teeLaLaLa Further, let us say we have a completely , $S_5$ and $S_6$ , whose labels we would also like to predict effectively: 5. $\langle\boldsymbol{S_5},-1\rangle$ : This song is prehistoric 6. $\langle\boldsymbol{S_6},+1\rangle$ : Watch Maroon 5's latest ... www.youtube.

$\langle\boldsymbol{S_2},-1\rangle$：我喜欢这首歌 3. $\langle\boldsymbol{S_3},-1\rangle$：这首歌非常好......但是视频没有任何意义...... $\langle\boldsymbol{S_4},+1\rangle$：https：// www.facebook.com/teelala 进一步，让我们说我们有一个完全，$S_5$和$S_6$，其标签我们还希望有效地预测：5。$\langle\boldsymbol{S_5},-1\rangle$：这首歌是史前 6。 $\langle\boldsymbol{S_6},+1\rangle$：手表 Maroon 5's 最新的... www.youtube。

com/watch?v=TQ046FuAu00 - $\langle S_1,+1 \rangle$ : Please help me go to college guys! Thanks from the bottom of my heart. https://www.indiegogo.com/projects/﻿ - $\langle S_2,-1\rangle$ : Are those real animals﻿﻿ - $\langle S_3,-1\rangle$ :this video is very inaccurate, a tiger would rip her face of﻿ - $\langle S_4,+1\rangle$ : Get free gift http://swagbucks.com/p/register - $\langle S_5,-1\rangle$ : prehistoric song..has been﻿ - $\langle S_6,+1\rangle$ : Come check out our parody of this!﻿

Training data LF outputs Features
id Label LF1(+1) LF2(-1)
$S_1$ +1 1 0
% $S_1$ +1 1 0
$S_2$ -1 0 1
% $S_2$ -1 0 1
$S_3$ -1 0 0
% $S_3$ -1 0 0
$S_4$ +1 1 1
% $S_4$ +1 1 1
Test data
% $S_5$ -1 0 1
$S_6$ +1 0 0

In Tabletwochallenges, we present the outputs of the LFs as well as some n-gram features F1 (.com') and F2 ( This song') on the observed training examples $S_1$ , $S_2$ , $S_3$ and $S_4$ as well as the unseen test examples $S_5$ and $S_6$ . For $S_1$ , correct consensus can easily be performed to output the true label +1 as LF1 (designed for class +1) gets triggered whereas LF2 (designed for class -1) is not triggered. Similarly, for $S_2$ , LF2 gets triggered whereas LF1 is not, making the correct consensus easy to perform. Hence, we have treated $S_1$ and $S_2$ as unlabelled, indicating that we could learn a model based on LFs alone without supervision if all we observed were these two examples and the outputs of LF1 and LF2. However, correct consensus on $S_3$ and $S_4$ is challenging since LF1 and LF2 either both fire or both do not. While the (n-gram based) features F1 and F2 appear to be informative and could potentially complement LF1 and LF2, we can easily see that correlating feature values with LF outputs is tricky in a completely unsupervised setup. To address this issue, we ask the following questions: What if were provided access to the true labels of a small subset of instances - in this case only $S_3$ and $S_4$ ? Could the (i) correlation of features values ( F1 and F2) with labels ( +1 and -1 respectively), modelled via a small set of labelled instances ( $S_3$ and $S_4$ ), in conjunction with (ii) the correlation of feature values ( F1 and F2) with LFs ( LF1 and LF2) modelled via a potentially larger set of unlabelled instances ( $S_1$ , $S_2$ ), help improved prediction of labels for hitherto unseen test instances $S_5$ and $S_6$ ? Can we precisely determine the subset of the unlabeled data, which when labeled would help us train a model (in conjunction with the labeling functions) that is most effective on the test set? In other words, instead of randomly choosing the labeled dataset for doing semi-supervised learning (part A), can we intelligently select the labeled subset to effectively complement the labeled set. In the above example, choosing the labeled set as $S_3, S_4$ would be much more useful than having the labeled set as $S_1, S_2$ . As a solution to (A), we present a new formulation in which the parameters over features and LFs are jointly trained in a semi-supervised manner. As for (B), we present subset selection algorithms, with a recipe that recommends the sub-set of the data ( $S_3$ and $S_4$ ), which if labelled, would most benefit the joint learning framework.

com / watch？v = tq046fuau00 - $\langle S_1,+1 \rangle$：请帮我上大学生！谢谢你的底部。 https://www.indiegogo.com/projects/ - $\langle S_2,-1\rangle$：是那些真实的动物 - $\langle S_3,-1\rangle$：这个视频非常不准确，一只老虎会撕裂她的脸 - $\langle S_4,+1\rangle$：获得免费礼品 http://swagbucks.com/p/register - $\langle S_5,-1\rangle$：史前歌曲......在$\langle S_6,+1\rangle$：来看看我们的模仿！

### Our Contributions

We summarise our main contributions: To address (A), we present a novel formulation for jointly learning the parameters over features and Labelling Functions in a semi-supervised manner. We jointly learn a parameterized graphical model and a classifier model to learn our overall objective. To address (B), we study a subset selection approach to select the set of examples which can be used as the labelled set. We show, in particular, that through a principled data selection approach, we can achieve significantly higher accuracies than just randomly selecting the seed labelled set for semi-supervised learning with labelling functions. Moreover, we also show that the automatically selected subset performs comparably or even better than the hand-picked subset by humans in, further emphasizing the benefit of subset selection for semi-supervised data programming. Unlike the work of, we do not assume we have rule exemplars given, or that we are given a seed labelled set. Our framework is agnostic to the underlying network architecture and can be applied using different underlying techniques without change in the meta-approach. Finally, we evaluate our model on seven publicly available datasets from domains such as spam detection and record classification and show improvement over state-of-the-art techniques. We also draw insights from experiments in synthetic settings. - Scarcity of labelled data. - Two popular approaches - i) Semi-Supervised Learning ii) Learning using rules or lfs - Can be combined to get the best of both techniques. - We propose a model which involves both semi-supervised learning on features and incorporate lfs provided by the users - We show that this method works better than using individual approaches

### 我们的贡献

In this section, we review related work on data programming, learning with rules, unsupervised learning with labelling functions and data subset selection. Snorkel has been proposed as a generative model to determine correct label probability using consensus on the noisy and conflicting labels assigned by the discrete LFs. CAGE has been proposed as a graphical model that used continuous valued LFs with scores obtained using soft match techniques such as cosine similarity of word vectors, TF-IDF score, distance among entity pairs, . Owing to its generative model, Snorkel is highly sensitive to initialisation and hyper-parameters. On the other hand, the CAGE model introduced user-controlled quality guides that incorporates labeller intuition into the model. However, these models completely disregard feature information that could provide additional information to learn the (graphical) model. Both these models try to learn a combined model for the labelling functions in an unsupervised manner. However, in practical scenarios, some labelled data is always available (or could be made available by labelling a few instances), and hence, a completely unsupervised approach might not be the best solution. In this work, we augment these data programming approaches by designing a semi-supervised model that incorporates feature information and LFs to jointly learn the parameters. proposed a student and teacher model that transfers rule information into the weight parameters of the neural network . However, they assign linear weight to each rule based on an agreement objective. The model we propose in this paper jointly learns parameters over features and rules in a semi-supervised manner rather than just weighing their outputs and can therefore be more powerful. The only work, which to our knowledge, combines rules with supervised learning in a joint framework is work by . In their approach, they leverage both rules and labelled data by associating each rule with exemplars of correct firings (, instantiations) of that rule. Their joint training algorithms denoises over-generalized rules and trains a classification model. Our approach differs from their paper in two ways: a) we do not have information of rule exemplars and b) we employ a semi-supervised framework combined with graphical model for consensus amongst the LFs to train our model. Finally, as discussed in the next paragraph, we also study how to automatically select the seed set of labeled data, rather than having a human provide this seed set, as was done in.-1 Finally, another approach which has been gaining a lot of attention recently is data subset selection. The specific application of data subset selection depends on the goal at hand. Data subset selection techniques have been used to reduce end to end training time, select unlabaled points in an active learning manner to label or for robustness.

## 相关工作

In this paper, we use the framework of data subset selection for selecting a subset of unlabelled examples for obtaining labels, that are complementary to the labelling functions. In other words, we can use information provided by the labeling functions along with other aspects like diversity, to intelligently select the labeled subset so that both the labeled data and labeling functions are most informative. - Snorkel and CAGE model - They try to learn a combined model for the labelling functions in an unsupervised manner. - Due to the unsuervised nature we found snorkel to be highly sensitive to initialisation and hyper-parameters. With no supervised data, difficult to set these parameters correctly. - CAGE addresses this using quality guides. But theses guides come from the people generating lfs and is dependent on them providing semi accurate values. - They also treat lfs as complete black boxes and completely disregard the features of the instances which might provide additional information to learn the graphical model(Note: This point is debatable since they use the learnt graphical model and train feature based supervised model on learnt labels.) - Also in practical scenario some labelled data is always available (or could be made available by labling a few instances), and hence a completely unsupervised approach might not be the best approach. - This paper combines lfs with feature based learning. - They model the lfs individually. They do not try to learn a combined model of the lfs and hence might lose out on information of agreement // disagreement within the lfs. - They only train the lfs on the supervised data. - Unsupervised data is only used to send a weak signal to the neural network model through the third component of the loss. They do not utilize the unsupervised data fully. - The lfs are created by observing the labeled examples and thus contain a lot of bias. As a result the lfs might have very low conflict but not very high coverage. - Also the scenario is a little bit restrictive since there is a strong correlation between the labelled data and the lfs, but in reality lfs might not come from any observed data or the lfs might be given without the data from which it was created.

## Conclusion

We present how data programming can benefit from use of labelled data by learning a model that jointly optimises the consensus obtained from labelling functions in an unsupervised manner along with semi-supervised loss functions designed in the feature space. We empirically assess the performance of the different components of the loss in our joint learning framework. As another contribution, we also study some subset selection approaches to guide the selection of the subset of examples that can be used as the labelled data-set. We present performance of our models and present insights on both synthetic and real datasets. While outperforming previous approaches, we are often able to better even an examplar based (skyline) approach that uses the additional information of the association of rules with specific labelled examples.