Automatic Historical Feature GenerationAutomatic Historical Feature Generation through Tree-based Method in Ads Prediction广告预测中基于树的自动历史特征生成
Abstract 摘要
Historical features are important in ads click-through rate (CTR) prediction, because they account for past engagements between users and ads. In this paper, we study how to efficiently construct historical features through counting features. The key challenge of such problem lies in how to automatically identify counting keys. We propose a tree-based method for counting key selection. The intuition is that a decision tree naturally provides various combinations of features, which could be used as counting key candidate. In order to select personalized counting features, we train one decision tree model per user, and the counting keys are selected across different users with a frequency-based importance measure. To validate the effectiveness of proposed solution, we conduct large scale experiments on Twitter\footnote{Twitter and Tweets are registered marks of Twitter Inc.} video advertising data. In both online learning and offline training settings, the automatically identified counting features outperform the manually curated counting features.
历史功能在广告点击率(CTR)预测中非常重要,因为它们可以说明用户和广告之间的互动情况。在本文中,我们研究了如何通过对特征进行计数来有效地构建历史特征。这种问题的关键挑战在于如何自动识别计数键。我们提出了一种基于树的方法来计算键选择。直觉是决策树自然提供了功能的各种组合,可以用作对关键候选项进行计数。为了选择个性化的计数功能,我们为每个用户训练一个决策树模型,并使用基于频率的重要性度量在不同用户之间选择计数键。为了验证所提出解决方案的有效性,我们在Twitter 视频广告数据上进行了大规模实验。在在线学习和离线培训设置中,自动识别的计数功能都优于手动策划的计数功能。
.
Introduction 简介
The data rich advertising environment provides ample opportunities for machine learning to play a key role. Ads CTR prediction problem is one of the most important tasks. At Twitter, the ads are shown as the form of a Promoted Tweet. Twitter users usually consume Tweets in their home timeline, where the Tweets created by the people they follow are presented. Among the list of user generated Tweets, Promoted Tweets are injected in a set of given positions. The selection of which ads to show is based on the combination of predicted CTR and advertising bidding. We use impression refer to showing an ad to a user, and engagement refers to user acting on the ads, i.e. clicking on the link, watching the video ads, etc. Contextual features are the most straightforward features for CTR prediction, which account for the current context of an impression, including user information, ads content, nearby Tweets, etc. However, with contextual features alone it is very difficult to learn a CTR prediction model due to two data sparsity issues. First, the feature space is huge, such as ID-based feature. Second, the number of samples for each user-ad pair is small. To address the issues above, historical features are added to the prediction model. The historical features capture the past engagement information of the user-ad pair, while avoid explicit matching of user or ads. For example, past CTR is a strong feature in CTR model. Namely, even if user $ u $ only saw ad $ d $ once and engaged with it (a past CTR of $ 100% $ ), the model is confident to predict a high CTR for showing $ d $ to $ u $ in the future. As a result, instead of maintaining the user ID and ad ID as contextual features, we could simply add one historical feature -- past CTR. Both contextual features and historical features are indispensable for CTR prediction. However, it has shown that the historical features play a more important role than the contextual features. We use counting feature refer to the counting-based implementation of historical features. Furthermore, we could count the past engagements under different conditions, which are called counting keys. When selecting proper counting keys, there are a lot of choices. Basically, we can count on any categorical features, and even the combinations of them. In this paper, we aim to efficiently extract useful counting keys.This task is challenging for the following three reasons. First, the exhaustive search is impossible, given an exponential search space. Second, the counting keys suffers from sparsity, i.e. the coverage of most counting keys are low. Third, as the size of counting keys increases, it is hard to semantically interpret the counting keys.
数据丰富的广告环境为机器学习提供了充足的机会,以发挥关键作用。广告CTR预测问题是最重要的任务之一。在Twitter时,广告显示为促销推文的形式。 Twitter用户通常会在其家庭时间表中消耗推文,其中介绍了他们所关注的人员创建的推文。。在用户生成的推文列表中,推荐推文被注入到一组给定位置中。根据预测的点击率和广告出价来选择要显示的广告。我们使用印象是指向用户展示广告,而参与度是指用户对广告进行操作,即点击链接,观看视频广告等。
上下文特征是CTR预测中最直接的特征,它可以说明印象的当前上下文,包括用户信息,广告内容,附近的Tweet等。但是,仅凭上下文特征很难学习CTR预测模型两个数据稀疏性问题。首先,功能空间很大,例如基于ID的功能。其次,每个用户广告对的样本数量很少。
为了解决上述问题,将历史特征添加到了预测模型中。历史特征会捕获用户广告对的过去参与度信息,同时避免用户或广告的显式匹配。例如,过去的点击率在点击率模型中是很强大的功能。即,即使用户$ u $只看到广告$ d $并参与其中(过去的点击率 100%),该模型就对$ u $显示广告$ d $可以预测将来一个高点击率因此,我们无需将用户ID和广告ID保持为上下文功能,而只需添加一项历史功能-过去的点击率。上下文特征和历史特征对于CTR预测都是必不可少的 (He等,2014)。然而,它表明历史特征比上下文特征起着更重要的作用 (He et al。,2014)。
我们使用计数功能是指基于计数的历史功能的实现。此外,我们可以计算不同条件下的过去参与,这称为计数键。选择正确的计数键时,有很多选择。基本上,我们可以指望任何分类功能,甚至包括它们的组合。在本文中,我们旨在有效地提取有用的计数键。出于以下三个原因,此任务具有挑战性。首先,给定指数搜索空间,穷举搜索是不可能的。其次,计数键具有稀疏性,即大多数计数键的覆盖率很低。第三,随着计数键的大小增加,很难从语义上解释计数键。
In this paper, we propose a method to automatically select counting keys from tree-based models. The intuition is that a path in the tree, which consists of several features, is a good candidate of a counting key. To achieve this goal, we independently learn one tree-based model per user for a selected set of users, who has sufficient impressions data. We use a frequency-based measure as weighting mechanism for each path in the tree. The paths with the highest weights are selected as our counting keys, which guarantees our counting features have good coverage. Through extensive experiments with Twitter real ads traffic, we show that the selected counting features consistently improve the CTR prediction performance. The contributions of this paper are summarized as follows. First, we propose a tree-based method to automatically select counting keys. Second, we compare the automatically selected counting features with a set of human curated features, and show better performance. Third, we evaluate our method with two widely used setting, i.e. online learning with logistic regression and offline training with neural network models, and show consistent results.
The result of this paper is organized as follows. In Section 2, we present related works in the literature. In Section 3, the formal definition of our problem is given. The method is described in Section 4, and the experiment results are shown in Section 5. Finally, we conclude our paper with future work in Section 6.
在本文中,我们提出了一种从基于树的模型中自动选择计数关键字的方法。直觉是,树中的路径(由多个要素组成)是计数键的理想选择。为了实现此目标,我们为具有足够的印象数据的一组选定用户独立地为每个用户学习一个基于树的模型。我们使用基于频率的度量作为树中每个路径的加权机制。选择权重最高的路径作为我们的计数键,以确保我们的计数功能具有良好的覆盖范围。通过对Twitter实际广告流量进行的大量实验,我们表明所选的计数功能始终如一地提高了点击率预测性能。本文的贡献总结如下。首先,我们提出了一种基于树的方法来自动选择计数键。其次,我们将自动选择的计数功能与一组人工策划的功能进行比较,并显示出更好的性能。第三,我们通过两种广泛使用的设置评估我们的方法,即使用逻辑回归的在线学习和使用神经网络模型的离线训练,并显示出一致的结果。
本文的结果安排如下。在第 2,我们提出在文献中相关工作。在第3节中 ,给出了我们问题的正式定义。该方法在第4节中进行了说明 ,实验结果在第5节中进行了说明 。最后,我们在第6节中以将来的工作来总结本文 。
Related Work 相关工作
CTR Prediction is a well-studied problem. Logistic regression model is first used for CTR prediction at scale (Battelle, 2011; Richardson et al., 2007). Later, online learning is used in ads system (Ciaramita et al., 2008; Graepel et al., 2010), due to the rapid growth of data volume. To highlight the relationship and differences between theoretical advance and practical engineering in CTR prediction, McMahan et. al. (McMahan et al., 2013) at Google shares insights and lessons learned from the large-scale online learning CTR prediction system. Facebook (He et al., 2014) employs a gradient boosted decision tree (GBDT) to learn a feature transformation, and shows that applying the logistic regression on the transformed features achieve better performance. Li et al. (Li et al., 2015) shows that a learning-to-rank model, which learns the relative order of ads, actually outperforms a model that independently learns the CTR of each impression with Twitter timeline advertising data. As the content of ads is getting rich with image and video, deep learning model are used to account for the visual information (Chen et al., 2016).
As we can see, feature engineering is crucial to improve CTR prediction. In this paper, we aim to achieve the same goal by constructing effective historical features.
Personalization is an import aspect in CTR prediction to improve the long-term user engagement (Wu et al., 2017). Various probabilistic models are proposed to model the latent factors on user domain (Liu et al., 2010; Cheng and Cantú-Paz, 2010). For example, Shen et al. (Shen et al., 2012) propose to use graphic model and tensor decomposition technique to learn the latent factor of users. In highly dynamic domains, such as news recommendation and digital advertisement, exploration-exploitation strategies are widely adopted for personalization. The multi-armed bandit is a typical algorithm of such strategy. Li et. al. (Li et al., 2016) use multi-armed bandit to dynamically group similar users.
Note that in this paper, we devise a counting feature selection method with personalization as a constraint. We differ from the personalization literature in a way that we are not trying to innovate new personalization method. I
Attribute Interactions. Within a counting key, a group of features jointly act as a CTR indicator. A relevant concept is attribute interaction (Freitas, 2001), which refers to the fact that two or more attributes jointly present strong correlation with label, while each of them loosely correlate with the label. Here the attribute and feature are used interchangeably. There are literatures focusing on the analysis of attribute interaction (Jakulin and Bratko, 2003), test the significant of feature interaction (Jakulin and Bratko, 2004), etc.
We should note that counting keys selection is more complicated than attribute interaction. The attribute interaction mainly studies the interactions of several features within one sample. Meanwhile, the counting features are constructed from several features and aggregated over multiple historical samples.
Feature Selection. Finally, we want to emphasize that the feature selection (Molina et al., 2002; Guyon and Elisseeff, 2003; Liu and Yu, 2005; Chandrashekar and Sahin, 2014) is another highly relevant research field. Feature selection aims at finding a subset of features that are representative to original data. Various feature selection methods are proposed in the literature, and there are mainly three categories (Chandrashekar and Sahin, 2014). (1) Filter-based method (Blum and Langley, 1997) defines a relevance score for each feature and select features with certain threshold. (2) Wrapper methods use the predictor as a black box and the predictor performance as the objective function to evaluate the feature subset. Examples are Sequential Feature Selection method (Reunanen, 2003) and Genetic Algorithm (Goldberg, 1989). (3) Unsupervised feature selection methods (Mitra et al., 2002; Li et al., 2012) are proposed as well, where clustering is a primarily used technique.
These feature selection methods cannot be used in our problem, because they do not address our challenges. In addition to the exponential search space, implementing and storing the counting features are costly as well. What’s more, the conventional feature selection method does not consider the sparsity of features at all.
点击率预测是一个经过充分研究的问题。Logistic回归模型首先用于大规模CTR预测 (Battelle,2011; Richardson等,2007)。后来,由于数据量的快速增长,在线学习被用于广告系统 (Ciaramita等,2008; Graepel等,2010)。为了强调CTR预测中理论进展与实际工程之间的关系和差异,McMahan等人。al。(McMahan等人,2013年)的Google分享了从大型在线学习点击率预测系统中获得的见解和经验教训。Facebook (He et al。,2014)使用梯度增强决策树(GBDT)来学习特征变换,并表明对变换后的特征应用逻辑回归可以实现更好的性能。Li等。 (Li等人,2015)表明,学习排名广告的相对顺序的学习模型实际上要优于通过Twitter时间轴广告数据独立学习每个印象的点击率的模型。随着广告内容越来越丰富的图像和视频,深度学习模型用于说明视觉信息 (Chen等人,2016)。
如我们所见,要素工程对于提高点击率预测至关重要。在本文中,我们旨在通过构建有效的历史特征来实现相同的目标。
个性化是CTR预测中的一个重要方面,可以提高用户的长期参与度 (Wu等人,2017)。提出了各种概率模型来对用户域上的潜在因素进行建模 (Liu等,2010; Cheng和Cantú-Paz,2010)。例如,Shen等。 (Shen et al。,2012)建议使用图形模型和张量分解技术来学习用户的潜在因素。在新闻推荐和数字广告等高度动态的领域中,探索开发策略被广泛用于个性化。多臂匪是这种策略的典型算法。李等 al。 Li et al。,2016)使用多臂强盗对相似用户进行动态分组。
注意,在本文中,我们设计了一种以个性化为约束的计数特征选择方法。我们与个性化文献的不同之处在于我们不尝试创新新的个性化方法。一世
属性交互。在计数键内,一组功能共同充当点击率指标。一个相关的概念是属性交互 (Freitas,2001),它是指两个或多个属性共同与标签具有很强的相关性,而每个属性与标签之间的松散相关性这一事实。在这里,属性和特征可以互换使用。有文献着重分析属性交互作用 (Jakulin和Bratko,2003),检验特征交互作用的意义 (Jakulin和Bratko,2004),等等。
我们应该注意,计数键选择比属性交互更为复杂。属性交互作用主要研究一个样本中多个特征的交互作用。同时,计数特征是由多个特征构成的,并汇总到多个历史样本中。
特征选择。最后,我们要强调的是特征选择 (Molina等,2002; Guyon和Elisseeff,2003; Liu和Yu,2005; Chandrashekar和Sahin,2014)是另一个高度相关的研究领域。特征选择旨在找到代表原始数据的特征子集。文献中提出了多种特征选择方法,主要分为三类 (Chandrashekar和Sahin,2014)。(1)基于过滤器的方法 (Blum and Langley,1997)为每个要素定义相关性得分,并选择具有特定阈值的要素。(2)包装器方法使用预测变量作为黑盒,并使用预测变量性能作为目标函数来评估特征子集。例子有顺序特征选择方法 (Reunanen,2003年)和遗传算法 (Goldberg,1989年)。(3)还提出了无监督特征选择方法 (Mitra等,2002; Li等,2012),聚类是主要使用的技术。
这些功能选择方法不能用于我们的问题,因为它们不能解决我们的挑战。除了指数搜索空间外,实现和存储计数功能也很昂贵。而且,传统的特征选择方法根本不考虑特征的稀疏性。
Problem Definition 问题定义
Given an online advertising system, we have millions of users denoted as $ \mathcal{U} $ and tens of thousands of ads denoted as $ \mathcal{D} $ . The objective of an advertising system is to present the most relevant ads to users. CTR is the most commonly used measure for ads relevance, which measures the probability that a user clicks on an ad. Usually there is one prediction model $ f $ to be learned for thousands of millions of impressions. Each impression contains the information about showing ad $ d $ to user $ u $ within certain context. It is important to differentiate different users and ads. To this end, we extract as many contextual features as possible for each impression, denoted as $ c $ . The contextual feature vector $ c = [ c_1, c_2, \cdots c_n ]\in\mathbb{R}^N $ is an $ N $ -dimension vector, and each dimension $ c_j $ corresponds to one feature. There are three types of contextual features, which are user features, ad features, and context features. The user features, such as user gender and age, differentiate users. The ad features are used to differentiate ads, and examples are advertiser category and ads content. The context features track the context of current impression, including time of impression, ad injection location, and nearby organic Tweets information. The ads relevance prediction problem is defined as follows.
给定在线广告系统,我们有数百万个用户表示为$\mathcal{U} $和数万个广告,表示为$\mathcal{D} $。广告系统的目的是向用户提供最相关的广告。 CTR是广告相关性最常用的措施,这测量了用户点击广告的概率。通常有一个预测模型$ f $才能学习数千千万次印象。每个呼吸包含有关在某些上下文中向用户$ u $显示AD $ d $的信息。区分不同的用户和广告很重要。为此,我们将尽可能多的上下文特征提取每个印象,表示为$ c $。上下文特征矢量$ c = [ c_1, c_2, \cdots c_n ]\in]\in $是$ N $ -dimenning矢量,每个维度$ c_j $对应一个特征。有三种类型的上下文功能,分别是用户功能,广告功能和上下文功能。用户功能(例如用户的性别和年龄)可以区分用户。广告功能用于区分广告,例如广告客户类别和广告内容。上下文功能跟踪当前印象的上下文,包括印象时间,广告注入位置以及附近的有机推文信息。
广告相关性预测问题定义如下
Definition 3.1 (CTR prediction).
Given a set of impressions I, each impression i denotes one event that ad di∈D is shown to user ui∈U. We extract contextual feature ci for impression i. The CTR prediction aims to learn a prediction model
$$ \hat{p_i} = f(\mathbf{c}_{i}) $$
$\hat{p}_{i}$ is the predicted CTR.
$\hat{p}_{i}$ 是预测的点击率。
Another important signal for CTR prediction is the user's past engagements, namely historical features, which is missing in Definitionctr. We use counting features as one implementation of the historical engagement. To give the formal definition of counting features, we first define counting keys. In this paper, our goal is to automatically select counting keys.
CTR预测的另一个重要信号是用户过去的接合,即历史功能,在DeadionCtr中缺少。我们使用计数功能作为历史参与的一个实施。要提供计数功能的正式定义,我们首先定义计数键。在本文中,我们的目标是自动选择计数键。
Definition 3.2 (Counting key).
A counting key k is a combination of two or more contextual features, denoted as a feature set $k={c_{p1},c_{p2},c_{p3},\cdots}$. Without loss of generality, we assume $c_{pi}$are discrete features. Each counting key k is defined over a pair of user and ad ⟨u,d⟩.
Definition 3.3 (Counting feature).
Counting feature accounts for the past engagements over a pair of user and ad ⟨u,d⟩. Given a counting key k we are able to generate several counting features for user-ad pair ⟨u,d⟩, such as number of past impressions as $h_{k}^{i}$, number of past engagements as $h_{e}^{k}$, and past CTR as $h_{p}^{k}$. For each impression i and a given counting key set K={k1,k2,⋯}, we extract all counting features as
$$ h_{i}={h_{k1}^{i},h_{k2}^{i},\cdots,h_{k1}^{e},h_{k2}^{e},\cdots,h_{k1}^{p},h_{k2}^{p},\cdots}$$
In this paper, our goal is to automatically select counting keys.
Definition 3.4 (Counting feature keys selection).
Counting feature keys selection problem aims to find a set of counting keys K from given contextual feature c. These is to generate counting features h for each impression, and to improve the CTR prediction model through
$$ \hat{p_i} = f(c_{i},h_{i})$$
Method 方法
Overview 概述
We visualize the overview flow of our method in Figureoverview. We take a stream of impressions as input, where the contextual features (green features denoted as $ \cf_i $ ) of impressions are available. Next, in order to generate the counting key candidate, we learn one tree-based binary classification model per user to predict the click label. The motivation behind training one tree per user is personalization. A simplest approach to achieve personalization in the counting features is to keep the user ID as one component in the key. With the tree-based model, each path in the trees is a set of features that together act as strong indicator for click event. The counting key candidate is a set of features, which is constructed by adding user ID into the feature set from one tree path. In the end, we have many counting key candidates from those tree-based models. In order to find counting keys that are discriminative to all users, we use a frequency-based measure as the weighting measure. Finally, we generate counting features for each pair of user and ads according to the selected counting keys. These counting features (yellow features denoted as $ \hf_i $ ) are joined with the contextual features to serve the CTR prediction.
我们可视化我们在oveverview中的方法的概述流。我们将展示邮件作为输入,其中,上下文功能(以$\cf_i $表示的绿色功能)可用。接下来,为了生成计数关键候选,我们从每个用户学习一个基于树的二进制分类模型来预测点击标签。每个用户训练一棵树的动机是个性化的。在计数功能中实现个性化的最简单方法是将用户ID保留为密钥中的一个组件。通过基于树的模型,树中的每个路径都是一组功能,将作为单击事件的强大指示灯一起发挥作用。计数密钥候选者是一组特征,它是通过将用户ID添加到从一个树路径集中的特征组成而构建。最后,我们有许多基于树的模型的关键候选。为了找到对所有用户歧视的计数密钥,我们使用基于频率的措施作为加权测量。最后,我们根据所选计数密钥为每对用户和广告生成计数功能。这些计数特征(黄色特征表示为$\hf_i $)与上下文功能连接,以提供CTR预测。
Counting Key Candidates
We use decision trees to generate counting key candidates. In order to implement personalization, we therefore build one decision tree model $ M_i $ for user $ u_i $ . The input for decision model $ M_i $ consists of all impressions related to user $ u_i $ . In our system, we use Xgboost to build one gradient boosted forest for each user. In gradient boosted forest, trees are built in sequential order, where the newly built trees aim to minimize the errors from previous trees. The oldest trees usually capture the most important features to differentiate general cases, while the latest trees usually account for the specific features to differentiate the difficult cases. With these boosted decision trees, we can use all the paths as counting key candidates.
计数密钥候选
我们使用决策树来生成计数密钥候选。为了实现个性化,因此为用户$ u_i $构建一个决策树模型$ M_i $。决策模型$ M_i $的输入包括与用户$ u_i $相关的所有印象。在我们的系统中,我们使用XGBoost为每个用户构建一个渐变提升的林。在渐变增强森林中,树木以顺序构建,新建的树木旨在最大限度地减少以前树木的错误。最古老的树木通常捕获最重要的功能来区分一般情况,而最新树木通常会占具体特征以区分困难的情况。通过这些提升决策树,我们可以使用所有路径作为计数关键候选。
Theoretical Support 理论支持
Now we provide theoretical support to explain why tree-based model could be used to select counting keys. The objective of counting key selection is to identify a set of features $ k = {c_{p1}, c_{p2}, \cdots} $ , that are the discriminative for click prediction. We use the conditional entropy to measure the quality of a counting key $ k $ . We use $ Y $ to denote the click label. If a feature key $ k $ is useful, then the conditional entropy of $ H(Y|k) $ should be less than the marginal entropy $ H(Y) $ , i.e. $ H(Y|k) < H(Y) $ . In order to find the best counting key, we maximize the difference between the two. Formally, we have the following objective:
现在我们提供了理论支持,以解释为什么基于树的模型可用于选择计数键。计数密钥选择的目的是识别一组特征$ k = {c_{p1}, c_{p2}, \cdots} $,这是单击预测的判别。我们使用条件熵测量计数密钥$ k $的质量。我们使用$ Y $表示单击标签。如果特征键$ k $是有用的,则$ H(Y|k) $的条件熵应小于边缘熵$ H(Y) $,即$ H(Y|k) < H(Y) $。为了找到最佳计数密钥,我们最大限度地提高了两者之间的差异。正式,我们具有以下目标:
$$ \arg\max_k H(Y) - H(Y|k). $$
$
The conditional entropy $ H(Y|k) $ is defined as
条件熵$ H(Y|k) $被定义为
$$ H(Y|k) = \sum_{ v \in vals(k) } p(v) H(Y|k = v), $$
where $ vals(k) $ refers to all possible values that vector $ k $ could take. The Equation(obj) above is difficult to optimize, since the $ k $ is a set of features with exponential search space. Also, the goal of counting key selection is not to find the best counting key, but rather to find as many counting keys as possible. Those found counting keys should satisfy the condition that $ H(Y) - H(Y|k) > \Delta $ , where $ \Delta $ is a threshold on entropy decrease.
,其中$ vals(k) $指矢量$ k $可以采用的所有可能值。上述等式(OBJ)难以优化,因为$ k $是具有指数搜索空间的一组特征。此外,计算密钥选择的目标不是找到最佳计数密钥,而是找到尽可能多的计数密钥。发现计数密钥的那些应满足$ H(Y) - H(Y|k) > \Delta $的条件,其中$\Delta $是熵减少的阈值。
The tree-based model selects featu