[论文翻译]广告预测中基于树的自动历史特征生成


原文地址:https://arxiv.org/pdf/2012.15522v1.pdf


Automatic Historical Feature GenerationAutomatic Historical Feature Generation through Tree-based Method in Ads Prediction广告预测中基于树的自动历史特征生成

Abstract 摘要

Historical features are important in ads click-through rate (CTR) prediction, because they account for past engagements between users and ads. In this paper, we study how to efficiently construct historical features through counting features. The key challenge of such problem lies in how to automatically identify counting keys. We propose a tree-based method for counting key selection. The intuition is that a decision tree naturally provides various combinations of features, which could be used as counting key candidate. In order to select personalized counting features, we train one decision tree model per user, and the counting keys are selected across different users with a frequency-based importance measure. To validate the effectiveness of proposed solution, we conduct large scale experiments on Twitter\footnote{Twitter and Tweets are registered marks of Twitter Inc.} video advertising data. In both online learning and offline training settings, the automatically identified counting features outperform the manually curated counting features.

历史功能在广告点击率(CTR)预测中非常重要,因为它们可以说明用户和广告之间的互动情况。在本文中,我们研究了如何通过对特征进行计数来有效地构建历史特征。这种问题的关键挑战在于如何自动识别计数键。我们提出了一种基于树的方法来计算键选择。直觉是决策树自然提供了功能的各种组合,可以用作对关键候选项进行计数。为了选择个性化的计数功能,我们为每个用户训练一个决策树模型,并使用基于频率的重要性度量在不同用户之间选择计数键。为了验证所提出解决方案的有效性,我们在Twitter 视频广告数据上进行了大规模实验。在在线学习和离线培训设置中,自动识别的计数功能都优于手动策划的计数功能。

.

Introduction 简介

The data rich advertising environment provides ample opportunities for machine learning to play a key role. Ads CTR prediction problem is one of the most important tasks. At Twitter, the ads are shown as the form of a Promoted Tweet. Twitter users usually consume Tweets in their home timeline, where the Tweets created by the people they follow are presented. Among the list of user generated Tweets, Promoted Tweets are injected in a set of given positions. The selection of which ads to show is based on the combination of predicted CTR and advertising bidding. We use impression refer to showing an ad to a user, and engagement refers to user acting on the ads, i.e. clicking on the link, watching the video ads, etc. Contextual features are the most straightforward features for CTR prediction, which account for the current context of an impression, including user information, ads content, nearby Tweets, etc. However, with contextual features alone it is very difficult to learn a CTR prediction model due to two data sparsity issues. First, the feature space is huge, such as ID-based feature. Second, the number of samples for each user-ad pair is small. To address the issues above, historical features are added to the prediction model. The historical features capture the past engagement information of the user-ad pair, while avoid explicit matching of user or ads. For example, past CTR is a strong feature in CTR model. Namely, even if user $ u $ only saw ad $ d $ once and engaged with it (a past CTR of $ 100% $ ), the model is confident to predict a high CTR for showing $ d $ to $ u $ in the future. As a result, instead of maintaining the user ID and ad ID as contextual features, we could simply add one historical feature -- past CTR. Both contextual features and historical features are indispensable for CTR prediction. However, it has shown that the historical features play a more important role than the contextual features. We use counting feature refer to the counting-based implementation of historical features. Furthermore, we could count the past engagements under different conditions, which are called counting keys. When selecting proper counting keys, there are a lot of choices. Basically, we can count on any categorical features, and even the combinations of them. In this paper, we aim to efficiently extract useful counting keys.This task is challenging for the following three reasons. First, the exhaustive search is impossible, given an exponential search space. Second, the counting keys suffers from sparsity, i.e. the coverage of most counting keys are low. Third, as the size of counting keys increases, it is hard to semantically interpret the counting keys.

数据丰富的广告环境为机器学习提供了充足的机会,以发挥关键作用。广告CTR预测问题是最重要的任务之一。在Twitter时,广告显示为促销推文的形式。 Twitter用户通常会在其家庭时间表中消耗推文,其中介绍了他们所关注的人员创建的推文。。在用户生成的推文列表中,推荐推文被注入到一组给定位置中。根据预测的点击率和广告出价来选择要显示的广告。我们使用印象是指向用户展示广告,而参与度是指用户对广告进行操作,即点击链接,观看视频广告等。

上下文特征是CTR预测中最直接的特征,它可以说明印象的当前上下文,包括用户信息,广告内容,附近的Tweet等。但是,仅凭上下文特征很难学习CTR预测模型两个数据稀疏性问题。首先,功能空间很大,例如基于ID的功能。其次,每个用户广告对的样本数量很少。

为了解决上述问题,将历史特征添加到了预测模型中。历史特征会捕获用户广告对的过去参与度信息,同时避免用户或广告的显式匹配。例如,过去的点击率在点击率模型中是很强大的功能。即,即使用户$ u $只看到广告$ d $并参与其中(过去的点击率 100%),该模型就对$ u $显示广告$ d $可以预测将来一个高点击率因此,我们无需将用户ID和广告ID保持为上下文功能,而只需添加一项历史功能-过去的点击率。上下文特征和历史特征对于CTR预测都是必不可少的 (He等,2014)。然而,它表明历史特征比上下文特征起着更重要的作用 (He et al。,2014)。

我们使用计数功能是指基于计数的历史功能的实现。此外,我们可以计算不同条件下的过去参与,这称为计数键。选择正确的计数键时,有很多选择。基本上,我们可以指望任何分类功能,甚至包括它们的组合。在本文中,我们旨在有效地提取有用的计数键。出于以下三个原因,此任务具有挑战性。首先,给定指数搜索空间,穷举搜索是不可能的。其次,计数键具有稀疏性,即大多数计数键的覆盖率很低。第三,随着计数键的大小增加,很难从语义上解释计数键。

In this paper, we propose a method to automatically select counting keys from tree-based models. The intuition is that a path in the tree, which consists of several features, is a good candidate of a counting key. To achieve this goal, we independently learn one tree-based model per user for a selected set of users, who has sufficient impressions data. We use a frequency-based measure as weighting mechanism for each path in the tree. The paths with the highest weights are selected as our counting keys, which guarantees our counting features have good coverage. Through extensive experiments with Twitter real ads traffic, we show that the selected counting features consistently improve the CTR prediction performance. The contributions of this paper are summarized as follows. First, we propose a tree-based method to automatically select counting keys. Second, we compare the automatically selected counting features with a set of human curated features, and show better performance. Third, we evaluate our method with two widely used setting, i.e. online learning with logistic regression and offline training with neural network models, and show consistent results.

The result of this paper is organized as follows. In Section 2, we present related works in the literature. In Section 3, the formal definition of our problem is given. The method is described in Section 4, and the experiment results are shown in Section 5. Finally, we conclude our paper with future work in Section 6.

在本文中,我们提出了一种从基于树的模型中自动选择计数关键字的方法。直觉是,树中的路径(由多个要素组成)是计数键的理想选择。为了实现此目标,我们为具有足够的印象数据的一组选定用户独立地为每个用户学习一个基于树的模型。我们使用基于频率的度量作为树中每个路径的加权机制。选择权重最高的路径作为我们的计数键,以确保我们的计数功能具有良好的覆盖范围。通过对Twitter实际广告流量进行的大量实验,我们表明所选的计数功能始终如一地提高了点击率预测性能。本文的贡献总结如下。首先,我们提出了一种基于树的方法来自动选择计数键。其次,我们将自动选择的计数功能与一组人工策划的功能进行比较,并显示出更好的性能。第三,我们通过两种广泛使用的设置评估我们的方法,即使用逻辑回归的在线学习和使用神经网络模型的离线训练,并显示出一致的结果。

本文的结果安排如下。在第 2,我们提出在文献中相关工作。在第3节中 ,给出了我们问题的正式定义。该方法在第4节中进行了说明 ,实验结果在第5节中进行了说明 。最后,我们在第6节中以将来的工作来总结本文 。

CTR Prediction is a well-studied problem. Logistic regression model is first used for CTR prediction at scale (Battelle, 2011; Richardson et al., 2007). Later, online learning is used in ads system (Ciaramita et al., 2008; Graepel et al., 2010), due to the rapid growth of data volume. To highlight the relationship and differences between theoretical advance and practical engineering in CTR prediction, McMahan et. al. (McMahan et al., 2013) at Google shares insights and lessons learned from the large-scale online learning CTR prediction system. Facebook (He et al., 2014) employs a gradient boosted decision tree (GBDT) to learn a feature transformation, and shows that applying the logistic regression on the transformed features achieve better performance. Li et al. (Li et al., 2015) shows that a learning-to-rank model, which learns the relative order of ads, actually outperforms a model that independently learns the CTR of each impression with Twitter timeline advertising data. As the content of ads is getting rich with image and video, deep learning model are used to account for the visual information (Chen et al., 2016).

As we can see, feature engineering is crucial to improve CTR prediction. In this paper, we aim to achieve the same goal by constructing effective historical features.

Personalization is an import aspect in CTR prediction to improve the long-term user engagement (Wu et al., 2017). Various probabilistic models are proposed to model the latent factors on user domain (Liu et al., 2010; Cheng and Cantú-Paz, 2010). For example, Shen et al. (Shen et al., 2012) propose to use graphic model and tensor decomposition technique to learn the latent factor of users. In highly dynamic domains, such as news recommendation and digital advertisement, exploration-exploitation strategies are widely adopted for personalization. The multi-armed bandit is a typical algorithm of such strategy. Li et. al. (Li et al., 2016) use multi-armed bandit to dynamically group similar users.

Note that in this paper, we devise a counting feature selection method with personalization as a constraint. We differ from the personalization literature in a way that we are not trying to innovate new personalization method. I

Attribute Interactions. Within a counting key, a group of features jointly act as a CTR indicator. A relevant concept is attribute interaction (Freitas, 2001), which refers to the fact that two or more attributes jointly present strong correlation with label, while each of them loosely correlate with the label. Here the attribute and feature are used interchangeably. There are literatures focusing on the analysis of attribute interaction (Jakulin and Bratko, 2003), test the significant of feature interaction (Jakulin and Bratko, 2004), etc.

We should note that counting keys selection is more complicated than attribute interaction. The attribute interaction mainly studies the interactions of several features within one sample. Meanwhile, the counting features are constructed from several features and aggregated over multiple historical samples.

Feature Selection. Finally, we want to emphasize that the feature selection (Molina et al., 2002; Guyon and Elisseeff, 2003; Liu and Yu, 2005; Chandrashekar and Sahin, 2014) is another highly relevant research field. Feature selection aims at finding a subset of features that are representative to original data. Various feature selection methods are proposed in the literature, and there are mainly three categories (Chandrashekar and Sahin, 2014). (1) Filter-based method (Blum and Langley, 1997) defines a relevance score for each feature and select features with certain threshold. (2) Wrapper methods use the predictor as a black box and the predictor performance as the objective function to evaluate the feature subset. Examples are Sequential Feature Selection method (Reunanen, 2003) and Genetic Algorithm (Goldberg, 1989). (3) Unsupervised feature selection methods (Mitra et al., 2002; Li et al., 2012) are proposed as well, where clustering is a primarily used technique.

These feature selection methods cannot be used in our problem, because they do not address our challenges. In addition to the exponential search space, implementing and storing the counting features are costly as well. What’s more, the conventional feature selection method does not consider the sparsity of features at all.

点击率预测是一个经过充分研究的问题。Logistic回归模型首先用于大规模CTR预测 (Battelle,2011; Richardson等,2007)。后来,由于数据量的快速增长,在线学习被用于广告系统 (Ciaramita等,2008; Graepel等,2010)。为了强调CTR预测中理论进展与实际工程之间的关系和差异,McMahan等人。al。(McMahan等人,2013年)的Google分享了从大型在线学习点击率预测系统中获得的见解和经验教训。Facebook (He et al。,2014)使用梯度增强决策树(GBDT)来学习特征变换,并表明对变换后的特征应用逻辑回归可以实现更好的性能。Li等。 (Li等人,2015)表明,学习排名广告的相对顺序的学习模型实际上要优于通过Twitter时间轴广告数据独立学习每个印象的点击率的模型。随着广告内容越来越丰富的图像和视频,深度学习模型用于说明视觉信息 (Chen等人,2016)。

如我们所见,要素工程对于提高点击率预测至关重要。在本文中,我们旨在通过构建有效的历史特征来实现相同的目标。

个性化是CTR预测中的一个重要方面,可以提高用户的长期参与度 (Wu等人,2017)。提出了各种概率模型来对用户域上的潜在因素进行建模 (Liu等,2010; Cheng和Cantú-Paz,2010)。例如,Shen等。 (Shen et al。,2012)建议使用图形模型和张量分解技术来学习用户的潜在因素。在新闻推荐和数字广告等高度动态的领域中,探索开发策略被广泛用于个性化。多臂匪是这种策略的典型算法。李等 al。 Li et al。,2016)使用多臂强盗对相似用户进行动态分组。

注意,在本文中,我们设计了一种以个性化为约束的计数特征选择方法。我们与个性化文献的不同之处在于我们不尝试创新新的个性化方法。一世

属性交互。在计数键内,一组功能共同充当点击率指标。一个相关的概念是属性交互 (Freitas,2001),它是指两个或多个属性共同与标签具有很强的相关性,而每个属性与标签之间的松散相关性这一事实。在这里,属性和特征可以互换使用。有文献着重分析属性交互作用 (Jakulin和Bratko,2003),检验特征交互作用的意义 (Jakulin和Bratko,2004),等等。

我们应该注意,计数键选择比属性交互更为复杂。属性交互作用主要研究一个样本中多个特征的交互作用。同时,计数特征是由多个特征构成的,并汇总到多个历史样本中。

特征选择。最后,我们要强调的是特征选择 (Molina等,2002; Guyon和Elisseeff,2003; Liu和Yu,2005; Chandrashekar和Sahin,2014)是另一个高度相关的研究领域。特征选择旨在找到代表原始数据的特征子集。文献中提出了多种特征选择方法,主要分为三类 (Chandrashekar和Sahin,2014)。(1)基于过滤器的方法 (Blum and Langley,1997)为每个要素定义相关性得分,并选择具有特定阈值的要素。(2)包装器方法使用预测变量作为黑盒,并使用预测变量性能作为目标函数来评估特征子集。例子有顺序特征选择方法 (Reunanen,2003年)和遗传算法 (Goldberg,1989年)。(3)还提出了无监督特征选择方法 (Mitra等,2002; Li等,2012),聚类是主要使用的技术。

这些功能选择方法不能用于我们的问题,因为它们不能解决我们的挑战。除了指数搜索空间外,实现和存储计数功能也很昂贵。而且,传统的特征选择方法根本不考虑特征的稀疏性。

Problem Definition 问题定义

Given an online advertising system, we have millions of users denoted as $ \mathcal{U} $ and tens of thousands of ads denoted as $ \mathcal{D} $ . The objective of an advertising system is to present the most relevant ads to users. CTR is the most commonly used measure for ads relevance, which measures the probability that a user clicks on an ad. Usually there is one prediction model $ f $ to be learned for thousands of millions of impressions. Each impression contains the information about showing ad $ d $ to user $ u $ within certain context. It is important to differentiate different users and ads. To this end, we extract as many contextual features as possible for each impression, denoted as $ c $ . The contextual feature vector $ c = [ c_1, c_2, \cdots c_n ]\in\mathbb{R}^N $ is an $ N $ -dimension vector, and each dimension $ c_j $ corresponds to one feature. There are three types of contextual features, which are user features, ad features, and context features. The user features, such as user gender and age, differentiate users. The ad features are used to differentiate ads, and examples are advertiser category and ads content. The context features track the context of current impression, including time of impression, ad injection location, and nearby organic Tweets information. The ads relevance prediction problem is defined as follows.

给定在线广告系统,我们有数百万个用户表示为$\mathcal{U} $和数万个广告,表示为$\mathcal{D} $。广告系统的目的是向用户提供最相关的广告。 CTR是广告相关性最常用的措施,这测量了用户点击广告的概率。通常有一个预测模型$ f $才能学习数千千万次印象。每个呼吸包含有关在某些上下文中向用户$ u $显示AD $ d $的信息。区分不同的用户和广告很重要。为此,我们将尽可能多的上下文特征提取每个印象,表示为$ c $。上下文特征矢量$ c = [ c_1, c_2, \cdots c_n ]\in]\in $是$ N $ -dimenning矢量,每个维度$ c_j $对应一个特征。有三种类型的上下文功能,分别是用户功能,广告功能和上下文功能。用户功能(例如用户的性别和年龄)可以区分用户。广告功能用于区分广告,例如广告客户类别和广告内容。上下文功能跟踪当前印象的上下文,包括印象时间,广告注入位置以及附近的有机推文信息。

广告相关性预测问题定义如下

Definition 3.1 (CTR prediction).

Given a set of impressions I, each impression i denotes one event that ad di∈D is shown to user ui∈U. We extract contextual feature ci for impression i. The CTR prediction aims to learn a prediction model

$$ \hat{p_i} = f(\mathbf{c}_{i}) $$

$\hat{p}_{i}$ is the predicted CTR.

$\hat{p}_{i}$ 是预测的点击率。

Another important signal for CTR prediction is the user's past engagements, namely historical features, which is missing in Definitionctr. We use counting features as one implementation of the historical engagement. To give the formal definition of counting features, we first define counting keys. In this paper, our goal is to automatically select counting keys.

CTR预测的另一个重要信号是用户过去的接合,即历史功能,在DeadionCtr中缺少。我们使用计数功能作为历史参与的一个实施。要提供计数功能的正式定义,我们首先定义计数键。在本文中,我们的目标是自动选择计数键。

Definition 3.2 (Counting key).

A counting key k is a combination of two or more contextual features, denoted as a feature set $k={c_{p1},c_{p2},c_{p3},\cdots}$. Without loss of generality, we assume $c_{pi}$are discrete features. Each counting key k is defined over a pair of user and ad ⟨u,d⟩.

Definition 3.3 (Counting feature).

Counting feature accounts for the past engagements over a pair of user and ad ⟨u,d⟩. Given a counting key k we are able to generate several counting features for user-ad pair ⟨u,d⟩, such as number of past impressions as $h_{k}^{i}$, number of past engagements as $h_{e}^{k}$, and past CTR as $h_{p}^{k}$. For each impression i and a given counting key set K={k1,k2,⋯}, we extract all counting features as
$$ h_{i}={h_{k1}^{i},h_{k2}^{i},\cdots,h_{k1}^{e},h_{k2}^{e},\cdots,h_{k1}^{p},h_{k2}^{p},\cdots}$$

In this paper, our goal is to automatically select counting keys.

Definition 3.4 (Counting feature keys selection).

Counting feature keys selection problem aims to find a set of counting keys K from given contextual feature c. These is to generate counting features h for each impression, and to improve the CTR prediction model through

$$ \hat{p_i} = f(c_{i},h_{i})$$

Method 方法

Overview 概述

image

We visualize the overview flow of our method in Figureoverview. We take a stream of impressions as input, where the contextual features (green features denoted as $ \cf_i $ ) of impressions are available. Next, in order to generate the counting key candidate, we learn one tree-based binary classification model per user to predict the click label. The motivation behind training one tree per user is personalization. A simplest approach to achieve personalization in the counting features is to keep the user ID as one component in the key. With the tree-based model, each path in the trees is a set of features that together act as strong indicator for click event. The counting key candidate is a set of features, which is constructed by adding user ID into the feature set from one tree path. In the end, we have many counting key candidates from those tree-based models. In order to find counting keys that are discriminative to all users, we use a frequency-based measure as the weighting measure. Finally, we generate counting features for each pair of user and ads according to the selected counting keys. These counting features (yellow features denoted as $ \hf_i $ ) are joined with the contextual features to serve the CTR prediction.

我们可视化我们在oveverview中的方法的概述流。我们将展示邮件作为输入,其中,上下文功能(以$\cf_i $表示的绿色功能)可用。接下来,为了生成计数关键候选,我们从每个用户学习一个基于树的二进制分类模型来预测点击标签。每个用户训练一棵树的动机是个性化的。在计数功能中实现个性化的最简单方法是将用户ID保留为密钥中的一个组件。通过基于树的模型,树中的每个路径都是一组功能,将作为单击事件的强大指示灯一起发挥作用。计数密钥候选者是一组特征,它是通过将用户ID添加到从一个树路径集中的特征组成而构建。最后,我们有许多基于树的模型的关键候选。为了找到对所有用户歧视的计数密钥,我们使用基于频率的措施作为加权测量。最后,我们根据所选计数密钥为每对用户和广告生成计数功能。这些计数特征(黄色特征表示为$\hf_i $)与上下文功能连接,以提供CTR预测。

Counting Key Candidates

We use decision trees to generate counting key candidates. In order to implement personalization, we therefore build one decision tree model $ M_i $ for user $ u_i $ . The input for decision model $ M_i $ consists of all impressions related to user $ u_i $ . In our system, we use Xgboost to build one gradient boosted forest for each user. In gradient boosted forest, trees are built in sequential order, where the newly built trees aim to minimize the errors from previous trees. The oldest trees usually capture the most important features to differentiate general cases, while the latest trees usually account for the specific features to differentiate the difficult cases. With these boosted decision trees, we can use all the paths as counting key candidates.

计数密钥候选

我们使用决策树来生成计数密钥候选。为了实现个性化,因此为用户$ u_i $构建一个决策树模型$ M_i $。决策模型$ M_i $的输入包括与用户$ u_i $相关的所有印象。在我们的系统中,我们使用XGBoost为每个用户构建一个渐变提升的林。在渐变增强森林中,树木以顺序构建,新建的树木旨在最大限度地减少以前树木的错误。最古老的树木通常捕获最重要的功能来区分一般情况,而最新树木通常会占具体特征以区分困难的情况。通过这些提升决策树,我们可以使用所有路径作为计数关键候选。

Theoretical Support 理论支持

Now we provide theoretical support to explain why tree-based model could be used to select counting keys. The objective of counting key selection is to identify a set of features $ k = {c_{p1}, c_{p2}, \cdots} $ , that are the discriminative for click prediction. We use the conditional entropy to measure the quality of a counting key $ k $ . We use $ Y $ to denote the click label. If a feature key $ k $ is useful, then the conditional entropy of $ H(Y|k) $ should be less than the marginal entropy $ H(Y) $ , i.e. $ H(Y|k) < H(Y) $ . In order to find the best counting key, we maximize the difference between the two. Formally, we have the following objective:

现在我们提供了理论支持,以解释为什么基于树的模型可用于选择计数键。计数密钥选择的目的是识别一组特征$ k = {c_{p1}, c_{p2}, \cdots} $,这是单击预测的判别。我们使用条件熵测量计数密钥$ k $的质量。我们使用$ Y $表示单击标签。如果特征键$ k $是有用的,则$ H(Y|k) $的条件熵应小于边缘熵$ H(Y) $,即$ H(Y|k) < H(Y) $。为了找到最佳计数密钥,我们最大限度地提高了两者之间的差异。正式,我们具有以下目标:

$$ \arg\max_k H(Y) - H(Y|k). $$
$
The conditional entropy $ H(Y|k) $ is defined as
条件熵$ H(Y|k) $被定义为

$$ H(Y|k) = \sum_{ v \in vals(k) } p(v) H(Y|k = v), $$
where $ vals(k) $ refers to all possible values that vector $ k $ could take. The Equation(obj) above is difficult to optimize, since the $ k $ is a set of features with exponential search space. Also, the goal of counting key selection is not to find the best counting key, but rather to find as many counting keys as possible. Those found counting keys should satisfy the condition that $ H(Y) - H(Y|k) > \Delta $ , where $ \Delta $ is a threshold on entropy decrease.

,其中$ vals(k) $指矢量$ k $可以采用的所有可能值。上述等式(OBJ)难以优化,因为$ k $是具有指数搜索空间的一组特征。此外,计算密钥选择的目标不是找到最佳计数密钥,而是找到尽可能多的计数密钥。发现计数密钥的那些应满足$ H(Y) - H(Y|k) > \Delta $的条件,其中$\Delta $是熵减少的阈值。

The tree-based model selects feature and split value to maximize the information gain. We note that the information gain in tree-based model is calculated on a binary split, which is not exactly equivalent to conditional entropy. However, we prove that information gain of a binary split is a lower-bound of $ H(Y) - H(Y|k) $ .
基于树的模式选择功能,并拆分值最大化信息增益。我们注意到,基于树的模型中的信息增益在二进制拆分上计算,这与条件熵不相同。但是,我们证明了二进制拆分的信息增益是$ H(Y) - H(Y|k) $的较小限制。

Without loss of generality, we consider the tree-based model in Figuretree. The first split maximizes the information gain $ IG(Y, c_{p1}, a_1) $ , which means split the samples with feature $ c_{p1} < a_1 $ . Apply the inequality property of conditional entropy, we have Take Equation(deduction) into Equation(ig), and notice that $ c_{p1} < a_1 \iff c_{p1} = a_0 $ . We have Therefore, we prove that the information gain is a lower-bound of $ H(Y) - H(Y_{p1}) $ , i.e.

不丢失一般性,我们考虑在图格勒中的树木模型。第一个拆分最大化信息GAIN $ IG(Y, c_{p1}, a_1) $,这意味着使用特征$ c_{p1} < a_1 $分离样本。申请条件熵的不等式性质,我们必须采取等式(扣除)代入式(IG),并注意$ c_{p1} < a_1 \iff c_{p1} = a_0 $。因此,我们证明了信息增益是$ H(Y) - H(Y_{p1}) $的较小限制,即

$$ IG(Y, c_{p1}, a_1) \le H(Y) - H(Y_{p1}). $$
Follow the left path of tree, the model recursively maximizes the $ IG(Y, [c_{p1}, c_{p2}], [a_1,b_1[0[1 $ . By applying Equation(approximation), we have that the $ IG(Y, [c_{p1}, c_{p2}], [a_1,b_1[0[1[2[3[4[5[6[7 $ . Namely, the tree-based model maximizes the lower-bound of $ H(Y) - H(Y|k) $ .

遵循树的左侧路径,型号递归地最大化$ IG(Y, [c_{p2}], IG(Y, a_1,b_1[0[1 $。通过应用公式(近似值),我们将$ IG(Y, [c_{p1}, c_{p2}], [a_1,b_1[0[1[2[3[4[5[6[7 $。即,基于树的模型最大化$ H(Y) - H(Y|k) $的较低限制。

Counting Key Quality Measure

Although we can limit the number and depth of trees, there are still too many counting key candidates. We should also notice that the latest trees usually contain counting keys that are very specific for one user, but not appear in other users' models. Meanwhile, it is possible that the oldest trees provide us counting keys that are too general to be useful. A quantitative measure on the relevance of each counting key candidate is needed. Motivated by the intuition above, we propose to use the term frequency and inverse document frequency (tf-idf) to measure the relevance of counting key candidates. We treat each user as one document, and the feature set of each path as a word. Notice that we omit the splitting value for each node, as well as the order of features in the path. The reason is that the counting feature key is a set, which does not differentiate the order of features in the key. We use $ \mathcal{M} = { M_i } $ to denote all tree-based models. We use $ \mathcal{M}_k = { M_i | k \models M_i } $ to denote the set of models that contain candidate key set $ k $ . The notation $ k \models M_i $ means there is a path in model $ M_i $ such that the feature set on the path is $ k $ . The term frequency (tf) is defined as where $ |\cdot| $ is the cardinality of a set, and $ f_j $ is the number of times $ k $ appears in model $ M_j \in\mathcal{M}_k $ . The inverse document frequency (idf) is defined as Finally, our counting key weight measure is
$$ tf-idf = tf(k, \mathcal{M}) \cdot idf(k, \mathcal{M}) $$

计数关键质量测量

虽然我们可以限制树的数量和深度,但仍有太多计数关键候选。我们还应注意最新树通常包含对一个用户非常特异的计数密钥,但不会出现在其他用户的模型中。同时,最旧的树木有可能为我们提供太普遍的键,这是有用的。需要定量测量每个计数关键候选者的相关性。通过上面的直觉,我们建议使用术语频率和逆文档频率(TF-IDF)来测量计数关键候选的相关性。我们将每个用户视为一个文档,以及每个路径的功能集作为单词。请注意,我们省略了每个节点的拆分值,以及路径中的功能顺序。原因是计数特征键是一个集合,它不会区分键中的特征顺序。我们使用$\mathcal{M} = { M_i } $表示所有基于树的模型。我们使用$\mathcal{M}_k = { M_i | k \models M_i } $表示包含候选密钥SET $ k $的模型集。符号$ k \models M_i $表示模型$ M_i $中有一个路径,使得路径上设置的功能是$ k $。术语频率(TF)定义为$ |\cdot| $是集合的基数,而$ f_j $是$ k $中的次数$ M_j \in\mathcal{M}_k $中出现的次数。逆文档频率(IDF)定义为最后,我们的计数密钥重量测量为
$$ tf-idf = tf(k,\ mathcal {m})\ cdot iDf(k,\ mathcal {m})$$

Apply Counting Features

With the tf-idf measure, we pick the top counting key candidates as $ \mathcal{K} $ to generate counting features. For each pair of user $ u $ and ad $ d $ , we count three values on a given key $ k $ . They are number of impressions, number of engagements, and past CTR. Namely, we generate the historical features As shown in the bottom left of Figureoverview, the historical features $ \hf $ are concatenated with contextual features $ \cf $ for the impression data. Together, they are used for the CTR prediction.

应用计数功能

与tf-idf测量,我们选择顶部计数密钥候选作为$\mathcal{K} $以生成计数功能。对于每对用户$ u $和AD $ d $,我们在给定的键$ k $上计算三个值。它们是展示次数,参与数量和过去的CTR。即,我们生成如图左下方的左下方所示的历史功能,历史功能$\hf $与上下文特征$\cf $连接,用于印象数据。它们一起用于CTR预测。

5.EXPERIMENTS

5.1.EXPERIMENT SETTINGS

We use Twitter video advertisement data stream to evaluate our method. More specifically, we use data from 18 continuous days in 2017. The first 15-day data are used to select the counting feature keys and generate corresponding counting features for all users. Then, we use the data of the last 3 days to evaluate the automatically identified counting features.

We compare three different feature settings to validate the effectiveness of automatically identified counting features.

  1. [leftmargin=*]
  2. Without user-ads engagement features (BASE), where we use only contextual features c to predict the CTR.
  3. Human curated user-ads engagement features (HUMAN). These features, denoted as hh are generated from human heuristics and past experiences. In this setting, we use c and hh together to predict CTR.
  4. Automatically selected counting features (ACF). Our method automatically extracts a set of counting features, denoted ha. We use c and ha together to predict the CTR.

With the three feature combinations, we show the CTR prediction results under two different model.

  1. [leftmargin=*]
  2. An online learning model using logistic regression (LR). The ads engagement data is generated in a stream, and an online learning model constantly updates itself given new engagement data. A data sample (user-ad pair) first goes through a discretizer, which is a gradient boosted decision trees, to get compact representations. Then the compact vectors are fed into a logistic regression model to get the predicted CTR.
  3. An offline training model using wide and deep neural network (Cheng et al., 2016) (NN). The wide component has a single layer, which acts as a linear model on the given raw features. The deep component consists of multiple layers, which aims to automatically learn the interactions among features. Two components are jointly learned and the final output is the predicted CTR.

We use the relative cross entropy (RCE) (He et al., 2014) as our evaluation measure, mainly because we use cross entropy as our loss function,

5.1。实验设定

我们使用Twitter视频广告数据流来评估我们的方法。更具体地说,我们使用2017年连续18天的数据。前15天的数据用于选择计数功能键并为所有用户生成相应的计数功能。然后,我们使用最近3天的数据来评估自动识别的计数功能。

我们比较了三种不同的功能设置,以验证自动识别的计数功能的有效性。

  1. [leftmargin = *]
  2. 没有用户广告互动功能(BASE),我们仅使用上下文功能C 预测点击率。
  3. 人工策划的用户广告互动功能(HUMAN)。这些功能,表示为H是从人类试探法和过去的经验中产生的。在此设置中,我们使用C 和 HH 一起预测点击率。
  4. 自动选择计数功能(ACF)。我们的方法会自动提取一组计数特征,表示为H一种。我们用C 和 Ha 一起预测点击率。

通过这三种功能组合,我们展示了两种不同模型下的点击率预测结果。

  1. [leftmargin = *]
  2. 使用Logistic回归(LR)的在线学习模型。广告参与度数据是在流中生成的,并且在线学习模型会根据新的参与度数据不断更新自身。数据样本(用户广告对)首先经过离散器(即梯度增强决策树)以获取紧凑的表示形式。然后将紧实向量输入到逻辑回归模型中,以获得预测的CTR。
  3. 使用广泛和深度神经网络的离线培训模型 (Cheng et al。,2016)(NN)。宽组件有一个单层,在给定的原始特征上充当线性模型。深度组件由多层组成,旨在自动学习要素之间的交互。共同学习了两个组成部分,最终输出是预测的点击率。

我们使用相对交叉熵(RCE) (He et al。,2014)作为我们的评估指标,主要是因为我们使用交叉熵作为损失函数,

5.2.COUNTING FEATURES

Automatically Selected Counting Features – Acf

To obtain the counting feature candidates, we use the data for the first five days. Due to data sparsity, we only pick the top 100 users with the most impressions. We train one gradient boosted decision trees for each user. The number of trees is set to 20, and the maximum depth of each tree is set to 3. Here we notice that the size of a counting keys should not be too large, otherwise the counting value for such key will be too sparse. Considering that user ID is one feature in the counting keys, we set the maximum tree depth as 3.

We pick the following counting feature keys, shown in Table 1, according to their corresponding tf-idf values. Given the counting feature keys, we count the number of impressions, number of engagements, and the corresponding average CTR as features. In our experiments, we generate the values for the counting features using the 15-day data. We assume the counting features won’t change in a short time. Therefore, we join these historical counting features with the next 3-day data for evaluation.

自动选择计数功能– Acf

为了获得计数特征候选者,我们使用前五天的数据。由于数据稀疏,我们只选择印象数最高的前100名用户。我们为每个用户训练一个梯度提升的决策树。树的数量设置为20,每棵树的最大深度设置为3。在这里,我们注意到计数键的大小不能太大,否则计数键的计数值将太稀疏。考虑到用户ID是计数键中的一项功能,我们将最大树深度设置为3。

我们根据其对应的tf-idf值选择以下表1中所示的计数功能键 。给定功能键计数后,我们将印象数,互动次数和相应的平均CTR数作为功能。在我们的实验中,我们使用15天的数据生成计数功能的值。我们假设计数功能在短时间内不会改变。因此,我们将这些历史计数功能与接下来的3天数据进行评估。

Counting Keys tf-idf
user ID item objective
engagement option -
hour of day -
hour of day item objective
hour of day engagement option

Table 1. Automatically Extracted Counting Feature Keys

Human Curated Counting Features – Human

The HUMAN features include the past CTR, number of impressions, and number of engagements on a set of human selected counting keys. These counting keys are selected based on human heuristics, where the selected features are all categorical. Some examples of selected features are advertiser ID, user age bucket, and device type. The total number of HUMAN features is 453. We should note that many of the HUMAN counting features are very sparse.

人为计数的功能–

的特征包括过去CTR,曝光次数,和接合的数量的一组人的选择的计数键。这些计数键是根据人类启发式方法选择的,其中所选功能都是分类的。所选功能的一些示例是广告商ID,用户年龄段和设备类型。HUMAN功能的总数为453。我们应该注意,许多人类计数功能都很稀疏。

5.3.CTR PREDICTION WITH ONLINE LEARNING 在线学习的点击率预测

In this section we show the CTR prediction results in an online learning setting. This CTR prediction model consists of a gradient boosted decision tree and a logistic regression. The gradient boosted decision tree (GBDT) is used generate new features, and it is learned independently from the latter logistic regression. The goal of GBDT is to deal with the feature non-linearity. The logistic regression is subsequently applied on the binary vectors from GBDT. In an online learning setting, we update the logistic regression model as new data coming in. In total, 1% data are sampled as testing data, and the rest data are used as training. Impression are fed to the model in chronological order. The testing samples are not used to update the model. When encounter a testing sample, we use current model to report predicted CTR, which is later used to calculate the testing RCE.

Considering the model warms up need some time, we report the average RCE on the last day. Our ACF features has a 0.6% higher RCE than the BASE features, and the HUMAN features actually are worse than the BASE method. Although such RCE gain seems small, the revenue impact is still huge at scale. Also, we should keep in mind that ACF only contains a dozen features, while the quantity of HUMAN is close to 500. In order to look into the model warming-up process, we further plot the RCE curve over time in Figure 3.

在本节中,我们将显示在线学习环境中的点击率预测结果。此CTR预测模型由梯度提升决策树和逻辑回归组成。梯度提升决策树(GBDT)用于生成新功能,并且独立于后者的逻辑回归学习。GBDT的目标是处理特征非线性。随后将逻辑回归应用于来自GBDT的二元向量。在在线学习环境中,我们会随着新数据的更新来更新逻辑回归模型。总共将1%的数据作为测试数据进行采样,其余数据用作培训。印象按时间顺序输入模型。测试样本不用于更新模型。遇到测试样本时,我们使用当前模型来报告预测的点击率,

考虑到模型预热需要一些时间,我们报告最后一天的平均RCE。我们的ACF功能具有0.6%RCE高于BASE功能,而HUMAN功能实际上比BASE方法差。尽管这样的RCE收益看起来很小,但是对收入的影响在规模上仍然是巨大的。此外,我们应该记住,ACF仅包含十几个功能,而HUMAN的数量接近500。为了研究模型的预热过程,我们在图3中进一步绘制了RCE曲线随时间的变化 。

The relative RCE improvement over BASE method. Evaluated with logistic regression model in an online learning setting for different features.Figure 3. The relative RCE improvement over BASE method. Evaluated with logistic regression model in an online learning setting for different features.

In Figure 3, we report the relative RCE improvement over BASE method over three days at two-hour interval. It is clear that our ACF features consistently outperforms the BASE features. Meanwhile, we notice that HUMAN features are almost consistently worse than BASE features. We further investigate this issue in Section 5.5, and find the main reason is that HUMAN contains too many sparse features, which hurts the model performance.
在图 3中,我们报告了三天以两小时为间隔相对于BASE方法的相对RCE改进。显然,我们的ACF功能始终优于BASE功能。同时,我们注意到HUMAN功能几乎始终比BASE功能差。我们将在5.5节中进一步研究此问题 ,并发现主要原因是HUMAN包含太多稀疏特征,这会损害模型性能。

5.4.CTR PREDICTION WITH BATCH TRAINING

In this section we show the CTR prediction results in a batch training setting. We evaluate our new features in a batch training setting with a wide and deep NN model shown in Figure 4. More specifically, we use the first 2-day data as training data for the NN model, and test on the last day.
在本节中,我们将显示在批量培训设置中的点击率预测结果。我们在批量培训环境中使用图4所示的宽而深的NN模型来 评估我们的新功能。更具体地说,我们使用前2天的数据作为NN模型的训练数据,并在最后一天进行测试。

Wide & Deep Figure 4. Wide & Deep NN model structure.

The ACF achieves the best performance, with a testing RCE 2.02% better than BASE. And HUMAN outperforms the BASE by 1.06% in RCE, mainly because the 5-layer fully connected model could learn new combinations of features.

ACF达到最佳性能,与测试RCE2.02%比BASE更好。而人类优于BASE由1.06% 在RCE中,主要是因为5层全连接模型可以学习功能的新组合。

5.5.FEATURE ANALYSIS

In this section, we compare the feature quality of ACF and HUMAN.

Feature Coverage We define the feature coverage as the percentage of number of samples that has non-zero value on given feature. We summarize the key coverage statistics for HUMAN and ACF in Table 2. Notice that there are HUMAN features that cover no impressions, because they are too sparse.
在本节中,我们将比较ACFHUMAN的特征质量。

特征覆盖率我们将特征覆盖率定义为在给定特征上具有非零值的样本数量的百分比。我们在表 2中总结了HUMANACF的关键覆盖范围统计信息。请注意,有些人文功能因为稀疏而无法覆盖任何印象。

Feature type ACF HUMAN
Number of features 15 453
Average coverage 0.8411 0.0286
Maximum coverage 0.9768 0.9461
Minimum coverage 0.6179 0

Table 2. Feature coverage comparison.

Feature Correlations. We calculate the Pearson correlation between each feature and click label. For each feature, since the coverage is not 100%, we only calculate the correlation on a subset of impressions where the feature values are not missing. A summary of correlations of both features are compared in Table 3, from which it is easy to tell that overall correlations of ACF are better than HUMAN (0.0451 vs. 0.0138).
功能关联。我们计算每个功能和单击标签之间的Pearson相关性。对于每个功能,由于覆盖范围不大100%,我们仅在特征值不丢失的展示次数子集上计算相关性。表3比较了这两种功能的相关性摘要 ,从中可以很容易地看出ACF的整体相关性优于HUMAN(0.0451 与 0.0138)

Feature type ACF HUMAN
Number of features 15 102
Average correlation 0.0451 -0.0003
Maximum correlation 0.2026 0.0138
Minimum correlation 0.0005 -0.0165

Table 3. Feature correlation comparison.

Conclusion

In this paper, we study how to automatically generate counting features in ads CTR prediction. We propose a tree-based method to automatically select counting feature keys. The proposed method is simple and efficient. Through extensive experiments with Twitter real data, we validate that the automatically extracted counting features have better coverage and higher correlations with the click labels, compared to human curated features. It is noteworthy that our approach could be easily generalized beyond user-centric counting features, such as advertiser-centric and interest-centric counting features.

结论

在本文中,我们研究了如何在广告点击率预测中自动生成计数功能。我们提出了一种基于树的方法来自动选择计数功能键。所提出的方法简单有效。通过对Twitter真实数据进行的大量实验,我们验证了与人工策划的功能相比,自动提取的计数功能具有更好的覆盖范围以及与点击标签的更高相关性。值得注意的是,我们的方法可以轻松地推广到以用户为中心的计数功能之外,例如以广告商为中心和以利益为中心的计数功能。