# Automatic Historical Feature GenerationAutomatic Historical Feature Generation through Tree-based Method in Ads Prediction 广告预测中基于树的自动历史特征生成

## Abstract 摘要

Historical features are important in ads click-through rate (CTR) prediction, because they account for past engagements between users and ads. In this paper, we study how to efficiently construct historical features through counting features. The key challenge of such problem lies in how to automatically identify counting keys. We propose a tree-based method for counting key selection. The intuition is that a decision tree naturally provides various combinations of features, which could be used as counting key candidate. In order to select personalized counting features, we train one decision tree model per user, and the counting keys are selected across different users with a frequency-based importance measure. To validate the effectiveness of proposed solution, we conduct large scale experiments on Twitter\footnote{Twitter and Tweets are registered marks of Twitter Inc.} video advertising data. In both online learning and offline training settings, the automatically identified counting features outperform the manually curated counting features.

.

## Introduction 简介

The data rich advertising environment provides ample opportunities for machine learning to play a key role. Ads CTR prediction problem is one of the most important tasks. At Twitter, the ads are shown as the form of a Promoted Tweet. Twitter users usually consume Tweets in their home timeline, where the Tweets created by the people they follow are presented. Among the list of user generated Tweets, Promoted Tweets are injected in a set of given positions. The selection of which ads to show is based on the combination of predicted CTR and advertising bidding. We use impression refer to showing an ad to a user, and engagement refers to user acting on the ads, i.e. clicking on the link, watching the video ads, etc. Contextual features are the most straightforward features for CTR prediction, which account for the current context of an impression, including user information, ads content, nearby Tweets, etc. However, with contextual features alone it is very difficult to learn a CTR prediction model due to two data sparsity issues. First, the feature space is huge, such as ID-based feature. Second, the number of samples for each user-ad pair is small. To address the issues above, historical features are added to the prediction model. The historical features capture the past engagement information of the user-ad pair, while avoid explicit matching of user or ads. For example, past CTR is a strong feature in CTR model. Namely, even if user $u$ only saw ad $d$ once and engaged with it (a past CTR of $100%$ ), the model is confident to predict a high CTR for showing $d$ to $u$ in the future. As a result, instead of maintaining the user ID and ad ID as contextual features, we could simply add one historical feature -- past CTR. Both contextual features and historical features are indispensable for CTR prediction. However, it has shown that the historical features play a more important role than the contextual features. We use counting feature refer to the counting-based implementation of historical features. Furthermore, we could count the past engagements under different conditions, which are called counting keys. When selecting proper counting keys, there are a lot of choices. Basically, we can count on any categorical features, and even the combinations of them. In this paper, we aim to efficiently extract useful counting keys.This task is challenging for the following three reasons. First, the exhaustive search is impossible, given an exponential search space. Second, the counting keys suffers from sparsity, i.e. the coverage of most counting keys are low. Third, as the size of counting keys increases, it is hard to semantically interpret the counting keys.

In this paper, we propose a method to automatically select counting keys from tree-based models. The intuition is that a path in the tree, which consists of several features, is a good candidate of a counting key. To achieve this goal, we independently learn one tree-based model per user for a selected set of users, who has sufficient impressions data. We use a frequency-based measure as weighting mechanism for each path in the tree. The paths with the highest weights are selected as our counting keys, which guarantees our counting features have good coverage. Through extensive experiments with Twitter real ads traffic, we show that the selected counting features consistently improve the CTR prediction performance. The contributions of this paper are summarized as follows. First, we propose a tree-based method to automatically select counting keys. Second, we compare the automatically selected counting features with a set of human curated features, and show better performance. Third, we evaluate our method with two widely used setting, i.e. online learning with logistic regression and offline training with neural network models, and show consistent results.

The result of this paper is organized as follows. In Section 2, we present related works in the literature. In Section 3, the formal definition of our problem is given. The method is described in Section 4, and the experiment results are shown in Section 5. Finally, we conclude our paper with future work in Section 6.

CTR Prediction is a well-studied problem. Logistic regression model is first used for CTR prediction at scale (Battelle, 2011; Richardson et al., 2007). Later, online learning is used in ads system (Ciaramita et al., 2008; Graepel et al., 2010), due to the rapid growth of data volume. To highlight the relationship and differences between theoretical advance and practical engineering in CTR prediction, McMahan et. al. (McMahan et al., 2013) at Google shares insights and lessons learned from the large-scale online learning CTR prediction system. Facebook (He et al., 2014) employs a gradient boosted decision tree (GBDT) to learn a feature transformation, and shows that applying the logistic regression on the transformed features achieve better performance. Li et al. (Li et al., 2015) shows that a learning-to-rank model, which learns the relative order of ads, actually outperforms a model that independently learns the CTR of each impression with Twitter timeline advertising data. As the content of ads is getting rich with image and video, deep learning model are used to account for the visual information (Chen et al., 2016).

As we can see, feature engineering is crucial to improve CTR prediction. In this paper, we aim to achieve the same goal by constructing effective historical features.

Personalization is an import aspect in CTR prediction to improve the long-term user engagement (Wu et al., 2017). Various probabilistic models are proposed to model the latent factors on user domain (Liu et al., 2010; Cheng and Cantú-Paz, 2010). For example, Shen et al. (Shen et al., 2012) propose to use graphic model and tensor decomposition technique to learn the latent factor of users. In highly dynamic domains, such as news recommendation and digital advertisement, exploration-exploitation strategies are widely adopted for personalization. The multi-armed bandit is a typical algorithm of such strategy. Li et. al. (Li et al., 2016) use multi-armed bandit to dynamically group similar users.

Note that in this paper, we devise a counting feature selection method with personalization as a constraint. We differ from the personalization literature in a way that we are not trying to innovate new personalization method. I

Attribute Interactions. Within a counting key, a group of features jointly act as a CTR indicator. A relevant concept is attribute interaction (Freitas, 2001), which refers to the fact that two or more attributes jointly present strong correlation with label, while each of them loosely correlate with the label. Here the attribute and feature are used interchangeably. There are literatures focusing on the analysis of attribute interaction (Jakulin and Bratko, 2003), test the significant of feature interaction (Jakulin and Bratko, 2004), etc.

We should note that counting keys selection is more complicated than attribute interaction. The attribute interaction mainly studies the interactions of several features within one sample. Meanwhile, the counting features are constructed from several features and aggregated over multiple historical samples.

Feature Selection. Finally, we want to emphasize that the feature selection (Molina et al., 2002; Guyon and Elisseeff, 2003; Liu and Yu, 2005; Chandrashekar and Sahin, 2014) is another highly relevant research field. Feature selection aims at finding a subset of features that are representative to original data. Various feature selection methods are proposed in the literature, and there are mainly three categories (Chandrashekar and Sahin, 2014). (1) Filter-based method (Blum and Langley, 1997) defines a relevance score for each feature and select features with certain threshold. (2) Wrapper methods use the predictor as a black box and the predictor performance as the objective function to evaluate the feature subset. Examples are Sequential Feature Selection method (Reunanen, 2003) and Genetic Algorithm (Goldberg, 1989). (3) Unsupervised feature selection methods (Mitra et al., 2002; Li et al., 2012) are proposed as well, where clustering is a primarily used technique.

These feature selection methods cannot be used in our problem, because they do not address our challenges. In addition to the exponential search space, implementing and storing the counting features are costly as well. What’s more, the conventional feature selection method does not consider the sparsity of features at all.

## Problem Definition 问题定义

Given an online advertising system, we have millions of users denoted as $\mathcal{U}$ and tens of thousands of ads denoted as $\mathcal{D}$ . The objective of an advertising system is to present the most relevant ads to users. CTR is the most commonly used measure for ads relevance, which measures the probability that a user clicks on an ad. Usually there is one prediction model $f$ to be learned for thousands of millions of impressions. Each impression contains the information about showing ad $d$ to user $u$ within certain context. It is important to differentiate different users and ads. To this end, we extract as many contextual features as possible for each impression, denoted as $c$ . The contextual feature vector $c = [ c_1, c_2, \cdots c_n ]\in\mathbb{R}^N$ is an $N$ -dimension vector, and each dimension $c_j$ corresponds to one feature. There are three types of contextual features, which are user features, ad features, and context features. The user features, such as user gender and age, differentiate users. The ad features are used to differentiate ads, and examples are advertiser category and ads content. The context features track the context of current impression, including time of impression, ad injection location, and nearby organic Tweets information. The ads relevance prediction problem is defined as follows.

###### Definition 3.1 (CTR prediction).

Given a set of impressions I, each impression i denotes one event that ad di∈D is shown to user ui∈U. We extract contextual feature ci for impression i. The CTR prediction aims to learn a prediction model

$$\hat{p_i} = f(\mathbf{c}_{i})$$

\hat{p}_{i}\ is the predicted CTR.

$\hat{p}_{i}$ 是预测的点击率。

Another important signal for CTR prediction is the user's past engagements, namely historical features, which is missing in Definitionctr. We use counting features as one implementation of the historical engagement. To give the formal definition of counting features, we first define counting keys. In this paper, our goal is to automatically select counting keys.

###### Definition 3.2 (Counting key).

A counting key k is a combination of two or more contextual features, denoted as a feature set $k={c_{p1},c_{p2},c_{p3},\cdots}$. Without loss of generality, we assume $c_{pi}$are discrete features. Each counting key k is defined over a pair of user and ad ⟨u,d⟩.

###### Definition 3.3 (Counting feature).

Counting feature accounts for the past engagements over a pair of user and ad ⟨u,d⟩. Given a counting key k we are able to generate several counting features for user-ad pair ⟨u,d⟩, such as number of past impressions as $h_{k}^{i}$, number of past engagements as $h_{e}^{k}$, and past CTR as $h_{p}^{k}$. For each impression i and a given counting key set K={k1,k2,⋯}, we extract all counting features as
$$h_{i}={h_{k1}^{i},h_{k2}^{i},\cdots,h_{k1}^{e},h_{k2}^{e},\cdots,h_{k1}^{p},h_{k2}^{p},\cdots}$$

In this paper, our goal is to automatically select counting keys.

###### Definition 3.4 (Counting feature keys selection).

Counting feature keys selection problem aims to find a set of counting keys K from given contextual feature c. These is to generate counting features h for each impression, and to improve the CTR prediction model through

$$\hat{p_i} = f(c_{i},h_{i})$$

## Method 方法

### Overview 概述

We visualize the overview flow of our method in Figureoverview. We take a stream of impressions as input, where the contextual features (green features denoted as $\cf_i$ ) of impressions are available. Next, in order to generate the counting key candidate, we learn one tree-based binary classification model per user to predict the click label. The motivation behind training one tree per user is personalization. A simplest approach to achieve personalization in the counting features is to keep the user ID as one component in the key. With the tree-based model, each path in the trees is a set of features that together act as strong indicator for click event. The counting key candidate is a set of features, which is constructed by adding user ID into the feature set from one tree path. In the end, we have many counting key candidates from those tree-based models. In order to find counting keys that are discriminative to all users, we use a frequency-based measure as the weighting measure. Finally, we generate counting features for each pair of user and ads according to the selected counting keys. These counting features (yellow features denoted as $\hf_i$ ) are joined with the contextual features to serve the CTR prediction.

### Counting Key Candidates

We use decision trees to generate counting key candidates. In order to implement personalization, we therefore build one decision tree model $M_i$ for user $u_i$ . The input for decision model $M_i$ consists of all impressions related to user $u_i$ . In our system, we use Xgboost to build one gradient boosted forest for each user. In gradient boosted forest, trees are built in sequential order, where the newly built trees aim to minimize the errors from previous trees. The oldest trees usually capture the most important features to differentiate general cases, while the latest trees usually account for the specific features to differentiate the difficult cases. With these boosted decision trees, we can use all the paths as counting key candidates.

### 计数密钥候选

#### Theoretical Support 理论支持

Now we provide theoretical support to explain why tree-based model could be used to select counting keys. The objective of counting key selection is to identify a set of features $k = {c_{p1}, c_{p2}, \cdots}$ , that are the discriminative for click prediction. We use the conditional entropy to measure the quality of a counting key $k$ . We use $Y$ to denote the click label. If a feature key $k$ is useful, then the conditional entropy of $H(Y|k)$ should be less than the marginal entropy $H(Y)$ , i.e. $H(Y|k) < H(Y)$ . In order to find the best counting key, we maximize the difference between the two. Formally, we have the following objective:

$$\arg\max_k H(Y) - H(Y|k).$$

## 5.EXPERIMENTS

### 5.1.EXPERIMENT SETTINGS

We use Twitter video advertisement data stream to evaluate our method. More specifically, we use data from 18 continuous days in 2017. The first 15-day data are used to select the counting feature keys and generate corresponding counting features for all users. Then, we use the data of the last 3 days to evaluate the automatically identified counting features.

We compare three different feature settings to validate the effectiveness of automatically identified counting features.

1. [leftmargin=*]
2. Without user-ads engagement features (BASE), where we use only contextual features c to predict the CTR.
3. Human curated user-ads engagement features (HUMAN). These features, denoted as hh are generated from human heuristics and past experiences. In this setting, we use c and hh together to predict CTR.
4. Automatically selected counting features (ACF). Our method automatically extracts a set of counting features, denoted ha. We use c and ha together to predict the CTR.

With the three feature combinations, we show the CTR prediction results under two different model.

1. [leftmargin=*]
2. An online learning model using logistic regression (LR). The ads engagement data is generated in a stream, and an online learning model constantly updates itself given new engagement data. A data sample (user-ad pair) first goes through a discretizer, which is a gradient boosted decision trees, to get compact representations. Then the compact vectors are fed into a logistic regression model to get the predicted CTR.
3. An offline training model using wide and deep neural network (Cheng et al., 2016) (NN). The wide component has a single layer, which acts as a linear model on the given raw features. The deep component consists of multiple layers, which aims to automatically learn the interactions among features. Two components are jointly learned and the final output is the predicted CTR.

We use the relative cross entropy (RCE) (He et al., 2014) as our evaluation measure, mainly because we use cross entropy as our loss function,

### 5.1。实验设定

1. [leftmargin = *]
2. 没有用户广告互动功能（BASE），我们仅使用上下文功能 C 预测点击率。
3. 人工策划的用户广告互动功能（HUMAN）。这些功能，表示为 H 是从人类试探法和过去的经验中产生的。在此设置中，我们使用 C 和 HH 一起预测点击率。
4. 自动选择计数功能（ACF）。我们的方法会自动提取一组计数特征，表示为 H 一种。我们用 C 和 Ha 一起预测点击率。

1. [leftmargin = *]
2. 使用 Logistic 回归（LR）的在线学习模型。广告参与度数据是在流中生成的，并且在线学习模型会根据新的参与度数据不断更新自身。数据样本（用户广告对）首先经过离散器（即梯度增强决策树）以获取紧凑的表示形式。然后将紧实向量输入到逻辑回归模型中，以获得预测的 CTR。
3. 使用广泛和深度神经网络的离线培训模型 （Cheng et al。，2016）（NN）。宽组件有一个单层，在给定的原始特征上充当线性模型。深度组件由多层组成，旨在自动学习要素之间的交互。共同学习了两个组成部分，最终输出是预测的点击率。

### 5.2.COUNTING FEATURES

#### Automatically Selected Counting Features – Acf

To obtain the counting feature candidates, we use the data for the first five days. Due to data sparsity, we only pick the top 100 users with the most impressions. We train one gradient boosted decision trees for each user. The number of trees is set to 20, and the maximum depth of each tree is set to 3. Here we notice that the size of a counting keys should not be too large, otherwise the counting value for such key will be too sparse. Considering that user ID is one feature in the counting keys, we set the maximum tree depth as 3.

We pick the following counting feature keys, shown in Table 1, according to their corresponding tf-idf values. Given the counting feature keys, we count the number of impressions, number of engagements, and the corresponding average CTR as features. In our experiments, we generate the values for the counting features using the 15-day data. We assume the counting features won’t change in a short time. Therefore, we join these historical counting features with the next 3-day data for evaluation.

#### 自动选择计数功能– Acf

Counting Keys tf-idf
user ID item objective
engagement option -
hour of day -
hour of day item objective
hour of day engagement option

Table 1. Automatically Extracted Counting Feature Keys

#### Human Curated Counting Features – Human

The HUMAN features include the past CTR, number of impressions, and number of engagements on a set of human selected counting keys. These counting keys are selected based on human heuristics, where the selected features are all categorical. Some examples of selected features are advertiser ID, user age bucket, and device type. The total number of HUMAN features is 453. We should note that many of the HUMAN counting features are very sparse.

### 5.3.CTR PREDICTION WITH ONLINE LEARNING 在线学习的点击率预测

In this section we show the CTR prediction results in an online learning setting. This CTR prediction model consists of a gradient boosted decision tree and a logistic regression. The gradient boosted decision tree (GBDT) is used generate new features, and it is learned independently from the latter logistic regression. The goal of GBDT is to deal with the feature non-linearity. The logistic regression is subsequently applied on the binary vectors from GBDT. In an online learning setting, we update the logistic regression model as new data coming in. In total, 1% data are sampled as testing data, and the rest data are used as training. Impression are fed to the model in chronological order. The testing samples are not used to update the model. When encounter a testing sample, we use current model to report predicted CTR, which is later used to calculate the testing RCE.

Considering the model warms up need some time, we report the average RCE on the last day. Our ACF features has a 0.6% higher RCE than the BASE features, and the HUMAN features actually are worse than the BASE method. Although such RCE gain seems small, the revenue impact is still huge at scale. Also, we should keep in mind that ACF only contains a dozen features, while the quantity of HUMAN is close to 500. In order to look into the model warming-up process, we further plot the RCE curve over time in Figure 3.

Figure 3. The relative RCE improvement over BASE method. Evaluated with logistic regression model in an online learning setting for different features.

In Figure 3, we report the relative RCE improvement over BASE method over three days at two-hour interval. It is clear that our ACF features consistently outperforms the BASE features. Meanwhile, we notice that HUMAN features are almost consistently worse than BASE features. We further investigate this issue in Section 5.5, and find the main reason is that HUMAN contains too many sparse features, which hurts the model performance.

### 5.4.CTR PREDICTION WITH BATCH TRAINING

In this section we show the CTR prediction results in a batch training setting. We evaluate our new features in a batch training setting with a wide and deep NN model shown in Figure 4. More specifically, we use the first 2-day data as training data for the NN model, and test on the last day.

Figure 4. Wide & Deep NN model structure.

The ACF achieves the best performance, with a testing RCE 2.02% better than BASE. And HUMAN outperforms the BASE by 1.06% in RCE, mainly because the 5-layer fully connected model could learn new combinations of features.

ACF 达到最佳性能，与测试 RCE2.02％比 BASE 更好。而人类优于 BASE 由 1.06％ 在 RCE 中，主要是因为 5 层全连接模型可以学习功能的新组合。

### 5.5.FEATURE ANALYSIS

In this section, we compare the feature quality of ACF and HUMAN.

Feature Coverage We define the feature coverage as the percentage of number of samples that has non-zero value on given feature. We summarize the key coverage statistics for HUMAN and ACF in Table 2. Notice that there are HUMAN features that cover no impressions, because they are too sparse.

Feature type ACF HUMAN
Number of features 15 453
Average coverage 0.8411 0.0286
Maximum coverage 0.9768 0.9461
Minimum coverage 0.6179 0

Table 2. Feature coverage comparison.

Feature Correlations. We calculate the Pearson correlation between each feature and click label. For each feature, since the coverage is not 100%, we only calculate the correlation on a subset of impressions where the feature values are not missing. A summary of correlations of both features are compared in Table 3, from which it is easy to tell that overall correlations of ACF are better than HUMAN (0.0451 vs. 0.0138).

Feature type ACF HUMAN
Number of features 15 102
Average correlation 0.0451 -0.0003
Maximum correlation 0.2026 0.0138
Minimum correlation 0.0005 -0.0165

Table 3. Feature correlation comparison.

## Conclusion

In this paper, we study how to automatically generate counting features in ads CTR prediction. We propose a tree-based method to automatically select counting feature keys. The proposed method is simple and efficient. Through extensive experiments with Twitter real data, we validate that the automatically extracted counting features have better coverage and higher correlations with the click labels, compared to human curated features. It is noteworthy that our approach could be easily generalized beyond user-centric counting features, such as advertiser-centric and interest-centric counting features.