Propensity Modeling, Causal Inference, and Discovering Drivers of Growth
Imagine you just started a job at a new company. You watched World War Z recently, so you're in a skeptical mood, and given that your last two startups failed from what you believe to be a lack of data, you're giving everything an extra critical eye.
You start by thinking about the impact of the sales team. How much extra revenue are they generating for the company? The sales folks you've met say that over 90% of the leads they've talked to end up buying the company's product – but, you wonder, how many of those leads would have converted anyways?
You take a look at the logs, and notice something interesting: last week was hack week, and half the salesforce took time off from making calls to make Marauder's Maps instead, yet the rate of converted leads remained the same...*
Suddenly, one of your teammates drops by your desk. He's making a batch of Soylent, and he wants you to take a sip. It looks nasty, so you ask what the benefits are, and he responds that his friends who've been drinking it for the past few months just ran a marathon! Oh, did they just start running? Nope, they ran the marathon last year too!...
*Inspired by a true story.
导读
倾向模型的使用,如何在无法进行A/B测试的时候进行因果推断,如何找到用户组增长的驱动力。
想象一下,你刚刚开始在一家新公司工作。你最近看了《僵尸世界大战》(World War Z),所以你心存怀疑,考虑到你前两家初创公司都是因为缺乏数据而失败,你对每件事都格外挑剔。
首先考虑销售团队的影响。他们为公司创造了多少额外收入?你见过的销售人员说,他们谈过的超过90%的潜在客户最终都购买了公司的产品 —— 但是,你想知道,这些潜在客户中有多少人最终会转换为客户?
看看日志,你会发现一些有趣的事情:上周是黑客周,salesforce有一半的人休息了,但转化率仍然保持不变。
突然,你的一个同事来到你的办公桌前。他拿了一杯特卖饮料,想让你喝一口。它看起来很恶心,所以你问它有什么好处,他回答说他几个月来一直喝它的朋友刚刚跑了马拉松!哦,他们刚开始跑步吗?不,他们去年也参加了马拉松!
Causal Inference
Causality is incredibly important, yet often extremely difficult to establish.
Do patients who self-select into taking a new drug get better because the drug works, or would they have gotten better anyways? Is your salesforce actually effective, or are they simply talking to the customers who already plan to convert? Is Soylent (or your company's million-dollar ad campaign) truly worth your time?
In an ideal world, we'd be able to run experiments – the gold standard for measuring causality – whenever we wish. In the real world, however, we can't. There are ethical qualms with giving certain patients placebos, or dangerous and untested drugs. Management may be unwilling to take a potential short-term revenue hit by assigning sales to random customers, and a team earning commission-based bonuses may rebel against the thought as well.
How can we understand causal lifts in the absence of an A/B test? This is where propensity modeling, or other techniques of causal inference, comes into play.
因果推断
因果关系非常重要,但通常很难建立。那些自己选择服用新药的病人是因为药物起作用而变得更好吗?或者他们本来就会变得更好?你的销售团队是否真的有效,或者他们只是在与那些已经准备转化的客户进行沟通?特卖(或者你公司价值百万美元的广告活动)真的值得你花时间吗?
在一个理想的世界里,我们可以随时进行实验 —— 这是衡量因果关系的黄金标准。然而,在现实世界中,我们不能。给某些病人安慰剂或危险的未经试验的药物会引起伦理上的不安。管理层可能不愿意随机分配销售给潜在的短期收入损失,而一个以佣金为奖金基础的团队可能也会反对这种想法。
在没有A/B测试的情况下,我们如何理解因果提升?这就是倾向模型,或其他因果推断的技术,起作用的地方。
Propensity Modeling
So suppose we want to model the effect of drinking Soylent using a propensity model technique. To explain the idea, let's start with a thought experiment.
Imagine Brad Pitt has a twin brother, indistinguishable in every way: Brad 1 and Brad 2 wake up at the same time, they eat the same foods, they exercise the same amount, and so on. One day, Brad 1 happens to receive the last batch of Soylent from a marketer on the street, while Brad 2 does not, and so only Brad 1 begins to incorporate Soylent into his diet. In this scenario, any subsequent difference in behavior between the twins is precisely the drink's effect.
Taking this scenario into the real world, one way to estimate the Soylent's effect on health would be as follows:
- For every Soylent drinker, find a Soylent abstainer who's as close a match as possible. For example, we might match a Soylent-drinking Jay-Z with a non-Soylent Kanye, a Soylent-drinking Natalie Portman with a non-Soylent Keira Knightley, and a Soylent-drinking JK Rowling with a non-Soylent Stephenie Meyer.
- We measure Soylent's effect as the difference between each twin pair.
However, finding closely matching twins is extremely difficult in practice. Is Jay-Z really a close match with Kanye, if Jay-Z sleeps one hour more on average? What about the Jonas Brothers and One Direction?
Propensity modeling, then, is a simplification of this twin matching procedure. Instead of matching pairs of people based on all the variables we have, we simply match all users based on a single number, the likelihood ("propensity") that they'll start to drink Soylent.
In more detail, here's how to build a propensity model.
- First, select which variables to use as features. (e.g., what foods people eat, when they sleep, where they live, etc.)
- Next, build a probabilistic model (say, a logistic regression) based on these variables to predict whether a user will start drinking Soylent or not. For example, our training set might consist of a set of people, some of whom ordered Soylent in the first week of March 2014, and we would train the classifier to model which users become the Soylent users.
- The model's probabilistic estimate that a user will start drinking Soylent is called a propensity score.
- Form some number of buckets, say 10 buckets in total (one bucket covers users with a 0.0 - 0.1 propensity to take the drink, a second bucket covers users with a 0.1 - 0.2 propensity, and so on), and place people into each one.
- Finally, compare the drinkers and non-drinkers within each bucket (say, by measuring their subsequent physical activity, weight, or whatever measure of health) to estimate Soylent's causal effect.
For example, here's a hypothetical distribution of Soylent and non-Soylent ages. We see that drinkers tend to be quite a bit older, and this confounding fact is one reason we can't simply run a correlational analysis.
After training a model to estimate Soylent propensity and group users into propensity buckets, this might be a graph of the effect that Soylent has on a person's weekly running mileage.
In the above (hypothetical) graph, each row represents a propensity bucket of people, and the exposure week denotes the first week of March, when the treatment group received their Soylent shipment. We see that prior to that week, both groups of people track quite well. After the treatment group (the Soylent drinkers) start their plan, however, their weekly running mileage ramps up, which forms our estimate of the drink's causal effect.
倾向模型
假设我们想用倾向模型技术来模拟喝特卖饮料的效果。为了解释这个想法,让我们从一个思维实验开始。想象一下布拉德·皮特有一个双胞胎兄弟,在各个方面都无法区分:布拉德1和布拉德2在同一时间起床,吃同样的食物,锻炼同样的量,等等。有一天,布拉德1号碰巧收到了最后一批特卖饮料,而布拉德2号没有,所以只有布拉德1号开始在他的饮食中加入特卖饮料。在这种情况下,双胞胎之间行为的任何后续差异都是特卖饮料的影响。
在现实世界中,一种评估特卖饮料对健康影响的方法如下:
为每一个喝特卖饮料的人,找一个不喝特卖饮料的,并且尽可能和他相似的人。例如,我们可能会把喝特卖饮料的Jay-Z和不喝特卖饮料的Kanye相匹配,把喝特卖饮料的Natalie Portman和不喝特卖饮料的Keira Knightley相匹配,把喝特卖饮料的JK罗琳和不喝特卖饮料的Stephenie Meyer相匹配。
我们将特卖饮料的效果衡量定为每一对双胞胎之间的差异。
然而,在实际生活中,要找到长相接近的两个人是极其困难的。如果Jay-Z平均多睡一小时,那么Jay-Z真的能和Kanye一样吗?Jonas Brothers和One Direction怎么样?
那么,倾向建模就是对这一双胞胎匹配过程的简化。我们不是根据我们拥有的所有变量来配对,而是根据一个单一的数字来匹配所有用户,即他们开始喝特卖饮料的可能性(“倾向”)。更详细地说,这里是如何建立一个倾向模型。
首先,选择哪些变量作为特征。(例如:人们吃什么,什么时候睡觉,住在哪里等等。)
接下来,基于这些变量建立一个概率模型(比如逻辑回归)来预测用户是否会开始饮用特卖饮料。例如,我们的训练集可能包括一组人,其中一些人在2014年3月的第一周订购了特卖饮料,我们会训练分类器建模哪些用户成为特卖饮料用户。
该模型对用户将开始饮用特卖饮料的概率估计称为倾向评分。
形成一定数量的桶,假设总共有10个桶(一个桶的用户具有0.0 - 0.1的饮用倾向,第二个桶的用户具有0.1 - 0.2的饮用倾向,以此类推),并将人放入每个桶中。
最后,比较每个桶内的喝特卖饮料的人和不喝特卖饮料的人(比如,通过测量他们随后的体力活动、体重或其他健康指标)来估计特卖饮料的因果效应。
例如,这是喝特卖饮料和不喝特卖饮料年龄的假设分布。我们发现喝的人往往年龄更大一些,这一令人困惑的事实是我们不能简单地进行相关分析的原因之一。
在训练一个模型来估计喝特卖饮料的倾向并将用户分组到倾向桶之后,这可能是一个关于Soylent对一个人每周跑步里程影响的图表。
在上面的(假设)图表中,每一行代表人们的倾向桶,暴露周表示3月的第一周,实验组收到了他们的特卖饮料发货。我们看到在那周之前,两组人的跟踪都很好。然而,在实验组(特卖饮料饮用者)开始他们的计划后,他们每周跑步的里程增加了,这形成了我们对饮料的因果效应的估计。
Other Methods of Causal Inference
Of course, there are many other methods of causal inference on observational data. I'll run through two of my favorites. (I originally wrote this post in response to a question on Quora, which is why I take my examples from there.)
Regression Discontinuity
Quora recently started displaying badges of status on the profiles of its Top Writers, so suppose we want to understand the effect of this feature. (Assume that it's impossible to run an A/B test now that the feature has been launched.) Specifically, does the badge itself cause users to gain more followers?
For simplicity, let's assume that the badge was given to all users who received at least 5000 upvotes in 2013. The idea behind a regression discontinuity design is that the difference between those users who just barely receive a Top Writer badge (i.e., receive 5000 upvotes) and those who just barely don't (i.e., receive 4999 upvotes) is more or less random chance, so we can use this threshold to estimate a causal effect.
For example, in the imaginary graph below, the discontinuity at 5000 upvotes suggests that a Top Writer badge leads to around 100 more followers on average.
因果推断的其他方法
当然,对观察数据进行因果推论还有许多其他方法。我这里介绍我最喜欢的两个。(我最初是为了回答Quora上的一个问题而写这篇文章的,所以我才从那里举出了我的例子。)
不连续回归
Quora最近开始在其Top Writers的个人资料上显示了徽章,所以假设我们想要了解这个功能的效果。(假设既然该特性已经启动,就不可能运行A/B测试。)具体来说,徽章本身是否会让用户获得更多关注者?
为了简单起见,让我们假设这个徽章是给2013年获得至少5000个好评的所有用户的。不连续回归设计背后的想法是得到那些勉强获得Top Writers徽章的用户(获得5000个好评)和那些差点就可以获得徽章的人(比如,获得4999个好评的人)之间的区别。这两者之间或多或少是随机的,所以我们可以使用这个阈值来估计因果效应。
例如,在下面这个假想的图表中,5000个好评的不连续性表明,一个Top Writers徽章平均会带来大约100多个关注者。
Natural Experiments
Understanding the effect of a Top Writer badge is a fairly uninteresting question, though. (It just makes an easy example.) A deeper, more fundamental question could be to ask: what happens when a user discovers a new writer that they love? Does the writer inspire them to write some of their own content, to explore more of the same topics, and through curation lead them to engage with the site even more? How important, in other words, is the connection to a great user as opposed to the reading of random, great posts?
I studied an analogous question when I was at Google, so instead of making up an imaginary Quora case study, I'll describe some of that work here.
So let's suppose we want to understand what would happen if we were able to match users to the perfect YouTube channel. How much is the ultimate recommendation worth?
- Does falling in love with a new channel lead to engagement above and beyond activity on the channel itself, perhaps because users return to YouTube specifically for the new channel and stay to watch more? (a multiplicative effect) In the TV world, for example, perhaps many people stay at home on Sunday nights specifically to catch the latest episode of Real Housewives, and channel surf for even more entertainment once it's over.
- Does falling in love with a new channel simply increase activity on the channel alone? (an additive effect)
- Does a new channel replace existing engagement on YouTube? After all, maybe users only have a limited amount of time they can spend on the site. (a neutral effect)
- Does the perfect channel actually cause users to spend less time overall on the site, since maybe they spend less time idly browsing and channel surfing once they have concrete channels they know how to turn to? (a negative effect)
As always, an A/B test would be ideal, but it's impossible in this case to run: we can't force users to fall in love with a channel (we can recommend them channels, but there's no guarantee they'll actually like them), and we can't forcibly block them from certain channels either.
One approach is to use a natural experiment (a scenario in which the universe itself somehow generates a random-like assignment) to study this effect. Here's the idea.
Consider a user who uploads a new video every Wednesday. One month, he lets his subscribers know that he won’t be uploading any new videos for a few weeks, while he goes on vacation.
How do his subscribers respond? Do they stop watching YouTube on Wednesdays, since his channel was the sole reason for their visits? Or is their activity relatively unaffected, since they only watch his content when it appears on the front page?
Imagine, instead, that the channel starts uploading a new video every Friday. Do his subscribers start to visit then as well? And now that they're on YouTube, do they merely stay for the new video, or does their visit lead to a sinkhole of searches and related content too?
As it turns out, these scenarios do happen. For example, here's a calendar of when one popular channel uploads videos. You can see that in 2011, it tended to upload videos on Tuesdays and Fridays, but it shifted to uploads on Wednesday and Saturday at the end of the year.
By using this shift as natural experiment that "quasi-randomly" removes a well-loved channel on certain days and introduces it on others, we can try to understand the effect of successfully making the perfect recommendation.
(This is probably a somewhat convoluted example of a natural experiment. For an example that perhaps illustrates the idea more clearly, suppose we want to understand the effect of income on mental health. We can't force some people to become poor or rich, and a correlational study is clearly flawed. This NY Times article describes a natural experiment when a group of Cherokee Indians distributed casino