资料仓库  收藏
0 / 432

Propensity Modeling, Causal Inference, and Discovering Drivers of Growth

Imagine you just started a job at a new company. You watched World War Z recently, so you're in a skeptical mood, and given that your last two startups failed from what you believe to be a lack of data, you're giving everything an extra critical eye.

You start by thinking about the impact of the sales team. How much extra revenue are they generating for the company? The sales folks you've met say that over 90% of the leads they've talked to end up buying the company's product – but, you wonder, how many of those leads would have converted anyways?

You take a look at the logs, and notice something interesting: last week was hack week, and half the salesforce took time off from making calls to make Marauder's Maps instead, yet the rate of converted leads remained the same...*

Suddenly, one of your teammates drops by your desk. He's making a batch of Soylent, and he wants you to take a sip. It looks nasty, so you ask what the benefits are, and he responds that his friends who've been drinking it for the past few months just ran a marathon! Oh, did they just start running? Nope, they ran the marathon last year too!...

*Inspired by a true story.


想象一下,你刚刚开始在一家新公司工作。你最近看了《僵尸世界大战》(World War Z),所以你心存怀疑,考虑到你前两家初创公司都是因为缺乏数据而失败,你对每件事都格外挑剔。

首先考虑销售团队的影响。他们为公司创造了多少额外收入?你见过的销售人员说,他们谈过的超过90%的潜在客户最终都购买了公司的产品 —— 但是,你想知道,这些潜在客户中有多少人最终会转换为客户?



Causal Inference

Causality is incredibly important, yet often extremely difficult to establish.

Do patients who self-select into taking a new drug get better because the drug works, or would they have gotten better anyways? Is your salesforce actually effective, or are they simply talking to the customers who already plan to convert? Is Soylent (or your company's million-dollar ad campaign) truly worth your time?

In an ideal world, we'd be able to run experiments – the gold standard for measuring causality – whenever we wish. In the real world, however, we can't. There are ethical qualms with giving certain patients placebos, or dangerous and untested drugs. Management may be unwilling to take a potential short-term revenue hit by assigning sales to random customers, and a team earning commission-based bonuses may rebel against the thought as well.

How can we understand causal lifts in the absence of an A/B test? This is where propensity modeling, or other techniques of causal inference, comes into play.


在一个理想的世界里,我们可以随时进行实验 —— 这是衡量因果关系的黄金标准。然而,在现实世界中,我们不能。给某些病人安慰剂或危险的未经试验的药物会引起伦理上的不安。管理层可能不愿意随机分配销售给潜在的短期收入损失,而一个以佣金为奖金基础的团队可能也会反对这种想法。


Propensity Modeling

So suppose we want to model the effect of drinking Soylent using a propensity model technique. To explain the idea, let's start with a thought experiment.

Imagine Brad Pitt has a twin brother, indistinguishable in every way: Brad 1 and Brad 2 wake up at the same time, they eat the same foods, they exercise the same amount, and so on. One day, Brad 1 happens to receive the last batch of Soylent from a marketer on the street, while Brad 2 does not, and so only Brad 1 begins to incorporate Soylent into his diet. In this scenario, any subsequent difference in behavior between the twins is precisely the drink's effect.

Taking this scenario into the real world, one way to estimate the Soylent's effect on health would be as follows:

  • For every Soylent drinker, find a Soylent abstainer who's as close a match as possible. For example, we might match a Soylent-drinking Jay-Z with a non-Soylent Kanye, a Soylent-drinking Natalie Portman with a non-Soylent Keira Knightley, and a Soylent-drinking JK Rowling with a non-Soylent Stephenie Meyer.
  • We measure Soylent's effect as the difference between each twin pair.

However, finding closely matching twins is extremely difficult in practice. Is Jay-Z really a close match with Kanye, if Jay-Z sleeps one hour more on average? What about the Jonas Brothers and One Direction?

Propensity modeling, then, is a simplification of this twin matching procedure. Instead of matching pairs of people based on all the variables we have, we simply match all users based on a single number, the likelihood ("propensity") that they'll start to drink Soylent.

In more detail, here's how to build a propensity model.

  • First, select which variables to use as features. (e.g., what foods people eat, when they sleep, where they live, etc.)
  • Next, build a probabilistic model (say, a logistic regression) based on these variables to predict whether a user will start drinking Soylent or not. For example, our training set might consist of a set of people, some of whom ordered Soylent in the first week of March 2014, and we would train the classifier to model which users become the Soylent users.
  • The model's probabilistic estimate that a user will start drinking Soylent is called a propensity score.
  • Form some number of buckets, say 10 buckets in total (one bucket covers users with a 0.0 - 0.1 propensity to take the drink, a second bucket covers users with a 0.1 - 0.2 propensity, and so on), and place people into each one.
  • Finally, compare the drinkers and non-drinkers within each bucket (say, by measuring their subsequent physical activity, weight, or whatever measure of health) to estimate Soylent's causal effect.

For example, here's a hypothetical distribution of Soylent and non-Soylent ages. We see that drinkers tend to be quite a bit older, and this confounding fact is one reason we can't simply run a correlational analysis.

Age Distribution

After training a model to estimate Soylent propensity and group users into propensity buckets, this might be a graph of the effect that Soylent has on a person's weekly running mileage.

Soylent Effect

In the above (hypothetical) graph, each row represents a propensity bucket of people, and the exposure week denotes the first week of March, when the treatment group received their Soylent shipment. We see that prior to that week, both groups of people track quite well. After the treatment group (the Soylent drinkers) start their plan, however, their weekly running mileage ramps up, which forms our estimate of the drink's causal effect.


为每一个喝特卖饮料的人,找一个不喝特卖饮料的,并且尽可能和他相似的人。例如,我们可能会把喝特卖饮料的Jay-Z和不喝特卖饮料的Kanye相匹配,把喝特卖饮料的Natalie Portman和不喝特卖饮料的Keira Knightley相匹配,把喝特卖饮料的JK罗琳和不喝特卖饮料的Stephenie Meyer相匹配。


然而,在实际生活中,要找到长相接近的两个人是极其困难的。如果Jay-Z平均多睡一小时,那么Jay-Z真的能和Kanye一样吗?Jonas Brothers和One Direction怎么样?





形成一定数量的桶,假设总共有10个桶(一个桶的用户具有0.0 - 0.1的饮用倾向,第二个桶的用户具有0.1 - 0.2的饮用倾向,以此类推),并将人放入每个桶中。





Other Methods of Causal Inference

Of course, there are many other methods of causal inference on observational data. I'll run through two of my favorites. (I originally wrote this post in response to a question on Quora, which is why I take my examples from there.)

Regression Discontinuity

Quora recently started displaying badges of status on the profiles of its Top Writers, so suppose we want to understand the effect of this feature. (Assume that it's impossible to run an A/B test now that the feature has been launched.) Specifically, does the badge itself cause users to gain more followers?

For simplicity, let's assume that the badge was given to all users who received at least 5000 upvotes in 2013. The idea behind a regression discontinuity design is that the difference between those users who just barely receive a Top Writer badge (i.e., receive 5000 upvotes) and those who just barely don't (i.e., receive 4999 upvotes) is more or less random chance, so we can use this threshold to estimate a causal effect.

For example, in the imaginary graph below, the discontinuity at 5000 upvotes suggests that a Top Writer badge leads to around 100 more followers on average.

Quora最近开始在其Top Writers的个人资料上显示了徽章,所以假设我们想要了解这个功能的效果。(假设既然该特性已经启动,就不可能运行A/B测试。)具体来说,徽章本身是否会让用户获得更多关注者?

为了简单起见,让我们假设这个徽章是给2013年获得至少5000个好评的所有用户的。不连续回归设计背后的想法是得到那些勉强获得Top Writers徽章的用户(获得5000个好评)和那些差点就可以获得徽章的人(比如,获得4999个好评的人)之间的区别。这两者之间或多或少是随机的,所以我们可以使用这个阈值来估计因果效应。

例如,在下面这个假想的图表中,5000个好评的不连续性表明,一个Top Writers徽章平均会带来大约100多个关注者。

Regression Discontinuity

Natural Experiments

Understanding the effect of a Top Writer badge is a fairly uninteresting question, though. (It just makes an easy example.) A deeper, more fundamental question could be to ask: what happens when a user discovers a new writer that they love? Does the writer inspire them to write some of their own content, to explore more of the same topics, and through curation lead them to engage with the site even more? How important, in other words, is the connection to a great user as opposed to the reading of random, great posts?

I studied an analogous question when I was at Google, so instead of making up an imaginary Quora case study, I'll describe some of that work here.

So let's suppose we want to understand what would happen if we were able to match users to the perfect YouTube channel. How much is the ultimate recommendation worth?

  • Does falling in love with a new channel lead to engagement above and beyond activity on the channel itself, perhaps because users return to YouTube specifically for the new channel and stay to watch more? (a multiplicative effect) In the TV world, for example, perhaps many people stay at home on Sunday nights specifically to catch the latest episode of Real Housewives, and channel surf for even more entertainment once it's over.
  • Does falling in love with a new channel simply increase activity on the channel alone? (an additive effect)
  • Does a new channel replace existing engagement on YouTube? After all, maybe users only have a limited amount of time they can spend on the site. (a neutral effect)
  • Does the perfect channel actually cause users to spend less time overall on the site, since maybe they spend less time idly browsing and channel surfing once they have concrete channels they know how to turn to? (a negative effect)

As always, an A/B test would be ideal, but it's impossible in this case to run: we can't force users to fall in love with a channel (we can recommend them channels, but there's no guarantee they'll actually like them), and we can't forcibly block them from certain channels either.

One approach is to use a natural experiment (a scenario in which the universe itself somehow generates a random-like assignment) to study this effect. Here's the idea.

Consider a user who uploads a new video every Wednesday. One month, he lets his subscribers know that he won’t be uploading any new videos for a few weeks, while he goes on vacation.

How do his subscribers respond? Do they stop watching YouTube on Wednesdays, since his channel was the sole reason for their visits? Or is their activity relatively unaffected, since they only watch his content when it appears on the front page?

Imagine, instead, that the channel starts uploading a new video every Friday. Do his subscribers start to visit then as well? And now that they're on YouTube, do they merely stay for the new video, or does their visit lead to a sinkhole of searches and related content too?

As it turns out, these scenarios do happen. For example, here's a calendar of when one popular channel uploads videos. You can see that in 2011, it tended to upload videos on Tuesdays and Fridays, but it shifted to uploads on Wednesday and Saturday at the end of the year.

Uploads Calendar

By using this shift as natural experiment that "quasi-randomly" removes a well-loved channel on certain days and introduces it on others, we can try to understand the effect of successfully making the perfect recommendation.

(This is probably a somewhat convoluted example of a natural experiment. For an example that perhaps illustrates the idea more clearly, suppose we want to understand the effect of income on mental health. We can't force some people to become poor or rich, and a correlational study is clearly flawed. This NY Times article describes a natural experiment when a group of Cherokee Indians distributed casino profits to its members, thereby "randomly" lifting some of them out of poverty.

Another example, assuming there's nothing special about the period in which hack week occurs, is the use of hack week as an instrument that quasi-randomly "prevents" the sales team from doing their job, as in the scenario I described above.)
然而,理解 Top Writer 徽章的作用是一个相当无趣的问题。(这只是一个简单的例子。)一个更深、更根本的问题是:当用户发现他们喜欢的新作者时,会发生什么?作者是否会鼓励他们写一些自己的内容,探索更多相同的主题,并通过管理引导他们与网站更多地接触?换句话说,与一个伟大的“用户”的联系,相对于阅读随机的、伟大的“帖子”,有多重要?


爱上一个新频道,是否会让用户更愿意参与,而不仅仅是在这个频道上活动,或许是因为用户回到YouTube是为了这个新频道,并留下来看更多的视频?例如,在电视世界里,也许很多人星期天晚上会呆在家里,专门看最新一集的 Real Housewives,等节目结束后,他们还会在电视上浏览,进行更多的娱乐。











(这可能是一个自然实验的有点复杂的例子。举个例子,也许能更清楚地说明这个观点,假设我们想了解收入对心理健康的影响。我们不能强迫一些人变穷或变富,相关性研究显然是有缺陷的。[纽约时报的这篇文章](http://opinionator.blogs.nytimes.com/2014/01/18/what-happens-when- recei-a-stipend/)描述了一个自然的实验:一群切罗基印第安人将赌场利润分配给他们的会员,从而“随机”地帮助其中一些人摆脱了贫困。


Discovering Drivers of Growth

Let's go back to propensity modeling.

Imagine that we're on our company's Growth team, and we're tasked with figuring out how to turn casual users of the site into users that return every day. What do we do?

The propensity modeling approach might be the following. We could take a list of features (installing the mobile app, logging in, signing up for a newsletter, following certain users, etc.), and build a propensity model for each one. We could then rank each feature by its estimated causal effect on engagement, and use the ordered list of features to prioritize our next sprint. (Or we could use these numbers in order to convince the exec team that we need more resources.) This is a slightly more sophisticated version of the idea of building an engagement regression model (or a churn regression model), and examining the weights on each feature.

Despite writing this post, though, I admit I'm generally not a fan of propensity modeling for many applications in the tech world. (I haven't worked in the medical field, so I don't have a strong opinion on its usefulness there, though I think it's a little more necessary there.) I'll save more of my reasons for another time, but after all, causal inference is extremely difficult, and we're never going to be able to control for all the hidden influencers that can bias a treatment. Also, the mere fact that we have to choose which features to include in our model (and remember: building features is very time-consuming and difficult) means that we already have a strong prior belief on the usefulness of each feature, whereas what we'd really like to do is to discover hidden motivations of engagement that we've never thought of.

So what can we do instead?

If we're trying to understand what drives people to become heavy users of the site, why don't we simply ask them?

In more detail, let's do the following:

  • First, we'll run a survey on a couple hundred of users.
  • In the survey, we'll ask them whether their engagement on the site has increased, decreased, or remained about the same over the past year. We'll also ask them to explain possible reasons for their change in activity, and to describe how they use the site currently. We can also ask for supplemental details, like their demographic information.
  • Finally, we can filter all responses for those users who heavily increased their engagement over the past year (or who heavily decreased it, if we're trying to understand churn), and analyse their responses for the reasons.

For example, here's one interesting response I got when I ran this study at YouTube.

"I have always been a big music fan, but recently I took up playing the guitar. Because of my new found passion (playing the guitar) my desire to watch concerts has increased. I started watching a whole lot of music festivals and concerts that are posted on Youtube and other music videos. I have spent a lot of time also watching guitar lessons on Youtube (from www.justinguitar.com)."

This response was representative of a general theme the survey uncovered: one big driver of engagement seemed to come from people discovering a new offline hobby, and using YouTube to increase their appreciation of it. People who wanted to start cooking at home would turn to YouTube for recipe videos, people who started playing tennis or some other sport would go to YouTube for lessons or other great shots, college students would look for channels like Khan Academy to supplement their lectures, and so on. In other words, offline activities were driving online growth, and instead of trying to figure out what kinds of online content people were interested in (which articles did they like on Facebook, who did they follow on Twitter, what did they read on Reddit), perhaps we should have been focusing on bringing their physical hobbies into the digital world.

This "offline hobby" idea certainly wouldn't have been a feature I would have thrown into any engagement model, even if only because it's a very difficult feature to create. (How do we know which videos are related to real-world behavior?) But now that we suspect it's a potentially big driver of growth ("potentially" because, of course, surveys aren't necessarily representative), it's something we can spend a lot more time studying in the logs.
















To summarize: propensity modeling is a powerful technique for measuring causal effects in the absence of a randomized experiment.

Purely correlational analyses on top of observational studies can be very dangerous, after all. To take my favorite example: if we find that cities with more policemen tend to have more crime, does this mean that we should try to reduce the size of our police forces in order to reduce the nation's amount of crime?

For another example, here's a post by Gelman on contradictory conclusions about hormone replacement therapy in the Harvard Nurses Study.

That said, remember that (as always) a model is only as good as the data that you feed it. It's super difficult to account for all the hidden variables that might matter, and what you think might be a well-designed causal model might well in fact be missing many hidden factors. (I actually remember hearing that a propensity model on the nurses study generated a flawed conclusion, though I can't find any references to this at the moment.) So consider whether there are other approaches you can take, whether it's an easier-to-understand causal technique or even just asking your users, and even if a randomized experiment seems too difficult to run now, the effort may be well worth the trouble in the end.