倾向性模型,因果推断以及用户增长的驱动力发现(zz)

资料仓库  收藏
0 / 432

Propensity Modeling, Causal Inference, and Discovering Drivers of Growth

Imagine you just started a job at a new company. You watched World War Z recently, so you're in a skeptical mood, and given that your last two startups failed from what you believe to be a lack of data, you're giving everything an extra critical eye.

You start by thinking about the impact of the sales team. How much extra revenue are they generating for the company? The sales folks you've met say that over 90% of the leads they've talked to end up buying the company's product – but, you wonder, how many of those leads would have converted anyways?

You take a look at the logs, and notice something interesting: last week was hack week, and half the salesforce took time off from making calls to make Marauder's Maps instead, yet the rate of converted leads remained the same...*

Suddenly, one of your teammates drops by your desk. He's making a batch of Soylent, and he wants you to take a sip. It looks nasty, so you ask what the benefits are, and he responds that his friends who've been drinking it for the past few months just ran a marathon! Oh, did they just start running? Nope, they ran the marathon last year too!...

*Inspired by a true story.
导读

倾向模型的使用,如何在无法进行A/B测试的时候进行因果推断,如何找到用户组增长的驱动力。

想象一下,你刚刚开始在一家新公司工作。你最近看了《僵尸世界大战》(World War Z),所以你心存怀疑,考虑到你前两家初创公司都是因为缺乏数据而失败,你对每件事都格外挑剔。

首先考虑销售团队的影响。他们为公司创造了多少额外收入?你见过的销售人员说,他们谈过的超过90%的潜在客户最终都购买了公司的产品 —— 但是,你想知道,这些潜在客户中有多少人最终会转换为客户?

看看日志,你会发现一些有趣的事情:上周是黑客周,salesforce有一半的人休息了,但转化率仍然保持不变。

突然,你的一个同事来到你的办公桌前。他拿了一杯特卖饮料,想让你喝一口。它看起来很恶心,所以你问它有什么好处,他回答说他几个月来一直喝它的朋友刚刚跑了马拉松!哦,他们刚开始跑步吗?不,他们去年也参加了马拉松!

Causal Inference

Causality is incredibly important, yet often extremely difficult to establish.

Do patients who self-select into taking a new drug get better because the drug works, or would they have gotten better anyways? Is your salesforce actually effective, or are they simply talking to the customers who already plan to convert? Is Soylent (or your company's million-dollar ad campaign) truly worth your time?

In an ideal world, we'd be able to run experiments – the gold standard for measuring causality – whenever we wish. In the real world, however, we can't. There are ethical qualms with giving certain patients placebos, or dangerous and untested drugs. Management may be unwilling to take a potential short-term revenue hit by assigning sales to random customers, and a team earning commission-based bonuses may rebel against the thought as well.

How can we understand causal lifts in the absence of an A/B test? This is where propensity modeling, or other techniques of causal inference, comes into play.

因果推断
因果关系非常重要,但通常很难建立。那些自己选择服用新药的病人是因为药物起作用而变得更好吗?或者他们本来就会变得更好?你的销售团队是否真的有效,或者他们只是在与那些已经准备转化的客户进行沟通?特卖(或者你公司价值百万美元的广告活动)真的值得你花时间吗?

在一个理想的世界里,我们可以随时进行实验 —— 这是衡量因果关系的黄金标准。然而,在现实世界中,我们不能。给某些病人安慰剂或危险的未经试验的药物会引起伦理上的不安。管理层可能不愿意随机分配销售给潜在的短期收入损失,而一个以佣金为奖金基础的团队可能也会反对这种想法。

在没有A/B测试的情况下,我们如何理解因果提升?这就是倾向模型,或其他因果推断的技术,起作用的地方。

Propensity Modeling

So suppose we want to model the effect of drinking Soylent using a propensity model technique. To explain the idea, let's start with a thought experiment.

Imagine Brad Pitt has a twin brother, indistinguishable in every way: Brad 1 and Brad 2 wake up at the same time, they eat the same foods, they exercise the same amount, and so on. One day, Brad 1 happens to receive the last batch of Soylent from a marketer on the street, while Brad 2 does not, and so only Brad 1 begins to incorporate Soylent into his diet. In this scenario, any subsequent difference in behavior between the twins is precisely the drink's effect.

Taking this scenario into the real world, one way to estimate the Soylent's effect on health would be as follows:

  • For every Soylent drinker, find a Soylent abstainer who's as close a match as possible. For example, we might match a Soylent-drinking Jay-Z with a non-Soylent Kanye, a Soylent-drinking Natalie Portman with a non-Soylent Keira Knightley, and a Soylent-drinking JK Rowling with a non-Soylent Stephenie Meyer.
  • We measure Soylent's effect as the difference between each twin pair.

However, finding closely matching twins is extremely difficult in practice. Is Jay-Z really a close match with Kanye, if Jay-Z sleeps one hour more on average? What about the Jonas Brothers and One Direction?

Propensity modeling, then, is a simplification of this twin matching procedure. Instead of matching pairs of people based on all the variables we have, we simply match all users based on a single number, the likelihood ("propensity") that they'll start to drink Soylent.

In more detail, here's how to build a propensity model.

  • First, select which variables to use as features. (e.g., what foods people eat, when they sleep, where they live, etc.)
  • Next, build a probabilistic model (say, a logistic regression) based on these variables to predict whether a user will start drinking Soylent or not. For example, our training set might consist of a set of people, some of whom ordered Soylent in the first week of March 2014, and we would train the classifier to model which users become the Soylent users.
  • The model's probabilistic estimate that a user will start drinking Soylent is called a propensity score.
  • Form some number of buckets, say 10 buckets in total (one bucket covers users with a 0.0 - 0.1 propensity to take the drink, a second bucket covers users with a 0.1 - 0.2 propensity, and so on), and place people into each one.
  • Finally, compare the drinkers and non-drinkers within each bucket (say, by measuring their subsequent physical activity, weight, or whatever measure of health) to estimate Soylent's causal effect.

For example, here's a hypothetical distribution of Soylent and non-Soylent ages. We see that drinkers tend to be quite a bit older, and this confounding fact is one reason we can't simply run a correlational analysis.

Age Distribution

After training a model to estimate Soylent propensity and group users into propensity buckets, this might be a graph of the effect that Soylent has on a person's weekly running mileage.

Soylent Effect

In the above (hypothetical) graph, each row represents a propensity bucket of people, and the exposure week denotes the first week of March, when the treatment group received their Soylent shipment. We see that prior to that week, both groups of people track quite well. After the treatment group (the Soylent drinkers) start their plan, however, their weekly running mileage ramps up, which forms our estimate of the drink's causal effect.
倾向模型
假设我们想用倾向模型技术来模拟喝特卖饮料的效果。为了解释这个想法,让我们从一个思维实验开始。想象一下布拉德·皮特有一个双胞胎兄弟,在各个方面都无法区分:布拉德1和布拉德2在同一时间起床,吃同样的食物,锻炼同样的量,等等。有一天,布拉德1号碰巧收到了最后一批特卖饮料,而布拉德2号没有,所以只有布拉德1号开始在他的饮食中加入特卖饮料。在这种情况下,双胞胎之间行为的任何后续差异都是特卖饮料的影响。

在现实世界中,一种评估特卖饮料对健康影响的方法如下:

为每一个喝特卖饮料的人,找一个不喝特卖饮料的,并且尽可能和他相似的人。例如,我们可能会把喝特卖饮料的Jay-Z和不喝特卖饮料的Kanye相匹配,把喝特卖饮料的Natalie Portman和不喝特卖饮料的Keira Knightley相匹配,把喝特卖饮料的JK罗琳和不喝特卖饮料的Stephenie Meyer相匹配。

我们将特卖饮料的效果衡量定为每一对双胞胎之间的差异。

然而,在实际生活中,要找到长相接近的两个人是极其困难的。如果Jay-Z平均多睡一小时,那么Jay-Z真的能和Kanye一样吗?Jonas Brothers和One Direction怎么样?

那么,倾向建模就是对这一双胞胎匹配过程的简化。我们不是根据我们拥有的所有变量来配对,而是根据一个单一的数字来匹配所有用户,即他们开始喝特卖饮料的可能性(“倾向”)。更详细地说,这里是如何建立一个倾向模型。

首先,选择哪些变量作为特征。(例如:人们吃什么,什么时候睡觉,住在哪里等等。)

接下来,基于这些变量建立一个概率模型(比如逻辑回归)来预测用户是否会开始饮用特卖饮料。例如,我们的训练集可能包括一组人,其中一些人在2014年3月的第一周订购了特卖饮料,我们会训练分类器建模哪些用户成为特卖饮料用户。

该模型对用户将开始饮用特卖饮料的概率估计称为倾向评分。

形成一定数量的桶,假设总共有10个桶(一个桶的用户具有0.0 - 0.1的饮用倾向,第二个桶的用户具有0.1 - 0.2的饮用倾向,以此类推),并将人放入每个桶中。

最后,比较每个桶内的喝特卖饮料的人和不喝特卖饮料的人(比如,通过测量他们随后的体力活动、体重或其他健康指标)来估计特卖饮料的因果效应。

例如,这是喝特卖饮料和不喝特卖饮料年龄的假设分布。我们发现喝的人往往年龄更大一些,这一令人困惑的事实是我们不能简单地进行相关分析的原因之一。

在训练一个模型来估计喝特卖饮料的倾向并将用户分组到倾向桶之后,这可能是一个关于Soylent对一个人每周跑步里程影响的图表。

在上面的(假设)图表中,每一行代表人们的倾向桶,暴露周表示3月的第一周,实验组收到了他们的特卖饮料发货。我们看到在那周之前,两组人的跟踪都很好。然而,在实验组(特卖饮料饮用者)开始他们的计划后,他们每周跑步的里程增加了,这形成了我们对饮料的因果效应的估计。

Other Methods of Causal Inference

Of course, there are many other methods of causal inference on observational data. I'll run through two of my favorites. (I originally wrote this post in response to a question on Quora, which is why I take my examples from there.)

Regression Discontinuity

Quora recently started displaying badges of status on the profiles of its Top Writers, so suppose we want to understand the effect of this feature. (Assume that it's impossible to run an A/B test now that the feature has been launched.) Specifically, does the badge itself cause users to gain more followers?

For simplicity, let's assume that the badge was given to all users who received at least 5000 upvotes in 2013. The idea behind a regression discontinuity design is that the difference between those users who just barely receive a Top Writer badge (i.e., receive 5000 upvotes) and those who just barely don't (i.e., receive 4999 upvotes) is more or less random chance, so we can use this threshold to estimate a causal effect.

For example, in the imaginary graph below, the discontinuity at 5000 upvotes suggests that a Top Writer badge leads to around 100 more followers on average.
因果推断的其他方法
当然,对观察数据进行因果推论还有许多其他方法。我这里介绍我最喜欢的两个。(我最初是为了回答Quora上的一个问题而写这篇文章的,所以我才从那里举出了我的例子。)

不连续回归
Quora最近开始在其Top Writers的个人资料上显示了徽章,所以假设我们想要了解这个功能的效果。(假设既然该特性已经启动,就不可能运行A/B测试。)具体来说,徽章本身是否会让用户获得更多关注者?

为了简单起见,让我们假设这个徽章是给2013年获得至少5000个好评的所有用户的。不连续回归设计背后的想法是得到那些勉强获得Top Writers徽章的用户(获得5000个好评)和那些差点就可以获得徽章的人(比如,获得4999个好评的人)之间的区别。这两者之间或多或少是随机的,所以我们可以使用这个阈值来估计因果效应。

例如,在下面这个假想的图表中,5000个好评的不连续性表明,一个Top Writers徽章平均会带来大约100多个关注者。

Regression Discontinuity

Natural Experiments

Understanding the effect of a Top Writer badge is a fairly uninteresting question, though. (It just makes an easy example.) A deeper, more fundamental question could be to ask: what happens when a user discovers a new writer that they love? Does the writer inspire them to write some of their own content, to explore more of the same topics, and through curation lead them to engage with the site even more? How important, in other words, is the connection to a great user as opposed to the reading of random, great posts?

I studied an analogous question when I was at Google, so instead of making up an imaginary Quora case study, I'll describe some of that work here.

So let's suppose we want to understand what would happen if we were able to match users to the perfect YouTube channel. How much is the ultimate recommendation worth?

  • Does falling in love with a new channel lead to engagement above and beyond activity on the channel itself, perhaps because users return to YouTube specifically for the new channel and stay to watch more? (a multiplicative effect) In the TV world, for example, perhaps many people stay at home on Sunday nights specifically to catch the latest episode of Real Housewives, and channel surf for even more entertainment once it's over.
  • Does falling in love with a new channel simply increase activity on the channel alone? (an additive effect)
  • Does a new channel replace existing engagement on YouTube? After all, maybe users only have a limited amount of time they can spend on the site. (a neutral effect)
  • Does the perfect channel actually cause users to spend less time overall on the site, since maybe they spend less time idly browsing and channel surfing once they have concrete channels they know how to turn to? (a negative effect)

As always, an A/B test would be ideal, but it's impossible in this case to run: we can't force users to fall in love with a channel (we can recommend them channels, but there's no guarantee they'll actually like them), and we can't forcibly block them from certain channels either.

One approach is to use a natural experiment (a scenario in which the universe itself somehow generates a random-like assignment) to study this effect. Here's the idea.

Consider a user who uploads a new video every Wednesday. One month, he lets his subscribers know that he won’t be uploading any new videos for a few weeks, while he goes on vacation.

How do his subscribers respond? Do they stop watching YouTube on Wednesdays, since his channel was the sole reason for their visits? Or is their activity relatively unaffected, since they only watch his content when it appears on the front page?

Imagine, instead, that the channel starts uploading a new video every Friday. Do his subscribers start to visit then as well? And now that they're on YouTube, do they merely stay for the new video, or does their visit lead to a sinkhole of searches and related content too?

As it turns out, these scenarios do happen. For example, here's a calendar of when one popular channel uploads videos. You can see that in 2011, it tended to upload videos on Tuesdays and Fridays, but it shifted to uploads on Wednesday and Saturday at the end of the year.

Uploads Calendar

By using this shift as natural experiment that "quasi-randomly" removes a well-loved channel on certain days and introduces it on others, we can try to understand the effect of successfully making the perfect recommendation.

(This is probably a somewhat convoluted example of a natural experiment. For an example that perhaps illustrates the idea more clearly, suppose we want to understand the effect of income on mental health. We can't force some people to become poor or rich, and a correlational study is clearly flawed. This NY Times article describes a natural experiment when a group of Cherokee Indians distributed casino profits to its members, thereby "randomly" lifting some of them out of poverty.

Another example, assuming there's nothing special about the period in which hack week occurs, is the use of hack week as an instrument that quasi-randomly "prevents" the sales team from doing their job, as in the scenario I described above.)
自然实验
然而,理解 Top Writer 徽章的作用是一个相当无趣的问题。(这只是一个简单的例子。)一个更深、更根本的问题是:当用户发现他们喜欢的新作者时,会发生什么?作者是否会鼓励他们写一些自己的内容,探索更多相同的主题,并通过管理引导他们与网站更多地接触?换句话说,与一个伟大的“用户”的联系,相对于阅读随机的、伟大的“帖子”,有多重要?

我在谷歌的时候研究过一个类似的问题,所以我将在这里描述其中的一些工作,而不是虚构一个Quora案例研究。假设我们想知道如果我们能够将用户匹配到完美的YouTube频道会发生什么。最终的推荐值多少钱?

爱上一个新频道,是否会让用户更愿意参与,而不仅仅是在这个频道上活动,或许是因为用户回到YouTube是为了这个新频道,并留下来看更多的视频?例如,在电视世界里,也许很多人星期天晚上会呆在家里,专门看最新一集的 Real Housewives,等节目结束后,他们还会在电视上浏览,进行更多的娱乐。

爱上一个新频道只会增加该频道的活动吗?(加性效果)

新的频道会取代YouTube上现有的互动吗?毕竟,用户在站点上的时间是有限的。(中性效果)

完美的频道真的能让用户在网站上花更少的时间吗?因为一旦他们有了具体的频道,他们可能会花更少的时间在无聊的浏览和频道选择上。(负面效果)

通常,A/B测试是最理想的,但在这种情况下不可能运行:我们不能强迫用户爱上某个频道(我们可以向他们推荐一些频道,但不能保证他们真的会喜欢它们),我们也不能强行阻止他们进入某些频道。

一种方法是使用“自然实验”(一种宇宙本身以某种方式产生一种类似随机分配的情况)来研究这种效应。想法是这样的。

假设一个用户每周三上传一个新视频。有一个月,他告诉他的订阅者,在他度假的这几周内,他不会上传任何新的视频。

他的订阅者有何反应?他们会在周三停止看YouTube吗,因为他的频道是他们访问YouTube的唯一原因?还是他们的活动相对不受影响,因为他们只在他的内容出现在头版时才看?

想象一下,该频道每周五开始上传一段新视频。他的订阅用户也开始访问了吗?现在他们上了YouTube,他们是只是为了看新视频而呆在那里,还是他们的访问也会导致搜索和相关内容结果的大坑?

事实证明,这些情况确实会发生。例如,以下是一个热门频道何时上传视频的日历。你可以看到,在2011年,它倾向于在周二和周五上传视频,但在年底,它改为周三和周六上传。

将这种转变作为“准随机”的自然实验,在特定的日子移除一个广受喜爱的频道,在其他日子将其引入,我们可以尝试了解成功做出完美推荐的效果。

(这可能是一个自然实验的有点复杂的例子。举个例子,也许能更清楚地说明这个观点,假设我们想了解收入对心理健康的影响。我们不能强迫一些人变穷或变富,相关性研究显然是有缺陷的。[纽约时报的这篇文章](http://opinionator.blogs.nytimes.com/2014/01/18/what-happens-when- recei-a-stipend/)描述了一个自然的实验:一群切罗基印第安人将赌场利润分配给他们的会员,从而“随机”地帮助其中一些人摆脱了贫困。

另一个例子是,假设“黑客周”发生的时间段没有什么特别之处,那么使用“黑客周”作为准随机“阻止”销售团队完成工作的工具,就像我上面描述的场景一样。

Discovering Drivers of Growth

Let's go back to propensity modeling.

Imagine that we're on our company's Growth team, and we're tasked with figuring out how to turn casual users of the site into users that return every day. What do we do?

The propensity modeling approach might be the following. We could take a list of features (installing the mobile app, logging in, signing up for a newsletter, following certain users, etc.), and build a propensity model for each one. We could then rank each feature by its estimated causal effect on engagement, and use the ordered list of features to prioritize our next sprint. (Or we could use these numbers in order to convince the exec team that we need more resources.) This is a slightly more sophisticated version of the idea of building an engagement regression model (or a churn regression model), and examining the weights on each feature.

Despite writing this post, though, I admit I'm generally not a fan of propensity modeling for many applications in the tech world. (I haven't worked in the medical field, so I don't have a strong opinion on its usefulness there, though I think it's a little more necessary there.) I'll save more of my reasons for another time, but after all, causal inference is extremely difficult, and we're never going to be able to control for all the hidden influencers that can bias a treatment. Also, the mere fact that we have to choose which features to include in our model (and remember: building features is very time-consuming and difficult) means that we already have a strong prior belief on the usefulness of each feature, whereas what we'd really like to do is to discover hidden motivations of engagement that we've never thought of.

So what can we do instead?

If we're trying to understand what drives people to become heavy users of the site, why don't we simply ask them?

In more detail, let's do the following:

  • First, we'll run a survey on a couple hundred of users.
  • In the survey, we'll ask them whether their engagement on the site has increased, decreased, or remained about the same over the past year. We'll also ask them to explain possible reasons for their change in activity, and to describe how they use the site currently. We can also ask for supplemental details, like their demographic information.
  • Finally, we can filter all responses for those users who heavily increased their engagement over the past year (or who heavily decreased it, if we're trying to understand churn), and analyse their responses for the reasons.

For example, here's one interesting response I got when I ran this study at YouTube.

"I have always been a big music fan, but recently I took up playing the guitar. Because of my new found passion (playing the guitar) my desire to watch concerts has increased. I started watching a whole lot of music festivals and concerts that are posted on Youtube and other music videos. I have spent a lot of time also watching guitar lessons on Youtube (from www.justinguitar.com)."

This response was representative of a general theme the survey uncovered: one big driver of engagement seemed to come from people discovering a new offline hobby, and using YouTube to increase their appreciation of it. People who wanted to start cooking at home would turn to YouTube for recipe videos, people who started playing tennis or some other sport would go to YouTube for lessons or other great shots, college students would look for channels like Khan Academy to supplement their lectures, and so on. In other words, offline activities were driving online growth, and instead of trying to figure out what kinds of online content people were interested in (which articles did they like on Facebook, who did they follow on Twitter, what did they read on Reddit), perhaps we should have been focusing on bringing their physical hobbies into the digital world.

This "offline hobby" idea certainly wouldn't have been a feature I would have thrown into any engagement model, even if only because it's a very difficult feature to create. (How do we know which videos are related to real-world behavior?) But now that we suspect it's a potentially big driver of growth ("potentially" because, of course, surveys aren't necessarily representative), it's something we can spend a lot more time studying in the logs.

发现增长的驱动力
让我们回到倾向模型。

假设我们是公司的增长团队的一员,我们的任务是弄清楚如何将网站的普通用户转变为每天都会返回的用户。我们该怎么办?

倾向性建模方法可能如下。我们可以取一个特征列表(安装移动应用程序,登录,订阅消息,关注特定用户,等等),然后为每个特征建立一个倾向性模型。然后,我们可以根据每个特征对用户参与的估计因果影响来对其排序,并使用这些特征的有序列表来对我们的下一个sprint进行优先级排序。(或者我们可以使用这些数字来说服执行团队我们需要更多的资源。)这是构建参与度回归模型(或流失回归模型)和检查每个特征的权重的思想的稍微复杂的版本。

尽管我写了这篇文章,但我承认我通常不喜欢在科技世界的许多应用中使用倾向模型。(我没有在医学领域工作过,所以我对它在医学领域的用处没有强烈的看法,尽管我认为它在医学领域更有必要一些。)我将把更多的理由留到下次再讲,但毕竟,因果推理是极其困难的,我们永远无法控制所有可能导致治疗偏差的潜在影响因素。事实上,我们必须选择特征包含在我们的模型中(记住:构建特征是非常耗时和困难的),这意味着我们之前在每个特征的有用性上已经有一个强烈的信念,而我们真正想做的是发现隐藏的,我们从来没有想到的动机。

那么我们能做什么呢?

如果我们试图理解是什么驱使人们成为网站的重度用户,为什么我们不直接问他们呢?

更详细地说,让我们做以下事情:

首先,我们将对几百名用户进行调查。

在调查中,我们会问他们在过去的一年里,他们在网站上的参与度是增加了,还是减少了,还是保持不变。我们还会要求他们解释活动变化的可能原因,并描述他们目前是如何使用网站的。我们也可以要求补充细节,比如他们的人口统计信息。

最后,我们可以过滤得到那些在过去一年中参与度大幅提高的用户(或者那些参与度大幅降低的用户,如果我们试图理解用户流失的话)的所有反馈,并分析他们的反馈的原因。

例如,当我在YouTube上进行这项研究时,我得到了一个有趣的回应。

“我一直是个音乐迷,但最近开始弹吉他了。因为我发现了新的激情(弹吉他),我看音乐会的欲望增加了。我开始看很多上传到Youtube和其他音乐视频上的音乐节和音乐会。我还花了很多时间在Youtube上看吉他课。"

这一回答代表了调查发现的一个普遍主题:参与的一大驱动力似乎来自人们发现了一种新的线下爱好,并利用YouTube来提高他们对它的欣赏程度。想在家里开始烹饪的人会去YouTube上看菜谱视频,开始打网球或其他运动的人会去YouTube上看课程或其他精彩的视频,大学生会寻找可汗学院这样的频道来补充他们的课程,等等。换句话说,线下活动在驱动线上的行为,而不是试图找出什么样的在线内容是用户感兴趣的(在Facebook喜欢什么样的文章,Twitter上他们关注了谁,在Reddit上他们看什么内容),也许我们应该专注于把用户线下的爱好带进数字世界。

这种“离线爱好”的想法肯定不会成为任何参与模型的一个特征,即使只是因为它是一个非常难以创建的特征。(我们怎么知道哪些视频与真实世界的行为有关?)但现在我们怀疑它是增长的潜在非常大的驱动力(“潜在的”,当然,调查并不一定具有代表性),我们可以花更多的时间在日志中研究它。

End

To summarize: propensity modeling is a powerful technique for measuring causal effects in the absence of a randomized experiment.

Purely correlational analyses on top of observational studies can be very dangerous, after all. To take my favorite example: if we find that cities with more policemen tend to have more crime, does this mean that we should try to reduce the size of our police forces in order to reduce the nation's amount of crime?

For another example, here's a post by Gelman on contradictory conclusions about hormone replacement therapy in the Harvard Nurses Study.

That said, remember that (as always) a model is only as good as the data that you feed it. It's super difficult to account for all the hidden variables that might matter, and what you think might be a well-designed causal model might well in fact be missing many hidden factors. (I actually remember hearing that a propensity model on the nurses study generated a flawed conclusion, though I can't find any references to this at the moment.) So consider whether there are other approaches you can take, whether it's an easier-to-understand causal technique or even just asking your users, and even if a randomized experiment seems too difficult to run now, the effort may be well worth the trouble in the end.

结束
总而言之:在没有随机实验的情况下,倾向建模是一种强有力的测量因果效应的技术。

毕竟,在观察研究之上进行纯粹的相关分析是非常危险的。举个我最喜欢的例子:如果我们发现有更多警察的城市往往有更多的犯罪,这是否意味着我们应该尝试减少我们的警察部队的规模,以减少国家的犯罪数量?

另一个例子,这里:http://andrewgelman.com/2005/01/07/could_propensit/是Gelman关于激素替代疗法在哈佛护士研究中的矛盾结论的帖子。

也就是说,请记住(一如既往)模型的好坏取决于提供给它的数据。要解释所有可能起作用的隐藏变量是非常困难的,而你认为可能是一个设计良好的因果模型,实际上可能缺少很多隐藏的因素。(实际上,我记得听说护士研究中的倾向模型得出了一个有缺陷的结论,尽管目前我找不到任何与此相关的资料。)所以,考虑一下你是否还可以采取其他方法,它是一种更容易理解的因果关系技巧,还是仅仅是询问你的用户,即使一个随机实验现在似乎很难进行,但最终这些努力可能是值得的。

—END—

英文原文:http://blog.echen.me/2014/08/15/propensity-modeling-causal-inference-and-discovering-drivers-of-growth/