Counterfactual Reasoning and Learning Systems: The Example of Computational Advertising 反事实推理和学习系统:计算广告的例子

Léon Bottou, Jonas Peters, Joaquin Quiñonero-Candela, Denis X. Charles, D. Max Chickering, Elon Portugaly, Dipankar Ray, Patrice Simard, Ed Snelson; 14(65):3207−3260, 2013.


This work shows how to leverage causal inference to understand the behavior of complex learning systems interacting with their environment and predict the consequences of changes to the system. Such predictions allow both humans and algorithms to select the changes that would have improved the system performance. This work is illustrated by experiments on the ad placement system associated with the Bing search engine.


causation, counterfactual reasoning, computational advertising




Statistical machine learning technologies in the real world are never without a purpose.Using their predictions, humans or machines make decisions whose circuitous consequences often violate the modeling assumptions that justified
the system design in the first place.


Such contradictions appear very clearly in the case of the learning systems that power Web scale applications such as search engines, ad placement engines, or recommendation systems.For instance, the placement of advertisement on the
result pages of Internet search engines depend on the bids of advertisers and on scores computed by statistical machine learning systems.Because the scores affect the contents of the result pages proposed to the users, they directly influence
the occurrence of clicks and the corresponding advertiser payments.They also have important indirect effects.Ad placement decisions impact the satisfaction of the users and therefore their willingness to frequent this Web site in the
future.They also impact the return on investment observed by the advertisers and therefore their future bids.Finally they change the nature of the data collected for training the statistical models in the future.

These complicated interactions are clarified by important theoretical works.Under
simplified assumptions, mechanism design (Myerson, 1981) leads to an insightful
account of the advertiser feedback loop (Varian, 2007;Edelman et al.,
2007).Under simplified assumptions, multiarmed bandits theory (Robbins, 1952;Auer
et al., 2002;Langford and Zhang, 2008) and reinforcement learning (Sutton and
Barto, 1998) describe the exploration/exploitation dilemma associated with the
training feedback loop.However, none of these approaches gives a complete
account of the complex interactions found in real-life systems.

重要的理论著作阐明了这些复杂的相互作用。在简化的假设下,机制设计(Myerson,1981)导致了对广告商反馈回路的深刻理解(Varian,2007;Edelman 等人,2007 年)。在简化假设下,多臂老虎机理论(罗宾斯,1952;Auer 等人,2002 年;兰福德和张,2008)和强化学习(萨顿和,1998)描述了与训练反馈回路相关的探索/开发困境。然而,这些方法都没有给出现实系统中复杂交互的完整描述。

This contribution proposes a novel approach: we view these complicated interactions
as man-ifestations of the fundamental difference that separates correlation and
causation.Using the ad placement example as a model of our problem class, we
therefore argue that the language and the methods of causal inference provide
flexible means to describe such complex machine learning sys-tems and give
sound answers to the practical questions facing the designer of such a
system.Is it useful to pass a new input signal to the statistical model?Is it
worthwhile to collect and label a new training set?What about changing the loss
function or the learning algorithm?In order to answer such questions and
improve the operational performance of the learning system, one needs to unravel
how the information produced by the statistical models traverses the Web of
causes and effects and eventually produces measurable performance metrics.


Readers with an interest in causal inference will find in this paper (i) a real world
example demonstrating the value of causal inference for large-scale machine
learning applications, (ii) causal inference techniques applicable to
continuously valued variables with meaningful confidence intervals, and (iii)
quasi-static analysis techniques for estimating how small interventions affect
cer-tain causal equilibria.Readers with an interest in real-life applications
will find (iv) a selection of practical counterfactual analysis techniques
applicable to many real-life machine learning systems.Readers with an interest
in computational advertising will find a principled framework that (v)
ex-plains how to soundly use machine learning techniques for ad placement, and
(vi) conceptually connects machine learning and auction theory in a compelling




The paper is organized as follows.Section 2 gives an overview of the advertisement placement
problem which serves as our main example.In particular, we stress some of the
difficulties encoun-tered when one approaches such a problem without a
principled perspective.Section 3 provides a condensed review of the essential
concepts of causal modeling and inference.Section 4 centers on formulating and
answering counterfactual questions such as "how would the system have
performed during the data collection period if certain interventions had been
carried out on the system ?"We describe importance sampling methods for
counterfactual analysis, with clear conditions of validity and confidence
intervals.Section 5 illustrates how the structure of the causal graph reveals
opportu-nities to exploit prior information and vastly improve the confidence
intervals.Section 6 describes how counterfactual analysis provides essential
signals that can drive learning algorithms.Assume that we have identified
interventions that would have caused the system to perform well during the data
collection period.Which guarantee can we obtain on the performance of these
same inter-ventions in the future?Section 7 presents counterfactual
differential techniques for the study of equlibria.Using data collected when
the system is at equilibrium, we can estimate how a small intervention
displaces the equilibrium.This provides an elegant and effective way to reason
about long-term feedback effects.Various appendices complete the main text with
information that we think more relevant to readers with specific backgrounds.

第 2 节概述了广告投放问题,这是我们的主要例子。特别是,我们强调当一个人在没有原则性观点的情况下处理这样一个问题时会遇到的一些困难。

第 3 节简要回顾了因果建模和推理的基本概念。

第 4 节集中阐述和回答反事实问题,如“如果对系统进行了某些干预,系统在数据收集期间会有什么表现?”我们描述了用于反事实分析的重要性抽样方法,具有明确的有效性条件和置信区间。

第 5 节说明了因果图的结构如何揭示利用先验信息和大大提高置信区间的机会。

第 6 节描述了反事实分析如何提供驱动学习算法的基本信号。假设我们已经确定了会导致系统在数据收集期间表现良好的干预措施。我们能从这些相同的干预措施的未来表现中获得什么保证?

第 7 节介绍了研究马库利亚的反事实微分技术。利用系统处于平衡时收集的数据,我们可以估计一个小的干预如何取代平衡。这为长期反馈效应提供了一种优雅而有效的推理方式。各种附录用我们认为与特定背景的读者更相关的信息完成了正文。



2.Causation Issues in Computational Advertising 计算广告中的因果关系问题

After giving an overview of the advertisement placement problem, which serves as our main
ex-ample, this section illustrates some of the difficulties that arise when one
does not pay sufficient attention to the causal structure of the learning


2.1 Advertisement Placemen 广告投放

All Internet users are now familiar with the advertisement messages that adorn popular Web
pages.Advertisements are particularly effective on search engine result pages
because users who are searching for something are good targets for advertisers
who have something to offer.Several actors take part in this Internet
advertisement game:


• Advertisers create advertisement messages, and place bids that describe how much they are
willing to pay to see their ads displayed or clicked.

广告商 创建广告信息,并出价描述他们愿意支付多少费用来进行广告显示或点击。

• Publishers
provide attractive Web services, such as, for instance, an Internet search
engine.They display selected ads and expect to receive payments from the
advertisers.The infras-tructure to collect the advertiser bids and select ads
is sometimes provided by an advertising network on behalf of its affiliated
publishers.For the purposes of this work, we simply con-sider a publisher large
enough to run its own infrastructure.


• Users
reveal information about their current interests, for instance, by entering a
query in a search engine.They are offered Web pages that contain a selection of
ads (Figure 1).Users sometimes click on an advertisement and are transported to
a Web site controlled by the ad-vertiser where they can initiate some business.

用户透露他们当前兴趣的信息,例如,通过在搜索引擎中输入查询。它们被提供包含精选广告的网页(图 1)。用户有时点击一个广告,然后被传送到由广告发布者控制的网站,在那里他们可以开始一些业务。

A conventional bidding language is necessary to precisely define under which
conditions an adver-tiser is willing to pay the bid amount.In the case of
Internet search advertisement, each bid specifies (a) the advertisement
message, (b) a set of keywords, (c) one of several possible matching criteria
between the keywords and the user query, and (d) the maximal price the
advertiser is willing to pay when a user clicks on the ad after entering a
query that matches the keywords according to the specified criterion.


Whenever a user visits a publisher Web page, an advertisement placement engine runs an
auction in real time in order to select winning ads, determine where to display
them in the page, and compute the prices charged to advertisers, should the
user click on their ad.Since the placement engine is operated by the publisher,
it is designed to further the interests of the publisher.Fortunately for
everyone else, the publisher must balance short term interests, namely the
immediate revenue brought by the ads displayed on each Web page, and long term
interests, namely the future revenues resulting from the continued satisfaction
of both users and advertisers.


Auction theory explains how to design a mechanism that optimizes the revenue of the
seller of a single object (Myerson, 1981;Milgrom, 2004) under various
assumptions about the information available to the buyers regarding the
intentions of the other buyers.In the case of the ad placement problem, the publisher
runs multiple auctions and sells opportunities to receive a click.When nearly
identical auctions occur thousand of times per second, it is tempting to
consider that the advertisers have perfect information about each other.This
assumption gives support to the popular generalized second price rank-score
auction (Varian, 2007;Edelman et al., 2007):

拍卖理论解释了如何设计一种机制来优化单个物品卖家的收入(Myerson,1981;Milgrom,2004)在关于买方可获得的关于其他买方意图的信息的各种假设下。在广告投放问题的情况下,发布者进行多次拍卖,并出售获得点击的机会。当几乎相同的拍卖每秒钟发生数千次时,人们很容易认为广告商拥有关于彼此的完美信息。这一假设支持了流行的广义第二价格等级分数拍卖(瓦里安,2007;Edelman 等人,2007 年):

Figure 1:
Mainline and sidebar ads on a search result page.Ads placed in the mainline are more
likely to be noticed, increasing both the chances of a click if the ad is relevant and the
risk of annoying the user if the ad is not relevant.

图 1:搜索结果页面上的主线和侧边栏广告。主线上的广告更可能会被注意到,这既增加了广告相关时点击的机会,也增加了广告不相关时惹恼用户的风险。

• Let x represent the auction context information, such as the user query, the user
profile, the date, the time, etc.The ad placement engine first determines all
eligible ads a1 ...an and the corresponding bids b1 ...bn on the basis of the
auction context x and of the matching criteria specified by the advertisers.

让 x 代表拍卖上下文信息,如用户查询、用户档案、日期、时间等。广告投放引擎首先确定所有合格的广告 a1...an 和相应的投标 b1...bn 基于拍卖上下文 x 和广告商指定的匹配标准。

• For each selected ad ai and each potential position p on the Web page, a statistical
model outputs the estimate qi,p(x) of the probability that ad ai displayed in
position p receives a user click.The rank-score ri,p(x) = biqi,p(x) then
represents the purported value associated with placing ad ai at position p.

对于每个选定的广告$a_i$ 和网页上的每个潜在位置 p,统计模型输出显示在位置 p 的广告$a_i$ 接收用户点击的概率的估计$q_i$ ,p(x)。等级分数 ri,p(x) = $b_i$ $q_i$ ,p(x)则表示与将广告 $a_i$ 放置在位置 p 相关联的声称值。

• Let L represent a possible ad layout, that is, a set of positions that can
simultaneously be populated with ads, and let $L$ be the set of possible ad layouts, including of course the
empty layout.The optimal layout and the corresponding ads are obtained by maximizing the total

让 L 代表一种可能的广告布局,即一组可以同时用广告填充,让$L$是一组可能的广告布局,当然也包括空的布局。通过最大化总排名分数来获得最佳布局和相应的广告


subject to
reserve constraints and also subject to diverse policy constraints, such as, for instance, preventing the

simultane-ous display of multiple ads belonging to the same advertiser.Under
mild assumptions, this discrete maximization problem is amenable to
computationally efficient greedy algorithms (see appendix A.)

受储备限制,并且还受到不同的政策约束,例如防止属于同一广告商的多个广告同时显示。在温和的假设下,这个离散的最大化问题服从于计算高效的贪婪算法(见附录 a)

• The
advertiser payment associated with a user click is computed using the
generalized second price (GSP) rule: the advertiser pays the smallest bid that
it could have entered without chang-ing the solution of the discrete
maximization problem, all other bids remaining equal.In other words, the
advertiser could not have manipulated its bid and obtained the same treatment
for a better price.


Under the
perfect information assumption, the analysis suggests that the publisher simply
needs to find which reserve prices Rp(x) yield the best revenue per
auction.However, the total revenue of the publisher also depends on the traffic
experienced by its Web site.Displaying an excessive number of irrelevant ads
can train users to ignore the ads, and can also drive them to competing Web
sites.Advertisers can artificially raise the rank-scores of irrelevant ads by
temporarily increasing the bids.Indelicate advertisers can create deceiving
advertisements that elicit many clicks but direct users to spam Web sites.Experience
shows that the continued satisfaction of the users is more important to the
publisher than it is to the advertisers.

在完美信息假设下,分析表明,发行商只需找到哪个保留价格 Rp(x)在每次拍卖中产生最佳收益。然而,出版商的总收入也取决于其网站的流量。显示过多不相关的广告可以训练用户忽略广告,也可以驱使他们去竞争网站。广告商可以通过临时提高出价来人为提高不相关广告的排名分数。无礼的广告商可以制造欺骗性的广告,吸引大量点击,但却将用户引向垃圾网站。经验表明,用户的持续满意度对出版商来说比对广告商更重要。

Therefore the
generalized second price rank-score auction has evolved.Rank-scores have been
augmented with terms that quantify the user satisfaction or the ad
relevance.Bids receive adaptive discounts in order to deal with situations
where the perfect information assumption is unrealistic.These adjustments are
driven by additional statistical models.The ad placement engine should
therefore be viewed as a complex learning system interacting with both users
and advertisers.


Controlled Experiments

2.2 对照实验

The designer
of such an ad placement engine faces the fundamental question of testing
whether a proposed modification of the ad placement engine results in an
improvement of the operational performance of the system.


The simplest
way to answer such a question is to try the modification.The basic idea is to
randomly split the users into treatment and control groups (Kohavi et al.,
2008).Users from the control group see Web pages generated using the unmodified
system.Users of the treatment groups see Web pages generated using alternate
versions of the system.Monitoring various performance metrics for a couple
months usually gives sufficient information to reliably decide which variant of
the system delivers the most satisfactory performance.

回答这样的问题最简单的方法就是尝试修改。基本思想是将用户随机分为治疗组和对照组(Kohavi 等人,2008 年)。控制组的用户可以看到使用未修改的系统生成的网页。治疗组的用户可以看到使用系统的替代版本生成的网页。监控几个月的各种性能指标通常可以提供足够的信息来可靠地决定系统的哪种变体提供最令人满意的性能。

Modifying an
advertisement placement engine elicits reactions from both the users and the
ad-vertisers.Whereas it is easy to split users into treatment and control
groups, splitting advertisers into treatment and control groups demands special
attention because each auction involves multi-ple advertisers (Charles et al.,
2012).Simultaneously controlling for both users and advertisers is probably

修改广告投放引擎会引起用户和广告发布者的反应。虽然很容易将用户分成治疗组和对照组,但将广告商分成治疗组和对照组需要特别注意,因为每次拍卖都涉及多个广告商(查尔斯等人,2012 年)。同时控制用户和广告商可能是不可能的。

experiments also suffer from several drawbacks.They are expensive because they
demand a complete implementation of the proposed modifications.They are slow
because each ex-periment typically demands a couple months.Finally, although
there are elegant ways to efficiently run overlapping controlled experiments on
the same traffic (Tang et al., 2010), they are limited by the volume of traffic
available for experimentation.

受控实验也有几个缺点。它们是昂贵的,因为它们要求完全实现提议的修改。它们很慢,因为每次实验通常需要几个月的时间。最后,虽然有很好的方法可以在相同的流量上有效地运行重叠的受控实验(Tang 等人,2010),但是它们受到可用于实验的流量的限制。

It is
therefore difficult to rely on controlled experiments during the conception
phase of potential improvements to the ad placement engine.It is similarly
difficult to use controlled experiments to drive the training algorithms
associated with click probability estimation models.Cheaper and faster
statistical methods are needed to drive these essential aspects of the
development of an ad placement engine.Unfortunately, interpreting cheap and
fast data can be very deceiving.


Confounding Data

2.3 混杂数据

Assessing the
consequence of an intervention using statistical data is generally challenging
because it is often difficult to determine whether the observed effect is a
simple consequence of the interven-tion or has other uncontrolled causes.


Overall 总数 Patients with small stones 小结石患者 Patients with large stones 大结石患者
Treatment A:Open surgery 开放手术 78% (273/350) 93% (81/87) 73% (192/263)
Treatment B:Percutaneous nephrolithotomy 经皮肾镜取石术 83% (289/350) 87% (234/270) 69% (55/80)

Table 1: A
classic example of Simpson's paradox.The table reports the success rates of two
treat-ments for
kidney stones (Charig et al., 1986, Tables I and II).Although the overall
success rate of treatment B seems better, treatment B performs worse than
treatment A on both patients with small kidney stones and patients with large
kidney stones.See Section 2.3.

表 1:辛普森悖论的经典例子。该表报告了两次治疗的成功率-
肾结石治疗(Charig 等人,1986 年,表一和表二)。尽管治疗 B 的总体成功率似乎更高,但无论是小肾结石患者还是大肾结石患者,治疗 B 的表现都比治疗 A 差。参见第 2.3 节。

For instance,
the empirical comparison of certain kidney stone treatments illustrates this
dif-ficulty (Charig et al., 1986).Table 2.3 reports the success rates observed
on two groups of 350 patients treated with respectively open surgery (treatment
A, with 78% success) and percutaneous nephrolithotomy (treatment B, with 83%
success).Although treatment B seems more successful, it was more frequently
prescribed to patients suffering from small kidney stones, a less serious
con-dition.Did treatment B achieve a high success rate because of its intrinsic
qualities or because it was preferentially applied to less severe cases?Further
splitting the data according to the size of the kidney stones reverses the
conclusion: treatment A now achieves the best success rate for both patients
suffering from large kidney stones and patients suffering from small kidney
stones.Such an inversion of the conclusion is called Simpson's paradox
(Simpson, 1951).

例如,某些肾结石治疗的经验比较说明了这一困难(Charig 等人,1986)。表 2.3 报告了两组分别接受开放手术(治疗 A,成功率 78%)和经皮肾镜取石术(治疗 B,成功率 83%)治疗的 350 名患者的成功率。虽然治疗乙似乎更成功,但它更经常被开给患有小肾结石的患者,这是一种不太严重的情况。B 治疗获得高成功率是因为它的内在质量还是因为它优先应用于不太严重的病例?根据肾结石的大小进一步分割数据会逆转结论:治疗 A 现在对患有大肾结石的患者和患有小肾结石的患者都达到了最佳成功率。这种结论的颠倒被称为辛普森悖论(辛普森,1951)。

The stone
size in this study is an example of a confounding variable, that is an
uncontrolled variable whose consequences pollute the effect of the
intervention.Doctors knew the size of the kidney stones, chose to treat the
healthier patients with the least invasive treatment B, and therefore caused
treatment B to appear more effective than it actually was.If we now decide to
apply treat-ment B to all patients irrespective of the stone size, we break the
causal path connecting the stone size to the outcome, we eliminate the
illusion, and we will experience disappointing results.

这项研究中的结石大小是一个混杂变量的例子,这是一个不受控制的变量,其结果污染了干预的效果。医生知道肾结石的大小,选择用侵入性最小的治疗 B 来治疗更健康的患者,因此导致治疗 B 看起来比实际更有效。如果我们现在决定对所有患者进行治疗,而不考虑结石的大小,我们就打破了结石大小和结果之间的因果关系,我们就消除了幻觉,我们将经历令人失望的结果。

When we
suspect the existence of a confounding variable, we can split the contingency
tables and reach improved conclusions.Unfortunately we cannot fully trust these
conclusions unless we are certain to have taken into account all confounding
variables.The real problem therefore comes from the confounding variables we do
not know.


experiments arguably provide the only correct solution to this problem (see
Stigler, 1992).The idea is to randomly chose whether the patient receives
treatment A or treatment B. Because this random choice is independent from all
the potential confounding variables, known and unknown, they cannot pollute the
observed effect of the treatments (see also Section 4.2).This is why controlled
experiments in ad placement (Section 2.2) randomly distribute users between
treatment and control groups, and this is also why, in the case of an ad
placement engine, we should be somehow concerned by the practical impossibility
to randomly distribute both users and advertisers.

随机实验可以说是这个问题唯一正确的解决方案(见斯蒂格勒,1992)。想法是随机选择患者是接受治疗 A 还是治疗 b。因为这种随机选择独立于所有已知和未知的潜在混杂变量,所以它们不能污染观察到的治疗效果(另见第 4.2 节)。这就是为什么广告投放的受控实验(第 2.2 节)在治疗组和对照组之间随机分配用户,这也是为什么在广告投放引擎的情况下,我们应该以某种方式关注随机分配用户和广告商的实际可能性。

Overall q2 low q2 high
q1 low 6.2%(124/2000) 5.1% (92/1823) 18.1% (32/176)
q1 high 7.5% (149/2000) 4.8% (71/1500) 15.6% (78/500)

Table 2:
Confounding data in ad placement.The table reports the click-through rates and
the click counts of the
second mainline ad.The overall counts suggest that the click-through rate of
the second mainline ad increases when the click probability estimate q1 of the
top ad is high.However, if we further split the pages according to the click
probability estimate q2 of the second mainline ad, we reach the opposite
conclusion.See Section 2.4.

表 2:广告投放的混杂数据。该表报告了点击率和点击量第二条主线广告的计数。总体计数表明,当顶部广告的点击概率估计值 q1 较高时,第二主线广告的点击率会增加。然而,如果我们根据第二条主线广告的点击概率估计 q2 进一步分割页面,我们会得出相反的结论。参见第 2.4 节。

Confounding Data in Ad Placement

2.4 广告投放中的混杂数据

Let us return
to the question of assessing the value of passing a new input signal to the ad
placement engine click prediction model.Section 2.1 outlines a placement method
where the click probability estimates qi,p(x) depend on the ad and the position
we consider, but do not depend on other ads displayed on the page.We now
consider replacing this model by a new model that additionally uses the
estimated click probability of the top mainline ad to estimate the click
probability of the second mainline ad (Figure 1).We would like to estimate the
effect of such an intervention using existing statistical data.

让我们回到评估向广告投放引擎点击预测模型传递新输入信号的价值的问题。第 2.1 节概述了一种布局方法,其中点击概率估计 qi,p(x)取决于广告和我们考虑的位置,但不取决于页面上显示的其他广告。我们现在考虑用一个新模型来代替这个模型,这个新模型额外使用顶部主线广告的估计点击概率来估计第二个主线广告的点击概率(图 1)。我们希望利用现有的统计数据来评估这种干预的效果。

We have
collected ad placement data for Bing search result pages served during three
consecu-tive hours on a certain slice of traffic.Let q1 and q2 denote the click
probability estimates computed by the existing model for respectively the top
mainline ad and the second mainline ad.After ex-cluding pages displaying fewer
than two mainline ads, we form two groups of 2000 pages randomly picked among
those satisfying the conditions q1 < 0.15 for the first group and q1 ≥ 0.15
for the second group.Table 2.4 reports the click counts and frequencies
observed on the second mainline ad in each group.Although the overall numbers
show that users click more often on the second mainline ad when the top
mainline ad has a high click probability estimate q1, this conclusion is
reversed when we further split the data according to the click probability
estimate q2 of the second mainline ad.

我们已经收集了必应搜索结果页面的广告投放数据,这些页面在连续三个小时的特定流量中提供服务。让 q1 和 q2 分别表示由现有模型为顶部主线广告和第二主线广告计算的点击概率估计。在包括显示少于两个主线广告的页面之后,我们形成两组 2000 页的页面,从满足第一组 q1 < 0.15 和第二组 q1 ≥ 0.15 的条件的页面中随机挑选。表 2.4 报告了每组中第二条主线广告的点击次数和频率。虽然总体数字显示,当顶级主线广告具有高点击概率估计值 q1 时,用户更频繁地点击第二主线广告,但当我们根据第二主线广告的点击概率估计值 q2 进一步分割数据时,这一结论被颠倒了。

superficial similarities, this example is considerably more difficult to
interpret than the kidney stone example.The overall click counts show that the
actual click-through rate of the second mainline ad is positively correlated
with the click probability estimate on the top mainline ad.Does this mean that
we can increase the total number of clicks by placing regular ads below
frequently clicked ads?


Remember that
the click probability estimates depend on the search query which itself depends
on the user intention.The most likely explanation is that pages with a high q1
are frequently asso-ciated with more commercial searches and therefore receive
more ad clicks on all positions.The observed correlation occurs because the
presence of a click and the magnitude of the click probabil-ity estimate q1
have a common cause: the user intention.Meanwhile, the click probability
estimate q2 returned by the current model for the second mainline ad also
depend on the query and therefore the user intention.Therefore, assuming that
this dependence has comparable strength, and assuming that there are no other
causal paths, splitting the counts according to the magnitude of q2 factors out
the effects of this common confounding cause.We then observe a negative
correlation which now suggests that
a frequently clicked top mainline ad has a negative impact on the click-through
rate of the second mainline ad.

请记住,点击概率估计取决于搜索查询,而搜索查询本身取决于用户意图。最可能的解释是,q1 高的页面通常与更多的商业搜索相关联,因此在所有位置都能获得更多的广告点击。出现观察到的相关性是因为点击的存在和点击概率估计值 q1 的大小有一个共同的原因:用户意图。同时,当前模型为第二主线广告返回的点击概率估计 q2 也取决于查询,因此也取决于用户意图。因此,假设这种依赖性具有相当的强度,并且假设没有其他因果路径,根据 q2 因子的大小将计数分开,排除这种常见混杂原因的影响。然后我们观察到一个负相关

If this is
correct, we would probably increase the accuracy of the click prediction model
by switching to the new model.This would decrease the click probability
estimates for ads placed in the second mainline position on commercial search
pages.These ads are then less likely to clear the reserve and therefore more likely
to be displayed in the less attractive sidebar.The net result is probably a
loss of clicks and a loss of money despite the higher quality of the click
probability model.Although we could tune the reserve prices to compensate this
unfortunate effect, nothing in this data tells us where the performance of the
ad placement engine will land.Furthermore, unknown confounding variables might
completely reverse our conclusions.


Making sense
out of such data is just too complex !


2.5 A Better Way

2.5 更好的方式

It should now
be obvious that we need a more principled way to reason about the effect of
potential interventions.We provide one such more principled approach using the
causal inference machinery (Section 3).The next step is then the identification
of a class of questions that are sufficiently ex-pressive to guide the designer
of a complex learning system, and sufficiently simple to be answered using data
collected in the past using adequate procedures (Section 4).

现在应该很明显,我们需要一种更有原则的方式来推理潜在干预的效果。我们使用因果推理机制提供了一种更有原则的方法(第 3 节)。然后,下一步是确定一类问题,这些问题足够有代表性,可以指导复杂学习系统的设计者,并且足够简单,可以使用过去使用适当程序收集的数据来回答(第 4 节)。

A machine
learning algorithm can then be viewed as an automated way to generate questions
about the parameters of a statistical model, obtain the corresponding answers,
and update the param-eters accordingly (Section 6).Learning algorithms derived
in this manner are very flexible: human designers and machine learning
algorithms can cooperate seamlessly because they rely on similar sources of

然后,机器学习算法可以被视为自动生成关于统计模型参数的问题、获得相应答案并相应更新参数的方法(第 6 节)。以这种方式导出的学习算法非常灵活:人类设计者和机器学习算法可以无缝协作,因为它们依赖于相似的信息源。

Causal Systems


When we point
out a causal relationship between two events, we describe what we expect to
happen to the event we call the effect, should an external operator manipulate
the event we call the cause.Manipulability theories of causation (von Wright,
1971;Woodward, 2005) raise this commonsense insight to the status of a
definition of the causal relation.Difficult adjustments are then needed to
interpret statements involving causes that we can only observe through their
effects, "because they love me," or that are not easily manipulated,
"because the earth is round."


statistical thinking makes a clear distinction between the statistical model
and the world.The actual mechanisms underlying the data are considered
unknown.The statistical models do not need to reproduce these mechanisms to
emulate the observable data (Breiman, 2001).Better models are sometimes
obtained by deliberately avoiding to reproduce the true mechanisms (Vapnik,
1982, Section 8.6).We can approach the manipulability puzzle in the same spirit
by viewing causation as a reasoning model (Bottou, 2011) rather than a property
of the world.Causes and effects are simply the pieces of an abstract reasoning
game.Causal statements that are not empirically testable acquire validity when
they are used as intermediate steps when one reasons about manipulations or
interventions amenable to experimental validation.

现代统计思维明确区分了统计模型和世界。数据背后的实际机制被认为是未知的。统计模型不需要再现这些机制来模拟可观察到的数据(Breiman,2001)。更好的模型有时是通过故意避免再现真实的机制而获得的(Vapnik,1982,第 8.6 节)。我们可以通过将因果关系视为一种推理模型(Bottou,2011)而不是世界的一种属性,以同样的精神来解决可操作性难题。原因和结果只是抽象推理游戏的片段。不可通过经验检验的因果陈述,当一个人提出可以通过实验验证的操作或干预时,当它们被用作中间步骤时,就获得了有效性。

This section
presents the rules of this reasoning game.We largely follow the framework
pro-posed by Pearl (2009) because it gives a clear account of the connections
between causal models and probabilistic models.



Query context
x from user intent u. Eligible ads (ai) from query x and inventory v.
Corresponding bids (bi).

来自用户意图的查询上下文 x。来自查询 x 和库存 v 的合格广告(ai)对应的出价(bi)。

(qi,p,Rp) from query x and ads a. Ad slate s from eligible ads a, scores q and
bids b. Corresponding click prices c.

来自查询 x 和广告 a 的分数(qi,p,Rp)。来自合格广告 a 的广告名单 s,分数 q 和出价 b。相应的点击价格 c。

User clicks y
from ad slate s and user intent u. Revenue z from clicks y and prices c.

用户从广告名单中点击 y,用户意图从点击 y 和价格 c 中获得收入

Figure 2: A structural
equation model for ad placement.The sequence of equations describes the

图 2:广告投放的结构方程模型。方程式序列描述了

flow of
information.The functions fk describe how effects depend on their direct
causes.The additional noise variables εk represent independent sources of
randomness useful to model probabilistic dependencies.

信息流。函数 fk 描述了效果如何依赖于它们的直接原因。额外的噪声变量 εk 代表独立的随机源,可用于建模概率相关性。

3.1 The Flow
of Information

3.1 信息流

Figure 2
gives a deterministic description of the operation of the ad placement
engine.Variable u represents the user and his or her intention in an
unspecified manner.The query and query context x is then expressed as an
unknown function of the u and of a noise variable ε1.Noise variables in this
framework are best viewed as independent sources of randomness useful for
modeling a nondeterministic causal dependency.We shall only mention them when
they play a specific role in the discussion.The set of eligible ads a and the
corresponding bids b are then derived from the query x and the ad inventory v
supplied by the advertisers.Statistical models then compute a collection of
scores q such as the click probability estimates qi,p and the reserves Rp
introduced in Section 2.1.The placement logic uses these scores to generate the
"ad slate" s, that is, the set of winning ads and their assigned
positions.The corresponding click prices c are computed.The set of user clicks
y is expressed as an unknown function of the ad slate s and the user intent u.
Finally the revenue z is expressed as another function of the clicks y and the
prices c.

图 2 给出了广告投放引擎操作的确定性描述。变量 u 以未指明的方式表示用户及其意图。然后,查询和查询上下文 x 被表示为 u 和噪声变量 ε1 的未知函数。这个框架中的噪声变量最好被视为独立的随机源,有助于对不确定的因果相关性进行建模。只有当它们在讨论中发挥具体作用时,我们才会提到它们。合格广告 a 的集合和相应的出价 b 然后从查询 x 和广告商提供的广告目录 v 中导出。统计模型然后计算分数 q 的集合,例如点击概率估计 qi、p 和第 2.1 节中介绍的储备 Rp。放置逻辑使用这些分数来生成“广告名单”,即一组获胜的广告及其分配的位置。计算相应的点击价格 c。用户点击量 y 的集合被表示为广告名单 s 和用户意图 u 的未知函数。最后,收入 z 被表示为点击量 y 和价格 c 的另一个函数

Such a system
of equations is named structural equation model (Wright, 1921).Each equation
asserts a functional dependency between an effect, appearing on the left hand
side of the equation, and its direct causes, appearing on the right hand side
as arguments of the function.Some of these causal dependencies are
unknown.Although we postulate that the effect can be expressed as some function
of its direct causes, we do not know the form of this function.For instance,
the designer of the ad placement engine knows functions f2 to f6 and f8 because
he has designed them.However, he does not know the functions f1 and f7 because
whoever designed the user did not leave sufficient documentation.

这样的方程组被称为结构方程模型(Wright,1921)。每一个方程都断言出现在方程左边的一个效应与其直接原因之间的函数依赖关系,直接原因作为函数的自变量出现在右边。其中一些因果依赖是未知的。虽然我们假设效果可以表示为其直接原因的某种函数,但我们不知道这种函数的形式。例如,广告投放引擎的设计者知道函数 f2 到 f6 和 f8,因为他已经设计了它们。然而,他不知道函数 f1 和 f7,因为设计用户的人没有留下足够的文档。

Figure 3
represents the directed causal graph associated with the structural equation
model.Each arrow connects a direct cause to its effect.The noise variables are
omitted for simplicity.The structure of this graph reveals fundamental
assumptions about our model.For instance, the user clicks y do not directly
depend on the scores q or the prices c because users do not have access to this

图 3 表示与结构方程模型相关的有向因果图。每个箭头都将直接原因与其结果联系起来。为简单起见,噪声变量被省略。这个图表的结构揭示了关于我们模型的基本假设。例如,用户点击 y 并不直接取决于分数 q 或价格 c,因为用户无法访问这些信息。

We hold as a
principle that causation obeys the arrow of time: causes always precede their
effects.Therefore the causal graph must be acyclic.Structural equation models
then support two fundamental operations, namely simulation and intervention.


Figure 3:
Causal graph associated with the structural equation model of Figure 2.The
mutually independent
noise variables ε1 to ε8 are implicit.The variables a, b, q, s, c, and z depend
on their direct causes in known ways.In contrast, the variables u and v are
exogenous and the variables x and y depend on their direct causes through
unknown functions.

图 3:与图 2 的结构方程模型相关的因果图。相互之间独立噪声变量 ε1 至 ε8 是隐式的。变量 a、b、q、s、c 和 z 以已知的方式依赖于它们的直接原因。相比之下,变量 u 和 v 是外生的,变量 x 和 y 通过未知函数依赖于它们的直接原因。

• Simulation

  • Let us assume that we know both the exact form of all functional dependencies
    and the value of all exogenous variables, that is, the variables that never
    appear in the left hand side of an equation.We can compute the values of all
    the remaining variables by applying the equations in their natural time


Intervention - As long as the causal graph remains acyclic, we can construct
derived structural equation models using arbitrary algebraic manipulations of
the system of equations.For instance, we can clamp a variable to a constant
value by rewriting the right-hand side of the corresponding equation as the
specified constant value.


The algebraic
manipulation of the structural equation models provides a powerful language to
describe interventions on a causal system.This is not a coincidence.Many
aspects of the mathe-matical notation were invented to support causal inference
in classical mechanics.However, we no longer have to interpret the variable
values as physical quantities: the equations simply describe the flow of
information in the causal model (Wiener, 1948).


3.2 The
Isolation Assumption

3.2 隔离假设

Let us now
turn our attention to the exogenous variables, that is, variables that never
appear in the left hand side of an equation of the structural model.Leibniz's
principle of sufficient reason claims that there are no facts without
causes.This suggests that the exogenous variables are the effects of a network
of causes not expressed by the structural equation model.For instance, the user
intent u and the ad inventory v in Figure 3 have temporal correlations because
both users and advertisers worry about their budgets when the end of the month
approaches.Any structural equation model should then be understood in the
context of a larger structural equation model potentially describing all things
in existence.

现在让我们将注意力转向外生变量,即从未出现在结构模型方程左侧的变量。莱布尼茨的充分理由原则主张没有原因就没有事实。这表明外生变量是结构方程模型没有表达的原因网络的影响。例如,图 3 中的用户意图 u 和广告清单 v 具有时间相关性,因为当月底临近时,用户和广告商都担心他们的预算。任何结构方程模型都应该在一个更大的结构方程模型的背景下理解,这个模型潜在地描述了所有存在的事物。

Ads served on
a particular page contribute to the continued satisfaction of both users and
ad-vertisers, and therefore have an effect on their willingness to use the
services of the publisher in the future.The ad placement structural equation
model shown in Figure 2 only describes the causal de-pendencies for a single
page and therefore cannot account for such effects.Consider however a very

在特定页面上提供的广告有助于用户和广告发布者的持续满意度,因此会影响他们将来使用出版商服务的意愿。图 2 所示的广告投放结构方程模型仅描述了单个页面的因果相关性,因此无法解释这种影响。然而,考虑一个非常





Figure 4:
Conceptually unrolling the user feedback loop by threading instances of the
single page

图 4:通过线程化单个页面的实例从概念上展开用户反馈循环

causal graph
(Figure 3).Both the ad slate st and user clicks yt have an indirect effect on
the user intent ut+1 associated with the next query.

因果图(图 3)。ad slate st 和用户点击 yt 都对与下一个查询相关联的用户意图 ut+1 有间接影响。

structural equation model containing a copy of the page-level model for every
Web page ever served by the publisher.Figure 4 shows how we can thread the
page-level models corresponding to pages served to the same user.Similarly we
could model how advertisers track the performance and the cost of their
advertisements and model how their satisfaction affects their future bids.The
resulting causal graphs can be very complex.Part of this complexity results
from time-scale dif-ferences.Thousands of search pages are served in a
second.Each page contributes a little to the continued satisfaction of one user
and a few advertisers.The accumulation of these contributions produces
measurable effects after a few weeks.

大型结构方程模型,包含发布者提供的每个网页的页面级模型副本。图 4 显示了我们如何线程化对应于服务于同一用户的页面的页面级模型。同样,我们可以模拟广告商如何跟踪他们广告的表现和成本,并模拟他们的满意度如何影响他们未来的出价。产生的因果图可能非常复杂。这种复杂性的一部分是由时间尺度的差异造成的。数以千计的搜索页面在一秒钟内被提供。每一页都对一个用户和几个广告商的持续满意度有一点贡献。几周后,这些贡献的积累会产生可衡量的效果。

Many of the
functional dependencies expressed by the structural equation model are left
un-specified.Without direct knowledge of these functions, we must reason using
statistical data.The most fundamental statistical data is collected from
repeated trials that are assumed independent.When we consider the large
structured equation model of everything, we can only have one large trial
producing a single data point.It is therefore desirable to identify repeated
patterns of identical equations that can be viewed as repeated independent
trials.Therefore, when we study a structural equation model representing such a
pattern, we need to make an additional assumption to expresses the idea that
the outcome of one trial does not affect the other trials.We call such an
assumption an isolation assumption by analogy with thermodynamics.This can be
achieved by assuming that the exogenous variables are independently drawn from
an unknown but fixed joint probability distribu-tion.This assumption cuts the
causation effects that could flow through the exogenous variables.


The noise
variables are also exogenous variables acting as independent source of
randomness.The noise variables are useful to represent the conditional
distribution P(effect|causes) using the equation effect= f(causes,
ε).Therefore, we also assume joint independence between all the noise variables
and any of the named exogenous variable.For instance, in the case of the ad

噪声变量也是作为随机性独立来源的外生变量。噪声变量有助于使用方程 effect= f(causes,ε)来表示条件分布 P(effect|causes)。因此,我们还假设所有噪声变量和任何一个已命名的外生变量之间具有联合独立性。例如,在广告投放的情况下

1.See also
the discussion on reinforcement learning, Section 3.5.

1.另请参见第 3.5 节中关于强化学习的讨论。

2.The concept
of isolation is pervasive in physics.An isolated system in thermodynamics
(Reichl, 1998, Section 2.D) or a closed system in mechanics (Landau and
Lifshitz, 1969, 5) evolves without exchanging mass or energy with its
surroundings.Experimental trials involving systems that are assumed isolated
may differ in their initial setup and therefore have different
outcomes.Assuming isolation implies that the outcome of each trial cannot
affect the other trials.

2.孤立的概念在物理学中很普遍。热力学中的孤立系统(赖歇尔,1998,第 2 节。d)或力学中的封闭系统(Landau 和 Lifshitz,1969,5)在不与周围环境交换质量或能量的情况下演化。涉及被假设为孤立的系统的实验性试验可能在它们的初始设置上有所不同,因此具有不同的结果。假设隔离意味着每个试验的结果不能影响其他试验。

3.Rather than
letting two noise variables display measurable statistical dependencies because
they share a common cause, we prefer to name the common cause and make the
dependency explicit in the graph.



外源性 vars。查询。






Figure 6:
Bayesian network associated with the Markov factorization shown in Figure 5.

图 6:与图 5 所示的马尔可夫分解相关联的贝叶斯网络。

model shown
in Figure 2, we assume that the joint distribution of the exogenous variables

模型如图 2 所示,我们假设外生变量的联合分布因子化



P(u, v,ε1,..., ε8) = P(u, v)P(ε1)...P(ε8).

Since an
isolation assumption is only true up to a point, it should be expressed clearly
and remain under constant scrutiny.We must therefore measure additional
performance metrics that reveal how the isolation assumption holds.For
instance, the ad placement structural equation model and the corresponding
causal graph (figures 2 and 3) do not take user feedback or advertiser feedback
into account.Measuring the revenue is not enough because we could easily
generate revenue at the expense of the satisfaction of the users and
advertisers.When we evaluate interventions under such an isolation assumption,
we also need to measure a battery of additional quantities that act as proxies
for the user and advertiser satisfaction.Noteworthy examples include ad
relevance estimated by human judges, and advertiser surplus estimated from the
auctions (Varian, 2009).

因为隔离假设在某种程度上是正确的,所以应该清楚地表达出来,并保持不断的审查。因此,我们必须衡量额外的性能指标,以揭示隔离假设是如何成立的。例如,广告投放结构方程模型和相应的因果图(图 2 和 3)没有考虑用户反馈或广告商反馈。衡量收入是不够的,因为我们很容易以牺牲用户和广告商的满意度为代价来创造收入。当我们在这样的隔离假设下评估干预时,我们还需要测量一组额外的量,作为用户和广告商满意度的代理。值得注意的例子包括由人类估计的广告相关性,以及从拍卖中估计的广告商盈余(瓦里安,2009)。

3.3 Markov Factorization 马尔可夫因子分解

we can draw a sample of the exogenous variables using the distribution
specified by the isolation assumption, and we can then generate values for all
the remaining variables by simulating the structural equation model.


This process
defines a generative probabilistic model representing the joint distribution of
all variables in the structural equation model.The distribution readily
factorizes as the product of the joint probability of the named exogenous
variables, and, for each equation in the structural equation model, the
conditional probability of the effect given its direct causes (Spirtes et al.,
1993;Pearl, 2000).As illustrated by figures 5 and 6, this Markov factorization
connects the structural equa-tion model that describes causation, and the
Bayesian network that describes the joint probability distribution followed by
the variables under the isolation assumption.

这个过程定义了一个表示结构方程模型中所有变量联合分布的生成概率模型。该分布很容易分解为所命名的外生变量的联合概率的乘积,对于结构方程模型中的每个方程,给出其直接原因的影响的条件概率(Spirtes 等人,1993;Pearl,2000)。如图 5 和图 6 所示,这种马尔可夫分解将描述因果关系的结构方程模型和描述隔离假设下变量遵循的联合概率分布的贝叶斯网络联系起来。

equation models and Bayesian networks appear so intimately connected that it
could be easy to forget the differences.The structural equation model is an
algebraic object.As long as the causal graph remains acyclic, algebraic
manipulations are interpreted as interventions on the causal system.The
Bayesian network is a generative statistical model representing a class of
joint probability distributions, and, as such, does not support algebraic
manipulations.However, the symbolic representation of its Markov factorization
is an algebraic object, essentially equivalent to the structural equation


Identification, Transportation, and Transfer Learning

3.4 识别、运输和转移学习

Consider a
causal system represented by a structural equation model with some unknown
functional dependencies.Subject to the isolation assumption, data collected
during the operation of this system follows the distribution described by the
corresponding Markov factorization.Let us first assume that this data is sufficient
to identify the joint distribution of the subset of variables we can observe.We
can intervene on the system by clamping the value of some variables.This
amounts to replacing the right-hand side of the corresponding structural
equations by constants.The joint distribution of the variables is then
described by a new Markov factorization that shares many factors with the
orig-inal Markov factorization.Which conditional probabilities associated with
this new distribution can we express using only conditional probabilities
identified during the observation of the original sys-tem?This is called the
identifiability problem.More generally, we can consider arbitrarily complex
manipulations of the structural equation model, and we can perform multiple
experiments involv-ing different manipulations of the causal system.Which
conditional probabilities pertaining to one experiment can be expressed using
only conditional probabilities identified during the observation of other
experiments?This is called the transportability problem.


do-calculus completely solves the identifiability problem and provides useful
tools to address many instances of the transportability problem (see Pearl,
2012).Assuming that we know the conditional probability distributions involving
observed variables in the original structural equa-tion model, do-calculus
allows us to derive conditional distributions pertaining to the manipulated structural
equation model.

Pearl 的 do-演算完全解决了可识别性问题,并提供了有用的工具来解决可运输性问题的许多实例(参见 Pearl,2012)。假设我们知道原始结构方程模型中包含观测变量的条件概率分布,微积分允许我们导出与操纵结构方程模型相关的条件分布。

we must further distinguish the conditional probabilities that we know (because
we designed them) from those that we estimate from empirical data.This
distinction is important because estimating the distribution of continuous or
high cardinality variables is notoriously dif-ficult.Furthermore, do-calculus
often combines the estimated probabilities in ways that amplify estimation
errors.This happens when the manipulated structural equation model exercises
the vari-ables in ways that were rarely observed in the data collected from the
original structural equation model.


networks are directed graphs representing the Markov factorization of a joint
probability distribution: the arrows no longer have a causal interpretation.


Therefore we
prefer to use much simpler causal inference techniques (see sections 4.1 and
4.2).Although these techniques do not have the completeness properties of
do-calculus, they combine estimation and transportation in a manner that
facilitates the derivation of useful confidence inter-vals.

因此,我们更喜欢使用简单得多的因果推断技术(参见第 4.1 节和第 4.2 节)。虽然这些技术不具备微积分的完备性,但它们以一种便于推导有用的置信区间的方式结合了估计和传输。

3.5 Special Cases

3.5 特殊情况

Three special
cases of causal models are particularly relevant to this work.


• In the
multi-armed bandit (Robbins, 1952), a user-defined policy function π determines
the distribution of action a ∈ {1...K}, and an unknown reward function r
determines the distri-bution of the outcome y given the action a (Figure 7).In
order to maximize the accumulated rewards, the player must construct policies π
that balance the exploration of the action space with the exploitation of the best
action identified so far (Auer et al., 2002;Audibert et al., 2007;Seldin et
al., 2012).

在多臂老虎机(Robbins,1952)中,用户定义的策略函数 π 决定了动作 a 的分布 ∑{ 1...一个未知的奖励函数 r 决定了给定动作 a 的结果 y 的分布(图 7)。为了使累积的奖励最大化,玩家必须构建策略 π,该策略 π 平衡对动作空间的探索和对迄今为止识别的最佳动作的开发(Auer 等人,2002;Audibert 等人,2007 年;Seldin 等人,2012 年)。

• The
contextual bandit problem (Langford and Zhang, 2008) significantly increases
the com-plexity of multi-armed bandits by adding one exogenous variable x to
the policy function π and the reward functions r (Figure 8).

上下文老虎机问题(Langford 和 Zhang,2008)通过在策略函数 π 和奖励函数 r 上增加一个外生变量 x,显著增加了多武装土匪的复杂性(图 8)。

• Both
multi-armed bandit and contextual bandit are special case of reinforcement
learning (Sutton and Barto, 1998).In essence, a Markov decision process is a
sequence of contextual bandits where the context is no longer an exogenous
variable but a state variable that depends on the previous states and actions
(Figure 9).Note that the policy function π, the reward function r, and the
transition function s are independent of time.All the time dependencies are expressed
using the states st .

多臂老虎机和情境老虎机都是强化学习的特例(萨顿和巴尔托,1998)。本质上,马尔可夫决策过程是一系列上下文老虎机,其中上下文不再是外生变量,而是依赖于先前状态和动作的状态变量(图 9)。注意,策略函数 π、奖励函数 r 和转移函数 s 与时间无关。所有的时间依赖关系都用状态 st 表示。

These special
cases have increasing generality.Many simple structural equation models can be
reduced to a contextual bandit problem using appropriate definitions of the
context x, the action a and the outcome y. For instance, assuming that the
prices c are discrete, the ad placement struc-tural equation model shown in
Figure 2 reduces to a contextual bandit problem with context (u, v), actions
(s, c) and reward z. Similarly, given a sufficiently intricate definition of
the state variables st , all structural equation models with discrete variables
can be reduced to a reinforcement learning problem.Such reductions lose the
fine structure of the causal graph.We show in Section 5 how this fine structure
can in fact be leveraged to obtain more information from the same experiments.

这些特例越来越具有普遍性。使用上下文 x、动作 a 和结果 y 的适当定义,许多简单的结构方程模型可以简化为上下文强盗问题。例如,假设价格 c 是离散的,图 2 所示的广告投放结构方程模型可以简化为具有上下文(u,v)、动作(s,c)和奖励 z 的上下文强盗问题。类似地,给定状态变量 st 的足够复杂的定义,所有具有离散变量的结构方程模型可以简化为强化学习问题。这种简化失去了因果图的精细结构。我们在第 5 节中展示了如何利用这种精细的结构从相同的实验中获得更多的信息。

reinforcement learning algorithms (see Sutton and Barto, 1998) leverage the
assumption that the policy function, the reward function, the transition
function, and the distributions of the corresponding noise variables, are
independent from time.This invariance property provides great benefits when the
observed sequences of actions and rewards are long in comparison with the size
of the state space.Only Section 7 in this contribution presents methods that
take advantage of such an invariance.The general question of leveraging
arbitrary functional invariances in causal graphs is left for future work.

现代强化学习算法(见萨顿和巴尔托,1998)利用了政策函数、奖励函数、转移函数和相应噪声变量的分布与时间无关的假设。当观察到的动作和回报序列与状态空间的大小相比较长时,这种不变性提供了很大的好处。这篇文章中只有第 7 节介绍了利用这种不变性的方法。利用因果图中任意函数不变性的一般问题留待将来研究。

4.Counterfactual Analysis 反事实分析

We now return
to the problem of formulating and answering questions about the value of
proposed changes of a learning system.Assume for instance that we consider
replacing the score computation


a = π ( ε ) Action a ∈{1...K}

y = r ( a,ε’) Reward y ∈ R

动作 a∞{ 1...K}奖励 y ∈ R

Figure 7:
Structural equation model for the multi-armed bandit problem.The policy π
selects a discrete
action a, and the reward function r determines the outcome y. The noise
vari-ables ε and ε ′ represent independent sources of randomness useful to
model probabilistic dependencies.

图 7:多武装匪徒问题的结构方程模型。策略 π 选择一个离散动作 a,奖励函数 r 决定结果 y。噪声变量 ε 和 ε’代表独立的随机源,可用于建模概率相关性。

a = π ( x , ε) Action a ∈{1...K}

y = r ( x,a,ε’)Reward y ∈ R

Figure 8:
Structural equation model for contextual bandit problem.Both the action and the
reward depend on an exogenous context variable x.

图 8:上下文老虎机问题的结构方程模型。行动和奖励都依赖于一个外生的环境变量 x。

Figure 9:
Structural equation model for reinforcement learning.The above equations are

图 9:强化学习的结构方程模型。上面的等式是重复的

for all t ∈
{0...,T}.The context is now provided by a state variable st−1 that depends on
the previous states and actions.

对于所有 t∞{ 0...,T}。上下文现在由状态变量 ST 1 提供,该变量取决于先前的状态和动作。

model M of an
ad placement engine by an alternate model M. We seek an answer to the
conditional question:

模型 M 的一个广告投放引擎的一个替代模型 M。我们寻求一个条件问题的答案:

will the system perform if we replace model M by model M*?"

"如果我们用 M*代替 M 模型,系统会有什么表现?"

sufficient time and sufficient resources, we can obtain the answer using a
controlled experiment (Section 2.2).However, instead of carrying out a new
experiment, we would like to obtain an answer using data that we have already
collected in the past.

如果有足够的时间和足够的资源,我们可以通过一个受控实验(第 2.2 节)来获得答案。然而,我们不希望进行新的实验,而是希望使用我们过去已经收集的数据来获得答案。

would the system have performed if, when the data was collected, we had
replaced model M by model M*?"

“如果在收集数据时,我们用模型 M*替换了模型 M,系统会有什么表现?”

The answer of
this counterfactual question is of course a counterfactual statement that
describes the system performance subject to a condition that did not happen.


statements challenge ordinary logic because they depend on a condition that is
known to be false.Although assertion A ⇒ B is always true when assertion A is
false, we certainly do not mean for all counterfactual statements to be
true.Lewis (1973) navigates this paradox using a modal logic in which a
counterfactual statement describes the state of affairs in an alternate world
that resembles ours except for the specified differences.Counterfactuals indeed
offer many subtle ways to qualify such alternate worlds.For instance, we can
easily describe isolation assumptions (Section 3.2) in a counterfactual

反事实陈述挑战普通逻辑,因为它们依赖于一个已知为假的条件。虽然当断言 A 为假时,断言 A 总是真的,但我们当然不是说所有的反事实陈述都是真的。刘易斯(1973)使用模态逻辑来解释这个悖论,在模态逻辑中,一个反事实的陈述描述了一个与我们相似的另一个世界的事态,除了特定的差异。反事实确实提供了许多微妙的方法来限定这样的交替世界。例如,我们可以很容易地在一个反事实问题中描述隔离假设(第 3.2 节):

would the system have performed if, when the data was collected, we had
replaced model M by model Mwithout incurring user or advertiser

“如果在收集数据时,我们用 M*模型代替了 M 模型,而不会引起用户或广告商的反应,那么系统会有怎样的表现?”


Figure 10:
Causal graph for an image recognition system.We can estimate counterfactuals by
replaying data collected in the past.

图 10:图像识别系统的因果图。我们可以通过回放过去收集的数据来估计反事实。

Figure 11:
Causal graph for a randomized experiment.We can estimate certain
counterfactuals by reweighting data collected in the past.

图 11:随机实验的因果图。我们可以通过重新加权过去收集的数据来估计某些反事实。

The fact that
we could not have changed the model without incurring the user and advertiser
reac-tions does not matter any more than the fact that we did not replace model
M by model M* in the first place.This does not prevent us from using
counterfactual statements to reason about causes and effects.Counterfactual
questions and statements provide a natural framework to express and share our

事实上,我们不可能在不引起用户和广告商反应的情况下改变模型,这一点并不比我们一开始没有用 M*模型替换 M 模型更重要。这并不妨碍我们用反事实的陈述来推理因果。反事实的问题和陈述提供了一个自然的框架来表达和分享我们的结论。

The remaining
text in this section explains how we can answer certain counterfactual
questions using data collected in the past.More precisely, we seek to estimate
performance metrics that can be expressed as expectations with respect to the
distribution that would have been observed if the counterfactual conditions had
been in force.


4.1 Replaying
Empirical Data

4.1 重放经验数据

Figure 10
shows the causal graph associated with a simple image recognition system.The
classifier takes an image x and produces a prospective class label ˆy. The loss
measures the penalty associated with recognizing class ˆy while the true class
is y.

图 10 显示了与简单图像识别系统相关的因果图。分类器获取图像 x 并产生预期的类别标签 y。损失测量与识别类别 y 相关联的惩罚,而真实类别是 y

To estimate
the expected error of such a classifier, we collect a representative data set
composed of labelled images, run the classifier on each image, and average the
resulting losses.In other words, we replay the data set to estimate what (counterfactual)
performance would have been observed if we had used a different classifier.We
can then select in retrospect the classifier that would have worked the best
and hope that it will keep working well.This is the counterfactual viewpoint on
empirical risk minimization (Vapnik, 1982).


Replaying the
data set works because both the alternate classifier and the loss function are
known.More generally, to estimate a counterfactual by replaying a data set, we
need to know all the functional dependencies associated with all causal paths
connecting the intervention point to the measurement point.This is obviously
not always the case.


counterfactual expectations can be viewed as expectations of unit-level
counterfactuals (Pearl, 2009, Def-inition 4), they elude the semantic
subtleties of unit-level counterfactuals and can be measured with randomized
experiments (see Section 4.2.)

5.尽管反事实期望可以被视为单位级反事实的期望(Pearl,2009,Def-inition 4),但它们回避了单位级反事实的语义微妙性,并且可以通过随机实验来测量(参见第 4.2 节)。)

Reweighting Randomized Trials

4.2 重新加权随机试验

Figure 11
illustrates the randomized experiment suggested in Section 2.3.The patients are
randomly split into two equally sized groups receiving respectively treatments
A and B. The overall success rate for this experiment is therefore Y = (YA
+YB)/2 where YA and YB are the success rates observed for each group.We would
like to estimate which (counterfactual) overall success rate Y would have been
observed if we had selected treatment A with probability p and treatment B with
probability 1− p.

图 11 说明了第 2.3 节中建议的随机实验。患者被随机分成两个大小相等的组,分别接受治疗 A 和治疗 b。因此,本实验的总成功率为 Y = (YA +YB)/2,其中 YA 和 YB 是每组观察到的成功率。如果我们选择概率为 p 的治疗 A 和概率为 1 p 的治疗 B,我们希望估计观察到的(反事实的)总体成功率 Y。

Since we do
not know how the outcome depends on the treatment and the patient condition, we
cannot compute which outcome y would have been obtained if we had treated
patient x with a different treatment u . Therefore we cannot answer this
question by replaying the data as we did in Section 4.1.

由于我们不知道结果如何取决于治疗和患者状况,因此我们无法计算如果我们用不同的治疗 u 治疗患者 x 会获得哪种结果 y。因此,我们无法像在第 4.1 节中那样通过重放数据来回答这个问题。

observing different success rates YA and YB for the treatment groups reveals an
empir-ical correlation between the treatment u and the outcome y. Since the only
cause of the treatment u is an independent roll of the dices, this correlation
cannot result from any known or unknown con-founding common cause.Having
eliminated this possibility, we can reweight the observed out-comes and compute
the estimate Y ≈ pYA + (1− p)YB .

然而,观察治疗组的不同成功率 YA 和 YB 揭示了治疗 u 和结果 y 之间的经验相关性。由于治疗 u 的唯一原因是独立的骰子滚动,这种相关性不能由任何已知或未知的共同原因产生。排除这种可能性后,我们可以重新加权观察到的输出,并计算估计值 Y≈pyA+(1p)YB。

4.3 Markov
Factor Replacement

4.3 马尔可夫因子替换

reweighting approach can in fact be applied under much less stringent conditions.Let
us return to the ad placement problem to illustrate this point.


The average
number of ad clicks per page is often called click yield.Increasing the click
yield usually benefits both the advertiser and the publisher, whereas
increasing the revenue per page often benefits the publisher at the expense of
the advertiser.Click yield is therefore a very useful metric when we reason
with an isolation assumption that ignores the advertiser reactions to pricing changes.


Let ω be a
shorthand for all variables appearing in the Markov factorization of the ad
placement structural equation model,

假设 ω 是广告投放结构方程模型的马尔可夫分解中出现的所有变量的简写,

P(ω) =P ( u , v ) P( x |u ) P ( a|x , v ) P ( b|x , v ) P ( q|x , a )
× P ( s|a , q
, b ) P ( c |a , q , b ) P ( y |s , u ) P ( z|y , c ) .

Variable y
was defined in Section 3.1 as the set of user clicks.In the rest of the
document, we slightly abuse this notation by using the same letter y to
represent the number of clicks.We also write the expectation Y = Eω∼P(ω) [y]
using the integral notation

变量 y 在第 3.1 节中被定义为用户点击的集合。在文档的其余部分,我们稍微滥用了这个符号,使用相同的字母 y 来表示点击次数。我们还用积分表示法写出了期望 Y = Eω∞P(ω)[Y]

Y = Zω y P (ω ) .

We would like
to estimate what the expected click yield Y would have been if we had used a
different scoring function (Figure 12).This intervention amounts to replacing
the actual factor P(q|x,a) by a counterfactual factor P(q|x,a) in the Markov

如果我们使用不同的评分函数,我们希望估计预期的点击率是多少(图 12)。这种干预相当于用马尔可夫因子分解中的反事实因子 P(q|x,a)代替实际因子 P(q|x,a)。

P ( ω) = P (u , v ) P ( x |u ) P ( a|x , v ) P ( b|x , v ) P ( q| x , a )

× P ( s|a,q,b ) P ( c |a,q,b ) P ( y |s,u ) P ( z|x,c)。


6.See also
the discussion of Reichenbach's common cause principle and of its limitations
in Spirtes et al. (1993) and Spirtes and Scheines (2004).

6.另见 Spirtes 等人(1993 年)和 Spirtes 和 Scheines
(2004 年)对赖兴巴赫共同原因原则及其局限性的讨论。


Figure 12:
Estimating which average number of clicks per page would have been observed if
we had used a different scoring model.

图 12:如果我们使用不同的评分模型,估计每个页面的平均点击次数。

Let us
assume, for simplicity, that the actual factor P(q|x,a) is nonzero
everywhere.We can then estimate the counterfactual expected click yield Y using
the transformation

为了简单起见,我们假设实际的因子 P(q|x,a)处处非零。然后,我们可以使用转换来估计反事实的预期点击量 Y

Y = Zω y P (ω) = Zω y P ( q| x a ) P ( q|x , a ) P ( ω ) ≈ n n∑i=1 y i P ( q i |x i a i ) P
( q i |x i , a i )

where the
data set of tuples (ai , xi ,qi , yi) is distributed according to the actual
Markov factorization instead of the counterfactual Markov factorization.This
data could therefore have been collected during the normal operation of the ad
placement system.Each sample is reweighted to reflect its probability of
occurrence under the counterfactual conditions.


In general,
we can use importance sampling to estimate the counterfactual expectation of
any quantity ℓ(ω):

一般来说,我们可以使用重要性抽样来估计任何数量 ℓ(ω):的反事实期望

Y = Zω ℓ ( ω
) P ( ω) = Zω ℓ ( ω ) P ( ω ) P ( ω ) P ( ω ) ≈ n n∑i=1 ℓ ( ω i ) w i

with weights

Equation (6)
emphasizes the simplifications resulting from the algebraic similarities of the
actual and counterfactual Markov factorizations.Because of these
simplifications, the evaluation of the weights only requires the knowledge of
the few factors that differ between P(ω) and P(ω).Each data sample needs to
provide the value of ℓ(ωi) and the values of all variables needed to evaluate
the factors that do not cancel in the ratio (6).

等式(6)强调由实际和反事实马尔可夫因子分解的代数相似性产生的简化。由于这些简化,权重的计算只需要知道 P(ω)和 P(ω)之间的几个不同因素。每个数据样本需要提供 ℓ(ωi 值)和评估比率(6)中不抵消的因素所需的所有变量的值。

In contrast,
the replaying approach (Section 4.1) demands the knowledge of all factors of
P(ω) connecting the point of intervention to the point of measurement ℓ(ω).On
the other hand, it does not require the knowledge of factors appearing only in

相比之下,重放方法(第 4.1 节)要求了解连接干预点和测量点 ℓ(ω).的所有 P(ω)因子另一方面,它不需要只出现在 P(ω)中的因子的知识。

sampling relies on the assumption that all the factors appearing in the
denomina-tor of the reweighting ratio (6) are nonzero whenever the factors
appearing in the numerator are nonzero.Since these factors represents
conditional probabilities resulting from the effect of an independent noise
variable in the structural equation model, this assumption means that the data
must be
collected with an experiment involving active randomization.We must therefore
design cost-effective randomized experiments that yield enough information to
estimate many interesting counterfactual expectations with sufficient
accuracy.This problem cannot be solved without an-swering the confidence
interval question: given data collected with a certain level of randomization,
with which accuracy can we estimate a given counterfactual expectation?



Confidence Intervals

4.4 置信区间

At first
sight, we can invoke the law of large numbers and write


Y = Zω ℓ ( ω
) w ( ω ) P ( ω ) ≈ n n∑i=1 ℓ ( ω i ) w i .


For sufficiently
large n, the central limit theorem provides confidence intervals whose width
grows with the standard deviation of the product ℓ(ω)w(ω).

对于足够大的 n,中心极限定理提供了置信区间,其宽度随着乘积 ℓ(ω)w(ω).的标准偏差而增加

when P(ω) is small, the reweighting ratio w(ω) takes large values with low
probability.This heavy tailed distribution has annoying consequences because
the variance of the integrand could be very high or infinite.When the variance
is infinite, the central limit theorem does not hold.When the variance is
merely very large, the central limit convergence might occur too slowly to
justify such confidence intervals.Importance sampling works best when the
actual distribution and the counterfactual distribution overlap.

遗憾的是,当 P(ω)较小时,重加权比 w(ω)取大值的概率较低。这种重尾分布有令人讨厌的后果,因为被积函数的方差可能很大或无穷大。当方差无穷大时,中心极限定理不成立。当方差仅仅非常大时,中心极限收敛可能发生得太慢,不足以证明这样的置信区间。当实际分布和反事实分布重叠时,重要性抽样效果最好。

When the
counterfactual distribution has significant mass in domains where the actual
distribu-tion is small, the few samples available in these domains receive very
high weights.Their noisy contribution dominates the reweighted estimate (7).We
can obtain better confidence intervals by eliminating these few samples drawn
in poorly explored domains.The resulting bias can be bounded using prior
knowledge, for instance with an assumption about the range of values taken by

当反事实分布在实际分布很小的域中具有显著质量时,这些域中可用的少数样本获得非常高的权重。他们的噪声贡献主导了重新加权的估计(7)。我们可以通过消除在探索不足的领域提取的这几个样本来获得更好的置信区间。所产生的偏差可以使用先验知识来界定,例如利用关于 ℓ(ω 取值范围的假设),

∀ω ℓ(ω) ∈ [0,M].


Let us choose
the maximum weight value R deemed acceptable for the weights.We have ob-tained
very consistent results in practice with R equal to the fifth largest
reweighting ratio observed on the empirical data.We can then rely on clipped
weights to eliminate the contribution of the poorly explored domains,

让我们选择重量可以接受的最大重量值 R。我们在实践中获得了非常一致的结果,R 等于在经验数据上观察到的第五大再加权比。然后,我们可以依靠剪裁的权重来消除探索不充分的领域的贡献,

w (ω) = w(ω)
if P(ω) < RP(ω) 0 otherwise.

如果 P(ω) < RP(ω) 0,则 w (ω) = w(ω),否则。

The condition
P(ω) < RP(ω) ensures that the ratio has a nonzero denominator P(ω) and is
smaller than R. Let ΩR be the set of all values of ω associated with acceptable

条件 P(ω) < RP(ω)确保比率具有非零分母 P(ω)并且小于 R。假设 ωR 是与可接受比率相关的所有 ω 值的集合:

ΩR = {ω :
P(ω) < RP(ω)} .

ωR =
{ω:P(ω)< RP(ω)}。

We can
decompose Y in two terms:

我们可以将 Y 分解为两个项:

Y = Zω ∈ Ω R
ℓ ( ω ) P ( ω) + Zω ∈ Ω \ Ω R ℓ ( ω ) P ( ω) = Y + ( Y − Y ) .

7.This is in
fact a slight abuse because the theory calls for choosing R before seeing the

7.这实际上是一个轻微的滥用,因为理论要求在看到数据之前选择 R。

The first
term of this decomposition is the clipped expectation Y . Estimating the
clipped expec-tation Y is much easier than estimating Y from (7) because the
clipped weights w(ω) are bounded by R.

这个分解的第一项是截尾期望 Y。估计截尾期望 Y 比从(7)估计 Y 容易得多,因为截尾权重 w(ω)由 r 界定

Y = Z ω∈ΩR
ℓ(ω)P (ω) = Z ω ℓ(ω)w (ω) P(ω) ≈ Yb= n n ∑ i=1 ℓ(ωi)w (ωi).(10)

y =
zω∈rωℓ(ω)p(ω)= zωℓ(ω)w(ω)p(ω)≈Yb = n n∑I = 1 ℓ(ωi)w(ωI)。(10)

The second
term of Equation (9) can be bounded by leveraging assumption (8).The resulting
bound can then be conveniently estimated using only the clipped weights.


Y −Y = Z
ω∈Ω\ΩR ℓ(ω)P (ω) ∈ h M P (Ω\ΩR) i = h M (−W ) i with

W = P (ΩR) =
Z ω∈ΩR P (ω) = Z ω w (ω)P(ω) ≈ Wb= n n ∑ i=1 w (ωi).



Since the
clipped weights are bounded, the estimation errors associated with (10) and
(11) are well characterized using either the central limit theorem or using
empirical Bernstein bounds (see appendix B for details).Therefore we can derive
an outer confidence interval of the form

由于裁剪后的权重是有界的,因此使用中心极限定理或使用经验伯恩斯坦界限(详见附录 B)可以很好地表征与(10)和(11)相关的估计误差。因此,我们可以导出形式的外部置信区间

P n Yb−εR ≤ Y
≤ Yb+εR o ≥ −δ

and an inner
confidence interval of the form


P n Y ≤ Y ≤ Y

  • M(−Wb+ξR) o ≥ −δ.


The names
inner and outer are in fact related to our preferred way to visualize these
intervals (e.g., Figure 13).Since the bounds on Y −Y can be written as

内部和外部名称实际上与我们首选的可视化这些间隔的方式相关(例如,图 13)。因为 Y-Y 的界限可以写成

Y ≤ Y ≤ Y + M
(1−W ),

we can derive
our final confidence interval,


P n Yb−εR ≤ Y
≤ Yb+ M(−Wb+ξR) +εR o ≥ −δ.



conclusion, replacing the unbiased importance sampling estimator (7) by the
clipped impor-tance sampling estimator (10) with a suitable choice of R leads
to improved confidence intervals.Furthermore, since the derivation of these
confidence intervals does not rely on the assumption that P(ω) is nonzero
everywhere, the clipped importance sampling estimator remains valid when the
dis-tribution P(ω) has a limited support.This relaxes the main restriction
associated with importance sampling.

总之,用适当选择的 R 来代替无偏重要抽样估计量(7)的削波重要抽样估计量(10)导致置信区间的改善。此外,由于这些置信区间的推导不依赖于 P(ω)处处非零的假设,当分布 P(ω)的支持度有限时,截尾重要性抽样估计量仍然有效。这放松了与重要性抽样相关的主要限制。

Interpreting the Confidence Intervals

4.5 解释置信区间

estimation of the counterfactual expectation Y can be inaccurate because the
sample size is in-sufficient or because the sampling distribution P(ω) does not
sufficiently explore the counterfactual conditions of interest.

反事实期望 Y 的估计可能是不准确的,因为样本量是不充分的,或者因为抽样分布 P(ω)没有充分探索感兴趣的反事实条件。

construction, the clipped expectation Y ignores the domains poorly explored by
the sam-pling distribution P(ω).The difference Y −Y then reflects the
inaccuracy resulting from a lack of exploration.Therefore, assuming that the
bound R has been chosen competently, the relative sizes of the outer and inner
confidence intervals provide precious cues to determine whether we can continue
collecting data using the same experimental setup or should adjust the data
collection experiment in order to obtain a better coverage.

通过构造,裁剪期望 Y 忽略了抽样分布 P(ω)探索不充分的域。Y-Y 的差异反映了由于缺乏探索而导致的不准确性。因此,假设界限 R 已经被恰当地选择,外置信区间和内置信区间的相对大小提供了宝贵的线索,以确定我们是否可以使用相同的实验设置继续收集数据,或者应该调整数据收集实验以获得更好的覆盖。

• The inner
confidence interval (13) witnesses the uncertainty associated with the domain
GR insufficiently explored by the actual distribution.A large inner confidence
interval suggests that the most practical way to improve the estimate is to
adjust the data collection experiment in order to obtain a better coverage of
the counterfactual conditions of interest.

内部置信区间(13)见证了与实际分布未充分探索的域 GR 相关的不确定性。较大的内部置信区间表明,提高估计值的最实用方法是调整数据收集实验,以便更好地覆盖感兴趣的反事实条件。

• The outer
confidence interval (12) represents the uncertainty that results from the
limited sample size.A large outer confidence interval indicates that the sample
is too small.To improve the result, we simply need to continue collecting data
using the same experimental setup.


Experimenting with Mainline Reserves

4.6 主线储备试验

We return to
the ad placement problem to illustrate the reweighting approach and the
interpretation of the confidence intervals.Manipulating the reserves Rp(x)
associated with the mainline positions (Figure 1) controls which ads are
prominently displayed in the mainline or displaced into the sidebar.

我们回到广告投放问题来说明重新加权方法和置信区间的解释。操纵与主线位置相关联的保留区 Rp(x)控制哪些广告在主线中突出显示或者转移到侧边栏中。

We seek in
this section to answer counterfactual questions of the form:


would the ad placement system have performed if we had scaled the mainline
reserves by a constant factor ρ, without incurring user or advertiser

“如果我们以一个常数 ρ 来衡量主线储备,而不引起用户或广告商的反应,广告投放系统会有怎样的表现?”

was introduced using a modified version of the ad placement engine.Before
determining the ad layout (see Section 2.1), a random number ε is drawn
according to the standard

随机化是使用广告投放引擎的修改版本引入的。在确定广告布局之前(见第 2.1 节),根据标准绘制一个随机数 ε

distribution N (0,1), and all the mainline reserves are multiplied by m = ρe −σ

正态分布 N (0,1),所有主线储量乘以 m =ρeσ/+σε。这样的

follow a log-normal distributionwhose mean is ρ and whose width is controlled
by σ.This effectively provides a parametrization of the conditional score distribution
P(q|x,a) (see Figure 5.)

乘数遵循对数正态分布,其平均值为 ρ,宽度由 σ 控制。这有效地提供了条件分数分布 P(q|x,a)的参数化(见图 5)。)

The Bing
search platform offers many ways to select traffic for controlled experiments
(Sec-tion 2.2).In order to match our isolation assumption, individual page
views were randomly as-signed to traffic buckets without regard to the user
identity.The main treatment bucket was pro-cessed with mainline reserves
randomized by a multiplier drawn as explained above with ρ=1 and σ = 0.3.With
these parameters, the mean multiplier is exactly 1, and 95% of the multipliers
are in range [0.52,1.74].Samples describing 22 million search result pages were
collected during five consecutive weeks.

必应搜索平台提供了许多选择流量进行受控实验的方法(第 2.2 节)。为了与我们的隔离假设相匹配,单个页面视图被随机签名到流量桶,而不考虑用户身份。主处理桶用主线储量进行处理,主线储量由乘数随机化,乘数如上所述绘制,ρ=1,σ = 0.3。有了这些参数,平均乘数正好是 1,95% 的乘数都在[0.52,1.74]的范围内。连续五周收集了描述 2200 万个搜索结果页面的样本。

We then use
this data to estimate what would have been measured if the mainline reserve
mul-tipliers had been drawn according to a distribution determined by
parameters ρ and σ .This is achieved by reweighting each sample ωi with

然后,我们使用这些数据来估计如果主线储备 multipliers 是根据由参数 ρ 和 σ 确定的分布绘制的,将会测量到什么。这是通过用 ωi 重新加权每个样本来实现的

w i =
P ( q i |x i,a i ) P ( q
i |x i,a i)



p ( m i ;ρ ,
σ ) p ( m i ;ρ , σ )

where mi is
the multiplier drawn for this sample during the data collection experiment, and
p(t ;ρ,σ) is the density of the log-normal multiplier distribution.

其中 mi 是在数据收集实验期间为该样本绘制的乘数,p(t;ρ,σ)是对数正态乘数分布的密度。

Figure 13
reports results obtained by varying ρ while keeping σ =σ.This amounts to
esti-mating what would have been measured if all mainline reserves had been
multiplied by ρ while keeping the same randomization.The curves bound 95%
confidence intervals on the variations of the average number of mainline ads
displayed per page, the average number of ad clicks per page,

图 13 报告了在保持 σ =σ 的情况下,通过改变 ρ 获得的结果。这相当于在保持相同随机化的情况下,对所有主线储量乘以 ρ 后的测量值进行了估计。这些曲线在每页显示的主线广告的平均数量、每页广告点击的平均数量,

precisely, lnN ( ,σ ) with = σ /2+logρ.

8.更准确地说,lnN(,σ)具有= σ /2+logρ。

Figure 13:
Estimated variations of three performance metrics in response to mainline

图 13:响应主线储备的三个性能指标的估计变化

curves delimit 95% confidence intervals for the metrics we would have observed
if we had increased the mainline reserves by the percentage shown on the
horizontal axis.The filled areas represent the inner confidence intervals.The
hollow squares represent the metrics measured on the experimental data.The
hollow circles represent metrics measured on a second experimental bucket with
mainline reserves re-duced by 18%.The filled circles represent the metrics
effectively measured on a control bucket running without randomization.

变化。这些曲线为我们观察到的指标划定了 95% 的置信区间,如果我们将主线储量增加水平轴上显示的百分比。填充区域代表内部置信区间。空心方块代表在实验数据上测量的度量。空心圆圈代表在第二个实验桶上测量的指标,主线储量减少了 18%。实心圆圈代表在没有随机化的情况下运行的控制桶上有效测量的指标。





and the
average revenue per page, as functions of ρ .The inner confidence intervals,
represented by the filled areas, grow sharply when ρ leaves the range explored
during the data collection ex-periment.The average revenue per page has more
variance because a few very competitive queries command high prices.

和每页的平均收入,作为 ρ 的函数。当 ρ 离开数据收集实验期间探索的范围时,由填充区域表示的内部置信区间急剧增长。每页的平均收入差异更大,因为一些竞争非常激烈的查询要求很高的价格。

In order to
validate the accuracy of these counterfactual estimates, a second traffic
bucket of equal size was configured with mainline reserves reduced by about
18%.The hollow circles in Figure 13 represent the metrics effectively measured
on this bucket during the same time period.The effective measurements and the
counterfactual estimates match with high accuracy.

为了验证这些反事实估计的准确性,配置了第二个同等大小的流量桶,主线储量减少了约 18%。图 13 中的空心圆圈表示在同一时间段内在该桶上有效测量的度量。有效测量值和反事实估计值高度匹配。

Finally, in
order to measure the cost of the randomization, we also ran the unmodified ad
place-ment system on a control bucket.The brown filled circles in Figure 13
represent the metrics effec-tively measured on the control bucket during the
same time period.The randomization caused a small but statistically significant
increase of the number of mainline ads per page.The click yield and average
revenue differences are not significant.

最后,为了测量随机化的成本,我们还在控制桶上运行了未修改的广告投放系统。图 13 中用棕色填充的圆圈表示在同一时间段内在控制桶上有效测量的指标。随机化导致了每页主线广告数量的小幅但有统计学意义的增加。点击率和平均收入差异不显著。

experiment shows that we can obtain accurate counterfactual estimates with
affordable randomization strategies.However, this nice conclusion does not
capture the true practical value of the counterfactual estimation approach.


4.7 More on
Mainline Reserves

4.7 主线储备的更多信息

The main
benefit of the counterfactual estimation approach is the ability to use the
same data to answer a broad range of counterfactual questions.Here are a few
examples of counterfactual ques-tions that can be answered using data collected
using the simple mainline reserve randomization scheme described in the
previous section:


• Different
variances - Instead of estimating what would have been measured if we had
in-creased the mainline reserves without changing the randomization variance,
that is, letting σ = σ, we can use the same data to estimate what would have
been measured if we had also changed σ.This provides the means to determine
which level of randomization we can afford in future experiments.

不同的方差——如果我们在不改变随机化方差的情况下增加主线储量,也就是说,让 σ = σ,我们可以使用相同的数据来估计如果我们也改变 σ 会测量到什么,而不是估计会测量到什么。这提供了确定我们在未来实验中能够承受的随机化水平的方法。

• Pointwise
estimates - We often want to estimate what would have been measured if we had
set the mainline reserves to a specific value without randomization.Although
computing estimates for small values of σ often works well enough, very small
values lead to large confidence intervals.

逐点估算-如果我们将主线储量设置为特定值而不进行随机化,我们通常希望估算将要测量的值。虽然计算小 σ 值的估计值通常足够有效,但非常小的值会导致较大的置信区间。

Let Yν(ρ)
represent the expectation we would have observed if the multipliers m had mean
ρ and variance ν.We have then Yν(ρ) = Em[E[y|m]] = Em[Y0(m)].Assuming that the
pointwise value Y0 is smooth enough for a second order development,

让 Yν(ρ)代表如果乘数 m 有均值 ρ 和方差 ν,我们会观察到的期望。然后我们有 Yν(ρ) = Em[E[y|m]] = Em[Y0(m)]。假设逐点值 Y0 对于二阶展开足够平滑,

Yν ( ρ ) ≈ E

Yν ( ρ ) ≈ E

  • Y0 ( ρ) + (
    m − ρ ) Y ′0 ( ρ) + ( m − ρ ) Y ′′ 0 ( ρ ) / 2


= Y0 ( ρ) + ν
Y ′′ 0 ( ρ ) / 2 .


Although the
reweighting method cannot estimate the point-wise value Y0(ρ) directly, we can
use the reweighting method to estimate both Yν(ρ) and Y2ν(ρ) with acceptable
confidence intervals and write Y0(ρ) ≈ 2Yν(ρ)−Y2ν(ρ) (Goodwin, 2011).

虽然重加权法不能直接估计逐点值 Y0(ρ),但我们可以用重加权法以可接受的置信区间同时估计 Yν(ρ)和 Y2ν(ρ),写出 Y0(ρ)≈2Yν(ρ)-Y2ν(ρ)(Goodwin,2011)。

Query-dependent reserves - Compare for instance the queries "car
insurance" and "com-mon cause principle" in a Web search
engine.Since the advertising potential of a search






considerably with the query, it makes sense to investigate various ways to
define query-dependent reserves (Charles and Chickering, 2012).

随着查询的不同而有很大差异,研究定义查询相关储量的各种方法是有意义的(Charles 和 Chickering,2012)。

The data
collected using the simple mainline reserve randomization can also be used to es-timate
what would have been measured if we had increased all the mainline reserves by
a query-dependent multiplier ρ (x).This is simply achieved by reweighting each
sample ωi with

使用简单的主线储量随机化收集的数据也可以用来估计如果我们将所有主线储量增加一个依赖于查询的乘数 ρ (x)会测量到什么。这可以通过用 ωi 重新加权每个样本来实现

w i =

w i =

P ( q i |x i
, a i ) P ( q i |x i , a i )

P ( q i |x i,a i ) P ( q
i |x i,a i)



p ( m i ;ρ (
x i ) , σ )

p(m I;ρ ( x i),σ)

p ( m i ;, σ
) .

p(m I;, σ ) .

broader ranges of counterfactual questions can be answered when data is
collected using randomization schemes that explore more dimensions.For
instance, in the case of the ad placement problem, we could apply an
independent random multiplier for each score instead of applying a single
random multiplier to the mainline reserves only.However, the more dimensions we
randomize, the more data needs to be collected to effectively explore all these
dimensions.Fortunately, as discussed in section 5, the structure of the causal
graph reveals many ways to leverage a priori information and improve the
confidence intervals.

当使用探索更多维度的随机化方案收集数据时,可以回答范围更广的反事实问题。例如,在广告投放问题的情况下,我们可以对每个分数应用一个独立的随机乘数,而不是只对主线储备应用一个随机乘数。然而,我们随机化的维度越多,就需要收集更多的数据来有效地探索所有这些维度。幸运的是,正如第 5 节所讨论的,因果图的结构揭示了许多利用先验信息和提高置信区间的方法。

4.8 Related

4.8 相关工作

sampling is widely used to deal with covariate shifts (Shimodaira,
2000;Sugiyama et al., 2007).Since manipulating the causal graph changes the
data distribution, such an intervention can be viewed as a covariate shift
amenable to importance sampling.Importance sampling techniques have also been
proposed without causal interpretation for many of the problems that we view as
causal inference problems.In particular, the work presented in this section is
closely related to the Monte-Carlo approach of reinforcement learning (Sutton
and Barto, 1998, Chapter 5) and to the offline evaluation of contextual bandit
policies (Li et al., 2010, 2011).

重要性抽样被广泛用于处理协变量转移(Shimodaira,2000;Sugiyama 等人,2007 年)。由于操纵因果图会改变数据分布,这种干预可以被视为服从重要性抽样的协变量转移。对于我们认为是因果推理问题的许多问题,也提出了没有因果解释的重要性抽样技术。特别是,本节介绍的工作与强化学习的蒙特卡罗方法(Sutton 和,1998,第 5 章)和上下文土匪策略的离线评估(Li 等人,2010,2011)密切相关。

learning research traditionally focuses on control problems with relatively
small discrete state spaces and long sequences of observations.This focus
reduces the need for char-acterizing exploration with tight confidence
intervals.For instance, Sutton and Barto suggest to normalize the importance
sampling estimator by 1/∑i w(ωi) instead of 1/n. This would give erro-neous
results when the data collection distribution leaves parts of the state space
poorly explored.Contextual bandits are traditionally formulated with a finite
set of discrete actions.For instance, Li's (2011) unbiased policy evaluation
assumes that the data collection policy always selects an arbitrary policy with
probability greater than some small constant.This is not possible when the
action space is infinite.

强化学习研究传统上集中于具有相对小的离散状态空间和长的观测序列的控制问题。这种关注减少了对具有紧密置信区间的特征勘探的需要。例如,萨顿和巴尔托建议用 1/∑i w(ωi)而不是 1/n 来归一化重要性抽样估计量。当数据收集分布使得部分状态空间未被充分探索时,这将给出错误的结果。背景盗匪传统上是由一组有限的离散动作组成的。例如,李(2011)的无偏策略评估假设数据收集策略总是选择概率大于某个小常数的任意策略。当动作空间无限时,这是不可能的。

assumptions on the data collection distribution are often impractical.For
instance, certain ad placement policies are not worth exploring because they
cannot be implemented efficiently or are known to elicit fraudulent
behaviors.There are many practical situations in which one is only interested
in limited aspects of the ad placement policy involving continuous parameters
such as click prices or reserves.Discretizing such parameters eliminates useful
a priori knowledge: for instance, if we slightly increase a reserve, we can
reasonable believe that we are going to show slightly less ads.


Instead of
making assumptions on the data collection distribution, we construct a biased
estima-tor (10) and bound its bias.We then interpret the inner and outer
confidence intervals as resulting from a lack of exploration or an insufficient
sample size.






Finally, the
causal framework allows us to easily formulate counterfactual questions that
pertain to the practical ad placement problem and yet differ considerably in
complexity and exploration requirements.We can address specific problems
identified by the engineers without incurring the risks associated with a
complete redesign of the system.Each of these incremental steps helps
demonstrating the soundness of the approach.




This section
shows how the structure of the causal graph reveals many ways to leverage a
priori knowledge and improve the accuracy of our counterfactual
estimates.Displacing the reweighting point (Section 5.1) improves the inner
confidence interval and therefore reduce the need for explo-ration.Using a
prediction function (Section 5.2) essentially improve the outer confidence
interval and therefore reduce the sample size requirements.

本节展示了因果图的结构如何揭示利用先验知识和提高我们的反事实估计的准确性的许多方法。替换重加权点(第 5.1 节)提高了内部置信区间,因此减少了解释的需要。使用预测函数(第 5.2 节)本质上改善了外部置信区间,因此降低了样本量要求。

5.1 Better
Reweighting Variables

5.1 更好地重新加权变量

Many search
result pages come without eligible ads.We then know with certainty that such
pages will have zero mainline ads, receive zero clicks, and generate zero
revenue.This is true for the randomly selected value of the reserve, and this
would have been true for any other value of the reserve.We can exploit this
knowledge by pretending that the reserve was drawn from the coun-terfactual
distribution P(q|xi ,ai) instead of the actual distribution P(q|xi ,ai).The
ratio w(ωi) is therefore forced to the unity.This does not change the estimate
but reduces the size of the inner confidence interval.The results of Figure 13
were in fact helped by this little optimization.

许多搜索结果页面没有合格的广告。然后,我们可以肯定地知道,这样的页面将没有主线广告,没有点击,也没有收入。对于随机选择的储备值来说是这样,对于储备的任何其他值来说也是这样。我们可以利用这一知识,假设储备是从实际分布 P(q|xi,ai)中提取的,而不是实际分布 P(q|xi,ai)。因此,比值 w(ωi)被强制为 1。这不会改变估计值,但会减小内部置信区间的大小。图 13 的结果实际上得益于这个小小的优化。

There are in
fact many circumstances in which the observed outcome would have been the same
for other values of the randomized variables.This prior knowledge is in fact
encoded in the structure of the causal graph and can be exploited in a more
systematic manner.For instance, we know that users make click decisions without
knowing which scores were computed by the ad placement engine, and without
knowing the prices charged to advertisers.The ad placement causal graph encodes
this knowledge by showing the clicks y as direct effects of the user intent u
and the ad slate s. This implies that the exact value of the scores q does not
matter to the clicks y as long as the ad slate s remains the same.

事实上,在许多情况下,观察到的结果对于随机变量的其他值是相同的。这种先验知识事实上被编码在因果图的结构中,并且可以以更系统的方式加以利用。例如,我们知道用户在不知道广告投放引擎计算了哪些分数,也不知道向广告商收取的价格的情况下做出点击决定。广告投放因果图通过将点击量 y 显示为用户意图 u 和广告名单 s 的直接影响来编码这一知识。这意味着只要广告名单 s 保持不变,分数 q 的确切值对点击量 y 并不重要。

Because the
causal graph has this special structure, we can simplify both the actual and
counter-factual Markov factorizations (2) (3) without eliminating the variable
y whose expectation is sought.Successively eliminating variables z, c, and q

因为因果图具有这种特殊的结构,我们可以简化实际的和反事实的马尔可夫因子分解(2) (3),而不消除期望被寻求的变量 y。连续消除变量 z、c 和 q 可得出:

P ( u , v , x
, a , b , s , y) = P ( u , v ) P ( x |u ) P ( a|x , v ) P ( b|x , v ) P ( s|x ,
a , b ) P ( y |s , u ) , P ( u , v , x , a , b , s , y) = P ( u ,v ) P ( x |u )
P ( a|x , v ) P ( b|x , v ) P ( s|x , a , b ) P ( y |s , u ) .

P ( u,v,x,a,b,s,y) = P ( u,v ) P ( x |u
) P ( a|x,v ) P ( b|x,v ) P ( s|x,a,b ) P ( y |s,u),P ( u,v,x,a,b,s,y) = P ( u,P ( x |u ) P ( a|x,v ) P ( b|x,v ) P ( s|x,a,b ) P ( y |s,u)。

conditional distributions P(s|x,a,b) and P(s|x,a,b) did not originally appear
in the Markov factorization.They are defined by marginalization as a
consequence of the elimination of the vari-able q representing the scores.

条件分布 P(s|x,a,b)和 P(s|x,a,b)最初没有出现在马尔可夫分解中。他们被边缘化定义为代表分数的变量 q 被消除的结果。

P ( s|x a b)
= Zq P ( s|a q b ) P ( q|x a )

P ( s|x a b)
= Zq P ( s|a q b ) P ( q|x a)

P ( s|x a b)
= Zq P ( s|a q b ) P ( q|x a ) .

P ( s|x a b)
= Zq P ( s|a q b ) P ( q|x a)。





Figure 14:
Estimated variations of two performance metrics in response to mainline reserve

图 14:响应主线储备的两个性能指标的估计变化

estimates were obtained using the ad slates s as reweighting variable.Compare
the inner confidence intervals with those shown in Figure 13.

变化。这些估计是使用广告板作为重新加权变量获得的。将内部置信区间与图 13 所示的置信区间进行比较。

We can
estimate the counterfactual click yield Y using these simplified

我们可以使用这些简化的因式分解来估计反事实的点击率 Y:

= Z y P ( u v
x a b s y) = Z y P ( s|x a b ) P ( s|x , a , b ) P ( u v x a b s y )

= Z y P ( u v
x a b s y) = Z y P ( s|x a b ) P ( s|x,a,b ) P ( u v x a b s y)



n n∑i=1 y i P
( s i |x i a i b i ) P ( s i | x i , a i , b i ) .

n n∑i=1 y i P
( s i |x i a i b i ) P ( s i | x i,a i,b i)。



We have
reproduced the experiments described in Section 4.6 with the counterfactual
esti-mate (16) instead of (4).For each example ωi , we determine which range [m
max i ,m min i ] of mainline reserve multipliers could have produced the observed
ad slate si , and then compute the reweighting ratio using the formula:

我们用反事实估计(16)代替(4)再现了第 4.6 节中描述的实验。对于每个示例 ωi,我们确定主线储备乘数的哪个范围[m 最大值 I,m 最小值 i ]可以产生观察到的 ad slate si,然后使用以下公式计算再加权比率:

w i =

w i =

P ( s i | x i
, a i , b i ) P ( s i | x i , a i , b i )

P ( s i | x i,a i,b i ) P ( s
i | x i,a i,b i)



Ψ ( m max i
;ρ , σ ) − Ψ ( m min i ;ρ , σ ) Ψ ( mmax i ;ρ , σ ) − Ψ ( mmin i ;ρ , σ )

ψ(m 最大值 I;ρ,σ)ψ(m min I;ρ,σ)ψ(mmax I;ρ,σ)ψ(mmin I;ρ , σ )

Ψ(m;ρ,σ) is the cumulative of the log-normal multiplier distribution.Figure 14
shows coun-terfactual estimates obtained using the same data as Figure 13.The
obvious improvement of the

其中 ψ(m;ρ,σ)是对数正态乘数分布的累积。图 14 显示了使用与图 13 相同的数据获得的交互事实估计。的明显改善

Figure 16: A
distribution on the scores q induce a distribution on the possible ad slates s.
If the observed slate is slate2, the reweighting ratio is 34/22.

图 16:分数 q 的分布导致可能的广告候选名单的分布。如果观察到的候选名单是候选名单 2,则重新加权比率是 34/22。

confidence intervals significantly extends the range of mainline reserve
multipliers for which we can compute accurate counterfactual expectations using
this same data.


Comparing (4)
and (16) makes the difference very clear: instead of computing the ratio of the
probabilities of the observed scores under the counterfactual and actual
distributions, we compute the ratio of the probabilities of the observed ad
slates under the counterfactual and actual distribu-tions.As illustrated by
Figure 15, we now distinguish the reweighting variable (or variables) from the
intervention.In general, the corresponding manipulation of the Markov
factorization consists of marginalizing out all the variables that appear on
the causal paths connecting the point of interven-tion to the reweighting
variables and factoring all the independent terms out of the integral.This
simplification works whenever the reweighting variables intercept all the
causal paths connecting the point of intervention to the measurement variable.In
order to compute the new reweighting ratios, all the factors remaining inside
the integral, that is, all the factors appearing on the causal paths connecting
the point of intervention to the reweighting variables, have to be known.

比较(4)和(16)使区别非常明显:我们不是计算在反事实和实际分布下观察到的得分的概率比,而是计算在反事实和实际分布下观察到的广告的概率比。如图 15 所示,我们现在将重新加权变量(或多个变量)与干预区分开来。一般来说,马尔可夫因子分解的相应操作包括将出现在连接介入点和重加权变量的因果路径上的所有变量边缘化,并将所有独立项从积分中分解出来。每当重新加权的变量截取连接干预点和测量变量的所有因果路径时,这种简化就起作用了。为了计算新的重新加权比率,必须知道积分中剩余的所有因子,即在连接干预点和重新加权变量的因果路径上出现的所有因子。

Figure 14
does not report the average revenue per page because the revenue z also depends
on the scores q through the click prices c. This causal path is not intercepted
by the ad slate variable s alone.However, we can introduce a new variable ˜c =
f(c, y)that filters out the click prices computed

图 14 没有报告每页的平均收入,因为收入 z 也依赖于点击率 c 的得分 q。这个因果路径不是由单独的 ad slate 变量 s 截取的。但是,我们可以引入一个新的变量\\\\\\\\\\\\\\\\\





for ads that
did not receive a click.Markedly improved revenue estimates are then obtained
by reweighting according to the joint variable (s, c˜).

对于没有点击的广告。然后,通过根据联合变量(s,c ;)重新加权,可以获得显著改善的收入估计。

Figure 16
illustrates the same approach applied to the simultaneous randomization of all
the scores q using independent log-normal multipliers.The weight w(ωi) is the
ratio of the probabilities of the observed ad slate si under the counterfactual
and actual multiplier distributions.Computing these probabilities amounts to
integrating a multivariate Gaussian distribution (Genz, 1992).Details will be
provided in a forthcoming publication.

图 16 显示了使用独立对数正态乘数同时随机化所有分数 q 的相同方法。权重 w(ωi)是在反事实和实际乘数分布下观察到的 ad slate si 的概率比。计算这些概率相当于对多变量高斯分布进行积分(Genz,1992)。细节将在即将出版的出版物中提供。

5.2 Variance
Reduction with Predictors

5.2 使用预测因子减少方差

Although we
do not know exactly how the variable of interest ℓ(ω) depends on the measurable
vari-ables and are affected by interventions on the causal graph, we may have
strong a priori knowledge about this dependency.For instance, if we augment the
slate s with an ad that usually receives a lot of clicks, we can expect an
increase of the number of clicks.


Let the
invariant variables υ be all observed variables that are not direct or indirect
effects of variables affected by the intervention under consideration.This
definition implies that the distri-bution of the invariant variables is not
affected by the intervention.Therefore the values υi of the invariant variables
sampled during the actual experiment are also representative of the
distribution of the invariant variables under the counterfactual conditions.

设不变变量 υ 是所有观察到的变量,它们不是受所考虑的干预影响的变量的直接或间接影响。这个定义意味着不变变量的分布不受干预的影响。因此,在实际实验期间采样的不变变量的值 υi 也代表了在反事实条件下不变变量的分布。

We can
leverage a priori knowledge to construct a predictor ζ(ω) of the quantity ℓ(ω)
whose counterfactual expectation Y is sought.We assume that the predictor ζ(ω)
depends only on the invariant variables or on variables that depend on the
invariant variables through known functional dependencies.Given sampled values
υi of the invariant variables, we can replay both the original and manipulated
structural equation model as explained in Section 4.1 and obtain samples ζi and
ζ i that respectively follow the actual and counterfactual distributions

我们可以利用先验知识来构造量 ℓ(ω 的预测器 ζ(ω),其反事实期望 y 被寻求。我们假设预测因子 ζ(ω)仅依赖于不变变量或者依赖于通过已知函数依赖关系的不变变量的变量。给定不变变量的采样值 υi,我们可以如第 4.1 节所述重放原始和操纵的结构方程模型,并获得分别遵循实际和反事实分布的样本 ζi 和 ζ i

regardless of the quality of the predictor,


Y = Zω ℓ ( ω
) P ( ω) =

Y = Zω ℓ ( ω
) P ( ω) =

Z ω ζ(ω)P (ω)

  • Z ω (ℓ(ω)−ζ(ω))P (ω) ≈

Z ω ζ(ω)P (ω)

  • Z ω (ℓ(ω)−ζ(ω))P (ω) ≈








ζ i + n n ∑
i=1 (ℓ(ωi)−ζi)w(ωi).

ζ i + n n ∑
i=1 (ℓ(ωi)−ζi)w(ωi).



The first
term in this sum represents the counterfactual expectation of the predictor and
can be accurately estimated by averaging the simulated counterfactual samples ζ
i without resorting to po-tentially large importance weights.The second term in
this sum represents the counterfactual ex-pectation of the residuals ℓ(ω)−ζ(ω)
and must be estimated using importance sampling.Since the magnitude of the
residuals is hopefully smaller than that of ℓ(ω), the variance of
(ℓ(ω)−ζ(ω))w(ω) is reduced and the importance sampling estimator of the second
term has improved confidence in-tervals.The more accurate the predictor ζ(ω),
the more effective this variance reduction strategy.

这个和中的第一项表示预测器的反事实期望,并且可以通过对模拟的反事实样本 ζ i 求平均而不求助于潜在的大重要性权重来精确地估计。这个和中的第二项代表残差 ℓ(ω)−ζ(ω 的反事实预测,必须使用重要性抽样来估计。由于残差的幅度希望小于 ℓ(ω 的幅度,所以(ℓ(ω)−ζ(ω))w(ω)的方差减小了,并且第二项的重要性抽样估计量提高了置信区间。预测因子 ζ(ω)越精确,这种方差减小策略就越有效。

This variance
reduction technique is in fact identical to the doubly robust contextual bandit
eval-uation technique of Dudík et al. (2012).Doubly robust variance reduction
has also been extensively used for causal inference applied to biostatistics
(see Robins et al., 2000;Bang and Robins, 2005).We subjectively find that
viewing the predictor as a component of the causal graph (Figure 17) clar-ifies
how a well designed predictor can leverage prior knowledge.For instance, in
order to estimate the counterfactual performance of the ad placement system, we
can easily use a predictor that runs the ad auction and simulate the user
clicks using a click probability model trained offline.

这种方差减少技术实际上与 Dudík 等人(2012)的双重鲁棒上下文土匪评估技术相同。双重稳健方差缩减也被广泛用于应用于生物统计学的因果推断(参见 Robins 等人,2000;Bang and Robins,2005)。我们主观地发现,将预测器视为因果图的一个组成部分(图 17)说明了一个设计良好的预测器如何利用先验知识。例如,为了估计广告投放系统的反事实性能,我们可以很容易地使用运行广告拍卖的预测器,并使用离线训练的点击概率模型来模拟用户点击。





Figure 17:
Leveraging a predictor.Yellow nodes represent known functional relations in the

图 17:利用预测器。黄色节点代表结构中已知的功能关系

equation model.We can estimate the counterfactual expectation Y of the number
of clicks per page as the sum of the counterfactual expectations of a predictor
ζ, which is easy to estimate by replaying empirical data, and y−ζ, which has to
be estimated by importance sampling but has reduced variance.

自然方程模型。我们可以将每页点击次数的反事实期望 Y 估计为预测者 ζ 和 Yζ 的反事实期望之和,预测者 ζ 很容易通过重放经验数据来估计,Yζ 必须通过重要性抽样来估计,但方差减小。

Figure 18:
The two plots show the hourly click yield for two variants of the ad placement
engine.The daily variations dwarf the differences between the two treatments.

图 18:两个图显示了广告投放引擎的两个变体的每小时点击量。每天的变化使两种治疗之间的差异相形见绌。





5.3 Invariant

5.3 不变预测因子

In order to
evaluate which of two interventions is most likely to improve the system, the
designer of a learning system often seeks to estimate a counterfactual
difference, that is, the difference Y + −Y of the expectations of a same
quantity ℓ(ω) under two different counterfactual distributions P+(ω) and
P(ω).These expectations are often affected by variables whose value is left
unchanged by the interventions under consideration.For instance, seasonal
effects can have very large effects on the number of ad clicks (Figure 18) but
affect Y + and Y in similar ways.

为了评估两种干预中的哪一种最有可能改善系统,学习系统的设计者经常寻求估计一个反事实差异,即在两个不同的反事实分布 P+(ω)和 P(ω)下,相同数量 ℓ(ω 的期望的差异 y+y。这些期望经常受到变量的影响,这些变量的值在所考虑的干预中保持不变。例如,季节效应会对广告点击量产生非常大的影响(图 18),但会以类似的方式影响 Y + 和 Y。

better confidence intervals on the difference Y + −Y can be obtained using an
invariant predictor, that is, a predictor function that depends only on
invariant variables υ such as the time of the day.Since the invariant predictor
ζ(υ) is not affected by the interventions under consideration,

使用不变预测器,即仅依赖于不变变量 υ(例如一天中的时间)的预测器函数,可以获得关于差 Y+Y 的实质上更好的置信区间。由于不变预测因子 ζ(υ)不受所考虑的干预的影响,

Z ω ζ(υ)P (ω)
= Z ω ζ(υ)P +(ω).

Z ω ζ(υ)P (ω)
= Z ω ζ(υ)P +(ω)。





Y + −Y


= Zω ζ ( υ )
P + ( ω) + Zω ( ℓ ( ω ) − ζ ( υ)) P + ( ω )


− Z ω ζ(υ)P
(ω)− Z ω (ℓ(ω)−ζ(υ))P (ω) ≈









ℓ(ωi)−ζ(υi) P
+(ωi)−P (ωi) P(ωi) .


This direct
estimate of the counterfactual difference Y + −Y benefits from the same
variance reduc-tion effect as (17) without need to estimate the expectations
(18).Appendix C provide details on the computation of confidence intervals for
estimators of the counterfactual differences.Appendix D shows how the same
approach can be used to compute counterfactual derivatives that describe the
response of the system to very small interventions.

这种对反事实差异 Y+Y 的直接估计受益于与(17)相同的方差减小效应,而不需要估计期望值(18)。附录 C 提供了反事实差异估计量的置信区间计算的细节。附录 D 显示了如何使用相同的方法来计算描述系统对非常小的干预的反应的反事实导数。



The previous
sections deal with the identification and the measurement of interpretable
signals that can justify the actions of human decision makers.These same signals
can also justify the actions of machine learning algorithms.This section
explains why optimizing a counterfactual estimate is a sound learning


6.1 A
Learning Principle

6.1 学习原则

We consider
in this section interventions that depend on a parameter θ.For instance, we
might want to know what the performance of the ad placement engine would have
been if we had used different values for the parameter θ of the click scoring
model.Let Pθ (ω) denote the counterfactual Markov factorization associated with
this intervention.Let Y θ be the counterfactual expectation of ℓ(ω) under
distribution Pθ .Figure 19 illustrates our simple learning setup.Training data
is collected from a single experiment associated with an initial parameter
value θ chosen using prior knowledge acquired in an unspecified manner.A
preferred parameter value θ is then determined using the training data and
loaded into the system.The goal is of course to observe a good performance on
data collected during a test period that takes place after the switching point.

在本节中,我们考虑依赖于参数 θ 的干预。例如,我们可能想知道如果我们对点击评分模型的参数 θ 使用不同的值,广告投放引擎的性能会如何。让 Pθ (ω)表示与该干预相关的反事实马尔可夫因子分解。设 Y θ 是分布 Pθ 下 ℓ(ω 的反事实期望。图 19 展示了我们简单的学习设置。训练数据是从与初始参数值 θ 相关联的单个实验中收集的,该初始参数值 θ 是使用以未指定方式获得的先验知识选择的。然后使用训练数据确定优选的参数值 θ,并将其加载到系统中。目标当然是观察在切换点之后的测试期间收集的数据的良好性能。



Figure 19:
Single design - A preferred parameter value θ is determined using randomized
data collected in the past.Test data is collected after loading θ into the

图 19:单一设计-使用过去收集的随机数据确定首选参数值 θ。测试数据在 θ 载入系统后收集。

The isolation
assumption introduced in Section 3.2 states that the exogenous variables are
drawn from an unknown but fixed joint probability distribution.This
distribution induces a joint distribu-tion P(ω) on all the variables ω
appearing in the structural equation model associated with the parameter
θ.Therefore, if the isolation assumption remains valid during the test period,
the test data follows the same distribution Pθ (ω) that would have been
observed during the training data collection period if the system had been
using parameter θ all along.

第 3.2 节中引入的隔离假设指出,外生变量来自未知但固定的联合概率分布。这种分布在与参数 θ 相关的结构方程模型中出现的所有变量 ω 上引起联合分布 P(ω)。因此,如果隔离假设在测试期间保持有效,测试数据将遵循相同的分布 Pθ (ω),如果系统一直使用参数 θ,则在训练数据收集期间将会观察到该分布。

We can
therefore formulate this problem as the optimization of the expectation Y θ of
the reward ℓ(ω) with respect to the distribution Pθ (ω)

因此,我们可以将这个问题公式化为报酬 ℓ(ω 的期望 Y θ 相对于分布 Pθ (ω)的最优化

maxθ Y θ = Zω
ℓ ( ω ) P θ ( ω )

maxθ Y θ = Zω
ℓ ( ω ) P θ ( ω)

on the basis
of a finite set of training examples ω1,...,ωn sampled from P(ω).

基于有限组训练示例 ω1,...,ωn 从 P(ω)采样。

However, it
would be unwise to maximize the estimates obtained using approximation (5)
be-cause they could reach a maximum for a value of θ that is poorly explored by
the actual distribution.As explained in Section 4.5, the gap between the upper
and lower bound of inequality (14) reveals the uncertainty associated with
insufficient exploration.Maximizing an empirical estimate Ybθ of the lower
bound Y θ ensures that the optimization algorithm finds a trustworthy answer

然而,最大化使用近似值(5)获得的估计值是不明智的,因为它们可能达到 θ 值的最大值,而实际分布对此了解甚少。如第 4.5 节所述,不等式(14)的上限和下限之间的差距揭示了与探索不足相关的不确定性。最大化下界 Y θ 的经验估计 Ybθ 确保了优化算法找到可信的答案

θ = argmax θ
Yb θ .

θ = argmax θ
Yb θ。

We shall now
discuss the statistical basis of this learning principle.


6.2 Uniform
Confidence Intervals

6.2 统一置信区间

As discussed
in Section 4.4, inequality (14),

如第 4.4 节不等式(14)所述,

Y θ ≤ Y θ ≤ Y
θ + M ( 1 − W θ ) ,




Y θ = Zω ℓ (
ω ) w ( ω ) P ( ω ) ≈ Yb θ = n n∑i=1 ℓ ( ω i ) w ( ω i ) W θ = Zω w ( ω ) P ( ω
) ≈ Wbθ = n n∑i=1 w ( ω i )

zωℓ(ω)w(ω)p(ω)≈Ybθ= n n∑I = 1ℓ(ωI)w(ωI)wθ= zωw(ω)p(ω)≈WBθ= n∑I = 1w(ωI)



9.The idea of
maximizing the lower bound may surprise readers familiar with the UCB algorithm
for multi-armed bandits (Auer et al., 2002).UCB performs exploration by
maximizing the upper confidence interval bound and updating the confidence
intervals online.Exploration in our setup results from the active system
randomization during the offline data collection.See also Section 6.4.

9.下限最大化的想法可能会让熟悉多武装匪徒的 UCB 算法的读者感到惊讶(奥尔等人,2002)。UCB 通过最大化置信区间上限和在线更新置信区间来执行探索。我们设置中的探索源于离线数据收集期间的主动系统随机化。另请参见第 6.4 节。





leads to
confidence intervals (15) of the form


∀ δ > , ∀
θ P n Yb θ − ε R ≤ Y θ ≤ Yb θ + M ( − Wbθ + ξ R) + ε R o ≥ − δ .

∀δ>∀θp n
Ybθεr≤yθ≤Ybθ+m(WBθ+ξr)+εr o≥δ。



Both εR and
ξR converge to zero in inverse proportion to the square root of the sample size
n. They also increase at most linearly in logδ and depend on both the capping
bound R and the parameter θ through the empirical variances (see appendix B.)

εR 和 ξR 都以与样本量 n 的平方根成反比的方式收敛到零。它们也最多以对数 δ 线性增加,并且通过经验方差同时依赖于上限 R 和参数 θ(见附录 b)

confidence intervals are insufficient to provide guarantees for a parameter
value θ that depends on the sample.In fact, the optimization (19) procedure is
likely to select values of θ for which the inequality is violated.We therefore
seek uniform confidence intervals (Vapnik and Chervonenkis, 1968), simultaneously
valid for all values of θ.

这种置信区间不足以保证依赖于样本的参数值 θ。事实上,优化(19)过程可能会选择违反不等式的 θ 值。因此,我们寻求对 θ 的所有值同时有效的统一置信区间(Vapnik 和 Chervonenkis,1968)。

• When the
parameter θ is chosen from a finite set F , applying the union bound to the

当参数 θ 从有限集合 F 中选择时,将并界应用于普通

(20) immediately gives the uniform confidence interval :


P n ∀ θ ∈ F ,
Yb θ − ε R ≤ Y θ ≤ Yb θ + M ( −Wbθ + ξ R)+ ε R o ≥ −| F |δ .

P n ∀ θ ∈ F,Ybθεr≤yθ≤Ybθ+m(WBθ+ξr)+εr
o≥| f |δ。

• Following
the pioneering work of Vapnik and Chervonenkis, a broad choice of mathematical


tools have
been developed to construct uniform confidence intervals when the set F is

当集合 F 为无穷大时,已经开发了构造一致置信区间的工具。

For instance,
appendix E leverages uniform empirical Bernstein bounds (Maurer and Pontil,
2009) and obtains the uniform confidence interval

例如,附录 E 利用统一的经验伯恩斯坦界限(Maurer 和 Pontil,2009)并获得统一的置信区间

P n ∀ θ ∈ F ,
Yb θ − ε R ≤ Y θ ≤ Yb θ + M ( −Wbθ + ξ R)+ ε R o ≥ − M ( n ) δ ,

P n ∀ θ ∈ F,Ybθεr≤yθ≤Ybθ+m(WBθ+ξr)+εr

where the
growth function M (n) measures the capacity of the family of functions

其中增长函数 M (n)测量函数族的容量

{ fθ : ω 7→
ℓ(ω)w (ω) , gθ : ω 7→ w (ω) , ∀θ ∈ F } .

{ fθ : ω 7→
ℓ(ω)w (ω),gθ : ω 7→ w (ω),∀θ ∈ F }。



practical choices of P(ω) lead to functions M (n) that grow polynomially with
the sample size.Because both εR and ξR are O(n −/logδ), they converge to zero
with the sample size when one maintains the confidence level 1−M (n)δ equal to
a predefined constant.

P(ω)的许多实际选择导致函数 M
(n)随样本量多项式增长。因为 εR 和 ξR 都是 0(n/logδ),当一个人保持置信水平 1M(n)δ 等于预定常数时,它们随着样本量收敛到零。

interpretation of the inner and outer confidence intervals (Section 4.5) also
applies to the uniform confidence interval (21).When the sample size is
sufficiently large and the capping bound R chosen appropriately, the inner
confidence interval reflects the upper and lower bound of inequal-ity (14).

内部和外部置信区间的解释(第 4.5 节)也适用于统一置信区间(21)。当样本量足够大并且适当选择上限时,内部置信区间反映了不等性(14)的上限和下限。

The uniform
confidence interval therefore ensures that Y θ is close to the maximum of the
lower bound of inequality (14) which essentially represents the best
performance that can be guaranteed using training data sampled from
P(ω).Meanwhile, the upper bound of this same inequality reveals which values of
θ could potentially offer better performance but have been insufficiently
probed by the sampling distribution (Figure 20.)

因此,均匀置信区间确保了 Y θ 接近不等式(14)的下限的最大值,这基本上代表了使用从 P(ω)采样的训练数据可以保证的最佳性能。与此同时,这个不等式的上限揭示了哪些 θ 值可能提供更好的性能,但采样分布没有充分探究(图 20。)

6.3 Tuning Ad
Placement Auctions

6.3 调整广告投放拍卖

We now
present an application of this learning principle to the optimization of
auction tuning pa-rameters in the ad placement engine.Despite increasingly
challenging engineering difficulties,






Figure 20:
The uniform inner confidence interval reveals where the best guaranteed Y θ is
reached and where additional exploration is needed.

图 20:统一的内部置信区间揭示了最佳保证 Y θ 达到的位置和需要额外探索的位置。

Figure 21: Level
curves associated with the average number of mainline ads per page (red curves

图 21:与平均每页主线广告数量相关的水平曲线(红色曲线

labelled from
−6% to +10%) and the average estimated advertisement value generated per page
(black curves, labelled with arbitrary units ranging from 164 to 169) that
would have been observed for a certain query cluster if we had changed the
mainline reserves by the multiplicative factor shown on the horizontal axis,
and if we had applied a squash-ing exponent α shown on the vertical axis to the
estimated click probabilities qi,p(x).

标记为从-6% 到 +10%)和每页生成的平均估计广告值(黑色曲线,标记为范围从 164 到 169 的任意单位),如果我们通过横轴上显示的乘法因子改变主线储备,并且如果我们将纵轴上显示的挤压指数 α 应用于估计的点击概率 qi,p(x ),将会观察到特定查询聚类。

optimization procedures can obviously be applied to larger numbers of tunable param-eters.


Lahaie and
McAfee (2011) propose to account for the uncertainty of the click probability
esti-mation by introducing a squashing exponent α to control the impact of the
estimated probabilities on the rank scores.Using the notations introduced in
Section 2.1, and assuming that the estimated probability of a click on ad i
placed at position p after query x has the form qip(x) = γp βi(x) (see

Lahaie 和 McAfee
(2011)提出通过引入挤压指数 α 来控制估计概率对排名分数的影响,从而解决点击概率估计的不确定性。使用第 2.1 节中介绍的符号,并假设在查询 x 之后点击放置在位置 p 的广告 I 的估计概率具有 qip(x) = γp βi(x)的形式(参见





appendix A),
they redefine the rank-score rip(x) as:

附录 A),他们将等级分数 rip(x)重新定义为:

rip ( x) = γp
b i β i ( x ) α .

rip ( x) = γp
b i β i ( x ) α。

Using a
squashing exponent α < 1 reduces the contribution of the estimated
probabilities and in-creases the reliance on the bids bi placed by the

使用挤压指数 α < 1 降低了估计概率的贡献,并增加了对广告商所提出的出价 bi 的依赖。

Because the
squashing exponent changes the rank-score scale, it is necessary to simultaneously
adjust the reserves in order to display comparable number of ads.In order to
estimate the counter-factual performance of the system under interventions
affecting both the squashing exponent and the mainline reserves, we have
collected data using a random squashing exponent following a normal
distribution, and a mainline reserve multiplier following a log-normal
distribution as described in Section 4.6.Samples describing 12 million search
result pages were collected during four consecu-tive weeks.

因为挤压指数改变了排名-评分等级,所以有必要同时调整储备,以便显示可比数量的广告。为了估计在影响挤压指数和主线储量的干预下系统的反事实性能,我们使用正态分布的随机挤压指数和对数正态分布的主线储量乘数收集数据,如第 4.6 节所述。描述 1200 万个搜索结果页面的样本是在连续四周内收集的。

Charles and Chickering (2012), we consider separate squashing coefficients αk
and mainline reserve multipliers ρk per query cluster k ∈ {1..K}, and, in order
to avoid negative user or advertiser reactions, we seek the auction tuning
parameters αk and ρk that maximize an estimate of the advertisement
valuesubject to a global constraint on the average number of ads displayed in
the mainline.Because maximizing the advertisement value instead of the
publisher revenue amounts to maximizing the size of the advertisement pie
instead of the publisher slice of the pie, this criterion is less likely to
simply raise the prices without improving the ads.Meanwhile the constraint
ensures that users are not exposed to excessive numbers of mainline ads.

在 Charles 和 Chickering (2012)之后,我们考虑了每个查询聚类 k∑{ 1 }的独立压缩系数 αk 和主线保留乘数 ρk..K},并且,为了避免负面的用户或广告商反应,我们寻求拍卖调整参数 αk 和 ρk,其最大化广告价值的估计,服从主线中显示的广告的平均数量的全局约束。因为最大化广告价值而不是发布者收入等于最大化广告饼的大小而不是饼的发布者份额,所以该标准不太可能简单地提高价格而不改进广告。同时,该约束确保用户不会接触到过多的主线广告。

We then use
the collected data to estimate bounds on the counterfactual expectations of the
advertiser value and the counterfactual expectation of the number of mainline
ads per page.Fig-ure 21 shows the corresponding level curves for a particular
query cluster.We can then run a simple optimization algorithm and determine the
optimal auction tuning parameters for each cluster sub-ject to the global mainline
footprint constraint.Appendix D describes how to estimate off-policy
counterfactual derivatives that greatly help the numerical optimization.

然后,我们使用收集的数据来估计广告商价值的反事实预期和每页主线广告数量的反事实预期的界限。图 21 显示了特定查询集群的相应级别曲线。然后,我们可以运行一个简单的优化算法,并为每个服从全局主线足迹约束的集群确定最优拍卖调整参数。附录 D 描述了如何估计非常有助于数值优化的政策外反事实导数。

The obvious
alternative (see Charles and Chickering, 2012) consists of replaying the
auctions with different parameters and simulating the user using a click
probability model.However, it may be unwise to rely on a click probability
model to estimate the best value of a squashing coefficient that is expected to
compensate for the uncertainty of the click prediction model itself.The
coun-terfactual approach described here avoids the problem because it does not
rely on a click prediction model to simulate users.Instead it estimates the
counterfactual performance of the system using the actual behavior of the users
collected under moderate randomization.


Sequential Design

6.4 顺序设计

intervals computed after a first randomized data collection experiment might
not offer sufficient accuracy to choose a final value of the parameter θ.It is
generally unwise to simply col-lect additional samples using the same
experimental setup because the current data already reveals information (Figure
20) that can be used to design a better data collection experiment.Therefore,
it seems natural to extend the learning principle discussed in Section 6.1 to a
sequence of data col-lection experiments.The parameter θt characterizing the
t-th experiment is then determined using samples collected during the previous
experiments (Figure 22).

第一次随机数据收集实验后计算的置信区间可能无法提供足够的精度来选择参数 θ 的最终值。使用相同的实验设置简单地选择额外的样本通常是不明智的,因为当前的数据已经揭示了可以用来设计更好的数据收集实验的信息(图 20)。因此,将第 6.1 节中讨论的学习原则扩展到一系列数据收集实验似乎是很自然的。表征第 t 次实验的参数 θt 随后使用之前实验中收集的样本来确定(图 22)。

10.The value
of an ad click from the point of view of the advertiser.The advertiser payment
then splits the advertisement value between the publisher and the advertiser.




Figure 22:
Sequential design - The parameter θt of each data collection experiment is determined
using data collected during the previous experiments.

图 22:顺序设计-每个数据收集实验的参数 θt 是使用之前实验中收集的数据确定的。

Although it
is relatively easy to construct convergent sequential design algorithms,
reaching the optimal learning performance is notoriously difficult (Wald, 1945)
because the selection of param-eter θt involves a trade-off between
exploitation, that is, the maximization of the immediate reward Y θt , and
exploration, that is, the collection of samples potentially leading to better Y
θ in the more distant future.

尽管构造收敛的序列设计算法相对容易,但达到最优学习性能是出了名的困难(Wald,1945),因为参数 θt 的选择涉及开发(即即时回报 Y θt 的最大化)和探索(即样本的收集,潜在地在更遥远的未来导致更好的 Y θ)之间的权衡。

The optimal
exploration exploitation trade-off for multi-armed bandits is well understood
(Git-tins, 1989;Auer et al., 2002;Audibert et al., 2007) because an essential
property of multi-armed bandits makes the analysis much simpler: the outcome
observed after performing a particular action brings no information about the
value of other actions.Such an assumption is both unrealistic and pessimistic.For
instance, the outcome observed after displaying a certain ad in response to a
certain query brings very useful information about the value of displaying
similar ads on similar queries.

多武装匪徒的最优勘探开发权衡是众所周知的(吉特-廷斯,1989;Auer 等人,2002 年;Audibert 等人,2007),因为多武装匪徒的一个基本属性使分析变得简单得多:在执行特定行动后观察到的结果不会带来关于其他行动价值的信息。这样的假设既不现实又悲观。例如,在响应某个查询显示某个广告之后观察到的结果带来了关于在类似查询上显示类似广告的价值的非常有用的信息。

contextual bandit approaches (Slivkins, 2011) account for similarities in the
context and action spaces but do not take advantage of all the additional
opportunities expressed by structural equation models.For instance, in the
contextual bandit formulation of the ad placement problem outlined in Section
3.5, actions are pairs (s, c) describing the ad slate s and the corresponding
click prices c, policies select actions by combining individual ad scores in
very specific ways, and actions determine the rewards through very specific

精炼的上下文强盗方法(Slivkins,2011)考虑了上下文和动作空间中的相似性,但没有利用结构方程模型表达的所有额外机会。例如,在第 3.5 节概述的广告投放问题的上下文土匪公式中,动作是描述广告名单和相应点击价格的对,策略通过以非常具体的方式组合单个广告分数来选择动作,动作通过非常具体的机制来确定奖励。

despite their suboptimal asymptotic properties, heuristic exploration
strategies per-form surprisingly well during the time span in which the problem
can be considered stationary.Even in the simple case of multi-armed bandits,
excellent empirical results have been obtained us-ing Thompson sampling
(Chapelle and Li, 2011) or fixed strategies (Vermorel and Mohri, 2005;Kuleshov
and Precup, 2010).Leveraging the problem structure seems more important in
practice than perfecting an otherwise sound exploration strategy.

与此同时,尽管启发式探索策略具有次优的渐近性质,但在问题可以被视为平稳的时间跨度内,它们表现得出奇地好。即使在多武装匪徒的简单情况下,使用汤普森抽样(Chapelle 和李,2011)或固定策略(Vermorel 和 Mohri,2005)也获得了出色的经验结果;库里肖夫和塞切普,2010 年)。在实践中,利用问题结构似乎比完善一个合理的探索策略更重要。

Therefore, in
the absence of sufficient theoretical guidance, it is both expedient and
practical to maximizing Ybθ at each round, as described in Section 6.1, subject
to additional ad-hoc constraints ensuring a minimum level of exploration.

因此,在缺乏足够的理论指导的情况下,如第 6.1 节所述,在每一轮最大化 Ybθ 既方便又实用,但要受到额外的特殊限制,以确保最低水平的勘探。



All the
methods discussed in this contribution rely on the isolation assumption
presented in Sec-tion 3.2.This assumption lets us interpret the samples as
repeated independent trials that follow the pattern defined by the structural
equation model and are amenable to statistical analysis.

本文中讨论的所有方法都依赖于第 3.2 节中提出的隔离假设。这一假设让我们将样本解释为重复的独立试验,这些试验遵循结构方程模型定义的模式,并且易于进行统计分析。

The isolation
assumption is in fact a component of the counterfactual conditions under
inves-tigation.For instance, in Section 4.6, we model single auctions (Figure
3) in order to empirically

隔离假设实际上是研究中反事实条件的一个组成部分。例如,在第 4.6 节中,我们对单一拍卖进行建模(图 3),以便根据经验





determine how
the ad placement system would have performed if we had changed the mainline
reserves without incurring a reaction from the users or the advertisers.


Since the
future publisher revenues depend on the continued satisfaction of users and
advertisers, lifting this restriction is highly desirable.


• We can in
principle work with larger structural equation models.For instance, Figure 4
suggests to thread single auction models with additional causal links
representing the impact of the displayed ads on the future user
goodwill.However, there are practical limits on the number of trials we can
consider at once.For instance, it is relatively easy to simultaneously model
all the auctions associated with the Web pages served to the same user during a
thirty minute Web session.On the other hand, it is practically impossible to
consider several weeks worth of auctions in order to model their accumulated
effect on the continued satisfaction of users and advertisers.

原则上,我们可以使用更大的结构方程模型。例如,图 4 建议将单个拍卖模型与表示显示的广告对未来用户好感度的影响的额外因果联系联系起来。然而,我们可以同时考虑的试验数量有实际限制。例如,在 30 分钟的网络会话期间,相对容易同时模拟与服务于同一用户的网页相关联的所有拍卖。另一方面,实际上不可能考虑几周的拍卖来模拟它们对用户和广告商持续满意度的累积影响。

• We can
sometimes use problem-specific knowledge to construct alternate performance
met-rics that anticipate the future effects of the feedback loops.For instance,
in Section 6.3, we optimize the advertisement value instead of the publisher
revenue.Since this alternative cri-terion takes the advertiser interests into
account, it can be viewed as a heuristic proxy for the future revenues of the

我们有时可以利用特定问题的知识来构建替代绩效指标,预测反馈循环的未来影响。例如,在第 6.3 节中,我们优化了广告价值,而不是出版商的收入。由于这种替代标准考虑到了广告客户的利益,它可以被看作是出版商未来收入的一种启发式代理。

This section
proposes an alternative way to account for such feedback loops using the
quasistatic equilibrium method familiar to physicists: we assume that the
publisher changes the parameter θ so slowly that the system remains at
equilibrium at all times.Using data collected while the system was at equilibrium,
we describe empirical methods to determine how an infinitesimal intervention dθ
on the model parameters would have displaced the equilibrium:

本节提出了一种替代方法,利用物理学家熟悉的准静态平衡方法来解释这种反馈回路:我们假设发布者缓慢地改变参数 θ,使得系统始终保持平衡。使用在系统处于平衡状态时收集的数据,我们描述了经验方法,以确定模型参数上的无限小干预 dθ 将如何取代平衡:

would the system have performed during the data collection period if a small
change dθ had been applied to the model parameter θ and the equilibrium had
been reached before the data collection period."

“如果对模型参数 θ 进行较小的改变,并且在数据收集期间之前达到平衡,那么系统在数据收集期间的表现会如何。”

A learning
algorithm can then update θ to improve selected performance metrics.

然后,学习算法可以更新 θ,以改进选定的性能指标。

7.1 Rational

7.1 理性广告商

The ad
placement system is an example of game where each actor furthers his or her
interests by controlling some aspects of the system: the publisher controls the
placement engine parameters, the advertisers control the bids, and the users
control the clicks.


As an example
of the general quasi-static approach, this section focuses on the reaction of
ratio-nal advertisers to small changes of the scoring functions driving the ad
placement system.Rational advertisers always select bids that maximize their
economic interests.Although there are more re-alistic ways to model
advertisers, this exercise is interesting because the auction theory approaches
also rely on the rational advertiser assumption (see Section 2.1).This analysis
seamlessly integrates the auction theory and machine learning perspectives.

作为一般准静态方法的一个例子,这一部分着重于比率广告客户对驱动广告投放系统的评分函数的微小变化的反应。理性的广告商总是选择最大化其经济利益的出价。虽然有更多真实的方式来模拟广告商,这个练习是有趣的,因为拍卖理论方法也依赖于理性的广告商假设(见第 2.1 节)。这种分析无缝地结合了拍卖理论和机器学习的观点。

illustrated in Figure 23, we treat the bid vector b⋆ = (b1 ...bA) ∈ [0,bmax] A
as the parameter of the conditional distribution Pb⋆ (b|x, v) of the bids
associated with the eligible ads.The vari-

如图 23 所示,我们处理出价向量 b⋆ = (b1...ba)∑0,bmax] A 作为与合格广告相关的投标的条件分布 Pb⋆ (b|x,v)的参数。vari-

measured when a feedback causal system reaches equilibrium often display
conditional independence patterns that cannot be represented with directed
acyclic graphs (Lauritzen and Richardson, 2002;Dash, 2003).Treating the
feedback loop as parameters instead of variables works around this difficulty
in a manner that appears sufficient to perform the quasi-static analysis.

11.当反馈因果系统达到平衡时测量的量通常显示条件独立模式,不能用有向无环图表示(Lauritzen 和 Richardson,2002;Dash,2003)。将反馈回路视为参数而非变量,以一种似乎足以执行准静态分析的方式解决了这一难题。





Figure 23:
Advertisers select the bid amounts ba on the basis of the past number of clicks
ya and the past prices za observed for the corresponding ads.

图 23:广告商根据过去的点击次数 ya 和相应广告的过去价格 za 选择出价金额 ba。

Figure 24:
Advertisers control the expected number of clicks Ya and expected prices Za by

图 24:广告商通过调整来控制预期点击量 Ya 和预期价格 Za

their bids
ba.Rational advertisers select bids that maximize the difference between the
value they see in the clicks and the price they pay.


ables ya in
the structural equation model represents the number of clicks received by ads
associated with bid ba.The variables za represents the amount charged for these
clicks to the corresponding advertiser.The advertisers select their bids ba
according to their anticipated impact on the number of resulting clicks ya and
on their cost za.

结构方程模型中的 ables ya 表示与 bid ba 关联的广告接收的点击量。变量 za 表示对相应广告商的这些点击收取的费用。广告商根据他们对最终点击量 ya 和他们的成本 za 的预期影响来选择他们的出价 ba。

Following the
pattern of the perfect information assumption (see Section 2.1), we assume that
the advertisers eventually acquire full knowledge of the expectations

遵循完美信息假设的模式(见第 2.1 节),我们假设广告商最终获得了预期的全部知识

Ya ( θ , b⋆)
= Zω y a P θ b⋆ ( ω )

Ya ( θ,b⋆) = Zω y a
P θ b⋆ ( ω)

and Za ( θ ,
b⋆) = Zω z a P θ b⋆ ( ω ) .

和 Za ( θ,b⋆) = Zω z a P θ b⋆ ( ω)。

Let Va denote
the value of a click for the corresponding advertiser.Rational advertiser seek
to maximize the difference between the value they see in the clicks and the
price they pay to the publisher, as illustrated in Figure 24.This is expressed
by the utility functions

让 Va 表示相应广告商的点击值。理性的广告商寻求最大化他们在点击中看到的价值和他们支付给出版商的价格之间的差异,如图 24 所示。这由效用函数来表示

U θ a (b⋆) =
VaYa(θ,b⋆)−Za(θ,b⋆) .

U θ a (b⋆) =





Athey and Nekipelov (2010), we argue that the injection of smooth random noise
into the auction mechanism changes the discrete problem into a continuous
problem amenable to standard differential methods.Mild regularity assumption on
the densities probability Pb⋆ (b|x, v) and Pθ (q|x,a) are in fact sufficient to
ensure that the expectations Ya(θ,b⋆) and Za(θ,b⋆) are con-tinuously
differentiable functions of the distribution parameters b⋆ and θ.Further
assuming that utility functions U θ a (b⋆) are diagonally quasiconcave, Athey
and Nekipelov establish the existence of a unique Nash equilibrium

在 Athey 和 Nekipelov (2010)之后,我们认为向拍卖机制中注入平滑随机噪声会将离散问题转化为可服从标准微分方法的连续问题。密度概率 Pb⋆ (b|x,v)和 Pθ (q|x,a)的温和正则性假设实际上足以保证期望 Ya(θ,b⋆)和 Za(θ,b⋆)是分布参数 b⋆ 和 θ 的连续可微函数。进一步假设效用函数 U θ a (b⋆)是对角拟凹的,Athey 和 Nekipelov 建立了唯一 Nash 均衡的存在性

∀a ba ∈

∀a ba ∈



U θ a

U θ a (b1,...、ba 1、b、ba+1、...,bA)

by its first order Karush-Kuhn-Tucker conditions


a Va


∂Ya ∂ba

∂Ya ∂ba

∂Za ∂ba

∂Za ∂ba

   ≤ 0 if
ba = ≥ 0 if ba = bmax= 0 if 0 < ba < bmax.

如果 ba = ≥ 0,则   ≤ 0 如果 ba = bmax= 0,则 0 < ba < bmax。



We use the
first order equilibrium conditions (22) for two related purposes.Section 7.2
explains how to complete the advertiser model by estimating the values Va.
Section 7.3 estimates how the equilibrium bids and the system performance
metrics respond to a small change dθ of the model parameters.

我们将一阶平衡条件(22)用于两个相关的目的。第 7.2 节解释了如何通过估计值 Va 来完成广告商模型。第 7.3 节估计了均衡出价和系统性能指标如何响应模型参数的微小变化 dθ。

this approach remains sensible when key assumptions of the equilibrium model
are violated.The perfect information assumption is unlikely to hold in
practice.The quasi-concavity of the utility functions is merely
plausible.However, after observing the operation of the stationary ad placement
system for a sufficiently long time, it is reasonable to assume that the most
active advertisers have tried small bid variations and have chosen locally
optimal ones.Less active adver-tisers may leave their bids unchanged for longer
time periods, but can also update them brutally if they experience a
significant change in return on investment.Therefore it makes sense to use data
collected when the system is stationary to estimate advertiser values Va that
are consistent with the first order equilibrium conditions.We then hope to
maintain the conditions that each advertisers had found sufficiently
attractive, by first estimating how a small change dθ displaces this posited
local equilibrium, then by using performance metrics that take this
displacement into account.

有趣的是,当平衡模型的关键假设被违反时,这种方法仍然是明智的。完美的信息假设在实践中不太可能成立。效用函数的拟凹性仅仅是似是而非的。然而,在观察固定广告投放系统的操作足够长的时间之后,可以合理地假设最活跃的广告商已经尝试了小的出价变化,并且已经选择了局部最优的出价变化。不太积极的反对者可能会在更长的时间内保持出价不变,但如果他们经历了投资回报的重大变化,也可能会残酷地更新出价。因此,使用系统静止时收集的数据来估计符合一阶平衡条件的广告商价值 Va 是有意义的。然后,我们希望通过首先估计一个小的变化 dθ 如何取代这个假定的局部平衡,然后使用考虑到这个位移的性能指标,来维持每个广告商已经发现足够有吸引力的条件。

Estimating Advertiser Values

7.2 估计广告客户价值

We first need
to estimate the partial derivatives appearing in the equilibrium condition
(22).These derivatives measure how the expectations Ya and Za would have been
changed if each advertiser had placed a slightly different bid ba.Such
quantities can be estimated by randomizing the bids and computing on-policy
counterfactual derivatives as explained in appendix D. Confidence intervals can
be derived with the usual tools.

我们首先需要估计平衡条件(22)中出现的偏导数。这些衍生产品衡量的是,如果每个广告客户提出稍微不同的出价 ba,预期 Ya 和 Za 会发生怎样的变化。如附录 d 所述,这些数量可以通过随机投标和计算政策反事实导数来估计。置信区间可以用常用工具推导出来。

the publisher is not allowed to directly randomize the bids because the
advertisers expect to pay prices computed using the bid they have specified and
not the potentially higher bids resulting from the randomization.However, the
publisher has full control on the estimated click probabilities qi,p(x).Since
the rank-scores ri,p(x) are the products of the bids and the estimated click
probabilities (see Section 2.1), a random multiplier applied to the bids can
also be interpreted as a random multiplier applied to the estimated click
probabilities.Under these two interpretations, the same ads are shown to the
users, but different click prices are charged to the advertisers.Therefore,

不幸的是,出版商不被允许直接随机化出价,因为广告商期望支付使用他们指定的出价计算的价格,而不是随机化导致的潜在更高的出价。然而,出版商完全控制估计的点击概率 qi,p(x)。由于等级分数 ri、p(x)是出价和估计点击概率的乘积(见第 2.1 节),所以应用于出价的随机乘数也可以解释为应用于估计点击概率的随机乘数。在这两种解释下,相同的广告被展示给用户,但是不同的点击价格被收取给广告商。因此,





the publisher
can simultaneously charge prices computed as if the multiplier had been applied
to the estimated click probabilities, and collect data as if the multiplier had
been applied to the bid.This data can then be used to estimate the derivatives.


Solving the
first order equilibrium equations then yields estimated advertiser values Va
that are consistent with the observed data.

然后,求解一阶平衡方程产生与观察数据一致的估计广告商值 Va。

Va ≈

Va ≈



∂ba .∂Za ∂ba

∂ba。∂Za ∂ba

There are
however a couple caveats:


• The
advertiser bid ba may be too small to cause ads to be displayed.In the absence
of data, we have no means to estimate a click value for these advertisers.

广告商出价 ba 可能太小,无法显示广告。在没有数据的情况下,我们无法估计这些广告商的点击量。

• Many ads
are not displayed often enough to obtain accurate estimates of the partial




∂ba and ∂Za
∂ba .This can be partially remediated by smartly aggregating the data of
advertisers deemed similar.

∂ba 和 ∂Za ∂ba。这可以通过智能地聚合被认为相似的广告商的数据来部分补救。

• Some
advertisers attempt to capture all the available ad opportunities by placing
extremely high bids and hoping to pay reasonable prices thanks to the
generalized second price rule.Both partial derivatives ∂Ya ∂ba and ∂Za ∂ba are
equal to zero in such cases.Therefore we cannot recover Va by solving the
equilibrium Equation (22).It is however possible to collect useful data by
selecting for these advertisers a maximum bid bmax that prevents them from
monop-olizing the eligible ad opportunities.Since the equilibrium condition is
an inequality when ba = bmax, we can only determine a lower bound of the values
Va for these advertisers.

一些广告客户试图通过出价极高来抓住所有可用的广告机会,并希望支付合理的价格,这得益于广义的第二价格规则。在这种情况下,偏导数 ∂Ya ∂ba 和 ∂Za ∂ba 都等于零。因此,我们不能通过求解平衡方程(22)来恢复 Va。然而,可以通过为这些广告客户选择最大出价 bmax 来收集有用的数据,以防止他们垄断合格的广告机会。因为当 ba = bmax 时,均衡条件是一个不等式,所以我们只能确定这些广告商的 Va 值的下限。

These caveats
in fact underline the limitations of the advertiser modelling assumptions.When
their ads are not displayed often enough, advertisers have no more chance to
acquire a full knowledge of the expectations Ya and Za than the publisher has a
chance to determine their value.Similarly, advertisers that place extremely
high bids are probably underestimating the risk to occasionally experience a
very high click price.A more realistic model of the advertiser information
acquisition is required to adequately handle these cases.

这些警告实际上强调了广告商建模假设的局限性。当他们的广告展示得不够频繁时,广告商就没有机会充分了解 Ya 和 Za 的期望,就像出版商没有机会确定它们的价值一样。类似地,出价极高的广告商可能低估了偶尔经历极高点击价格的风险。需要更现实的广告客户信息获取模型来充分处理这些情况。

Estimating the Equilibrium Response

7.3 估计平衡响应

Let A be the
set of the active advertisers, that is, the advertisers whose value can be

设 A 为活跃广告主的集合,即可以估计其价值的广告主

(or lower
bounded) with sufficient accuracy.Assuming that the other advertisers leave
their bids unchanged, we can estimate how the active advertisers adjust their
bids in response to an infinitesi-mal change dθ of the scoring model
parameters.This is achieved by differentiating the equilibrium equations (22):

(或下限)足够精确。假设其他广告客户保持他们的出价不变,我们可以估计活跃的广告客户如何响应评分模型参数的无穷小变化 dθ 来调整他们的出价。这通过微分平衡方程(22)来实现:

∀a ′ ∈ A,


= Va ′ ∂ Ya ′
∂ b a ′ ∂θ − ∂ Za ′ ∂ b a ′ ∂θ d θ + ∑a∈A Va ′ ∂ Ya ′ ∂ b a ′ ∂ b a − ∂ Za ′ ∂
b a ′ ∂ b a d b a .

= va′∂ya′∂b
a′∂θ-∂za′∂b a′∂θdθ+∑a∈a va′∂ya′∂b a′≈b a≈za′≈b a′≈b a d b a。



The partial
second derivatives must be estimated as described in appendix D. Solving this
linear system of equations then yields an expression of the form

偏二阶导数必须按照附录 d 中的描述进行估算。然后求解这个线性方程组,得到以下形式的表达式

dba = Ξa dθ.

DBA =ξa dθ。

approach is of course related to the value estimation method proposed by Athey
and Nekipelov (2010) but strictly relies on the explicit randomization of the
scores.In contrast, practical considerations force Athey and Nekipelov to rely
on the apparent noise and hope that the noise model accounts for all potential
confounding factors.

12.这种方法当然与 Athey 和 Nekipelov
(2010)提出的价值估算方法有关,但严格依赖于分数的显式随机化。相比之下,实际考虑迫使 Athey 和 Nekipelov 依赖于表观噪声,并希望噪声模型能够解释所有潜在的混杂因素。





expression can then be used to estimate how any counterfactual expectation Y of
interest changes when the publisher applies an infinitesimal change dθ to the
scoring parameter θ and the

然后,该表达式可用于估计当发布者对评分参数 θ 应用无穷小的变化 dθ 时,任何感兴趣的反事实期望 Y 如何变化

advertisers A rationally adjust their bids ba in response:

积极的广告客户 A 理性地调整他们的出价 ba 作为回应:

d Y = ∂ Y ∂θ

  • ∑a Ξ a ∂ Y ∂ b a d θ .

d y
=∂y∂θ+∑aνa∂y∂b a dθ。



Although this
expression provides useful information, one should remain aware of its
limita-tions.Because we only can estimate the reaction of active advertisers,
expression (24) does not includes the potentially positive reactions of
advertisers who did not bid but could have.Because we only can estimate a lower
bound of their values, this expression does not model the potential re-actions
of advertisers placing unrealistically high bids.Furthermore, one needs to be
very cautious when the system (23) approaches singularities.Singularities
indicate that the rational advertiser assumption is no longer sufficient to
determine the reactions of certain advertisers.This happens for instance when
advertisers cannot find bids that deliver a satisfactory return.The eventual
behavior of such advertisers then depends on factors not taken in consideration
by our model.


To alleviate
these issues, we could alter the auction mechanism in a manner that forces
advertis-ers to reveal more information, and we could enforce policies ensuring
that the system (23) remains safely nonsingular.We could also design
experiments revealing the impact of the fixed costs in-curred by advertisers
participating into new auctions.Although additional work is needed to design
such refinements, the quasistatic equilibrium approach provides a generic
framework to take such aspects into account.



7.4 讨论

The rational
advertiser assumption is the cornerstone of seminal works describing simplified
vari-ants of the ad placement problem using auction theory (Varian,
2007;Edelman et al., 2007).More sophisticated works account for more aspects of
the ad placement problem, such as the impact of click prediction learning
algorithms (Lahaie and McAfee, 2011), the repeated nature of the ad auc-tions
(Bergemann and Said, 2010), or for the fact that advertisers place bids valid
for multiple auc-tions (Athey and Nekipelov, 2010).Despite these advances, it
seems technically very challenging to use these methods and account for all the
effects that can be observed in practical ad placement systems.

理性广告客户假设是使用拍卖理论描述广告投放问题的简化变量的开创性工作的基石(瓦里安,2007;Edelman 等人,2007 年)。更复杂的作品解释了广告投放问题的更多方面,例如点击预测学习算法的影响(Lahaie 和 McAfee,2011),广告投放的重复性质(Bergemann 和 Said,2010),或者广告商对多次投放进行有效投标的事实(Athey 和 Nekipelov,2010)。尽管取得了这些进步,但使用这些方法并解释在实际广告投放系统中可以观察到的所有影响在技术上似乎非常具有挑战性。

We believe
that our counterfactual reasoning framework is best viewed as a modular toolkit
that lets us apply insights from auction theory and machine learning to
problems that are far more complex than those studied in any single paper.For
instance, the quasi-static equilibrium analysis technique illustrated in this
section extends naturally to the analysis of multiple simultaneous causal
feedback loops involving additional players:


• The first
step consists in designing ad-hoc experiments to identify the parameters that
deter-mine the equilibrium equation of each player.In the case of the
advertisers, we have shown how to use randomized scores to reveal the
advertiser values.In the case of the user feedback, we must carefully design
experiments that reveal how users respond to changes in the quality of the
displayed ads.


Differentiating all the equilibrium equations yields a linear system of
equations linking the variations of the parameter under our control, such as
dθ, and all the parameters under the

对所有平衡方程进行微分,得到一个线性方程组,该方程组将我们控制下的参数(如 dθ)的变化与下的所有参数联系起来





control of
the other players, such as the advertiser bids, or the user willingness to
visit the site and click on ads.Solving this system and writing the total
derivative of the performance measure gives the answer to our question.


Although this
programme has not yet been fully realized, the existence of a principled
frame-work to handle such complex interactions is remarkable.Furthermore,
thanks to the flexibility of the causal inference frameworks, these techniques
can be infinitely adapted to various modeling assumptions and various system




Using the ad
placement example, this work demonstrates the central role of causal inference
(Pearl, 2000;Spirtes et al., 1993) for the design of learning systems
interacting with their environment.Thanks to importance sampling techniques,
data collected during randomized experiments gives precious cues to assist the
designer of such learning systems and useful signals to drive learning

使用广告放置的例子,这项工作证明了因果推理的中心作用(珀尔,2000;Spirtes 等人,1993)的学习系统的设计与他们的环境互动。由于重要性抽样技术,在随机实验中收集的数据提供了宝贵的线索来帮助这种学习系统的设计者,并提供了有用的信号来驱动学习算法。

Two recurrent
themes structure this work.First, we maintain a sharp distinction between the
learning algorithms and the extraction of the signals that drive them.Since
real world learning systems often involve a mixture of human decision and
automated processes, it makes sense to separate the discussion of the learning
signals from the discussion of the learning algorithms that leverage
them.Second, we claim that the mathematical and philosophical tools developed
for the analysis of physical systems appear very effective for the analysis of
causal information systems and of their equilibria.These two themes are in fact
a vindication of cybernetics (Wiener, 1948).




We would like
to acknowledge extensive discussions with Susan Athey, Miroslav Dudík, Patrick
Jordan, John Langford, Lihong Li, Sebastien Lahaie, Shie Mannor, Chris Meek,
Alex Slivkins, and Paul Viola.We also thank the Microsoft adCenter RnR team for
giving us the invaluable opportunity to deploy these ideas at scale and prove
their worth.Finally we gratefully acknowledge the precious comments of our JMLR
editor and reviewers.

我们要感谢与苏珊·艾希、米罗斯拉夫·杜迪克、帕特里克·乔丹、约翰·兰福德、李立宏、塞巴斯蒂安·拉海耶、阏氏·曼诺尔、克里斯·米克、亚历克斯·斯利夫金斯和保罗·维奥拉进行的广泛讨论。我们也感谢微软 adCenter RnR 团队给了我们宝贵的机会,让我们大规模部署这些想法并证明它们的价值。最后,我们感谢 JMLR 编辑和审稿人的宝贵意见。

Appendix A.
Greedy Ad Placement Algorithms


Section 2.1
describes how to select and place ads on a Web page by maximizing the total
rank-score (1).Following (Varian, 2007;Edelman et al., 2007) , we assume that
the click probability estimates are expressed as the product of a positive
position term γp and a positive ad term βi(x).The rank-scores can therefore be
written as ri,p(x) = γpbiβi(x).We also assume that the policy constraints
simply state that a Web page should not display more than one ad belonging to
any given advertiser.The discrete maximization problem is then amenable to
computationally efficient greedy algorithms.

第 2.1 节描述了如何通过最大化总排名分数(1)来选择和放置网页上的广告。以下(瓦里安,2007;Edelman 等人,2007),我们假设点击概率估计被表示为正位置项 γp 和正 ad 项 βi(x)的乘积。因此,等级分数可以写成 ri,p(x) = γpbiβi(x)。我们还假设策略约束只是声明一个网页不应该显示一个以上属于任何给定广告商的广告。然后,离散最大化问题服从于计算高效的贪婪算法。

Let us fix a
layout L and focus on the inner maximization problem.Without loss of
generality, we can renumber the positions such that

让我们固定一个布局 L,专注于内部最大化问题。不失一般性,我们可以对这些位置重新编号

L =
{1,2,...N} and γ1 ≥ γ2 ≥ ≥ 0.

L = {1,2,...N}和 γ1≥γ2≥0。





and write the
inner maximization problem as






= ∑






subject to
the policy constraints and reserve constraints ri,p(x) ≥ Rp(x).

受政策约束和准备金约束 ri,p(x) ≥ Rp(x)。

Let Si denote
the advertiser owning ad i. The set of ads is then partitioned into subsets Is
= {i :

让 Si 表示拥有广告 I 的广告商。然后广告集合被分成子集 Is = {i:

Si = s}
gathering the ads belonging to the same advertiser s. The ads that maximize the


within set Is are called the best ads for advertiser s. If the solution of the
discrete maximiza-

集合 Is 内的双 βi(x)被称为广告商的最佳广告

tion problem
contains one ad belonging to advertiser s, then it is easy to see that this ad
must be one of the best ads for advertiser s: were it not the case, replacing
the offending ad by one of the best

问题包含一个属于广告客户 s 的广告,那么很容易看出这个广告一定是广告客户 s 最好的广告之一:如果不是这样,用最好的广告替换违规的广告

ads would
yield a higher RL without violating any of the constraints.It is also easy to
see that one could select any of the best ads for advertiser s without changing

广告将在不违反任何约束的情况下产生更高的 RL。很容易看出,人们可以在不改变 RL 的情况下为广告商选择任何最好的广告。

Let the set I
contain exactly one ad per advertiser, arbitrarily chosen among the best ads


advertiser.The inner maximization problem can then be simplified as:




i 1,..., iN ∈

i 1,...,iN ∈ I

RL ( i ,...,
i N) = ∑p∈L γp b i p β i p ( x )

RL ( i,...,i N) = ∑p∈L
γp b i p β i p ( x)

where all the
indices i1,...,iN are distinct, and subject to the reserve constraints.

其中所有的指数 i1,...,iN 是不同的,并且受保留约束。

Assume that
this maximization problem has a solution i1,...,iN, meaning that there is a

假设这个最大化问题有一个解 i1,...,iN,意思是有可行的

ad placement
solution for the layout L. For k = ...N, let us define I k ⊂ I as

布局 l 的广告投放解决方案。对于 k =...让我们把 ⊂ 定义为



I k = ArgMax

I k = ArgMax


我 ∈ 我{i1,...,ik 1 }

It is easy to
see that I k intersects {ik,...,iN} because, were it not the case, replacing ik
by any

很容易看出 I k 与{ik,...,iN}因为,如果不是这样,用任何

element of I
k would increase RL without violating any of the constraints.Furthermore it is
easy to

元素将在不违反任何约束的情况下增加 RL。此外,很容易

see that ik ∈
I k because, were it not the case, there would be h > k such that ih ∈ I k ,
and swapping

看到 ik ∈ I k,因为,如果不是这样,会有 h > k,这样 ih ∈ I k,和交换

ik and ih
would increase RL without violating any of the constraints.

ik 和 ih 将在不违反任何约束的情况下增加 RL。

Therefore, if
the inner maximization problem admits a solution, we can compute a solution by
recursively picking i1,...,iN from I 1 ,I 2 ,...,I N . This can be done
efficiently by first sorting the biβi(x) in decreasing order, and then greedily
assigning ads to the best positions subject to the reserve constraints.This
operation has to be repeated for all possible layouts, including of course the
empty layout.

因此,如果内部最大化问题允许一个解,我们可以通过递归地选择 i1 来计算一个解,...,iN 来自 I 1,I 2,...这可以通过首先以递减的顺序对 biβi(x)进行排序,然后贪婪地将广告分配给服从保留约束的最佳位置来有效地完成。必须对所有可能的布局重复该操作,当然包括空布局。

The same
analysis can be carried out for click prediction estimates expressed as
arbitrary mono-tone combination of a position term γp(x) and an ad term βi(x),
as shown, for instance, by Graepel et al. (2010).

可以对表示为位置项 γp(x)和广告项 βi(x)的任意单音组合的点击预测估计进行相同的分析,例如,如 Graepel 等人(2010)所示。

Appendix B.
Confidence Intervals

附录 b .置信区间

Section 4.4
explains how to obtain improved confidence intervals by replacing the unbiased
impor-tance sampling estimator (7) by the clipped importance sampling estimator
(10).This appendix provides details that could have obscured the main message.

第 4.4 节解释了如何通过用截尾重要性抽样估计量(10)代替无偏重要性抽样估计量(7)来获得改进的置信区间。本附录提供了可能掩盖主要信息的细节。





B.1 Outer
Confidence Interval

B.1 外部置信区间

We first
address the computation of the outer confidence interval (12) which describes
how the estimator Ybapproaches the clipped expectation Y .


Y = Zω ℓ ( ω
) w ( ω ) P ( ω ) ≈ Yb = n n∑i=1 ℓ ( ω i ) w ( ω i ) .

y =
zωℓ(ω)w(ω)p(ω)≈Yb = n n∑I = 1ℓ(ωI)w(ωI)。

Since the
samples ℓ(ωi)w (ωi) are independent and identically distributed, the central
limit theorem (e.g., Cramér, 1946, Section 17.4) states that the empirical
average Ybconverges in law to a normal distribution of mean Y = E[ℓ(ω)w (ω)]
and variance V = var[ℓ(ω)w (ω)].Since this convergence usually occurs quickly,
it is widely accepted to write

由于样本 ℓ(ωi)w (ωi)是独立且同分布的,中心极限定理(如 Cramér,1946,第 17.4 节)指出经验平均 Yb 收敛于均值 Y = E[ℓ(ω)w (ω)]和方差 V = var[ℓ(ω)w (ω)]的正态分布。由于这种趋同通常发生得很快,所以写作被广泛接受

P n Yb − ε R
≤ Y ≤ Yb + ε R o ≥ − δ ,

p n
YbεR≤Y≤Yb+εR o≥δ,



ε R = erf− (
1 − δ ) p 2 V .

ε R =电流变流体(1δ)p 2V。

and to
estimate the variance V using the sample variance Vb

并且使用样本方差 Vb 来估计方差 V

V ≈ Vb =

V ≈ Vb =

n − 1


n∑i=1 ℓ ( ω i
) w ( ω i ) − Yb .

n∑I =



This approach
works well when the ratio ceiling R is relatively small.However the presence of
a few very large ratios makes the variance estimation noisy and might slow down
the central limit convergence.


The first
remedy is to bound the variance more rigorously.For instance, the following
bound results from (Maurer and Pontil, 2009, Theorem 10).

第一个补救措施是更严格地约束方差。例如,下面的界限结果来自(莫勒和庞泰尔,2009,定理 10)。

P ( p V >
p Vb + ( M − m ) R r 2log ( / δ ) n − 1 ) ≤ δ

p(p V > p
Vb+(M M)R R 2 log(/δ)n 1)≤δ

this bound with (25) gives a confidence interval valid with probability greater
than 1−2δ.Although this approach eliminates the potential problems related to
the variance estimation, it does not address the potentially slow convergence
of the central limit theorem.

将该界限与(25)相结合,给出概率大于 12δ 的有效置信区间。尽管这种方法消除了与方差估计相关的潜在问题,但它没有解决中心极限定理的潜在缓慢收敛问题。

The next
remedy is to rely on empirical Bernstein bounds to derive rigorous confidence
intervals that leverage both the sample mean and the sample variance (Audibert
et al., 2007;Maurer and Pontil, 2009).

下一个补救措施是依靠经验伯恩斯坦界限来推导严格的置信区间,该置信区间利用样本均值和样本方差(奥迪伯特等人,2007;Maurer 和 Pontil,2009)。

Theorem 1
(Empirical Bernstein bound) (Maurer and Pontil, 2009, thm 4) Let X,X1,X2,...,Xn
be i.i.d. random variable with values in [a,b] and let δ > 0.Then, with
proba-bility at least 1−δ,

定理 1(经验 Bernstein 界)(Maurer 和 Pontil,2009,thm 4)让 X,X1,X2,...,Xn 为 i.i.d .随机变量,取值在[a,b]中,设 δ > 0。那么,概率至少为 1δ,

E [X ] − Mn ≤
r Vn log ( / δ ) n + ( b − a ) 7log ( / δ ) 3 ( n − 1 ) ,

e[X]Mn≤r Vn
log(/δ)n+(B- a)7 log(/δ)3(n-1),

where Mn and
Vn respectively are the sample mean and variance

其中 Mn 和 Vn 分别是样本均值和方差

Mn =

Mn =

n n∑i=1 Xi

n n∑i=1 Xi

Vn =

Vn =


n − 1


n∑i=1 ( Xi −
Mn ) .

n∑i=1 ( Xi 锰)。





Applying this
theorem to both ℓ(ωi)w (ωi) and −ℓ(ωi)w (ωi) provides confidence intervals that
hold for for the worst possible distribution of the variables ℓ(ω) and w(ω).

将该定理应用于 ℓ(ωi)w (ωi)和 −ℓ(ωi)w (ωi)提供了置信区间,该区间适用于变量 ℓ(ω(ω)和 w(ω)的最差可能分布。



P n Yb − ε R
≤ Y ≤ Yb + ε R o ≥ − δ

p n
YbεR≤Y≤Yb+εR o≥δ

ε R = s Vb
log ( / δ ) n + M R 7log ( / δ ) 3 ( n − 1 ) .

εR = s Vb
log(/δ)n+M R 7 log(/δ)3(n-1)。



Because they
hold for the worst possible distribution, confidence intervals obtained in this
way are less tight than confidence intervals based on the central limit
theorem.On the other hand, thanks to the Bernstein bound, they remains
reasonably competitive, and they provide a much stronger guarantee.


B.2 Inner
Confidence Interval

B.2 内部置信区间

confidence intervals are derived from inequality (14) which bounds the
difference between the counterfactual expectation Y and the clipped expectation
Y :

内部置信区间由不等式(14)导出,不等式(14)限制了反事实期望 Y 和限幅期望 Y 之间的差:

0 ≤ Y −Y ≤ M
(1−W ).


The constant
M is defined by assumption (8).The first step of the derivation consists in
obtaining a lower bound of W −Wbusing either the central limit theorem or an
empirical Bernstein bound.

常数 M 由假设(8)定义。推导的第一步是利用中心极限定理或经验伯恩斯坦界获得 W-WB 的下界。

For instance,
applying theorem 1 to −w (ωi) yields

例如,将定理 1 应用于 w (ωi)会产生



 W ≥ Wb− s
Vbw log ( / δ ) n − R 7log ( / δ ) 3 ( n − 1 )  ≥ − δ

w≥WB s vbw 对数(/δ)n r7 对数(/δ)3(n1)≥δ

where Vbw is
the sample variance of the clipped weights

其中 Vbw 是剪裁权重的样本方差

Vbw =

Vbw =

n − 1


n∑i=1 w ( ω i
) − Wb .

n∑I =

Replacing in
inequality (14) gives the outer confidence interval


P n Y ≤ Y ≤ Y

  • M ( − Wb+ ξ R ) o ≥ − δ .

p n



ξ R = s Vbw
log ( / δ ) n + R 7log ( / δ ) 3 ( n − 1 ) .

ξR = s Vbw
log(/δ)n+R 7 log(/δ)3(n-1)。



Note that 1
−Wb+ ξR can occasionally be negative.This occurs in the unlucky cases where the
confidence interval is violated, with probability smaller than δ.

请注意,1 Wb+ξR 有时可能为负。这发生在违反置信区间的不幸情况下,概率小于 δ。

together the inner and outer confidence intervals,


P n Yb − ε R
≤ Y ≤ Yb + M ( − Wb+ ξ R) + ε R o ≥ − δ ,

p n
YbεR≤Y≤Yb+M(Wb+ξR)+εR o≥δ,

with εR and
ξR computed as described in expressions (26) and (27).

按照表达式(26)和(27)计算 εR 和 ξR。





Appendix C.
Counterfactual Differences


We now seek
to estimate the difference Y + −Y of the expectations of a same quantity ℓ(ω)
under two different counterfactual distributions P+(ω) and P(ω).These
expectations are often affected by variables whose value is left unchanged by
the interventions under consideration.For instance, seasonal effects can have
very large effects on the number of ad clicks.When these variables affect both
Y + and Y in similar ways, we can obtain substantially better confidence
intervals for the difference Y + −Y .

我们现在寻求在两个不同的反事实分布 P+(ω)和 P(ω)下,估计相同量 ℓ(ω 期望的差 y+y。这些期望经常受到变量的影响,这些变量的值在所考虑的干预中保持不变。例如,季节效应会对广告点击量产生非常大的影响。当这些变量以类似的方式影响 Y + 和 Y 时,我们可以获得 Y+Y 差的更好的置信区间。

In addition
to the notation ω representing all the variables in the structural equation
model, we use notation υ to represent all the variables that are not direct or
indirect effects of variables affected by the interventions under

除了表示结构方程模型中所有变量的符号 ω 之外,我们还使用符号 υ 来表示不是受所考虑干预影响的变量的直接或间接影响的所有变量。

Let ζ(υ) be a
known function believed to be a good predictor of the quantity ℓ(ω) whose
coun-terfactual expectation is sought.Since P(υ) = P(υ), the following equality
holds regardless of the quality of this prediction:

设 ζ(υ)是一个已知的函数,它被认为是 ℓ(ω 量的一个很好的预测器,它的实际期望是被寻求的。因为 P(υ) = P(υ),所以不管这个预测的质量如何,下列等式都成立:

both Y + and Y in this way and computing the difference,

以这种方式分解 Y + 和 Y 并计算差值,

Y + − Y = Zω
[ ℓ ( ω ) − ζ ( υ)] ∆ w ( ω ) P ( ω ) ≈ n n∑i=1

y+y =
zω[ℓ(ω)ζ(υ)]w(ω)p(ω)≈n∑I = 1

  • ℓ ( ω i ) −
    ζ ( υ i )


∆ w ( ω i )

w ( ω i)

∆ w ( ω) = P

  • ( ω ) P ( ω ) − P ( ω ) P ( ω ) = P + ( ω ) − P ( ω ) P ( ω ) .

P+(ω)P(ω)P(ω)P(ω)= P+(ω)P(ω)。





The outer
confidence interval size is reduced if the variance of the residual ℓ(ω)−ζ(υ)is
smaller than the variance of the original variable ℓ(ω).For instance, a
suitable predictor function ζ(υ) can significantly capture the seasonal click
yield variations regardless of the interventions under consideration.Even a
constant predictor function can considerably change the variance of the outer
confidence interval.Therefore, in the absence of better predictor, we still can
( and always should ) center the integrand using a constant predictor.

如果残差 ℓ(ω)−ζ(υ)is 的方差小于原始变量 ℓ(ω).的方差,则减小外部置信区间的大小例如,一个合适的预测函数 ζ(υ)可以显著地捕捉季节性点击量的变化,而不考虑正在考虑的干预措施。即使是一个恒定的预测函数也能显著改变外部置信区间的方差。因此,在没有更好的预测器的情况下,我们仍然可以(并且总是应该)使用常数预测器对被积函数进行中心化。

The rest of
this appendix describes how to construct confidence intervals for the
estimation of counterfactual differences.Additional bookkeeping is required
because both the weights ∆w(ωi) and the integrand ℓ(ω)−ζ(υ) can be positive or
negative.We use the notation υ to represent the variables of the structural
equation model that are left unchanged by the intervention under
consid-erations.Such variables satisfy the relations P(υ) = P(υ) and P(ω) = P
(ω\υ|υ)P(υ), where we use notation ω\υ to denote all remaining variables in the
structural equation model.An invariant predictor is then a function ζ(υ) that
is believed to be a good predictor of ℓ(ω).In particular, it is expected that
var[ℓ(ω)−ζ(υ)] is smaller than var[ℓ(ω)].

本附录的其余部分描述了如何构建置信区间来估计反事实差异。需要额外记账,因为权重 w(ωi)和被积函数 ℓ(ω)−ζ(υ 都可以是正的或负的。我们使用符号 υ 来表示结构方程模型的变量,这些变量在考虑的情况下由于干预而保持不变。这样的变量满足关系 P(υ) = P(υ)和 P(ω) = P (ω\υ|υ)P(υ),这里我们用符号 ω\υ 来表示结构方程模型中所有剩余的变量。然后,不变预测器是被认为是 ℓ(ω).的良好预测器的函数 ζ(υ)特别是,预计 var[ℓ(ω)−ζ(υ)比 var[ℓ(ω)].小

C.1 Inner
Confidence Interval with Dependent Bounds

C.1 具有相关界限的内部置信区间

We first
describe how to construct finer inner confidence intervals by using more
refined bounds on ℓ(ω).In particular, instead of the simple bound (8), we can
use bounds that depend on invariant variables:

我们首先描述如何通过在 ℓ(ω).上使用更精细的边界来构造更精细的内部置信区间特别是,我们可以使用依赖于不变变量的界限来代替简单界限(8):



m ≤ m ( υ ) ≤
ℓ ( ω ) ≤ M ( υ ) ≤ M .

m ≤ m ( υ ) ≤
ℓ ( ω ) ≤ M ( υ ) ≤ M。





The key
observation is the equality


E [ w ( ω
)|υ] = Zω \ υ w ( ω ) P ( ω \ υ|υ) = Zω \ υ P ( ω \ υ|υ ) P ( υ ) P ( ω \ υ|υ )
P ( υ ) P ( ω \ υ|υ) = .

zω\υw(ω)P(ω\υ|υ)= zω\υP(ω\υ|υ)P(ω\υ|υ)P(υ)P(ω\υ|υ)=。

We can then


= Z ω

= Z ω

  • w (ω)−w (ω)


ℓ(ω) P(ω) ≤ Z
υ E[w (ω)−w (ω)|υ] M(υ) P(υ)


= Z υ (−E[w
(ω)|υ] ) M(υ) P(υ) = Z ω (−w (ω) ) M(υ) P(ω) = Bhi .

Zυ(E[w(ω)|υ])M(υ)P(υ)= Zω(w(ω))M(υ)P(ω)= Bhi。

Y −Y


Using a
similar derivation for the lower bound Blo, we obtain the inequality

利用下界 Blo 的类似推导,我们得到了不等式

With the


Blo ≤ Y −Y ≤


Bb lo = n
n∑i=1 ( − w ( ω i)) m ( υ i )

bb lo = n n∑I
= 1(w(ωI))m(υI)

Bb hi = n
n∑i=1 ( − w ( ω i)) M ( υ i )

bb hi = n n∑I
= 1(w(ωI))M(υI)

Vblo =

Vblo =

n − 1


n∑i=1 h ( − w
( ω i)) m ( υ i ) − Bb lo i Vbhi = n − 1 n∑i=1 h ( − w ( ω i)) M ( υ i ) − Bb
hi i

n∑I =
1h(w(ωI))M(υI)Bb lo I Vhibi = n 1n∑I = 1h(w(ωI))M(υI)Bb hi I

ξlo = s Vblo
log ( / δ ) n +|m|R 7log ( / δ ) 3 ( n − 1 ) ,

ξlo = s Vblo
log(/δ)n+| m | R 7 log(/δ)3(n 1),

ξhi = s Vbhi
log ( / δ ) n +|M|R 7log ( / δ ) 3 ( n − 1 ) ,

ξhi = s Vhibi
log(/δ)n+| M | R 7 log(/δ)3(n-1),

applications of theorem 1 give the inner confidence interval:

定理 1 的两个应用给出了内部置信区间:

P n Y + Bb lo
− ξlo ≤ Y ≤ Y + Bb hi + ξhi o ≥ − δ .

p n Y+Bb
loξlo≤Y≤Y+Bb hi+ξhi o≥δ。

Confidence Intervals for Counterfactual Differences

C2 反事实差异的置信区间

We now
describe how to leverage invariant predictors in order to construct tighter
confidence inter-vals for the difference of two counterfactual expectations.


Y + − Y ≈ n

Y+Y≈n n∑I = 1

  • ℓ ( ω i ) −
    ζ ( υ i )


∆ w ( ω i )

w ( ω i)

with ∆w(ω) =
P +(ω)−P (ω) P(ω) .

其中 ∏w(ω)= P+(ω)P(ω)P(ω)。

Let us define
the reweigthing ratios w +(ω) = P +(ω)/P(ω) and w (ω) = P (ω)/P(ω), their
clipped variants w +(ω) and w (ω), and the clipped centered expectations

让我们定义重加权比 w +(ω) = P +(ω)/P(ω)和 w (ω) = P (ω)/P(ω),它们的削波变量 w +(ω)和 w (ω),以及削波的中心期望值

Y +c = Zω [ ℓ
( ω ) − ζ ( υ)] w + ( ω ) P ( ω )

y+c =

and Y c = Zω
[ ℓ ( ω ) − ζ ( υ)] w ( ω ) P ( ω ) .

y c =

The outer
confidence interval is obtained by applying the techniques of Section B.1 to

外部置信区间是通过应用 B.1 节的技术获得的

Y +c − Y c =
Zω [ ℓ ( ω ) − ζ ( υ) ][ w + ( ω ) − w ( ω) ] P ( ω ) .

y+c y c =

Since the
weights w + −w can be positive or negative, adding or removing a constant to
ℓ(ω) can considerably change the variance of the outer confidence interval.This
means that one should

因为权重 w+w 可以是正的或负的,所以在 ℓ(ω 中添加或移除常数)可以显著改变外部置信区间的方差。这意味着一个人应该





always use a
predictor.Even a constant predictor can vastly improve the outer confidence
interval difference.


The inner
confidence interval is then obtained by writing the difference


Y +−Y − Y + c
−Y c

Y+Y-Y+c Y-c

= Zω

= Zω

  • ℓ ( ω ) − ζ
    ( υ )
  • ℓ ( ω ) − ζ
    ( υ )
  • w + ( ω ) −
    w + ( ω )


P ( ω ) − Zω


  • ℓ ( ω ) − ζ
    ( υ )
  • ℓ ( ω ) − ζ
    ( υ )
  • w ( ω ) − w
    ( ω )


P ( ω )

P ( ω)

and bounding
both terms by leveraging υ-dependent bounds on the integrand:

并且通过利用被积函数的 υ 相关边界来限制这两个项:

− M ≤ −ζ(υ) ≤
ℓ(ω)−ζ(υ) ≤ M −ζ(υ) ≤ M .




This can be
achieved as shown in Section C.1.

这可以如第 1 节所示实现

Appendix D.
Counterfactual Derivatives

附录 d .反事实衍生品

We now
consider interventions that depend on a continuous parameter θ.For instance, we
might want to know what the performance of the ad placement engine would have
been if we had used a parametrized scoring model.Let Pθ (ω)represent the
counterfactual Markov factorization associated with this intervention.Let Y θ be
the counterfactual expectation of ℓ(ω) under distribution Pθ .

我们现在考虑依赖于连续参数 θ 的干预。例如,我们可能想知道如果我们使用参数化评分模型,广告投放引擎的性能会如何。让 Pθ (ω)代表与此干预相关的反事实马尔可夫因子分解。设 Y θ 是分布 Pθ 下 ℓ(ω 的反事实期望。

Computing the
derivative of (28) immediately gives


∂ Y θ

∂ Y θ



= Zw

= Zw

  • ℓ ( ω ) − ζ
    ( υ )
  • ℓ ( ω ) − ζ
    ( υ )

w ′θ ( ω ) P
( ω ) ≈ n n∑i=1

= 1

  • ℓ ( ω i ) −
    ζ ( υ i )


w ′θ ( ω i )


with wθ(ω) =
P θ (ω) P(ω)

其中 wθ(ω) = P θ (ω) P(ω)

and w ′θ ( ω)
= ∂ w θ ( ω ) ∂θ = w θ ( ω ) ∂logP θ ( ω ) ∂θ .

和 w′θ(ω)=∂wθ(ω)∂θ= wθ(ω)∂logpθ(ω)∂θ。



Replacing the
expressions P(ω) and Pθ (ω) by the corresponding Markov factorizations gives
many opportunities to simplify the reweighting ratio w ′ θ (ω).The term wθ(ω)
simplifies as shown in (6).The derivative of log Pθ (ω) depends only on the
factors parametrized by θ.Therefore, in order to evaluate w ′ θ (ω), we only
need to know the few factors affected by the intervention.

用相应的马尔可夫因子分解代替表达式 P(ω)和 Pθ (ω)给出了许多简化重加权比 w′θ(ω)的机会。术语 wθ(ω)简化如(6)所示。对数 Pθ (ω)的导数仅取决于 θ 参数化的因子。因此,为了评估 w′θ(ω),我们只需要知道受干预影响的少数因素。

Higher order
derivatives can be estimated using the same approach.For instance,


∂ Y θ

∂ Y θ

∂θ i ∂θ j


= Zw

= Zw

  • ℓ ( ω ) − ζ
    ( υ )
  • ℓ ( ω ) − ζ
    ( υ )

w ′′ i j ( ω
) P ( ω ) ≈ n n∑i=1


  • ℓ ( ω i ) −
    ζ ( υ i )


w ′′ i j ( ω
i )

w′’I j(ωI)

with w ′′ i
j(ω) = ∂ wθ(ω) ∂θi ∂θj = wθ(ω) ∂log Pθ (ω) ∂θi ∂logPθ (ω) ∂θj +wθ(ω) ∂ logPθ
(ω) ∂θi ∂θj



The second
term in w ′′ i j(ω) vanishes when θi and θj parametrize distinct factors in Pθ

当 θi 和 θj 参数化 Pθ (ω)中的不同因子时,w′I j(ω)中的第二项消失。

Infinitesimal Interventions and Policy Gradient

D1 极小干预和政策梯度

(29) becomes particularly attractive when P(ω) = P θ (ω), that is, when one
seeks deriva-tives that describe the effect of an infinitesimal intervention on
the system from which the data was collected.The resulting expression is then
identical to the celebrated policy gradient (Alek-sandrov et al., 1968;Glynn,
1987;Williams, 1992) which expresses how the accumulated rewards

当 P(ω) = P θ (ω)时,表达式(29)变得特别有吸引力,也就是说,当人们寻求描述无穷小的干预对收集数据的系统的影响的导数时。结果表达式与著名的政策梯度相同(Alek-sandrov 等人,1968 年;格林,1987 年;威廉姆斯,1992)表达了累积的回报





in a
reinforcement learning problem are affected by small changes of the parameters
of the policy function.


∂ Y θ

∂ Y θ



= Zω

= Zω

  • ℓ ( ω ) − ζ
    ( υ )
  • ℓ ( ω ) − ζ
    ( υ )

w ′θ ( ω ) P
θ ( ω ) ≈ n n∑i=1

= 1

  • ℓ ( ω i ) −
    ζ ( υ i )


w ′θ ( ω i )


where ωi are
sampled i.i.d. from Pθ and w ′ θ (ω) = ∂logPθ (ω) ∂θ

其中,ωi 是从 Pθ 和 w’θ(ω)=∂logpθ(ω)∂θ 获得的 i.i.d 采样值


Sampling from
Pθ (ω) eliminates the potentially large ratio wθ(ω) that usually plagues
impor-tance sampling approaches.Choosing a parametrized distribution that
depends smoothly on θ is then sufficient to contain the size of the weights w ′
θ (ω).Since the weights can be positive or nega-tive, centering the integrand
with a prediction function ζ(υ) remains very important.Even a constant
predictor ζ can substantially reduce the variance

从 Pθ (ω)采样消除了通常困扰重要采样方法的潜在大比值 wθ(ω)。选择平滑依赖于 θ 的参数化分布就足以包含权重 w’θ(ω)的大小。由于权重可以是正的或负的,用预测函数 ζ(υ)对被积函数进行定心仍然非常重要。即使是常数预测因子 ζ 也能显著降低方差

(ℓ(ω)−ζ)w ′ θ (ω)] = var[ ℓ(ω)w ′ θ (ω)−ζw ′ θ (ω)]


= var[ℓ(ω)w ′
θ (ω)]−2ζcov[ ℓ(ω)w ′ θ (ω), w ′ θ (ω)] +ζ var[w ′ θ (ω)]


whose minimum
is reached for ζ = cov[ℓ(ω)w ′ θ (ω)w ′ θ (ω)]

对于 ζ= cov[ℓ(ω)w’θ(ω)w’θ(ω)达到最小值

var[w ′ θ
(ω)] =


E[ℓ(ω)w ′ θ
(ω) ]


E[w ′ θ (ω) ]


We sometimes
want to evaluate expectations under a counterfactual distribution that is too
far from the actual distribution to obtain reasonable confidence
intervals.Suppose, for instance, that we are unable to reliably estimate which
click yield would have been observed if we had used a certain parameter θ for
the scoring models.We still can estimate how quickly and in which direction the
click yield would have changed if we had slightly moved the current scoring
model parameters θ in the direction of the target θ .Although such an answer is
not as good as a reliable estimate of Y θ , it is certainly better than no

我们有时希望在与实际分布相差太远的反事实分布下评估期望,以获得合理的置信区间。例如,假设我们无法可靠地估计如果我们对评分模型使用某个参数 θ 会观察到哪种点击量。如果我们在目标 θ 的方向上稍微移动当前评分模型参数 θ,我们仍然可以估计点击率会以多快的速度和朝哪个方向变化。这样的答案虽然不如对 Y θ 的可靠估计,但肯定比没有答案要好。

Off-Policy Gradient

D.2 政策外梯度

We assume in
this subsection that the parametrized probability distribution Pθ (ω) is
regular enough to ensure that all the derivatives of interest are defined and
that the event {wθ(ω)=R} has probability zero.Furthermore, in order to simplify
the exposition, the following derivation does not leverage an invariant
predictor function.

在本小节中,我们假设参数化概率分布 Pθ (ω)足够规则,以确保定义了所有感兴趣的导数,并且事件{wθ(ω)=R}具有概率零。此外,为了简化说明,下面的推导没有利用不变预测函数。

derivatives using data sampled from a distribution P(ω) different from Pθ (ω)
is more challenging because the ratios wθ(ωi) in Equation (29) can take very
large values.However it is comparatively easy to estimate the derivatives of
lower and upper bounds using a slightly different way to clip the weights.Using
notation 1l(x) represent the indicator function, equal to one if condition x is
true and zero otherwise, let us define respectively the clipped weights w Z θ
and the capped weights w M θ :

使用从不同于 Pθ (ω)的分布 P(ω)采样的数据来估计导数更具挑战性,因为等式(29)中的比率 wθ(ωi)可能取非常大的值。然而,使用稍微不同的方法来裁剪权重,估计下限和上限的导数相对容易。使用符号 1l(x)表示指标函数,如果条件 x 为真,则等于 1,否则为零,让我们分别定义削波权重 w Z θ 和封顶权重 w M θ:

w Zθ ( ω) = w
θ ( ω )1l { P ( ω ) < R P ( ω ) } and w Mθ ( ω) = min { w θ ( ω ) , R } .

w Zθ ( ω) = w
θ ( ω )1l { P ( ω ) < R P ( ω ) }和 w Mθ ( ω) = min { w θ ( ω),R }。

Section 4.4 illustrates the use of clipped weights, the confidence interval
derivation can be easily extended to the capped weights.Defining the capped

尽管第 4.4 节说明了削波权重的使用,但置信区间推导可以很容易地扩展到封顶权重。定义上限数量

Y θ = Zω ℓ (
ω ) w Mθ ( ω ) P ( ω )

Y θ = Zω ℓ (
ω ) w Mθ ( ω ) P ( ω)

and W θ = Zω
w Mθ ( ω ) P ( ω )

而 W θ = Zω w Mθ ( ω ) P ( ω)





and writing


= Zω ∈ Ω \ Ω
R ℓ ( ω) ( P ( ω ) − R P ( ω) )

zω∈ω\ωrℓ(ω)(p(ω)r p(ω))

0 ≤ Y θ −Y θ


≤ M − P ( Ω R
) − R P ( Ω \ Ω R ) = M − Zω w Mθ ( ω ) P ( ω )

≤M P(ωR)R
P(ω\ωR)= M zωw Mθ(ω)P(ω)

yields the


Y θ ≤ Y θ ≤ Y
θ + M ( 1 − W θ ) .




In order to
obtain reliable estimates of the derivatives of these upper and lower bounds,
it is of course sufficient to obtain reliable estimates of the derivatives of Y
θ and W θ .By separately considering the cases wθ(ω) < R and wθ(ω) > R,
we easily obtain the relation

为了获得这些上限和下限的导数的可靠估计,获得 Y θ 和 W θ 的导数的可靠估计当然是足够的。通过分别考虑 wθ(ω) < R 和 wθ(ω) > R 的情况,我们很容易得到关系式

w M ′ θ ( ω)
= ∂ w Mθ ( ω ) ∂θ = w Zθ ( ω ) ∂log P θ ( ω ) ∂θ

w m’θ(ω)=∂w
mθ(ω)∂θ= w zθ(ω)∂log pθ(ω)∂θ

and, thanks
to the regularity assumptions, we can write


when wθ(ω) 6=

当 wθ(ω) 6= R 时

∂Y θ ∂θ

∂Y θ ∂θ

∂W θ ∂θ

∂W θ ∂θ

= Zω ℓ ( ω )
w M ′ θ ( ω ) P ( ω ) ≈ n n∑i=1 ℓ ( ω i ) w M ′ θ ( ω i )

= zωℓ(ω)w
m′θ(ω)p(ω)≈n∑I = 1ℓ(ωI)w m′θ(ωI)

= Zω w M ′ θ
( ω ) P ( ω ) ≈ n n∑i=1 w M ′ θ ( ω i )

M′θ(ω)P(ω)≈n∑I = 1w M′θ(ωI)

these derivatives is considerably easier than using approximation (29) because
they in-volve the bounded quantity w Z θ (ω) instead of the potentially large
ratio wθ(ω).It is still necessary to choose a sufficiently smooth sampling
distribution P(ω) to limit the magnitude of ∂log Pθ/∂θ.

估算这些导数比使用近似值(29)容易得多,因为它们涉及有界量 w Z θ (ω),而不是潜在的大比值 wθ(ω)。仍然需要选择足够平滑的采样分布 P(ω)来限制 ∂log Pθ/∂θ.的幅度

derivatives are very useful to drive optimization algorithms.Assume for
instance that we want to find the parameter θ that maximizes the counterfactual
expectation Y θ as illustrated in Section 6.3.Maximizing the estimate obtained
using approximation (5) could reach its maximum for a value of θ that is poorly
explored by the actual distribution.Maximizing an estimate of the lower bound
(30) ensures that the optimization algorithm finds a trustworthy answer.

这样的导数对于驱动优化算法非常有用。例如,假设我们想要找到使反事实期望最大化的参数 θ,如第 6.3 节所示。使用近似值(5)获得的最大估计值可能会在 θ 值达到最大值,而实际分布对此了解甚少。使下限(30)的估计最大化确保了优化算法找到可信的答案。

Appendix E.
Uniform Empirical Bernstein Bounds

附录 e .统一经验伯恩斯坦界限

This appendix
reviews the uniform empirical Bernstein bound given by Maurer and Pontil (2009)
and describes how it can be used to construct the uniform confidence interval
(21).The first step


consists of
characterizing the size of a family F of functions mapping a space X into the
in-terval [a,b] ⊂ R. Given n points x = (x1...xn) ∈ X n , the trace F (x) ∈ R n
is the set of vectors f(x1),..., f(xn) for all functions f ∈ F .

包括表征函数族 f 的大小,函数族 f 将空间 x 映射到区间[a,b] ⊂ R 中。给定 n 个点 x = (x1...xn)∈X ^ n,迹 F(X)∈R ^ n 是向量 f(x1)的集合,...,f(xn)对于所有函数 f ∈ F。

Definition 2
(Covering numbers, etc.) Given ε > 0, the covering number N (x, ε,F ) is the
small-est possible cardinality of a subset C ⊂ F (x) satisfying the condition

定义 2(涵盖数字等。)给定 ε > 0,覆盖数 N (x,ε,f)是满足条件的子集 C ⊂ F (x)的最小可能基数

∀ v ∈ F ( x )
∃ c ∈ C max i=1...n |v i − c i | ≤ ε

∀ v ∈ F ( x )
∃ c ∈ C 最大 i=1...n | v I c I |≤ε

and the
growth function N (n, ε,F ) is

生长函数 N (n,ε,F)为

N (n, ε,F ) =

N (n,ε,F ) = sup



N (x, ε,F ) .

N (x,ε,F)。





Thanks to a
famous combinatorial lemma (Vapnik and Chervonenkis, 1968, 1971;Sauer, 1972),

感谢一个著名的组合引理(Vapnik 和 Chervonenkis,1968,1971;绍尔,1972),为

many usual
parametric families F , the growth function N (n, ε,F ) increases at most

许多常用的参数族 F,增长函数 N (n,ε,F)至多多项式增长

with both n
and 1/ε.

对于 n 和 1/ε。

Theorem 3
(Uniform empirical Bernstein bound) (Maurer and Pontil, 2009, thm 6)

定理 3(统一经验伯恩斯坦界)(莫勒和庞泰尔,2009,thm 6)

Let δ ∈
(0,1), n >= 16.Let X,X1,...,Xn be i.i.d. random variables with values in X.
Let F be a set of functions mapping X into [a,b] ⊂ R and let M (n) = 10N (2n,F
,1/n).Then we probability

设 δ∑(0,1),n >= 16。设 X,X1,...设 f 是将 x 映射成[a,b] ⊂ R 的一组函数,设 M (n) = 10N (2n,f,1/n)。那么我们概率

at least 1−δ,

至少 1δ,

∀ f ∈ F , E [
f ( X)] − Mn ≤ r Vn log ( M ( n ) / δ ) n + ( b − a ) 15 log ( M ( n ) / δ ) n
− 1

∀ f ∈ F,e[f(x)]Mn≤r
VN log(m(n)/δ)n+(b a)15 log(m(n)/δ)n-1



where Mn and
Vn respectively are the sample mean and variance

其中 Mn 和 Vn 分别是样本均值和方差

Mn =

Mn =

n n∑i=1 f (
Xi )

n n∑i=1 f (

Vn =

Vn =


n − 1


n∑i=1 ( f (
Xi ) − Mn ) .

n∑I =

The statement
of this theorem emphasizes its similarity with the non-uniform empirical
Bern-stein bound (theorem 1).Although the constants are less attractive, the
uniform bound still con-

这个定理的陈述强调了它与非一致经验 Bern-stein 界(定理 1)的相似性。尽管常数没有那么吸引人,但统一界限仍然成立

verges to
zero when n increases, provided of course that M (n) = 10N (2n,F ,1/n) grows

当 n 增加时收敛到零,当然前提是 M (n) = 10N (2n,F,1/n)增长多项式

mially with

和 n 一起。

Let us then
define the family of functions


F = { fθ : ω
7→ ℓ(ω)w M θ (ω) , gθ : ω 7→ w M θ (ω) , ∀θ ∈ F } ,

F = { fθ : ω
7→ ℓ(ω)w M θ (ω),gθ : ω 7→ w M θ (ω),∀θ ∈ F },

and use the
uniform empirical Bernstein bound to derive an outer inequality similar to (26)
and an inner inequality similar to (27).The theorem implies that, with
probability 1−δ, both inequalities are simultaneously true for all values of
the parameter θ.The uniform confidence interval (21) then follows directly.

并且使用一致经验伯恩斯坦界导出类似于(26)的外部不等式和类似于(27)的内部不等式。该定理意味着,对于概率 1δ,两个不等式对于参数 θ 的所有值同时成立。然后,统一置信区间(21)直接跟随。



V M.
Aleksandrov, V. I. Sysoyev, and V. V. Shemeneva.Stochastic
optimization.Engineering Cybernetics, 5:11-16, 1968.

V M.
Aleksandrov、V. I. Sysoyev 和 V. V. Shemeneva。随机优化。工程控制论,5:11-16,1968。

Susan Athey
and Denis Nekipelov.A structural model of sponsored search advertising.Working

苏珊·艾希和丹尼斯·内基佩洛夫。赞助搜索广告的结构模型。工作 pa-

per, 2010.URL
http://kuznets.harvard.edu/~athey/papers/Structural_Sponsored_ Search.pdf .

per,2010 年。网址:Search.pdf。

Audibert, Remi Munos, and Csaba Szepesvári.Tuning bandit algorithms in
stochastic en-


Proc.18th International Conference on Algorithmic Learning Theory (ALT 2007),
pages 150-165, 2007.

环境。在 Proc 中。第 18 届算法学习理论国际会议(ALT 2007),第 150-165 页,2007。

Peter Auer,
Nicolò Cesa-Bianchi, and Paul Fisher.Finite time analysis of the multiarmed
bandit problem.Machine Learning, 47(2-3):235-256, 2002.


13.For a
simple proof of this fact, slice [a,b] into intervals Sk of maximal width ε and
apply the lemma to the family of indicator functions (xi ,Sk) 7→ 1l{ f(xi) ∈

13.为了简单证明这个事实,将[a,b]切成最大宽度 ε 的区间 Sk,并将引理应用于指示函数族(xi,Sk) 7→ 1l{
f(xi) ∈ Sk}。





Heejung Bang
and James M. Robins.Doubly robust estimation in missing data and causal
inference models.Biometrics, 61:692-972, 2005.


Bergemann and Maher Said.Dynamic auctions: a survey.Discussion Paper 1757R,
Cowles Foundation for Research in Economics, Yale University, 2010.

德克·伯格曼和马希尔·赛义德。动态拍卖:综述。讨论文件 1757R,考尔斯经济学研究基金会,耶鲁大学,2010 年。

Bottou.From machine learning to machine reasoning.http://arxiv.org/abs/1102.
1808v3, Feb 2011.

莱昂·博图。从机器学习到机器推理。http://arxiv.org/abs/1102. 1808 v3,2011 年 2 月。

Breiman.Statistical modeling: The two cultures.Statistical Science,
16(3):199-231, 2001.


Chapelle and Lihong Li.An empirical evaluation of thompson sampling.In Advances
in Neural Information Processing Systems 24, pages 2249-2257.NIPS Foundation,

奥利维耶·查佩尔和李立宏。汤普森抽样的经验评估。神经信息处理系统进展 24,第 2249-2257 页。NIPS 基金会,2011 年。

C.R. Charig,
D. R. Webb, S. R. Payne, and J. E. A. Wickham.Comparison of treatment of renal


calculi by
open surgery, percutaneous nephrolithotomy, and extracorporeal shockwave
lithotripsy.British Medical Journal (Clin Res Ed), 292(6254):879-882, 1986.


Denis X.
Charles and D. Max Chickering.Optimization for paid search auctions.Manuscript
in preparation, 2012.

丹尼斯·查尔斯十世和马克斯·奇克林。付费搜索拍卖的优化。手稿在准备中,2012 年。

Denis X.
Charles, D. Max Chickering, and Patrice Simard.Micro-market experimentation for
paid search.Manuscript in preparation, 2012.

丹尼斯·查尔斯十世、马克斯·奇克林和帕特里斯·西玛德。付费搜索的微观市场实验。手稿在准备中,2012 年。

Harald Cramér.
Mathematical Methods of Statistics.Princeton University Press, 1946.

统计的数学方法。普林斯顿大学出版社,1946 年。

Dash.Caveats for Causal Reasoning with Equilibrium Models.PhD thesis,
University of Pittsburgh, 2003.

丹佛短跑。平衡模型因果推理的注意事项。匹兹堡大学博士论文,2003 年。

Miroslav Dudík,
Dimitru Erhan, John Langford, and Lihong Li.Sample-efficient nonstationary-


evaluation for contextual bandits.In Proceedings of Uncertainty in Artificial
Intelligence (UAI), pages 247-254, 2012.

背景土匪的政策评估。《人工智能的不确定性论文集》(UAI),第 247-254 页,2012 年。

Edelman, Michael Ostrovsky, and Michael Schwarz.Internet advertising and the


alized second
price auction: Selling billions of dollars worth of keywords.American Economic
Review, 97(1):242-259, 2007.


Genz.Numerical computation of multivariate normal probabilities.Journal
Computation of Multivariate Normal Probabilities, 1:141-149, 1992.


John C.
Gittins.Bandit Processes and Dynamic Allocation Indices.Wiley, 1989.

约翰·吉廷斯。土匪过程和动态分配指数。威利,1989 年。

Peter W.
Glynn.Likelihood ratio gradient estimation: an overview.In Proceedings of the
1987 Winter Simulation Conference, pages 366-375, 1987.

彼得·格林。似然比梯度估计:综述。1987 年冬季模拟会议记录,第 366-375 页,1987 年。

Goodwin.Microsoft adCenter.Personal communication, 2011.

约翰·古德温。Microsoft adCenter。个人交流,2011 年。

Graepel, Joaquin Quinonero Candela, Thomas Borchert, and Ralf


click-through rate prediction for sponsored search advertising in Microsoft's
Bing search engine.In Proceedings of the 27th International Conference on
Machine Learning (ICML 2010), Invited Applications Track.Omnipress, 2010.

微软必应搜索引擎中赞助搜索广告的贝叶斯点击率预测。在第 27 届国际机器学习会议(2010 年,ICML)的会议记录中,邀请应用跟踪。Omnipress,2010 年。





Ron Kohavi,
Roger Longbotham, Dan Sommerfield, and Randal M. Henne.Controlled experiments


on the web:
Survey and practical guide.Data Mining and Knowledge Discovery, 18(1):140-181,
July 2008.

网上:调查和实践指南。数据挖掘和知识发现,18(1):140-181,2008 年 7 月。

Kuleshov and Doina Precup.Algorithms for multi-armed bandit problems, October

沃洛季米尔·库里肖夫和多伊娜·切普。多武装匪徒问题的算法,2010 年 10 月。http://www.cs.mcgill.ca/~vkules/bandits.pdf.

Lahaie and R. Preston McAfee.Efficient ranking in sponsored search.In Proc.7th

塞巴斯蒂安·拉海耶和普雷斯顿·迈克菲。赞助搜索中的有效排名。在 Proc 中。第七名

Workshop on Internet and Network Economics (WINE 2011), pages 254-265.LNCS
7090, Springer, 2011.

互联网和网络经济学国际研讨会(WINE 2011),第 254-265 页。LNCS 7090,斯普林格,2011 年。

Lev Landau
and Evgeny Lifshitz.Course in Theoretical Physics, Volume 1: Mechanics.Pergamon
Press, 1969.2nd edition.

列夫·兰道和叶夫根尼·利弗希茨。理论物理课程,第 1 卷:力学。佩尔加蒙出版社,1969 年。第二版。

John Langford
and Tong Zhang.The epoch-greedy algorithm for multi-armed bandits with side


Advances in Neural Information Processing Systems 20, pages 817-824.MIT Press,
Cambridge, MA, 2008.

信息。神经信息处理系统进展 20,第 817-824 页。麻省理工学院出版社,剑桥,麻省,2008。

Steffen L.
Lauritzen and Thomas S. Richardson.Chain graph models and their causal
interpretation.Journal of the Royal Statistical Society, Series B, 64:321-361,

斯蒂芬·劳里岑和托马斯·理查森。链图模型及其因果解释。皇家统计学会杂志,系列 B,64:321-361,2002。

David K.
Lewis.Counterfactuals.Harvard University Press, 1973.2nd edition:
Wiley-Blackwell, 2001.

大卫·刘易斯。反事实。哈佛大学出版社,1973 年。第二版:威利-布莱克威尔,2001 年。

Lihong Li,
Wei Chu, John Langford, and Robert E. Schapire.A contextual-bandit approach to


news article recommendation.In Proceedings of the 19th International Conference
on the World Wide Web (WWW 2010), pages 661-670.ACM, 2010.

个性化新闻文章推荐。《第 19 届万维网国际会议论文集》(WWW 2010),第 661-670 页。ACM,2010 年。

Lihong Li,
Wei Chu, John Langford, and Xuanhui Wang.Unbiased offline evaluation of


news article recommendation algorithms.In Proc.4th ACM International
Confer-ence on Web Search and Data Mining (WSDM 2011), pages 297-306, 2011.

基于土匪的新闻文章推荐算法。在 Proc 中。第四届美国计算机学会网络搜索和数据挖掘国际会议(WSDM,2011 年),第 297-306 页,2011 年。

Maurer and Massimiliano Pontil.Empirical bernstein bounds and sample-variance
penal-ization.In Proc.The 22nd Conference on Learning Theory (COLT 2009), 2009.

安德烈亚斯·莫伊雷尔和马西米利亚诺·庞泰尔。经验伯恩斯坦界限和样本方差惩罚。在 Proc 中。第 22 届学习理论会议(COLT 2009),2009 年。

Milgrom.Putting Auction Theory to Work.Cambridge University Press, 2004.


Roger B.
Myerson.Optimal auction design.Mathematics of Operations Research, 6(1):58-73,


Pearl.Causality: Models, Reasoning, and Inference.Cambridge University Press,
2000.2nd edition: 2009.

朱迪亚珍珠。因果关系:模型、推理和推论。剑桥大学出版社,2000 年。第二版:2009 年。

Pearl.Causal inference in statistics: an overview.Statistics Surveys, 3:96-146,

朱迪亚珍珠。统计学中的因果推断:综述。统计调查,3:96-146,2009 年。

Pearl.The do-calculus revisited.In Proc.Twenty-Eighth Conference on Uncertainty
in Artificial Intelligence (UAI-2012), pages 3-11, 2012.

朱迪亚珍珠。重新审视微积分。在 Proc 中。第二十八届人工智能不确定性会议(UAI-2012),第 3-11 页,2012 年。

Linda E.
Reichl.A Modern Course in Statistical Physics, 2nd Edition.Wiley, 1998.

琳达·雷克尔。统计物理现代教程,第二版。威利,1998 年。





Robbins.Some aspects of the sequential design of experiments.Bulletin of the
American Mathematical Society, 58(5):527-535, 1952.


James M.
Robins, Miguel Angel Hernan, and Babette Brumback.Marginal structural models
and causal inference in epidemiology.Epidemiology, 11(5):550-560, Sep 2000.

詹姆斯·罗宾斯、米格尔·安吉尔·埃尔南和巴贝特·布鲁姆巴克。流行病学中的边缘结构模型和因果推断。流行病学,11(5):550-560,2000 年 9 月。

Sauer.On the density of families of sets.Journal of Combinatorial Theory, 13:145-147,


Seldin, cois Laviolette Fran Nicolò Cesa-Bianchi, John Shawe-Taylor, and Peter


inequalities for martingales.IEEE Transactions on Information Theory, 58(12):
7086-7093, 2012.

鞅的泛贝叶斯不等式。《信息理论杂志》,58(12): 7086-7093,2012。

Shimodaira.Improving predictive inference under covariate shift by weighting
the log likelihood function.Journal of Statistical Planning and Inference, 90(2):227-244,


Edward H.
Simpson.The interpretation of interaction in contingency tables.Journal of the
Royal Statistical Society, Ser.B, 13:238-241, 1951.


Slivkins.Contextual bandits with similarity information.JMLR Conference and
Work-shop Proceedings, 19:679-702, 2011.

亚历山大·斯莱夫金斯。具有相似信息的上下文强盗。JMLR 会议和工作间会议录,19:679-702,2011。

Peter Spirtes
and Richard Scheines.Causal inference of ambiguous manipulations.Philosophy of
Science, 71(5):833-845, Dec 2004.

彼得·斯皮特斯和理查德·申斯。模糊操作的因果推理。科学哲学,71(5):833-845,2004 年 12 月。

Spirtes, Clark Glymour, and Richard Scheines.Causation, Prediction and
Search.Springer Verlag, New York, 1993.2nd edition: MIT Press, Cambridge
(Mass.), 2011.

彼得·斯皮特斯、克拉克·格利穆尔和理查德·申斯。因果关系、预测和搜索。斯普林格·弗拉格,纽约,1993 年。第二版:麻省理工学院出版社,剑桥(马萨诸塞州。), 2011.

Stephen M.
Stigler.A historical view of statistical concepts in psychology and educational
research.American Journal of Education, 101(1):60-70, Nov 1992.

斯蒂芬·斯蒂格勒。心理学和教育研究中统计概念的历史观点。美国教育杂志,101(1):60-70,1992 年 11 月。

Sugiyama, Matthias Krauledat, and Klaus-Robert Müller.Covariate shift
adaptation by im-portance weighted cross validation.Journal of Machine Learning
Research, 8:985-1005, 2007.


Rich S.
Sutton and Andrew G. Barto.Reinforcement Learning: An Introduction.MIT Press,
Cam-bridge, MA, 1998.


Diane Tang,
Ashish Agarwal, Deirdre O'Brien, and Mike Meyer.Overlapping experiment infras-


More, better, faster experimentation.In Proceedings 16th Conference on
Knowledge Discovery and Data Mining (KDD 2010), pages 17-26, 2010.

结构:更多、更好、更快的实验。《第 16 届知识发现和数据挖掘会议论文集》(KDD,2010),第 17-26 页,2010 年。

Vladimir N.
Vapnik.Estimation of Dependences based on Empirical Data.Springer Series in
Statistics.Springer Verlag, Berlin, New York, 1982.

弗拉基米尔·瓦普尼克。基于经验数据的相关性估计。统计学中的斯普林格系列。斯普林格·弗拉格,柏林,纽约,1982 年。

Vladimir N.
Vapnik and Alexey Ya.Chervonenkis.Uniform convegence of the frequencies of


occurence of
events to their probabilities.Proc.Academy of Sciences of the USSR, 181(4),
1968.English translation: Soviet Mathematics - Doklady, 9:915-918, 1968.

事件发生的概率。继续。苏联科学院,181(4),1968 年。英文翻译:苏联数学-多克拉迪,9:915-918,1968。

Vladimir N.
Vapnik and Alexey Ya.Chervonenkis.On the uniform convergence of relative


cies of
events to their probabilities.Theory of Probability and its Applications,
16(2):264-280, 1971.






Hal R.
Varian.Position auctions.International Journal of Industrial Organization,
25:1163-1178, 2007.


Hal R.
Varian.Online ad auctions.American Economic Review, 99(2):430-434, 2009.


Vermorel and Mehryar Mohri.Multi-armed bandit algorithms and empirical
evaluation.In Proc.European Conference on Machine Learning, pages 437-448,

乔安妮·维尔莫勒和梅赫里亚·莫赫里。多武装匪徒算法和经验评估。在 Proc 中。欧洲机器学习会议,第 437-448 页,2005 年。

Georg H. von
Wright.Explanation and Understanding.Cornell University Press, 1971.

乔治·冯·赖特。解释和理解。康奈尔大学出版社,1971 年。

Abraham Wald.Sequential
tests of statistical hypotheses.The Annals of Mathematical Statistics,
16(2):117-186, 1945.


Wiener.Cybernetics, or Control and Communication in the Animal and the Machine.


Hermann et
Cie (Paris), MIT Press (Cambridge, Mass.), Wiley and Sons (New York), 1948.2nd
Edition (expanded): MIT Press, Wiley and Sons, 1961.

赫尔曼和西(巴黎),麻省理工学院出版社(剑桥,马萨诸塞州。),威利父子公司(纽约),1948 年。第二版(扩大版):麻省理工学院出版社,威利父子出版社,1961 年。

Ronald J.
Williams.Simple statistical gradient-following algorithms for connectionist
reinforce-ment learning.Machine Learning, 8(229-256), 1992.


Woodward.Making Things Happen.Oxford University Press, 2005.

詹姆斯·伍德沃德。让事情发生。牛津大学出版社,2005 年。

Sewall S.
Wright.Correlation and causation.Journal of Agricultural Research, 20:557-585,