[论文翻译]反事实推理和学习系统:计算广告的例子

Counterfactual Reasoning and Learning Systems: The Example of Computational Advertising 反事实推理和学习系统:计算广告的例子

Léon Bottou, Jonas Peters, Joaquin Quiñonero-Candela, Denis X. Charles, D. Max Chickering, Elon Portugaly, Dipankar Ray, Patrice Simard, Ed Snelson; 14(65):3207−3260, 2013.

Abstract

This work shows how to leverage causal inference to understand the behavior of complex learning systems interacting with their environment and predict the consequences of changes to the system. Such predictions allow both humans and algorithms to select the changes that would have improved the system performance. This work is illustrated by experiments on the ad placement system associated with the Bing search engine.

这项工作展示了如何利用因果推理来理解复杂学习系统与其环境交互的行为，并预测系统变化的后果。这样的预测使得人类和算法都可以选择能够改善系统性能的变化。这项工作是通过与必应搜索引擎相关的广告投放系统的实验来说明的。

Keywords:
causation, counterfactual reasoning, computational advertising

关键词:因果关系，反事实推理，计算广告

1.Introduction

1.介绍

Statistical machine learning technologies in the real world are never without a purpose.Using their predictions, humans or machines make decisions whose circuitous consequences often violate the modeling assumptions that justified
the system design in the first place.

现实世界中的统计机器学习技术从来都是有目的的。利用他们的预测，人类或机器做出的决策，其迂回的结果往往违反了最初证明系统设计合理性的建模假设。

Such contradictions appear very clearly in the case of the learning systems that power web scale applications such as search engines, ad placement engines, or recommendation systems.For instance, the placement of advertisement on the
result pages of Internet search engines depend on the bids of advertisers and on scores computed by statistical machine learning systems.Because the scores affect the contents of the result pages proposed to the users, they directly influence
the occurrence of clicks and the corresponding advertiser payments.They also have important indirect effects.Ad placement decisions impact the satisfaction of the users and therefore their willingness to frequent this web site in the
future.They also impact the return on investment observed by the advertisers and therefore their future bids.Finally they change the nature of the data collected for training the statistical models in the future.

这种矛盾在为网络规模的应用提供动力的学习系统中表现得非常明显，例如搜索引擎、广告投放引擎或推荐系统。例如，在互联网搜索引擎的结果页面上放置广告取决于广告商的出价和统计机器学习系统计算的分数。因为分数影响向用户建议的结果页面的内容，所以它们直接影响点击的发生和相应的广告商支付。它们也有重要的间接影响。广告投放决策会影响用户的满意度，从而影响他们将来频繁访问该网站的意愿。它们还会影响广告商和他们未来的出价。最后，它们改变了为将来训练统计模型而收集的数据的性质。
These complicated interactions are clarified by important theoretical works.Under
simplified assumptions, mechanism design (Myerson, 1981) leads to an insightful
account of the advertiser feedback loop (Varian, 2007;Edelman et al.,
2007).Under simplified assumptions, multiarmed bandits theory (Robbins, 1952;Auer
et al., 2002;Langford and Zhang, 2008) and reinforcement learning (Sutton and
Barto, 1998) describe the exploration/exploitation dilemma associated with the
training feedback loop.However, none of these approaches gives a complete
account of the complex interactions found in real-life systems.

重要的理论著作阐明了这些复杂的相互作用。在简化的假设下，机制设计(Myerson，1981)导致了对广告商反馈回路的深刻理解(Varian，2007；Edelman等人，2007年)。在简化假设下，多臂老虎机理论(罗宾斯，1952；Auer等人，2002年；兰福德和张，2008)和强化学习(萨顿和，1998)描述了与训练反馈回路相关的探索/开发困境。然而，这些方法都没有给出现实系统中复杂交互的完整描述。

This contribution proposes a novel approach: we view these complicated interactions
as man-ifestations of the fundamental difference that separates correlation and
causation.Using the ad placement example as a model of our problem class, we
therefore argue that the language and the methods of causal inference provide
flexible means to describe such complex machine learning sys-tems and give
sound answers to the practical questions facing the designer of such a
system.Is it useful to pass a new input signal to the statistical model?Is it
worthwhile to collect and label a new training set?What about changing the loss
function or the learning algorithm?In order to answer such questions and
improve the operational performance of the learning system, one needs to unravel
how the information produced by the statistical models traverses the web of
causes and effects and eventually produces measurable performance metrics.

这篇文章提出了一种新的方法:我们将这些复杂的相互作用视为区分相关性和因果关系的根本区别的人工陈述。因此，使用广告放置示例作为我们问题类的模型，我们认为因果推理的语言和方法提供了灵活的手段来描述这种复杂的机器学习系统，并给出了这种系统的设计者所面临的实际问题的合理答案。向统计模型传递新的输入信号有用吗？收集和标记一个新的训练集值得吗？改变损失函数或学习算法怎么样？为了回答这些问题并提高学习系统的运行性能，人们需要弄清楚统计模型产生的信息如何穿越因果网络并最终产生可衡量的性能指标。

Readers with an interest in causal inference will find in this paper (i) a real world
example demonstrating the value of causal inference for large-scale machine
learning applications, (ii) causal inference techniques applicable to
continuously valued variables with meaningful confidence intervals, and (iii)
quasi-static analysis techniques for estimating how small interventions affect
cer-tain causal equilibria.Readers with an interest in real-life applications
will find (iv) a selection of practical counterfactual analysis techniques
applicable to many real-life machine learning systems.Readers with an interest
in computational advertising will find a principled framework that (v)
ex-plains how to soundly use machine learning techniques for ad placement, and
(vi) conceptually connects machine learning and auction theory in a compelling
manner.

对因果推理感兴趣的读者将在本文中发现

(I)一个真实世界的例子，展示了因果推理对于大规模机器学习应用的价值，
(ii)适用于具有有意义置信区间的连续值变量的因果推理技术，以及
(iii)用于估计小干预如何影响特定因果均衡的准静态分析技术。

对现实应用感兴趣的读者将会发现
(iv)适用于许多现实机器学习系统的实用反事实分析技术的选择。对计算广告感兴趣的读者会发现一个有原则的框架，该框架(v)解释了如何合理地使用机器学习技术进行广告投放，以及(vi)以引人注目的方式在概念上将机器学习和拍卖理论联系起来。

The paper is organized as follows.Section 2 gives an overview of the advertisement placement
problem which serves as our main example.In particular, we stress some of the
difficulties encoun-tered when one approaches such a problem without a
principled perspective.Section 3 provides a condensed review of the essential
concepts of causal modeling and inference.Section 4 centers on formulating and
answering counterfactual questions such as "how would the system have
performed during the data collection period if certain interventions had been
carried out on the system ?"We describe importance sampling methods for
counterfactual analysis, with clear conditions of validity and confidence
intervals.Section 5 illustrates how the structure of the causal graph reveals
opportu-nities to exploit prior information and vastly improve the confidence
intervals.Section 6 describes how counterfactual analysis provides essential
signals that can drive learning algorithms.Assume that we have identified
interventions that would have caused the system to perform well during the data
collection period.Which guarantee can we obtain on the performance of these
same inter-ventions in the future?Section 7 presents counterfactual
differential techniques for the study of equlibria.Using data collected when
the system is at equilibrium, we can estimate how a small intervention
displaces the equilibrium.This provides an elegant and effective way to reason
about long-term feedback effects.Various appendices complete the main text with
information that we think more relevant to readers with specific backgrounds.

论文组织如下。
第2节概述了广告投放问题，这是我们的主要例子。特别是，我们强调当一个人在没有原则性观点的情况下处理这样一个问题时会遇到的一些困难。

第3节简要回顾了因果建模和推理的基本概念。

第4节集中阐述和回答反事实问题，如“如果对系统进行了某些干预，系统在数据收集期间会有什么表现？”我们描述了用于反事实分析的重要性抽样方法，具有明确的有效性条件和置信区间。

第5节说明了因果图的结构如何揭示利用先验信息和大大提高置信区间的机会。

第6节描述了反事实分析如何提供驱动学习算法的基本信号。假设我们已经确定了会导致系统在数据收集期间表现良好的干预措施。我们能从这些相同的干预措施的未来表现中获得什么保证？

第7节介绍了研究马库利亚的反事实微分技术。利用系统处于平衡时收集的数据，我们可以估计一个小的干预如何取代平衡。这为长期反馈效应提供了一种优雅而有效的推理方式。各种附录用我们认为与特定背景的读者更相关的信息完成了正文。

COUNTERFACTUAL
REASONING AND LEARNING SYSTEMS

反事实推理和学习系统

2.Causation Issues in Computational Advertising 计算广告中的因果关系问题

After giving an overview of the advertisement placement problem, which serves as our main
ex-ample, this section illustrates some of the difficulties that arise when one
does not pay sufficient attention to the causal structure of the learning
system.

在概述了作为我们主要例子的广告投放问题之后，本节说明了当人们没有充分注意学习系统的因果结构时出现的一些困难。

2.1 Advertisement Placemen广告投放

All Internet users are now familiar with the advertisement messages that adorn popular web
pages.Advertisements are particularly effective on search engine result pages
because users who are searching for something are good targets for advertisers
who have something to offer.Several actors take part in this Internet
advertisement game:

现在所有的互联网用户都熟悉流行网页上的广告信息。广告在搜索引擎结果页面上特别有效，因为搜索某样东西的用户是广告商的好目标。在这个网络广告游戏中有如下几个角色:

• Advertisers create advertisement messages, and place bids that describe how much they are
willing to pay to see their ads displayed or clicked.

广告商 创建广告信息，并出价描述他们愿意支付多少费用来进行广告显示或点击。

• Publishers
provide attractive web services, such as, for instance, an Internet search
engine.They display selected ads and expect to receive payments from the
advertisers.The infras-tructure to collect the advertiser bids and select ads
is sometimes provided by an advertising network on behalf of its affiliated
publishers.For the purposes of this work, we simply con-sider a publisher large
enough to run its own infrastructure.

出版商提供有吸引力的网络服务，例如互联网搜索引擎。他们展示选定的广告，并期望从广告商那里获得报酬。收集广告商出价和选择广告的基础设施有时是由广告网络代表其附属出版商提供的。为了这项工作的目的，我们只是考虑一个足够大的发行商来运行它自己的基础设施。

• Users
reveal information about their current interests, for instance, by entering a
query in a search engine.They are offered web pages that contain a selection of
ads (Figure 1).Users sometimes click on an advertisement and are transported to
a web site controlled by the ad-vertiser where they can initiate some business.

用户透露他们当前兴趣的信息，例如，通过在搜索引擎中输入查询。它们被提供包含精选广告的网页(图1)。用户有时点击一个广告，然后被传送到由广告发布者控制的网站，在那里他们可以开始一些业务。

A conventional bidding language is necessary to precisely define under which
conditions an adver-tiser is willing to pay the bid amount.In the case of
Internet search advertisement, each bid specifies (a) the advertisement
message, (b) a set of keywords, (c) one of several possible matching criteria
between the keywords and the user query, and (d) the maximal price the
advertiser is willing to pay when a user clicks on the ad after entering a
query that matches the keywords according to the specified criterion.

传统的投标语言是必要的，以准确定义在什么条件下，一个对手愿意支付投标金额。在互联网搜索广告的情况下，每个出价指定
(a)广告消息，
(b)一组关键词，
(c)关键词和用户查询之间的几个可能匹配标准之一，以及
(d)当用户在输入根据指定标准匹配关键词的查询之后点击广告时，广告商愿意支付的最大价格。

Whenever a user visits a publisher web page, an advertisement placement engine runs an
auction in real time in order to select winning ads, determine where to display
them in the page, and compute the prices charged to advertisers, should the
user click on their ad.Since the placement engine is operated by the publisher,
it is designed to further the interests of the publisher.Fortunately for
everyone else, the publisher must balance short term interests, namely the
immediate revenue brought by the ads displayed on each web page, and long term
interests, namely the future revenues resulting from the continued satisfaction
of both users and advertisers.

每当用户访问发布者网页时，广告投放引擎实时运行拍卖，以便选择获胜的广告，确定在页面中的何处显示它们，并计算如果用户点击他们的广告向广告商收取的价格。由于布局引擎是由发布者操作的，因此它旨在促进发布者的利益。对其他人来说幸运的是，出版商必须平衡短期利益，即每个网页上显示的广告带来的直接收入，和长期利益，即用户和广告商持续满意带来的未来收入。

Auction theory explains how to design a mechanism that optimizes the revenue of the
seller of a single object (Myerson, 1981;Milgrom, 2004) under various
assumptions about the information available to the buyers regarding the
intentions of the other buyers.In the case of the ad placement problem, the publisher
runs multiple auctions and sells opportunities to receive a click.When nearly
identical auctions occur thousand of times per second, it is tempting to
consider that the advertisers have perfect information about each other.This
assumption gives support to the popular generalized second price rank-score
auction (Varian, 2007;Edelman et al., 2007):

拍卖理论解释了如何设计一种机制来优化单个物品卖家的收入(Myerson，1981；Milgrom，2004)在关于买方可获得的关于其他买方意图的信息的各种假设下。在广告投放问题的情况下，发布者进行多次拍卖，并出售获得点击的机会。当几乎相同的拍卖每秒钟发生数千次时，人们很容易认为广告商拥有关于彼此的完美信息。这一假设支持了流行的广义第二价格等级分数拍卖(瓦里安，2007；Edelman等人，2007年):

Figure 1:
Mainline and sidebar ads on a search result page.Ads placed in the mainline are more
likely to be noticed, increasing both the chances of a click if the ad is relevant and the
risk of annoying the user if the ad is not relevant.

图1:搜索结果页面上的主线和侧边栏广告。主线上的广告更可能会被注意到，这既增加了广告相关时点击的机会，也增加了广告不相关时惹恼用户的风险。

• Let x represent the auction context information, such as the user query, the user
profile, the date, the time, etc.The ad placement engine first determines all
eligible ads a1 ...an and the corresponding bids b1 ...bn on the basis of the
auction context x and of the matching criteria specified by the advertisers.

让x代表拍卖上下文信息，如用户查询、用户档案、日期、时间等。广告投放引擎首先确定所有合格的广告a1...an和相应的投标b1...bn基于拍卖上下文x和广告商指定的匹配标准。

• For each selected ad ai and each potential position p on the web page, a statistical
model outputs the estimate qi,p(x) of the probability that ad ai displayed in
position p receives a user click.The rank-score ri,p(x) = biqi,p(x) then
represents the purported value associated with placing ad ai at position p.

对于每个选定的广告$a_i$ 和网页上的每个潜在位置p，统计模型输出显示在位置p的广告$a_i$ 接收用户点击的概率的估计$q_i$ ，p(x)。等级分数ri，p(x) = $b_i$ $q_i$ ，p(x)则表示与将广告 $a_i$ 放置在位置p相关联的声称值。

• Let L represent a possible ad layout, that is, a set of positions that can
simultaneously be populated with ads, and let $L$ be the set of possible ad layouts, including of course the
empty layout.The optimal layout and the corresponding ads are obtained by maximizing the total
rank-score

让L代表一种可能的广告布局，即一组可以同时用广告填充，让$L$是一组可能的广告布局，当然也包括空的布局。通过最大化总排名分数来获得最佳布局和相应的广告

subject to
reserve constraints and also subject to diverse policy constraints, such as, for instance, preventing the

simultane-ous display of multiple ads belonging to the same advertiser.Under
mild assumptions, this discrete maximization problem is amenable to
computationally efficient greedy algorithms (see appendix A.)

受储备限制,并且还受到不同的政策约束，例如防止属于同一广告商的多个广告同时显示。在温和的假设下，这个离散的最大化问题服从于计算高效的贪婪算法(见附录a)

• The
advertiser payment associated with a user click is computed using the
generalized second price (GSP) rule: the advertiser pays the smallest bid that
it could have entered without chang-ing the solution of the discrete
maximization problem, all other bids remaining equal.In other words, the
advertiser could not have manipulated its bid and obtained the same treatment
for a better price.

与用户点击相关联的广告商支付是使用广义第二价格(GSP)规则计算的:广告商支付其可能输入的最小出价，而不改变离散最大化问题的解决方案，所有其他出价保持相等。换句话说，广告商不可能操纵其出价，以更好的价格获得同样的待遇。

Under the
perfect information assumption, the analysis suggests that the publisher simply
needs to find which reserve prices Rp(x) yield the best revenue per
auction.However, the total revenue of the publisher also depends on the traffic
experienced by its web site.Displaying an excessive number of irrelevant ads
can train users to ignore the ads, and can also drive them to competing web
sites.Advertisers can artificially raise the rank-scores of irrelevant ads by
temporarily increasing the bids.Indelicate advertisers can create deceiving
advertisements that elicit many clicks but direct users to spam web sites.Experience
shows that the continued satisfaction of the users is more important to the
publisher than it is to the advertisers.

在完美信息假设下，分析表明，发行商只需找到哪个保留价格Rp(x)在每次拍卖中产生最佳收益。然而，出版商的总收入也取决于其网站的流量。显示过多不相关的广告可以训练用户忽略广告，也可以驱使他们去竞争网站。广告商可以通过临时提高出价来人为提高不相关广告的排名分数。无礼的广告商可以制造欺骗性的广告，吸引大量点击，但却将用户引向垃圾网站。经验表明，用户的持续满意度对出版商来说比对广告商更重要。

Therefore the
generalized second price rank-score auction has evolved.Rank-scores have been
augmented with terms that quantify the user satisfaction or the ad
relevance.Bids receive adaptive discounts in order to deal with situations
where the perfect information assumption is unrealistic.These adjustments are
driven by additional statistical models.The ad placement engine should
therefore be viewed as a complex learning system interacting with both users
and advertisers.

因此，广义第二价格等级评分拍卖已经演变。排名分数已经增加了量化用户满意度或广告相关性的术语。投标获得自适应折扣，以应对完美信息假设不现实的情况。这些调整由额外的统计模型驱动。因此，广告投放引擎应该被视为一个复杂的学习系统，与用户和广告商进行交互。

2.2
Controlled Experiments

2.2对照实验

The designer
of such an ad placement engine faces the fundamental question of testing
whether a proposed modification of the ad placement engine results in an
improvement of the operational performance of the system.

这种广告投放引擎的设计者面临一个基本问题，即测试广告投放引擎的建议修改是否导致系统操作性能的改善。

The simplest
way to answer such a question is to try the modification.The basic idea is to
randomly split the users into treatment and control groups (Kohavi et al.,
2008).Users from the control group see web pages generated using the unmodified
system.Users of the treatment groups see web pages generated using alternate
versions of the system.Monitoring various performance metrics for a couple
months usually gives sufficient information to reliably decide which variant of
the system delivers the most satisfactory performance.

回答这样的问题最简单的方法就是尝试修改。基本思想是将用户随机分为治疗组和对照组(Kohavi等人，2008年)。控制组的用户可以看到使用未修改的系统生成的网页。治疗组的用户可以看到使用系统的替代版本生成的网页。监控几个月的各种性能指标通常可以提供足够的信息来可靠地决定系统的哪种变体提供最令人满意的性能。

Modifying an
advertisement placement engine elicits reactions from both the users and the
ad-vertisers.Whereas it is easy to split users into treatment and control
groups, splitting advertisers into treatment and control groups demands special
attention because each auction involves multi-ple advertisers (Charles et al.,
2012).Simultaneously controlling for both users and advertisers is probably
impossible.

修改广告投放引擎会引起用户和广告发布者的反应。虽然很容易将用户分成治疗组和对照组，但将广告商分成治疗组和对照组需要特别注意，因为每次拍卖都涉及多个广告商(查尔斯等人，2012年)。同时控制用户和广告商可能是不可能的。

Controlled
experiments also suffer from several drawbacks.They are expensive because they
demand a complete implementation of the proposed modifications.They are slow
because each ex-periment typically demands a couple months.Finally, although
there are elegant ways to efficiently run overlapping controlled experiments on
the same traffic (Tang et al., 2010), they are limited by the volume of traffic
available for experimentation.

受控实验也有几个缺点。它们是昂贵的，因为它们要求完全实现提议的修改。它们很慢，因为每次实验通常需要几个月的时间。最后，虽然有很好的方法可以在相同的流量上有效地运行重叠的受控实验(Tang等人，2010)，但是它们受到可用于实验的流量的限制。

It is
therefore difficult to rely on controlled experiments during the conception
phase of potential improvements to the ad placement engine.It is similarly
difficult to use controlled experiments to drive the training algorithms
associated with click probability estimation models.Cheaper and faster
statistical methods are needed to drive these essential aspects of the
development of an ad placement engine.Unfortunately, interpreting cheap and
fast data can be very deceiving.

因此，在广告投放引擎潜在改进的构思阶段，很难依赖受控实验。同样难以使用受控实验来驱动与点击概率估计模型相关联的训练算法。需要更便宜、更快的统计方法来推动广告投放引擎发展的这些重要方面。不幸的是，解释廉价而快速的数据可能非常具有欺骗性。

2.3
Confounding Data

2.3混杂数据

Assessing the
consequence of an intervention using statistical data is generally challenging
because it is often difficult to determine whether the observed effect is a
simple consequence of the interven-tion or has other uncontrolled causes.

使用统计数据评估干预的结果通常具有挑战性，因为通常很难确定观察到的效果是干预的简单结果还是有其他不受控制的原因。

	Overall总数	Patients with small stones 小结石患者	Patients with large stones大结石患者
Treatment A:Open surgery开放手术	78% (273/350)	93% (81/87)	73% (192/263)
Treatment B:Percutaneous nephrolithotomy 经皮肾镜取石术	83% (289/350)	87% (234/270)	69% (55/80)

Table 1: A
classic example of Simpson's paradox.The table reports the success rates of two
treat-ments for
kidney stones (Charig et al., 1986, Tables I and II).Although the overall
success rate of treatment B seems better, treatment B performs worse than
treatment A on both patients with small kidney stones and patients with large
kidney stones.See Section 2.3.

表1:辛普森悖论的经典例子。该表报告了两次治疗的成功率-
肾结石治疗(Charig等人，1986年，表一和表二)。尽管治疗B的总体成功率似乎更高，但无论是小肾结石患者还是大肾结石患者，治疗B的表现都比治疗A差。参见第2.3节。

For instance,
the empirical comparison of certain kidney stone treatments illustrates this
dif-ficulty (Charig et al., 1986).Table 2.3 reports the success rates observed
on two groups of 350 patients treated with respectively open surgery (treatment
A, with 78% success) and percutaneous nephrolithotomy (treatment B, with 83%
success).Although treatment B seems more successful, it was more frequently
prescribed to patients suffering from small kidney stones, a less serious
con-dition.Did treatment B achieve a high success rate because of its intrinsic
qualities or because it was preferentially applied to less severe cases?Further
splitting the data according to the size of the kidney stones reverses the
conclusion: treatment A now achieves the best success rate for both patients
suffering from large kidney stones and patients suffering from small kidney
stones.Such an inversion of the conclusion is called Simpson's paradox
(Simpson, 1951).

例如，某些肾结石治疗的经验比较说明了这一困难(Charig等人，1986)。表2.3报告了两组分别接受开放手术(治疗A，成功率78%)和经皮肾镜取石术(治疗B，成功率83%)治疗的350名患者的成功率。虽然治疗乙似乎更成功，但它更经常被开给患有小肾结石的患者，这是一种不太严重的情况。B治疗获得高成功率是因为它的内在质量还是因为它优先应用于不太严重的病例？根据肾结石的大小进一步分割数据会逆转结论:治疗A现在对患有大肾结石的患者和患有小肾结石的患者都达到了最佳成功率。这种结论的颠倒被称为辛普森悖论(辛普森，1951)。

The stone
size in this study is an example of a confounding variable, that is an
uncontrolled variable whose consequences pollute the effect of the
intervention.Doctors knew the size of the kidney stones, chose to treat the
healthier patients with the least invasive treatment B, and therefore caused
treatment B to appear more effective than it actually was.If we now decide to
apply treat-ment B to all patients irrespective of the stone size, we break the
causal path connecting the stone size to the outcome, we eliminate the
illusion, and we will experience disappointing results.

这项研究中的结石大小是一个混杂变量的例子，这是一个不受控制的变量，其结果污染了干预的效果。医生知道肾结石的大小，选择用侵入性最小的治疗B来治疗更健康的患者，因此导致治疗B看起来比实际更有效。如果我们现在决定对所有患者进行治疗，而不考虑结石的大小，我们就打破了结石大小和结果之间的因果关系，我们就消除了幻觉，我们将经历令人失望的结果。

When we
suspect the existence of a confounding variable, we can split the contingency
tables and reach improved conclusions.Unfortunately we cannot fully trust these
conclusions unless we are certain to have taken into account all confounding
variables.The real problem therefore comes from the confounding variables we do
not know.

当我们怀疑混杂变量的存在时，我们可以拆分列联表并得出改进的结论。不幸的是，我们不能完全相信这些结论，除非我们确信已经考虑了所有的混杂变量。因此，真正的问题来自我们不知道的混杂变量。

Randomized
experiments arguably provide the only correct solution to this problem (see
Stigler, 1992).The idea is to randomly chose whether the patient receives
treatment A or treatment B. Because this random choice is independent from all
the potential confounding variables, known and unknown, they cannot pollute the
observed effect of the treatments (see also Section 4.2).This is why controlled
experiments in ad placement (Section 2.2) randomly distribute users between
treatment and control groups, and this is also why, in the case of an ad
placement engine, we should be somehow concerned by the practical impossibility
to randomly distribute both users and advertisers.

随机实验可以说是这个问题唯一正确的解决方案(见斯蒂格勒，1992)。想法是随机选择患者是接受治疗A还是治疗b。因为这种随机选择独立于所有已知和未知的潜在混杂变量，所以它们不能污染观察到的治疗效果(另见第4.2节)。这就是为什么广告投放的受控实验(第2.2节)在治疗组和对照组之间随机分配用户，这也是为什么在广告投放引擎的情况下，我们应该以某种方式关注随机分配用户和广告商的实际可能性。

	Overall	q2 low	q2 high
q1 low	6.2%(124/2000)	5.1% (92/1823)	18.1% (32/176)
q1 high	7.5% (149/2000)	4.8% (71/1500)	15.6% (78/500)

Table 2:
Confounding data in ad placement.The table reports the click-through rates and
the click counts of the
second mainline ad.The overall counts suggest that the click-through rate of
the second mainline ad increases when the click probability estimate q1 of the
top ad is high.However, if we further split the pages according to the click
probability estimate q2 of the second mainline ad, we reach the opposite
conclusion.See Section 2.4.

表2:广告投放的混杂数据。该表报告了点击率和点击量第二条主线广告的计数。总体计数表明，当顶部广告的点击概率估计值q1较高时，第二主线广告的点击率会增加。然而，如果我们根据第二条主线广告的点击概率估计q2进一步分割页面，我们会得出相反的结论。参见第2.4节。

2.4
Confounding Data in Ad Placement

2.4广告投放中的混杂数据

Let us return
to the question of assessing the value of passing a new input signal to the ad
placement engine click prediction model.Section 2.1 outlines a placement method
where the click probability estimates qi,p(x) depend on the ad and the position
we consider, but do not depend on other ads displayed on the page.We now
consider replacing this model by a new model that additionally uses the
estimated click probability of the top mainline ad to estimate the click
probability of the second mainline ad (Figure 1).We would like to estimate the
effect of such an intervention using existing statistical data.

让我们回到评估向广告投放引擎点击预测模型传递新输入信号的价值的问题。第2.1节概述了一种布局方法，其中点击概率估计qi，p(x)取决于广告和我们考虑的位置，但不取决于页面上显示的其他广告。我们现在考虑用一个新模型来代替这个模型，这个新模型额外使用顶部主线广告的估计点击概率来估计第二个主线广告的点击概率(图1)。我们希望利用现有的统计数据来评估这种干预的效果。

We have
collected ad placement data for Bing search result pages served during three
consecu-tive hours on a certain slice of traffic.Let q1 and q2 denote the click
probability estimates computed by the existing model for respectively the top
mainline ad and the second mainline ad.After ex-cluding pages displaying fewer
than two mainline ads, we form two groups of 2000 pages randomly picked among
those satisfying the conditions q1 < 0.15 for the first group and q1 ≥ 0.15
for the second group.Table 2.4 reports the click counts and frequencies
observed on the second mainline ad in each group.Although the overall numbers
show that users click more often on the second mainline ad when the top
mainline ad has a high click probability estimate q1, this conclusion is
reversed when we further split the data according to the click probability
estimate q2 of the second mainline ad.

我们已经收集了必应搜索结果页面的广告投放数据，这些页面在连续三个小时的特定流量中提供服务。让q1和q2分别表示由现有模型为顶部主线广告和第二主线广告计算的点击概率估计。在包括显示少于两个主线广告的页面之后，我们形成两组2000页的页面，从满足第一组q1 < 0.15和第二组q1 ≥ 0.15的条件的页面中随机挑选。表2.4报告了每组中第二条主线广告的点击次数和频率。虽然总体数字显示，当顶级主线广告具有高点击概率估计值q1时，用户更频繁地点击第二主线广告，但当我们根据第二主线广告的点击概率估计值q2进一步分割数据时，这一结论被颠倒了。

Despite
superficial similarities, this example is considerably more difficult to
interpret than the kidney stone example.The overall click counts show that the
actual click-through rate of the second mainline ad is positively correlated
with the click probability estimate on the top mainline ad.Does this mean that
we can increase the total number of clicks by placing regular ads below
frequently clicked ads?

尽管表面相似，这个例子比肾结石的例子更难解释。总体点击计数显示，第二条主线广告的实际点击率与顶部主线广告的点击概率估计值正相关。这是否意味着我们可以通过在经常点击的广告下方放置常规广告来增加总点击量？

Remember that
the click probability estimates depend on the search query which itself depends
on the user intention.The most likely explanation is that pages with a high q1
are frequently asso-ciated with more commercial searches and therefore receive
more ad clicks on all positions.The observed correlation occurs because the
presence of a click and the magnitude of the click probabil-ity estimate q1
have a common cause: the user intention.Meanwhile, the click probability
estimate q2 returned by the current model for the second mainline ad also
depend on the query and therefore the user intention.Therefore, assuming that
this dependence has comparable strength, and assuming that there are no other
causal paths, splitting the counts according to the magnitude of q2 factors out
the effects of this common confounding cause.We then observe a negative
correlation which now suggests that
a frequently clicked top mainline ad has a negative impact on the click-through
rate of the second mainline ad.

请记住，点击概率估计取决于搜索查询，而搜索查询本身取决于用户意图。最可能的解释是，q1高的页面通常与更多的商业搜索相关联，因此在所有位置都能获得更多的广告点击。出现观察到的相关性是因为点击的存在和点击概率估计值q1的大小有一个共同的原因:用户意图。同时，当前模型为第二主线广告返回的点击概率估计q2也取决于查询，因此也取决于用户意图。因此，假设这种依赖性具有相当的强度，并且假设没有其他因果路径，根据q2因子的大小将计数分开，排除这种常见混杂原因的影响。然后我们观察到一个负相关
表明频繁点击的顶级主线广告对第二条主线广告的点击率有负面影响。

If this is
correct, we would probably increase the accuracy of the click prediction model
by switching to the new model.This would decrease the click probability
estimates for ads placed in the second mainline position on commercial search
pages.These ads are then less likely to clear the reserve and therefore more likely
to be displayed in the less attractive sidebar.The net result is probably a
loss of clicks and a loss of money despite the higher quality of the click
probability model.Although we could tune the reserve prices to compensate this
unfortunate effect, nothing in this data tells us where the performance of the
ad placement engine will land.Furthermore, unknown confounding variables might
completely reverse our conclusions.

如果这是正确的，我们可能会通过切换到新模型来提高点击预测模型的准确性。这将降低广告在商业搜索页面第二主线位置的点击概率估计。然后，这些广告不太可能清除保留区，因此更有可能显示在不太吸引人的侧栏中。尽管点击概率模型的质量更高，但最终结果可能是点击量的损失和金钱的损失。尽管我们可以调整保留价格来补偿这种不幸的影响，但这些数据中没有任何内容告诉我们广告投放引擎的性能将落在哪里。此外，未知的混杂变量可能会完全推翻我们的结论。

Making sense
out of such data is just too complex !

从这样的数据中获得意义实在是太复杂了！

2.5 A Better Way

2.5更好的方式

It should now
be obvious that we need a more principled way to reason about the effect of
potential interventions.We provide one such more principled approach using the
causal inference machinery (Section 3).The next step is then the identification
of a class of questions that are sufficiently ex-pressive to guide the designer
of a complex learning system, and sufficiently simple to be answered using data
collected in the past using adequate procedures (Section 4).

现在应该很明显，我们需要一种更有原则的方式来推理潜在干预的效果。我们使用因果推理机制提供了一种更有原则的方法(第3节)。然后，下一步是确定一类问题，这些问题足够有代表性，可以指导复杂学习系统的设计者，并且足够简单，可以使用过去使用适当程序收集的数据来回答(第4节)。

A machine
learning algorithm can then be viewed as an automated way to generate questions
about the parameters of a statistical model, obtain the corresponding answers,
and update the param-eters accordingly (Section 6).Learning algorithms derived
in this manner are very flexible: human designers and machine learning
algorithms can cooperate seamlessly because they rely on similar sources of
information.

然后，机器学习算法可以被视为自动生成关于统计模型参数的问题、获得相应答案并相应更新参数的方法(第6节)。以这种方式导出的学习算法非常灵活:人类设计者和机器学习算法可以无缝协作，因为它们依赖于相似的信息源。

3.Modeling
Causal Systems

3.因果系统建模

When we point
out a causal relationship between two events, we describe what we expect to
happen to the event we call the effect, should an external operator manipulate
the event we call the cause.Manipulability theories of causation (von Wright,
1971;Woodward, 2005) raise this commonsense insight to the status of a
definition of the causal relation.Difficult adjustments are then needed to
interpret statements involving causes that we can only observe through their
effects, "because they love me," or that are not easily manipulated,
"because the earth is round."

当我们指出两个事件之间的因果关系时，我们描述了如果外部操作者操纵我们称为原因的事件，我们期望发生在我们称为结果的事件上的事情。因果关系的可操作性理论(冯·赖特，1971；伍德沃德，2005)将这种常识性的洞察力提升到因果关系定义的地位。然后需要困难的调整来解释涉及原因的陈述，这些原因我们只能通过它们的影响来观察，“因为它们爱我”，或者不容易被操纵，“因为地球是圆的。”

Modern
statistical thinking makes a clear distinction between the statistical model
and the world.The actual mechanisms underlying the data are considered
unknown.The statistical models do not need to reproduce these mechanisms to
emulate the observable data (Breiman, 2001).Better models are sometimes
obtained by deliberately avoiding to reproduce the true mechanisms (Vapnik,
1982, Section 8.6).We can approach the manipulability puzzle in the same spirit
by viewing causation as a reasoning model (Bottou, 2011) rather than a property
of the world.Causes and effects are simply the pieces of an abstract reasoning
game.Causal statements that are not empirically testable acquire validity when
they are used as intermediate steps when one reasons about manipulations or
interventions amenable to experimental validation.

现代统计思维明确区分了统计模型和世界。数据背后的实际机制被认为是未知的。统计模型不需要再现这些机制来模拟可观察到的数据(Breiman，2001)。更好的模型有时是通过故意避免再现真实的机制而获得的(Vapnik，1982，第8.6节)。我们可以通过将因果关系视为一种推理模型(Bottou，2011)而不是世界的一种属性，以同样的精神来解决可操作性难题。原因和结果只是抽象推理游戏的片段。不可通过经验检验的因果陈述，当一个人提出可以通过实验验证的操作或干预时，当它们被用作中间步骤时，就获得了有效性。

This section
presents the rules of this reasoning game.We largely follow the framework
pro-posed by Pearl (2009) because it gives a clear account of the connections
between causal models and probabilistic models.

本节介绍了这个推理游戏的规则。我们很大程度上遵循了珀尔(2009)提出的框架，因为它给出了因果模型和概率模型之间联系的清晰描述。

Query context
x from user intent u. Eligible ads (ai) from query x and inventory v.
Corresponding bids (bi).

来自用户意图的查询上下文x。来自查询x和库存v的合格广告(ai)对应的出价(bi)。

Scores
(qi,p,Rp) from query x and ads a. Ad slate s from eligible ads a, scores q and
bids b. Corresponding click prices c.

来自查询x和广告a的分数(qi，p，Rp)。来自合格广告a的广告名单s，分数q和出价b。相应的点击价格c。

User clicks y
from ad slate s and user intent u. Revenue z from clicks y and prices c.

用户从广告名单中点击y，用户意图从点击y和价格c中获得收入

Figure 2: A structural
equation model for ad placement.The sequence of equations describes the

图2:广告投放的结构方程模型。方程式序列描述了

flow of
information.The functions fk describe how effects depend on their direct
causes.The additional noise variables εk represent independent sources of
randomness useful to model probabilistic dependencies.

信息流。函数fk描述了效果如何依赖于它们的直接原因。额外的噪声变量εk代表独立的随机源，可用于建模概率相关性。

3.1 The Flow
of Information

3.1信息流

Figure 2
gives a deterministic description of the operation of the ad placement
engine.Variable u represents the user and his or her intention in an
unspecified manner.The query and query context x is then expressed as an
unknown function of the u and of a noise variable ε1.Noise variables in this
framework are best viewed as independent sources of randomness useful for
modeling a nondeterministic causal dependency.We shall only mention them when
they play a specific role in the discussion.The set of eligible ads a and the
corresponding bids b are then derived from the query x and the ad inventory v
supplied by the advertisers.Statistical models then compute a collection of
scores q such as the click probability estimates qi,p and the reserves Rp
introduced in Section 2.1.The placement logic uses these scores to generate the
"ad slate" s, that is, the set of winning ads and their assigned
positions.The corresponding click prices c are computed.The set of user clicks
y is expressed as an unknown function of the ad slate s and the user intent u.
Finally the revenue z is expressed as another function of the clicks y and the
prices c.

图2给出了广告投放引擎操作的确定性描述。变量u以未指明的方式表示用户及其意图。然后，查询和查询上下文x被表示为u和噪声变量ε1的未知函数。这个框架中的噪声变量最好被视为独立的随机源，有助于对不确定的因果相关性进行建模。只有当它们在讨论中发挥具体作用时，我们才会提到它们。合格广告a的集合和相应的出价b然后从查询x和广告商提供的广告目录v中导出。统计模型然后计算分数q的集合，例如点击概率估计qi、p和第2.1节中介绍的储备Rp。放置逻辑使用这些分数来生成“广告名单”，即一组获胜的广告及其分配的位置。计算相应的点击价格c。用户点击量y的集合被表示为广告名单s和用户意图u的未知函数。最后，收入z被表示为点击量y和价格c的另一个函数

Such a system
of equations is named structural equation model (Wright, 1921).Each equation
asserts a functional dependency between an effect, appearing on the left hand
side of the equation, and its direct causes, appearing on the right hand side
as arguments of the function.Some of these causal dependencies are
unknown.Although we postulate that the effect can be expressed as some function
of its direct causes, we do not know the form of this function.For instance,
the designer of the ad placement engine knows functions f2 to f6 and f8 because
he has designed them.However, he does not know the functions f1 and f7 because
whoever designed the user did not leave sufficient documentation.

这样的方程组被称为结构方程模型(Wright，1921)。每一个方程都断言出现在方程左边的一个效应与其直接原因之间的函数依赖关系，直接原因作为函数的自变量出现在右边。其中一些因果依赖是未知的。虽然我们假设效果可以表示为其直接原因的某种函数，但我们不知道这种函数的形式。例如，广告投放引擎的设计者知道函数f2到f6和f8，因为他已经设计了它们。然而，他不知道函数f1和f7，因为设计用户的人没有留下足够的文档。

Figure 3
represents the directed causal graph associated with the structural equation
model.Each arrow connects a direct cause to its effect.The noise variables are
omitted for simplicity.The structure of this graph reveals fundamental
assumptions about our model.For instance, the user clicks y do not directly
depend on the scores q or the prices c because users do not have access to this
information.

图3表示与结构方程模型相关的有向因果图。每个箭头都将直接原因与其结果联系起来。为简单起见，噪声变量被省略。这个图表的结构揭示了关于我们模型的基本假设。例如，用户点击y并不直接取决于分数q或价格c，因为用户无法访问这些信息。

We hold as a
principle that causation obeys the arrow of time: causes always precede their
effects.Therefore the causal graph must be acyclic.Structural equation models
then support two fundamental operations, namely simulation and intervention.

我们认为因果关系遵循时间之箭是一个原则:原因总是在结果之前。因此因果图必须是非循环的。结构方程模型支持两个基本操作，即模拟和干预。

Figure 3:
Causal graph associated with the structural equation model of Figure 2.The
mutually independent
noise variables ε1 to ε8 are implicit.The variables a, b, q, s, c, and z depend
on their direct causes in known ways.In contrast, the variables u and v are
exogenous and the variables x and y depend on their direct causes through
unknown functions.

图3:与图2的结构方程模型相关的因果图。相互之间独立噪声变量ε1至ε8是隐式的。变量a、b、q、s、c和z以已知的方式依赖于它们的直接原因。相比之下，变量u和v是外生的，变量x和y通过未知函数依赖于它们的直接原因。

• Simulation

Let us assume that we know both the exact form of all functional dependencies
and the value of all exogenous variables, that is, the variables that never
appear in the left hand side of an equation.We can compute the values of all
the remaining variables by applying the equations in their natural time
sequence.

模拟-让我们假设我们知道所有函数依赖的确切形式和所有外生变量的值，即从未出现在等式左侧的变量。我们可以通过在自然时间序列中应用这些方程来计算所有剩余变量的值。

•
Intervention - As long as the causal graph remains acyclic, we can construct
derived structural equation models using arbitrary algebraic manipulations of
the system of equations.For instance, we can clamp a variable to a constant
value by rewriting the right-hand side of the corresponding equation as the
specified constant value.

干预——只要因果图保持非循环，我们就可以使用方程组的任意代数操作来构建导出的结构方程模型。例如，我们可以通过将相应等式的右侧重写为指定的常数值，将变量钳制为常数值。

The algebraic
manipulation of the structural equation models provides a powerful language to
describe interventions on a causal system.This is not a coincidence.Many
aspects of the mathe-matical notation were invented to support causal inference
in classical mechanics.However, we no longer have to interpret the variable
values as physical quantities: the equations simply describe the flow of
information in the causal model (Wiener, 1948).

结构方程模型的代数操作为描述因果系统上的干预提供了一种强大的语言。这不是巧合。数学符号的许多方面是为了支持经典力学中的因果推理而发明的。然而，我们不再需要将变量值解释为物理量:方程简单地描述了因果模型中的信息流(维纳，1948)。

3.2 The
Isolation Assumption

3.2隔离假设

Let us now
turn our attention to the exogenous variables, that is, variables that never
appear in the left hand side of an equation of the structural model.Leibniz's
principle of sufficient reason claims that there are no facts without
causes.This suggests that the exogenous variables are the effects of a network
of causes not expressed by the structural equation model.For instance, the user
intent u and the ad inventory v in Figure 3 have temporal correlations because
both users and advertisers worry about their budgets when the end of the month
approaches.Any structural equation model should then be understood in the
context of a larger structural equation model potentially describing all things
in existence.

现在让我们将注意力转向外生变量，即从未出现在结构模型方程左侧的变量。莱布尼茨的充分理由原则主张没有原因就没有事实。这表明外生变量是结构方程模型没有表达的原因网络的影响。例如，图3中的用户意图u和广告清单v具有时间相关性，因为当月底临近时，用户和广告商都担心他们的预算。任何结构方程模型都应该在一个更大的结构方程模型的背景下理解，这个模型潜在地描述了所有存在的事物。

Ads served on
a particular page contribute to the continued satisfaction of both users and
ad-vertisers, and therefore have an effect on their willingness to use the
services of the publisher in the future.The ad placement structural equation
model shown in Figure 2 only describes the causal de-pendencies for a single
page and therefore cannot account for such effects.Consider however a very

在特定页面上提供的广告有助于用户和广告发布者的持续满意度，因此会影响他们将来使用出版商服务的意愿。图2所示的广告投放结构方程模型仅描述了单个页面的因果相关性，因此无法解释这种影响。然而，考虑一个非常

3216

COUNTERFACTUAL
REASONING AND LEARNING SYSTEMS

反事实推理和学习系统

Figure 4:
Conceptually unrolling the user feedback loop by threading instances of the
single page

图4:通过线程化单个页面的实例从概念上展开用户反馈循环

causal graph
(Figure 3).Both the ad slate st and user clicks yt have an indirect effect on
the user intent ut+1 associated with the next query.

因果图(图3)。ad slate st和用户点击yt都对与下一个查询相关联的用户意图ut+1有间接影响。

large
structural equation model containing a copy of the page-level model for every
web page ever served by the publisher.Figure 4 shows how we can thread the
page-level models corresponding to pages served to the same user.Similarly we
could model how advertisers track the performance and the cost of their
advertisements and model how their satisfaction affects their future bids.The
resulting causal graphs can be very complex.Part of this complexity results
from time-scale dif-ferences.Thousands of search pages are served in a
second.Each page contributes a little to the continued satisfaction of one user
and a few advertisers.The accumulation of these contributions produces
measurable effects after a few weeks.

大型结构方程模型，包含发布者提供的每个网页的页面级模型副本。图4显示了我们如何线程化对应于服务于同一用户的页面的页面级模型。同样，我们可以模拟广告商如何跟踪他们广告的表现和成本，并模拟他们的满意度如何影响他们未来的出价。产生的因果图可能非常复杂。这种复杂性的一部分是由时间尺度的差异造成的。数以千计的搜索页面在一秒钟内被提供。每一页都对一个用户和几个广告商的持续满意度有一点贡献。几周后，这些贡献的积累会产生可衡量的效果。

Many of the
functional dependencies expressed by the structural equation model are left
un-specified.Without direct knowledge of these functions, we must reason using
statistical data.The most fundamental statistical data is collected from
repeated trials that are assumed independent.When we consider the large
structured equation model of everything, we can only have one large trial
producing a single data point.It is therefore desirable to identify repeated
patterns of identical equations that can be viewed as repeated independent
trials.Therefore, when we study a structural equation model representing such a
pattern, we need to make an additional assumption to expresses the idea that
the outcome of one trial does not affect the other trials.We call such an
assumption an isolation assumption by analogy with thermodynamics.This can be
achieved by assuming that the exogenous variables are independently drawn from
an unknown but fixed joint probability distribu-tion.This assumption cuts the
causation effects that could flow through the exogenous variables.

结构方程模型所表达的许多函数依赖关系都没有明确说明。没有这些函数的直接知识，我们必须使用统计数据进行推理。最基本的统计数据是从假设独立的重复试验中收集的。当我们考虑所有事物的大型结构化方程模型时，我们只能进行一次大型试验来产生一个数据点。因此，希望识别可被视为重复独立试验的相同方程的重复模式。因此，当我们研究代表这种模式的结构方程模型时，我们需要做一个额外的假设来表达一个试验的结果不会影响其他试验的想法。我们通过类比热力学把这样的假设称为孤立假设。这可以通过假设外生变量独立地来自未知但固定的联合概率分布来实现。这一假设消除了可能流经外生变量的因果效应。

The noise
variables are also exogenous variables acting as independent source of
randomness.The noise variables are useful to represent the conditional
distribution P(effect|causes) using the equation effect= f(causes,
ε).Therefore, we also assume joint independence between all the noise variables
and any of the named exogenous variable.For instance, in the case of the ad
placement

噪声变量也是作为随机性独立来源的外生变量。噪声变量有助于使用方程effect= f(causes，ε)来表示条件分布 P(effect|causes)。因此，我们还假设所有噪声变量和任何一个已命名的外生变量之间具有联合独立性。例如，在广告投放的情况下

1.See also
the discussion on reinforcement learning, Section 3.5.

1.另请参见第3.5节中关于强化学习的讨论。

2.The concept
of isolation is pervasive in physics.An isolated system in thermodynamics
(Reichl, 1998, Section 2.D) or a closed system in mechanics (Landau and
Lifshitz, 1969, 5) evolves without exchanging mass or energy with its
surroundings.Experimental trials involving systems that are assumed isolated
may differ in their initial setup and therefore have different
outcomes.Assuming isolation implies that the outcome of each trial cannot
affect the other trials.

2.孤立的概念在物理学中很普遍。热力学中的孤立系统(赖歇尔，1998，第2节。d)或力学中的封闭系统(Landau和Lifshitz，1969，5)在不与周围环境交换质量或能量的情况下演化。涉及被假设为孤立的系统的实验性试验可能在它们的初始设置上有所不同，因此具有不同的结果。假设隔离意味着每个试验的结果不能影响其他试验。

3.Rather than
letting two noise variables display measurable statistical dependencies because
they share a common cause, we prefer to name the common cause and make the
dependency explicit in the graph.

3.与其让两个噪声变量显示可测量的统计相关性，因为它们有一个共同的原因，我们更喜欢命名共同的原因，并在图中明确相关性。

Exogenous
vars.Query.

外源性vars。查询。

Eligible
ads.Bids.

合格的广告。出价。

Scores.Ad
slate.Prices.Clicks.Revenue.

分数。广告石板。价格。点击。收入。

Figure 6:
Bayesian network associated with the Markov factorization shown in Figure 5.

图6:与图5所示的马尔可夫分解相关联的贝叶斯网络。

model shown
in Figure 2, we assume that the joint distribution of the exogenous variables
factorizes

模型如图2所示，我们假设外生变量的联合分布因子化

如同

P(u, v,ε1,..., ε8) = P(u, v)P(ε1)...P(ε8).

Since an
isolation assumption is only true up to a point, it should be expressed clearly
and remain under constant scrutiny.We must therefore measure additional
performance metrics that reveal how the isolation assumption holds.For
instance, the ad placement structural equation model and the corresponding
causal graph (figures 2 and 3) do not take user feedback or advertiser feedback
into account.Measuring the revenue is not enough because we could easily
generate revenue at the expense of the satisfaction of the users and
advertisers.When we evaluate interventions under such an isolation assumption,
we also need to measure a battery of additional quantities that act as proxies
for the user and advertiser satisfaction.Noteworthy examples include ad
relevance estimated by human judges, and advertiser surplus estimated from the
auctions (Varian, 2009).

因为隔离假设在某种程度上是正确的，所以应该清楚地表达出来，并保持不断的审查。因此，我们必须衡量额外的性能指标，以揭示隔离假设是如何成立的。例如，广告投放结构方程模型和相应的因果图(图2和3)没有考虑用户反馈或广告商反馈。衡量收入是不够的，因为我们很容易以牺牲用户和广告商的满意度为代价来创造收入。当我们在这样的隔离假设下评估干预时，我们还需要测量一组额外的量，作为用户和广告商满意度的代理。值得注意的例子包括由人类估计的广告相关性，以及从拍卖中估计的广告商盈余(瓦里安，2009)。

3.3 Markov Factorization马尔可夫因子分解

Conceptually,
we can draw a sample of the exogenous variables using the distribution
specified by the isolation assumption, and we can then generate values for all
the remaining variables by simulating the structural equation model.

从概念上讲，我们可以使用隔离假设指定的分布来绘制外生变量的样本，然后我们可以通过模拟结构方程模型来生成所有剩余变量的值。

This process
defines a generative probabilistic model representing the joint distribution of
all variables in the structural equation model.The distribution readily
factorizes as the product of the joint probability of the named exogenous
variables, and, for each equation in the structural equation model, the
conditional probability of the effect given its direct causes (Spirtes et al.,
1993;Pearl, 2000).As illustrated by figures 5 and 6, this Markov factorization
connects the structural equa-tion model that describes causation, and the
Bayesian network that describes the joint probability distribution followed by
the variables under the isolation assumption.

这个过程定义了一个表示结构方程模型中所有变量联合分布的生成概率模型。该分布很容易分解为所命名的外生变量的联合概率的乘积，对于结构方程模型中的每个方程，给出其直接原因的影响的条件概率(Spirtes等人，1993；Pearl，2000)。如图5和图6所示，这种马尔可夫分解将描述因果关系的结构方程模型和描述隔离假设下变量遵循的联合概率分布的贝叶斯网络联系起来。

Structural
equation models and Bayesian networks appear so intimately connected that it
could be easy to forget the differences.The structural equation model is an
algebraic object.As long as the causal graph remains acyclic, algebraic
manipulations are interpreted as interventions on the causal system.The
Bayesian network is a generative statistical model representing a class of
joint probability distributions, and, as such, does not support algebraic
manipulations.However, the symbolic representation of its Markov factorization
is an algebraic object, essentially equivalent to the structural equation
model.

结构方程模型和贝叶斯网络看起来联系如此紧密，以至于很容易忘记它们之间的差异。结构方程模型是一个代数对象。只要因果图保持非循环，代数操作就被解释为对因果系统的干预。贝叶斯网络是表示一类联合概率分布的生成统计模型，因此不支持代数操作。然而，其马尔可夫因子分解的符号表示是一个代数对象，本质上等价于结构方程模型。

3.4
Identification, Transportation, and Transfer Learning

3.4识别、运输和转移学习

Consider a
causal system represented by a structural equation model with some unknown
functional dependencies.Subject to the isolation assumption, data collected
during the operation of this system follows the distribution described by the
corresponding Markov factorization.Let us first assume that this data is sufficient
to identify the joint distribution of the subset of variables we can observe.We
can intervene on the system by clamping the value of some variables.This
amounts to replacing the right-hand side of the corresponding structural
equations by constants.The joint distribution of the variables is then
described by a new Markov factorization that shares many factors with the
orig-inal Markov factorization.Which conditional probabilities associated with
this new distribution can we express using only conditional probabilities
identified during the observation of the original sys-tem?This is called the
identifiability problem.More generally, we can consider arbitrarily complex
manipulations of the structural equation model, and we can perform multiple
experiments involv-ing different manipulations of the causal system.Which
conditional probabilities pertaining to one experiment can be expressed using
only conditional probabilities identified during the observation of other
experiments?This is called the transportability problem.

考虑一个由结构方程模型表示的因果系统，该模型具有一些未知的函数依赖关系。根据隔离假设，在该系统运行期间收集的数据遵循由相应的马尔可夫因子分解描述的分布。让我们首先假设这些数据足以识别我们可以观察到的变量子集的联合分布。我们可以通过钳制某些变量的值来干预系统。这相当于用常数替换了相应结构方程的右边。然后，变量的联合分布由一个新的马尔可夫因子分解来描述，它与原始的马尔可夫因子分解共享许多因子。我们可以仅使用在观察原始系统时识别的条件概率来表示与这种新分布相关的哪些条件概率？这就是所谓的可识别性问题。更一般地说，我们可以考虑结构方程模型的任意复杂操作，并且我们可以进行涉及因果系统的不同操作的多个实验。哪些与一个实验相关的条件概率可以仅用在观察其他实验时确定的条件概率来表示？这就是所谓的可运输性问题。

Pearl's
do-calculus completely solves the identifiability problem and provides useful
tools to address many instances of the transportability problem (see Pearl,
2012).Assuming that we know the conditional probability distributions involving
observed variables in the original structural equa-tion model, do-calculus
allows us to derive conditional distributions pertaining to the manipulated structural
equation model.

Pearl的do-演算完全解决了可识别性问题，并提供了有用的工具来解决可运输性问题的许多实例(参见Pearl，2012)。假设我们知道原始结构方程模型中包含观测变量的条件概率分布，微积分允许我们导出与操纵结构方程模型相关的条件分布。

Unfortunately,
we must further distinguish the conditional probabilities that we know (because
we designed them) from those that we estimate from empirical data.This
distinction is important because estimating the distribution of continuous or
high cardinality variables is notoriously dif-ficult.Furthermore, do-calculus
often combines the estimated probabilities in ways that amplify estimation
errors.This happens when the manipulated structural equation model exercises
the vari-ables in ways that were rarely observed in the data collected from the
original structural equation model.

不幸的是，我们必须进一步区分我们知道的条件概率(因为我们设计了它们)和我们根据经验数据估计的条件概率。这种区别很重要，因为估计连续或高基数变量的分布是出了名的困难。此外，微积分经常以放大估计误差的方式组合估计的概率。当被操纵的结构方程模型以从原始结构方程模型收集的数据中很少观察到的方式运用变量时，就会发生这种情况。

4.Bayesian
networks are directed graphs representing the Markov factorization of a joint
probability distribution: the arrows no longer have a causal interpretation.

4.贝叶斯网络是表示联合概率分布的马尔可夫因子分解的有向图:箭头不再具有因果解释。

Therefore we
prefer to use much simpler causal inference techniques (see sections 4.1 and
4.2).Although these techniques do not have the completeness properties of
do-calculus, they combine estimation and transportation in a manner that
facilitates the derivation of useful confidence inter-vals.

因此，我们更喜欢使用简单得多的因果推断技术(参见第4.1节和第4.2节)。虽然这些技术不具备微积分的完备性，但它们以一种便于推导有用的置信区间的方式结合了估计和传输。

3.5 Special Cases

3.5特殊情况

Three special
cases of causal models are particularly relevant to this work.

因果模型的三个特例与这项工作特别相关。

• In the
multi-armed bandit (Robbins, 1952), a user-defined policy function π determines
the distribution of action a ∈ {1...K}, and an unknown reward function r
determines the distri-bution of the outcome y given the action a (Figure 7).In
order to maximize the accumulated rewards, the player must construct policies π
that balance the exploration of the action space with the exploitation of the best
action identified so far (Auer et al., 2002;Audibert et al., 2007;Seldin et
al., 2012).

在多臂老虎机(Robbins，1952)中，用户定义的策略函数π决定了动作a的分布∑{ 1...一个未知的奖励函数r决定了给定动作a的结果y的分布(图7)。为了使累积的奖励最大化，玩家必须构建策略π，该策略π平衡对动作空间的探索和对迄今为止识别的最佳动作的开发(Auer等人，2002；Audibert等人，2007年；Seldin等人，2012年)。

• The
contextual bandit problem (Langford and Zhang, 2008) significantly increases
the com-plexity of multi-armed bandits by adding one exogenous variable x to
the policy function π and the reward functions r (Figure 8).

上下文老虎机问题(Langford和Zhang，2008)通过在策略函数π和奖励函数r上增加一个外生变量x，显著增加了多武装土匪的复杂性(图8)。

• Both
multi-armed bandit and contextual bandit are special case of reinforcement
learning (Sutton and Barto, 1998).In essence, a Markov decision process is a
sequence of contextual bandits where the context is no longer an exogenous
variable but a state variable that depends on the previous states and actions
(Figure 9).Note that the policy function π, the reward function r, and the
transition function s are independent of time.All the time dependencies are expressed
using the states st .

多臂老虎机和情境老虎机都是强化学习的特例(萨顿和巴尔托，1998)。本质上，马尔可夫决策过程是一系列上下文老虎机，其中上下文不再是外生变量，而是依赖于先前状态和动作的状态变量(图9)。注意，策略函数π、奖励函数r和转移函数s与时间无关。所有的时间依赖关系都用状态st表示。

These special
cases have increasing generality.Many simple structural equation models can be
reduced to a contextual bandit problem using appropriate definitions of the
context x, the action a and the outcome y. For instance, assuming that the
prices c are discrete, the ad placement struc-tural equation model shown in
Figure 2 reduces to a contextual bandit problem with context (u, v), actions
(s, c) and reward z. Similarly, given a sufficiently intricate definition of
the state variables st , all structural equation models with discrete variables
can be reduced to a reinforcement learning problem.Such reductions lose the
fine structure of the causal graph.We show in Section 5 how this fine structure
can in fact be leveraged to obtain more information from the same experiments.

这些特例越来越具有普遍性。使用上下文x、动作a和结果y的适当定义，许多简单的结构方程模型可以简化为上下文强盗问题。例如，假设价格c是离散的，图2所示的广告投放结构方程模型可以简化为具有上下文(u，v)、动作(s，c)和奖励z的上下文强盗问题。类似地，给定状态变量st的足够复杂的定义，所有具有离散变量的结构方程模型可以简化为强化学习问题。这种简化失去了因果图的精细结构。我们在第5节中展示了如何利用这种精细的结构从相同的实验中获得更多的信息。

Modern
reinforcement learning algorithms (see Sutton and Barto, 1998) leverage the
assumption that the policy function, the reward function, the transition
function, and the distributions of the corresponding noise variables, are
independent from time.This invariance property provides great benefits when the
observed sequences of actions and rewards are long in comparison with the size
of the state space.Only Section 7 in this contribution presents methods that
take advantage of such an invariance.The general question of leveraging
arbitrary functional invariances in causal graphs is left for future work.

现代强化学习算法(见萨顿和巴尔托，1998)利用了政策函数、奖励函数、转移函数和相应噪声变量的分布与时间无关的假设。当观察到的动作和回报序列与状态空间的大小相比较长时，这种不变性提供了很大的好处。这篇文章中只有第7节介绍了利用这种不变性的方法。利用因果图中任意函数不变性的一般问题留待将来研究。

4.Counterfactual Analysis反事实分析

We now return
to the problem of formulating and answering questions about the value of
proposed changes of a learning system.Assume for instance that we consider
replacing the score computation

我们现在回到制定和回答关于学习系统的建议改变的价值的问题。例如，假设我们考虑替换分数计算

a = π ( ε ) Action a ∈{1...K}

y = r ( a，ε’) Reward y ∈ R

动作a∞{ 1...K}奖励y ∈ R

Figure 7:
Structural equation model for the multi-armed bandit problem.The policy π
selects a discrete
action a, and the reward function r determines the outcome y. The noise
vari-ables ε and ε ′ represent independent sources of randomness useful to
model probabilistic dependencies.

图7:多武装匪徒问题的结构方程模型。策略π选择一个离散动作a，奖励函数r决定结果y。噪声变量ε和ε’代表独立的随机源，可用于建模概率相关性。

a = π ( x , ε) Action a ∈{1...K}

y = r ( x，a，ε’)Reward y ∈ R

Figure 8:
Structural equation model for contextual bandit problem.Both the action and the
reward depend on an exogenous context variable x.

图8:上下文老虎机问题的结构方程模型。行动和奖励都依赖于一个外生的环境变量x。

Figure 9:
Structural equation model for reinforcement learning.The above equations are
replicated

图9:强化学习的结构方程模型。上面的等式是重复的

for all t ∈
{0...,T}.The context is now provided by a state variable st−1 that depends on
the previous states and actions.

对于所有t∞{ 0...，T}。上下文现在由状态变量ST 1提供，该变量取决于先前的状态和动作。

model M of an
ad placement engine by an alternate model M. We seek an answer to the
conditional question:

模型M的一个广告投放引擎的一个替代模型M。我们寻求一个条件问题的答案:

"How
will the system perform if we replace model M by model M*?"

"如果我们用M*代替M模型，系统会有什么表现？"

Given
sufficient time and sufficient resources, we can obtain the answer using a
controlled experiment (Section 2.2).However, instead of carrying out a new
experiment, we would like to obtain an answer using data that we have already
collected in the past.

如果有足够的时间和足够的资源，我们可以通过一个受控实验(第2.2节)来获得答案。然而，我们不希望进行新的实验，而是希望使用我们过去已经收集的数据来获得答案。

"How
would the system have performed if, when the data was collected, we had
replaced model M by model M*?"

“如果在收集数据时，我们用模型M*替换了模型M，系统会有什么表现？”

The answer of
this counterfactual question is of course a counterfactual statement that
describes the system performance subject to a condition that did not happen.

这个反事实问题的答案当然是一个反事实陈述，它描述了系统在一个没有发生的条件下的性能。

Counterfactual
statements challenge ordinary logic because they depend on a condition that is
known to be false.Although assertion A ⇒ B is always true when assertion A is
false, we certainly do not mean for all counterfactual statements to be
true.Lewis (1973) navigates this paradox using a modal logic in which a
counterfactual statement describes the state of affairs in an alternate world
that resembles ours except for the specified differences.Counterfactuals indeed
offer many subtle ways to qualify such alternate worlds.For instance, we can
easily describe isolation assumptions (Section 3.2) in a counterfactual
question:

反事实陈述挑战普通逻辑，因为它们依赖于一个已知为假的条件。虽然当断言A为假时，断言A总是真的，但我们当然不是说所有的反事实陈述都是真的。刘易斯(1973)使用模态逻辑来解释这个悖论，在模态逻辑中，一个反事实的陈述描述了一个与我们相似的另一个世界的事态，除了特定的差异。反事实确实提供了许多微妙的方法来限定这样的交替世界。例如，我们可以很容易地在一个反事实问题中描述隔离假设(第3.2节):

"How
would the system have performed if, when the data was collected, we had
replaced model M by model Mwithout incurring user or advertiser
reactions?"

“如果在收集数据时，我们用M*模型代替了M模型，而不会引起用户或广告商的反应，那么系统会有怎样的表现？”

Figure 10:
Causal graph for an image recognition system.We can estimate counterfactuals by
replaying data collected in the past.

图10:图像识别系统的因果图。我们可以通过回放过去收集的数据来估计反事实。

Figure 11:
Causal graph for a randomized experiment.We can estimate certain
counterfactuals by reweighting data collected in the past.

图11:随机实验的因果图。我们可以通过重新加权过去收集的数据来估计某些反事实。

The fact that
we could not have changed the model without incurring the user and advertiser
reac-tions does not matter any more than the fact that we did not replace model
M by model M* in the first place.This does not prevent us from using
counterfactual statements to reason about causes and effects.Counterfactual
questions and statements provide a natural framework to express and share our
conclusions.

事实上，我们不可能在不引起用户和广告商反应的情况下改变模型，这一点并不比我们一开始没有用M*模型替换M模型更重要。这并不妨碍我们用反事实的陈述来推理因果。反事实的问题和陈述提供了一个自然的框架来表达和分享我们的结论。

The remaining
text in this section explains how we can answer certain counterfactual
questions using data collected in the past.More precisely, we seek to estimate
performance metrics that can be expressed as expectations with respect to the
distribution that would have been observed if the counterfactual conditions had
been in force.

本节剩下的内容解释了我们如何使用过去收集的数据来回答某些反事实的问题。更准确地说，我们试图估计性能指标，这些指标可以表示为对分布的期望，如果反事实条件已经生效，就会观察到分布。

4.1 Replaying
Empirical Data

4.1重放经验数据

Figure 10
shows the causal graph associated with a simple image recognition system.The
classifier takes an image x and produces a prospective class label ˆy. The loss
measures the penalty associated with recognizing class ˆy while the true class
is y.

图10显示了与简单图像识别系统相关的因果图。分类器获取图像x并产生预期的类别标签y。损失测量与识别类别y相关联的惩罚，而真实类别是y

To estimate
the expected error of such a classifier, we collect a representative data set
composed of labelled images, run the classifier on each image, and average the
resulting losses.In other words, we replay the data set to estimate what (counterfactual)
performance would have been observed if we had used a different classifier.We
can then select in retrospect the classifier that would have worked the best
and hope that it will keep working well.This is the counterfactual viewpoint on
empirical risk minimization (Vapnik, 1982).

为了估计这种分类器的预期误差，我们收集由标记图像组成的代表性数据集，在每个图像上运行分类器，并对产生的损失进行平均。换句话说，我们重放数据集来估计如果我们使用不同的分类器会观察到什么样的(反事实的)性能。然后，我们可以在回顾中选择最有效的分类器，并希望它能继续有效。这是经验风险最小化的反事实观点(Vapnik，1982)。

Replaying the
data set works because both the alternate classifier and the loss function are
known.More generally, to estimate a counterfactual by replaying a data set, we
need to know all the functional dependencies associated with all causal paths
connecting the intervention point to the measurement point.This is obviously
not always the case.

重放数据集是可行的，因为备用分类器和损失函数都是已知的。更一般地说，为了通过重放数据集来估计反事实，我们需要知道与连接干预点和测量点的所有因果路径相关联的所有功能依赖。显然情况并非总是如此。

5.Although
counterfactual expectations can be viewed as expectations of unit-level
counterfactuals (Pearl, 2009, Def-inition 4), they elude the semantic
subtleties of unit-level counterfactuals and can be measured with randomized
experiments (see Section 4.2.)

5.尽管反事实期望可以被视为单位级反事实的期望(Pearl，2009，Def-inition 4)，但它们回避了单位级反事实的语义微妙性，并且可以通过随机实验来测量(参见第4.2节)。)

4.2
Reweighting Randomized Trials

4.2重新加权随机试验

Figure 11
illustrates the randomized experiment suggested in Section 2.3.The patients are
randomly split into two equally sized groups receiving respectively treatments
A and B. The overall success rate for this experiment is therefore Y = (YA
+YB)/2 where YA and YB are the success rates observed for each group.We would
like to estimate which (counterfactual) overall success rate Y would have been
observed if we had selected treatment A with probability p and treatment B with
probability 1− p.

图11说明了第2.3节中建议的随机实验。患者被随机分成两个大小相等的组，分别接受治疗A和治疗b。因此，本实验的总成功率为Y = (YA +YB)/2，其中YA和YB是每组观察到的成功率。如果我们选择概率为p的治疗A和概率为1 p的治疗B，我们希望估计观察到的(反事实的)总体成功率Y。

Since we do
not know how the outcome depends on the treatment and the patient condition, we
cannot compute which outcome y would have been obtained if we had treated
patient x with a different treatment u . Therefore we cannot answer this
question by replaying the data as we did in Section 4.1.

由于我们不知道结果如何取决于治疗和患者状况，因此我们无法计算如果我们用不同的治疗u治疗患者x会获得哪种结果y。因此，我们无法像在第4.1节中那样通过重放数据来回答这个问题。

However,
observing different success rates YA and YB for the treatment groups reveals an
empir-ical correlation between the treatment u and the outcome y. Since the only
cause of the treatment u is an independent roll of the dices, this correlation
cannot result from any known or unknown con-founding common cause.Having
eliminated this possibility, we can reweight the observed out-comes and compute
the estimate Y ≈ pYA + (1− p)YB .

然而，观察治疗组的不同成功率YA和YB揭示了治疗u和结果y之间的经验相关性。由于治疗u的唯一原因是独立的骰子滚动，这种相关性不能由任何已知或未知的共同原因产生。排除这种可能性后，我们可以重新加权观察到的输出，并计算估计值Y≈pyA+(1p)YB。

4.3 Markov
Factor Replacement

4.3马尔可夫因子替换

The
reweighting approach can in fact be applied under much less stringent conditions.Let
us return to the ad placement problem to illustrate this point.

事实上，再加权方法可以在不太严格的条件下应用。让我们回到广告投放问题来说明这一点。

The average
number of ad clicks per page is often called click yield.Increasing the click
yield usually benefits both the advertiser and the publisher, whereas
increasing the revenue per page often benefits the publisher at the expense of
the advertiser.Click yield is therefore a very useful metric when we reason
with an isolation assumption that ignores the advertiser reactions to pricing changes.

平均每页的广告点击量通常被称为点击量。增加点击率通常对广告商和出版商都有利，而增加每页的收入通常对出版商有利，但牺牲了广告商的利益。因此，当我们用一个孤立的假设来推理时，点击量是一个非常有用的指标，这个假设忽略了广告商对价格变化的反应。

Let ω be a
shorthand for all variables appearing in the Markov factorization of the ad
placement structural equation model,

假设ω是广告投放结构方程模型的马尔可夫分解中出现的所有变量的简写，

Variable y
was defined in Section 3.1 as the set of user clicks.In the rest of the
document, we slightly abuse this notation by using the same letter y to
represent the number of clicks.We also write the expectation Y = Eω∼P(ω) [y]
using the integral notation

变量y在第3.1节中被定义为用户点击的集合。在文档的其余部分，我们稍微滥用了这个符号，使用相同的字母y来表示点击次数。我们还用积分表示法写出了期望Y = Eω∞P(ω)[Y]

Y = Zω y P (ω ) .

We would like
to estimate what the expected click yield Y would have been if we had used a
different scoring function (Figure 12).This intervention amounts to replacing
the actual factor P(q|x,a) by a counterfactual factor P(q|x,a) in the Markov
factorization.

如果我们使用不同的评分函数，我们希望估计预期的点击率是多少(图12)。这种干预相当于用马尔可夫因子分解中的反事实因子P(q|x，a)代替实际因子P(q|x，a)。

P ( ω) = P (u , v ) P ( x |u ) P ( a|x , v ) P ( b|x , v ) P ( q| x , a )

× P ( s|a，q，b ) P ( c |a，q，b ) P ( y |s，u ) P ( z|x，c)。

(3)

6.See also
the discussion of Reichenbach's common cause principle and of its limitations
in Spirtes et al. (1993) and Spirtes and Scheines (2004).

6.另见Spirtes等人(1993年)和Spirtes和Scheines
(2004年)对赖兴巴赫共同原因原则及其局限性的讨论。

Figure 12:
Estimating which average number of clicks per page would have been observed if
we had used a different scoring model.

图12:如果我们使用不同的评分模型，估计每个页面的平均点击次数。

Let us
assume, for simplicity, that the actual factor P(q|x,a) is nonzero
everywhere.We can then estimate the counterfactual expected click yield Y using
the transformation

为了简单起见，我们假设实际的因子P(q|x，a)处处非零。然后，我们可以使用转换来估计反事实的预期点击量Y

Y = Zω y P (ω) = Zω y P ( q| x a ) P ( q|x , a ) P ( ω ) ≈ n n∑i=1 y i P ( q i |x i a i ) P
( q i |x i , a i )

where the
data set of tuples (ai , xi ,qi , yi) is distributed according to the actual
Markov factorization instead of the counterfactual Markov factorization.This
data could therefore have been collected during the normal operation of the ad
placement system.Each sample is reweighted to reflect its probability of
occurrence under the counterfactual conditions.

其中元组(ai、xi、qi、yi)的数据集根据实际的马尔可夫因子分解而不是反事实的马尔可夫因子分解来分布。因此，这些数据可能是在广告投放系统正常运行期间收集的。每个样本被重新加权，以反映其在反事实条件下出现的概率。

In general,
we can use importance sampling to estimate the counterfactual expectation of
any quantity ℓ(ω):

一般来说，我们可以使用重要性抽样来估计任何数量ℓ(ω):的反事实期望

Y = Zω ℓ ( ω
) P ( ω) = Zω ℓ ( ω ) P ( ω ) P ( ω ) P ( ω ) ≈ n n∑i=1 ℓ ( ω i ) w i

with weights

Equation (6)
emphasizes the simplifications resulting from the algebraic similarities of the
actual and counterfactual Markov factorizations.Because of these
simplifications, the evaluation of the weights only requires the knowledge of
the few factors that differ between P(ω) and P(ω).Each data sample needs to
provide the value of ℓ(ωi) and the values of all variables needed to evaluate
the factors that do not cancel in the ratio (6).

等式(6)强调由实际和反事实马尔可夫因子分解的代数相似性产生的简化。由于这些简化，权重的计算只需要知道P(ω)和P(ω)之间的几个不同因素。每个数据样本需要提供ℓ(ωi值)和评估比率(6)中不抵消的因素所需的所有变量的值。

In contrast,
the replaying approach (Section 4.1) demands the knowledge of all factors of
P(ω) connecting the point of intervention to the point of measurement ℓ(ω).On
the other hand, it does not require the knowledge of factors appearing only in
P(ω).

相比之下，重放方法(第4.1节)要求了解连接干预点和测量点ℓ(ω).的所有P(ω)因子另一方面，它不需要只出现在P(ω)中的因子的知识。

Importance
sampling relies on the assumption that all the factors appearing in the
denomina-tor of the reweighting ratio (6) are nonzero whenever the factors
appearing in the numerator are nonzero.Since these factors represents
conditional probabilities resulting from the effect of an independent noise
variable in the structural equation model, this assumption means that the data
must be
collected with an experiment involving active randomization.We must therefore
design cost-effective randomized experiments that yield enough information to
estimate many interesting counterfactual expectations with sufficient
accuracy.This problem cannot be solved without an-swering the confidence
interval question: given data collected with a certain level of randomization,
with which accuracy can we estimate a given counterfactual expectation?

重要性抽样依赖于这样的假设，即只要分子中出现的因子非零，重加权比(6)的分母中出现的所有因子都非零。由于这些因素代表了由结构方程模型中独立噪声变量的影响而产生的条件概率，这一假设意味着数据

必须通过涉及主动随机化的实验收集。因此，我们必须设计成本有效的随机实验，产生足够的信息，以足够的准确性估计许多有趣的反事实预期。如果不回答置信区间问题，这个问题就无法解决:给定以一定程度的随机化收集的数据，我们能以何种精度估计给定的反事实期望？

4.4
Confidence Intervals

4.4置信区间

At first
sight, we can invoke the law of large numbers and write

乍一看，我们可以援引大数定律来写

Y = Zω ℓ ( ω
) w ( ω ) P ( ω ) ≈ n n∑i=1 ℓ ( ω i ) w i .

(7)

For sufficiently
large n, the central limit theorem provides confidence intervals whose width
grows with the standard deviation of the product ℓ(ω)w(ω).

对于足够大的n，中心极限定理提供了置信区间，其宽度随着乘积ℓ(ω)w(ω).的标准偏差而增加

Unfortunately,
when P(ω) is small, the reweighting ratio w(ω) takes large values with low
probability.This heavy tailed distribution has annoying consequences because
the variance of the integrand could be very high or infinite.When the variance
is infinite, the central limit theorem does not hold.When the variance is
merely very large, the central limit convergence might occur too slowly to
justify such confidence intervals.Importance sampling works best when the
actual distribution and the counterfactual distribution overlap.

遗憾的是，当P(ω)较小时，重加权比w(ω)取大值的概率较低。这种重尾分布有令人讨厌的后果，因为被积函数的方差可能很大或无穷大。当方差无穷大时，中心极限定理不成立。当方差仅仅非常大时，中心极限收敛可能发生得太慢，不足以证明这样的置信区间。当实际分布和反事实分布重叠时，重要性抽样效果最好。

When the
counterfactual distribution has significant mass in domains where the actual
distribu-tion is small, the few samples available in these domains receive very
high weights.Their noisy contribution dominates the reweighted estimate (7).We
can obtain better confidence intervals by eliminating these few samples drawn
in poorly explored domains.The resulting bias can be bounded using prior
knowledge, for instance with an assumption about the range of values taken by
ℓ(ω),

当反事实分布在实际分布很小的域中具有显著质量时，这些域中可用的少数样本获得非常高的权重。他们的噪声贡献主导了重新加权的估计(7)。我们可以通过消除在探索不足的领域提取的这几个样本来获得更好的置信区间。所产生的偏差可以使用先验知识来界定，例如利用关于ℓ(ω取值范围的假设)，

∀ω ℓ(ω) ∈ [0,M].

(8)

Let us choose
the maximum weight value R deemed acceptable for the weights.We have ob-tained
very consistent results in practice with R equal to the fifth largest
reweighting ratio observed on the empirical data.We can then rely on clipped
weights to eliminate the contribution of the poorly explored domains,

让我们选择重量可以接受的最大重量值R。我们在实践中获得了非常一致的结果，R等于在经验数据上观察到的第五大再加权比。然后，我们可以依靠剪裁的权重来消除探索不充分的领域的贡献，

w (ω) = w(ω)
if P(ω) < RP(ω) 0 otherwise.

如果P(ω) < RP(ω) 0，则w (ω) = w(ω),否则。

The condition
P(ω) < RP(ω) ensures that the ratio has a nonzero denominator P(ω) and is
smaller than R. Let ΩR be the set of all values of ω associated with acceptable
ratios:

条件P(ω) < RP(ω)确保比率具有非零分母P(ω)并且小于R。假设ωR是与可接受比率相关的所有ω值的集合:

ΩR = {ω :
P(ω) < RP(ω)} .

ωR =
{ω:P(ω)< RP(ω)}。

We can
decompose Y in two terms:

我们可以将Y分解为两个项:

Y = Zω ∈ Ω R
ℓ ( ω ) P ( ω) + Zω ∈ Ω \ Ω R ℓ ( ω ) P ( ω) = Y + ( Y − Y ) .

7.This is in
fact a slight abuse because the theory calls for choosing R before seeing the
data.

7.这实际上是一个轻微的滥用，因为理论要求在看到数据之前选择R。

The first
term of this decomposition is the clipped expectation Y . Estimating the
clipped expec-tation Y is much easier than estimating Y from (7) because the
clipped weights w(ω) are bounded by R.

这个分解的第一项是截尾期望Y。估计截尾期望Y比从(7)估计Y容易得多，因为截尾权重w(ω)由r界定

Y = Z ω∈ΩR
ℓ(ω)P (ω) = Z ω ℓ(ω)w (ω) P(ω) ≈ Yb= n n ∑ i=1 ℓ(ωi)w (ωi).(10)

y =
zω∈rωℓ(ω)p(ω)= zωℓ(ω)w(ω)p(ω)≈Yb = n n∑I = 1 ℓ(ωi)w(ωI)。(10)

The second
term of Equation (9) can be bounded by leveraging assumption (8).The resulting
bound can then be conveniently estimated using only the clipped weights.

等式(9)的第二项可以由杠杆假设(8)来界定。然后，仅使用剪裁的权重就可以方便地估计得到的界限。

Y −Y = Z
ω∈Ω\ΩR ℓ(ω)P (ω) ∈ h M P (Ω\ΩR) i = h M (−W ) i with

W = P (ΩR) =
Z ω∈ΩR P (ω) = Z ω w (ω)P(ω) ≈ Wb= n n ∑ i=1 w (ωi).

(11)

Since the
clipped weights are bounded, the estimation errors associated with (10) and
(11) are well characterized using either the central limit theorem or using
empirical Bernstein bounds (see appendix B for details).Therefore we can derive
an outer confidence interval of the form

由于裁剪后的权重是有界的，因此使用中心极限定理或使用经验伯恩斯坦界限(详见附录B)可以很好地表征与(10)和(11)相关的估计误差。因此，我们可以导出形式的外部置信区间

P n Yb−εR ≤ Y
≤ Yb+εR o ≥ −δ

and an inner
confidence interval of the form

和形式的内部置信区间

P n Y ≤ Y ≤ Y

M(−Wb+ξR) o ≥ −δ.

(13)

The names
inner and outer are in fact related to our preferred way to visualize these
intervals (e.g., Figure 13).Since the bounds on Y −Y can be written as

内部和外部名称实际上与我们首选的可视化这些间隔的方式相关(例如，图13)。因为Y-Y的界限可以写成

Y ≤ Y ≤ Y + M
(1−W ),

we can derive
our final confidence interval,

我们可以推导出最终的置信区间，

P n Yb−εR ≤ Y
≤ Yb+ M(−Wb+ξR) +εR o ≥ −δ.

(15)

In
conclusion, replacing the unbiased importance sampling estimator (7) by the
clipped impor-tance sampling estimator (10) with a suitable choice of R leads
to improved confidence intervals.Furthermore, since the derivation of these
confidence intervals does not rely on the assumption that P(ω) is nonzero
everywhere, the clipped importance sampling estimator remains valid when the
dis-tribution P(ω) has a limited support.This relaxes the main restriction
associated with importance sampling.

总之，用适当选择的R来代替无偏重要抽样估计量(7)的削波重要抽样估计量(10)导致置信区间的改善。此外，由于这些置信区间的推导不依赖于P(ω)处处非零的假设，当分布P(ω)的支持度有限时，截尾重要性抽样估计量仍然有效。这放松了与重要性抽样相关的主要限制。

4.5
Interpreting the Confidence Intervals

4.5解释置信区间

The
estimation of the counterfactual expectation Y can be inaccurate because the
sample size is in-sufficient or because the sampling distribution P(ω) does not
sufficiently explore the counterfactual conditions of interest.

反事实期望Y的估计可能是不准确的，因为样本量是不充分的，或者因为抽样分布P(ω)没有充分探索感兴趣的反事实条件。

By
construction, the clipped expectation Y ignores the domains poorly explored by
the sam-pling distribution P(ω).The difference Y −Y then reflects the
inaccuracy resulting from a lack of exploration.Therefore, assuming that the
bound R has been chosen competently, the relative sizes of the outer and inner
confidence intervals provide precious cues to determine whether we can continue
collecting data using the same experimental setup or should adjust the data
collection experiment in order to obtain a better coverage.

通过构造，裁剪期望Y忽略了抽样分布P(ω)探索不充分的域。Y-Y的差异反映了由于缺乏探索而导致的不准确性。因此，假设界限R已经被恰当地选择，外置信区间和内置信区间的相对大小提供了宝贵的线索，以确定我们是否可以使用相同的实验设置继续收集数据，或者应该调整数据收集实验以获得更好的覆盖。

• The inner
confidence interval (13) witnesses the uncertainty associated with the domain
GR insufficiently explored by the actual distribution.A large inner confidence
interval suggests that the most practical way to improve the estimate is to
adjust the data collection experiment in order to obtain a better coverage of
the counterfactual conditions of interest.

内部置信区间(13)见证了与实际分布未充分探索的域GR相关的不确定性。较大的内部置信区间表明，提高估计值的最实用方法是调整数据收集实验，以便更好地覆盖感兴趣的反事实条件。

• The outer
confidence interval (12) represents the uncertainty that results from the
limited sample size.A large outer confidence interval indicates that the sample
is too small.To improve the result, we simply need to continue collecting data
using the same experimental setup.

外部置信区间(12)代表有限样本量导致的不确定性。较大的外部置信区间表明样本太小。为了改进结果，我们只需要使用相同的实验设置继续收集数据。

4.6
Experimenting with Mainline Reserves

4.6主线储备试验

We return to
the ad placement problem to illustrate the reweighting approach and the
interpretation of the confidence intervals.Manipulating the reserves Rp(x)
associated with the mainline positions (Figure 1) controls which ads are
prominently displayed in the mainline or displaced into the sidebar.

我们回到广告投放问题来说明重新加权方法和置信区间的解释。操纵与主线位置相关联的保留区Rp(x)控制哪些广告在主线中突出显示或者转移到侧边栏中。

We seek in
this section to answer counterfactual questions of the form:

我们在本节中寻求回答以下形式的反事实问题:

"How
would the ad placement system have performed if we had scaled the mainline
reserves by a constant factor ρ, without incurring user or advertiser
reactions?"

“如果我们以一个常数ρ来衡量主线储备，而不引起用户或广告商的反应，广告投放系统会有怎样的表现？”

Randomization
was introduced using a modified version of the ad placement engine.Before
determining the ad layout (see Section 2.1), a random number ε is drawn
according to the standard

随机化是使用广告投放引擎的修改版本引入的。在确定广告布局之前(见第2.1节)，根据标准绘制一个随机数ε

normal
distribution N (0,1), and all the mainline reserves are multiplied by m = ρe −σ
/+σε.Such

正态分布N (0，1)，所有主线储量乘以m =ρeσ/+σε。这样的

multipliers
follow a log-normal distributionwhose mean is ρ and whose width is controlled
by σ.This effectively provides a parametrization of the conditional score distribution
P(q|x,a) (see Figure 5.)

乘数遵循对数正态分布，其平均值为ρ，宽度由σ控制。这有效地提供了条件分数分布P(q|x，a)的参数化(见图5)。)

The Bing
search platform offers many ways to select traffic for controlled experiments
(Sec-tion 2.2).In order to match our isolation assumption, individual page
views were randomly as-signed to traffic buckets without regard to the user
identity.The main treatment bucket was pro-cessed with mainline reserves
randomized by a multiplier drawn as explained above with ρ=1 and σ = 0.3.With
these parameters, the mean multiplier is exactly 1, and 95% of the multipliers
are in range [0.52,1.74].Samples describing 22 million search result pages were
collected during five consecutive weeks.

必应搜索平台提供了许多选择流量进行受控实验的方法(第2.2节)。为了与我们的隔离假设相匹配，单个页面视图被随机签名到流量桶，而不考虑用户身份。主处理桶用主线储量进行处理，主线储量由乘数随机化，乘数如上所述绘制，ρ=1，σ = 0.3。有了这些参数，平均乘数正好是1，95%的乘数都在[0.52，1.74]的范围内。连续五周收集了描述2200万个搜索结果页面的样本。

We then use
this data to estimate what would have been measured if the mainline reserve
mul-tipliers had been drawn according to a distribution determined by
parameters ρ and σ .This is achieved by reweighting each sample ωi with

然后，我们使用这些数据来估计如果主线储备multipliers是根据由参数ρ和σ确定的分布绘制的，将会测量到什么。这是通过用ωi重新加权每个样本来实现的

w i =
P ( q i |x i，a i ) P ( q
i |x i，a i)

p ( m i ;ρ ,
σ ) p ( m i ;ρ , σ )

where mi is
the multiplier drawn for this sample during the data collection experiment, and
p(t ;ρ,σ) is the density of the log-normal multiplier distribution.

其中mi是在数据收集实验期间为该样本绘制的乘数，p(t；ρ，σ)是对数正态乘数分布的密度。

Figure 13
reports results obtained by varying ρ while keeping σ =σ.This amounts to
esti-mating what would have been measured if all mainline reserves had been
multiplied by ρ while keeping the same randomization.The curves bound 95%
confidence intervals on the variations of the average number of mainline ads
displayed per page, the average number of ad clicks per page,

图13报告了在保持σ =σ的情况下，通过改变ρ获得的结果。这相当于在保持相同随机化的情况下，对所有主线储量乘以ρ后的测量值进行了估计。这些曲线在每页显示的主线广告的平均数量、每页广告点击的平均数量，

8.More
precisely, lnN ( ,σ ) with = σ /2+logρ.

8.更准确地说，lnN(，σ)具有= σ /2+logρ。

Figure 13:
Estimated variations of three performance metrics in response to mainline
reserve

图13:响应主线储备的三个性能指标的估计变化

changes.The
curves delimit 95% confidence intervals for the metrics we would have observed
if we had increased the mainline reserves by the percentage shown on the
horizontal axis.The filled areas represent the inner confidence intervals.The
hollow squares represent the metrics measured on the experimental data.The
hollow circles represent metrics measured on a second experimental bucket with
mainline reserves re-duced by 18%.The filled circles represent the metrics
effectively measured on a control bucket running without randomization.

变化。这些曲线为我们观察到的指标划定了95%的置信区间，如果我们将主线储量增加水平轴上显示的百分比。填充区域代表内部置信区间。空心方块代表在实验数据上测量的度量。空心圆圈代表在第二个实验桶上测量的指标，主线储量减少了18%。实心圆圈代表在没有随机化的情况下运行的控制桶上有效测量的指标。

3228

COUNTERFACTUAL
REASONING AND LEARNING SYSTEMS

反事实推理和学习系统

and the
average revenue per page, as functions of ρ .The inner confidence intervals,
represented by the filled areas, grow sharply when ρ leaves the range explored
during the data collection ex-periment.The average revenue per page has more
variance because a few very competitive queries command high prices.

和每页的平均收入，作为ρ的函数。当ρ离开数据收集实验期间探索的范围时，由填充区域表示的内部置信区间急剧增长。每页的平均收入差异更大，因为一些竞争非常激烈的查询要求很高的价格。

In order to
validate the accuracy of these counterfactual estimates, a second traffic
bucket of equal size was configured with mainline reserves reduced by about
18%.The hollow circles in Figure 13 represent the metrics effectively measured
on this bucket during the same time period.The effective measurements and the
counterfactual estimates match with high accuracy.

为了验证这些反事实估计的准确性，配置了第二个同等大小的流量桶，主线储量减少了约18%。图13中的空心圆圈表示在同一时间段内在该桶上有效测量的度量。有效测量值和反事实估计值高度匹配。

Finally, in
order to measure the cost of the randomization, we also ran the unmodified ad
place-ment system on a control bucket.The brown filled circles in Figure 13
represent the metrics effec-tively measured on the control bucket during the
same time period.The randomization caused a small but statistically significant
increase of the number of mainline ads per page.The click yield and average
revenue differences are not significant.

最后，为了测量随机化的成本，我们还在控制桶上运行了未修改的广告投放系统。图13中用棕色填充的圆圈表示在同一时间段内在控制桶上有效测量的指标。随机化导致了每页主线广告数量的小幅但有统计学意义的增加。点击率和平均收入差异不显著。

This
experiment shows that we can obtain accurate counterfactual estimates with
affordable randomization strategies.However, this nice conclusion does not
capture the true practical value of the counterfactual estimation approach.

这个实验表明，我们可以用负担得起的随机化策略获得准确的反事实估计。然而，这个好结论并没有抓住反事实估计方法的真正实用价值。

4.7 More on
Mainline Reserves

4.7主线储备的更多信息

The main
benefit of the counterfactual estimation approach is the ability to use the
same data to answer a broad range of counterfactual questions.Here are a few
examples of counterfactual ques-tions that can be answered using data collected
using the simple mainline reserve randomization scheme described in the
previous section:

反事实估计方法的主要好处是能够使用相同的数据来回答广泛的反事实问题。这里有几个反事实问题的例子，可以通过使用上一节描述的简单主线储备随机化方案收集的数据来回答:

• Different
variances - Instead of estimating what would have been measured if we had
in-creased the mainline reserves without changing the randomization variance,
that is, letting σ = σ, we can use the same data to estimate what would have
been measured if we had also changed σ.This provides the means to determine
which level of randomization we can afford in future experiments.

不同的方差——如果我们在不改变随机化方差的情况下增加主线储量，也就是说，让σ = σ，我们可以使用相同的数据来估计如果我们也改变σ会测量到什么，而不是估计会测量到什么。这提供了确定我们在未来实验中能够承受的随机化水平的方法。

• Pointwise
estimates - We often want to estimate what would have been measured if we had
set the mainline reserves to a specific value without randomization.Although
computing estimates for small values of σ often works well enough, very small
values lead to large confidence intervals.

逐点估算-如果我们将主线储量设置为特定值而不进行随机化，我们通常希望估算将要测量的值。虽然计算小σ值的估计值通常足够有效，但非常小的值会导致较大的置信区间。

Let Yν(ρ)
represent the expectation we would have observed if the multipliers m had mean
ρ and variance ν.We have then Yν(ρ) = Em[E[y|m]] = Em[Y0(m)].Assuming that the
pointwise value Y0 is smooth enough for a second order development,

让Yν(ρ)代表如果乘数m有均值ρ和方差ν，我们会观察到的期望。然后我们有Yν(ρ) = Em[E[y|m]] = Em[Y0(m)]。假设逐点值Y0对于二阶展开足够平滑，

Yν ( ρ ) ≈ E
m

Y0 ( ρ) + (
m − ρ ) Y ′0 ( ρ) + ( m − ρ ) Y ′′ 0 ( ρ ) / 2

-Y0(ρ)+(mρ)Y′0(ρ)+(mρ)Y′0(ρ)/2

= Y0 ( ρ) + ν
Y ′′ 0 ( ρ ) / 2 .

=
Y0(ρ)+νY′’0(ρ)/2。

Although the
reweighting method cannot estimate the point-wise value Y0(ρ) directly, we can
use the reweighting method to estimate both Yν(ρ) and Y2ν(ρ) with acceptable
confidence intervals and write Y0(ρ) ≈ 2Yν(ρ)−Y2ν(ρ) (Goodwin, 2011).

虽然重加权法不能直接估计逐点值Y0(ρ)，但我们可以用重加权法以可接受的置信区间同时估计Yν(ρ)和Y2ν(ρ)，写出Y0(ρ)≈2Yν(ρ)-Y2ν(ρ)(Goodwin，2011)。

•
Query-dependent reserves - Compare for instance the queries "car
insurance" and "com-mon cause principle" in a web search
engine.Since the advertising potential of a search

查询相关准备金——例如，在网络搜索引擎中比较查询“汽车保险”和“共同原因原则”。因为搜索的广告潜力

3229

BOTTOU,
PETERS, ET AL.

波图，彼得斯，等。

varies
considerably with the query, it makes sense to investigate various ways to
define query-dependent reserves (Charles and Chickering, 2012).

随着查询的不同而有很大差异，研究定义查询相关储量的各种方法是有意义的(Charles和Chickering，2012)。

The data
collected using the simple mainline reserve randomization can also be used to es-timate
what would have been measured if we had increased all the mainline reserves by
a query-dependent multiplier ρ (x).This is simply achieved by reweighting each
sample ωi with

使用简单的主线储量随机化收集的数据也可以用来估计如果我们将所有主线储量增加一个依赖于查询的乘数ρ (x)会测量到什么。这可以通过用ωi重新加权每个样本来实现

w i =

P ( q i |x i
, a i ) P ( q i |x i , a i )

P ( q i |x i，a i ) P ( q
i |x i，a i)

p ( m i ;ρ (
x i ) , σ )

p(m I；ρ ( x i)，σ)

p ( m i ;, σ
) .

p(m I；, σ ) .

Considerably
broader ranges of counterfactual questions can be answered when data is
collected using randomization schemes that explore more dimensions.For
instance, in the case of the ad placement problem, we could apply an
independent random multiplier for each score instead of applying a single
random multiplier to the mainline reserves only.However, the more dimensions we
randomize, the more data needs to be collected to effectively explore all these
dimensions.Fortunately, as discussed in section 5, the structure of the causal
graph reveals many ways to leverage a priori information and improve the
confidence intervals.

当使用探索更多维度的随机化方案收集数据时，可以回答范围更广的反事实问题。例如，在广告投放问题的情况下，我们可以对每个分数应用一个独立的随机乘数，而不是只对主线储备应用一个随机乘数。然而，我们随机化的维度越多，就需要收集更多的数据来有效地探索所有这些维度。幸运的是，正如第5节所讨论的，因果图的结构揭示了许多利用先验信息和提高置信区间的方法。

4.8 Related
Work

4.8相关工作

Importance
sampling is widely used to deal with covariate shifts (Shimodaira,
2000;Sugiyama et al., 2007).Since manipulating the causal graph changes the
data distribution, such an intervention can be viewed as a covariate shift
amenable to importance sampling.Importance sampling techniques have also been
proposed without causal interpretation for many of the problems that we view as
causal inference problems.In particular, the work presented in this section is
closely related to the Monte-Carlo approach of reinforcement learning (Sutton
and Barto, 1998, Chapter 5) and to the offline evaluation of contextual bandit
policies (Li et al., 2010, 2011).

重要性抽样被广泛用于处理协变量转移(Shimodaira，2000；Sugiyama等人，2007年)。由于操纵因果图会改变数据分布，这种干预可以被视为服从重要性抽样的协变量转移。对于我们认为是因果推理问题的许多问题，也提出了没有因果解释的重要性抽样技术。特别是，本节介绍的工作与强化学习的蒙特卡罗方法(Sutton和，1998，第5章)和上下文土匪策略的离线评估(Li等人，2010，2011)密切相关。

Reinforcement
learning research traditionally focuses on control problems with relatively
small discrete state spaces and long sequences of observations.This focus
reduces the need for char-acterizing exploration with tight confidence
intervals.For instance, Sutton and Barto suggest to normalize the importance
sampling estimator by 1/∑i w(ωi) instead of 1/n. This would give erro-neous
results when the data collection distribution leaves parts of the state space
poorly explored.Contextual bandits are traditionally formulated with a finite
set of discrete actions.For instance, Li's (2011) unbiased policy evaluation
assumes that the data collection policy always selects an arbitrary policy with
probability greater than some small constant.This is not possible when the
action space is infinite.

强化学习研究传统上集中于具有相对小的离散状态空间和长的观测序列的控制问题。这种关注减少了对具有紧密置信区间的特征勘探的需要。例如，萨顿和巴尔托建议用1/∑i w(ωi)而不是1/n来归一化重要性抽样估计量。当数据收集分布使得部分状态空间未被充分探索时，这将给出错误的结果。背景盗匪传统上是由一组有限的离散动作组成的。例如，李(2011)的无偏策略评估假设数据收集策略总是选择概率大于某个小常数的任意策略。当动作空间无限时，这是不可能的。

Such
assumptions on the data collection distribution are often impractical.For
instance, certain ad placement policies are not worth exploring because they
cannot be implemented efficiently or are known to elicit fraudulent
behaviors.There are many practical situations in which one is only interested
in limited aspects of the ad placement policy involving continuous parameters
such as click prices or reserves.Discretizing such parameters eliminates useful
a priori knowledge: for instance, if we slightly increase a reserve, we can
reasonable believe that we are going to show slightly less ads.

这种关于数据收集分布的假设通常是不切实际的。例如，某些广告投放策略不值得探索，因为它们无法有效实施或已知会引发欺诈行为。有许多实际情况，人们只对广告投放政策的有限方面感兴趣，包括连续的参数，如点击价格或储备。将这些参数离散化消除了有用的先验知识:例如，如果我们稍微增加一个储备，我们可以合理地相信我们将展示稍微少一点的广告。

Instead of
making assumptions on the data collection distribution, we construct a biased
estima-tor (10) and bound its bias.We then interpret the inner and outer
confidence intervals as resulting from a lack of exploration or an insufficient
sample size.

我们不是对数据收集分布做假设，而是构造一个有偏估计量(10)并限制它的偏差。然后，我们将内部和外部置信区间解释为缺乏探索或样本量不足导致的。

3230

COUNTERFACTUAL
REASONING AND LEARNING SYSTEMS

反事实推理和学习系统

Finally, the
causal framework allows us to easily formulate counterfactual questions that
pertain to the practical ad placement problem and yet differ considerably in
complexity and exploration requirements.We can address specific problems
identified by the engineers without incurring the risks associated with a
complete redesign of the system.Each of these incremental steps helps
demonstrating the soundness of the approach.

最后，因果框架允许我们很容易地提出与实际广告投放问题相关的反事实问题，但在复杂性和探索要求上有很大不同。我们可以解决工程师发现的具体问题，而不会产生与系统完全重新设计相关的风险。这些渐进步骤中的每一步都有助于证明该方法的合理性。

5.Structure

5.结构

This section
shows how the structure of the causal graph reveals many ways to leverage a
priori knowledge and improve the accuracy of our counterfactual
estimates.Displacing the reweighting point (Section 5.1) improves the inner
confidence interval and therefore reduce the need for explo-ration.Using a
prediction function (Section 5.2) essentially improve the outer confidence
interval and therefore reduce the sample size requirements.

本节展示了因果图的结构如何揭示利用先验知识和提高我们的反事实估计的准确性的许多方法。替换重加权点(第5.1节)提高了内部置信区间，因此减少了解释的需要。使用预测函数(第5.2节)本质上改善了外部置信区间，因此降低了样本量要求。

5.1 Better
Reweighting Variables

5.1更好地重新加权变量

Many search
result pages come without eligible ads.We then know with certainty that such
pages will have zero mainline ads, receive zero clicks, and generate zero
revenue.This is true for the randomly selected value of the reserve, and this
would have been true for any other value of the reserve.We can exploit this
knowledge by pretending that the reserve was drawn from the coun-terfactual
distribution P(q|xi ,ai) instead of the actual distribution P(q|xi ,ai).The
ratio w(ωi) is therefore forced to the unity.This does not change the estimate
but reduces the size of the inner confidence interval.The results of Figure 13
were in fact helped by this little optimization.

许多搜索结果页面没有合格的广告。然后，我们可以肯定地知道，这样的页面将没有主线广告，没有点击，也没有收入。对于随机选择的储备值来说是这样，对于储备的任何其他值来说也是这样。我们可以利用这一知识，假设储备是从实际分布P(q|xi，ai)中提取的，而不是实际分布P(q|xi，ai)。因此，比值w(ωi)被强制为1。这不会改变估计值，但会减小内部置信区间的大小。图13的结果实际上得益于这个小小的优化。

There are in
fact many circumstances in which the observed outcome would have been the same
for other values of the randomized variables.This prior knowledge is in fact
encoded in the structure of the causal graph and can be exploited in a more
systematic manner.For instance, we know that users make click decisions without
knowing which scores were computed by the ad placement engine, and without
knowing the prices charged to advertisers.The ad placement causal graph encodes
this knowledge by showing the clicks y as direct effects of the user intent u
and the ad slate s. This implies that the exact value of the scores q does not
matter to the clicks y as long as the ad slate s remains the same.

事实上，在许多情况下，观察到的结果对于随机变量的其他值是相同的。这种先验知识事实上被编码在因果图的结构中，并且可以以更系统的方式加以利用。例如，我们知道用户在不知道广告投放引擎计算了哪些分数，也不知道向广告商收取的价格的情况下做出点击决定。广告投放因果图通过将点击量y显示为用户意图u和广告名单s的直接影响来编码这一知识。这意味着只要广告名单s保持不变，分数q的确切值对点击量y并不重要。

Because the
causal graph has this special structure, we can simplify both the actual and
counter-factual Markov factorizations (2) (3) without eliminating the variable
y whose expectation is sought.Successively eliminating variables z, c, and q
gives:

因为因果图具有这种特殊的结构，我们可以简化实际的和反事实的马尔可夫因子分解(2) (3)，而不消除期望被寻求的变量y。连续消除变量z、c和q可得出:

The
conditional distributions P(s|x,a,b) and P(s|x,a,b) did not originally appear
in the Markov factorization.They are defined by marginalization as a
consequence of the elimination of the vari-able q representing the scores.

条件分布P(s|x，a，b)和P(s|x，a，b)最初没有出现在马尔可夫分解中。他们被边缘化定义为代表分数的变量q被消除的结果。

P ( s|x a b)
= Zq P ( s|a q b ) P ( q|x a )

P ( s|x a b)
= Zq P ( s|a q b ) P ( q|x a)

P ( s|x a b)
= Zq P ( s|a q b ) P ( q|x a ) .

P ( s|x a b)
= Zq P ( s|a q b ) P ( q|x a)。

3231

BOTTOU,
PETERS, ET AL.

波图，彼得斯，等。

Figure 14:
Estimated variations of two performance metrics in response to mainline reserve

图14:响应主线储备的两个性能指标的估计变化

changes.These
estimates were obtained using the ad slates s as reweighting variable.Compare
the inner confidence intervals with those shown in Figure 13.

变化。这些估计是使用广告板作为重新加权变量获得的。将内部置信区间与图13所示的置信区间进行比较。

We can
estimate the counterfactual click yield Y using these simplified
factorizations:

我们可以使用这些简化的因式分解来估计反事实的点击率Y:

= Z y P ( u v
x a b s y) = Z y P ( s|x a b ) P ( s|x , a , b ) P ( u v x a b s y )

= Z y P ( u v
x a b s y) = Z y P ( s|x a b ) P ( s|x，a，b ) P ( u v x a b s y)

≈

n n∑i=1 y i P
( s i |x i a i b i ) P ( s i | x i , a i , b i ) .

n n∑i=1 y i P
( s i |x i a i b i ) P ( s i | x i，a i，b i)。

(16)

We have
reproduced the experiments described in Section 4.6 with the counterfactual
esti-mate (16) instead of (4).For each example ωi , we determine which range [m
max i ,m min i ] of mainline reserve multipliers could have produced the observed
ad slate si , and then compute the reweighting ratio using the formula:

我们用反事实估计(16)代替(4)再现了第4.6节中描述的实验。对于每个示例ωi，我们确定主线储备乘数的哪个范围[m最大值I，m最小值i ]可以产生观察到的ad slate si，然后使用以下公式计算再加权比率:

w i =

P ( s i | x i
, a i , b i ) P ( s i | x i , a i , b i )

P ( s i | x i，a i，b i ) P ( s
i | x i，a i，b i)

Ψ ( m max i
;ρ , σ ) − Ψ ( m min i ;ρ , σ ) Ψ ( mmax i ;ρ , σ ) − Ψ ( mmin i ;ρ , σ )

ψ(m最大值I；ρ，σ)ψ(m min I；ρ，σ)ψ(mmax I；ρ，σ)ψ(mmin I；ρ , σ )

where
Ψ(m;ρ,σ) is the cumulative of the log-normal multiplier distribution.Figure 14
shows coun-terfactual estimates obtained using the same data as Figure 13.The
obvious improvement of the

其中ψ(m；ρ，σ)是对数正态乘数分布的累积。图14显示了使用与图13相同的数据获得的交互事实估计。的明显改善

Figure 16: A
distribution on the scores q induce a distribution on the possible ad slates s.
If the observed slate is slate2, the reweighting ratio is 34/22.

图16:分数q的分布导致可能的广告候选名单的分布。如果观察到的候选名单是候选名单2，则重新加权比率是34/22。

inner
confidence intervals significantly extends the range of mainline reserve
multipliers for which we can compute accurate counterfactual expectations using
this same data.

内部置信区间显著扩大了主线储备乘数的范围，我们可以使用相同的数据计算出准确的反事实预期。

Comparing (4)
and (16) makes the difference very clear: instead of computing the ratio of the
probabilities of the observed scores under the counterfactual and actual
distributions, we compute the ratio of the probabilities of the observed ad
slates under the counterfactual and actual distribu-tions.As illustrated by
Figure 15, we now distinguish the reweighting variable (or variables) from the
intervention.In general, the corresponding manipulation of the Markov
factorization consists of marginalizing out all the variables that appear on
the causal paths connecting the point of interven-tion to the reweighting
variables and factoring all the independent terms out of the integral.This
simplification works whenever the reweighting variables intercept all the
causal paths connecting the point of intervention to the measurement variable.In
order to compute the new reweighting ratios, all the factors remaining inside
the integral, that is, all the factors appearing on the causal paths connecting
the point of intervention to the reweighting variables, have to be known.

比较(4)和(16)使区别非常明显:我们不是计算在反事实和实际分布下观察到的得分的概率比，而是计算在反事实和实际分布下观察到的广告的概率比。如图15所示，我们现在将重新加权变量(或多个变量)与干预区分开来。一般来说，马尔可夫因子分解的相应操作包括将出现在连接介入点和重加权变量的因果路径上的所有变量边缘化，并将所有独立项从积分中分解出来。每当重新加权的变量截取连接干预点和测量变量的所有因果路径时，这种简化就起作用了。为了计算新的重新加权比率，必须知道积分中剩余的所有因子，即在连接干预点和重新加权变量的因果路径上出现的所有因子。

Figure 14
does not report the average revenue per page because the revenue z also depends
on the scores q through the click prices c. This causal path is not intercepted
by the ad slate variable s alone.However, we can introduce a new variable ˜c =
f(c, y)that filters out the click prices computed

图14没有报告每页的平均收入，因为收入z也依赖于点击率c的得分q。这个因果路径不是由单独的ad slate变量s截取的。但是，我们可以引入一个新的变量\\\\\\\\\\\\\\\\\

3233

BOTTOU,
PETERS, ET AL.

波图，彼得斯，等。

for ads that
did not receive a click.Markedly improved revenue estimates are then obtained
by reweighting according to the joint variable (s, c˜).

对于没有点击的广告。然后，通过根据联合变量(s，c ;)重新加权，可以获得显著改善的收入估计。

Figure 16
illustrates the same approach applied to the simultaneous randomization of all
the scores q using independent log-normal multipliers.The weight w(ωi) is the
ratio of the probabilities of the observed ad slate si under the counterfactual
and actual multiplier distributions.Computing these probabilities amounts to
integrating a multivariate Gaussian distribution (Genz, 1992).Details will be
provided in a forthcoming publication.

图16显示了使用独立对数正态乘数同时随机化所有分数q的相同方法。权重w(ωi)是在反事实和实际乘数分布下观察到的ad slate si的概率比。计算这些概率相当于对多变量高斯分布进行积分(Genz，1992)。细节将在即将出版的出版物中提供。

5.2 Variance
Reduction with Predictors

5.2使用预测因子减少方差

Although we
do not know exactly how the variable of interest ℓ(ω) depends on the measurable
vari-ables and are affected by interventions on the causal graph, we may have
strong a priori knowledge about this dependency.For instance, if we augment the
slate s with an ad that usually receives a lot of clicks, we can expect an
increase of the number of clicks.

虽然我们不知道感兴趣的变量(ℓ(ω)是如何依赖于可测量的变量并受到因果图上的干预的影响，但我们可能对这种依赖有很强的先验知识。例如，如果我们用一个通常会收到大量点击量的广告来增加候选名单，我们可以预期点击量会增加。

Let the
invariant variables υ be all observed variables that are not direct or indirect
effects of variables affected by the intervention under consideration.This
definition implies that the distri-bution of the invariant variables is not
affected by the intervention.Therefore the values υi of the invariant variables
sampled during the actual experiment are also representative of the
distribution of the invariant variables under the counterfactual conditions.

设不变变量υ是所有观察到的变量，它们不是受所考虑的干预影响的变量的直接或间接影响。这个定义意味着不变变量的分布不受干预的影响。因此，在实际实验期间采样的不变变量的值υi也代表了在反事实条件下不变变量的分布。

We can
leverage a priori knowledge to construct a predictor ζ(ω) of the quantity ℓ(ω)
whose counterfactual expectation Y is sought.We assume that the predictor ζ(ω)
depends only on the invariant variables or on variables that depend on the
invariant variables through known functional dependencies.Given sampled values
υi of the invariant variables, we can replay both the original and manipulated
structural equation model as explained in Section 4.1 and obtain samples ζi and
ζ i that respectively follow the actual and counterfactual distributions

我们可以利用先验知识来构造量ℓ(ω的预测器ζ(ω),其反事实期望y被寻求。我们假设预测因子ζ(ω)仅依赖于不变变量或者依赖于通过已知函数依赖关系的不变变量的变量。给定不变变量的采样值υi，我们可以如第4.1节所述重放原始和操纵的结构方程模型，并获得分别遵循实际和反事实分布的样本ζi和ζ i

Then,
regardless of the quality of the predictor,

然后，不管预测器的质量如何，

Y = Zω ℓ ( ω
) P ( ω) =

Z ω ζ(ω)P (ω)

Z ω (ℓ(ω)−ζ(ω))P (ω) ≈

Z ω ζ(ω)P (ω)

Z ω (ℓ(ω)−ζ(ω))P (ω) ≈

一

∑

i=1

ζ i + n n ∑
i=1 (ℓ(ωi)−ζi)w(ωi).

(17)

The first
term in this sum represents the counterfactual expectation of the predictor and
can be accurately estimated by averaging the simulated counterfactual samples ζ
i without resorting to po-tentially large importance weights.The second term in
this sum represents the counterfactual ex-pectation of the residuals ℓ(ω)−ζ(ω)
and must be estimated using importance sampling.Since the magnitude of the
residuals is hopefully smaller than that of ℓ(ω), the variance of
(ℓ(ω)−ζ(ω))w(ω) is reduced and the importance sampling estimator of the second
term has improved confidence in-tervals.The more accurate the predictor ζ(ω),
the more effective this variance reduction strategy.

这个和中的第一项表示预测器的反事实期望，并且可以通过对模拟的反事实样本ζ i求平均而不求助于潜在的大重要性权重来精确地估计。这个和中的第二项代表残差ℓ(ω)−ζ(ω的反事实预测，必须使用重要性抽样来估计。由于残差的幅度希望小于ℓ(ω的幅度，所以(ℓ(ω)−ζ(ω))w(ω)的方差减小了，并且第二项的重要性抽样估计量提高了置信区间。预测因子ζ(ω)越精确，这种方差减小策略就越有效。

This variance
reduction technique is in fact identical to the doubly robust contextual bandit
eval-uation technique of Dudík et al. (2012).Doubly robust variance reduction
has also been extensively used for causal inference applied to biostatistics
(see Robins et al., 2000;Bang and Robins, 2005).We subjectively find that
viewing the predictor as a component of the causal graph (Figure 17) clar-ifies
how a well designed predictor can leverage prior knowledge.For instance, in
order to estimate the counterfactual performance of the ad placement system, we
can easily use a predictor that runs the ad auction and simulate the user
clicks using a click probability model trained offline.

这种方差减少技术实际上与Dudík等人(2012)的双重鲁棒上下文土匪评估技术相同。双重稳健方差缩减也被广泛用于应用于生物统计学的因果推断(参见Robins等人，2000；Bang and Robins，2005)。我们主观地发现，将预测器视为因果图的一个组成部分(图17)说明了一个设计良好的预测器如何利用先验知识。例如，为了估计广告投放系统的反事实性能，我们可以很容易地使用运行广告拍卖的预测器，并使用离线训练的点击概率模型来模拟用户点击。

3234

COUNTERFACTUAL
REASONING AND LEARNING SYSTEMS

反事实推理和学习系统

Figure 17:
Leveraging a predictor.Yellow nodes represent known functional relations in the
struc-

图17:利用预测器。黄色节点代表结构中已知的功能关系

tural
equation model.We can estimate the counterfactual expectation Y of the number
of clicks per page as the sum of the counterfactual expectations of a predictor
ζ, which is easy to estimate by replaying empirical data, and y−ζ, which has to
be estimated by importance sampling but has reduced variance.

自然方程模型。我们可以将每页点击次数的反事实期望Y估计为预测者ζ和Yζ的反事实期望之和，预测者ζ很容易通过重放经验数据来估计，Yζ必须通过重要性抽样来估计，但方差减小。

Figure 18:
The two plots show the hourly click yield for two variants of the ad placement
engine.The daily variations dwarf the differences between the two treatments.

图18:两个图显示了广告投放引擎的两个变体的每小时点击量。每天的变化使两种治疗之间的差异相形见绌。

3235

BOTTOU,
PETERS, ET AL.

波图，彼得斯，等。

5.3 Invariant
Predictors

5.3不变预测因子

In order to
evaluate which of two interventions is most likely to improve the system, the
designer of a learning system often seeks to estimate a counterfactual
difference, that is, the difference Y + −Y of the expectations of a same
quantity ℓ(ω) under two different counterfactual distributions P+(ω) and
P(ω).These expectations are often affected by variables whose value is left
unchanged by the interventions under consideration.For instance, seasonal
effects can have very large effects on the number of ad clicks (Figure 18) but
affect Y + and Y in similar ways.

为了评估两种干预中的哪一种最有可能改善系统，学习系统的设计者经常寻求估计一个反事实差异，即在两个不同的反事实分布P+(ω)和P(ω)下，相同数量ℓ(ω的期望的差异y+y。这些期望经常受到变量的影响，这些变量的值在所考虑的干预中保持不变。例如，季节效应会对广告点击量产生非常大的影响(图18)，但会以类似的方式影响Y +和Y。

Substantially
better confidence intervals on the difference Y + −Y can be obtained using an
invariant predictor, that is, a predictor function that depends only on
invariant variables υ such as the time of the day.Since the invariant predictor
ζ(υ) is not affected by the interventions under consideration,

使用不变预测器，即仅依赖于不变变量υ(例如一天中的时间)的预测器函数，可以获得关于差Y+Y的实质上更好的置信区间。由于不变预测因子ζ(υ)不受所考虑的干预的影响，

Z ω ζ(υ)P (ω)
= Z ω ζ(υ)P +(ω).

Z ω ζ(υ)P (ω)
= Z ω ζ(υ)P +(ω)。

(18)

Therefore

因此

Y + −Y

Y+Y

= Zω ζ ( υ )
P + ( ω) + Zω ( ℓ ( ω ) − ζ ( υ)) P + ( ω )

=
zωζ(υ)p+(ω)+zω(ℓ(ω)ζ(υ))p+(ω)

− Z ω ζ(υ)P
(ω)− Z ω (ℓ(ω)−ζ(υ))P (ω) ≈

zωζ(υ)p(ω)zω(ℓ(ω)−ζ(υ))p(ω)≈

一

∑

i=1

ℓ(ωi)−ζ(υi) P
+(ωi)−P (ωi) P(ωi) .

ℓ(ωi)−ζ(υi)p+(ωI)p(ωI)p(ωI)。

This direct
estimate of the counterfactual difference Y + −Y benefits from the same
variance reduc-tion effect as (17) without need to estimate the expectations
(18).Appendix C provide details on the computation of confidence intervals for
estimators of the counterfactual differences.Appendix D shows how the same
approach can be used to compute counterfactual derivatives that describe the
response of the system to very small interventions.

这种对反事实差异Y+Y的直接估计受益于与(17)相同的方差减小效应，而不需要估计期望值(18)。附录C提供了反事实差异估计量的置信区间计算的细节。附录D显示了如何使用相同的方法来计算描述系统对非常小的干预的反应的反事实导数。

6.Learning

6.学问

The previous
sections deal with the identification and the measurement of interpretable
signals that can justify the actions of human decision makers.These same signals
can also justify the actions of machine learning algorithms.This section
explains why optimizing a counterfactual estimate is a sound learning
procedure.

前面几节讨论了可解释信号的识别和测量，这些信号可以证明人类决策者的行为是正确的。这些相同的信号也可以证明机器学习算法的作用。本节解释了为什么优化反事实估计是一个合理的学习过程。

6.1 A
Learning Principle

6.1学习原则

We consider
in this section interventions that depend on a parameter θ.For instance, we
might want to know what the performance of the ad placement engine would have
been if we had used different values for the parameter θ of the click scoring
model.Let Pθ (ω) denote the counterfactual Markov factorization associated with
this intervention.Let Y θ be the counterfactual expectation of ℓ(ω) under
distribution Pθ .Figure 19 illustrates our simple learning setup.Training data
is collected from a single experiment associated with an initial parameter
value θ chosen using prior knowledge acquired in an unspecified manner.A
preferred parameter value θ is then determined using the training data and
loaded into the system.The goal is of course to observe a good performance on
data collected during a test period that takes place after the switching point.

在本节中，我们考虑依赖于参数θ的干预。例如，我们可能想知道如果我们对点击评分模型的参数θ使用不同的值，广告投放引擎的性能会如何。让Pθ (ω)表示与该干预相关的反事实马尔可夫因子分解。设Y θ是分布Pθ下ℓ(ω的反事实期望。图19展示了我们简单的学习设置。训练数据是从与初始参数值θ相关联的单个实验中收集的，该初始参数值θ是使用以未指定方式获得的先验知识选择的。然后使用训练数据确定优选的参数值θ，并将其加载到系统中。目标当然是观察在切换点之后的测试期间收集的数据的良好性能。

3236

Figure 19:
Single design - A preferred parameter value θ is determined using randomized
data collected in the past.Test data is collected after loading θ into the
system.

图19:单一设计-使用过去收集的随机数据确定首选参数值θ。测试数据在θ载入系统后收集。

The isolation
assumption introduced in Section 3.2 states that the exogenous variables are
drawn from an unknown but fixed joint probability distribution.This
distribution induces a joint distribu-tion P(ω) on all the variables ω
appearing in the structural equation model associated with the parameter
θ.Therefore, if the isolation assumption remains valid during the test period,
the test data follows the same distribution Pθ (ω) that would have been
observed during the training data collection period if the system had been
using parameter θ all along.

第3.2节中引入的隔离假设指出，外生变量来自未知但固定的联合概率分布。这种分布在与参数θ相关的结构方程模型中出现的所有变量ω上引起联合分布P(ω)。因此，如果隔离假设在测试期间保持有效，测试数据将遵循相同的分布Pθ (ω)，如果系统一直使用参数θ，则在训练数据收集期间将会观察到该分布。

We can
therefore formulate this problem as the optimization of the expectation Y θ of
the reward ℓ(ω) with respect to the distribution Pθ (ω)

因此，我们可以将这个问题公式化为报酬ℓ(ω的期望Y θ相对于分布Pθ (ω)的最优化

maxθ Y θ = Zω
ℓ ( ω ) P θ ( ω )

maxθ Y θ = Zω
ℓ ( ω ) P θ ( ω)

on the basis
of a finite set of training examples ω1,...,ωn sampled from P(ω).

基于有限组训练示例ω1，...，ωn从P(ω)采样。

However, it
would be unwise to maximize the estimates obtained using approximation (5)
be-cause they could reach a maximum for a value of θ that is poorly explored by
the actual distribution.As explained in Section 4.5, the gap between the upper
and lower bound of inequality (14) reveals the uncertainty associated with
insufficient exploration.Maximizing an empirical estimate Ybθ of the lower
bound Y θ ensures that the optimization algorithm finds a trustworthy answer

然而，最大化使用近似值(5)获得的估计值是不明智的，因为它们可能达到θ值的最大值，而实际分布对此了解甚少。如第4.5节所述，不等式(14)的上限和下限之间的差距揭示了与探索不足相关的不确定性。最大化下界Y θ的经验估计Ybθ确保了优化算法找到可信的答案

θ = argmax θ
Yb θ .

θ = argmax θ
Yb θ。

We shall now
discuss the statistical basis of this learning principle.

我们现在将讨论这个学习原则的统计基础。

6.2 Uniform
Confidence Intervals

6.2统一置信区间

As discussed
in Section 4.4, inequality (14),

如第4.4节不等式(14)所述，

Y θ ≤ Y θ ≤ Y
θ + M ( 1 − W θ ) ,

Yθ≤Yθ≤Yθ+M(1wθ)，

where

在哪里

Y θ = Zω ℓ (
ω ) w ( ω ) P ( ω ) ≈ Yb θ = n n∑i=1 ℓ ( ω i ) w ( ω i ) W θ = Zω w ( ω ) P ( ω
) ≈ Wbθ = n n∑i=1 w ( ω i )

yθ=
zωℓ(ω)w(ω)p(ω)≈Ybθ= n n∑I = 1ℓ(ωI)w(ωI)wθ= zωw(ω)p(ω)≈WBθ= n∑I = 1w(ωI)

(19)

9.The idea of
maximizing the lower bound may surprise readers familiar with the UCB algorithm
for multi-armed bandits (Auer et al., 2002).UCB performs exploration by
maximizing the upper confidence interval bound and updating the confidence
intervals online.Exploration in our setup results from the active system
randomization during the offline data collection.See also Section 6.4.

9.下限最大化的想法可能会让熟悉多武装匪徒的UCB算法的读者感到惊讶(奥尔等人，2002)。UCB通过最大化置信区间上限和在线更新置信区间来执行探索。我们设置中的探索源于离线数据收集期间的主动系统随机化。另请参见第6.4节。

3237

BOTTOU,
PETERS, ET AL.

波图，彼得斯，等。

leads to
confidence intervals (15) of the form

导致形式的置信区间(15)

∀ δ > , ∀
θ P n Yb θ − ε R ≤ Y θ ≤ Yb θ + M ( − Wbθ + ξ R) + ε R o ≥ − δ .

∀δ>∀θp n
Ybθεr≤yθ≤Ybθ+m(WBθ+ξr)+εr o≥δ。

(20)

Both εR and
ξR converge to zero in inverse proportion to the square root of the sample size
n. They also increase at most linearly in logδ and depend on both the capping
bound R and the parameter θ through the empirical variances (see appendix B.)

εR和ξR都以与样本量n的平方根成反比的方式收敛到零。它们也最多以对数δ线性增加，并且通过经验方差同时依赖于上限R和参数θ(见附录b)

Such
confidence intervals are insufficient to provide guarantees for a parameter
value θ that depends on the sample.In fact, the optimization (19) procedure is
likely to select values of θ for which the inequality is violated.We therefore
seek uniform confidence intervals (Vapnik and Chervonenkis, 1968), simultaneously
valid for all values of θ.

这种置信区间不足以保证依赖于样本的参数值θ。事实上，优化(19)过程可能会选择违反不等式的θ值。因此，我们寻求对θ的所有值同时有效的统一置信区间(Vapnik和Chervonenkis，1968)。

• When the
parameter θ is chosen from a finite set F , applying the union bound to the
ordinary

当参数θ从有限集合F中选择时，将并界应用于普通

intervals
(20) immediately gives the uniform confidence interval :

区间(20)立即给出统一的置信区间:

P n ∀ θ ∈ F ,
Yb θ − ε R ≤ Y θ ≤ Yb θ + M ( −Wbθ + ξ R)+ ε R o ≥ −| F |δ .

P n ∀ θ ∈ F，Ybθεr≤yθ≤Ybθ+m(WBθ+ξr)+εr
o≥| f |δ。

• Following
the pioneering work of Vapnik and Chervonenkis, a broad choice of mathematical

遵循瓦普尼克和切尔沃嫩基斯的开创性工作，广泛选择数学

tools have
been developed to construct uniform confidence intervals when the set F is
infinite.

当集合F为无穷大时，已经开发了构造一致置信区间的工具。

For instance,
appendix E leverages uniform empirical Bernstein bounds (Maurer and Pontil,
2009) and obtains the uniform confidence interval

例如，附录E利用统一的经验伯恩斯坦界限(Maurer和Pontil，2009)并获得统一的置信区间

P n ∀ θ ∈ F ,
Yb θ − ε R ≤ Y θ ≤ Yb θ + M ( −Wbθ + ξ R)+ ε R o ≥ − M ( n ) δ ,

P n ∀ θ ∈ F，Ybθεr≤yθ≤Ybθ+m(WBθ+ξr)+εr
o≥m(n)δ，

where the
growth function M (n) measures the capacity of the family of functions

其中增长函数M (n)测量函数族的容量

{ fθ : ω 7→
ℓ(ω)w (ω) , gθ : ω 7→ w (ω) , ∀θ ∈ F } .

{ fθ : ω 7→
ℓ(ω)w (ω)，gθ : ω 7→ w (ω)，∀θ ∈ F }。

(21)

Many
practical choices of P(ω) lead to functions M (n) that grow polynomially with
the sample size.Because both εR and ξR are O(n −/logδ), they converge to zero
with the sample size when one maintains the confidence level 1−M (n)δ equal to
a predefined constant.

P(ω)的许多实际选择导致函数M
(n)随样本量多项式增长。因为εR和ξR都是0(n/logδ)，当一个人保持置信水平1M(n)δ等于预定常数时，它们随着样本量收敛到零。

The
interpretation of the inner and outer confidence intervals (Section 4.5) also
applies to the uniform confidence interval (21).When the sample size is
sufficiently large and the capping bound R chosen appropriately, the inner
confidence interval reflects the upper and lower bound of inequal-ity (14).

内部和外部置信区间的解释(第4.5节)也适用于统一置信区间(21)。当样本量足够大并且适当选择上限时，内部置信区间反映了不等性(14)的上限和下限。

The uniform
confidence interval therefore ensures that Y θ is close to the maximum of the
lower bound of inequality (14) which essentially represents the best
performance that can be guaranteed using training data sampled from
P(ω).Meanwhile, the upper bound of this same inequality reveals which values of
θ could potentially offer better performance but have been insufficiently
probed by the sampling distribution (Figure 20.)

因此，均匀置信区间确保了Y θ接近不等式(14)的下限的最大值，这基本上代表了使用从P(ω)采样的训练数据可以保证的最佳性能。与此同时，这个不等式的上限揭示了哪些θ值可能提供更好的性能，但采样分布没有充分探究(图20。)

6.3 Tuning Ad
Placement Auctions

6.3调整广告投放拍卖

We now
present an application of this learning principle to the optimization of
auction tuning pa-rameters in the ad placement engine.Despite increasingly
challenging engineering difficulties,

我们现在将这一学习原理应用于广告投放引擎中拍卖调整参数的优化。尽管工程难度越来越大，

3238

COUNTERFACTUAL
REASONING AND LEARNING SYSTEMS

反事实推理和学习系统

Figure 20:
The uniform inner confidence interval reveals where the best guaranteed Y θ is
reached and where additional exploration is needed.

图20:统一的内部置信区间揭示了最佳保证Y θ达到的位置和需要额外探索的位置。

Figure 21: Level
curves associated with the average number of mainline ads per page (red curves

图21:与平均每页主线广告数量相关的水平曲线(红色曲线

labelled from
−6% to +10%) and the average estimated advertisement value generated per page
(black curves, labelled with arbitrary units ranging from 164 to 169) tha

[论文翻译]反事实推理和学习系统:计算广告的例子

原文地址：https://jmlr.org/papers/volume14/bottou13a/bottou13a.pdf

Counterfactual Reasoning and Learning Systems: The Example of Computational Advertising 反事实推理和学习系统:计算广告的例子

Abstract