[论文翻译]反事实推理和学习系统:计算广告的例子


原文地址:https://jmlr.org/papers/volume14/bottou13a/bottou13a.pdf


Counterfactual Reasoning and Learning Systems: The Example of Computational Advertising 反事实推理和学习系统:计算广告的例子

Léon Bottou, Jonas Peters, Joaquin Quiñonero-Candela, Denis X. Charles, D. Max Chickering, Elon Portugaly, Dipankar Ray, Patrice Simard, Ed Snelson; 14(65):3207−3260, 2013.

Abstract

This work shows how to leverage causal inference to understand the behavior of complex learning systems interacting with their environment and predict the consequences of changes to the system. Such predictions allow both humans and algorithms to select the changes that would have improved the system performance. This work is illustrated by experiments on the ad placement system associated with the Bing search engine.

这项工作展示了如何利用因果推理来理解复杂学习系统与其环境交互的行为,并预测系统变化的后果。这样的预测使得人类和算法都可以选择能够改善系统性能的变化。这项工作是通过与必应搜索引擎相关的广告投放系统的实验来说明的。

Keywords:
causation, counterfactual reasoning, computational advertising

关键词:因果关系,反事实推理,计算广告

1.Introduction

1.介绍

Statistical machine learning technologies in the real world are never without a purpose.Using their predictions, humans or machines make decisions whose circuitous consequences often violate the modeling assumptions that justified
the system design in the first place.

现实世界中的统计机器学习技术从来都是有目的的。利用他们的预测,人类或机器做出的决策,其迂回的结果往往违反了最初证明系统设计合理性的建模假设。

Such contradictions appear very clearly in the case of the learning systems that power web scale applications such as search engines, ad placement engines, or recommendation systems.For instance, the placement of advertisement on the
result pages of Internet search engines depend on the bids of advertisers and on scores computed by statistical machine learning systems.Because the scores affect the contents of the result pages proposed to the users, they directly influence
the occurrence of clicks and the corresponding advertiser payments.They also have important indirect effects.Ad placement decisions impact the satisfaction of the users and therefore their willingness to frequent this web site in the
future.They also impact the return on investment observed by the advertisers and therefore their future bids.Finally they change the nature of the data collected for training the statistical models in the future.

这种矛盾在为网络规模的应用提供动力的学习系统中表现得非常明显,例如搜索引擎、广告投放引擎或推荐系统。例如,在互联网搜索引擎的结果页面上放置广告取决于广告商的出价和统计机器学习系统计算的分数。因为分数影响向用户建议的结果页面的内容,所以它们直接影响点击的发生和相应的广告商支付。它们也有重要的间接影响。广告投放决策会影响用户的满意度,从而影响他们将来频繁访问该网站的意愿。它们还会影响广告商和他们未来的出价。最后,它们改变了为将来训练统计模型而收集的数据的性质。
These complicated interactions are clarified by important theoretical works.Under
simplified assumptions, mechanism design (Myerson, 1981) leads to an insightful
account of the advertiser feedback loop (Varian, 2007;Edelman et al.,
2007).Under simplified assumptions, multiarmed bandits theory (Robbins, 1952;Auer
et al., 2002;Langford and Zhang, 2008) and reinforcement learning (Sutton and
Barto, 1998) describe the exploration/exploitation dilemma associated with the
training feedback loop.However, none of these approaches gives a complete
account of the complex interactions found in real-life systems.

重要的理论著作阐明了这些复杂的相互作用。在简化的假设下,机制设计(Myerson,1981)导致了对广告商反馈回路的深刻理解(Varian,2007;Edelman等人,2007年)。在简化假设下,多臂老虎机理论(罗宾斯,1952;Auer等人,2002年;兰福德和张,2008)和强化学习(萨顿和,1998)描述了与训练反馈回路相关的探索/开发困境。然而,这些方法都没有给出现实系统中复杂交互的完整描述。

This contribution proposes a novel approach: we view these complicated interactions
as man-ifestations of the fundamental difference that separates correlation and
causation.Using the ad placement example as a model of our problem class, we
therefore argue that the language and the methods of causal inference provide
flexible means to describe such complex machine learning sys-tems and give
sound answers to the practical questions facing the designer of such a
system.Is it useful to pass a new input signal to the statistical model?Is it
worthwhile to collect and label a new training set?What about changing the loss
function or the learning algorithm?In order to answer such questions and
improve the operational performance of the learning system, one needs to unravel
how the information produced by the statistical models traverses the web of
causes and effects and eventually produces measurable performance metrics.

这篇文章提出了一种新的方法:我们将这些复杂的相互作用视为区分相关性和因果关系的根本区别的人工陈述。因此,使用广告放置示例作为我们问题类的模型,我们认为因果推理的语言和方法提供了灵活的手段来描述这种复杂的机器学习系统,并给出了这种系统的设计者所面临的实际问题的合理答案。向统计模型传递新的输入信号有用吗?收集和标记一个新的训练集值得吗?改变损失函数或学习算法怎么样?为了回答这些问题并提高学习系统的运行性能,人们需要弄清楚统计模型产生的信息如何穿越因果网络并最终产生可衡量的性能指标。

Readers with an interest in causal inference will find in this paper (i) a real world
example demonstrating the value of causal inference for large-scale machine
learning applications, (ii) causal inference techniques applicable to
continuously valued variables with meaningful confidence intervals, and (iii)
quasi-static analysis techniques for estimating how small interventions affect
cer-tain causal equilibria.Readers with an interest in real-life applications
will find (iv) a selection of practical counterfactual analysis techniques
applicable to many real-life machine learning systems.Readers with an interest
in computational advertising will find a principled framework that (v)
ex-plains how to soundly use machine learning techniques for ad placement, and
(vi) conceptually connects machine learning and auction theory in a compelling
manner.

对因果推理感兴趣的读者将在本文中发现

(I)一个真实世界的例子,展示了因果推理对于大规模机器学习应用的价值,
(ii)适用于具有有意义置信区间的连续值变量的因果推理技术,以及
(iii)用于估计小干预如何影响特定因果均衡的准静态分析技术。

对现实应用感兴趣的读者将会发现
(iv)适用于许多现实机器学习系统的实用反事实分析技术的选择。对计算广告感兴趣的读者会发现一个有原则的框架,该框架(v)解释了如何合理地使用机器学习技术进行广告投放,以及(vi)以引人注目的方式在概念上将机器学习和拍卖理论联系起来。

The paper is organized as follows.Section 2 gives an overview of the advertisement placement
problem which serves as our main example.In particular, we stress some of the
difficulties encoun-tered when one approaches such a problem without a
principled perspective.Section 3 provides a condensed review of the essential
concepts of causal modeling and inference.Section 4 centers on formulating and
answering counterfactual questions such as "how would the system have
performed during the data collection period if certain interventions had been
carried out on the system ?"We describe importance sampling methods for
counterfactual analysis, with clear conditions of validity and confidence
intervals.Section 5 illustrates how the structure of the causal graph reveals
opportu-nities to exploit prior information and vastly improve the confidence
intervals.Section 6 describes how counterfactual analysis provides essential
signals that can drive learning algorithms.Assume that we have identified
interventions that would have caused the system to perform well during the data
collection period.Which guarantee can we obtain on the performance of these
same inter-ventions in the future?Section 7 presents counterfactual
differential techniques for the study of equlibria.Using data collected when
the system is at equilibrium, we can estimate how a small intervention
displaces the equilibrium.This provides an elegant and effective way to reason
about long-term feedback effects.Various appendices complete the main text with
information that we think more relevant to readers with specific backgrounds.

论文组织如下。
第2节概述了广告投放问题,这是我们的主要例子。特别是,我们强调当一个人在没有原则性观点的情况下处理这样一个问题时会遇到的一些困难。

第3节简要回顾了因果建模和推理的基本概念。

第4节集中阐述和回答反事实问题,如“如果对系统进行了某些干预,系统在数据收集期间会有什么表现?”我们描述了用于反事实分析的重要性抽样方法,具有明确的有效性条件和置信区间。

第5节说明了因果图的结构如何揭示利用先验信息和大大提高置信区间的机会。

第6节描述了反事实分析如何提供驱动学习算法的基本信号。假设我们已经确定了会导致系统在数据收集期间表现良好的干预措施。我们能从这些相同的干预措施的未来表现中获得什么保证?

第7节介绍了研究马库利亚的反事实微分技术。利用系统处于平衡时收集的数据,我们可以估计一个小的干预如何取代平衡。这为长期反馈效应提供了一种优雅而有效的推理方式。各种附录用我们认为与特定背景的读者更相关的信息完成了正文。

COUNTERFACTUAL
REASONING AND LEARNING SYSTEMS

反事实推理和学习系统

2.Causation Issues in Computational Advertising 计算广告中的因果关系问题

After giving an overview of the advertisement placement problem, which serves as our main
ex-ample, this section illustrates some of the difficulties that arise when one
does not pay sufficient attention to the causal structure of the learning
system.

在概述了作为我们主要例子的广告投放问题之后,本节说明了当人们没有充分注意学习系统的因果结构时出现的一些困难。

2.1 Advertisement Placemen广告投放

All Internet users are now familiar with the advertisement messages that adorn popular web
pages.Advertisements are particularly effective on search engine result pages
because users who are searching for something are good targets for advertisers
who have something to offer.Several actors take part in this Internet
advertisement game:

现在所有的互联网用户都熟悉流行网页上的广告信息。广告在搜索引擎结果页面上特别有效,因为搜索某样东西的用户是广告商的好目标。在这个网络广告游戏中有如下几个角色:

• Advertisers create advertisement messages, and place bids that describe how much they are
willing to pay to see their ads displayed or clicked.

广告商 创建广告信息,并出价描述他们愿意支付多少费用来进行广告显示或点击。

• Publishers
provide attractive web services, such as, for instance, an Internet search
engine.They display selected ads and expect to receive payments from the
advertisers.The infras-tructure to collect the advertiser bids and select ads
is sometimes provided by an advertising network on behalf of its affiliated
publishers.For the purposes of this work, we simply con-sider a publisher large
enough to run its own infrastructure.

出版商提供有吸引力的网络服务,例如互联网搜索引擎。他们展示选定的广告,并期望从广告商那里获得报酬。收集广告商出价和选择广告的基础设施有时是由广告网络代表其附属出版商提供的。为了这项工作的目的,我们只是考虑一个足够大的发行商来运行它自己的基础设施。

• Users
reveal information about their current interests, for instance, by entering a
query in a search engine.They are offered web pages that contain a selection of
ads (Figure 1).Users sometimes click on an advertisement and are transported to
a web site controlled by the ad-vertiser where they can initiate some business.

用户透露他们当前兴趣的信息,例如,通过在搜索引擎中输入查询。它们被提供包含精选广告的网页(图1)。用户有时点击一个广告,然后被传送到由广告发布者控制的网站,在那里他们可以开始一些业务。

A conventional bidding language is necessary to precisely define under which
conditions an adver-tiser is willing to pay the bid amount.In the case of
Internet search advertisement, each bid specifies (a) the advertisement
message, (b) a set of keywords, (c) one of several possible matching criteria
between the keywords and the user query, and (d) the maximal price the
advertiser is willing to pay when a user clicks on the ad after entering a
query that matches the keywords according to the specified criterion.

传统的投标语言是必要的,以准确定义在什么条件下,一个对手愿意支付投标金额。在互联网搜索广告的情况下,每个出价指定
(a)广告消息,
(b)一组关键词,
(c)关键词和用户查询之间的几个可能匹配标准之一,以及
(d)当用户在输入根据指定标准匹配关键词的查询之后点击广告时,广告商愿意支付的最大价格。

Whenever a user visits a publisher web page, an advertisement placement engine runs an
auction in real time in order to select winning ads, determine where to display
them in the page, and compute the prices charged to advertisers, should the
user click on their ad.Since the placement engine is operated by the publisher,
it is designed to further the interests of the publisher.Fortunately for
everyone else, the publisher must balance short term interests, namely the
immediate revenue brought by the ads displayed on each web page, and long term
interests, namely the future revenues resulting from the continued satisfaction
of both users and advertisers.

每当用户访问发布者网页时,广告投放引擎实时运行拍卖,以便选择获胜的广告,确定在页面中的何处显示它们,并计算如果用户点击他们的广告向广告商收取的价格。由于布局引擎是由发布者操作的,因此它旨在促进发布者的利益。对其他人来说幸运的是,出版商必须平衡短期利益,即每个网页上显示的广告带来的直接收入,和长期利益,即用户和广告商持续满意带来的未来收入。

Auction theory explains how to design a mechanism that optimizes the revenue of the
seller of a single object (Myerson, 1981;Milgrom, 2004) under various
assumptions about the information available to the buyers regarding the
intentions of the other buyers.In the case of the ad placement problem, the publisher
runs multiple auctions and sells opportunities to receive a click.When nearly
identical auctions occur thousand of times per second, it is tempting to
consider that the advertisers have perfect information about each other.This
assumption gives support to the popular generalized second price rank-score
auction (Varian, 2007;Edelman et al., 2007):

拍卖理论解释了如何设计一种机制来优化单个物品卖家的收入(Myerson,1981;Milgrom,2004)在关于买方可获得的关于其他买方意图的信息的各种假设下。在广告投放问题的情况下,发布者进行多次拍卖,并出售获得点击的机会。当几乎相同的拍卖每秒钟发生数千次时,人们很容易认为广告商拥有关于彼此的完美信息。这一假设支持了流行的广义第二价格等级分数拍卖(瓦里安,2007;Edelman等人,2007年):

Figure 1:
Mainline and sidebar ads on a search result page.Ads placed in the mainline are more
likely to be noticed, increasing both the chances of a click if the ad is relevant and the
risk of annoying the user if the ad is not relevant.

图1:搜索结果页面上的主线和侧边栏广告。主线上的广告更可能会被注意到,这既增加了广告相关时点击的机会,也增加了广告不相关时惹恼用户的风险。

• Let x represent the auction context information, such as the user query, the user
profile, the date, the time, etc.The ad placement engine first determines all
eligible ads a1 ...an and the corresponding bids b1 ...bn on the basis of the
auction context x and of the matching criteria specified by the advertisers.

让x代表拍卖上下文信息,如用户查询、用户档案、日期、时间等。广告投放引擎首先确定所有合格的广告a1...an和相应的投标b1...bn基于拍卖上下文x和广告商指定的匹配标准。

• For each selected ad ai and each potential position p on the web page, a statistical
model outputs the estimate qi,p(x) of the probability that ad ai displayed in
position p receives a user click.The rank-score ri,p(x) = biqi,p(x) then
represents the purported value associated with placing ad ai at position p.

对于每个选定的广告$a_i$ 和网页上的每个潜在位置p,统计模型输出显示在位置p的广告$a_i$ 接收用户点击的概率的估计$q_i$ ,p(x)。等级分数ri,p(x) = $b_i$ $q_i$ ,p(x)则表示与将广告 $a_i$ 放置在位置p相关联的声称值。

• Let L represent a possible ad layout, that is, a set of positions that can
simultaneously be populated with ads, and let $L$ be the set of possible ad layouts, including of course the
empty layout.The optimal layout and the corresponding ads are obtained by maximizing the total
rank-score

让L代表一种可能的广告布局,即一组可以同时用广告填充,让$L$是一组可能的广告布局,当然也包括空的布局。通过最大化总排名分数来获得最佳布局和相应的广告

image.png

subject to
reserve constraints and also subject to diverse policy constraints, such as, for instance, preventing the

simultane-ous display of multiple ads belonging to the same advertiser.Under
mild assumptions, this discrete maximization problem is amenable to
computationally efficient greedy algorithms (see appendix A.)

受储备限制,并且还受到不同的政策约束,例如防止属于同一广告商的多个广告同时显示。在温和的假设下,这个离散的最大化问题服从于计算高效的贪婪算法(见附录a)

• The
advertiser payment associated with a user click is computed using the
generalized second price (GSP) rule: the advertiser pays the smallest bid that
it could have entered without chang-ing the solution of the discrete
maximization problem, all other bids remaining equal.In other words, the
advertiser could not have manipulated its bid and obtained the same treatment
for a better price.

与用户点击相关联的广告商支付是使用广义第二价格(GSP)规则计算的:广告商支付其可能输入的最小出价,而不改变离散最大化问题的解决方案,所有其他出价保持相等。换句话说,广告商不可能操纵其出价,以更好的价格获得同样的待遇。

Under the
perfect information assumption, the analysis suggests that the publisher simply
needs to find which reserve prices Rp(x) yield the best revenue per
auction.However, the total revenue of the publisher also depends on the traffic
experienced by its web site.Displaying an excessive number of irrelevant ads
can train users to ignore the ads, and can also drive them to competing web
sites.Advertisers can artificially raise the rank-scores of irrelevant ads by
temporarily increasing the bids.Indelicate advertisers can create deceiving
advertisements that elicit many clicks but direct users to spam web sites.Experience
shows that the continued satisfaction of the users is more important to the
publisher than it is to the advertisers.

在完美信息假设下,分析表明,发行商只需找到哪个保留价格Rp(x)在每次拍卖中产生最佳收益。然而,出版商的总收入也取决于其网站的流量。显示过多不相关的广告可以训练用户忽略广告,也可以驱使他们去竞争网站。广告商可以通过临时提高出价来人为提高不相关广告的排名分数。无礼的广告商可以制造欺骗性的广告,吸引大量点击,但却将用户引向垃圾网站。经验表明,用户的持续满意度对出版商来说比对广告商更重要。

Therefore the
generalized second price rank-score auction has evolved.Rank-scores have been
augmented with terms that quantify the user satisfaction or the ad
relevance.Bids receive adaptive discounts in order to deal with situations
where the perfect information assumption is unrealistic.These adjustments are
driven by additional statistical models.The ad placement engine should
therefore be viewed as a complex learning system interacting with both users
and advertisers.

因此,广义第二价格等级评分拍卖已经演变。排名分数已经增加了量化用户满意度或广告相关性的术语。投标获得自适应折扣,以应对完美信息假设不现实的情况。这些调整由额外的统计模型驱动。因此,广告投放引擎应该被视为一个复杂的学习系统,与用户和广告商进行交互。

2.2
Controlled Experiments

2.2对照实验

The designer
of such an ad placement engine faces the fundamental question of testing
whether a proposed modification of the ad placement engine results in an
improvement of the operational performance of the system.

这种广告投放引擎的设计者面临一个基本问题,即测试广告投放引擎的建议修改是否导致系统操作性能的改善。

The simplest
way to answer such a question is to try the modification.The basic idea is to
randomly split the users into treatment and control groups (Kohavi et al.,
2008).Users from the control group see web pages generated using the unmodified
system.Users of the treatment groups see web pages generated using alternate
versions of the system.Monitoring various performance metrics for a couple
months usually gives sufficient information to reliably decide which variant of
the system delivers the most satisfactory performance.

回答这样的问题最简单的方法就是尝试修改。基本思想是将用户随机分为治疗组和对照组(Kohavi等人,2008年)。控制组的用户可以看到使用未修改的系统生成的网页。治疗组的用户可以看到使用系统的替代版本生成的网页。监控几个月的各种性能指标通常可以提供足够的信息来可靠地决定系统的哪种变体提供最令人满意的性能。

Modifying an
advertisement placement engine elicits reactions from both the users and the
ad-vertisers.Whereas it is easy to split users into treatment and control
groups, splitting advertisers into treatment and control groups demands special
attention because each auction involves multi-ple advertisers (Charles et al.,
2012).Simultaneously controlling for both users and advertisers is probably
impossible.

修改广告投放引擎会引起用户和广告发布者的反应。虽然很容易将用户分成治疗组和对照组,但将广告商分成治疗组和对照组需要特别注意,因为每次拍卖都涉及多个广告商(查尔斯等人,2012年)。同时控制用户和广告商可能是不可能的。

Controlled
experiments also suffer from several drawbacks.They are expensive because they
demand a complete implementation of the proposed modifications.They are slow
because each ex-periment typically demands a couple months.Finally, although
there are elegant ways to efficiently run overlapping controlled experiments on
the same traffic (Tang et al., 2010), they are limited by the volume of traffic
available for experimentation.

受控实验也有几个缺点。它们是昂贵的,因为它们要求完全实现提议的修改。它们很慢,因为每次实验通常需要几个月的时间。最后,虽然有很好的方法可以在相同的流量上有效地运行重叠的受控实验(Tang等人,2010),但是它们受到可用于实验的流量的限制。

It is
therefore difficult to rely on controlled experiments during the conception
phase of potential improvements to the ad placement engine.It is similarly
difficult to use controlled experiments to drive the training algorithms
associated with click probability estimation models.Cheaper and faster
statistical methods are needed to drive these essential aspects of the
development of an ad placement engine.Unfortunately, interpreting cheap and
fast data can be very deceiving.

因此,在广告投放引擎潜在改进的构思阶段,很难依赖受控实验。同样难以使用受控实验来驱动与点击概率估计模型相关联的训练算法。需要更便宜、更快的统计方法来推动广告投放引擎发展的这些重要方面。不幸的是,解释廉价而快速的数据可能非常具有欺骗性。

2.3
Confounding Data

2.3混杂数据

Assessing the
consequence of an intervention using statistical data is generally challenging
because it is often difficult to determine whether the observed effect is a
simple consequence of the interven-tion or has other uncontrolled causes.

使用统计数据评估干预的结果通常具有挑战性,因为通常很难确定观察到的效果是干预的简单结果还是有其他不受控制的原因。

Overall总数 Patients with small stones 小结石患者 Patients with large stones大结石患者
Treatment A:Open surgery开放手术 78% (273/350) 93% (81/87) 73% (192/263)
Treatment B:Percutaneous nephrolithotomy 经皮肾镜取石术 83% (289/350) 87% (234/270) 69% (55/80)

Table 1: A
classic example of Simpson's paradox.The table reports the success rates of two
treat-ments for
kidney stones (Charig et al., 1986, Tables I and II).Although the overall
success rate of treatment B seems better, treatment B performs worse than
treatment A on both patients with small kidney stones and patients with large
kidney stones.See Section 2.3.

表1:辛普森悖论的经典例子。该表报告了两次治疗的成功率-
肾结石治疗(Charig等人,1986年,表一和表二)。尽管治疗B的总体成功率似乎更高,但无论是小肾结石患者还是大肾结石患者,治疗B的表现都比治疗A差。参见第2.3节。

For instance,
the empirical comparison of certain kidney stone treatments illustrates this
dif-ficulty (Charig et al., 1986).Table 2.3 reports the success rates observed
on two groups of 350 patients treated with respectively open surgery (treatment
A, with 78% success) and percutaneous nephrolithotomy (treatment B, with 83%
success).Although treatment B seems more successful, it was more frequently
prescribed to patients suffering from small kidney stones, a less serious
con-dition.Did treatment B achieve a high success rate because of its intrinsic
qualities or because it was preferentially applied to less severe cases?Further
splitting the data according to the size of the kidney stones reverses the
conclusion: treatment A now achieves the best success rate for both patients
suffering from large kidney stones and patients suffering from small kidney
stones.Such an inversion of the conclusion is called Simpson's paradox
(Simpson, 1951).

例如,某些肾结石治疗的经验比较说明了这一困难(Charig等人,1986)。表2.3报告了两组分别接受开放手术(治疗A,成功率78%)和经皮肾镜取石术(治疗B,成功率83%)治疗的350名患者的成功率。虽然治疗乙似乎更成功,但它更经常被开给患有小肾结石的患者,这是一种不太严重的情况。B治疗获得高成功率是因为它的内在质量还是因为它优先应用于不太严重的病例?根据肾结石的大小进一步分割数据会逆转结论:治疗A现在对患有大肾结石的患者和患有小肾结石的患者都达到了最佳成功率。这种结论的颠倒被称为辛普森悖论(辛普森,1951)。

The stone
size in this study is an example of a confounding variable, that is an
uncontrolled variable whose consequences pollute the effect of the
intervention.Doctors knew the size of the kidney stones, chose to treat the
healthier patients with the least invasive treatment B, and therefore caused
treatment B to appear more effective than it actually was.If we now decide to
apply treat-ment B to all patients irrespective of the stone size, we break the
causal path connecting the stone size to the outcome, we eliminate the
illusion, and we will experience disappointing results.

这项研究中的结石大小是一个混杂变量的例子,这是一个不受控制的变量,其结果污染了干预的效果。医生知道肾结石的大小,选择用侵入性最小的治疗B来治疗更健康的患者,因此导致治疗B看起来比实际更有效。如果我们现在决定对所有患者进行治疗,而不考虑结石的大小,我们就打破了结石大小和结果之间的因果关系,我们就消除了幻觉,我们将经历令人失望的结果。

When we
suspect the existence of a confounding variable, we can split the contingency
tables and reach improved conclusions.Unfortunately we cannot fully trust these
conclusions unless we are certain to have taken into account all confounding
variables.The real problem therefore comes from the confounding variables we do
not know.

当我们怀疑混杂变量的存在时,我们可以拆分列联表并得出改进的结论。不幸的是,我们不能完全相信这些结论,除非我们确信已经考虑了所有的混杂变量。因此,真正的问题来自我们不知道的混杂变量。

Randomized
experiments arguably provide the only correct solution to this problem (see
Stigler, 1992).The idea is to randomly chose whether the patient receives
treatment A or treatment B. Because this random choice is independent from all
the potential confounding variables, known and unknown, they cannot pollute the
observed effect of the treatments (see also Section 4.2).This is why controlled
experiments in ad placement (Section 2.2) randomly distribute users between
treatment and control groups, and this is also why, in the case of an ad
placement engine, we should be somehow concerned by the practical impossibility
to randomly distribute both users and advertisers.

随机实验可以说是这个问题唯一正确的解决方案(见斯蒂格勒,1992)。想法是随机选择患者是接受治疗A还是治疗b。因为这种随机选择独立于所有已知和未知的潜在混杂变量,所以它们不能污染观察到的治疗效果(另见第4.2节)。这就是为什么广告投放的受控实验(第2.2节)在治疗组和对照组之间随机分配用户,这也是为什么在广告投放引擎的情况下,我们应该以某种方式关注随机分配用户和广告商的实际可能性。

Overall q2 low q2 high
q1 low 6.2%(124/2000) 5.1% (92/1823) 18.1% (32/176)
q1 high 7.5% (149/2000) 4.8% (71/1500) 15.6% (78/500)

Table 2:
Confounding data in ad placement.The table reports the click-through rates and
the click counts of the
second mainline ad.The overall counts suggest that the click-through rate of
the second mainline ad increases when the click probability estimate q1 of the
top ad is high.However, if we further split the pages according to the click
probability estimate q2 of the second mainline ad, we reach the opposite
conclusion.See Section 2.4.

表2:广告投放的混杂数据。该表报告了点击率和点击量第二条主线广告的计数。总体计数表明,当顶部广告的点击概率估计值q1较高时,第二主线广告的点击率会增加。然而,如果我们根据第二条主线广告的点击概率估计q2进一步分割页面,我们会得出相反的结论。参见第2.4节。

2.4
Confounding Data in Ad Placement

2.4广告投放中的混杂数据

Let us return
to the question of assessing the value of passing a new input signal to the ad
placement engine click prediction model.Section 2.1 outlines a placement method
where the click probability estimates qi,p(x) depend on the ad and the position
we consider, but do not depend on other ads displayed on the page.We now
consider replacing this model by a new model that additionally uses the
estimated click probability of the top mainline ad to estimate the click
probability of the second mainline ad (Figure 1).We would like to estimate the
effect of such an intervention using existing statistical data.

让我们回到评估向广告投放引擎点击预测模型传递新输入信号的价值的问题。第2.1节概述了一种布局方法,其中点击概率估计qi,p(x)取决于广告和我们考虑的位置,但不取决于页面上显示的其他广告。我们现在考虑用一个新模型来代替这个模型,这个新模型额外使用顶部主线广告的估计点击概率来估计第二个主线广告的点击概率(图1)。我们希望利用现有的统计数据来评估这种干预的效果。

We have
collected ad placement data for Bing search result pages served during three
consecu-tive hours on a certain slice of traffic.Let q1 and q2 denote the click
probability estimates computed by the existing model for respectively the top
mainline ad and the second mainline ad.After ex-cluding pages displaying fewer
than two mainline ads, we form two groups of 2000 pages randomly picked among
those satisfying the conditions q1 < 0.15 for the first group and q1 ≥ 0.15
for the second group.Table 2.4 reports the click counts and frequencies
observed on the second mainline ad in each group.Although the overall numbers
show that users click more often on the second mainline ad when the top
mainline ad has a high click probability estimate q1, this conclusion is
reversed when we further split the data according to the click probability
estimate q2 of the second mainline ad.

我们已经收集了必应搜索结果页面的广告投放数据,这些页面在连续三个小时的特定流量中提供服务。让q1和q2分别表示由现有模型为顶部主线广告和第二主线广告计算的点击概率估计。在包括显示少于两个主线广告的页面之后,我们形成两组2000页的页面,从满足第一组q1 < 0.15和第二组q1 ≥ 0.15的条件的页面中随机挑选。表2.4报告了每组中第二条主线广告的点击次数和频率。虽然总体数字显示,当顶级主线广告具有高点击概率估计值q1时,用户更频繁地点击第二主线广告,但当我们根据第二主线广告的点击概率估计值q2进一步分割数据时,这一结论被颠倒了。

Despite
superficial similarities, this example is considerably more difficult to
interpret than the kidney stone example.The overall click counts show that the
actual click-through rate of the second mainline ad is positively correlated
with the click probability estimate on the top mainline ad.Does this mean that
we can increase the total number of clicks by placing regular ads below
frequently clicked ads?

尽管表面相似,这个例子比肾结石的例子更难解释。总体点击计数显示,第二条主线广告的实际点击率与顶部主线广告的点击概率估计值正相关。这是否意味着我们可以通过在经常点击的广告下方放置常规广告来增加总点击量?

Remember that
the click probability estimates depend on the search query which itself depends
on the user intention.The most likely explanation is that pages with a high q1
are frequently asso-ciated with more commercial searches and therefore receive
more ad clicks on all positions.The observed correlation occurs because the
presence of a click and the magnitude of the click probabil-ity estimate q1
have a common cause: the user intention.Meanwhile, the click probability
estimate q2 returned by the current model for the second mainline ad also
depend on the query and therefore the user intention.Therefore, assuming that
this dependence has comparable strength, and assuming that there are no other
causal paths, splitting the counts according to the magnitude of q2 factors out
the effects of this common confounding cause.We then observe a negative
correlation which now suggests that
a frequently clicked top mainline ad has a negative impact on the click-through
rate of the second mainline ad.

请记住,点击概率估计取决于搜索查询,而搜索查询本身取决于用户意图。最可能的解释是,q1高的页面通常与更多的商业搜索相关联,因此在所有位置都能获得更多的广告点击。出现观察到的相关性是因为点击的存在和点击概率估计值q1的大小有一个共同的原因:用户意图。同时,当前模型为第二主线广告返回的点击概率估计q2也取决于查询,因此也取决于用户意图。因此,假设这种依赖性具有相当的强度,并且假设没有其他因果路径,根据q2因子的大小将计数分开,排除这种常见混杂原因的影响。然后我们观察到一个负相关
表明频繁点击的顶级主线广告对第二条主线广告的点击率有负面影响。

If this is
correct, we would probably increase the accuracy of the click prediction model
by switching to the new model.This would decrease the click probability
estimates for ads placed in the second mainline position on commercial search
pages.These ads are then less likely to clear the reserve and therefore more likely
to be displayed in the less attractive sidebar.The net result is probably a
loss of clicks and a loss of money despite the higher quality of the click
probability model.Although we could tune the reserve prices to compensate this
unfortunate effect, nothing in this data tells us where the performance of the
ad placement engine will land.Furthermore, unknown confounding variables might
completely reverse our conclusions.

如果这是正确的,我们可能会通过切换到新模型来提高点击预测模型的准确性。这将降低广告在商业搜索页面第二主线位置的点击概率估计。然后,这些广告不太可能清除保留区,因此更有可能显示在不太吸引人的侧栏中。尽管点击概率模型的质量更高,但最终结果可能是点击量的损失和金钱的损失。尽管我们可以调整保留价格来补偿这种不幸的影响,但这些数据中没有任何内容告诉我们广告投放引擎的性能将落在哪里。此外,未知的混杂变量可能会完全推翻我们的结论。

Making sense
out of such data is just too complex !

从这样的数据中获得意义实在是太复杂了!

2.5 A Better Way

2.5更好的方式

It should now
be obvious that we need a more principled way to reason about the effect of
potential interventions.We provide one such more principled approach using the
causal inference machinery (Section 3).The next step is then the identification
of a class of questions that are sufficiently ex-pressive to guide the designer
of a complex learning system, and sufficiently simple to be answered using data
collected in the past using adequate procedures (Section 4).

现在应该很明显,我们需要一种更有原则的方式来推理潜在干预的效果。我们使用因果推理机制提供了一种更有原则的方法(第3节)。然后,下一步是确定一类问题,这些问题足够有代表性,可以指导复杂学习系统的设计者,并且足够简单,可以使用过去使用适当程序收集的数据来回答(第4节)。

A machine
learning algorithm can then be viewed as an automated way to generate questions
about the parameters of a statistical model, obtain the corresponding answers,
and update the param-eters accordingly (Section 6).Learning algorithms derived
in this manner are very flexible: human designers and machine learning
algorithms can cooperate seamlessly because they rely on similar sources of
information.

然后,机器学习算法可以被视为自动生成关于统计模型参数的问题、获得相应答案并相应更新参数的方法(第6节)。以这种方式导出的学习算法非常灵活:人类设计者和机器学习算法可以无缝协作,因为它们依赖于相似的信息源。

3.Modeling
Causal Systems

3.因果系统建模

When we point
out a causal relationship between two events, we describe what we expect to
happen to the event we call the effect, should an external operator manipulate
the event we call the cause.Manipulability theories of causation (von Wright,
1971;Woodward, 2005) raise this commonsense insight to the status of a
definition of the causal relation.Difficult adjustments are then needed to
interpret statements involving causes that we can only observe through their
effects, "because they love me," or that are not easily manipulated,
"because the earth is round."

当我们指出两个事件之间的因果关系时,我们描述了如果外部操作者操纵我们称为原因的事件,我们期望发生在我们称为结果的事件上的事情。因果关系的可操作性理论(冯·赖特,1971;伍德沃德,2005)将这种常识性的洞察力提升到因果关系定义的地位。然后需要困难的调整来解释涉及原因的陈述,这些原因我们只能通过它们的影响来观察,“因为它们爱我”,或者不容易被操纵,“因为地球是圆的。”

Modern
statistical thinking makes a clear distinction between the statistical model
and the world.The actual mechanisms underlying the data are considered
unknown.The statistical models do not need to reproduce these mechanisms to
emulate the observable data (Breiman, 2001).Better models are sometimes
obtained by deliberately avoiding to reproduce the true mechanisms (Vapnik,
1982, Section 8.6).We can approach the manipulability puzzle in the same spirit
by viewing causation as a reasoning model (Bottou, 2011) rather than a property
of the world.Causes and effects are simply the pieces of an abstract reasoning
game.Causal statements that are not empirically testable acquire validity when
they are used as intermediate steps when one reasons about manipulations or
interventions amenable to experimental validation.

现代统计思维明确区分了统计模型和世界。数据背后的实际机制被认为是未知的。统计模型不需要再现这些机制来模拟可观察到的数据(Breiman,2001)。更好的模型有时是通过故意避免再现真实的机制而获得的(Vapnik,1982,第8.6节)。我们可以通过将因果关系视为一种推理模型(Bottou,2011)而不是世界的一种属性,以同样的精神来解决可操作性难题。原因和结果只是抽象推理游戏的片段。不可通过经验检验的因果陈述,当一个人提出可以通过实验验证的操作或干预时,当它们被用作中间步骤时,就获得了有效性。

This section
presents the rules of this reasoning game.We largely follow the framework
pro-posed by Pearl (2009) because it gives a clear account of the connections
between causal models and probabilistic models.

本节介绍了这个推理游戏的规则。我们很大程度上遵循了珀尔(2009)提出的框架,因为它给出了因果模型和概率模型之间联系的清晰描述。

image.png

Query context
x from user intent u. Eligible ads (ai) from query x and inventory v.
Corresponding bids (bi).

来自用户意图的查询上下文x。来自查询x和库存v的合格广告(ai)对应的出价(bi)。

Scores
(qi,p,Rp) from query x and ads a. Ad slate s from eligible ads a, scores q and
bids b. Corresponding click prices c.

来自查询x和广告a的分数(qi,p,Rp)。来自合格广告a的广告名单s,分数q和出价b。相应的点击价格c。

User clicks y
from ad slate s and user intent u. Revenue z from clicks y and prices c.

用户从广告名单中点击y,用户意图从点击y和价格c中获得收入

Figure 2: A structural
equation model for ad placement.The sequence of equations describes the

图2:广告投放的结构方程模型。方程式序列描述了

flow of
information.The functions fk describe how effects depend on their direct
causes.The additional noise variables εk represent independent sources of
randomness useful to model probabilistic dependencies.

信息流。函数fk描述了效果如何依赖于它们的直接原因。额外的噪声变量εk代表独立的随机源,可用于建模概率相关性。

3.1 The Flow
of Information

3.1信息流

Figure 2
gives a deterministic description of the operation of the ad placement
engine.Variable u represents the user and his or her intention in an
unspecified manner.The query and query context x is then expressed as an
unknown function of the u and of a noise variable ε1.Noise variables in this
framework are best viewed as independent sources of randomness useful for
modeling a nondeterministic causal dependency.We shall only mention them when
they play a specific role in the discussion.The set of eligible ads a and the
corresponding bids b are then derived from the query x and the ad inventory v
supplied by the advertisers.Statistical models then compute a collection of
scores q such as the click probability estimates qi,p and the reserves Rp
introduced in Section 2.1.The placement logic uses these scores to generate the
"ad slate" s, that is, the set of winning ads and their assigned
positions.The corresponding click prices c are computed.The set of user clicks
y is expressed as an unknown function of the ad slate s and the user intent u.
Finally the revenue z is expressed as another function of the clicks y and the
prices c.

图2给出了广告投放引擎操作的确定性描述。变量u以未指明的方式表示用户及其意图。然后,查询和查询上下文x被表示为u和噪声变量ε1的未知函数。这个框架中的噪声变量最好被视为独立的随机源,有助于对不确定的因果相关性进行建模。只有当它们在讨论中发挥具体作用时,我们才会提到它们。合格广告a的集合和相应的出价b然后从查询x和广告商提供的广告目录v中导出。统计模型然后计算分数q的集合,例如点击概率估计qi、p和第2.1节中介绍的储备Rp。放置逻辑使用这些分数来生成“广告名单”,即一组获胜的广告及其分配的位置。计算相应的点击价格c。用户点击量y的集合被表示为广告名单s和用户意图u的未知函数。最后,收入z被表示为点击量y和价格c的另一个函数

Such a system
of equations is named structural equation model (Wright, 1921).Each equation
asserts a functional dependency between an effect, appearing on the left hand
side of the equation, and its direct causes, appearing on the right hand side
as arguments of the function.Some of these causal dependencies are
unknown.Although we postulate that the effect can be expressed as some function
of its direct causes, we do not know the form of this function.For instance,
the designer of the ad placement engine knows functions f2 to f6 and f8 because
he has designed them.However, he does not know the functions f1 and f7 because
whoever designed the user did not leave sufficient documentation.

这样的方程组被称为结构方程模型(Wright,1921)。每一个方程都断言出现在方程左边的一个效应与其直接原因之间的函数依赖关系,直接原因作为函数的自变量出现在右边。其中一些因果依赖是未知的。虽然我们假设效果可以表示为其直接原因的某种函数,但我们不知道这种函数的形式。例如,广告投放引擎的设计者知道函数f2到f6和f8,因为他已经设计了它们。然而,他不知道函数f1和f7,因为设计用户的人没有留下足够的文档。

Figure 3
represents the directed causal graph associated with the structural equation
model.Each arrow connects a direct cause to its effect.The noise variables are
omitted for simplicity.The structure of this graph reveals fundamental
assumptions about our model.For instance, the user clicks y do not directly
depend on the scores q or the prices c because users do not have access to this
information.

图3表示与结构方程模型相关的有向因果图。每个箭头都将直接原因与其结果联系起来。为简单起见,噪声变量被省略。这个图表的结构揭示了关于我们模型的基本假设。例如,用户点击y并不直接取决于分数q或价格c,因为用户无法访问这些信息。

We hold as a
principle that causation obeys the arrow of time: causes always precede their
effects.Therefore the causal graph must be acyclic.Structural equation models
then support two fundamental operations, namely simulation and intervention.

我们认为因果关系遵循时间之箭是一个原则:原因总是在结果之前。因此因果图必须是非循环的。结构方程模型支持两个基本操作,即模拟和干预。

Figure 3:
Causal graph associated with the structural equation model of Figure 2.The
mutually independent
noise variables ε1 to ε8 are implicit.The variables a, b, q, s, c, and z depend
on their direct causes in known ways.In contrast, the variables u and v are
exogenous and the variables x and y depend on their direct causes through
unknown functions.

图3:与图2的结构方程模型相关的因果图。相互之间独立噪声变量ε1至ε8是隐式的。变量a、b、q、s、c和z以已知的方式依赖于它们的直接原因。相比之下,变量u和v是外生的,变量x和y通过未知函数依赖于它们的直接原因。

• Simulation

  • Let us assume that we know both the exact form of all functional dependencies
    and the value of all exogenous variables, that is, the variables that never
    appear in the left hand side of an equation.We can compute the values of all
    the remaining variables by applying the equations in their natural time
    sequence.

模拟-让我们假设我们知道所有函数依赖的确切形式和所有外生变量的值,即从未出现在等式左侧的变量。我们可以通过在自然时间序列中应用这些方程来计算所有剩余变量的值。


Intervention - As long as the causal graph remains acyclic, we can construct
derived structural equation models using arbitrary algebraic manipulations of
the system of equations.For instance, we can clamp a variable to a constant
value by rewriting the right-hand side of the corresponding equation as the
specified constant value.

干预——只要因果图保持非循环,我们就可以使用方程组的任意代数操作来构建导出的结构方程模型。例如,我们可以通过将相应等式的右侧重写为指定的常数值,将变量钳制为常数值。

The algebraic
manipulation of the structural equation models provides a powerful language to
describe interventions on a causal system.This is not a coincidence.Many
aspects of the mathe-matical notation were invented to support causal inference
in classical mechanics.However, we no longer have to interpret the variable
values as physical quantities: the equations simply describe the flow of
information in the causal model (Wiener, 1948).

结构方程模型的代数操作为描述因果系统上的干预提供了一种强大的语言。这不是巧合。数学符号的许多方面是为了支持经典力学中的因果推理而发明的。然而,我们不再需要将变量值解释为物理量:方程简单地描述了因果模型中的信息流(维纳,1948)。

3.2 The
Isolation Assumption

3.2隔离假设

Let us now
turn our attention to the exogenous variables, that is, variables that never
appear in the left hand side of an equation of the structural model.Leibniz's
principle of sufficient reason claims that there are no facts without
causes.This suggests that the exogenous variables are the effects of a network
of causes not expressed by the structural equation model.For instance, the user
intent u and the ad inventory v in Figure 3 have temporal correlations because
both users and advertisers worry about their budgets when the end of the month
approaches.Any structural equation model should then be understood in the
context of a larger structural equation model potentially describing all things
in existence.

现在让我们将注意力转向外生变量,即从未出现在结构模型方程左侧的变量。莱布尼茨的充分理由原则主张没有原因就没有事实。这表明外生变量是结构方程模型没有表达的原因网络的影响。例如,图3中的用户意图u和广告清单v具有时间相关性,因为当月底临近时,用户和广告商都担心他们的预算。任何结构方程模型都应该在一个更大的结构方程模型的背景下理解,这个模型潜在地描述了所有存在的事物。

Ads served on
a particular page contribute to the continued satisfaction of both users and
ad-vertisers, and therefore have an effect on their willingness to use the
services of the publisher in the future.The ad placement structural equation
model shown in Figure 2 only describes the causal de-pendencies for a single
page and therefore cannot account for such effects.Consider however a very

在特定页面上提供的广告有助于用户和广告发布者的持续满意度,因此会影响他们将来使用出版商服务的意愿。图2所示的广告投放结构方程模型仅描述了单个页面的因果相关性,因此无法解释这种影响。然而,考虑一个非常

3216

3216

COUNTERFACTUAL
REASONING AND LEARNING SYSTEMS

反事实推理和学习系统

Figure 4:
Conceptually unrolling the user feedback loop by threading instances of the
single page

图4:通过线程化单个页面的实例从概念上展开用户反馈循环

causal graph
(Figure 3).Both the ad slate st and user clicks yt have an indirect effect on
the user intent ut+1 associated with the next query.

因果图(图3)。ad slate st和用户点击yt都对与下一个查询相关联的用户意图ut+1有间接影响。

large
structural equation model containing a copy of the page-level model for every
web page ever served by the publisher.Figure 4 shows how we can thread the
page-level models corresponding to pages served to the same user.Similarly we
could model how advertisers track the performance and the cost of their
advertisements and model how their satisfaction affects their future bids.The
resulting causal graphs can be very complex.Part of this complexity results
from time-scale dif-ferences.Thousands of search pages are served in a
second.Each page contributes a little to the continued satisfaction of one user
and a few advertisers.The accumulation of these contributions produces
measurable effects after a few weeks.

大型结构方程模型,包含发布者提供的每个网页的页面级模型副本。图4显示了我们如何线程化对应于服务于同一用户的页面的页面级模型。同样,我们可以模拟广告商如何跟踪他们广告的表现和成本,并模拟他们的满意度如何影响他们未来的出价。产生的因果图可能非常复杂。这种复杂性的一部分是由时间尺度的差异造成的。数以千计的搜索页面在一秒钟内被提供。每一页都对一个用户和几个广告商的持续满意度有一点贡献。几周后,这些贡献的积累会产生可衡量的效果。

Many of the
functional dependencies expressed by the structural equation model are left
un-specified.Without direct knowledge of these functions, we must reason using
statistical data.The most fundamental statistical data is collected from
repeated trials that are assumed independent.When we consider the large
structured equation model of everything, we can only have one large trial
producing a single data point.It is therefore desirable to identify repeated
patterns of identical equations that can be viewed as repeated independent
trials.Therefore, when we study a structural equation model representing such a
pattern, we need to make an additional assumption to expresses the idea that
the outcome of one trial does not affect the other trials.We call such an
assumption an isolation assumption by analogy with thermodynamics.This can be
achieved by assuming that the exogenous variables are independently drawn from
an unknown but fixed joint probability distribu-tion.This assumption cuts the
causation effects that could flow through the exogenous variables.

结构方程模型所表达的许多函数依赖关系都没有明确说明。没有这些函数的直接知识,我们必须使用统计数据进行推理。最基本的统计数据是从假设独立的重复试验中收集的。当我们考虑所有事物的大型结构化方程模型时,我们只能进行一次大型试验来产生一个数据点。因此,希望识别可被视为重复独立试验的相同方程的重复模式。因此,当我们研究代表这种模式的结构方程模型时,我们需要做一个额外的假设来表达一个试验的结果不会影响其他试验的想法。我们通过类比热力学把这样的假设称为孤立假设。这可以通过假设外生变量独立地来自未知但固定的联合概率分布来实现。这一假设消除了可能流经外生变量的因果效应。

The noise
variables are also exogenous variables acting as independent source of
randomness.The noise variables are useful to represent the conditional
distribution P(effect|causes) using the equation effect= f(causes,
ε).Therefore, we also assume joint independence between all the noise variables
and any of the named exogenous variable.For instance, in the case of the ad
placement

噪声变量也是作为随机性独立来源的外生变量。噪声变量有助于使用方程effect= f(causes,ε)来表示条件分布 P(effect|causes)。因此,我们还假设所有噪声变量和任何一个已命名的外生变量之间具有联合独立性。例如,在广告投放的情况下

1.See also
the discussion on reinforcement learning, Section 3.5.

1.另请参见第3.5节中关于强化学习的讨论。

2.The concept
of isolation is pervasive in physics.An isolated system in thermodynamics
(Reichl, 1998, Section 2.D) or a closed system in mechanics (Landau and
Lifshitz, 1969, 5) evolves without exchanging mass or energy with its
surroundings.Experimental trials involving systems that are assumed isolated
may differ in their initial setup and therefore have different
outcomes.Assuming isolation implies that the outcome of each trial cannot
affect the other trials.

2.孤立的概念在物理学中很普遍。热力学中的孤立系统(赖歇尔,1998,第2节。d)或力学中的封闭系统(Landau和Lifshitz,1969,5)在不与周围环境交换质量或能量的情况下演化。涉及被假设为孤立的系统的实验性试验可能在它们的初始设置上有所不同,因此具有不同的结果。假设隔离意味着每个试验的结果不能影响其他试验。

3.Rather than
letting two noise variables display measurable statistical dependencies because
they share a common cause, we prefer to name the common cause and make the
dependency explicit in the graph.

3.与其让两个噪声变量显示可测量的统计相关性,因为它们有一个共同的原因,我们更喜欢命名共同的原因,并在图中明确相关性。

Exogenous
vars.Query.

外源性vars。查询。

Eligible
ads.Bids.

合格的广告。出价。

Scores.Ad
slate.Prices.Clicks.Revenue.

分数。广告石板。价格。点击。收入。

image.png

Figure 6:
Bayesian network associated with the Markov factorization shown in Figure 5.

图6:与图5所示的马尔可夫分解相关联的贝叶斯网络。

model shown
in Figure 2, we assume that the joint distribution of the exogenous variables
factorizes

模型如图2所示,我们假设外生变量的联合分布因子化

as

如同

P(u, v,ε1,..., ε8) = P(u, v)P(ε1)...P(ε8).

Since an
isolation assumption is only true up to a point, it should be expressed clearly
and remain under constant scrutiny.We must therefore measure additional
performance metrics that reveal how the isolation assumption holds.For
instance, the ad placement structural equation model and the corresponding
causal graph (figures 2 and 3) do not take user feedback or advertiser feedback
into account.Measuring the revenue is not enough because we could easily
generate revenue at the expense of the satisfaction of the users and
advertisers.When we evaluate interventions under such an isolation assumption,
we also need to measure a battery of additional quantities that act as proxies
for the user and advertiser satisfaction.Noteworthy examples include ad
relevance estimated by human judges, and advertiser surplus estimated from the
auctions (Varian, 2009).

因为隔离假设在某种程度上是正确的,所以应该清楚地表达出来,并保持不断的审查。因此,我们必须衡量额外的性能指标,以揭示隔离假设是如何成立的。例如,广告投放结构方程模型和相应的因果图(图2和3)没有考虑用户反馈或广告商反馈。衡量收入是不够的,因为我们很容易以牺牲用户和广告商的满意度为代价来创造收入。当我们在这样的隔离假设下评估干预时,我们还需要测量一组额外的量,作为用户和广告商满意度的代理。值得注意的例子包括由人类估计的广告相关性,以及从拍卖中估计的广告商盈余(瓦里安,2009)。

3.3 Markov Factorization马尔可夫因子分解

Conceptually,
we can draw a sample of the exogenous variables using the distribution
specified by the isolation assumption, and we can then generate values for all
the remaining variables by simulating the structural equation model.

从概念上讲,我们可以使用隔离假设指定的分布来绘制外生变量的样本,然后我们可以通过模拟结构方程模型来生成所有剩余变量的值。

This process
defines a generative probabilistic model representing the joint distribution of
all variables in the structural equation model.The distribution readily
factorizes as the product of the joint probability of the named exogenous
variables, and, for each equation in the structural equation model, the
conditional probability of the effect given its direct causes (Spirtes et al.,
1993;Pearl, 2000).As illustrated by figures 5 and 6, this Markov factorization
connects the structural equa-tion model that describes causation, and the
Bayesian network that describes the joint probability distribution followed by
the variables under the isolation assumption.

这个过程定义了一个表示结构方程模型中所有变量联合分布的生成概率模型。该分布很容易分解为所命名的外生变量的联合概率的乘积,对于结构方程模型中的每个方程,给出其直接原因的影响的条件概率(Spirtes等人,1993;Pearl,2000)。如图5和图6所示,这种马尔可夫分解将描述因果关系的结构方程模型和描述隔离假设下变量遵循的联合概率分布的贝叶斯网络联系起来。

Structural
equation models and Bayesian networks appear so intimately connected that it
could be easy to forget the differences.The structural equation model is an
algebraic object.As long as the causal graph remains acyclic, algebraic
manipulations are interpreted as interventions on the causal system.The
Bayesian network is a generative statistical model representing a class of
joint probability distributions, and, as such, does not support algebraic
manipulations.However, the symbolic representation of its Markov factorization
is an algebraic object, essentially equivalent to the structural equation
model.

结构方程模型和贝叶斯网络看起来联系如此紧密,以至于很容易忘记它们之间的差异。结构方程模型是一个代数对象。只要因果图保持非循环,代数操作就被解释为对因果系统的干预。贝叶斯网络是表示一类联合概率分布的生成统计模型,因此不支持代数操作。然而,其马尔可夫因子分解的符号表示是一个代数对象,本质上等价于结构方程模型。

3.4
Identification, Transportation, and Transfer Learning

3.4识别、运输和转移学习

Consider a
causal system represented by a structural equation model with some unknown
functional dependencies.Subject to the isolation assumption, data collected
during the operation of this system follows the distribution described by the
corresponding Markov factorization.Let us first assume that this data is sufficient
to identify the joint distribution of the subset of variables we can observe.We
can intervene on the system by clamping the value of some variables.This
amounts to replacing the right-hand side of the corresponding structural
equations by constants.The joint distribution of the variables is then
described by a new Markov factorization that shares many factors with the
orig-inal Markov factorization.Which conditional probabilities associated with
this new distribution can we express using only conditional probabilities
identified during the observation of the original sys-tem?This is called the
identifiability problem.More generally, we can consider arbitrarily complex
manipulations of the structural equation model, and we can perform multiple
experiments involv-ing different manipulations of the causal system.Which
conditional probabilities pertaining to one experiment can be expressed using
only conditional probabilities identified during the observation of other
experiments?This is called the transportability problem.

考虑一个由结构方程模型表示的因果系统,该模型具有一些未知的函数依赖关系。根据隔离假设,在该系统运行期间收集的数据遵循由相应的马尔可夫因子分解描述的分布。让我们首先假设这些数据足以识别我们可以观察到的变量子集的联合分布。我们可以通过钳制某些变量的值来干预系统。这相当于用常数替换了相应结构方程的右边。然后,变量的联合分布由一个新的马尔可夫因子分解来描述,它与原始的马尔可夫因子分解共享许多因子。我们可以仅使用在观察原始系统时识别的条件概率来表示与这种新分布相关的哪些条件概率?这就是所谓的可识别性问题。更一般地说,我们可以考虑结构方程模型的任意复杂操作,并且我们可以进行涉及因果系统的不同操作的多个实验。哪些与一个实验相关的条件概率可以仅用在观察其他实验时确定的条件概率来表示?这就是所谓的可运输性问题。

Pearl's
do-calculus completely solves the identifiability problem and provides useful
tools to address many instances of the transportability problem (see Pearl,
2012).Assuming that we know the conditional probability distributions involving
observed variables in the original structural equa-tion model, do-calculus
allows us to derive conditional distributions pertaining to the manipulated structural
equation model.

Pearl的do-演算完全解决了可识别性问题,并提供了有用的工具来解决可运输性问题的许多实例(参见Pearl,2012)。假设我们知道原始结构方程模型中包含观测变量的条件概率分布,微积分允许我们导出与操纵结构方程模型相关的条件分布。

Unfortunately,
we must further distinguish the conditional probabilities that we know (because
we designed them) from those that we estimate from empirical data.This
distinction is important because estimating the distribution of continuous or
high cardinality variables is notoriously dif-ficult.Furthermore, do-calculus
often combines the estimated probabilities in ways that amplify estimation
errors.This happens when the manipulated structural equation model exercises
the vari-ables in ways that were rarely observed in the data collected from the
original structural equation model.

不幸的是,我们必须进一步区分我们知道的条件概率(因为我们设计了它们)和我们根据经验数据估计的条件概率。这种区别很重要,因为估计连续或高基数变量的分布是出了名的困难。此外,微积分经常以放大估计误差的方式组合估计的概率。当被操纵的结构方程模型以从原始结构方程模型收集的数据中很少观察到的方式运用变量时,就会发生这种情况。

4.Bayesian
networks are directed graphs representing the Markov factorization of a joint
probability distribution: the arrows no longer have a causal interpretation.

4.贝叶斯网络是表示联合概率分布的马尔可夫因子分解的有向图:箭头不再具有因果解释。

Therefore we
prefer to use much simpler causal inference techniques (see sections 4.1 and
4.2).Although these techniques do not have the completeness properties of
do-calculus, they combine estimation and transportation in a manner that
facilitates the derivation of useful confidence inter-vals.

因此,我们更喜欢使用简单得多的因果推断技术(参见第4.1节和第4.2节)。虽然这些技术不具备微积分的完备性,但它们以一种便于推导有用的置信区间的方式结合了估计和传输。

3.5 Special Cases

3.5特殊情况

Three special
cases of causal models are particularly relevant to this work.

因果模型的三个特例与这项工作特别相关。

• In the
multi-armed bandit (Robbins, 1952), a user-defined policy function π determines
the distribution of action a ∈ {1...K}, and an unknown reward function r
determines the distri-bution of the outcome y given the action a (Figure 7).In
order to maximize the accumulated rewards, the player must construct policies π
that balance the exploration of the action space with the exploitation of the best
action identified so far (Auer et al., 2002;Audibert et al., 2007;Seldin et
al., 2012).

在多臂老虎机(Robbins,1952)中,用户定义的策略函数π决定了动作a的分布∑{ 1...一个未知的奖励函数r决定了给定动作a的结果y的分布(图7)。为了使累积的奖励最大化,玩家必须构建策略π,该策略π平衡对动作空间的探索和对迄今为止识别的最佳动作的开发(Auer等人,2002;Audibert等人,2007年;Seldin等人,2012年)。

• The
contextual bandit problem (Langford and Zhang, 2008) significantly increases
the com-plexity of multi-armed bandits by adding one exogenous variable x to
the policy function π and the reward functions r (Figure 8).

上下文老虎机问题(Langford和Zhang,2008)通过在策略函数π和奖励函数r上增加一个外生变量x,显著增加了多武装土匪的复杂性(图8)。

• Both
multi-armed bandit and contextual bandit are special case of reinforcement
learning (Sutton and Barto, 1998).In essence, a Markov decision process is a
sequence of contextual bandits where the context is no longer an exogenous
variable but a state variable that depends on the previous states and actions
(Figure 9).Note that the policy function π, the reward function r, and the
transition function s are independent of time.All the time dependencies are expressed
using the states st .

多臂老虎机和情境老虎机都是强化学习的特例(萨顿和巴尔托,1998)。本质上,马尔可夫决策过程是一系列上下文老虎机,其中上下文不再是外生变量,而是依赖于先前状态和动作的状态变量(图9)。注意,策略函数π、奖励函数r和转移函数s与时间无关。所有的时间依赖关系都用状态st表示。

These special
cases have increasing generality.Many simple structural equation models can be
reduced to a contextual bandit problem using appropriate definitions of the
context x, the action a and the outcome y. For instance, assuming that the
prices c are discrete, the ad placement struc-tural equation model shown in
Figure 2 reduces to a contextual bandit problem with context (u, v), actions
(s, c) and reward z. Similarly, given a sufficiently intricate definition of
the state variables st , all structural equation models with discrete variables
can be reduced to a reinforcement learning problem.Such reductions lose the
fine structure of the causal graph.We show in Section 5 how this fine structure
can in fact be leveraged to obtain more information from the same experiments.

这些特例越来越具有普遍性。使用上下文x、动作a和结果y的适当定义,许多简单的结构方程模型可以简化为上下文强盗问题。例如,假设价格c是离散的,图2所示的广告投放结构方程模型可以简化为具有上下文(u,v)、动作(s,c)和奖励z的上下文强盗问题。类似地,给定状态变量st的足够复杂的定义,所有具有离散变量的结构方程模型可以简化为强化学习问题。这种简化失去了因果图的精细结构。我们在第5节中展示了如何利用这种精细的结构从相同的实验中获得更多的信息。

Modern
reinforcement learning algorithms (see Sutton and Barto, 1998) leverage the
assumption that the policy function, the reward function, the transition
function, and the distributions of the corresponding noise variables, are
independent from time.This invariance property provides great benefits when the
observed sequences of actions and rewards are long in comparison with the size
of the state space.Only Section 7 in this contribution presents methods that
take advantage of such an invariance.The general question of leveraging
arbitrary functional invariances in causal graphs is left for future work.

现代强化学习算法(见萨顿和巴尔托,1998)利用了政策函数、奖励函数、转移函数和相应噪声变量的分布与时间无关的假设。当观察到的动作和回报序列与状态空间的大小相比较长时,这种不变性提供了很大的好处。这篇文章中只有第7节介绍了利用这种不变性的方法。利用因果图中任意函数不变性的一般问题留待将来研究。

4.Counterfactual Analysis反事实分析

We now return
to the problem of formulating and answering questions about the value of
proposed changes of a learning system.Assume for instance that we consider
replacing the score computation

我们现在回到制定和回答关于学习系统的建议改变的价值的问题。例如,假设我们考虑替换分数计算

a = π ( ε ) Action a ∈{1...K}

y = r ( a,ε’) Reward y ∈ R

动作a∞{ 1...K}奖励y ∈ R

Figure 7:
Structural equation model for the multi-armed bandit problem.The policy π
selects a discrete
action a, and the reward function r determines the outcome y. The noise
vari-ables ε and ε ′ represent independent sources of randomness useful to
model probabilistic dependencies.

图7:多武装匪徒问题的结构方程模型。策略π选择一个离散动作a,奖励函数r决定结果y。噪声变量ε和ε’代表独立的随机源,可用于建模概率相关性。

a = π ( x , ε) Action a ∈{1...K}

y = r ( x,a,ε’)Reward y ∈ R

Figure 8:
Structural equation model for contextual bandit problem.Both the action and the
reward depend on an exogenous context variable x.

图8:上下文老虎机问题的结构方程模型。行动和奖励都依赖于一个外生的环境变量x。
image.png

Figure 9:
Structural equation model for reinforcement learning.The above equations are
replicated

图9:强化学习的结构方程模型。上面的等式是重复的

for all t ∈
{0...,T}.The context is now provided by a state variable st−1 that depends on
the previous states and actions.

对于所有t∞{ 0...,T}。上下文现在由状态变量ST 1提供,该变量取决于先前的状态和动作。

model M of an
ad placement engine by an alternate model M. We seek an answer to the
conditional question:

模型M的一个广告投放引擎的一个替代模型M。我们寻求一个条件问题的答案:

"How
will the system perform if we replace model M by model M*?"

"如果我们用M*代替M模型,系统会有什么表现?"

Given
sufficient time and sufficient resources, we can obtain the answer using a
controlled experiment (Section 2.2).However, instead of carrying out a new
experiment, we would like to obtain an answer using data that we have already
collected in the past.

如果有足够的时间和足够的资源,我们可以通过一个受控实验(第2.2节)来获得答案。然而,我们不希望进行新的实验,而是希望使用我们过去已经收集的数据来获得答案。

"How
would the system have performed if, when the data was collected, we had
replaced model M by model M*?"

“如果在收集数据时,我们用模型M*替换了模型M,系统会有什么表现?”

The answer of
this counterfactual question is of course a counterfactual statement that
describes the system performance subject to a condition that did not happen.

这个反事实问题的答案当然是一个反事实陈述,它描述了系统在一个没有发生的条件下的性能。

Counterfactual
statements challenge ordinary logic because they depend on a condition that is
known to be false.Although assertion A ⇒ B is always true when assertion A is
false, we certainly do not mean for all counterfactual statements to be
true.Lewis (1973) navigates this paradox using a modal logic in which a
counterfactual statement describes the state of affairs in an alternate world
that resembles ours except for the specified differences.Counterfactuals indeed
offer many subtle ways to qualify such alternate worlds.For instance, we can
easily describe isolation assumptions (Section 3.2) in a counterfactual
question:

反事实陈述挑战普通逻辑,因为它们依赖于一个已知为假的条件。虽然当断言A为假时,断言A总是真的,但我们当然不是说所有的反事实陈述都是真的。刘易斯(1973)使用模态逻辑来解释这个悖论,在模态逻辑中,一个反事实的陈述描述了一个与我们相似的另一个世界的事态,除了特定的差异。反事实确实提供了许多微妙的方法来限定这样的交替世界。例如,我们可以很容易地在一个反事实问题中描述隔离假设(第3.2节):

"How
would the system have performed if, when the data was collected, we had
replaced model M by model Mwithout incurring user or advertiser
reactions?"

“如果在收集数据时,我们用M*模型代替了M模型,而不会引起用户或广告商的反应,那么系统会有怎样的表现?”

image.png

Figure 10:
Causal graph for an image recognition system.We can estimate counterfactuals by
replaying data collected in the past.

图10:图像识别系统的因果图。我们可以通过回放过去收集的数据来估计反事实。

Figure 11:
Causal graph for a randomized experiment.We can estimate certain
counterfactuals by reweighting data collected in the past.

图11:随机实验的因果图。我们可以通过重新加权过去收集的数据来估计某些反事实。

The fact that
we could not have changed the model without incurring the user and advertiser
reac-tions does not matter any more than the fact that we did not replace model
M by model M* in the first place.This does not prevent us from using
counterfactual statements to reason about causes and effects.Counterfactual
questions and statements provide a natural framework to express and share our
conclusions.

事实上,我们不可能在不引起用户和广告商反应的情况下改变模型,这一点并不比我们一开始没有用M*模型替换M模型更重要。这并不妨碍我们用反事实的陈述来推理因果。反事实的问题和陈述提供了一个自然的框架来表达和分享我们的结论。

The remaining
text in this section explains how we can answer certain counterfactual
questions using data collected in the past.More precisely, we seek to estimate
performance metrics that can be expressed as expectations with respect to the
distribution that would have been observed if the counterfactual conditions had
been in force.

本节剩下的内容解释了我们如何使用过去收集的数据来回答某些反事实的问题。更准确地说,我们试图估计性能指标,这些指标可以表示为对分布的期望,如果反事实条件已经生效,就会观察到分布。

4.1 Replaying
Empirical Data

4.1重放经验数据

Figure 10
shows the causal graph associated with a simple image recognition system.The
classifier takes an image x and produces a prospective class label ˆy. The loss
measures the penalty associated with recognizing class ˆy while the true class
is y.

图10显示了与简单图像识别系统相关的因果图。分类器获取图像x并产生预期的类别标签y。损失测量与识别类别y相关联的惩罚,而真实类别是y

To estimate
the expected error of such a classifier, we collect a representative data set
composed of labelled images, run the classifier on each image, and average the
resulting losses.In other words, we replay the data set to estimate what (counterfactual)
performance would have been observed if we had used a different classifier.We
can then select in retrospect the classifier that would have worked the best
and hope that it will keep working well.This is the counterfactual viewpoint on
empirical risk minimization (Vapnik, 1982).

为了估计这种分类器的预期误差,我们收集由标记图像组成的代表性数据集,在每个图像上运行分类器,并对产生的损失进行平均。换句话说,我们重放数据集来估计如果我们使用不同的分类器会观察到什么样的(反事实的)性能。然后,我们可以在回顾中选择最有效的分类器,并希望它能继续有效。这是经验风险最小化的反事实观点(Vapnik,1982)。

Replaying the
data set works because both the alternate classifier and the loss function are
known.More generally, to estimate a counterfactual by replaying a data set, we
need to know all the functional dependencies associated with all causal paths
connecting the intervention point to the measurement point.This is obviously
not always the case.

重放数据集是可行的,因为备用分类器和损失函数都是已知的。更一般地说,为了通过重放数据集来估计反事实,我们需要知道与连接干预点和测量点的所有因果路径相关联的所有功能依赖。显然情况并非总是如此。

5.Although
counterfactual expectations can be viewed as expectations of unit-level
counterfactuals (Pearl, 2009, Def-inition 4), they elude the semantic
subtleties of unit-level counterfactuals and can be measured with randomized
experiments (see Section 4.2.)

5.尽管反事实期望可以被视为单位级反事实的期望(Pearl,2009,Def-inition 4),但它们回避了单位级反事实的语义微妙性,并且可以通过随机实验来测量(参见第4.2节)。)

4.2
Reweighting Randomized Trials

4.2重新加权随机试验

Figure 11
illustrates the randomized experiment suggested in Section 2.3.The patients are
randomly split into two equally sized groups receiving respectively treatments
A and B. The overall success rate for this experiment is therefore Y = (YA
+YB)/2 where YA and YB are the success rates observed for each group.We would
like to estimate which (counterfactual) overall success rate Y would have been
observed if we had selected treatment A with probability p and treatment B with
probability 1− p.

图11说明了第2.3节中建议的随机实验。患者被随机分成两个大小相等的组,分别接受治疗A和治疗b。因此,本实验的总成功率为Y = (YA +YB)/2,其中YA和YB是每组观察到的成功率。如果我们选择概率为p的治疗A和概率为1 p的治疗B,我们希望估计观察到的(反事实的)总体成功率Y。

Since we do
not know how the outcome depends on the treatment and the patient condition, we
cannot compute which outcome y would have been obtained if we had treated
patient x with a different treatment u . Therefore we cannot answer this
question by replaying the data as we did in Section 4.1.

由于我们不知道结果如何取决于治疗和患者状况,因此我们无法计算如果我们用不同的治疗u治疗患者x会获得哪种结果y。因此,我们无法像在第4.1节中那样通过重放数据来回答这个问题。

However,
observing different success rates YA and YB for the treatment groups reveals an
empir-ical correlation between the treatment u and the outcome y. Since the only
cause of the treatment u is an independent roll of the dices, this correlation
cannot result from any known or unknown con-founding common cause.Having
eliminated this possibility, we can reweight the observed out-comes and compute
the estimate Y ≈ pYA + (1− p)YB .

然而,观察治疗组的不同成功率YA和YB揭示了治疗u和结果y之间的经验相关性。由于治疗u的唯一原因是独立的骰子滚动,这种相关性不能由任何已知或未知的共同原因产生。排除这种可能性后,我们可以重新加权观察到的输出,并计算估计值Y≈pyA+(1p)YB。

4.3 Markov
Factor Replacement

4.3马尔可夫因子替换

The
reweighting approach can in fact be applied under much less stringent conditions.Let
us return to the ad placement problem to illustrate this point.

事实上,再加权方法可以在不太严格的条件下应用。让我们回到广告投放问题来说明这一点。

The average
number of ad clicks per page is often called click yield.Increasing the click
yield usually benefits both the advertiser and the publisher, whereas
increasing the revenue per page often benefits the publisher at the expense of
the advertiser.Click yield is therefore a very useful metric when we reason
with an isolation assumption that ignores the advertiser reactions to pricing changes.

平均每页的广告点击量通常被称为点击量。增加点击率通常对广告商和出版商都有利,而增加每页的收入通常对出版商有利,但牺牲了广告商的利益。因此,当我们用一个孤立的假设来推理时,点击量是一个非常有用的指标,这个假设忽略了广告商对价格变化的反应。

Let ω be a
shorthand for all variables appearing in the Markov factorization of the ad
placement structural equation model,

假设ω是广告投放结构方程模型的马尔可夫分解中出现的所有变量的简写,

P(ω) =P ( u , v ) P( x |u ) P ( a|x , v ) P ( b|x , v ) P ( q|x , a )
× P ( s|a , q
, b ) P ( c |a , q , b ) P ( y |s , u ) P ( z|y , c ) .

Variable y
was defined in Section 3.1 as the set of user clicks.In the rest of the
document, we slightly abuse this notation by using the same letter y to
represent the number of clicks.We also write the expectation Y = Eω∼P(ω) [y]
using the integral notation

变量y在第3.1节中被定义为用户点击的集合。在文档的其余部分,我们稍微滥用了这个符号,使用相同的字母y来表示点击次数。我们还用积分表示法写出了期望Y = Eω∞P(ω)[Y]

Y = Zω y P (ω ) .

We would like
to estimate what the expected click yield Y would have been if we had used a
different scoring function (Figure 12).This intervention amounts to replacing
the actual factor P(q|x,a) by a counterfactual factor P(q|x,a) in the Markov
factorization.

如果我们使用不同的评分函数,我们希望估计预期的点击率是多少(图12)。这种干预相当于用马尔可夫因子分解中的反事实因子P(q|x,a)代替实际因子P(q|x,a)。

P ( ω) = P (u , v ) P ( x |u ) P ( a|x , v ) P ( b|x , v ) P ( q| x , a )

× P ( s|a,q,b ) P ( c |a,q,b ) P ( y |s,u ) P ( z|x,c)。

(3)

6.See also
the discussion of Reichenbach's common cause principle and of its limitations
in Spirtes et al. (1993) and Spirtes and Scheines (2004).

6.另见Spirtes等人(1993年)和Spirtes和Scheines
(2004年)对赖兴巴赫共同原因原则及其局限性的讨论。

image.png

Figure 12:
Estimating which average number of clicks per page would have been observed if
we had used a different scoring model.

图12:如果我们使用不同的评分模型,估计每个页面的平均点击次数。

Let us
assume, for simplicity, that the actual factor P(q|x,a) is nonzero
everywhere.We can then estimate the counterfactual expected click yield Y using
the transformation

为了简单起见,我们假设实际的因子P(q|x,a)处处非零。然后,我们可以使用转换来估计反事实的预期点击量Y

Y = Zω y P (ω) = Zω y P ( q| x a ) P ( q|x , a ) P ( ω ) ≈ n n∑i=1 y i P ( q i |x i a i ) P
( q i |x i , a i )

where the
data set of tuples (ai , xi ,qi , yi) is distributed according to the actual
Markov factorization instead of the counterfactual Markov factorization.This
data could therefore have been collected during the normal operation of the ad
placement system.Each sample is reweighted to reflect its probability of
occurrence under the counterfactual conditions.

其中元组(ai、xi、qi、yi)的数据集根据实际的马尔可夫因子分解而不是反事实的马尔可夫因子分解来分布。因此,这些数据可能是在广告投放系统正常运行期间收集的。每个样本被重新加权,以反映其在反事实条件下出现的概率。

In general,
we can use importance sampling to estimate the counterfactual expectation of
any quantity ℓ(ω):

一般来说,我们可以使用重要性抽样来估计任何数量ℓ(ω):的反事实期望

Y = Zω ℓ ( ω
) P ( ω) = Zω ℓ ( ω ) P ( ω ) P ( ω ) P ( ω ) ≈ n n∑i=1 ℓ ( ω i ) w i

with weights
image.png

Equation (6)
emphasizes the simplifications resulting from the algebraic similarities of the
actual and counterfactual Markov factorizations.Because of these
simplifications, the evaluation of the weights only requires the knowledge of
the few factors that differ between P(ω) and P(ω).Each data sample needs to
provide the value of ℓ(ωi) and the values of all variables needed to evaluate
the factors that do not cancel in the ratio (6).

等式(6)强调由实际和反事实马尔可夫因子分解的代数相似性产生的简化。由于这些简化,权重的计算只需要知道P(ω)和P(ω)之间的几个不同因素。每个数据样本需要提供ℓ(ωi值)和评估比率(6)中不抵消的因素所需的所有变量的值。

In contrast,
the replaying approach (Section 4.1) demands the knowledge of all factors of
P(ω) connecting the point of intervention to the point of measurement ℓ(ω).On
the other hand, it does not require the knowledge of factors appearing only in
P(ω).

相比之下,重放方法(第4.1节)要求了解连接干预点和测量点ℓ(ω).的所有P(ω)因子另一方面,它不需要只出现在P(ω)中的因子的知识。

Importance
sampling relies on the assumption that all the factors appearing in the
denomina-tor of the reweighting ratio (6) are nonzero whenever the factors
appearing in the numerator are nonzero.Since these factors represents
conditional probabilities resulting from the effect of an independent noise
variable in the structural equation model, this assumption means that the data
must be
collected with an experiment involving active randomization.We must therefore
design cost-effective randomized experiments that yield enough information to
estimate many interesting counterfactual expectations with sufficient
accuracy.This problem cannot be solved without an-swering the confidence
interval question: given data collected with a certain level of randomization,
with which accuracy can we estimate a given counterfactual expectation?

重要性抽样依赖于这样的假设,即只要分子中出现的因子非零,重加权比(6)的分母中出现的所有因子都非零。由于这些因素代表了由结构方程模型中独立噪声变量的影响而产生的条件概率,这一假设意味着数据

必须通过涉及主动随机化的实验收集。因此,我们必须设计成本有效的随机实验,产生足够的信息,以足够的准确性估计许多有趣的反事实预期。如果不回答置信区间问题,这个问题就无法解决:给定以一定程度的随机化收集的数据,我们能以何种精度估计给定的反事实期望?

4.4
Confidence Intervals

4.4置信区间

At first
sight, we can invoke the law of large numbers and write

乍一看,我们可以援引大数定律来写

Y = Zω ℓ ( ω
) w ( ω ) P ( ω ) ≈ n n∑i=1 ℓ ( ω i ) w i .

(7)

For sufficiently
large n, the central limit theorem provides confidence intervals whose width
grows with the standard deviation of the product ℓ(ω)w(ω).

对于足够大的n,中心极限定理提供了置信区间,其宽度随着乘积ℓ(ω)w(ω).的标准偏差而增加

Unfortunately,
when P(ω) is small, the reweighting ratio w(ω) takes large values with low
probability.This heavy tailed distribution has annoying consequences because
the variance of the integrand could be very high or infinite.When the variance
is infinite, the central limit theorem does not hold.When the variance is
merely very large, the central limit convergence might occur too slowly to
justify such confidence intervals.Importance sampling works best when the
actual distribution and the counterfactual distribution overlap.

遗憾的是,当P(ω)较小时,重加权比w(ω)取大值的概率较低。这种重尾分布有令人讨厌的后果,因为被积函数的方差可能很大或无穷大。当方差无穷大时,中心极限定理不成立。当方差仅仅非常大时,中心极限收敛可能发生得太慢,不足以证明这样的置信区间。当实际分布和反事实分布重叠时,重要性抽样效果最好。

When the
counterfactual distribution has significant mass in domains where the actual
distribu-tion is small, the few samples available in these domains receive very
high weights.Their noisy contribution dominates the reweighted estimate (7).We
can obtain better confidence intervals by eliminating these few samples drawn
in poorly explored domains.The resulting bias can be bounded using prior
knowledge, for instance with an assumption about the range of values taken by
ℓ(ω),

当反事实分布在实际分布很小的域中具有显著质量时,这些域中可用的少数样本获得非常高的权重。他们的噪声贡献主导了重新加权的估计(7)。我们可以通过消除在探索不足的领域提取的这几个样本来获得更好的置信区间。所产生的偏差可以使用先验知识来界定,例如利用关于ℓ(ω取值范围的假设),

∀ω ℓ(ω) ∈ [0,M].

(8)

Let us choose
the maximum weight value R deemed acceptable for the weights.We have ob-tained
very consistent results in practice with R equal to the fifth largest
reweighting ratio observed on the empirical data.We can then rely on clipped
weights to eliminate the contribution of the poorly explored domains,

让我们选择重量可以接受的最大重量值R。我们在实践中获得了非常一致的结果,R等于在经验数据上观察到的第五大再加权比。然后,我们可以依靠剪裁的权重来消除探索不充分的领域的贡献,

w (ω) = w(ω)
if P(ω) < RP(ω) 0 otherwise.

如果P(ω) < RP(ω) 0,则w (ω) = w(ω),否则。

The condition
P(ω) < RP(ω) ensures that the ratio has a nonzero denominator P(ω) and is
smaller than R. Let ΩR be the set of all values of ω associated with acceptable
ratios:

条件P(ω) < RP(ω)确保比率具有非零分母P(ω)并且小于R。假设ωR是与可接受比率相关的所有ω值的集合:

ΩR = {ω :
P(ω) < RP(ω)} .

ωR =
{ω:P(ω)< RP(ω)}。

We can
decompose Y in two terms:

我们可以将Y分解为两个项:

Y = Zω ∈ Ω R
ℓ ( ω ) P ( ω) + Zω ∈ Ω \ Ω R ℓ ( ω ) P ( ω) = Y + ( Y − Y ) .

7.This is in
fact a slight abuse because the theory calls for choosing R before seeing the
data.

7.这实际上是一个轻微的滥用,因为理论要求在看到数据之前选择R。

The first
term of this decomposition is the clipped expectation Y . Estimating the
clipped expec-tation Y is much easier than estimating Y from (7) because the
clipped weights w(ω) are bounded by R.

这个分解的第一项是截尾期望Y。估计截尾期望Y比从(7)估计Y容易得多,因为截尾权重w(ω)由r界定

Y = Z ω∈ΩR
ℓ(ω)P (ω) = Z ω ℓ(ω)w (ω) P(ω) ≈ Yb= n n ∑ i=1 ℓ(ωi)w (ωi).(10)

y =
zω∈rωℓ(ω)p(ω)= zωℓ(ω)w(ω)p(ω)≈Yb = n n∑I = 1 ℓ(ωi)w(ωI)。(10)

The second
term of Equation (9) can be bounded by leveraging assumption (8).The resulting
bound can then be conveniently estimated using only the clipped weights.

等式(9)的第二项可以由杠杆假设(8)来界定。然后,仅使用剪裁的权重就可以方便地估计得到的界限。

Y −Y = Z
ω∈Ω\ΩR ℓ(ω)P (ω) ∈ h M P (Ω\ΩR) i = h M (−W ) i with

W = P (ΩR) =
Z ω∈ΩR P (ω) = Z ω w (ω)P(ω) ≈ Wb= n n ∑ i=1 w (ωi).

(11)

(11)

Since the
clipped weights are bounded, the estimation errors associated with (10) and
(11) are well characterized using either the central limit theorem or using
empirical Bernstein bounds (see appendix B for details).Therefore we can derive
an outer confidence interval of the form

由于裁剪后的权重是有界的,因此使用中心极限定理或使用经验伯恩斯坦界限(详见附录B)可以很好地表征与(10)和(11)相关的估计误差。因此,我们可以导出形式的外部置信区间

P n Yb−εR ≤ Y
≤ Yb+εR o ≥ −δ

and an inner
confidence interval of the form

和形式的内部置信区间

P n Y ≤ Y ≤ Y

  • M(−Wb+ξR) o ≥ −δ.

(13)

The names
inner and outer are in fact related to our preferred way to visualize these
intervals (e.g., Figure 13).Since the bounds on Y −Y can be written as

内部和外部名称实际上与我们首选的可视化这些间隔的方式相关(例如,图13)。因为Y-Y的界限可以写成

Y ≤ Y ≤ Y + M
(1−W ),

we can derive
our final confidence interval,

我们可以推导出最终的置信区间,

P n Yb−εR ≤ Y
≤ Yb+ M(−Wb+ξR) +εR o ≥ −δ.

(15)

(15)

In
conclusion, replacing the unbiased importance sampling estimator (7) by the
clipped impor-tance sampling estimator (10) with a suitable choice of R leads
to improved confidence intervals.Furthermore, since the derivation of these
confidence intervals does not rely on the assumption that P(ω) is nonzero
everywhere, the clipped importance sampling estimator remains valid when the
dis-tribution P(ω) has a limited support.This relaxes the main restriction
associated with importance sampling.

总之,用适当选择的R来代替无偏重要抽样估计量(7)的削波重要抽样估计量(10)导致置信区间的改善。此外,由于这些置信区间的推导不依赖于P(ω)处处非零的假设,当分布P(ω)的支持度有限时,截尾重要性抽样估计量仍然有效。这放松了与重要性抽样相关的主要限制。

4.5
Interpreting the Confidence Intervals

4.5解释置信区间

The
estimation of the counterfactual expectation Y can be inaccurate because the
sample size is in-sufficient or because the sampling distribution P(ω) does not
sufficiently explore the counterfactual conditions of interest.

反事实期望Y的估计可能是不准确的,因为样本量是不充分的,或者因为抽样分布P(ω)没有充分探索感兴趣的反事实条件。

By
construction, the clipped expectation Y ignores the domains poorly explored by
the sam-pling distribution P(ω).The difference Y −Y then reflects the
inaccuracy resulting from a lack of exploration.Therefore, assuming that the
bound R has been chosen competently, the relative sizes of the outer and inner
confidence intervals provide precious cues to determine whether we can continue
collecting data using the same experimental setup or should adjust the data
collection experiment in order to obtain a better coverage.

通过构造,裁剪期望Y忽略了抽样分布P(ω)探索不充分的域。Y-Y的差异反映了由于缺乏探索而导致的不准确性。因此,假设界限R已经被恰当地选择,外置信区间和内置信区间的相对大小提供了宝贵的线索,以确定我们是否可以使用相同的实验设置继续收集数据,或者应该调整数据收集实验以获得更好的覆盖。

• The inner
confidence interval (13) witnesses the uncertainty associated with the domain
GR insufficiently explored by the actual distribution.A large inner confidence
interval suggests that the most practical way to improve the estimate is to
adjust the data collection experiment in order to obtain a better coverage of
the counterfactual conditions of interest.

内部置信区间(13)见证了与实际分布未充分探索的域GR相关的不确定性。较大的内部置信区间表明,提高估计值的最实用方法是调整数据收集实验,以便更好地覆盖感兴趣的反事实条件。

• The outer
confidence interval (12) represents the uncertainty that results from the
limited sample size.A large outer confidence interval indicates that the sample
is too small.To improve the result, we simply need to continue collecting data
using the same experimental setup.

外部置信区间(12)代表有限样本量导致的不确定性。较大的外部置信区间表明样本太小。为了改进结果,我们只需要使用相同的实验设置继续收集数据。

4.6
Experimenting with Mainline Reserves

4.6主线储备试验

We return to
the ad placement problem to illustrate the reweighting approach and the
interpretation of the confidence intervals.Manipulating the reserves Rp(x)
associated with the mainline positions (Figure 1) controls which ads are
prominently displayed in the mainline or displaced into the sidebar.

我们回到广告投放问题来说明重新加权方法和置信区间的解释。操纵与主线位置相关联的保留区Rp(x)控制哪些广告在主线中突出显示或者转移到侧边栏中。

We seek in
this section to answer counterfactual questions of the form:

我们在本节中寻求回答以下形式的反事实问题:

"How
would the ad placement system have performed if we had scaled the mainline
reserves by a constant factor ρ, without incurring user or advertiser
reactions?"

“如果我们以一个常数ρ来衡量主线储备,而不引起用户或广告商的反应,广告投放系统会有怎样的表现?”

Randomization
was introduced using a modified version of the ad placement engine.Before
determining the ad layout (see Section 2.1), a random number ε is drawn
according to the standard

随机化是使用广告投放引擎的修改版本引入的。在确定广告布局之前(见第2.1节),根据标准绘制一个随机数ε

normal
distribution N (0,1), and all the mainline reserves are multiplied by m = ρe −σ
/+σε.Such

正态分布N (0,1),所有主线储量乘以m =ρeσ/+σε。这样的

multipliers
follow a log-normal distributionwhose mean is ρ and whose width is controlled
by σ.This effectively provides a parametrization of the conditional score distribution
P(q|x,a) (see Figure 5.)

乘数遵循对数正态分布,其平均值为ρ,宽度由σ控制。这有效地提供了条件分数分布P(q|x,a)的参数化(见图5)。)

The Bing
search platform offers many ways to select traffic for controlled experiments
(Sec-tion 2.2).In order to match our isolation assumption, individual page
views were randomly as-signed to traffic buckets without regard to the user
identity.The main treatment bucket was pro-cessed with mainline reserves
randomized by a multiplier drawn as explained above with ρ=1 and σ = 0.3.With
these parameters, the mean multiplier is exactly 1, and 95% of the multipliers
are in range [0.52,1.74].Samples describing 22 million search result pages were
collected during five consecutive weeks.

必应搜索平台提供了许多选择流量进行受控实验的方法(第2.2节)。为了与我们的隔离假设相匹配,单个页面视图被随机签名到流量桶,而不考虑用户身份。主处理桶用主线储量进行处理,主线储量由乘数随机化,乘数如上所述绘制,ρ=1,σ = 0.3。有了这些参数,平均乘数正好是1,95%的乘数都在[0.52,1.74]的范围内。连续五周收集了描述2200万个搜索结果页面的样本。

We then use
this data to estimate what would have been measured if the mainline reserve
mul-tipliers had been drawn according to a distribution determined by
parameters ρ and σ .This is achieved by reweighting each sample ωi with

然后,我们使用这些数据来估计如果主线储备multipliers是根据由参数ρ和σ确定的分布绘制的,将会测量到什么。这是通过用ωi重新加权每个样本来实现的

w i =
P ( q i |x i,a i ) P ( q
i |x i,a i)

=

=

p ( m i ;ρ ,
σ ) p ( m i ;ρ , σ )

where mi is
the multiplier drawn for this sample during the data collection experiment, and
p(t ;ρ,σ) is the density of the log-normal multiplier distribution.

其中mi是在数据收集实验期间为该样本绘制的乘数,p(t;ρ,σ)是对数正态乘数分布的密度。

Figure 13
reports results obtained by varying ρ while keeping σ =σ.This amounts to
esti-mating what would have been measured if all mainline reserves had been
multiplied by ρ while keeping the same randomization.The curves bound 95%
confidence intervals on the variations of the average number of mainline ads
displayed per page, the average number of ad clicks per page,

图13报告了在保持σ =σ的情况下,通过改变ρ获得的结果。这相当于在保持相同随机化的情况下,对所有主线储量乘以ρ后的测量值进行了估计。这些曲线在每页显示的主线广告的平均数量、每页广告点击的平均数量,

8.More
precisely, lnN ( ,σ ) with = σ /2+logρ.

8.更准确地说,lnN(,σ)具有= σ /2+logρ。

Figure 13:
Estimated variations of three performance metrics in response to mainline
reserve

图13:响应主线储备的三个性能指标的估计变化

changes.The
curves delimit 95% confidence intervals for the metrics we would have observed
if we had increased the mainline reserves by the percentage shown on the
horizontal axis.The filled areas represent the inner confidence intervals.The
hollow squares represent the metrics measured on the experimental data.The
hollow circles represent metrics measured on a second experimental bucket with
mainline reserves re-duced by 18%.The filled circles represent the metrics
effectively measured on a control bucket running without randomization.

变化。这些曲线为我们观察到的指标划定了95%的置信区间,如果我们将主线储量增加水平轴上显示的百分比。填充区域代表内部置信区间。空心方块代表在实验数据上测量的度量。空心圆圈代表在第二个实验桶上测量的指标,主线储量减少了18%。实心圆圈代表在没有随机化的情况下运行的控制桶上有效测量的指标。

3228

3228

COUNTERFACTUAL
REASONING AND LEARNING SYSTEMS

反事实推理和学习系统

and the
average revenue per page, as functions of ρ .The inner confidence intervals,
represented by the filled areas, grow sharply when ρ leaves the range explored
during the data collection ex-periment.The average revenue per page has more
variance because a few very competitive queries command high prices.

和每页的平均收入,作为ρ的函数。当ρ离开数据收集实验期间探索的范围时,由填充区域表示的内部置信区间急剧增长。每页的平均收入差异更大,因为一些竞争非常激烈的查询要求很高的价格。

In order to
validate the accuracy of these counterfactual estimates, a second traffic
bucket of equal size was configured with mainline reserves reduced by about
18%.The hollow circles in Figure 13 represent the metrics effectively measured
on this bucket during the same time period.The effective measurements and the
counterfactual estimates match with high accuracy.

为了验证这些反事实估计的准确性,配置了第二个同等大小的流量桶,主线储量减少了约18%。图13中的空心圆圈表示在同一时间段内在该桶上有效测量的度量。有效测量值和反事实估计值高度匹配。

Finally, in
order to measure the cost of the randomization, we also ran the unmodified ad
place-ment system on a control bucket.The brown filled circles in Figure 13
represent the metrics effec-tively measured on the control bucket during the
same time period.The randomization caused a small but statistically significant
increase of the number of mainline ads per page.The click yield and average
revenue differences are not significant.

最后,为了测量随机化的成本,我们还在控制桶上运行了未修改的广告投放系统。图13中用棕色填充的圆圈表示在同一时间段内在控制桶上有效测量的指标。随机化导致了每页主线广告数量的小幅但有统计学意义的增加。点击率和平均收入差异不显著。

This
experiment shows that we can obtain accurate counterfactual estimates with
affordable randomization strategies.However, this nice conclusion does not
capture the true practical value of the counterfactual estimation approach.

这个实验表明,我们可以用负担得起的随机化策略获得准确的反事实估计。然而,这个好结论并没有抓住反事实估计方法的真正实用价值。

4.7 More on
Mainline Reserves

4.7主线储备的更多信息

The main
benefit of the counterfactual estimation approach is the ability to use the
same data to answer a broad range of counterfactual questions.Here are a few
examples of counterfactual ques-tions that can be answered using data collected
using the simple mainline reserve randomization scheme described in the
previous section:

反事实估计方法的主要好处是能够使用相同的数据来回答广泛的反事实问题。这里有几个反事实问题的例子,可以通过使用上一节描述的简单主线储备随机化方案收集的数据来回答:

• Different
variances - Instead of estimating what would have been measured if we had
in-creased the mainline reserves without changing the randomization variance,
that is, letting σ = σ, we can use the same data to estimate what would have
been measured if we had also changed σ.This provides the means to determine
which level of randomization we can afford in future experiments.

不同的方差——如果我们在不改变随机化方差的情况下增加主线储量,也就是说,让σ = σ,我们可以使用相同的数据来估计如果我们也改变σ会测量到什么,而不是估计会测量到什么。这提供了确定我们在未来实验中能够承受的随机化水平的方法。

• Pointwise
estimates - We often want to estimate what would have been measured if we had
set the mainline reserves to a specific value without randomization.Although
computing estimates for small values of σ often works well enough, very small
values lead to large confidence intervals.

逐点估算-如果我们将主线储量设置为特定值而不进行随机化,我们通常希望估算将要测量的值。虽然计算小σ值的估计值通常足够有效,但非常小的值会导致较大的置信区间。

Let Yν(ρ)
represent the expectation we would have observed if the multipliers m had mean
ρ and variance ν.We have then Yν(ρ) = Em[E[y|m]] = Em[Y0(m)].Assuming that the
pointwise value Y0 is smooth enough for a second order development,

让Yν(ρ)代表如果乘数m有均值ρ和方差ν,我们会观察到的期望。然后我们有Yν(ρ) = Em[E[y|m]] = Em[Y0(m)]。假设逐点值Y0对于二阶展开足够平滑,

Yν ( ρ ) ≈ E
m

Yν ( ρ ) ≈ E
m

  • Y0 ( ρ) + (
    m − ρ ) Y ′0 ( ρ) + ( m − ρ ) Y ′′ 0 ( ρ ) / 2

-Y0(ρ)+(mρ)Y′0(ρ)+(mρ)Y′0(ρ)/2

= Y0 ( ρ) + ν
Y ′′ 0 ( ρ ) / 2 .

=
Y0(ρ)+νY′’0(ρ)/2。

Although the
reweighting method cannot estimate the point-wise value Y0(ρ) directly, we can
use the reweighting method to estimate both Yν(ρ) and Y2ν(ρ) with acceptable
confidence intervals and write Y0(ρ) ≈ 2Yν(ρ)−Y2ν(ρ) (Goodwin, 2011).

虽然重加权法不能直接估计逐点值Y0(ρ),但我们可以用重加权法以可接受的置信区间同时估计Yν(ρ)和Y2ν(ρ),写出Y0(ρ)≈2Yν(ρ)-Y2ν(ρ)(Goodwin,2011)。


Query-dependent reserves - Compare for instance the queries "car
insurance" and "com-mon cause principle" in a web search
engine.Since the advertising potential of a search

查询相关准备金——例如,在网络搜索引擎中比较查询“汽车保险”和“共同原因原则”。因为搜索的广告潜力

3229

3229

BOTTOU,
PETERS, ET AL.

波图,彼得斯,等。

varies
considerably with the query, it makes sense to investigate various ways to
define query-dependent reserves (Charles and Chickering, 2012).

随着查询的不同而有很大差异,研究定义查询相关储量的各种方法是有意义的(Charles和Chickering,2012)。

The data
collected using the simple mainline reserve randomization can also be used to es-timate
what would have been measured if we had increased all the mainline reserves by
a query-dependent multiplier ρ (x).This is simply achieved by reweighting each
sample ωi with

使用简单的主线储量随机化收集的数据也可以用来估计如果我们将所有主线储量增加一个依赖于查询的乘数ρ (x)会测量到什么。这可以通过用ωi重新加权每个样本来实现

w i =

w i =

P ( q i |x i
, a i ) P ( q i |x i , a i )

P ( q i |x i,a i ) P ( q
i |x i,a i)

=

=

p ( m i ;ρ (
x i ) , σ )

p(m I;ρ ( x i),σ)

p ( m i ;, σ
) .

p(m I;, σ ) .

Considerably
broader ranges of counterfactual questions can be answered when data is
collected using randomization schemes that explore more dimensions.For
instance, in the case of the ad placement problem, we could apply an
independent random multiplier for each score instead of applying a single
random multiplier to the mainline reserves only.However, the more dimensions we
randomize, the more data needs to be collected to effectively explore all these
dimensions.Fortunately, as discussed in section 5, the structure of the causal
graph reveals many ways to leverage a priori information and improve the
confidence intervals.

当使用探索更多维度的随机化方案收集数据时,可以回答范围更广的反事实问题。例如,在广告投放问题的情况下,我们可以对每个分数应用一个独立的随机乘数,而不是只对主线储备应用一个随机乘数。然而,我们随机化的维度越多,就需要收集更多的数据来有效地探索所有这些维度。幸运的是,正如第5节所讨论的,因果图的结构揭示了许多利用先验信息和提高置信区间的方法。

4.8 Related
Work

4.8相关工作

Importance
sampling is widely used to deal with covariate shifts (Shimodaira,
2000;Sugiyama et al., 2007).Since manipulating the causal graph changes the
data distribution, such an intervention can be viewed as a covariate shift
amenable to importance sampling.Importance sampling techniques have also been
proposed without causal interpretation for many of the problems that we view as
causal inference problems.In particular, the work presented in this section is
closely related to the Monte-Carlo approach of reinforcement learning (Sutton
and Barto, 1998, Chapter 5) and to the offline evaluation of contextual bandit
policies (Li et al., 2010, 2011).

重要性抽样被广泛用于处理协变量转移(Shimodaira,2000;Sugiyama等人,2007年)。由于操纵因果图会改变数据分布,这种干预可以被视为服从重要性抽样的协变量转移。对于我们认为是因果推理问题的许多问题,也提出了没有因果解释的重要性抽样技术。特别是,本节介绍的工作与强化学习的蒙特卡罗方法(Sutton和,1998,第5章)和上下文土匪策略的离线评估(Li等人,2010,2011)密切相关。

Reinforcement
learning research traditionally focuses on control problems with relatively
small discrete state spaces and long sequences of observations.This focus
reduces the need for char-acterizing exploration with tight confidence
intervals.For instance, Sutton and Barto suggest to normalize the importance
sampling estimator by 1/∑i w(ωi) instead of 1/n. This would give erro-neous
results when the data collection distribution leaves parts of the state space
poorly explored.Contextual bandits are traditionally formulated with a finite
set of discrete actions.For instance, Li's (2011) unbiased policy evaluation
assumes that the data collection policy always selects an arbitrary policy with
probability greater than some small constant.This is not possible when the
action space is infinite.

强化学习研究传统上集中于具有相对小的离散状态空间和长的观测序列的控制问题。这种关注减少了对具有紧密置信区间的特征勘探的需要。例如,萨顿和巴尔托建议用1/∑i w(ωi)而不是1/n来归一化重要性抽样估计量。当数据收集分布使得部分状态空间未被充分探索时,这将给出错误的结果。背景盗匪传统上是由一组有限的离散动作组成的。例如,李(2011)的无偏策略评估假设数据收集策略总是选择概率大于某个小常数的任意策略。当动作空间无限时,这是不可能的。

Such
assumptions on the data collection distribution are often impractical.For
instance, certain ad placement policies are not worth exploring because they
cannot be implemented efficiently or are known to elicit fraudulent
behaviors.There are many practical situations in which one is only interested
in limited aspects of the ad placement policy involving continuous parameters
such as click prices or reserves.Discretizing such parameters eliminates useful
a priori knowledge: for instance, if we slightly increase a reserve, we can
reasonable believe that we are going to show slightly less ads.

这种关于数据收集分布的假设通常是不切实际的。例如,某些广告投放策略不值得探索,因为它们无法有效实施或已知会引发欺诈行为。有许多实际情况,人们只对广告投放政策的有限方面感兴趣,包括连续的参数,如点击价格或储备。将这些参数离散化消除了有用的先验知识:例如,如果我们稍微增加一个储备,我们可以合理地相信我们将展示稍微少一点的广告。

Instead of
making assumptions on the data collection distribution, we construct a biased
estima-tor (10) and bound its bias.We then interpret the inner and outer
confidence intervals as resulting from a lack of exploration or an insufficient
sample size.

我们不是对数据收集分布做假设,而是构造一个有偏估计量(10)并限制它的偏差。然后,我们将内部和外部置信区间解释为缺乏探索或样本量不足导致的。

3230

3230

COUNTERFACTUAL
REASONING AND LEARNING SYSTEMS

反事实推理和学习系统

Finally, the
causal framework allows us to easily formulate counterfactual questions that
pertain to the practical ad placement problem and yet differ considerably in
complexity and exploration requirements.We can address specific problems
identified by the engineers without incurring the risks associated with a
complete redesign of the system.Each of these incremental steps helps
demonstrating the soundness of the approach.

最后,因果框架允许我们很容易地提出与实际广告投放问题相关的反事实问题,但在复杂性和探索要求上有很大不同。我们可以解决工程师发现的具体问题,而不会产生与系统完全重新设计相关的风险。这些渐进步骤中的每一步都有助于证明该方法的合理性。

5.Structure

5.结构

This section
shows how the structure of the causal graph reveals many ways to leverage a
priori knowledge and improve the accuracy of our counterfactual
estimates.Displacing the reweighting point (Section 5.1) improves the inner
confidence interval and therefore reduce the need for explo-ration.Using a
prediction function (Section 5.2) essentially improve the outer confidence
interval and therefore reduce the sample size requirements.

本节展示了因果图的结构如何揭示利用先验知识和提高我们的反事实估计的准确性的许多方法。替换重加权点(第5.1节)提高了内部置信区间,因此减少了解释的需要。使用预测函数(第5.2节)本质上改善了外部置信区间,因此降低了样本量要求。

5.1 Better
Reweighting Variables

5.1更好地重新加权变量

Many search
result pages come without eligible ads.We then know with certainty that such
pages will have zero mainline ads, receive zero clicks, and generate zero
revenue.This is true for the randomly selected value of the reserve, and this
would have been true for any other value of the reserve.We can exploit this
knowledge by pretending that the reserve was drawn from the coun-terfactual
distribution P(q|xi ,ai) instead of the actual distribution P(q|xi ,ai).The
ratio w(ωi) is therefore forced to the unity.This does not change the estimate
but reduces the size of the inner confidence interval.The results of Figure 13
were in fact helped by this little optimization.

许多搜索结果页面没有合格的广告。然后,我们可以肯定地知道,这样的页面将没有主线广告,没有点击,也没有收入。对于随机选择的储备值来说是这样,对于储备的任何其他值来说也是这样。我们可以利用这一知识,假设储备是从实际分布P(q|xi,ai)中提取的,而不是实际分布P(q|xi,ai)。因此,比值w(ωi)被强制为1。这不会改变估计值,但会减小内部置信区间的大小。图13的结果实际上得益于这个小小的优化。

There are in
fact many circumstances in which the observed outcome would have been the same
for other values of the randomized variables.This prior knowledge is in fact
encoded in the structure of the causal graph and can be exploited in a more
systematic manner.For instance, we know that users make click decisions without
knowing which scores were computed by the ad placement engine, and without
knowing the prices charged to advertisers.The ad placement causal graph encodes
this knowledge by showing the clicks y as direct effects of the user intent u
and the ad slate s. This implies that the exact value of the scores q does not
matter to the clicks y as long as the ad slate s remains the same.

事实上,在许多情况下,观察到的结果对于随机变量的其他值是相同的。这种先验知识事实上被编码在因果图的结构中,并且可以以更系统的方式加以利用。例如,我们知道用户在不知道广告投放引擎计算了哪些分数,也不知道向广告商收取的价格的情况下做出点击决定。广告投放因果图通过将点击量y显示为用户意图u和广告名单s的直接影响来编码这一知识。这意味着只要广告名单s保持不变,分数q的确切值对点击量y并不重要。

Because the
causal graph has this special structure, we can simplify both the actual and
counter-factual Markov factorizations (2) (3) without eliminating the variable
y whose expectation is sought.Successively eliminating variables z, c, and q
gives:

因为因果图具有这种特殊的结构,我们可以简化实际的和反事实的马尔可夫因子分解(2) (3),而不消除期望被寻求的变量y。连续消除变量z、c和q可得出:

P ( u , v , x
, a , b , s , y) = P ( u , v ) P ( x |u ) P ( a|x , v ) P ( b|x , v ) P ( s|x ,
a , b ) P ( y |s , u ) , P ( u , v , x , a , b , s , y) = P ( u ,v ) P ( x |u )
P ( a|x , v ) P ( b|x , v ) P ( s|x , a , b ) P ( y |s , u ) .

P ( u,v,x,a,b,s,y) = P ( u,v ) P ( x |u
) P ( a|x,v ) P ( b|x,v ) P ( s|x,a,b ) P ( y |s,u),P ( u,v,x,a,b,s,y) = P ( u,P ( x |u ) P ( a|x,v ) P ( b|x,v ) P ( s|x,a,b ) P ( y |s,u)。

The
conditional distributions P(s|x,a,b) and P(s|x,a,b) did not originally appear
in the Markov factorization.They are defined by marginalization as a
consequence of the elimination of the vari-able q representing the scores.

条件分布P(s|x,a,b)和P(s|x,a,b)最初没有出现在马尔可夫分解中。他们被边缘化定义为代表分数的变量q被消除的结果。

P ( s|x a b)
= Zq P ( s|a q b ) P ( q|x a )

P ( s|x a b)
= Zq P ( s|a q b ) P ( q|x a)

P ( s|x a b)
= Zq P ( s|a q b ) P ( q|x a ) .

P ( s|x a b)
= Zq P ( s|a q b ) P ( q|x a)。

3231

3231

BOTTOU,
PETERS, ET AL.

波图,彼得斯,等。

Figure 14:
Estimated variations of two performance metrics in response to mainline reserve

图14:响应主线储备的两个性能指标的估计变化

changes.These
estimates were obtained using the ad slates s as reweighting variable.Compare
the inner confidence intervals with those shown in Figure 13.

变化。这些估计是使用广告板作为重新加权变量获得的。将内部置信区间与图13所示的置信区间进行比较。

We can
estimate the counterfactual click yield Y using these simplified
factorizations:

我们可以使用这些简化的因式分解来估计反事实的点击率Y:

= Z y P ( u v
x a b s y) = Z y P ( s|x a b ) P ( s|x , a , b ) P ( u v x a b s y )

= Z y P ( u v
x a b s y) = Z y P ( s|x a b ) P ( s|x,a,b ) P ( u v x a b s y)

Y

Y

n n∑i=1 y i P
( s i |x i a i b i ) P ( s i | x i , a i , b i ) .

n n∑i=1 y i P
( s i |x i a i b i ) P ( s i | x i,a i,b i)。

(16)

(16)

We have
reproduced the experiments described in Section 4.6 with the counterfactual
esti-mate (16) instead of (4).For each example ωi , we determine which range [m
max i ,m min i ] of mainline reserve multipliers could have produced the observed
ad slate si , and then compute the reweighting ratio using the formula:

我们用反事实估计(16)代替(4)再现了第4.6节中描述的实验。对于每个示例ωi,我们确定主线储备乘数的哪个范围[m最大值I,m最小值i ]可以产生观察到的ad slate si,然后使用以下公式计算再加权比率:

w i =

w i =

P ( s i | x i
, a i , b i ) P ( s i | x i , a i , b i )

P ( s i | x i,a i,b i ) P ( s
i | x i,a i,b i)

=

=

Ψ ( m max i
;ρ , σ ) − Ψ ( m min i ;ρ , σ ) Ψ ( mmax i ;ρ , σ ) − Ψ ( mmin i ;ρ , σ )

ψ(m最大值I;ρ,σ)ψ(m min I;ρ,σ)ψ(mmax I;ρ,σ)ψ(mmin I;ρ , σ )

where
Ψ(m;ρ,σ) is the cumulative of the log-normal multiplier distribution.Figure 14
shows coun-terfactual estimates obtained using the same data as Figure 13.The
obvious improvement of the

其中ψ(m;ρ,σ)是对数正态乘数分布的累积。图14显示了使用与图13相同的数据获得的交互事实估计。的明显改善
image.png

Figure 16: A
distribution on the scores q induce a distribution on the possible ad slates s.
If the observed slate is slate2, the reweighting ratio is 34/22.

图16:分数q的分布导致可能的广告候选名单的分布。如果观察到的候选名单是候选名单2,则重新加权比率是34/22。

inner
confidence intervals significantly extends the range of mainline reserve
multipliers for which we can compute accurate counterfactual expectations using
this same data.

内部置信区间显著扩大了主线储备乘数的范围,我们可以使用相同的数据计算出准确的反事实预期。

Comparing (4)
and (16) makes the difference very clear: instead of computing the ratio of the
probabilities of the observed scores under the counterfactual and actual
distributions, we compute the ratio of the probabilities of the observed ad
slates under the counterfactual and actual distribu-tions.As illustrated by
Figure 15, we now distinguish the reweighting variable (or variables) from the
intervention.In general, the corresponding manipulation of the Markov
factorization consists of marginalizing out all the variables that appear on
the causal paths connecting the point of interven-tion to the reweighting
variables and factoring all the independent terms out of the integral.This
simplification works whenever the reweighting variables intercept all the
causal paths connecting the point of intervention to the measurement variable.In
order to compute the new reweighting ratios, all the factors remaining inside
the integral, that is, all the factors appearing on the causal paths connecting
the point of intervention to the reweighting variables, have to be known.

比较(4)和(16)使区别非常明显:我们不是计算在反事实和实际分布下观察到的得分的概率比,而是计算在反事实和实际分布下观察到的广告的概率比。如图15所示,我们现在将重新加权变量(或多个变量)与干预区分开来。一般来说,马尔可夫因子分解的相应操作包括将出现在连接介入点和重加权变量的因果路径上的所有变量边缘化,并将所有独立项从积分中分解出来。每当重新加权的变量截取连接干预点和测量变量的所有因果路径时,这种简化就起作用了。为了计算新的重新加权比率,必须知道积分中剩余的所有因子,即在连接干预点和重新加权变量的因果路径上出现的所有因子。

Figure 14
does not report the average revenue per page because the revenue z also depends
on the scores q through the click prices c. This causal path is not intercepted
by the ad slate variable s alone.However, we can introduce a new variable ˜c =
f(c, y)that filters out the click prices computed

图14没有报告每页的平均收入,因为收入z也依赖于点击率c的得分q。这个因果路径不是由单独的ad slate变量s截取的。但是,我们可以引入一个新的变量\\\\\\\\\\\\\\\\\

3233

3233

BOTTOU,
PETERS, ET AL.

波图,彼得斯,等。

for ads that
did not receive a click.Markedly improved revenue estimates are then obtained
by reweighting according to the joint variable (s, c˜).

对于没有点击的广告。然后,通过根据联合变量(s,c ;)重新加权,可以获得显著改善的收入估计。

Figure 16
illustrates the same approach applied to the simultaneous randomization of all
the scores q using independent log-normal multipliers.The weight w(ωi) is the
ratio of the probabilities of the observed ad slate si under the counterfactual
and actual multiplier distributions.Computing these probabilities amounts to
integrating a multivariate Gaussian distribution (Genz, 1992).Details will be
provided in a forthcoming publication.

图16显示了使用独立对数正态乘数同时随机化所有分数q的相同方法。权重w(ωi)是在反事实和实际乘数分布下观察到的ad slate si的概率比。计算这些概率相当于对多变量高斯分布进行积分(Genz,1992)。细节将在即将出版的出版物中提供。

5.2 Variance
Reduction with Predictors

5.2使用预测因子减少方差

Although we
do not know exactly how the variable of interest ℓ(ω) depends on the measurable
vari-ables and are affected by interventions on the causal graph, we may have
strong a priori knowledge about this dependency.For instance, if we augment the
slate s with an ad that usually receives a lot of clicks, we can expect an
increase of the number of clicks.

虽然我们不知道感兴趣的变量(ℓ(ω)是如何依赖于可测量的变量并受到因果图上的干预的影响,但我们可能对这种依赖有很强的先验知识。例如,如果我们用一个通常会收到大量点击量的广告来增加候选名单,我们可以预期点击量会增加。

Let the
invariant variables υ be all observed variables that are not direct or indirect
effects of variables affected by the intervention under consideration.This
definition implies that the distri-bution of the invariant variables is not
affected by the intervention.Therefore the values υi of the invariant variables
sampled during the actual experiment are also representative of the
distribution of the invariant variables under the counterfactual conditions.

设不变变量υ是所有观察到的变量,它们不是受所考虑的干预影响的变量的直接或间接影响。这个定义意味着不变变量的分布不受干预的影响。因此,在实际实验期间采样的不变变量的值υi也代表了在反事实条件下不变变量的分布。

We can
leverage a priori knowledge to construct a predictor ζ(ω) of the quantity ℓ(ω)
whose counterfactual expectation Y is sought.We assume that the predictor ζ(ω)
depends only on the invariant variables or on variables that depend on the
invariant variables through known functional dependencies.Given sampled values
υi of the invariant variables, we can replay both the original and manipulated
structural equation model as explained in Section 4.1 and obtain samples ζi and
ζ i that respectively follow the actual and counterfactual distributions

我们可以利用先验知识来构造量ℓ(ω的预测器ζ(ω),其反事实期望y被寻求。我们假设预测因子ζ(ω)仅依赖于不变变量或者依赖于通过已知函数依赖关系的不变变量的变量。给定不变变量的采样值υi,我们可以如第4.1节所述重放原始和操纵的结构方程模型,并获得分别遵循实际和反事实分布的样本ζi和ζ i

Then,
regardless of the quality of the predictor,

然后,不管预测器的质量如何,

Y = Zω ℓ ( ω
) P ( ω) =

Y = Zω ℓ ( ω
) P ( ω) =

Z ω ζ(ω)P (ω)

  • Z ω (ℓ(ω)−ζ(ω))P (ω) ≈

Z ω ζ(ω)P (ω)

  • Z ω (ℓ(ω)−ζ(ω))P (ω) ≈

1

n

n

n

n

i=1

i=1

ζ i + n n ∑
i=1 (ℓ(ωi)−ζi)w(ωi).

ζ i + n n ∑
i=1 (ℓ(ωi)−ζi)w(ωi).

(17)

(17)

The first
term in this sum represents the counterfactual expectation of the predictor and
can be accurately estimated by averaging the simulated counterfactual samples ζ
i without resorting to po-tentially large importance weights.The second term in
this sum represents the counterfactual ex-pectation of the residuals ℓ(ω)−ζ(ω)
and must be estimated using importance sampling.Since the magnitude of the
residuals is hopefully smaller than that of ℓ(ω), the variance of
(ℓ(ω)−ζ(ω))w(ω) is reduced and the importance sampling estimator of the second
term has improved confidence in-tervals.The more accurate the predictor ζ(ω),
the more effective this variance reduction strategy.

这个和中的第一项表示预测器的反事实期望,并且可以通过对模拟的反事实样本ζ i求平均而不求助于潜在的大重要性权重来精确地估计。这个和中的第二项代表残差ℓ(ω)−ζ(ω的反事实预测,必须使用重要性抽样来估计。由于残差的幅度希望小于ℓ(ω的幅度,所以(ℓ(ω)−ζ(ω))w(ω)的方差减小了,并且第二项的重要性抽样估计量提高了置信区间。预测因子ζ(ω)越精确,这种方差减小策略就越有效。

This variance
reduction technique is in fact identical to the doubly robust contextual bandit
eval-uation technique of Dudík et al. (2012).Doubly robust variance reduction
has also been extensively used for causal inference applied to biostatistics
(see Robins et al., 2000;Bang and Robins, 2005).We subjectively find that
viewing the predictor as a component of the causal graph (Figure 17) clar-ifies
how a well designed predictor can leverage prior knowledge.For instance, in
order to estimate the counterfactual performance of the ad placement system, we
can easily use a predictor that runs the ad auction and simulate the user
clicks using a click probability model trained offline.

这种方差减少技术实际上与Dudík等人(2012)的双重鲁棒上下文土匪评估技术相同。双重稳健方差缩减也被广泛用于应用于生物统计学的因果推断(参见Robins等人,2000;Bang and Robins,2005)。我们主观地发现,将预测器视为因果图的一个组成部分(图17)说明了一个设计良好的预测器如何利用先验知识。例如,为了估计广告投放系统的反事实性能,我们可以很容易地使用运行广告拍卖的预测器,并使用离线训练的点击概率模型来模拟用户点击。

3234

3234

COUNTERFACTUAL
REASONING AND LEARNING SYSTEMS

反事实推理和学习系统

Figure 17:
Leveraging a predictor.Yellow nodes represent known functional relations in the
struc-

图17:利用预测器。黄色节点代表结构中已知的功能关系

tural
equation model.We can estimate the counterfactual expectation Y of the number
of clicks per page as the sum of the counterfactual expectations of a predictor
ζ, which is easy to estimate by replaying empirical data, and y−ζ, which has to
be estimated by importance sampling but has reduced variance.

自然方程模型。我们可以将每页点击次数的反事实期望Y估计为预测者ζ和Yζ的反事实期望之和,预测者ζ很容易通过重放经验数据来估计,Yζ必须通过重要性抽样来估计,但方差减小。

Figure 18:
The two plots show the hourly click yield for two variants of the ad placement
engine.The daily variations dwarf the differences between the two treatments.

图18:两个图显示了广告投放引擎的两个变体的每小时点击量。每天的变化使两种治疗之间的差异相形见绌。

3235

3235

BOTTOU,
PETERS, ET AL.

波图,彼得斯,等。

5.3 Invariant
Predictors

5.3不变预测因子

In order to
evaluate which of two interventions is most likely to improve the system, the
designer of a learning system often seeks to estimate a counterfactual
difference, that is, the difference Y + −Y of the expectations of a same
quantity ℓ(ω) under two different counterfactual distributions P+(ω) and
P(ω).These expectations are often affected by variables whose value is left
unchanged by the interventions under consideration.For instance, seasonal
effects can have very large effects on the number of ad clicks (Figure 18) but
affect Y + and Y in similar ways.

为了评估两种干预中的哪一种最有可能改善系统,学习系统的设计者经常寻求估计一个反事实差异,即在两个不同的反事实分布P+(ω)和P(ω)下,相同数量ℓ(ω的期望的差异y+y。这些期望经常受到变量的影响,这些变量的值在所考虑的干预中保持不变。例如,季节效应会对广告点击量产生非常大的影响(图18),但会以类似的方式影响Y +和Y。

Substantially
better confidence intervals on the difference Y + −Y can be obtained using an
invariant predictor, that is, a predictor function that depends only on
invariant variables υ such as the time of the day.Since the invariant predictor
ζ(υ) is not affected by the interventions under consideration,

使用不变预测器,即仅依赖于不变变量υ(例如一天中的时间)的预测器函数,可以获得关于差Y+Y的实质上更好的置信区间。由于不变预测因子ζ(υ)不受所考虑的干预的影响,

Z ω ζ(υ)P (ω)
= Z ω ζ(υ)P +(ω).

Z ω ζ(υ)P (ω)
= Z ω ζ(υ)P +(ω)。

(18)

(18)

Therefore

因此

Y + −Y

Y+Y

= Zω ζ ( υ )
P + ( ω) + Zω ( ℓ ( ω ) − ζ ( υ)) P + ( ω )

=
zωζ(υ)p+(ω)+zω(ℓ(ω)ζ(υ))p+(ω)

− Z ω ζ(υ)P
(ω)− Z ω (ℓ(ω)−ζ(υ))P (ω) ≈

zωζ(υ)p(ω)zω(ℓ(ω)−ζ(υ))p(ω)≈

1

n

n

n

n

i=1

i=1

ℓ(ωi)−ζ(υi) P
+(ωi)−P (ωi) P(ωi) .

ℓ(ωi)−ζ(υi)p+(ωI)p(ωI)p(ωI)。

This direct
estimate of the counterfactual difference Y + −Y benefits from the same
variance reduc-tion effect as (17) without need to estimate the expectations
(18).Appendix C provide details on the computation of confidence intervals for
estimators of the counterfactual differences.Appendix D shows how the same
approach can be used to compute counterfactual derivatives that describe the
response of the system to very small interventions.

这种对反事实差异Y+Y的直接估计受益于与(17)相同的方差减小效应,而不需要估计期望值(18)。附录C提供了反事实差异估计量的置信区间计算的细节。附录D显示了如何使用相同的方法来计算描述系统对非常小的干预的反应的反事实导数。

6.Learning

6.学问

The previous
sections deal with the identification and the measurement of interpretable
signals that can justify the actions of human decision makers.These same signals
can also justify the actions of machine learning algorithms.This section
explains why optimizing a counterfactual estimate is a sound learning
procedure.

前面几节讨论了可解释信号的识别和测量,这些信号可以证明人类决策者的行为是正确的。这些相同的信号也可以证明机器学习算法的作用。本节解释了为什么优化反事实估计是一个合理的学习过程。

6.1 A
Learning Principle

6.1学习原则

We consider
in this section interventions that depend on a parameter θ.For instance, we
might want to know what the performance of the ad placement engine would have
been if we had used different values for the parameter θ of the click scoring
model.Let Pθ (ω) denote the counterfactual Markov factorization associated with
this intervention.Let Y θ be the counterfactual expectation of ℓ(ω) under
distribution Pθ .Figure 19 illustrates our simple learning setup.Training data
is collected from a single experiment associated with an initial parameter
value θ chosen using prior knowledge acquired in an unspecified manner.A
preferred parameter value θ is then determined using the training data and
loaded into the system.The goal is of course to observe a good performance on
data collected during a test period that takes place after the switching point.

在本节中,我们考虑依赖于参数θ的干预。例如,我们可能想知道如果我们对点击评分模型的参数θ使用不同的值,广告投放引擎的性能会如何。让Pθ (ω)表示与该干预相关的反事实马尔可夫因子分解。设Y θ是分布Pθ下ℓ(ω的反事实期望。图19展示了我们简单的学习设置。训练数据是从与初始参数值θ相关联的单个实验中收集的,该初始参数值θ是使用以未指定方式获得的先验知识选择的。然后使用训练数据确定优选的参数值θ,并将其加载到系统中。目标当然是观察在切换点之后的测试期间收集的数据的良好性能。

3236

3236

Figure 19:
Single design - A preferred parameter value θ is determined using randomized
data collected in the past.Test data is collected after loading θ into the
system.

图19:单一设计-使用过去收集的随机数据确定首选参数值θ。测试数据在θ载入系统后收集。

The isolation
assumption introduced in Section 3.2 states that the exogenous variables are
drawn from an unknown but fixed joint probability distribution.This
distribution induces a joint distribu-tion P(ω) on all the variables ω
appearing in the structural equation model associated with the parameter
θ.Therefore, if the isolation assumption remains valid during the test period,
the test data follows the same distribution Pθ (ω) that would have been
observed during the training data collection period if the system had been
using parameter θ all along.

第3.2节中引入的隔离假设指出,外生变量来自未知但固定的联合概率分布。这种分布在与参数θ相关的结构方程模型中出现的所有变量ω上引起联合分布P(ω)。因此,如果隔离假设在测试期间保持有效,测试数据将遵循相同的分布Pθ (ω),如果系统一直使用参数θ,则在训练数据收集期间将会观察到该分布。

We can
therefore formulate this problem as the optimization of the expectation Y θ of
the reward ℓ(ω) with respect to the distribution Pθ (ω)

因此,我们可以将这个问题公式化为报酬ℓ(ω的期望Y θ相对于分布Pθ (ω)的最优化

maxθ Y θ = Zω
ℓ ( ω ) P θ ( ω )

maxθ Y θ = Zω
ℓ ( ω ) P θ ( ω)

on the basis
of a finite set of training examples ω1,...,ωn sampled from P(ω).

基于有限组训练示例ω1,...,ωn从P(ω)采样。

However, it
would be unwise to maximize the estimates obtained using approximation (5)
be-cause they could reach a maximum for a value of θ that is poorly explored by
the actual distribution.As explained in Section 4.5, the gap between the upper
and lower bound of inequality (14) reveals the uncertainty associated with
insufficient exploration.Maximizing an empirical estimate Ybθ of the lower
bound Y θ ensures that the optimization algorithm finds a trustworthy answer

然而,最大化使用近似值(5)获得的估计值是不明智的,因为它们可能达到θ值的最大值,而实际分布对此了解甚少。如第4.5节所述,不等式(14)的上限和下限之间的差距揭示了与探索不足相关的不确定性。最大化下界Y θ的经验估计Ybθ确保了优化算法找到可信的答案

θ = argmax θ
Yb θ .

θ = argmax θ
Yb θ。

We shall now
discuss the statistical basis of this learning principle.

我们现在将讨论这个学习原则的统计基础。

6.2 Uniform
Confidence Intervals

6.2统一置信区间

As discussed
in Section 4.4, inequality (14),

如第4.4节不等式(14)所述,

Y θ ≤ Y θ ≤ Y
θ + M ( 1 − W θ ) ,

Yθ≤Yθ≤Yθ+M(1wθ),

where

在哪里

Y θ = Zω ℓ (
ω ) w ( ω ) P ( ω ) ≈ Yb θ = n n∑i=1 ℓ ( ω i ) w ( ω i ) W θ = Zω w ( ω ) P ( ω
) ≈ Wbθ = n n∑i=1 w ( ω i )

yθ=
zωℓ(ω)w(ω)p(ω)≈Ybθ= n n∑I = 1ℓ(ωI)w(ωI)wθ= zωw(ω)p(ω)≈WBθ= n∑I = 1w(ωI)

(19)

(19)

9.The idea of
maximizing the lower bound may surprise readers familiar with the UCB algorithm
for multi-armed bandits (Auer et al., 2002).UCB performs exploration by
maximizing the upper confidence interval bound and updating the confidence
intervals online.Exploration in our setup results from the active system
randomization during the offline data collection.See also Section 6.4.

9.下限最大化的想法可能会让熟悉多武装匪徒的UCB算法的读者感到惊讶(奥尔等人,2002)。UCB通过最大化置信区间上限和在线更新置信区间来执行探索。我们设置中的探索源于离线数据收集期间的主动系统随机化。另请参见第6.4节。

3237

3237

BOTTOU,
PETERS, ET AL.

波图,彼得斯,等。

leads to
confidence intervals (15) of the form

导致形式的置信区间(15)

∀ δ > , ∀
θ P n Yb θ − ε R ≤ Y θ ≤ Yb θ + M ( − Wbθ + ξ R) + ε R o ≥ − δ .

∀δ>∀θp n
Ybθεr≤yθ≤Ybθ+m(WBθ+ξr)+εr o≥δ。

(20)

(20)

Both εR and
ξR converge to zero in inverse proportion to the square root of the sample size
n. They also increase at most linearly in logδ and depend on both the capping
bound R and the parameter θ through the empirical variances (see appendix B.)

εR和ξR都以与样本量n的平方根成反比的方式收敛到零。它们也最多以对数δ线性增加,并且通过经验方差同时依赖于上限R和参数θ(见附录b)

Such
confidence intervals are insufficient to provide guarantees for a parameter
value θ that depends on the sample.In fact, the optimization (19) procedure is
likely to select values of θ for which the inequality is violated.We therefore
seek uniform confidence intervals (Vapnik and Chervonenkis, 1968), simultaneously
valid for all values of θ.

这种置信区间不足以保证依赖于样本的参数值θ。事实上,优化(19)过程可能会选择违反不等式的θ值。因此,我们寻求对θ的所有值同时有效的统一置信区间(Vapnik和Chervonenkis,1968)。

• When the
parameter θ is chosen from a finite set F , applying the union bound to the
ordinary

当参数θ从有限集合F中选择时,将并界应用于普通

intervals
(20) immediately gives the uniform confidence interval :

区间(20)立即给出统一的置信区间:

P n ∀ θ ∈ F ,
Yb θ − ε R ≤ Y θ ≤ Yb θ + M ( −Wbθ + ξ R)+ ε R o ≥ −| F |δ .

P n ∀ θ ∈ F,Ybθεr≤yθ≤Ybθ+m(WBθ+ξr)+εr
o≥| f |δ。

• Following
the pioneering work of Vapnik and Chervonenkis, a broad choice of mathematical

遵循瓦普尼克和切尔沃嫩基斯的开创性工作,广泛选择数学

tools have
been developed to construct uniform confidence intervals when the set F is
infinite.

当集合F为无穷大时,已经开发了构造一致置信区间的工具。

For instance,
appendix E leverages uniform empirical Bernstein bounds (Maurer and Pontil,
2009) and obtains the uniform confidence interval

例如,附录E利用统一的经验伯恩斯坦界限(Maurer和Pontil,2009)并获得统一的置信区间

P n ∀ θ ∈ F ,
Yb θ − ε R ≤ Y θ ≤ Yb θ + M ( −Wbθ + ξ R)+ ε R o ≥ − M ( n ) δ ,

P n ∀ θ ∈ F,Ybθεr≤yθ≤Ybθ+m(WBθ+ξr)+εr
o≥m(n)δ,

where the
growth function M (n) measures the capacity of the family of functions

其中增长函数M (n)测量函数族的容量

{ fθ : ω 7→
ℓ(ω)w (ω) , gθ : ω 7→ w (ω) , ∀θ ∈ F } .

{ fθ : ω 7→
ℓ(ω)w (ω),gθ : ω 7→ w (ω),∀θ ∈ F }。

(21)

(21)

Many
practical choices of P(ω) lead to functions M (n) that grow polynomially with
the sample size.Because both εR and ξR are O(n −/logδ), they converge to zero
with the sample size when one maintains the confidence level 1−M (n)δ equal to
a predefined constant.

P(ω)的许多实际选择导致函数M
(n)随样本量多项式增长。因为εR和ξR都是0(n/logδ),当一个人保持置信水平1M(n)δ等于预定常数时,它们随着样本量收敛到零。

The
interpretation of the inner and outer confidence intervals (Section 4.5) also
applies to the uniform confidence interval (21).When the sample size is
sufficiently large and the capping bound R chosen appropriately, the inner
confidence interval reflects the upper and lower bound of inequal-ity (14).

内部和外部置信区间的解释(第4.5节)也适用于统一置信区间(21)。当样本量足够大并且适当选择上限时,内部置信区间反映了不等性(14)的上限和下限。

The uniform
confidence interval therefore ensures that Y θ is close to the maximum of the
lower bound of inequality (14) which essentially represents the best
performance that can be guaranteed using training data sampled from
P(ω).Meanwhile, the upper bound of this same inequality reveals which values of
θ could potentially offer better performance but have been insufficiently
probed by the sampling distribution (Figure 20.)

因此,均匀置信区间确保了Y θ接近不等式(14)的下限的最大值,这基本上代表了使用从P(ω)采样的训练数据可以保证的最佳性能。与此同时,这个不等式的上限揭示了哪些θ值可能提供更好的性能,但采样分布没有充分探究(图20。)

6.3 Tuning Ad
Placement Auctions

6.3调整广告投放拍卖

We now
present an application of this learning principle to the optimization of
auction tuning pa-rameters in the ad placement engine.Despite increasingly
challenging engineering difficulties,

我们现在将这一学习原理应用于广告投放引擎中拍卖调整参数的优化。尽管工程难度越来越大,

3238

3238

COUNTERFACTUAL
REASONING AND LEARNING SYSTEMS

反事实推理和学习系统

Figure 20:
The uniform inner confidence interval reveals where the best guaranteed Y θ is
reached and where additional exploration is needed.

图20:统一的内部置信区间揭示了最佳保证Y θ达到的位置和需要额外探索的位置。

Figure 21: Level
curves associated with the average number of mainline ads per page (red curves

图21:与平均每页主线广告数量相关的水平曲线(红色曲线

labelled from
−6% to +10%) and the average estimated advertisement value generated per page
(black curves, labelled with arbitrary units ranging from 164 to 169) that
would have been observed for a certain query cluster if we had changed the
mainline reserves by the multiplicative factor shown on the horizontal axis,
and if we had applied a squash-ing exponent α shown on the vertical axis to the
estimated click probabilities qi,p(x).

标记为从-6%到+10%)和每页生成的平均估计广告值(黑色曲线,标记为范围从164到169的任意单位),如果我们通过横轴上显示的乘法因子改变主线储备,并且如果我们将纵轴上显示的挤压指数α应用于估计的点击概率qi,p(x ),将会观察到特定查询聚类。

comparable
optimization procedures can obviously be applied to larger numbers of tunable param-eters.

类似的优化过程显然可以应用于大量的可调参数。

Lahaie and
McAfee (2011) propose to account for the uncertainty of the click probability
esti-mation by introducing a squashing exponent α to control the impact of the
estimated probabilities on the rank scores.Using the notations introduced in
Section 2.1, and assuming that the estimated probability of a click on ad i
placed at position p after query x has the form qip(x) = γp βi(x) (see

Lahaie和McAfee
(2011)提出通过引入挤压指数α来控制估计概率对排名分数的影响,从而解决点击概率估计的不确定性。使用第2.1节中介绍的符号,并假设在查询x之后点击放置在位置p的广告I的估计概率具有qip(x) = γp βi(x)的形式(参见

3239

3239

BOTTOU,
PETERS, ET AL.

波图,彼得斯,等。

appendix A),
they redefine the rank-score rip(x) as:

附录A),他们将等级分数rip(x)重新定义为:

rip ( x) = γp
b i β i ( x ) α .

rip ( x) = γp
b i β i ( x ) α。

Using a
squashing exponent α < 1 reduces the contribution of the estimated
probabilities and in-creases the reliance on the bids bi placed by the
advertisers.

使用挤压指数α < 1降低了估计概率的贡献,并增加了对广告商所提出的出价bi的依赖。

Because the
squashing exponent changes the rank-score scale, it is necessary to simultaneously
adjust the reserves in order to display comparable number of ads.In order to
estimate the counter-factual performance of the system under interventions
affecting both the squashing exponent and the mainline reserves, we have
collected data using a random squashing exponent following a normal
distribution, and a mainline reserve multiplier following a log-normal
distribution as described in Section 4.6.Samples describing 12 million search
result pages were collected during four consecu-tive weeks.

因为挤压指数改变了排名-评分等级,所以有必要同时调整储备,以便显示可比数量的广告。为了估计在影响挤压指数和主线储量的干预下系统的反事实性能,我们使用正态分布的随机挤压指数和对数正态分布的主线储量乘数收集数据,如第4.6节所述。描述1200万个搜索结果页面的样本是在连续四周内收集的。

Following
Charles and Chickering (2012), we consider separate squashing coefficients αk
and mainline reserve multipliers ρk per query cluster k ∈ {1..K}, and, in order
to avoid negative user or advertiser reactions, we seek the auction tuning
parameters αk and ρk that maximize an estimate of the advertisement
valuesubject to a global constraint on the average number of ads displayed in
the mainline.Because maximizing the advertisement value instead of the
publisher revenue amounts to maximizing the size of the advertisement pie
instead of the publisher slice of the pie, this criterion is less likely to
simply raise the prices without improving the ads.Meanwhile the constraint
ensures that users are not exposed to excessive numbers of mainline ads.

在Charles和Chickering (2012)之后,我们考虑了每个查询聚类k∑{ 1 }的独立压缩系数αk和主线保留乘数ρk..K},并且,为了避免负面的用户或广告商反应,我们寻求拍卖调整参数αk和ρk,其最大化广告价值的估计,服从主线中显示的广告的平均数量的全局约束。因为最大化广告价值而不是发布者收入等于最大化广告饼的大小而不是饼的发布者份额,所以该标准不太可能简单地提高价格而不改进广告。同时,该约束确保用户不会接触到过多的主线广告。

We then use
the collected data to estimate bounds on the counterfactual expectations of the
advertiser value and the counterfactual expectation of the number of mainline
ads per page.Fig-ure 21 shows the corresponding level curves for a particular
query cluster.We can then run a simple optimization algorithm and determine the
optimal auction tuning parameters for each cluster sub-ject to the global mainline
footprint constraint.Appendix D describes how to estimate off-policy
counterfactual derivatives that greatly help the numerical optimization.

然后,我们使用收集的数据来估计广告商价值的反事实预期和每页主线广告数量的反事实预期的界限。图21显示了特定查询集群的相应级别曲线。然后,我们可以运行一个简单的优化算法,并为每个服从全局主线足迹约束的集群确定最优拍卖调整参数。附录D描述了如何估计非常有助于数值优化的政策外反事实导数。

The obvious
alternative (see Charles and Chickering, 2012) consists of replaying the
auctions with different parameters and simulating the user using a click
probability model.However, it may be unwise to rely on a click probability
model to estimate the best value of a squashing coefficient that is expected to
compensate for the uncertainty of the click prediction model itself.The
coun-terfactual approach described here avoids the problem because it does not
rely on a click prediction model to simulate users.Instead it estimates the
counterfactual performance of the system using the actual behavior of the users
collected under moderate randomization.

显而易见的替代方案(见查尔斯和奇克林,2012)包括用不同的参数重放拍卖,并使用点击概率模型模拟用户。然而,依赖点击概率模型来估计挤压系数的最佳值可能是不明智的,挤压系数被期望补偿点击预测模型本身的不确定性。这里描述的交互事实方法避免了这个问题,因为它不依赖点击预测模型来模拟用户。相反,它使用在适度随机化下收集的用户的实际行为来估计系统的反事实性能。

6.4
Sequential Design

6.4顺序设计

Confidence
intervals computed after a first randomized data collection experiment might
not offer sufficient accuracy to choose a final value of the parameter θ.It is
generally unwise to simply col-lect additional samples using the same
experimental setup because the current data already reveals information (Figure
20) that can be used to design a better data collection experiment.Therefore,
it seems natural to extend the learning principle discussed in Section 6.1 to a
sequence of data col-lection experiments.The parameter θt characterizing the
t-th experiment is then determined using samples collected during the previous
experiments (Figure 22).

第一次随机数据收集实验后计算的置信区间可能无法提供足够的精度来选择参数θ的最终值。使用相同的实验设置简单地选择额外的样本通常是不明智的,因为当前的数据已经揭示了可以用来设计更好的数据收集实验的信息(图20)。因此,将第6.1节中讨论的学习原则扩展到一系列数据收集实验似乎是很自然的。表征第t次实验的参数θt随后使用之前实验中收集的样本来确定(图22)。

10.The value
of an ad click from the point of view of the advertiser.The advertiser payment
then splits the advertisement value between the publisher and the advertiser.

10.从广告客户的角度来看,广告点击的价值。广告商支付然后在发布者和广告商之间分割广告价值。

3240

3240

Figure 22:
Sequential design - The parameter θt of each data collection experiment is determined
using data collected during the previous experiments.

图22:顺序设计-每个数据收集实验的参数θt是使用之前实验中收集的数据确定的。

Although it
is relatively easy to construct convergent sequential design algorithms,
reaching the optimal learning performance is notoriously difficult (Wald, 1945)
because the selection of param-eter θt involves a trade-off between
exploitation, that is, the maximization of the immediate reward Y θt , and
exploration, that is, the collection of samples potentially leading to better Y
θ in the more distant future.

尽管构造收敛的序列设计算法相对容易,但达到最优学习性能是出了名的困难(Wald,1945),因为参数θt的选择涉及开发(即即时回报Y θt的最大化)和探索(即样本的收集,潜在地在更遥远的未来导致更好的Y θ)之间的权衡。

The optimal
exploration exploitation trade-off for multi-armed bandits is well understood
(Git-tins, 1989;Auer et al., 2002;Audibert et al., 2007) because an essential
property of multi-armed bandits makes the analysis much simpler: the outcome
observed after performing a particular action brings no information about the
value of other actions.Such an assumption is both unrealistic and pessimistic.For
instance, the outcome observed after displaying a certain ad in response to a
certain query brings very useful information about the value of displaying
similar ads on similar queries.

多武装匪徒的最优勘探开发权衡是众所周知的(吉特-廷斯,1989;Auer等人,2002年;Audibert等人,2007),因为多武装匪徒的一个基本属性使分析变得简单得多:在执行特定行动后观察到的结果不会带来关于其他行动价值的信息。这样的假设既不现实又悲观。例如,在响应某个查询显示某个广告之后观察到的结果带来了关于在类似查询上显示类似广告的价值的非常有用的信息。

Refined
contextual bandit approaches (Slivkins, 2011) account for similarities in the
context and action spaces but do not take advantage of all the additional
opportunities expressed by structural equation models.For instance, in the
contextual bandit formulation of the ad placement problem outlined in Section
3.5, actions are pairs (s, c) describing the ad slate s and the corresponding
click prices c, policies select actions by combining individual ad scores in
very specific ways, and actions determine the rewards through very specific
mechanisms.

精炼的上下文强盗方法(Slivkins,2011)考虑了上下文和动作空间中的相似性,但没有利用结构方程模型表达的所有额外机会。例如,在第3.5节概述的广告投放问题的上下文土匪公式中,动作是描述广告名单和相应点击价格的对,策略通过以非常具体的方式组合单个广告分数来选择动作,动作通过非常具体的机制来确定奖励。

Meanwhile,
despite their suboptimal asymptotic properties, heuristic exploration
strategies per-form surprisingly well during the time span in which the problem
can be considered stationary.Even in the simple case of multi-armed bandits,
excellent empirical results have been obtained us-ing Thompson sampling
(Chapelle and Li, 2011) or fixed strategies (Vermorel and Mohri, 2005;Kuleshov
and Precup, 2010).Leveraging the problem structure seems more important in
practice than perfecting an otherwise sound exploration strategy.

与此同时,尽管启发式探索策略具有次优的渐近性质,但在问题可以被视为平稳的时间跨度内,它们表现得出奇地好。即使在多武装匪徒的简单情况下,使用汤普森抽样(Chapelle和李,2011)或固定策略(Vermorel和Mohri,2005)也获得了出色的经验结果;库里肖夫和塞切普,2010年)。在实践中,利用问题结构似乎比完善一个合理的探索策略更重要。

Therefore, in
the absence of sufficient theoretical guidance, it is both expedient and
practical to maximizing Ybθ at each round, as described in Section 6.1, subject
to additional ad-hoc constraints ensuring a minimum level of exploration.

因此,在缺乏足够的理论指导的情况下,如第6.1节所述,在每一轮最大化Ybθ既方便又实用,但要受到额外的特殊限制,以确保最低水平的勘探。

7.Equilibrium
Analysis

7.均衡分析

All the
methods discussed in this contribution rely on the isolation assumption
presented in Sec-tion 3.2.This assumption lets us interpret the samples as
repeated independent trials that follow the pattern defined by the structural
equation model and are amenable to statistical analysis.

本文中讨论的所有方法都依赖于第3.2节中提出的隔离假设。这一假设让我们将样本解释为重复的独立试验,这些试验遵循结构方程模型定义的模式,并且易于进行统计分析。

The isolation
assumption is in fact a component of the counterfactual conditions under
inves-tigation.For instance, in Section 4.6, we model single auctions (Figure
3) in order to empirically

隔离假设实际上是研究中反事实条件的一个组成部分。例如,在第4.6节中,我们对单一拍卖进行建模(图3),以便根据经验

3241

3241

BOTTOU,
PETERS, ET AL.

波图,彼得斯,等。

determine how
the ad placement system would have performed if we had changed the mainline
reserves without incurring a reaction from the users or the advertisers.

确定如果我们在不引起用户或广告商反应的情况下改变主线储备,广告投放系统会有怎样的表现。

Since the
future publisher revenues depend on the continued satisfaction of users and
advertisers, lifting this restriction is highly desirable.

由于未来出版商的收入取决于用户和广告商的持续满意度,取消这一限制是非常可取的。

• We can in
principle work with larger structural equation models.For instance, Figure 4
suggests to thread single auction models with additional causal links
representing the impact of the displayed ads on the future user
goodwill.However, there are practical limits on the number of trials we can
consider at once.For instance, it is relatively easy to simultaneously model
all the auctions associated with the web pages served to the same user during a
thirty minute web session.On the other hand, it is practically impossible to
consider several weeks worth of auctions in order to model their accumulated
effect on the continued satisfaction of users and advertisers.

原则上,我们可以使用更大的结构方程模型。例如,图4建议将单个拍卖模型与表示显示的广告对未来用户好感度的影响的额外因果联系联系起来。然而,我们可以同时考虑的试验数量有实际限制。例如,在30分钟的网络会话期间,相对容易同时模拟与服务于同一用户的网页相关联的所有拍卖。另一方面,实际上不可能考虑几周的拍卖来模拟它们对用户和广告商持续满意度的累积影响。

• We can
sometimes use problem-specific knowledge to construct alternate performance
met-rics that anticipate the future effects of the feedback loops.For instance,
in Section 6.3, we optimize the advertisement value instead of the publisher
revenue.Since this alternative cri-terion takes the advertiser interests into
account, it can be viewed as a heuristic proxy for the future revenues of the
publisher.

我们有时可以利用特定问题的知识来构建替代绩效指标,预测反馈循环的未来影响。例如,在第6.3节中,我们优化了广告价值,而不是出版商的收入。由于这种替代标准考虑到了广告客户的利益,它可以被看作是出版商未来收入的一种启发式代理。

This section
proposes an alternative way to account for such feedback loops using the
quasistatic equilibrium method familiar to physicists: we assume that the
publisher changes the parameter θ so slowly that the system remains at
equilibrium at all times.Using data collected while the system was at equilibrium,
we describe empirical methods to determine how an infinitesimal intervention dθ
on the model parameters would have displaced the equilibrium:

本节提出了一种替代方法,利用物理学家熟悉的准静态平衡方法来解释这种反馈回路:我们假设发布者缓慢地改变参数θ,使得系统始终保持平衡。使用在系统处于平衡状态时收集的数据,我们描述了经验方法,以确定模型参数上的无限小干预dθ将如何取代平衡:

"How
would the system have performed during the data collection period if a small
change dθ had been applied to the model parameter θ and the equilibrium had
been reached before the data collection period."

“如果对模型参数θ进行较小的改变,并且在数据收集期间之前达到平衡,那么系统在数据收集期间的表现会如何。”

A learning
algorithm can then update θ to improve selected performance metrics.

然后,学习算法可以更新θ,以改进选定的性能指标。

7.1 Rational
Advertisers

7.1理性广告商

The ad
placement system is an example of game where each actor furthers his or her
interests by controlling some aspects of the system: the publisher controls the
placement engine parameters, the advertisers control the bids, and the users
control the clicks.

广告投放系统是一个游戏的例子,其中每个参与者通过控制系统的某些方面来促进他或她的兴趣:发布者控制投放引擎参数,广告商控制出价,用户控制点击。

As an example
of the general quasi-static approach, this section focuses on the reaction of
ratio-nal advertisers to small changes of the scoring functions driving the ad
placement system.Rational advertisers always select bids that maximize their
economic interests.Although there are more re-alistic ways to model
advertisers, this exercise is interesting because the auction theory approaches
also rely on the rational advertiser assumption (see Section 2.1).This analysis
seamlessly integrates the auction theory and machine learning perspectives.

作为一般准静态方法的一个例子,这一部分着重于比率广告客户对驱动广告投放系统的评分函数的微小变化的反应。理性的广告商总是选择最大化其经济利益的出价。虽然有更多真实的方式来模拟广告商,这个练习是有趣的,因为拍卖理论方法也依赖于理性的广告商假设(见第2.1节)。这种分析无缝地结合了拍卖理论和机器学习的观点。

As
illustrated in Figure 23, we treat the bid vector b⋆ = (b1 ...bA) ∈ [0,bmax] A
as the parameter of the conditional distribution Pb⋆ (b|x, v) of the bids
associated with the eligible ads.The vari-

如图23所示,我们处理出价向量b⋆ = (b1...ba)∑0,bmax] A作为与合格广告相关的投标的条件分布Pb⋆ (b|x,v)的参数。vari-

11.Quantities
measured when a feedback causal system reaches equilibrium often display
conditional independence patterns that cannot be represented with directed
acyclic graphs (Lauritzen and Richardson, 2002;Dash, 2003).Treating the
feedback loop as parameters instead of variables works around this difficulty
in a manner that appears sufficient to perform the quasi-static analysis.

11.当反馈因果系统达到平衡时测量的量通常显示条件独立模式,不能用有向无环图表示(Lauritzen和Richardson,2002;Dash,2003)。将反馈回路视为参数而非变量,以一种似乎足以执行准静态分析的方式解决了这一难题。

3242

3242

COUNTERFACTUAL
REASONING AND LEARNING SYSTEMS

反事实推理和学习系统

Figure 23:
Advertisers select the bid amounts ba on the basis of the past number of clicks
ya and the past prices za observed for the corresponding ads.

图23:广告商根据过去的点击次数ya和相应广告的过去价格za选择出价金额ba。

Figure 24:
Advertisers control the expected number of clicks Ya and expected prices Za by
adjusting

图24:广告商通过调整来控制预期点击量Ya和预期价格Za

their bids
ba.Rational advertisers select bids that maximize the difference between the
value they see in the clicks and the price they pay.

他们的出价。理性的广告客户选择出价最大化他们在点击中看到的价值和他们支付的价格之间的差异。

ables ya in
the structural equation model represents the number of clicks received by ads
associated with bid ba.The variables za represents the amount charged for these
clicks to the corresponding advertiser.The advertisers select their bids ba
according to their anticipated impact on the number of resulting clicks ya and
on their cost za.

结构方程模型中的ables ya表示与bid ba关联的广告接收的点击量。变量za表示对相应广告商的这些点击收取的费用。广告商根据他们对最终点击量ya和他们的成本za的预期影响来选择他们的出价ba。

Following the
pattern of the perfect information assumption (see Section 2.1), we assume that
the advertisers eventually acquire full knowledge of the expectations

遵循完美信息假设的模式(见第2.1节),我们假设广告商最终获得了预期的全部知识

Ya ( θ , b⋆)
= Zω y a P θ b⋆ ( ω )

Ya ( θ,b⋆) = Zω y a
P θ b⋆ ( ω)

and Za ( θ ,
b⋆) = Zω z a P θ b⋆ ( ω ) .

和Za ( θ,b⋆) = Zω z a P θ b⋆ ( ω)。

Let Va denote
the value of a click for the corresponding advertiser.Rational advertiser seek
to maximize the difference between the value they see in the clicks and the
price they pay to the publisher, as illustrated in Figure 24.This is expressed
by the utility functions

让Va表示相应广告商的点击值。理性的广告商寻求最大化他们在点击中看到的价值和他们支付给出版商的价格之间的差异,如图24所示。这由效用函数来表示

U θ a (b⋆) =
VaYa(θ,b⋆)−Za(θ,b⋆) .

U θ a (b⋆) =
VaYa(θ,b⋆)−Za(θ,b⋆。

3243

3243

BOTTOU,
PETERS, ET AL.

波图,彼得斯,等。

Following
Athey and Nekipelov (2010), we argue that the injection of smooth random noise
into the auction mechanism changes the discrete problem into a continuous
problem amenable to standard differential methods.Mild regularity assumption on
the densities probability Pb⋆ (b|x, v) and Pθ (q|x,a) are in fact sufficient to
ensure that the expectations Ya(θ,b⋆) and Za(θ,b⋆) are con-tinuously
differentiable functions of the distribution parameters b⋆ and θ.Further
assuming that utility functions U θ a (b⋆) are diagonally quasiconcave, Athey
and Nekipelov establish the existence of a unique Nash equilibrium

在Athey和Nekipelov (2010)之后,我们认为向拍卖机制中注入平滑随机噪声会将离散问题转化为可服从标准微分方法的连续问题。密度概率Pb⋆ (b|x,v)和Pθ (q|x,a)的温和正则性假设实际上足以保证期望Ya(θ,b⋆)和Za(θ,b⋆)是分布参数b⋆和θ的连续可微函数。进一步假设效用函数U θ a (b⋆)是对角拟凹的,Athey和Nekipelov建立了唯一Nash均衡的存在性

∀a ba ∈
ArgMax

∀a ba ∈
ArgMax

b

b

U θ a
(b1,...,ba−1,b,ba+1,...,bA)

U θ a (b1,...、ba 1、b、ba+1、...,bA)

characterized
by its first order Karush-Kuhn-Tucker conditions

以其一阶卡鲁什-库恩-塔克条件为特征

a Va

Va

∂Ya ∂ba

∂Ya ∂ba

∂Za ∂ba

∂Za ∂ba

   ≤ 0 if
ba = ≥ 0 if ba = bmax= 0 if 0 < ba < bmax.

如果ba = ≥ 0,则   ≤ 0如果ba = bmax= 0,则0 < ba < bmax。

(22)

(22)

We use the
first order equilibrium conditions (22) for two related purposes.Section 7.2
explains how to complete the advertiser model by estimating the values Va.
Section 7.3 estimates how the equilibrium bids and the system performance
metrics respond to a small change dθ of the model parameters.

我们将一阶平衡条件(22)用于两个相关的目的。第7.2节解释了如何通过估计值Va来完成广告商模型。第7.3节估计了均衡出价和系统性能指标如何响应模型参数的微小变化dθ。

Interestingly,
this approach remains sensible when key assumptions of the equilibrium model
are violated.The perfect information assumption is unlikely to hold in
practice.The quasi-concavity of the utility functions is merely
plausible.However, after observing the operation of the stationary ad placement
system for a sufficiently long time, it is reasonable to assume that the most
active advertisers have tried small bid variations and have chosen locally
optimal ones.Less active adver-tisers may leave their bids unchanged for longer
time periods, but can also update them brutally if they experience a
significant change in return on investment.Therefore it makes sense to use data
collected when the system is stationary to estimate advertiser values Va that
are consistent with the first order equilibrium conditions.We then hope to
maintain the conditions that each advertisers had found sufficiently
attractive, by first estimating how a small change dθ displaces this posited
local equilibrium, then by using performance metrics that take this
displacement into account.

有趣的是,当平衡模型的关键假设被违反时,这种方法仍然是明智的。完美的信息假设在实践中不太可能成立。效用函数的拟凹性仅仅是似是而非的。然而,在观察固定广告投放系统的操作足够长的时间之后,可以合理地假设最活跃的广告商已经尝试了小的出价变化,并且已经选择了局部最优的出价变化。不太积极的反对者可能会在更长的时间内保持出价不变,但如果他们经历了投资回报的重大变化,也可能会残酷地更新出价。因此,使用系统静止时收集的数据来估计符合一阶平衡条件的广告商价值Va是有意义的。然后,我们希望通过首先估计一个小的变化dθ如何取代这个假定的局部平衡,然后使用考虑到这个位移的性能指标,来维持每个广告商已经发现足够有吸引力的条件。

7.2
Estimating Advertiser Values

7.2估计广告客户价值

We first need
to estimate the partial derivatives appearing in the equilibrium condition
(22).These derivatives measure how the expectations Ya and Za would have been
changed if each advertiser had placed a slightly different bid ba.Such
quantities can be estimated by randomizing the bids and computing on-policy
counterfactual derivatives as explained in appendix D. Confidence intervals can
be derived with the usual tools.

我们首先需要估计平衡条件(22)中出现的偏导数。这些衍生产品衡量的是,如果每个广告客户提出稍微不同的出价ba,预期Ya和Za会发生怎样的变化。如附录d所述,这些数量可以通过随机投标和计算政策反事实导数来估计。置信区间可以用常用工具推导出来。

Unfortunately,
the publisher is not allowed to directly randomize the bids because the
advertisers expect to pay prices computed using the bid they have specified and
not the potentially higher bids resulting from the randomization.However, the
publisher has full control on the estimated click probabilities qi,p(x).Since
the rank-scores ri,p(x) are the products of the bids and the estimated click
probabilities (see Section 2.1), a random multiplier applied to the bids can
also be interpreted as a random multiplier applied to the estimated click
probabilities.Under these two interpretations, the same ads are shown to the
users, but different click prices are charged to the advertisers.Therefore,

不幸的是,出版商不被允许直接随机化出价,因为广告商期望支付使用他们指定的出价计算的价格,而不是随机化导致的潜在更高的出价。然而,出版商完全控制估计的点击概率qi,p(x)。由于等级分数ri、p(x)是出价和估计点击概率的乘积(见第2.1节),所以应用于出价的随机乘数也可以解释为应用于估计点击概率的随机乘数。在这两种解释下,相同的广告被展示给用户,但是不同的点击价格被收取给广告商。因此,

3244

3244

COUNTERFACTUAL
REASONING AND LEARNING SYSTEMS

反事实推理和学习系统

the publisher
can simultaneously charge prices computed as if the multiplier had been applied
to the estimated click probabilities, and collect data as if the multiplier had
been applied to the bid.This data can then be used to estimate the derivatives.

发布者可以同时对计算出的价格收费,就好像乘数已经应用于估计的点击概率,并且收集数据,就好像乘数已经应用于出价。这些数据可以用来估算导数。

Solving the
first order equilibrium equations then yields estimated advertiser values Va
that are consistent with the observed data.

然后,求解一阶平衡方程产生与观察数据一致的估计广告商值Va。

Va ≈

Va ≈

∂Ya

∂Ya

∂ba .∂Za ∂ba

∂ba。∂Za ∂ba

There are
however a couple caveats:

然而,有几个警告:

• The
advertiser bid ba may be too small to cause ads to be displayed.In the absence
of data, we have no means to estimate a click value for these advertisers.

广告商出价ba可能太小,无法显示广告。在没有数据的情况下,我们无法估计这些广告商的点击量。

• Many ads
are not displayed often enough to obtain accurate estimates of the partial
derivatives

许多广告显示频率不足以获得偏导数的准确估计

∂Ya

∂Ya

∂ba and ∂Za
∂ba .This can be partially remediated by smartly aggregating the data of
advertisers deemed similar.

∂ba和∂Za ∂ba。这可以通过智能地聚合被认为相似的广告商的数据来部分补救。

• Some
advertisers attempt to capture all the available ad opportunities by placing
extremely high bids and hoping to pay reasonable prices thanks to the
generalized second price rule.Both partial derivatives ∂Ya ∂ba and ∂Za ∂ba are
equal to zero in such cases.Therefore we cannot recover Va by solving the
equilibrium Equation (22).It is however possible to collect useful data by
selecting for these advertisers a maximum bid bmax that prevents them from
monop-olizing the eligible ad opportunities.Since the equilibrium condition is
an inequality when ba = bmax, we can only determine a lower bound of the values
Va for these advertisers.

一些广告客户试图通过出价极高来抓住所有可用的广告机会,并希望支付合理的价格,这得益于广义的第二价格规则。在这种情况下,偏导数∂Ya ∂ba和∂Za ∂ba都等于零。因此,我们不能通过求解平衡方程(22)来恢复Va。然而,可以通过为这些广告客户选择最大出价bmax来收集有用的数据,以防止他们垄断合格的广告机会。因为当ba = bmax时,均衡条件是一个不等式,所以我们只能确定这些广告商的Va值的下限。

These caveats
in fact underline the limitations of the advertiser modelling assumptions.When
their ads are not displayed often enough, advertisers have no more chance to
acquire a full knowledge of the expectations Ya and Za than the publisher has a
chance to determine their value.Similarly, advertisers that place extremely
high bids are probably underestimating the risk to occasionally experience a
very high click price.A more realistic model of the advertiser information
acquisition is required to adequately handle these cases.

这些警告实际上强调了广告商建模假设的局限性。当他们的广告展示得不够频繁时,广告商就没有机会充分了解Ya和Za的期望,就像出版商没有机会确定它们的价值一样。类似地,出价极高的广告商可能低估了偶尔经历极高点击价格的风险。需要更现实的广告客户信息获取模型来充分处理这些情况。

7.3
Estimating the Equilibrium Response

7.3估计平衡响应

Let A be the
set of the active advertisers, that is, the advertisers whose value can be
estimated

设A为活跃广告主的集合,即可以估计其价值的广告主

(or lower
bounded) with sufficient accuracy.Assuming that the other advertisers leave
their bids unchanged, we can estimate how the active advertisers adjust their
bids in response to an infinitesi-mal change dθ of the scoring model
parameters.This is achieved by differentiating the equilibrium equations (22):

(或下限)足够精确。假设其他广告客户保持他们的出价不变,我们可以估计活跃的广告客户如何响应评分模型参数的无穷小变化dθ来调整他们的出价。这通过微分平衡方程(22)来实现:

∀a ′ ∈ A,

∀a′∈a,

= Va ′ ∂ Ya ′
∂ b a ′ ∂θ − ∂ Za ′ ∂ b a ′ ∂θ d θ + ∑a∈A Va ′ ∂ Ya ′ ∂ b a ′ ∂ b a − ∂ Za ′ ∂
b a ′ ∂ b a d b a .

= va′∂ya′∂b
a′∂θ-∂za′∂b a′∂θdθ+∑a∈a va′∂ya′∂b a′≈b a≈za′≈b a′≈b a d b a。

(23)

(23)

The partial
second derivatives must be estimated as described in appendix D. Solving this
linear system of equations then yields an expression of the form

偏二阶导数必须按照附录d中的描述进行估算。然后求解这个线性方程组,得到以下形式的表达式

dba = Ξa dθ.

DBA =ξa dθ。

12.This
approach is of course related to the value estimation method proposed by Athey
and Nekipelov (2010) but strictly relies on the explicit randomization of the
scores.In contrast, practical considerations force Athey and Nekipelov to rely
on the apparent noise and hope that the noise model accounts for all potential
confounding factors.

12.这种方法当然与Athey和Nekipelov
(2010)提出的价值估算方法有关,但严格依赖于分数的显式随机化。相比之下,实际考虑迫使Athey和Nekipelov依赖于表观噪声,并希望噪声模型能够解释所有潜在的混杂因素。

3245

3245

BOTTOU,
PETERS, ET AL.

波图,彼得斯,等。

This
expression can then be used to estimate how any counterfactual expectation Y of
interest changes when the publisher applies an infinitesimal change dθ to the
scoring parameter θ and the

然后,该表达式可用于估计当发布者对评分参数θ应用无穷小的变化dθ时,任何感兴趣的反事实期望Y如何变化

active
advertisers A rationally adjust their bids ba in response:

积极的广告客户A理性地调整他们的出价ba作为回应:

d Y = ∂ Y ∂θ

  • ∑a Ξ a ∂ Y ∂ b a d θ .

d y
=∂y∂θ+∑aνa∂y∂b a dθ。

(24)

(24)

Although this
expression provides useful information, one should remain aware of its
limita-tions.Because we only can estimate the reaction of active advertisers,
expression (24) does not includes the potentially positive reactions of
advertisers who did not bid but could have.Because we only can estimate a lower
bound of their values, this expression does not model the potential re-actions
of advertisers placing unrealistically high bids.Furthermore, one needs to be
very cautious when the system (23) approaches singularities.Singularities
indicate that the rational advertiser assumption is no longer sufficient to
determine the reactions of certain advertisers.This happens for instance when
advertisers cannot find bids that deliver a satisfactory return.The eventual
behavior of such advertisers then depends on factors not taken in consideration
by our model.

虽然这个表达提供了有用的信息,但人们应该意识到它的局限性。因为我们只能估计活跃广告商的反应,所以表达式(24)不包括没有出价但可能出价的广告商的潜在积极反应。因为我们只能估计他们价值的下限,这个表达式不能模拟广告商提出不切实际的高价的潜在再行动。此外,当系统(23)接近奇点时,需要非常谨慎。奇异性表明理性的广告商假设不再足以决定某些广告商的反应。例如,当广告商找不到能带来满意回报的出价时,就会出现这种情况。这类广告商的最终行为取决于我们的模型没有考虑的因素。

To alleviate
these issues, we could alter the auction mechanism in a manner that forces
advertis-ers to reveal more information, and we could enforce policies ensuring
that the system (23) remains safely nonsingular.We could also design
experiments revealing the impact of the fixed costs in-curred by advertisers
participating into new auctions.Although additional work is needed to design
such refinements, the quasistatic equilibrium approach provides a generic
framework to take such aspects into account.

为了缓解这些问题,我们可以改变拍卖机制,迫使广告者透露更多的信息,我们还可以强制执行确保系统(23)安全地保持非奇异的策略。我们还可以设计实验,揭示广告商参与新拍卖所产生的固定成本的影响。虽然需要额外的工作来设计这种改进,但准静态平衡方法提供了一个通用的框架来考虑这些方面。

7.4
Discussion

7.4讨论

The rational
advertiser assumption is the cornerstone of seminal works describing simplified
vari-ants of the ad placement problem using auction theory (Varian,
2007;Edelman et al., 2007).More sophisticated works account for more aspects of
the ad placement problem, such as the impact of click prediction learning
algorithms (Lahaie and McAfee, 2011), the repeated nature of the ad auc-tions
(Bergemann and Said, 2010), or for the fact that advertisers place bids valid
for multiple auc-tions (Athey and Nekipelov, 2010).Despite these advances, it
seems technically very challenging to use these methods and account for all the
effects that can be observed in practical ad placement systems.

理性广告客户假设是使用拍卖理论描述广告投放问题的简化变量的开创性工作的基石(瓦里安,2007;Edelman等人,2007年)。更复杂的作品解释了广告投放问题的更多方面,例如点击预测学习算法的影响(Lahaie和McAfee,2011),广告投放的重复性质(Bergemann和Said,2010),或者广告商对多次投放进行有效投标的事实(Athey和Nekipelov,2010)。尽管取得了这些进步,但使用这些方法并解释在实际广告投放系统中可以观察到的所有影响在技术上似乎非常具有挑战性。

We believe
that our counterfactual reasoning framework is best viewed as a modular toolkit
that lets us apply insights from auction theory and machine learning to
problems that are far more complex than those studied in any single paper.For
instance, the quasi-static equilibrium analysis technique illustrated in this
section extends naturally to the analysis of multiple simultaneous causal
feedback loops involving additional players:

我们认为,我们的反事实推理框架最好被视为一个模块化工具包,让我们能够将拍卖理论和机器学习的见解应用于比任何一篇论文中研究的问题都复杂得多的问题。例如,本节中说明的准静态平衡分析技术自然延伸到涉及其他参与者的多个同时因果反馈循环的分析:

• The first
step consists in designing ad-hoc experiments to identify the parameters that
deter-mine the equilibrium equation of each player.In the case of the
advertisers, we have shown how to use randomized scores to reveal the
advertiser values.In the case of the user feedback, we must carefully design
experiments that reveal how users respond to changes in the quality of the
displayed ads.

第一步包括设计特别实验,以确定决定每个参与者平衡方程的参数。就广告商而言,我们已经展示了如何使用随机分数来揭示广告商的价值。在用户反馈的情况下,我们必须仔细设计实验,揭示用户如何对显示广告质量的变化做出反应。


Differentiating all the equilibrium equations yields a linear system of
equations linking the variations of the parameter under our control, such as
dθ, and all the parameters under the

对所有平衡方程进行微分,得到一个线性方程组,该方程组将我们控制下的参数(如dθ)的变化与下的所有参数联系起来

3246

3246

COUNTERFACTUAL
REASONING AND LEARNING SYSTEMS

反事实推理和学习系统

control of
the other players, such as the advertiser bids, or the user willingness to
visit the site and click on ads.Solving this system and writing the total
derivative of the performance measure gives the answer to our question.

对其他玩家的控制,例如广告商出价,或者用户访问网站并点击广告的意愿。求解这个系统并写出性能度量的总导数就给出了我们问题的答案。

Although this
programme has not yet been fully realized, the existence of a principled
frame-work to handle such complex interactions is remarkable.Furthermore,
thanks to the flexibility of the causal inference frameworks, these techniques
can be infinitely adapted to various modeling assumptions and various system
complexities.

尽管这一方案尚未完全实现,但处理这种复杂互动的原则性框架的存在是值得注意的。此外,由于因果推理框架的灵活性,这些技术可以无限地适应各种建模假设和各种系统复杂性。

8.Conclusion

8.结论

Using the ad
placement example, this work demonstrates the central role of causal inference
(Pearl, 2000;Spirtes et al., 1993) for the design of learning systems
interacting with their environment.Thanks to importance sampling techniques,
data collected during randomized experiments gives precious cues to assist the
designer of such learning systems and useful signals to drive learning
algorithms.

使用广告放置的例子,这项工作证明了因果推理的中心作用(珀尔,2000;Spirtes等人,1993)的学习系统的设计与他们的环境互动。由于重要性抽样技术,在随机实验中收集的数据提供了宝贵的线索来帮助这种学习系统的设计者,并提供了有用的信号来驱动学习算法。

Two recurrent
themes structure this work.First, we maintain a sharp distinction between the
learning algorithms and the extraction of the signals that drive them.Since
real world learning systems often involve a mixture of human decision and
automated processes, it makes sense to separate the discussion of the learning
signals from the discussion of the learning algorithms that leverage
them.Second, we claim that the mathematical and philosophical tools developed
for the analysis of physical systems appear very effective for the analysis of
causal information systems and of their equilibria.These two themes are in fact
a vindication of cybernetics (Wiener, 1948).

这部作品有两个反复出现的主题。首先,我们在学习算法和驱动它们的信号提取之间保持鲜明的区别。由于现实世界的学习系统通常涉及人类决策和自动化过程的混合,因此将学习信号的讨论与利用它们的学习算法的讨论分开是有意义的。其次,我们声称,为分析物理系统而开发的数学和哲学工具对于分析因果信息系统及其平衡似乎非常有效。这两个主题实际上是控制论的辩护(维纳,1948)。

Acknowledgments

感谢

We would like
to acknowledge extensive discussions with Susan Athey, Miroslav Dudík, Patrick
Jordan, John Langford, Lihong Li, Sebastien Lahaie, Shie Mannor, Chris Meek,
Alex Slivkins, and Paul Viola.We also thank the Microsoft adCenter RnR team for
giving us the invaluable opportunity to deploy these ideas at scale and prove
their worth.Finally we gratefully acknowledge the precious comments of our JMLR
editor and reviewers.

我们要感谢与苏珊·艾希、米罗斯拉夫·杜迪克、帕特里克·乔丹、约翰·兰福德、李立宏、塞巴斯蒂安·拉海耶、阏氏·曼诺尔、克里斯·米克、亚历克斯·斯利夫金斯和保罗·维奥拉进行的广泛讨论。我们也感谢微软adCenter RnR团队给了我们宝贵的机会,让我们大规模部署这些想法并证明它们的价值。最后,我们感谢JMLR编辑和审稿人的宝贵意见。

Appendix A.
Greedy Ad Placement Algorithms

附录一.贪婪广告投放算法

Section 2.1
describes how to select and place ads on a web page by maximizing the total
rank-score (1).Following (Varian, 2007;Edelman et al., 2007) , we assume that
the click probability estimates are expressed as the product of a positive
position term γp and a positive ad term βi(x).The rank-scores can therefore be
written as ri,p(x) = γpbiβi(x).We also assume that the policy constraints
simply state that a web page should not display more than one ad belonging to
any given advertiser.The discrete maximization problem is then amenable to
computationally efficient greedy algorithms.

第2.1节描述了如何通过最大化总排名分数(1)来选择和放置网页上的广告。以下(瓦里安,2007;Edelman等人,2007),我们假设点击概率估计被表示为正位置项γp和正ad项βi(x)的乘积。因此,等级分数可以写成ri,p(x) = γpbiβi(x)。我们还假设策略约束只是声明一个网页不应该显示一个以上属于任何给定广告商的广告。然后,离散最大化问题服从于计算高效的贪婪算法。

Let us fix a
layout L and focus on the inner maximization problem.Without loss of
generality, we can renumber the positions such that

让我们固定一个布局L,专注于内部最大化问题。不失一般性,我们可以对这些位置重新编号

L =
{1,2,...N} and γ1 ≥ γ2 ≥ ≥ 0.

L = {1,2,...N}和γ1≥γ2≥0。

3247

3247

BOTTOU,
PETERS, ET AL.

波图,彼得斯,等。

and write the
inner maximization problem as

并将内部最大化问题写成

max

最大

i1,...,iN

i1,...,iN

RL(i1,...,iN)
= ∑

RL(i1,...,iN)=∑2

p∈L

p∈L

rip,p(x)

rip,p(x)

subject to
the policy constraints and reserve constraints ri,p(x) ≥ Rp(x).

受政策约束和准备金约束ri,p(x) ≥ Rp(x)。

Let Si denote
the advertiser owning ad i. The set of ads is then partitioned into subsets Is
= {i :

让Si表示拥有广告I的广告商。然后广告集合被分成子集Is = {i:

Si = s}
gathering the ads belonging to the same advertiser s. The ads that maximize the
product

收集属于同一广告客户的广告。使产品最大化的广告

biβi(x)
within set Is are called the best ads for advertiser s. If the solution of the
discrete maximiza-

集合Is内的双βi(x)被称为广告商的最佳广告

tion problem
contains one ad belonging to advertiser s, then it is easy to see that this ad
must be one of the best ads for advertiser s: were it not the case, replacing
the offending ad by one of the best

问题包含一个属于广告客户s的广告,那么很容易看出这个广告一定是广告客户s最好的广告之一:如果不是这样,用最好的广告替换违规的广告

ads would
yield a higher RL without violating any of the constraints.It is also easy to
see that one could select any of the best ads for advertiser s without changing
RL.

广告将在不违反任何约束的情况下产生更高的RL。很容易看出,人们可以在不改变RL的情况下为广告商选择任何最好的广告。

Let the set I
contain exactly one ad per advertiser, arbitrarily chosen among the best ads
for

让我的广告集每个广告客户只包含一个广告,从最好的广告中任意选择

this
advertiser.The inner maximization problem can then be simplified as:

这个广告商。内部最大化问题可以简化为:

max

最大

i 1,..., iN ∈
I

i 1,...,iN ∈ I

RL ( i ,...,
i N) = ∑p∈L γp b i p β i p ( x )

RL ( i,...,i N) = ∑p∈L
γp b i p β i p ( x)

where all the
indices i1,...,iN are distinct, and subject to the reserve constraints.

其中所有的指数i1,...,iN是不同的,并且受保留约束。

Assume that
this maximization problem has a solution i1,...,iN, meaning that there is a
feasible

假设这个最大化问题有一个解i1,...,iN,意思是有可行的

ad placement
solution for the layout L. For k = ...N, let us define I k ⊂ I as

布局l的广告投放解决方案。对于k =...让我们把⊂定义为

biβi(x).

biβi(x)。

I k = ArgMax

I k = ArgMax

i∈I
{i1,...,ik−1}

我∈我{i1,...,ik 1 }

It is easy to
see that I k intersects {ik,...,iN} because, were it not the case, replacing ik
by any

很容易看出I k与{ik,...,iN}因为,如果不是这样,用任何

element of I
k would increase RL without violating any of the constraints.Furthermore it is
easy to

元素将在不违反任何约束的情况下增加RL。此外,很容易

see that ik ∈
I k because, were it not the case, there would be h > k such that ih ∈ I k ,
and swapping

看到ik ∈ I k,因为,如果不是这样,会有h > k,这样ih ∈ I k,和交换

ik and ih
would increase RL without violating any of the constraints.

ik和ih将在不违反任何约束的情况下增加RL。

Therefore, if
the inner maximization problem admits a solution, we can compute a solution by
recursively picking i1,...,iN from I 1 ,I 2 ,...,I N . This can be done
efficiently by first sorting the biβi(x) in decreasing order, and then greedily
assigning ads to the best positions subject to the reserve constraints.This
operation has to be repeated for all possible layouts, including of course the
empty layout.

因此,如果内部最大化问题允许一个解,我们可以通过递归地选择i1来计算一个解,...,iN来自I 1,I 2,...这可以通过首先以递减的顺序对biβi(x)进行排序,然后贪婪地将广告分配给服从保留约束的最佳位置来有效地完成。必须对所有可能的布局重复该操作,当然包括空布局。

The same
analysis can be carried out for click prediction estimates expressed as
arbitrary mono-tone combination of a position term γp(x) and an ad term βi(x),
as shown, for instance, by Graepel et al. (2010).

可以对表示为位置项γp(x)和广告项βi(x)的任意单音组合的点击预测估计进行相同的分析,例如,如Graepel等人(2010)所示。

Appendix B.
Confidence Intervals

附录b .置信区间

Section 4.4
explains how to obtain improved confidence intervals by replacing the unbiased
impor-tance sampling estimator (7) by the clipped importance sampling estimator
(10).This appendix provides details that could have obscured the main message.

第4.4节解释了如何通过用截尾重要性抽样估计量(10)代替无偏重要性抽样估计量(7)来获得改进的置信区间。本附录提供了可能掩盖主要信息的细节。

3248

3248

COUNTERFACTUAL
REASONING AND LEARNING SYSTEMS

反事实推理和学习系统

B.1 Outer
Confidence Interval

B.1外部置信区间

We first
address the computation of the outer confidence interval (12) which describes
how the estimator Ybapproaches the clipped expectation Y .

我们首先讨论外置信区间(12)的计算,它描述了估计量如何逼近截尾期望。

Y = Zω ℓ ( ω
) w ( ω ) P ( ω ) ≈ Yb = n n∑i=1 ℓ ( ω i ) w ( ω i ) .

y =
zωℓ(ω)w(ω)p(ω)≈Yb = n n∑I = 1ℓ(ωI)w(ωI)。

Since the
samples ℓ(ωi)w (ωi) are independent and identically distributed, the central
limit theorem (e.g., Cramér, 1946, Section 17.4) states that the empirical
average Ybconverges in law to a normal distribution of mean Y = E[ℓ(ω)w (ω)]
and variance V = var[ℓ(ω)w (ω)].Since this convergence usually occurs quickly,
it is widely accepted to write

由于样本ℓ(ωi)w (ωi)是独立且同分布的,中心极限定理(如Cramér,1946,第17.4节)指出经验平均Yb收敛于均值Y = E[ℓ(ω)w (ω)]和方差V = var[ℓ(ω)w (ω)]的正态分布。由于这种趋同通常发生得很快,所以写作被广泛接受

P n Yb − ε R
≤ Y ≤ Yb + ε R o ≥ − δ ,

p n
YbεR≤Y≤Yb+εR o≥δ,

with

随着

ε R = erf− (
1 − δ ) p 2 V .

ε R =电流变流体(1δ)p 2V。

and to
estimate the variance V using the sample variance Vb

并且使用样本方差Vb来估计方差V

V ≈ Vb =

V ≈ Vb =

n − 1

n1

n∑i=1 ℓ ( ω i
) w ( ω i ) − Yb .

n∑I =
1ℓ(ωI)w(ωI)Yb。

(25)

(25)

This approach
works well when the ratio ceiling R is relatively small.However the presence of
a few very large ratios makes the variance estimation noisy and might slow down
the central limit convergence.

当比率上限相对较小时,这种方法效果很好。然而,一些非常大的比率的存在使得方差估计有噪声,并且可能减慢中心极限收敛。

The first
remedy is to bound the variance more rigorously.For instance, the following
bound results from (Maurer and Pontil, 2009, Theorem 10).

第一个补救措施是更严格地约束方差。例如,下面的界限结果来自(莫勒和庞泰尔,2009,定理10)。

P ( p V >
p Vb + ( M − m ) R r 2log ( / δ ) n − 1 ) ≤ δ

p(p V > p
Vb+(M M)R R 2 log(/δ)n 1)≤δ

Combining
this bound with (25) gives a confidence interval valid with probability greater
than 1−2δ.Although this approach eliminates the potential problems related to
the variance estimation, it does not address the potentially slow convergence
of the central limit theorem.

将该界限与(25)相结合,给出概率大于12δ的有效置信区间。尽管这种方法消除了与方差估计相关的潜在问题,但它没有解决中心极限定理的潜在缓慢收敛问题。

The next
remedy is to rely on empirical Bernstein bounds to derive rigorous confidence
intervals that leverage both the sample mean and the sample variance (Audibert
et al., 2007;Maurer and Pontil, 2009).

下一个补救措施是依靠经验伯恩斯坦界限来推导严格的置信区间,该置信区间利用样本均值和样本方差(奥迪伯特等人,2007;Maurer和Pontil,2009)。

Theorem 1
(Empirical Bernstein bound) (Maurer and Pontil, 2009, thm 4) Let X,X1,X2,...,Xn
be i.i.d. random variable with values in [a,b] and let δ > 0.Then, with
proba-bility at least 1−δ,

定理1(经验Bernstein界)(Maurer和Pontil,2009,thm 4)让X,X1,X2,...,Xn为i.i.d .随机变量,取值在[a,b]中,设δ > 0。那么,概率至少为1δ,

E [X ] − Mn ≤
r Vn log ( / δ ) n + ( b − a ) 7log ( / δ ) 3 ( n − 1 ) ,

e[X]Mn≤r Vn
log(/δ)n+(B- a)7 log(/δ)3(n-1),

where Mn and
Vn respectively are the sample mean and variance

其中Mn和Vn分别是样本均值和方差

Mn =

Mn =

n n∑i=1 Xi

n n∑i=1 Xi

Vn =

Vn =

1

n − 1

n1

n∑i=1 ( Xi −
Mn ) .

n∑i=1 ( Xi锰)。

3249

3249

BOTTOU,
PETERS, ET AL.

波图,彼得斯,等。

Applying this
theorem to both ℓ(ωi)w (ωi) and −ℓ(ωi)w (ωi) provides confidence intervals that
hold for for the worst possible distribution of the variables ℓ(ω) and w(ω).

将该定理应用于ℓ(ωi)w (ωi)和−ℓ(ωi)w (ωi)提供了置信区间,该区间适用于变量ℓ(ω(ω)和w(ω)的最差可能分布。

where

在哪里

P n Yb − ε R
≤ Y ≤ Yb + ε R o ≥ − δ

p n
YbεR≤Y≤Yb+εR o≥δ

ε R = s Vb
log ( / δ ) n + M R 7log ( / δ ) 3 ( n − 1 ) .

εR = s Vb
log(/δ)n+M R 7 log(/δ)3(n-1)。

(26)

(26)

Because they
hold for the worst possible distribution, confidence intervals obtained in this
way are less tight than confidence intervals based on the central limit
theorem.On the other hand, thanks to the Bernstein bound, they remains
reasonably competitive, and they provide a much stronger guarantee.

因为它们适用于最差的可能分布,所以以这种方式获得的置信区间没有基于中心极限定理的置信区间紧。另一方面,由于受到伯恩斯坦约束,它们保持了合理的竞争力,并提供了更强有力的保证。

B.2 Inner
Confidence Interval

B.2内部置信区间

Inner
confidence intervals are derived from inequality (14) which bounds the
difference between the counterfactual expectation Y and the clipped expectation
Y :

内部置信区间由不等式(14)导出,不等式(14)限制了反事实期望Y和限幅期望Y之间的差:

0 ≤ Y −Y ≤ M
(1−W ).

0≤Y-Y≤M(1w)。

The constant
M is defined by assumption (8).The first step of the derivation consists in
obtaining a lower bound of W −Wbusing either the central limit theorem or an
empirical Bernstein bound.

常数M由假设(8)定义。推导的第一步是利用中心极限定理或经验伯恩斯坦界获得W-WB的下界。

For instance,
applying theorem 1 to −w (ωi) yields

例如,将定理1应用于w (ωi)会产生

P

P

 W ≥ Wb− s
Vbw log ( / δ ) n − R 7log ( / δ ) 3 ( n − 1 )  ≥ − δ

w≥WB s vbw对数(/δ)n r7对数(/δ)3(n1)≥δ

where Vbw is
the sample variance of the clipped weights

其中Vbw是剪裁权重的样本方差

Vbw =

Vbw =

n − 1

n1

n∑i=1 w ( ω i
) − Wb .

n∑I =
1w(ωI)Wb。

Replacing in
inequality (14) gives the outer confidence interval

不等式(14)中的替换给出了外置信区间

P n Y ≤ Y ≤ Y

  • M ( − Wb+ ξ R ) o ≥ − δ .

p n
Y≤Y≤Y+M(Wb+ξR)o≥δ。

with

随着

ξ R = s Vbw
log ( / δ ) n + R 7log ( / δ ) 3 ( n − 1 ) .

ξR = s Vbw
log(/δ)n+R 7 log(/δ)3(n-1)。

(27)

(27)

Note that 1
−Wb+ ξR can occasionally be negative.This occurs in the unlucky cases where the
confidence interval is violated, with probability smaller than δ.

请注意,1 Wb+ξR有时可能为负。这发生在违反置信区间的不幸情况下,概率小于δ。

Putting
together the inner and outer confidence intervals,

将内部和外部置信区间放在一起,

P n Yb − ε R
≤ Y ≤ Yb + M ( − Wb+ ξ R) + ε R o ≥ − δ ,

p n
YbεR≤Y≤Yb+M(Wb+ξR)+εR o≥δ,

with εR and
ξR computed as described in expressions (26) and (27).

按照表达式(26)和(27)计算εR和ξR。

3250

3250

COUNTERFACTUAL
REASONING AND LEARNING SYSTEMS

反事实推理和学习系统

Appendix C.
Counterfactual Differences

附录三.反事实差异

We now seek
to estimate the difference Y + −Y of the expectations of a same quantity ℓ(ω)
under two different counterfactual distributions P+(ω) and P(ω).These
expectations are often affected by variables whose value is left unchanged by
the interventions under consideration.For instance, seasonal effects can have
very large effects on the number of ad clicks.When these variables affect both
Y + and Y in similar ways, we can obtain substantially better confidence
intervals for the difference Y + −Y .

我们现在寻求在两个不同的反事实分布P+(ω)和P(ω)下,估计相同量ℓ(ω期望的差y+y。这些期望经常受到变量的影响,这些变量的值在所考虑的干预中保持不变。例如,季节效应会对广告点击量产生非常大的影响。当这些变量以类似的方式影响Y +和Y时,我们可以获得Y+Y差的更好的置信区间。

In addition
to the notation ω representing all the variables in the structural equation
model, we use notation υ to represent all the variables that are not direct or
indirect effects of variables affected by the interventions under
consideration.

除了表示结构方程模型中所有变量的符号ω之外,我们还使用符号υ来表示不是受所考虑干预影响的变量的直接或间接影响的所有变量。

Let ζ(υ) be a
known function believed to be a good predictor of the quantity ℓ(ω) whose
coun-terfactual expectation is sought.Since P(υ) = P(υ), the following equality
holds regardless of the quality of this prediction:

设ζ(υ)是一个已知的函数,它被认为是ℓ(ω量的一个很好的预测器,它的实际期望是被寻求的。因为P(υ) = P(υ),所以不管这个预测的质量如何,下列等式都成立:

Decomposing
both Y + and Y in this way and computing the difference,

以这种方式分解Y +和Y并计算差值,

Y + − Y = Zω
[ ℓ ( ω ) − ζ ( υ)] ∆ w ( ω ) P ( ω ) ≈ n n∑i=1

y+y =
zω[ℓ(ω)ζ(υ)]w(ω)p(ω)≈n∑I = 1

  • ℓ ( ω i ) −
    ζ ( υ i )

-ℓ(ωI)-ζ(υI)

∆ w ( ω i )

w ( ω i)

∆ w ( ω) = P

  • ( ω ) P ( ω ) − P ( ω ) P ( ω ) = P + ( ω ) − P ( ω ) P ( ω ) .

⇼w(ω)=
P+(ω)P(ω)P(ω)P(ω)= P+(ω)P(ω)。

with

随着

(28)

(28)

The outer
confidence interval size is reduced if the variance of the residual ℓ(ω)−ζ(υ)is
smaller than the variance of the original variable ℓ(ω).For instance, a
suitable predictor function ζ(υ) can significantly capture the seasonal click
yield variations regardless of the interventions under consideration.Even a
constant predictor function can considerably change the variance of the outer
confidence interval.Therefore, in the absence of better predictor, we still can
( and always should ) center the integrand using a constant predictor.

如果残差ℓ(ω)−ζ(υ)is的方差小于原始变量ℓ(ω).的方差,则减小外部置信区间的大小例如,一个合适的预测函数ζ(υ)可以显著地捕捉季节性点击量的变化,而不考虑正在考虑的干预措施。即使是一个恒定的预测函数也能显著改变外部置信区间的方差。因此,在没有更好的预测器的情况下,我们仍然可以(并且总是应该)使用常数预测器对被积函数进行中心化。

The rest of
this appendix describes how to construct confidence intervals for the
estimation of counterfactual differences.Additional bookkeeping is required
because both the weights ∆w(ωi) and the integrand ℓ(ω)−ζ(υ) can be positive or
negative.We use the notation υ to represent the variables of the structural
equation model that are left unchanged by the intervention under
consid-erations.Such variables satisfy the relations P(υ) = P(υ) and P(ω) = P
(ω\υ|υ)P(υ), where we use notation ω\υ to denote all remaining variables in the
structural equation model.An invariant predictor is then a function ζ(υ) that
is believed to be a good predictor of ℓ(ω).In particular, it is expected that
var[ℓ(ω)−ζ(υ)] is smaller than var[ℓ(ω)].

本附录的其余部分描述了如何构建置信区间来估计反事实差异。需要额外记账,因为权重w(ωi)和被积函数ℓ(ω)−ζ(υ都可以是正的或负的。我们使用符号υ来表示结构方程模型的变量,这些变量在考虑的情况下由于干预而保持不变。这样的变量满足关系P(υ) = P(υ)和P(ω) = P (ω\υ|υ)P(υ),这里我们用符号ω\υ来表示结构方程模型中所有剩余的变量。然后,不变预测器是被认为是ℓ(ω).的良好预测器的函数ζ(υ)特别是,预计var[ℓ(ω)−ζ(υ)比var[ℓ(ω)].小

C.1 Inner
Confidence Interval with Dependent Bounds

C.1具有相关界限的内部置信区间

We first
describe how to construct finer inner confidence intervals by using more
refined bounds on ℓ(ω).In particular, instead of the simple bound (8), we can
use bounds that depend on invariant variables:

我们首先描述如何通过在ℓ(ω).上使用更精细的边界来构造更精细的内部置信区间特别是,我们可以使用依赖于不变变量的界限来代替简单界限(8):

∀ω

∀ω

m ≤ m ( υ ) ≤
ℓ ( ω ) ≤ M ( υ ) ≤ M .

m ≤ m ( υ ) ≤
ℓ ( ω ) ≤ M ( υ ) ≤ M。

3251

3251

BOTTOU,
PETERS, ET AL.

波图,彼得斯,等。

The key
observation is the equality

关键的观察是平等

E [ w ( ω
)|υ] = Zω \ υ w ( ω ) P ( ω \ υ|υ) = Zω \ υ P ( ω \ υ|υ ) P ( υ ) P ( ω \ υ|υ )
P ( υ ) P ( ω \ υ|υ) = .

e[w(ω)|υ]=
zω\υw(ω)P(ω\υ|υ)= zω\υP(ω\υ|υ)P(ω\υ|υ)P(υ)P(ω\υ|υ)=。

We can then
write

然后我们可以写

= Z ω

= Z ω

  • w (ω)−w (ω)

-w(ω)-w(ω)

ℓ(ω) P(ω) ≤ Z
υ E[w (ω)−w (ω)|υ] M(υ) P(υ)

ℓ(ω)p(ω)≤zυe[w(ω)w(ω)|υ]m(υ)p(υ)

= Z υ (−E[w
(ω)|υ] ) M(υ) P(υ) = Z ω (−w (ω) ) M(υ) P(ω) = Bhi .

=
Zυ(E[w(ω)|υ])M(υ)P(υ)= Zω(w(ω))M(υ)P(ω)= Bhi。

Y −Y

是的

Using a
similar derivation for the lower bound Blo, we obtain the inequality

利用下界Blo的类似推导,我们得到了不等式

With the
notations

用符号

Blo ≤ Y −Y ≤
Bhi

blo≤Y-Y≤Bhi

Bb lo = n
n∑i=1 ( − w ( ω i)) m ( υ i )

bb lo = n n∑I
= 1(w(ωI))m(υI)

Bb hi = n
n∑i=1 ( − w ( ω i)) M ( υ i )

bb hi = n n∑I
= 1(w(ωI))M(υI)

Vblo =

Vblo =

n − 1

n1

n∑i=1 h ( − w
( ω i)) m ( υ i ) − Bb lo i Vbhi = n − 1 n∑i=1 h ( − w ( ω i)) M ( υ i ) − Bb
hi i

n∑I =
1h(w(ωI))M(υI)Bb lo I Vhibi = n 1n∑I = 1h(w(ωI))M(υI)Bb hi I

ξlo = s Vblo
log ( / δ ) n +|m|R 7log ( / δ ) 3 ( n − 1 ) ,

ξlo = s Vblo
log(/δ)n+| m | R 7 log(/δ)3(n 1),

ξhi = s Vbhi
log ( / δ ) n +|M|R 7log ( / δ ) 3 ( n − 1 ) ,

ξhi = s Vhibi
log(/δ)n+| M | R 7 log(/δ)3(n-1),

two
applications of theorem 1 give the inner confidence interval:

定理1的两个应用给出了内部置信区间:

P n Y + Bb lo
− ξlo ≤ Y ≤ Y + Bb hi + ξhi o ≥ − δ .

p n Y+Bb
loξlo≤Y≤Y+Bb hi+ξhi o≥δ。

C.2
Confidence Intervals for Counterfactual Differences

C2反事实差异的置信区间

We now
describe how to leverage invariant predictors in order to construct tighter
confidence inter-vals for the difference of two counterfactual expectations.

我们现在描述如何利用不变预测因子来为两个反事实预期的差异构建更紧密的置信区间。

Y + − Y ≈ n
n∑i=1

Y+Y≈n n∑I = 1

  • ℓ ( ω i ) −
    ζ ( υ i )

-ℓ(ωI)-ζ(υI)

∆ w ( ω i )

w ( ω i)

with ∆w(ω) =
P +(ω)−P (ω) P(ω) .

其中∏w(ω)= P+(ω)P(ω)P(ω)。

Let us define
the reweigthing ratios w +(ω) = P +(ω)/P(ω) and w (ω) = P (ω)/P(ω), their
clipped variants w +(ω) and w (ω), and the clipped centered expectations

让我们定义重加权比w +(ω) = P +(ω)/P(ω)和w (ω) = P (ω)/P(ω),它们的削波变量w +(ω)和w (ω),以及削波的中心期望值

Y +c = Zω [ ℓ
( ω ) − ζ ( υ)] w + ( ω ) P ( ω )

y+c =
zω[ℓ(ω)-ζ(υ)]w+(ω)p(ω)

and Y c = Zω
[ ℓ ( ω ) − ζ ( υ)] w ( ω ) P ( ω ) .

y c =
zω[ℓ(ω)-ζ(υ)]w(ω)p(ω)。

The outer
confidence interval is obtained by applying the techniques of Section B.1 to

外部置信区间是通过应用B.1节的技术获得的

Y +c − Y c =
Zω [ ℓ ( ω ) − ζ ( υ) ][ w + ( ω ) − w ( ω) ] P ( ω ) .

y+c y c =
zω[ℓ(ω)ζ(υ)][w+(ω)w(ω)]p(ω)。

Since the
weights w + −w can be positive or negative, adding or removing a constant to
ℓ(ω) can considerably change the variance of the outer confidence interval.This
means that one should

因为权重w+w可以是正的或负的,所以在ℓ(ω中添加或移除常数)可以显著改变外部置信区间的方差。这意味着一个人应该

3252

3252

COUNTERFACTUAL
REASONING AND LEARNING SYSTEMS

反事实推理和学习系统

always use a
predictor.Even a constant predictor can vastly improve the outer confidence
interval difference.

总是使用预测值。即使是一个常数预测器也可以极大地改善外部置信区间的差异。

The inner
confidence interval is then obtained by writing the difference

然后通过写入差值获得内部置信区间

Y +−Y − Y + c
−Y c

Y+Y-Y+c Y-c

= Zω

= Zω

  • ℓ ( ω ) − ζ
    ( υ )
  • ℓ ( ω ) − ζ
    ( υ )
  • w + ( ω ) −
    w + ( ω )

-w+(ω)-w+(ω)

P ( ω ) − Zω

p(ω)-Zω

  • ℓ ( ω ) − ζ
    ( υ )
  • ℓ ( ω ) − ζ
    ( υ )
  • w ( ω ) − w
    ( ω )

-w(ω)-w(ω)

P ( ω )

P ( ω)

and bounding
both terms by leveraging υ-dependent bounds on the integrand:

并且通过利用被积函数的υ相关边界来限制这两个项:

− M ≤ −ζ(υ) ≤
ℓ(ω)−ζ(υ) ≤ M −ζ(υ) ≤ M .

m≤ζ(υ)≤ℓ(ω)−ζ(υ)≤m≤ζ(υ)≤m。

∀ω

∀ω

This can be
achieved as shown in Section C.1.

这可以如第1节所示实现

Appendix D.
Counterfactual Derivatives

附录d .反事实衍生品

We now
consider interventions that depend on a continuous parameter θ.For instance, we
might want to know what the performance of the ad placement engine would have
been if we had used a parametrized scoring model.Let Pθ (ω)represent the
counterfactual Markov factorization associated with this intervention.Let Y θ be
the counterfactual expectation of ℓ(ω) under distribution Pθ .

我们现在考虑依赖于连续参数θ的干预。例如,我们可能想知道如果我们使用参数化评分模型,广告投放引擎的性能会如何。让Pθ (ω)代表与此干预相关的反事实马尔可夫因子分解。设Y θ是分布Pθ下ℓ(ω的反事实期望。

Computing the
derivative of (28) immediately gives

计算(28)的导数立即给出

∂ Y θ

∂ Y θ

∂θ

∂θ

= Zw

= Zw

  • ℓ ( ω ) − ζ
    ( υ )
  • ℓ ( ω ) − ζ
    ( υ )

w ′θ ( ω ) P
( ω ) ≈ n n∑i=1

w′θ(ω)P(ω)≈n∑I
= 1

  • ℓ ( ω i ) −
    ζ ( υ i )

-ℓ(ωI)-ζ(υI)

w ′θ ( ω i )

w′θ(ωI)

with wθ(ω) =
P θ (ω) P(ω)

其中wθ(ω) = P θ (ω) P(ω)

and w ′θ ( ω)
= ∂ w θ ( ω ) ∂θ = w θ ( ω ) ∂logP θ ( ω ) ∂θ .

和w′θ(ω)=∂wθ(ω)∂θ= wθ(ω)∂logpθ(ω)∂θ。

(29)

(29)

Replacing the
expressions P(ω) and Pθ (ω) by the corresponding Markov factorizations gives
many opportunities to simplify the reweighting ratio w ′ θ (ω).The term wθ(ω)
simplifies as shown in (6).The derivative of log Pθ (ω) depends only on the
factors parametrized by θ.Therefore, in order to evaluate w ′ θ (ω), we only
need to know the few factors affected by the intervention.

用相应的马尔可夫因子分解代替表达式P(ω)和Pθ (ω)给出了许多简化重加权比w′θ(ω)的机会。术语wθ(ω)简化如(6)所示。对数Pθ (ω)的导数仅取决于θ参数化的因子。因此,为了评估w′θ(ω),我们只需要知道受干预影响的少数因素。

Higher order
derivatives can be estimated using the same approach.For instance,

高阶导数可以用同样的方法估算。例如,

∂ Y θ

∂ Y θ

∂θ i ∂θ j

∂θ·∂θ

= Zw

= Zw

  • ℓ ( ω ) − ζ
    ( υ )
  • ℓ ( ω ) − ζ
    ( υ )

w ′′ i j ( ω
) P ( ω ) ≈ n n∑i=1

w′

  • ℓ ( ω i ) −
    ζ ( υ i )

-ℓ(ωI)-ζ(υI)

w ′′ i j ( ω
i )

w′’I j(ωI)

with w ′′ i
j(ω) = ∂ wθ(ω) ∂θi ∂θj = wθ(ω) ∂log Pθ (ω) ∂θi ∂logPθ (ω) ∂θj +wθ(ω) ∂ logPθ
(ω) ∂θi ∂θj

w′

.

The second
term in w ′′ i j(ω) vanishes when θi and θj parametrize distinct factors in Pθ
(ω).

当θi和θj参数化Pθ (ω)中的不同因子时,w′I j(ω)中的第二项消失。

D.1
Infinitesimal Interventions and Policy Gradient

D1极小干预和政策梯度

Expression
(29) becomes particularly attractive when P(ω) = P θ (ω), that is, when one
seeks deriva-tives that describe the effect of an infinitesimal intervention on
the system from which the data was collected.The resulting expression is then
identical to the celebrated policy gradient (Alek-sandrov et al., 1968;Glynn,
1987;Williams, 1992) which expresses how the accumulated rewards

当P(ω) = P θ (ω)时,表达式(29)变得特别有吸引力,也就是说,当人们寻求描述无穷小的干预对收集数据的系统的影响的导数时。结果表达式与著名的政策梯度相同(Alek-sandrov等人,1968年;格林,1987年;威廉姆斯,1992)表达了累积的回报

3253

3253

BOTTOU,
PETERS, ET AL.

波图,彼得斯,等。

in a
reinforcement learning problem are affected by small changes of the parameters
of the policy function.

在强化学习问题中会受到策略函数参数的微小变化的影响。

∂ Y θ

∂ Y θ

∂θ

∂θ

= Zω

= Zω

  • ℓ ( ω ) − ζ
    ( υ )
  • ℓ ( ω ) − ζ
    ( υ )

w ′θ ( ω ) P
θ ( ω ) ≈ n n∑i=1

w′θ(ω)Pθ(ω)≈n∑I
= 1

  • ℓ ( ω i ) −
    ζ ( υ i )

-ℓ(ωI)-ζ(υI)

w ′θ ( ω i )

w′θ(ωI)

where ωi are
sampled i.i.d. from Pθ and w ′ θ (ω) = ∂logPθ (ω) ∂θ

其中,ωi是从Pθ和w’θ(ω)=∂logpθ(ω)∂θ获得的i.i.d采样值

.

Sampling from
Pθ (ω) eliminates the potentially large ratio wθ(ω) that usually plagues
impor-tance sampling approaches.Choosing a parametrized distribution that
depends smoothly on θ is then sufficient to contain the size of the weights w ′
θ (ω).Since the weights can be positive or nega-tive, centering the integrand
with a prediction function ζ(υ) remains very important.Even a constant
predictor ζ can substantially reduce the variance

从Pθ (ω)采样消除了通常困扰重要采样方法的潜在大比值wθ(ω)。选择平滑依赖于θ的参数化分布就足以包含权重w’θ(ω)的大小。由于权重可以是正的或负的,用预测函数ζ(υ)对被积函数进行定心仍然非常重要。即使是常数预测因子ζ也能显著降低方差

var[
(ℓ(ω)−ζ)w ′ θ (ω)] = var[ ℓ(ω)w ′ θ (ω)−ζw ′ θ (ω)]

var[(ℓ(ω)−ζ)w’θ(ω)]=
var[ℓ(ω)w’θ(ω)ζw’θ(ω)]

= var[ℓ(ω)w ′
θ (ω)]−2ζcov[ ℓ(ω)w ′ θ (ω), w ′ θ (ω)] +ζ var[w ′ θ (ω)]

=
var[ℓ(ω)w’θ(ω)]2ζcov[ℓ(ω)w’θ(ω),w’θ(ω)]+ζvar[w’θ(ω)]

whose minimum
is reached for ζ = cov[ℓ(ω)w ′ θ (ω)w ′ θ (ω)]

对于ζ= cov[ℓ(ω)w’θ(ω)w’θ(ω)达到最小值

var[w ′ θ
(ω)] =

var[w’θ(ω)]=

E[ℓ(ω)w ′ θ
(ω) ]

e[ℓ(ω)w
'θ(ω)]

E[w ′ θ (ω) ]
.

e[w′θ(ω)]。

We sometimes
want to evaluate expectations under a counterfactual distribution that is too
far from the actual distribution to obtain reasonable confidence
intervals.Suppose, for instance, that we are unable to reliably estimate which
click yield would have been observed if we had used a certain parameter θ for
the scoring models.We still can estimate how quickly and in which direction the
click yield would have changed if we had slightly moved the current scoring
model parameters θ in the direction of the target θ .Although such an answer is
not as good as a reliable estimate of Y θ , it is certainly better than no
answer.

我们有时希望在与实际分布相差太远的反事实分布下评估期望,以获得合理的置信区间。例如,假设我们无法可靠地估计如果我们对评分模型使用某个参数θ会观察到哪种点击量。如果我们在目标θ的方向上稍微移动当前评分模型参数θ,我们仍然可以估计点击率会以多快的速度和朝哪个方向变化。这样的答案虽然不如对Y θ的可靠估计,但肯定比没有答案要好。

D.2
Off-Policy Gradient

D.2政策外梯度

We assume in
this subsection that the parametrized probability distribution Pθ (ω) is
regular enough to ensure that all the derivatives of interest are defined and
that the event {wθ(ω)=R} has probability zero.Furthermore, in order to simplify
the exposition, the following derivation does not leverage an invariant
predictor function.

在本小节中,我们假设参数化概率分布Pθ (ω)足够规则,以确保定义了所有感兴趣的导数,并且事件{wθ(ω)=R}具有概率零。此外,为了简化说明,下面的推导没有利用不变预测函数。

Estimating
derivatives using data sampled from a distribution P(ω) different from Pθ (ω)
is more challenging because the ratios wθ(ωi) in Equation (29) can take very
large values.However it is comparatively easy to estimate the derivatives of
lower and upper bounds using a slightly different way to clip the weights.Using
notation 1l(x) represent the indicator function, equal to one if condition x is
true and zero otherwise, let us define respectively the clipped weights w Z θ
and the capped weights w M θ :

使用从不同于Pθ (ω)的分布P(ω)采样的数据来估计导数更具挑战性,因为等式(29)中的比率wθ(ωi)可能取非常大的值。然而,使用稍微不同的方法来裁剪权重,估计下限和上限的导数相对容易。使用符号1l(x)表示指标函数,如果条件x为真,则等于1,否则为零,让我们分别定义削波权重w Z θ和封顶权重w M θ:

w Zθ ( ω) = w
θ ( ω )1l { P ( ω ) < R P ( ω ) } and w Mθ ( ω) = min { w θ ( ω ) , R } .

w Zθ ( ω) = w
θ ( ω )1l { P ( ω ) < R P ( ω ) }和w Mθ ( ω) = min { w θ ( ω),R }。

Although
Section 4.4 illustrates the use of clipped weights, the confidence interval
derivation can be easily extended to the capped weights.Defining the capped
quantities

尽管第4.4节说明了削波权重的使用,但置信区间推导可以很容易地扩展到封顶权重。定义上限数量

Y θ = Zω ℓ (
ω ) w Mθ ( ω ) P ( ω )

Y θ = Zω ℓ (
ω ) w Mθ ( ω ) P ( ω)

and W θ = Zω
w Mθ ( ω ) P ( ω )

而W θ = Zω w Mθ ( ω ) P ( ω)

3254

3254

COUNTERFACTUAL
REASONING AND LEARNING SYSTEMS

反事实推理和学习系统

and writing

和写作

= Zω ∈ Ω \ Ω
R ℓ ( ω) ( P ( ω ) − R P ( ω) )

=
zω∈ω\ωrℓ(ω)(p(ω)r p(ω))

0 ≤ Y θ −Y θ

0≤Yθ-Yθ

≤ M − P ( Ω R
) − R P ( Ω \ Ω R ) = M − Zω w Mθ ( ω ) P ( ω )

≤M P(ωR)R
P(ω\ωR)= M zωw Mθ(ω)P(ω)

yields the
inequality

产生了不平等

Y θ ≤ Y θ ≤ Y
θ + M ( 1 − W θ ) .

Yθ≤Yθ≤Yθ+M(1wθ)。

(30)

(30)

In order to
obtain reliable estimates of the derivatives of these upper and lower bounds,
it is of course sufficient to obtain reliable estimates of the derivatives of Y
θ and W θ .By separately considering the cases wθ(ω) < R and wθ(ω) > R,
we easily obtain the relation

为了获得这些上限和下限的导数的可靠估计,获得Y θ和W θ的导数的可靠估计当然是足够的。通过分别考虑wθ(ω) < R和wθ(ω) > R的情况,我们很容易得到关系式

w M ′ θ ( ω)
= ∂ w Mθ ( ω ) ∂θ = w Zθ ( ω ) ∂log P θ ( ω ) ∂θ

w m’θ(ω)=∂w
mθ(ω)∂θ= w zθ(ω)∂log pθ(ω)∂θ

and, thanks
to the regularity assumptions, we can write

而且,由于规律性假设,我们可以写

when wθ(ω) 6=
R

当wθ(ω) 6= R时

∂Y θ ∂θ

∂Y θ ∂θ

∂W θ ∂θ

∂W θ ∂θ

= Zω ℓ ( ω )
w M ′ θ ( ω ) P ( ω ) ≈ n n∑i=1 ℓ ( ω i ) w M ′ θ ( ω i )

= zωℓ(ω)w
m′θ(ω)p(ω)≈n∑I = 1ℓ(ωI)w m′θ(ωI)

= Zω w M ′ θ
( ω ) P ( ω ) ≈ n n∑i=1 w M ′ θ ( ω i )

=Zωw
M′θ(ω)P(ω)≈n∑I = 1w M′θ(ωI)

Estimating
these derivatives is considerably easier than using approximation (29) because
they in-volve the bounded quantity w Z θ (ω) instead of the potentially large
ratio wθ(ω).It is still necessary to choose a sufficiently smooth sampling
distribution P(ω) to limit the magnitude of ∂log Pθ/∂θ.

估算这些导数比使用近似值(29)容易得多,因为它们涉及有界量w Z θ (ω),而不是潜在的大比值wθ(ω)。仍然需要选择足够平滑的采样分布P(ω)来限制∂log Pθ/∂θ.的幅度

Such
derivatives are very useful to drive optimization algorithms.Assume for
instance that we want to find the parameter θ that maximizes the counterfactual
expectation Y θ as illustrated in Section 6.3.Maximizing the estimate obtained
using approximation (5) could reach its maximum for a value of θ that is poorly
explored by the actual distribution.Maximizing an estimate of the lower bound
(30) ensures that the optimization algorithm finds a trustworthy answer.

这样的导数对于驱动优化算法非常有用。例如,假设我们想要找到使反事实期望最大化的参数θ,如第6.3节所示。使用近似值(5)获得的最大估计值可能会在θ值达到最大值,而实际分布对此了解甚少。使下限(30)的估计最大化确保了优化算法找到可信的答案。

Appendix E.
Uniform Empirical Bernstein Bounds

附录e .统一经验伯恩斯坦界限

This appendix
reviews the uniform empirical Bernstein bound given by Maurer and Pontil (2009)
and describes how it can be used to construct the uniform confidence interval
(21).The first step

本附录回顾了毛雷尔和庞泰尔(2009)给出的统一经验伯恩斯坦界限,并描述了如何使用它来构造统一置信区间(21)。第一步

consists of
characterizing the size of a family F of functions mapping a space X into the
in-terval [a,b] ⊂ R. Given n points x = (x1...xn) ∈ X n , the trace F (x) ∈ R n
is the set of vectors f(x1),..., f(xn) for all functions f ∈ F .

包括表征函数族f的大小,函数族f将空间x映射到区间[a,b] ⊂ R中。给定n个点x = (x1...xn)∈X ^ n,迹F(X)∈R ^ n是向量f(x1)的集合,...,f(xn)对于所有函数f ∈ F。

Definition 2
(Covering numbers, etc.) Given ε > 0, the covering number N (x, ε,F ) is the
small-est possible cardinality of a subset C ⊂ F (x) satisfying the condition

定义2(涵盖数字等。)给定ε > 0,覆盖数N (x,ε,f)是满足条件的子集C ⊂ F (x)的最小可能基数

∀ v ∈ F ( x )
∃ c ∈ C max i=1...n |v i − c i | ≤ ε

∀ v ∈ F ( x )
∃ c ∈ C最大i=1...n | v I c I |≤ε

and the
growth function N (n, ε,F ) is

生长函数N (n,ε,F)为

N (n, ε,F ) =
sup

N (n,ε,F ) = sup

x∈Xn

x∈Xn

N (x, ε,F ) .

N (x,ε,F)。

3255

3255

BOTTOU,
PETERS, ET AL.

波图,彼得斯,等。

Thanks to a
famous combinatorial lemma (Vapnik and Chervonenkis, 1968, 1971;Sauer, 1972),
for

感谢一个著名的组合引理(Vapnik和Chervonenkis,1968,1971;绍尔,1972),为

many usual
parametric families F , the growth function N (n, ε,F ) increases at most
polynomially

许多常用的参数族F,增长函数N (n,ε,F)至多多项式增长

with both n
and 1/ε.

对于n和1/ε。

Theorem 3
(Uniform empirical Bernstein bound) (Maurer and Pontil, 2009, thm 6)

定理3(统一经验伯恩斯坦界)(莫勒和庞泰尔,2009,thm 6)

Let δ ∈
(0,1), n >= 16.Let X,X1,...,Xn be i.i.d. random variables with values in X.
Let F be a set of functions mapping X into [a,b] ⊂ R and let M (n) = 10N (2n,F
,1/n).Then we probability

设δ∑(0,1),n >= 16。设X,X1,...设f是将x映射成[a,b] ⊂ R的一组函数,设M (n) = 10N (2n,f,1/n)。那么我们概率

at least 1−δ,

至少1δ,

∀ f ∈ F , E [
f ( X)] − Mn ≤ r Vn log ( M ( n ) / δ ) n + ( b − a ) 15 log ( M ( n ) / δ ) n
− 1

∀ f ∈ F,e[f(x)]Mn≤r
VN log(m(n)/δ)n+(b a)15 log(m(n)/δ)n-1

,

,

where Mn and
Vn respectively are the sample mean and variance

其中Mn和Vn分别是样本均值和方差

Mn =

Mn =

n n∑i=1 f (
Xi )

n n∑i=1 f (
Xi)

Vn =

Vn =

1

n − 1

n1

n∑i=1 ( f (
Xi ) − Mn ) .

n∑I =
1(f(Xi)Mn)。

The statement
of this theorem emphasizes its similarity with the non-uniform empirical
Bern-stein bound (theorem 1).Although the constants are less attractive, the
uniform bound still con-

这个定理的陈述强调了它与非一致经验Bern-stein界(定理1)的相似性。尽管常数没有那么吸引人,但统一界限仍然成立

verges to
zero when n increases, provided of course that M (n) = 10N (2n,F ,1/n) grows
polyno-

当n增加时收敛到零,当然前提是M (n) = 10N (2n,F,1/n)增长多项式

mially with
n.

和n一起。

Let us then
define the family of functions

让我们定义函数族

F = { fθ : ω
7→ ℓ(ω)w M θ (ω) , gθ : ω 7→ w M θ (ω) , ∀θ ∈ F } ,

F = { fθ : ω
7→ ℓ(ω)w M θ (ω),gθ : ω 7→ w M θ (ω),∀θ ∈ F },

and use the
uniform empirical Bernstein bound to derive an outer inequality similar to (26)
and an inner inequality similar to (27).The theorem implies that, with
probability 1−δ, both inequalities are simultaneously true for all values of
the parameter θ.The uniform confidence interval (21) then follows directly.

并且使用一致经验伯恩斯坦界导出类似于(26)的外部不等式和类似于(27)的内部不等式。该定理意味着,对于概率1δ,两个不等式对于参数θ的所有值同时成立。然后,统一置信区间(21)直接跟随。

References

参考

V M.
Aleksandrov, V. I. Sysoyev, and V. V. Shemeneva.Stochastic
optimization.Engineering Cybernetics, 5:11-16, 1968.

V M.
Aleksandrov、V. I. Sysoyev和V. V. Shemeneva。随机优化。工程控制论,5:11-16,1968。

Susan Athey
and Denis Nekipelov.A structural model of sponsored search advertising.Working
pa-

苏珊·艾希和丹尼斯·内基佩洛夫。赞助搜索广告的结构模型。工作pa-

per, 2010.URL
http://kuznets.harvard.edu/~athey/papers/Structural_Sponsored_ Search.pdf .

per,2010年。网址:Search.pdf。

Jean-Yves
Audibert, Remi Munos, and Csaba Szepesvári.Tuning bandit algorithms in
stochastic en-

让-伊夫·奥迪伯特、雷米·穆尼奥斯和乔萨巴·塞佩瓦斯瓦里。随机环境下的土匪算法调整

vironments.In
Proc.18th International Conference on Algorithmic Learning Theory (ALT 2007),
pages 150-165, 2007.

环境。在Proc中。第18届算法学习理论国际会议(ALT 2007),第150-165页,2007。

Peter Auer,
Nicolò Cesa-Bianchi, and Paul Fisher.Finite time analysis of the multiarmed
bandit problem.Machine Learning, 47(2-3):235-256, 2002.

彼得·奥尔、尼科洛·塞萨-比安奇和保罗·费舍尔。多参数土匪问题的有限时间分析。机器学习,47(2-3):235-256,2002。

13.For a
simple proof of this fact, slice [a,b] into intervals Sk of maximal width ε and
apply the lemma to the family of indicator functions (xi ,Sk) 7→ 1l{ f(xi) ∈
Sk}.

13.为了简单证明这个事实,将[a,b]切成最大宽度ε的区间Sk,并将引理应用于指示函数族(xi,Sk) 7→ 1l{
f(xi) ∈ Sk}。

3256

3256

COUNTERFACTUAL
REASONING AND LEARNING SYSTEMS

反事实推理和学习系统

Heejung Bang
and James M. Robins.Doubly robust estimation in missing data and causal
inference models.Biometrics, 61:692-972, 2005.

希荣邦和詹姆斯·罗宾斯。缺失数据和因果推理模型中的双重稳健估计。生物统计学,61:692-972,2005。

Dirk
Bergemann and Maher Said.Dynamic auctions: a survey.Discussion Paper 1757R,
Cowles Foundation for Research in Economics, Yale University, 2010.

德克·伯格曼和马希尔·赛义德。动态拍卖:综述。讨论文件1757R,考尔斯经济学研究基金会,耶鲁大学,2010年。

Léon
Bottou.From machine learning to machine reasoning.http://arxiv.org/abs/1102.
1808v3, Feb 2011.

莱昂·博图。从机器学习到机器推理。http://arxiv.org/abs/1102. 1808 v3,2011年2月。

Leo
Breiman.Statistical modeling: The two cultures.Statistical Science,
16(3):199-231, 2001.

利奥·布里曼。统计建模:两种文化。统计科学,16(3):199-231,2001。

Olivier
Chapelle and Lihong Li.An empirical evaluation of thompson sampling.In Advances
in Neural Information Processing Systems 24, pages 2249-2257.NIPS Foundation,
2011.

奥利维耶·查佩尔和李立宏。汤普森抽样的经验评估。神经信息处理系统进展24,第2249-2257页。NIPS基金会,2011年。

C.R. Charig,
D. R. Webb, S. R. Payne, and J. E. A. Wickham.Comparison of treatment of renal

C.查理格、韦伯、佩恩和韦翰。肾脏疾病治疗方法的比较

calculi by
open surgery, percutaneous nephrolithotomy, and extracorporeal shockwave
lithotripsy.British Medical Journal (Clin Res Ed), 292(6254):879-882, 1986.

开放手术、经皮肾镜取石术和体外冲击波碎石术治疗结石。英国医学杂志(临床研究版),292(6254):879-882,1986。

Denis X.
Charles and D. Max Chickering.Optimization for paid search auctions.Manuscript
in preparation, 2012.

丹尼斯·查尔斯十世和马克斯·奇克林。付费搜索拍卖的优化。手稿在准备中,2012年。

Denis X.
Charles, D. Max Chickering, and Patrice Simard.Micro-market experimentation for
paid search.Manuscript in preparation, 2012.

丹尼斯·查尔斯十世、马克斯·奇克林和帕特里斯·西玛德。付费搜索的微观市场实验。手稿在准备中,2012年。

Harald Cramér.
Mathematical Methods of Statistics.Princeton University Press, 1946.

统计的数学方法。普林斯顿大学出版社,1946年。

Denver
Dash.Caveats for Causal Reasoning with Equilibrium Models.PhD thesis,
University of Pittsburgh, 2003.

丹佛短跑。平衡模型因果推理的注意事项。匹兹堡大学博士论文,2003年。

Miroslav Dudík,
Dimitru Erhan, John Langford, and Lihong Li.Sample-efficient nonstationary-

米罗斯拉夫·杜迪克、迪米特鲁·埃汉、约翰·兰福德和李立宏。样本有效的非平稳-

policy
evaluation for contextual bandits.In Proceedings of Uncertainty in Artificial
Intelligence (UAI), pages 247-254, 2012.

背景土匪的政策评估。《人工智能的不确定性论文集》(UAI),第247-254页,2012年。

Benjamin
Edelman, Michael Ostrovsky, and Michael Schwarz.Internet advertising and the
gener-

本杰明·埃德尔曼、迈克尔·奥斯特洛夫斯基和迈克尔·施瓦茨。网络广告和一般的

alized second
price auction: Selling billions of dollars worth of keywords.American Economic
Review, 97(1):242-259, 2007.

实现二次价格拍卖:出售价值数十亿美元的关键词。《美国经济评论》,97(1):242-259,2007。

Alan
Genz.Numerical computation of multivariate normal probabilities.Journal
Computation of Multivariate Normal Probabilities, 1:141-149, 1992.

艾伦·根茨。多元正态概率的数值计算。多元正态概率计算杂志,1:141-149,1992。

John C.
Gittins.Bandit Processes and Dynamic Allocation Indices.Wiley, 1989.

约翰·吉廷斯。土匪过程和动态分配指数。威利,1989年。

Peter W.
Glynn.Likelihood ratio gradient estimation: an overview.In Proceedings of the
1987 Winter Simulation Conference, pages 366-375, 1987.

彼得·格林。似然比梯度估计:综述。1987年冬季模拟会议记录,第366-375页,1987年。

John
Goodwin.Microsoft adCenter.Personal communication, 2011.

约翰·古德温。Microsoft adCenter。个人交流,2011年。

Thore
Graepel, Joaquin Quinonero Candela, Thomas Borchert, and Ralf
Herbrich.Web-scale

托雷·格莱佩尔、华金·奎诺内罗·坎德拉、托马斯·伯切特和拉尔夫·赫布里奇。网络规模

Bayesian
click-through rate prediction for sponsored search advertising in Microsoft's
Bing search engine.In Proceedings of the 27th International Conference on
Machine Learning (ICML 2010), Invited Applications Track.Omnipress, 2010.

微软必应搜索引擎中赞助搜索广告的贝叶斯点击率预测。在第27届国际机器学习会议(2010年,ICML)的会议记录中,邀请应用跟踪。Omnipress,2010年。

3257

3257

BOTTOU,
PETERS, ET AL.

波图,彼得斯,等。

Ron Kohavi,
Roger Longbotham, Dan Sommerfield, and Randal M. Henne.Controlled experiments

罗恩·科哈维、罗杰·朗博瑟姆、丹·索默菲尔德和兰德尔·亨内。可控试验

on the web:
Survey and practical guide.Data Mining and Knowledge Discovery, 18(1):140-181,
July 2008.

网上:调查和实践指南。数据挖掘和知识发现,18(1):140-181,2008年7月。

Volodymyr
Kuleshov and Doina Precup.Algorithms for multi-armed bandit problems, October
2010.http://www.cs.mcgill.ca/~vkules/bandits.pdf.

沃洛季米尔·库里肖夫和多伊娜·切普。多武装匪徒问题的算法,2010年10月。http://www.cs.mcgill.ca/~vkules/bandits.pdf.

Sébastien
Lahaie and R. Preston McAfee.Efficient ranking in sponsored search.In Proc.7th
In-

塞巴斯蒂安·拉海耶和普雷斯顿·迈克菲。赞助搜索中的有效排名。在Proc中。第七名

ternational
Workshop on Internet and Network Economics (WINE 2011), pages 254-265.LNCS
7090, Springer, 2011.

互联网和网络经济学国际研讨会(WINE 2011),第254-265页。LNCS 7090,斯普林格,2011年。

Lev Landau
and Evgeny Lifshitz.Course in Theoretical Physics, Volume 1: Mechanics.Pergamon
Press, 1969.2nd edition.

列夫·兰道和叶夫根尼·利弗希茨。理论物理课程,第1卷:力学。佩尔加蒙出版社,1969年。第二版。

John Langford
and Tong Zhang.The epoch-greedy algorithm for multi-armed bandits with side

约翰·兰福德和张彤。带边多武装土匪的历元贪婪算法

information.In
Advances in Neural Information Processing Systems 20, pages 817-824.MIT Press,
Cambridge, MA, 2008.

信息。神经信息处理系统进展20,第817-824页。麻省理工学院出版社,剑桥,麻省,2008。

Steffen L.
Lauritzen and Thomas S. Richardson.Chain graph models and their causal
interpretation.Journal of the Royal Statistical Society, Series B, 64:321-361,
2002.

斯蒂芬·劳里岑和托马斯·理查森。链图模型及其因果解释。皇家统计学会杂志,系列B,64:321-361,2002。

David K.
Lewis.Counterfactuals.Harvard University Press, 1973.2nd edition:
Wiley-Blackwell, 2001.

大卫·刘易斯。反事实。哈佛大学出版社,1973年。第二版:威利-布莱克威尔,2001年。

Lihong Li,
Wei Chu, John Langford, and Robert E. Schapire.A contextual-bandit approach to

李立宏、魏初、约翰·兰福德和罗伯特·沙皮雷。对…的上下文无关的方法

personalized
news article recommendation.In Proceedings of the 19th International Conference
on the World Wide Web (WWW 2010), pages 661-670.ACM, 2010.

个性化新闻文章推荐。《第19届万维网国际会议论文集》(WWW 2010),第661-670页。ACM,2010年。

Lihong Li,
Wei Chu, John Langford, and Xuanhui Wang.Unbiased offline evaluation of
contextual-

、魏初、约翰·兰福和王宣辉。上下文无关的无偏见离线评估

bandit-based
news article recommendation algorithms.In Proc.4th ACM International
Confer-ence on Web Search and Data Mining (WSDM 2011), pages 297-306, 2011.

基于土匪的新闻文章推荐算法。在Proc中。第四届美国计算机学会网络搜索和数据挖掘国际会议(WSDM,2011年),第297-306页,2011年。

Andreas
Maurer and Massimiliano Pontil.Empirical bernstein bounds and sample-variance
penal-ization.In Proc.The 22nd Conference on Learning Theory (COLT 2009), 2009.

安德烈亚斯·莫伊雷尔和马西米利亚诺·庞泰尔。经验伯恩斯坦界限和样本方差惩罚。在Proc中。第22届学习理论会议(COLT 2009),2009年。

Paul
Milgrom.Putting Auction Theory to Work.Cambridge University Press, 2004.

保罗·米尔格罗姆。将拍卖理论付诸实践。剑桥大学出版社,2004。

Roger B.
Myerson.Optimal auction design.Mathematics of Operations Research, 6(1):58-73,
1981.

罗杰·B·迈尔森。最优拍卖设计。运筹学数学,6(1):58-73,1981。

Judea
Pearl.Causality: Models, Reasoning, and Inference.Cambridge University Press,
2000.2nd edition: 2009.

朱迪亚珍珠。因果关系:模型、推理和推论。剑桥大学出版社,2000年。第二版:2009年。

Judea
Pearl.Causal inference in statistics: an overview.Statistics Surveys, 3:96-146,
2009.

朱迪亚珍珠。统计学中的因果推断:综述。统计调查,3:96-146,2009年。

Judea
Pearl.The do-calculus revisited.In Proc.Twenty-Eighth Conference on Uncertainty
in Artificial Intelligence (UAI-2012), pages 3-11, 2012.

朱迪亚珍珠。重新审视微积分。在Proc中。第二十八届人工智能不确定性会议(UAI-2012),第3-11页,2012年。

Linda E.
Reichl.A Modern Course in Statistical Physics, 2nd Edition.Wiley, 1998.

琳达·雷克尔。统计物理现代教程,第二版。威利,1998年。

3258

3258

COUNTERFACTUAL
REASONING AND LEARNING SYSTEMS

反事实推理和学习系统

Herbert
Robbins.Some aspects of the sequential design of experiments.Bulletin of the
American Mathematical Society, 58(5):527-535, 1952.

赫伯特·罗宾斯。实验顺序设计的某些方面。美国数学学会公报,58(5):527-535,1952。

James M.
Robins, Miguel Angel Hernan, and Babette Brumback.Marginal structural models
and causal inference in epidemiology.Epidemiology, 11(5):550-560, Sep 2000.

詹姆斯·罗宾斯、米格尔·安吉尔·埃尔南和巴贝特·布鲁姆巴克。流行病学中的边缘结构模型和因果推断。流行病学,11(5):550-560,2000年9月。

Norbert
Sauer.On the density of families of sets.Journal of Combinatorial Theory, 13:145-147,
1972.

诺伯特·索尔。关于集合族的密度。组合理论杂志,13:145-147,1972。

Yevgeny
Seldin, cois Laviolette Fran Nicolò Cesa-Bianchi, John Shawe-Taylor, and Peter
Auer.

叶夫根尼·塞尔丁、科伊斯·拉维莱特·弗兰·尼科洛·切萨-比安奇、约翰·肖-泰勒和彼得·奥尔。

PAC-Bayesian
inequalities for martingales.IEEE Transactions on Information Theory, 58(12):
7086-7093, 2012.

鞅的泛贝叶斯不等式。《信息理论杂志》,58(12): 7086-7093,2012。

Hidetoshi
Shimodaira.Improving predictive inference under covariate shift by weighting
the log likelihood function.Journal of Statistical Planning and Inference, 90(2):227-244,
2000.

Hidetoshi
Shimodaira。通过加权对数似然函数改进协变量移动下的预测推断。统计规划与推理杂志,90(2):227-244,2000。

Edward H.
Simpson.The interpretation of interaction in contingency tables.Journal of the
Royal Statistical Society, Ser.B, 13:238-241, 1951.

爱德华·辛普森。列联表中相互作用的解释。皇家统计学会杂志。b,13:238-241,1951。

Alexsandrs
Slivkins.Contextual bandits with similarity information.JMLR Conference and
Work-shop Proceedings, 19:679-702, 2011.

亚历山大·斯莱夫金斯。具有相似信息的上下文强盗。JMLR会议和工作间会议录,19:679-702,2011。

Peter Spirtes
and Richard Scheines.Causal inference of ambiguous manipulations.Philosophy of
Science, 71(5):833-845, Dec 2004.

彼得·斯皮特斯和理查德·申斯。模糊操作的因果推理。科学哲学,71(5):833-845,2004年12月。

Peter
Spirtes, Clark Glymour, and Richard Scheines.Causation, Prediction and
Search.Springer Verlag, New York, 1993.2nd edition: MIT Press, Cambridge
(Mass.), 2011.

彼得·斯皮特斯、克拉克·格利穆尔和理查德·申斯。因果关系、预测和搜索。斯普林格·弗拉格,纽约,1993年。第二版:麻省理工学院出版社,剑桥(马萨诸塞州。), 2011.

Stephen M.
Stigler.A historical view of statistical concepts in psychology and educational
research.American Journal of Education, 101(1):60-70, Nov 1992.

斯蒂芬·斯蒂格勒。心理学和教育研究中统计概念的历史观点。美国教育杂志,101(1):60-70,1992年11月。

Masashi
Sugiyama, Matthias Krauledat, and Klaus-Robert Müller.Covariate shift
adaptation by im-portance weighted cross validation.Journal of Machine Learning
Research, 8:985-1005, 2007.

杉山正史、马提亚斯·克劳莱达特和克劳斯-罗伯特·穆勒。通过重要性加权交叉验证的协变量移位适应。机器学习研究杂志,8:985-1005,2007。

Rich S.
Sutton and Andrew G. Barto.Reinforcement Learning: An Introduction.MIT Press,
Cam-bridge, MA, 1998.

里奇·萨顿和安德鲁·巴尔托。强化学习:导论。麻省理工学院出版社,剑桥,麻省,1998。

Diane Tang,
Ashish Agarwal, Deirdre O'Brien, and Mike Meyer.Overlapping experiment infras-

黛安·唐,阿希什·阿加瓦尔,迪尔德丽·奥布莱恩和迈克·迈耶。重叠实验基础

tructure:
More, better, faster experimentation.In Proceedings 16th Conference on
Knowledge Discovery and Data Mining (KDD 2010), pages 17-26, 2010.

结构:更多、更好、更快的实验。《第16届知识发现和数据挖掘会议论文集》(KDD,2010),第17-26页,2010年。

Vladimir N.
Vapnik.Estimation of Dependences based on Empirical Data.Springer Series in
Statistics.Springer Verlag, Berlin, New York, 1982.

弗拉基米尔·瓦普尼克。基于经验数据的相关性估计。统计学中的斯普林格系列。斯普林格·弗拉格,柏林,纽约,1982年。

Vladimir N.
Vapnik and Alexey Ya.Chervonenkis.Uniform convegence of the frequencies of

弗拉基米尔·瓦普尼克和阿列克谢·亚。切尔沃嫩基斯。频率的一致一致性

occurence of
events to their probabilities.Proc.Academy of Sciences of the USSR, 181(4),
1968.English translation: Soviet Mathematics - Doklady, 9:915-918, 1968.

事件发生的概率。继续。苏联科学院,181(4),1968年。英文翻译:苏联数学-多克拉迪,9:915-918,1968。

Vladimir N.
Vapnik and Alexey Ya.Chervonenkis.On the uniform convergence of relative
frequen-

弗拉基米尔·瓦普尼克和阿列克谢·亚。切尔沃嫩基斯。关于相对频率的一致收敛性

cies of
events to their probabilities.Theory of Probability and its Applications,
16(2):264-280, 1971.

事件的可能性。概率论及其应用,16(2):264-280,1971。

3259

3259

BOTTOU,
PETERS, ET AL.

波图,彼得斯,等。

Hal R.
Varian.Position auctions.International Journal of Industrial Organization,
25:1163-1178, 2007.

哈尔·瓦里安。位置拍卖。国际工业组织杂志,25:1163-1178,2007。

Hal R.
Varian.Online ad auctions.American Economic Review, 99(2):430-434, 2009.

哈尔·瓦里安。在线广告拍卖。《美国经济评论》,99(2):430-434,2009。

Joannes
Vermorel and Mehryar Mohri.Multi-armed bandit algorithms and empirical
evaluation.In Proc.European Conference on Machine Learning, pages 437-448,
2005.

乔安妮·维尔莫勒和梅赫里亚·莫赫里。多武装匪徒算法和经验评估。在Proc中。欧洲机器学习会议,第437-448页,2005年。

Georg H. von
Wright.Explanation and Understanding.Cornell University Press, 1971.

乔治·冯·赖特。解释和理解。康奈尔大学出版社,1971年。

Abraham Wald.Sequential
tests of statistical hypotheses.The Annals of Mathematical Statistics,
16(2):117-186, 1945.

亚伯拉罕·瓦尔德。统计假设的顺序检验。数理统计年鉴,16(2):117-186,1945。

Norbert
Wiener.Cybernetics, or Control and Communication in the Animal and the Machine.

诺伯特·维纳。控制论,或动物和机器中的控制和交流。

Hermann et
Cie (Paris), MIT Press (Cambridge, Mass.), Wiley and Sons (New York), 1948.2nd
Edition (expanded): MIT Press, Wiley and Sons, 1961.

赫尔曼和西(巴黎),麻省理工学院出版社(剑桥,马萨诸塞州。),威利父子公司(纽约),1948年。第二版(扩大版):麻省理工学院出版社,威利父子出版社,1961年。

Ronald J.
Williams.Simple statistical gradient-following algorithms for connectionist
reinforce-ment learning.Machine Learning, 8(229-256), 1992.

罗纳德·威廉姆斯。连接强化学习的简单统计梯度跟踪算法。机器学习,8(229-256),1992。

James
Woodward.Making Things Happen.Oxford University Press, 2005.

詹姆斯·伍德沃德。让事情发生。牛津大学出版社,2005年。

Sewall S.
Wright.Correlation and causation.Journal of Agricultural Research, 20:557-585,
1921.

苏沃尔·s·赖特。相关性和因果关系。农业研究杂志,20:557-585,1921。

3260

3260