# 反事实推理和学习系统:计算广告的例子

## Counterfactual Reasoning and Learning Systems: The Example of Computational Advertising 反事实推理和学习系统：计算广告的例子

Léon Bottou, Jonas Peters, Joaquin Quiñonero-Candela, Denis X. Charles, D. Max Chickering, Elon Portugaly, Dipankar Ray, Patrice Simard, Ed Snelson; 14(65):3207−3260, 2013.

### Abstract

This work shows how to leverage causal inference to understand the behavior of complex learning systems interacting with their environment and predict the consequences of changes to the system. Such predictions allow both humans and algorithms to select the changes that would have improved the system performance. This work is illustrated by experiments on the ad placement system associated with the Bing search engine.

Keywords:

1.Introduction

1.介绍

Statistical machine learning technologies in the real world are never without a purpose.Using their predictions, humans or machines make decisions whose circuitous consequences often violate the modeling assumptions that justified

Such contradictions appear very clearly in the case of the learning systems that power Web scale applications such as search engines, ad placement engines, or recommendation systems.For instance, the placement of advertisement on the
result pages of Internet search engines depend on the bids of advertisers and on scores computed by statistical machine learning systems.Because the scores affect the contents of the result pages proposed to the users, they directly influence
the occurrence of clicks and the corresponding advertiser payments.They also have important indirect effects.Ad placement decisions impact the satisfaction of the users and therefore their willingness to frequent this Web site in the
future.They also impact the return on investment observed by the advertisers and therefore their future bids.Finally they change the nature of the data collected for training the statistical models in the future.

These complicated interactions are clarified by important theoretical works.Under
simplified assumptions, mechanism design (Myerson, 1981) leads to an insightful
account of the advertiser feedback loop (Varian, 2007;Edelman et al.,
2007).Under simplified assumptions, multiarmed bandits theory (Robbins, 1952;Auer
et al., 2002;Langford and Zhang, 2008) and reinforcement learning (Sutton and
Barto, 1998) describe the exploration/exploitation dilemma associated with the
training feedback loop.However, none of these approaches gives a complete
account of the complex interactions found in real-life systems.

This contribution proposes a novel approach: we view these complicated interactions
as man-ifestations of the fundamental difference that separates correlation and
causation.Using the ad placement example as a model of our problem class, we
therefore argue that the language and the methods of causal inference provide
flexible means to describe such complex machine learning sys-tems and give
sound answers to the practical questions facing the designer of such a
system.Is it useful to pass a new input signal to the statistical model?Is it
worthwhile to collect and label a new training set?What about changing the loss
function or the learning algorithm?In order to answer such questions and
improve the operational performance of the learning system, one needs to unravel
how the information produced by the statistical models traverses the Web of
causes and effects and eventually produces measurable performance metrics.

Readers with an interest in causal inference will find in this paper (i) a real world
example demonstrating the value of causal inference for large-scale machine
learning applications, (ii) causal inference techniques applicable to
continuously valued variables with meaningful confidence intervals, and (iii)
quasi-static analysis techniques for estimating how small interventions affect
cer-tain causal equilibria.Readers with an interest in real-life applications
will find (iv) a selection of practical counterfactual analysis techniques
applicable to many real-life machine learning systems.Readers with an interest
in computational advertising will find a principled framework that (v)
ex-plains how to soundly use machine learning techniques for ad placement, and
(vi) conceptually connects machine learning and auction theory in a compelling
manner.

(I)一个真实世界的例子，展示了因果推理对于大规模机器学习应用的价值，
(ii)适用于具有有意义置信区间的连续值变量的因果推理技术，以及
(iii)用于估计小干预如何影响特定因果均衡的准静态分析技术。

(iv)适用于许多现实机器学习系统的实用反事实分析技术的选择。对计算广告感兴趣的读者会发现一个有原则的框架，该框架(v)解释了如何合理地使用机器学习技术进行广告投放，以及(vi)以引人注目的方式在概念上将机器学习和拍卖理论联系起来。

The paper is organized as follows.Section 2 gives an overview of the advertisement placement
problem which serves as our main example.In particular, we stress some of the
difficulties encoun-tered when one approaches such a problem without a
principled perspective.Section 3 provides a condensed review of the essential
concepts of causal modeling and inference.Section 4 centers on formulating and
answering counterfactual questions such as "how would the system have
performed during the data collection period if certain interventions had been
carried out on the system ?"We describe importance sampling methods for
counterfactual analysis, with clear conditions of validity and confidence
intervals.Section 5 illustrates how the structure of the causal graph reveals
opportu-nities to exploit prior information and vastly improve the confidence
intervals.Section 6 describes how counterfactual analysis provides essential
signals that can drive learning algorithms.Assume that we have identified
interventions that would have caused the system to perform well during the data
collection period.Which guarantee can we obtain on the performance of these
same inter-ventions in the future?Section 7 presents counterfactual
differential techniques for the study of equlibria.Using data collected when
the system is at equilibrium, we can estimate how a small intervention
displaces the equilibrium.This provides an elegant and effective way to reason
about long-term feedback effects.Various appendices complete the main text with
information that we think more relevant to readers with specific backgrounds.

COUNTERFACTUAL
REASONING AND LEARNING SYSTEMS

2.Causation Issues in Computational Advertising 计算广告中的因果关系问题

After giving an overview of the advertisement placement problem, which serves as our main
ex-ample, this section illustrates some of the difficulties that arise when one
does not pay sufficient attention to the causal structure of the learning
system.

because users who are searching for something are good targets for advertisers
who have something to offer.Several actors take part in this Internet

willing to pay to see their ads displayed or clicked.

• Publishers
provide attractive Web services, such as, for instance, an Internet search
is sometimes provided by an advertising network on behalf of its affiliated
publishers.For the purposes of this work, we simply con-sider a publisher large
enough to run its own infrastructure.

• Users
reveal information about their current interests, for instance, by entering a
query in a search engine.They are offered Web pages that contain a selection of
a Web site controlled by the ad-vertiser where they can initiate some business.

A conventional bidding language is necessary to precisely define under which
conditions an adver-tiser is willing to pay the bid amount.In the case of
message, (b) a set of keywords, (c) one of several possible matching criteria
between the keywords and the user query, and (d) the maximal price the
advertiser is willing to pay when a user clicks on the ad after entering a
query that matches the keywords according to the specified criterion.

(a)广告消息，
(b)一组关键词，
(c)关键词和用户查询之间的几个可能匹配标准之一，以及
(d)当用户在输入根据指定标准匹配关键词的查询之后点击广告时，广告商愿意支付的最大价格。

Whenever a user visits a publisher Web page, an advertisement placement engine runs an
auction in real time in order to select winning ads, determine where to display
them in the page, and compute the prices charged to advertisers, should the
user click on their ad.Since the placement engine is operated by the publisher,
it is designed to further the interests of the publisher.Fortunately for
everyone else, the publisher must balance short term interests, namely the
immediate revenue brought by the ads displayed on each Web page, and long term
interests, namely the future revenues resulting from the continued satisfaction

Auction theory explains how to design a mechanism that optimizes the revenue of the
seller of a single object (Myerson, 1981;Milgrom, 2004) under various
intentions of the other buyers.In the case of the ad placement problem, the publisher
runs multiple auctions and sells opportunities to receive a click.When nearly
identical auctions occur thousand of times per second, it is tempting to
assumption gives support to the popular generalized second price rank-score
auction (Varian, 2007;Edelman et al., 2007):

Figure 1:
Mainline and sidebar ads on a search result page.Ads placed in the mainline are more
likely to be noticed, increasing both the chances of a click if the ad is relevant and the
risk of annoying the user if the ad is not relevant.

• Let x represent the auction context information, such as the user query, the user
profile, the date, the time, etc.The ad placement engine first determines all
eligible ads a1 ...an and the corresponding bids b1 ...bn on the basis of the
auction context x and of the matching criteria specified by the advertisers.

• For each selected ad ai and each potential position p on the Web page, a statistical
model outputs the estimate qi,p(x) of the probability that ad ai displayed in
position p receives a user click.The rank-score ri,p(x) = biqi,p(x) then
represents the purported value associated with placing ad ai at position p.

• Let L represent a possible ad layout, that is, a set of positions that can
simultaneously be populated with ads, and let $L$ be the set of possible ad layouts, including of course the
empty layout.The optimal layout and the corresponding ads are obtained by maximizing the total
rank-score

subject to
reserve constraints and also subject to diverse policy constraints, such as, for instance, preventing the

mild assumptions, this discrete maximization problem is amenable to
computationally efficient greedy algorithms (see appendix A.)

• The
advertiser payment associated with a user click is computed using the
generalized second price (GSP) rule: the advertiser pays the smallest bid that
it could have entered without chang-ing the solution of the discrete
maximization problem, all other bids remaining equal.In other words, the
advertiser could not have manipulated its bid and obtained the same treatment
for a better price.

Under the
perfect information assumption, the analysis suggests that the publisher simply
needs to find which reserve prices Rp(x) yield the best revenue per
auction.However, the total revenue of the publisher also depends on the traffic
experienced by its Web site.Displaying an excessive number of irrelevant ads
can train users to ignore the ads, and can also drive them to competing Web
temporarily increasing the bids.Indelicate advertisers can create deceiving
advertisements that elicit many clicks but direct users to spam Web sites.Experience
shows that the continued satisfaction of the users is more important to the
publisher than it is to the advertisers.

Therefore the
generalized second price rank-score auction has evolved.Rank-scores have been
augmented with terms that quantify the user satisfaction or the ad
where the perfect information assumption is unrealistic.These adjustments are
therefore be viewed as a complex learning system interacting with both users

2.2
Controlled Experiments

2.2 对照实验

The designer
of such an ad placement engine faces the fundamental question of testing
whether a proposed modification of the ad placement engine results in an
improvement of the operational performance of the system.

The simplest
way to answer such a question is to try the modification.The basic idea is to
randomly split the users into treatment and control groups (Kohavi et al.,
2008).Users from the control group see Web pages generated using the unmodified
system.Users of the treatment groups see Web pages generated using alternate
versions of the system.Monitoring various performance metrics for a couple
months usually gives sufficient information to reliably decide which variant of
the system delivers the most satisfactory performance.

Modifying an
ad-vertisers.Whereas it is easy to split users into treatment and control
groups, splitting advertisers into treatment and control groups demands special
attention because each auction involves multi-ple advertisers (Charles et al.,
2012).Simultaneously controlling for both users and advertisers is probably
impossible.

Controlled
experiments also suffer from several drawbacks.They are expensive because they
demand a complete implementation of the proposed modifications.They are slow
because each ex-periment typically demands a couple months.Finally, although
there are elegant ways to efficiently run overlapping controlled experiments on
the same traffic (Tang et al., 2010), they are limited by the volume of traffic
available for experimentation.

It is
therefore difficult to rely on controlled experiments during the conception
phase of potential improvements to the ad placement engine.It is similarly
difficult to use controlled experiments to drive the training algorithms
associated with click probability estimation models.Cheaper and faster
statistical methods are needed to drive these essential aspects of the
development of an ad placement engine.Unfortunately, interpreting cheap and
fast data can be very deceiving.

2.3
Confounding Data

2.3 混杂数据

Assessing the
consequence of an intervention using statistical data is generally challenging
because it is often difficult to determine whether the observed effect is a
simple consequence of the interven-tion or has other uncontrolled causes.

Overall 总数 Patients with small stones 小结石患者 Patients with large stones 大结石患者
Treatment A:Open surgery 开放手术 78% (273/350) 93% (81/87) 73% (192/263)
Treatment B:Percutaneous nephrolithotomy 经皮肾镜取石术 83% (289/350) 87% (234/270) 69% (55/80)

Table 1: A
classic example of Simpson's paradox.The table reports the success rates of two
treat-ments for
kidney stones (Charig et al., 1986, Tables I and II).Although the overall
success rate of treatment B seems better, treatment B performs worse than
treatment A on both patients with small kidney stones and patients with large
kidney stones.See Section 2.3.

For instance,
the empirical comparison of certain kidney stone treatments illustrates this
dif-ficulty (Charig et al., 1986).Table 2.3 reports the success rates observed
on two groups of 350 patients treated with respectively open surgery (treatment
A, with 78% success) and percutaneous nephrolithotomy (treatment B, with 83%
success).Although treatment B seems more successful, it was more frequently
prescribed to patients suffering from small kidney stones, a less serious
con-dition.Did treatment B achieve a high success rate because of its intrinsic
qualities or because it was preferentially applied to less severe cases?Further
splitting the data according to the size of the kidney stones reverses the
conclusion: treatment A now achieves the best success rate for both patients
suffering from large kidney stones and patients suffering from small kidney
stones.Such an inversion of the conclusion is called Simpson's paradox
(Simpson, 1951).

The stone
size in this study is an example of a confounding variable, that is an
uncontrolled variable whose consequences pollute the effect of the
intervention.Doctors knew the size of the kidney stones, chose to treat the
healthier patients with the least invasive treatment B, and therefore caused
treatment B to appear more effective than it actually was.If we now decide to
apply treat-ment B to all patients irrespective of the stone size, we break the
causal path connecting the stone size to the outcome, we eliminate the
illusion, and we will experience disappointing results.

When we
suspect the existence of a confounding variable, we can split the contingency
tables and reach improved conclusions.Unfortunately we cannot fully trust these
conclusions unless we are certain to have taken into account all confounding
variables.The real problem therefore comes from the confounding variables we do
not know.

Randomized
experiments arguably provide the only correct solution to this problem (see
Stigler, 1992).The idea is to randomly chose whether the patient receives
treatment A or treatment B. Because this random choice is independent from all
the potential confounding variables, known and unknown, they cannot pollute the
observed effect of the treatments (see also Section 4.2).This is why controlled
experiments in ad placement (Section 2.2) randomly distribute users between
treatment and control groups, and this is also why, in the case of an ad
placement engine, we should be somehow concerned by the practical impossibility
to randomly distribute both users and advertisers.

Overall q2 low q2 high
q1 low 6.2%(124/2000) 5.1% (92/1823) 18.1% (32/176)
q1 high 7.5% (149/2000) 4.8% (71/1500) 15.6% (78/500)

Table 2:
Confounding data in ad placement.The table reports the click-through rates and
the click counts of the
second mainline ad.The overall counts suggest that the click-through rate of
the second mainline ad increases when the click probability estimate q1 of the
top ad is high.However, if we further split the pages according to the click
probability estimate q2 of the second mainline ad, we reach the opposite
conclusion.See Section 2.4.

2.4

2.4 广告投放中的混杂数据

Let us return
to the question of assessing the value of passing a new input signal to the ad
placement engine click prediction model.Section 2.1 outlines a placement method
where the click probability estimates qi,p(x) depend on the ad and the position
we consider, but do not depend on other ads displayed on the page.We now
consider replacing this model by a new model that additionally uses the
estimated click probability of the top mainline ad to estimate the click
probability of the second mainline ad (Figure 1).We would like to estimate the
effect of such an intervention using existing statistical data.

We have
collected ad placement data for Bing search result pages served during three
consecu-tive hours on a certain slice of traffic.Let q1 and q2 denote the click
probability estimates computed by the existing model for respectively the top
than two mainline ads, we form two groups of 2000 pages randomly picked among
those satisfying the conditions q1 < 0.15 for the first group and q1 ≥ 0.15
for the second group.Table 2.4 reports the click counts and frequencies
observed on the second mainline ad in each group.Although the overall numbers
show that users click more often on the second mainline ad when the top
mainline ad has a high click probability estimate q1, this conclusion is
reversed when we further split the data according to the click probability
estimate q2 of the second mainline ad.

Despite
superficial similarities, this example is considerably more difficult to
interpret than the kidney stone example.The overall click counts show that the
actual click-through rate of the second mainline ad is positively correlated
with the click probability estimate on the top mainline ad.Does this mean that
we can increase the total number of clicks by placing regular ads below

Remember that
the click probability estimates depend on the search query which itself depends
on the user intention.The most likely explanation is that pages with a high q1
are frequently asso-ciated with more commercial searches and therefore receive
more ad clicks on all positions.The observed correlation occurs because the
presence of a click and the magnitude of the click probabil-ity estimate q1
have a common cause: the user intention.Meanwhile, the click probability
estimate q2 returned by the current model for the second mainline ad also
depend on the query and therefore the user intention.Therefore, assuming that
this dependence has comparable strength, and assuming that there are no other
causal paths, splitting the counts according to the magnitude of q2 factors out
the effects of this common confounding cause.We then observe a negative
correlation which now suggests that
a frequently clicked top mainline ad has a negative impact on the click-through
rate of the second mainline ad.

If this is
correct, we would probably increase the accuracy of the click prediction model
by switching to the new model.This would decrease the click probability
estimates for ads placed in the second mainline position on commercial search
pages.These ads are then less likely to clear the reserve and therefore more likely
to be displayed in the less attractive sidebar.The net result is probably a
loss of clicks and a loss of money despite the higher quality of the click
probability model.Although we could tune the reserve prices to compensate this
unfortunate effect, nothing in this data tells us where the performance of the
ad placement engine will land.Furthermore, unknown confounding variables might
completely reverse our conclusions.

Making sense
out of such data is just too complex !

2.5 A Better Way

2.5 更好的方式

It should now
be obvious that we need a more principled way to reason about the effect of
potential interventions.We provide one such more principled approach using the
causal inference machinery (Section 3).The next step is then the identification
of a class of questions that are sufficiently ex-pressive to guide the designer
of a complex learning system, and sufficiently simple to be answered using data
collected in the past using adequate procedures (Section 4).

A machine
learning algorithm can then be viewed as an automated way to generate questions
and update the param-eters accordingly (Section 6).Learning algorithms derived
in this manner are very flexible: human designers and machine learning
algorithms can cooperate seamlessly because they rely on similar sources of
information.

3.Modeling
Causal Systems

3.因果系统建模

When we point
out a causal relationship between two events, we describe what we expect to
happen to the event we call the effect, should an external operator manipulate
the event we call the cause.Manipulability theories of causation (von Wright,
1971;Woodward, 2005) raise this commonsense insight to the status of a
definition of the causal relation.Difficult adjustments are then needed to
interpret statements involving causes that we can only observe through their
effects, "because they love me," or that are not easily manipulated,
"because the earth is round."

Modern
statistical thinking makes a clear distinction between the statistical model
and the world.The actual mechanisms underlying the data are considered
unknown.The statistical models do not need to reproduce these mechanisms to
emulate the observable data (Breiman, 2001).Better models are sometimes
obtained by deliberately avoiding to reproduce the true mechanisms (Vapnik,
1982, Section 8.6).We can approach the manipulability puzzle in the same spirit
by viewing causation as a reasoning model (Bottou, 2011) rather than a property
of the world.Causes and effects are simply the pieces of an abstract reasoning
game.Causal statements that are not empirically testable acquire validity when
they are used as intermediate steps when one reasons about manipulations or
interventions amenable to experimental validation.

This section
presents the rules of this reasoning game.We largely follow the framework
pro-posed by Pearl (2009) because it gives a clear account of the connections
between causal models and probabilistic models.

Query context
x from user intent u. Eligible ads (ai) from query x and inventory v.
Corresponding bids (bi).

Scores
(qi,p,Rp) from query x and ads a. Ad slate s from eligible ads a, scores q and
bids b. Corresponding click prices c.

User clicks y
from ad slate s and user intent u. Revenue z from clicks y and prices c.

Figure 2: A structural
equation model for ad placement.The sequence of equations describes the

flow of
information.The functions fk describe how effects depend on their direct
causes.The additional noise variables εk represent independent sources of
randomness useful to model probabilistic dependencies.

3.1 The Flow
of Information

3.1 信息流

Figure 2
gives a deterministic description of the operation of the ad placement
engine.Variable u represents the user and his or her intention in an
unspecified manner.The query and query context x is then expressed as an
unknown function of the u and of a noise variable ε1.Noise variables in this
framework are best viewed as independent sources of randomness useful for
modeling a nondeterministic causal dependency.We shall only mention them when
they play a specific role in the discussion.The set of eligible ads a and the
corresponding bids b are then derived from the query x and the ad inventory v
supplied by the advertisers.Statistical models then compute a collection of
scores q such as the click probability estimates qi,p and the reserves Rp
introduced in Section 2.1.The placement logic uses these scores to generate the
"ad slate" s, that is, the set of winning ads and their assigned
positions.The corresponding click prices c are computed.The set of user clicks
y is expressed as an unknown function of the ad slate s and the user intent u.
Finally the revenue z is expressed as another function of the clicks y and the
prices c.

Such a system
of equations is named structural equation model (Wright, 1921).Each equation
asserts a functional dependency between an effect, appearing on the left hand
side of the equation, and its direct causes, appearing on the right hand side
as arguments of the function.Some of these causal dependencies are
unknown.Although we postulate that the effect can be expressed as some function
of its direct causes, we do not know the form of this function.For instance,
the designer of the ad placement engine knows functions f2 to f6 and f8 because
he has designed them.However, he does not know the functions f1 and f7 because
whoever designed the user did not leave sufficient documentation.

Figure 3
represents the directed causal graph associated with the structural equation
model.Each arrow connects a direct cause to its effect.The noise variables are
omitted for simplicity.The structure of this graph reveals fundamental
assumptions about our model.For instance, the user clicks y do not directly
depend on the scores q or the prices c because users do not have access to this
information.

We hold as a
principle that causation obeys the arrow of time: causes always precede their
effects.Therefore the causal graph must be acyclic.Structural equation models
then support two fundamental operations, namely simulation and intervention.

Figure 3:
Causal graph associated with the structural equation model of Figure 2.The
mutually independent
noise variables ε1 to ε8 are implicit.The variables a, b, q, s, c, and z depend
on their direct causes in known ways.In contrast, the variables u and v are
exogenous and the variables x and y depend on their direct causes through
unknown functions.

• Simulation

• Let us assume that we know both the exact form of all functional dependencies
and the value of all exogenous variables, that is, the variables that never
appear in the left hand side of an equation.We can compute the values of all
the remaining variables by applying the equations in their natural time
sequence.

Intervention - As long as the causal graph remains acyclic, we can construct
derived structural equation models using arbitrary algebraic manipulations of
the system of equations.For instance, we can clamp a variable to a constant
value by rewriting the right-hand side of the corresponding equation as the
specified constant value.

The algebraic
manipulation of the structural equation models provides a powerful language to
describe interventions on a causal system.This is not a coincidence.Many
aspects of the mathe-matical notation were invented to support causal inference
in classical mechanics.However, we no longer have to interpret the variable
values as physical quantities: the equations simply describe the flow of
information in the causal model (Wiener, 1948).

3.2 The
Isolation Assumption

3.2 隔离假设

Let us now
turn our attention to the exogenous variables, that is, variables that never
appear in the left hand side of an equation of the structural model.Leibniz's
principle of sufficient reason claims that there are no facts without
causes.This suggests that the exogenous variables are the effects of a network
of causes not expressed by the structural equation model.For instance, the user
intent u and the ad inventory v in Figure 3 have temporal correlations because
both users and advertisers worry about their budgets when the end of the month
approaches.Any structural equation model should then be understood in the
context of a larger structural equation model potentially describing all things
in existence.

a particular page contribute to the continued satisfaction of both users and
ad-vertisers, and therefore have an effect on their willingness to use the
services of the publisher in the future.The ad placement structural equation
model shown in Figure 2 only describes the causal de-pendencies for a single
page and therefore cannot account for such effects.Consider however a very

3216

3216

COUNTERFACTUAL
REASONING AND LEARNING SYSTEMS

Figure 4:
Conceptually unrolling the user feedback loop by threading instances of the
single page

causal graph
(Figure 3).Both the ad slate st and user clicks yt have an indirect effect on
the user intent ut+1 associated with the next query.

large
structural equation model containing a copy of the page-level model for every
Web page ever served by the publisher.Figure 4 shows how we can thread the
page-level models corresponding to pages served to the same user.Similarly we
could model how advertisers track the performance and the cost of their
resulting causal graphs can be very complex.Part of this complexity results
from time-scale dif-ferences.Thousands of search pages are served in a
second.Each page contributes a little to the continued satisfaction of one user
and a few advertisers.The accumulation of these contributions produces
measurable effects after a few weeks.

Many of the
functional dependencies expressed by the structural equation model are left
un-specified.Without direct knowledge of these functions, we must reason using
statistical data.The most fundamental statistical data is collected from
repeated trials that are assumed independent.When we consider the large
structured equation model of everything, we can only have one large trial
producing a single data point.It is therefore desirable to identify repeated
patterns of identical equations that can be viewed as repeated independent
trials.Therefore, when we study a structural equation model representing such a
pattern, we need to make an additional assumption to expresses the idea that
the outcome of one trial does not affect the other trials.We call such an
assumption an isolation assumption by analogy with thermodynamics.This can be
achieved by assuming that the exogenous variables are independently drawn from
an unknown but fixed joint probability distribu-tion.This assumption cuts the
causation effects that could flow through the exogenous variables.

The noise
variables are also exogenous variables acting as independent source of
randomness.The noise variables are useful to represent the conditional
distribution P(effect|causes) using the equation effect= f(causes,
ε).Therefore, we also assume joint independence between all the noise variables
and any of the named exogenous variable.For instance, in the case of the ad
placement

the discussion on reinforcement learning, Section 3.5.

1.另请参见第 3.5 节中关于强化学习的讨论。

2.The concept
of isolation is pervasive in physics.An isolated system in thermodynamics
(Reichl, 1998, Section 2.D) or a closed system in mechanics (Landau and
Lifshitz, 1969, 5) evolves without exchanging mass or energy with its
surroundings.Experimental trials involving systems that are assumed isolated
may differ in their initial setup and therefore have different
outcomes.Assuming isolation implies that the outcome of each trial cannot
affect the other trials.

2.孤立的概念在物理学中很普遍。热力学中的孤立系统(赖歇尔，1998，第 2 节。d)或力学中的封闭系统(Landau 和 Lifshitz，1969，5)在不与周围环境交换质量或能量的情况下演化。涉及被假设为孤立的系统的实验性试验可能在它们的初始设置上有所不同，因此具有不同的结果。假设隔离意味着每个试验的结果不能影响其他试验。

3.Rather than
letting two noise variables display measurable statistical dependencies because
they share a common cause, we prefer to name the common cause and make the
dependency explicit in the graph.

3.与其让两个噪声变量显示可测量的统计相关性，因为它们有一个共同的原因，我们更喜欢命名共同的原因，并在图中明确相关性。

Exogenous
vars.Query.

Eligible

slate.Prices.Clicks.Revenue.

Figure 6:
Bayesian network associated with the Markov factorization shown in Figure 5.

model shown
in Figure 2, we assume that the joint distribution of the exogenous variables
factorizes

as

P(u, v,ε1,..., ε8) = P(u, v)P(ε1)...P(ε8).

Since an
isolation assumption is only true up to a point, it should be expressed clearly
and remain under constant scrutiny.We must therefore measure additional
performance metrics that reveal how the isolation assumption holds.For
instance, the ad placement structural equation model and the corresponding
causal graph (figures 2 and 3) do not take user feedback or advertiser feedback
into account.Measuring the revenue is not enough because we could easily
generate revenue at the expense of the satisfaction of the users and
advertisers.When we evaluate interventions under such an isolation assumption,
we also need to measure a battery of additional quantities that act as proxies
relevance estimated by human judges, and advertiser surplus estimated from the
auctions (Varian, 2009).

3.3 Markov Factorization 马尔可夫因子分解

Conceptually,
we can draw a sample of the exogenous variables using the distribution
specified by the isolation assumption, and we can then generate values for all
the remaining variables by simulating the structural equation model.

This process
defines a generative probabilistic model representing the joint distribution of
all variables in the structural equation model.The distribution readily
factorizes as the product of the joint probability of the named exogenous
variables, and, for each equation in the structural equation model, the
conditional probability of the effect given its direct causes (Spirtes et al.,
1993;Pearl, 2000).As illustrated by figures 5 and 6, this Markov factorization
connects the structural equa-tion model that describes causation, and the
Bayesian network that describes the joint probability distribution followed by
the variables under the isolation assumption.

Structural
equation models and Bayesian networks appear so intimately connected that it
could be easy to forget the differences.The structural equation model is an
algebraic object.As long as the causal graph remains acyclic, algebraic
manipulations are interpreted as interventions on the causal system.The
Bayesian network is a generative statistical model representing a class of
joint probability distributions, and, as such, does not support algebraic
manipulations.However, the symbolic representation of its Markov factorization
is an algebraic object, essentially equivalent to the structural equation
model.

3.4
Identification, Transportation, and Transfer Learning

3.4 识别、运输和转移学习

Consider a
causal system represented by a structural equation model with some unknown
functional dependencies.Subject to the isolation assumption, data collected
during the operation of this system follows the distribution described by the
corresponding Markov factorization.Let us first assume that this data is sufficient
to identify the joint distribution of the subset of variables we can observe.We
can intervene on the system by clamping the value of some variables.This
amounts to replacing the right-hand side of the corresponding structural
equations by constants.The joint distribution of the variables is then
described by a new Markov factorization that shares many factors with the
orig-inal Markov factorization.Which conditional probabilities associated with
this new distribution can we express using only conditional probabilities
identified during the observation of the original sys-tem?This is called the
identifiability problem.More generally, we can consider arbitrarily complex
manipulations of the structural equation model, and we can perform multiple
experiments involv-ing different manipulations of the causal system.Which
conditional probabilities pertaining to one experiment can be expressed using
only conditional probabilities identified during the observation of other
experiments?This is called the transportability problem.

Pearl's
do-calculus completely solves the identifiability problem and provides useful
tools to address many instances of the transportability problem (see Pearl,
2012).Assuming that we know the conditional probability distributions involving
observed variables in the original structural equa-tion model, do-calculus
allows us to derive conditional distributions pertaining to the manipulated structural
equation model.

Pearl 的 do-演算完全解决了可识别性问题，并提供了有用的工具来解决可运输性问题的许多实例(参见 Pearl，2012)。假设我们知道原始结构方程模型中包含观测变量的条件概率分布，微积分允许我们导出与操纵结构方程模型相关的条件分布。

Unfortunately,
we must further distinguish the conditional probabilities that we know (because
we designed them) from those that we estimate from empirical data.This
distinction is important because estimating the distribution of continuous or
high cardinality variables is notoriously dif-ficult.Furthermore, do-calculus
often combines the estimated probabilities in ways that amplify estimation
errors.This happens when the manipulated structural equation model exercises
the vari-ables in ways that were rarely observed in the data collected from the
original structural equation model.

4.Bayesian
networks are directed graphs representing the Markov factorization of a joint
probability distribution: the arrows no longer have a causal interpretation.

4.贝叶斯网络是表示联合概率分布的马尔可夫因子分解的有向图：箭头不再具有因果解释。

Therefore we
prefer to use much simpler causal inference techniques (see sections 4.1 and
4.2).Although these techniques do not have the completeness properties of
do-calculus, they combine estimation and transportation in a manner that
facilitates the derivation of useful confidence inter-vals.

3.5 Special Cases

3.5 特殊情况

Three special
cases of causal models are particularly relevant to this work.

• In the
multi-armed bandit (Robbins, 1952), a user-defined policy function π determines
the distribution of action a ∈ {1...K}, and an unknown reward function r
determines the distri-bution of the outcome y given the action a (Figure 7).In
order to maximize the accumulated rewards, the player must construct policies π
that balance the exploration of the action space with the exploitation of the best
action identified so far (Auer et al., 2002;Audibert et al., 2007;Seldin et
al., 2012).

• The
contextual bandit problem (Langford and Zhang, 2008) significantly increases
the com-plexity of multi-armed bandits by adding one exogenous variable x to
the policy function π and the reward functions r (Figure 8).

• Both
multi-armed bandit and contextual bandit are special case of reinforcement
learning (Sutton and Barto, 1998).In essence, a Markov decision process is a
sequence of contextual bandits where the context is no longer an exogenous
variable but a state variable that depends on the previous states and actions
(Figure 9).Note that the policy function π, the reward function r, and the
transition function s are independent of time.All the time dependencies are expressed
using the states st .

These special
cases have increasing generality.Many simple structural equation models can be
reduced to a contextual bandit problem using appropriate definitions of the
context x, the action a and the outcome y. For instance, assuming that the
prices c are discrete, the ad placement struc-tural equation model shown in
Figure 2 reduces to a contextual bandit problem with context (u, v), actions
(s, c) and reward z. Similarly, given a sufficiently intricate definition of
the state variables st , all structural equation models with discrete variables
can be reduced to a reinforcement learning problem.Such reductions lose the
fine structure of the causal graph.We show in Section 5 how this fine structure
can in fact be leveraged to obtain more information from the same experiments.

Modern
reinforcement learning algorithms (see Sutton and Barto, 1998) leverage the
assumption that the policy function, the reward function, the transition
function, and the distributions of the corresponding noise variables, are
independent from time.This invariance property provides great benefits when the
observed sequences of actions and rewards are long in comparison with the size
of the state space.Only Section 7 in this contribution presents methods that
take advantage of such an invariance.The general question of leveraging
arbitrary functional invariances in causal graphs is left for future work.

4.Counterfactual Analysis 反事实分析

We now return
to the problem of formulating and answering questions about the value of
proposed changes of a learning system.Assume for instance that we consider
replacing the score computation

a = π ( ε ) Action a ∈{1...K}

y = r ( a，ε’) Reward y ∈ R

Figure 7:
Structural equation model for the multi-armed bandit problem.The policy π
selects a discrete
action a, and the reward function r determines the outcome y. The noise
vari-ables ε and ε ′ represent independent sources of randomness useful to
model probabilistic dependencies.

a = π ( x , ε) Action a ∈{1...K}

y = r ( x，a，ε’)Reward y ∈ R

Figure 8:
Structural equation model for contextual bandit problem.Both the action and the
reward depend on an exogenous context variable x.

Figure 9:
Structural equation model for reinforcement learning.The above equations are
replicated

for all t ∈
{0...,T}.The context is now provided by a state variable st−1 that depends on
the previous states and actions.

model M of an
ad placement engine by an alternate model M. We seek an answer to the
conditional question:

"How
will the system perform if we replace model M by model M*?"

"如果我们用 M*代替 M 模型，系统会有什么表现？"

Given
sufficient time and sufficient resources, we can obtain the answer using a
controlled experiment (Section 2.2).However, instead of carrying out a new
experiment, we would like to obtain an answer using data that we have already
collected in the past.

"How
would the system have performed if, when the data was collected, we had
replaced model M by model M*?"

“如果在收集数据时，我们用模型 M*替换了模型 M，系统会有什么表现？”

this counterfactual question is of course a counterfactual statement that
describes the system performance subject to a condition that did not happen.

Counterfactual
statements challenge ordinary logic because they depend on a condition that is
known to be false.Although assertion A ⇒ B is always true when assertion A is
false, we certainly do not mean for all counterfactual statements to be
true.Lewis (1973) navigates this paradox using a modal logic in which a
counterfactual statement describes the state of affairs in an alternate world
that resembles ours except for the specified differences.Counterfactuals indeed
offer many subtle ways to qualify such alternate worlds.For instance, we can
easily describe isolation assumptions (Section 3.2) in a counterfactual
question:

"How
would the system have performed if, when the data was collected, we had
replaced model M by model Mwithout incurring user or advertiser
reactions?"

“如果在收集数据时，我们用 M*模型代替了 M 模型，而不会引起用户或广告商的反应，那么系统会有怎样的表现？”

Figure 10:
Causal graph for an image recognition system.We can estimate counterfactuals by
replaying data collected in the past.

Figure 11:
Causal graph for a randomized experiment.We can estimate certain
counterfactuals by reweighting data collected in the past.

The fact that
we could not have changed the model without incurring the user and advertiser
reac-tions does not matter any more than the fact that we did not replace model
M by model M* in the first place.This does not prevent us from using
counterfactual statements to reason about causes and effects.Counterfactual
questions and statements provide a natural framework to express and share our
conclusions.

The remaining
text in this section explains how we can answer certain counterfactual
questions using data collected in the past.More precisely, we seek to estimate
performance metrics that can be expressed as expectations with respect to the
distribution that would have been observed if the counterfactual conditions had
been in force.

4.1 Replaying
Empirical Data

4.1 重放经验数据

Figure 10
shows the causal graph associated with a simple image recognition system.The
classifier takes an image x and produces a prospective class label ˆy. The loss
measures the penalty associated with recognizing class ˆy while the true class
is y.

To estimate
the expected error of such a classifier, we collect a representative data set
composed of labelled images, run the classifier on each image, and average the
resulting losses.In other words, we replay the data set to estimate what (counterfactual)
performance would have been observed if we had used a different classifier.We
can then select in retrospect the classifier that would have worked the best
and hope that it will keep working well.This is the counterfactual viewpoint on
empirical risk minimization (Vapnik, 1982).

Replaying the
data set works because both the alternate classifier and the loss function are
known.More generally, to estimate a counterfactual by replaying a data set, we
need to know all the functional dependencies associated with all causal paths
connecting the intervention point to the measurement point.This is obviously
not always the case.

5.Although
counterfactual expectations can be viewed as expectations of unit-level
counterfactuals (Pearl, 2009, Def-inition 4), they elude the semantic
subtleties of unit-level counterfactuals and can be measured with randomized
experiments (see Section 4.2.)

5.尽管反事实期望可以被视为单位级反事实的期望(Pearl，2009，Def-inition 4)，但它们回避了单位级反事实的语义微妙性，并且可以通过随机实验来测量(参见第 4.2 节)。)

4.2
Reweighting Randomized Trials

4.2 重新加权随机试验

Figure 11
illustrates the randomized experiment suggested in Section 2.3.The patients are
randomly split into two equally sized groups receiving respectively treatments
A and B. The overall success rate for this experiment is therefore Y = (YA
+YB)/2 where YA and YB are the success rates observed for each group.We would
like to estimate which (counterfactual) overall success rate Y would have been
observed if we had selected treatment A with probability p and treatment B with
probability 1− p.

Since we do
not know how the outcome depends on the treatment and the patient condition, we
cannot compute which outcome y would have been obtained if we had treated
patient x with a different treatment u . Therefore we cannot answer this
question by replaying the data as we did in Section 4.1.

However,
observing different success rates YA and YB for the treatment groups reveals an
empir-ical correlation between the treatment u and the outcome y. Since the only
cause of the treatment u is an independent roll of the dices, this correlation
cannot result from any known or unknown con-founding common cause.Having
eliminated this possibility, we can reweight the observed out-comes and compute
the estimate Y ≈ pYA + (1− p)YB .

4.3 Markov
Factor Replacement

4.3 马尔可夫因子替换

The
reweighting approach can in fact be applied under much less stringent conditions.Let

The average
number of ad clicks per page is often called click yield.Increasing the click
yield usually benefits both the advertiser and the publisher, whereas
increasing the revenue per page often benefits the publisher at the expense of
the advertiser.Click yield is therefore a very useful metric when we reason
with an isolation assumption that ignores the advertiser reactions to pricing changes.

Let ω be a
shorthand for all variables appearing in the Markov factorization of the ad
placement structural equation model,

P(ω) =P ( u , v ) P( x |u ) P ( a|x , v ) P ( b|x , v ) P ( q|x , a )
× P ( s|a , q
, b ) P ( c |a , q , b ) P ( y |s , u ) P ( z|y , c ) .

Variable y
was defined in Section 3.1 as the set of user clicks.In the rest of the
document, we slightly abuse this notation by using the same letter y to
represent the number of clicks.We also write the expectation Y = Eω∼P(ω) [y]
using the integral notation

Y = Zω y P (ω ) .

We would like
to estimate what the expected click yield Y would have been if we had used a
different scoring function (Figure 12).This intervention amounts to replacing
the actual factor P(q|x,a) by a counterfactual factor P(q|x,a) in the Markov
factorization.

P ( ω) = P (u , v ) P ( x |u ) P ( a|x , v ) P ( b|x , v ) P ( q| x , a )

× P ( s|a，q，b ) P ( c |a，q，b ) P ( y |s，u ) P ( z|x，c)。

(3)

the discussion of Reichenbach's common cause principle and of its limitations
in Spirtes et al. (1993) and Spirtes and Scheines (2004).

6.另见 Spirtes 等人(1993 年)和 Spirtes 和 Scheines
(2004 年)对赖兴巴赫共同原因原则及其局限性的讨论。

Figure 12:
Estimating which average number of clicks per page would have been observed if
we had used a different scoring model.

Let us
assume, for simplicity, that the actual factor P(q|x,a) is nonzero
everywhere.We can then estimate the counterfactual expected click yield Y using
the transformation

Y = Zω y P (ω) = Zω y P ( q| x a ) P ( q|x , a ) P ( ω ) ≈ n n∑i=1 y i P ( q i |x i a i ) P
( q i |x i , a i )

where the
data set of tuples (ai , xi ,qi , yi) is distributed according to the actual
Markov factorization instead of the counterfactual Markov factorization.This
data could therefore have been collected during the normal operation of the ad
placement system.Each sample is reweighted to reflect its probability of
occurrence under the counterfactual conditions.

In general,
we can use importance sampling to estimate the counterfactual expectation of
any quantity ℓ(ω):

Y = Zω ℓ ( ω
) P ( ω) = Zω ℓ ( ω ) P ( ω ) P ( ω ) P ( ω ) ≈ n n∑i=1 ℓ ( ω i ) w i

with weights

Equation (6)
emphasizes the simplifications resulting from the algebraic similarities of the
actual and counterfactual Markov factorizations.Because of these
simplifications, the evaluation of the weights only requires the knowledge of
the few factors that differ between P(ω) and P(ω).Each data sample needs to
provide the value of ℓ(ωi) and the values of all variables needed to evaluate
the factors that do not cancel in the ratio (6).

In contrast,
the replaying approach (Section 4.1) demands the knowledge of all factors of
P(ω) connecting the point of intervention to the point of measurement ℓ(ω).On
the other hand, it does not require the knowledge of factors appearing only in
P(ω).

Importance
sampling relies on the assumption that all the factors appearing in the
denomina-tor of the reweighting ratio (6) are nonzero whenever the factors
appearing in the numerator are nonzero.Since these factors represents
conditional probabilities resulting from the effect of an independent noise
variable in the structural equation model, this assumption means that the data
must be
collected with an experiment involving active randomization.We must therefore
design cost-effective randomized experiments that yield enough information to
estimate many interesting counterfactual expectations with sufficient
accuracy.This problem cannot be solved without an-swering the confidence
interval question: given data collected with a certain level of randomization,
with which accuracy can we estimate a given counterfactual expectation?

4.4
Confidence Intervals

4.4 置信区间

At first
sight, we can invoke the law of large numbers and write

Y = Zω ℓ ( ω
) w ( ω ) P ( ω ) ≈ n n∑i=1 ℓ ( ω i ) w i .

(7)

For sufficiently
large n, the central limit theorem provides confidence intervals whose width
grows with the standard deviation of the product ℓ(ω)w(ω).

Unfortunately,
when P(ω) is small, the reweighting ratio w(ω) takes large values with low
probability.This heavy tailed distribution has annoying consequences because
the variance of the integrand could be very high or infinite.When the variance
is infinite, the central limit theorem does not hold.When the variance is
merely very large, the central limit convergence might occur too slowly to
justify such confidence intervals.Importance sampling works best when the
actual distribution and the counterfactual distribution overlap.

When the
counterfactual distribution has significant mass in domains where the actual
distribu-tion is small, the few samples available in these domains receive very
high weights.Their noisy contribution dominates the reweighted estimate (7).We
can obtain better confidence intervals by eliminating these few samples drawn
in poorly explored domains.The resulting bias can be bounded using prior
knowledge, for instance with an assumption about the range of values taken by
ℓ(ω),

∀ω ℓ(ω) ∈ [0,M].

(8)

Let us choose
the maximum weight value R deemed acceptable for the weights.We have ob-tained
very consistent results in practice with R equal to the fifth largest
reweighting ratio observed on the empirical data.We can then rely on clipped
weights to eliminate the contribution of the poorly explored domains,

w (ω) = w(ω)
if P(ω) < RP(ω) 0 otherwise.

The condition
P(ω) < RP(ω) ensures that the ratio has a nonzero denominator P(ω) and is
smaller than R. Let ΩR be the set of all values of ω associated with acceptable
ratios:

ΩR = {ω :
P(ω) < RP(ω)} .

ωR =
{ω:P(ω)< RP(ω)}。

We can
decompose Y in two terms:

Y = Zω ∈ Ω R
ℓ ( ω ) P ( ω) + Zω ∈ Ω \ Ω R ℓ ( ω ) P ( ω) = Y + ( Y − Y ) .

7.This is in
fact a slight abuse because the theory calls for choosing R before seeing the
data.

7.这实际上是一个轻微的滥用，因为理论要求在看到数据之前选择 R。

The first
term of this decomposition is the clipped expectation Y . Estimating the
clipped expec-tation Y is much easier than estimating Y from (7) because the
clipped weights w(ω) are bounded by R.

Y = Z ω∈ΩR
ℓ(ω)P (ω) = Z ω ℓ(ω)w (ω) P(ω) ≈ Yb= n n ∑ i=1 ℓ(ωi)w (ωi).(10)

y =
zω∈rωℓ(ω)p(ω)= zωℓ(ω)w(ω)p(ω)≈Yb = n n∑I = 1 ℓ(ωi)w(ωI)。(10)

The second
term of Equation (9) can be bounded by leveraging assumption (8).The resulting
bound can then be conveniently estimated using only the clipped weights.

Y −Y = Z
ω∈Ω\ΩR ℓ(ω)P (ω) ∈ h M P (Ω\ΩR) i = h M (−W ) i with

W = P (ΩR) =
Z ω∈ΩR P (ω) = Z ω w (ω)P(ω) ≈ Wb= n n ∑ i=1 w (ωi).

(11)

(11)

Since the
clipped weights are bounded, the estimation errors associated with (10) and
(11) are well characterized using either the central limit theorem or using
empirical Bernstein bounds (see appendix B for details).Therefore we can derive
an outer confidence interval of the form

P n Yb−εR ≤ Y
≤ Yb+εR o ≥ −δ

and an inner
confidence interval of the form

P n Y ≤ Y ≤ Y

• M(−Wb+ξR) o ≥ −δ.

(13)

The names
inner and outer are in fact related to our preferred way to visualize these
intervals (e.g., Figure 13).Since the bounds on Y −Y can be written as

Y ≤ Y ≤ Y + M
(1−W ),

we can derive
our final confidence interval,

P n Yb−εR ≤ Y
≤ Yb+ M(−Wb+ξR) +εR o ≥ −δ.

(15)

(15)

In
conclusion, replacing the unbiased importance sampling estimator (7) by the
clipped impor-tance sampling estimator (10) with a suitable choice of R leads
to improved confidence intervals.Furthermore, since the derivation of these
confidence intervals does not rely on the assumption that P(ω) is nonzero
everywhere, the clipped importance sampling estimator remains valid when the
dis-tribution P(ω) has a limited support.This relaxes the main restriction
associated with importance sampling.

4.5
Interpreting the Confidence Intervals

4.5 解释置信区间

The
estimation of the counterfactual expectation Y can be inaccurate because the
sample size is in-sufficient or because the sampling distribution P(ω) does not
sufficiently explore the counterfactual conditions of interest.

By
construction, the clipped expectation Y ignores the domains poorly explored by
the sam-pling distribution P(ω).The difference Y −Y then reflects the
inaccuracy resulting from a lack of exploration.Therefore, assuming that the
bound R has been chosen competently, the relative sizes of the outer and inner
confidence intervals provide precious cues to determine whether we can continue
collecting data using the same experimental setup or should adjust the data
collection experiment in order to obtain a better coverage.

• The inner
confidence interval (13) witnesses the uncertainty associated with the domain
GR insufficiently explored by the actual distribution.A large inner confidence
interval suggests that the most practical way to improve the estimate is to
adjust the data collection experiment in order to obtain a better coverage of
the counterfactual conditions of interest.

• The outer
confidence interval (12) represents the uncertainty that results from the
limited sample size.A large outer confidence interval indicates that the sample
is too small.To improve the result, we simply need to continue collecting data
using the same experimental setup.

4.6
Experimenting with Mainline Reserves

4.6 主线储备试验

the ad placement problem to illustrate the reweighting approach and the
interpretation of the confidence intervals.Manipulating the reserves Rp(x)
associated with the mainline positions (Figure 1) controls which ads are
prominently displayed in the mainline or displaced into the sidebar.

We seek in
this section to answer counterfactual questions of the form:

"How
would the ad placement system have performed if we had scaled the mainline
reserves by a constant factor ρ, without incurring user or advertiser
reactions?"

“如果我们以一个常数 ρ 来衡量主线储备，而不引起用户或广告商的反应，广告投放系统会有怎样的表现？”

Randomization
was introduced using a modified version of the ad placement engine.Before
determining the ad layout (see Section 2.1), a random number ε is drawn
according to the standard

normal
distribution N (0,1), and all the mainline reserves are multiplied by m = ρe −σ
/+σε.Such

multipliers
follow a log-normal distributionwhose mean is ρ and whose width is controlled
by σ.This effectively provides a parametrization of the conditional score distribution
P(q|x,a) (see Figure 5.)

The Bing
search platform offers many ways to select traffic for controlled experiments
(Sec-tion 2.2).In order to match our isolation assumption, individual page
views were randomly as-signed to traffic buckets without regard to the user
identity.The main treatment bucket was pro-cessed with mainline reserves
randomized by a multiplier drawn as explained above with ρ=1 and σ = 0.3.With
these parameters, the mean multiplier is exactly 1, and 95% of the multipliers
are in range [0.52,1.74].Samples describing 22 million search result pages were
collected during five consecutive weeks.

We then use
this data to estimate what would have been measured if the mainline reserve
mul-tipliers had been drawn according to a distribution determined by
parameters ρ and σ .This is achieved by reweighting each sample ωi with

w i =
P ( q i |x i，a i ) P ( q
i |x i，a i)

=

=

p ( m i ;ρ ,
σ ) p ( m i ;ρ , σ )

where mi is
the multiplier drawn for this sample during the data collection experiment, and
p(t ;ρ,σ) is the density of the log-normal multiplier distribution.

Figure 13
reports results obtained by varying ρ while keeping σ =σ.This amounts to
esti-mating what would have been measured if all mainline reserves had been
multiplied by ρ while keeping the same randomization.The curves bound 95%
confidence intervals on the variations of the average number of mainline ads
displayed per page, the average number of ad clicks per page,

8.More
precisely, lnN ( ,σ ) with = σ /2+logρ.

8.更准确地说，lnN(，σ)具有= σ /2+logρ。

Figure 13:
Estimated variations of three performance metrics in response to mainline
reserve

changes.The
curves delimit 95% confidence intervals for the metrics we would have observed
if we had increased the mainline reserves by the percentage shown on the
horizontal axis.The filled areas represent the inner confidence intervals.The
hollow squares represent the metrics measured on the experimental data.The
hollow circles represent metrics measured on a second experimental bucket with
mainline reserves re-duced by 18%.The filled circles represent the metrics
effectively measured on a control bucket running without randomization.

3228

3228

COUNTERFACTUAL
REASONING AND LEARNING SYSTEMS

and the
average revenue per page, as functions of ρ .The inner confidence intervals,
represented by the filled areas, grow sharply when ρ leaves the range explored
during the data collection ex-periment.The average revenue per page has more
variance because a few very competitive queries command high prices.

In order to
validate the accuracy of these counterfactual estimates, a second traffic
bucket of equal size was configured with mainline reserves reduced by about
18%.The hollow circles in Figure 13 represent the metrics effectively measured
on this bucket during the same time period.The effective measurements and the
counterfactual estimates match with high accuracy.

Finally, in
order to measure the cost of the randomization, we also ran the unmodified ad
place-ment system on a control bucket.The brown filled circles in Figure 13
represent the metrics effec-tively measured on the control bucket during the
same time period.The randomization caused a small but statistically significant
increase of the number of mainline ads per page.The click yield and average
revenue differences are not significant.

This
experiment shows that we can obtain accurate counterfactual estimates with
affordable randomization strategies.However, this nice conclusion does not
capture the true practical value of the counterfactual estimation approach.

4.7 More on
Mainline Reserves

4.7 主线储备的更多信息

The main
benefit of the counterfactual estimation approach is the ability to use the
same data to answer a broad range of counterfactual questions.Here are a few
examples of counterfactual ques-tions that can be answered using data collected
using the simple mainline reserve randomization scheme described in the
previous section:

• Different
variances - Instead of estimating what would have been measured if we had
in-creased the mainline reserves without changing the randomization variance,
that is, letting σ = σ, we can use the same data to estimate what would have
been measured if we had also changed σ.This provides the means to determine
which level of randomization we can afford in future experiments.

• Pointwise
estimates - We often want to estimate what would have been measured if we had
set the mainline reserves to a specific value without randomization.Although
computing estimates for small values of σ often works well enough, very small
values lead to large confidence intervals.

Let Yν(ρ)
represent the expectation we would have observed if the multipliers m had mean
ρ and variance ν.We have then Yν(ρ) = Em[E[y|m]] = Em[Y0(m)].Assuming that the
pointwise value Y0 is smooth enough for a second order development,

Yν ( ρ ) ≈ E
m

Yν ( ρ ) ≈ E
m

• Y0 ( ρ) + (
m − ρ ) Y ′0 ( ρ) + ( m − ρ ) Y ′′ 0 ( ρ ) / 2

-Y0(ρ)+(mρ)Y′0(ρ)+(mρ)Y′0(ρ)/2

= Y0 ( ρ) + ν
Y ′′ 0 ( ρ ) / 2 .

=
Y0(ρ)+νY′’0(ρ)/2。

Although the
reweighting method cannot estimate the point-wise value Y0(ρ) directly, we can
use the reweighting method to estimate both Yν(ρ) and Y2ν(ρ) with acceptable
confidence intervals and write Y0(ρ) ≈ 2Yν(ρ)−Y2ν(ρ) (Goodwin, 2011).

Query-dependent reserves - Compare for instance the queries "car
insurance" and "com-mon cause principle" in a Web search
engine.Since the advertising potential of a search

3229

3229

BOTTOU,
PETERS, ET AL.

varies
considerably with the query, it makes sense to investigate various ways to
define query-dependent reserves (Charles and Chickering, 2012).

The data
collected using the simple mainline reserve randomization can also be used to es-timate
what would have been measured if we had increased all the mainline reserves by
a query-dependent multiplier ρ (x).This is simply achieved by reweighting each
sample ωi with

w i =

w i =

P ( q i |x i
, a i ) P ( q i |x i , a i )

P ( q i |x i，a i ) P ( q
i |x i，a i)

=

=

p ( m i ;ρ (
x i ) , σ )

p(m I；ρ ( x i)，σ)

p ( m i ;, σ
) .

p(m I；, σ ) .

Considerably
collected using randomization schemes that explore more dimensions.For
instance, in the case of the ad placement problem, we could apply an
independent random multiplier for each score instead of applying a single
random multiplier to the mainline reserves only.However, the more dimensions we
randomize, the more data needs to be collected to effectively explore all these
dimensions.Fortunately, as discussed in section 5, the structure of the causal
graph reveals many ways to leverage a priori information and improve the
confidence intervals.

4.8 Related
Work

4.8 相关工作

Importance
sampling is widely used to deal with covariate shifts (Shimodaira,
2000;Sugiyama et al., 2007).Since manipulating the causal graph changes the
data distribution, such an intervention can be viewed as a covariate shift
amenable to importance sampling.Importance sampling techniques have also been
proposed without causal interpretation for many of the problems that we view as
causal inference problems.In particular, the work presented in this section is
closely related to the Monte-Carlo approach of reinforcement learning (Sutton
and Barto, 1998, Chapter 5) and to the offline evaluation of contextual bandit
policies (Li et al., 2010, 2011).

Reinforcement
learning research traditionally focuses on control problems with relatively
small discrete state spaces and long sequences of observations.This focus
reduces the need for char-acterizing exploration with tight confidence
intervals.For instance, Sutton and Barto suggest to normalize the importance
sampling estimator by 1/∑i w(ωi) instead of 1/n. This would give erro-neous
results when the data collection distribution leaves parts of the state space
poorly explored.Contextual bandits are traditionally formulated with a finite
set of discrete actions.For instance, Li's (2011) unbiased policy evaluation
assumes that the data collection policy always selects an arbitrary policy with
probability greater than some small constant.This is not possible when the
action space is infinite.

Such
assumptions on the data collection distribution are often impractical.For
instance, certain ad placement policies are not worth exploring because they
cannot be implemented efficiently or are known to elicit fraudulent
behaviors.There are many practical situations in which one is only interested
in limited aspects of the ad placement policy involving continuous parameters
such as click prices or reserves.Discretizing such parameters eliminates useful
a priori knowledge: for instance, if we slightly increase a reserve, we can
reasonable believe that we are going to show slightly less ads.

making assumptions on the data collection distribution, we construct a biased
estima-tor (10) and bound its bias.We then interpret the inner and outer
confidence intervals as resulting from a lack of exploration or an insufficient
sample size.

3230

3230

COUNTERFACTUAL
REASONING AND LEARNING SYSTEMS

Finally, the
causal framework allows us to easily formulate counterfactual questions that
pertain to the practical ad placement problem and yet differ considerably in
complexity and exploration requirements.We can address specific problems
identified by the engineers without incurring the risks associated with a
complete redesign of the system.Each of these incremental steps helps
demonstrating the soundness of the approach.

5.Structure

5.结构

This section
shows how the structure of the causal graph reveals many ways to leverage a
priori knowledge and improve the accuracy of our counterfactual
estimates.Displacing the reweighting point (Section 5.1) improves the inner
confidence interval and therefore reduce the need for explo-ration.Using a
prediction function (Section 5.2) essentially improve the outer confidence
interval and therefore reduce the sample size requirements.

5.1 Better
Reweighting Variables

5.1 更好地重新加权变量

Many search
result pages come without eligible ads.We then know with certainty that such
pages will have zero mainline ads, receive zero clicks, and generate zero
revenue.This is true for the randomly selected value of the reserve, and this
would have been true for any other value of the reserve.We can exploit this
knowledge by pretending that the reserve was drawn from the coun-terfactual
distribution P(q|xi ,ai) instead of the actual distribution P(q|xi ,ai).The
ratio w(ωi) is therefore forced to the unity.This does not change the estimate
but reduces the size of the inner confidence interval.The results of Figure 13
were in fact helped by this little optimization.

There are in
fact many circumstances in which the observed outcome would have been the same
for other values of the randomized variables.This prior knowledge is in fact
encoded in the structure of the causal graph and can be exploited in a more
systematic manner.For instance, we know that users make click decisions without
knowing which scores were computed by the ad placement engine, and without
this knowledge by showing the clicks y as direct effects of the user intent u
and the ad slate s. This implies that the exact value of the scores q does not
matter to the clicks y as long as the ad slate s remains the same.

Because the
causal graph has this special structure, we can simplify both the actual and
counter-factual Markov factorizations (2) (3) without eliminating the variable
y whose expectation is sought.Successively eliminating variables z, c, and q
gives:

P ( u , v , x
, a , b , s , y) = P ( u , v ) P ( x |u ) P ( a|x , v ) P ( b|x , v ) P ( s|x ,
a , b ) P ( y |s , u ) , P ( u , v , x , a , b , s , y) = P ( u ,v ) P ( x |u )
P ( a|x , v ) P ( b|x , v ) P ( s|x , a , b ) P ( y |s , u ) .

P ( u，v，x，a，b，s，y) = P ( u，v ) P ( x |u
) P ( a|x，v ) P ( b|x，v ) P ( s|x，a，b ) P ( y |s，u)，P ( u，v，x，a，b，s，y) = P ( u，P ( x |u ) P ( a|x，v ) P ( b|x，v ) P ( s|x，a，b ) P ( y |s，u)。

The
conditional distributions P(s|x,a,b) and P(s|x,a,b) did not originally appear
in the Markov factorization.They are defined by marginalization as a
consequence of the elimination of the vari-able q representing the scores.

P ( s|x a b)
= Zq P ( s|a q b ) P ( q|x a )

P ( s|x a b)
= Zq P ( s|a q b ) P ( q|x a)

P ( s|x a b)
= Zq P ( s|a q b ) P ( q|x a ) .

P ( s|x a b)
= Zq P ( s|a q b ) P ( q|x a)。

3231

3231

BOTTOU,
PETERS, ET AL.

Figure 14:
Estimated variations of two performance metrics in response to mainline reserve

changes.These
estimates were obtained using the ad slates s as reweighting variable.Compare
the inner confidence intervals with those shown in Figure 13.

We can
estimate the counterfactual click yield Y using these simplified
factorizations:

= Z y P ( u v
x a b s y) = Z y P ( s|x a b ) P ( s|x , a , b ) P ( u v x a b s y )

= Z y P ( u v
x a b s y) = Z y P ( s|x a b ) P ( s|x，a，b ) P ( u v x a b s y)

Y

Y

n n∑i=1 y i P
( s i |x i a i b i ) P ( s i | x i , a i , b i ) .

n n∑i=1 y i P
( s i |x i a i b i ) P ( s i | x i，a i，b i)。

(16)

(16)

We have
reproduced the experiments described in Section 4.6 with the counterfactual
esti-mate (16) instead of (4).For each example ωi , we determine which range [m
max i ,m min i ] of mainline reserve multipliers could have produced the observed
ad slate si , and then compute the reweighting ratio using the formula:

w i =

w i =

P ( s i | x i
, a i , b i ) P ( s i | x i , a i , b i )

P ( s i | x i，a i，b i ) P ( s
i | x i，a i，b i)

=

=

Ψ ( m max i
;ρ , σ ) − Ψ ( m min i ;ρ , σ ) Ψ ( mmax i ;ρ , σ ) − Ψ ( mmin i ;ρ , σ )

ψ(m 最大值 I；ρ，σ)ψ(m min I；ρ，σ)ψ(mmax I；ρ，σ)ψ(mmin I；ρ , σ )

where
Ψ(m;ρ,σ) is the cumulative of the log-normal multiplier distribution.Figure 14
shows coun-terfactual estimates obtained using the same data as Figure 13.The
obvious improvement of the

Figure 16: A
distribution on the scores q induce a distribution on the possible ad slates s.
If the observed slate is slate2, the reweighting ratio is 34/22.

inner
confidence intervals significantly extends the range of mainline reserve
multipliers for which we can compute accurate counterfactual expectations using
this same data.

Comparing (4)
and (16) makes the difference very clear: instead of computing the ratio of the
probabilities of the observed scores under the counterfactual and actual
distributions, we compute the ratio of the probabilities of the observed ad
slates under the counterfactual and actual distribu-tions.As illustrated by
Figure 15, we now distinguish the reweighting variable (or variables) from the
intervention.In general, the corresponding manipulation of the Markov
factorization consists of marginalizing out all the variables that appear on
the causal paths connecting the point of interven-tion to the reweighting
variables and factoring all the independent terms out of the integral.This
simplification works whenever the reweighting variables intercept all the
causal paths connecting the point of intervention to the measurement variable.In
order to compute the new reweighting ratios, all the factors remaining inside
the integral, that is, all the factors appearing on the causal paths connecting
the point of intervention to the reweighting variables, have to be known.

Figure 14
does not report the average revenue per page because the revenue z also depends
on the scores q through the click prices c. This causal path is not intercepted
by the ad slate variable s alone.However, we can introduce a new variable ˜c =
f(c, y)that filters out the click prices computed

3233

3233

BOTTOU,
PETERS, ET AL.

did not receive a click.Markedly improved revenue estimates are then obtained
by reweighting according to the joint variable (s, c˜).

Figure 16
illustrates the same approach applied to the simultaneous randomization of all
the scores q using independent log-normal multipliers.The weight w(ωi) is the
ratio of the probabilities of the observed ad slate si under the counterfactual
and actual multiplier distributions.Computing these probabilities amounts to
integrating a multivariate Gaussian distribution (Genz, 1992).Details will be
provided in a forthcoming publication.

5.2 Variance
Reduction with Predictors

5.2 使用预测因子减少方差

Although we
do not know exactly how the variable of interest ℓ(ω) depends on the measurable
vari-ables and are affected by interventions on the causal graph, we may have
slate s with an ad that usually receives a lot of clicks, we can expect an
increase of the number of clicks.

Let the
invariant variables υ be all observed variables that are not direct or indirect
effects of variables affected by the intervention under consideration.This
definition implies that the distri-bution of the invariant variables is not
affected by the intervention.Therefore the values υi of the invariant variables
sampled during the actual experiment are also representative of the
distribution of the invariant variables under the counterfactual conditions.

We can
leverage a priori knowledge to construct a predictor ζ(ω) of the quantity ℓ(ω)
whose counterfactual expectation Y is sought.We assume that the predictor ζ(ω)
depends only on the invariant variables or on variables that depend on the
invariant variables through known functional dependencies.Given sampled values
υi of the invariant variables, we can replay both the original and manipulated
structural equation model as explained in Section 4.1 and obtain samples ζi and
ζ i that respectively follow the actual and counterfactual distributions

Then,
regardless of the quality of the predictor,

Y = Zω ℓ ( ω
) P ( ω) =

Y = Zω ℓ ( ω
) P ( ω) =

Z ω ζ(ω)P (ω)

• Z ω (ℓ(ω)−ζ(ω))P (ω) ≈

Z ω ζ(ω)P (ω)

• Z ω (ℓ(ω)−ζ(ω))P (ω) ≈

1

n

n

n

n

i=1

i=1

ζ i + n n ∑
i=1 (ℓ(ωi)−ζi)w(ωi).

ζ i + n n ∑
i=1 (ℓ(ωi)−ζi)w(ωi).

(17)

(17)

The first
term in this sum represents the counterfactual expectation of the predictor and
can be accurately estimated by averaging the simulated counterfactual samples ζ
i without resorting to po-tentially large importance weights.The second term in
this sum represents the counterfactual ex-pectation of the residuals ℓ(ω)−ζ(ω)
and must be estimated using importance sampling.Since the magnitude of the
residuals is hopefully smaller than that of ℓ(ω), the variance of
(ℓ(ω)−ζ(ω))w(ω) is reduced and the importance sampling estimator of the second
term has improved confidence in-tervals.The more accurate the predictor ζ(ω),
the more effective this variance reduction strategy.

This variance
reduction technique is in fact identical to the doubly robust contextual bandit
eval-uation technique of Dudík et al. (2012).Doubly robust variance reduction
has also been extensively used for causal inference applied to biostatistics
(see Robins et al., 2000;Bang and Robins, 2005).We subjectively find that
viewing the predictor as a component of the causal graph (Figure 17) clar-ifies
how a well designed predictor can leverage prior knowledge.For instance, in
order to estimate the counterfactual performance of the ad placement system, we
can easily use a predictor that runs the ad auction and simulate the user
clicks using a click probability model trained offline.

3234

3234

COUNTERFACTUAL
REASONING AND LEARNING SYSTEMS

Figure 17:
Leveraging a predictor.Yellow nodes represent known functional relations in the
struc-

tural
equation model.We can estimate the counterfactual expectation Y of the number
of clicks per page as the sum of the counterfactual expectations of a predictor
ζ, which is easy to estimate by replaying empirical data, and y−ζ, which has to
be estimated by importance sampling but has reduced variance.

Figure 18:
The two plots show the hourly click yield for two variants of the ad placement
engine.The daily variations dwarf the differences between the two treatments.

3235

3235

BOTTOU,
PETERS, ET AL.

5.3 Invariant
Predictors

5.3 不变预测因子

In order to
evaluate which of two interventions is most likely to improve the system, the
designer of a learning system often seeks to estimate a counterfactual
difference, that is, the difference Y + −Y of the expectations of a same
quantity ℓ(ω) under two different counterfactual distributions P+(ω) and
P(ω).These expectations are often affected by variables whose value is left
unchanged by the interventions under consideration.For instance, seasonal
effects can have very large effects on the number of ad clicks (Figure 18) but
affect Y + and Y in similar ways.

Substantially
better confidence intervals on the difference Y + −Y can be obtained using an
invariant predictor, that is, a predictor function that depends only on
invariant variables υ such as the time of the day.Since the invariant predictor
ζ(υ) is not affected by the interventions under consideration,

Z ω ζ(υ)P (ω)
= Z ω ζ(υ)P +(ω).

Z ω ζ(υ)P (ω)
= Z ω ζ(υ)P +(ω)。

(18)

(18)

Therefore

Y + −Y

Y+Y

= Zω ζ ( υ )
P + ( ω) + Zω ( ℓ ( ω ) − ζ ( υ)) P + ( ω )

=
zωζ(υ)p+(ω)+zω(ℓ(ω)ζ(υ))p+(ω)

− Z ω ζ(υ)P
(ω)− Z ω (ℓ(ω)−ζ(υ))P (ω) ≈

zωζ(υ)p(ω)zω(ℓ(ω)−ζ(υ))p(ω)≈

1

n

n

n

n

i=1

i=1

ℓ(ωi)−ζ(υi) P
+(ωi)−P (ωi) P(ωi) .

ℓ(ωi)−ζ(υi)p+(ωI)p(ωI)p(ωI)。

This direct
estimate of the counterfactual difference Y + −Y benefits from the same
variance reduc-tion effect as (17) without need to estimate the expectations
(18).Appendix C provide details on the computation of confidence intervals for
estimators of the counterfactual differences.Appendix D shows how the same
approach can be used to compute counterfactual derivatives that describe the
response of the system to very small interventions.

6.Learning

6.学问

The previous
sections deal with the identification and the measurement of interpretable
signals that can justify the actions of human decision makers.These same signals
can also justify the actions of machine learning algorithms.This section
explains why optimizing a counterfactual estimate is a sound learning
procedure.

6.1 A
Learning Principle

6.1 学习原则

We consider
in this section interventions that depend on a parameter θ.For instance, we
might want to know what the performance of the ad placement engine would have
been if we had used different values for the parameter θ of the click scoring
model.Let Pθ (ω) denote the counterfactual Markov factorization associated with
this intervention.Let Y θ be the counterfactual expectation of ℓ(ω) under
distribution Pθ .Figure 19 illustrates our simple learning setup.Training data
is collected from a single experiment associated with an initial parameter
value θ chosen using prior knowledge acquired in an unspecified manner.A
preferred parameter value θ is then determined using the training data and
loaded into the system.The goal is of course to observe a good performance on
data collected during a test period that takes place after the switching point.

3236

3236

Figure 19:
Single design - A preferred parameter value θ is determined using randomized
data collected in the past.Test data is collected after loading θ into the
system.

The isolation
assumption introduced in Section 3.2 states that the exogenous variables are
drawn from an unknown but fixed joint probability distribution.This
distribution induces a joint distribu-tion P(ω) on all the variables ω
appearing in the structural equation model associated with the parameter
θ.Therefore, if the isolation assumption remains valid during the test period,
the test data follows the same distribution Pθ (ω) that would have been
observed during the training data collection period if the system had been
using parameter θ all along.

We can
therefore formulate this problem as the optimization of the expectation Y θ of
the reward ℓ(ω) with respect to the distribution Pθ (ω)

maxθ Y θ = Zω
ℓ ( ω ) P θ ( ω )

maxθ Y θ = Zω
ℓ ( ω ) P θ ( ω)

on the basis
of a finite set of training examples ω1,...,ωn sampled from P(ω).

However, it
would be unwise to maximize the estimates obtained using approximation (5)
be-cause they could reach a maximum for a value of θ that is poorly explored by
the actual distribution.As explained in Section 4.5, the gap between the upper
and lower bound of inequality (14) reveals the uncertainty associated with
insufficient exploration.Maximizing an empirical estimate Ybθ of the lower
bound Y θ ensures that the optimization algorithm finds a trustworthy answer

θ = argmax θ
Yb θ .

θ = argmax θ
Yb θ。

We shall now
discuss the statistical basis of this learning principle.

6.2 Uniform
Confidence Intervals

6.2 统一置信区间

As discussed
in Section 4.4, inequality (14),

Y θ ≤ Y θ ≤ Y
θ + M ( 1 − W θ ) ,

Yθ≤Yθ≤Yθ+M(1wθ)，

where

Y θ = Zω ℓ (
ω ) w ( ω ) P ( ω ) ≈ Yb θ = n n∑i=1 ℓ ( ω i ) w ( ω i ) W θ = Zω w ( ω ) P ( ω
) ≈ Wbθ = n n∑i=1 w ( ω i )

yθ=
zωℓ(ω)w(ω)p(ω)≈Ybθ= n n∑I = 1ℓ(ωI)w(ωI)wθ= zωw(ω)p(ω)≈WBθ= n∑I = 1w(ωI)

(19)

(19)

9.The idea of
maximizing the lower bound may surprise readers familiar with the UCB algorithm
for multi-armed bandits (Auer et al., 2002).UCB performs exploration by
maximizing the upper confidence interval bound and updating the confidence
intervals online.Exploration in our setup results from the active system

9.下限最大化的想法可能会让熟悉多武装匪徒的 UCB 算法的读者感到惊讶(奥尔等人，2002)。UCB 通过最大化置信区间上限和在线更新置信区间来执行探索。我们设置中的探索源于离线数据收集期间的主动系统随机化。另请参见第 6.4 节。

3237

3237

BOTTOU,
PETERS, ET AL.

confidence intervals (15) of the form

∀ δ > , ∀
θ P n Yb θ − ε R ≤ Y θ ≤ Yb θ + M ( − Wbθ + ξ R) + ε R o ≥ − δ .

∀δ>∀θp n
Ybθεr≤yθ≤Ybθ+m(WBθ+ξr)+εr o≥δ。

(20)

(20)

Both εR and
ξR converge to zero in inverse proportion to the square root of the sample size
n. They also increase at most linearly in logδ and depend on both the capping
bound R and the parameter θ through the empirical variances (see appendix B.)

εR 和 ξR 都以与样本量 n 的平方根成反比的方式收敛到零。它们也最多以对数 δ 线性增加，并且通过经验方差同时依赖于上限 R 和参数 θ(见附录 b)

Such
confidence intervals are insufficient to provide guarantees for a parameter
value θ that depends on the sample.In fact, the optimization (19) procedure is
likely to select values of θ for which the inequality is violated.We therefore
seek uniform confidence intervals (Vapnik and Chervonenkis, 1968), simultaneously
valid for all values of θ.

• When the
parameter θ is chosen from a finite set F , applying the union bound to the
ordinary

intervals
(20) immediately gives the uniform confidence interval :

P n ∀ θ ∈ F ,
Yb θ − ε R ≤ Y θ ≤ Yb θ + M ( −Wbθ + ξ R)+ ε R o ≥ −| F |δ .

P n ∀ θ ∈ F，Ybθεr≤yθ≤Ybθ+m(WBθ+ξr)+εr
o≥| f |δ。

• Following
the pioneering work of Vapnik and Chervonenkis, a broad choice of mathematical

tools have
been developed to construct uniform confidence intervals when the set F is
infinite.

For instance,
appendix E leverages uniform empirical Bernstein bounds (Maurer and Pontil,
2009) and obtains the uniform confidence interval

P n ∀ θ ∈ F ,
Yb θ − ε R ≤ Y θ ≤ Yb θ + M ( −Wbθ + ξ R)+ ε R o ≥ − M ( n ) δ ,

P n ∀ θ ∈ F，Ybθεr≤yθ≤Ybθ+m(WBθ+ξr)+εr
o≥m(n)δ，

where the
growth function M (n) measures the capacity of the family of functions

{ fθ : ω 7→
ℓ(ω)w (ω) , gθ : ω 7→ w (ω) , ∀θ ∈ F } .

{ fθ : ω 7→
ℓ(ω)w (ω)，gθ : ω 7→ w (ω)，∀θ ∈ F }。

(21)

(21)

Many
practical choices of P(ω) lead to functions M (n) that grow polynomially with
the sample size.Because both εR and ξR are O(n −/logδ), they converge to zero
with the sample size when one maintains the confidence level 1−M (n)δ equal to
a predefined constant.

P(ω)的许多实际选择导致函数 M
(n)随样本量多项式增长。因为 εR 和 ξR 都是 0(n/logδ)，当一个人保持置信水平 1M(n)δ 等于预定常数时，它们随着样本量收敛到零。

The
interpretation of the inner and outer confidence intervals (Section 4.5) also
applies to the uniform confidence interval (21).When the sample size is
sufficiently large and the capping bound R chosen appropriately, the inner
confidence interval reflects the upper and lower bound of inequal-ity (14).

The uniform
confidence interval therefore ensures that Y θ is close to the maximum of the
lower bound of inequality (14) which essentially represents the best
performance that can be guaranteed using training data sampled from
P(ω).Meanwhile, the upper bound of this same inequality reveals which values of
θ could potentially offer better performance but have been insufficiently
probed by the sampling distribution (Figure 20.)

Placement Auctions

6.3 调整广告投放拍卖

We now
present an application of this learning principle to the optimization of
auction tuning pa-rameters in the ad placement engine.Despite increasingly
challenging engineering difficulties,

3238

3238

COUNTERFACTUAL
REASONING AND LEARNING SYSTEMS

Figure 20:
The uniform inner confidence interval reveals where the best guaranteed Y θ is
reached and where additional exploration is needed.

Figure 21: Level
curves associated with the average number of mainline ads per page (red curves

labelled from
−6% to +10%) and the average estimated advertisement value generated per page
(black curves, labelled with arbitrary units ranging from 164 to 169) that
would have been observed for a certain query cluster if we had changed the
mainline reserves by the multiplicative factor shown on the horizontal axis,
and if we had applied a squash-ing exponent α shown on the vertical axis to the
estimated click probabilities qi,p(x).

comparable
optimization procedures can obviously be applied to larger numbers of tunable param-eters.

Lahaie and
McAfee (2011) propose to account for the uncertainty of the click probability
esti-mation by introducing a squashing exponent α to control the impact of the
estimated probabilities on the rank scores.Using the notations introduced in
Section 2.1, and assuming that the estimated probability of a click on ad i
placed at position p after query x has the form qip(x) = γp βi(x) (see

Lahaie 和 McAfee
(2011)提出通过引入挤压指数 α 来控制估计概率对排名分数的影响，从而解决点击概率估计的不确定性。使用第 2.1 节中介绍的符号，并假设在查询 x 之后点击放置在位置 p 的广告 I 的估计概率具有 qip(x) = γp βi(x)的形式(参见

3239

3239

BOTTOU,
PETERS, ET AL.

appendix A),
they redefine the rank-score rip(x) as:

rip ( x) = γp
b i β i ( x ) α .

rip ( x) = γp
b i β i ( x ) α。

Using a
squashing exponent α < 1 reduces the contribution of the estimated
probabilities and in-creases the reliance on the bids bi placed by the

Because the
squashing exponent changes the rank-score scale, it is necessary to simultaneously
adjust the reserves in order to display comparable number of ads.In order to
estimate the counter-factual performance of the system under interventions
affecting both the squashing exponent and the mainline reserves, we have
collected data using a random squashing exponent following a normal
distribution, and a mainline reserve multiplier following a log-normal
distribution as described in Section 4.6.Samples describing 12 million search
result pages were collected during four consecu-tive weeks.

Following
Charles and Chickering (2012), we consider separate squashing coefficients αk
and mainline reserve multipliers ρk per query cluster k ∈ {1..K}, and, in order
to avoid negative user or advertiser reactions, we seek the auction tuning
valuesubject to a global constraint on the average number of ads displayed in
instead of the publisher slice of the pie, this criterion is less likely to
simply raise the prices without improving the ads.Meanwhile the constraint
ensures that users are not exposed to excessive numbers of mainline ads.

We then use
the collected data to estimate bounds on the counterfactual expectations of the
advertiser value and the counterfactual expectation of the number of mainline
ads per page.Fig-ure 21 shows the corresponding level curves for a particular
query cluster.We can then run a simple optimization algorithm and determine the
optimal auction tuning parameters for each cluster sub-ject to the global mainline
footprint constraint.Appendix D describes how to estimate off-policy
counterfactual derivatives that greatly help the numerical optimization.

The obvious
alternative (see Charles and Chickering, 2012) consists of replaying the
auctions with different parameters and simulating the user using a click
probability model.However, it may be unwise to rely on a click probability
model to estimate the best value of a squashing coefficient that is expected to
compensate for the uncertainty of the click prediction model itself.The
coun-terfactual approach described here avoids the problem because it does not
rely on a click prediction model to simulate users.Instead it estimates the
counterfactual performance of the system using the actual behavior of the users
collected under moderate randomization.

6.4
Sequential Design

6.4 顺序设计

Confidence
intervals computed after a first randomized data collection experiment might
not offer sufficient accuracy to choose a final value of the parameter θ.It is
generally unwise to simply col-lect additional samples using the same
experimental setup because the current data already reveals information (Figure
20) that can be used to design a better data collection experiment.Therefore,
it seems natural to extend the learning principle discussed in Section 6.1 to a
sequence of data col-lection experiments.The parameter θt characterizing the
t-th experiment is then determined using samples collected during the previous
experiments (Figure 22).

10.The value

10.从广告客户的角度来看，广告点击的价值。广告商支付然后在发布者和广告商之间分割广告价值。

3240

3240

Figure 22:
Sequential design - The parameter θt of each data collection experiment is determined
using data collected during the previous experiments.

Although it
is relatively easy to construct convergent sequential design algorithms,
reaching the optimal learning performance is notoriously difficult (Wald, 1945)
because the selection of param-eter θt involves a trade-off between
exploitation, that is, the maximization of the immediate reward Y θt , and
exploration, that is, the collection of samples potentially leading to better Y
θ in the more distant future.

The optimal
exploration exploitation trade-off for multi-armed bandits is well understood
(Git-tins, 1989;Auer et al., 2002;Audibert et al., 2007) because an essential
property of multi-armed bandits makes the analysis much simpler: the outcome
observed after performing a particular action brings no information about the
value of other actions.Such an assumption is both unrealistic and pessimistic.For
instance, the outcome observed after displaying a certain ad in response to a
certain query brings very useful information about the value of displaying

Refined
contextual bandit approaches (Slivkins, 2011) account for similarities in the
context and action spaces but do not take advantage of all the additional
opportunities expressed by structural equation models.For instance, in the
contextual bandit formulation of the ad placement problem outlined in Section
3.5, actions are pairs (s, c) describing the ad slate s and the corresponding
click prices c, policies select actions by combining individual ad scores in
very specific ways, and actions determine the rewards through very specific
mechanisms.

Meanwhile,
despite their suboptimal asymptotic properties, heuristic exploration
strategies per-form surprisingly well during the time span in which the problem
can be considered stationary.Even in the simple case of multi-armed bandits,
excellent empirical results have been obtained us-ing Thompson sampling
(Chapelle and Li, 2011) or fixed strategies (Vermorel and Mohri, 2005;Kuleshov
and Precup, 2010).Leveraging the problem structure seems more important in
practice than perfecting an otherwise sound exploration strategy.

Therefore, in
the absence of sufficient theoretical guidance, it is both expedient and
practical to maximizing Ybθ at each round, as described in Section 6.1, subject

7.Equilibrium
Analysis

7.均衡分析

All the
methods discussed in this contribution rely on the isolation assumption
presented in Sec-tion 3.2.This assumption lets us interpret the samples as
repeated independent trials that follow the pattern defined by the structural
equation model and are amenable to statistical analysis.

The isolation
assumption is in fact a component of the counterfactual conditions under
inves-tigation.For instance, in Section 4.6, we model single auctions (Figure
3) in order to empirically

3241

3241

BOTTOU,
PETERS, ET AL.

determine how
the ad placement system would have performed if we had changed the mainline
reserves without incurring a reaction from the users or the advertisers.

Since the
future publisher revenues depend on the continued satisfaction of users and
advertisers, lifting this restriction is highly desirable.

• We can in
principle work with larger structural equation models.For instance, Figure 4
representing the impact of the displayed ads on the future user
goodwill.However, there are practical limits on the number of trials we can
consider at once.For instance, it is relatively easy to simultaneously model
all the auctions associated with the Web pages served to the same user during a
thirty minute Web session.On the other hand, it is practically impossible to
consider several weeks worth of auctions in order to model their accumulated
effect on the continued satisfaction of users and advertisers.

• We can
sometimes use problem-specific knowledge to construct alternate performance
met-rics that anticipate the future effects of the feedback loops.For instance,
revenue.Since this alternative cri-terion takes the advertiser interests into
account, it can be viewed as a heuristic proxy for the future revenues of the
publisher.

This section
proposes an alternative way to account for such feedback loops using the
quasistatic equilibrium method familiar to physicists: we assume that the
publisher changes the parameter θ so slowly that the system remains at
equilibrium at all times.Using data collected while the system was at equilibrium,
we describe empirical methods to determine how an infinitesimal intervention dθ
on the model parameters would have displaced the equilibrium:

"How
would the system have performed during the data collection period if a small
change dθ had been applied to the model parameter θ and the equilibrium had
been reached before the data collection period."

“如果对模型参数 θ 进行较小的改变，并且在数据收集期间之前达到平衡，那么系统在数据收集期间的表现会如何。”

A learning
algorithm can then update θ to improve selected performance metrics.

7.1 Rational

7.1 理性广告商

placement system is an example of game where each actor furthers his or her
interests by controlling some aspects of the system: the publisher controls the
placement engine parameters, the advertisers control the bids, and the users
control the clicks.

As an example
of the general quasi-static approach, this section focuses on the reaction of
ratio-nal advertisers to small changes of the scoring functions driving the ad
placement system.Rational advertisers always select bids that maximize their
economic interests.Although there are more re-alistic ways to model
advertisers, this exercise is interesting because the auction theory approaches
also rely on the rational advertiser assumption (see Section 2.1).This analysis
seamlessly integrates the auction theory and machine learning perspectives.

As
illustrated in Figure 23, we treat the bid vector b⋆ = (b1 ...bA) ∈ [0,bmax] A
as the parameter of the conditional distribution Pb⋆ (b|x, v) of the bids
associated with the eligible ads.The vari-

11.Quantities
measured when a feedback causal system reaches equilibrium often display
conditional independence patterns that cannot be represented with directed
acyclic graphs (Lauritzen and Richardson, 2002;Dash, 2003).Treating the
feedback loop as parameters instead of variables works around this difficulty
in a manner that appears sufficient to perform the quasi-static analysis.

11.当反馈因果系统达到平衡时测量的量通常显示条件独立模式，不能用有向无环图表示(Lauritzen 和 Richardson，2002；Dash，2003)。将反馈回路视为参数而非变量，以一种似乎足以执行准静态分析的方式解决了这一难题。

3242

3242

COUNTERFACTUAL
REASONING AND LEARNING SYSTEMS

Figure 23:
Advertisers select the bid amounts ba on the basis of the past number of clicks
ya and the past prices za observed for the corresponding ads.

Figure 24:
Advertisers control the expected number of clicks Ya and expected prices Za by

their bids
ba.Rational advertisers select bids that maximize the difference between the
value they see in the clicks and the price they pay.

ables ya in
the structural equation model represents the number of clicks received by ads
associated with bid ba.The variables za represents the amount charged for these
according to their anticipated impact on the number of resulting clicks ya and
on their cost za.

Following the
pattern of the perfect information assumption (see Section 2.1), we assume that
the advertisers eventually acquire full knowledge of the expectations

Ya ( θ , b⋆)
= Zω y a P θ b⋆ ( ω )

Ya ( θ，b⋆) = Zω y a
P θ b⋆ ( ω)

and Za ( θ ,
b⋆) = Zω z a P θ b⋆ ( ω ) .

Let Va denote
to maximize the difference between the value they see in the clicks and the
price they pay to the publisher, as illustrated in Figure 24.This is expressed
by the utility functions

U θ a (b⋆) =
VaYa(θ,b⋆)−Za(θ,b⋆) .

U θ a (b⋆) =
VaYa(θ,b⋆)−Za(θ,b⋆。

3243

3243

BOTTOU,
PETERS, ET AL.

Following
Athey and Nekipelov (2010), we argue that the injection of smooth random noise
into the auction mechanism changes the discrete problem into a continuous
problem amenable to standard differential methods.Mild regularity assumption on
the densities probability Pb⋆ (b|x, v) and Pθ (q|x,a) are in fact sufficient to
ensure that the expectations Ya(θ,b⋆) and Za(θ,b⋆) are con-tinuously
differentiable functions of the distribution parameters b⋆ and θ.Further
assuming that utility functions U θ a (b⋆) are diagonally quasiconcave, Athey
and Nekipelov establish the existence of a unique Nash equilibrium

∀a ba ∈
ArgMax

∀a ba ∈
ArgMax

b

b

U θ a
(b1,...,ba−1,b,ba+1,...,bA)

U θ a (b1，...、ba 1、b、ba+1、...，bA)

characterized
by its first order Karush-Kuhn-Tucker conditions

a Va

Va

∂Ya ∂ba

∂Ya ∂ba

∂Za ∂ba

∂Za ∂ba

   ≤ 0 if
ba = ≥ 0 if ba = bmax= 0 if 0 < ba < bmax.

(22)

(22)

We use the
first order equilibrium conditions (22) for two related purposes.Section 7.2
explains how to complete the advertiser model by estimating the values Va.
Section 7.3 estimates how the equilibrium bids and the system performance
metrics respond to a small change dθ of the model parameters.

Interestingly,
this approach remains sensible when key assumptions of the equilibrium model
are violated.The perfect information assumption is unlikely to hold in
practice.The quasi-concavity of the utility functions is merely
plausible.However, after observing the operation of the stationary ad placement
system for a sufficiently long time, it is reasonable to assume that the most
active advertisers have tried small bid variations and have chosen locally
optimal ones.Less active adver-tisers may leave their bids unchanged for longer
time periods, but can also update them brutally if they experience a
significant change in return on investment.Therefore it makes sense to use data
collected when the system is stationary to estimate advertiser values Va that
are consistent with the first order equilibrium conditions.We then hope to
attractive, by first estimating how a small change dθ displaces this posited
local equilibrium, then by using performance metrics that take this
displacement into account.

7.2

7.2 估计广告客户价值

We first need
to estimate the partial derivatives appearing in the equilibrium condition
(22).These derivatives measure how the expectations Ya and Za would have been
quantities can be estimated by randomizing the bids and computing on-policy
counterfactual derivatives as explained in appendix D. Confidence intervals can
be derived with the usual tools.

Unfortunately,
the publisher is not allowed to directly randomize the bids because the
advertisers expect to pay prices computed using the bid they have specified and
not the potentially higher bids resulting from the randomization.However, the
publisher has full control on the estimated click probabilities qi,p(x).Since
the rank-scores ri,p(x) are the products of the bids and the estimated click
probabilities (see Section 2.1), a random multiplier applied to the bids can
also be interpreted as a random multiplier applied to the estimated click
probabilities.Under these two interpretations, the same ads are shown to the
users, but different click prices are charged to the advertisers.Therefore,

3244

3244

COUNTERFACTUAL
REASONING AND LEARNING SYSTEMS

the publisher
can simultaneously charge prices computed as if the multiplier had been applied
to the estimated click probabilities, and collect data as if the multiplier had
been applied to the bid.This data can then be used to estimate the derivatives.

Solving the
first order equilibrium equations then yields estimated advertiser values Va
that are consistent with the observed data.

Va ≈

Va ≈

∂Ya

∂Ya

∂ba .∂Za ∂ba

∂ba。∂Za ∂ba

There are
however a couple caveats:

• The
advertiser bid ba may be too small to cause ads to be displayed.In the absence
of data, we have no means to estimate a click value for these advertisers.

are not displayed often enough to obtain accurate estimates of the partial
derivatives

∂Ya

∂Ya

∂ba and ∂Za
∂ba .This can be partially remediated by smartly aggregating the data of

∂ba 和 ∂Za ∂ba。这可以通过智能地聚合被认为相似的广告商的数据来部分补救。

• Some
extremely high bids and hoping to pay reasonable prices thanks to the
generalized second price rule.Both partial derivatives ∂Ya ∂ba and ∂Za ∂ba are
equal to zero in such cases.Therefore we cannot recover Va by solving the
equilibrium Equation (22).It is however possible to collect useful data by
selecting for these advertisers a maximum bid bmax that prevents them from
monop-olizing the eligible ad opportunities.Since the equilibrium condition is
an inequality when ba = bmax, we can only determine a lower bound of the values

These caveats
in fact underline the limitations of the advertiser modelling assumptions.When
their ads are not displayed often enough, advertisers have no more chance to
acquire a full knowledge of the expectations Ya and Za than the publisher has a
chance to determine their value.Similarly, advertisers that place extremely
high bids are probably underestimating the risk to occasionally experience a
very high click price.A more realistic model of the advertiser information
acquisition is required to adequately handle these cases.

7.3
Estimating the Equilibrium Response

7.3 估计平衡响应

Let A be the
set of the active advertisers, that is, the advertisers whose value can be
estimated

(or lower
bounded) with sufficient accuracy.Assuming that the other advertisers leave
their bids unchanged, we can estimate how the active advertisers adjust their
bids in response to an infinitesi-mal change dθ of the scoring model
parameters.This is achieved by differentiating the equilibrium equations (22):

(或下限)足够精确。假设其他广告客户保持他们的出价不变，我们可以估计活跃的广告客户如何响应评分模型参数的无穷小变化 dθ 来调整他们的出价。这通过微分平衡方程(22)来实现：

∀a ′ ∈ A,

∀a′∈a，

= Va ′ ∂ Ya ′
∂ b a ′ ∂θ − ∂ Za ′ ∂ b a ′ ∂θ d θ + ∑a∈A Va ′ ∂ Ya ′ ∂ b a ′ ∂ b a − ∂ Za ′ ∂
b a ′ ∂ b a d b a .

= va′∂ya′∂b
a′∂θ-∂za′∂b a′∂θdθ+∑a∈a va′∂ya′∂b a′≈b a≈za′≈b a′≈b a d b a。

(23)

(23)

The partial
second derivatives must be estimated as described in appendix D. Solving this
linear system of equations then yields an expression of the form

dba = Ξa dθ.

DBA =ξa dθ。

12.This
approach is of course related to the value estimation method proposed by Athey
and Nekipelov (2010) but strictly relies on the explicit randomization of the
scores.In contrast, practical considerations force Athey and Nekipelov to rely
on the apparent noise and hope that the noise model accounts for all potential
confounding factors.

12.这种方法当然与 Athey 和 Nekipelov
(2010)提出的价值估算方法有关，但严格依赖于分数的显式随机化。相比之下，实际考虑迫使 Athey 和 Nekipelov 依赖于表观噪声，并希望噪声模型能够解释所有潜在的混杂因素。

3245

3245

BOTTOU,
PETERS, ET AL.

This
expression can then be used to estimate how any counterfactual expectation Y of
interest changes when the publisher applies an infinitesimal change dθ to the
scoring parameter θ and the

active

d Y = ∂ Y ∂θ

• ∑a Ξ a ∂ Y ∂ b a d θ .

d y
=∂y∂θ+∑aνa∂y∂b a dθ。

(24)

(24)

Although this
expression provides useful information, one should remain aware of its
limita-tions.Because we only can estimate the reaction of active advertisers,
expression (24) does not includes the potentially positive reactions of
advertisers who did not bid but could have.Because we only can estimate a lower
bound of their values, this expression does not model the potential re-actions
of advertisers placing unrealistically high bids.Furthermore, one needs to be
very cautious when the system (23) approaches singularities.Singularities
indicate that the rational advertiser assumption is no longer sufficient to
determine the reactions of certain advertisers.This happens for instance when
advertisers cannot find bids that deliver a satisfactory return.The eventual
behavior of such advertisers then depends on factors not taken in consideration
by our model.

To alleviate
these issues, we could alter the auction mechanism in a manner that forces
that the system (23) remains safely nonsingular.We could also design
experiments revealing the impact of the fixed costs in-curred by advertisers
participating into new auctions.Although additional work is needed to design
such refinements, the quasistatic equilibrium approach provides a generic
framework to take such aspects into account.

7.4
Discussion

7.4 讨论

The rational
advertiser assumption is the cornerstone of seminal works describing simplified
vari-ants of the ad placement problem using auction theory (Varian,
2007;Edelman et al., 2007).More sophisticated works account for more aspects of
the ad placement problem, such as the impact of click prediction learning
algorithms (Lahaie and McAfee, 2011), the repeated nature of the ad auc-tions
(Bergemann and Said, 2010), or for the fact that advertisers place bids valid
for multiple auc-tions (Athey and Nekipelov, 2010).Despite these advances, it
seems technically very challenging to use these methods and account for all the
effects that can be observed in practical ad placement systems.

We believe
that our counterfactual reasoning framework is best viewed as a modular toolkit
that lets us apply insights from auction theory and machine learning to
problems that are far more complex than those studied in any single paper.For
instance, the quasi-static equilibrium analysis technique illustrated in this
section extends naturally to the analysis of multiple simultaneous causal

• The first
step consists in designing ad-hoc experiments to identify the parameters that
deter-mine the equilibrium equation of each player.In the case of the
advertisers, we have shown how to use randomized scores to reveal the
advertiser values.In the case of the user feedback, we must carefully design
experiments that reveal how users respond to changes in the quality of the

Differentiating all the equilibrium equations yields a linear system of
equations linking the variations of the parameter under our control, such as
dθ, and all the parameters under the

3246

3246

COUNTERFACTUAL
REASONING AND LEARNING SYSTEMS

control of
the other players, such as the advertiser bids, or the user willingness to
visit the site and click on ads.Solving this system and writing the total
derivative of the performance measure gives the answer to our question.

Although this
programme has not yet been fully realized, the existence of a principled
frame-work to handle such complex interactions is remarkable.Furthermore,
thanks to the flexibility of the causal inference frameworks, these techniques
can be infinitely adapted to various modeling assumptions and various system
complexities.

8.Conclusion

8.结论

placement example, this work demonstrates the central role of causal inference
(Pearl, 2000;Spirtes et al., 1993) for the design of learning systems
interacting with their environment.Thanks to importance sampling techniques,
data collected during randomized experiments gives precious cues to assist the
designer of such learning systems and useful signals to drive learning
algorithms.

Two recurrent
themes structure this work.First, we maintain a sharp distinction between the
learning algorithms and the extraction of the signals that drive them.Since
real world learning systems often involve a mixture of human decision and
automated processes, it makes sense to separate the discussion of the learning
signals from the discussion of the learning algorithms that leverage
them.Second, we claim that the mathematical and philosophical tools developed
for the analysis of physical systems appear very effective for the analysis of
causal information systems and of their equilibria.These two themes are in fact
a vindication of cybernetics (Wiener, 1948).

Acknowledgments

We would like
to acknowledge extensive discussions with Susan Athey, Miroslav Dudík, Patrick
Jordan, John Langford, Lihong Li, Sebastien Lahaie, Shie Mannor, Chris Meek,
Alex Slivkins, and Paul Viola.We also thank the Microsoft adCenter RnR team for
giving us the invaluable opportunity to deploy these ideas at scale and prove
their worth.Finally we gratefully acknowledge the precious comments of our JMLR
editor and reviewers.

Appendix A.

Section 2.1
describes how to select and place ads on a Web page by maximizing the total
rank-score (1).Following (Varian, 2007;Edelman et al., 2007) , we assume that
the click probability estimates are expressed as the product of a positive
position term γp and a positive ad term βi(x).The rank-scores can therefore be
written as ri,p(x) = γpbiβi(x).We also assume that the policy constraints
simply state that a Web page should not display more than one ad belonging to
any given advertiser.The discrete maximization problem is then amenable to
computationally efficient greedy algorithms.

Let us fix a
layout L and focus on the inner maximization problem.Without loss of
generality, we can renumber the positions such that

L =
{1,2,...N} and γ1 ≥ γ2 ≥ ≥ 0.

L = {1，2，...N}和 γ1≥γ2≥0。

3247

3247

BOTTOU,
PETERS, ET AL.

and write the
inner maximization problem as

max

i1,...,iN

i1，...，iN

RL(i1,...,iN)
= ∑

RL(i1，...，iN)=∑2

p∈L

p∈L

rip,p(x)

rip，p(x)

subject to
the policy constraints and reserve constraints ri,p(x) ≥ Rp(x).

Let Si denote
= {i :

Si = s}
product

biβi(x)
within set Is are called the best ads for advertiser s. If the solution of the
discrete maximiza-

tion problem
must be one of the best ads for advertiser s: were it not the case, replacing
the offending ad by one of the best

yield a higher RL without violating any of the constraints.It is also easy to
see that one could select any of the best ads for advertiser s without changing
RL.

Let the set I
for

this
advertiser.The inner maximization problem can then be simplified as:

max

i 1,..., iN ∈
I

i 1，...，iN ∈ I

RL ( i ,...,
i N) = ∑p∈L γp b i p β i p ( x )

RL ( i，...，i N) = ∑p∈L
γp b i p β i p ( x)

where all the
indices i1,...,iN are distinct, and subject to the reserve constraints.

Assume that
this maximization problem has a solution i1,...,iN, meaning that there is a
feasible

solution for the layout L. For k = ...N, let us define I k ⊂ I as

biβi(x).

biβi(x)。

I k = ArgMax

I k = ArgMax

i∈I
{i1,...,ik−1}

It is easy to
see that I k intersects {ik,...,iN} because, were it not the case, replacing ik
by any

element of I
k would increase RL without violating any of the constraints.Furthermore it is
easy to

see that ik ∈
I k because, were it not the case, there would be h > k such that ih ∈ I k ,
and swapping

ik and ih
would increase RL without violating any of the constraints.

ik 和 ih 将在不违反任何约束的情况下增加 RL。

Therefore, if
the inner maximization problem admits a solution, we can compute a solution by
recursively picking i1,...,iN from I 1 ,I 2 ,...,I N . This can be done
efficiently by first sorting the biβi(x) in decreasing order, and then greedily
assigning ads to the best positions subject to the reserve constraints.This
operation has to be repeated for all possible layouts, including of course the
empty layout.

The same
analysis can be carried out for click prediction estimates expressed as
arbitrary mono-tone combination of a position term γp(x) and an ad term βi(x),
as shown, for instance, by Graepel et al. (2010).

Appendix B.
Confidence Intervals

Section 4.4
explains how to obtain improved confidence intervals by replacing the unbiased
impor-tance sampling estimator (7) by the clipped importance sampling estimator
(10).This appendix provides details that could have obscured the main message.

3248

3248

COUNTERFACTUAL
REASONING AND LEARNING SYSTEMS

B.1 Outer
Confidence Interval

B.1 外部置信区间

We first
address the computation of the outer confidence interval (12) which describes
how the estimator Ybapproaches the clipped expectation Y .

Y = Zω ℓ ( ω
) w ( ω ) P ( ω ) ≈ Yb = n n∑i=1 ℓ ( ω i ) w ( ω i ) .

y =
zωℓ(ω)w(ω)p(ω)≈Yb = n n∑I = 1ℓ(ωI)w(ωI)。

Since the
samples ℓ(ωi)w (ωi) are independent and identically distributed, the central
limit theorem (e.g., Cramér, 1946, Section 17.4) states that the empirical
average Ybconverges in law to a normal distribution of mean Y = E[ℓ(ω)w (ω)]
and variance V = var[ℓ(ω)w (ω)].Since this convergence usually occurs quickly,
it is widely accepted to write

P n Yb − ε R
≤ Y ≤ Yb + ε R o ≥ − δ ,

p n
YbεR≤Y≤Yb+εR o≥δ，

with

ε R = erf− (
1 − δ ) p 2 V .

ε R =电流变流体(1δ)p 2V。

and to
estimate the variance V using the sample variance Vb

V ≈ Vb =

V ≈ Vb =

n − 1

n1

n∑i=1 ℓ ( ω i
) w ( ω i ) − Yb .

n∑I =
1ℓ(ωI)w(ωI)Yb。

(25)

(25)

This approach
works well when the ratio ceiling R is relatively small.However the presence of
a few very large ratios makes the variance estimation noisy and might slow down
the central limit convergence.

The first
remedy is to bound the variance more rigorously.For instance, the following
bound results from (Maurer and Pontil, 2009, Theorem 10).

P ( p V >
p Vb + ( M − m ) R r 2log ( / δ ) n − 1 ) ≤ δ

p(p V > p
Vb+(M M)R R 2 log(/δ)n 1)≤δ

Combining
this bound with (25) gives a confidence interval valid with probability greater
than 1−2δ.Although this approach eliminates the potential problems related to
the variance estimation, it does not address the potentially slow convergence
of the central limit theorem.

The next
remedy is to rely on empirical Bernstein bounds to derive rigorous confidence
intervals that leverage both the sample mean and the sample variance (Audibert
et al., 2007;Maurer and Pontil, 2009).

Theorem 1
(Empirical Bernstein bound) (Maurer and Pontil, 2009, thm 4) Let X,X1,X2,...,Xn
be i.i.d. random variable with values in [a,b] and let δ > 0.Then, with
proba-bility at least 1−δ,

E [X ] − Mn ≤
r Vn log ( / δ ) n + ( b − a ) 7log ( / δ ) 3 ( n − 1 ) ,

e[X]Mn≤r Vn
log(/δ)n+(B- a)7 log(/δ)3(n-1)，

where Mn and
Vn respectively are the sample mean and variance

Mn =

Mn =

n n∑i=1 Xi

n n∑i=1 Xi

Vn =

Vn =

1

n − 1

n1

n∑i=1 ( Xi −
Mn ) .

n∑i=1 ( Xi 锰)。

3249

3249

BOTTOU,
PETERS, ET AL.

Applying this
theorem to both ℓ(ωi)w (ωi) and −ℓ(ωi)w (ωi) provides confidence intervals that
hold for for the worst possible distribution of the variables ℓ(ω) and w(ω).

where

P n Yb − ε R
≤ Y ≤ Yb + ε R o ≥ − δ

p n
YbεR≤Y≤Yb+εR o≥δ

ε R = s Vb
log ( / δ ) n + M R 7log ( / δ ) 3 ( n − 1 ) .

εR = s Vb
log(/δ)n+M R 7 log(/δ)3(n-1)。

(26)

(26)

Because they
hold for the worst possible distribution, confidence intervals obtained in this
way are less tight than confidence intervals based on the central limit
theorem.On the other hand, thanks to the Bernstein bound, they remains
reasonably competitive, and they provide a much stronger guarantee.

B.2 Inner
Confidence Interval

B.2 内部置信区间

Inner
confidence intervals are derived from inequality (14) which bounds the
difference between the counterfactual expectation Y and the clipped expectation
Y :

0 ≤ Y −Y ≤ M
(1−W ).

0≤Y-Y≤M(1w)。

The constant
M is defined by assumption (8).The first step of the derivation consists in
obtaining a lower bound of W −Wbusing either the central limit theorem or an
empirical Bernstein bound.

For instance,
applying theorem 1 to −w (ωi) yields

P

P

 W ≥ Wb− s
Vbw log ( / δ ) n − R 7log ( / δ ) 3 ( n − 1 )  ≥ − δ

w≥WB s vbw 对数(/δ)n r7 对数(/δ)3(n1)≥δ

where Vbw is
the sample variance of the clipped weights

Vbw =

Vbw =

n − 1

n1

n∑i=1 w ( ω i
) − Wb .

n∑I =
1w(ωI)Wb。

Replacing in
inequality (14) gives the outer confidence interval

P n Y ≤ Y ≤ Y

• M ( − Wb+ ξ R ) o ≥ − δ .

p n
Y≤Y≤Y+M(Wb+ξR)o≥δ。

with

ξ R = s Vbw
log ( / δ ) n + R 7log ( / δ ) 3 ( n − 1 ) .

ξR = s Vbw
log(/δ)n+R 7 log(/δ)3(n-1)。

(27)

(27)

Note that 1
−Wb+ ξR can occasionally be negative.This occurs in the unlucky cases where the
confidence interval is violated, with probability smaller than δ.

Putting
together the inner and outer confidence intervals,

P n Yb − ε R
≤ Y ≤ Yb + M ( − Wb+ ξ R) + ε R o ≥ − δ ,

p n
YbεR≤Y≤Yb+M(Wb+ξR)+εR o≥δ，

with εR and
ξR computed as described in expressions (26) and (27).

3250

3250

COUNTERFACTUAL
REASONING AND LEARNING SYSTEMS

Appendix C.
Counterfactual Differences

We now seek
to estimate the difference Y + −Y of the expectations of a same quantity ℓ(ω)
under two different counterfactual distributions P+(ω) and P(ω).These
expectations are often affected by variables whose value is left unchanged by
the interventions under consideration.For instance, seasonal effects can have
very large effects on the number of ad clicks.When these variables affect both
Y + and Y in similar ways, we can obtain substantially better confidence
intervals for the difference Y + −Y .

to the notation ω representing all the variables in the structural equation
model, we use notation υ to represent all the variables that are not direct or
indirect effects of variables affected by the interventions under
consideration.

Let ζ(υ) be a
known function believed to be a good predictor of the quantity ℓ(ω) whose
coun-terfactual expectation is sought.Since P(υ) = P(υ), the following equality
holds regardless of the quality of this prediction:

Decomposing
both Y + and Y in this way and computing the difference,

Y + − Y = Zω
[ ℓ ( ω ) − ζ ( υ)] ∆ w ( ω ) P ( ω ) ≈ n n∑i=1

y+y =
zω[ℓ(ω)ζ(υ)]w(ω)p(ω)≈n∑I = 1

• ℓ ( ω i ) −
ζ ( υ i )

-ℓ(ωI)-ζ(υI)

∆ w ( ω i )

w ( ω i)

∆ w ( ω) = P

• ( ω ) P ( ω ) − P ( ω ) P ( ω ) = P + ( ω ) − P ( ω ) P ( ω ) .

⇼w(ω)=
P+(ω)P(ω)P(ω)P(ω)= P+(ω)P(ω)。

with

(28)

(28)

The outer
confidence interval size is reduced if the variance of the residual ℓ(ω)−ζ(υ)is
smaller than the variance of the original variable ℓ(ω).For instance, a
suitable predictor function ζ(υ) can significantly capture the seasonal click
yield variations regardless of the interventions under consideration.Even a
constant predictor function can considerably change the variance of the outer
confidence interval.Therefore, in the absence of better predictor, we still can
( and always should ) center the integrand using a constant predictor.

The rest of
this appendix describes how to construct confidence intervals for the
estimation of counterfactual differences.Additional bookkeeping is required
because both the weights ∆w(ωi) and the integrand ℓ(ω)−ζ(υ) can be positive or
negative.We use the notation υ to represent the variables of the structural
equation model that are left unchanged by the intervention under
consid-erations.Such variables satisfy the relations P(υ) = P(υ) and P(ω) = P
(ω\υ|υ)P(υ), where we use notation ω\υ to denote all remaining variables in the
structural equation model.An invariant predictor is then a function ζ(υ) that
is believed to be a good predictor of ℓ(ω).In particular, it is expected that
var[ℓ(ω)−ζ(υ)] is smaller than var[ℓ(ω)].

C.1 Inner
Confidence Interval with Dependent Bounds

C.1 具有相关界限的内部置信区间

We first
describe how to construct finer inner confidence intervals by using more
refined bounds on ℓ(ω).In particular, instead of the simple bound (8), we can
use bounds that depend on invariant variables:

∀ω

∀ω

m ≤ m ( υ ) ≤
ℓ ( ω ) ≤ M ( υ ) ≤ M .

m ≤ m ( υ ) ≤
ℓ ( ω ) ≤ M ( υ ) ≤ M。

3251

3251

BOTTOU,
PETERS, ET AL.

The key
observation is the equality

E [ w ( ω
)|υ] = Zω \ υ w ( ω ) P ( ω \ υ|υ) = Zω \ υ P ( ω \ υ|υ ) P ( υ ) P ( ω \ υ|υ )
P ( υ ) P ( ω \ υ|υ) = .

e[w(ω)|υ]=
zω\υw(ω)P(ω\υ|υ)= zω\υP(ω\υ|υ)P(ω\υ|υ)P(υ)P(ω\υ|υ)=。

We can then
write

= Z ω

= Z ω

• w (ω)−w (ω)

-w(ω)-w(ω)

ℓ(ω) P(ω) ≤ Z
υ E[w (ω)−w (ω)|υ] M(υ) P(υ)

ℓ(ω)p(ω)≤zυe[w(ω)w(ω)|υ]m(υ)p(υ)

= Z υ (−E[w
(ω)|υ] ) M(υ) P(υ) = Z ω (−w (ω) ) M(υ) P(ω) = Bhi .

=
Zυ(E[w(ω)|υ])M(υ)P(υ)= Zω(w(ω))M(υ)P(ω)= Bhi。

Y −Y

Using a
similar derivation for the lower bound Blo, we obtain the inequality

With the
notations

Blo ≤ Y −Y ≤
Bhi

blo≤Y-Y≤Bhi

Bb lo = n
n∑i=1 ( − w ( ω i)) m ( υ i )

bb lo = n n∑I
= 1(w(ωI))m(υI)

Bb hi = n
n∑i=1 ( − w ( ω i)) M ( υ i )

bb hi = n n∑I
= 1(w(ωI))M(υI)

Vblo =

Vblo =

n − 1

n1

n∑i=1 h ( − w
( ω i)) m ( υ i ) − Bb lo i Vbhi = n − 1 n∑i=1 h ( − w ( ω i)) M ( υ i ) − Bb
hi i

n∑I =
1h(w(ωI))M(υI)Bb lo I Vhibi = n 1n∑I = 1h(w(ωI))M(υI)Bb hi I

ξlo = s Vblo
log ( / δ ) n +|m|R 7log ( / δ ) 3 ( n − 1 ) ,

ξlo = s Vblo
log(/δ)n+| m | R 7 log(/δ)3(n 1)，

ξhi = s Vbhi
log ( / δ ) n +|M|R 7log ( / δ ) 3 ( n − 1 ) ,

ξhi = s Vhibi
log(/δ)n+| M | R 7 log(/δ)3(n-1)，

two
applications of theorem 1 give the inner confidence interval:

P n Y + Bb lo
− ξlo ≤ Y ≤ Y + Bb hi + ξhi o ≥ − δ .

p n Y+Bb
loξlo≤Y≤Y+Bb hi+ξhi o≥δ。

C.2
Confidence Intervals for Counterfactual Differences

C2 反事实差异的置信区间

We now
describe how to leverage invariant predictors in order to construct tighter
confidence inter-vals for the difference of two counterfactual expectations.

Y + − Y ≈ n
n∑i=1

Y+Y≈n n∑I = 1

• ℓ ( ω i ) −
ζ ( υ i )

-ℓ(ωI)-ζ(υI)

∆ w ( ω i )

w ( ω i)

with ∆w(ω) =
P +(ω)−P (ω) P(ω) .

Let us define
the reweigthing ratios w +(ω) = P +(ω)/P(ω) and w (ω) = P (ω)/P(ω), their
clipped variants w +(ω) and w (ω), and the clipped centered expectations

Y +c = Zω [ ℓ
( ω ) − ζ ( υ)] w + ( ω ) P ( ω )

y+c =
zω[ℓ(ω)-ζ(υ)]w+(ω)p(ω)

and Y c = Zω
[ ℓ ( ω ) − ζ ( υ)] w ( ω ) P ( ω ) .

y c =
zω[ℓ(ω)-ζ(υ)]w(ω)p(ω)。

The outer
confidence interval is obtained by applying the techniques of Section B.1 to

Y +c − Y c =
Zω [ ℓ ( ω ) − ζ ( υ) ][ w + ( ω ) − w ( ω) ] P ( ω ) .

y+c y c =
zω[ℓ(ω)ζ(υ)][w+(ω)w(ω)]p(ω)。

Since the
weights w + −w can be positive or negative, adding or removing a constant to
ℓ(ω) can considerably change the variance of the outer confidence interval.This
means that one should

3252

3252

COUNTERFACTUAL
REASONING AND LEARNING SYSTEMS

always use a
predictor.Even a constant predictor can vastly improve the outer confidence
interval difference.

The inner
confidence interval is then obtained by writing the difference

Y +−Y − Y + c
−Y c

Y+Y-Y+c Y-c

= Zω

= Zω

• ℓ ( ω ) − ζ
( υ )
• ℓ ( ω ) − ζ
( υ )
• w + ( ω ) −
w + ( ω )

-w+(ω)-w+(ω)

P ( ω ) − Zω

p(ω)-Zω

• ℓ ( ω ) − ζ
( υ )
• ℓ ( ω ) − ζ
( υ )
• w ( ω ) − w
( ω )

-w(ω)-w(ω)

P ( ω )

P ( ω)

and bounding
both terms by leveraging υ-dependent bounds on the integrand:

− M ≤ −ζ(υ) ≤
ℓ(ω)−ζ(υ) ≤ M −ζ(υ) ≤ M .

m≤ζ(υ)≤ℓ(ω)−ζ(υ)≤m≤ζ(υ)≤m。

∀ω

∀ω

This can be
achieved as shown in Section C.1.

Appendix D.
Counterfactual Derivatives

We now
consider interventions that depend on a continuous parameter θ.For instance, we
might want to know what the performance of the ad placement engine would have
been if we had used a parametrized scoring model.Let Pθ (ω)represent the
counterfactual Markov factorization associated with this intervention.Let Y θ be
the counterfactual expectation of ℓ(ω) under distribution Pθ .

Computing the
derivative of (28) immediately gives

∂ Y θ

∂ Y θ

∂θ

∂θ

= Zw

= Zw

• ℓ ( ω ) − ζ
( υ )
• ℓ ( ω ) − ζ
( υ )

w ′θ ( ω ) P
( ω ) ≈ n n∑i=1

w′θ(ω)P(ω)≈n∑I
= 1

• ℓ ( ω i ) −
ζ ( υ i )

-ℓ(ωI)-ζ(υI)

w ′θ ( ω i )

w′θ(ωI)

with wθ(ω) =
P θ (ω) P(ω)

and w ′θ ( ω)
= ∂ w θ ( ω ) ∂θ = w θ ( ω ) ∂logP θ ( ω ) ∂θ .

(29)

(29)

Replacing the
expressions P(ω) and Pθ (ω) by the corresponding Markov factorizations gives
many opportunities to simplify the reweighting ratio w ′ θ (ω).The term wθ(ω)
simplifies as shown in (6).The derivative of log Pθ (ω) depends only on the
factors parametrized by θ.Therefore, in order to evaluate w ′ θ (ω), we only
need to know the few factors affected by the intervention.

Higher order
derivatives can be estimated using the same approach.For instance,

∂ Y θ

∂ Y θ

∂θ i ∂θ j

∂θ·∂θ

= Zw

= Zw

• ℓ ( ω ) − ζ
( υ )
• ℓ ( ω ) − ζ
( υ )

w ′′ i j ( ω
) P ( ω ) ≈ n n∑i=1

w′

• ℓ ( ω i ) −
ζ ( υ i )

-ℓ(ωI)-ζ(υI)

w ′′ i j ( ω
i )

w′’I j(ωI)

with w ′′ i
j(ω) = ∂ wθ(ω) ∂θi ∂θj = wθ(ω) ∂log Pθ (ω) ∂θi ∂logPθ (ω) ∂θj +wθ(ω) ∂ logPθ
(ω) ∂θi ∂θj

w′

.

The second
term in w ′′ i j(ω) vanishes when θi and θj parametrize distinct factors in Pθ
(ω).

D.1

D1 极小干预和政策梯度

Expression
(29) becomes particularly attractive when P(ω) = P θ (ω), that is, when one
seeks deriva-tives that describe the effect of an infinitesimal intervention on
the system from which the data was collected.The resulting expression is then
identical to the celebrated policy gradient (Alek-sandrov et al., 1968;Glynn,
1987;Williams, 1992) which expresses how the accumulated rewards

3253

3253

BOTTOU,
PETERS, ET AL.

in a
reinforcement learning problem are affected by small changes of the parameters
of the policy function.

∂ Y θ

∂ Y θ

∂θ

∂θ

= Zω

= Zω

• ℓ ( ω ) − ζ
( υ )
• ℓ ( ω ) − ζ
( υ )

w ′θ ( ω ) P
θ ( ω ) ≈ n n∑i=1

w′θ(ω)Pθ(ω)≈n∑I
= 1

• ℓ ( ω i ) −
ζ ( υ i )

-ℓ(ωI)-ζ(υI)

w ′θ ( ω i )

w′θ(ωI)

where ωi are
sampled i.i.d. from Pθ and w ′ θ (ω) = ∂logPθ (ω) ∂θ

.

Sampling from
Pθ (ω) eliminates the potentially large ratio wθ(ω) that usually plagues
impor-tance sampling approaches.Choosing a parametrized distribution that
depends smoothly on θ is then sufficient to contain the size of the weights w ′
θ (ω).Since the weights can be positive or nega-tive, centering the integrand
with a prediction function ζ(υ) remains very important.Even a constant
predictor ζ can substantially reduce the variance

var[
(ℓ(ω)−ζ)w ′ θ (ω)] = var[ ℓ(ω)w ′ θ (ω)−ζw ′ θ (ω)]

var[(ℓ(ω)−ζ)w’θ(ω)]=
var[ℓ(ω)w’θ(ω)ζw’θ(ω)]

= var[ℓ(ω)w ′
θ (ω)]−2ζcov[ ℓ(ω)w ′ θ (ω), w ′ θ (ω)] +ζ var[w ′ θ (ω)]

=
var[ℓ(ω)w’θ(ω)]2ζcov[ℓ(ω)w’θ(ω)，w’θ(ω)]+ζvar[w’θ(ω)]

whose minimum
is reached for ζ = cov[ℓ(ω)w ′ θ (ω)w ′ θ (ω)]

var[w ′ θ
(ω)] =

var[w’θ(ω)]=

E[ℓ(ω)w ′ θ
(ω) ]

e[ℓ(ω)w
'θ(ω)]

E[w ′ θ (ω) ]
.

e[w′θ(ω)]。

We sometimes
want to evaluate expectations under a counterfactual distribution that is too
far from the actual distribution to obtain reasonable confidence
intervals.Suppose, for instance, that we are unable to reliably estimate which
click yield would have been observed if we had used a certain parameter θ for
the scoring models.We still can estimate how quickly and in which direction the
click yield would have changed if we had slightly moved the current scoring
model parameters θ in the direction of the target θ .Although such an answer is
not as good as a reliable estimate of Y θ , it is certainly better than no

D.2

D.2 政策外梯度

We assume in
this subsection that the parametrized probability distribution Pθ (ω) is
regular enough to ensure that all the derivatives of interest are defined and
that the event {wθ(ω)=R} has probability zero.Furthermore, in order to simplify
the exposition, the following derivation does not leverage an invariant
predictor function.

Estimating
derivatives using data sampled from a distribution P(ω) different from Pθ (ω)
is more challenging because the ratios wθ(ωi) in Equation (29) can take very
large values.However it is comparatively easy to estimate the derivatives of
lower and upper bounds using a slightly different way to clip the weights.Using
notation 1l(x) represent the indicator function, equal to one if condition x is
true and zero otherwise, let us define respectively the clipped weights w Z θ
and the capped weights w M θ :

w Zθ ( ω) = w
θ ( ω )1l { P ( ω ) < R P ( ω ) } and w Mθ ( ω) = min { w θ ( ω ) , R } .

w Zθ ( ω) = w
θ ( ω )1l { P ( ω ) < R P ( ω ) }和 w Mθ ( ω) = min { w θ ( ω)，R }。

Although
Section 4.4 illustrates the use of clipped weights, the confidence interval
derivation can be easily extended to the capped weights.Defining the capped
quantities

Y θ = Zω ℓ (
ω ) w Mθ ( ω ) P ( ω )

Y θ = Zω ℓ (
ω ) w Mθ ( ω ) P ( ω)

and W θ = Zω
w Mθ ( ω ) P ( ω )

3254

3254

COUNTERFACTUAL
REASONING AND LEARNING SYSTEMS

and writing

= Zω ∈ Ω \ Ω
R ℓ ( ω) ( P ( ω ) − R P ( ω) )

=
zω∈ω\ωrℓ(ω)(p(ω)r p(ω))

0 ≤ Y θ −Y θ

0≤Yθ-Yθ

≤ M − P ( Ω R
) − R P ( Ω \ Ω R ) = M − Zω w Mθ ( ω ) P ( ω )

≤M P(ωR)R
P(ω\ωR)= M zωw Mθ(ω)P(ω)

yields the
inequality

Y θ ≤ Y θ ≤ Y
θ + M ( 1 − W θ ) .

Yθ≤Yθ≤Yθ+M(1wθ)。

(30)

(30)

In order to
obtain reliable estimates of the derivatives of these upper and lower bounds,
it is of course sufficient to obtain reliable estimates of the derivatives of Y
θ and W θ .By separately considering the cases wθ(ω) < R and wθ(ω) > R,
we easily obtain the relation

w M ′ θ ( ω)
= ∂ w Mθ ( ω ) ∂θ = w Zθ ( ω ) ∂log P θ ( ω ) ∂θ

w m’θ(ω)=∂w
mθ(ω)∂θ= w zθ(ω)∂log pθ(ω)∂θ

and, thanks
to the regularity assumptions, we can write

when wθ(ω) 6=
R

∂Y θ ∂θ

∂Y θ ∂θ

∂W θ ∂θ

∂W θ ∂θ

= Zω ℓ ( ω )
w M ′ θ ( ω ) P ( ω ) ≈ n n∑i=1 ℓ ( ω i ) w M ′ θ ( ω i )

= zωℓ(ω)w
m′θ(ω)p(ω)≈n∑I = 1ℓ(ωI)w m′θ(ωI)

= Zω w M ′ θ
( ω ) P ( ω ) ≈ n n∑i=1 w M ′ θ ( ω i )

=Zωw
M′θ(ω)P(ω)≈n∑I = 1w M′θ(ωI)

Estimating
these derivatives is considerably easier than using approximation (29) because
they in-volve the bounded quantity w Z θ (ω) instead of the potentially large
ratio wθ(ω).It is still necessary to choose a sufficiently smooth sampling
distribution P(ω) to limit the magnitude of ∂log Pθ/∂θ.

Such
derivatives are very useful to drive optimization algorithms.Assume for
instance that we want to find the parameter θ that maximizes the counterfactual
expectation Y θ as illustrated in Section 6.3.Maximizing the estimate obtained
using approximation (5) could reach its maximum for a value of θ that is poorly
explored by the actual distribution.Maximizing an estimate of the lower bound
(30) ensures that the optimization algorithm finds a trustworthy answer.

Appendix E.
Uniform Empirical Bernstein Bounds

This appendix
reviews the uniform empirical Bernstein bound given by Maurer and Pontil (2009)
and describes how it can be used to construct the uniform confidence interval
(21).The first step

consists of
characterizing the size of a family F of functions mapping a space X into the
in-terval [a,b] ⊂ R. Given n points x = (x1...xn) ∈ X n , the trace F (x) ∈ R n
is the set of vectors f(x1),..., f(xn) for all functions f ∈ F .

Definition 2
(Covering numbers, etc.) Given ε > 0, the covering number N (x, ε,F ) is the
small-est possible cardinality of a subset C ⊂ F (x) satisfying the condition

∀ v ∈ F ( x )
∃ c ∈ C max i=1...n |v i − c i | ≤ ε

∀ v ∈ F ( x )
∃ c ∈ C 最大 i=1...n | v I c I |≤ε

and the
growth function N (n, ε,F ) is

N (n, ε,F ) =
sup

N (n，ε，F ) = sup

x∈Xn

x∈Xn

N (x, ε,F ) .

N (x，ε，F)。

3255

3255

BOTTOU,
PETERS, ET AL.

Thanks to a
famous combinatorial lemma (Vapnik and Chervonenkis, 1968, 1971;Sauer, 1972),
for

many usual
parametric families F , the growth function N (n, ε,F ) increases at most
polynomially

with both n
and 1/ε.

Theorem 3
(Uniform empirical Bernstein bound) (Maurer and Pontil, 2009, thm 6)

Let δ ∈
(0,1), n >= 16.Let X,X1,...,Xn be i.i.d. random variables with values in X.
Let F be a set of functions mapping X into [a,b] ⊂ R and let M (n) = 10N (2n,F
,1/n).Then we probability

at least 1−δ,

∀ f ∈ F , E [
f ( X)] − Mn ≤ r Vn log ( M ( n ) / δ ) n + ( b − a ) 15 log ( M ( n ) / δ ) n
− 1

∀ f ∈ F，e[f(x)]Mn≤r
VN log(m(n)/δ)n+(b a)15 log(m(n)/δ)n-1

,

,

where Mn and
Vn respectively are the sample mean and variance

Mn =

Mn =

n n∑i=1 f (
Xi )

n n∑i=1 f (
Xi)

Vn =

Vn =

1

n − 1

n1

n∑i=1 ( f (
Xi ) − Mn ) .

n∑I =
1(f(Xi)Mn)。

The statement
of this theorem emphasizes its similarity with the non-uniform empirical
Bern-stein bound (theorem 1).Although the constants are less attractive, the
uniform bound still con-

verges to
zero when n increases, provided of course that M (n) = 10N (2n,F ,1/n) grows
polyno-

mially with
n.

Let us then
define the family of functions

F = { fθ : ω
7→ ℓ(ω)w M θ (ω) , gθ : ω 7→ w M θ (ω) , ∀θ ∈ F } ,

F = { fθ : ω
7→ ℓ(ω)w M θ (ω)，gθ : ω 7→ w M θ (ω)，∀θ ∈ F }，

and use the
uniform empirical Bernstein bound to derive an outer inequality similar to (26)
and an inner inequality similar to (27).The theorem implies that, with
probability 1−δ, both inequalities are simultaneously true for all values of
the parameter θ.The uniform confidence interval (21) then follows directly.

References

V M.
Aleksandrov, V. I. Sysoyev, and V. V. Shemeneva.Stochastic
optimization.Engineering Cybernetics, 5:11-16, 1968.

V M.
Aleksandrov、V. I. Sysoyev 和 V. V. Shemeneva。随机优化。工程控制论，5:11-16，1968。

Susan Athey
pa-

per, 2010.URL

per，2010 年。网址：Search.pdf。

Jean-Yves
Audibert, Remi Munos, and Csaba Szepesvári.Tuning bandit algorithms in
stochastic en-

vironments.In
Proc.18th International Conference on Algorithmic Learning Theory (ALT 2007),
pages 150-165, 2007.

Peter Auer,
Nicolò Cesa-Bianchi, and Paul Fisher.Finite time analysis of the multiarmed
bandit problem.Machine Learning, 47(2-3):235-256, 2002.

13.For a
simple proof of this fact, slice [a,b] into intervals Sk of maximal width ε and
apply the lemma to the family of indicator functions (xi ,Sk) 7→ 1l{ f(xi) ∈
Sk}.

13.为了简单证明这个事实，将[a，b]切成最大宽度 ε 的区间 Sk，并将引理应用于指示函数族(xi，Sk) 7→ 1l{
f(xi) ∈ Sk}。

3256

3256

COUNTERFACTUAL
REASONING AND LEARNING SYSTEMS

Heejung Bang
and James M. Robins.Doubly robust estimation in missing data and causal
inference models.Biometrics, 61:692-972, 2005.

Dirk
Bergemann and Maher Said.Dynamic auctions: a survey.Discussion Paper 1757R,
Cowles Foundation for Research in Economics, Yale University, 2010.

Léon
Bottou.From machine learning to machine reasoning.http://arxiv.org/abs/1102.
1808v3, Feb 2011.

Leo
Breiman.Statistical modeling: The two cultures.Statistical Science,
16(3):199-231, 2001.

Olivier
Chapelle and Lihong Li.An empirical evaluation of thompson sampling.In Advances
in Neural Information Processing Systems 24, pages 2249-2257.NIPS Foundation,
2011.

C.R. Charig,
D. R. Webb, S. R. Payne, and J. E. A. Wickham.Comparison of treatment of renal

C.查理格、韦伯、佩恩和韦翰。肾脏疾病治疗方法的比较

calculi by
open surgery, percutaneous nephrolithotomy, and extracorporeal shockwave
lithotripsy.British Medical Journal (Clin Res Ed), 292(6254):879-882, 1986.

Denis X.
Charles and D. Max Chickering.Optimization for paid search auctions.Manuscript
in preparation, 2012.

Denis X.
Charles, D. Max Chickering, and Patrice Simard.Micro-market experimentation for
paid search.Manuscript in preparation, 2012.

Harald Cramér.
Mathematical Methods of Statistics.Princeton University Press, 1946.

Denver
Dash.Caveats for Causal Reasoning with Equilibrium Models.PhD thesis,
University of Pittsburgh, 2003.

Miroslav Dudík,
Dimitru Erhan, John Langford, and Lihong Li.Sample-efficient nonstationary-

policy
evaluation for contextual bandits.In Proceedings of Uncertainty in Artificial
Intelligence (UAI), pages 247-254, 2012.

Benjamin
Edelman, Michael Ostrovsky, and Michael Schwarz.Internet advertising and the
gener-

alized second
price auction: Selling billions of dollars worth of keywords.American Economic
Review, 97(1):242-259, 2007.

Alan
Genz.Numerical computation of multivariate normal probabilities.Journal
Computation of Multivariate Normal Probabilities, 1:141-149, 1992.

John C.
Gittins.Bandit Processes and Dynamic Allocation Indices.Wiley, 1989.

Peter W.
Glynn.Likelihood ratio gradient estimation: an overview.In Proceedings of the
1987 Winter Simulation Conference, pages 366-375, 1987.

John

Thore
Graepel, Joaquin Quinonero Candela, Thomas Borchert, and Ralf
Herbrich.Web-scale

Bayesian
Bing search engine.In Proceedings of the 27th International Conference on
Machine Learning (ICML 2010), Invited Applications Track.Omnipress, 2010.

3257

3257

BOTTOU,
PETERS, ET AL.

Ron Kohavi,
Roger Longbotham, Dan Sommerfield, and Randal M. Henne.Controlled experiments

on the web:
Survey and practical guide.Data Mining and Knowledge Discovery, 18(1):140-181,
July 2008.

Volodymyr
Kuleshov and Doina Precup.Algorithms for multi-armed bandit problems, October
2010.http://www.cs.mcgill.ca/~vkules/bandits.pdf.

Sébastien
Lahaie and R. Preston McAfee.Efficient ranking in sponsored search.In Proc.7th
In-

ternational
Workshop on Internet and Network Economics (WINE 2011), pages 254-265.LNCS
7090, Springer, 2011.

Lev Landau
and Evgeny Lifshitz.Course in Theoretical Physics, Volume 1: Mechanics.Pergamon
Press, 1969.2nd edition.

John Langford
and Tong Zhang.The epoch-greedy algorithm for multi-armed bandits with side

information.In
Advances in Neural Information Processing Systems 20, pages 817-824.MIT Press,
Cambridge, MA, 2008.

Steffen L.
Lauritzen and Thomas S. Richardson.Chain graph models and their causal
interpretation.Journal of the Royal Statistical Society, Series B, 64:321-361,
2002.

David K.
Lewis.Counterfactuals.Harvard University Press, 1973.2nd edition:
Wiley-Blackwell, 2001.

Lihong Li,
Wei Chu, John Langford, and Robert E. Schapire.A contextual-bandit approach to

personalized
news article recommendation.In Proceedings of the 19th International Conference
on the World Wide Web (WWW 2010), pages 661-670.ACM, 2010.

Lihong Li,
Wei Chu, John Langford, and Xuanhui Wang.Unbiased offline evaluation of
contextual-

、魏初、约翰·兰福和王宣辉。上下文无关的无偏见离线评估

bandit-based
news article recommendation algorithms.In Proc.4th ACM International
Confer-ence on Web Search and Data Mining (WSDM 2011), pages 297-306, 2011.

Andreas
Maurer and Massimiliano Pontil.Empirical bernstein bounds and sample-variance
penal-ization.In Proc.The 22nd Conference on Learning Theory (COLT 2009), 2009.

Paul
Milgrom.Putting Auction Theory to Work.Cambridge University Press, 2004.

Roger B.
Myerson.Optimal auction design.Mathematics of Operations Research, 6(1):58-73,
1981.

Judea
Pearl.Causality: Models, Reasoning, and Inference.Cambridge University Press,
2000.2nd edition: 2009.

Judea
Pearl.Causal inference in statistics: an overview.Statistics Surveys, 3:96-146,
2009.

Judea
Pearl.The do-calculus revisited.In Proc.Twenty-Eighth Conference on Uncertainty
in Artificial Intelligence (UAI-2012), pages 3-11, 2012.

Linda E.
Reichl.A Modern Course in Statistical Physics, 2nd Edition.Wiley, 1998.

3258

3258

COUNTERFACTUAL
REASONING AND LEARNING SYSTEMS

Herbert
Robbins.Some aspects of the sequential design of experiments.Bulletin of the
American Mathematical Society, 58(5):527-535, 1952.

James M.
Robins, Miguel Angel Hernan, and Babette Brumback.Marginal structural models
and causal inference in epidemiology.Epidemiology, 11(5):550-560, Sep 2000.

Norbert
Sauer.On the density of families of sets.Journal of Combinatorial Theory, 13:145-147,
1972.

Yevgeny
Seldin, cois Laviolette Fran Nicolò Cesa-Bianchi, John Shawe-Taylor, and Peter
Auer.

PAC-Bayesian
inequalities for martingales.IEEE Transactions on Information Theory, 58(12):
7086-7093, 2012.

Hidetoshi
Shimodaira.Improving predictive inference under covariate shift by weighting
the log likelihood function.Journal of Statistical Planning and Inference, 90(2):227-244,
2000.

Hidetoshi
Shimodaira。通过加权对数似然函数改进协变量移动下的预测推断。统计规划与推理杂志，90(2):227-244，2000。

Edward H.
Simpson.The interpretation of interaction in contingency tables.Journal of the
Royal Statistical Society, Ser.B, 13:238-241, 1951.

Alexsandrs
Slivkins.Contextual bandits with similarity information.JMLR Conference and
Work-shop Proceedings, 19:679-702, 2011.

Peter Spirtes
and Richard Scheines.Causal inference of ambiguous manipulations.Philosophy of
Science, 71(5):833-845, Dec 2004.

Peter
Spirtes, Clark Glymour, and Richard Scheines.Causation, Prediction and
Search.Springer Verlag, New York, 1993.2nd edition: MIT Press, Cambridge
(Mass.), 2011.

Stephen M.
Stigler.A historical view of statistical concepts in psychology and educational
research.American Journal of Education, 101(1):60-70, Nov 1992.

Masashi
Sugiyama, Matthias Krauledat, and Klaus-Robert Müller.Covariate shift
adaptation by im-portance weighted cross validation.Journal of Machine Learning
Research, 8:985-1005, 2007.

Rich S.
Sutton and Andrew G. Barto.Reinforcement Learning: An Introduction.MIT Press,
Cam-bridge, MA, 1998.

Diane Tang,
Ashish Agarwal, Deirdre O'Brien, and Mike Meyer.Overlapping experiment infras-

tructure:
More, better, faster experimentation.In Proceedings 16th Conference on
Knowledge Discovery and Data Mining (KDD 2010), pages 17-26, 2010.

Vapnik.Estimation of Dependences based on Empirical Data.Springer Series in
Statistics.Springer Verlag, Berlin, New York, 1982.

Vapnik and Alexey Ya.Chervonenkis.Uniform convegence of the frequencies of

occurence of
events to their probabilities.Proc.Academy of Sciences of the USSR, 181(4),
1968.English translation: Soviet Mathematics - Doklady, 9:915-918, 1968.

Vapnik and Alexey Ya.Chervonenkis.On the uniform convergence of relative
frequen-

cies of
events to their probabilities.Theory of Probability and its Applications,
16(2):264-280, 1971.

3259

3259

BOTTOU,
PETERS, ET AL.

Hal R.
Varian.Position auctions.International Journal of Industrial Organization,
25:1163-1178, 2007.

Hal R.
Varian.Online ad auctions.American Economic Review, 99(2):430-434, 2009.

Joannes
Vermorel and Mehryar Mohri.Multi-armed bandit algorithms and empirical
evaluation.In Proc.European Conference on Machine Learning, pages 437-448,
2005.

Georg H. von
Wright.Explanation and Understanding.Cornell University Press, 1971.

Abraham Wald.Sequential
tests of statistical hypotheses.The Annals of Mathematical Statistics,
16(2):117-186, 1945.

Norbert
Wiener.Cybernetics, or Control and Communication in the Animal and the Machine.

Hermann et
Cie (Paris), MIT Press (Cambridge, Mass.), Wiley and Sons (New York), 1948.2nd
Edition (expanded): MIT Press, Wiley and Sons, 1961.

Ronald J.
Williams.Simple statistical gradient-following algorithms for connectionist
reinforce-ment learning.Machine Learning, 8(229-256), 1992.

James
Woodward.Making Things Happen.Oxford University Press, 2005.

Sewall S.
Wright.Correlation and causation.Journal of Agricultural Research, 20:557-585,
1921.

3260

3260