Counterfactual Reasoning and Learning Systems: The Example of Computational Advertising 反事实推理和学习系统:计算广告的例子

Léon Bottou, Jonas Peters, Joaquin Quiñonero-Candela, Denis X. Charles, D. Max Chickering, Elon Portugaly, Dipankar Ray, Patrice Simard, Ed Snelson; 14(65):3207−3260, 2013.


This work shows how to leverage causal inference to understand the behavior of complex learning systems interacting with their environment and predict the consequences of changes to the system. Such predictions allow both humans and algorithms to select the changes that would have improved the system performance. This work is illustrated by experiments on the ad placement system associated with the Bing search engine.


causation, counterfactual reasoning, computational advertising




Statistical machine learning technologies in the real world are never without a purpose.Using their predictions, humans or machines make decisions whose circuitous consequences often violate the modeling assumptions that justified
the system design in the first place.


Such contradictions appear very clearly in the case of the learning systems that power web scale applications such as search engines, ad placement engines, or recommendation systems.For instance, the placement of advertisement on the
result pages of Internet search engines depend on the bids of advertisers and on scores computed by statistical machine learning systems.Because the scores affect the contents of the result pages proposed to the users, they directly influence
the occurrence of clicks and the corresponding advertiser payments.They also have important indirect effects.Ad placement decisions impact the satisfaction of the users and therefore their willingness to frequent this web site in the
future.They also impact the return on investment observed by the advertisers and therefore their future bids.Finally they change the nature of the data collected for training the statistical models in the future.

These complicated interactions are clarified by important theoretical works.Under
simplified assumptions, mechanism design (Myerson, 1981) leads to an insightful
account of the advertiser feedback loop (Varian, 2007;Edelman et al.,
2007).Under simplified assumptions, multiarmed bandits theory (Robbins, 1952;Auer
et al., 2002;Langford and Zhang, 2008) and reinforcement learning (Sutton and
Barto, 1998) describe the exploration/exploitation dilemma associated with the
training feedback loop.However, none of these approaches gives a complete
account of the complex interactions found in real-life systems.


This contribution proposes a novel approach: we view these complicated interactions
as man-ifestations of the fundamental difference that separates correlation and
causation.Using the ad placement example as a model of our problem class, we
therefore argue that the language and the methods of causal inference provide
flexible means to describe such complex machine learning sys-tems and give
sound answers to the practical questions facing the designer of such a
system.Is it useful to pass a new input signal to the statistical model?Is it
worthwhile to collect and label a new training set?What about changing the loss
function or the learning algorithm?In order to answer such questions and
improve the operational performance of the learning system, one needs to unravel
how the information produced by the statistical models traverses the web of
causes and effects and eventually produces measurable performance metrics.


Readers with an interest in causal inference will find in this paper (i) a real world
example demonstrating the value of causal inference for large-scale machine
learning applications, (ii) causal inference techniques applicable to
continuously valued variables with meaningful confidence intervals, and (iii)
quasi-static analysis techniques for estimating how small interventions affect
cer-tain causal equilibria.Readers with an interest in real-life applications
will find (iv) a selection of practical counterfactual analysis techniques
applicable to many real-life machine learning systems.Readers with an interest
in computational advertising will find a principled framework that (v)
ex-plains how to soundly use machine learning techniques for ad placement, and
(vi) conceptually connects machine learning and auction theory in a compelling




The paper is organized as follows.Section 2 gives an overview of the advertisement placement
problem which serves as our main example.In particular, we stress some of the
difficulties encoun-tered when one approaches such a problem without a
principled perspective.Section 3 provides a condensed review of the essential
concepts of causal modeling and inference.Section 4 centers on formulating and
answering counterfactual questions such as "how would the system have
performed during the data collection period if certain interventions had been
carried out on the system ?"We describe importance sampling methods for
counterfactual analysis, with clear conditions of validity and confidence
intervals.Section 5 illustrates how the structure of the causal graph reveals
opportu-nities to exploit prior information and vastly improve the confidence
intervals.Section 6 describes how counterfactual analysis provides essential
signals that can drive learning algorithms.Assume that we have identified
interventions that would have caused the system to perform well during the data
collection period.Which guarantee can we obtain on the performance of these
same inter-ventions in the future?Section 7 presents counterfactual
differential techniques for the study of equlibria.Using data collected when
the system is at equilibrium, we can estimate how a small intervention
displaces the equilibrium.This provides an elegant and effective way to reason
about long-term feedback effects.Various appendices complete the main text with
information that we think more relevant to readers with specific backgrounds.









2.Causation Issues in Computational Advertising 计算广告中的因果关系问题

After giving an overview of the advertisement placement problem, which serves as our main
ex-ample, this section illustrates some of the difficulties that arise when one
does not pay sufficient attention to the causal structure of the learning


2.1 Advertisement Placemen广告投放

All Internet users are now familiar with the advertisement messages that adorn popular web
pages.Advertisements are particularly effective on search engine result pages
because users who are searching for something are good targets for advertisers
who have something to offer.Several actors take part in this Internet
advertisement game:


• Advertisers create advertisement messages, and place bids that describe how much they are
willing to pay to see their ads displayed or clicked.

广告商 创建广告信息,并出价描述他们愿意支付多少费用来进行广告显示或点击。

• Publishers
provide attractive web services, such as, for instance, an Internet search
engine.They display selected ads and expect to receive payments from the
advertisers.The infras-tructure to collect the advertiser bids and select ads
is sometimes provided by an advertising network on behalf of its affiliated
publishers.For the purposes of this work, we simply con-sider a publisher large
enough to run its own infrastructure.


• Users
reveal information about their current interests, for instance, by entering a
query in a search engine.They are offered web pages that contain a selection of
ads (Figure 1).Users sometimes click on an advertisement and are transported to
a web site controlled by the ad-vertiser where they can initiate some business.


A conventional bidding language is necessary to precisely define under which
conditions an adver-tiser is willing to pay the bid amount.In the case of
Internet search advertisement, each bid specifies (a) the advertisement
message, (b) a set of keywords, (c) one of several possible matching criteria
between the keywords and the user query, and (d) the maximal price the
advertiser is willing to pay when a user clicks on the ad after entering a
query that matches the keywords according to the specified criterion.


Whenever a user visits a publisher web page, an advertisement placement engine runs an
auction in real time in order to select winning ads, determine where to display
them in the page, and compute the prices charged to advertisers, should the
user click on their ad.Since the placement engine is operated by the publisher,
it is designed to further the interests of the publisher.Fortunately for
everyone else, the publisher must balance short term interests, namely the
immediate revenue brought by the ads displayed on each web page, and long term
interests, namely the future revenues resulting from the continued satisfaction
of both users and advertisers.


Auction theory explains how to design a mechanism that optimizes the revenue of the
seller of a single object (Myerson, 1981;Milgrom, 2004) under various
assumptions about the information available to the buyers regarding the
intentions of the other buyers.In the case of the ad placement problem, the publisher
runs multiple auctions and sells opportunities to receive a click.When nearly
identical auctions occur thousand of times per second, it is tempting to
consider that the advertisers have perfect information about each other.This
assumption gives support to the popular generalized second price rank-score
auction (Varian, 2007;Edelman et al., 2007):


Figure 1:
Mainline and sidebar ads on a search result page.Ads placed in the mainline are more
likely to be noticed, increasing both the chances of a click if the ad is relevant and the
risk of annoying the user if the ad is not relevant.


• Let x represent the auction context information, such as the user query, the user
profile, the date, the time, etc.The ad placement engine first determines all
eligible ads a1 ...an and the corresponding bids b1 ...bn on the basis of the
auction context x and of the matching criteria specified by the advertisers.


• For each selected ad ai and each potential position p on the web page, a statistical
model outputs the estimate qi,p(x) of the probability that ad ai displayed in
position p receives a user click.The rank-score ri,p(x) = biqi,p(x) then
represents the purported value associated with placing ad ai at position p.

对于每个选定的广告$a_i$ 和网页上的每个潜在位置p,统计模型输出显示在位置p的广告$a_i$ 接收用户点击的概率的估计$q_i$ ,p(x)。等级分数ri,p(x) = $b_i$ $q_i$ ,p(x)则表示与将广告 $a_i$ 放置在位置p相关联的声称值。

• Let L represent a possible ad layout, that is, a set of positions that can
simultaneously be populated with ads, and let $L$ be the set of possible ad layouts, including of course the
empty layout.The optimal layout and the corresponding ads are obtained by maximizing the total



subject to
reserve constraints and also subject to diverse policy constraints, such as, for instance, preventing the

simultane-ous display of multiple ads belonging to the same advertiser.Under
mild assumptions, this discrete maximization problem is amenable to
computationally efficient greedy algorithms (see appendix A.)


• The
advertiser payment associated with a user click is computed using the
generalized second price (GSP) rule: the advertiser pays the smallest bid that
it could have entered without chang-ing the solution of the discrete
maximization problem, all other bids remaining equal.In other words, the
advertiser could not have manipulated its bid and obtained the same treatment
for a better price.


Under the
perfect information assumption, the analysis suggests that the publisher simply
needs to find which reserve prices Rp(x) yield the best revenue per
auction.However, the total revenue of the publisher also depends on the traffic
experienced by its web site.Displaying an excessive number of irrelevant ads
can train users to ignore the ads, and can also drive them to competing web
sites.Advertisers can artificially raise the rank-scores of irrelevant ads by
temporarily increasing the bids.Indelicate advertisers can create deceiving
advertisements that elicit many clicks but direct users to spam web sites.Experience
shows that the continued satisfaction of the users is more important to the
publisher than it is to the advertisers.


Therefore the
generalized second price rank-score auction has evolved.Rank-scores have been
augmented with terms that quantify the user satisfaction or the ad
relevance.Bids receive adaptive discounts in order to deal with situations
where the perfect information assumption is unrealistic.These adjustments are
driven by additional statistical models.The ad placement engine should
therefore be viewed as a complex learning system interacting with both users
and advertisers.


Controlled Experiments


The designer
of such an ad placement engine faces the fundamental question of testing
whether a proposed modification of the ad placement engine results in an
improvement of the operational performance of the system.


The simplest
way to answer such a question is to try the modification.The basic idea is to
randomly split the users into treatment and control groups (Kohavi et al.,
2008).Users from the control group see web pages generated using the unmodified
system.Users of the treatment groups see web pages generated using alternate
versions of the system.Monitoring various performance metrics for a couple
months usually gives sufficient information to reliably decide which variant of
the system delivers the most satisfactory performance.


Modifying an
advertisement placement engine elicits reactions from both the users and the
ad-vertisers.Whereas it is easy to split users into treatment and control
groups, splitting advertisers into treatment and control groups demands special
attention because each auction involves multi-ple advertisers (Charles et al.,
2012).Simultaneously controlling for both users and advertisers is probably


experiments also suffer from several drawbacks.They are expensive because they
demand a complete implementation of the proposed modifications.They are slow
because each ex-periment typically demands a couple months.Finally, although
there are elegant ways to efficiently run overlapping controlled experiments on
the same traffic (Tang et al., 2010), they are limited by the volume of traffic
available for experimentation.


It is
therefore difficult to rely on controlled experiments during the conception
phase of potential improvements to the ad placement engine.It is similarly
difficult to use controlled experiments to drive the training algorithms
associated with click probability estimation models.Cheaper and faster
statistical methods are needed to drive these essential aspects of the
development of an ad placement engine.Unfortunately, interpreting cheap and
fast data can be very deceiving.


Confounding Data


Assessing the
consequence of an intervention using statistical data is generally challenging
because it is often difficult to determine whether the observed effect is a
simple consequence of the interven-tion or has other uncontrolled causes.


Overall总数 Patients with small stones 小结石患者 Patients with large stones大结石患者
Treatment A:Open surgery开放手术 78% (273/350) 93% (81/87) 73% (192/263)
Treatment B:Percutaneous nephrolithotomy 经皮肾镜取石术 83% (289/350) 87% (234/270) 69% (55/80)

Table 1: A
classic example of Simpson's paradox.The table reports the success rates of two
treat-ments for
kidney stones (Charig et al., 1986, Tables I and II).Although the overall
success rate of treatment B seems better, treatment B performs worse than
treatment A on both patients with small kidney stones and patients with large
kidney stones.See Section 2.3.


For instance,
the empirical comparison of certain kidney stone treatments illustrates this
dif-ficulty (Charig et al., 1986).Table 2.3 reports the success rates observed
on two groups of 350 patients treated with respectively open surgery (treatment
A, with 78% success) and percutaneous nephrolithotomy (treatment B, with 83%
success).Although treatment B seems more successful, it was more frequently
prescribed to patients suffering from small kidney stones, a less serious
con-dition.Did treatment B achieve a high success rate because of its intrinsic
qualities or because it was preferentially applied to less severe cases?Further
splitting the data according to the size of the kidney stones reverses the
conclusion: treatment A now achieves the best success rate for both patients
suffering from large kidney stones and patients suffering from small kidney
stones.Such an inversion of the conclusion is called Simpson's paradox
(Simpson, 1951).


The stone
size in this study is an example of a confounding variable, that is an
uncontrolled variable whose consequences pollute the effect of the
intervention.Doctors knew the size of the kidney stones, chose to treat the
healthier patients with the least invasive treatment B, and therefore caused
treatment B to appear more effective than it actually was.If we now decide to
apply treat-ment B to all patients irrespective of the stone size, we break the
causal path connecting the stone size to the outcome, we eliminate the
illusion, and we will experience disappointing results.


When we
suspect the existence of a confounding variable, we can split the contingency
tables and reach improved conclusions.Unfortunately we cannot fully trust these
conclusions unless we are certain to have taken into account all confounding
variables.The real problem therefore comes from the confounding variables we do
not know.


experiments arguably provide the only correct solution to this problem (see
Stigler, 1992).The idea is to randomly chose whether the patient receives
treatment A or treatment B. Because this random choice is independent from all
the potential confounding variables, known and unknown, they cannot pollute the
observed effect of the treatments (see also Section 4.2).This is why controlled
experiments in ad placement (Section 2.2) randomly distribute users between
treatment and control groups, and this is also why, in the case of an ad
placement engine, we should be somehow concerned by the practical impossibility
to randomly distribute both users and advertisers.


Overall q2 low q2 high
q1 low 6.2%(124/2000) 5.1% (92/1823) 18.1% (32/176)
q1 high 7.5% (149/2000) 4.8% (71/1500) 15.6% (78/500)

Table 2:
Confounding data in ad placement.The table reports the click-through rates and
the click counts of the
second mainline ad.The overall counts suggest that the click-through rate of
the second mainline ad increases when the click probability estimate q1 of the
top ad is high.However, if we further split the pages according to the click
probability estimate q2 of the second mainline ad, we reach the opposite
conclusion.See Section 2.4.


Confounding Data in Ad Placement


Let us return
to the question of assessing the value of passing a new input signal to the ad
placement engine click prediction model.Section 2.1 outlines a placement method
where the click probability estimates qi,p(x) depend on the ad and the position
we consider, but do not depend on other ads displayed on the page.We now
consider replacing this model by a new model that additionally uses the
estimated click probability of the top mainline ad to estimate the click
probability of the second mainline ad (Figure 1).We would like to estimate the
effect of such an intervention using existing statistical data.


We have
collected ad placement data for Bing search result pages served during three
consecu-tive hours on a certain slice of traffic.Let q1 and q2 denote the click
probability estimates computed by the existing model for respectively the top
mainline ad and the second mainline ad.After ex-cluding pages displaying fewer
than two mainline ads, we form two groups of 2000 pages randomly picked among
those satisfying the conditions q1 < 0.15 for the first group and q1 ≥ 0.15
for the second group.Table 2.4 reports the click counts and frequencies
observed on the second mainline ad in each group.Although the overall numbers
show that users click more often on the second mainline ad when the top
mainline ad has a high click probability estimate q1, this conclusion is
reversed when we further split the data according to the click probability
estimate q2 of the second mainline ad.

我们已经收集了必应搜索结果页面的广告投放数据,这些页面在连续三个小时的特定流量中提供服务。让q1和q2分别表示由现有模型为顶部主线广告和第二主线广告计算的点击概率估计。在包括显示少于两个主线广告的页面之后,我们形成两组2000页的页面,从满足第一组q1 < 0.15和第二组q1 ≥ 0.15的条件的页面中随机挑选。表2.4报告了每组中第二条主线广告的点击次数和频率。虽然总体数字显示,当顶级主线广告具有高点击概率估计值q1时,用户更频繁地点击第二主线广告,但当我们根据第二主线广告的点击概率估计值q2进一步分割数据时,这一结论被颠倒了。

superficial similarities, this example is considerably more difficult to
interpret than the kidney stone example.The overall click counts show that the
actual click-through rate of the second mainline ad is positively correlated
with the click probability estimate on the top mainline ad.Does this mean that
we can increase the total number of clicks by placing regular ads below
frequently clicked ads?


Remember that
the click probability estimates depend on the search query which itself depends
on the user intention.The most likely explanation is that pages with a high q1
are frequently asso-ciated with more commercial searches and therefore receive
more ad clicks on all positions.The observed correlation occurs because the
presence of a click and the magnitude of the click probabil-ity estimate q1
have a common cause: the user intention.Meanwhile, the click probability
estimate q2 returned by the current model for the second mainline ad also
depend on the query and therefore the user intention.Therefore, assuming that
this dependence has comparable strength, and assuming that there are no other
causal paths, splitting the counts according to the magnitude of q2 factors out
the effects of this common confounding cause.We then observe a negative
correlation which now suggests that
a frequently clicked top mainline ad has a negative impact on the click-through
rate of the second mainline ad.


If this is
correct, we would probably increase the accuracy of the click prediction model
by switching to the new model.This would decrease the click probability
estimates for ads placed in the second mainline position on commercial search
pages.These ads are then less likely to clear the reserve and therefore more likely
to be displayed in the less attractive sidebar.The net result is probably a
loss of clicks and a loss of money despite the higher quality of the click
probability model.Although we could tune the reserve prices to compensate this
unfortunate effect, nothing in this data tells us where the performance of the
ad placement engine will land.Furthermore, unknown confounding variables might
completely reverse our conclusions.


Making sense
out of such data is just too complex !


2.5 A Better Way


It should now
be obvious that we need a more principled way to reason about the effect of
potential interventions.We provide one such more principled approach using the
causal inference machinery (Section 3).The next step is then the identification
of a class of questions that are sufficiently ex-pressive to guide the designer
of a complex learning system, and sufficiently simple to be answered using data
collected in the past using adequate procedures (Section 4).


A machine
learning algorithm can then be viewed as an automated way to generate questions
about the parameters of a statistical model, obtain the corresponding answers,
and update the param-eters accordingly (Section 6).Learning algorithms derived
in this manner are very flexible: human designers and machine learning
algorithms can cooperate seamlessly because they rely on similar sources of


Causal Systems


When we point
out a causal relationship between two events, we describe what we expect to
happen to the event we call the effect, should an external operator manipulate
the event we call the cause.Manipulability theories of causation (von Wright,
1971;Woodward, 2005) raise this commonsense insight to the status of a
definition of the causal relation.Difficult adjustments are then needed to
interpret statements involving causes that we can only observe through their
effects, "because they love me," or that are not easily manipulated,
"because the earth is round."


statistical thinking makes a clear distinction between the statistical model
and the world.The actual mechanisms underlying the data are considered
unknown.The statistical models do not need to reproduce these mechanisms to
emulate the observable data (Breiman, 2001).Better models are sometimes
obtained by deliberately avoiding to reproduce the true mechanisms (Vapnik,
1982, Section 8.6).We can approach the manipulability puzzle in the same spirit
by viewing causation as a reasoning model (Bottou, 2011) rather than a property
of the world.Causes and effects are simply the pieces of an abstract reasoning
game.Causal statements that are not empirically testable acquire validity when
they are used as intermediate steps when one reasons about manipulations or
interventions amenable to experimental validation.


This section
presents the rules of this reasoning game.We largely follow the framework
pro-posed by Pearl (2009) because it gives a clear account of the connections
between causal models and probabilistic models.



Query context
x from user intent u. Eligible ads (ai) from query x and inventory v.
Corresponding bids (bi).


(qi,p,Rp) from query x and ads a. Ad slate s from eligible ads a, scores q and
bids b. Corresponding click prices c.


User clicks y
from ad slate s and user intent u. Revenue z from clicks y and prices c.


Figure 2: A structural
equation model for ad placement.The sequence of equations describes the


flow of
information.The functions fk describe how effects depend on their direct
causes.The additional noise variables εk represent independent sources of
randomness useful to model probabilistic dependencies.


3.1 The Flow
of Information


Figure 2
gives a deterministic description of the operation of the ad placement
engine.Variable u represents the user and his or her intention in an
unspecified manner.The query and query context x is then expressed as an
unknown function of the u and of a noise variable ε1.Noise variables in this
framework are best viewed as independent sources of randomness useful for
modeling a nondeterministic causal dependency.We shall only mention them when
they play a specific role in the discussion.The set of eligible ads a and the
corresponding bids b are then derived from the query x and the ad inventory v
supplied by the advertisers.Statistical models then compute a collection of
scores q such as the click probability estimates qi,p and the reserves Rp
introduced in Section 2.1.The placement logic uses these scores to generate the
"ad slate" s, that is, the set of winning ads and their assigned
positions.The corresponding click prices c are computed.The set of user clicks
y is expressed as an unknown function of the ad slate s and the user intent u.
Finally the revenue z is expressed as another function of the clicks y and the
prices c.


Such a system
of equations is named structural equation model (Wright, 1921).Each equation
asserts a functional dependency between an effect, appearing on the left hand
side of the equation, and its direct causes, appearing on the right hand side
as arguments of the function.Some of these causal dependencies are
unknown.Although we postulate that the effect can be expressed as some function
of its direct causes, we do not know the form of this function.For instance,
the designer of the ad placement engine knows functions f2 to f6 and f8 because
he has designed them.However, he does not know the functions f1 and f7 because
whoever designed the user did not leave sufficient documentation.


Figure 3
represents the directed causal graph associated with the structural equation
model.Each arrow connects a direct cause to its effect.The noise variables are
omitted for simplicity.The structure of this graph reveals fundamental
assumptions about our model.For instance, the user clicks y do not directly
depend on the scores q or the prices c because users do not have access to this


We hold as a
principle that causation obeys the arrow of time: causes always precede their
effects.Therefore the causal graph must be acyclic.Structural equation models
then support two fundamental operations, namely simulation and intervention.


Figure 3:
Causal graph associated with the structural equation model of Figure 2.The
mutually independent
noise variables ε1 to ε8 are implicit.The variables a, b, q, s, c, and z depend
on their direct causes in known ways.In contrast, the variables u and v are
exogenous and the variables x and y depend on their direct causes through
unknown functions.


• Simulation

  • Let us assume that we know both the exact form of all functional dependencies
    and the value of all exogenous variables, that is, the variables that never
    appear in the left hand side of an equation.We can compute the values of all
    the remaining variables by applying the equations in their natural time


Intervention - As long as the causal graph remains acyclic, we can construct
derived structural equation models using arbitrary algebraic manipulations of
the system of equations.For instance, we can clamp a variable to a constant
value by rewriting the right-hand side of the corresponding equation as the
specified constant value.


The algebraic
manipulation of the structural equation models provides a powerful language to
describe interventions on a causal system.This is not a coincidence.Many
aspects of the mathe-matical notation were invented to support causal inference
in classical mechanics.However, we no longer have to interpret the variable
values as physical quantities: the equations simply describe the flow of
information in the causal model (Wiener, 1948).


3.2 The
Isolation Assumption


Let us now
turn our attention to the exogenous variables, that is, variables that never
appear in the left hand side of an equation of the structural model.Leibniz's
principle of sufficient reason claims that there are no facts without
causes.This suggests that the exogenous variables are the effects of a network
of causes not expressed by the structural equation model.For instance, the user
intent u and the ad inventory v in Figure 3 have temporal correlations because
both users and advertisers worry about their budgets when the end of the month
approaches.Any structural equation model should then be understood in the
context of a larger structural equation model potentially describing all things
in existence.


Ads served on
a particular page contribute to the continued satisfaction of both users and
ad-vertisers, and therefore have an effect on their willingness to use the
services of the publisher in the future.The ad placement structural equation
model shown in Figure 2 only describes the causal de-pendencies for a single
page and therefore cannot account for such effects.Consider however a very






Figure 4:
Conceptually unrolling the user feedback loop by threading instances of the
single page


causal graph
(Figure 3).Both the ad slate st and user clicks yt have an indirect effect on
the user intent ut+1 associated with the next query.

因果图(图3)。ad slate st和用户点击yt都对与下一个查询相关联的用户意图ut+1有间接影响。

structural equation model containing a copy of the page-level model for every
web page ever served by the publisher.Figure 4 shows how we can thread the
page-level models corresponding to pages served to the same user.Similarly we
could model how advertisers track the performance and the cost of their
advertisements and model how their satisfaction affects their future bids.The
resulting causal graphs can be very complex.Part of this complexity results
from time-scale dif-ferences.Thousands of search pages are served in a
second.Each page contributes a little to the continued satisfaction of one user
and a few advertisers.The accumulation of these contributions produces
measurable effects after a few weeks.


Many of the
functional dependencies expressed by the structural equation model are left
un-specified.Without direct knowledge of these functions, we must reason using
statistical data.The most fundamental statistical data is collected from
repeated trials that are assumed independent.When we consider the large
structured equation model of everything, we can only have one large trial
producing a single data point.It is therefore desirable to identify repeated
patterns of identical equations that can be viewed as repeated independent
trials.Therefore, when we study a structural equation model representing such a
pattern, we need to make an additional assumption to expresses the idea that
the outcome of one trial does not affect the other trials.We call such an
assumption an isolation assumption by analogy with thermodynamics.This can be
achieved by assuming that the exogenous variables are independently drawn from
an unknown but fixed joint probability distribu-tion.This assumption cuts the
causation effects that could flow through the exogenous variables.


The noise
variables are also exogenous variables acting as independent source of
randomness.The noise variables are useful to represent the conditional
distribution P(effect|causes) using the equation effect= f(causes,
ε).Therefore, we also assume joint independence between all the noise variables
and any of the named exogenous variable.For instance, in the case of the ad

噪声变量也是作为随机性独立来源的外生变量。噪声变量有助于使用方程effect= f(causes,ε)来表示条件分布 P(effect|causes)。因此,我们还假设所有噪声变量和任何一个已命名的外生变量之间具有联合独立性。例如,在广告投放的情况下

1.See also
the discussion on reinforcement learning, Section 3.5.


2.The concept
of isolation is pervasive in physics.An isolated system in thermodynamics
(Reichl, 1998, Section 2.D) or a closed system in mechanics (Landau and
Lifshitz, 1969, 5) evolves without exchanging mass or energy with its
surroundings.Experimental trials involving systems that are assumed isolated
may differ in their initial setup and therefore have different
outcomes.Assuming isolation implies that the outcome of each trial cannot
affect the other trials.


3.Rather than
letting two noise variables display measurable statistical dependencies because
they share a common cause, we prefer to name the common cause and make the
dependency explicit in the graph.









Figure 6:
Bayesian network associated with the Markov factorization shown in Figure 5.


model shown
in Figure 2, we assume that the joint distribution of the exogenous variables




P(u, v,ε1,..., ε8) = P(u, v)P(ε1)...P(ε8).

Since an
isolation assumption is only true up to a point, it should be expressed clearly
and remain under constant scrutiny.We must therefore measure additional
performance metrics that reveal how the isolation assumption holds.For
instance, the ad placement structural equation model and the corresponding
causal graph (figures 2 and 3) do not take user feedback or advertiser feedback
into account.Measuring the revenue is not enough because we could easily
generate revenue at the expense of the satisfaction of the users and
advertisers.When we evaluate interventions under such an isolation assumption,
we also need to measure a battery of additional quantities that act as proxies
for the user and advertiser satisfaction.Noteworthy examples include ad
relevance estimated by human judges, and advertiser surplus estimated from the
auctions (Varian, 2009).


3.3 Markov Factorization马尔可夫因子分解

we can draw a sample of the exogenous variables using the distribution
specified by the isolation assumption, and we can then generate values for all
the remaining variables by simulating the structural equation model.


This process
defines a generative probabilistic model representing the joint distribution of
all variables in the structural equation model.The distribution readily
factorizes as the product of the joint probability of the named exogenous
variables, and, for each equation in the structural equation model, the
conditional probability of the effect given its direct causes (Spirtes et al.,
1993;Pearl, 2000).As illustrated by figures 5 and 6, this Markov factorization
connects the structural equa-tion model that describes causation, and the
Bayesian network that describes the joint probability distribution followed by
the variables under the isolation assumption.


equation models and Bayesian networks appear so intimately connected that it
could be easy to forget the differences.The structural equation model is an
algebraic object.As long as the causal graph remains acyclic, algebraic
manipulations are interpreted as interventions on the causal system.The
Bayesian network is a generative statistical model representing a class of
joint probability distributions, and, as such, does not support algebraic
manipulations.However, the symbolic representation of its Markov factorization
is an algebraic object, essentially equivalent to the structural equation


Identification, Transportation, and Transfer Learning


Consider a
causal system represented by a structural equation model with some unknown
functional dependencies.Subject to the isolation assumption, data collected
during the operation of this system follows the distribution described by the
corresponding Markov factorization.Let us first assume that this data is sufficient
to identify the joint distribution of the subset of variables we can observe.We
can intervene on the system by clamping the value of some variables.This
amounts to replacing the right-hand side of the corresponding structural
equations by constants.The joint distribution of the variables is then
described by a new Markov factorization that shares many factors with the
orig-inal Markov factorization.Which conditional probabilities associated with
this new distribution can we express using only conditional probabilities
identified during the observation of the original sys-tem?This is called the
identifiability problem.More generally, we can consider arbitrarily complex
manipulations of the structural equation model, and we can perform multiple
experiments involv-ing different manipulations of the causal system.Which
conditional probabilities pertaining to one experiment can be expressed using
only conditional probabilities identified during the observation of other
experiments?This is called the transportability problem.


do-calculus completely solves the identifiability problem and provides useful
tools to address many instances of the transportability problem (see Pearl,
2012).Assuming that we know the conditional probability distributions involving
observed variables in the original structural equa-tion model, do-calculus
allows us to derive conditional distributions pertaining to the manipulated structural
equation model.


we must further distinguish the conditional probabilities that we know (because
we designed them) from those that we estimate from empirical data.This
distinction is important because estimating the distribution of continuous or
high cardinality variables is notoriously dif-ficult.Furthermore, do-calculus
often combines the estimated probabilities in ways that amplify estimation
errors.This happens when the manipulated structural equation model exercises
the vari-ables in ways that were rarely observed in the data collected from the
original structural equation model.


networks are directed graphs representing the Markov factorization of a joint
probability distribution: the arrows no longer have a causal interpretation.


Therefore we
prefer to use much simpler causal inference techniques (see sections 4.1 and
4.2).Although these techniques do not have the completeness properties of
do-calculus, they combine estimation and transportation in a manner that
facilitates the derivation of useful confidence inter-vals.


3.5 Special Cases


Three special
cases of causal models are particularly relevant to this work.


• In the
multi-armed bandit (Robbins, 1952), a user-defined policy function π determines
the distribution of action a ∈ {1...K}, and an unknown reward function r
determines the distri-bution of the outcome y given the action a (Figure 7).In
order to maximize the accumulated rewards, the player must construct policies π
that balance the exploration of the action space with the exploitation of the best
action identified so far (Auer et al., 2002;Audibert et al., 2007;Seldin et
al., 2012).

在多臂老虎机(Robbins,1952)中,用户定义的策略函数π决定了动作a的分布∑{ 1...一个未知的奖励函数r决定了给定动作a的结果y的分布(图7)。为了使累积的奖励最大化,玩家必须构建策略π,该策略π平衡对动作空间的探索和对迄今为止识别的最佳动作的开发(Auer等人,2002;Audibert等人,2007年;Seldin等人,2012年)。

• The
contextual bandit problem (Langford and Zhang, 2008) significantly increases
the com-plexity of multi-armed bandits by adding one exogenous variable x to
the policy function π and the reward functions r (Figure 8).


• Both
multi-armed bandit and contextual bandit are special case of reinforcement
learning (Sutton and Barto, 1998).In essence, a Markov decision process is a
sequence of contextual bandits where the context is no longer an exogenous
variable but a state variable that depends on the previous states and actions
(Figure 9).Note that the policy function π, the reward function r, and the
transition function s are independent of time.All the time dependencies are expressed
using the states st .


These special
cases have increasing generality.Many simple structural equation models can be
reduced to a contextual bandit problem using appropriate definitions of the
context x, the action a and the outcome y. For instance, assuming that the
prices c are discrete, the ad placement struc-tural equation model shown in
Figure 2 reduces to a contextual bandit problem with context (u, v), actions
(s, c) and reward z. Similarly, given a sufficiently intricate definition of
the state variables st , all structural equation models with discrete variables
can be reduced to a reinforcement learning problem.Such reductions lose the
fine structure of the causal graph.We show in Section 5 how this fine structure
can in fact be leveraged to obtain more information from the same experiments.


reinforcement learning algorithms (see Sutton and Barto, 1998) leverage the
assumption that the policy function, the reward function, the transition
function, and the distributions of the corresponding noise variables, are
independent from time.This invariance property provides great benefits when the
observed sequences of actions and rewards are long in comparison with the size
of the state space.Only Section 7 in this contribution presents methods that
take advantage of such an invariance.The general question of leveraging
arbitrary functional invariances in causal graphs is left for future work.


4.Counterfactual Analysis反事实分析

We now return
to the problem of formulating and answering questions about the value of
proposed changes of a learning system.Assume for instance that we consider
replacing the score computation


a = π ( ε ) Action a ∈{1...K}

y = r ( a,ε’) Reward y ∈ R

动作a∞{ 1...K}奖励y ∈ R

Figure 7:
Structural equation model for the multi-armed bandit problem.The policy π
selects a discrete
action a, and the reward function r determines the outcome y. The noise
vari-ables ε and ε ′ represent independent sources of randomness useful to
model probabilistic dependencies.


a = π ( x , ε) Action a ∈{1...K}

y = r ( x,a,ε’)Reward y ∈ R

Figure 8:
Structural equation model for contextual bandit problem.Both the action and the
reward depend on an exogenous context variable x.


Figure 9:
Structural equation model for reinforcement learning.The above equations are


for all t ∈
{0...,T}.The context is now provided by a state variable st−1 that depends on
the previous states and actions.

对于所有t∞{ 0...,T}。上下文现在由状态变量ST 1提供,该变量取决于先前的状态和动作。

model M of an
ad placement engine by an alternate model M. We seek an answer to the
conditional question:


will the system perform if we replace model M by model M*?"


sufficient time and sufficient resources, we can obtain the answer using a
controlled experiment (Section 2.2).However, instead of carrying out a new
experiment, we would like to obtain an answer using data that we have already
collected in the past.


would the system have performed if, when the data was collected, we had
replaced model M by model M*?"


The answer of
this counterfactual question is of course a counterfactual statement that
describes the system performance subject to a condition that did not happen.


statements challenge ordinary logic because they depend on a condition that is
known to be false.Although assertion A ⇒ B is always true when assertion A is
false, we certainly do not mean for all counterfactual statements to be
true.Lewis (1973) navigates this paradox using a modal logic in which a
counterfactual statement describes the state of affairs in an alternate world
that resembles ours except for the specified differences.Counterfactuals indeed
offer many subtle ways to qualify such alternate worlds.For instance, we can
easily describe isolation assumptions (Section 3.2) in a counterfactual


would the system have performed if, when the data was collected, we had
replaced model M by model Mwithout incurring user or advertiser



Figure 10:
Causal graph for an image recognition system.We can estimate counterfactuals by
replaying data collected in the past.


Figure 11:
Causal graph for a randomized experiment.We can estimate certain
counterfactuals by reweighting data collected in the past.


The fact that
we could not have changed the model without incurring the user and advertiser
reac-tions does not matter any more than the fact that we did not replace model
M by model M* in the first place.This does not prevent us from using
counterfactual statements to reason about causes and effects.Counterfactual
questions and statements provide a natural framework to express and share our


The remaining
text in this section explains how we can answer certain counterfactual
questions using data collected in the past.More precisely, we seek to estimate
performance metrics that can be expressed as expectations with respect to the
distribution that would have been observed if the counterfactual conditions had
been in force.


4.1 Replaying
Empirical Data


Figure 10
shows the causal graph associated with a simple image recognition system.The
classifier takes an image x and produces a prospective class label ˆy. The loss
measures the penalty associated with recognizing class ˆy while the true class
is y.


To estimate
the expected error of such a classifier, we collect a representative data set
composed of labelled images, run the classifier on each image, and average the
resulting losses.In other words, we replay the data set to estimate what (counterfactual)
performance would have been observed if we had used a different classifier.We
can then select in retrospect the classifier that would have worked the best
and hope that it will keep working well.This is the counterfactual viewpoint on
empirical risk minimization (Vapnik, 1982).


Replaying the
data set works because both the alternate classifier and the loss function are
known.More generally, to estimate a counterfactual by replaying a data set, we
need to know all the functional dependencies associated with all causal paths
connecting the intervention point to the measurement point.This is obviously
not always the case.


counterfactual expectations can be viewed as expectations of unit-level
counterfactuals (Pearl, 2009, Def-inition 4), they elude the semantic
subtleties of unit-level counterfactuals and can be measured with randomized
experiments (see Section 4.2.)

5.尽管反事实期望可以被视为单位级反事实的期望(Pearl,2009,Def-inition 4),但它们回避了单位级反事实的语义微妙性,并且可以通过随机实验来测量(参见第4.2节)。)

Reweighting Randomized Trials


Figure 11
illustrates the randomized experiment suggested in Section 2.3.The patients are
randomly split into two equally sized groups receiving respectively treatments
A and B. The overall success rate for this experiment is therefore Y = (YA
+YB)/2 where YA and YB are the success rates observed for each group.We would
like to estimate which (counterfactual) overall success rate Y would have been
observed if we had selected treatment A with probability p and treatment B with
probability 1− p.

图11说明了第2.3节中建议的随机实验。患者被随机分成两个大小相等的组,分别接受治疗A和治疗b。因此,本实验的总成功率为Y = (YA +YB)/2,其中YA和YB是每组观察到的成功率。如果我们选择概率为p的治疗A和概率为1 p的治疗B,我们希望估计观察到的(反事实的)总体成功率Y。

Since we do
not know how the outcome depends on the treatment and the patient condition, we
cannot compute which outcome y would have been obtained if we had treated
patient x with a different treatment u . Therefore we cannot answer this
question by replaying the data as we did in Section 4.1.


observing different success rates YA and YB for the treatment groups reveals an
empir-ical correlation between the treatment u and the outcome y. Since the only
cause of the treatment u is an independent roll of the dices, this correlation
cannot result from any known or unknown con-founding common cause.Having
eliminated this possibility, we can reweight the observed out-comes and compute
the estimate Y ≈ pYA + (1− p)YB .


4.3 Markov
Factor Replacement


reweighting approach can in fact be applied under much less stringent conditions.Let
us return to the ad placement problem to illustrate this point.


The average
number of ad clicks per page is often called click yield.Increasing the click
yield usually benefits both the advertiser and the publisher, whereas
increasing the revenue per page often benefits the publisher at the expense of
the advertiser.Click yield is therefore a very useful metric when we reason
with an isolation assumption that ignores the advertiser reactions to pricing changes.


Let ω be a
shorthand for all variables appearing in the Markov factorization of the ad
placement structural equation model,


P(ω) =P ( u , v ) P( x |u ) P ( a|x , v ) P ( b|x , v ) P ( q|x , a )
× P ( s|a , q
, b ) P ( c |a , q , b ) P ( y |s , u ) P ( z|y , c ) .

Variable y
was defined in Section 3.1 as the set of user clicks.In the rest of the
document, we slightly abuse this notation by using the same letter y to
represent the number of clicks.We also write the expectation Y = Eω∼P(ω) [y]
using the integral notation

变量y在第3.1节中被定义为用户点击的集合。在文档的其余部分,我们稍微滥用了这个符号,使用相同的字母y来表示点击次数。我们还用积分表示法写出了期望Y = Eω∞P(ω)[Y]

Y = Zω y P (ω ) .

We would like
to estimate what the expected click yield Y would have been if we had used a
different scoring function (Figure 12).This intervention amounts to replacing
the actual factor P(q|x,a) by a counterfactual factor P(q|x,a) in the Markov


P ( ω) = P (u , v ) P ( x |u ) P ( a|x , v ) P ( b|x , v ) P ( q| x , a )

× P ( s|a,q,b ) P ( c |a,q,b ) P ( y |s,u ) P ( z|x,c)。


6.See also
the discussion of Reichenbach's common cause principle and of its limitations
in Spirtes et al. (1993) and Spirtes and Scheines (2004).



Figure 12:
Estimating which average number of clicks per page would have been observed if
we had used a different scoring model.


Let us
assume, for simplicity, that the actual factor P(q|x,a) is nonzero
everywhere.We can then estimate the counterfactual expected click yield Y using
the transformation


Y = Zω y P (ω) = Zω y P ( q| x a ) P ( q|x , a ) P ( ω ) ≈ n n∑i=1 y i P ( q i |x i a i ) P
( q i |x i , a i )

where the
data set of tuples (ai , xi ,qi , yi) is distributed according to the actual
Markov factorization instead of the counterfactual Markov factorization.This
data could therefore have been collected during the normal operation of the ad
placement system.Each sample is reweighted to reflect its probability of
occurrence under the counterfactual conditions.


In general,
we can use importance sampling to estimate the counterfactual expectation of
any quantity ℓ(ω):


Y = Zω ℓ ( ω
) P ( ω) = Zω ℓ ( ω ) P ( ω ) P ( ω ) P ( ω ) ≈ n n∑i=1 ℓ ( ω i ) w i

with weights

Equation (6)
emphasizes the simplifications resulting from the algebraic similarities of the
actual and counterfactual Markov factorizations.Because of these
simplifications, the evaluation of the weights only requires the knowledge of
the few factors that differ between P(ω) and P(ω).Each data sample needs to
provide the value of ℓ(ωi) and the values of all variables needed to evaluate
the factors that do not cancel in the ratio (6).


In contrast,
the replaying approach (Section 4.1) demands the knowledge of all factors of
P(ω) connecting the point of intervention to the point of measurement ℓ(ω).On
the other hand, it does not require the knowledge of factors appearing only in


sampling relies on the assumption that all the factors appearing in the
denomina-tor of the reweighting ratio (6) are nonzero whenever the factors
appearing in the numerator are nonzero.Since these factors represents
conditional probabilities resulting from the effect of an independent noise
variable in the structural equation model, this assumption means that the data
must be
collected with an experiment involving active randomization.We must therefore
design cost-effective randomized experiments that yield enough information to
estimate many interesting counterfactual expectations with sufficient
accuracy.This problem cannot be solved without an-swering the confidence
interval question: given data collected with a certain level of randomization,
with which accuracy can we estimate a given counterfactual expectation?



Confidence Intervals


At first
sight, we can invoke the law of large numbers and write


Y = Zω ℓ ( ω
) w ( ω ) P ( ω ) ≈ n n∑i=1 ℓ ( ω i ) w i .


For sufficiently
large n, the central limit theorem provides confidence intervals whose width
grows with the standard deviation of the product ℓ(ω)w(ω).


when P(ω) is small, the reweighting ratio w(ω) takes large values with low
probability.This heavy tailed distribution has annoying consequences because
the variance of the integrand could be very high or infinite.When the variance
is infinite, the central limit theorem does not hold.When the variance is
merely very large, the central limit convergence might occur too slowly to
justify such confidence intervals.Importance sampling works best when the
actual distribution and the counterfactual distribution overlap.


When the
counterfactual distribution has significant mass in domains where the actual
distribu-tion is small, the few samples available in these domains receive very
high weights.Their noisy contribution dominates the reweighted estimate (7).We
can obtain better confidence intervals by eliminating these few samples drawn
in poorly explored domains.The resulting bias can be bounded using prior
knowledge, for instance with an assumption about the range of values taken by


∀ω ℓ(ω) ∈ [0,M].


Let us choose
the maximum weight value R deemed acceptable for the weights.We have ob-tained
very consistent results in practice with R equal to the fifth largest
reweighting ratio observed on the empirical data.We can then rely on clipped
weights to eliminate the contribution of the poorly explored domains,


w (ω) = w(ω)
if P(ω) < RP(ω) 0 otherwise.

如果P(ω) < RP(ω) 0,则w (ω) = w(ω),否则。

The condition
P(ω) < RP(ω) ensures that the ratio has a nonzero denominator P(ω) and is
smaller than R. Let ΩR be the set of all values of ω associated with acceptable

条件P(ω) < RP(ω)确保比率具有非零分母P(ω)并且小于R。假设ωR是与可接受比率相关的所有ω值的集合:

ΩR = {ω :
P(ω) < RP(ω)} .

ωR =
{ω:P(ω)< RP(ω)}。

We can
decompose Y in two terms:


Y = Zω ∈ Ω R
ℓ ( ω ) P ( ω) + Zω ∈ Ω \ Ω R ℓ ( ω ) P ( ω) = Y + ( Y − Y ) .

7.This is in
fact a slight abuse because the theory calls for choosing R before seeing the


The first
term of this decomposition is the clipped expectation Y . Estimating the
clipped expec-tation Y is much easier than estimating Y from (7) because the
clipped weights w(ω) are bounded by R.


Y = Z ω∈ΩR
ℓ(ω)P (ω) = Z ω ℓ(ω)w (ω) P(ω) ≈ Yb= n n ∑ i=1 ℓ(ωi)w (ωi).(10)

y =
zω∈rωℓ(ω)p(ω)= zωℓ(ω)w(ω)p(ω)≈Yb = n n∑I = 1 ℓ(ωi)w(ωI)。(10)

The second
term of Equation (9) can be bounded by leveraging assumption (8).The resulting
bound can then be conveniently estimated using only the clipped weights.


Y −Y = Z
ω∈Ω\ΩR ℓ(ω)P (ω) ∈ h M P (Ω\ΩR) i = h M (−W ) i with

W = P (ΩR) =
Z ω∈ΩR P (ω) = Z ω w (ω)P(ω) ≈ Wb= n n ∑ i=1 w (ωi).



Since the
clipped weights are bounded, the estimation errors associated with (10) and
(11) are well characterized using either the central limit theorem or using
empirical Bernstein bounds (see appendix B for details).Therefore we can derive
an outer confidence interval of the form


P n Yb−εR ≤ Y
≤ Yb+εR o ≥ −δ

and an inner
confidence interval of the form


P n Y ≤ Y ≤ Y

  • M(−Wb+ξR) o ≥ −δ.


The names
inner and outer are in fact related to our preferred way to visualize these
intervals (e.g., Figure 13).Since the bounds on Y −Y can be written as


Y ≤ Y ≤ Y + M
(1−W ),

we can derive
our final confidence interval,


P n Yb−εR ≤ Y
≤ Yb+ M(−Wb+ξR) +εR o ≥ −δ.



conclusion, replacing the unbiased importance sampling estimator (7) by the
clipped impor-tance sampling estimator (10) with a suitable choice of R leads
to improved confidence intervals.Furthermore, since the derivation of these
confidence intervals does not rely on the assumption that P(ω) is nonzero
everywhere, the clipped importance sampling estimator remains valid when the
dis-tribution P(ω) has a limited support.This relaxes the main restriction
associated with importance sampling.


Interpreting the Confidence Intervals


estimation of the counterfactual expectation Y can be inaccurate because the
sample size is in-sufficient or because the sampling distribution P(ω) does not
sufficiently explore the counterfactual conditions of interest.


construction, the clipped expectation Y ignores the domains poorly explored by
the sam-pling distribution P(ω).The difference Y −Y then reflects the
inaccuracy resulting from a lack of exploration.Therefore, assuming that the
bound R has been chosen competently, the relative sizes of the outer and inner
confidence intervals provide precious cues to determine whether we can continue
collecting data using the same experimental setup or should adjust the data
collection experiment in order to obtain a better coverage.


• The inner
confidence interval (13) witnesses the uncertainty associated with the domain
GR insufficiently explored by the actual distribution.A large inner confidence
interval suggests that the most practical way to improve the estimate is to
adjust the data collection experiment in order to obtain a better coverage of
the counterfactual conditions of interest.


• The outer
confidence interval (12) represents the uncertainty that results from the
limited sample size.A large outer confidence interval indicates that the sample
is too small.To improve the result, we simply need to continue collecting data
using the same experimental setup.


Experimenting with Mainline Reserves


We return to
the ad placement problem to illustrate the reweighting approach and the
interpretation of the confidence intervals.Manipulating the reserves Rp(x)
associated with the mainline positions (Figure 1) controls which ads are
prominently displayed in the mainline or displaced into the sidebar.


We seek in
this section to answer counterfactual questions of the form:


would the ad placement system have performed if we had scaled the mainline
reserves by a constant factor ρ, without incurring user or advertiser


was introduced using a modified version of the ad placement engine.Before
determining the ad layout (see Section 2.1), a random number ε is drawn
according to the standard


distribution N (0,1), and all the mainline reserves are multiplied by m = ρe −σ

正态分布N (0,1),所有主线储量乘以m =ρeσ/+σε。这样的

follow a log-normal distributionwhose mean is ρ and whose width is controlled
by σ.This effectively provides a parametrization of the conditional score distribution
P(q|x,a) (see Figure 5.)


The Bing
search platform offers many ways to select traffic for controlled experiments
(Sec-tion 2.2).In order to match our isolation assumption, individual page
views were randomly as-signed to traffic buckets without regard to the user
identity.The main treatment bucket was pro-cessed with mainline reserves
randomized by a multiplier drawn as explained above with ρ=1 and σ = 0.3.With
these parameters, the mean multiplier is exactly 1, and 95% of the multipliers
are in range [0.52,1.74].Samples describing 22 million search result pages were
collected during five consecutive weeks.

必应搜索平台提供了许多选择流量进行受控实验的方法(第2.2节)。为了与我们的隔离假设相匹配,单个页面视图被随机签名到流量桶,而不考虑用户身份。主处理桶用主线储量进行处理,主线储量由乘数随机化,乘数如上所述绘制,ρ=1,σ = 0.3。有了这些参数,平均乘数正好是1,95%的乘数都在[0.52,1.74]的范围内。连续五周收集了描述2200万个搜索结果页面的样本。

We then use
this data to estimate what would have been measured if the mainline reserve
mul-tipliers had been drawn according to a distribution determined by
parameters ρ and σ .This is achieved by reweighting each sample ωi with


w i =
P ( q i |x i,a i ) P ( q
i |x i,a i)



p ( m i ;ρ ,
σ ) p ( m i ;ρ , σ )

where mi is
the multiplier drawn for this sample during the data collection experiment, and
p(t ;ρ,σ) is the density of the log-normal multiplier distribution.


Figure 13
reports results obtained by varying ρ while keeping σ =σ.This amounts to
esti-mating what would have been measured if all mainline reserves had been
multiplied by ρ while keeping the same randomization.The curves bound 95%
confidence intervals on the variations of the average number of mainline ads
displayed per page, the average number of ad clicks per page,

图13报告了在保持σ =σ的情况下,通过改变ρ获得的结果。这相当于在保持相同随机化的情况下,对所有主线储量乘以ρ后的测量值进行了估计。这些曲线在每页显示的主线广告的平均数量、每页广告点击的平均数量,

precisely, lnN ( ,σ ) with = σ /2+logρ.

8.更准确地说,lnN(,σ)具有= σ /2+logρ。

Figure 13:
Estimated variations of three performance metrics in response to mainline


curves delimit 95% confidence intervals for the metrics we would have observed
if we had increased the mainline reserves by the percentage shown on the
horizontal axis.The filled areas represent the inner confidence intervals.The
hollow squares represent the metrics measured on the experimental data.The
hollow circles represent metrics measured on a second experimental bucket with
mainline reserves re-duced by 18%.The filled circles represent the metrics
effectively measured on a control bucket running without randomization.






and the
average revenue per page, as functions of ρ .The inner confidence intervals,
represented by the filled areas, grow sharply when ρ leaves the range explored
during the data collection ex-periment.The average revenue per page has more
variance because a few very competitive queries command high prices.


In order to
validate the accuracy of these counterfactual estimates, a second traffic
bucket of equal size was configured with mainline reserves reduced by about
18%.The hollow circles in Figure 13 represent the metrics effectively measured
on this bucket during the same time period.The effective measurements and the
counterfactual estimates match with high accuracy.


Finally, in
order to measure the cost of the randomization, we also ran the unmodified ad
place-ment system on a control bucket.The brown filled circles in Figure 13
represent the metrics effec-tively measured on the control bucket during the
same time period.The randomization caused a small but statistically significant
increase of the number of mainline ads per page.The click yield and average
revenue differences are not significant.


experiment shows that we can obtain accurate counterfactual estimates with
affordable randomization strategies.However, this nice conclusion does not
capture the true practical value of the counterfactual estimation approach.


4.7 More on
Mainline Reserves


The main
benefit of the counterfactual estimation approach is the ability to use the
same data to answer a broad range of counterfactual questions.Here are a few
examples of counterfactual ques-tions that can be answered using data collected
using the simple mainline reserve randomization scheme described in the
previous section:


• Different
variances - Instead of estimating what would have been measured if we had
in-creased the mainline reserves without changing the randomization variance,
that is, letting σ = σ, we can use the same data to estimate what would have
been measured if we had also changed σ.This provides the means to determine
which level of randomization we can afford in future experiments.

不同的方差——如果我们在不改变随机化方差的情况下增加主线储量,也就是说,让σ = σ,我们可以使用相同的数据来估计如果我们也改变σ会测量到什么,而不是估计会测量到什么。这提供了确定我们在未来实验中能够承受的随机化水平的方法。

• Pointwise
estimates - We often want to estimate what would have been measured if we had
set the mainline reserves to a specific value without randomization.Although
computing estimates for small values of σ often works well enough, very small
values lead to large confidence intervals.


Let Yν(ρ)
represent the expectation we would have observed if the multipliers m had mean
ρ and variance ν.We have then Yν(ρ) = Em[E[y|m]] = Em[Y0(m)].Assuming that the
pointwise value Y0 is smooth enough for a second order development,

让Yν(ρ)代表如果乘数m有均值ρ和方差ν,我们会观察到的期望。然后我们有Yν(ρ) = Em[E[y|m]] = Em[Y0(m)]。假设逐点值Y0对于二阶展开足够平滑,

Yν ( ρ ) ≈ E

Yν ( ρ ) ≈ E

  • Y0 ( ρ) + (
    m − ρ ) Y ′0 ( ρ) + ( m − ρ ) Y ′′ 0 ( ρ ) / 2


= Y0 ( ρ) + ν
Y ′′ 0 ( ρ ) / 2 .


Although the
reweighting method cannot estimate the point-wise value Y0(ρ) directly, we can
use the reweighting method to estimate both Yν(ρ) and Y2ν(ρ) with acceptable
confidence intervals and write Y0(ρ) ≈ 2Yν(ρ)−Y2ν(ρ) (Goodwin, 2011).


Query-dependent reserves - Compare for instance the queries "car
insurance" and "com-mon cause principle" in a web search
engine.Since the advertising potential of a search






considerably with the query, it makes sense to investigate various ways to
define query-dependent reserves (Charles and Chickering, 2012).


The data
collected using the simple mainline reserve randomization can also be used to es-timate
what would have been measured if we had increased all the mainline reserves by
a query-dependent multiplier ρ (x).This is simply achieved by reweighting each
sample ωi with

使用简单的主线储量随机化收集的数据也可以用来估计如果我们将所有主线储量增加一个依赖于查询的乘数ρ (x)会测量到什么。这可以通过用ωi重新加权每个样本来实现

w i =

w i =

P ( q i |x i
, a i ) P ( q i |x i , a i )

P ( q i |x i,a i ) P ( q
i |x i,a i)



p ( m i ;ρ (
x i ) , σ )

p(m I;ρ ( x i),σ)

p ( m i ;, σ
) .

p(m I;, σ ) .

broader ranges of counterfactual questions can be answered when data is
collected using randomization schemes that explore more dimensions.For
instance, in the case of the ad placement problem, we could apply an
independent random multiplier for each score instead of applying a single
random multiplier to the mainline reserves only.However, the more dimensions we
randomize, the more data needs to be collected to effectively explore all these
dimensions.Fortunately, as discussed in section 5, the structure of the causal
graph reveals many ways to leverage a priori information and improve the
confidence intervals.


4.8 Related


sampling is widely used to deal with covariate shifts (Shimodaira,
2000;Sugiyama et al., 2007).Since manipulating the causal graph changes the
data distribution, such an intervention can be viewed as a covariate shift
amenable to importance sampling.Importance sampling techniques have also been
proposed without causal interpretation for many of the problems that we view as
causal inference problems.In particular, the work presented in this section is
closely related to the Monte-Carlo approach of reinforcement learning (Sutton
and Barto, 1998, Chapter 5) and to the offline evaluation of contextual bandit
policies (Li et al., 2010, 2011).


learning research traditionally focuses on control problems with relatively
small discrete state spaces and long sequences of observations.This focus
reduces the need for char-acterizing exploration with tight confidence
intervals.For instance, Sutton and Barto suggest to normalize the importance
sampling estimator by 1/∑i w(ωi) instead of 1/n. This would give erro-neous
results when the data collection distribution leaves parts of the state space
poorly explored.Contextual bandits are traditionally formulated with a finite
set of discrete actions.For instance, Li's (2011) unbiased policy evaluation
assumes that the data collection policy always selects an arbitrary policy with
probability greater than some small constant.This is not possible when the
action space is infinite.

强化学习研究传统上集中于具有相对小的离散状态空间和长的观测序列的控制问题。这种关注减少了对具有紧密置信区间的特征勘探的需要。例如,萨顿和巴尔托建议用1/∑i w(ωi)而不是1/n来归一化重要性抽样估计量。当数据收集分布使得部分状态空间未被充分探索时,这将给出错误的结果。背景盗匪传统上是由一组有限的离散动作组成的。例如,李(2011)的无偏策略评估假设数据收集策略总是选择概率大于某个小常数的任意策略。当动作空间无限时,这是不可能的。

assumptions on the data collection distribution are often impractical.For
instance, certain ad placement policies are not worth exploring because they
cannot be implemented efficiently or are known to elicit fraudulent
behaviors.There are many practical situations in which one is only interested
in limited aspects of the ad placement policy involving continuous parameters
such as click prices or reserves.Discretizing such parameters eliminates useful
a priori knowledge: for instance, if we slightly increase a reserve, we can
reasonable believe that we are going to show slightly less ads.


Instead of
making assumptions on the data collection distribution, we construct a biased
estima-tor (10) and bound its bias.We then interpret the inner and outer
confidence intervals as resulting from a lack of exploration or an insufficient
sample size.






Finally, the
causal framework allows us to easily formulate counterfactual questions that
pertain to the practical ad placement problem and yet differ considerably in
complexity and exploration requirements.We can address specific problems
identified by the engineers without incurring the risks associated with a
complete redesign of the system.Each of these incremental steps helps
demonstrating the soundness of the approach.




This section
shows how the structure of the causal graph reveals many ways to leverage a
priori knowledge and improve the accuracy of our counterfactual
estimates.Displacing the reweighting point (Section 5.1) improves the inner
confidence interval and therefore reduce the need for explo-ration.Using a
prediction function (Section 5.2) essentially improve the outer confidence
interval and therefore reduce the sample size requirements.


5.1 Better
Reweighting Variables


Many search
result pages come without eligible ads.We then know with certainty that such
pages will have zero mainline ads, receive zero clicks, and generate zero
revenue.This is true for the randomly selected value of the reserve, and this
would have been true for any other value of the reserve.We can exploit this
knowledge by pretending that the reserve was drawn from the coun-terfactual
distribution P(q|xi ,ai) instead of the actual distribution P(q|xi ,ai).The
ratio w(ωi) is therefore forced to the unity.This does not change the estimate
but reduces the size of the inner confidence interval.The results of Figure 13
were in fact helped by this little optimization.


There are in
fact many circumstances in which the observed outcome would have been the same
for other values of the randomized variables.This prior knowledge is in fact
encoded in the structure of the causal graph and can be exploited in a more
systematic manner.For instance, we know that users make click decisions without
knowing which scores were computed by the ad placement engine, and without
knowing the prices charged to advertisers.The ad placement causal graph encodes
this knowledge by showing the clicks y as direct effects of the user intent u
and the ad slate s. This implies that the exact value of the scores q does not
matter to the clicks y as long as the ad slate s remains the same.


Because the
causal graph has this special structure, we can simplify both the actual and
counter-factual Markov factorizations (2) (3) without eliminating the variable
y whose expectation is sought.Successively eliminating variables z, c, and q

因为因果图具有这种特殊的结构,我们可以简化实际的和反事实的马尔可夫因子分解(2) (3),而不消除期望被寻求的变量y。连续消除变量z、c和q可得出:

P ( u , v , x
, a , b , s , y) = P ( u , v ) P ( x |u ) P ( a|x , v ) P ( b|x , v ) P ( s|x ,
a , b ) P ( y |s , u ) , P ( u , v , x , a , b , s , y) = P ( u ,v ) P ( x |u )
P ( a|x , v ) P ( b|x , v ) P ( s|x , a , b ) P ( y |s , u ) .

P ( u,v,x,a,b,s,y) = P ( u,v ) P ( x |u
) P ( a|x,v ) P ( b|x,v ) P ( s|x,a,b ) P ( y |s,u),P ( u,v,x,a,b,s,y) = P ( u,P ( x |u ) P ( a|x,v ) P ( b|x,v ) P ( s|x,a,b ) P ( y |s,u)。

conditional distributions P(s|x,a,b) and P(s|x,a,b) did not originally appear
in the Markov factorization.They are defined by marginalization as a
consequence of the elimination of the vari-able q representing the scores.


P ( s|x a b)
= Zq P ( s|a q b ) P ( q|x a )

P ( s|x a b)
= Zq P ( s|a q b ) P ( q|x a)

P ( s|x a b)
= Zq P ( s|a q b ) P ( q|x a ) .

P ( s|x a b)
= Zq P ( s|a q b ) P ( q|x a)。





Figure 14:
Estimated variations of two performance metrics in response to mainline reserve


estimates were obtained using the ad slates s as reweighting variable.Compare
the inner confidence intervals with those shown in Figure 13.


We can
estimate the counterfactual click yield Y using these simplified


= Z y P ( u v
x a b s y) = Z y P ( s|x a b ) P ( s|x , a , b ) P ( u v x a b s y )

= Z y P ( u v
x a b s y) = Z y P ( s|x a b ) P ( s|x,a,b ) P ( u v x a b s y)



n n∑i=1 y i P
( s i |x i a i b i ) P ( s i | x i , a i , b i ) .

n n∑i=1 y i P
( s i |x i a i b i ) P ( s i | x i,a i,b i)。



We have
reproduced the experiments described in Section 4.6 with the counterfactual
esti-mate (16) instead of (4).For each example ωi , we determine which range [m
max i ,m min i ] of mainline reserve multipliers could have produced the observed
ad slate si , and then compute the reweighting ratio using the formula:

我们用反事实估计(16)代替(4)再现了第4.6节中描述的实验。对于每个示例ωi,我们确定主线储备乘数的哪个范围[m最大值I,m最小值i ]可以产生观察到的ad slate si,然后使用以下公式计算再加权比率:

w i =

w i =

P ( s i | x i
, a i , b i ) P ( s i | x i , a i , b i )

P ( s i | x i,a i,b i ) P ( s
i | x i,a i,b i)



Ψ ( m max i
;ρ , σ ) − Ψ ( m min i ;ρ , σ ) Ψ ( mmax i ;ρ , σ ) − Ψ ( mmin i ;ρ , σ )

ψ(m最大值I;ρ,σ)ψ(m min I;ρ,σ)ψ(mmax I;ρ,σ)ψ(mmin I;ρ , σ )

Ψ(m;ρ,σ) is the cumulative of the log-normal multiplier distribution.Figure 14
shows coun-terfactual estimates obtained using the same data as Figure 13.The
obvious improvement of the


Figure 16: A
distribution on the scores q induce a distribution on the possible ad slates s.
If the observed slate is slate2, the reweighting ratio is 34/22.


confidence intervals significantly extends the range of mainline reserve
multipliers for which we can compute accurate counterfactual expectations using
this same data.


Comparing (4)
and (16) makes the difference very clear: instead of computing the ratio of the
probabilities of the observed scores under the counterfactual and actual
distributions, we compute the ratio of the probabilities of the observed ad
slates under the counterfactual and actual distribu-tions.As illustrated by
Figure 15, we now distinguish the reweighting variable (or variables) from the
intervention.In general, the corresponding manipulation of the Markov
factorization consists of marginalizing out all the variables that appear on
the causal paths connecting the point of interven-tion to the reweighting
variables and factoring all the independent terms out of the integral.This
simplification works whenever the reweighting variables intercept all the
causal paths connecting the point of intervention to the measurement variable.In
order to compute the new reweighting ratios, all the factors remaining inside
the integral, that is, all the factors appearing on the causal paths connecting
the point of intervention to the reweighting variables, have to be known.


Figure 14
does not report the average revenue per page because the revenue z also depends
on the scores q through the click prices c. This causal path is not intercepted
by the ad slate variable s alone.However, we can introduce a new variable ˜c =
f(c, y)that filters out the click prices computed

图14没有报告每页的平均收入,因为收入z也依赖于点击率c的得分q。这个因果路径不是由单独的ad slate变量s截取的。但是,我们可以引入一个新的变量\\\\\\\\\\\\\\\\\





for ads that
did not receive a click.Markedly improved revenue estimates are then obtained
by reweighting according to the joint variable (s, c˜).

对于没有点击的广告。然后,通过根据联合变量(s,c ;)重新加权,可以获得显著改善的收入估计。

Figure 16
illustrates the same approach applied to the simultaneous randomization of all
the scores q using independent log-normal multipliers.The weight w(ωi) is the
ratio of the probabilities of the observed ad slate si under the counterfactual
and actual multiplier distributions.Computing these probabilities amounts to
integrating a multivariate Gaussian distribution (Genz, 1992).Details will be
provided in a forthcoming publication.

图16显示了使用独立对数正态乘数同时随机化所有分数q的相同方法。权重w(ωi)是在反事实和实际乘数分布下观察到的ad slate si的概率比。计算这些概率相当于对多变量高斯分布进行积分(Genz,1992)。细节将在即将出版的出版物中提供。

5.2 Variance
Reduction with Predictors


Although we
do not know exactly how the variable of interest ℓ(ω) depends on the measurable
vari-ables and are affected by interventions on the causal graph, we may have
strong a priori knowledge about this dependency.For instance, if we augment the
slate s with an ad that usually receives a lot of clicks, we can expect an
increase of the number of clicks.


Let the
invariant variables υ be all observed variables that are not direct or indirect
effects of variables affected by the intervention under consideration.This
definition implies that the distri-bution of the invariant variables is not
affected by the intervention.Therefore the values υi of the invariant variables
sampled during the actual experiment are also representative of the
distribution of the invariant variables under the counterfactual conditions.


We can
leverage a priori knowledge to construct a predictor ζ(ω) of the quantity ℓ(ω)
whose counterfactual expectation Y is sought.We assume that the predictor ζ(ω)
depends only on the invariant variables or on variables that depend on the
invariant variables through known functional dependencies.Given sampled values
υi of the invariant variables, we can replay both the original and manipulated
structural equation model as explained in Section 4.1 and obtain samples ζi and
ζ i that respectively follow the actual and counterfactual distributions

我们可以利用先验知识来构造量ℓ(ω的预测器ζ(ω),其反事实期望y被寻求。我们假设预测因子ζ(ω)仅依赖于不变变量或者依赖于通过已知函数依赖关系的不变变量的变量。给定不变变量的采样值υi,我们可以如第4.1节所述重放原始和操纵的结构方程模型,并获得分别遵循实际和反事实分布的样本ζi和ζ i

regardless of the quality of the predictor,


Y = Zω ℓ ( ω
) P ( ω) =

Y = Zω ℓ ( ω
) P ( ω) =

Z ω ζ(ω)P (ω)

  • Z ω (ℓ(ω)−ζ(ω))P (ω) ≈

Z ω ζ(ω)P (ω)

  • Z ω (ℓ(ω)−ζ(ω))P (ω) ≈








ζ i + n n ∑
i=1 (ℓ(ωi)−ζi)w(ωi).

ζ i + n n ∑
i=1 (ℓ(ωi)−ζi)w(ωi).



The first
term in this sum represents the counterfactual expectation of the predictor and
can be accurately estimated by averaging the simulated counterfactual samples ζ
i without resorting to po-tentially large importance weights.The second term in
this sum represents the counterfactual ex-pectation of the residuals ℓ(ω)−ζ(ω)
and must be estimated using importance sampling.Since the magnitude of the
residuals is hopefully smaller than that of ℓ(ω), the variance of
(ℓ(ω)−ζ(ω))w(ω) is reduced and the importance sampling estimator of the second
term has improved confidence in-tervals.The more accurate the predictor ζ(ω),
the more effective this variance reduction strategy.

这个和中的第一项表示预测器的反事实期望,并且可以通过对模拟的反事实样本ζ i求平均而不求助于潜在的大重要性权重来精确地估计。这个和中的第二项代表残差ℓ(ω)−ζ(ω的反事实预测,必须使用重要性抽样来估计。由于残差的幅度希望小于ℓ(ω的幅度,所以(ℓ(ω)−ζ(ω))w(ω)的方差减小了,并且第二项的重要性抽样估计量提高了置信区间。预测因子ζ(ω)越精确,这种方差减小策略就越有效。

This variance
reduction technique is in fact identical to the doubly robust contextual bandit
eval-uation technique of Dudík et al. (2012).Doubly robust variance reduction
has also been extensively used for causal inference applied to biostatistics
(see Robins et al., 2000;Bang and Robins, 2005).We subjectively find that
viewing the predictor as a component of the causal graph (Figure 17) clar-ifies
how a well designed predictor can leverage prior knowledge.For instance, in
order to estimate the counterfactual performance of the ad placement system, we
can easily use a predictor that runs the ad auction and simulate the user
clicks using a click probability model trained offline.

这种方差减少技术实际上与Dudík等人(2012)的双重鲁棒上下文土匪评估技术相同。双重稳健方差缩减也被广泛用于应用于生物统计学的因果推断(参见Robins等人,2000;Bang and Robins,2005)。我们主观地发现,将预测器视为因果图的一个组成部分(图17)说明了一个设计良好的预测器如何利用先验知识。例如,为了估计广告投放系统的反事实性能,我们可以很容易地使用运行广告拍卖的预测器,并使用离线训练的点击概率模型来模拟用户点击。





Figure 17:
Leveraging a predictor.Yellow nodes represent known functional relations in the


equation model.We can estimate the counterfactual expectation Y of the number
of clicks per page as the sum of the counterfactual expectations of a predictor
ζ, which is easy to estimate by replaying empirical data, and y−ζ, which has to
be estimated by importance sampling but has reduced variance.


Figure 18:
The two plots show the hourly click yield for two variants of the ad placement
engine.The daily variations dwarf the differences between the two treatments.






5.3 Invariant


In order to
evaluate which of two interventions is most likely to improve the system, the
designer of a learning system often seeks to estimate a counterfactual
difference, that is, the difference Y + −Y of the expectations of a same
quantity ℓ(ω) under two different counterfactual distributions P+(ω) and
P(ω).These expectations are often affected by variables whose value is left
unchanged by the interventions under consideration.For instance, seasonal
effects can have very large effects on the number of ad clicks (Figure 18) but
affect Y + and Y in similar ways.

为了评估两种干预中的哪一种最有可能改善系统,学习系统的设计者经常寻求估计一个反事实差异,即在两个不同的反事实分布P+(ω)和P(ω)下,相同数量ℓ(ω的期望的差异y+y。这些期望经常受到变量的影响,这些变量的值在所考虑的干预中保持不变。例如,季节效应会对广告点击量产生非常大的影响(图18),但会以类似的方式影响Y +和Y。

better confidence intervals on the difference Y + −Y can be obtained using an
invariant predictor, that is, a predictor function that depends only on
invariant variables υ such as the time of the day.Since the invariant predictor
ζ(υ) is not affected by the interventions under consideration,


Z ω ζ(υ)P (ω)
= Z ω ζ(υ)P +(ω).

Z ω ζ(υ)P (ω)
= Z ω ζ(υ)P +(ω)。





Y + −Y


= Zω ζ ( υ )
P + ( ω) + Zω ( ℓ ( ω ) − ζ ( υ)) P + ( ω )


− Z ω ζ(υ)P
(ω)− Z ω (ℓ(ω)−ζ(υ))P (ω) ≈









ℓ(ωi)−ζ(υi) P
+(ωi)−P (ωi) P(ωi) .


This direct
estimate of the counterfactual difference Y + −Y benefits from the same
variance reduc-tion effect as (17) without need to estimate the expectations
(18).Appendix C provide details on the computation of confidence intervals for
estimators of the counterfactual differences.Appendix D shows how the same
approach can be used to compute counterfactual derivatives that describe the
response of the system to very small interventions.




The previous
sections deal with the identification and the measurement of interpretable
signals that can justify the actions of human decision makers.These same signals
can also justify the actions of machine learning algorithms.This section
explains why optimizing a counterfactual estimate is a sound learning


6.1 A
Learning Principle


We consider
in this section interventions that depend on a parameter θ.For instance, we
might want to know what the performance of the ad placement engine would have
been if we had used different values for the parameter θ of the click scoring
model.Let Pθ (ω) denote the counterfactual Markov factorization associated with
this intervention.Let Y θ be the counterfactual expectation of ℓ(ω) under
distribution Pθ .Figure 19 illustrates our simple learning setup.Training data
is collected from a single experiment associated with an initial parameter
value θ chosen using prior knowledge acquired in an unspecified manner.A
preferred parameter value θ is then determined using the training data and
loaded into the system.The goal is of course to observe a good performance on
data collected during a test period that takes place after the switching point.

在本节中,我们考虑依赖于参数θ的干预。例如,我们可能想知道如果我们对点击评分模型的参数θ使用不同的值,广告投放引擎的性能会如何。让Pθ (ω)表示与该干预相关的反事实马尔可夫因子分解。设Y θ是分布Pθ下ℓ(ω的反事实期望。图19展示了我们简单的学习设置。训练数据是从与初始参数值θ相关联的单个实验中收集的,该初始参数值θ是使用以未指定方式获得的先验知识选择的。然后使用训练数据确定优选的参数值θ,并将其加载到系统中。目标当然是观察在切换点之后的测试期间收集的数据的良好性能。



Figure 19:
Single design - A preferred parameter value θ is determined using randomized
data collected in the past.Test data is collected after loading θ into the


The isolation
assumption introduced in Section 3.2 states that the exogenous variables are
drawn from an unknown but fixed joint probability distribution.This
distribution induces a joint distribu-tion P(ω) on all the variables ω
appearing in the structural equation model associated with the parameter
θ.Therefore, if the isolation assumption remains valid during the test period,
the test data follows the same distribution Pθ (ω) that would have been
observed during the training data collection period if the system had been
using parameter θ all along.

第3.2节中引入的隔离假设指出,外生变量来自未知但固定的联合概率分布。这种分布在与参数θ相关的结构方程模型中出现的所有变量ω上引起联合分布P(ω)。因此,如果隔离假设在测试期间保持有效,测试数据将遵循相同的分布Pθ (ω),如果系统一直使用参数θ,则在训练数据收集期间将会观察到该分布。

We can
therefore formulate this problem as the optimization of the expectation Y θ of
the reward ℓ(ω) with respect to the distribution Pθ (ω)

因此,我们可以将这个问题公式化为报酬ℓ(ω的期望Y θ相对于分布Pθ (ω)的最优化

maxθ Y θ = Zω
ℓ ( ω ) P θ ( ω )

maxθ Y θ = Zω
ℓ ( ω ) P θ ( ω)

on the basis
of a finite set of training examples ω1,...,ωn sampled from P(ω).


However, it
would be unwise to maximize the estimates obtained using approximation (5)
be-cause they could reach a maximum for a value of θ that is poorly explored by
the actual distribution.As explained in Section 4.5, the gap between the upper
and lower bound of inequality (14) reveals the uncertainty associated with
insufficient exploration.Maximizing an empirical estimate Ybθ of the lower
bound Y θ ensures that the optimization algorithm finds a trustworthy answer

然而,最大化使用近似值(5)获得的估计值是不明智的,因为它们可能达到θ值的最大值,而实际分布对此了解甚少。如第4.5节所述,不等式(14)的上限和下限之间的差距揭示了与探索不足相关的不确定性。最大化下界Y θ的经验估计Ybθ确保了优化算法找到可信的答案

θ = argmax θ
Yb θ .

θ = argmax θ
Yb θ。

We shall now
discuss the statistical basis of this learning principle.


6.2 Uniform
Confidence Intervals


As discussed
in Section 4.4, inequality (14),


Y θ ≤ Y θ ≤ Y
θ + M ( 1 − W θ ) ,




Y θ = Zω ℓ (
ω ) w ( ω ) P ( ω ) ≈ Yb θ = n n∑i=1 ℓ ( ω i ) w ( ω i ) W θ = Zω w ( ω ) P ( ω
) ≈ Wbθ = n n∑i=1 w ( ω i )

zωℓ(ω)w(ω)p(ω)≈Ybθ= n n∑I = 1ℓ(ωI)w(ωI)wθ= zωw(ω)p(ω)≈WBθ= n∑I = 1w(ωI)



9.The idea of
maximizing the lower bound may surprise readers familiar with the UCB algorithm
for multi-armed bandits (Auer et al., 2002).UCB performs exploration by
maximizing the upper confidence interval bound and updating the confidence
intervals online.Exploration in our setup results from the active system
randomization during the offline data collection.See also Section 6.4.






leads to
confidence intervals (15) of the form


∀ δ > , ∀
θ P n Yb θ − ε R ≤ Y θ ≤ Yb θ + M ( − Wbθ + ξ R) + ε R o ≥ − δ .

∀δ>∀θp n
Ybθεr≤yθ≤Ybθ+m(WBθ+ξr)+εr o≥δ。



Both εR and
ξR converge to zero in inverse proportion to the square root of the sample size
n. They also increase at most linearly in logδ and depend on both the capping
bound R and the parameter θ through the empirical variances (see appendix B.)


confidence intervals are insufficient to provide guarantees for a parameter
value θ that depends on the sample.In fact, the optimization (19) procedure is
likely to select values of θ for which the inequality is violated.We therefore
seek uniform confidence intervals (Vapnik and Chervonenkis, 1968), simultaneously
valid for all values of θ.


• When the
parameter θ is chosen from a finite set F , applying the union bound to the


(20) immediately gives the uniform confidence interval :


P n ∀ θ ∈ F ,
Yb θ − ε R ≤ Y θ ≤ Yb θ + M ( −Wbθ + ξ R)+ ε R o ≥ −| F |δ .

P n ∀ θ ∈ F,Ybθεr≤yθ≤Ybθ+m(WBθ+ξr)+εr
o≥| f |δ。

• Following
the pioneering work of Vapnik and Chervonenkis, a broad choice of mathematical


tools have
been developed to construct uniform confidence intervals when the set F is


For instance,
appendix E leverages uniform empirical Bernstein bounds (Maurer and Pontil,
2009) and obtains the uniform confidence interval


P n ∀ θ ∈ F ,
Yb θ − ε R ≤ Y θ ≤ Yb θ + M ( −Wbθ + ξ R)+ ε R o ≥ − M ( n ) δ ,

P n ∀ θ ∈ F,Ybθεr≤yθ≤Ybθ+m(WBθ+ξr)+εr

where the
growth function M (n) measures the capacity of the family of functions

其中增长函数M (n)测量函数族的容量

{ fθ : ω 7→
ℓ(ω)w (ω) , gθ : ω 7→ w (ω) , ∀θ ∈ F } .

{ fθ : ω 7→
ℓ(ω)w (ω),gθ : ω 7→ w (ω),∀θ ∈ F }。



practical choices of P(ω) lead to functions M (n) that grow polynomially with
the sample size.Because both εR and ξR are O(n −/logδ), they converge to zero
with the sample size when one maintains the confidence level 1−M (n)δ equal to
a predefined constant.


interpretation of the inner and outer confidence intervals (Section 4.5) also
applies to the uniform confidence interval (21).When the sample size is
sufficiently large and the capping bound R chosen appropriately, the inner
confidence interval reflects the upper and lower bound of inequal-ity (14).


The uniform
confidence interval therefore ensures that Y θ is close to the maximum of the
lower bound of inequality (14) which essentially represents the best
performance that can be guaranteed using training data sampled from
P(ω).Meanwhile, the upper bound of this same inequality reveals which values of
θ could potentially offer better performance but have been insufficiently
probed by the sampling distribution (Figure 20.)

因此,均匀置信区间确保了Y θ接近不等式(14)的下限的最大值,这基本上代表了使用从P(ω)采样的训练数据可以保证的最佳性能。与此同时,这个不等式的上限揭示了哪些θ值可能提供更好的性能,但采样分布没有充分探究(图20。)

6.3 Tuning Ad
Placement Auctions


We now
present an application of this learning principle to the optimization of
auction tuning pa-rameters in the ad placement engine.Despite increasingly
challenging engineering difficulties,






Figure 20:
The uniform inner confidence interval reveals where the best guaranteed Y θ is
reached and where additional exploration is needed.

图20:统一的内部置信区间揭示了最佳保证Y θ达到的位置和需要额外探索的位置。

Figure 21: Level
curves associated with the average number of mainline ads per page (red curves


labelled from
−6% to +10%) and the average estimated advertisement value generated per page
(black curves, labelled with arbitrary units ranging from 164 to 169) that
would have been observed for a certain query cluster if we had changed the
mainline reserves by the multiplicative factor shown on the horizontal axis,
and if we had applied a squash-ing exponent α shown on the vertical axis to the
estimated click probabilities qi,p(x).

标记为从-6%到+10%)和每页生成的平均估计广告值(黑色曲线,标记为范围从164到169的任意单位),如果我们通过横轴上显示的乘法因子改变主线储备,并且如果我们将纵轴上显示的挤压指数α应用于估计的点击概率qi,p(x ),将会观察到特定查询聚类。

optimization procedures can obviously be applied to larger numbers of tunable param-eters.


Lahaie and
McAfee (2011) propose to account for the uncertainty of the click probability
esti-mation by introducing a squashing exponent α to control the impact of the
estimated probabilities on the rank scores.Using the notations introduced in
Section 2.1, and assuming that the estimated probability of a click on ad i
placed at position p after query x has the form qip(x) = γp βi(x) (see

(2011)提出通过引入挤压指数α来控制估计概率对排名分数的影响,从而解决点击概率估计的不确定性。使用第2.1节中介绍的符号,并假设在查询x之后点击放置在位置p的广告I的估计概率具有qip(x) = γp βi(x)的形式(参见





appendix A),
they redefine the rank-score rip(x) as:


rip ( x) = γp
b i β i ( x ) α .

rip ( x) = γp
b i β i ( x ) α。

Using a
squashing exponent α < 1 reduces the contribution of the estimated
probabilities and in-creases the reliance on the bids bi placed by the

使用挤压指数α < 1降低了估计概率的贡献,并增加了对广告商所提出的出价bi的依赖。

Because the
squashing exponent changes the rank-score scale, it is necessary to simultaneously
adjust the reserves in order to display comparable number of ads.In order to
estimate the counter-factual performance of the system under interventions
affecting both the squashing exponent and the mainline reserves, we have
collected data using a random squashing exponent following a normal
distribution, and a mainline reserve multiplier following a log-normal
distribution as described in Section 4.6.Samples describing 12 million search
result pages were collected during four consecu-tive weeks.


Charles and Chickering (2012), we consider separate squashing coefficients αk
and mainline reserve multipliers ρk per query cluster k ∈ {1..K}, and, in order
to avoid negative user or advertiser reactions, we seek the auction tuning
parameters αk and ρk that maximize an estimate of the advertisement
valuesubject to a global constraint on the average number of ads displayed in
the mainline.Because maximizing the advertisement value instead of the
publisher revenue amounts to maximizing the size of the advertisement pie
instead of the publisher slice of the pie, this criterion is less likely to
simply raise the prices without improving the ads.Meanwhile the constraint
ensures that users are not exposed to excessive numbers of mainline ads.

在Charles和Chickering (2012)之后,我们考虑了每个查询聚类k∑{ 1 }的独立压缩系数αk和主线保留乘数ρk..K},并且,为了避免负面的用户或广告商反应,我们寻求拍卖调整参数αk和ρk,其最大化广告价值的估计,服从主线中显示的广告的平均数量的全局约束。因为最大化广告价值而不是发布者收入等于最大化广告饼的大小而不是饼的发布者份额,所以该标准不太可能简单地提高价格而不改进广告。同时,该约束确保用户不会接触到过多的主线广告。

We then use
the collected data to estimate bounds on the counterfactual expectations of the
advertiser value and the counterfactual expectation of the number of mainline
ads per page.Fig-ure 21 shows the corresponding level curves for a particular
query cluster.We can then run a simple optimization algorithm and determine the
optimal auction tuning parameters for each cluster sub-ject to the global mainline
footprint constraint.Appendix D describes how to estimate off-policy
counterfactual derivatives that greatly help the numerical optimization.


The obvious
alternative (see Charles and Chickering, 2012) consists of replaying the
auctions with different parameters and simulating the user using a click
probability model.However, it may be unwise to rely on a click probability
model to estimate the best value of a squashing coefficient that is expected to
compensate for the uncertainty of the click prediction model itself.The
coun-terfactual approach described here avoids the problem because it does not
rely on a click prediction model to simulate users.Instead it estimates the
counterfactual performance of the system using the actual behavior of the users
collected under moderate randomization.


Sequential Design


intervals computed after a first randomized data collection experiment might
not offer sufficient accuracy to choose a final value of the parameter θ.It is
generally unwise to simply col-lect additional samples using the same
experimental setup because the current data already reveals information (Figure
20) that can be used to design a better data collection experiment.Therefore,
it seems natural to extend the learning principle discussed in Section 6.1 to a
sequence of data col-lection experiments.The parameter θt characterizing the
t-th experiment is then determined using samples collected during the previous
experiments (Figure 22).


10.The value
of an ad click from the point of view of the advertiser.The advertiser payment
then splits the advertisement value between the publisher and the advertiser.




Figure 22:
Sequential design - The parameter θt of each data collection experiment is determined
using data collected during the previous experiments.


Although it
is relatively easy to construct convergent sequential design algorithms,
reaching the optimal learning performance is notoriously difficult (Wald, 1945)
because the selection of param-eter θt involves a trade-off between
exploitation, that is, the maximization of the immediate reward Y θt , and
exploration, that is, the collection of samples potentially leading to better Y
θ in the more distant future.

尽管构造收敛的序列设计算法相对容易,但达到最优学习性能是出了名的困难(Wald,1945),因为参数θt的选择涉及开发(即即时回报Y θt的最大化)和探索(即样本的收集,潜在地在更遥远的未来导致更好的Y θ)之间的权衡。

The optimal
exploration exploitation trade-off for multi-armed bandits is well understood
(Git-tins, 1989;Auer et al., 2002;Audibert et al., 2007) because an essential
property of multi-armed bandits makes the analysis much simpler: the outcome
observed after performing a particular action brings no information about the
value of other actions.Such an assumption is both unrealistic and pessimistic.For
instance, the outcome observed after displaying a certain ad in response to a
certain query brings very useful information about the value of displaying
similar ads on similar queries.


contextual bandit approaches (Slivkins, 2011) account for similarities in the
context and action spaces but do not take advantage of all the additional
opportunities expressed by structural equation models.For instance, in the
contextual bandit formulation of the ad placement problem outlined in Section
3.5, actions are pairs (s, c) describing the ad slate s and the corresponding
click prices c, policies select actions by combining individual ad scores in
very specific ways, and actions determine the rewards through very specific


despite their suboptimal asymptotic properties, heuristic exploration
strategies per-form surprisingly well during the time span in which the problem
can be considered stationary.Even in the simple case of multi-armed bandits,
excellent empirical results have been obtained us-ing Thompson sampling
(Chapelle and Li, 2011) or fixed strategies (Vermorel and Mohri, 2005;Kuleshov
and Precup, 2010).Leveraging the problem structure seems more important in
practice than perfecting an otherwise sound exploration strategy.


Therefore, in
the absence of sufficient theoretical guidance, it is both expedient and
practical to maximizing Ybθ at each round, as described in Section 6.1, subject
to additional ad-hoc constraints ensuring a minimum level of exploration.




All the
methods discussed in this contribution rely on the isolation assumption
presented in Sec-tion 3.2.This assumption lets us interpret the samples as
repeated independent trials that follow the pattern defined by the structural
equation model and are amenable to statistical analysis.


The isolation
assumption is in fact a component of the counterfactual conditions under
inves-tigation.For instance, in Section 4.6, we model single auctions (Figure
3) in order to empirically






determine how
the ad placement system would have performed if we had changed the mainline
reserves without incurring a reaction from the users or the advertisers.


Since the
future publisher revenues depend on the continued satisfaction of users and
advertisers, lifting this restriction is highly desirable.


• We can in
principle work with larger structural equation models.For instance, Figure 4
suggests to thread single auction models with additional causal links
representing the impact of the displayed ads on the future user
goodwill.However, there are practical limits on the number of trials we can
consider at once.For instance, it is relatively easy to simultaneously model
all the auctions associated with the web pages served to the same user during a
thirty minute web session.On the other hand, it is practically impossible to
consider several weeks worth of auctions in order to model their accumulated
effect on the continued satisfaction of users and advertisers.


• We can
sometimes use problem-specific knowledge to construct alternate performance
met-rics that anticipate the future effects of the feedback loops.For instance,
in Section 6.3, we optimize the advertisement value instead of the publisher
revenue.Since this alternative cri-terion takes the advertiser interests into
account, it can be viewed as a heuristic proxy for the future revenues of the


This section
proposes an alternative way to account for such feedback loops using the
quasistatic equilibrium method familiar to physicists: we assume that the
publisher changes the parameter θ so slowly that the system remains at
equilibrium at all times.Using data collected while the system was at equilibrium,
we describe empirical methods to determine how an infinitesimal intervention dθ
on the model parameters would have displaced the equilibrium:


would the system have performed during the data collection period if a small
change dθ had been applied to the model parameter θ and the equilibrium had
been reached before the data collection period."


A learning
algorithm can then update θ to improve selected performance metrics.


7.1 Rational


The ad
placement system is an example of game where each actor furthers his or her
interests by controlling some aspects of the system: the publisher controls the
placement engine parameters, the advertisers control the bids, and the users
control the clicks.


As an example
of the general quasi-static approach, this section focuses on the reaction of
ratio-nal advertisers to small changes of the scoring functions driving the ad
placement system.Rational advertisers always select bids that maximize their
economic interests.Although there are more re-alistic ways to model
advertisers, this exercise is interesting because the auction theory approaches
also rely on the rational advertiser assumption (see Section 2.1).This analysis
seamlessly integrates the auction theory and machine learning perspectives.


illustrated in Figure 23, we treat the bid vector b⋆ = (b1 ...bA) ∈ [0,bmax] A
as the parameter of the conditional distribution Pb⋆ (b|x, v) of the bids
associated with the eligible ads.The vari-

如图23所示,我们处理出价向量b⋆ = (b1...ba)∑0,bmax] A作为与合格广告相关的投标的条件分布Pb⋆ (b|x,v)的参数。vari-

measured when a feedback causal system reaches equilibrium often display
conditional independence patterns that cannot be represented with directed
acyclic graphs (Lauritzen and Richardson, 2002;Dash, 2003).Treating the
feedback loop as parameters instead of variables works around this difficulty
in a manner that appears sufficient to perform the quasi-static analysis.






Figure 23:
Advertisers select the bid amounts ba on the basis of the past number of clicks
ya and the past prices za observed for the corresponding ads.


Figure 24:
Advertisers control the expected number of clicks Ya and expected prices Za by


their bids
ba.Rational advertisers select bids that maximize the difference between the
value they see in the clicks and the price they pay.


ables ya in
the structural equation model represents the number of clicks received by ads
associated with bid ba.The variables za represents the amount charged for these
clicks to the corresponding advertiser.The advertisers select their bids ba
according to their anticipated impact on the number of resulting clicks ya and
on their cost za.

结构方程模型中的ables ya表示与bid ba关联的广告接收的点击量。变量za表示对相应广告商的这些点击收取的费用。广告商根据他们对最终点击量ya和他们的成本za的预期影响来选择他们的出价ba。

Following the
pattern of the perfect information assumption (see Section 2.1), we assume that
the advertisers eventually acquire full knowledge of the expectations


Ya ( θ , b⋆)
= Zω y a P θ b⋆ ( ω )

Ya ( θ,b⋆) = Zω y a
P θ b⋆ ( ω)

and Za ( θ ,
b⋆) = Zω z a P θ b⋆ ( ω ) .

和Za ( θ,b⋆) = Zω z a P θ b⋆ ( ω)。

Let Va denote
the value of a click for the corresponding advertiser.Rational advertiser seek
to maximize the difference between the value they see in the clicks and the
price they pay to the publisher, as illustrated in Figure 24.This is expressed
by the utility functions


U θ a (b⋆) =
VaYa(θ,b⋆)−Za(θ,b⋆) .

U θ a (b⋆) =





Athey and Nekipelov (2010), we argue that the injection of smooth random noise
into the auction mechanism changes the discrete problem into a continuous
problem amenable to standard differential methods.Mild regularity assumption on
the densities probability Pb⋆ (b|x, v) and Pθ (q|x,a) are in fact sufficient to
ensure that the expectations Ya(θ,b⋆) and Za(θ,b⋆) are con-tinuously
differentiable functions of the distribution parameters b⋆ and θ.Further
assuming that utility functions U θ a (b⋆) are diagonally quasiconcave, Athey
and Nekipelov establish the existence of a unique Nash equilibrium

在Athey和Nekipelov (2010)之后,我们认为向拍卖机制中注入平滑随机噪声会将离散问题转化为可服从标准微分方法的连续问题。密度概率Pb⋆ (b|x,v)和Pθ (q|x,a)的温和正则性假设实际上足以保证期望Ya(θ,b⋆)和Za(θ,b⋆)是分布参数b⋆和θ的连续可微函数。进一步假设效用函数U θ a (b⋆)是对角拟凹的,Athey和Nekipelov建立了唯一Nash均衡的存在性

∀a ba ∈

∀a ba ∈



U θ a

U θ a (b1,...、ba 1、b、ba+1、...,bA)

by its first order Karush-Kuhn-Tucker conditions


a Va


∂Ya ∂ba

∂Ya ∂ba

∂Za ∂ba

∂Za ∂ba

   ≤ 0 if
ba = ≥ 0 if ba = bmax= 0 if 0 < ba < bmax.

如果ba = ≥ 0,则   ≤ 0如果ba = bmax= 0,则0 < ba < bmax。



We use the
first order equilibrium conditions (22) for two related purposes.Section 7.2
explains how to complete the advertiser model by estimating the values Va.
Section 7.3 estimates how the equilibrium bids and the system performance
metrics respond to a small change dθ of the model parameters.


this approach remains sensible when key assumptions of the equilibrium model
are violated.The perfect information assumption is unlikely to hold in
practice.The quasi-concavity of the utility functions is merely
plausible.However, after observing the operation of the stationary ad placement
system for a sufficiently long time, it is reasonable to assume that the most
active advertisers have tried small bid variations and have chosen locally
optimal ones.Less active adver-tisers may leave their bids unchanged for longer
time periods, but can also update them brutally if they experience a
significant change in return on investment.Therefore it makes sense to use data
collected when the system is stationary to estimate advertiser values Va that
are consistent with the first order equilibrium conditions.We then hope to
maintain the conditions that each advertisers had found sufficiently
attractive, by first estimating how a small change dθ displaces this posited
local equilibrium, then by using performance metrics that take this
displacement into account.


Estimating Advertiser Values


We first need
to estimate the partial derivatives appearing in the equilibrium condition
(22).These derivatives measure how the expectations Ya and Za would have been
changed if each advertiser had placed a slightly different bid ba.Such
quantities can be estimated by randomizing the bids and computing on-policy
counterfactual derivatives as explained in appendix D. Confidence intervals can
be derived with the usual tools.


the publisher is not allowed to directly randomize the bids because the
advertisers expect to pay prices computed using the bid they have specified and
not the potentially higher bids resulting from the randomization.However, the
publisher has full control on the estimated click probabilities qi,p(x).Since
the rank-scores ri,p(x) are the products of the bids and the estimated click
probabilities (see Section 2.1), a random multiplier applied to the bids can
also be interpreted as a random multiplier applied to the estimated click
probabilities.Under these two interpretations, the same ads are shown to the
users, but different click prices are charged to the advertisers.Therefore,






the publisher
can simultaneously charge prices computed as if the multiplier had been applied
to the estimated click probabilities, and collect data as if the multiplier had
been applied to the bid.This data can then be used to estimate the derivatives.


Solving the
first order equilibrium equations then yields estimated advertiser values Va
that are consistent with the observed data.


Va ≈

Va ≈



∂ba .∂Za ∂ba

∂ba。∂Za ∂ba

There are
however a couple caveats:


• The
advertiser bid ba may be too small to cause ads to be displayed.In the absence
of data, we have no means to estimate a click value for these advertisers.


• Many ads
are not displayed often enough to obtain accurate estimates of the partial




∂ba and ∂Za
∂ba .This can be partially remediated by smartly aggregating the data of
advertisers deemed similar.

∂ba和∂Za ∂ba。这可以通过智能地聚合被认为相似的广告商的数据来部分补救。

• Some
advertisers attempt to capture all the available ad opportunities by placing
extremely high bids and hoping to pay reasonable prices thanks to the
generalized second price rule.Both partial derivatives ∂Ya ∂ba and ∂Za ∂ba are
equal to zero in such cases.Therefore we cannot recover Va by solving the
equilibrium Equation (22).It is however possible to collect useful data by
selecting for these advertisers a maximum bid bmax that prevents them from
monop-olizing the eligible ad opportunities.Since the equilibrium condition is
an inequality when ba = bmax, we can only determine a lower bound of the values
Va for these advertisers.

一些广告客户试图通过出价极高来抓住所有可用的广告机会,并希望支付合理的价格,这得益于广义的第二价格规则。在这种情况下,偏导数∂Ya ∂ba和∂Za ∂ba都等于零。因此,我们不能通过求解平衡方程(22)来恢复Va。然而,可以通过为这些广告客户选择最大出价bmax来收集有用的数据,以防止他们垄断合格的广告机会。因为当ba = bmax时,均衡条件是一个不等式,所以我们只能确定这些广告商的Va值的下限。

These caveats
in fact underline the limitations of the advertiser modelling assumptions.When
their ads are not displayed often enough, advertisers have no more chance to
acquire a full knowledge of the expectations Ya and Za than the publisher has a
chance to determine their value.Similarly, advertisers that place extremely
high bids are probably underestimating the risk to occasionally experience a
very high click price.A more realistic model of the advertiser information
acquisition is required to adequately handle these cases.


Estimating the Equilibrium Response


Let A be the
set of the active advertisers, that is, the advertisers whose value can be


(or lower
bounded) with sufficient accuracy.Assuming that the other advertisers leave
their bids unchanged, we can estimate how the active advertisers adjust their
bids in response to an infinitesi-mal change dθ of the scoring model
parameters.This is achieved by differentiating the equilibrium equations (22):


∀a ′ ∈ A,


= Va ′ ∂ Ya ′
∂ b a ′ ∂θ − ∂ Za ′ ∂ b a ′ ∂θ d θ + ∑a∈A Va ′ ∂ Ya ′ ∂ b a ′ ∂ b a − ∂ Za ′ ∂
b a ′ ∂ b a d b a .

= va′∂ya′∂b
a′∂θ-∂za′∂b a′∂θdθ+∑a∈a va′∂ya′∂b a′≈b a≈za′≈b a′≈b a d b a。



The partial
second derivatives must be estimated as described in appendix D. Solving this
linear system of equations then yields an expression of the form


dba = Ξa dθ.

DBA =ξa dθ。

approach is of course related to the value estimation method proposed by Athey
and Nekipelov (2010) but strictly relies on the explicit randomization of the
scores.In contrast, practical considerations force Athey and Nekipelov to rely
on the apparent noise and hope that the noise model accounts for all potential
confounding factors.






expression can then be used to estimate how any counterfactual expectation Y of
interest changes when the publisher applies an infinitesimal change dθ to the
scoring parameter θ and the


advertisers A rationally adjust their bids ba in response:


d Y = ∂ Y ∂θ

  • ∑a Ξ a ∂ Y ∂ b a d θ .

d y
=∂y∂θ+∑aνa∂y∂b a dθ。



Although this
expression provides useful information, one should remain aware of its
limita-tions.Because we only can estimate the reaction of active advertisers,
expression (24) does not includes the potentially positive reactions of
advertisers who did not bid but could have.Because we only can estimate a lower
bound of their values, this expression does not model the potential re-actions
of advertisers placing unrealistically high bids.Furthermore, one needs to be
very cautious when the system (23) approaches singularities.Singularities
indicate that the rational advertiser assumption is no longer sufficient to
determine the reactions of certain advertisers.This happens for instance when
advertisers cannot find bids that deliver a satisfactory return.The eventual
behavior of such advertisers then depends on factors not taken in consideration
by our model.


To alleviate
these issues, we could alter the auction mechanism in a manner that forces
advertis-ers to reveal more information, and we could enforce policies ensuring
that the system (23) remains safely nonsingular.We could also design
experiments revealing the impact of the fixed costs in-curred by advertisers
participating into new auctions.Although additional work is needed to design
such refinements, the quasistatic equilibrium approach provides a generic
framework to take such aspects into account.




The rational
advertiser assumption is the cornerstone of seminal works describing simplified
vari-ants of the ad placement problem using auction theory (Varian,
2007;Edelman et al., 2007).More sophisticated works account for more aspects of
the ad placement problem, such as the impact of click prediction learning
algorithms (Lahaie and McAfee, 2011), the repeated nature of the ad auc-tions
(Bergemann and Said, 2010), or for the fact that advertisers place bids valid
for multiple auc-tions (Athey and Nekipelov, 2010).Despite these advances, it
seems technically very challenging to use these methods and account for all the
effects that can be observed in practical ad placement systems.


We believe
that our counterfactual reasoning framework is best viewed as a modular toolkit
that lets us apply insights from auction theory and machine learning to
problems that are far more complex than those studied in any single paper.For
instance, the quasi-static equilibrium analysis technique illustrated in this
section extends naturally to the analysis of multiple simultaneous causal
feedback loops involving additional players:


• The first
step consists in designing ad-hoc experiments to identify the parameters that
deter-mine the equilibrium equation of each player.In the case of the
advertisers, we have shown how to use randomized scores to reveal the
advertiser values.In the case of the user feedback, we must carefully design
experiments that reveal how users respond to changes in the quality of the
displayed ads.


Differentiating all the equilibrium equations yields a linear system of
equations linking the variations of the parameter under our control, such as
dθ, and all the parameters under the






control of
the other players, such as the advertiser bids, or the user willingness to
visit the site and click on ads.Solving this system and writing the total
derivative of the performance measure gives the answer to our question.


Although this
programme has not yet been fully realized, the existence of a principled
frame-work to handle such complex interactions is remarkable.Furthermore,
thanks to the flexibility of the causal inference frameworks, these techniques
can be infinitely adapted to various modeling assumptions and various system




Using the ad
placement example, this work demonstrates the central role of causal inference
(Pearl, 2000;Spirtes et al., 1993) for the design of learning systems
interacting with their environment.Thanks to importance sampling techniques,
data collected during randomized experiments gives precious cues to assist the
designer of such learning systems and useful signals to drive learning


Two recurrent
themes structure this work.First, we maintain a sharp distinction between the
learning algorithms and the extraction of the signals that drive them.Since
real world learning systems often involve a mixture of human decision and
automated processes, it makes sense to separate the discussion of the learning
signals from the discussion of the learning algorithms that leverage
them.Second, we claim that the mathematical and philosophical tools developed
for the analysis of physical systems appear very effective for the analysis of
causal information systems and of their equilibria.These two themes are in fact
a vindication of cybernetics (Wiener, 1948).




We would like
to acknowledge extensive discussions with Susan Athey, Miroslav Dudík, Patrick
Jordan, John Langford, Lihong Li, Sebastien Lahaie, Shie Mannor, Chris Meek,
Alex Slivkins, and Paul Viola.We also thank the Microsoft adCenter RnR team for
giving us the invaluable opportunity to deploy these ideas at scale and prove
their worth.Finally we gratefully acknowledge the precious comments of our JMLR
editor and reviewers.

我们要感谢与苏珊·艾希、米罗斯拉夫·杜迪克、帕特里克·乔丹、约翰·兰福德、李立宏、塞巴斯蒂安·拉海耶、阏氏·曼诺尔、克里斯·米克、亚历克斯·斯利夫金斯和保罗·维奥拉进行的广泛讨论。我们也感谢微软adCenter RnR团队给了我们宝贵的机会,让我们大规模部署这些想法并证明它们的价值。最后,我们感谢JMLR编辑和审稿人的宝贵意见。

Appendix A.
Greedy Ad Placement Algorithms


Section 2.1
describes how to select and place ads on a web page by maximizing the total
rank-score (1).Following (Varian, 2007;Edelman et al., 2007) , we assume that
the click probability estimates are expressed as the product of a positive
position term γp and a positive ad term βi(x).The rank-scores can therefore be
written as ri,p(x) = γpbiβi(x).We also assume that the policy constraints
simply state that a web page should not display more than one ad belonging to
any given advertiser.The discrete maximization problem is then amenable to
computationally efficient greedy algorithms.

第2.1节描述了如何通过最大化总排名分数(1)来选择和放置网页上的广告。以下(瓦里安,2007;Edelman等人,2007),我们假设点击概率估计被表示为正位置项γp和正ad项βi(x)的乘积。因此,等级分数可以写成ri,p(x) = γpbiβi(x)。我们还假设策略约束只是声明一个网页不应该显示一个以上属于任何给定广告商的广告。然后,离散最大化问题服从于计算高效的贪婪算法。

Let us fix a
layout L and focus on the inner maximization problem.Without loss of
generality, we can renumber the positions such that


L =
{1,2,...N} and γ1 ≥ γ2 ≥ ≥ 0.

L = {1,2,...N}和γ1≥γ2≥0。





and write the
inner maximization problem as






= ∑






subject to
the policy constraints and reserve constraints ri,p(x) ≥ Rp(x).

受政策约束和准备金约束ri,p(x) ≥ Rp(x)。

Let Si denote
the advertiser owning ad i. The set of ads is then partitioned into subsets Is
= {i :

让Si表示拥有广告I的广告商。然后广告集合被分成子集Is = {i:

Si = s}
gathering the ads belonging to the same advertiser s. The ads that maximize the


within set Is are called the best ads for advertiser s. If the solution of the
discrete maximiza-


tion problem
contains one ad belonging to advertiser s, then it is easy to see that this ad
must be one of the best ads for advertiser s: were it not the case, replacing
the offending ad by one of the best


ads would
yield a higher RL without violating any of the constraints.It is also easy to
see that one could select any of the best ads for advertiser s without changing


Let the set I
contain exactly one ad per advertiser, arbitrarily chosen among the best ads


advertiser.The inner maximization problem can then be simplified as:




i 1,..., iN ∈

i 1,...,iN ∈ I

RL ( i ,...,
i N) = ∑p∈L γp b i p β i p ( x )

RL ( i,...,i N) = ∑p∈L
γp b i p β i p ( x)

where all the
indices i1,...,iN are distinct, and subject to the reserve constraints.


Assume that
this maximization problem has a solution i1,...,iN, meaning that there is a


ad placement
solution for the layout L. For k = ...N, let us define I k ⊂ I as

布局l的广告投放解决方案。对于k =...让我们把⊂定义为



I k = ArgMax

I k = ArgMax


我∈我{i1,...,ik 1 }

It is easy to
see that I k intersects {ik,...,iN} because, were it not the case, replacing ik
by any

很容易看出I k与{ik,...,iN}因为,如果不是这样,用任何

element of I
k would increase RL without violating any of the constraints.Furthermore it is
easy to


see that ik ∈
I k because, were it not the case, there would be h > k such that ih ∈ I k ,
and swapping

看到ik ∈ I k,因为,如果不是这样,会有h > k,这样ih ∈ I k,和交换

ik and ih
would increase RL without violating any of the constraints.


Therefore, if
the inner maximization problem admits a solution, we can compute a solution by
recursively picking i1,...,iN from I 1 ,I 2 ,...,I N . This can be done
efficiently by first sorting the biβi(x) in decreasing order, and then greedily
assigning ads to the best positions subject to the reserve constraints.This
operation has to be repeated for all possible layouts, including of course the
empty layout.

因此,如果内部最大化问题允许一个解,我们可以通过递归地选择i1来计算一个解,...,iN来自I 1,I 2,...这可以通过首先以递减的顺序对biβi(x)进行排序,然后贪婪地将广告分配给服从保留约束的最佳位置来有效地完成。必须对所有可能的布局重复该操作,当然包括空布局。

The same
analysis can be carried out for click prediction estimates expressed as
arbitrary mono-tone combination of a position term γp(x) and an ad term βi(x),
as shown, for instance, by Graepel et al. (2010).


Appendix B.
Confidence Intervals

附录b .置信区间

Section 4.4
explains how to obtain improved confidence intervals by replacing the unbiased
impor-tance sampling estimator (7) by the clipped importance sampling estimator
(10).This appendix provides details that could have obscured the main message.






B.1 Outer
Confidence Interval


We first
address the computation of the outer confidence interval (12) which describes
how the estimator Ybapproaches the clipped expectation Y .


Y = Zω ℓ ( ω
) w ( ω ) P ( ω ) ≈ Yb = n n∑i=1 ℓ ( ω i ) w ( ω i ) .

y =
zωℓ(ω)w(ω)p(ω)≈Yb = n n∑I = 1ℓ(ωI)w(ωI)。

Since the
samples ℓ(ωi)w (ωi) are independent and identically distributed, the central
limit theorem (e.g., Cramér, 1946, Section 17.4) states that the empirical
average Ybconverges in law to a normal distribution of mean Y = E[ℓ(ω)w (ω)]
and variance V = var[ℓ(ω)w (ω)].Since this convergence usually occurs quickly,
it is widely accepted to write

由于样本ℓ(ωi)w (ωi)是独立且同分布的,中心极限定理(如Cramér,1946,第17.4节)指出经验平均Yb收敛于均值Y = E[ℓ(ω)w (ω)]和方差V = var[ℓ(ω)w (ω)]的正态分布。由于这种趋同通常发生得很快,所以写作被广泛接受

P n Yb − ε R
≤ Y ≤ Yb + ε R o ≥ − δ ,

p n
YbεR≤Y≤Yb+εR o≥δ,



ε R = erf− (
1 − δ ) p 2 V .

ε R =电流变流体(1δ)p 2V。

and to
estimate the variance V using the sample variance Vb


V ≈ Vb =

V ≈ Vb =

n − 1


n∑i=1 ℓ ( ω i
) w ( ω i ) − Yb .

n∑I =



This approach
works well when the ratio ceiling R is relatively small.However the presence of
a few very large ratios makes the variance estimation noisy and might slow down
the central limit convergence.


The first
remedy is to bound the variance more rigorously.For instance, the following
bound results from (Maurer and Pontil, 2009, Theorem 10).


P ( p V >
p Vb + ( M − m ) R r 2log ( / δ ) n − 1 ) ≤ δ

p(p V > p
Vb+(M M)R R 2 log(/δ)n 1)≤δ

this bound with (25) gives a confidence interval valid with probability greater
than 1−2δ.Although this approach eliminates the potential problems related to
the variance estimation, it does not address the potentially slow convergence
of the central limit theorem.


The next
remedy is to rely on empirical Bernstein bounds to derive rigorous confidence
intervals that leverage both the sample mean and the sample variance (Audibert
et al., 2007;Maurer and Pontil, 2009).


Theorem 1
(Empirical Bernstein bound) (Maurer and Pontil, 2009, thm 4) Let X,X1,X2,...,Xn
be i.i.d. random variable with values in [a,b] and let δ > 0.Then, with
proba-bility at least 1−δ,

定理1(经验Bernstein界)(Maurer和Pontil,2009,thm 4)让X,X1,X2,...,Xn为i.i.d .随机变量,取值在[a,b]中,设δ > 0。那么,概率至少为1δ,

E [X ] − Mn ≤
r Vn log ( / δ ) n + ( b − a ) 7log ( / δ ) 3 ( n − 1 ) ,

e[X]Mn≤r Vn
log(/δ)n+(B- a)7 log(/δ)3(n-1),

where Mn and
Vn respectively are the sample mean and variance


Mn =

Mn =

n n∑i=1 Xi

n n∑i=1 Xi

Vn =

Vn =


n − 1


n∑i=1 ( Xi −
Mn ) .

n∑i=1 ( Xi锰)。





Applying this
theorem to both ℓ(ωi)w (ωi) and −ℓ(ωi)w (ωi) provides confidence intervals that
hold for for the worst possible distribution of the variables ℓ(ω) and w(ω).

将该定理应用于ℓ(ωi)w (ωi)和−ℓ(ωi)w (ωi)提供了置信区间,该区间适用于变量ℓ(ω(ω)和w(ω)的最差可能分布。



P n Yb − ε R
≤ Y ≤ Yb + ε R o ≥ − δ

p n
YbεR≤Y≤Yb+εR o≥δ

ε R = s Vb
log ( / δ ) n + M R 7log ( / δ ) 3 ( n − 1 ) .

εR = s Vb
log(/δ)n+M R 7 log(/δ)3(n-1)。



Because they
hold for the worst possible distribution, confidence intervals obtained in this
way are less tight than confidence intervals based on the central limit
theorem.On the other hand, thanks to the Bernstein bound, they remains
reasonably competitive, and they provide a much stronger guarantee.


B.2 Inner
Confidence Interval


confidence intervals are derived from inequality (14) which bounds the
difference between the counterfactual expectation Y and the clipped expectation
Y :


0 ≤ Y −Y ≤ M
(1−W ).


The constant
M is defined by assumption (8).The first step of the derivation consists in
obtaining a lower bound of W −Wbusing either the central limit theorem or an
empirical Bernstein bound.


For instance,
applying theorem 1 to −w (ωi) yields

例如,将定理1应用于w (ωi)会产生



 W ≥ Wb− s
Vbw log ( / δ ) n − R 7log ( / δ ) 3 ( n − 1 )  ≥ − δ

w≥WB s vbw对数(/δ)n r7对数(/δ)3(n1)≥δ

where Vbw is
the sample variance of the clipped weights


Vbw =

Vbw =

n − 1


n∑i=1 w ( ω i
) − Wb .

n∑I =

Replacing in
inequality (14) gives the outer confidence interval


P n Y ≤ Y ≤ Y

  • M ( − Wb+ ξ R ) o ≥ − δ .

p n



ξ R = s Vbw
log ( / δ ) n + R 7log ( / δ ) 3 ( n − 1 ) .

ξR = s Vbw
log(/δ)n+R 7 log(/δ)3(n-1)。



Note that 1
−Wb+ ξR can occasionally be negative.This occurs in the unlucky cases where the
confidence interval is violated, with probability smaller than δ.

请注意,1 Wb+ξR有时可能为负。这发生在违反置信区间的不幸情况下,概率小于δ。

together the inner and outer confidence intervals,


P n Yb − ε R
≤ Y ≤ Yb + M ( − Wb+ ξ R) + ε R o ≥ − δ ,

p n
YbεR≤Y≤Yb+M(Wb+ξR)+εR o≥δ,

with εR and
ξR computed as described in expressions (26) and (27).






Appendix C.
Counterfactual Differences


We now seek
to estimate the difference Y + −Y of the expectations of a same quantity ℓ(ω)
under two different counterfactual distributions P+(ω) and P(ω).These
expectations are often affected by variables whose value is left unchanged by
the interventions under consideration.For instance, seasonal effects can have
very large effects on the number of ad clicks.When these variables affect both
Y + and Y in similar ways, we can obtain substantially better confidence
intervals for the difference Y + −Y .

我们现在寻求在两个不同的反事实分布P+(ω)和P(ω)下,估计相同量ℓ(ω期望的差y+y。这些期望经常受到变量的影响,这些变量的值在所考虑的干预中保持不变。例如,季节效应会对广告点击量产生非常大的影响。当这些变量以类似的方式影响Y +和Y时,我们可以获得Y+Y差的更好的置信区间。

In addition
to the notation ω representing all the variables in the structural equation
model, we use notation υ to represent all the variables that are not direct or
indirect effects of variables affected by the interventions under


Let ζ(υ) be a
known function believed to be a good predictor of the quantity ℓ(ω) whose
coun-terfactual expectation is sought.Since P(υ) = P(υ), the following equality
holds regardless of the quality of this prediction:

设ζ(υ)是一个已知的函数,它被认为是ℓ(ω量的一个很好的预测器,它的实际期望是被寻求的。因为P(υ) = P(υ),所以不管这个预测的质量如何,下列等式都成立:

both Y + and Y in this way and computing the difference,

以这种方式分解Y +和Y并计算差值,

Y + − Y = Zω
[ ℓ ( ω ) − ζ ( υ)] ∆ w ( ω ) P ( ω ) ≈ n n∑i=1

y+y =
zω[ℓ(ω)ζ(υ)]w(ω)p(ω)≈n∑I = 1

  • ℓ ( ω i ) −
    ζ ( υ i )


∆ w ( ω i )

w ( ω i)

∆ w ( ω) = P

  • ( ω ) P ( ω ) − P ( ω ) P ( ω ) = P + ( ω ) − P ( ω ) P ( ω ) .

P+(ω)P(ω)P(ω)P(ω)= P+(ω)P(ω)。





The outer
confidence interval size is reduced if the variance of the residual ℓ(ω)−ζ(υ)is
smaller than the variance of the original variable ℓ(ω).For instance, a
suitable predictor function ζ(υ) can significantly capture the seasonal click
yield variations regardless of the interventions under consideration.Even a
constant predictor function can considerably change the variance of the outer
confidence interval.Therefore, in the absence of better predictor, we still can
( and always should ) center the integrand using a constant predictor.


The rest of
this appendix describes how to construct confidence intervals for the
estimation of counterfactual differences.Additional bookkeeping is required
because both the weights ∆w(ωi) and the integrand ℓ(ω)−ζ(υ) can be positive or
negative.We use the notation υ to represent the variables of the structural
equation model that are left unchanged by the intervention under
consid-erations.Such variables satisfy the relations P(υ) = P(υ) and P(ω) = P
(ω\υ|υ)P(υ), where we use notation ω\υ to denote all remaining variables in the
structural equation model.An invariant predictor is then a function ζ(υ) that
is believed to be a good predictor of ℓ(ω).In particular, it is expected that
var[ℓ(ω)−ζ(υ)] is smaller than var[ℓ(ω)].

本附录的其余部分描述了如何构建置信区间来估计反事实差异。需要额外记账,因为权重w(ωi)和被积函数ℓ(ω)−ζ(υ都可以是正的或负的。我们使用符号υ来表示结构方程模型的变量,这些变量在考虑的情况下由于干预而保持不变。这样的变量满足关系P(υ) = P(υ)和P(ω) = P (ω\υ|υ)P(υ),这里我们用符号ω\υ来表示结构方程模型中所有剩余的变量。然后,不变预测器是被认为是ℓ(ω).的良好预测器的函数ζ(υ)特别是,预计var[ℓ(ω)−ζ(υ)比var[ℓ(ω)].小

C.1 Inner
Confidence Interval with Dependent Bounds


We first
describe how to construct finer inner confidence intervals by using more
refined bounds on ℓ(ω).In particular, instead of the simple bound (8), we can
use bounds that depend on invariant variables:




m ≤ m ( υ ) ≤
ℓ ( ω ) ≤ M ( υ ) ≤ M .

m ≤ m ( υ ) ≤
ℓ ( ω ) ≤ M ( υ ) ≤ M。





The key
observation is the equality


E [ w ( ω
)|υ] = Zω \ υ w ( ω ) P ( ω \ υ|υ) = Zω \ υ P ( ω \ υ|υ ) P ( υ ) P ( ω \ υ|υ )
P ( υ ) P ( ω \ υ|υ) = .

zω\υw(ω)P(ω\υ|υ)= zω\υP(ω\υ|υ)P(ω\υ|υ)P(υ)P(ω\υ|υ)=。

We can then


= Z ω

= Z ω

  • w (ω)−w (ω)


ℓ(ω) P(ω) ≤ Z
υ E[w (ω)−w (ω)|υ] M(υ) P(υ)


= Z υ (−E[w
(ω)|υ] ) M(υ) P(υ) = Z ω (−w (ω) ) M(υ) P(ω) = Bhi .

Zυ(E[w(ω)|υ])M(υ)P(υ)= Zω(w(ω))M(υ)P(ω)= Bhi。

Y −Y


Using a
similar derivation for the lower bound Blo, we obtain the inequality


With the


Blo ≤ Y −Y ≤


Bb lo = n
n∑i=1 ( − w ( ω i)) m ( υ i )

bb lo = n n∑I
= 1(w(ωI))m(υI)

Bb hi = n
n∑i=1 ( − w ( ω i)) M ( υ i )

bb hi = n n∑I
= 1(w(ωI))M(υI)

Vblo =

Vblo =

n − 1


n∑i=1 h ( − w
( ω i)) m ( υ i ) − Bb lo i Vbhi = n − 1 n∑i=1 h ( − w ( ω i)) M ( υ i ) − Bb
hi i

n∑I =
1h(w(ωI))M(υI)Bb lo I Vhibi = n 1n∑I = 1h(w(ωI))M(υI)Bb hi I

ξlo = s Vblo
log ( / δ ) n +|m|R 7log ( / δ ) 3 ( n − 1 ) ,

ξlo = s Vblo
log(/δ)n+| m | R 7 log(/δ)3(n 1),

ξhi = s Vbhi
log ( / δ ) n +|M|R 7log ( / δ ) 3 ( n − 1 ) ,

ξhi = s Vhibi
log(/δ)n+| M | R 7 log(/δ)3(n-1),

applications of theorem 1 give the inner confidence interval:


P n Y + Bb lo
− ξlo ≤ Y ≤ Y + Bb hi + ξhi o ≥ − δ .

p n Y+Bb
loξlo≤Y≤Y+Bb hi+ξhi o≥δ。

Confidence Intervals for Counterfactual Differences


We now
describe how to leverage invariant predictors in order to construct tighter
confidence inter-vals for the difference of two counterfactual expectations.


Y + − Y ≈ n

Y+Y≈n n∑I = 1

  • ℓ ( ω i ) −
    ζ ( υ i )


∆ w ( ω i )

w ( ω i)

with ∆w(ω) =
P +(ω)−P (ω) P(ω) .

其中∏w(ω)= P+(ω)P(ω)P(ω)。

Let us define
the reweigthing ratios w +(ω) = P +(ω)/P(ω) and w (ω) = P (ω)/P(ω), their
clipped variants w +(ω) and w (ω), and the clipped centered expectations

让我们定义重加权比w +(ω) = P +(ω)/P(ω)和w (ω) = P (ω)/P(ω),它们的削波变量w +(ω)和w (ω),以及削波的中心期望值

Y +c = Zω [ ℓ
( ω ) − ζ ( υ)] w + ( ω ) P ( ω )

y+c =

and Y c = Zω
[ ℓ ( ω ) − ζ ( υ)] w ( ω ) P ( ω ) .

y c =

The outer
confidence interval is obtained by applying the techniques of Section B.1 to


Y +c − Y c =
Zω [ ℓ ( ω ) − ζ ( υ) ][ w + ( ω ) − w ( ω) ] P ( ω ) .

y+c y c =

Since the
weights w + −w can be positive or negative, adding or removing a constant to
ℓ(ω) can considerably change the variance of the outer confidence interval.This
means that one should






always use a
predictor.Even a constant predictor can vastly improve the outer confidence
interval difference.


The inner
confidence interval is then obtained by writing the difference


Y +−Y − Y + c
−Y c

Y+Y-Y+c Y-c

= Zω

= Zω

  • ℓ ( ω ) − ζ
    ( υ )
  • ℓ ( ω ) − ζ
    ( υ )
  • w + ( ω ) −
    w + ( ω )


P ( ω ) − Zω


  • ℓ ( ω ) − ζ
    ( υ )
  • ℓ ( ω ) − ζ
    ( υ )
  • w ( ω ) − w
    ( ω )


P ( ω )

P ( ω)

and bounding
both terms by leveraging υ-dependent bounds on the integrand:


− M ≤ −ζ(υ) ≤
ℓ(ω)−ζ(υ) ≤ M −ζ(υ) ≤ M .




This can be
achieved as shown in Section C.1.


Appendix D.
Counterfactual Derivatives

附录d .反事实衍生品

We now
consider interventions that depend on a continuous parameter θ.For instance, we
might want to know what the performance of the ad placement engine would have
been if we had used a parametrized scoring model.Let Pθ (ω)represent the
counterfactual Markov factorization associated with this intervention.Let Y θ be
the counterfactual expectation of ℓ(ω) under distribution Pθ .

我们现在考虑依赖于连续参数θ的干预。例如,我们可能想知道如果我们使用参数化评分模型,广告投放引擎的性能会如何。让Pθ (ω)代表与此干预相关的反事实马尔可夫因子分解。设Y θ是分布Pθ下ℓ(ω的反事实期望。

Computing the
derivative of (28) immediately gives


∂ Y θ

∂ Y θ



= Zw

= Zw

  • ℓ ( ω ) − ζ
    ( υ )
  • ℓ ( ω ) − ζ
    ( υ )

w ′θ ( ω ) P
( ω ) ≈ n n∑i=1

= 1

  • ℓ ( ω i ) −
    ζ ( υ i )


w ′θ ( ω i )


with wθ(ω) =
P θ (ω) P(ω)

其中wθ(ω) = P θ (ω) P(ω)

and w ′θ ( ω)
= ∂ w θ ( ω ) ∂θ = w θ ( ω ) ∂logP θ ( ω ) ∂θ .

和w′θ(ω)=∂wθ(ω)∂θ= wθ(ω)∂logpθ(ω)∂θ。



Replacing the
expressions P(ω) and Pθ (ω) by the corresponding Markov factorizations gives
many opportunities to simplify the reweighting ratio w ′ θ (ω).The term wθ(ω)
simplifies as shown in (6).The derivative of log Pθ (ω) depends only on the
factors parametrized by θ.Therefore, in order to evaluate w ′ θ (ω), we only
need to know the few factors affected by the intervention.

用相应的马尔可夫因子分解代替表达式P(ω)和Pθ (ω)给出了许多简化重加权比w′θ(ω)的机会。术语wθ(ω)简化如(6)所示。对数Pθ (ω)的导数仅取决于θ参数化的因子。因此,为了评估w′θ(ω),我们只需要知道受干预影响的少数因素。

Higher order
derivatives can be estimated using the same approach.For instance,


∂ Y θ

∂ Y θ

∂θ i ∂θ j


= Zw

= Zw

  • ℓ ( ω ) − ζ
    ( υ )
  • ℓ ( ω ) − ζ
    ( υ )

w ′′ i j ( ω
) P ( ω ) ≈ n n∑i=1


  • ℓ ( ω i ) −
    ζ ( υ i )


w ′′ i j ( ω
i )

w′’I j(ωI)

with w ′′ i
j(ω) = ∂ wθ(ω) ∂θi ∂θj = wθ(ω) ∂log Pθ (ω) ∂θi ∂logPθ (ω) ∂θj +wθ(ω) ∂ logPθ
(ω) ∂θi ∂θj



The second
term in w ′′ i j(ω) vanishes when θi and θj parametrize distinct factors in Pθ

当θi和θj参数化Pθ (ω)中的不同因子时,w′I j(ω)中的第二项消失。

Infinitesimal Interventions and Policy Gradient


(29) becomes particularly attractive when P(ω) = P θ (ω), that is, when one
seeks deriva-tives that describe the effect of an infinitesimal intervention on
the system from which the data was collected.The resulting expression is then
identical to the celebrated policy gradient (Alek-sandrov et al., 1968;Glynn,
1987;Williams, 1992) which expresses how the accumulated rewards

当P(ω) = P θ (ω)时,表达式(29)变得特别有吸引力,也就是说,当人们寻求描述无穷小的干预对收集数据的系统的影响的导数时。结果表达式与著名的政策梯度相同(Alek-sandrov等人,1968年;格林,1987年;威廉姆斯,1992)表达了累积的回报





in a
reinforcement learning problem are affected by small changes of the parameters
of the policy function.


∂ Y θ

∂ Y θ



= Zω

= Zω

  • ℓ ( ω ) − ζ
    ( υ )
  • ℓ ( ω ) − ζ
    ( υ )

w ′θ ( ω ) P
θ ( ω ) ≈ n n∑i=1

= 1

  • ℓ ( ω i ) −
    ζ ( υ i )


w ′θ ( ω i )


where ωi are
sampled i.i.d. from Pθ and w ′ θ (ω) = ∂logPθ (ω) ∂θ



Sampling from
Pθ (ω) eliminates the potentially large ratio wθ(ω) that usually plagues
impor-tance sampling approaches.Choosing a parametrized distribution that
depends smoothly on θ is then sufficient to contain the size of the weights w ′
θ (ω).Since the weights can be positive or nega-tive, centering the integrand
with a prediction function ζ(υ) remains very important.Even a constant
predictor ζ can substantially reduce the variance

从Pθ (ω)采样消除了通常困扰重要采样方法的潜在大比值wθ(ω)。选择平滑依赖于θ的参数化分布就足以包含权重w’θ(ω)的大小。由于权重可以是正的或负的,用预测函数ζ(υ)对被积函数进行定心仍然非常重要。即使是常数预测因子ζ也能显著降低方差

(ℓ(ω)−ζ)w ′ θ (ω)] = var[ ℓ(ω)w ′ θ (ω)−ζw ′ θ (ω)]


= var[ℓ(ω)w ′
θ (ω)]−2ζcov[ ℓ(ω)w ′ θ (ω), w ′ θ (ω)] +ζ var[w ′ θ (ω)]


whose minimum
is reached for ζ = cov[ℓ(ω)w ′ θ (ω)w ′ θ (ω)]

对于ζ= cov[ℓ(ω)w’θ(ω)w’θ(ω)达到最小值

var[w ′ θ
(ω)] =


E[ℓ(ω)w ′ θ
(ω) ]


E[w ′ θ (ω) ]


We sometimes
want to evaluate expectations under a counterfactual distribution that is too
far from the actual distribution to obtain reasonable confidence
intervals.Suppose, for instance, that we are unable to reliably estimate which
click yield would have been observed if we had used a certain parameter θ for
the scoring models.We still can estimate how quickly and in which direction the
click yield would have changed if we had slightly moved the current scoring
model parameters θ in the direction of the target θ .Although such an answer is
not as good as a reliable estimate of Y θ , it is certainly better than no

我们有时希望在与实际分布相差太远的反事实分布下评估期望,以获得合理的置信区间。例如,假设我们无法可靠地估计如果我们对评分模型使用某个参数θ会观察到哪种点击量。如果我们在目标θ的方向上稍微移动当前评分模型参数θ,我们仍然可以估计点击率会以多快的速度和朝哪个方向变化。这样的答案虽然不如对Y θ的可靠估计,但肯定比没有答案要好。

Off-Policy Gradient


We assume in
this subsection that the parametrized probability distribution Pθ (ω) is
regular enough to ensure that all the derivatives of interest are defined and
that the event {wθ(ω)=R} has probability zero.Furthermore, in order to simplify
the exposition, the following derivation does not leverage an invariant
predictor function.

在本小节中,我们假设参数化概率分布Pθ (ω)足够规则,以确保定义了所有感兴趣的导数,并且事件{wθ(ω)=R}具有概率零。此外,为了简化说明,下面的推导没有利用不变预测函数。

derivatives using data sampled from a distribution P(ω) different from Pθ (ω)
is more challenging because the ratios wθ(ωi) in Equation (29) can take very
large values.However it is comparatively easy to estimate the derivatives of
lower and upper bounds using a slightly different way to clip the weights.Using
notation 1l(x) represent the indicator function, equal to one if condition x is
true and zero otherwise, let us define respectively the clipped weights w Z θ
and the capped weights w M θ :

使用从不同于Pθ (ω)的分布P(ω)采样的数据来估计导数更具挑战性,因为等式(29)中的比率wθ(ωi)可能取非常大的值。然而,使用稍微不同的方法来裁剪权重,估计下限和上限的导数相对容易。使用符号1l(x)表示指标函数,如果条件x为真,则等于1,否则为零,让我们分别定义削波权重w Z θ和封顶权重w M θ:

w Zθ ( ω) = w
θ ( ω )1l { P ( ω ) < R P ( ω ) } and w Mθ ( ω) = min { w θ ( ω ) , R } .

w Zθ ( ω) = w
θ ( ω )1l { P ( ω ) < R P ( ω ) }和w Mθ ( ω) = min { w θ ( ω),R }。

Section 4.4 illustrates the use of clipped weights, the confidence interval
derivation can be easily extended to the capped weights.Defining the capped


Y θ = Zω ℓ (
ω ) w Mθ ( ω ) P ( ω )

Y θ = Zω ℓ (
ω ) w Mθ ( ω ) P ( ω)

and W θ = Zω
w Mθ ( ω ) P ( ω )

而W θ = Zω w Mθ ( ω ) P ( ω)





and writing


= Zω ∈ Ω \ Ω
R ℓ ( ω) ( P ( ω ) − R P ( ω) )

zω∈ω\ωrℓ(ω)(p(ω)r p(ω))

0 ≤ Y θ −Y θ


≤ M − P ( Ω R
) − R P ( Ω \ Ω R ) = M − Zω w Mθ ( ω ) P ( ω )

≤M P(ωR)R
P(ω\ωR)= M zωw Mθ(ω)P(ω)

yields the


Y θ ≤ Y θ ≤ Y
θ + M ( 1 − W θ ) .




In order to
obtain reliable estimates of the derivatives of these upper and lower bounds,
it is of course sufficient to obtain reliable estimates of the derivatives of Y
θ and W θ .By separately considering the cases wθ(ω) < R and wθ(ω) > R,
we easily obtain the relation

为了获得这些上限和下限的导数的可靠估计,获得Y θ和W θ的导数的可靠估计当然是足够的。通过分别考虑wθ(ω) < R和wθ(ω) > R的情况,我们很容易得到关系式

w M ′ θ ( ω)
= ∂ w Mθ ( ω ) ∂θ = w Zθ ( ω ) ∂log P θ ( ω ) ∂θ

w m’θ(ω)=∂w
mθ(ω)∂θ= w zθ(ω)∂log pθ(ω)∂θ

and, thanks
to the regularity assumptions, we can write


when wθ(ω) 6=

当wθ(ω) 6= R时

∂Y θ ∂θ

∂Y θ ∂θ

∂W θ ∂θ

∂W θ ∂θ

= Zω ℓ ( ω )
w M ′ θ ( ω ) P ( ω ) ≈ n n∑i=1 ℓ ( ω i ) w M ′ θ ( ω i )

= zωℓ(ω)w
m′θ(ω)p(ω)≈n∑I = 1ℓ(ωI)w m′θ(ωI)

= Zω w M ′ θ
( ω ) P ( ω ) ≈ n n∑i=1 w M ′ θ ( ω i )

M′θ(ω)P(ω)≈n∑I = 1w M′θ(ωI)

these derivatives is considerably easier than using approximation (29) because
they in-volve the bounded quantity w Z θ (ω) instead of the potentially large
ratio wθ(ω).It is still necessary to choose a sufficiently smooth sampling
distribution P(ω) to limit the magnitude of ∂log Pθ/∂θ.

估算这些导数比使用近似值(29)容易得多,因为它们涉及有界量w Z θ (ω),而不是潜在的大比值wθ(ω)。仍然需要选择足够平滑的采样分布P(ω)来限制∂log Pθ/∂θ.的幅度

derivatives are very useful to drive optimization algorithms.Assume for
instance that we want to find the parameter θ that maximizes the counterfactual
expectation Y θ as illustrated in Section 6.3.Maximizing the estimate obtained
using approximation (5) could reach its maximum for a value of θ that is poorly
explored by the actual distribution.Maximizing an estimate of the lower bound
(30) ensures that the optimization algorithm finds a trustworthy answer.


Appendix E.
Uniform Empirical Bernstein Bounds

附录e .统一经验伯恩斯坦界限

This appendix
reviews the uniform empirical Bernstein bound given by Maurer and Pontil (2009)
and describes how it can be used to construct the uniform confidence interval
(21).The first step


consists of
characterizing the size of a family F of functions mapping a space X into the
in-terval [a,b] ⊂ R. Given n points x = (x1...xn) ∈ X n , the trace F (x) ∈ R n
is the set of vectors f(x1),..., f(xn) for all functions f ∈ F .

包括表征函数族f的大小,函数族f将空间x映射到区间[a,b] ⊂ R中。给定n个点x = (x1...xn)∈X ^ n,迹F(X)∈R ^ n是向量f(x1)的集合,...,f(xn)对于所有函数f ∈ F。

Definition 2
(Covering numbers, etc.) Given ε > 0, the covering number N (x, ε,F ) is the
small-est possible cardinality of a subset C ⊂ F (x) satisfying the condition

定义2(涵盖数字等。)给定ε > 0,覆盖数N (x,ε,f)是满足条件的子集C ⊂ F (x)的最小可能基数

∀ v ∈ F ( x )
∃ c ∈ C max i=1...n |v i − c i | ≤ ε

∀ v ∈ F ( x )
∃ c ∈ C最大i=1...n | v I c I |≤ε

and the
growth function N (n, ε,F ) is

生长函数N (n,ε,F)为

N (n, ε,F ) =

N (n,ε,F ) = sup



N (x, ε,F ) .

N (x,ε,F)。





Thanks to a
famous combinatorial lemma (Vapnik and Chervonenkis, 1968, 1971;Sauer, 1972),


many usual
parametric families F , the growth function N (n, ε,F ) increases at most

许多常用的参数族F,增长函数N (n,ε,F)至多多项式增长

with both n
and 1/ε.


Theorem 3
(Uniform empirical Bernstein bound) (Maurer and Pontil, 2009, thm 6)

定理3(统一经验伯恩斯坦界)(莫勒和庞泰尔,2009,thm 6)

Let δ ∈
(0,1), n >= 16.Let X,X1,...,Xn be i.i.d. random variables with values in X.
Let F be a set of functions mapping X into [a,b] ⊂ R and let M (n) = 10N (2n,F
,1/n).Then we probability

设δ∑(0,1),n >= 16。设X,X1,...设f是将x映射成[a,b] ⊂ R的一组函数,设M (n) = 10N (2n,f,1/n)。那么我们概率

at least 1−δ,


∀ f ∈ F , E [
f ( X)] − Mn ≤ r Vn log ( M ( n ) / δ ) n + ( b − a ) 15 log ( M ( n ) / δ ) n
− 1

∀ f ∈ F,e[f(x)]Mn≤r
VN log(m(n)/δ)n+(b a)15 log(m(n)/δ)n-1



where Mn and
Vn respectively are the sample mean and variance


Mn =

Mn =

n n∑i=1 f (
Xi )

n n∑i=1 f (

Vn =

Vn =


n − 1


n∑i=1 ( f (
Xi ) − Mn ) .

n∑I =

The statement
of this theorem emphasizes its similarity with the non-uniform empirical
Bern-stein bound (theorem 1).Although the constants are less attractive, the
uniform bound still con-


verges to
zero when n increases, provided of course that M (n) = 10N (2n,F ,1/n) grows

当n增加时收敛到零,当然前提是M (n) = 10N (2n,F,1/n)增长多项式

mially with


Let us then
define the family of functions


F = { fθ : ω
7→ ℓ(ω)w M θ (ω) , gθ : ω 7→ w M θ (ω) , ∀θ ∈ F } ,

F = { fθ : ω
7→ ℓ(ω)w M θ (ω),gθ : ω 7→ w M θ (ω),∀θ ∈ F },

and use the
uniform empirical Bernstein bound to derive an outer inequality similar to (26)
and an inner inequality similar to (27).The theorem implies that, with
probability 1−δ, both inequalities are simultaneously true for all values of
the parameter θ.The uniform confidence interval (21) then follows directly.




V M.
Aleksandrov, V. I. Sysoyev, and V. V. Shemeneva.Stochastic
optimization.Engineering Cybernetics, 5:11-16, 1968.

V M.
Aleksandrov、V. I. Sysoyev和V. V. Shemeneva。随机优化。工程控制论,5:11-16,1968。

Susan Athey
and Denis Nekipelov.A structural model of sponsored search advertising.Working


per, 2010.URL
http://kuznets.harvard.edu/~athey/papers/Structural_Sponsored_ Search.pdf .


Audibert, Remi Munos, and Csaba Szepesvári.Tuning bandit algorithms in
stochastic en-


Proc.18th International Conference on Algorithmic Learning Theory (ALT 2007),
pages 150-165, 2007.

环境。在Proc中。第18届算法学习理论国际会议(ALT 2007),第150-165页,2007。

Peter Auer,
Nicolò Cesa-Bianchi, and Paul Fisher.Finite time analysis of the multiarmed
bandit problem.Machine Learning, 47(2-3):235-256, 2002.


13.For a
simple proof of this fact, slice [a,b] into intervals Sk of maximal width ε and
apply the lemma to the family of indicator functions (xi ,Sk) 7→ 1l{ f(xi) ∈

13.为了简单证明这个事实,将[a,b]切成最大宽度ε的区间Sk,并将引理应用于指示函数族(xi,Sk) 7→ 1l{
f(xi) ∈ Sk}。





Heejung Bang
and James M. Robins.Doubly robust estimation in missing data and causal
inference models.Biometrics, 61:692-972, 2005.


Bergemann and Maher Said.Dynamic auctions: a survey.Discussion Paper 1757R,
Cowles Foundation for Research in Economics, Yale University, 2010.


Bottou.From machine learning to machine reasoning.http://arxiv.org/abs/1102.
1808v3, Feb 2011.

莱昂·博图。从机器学习到机器推理。http://arxiv.org/abs/1102. 1808 v3,2011年2月。

Breiman.Statistical modeling: The two cultures.Statistical Science,
16(3):199-231, 2001.


Chapelle and Lihong Li.An empirical evaluation of thompson sampling.In Advances
in Neural Information Processing Systems 24, pages 2249-2257.NIPS Foundation,


C.R. Charig,
D. R. Webb, S. R. Payne, and J. E. A. Wickham.Comparison of treatment of renal


calculi by
open surgery, percutaneous nephrolithotomy, and extracorporeal shockwave
lithotripsy.British Medical Journal (Clin Res Ed), 292(6254):879-882, 1986.


Denis X.
Charles and D. Max Chickering.Optimization for paid search auctions.Manuscript
in preparation, 2012.


Denis X.
Charles, D. Max Chickering, and Patrice Simard.Micro-market experimentation for
paid search.Manuscript in preparation, 2012.


Harald Cramér.
Mathematical Methods of Statistics.Princeton University Press, 1946.


Dash.Caveats for Causal Reasoning with Equilibrium Models.PhD thesis,
University of Pittsburgh, 2003.


Miroslav Dudík,
Dimitru Erhan, John Langford, and Lihong Li.Sample-efficient nonstationary-


evaluation for contextual bandits.In Proceedings of Uncertainty in Artificial
Intelligence (UAI), pages 247-254, 2012.


Edelman, Michael Ostrovsky, and Michael Schwarz.Internet advertising and the


alized second
price auction: Selling billions of dollars worth of keywords.American Economic
Review, 97(1):242-259, 2007.


Genz.Numerical computation of multivariate normal probabilities.Journal
Computation of Multivariate Normal Probabilities, 1:141-149, 1992.


John C.
Gittins.Bandit Processes and Dynamic Allocation Indices.Wiley, 1989.


Peter W.
Glynn.Likelihood ratio gradient estimation: an overview.In Proceedings of the
1987 Winter Simulation Conference, pages 366-375, 1987.


Goodwin.Microsoft adCenter.Personal communication, 2011.

约翰·古德温。Microsoft adCenter。个人交流,2011年。

Graepel, Joaquin Quinonero Candela, Thomas Borchert, and Ralf


click-through rate prediction for sponsored search advertising in Microsoft's
Bing search engine.In Proceedings of the 27th International Conference on
Machine Learning (ICML 2010), Invited Applications Track.Omnipress, 2010.






Ron Kohavi,
Roger Longbotham, Dan Sommerfield, and Randal M. Henne.Controlled experiments


on the web:
Survey and practical guide.Data Mining and Knowledge Discovery, 18(1):140-181,
July 2008.


Kuleshov and Doina Precup.Algorithms for multi-armed bandit problems, October


Lahaie and R. Preston McAfee.Efficient ranking in sponsored search.In Proc.7th


Workshop on Internet and Network Economics (WINE 2011), pages 254-265.LNCS
7090, Springer, 2011.

互联网和网络经济学国际研讨会(WINE 2011),第254-265页。LNCS 7090,斯普林格,2011年。

Lev Landau
and Evgeny Lifshitz.Course in Theoretical Physics, Volume 1: Mechanics.Pergamon
Press, 1969.2nd edition.


John Langford
and Tong Zhang.The epoch-greedy algorithm for multi-armed bandits with side


Advances in Neural Information Processing Systems 20, pages 817-824.MIT Press,
Cambridge, MA, 2008.


Steffen L.
Lauritzen and Thomas S. Richardson.Chain graph models and their causal
interpretation.Journal of the Royal Statistical Society, Series B, 64:321-361,


David K.
Lewis.Counterfactuals.Harvard University Press, 1973.2nd edition:
Wiley-Blackwell, 2001.


Lihong Li,
Wei Chu, John Langford, and Robert E. Schapire.A contextual-bandit approach to


news article recommendation.In Proceedings of the 19th International Conference
on the World Wide Web (WWW 2010), pages 661-670.ACM, 2010.

个性化新闻文章推荐。《第19届万维网国际会议论文集》(WWW 2010),第661-670页。ACM,2010年。

Lihong Li,
Wei Chu, John Langford, and Xuanhui Wang.Unbiased offline evaluation of


news article recommendation algorithms.In Proc.4th ACM International
Confer-ence on Web Search and Data Mining (WSDM 2011), pages 297-306, 2011.


Maurer and Massimiliano Pontil.Empirical bernstein bounds and sample-variance
penal-ization.In Proc.The 22nd Conference on Learning Theory (COLT 2009), 2009.

安德烈亚斯·莫伊雷尔和马西米利亚诺·庞泰尔。经验伯恩斯坦界限和样本方差惩罚。在Proc中。第22届学习理论会议(COLT 2009),2009年。

Milgrom.Putting Auction Theory to Work.Cambridge University Press, 2004.


Roger B.
Myerson.Optimal auction design.Mathematics of Operations Research, 6(1):58-73,


Pearl.Causality: Models, Reasoning, and Inference.Cambridge University Press,
2000.2nd edition: 2009.


Pearl.Causal inference in statistics: an overview.Statistics Surveys, 3:96-146,


Pearl.The do-calculus revisited.In Proc.Twenty-Eighth Conference on Uncertainty
in Artificial Intelligence (UAI-2012), pages 3-11, 2012.


Linda E.
Reichl.A Modern Course in Statistical Physics, 2nd Edition.Wiley, 1998.






Robbins.Some aspects of the sequential design of experiments.Bulletin of the
American Mathematical Society, 58(5):527-535, 1952.


James M.
Robins, Miguel Angel Hernan, and Babette Brumback.Marginal structural models
and causal inference in epidemiology.Epidemiology, 11(5):550-560, Sep 2000.


Sauer.On the density of families of sets.Journal of Combinatorial Theory, 13:145-147,


Seldin, cois Laviolette Fran Nicolò Cesa-Bianchi, John Shawe-Taylor, and Peter


inequalities for martingales.IEEE Transactions on Information Theory, 58(12):
7086-7093, 2012.

鞅的泛贝叶斯不等式。《信息理论杂志》,58(12): 7086-7093,2012。

Shimodaira.Improving predictive inference under covariate shift by weighting
the log likelihood function.Journal of Statistical Planning and Inference, 90(2):227-244,


Edward H.
Simpson.The interpretation of interaction in contingency tables.Journal of the
Royal Statistical Society, Ser.B, 13:238-241, 1951.


Slivkins.Contextual bandits with similarity information.JMLR Conference and
Work-shop Proceedings, 19:679-702, 2011.


Peter Spirtes
and Richard Scheines.Causal inference of ambiguous manipulations.Philosophy of
Science, 71(5):833-845, Dec 2004.


Spirtes, Clark Glymour, and Richard Scheines.Causation, Prediction and
Search.Springer Verlag, New York, 1993.2nd edition: MIT Press, Cambridge
(Mass.), 2011.

彼得·斯皮特斯、克拉克·格利穆尔和理查德·申斯。因果关系、预测和搜索。斯普林格·弗拉格,纽约,1993年。第二版:麻省理工学院出版社,剑桥(马萨诸塞州。), 2011.

Stephen M.
Stigler.A historical view of statistical concepts in psychology and educational
research.American Journal of Education, 101(1):60-70, Nov 1992.


Sugiyama, Matthias Krauledat, and Klaus-Robert Müller.Covariate shift
adaptation by im-portance weighted cross validation.Journal of Machine Learning
Research, 8:985-1005, 2007.


Rich S.
Sutton and Andrew G. Barto.Reinforcement Learning: An Introduction.MIT Press,
Cam-bridge, MA, 1998.


Diane Tang,
Ashish Agarwal, Deirdre O'Brien, and Mike Meyer.Overlapping experiment infras-


More, better, faster experimentation.In Proceedings 16th Conference on
Knowledge Discovery and Data Mining (KDD 2010), pages 17-26, 2010.


Vladimir N.
Vapnik.Estimation of Dependences based on Empirical Data.Springer Series in
Statistics.Springer Verlag, Berlin, New York, 1982.


Vladimir N.
Vapnik and Alexey Ya.Chervonenkis.Uniform convegence of the frequencies of


occurence of
events to their probabilities.Proc.Academy of Sciences of the USSR, 181(4),
1968.English translation: Soviet Mathematics - Doklady, 9:915-918, 1968.


Vladimir N.
Vapnik and Alexey Ya.Chervonenkis.On the uniform convergence of relative


cies of
events to their probabilities.Theory of Probability and its Applications,
16(2):264-280, 1971.






Hal R.
Varian.Position auctions.International Journal of Industrial Organization,
25:1163-1178, 2007.


Hal R.
Varian.Online ad auctions.American Economic Review, 99(2):430-434, 2009.


Vermorel and Mehryar Mohri.Multi-armed bandit algorithms and empirical
evaluation.In Proc.European Conference on Machine Learning, pages 437-448,


Georg H. von
Wright.Explanation and Understanding.Cornell University Press, 1971.


Abraham Wald.Sequential
tests of statistical hypotheses.The Annals of Mathematical Statistics,
16(2):117-186, 1945.


Wiener.Cybernetics, or Control and Communication in the Animal and the Machine.


Hermann et
Cie (Paris), MIT Press (Cambridge, Mass.), Wiley and Sons (New York), 1948.2nd
Edition (expanded): MIT Press, Wiley and Sons, 1961.


Ronald J.
Williams.Simple statistical gradient-following algorithms for connectionist
reinforce-ment learning.Machine Learning, 8(229-256), 1992.


Woodward.Making Things Happen.Oxford University Press, 2005.


Sewall S.
Wright.Correlation and causation.Journal of Agricultural Research, 20:557-585,