Augmenting Reinforcement Learning with Human Feedback

通过人类反馈增强强化学习

W. Bradley Knox

BRADKNOX $@$ CS.UTEXAS.EDU

University of Texas at Austin, Department of Computer Science

德克萨斯大学奥斯汀分校，计算机科学系

Peter Stone

PSTONE@CS.UTEXAS.EDU

University of Texas at Austin, Department of Computer Science

德克萨斯大学奥斯汀分校，计算机科学系

Abstract

摘要

1. Introduction

1. 引言

As computational agents are increasingly used beyond research labs, their success will depend on their ability to learn new skills and adapt to their dynamic, complex environments. If human users — without programming skills — can transfer their task knowledge to agents, learning can accelerate dramatically, reducing costly trials. The TAMER framework guides the design of agents whose behavior can be shaped through signals of approval and disapproval, a natural form of human feedback. More recently, TAMER $.+\mathrm{RL}$ was introduced to enable human feedback to augment a traditional reinforcement learning (RL) agent that learns from a Markov decision process’s (MDP) reward signal. Using a re implementation of TAMER and ${\mathrm{TAMER}}{+}{\mathrm{RL}}$ , we address limitations of prior work, contributing in two critical directions. First, the four successful techniques for combining a human reinforcement with RL from prior $\mathrm{TAMER+RL}$ work are tested on a second task, and these techniques’ sensitivities to parameter changes are analyzed. Together, these examinations yield more general and prescriptive conclusions to guide others who wish to incorporate human knowledge into an RL algorithm. Second, $\mathrm{TAMER+RL}$ has thus far been limited to a sequential setting, in which training occurs before learning from MDP reward. We modify the sequential algorithms to learn simultaneously from both sources, enabling the human feedback to come at any time during the reinforcement learning process. To enable simultaneous learning, we introduce a new technique that appropriately determines the magnitude of the human model’s influence on the RL algorithm throughout time and state-action space.

随着计算智能体在实验室之外的应用越来越广泛，它们的成功将取决于其学习新技能和适应动态复杂环境的能力。如果人类用户——无需编程技能——能够将任务知识传递给智能体，学习速度将显著加快，减少昂贵的试错成本。TAMER框架指导了智能体的设计，这些智能体的行为可以通过批准和反对信号来塑造，这是一种自然的人类反馈形式。最近，TAMER+RL被引入，使人类反馈能够增强传统的强化学习（RL）智能体，该智能体从马尔可夫决策过程（MDP）的奖励信号中学习。通过重新实现TAMER和TAMER+RL，我们解决了先前工作的局限性，并在两个关键方向上做出了贡献。首先，先前TAMER+RL工作中将人类强化与RL结合的四种成功技术在第二个任务上进行了测试，并分析了这些技术对参数变化的敏感性。这些测试共同得出了更具普遍性和指导性的结论，以指导那些希望将人类知识融入RL算法的人。其次，TAMER+RL迄今为止仅限于顺序设置，即在从MDP奖励学习之前进行训练。我们修改了顺序算法，使其能够同时从两个来源学习，使人类反馈能够在强化学习过程中的任何时间到来。为了实现同时学习，我们引入了一种新技术，该技术适当地确定了人类模型在时间和状态-动作空间中对RL算法的影响程度。

Computational agents may soon be prevalent in society, and many of their end users will want these agents to learn to perform new tasks. For many of these tasks, the human user will already have significant task knowledge. Consequently, we seek to enable non-technical users to transfer their knowledge to the agent, reducing the cost of learning without hurting the agent’s final, asymptotic performance.

计算智能体可能很快会在社会中普及，许多终端用户将希望这些智能体能够学习执行新任务。对于许多任务，人类用户已经具备丰富的任务知识。因此，我们致力于让非技术用户能够将其知识转移给智能体，从而在不影响智能体最终渐进性能的情况下降低学习成本。

In this vein, the TAMER framework guides the design of agents that learn by shaping — using signals of approval and disapproval to teach an agent a desired behavior (Knox and Stone, 2009). As originally formulated, TAMER was limited to learn exclusively from the human feedback. More recently, TAMER $+\mathrm{RL}$ was introduced with the goal of enabling the human feedback to augment a traditional reinforcement learning (RL) agent that learns from an MDP reward signal (Knox and Stone, 2010). However, TAMER $.+\mathrm{RL}$ has previously only been tested on a single domain, and it has been limited to the case where the learning from human feedback happens only prior to RL: sequential TAMER $+\mathrm{RL}$ . Using a re implementation of TAMER and ${\mathrm{TAMER}}{+}{\mathrm{RL}}$ , we address these limitations by improving upon prior work in two crucial directions.

本着这一思路，TAMER框架指导了通过塑造学习的智能体设计——使用批准和反对的信号来教导智能体期望的行为 (Knox and Stone, 2009)。最初提出的TAMER仅限于从人类反馈中学习。最近，TAMER $+\mathrm{RL}$ 被引入，旨在使人类反馈能够增强传统的强化学习 (Reinforcement Learning, RL) 智能体，该智能体从MDP奖励信号中学习 (Knox and Stone, 2010)。然而，TAMER $+\mathrm{RL}$ 之前仅在一个领域进行了测试，并且仅限于在RL之前从人类反馈中学习的情况：顺序TAMER $+\mathrm{RL}$。通过对TAMER和 ${\mathrm{TAMER}}{+}{\mathrm{RL}}$ 的重新实现，我们在两个关键方向上改进了先前的工作，解决了这些限制。

First, in Section 3, we continue with the sequential TAMER $+\mathrm{RL}$ approach, testing the four TAMER $.+\mathrm{RL}$ techniques that were previously found to be successful. We test on two tasks — one identical to the single prior $\mathrm{TAMER+RL}$ task and a new task. We also provide a novel examination of each technique’s performance at a range of combination parameter values to determine the ease of setting each pa- rameter effectively, a critical aspect of using TAMER $.+\mathrm{RL}$ algorithms in practice that has been previously sidestepped. Together, these analyses yield stronger, more prescriptive conclusions than were possible from prior work. Two similar combination techniques, for the first time, clearly stand out as the most effective, and we consistently observe that manipulating action selection is more effective than altering the RL update.

首先，在第 3 节中，我们继续采用顺序 TAMER $+\mathrm{RL}$ 方法，测试了之前发现成功的四种 TAMER $.+\mathrm{RL}$ 技术。我们在两个任务上进行了测试——一个与之前的单一 $\mathrm{TAMER+RL}$ 任务相同，另一个是新任务。我们还对每种技术在不同组合参数值下的性能进行了新颖的考察，以确定有效设置每个参数的难易程度，这是在实践中使用 TAMER $.+\mathrm{RL}$ 算法时的一个关键方面，而之前一直被回避。这些分析共同得出了比之前工作更强大、更具指导性的结论。两种相似的组合技术首次明显脱颖而出，成为最有效的技术，并且我们一致观察到，操纵动作选择比改变 RL 更新更有效。

Second, in Section 4 we move from the sequential setting of first learning only from the human and then learning from MDP reward to learning from both simultaneously. The principal benefit of simultaneous learning is its flexibility; it gives a trainer the important ability to step in as desired to alter the course of reinforcement learning while it is in progress. We demonstrate the success of the two best-performing techniques from the sequential experiments, action biasing and control sharing, in this simultaneous setting. To meet demands introduced by the simultaneous setting, we use a novel method to moderate the influence of the model of human reinforcement on the RL algorithm. Our method increases influence in areas of the state-action space that have recently received training and slowly decreases influence in the absence of training, leaving the original MDP reward and base RL agent to learn autonomously in the limit. Without this improvement, the sequential techniques would be too brittle for simultaneous learning.

其次，在第4节中，我们从仅从人类学习然后从MDP奖励学习的顺序设置转向同时从两者学习。同时学习的主要好处是其灵活性；它使训练者能够在强化学习进行过程中根据需要介入以改变其进程。我们展示了在顺序实验中表现最好的两种技术——动作偏置和控制共享——在这种同时设置中的成功。为了满足同时设置引入的需求，我们使用了一种新方法来调节人类强化模型对RL算法的影响。我们的方法在最近接受训练的状态-动作空间区域增加影响，并在没有训练的情况下缓慢减少影响，最终让原始的MDP奖励和基础RL智能体在极限情况下自主学习。如果没有这一改进，顺序技术对于同时学习来说将过于脆弱。

2. Preliminaries

2. 预备知识

In this section, we briefly introduce reinforcement learning and the TAMER Framework.

在本节中，我们简要介绍强化学习和 TAMER 框架。

2.1. Reinforcement Learning

2.1. 强化学习

We assume that the task environment is a Markov decision process (MDP) specified by the tuple $(S,,A,,T,,\gamma,,D,,R)$ . $S$ and $A$ are respectively the sets of possible states and actions. $T$ is a transition function, $T:,S\times A\times S,\rightarrow,\mathbb{R}.$ , which gives the probability, given a state $s_{t}$ and an action $a_{t}$ , of transitioning to state $s_{t+1}$ . $\gamma$ , the discount factor, exponentially decreases the value of a future reward. $D$ is the distribution of start states. $R$ is a reward function, $R:S\times A\times S\to\mathbb{R}$ , where the reward is a function of $s_{t}$ , $a_{t}$ , and $s_{t+1}$ . We will also consider reward that is a function of only $s_{t}$ and $a_{t}$ .

我们假设任务环境是一个由元组 $(S,,A,,T,,\gamma,,D,,R)$ 指定的马尔可夫决策过程 (MDP)。$S$ 和 $A$ 分别是可能的状态和动作的集合。$T$ 是一个转移函数，$T:,S\times A\times S,\rightarrow,\mathbb{R}$，它给出了在给定状态 $s_{t}$ 和动作 $a_{t}$ 的情况下，转移到状态 $s_{t+1}$ 的概率。$\gamma$ 是折扣因子，它按指数级减少未来奖励的价值。$D$ 是起始状态的分布。$R$ 是一个奖励函数，$R:S\times A\times S\to\mathbb{R}$，其中奖励是 $s_{t}$、$a_{t}$ 和 $s_{t+1}$ 的函数。我们还将考虑仅作为 $s_{t}$ 和 $a_{t}$ 的函数的奖励。

Reinforcement learning algorithms (see Sutton and Barto (1998)), seek to learn policies $(\pi,:,S,\rightarrow,A)$ ) for an MDP that maximize return from each state-action pair, where $\begin{array}{r}{\begin{array}{r}{r e t u r n,=,\sum_{t=0}^{T}E[\gamma^{t}R(s_{t},a_{t},s_{t+1})]}\end{array}}\end{array}$ . In this paper, we focus on us ing a value-function-based RL method, namely $\mathrm{SARSA}(\lambda)$ (Sutton and Barto, 1998), augmented by the TAMER-based learning that can be done directly from a human’s reinforcement signal. Though more sophisticated RL methods exist, we use $\mathrm{SARSA}(\lambda)$ for its popularity and representative ness, and because we are not concerned with finding the best overall algorithm for our experimental tasks but rather with determining how various methods for including a human model change the base RL algorithm’s performance.

强化学习算法（参见 Sutton 和 Barto (1998)）旨在学习最大化每个状态-动作对回报的 MDP 策略 $(\pi,:,S,\rightarrow,A)$，其中 $\begin{array}{r}{\begin{array}{r}{r e t u r n,=,\sum_{t=0}^{T}E[\gamma^{t}R(s_{t},a_{t},s_{t+1})]}\end{array}}\end{array}$。在本文中，我们重点使用基于价值函数的强化学习方法，即 $\mathrm{SARSA}(\lambda)$（Sutton 和 Barto, 1998），并通过基于 TAMER 的学习进行增强，该方法可以直接从人类的强化信号中学习。尽管存在更复杂的强化学习方法，但我们使用 $\mathrm{SARSA}(\lambda)$ 是因为它的普及性和代表性，并且我们并不关心为我们的实验任务找到最佳的整体算法，而是关注如何通过包含人类模型的各种方法来改变基础强化学习算法的性能。

2.2. The TAMER Framework for Interactive Shaping

2.2. 用于交互式塑造的 TAMER 框架

The TAMER Framework, introduced by Knox and Stone (2009) is an approach to the problem of how an agent should learn from numerically mapped reinforcement signals. Specifically, these feedback signals are delivered by an observing human trainer as the agent attempts to perform a task.1 TAMER is motivated by two insights about human reinforcement. First, reinforcement is trivially delayed, slowed only by the time it takes the trainer to assess behavior and deliver feedback. Second, the trainer observes the agent’s behavior with a model of that behavior’s longterm effects, so the reinforcement is assumed to be fully informative about the quality of recent behavior. Human rein for cement is more similar to an action value (sometimes called a Q-value), albeit a noisy and trivially delayed one, than MDP reward. Consequently, TAMER assumes human reinforcement to be fully informative about the quality of an action given the current state, and it models a hypothetical human reinforcement function, $H:S\times A\to\mathbb{R}$ , as $\hat{H}$ in real time by regression. In the simplest form of credit assignment, each reinforcement creates a label for the last state-action pair.2 The output of the resultant $\hat{H}$ function — changing as the agent gains experience — determines the relative quality of potential actions, so that the exploitative action is $\stackrel{.}{a}=\stackrel{.}{a r g m a x}_{a}[\hat{H}(s,a)]$ .

TAMER 框架由 Knox 和 Stone (2009) 提出，旨在解决智能体如何从数值映射的强化信号中学习的问题。具体来说，这些反馈信号是由观察者（人类训练师）在智能体尝试执行任务时提供的。TAMER 的动机源于对人类强化的两个洞察：首先，强化信号几乎没有延迟，仅受训练师评估行为和提供反馈所需时间的限制；其次，训练师通过观察智能体的行为及其长期影响模型来提供反馈，因此假设强化信号能够充分反映近期行为质量。与 MDP（马尔可夫决策过程）中的奖励相比，人类强化信号更类似于动作值（有时称为 Q 值），尽管它带有噪声且几乎没有延迟。因此，TAMER 假设人类强化信号能够充分反映当前状态下动作的质量，并通过回归实时建模一个假设的人类强化函数 $H:S\times A\to\mathbb{R}$ 为 $\hat{H}$。在最简单的信用分配形式中，每次强化信号为最近的状态-动作对创建一个标签。生成的 $\hat{H}$ 函数的输出——随着智能体经验的积累而变化——决定了潜在动作的相对质量，因此利用性动作为 $\stackrel{.}{a}=\stackrel{.}{a r g m a x}_{a}[\hat{H}(s,a)]$。

3. Sequential TAMER $\mathbf$

3. 顺序 TAMER $\mathbf$

Noting that TAMER agents typically learn faster than agents learning from MDP reward but to a lower performance plateau, Knox and Stone combined TAMER and SARSA $\lfloor\lambda)$ (2010). Their aim was to complement TAMER’s fast learning with RL’s ability to often learn better policies in the long run. These conjoined $\mathrm{TAMER+RL}$ algorithms address a scenario in which a human trains an agent, leaving a model $\hat{H}$ of reinforcement, and then $\hat{H}$ is used to influence the base RL algorithm somehow. We call this scenario and the algorithms that address it sequential TAMER $+\mathrm{RL}$ . For all $\mathrm{TAMER+RL}$ approaches, only MDP reward is considered to specify optimal behavior. $\dot{H}$ provides guidance but not an objective. In this section, we reproduce and then extend prior investigations of sequential $\mathrm{TAMER+RL}$ , yielding more prescriptive and general conclusions than prior work allowed.

注意到 TAMER 智能体通常比从 MDP 奖励中学习的智能体学习得更快，但达到的性能平台较低，Knox 和 Stone 将 TAMER 和 SARSA $\lfloor\lambda)$ 结合起来 (2010)。他们的目标是通过 RL 长期学习更好策略的能力来补充 TAMER 的快速学习。这些结合的 $\mathrm{TAMER+RL}$ 算法解决了一个场景，其中人类训练一个智能体，留下一个强化模型 $\hat{H}$，然后 $\hat{H}$ 以某种方式影响基础的 RL 算法。我们将这种场景及其算法称为顺序 TAMER $+\mathrm{RL}$。对于所有 $\mathrm{TAMER+RL}$ 方法，只有 MDP 奖励被认为是指定最优行为的。$\dot{H}$ 提供指导，但不提供目标。在本节中，我们重现并扩展了之前对顺序 $\mathrm{TAMER+RL}$ 的研究，得出了比之前工作更具规定性和普遍性的结论。

3.1. Combination techniques

3.1. 组合技术

Knox and Stone tested eight $\mathrm{TAMER+RL}$ techniques that each use $\hat{H}$ to affect the RL algorithm in a different way. Four were largely effective when compared to the SARSA $(\lambda)$ -only and TAMER-only agents on both mean reward over a run and performance at the end of the run. We focus on those four techniques, which can be used on any RL algorithm that uses an action-value function. Below, we list them with names we have created. In our notation, a prime (e.g., $Q^{\prime}$ ) after a function means the function replaces its non-prime counterpart in the base RL algorithm.

Knox 和 Stone 测试了八种 $\mathrm{TAMER+RL}$ 技术，每种技术都以不同的方式使用 $\hat{H}$ 来影响 RL 算法。与仅使用 SARSA $(\lambda)$ 和仅使用 TAMER 的智能体相比，其中四种技术在运行期间的平均奖励和运行结束时的性能方面表现出显著效果。我们重点关注这四种技术，它们可以应用于任何使用动作值函数的 RL 算法。下面，我们用自创的名称列出这些技术。在我们的符号中，函数后的撇号（例如 $Q^{\prime}$）表示该函数在基础 RL 算法中替换其非撇号对应项。

• Action biasing: $Q^{\prime}(s,a),=,Q(s,a)+(\beta*{\hat{H}}(s,a))$ only during action selection

• 动作偏置 (Action biasing)：仅在动作选择期间，$Q^{\prime}(s,a),=,Q(s,a)+(\beta*{\hat{H}}(s,a))$

Control sharing: $\begin{array}{r l}{P(a{=}a r g m a x_{a}[\hat{H}(s,a)])}&{{}=}\end{array}$ $m i n(\beta,1)$ . Otherwise use base RL agent’s action selection mechanism.

控制共享：$\begin{array}{r l}{P(a{=}a r g m a x_{a}[\hat{H}(s,a)])}&{{}=}\end{array}$ $m i n(\beta,1)$。否则使用基础 RL 智能体的动作选择机制。

These four techniques are numbered 1, 4, 6, and 7 in Knox and Stone (2010). We altered action biasing to generalize it, but the $\epsilon_{}$ -greedy policies we use in our experiments are not affected. In the descriptions above, $\beta$ is a predefined combination parameter. In our sequential $\mathrm{TAMER+RL}$ experiments, $\beta$ is annealed by a predefined factor after each episode for all techniques other than Q augmentation.

这四种技术在 Knox 和 Stone (2010) 中编号为 1、4、6 和 7。我们对动作偏置进行了修改以使其更通用，但我们在实验中使用的 $\epsilon_{}$ -贪婪策略不受影响。在上述描述中，$\beta$ 是一个预定义的组合参数。在我们的顺序 $\mathrm{TAMER+RL}$ 实验中，除了 Q 增强之外的所有技术，$\beta$ 在每集之后都会按预定义因子进行退火。

We now briefly discuss these techniques and situate them within related work. In the RL literature, reward shaping adds the output of a shaping function to the original MDP reward, creating a new reward to learn from instead (Dorigo and Colombetti, 1994; Mataric, 1994). As we confirm in the coming paragraph on Q augmentation, our reward shaping technique is not the only way to do reward shaping, but it is the most direct use of $\dot{H}$ for reward shaping.

我们现在简要讨论这些技术，并将它们置于相关工作的背景中。在强化学习（RL）文献中，奖励塑形（reward shaping）通过将塑形函数的输出添加到原始马尔可夫决策过程（MDP）奖励中，创建一个新的奖励以供学习（Dorigo and Colombetti, 1994; Mataric, 1994）。正如我们在接下来关于Q增强的段落中所确认的那样，我们的奖励塑形技术并不是唯一的方法，但它是最直接使用 $\dot{H}$ 进行奖励塑形的方式。

If $\hat{H}$ is considered a heuristic function, action biasing is the same action selection method used in Bianchi et al.’s Heuristic ally Accelerated Q-Learning (HAQL) algorithm (Bianchi et al., 2004). Control sharing is equivalent to Ferna´ndez and Veloso’s $\pi$ -reuse exploration strat-egy (2006). Note that both control sharing and action bi- asing only affect action selection and can be interpreted as directly guiding exploration toward human-favored stateaction pairs.

如果 $\hat{H}$ 被视为启发式函数，那么动作偏向 (action biasing) 与 Bianchi 等人提出的启发式加速 Q 学习 (Heuristic ally Accelerated Q-Learning, HAQL) 算法中使用的动作选择方法相同 (Bianchi et al., 2004)。控制共享 (control sharing) 则等同于 Ferna´ndez 和 Veloso 提出的 $\pi$ -重用探索策略 (2006)。需要注意的是，控制共享和动作偏向仅影响动作选择，并且可以解释为直接引导探索朝向人类偏好的状态-动作对。

$\mathrm{Q}$ augmentation is action biasing with additional use of $\hat{H}$ during the Q-function’s update. Wiewiora et al.’s related look-ahead advice (2003) uses a discounted change in the output of a state-action potential function, $\gamma\phi\big(s_{t+1},a_{t+1}\big)-\phi\big(s_{t},a_{t}\big)$ , for reward shaping and to augment action values during action selection. Interestingly, look-ahead advice is equivalent to $\mathrm{Q}$ augmentation when $\hat{H}$ is used for $\phi$ , the state and action space are finite, and the policy is invariant to adding a constant to all action values in the current state (e.g., $\epsilon$ -greedy and soft-max).

$\mathrm{Q}$ 增强是在 Q 函数更新过程中额外使用 $\hat{H}$ 来进行动作偏置。Wiewiora 等人 (2003) 提出的相关前瞻建议 (look-ahead advice) 使用状态-动作势函数的输出的折扣变化 $\gamma\phi\big(s_{t+1},a_{t+1}\big)-\phi\big(s_{t},a_{t}\big)$ 来进行奖励塑造，并在动作选择期间增强动作值。有趣的是，当 $\hat{H}$ 被用作 $\phi$ 时，前瞻建议等同于 $\mathrm{Q}$ 增强，前提是状态和动作空间是有限的，并且策略对当前状态下所有动作值添加常数保持不变（例如，$\epsilon$-贪婪策略和 soft-max 策略）。

3.2. Sequential learning experiments

3.2. 序列学习实验

We now describe our sequential ${\mathrm{TAMER}}{+}{\mathrm{RL}}$ experiments. We first validate our re implementation of TAMER and TAMER $+\mathrm{RL}$ by reproducing Knox and Stone’s results on the single task they tested. We then evaluate the algorithms’ effectiveness on a different task. Additionally, we analyze our results at a range of combination parameter values ( $\beta$ values) to identify challenges to setting $\beta$ ’s value without prior testing.

我们现在描述我们的顺序 ${\mathrm{TAMER}}{+}{\mathrm{RL}}$ 实验。我们首先通过复现 Knox 和 Stone 在他们测试的单一任务上的结果，验证我们对 TAMER 和 TAMER $+\mathrm{RL}$ 的重新实现。然后，我们在不同的任务上评估这些算法的有效性。此外，我们在一系列组合参数值（$\beta$ 值）下分析我们的结果，以识别在没有预先测试的情况下设置 $\beta$ 值的挑战。

Following past work on TAMER and $\mathrm{TAMER+RL}$ , we implemented the corresponding algorithms as exactly as we could, excepting some changes to the credit assignment technique in Knox and Stone (2009).4 Using the original $\hat{H}$ representation (linear model of RBF features), task settings, $\mathrm{SARSA}(\lambda)$ parameters, and training records from Knox and Stone (2010),5 we repeat their experiments on the Mountain Car task,6 using all four combination techniques found to be successful in their experiments and a range of $\beta$ combination parameters. We then test these TAMER $+\mathrm{RL}$ techniques on a second task, Cart Pole, using an $\hat{H}$ model trained by an author. We again use SARSA $(\lambda)$ , choosing parameters that perform well but sacrifice some performance for episode-to-episode stability and the ability to evaluate policies that might otherwise balance the pole for too long to finish a run. In Mountain Car, the goal is to quickly move the car up a hill to the goal. The agent receives -1 reward for all transitions to non-absorbing states. In Cart Pole, the goal is to move a cart so that an attached, upright pole maintains balance as long as possible. The agent receives $+1$ reward for all transitions that keep the pole within a specified range of vertical. The $\hat{H}$ for Cart Pole was learned by $\boldsymbol{\mathrm{k}}$ -Nearest Neighbor. For both tasks, we use Gaussian RBF features for $\mathrm{SARSA}(\lambda)$ and initialize $Q$ pessimistically, as was found effective in Knox and Stone (2010). In these and later experiments, $\hat{H}$ outputs are typically in the range [-2, 2].

遵循过去关于 TAMER 和 $\mathrm{TAMER+RL}$ 的工作，我们尽可能准确地实现了相应的算法，除了对 Knox 和 Stone (2009) 中的信用分配技术进行了一些修改。使用原始的 $\hat{H}$ 表示（RBF 特征的线性模型）、任务设置、$\mathrm{SARSA}(\lambda)$ 参数以及 Knox 和 Stone (2010) 的训练记录，我们在 Mountain Car 任务上重复了他们的实验，使用了在他们的实验中被证明成功的所有四种组合技术以及一系列 $\beta$ 组合参数。然后，我们在第二个任务 Cart Pole 上测试了这些 TAMER $+\mathrm{RL}$ 技术，使用了由一位作者训练的 $\hat{H}$ 模型。我们再次使用 SARSA $(\lambda)$，选择了表现良好但在剧集间稳定性和评估策略能力上牺牲了一些性能的参数，这些策略可能会使杆子保持平衡时间过长而无法完成一次运行。在 Mountain Car 中，目标是快速将车移动到山上的目标位置。代理在所有非吸收状态的转换中都会收到 -1 的奖励。在 Cart Pole 中，目标是移动小车，使连接的直立杆尽可能长时间保持平衡。代理在所有使杆保持在指定垂直范围内的转换中都会收到 $+1$ 的奖励。Cart Pole 的 $\hat{H}$ 是通过 $\boldsymbol{\mathrm{k}}$ -近邻算法学习的。对于这两个任务，我们使用高斯 RBF 特征进行 $\mathrm{SARSA}(\lambda)$，并悲观地初始化 $Q$，正如 Knox 和 Stone (2010) 中发现的那样有效。在这些以及后续的实验中，$\hat{H}$ 的输出通常在 [-2, 2] 范围内。

We evaluate each combination technique on four criteria; full success requires outperforming the corresponding $\hat{H}$ ’s TAMER-only policy and $\mathrm{SARSA}(\lambda)$ -only both in endrun performance and cumulative reward (or mean reward across full runs, equivalently).

我们根据四个标准评估每种组合技术；完全成功需要在最终运行性能和累积奖励（或完整运行的平均奖励，等价）方面均优于相应的 $\hat{H}$ 的仅 TAMER 策略和仅 $\mathrm{SARSA}(\lambda)$ 策略。

4For space considerations, we we will not fully describe these changes. Briefly, for each reinforcement signal received, Knox and Stone create a learning sample for every time step within a window of recent experience, resulting in many samples per reinforcement in fast domains. We instead create one sample per time step, using all crediting reinforcements to create one label.

出于篇幅考虑，我们不会详细描述这些变化。简而言之，对于每个接收到的强化信号，Knox 和 Stone 会在最近经验窗口内的每个时间步创建一个学习样本，从而在快速领域中每个强化信号产生许多样本。而我们则每个时间步创建一个样本，使用所有信用强化信号来生成一个标签。

5The models we create — $\hat{H}{1}$ and $\hat{H}{2}$ — from the original training trajectories perform a bit better than those from Knox and Stone’s experiments, which points to small implementation differences.

我们创建的模型——$\hat{H}{1}$ 和 $\hat{H}{2}$——在原始训练轨迹上的表现略优于 Knox 和 Stone 的实验结果，这表明存在一些小的实现差异。

6Tasks are adapted from RL-Library (Tanner and White, 2009).

任务改编自RL-Library (Tanner and White, 2009)。

Figure 1. Comparison of TAMER $\mathbf{+RL}$ techniques with SARSA(λ) and the TAMER-only policy on Mountain Car over 40 or more runs of 500 episodes. $\hat{H_{1}}$ and $\hat{H}_{2}$ are models from two different human trainers. The top chart considers reward over the entire run, and the bottom chart evaluates reward over the final 10 episodes. Error bars show standard error.

图 1: TAMER $\mathbf{+RL}$ 技术与 SARSA(λ) 和仅使用 TAMER 的策略在 Mountain Car 上的比较，基于 40 次或更多次 500 回合的运行。$\hat{H_{1}}$ 和 $\hat{H}_{2}$ 是来自两位不同人类训练者的模型。顶部图表考虑了整个运行期间的奖励，底部图表评估了最后 10 回合的奖励。误差条表示标准误差。

Figure 2. The same TAMER $\mathbf{+RL}$ comparisons as in Figure 1, but on Cart Pole over runs of 150 episodes. A single $\hat{H}$ was used. End-run performance is the mean reward during the last 5 episodes.

图 2: 与图 1 相同的 TAMER $\mathbf{+RL}$ 对比，但在 Cart Pole 任务中运行了 150 个回合。使用了单一的 $\hat{H}$。最终性能是最后 5 个回合的平均奖励。

3.3. Sequential learning results and discussion

3.3. 顺序学习结果与讨论

Figures 1 and 2 show the results of our experiments for sequential TAMER $+\mathrm{RL}$ . For now, we only show results for the $\beta$ combination parameters that accrue the highest cumulative reward for their corresponding technique. Figure 2 additionally shows learning curves for the first 30 episodes of the Cart Pole run. (Our early-run results for Mountain Car are similar to those shown by Knox and Stone (2010)).

图 1 和图 2 展示了我们针对顺序 TAMER $+\mathrm{RL}$ 的实验结果。目前，我们仅展示了对应技术中累积奖励最高的 $\beta$ 组合参数的结果。图 2 还展示了 Cart Pole 运行前 30 个回合的学习曲线。（我们在 Mountain Car 上的早期运行结果与 Knox 和 Stone (2010) 展示的结果相似）。

Qualitatively, our Mountain Car results agree with previous work. Action biasing and control sharing succeed on all four criteria and significantly outperform other techniques in cumulative reward. Reward shaping and Q augmentation also improve over SARSA $(\lambda)$ -only by both metrics and over the TAMER-only policies in end-run reward.

定性地看，我们的 Mountain Car 结果与之前的工作一致。动作偏置 (action biasing) 和控制共享 (control sharing) 在所有四个标准上都取得了成功，并且在累积奖励方面显著优于其他技术。奖励塑形 (reward shaping) 和 Q 增强 (Q augmentation) 也在两个指标上优于仅使用 SARSA $(\lambda)$ 的策略，并且在最终奖励方面优于仅使用 TAMER 的策略。

On Cart Pole, action biasing and control sharing again succeed fully. This time, Q augmentation also meets the criteria for success, though it performs significantly worse than action biasing and control sharing. Most interestingly, reward shaping, at its best tested parameter, does not significantly alter SARSA $(\lambda)$ ’s performance on either metric.

在 Cart Pole 任务中，动作偏置 (action biasing) 和控制共享 (control sharing) 再次完全成功。这一次，Q 增强 (Q augmentation) 也达到了成功的标准，尽管它的表现明显不如动作偏置和控制共享。最有趣的是，在测试的最佳参数下，奖励塑形 (reward shaping) 并没有显著改变 SARSA $(\lambda)$ 在任一指标上的表现。

By choosing the best $\beta$ parameter value for each technique, prior TAMER $\mathbf{+RL}$ experiments sidestep the issue of using an effective value without first testing a range of values. With experiments in two tasks, we can begin to address this problem by examining each technique’s sensitivity to $\beta$ parameter changes and whether certain ranges of $\beta$ are effective across different tasks. In Figure 3, we show the mean performance of each combination technique as $\beta$ varies. Examining the charts, we consider several criteria:

通过为每种技术选择最佳的 $\beta$ 参数值，之前的 TAMER $\mathbf{+RL}$ 实验避免了在没有首先测试一系列值的情况下使用有效值的问题。通过在两个任务中的实验，我们可以通过检查每种技术对 $\beta$ 参数变化的敏感性以及某些 $\beta$ 范围是否在不同任务中有效来开始解决这个问题。在图 3 中，我们展示了每种组合技术在 $\beta$ 变化时的平均性能。通过检查图表，我们考虑了以下几个标准：

Evaluating the techniques on these three criteria creates a consistent story that fits with our analysis of the techniques at their best $\beta$ parameter values (in Figures 1 and 2). The two methods that only affect action selection — action biasing and control sharing — emerge as the most effective techniques without a clear leader between them, and they are followed by Q augmentation and then shaping rewards.

根据这三个标准评估这些技术，得出了一个与我们分析这些技术在其最佳 $\beta$ 参数值（见图 1 和图 2）时一致的结果。仅影响动作选择的两种方法——动作偏向 (action biasing) 和控制共享 (control sharing)——被证明是最有效的技术，两者之间没有明显的领先者，其次是 Q 增强 (Q augmentation)，然后是塑造奖励 (shaping rewards)。

From an RL perspective, the weakness of reward shaping may be counter intuitive. When researchers discuss combining human reinforcement with RL in the literature, reward shaping is predominantly suggested (Thomaz and Breazeal, 2006; Isbell et al., 2006), possibly because human “reward” is seen as an analog to MDP reward that should be used similarly. However, though reward shaping is generally cast as a guide for exploration, it only affects exploration indirectly through precariously tampering with the reward signal. Action biasing and control sharing affect exploration directly, without manipulating reward. Thus, they achieve the stated goal of reward shaping while leaving the agent to learn accurate values from its experience. Following this line of thought, Q augmentation is identical to action biasing during action selection, boosting each action’s Q-value by the weighted prediction of human rein for cement. In addition to this direct guidance on exploration, Q augmentation also changes the Q-value during the SARSA(λ) update’s calculation of temporal difference error. As discussed in Section 3.1, Q augmentation is nearly equivalent to a form of reward shaping called look-ahead advice (Wiewiora et al., 2003). In short, we observe that the more a technique directly affects action selection, the better it does, and the more it affects the update to the $\mathrm{Q}$ function for each transition experience, the worse it does. Q augmentation does both and performs between the techniques that do only one.

从强化学习（RL）的角度来看，奖励塑形的弱点可能有些反直觉。当研究人员在文献中讨论将人类强化与RL结合时，奖励塑形通常是被建议的主要方法（Thomaz 和 Breazeal, 2006; Isbell 等, 2006），可能是因为人类的“奖励”被视为与MDP奖励类似，应该以相似的方式使用。然而，尽管奖励塑形通常被视为探索的引导，它只是通过不稳定的奖励信号间接影响探索。动作偏置和控制共享则直接影响探索，而无需操纵奖励。因此，它们在实现奖励塑形的既定目标的同时，让智能体从其经验中学习准确的值。按照这一思路，Q增强在动作选择期间与动作偏置相同，通过加权预测人类强化来提升每个动作的Q值。除了对探索的直接引导外，Q增强还在SARSA(λ)更新的时间差分误差计算中改变了Q值。正如第3.1节所讨论的，Q增强几乎等同于一种称为前瞻建议的奖励塑形形式（Wiewiora 等, 2003）。简而言之，我们观察到，一种技术对动作选择的影响越直接，效果越好；而对每次转移经验的Q函数更新的影响越大，效果越差。Q增强同时影响两者，其表现介于只影响一种的技术之间。

Figure 3. Performance of each technique with each tested $\hat{H}$ over a range of $\beta$ parameters on two tasks: Cart Pole (CP) and Mountain Car (MC). Note changes in y-axis scaling.

图 3: 每种技术在两个任务（Cart Pole (CP) 和 Mountain Car (MC)）上，针对不同 $\beta$ 参数的 $\hat{H}$ 的性能表现。注意 y 轴比例的变化。

Taken together, these experiments validate Knox and Stone’s conclusions and yield new, firmer conclusions about the relative effectiveness of each technique, endorsing action biasing and control sharing over the two other previously successful techniques. And more gene

[论文翻译]通过人类反馈增强强化学习

原文地址：https://www.cs.utexas.edu/~ai-lab/pubs/ICML_IL11-knox.pdf

Augmenting Reinforcement Learning with Human Feedback

通过人类反馈增强强化学习

W. Bradley Knox

Peter Stone

Peter Stone

Abstract

摘要

1. Introduction

1. 引言

2. Preliminaries

2. 预备知识

2.1. Reinforcement Learning

2.1. 强化学习

2.2. The TAMER Framework for Interactive Shaping

2.2. 用于交互式塑造的 TAMER 框架

3. Sequential TAMER $\mathbf$

3. 顺序 TAMER $\mathbf$

3.1. Combination techniques

3.1. 组合技术

3.2. Sequential learning experiments

3.2. 序列学习实验

3.3. Sequential learning results and discussion

3.3. 顺序学习结果与讨论