Augmenting Reinforcement Learning with Human Feedback
通过人类反馈增强强化学习
W. Bradley Knox
W. Bradley Knox
BRADKNOX $@$ CS.UTEXAS.EDU
BRADKNOX $@$ CS.UTEXAS.EDU
University of Texas at Austin, Department of Computer Science
德克萨斯大学奥斯汀分校,计算机科学系
Peter Stone
Peter Stone
University of Texas at Austin, Department of Computer Science
德克萨斯大学奥斯汀分校,计算机科学系
Abstract
摘要
1. Introduction
1. 引言
As computational agents are increasingly used beyond research labs, their success will depend on their ability to learn new skills and adapt to their dynamic, complex environments. If human users — without programming skills — can transfer their task knowledge to agents, learning can accelerate dramatically, reducing costly trials. The TAMER framework guides the design of agents whose behavior can be shaped through signals of approval and disapproval, a natural form of human feedback. More recently, TAMER $.+\mathrm{RL}$ was introduced to enable human feedback to augment a traditional reinforcement learning (RL) agent that learns from a Markov decision process’s (MDP) reward signal. Using a re implementation of TAMER and ${\mathrm{TAMER}}{+}{\mathrm{RL}}$ , we address limitations of prior work, contributing in two critical directions. First, the four successful techniques for combining a human reinforcement with RL from prior $\mathrm{TAMER+RL}$ work are tested on a second task, and these techniques’ sensitivities to parameter changes are analyzed. Together, these examinations yield more general and prescriptive conclusions to guide others who wish to incorporate human knowledge into an RL algorithm. Second, $\mathrm{TAMER+RL}$ has thus far been limited to a sequential setting, in which training occurs before learning from MDP reward. We modify the sequential algorithms to learn simultaneously from both sources, enabling the human feedback to come at any time during the reinforcement learning process. To enable simultaneous learning, we introduce a new technique that appropriately determines the magnitude of the human model’s influence on the RL algorithm throughout time and state-action space.
随着计算智能体在实验室之外的应用越来越广泛,它们的成功将取决于其学习新技能和适应动态复杂环境的能力。如果人类用户——无需编程技能——能够将任务知识传递给智能体,学习速度将显著加快,减少昂贵的试错成本。TAMER框架指导了智能体的设计,这些智能体的行为可以通过批准和反对信号来塑造,这是一种自然的人类反馈形式。最近,TAMER+RL被引入,使人类反馈能够增强传统的强化学习(RL)智能体,该智能体从马尔可夫决策过程(MDP)的奖励信号中学习。通过重新实现TAMER和TAMER+RL,我们解决了先前工作的局限性,并在两个关键方向上做出了贡献。首先,先前TAMER+RL工作中将人类强化与RL结合的四种成功技术在第二个任务上进行了测试,并分析了这些技术对参数变化的敏感性。这些测试共同得出了更具普遍性和指导性的结论,以指导那些希望将人类知识融入RL算法的人。其次,TAMER+RL迄今为止仅限于顺序设置,即在从MDP奖励学习之前进行训练。我们修改了顺序算法,使其能够同时从两个来源学习,使人类反馈能够在强化学习过程中的任何时间到来。为了实现同时学习,我们引入了一种新技术,该技术适当地确定了人类模型在时间和状态-动作空间中对RL算法的影响程度。
Computational agents may soon be prevalent in society, and many of their end users will want these agents to learn to perform new tasks. For many of these tasks, the human user will already have significant task knowledge. Consequently, we seek to enable non-technical users to transfer their knowledge to the agent, reducing the cost of learning without hurting the agent’s final, asymptotic performance.
计算智能体可能很快会在社会中普及,许多终端用户将希望这些智能体能够学习执行新任务。对于许多任务,人类用户已经具备丰富的任务知识。因此,我们致力于让非技术用户能够将其知识转移给智能体,从而在不影响智能体最终渐进性能的情况下降低学习成本。
In this vein, the TAMER framework guides the design of agents that learn by shaping — using signals of approval and disapproval to teach an agent a desired behavior (Knox and Stone, 2009). As originally formulated, TAMER was limited to learn exclusively from the human feedback. More recently, TAMER $+\mathrm{RL}$ was introduced with the goal of enabling the human feedback to augment a traditional reinforcement learning (RL) agent that learns from an MDP reward signal (Knox and Stone, 2010). However, TAMER $.+\mathrm{RL}$ has previously only been tested on a single domain, and it has been limited to the case where the learning from human feedback happens only prior to RL: sequential TAMER $+\mathrm{RL}$ . Using a re implementation of TAMER and ${\mathrm{TAMER}}{+}{\mathrm{RL}}$ , we address these limitations by improving upon prior work in two crucial directions.
本着这一思路,TAMER框架指导了通过塑造学习的智能体设计——使用批准和反对的信号来教导智能体期望的行为 (Knox and Stone, 2009)。最初提出的TAMER仅限于从人类反馈中学习。最近,TAMER $+\mathrm{RL}$ 被引入,旨在使人类反馈能够增强传统的强化学习 (Reinforcement Learning, RL) 智能体,该智能体从MDP奖励信号中学习 (Knox and Stone, 2010)。然而,TAMER $+\mathrm{RL}$ 之前仅在一个领域进行了测试,并且仅限于在RL之前从人类反馈中学习的情况:顺序TAMER $+\mathrm{RL}$。通过对TAMER和 ${\mathrm{TAMER}}{+}{\mathrm{RL}}$ 的重新实现,我们在两个关键方向上改进了先前的工作,解决了这些限制。
First, in Section 3, we continue with the sequential TAMER $+\mathrm{RL}$ approach, testing the four TAMER $.+\mathrm{RL}$ techniques that were previously found to be successful. We test on two tasks — one identical to the single prior $\mathrm{TAMER+RL}$ task and a new task. We also provide a novel examination of each technique’s performance at a range of combination parameter values to determine the ease of setting each pa- rameter effectively, a critical aspect of using TAMER $.+\mathrm{RL}$ algorithms in practice that has been previously sidestepped. Together, these analyses yield stronger, more prescriptive conclusions than were possible from prior work. Two similar combination techniques, for the first time, clearly stand out as the most effective, and we consistently observe that manipulating action selection is more effective than altering the RL update.
首先,在第 3 节中,我们继续采用顺序 TAMER $+\mathrm{RL}$ 方法,测试了之前发现成功的四种 TAMER $.+\mathrm{RL}$ 技术。我们在两个任务上进行了测试——一个与之前的单一 $\mathrm{TAMER+RL}$ 任务相同,另一个是新任务。我们还对每种技术在不同组合参数值下的性能进行了新颖的考察,以确定有效设置每个参数的难易程度,这是在实践中使用 TAMER $.+\mathrm{RL}$ 算法时的一个关键方面,而之前一直被回避。这些分析共同得出了比之前工作更强大、更具指导性的结论。两种相似的组合技术首次明显脱颖而出,成为最有效的技术,并且我们一致观察到,操纵动作选择比改变 RL 更新更有效。
Second, in Section 4 we move from the sequential setting of first learning only from the human and then learning from MDP reward to learning from both simultaneously. The principal benefit of simultaneous learning is its flexibility; it gives a trainer the important ability to step in as desired to alter the course of reinforcement learning while it is in progress. We demonstrate the success of the two best-performing techniques from the sequential experiments, action biasing and control sharing, in this simultaneous setting. To meet demands introduced by the simultaneous setting, we use a novel method to moderate the influence of the model of human reinforcement on the RL algorithm. Our method increases influence in areas of the state-action space that have recently received training and slowly decreases influence in the absence of training, leaving the original MDP reward and base RL agent to learn autonomously in the limit. Without this improvement, the sequential techniques would be too brittle for simultaneous learning.
其次,在第4节中,我们从仅从人类学习然后从MDP奖励学习的顺序设置转向同时从两者学习。同时学习的主要好处是其灵活性;它使训练者能够在强化学习进行过程中根据需要介入以改变其进程。我们展示了在顺序实验中表现最好的两种技术——动作偏置和控制共享——在这种同时设置中的成功。为了满足同时设置引入的需求,我们使用了一种新方法来调节人类强化模型对RL算法的影响。我们的方法在最近接受训练的状态-动作空间区域增加影响,并在没有训练的情况下缓慢减少影响,最终让原始的MDP奖励和基础RL智能体在极限情况下自主学习。如果没有这一改进,顺序技术对于同时学习来说将过于脆弱。
2. Preliminaries
2. 预备知识
In this section, we briefly introduce reinforcement learning and the TAMER Framework.
在本节中,我们简要介绍强化学习和 TAMER 框架。
2.1. Reinforcement Learning
2.1. 强化学习
We assume that the task environment is a Markov decision process (MDP) specified by the tuple $(S,,A,,T,,\gamma,,D,,R)$ . $S$ and $A$ are respectively the sets of possible states and actions. $T$ is a transition function, $T:,S\times A\times S,\rightarrow,\mathbb{R}.$ , which gives the probability, given a state $s_{t}$ and an action $a_{t}$ , of transitioning to state $s_{t+1}$ . $\gamma$ , the discount factor, exponentially decreases the value of a future reward. $D$ is the distribution of start states. $R$ is a reward function, $R:S\times A\times S\to\mathbb{R}$ , where the reward is a function of $s_{t}$ , $a_{t}$ , and $s_{t+1}$ . We will also consider reward that is a function of only $s_{t}$ and $a_{t}$ .
我们假设任务环境是一个由元组 $(S,,A,,T,,\gamma,,D,,R)$ 指定的马尔可夫决策过程 (MDP)。$S$ 和 $A$ 分别是可能的状态和动作的集合。$T$ 是一个转移函数,$T:,S\times A\times S,\rightarrow,\mathbb{R}$,它给出了在给定状态 $s_{t}$ 和动作 $a_{t}$ 的情况下,转移到状态 $s_{t+1}$ 的概率。$\gamma$ 是折扣因子,它按指数级减少未来奖励的价值。$D$ 是起始状态的分布。$R$ 是一个奖励函数,$R:S\times A\times S\to\mathbb{R}$,其中奖励是 $s_{t}$、$a_{t}$ 和 $s_{t+1}$ 的函数。我们还将考虑仅作为 $s_{t}$ 和 $a_{t}$ 的函数的奖励。
Reinforcement learning algorithms (see Sutton and Barto (1998)), seek to learn policies $(\pi,:,S,\rightarrow,A)$ ) for an MDP that maximize return from each state-action pair, where $\begin{array}{r}{\begin{array}{r}{r e t u r n,=,\sum_{t=0}^{T}E[\gamma^{t}R(s_{t},a_{t},s_{t+1})]}\end{array}}\end{array}$ . In this paper, we focus on us ing a value-function-based RL method, namely $\mathrm{SARSA}(\lambda)$ (Sutton and Barto, 1998), augmented by the TAMER-based learning that can be done directly from a human’s reinforcement signal. Though more sophisticated RL methods exist, we use $\mathrm{SARSA}(\lambda)$ for its popularity and representative ness, and because we are not concerned with finding the best overall algorithm for our experimental tasks but rather with determining how various methods for including a human model change the base RL algorithm’s performance.
强化学习算法(参见 Sutton 和 Barto (1998))旨在学习最大化每个状态-动作对回报的 MDP 策略 $(\pi,:,S,\rightarrow,A)$,其中 $\begin{array}{r}{\begin{array}{r}{r e t u r n,=,\sum_{t=0}^{T}E[\gamma^{t}R(s_{t},a_{t},s_{t+1})]}\end{array}}\end{array}$。在本文中,我们重点使用基于价值函数的强化学习方法,即 $\mathrm{SARSA}(\lambda)$(Sutton 和 Barto, 1998),并通过基于 TAMER 的学习进行增强,该方法可以直接从人类的强化信号中学习。尽管存在更复杂的强化学习方法,但我们使用 $\mathrm{SARSA}(\lambda)$ 是因为它的普及性和代表性,并且我们并不关心为我们的实验任务找到最佳的整体算法,而是关注如何通过包含人类模型的各种方法来改变基础强化学习算法的性能。
2.2. The TAMER Framework for Interactive Shaping
2.2. 用于交互式塑造的 TAMER 框架
The TAMER Framework, introduced by Knox and Stone (2009) is an approach to the problem of how an agent should learn from numerically mapped reinforcement signals. Specifically, these feedback signals are delivered by an observing human trainer as the agent attempts to perform a task.1 TAMER is motivated by two insights about human reinforcement. First, reinforcement is trivially delayed, slowed only by the time it takes the trainer to assess behavior and deliver feedback. Second, the trainer observes the agent’s behavior with a model of that behavior’s longterm effects, so the reinforcement is assumed to be fully informative about the quality of recent behavior. Human rein for cement is more similar to an action value (sometimes called a Q-value), albeit a noisy and trivially delayed one, than MDP reward. Consequently, TAMER assumes human reinforcement to be fully informative about the quality of an action given the current state, and it models a hypothetical human reinforcement function, $H:S\times A\to\mathbb{R}$ , as $\hat{H}$ in real time by regression. In the simplest form of credit assignment, each reinforcement creates a label for the last state-action pair.2 The output of the resultant $\hat{H}$ function — changing as the agent gains experience — determines the relative quality of potential actions, so that the exploitative action is $\stackrel{.}{a}=\stackrel{.}{a r g m a x}_{a}[\hat{H}(s,a)]$ .
TAMER 框架由 Knox 和 Stone (2009) 提出,旨在解决智能体如何从数值映射的强化信号中学习的问题。具体来说,这些反馈信号是由观察者(人类训练师)在智能体尝试执行任务时提供的。TAMER 的动机源于对人类强化的两个洞察:首先,强化信号几乎没有延迟,仅受训练师评估行为和提供反馈所需时间的限制;其次,训练师通过观察智能体的行为及其长期影响模型来提供反馈,因此假设强化信号能够充分反映近期行为质量。与 MDP(马尔可夫决策过程)中的奖励相比,人类强化信号更类似于动作值(有时称为 Q 值),尽管它带有噪声且几乎没有延迟。因此,TAMER 假设人类强化信号能够充分反映当前状态下动作的质量,并通过回归实时建模一个假设的人类强化函数 $H:S\times A\to\mathbb{R}$ 为 $\hat{H}$。在最简单的信用分配形式中,每次强化信号为最近的状态-动作对创建一个标签。生成的 $\hat{H}$ 函数的输出——随着智能体经验的积累而变化——决定了潜在动作的相对质量,因此利用性动作为 $\stackrel{.}{a}=\stackrel{.}{a r g m a x}_{a}[\hat{H}(s,a)]$。
3. Sequential TAMER $\mathbf$
3. 顺序 TAMER $\mathbf$
Noting that TAMER agents typically learn faster than agents learning from MDP reward but to a lower performance plateau, Knox and Stone combined TAMER and SARSA $\lfloor\lambda)$ (2010). Their aim was to complement TAMER’s fast learning with RL’s ability to often learn better policies in the long run. These conjoined $\mathrm{TAMER+RL}$ algorithms address a scenario in which a human trains an agent, leaving a model $\hat{H}$ of reinforcement, and then $\hat{H}$ is used to influence the base RL algorithm somehow. We call this scenario and the algorithms that address it sequential TAMER $+\mathrm{RL}$ . For all $\mathrm{TAMER+RL}$ approaches, only MDP reward is considered to specify optimal behavior. $\dot{H}$ provides guidance but not an objective. In this section, we reproduce and then extend prior investigations of sequential $\mathrm{TAMER+RL}$ , yielding more prescriptive and general conclusions than prior work allowed.
注意到 TAMER 智能体通常比从 MDP 奖励中学习的智能体学习得更快,但达到的性能平台较低,Knox 和 Stone 将 TAMER 和 SARSA $\lfloor\lambda)$ 结合起来 (2010)。他们的目标是通过 RL 长期学习更好策略的能力来补充 TAMER 的快速学习。这些结合的 $\mathrm{TAMER+RL}$ 算法解决了一个场景,其中人类训练一个智能体,留下一个强化模型 $\hat{H}$,然后 $\hat{H}$ 以某种方式影响基础的 RL 算法。我们将这种场景及其算法称为顺序 TAMER $+\mathrm{RL}$。对于所有 $\mathrm{TAMER+RL}$ 方法,只有 MDP 奖励被认为是指定最优行为的。$\dot{H}$ 提供指导,但不提供目标。在本节中,我们重现并扩展了之前对顺序 $\mathrm{TAMER+RL}$ 的研究,得出了比之前工作更具规定性和普遍性的结论。
3.1. Combination techniques
3.1. 组合技术
Knox and Stone tested eight $\mathrm{TAMER+RL}$ techniques that each use $\hat{H}$ to affect the RL algorithm in a different way. Four were largely effective when compared to the SARSA $(\lambda)$ -only and TAMER-only agents on both mean reward over a run and performance at the end of the run. We focus on those four techniques, which can be used on any RL algorithm that uses an action-value function. Below, we list them with names we have created. In our notation, a prime (e.g., $Q^{\prime}$ ) after a function means the function replaces its non-prime counterpart in the base RL algorithm.
Knox 和 Stone 测试了八种 $\mathrm{TAMER+RL}$ 技术,每种技术都以不同的方式使用 $\hat{H}$ 来影响 RL 算法。与仅使用 SARSA $(\lambda)$ 和仅使用 TAMER 的智能体相比,其中四种技术在运行期间的平均奖励和运行结束时的性能方面表现出显著效果。我们重点关注这四种技术,它们可以应用于任何使用动作值函数的 RL 算法。下面,我们用自创的名称列出这些技术。在我们的符号中,函数后的撇号(例如 $Q^{\prime}$)表示该函数在基础 RL 算法中替换其非撇号对应项。
• Action biasing: $Q^{\prime}(s,a),=,Q(s,a)+(\beta*{\hat{H}}(s,a))$ only during action selection
• 动作偏置 (Action biasing):仅在动作选择期间,$Q^{\prime}(s,a),=,Q(s,a)+(\beta*{\hat{H}}(s,a))$
Control sharing: $\begin{array}{r l}{P(a{=}a r g m a x_{a}[\hat{H}(s,a)])}&{{}=}\end{array}$ $m i n(\beta,1)$ . Otherwise use base RL agent’s action selection mechanism.
控制共享:$\begin{array}{r l}{P(a{=}a r g m a x_{a}[\hat{H}(s,a)])}&{{}=}\end{array}$ $m i n(\beta,1)$。否则使用基础 RL 智能体的动作选择机制。
These four techniques are numbered 1, 4, 6, and 7 in Knox and Stone (2010). We altered action biasing to generalize it, but the $\epsilon_{}$ -greedy policies we use in our experiments are not affected. In the descriptions above, $\beta$ is a predefined combination parameter. In our sequential $\mathrm{TAMER+RL}$ experiments, $\beta$ is annealed by a predefined factor after each episode for all techniques other than Q augmentation.
这四种技术在 Knox 和 Stone (2010) 中编号为 1、4、6 和 7。我们对动作偏置进行了修改以使其更通用,但我们在实验中使用的 $\epsilon_{}$ -贪婪策略不受影响。在上述描述中,$\beta$ 是一个预定义的组合参数。在我们的顺序 $\mathrm{TAMER+RL}$ 实验中,除了 Q 增强之外的所有技术,$\beta$ 在每集之后都会按预定义因子进行退火。
We now briefly discuss these techniques and situate them within related work. In the RL literature, reward shaping adds the output of a shaping function to the original MDP reward, creating a new reward to learn from instead (Dorigo and Colombetti, 1994; Mataric, 1994). As we confirm in the coming paragraph on Q augmentation, our reward shaping technique is not the only way to do reward shaping, but it is the most direct use of $\dot{H}$ for reward shaping.
我们现在简要讨论这些技术,并将它们置于相关工作的背景中。在强化学习(RL)文献中,奖励塑形(reward shaping)通过将塑形函数的输出添加到原始马尔可夫决策过程(MDP)奖励中,创建一个新的奖励以供学习(Dorigo and Colombetti, 1994; Mataric, 1994)。正如我们在接下来关于Q增强的段落中所确认的那样,我们的奖励塑形技术并不是唯一的方法,但它是最直接使用 $\dot{H}$ 进行奖励塑形的方式。
If $\hat{H}$ is considered a heuristic function, action biasing is the same action selection method used in Bianchi et al.’s Heuristic ally Accelerated Q-Learning (HAQL) algorithm (Bianchi et al., 2004). Control sharing is equivalent to Ferna´ndez and Veloso’s $\pi$ -reuse exploration strat-egy (2006). Note that both control sharing and action bi- asing only affect action selection and can be interpreted as directly guiding exploration toward human-favored stateaction pairs.
如果 $\hat{H}$ 被视为启发式函数,那么动作偏向 (action biasing) 与 Bianchi 等人提出的启发式加速 Q 学习 (Heuristic ally Accelerated Q-Learning, HAQL) 算法中使用的动作选择方法相同 (Bianchi et al., 2004)。控制共享 (control sharing) 则等同于 Ferna´ndez 和 Veloso 提出的 $\pi$ -重用探索策略 (2006)。需要注意的是,控制共享和动作偏向仅影响动作选择,并且可以解释为直接引导探索朝向人类偏好的状态-动作对。
$\mathrm{Q}$ augmentation is action biasing with additional use of $\hat{H}$ during the Q-function’s update. Wiewiora et al.’s related look-ahead advice (2003) uses a discounted change in the output of a state-action potential function, $\gamma\phi\big(s_{t+1},a_{t+1}\big)-\phi\big(s_{t},a_{t}\big)$ , for reward shaping and to augment action values during action selection. Interestingly, look-ahead advice is equivalent to $\mathrm{Q}$ augmentation when $\hat{H}$ is used for $\phi$ , the state and action space are finite, and the policy is invariant to adding a constant to all action values in the current state (e.g., $\epsilon$ -greedy and soft-max).
$\mathrm{Q}$ 增强是在 Q 函数更新过程中额外使用 $\hat{H}$ 来进行动作偏置。Wiewiora 等人 (2003) 提出的相关前瞻建议 (look-ahead advice) 使用状态-动作势函数的输出的折扣变化 $\gamma\phi\big(s_{t+1},a_{t+1}\big)-\phi\big(s_{t},a_{t}\big)$ 来进行奖励塑造,并在动作选择期间增强动作值。有趣的是,当 $\hat{H}$ 被用作 $\phi$ 时,前瞻建议等同于 $\mathrm{Q}$ 增强,前提是状态和动作空间是有限的,并且策略对当前状态下所有动作值添加常数保持不变(例如,$\epsilon$-贪婪策略和 soft-max 策略)。
3.2. Sequential learning experiments
3.2. 序列学习实验
We now describe our sequential ${\mathrm{TAMER}}{+}{\mathrm{RL}}$ experiments. We first validate our re implementation of TAMER and TAMER $+\mathrm{RL}$ by reproducing Knox and Stone’s results on the single task they tested. We then evaluate the algorithms’ effectiveness on a different task. Additionally, we analyze our results at a range of combination parameter values ( $\beta$ values) to identify challenges to setting $\beta$ ’s value without prior testing.
我们现在描述我们的顺序 ${\mathrm{TAMER}}{+}{\mathrm{RL}}$ 实验。我们首先通过复现 Knox 和 Stone 在他们测试的单一任务上的结果,验证我们对 TAMER 和 TAMER $+\mathrm{RL}$ 的重新实现。然后,我们在不同的任务上评估这些算法的有效性。此外,我们在一系列组合参数值($\beta$ 值)下分析我们的结果,以识别在没有预先测试的情况下设置 $\beta$ 值的挑战。
Following past work on TAMER and $\mathrm{TAMER+RL}$ , we implemented the corresponding algorithms as exactly as we could, excepting some changes to the credit assignment technique in Knox and Stone (2009).4 Using the original $\hat{H}$ representation (linear model of RBF features), task settings, $\mathrm{SARSA}(\lambda)$ parameters, and training records from Knox and Stone (2010),5 we repeat their experiments on the Mountain Car task,6 using all four combination techniques found to be successful in their experiments and a range of $\beta$ combination parameters. We then test these TAMER $+\mathrm{RL}$ techniques on a second task, Cart Pole, using an $\hat{H}$ model trained by an author. We again use SARSA $(\lambda)$ , choosing parameters that perform well but sacrifice some performance for episode-to-episode stability and the ability to evaluate policies that might otherwise balance the pole for too long to finish a run. In Mountain Car, the goal is to quickly move the car up a hill to the goal. The agent receives -1 reward for all transitions to non-absorbing states. In Cart Pole, the goal is to move a cart so that an attached, upright pole maintains balance as long as possible. The agent receives $+1$ reward for all transitions that keep the pole within a specified range of vertical. The $\hat{H}$ for Cart Pole was learned by $\boldsymbol{\mathrm{k}}$ -Nearest Neighbor. For both tasks, we use Gaussian RBF features for $\mathrm{SARSA}(\lambda)$ and initialize $Q$ pessimistically, as was found effective in Knox and Stone (2010). In these and later experiments, $\hat{H}$ outputs are typically in the range [-2, 2].
遵循过去关于 TAMER 和 $\mathrm{TAMER+RL}$ 的工作,我们尽可能准确地实现了相应的算法,除了对 Knox 和 Stone (2009) 中的信用分配技术进行了一些修改。使用原始的 $\hat{H}$ 表示(RBF 特征的线性模型)、任务设置、$\mathrm{SARSA}(\lambda)$ 参数以及 Knox 和 Stone (2010) 的训练记录,我们在 Mountain Car 任务上重复了他们的实验,使用了在他们的实验中被证明成功的所有四种组合技术以及一系列 $\beta$ 组合参数。然后,我们在第二个任务 Cart Pole 上测试了这些 TAMER $+\mathrm{RL}$ 技术,使用了由一位作者训练的 $\hat{H}$ 模型。我们再次使用 SARSA $(\lambda)$,选择了表现良好但在剧集间稳定性和评估策略能力上牺牲了一些性能的参数,这些策略可能会使杆子保持平衡时间过长而无法完成一次运行。在 Mountain Car 中,目标是快速将车移动到山上的目标位置。代理在所有非吸收状态的转换中都会收到 -1 的奖励。在 Cart Pole 中,目标是移动小车,使连接的直立杆尽可能长时间保持平衡。代理在所有使杆保持在指定垂直范围内的转换中都会收到 $+1$ 的奖励。Cart Pole 的 $\hat{H}$ 是通过 $\boldsymbol{\mathrm{k}}$ -近邻算法学习的。对于这两个任务,我们使用高斯 RBF 特征进行 $\mathrm{SARSA}(\lambda)$,并悲观地初始化 $Q$,正如 Knox 和 Stone (2010) 中发现的那样有效。在这些以及后续的实验中,$\hat{H}$ 的输出通常在 [-2, 2] 范围内。
We evaluate each combination technique on four criteria; full success requires outperforming the corresponding $\hat{H}$ ’s TAMER-only policy and $\mathrm{SARSA}(\lambda)$ -only both in endrun performance and cumulative reward (or mean reward across full runs, equivalently).
我们根据四个标准评估每种组合技术;完全成功需要在最终运行性能和累积奖励(或完整运行的平均奖励,等价)方面均优于相应的 $\hat{H}$ 的仅 TAMER 策略和仅 $\mathrm{SARSA}(\lambda)$ 策略。
4For space considerations, we we will not fully describe these changes. Briefly, for each reinforcement signal received, Knox and Stone create a learning sample for every time step within a window of recent experience, resulting in many samples per reinforcement in fast domains. We instead create one sample per time step, using all crediting reinforcements to create one label.
出于篇幅考虑,我们不会详细描述这些变化。简而言之,对于每个接收到的强化信号,Knox 和 Stone 会在最近经验窗口内的每个时间步创建一个学习样本,从而在快速领域中每个强化信号产生许多样本。而我们则每个时间步创建一个样本,使用所有信用强化信号来生成一个标签。
5The models we create — $\hat{H}{1}$ and $\hat{H}{2}$ — from the original training trajectories perform a bit better than those from Knox and Stone’s experiments, which points to small implementation differences.
我们创建的模型——$\hat{H}{1}$ 和 $\hat{H}{2}$——在原始训练轨迹上的表现略优于 Knox 和 Stone 的实验结果,这表明存在一些小的实现差异。
6Tasks are adapted from RL-Library (Tanner and White, 2009).
任务改编自RL-Library (Tanner and White, 2009)。

Figure 1. Comparison of TAMER $\mathbf{+RL}$ techniques with SARSA(λ) and the TAMER-only policy on Mountain Car over 40 or more runs of 500 episodes. $\hat{H_{1}}$ and $\hat{H}_{2}$ are models from two different human trainers. The top chart considers reward over the entire run, and the bottom chart evaluates reward over the final 10 episodes. Error bars show standard error.
图 1: TAMER $\mathbf{+RL}$ 技术与 SARSA(λ) 和仅使用 TAMER 的策略在 Mountain Car 上的比较,基于 40 次或更多次 500 回合的运行。$\hat{H_{1}}$ 和 $\hat{H}_{2}$ 是来自两位不同人类训练者的模型。顶部图表考虑了整个运行期间的奖励,底部图表评估了最后 10 回合的奖励。误差条表示标准误差。


Figure 2. The same TAMER $\mathbf{+RL}$ comparisons as in Figure 1, but on Cart Pole over runs of 150 episodes. A single $\hat{H}$ was used. End-run performance is the mean reward during the last 5 episodes.
图 2: 与图 1 相同的 TAMER $\mathbf{+RL}$ 对比,但在 Cart Pole 任务中运行了 150 个回合。使用了单一的 $\hat{H}$。最终性能是最后 5 个回合的平均奖励。
3.3. Sequential learning results and discussion
3.3. 顺序学习结果与讨论
Figures 1 and 2 show the results of our experiments for sequential TAMER $+\mathrm{RL}$ . For now, we only show results for the $\beta$ combination parameters that accrue the highest cumulative reward for their corresponding technique. Figure 2 additionally shows learning curves for the first 30 episodes of the Cart Pole run. (Our early-run results for Mountain Car are similar to those shown by Knox and Stone (2010)).
图 1 和图 2 展示了我们针对顺序 TAMER $+\mathrm{RL}$ 的实验结果。目前,我们仅展示了对应技术中累积奖励最高的 $\beta$ 组合参数的结果。图 2 还展示了 Cart Pole 运行前 30 个回合的学习曲线。(我们在 Mountain Car 上的早期运行结果与 Knox 和 Stone (2010) 展示的结果相似)。
Qualitatively, our Mountain Car results agree with previous work. Action biasing and control sharing succeed on all four criteria and significantly outperform other techniques in cumulative reward. Reward shaping and Q augmentation also improve over SARSA $(\lambda)$ -only by both metrics and over the TAMER-only policies in end-run reward.
定性地看,我们的 Mountain Car 结果与之前的工作一致。动作偏置 (action biasing) 和控制共享 (control sharing) 在所有四个标准上都取得了成功,并且在累积奖励方面显著优于其他技术。奖励塑形 (reward shaping) 和 Q 增强 (Q augmentation) 也在两个指标上优于仅使用 SARSA $(\lambda)$ 的策略,并且在最终奖励方面优于仅使用 TAMER 的策略。
On Cart Pole, action biasing and control sharing again succeed fully. This time, Q augmentation also meets the criteria for success, though it performs significantly worse than action biasing and control sharing. Most interestingly, reward shaping, at its best tested parameter, does not significantly alter SARSA $(\lambda)$ ’s performance on either metric.
在 Cart Pole 任务中,动作偏置 (action biasing) 和控制共享 (control sharing) 再次完全成功。这一次,Q 增强 (Q augmentation) 也达到了成功的标准,尽管它的表现明显不如动作偏置和控制共享。最有趣的是,在测试的最佳参数下,奖励塑形 (reward shaping) 并没有显著改变 SARSA $(\lambda)$ 在任一指标上的表现。
By choosing the best $\beta$ parameter value for each technique, prior TAMER $\mathbf{+RL}$ experiments sidestep the issue of using an effective value without first testing a range of values. With experiments in two tasks, we can begin to address this problem by examining each technique’s sensitivity to $\beta$ parameter changes and whether certain ranges of $\beta$ are effective across different tasks. In Figure 3, we show the mean performance of each combination technique as $\beta$ varies. Examining the charts, we consider several criteria:
通过为每种技术选择最佳的 $\beta$ 参数值,之前的 TAMER $\mathbf{+RL}$ 实验避免了在没有首先测试一系列值的情况下使用有效值的问题。通过在两个任务中的实验,我们可以通过检查每种技术对 $\beta$ 参数变化的敏感性以及某些 $\beta$ 范围是否在不同任务中有效来开始解决这个问题。在图 3 中,我们展示了每种组合技术在 $\beta$ 变化时的平均性能。通过检查图表,我们考虑了以下几个标准:
Evaluating the techniques on these three criteria creates a consistent story that fits with our analysis of the techniques at their best $\beta$ parameter values (in Figures 1 and 2). The two methods that only affect action selection — action biasing and control sharing — emerge as the most effective techniques without a clear leader between them, and they are followed by Q augmentation and then shaping rewards.
根据这三个标准评估这些技术,得出了一个与我们分析这些技术在其最佳 $\beta$ 参数值(见图 1 和图 2)时一致的结果。仅影响动作选择的两种方法——动作偏向 (action biasing) 和控制共享 (control sharing)——被证明是最有效的技术,两者之间没有明显的领先者,其次是 Q 增强 (Q augmentation),然后是塑造奖励 (shaping rewards)。
From an RL perspective, the weakness of reward shaping may be counter intuitive. When researchers discuss combining human reinforcement with RL in the literature, reward shaping is predominantly suggested (Thomaz and Breazeal, 2006; Isbell et al., 2006), possibly because human “reward” is seen as an analog to MDP reward that should be used similarly. However, though reward shaping is generally cast as a guide for exploration, it only affects exploration indirectly through precariously tampering with the reward signal. Action biasing and control sharing affect exploration directly, without manipulating reward. Thus, they achieve the stated goal of reward shaping while leaving the agent to learn accurate values from its experience. Following this line of thought, Q augmentation is identical to action biasing during action selection, boosting each action’s Q-value by the weighted prediction of human rein for cement. In addition to this direct guidance on exploration, Q augmentation also changes the Q-value during the SARSA(λ) update’s calculation of temporal difference error. As discussed in Section 3.1, Q augmentation is nearly equivalent to a form of reward shaping called look-ahead advice (Wiewiora et al., 2003). In short, we observe that the more a technique directly affects action selection, the better it does, and the more it affects the update to the $\mathrm{Q}$ function for each transition experience, the worse it does. Q augmentation does both and performs between the techniques that do only one.
从强化学习(RL)的角度来看,奖励塑形的弱点可能有些反直觉。当研究人员在文献中讨论将人类强化与RL结合时,奖励塑形通常是被建议的主要方法(Thomaz 和 Breazeal, 2006; Isbell 等, 2006),可能是因为人类的“奖励”被视为与MDP奖励类似,应该以相似的方式使用。然而,尽管奖励塑形通常被视为探索的引导,它只是通过不稳定的奖励信号间接影响探索。动作偏置和控制共享则直接影响探索,而无需操纵奖励。因此,它们在实现奖励塑形的既定目标的同时,让智能体从其经验中学习准确的值。按照这一思路,Q增强在动作选择期间与动作偏置相同,通过加权预测人类强化来提升每个动作的Q值。除了对探索的直接引导外,Q增强还在SARSA(λ)更新的时间差分误差计算中改变了Q值。正如第3.1节所讨论的,Q增强几乎等同于一种称为前瞻建议的奖励塑形形式(Wiewiora 等, 2003)。简而言之,我们观察到,一种技术对动作选择的影响越直接,效果越好;而对每次转移经验的Q函数更新的影响越大,效果越差。Q增强同时影响两者,其表现介于只影响一种的技术之间。

Figure 3. Performance of each technique with each tested $\hat{H}$ over a range of $\beta$ parameters on two tasks: Cart Pole (CP) and Mountain Car (MC). Note changes in y-axis scaling.

图 3: 每种技术在两个任务(Cart Pole (CP) 和 Mountain Car (MC))上,针对不同 $\beta$ 参数的 $\hat{H}$ 的性能表现。注意 y 轴比例的变化。
Taken together, these experiments validate Knox and Stone’s conclusions and yield new, firmer conclusions about the relative effectiveness of each technique, endorsing action biasing and control sharing over the two other previously successful techniques. And more generally, these results endorse manipulating action selection and leaving the action-value model’s update unmolested.
综上所述,这些实验验证了 Knox 和 Stone 的结论,并得出了关于每种技术相对有效性的新的、更坚实的结论,支持动作偏向和控制共享,而不是之前成功的另外两种技术。更一般地说,这些结果支持操纵动作选择并保持动作价值模型的更新不受干扰。
4. Simultaneous TAMER+RL
4. 同步 TAMER+RL
To this point, similarly to all prior work on TAMER, we have assumed that the human training was finished prior to any reinforcement learning. This “sequential” learning is sometimes appropriate; for instance, when a difficult-to-simulate reward function is tied to potentially costly learning trials and the agent can train in simulation without significant cost. However, in other scenarios this assumption can be limiting. In this section, we investigate how to modify sequential ${\mathrm{TAMER}}{+}{\mathrm{RL}}$ algorithms to allow a trainer to step in as desired to alter the course of reinforcement learning while it is in progress. We call this scenario and the algorithms that address it “simultaneous” TAMER $\mathbf{+RL}$ . Specifically, the agent should learn simultaneously from two feedback modalities — human reinforcement and MDP reward — as one fully integrated system. As in the sequential TAMER $+\mathrm{RL}$ approaches, we examine techniques that use only $\hat{H}$ from TAMER in the RL algorithm, otherwise leaving the two algorithms as separate modules.
到目前为止,与之前所有关于 TAMER 的工作类似,我们假设人类训练在强化学习之前已经完成。这种“顺序”学习有时是合适的;例如,当一个难以模拟的奖励函数与潜在成本高昂的学习试验相关联时,智能体可以在模拟中训练而不会产生显著成本。然而,在其他情况下,这种假设可能会受到限制。在本节中,我们研究了如何修改顺序的 ${\mathrm{TAMER}}{+}{\mathrm{RL}}$ 算法,以允许训练者在强化学习进行过程中根据需要介入以改变其进程。我们将这种情况及其应对算法称为“同步” TAMER $\mathbf{+RL}$。具体来说,智能体应该同时从两种反馈模式中学习——人类强化和 MDP 奖励——作为一个完全集成的系统。与顺序 TAMER $+\mathrm{RL}$ 方法一样,我们研究了在 RL 算法中仅使用来自 TAMER 的 $\hat{H}$ 的技术,否则将这两个算法作为单独的模块保留。
Since TAMER empirically compares most favorably against RL algorithms in early learning (Knox and Stone, 2009), we expect the greatest gains to come from training near the beginning of learning. However, training at any suboptimal point along the learning curve should benefit the agent, and we hope to do little harm if the agent is already performing optimally and the trainer’s feedback cannot help.
由于 TAMER 在早期学习阶段与强化学习算法相比表现最佳 (Knox and Stone, 2009),我们预计最大的收益将来自在学习初期进行训练。然而,在学习曲线的任何次优点进行训练都应该对智能体有益,如果智能体已经表现最优且训练者的反馈无法提供帮助,我们希望不会造成太大影响。
Some desirable characteristics for simultaneous learning are:
同时学习的一些理想特性包括:
Simultaneous learning — and its inclusion of RL-based action selection during training — presents new challenges for maintaining behavioral consistency. For instance, control sharing abruptly shifts between two policies, which can create erratic behavior with many different actions (both good and bad) in a small time period, increasing the difficulty of giving clear feedback. Also note that the second and third characteristics are in opposition. Fully responding to the trainer’s reinforcement requires abandoning the policy learned by MDP reward. Our module for determining human influence, described in the following section, strikes a balance by ramping up the influence of $\hat{H}$ with increased reinforcement, keeping the RL policy early on.
同时学习——及其在训练期间包含基于强化学习(RL)的动作选择——对保持行为一致性提出了新的挑战。例如,控制权在两个策略之间突然切换,可能会在短时间内产生许多不同的动作(既有好的也有坏的),从而增加了提供清晰反馈的难度。此外,第二个和第三个特性是相互对立的。完全响应训练者的强化信号需要放弃通过马尔可夫决策过程(MDP)奖励学习的策略。我们在下一节中描述的决定人类影响的模块通过在强化信号增加时逐步提升 $\hat{H}$ 的影响,保持了早期强化学习策略的平衡。
4.1. Determining the immediate influence of Hˆ
4.1. 确定 Hˆ 的即时影响
Simultaneous learning allows human trainers to insert themselves at any point of the learning process. Consequently, $\hat{H}$ ’s influence should increase in areas of the stateaction space with recent reinforcement — but not in areas that have not been targeted with feedback — and decrease in the absence of reinforcement, leaving the set of optimal policies unchanged in the limit. Thus, we must do more than annealing a combination parameter, as is done in sequential learning.
同时学习允许人类训练者在学习过程的任何阶段介入。因此,$\hat{H}$ 的影响应在最近受到强化的状态-动作空间区域增加——但在未收到反馈的区域不应增加——并在缺乏强化时减少,从而在极限情况下保持最优策略集不变。因此,我们必须做的不仅仅是退火组合参数,这在顺序学习中已经实现。
We determine $\hat{H}$ ’s influence through a novel adaptation of the eligibility traces often used in reinforcement learning (Sutton and Barto, 1998). We will refer to it as the eligibility module. The general idea of this eligibility module is that we maintain an eligibility trace for each state-action feature7, normalized between 0 and 1, that represents the recency of training while that feature was active (i.e., nonzero). Then, the eligibility traces and a time step’s feature vector together calculate a measure of the recency of training in similar feature vectors. That measure, multiplied by a constant scaling parameter $c_{s}$ , is used as the $\beta$ term introduced in Section 3.1. The implementation follows.
我们通过一种新颖的适应方法来确定 $\hat{H}$ 的影响,这种方法通常用于强化学习中的资格迹 (eligibility traces) (Sutton and Barto, 1998)。我们将其称为资格模块。该资格模块的总体思想是,我们为每个状态-动作特征7维护一个资格迹,其值在0到1之间归一化,表示该特征处于活动状态(即非零)时的训练新近性。然后,资格迹和时间步的特征向量一起计算相似特征向量中训练新近性的度量。该度量乘以一个常数缩放参数 $c_{s}$,用作第3.1节中引入的 $\beta$ 项。具体实现如下。
Let $\vec{e}$ be the vector of traces and $\vec{f}{n}^{\rangle}$ be the feature vector normalized such that each element of $\overrightarrow{f{n}}$ exists within the range $[0,1]$ . The eligibility module is designed to make $\beta$ a function of $\vec{e}$ , $\vec{f_{n}}^{\prime}$ , and $c_{s}$ with range $[0,c_{s}]$ . A guiding design constraint is that when $\vec{e}^{\gamma}=\bar{\vec{1}}^{\prime}$ (i.e., each element of $\vec{e}$ is the maximum allowed), the normalized dot product of $\vec{e}$ and any $\vec{f}{n}^{\rangle}$ , denoted $n(\overrightarrow{e}\cdot\overrightarrow{f{n}})$ , should equal 1 (since it weights the influence of $\hat{H}$ ). To achieve this, we make $n(\overrightarrow{e}\cdot\overrightarrow{f}{n}^{\rangle})\ =\ \ \overrightarrow{e}\ \cdot(\overrightarrow{f}{n}^{\rangle}\ /\ \vert\ \overrightarrow{f}{n}^{\rangle}\ \vert\vert{1})\ =\ \ (\overrightarrow{e}\cdot\overrightarrow{f}{n}^{\rangle})\ /$ $(\parallel\overrightarrow{f{n}}\parallel_{1});;=;;\beta;/,;c_{s}$ . Thus, at any time step with normalized features $\overrightarrow{f_{n}}$ , the influence of $\hat{H}$ is calculated as $\beta,=,c_{s}(\overrightarrow{e}^{\prime}\cdot\overrightarrow{f_{n}^{\prime}})/(\Vert\overrightarrow{,f_{n}^{\prime},}\Vert_{1})$ . This formula has a desirable mathematical characteristic; for a given $\vec{e}$ , $\beta$ is higher when relatively large feature values correspond to large trace values — indicating the current state-action pair is similar to the recently trained state-action pairs — and $\beta$ is smaller when large feature values correspond to small trace values.
设 $\vec{e}$ 为轨迹向量,$\vec{f}{n}^{\rangle}$ 为归一化后的特征向量,使得 $\overrightarrow{f{n}}$ 的每个元素都在 $[0,1]$ 范围内。资格模块旨在使 $\beta$ 成为 $\vec{e}$、$\vec{f_{n}}^{\prime}$ 和 $c_{s}$ 的函数,范围为 $[0,c_{s}]$。一个指导性的设计约束是,当 $\vec{e}^{\gamma}=\bar{\vec{1}}^{\prime}$(即 $\vec{e}$ 的每个元素都是允许的最大值)时,$\vec{e}$ 与任何 $\vec{f}{n}^{\rangle}$ 的归一化点积,记为 $n(\overrightarrow{e}\cdot\overrightarrow{f{n}})$,应等于 1(因为它加权了 $\hat{H}$ 的影响)。为了实现这一点,我们令 $n(\overrightarrow{e}\cdot\overrightarrow{f}{n}^{\rangle})\ =\ \ \overrightarrow{e}\ \cdot(\overrightarrow{f}{n}^{\rangle}\ /\ \vert\ \overrightarrow{f}{n}^{\rangle}\ \vert\vert{1})\ =\ \ (\overrightarrow{e}\cdot\overrightarrow{f}{n}^{\rangle})\ /$ $(\parallel\overrightarrow{f{n}}\parallel_{1});;=;;\beta;/,;c_{s}$。因此,在任何时间步,对于归一化特征 $\overrightarrow{f_{n}}$,$\hat{H}$ 的影响计算为 $\beta,=,c_{s}(\overrightarrow{e}^{\prime}\cdot\overrightarrow{f_{n}^{\prime}})/(\Vert\overrightarrow{,f_{n}^{\prime},}\Vert_{1})$。这个公式具有一个理想的数学特性;对于给定的 $\vec{e}$,当较大的特征值对应于较大的轨迹值时,$\beta$ 较高——表明当前状态-动作对与最近训练的状态-动作对相似——而当较大的特征值对应于较小的轨迹值时,$\beta$ 较小。
Using accumulating traces capped at 1, the trace is updated with $\overrightarrow{f_{n}}$ during training: $e_{i};:=;m i n(1,e_{i},+,(f_{n,i},,a))$ , where $e_{i}$ and $f_{n,i}$ are the $i^{t h}$ elements of $\vec{e}$ and $\overrightarrow{f_{n}}$ , respectively, and $a$ is a constant factor that moderates the speed of accumulation. During time steps without training, $\overrightarrow{e}:=d e c a y F a c t o r\overrightarrow{e}$ .
使用累积迹(accumulating traces)并限制最大值为1,训练期间迹会通过 $\overrightarrow{f_{n}}$ 更新:$e_{i};:=;m i n(1,e_{i},+,(f_{n,i},,a))$ ,其中 $e_{i}$ 和 $f_{n,i}$ 分别是 $\vec{e}$ 和 $\overrightarrow{f_{n}}$ 的第 $i^{t h}$ 个元素,$a$ 是一个调节累积速度的常数因子。在没有训练的步骤中,$\overrightarrow{e}:=d e c a y F a c t o r\overrightarrow{e}$ 。
4.2. Simultaneous learning experiments
4.2. 同步学习实验
Our experiments test the effectiveness of simultaneous TAMER $+\mathrm{RL}$ when training starts either at the beginning of learning or after some learning has occurred. We again use Mountain Car and Cart Pole, and we focus on the two best-performing combination techniques, action biasing and control sharing. For the eligibility module, the scaling parameter $c_{s}$ for Mountain Car and Cart Pole is respectively 100 and 200 for action biasing and 2 and 1 for control sharing. These values were chosen to be on the upper end of each method’s effective $\beta$ values in Figure 3. The accumulation factor $a$ for eligibility is 0.2. Training in Mountain Car occurs either for 16 episodes, starting at episode 1, or for 12 episodes after 20 episodes of SARSA $(\lambda)$ -only learning. In Cart Pole, training at start occurs for 12 episodes, and training after 25 episodes of $\mathrm{SARSA}(\lambda)$ -only learning lasts 8 episodes. The start times are chosen to represent the beginning of learning and also a point at which the SARSA $(\lambda)$ agent has learned a policy that is much improved but still quite flawed.8 The number of episodes corresponds to an informal assessment of how many episodes are needed to satisfactorily train the agent; training at later start times progresses more quickly. The trainer has a button that starts and stops training during the designated training episodes, letting the human observe without the agent updating Hˆ or the eligibility module.
我们的实验测试了同时使用 TAMER $+\mathrm{RL}$ 在训练开始时或学习已经进行一段时间后的有效性。我们再次使用 Mountain Car 和 Cart Pole,并重点关注两种表现最佳的组合技术:动作偏向和控制共享。对于资格模块,Mountain Car 和 Cart Pole 的缩放参数 $c_{s}$ 分别为 100 和 200(动作偏向)以及 2 和 1(控制共享)。这些值选择为每种方法在图 3 中有效 $\beta$ 值的上限。资格累积因子 $a$ 为 0.2。Mountain Car 的训练要么从第 1 集开始进行 16 集,要么在仅使用 SARSA $(\lambda)$ 学习 20 集后进行 12 集。在 Cart Pole 中,训练从开始进行 12 集,而在仅使用 $\mathrm{SARSA}(\lambda)$ 学习 25 集后进行 8 集。选择这些开始时间是为了代表学习的开始,以及 SARSA $(\lambda)$ 智能体已经学习到一个显著改进但仍然存在缺陷的策略的时刻。8 集数对应于对智能体进行满意训练所需集数的非正式评估;在较晚的开始时间进行训练时,进展更快。训练者有一个按钮,可以在指定的训练集中启动和停止训练,让人类观察而不更新 Hˆ 或资格模块。

Figure 4. Simultaneous TAMER $.+\mathrm{RL}$ results. Mean reward is calculated over runs of 500 episodes in Mountain Car and 150 episodes in Cart Pole. Standard error is shown.
图 4: 同步 TAMER $.+\mathrm{RL}$ 结果。平均奖励是在 Mountain Car 的 500 次运行和 Cart Pole 的 150 次运行中计算的。标准误差如图所示。
An added experimental challenge is that the training is inextricably bound to one specific run, whereas sequential experiments can reuse the same training session for any number of parameters and combination techniques, limiting the depth of analysis that can be done for a set number of trainer-hours. Mountain Car and Cart Pole training sessions typically took around 8 minutes and 15 minutes each, respectively. Consequently, each experimental condition was limited to 3 runs of training for a total of 12 runs on each task.
一个额外的实验挑战是,训练与一次特定的运行密不可分,而顺序实验可以重复使用相同的训练会话进行任意数量的参数和组合技术,这限制了在固定训练时间内可以进行的分析深度。Mountain Car 和 Cart Pole 的训练会话通常分别需要大约 8 分钟和 15 分钟。因此,每个实验条件被限制为 3 次训练运行,每个任务总共进行 12 次运行。
4.3. Simultaneous learning results and discussion
4.3. 同步学习结果与讨论
The results of our simultaneous TAMER $\mathbf{+RL}$ experiments are shown in Figure 4. Though the sample size is too small to show statistical significance, there is a clear pattern of both action biasing and control sharing outperforming $\mathrm{SARSA}(\lambda)$ . The condition that is closest to SARSA $(\lambda)$ in terms of standard error, control sharing on Cart Pole where training begins after 25 episodes, still receives almost twice the reward of $\mathrm{SARSA}(\lambda)$ . We also observe that training at the beginning of learning is more effective than training after some autonomous learning, as we expected. Seeing this, one might ask whether the $n$ episodes of RL-only learning before training is helping or whether the prior learning should be abandoned to start from scratch. We can test this. Starting from scratch after n episodes is the same as simply training from the start and stopping n episodes early. So if we ignore the first n episodes of the later-training group and the last n episodes of the training-at-start group, the comparison of the groups’ mean reward addresses this question. Of four such comparisons (2 techniques $\mathbf{x},2$ tasks), the later-training group outperforms three times and is roughly equal once, suggesting that the prior learning does indeed help. These results serve as proof of concept for the effec ti ve ness of simultaneous TAMER $.+\mathrm{RL}$ with our eligibility module.
我们同时进行的 TAMER $\mathbf{+RL}$ 实验结果如图 4 所示。尽管样本量太小,无法显示统计显著性,但明显可以看出,动作偏向和控制共享的表现都优于 $\mathrm{SARSA}(\lambda)$。在标准误差方面最接近 SARSA $(\lambda)$ 的条件是 Cart Pole 任务中在 25 个回合后开始训练的控制共享,其获得的奖励几乎是 $\mathrm{SARSA}(\lambda)$ 的两倍。我们还观察到,正如我们所预期的那样,在学习开始时进行训练比在自主学习一段时间后进行训练更有效。看到这一点,人们可能会问,在训练前进行 $n$ 个回合的仅 RL 学习是否有帮助,或者是否应该放弃之前的学习,从头开始。我们可以对此进行测试。在 $n$ 个回合后从头开始,相当于从一开始就进行训练并提前 $n$ 个回合停止。因此,如果我们忽略后期训练组的前 $n$ 个回合和开始训练组的最后 $n$ 个回合,比较两组的平均奖励就能回答这个问题。在四次这样的比较中(2 种技术 $\mathbf{x},2$ 个任务),后期训练组在三次中表现更好,一次大致相当,这表明之前的学习确实有帮助。这些结果证明了同时使用 TAMER $.+\mathrm{RL}$ 和我们的资格模块的有效性。
5. Related Work
5. 相关工作
In this section, we situate our work within prior research on naturally transferring knowledge to a reinforcement learning agent. We focus on work not already mentioned in Section 3.1, or in the previous papers on TAMER (Knox and Stone, 2009; 2010).
在本节中,我们将我们的工作置于先前关于自然地将知识迁移到强化学习智能体的研究中。我们重点关注未在第 3.1 节或之前关于 TAMER 的论文 (Knox and Stone, 2009; 2010) 中提及的工作。
In the only other example of an agent learning simultaneously from human reinforcement and MDP reward, Thomaz and Breazeal (2006) interfaced a human trainer with a table-based Q-learning agent in a virtual kitchen environment. Their agent seeks to maximize its discounted total reward, which for any time step is the sum of human reinforcement and environmental reward. Their approach is a form of reward shaping, differing in that Thomaz and Breazeal directly apply the human reinforcement value to the current reward (instead of modeling reinforcement and using the output of the model as supplemental reward).
在另一个智能体同时从人类强化和马尔可夫决策过程 (MDP) 奖励中学习的例子中,Thomaz 和 Breazeal (2006) 在一个虚拟厨房环境中将人类训练师与基于表格的 Q-learning 智能体进行了对接。他们的智能体试图最大化其折扣总奖励,该奖励在任何时间步都是人类强化和环境奖励的总和。他们的方法是一种奖励塑形 (reward shaping) 的形式,不同之处在于 Thomaz 和 Breazeal 直接将人类强化值应用于当前奖励(而不是建模强化并使用模型的输出作为补充奖励)。
Judah et al. consider a learning scenario that alternates between “practice”, where actual world experience is gath- ered, and an offline labeling of actions as good or bad by a human critic (2010). Using an elegant probabilistic technique with a few assumptions, the human criticism is input to a loss function that lessens the expected value of candidate policies while also automatically determining the level of influence given to the criticism. From some mixed results and comments from frustrated subjects, they predicted that redesigning their system to be more interactive and to let the human train periodically — characteristics of simultaneous TAMER $.+\mathrm{RL}$ — would improve performance.
Judah 等人考虑了一种学习场景,该场景在“实践”和离线标记之间交替进行,其中“实践”是收集实际世界经验的过程,而离线标记则由人类评论者对行为的好坏进行标注 (2010)。通过一种优雅的概率技术并结合一些假设,人类评论被输入到一个损失函数中,该函数减少了候选策略的期望值,同时自动确定评论的影响程度。根据一些混合结果和受挫受试者的反馈,他们预测,重新设计系统以使其更具交互性,并让人类定期进行训练——这是同时进行的 TAMER $.+\mathrm{RL}$ 的特性——将提高性能。
Imitation learning, or programming by demonstration, has also been used to improve reinforcement learning, using pre programmed policies (Price and Boutilier, 2003) or humans (Taylor et al., 2011; Smart and Kaelbling, 2000) to provide demonstrations for an agent that observes and learns. These methods are similar to control sharing. An advantage, though, of reinforcement over demonstration is that reinforcement permits learning the relative values of actions, allowing techniques like action biasing to gently push the behavior of the RL agent towards the policy endorsed by $\hat{H}$ , whereas pure demonstration is all or nothing — either the demonstrator or the learning agent chooses the action. Additionally, trainers can give reinforce state-action pairs visited by the agent’s policy, whereas demonstrations might not ever visit areas of the state space that the imitation learning algorithm visits.
模仿学习(Imitation Learning),或称为通过演示编程,也被用于改进强化学习,通过使用预编程策略(Price 和 Boutilier,2003)或人类(Taylor 等人,2011;Smart 和 Kaelbling,2000)为观察和学习的智能体提供演示。这些方法与控制共享类似。然而,强化学习相对于演示的优势在于,强化学习允许学习动作的相对价值,使得像动作偏置这样的技术能够温和地将强化学习智能体的行为推向 $\hat{H}$ 所支持的策略,而纯粹的演示则是全有或全无的——要么是演示者选择动作,要么是学习智能体选择动作。此外,训练者可以强化智能体策略访问的状态-动作对,而演示可能永远不会访问模仿学习算法访问的状态空间区域。
6. Conclusion
6. 结论
Prior work on $\mathrm{TAMER+RL}$ is limited by having only tested on a single domain and by simply taking the best $\beta$ combination parameter from testing. Further, past $\mathrm{TAMER+RL}$ algorithms were designed for sequential learning and were unsuitable for simultaneously learning from the trainer and the MDP reward signal. This paper addresses these limitations, giving a clear endorsement of using $\hat{H}$ to affect action selection and, for the first time, enabling a human trainer to interactively provide feedback at any time during the learning process, a critical improvement towards the practicality and widespread applicability of the TAMER framework.
先前关于 $\mathrm{TAMER+RL}$ 的研究存在局限性,仅在一个领域进行了测试,并且仅从测试中选择了最佳的 $\beta$ 组合参数。此外,过去的 $\mathrm{TAMER+RL}$ 算法是为顺序学习设计的,不适合同时从训练者和 MDP 奖励信号中学习。本文解决了这些局限性,明确支持使用 $\hat{H}$ 来影响动作选择,并首次使人类训练者能够在学习过程中的任何时间交互式地提供反馈,这是 TAMER 框架实用性和广泛适用性的关键改进。
Acknowledgments
致谢
This work has taken place in the Learning Agents Research Group (LARG) at the Artificial Intelligence Laboratory, The University of Texas at Austin. LARG research is supported in part by grants from the NSF (IIS-0917122), ONR (N00014-09-1-0658), and the Federal Highway Administration (DTFH61-07-H-00030). An NSF Graduate Research Fellowship supports the first author.
本工作由德克萨斯大学奥斯汀分校人工智能实验室的学习智能体研究小组 (LARG) 完成。LARG 的研究部分得到了美国国家科学基金会 (NSF) (IIS-0917122)、海军研究办公室 (ONR) (N00014-09-1-0658) 和联邦公路管理局 (DTFH61-07-H-00030) 的资助。第一作者获得了美国国家科学基金会研究生研究奖学金的支持。
References
参考文献
R.A.C. Bianchi, C.H.C. Ribeiro, and A.H.R. Costa. Heuristic ally Accelerated Q–Learning: a new approach to speed up Reinforcement Learning. Advances in AI – SBIA, 2004. M. Dorigo and M. Colombetti. Robot shaping: Developing situated agents through learning. Artificial Intelligence, 1994. F. Ferna´ndez and M. Veloso. Probabilistic policy reuse in a reinforcement learning agent. AAMAS, 2006. C.L. Isbell, M. Kearns, S. Singh, C.R. Shelton, P. Stone, and D. Kormann. Cobot in LambdaMOO: An Adaptive Social Statistics Agent. AAMAS, 2006. K. Judah, S. Roy, A. Fern, and T.G. Dietterich. Reinforcement Learning Via Practice and Critique Advice. AAAI, 2010. W.B. Knox and P. Stone. Interactively shaping agents via human reinforcement: The TAMER framework. K-CAP, 2009.
R.A.C. Bianchi, C.H.C. Ribeiro, 和 A.H.R. Costa. 启发式加速 Q-学习:一种加速强化学习的新方法。AI 进展 – SBIA, 2004.
M. Dorigo 和 M. Colombetti. 机器人塑造:通过学习开发情境智能体。人工智能, 1994.
F. Fernández 和 M. Veloso. 强化学习智能体中的概率策略重用。AAMAS, 2006.
C.L. Isbell, M. Kearns, S. Singh, C.R. Shelton, P. Stone, 和 D. Kormann. LambdaMOO 中的 Cobot:一个自适应社交统计智能体。AAMAS, 2006.
K. Judah, S. Roy, A. Fern, 和 T.G. Dietterich. 通过实践和批评建议进行强化学习。AAAI, 2010.
W.B. Knox 和 P. Stone. 通过人类强化交互式塑造智能体:TAMER 框架。K-CAP, 2009.
