[论文翻译]基于人类偏好的深度强化学习


原文地址:https://github.com/dalinvip/Awesome-ChatGPT/blob/main/PDF/RLHF%E8%AE%BA%E6%96%87%E9%9B%86/NIPS-2017-deep-reinforcement-learning-from-human-preferences-Paper.pdf


Deep Reinforcement Learning from Human Preferences

基于人类偏好的深度强化学习

Abstract

摘要

For sophisticated reinforcement learning (RL) systems to interact usefully with real-world environments, we need to communicate complex goals to these systems. In this work, we explore goals defined in terms of (non-expert) human preferences between pairs of trajectory segments. We show that this approach can effectively solve complex RL tasks without access to the reward function, including Atari games and simulated robot locomotion, while providing feedback on less than $1%$ of our agent’s interactions with the environment. This reduces the cost of human oversight far enough that it can be practically applied to state-of-the-art RL systems. To demonstrate the flexibility of our approach, we show that we can successfully train complex novel behaviors with about an hour of human time. These behaviors and environments are considerably more complex than any which have been previously learned from human feedback.

为了让复杂的强化学习 (Reinforcement Learning, RL) 系统能够有效地与现实世界环境交互,我们需要向这些系统传达复杂的目标。在这项工作中,我们探索了基于(非专家)人类对轨迹片段对的偏好来定义目标的方法。我们展示了这种方法可以有效地解决复杂的 RL 任务,而无需访问奖励函数,包括 Atari 游戏和模拟机器人运动,同时仅需对智能体与环境的交互提供不到 $1%$ 的反馈。这大大降低了人类监督的成本,使其能够实际应用于最先进的 RL 系统。为了展示我们方法的灵活性,我们展示了可以在大约一小时内成功训练出复杂的新行为。这些行为和环境比之前从人类反馈中学习的任何内容都要复杂得多。

1 Introduction

1 引言

Recent success in scaling reinforcement learning (RL) to large problems has been driven in domains that have a well-specified reward function (Mnih et al., 2015, 2016; Silver et al., 2016). Unfortunately, many tasks involve goals that are complex, poorly-defined, or hard to specify. Overcoming this limitation would greatly expand the possible impact of deep RL and could increase the reach of machine learning more broadly.

最近在将强化学习 (Reinforcement Learning, RL) 扩展到大规模问题上取得的成功,主要得益于那些具有明确奖励函数的领域 (Mnih et al., 2015, 2016; Silver et al., 2016)。然而,许多任务涉及的目标复杂、定义不清或难以明确。克服这一限制将极大地扩展深度强化学习的潜在影响,并可能更广泛地提升机器学习的应用范围。

For example, suppose that we wanted to use reinforcement learning to train a robot to clean a table or scramble an egg. It’s not clear how to construct a suitable reward function, which will need to be a function of the robot’s sensors. We could try to design a simple reward function that approximately captures the intended behavior, but this will often result in behavior that optimizes our reward function without actually satisfying our preferences. This difficulty underlies recent concerns about misalignment between our values and the objectives of our RL systems (Bostrom, 2014; Russell, 2016; Amodei et al., 2016). If we could successfully communicate our actual objectives to our agents, it would be a significant step towards addressing these concerns.

例如,假设我们想使用强化学习来训练一个机器人来清洁桌子或炒鸡蛋。目前尚不清楚如何构建一个合适的奖励函数,该函数需要基于机器人的传感器。我们可以尝试设计一个简单的奖励函数,近似捕捉预期的行为,但这通常会导致行为优化了我们的奖励函数,而实际上并未满足我们的偏好。这一困难正是最近关于我们的价值观与强化学习系统目标之间不一致的担忧的基础 (Bostrom, 2014; Russell, 2016; Amodei et al., 2016)。如果我们能够成功地将我们的实际目标传达给我们的AI智能体,这将是解决这些担忧的重要一步。

If we have demonstrations of the desired task, we can use inverse reinforcement learning $\mathrm{Ng}$ and Russell, 2000) or imitation learning to copy the demonstrated behavior. But these approaches are not directly applicable to behaviors that are difficult for humans to demonstrate (such as controlling a robot with many degrees of freedom but non-human morphology).

如果我们有期望任务的演示,可以使用逆强化学习 [Ng 和 Russell, 2000] 或模仿学习来复制演示的行为。但这些方法并不直接适用于人类难以演示的行为(例如控制具有许多自由度但非人类形态的机器人)。

An alternative approach is to allow a human to provide feedback on our system’s current behavior and to use this feedback to define the task. In principle this fits within the paradigm of reinforcement learning, but using human feedback directly as a reward function is prohibitively expensive for RL systems that require hundreds or thousands of hours of experience. In order to practically train deep RL systems with human feedback, we need to decrease the amount of feedback required by several orders of magnitude.

另一种方法是允许人类对我们系统的当前行为提供反馈,并利用这些反馈来定义任务。原则上,这符合强化学习的范式,但对于需要数百或数千小时经验的强化学习系统来说,直接使用人类反馈作为奖励函数的成本过高。为了实际利用人类反馈训练深度强化学习系统,我们需要将所需的反馈量减少几个数量级。

We overcome this difficulty by asking humans to compare possible trajectories of the agent, using that data to learn a reward function, and optimizing the learned reward function with RL.

我们通过让人类比较智能体可能的轨迹,利用这些数据学习奖励函数,并通过强化学习优化学习到的奖励函数来克服这一困难。

This basic approach has been explored in the past, but we confront the challenges involved in scaling it up to modern deep RL and demonstrate by far the most complex behaviors yet learned from human feedback.

过去已经探索过这种基本方法,但我们面临将其扩展到现代深度强化学习(RL)中的挑战,并展示了迄今为止从人类反馈中学到的最复杂行为。

Our experiments take place in two domains: Atari games in the Arcade Learning Environment (Bellemare et al., 2013), and robotics tasks in the physics simulator MuJoCo (Todorov et al., 2012). We show that a small amount of feedback from a non-expert human, ranging from fifteen minutes to five hours, suffice to learn both standard RL tasks and novel hard-to-specify behaviors such as performing a backflip or driving with the flow of traffic.

我们的实验在两个领域进行:Arcade Learning Environment (Bellemare et al., 2013) 中的 Atari 游戏,以及物理模拟器 MuJoCo (Todorov et al., 2012) 中的机器人任务。我们展示了来自非专家人类的少量反馈(从十五分钟到五小时不等)足以学习标准的强化学习任务以及难以指定的新行为,例如后空翻或随车流驾驶。

1.1 Related Work

1.1 相关工作

A long line of work studies reinforcement learning from human ratings or rankings, including Akrour et al. (2011), Pilarski et al. (2011), Akrour et al. (2012), Wilson et al. (2012), Sugiyama et al. (2012), Wirth and Fürnkranz (2013), Daniel et al. (2015), El Asri et al. (2016), Wang et al. (2016), and Wirth et al. (2016). Other lines of research consider the general problem of reinforcement learning from preferences rather than absolute reward values (Fürnkranz et al., 2012; Akrour et al., 2014; Wirth et al., 2016), and optimizing using human preferences in settings other than reinforcement learning (Machwe and Parmee, 2006; Secretan et al., 2008; Brochu et al., 2010; Sørensen et al., 2016).

一系列研究工作探讨了基于人类评分或排序的强化学习,包括 Akrour 等人 (2011)、Pilarski 等人 (2011)、Akrour 等人 (2012)、Wilson 等人 (2012)、Sugiyama 等人 (2012)、Wirth 和 Fürnkranz (2013)、Daniel 等人 (2015)、El Asri 等人 (2016)、Wang 等人 (2016) 以及 Wirth 等人 (2016)。其他研究方向则考虑了基于偏好而非绝对奖励值的强化学习一般问题 (Fürnkranz 等人, 2012; Akrour 等人, 2014; Wirth 等人, 2016),以及在非强化学习环境中使用人类偏好进行优化的研究 (Machwe 和 Parmee, 2006; Secretan 等人, 2008; Brochu 等人, 2010; Sørensen 等人, 2016)。

Our algorithm follows the same basic approach as Akrour et al. (2012) and Akrour et al. (2014), but considers much more complex domains and behaviors. The complexity of our environments force us to use different RL algorithms, reward models, and training strategies. One notable difference is that Akrour et al. (2012) and Akrour et al. (2014) elicit preferences over whole trajectories rather than short clips, and so would require about an order of magnitude more human time per data point. Our approach to feedback eli citation closely follows Wilson et al. (2012). However, Wilson et al. (2012) assumes that the reward function is the distance to some unknown (linear) “target” policy, and is never tested with real human feedback.

我们的算法遵循与 Akrour 等人 (2012) 和 Akrour 等人 (2014) 相同的基本方法,但考虑了更复杂的领域和行为。环境的复杂性迫使我们使用不同的强化学习 (RL) 算法、奖励模型和训练策略。一个显著的差异是,Akrour 等人 (2012) 和 Akrour 等人 (2014) 获取的是对整个轨迹而非短片段的偏好,因此每个数据点需要大约多一个数量级的人类时间。我们的反馈获取方法紧密遵循 Wilson 等人 (2012)。然而,Wilson 等人 (2012) 假设奖励函数是到某个未知(线性)“目标”策略的距离,并且从未使用真实的人类反馈进行测试。

TAMER (Knox, 2012; Knox and Stone, 2013) also learns a reward function from human feedback, but learns from ratings rather than comparisons, has the human observe the agent as it behaves, and has been applied to settings where the desired policy can be learned orders of magnitude more quickly.

TAMER (Knox, 2012; Knox and Stone, 2013) 也从人类反馈中学习奖励函数,但它从评分而非比较中学习,让人类观察智能体的行为,并且已经应用于可以更快学习期望策略的场景。

Compared to all prior work, our key contribution is to scale human feedback up to deep reinforcement learning and to learn much more complex behaviors. This fits into a recent trend of scaling reward learning methods to large deep learning systems, for example inverse RL (Finn et al., 2016), imitation learning (Ho and Ermon, 2016; Stadie et al., 2017), semi-supervised skill generalization (Finn et al., 2017), and boots trapping RL from demonstrations (Silver et al., 2016; Hester et al., 2017).

与之前的所有工作相比,我们的关键贡献是将人类反馈扩展到深度强化学习中,并学习更复杂的行为。这符合最近将奖励学习方法扩展到大型深度学习系统的趋势,例如逆向强化学习 (Finn et al., 2016)、模仿学习 (Ho and Ermon, 2016; Stadie et al., 2017)、半监督技能泛化 (Finn et al., 2017) 以及从演示中进行引导强化学习 (Silver et al., 2016; Hester et al., 2017)。

2 Preliminaries and Method

2 预备知识与方法

2.1 Setting and Goal

2.1 设置与目标

We consider an agent interacting with an environment over a sequence of steps; at each time $t$ the agent receives an observation $o_{t}\in\mathcal{O}$ from the environment and then sends an action $a_{t}\in\mathcal A$ to the environment.

我们考虑一个智能体在一系列步骤中与环境进行交互;在每个时间 $t$,智能体从环境中接收到一个观测值 $o_{t}\in\mathcal{O}$,然后向环境发送一个动作 $a_{t}\in\mathcal A$。

In traditional reinforcement learning, the environment would also supply a reward $r_{t}\in\mathbb{R}$ and the agent’s goal would be to maximize the discounted sum of rewards. Instead of assuming that the environment produces a reward signal, we assume that there is a human overseer who can express preferences between trajectory segments. A trajectory segment is a sequence of observations and actions, $\sigma=((o_{0},a_{0}),(o_{1},a_{1}),\ldots,(o_{k-1},a_{k-1}))\in(\mathcal{O}\times\mathcal{A})_{*}^{k}$ . Write $\sigma^{1}\succ\sigma^{2}$ to indicate that the human preferred trajectory segment $\sigma^{1}$ to trajectory segment $\sigma^{2}$ . Informally, the goal of the agent is to produce trajectories which are preferred by the human, while making as few queries as possible to the human.

在传统的强化学习中,环境还会提供一个奖励 $r_{t}\in\mathbb{R}$,而智能体的目标是最大化奖励的折现总和。我们不再假设环境会产生奖励信号,而是假设有一个人类监督者可以在轨迹片段之间表达偏好。轨迹片段是一系列观察和动作的序列,$\sigma=((o_{0},a_{0}),(o_{1},a_{1}),\ldots,(o_{k-1},a_{k-1}))\in(\mathcal{O}\times\mathcal{A})_{*}^{k}$。用 $\sigma^{1}\succ\sigma^{2}$ 表示人类更偏好轨迹片段 $\sigma^{1}$ 而不是 $\sigma^{2}$。非正式地说,智能体的目标是生成人类偏好的轨迹,同时尽可能少地向人类进行查询。

More precisely, we will evaluate our algorithms’ behavior in two ways:

更准确地说,我们将通过两种方式评估算法的行为:

Quantitative: We say that preferences $\succ$ are generated by a reward function2 $r:{\mathcal{O}}\times{\mathcal{A}}\to\mathbb{R}$ if

定量分析:我们说偏好 $\succ$ 是由奖励函数 $r:{\mathcal{O}}\times{\mathcal{A}}\to\mathbb{R}$ 生成的,如果

$$
\left(\left(o_{0}^{1},a_{0}^{1}\right),\ldots,\left(o_{k-1}^{1},a_{k-1}^{1}\right)\right)\succ\left(\left(o_{0}^{2},a_{0}^{2}\right),\ldots,\left(o_{k-1}^{2},a_{k-1}^{2}\right)\right)
$$

$$
\left(\left(o_{0}^{1},a_{0}^{1}\right),\ldots,\left(o_{k-1}^{1},a_{k-1}^{1}\right)\right)\succ\left(\left(o_{0}^{2},a_{0}^{2}\right),\ldots,\left(o_{k-1}^{2},a_{k-1}^{2}\right)\right)
$$

whenever

每当

$$
r\bigl(o_{0}^{1},a_{0}^{1}\bigr)+\cdot\cdot\cdot+r\bigl(o_{k-1}^{1},a_{k-1}^{1}\bigr)>r\bigl(o_{0}^{2},a_{0}^{2}\bigr)+\cdot\cdot\cdot+r\bigl(o_{k-1}^{2},a_{k-1}^{2}\bigr).
$$

$$
r\bigl(o_{0}^{1},a_{0}^{1}\bigr)+\cdot\cdot\cdot+r\bigl(o_{k-1}^{1},a_{k-1}^{1}\bigr)>r\bigl(o_{0}^{2},a_{0}^{2}\bigr)+\cdot\cdot\cdot+r\bigl(o_{k-1}^{2},a_{k-1}^{2}\bigr).
$$

If the human’s preferences are generated by a reward function $r$ , then our agent ought to receive a high total reward according to $r$ . So if we know the reward function $r$ , we can evaluate the agent quantitatively. Ideally the agent will achieve reward nearly as high as if it had been using RL to optimize $r$ .

如果人类的偏好是由奖励函数 $r$ 生成的,那么我们的智能体应该根据 $r$ 获得较高的总奖励。因此,如果我们知道奖励函数 $r$,我们就可以定量地评估智能体。理想情况下,智能体将获得几乎与使用强化学习 (RL) 优化 $r$ 时一样高的奖励。

Qualitative: Sometimes we have no reward function by which we can quantitatively evaluate behavior (this is the situation where our approach would be practically useful). In these cases, all we can do is qualitatively evaluate how well the agent satisfies the human’s preferences. In this paper, we will start from a goal expressed in natural language, ask a human to evaluate the agent’s behavior based on how well it fulfills that goal, and then present videos of agents attempting to fulfill that goal.

定性评估:有时我们没有奖励函数来定量评估行为(这正是我们的方法在实际中会有用的场景)。在这些情况下,我们只能定性评估智能体满足人类偏好的程度。在本文中,我们将从自然语言表达的目标出发,要求人类根据智能体实现该目标的程度来评估其行为,然后展示智能体尝试实现该目标的视频。

Our model based on trajectory segment comparisons is very similar to the trajectory preference queries used in Wilson et al. (2012), except that we don’t assume that we can reset the system to an arbitrary state3 and so our segments generally begin from different states. This complicates the interpretation of human comparisons, but we show that our algorithm overcomes this difficulty even when the human raters have no understanding of our algorithm.

我们基于轨迹段比较的模型与 Wilson 等人 (2012) 中使用的轨迹偏好查询非常相似,不同之处在于我们不假设可以将系统重置到任意状态,因此我们的轨迹段通常从不同的状态开始。这使得人类比较的解释变得复杂,但我们展示了即使人类评分者不理解我们的算法,我们的算法也能克服这一困难。

2.2 Our Method

2.2 我们的方法

At each point in time our method maintains a policy $\pi:{\mathcal{O}}\rightarrow A$ and a reward function estimate $\hat{r}:\mathcal{O}\times\mathcal{A}\rightarrow\mathbb{R}$ , each para met rize d by deep neural networks.

在每个时间点,我们的方法维护一个策略 $\pi:{\mathcal{O}}\rightarrow A$ 和一个奖励函数估计 $\hat{r}:\mathcal{O}\times\mathcal{A}\rightarrow\mathbb{R}$,每个都由深度神经网络参数化。

These networks are updated by three processes:

这些网络通过三个过程进行更新:

These processes run asynchronously, with trajectories flowing from process (1) to process (2), human comparisons flowing from process (2) to process (3), and parameters for $\hat{r}$ flowing from process (3) to process (1). The following subsections provide details on each of these processes.

这些进程异步运行,轨迹从进程 (1) 流向进程 (2),人类比较从进程 (2) 流向进程 (3),而 $\hat{r}$ 的参数从进程 (3) 流向进程 (1)。以下小节将详细介绍这些进程中的每一个。

2.2.1 Optimizing the Policy

2.2.1 策略优化

After using $\hat{r}$ to compute rewards, we are left with a traditional reinforcement learning problem. We can solve this problem using any RL algorithm that is appropriate for the domain. One subtlety is that the reward function $\hat{r}$ may be non-stationary, which leads us to prefer methods which are robust to changes in the reward function. This led us to focus on policy gradient methods, which have been applied successfully for such problems (Ho and Ermon, 2016).

在使用 $\hat{r}$ 计算奖励后,我们面临一个传统的强化学习问题。我们可以使用适用于该领域的任何强化学习算法来解决这个问题。一个微妙之处在于,奖励函数 $\hat{r}$ 可能是非平稳的,这促使我们倾向于使用对奖励函数变化具有鲁棒性的方法。这使我们专注于策略梯度方法,这些方法已成功应用于此类问题 (Ho and Ermon, 2016)。

In this paper, we use advantage actor-critic (A2C; Mnih et al., 2016) to play Atari games, and trust region policy optimization (TRPO; Schulman et al., 2015) to perform simulated robotics tasks. In each case, we used parameter settings which have been found to work well for traditional RL tasks. The only hyper parameter which we adjusted was the entropy bonus for TRPO. This is because TRPO relies on the trust region to ensure adequate exploration, which can lead to inadequate exploration if the reward function is changing.

在本文中,我们使用优势演员-评论家 (A2C; Mnih et al., 2016) 来玩 Atari 游戏,并使用信任区域策略优化 (TRPO; Schulman et al., 2015) 来执行模拟机器人任务。在每种情况下,我们都使用了已被证明在传统强化学习任务中表现良好的参数设置。我们唯一调整的超参数是 TRPO 的熵奖励。这是因为 TRPO 依赖信任区域来确保充分的探索,如果奖励函数发生变化,可能会导致探索不足。

We normalized the rewards produced by $\hat{r}$ to have zero mean and constant standard deviation. This is a typical preprocessing step which is particularly appropriate here since the position of the rewards is under determined by our learning problem.

我们将 $\hat{r}$ 生成的奖励归一化,使其均值为零且标准差恒定。这是一个典型的预处理步骤,在这里特别适用,因为奖励的位置在我们的学习问题中是不确定的。

2.2.2 Preference Eli citation

2.2.2 偏好引出

The human overseer is given a visualization of two trajectory segments, in the form of short movie clips. In all of our experiments, these clips are between 1 and 2 seconds long.

人类监督者会以短视频片段的形式看到两个轨迹段的可视化。在我们所有的实验中,这些片段的长度在1到2秒之间。

The human then indicates which segment they prefer, that the two segments are equally good, or that they are unable to compare the two segments.

然后,人类指示他们更喜欢哪个片段,两个片段同样好,或者他们无法比较这两个片段。

The human judgments are recorded in a database $\mathcal{D}$ of triples $(\sigma^{1},\sigma^{2},\mu)$ , where $\sigma^{1}$ and $\sigma^{2}$ are the two segments and $\mu$ is a distribution over ${1,2}$ indicating which segment the user preferred. If the human selects one segment as preferable, then $\mu$ puts all of its mass on that choice. If the human marks the segments as equally preferable, then $\mu$ is uniform. Finally, if the human marks the segments as incomparable, then the comparison is not included in the database.

人类判断记录在一个数据库 $\mathcal{D}$ 中,其中包含三元组 $(\sigma^{1},\sigma^{2},\mu)$,其中 $\sigma^{1}$ 和 $\sigma^{2}$ 是两个片段,$\mu$ 是 ${1,2}$ 上的一个分布,表示用户偏好的片段。如果人类选择其中一个片段作为偏好,则 $\mu$ 将所有质量放在该选择上。如果人类将片段标记为同等偏好,则 $\mu$ 是均匀的。最后,如果人类将片段标记为不可比较,则该比较不包含在数据库中。

2.2.3 Fitting the Reward Function

2.2.3 拟合奖励函数

We can interpret a reward function estimate $\hat{r}$ as a preference-predictor if we view $\hat{r}$ as a latent factor explaining the human’s judgments and assume that the human’s probability of preferring a segment $\sigma^{i}$ depends exponentially on the value of the latent reward summed over the length of the clip:4

我们可以将奖励函数估计 $\hat{r}$ 解释为偏好预测器,如果我们将其视为解释人类判断的潜在因素,并假设人类偏好片段 $\sigma^{i}$ 的概率取决于片段长度上潜在奖励值的指数和:4

$$
\hat{P}\big[\sigma^{1}\succ\sigma^{2}\big]=\frac{\exp\sum\hat{r}\big(o_{t}^{1},a_{t}^{1}\big)}{\exp\sum\hat{r}\big(o_{t}^{1},a_{t}^{1}\big)+\exp\sum\hat{r}\big(o_{t}^{2},a_{t}^{2}\big)}.
$$

$$
\hat{P}\big[\sigma^{1}\succ\sigma^{2}\big]=\frac{\exp\sum\hat{r}\big(o_{t}^{1},a_{t}^{1}\big)}{\exp\sum\hat{r}\big(o_{t}^{1},a_{t}^{1}\big)+\exp\sum\hat{r}\big(o_{t}^{2},a_{t}^{2}\big)}.
$$

We choose $\hat{r}$ to minimize the cross-entrop y loss between thes e predictions and the actual human labels:

我们选择 $\hat{r}$ 以最小化这些预测与实际人类标签之间的交叉熵损失:

$$
\mathrm{loss}(\hat{r})=-\sum_{(\sigma^{1},\sigma^{2},\mu)\in\mathcal{D}}\mu(1)\log\hat{P}\big[\sigma^{1}\succ\sigma^{2}\big]+\mu(2)\log\hat{P}\big[\sigma^{2}\succ\sigma^{1}\big].
$$

$$
\mathrm{loss}(\hat{r})=-\sum_{(\sigma^{1},\sigma^{2},\mu)\in\mathcal{D}}\mu(1)\log\hat{P}\big[\sigma^{1}\succ\sigma^{2}\big]+\mu(2)\log\hat{P}\big[\sigma^{2}\succ\sigma^{1}\big].
$$

This follows the Bradley-Terry model (Bradley and Terry, 1952) for estimating score functions from pairwise preferences, and is the specialization of the Luce-Shephard choice rule (Luce, 2005; Shepard, 1957) to preferences over trajectory segments.

这遵循了 Bradley-Terry 模型 (Bradley and Terry, 1952) 用于从成对偏好中估计评分函数,并且是 Luce-Shephard 选择规则 (Luce, 2005; Shepard, 1957) 在轨迹段偏好上的特化。

Our actual algorithm incorporates a number of modifications to this basic approach, which early experiments discovered to be helpful and which are analyzed in Section 3.3:

我们的实际算法对这一基本方法进行了多项修改,这些修改在早期实验中被发现是有帮助的,并在第 3.3 节中进行了分析:

• Rather than applying a softmax directly as described in Equation 1, we assume there is a $10%$ chance that the human responds uniformly at random. Conceptually this adjustment is needed because human raters have a constant probability of making an error, which doesn’t decay to 0 as the difference in reward difference becomes extreme.

• 我们假设有 $10%$ 的概率人类会随机均匀响应,而不是直接应用公式 1 中的 softmax。从概念上讲,这种调整是必要的,因为人类评分者存在一个恒定的错误概率,这种概率不会随着奖励差异的极端化而衰减到 0。

2.2.4 Selecting Queries

2.2.4 查询选择

We decide how to query preferences based on an approximation to the uncertainty in the reward function estimator, similar to Daniel et al. (2014): we sample a large number of pairs of trajectory segments of length $k$ from the latest agent-environment interactions, use each reward predictor in our ensemble to predict which segment will be preferred from each pair, and then select those trajectories for which the predictions have the highest variance across ensemble members5 This is a crude approximation and the ablation experiments in Section 3 show that in some tasks it actually impairs performance. Ideally, we would want to query based on the expected value of information of the query (Akrour et al., 2012; Krueger et al., 2016), but we leave it to future work to explore this direction further.

我们根据奖励函数估计器的不确定性近似值来决定如何查询偏好,类似于 Daniel 等人 (2014) 的方法:我们从最新的智能体-环境交互中采样大量长度为 $k$ 的轨迹片段对,使用集成中的每个奖励预测器来预测每对中哪个片段会被偏好,然后选择那些在集成成员中预测方差最大的轨迹。这是一个粗略的近似,第 3 节中的消融实验表明,在某些任务中,它实际上会损害性能。理想情况下,我们希望基于查询的预期信息价值进行查询 (Akrour 等人, 2012; Krueger 等人, 2016),但我们将其留给未来的工作来进一步探索这个方向。

3 Experimental Results

3 实验结果

We implemented our algorithm in TensorFlow (Abadi et al., 2016). We interface with MuJoCo (Todorov et al., 2012) and the Arcade Learning Environment (Bellemare et al., 2013) through the OpenAI Gym (Brockman et al., 2016).

我们在 TensorFlow (Abadi et al., 2016) 中实现了我们的算法。我们通过 OpenAI Gym (Brockman et al., 2016) 与 MuJoCo (Todorov et al., 2012) 和 Arcade Learning Environment (Bellemare et al., 2013) 进行交互。

3.1 Reinforcement Learning Tasks with Unobserved Rewards

3.1 未观测奖励的强化学习任务

In our first set of experiments, we attempt to solve a range of benchmark tasks for deep RL without observing the true reward. Instead, the agent learns about the goal of the task only by asking a human which of two trajectory segments is better. Our goal is to solve the task in a reasonable amount of time using as few queries as possible.

在我们的第一组实验中,我们尝试解决一系列深度强化学习(RL)的基准任务,而不观察真实的奖励。相反,AI智能体仅通过询问人类两个轨迹片段中哪个更好来了解任务的目标。我们的目标是在合理的时间内使用尽可能少的查询来解决任务。

In our experiments, feedback is provided by contractors who are given a 1-2 sentence description of each task before being asked to compare several hundred to several thousand pairs of trajectory segments for that task (see Appendix B for the exact instructions given to contractors). Each trajectory segment is between 1 and 2 seconds long. Contractors responded to the average query in 3-5 seconds, and so the experiments involving real human feedback required between 30 minutes and 5 hours of human time.

在我们的实验中,反馈由承包商提供,他们在被要求比较数百到数千对任务轨迹段之前,会获得每个任务的1-2句描述(参见附录B中提供给承包商的确切说明)。每个轨迹段的长度在1到2秒之间。承包商平均在3-5秒内响应每个查询,因此涉及真实人类反馈的实验需要30分钟到5小时的人类时间。

For comparison, we also run experiments using a synthetic oracle whose preferences are generated (in the sense of Section 2.1) by the real reward6. We also compare to the baseline of RL training using the real reward. Our aim here is not to outperform but rather to do nearly as well as RL without access to reward information and instead relying on much scarcer feedback. Nevertheless, note that feedback from real humans does have the potential to outperform RL (and as shown below it actually does so on some tasks), because the human feedback might provide a better-shaped reward.

为了进行比较,我们还使用了一个合成预言机进行实验,其偏好由真实奖励生成(如第2.1节所述)。我们还与使用真实奖励的强化学习(RL)训练基线进行了比较。我们的目标不是超越RL,而是在没有奖励信息的情况下,依靠更稀缺的反馈,做到与RL几乎一样好。然而,值得注意的是,来自真实人类的反馈确实有可能超越RL(如下所示,在某些任务中确实如此),因为人类反馈可能提供更好的奖励形状。

We describe the details of our experiments in Appendix A, including model architectures, modifications to the environment, and the RL algorithms used to optimize the policy.

我们在附录 A 中详细描述了实验的细节,包括模型架构、对环境的修改以及用于优化策略的强化学习 (RL) 算法。

3.1.1 Simulated Robotics

3.1.1 模拟机器人学

The first tasks we consider are eight simulated robotics tasks, implemented in MuJoCo (Todorov et al., 2012), and included in OpenAI Gym (Brockman et al., 2016). We made small modifications to these tasks in order to avoid encoding information about the task in the environment itself (the modifications are described in detail in Appendix A). The reward functions in these tasks are quadratic functions of distances, positions and velocities, and most are linear. We included a simple cartpole task (“pendulum”) for comparison, since this is representative of the complexity of tasks studied in prior work.

我们首先考虑的任务是八个模拟机器人任务,这些任务在 MuJoCo (Todorov et al., 2012) 中实现,并包含在 OpenAI Gym (Brockman et al., 2016) 中。我们对这些任务进行了小幅修改,以避免在环境本身中编码任务信息(修改的详细描述见附录 A)。这些任务中的奖励函数是距离、位置和速度的二次函数,大多数是线性的。我们包含了一个简单的倒立摆任务(“pendulum”)作为对比,因为这是之前工作中研究的任务复杂度的代表。


Figure 1: Results on MuJoCo simulated robotics as measured on the tasks’ true reward. We compare our method using real human feedback (purple), our method using synthetic feedback provided by an oracle (shades of blue), and reinforcement learning using the true reward function (orange). All curves are the average of 5 runs, except for the real human feedback, which is a single run, and each point is the average reward over five consecutive batches. For Reacher and Cheetah feedback was provided by an author due to time constraints. For all other tasks, feedback was provided by contractors unfamiliar with the environments and with our algorithm. The irregular progress on Hopper is due to one contractor deviating from the typical labeling schedule.

图 1: 在 MuJoCo 模拟机器人任务上以真实奖励衡量的结果。我们比较了使用真实人类反馈的方法(紫色)、使用由预言机提供的合成反馈的方法(蓝色阴影)以及使用真实奖励函数的强化学习方法(橙色)。所有曲线均为 5 次运行的平均值,除了真实人类反馈是单次运行,每个点是连续五个批次的平均奖励。对于 Reacher 和 Cheetah 任务,由于时间限制,反馈由作者提供。对于所有其他任务,反馈由不熟悉环境和算法的承包商提供。Hopper 任务上的不规则进展是由于一位承包商偏离了典型的标注计划。

Figure 1 shows the results of training our agent with 700 queries to a human rater, compared to learning from 350, 700, or 1400 synthetic queries, as well as to RL learning from the real reward. With 700 labels we are able to nearly match reinforcement learning on all of these tasks. Training with learned reward functions tends to be less stable and higher variance, while having a comparable mean performance.

图 1 展示了我们使用 700 条人工评分查询训练 AI 智能体的结果,并与从 350、700 或 1400 条合成查询中学习的结果,以及从真实奖励中进行强化学习 (RL) 的结果进行了比较。使用 700 条标签,我们能够在所有这些任务上几乎与强化学习相匹配。使用学习到的奖励函数进行训练往往稳定性较低且方差较大,但平均性能相当。

Surprisingly, by 1400 labels our algorithm performs slightly better than if it had simply been given the true reward, perhaps because the learned reward function is slightly better shaped—the reward learning procedure assigns positive rewards to all behaviors that are typically followed by high reward. The difference may also be due to subtle changes in the relative scale of rewards or our use of entropy regular iz ation.

令人惊讶的是,在使用了1400个标签后,我们的算法表现略优于直接使用真实奖励的情况,这可能是因为学习到的奖励函数形状略好——奖励学习过程为所有通常伴随高奖励的行为分配了正奖励。这种差异也可能源于奖励相对尺度的微妙变化或我们对熵正则化的使用。

Real human feedback is typically only slightly less effective than the synthetic feedback; depending on the task human feedback ranged from being half as efficient as ground truth feedback to being equally efficient. On the Ant task the human feedback significantly outperformed the synthetic feedback, apparently because we asked humans to prefer trajectories where the robot was “standing upright,” which proved to be useful reward shaping. (There was a similar bonus in the RL reward function to encourage the robot to remain upright, but the simple hand-crafted bonus was not as useful.)

真实的人类反馈通常只比合成反馈稍微低效一些;根据任务的不同,人类反馈的效率从仅为真实反馈的一半到与真实反馈同等效率不等。在 Ant 任务中,人类反馈显著优于合成反馈,显然是因为我们要求人类偏好机器人“直立站立”的轨迹,这被证明是有用的奖励塑造。(在 RL 奖励函数中也有类似的奖励,以鼓励机器人保持直立,但简单的手工奖励效果不佳。)

3.1.2 Atari

3.1.2 Atari

The second set of tasks we consider is a set of seven Atari games in the Arcade Learning Environment (Bellemare et al., 2013), the same games presented in Mnih et al., 2013.

我们考虑的第二组任务是 Arcade Learning Environment (Bellemare et al., 2013) 中的七款 Atari 游戏,与 Mnih et al., 2013 中展示的游戏相同。

Figure 2 shows the results of training our agent with 5,500 queries to a human rater, compared to learning from 350, 700, or 1400 synthetic queries, as well as to RL learning from the real reward. Our method has more difficulty matching RL in these challenging environments, but nevertheless it displays substantial learning on most of them and matches or even exceeds RL on some. Specifically, on BeamRider and Pong, synthetic labels match or come close to RL even with only 3,300 such labels. On Seaquest and Qbert synthetic feedback eventually performs near the level of RL but learns more slowly. On Space Invaders and Breakout synthetic feedback never matches RL, but nevertheless the agent improves substantially, often passing the first level in Space Invaders and reaching a score of 20 on Breakout, or 50 with enough labels.

图 2 展示了我们的智能体在与人类评分者进行 5,500 次查询后的训练结果,与从 350、700 或 1400 次合成查询中学习的结果,以及从真实奖励中进行强化学习 (RL) 的结果进行了比较。在这些具有挑战性的环境中,我们的方法在匹配 RL 方面存在更多困难,但在大多数情况下仍然显示出显著的学习效果,并且在某些情况下甚至超过了 RL。具体来说,在 BeamRider 和 Pong 上,即使只有 3,300 个合成标签,合成标签也能匹配或接近 RL 的水平。在 Seaquest 和 Qbert 上,合成反馈最终表现接近 RL 水平,但学习速度较慢。在 Space Invaders 和 Breakout 上,合成反馈从未匹配 RL,但智能体仍然有显著提升,通常在 Space Invaders 中通过第一关,在 Breakout 中达到 20 分,或者在足够标签的情况下达到 50 分。


Figure 2: Results on Atari games as measured on the tasks’ true reward. We compare our method using real human feedback (purple), our method using synthetic feedback provided by an oracle (shades of blue), and reinforcement learning using the true reward function (orange). All curves are the average of 3 runs, except for the real human feedback which is a single run, and each point is the average reward over about 150,000 consecutive frames.

图 2: Atari 游戏中的结果,以任务的真实奖励衡量。我们比较了使用真实人类反馈(紫色)、使用由预言机提供的合成反馈(蓝色调)的方法,以及使用真实奖励函数的强化学习(橙色)。所有曲线均为 3 次运行的平均值,除了真实人类反馈为单次运行,每个点代表约 150,000 连续帧的平均奖励。

On most of the games real human feedback performs similar to or slightly worse than synthetic feedback with the same number of labels, and often comparably to synthetic feedback that has $40%$ fewer labels. On Qbert, our method fails to learn to beat the first level with real human feedback; this may be because short clips in Qbert can be confusing and difficult to evaluate. Finally, Enduro is difficult for A3C to learn due to the difficulty of successfully passing other cars through random exploration, and is correspondingly difficult to learn with synthetic labels, but human labelers tend to reward any progress towards passing cars, essentially shaping the reward and thus outperforming A3C in this game (the results are comparable to those achieved with DQN).

在大多数游戏中,真实人类反馈的表现与相同数量标签的合成反馈相似或略差,并且通常与标签数量少 $40%$ 的合成反馈相当。在 Qbert 游戏中,我们的方法未能通过真实人类反馈学会通过第一关;这可能是因为 Qbert 中的短片段容易让人困惑且难以评估。最后,Enduro 游戏对 A3C 来说很难学习,因为通过随机探索成功超越其他车辆非常困难,因此使用合成标签学习也很困难,但人类标注者倾向于奖励任何超越车辆的进展,这实质上塑造了奖励机制,从而在该游戏中超越了 A3C(结果与使用 DQN 取得的结果相当)。

3.2 Novel behaviors

3.2 新颖行为

Experiments with traditional RL tasks help us understand whether our method is effective, but the ultimate purpose of human interaction is to solve tasks for which no reward function is available.

在传统强化学习任务上的实验帮助我们理解我们的方法是否有效,但人类交互的最终目的是解决那些没有奖励函数的任务。

Using the same parameters as in the previous experiments, we show that our algorithm can learn novel complex behaviors. We demonstrate:

使用与之前实验相同的参数,我们展示了我们的算法能够学习新颖的复杂行为。我们展示了:


Figure 3: Performance of our algorithm on MuJoCo tasks after removing various components, as described in Section Section 3.3. All graphs are averaged over 5 runs, using 700 synthetic labels each.

图 3: 在移除各种组件后,我们的算法在 MuJoCo 任务上的表现,如第 3.3 节所述。所有图表均为 5 次运行的平均值,每次使用 700 个合成标签。

Videos of these behaviors can be found at https://goo.gl/MhgvIU. These behaviors were trained using feedback from the authors.

这些行为的视频可以在 https://goo.gl/MhgvIU 找到。这些行为是通过作者的反馈进行训练的。

3.3 Ablation Studies

3.3 消融研究

In order to better understand the performance of our algorithm, we consider a range of modifications:

为了更好地理解我们算法的性能,我们考虑了一系列的修改:

The results are presented in Figure 3 for MuJoCo and Figure 4 for Atari.

结果如图 3 所示(MuJoCo)和图 4 所示(Atari)。

Training the reward predictor offline can lead to bizarre behavior that is undesirable as measured by the true reward (Amodei et al., 2016). For instance, on Pong offline training sometimes leads our agent to avoid losing points but not to score points; this can result in extremely long volleys (videos at https://goo.gl/L5eAbk). This type of behavior demonstrates that in general human feedback needs to be intertwined with RL rather than provided statically.

离线训练奖励预测器可能会导致一些不符合真实奖励期望的怪异行为 (Amodei et al., 2016)。例如,在 Pong 游戏中,离线训练有时会导致我们的 AI 智能体避免失分,但不去得分;这可能会导致极长的对打回合(视频见 https://goo.gl/L5eAbk)。这种行为表明,通常需要将人类反馈与强化学习 (RL) 交织在一起,而不是静态地提供反馈。

Our main motivation for eliciting comparisons rather than absolute scores was that we found it much easier for humans to provide consistent comparisons than consistent absolute scores, especially on the continuous control tasks and on the qualitative tasks in Section 3.2; nevertheless it seems important to understand how using comparisons affects performance. For continuous control tasks we found that predicting comparisons worked much better than predicting scores. This is likely because the scale of rewards varies substantially and this complicates the regression problem, which is smoothed significantly when we only need to predict comparisons. In the Atari tasks we clipped rewards and effectively only predicted the sign, avoiding these difficulties (this is not a suitable solution for the continuous control tasks because the magnitude of the reward is important to learning). In these tasks comparisons and targets had significantly different performance, but neither consistently outperformed the other.

我们选择引出比较而非绝对分数的主要动机是,我们发现人类在提供一致的比较时比提供一致的绝对分数要容易得多,尤其是在连续控制任务和第3.2节中的定性任务上;然而,理解使用比较如何影响性能似乎也很重要。对于连续控制任务,我们发现预测比较比预测分数效果要好得多。这很可能是因为奖励的规模变化很大,这使回归问题复杂化,而当我们只需要预测比较时,这个问题得到了显著缓解。在Atari任务中,我们裁剪了奖励,实际上只预测了符号,避免了这些困难(这对于连续控制任务来说不是一个合适的解决方案,因为奖励的大小对学习很重要)。在这些任务中,比较和目标的表现有显著差异,但没有一个始终优于另一个。


Figure 4: Performance of our algorithm on Atari tasks after removing various components, as described in Section 3.3. All curves are an average of 3 runs using 5,500 synthetic labels (see minor exceptions in Section A.2).

图 4: 我们的算法在 Atari 任务上的性能表现,移除了各种组件后的结果,如第 3.3 节所述。所有曲线均为使用 5,500 个合成标签的 3 次运行的平均值(详见第 A.2 节中的小例外)。

We also observed large performance differences when using single frames rather than clips.7 In order to obtain the same results using single frames we would need to have collected significantly more comparisons. In general we discovered that asking humans to compare longer clips was significantly more helpful per clip, and significantly less helpful per frame. Shrinking the clip length below 1-2 seconds did not significantly decrease the human time required to label each clip in early experiments, and so seems less efficient per second of human time. In the Atari environments we also found that it was often easier to compare longer clips because they provide more context than single frames.

我们还观察到,使用单帧而非片段时性能差异较大。为了使用单帧获得相同的结果,我们需要收集更多的比较数据。总的来说,我们发现让人类比较较长的片段对每个片段更有帮助,而对每帧的帮助则显著减少。在早期实验中,将片段长度缩短到1-2秒以下并没有显著减少标记每个片段所需的人类时间,因此每秒钟的人类时间效率似乎较低。在Atari环境中,我们还发现比较较长的片段通常更容易,因为它们比单帧提供了更多的上下文。

4 Discussion and Conclusions

4 讨论与结论

Agent-environment interactions are often radically cheaper than human interaction. We show that by learning a separate reward model using supervised learning, it is possible to reduce the interaction complexity by roughly 3 orders of magnitude.

智能体与环境交互的成本通常远低于人类交互。我们展示了通过使用监督学习训练一个独立的奖励模型,可以将交互复杂度降低大约3个数量级。

Although there is a large literature on preference eli citation and reinforcement learning from unknown reward functions, we provide the first evidence that these techniques can be economically scaled up to state-of-the-art reinforcement learning systems. This represents a step towards practical applications of deep RL to complex real-world tasks.

尽管关于偏好引导和从未知奖励函数中进行强化学习的文献众多,我们首次提供了证据,表明这些技术能够经济地扩展到最先进的强化学习系统中。这标志着深度强化学习向复杂现实世界任务的实际应用迈出了一步。

In the long run it would be desirable to make learning a task from human preferences no more difficult than learning it from a programmatic reward signal, ensuring that powerful RL systems can be applied in the service of complex human values rather than low-complexity goals.

从长远来看,我们希望从人类偏好中学习任务的难度不高于从程序化奖励信号中学习,确保强大的强化学习系统能够服务于复杂的人类价值观,而非低复杂度的目标。

Acknowledgments

致谢

We thank Olivier Pietquin, Bilal Piot, Laurent Orseau, Pedro Ortega, Victoria Krakovna, Owain Evans, Andrej Karpathy, Igor Mordatch, and Jack Clark for reading drafts of the paper. We thank Tyler Adkisson, Mandy Beri, Jessica Richards, Heather Tran, and other contractors for providing the

我们感谢 Olivier Pietquin、Bilal Piot、Laurent Orseau、Pedro Ortega、Victoria Krakovna、Owain Evans、Andrej Karpathy、Igor Mordatch 和 Jack Clark 阅读了论文的草稿。我们感谢 Tyler Adkisson、Mandy Beri、Jessica Richards、Heather Tran 以及其他承包商提供的帮助。

阅读全文(20积分)