[论文翻译]基于人类偏好的深度强化学习


原文地址:https://github.com/dalinvip/Awesome-ChatGPT/blob/main/PDF/RLHF%E8%AE%BA%E6%96%87%E9%9B%86/NIPS-2017-deep-reinforcement-learning-from-human-preferences-Paper.pdf


Deep Reinforcement Learning from Human Preferences

基于人类偏好的深度强化学习

Abstract

摘要

For sophisticated reinforcement learning (RL) systems to interact usefully with real-world environments, we need to communicate complex goals to these systems. In this work, we explore goals defined in terms of (non-expert) human preferences between pairs of trajectory segments. We show that this approach can effectively solve complex RL tasks without access to the reward function, including Atari games and simulated robot locomotion, while providing feedback on less than $1%$ of our agent’s interactions with the environment. This reduces the cost of human oversight far enough that it can be practically applied to state-of-the-art RL systems. To demonstrate the flexibility of our approach, we show that we can successfully train complex novel behaviors with about an hour of human time. These behaviors and environments are considerably more complex than any which have been previously learned from human feedback.

为了让复杂的强化学习 (Reinforcement Learning, RL) 系统能够有效地与现实世界环境交互,我们需要向这些系统传达复杂的目标。在这项工作中,我们探索了基于(非专家)人类对轨迹片段对的偏好来定义目标的方法。我们展示了这种方法可以有效地解决复杂的 RL 任务,而无需访问奖励函数,包括 Atari 游戏和模拟机器人运动,同时仅需对智能体与环境的交互提供不到 $1%$ 的反馈。这大大降低了人类监督的成本,使其能够实际应用于最先进的 RL 系统。为了展示我们方法的灵活性,我们展示了可以在大约一小时内成功训练出复杂的新行为。这些行为和环境比之前从人类反馈中学习的任何内容都要复杂得多。

1 Introduction

1 引言

Recent success in scaling reinforcement learning (RL) to large problems has been driven in domains that have a well-specified reward function (Mnih et al., 2015, 2016; Silver et al., 2016). Unfortunately, many tasks involve goals that are complex, poorly-defined, or hard to specify. Overcoming this limitation would greatly expand the possible impact of deep RL and could increase the reach of machine learning more broadly.

最近在将强化学习 (Reinforcement Learning, RL) 扩展到大规模问题上取得的成功,主要得益于那些具有明确奖励函数的领域 (Mnih et al., 2015, 2016; Silver et al., 2016)。然而,许多任务涉及的目标复杂、定义不清或难以明确。克服这一限制将极大地扩展深度强化学习的潜在影响,并可能更广泛地提升机器学习的应用范围。

For example, suppose that we wanted to use reinforcement learning to train a robot to clean a table or scramble an egg. It’s not clear how to construct a suitable reward function, which will need to be a function of the robot’s sensors. We could try to design a simple reward function that approximately captures the intended behavior, but this will often result in behavior that optimizes our reward function without actually satisfying our preferences. This difficulty underlies recent concerns about misalignment between our values and the objectives of our RL systems (Bostrom, 2014; Russell, 2016; Amodei et al., 2016). If we could successfully communicate our actual objectives to our agents, it would be a significant step towards addressing these concerns.

例如,假设我们想使用强化学习来训练一个机器人来清洁桌子或炒鸡蛋。目前尚不清楚如何构建一个合适的奖励函数,该函数需要基于机器人的传感器。我们可以尝试设计一个简单的奖励函数,近似捕捉预期的行为,但这通常会导致行为优化了我们的奖励函数,而实际上并未满足我们的偏好。这一困难正是最近关于我们的价值观与强化学习系统目标之间不一致的担忧的基础 (Bostrom, 2014; Russell, 2016; Amodei et al., 2016)。如果我们能够成功地将我们的实际目标传达给我们的AI智能体,这将是解决这些担忧的重要一步。

If we have demonstrations of the desired task, we can use inverse reinforcement learning $\mathrm{Ng}$ and Russell, 2000) or imitation learning to copy the demonstrated behavior. But these approaches are not directly applicable to behaviors that are difficult for humans to demonstrate (such as controlling a robot with many degrees of freedom but non-human morphology).

如果我们有期望任务的演示,可以使用逆强化学习 [Ng 和 Russell, 2000] 或模仿学习来复制演示的行为。但这些方法并不直接适用于人类难以演示的行为(例如控制具有许多自由度但非人类形态的机器人)。

An alternative approach is to allow a human to provide feedback on our system’s current behavior and to use this feedback to define the task. In principle this fits within the paradigm of reinforcement learning, but using human feedback directly as a reward function is prohibitively expensive for RL systems that require hundreds or thousands of hours of experience. In order to practically train deep RL systems with human feedback, we need to decrease the amount of feedback required by several orders of magnitude.

另一种方法是允许人类对我们系统的当前行为提供反馈,并利用这些反馈来定义任务。原则上,这符合强化学习的范式,但对于需要数百或数千小时经验的强化学习系统来说,直接使用人类反馈作为奖励函数的成本过高。为了实际利用人类反馈训练深度强化学习系统,我们需要将所需的反馈量减少几个数量级。

We overcome this difficulty by asking humans to compare possible trajectories of the agent, using that data to learn a reward function, and optimizing the learned reward function with RL.

我们通过让人类比较智能体可能的轨迹,利用这些数据学习奖励函数,并通过强化学习优化学习到的奖励函数来克服这一困难。

This basic approach has been explored in the past, but we confront the challenges involved in scaling it up to modern deep RL and demonstrate by far the most complex behaviors yet learned from human feedback.

过去已经探索过这种基本方法,但我们面临将其扩展到现代深度强化学习(RL)中的挑战,并展示了迄今为止从人类反馈中学到的最复杂行为。

Our experiments take place in two domains: Atari games in the Arcade Learning Environment (Bellemare et al., 2013), and robotics tasks in the physics simulator MuJoCo (Todorov et al., 2012). We show that a small amount of feedback from a non-expert human, ranging from fifteen minutes to five hours, suffice to learn both standard RL tasks and novel hard-to-specify behaviors such as performing a backflip or driving with the flow of traffic.

我们的实验在两个领域进行:Arcade Learning Environment (Bellemare et al., 2013) 中的 Atari 游戏,以及物理模拟器 MuJoCo (Todorov et al., 2012) 中的机器人任务。我们展示了来自非专家人类的少量反馈(从十五分钟到五小时不等)足以学习标准的强化学习任务以及难以指定的新行为,例如后空翻或随车流驾驶。

1.1 Related Work

1.1 相关工作

A long line of work studies reinforcement learning from human ratings or rankings, including Akrour et al. (2011), Pilarski et al. (2011), Akrour et al. (2012), Wilson et al. (2012), Sugiyama et al. (2012), Wirth and Fürnkranz (2013), Daniel et al. (2015), El Asri et al. (2016), Wang et al. (2016), and Wirth et al. (2016). Other lines of research consider the general problem of reinforcement learning from preferences rather than absolute reward values (Fürnkranz et al., 2012; Akrour et al., 2014; Wirth et al., 2016), and optimizing using human preferences in settings other than reinforcement learning (Machwe and Parmee, 2006; Secretan et al., 2008; Brochu et al., 2010; Sørensen et al., 2016).

一系列研究工作探讨了基于人类评分或排序的强化学习,包括 Akrour 等人 (2011)、Pilarski 等人 (2011)、Akrour 等人 (2012)、Wilson 等人 (2012)、Sugiyama 等人 (2012)、Wirth 和 Fürnkranz (2013)、Daniel 等人 (2015)、El Asri 等人 (2016)、Wang 等人 (2016) 以及 Wirth 等人 (2016)。其他研究方向则考虑了基于偏好而非绝对奖励值的强化学习一般问题 (Fürnkranz 等人, 2012; Akrour 等人, 2014; Wirth 等人, 2016),以及在非强化学习环境中使用人类偏好进行优化的研究 (Machwe 和 Parmee, 2006; Secretan 等人, 2008; Brochu 等人, 2010; Sørensen 等人, 2016)。

Our algorithm follows the same basic approach as Akrour et al. (2012) and Akrour et al. (2014), but considers much more complex domains and behaviors. The complexity of our environments force us to use different RL algorithms, reward models, and training strategies. One notable difference is that Akrour et al. (2012) and Akrour et al. (2014) elicit preferences over whole trajectories rather than short clips, and so would require about an order of magnitude more human time per data point. Our approach to feedback eli citation closely follows Wilson et al. (2012). However, Wilson et al. (2012) assumes that the reward function is the distance to some unknown (linear) “target” policy, and is never tested with real human feedback.

我们的算法遵循与 Akrour 等人 (2012) 和 Akrour 等人 (2014) 相同的基本方法,但考虑了更复杂的领域和行为。环境的复杂性迫使我们使用不同的强化学习 (RL) 算法、奖励模型和训练策略。一个显著的差异是,Akrour 等人 (2012) 和 Akrour 等人 (2014) 获取的是对整个轨迹而非短片段的偏好,因此每个数据点需要大约多一个数量级的人类时间。我们的反馈获取方法紧密遵循 Wilson 等人 (2012)。然而,Wilson 等人 (2012) 假设奖励函数是到某个未知(线性)“目标”策略的距离,并且从未使用真实的人类反馈进行测试。

TAMER (Knox, 2012; Knox and Stone, 2013) also learns a reward function from human feedback, but learns from ratings rather than comparisons, has the human observe the agent as it behaves, and has been applied to settings where the desired policy can be learned orders of magnitude more quickly.

TAMER (Knox, 2012; Knox and Stone, 2013) 也从人类反馈中学习奖励函数,但它从评分而非比较中学习,让人类观察智能体的行为,并且已经应用于可以更快学习期望策略的场景。

Compared to all prior work, our key contribution is to scale human feedback up to deep reinforcement learning and to learn much more complex behaviors. This fits into a recent trend of scaling reward learning methods to large deep learning systems, for example inverse RL (Finn et al., 2016), imitation learning (Ho and Ermon, 2016; Stadie et al., 2017), semi-supervised skill generalization (Finn et al., 2017), and boots trapping RL from demonstrations (Silver et al., 2016; Hester et al., 2017).

与之前的所有工作相比,我们的关键贡献是将人类反馈扩展到深度强化学习中,并学习更复杂的行为。这符合最近将奖励学习方法扩展到大型深度学习系统的趋势,例如逆向强化学习 (Finn et al., 2016)、模仿学习 (Ho and Ermon, 2016; Stadie et al., 2017)、半监督技能泛化 (Finn et al., 2017) 以及从演示中进行引导强化学习 (Silver et al., 2016; Hester et al., 2017)。

2 Preliminaries and Method

2 预备知识与方法

2.1 Setting and Goal

2.1 设置与目标

We consider an agent interacting with an environment over a sequence of steps; at each time $t$ the agent receives an observation $o_{t}\in\mathcal{O}$ from the environment and then sends an action $a_{t}\in\mathcal A$ to the environment.

我们考虑一个智能体在一系列步骤中与环境进行交互;在每个时间 $t$,智能体从环境中接收到一个观测值 $o_{t}\in\mathcal{O}$,然后向环境发送一个动作 $a_{t}\in\mathcal A$。

In traditional reinforcement learning, the environment would also supply a reward $r_{t}\in\mathbb{R}$ and the agent’s goal would be to maximize the discounted sum of rewards. Instead of assuming that the environment produces a reward signal, we assume that there is a human overseer who can express preferences between trajectory segments. A trajectory segment is a sequence of observations and actions, $\sigma=((o_{0},a_{0}),(o_{1},a_{1}),\ldots,(o_{k-1},a_{k-1}))\in(\mathcal{O}\times\mathcal{A})_{*}^{k}$ . Write $\sigma^{1}\succ\sigma^{2}$ to indicate that the human preferred trajectory segment $\sigma^{1}$ to trajectory segment $\sigma^{2}$ . Informally, the goal of the agent is to produce trajectories which are preferred by the human, while making as few queries as possible to the human.

在传统的强化学习中,环境还会提供一个奖励 $r_{t}\in\mathbb{R}$,而智能体的目标是最大化奖励的折现总和。我们不再假设环境会产生奖励信号,而是假设有一个人类监督者可以在轨迹片段之间表达偏好。轨迹片段是一系列观察和动作的序列,$\sigma=((o_{0},a_{0}),(o_{1},a_{1}),\ldots,(o_{k-1},a_{k-1}))\in(\mathcal{O}\times\mathcal{A})_{*}^{k}$。用 $\sigma^{1}\succ\sigma^{2}$ 表示人类更偏好轨迹片段 $\sigma^{1}$ 而不是 $\sigma^{2}$。非正式地说,智能体的目标是生成人类偏好的轨迹,同时尽可能少地向人类进行查询。

More precisely, we will evaluate our algorithms’ behavior in two ways:

更准确地说,我们将通过两种方式评估算法的行为:

Quantitative: We say that preferences $\succ$ are generated by a reward function2 $r:{\mathcal{O}}\times{\mathcal{A}}\to\mathbb{R}$ if

定量分析:我们说偏好 $\succ$ 是由奖励函数 $r:{\mathcal{O}}\times{\mathcal{A}}\to\mathbb{R}$ 生成的,如果

$$
\left(\left(o_{0}^{1},a_{0}^{1}\right),\ldots,\left(o_{k-1}^{1},a_{k-1}^{1}\right)\right)\succ\left(\left(o_{0}^{2},a_{0}^{2}\right),\ldots,\left(o_{k-1}^{2},a_{k-1}^{2}\right)\right)
$$

$$
\left(\left(o_{0}^{1},a_{0}^{1}\right),\ldots,\left(o_{k-1}^{1},a_{k-1}^{1}\right)\right)\succ\left(\left(o_{0}^{2},a_{0}^{2}\right),\ldots,\left(o_{k-1}^{2},a_{k-1}^{2}\right)\right)
$$

whenever

每当

$$
r\bigl(o_{0}^{1},a_{0}^{1}\bigr)+\cdot\cdot\cdot+r\bigl(o_{k-1}^{1},a_{k-1}^{1}\bigr)>r\bigl(o_{0}^{2},a_{0}^{2}\bigr)+\cdot\cdot\cdot+r\bigl(o_{k-1}^{2},a_{k-1}^{2}\bigr).
$$

$$
r\bigl(o_{0}^{1},a_{0}^{1}\bigr)+\cdot\cdot\cdot+r\bigl(o_{k-1}^{1},a_{k-1}^{1}\bigr)>r\bigl(o_{0}^{2},a_{0}^{2}\bigr)+\cdot\cdot\cdot+r\bigl(o_{k-1}^{2},a_{k-1}^{2}\bigr).
$$

If the human’s preferences are generated by a reward function $r$ , then our agent ought to receive a high total reward according to $r$ . So if we know the reward function $r$ , we can evaluate the agent quantitatively. Ideally the agent will achieve reward nearly as high as if it had been using RL to optimize $r$ .

如果人类的偏好是由奖励函数 $r$ 生成的,那么我们的智能体应该根据 $r$ 获得较高的总奖励。因此,如果我们知道奖励函数 $r$,我们就可以定量地评估智能体。理想情况下,智能体将获得几乎与使用强化学习 (RL) 优化 $r$ 时一样高的奖励。

Qualitative: Sometimes we have no reward function by which we can quantitatively evaluate behavior (this is the situation where our approach would be practically useful). In these cases, all we can do is qualitatively evaluate how well the agent satisfies the human’s preferences. In this paper, we will start from a goal expressed in natural language, ask a human to evaluate the agent’s behavior based on how well it fulfills that goal, and then present videos of agents attempting to fulfill that goal.

定性评估:有时我们没有奖励函数来定量评估行为(这正是我们的方法在实际中会有用的场景)。在这些情况下,我们只能定性评估智能体满足人类偏好的程度。在本文中,我们将从自然语言表达的目标出发,要求人类根据智能体实现该目标的程度来评估其行为,然后展示智能体尝试实现该目标的视频。

Our model based on trajectory segment comparisons is very similar to the trajectory preference queries used in Wilson et al. (2012), except that we don’t assume that we can reset the system to an arbitrary state3 and so our segments generally begin from different states. This complicates the interpretation of human comparisons, but we show that our algorithm overcomes this difficulty even when the human raters have no understanding of our algorithm.

我们基于轨迹段比较的模型与 Wilson 等人 (2012) 中使用的轨迹偏好查询非常相似,不同之处在于我们不假设可以将系统重置到任意状态,因此我们的轨迹段通常从不同的状态开始。这使得人类比较的解释变得复杂,但我们展示了即使人类评分者不理解我们的算法,我们的算法也能克服这一困难。

2.2 Our Method

2.2 我们的方法

At each point in time our method maintains a policy $\pi:{\mathcal{O}}\rightarrow A$ and a reward function estimate $\hat{r}:\mathcal{O}\times\mathcal{A}\rightarrow\mathbb{R}$ , each para met rize d by deep neural networks.

在每个时间点,我们的方法维护一个策略 $\pi:{\mathcal{O}}\rightarrow A$ 和一个奖励函数估计 $\hat{r}:\mathcal{O}\times\mathcal{A}\rightarrow\mathbb{R}$,每个都由深度神经网络参数化。

These networks are updated by three processes:

这些网络通过三个过程进行更新:

These processes run asynchronously, with trajectories flowing from process (1) to process (2), human comparisons flowing from process (2) to process (3), and parameters for $\hat{r}$ flowing from process (3) to process (1). The following subsections provide details on each of these processes.

这些进程异步运行,轨迹从进程 (1) 流向进程 (2),人类比较从进程 (2) 流向进程 (3),而 $\hat{r}$ 的参数从进程 (3) 流向进程 (1)。以下小节将详细介绍这些进程中的每一个。

2.2.1 Optimizing the Policy

2.2.1 策略优化

After using $\hat{r}$ to compute rewards, we are left with a traditional reinforcement learning problem. We can solve this problem using any RL algorithm that is appropriate for the domain. One subtlety is that the reward function $\hat{r}$ may be non-stationary, which leads us to prefer methods which are robust to changes in the reward function. This led us to focus on policy gradient methods, which have been applied successfully for such problems (Ho and Ermon, 2016).

在使用 $\hat{r}$ 计算奖励后,我们面临一个传统的强化学习问题。我们可以使用适用于该领域的任何强化学习算法来解决这个问题。一个微妙之处在于,奖励函数 $\hat{r}$ 可能是非平稳的,这促使我们倾向于使用对奖励函数变化具有鲁棒性的方法。这使我们专注于策略梯度方法,这些方法已成功应用于此类问题 (Ho and Ermon, 2016)。

In this paper, we use advantage actor-critic (A2C; Mnih et al., 2016) to play Atari games, and trust region policy optimization (TRPO; Schulman et al., 2015) to perform simulated robotics tasks. In each case, we used parameter settings which have been found to work well for traditional RL tasks. The only hyper parameter which we adjusted was the entropy bonus for TRPO. This is because TRPO relies on the trust region to ensure adequate exploration, which can lead to inadequate exploration if the reward function is changing.

在本文中,我们使用优势演员-评论家 (A2C; Mnih et al., 2016) 来玩 Atari 游戏,并使用信任区域策略优化 (TRPO; Schulman et al., 2015) 来执行模拟机器人任务。在每种情况下,我们都使用了已被证明在传统强化学习任务中表现良好的参数设置。我们唯一调整的超参数是 TRPO 的熵奖励。这是因为 TRPO 依赖信任区域来确保充分的探索,如果奖励函数发生变化,可能会导致探索不足。

We normalized the rewards produced by $\hat{r}$ to have zero mean and constant standard deviation. This is a typical preprocessing step which is particularly appropriate here since the position of the rewards is under determined by our learning problem.

我们将 $\hat{r}$ 生成的奖励归一化,使其均值为零且标准差恒定。这是一个典型的预处理步骤,在这里特别适用,因为奖励的位置在我们的学习问题中是不确定的。

2.2.2 Preference Eli citation

2.2.2 偏好引出

The human overseer is given a visualization of two trajectory segments, in the form of short movie clips. In all of our experiments, these clips are between 1 and 2 seconds long.

人类监督者会以短视频片段的形式看到两个轨迹段的可视化。在我们所有的实验中,这些片段的长度在1到2秒之间。

The human then indicates which segment they prefer, that the two segments are equally good, or that they are unable to compare the two segments.

然后,人类指示他们更喜欢哪个片段,两个片段同样好,或者他们无法比较这两个片段。

The human judgments are recorded in a database $\mathcal{D}$ of triples $(\sigma^{1},\sigma^{2},\mu)$ , where $\sigma^{1}$ and $\sigma^{2}$ are the two segments and $\mu$ is a distribution over ${1,2}$ indicating which segment the user preferred. If the human selects one segment as preferable, then $\mu$ puts all of its mass on that choice. If the human marks the segments as equally preferable, then $\mu$ is uniform. Finally, if the human marks the segments as incomparable, then the comparison is not included in the database.

人类判断记录在一个数据库 $\mathcal{D}$ 中,其中包含三元组 $(\sigma^{1},\sigma^{2},\mu)$,其中 $\sigma^{1}$ 和 $\sigma^{2}$ 是两个片段,$\mu$ 是 ${1,2}$ 上的一个分布,表示用户偏好的片段。如果人类选择其中一个片段作为偏好,则 $\mu$ 将所有质量放在该选择上。如果人类将片段标记为同等偏好,则 $\mu$ 是均匀的。最后,如果人类将片段标记为不可比较,则该比较不包含在数据库中。

2.2.3 Fitting the Reward Function

2.2.3 拟合奖励函数

We can interpret a reward function estimate $\hat{r}$ as a preference-predictor if we view $\hat{r}$ as a latent factor explaining the human’s judgments and assume that the human’s probability of preferring a segment $\sigma^{i}$ depends exponentially on the value of the latent reward summed over the length of the clip:4

我们可以将奖励函数估计 $\hat{r}$ 解释为偏好预测器,如果我们将其视为解释人类判断的潜在因素,并假设人类偏好片段 $\sigma^{i}$ 的概率取决于片段长度上潜在奖励值的指数和:4

$$
\hat{P}\big[\sigma^{1}\succ\sigma^{2}\big]=\frac{\exp\sum\hat{r}\big(o_{t}^{1},a_{t}^{1}\big)}{\exp\sum\hat{r}\big(o_{t}^{1},a_{t}^{1}\big)+\exp\sum\hat{r}\big(o_{t}^{2},a_{t}^{2}\big)}.
$$

$$
\hat{P}\big[\sigma^{1}\succ\sigma^{2}\big]=\frac{\exp\sum\hat{r}\big(o_{t}^{1},a_{t}^{1}\big)}{\exp\sum\hat{r}\big(o_{t}^{1},a_{t}^{1}\big)+\exp\sum\hat{r}\big(o_{t}^{2},a_{t}^{2}\big)}.
$$

We choose $\hat{r}$ to minimize the cross-entrop y loss between thes e predictions and the actual human labels:

我们选择 $\hat{r}$ 以最小化这些预测与实际人类标签之间的交叉熵损失:

$$
\mathrm{loss}(\hat{r})=-\sum_{(\sigma^{1},\sigma^{2},\mu)\in\mathcal{D}}\mu(1)\log\hat{P}\big[\sigma^{1}\succ\sigma^{2}\big]+\mu(2)\log\hat{P}\big[\sigma^{2}\succ\sigma^{1}\big].
$$

$$
\mathrm{loss}(\hat{r})=-\sum_{(\sigma^{1},\sigma^{2},\mu)\in\mathcal{D}}\mu(1)\log\hat{P}\big[\sigma^{1}\succ\sigma^{2}\big]+\mu(2)\log\hat{P}\big[\sigma^{2}\succ\sigma^{1}\big].
$$

This follows the Bradley-Terry model (Bradley and Terry, 1952) for estimating score functions from pairwise preferences, and is the specialization of the Luce-Shephard choice rule (Luce, 2005; Shepard, 1957) to preferences over trajectory segments.

这遵循了 Bradley-Terry 模型 (Bradley and Terry, 1952) 用于从成对偏好中估计评分函数,并且是 Luce-Shephard 选择规则 (Luce, 2005; Shepard, 1957) 在轨迹段偏好上的特化。

Our actual algorithm incorporates a number of modifications to this basic approach, which early experiments discovered to be helpful and which are analyzed in Section 3.3:

我们的实际算法对这一基本方法进行了多项修改,这些修改在早期实验中被发现是有帮助的,并在第 3.3 节中进行了分析:

• Rather than applying a softmax directly as described in Equation 1, we assume there is a $10%$ chance that the human responds uniformly at random. Conceptually this adjustment is needed because human raters have a constant probability of making an error, which doesn’t decay to 0 as the difference in reward difference becomes extreme.

• 我们假设有 $10%$ 的概率人类会随机均匀响应,而不是直接应用公式 1 中的 softmax。从概念上讲,这种调整是必要的,因为人类评分者存在一个恒定的错误概率,这种概率不会随着奖励差异的极端化而衰减到 0。

2.2.4 Selecting Queries

2.2.4 查询选择

We decide how to query preferences based on an approximation to the uncertainty in the reward function estimator, similar to Daniel et al. (2014): we sample a large number of pairs of trajectory segments of length $k$ from the latest agent-environment interactions, use each reward predictor in our ensemble to predict which segment will be preferred from each pair, and then select those trajectories for which the predictions have the highest variance across ensemble members5 This is a crude approximation and the ablation experiments in Section 3 show that in some tasks it actually impairs performance. Ideally, we would want to query based on the expected value of information of the query (Akrour et al., 2012; Krueger et al., 2016), but we leave it to future work to explore this direction further.

我们根据奖励函数估计器的不确定性近似值来决定如何查询偏好,类似于 Daniel 等人 (2014) 的方法:我们从最新的智能体-环境交互中采样大量长度为 $k$ 的轨迹片段对,使用集成中的每个奖励预测器来预测每对中哪个片段会被偏好,然后选择那些在集成成员中预测方差最大的轨迹。这是一个粗略的近似,第 3 节中的消融实验表明,在某些任务中,它实际上会损害性能。理想情况下,我们希望基于查询的预期信息价值进行查询 (Akrour 等人, 2012; Krueger 等人, 2016),但我们将其留给未来的工作来进一步探索这个方向。

3 Experimental Results

3 实验结果

We implemented our algorithm in TensorFlow (Abadi et al., 2016). We interface with MuJoCo (Todorov et al., 2012) and the Arcade Learning Environment (Bellemare et al., 2013) through the OpenAI Gym (Brockman et al., 2016).

我们在 TensorFlow (Abadi et al., 2016) 中实现了我们的算法。我们通过 OpenAI Gym (Brockman et al., 2016) 与 MuJoCo (Todorov et al., 2012) 和 Arcade Learning Environment (Bellemare et al., 2013) 进行交互。

3.1 Reinforcement Learning Tasks with Unobserved Rewards

3.1 未观测奖励的强化学习任务

In our first set of experiments, we attempt to solve a range of benchmark tasks for deep RL without observing the true reward. Instead, the agent learns about the goal of the task only by asking a human which of two trajectory segments is better. Our goal is to solve the task in a reasonable amount of time using as few queries as possible.

在我们的第一组实验中,我们尝试解决一系列深度强化学习(RL)的基准任务,而不观察真实的奖励。相反,AI智能体仅通过询问人类两个轨迹片段中哪个更好来了解任务的目标。我们的目标是在合理的时间内使用尽可能少的查询来解决任务。

In our experiments, feedback is provided by contractors who are given a 1-2 sentence description of each task before being asked to compare several hundred to several thousand pairs of trajectory segments for that task (see Appendix B for the exact instructions given to contractors). Each trajectory segment is between 1 and 2 seconds long. Contractors responded to the average query in 3-5 seconds, and so the experiments involving real human feedback required between 30 minutes and 5 hours of human time.

在我们的实验中,反馈由承包商提供,他们在被要求比较数百到数千对任务轨迹段之前,会获得每个任务的1-2句描述(参见附录B中提供给承包商的确切说明)。每个轨迹段的长度在1到2秒之间。承包商平均在3-5秒内响应每个查询,因此涉及真实人类反馈的实验需要30分钟到5小时的人类时间。

For comparison, we also run experiments using a synthetic oracle whose preference