[论文翻译]在智能体中创建需求的分层配置


原文地址:https://arxiv.org/pdf/2412.00044v1


Creating Hierarchical Dispositions of Needs in an Agent

在智能体中创建需求的分层配置

Abstract—We present a novel method for learning hierarchical abstractions that prioritize competing objectives, leading to improved global expected rewards. Our approach employs a secondary rewarding agent with multiple scalar outputs, each associated with a distinct level of abstraction. The traditional agent then learns to maximize these outputs in a hierarchical manner, conditioning each level on the maximization of the preceding level. We derive an equation that orders these scalar values and the global reward by priority, inducing a hierarchy of needs that informs goal formation. Experimental results on the Pendulum v1 environment demonstrate superior performance compared to a baseline implementation.We achieved state of the art results.

摘要—我们提出了一种新颖的学习层次化抽象方法,该方法能优先处理相互竞争的目标,从而提升全局预期奖励。我们的方法采用了一个具有多标量输出的次级奖励智能体,每个输出对应不同抽象层级。传统智能体则以层级化方式学习最大化这些输出,使每个层级的优化都基于前一层级的最大化。我们推导出了一个按优先级排序这些标量值与全局奖励的方程,由此构建的需求层级可为目标形成提供依据。在Pendulum v1环境中的实验结果表明,该方法优于基线实现,达到了当前最优水平。

I. INTRODUCTION

I. 引言

This paper presents a novel approach to inducing hierarchical reward structures in artificial agents. Our method involves introducing a secondary rewarding agent that parallels the traditional agent, receiving identical state inputs. The rewarding agent features a continuous action output layer, wherein the outputs serve as signals rather than control inputs.

本文提出了一种在人工智能体中诱导分层奖励结构的新方法。我们的方法引入了一个与传统智能体并行的次级奖励智能体,接收相同的状态输入。该奖励智能体具有连续动作输出层,其输出作为信号而非控制输入。

We propose an equation that integrates these signals, yielding a reward signal that is used to reinforce the traditional agent. This framework is designed to elicit a hierarchical organization of needs within the traditional agent, promoting more effective and efficient learning.

我们提出一个整合这些信号的方程,生成一个用于强化传统智能体的奖励信号。该框架旨在激发传统智能体内需求的分层组织,从而促进更高效的学习。

The complexity of a policy learned by reinforcement learning (RL) algorithms is inherently bounded by the complexity of the reward function. Consequently, significant efforts have been devoted to crafting intricate reward functions that can guide RL agents towards sophisticated behaviors.In contrast, humans and other animals appear to develop complex behaviors through a hierarchical process, wherein an initially simple reward function focused on fundamental drives such as pain avoidance and pleasure seeking serves as the foundation for a layered structure of dispositions.

强化学习(RL)算法习得策略的复杂度本质上受限于奖励函数的复杂度。因此,研究者们投入大量精力设计复杂的奖励函数,以引导RL智能体产生精细行为。相比之下,人类和其他动物似乎通过分层过程发展复杂行为:最初以避痛趋乐等基本驱力为核心的简单奖励函数,会演化为分层倾向性结构的基础。

Each level in this hierarchy is oriented towards satisfying the preceding levels, ultimately referencing the base reward function.

该层级中的每一层都旨在满足前一层级的需求,最终指向基础奖励函数。

The mechanisms underlying this process remain unclear. However, if we could induce artificial agents to learn hierarchical reward functions, it would enable the specification of simple base reward functions, allowing the algorithm to autonomously develop complex goals. A hierarchical reward function would confer upon the agent the capacity to pursue intricate objectives. The hierarchical structure of human needs has been extensively studied, yielding frameworks such as Maslow’s Hierarchy of Needs. This hierarchy progresses from fundamental, essential needs to more abstract and complex requirements, including self-actualization.

这一过程的潜在机制尚不明确。然而,若能引导AI智能体学习分层奖励函数 (hierarchical reward functions),就能通过指定简单的基础奖励函数,使算法自主发展出复杂目标。分层奖励函数将赋予智能体追求复杂目标的能力。人类需求的分层结构已被广泛研究,产生了诸如马斯洛需求层次理论等框架。该层次从基本生存需求逐步过渡到更抽象复杂的自我实现需求。

Our approach offers two primary benefits: enhanced stability throughout the training process and improved accuracy in the learned policy.

我们的方法具有两大主要优势:训练过程稳定性更高,所学策略的准确性更优。

II. BACKGROUND

II. 背景

A. Markov decision process

A. 马尔可夫决策过程

We formulate our model of continuous control reinforcement within the framework of a finite Markov Decision Process (MDP). An MDP is defined by the tuple: $M=$ $\langle S,A,s_{0},r\rangle$ where $S$ denotes the state space $A$ denotes the action space $s_{0}\in S$ denotes the initial state $r(s,a):S{\times}A\longrightarrow$ $R$ denotes the reward function, which assigns a scalar value to each state-action pair.At each time step t, the agent selects an action $a_{t+1}$ according to a policyπ : $S\longrightarrow A$ , which can be either stochastic or deterministic.A stochastic policy is defined as a probability distribution over actions given a state $\pi(a\mid s):S\longrightarrow P(A)$ where $P(A)$ denotes the set of probability distributions over $A$ .The objective of the agent is to maximize its future expected reward: max $\textstyle\pi\mathbf{E}[\sum_{t=0}^{\infty}r(s_{t},a_{t})]$ .

我们在有限马尔可夫决策过程(MDP)的框架下构建了连续控制强化学习模型。MDP由元组定义:$M=$ $\langle S,A,s_{0},r\rangle$,其中$S$表示状态空间,$A$表示动作空间,$s_{0}\in S$表示初始状态,$r(s,a):S{\times}A\longrightarrow$ $R$表示奖励函数,为每个状态-动作对分配标量值。在每一步时间t,智能体根据策略π: $S\longrightarrow A$选择动作$a_{t+1}$,该策略可以是随机或确定性的。随机策略定义为给定状态下动作的概率分布$\pi(a\mid s):S\longrightarrow P(A)$,其中$P(A)$表示$A$上的概率分布集。智能体的目标是最大化未来期望奖励:max $\textstyle\pi\mathbf{E}[\sum_{t=0}^{\infty}r(s_{t},a_{t})]$。

B. Policy Gradient Methods

B. 策略梯度方法

Policy gradient methods are a type of reinforcement learning algorithm that learns to optimize the policy directly, rather than learning the value function. The policy gradient theorem provides the foundation for policy gradient methods. It states that the gradient of the expected cumulative reward with respect to the policy parameters can be computed as:

策略梯度方法是一种直接优化策略而非学习价值函数的强化学习算法。策略梯度定理为这类方法提供了理论基础,其核心公式表明:预期累积奖励关于策略参数的梯度可表示为:

$$
\begin{array}{r}{\nabla_{\theta}J(\theta)=\mathbb{E}{\underset{a\sim\pi}{s\sim\mu_{\pi}}}\Big[\nabla_{\theta}\mathrm{log}\pi(a|s)Q_{\pi}(s,a)\Big],}\end{array}
$$

$$
\begin{array}{r}{\nabla_{\theta}J(\theta)=\mathbb{E}{\underset{a\sim\pi}{s\sim\mu_{\pi}}}\Big[\nabla_{\theta}\mathrm{log}\pi(a|s)Q_{\pi}(s,a)\Big],}\end{array}
$$

where $J(\pi_{\boldsymbol{\theta}})$ is the expected cumulative reward $\pi_{\theta}$ is the policy parameterized by $\theta\tau$ is a trajectory sampled from the policy $s_{t}$ and $a_{t}$ are the state and action at time t $Q\pi\theta(s_{t},a_{t})$ is the action-value function $\bigtriangledown\theta\log\pi\theta(a_{t}\mid s_{t})$ is the gradient of the log-probability of the action.

其中 $J(\pi_{\boldsymbol{\theta}})$ 是期望累积奖励,$\pi_{\theta}$ 是由 $\theta$ 参数化的策略,$\tau$ 是从策略中采样的轨迹,$s_{t}$ 和 $a_{t}$ 是时刻 t 的状态和动作,$Q\pi\theta(s_{t},a_{t})$ 是动作价值函数,$\bigtriangledown\theta\log\pi\theta(a_{t}\mid s_{t})$ 是动作对数概率的梯度。

  1. Actor-Critic Methods: Actor-critic methods are a type of policy gradient method that uses an actor to represent the policy and a critic to estimate the value function. The actor and critic are updated simultaneously using the policy gradient theorem. In this work we used a PPO implementation based of an actor critic network.
  2. 演员-评论家方法 (Actor-Critic Methods): 演员-评论家方法是一种策略梯度方法,它使用演员来表示策略,评论家来估计价值函数。演员和评论家通过策略梯度定理同时更新。在本工作中,我们使用了基于演员-评论家网络的PPO实现。
  3. Proximal Policy Optimization: We implement a policy gradient method using a truncated version of the generalized advantage estimator (GAE). The GAE is computed as:
  4. 近端策略优化 (Proximal Policy Optimization): 我们采用截断版广义优势估计器 (GAE) 实现策略梯度方法。GAE计算公式如下:

$$
\hat{A}{t}=\delta_{t}+(\gamma\lambda)\delta_{t+1}+\ldots+\ldots+(\gamma\lambda)^{T-t+1}\delta_{T-1}
$$

$$
\hat{A}{t}=\delta_{t}+(\gamma\lambda)\delta_{t+1}+\ldots+\ldots+(\gamma\lambda)^{T-t+1}\delta_{T-1}
$$

where

哪里

Where $r$ represents the final scalar reward value derived from the sole output of the rewarding agent. $R$ denotes the global reward at time step $t$ . $r_{1}$ symbolizes the reward component derived from the reward critics sole output.

其中 $r$ 表示从奖励智能体的唯一输出中获得的最终标量奖励值。$R$ 表示时间步 $t$ 的全局奖励。$r_{1}$ 代表从奖励评论者唯一输出中获得的奖励分量。

A closer examination of the equation reveals that the final reward value $r$ can only be increased by considering the impact of $r_{1}$ on $R$ . Since $r_{1}$ and $R$ are correlated, actions that optimize $r_{1}$ at the expense of $R$ will lead to suboptimal values of the final reward $r$ .

仔细审视该方程可以发现,最终奖励值 $r$ 只能通过考虑 $r_{1}$ 对 $R$ 的影响来提升。由于 $r_{1}$ 与 $R$ 存在相关性,以牺牲 $R$ 为代价优化 $r_{1}$ 的行为将导致最终奖励 $r$ 的次优值。

This correlation between $r_{1}$ and $R$ sets up a two-stage hierarchical structure:

$r_{1}$ 与 $R$ 之间的这种相关性建立了一个两阶段层次结构:

Optimizing $r_{1}$ : The agent must perform actions that optimize the reward component $r_{1}$ .

优化 $r_{1}$:智能体必须执行能优化奖励分量 $r_{1}$ 的动作。

Optimizing $R$ : Simultaneously, the agent must ensure that these actions also optimize the global reward $R$ , ultimately leading to an optimal final reward $r$ .

优化 $R$:同时,智能体必须确保这些行动也能优化全局奖励 $R$,最终获得最优的最终奖励 $r$。

This hierarchical structure encourages the agent to develop a nuanced understanding of the relationships between variables and to make decisions that balance competing objectives.

这种层级结构鼓励AI智能体深入理解变量间的复杂关系,并做出平衡多重竞争目标的决策。

$$
\delta_{t}=r_{t}+\gamma V\big(s_{t+1}\big)-V\big(s_{t}\big),
$$

$$
\delta_{t}=r_{t}+\gamma V\big(s_{t+1}\big)-V\big(s_{t}\big),
$$

The policy is run for $T$ timesteps, with $T$ less than the episode size. We use the standard notation for the discount factor $\gamma$ and GAE parameter $\lambda$ . To perform a policy update, each of $N$ parallel actors collects $T$ timesteps of data. We then construct the surrogate loss on these $N T$ timesteps of data and optimize it using the ADAM algorithm with a learning rate $\alpha$ . We use mini-batches of size $m\le N T$ for $K$ epochs.

策略运行 $T$ 个时间步长,其中 $T$ 小于回合长度。我们采用标准符号表示折扣因子 $\gamma$ 和 GAE 参数 $\lambda$。执行策略更新时,$N$ 个并行执行体各收集 $T$ 个时间步长的数据,随后基于这 $N T$ 个时间步长的数据构建替代损失函数,并使用学习率为 $\alpha$ 的 ADAM 算法进行优化。训练过程采用小批量数据(批量大小 $m\le N T$)进行 $K$ 个周期迭代。

We use a combined loss function that includes the policy surrogate, value function error term, and entropy term:

我们使用包含策略替代、价值函数误差项和熵项的复合损失函数:

$$
\begin{array}{c}{{L_{t}^{C L I P+V F+S}}}\ {{(\theta)=\hat{\mathbb{E}}{t}\left[L_{t}^{C L I P}(\theta)-c_{1}L_{t}^{V F}(\theta)+c_{2}S\pi_{\theta}\right],}}\end{array}
$$

$$
\begin{array}{c}{{L_{t}^{C L I P+V F+S}}}\ {{(\theta)=\hat{\mathbb{E}}{t}\left[L_{t}^{C L I P}(\theta)-c_{1}L_{t}^{V F}(\theta)+c_{2}S\pi_{\theta}\right],}}\end{array}
$$

where S denotes the entropy bonus, $L_{t}V F$ is the value function squared-error loss, and $c_{1}$ and $c_{2}$ are coefficients for the value function loss and entropy term, respectively.

其中S表示熵奖励,$L_{t}V F$为价值函数平方误差损失,$c_{1}$和$c_{2}$分别为价值函数损失项和熵项的系数。

III. HIERARCHICAL REWARD FUNCTIONS

III. 分层奖励函数

When creating a reward function that fosters a hierarchical structure of dispositions, it is crucial to establish a sensitive relationship between variables. This can be achieved through a sequential approach. The reward function’s equation provides insight into this relationship:

在创建奖励函数以培养分层行为倾向时,建立变量间的敏感关系至关重要。这可以通过顺序方法实现。奖励函数的公式揭示了这种关系:

$$
r=R r_{1}+R r_{\l}
$$

$$
r=R r_{1}+R r_{\l}
$$

All the while training the critic soley with the global reward R.

在训练评论家时仅使用全局奖励 R。

IV. EXPERIMENTS

IV. 实验

This section presents the results of our experimental evaluation of the proposed hierarchical reward function on a continuous control problem from the OpenAI Gym suite: the Pendulum-v1 environment with a low-dimensional state space.

本节展示了我们在OpenAI Gym套件中的连续控制问题上对所提出的分层奖励函数进行实验评估的结果:低维状态空间的Pendulum-v1环境。

The architecture of our experimental setup for the Pendulum-v1 environment consisted of a neural network with a final layer outputting a 1-dimensional real-valued vector. Our implementation of the Proximal Policy Optimization (PPO) algorithm was based on a publicly available GitHub repository.

我们在Pendulum-v1环境中的实验架构由一个神经网络组成,其最后一层输出一个1维实值向量。我们对近端策略优化(PPO)算法的实现基于一个公开的GitHub代码库。

For each environment, we trained five models using different random seeds for a fixed total number of time steps. Following completion of training, each model was evaluated over 100 consecutive episodes to assess its performance.

对于每个环境,我们使用不同的随机种子训练了五个模型,总训练步数固定。训练完成后,每个模型通过连续100轮次进行评估以衡量其性能。

The performance of each model was evaluated using the cumulative reward obtained over the 100 evaluation episodes. This metric provides a comprehensive assessment of the model’s ability to maximize the reward function while adapting to the environment’s dynamics.

每个模型的性能通过100次评估回合中获得的累计奖励来评估。该指标全面衡量了模型在适应环境动态的同时最大化奖励函数的能力。

A. Pendulum-v1

A. Pendulum-v1

The Pendulum-v0 environment is a well-established continuous control task from the OpenAI Gym suite. The primary objective of this task is to stabilize a pendulum by applying a torque, effectively balancing the pole in an upright position.

Pendulum-v0环境是OpenAI Gym套件中一个成熟的连续控制任务。该任务的主要目标是通过施加扭矩来稳定钟摆,有效地将杆保持在直立位置。

The Pendulum-v0 environment is characterized by: An unbounded, 3-dimensional observation space A 1-dimensional action space, where actions represent the torque applied to the pendulum Bounded actions within the interval $[-2,2]$ .

Pendulum-v0环境的特点包括: