Creating Hierarchical Dispositions of Needs in an Agent
在智能体中创建需求的分层配置
Abstract—We present a novel method for learning hierarchical abstractions that prioritize competing objectives, leading to improved global expected rewards. Our approach employs a secondary rewarding agent with multiple scalar outputs, each associated with a distinct level of abstraction. The traditional agent then learns to maximize these outputs in a hierarchical manner, conditioning each level on the maximization of the preceding level. We derive an equation that orders these scalar values and the global reward by priority, inducing a hierarchy of needs that informs goal formation. Experimental results on the Pendulum v1 environment demonstrate superior performance compared to a baseline implementation.We achieved state of the art results.
摘要—我们提出了一种新颖的学习层次化抽象方法,该方法能优先处理相互竞争的目标,从而提升全局预期奖励。我们的方法采用了一个具有多标量输出的次级奖励智能体,每个输出对应不同抽象层级。传统智能体则以层级化方式学习最大化这些输出,使每个层级的优化都基于前一层级的最大化。我们推导出了一个按优先级排序这些标量值与全局奖励的方程,由此构建的需求层级可为目标形成提供依据。在Pendulum v1环境中的实验结果表明,该方法优于基线实现,达到了当前最优水平。
I. INTRODUCTION
I. 引言
This paper presents a novel approach to inducing hierarchical reward structures in artificial agents. Our method involves introducing a secondary rewarding agent that parallels the traditional agent, receiving identical state inputs. The rewarding agent features a continuous action output layer, wherein the outputs serve as signals rather than control inputs.
本文提出了一种在人工智能体中诱导分层奖励结构的新方法。我们的方法引入了一个与传统智能体并行的次级奖励智能体,接收相同的状态输入。该奖励智能体具有连续动作输出层,其输出作为信号而非控制输入。
We propose an equation that integrates these signals, yielding a reward signal that is used to reinforce the traditional agent. This framework is designed to elicit a hierarchical organization of needs within the traditional agent, promoting more effective and efficient learning.
我们提出一个整合这些信号的方程,生成一个用于强化传统智能体的奖励信号。该框架旨在激发传统智能体内需求的分层组织,从而促进更高效的学习。
The complexity of a policy learned by reinforcement learning (RL) algorithms is inherently bounded by the complexity of the reward function. Consequently, significant efforts have been devoted to crafting intricate reward functions that can guide RL agents towards sophisticated behaviors.In contrast, humans and other animals appear to develop complex behaviors through a hierarchical process, wherein an initially simple reward function focused on fundamental drives such as pain avoidance and pleasure seeking serves as the foundation for a layered structure of dispositions.
强化学习(RL)算法习得策略的复杂度本质上受限于奖励函数的复杂度。因此,研究者们投入大量精力设计复杂的奖励函数,以引导RL智能体产生精细行为。相比之下,人类和其他动物似乎通过分层过程发展复杂行为:最初以避痛趋乐等基本驱力为核心的简单奖励函数,会演化为分层倾向性结构的基础。
Each level in this hierarchy is oriented towards satisfying the preceding levels, ultimately referencing the base reward function.
该层级中的每一层都旨在满足前一层级的需求,最终指向基础奖励函数。
The mechanisms underlying this process remain unclear. However, if we could induce artificial agents to learn hierarchical reward functions, it would enable the specification of simple base reward functions, allowing the algorithm to autonomously develop complex goals. A hierarchical reward function would confer upon the agent the capacity to pursue intricate objectives. The hierarchical structure of human needs has been extensively studied, yielding frameworks such as Maslow’s Hierarchy of Needs. This hierarchy progresses from fundamental, essential needs to more abstract and complex requirements, including self-actualization.
这一过程的潜在机制尚不明确。然而,若能引导AI智能体学习分层奖励函数 (hierarchical reward functions),就能通过指定简单的基础奖励函数,使算法自主发展出复杂目标。分层奖励函数将赋予智能体追求复杂目标的能力。人类需求的分层结构已被广泛研究,产生了诸如马斯洛需求层次理论等框架。该层次从基本生存需求逐步过渡到更抽象复杂的自我实现需求。
Our approach offers two primary benefits: enhanced stability throughout the training process and improved accuracy in the learned policy.
我们的方法具有两大主要优势:训练过程稳定性更高,所学策略的准确性更优。
II. BACKGROUND
II. 背景
A. Markov decision process
A. 马尔可夫决策过程
We formulate our model of continuous control reinforcement within the framework of a finite Markov Decision Process (MDP). An MDP is defined by the tuple: $M=$ $\langle S,A,s_{0},r\rangle$ where $S$ denotes the state space $A$ denotes the action space $s_{0}\in S$ denotes the initial state $r(s,a):S{\times}A\longrightarrow$ $R$ denotes the reward function, which assigns a scalar value to each state-action pair.At each time step t, the agent selects an action $a_{t+1}$ according to a policyπ : $S\longrightarrow A$ , which can be either stochastic or deterministic.A stochastic policy is defined as a probability distribution over actions given a state $\pi(a\mid s):S\longrightarrow P(A)$ where $P(A)$ denotes the set of probability distributions over $A$ .The objective of the agent is to maximize its future expected reward: max $\textstyle\pi\mathbf{E}[\sum_{t=0}^{\infty}r(s_{t},a_{t})]$ .
我们在有限马尔可夫决策过程(MDP)的框架下构建了连续控制强化学习模型。MDP由元组定义:$M=$ $\langle S,A,s_{0},r\rangle$,其中$S$表示状态空间,$A$表示动作空间,$s_{0}\in S$表示初始状态,$r(s,a):S{\times}A\longrightarrow$ $R$表示奖励函数,为每个状态-动作对分配标量值。在每一步时间t,智能体根据策略π: $S\longrightarrow A$选择动作$a_{t+1}$,该策略可以是随机或确定性的。随机策略定义为给定状态下动作的概率分布$\pi(a\mid s):S\longrightarrow P(A)$,其中$P(A)$表示$A$上的概率分布集。智能体的目标是最大化未来期望奖励:max $\textstyle\pi\mathbf{E}[\sum_{t=0}^{\infty}r(s_{t},a_{t})]$。
B. Policy Gradient Methods
B. 策略梯度方法
Policy gradient methods are a type of reinforcement learning algorithm that learns to optimize the policy directly, rather than learning the value function. The policy gradient theorem provides the foundation for policy gradient methods. It states that the gradient of the expected cumulative reward with respect to the policy parameters can be computed as:
策略梯度方法是一种直接优化策略而非学习价值函数的强化学习算法。策略梯度定理为这类方法提供了理论基础,其核心公式表明:预期累积奖励关于策略参数的梯度可表示为:
$$
\begin{array}{r}{\nabla_{\theta}J(\theta)=\mathbb{E}{\underset{a\sim\pi}{s\sim\mu_{\pi}}}\Big[\nabla_{\theta}\mathrm{log}\pi(a|s)Q_{\pi}(s,a)\Big],}\end{array}
$$
$$
\begin{array}{r}{\nabla_{\theta}J(\theta)=\mathbb{E}{\underset{a\sim\pi}{s\sim\mu_{\pi}}}\Big[\nabla_{\theta}\mathrm{log}\pi(a|s)Q_{\pi}(s,a)\Big],}\end{array}
$$
where $J(\pi_{\boldsymbol{\theta}})$ is the expected cumulative reward $\pi_{\theta}$ is the policy parameterized by $\theta\tau$ is a trajectory sampled from the policy $s_{t}$ and $a_{t}$ are the state and action at time t $Q\pi\theta(s_{t},a_{t})$ is the action-value function $\bigtriangledown\theta\log\pi\theta(a_{t}\mid s_{t})$ is the gradient of the log-probability of the action.
其中 $J(\pi_{\boldsymbol{\theta}})$ 是期望累积奖励,$\pi_{\theta}$ 是由 $\theta$ 参数化的策略,$\tau$ 是从策略中采样的轨迹,$s_{t}$ 和 $a_{t}$ 是时刻 t 的状态和动作,$Q\pi\theta(s_{t},a_{t})$ 是动作价值函数,$\bigtriangledown\theta\log\pi\theta(a_{t}\mid s_{t})$ 是动作对数概率的梯度。
- Actor-Critic Methods: Actor-critic methods are a type of policy gradient method that uses an actor to represent the policy and a critic to estimate the value function. The actor and critic are updated simultaneously using the policy gradient theorem. In this work we used a PPO implementation based of an actor critic network.
- 演员-评论家方法 (Actor-Critic Methods): 演员-评论家方法是一种策略梯度方法,它使用演员来表示策略,评论家来估计价值函数。演员和评论家通过策略梯度定理同时更新。在本工作中,我们使用了基于演员-评论家网络的PPO实现。
- Proximal Policy Optimization: We implement a policy gradient method using a truncated version of the generalized advantage estimator (GAE). The GAE is computed as:
- 近端策略优化 (Proximal Policy Optimization): 我们采用截断版广义优势估计器 (GAE) 实现策略梯度方法。GAE计算公式如下:
$$
\hat{A}{t}=\delta_{t}+(\gamma\lambda)\delta_{t+1}+\ldots+\ldots+(\gamma\lambda)^{T-t+1}\delta_{T-1}
$$
$$
\hat{A}{t}=\delta_{t}+(\gamma\lambda)\delta_{t+1}+\ldots+\ldots+(\gamma\lambda)^{T-t+1}\delta_{T-1}
$$
where
哪里
Where $r$ represents the final scalar reward value derived from the sole output of the rewarding agent. $R$ denotes the global reward at time step $t$ . $r_{1}$ symbolizes the reward component derived from the reward critics sole output.
其中 $r$ 表示从奖励智能体的唯一输出中获得的最终标量奖励值。$R$ 表示时间步 $t$ 的全局奖励。$r_{1}$ 代表从奖励评论者唯一输出中获得的奖励分量。
A closer examination of the equation reveals that the final reward value $r$ can only be increased by considering the impact of $r_{1}$ on $R$ . Since $r_{1}$ and $R$ are correlated, actions that optimize $r_{1}$ at the expense of $R$ will lead to suboptimal values of the final reward $r$ .
仔细审视该方程可以发现,最终奖励值 $r$ 只能通过考虑 $r_{1}$ 对 $R$ 的影响来提升。由于 $r_{1}$ 与 $R$ 存在相关性,以牺牲 $R$ 为代价优化 $r_{1}$ 的行为将导致最终奖励 $r$ 的次优值。
This correlation between $r_{1}$ and $R$ sets up a two-stage hierarchical structure:
$r_{1}$ 与 $R$ 之间的这种相关性建立了一个两阶段层次结构:
Optimizing $r_{1}$ : The agent must perform actions that optimize the reward component $r_{1}$ .
优化 $r_{1}$:智能体必须执行能优化奖励分量 $r_{1}$ 的动作。
Optimizing $R$ : Simultaneously, the agent must ensure that these actions also optimize the global reward $R$ , ultimately leading to an optimal final reward $r$ .
优化 $R$:同时,智能体必须确保这些行动也能优化全局奖励 $R$,最终获得最优的最终奖励 $r$。
This hierarchical structure encourages the agent to develop a nuanced understanding of the relationships between variables and to make decisions that balance competing objectives.
这种层级结构鼓励AI智能体深入理解变量间的复杂关系,并做出平衡多重竞争目标的决策。
$$
\delta_{t}=r_{t}+\gamma V\big(s_{t+1}\big)-V\big(s_{t}\big),
$$
$$
\delta_{t}=r_{t}+\gamma V\big(s_{t+1}\big)-V\big(s_{t}\big),
$$
The policy is run for $T$ timesteps, with $T$ less than the episode size. We use the standard notation for the discount factor $\gamma$ and GAE parameter $\lambda$ . To perform a policy update, each of $N$ parallel actors collects $T$ timesteps of data. We then construct the surrogate loss on these $N T$ timesteps of data and optimize it using the ADAM algorithm with a learning rate $\alpha$ . We use mini-batches of size $m\le N T$ for $K$ epochs.
策略运行 $T$ 个时间步长,其中 $T$ 小于回合长度。我们采用标准符号表示折扣因子 $\gamma$ 和 GAE 参数 $\lambda$。执行策略更新时,$N$ 个并行执行体各收集 $T$ 个时间步长的数据,随后基于这 $N T$ 个时间步长的数据构建替代损失函数,并使用学习率为 $\alpha$ 的 ADAM 算法进行优化。训练过程采用小批量数据(批量大小 $m\le N T$)进行 $K$ 个周期迭代。
We use a combined loss function that includes the policy surrogate, value function error term, and entropy term:
我们使用包含策略替代、价值函数误差项和熵项的复合损失函数:
$$
\begin{array}{c}{{L_{t}^{C L I P+V F+S}}}\ {{(\theta)=\hat{\mathbb{E}}{t}\left[L_{t}^{C L I P}(\theta)-c_{1}L_{t}^{V F}(\theta)+c_{2}S\pi_{\theta}\right],}}\end{array}
$$
$$
\begin{array}{c}{{L_{t}^{C L I P+V F+S}}}\ {{(\theta)=\hat{\mathbb{E}}{t}\left[L_{t}^{C L I P}(\theta)-c_{1}L_{t}^{V F}(\theta)+c_{2}S\pi_{\theta}\right],}}\end{array}
$$
where S denotes the entropy bonus, $L_{t}V F$ is the value function squared-error loss, and $c_{1}$ and $c_{2}$ are coefficients for the value function loss and entropy term, respectively.
其中S表示熵奖励,$L_{t}V F$为价值函数平方误差损失,$c_{1}$和$c_{2}$分别为价值函数损失项和熵项的系数。
III. HIERARCHICAL REWARD FUNCTIONS
III. 分层奖励函数
When creating a reward function that fosters a hierarchical structure of dispositions, it is crucial to establish a sensitive relationship between variables. This can be achieved through a sequential approach. The reward function’s equation provides insight into this relationship:
在创建奖励函数以培养分层行为倾向时,建立变量间的敏感关系至关重要。这可以通过顺序方法实现。奖励函数的公式揭示了这种关系:
$$
r=R r_{1}+R r_{\l}
$$
$$
r=R r_{1}+R r_{\l}
$$
All the while training the critic soley with the global reward R.
在训练评论家时仅使用全局奖励 R。
IV. EXPERIMENTS
IV. 实验
This section presents the results of our experimental evaluation of the proposed hierarchical reward function on a continuous control problem from the OpenAI Gym suite: the Pendulum-v1 environment with a low-dimensional state space.
本节展示了我们在OpenAI Gym套件中的连续控制问题上对所提出的分层奖励函数进行实验评估的结果:低维状态空间的Pendulum-v1环境。
The architecture of our experimental setup for the Pendulum-v1 environment consisted of a neural network with a final layer outputting a 1-dimensional real-valued vector. Our implementation of the Proximal Policy Optimization (PPO) algorithm was based on a publicly available GitHub repository.
我们在Pendulum-v1环境中的实验架构由一个神经网络组成,其最后一层输出一个1维实值向量。我们对近端策略优化(PPO)算法的实现基于一个公开的GitHub代码库。
For each environment, we trained five models using different random seeds for a fixed total number of time steps. Following completion of training, each model was evaluated over 100 consecutive episodes to assess its performance.
对于每个环境,我们使用不同的随机种子训练了五个模型,总训练步数固定。训练完成后,每个模型通过连续100轮次进行评估以衡量其性能。
The performance of each model was evaluated using the cumulative reward obtained over the 100 evaluation episodes. This metric provides a comprehensive assessment of the model’s ability to maximize the reward function while adapting to the environment’s dynamics.
每个模型的性能通过100次评估回合中获得的累计奖励来评估。该指标全面衡量了模型在适应环境动态的同时最大化奖励函数的能力。
A. Pendulum-v1
A. Pendulum-v1
The Pendulum-v0 environment is a well-established continuous control task from the OpenAI Gym suite. The primary objective of this task is to stabilize a pendulum by applying a torque, effectively balancing the pole in an upright position.
Pendulum-v0环境是OpenAI Gym套件中一个成熟的连续控制任务。该任务的主要目标是通过施加扭矩来稳定钟摆,有效地将杆保持在直立位置。
The Pendulum-v0 environment is characterized by: An unbounded, 3-dimensional observation space A 1-dimensional action space, where actions represent the torque applied to the pendulum Bounded actions within the interval $[-2,2]$ .
Pendulum-v0环境的特点包括:
- 一个无界的3维观测空间
- 一个1维动作空间,动作表示施加在钟摆上的扭矩
- 动作范围限制在区间$[-2,2]$内




The agent follows an actor-critic framework. The actor $\pi_{\boldsymbol{\theta}}(\boldsymbol{a}|\boldsymbol{s})$ consists of a neural network made of 3 fully-connected layers of 64 units each, with tanh activation functions. The output layer has 1 linear neuron. The critic $V_{\boldsymbol{\theta}_{v}}(s)$ does not share layers with the actor, but has an equivalent architecture of 3 hidden layers, and one output neuron representing the value function.
智能体遵循演员-评论家 (actor-critic) 框架。演员 $\pi_{\boldsymbol{\theta}}(\boldsymbol{a}|\boldsymbol{s})$ 由包含3个全连接层(每层64个单元)的神经网络构成,使用tanh激活函数。输出层包含1个线性神经元。评论家 $V_{\boldsymbol{\theta}_{v}}(s)$ 不与演员共享网络层,但具有相同的3隐藏层架构,其输出神经元代表价值函数。
B. Reward agent
B. 奖励智能体
In blue is our algorithm vs black, the established method -Average rewards over a 10-episode window for the Pendulum task.
蓝色曲线代表我们的算法,黑色曲线为现有方法——Pendulum任务在10回合窗口内的平均奖励。
The reward agent follows an actor-critic framework. The actor $\pi_{\boldsymbol{\theta}}(\boldsymbol{a}|\boldsymbol{s})$ consists of a neural network made of 5 fullyconnected layers of 64 units each, with tanh activation functions. The output layer has 3 linear neuron. This is because we wanted to set up a 3 level heirarchy.The equation we used was of the form of the equation we presented earlier but evolved to include a hierarchy of 3 steps. It took the following form.
奖励智能体遵循演员-评论家框架。演员$\pi_{\boldsymbol{\theta}}(\boldsymbol{a}|\boldsymbol{s})$由神经网络构成,包含5个全连接层,每层64个单元,使用tanh激活函数。输出层包含3个线性神经元,用于建立三级层次结构。所用方程形式与前述方程一致,但演变为包含三个步骤的层次结构,具体形式如下。
Additionally ,in another run, we were able to beat the state of the art with the TLA model after adapting the github code to include our reward function.The previous state of the art was held by the vanilla version of the TLA algorithm.The results obtained by it were -154 reward points while ours achieved a higher score of -125.Below is a comparison of the two methods results.
另外,在另一次实验中,我们通过修改GitHub代码加入奖励函数后,使用TLA模型超越了当前最优水平。此前的最佳记录由TLA算法的原始版本保持,其获得的奖励分数为-154分,而我们的方法取得了更高的-125分。以下是两种方法结果的对比。
$$
r=R(r_{1}(r_{2}r_{3}+r_{2})+r_{1})+R
$$
$$
r=R(r_{1}(r_{2}r_{3}+r_{2})+r_{1})+R
$$
As you can see the equation is a rewrite of the earlier equation, with a replacement of terms.
如你所见,该方程是之前方程的改写,只是替换了部分项。
V. RESULTS AND DISCUSSION
V. 结果与讨论
A. Pendulum-v1
A. Pendulum-v1
For the Pendulum-v1 environment, we observe that our method learnt faster, with greater stability and higher rewards than the PPO method without our adjustments.See FIG 2.
对于Pendulum-v1环境,我们观察到该方法比未经调整的PPO方法学习速度更快、稳定性更强且奖励更高。见图2。
VI. FUTURE WORK
VI. 未来工作
In our initial approach, we assumed a linear reward function, where scalar values are prioritized and each level is multiplied by the level above, with an additional term. However, this simplistic model may not accurately capture the complexities of real-world systems.
在我们的初始方法中,我们假设了一个线性奖励函数,其中标量值被优先考虑,并且每一层都乘以上一层,再加上一个附加项。然而,这种简化模型可能无法准确捕捉现实世界系统的复杂性。
A more comprehensive approach would involve using a graph to model the reward dynamics of the system. In this framework nodes at the same depth would be summed, rather than multiplied, to capture the cumulative effects of different factors and nodes above would be multiplied to represent the hierarchical relationships between different components.This graph-based approach would enable a more nuanced and accurate representation of the reward function.
更全面的方法是通过图(graph)来建模系统的奖励动态。在此框架中,同层级的节点采用相加而非相乘的方式,以捕捉不同因素的累积效应;而上层节点则通过相乘来体现各组件间的层级关系。这种基于图的方法能更精细、准确地表征奖励函数。
To implement this graph-based reward modeling, we propose a reward critic architecture that takes the state of the agent as input and outputs a graph representing the reward dynamics. We would then trace each leaf node up to the root, collecting values in an array to form a line-based partial reward function for the traditional agent and sum the rewards over all leaves to obtain the final reward.
为实现这种基于图的奖励建模,我们提出了一种奖励评判架构,该架构以智能体状态为输入,输出表示奖励动态的图。随后我们将追踪每个叶节点至根节点,将收集的数值存入数组,形成传统智能体的线性分段奖励函数,并通过累加所有叶节点的奖励值获得最终奖励。
To further enhance the reward critic, we can modify it to take into account the actions of the traditional agent, in addition to the state. This would enable the reward critic to evaluate both states and actions, providing a more comprehensive assessment of the agent’s behavior.By exploring these graphbased reward modeling and reward critic architectures, we can develop more sophisticated and accurate reward functions that capture the complexities of real-world systems.
为了进一步提升奖励评判器(reward critic),我们可以对其进行修改,使其除了考虑状态外,还能将传统智能体的动作纳入考量。这将使奖励评判器能够同时评估状态和动作,从而对智能体行为提供更全面的评价。通过探索这些基于图的奖励建模和奖励评判器架构,我们可以开发出更复杂、更精确的奖励函数,以捕捉现实世界系统的复杂性。
VII. CONCLUSIONS
VII. 结论
In this study, we conducted a comprehensive evaluation of the effectiveness of hierarchical reward functions in reinforcement learning. Our results demonstrate that agents trained with hierarchical reward functions exhibit faster convergence, improved stability, and higher final rewards compared to agents implementing standard Proximal Policy Optimization (PPO) algorithms.
在本研究中,我们对强化学习中分层奖励函数(hierarchical reward functions)的有效性进行了全面评估。结果表明,与采用标准近端策略优化(PPO)算法的智能体相比,使用分层奖励函数训练的智能体具有更快的收敛速度、更高的稳定性以及更高的最终奖励。
A comparative analysis of the performance of agents trained with hierarchical reward functions and standard PPO algorithms reveals significant advantages of the former approach. Specifically: Agents trained with hierarchical reward functions exhibit faster convergence rates, achieving optimal performance in fewer iterations .The stability of agents trained with hierarchical reward functions is improved, with reduced variance in performance across different trials and the final rewards obtained by agents trained with hierarchical reward functions are consistently higher than those achieved by agents implementing standard PPO algorithms.
采用分层奖励函数训练的智能体与标准PPO(Proximal Policy Optimization)算法性能对比分析表明,前者具有显著优势。具体表现为:分层奖励函数训练的智能体收敛速度更快,能以更少迭代次数达到最优性能;其训练稳定性更高,在不同试验中性能波动更小;最终获得的分层奖励函数智能体回报值始终高于标准PPO算法实现的智能体。
Our results suggest that the proposed method of implementing hierarchical reward functions is effective for simple cases. To further establish the s cal ability and generalization of this approach, we plan to extend our experiments to more complex environments with intricate dynamics.
我们的结果表明,所提出的分层奖励函数( hierarchical reward functions )实现方法在简单案例中行之有效。为进一步验证该方法的可扩展性(scal ability)和泛化能力,我们计划将实验扩展到具有复杂动态特性的环境中。
We propose to develop more complex graph-based hierarchical reward functions to capture nuanced relationships between different components. This will enable the creation of more sophisticated reward functions that can effectively guide the learning process in complex environments.
我们建议开发更复杂的基于图结构的层级奖励函数,以捕捉不同组件间的微妙关系。这将有助于构建更精细的奖励函数,从而在复杂环境中有效指导学习过程。
A key advantage of the proposed hierarchical reward function approach is the potential for component reuse and transfer learning. By fostering the reuse of components learned early on in the development process, we can accelerate the learning process and improve the overall performance of the agent.
所提出的分层奖励函数方法的一个关键优势在于组件复用和迁移学习的潜力。通过促进开发过程中早期学习组件的复用,我们可以加速学习过程并提升智能体的整体性能。
The proposed hierarchical reward function approach has significant implications for complex goal formation and navigation in real-world environments. By emulating the hierarchy of needs exhibited by humans, we can create agents that are capable of navigating complex environments and achieving sophisticated goals.
所提出的分层奖励函数方法对现实环境中复杂目标形成与导航具有重大意义。通过模拟人类展现的需求层次结构,我们能够构建出可在复杂环境中导航并实现高级目标的AI智能体。
Future work will focus on extending the proposed hierarchical reward function approach to more complex environments and developing more sophisticated graph-based reward functions.
未来工作将集中于将提出的分层奖励函数方法扩展到更复杂的环境,并开发更复杂的基于图的奖励函数。
