[论文翻译] SOON: Scenario Oriented Object Navigation with Graph-based Exploration 基于图探索的面向场景的目标导航


原文地址:https://arxiv.org/pdf/2103.17138v1.pdf


SOON: Scenario Oriented Object Navigation with Graph-based Exploration 基于图探索的面向场景的目标导航

Abstract

The ability to navigate like a human towards a language-guided target from anywhere in a 3D embodied environment is one of the `holy grail' goals of intelligent robots. Most visual navigation benchmarks, however, focus on navigating toward a target from a fixed starting point, guided by an elaborate set of instructions that depicts step-by-step. This approach deviates from real-world problems in which human-only describes what the object and its surrounding look like and asks the robot to start navigation from anywhere. Accordingly, in this paper, we introduce a Scenario Oriented Object Navigation (SOON) task. In this task, an agent is required to navigate from an arbitrary position in a 3D embodied environment to localize a target following a scene description. To give a promising direction to solve this task, we propose a novel graph-based exploration (GBE) method, which models the navigation state as a graph and introduces a novel graph-based exploration approach to learn knowledge from the graph and stabilize training by learning sub-optimal trajectories. We also propose a new large-scale benchmark named From Anywhere to Object (FAO) dataset. To avoid target ambiguity, the descriptions in FAO provide rich semantic scene information includes: object attribute, object relationship, region description, and nearby region description. Our experiments reveal that the proposed GBE outperforms various state-of-the-arts on both FAO and R2R datasets. And the ablation studies on FAO validates the quality of the dataset.

摘要

在3D体现环境中从任何地方像人类一样导航到语言指导目标的能力是智能机器人的“圣杯”目标之一。但是,大多数可视化导航基准测试都集中于在固定的起点上朝目标进行导航,并遵循一组详尽的说明(逐步描述)。这种方法与现实世界中的问题有所不同,在现实世界中,人为描述对象及其周围环境是什么样的,并要求机器人从任何地方开始导航。因此,在本文中,我们介绍了一种面向场景的对象导航(SOON)任务。在此任务中,需要代理从3D体现环境中的任意位置导航以按照场景描述来定位目标。为了给解决这个问题提供一个有希望的方向,我们提出了一种新颖的基于图的探索(GBE)方法,该方法将导航状态建模为一个图,并介绍了一种新颖的基于图的探索方法,以便从图中学习知识并通过学习次优轨迹来稳定训练。我们还提出了一个新的大规模基准测试,名为“从任何地方到对象(FAO)”数据集。为避免目标含糊不清,粮农组织中的描述提供了丰富的语义场景信息,包括:对象属性,对象关系,区域描述和附近区域描述。我们的实验表明,建议的GBE优于FAO和R2R数据集上的各种最新技术。对FAO的消融研究验证了数据集的质量。该模型将导航状态建模为图形,并引入了一种新颖的基于图形的探索方法,可以从图形中学习知识并通过学习次优轨迹来稳定训练。我们还提出了一个新的大规模基准测试,名为“从任何地方到对象(FAO)”数据集。为避免目标含糊不清,粮农组织中的描述提供了丰富的语义场景信息,包括:对象属性,对象关系,区域描述和附近区域描述。我们的实验表明,建议的GBE优于FAO和R2R数据集上的各种最新技术。对FAO的消融研究验证了数据集的质量。该模型将导航状态建模为图形,并引入了一种新颖的基于图形的探索方法,可以从图形中学习知识并通过学习次优轨迹来稳定训练。我们还提出了一个新的大规模基准测试,名为“从任何地方到对象(FAO)”数据集。为避免目标含糊不清,粮农组织中的描述提供了丰富的语义场景信息,包括:对象属性,对象关系,区域描述和附近区域描述。我们的实验表明,建议的GBE优于FAO和R2R数据集上的各种最新技术。对FAO的消融研究验证了数据集的质量。FAO中的描述提供了丰富的语义场景信息,包括:对象属性,对象关系,区域描述和附近区域描述。我们的实验表明,建议的GBE优于FAO和R2R数据集上的各种最新技术。对FAO的消融研究验证了数据集的质量。FAO中的描述提供了丰富的语义场景信息,包括:对象属性,对象关系,区域描述和附近区域描述。我们的实验表明,建议的GBE优于FAO和R2R数据集上的各种最新技术。对FAO的消融研究验证了数据集的质量。

.

Introduction 简介

Recent research efforts have achieved great success in embodied navigation tasks. The agent is able to reach the target by following a variety of instructions, such as a word (e.g. object name or room name), a question-answer pair, a natural language sentence or a dialogue consisting of multiple sentences. However, these navigation approaches are still far from real-world navigation activities. Current vision language based navigation tasks such as Vision-language Navigation (VLN), Navigation from Dialog History (NDH) focus on navigating to a target by a fixed trajectory, guided by an elaborate set of instructions that outlines every step. These approaches fail to consider the case in which the complex instruction provided only target description while the starting point is not fixed. In real-world applications, people often do not provide detailed step-by-step instructions and expect the robot to be capable of self-exploration and autonomous decision-making. We claim that the ability to navigate towards a language-guided target from anywhere in a 3D embodied environment like human would be of great importance to an intelligent robot.
最近的研究工作 [ 49191747334532 ]在具体化的导航任务都取得了巨大的成功。该试剂是能够通过以下的各种指令,以达到目标,例如一个字(例如对象名或房间名称) [ 4940 ],一个问答配对 [ 1118 ],一个自然语言句子 [ 3 ]或包含多个句子的对话 [ 4555 ]。但是,这些导航方法离现实世界的导航活动还很远。当前基于视觉语言的导航任务,例如视觉语言导航(VLN) [ 3 ],从对话历史记录导航(NDH) [ 45 ]专注于通过固定的轨迹导航到目标,并遵循一组详尽的指令来概述每个步骤。这些方法未能考虑复杂指令仅提供目标描述而起点不固定的情况。在实际应用中,人们通常不提供详细的分步说明,而是期望机器人能够自我探索和自主决策。我们声称,在像人一样的3D体现环境中,从任何地方导航到语言指导目标的能力对于智能机器人来说都是至关重要的。

image
图1: SOON中导航过程的示例。代理接收复杂的自然语言指令,该指令由多种描述组成(左侧)。代理在不同房间之间导航时,它首先搜索较大的区域,然后根据视觉场景和指示逐步缩小搜索范围。

To address these problems, we propose a new task, named Vision Situated Object Navigation (SOON), where an agent is instructed to find a thoroughly described target object inside a house. The navigation instructions in SOON are target-oriented rather than step-by-step babysitter as in previous benchmarks. There are two major features that makes our task unique: target orienting and starting independence. A brief example of a navigation process in SOON is illustrated in Fig.overview. Firstly, different from conventional object navigation tasks defined in, instructions in SOON play a guidance role in addition to distinguish a target object class. An instruction contains thorough descriptions to guide the agent to find a unique object from anywhere in the house. After receiving an instruction in SOON, the agent first searches a larger-scale area according to the region descriptions in the instruction, and then gradually narrows the search space to the target area.

为了解决这些问题,我们提出了一项名为“视觉定位对象导航(SOON)”的新任务,其中指示了代理人员在房屋内找到详细描述的目标对象。SOON中的导航说明是面向目标的,而不是像以前的基准测试中那样循序渐进的保姆。有两项主要功能使我们的任务与众不同:目标定向和启动独立性。图1示出了SOON中的导航过程的简要示例 。首先,从所限定的常规对象导航任务的不同 [ 4940 ],SOON中的指令除了可以区分目标对象类以外,还起着指导作用。指令中包含详尽的说明,以指导代理从房屋中的任何位置查找唯一的对象。代理在SOON中接收到指令后,首先根据指令中的区域描述搜索较大范围的区域,然后逐渐将搜索空间缩小到目标区域。与逐步导航设置 [ 3 ]或目标目标导航设置 [ 49 ]相比,这种从粗到精的导航过程更类似于现实情况。此外,SOON任务与启动无关。由于语言说明包含地理区域描述而不是特定于轨迹的描述,因此它们不限制代理如何找到目标。相比之下,在诸如视觉语言导航[ 3 ]或视觉与对话协作导航 [ 45 ]之类的逐步导航任务中 ,与定向路径的任何偏离都可以视为错误 [ 25 ]。我们在Tab中展示了SOON任务和现有的内置导航任务之间的总体比较。 1

Compared with step-by-step navigation settings or object-goal navigation settings, this kind of coarse-to-fine navigation process is more closely resembles a real-world situation. Moreover, the SOON task is starting-independent. Since the language instructions contain geographic region descriptions rather than trajectory specific descriptions, they do not limit how the agent finds the target. By contrast, in step-by-step navigation tasks such as Vision Language Navigation or Cooperative Vision-and-Dialog Navigation, any deviation from the directed path may be considered as an error. We present an overall comparison between the SOON task and existing embodied navigation tasks in Tab.difference.
与逐步导航设置或对象目标导航设置相比,这种粗细导航过程更为与真实的情况相似。此外,即将开始的任务是独立的。由于语言指令包含地理区域描述而不是轨迹特定的描述,因此它们不会限制代理程序如何找到目标。相比之下,在逐步导航任务如诸如视觉语言导航或协作视觉和对话框导航的逐步导航任务中,可以认为任何与定向路径的偏差都被视为错误。我们在Tab.difference中介绍了即将任务和现有体现的导航任务之间的总体比较。

Dataset Instruction Context Visual Context Starting Target
Dataset Human Content Unamb. Real-world Temporal Independent Oriented
House3D [49] Room Name Dynamic
MINOS [40] Ojbect Name Dynamic
EQA [11], IQA [18] QA Dynamic
MARCO [31], DRIF [5] Instruction Dynamic
R2R [3] Instruction Dynamic
TouchDown [10] Instruction Dynamic
VLNA [37], HANNA [36] Dialog Dynamic
TtW [13] Dialog Dynamic
CVDN [45] Dialog Dynamic
REVERIE [39] Instruction Dynamic
FAO (Ours) Instruction Dynamic

Table 1: Compared with existing datasets involving embodied vision and language tasks.

In this work, We propose a novel Graph-based Semantic Exploration (GBE) method to suggest a promising direction in approaching SOON. The proposed GBE has two advantages compared with previous navigation works. Firstly, GBE models the navigation process as a graph, which enables the navigation agent to obtain a comprehensive and structured understanding of observed information. It adopts graph action space to significantly merge the multiple actions in conventional sequence-to-sequence models into one-step decision. Merging actions reduces the number of predictions in a navigation process, which makes the model training more stable. Secondly, different from other graph-based navigation models that use either imitation learning or reinforcement to learn navigation policy, the proposed GBE combines the two learning approaches and proposes a novel exploration approach to stabilize training by learning from sub-optimal trajectories. In imitation learning, the agent learns to navigate step by step under the supervision of ground truth label. It causes severe overfitting problem since labeled trajectories occupy only a small proportion of the large trajectory space. In reinforcement learning, the navigation agent explores large trajectory space, and learn to maximize the discounted reward. Reinforcement learning leverages sub-optimal trajectories to improve the generalizability.

在这项工作中,我们提出了一种新颖的基于图的语义探索(GBE)方法,为接近SOON提出了一个有希望的方向。与以前的导航工作相比所提出的GBE具有两个优点 [ 31747 ]。首先,GBE将导航过程建模为图形,这使导航代理能够获得对观测信息的全面而结构化的理解。它采用图形动作空间显著合并在常规序列到序列模型中的多个动作 [ 31747 ]一步决定。合并动作会减少导航过程中的预测数量,从而使模型训练更加稳定。其次,从其他的基于图的导航模型不同 [ 149 ]通过使用模仿学习或强化学习导航策略,提出的GBE结合了两种学习方法,并提出了一种新颖的探索方法,可通过从次优轨迹中学习来稳定训练。在模仿学习中,代理学习在地面真相标签的监督下逐步进行导航。由于标记的轨迹仅占据大轨迹空间的一小部分,因此会导致严重的过拟合问题。在强化学习中,导航代理会探索较大的轨迹空间,并学习最大化折价奖励。强化学习利用次优轨迹来提高通用性。

However, the reinforcement learning is not an end-to-end optimization method, which is difficult for the agent to converge and learn a robust policy. We propose to learn the optimal actions in trajectories sampled from imperfect GBE policy to stabilize training while exploration. Different from other RL exploration methods, the proposed exploration method is based on the semantic graph, which is dynamically built during the navigation. Thus it helps the agent to learn a robust policy while navigating based on a graph. To investigate the SOON task, we propose a large-scale From Anywhere to Object (FAO) benchmark. This benchmark is built on the Matterport3D simulator, which comprises 90 different housing environments with real image panoramas. FAO provides 4K sets of annotated instructions with 40K trajectories. As Fig.overview(left) shows, one set of the instruction contains three sentences, including four levels of description: i) the color and shape of the object; ii) the surrounding objects along with the relationships between these objects and the target object; iii) the area in which the target object is located and the neighbour areas. Then, the average word number of the instructions is 38 (R2R is 26), and the average hop of the labeled trajectories is 9.6 (R2R is 6.0). Thus our dataset is more challenging than other tasks. We present experimental analyses on both R2R and FAO datasets to validate the performance of the proposed GBE and the quality of FAO dataset. The proposed GBE significantly outperforms previous previous VLN methods without pretraining or auxiliary tasks on R2R and SOON tasks. We further provide human performance on the test set of FAO to quantify the human-machine gap. Moreover, by ablating vision and language modals with different granularity, we validate that our FAO dataset contains rich information that enables the agent to successfully locate the target.

然而,增强学习不是端到端的优化方法,这对于代理难以汇聚和学习强大的策略。我们建议了解从不完美的GBE政策取样的轨迹中的最佳行为,以探索勘探。与其他RL探索方法不同,所提出的探索方法基于语义图,该图是在导航期间动态构建的。因此,它有助于代理在基于图形导航时学习强大的策略。要调查即将开始,我们从任何地方提出了大规模的对象(粮农组织)基准。该基准测试是在TATSPORT3D模拟器上构建的,该模拟器包括90个不同的住房环境,具有真实的图像全景。粮农组织提供4套带40K轨迹的注释说明。如图所示,一组指令包含三个句子,包括四个级别的描述:i)物体的颜色和形状; ii)周围物体以及这些对象与目标对象之间的关系; iii)目标对象所在的区域和邻居区域。然后,指令的平均单词数为38(R2R为26),标记轨迹的平均跳跃为9.6(R2R为6.0)。因此,我们的数据集比其他任务更具挑战性。我们在R2R和粮农组织数据集中呈现实验分析,以验证拟议的GBE和粮农组织数据集质量的表现。建议的GBE显着优于上一个以前的VLN方法,而不借预先磨损或辅助任务,并很快任务。我们进一步为粮农组织的试验组提供人性化,以量化人机差距。此外,通过使用不同粒度的愿景和语言模块,我们验证了我们的粮农组织数据集包含丰富的信息,使代理能够成功找到目标。

Navigation with vision-language information has attracted widespread attention, since it is both widely applicable and challenging. Anderson et al. propose Room-to-Room (R2R) dataset, which is the first Vision-Language Navigation (VLN) benchmark combining real imagery and natural language navigation instructions. In addition, the TOUCHDOWN dataset with natural language instructions is proposed for street navigation. To address the VLN task, Fried et al. propose a speaker-follower framework for data augmentation and reasoning in supervised learning, along with a concept named "panoramic action space" proposed to facilitate optimization. Wang et al. demonstrate the benefit to combine imitation learning and reinforcement learning. Other methods have been proposed to solve the VLN tasks from various angles. Inspired by the success of VLN, many datasets based on natural language instructions or dialogues have been proposed. VLNA and HANNA are environments in which an agent receives assistance when it gets lost. TtW and CVDN provide dialogues created by communication between two people to reach the target position. Unlike the above methods, REVERIE introduces a remote object localization task; in this task, an agent is required to find an object in another room that is unable to see at the beginning. The proposed SOON task is a coarse-to-fine navigation process, which navigates towards a target from anywhere following a complex scene description. An overall comparison between the SOON task and existing embodied navigation tasks is shown in Tab.difference.
视觉语言导航 带有视觉语言信息的导航受到广泛的关注,因为它既广泛适用又具有挑战性。安德森等。 [ 3 ]提出了“房间到房间(R2R)”数据集,这是第一个结合真实图像[ 7 ]和自然语言导航指令的视觉语言导航(VLN)基准 。另外,提出了具有自然语言指令的TOUCHDOWN数据集 [ 10 ]用于街道导航。为了解决VLN任务,Fried等人。提出演讲者跟进框架 [ 17 ]用于监督学习中的数据扩充和推理,以及为促进优化而提出的名为“全景动作空间”的概念。Wang等。 [ 47 ]证明了益处模仿学习结合 [ 622 ]和强化学习 [ 3442 ]。其他方法 [ 482930442624 ]已经提出从各个角度解决VLN任务。受到VLN成功的启发,已经提出了许多基于自然语言指令或对话的数据集。VLNA [ 37 ]和HANNA [ 36 ]是代理在丢失时会获得帮助的环境。TtW [ 13 ]和CVDN [ 45 ]提供了通过两个人之间的交流创建的对话,以到达目标位置。与上述方法不同,REVERIE [ 39 ]引入远程对象本地化任务;在此任务中,需要代理在开始时看不到的另一个房间中找到对象。拟议的SOON任务是一个从粗到精的导航过程,该过程会按照复杂的场景描述从任何地方导航到目标。Tab中显示了SOON任务和现有的内置导航任务之间的总体比较。 1

Classical SLAM-based methods build a 3D map with LIDAR, depth or structure, and then plan navigation routes based on this map. Due to the development of photo-realistic environments and efficient simulators, deep learning-based methods have become feasible ways of training a navigation agent. Since deep learning methods have revealed their ability in feature engineering, end-to-end agents are becoming popular. Later works adopt the idea of SLAM and introduce a memory mechanism, a method combining classical mapping methods and deep learning methods for generalization and long-trajectory navigation purposes. Recent works model the navigation semantics in graphs and achieve great success in embodied navigation tasks. Different from previous work that only trains the agent using labeled trajectories by imitation learning, our works introduce reinforcement learning in policy learning and propose a novel exploration method to learn a robust policy.

映射和规划经典的基于SLAM的方法 [ 46121916214 ]构建的3D地图与LIDAR,深度或结构,然后基于该图计划航行路线。由于照片般逼真的环境的发展 [ 31050 ],高效的模拟器 [ 154041 ],深基于学习的方法 [ 352853 ]已经成为培训导航员的可行方法。由于深度学习方法已经揭示了其在特征工程中的能力,因此端到端代理正变得越来越流行。后来的工作 [ 165133 ]采用SLAM的想法,并引入一个记忆机构,结合经典映射方法和深度学习方法概括和长期轨道导航目的的方法。最近的作品 [ 9148 ]在图表导航语义建模,并实现具体化的导航任务取得圆满成功。与以前的工作不同 [ 14 ] 通过模仿学习仅使用标记的轨迹来训练代理,我们的工作在策略学习中引入了强化学习,并提出了一种新颖的探索方法来学习稳健的策略。

Scenario Oriented Object Navigation 面向场景的目标导航

We propose a new Scenario Oriented Object Navigation (SOON) task, in which an agent navigates from an arbitrary position in a 3D embodied environment to localize a target object following an instruction. The task includes two sub-tasks: navigation and localization. We consider a navigation to be a success if the agent navigates to a position close to the target (<3m); and we consider the localization to be a success if the agent correctly locates the target object in the panoramic view based on the success of navigation. To ensure that the target object can be found regardless of the agent's starting point, the instruction consists of several parts: i) object attribute, ii) object relationship, iii) area description, and vi) neighbor area descriptions. An example to demonstrate different parts of description is shown in Fig.instruction. In step $ t $ in navigation, the agent observes a panoramic view $ v_t $ , containing RGB and depth information. Meanwhile, the agent receives neighbour node observations $ U_t={u^1_t,...,u^k_t} $ , which are the observations of $ k $ reachable positions from the current position. All reachable positions in a house scan are discretized into a navigation graph, and the agent navigates between nodes in the graph. For each step, the agent takes an action $ a $ to move from the current position to a neighbor node or stop. In addition to RGB-D sensor, the simulator provides a GPS sensor to inform the agent of its x, y coordinates. Also the simulator provides the indexes of the current node and candidate nodes. REVERIE annotates 2D bounding boxes in 2D views to represent the location of objects. The 2D views are separated from the panoramic views of the embodied simulator. This way of labeling has two disadvantages: 1) some object separated by 2D views is not labeled; 2) 2D image distortion introduces labeling noise. We adopt the idea of Point Detection and represent the location by polar coordinates, as shown in Fig.ploar. First, we annotate the object bounding box with four vertices $ { p_1, p_2, p_3, p_4 } $ . Then, we calculate the center point by $ p_c $ . After that, we convert the 2D coordinates into an angle difference between the original camera ray $ \alpha $ and the adjusted camera ray $ \alpha{}' $ .
SOON的任务定义 我们提出了一个新的面向场景的对象导航(SOON)任务,其中,代理从3D体现环境中的任意位置导航以按照指令定位目标对象。该任务包括两个子任务:导航和本地化。如果代理导航到目标附近(<3m)的位置,我们认为导航是成功的。并且我们认为本地化是成功的代理是否基于导航成功在全景视图中正确定位了目标对象。为了确保可以找到目标对象,无论代理的起点如何,指令都由几个部分:i)对象属性,ii)对象关系,iii)区域描述和vi)邻居区域描述。展示描述描述的不同部分的示例如图2所示。在导航步骤$ t $中,代理观察全景视图$ v_t $,包含RGB和深度信息。同时,代理接收邻居节点观测$ U_t={u^1_t,...,u^k_t} $,其是从当前位置的$ k $到达位置的观察。房屋扫描中的所有可达位置都被离散化为导航图,并且代理在图中的节点之间导航。对于每个步骤,代理采用动作$ a $从当前位置移动到邻居节点或停止。除RGB-D传感器外,模拟器还提供GPS传感器,以通知其X,Y坐标的代理。 Simulator还提供当前节点和候选节点的索引。 Reverie在2D视图中注释2D边界框以表示对象的位置。 2D视图与所体现的模拟器的全景视图分开。这种标签方式有两个缺点:1)由2D视图分隔的一些物体未标记; 2)2D图像失真引入标记噪声。我们采用点检测的思想,并表示极坐标的位置,如图所示。首先,我们注释了一个带有四个顶点${ p_1, p_2, p_3, p_4 } $的对象边界框。然后,我们通过$ p_c $计算中心点。之后,我们将2D坐标转换为原始摄像机射线$\alpha$和调整后的摄像机射线$\alpha{}' $之间的角度差。

image
图4: 基于图的语义探索(GBE)模型的概述。视觉视图由视觉编码器编码,指令由语言编码器编码。图规划器基于视觉嵌入和房间结构信息对房间语义进行建模。GBE使用GCN嵌入图节点并输出图嵌入。然后,GBE基于图形嵌入特征和语言特征输出交叉模式特征。之后,GBE使用交叉模式功能来预测导航动作并回归目标位置。

Graph-based Semantic Exploration 基于图的语义探索

We present the Graph-based Semantic Exploration (GBE) method in this section. The pipeline of the GBE is shown in Fig.model. Our vision encoder $ g $ and language encoder $ h $ are built on a common practice of vision language navigation. Subsequently, we introduce the graph planner in GBE, which models the structured semantics of visited places. Finally, we introduce our exploration method based on the graph planner. Memorizing viewed scenes and explicitly model the navigation environment are helpful for long-term navigation. Thus, we introduce a graph planner to memorize the observed features and model the explored areas as a feature graph. The graph planner maintains a node feature set $ \mathcal{V} $ , an edge set $ \mathcal{E} $ and a node embedding set $ \mathcal{M} $ . The node feature set $ \mathcal{V} $ is used to store node features and candidate features generated from visual encoder $ g $ . The edge set $ \mathcal{E} $ dynamically updated to represent the explored navigation graph. The embedding set $ \mathcal{M} $ stores the intermediate node embeddings, which are updated by GCN. The node features in $ \mathcal{M} $ , noted as $ f^{\mathcal{M}}_{n_i} $ , are initialized by the feature of the same position in $ \mathcal{V} $ . At step $ t $ , the agent navigates to a position whose index is $ d_0 $ , and receives a visual observation $ v_t $ and the observations of neighbor nodes are $ U_t={u^1_t,...,u^k_t} $ , where $ k $ is the number of the neighbors and $ N_t={n_1,...,n_k} $ are node indexes of the neighbors. The visual observation and neighbor observations are embedded by the visual encoder $ g $ : where $ n_0 $ stands for the current node, and $ n_i (1 \le i \le n) $ are the node it connects with. The graph planners add the $ f^v_t $ and $ f^{u,i}_t $ into $ \mathcal{V} $

我们在本节中介绍了基于图形的语义探索(GBE)方法。 GBE的管道如图。索德尔所示。我们的视觉编码器$ g $和语言编码器$ h $构建了视觉语言导航的常见实践。随后,我们介绍了GBE的图形计划者,该计划模拟了访问的地方的结构化语义。最后,我们介绍了基于图形计划者的探索方法。记住已观看的场景并显式模型导航环境对长期导航有用。因此,我们介绍了一个图形计划者,以记住观察到的特征和模型作为特征图。图形计划程序维护节点功能SET $\mathcal{V} $,EDGE SET $\mathcal{E} $和节点嵌入集$\mathcal{M} $。节点特征设置$\mathcal{V} $用于存储从Visual Encoder $ g $生成的节点特征和候选功能。边缘设置$\mathcal{E} $动态更新以表示探索的导航图。嵌入集$\mathcal{M} $存储由GCN更新的中间节点嵌入式。 $\mathcal{M} $中的节点特征,指示为$ f^{\mathcal{M}}_{n_i} $,通过$\mathcal{V}$中的相同位置的特征初始化。在步骤$ t $,代理导航到索引为$ d_0 $的位置,并接收视觉观察$ v_t $,邻居节点的观察是$ U_t={u^1_t,...,u^k_t} $,其中$ k $是数字邻居和$ N_t={n_1,...,n_k} $是邻居的节点索引。视觉观察和邻居观察由视觉编码器$ g $嵌入:其中$ n_0 $代表当前节点,而$ n_i (1 \le i \le n) $是它与其连接的节点。图规划者将$ f^v_t $和$ f^{u,i}_t $添加到$\mathcal{V} $:

$$ \mathcal{V}\leftarrow\mathcal{V}\cup{f^v_{n_0}, f^{u}{n_1},...,f^{u}{n_k}}. $$

For an arbitrary node $ n_i $ in the navigation graph, its node feature is represented by $ \mathcal{V} $ following two rules: 1) if a node $ n_i $ is visited, its feature $ f_{n_i} $ is represented by $ f^v_{n_i} $ ; 2) if a node $ n_i $ is not visited but only observed, its feature is represented by $ f^u_{n_i} $ ; 3) since a navigable position is able to be observed from multiple different views, the unvisited node feature is represented by the average value of all observed features. The graph planner also updates the edge set $ \mathcal{E} $ by:

在导航图中的任意节点$ n_i $,其节点功能由$\mathcal{V} $表示在两个规则之后:1)如果访问了节点$ n_i $,则其特征$ f_{n_i} $由$ f^{v}表示t165_2{n_i} $; 2)如果未访问Node $ n_i $但仅观察到,则其特征由$ f^{u}_{n_i} $表示; 3)由于能够从多个不同视图观察到可通航位置,因此不可见的节点特征由所有观察到的特征的平均值表示。图表计划程序还更新边缘设置$\mathcal{E} $(by:

$$ \mathcal{E}\leftarrow\mathcal{E}\cup{(n_0, n_1),(n_0, n_2),...... ,(n_0, n_k)}. $$

An edge is represented by a tuple consists of two node indexes, indicating that two nodes are connected. Then, $ \mathcal{M} $ is updated by GCN based on $ \mathcal{V} $ and $ \mathcal{E} $ :
边缘由元组表示由两个节点索引组成,指示两个节点已连接。然后,基于$\mathcal{V} $和$\mathcal{E} $:

$$ \mathcal{M}\leftarrow\mathrm{GCN}(\mathcal{M}, \mathcal{E}). $$

To obtain comprehensive understanding of the current position and nearby scene, we define the output of the graph planner as:
以获取当前位置和附近场景的全面了解,我们定义了图形规划器的输出为:

$$ f^g_t = \frac{1}{k+1}\sum_{i=0}^{k} f^M_{n_i}, $$

$ f^g_t $ and language feature $ f_t^l $ perform cross-modal matching and output $ \Tilde{f_t} $ . GBE uses the $ \Tilde{f_t} $ for two tasks: navigation action prediction and target object localization. The candidates to navigate are all observed but not visited nodes whose indexes are $ C = {c_1,...,c_{|C|}} $ , where $ |C| $ is the number of candidates. The candidate feature are extracted from $ \mathcal{V} $ , denoted as $ {f_{c_1},...,f_{c_{|C|}}} $ . The agent generates a probability distribution $ p_t $ over candidates for action prediction, and outputs regression results $ \hat{l^h_i} $ and $ \hat{l^e_i} $ standing for heading and elevation values for localization: $ 0\le i \le |C| $ . $ z_i $ are logits generated by a fully connected layer whose parameter is $ W_{nav} $ . $ a_{c_0} $ indicates the stop action. Thus the action space $ |\mathcal{A}| = |C| + 1 $ is varied depending on the dynamically built graph.

$ f^g_t $和语言功能$ f_t^l $执行跨模板匹配和输出$\Tilde{f_t} $。 GBE使用$\Tilde{f_t} $进行两个任务:导航操作预测和目标对象本地化。导航的候选者都观察到但未访问索引是$ C = {c_1,...,c_{|C|}} $的节点,其中$ |C| $是候选者的数量。候选功能从$\mathcal{V} $中提取,表示为${f_{c_1},...,f_{c_{|C|}}} $。该代理在动作预测的候选方面生成概率分布$ p_t $,输出回归结果$\hat{l^h_i} $和$\hat{l^e_i} $站立用于定位的标题和高度值:$ 0\le i \le |C| $。 $ z_i $是由完全连接的图层生成的Loadits,其参数为$ W_{nav} $。 $ a_{c_0} $表示停止操作。因此,根据动态构建的图形来改变动作空间$ |\mathcal{A}| = |C| + 1 $。



Figure 5: Statistical analysis across FAO
图5:整个FAO的统计分析

Seq2seq navigation models such as speaker-follower only perceives the current observation and an encoding of the historical information. And existing exploration methods focus on data augmentation, heuristic-aided approach and auxiliary task. However, with the dynamically built semantic graph, the navigation agent is able to memorize all the nodes that it observes but has not visited. Thus we propose to use the semantic graph to facilitate exploration. As shown in Fig.model(yellow box), the graph planner builds the navigation semantic graph during exploration. In imitation learning, the navigation agent uses the ground truth action $ a^*_t $ to sample the trajectory. However, in each step $ t $ , in graph-based exploration, the navigation action $ a_t $ is sampled from the predicted probability distribution of the candidates in Eq.prediction. The graph planner calculate the Dijkstra distance from each candidate to the target. The teacher action $ \hat{a}_t $ is to reach the candidate which is the closest to the target.

基于图的探索seq2seq导航模型,如扬声器 - 跟随器只能感知到历史信息的当前观察和编码。现有的探索方法侧重于数据增强,启发式辅助方法和辅助任务。但是,在动态构建的语义图中,导航代理能够记住它观察但未访问的所有节点。因此,我们建议使用语义图来促进探索。如图)模型(黄色框)所示,图表计划员在探索期间构建导航语义图。在模仿学习中,导航代理使用地面真理动作$ a^*_t $来对轨迹进行采样。然而,在每个步骤$ t $中,在基于图形的探索中,导航动作$ a_t $从eq.prediction中的候选者的预测概率分布采样。图规划器计算从每个候选者到目标的Dijkstra距离。教师动作$\hat{a}_t $是到达最接近目标的候选人。

Each trajectory in Room-to-room (R2R) dataset has only one target position. However, in the SOON task, since the target object could be able to be observed from multiple positions, trajectories could have multiple target positions. The teacher action $ \hat{a} $ is calculated by:
房间到室内(R2R)数据集的每个轨迹只有一个目标位置。然而,在很快任务中,由于可以从多个位置观察目标对象,因此轨迹可以具有多个目标位置。教师动作$\hat{a} $由以下计算

$$ \hat{a_t} = \underset{a_t^{n_i}}{\mathrm{argmin}}\left[ \mathrm{min}\left ( \mathrm{D}(c_i, n_{T_1}),...,\mathrm{D}(c_i, n_{T_m}) \right ) \right ], $$

where $ n_{T_1},...,n_{T_m} $ are indexes of $ m $ targets, and the action from current position to node $ n_i $ is defined by $ a_t^{n_i} $ . $ \mathrm{D}(n_i, n_j) $ stands for the function that calculates the Dijkstra distance between node $ n_i $ and $ n_j $ . Note that the target positions are visible in training to calculate the teacher action but not visible in testing. If the current position is one of target nodes, the teacher actions $ \hat{a_t} $ is a stop action. Sampling and ex