[论文翻译] SOON: Scenario Oriented Object Navigation with Graph-based Exploration 基于图探索的面向场景的目标导航


SOON: Scenario Oriented Object Navigation with Graph-based Exploration 基于图探索的面向场景的目标导航


The ability to navigate like a human towards a language-guided target from anywhere in a 3D embodied environment is one of the `holy grail' goals of intelligent robots. Most visual navigation benchmarks, however, focus on navigating toward a target from a fixed starting point, guided by an elaborate set of instructions that depicts step-by-step. This approach deviates from real-world problems in which human-only describes what the object and its surrounding look like and asks the robot to start navigation from anywhere. Accordingly, in this paper, we introduce a Scenario Oriented Object Navigation (SOON) task. In this task, an agent is required to navigate from an arbitrary position in a 3D embodied environment to localize a target following a scene description. To give a promising direction to solve this task, we propose a novel graph-based exploration (GBE) method, which models the navigation state as a graph and introduces a novel graph-based exploration approach to learn knowledge from the graph and stabilize training by learning sub-optimal trajectories. We also propose a new large-scale benchmark named From Anywhere to Object (FAO) dataset. To avoid target ambiguity, the descriptions in FAO provide rich semantic scene information includes: object attribute, object relationship, region description, and nearby region description. Our experiments reveal that the proposed GBE outperforms various state-of-the-arts on both FAO and R2R datasets. And the ablation studies on FAO validates the quality of the dataset.


在 3D 体现环境中从任何地方像人类一样导航到语言指导目标的能力是智能机器人的“圣杯”目标之一。但是,大多数可视化导航基准测试都集中于在固定的起点上朝目标进行导航,并遵循一组详尽的说明(逐步描述)。这种方法与现实世界中的问题有所不同,在现实世界中,人为描述对象及其周围环境是什么样的,并要求机器人从任何地方开始导航。因此,在本文中,我们介绍了一种面向场景的对象导航(SOON)任务。在此任务中,需要代理从 3D 体现环境中的任意位置导航以按照场景描述来定位目标。为了给解决这个问题提供一个有希望的方向,我们提出了一种新颖的基于图的探索(GBE)方法,该方法将导航状态建模为一个图,并介绍了一种新颖的基于图的探索方法,以便从图中学习知识并通过学习次优轨迹来稳定训练。我们还提出了一个新的大规模基准测试,名为“从任何地方到对象(FAO)”数据集。为避免目标含糊不清,粮农组织中的描述提供了丰富的语义场景信息,包括:对象属性,对象关系,区域描述和附近区域描述。我们的实验表明,建议的 GBE 优于 FAO 和 R2R 数据集上的各种最新技术。对 FAO 的消融研究验证了数据集的质量。该模型将导航状态建模为图形,并引入了一种新颖的基于图形的探索方法,可以从图形中学习知识并通过学习次优轨迹来稳定训练。我们还提出了一个新的大规模基准测试,名为“从任何地方到对象(FAO)”数据集。为避免目标含糊不清,粮农组织中的描述提供了丰富的语义场景信息,包括:对象属性,对象关系,区域描述和附近区域描述。我们的实验表明,建议的 GBE 优于 FAO 和 R2R 数据集上的各种最新技术。对 FAO 的消融研究验证了数据集的质量。该模型将导航状态建模为图形,并引入了一种新颖的基于图形的探索方法,可以从图形中学习知识并通过学习次优轨迹来稳定训练。我们还提出了一个新的大规模基准测试,名为“从任何地方到对象(FAO)”数据集。为避免目标含糊不清,粮农组织中的描述提供了丰富的语义场景信息,包括:对象属性,对象关系,区域描述和附近区域描述。我们的实验表明,建议的 GBE 优于 FAO 和 R2R 数据集上的各种最新技术。对 FAO 的消融研究验证了数据集的质量。FAO 中的描述提供了丰富的语义场景信息,包括:对象属性,对象关系,区域描述和附近区域描述。我们的实验表明,建议的 GBE 优于 FAO 和 R2R 数据集上的各种最新技术。对 FAO 的消融研究验证了数据集的质量。FAO 中的描述提供了丰富的语义场景信息,包括:对象属性,对象关系,区域描述和附近区域描述。我们的实验表明,建议的 GBE 优于 FAO 和 R2R 数据集上的各种最新技术。对 FAO 的消融研究验证了数据集的质量。


Introduction 简介

Recent research efforts have achieved great success in embodied navigation tasks. The agent is able to reach the target by following a variety of instructions, such as a word (e.g. object name or room name), a question-answer pair, a natural language sentence or a dialogue consisting of multiple sentences. However, these navigation approaches are still far from real-world navigation activities. Current vision language based navigation tasks such as Vision-language Navigation (VLN), Navigation from Dialog History (NDH) focus on navigating to a target by a fixed trajectory, guided by an elaborate set of instructions that outlines every step. These approaches fail to consider the case in which the complex instruction provided only target description while the starting point is not fixed. In real-world applications, people often do not provide detailed step-by-step instructions and expect the robot to be capable of self-exploration and autonomous decision-making. We claim that the ability to navigate towards a language-guided target from anywhere in a 3D embodied environment like human would be of great importance to an intelligent robot.
最近的研究工作 [ 49191747334532 ]在具体化的导航任务都取得了巨大的成功。该试剂是能够通过以下的各种指令,以达到目标,例如一个字(例如对象名或房间名称) [ 4940 ],一个问答配对 [ 1118 ],一个自然语言句子 [ 3 ]或包含多个句子的对话 [ 4555 ]。但是,这些导航方法离现实世界的导航活动还很远。当前基于视觉语言的导航任务,例如视觉语言导航(VLN) [ 3 ],从对话历史记录导航(NDH) [ 45 ]专注于通过固定的轨迹导航到目标,并遵循一组详尽的指令来概述每个步骤。这些方法未能考虑复杂指令仅提供目标描述而起点不固定的情况。在实际应用中,人们通常不提供详细的分步说明,而是期望机器人能够自我探索和自主决策。我们声称,在像人一样的 3D 体现环境中,从任何地方导航到语言指导目标的能力对于智能机器人来说都是至关重要的。

图 1: SOON 中导航过程的示例。代理接收复杂的自然语言指令,该指令由多种描述组成(左侧)。代理在不同房间之间导航时,它首先搜索较大的区域,然后根据视觉场景和指示逐步缩小搜索范围。

To address these problems, we propose a new task, named Vision Situated Object Navigation (SOON), where an agent is instructed to find a thoroughly described target object inside a house. The navigation instructions in SOON are target-oriented rather than step-by-step babysitter as in previous benchmarks. There are two major features that makes our task unique: target orienting and starting independence. A brief example of a navigation process in SOON is illustrated in Fig.overview. Firstly, different from conventional object navigation tasks defined in, instructions in SOON play a guidance role in addition to distinguish a target object class. An instruction contains thorough descriptions to guide the agent to find a unique object from anywhere in the house. After receiving an instruction in SOON, the agent first searches a larger-scale area according to the region descriptions in the instruction, and then gradually narrows the search space to the target area.

为了解决这些问题,我们提出了一项名为“视觉定位对象导航(SOON)”的新任务,其中指示了代理人员在房屋内找到详细描述的目标对象。SOON 中的导航说明是面向目标的,而不是像以前的基准测试中那样循序渐进的保姆。有两项主要功能使我们的任务与众不同:目标定向和启动独立性。图 1 示出了 SOON 中的导航过程的简要示例 。首先,从所限定的常规对象导航任务的不同 [ 4940 ],SOON 中的指令除了可以区分目标对象类以外,还起着指导作用。指令中包含详尽的说明,以指导代理从房屋中的任何位置查找唯一的对象。代理在 SOON 中接收到指令后,首先根据指令中的区域描述搜索较大范围的区域,然后逐渐将搜索空间缩小到目标区域。与逐步导航设置 [ 3 ]或目标目标导航设置 [ 49 ]相比,这种从粗到精的导航过程更类似于现实情况。此外,SOON 任务与启动无关。由于语言说明包含地理区域描述而不是特定于轨迹的描述,因此它们不限制代理如何找到目标。相比之下,在诸如视觉语言导航[ 3 ]或视觉与对话协作导航 [ 45 ]之类的逐步导航任务中 ,与定向路径的任何偏离都可以视为错误 [ 25 ]。我们在 Tab 中展示了 SOON 任务和现有的内置导航任务之间的总体比较。 1

Compared with step-by-step navigation settings or object-goal navigation settings, this kind of coarse-to-fine navigation process is more closely resembles a real-world situation. Moreover, the SOON task is starting-independent. Since the language instructions contain geographic region descriptions rather than trajectory specific descriptions, they do not limit how the agent finds the target. By contrast, in step-by-step navigation tasks such as Vision Language Navigation or Cooperative Vision-and-Dialog Navigation, any deviation from the directed path may be considered as an error. We present an overall comparison between the SOON task and existing embodied navigation tasks in Tab.difference.
与逐步导航设置或对象目标导航设置相比,这种粗细导航过程更为与真实的情况相似。此外,即将开始的任务是独立的。由于语言指令包含地理区域描述而不是轨迹特定的描述,因此它们不会限制代理程序如何找到目标。相比之下,在逐步导航任务如诸如视觉语言导航或协作视觉和对话框导航的逐步导航任务中,可以认为任何与定向路径的偏差都被视为错误。我们在 Tab.difference 中介绍了即将任务和现有体现的导航任务之间的总体比较。

Dataset Instruction Context Visual Context Starting Target
Dataset Human Content Unamb. Real-world Temporal Independent Oriented
House3D [49] Room Name Dynamic
MINOS [40] Ojbect Name Dynamic
EQA [11], IQA [18] QA Dynamic
MARCO [31], DRIF [5] Instruction Dynamic
R2R [3] Instruction Dynamic
TouchDown [10] Instruction Dynamic
VLNA [37], HANNA [36] Dialog Dynamic
TtW [13] Dialog Dynamic
CVDN [45] Dialog Dynamic
REVERIE [39] Instruction Dynamic
FAO (Ours) Instruction Dynamic

Table 1: Compared with existing datasets involving embodied vision and language tasks.

In this work, We propose a novel Graph-based Semantic Exploration (GBE) method to suggest a promising direction in approaching SOON. The proposed GBE has two advantages compared with previous navigation works. Firstly, GBE models the navigation process as a graph, which enables the navigation agent to obtain a comprehensive and structured understanding of observed information. It adopts graph action space to significantly merge the multiple actions in conventional sequence-to-sequence models into one-step decision. Merging actions reduces the number of predictions in a navigation process, which makes the model training more stable. Secondly, different from other graph-based navigation models that use either imitation learning or reinforcement to learn navigation policy, the proposed GBE combines the two learning approaches and proposes a novel exploration approach to stabilize training by learning from sub-optimal trajectories. In imitation learning, the agent learns to navigate step by step under the supervision of ground truth label. It causes severe overfitting problem since labeled trajectories occupy only a small proportion of the large trajectory space. In reinforcement learning, the navigation agent explores large trajectory space, and learn to maximize the discounted reward. Reinforcement learning leverages sub-optimal trajectories to improve the generalizability.

在这项工作中,我们提出了一种新颖的基于图的语义探索(GBE)方法,为接近 SOON 提出了一个有希望的方向。与以前的导航工作相比所提出的 GBE 具有两个优点 [ 31747 ]。首先,GBE 将导航过程建模为图形,这使导航代理能够获得对观测信息的全面而结构化的理解。它采用图形动作空间显著合并在常规序列到序列模型中的多个动作 [ 31747 ]一步决定。合并动作会减少导航过程中的预测数量,从而使模型训练更加稳定。其次,从其他的基于图的导航模型不同 [ 149 ]通过使用模仿学习或强化学习导航策略,提出的 GBE 结合了两种学习方法,并提出了一种新颖的探索方法,可通过从次优轨迹中学习来稳定训练。在模仿学习中,代理学习在地面真相标签的监督下逐步进行导航。由于标记的轨迹仅占据大轨迹空间的一小部分,因此会导致严重的过拟合问题。在强化学习中,导航代理会探索较大的轨迹空间,并学习最大化折价奖励。强化学习利用次优轨迹来提高通用性。

However, the reinforcement learning is not an end-to-end optimization method, which is difficult for the agent to converge and learn a robust policy. We propose to learn the optimal actions in trajectories sampled from imperfect GBE policy to stabilize training while exploration. Different from other RL exploration methods, the proposed exploration method is based on the semantic graph, which is dynamically built during the navigation. Thus it helps the agent to learn a robust policy while navigating based on a graph. To investigate the SOON task, we propose a large-scale From Anywhere to Object (FAO) benchmark. This benchmark is built on the Matterport3D simulator, which comprises 90 different housing environments with real image panoramas. FAO provides 4K sets of annotated instructions with 40K trajectories. As Fig.overview(left) shows, one set of the instruction contains three sentences, including four levels of description: i) the color and shape of the object; ii) the surrounding objects along with the relationships between these objects and the target object; iii) the area in which the target object is located and the neighbour areas. Then, the average word number of the instructions is 38 (R2R is 26), and the average hop of the labeled trajectories is 9.6 (R2R is 6.0). Thus our dataset is more challenging than other tasks. We present experimental analyses on both R2R and FAO datasets to validate the performance of the proposed GBE and the quality of FAO dataset. The proposed GBE significantly outperforms previous previous VLN methods without pretraining or auxiliary tasks on R2R and SOON tasks. We further provide human performance on the test set of FAO to quantify the human-machine gap. Moreover, by ablating vision and language modals with different granularity, we validate that our FAO dataset contains rich information that enables the agent to successfully locate the target.

然而,增强学习不是端到端的优化方法,这对于代理难以汇聚和学习强大的策略。我们建议了解从不完美的 GBE 政策取样的轨迹中的最佳行为,以探索勘探。与其他 RL 探索方法不同,所提出的探索方法基于语义图,该图是在导航期间动态构建的。因此,它有助于代理在基于图形导航时学习强大的策略。要调查即将开始,我们从任何地方提出了大规模的对象(粮农组织)基准。该基准测试是在 TATSPORT3D 模拟器上构建的,该模拟器包括 90 个不同的住房环境,具有真实的图像全景。粮农组织提供 4 套带 40K 轨迹的注释说明。如图所示,一组指令包含三个句子,包括四个级别的描述:i)物体的颜色和形状; ii)周围物体以及这些对象与目标对象之间的关系; iii)目标对象所在的区域和邻居区域。然后,指令的平均单词数为 38(R2R 为 26),标记轨迹的平均跳跃为 9.6(R2R 为 6.0)。因此,我们的数据集比其他任务更具挑战性。我们在 R2R 和粮农组织数据集中呈现实验分析,以验证拟议的 GBE 和粮农组织数据集质量的表现。建议的 GBE 显着优于上一个以前的 VLN 方法,而不借预先磨损或辅助任务,并很快任务。我们进一步为粮农组织的试验组提供人性化,以量化人机差距。此外,通过使用不同粒度的愿景和语言模块,我们验证了我们的粮农组织数据集包含丰富的信息,使代理能够成功找到目标。

Navigation with vision-language information has attracted widespread attention, since it is both widely applicable and challenging. Anderson et al. propose Room-to-Room (R2R) dataset, which is the first Vision-Language Navigation (VLN) benchmark combining real imagery and natural language navigation instructions. In addition, the TOUCHDOWN dataset with natural language instructions is proposed for street navigation. To address the VLN task, Fried et al. propose a speaker-follower framework for data augmentation and reasoning in supervised learning, along with a concept named "panoramic action space" proposed to facilitate optimization. Wang et al. demonstrate the benefit to combine imitation learning and reinforcement learning. Other methods have been proposed to solve the VLN tasks from various angles. Inspired by the success of VLN, many datasets based on natural language instructions or dialogues have been proposed. VLNA and HANNA are environments in which an agent receives assistance when it gets lost. TtW and CVDN provide dialogues created by communication between two people to reach the target position. Unlike the above methods, REVERIE introduces a remote object localization task; in this task, an agent is required to find an object in another room that is unable to see at the beginning. The proposed SOON task is a coarse-to-fine navigation process, which navigates towards a target from anywhere following a complex scene description. An overall comparison between the SOON task and existing embodied navigation tasks is shown in Tab.difference.
视觉语言导航 带有视觉语言信息的导航受到广泛的关注,因为它既广泛适用又具有挑战性。安德森等。 [ 3 ]提出了“房间到房间(R2R)”数据集,这是第一个结合真实图像[ 7 ]和自然语言导航指令的视觉语言导航(VLN)基准 。另外,提出了具有自然语言指令的 TOUCHDOWN 数据集 [ 10 ]用于街道导航。为了解决 VLN 任务,Fried 等人。提出演讲者跟进框架 [ 17 ]用于监督学习中的数据扩充和推理,以及为促进优化而提出的名为“全景动作空间”的概念。Wang等。 [ 47 ]证明了益处模仿学习结合 [ 622 ]和强化学习 [ 3442 ]。其他方法 [ 482930442624 ]已经提出从各个角度解决 VLN 任务。受到 VLN 成功的启发,已经提出了许多基于自然语言指令或对话的数据集。VLNA [ 37 ]和 HANNA [ 36 ]是代理在丢失时会获得帮助的环境。TtW [ 13 ]和 CVDN [ 45 ]提供了通过两个人之间的交流创建的对话,以到达目标位置。与上述方法不同,REVERIE [ 39 ]引入远程对象本地化任务;在此任务中,需要代理在开始时看不到的另一个房间中找到对象。拟议的 SOON 任务是一个从粗到精的导航过程,该过程会按照复杂的场景描述从任何地方导航到目标。Tab 中显示了 SOON 任务和现有的内置导航任务之间的总体比较。 1

Classical SLAM-based methods build a 3D map with LIDAR, depth or structure, and then plan navigation routes based on this map. Due to the development of photo-realistic environments and efficient simulators, deep learning-based methods have become feasible ways of training a navigation agent. Since deep learning methods have revealed their ability in feature engineering, end-to-end agents are becoming popular. Later works adopt the idea of SLAM and introduce a memory mechanism, a method combining classical mapping methods and deep learning methods for generalization and long-trajectory navigation purposes. Recent works model the navigation semantics in graphs and achieve great success in embodied navigation tasks. Different from previous work that only trains the agent using labeled trajectories by imitation learning, our works introduce reinforcement learning in policy learning and propose a novel exploration method to learn a robust policy.

映射和规划经典的基于 SLAM 的方法 [ 46121916214 ]构建的 3D 地图与 LIDAR,深度或结构,然后基于该图计划航行路线。由于照片般逼真的环境的发展 [ 31050 ],高效的模拟器 [ 154041 ],深基于学习的方法 [ 352853 ]已经成为培训导航员的可行方法。由于深度学习方法已经揭示了其在特征工程中的能力,因此端到端代理正变得越来越流行。后来的工作 [ 165133 ]采用 SLAM 的想法,并引入一个记忆机构,结合经典映射方法和深度学习方法概括和长期轨道导航目的的方法。最近的作品 [ 9148 ]在图表导航语义建模,并实现具体化的导航任务取得圆满成功。与以前的工作不同 [ 14 ] 通过模仿学习仅使用标记的轨迹来训练代理,我们的工作在策略学习中引入了强化学习,并提出了一种新颖的探索方法来学习稳健的策略。

Scenario Oriented Object Navigation 面向场景的目标导航

We propose a new Scenario Oriented Object Navigation (SOON) task, in which an agent navigates from an arbitrary position in a 3D embodied environment to localize a target object following an instruction. The task includes two sub-tasks: navigation and localization. We consider a navigation to be a success if the agent navigates to a position close to the target (<3m); and we consider the localization to be a success if the agent correctly locates the target object in the panoramic view based on the success of navigation. To ensure that the target object can be found regardless of the agent's starting point, the instruction consists of several parts: i) object attribute, ii) object relationship, iii) area description, and vi) neighbor area descriptions. An example to demonstrate different parts of description is shown in Fig.instruction. In step $ t $ in navigation, the agent observes a panoramic view $ v_t $ , containing RGB and depth information. Meanwhile, the agent receives neighbour node observations $ U_t={u^1_t,...,u^k_t} $ , which are the observations of $ k $ reachable positions from the current position. All reachable positions in a house scan are discretized into a navigation graph, and the agent navigates between nodes in the graph. For each step, the agent takes an action $ a $ to move from the current position to a neighbor node or stop. In addition to RGB-D sensor, the simulator provides a GPS sensor to inform the agent of its x, y coordinates. Also the simulator provides the indexes of the current node and candidate nodes. REVERIE annotates 2D bounding boxes in 2D views to represent the location of objects. The 2D views are separated from the panoramic views of the embodied simulator. This way of labeling has two disadvantages: 1) some object separated by 2D views is not labeled; 2) 2D image distortion introduces labeling noise. We adopt the idea of Point Detection and represent the location by polar coordinates, as shown in Fig.ploar. First, we annotate the object bounding box with four vertices $ { p_1, p_2, p_3, p_4 } $ . Then, we calculate the center point by $ p_c $ . After that, we convert the 2D coordinates into an angle difference between the original camera ray $ \alpha $ and the adjusted camera ray $ \alpha{}' $ .
SOON 的任务定义 我们提出了一个新的面向场景的对象导航(SOON)任务,其中,代理从 3D 体现环境中的任意位置导航以按照指令定位目标对象。该任务包括两个子任务:导航和本地化。如果代理导航到目标附近(<3m)的位置,我们认为导航是成功的。并且我们认为本地化是成功的代理是否基于导航成功在全景视图中正确定位了目标对象。为了确保可以找到目标对象,无论代理的起点如何,指令都由几个部分:i)对象属性,ii)对象关系,iii)区域描述和 vi)邻居区域描述。展示描述描述的不同部分的示例如图 2 所示。在导航步骤$ t $中,代理观察全景视图$ v_t $,包含 RGB 和深度信息。同时,代理接收邻居节点观测$ U_t={u^1_t,...,u^k_t} $,其是从当前位置的$ k $到达位置的观察。房屋扫描中的所有可达位置都被离散化为导航图,并且代理在图中的节点之间导航。对于每个步骤,代理采用动作$ a $从当前位置移动到邻居节点或停止。除 RGB-D 传感器外,模拟器还提供 GPS 传感器,以通知其 X,Y 坐标的代理。 Simulator 还提供当前节点和候选节点的索引。 Reverie 在 2D 视图中注释 2D 边界框以表示对象的位置。 2D 视图与所体现的模拟器的全景视图分开。这种标签方式有两个缺点:1)由 2D 视图分隔的一些物体未标记; 2)2D 图像失真引入标记噪声。我们采用点检测的思想,并表示极坐标的位置,如图所示。首先,我们注释了一个带有四个顶点${ p_1, p_2, p_3, p_4 } $的对象边界框。然后,我们通过$ p_c $计算中心点。之后,我们将 2D 坐标转换为原始摄像机射线$\alpha$和调整后的摄像机射线$\alpha{}' $之间的角度差。

图 4: 基于图的语义探索(GBE)模型的概述。视觉视图由视觉编码器编码,指令由语言编码器编码。图规划器基于视觉嵌入和房间结构信息对房间语义进行建模。GBE 使用 GCN 嵌入图节点并输出图嵌入。然后,GBE 基于图形嵌入特征和语言特征输出交叉模式特征。之后,GBE 使用交叉模式功能来预测导航动作并回归目标位置。

Graph-based Semantic Exploration 基于图的语义探索

We present the Graph-based Semantic Exploration (GBE) method in this section. The pipeline of the GBE is shown in Fig.model. Our vision encoder $ g $ and language encoder $ h $ are built on a common practice of vision language navigation. Subsequently, we introduce the graph planner in GBE, which models the structured semantics of visited places. Finally, we introduce our exploration method based on the graph planner. Memorizing viewed scenes and explicitly model the navigation environment are helpful for long-term navigation. Thus, we introduce a graph planner to memorize the observed features and model the explored areas as a feature graph. The graph planner maintains a node feature set $ \mathcal{V} $ , an edge set $ \mathcal{E} $ and a node embedding set $ \mathcal{M} $ . The node feature set $ \mathcal{V} $ is used to store node features and candidate features generated from visual encoder $ g $ . The edge set $ \mathcal{E} $ dynamically updated to represent the explored navigation graph. The embedding set $ \mathcal{M} $ stores the intermediate node embeddings, which are updated by GCN. The node features in $ \mathcal{M} $ , noted as $ f^{\mathcal{M}}_{n_i} $ , are initialized by the feature of the same position in $ \mathcal{V} $ . At step $ t $ , the agent navigates to a position whose index is $ d_0 $ , and receives a visual observation $ v_t $ and the observations of neighbor nodes are $ U_t={u^1_t,...,u^k_t} $ , where $ k $ is the number of the neighbors and $ N_t={n_1,...,n_k} $ are node indexes of the neighbors. The visual observation and neighbor observations are embedded by the visual encoder $ g $ : where $ n_0 $ stands for the current node, and $ n_i (1 \le i \le n) $ are the node it connects with. The graph planners add the $ f^v_t $ and $ f^{u,i}_t $ into $ \mathcal{V} $

我们在本节中介绍了基于图形的语义探索(GBE)方法。 GBE 的管道如图。索德尔所示。我们的视觉编码器$ g $和语言编码器$ h $构建了视觉语言导航的常见实践。随后,我们介绍了 GBE 的图形计划者,该计划模拟了访问的地方的结构化语义。最后,我们介绍了基于图形计划者的探索方法。记住已观看的场景并显式模型导航环境对长期导航有用。因此,我们介绍了一个图形计划者,以记住观察到的特征和模型作为特征图。图形计划程序维护节点功能 SET $\mathcal{V} $,EDGE SET $\mathcal{E} $和节点嵌入集$\mathcal{M} $。节点特征设置$\mathcal{V} $用于存储从 Visual Encoder $ g $生成的节点特征和候选功能。边缘设置$\mathcal{E} $动态更新以表示探索的导航图。嵌入集$\mathcal{M} $存储由 GCN 更新的中间节点嵌入式。 $\mathcal{M} $中的节点特征,指示为$ f^{\mathcal{M}}_{n_i} $,通过$\mathcal{V}$中的相同位置的特征初始化。在步骤$ t $,代理导航到索引为$ d_0 $的位置,并接收视觉观察$ v_t $,邻居节点的观察是$ U_t={u^1_t,...,u^k_t} $,其中$ k $是数字邻居和$ N_t={n_1,...,n_k} $是邻居的节点索引。视觉观察和邻居观察由视觉编码器$ g $嵌入:其中$ n_0 $代表当前节点,而$ n_i (1 \le i \le n) $是它与其连接的节点。图规划者将$ f^v_t $和$ f^{u,i}_t $添加到$\mathcal{V} $:

$$ \mathcal{V}\leftarrow\mathcal{V}\cup{f^v_{n_0}, f^{u}{n_1},...,f^{u}{n_k}}. $$

For an arbitrary node $ n_i $ in the navigation graph, its node feature is represented by $ \mathcal{V} $ following two rules: 1) if a node $ n_i $ is visited, its feature $ f_{n_i} $ is represented by $ f^v_{n_i} $ ; 2) if a node $ n_i $ is not visited but only observed, its feature is represented by $ f^u_{n_i} $ ; 3) since a navigable position is able to be observed from multiple different views, the unvisited node feature is represented by the average value of all observed features. The graph planner also updates the edge set $ \mathcal{E} $ by:

在导航图中的任意节点$ n_i $,其节点功能由$\mathcal{V} $表示在两个规则之后:1)如果访问了节点$ n_i $,则其特征$ f_{n_i} $由$ f^{v}表示 t165_2{n_i} $; 2)如果未访问 Node $ n_i $但仅观察到,则其特征由$ f^{u}_{n_i} $表示; 3)由于能够从多个不同视图观察到可通航位置,因此不可见的节点特征由所有观察到的特征的平均值表示。图表计划程序还更新边缘设置$\mathcal{E} $(by:

$$ \mathcal{E}\leftarrow\mathcal{E}\cup{(n_0, n_1),(n_0, n_2),...... ,(n_0, n_k)}. $$

An edge is represented by a tuple consists of two node indexes, indicating that two nodes are connected. Then, $ \mathcal{M} $ is updated by GCN based on $ \mathcal{V} $ and $ \mathcal{E} $ :
边缘由元组表示由两个节点索引组成,指示两个节点已连接。然后,基于$\mathcal{V} $和$\mathcal{E} $:

$$ \mathcal{M}\leftarrow\mathrm{GCN}(\mathcal{M}, \mathcal{E}). $$

To obtain comprehensive understanding of the current position and nearby scene, we define the output of the graph planner as:

$$ f^g_t = \frac{1}{k+1}\sum_{i=0}^{k} f^M_{n_i}, $$

$ f^g_t $ and language feature $ f_t^l $ perform cross-modal matching and output $ \Tilde{f_t} $ . GBE uses the $ \Tilde{f_t} $ for two tasks: navigation action prediction and target object localization. The candidates to navigate are all observed but not visited nodes whose indexes are $ C = {c_1,...,c_{|C|}} $ , where $ |C| $ is the number of candidates. The candidate feature are extracted from $ \mathcal{V} $ , denoted as $ {f_{c_1},...,f_{c_{|C|}}} $ . The agent generates a probability distribution $ p_t $ over candidates for action prediction, and outputs regression results $ \hat{l^h_i} $ and $ \hat{l^e_i} $ standing for heading and elevation values for localization: $ 0\le i \le |C| $ . $ z_i $ are logits generated by a fully connected layer whose parameter is $ W_{nav} $ . $ a_{c_0} $ indicates the stop action. Thus the action space $ |\mathcal{A}| = |C| + 1 $ is varied depending on the dynamically built graph.

$ f^g_t $和语言功能$ f_t^l $执行跨模板匹配和输出$\Tilde{f_t} $。 GBE 使用$\Tilde{f_t} $进行两个任务:导航操作预测和目标对象本地化。导航的候选者都观察到但未访问索引是$ C = {c_1,...,c_{|C|}} $的节点,其中$ |C| $是候选者的数量。候选功能从$\mathcal{V} $中提取,表示为${f_{c_1},...,f_{c_{|C|}}} $。该代理在动作预测的候选方面生成概率分布$ p_t $,输出回归结果$\hat{l^h_i} $和$\hat{l^e_i} $站立用于定位的标题和高度值:$ 0\le i \le |C| $。 $ z_i $是由完全连接的图层生成的 Loadits,其参数为$ W_{nav} $。 $ a_{c_0} $表示停止操作。因此,根据动态构建的图形来改变动作空间$ |\mathcal{A}| = |C| + 1 $。

Figure 5: Statistical analysis across FAO
图 5:整个 FAO 的统计分析

Seq2seq navigation models such as speaker-follower only perceives the current observation and an encoding of the historical information. And existing exploration methods focus on data augmentation, heuristic-aided approach and auxiliary task. However, with the dynamically built semantic graph, the navigation agent is able to memorize all the nodes that it observes but has not visited. Thus we propose to use the semantic graph to facilitate exploration. As shown in Fig.model(yellow box), the graph planner builds the navigation semantic graph during exploration. In imitation learning, the navigation agent uses the ground truth action $ a^*_t $ to sample the trajectory. However, in each step $ t $ , in graph-based exploration, the navigation action $ a_t $ is sampled from the predicted probability distribution of the candidates in Eq.prediction. The graph planner calculate the Dijkstra distance from each candidate to the target. The teacher action $ \hat{a}_t $ is to reach the candidate which is the closest to the target.

基于图的探索 seq2seq 导航模型,如扬声器 - 跟随器只能感知到历史信息的当前观察和编码。现有的探索方法侧重于数据增强,启发式辅助方法和辅助任务。但是,在动态构建的语义图中,导航代理能够记住它观察但未访问的所有节点。因此,我们建议使用语义图来促进探索。如图)模型(黄色框)所示,图表计划员在探索期间构建导航语义图。在模仿学习中,导航代理使用地面真理动作$ a^*_t $来对轨迹进行采样。然而,在每个步骤$ t $中,在基于图形的探索中,导航动作$ a_t $从 eq.prediction 中的候选者的预测概率分布采样。图规划器计算从每个候选者到目标的 Dijkstra 距离。教师动作$\hat{a}_t $是到达最接近目标的候选人。

Each trajectory in Room-to-room (R2R) dataset has only one target position. However, in the SOON task, since the target object could be able to be observed from multiple positions, trajectories could have multiple target positions. The teacher action $ \hat{a} $ is calculated by:
房间到室内(R2R)数据集的每个轨迹只有一个目标位置。然而,在很快任务中,由于可以从多个位置观察目标对象,因此轨迹可以具有多个目标位置。教师动作$\hat{a} $由以下计算

$$ \hat{a_t} = \underset{a_t^{n_i}}{\mathrm{argmin}}\left[ \mathrm{min}\left ( \mathrm{D}(c_i, n_{T_1}),...,\mathrm{D}(c_i, n_{T_m}) \right ) \right ], $$

where $ n_{T_1},...,n_{T_m} $ are indexes of $ m $ targets, and the action from current position to node $ n_i $ is defined by $ a_t^{n_i} $ . $ \mathrm{D}(n_i, n_j) $ stands for the function that calculates the Dijkstra distance between node $ n_i $ and $ n_j $ . Note that the target positions are visible in training to calculate the teacher action but not visible in testing. If the current position is one of target nodes, the teacher actions $ \hat{a_t} $ is a stop action. Sampling and executing action $ a $ from imperfect navigation policy enables the agent to explore in the room. Using the optimal action $ \hat{a_t} $ helps to learn a robust policy. We here introduce two objectives in training: i) the navigation objective $ L_{nav} $ ; ii) the object localization objective $ L_{loc} $ . The GBE model is jointly optimized by these two objectives. In imitation learning, our navigation agent learns from the ground truth action $ a^* $ . In reinforcement learning, the agent learns to navigate by maximizing the discounted reward when taking action $ a_t $ . In graph-based exploration, we calculate the candidate which is closest to the target by the graph planner and set the action to move to the candidate as $ \hat{a_t} $ . The $ L_{nav} $ is the combination of the above three learning approaches: $ A_t $ is the advantage defined in A2C. The reward of reinforcement learning is calculated by the Dijkstra distance between the current position and the target. The $ \lambda_1 $ , $ \lambda_2 $ , $ \lambda_3 $ are loss weights for imitation learning, reinforcement learning and graph-based exploration respectively. Our agent learns a localization branch that is supervised by the center position of the target. Since we map the 2D bounding box position into polar representation, the label consists of two linear values, namely heading $ l^h $ and elevation $ l^e $ . We use Mean Square Error (MSE) to optimize predictions:

,其中$ n_{T_1},...,n_{T_m} $是$ m $目标的索引,从当前位置到节点$ n_i $的动作由$ a_t^{n_i} $定义。 $\mathrm{D}(n_i, n_j) $代表了计算节点 D_d1 n_i $和$ n_j $之间的 dijkstra 距离的函数。请注意,在培训方面可见目标位置,以计算教师动作,但在测试中不可见。如果当前位置是目标节点之一,则教师操作$\hat{a_t} $是停止操作。采样和执行操作$ a $来自 Imperfect 导航策略使代理商探索在房间内。使用最佳操作$\hat{a_t} $有助于学习强大的策略。我们在这里介绍了两个训练目标:i)导航目标$ L_{nav} $; ii)对象本地化目标$ L_{loc} $。 GBE 模型由这两个目标共同优化。在模仿学习中,我们的导航代理从地面真理动作$ a^* $学习。在加固学习中,代理学会在采取动作$ a_t $时通过最大化折扣奖励来导航。在基于图形的探索中,我们通过图表计划计算最接近目标的候选者,并将动作设置为$\hat{a_t} $移动到候选者。 $ L_{nav} $是上述三种学习方法的组合:$ A_t $是 A2C 中定义的优势。强化学习的奖励由当前位置和目标之间的 Dijkstra 距离计算。 $\lambda_1 $,$\lambda_2 $,$\lambda_3 $分别是用于仿制学习,加强学习和基于图形探索的损耗权重。我们的代理学习由目标的中心位置监督的本地化分支。由于我们将 2D 边界框位置映射到极性表示,标签由两个线性值组成,即标题$ l^h $和高度$ l^e $。我们使用均方错误(MSE)来优化预测:.

$$ L_{loc} = \frac{1}{N}\sum_{i=1}^{N}\left[ (\hat{l^h_i} - l^h_i)^2 + (\hat{l^e_i} - l^e_i)^2 \right ]. $$

Experiments 实验

Splits Unseen House (Val) Unseen House (Test)
Metrics NE ↓ OSR ↑ SR ↑ SPL ↑ NE ↓ OSR ↑ SR ↑ SPL ↑
Seq2Seq [3] 7.81 28.4 21.8 - 7.85 26.6 20.4 -
Ghost [2] 7.20 44 35 31 7.83 42 33 30
Speaker-Follower [17] 6.62 43.1 34.5 - 6.62 44.5 35.1 -
RCM [47] 5.88 51.9 42.5 - 6.12 49.5 43.0 38
Monitor* [29] 5.52 56 45 32 5.67 59 48 35
Regretful* [30] 5.32 59 50 41 5.69 56 48 40
EGP [14] 5.34 65 52 41 - - - -
EGP* [14] 4.83 64 56 44 5.34 61 53 42
GBE (Ours) 5.20 67.0 53.9 43.4 5.18 64.1 53.0 43.4

Table 2: The results of the GMSE and previous state-of-the-art methods on R2R ( model uses additional synthetic data).
表 2: GMSE 和以前 R2R 上最先进方法的结果(*:模型使用其他合成数据)。


From Anywhere to Object (FAO) Dataset 从任何地方到对象( FAO)数据集

We provide 3,848 sets of natural language instructions, describing the absolute location in a 3D environment. We further collect 6,326 bounding boxes for 3,923 objects across 90 Matterport scenes. Despite the fact that our task does not place limitations on the agent's starting position, we provide over 30K long distance trajectories in our dataset to validate the effectiveness of our task. Each instruction contains attributes, relationships and region descriptions to filter out the unique target object when there are multiple objects. Please refer to the supplementary materials for more details of our FAO dataset and experimental analysis. The training split contains 3,085 sets of instructions with 28,015 trajectories over 38 houses. We propose a new split named validation on seen instruction, which is a validation set containing the same instructions in the same house with different starting positions. The validation seen instruction set contains 245 instructions with 1,225 trajectories. The validation set for seen houses with different instructions contains 195 instructions with 1,950 trajectories. The validation set for the unseen houses contains 205 instructions with 2,040 trajectories. We first label bounding boxes for objects in panoramic views. Then we convert the bounding box labels into polar representations as described in Sec.polar. Note that the object can be reached from multiple positions. We annotate all these positions to reduce the dataset bias. To collect diverse instructions with their hierarchical descriptions, we divide the language annotation task into five subtasks as shown in Fig.instruction: 1) Describe the attributes, such as the color, size or shape, of the target; 2) Find at least two objects related to the target and describe their relationship; 3) Conduct explorations in the simulator to describe the region in which the target is located; 4) Explore and describe the nearby regions; 5) Rewrite all descriptions within three sentences. The first four steps ensure language complexity and diversity. And the rewriting step makes the language instruction coherent and natural. Finally, we generate long navigation trajectories using the navigation graph of each scene. To make the task sufficiently challenging, we first set a threshold of 18 meters. For each instruction and object pair, we fix the target viewpoint and sample the starting viewpoint.We determine a trajectory as valid if the Dijkstra distance between the two viewpoints exceeds the threshold. In some houses, long trajectories are often difficult to find or may even not exist. Thus, we discount the threshold by a factor of 0.8 after every five sample failures.

我们提供了 3848 套自然语言指令,描述了 3D 环境中的绝对位置。我们还为 90 个 Matterport 场景中的 3,923 个对象收集了 6,326 个边界框。尽管我们的任务没有限制座席的开始位置,但我们还是在数据集中提供了超过 30K 的长距离轨迹来验证任务的有效性。每个指令包含属性,关系和区域描述,以在存在多个对象时过滤出唯一的目标对象。请参阅补充材料,以获取有关我们粮农组织数据集和实验分析的更多详细信息。 数据拆分 培训拆分包含 3085 套指令,涉及 38 套房屋的 28,015 条轨迹。我们建议一个新的拆分名为对可见指令的验证,这是一个验证集,其中包含同一房子中的相同指令且起始位置不同。验证可见指令集包含 245 条指令,具有 1,225 条轨迹。针对具有不同指令的可见房屋的验证集包含 195 条具有 1,950 条轨迹的指令。看不见的房屋的验证集包含 205 条指令和 2040 条轨迹。 数据收集 我们首先为全景对象标记边框。然后,如第 2 节中所述,将边界框标签转换为极坐标表示。 3。请注意,可以从多个位置到达对象。我们注释所有这些位置以减少数据集偏差。为了收集具有分层描述的各种指令,我们将语言注释任务划分为五个子任务,如图 3 所示。 :1)描述目标的属性,例如颜色,大小或形状;2)找出至少两个与目标有关的对象并描述它们之间的关系;3)在模拟器中进行探索,以描述目标所在的区域;4)探索并描述附近地区;5)用三个句子重写所有描述。前四个步骤可确保语言的复杂性和多样性。并且重写步骤使语言指令连贯而自然。最后,我们使用每个场景的导航图生成长的导航轨迹。为了使任务更具挑战性,我们首先将阈值设置为 18 米。对于每个指令和对象对,我们固定目标视点并采样起始视点。如果两个视点之间的 Dijkstra 距离超过阈值,我们将确定一条轨迹为有效。在某些房屋中,长轨往往很难找到甚至根本不存在。因此,在每五个样本失败之后,我们将阈值折现 0.8 倍。

Fig.analysis(left) illustrates the distributions of word numbers in the instructions. The FAO dataset contains 3,848 instructions with a vocabulary of 1,649 words. The average number of the words in an instruction set is 38.6, while which in REVERIE is 26.3 and in R2R is 18.3. Most of the instructions range from 20 words to 60 words, which ensures the power of representation. Moreover, the variance in instruction length makes the description more diverse. The trajectory length ranges from 15 meters to more than 60 meters. Compared with R2R and REVERIE that most of the trajectories are within 8 hops, as shown in Fig.analysis(middle), FAO provides much more long-term trajectories, which makes the dataset more challenging. Fig.analysis(right) illustrates the proportion of word numbers in the four instruction annotating steps. The more words are in the annotation, the richer information it contains. Therefore, we can infer that the object relationship and nearby regions contain the richest information. An agent should consequently pay more attention to these two parts in order to achieve good performance.
数据分析 图 5(左)说明了指令中单词编号的分布。FAO 数据集包含 3848 条指令,词汇量为 1649 个单词。指令集中的平均单词数为 38.6,而在 REVERIE 中为 26.3,在 R2R 中为 18.3。大多数指令范围从 20 字到 60 字,这确保了表示的力量。此外,指令长度的变化使描述更加多样化。弹道长度从 15 米到 60 多米不等。与 R2R 和 REVERIE 相比,大多数轨迹都在 8 个跳内,如图 5(中)所示,FAO 提供了更多的长期轨迹,这使数据集更具挑战性。图 5(右)说明了四个指令注释步骤中单词编号的比例。注释中的单词越多,其中包含的信息就越丰富。因此,我们可以推断出对象关系和附近区域包含最丰富的信息。因此,代理人应该更加注意这两个部分,以实现良好的性能。

Splits Val Seen Instruction Val Seen House Unseen House (Test)
Human - - - - - - - - 91.4 90.4 59.2 51.1
Random 0.1 0.0 1.5 1.4 0.4 0.1 0.0 0.9 2.7 2.1 0.4 0.0
Speaker-Follower [17] 97.8 97.9 97.7 24.5 69.4 61.2 60.4 9.1 9.8 7.0 6.1 0.6
RCM [47] 89.1 84.0 82.6 10.9 72.7 62.4 60.9 7.8 12.4 7.4 6.2 0.7
AuxRN [52] 98.7 98.4 97.4 13.7 78.5 68.8 67.3 8.3 11.0 8.1 6.7 0.5
GBE w/o GE 91.8 89.5 88.3 24.2 73 62.5 60.8 6.7 18.8 11.4 8.7 0.8
GBE (Ours) 98.6 98.4 97.9 44.2 64.1 76.3 62.5 7.3 19.5 11.9 10.2 1.4

Table 3: The results for baselines and our model on two validation set and test set.Models vision language SR SPL SFPL GBE ✗ ✗ 0.6 0.4 0.0 GBE ✓ ✗ 9.8 8.1 0.5 GBE ✗ ✓ 1.8 1.5 0.2 GBE ✓ ✓ 11.9 10.2 1.4Table 4: Ablation of unimodal inputs.Models SR SPL SFPL GBE+ 7.3 6.2 0.5 GBE++ 6.2 4.9 0.7 GBE+++ 6.6 5.5 0.8 GBE+ 11.9 10.2 1.4Table 5: Ablation of granularity levels.

如果两个观点之间的 Dijkstra 距离超过阈值,我们将确定一个有效的轨迹。在一些房屋中,长轨迹往往很难找到或甚至可能不存在。因此,每五次样本故障后,我们将阈值折扣为 0.8 倍。图 1Alysis(左)说明了指令中的字数的分布。粮农组织数据集包含 3,848 条指令,其中词汇 1,649 字。指令集中的单词的平均数量为 38.6,而在 Reverie 中,在 R2R 中,R2R 为 18.3。大多数指令范围从 20 个字到 60 个字,这确保了表示的力量。此外,指令长度的方差使得描述更多样化。轨迹长度为 15 米到 60 多米。与 R2R 和遐想相比,大多数轨迹在 8 次跳跃中,如图 1 所示,粮农组织提供了更多的长期轨迹,使数据集更具挑战性。图 1Alysis(右)示出了四指令注释步骤中的字数的比例。注释中的单词越多,它包含的更丰富的信息。因此,我们可以推断对象关系和附近区域包含最富有的信息。因此,代理人应该更加关注这两个部分,以实现良好的性能。

Experimental Results

We evaluate the GBE model on R2R and FAO datasets. We split our dataset into five components: 1) training; 2) validation on seen instructions (on seen houses as well); 3) validation on seen houses but unseen instructions; 4) validation on unseen houses; and 5) testing. Compared with standard VLN benchmark, we add a new validation set in FAO, the validation on seen instructions, due to the task starting-independent. We evaluate the performance from two aspects: navigation performance and localization performance. The navigation performance is evaluated via commonly used VLN metrics, including Navigation Error (NE), Success Rate (SR), Oracle Success Rate (OSR) and the Success Rate weighted by Path Length (SPL). The localization performance is evaluated by the success rate indicating whether the predicted direction is located in the bounding box. We combine the SPL and localization success to propose a success rate of finding weighted by path length (SFPL):
$$ \textnormal{SFPL} = \frac{1}{N}\sum_{i=1}^{N} S^{nav}_i S^{loc}_i \frac{l^{nav}_i}{\textnormal{max}(l^{nav}_i, l^{gt}_i)}, $$
where $ S^{nav}_i $ and $ S^{loc}_i $ are indicators of whether the agent has successfully navigated to or localized the target, respectively. $ l^{nav}_i $ is the length of the navigation trajectory, while $ l^{gt}_i $ is the shortest distance between the ground truth target and the starting position. We compare the proposed model with several baselines: 1) a random policy; 2) Speaker-Follower, an imitation learning method; 3) RCM, an imitation learning and reinforcement learning; 4) AuxRN, a model with auxiliary tasks; 5) the Hierarchical Memory Network. All five models employ the same vision language navigation backbone introduced in Sec.GBE. The visual encoder $ g $ is implemented by a Resnet-101 and the language encoder $ h $ is a combination of a word embedding layer and an LSTM layer. We train all models on the training split for 10K interactions to ensure that all models are sufficiently trained. The optimizer we use is RMSProp and the learning rate is $ 10^{-4} $ . In Tab.result_r2r, we compare the GBE model with state-of-the-art models without pretraining and auxiliary tasks. On the unseen house validation set, the GBE outperforms all models without using additional data. It outperforms EGP, other graph-based navigation method by 2.4% in SPL. On the test set, the GBE outperforms pervious models on all the evaluation metrics. It outperforms RCM, a seq2seq model with imitation learning with reinforcement learning by 5.4% in SPL. The experimental results are presented in Tab.result. The performances of the baseline models reveal some unique features of the FAO dataset. Firstly, the human performance largely outperforms all models. The existence of this human-machine gap suggests that current methods are not able to solve this new task. The random policy method performs poorly on all metrics, which reveals that our dataset is not biased. Moreover, Reinforced Cross-Modal Matching (RCM), a method combines imitation learning and reinforcement learning outperforms the pure imitation learning method (Speaker-follower) on the unseen house set. It indicates that reinforcement learning helps avoid overfitting in our dataset. Our experiment of the AuxRN shows that the auxiliary tasks work on R2R are not benefitial on FAO, which indicate the SOON is unique. We test the performance of the GBE and the GBE without graph-based exploration. We observe that with graph-exploration, the model obtain better generalization ability. The final model is 0.7% higher in Oracle success rate, 0.5% higher in success rate, 1.5% higher in SPL and 0.6% higher in SFPL than which without graph-based exploration on the test set. We discover that models perform well on the seen instruction set but perform poorly on other two sets. Since the domain of the seen instruction set is close to the training set, it indicates that models fit the training data well but lack of generalizability. We ablate the FAO dataset from two aspects: 1) the effect of vision and language modalities and 2) the effect of different granularity levels.


我们评估 R2R 和粮农组织数据集的 GBE 模型。我们将 DataSet 拆分为五个组件:1)培训; 2)关于所看到的指示(也是在所见的房屋上); 3)在看见房屋但看不见的指示; 4)关于看不见的房屋的验证;和 5)测试。与标准 VLN 基准测试相比,我们在粮农组织中添加了一个新的验证,由于任务独立于任务,在看出指令上的验证。我们评估了两个方面的性能:导航性能和本地化性能。通过常用的 VLN 度量评估导航性能,包括导航误差(NE),成功率(SR),Oracle 成功率(OSR)以及由路径长度(SPL)加权的成功率。通过成功率来评估本地化性能,指示预测方向是否位于边界框中。我们将 SPL 和本地化成功结合起来提出了通过路径长度(SFPL)加权的成功率(SFPL):
$$ \ textnormal {sfpl} = \ frac {1} {n} \ sum_ {i = 1} ^ {n} s ^ {nav} _i s ^ {loc} _i \ frac {l ^ {nav} _i} {\ textnormal {max}(l ^ {nav} _i,l ^ {gt} _i)},$$
其中$ S^{nav}_i $和$ S^{loc}_i $是分别成功导航到或本地化目标的指示符。 $ l^{nav}_i $是导航轨迹的长度,而$ l^{gt}_i $是地面真实目标和起始位置之间的最短距离。

实施细节 我们将提议的模型与几个基准进行比较:1)随机政策;2)Speaker-Follower [ 17],一种模仿学习方法;3)RCM [ 47 ],模仿学习和强化学习;4)AuxRN [ 52 ],具有辅助任务的模型;5)分层内存网络。所有这五个模型都采用了 Sec 中引入的相同视觉语言导航主干。 4。视觉编码器 G 由 Resnet-101 [ 20 ]和语言编码器实现 H 是单词嵌入层和 LSTM [ 23 ]层的组合。我们在 10K 互动的训练分组中训练所有模型,以确保所有模型都得到充分训练。我们使用的优化器是 RMSProp,学习率是 10-4。 选项卡中 R2R 上的结果。 2,我们将 GBE 模型与没有预先训练和辅助任务的最新模型进行了比较。在看不见的房屋验证集上,GBE 在不使用其他数据的情况下胜过所有模型。在 SPL 中,它的性能比其他基于图形的导航方法 EGP 高出 2.4%。在测试集上,GBE 在所有评估指标上均优于以往的模型。它在 SPL 中的表现优于 RCM,后者是具有模仿学习和强化学习功能的 seq2seq 模型。 粮农组织 的结果实验结果列在表中。 3。基准模型的性能揭示了粮农组织数据集的一些独特特征。首先,人类的表现大大优于所有模型。这种人机鸿沟的存在表明,当前的方法无法解决这一新任务。随机策略方法在所有指标上的表现都很差,这表明我们的数据集没有偏见。此外,强化交叉模态匹配(RCM)是一种将模仿学习和强化学习相结合的方法,在看不见的房屋集合上优于纯模仿学习方法(说话者跟随者)。这表明强化学习有助于避免过度拟合我们的数据集。我们对 AuxRN 的实验表明,R2R 上的辅助任务对 FAO 不利,这表明 SOON 是唯一的。我们无需基于图的探索就可以测试 GBE 和 GBE 的性能。我们观察到,通过图探索,该模型获得了更好的泛化能力。最终模型与没有在测试集上进行基于图的探索的模型相比,最终模型的预言成功率高出 0.7%,成功率高出 0.5%,SPL 高出 1.5%,SFPL 高出 0.6%。我们发现模型在看到的指令集上表现良好,而在其他两组上表现不佳。由于所看到的指令集的范围接近训练集,因此表明模型很好地拟合了训练数据,但缺乏可概括性。我们发现模型在看到的指令集上表现良好,而在其他两组上表现不佳。由于所看到的指令集的范围接近训练集,因此表明模型很好地拟合了训练数据,但缺乏可概括性。我们发现模型在看到的指令集上表现良好,而在其他两组上表现不佳。由于所看到的指令集的范围接近训练集,因此表明模型很好地拟合了训练数据,但缺乏可概括性。 FAO 的消融研究 我们从两个方面消融 FAO 数据集:1)视觉和语言方式的影响,以及 2)不同粒度级别的影响。

The ablation result of input modal is shown in Tab.abla_input. We observe that the model without vision and language input performs the worst. Thus it is impossible to finish SOON task without vision-language modalities. And the model with vision only performs better than the model with language only. We infer that the vision is more import than language in SOON task. Finally, we find that the model with vision and language performs the best, indicating that the two modalities are related and both modalities are important. Some objects like chair' exist in all houses while other objects like flower' do not commonly exist. The model learns prior knowledge to find common object in navigation without language. The ablation result of granularity levels is shown in Tab.abla_granularity. We train the GBE with different annotation granularity levels: object names, object attributes and relationships, region information, rewritten instructions. Note that the model with object names (GBE+) is equivalent to the ObjectGoal navigation. We find that the model trained in ObjectGoal setting performs worse than the models trained with more information. It has two reasons: 1) there are more than one objects belongs to the same class, and navigating with object name cause ambiguity; 2) navigating without scene and region makes the agent harder to find the final location. By comparing the first three experiments, we infer that the object name (), object attributes and relationships () and region descriptions () all contribute to the SOON navigation. At last, we find that the model with rewritten instructions performs the best (0.6% higher in SFPL than GBE+++). We infer that a well developed natural language instruction facilitates the agent to comprehend.

表中显示了输入模态的消融结果。 5。我们观察到,没有视觉和语言输入的模型表现最差。因此,没有视觉语言模式就不可能完成 SOON 任务。具有视觉的模型仅比具有语言的模型具有更好的性能。我们推断,在 SOON 任务中,愿景比语言更重要。最后,我们发现具有视觉和语言的模型表现最佳,表明这两种方式是相关的,并且两种方式都很重要。一些像“椅子”这样的对象存在于所有房屋中,而其他诸如“花”这样的对象并不普遍存在。该模型学习先验知识,以在没有语言的导航中找到常见的对象。粒度级别的烧蚀结果显示在选项卡中。 5。我们用不同的注释粒度级别训练 GBE:对象名称,对象属性和关系,区域信息,重写指令。请注意,带有对象名称(GBE + )的模型等效于 ObjectGoal 导航。我们发现,在 ObjectGoal 设置下训练的模型比在更多信息下训练的模型更差。这有两个原因:1)属于同一类的对象不止一个,并且使用对象名称导航会引起歧义;2)在没有场景和区域的情况下导航会使代理更难找到最终位置。通过比较前三个实验,我们推断出对象名称(),对象属性和关系()以及区域描述(****)都有助于 SOON 导航。最后,我们发现模型改写指令执行(以 SFPL 比 GBE + 高 0.6%,最好的 + + )。我们推断,完善的自然语言教学可以帮助代理理解。

Conclusion 结论

In this paper, we have proposed a task named Scenario Oriented Object Navigation (SOON), in which an agent is instructed to find an object in a house from an arbitrary starting position. To accompany this, we have constructed a dataset named From Anywhere to Object (FAO) with 3K descriptive natural language instructions. To suggest a promising direction for approaching this task, we propose GBE, a model that explicitly models the explored areas as a feature graph, and introduces graph-based exploration approach to obtain a robust policy. Our model outperforms all previous state-of-the-art models on R2R and FAO datasets. We hope that the SOON task could help the community approach real-world navigation problems.

在本文中,我们提出了一个名为“面向场景的对象导航”(SOON)的任务,其中指示代理从任意起始位置在房屋中查找对象。为此,我们使用 3K 描述性自然语言指令构建了一个名为“从任何地方到对象”(FAO)的数据集。为了提出解决该问题的有希望的方向,我们提出了 GBE 模型,该模型将探索区域明确地建模为特征图,并引入了基于图的探索方法来获得可靠的策略。我们的模型优于 R2R 和 FAO 数据集上所有以前的最新模型。我们希望 SOON 任务可以帮助社区解决现实世界中的导航问题。


This work was supported in part by National Key R&D Program of China under Grant No. 2020AAA0109700, Natural Science Foundation of China (NSFC) under Grant No.U19A2073, No.61976233 and No.61906109, Guangdong Province Basic and Applied Basic Research (Regional Joint Fund-Key) Grant No.2019B1515120039, Shenzhen Outstanding Youth Research Project (Project No. RCYX20200714114642083) Shenzhen Basic Research Project (Project No. JCYJ20190807154211365), Zhijiang Lab’s Open Fund (No. 2020AA3AB14) and CSIG Young Fellow Support Fund. And by the Australian Research Council Discovery Early Career Researcher Award (DE190100626).


这项工作是由中国的国家重点 r \&D 计划得到支持,在中国(NSFC)的自然科学基金(NSFC)的批准号 No.U19A2073,广东省 31976233 和 No.61906109 下提供基础和应用基础研究(区域联合资金关键)授予深圳优秀青年研究项目 2019B1515120039 支持基金。由澳大利亚研究委员会发现早期职业研究员奖(DE190100626)。