# SOON: Scenario Oriented Object Navigation with Graph-based Exploration 基于图探索的面向场景的目标导航

## Abstract

The ability to navigate like a human towards a language-guided target from anywhere in a 3D embodied environment is one of the holy grail' goals of intelligent robots. Most visual navigation benchmarks, however, focus on navigating toward a target from a fixed starting point, guided by an elaborate set of instructions that depicts step-by-step. This approach deviates from real-world problems in which human-only describes what the object and its surrounding look like and asks the robot to start navigation from anywhere. Accordingly, in this paper, we introduce a Scenario Oriented Object Navigation (SOON) task. In this task, an agent is required to navigate from an arbitrary position in a 3D embodied environment to localize a target following a scene description. To give a promising direction to solve this task, we propose a novel graph-based exploration (GBE) method, which models the navigation state as a graph and introduces a novel graph-based exploration approach to learn knowledge from the graph and stabilize training by learning sub-optimal trajectories. We also propose a new large-scale benchmark named From Anywhere to Object (FAO) dataset. To avoid target ambiguity, the descriptions in FAO provide rich semantic scene information includes: object attribute, object relationship, region description, and nearby region description. Our experiments reveal that the proposed GBE outperforms various state-of-the-arts on both FAO and R2R datasets. And the ablation studies on FAO validates the quality of the dataset.

.

## Introduction 简介

Recent research efforts have achieved great success in embodied navigation tasks. The agent is able to reach the target by following a variety of instructions, such as a word (e.g. object name or room name), a question-answer pair, a natural language sentence or a dialogue consisting of multiple sentences. However, these navigation approaches are still far from real-world navigation activities. Current vision language based navigation tasks such as Vision-language Navigation (VLN), Navigation from Dialog History (NDH) focus on navigating to a target by a fixed trajectory, guided by an elaborate set of instructions that outlines every step. These approaches fail to consider the case in which the complex instruction provided only target description while the starting point is not fixed. In real-world applications, people often do not provide detailed step-by-step instructions and expect the robot to be capable of self-exploration and autonomous decision-making. We claim that the ability to navigate towards a language-guided target from anywhere in a 3D embodied environment like human would be of great importance to an intelligent robot.

To address these problems, we propose a new task, named Vision Situated Object Navigation (SOON), where an agent is instructed to find a thoroughly described target object inside a house. The navigation instructions in SOON are target-oriented rather than step-by-step babysitter as in previous benchmarks. There are two major features that makes our task unique: target orienting and starting independence. A brief example of a navigation process in SOON is illustrated in Fig.overview. Firstly, different from conventional object navigation tasks defined in, instructions in SOON play a guidance role in addition to distinguish a target object class. An instruction contains thorough descriptions to guide the agent to find a unique object from anywhere in the house. After receiving an instruction in SOON, the agent first searches a larger-scale area according to the region descriptions in the instruction, and then gradually narrows the search space to the target area.

Dataset Instruction Context Visual Context Starting Target
Dataset Human Content Unamb. Real-world Temporal Independent Oriented
House3D [49] Room Name Dynamic
MINOS [40] Ojbect Name Dynamic
EQA [11], IQA [18] QA Dynamic
MARCO [31], DRIF [5] Instruction Dynamic
R2R [3] Instruction Dynamic
TouchDown [10] Instruction Dynamic
VLNA [37], HANNA [36] Dialog Dynamic
TtW [13] Dialog Dynamic
CVDN [45] Dialog Dynamic
REVERIE [39] Instruction Dynamic
FAO (Ours) Instruction Dynamic

Table 1: Compared with existing datasets involving embodied vision and language tasks.

In this work, We propose a novel Graph-based Semantic Exploration (GBE) method to suggest a promising direction in approaching SOON. The proposed GBE has two advantages compared with previous navigation works. Firstly, GBE models the navigation process as a graph, which enables the navigation agent to obtain a comprehensive and structured understanding of observed information. It adopts graph action space to significantly merge the multiple actions in conventional sequence-to-sequence models into one-step decision. Merging actions reduces the number of predictions in a navigation process, which makes the model training more stable. Secondly, different from other graph-based navigation models that use either imitation learning or reinforcement to learn navigation policy, the proposed GBE combines the two learning approaches and proposes a novel exploration approach to stabilize training by learning from sub-optimal trajectories. In imitation learning, the agent learns to navigate step by step under the supervision of ground truth label. It causes severe overfitting problem since labeled trajectories occupy only a small proportion of the large trajectory space. In reinforcement learning, the navigation agent explores large trajectory space, and learn to maximize the discounted reward. Reinforcement learning leverages sub-optimal trajectories to improve the generalizability.

However, the reinforcement learning is not an end-to-end optimization method, which is difficult for the agent to converge and learn a robust policy. We propose to learn the optimal actions in trajectories sampled from imperfect GBE policy to stabilize training while exploration. Different from other RL exploration methods, the proposed exploration method is based on the semantic graph, which is dynamically built during the navigation. Thus it helps the agent to learn a robust policy while navigating based on a graph. To investigate the SOON task, we propose a large-scale From Anywhere to Object (FAO) benchmark. This benchmark is built on the Matterport3D simulator, which comprises 90 different housing environments with real image panoramas. FAO provides 4K sets of annotated instructions with 40K trajectories. As Fig.overview(left) shows, one set of the instruction contains three sentences, including four levels of description: i) the color and shape of the object; ii) the surrounding objects along with the relationships between these objects and the target object; iii) the area in which the target object is located and the neighbour areas. Then, the average word number of the instructions is 38 (R2R is 26), and the average hop of the labeled trajectories is 9.6 (R2R is 6.0). Thus our dataset is more challenging than other tasks. We present experimental analyses on both R2R and FAO datasets to validate the performance of the proposed GBE and the quality of FAO dataset. The proposed GBE significantly outperforms previous previous VLN methods without pretraining or auxiliary tasks on R2R and SOON tasks. We further provide human performance on the test set of FAO to quantify the human-machine gap. Moreover, by ablating vision and language modals with different granularity, we validate that our FAO dataset contains rich information that enables the agent to successfully locate the target.

Classical SLAM-based methods build a 3D map with LIDAR, depth or structure, and then plan navigation routes based on this map. Due to the development of photo-realistic environments and efficient simulators, deep learning-based methods have become feasible ways of training a navigation agent. Since deep learning methods have revealed their ability in feature engineering, end-to-end agents are becoming popular. Later works adopt the idea of SLAM and introduce a memory mechanism, a method combining classical mapping methods and deep learning methods for generalization and long-trajectory navigation purposes. Recent works model the navigation semantics in graphs and achieve great success in embodied navigation tasks. Different from previous work that only trains the agent using labeled trajectories by imitation learning, our works introduce reinforcement learning in policy learning and propose a novel exploration method to learn a robust policy.

## Scenario Oriented Object Navigation 面向场景的目标导航

We propose a new Scenario Oriented Object Navigation (SOON) task, in which an agent navigates from an arbitrary position in a 3D embodied environment to localize a target object following an instruction. The task includes two sub-tasks: navigation and localization. We consider a navigation to be a success if the agent navigates to a position close to the target (<3m); and we consider the localization to be a success if the agent correctly locates the target object in the panoramic view based on the success of navigation. To ensure that the target object can be found regardless of the agent's starting point, the instruction consists of several parts: i) object attribute, ii) object relationship, iii) area description, and vi) neighbor area descriptions. An example to demonstrate different parts of description is shown in Fig.instruction. In step $t$ in navigation, the agent observes a panoramic view $v_t$ , containing RGB and depth information. Meanwhile, the agent receives neighbour node observations $U_t={u^1_t,...,u^k_t}$ , which are the observations of $k$ reachable positions from the current position. All reachable positions in a house scan are discretized into a navigation graph, and the agent navigates between nodes in the graph. For each step, the agent takes an action $a$ to move from the current position to a neighbor node or stop. In addition to RGB-D sensor, the simulator provides a GPS sensor to inform the agent of its x, y coordinates. Also the simulator provides the indexes of the current node and candidate nodes. REVERIE annotates 2D bounding boxes in 2D views to represent the location of objects. The 2D views are separated from the panoramic views of the embodied simulator. This way of labeling has two disadvantages: 1) some object separated by 2D views is not labeled; 2) 2D image distortion introduces labeling noise. We adopt the idea of Point Detection and represent the location by polar coordinates, as shown in Fig.ploar. First, we annotate the object bounding box with four vertices ${ p_1, p_2, p_3, p_4 }$ . Then, we calculate the center point by $p_c$ . After that, we convert the 2D coordinates into an angle difference between the original camera ray $\alpha$ and the adjusted camera ray $\alpha{}'$ .
SOON 的任务定义 我们提出了一个新的面向场景的对象导航（SOON）任务，其中，代理从 3D 体现环境中的任意位置导航以按照指令定位目标对象。该任务包括两个子任务：导航和本地化。如果代理导航到目标附近（<3m）的位置，我们认为导航是成功的。并且我们认为本地化是成功的代理是否基于导航成功在全景视图中正确定位了目标对象。为了确保可以找到目标对象，无论代理的起点如何，指令都由几个部分：i）对象属性，ii）对象关系，iii）区域描述和 vi）邻居区域描述。展示描述描述的不同部分的示例如图 2 所示。在导航步骤$t$中，代理观察全景视图$v_t$，包含 RGB 和深度信息。同时，代理接收邻居节点观测$U_t={u^1_t,...,u^k_t}$，其是从当前位置的$k$到达位置的观察。房屋扫描中的所有可达位置都被离散化为导航图，并且代理在图中的节点之间导航。对于每个步骤，代理采用动作$a$从当前位置移动到邻居节点或停止。除 RGB-D 传感器外，模拟器还提供 GPS 传感器，以通知其 X，Y 坐标的代理。 Simulator 还提供当前节点和候选节点的索引。 Reverie 在 2D 视图中注释 2D 边界框以表示对象的位置。 2D 视图与所体现的模拟器的全景视图分开。这种标签方式有两个缺点：1）由 2D 视图分隔的一些物体未标记; 2）2D 图像失真引入标记噪声。我们采用点检测的思想，并表示极坐标的位置，如图所示。首先，我们注释了一个带有四个顶点${ p_1, p_2, p_3, p_4 }$的对象边界框。然后，我们通过$p_c$计算中心点。之后，我们将 2D 坐标转换为原始摄像机射线$\alpha$和调整后的摄像机射线$\alpha{}'$之间的角度差。

## Graph-based Semantic Exploration 基于图的语义探索

We present the Graph-based Semantic Exploration (GBE) method in this section. The pipeline of the GBE is shown in Fig.model. Our vision encoder $g$ and language encoder $h$ are built on a common practice of vision language navigation. Subsequently, we introduce the graph planner in GBE, which models the structured semantics of visited places. Finally, we introduce our exploration method based on the graph planner. Memorizing viewed scenes and explicitly model the navigation environment are helpful for long-term navigation. Thus, we introduce a graph planner to memorize the observed features and model the explored areas as a feature graph. The graph planner maintains a node feature set $\mathcal{V}$ , an edge set $\mathcal{E}$ and a node embedding set $\mathcal{M}$ . The node feature set $\mathcal{V}$ is used to store node features and candidate features generated from visual encoder $g$ . The edge set $\mathcal{E}$ dynamically updated to represent the explored navigation graph. The embedding set $\mathcal{M}$ stores the intermediate node embeddings, which are updated by GCN. The node features in $\mathcal{M}$ , noted as $f^{\mathcal{M}}_{n_i}$ , are initialized by the feature of the same position in $\mathcal{V}$ . At step $t$ , the agent navigates to a position whose index is $d_0$ , and receives a visual observation $v_t$ and the observations of neighbor nodes are $U_t={u^1_t,...,u^k_t}$ , where $k$ is the number of the neighbors and $N_t={n_1,...,n_k}$ are node indexes of the neighbors. The visual observation and neighbor observations are embedded by the visual encoder $g$ : where $n_0$ stands for the current node, and $n_i (1 \le i \le n)$ are the node it connects with. The graph planners add the $f^v_t$ and $f^{u,i}_t$ into $\mathcal{V}$

$$\mathcal{V}\leftarrow\mathcal{V}\cup{f^v_{n_0}, f^{u}{n_1},...,f^{u}{n_k}}.$$

For an arbitrary node $n_i$ in the navigation graph, its node feature is represented by $\mathcal{V}$ following two rules: 1) if a node $n_i$ is visited, its feature $f_{n_i}$ is represented by $f^v_{n_i}$ ; 2) if a node $n_i$ is not visited but only observed, its feature is represented by $f^u_{n_i}$ ; 3) since a navigable position is able to be observed from multiple different views, the unvisited node feature is represented by the average value of all observed features. The graph planner also updates the edge set $\mathcal{E}$ by:

$$\mathcal{E}\leftarrow\mathcal{E}\cup{(n_0, n_1),(n_0, n_2),...... ,(n_0, n_k)}.$$

An edge is represented by a tuple consists of two node indexes, indicating that two nodes are connected. Then, $\mathcal{M}$ is updated by GCN based on $\mathcal{V}$ and $\mathcal{E}$ :

$$\mathcal{M}\leftarrow\mathrm{GCN}(\mathcal{M}, \mathcal{E}).$$

To obtain comprehensive understanding of the current position and nearby scene, we define the output of the graph planner as:

$$f^g_t = \frac{1}{k+1}\sum_{i=0}^{k} f^M_{n_i},$$

$f^g_t$ and language feature $f_t^l$ perform cross-modal matching and output $\Tilde{f_t}$ . GBE uses the $\Tilde{f_t}$ for two tasks: navigation action prediction and target object localization. The candidates to navigate are all observed but not visited nodes whose indexes are $C = {c_1,...,c_{|C|}}$ , where $|C|$ is the number of candidates. The candidate feature are extracted from $\mathcal{V}$ , denoted as ${f_{c_1},...,f_{c_{|C|}}}$ . The agent generates a probability distribution $p_t$ over candidates for action prediction, and outputs regression results $\hat{l^h_i}$ and $\hat{l^e_i}$ standing for heading and elevation values for localization: $0\le i \le |C|$ . $z_i$ are logits generated by a fully connected layer whose parameter is $W_{nav}$ . $a_{c_0}$ indicates the stop action. Thus the action space $|\mathcal{A}| = |C| + 1$ is varied depending on the dynamically built graph.

$f^g_t$和语言功能$f_t^l$执行跨模板匹配和输出$\Tilde{f_t}$。 GBE 使用$\Tilde{f_t}$进行两个任务：导航操作预测和目标对象本地化。导航的候选者都观察到但未访问索引是$C = {c_1,...,c_{|C|}}$的节点，其中$|C|$是候选者的数量。候选功能从$\mathcal{V}$中提取，表示为${f_{c_1},...,f_{c_{|C|}}}$。该代理在动作预测的候选方面生成概率分布$p_t$，输出回归结果$\hat{l^h_i}$和$\hat{l^e_i}$站立用于定位的标题和高度值：$0\le i \le |C|$。 $z_i$是由完全连接的图层生成的 Loadits，其参数为$W_{nav}$。 $a_{c_0}$表示停止操作。因此，根据动态构建的图形来改变动作空间$|\mathcal{A}| = |C| + 1$。

Figure 5: Statistical analysis across FAO

Seq2seq navigation models such as speaker-follower only perceives the current observation and an encoding of the historical information. And existing exploration methods focus on data augmentation, heuristic-aided approach and auxiliary task. However, with the dynamically built semantic graph, the navigation agent is able to memorize all the nodes that it observes but has not visited. Thus we propose to use the semantic graph to facilitate exploration. As shown in Fig.model(yellow box), the graph planner builds the navigation semantic graph during exploration. In imitation learning, the navigation agent uses the ground truth action $a^*_t$ to sample the trajectory. However, in each step $t$ , in graph-based exploration, the navigation action $a_t$ is sampled from the predicted probability distribution of the candidates in Eq.prediction. The graph planner calculate the Dijkstra distance from each candidate to the target. The teacher action $\hat{a}_t$ is to reach the candidate which is the closest to the target.

Each trajectory in Room-to-room (R2R) dataset has only one target position. However, in the SOON task, since the target object could be able to be observed from multiple positions, trajectories could have multiple target positions. The teacher action $\hat{a}$ is calculated by:

$$\hat{a_t} = \underset{a_t^{n_i}}{\mathrm{argmin}}\left[ \mathrm{min}\left ( \mathrm{D}(c_i, n_{T_1}),...,\mathrm{D}(c_i, n_{T_m}) \right ) \right ],$$

where $n_{T_1},...,n_{T_m}$ are indexes of $m$ targets, and the action from current position to node $n_i$ is defined by $a_t^{n_i}$ . $\mathrm{D}(n_i, n_j)$ stands for the function that calculates the Dijkstra distance between node $n_i$ and $n_j$ . Note that the target positions are visible in training to calculate the teacher action but not visible in testing. If the current position is one of target nodes, the teacher actions $\hat{a_t}$ is a stop action. Sampling and executing action $a$ from imperfect navigation policy enables the agent to explore in the room. Using the optimal action $\hat{a_t}$ helps to learn a robust policy. We here introduce two objectives in training: i) the navigation objective $L_{nav}$ ; ii) the object localization objective $L_{loc}$ . The GBE model is jointly optimized by these two objectives. In imitation learning, our navigation agent learns from the ground truth action $a^*$ . In reinforcement learning, the agent learns to navigate by maximizing the discounted reward when taking action $a_t$ . In graph-based exploration, we calculate the candidate which is closest to the target by the graph planner and set the action to move to the candidate as $\hat{a_t}$ . The $L_{nav}$ is the combination of the above three learning approaches: $A_t$ is the advantage defined in A2C. The reward of reinforcement learning is calculated by the Dijkstra distance between the current position and the target. The $\lambda_1$ , $\lambda_2$ , $\lambda_3$ are loss weights for imitation learning, reinforcement learning and graph-based exploration respectively. Our agent learns a localization branch that is supervised by the center position of the target. Since we map the 2D bounding box position into polar representation, the label consists of two linear values, namely heading $l^h$ and elevation $l^e$ . We use Mean Square Error (MSE) to optimize predictions:

，其中$n_{T_1},...,n_{T_m}$是$m$目标的索引，从当前位置到节点$n_i$的动作由$a_t^{n_i}$定义。 $\mathrm{D}(n_i, n_j)$代表了计算节点 D_d1 n_i $和$ n_j $之间的 dijkstra 距离的函数。请注意，在培训方面可见目标位置，以计算教师动作，但在测试中不可见。如果当前位置是目标节点之一，则教师操作$\hat{a_t} $是停止操作。采样和执行操作$ a $来自 Imperfect 导航策略使代理商探索在房间内。使用最佳操作$\hat{a_t} $有助于学习强大的策略。我们在这里介绍了两个训练目标：i）导航目标$ L_{nav} $; ii）对象本地化目标$ L_{loc} $。 GBE 模型由这两个目标共同优化。在模仿学习中，我们的导航代理从地面真理动作$ a^* $学习。在加固学习中，代理学会在采取动作$ a_t $时通过最大化折扣奖励来导航。在基于图形的探索中，我们通过图表计划计算最接近目标的候选者，并将动作设置为$\hat{a_t} $移动到候选者。$ L_{nav} $是上述三种学习方法的组合：$ A_t $是 A2C 中定义的优势。强化学习的奖励由当前位置和目标之间的 Dijkstra 距离计算。$\lambda_1 $，$\lambda_2 $，$\lambda_3 $分别是用于仿制学习，加强学习和基于图形探索的损耗权重。我们的代理学习由目标的中心位置监督的本地化分支。由于我们将 2D 边界框位置映射到极性表示，标签由两个线性值组成，即标题$ l^h $和高度$ l^e $。我们使用均方错误（MSE）来优化预测：. $$L_{loc} = \frac{1}{N}\sum_{i=1}^{N}\left[ (\hat{l^h_i} - l^h_i)^2 + (\hat{l^e_i} - l^e_i)^2 \right ].$$ ## Experiments 实验 Splits Unseen House (Val) Unseen House (Test) Metrics NE ↓ OSR ↑ SR ↑ SPL ↑ NE ↓ OSR ↑ SR ↑ SPL ↑ Seq2Seq [3] 7.81 28.4 21.8 - 7.85 26.6 20.4 - Ghost [2] 7.20 44 35 31 7.83 42 33 30 Speaker-Follower [17] 6.62 43.1 34.5 - 6.62 44.5 35.1 - RCM [47] 5.88 51.9 42.5 - 6.12 49.5 43.0 38 Monitor* [29] 5.52 56 45 32 5.67 59 48 35 Regretful* [30] 5.32 59 50 41 5.69 56 48 40 EGP [14] 5.34 65 52 41 - - - - EGP* [14] 4.83 64 56 44 5.34 61 53 42 GBE (Ours) 5.20 67.0 53.9 43.4 5.18 64.1 53.0 43.4 Table 2: The results of the GMSE and previous state-of-the-art methods on R2R ( model uses additional synthetic data). 表 2： GMSE 和以前 R2R 上最先进方法的结果（*：模型使用其他合成数据）。 ## 实验 ### From Anywhere to Object (FAO) Dataset 从任何地方到对象（ FAO）数据集 We provide 3,848 sets of natural language instructions, describing the absolute location in a 3D environment. We further collect 6,326 bounding boxes for 3,923 objects across 90 Matterport scenes. Despite the fact that our task does not place limitations on the agent's starting position, we provide over 30K long distance trajectories in our dataset to validate the effectiveness of our task. Each instruction contains attributes, relationships and region descriptions to filter out the unique target object when there are multiple objects. Please refer to the supplementary materials for more details of our FAO dataset and experimental analysis. The training split contains 3,085 sets of instructions with 28,015 trajectories over 38 houses. We propose a new split named validation on seen instruction, which is a validation set containing the same instructions in the same house with different starting positions. The validation seen instruction set contains 245 instructions with 1,225 trajectories. The validation set for seen houses with different instructions contains 195 instructions with 1,950 trajectories. The validation set for the unseen houses contains 205 instructions with 2,040 trajectories. We first label bounding boxes for objects in panoramic views. Then we convert the bounding box labels into polar representations as described in Sec.polar. Note that the object can be reached from multiple positions. We annotate all these positions to reduce the dataset bias. To collect diverse instructions with their hierarchical descriptions, we divide the language annotation task into five subtasks as shown in Fig.instruction: 1) Describe the attributes, such as the color, size or shape, of the target; 2) Find at least two objects related to the target and describe their relationship; 3) Conduct explorations in the simulator to describe the region in which the target is located; 4) Explore and describe the nearby regions; 5) Rewrite all descriptions within three sentences. The first four steps ensure language complexity and diversity. And the rewriting step makes the language instruction coherent and natural. Finally, we generate long navigation trajectories using the navigation graph of each scene. To make the task sufficiently challenging, we first set a threshold of 18 meters. For each instruction and object pair, we fix the target viewpoint and sample the starting viewpoint.We determine a trajectory as valid if the Dijkstra distance between the two viewpoints exceeds the threshold. In some houses, long trajectories are often difficult to find or may even not exist. Thus, we discount the threshold by a factor of 0.8 after every five sample failures. 我们提供了 3848 套自然语言指令，描述了 3D 环境中的绝对位置。我们还为 90 个 Matterport 场景中的 3,923 个对象收集了 6,326 个边界框。尽管我们的任务没有限制座席的开始位置，但我们还是在数据集中提供了超过 30K 的长距离轨迹来验证任务的有效性。每个指令包含属性，关系和区域描述，以在存在多个对象时过滤出唯一的目标对象。请参阅补充材料，以获取有关我们粮农组织数据集和实验分析的更多详细信息。 数据拆分 培训拆分包含 3085 套指令，涉及 38 套房屋的 28,015 条轨迹。我们建议一个新的拆分名为对可见指令的验证，这是一个验证集，其中包含同一房子中的相同指令且起始位置不同。验证可见指令集包含 245 条指令，具有 1,225 条轨迹。针对具有不同指令的可见房屋的验证集包含 195 条具有 1,950 条轨迹的指令。看不见的房屋的验证集包含 205 条指令和 2040 条轨迹。 数据收集 我们首先为全景对象标记边框。然后，如第 2 节中所述，将边界框标签转换为极坐标表示。 3。请注意，可以从多个位置到达对象。我们注释所有这些位置以减少数据集偏差。为了收集具有分层描述的各种指令，我们将语言注释任务划分为五个子任务，如图 3 所示。 ：1）描述目标的属性，例如颜色，大小或形状；2）找出至少两个与目标有关的对象并描述它们之间的关系；3）在模拟器中进行探索，以描述目标所在的区域；4）探索并描述附近地区；5）用三个句子重写所有描述。前四个步骤可确保语言的复杂性和多样性。并且重写步骤使语言指令连贯而自然。最后，我们使用每个场景的导航图生成长的导航轨迹。为了使任务更具挑战性，我们首先将阈值设置为 18 米。对于每个指令和对象对，我们固定目标视点并采样起始视点。如果两个视点之间的 Dijkstra 距离超过阈值，我们将确定一条轨迹为有效。在某些房屋中，长轨往往很难找到甚至根本不存在。因此，在每五个样本失败之后，我们将阈值折现 0.8 倍。 Fig.analysis(left) illustrates the distributions of word numbers in the instructions. The FAO dataset contains 3,848 instructions with a vocabulary of 1,649 words. The average number of the words in an instruction set is 38.6, while which in REVERIE is 26.3 and in R2R is 18.3. Most of the instructions range from 20 words to 60 words, which ensures the power of representation. Moreover, the variance in instruction length makes the description more diverse. The trajectory length ranges from 15 meters to more than 60 meters. Compared with R2R and REVERIE that most of the trajectories are within 8 hops, as shown in Fig.analysis(middle), FAO provides much more long-term trajectories, which makes the dataset more challenging. Fig.analysis(right) illustrates the proportion of word numbers in the four instruction annotating steps. The more words are in the annotation, the richer information it contains. Therefore, we can infer that the object relationship and nearby regions contain the richest information. An agent should consequently pay more attention to these two parts in order to achieve good performance. 数据分析 图 5（左）说明了指令中单词编号的分布。FAO 数据集包含 3848 条指令，词汇量为 1649 个单词。指令集中的平均单词数为 38.6，而在 REVERIE 中为 26.3，在 R2R 中为 18.3。大多数指令范围从 20 字到 60 字，这确保了表示的力量。此外，指令长度的变化使描述更加多样化。弹道长度从 15 米到 60 多米不等。与 R2R 和 REVERIE 相比，大多数轨迹都在 8 个跳内，如图 5（中）所示，FAO 提供了更多的长期轨迹，这使数据集更具挑战性。图 5（右）说明了四个指令注释步骤中单词编号的比例。注释中的单词越多，其中包含的信息就越丰富。因此，我们可以推断出对象关系和附近区域包含最丰富的信息。因此，代理人应该更加注意这两个部分，以实现良好的性能。 Splits Val Seen Instruction Val Seen House Unseen House (Test) Metrics OSR SR SPL SFPL OSR SR SPL SFPL OSR SR SPL SFPL Human - - - - - - - - 91.4 90.4 59.2 51.1 Random 0.1 0.0 1.5 1.4 0.4 0.1 0.0 0.9 2.7 2.1 0.4 0.0 Speaker-Follower [17] 97.8 97.9 97.7 24.5 69.4 61.2 60.4 9.1 9.8 7.0 6.1 0.6 RCM [47] 89.1 84.0 82.6 10.9 72.7 62.4 60.9 7.8 12.4 7.4 6.2 0.7 AuxRN [52] 98.7 98.4 97.4 13.7 78.5 68.8 67.3 8.3 11.0 8.1 6.7 0.5 GBE w/o GE 91.8 89.5 88.3 24.2 73 62.5 60.8 6.7 18.8 11.4 8.7 0.8 GBE (Ours) 98.6 98.4 97.9 44.2 64.1 76.3 62.5 7.3 19.5 11.9 10.2 1.4 Table 3: The results for baselines and our model on two validation set and test set.Models vision language SR SPL SFPL GBE ✗ ✗ 0.6 0.4 0.0 GBE ✓ ✗ 9.8 8.1 0.5 GBE ✗ ✓ 1.8 1.5 0.2 GBE ✓ ✓ 11.9 10.2 1.4Table 4: Ablation of unimodal inputs.Models SR SPL SFPL GBE+ 7.3 6.2 0.5 GBE++ 6.2 4.9 0.7 GBE+++ 6.6 5.5 0.8 GBE+ 11.9 10.2 1.4Table 5: Ablation of granularity levels. 如果两个观点之间的 Dijkstra 距离超过阈值，我们将确定一个有效的轨迹。在一些房屋中，长轨迹往往很难找到或甚至可能不存在。因此，每五次样本故障后，我们将阈值折扣为 0.8 倍。图 1Alysis（左）说明了指令中的字数的分布。粮农组织数据集包含 3,848 条指令，其中词汇 1,649 字。指令集中的单词的平均数量为 38.6，而在 Reverie 中，在 R2R 中，R2R 为 18.3。大多数指令范围从 20 个字到 60 个字，这确保了表示的力量。此外，指令长度的方差使得描述更多样化。轨迹长度为 15 米到 60 多米。与 R2R 和遐想相比，大多数轨迹在 8 次跳跃中，如图 1 所示，粮农组织提供了更多的长期轨迹，使数据集更具挑战性。图 1Alysis（右）示出了四指令注释步骤中的字数的比例。注释中的单词越多，它包含的更丰富的信息。因此，我们可以推断对象关系和附近区域包含最富有的信息。因此，代理人应该更加关注这两个部分，以实现良好的性能。 ### Experimental Results We evaluate the GBE model on R2R and FAO datasets. We split our dataset into five components: 1) training; 2) validation on seen instructions (on seen houses as well); 3) validation on seen houses but unseen instructions; 4) validation on unseen houses; and 5) testing. Compared with standard VLN benchmark, we add a new validation set in FAO, the validation on seen instructions, due to the task starting-independent. We evaluate the performance from two aspects: navigation performance and localization performance. The navigation performance is evaluated via commonly used VLN metrics, including Navigation Error (NE), Success Rate (SR), Oracle Success Rate (OSR) and the Success Rate weighted by Path Length (SPL). The localization performance is evaluated by the success rate indicating whether the predicted direction is located in the bounding box. We combine the SPL and localization success to propose a success rate of finding weighted by path length (SFPL): $$\textnormal{SFPL} = \frac{1}{N}\sum_{i=1}^{N} S^{nav}_i S^{loc}_i \frac{l^{nav}_i}{\textnormal{max}(l^{nav}_i, l^{gt}_i)},$$ where$ S^{nav}_i $and$ S^{loc}_i $are indicators of whether the agent has successfully navigated to or localized the target, respectively.$ l^{nav}_i $is the length of the navigation trajectory, while$ l^{gt}_i $is the shortest distance between the ground truth target and the starting position. We compare the proposed model with several baselines: 1) a random policy; 2) Speaker-Follower, an imitation learning method; 3) RCM, an imitation learning and reinforcement learning; 4) AuxRN, a model with auxiliary tasks; 5) the Hierarchical Memory Network. All five models employ the same vision language navigation backbone introduced in Sec.GBE. The visual encoder$ g $is implemented by a Resnet-101 and the language encoder$ h $is a combination of a word embedding layer and an LSTM layer. We train all models on the training split for 10K interactions to ensure that all models are sufficiently trained. The optimizer we use is RMSProp and the learning rate is$ 10^{-4} $. In Tab.result_r2r, we compare the GBE model with state-of-the-art models without pretraining and auxiliary tasks. On the unseen house validation set, the GBE outperforms all models without using additional data. It outperforms EGP, other graph-based navigation method by 2.4% in SPL. On the test set, the GBE outperforms pervious models on all the evaluation metrics. It outperforms RCM, a seq2seq model with imitation learning with reinforcement learning by 5.4% in SPL. The experimental results are presented in Tab.result. The performances of the baseline models reveal some unique features of the FAO dataset. Firstly, the human performance largely outperforms all models. The existence of this human-machine gap suggests that current methods are not able to solve this new task. The random policy method performs poorly on all metrics, which reveals that our dataset is not biased. Moreover, Reinforced Cross-Modal Matching (RCM), a method combines imitation learning and reinforcement learning outperforms the pure imitation learning method (Speaker-follower) on the unseen house set. It indicates that reinforcement learning helps avoid overfitting in our dataset. Our experiment of the AuxRN shows that the auxiliary tasks work on R2R are not benefitial on FAO, which indicate the SOON is unique. We test the performance of the GBE and the GBE without graph-based exploration. We observe that with graph-exploration, the model obtain better generalization ability. The final model is 0.7% higher in Oracle success rate, 0.5% higher in success rate, 1.5% higher in SPL and 0.6% higher in SFPL than which without graph-based exploration on the test set. We discover that models perform well on the seen instruction set but perform poorly on other two sets. Since the domain of the seen instruction set is close to the training set, it indicates that models fit the training data well but lack of generalizability. We ablate the FAO dataset from two aspects: 1) the effect of vision and language modalities and 2) the effect of different granularity levels. ### 实验结果 我们评估 R2R 和粮农组织数据集的 GBE 模型。我们将 DataSet 拆分为五个组件：1）培训; 2）关于所看到的指示（也是在所见的房屋上）; 3）在看见房屋但看不见的指示; 4）关于看不见的房屋的验证;和 5）测试。与标准 VLN 基准测试相比，我们在粮农组织中添加了一个新的验证，由于任务独立于任务，在看出指令上的验证。我们评估了两个方面的性能：导航性能和本地化性能。通过常用的 VLN 度量评估导航性能，包括导航误差（NE），成功率（SR），Oracle 成功率（OSR）以及由路径长度（SPL）加权的成功率。通过成功率来评估本地化性能，指示预测方向是否位于边界框中。我们将 SPL 和本地化成功结合起来提出了通过路径长度（SFPL）加权的成功率（SFPL）： $$\ textnormal {sfpl} = \ frac {1} {n} \ sum_ {i = 1} ^ {n} s ^ {nav} _i s ^ {loc} _i \ frac {l ^ {nav} _i} {\ textnormal {max}（l ^ {nav} _i，l ^ {gt} _i）}，$$ 其中$ S^{nav}_i $和$ S^{loc}_i $是分别成功导航到或本地化目标的指示符。$ l^{nav}_i $是导航轨迹的长度，而$ l^{gt}_i \$是地面真实目标和起始位置之间的最短距离。

The ablation result of input modal is shown in Tab.abla_input. We observe that the model without vision and language input performs the worst. Thus it is impossible to finish SOON task without vision-language modalities. And the model with vision only performs better than the model with language only. We infer that the vision is more import than language in SOON task. Finally, we find that the model with vision and language performs the best, indicating that the two modalities are related and both modalities are important. Some objects like chair' exist in all houses while other objects like ` flower' do not commonly exist. The model learns prior knowledge to find common object in navigation without language. The ablation result of granularity levels is shown in Tab.abla_granularity. We train the GBE with different annotation granularity levels: object names, object attributes and relationships, region information, rewritten instructions. Note that the model with object names (GBE+) is equivalent to the ObjectGoal navigation. We find that the model trained in ObjectGoal setting performs worse than the models trained with more information. It has two reasons: 1) there are more than one objects belongs to the same class, and navigating with object name cause ambiguity; 2) navigating without scene and region makes the agent harder to find the final location. By comparing the first three experiments, we infer that the object name (), object attributes and relationships () and region descriptions () all contribute to the SOON navigation. At last, we find that the model with rewritten instructions performs the best (0.6% higher in SFPL than GBE+++). We infer that a well developed natural language instruction facilitates the agent to comprehend.

## Conclusion 结论

In this paper, we have proposed a task named Scenario Oriented Object Navigation (SOON), in which an agent is instructed to find an object in a house from an arbitrary starting position. To accompany this, we have constructed a dataset named From Anywhere to Object (FAO) with 3K descriptive natural language instructions. To suggest a promising direction for approaching this task, we propose GBE, a model that explicitly models the explored areas as a feature graph, and introduces graph-based exploration approach to obtain a robust policy. Our model outperforms all previous state-of-the-art models on R2R and FAO datasets. We hope that the SOON task could help the community approach real-world navigation problems.

## Acknowledgements

This work was supported in part by National Key R&D Program of China under Grant No. 2020AAA0109700, Natural Science Foundation of China (NSFC) under Grant No.U19A2073, No.61976233 and No.61906109, Guangdong Province Basic and Applied Basic Research (Regional Joint Fund-Key) Grant No.2019B1515120039, Shenzhen Outstanding Youth Research Project (Project No. RCYX20200714114642083) Shenzhen Basic Research Project (Project No. JCYJ20190807154211365), Zhijiang Lab’s Open Fund (No. 2020AA3AB14) and CSIG Young Fellow Support Fund. And by the Australian Research Council Discovery Early Career Researcher Award (DE190100626).