NavA³: Understanding Any Instruction, Navigating Anywhere, Finding Anything

\(NavA^3\): Understanding Any Instruction, Navigating Anywhere, Finding Anything

Lingfeng Zhang^1,2,*, Xiaoshuai Hao^2,*,†, Yingbo Tang³, Haoxiang Fu⁶, Xinyu Zheng⁴, Pengwei Wang², Zhongyuan Wang², Wenbo Ding^1,✉, Shanghang Zhang^2,5,✉ ¹Tsinghua Shenzhen International Graduate School, Tsinghua University
²Beijing Academy of Artificial Intelligence (BAAI)
³Institute of Automation, Chinese Academy of Sciences
⁴National Maglev Transportation Engineering Research and Development Center, Tongji University
⁵Peking University
⁶National University of Singapore
^*Co-first Authors ^✉Corresponding Authors ^†Project Leader

Abstract

Embodied navigation is a fundamental capability of embodied intelligence, enabling robots to move and interact within physical environments. However, existing navigation tasks primarily focus on predefined object navigation or instruction following, which significantly differs from human needs in real-world scenarios involving complex, open-ended scenes. To bridge this gap, we introduce a challenging long-horizon navigation task that requires understanding high-level human instructions and performing spatial-aware object navigation in real-world environments. Existing embodied navigation methods struggle with such tasks due to their limitations in comprehending high-level human instructions and localizing objects with an open vocabulary. In this paper, we propose \(NavA^3\), a hierarchical framework divided into two stages: global and local policies. In the global policy, we leverage the reasoning capabilities of Reasoning-VLM to parse high-level human instructions and integrate them with global 3D scene views. This allows us to reason and navigate to regions most likely to contain the goal object. In the local policy, we have collected a dataset of 1.0 million samples of spatial-aware object affordances to train the NaviAfford model (PointingVLM), which provides robust open-vocabulary object localization and spatial awareness for precise goal identification and navigation in complex environments. Extensive experiments demonstrate that \(NavA^3\) achieves SOTA results in navigation performance and can successfully complete longhorizon navigation tasks across different robot embodiments in real-world settings, paving the way for universal embodied navigation. The dataset and code will be made available

Our \(NavA^3\) Framework

Our hierarchical approach consists of two stages: the global policy uses Reasoning-VLM to interpret high-level human instructions and marks the probable area in the 3D scene. Upon reaching the target area, the local policy employs Pointing-VLM to search for the goal object at each waypoint. If not found, it predicts the next waypoint; if detected, it marks the object on the egocentric image and navigates to the final destination.

Construction Process of 3D Scenes

We reconstruct 3D scenes from RGB scan images using 2D-to-3D reconstruction techniques. These scenes are then transformed into annotated top-down views, which are subsequently processed by VisionLanguage Models (VLMs) for navigation planning. This approach enhances the accuracy and efficiency of navigation tasks.

NaviAfford Model Training and Deployment Process

The NaviAfford model learns object and spatial affordances from various indoor scenes to output precise point coordinates. During navigation, it performs real-time object localization and generates target points, which the local policy converts into robot coordinates for effective navigation to goal objects.

Qualitative analysis on NaviAfford and \(NavA^3\).

Affordance visualization includes the performance of NaviAfford model on object affordance and spatial affordance. Long-horizon navigation tasks visualization includes the performance of \(NavA^3\) hierarchical system in real-world environments and its cross-embodiment deployment capabilities.

\(NavA^3\): Understanding Any Instruction, Navigating Anywhere, Finding Anything

Demo Videos (One shot without cuts)

Cross-embodiment Demos (One shot without cuts)

Abstract

Our \(NavA^3\) Framework

Construction Process of 3D Scenes

NaviAfford Model Training and Deployment Process

Experimental Results

Comparison with SOTA methods

Ablation Study on Annotation Components

Ablation Study on Different Reasoning-VLMs

Ablation Study on Different Pointing-VLMs

Qualitative analysis on NaviAfford and \(NavA^3\).

BibTeX