\(NavA^3\): Understanding Any Instruction, Navigating Anywhere, Finding Anything

Lingfeng Zhang1,2,*, Xiaoshuai Hao2,*,†, Yingbo Tang3, Haoxiang Fu6, Xinyu Zheng4, Pengwei Wang2, Zhongyuan Wang2, Wenbo Ding1,✉, Shanghang Zhang2,5,✉ 1Tsinghua Shenzhen International Graduate School, Tsinghua University
2Beijing Academy of Artificial Intelligence (BAAI)
3Institute of Automation, Chinese Academy of Sciences
4National Maglev Transportation Engineering Research and Development Center, Tongji University
5Peking University
6National University of Singapore

*Co-first Authors    Corresponding Authors    Project Leader
pipeline image

Execution Process of \(NavA^3\). The global policy employs Reasoning-VLM to interpret high-level instructions (e.g., “hang out the clothes” → clothes hanger) and identify the target location (balcony) using 3D scene understanding. The local policy uses Pointing-VLM to navigate waypoints and perform precise object localization with our NaviAfford model, which leverages spatial affordance understanding to accurately locate the target object (clothes hanger).

Demo Videos (One shot without cuts)


Cross-embodiment Demos (One shot without cuts)

Abstract

Embodied navigation is a fundamental capability of embodied intelligence, enabling robots to move and interact within physical environments. However, existing navigation tasks primarily focus on predefined object navigation or instruction following, which significantly differs from human needs in real-world scenarios involving complex, open-ended scenes. To bridge this gap, we introduce a challenging long-horizon navigation task that requires understanding high-level human instructions and performing spatial-aware object navigation in real-world environments. Existing embodied navigation methods struggle with such tasks due to their limitations in comprehending high-level human instructions and localizing objects with an open vocabulary. In this paper, we propose \(NavA^3\), a hierarchical framework divided into two stages: global and local policies. In the global policy, we leverage the reasoning capabilities of Reasoning-VLM to parse high-level human instructions and integrate them with global 3D scene views. This allows us to reason and navigate to regions most likely to contain the goal object. In the local policy, we have collected a dataset of 1.0 million samples of spatial-aware object affordances to train the NaviAfford model (PointingVLM), which provides robust open-vocabulary object localization and spatial awareness for precise goal identification and navigation in complex environments. Extensive experiments demonstrate that \(NavA^3\) achieves SOTA results in navigation performance and can successfully complete longhorizon navigation tasks across different robot embodiments in real-world settings, paving the way for universal embodied navigation. The dataset and code will be made available

Our \(NavA^3\) Framework

NaVid

Our hierarchical approach consists of two stages: the global policy uses Reasoning-VLM to interpret high-level human instructions and marks the probable area in the 3D scene. Upon reaching the target area, the local policy employs Pointing-VLM to search for the goal object at each waypoint. If not found, it predicts the next waypoint; if detected, it marks the object on the egocentric image and navigates to the final destination.

Construction Process of 3D Scenes

NaVid

We reconstruct 3D scenes from RGB scan images using 2D-to-3D reconstruction techniques. These scenes are then transformed into annotated top-down views, which are subsequently processed by VisionLanguage Models (VLMs) for navigation planning. This approach enhances the accuracy and efficiency of navigation tasks.

NaviAfford Model Training and Deployment Process

NaVid

The NaviAfford model learns object and spatial affordances from various indoor scenes to output precise point coordinates. During navigation, it performs real-time object localization and generates target points, which the local policy converts into robot coordinates for effective navigation to goal objects.


Experimental Results

Comparison with SOTA methods

NaVid

\(^*\) denotes that we modify the method to allow it to complete our task. Our \(NavA^3\) outperforms all the SOTA methods on the navigation performance..

Ablation Study on Annotation Components

NaVid

Ablation Study on Annotation Components.

Ablation Study on Different Reasoning-VLMs

NaVid

Ablation Study on Different Reasoning-VLMs.

Ablation Study on Different Pointing-VLMs

NaVid

Ablation Study on Different Pointing-VLMs.


Qualitative analysis on NaviAfford and \(NavA^3\).

NaVid

Affordance visualization includes the performance of NaviAfford model on object affordance and spatial affordance. Long-horizon navigation tasks visualization includes the performance of \(NavA^3\) hierarchical system in real-world environments and its cross-embodiment deployment capabilities.

BibTeX

BibTex Code Here