Loading [a11y]/accessibility-menu.js
ViewInfer3D: 3D Visual Grounding Based on Embodied Viewpoint Inference | IEEE Journals & Magazine | IEEE Xplore

ViewInfer3D: 3D Visual Grounding Based on Embodied Viewpoint Inference


Abstract:

3D Visual Grounding (3D VG) is a fundamental task in embodied intelligence, which entails robots interpreting natural language descriptions to locate objects within 3D en...Show More

Abstract:

3D Visual Grounding (3D VG) is a fundamental task in embodied intelligence, which entails robots interpreting natural language descriptions to locate objects within 3D environments. The complexity of this task emerges as robots perceive the spatial relationships of objects differently depending on their observational viewpoints. In this work, we propose ViewInfer3D, a framework that leverages Large Language Models (LLMs) to infer embodied viewpoints, thereby avoiding incorrect observational viewpoints. To enhance the reliability and speed of reasoning from embodied viewpoints, we have designed three sub-strategies: constructing a hierarchical 3D scene graph, implementing embodied viewpoint parsing, and applying scene graph reasoning. Through extensive experiments, we demonstrate that this framework can improve performance in 3D Visual Grounding tasks through embodied viewpoint reasoning. Our framework achieves the best performance among all zero-shot methods on the ScanRefer and Nr3D/Sr3D datasets, without significantly increasing inference time.
Published in: IEEE Robotics and Automation Letters ( Volume: 9, Issue: 9, September 2024)
Page(s): 7469 - 7476
Date of Publication: 10 July 2024

ISSN Information:

Funding Agency:


Contact IEEE to Subscribe

References

References is not available for this document.