Loading [MathJax]/extensions/MathMenu.js
Learning Cross-Modal Visuomotor Policies for Autonomous Drone Navigation | IEEE Journals & Magazine | IEEE Xplore

Learning Cross-Modal Visuomotor Policies for Autonomous Drone Navigation


Abstract:

Developing effective vision-based navigation algorithms adapting to various scenarios is a significant challenge for autonomous drone systems, with vast potential in dive...Show More

Abstract:

Developing effective vision-based navigation algorithms adapting to various scenarios is a significant challenge for autonomous drone systems, with vast potential in diverse real-world applications. This paper proposes a novel visuomotor policy learning framework for monocular autonomous navigation, combining cross-modal contrastive learning with deep reinforcement learning (DRL) to train a visuomotor policy. Our approach first leverages contrastive learning to extract consistent, task-focused visual representations from high-dimensional RGB images as depth images, and then directly maps these representations to action commands with DRL. This framework enables RGB images to capture structural and spatial information similar to depth images, which remains largely invariant under changes in lighting and texture, thereby maintaining robustness across various environments. We evaluate our approach through simulated and physical experiments, showing that our visuomotor policy outperforms baseline methods in both effectiveness and resilience to unseen visual disturbances. Our findings suggest that the key to enhancing transferability in monocular RGB-based navigation lies in achieving consistent, well-aligned visual representations across scenarios, which is an aspect often lacking in traditional end-to-end approaches.
Published in: IEEE Robotics and Automation Letters ( Volume: 10, Issue: 6, June 2025)
Page(s): 5425 - 5432
Date of Publication: 10 April 2025

ISSN Information:

Funding Agency:


I. Introduction

Autonomous drone control [1], particularly for navigation using onboard monocular cameras presents significant challenges, especially in complex and unseen environments. In contrast, expert human pilots can reliably control drones based on first-person view images, even under noise and latency. This ability comes from efficiently extracting essential low-level properties from diverse perceptual inputs and using them as internal representations to guide actions [2]. For autonomous navigation, it is crucial to develop a feature extractor that remains resistant to environmental changes while prioritizing task-relevant features in high-dimensional, dynamic visual observations.

Contact IEEE to Subscribe

References

References is not available for this document.