I. Introduction
Autonomous drone control [1], particularly for navigation using onboard monocular cameras presents significant challenges, especially in complex and unseen environments. In contrast, expert human pilots can reliably control drones based on first-person view images, even under noise and latency. This ability comes from efficiently extracting essential low-level properties from diverse perceptual inputs and using them as internal representations to guide actions [2]. For autonomous navigation, it is crucial to develop a feature extractor that remains resistant to environmental changes while prioritizing task-relevant features in high-dimensional, dynamic visual observations.