VFM-Depth: Leveraging Vision Foundation Model for Self-Supervised Monocular Depth Estimation | IEEE Journals & Magazine | IEEE Xplore

VFM-Depth: Leveraging Vision Foundation Model for Self-Supervised Monocular Depth Estimation


Abstract:

Self-supervised monocular depth estimation has exploited semantics to reduce depth ambiguities in texture-less regions and object boundaries. However, existing methods st...Show More

Abstract:

Self-supervised monocular depth estimation has exploited semantics to reduce depth ambiguities in texture-less regions and object boundaries. However, existing methods struggle to obtain universal semantics across scenes for effective depth estimation. This paper proposes VFM-Depth, a novel self-supervised teacher-student framework, that effectively leverages the vision foundation model as semantic regularization to significantly improve the accuracy of monocular depth estimation. Firstly, we propose a novel Geometric-Semantic Aggregation Encoding, integrating universal semantic constraints from the foundation model to reduce ambiguities in the teacher model. Specifically, semantic features from the foundation model and geometric features from the depth model are first encoded and then fused through cross-modal aggregation. Secondly, we introduce a novel Multi-Alignment for Depth Distillation to distill semantic constraints from the teacher, further leveraging knowledge from the foundation model. We obtain a lightweight yet effective student model through an innovative approach that combines distance category alignment with complementary feature and depth imitation. Extensive experiments on KITTI, Cityscapes, and Make3D datasets demonstrate that VFM-Depth (both teacher and student) outperforms state-of-the-art self-supervised methods by a large margin.
Page(s): 1 - 1
Date of Publication: 27 December 2024

ISSN Information:

Funding Agency:


Contact IEEE to Subscribe