Loading [a11y]/accessibility-menu.js
High-Order Multi-Scale Attention and Vertical Discriminator Enhanced CLIP for Monocular Depth Estimation | IEEE Journals & Magazine | IEEE Xplore

High-Order Multi-Scale Attention and Vertical Discriminator Enhanced CLIP for Monocular Depth Estimation


Abstract:

Multimodal monocular depth estimation methods based on deep learning have achieved competitive performance in recent years. However, the existing Contrastive Language-Ima...Show More

Abstract:

Multimodal monocular depth estimation methods based on deep learning have achieved competitive performance in recent years. However, the existing Contrastive Language-Image Pre-training (CLIP)-based multimodal networks often suffer from incomplete fusion of two modalities and lack multi-scale contextual information. To remedy these issues, this paper proposes a high-order feature and attention-assisted CLIP model HoCLIP for monocular depth estimation. Specifically, with the CLIP model as the backbone, Matrix Power Normalization Covariance Pooling (MPN-COV) technique is employed for high-order statistical modeling to capture image features by the visual encoder. These features are then combined with learnable deep prompts before being fed into the text encoder, facilitating enhanced fusion of text and image and enabling the extraction of more intricate statistical information and spatial structure. Furthermore, the Efficient Multi-Scale Attention (EMA)-Decoder is utilized for the reconstruction of depth maps. This structure captures contextual information across different scales, establishes long-range dependencies between features, and meticulously preserves spatial position information. Finally, a vertical discriminator with embedded vertical attention is integrated into the model’s final stages to capture vertical features and refine depth map generation. The extensive experiments on the NYU Depth V2 and KITTI datasets are conducted, and the results show that the proposed method has a decisive improvement over the state-of-the-art multimodal methods and exhibits robust competitiveness across all metrics.
Page(s): 1 - 1
Date of Publication: 21 February 2025

ISSN Information:

Funding Agency:


Contact IEEE to Subscribe