Cross-Modal Prealigned Method With Global and Local Information for Remote Sensing Image and Text Retrieval | IEEE Journals & Magazine | IEEE Xplore

Cross-Modal Prealigned Method With Global and Local Information for Remote Sensing Image and Text Retrieval


Abstract:

In recent years, remote sensing cross-modal text-image retrieval (RSCTIR) has attracted considerable attention owing to its convenience and information mining capabilitie...Show More

Abstract:

In recent years, remote sensing cross-modal text-image retrieval (RSCTIR) has attracted considerable attention owing to its convenience and information mining capabilities. However, two significant challenges persist: effectively integrating global and local information during feature extraction due to substantial variations in remote sensing imagery, and the failure of existing methods to adequately consider feature prealignment before modal fusion, resulting in complex modal interactions that adversely impact retrieval accuracy and efficiency. To address these challenges, we propose a cross-modal prealigned method with global and local information (CMPAGL) for remote sensing imagery. Specifically, we design a global-Swin (Gswin) Transformer block, which introduces a global information window on top of the local window attention mechanism, synergistically combining local window self-attention and global-local window cross-attention to effectively capture multiscale features of remote sensing images. In addition, our approach incorporates a prealignment mechanism to mitigate the training difficulty of modal fusion, thereby enhancing retrieval accuracy. Moreover, we propose a similarity matrix reweighting (SMR) reranking algorithm to deeply exploit information from the similarity matrix during the retrieval process. This algorithm combines forward and backward ranking, extreme difference ratio, and other factors to reweight the similarity matrix, thereby further enhancing retrieval accuracy. Finally, we optimize the triplet loss function by introducing an intraclass distance term for matched image-text pairs, not only focusing on the relative distance between matched and unmatched pairs but also minimizing the distance within matched pairs. Experiments on four public remote sensing text-image datasets, including RSICD, RSITMD, UCM-Captions, and Sydney-Captions, demonstrate the effectiveness of our proposed method, achieving improvements over state-of-the-art methods,...
Article Sequence Number: 4709118
Date of Publication: 01 November 2024

ISSN Information:

Funding Agency:


Contact IEEE to Subscribe

References

References is not available for this document.