Abstract:
In recent years, remote sensing cross-modal text-image retrieval (RSCTIR) has attracted considerable attention owing to its convenience and information mining capabilitie...Show MoreMetadata
Abstract:
In recent years, remote sensing cross-modal text-image retrieval (RSCTIR) has attracted considerable attention owing to its convenience and information mining capabilities. However, two significant challenges persist: effectively integrating global and local information during feature extraction due to substantial variations in remote sensing imagery, and the failure of existing methods to adequately consider feature prealignment before modal fusion, resulting in complex modal interactions that adversely impact retrieval accuracy and efficiency. To address these challenges, we propose a cross-modal prealigned method with global and local information (CMPAGL) for remote sensing imagery. Specifically, we design a global-Swin (Gswin) Transformer block, which introduces a global information window on top of the local window attention mechanism, synergistically combining local window self-attention and global-local window cross-attention to effectively capture multiscale features of remote sensing images. In addition, our approach incorporates a prealignment mechanism to mitigate the training difficulty of modal fusion, thereby enhancing retrieval accuracy. Moreover, we propose a similarity matrix reweighting (SMR) reranking algorithm to deeply exploit information from the similarity matrix during the retrieval process. This algorithm combines forward and backward ranking, extreme difference ratio, and other factors to reweight the similarity matrix, thereby further enhancing retrieval accuracy. Finally, we optimize the triplet loss function by introducing an intraclass distance term for matched image-text pairs, not only focusing on the relative distance between matched and unmatched pairs but also minimizing the distance within matched pairs. Experiments on four public remote sensing text-image datasets, including RSICD, RSITMD, UCM-Captions, and Sydney-Captions, demonstrate the effectiveness of our proposed method, achieving improvements over state-of-the-art methods,...
Published in: IEEE Transactions on Geoscience and Remote Sensing ( Volume: 62)