Loading [MathJax]/extensions/MathMenu.js
Exploring Uni-Modal Feature Learning on Entities and Relations for Remote Sensing Cross-Modal Text-Image Retrieval | IEEE Journals & Magazine | IEEE Xplore

Exploring Uni-Modal Feature Learning on Entities and Relations for Remote Sensing Cross-Modal Text-Image Retrieval


Abstract:

Remote sensing cross-modal text-image retrieval (RSCTIR) has recently received unprecedented attention due to its advantages of flexible input and efficient query on enor...Show More

Abstract:

Remote sensing cross-modal text-image retrieval (RSCTIR) has recently received unprecedented attention due to its advantages of flexible input and efficient query on enormous remote sensing (RS) images. However, most RSCTIR methods focus obsessively on the cross-modal semantic alignment between text and image modalities and are easily stuck in the dilemma of information redundancy, leading to the degradation of retrieval accuracy. To address the issues, we construct a novel RSCTIR framework based on mask-guided relation modeling with entity loss (MGRM-EL), to fully explore uni-modal feature learning on entities and relations in the cross-modal model learning process. Specifically, we take advantage of the Transformer encoder architecture for its ability to capture long-distance dependencies from the global view and build two uni-modal (visual and textual) Transformer encoders combined with convolutional neural network (CNN) to extract the spatial interregion relations of images as well as long-term interword relations of texts for prominent feature embedding of visual and semantic representations. A mask-guided attention strategy is further introduced to learn the salient regions and words, with the aim of enhancing the RSCTIR model’s uni-modal learning ability and eliminating unnecessary and redundant information about each modality. Unlike existing methods that simply compute the semantic similarity between images and texts on the loss functions, we present a novel uni-modal entity loss, which treats each image as an image entity and merges similar texts into a text entity, to learn the independent distribution of entities in each modality. We conduct extensive experiments on public RSCTIR benchmarks including RSICD and RSITMD datasets, which demonstrate the state-of-the-art performance of the proposed method on the RSCTIR task.
Article Sequence Number: 5626317
Date of Publication: 16 November 2023

ISSN Information:

Funding Agency:


Contact IEEE to Subscribe

References

References is not available for this document.