Journals & Magazines >IEEE Transactions on Geoscien... >Volume: 61

Exploring Uni-Modal Feature Learning on Entities and Relations for Remote Sensing Cross-Modal Text-Image Retrieval

Download PDF
Download References
Request Permissions
Save to
Alerts

Abstract:

Remote sensing cross-modal text-image retrieval (RSCTIR) has recently received unprecedented attention due to its advantages of flexible input and efficient query on enor...Show More

Metadata

Abstract:

Remote sensing cross-modal text-image retrieval (RSCTIR) has recently received unprecedented attention due to its advantages of flexible input and efficient query on enormous remote sensing (RS) images. However, most RSCTIR methods focus obsessively on the cross-modal semantic alignment between text and image modalities and are easily stuck in the dilemma of information redundancy, leading to the degradation of retrieval accuracy. To address the issues, we construct a novel RSCTIR framework based on mask-guided relation modeling with entity loss (MGRM-EL), to fully explore uni-modal feature learning on entities and relations in the cross-modal model learning process. Specifically, we take advantage of the Transformer encoder architecture for its ability to capture long-distance dependencies from the global view and build two uni-modal (visual and textual) Transformer encoders combined with convolutional neural network (CNN) to extract the spatial interregion relations of images as well as long-term interword relations of texts for prominent feature embedding of visual and semantic representations. A mask-guided attention strategy is further introduced to learn the salient regions and words, with the aim of enhancing the RSCTIR model’s uni-modal learning ability and eliminating unnecessary and redundant information about each modality. Unlike existing methods that simply compute the semantic similarity between images and texts on the loss functions, we present a novel uni-modal entity loss, which treats each image as an image entity and merges similar texts into a text entity, to learn the independent distribution of entities in each modality. We conduct extensive experiments on public RSCTIR benchmarks including RSICD and RSITMD datasets, which demonstrate the state-of-the-art performance of the proposed method on the RSCTIR task.

Published in: IEEE Transactions on Geoscience and Remote Sensing ( Volume: 61)

Article Sequence Number: 5626317

Date of Publication: 16 November 2023

ISSN Information:

DOI: 10.1109/TGRS.2023.3333375

Funding Agency:

Contents

References is not available for this document.

Exploring Uni-Modal Feature Learning on Entities and Relations for Remote Sensing Cross-Modal Text-Image Retrieval

Abstract:

Metadata

Abstract:

ISSN Information:

Funding Agency:

References

IEEE Account

Purchase Details

Profile Information

Need Help?

Exploring Uni-Modal Feature Learning on Entities and Relations for Remote Sensing Cross-Modal Text-Image Retrieval

Alerts

Abstract:

Metadata

Abstract:

ISSN Information:

Funding Agency:

Authors

Figures

References

Citations

Keywords

Metrics

Footnotes

References

IEEE Account

Purchase Details

Profile Information

Need Help?