Loading [MathJax]/extensions/MathMenu.js
Exploring Uni-Modal Feature Learning on Entities and Relations for Remote Sensing Cross-Modal Text-Image Retrieval | IEEE Journals & Magazine | IEEE Xplore

Exploring Uni-Modal Feature Learning on Entities and Relations for Remote Sensing Cross-Modal Text-Image Retrieval


Abstract:

Remote sensing cross-modal text-image retrieval (RSCTIR) has recently received unprecedented attention due to its advantages of flexible input and efficient query on enor...Show More

Abstract:

Remote sensing cross-modal text-image retrieval (RSCTIR) has recently received unprecedented attention due to its advantages of flexible input and efficient query on enormous remote sensing (RS) images. However, most RSCTIR methods focus obsessively on the cross-modal semantic alignment between text and image modalities and are easily stuck in the dilemma of information redundancy, leading to the degradation of retrieval accuracy. To address the issues, we construct a novel RSCTIR framework based on mask-guided relation modeling with entity loss (MGRM-EL), to fully explore uni-modal feature learning on entities and relations in the cross-modal model learning process. Specifically, we take advantage of the Transformer encoder architecture for its ability to capture long-distance dependencies from the global view and build two uni-modal (visual and textual) Transformer encoders combined with convolutional neural network (CNN) to extract the spatial interregion relations of images as well as long-term interword relations of texts for prominent feature embedding of visual and semantic representations. A mask-guided attention strategy is further introduced to learn the salient regions and words, with the aim of enhancing the RSCTIR model’s uni-modal learning ability and eliminating unnecessary and redundant information about each modality. Unlike existing methods that simply compute the semantic similarity between images and texts on the loss functions, we present a novel uni-modal entity loss, which treats each image as an image entity and merges similar texts into a text entity, to learn the independent distribution of entities in each modality. We conduct extensive experiments on public RSCTIR benchmarks including RSICD and RSITMD datasets, which demonstrate the state-of-the-art performance of the proposed method on the RSCTIR task.
Article Sequence Number: 5626317
Date of Publication: 16 November 2023

ISSN Information:

Funding Agency:

No metrics found for this document.

I. Introduction

Remote sensing (RS) is the science and technology of acquiring and recording information about physical objects, areas, or phenomena from a distance, typically from aircraft or satellites [1]. In recent years, the rapid developments of RS applications in the fields of agriculture, forestry, environment, and urban planning, have led to an explosive growth of RS image data [2]. However, in contrast to the RS image acquisition ability, the RS data processing capability is very weak and easily falls into the paradox of “big data, little knowledge” [3], [4]. To obtain valuable knowledge from enormous RS data in a convenient and flexible way, effective automatic RS cross-modal retrieval methods using different modal information are increasingly needed [5]. Due to its advantages of fast extraction and filtration of image information and flexible human–computer interaction, RS cross-modal text-image retrieval (RSCTIR) [6], [7] provides a possibility to explore the growing amount of cross-modal RS data, and becomes an emerging research area at the intersection of natural language processing (NLP) and computer vision (CV).

No metrics found for this document.

Contact IEEE to Subscribe

References

References is not available for this document.