Multimodal Fusion Remote Sensing Image–Audio Retrieval

Remote sensing image–audio retrieval (RSIAR) has been an emerging research topic in recent years, and many different methods have been proposed for this topic. These RSIAR methods have achieved good retrieval results, but two problems remain: the lack of discriminability of audio modality and the existence of a heterogeneous gap between audio and image. These two problems make the cross-modal common embedding space for audio and images suboptimal, often failing to perform superior retrieval. This article proposes a novel RSIAR method named multimodal fusion remote sensing image–audio retrieval (MMFR) to address these two problems. MMFR first converts original audio input to text. Then, MMFR uses a feature fusion module to obtain a fusion representation fused with text information instead of the original sole audio representation. Fusion text information can make the pronunciation-based audio feature more semantically discriminable and convert pronunciation-based audio feature to more “high-level” fusion feature to cross the heterogeneous gap. Seven different fusion methods are tried in the feature fusion module. In addition, the triplet loss, the semantic loss, and the consistency loss are used to optimize the common retrieval space. Extensive experiments conducted on the UCM_IV, RSICD_IV, and SYDNE_IV datasets demonstrate that our MMFR method outperforms state-of-the-art methods.


I. INTRODUCTION
R EMOTE sensing (RS) images have been widely used in many fields such as source exploration, building planning, and disaster monitoring. With the development of RS technology, RS images have exploded in recent years, and the amount of RS image data is above the zettabyte level now. There is no doubt that it is essential to mine information reasonably and adequately from massive RS images. As it is very time consuming to manually mine information from zettabyte-level data, image retrieval technology [1]- [10] that can mine helpful information automatically is urgently needed. Image retrieval is a technology that, when given a query, aims to return a rank of the images that match the query best. Due to the demand for automatic information mining in huge volumes of RS image data, image retrieval techniques in the field of RS images are gradually being studied [11]- [13]. RS image retrieval methods can be roughly divided into unimodal remote sensing image retrieval (UMRSIR) and crossmodal remote sensing image retrieval (CMRSIR). UMRSIR is an RS image retrieval technique, in which the query and retrieval results are the same image modality. For example, Li et al. [11] proposed a large-scale RS image retrieval framework, which maps the RS images into binary representations to realize RS image-to-image retrieval. UMRSIR enables the automatic mining of RS image information, but flexibility is limited because query and results are homogeneous in modality.
CMRSIR refers to retrieving results corresponding to the query from a different modality, which is more flexible and practical than UMRSIR. CMRSIR can be divided into remote sensing image-text retrieval (RSITR) and remote sensing image-audio retrieval (RSIAR). RSITR [14] refers to using text (image) as the input to retrieve the corresponding images (texts). In contrast, RSIAR [15]- [19] uses audio (image) as the input to retrieve images (audio), which is more in accord with human input habits and especially suitable for some emergencies, such as military target detection, military speech intelligence generation, and disaster detection [19].
In recent years, more and more people have paid attention to RSIAR. Mao et al. [15] proposed a framework based on convolutional neural networks (CNNs), which combines RS image features and corresponding speech features and outputs a number to indicate match or mismatch. Chaudhuri et al. [16] designed a deep-neural-network-based architecture that learns the discriminative shared feature space of images and corresponding audio. Chen et al. [17] proposed a multiscale cross-modal hash method, which aggregates the features of different scales of images and audio, respectively, and then maps the image aggregation features and audio aggregation features into binary representations to retrieve in Hamming space. Ning et al. [19] proposed a retrieval framework based on the CNN that considers pairwise consistency, intramodal consistency, and intermodal unpaired consistency to learn better cross-modal feature representation. The above RSIAR methods have achieved good retrieval results, but two problems remain. This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ Fig. 1. Illustration of the lack of discriminability of audio modality. There are two speech sentences "there is a port" and "there is pond." The distance of those two sentences in the audio modal feature space is very close because their pronunciation is similar. However, the images or texts corresponding to these two speech sentences are far apart, because their real semantics are different. The text modality and image modality are more discriminative than audio modality in RSIAR.
The first is the lack of semantic discriminability of audio modality, so that the sole audio input is confusing. Because most of the existing RSIAR methods [15]- [19] are based on the Mel spectrogram to extract sound features, the feature similarity of the Mel spectrogram for two sentences with close pronunciation is very high, as shown in Fig. 1. In other words, the audio feature is pronunciation-based rather than semantics-based. It may cause many wrong objects in the query results for retrieval. For example, "there is a port" and "there is a pond" may get retrieval results that both contain images of pond, which does not meet expectations for retrieval.
Second, image and audio are two completely different data modalities, and there is a vast heterogeneous gap between them, making inconsistency between modality representations. The above works [15]- [18] focus more on paired data matching or semantic alignment but do not explicitly address the heterogeneous gap, so it is not easy to achieve the optimal retrieval effect.
We think fusing additional text features can bring two benefits to addressing these problems. First, the text feature is extracted based on semantics, which means that the text feature has more semantic characteristics. Fusing the text feature into the audio feature can make the pronunciation-based audio feature match the semantic distribution to discriminate better. Fig. 2 shows the distribution of the audio feature and the audio feature augmented with the text feature in our retrieval method . The T-SNE visualization of the audio features in the two categories of port and pond in the RSICD dataset shows that the audio feature fused with the text feature has better semantic discriminability.
The samples in the two categories have distinct feature distribution boundaries. In contrast, the semantic discriminability of the audio feature without text feature fusion is relatively insufficient.
Second, introducing semantic-based text features can convert pronunciation-based audio features to more "high-level" fusion features. In addition, the image feature is based on content and semantics, which can also be seen as the "high-level" feature. Using such "high-level" features based on semantics to achieve RSIAR can cross the modality heterogeneity gap: retrieval is performed more using the actual semantics of the data samples rather than the underlying features such as pronunciation details or texture of the images.
Specifically, we designed a multimodal fusion remote sensing image-audio retrieval (MMFR) method to enhance the discrimination of audio and eliminate heterogeneity gap between modalities. It includes four parts: feature extraction, feature mapping, feature fusion, and optimization objective. 1) Feature extraction: We use pretrained RESNET [20] to extract RS image features and VGGISH [21] to extract features of audio description. In addition, we input the audio description of the image into Google's publicly available speech recognition platform to obtain speech-totext recognition text. We then use pretrained BERT [22] to extract text features. 2) Feature mapping: Feature mapping has three branches.
The first branch maps image features to obtain image representations. The second branch maps audio features to get audio representations. The last branch maps text features to obtain text representations. The above three branches are three corresponding multilayer perceptrons. 3) Feature fusion: We input audio and text representations into the multimodal feature fusion module to obtain a fusion feature representation. We try seven fusion styles to find the best fusion method. 4) Optimization objective: The optimization objective includes modal consistency loss, triplet loss, and semantic loss. Minimizing the modal consistency loss and the triplet loss can enhance intermodal pairwise semantic consistency. Minimizing the semantic loss can train intramodal semantic discriminability. The proposed framework can solve the two problems through above parts to learn better cross-modal feature representations. In general, our contributions are as follows.
1) We propose an RSIAR method with three modalities of text, image, and audio, which essentially eliminates the ambiguity of audio. We propose a multimodal feature fusion module and also tried a variety of feature fusion styles. As far as we know, this is the first RSIAR retrieval method that integrates text information into audio features. 2) Our method uses the triplet loss and the semantic loss to learn the intramodal semantic discriminability and intermodal pairwise semantic consistency, respectively. We use the cross-modal consistency loss to further enhance the consistency between modalities and eliminate the heterogeneous gap to the greatest extent so that different modal data with the same semantics have cross-modal invariance of feature representation. 3) Extensive and substantial experiments have been carried out on the UCM_IV, RSICD_IV, and SYDNE_IV datasets [15]. The experimental results on three datasets verify our method's effectiveness and show that our method has performed the best in the current RSIAR field. The rest of this article is organized as follows. In Section II, we introduce related work. In Section III, we introduce the proposed framework in detail. In Section IV, we introduce the experiments. Finally, Section V concludes this article.

A. Unimodal Remote Sensing Image Retrieval
UMRSIR can be divided into traditional methods and deeplearning-based methods. Traditional methods refer to earlier works that used hand-designed feature extractors to extract image features [23]- [25]. For example, Luo et al. [23] proposed an RS image retrieval method by comparing multiresolution wavelet features. Aptoula [24] proposed a method to use global morphological texture descriptors for RS image retrieval. Yang and Newsam [25] suggested using local invariant features for RS image retrieval.
In recent years, with the incredible popularity of deep learning in computer vision and natural language processing [26]- [37], many deep-learning-based methods have also appeared in UMR-SIR. The current deep-learning-based UMRSIR methods mainly target three aspects.
The first is for the relationship between feature coverage categories and objects of RS images. For example, Shao et al. [38] proposed a retrieval framework based on a fully convolutional network considering complex feature scenes of RS images. Chaudhuri et al. [39] introduced graph theory to multilabel RS image retrieval and proposed a retrieval method considering multiple feature labels and their adjacency. Kang et al. [40] modeled the relationship between samples (or scenes) by using graph structures to relationships and feeding them into the network learning to achieve RS image retrieval.
The second is for the case of small annotated samples of RS images. For example, Ye et al. [41] proposed a domain adaption method to migrate the knowledge of visible light to SAR images to achieve SAR image retrieval, which does not rely on the manual annotation of SAR images. Liu et al. [42] proposed an unsupervised migration learning method to convert similarity learning to rank classification to determine the similarity of images and provide pseudo labels.
The third is for the real-time demand of RS image retrieval. Li et al. [11] proposed a large-scale RS image hashing retrieval framework that maps RS images to binary representations, thus enabling fast RS image retrieval.

B. Cross-Modal Remote Sensing Image Retrieval
Unlike UMRSIR, which can only retrieve images from the image modality, CMRSIR can retrieve images from any modality. This more flexible retrieval has attracted the attention of researchers. Current CMRSIR methods can be divided into image-text and image-audio retrieval. Regarding image-text retrieval, Abdullah et al. [43] proposed a new deep bidirectional triplet network for text-to-image matching to achieve RS image retrieval. Cheng et al. [14] proposed a new cross-modal image-text retrieval network. The network designed a semantic alignment module to link RS images and their paired text data directly. It also realized the image-text matching of RS images. Yuan et al. [44] constructed a fine-grained and more challenging RS image-text matching dataset and proposed a new asymmetric multimodal feature matching network for the multiscale scarcity and target redundancy problems in RS multimodal retrieval. Later, Yuan et al. [45] proposed a framework for RS image-text matching based on global and local information and designed a multilevel information dynamic fusion module to integrate features at different levels effectively.
In recent years, image-audio retrieval methods have gradually emerged in CMRSIR. Mao et al. [15] proposed the first framework for RS image-audio retrieval using the CNN to learn the correlation between images and audio. Based on this work, many subsequent works were developed. Guo et al. [13] extended the method in [15] by proposing a two-branch network that fuses audio features and image features to determine whether they match or not. Chen and Lu [18] designed a triplet hash network to compute hash codes of positive and negative sample pairs. In addition, Chen et al. [17] developed a cross-modal hashing method to input RS images and audio separately into a feature extraction network, in which multiscale features of images are extracted, and this network achieves retrieval by computing hash codes of image features and audio features. Chaudhuri et al. [16] proposed a deep neural network to learn co-embeddings of RS images and multiple words instead of complete spoken sentences. Ning et al. [19] proposed a CNN-based retrieval framework that considers pairwise consistency, intramodal consistency, and intermodal nonpairwise consistency to learn better cross-modal feature representations. Chen et al. [46] used a novel deep-quadratic-hash-based method to mine relative semantic similarity relations for RS image-to-speech retrieval.

C. Multimodal Fusion
Multimodal fusion is a technique to achieve more robust output results, multiple information complementarity, and solutions in the absence of modalities by aggregating information from multiple modalities [47]. Multimodal fusion can be divided into model-agnostic and model-based approaches [47].
Most of the past multimodal fusion methods are modelagnostic. These methods fuse the inputs at the input stage of the data or fuse the output results to make fusion decisions for multiple information modalities [48]- [50].
Unlike model-agnostic approaches, model-based approaches perform the fusion of multimodal information within the model, which is specific to the structure of the model and different tasks. Model-based approaches can be further divided into kernel-based approaches [51], graph-based approaches [52], and neural-network-based approaches [47], with neural-networkbased approaches receiving the most attention in recent years. Gao et al. [53] designed a fusion component fusing image features from convolutional networks and text features from long short-term memory networks to implement a multimedia question and answer task. Neverova et al. [54] fused features of multiple modalities at multiple spatial and temporal scales to achieve gesture recognition. Kahou et al. [55] used multiple neural networks to learn the audio, face, and mouth features separately and fused these features to achieve emotion recognition of videos. Nagrani et al. [56] introduced a novel transformer-based architecture that uses "attention bottlenecks" for pattern fusion at multiple layers and demonstrated the effectiveness of this architecture on multiple audio-visual classification benchmarks. Qu et al. [57] gave a theoretical estimation of the performance of the fusion model by measuring the performance of each branch model and the distance between branches. They gave a basis for how to choose the modality in multimodal fusion.
Our MMFR is for CMRSIR, but unlike these works above, we use a total of three modalities in the retrieval process: image, audio, and text. The text is converted from audio, and we fuse the audio and text modalities to improve the discrimination of the audio modality. For multimodal fusion, we propose a multimodal feature fusion module and also tried a variety of feature fusion styles. In addition, we introduce a consistency loss to improve the consistency of cross-modal representation.

A. Preliminary
In this subsection, we define the variables and concepts used.
represents a triplet consisting of the RS image, the corresponding audio description, and the text description, where i ∈ (1, N), N is the number of samples. The text description is converted from audio description shown as (1) Speech_to_T ext () indicates the speech recognition function.
In this article, we use Google's publicly available speech recognition API [58] as speech recognition function. Then, we use (2)-(4) for feature extraction of I, A, and T , respectively where We map the features using (5)- (7) to obtain image representation, audio representation, and text representation, respectively where P I , P A , and P T denote the representations of image, audio, and text, respectively, and M I , M A , and M T denote the learnable feature mapping module corresponding to images, audio, and text, respectively. P I , P A , P T ∈ R d k , d k denotes the dimension of the representation.
Next, (8) is performed for the fusion of audio and text representations where P F ∈ R d k denotes the fused feature representation and F denotes the feature fusion module. The remaining problem is constructing the feature extraction module E, the feature mapping module M , and the feature fusion module F to build a suitable common space. In this space, the semantically related paired image representation P I and the fusion representation P F are close to each other and vice versa. In this article, we design suitable modules and loss functions to solve these problems, which are introduced in the rest of this section.

B. Framework
The proposed method in this article has four main parts: feature extraction module, feature mapping module, feature fusion module, and optimization objective. The proposed framework is shown in Fig. 3.
1) Feature Extraction: In this article, image feature extractor, speech feature extractor, and text feature extractor are deployed for RS image, audio description, and recognized text, respectively.
We use the ResNet50 [20] network pretrained on Ima-geNet [59] as the image feature extractor, and we average the output feature map of the last residual block to obtain the image feature V I . The audio features V A are extracted using the VGGISH [21] network pretrained on Audioset [60].  [22] model, which is widely used in natural language processing, to extract features V T .
We use the RS images in the training sets of UCM_IV, RSICD_IV, and SYDNEY_IV to fine-tune the image feature extractor ResNet50 after loading the pretraining weights. After fine-tuning ResNet50, we extract the image features V I , while we do not perform any fine-tuning for feature extraction of audio description and text description.
2) Feature Mapping: After extracting the features of image, audio, and text, we designed the feature mapping module to perform further mapping of these features. The structures of the feature mapping modules for image, speech, and text are all multilayer perceptron (MLP). The structure of each MLP is shown in Table I. We tried there settings: feature_dim-256-128-256, feature_dim-256-256-256, and feature_dim-256-256. Experiments on the RSICD-IV dataset indicated that the feature_dim-256-256 and the feature_dim-256-256-256 setting performed better. The feature_dim-256-256 has a smaller number of parameters compared to the feature_dim-256-256-256. In these three MLPs, we use the Relu function as the activation function. After passing the features through the feature mapping module, we obtain P I , P A , and P T as shown in (5)- (7). The next subsection describes how to fuse P A and P T .
ADD is the pointwise adding of P A and P T , as shown in the following equation: MUL refers to the Hadamard product operation of the P A and the P T , as shown in the following equation: CON is the concatenation of the P A and the P T , as shown in the following equation: where CON is the concatenating operation. ATT refers to constructing two attention mechanisms for P A and P T separately and then performing pointwise adding of the representation after the attention mechanism selection, as shown Fig. 3. Structure of MMFR. MMFR contains four parts: feature extraction, feature mapping, feature fusion, and objective function. First, we use pretrained RESNET [20] to extract RS image features and VGGISH [21] to extract features of audio description. In addition, we input the audio description of the image into Google's publicly available speech recognition platform to obtain speech-to-text recognition text. We then use pretrained BERT [22] to extract text features. Second, image, audio, and text features are fed into the feature mapping module to obtain image, audio, and text representations, respectively. Third, audio and text representations are fed into the feature fusion module to get fusion representation. Finally, minimizing loss functions can allow MMFR to learn the intermodal consistency, pairwise semantic consistency, and intramodal semantic consistency. in (12)-(14) where L 1 and L 2 denote the two fully connected layers. ATT_SHARE is a variant of ATT, which means that L 1 and L 2 share parameters.
ATT_CON is the second variant of ATT, as shown in (15)- (17). First, P A and p T are concatenated. They go through a fully connected layer with equal input and output dimensions. Then, they are activated by a sigmoid function to get the weight vector. Finally, we perform a Hadamard product operation on the first half of the weight vector and P A and a Hadamard product operation on the second half and P T . The output is the summation of the two Hadamard products A 2 = Down_half (Sigmoid (L 1 (CON (P A , P T )))) (16) where Up_half means obtaining the up-half of the vector and Down_half means obtaining the down-half of the vector. ATT_RES is another variant of ATT, which introduces a residual structure into fusion process, as shown in (18)-(20)

4) Optimization Objective:
In this article, we design three loss functions to optimize our model.
First, we introduce the triplet loss, which pulls in imagefusion pairs with the same semantics while pushing away imagefusion pairs with different semantics. By optimizing the triplet loss, we can obtain more optimal semantic consistency across modalities where S is the cosine similarity, d is the margin, and λ denotes control hyperparameters. j and k are the indexes of the positive and hardest negative samples, respectively. We use an online difficult sample mining strategy. In each batch with N samples, the similarity matrix S of fusion feature and image feature is computed, and the element s ij of the matrix represents the similarity of the ith fusion feature and the jth image feature. The diagonal elements of the matrix represent the positive sample pairs, and the nondiagonal elements represent the negative sample pairs. For image retrieve sound, we select the nondiagonal element with the greatest similarity from different classes in each column as the negative sample pair. For sound retrieve image, we select the nondiagonal element with the greatest similarity from different classes in each row as the negative sample pair. Based on this strategy, we construct the bidirectional triplet loss.
Next, semantic loss is introduced. We compute the crossentropy loss for the image mapping P I and the fusion mapping P F with the corresponding labels. Calculating the semantic loss allows us to construct the semantic space of image modality and fusion modality, increasing the semantic discriminability within the modality where l i is the semantic label of sample pairs and C is the semantic classifier. Finally, we introduce consistency loss, which minimizes the difference between image representations and fusion representations. Minimizing this difference can eliminate the heterogeneity gap between image modalities and fusion modalities during optimization where m is the batch size and · denotes the L 2 norm operation.
The final loss function is the weighted sum of L csy , L tri , and L class , shown as L final = L tri + η 1 * L class + η 2 * L csy (26) where η 1 and η 2 are the hyperparameters.

C. Optimizing
Our optimizing strategy is shown in Algorithm 1. The inputs of Algorithm 1 are the training set, learning rate, batch size, and training epoch K. I i , A i , T i , and l i represent image, audio, text recognized from audio, and the semantic label, respectively. The outputs are the parameters of the feature mapping module and the feature fusion module. Before training, we initialize the parameters of the feature mapping module and the feature fusion module by normal distribution. The training consists of a number of cycles, each with six steps. Step 1 is sampling training samples according to batch. Steps 2-6 are used to train the feature mapping module and the feature fusion module. When the training reaches the kth epoch or the network has converged, we end the training to obtain the trained parameters Θ M I , Θ M A , Θ M T , and Θ F .

A. Datasets
In this article, three datasets are used to validate the effectiveness of the proposed method, namely, Sydney_IV, UCM_IV, and RSICD_IV datasets.

Output:
The parameters Θ M I of M I , the parameters Θ M A of M A , the parameters Θ M T of M T , and the parameters Θ F of F .

Initialization:
The Parameters Θ M I , Θ M A , Θ M T , and Θ F are randomly initialized by normal distribution.

Repeat:
Step 1: Sample the input RS image-(audio text) pairs Step 2: Calculate the image features V I via (2), audio feature V A and text feature V T according to (3) and (4) respectively; Step 3: Map the features using (5)- (7) to obtain image embedding, audio embedding, and text embedding respectively.
Step 5: Calculate the loss L tri , L class , and L csy according to (21)-(26); Step 6: Update the Θ M I , Θ M A , Θ M T , and Θ F using the Adam optimizer.

1) SYDNE_IV dataset: This dataset is adapted from the
Sydney RS image description dataset [15] and contains 613 images from seven semantic categories. Each image corresponds to five audio descriptions. 2) UCM_IV dataset: This dataset is adapted from the UCM caption [15] and has 2100 images containing 21 semantic categories. Each image corresponds to five audio descriptions. 3) RSICD_IV dataset: This dataset is adapted from the RSICD dataset [15], with a total of 10 921 images, each corresponding to five audio descriptions. This dataset contains a total of 31 semantic categories. Following [19], we randomly select one from the five audio descriptions to obtain the image-audio sample pair. We randomly selected 80% as the training set and 20% for testing on each dataset.

B. Implementation Details
We used Resnet50 [20] pretrained on ImageNet [59] as the image feature extractor. Before extracting features for each dataset, the image feature extractor is fine-tuned on the training set for the classification task. The dimension of the extracted image features is 2048. We used VGGISH pretrained on Audioset [60] as the audio feature extractor, and the extracted audio feature dimension is 1024. We used a 12-layer 8-head Bert pretrained by Google [61] as the text feature extractor, and the extracted text feature dimension is 768. The image, audio, and text feature extractors are fixed at the training stage. For all the experiments, our training epochs are set to 100. We use the Adam optimizer with a learning rate of 0.0001 and beta value set to (0.5,0.999). The batch size is set to 64. The margin d and the control hyperparameter λ are set to 300 and 5, respectively. We experimented on a computer with an NVIDIA RTX 3090 GPU and 64-GB RAM. We program with the PyTorch framework.
The number of parameters for the proposed model in this article is about 1.2M. We counted the running time on the RSICD_IV dataset. The time to train 100 epochs on an NVIDIA RTX 3090 GPU is 8 min. The time to complete one retrieval per sample is 0.8 ms during testing.

C. Metrics
Following [19], we use mean average precision (mAP) [62] and the precision of the top-k ranking result P @k [63], k = (1, 5, 10) as evaluation metrics. The mAP is an evaluation metric, which combines retrieval accuracy and recall. P @k responds to the query precision of the retrieval. The mAP value is calculated as shown in (27) and (28). P @k is calculated in (29) where N is the number of retrievals, R is the number of relevant samples, I is the number of return samples, i is the rank, and ϕ(i) in {0, 1} is the indicator function. If the rank i is relevant sample, ϕ(i) is 1; otherwise, ϕ(i) is 0. (29) where N is the number of retrievals, k indicates the top-k results of the retrieval, and ϕ(i) in {0, 1} is the indicator function. If the rank i is relevant sample, ϕ(i) is 1; otherwise, ϕ(i) is 0.
In this article, we set four experiment protocols: I_A, A_I, I_T, and T_I. I_A (A_I) indicates using image (audio) as the retrieval query and audio (images) as the return result. I_T (T_I) indicates using image (text) as the query and texts (images) as the return result of the retrieval.

D. Ablation Studies
To verify the effectiveness of different feature fusion styles and impacts of different loss functions, we conduct two ablation studies. In this subsection, we empirically set the weights of L class to 1 and the weights of L tri and L csy to 0.0001.

1) Effectiveness of Different Fusion Methods:
We set seven feature fusion methods of MMFR, which are ADD, CON, MUL, ATT, ATT_SHARE, ATT_CON, and ATT_RES. In addition, we set two variants: ONLY_AUDIO and ONLY_TEXT. ONLY_AUDIO does not fuse the text information but only uses audio information. In contrast, ONLY_TEXT only uses text information instead of audio information.  Tables II and III show the results of this ablation on the RSICD_IV dataset and the UCM_IV dataset, respectively. From Table II, we can find the following.
1) The mAP values and P @k values of ONLY_AUDIO is the lowest compared with text-using methods. This observation demonstrates our viewpoint that audio representation lacks of discrimination. 2) The ONLY_TEXT surprisingly achieves a promising retrieval performance. This result shows that the textual representation has strong discriminative properties, and the use of textual information can improve the retrieval performance to a great extent.
3) The MMFR_MUL achieves the highest mAP and P @k values in this experiment. This is because the fusion of text and audio is unbalanced, with text playing a dominant role in the task. As a result, in this scenario, audio can act as noise to the signal of the text. The MMFR_MUL is a multiplicative fusion method that helps suppress some useless feature points and enhance the feature points that help to discriminate during the fusion process, thus learning more discriminative fusion features. From Table III, we can observe the phenomenon that ONLY_AUDIO performs the worst and the mAP values, and P @k values of all the fusion methods are close to ONLY_TEXT, with MMFR_MUL performing slightly better than ONLY_TEXT.
The experiment on the UCM_IV dataset shows that MMFR_MUL fusion is superior to ONLY_TEXT, but there is no significant difference in numerical performance. To demonstrate the superiority of MMFR_MUL fusion and highlight the role of audio, we conduct a T-SNE visualization experiment on the UCM_IV dataset. Specifically, we randomly select 100 samples from the test set. Then, we map the selected samples into the common feature space using the MMFR_MUL model and the ONLY_TEXT model separately. Next, we use the T-SNE algorithm to reduce the dimensionality of the features and conduct the visualization. Fig. 5 shows the visualization results. Fig. 5(a) is the T-SNE visualization of MMFR_MUL fusion and Fig. 5(b) is the T-SNE visualization of ONLY_TEXT. The solid red box in Fig. 5 indicates the ambiguous area in the common space. Samples in these areas have poor feature semantic discriminability, and it is not easy to make valid retrievals for these samples. From Fig. 5, we can find that MMFR_MUL learns a more discriminative common space for the feature distribution than ONLY_TEXT because the common space in Fig. 5(a) has only one ambiguous area, but the common space in Fig. 5(b) has two.
From quantitative experiments and insight feature visualization, we can draw a conclusion that the introduction of text information is very effective for the RSIAR task, and the MUL fusion style can fuse audio and text features to achieve better retrieval results. In the subsequent experiments, we use the MUL fusion style.
2) Impacts of Different Loss Functions: To examine the impacts of different loss functions, we set six variants consisting of different combinations of losses, namely, MMFR_tri, MMFR_class, MMFR_csy, MMFR_tri+csy, MMFR_class+csy, and MMFR_tri+class. MMFR_tri indicates that the MMFR only uses triplet loss. MMFR_class indicates that the MMFR only uses semantic loss. MMFR_csy indicates that the MMFR only uses consistency loss. MMFR_tri+csy is MMFR using triplet loss and consistency loss. MMFR_class+csy means that MMFR uses semantic loss and consistency loss. MMFR_tri+class indicates that the MMFR uses triplet loss and semantic loss. MMFR_all indicates our MMFR using all three loss functions: triplet loss, semantic loss, and consistency loss.
From Tables IV and V, we can find the following. 1) MMFR_csy and MMFR_tri performs poorly on mAP and P @k values. This phenomenon means that it is not enough only to use consistency loss and triplet loss. 2) MMFR_class improved mAP and P @k values but were still relatively low.

3) MMFR_tri+csy does not perform better than MMFR_tri.
This indicates that the combination of the consistency loss into the triplet loss does not further serve to optimize the feature space to achieve effective retrieval in this article. 4) MMFR_class+csy makes a significant improvement in both mAP and P @k values compared to MMFR_class, which shows that consistency loss and semantic loss can work well together to achieve the retrieval task. This is because semantic loss mitigates the heterogeneous gap.

5) MMFR_tri+class substantially improves mAP and P @k
values. This demonstrates that triplet loss and semantic loss play major roles in achieving retrieval and can work well together to achieve retrieval. 6) MMFR_all achieved outstanding retrieval metric values, which demonstrates the validity of our proposed combination of three loss functions. Moreover, there is an attractive phenomenon in Tables IV and V. MMFR_class for the task I-A have very low mAP value and high P @k values. The reason for such a result is that the retrieval returns some relevant samples in the top ten, while the remaining relevant samples are ranked much further back than the top ten. In addition, the mAP of MMFR_class for the task A-I is relatively high, while P @k is low. The reason for this is that many relevant results are ranked in the forward position in the retrieved results, but there are not enough relevant samples ranked in the top ten, and most of the relevant samples may be ranked in the middle position.
Based on the above observations, we can conclude that the roles of these loss functions are different; semantic loss is to make the semantic distribution within the modality according to the category to distribute, that is, samples from same class are distributed into a cluster. The triplet loss is to maintain the intermodal semantic consistency, so that semantically matched pairs of cross-modal samples are close to each other and semantically different cross-modal samples are far from each other. The consistency loss is to make the data distribution of different modalities consistent. Among them, semantic loss and triplet loss play the main role in retrieval by optimizing information using semantics and composition, while the consistency loss mainly serves to alleviate the heterogeneity gap between modalities and play a supplementary role in retrieval. Thus, we use the   weighted sum of the three losses as the objective of optimization. The specific analysis of the weighting parameters is discussed in the next subsection.

E. Parameter Analysis
In this subsection, we conducted experiments on the RSICD_IV and UCM_IV datasets to analyze the impact of hyperparameters η 1 and η 2 . η 1 controls the effect of triplet loss L tri . η 2 controls the contribution of L adv . We used a grid search strategy to find the best values of these two hyperparameters, respectively. The sets of search values of η 1 and η 2 are both {0, 0.00001, 0.0001, 0.001, 0.01, 0.1, 1, 10, 100, 1000}. In this experiment, we used the average of the mAP values of A-I and I-A as a metric.
First, we search for the best value of η 1 . η 2 is fixed at 0.0001 empirically. For the RSICD_IV dataset, it can be seen from Fig. 6 that the highest value of the average map is obtained when η 1 is 0.0001. When η 1 is much smaller than 0.0001, the average mAP value decreases significantly because the loss of match information of samples leads to the misalignment of features. When η 1 is much greater than 0.0001, the mAP value also declines because the excessively large η 1 influences the optimization of other loss functions. For the UCM_IV dataset, it can be seen from Fig. 6 that the highest value of the average map is obtained when η 1 is 0.0001.
Next, we find the best value of η 2 . In this experiment, η 1 is fixed at 0.0001 for the RSICD_IV dataset and the UCM_IV dataset based on the search for η 1 . From Fig. 7, we can see that the average mAP does not vary dramatically but gently when η 2 is lower than 10. For the RSICD_IV dataset, the best η 2 is 10. For the UCM_IV dataset, the best η 2 is 0.0001. There is a slight decrease in average mAP when η 2 is too small and a significant decrease when η 2 is too large, which could be because small η 2 creates a modality gap and large η 2 influences the optimization of other loss functions.
Consequently, the best values of hyperparameter η 1 is 0.0001 and η 2 is 10 for the RSICD_IV dataset and the best values of hyperparameter η 1 is 0.0001 and η 2 is 0.0001 for the UCM_IV dataset. In the later experiments, we fix η 1 at 0.0001 and η 2 at 10 for the RSICD_IV dataset. We set η 1 at 0.0001 and η 2 at 0.0001 for the UCM_IV dataset. For the Sydney_IV dataset, we set η 1 at 0.0001 and η 2 at 10 similar to the RSICD_IV dataset.

F. Results
Aiming to obtain a comprehensive experimental analysis of our MMFR method, we conduct eight comparison methods, which are SCRL [19], DTBH [18], DIVR [17], CMIR [16], DVAN [15], CNN + SPEC [64], DBLP [65], and SIFT + M [66]. The SCRL [19] is a method that considers the pairwise consistency, intramodality consistency, and nonpaired intermodality consistency in RSIAR. The DTBH [18] is a hashingbased method that uses the triplet similarity loss function. The DIVR [17] is another hashing-based method that uses multiscaled image features. The CMIR-NET [16] presents a network that learns a discriminative shared feature space for different input modalities. The DVAN [15] is a method that fuses the image and audio features to output a score indicating whether the image-audio pair matches. The CNN + SPEC [64] is an unsupervised method that trains a network to judge whether the image-audio pair matches. The DBLP [65] is another unsupervised method outputting a similarity score between image and speech caption. The SIFT + M [66] uses the SIFT features of images and the MFCC features of audio to achieve cross-modal retrieval.
We conduct experiments on three datasets to compare our MMFR with the eight comparison methods mentioned above. We show the experimental results of mAP and P @k in Tables VI-VIII. In addition, we show the visualization of retrieval results in Figs. 8-10 to give an intuitive presentation of the MMFR method.   Fig. 8 shows some retrieval samples using our MMFR method. The first column shows the image query and the top ten audio results of retrieval rank for the I-A protocol. In contrast, the second column shows the audio query and the top ten image results of retrieval rank for the A-I protocol. From the first column of Fig. 8, we can observe that the image query is about building. The retrieval results are all relevant to the query, although some of the results are inaccurate, such as the fourth.

1) Comparison With State of the Art on the UCM_IV Dataset:
Examination of the second column shows that the results are highly relevant to the query.
The quantitative experimental metrics on the UCM_IV dataset demonstrate the superiority of the MMFR method over those compared methods, and the visual result presentation also illustrates the effectiveness of our proposed method.
2) Comparison With State of the Art on the RSICD_IV Dataset: Table VII shows the results on the RSICD_IV dataset.
As we can see from Table VII, Fig. 9 shows some samples of retrieval results. The first column shows the result of the I-A protocol, and the last two columns show the result of the A-I protocol. The query in the first column is an image of a river. The retrieval results are all about the river and are highly relevant to the query. The query in the second column is an audio description of a river. The results are all about playground images and are very relevant to the query. Although the retrieved results are all highly relevant to the query, some details of the returned results do not match the query, as shown in the fifth and eighth items of the first column. This indicates that the fine-grained alignment of the RSIAR needs to be further explored. We will focus on this issue in our future work.
The quantitative experimental metrics on the RSICD_IV dataset demonstrate the superiority of MMFR method over those Fig. 10. Some samples of retrieval result using the MMFR method on the Sydney_IV dataset. The first row is the queries, including image queries and audio queries. The second row shows the retrieval results. For both I-A and A-I experimental protocols, we show the top ten samples of the retrieval rank. compared methods, and the visual presentation also illustrates the effectiveness of our proposed method.
3) Comparison With State of the Art on the SYDNEY_IV Dataset: Table VIII shows the results on the SYDNEY_IV dataset. We find that the MMFR method outperforms on mAP values for the I-A and A-I protocols. For the A-I protocol, the MMFR method achieves the highest on all P @k values. For the I-A protocol, the SCRL [19] and DTBH [18] still maintain the best performance on P @k metrics, which slightly outperforms our MMFR method. For retrieval methods, the mAP value considers both precision and ranking results, while the P @k score measures the precision of the retrieved top-k samples [19]. Therefore, our MMFR still makes progress in overall performance compared to state-of-the-art methods. Fig. 10 shows some retrieval examples on the Sydney_IV dataset. The first column indicates the I-A protocol, and the second is the A-I protocol. From the first column of Fig. 10, we can find that all the retrieval results are very relevant to the image query, which is about the runway. The second column illustrates that all the image results are highly relevant to the audio query. This example demonstrates that our MMFR can retrieve the relevant results very accurately.
Both quantitative and qualitative results illustrate that our MMFR method can achieve good retrieval results on the Syd-ney_IV dataset.

V. CONCLUSION
In this article, we proposed a novel RSIAR method named MMFR to address the lack of discrimination of audio modality and the existence of a heterogeneous gap between audio and image. MMFR uses an audio feature representation fused with text information (instead of the original audio representation) to enhance the discriminability of audio modality. In addition, the triplet loss, the semantic loss, and the consistency loss are used to optimize the common retrieval space. Extensive experiments prove that our method exceeds the stateof-the-art methods. Quantitative and qualitative experimental results show that the MMFR method has adequate RSIAR performance.