Remote Sensing Image Retrieval in the Past Decade: Achievements, Challenges, and Future Directions

Remote sensing image retrieval (RSIR) aims to search and retrieve the images of interest from a large remote sensing image archive, which has remained to be a hot topic over the past decade. Benefited from the advent and progress of deep learning, RSIR has been promoted by developing novel approaches, constructing new datasets, and exploring potential applications. To the best of our knowledge, there lacks a comprehensive review of RSIR achievements, including systematic and hierarchical categorization of RSIR methods and benchmark datasets over the past decade. This article, therefore, provides a systematic survey of the recently published RSIR methods and benchmarks by reviewing more than 200 papers. To be specific, in terms of image source, label, and modality, we first group the RSIR methods into some hierarchical categories, each of which is reviewed in detail. Following the categorization of the RSIR methods, we list the benchmark datasets publicly available for performance evaluation and present our newly collected RSIR dataset. Moreover, some of the existing RSIR methods are selected and evaluated on the representative benchmark datasets. The results demonstrate that deep learning-based methods are currently the dominant RSIR approaches and outperform handcrafted feature-based methods by a significant margin. Finally, we discuss the main challenges of RSIR and point out some potential directions for the future RSIR research.


I. INTRODUCTION
O VER the past decades, remote sensing (RS) earth observation has reached an unprecedented level, and the available RS data have grown exponentially; however, we are overwhelmed by the massive data with too much useless information due to the limitation of data processing techniques. Therefore, in the era of RS big data, how to efficiently organize and manage large RS archives and quickly search and retrieve the data of interest remains to be a significant challenge in the RS community. Remote sensing image retrieval (RSIR), which aims to search and retrieve images of interest from a large RS image archive, is an effective technique to solve the problems mentioned above [1]. In the early years, many RSIR methods are derived from the text-based image retrieval and particularly content-based image retrieval (CBIR) in computer vision (CV) field [2], [3], [4]. Although RSIR can be regarded as the application of CBIR in the RS field, RSIR is a more challenging task than CBIR due to the high complexity of RS images, including multiscale objects, varied resolutions, different imaging modes, and so forth. To advance RSIR, RS literature has invested significant effort to develop RS image-specific methods, making RSIR an active research topic, as shown in Fig. 1. The number of publications on RSIR has dramatically increased over the past decade, especially from the year of 2012. On the one hand, deep learning achieved remarkable performance on the large-scale ImageNet in 2012, which has drawn much attention from the RS community since then. On the other hand, RS benchmark datasets have been increasingly constructed and publicly available, making it possible for developing deep learning based RSIR methods.
RSIR in early years is performed based on the metadata (thereby also called metadata or text-based RSIR), such as This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ image resolution, geographic coordinate, sensor type, and image acquisition time; however, it does not take image content (i.e., buildings, roads, and rivers) into consideration [5]. Therefore, content-based RSIR (CBRSIR) that performs search and retrieval of RS images using low-level visual features [6], [7], [8], [9], [10], [11], [12], [13], [14] has gained much attention. Nevertheless, CBRSIR faces two significant challenges for large RS image archives: First, RS images usually contain a few object classes with varied scales; thus, low-level features are not available for the accurate representation of image content. Therefore, RSIR usually obtains unsatisfactory performance based on single low-level feature (e.g., spectral feature) or combined low-level features (e.g., spectral feature and texture feature); second, low-level features, known as handcrafted features, require laborious efforts. Moreover, it is not feasible to develop an effective feature representation suitable for RS images with different resolutions, different object types, and different image complexities. Thus, for CBRSIR, it is necessary to draw our attention from previously hand-engineered features to currently learned features. From here on, RSIR is termed for CBRSIR, unless otherwise stated.
Since 2012, deep learning has gradually developed as a dominant technique for feature extraction due to its remarkable performance on recognition tasks [15]. Accordingly, as an alternative to handcrafted features, deep learning has also explored by the RS community [16], [17], [18], [19], [20]. Yuan et al. [16] analyzed the potential of deep learning for environmental RS tasks (e.g., land cover mapping, environmental parameter retrieval, data fusion and downscaling, as well as information reconstruction and prediction). Ball et al. [17] gave a detailed survey of deep learning used for RS tasks in theories, tools, and challenges. Zhang et al. [18] provided a technical tutorial on the state-of-the-art (SOTA) deep learning techniques for RS big data from the four perspectives of image processing, pixel-based classification, target recognition, and scene understanding. Zhu et al. [19] analyzed the challenges of using deep learning for RS data analysis, reviewed the recent advances, and provided resources, attempting to ridiculously simple deep learning in the RS domain. Ma et al. [20] summarized several main subfields of deep learning used in RS and conducted a deep review to describe and discuss those techniques in all of these subfields.
These works demonstrate that deep learning has been one of the dominant techniques for RS tasks. Driven by deep learning, a great number of RSIR methods have been presented. The readers are referred to the reviews on RSIR [21], [22], [23], [24], [25]. Sudha and Aji [21] conducted a systematic study on the existing RSIR methods to guide the new researchers in the RS domain to choose effective methods for performance improvement of the RSIR system in different schemes. Gu et al. [22] comprehensively reviewed deep learning based methods for RS image understanding and pointed out some future directions and potential applications. Sudha and Aji [23] concentrated on the advancements and current trends related to deep learning based RSIR and analyzed how to use deep learning techniques and frameworks to address the challenges. Tong et al. [24] focused on three core issues of RSIR, i.e., feature extraction, similarity metric, and relevance feedback, and systematically investigated deep features for RSIR. In the recent work, Li et al. [25] systematically reviewed the emerging achievements of RSIR and discussed its applications, including fusion-oriented RS image processing, geolocalization, and disaster rescue. To date, it seems to be the most systematic and comprehensive review on RSIR. However, the existing RSIR methods were coarsely categorized into CBRSIR, hash-based RSIR, cross-modal RSIR (CMRSIR), and interactive RSIR, which was not a reasonable division of RSIR methods. For instance, in most studies, CBRSIR, as one kind of RSIR methods, actually contains hash-based RSIR, CMRSIR, and interactive RSIR. While in [25], they were categorized as four paralleled methods. Furthermore, the existing benchmark datasets were also coarsely categorized into single-modality data and multimodality data.
We, therefore, provide a comprehensive review of RSIR achievements, including RSIR methods and benchmark datasets over the past decade in this article. In addition, we also release a new dataset and present an RSIR method evaluated on the new released dataset. Our work categorizes the existing RSIR methods into a hierarchical category, and the benchmark datasets are categorized accordingly. To the best of our knowledge, it is the most sophisticated categorization of RSIR methods and benchmark datasets, and is complementary to the existing reviews. The comparisons between the several existing review works and ours are summarized in Table I.
The rest of this article is organized as follows. Section II surveys the conventional and deep learning RSIR methods of different categories. Section III introduces the benchmark datasets for the performance evaluation of RSIR. The performance metrics and results of RSIR methods are presented in Section IV. Section V discusses the current challenges and potential solutions for RSIR. Section VI points out some future directions of RSIR. Finally, Section VII concludes this article.

II. CONVENTIONAL AND DEEP LEARNING METHODS FOR RSIR
Over the past decade, the RS community has witnessed the rapidly developed RSIR methods, including the handcrafted feature-based approaches and the recent deep learning based ones. To detail these RSIR methods, we organize the existing RSIR methods by a four-level hierarchical category network, in terms of data, as shown in Fig. 2. To our best of knowledge, the hierarchical categorization network provides a sophisticated organization of RSIR methods.
Similar to the scheme in [25], our hierarchical RSIR category network is composed of unisource retrieval and cross-source retrieval at the first level. The categorization criteria are that whether the query image and the retrieved images are from the same source or not. To be specific, for unisource retrieval, both the query image and the retrieved images are from the same source, while for cross-source retrieval, the query image and the retrieved images are from different sources, and generally two sources. In the second level, unisource retrieval is further categorized into single-label RSIR (SLRSIR) and multilabel RSIR (MLRSIR), depending on whether an image is associated with one label or multiple labels. Regarding cross-source retrieval, it is categorized into three subcategories, including CMRSIR, retrieval across different sensors (RASRSIR), and cross-view RSIR (CVRSIR), depending on whether the images are from the same modality, sensors, view or not. It is notable that for CMRSIR and CVRSIR, we also provide their subcategories to cover the existing RSIR methods. Specifically, CMRSIR consists of retrieval between sketch and image, retrieval between optical and SAR images, retrieval between text and images, as well as retrieval between audio and images. CVRSIR contains retrieval between ground images, retrieval between ground and overhead images (e.g., ground-satellite, ground-aerial, and ground-drone), as well as retrieval between overhead and overhead images (e.g., drone-satellite). Table II presents the categorization criteria and detailed description for each RSIR subcategory.
In the following part, we review the existing SLRSIR, MLR-SIR, RASRSIR, CMRSIR, and CVRSIR methods, respectively, over the past decade. Because of the remarkable performance of deep learning for RSIR, we mainly focus on deep learning based RSIR methods.

A. SLRSIR Methods
SLRSIR is to perform retrieval with single-label images and has been the dominant RSIR methods. In this scenario, each of the query and retrieved images belongs to only one image category. Generally, there are three modules in a typical RSIR system, including feature extraction, feature indexing, and similarity measures [25]. In practice, feature extraction and similarity measures are two indispensable parts because feature indexing is mostly used for large-scale RS image archives. Fig. 3 illustrates the flowchart of the SLRSIR method, which mainly contains two steps, including feature extraction and similarity measures. For feature extraction, the handcrafted features and deep learning features are available for the representation of the query image and images in the database. In the second step, the query image is compared with each of the images in the database by calculating their feature similarity. Then, these similarity values are subsequently sorted in descending order to return the top k similar images. The kth image is determined as a correct query if it has the same category as the query image. Moreover, for one query, the higher ranks of the correctly retrieved images indicate the better performance of the SLRSIR method. The authors can be referred to Section IV for SLRSIR performance metrics in detail.
The literature has committed to SLRSIR over the past decade, and a number of advanced methods have been proposed, especially the methods driven by deep learning. Before the advent of deep learning in the RS community, SLRSIR mainly relies on handcrafted features, such as spectral features, texture features, and even combined features. Shao et al. [26] have proposed improved color texture descriptors for RSIR by taking color information into consideration. Bosilj et al. [27] have presented pattern spectra descriptors, which are computationally efficient histogram-like structures and described the global distribution of arbitrarily defined attributes of connected image components. Aptoula [28] exploited global morphological texture descriptors for RSIR, which outperforms the best-known retrieval scores, despite its shorter feature vector length. Chen et al. [29] have proposed a radar RSIR algorithm to solve the time-consuming problem. Kavitha and Vidhya Saraswathi [30] have proposed a fuzzy multicharacteristic clustering technique to provide retrieval outcomes with elevated retrieval accuracy. Sunitha and Sivarani [31] have proposed an efficient RSIR system utilizing weighted Brownian motion-based monarch butterfly optimizations to improve the retrieval accuracy along with computational intricacy. Ben-Ahmed et al. [32] have focused on the most relevant channels and studied spectral sensitivity functions in constructing discriminative representations for hyperspectral image retrieval. Tekeste and Demir [33] have introduced local binary patterns (LBP) variants for the first time in the framework of RSIR problems and presented a comparative study to analyze and compare advanced LBP variants for RSIR. Zhang et al. [34] have proposed a hyperspectral RSIR system based on spectral and texture features. Du et al. [35] have considered the topological structure of local features and proposed a new method to represent by taking the structural information of local features into consideration. Sukhia et al. [36] have utilized local ternary pattern to obtain upper and lower texture images and divided them into dense patches to build a final histogram representation. Yang et al. [37] have proposed a simple method to improve recognition performance of the typical Bag of Words (BoW) framework by representing images with local features extracted from base images in a large-scale image database. To further improve performance of RSIR, the combined features are also explored and exploited. For instance, Chaudhuri et al. [38] have introduced an unsupervised graph-theoretic approach for region-based retrieval by using intensity, texture, and shape features extracted from the regions to describe the node attributes of the graph. Ye et al. [39] have proposed an RSIR method based on the query-adaptive feature weights to fuse features and utilized two image similarities to improve retrieval performance. The rest of the works have been focused on RSIR in compressed image archive [40], image reranking [41], feature hashing [42], and relevance feedback [43].
Although handcrafted features have been demonstrated their capacity for RSIR, it is difficult to further improve retrieval performance due to the limitations of low-level features. The popularity and success of deep learning, particularly convolutional neural network (CNN), in RS community [16], [17], [18], [19], [20] have promoted the literature to develop a great number of deep learning driven methods for RSIR. The existing RSIR methods can be coarsely divided into five groups, including feature extraction-based methods, novel network-based methods, attention-based methods, metric learning-based methods, and hashing-based methods.
1) Feature Extraction-Based Methods: Feature extractionbased RSIR is to treat the pretrained deep networks as feature extractors or fine-tuning these networks to extract image features. To ensure high efficiency and accuracy, Cheng et al. [44] have proposed a distributed system architecture for highresolution satellite image retrieval by combining deep and traditional handcrafted features. Ye et al. [45] have exploited a CNN regression model to develop a query-adaptive feature fusion method to alleviate the huge variation in retrieval performance among different image queries. In another work, they calculated the fuzzy class membership of images to reduce the overall search time [46]. Fan et al. [47] have used CNN to extract the effective coverage information of images and presented an automatic accurate high-resolution RSIR method. Vharkate and Musande [48] have proposed a hybrid visual geometry group network by integrating dimensionality reduction, feature extraction, loss function optimization, matching process, and relevance feedback for the appropriate retrieval of RS images. Zhuo and Zhou [49] have focused on feature dimension reduction and extracted low-dimensional, representative, and discriminative features from fully connected layers of CNN by using an extended method. Similarly, Hou et al. [50] have extracted low-dimensional features from the fully connected layers by fine-tuning the pretrained MobileNets, Sadeghi-Tehran et al. [51] have derived feature representations via a CNN feature extractor, and Ye et al. [52] have fine-tuned the pretrained CNNs to extract features.
Most of the aforementioned works regard CNNs as a feature extractor and extract features from fully connected layers. Actually, CNNs are also capable of extracting deep local features from convolutional layers, and the process is similar to that of scale-invariant feature transform [53]. Ge et al. [54] have aggregated the outputs of midlevel layers by means of average pooling with different pooling regions to extract CNN features for highresolution RSIR. Imbriaco et al. [55] have presented a pipeline that used attentive, local convolutional features and aggregated them using the vector of locally aggregated descriptors (VLAD) to produce a global descriptor. Tang et al. [56] have conducted a similar work. Specifically, they have proposed an unsupervised deep learning method using deep convolutional autoencoder. The learned features were aggregated using BoW framework to obtain the final feature vector. Hu et al. [57] have provided a comparative study on deep representations extracted from either full-connected or convolutional layers. Napoletano [58] has compared a few handcrafted features with CNN features on the two benchmark datasets. The obtained results indicate that CNN features achieve overall better performance. The rest of related works can be found in [59], [60], [61], and [62].
Treating the off-the-shelf networks as feature extractors often performs well on small-scale target datasets and particularly those similar to the source dataset on which the networks are pretrained. However, fine-tuning the off-the-shelf networks is often used to further improve performance, especially for datasets that have limited labeled images.
2) Novel Network-Based Methods: The novel networkbased methods focus on designing new CNN architectures trained from scratch for learning powerful features. Zhou et al. [63] have compared the performance of various CNN features and proposed a low-dimensional CNN (LDCNN) for high-resolution RSIR, which outperforms the fine-tuned CNNs. Boualleg et al. have combined LDCNN model [63] with the triplet loss and proposed a triplet LDCNN [64]. Zhang et al. [65] constructed a triplet nonlocal neural network with dualanchor triplet loss for high-resolution RSIR. Zhuo and Zhou [66] proposed an RSIR method for high-resolution RS images with Gabor-CA-ResNet and split-based deep feature transform network. Wu et al. [67] have developed and investigated two new rotation-aware CNN-based RSIR methods to learn rotationaware representation. Liu et al. [68] have introduced an easy way to organize semantic relationship among classes as a category tree and proposed a tree-triplet-classification network. Wang et al. [69] have proposed a learnable joint spatial and spectral transformation model composed of parameter generation network, spatial conversion module, and spectral conversion module for RSIR. Sumbul and Demir [70] have proposed a novel plasticity-stability preserving multitask learning approach to ensure the plasticity and the stability conditions of the whole learning procedure independently of the number and type of tasks.
The works surveyed above focus on the conventional CNN architectures that take images as input. There have been other works conducted on image graphs. Wang et al. [71] have developed a graph-based learning method for effectively retrieving RS images. The method utilized a three-layer framework that integrates the strengths of query expansion and fusion of holistic and local features. Chaudhuri et al. [72] have argued the effectiveness of region adjacency graph-based image representations for very high resolution (VHR) RS images in terms of localized region and proposed a Siamese CNN architecture for assessing the similarity between a pair of graphs. Compared with image, graph is capable of capturing contextual information and, thus, is possible to improve RSIR performance.
Training from scratch with novel architectures tends to achieve more remarkable performance compared with feature extraction-based methods that use pretrained or fine-tuned networks for feature extraction. However, a large number of labeled samples are often required to train a successful CNN. Moreover, it is laborious to design a powerful CNN architecture even if the pretrained CNNs are taken as the backbones in the new architecture.

3) Attention-Based Methods:
Generally, the attention-based methods are networks integrating attention modules in the architecture to learn more discriminative features. There have been several representative works related to this topic [73], [74], [75], [76], [77]. Wang et al. [73] have proposed a multiattention fusion network with dilated convolution and label smoothing to force the network to learn discriminative features of important objects. Wang et al. [74] have presented a wide-context attention network by leveraging two attention modules to adaptively learn local features correlated in the spatial and channel dimensions. Wang et al. [75] have introduced a second-order pooling named compact bilinear pooling into CNN containing three stages, i.e., pretraining, fine-tuning, and retrieval. Xiong et al. [76] have proposed two effective schemes for generating discriminative features for RSIR, where in the first scheme, the attention mechanism and a new attention module were introduced to the CNN architecture. In the second scheme, a multitask learning network structure was proposed to force the features to be more discriminative. Unlike the above works that focus on the spatial or channel attention to learn discriminative features, other works focus on edge and node attention to highlight important image context features by using image graphs as input. For example, Chaudhuri et al. [77] have proposed an attention-driven graph CNN for RSIR by attending over the edge matrix to highlight the interactions among meaningful regions and exploiting this edge attention mechanism together with node attention to highlight essential image context.
Integrating attention module in CNN network provides the literature a new manner to extract more discriminative image features. However, the existing attention-based methods mainly focus on extracting powerful features that ignore similarity measure, another indispensable part in an RSIR system. Therefore, there is still room for performance improvement when more sophisticated attention modules that take both feature extraction and similarity measure into account.

4) Metric Learning-Based Methods:
Metric learning learns a distance metric for the input space of data from a given collection of pair of points [78] and is able to combine with loss functions, such as contrastive loss [79], to improve classification performance [80]. Considering similarity measure is an indispensable part for RSIR system; thus, there have been a great number of metrics learning-based methods developed for RSIR, which can be coarsely categorized into two groups. The first is integrating metric learning in a CNN network to improve performance. Zhao et al. [81] have proposed a global-aware ranking deep metric learning with intraclass space sample mining and cost-sensitive loss. Cao et al. [82] have developed a deep metric learning approach with generative adversarial network (GAN) regularization, aiming to obtain more accurate retrieval performance with small training samples. Cao et al. [83] have constructed a triplet network with metric learning to extract representative features in a semantic space where images from the same class are close to each other, while those from different classes are far apart to enhance RSIR. The second is combining metric learning and attention mechanism to learn discriminative features and, thus, achieving better performance. Cheng et al. [84] have proposed an ensemble architecture of residual attention-based deep metric learning for RSIR to improve feature distinguishability and retrieval efficiency. Chung et al. [85] have introduced a method for retrieving aerial images by merging group convolution with attention mechanism and metric learning, resulting in robustness to rotational variations. Previous research on RSIR has ignored the advantages of joint optimization of RSIR and scene classification. To overcome this limitation, Liu et al. [86] have presented an eagle-eyed multitask CNN integrating three tasks, i.e., centermetric learning, similarity distribution learning, and aerial scene classification in a network. The extensive experiments over four public aerial image sets demonstrate its better performance than all of the existing methods. Other existing metric learning-based methods have been focused on defining novel losses [87], [88], [89], similarity learning [90], and reranking [91] to improve RSIR performance.
RS images usually have different resolutions, different object types, and different image complexities. Therefore, the data-driven metric learning is suitable for deep learning-based RSIR methods since both the features and similarity measure are learned from data. Moreover, RSIR performance is possible to be further improved when the network is trained in an end-to-end manner.

5) Hashing-Based Methods:
For RSIR with large-scale archives, the storage cost and the retrieval efficiency are two factors to be considered. Hashing-based methods aim to perform RSIR with short binary codes, which have low storage cost and high retrieval efficiency. The existing hashing-based methods can be divided into unsupervised hashing and supervised hashing [25].
The unsupervised hashing methods rely on unlabeled data to generate binary hash codes. There have been few unsupervised hashing methods in RS community. Chen and Lu [92] have proposed an unsupervised multispectral RSIR method, making use of the unsupervised representation learning ability of GAN. Reato et al. [93] have presented a simple yet effective unsupervised RSIR method that represented each image with primitive-cluster-sensitive multihash codes. Lukac et al. [94] have improved the well-known kernelized locality-sensitive hashing method using graphical processing units to make it feasible for parallelization, and thus performing fast parallel image retrieval. Kong et al. [95] have proposed a low-rank hypergraph hashing to accomplish for the large-scale RSIR. To improve the performance of unsupervised hashing methods, self-supervised methods, semi-supervised methods, and methods relying on pseudolabel have been explored. As an example, Tan et al. [96] have proposed a deep contrastive self-supervised hashing, which uses unlabeled images to learn accurate hash codes. Tang et al. [97], [98] have proposed a semi-supervised deep hashing method based on the adversarial autoencoder network for RSIR. Sun et al. [99] have proposed a soft-pseudolabel-based unsupervised deep hashing method to well reflect the semantic distance between intercluster images.
Although unsupervised hashing methods are simple and effective for generating binary codes for large-scale RSIR, their performance improvement is limited due to the lack of supervised information. In contrast, supervised hashing methods often achieves better performance than unsupervised hashing methods. To address the problem that deep hashing networks tends to be highly expensive in terms of storage space and computing resources, Li et al. [100] have developed a quantized deep learning to hash framework for large-scale RSIR. Song et al. [101] have proposed an asymmetric hash code learning for RSIR, attempting to improve the conventional learning one hash function for both the query and database samples in a symmetric way. Motivated by the residual net, Han et al. [102] have developed a cohesion intensive deep hashing model for RSIR. Liu et al. [103] have presented a deep supervised hashing model for RSIR in the framework of GANs, named GAN-assist hashing. Li et al. [104] proposed a large-scale RSIR method based on deep hashing neural network. Ye et al. [105] have investigated multiple feature hashing learning for large-scale RSIR. Tang et al. [106] have developed a new supervised hash learning method for the large-scale high-resolution RSIR task based on metalearning. Wang et al. [107] have proposed a novel triplet ordinal cross-entropy hashing method to fix the problem that most of the existing hashing algorithms only emphasized preserving pointwise or pairwise similarity. Shan et al. [108] have presented a proxy-based hash retrieval method, called deep hashing using proxy loss, which combines hash code learning with proxy-based metric learning in a CNN. Liu et al. proposed a new RSIR method named feature and hash learning, which consists of a deep feature learning model and an adversarial hash learning model [109]. For most existing hashing methods, the hash functions are learned once for all and kept fixed all the time, which are not suitable for the ever-growing new RS images. Li et al. [110], therefore, proposed a new online hashing method, learning and adapting hashing functions with respect to the newly incoming RS images.
The above supervised hashing methods provide new ways of performing large-scale RSIR. However, a large number of labeled images are needed to train a successful network. Besides, it is also time-consuming to generate binary hash codes.

B. MLRSIR Methods
For SLRSIR, both the query images and other images in the database are single labeled. The assumption is that each image is annotated by a single label representing the most significant semantic content of the image. However, in practical scenarios, RS images might contain multiple classes (e.g., buildings, roads, trees, etc.). MLRSIR, a more challenging task than SLRSIR, assumes each image is associated with multiple labels (also known as primitive class), and thus is suitable for addressing the above problem. As illustrated in Fig. 4, the process of MLRSIR is similar to that of SLRSIR but is a bit different in feature extraction and performance evaluation. To be specific, for extraction, we need to extract the features of each primitive classes contained in the image. With respect to performance evaluation, there have been no ground truth images for each multilabel query images. The retrieved images are ranked according to the similarity scores between query image and images in the database. Thus, the metrics for SLRSIR are not available for evaluating MLRSIR methods. The literature has committed efforts to developing MLRSIR methods, such as handcrafted feature-based methods [111], [112], [113], [114] and deep learning-based ones [115], [116], [117], [118], [119], [120], [121]. For handcrafted methods, Chaudhuri et al. [111] have introduced a semi-supervised graphtheoretic method in the framework of MLRSIR problems. Dai et al. have presented a novel hyperspectral RSIR system consisting of a spatial and spectral image description scheme and a sparsity-based supervised retrieval method [112], which was improved in their later work [113]. Shao et al. [114] have conducted a comparative performance evaluation of SLRSIR and MLRSIR methods on a newly collected multilabeled dataset termed DLRSD, providing the literature a benchmark dataset along with the baseline results for MLRSIR research.
Regarding deep learning-based methods, Kang et al. [115] have proposed a new graph relation network to model the relations between samples by using a graph structure for multilabel RS scene categorization and retrieval. Hua et al. [122] took advantage of pairwise label relations to infer multiple object labels of a high-resolution aerial image and proposed an innovative inference network. Sumbul and Demir [117] have proposed a novel graph-theoretic deep representation learning method in the framework of MLRSIR problems, aiming to extract and exploit multilabel co-occurrence relationships associated to each RS image. In their another work [116], a novel triplet sampling method in the framework of deep neural networks defined for MLRSIR was proposed to obtain informative and representative triplet selection, which is an improved version of the previous work [114]. To increase retrieval efficiency and reduce feature storage while preserving semantic information, Cheng et al. [119] have presented a new semantic-preserving deep hashing model for MLRSIR. Shao et al. [121] have proposed a novel MLRSIR approach based on the fully convolutional network, where the single-scale and multiscale region features were extracted to perform region-based MLRSIR. Although these methods are able to achieve satisfactory performance, they mainly focus on extracting powerful features. The similarity measure as well as evaluation metric for MLRSIR are not considered. As an alternative, Imbriaco et al. [120] have defined protocols for performance evaluation using new metrics and studied the impact of commonly used losses as well as reranking methods for MLRSIR. It provides a direction for similarity measure for multilabel images.
In contrast to SLRSIR, MLRSIR is still a new topic, and most of the existing works focus on feature extraction. However, as stated above, there are no ground truth images for each query image. Therefore, more attention should be drawn to define protocols for similarity measures and performance evaluations.

C. RASRSIR Methods
RASRSIR performs RSIR between images captured by different sensors and, thus, having different resolutions (e.g., multispectral and panchromatic images, and multispectral and hyperspectral images). The special case is retrieval between optical and SAR images, which is generally categorized into CMRSIR, as illustrated in Table I. Given a multispectral (MUL) image as the query image and panchromatic images as the database images, Fig. 5 illustrates the flowchart of RASRSIR. The process is also similar to that of SLRSIR but different in feature extraction. Specifically, to avoid the difference caused by image resolution, the features of MUL and panchromatic (PAN) images are fed into the same feature space before conducting similarity measure. There is a continually increasing interest to develop RASRSIR methods with RS images from different sensors. Li et al. [123] have proposed a source-invariant deep hashing CNN for RASR-SIR between MUL and PAN images, which were optimized in an end-to-end manner using a series of well-designed optimization constraints. To maintain the source discrepancy at the classifier level, Ma et al. [124] have presented teacher-ensemble learning with the knowledge distillation method. In [125], a discriminative distillation network was also proposed to address the inconsistency between different image sources. Gao et al. [126] have proposed a dubbed multiview graph convolutional hashing method to fuse multisource RS images. Xiong et al. [127] have explored to explicitly address a data drift problem by mapping the source domain to the target domain in an image translation-based framework. Ma et al. [128] have developed a dual-modality collaborative learning model to fully explore the specific information from diverse RS images.
RASRSIR is a promising direction for RSIR, and the core problem is to alleviate the effect of inconsistency between different image sources.
The existing CMRSIR methods for sketch-image and optical-SAR are fewer than that of text-image and audio-image. Jiang et al. [129] have taken free-hand sketches into account and addressed the problem of sketch-based aerial image retrieval. Chaudhuri et al. [130] have exploited the data modality comprising more spatial information (sketch) to extract other modality features (image) with cross-attention networks. Xu et al. [131] have developed a sketch-based RSIR model to learn a deep joint embedding space with discriminative losses, which was then evaluated on a sketch RS image dataset. Regarding optical-SAR CMRSIR, Xiong et al. [132] have addressed the prominent modality discrepancy caused by different imaging mechanisms in a deep cross-modality hashing network. To effectively deal with the discrepancies, Sun et al. [133] have conducted a similar work. They have proposed a multisensor fusion and explicit semantic-preserving-based deep hashing method. Sumbul et al. [134] have collected the multimodal BigEarthNet (BigEarthNet-MM) benchmark archive containing pairs of Sentinel-1 and Sentinel-2 image patches.
For text-image and audio-image CMRSIR, there have been a great number of methods developed in recent years. Regarding text-image CMRSIR, to bridge the modality gap, Lv et al. [135] have proposed a fusion-based correlation learning model for text-image retrieval. Yuan et al. [136] have presented a textimage retrieval framework based on global and local information and designed a multilevel information dynamic fusion module to efficiently integrate features of different levels. In their other works, an asymmetric multimodal feature matching (FM) network was developed to explore a fine-grained multiscale method for text-image retrieval in [137]. Besides, a concise but effective cross-modal retrieval model was designed by considering the characteristics of multiscale and target redundancy in RS [138]. Cheng et al. [139] have established the direct relationship between RS images and paired text data. To address the limitation that the existing approaches require a high number of labeled training samples, Mikriukov et al. [140] have proposed an unsupervised cross-modal contrastive hashing method for text-image retrieval. With respect to audio-image CMRSIR, there have been a large number of methods developed in recent years. Chen et al. [141], [142], [143] have proposed a few audio-image retrieval methods. Guo et al. have proposed a CMRSIR method for RS image and spoken audio [144], which was improved in a later work [145]. Existing methods for RS image-voice retrieval rely primarily on the pairwise relationship. To overcome this limitation, Ning et al. [146] have proposed a semantics-consistent representation learning method for image-voice retrieval. Yang et al. [147] have proposed a cross-modal feature fusion retrieval model, which provides a more optimized cross-modal common feature space than the previous models and, thus, optimizes the retrieval performance. In a later work [148], they have presented the multifusion method.
As an emerging research topic, a great number of CMRSIR methods have been developed to perform sketch-image, optical-SAR, text-image, and audio-image retrieval. The performance of CMRSIR methods would be further improved as long as the effect of modality difference is well alleviated.

E. CVRSIR Methods
CVRSIR is known as image geolocalization [149] in the literature, aiming to determine the geographic information of an image (e.g., ground-view image) by referencing to a geotagged image of another view (e.g., overhead-view image). Therefore, image geolocalization is essentially one kind of RSIR applications. Fig. 6 illustrates the basic flowchart of CVRSIR between ground-view image and overhead-view images. It is notable that CVRSIR is similar to SLRSIR in terms of the retrieval process, and the differences lie in which the query image and other images in the database are captured from different views, and for each query image, there is generally one ground truth retrieved image.
The existing CVRSIR methods consist of ground-ground, ground-overhead, and overhead-overhead, as listed in Table I. Most CVRSIR methods are overhead-overhead since they are able to obtain an overhead image with a random location on the earth. Zhang et al. [150] have proposed a deep network that embeds spatial configuration of the scenes into feature representation. Zemene et al. [151] have used image matching in a structured database of city-wide reference images with known GPS coordinates. Inspired by the human visual system, Lin et al. [152] have proposed a framework to jointly learn the discriminative representation and detect salient key points with a single network. Rodrigues and Tani [153] have retrieved corresponding aerial views from a large database of geotagged aerial imagery. Hu et al. [154] have leveraged on the recent success of deep learning and proposed a cross-view matching network for the ground-to-aerial geolocalization task. Ground-overhead geolocalization is the most challenging geolocalization task due to the large variation of viewpoint and irrelevant content. To address this issue, Zeng et al. [155] have taken drone-view information as a bridge between ground-view and satellite-view domains, and proposed a peer learning and cross diffusion framework. The rest of the works can be found in [156], [157], [158], [159], [160], [161], and [162].
The aforementioned CVRSIR works mainly focus on groundsatellite and ground-aerial geolocalization; we here introduce a novel similarity learning based on CVRSIR (SL-CVRSIR) method for ground-drone geolocalization. To evaluate the performance of SL-CVRSIR, we also collect a new dataset named CVGD, which will be opened later for uncommercial purposes. Fig. 7 illustrates the architecture of SL-CVRSIR, which has two identical subnetworks (i.e., CNN1 and CNN2) without shared weights, and are designed to extract the features of overhead-view images and ground-view images, respectively. SL-CVRSIR takes the positive and negative image pairs as input, where positive pairs are composed of images from the same location, while negative pairs are composed of images from different locations. The output of the fully connected layer (i.e., F o and F g ) from each subnetwork is then combined through subtraction, and the result is passed through a fully connected (Fc) layer with a single output. The sigmoid operation is used to convert the output of Fc layer to a similarity score between 0 and 1, indicating whether the overhead-view and ground-view images in the image pair are from the same location or not.
To train SL-CVRSIR, the combined loss L is defined as follows: where L c is the contrastive loss [79], aiming to compare the similarity between overhead-view and ground-view images, L ce is the cross-entropy loss, aiming to measure the difference between the label and similarity score. y and q are the labels of image pair with 1 for positive pair and 0 for negative pair, and p is the similarity score. d og is the Euclidean distance between overhead-view image and ground-view image in the image pair, and m is the margin. Once SL-CVRSIR is trained, the similarity score of each image pair is extracted to perform cross-view retrieval. Specifically, given one ground-view image (i.e., query image), the overheadview images are sorted in descending order by the similarity values between the query image and each of the overhead-view images. The query is regarded as a correct query if the overheadview image from the same location as the ground-view image is within the top K% retrieved images.

III. BENCHMARK DATASETS FOR PERFORMANCE EVALUATION OF RSIR
Benchmark datasets are indispensable for advancing RSIR approaches and further performance evaluation. As the development of RS technology, the literature has witnessed the remarkable progress on constructing publicly available datasets for RSIR. These datasets are collected for developing different RSIR methods and, thus, can be divided into different categories, including SLRSIR datasets, MLRSIR datasets, RASR-SIR datasets, CMRSIR datasets, and CVRSIR datasets. We, therefore, survey the publicly available benchmark datasets presented for different RSIR methods in recent years, as shown in Table III. In the table, we list the basic characteristics of each dataset for simple comparison. The readers are referred to corresponding datasets for more details. It is worth noting that not all of these datasets are originally collected for RSIR. In the following section, we select and detail several representative datasets for each RSIR method category, i.e., SLRSIR, MLRSIR, RASRSIR, CMRSIR, and CVRSIR. These representative benchmark datasets will be further used for performance evaluation in Section IV.
We select three representative datasets, i.e., UC Merced, WHU-RS19, and PatternNet, and introduce them in detail.
1) UC Merced: The UC Merced [163] dataset is originally collected for land use/land cover with 21 categories, including agricultural, airplane, baseball diamond, beach, building, chaparral, dense residential, forest, freeway, golf course, harbor, intersection, medium density residential, mobile home park, overpass, parking lot, river, runway, sparse residential, storage tank, and tennis court. Each category contains 100 images cropped from the United States Geological Survey (USGS) aerial images. Each image in the UC Merced dataset has the size of 256×256 pixels and the spatial resolution of 0.3 m. As the first 2) WHU-RS19: The WHU-RS19 [164] dataset is collected from the google earth and then categorize into 19 classes, including airport, beach, bridge, commercial area, desert, farmland, football field, forest, industrial area, meadow, mountain, park, parking, pond, port, railway station, residential area, river, and viaduct. WHU-RS19 contains 1005 images with the size of 600×600 pixels and the spatial resolution is up to 0.5 m. It is a more challenging dataset for RSIR, compared with the UC Merced dataset in terms of its varied spatial resolutions.
3) PatternNet: The PatternNet [167] dataset is a large-scale dataset collected from U.S. cities via google earth and google map API, developing RSIR methods and particularly deep learning based ones. It contains a total of 30 400 images evenly divided into 38 categories, including airplane, baseball field, basketball court, beach, bridge, cemetery, chaparral, Christmas tree farm, closed road, coastal mansion, crosswalk, dense residential, ferry terminal, football field, forest, freeway, golf course, harbor, intersection, mobile home park, nursing home, oil gas field, oil well, overpass, parking lot, parking space, railway, river, runway, runway marking, shipping yard, solar panel, sparse residential, storage tank, swimming pool, tennis court, transformer station, and wastewater treatment plant. The images in the dataset measure 256×256 pixels with the spatial resolution ranges from 4.69 to 0.06 m. The release of PatternNet is to overcome the limitations that the existing datasets are small scale and their images contain a large amount of background and, thus, might distract accurate retrieval.

B. MLRSIR Datasets
It is time-consuming and laborious to construct a multilabel archive for RSIR since each image in the MLRSIR dataset is associated with at least one primitive class (i.e., label). Thanks to the literature's efforts, several publicly available datasets available for MLRSIR have been collected and opened over the past five years. The existing MLRSIR datasets include DLRSD [114], WHDLD [121], MLRSNet [171], ML-AID [172], MultiScene [173], and BigEarthNet [174].
Among these datasets, three representative archives, i.e., DLRSD, WHDLD, and MLRSNet, are selected and introduced in detail.

1) DLRSD:
The DLRSD [114] dataset is labeled based on the UC Merced archive [163], and therefore, it also consists of 2100 images with the size of 256×256 pixels and the spatial resolution of 0.3 m. DLRSD is a dense labeling dataset, where the pixels of each image in the UC Merced dataset are annotated with one of the 17 primitive classes (labels), including airplane, bare soil, building, car, chaparral, court, dock, field, grass, mobile home, pavement, sand, sea, ship, tank, tree, and water. It is an improved version of the multilabel archive [111] and is available for not only image-level tasks, such as SLRSIR and MLRSIR, but also pixel-level task, such as semantic segmentation.
2) WHDLD: Similar to DLRSD, the WHDLD [121] dataset is also a pixel-level dense labeling dataset, where the images are cropped from a large mosaic image with the images acquired by Gaofen-1 and Ziyuan-3 satellites. The pixels of each image in WHDLD are annotated with the following six primitive classes (labels), including building, road, pavement, vegetation, bare soil, and water. WHDLD contains 4940 images with the size of 256×56 pixels and the spatial resolution of 2 m. WHDLD is different from DLRSD in terms of the number of images, image labels, and the spatial resolution of images. Therefore, it is treated as a complementary dataset to DLRSD for both image-level and pixel-level tasks.
3) MLRSNet: The MLRSNet [171] dataset is composed of 109 161 images with the size of 256×256 pixels, and the spatial resolution ranges from 10 to 0.1 m. The images are divided into 46 broad categories, and the number of images in each category varies from 1500 to 3000. In addition, there are 60 predefined primitive classes (labels), containing airplane, airport, bare soil, baseball diamond, basketball court, beach, bridge, buildings, cars, chaparral, cloud, containers, cross walk, dense residential area, desert, dock, factory, field, football field, forest, freeway, golf course, grass, greenhouse, gully, harbor, intersection, island, lake, mobile home, mountain, overpass, park, parking lot, parkway, pavement, railway, railway station, river, road, roundabout, runway, sand, sea, ships, snow, snowberg, sparse residential area, stadium, swimming pool, tanks, tennis court, terrace, track, trail, transmission tower, trees, water, wetland, and wind turbine. Each image in MLRSNet is associated with at least one of the 60 labels. In contrast to DLRSD and WHDLD, MLRSNet has the characteristics of hierarchy, large scale, and high diversity. However, as an image-level multilabel dataset, it cannot be used for pixel-level tasks.

C. RASRSIR Dataset
The existing public datasets for RASRSIR are not as many as that for SLRSIR and MLRSIR, and the DSRSID [123] dataset is one of the open datasets. The images in DSRSID are titled from panchromatic images and multispectral images acquired by the Gaofen-1 optical satellite and are grouped into image pairs where each pair is a combination of one panchromatic image and one multispectral image. The one-channel panchromatic image has the size of 256×256 pixels and the spatial resolution of 2 m, and the four-channel multispectral image has the size of 64×64 pixels with the spatial resolution of 8 m. Additionally, DSRSID consists of eight classes, i.e., aquafarm, cloud, forest, high building, low building, farm land, river, and water, where each class contains 10 000 panchromatic and multispectral image pairs.

1) RSketch:
The RSketch [131] dataset is collected for CMR-SIR between RS image and sketch. It is composed of 20 categories, including airplane, baseball diamond, basketball court, beach, bridge, closed road, crosswalk, football field, golf course, intersection, oil gas field, overpass, railway, river, runway, runway marking, storage tank, swimming pool, tennis court, and wastewater treatment plant, and each category contains 200 RS images and 45 sketches. Both the size of RS image and sketch are fixed to 256×256 pixels. The RS images are collected from the existing datasets, such as UC Merced [163], WHU-RS19 [164], AID [166], and PatternNet [167].
2) UCM-/Syndey-/RSICD-Audio: The UCM-/Syndey-/ RSICD-audio [144] is collected based on the existing UCM-/Syndey-Captions [175] and RSICD [176], respectively. To construct the dataset, each image is given five sentences, and each sentence of an image is generated by five different speakers. The spoken audios have varied length ranging from 1 to 15 s, which make the dataset challenging.
3) TextRS: The images in the TextRS [177] dataset are collected from the four existing datasets containing UC Merced [163], AID [166], PatternNet [167], and NWPU-45 [170]. Tex-tRS contains 2144 images randomly selected from the four datasets, and each image is then annotated with five sentences generated by five different people to guarantee diversity.

4) CBRSIR_VS:
The CBRSIR_VS [133] dataset is an optical and SAR dual-modality RS image dataset. It consists of ten class labels and 26 901 pairs of optical and SAR images. The optical images are VHR images with the size of 256×256 and the resolution of 1 m, and the SAR images are from Sentinel-1 imagery with the size of 64×64 and the resolution of 10 m.

E. CVRSIR Datasets
The images in a CVRSIR dataset are usually collected from ground-aerial view and ground-satellite view. The two view images captured from the same location are aligned to form image pairs. CVRSIR is a hot topic in recent years, and thus, the literature has constructed benchmarks of different viewpoints to advance CVRSIR research, including University-1652 [178], CVACT [179], AiRound/CvBrCT [180], CVUSA subset [181], VIGOR [182], Vo and Hays [183], and two cities dataset [184]. Besides, we also release a novel dataset termed cross-view between ground and drone (CVGD) to perform CVRSIR between ground-view and drone-view images. The readers are referred to the following section for more details on CVGD.
We select University-1652, CVACT, CVUSA subset, and the newly constructed CVGD as representative benchmarks and introduce them in detail.

1) University-1652:
The University-1652 [178] dataset is a multiview multisource benchmark for drone-based geolocalization. It has ground-view, drone-view, and satellite-view images collected from 1652 buildings of 72 universities. The training set contains 701 buildings of 33 universities, and the testing set contains 701 building of the rest 39 universities. It should be noted that the training set has 71.64 images on average per location, while the existing datasets generally contain two images per location. University-1652 has the characteristics of multisource, multiview, and more images per class.
2) CVACT: The CVACT [179] dataset is a city-scale fully GPS-tagged cross-view dataset consisting of ground-view panoramas collected via google street view API and the corresponding satellite-view images. The image resolution of groundview image is 1664×832 pixels and is 1200×1200 pixels for satellite image. Regarding the training set and testing set, there are 35 532 and 92 802 image pairs, respectively. Besides, it also provides a validation set with 8884 image pairs. Note that the training set is from the CVUSA subset [181].
3) CVUSA Subset: The CVUSA subset [181] dataset, a small version of the original CVUSA [158], is a much larger dataset. Specifically, the panoramas of CVUSA are selected to form a CVUSA subset as ground-view images, for each of which the aerial images at zoom level 19 are downloaded from bing map in the same geographic area. The panoramas with unavailable corresponding aerial images are filtered out, and finally 35 532 training image pairs and 8884 testing image pairs are obtained. The image resolution of the ground-view image and the satellite image are 1232×224 pixels and 750×750 pixels, respectively.

4) CVGD:
Most existing datasets of CVRSIR are groundand aerial-view or ground-and satellite-view. No ground-and drone-view images are included except for the University-1652 dataset. However, the drone images in University-1652 are actually simulated drone images collected from google earth. Moreover, all the ground-drone images in University-1652 are building images without any other objects, and thus, the literature needs a dataset with high diversity, i.e., images containing different objects. To this end, we collect a ground-and drone-view dataset, named CVGD. The images in CVGD are collected from 100 locations in a university. Considering the fact that, in a real CVRSIR task, there are possibly more than one ground and drone image pairs indicating the same location but captured different viewpoints, we collect 2-6 drone images and 2-7 ground images per location, as shown in Fig. 8. Fig. 9 illustrates nine examples of ground and drone pairs, and it can be observed that the images contain different types of objects, such as building, tree, road, grass, and lake; thus, it is more challenging than University-1652.

IV. PERFORMANCE EVALUATION OF RSIR METHODS ON BENCHMARK DATASETS
In this section, we first introduce the performance metrics for different RSIR methods and then present the results of representative RSIR methods.

A. Performance Metrics for RSIR
Performance metrics are crucial for performance evaluation of RSIR methods. Considering the fact that different RSIR methods need their own measures, we categorize the existing measures into SLRSIR, MLRSIR, RASRSIR, CMRSIR, and CVRSIR, and then introduce them in detail.
1) Metrics for SLRSIR: There are several metrics that are commonly used for performance evaluation of SLRSIR, which are average normalized modified retrieval rank (ANMRR), mean  average precision (MAP), precision at k (P@K), and precisionrecall (PR) curve [58]. The definitions of these metrics are presented in the following text.
ANMRR is a metric that takes the rank of each returned image into account and is possibly the most widely used measure for SLRSIR. Let q be one query image, and ng(q) is the number of its similar images. ANMRR is defined by where nq is the query times, and NMRR(q) is defined as follows: where ar(q) = 1 ng(q) ng(q) k=1 r(k) is the average rank, and r(k) is the retrieved rank of the kth image, which is defined as follows: where k(q) = 2ng(q). ANMRR ranges between zero and one, and the lower the value, the better the performance.
To explain the other three metrics clearly, we here introduce precision and recall first. Precision is the ratio of the images retrieved that are similar to the query image, while recall is the ratio of the images that are similar to the query image that are successfully retrieved.
MAP is a commonly used performance metric and is defined as follows: where AveP is the average precision. P (k) is the precision at cutoff k, i.e., the metric P@K. rel(k) is the indicator function with the value of 1 if the image at rank is relevant and 0 if otherwise. PR curve can be obtained by plotting precision and recall values. In practice, the 11-interpolated PR curve is usually selected, which is achieved by plotting the interpolated precision measure at 11 recall levels (i.e., 0, 0.1, 0.2, …, 1). The interpolated precision P inter at recall level k is defined as the highest precision for any recall level k ≥ k P inter (k) = max(P (k )).
2) Metrics for MLRSIR: The existing evaluation measures for SLRSIR are not suitable for MLRSIR since there has no ground truth (i.e., similar images) for each multilabel image in the MLRSIR dataset. To leverage this limitation, the measures designed for multilabel classification are used for performance of MLRSIR [111]. These measures are accuracy, precision, recall, hamming loss (HL), and F1-measure (F1), and are defined as follows: where ∩, ∪, and ⊕ are the logical AND, logical OR, and logical XOR operations, respectively. | · | is the number of nonzeros, l q is the multilabel vector of query image, l r i is the ith retrieved image, n is the number of labels (i.e., primitive classes), and m is the number of retrieved images. Although the above-mentioned measures can be used for performance evaluation, they are originally designed for multilabel classification and, thus, are not well suited for MLRSIR. To circumvent this limitation, Imbriaco et al. [120] presented a novel and effective metric for MLRSIR.
3) Metrics for RASRSIR and CMRSIR: Regarding RASRIR and CMRSIR, there have been ground truth for each query image; therefore, the metrics for SLRSIR can be used for performance evaluation. It is worth noting that for text-image retrieval in CMRSIR, recall at k (R@K, K = 1, 5, and 10) is the widely used measure [177], which is defined as the fraction of the top K results that are relevant to the query.

4) Metrics for CVRSIR:
The most commonly used metric for evaluating the performance of CVRSIR is recall at K (R@K, K = 1, 2, 3, …) [179]. For the metric R@K, the special case is which K equals 1% × N , where N is the number of images. R@K then becomes R@top 1%, another widely used metric for performance evaluation.

B. Results of RSIR Methods
This section presents the performance comparisons of representative RSIR methods of each category on the corresponding benchmark datasets. It is notable that these results (except for the results on CVGD dataset) are collected from the published works since it is impossible to reimplement all of these algorithms that are not publicly available. To give a fair comparison, all the results are obtained on the same retrieval dataset. Besides, for RASRSIR and CMRSIR methods, there are two retrieval scenarios, i.e., "A→B" (Retrieve B with A as the query) and "B→A" (Retrieve A with B as the query).
As can be observed from Table IV, deep learning based methods greatly improve the performance of handcrafted featurebased methods by a significant margin on the three datasets. However, UFL achieves comparative performance with handcrafted feature-based methods on the PatternNet dataset. The result makes sense since UFL is an unsupervised deep learning method. For the deep learning-based methods, LDCNN (belonging to the novel network-based methods) achieves remarkable performance. Specifically, LDCNN outperforms CaffeRef_F and ResNet50_P (belonging to the feature extraction-based methods) on WHU-RS19 and PatternNet datasets, respectively, indicating that training novel CNNs from scratch are capable of learning more discriminative features. For UC Merced dataset, LDCNN obtains slightly worse performance than VGGM_P and VGGM_F. This is because LDCNN is trained using the AID [166] dataset, which has large variation from the UC Merced dataset. Therefore, the performance of LDCNN could be further improved when training using images that are similar to the target dataset, and in particular using the target dataset [163]. The results demonstrate that CNN features improve SLRSIR performance by a significant margin compared with handcrafted features. Among the CNN-based methods, the LDCNN network achieves the best performance on both WHU-RS19 and PatternNet dataset, and is potential to perform the best on the UC Merced dataset when the target dataset is used for training. Additionally, ResNet often performs better for RSIR than other pretrained CNNs and would be a potential CNN network for practical application scenarios.
2) Results of MLRSIR: The performance of MLRSIR methods is evaluated on DLRSD and WHDLD datasets. The results are shown in Table V. It is obvious that deep learning based methods (i.e., SSRCF and MSRCF) outperform handcrafted feature-based methods (i.e., BoVW, LBP, gabor texture, and MLIR) in terms of HL, accuracy, precision, recall, and F1 metrics. For deep learning based methods, the multiscale region feature (i.e., MSRCF) performs better than the single-scale region feature (i.e., SSRCF) as expected; thus, MLRSIR could benefit from combining CNN features extracted from different layers. Specifically, MSCRF achieves about 3% and 5% improvement on DLRSD and WHDLD, respectively, in terms of accuracy values. Overall, for handcrafted features, BoVW obtains the best performance, with the accuracy value of 0.5454 for DLRSD and 0.7013 for WHDLD.
According to the results in Table V, deep learning has demonstrated its capacity for promoting MLRSIR. It is notable that both SSRCF and MSRCF features are extracted based on FCN network [196]. Thus, the performance is possible to be further improved if a more sophisticated network is exploited.
The results indicate that CNN features improve MLRSIR performance by a significant margin compared with handcrafted features. Among the two CNN-based methods (i.e., SSRCF and MSRCF), MSRCF combines features of different layers to obtain multiscale features and thus achieves better performance. It would be a potential method for practical application scenarios.
3) Results of RASRSIR: The performance comparisons of RASRSIR methods for retrieval between MUL and PAN images on DSRSID dataset are summarized in Table VI. As the results shown, SIDHCNNs achieve remarkable performance for both "PAN→MUL" and "MUL→PAN" retrieval scenarios, and outperforms the other methods by a large margin in terms of MAP values.
To be specific, SIDHCNNs improves the worst performing method CCA by 80% for "PAN→MUL" and 81% for "MUL→PAN." The worst performance is because CCA works in an unsupervised way. In contrast, the supervised method SCM performs better than CCA, which achieves about 22% and 23% improvement for "PAN→MUL" and "MUL→PAN," respectively. Regarding DCHM, it is the second-best performing method benefiting from deep learning, indicating that deep learning can contribute to develop more effective RASRSIR methods. The remarkable performance of SIDHCNNs indicates that CNN integrating with hash learning is a potential solution  Table VII. For sketch-image CMRSIR methods, SBRSIR performs the best on RSketch dataset. Specifically, the performances are 0.9091 and 0.5008 for seen and unseen categories (Seen categories mean the samples are used for training, while unseen categories mean the opposite), respectively, in terms of MAP values. Sketch-a-Net performs poorly on both seen and unseen categories due to the shallow architecture of network. Although DSM improves the performance of Sketch-a-Net, the improvement is limited. As the second-best method, LDF-CLS achieves a bit worse performance than SBRSIR but improves the performance of Sketch-a-Net and DSM by a remarkable margin. The results indicate that the deep hash-based method (i.e., SBRSIR) is a potential solution for sketch-image retrieval whether the categories are seen or unseen category. Regarding CMRSIR between VHR and SAR images, MsEspH outperforms other three methods for both "VHR→SAR" and "SAR→VHR" retrieval scenarios. For example, compared with the second-best method DSMHN, MsEspH achieves about 4% improvement and would be a potential solution for retrieval between VHR and SAR images.
For CMRSIR between audio (A) and image (I), CNN+M, CNN+ΔM, and CNN+Δ 2 M obtain comparative performance for "I→A" on UCM-audio, Sydney-audio, and RSICD-audio datasets. For "A→I", overall, CNN+M is the best performing method. It is notable that CNN+SPEC performs the worst for both "A→I" and "I→A" retrieval scenarios; therefore, the discriminative features for audio and image are crucial for obtaining better performance. With respect to CMRSIR between text and image, overall, DBTN_EfficientNet outperforms other method in terms of R@1, R@5, and R@10 values. These results indicate that the deeper CNNs (i.e., ResNet50, Inception_v3, and VGG16) do not achieve remarkable performance when combined with DBTN. Table VIII presents the results of CVRSIR on three benchmark datasets, i.e., University-1652, CVACT, and CVUSA subset. For the University-1652 dataset, the method proposed by Wang et al. outperforms other methods by a significant margin for "Drone → Satellite" and "Satellite → Drone" retrieval scenarios in terms of R@1 value. Besides, it can be observed that each method performs better for "Satellite → Drone" than "Drone → Satellite" scenario. This is because for In addition to the benchmark datasets mentioned above, we also report the performance of SL-CVRSIR on our CVGD dataset, as shown in Table IX. The presented SL-CVRSIR is compared with two methods, including FM and improved feature matching (IFM). For the FM method, we extract the features of overhead-view images and ground-view images from the first fully connected layer of the subnetwork pretrained on ImageNet and perform CVRSIR following the workflow, as illustrated in Fig. 6. While for the IFM method, it is similar to FM and the only difference is that the subnetwork in IFM is from SL-CVRSIR. To train SL-CVRSIR, the 100 locations in CVGD are randomly split into training set, validation set, and testing set with the ratio of 6:2:2, and then the images of each location constitute the positive and negative image pairs. Considering the fact that CVGD is a small-scale dataset, we increase the training set by flip and rotation, and the weights of convolutional layers of the subnetwork are kept frozen. The initial learning rate is set to 3e-5 and is decreased to 0.9 times of the former learning rate every ten epochs, and the batch size is set to 128 with 64 positive pairs and 64 negative pairs.

5) Results of CVRSIR:
It can be observed that SL-CVRSIR outperforms FM and IFM and particularly FM in terms of the recall at top K% (K = 1, 5, 10, 15, 20) metric, indicating that the proposed similarity learning-based method is an effective approach for CVRSIR. Regarding FM and IFM, it is notable that the trained subnetwork performs better than the pretrained subnetwork as expected. Furthermore, the subnetwork VGG16 achieves overall better performance than AlexNet for all of the three methods; therefore, it is possible that the performance could be further improved when using a deeper subnetwork.

V. CHALLENGES AND POTENTIAL SOLUTIONS FOR RSIR
Over the past decade, RS community has witnessed the significant progress of RSIR on developing novel methods and constructing new benchmark datasets. However, RSIR has been facing some challenges, which need to be addressed to further promote RSIR research. Here, we present five main challenges for current RSIR: lack of large-scale RSIR datasets, large difference between RS images, difficulty in reproducing results of existing methods, inconsistent dataset split and evaluation protocol, and semantic gap in CBRSIR system.

A. Lack of Large-Scale RSIR Datasets
RSIR datasets are crucial for developing RSIR algorithms, especially data-driven methods (i.e., deep learning). The literature has committed to constructing new datasets for RSIR methods of different categories over the past decade, and there have been a few publicly available benchmark datasets, as shown in Table II. Unfortunately, the existing datasets have the following limitations. First, the commonly used datasets collected in early years are often small scale on which the results have gradually been saturated. Moreover, these datasets are too small to develop deep learning based approaches. Second, some existing datasets, such as WHU-RS19 [164], AID [166], and NWPU-45 [170], are originally collected for scene classification rather than RSIR. The problem is that their images contain large amounts of background unrelated to image category, and thus may distract RSIR. Third, the existing large-scale datasets are not large enough for  VIII  PERFORMANCE COMPARISONS OF CVRSIR METHODS ON UNIVERSITY-1652, CVACT, AND CVUSA SUBSET DATASETS   TABLE IX  PERFORMANCE OF SL-CVRSIR ON THE CVGD DATASET training deep learning networks from scratch, although the literature has collected some large-scale datasets. In practice, transfer learning is still a commonly accepted strategy to overcome this limitation.
To address the dataset related issues, much larger RSIR datasets are required. Besides, we should keep in mind the differences between different categories of RSIR methods when creating datasets and follow the guidance on creating benchmark datasets for RS image interpretation [208].

B. Large Difference Between RS Images
With the rapid development of RS technology, more and more RS images have become available. Generally, these images have varied resolutions, complexities, and even modalities, presenting great challenges for RSIR and in particular the cross-source RSIR. To be specific, for unisource RSIR (i.e., SLRSIR and MLRSIR), although the query images and images to be retrieved are from the same source, they might be different in terms of image size, resolution, scale, etc., leading to the problem of big intraclass diversity and high interclass similarity. For cross-source RSIR (i.e., RASRSIR, CMRSIR, and CVRSIR), retrieval is performed between two sources. The large differences between RS images may also degrade the performance. For example, RASRSIR is generally to perform retrieval between images captured by two different sensors (e.g., multispectral and panchromatic sensors of one satellite), which is also the case for the existing DSRSID dataset [123]. However, RASRSIR will become more challenging when the multispectral and panchromatic images are from more than one satellite.
Considering the large difference between RS images, there are two potential solutions: First, constructing a large dataset containing images with varied resolutions, complexities, and scales. Second, developing novel RSIR algorithms and particularly deep learning based approaches to learn more powerful and discriminative features.

C. Difficulty in Reproducing Results of Existing Methods
A large number of RSIR methods have been developed over the past decade, and some of them even achieved SOTA performance. Some of these methods, however, are unable to reimplement because of the lack of open implementations. Besides, it is difficult to replicate the same results presented in the published works because the necessary implementation details were not provided. For example, to train a successful CNN, some tricks, such as data augmentation and dropout, are often exploited. It is impossible to replicate exactly the same results without these details.
RS community may learn from CV domain, where the opensource implementation and necessary details are often provided.

D. Inconsistent Dataset Split and Evaluation Protocol
A developed algorithm needs to be compared with the existing methods to demonstrate its performance. To this end, we need to either reimplement these methods or collect the results from published works. The former is a challenge, as discussed above. While for the latter, the results presented in related works may not be feasible due to the inconsistent dataset split and evaluation protocol. For example, most existing RSIR datasets do not provide dataset splits, such as training set, validation set, and testing set. In practice, these dataset splits are often obtained by randomly dividing the dataset into different parts, resulting in different dataset splits. Besides, the evaluation protocol may also be different. For example, before evaluating an RSIR method, we need to select the query images and performance metrics. It is not fair to compare two RSIR methods with totally different query images. Moreover, the methods may be evaluated using different metrics, making it impossible to compare them directly.
The above-mentioned issues could be overcome as long as we provide the dataset splits and evaluation protocol when constructing new RSIR datasets, as CV domain does.

E. Existence of Semantic Gap
Most of the existing RSIR methods were performed in the feature level, relying on visual features to compare similarity between RS images. However, the results cannot well reflect the users' real query intensions due to the "semantic gap" between the visual features and, in particular, the low-level features and the richness of human semantics [209]. As the current mainstream technique for RSIR, deep learning is able to extract high-level features containing some semantic information, but the "semantic gap" problem is still not well explored.
To reduce the effect of "semantic gap," a potential solution is to combine deep learning with the traditional techniques, such as relevance feedback to learn users' intention.

VI. FUTURE DIRECTIONS FOR RSIR
RSIR is an effective technique for organizing and managing large RS image archive, and the RS community has committed great efforts to promote RSIR over the past decade. Thanks to deep learning, the literature has achieved significant progress in terms of new RSIR methods and benchmark datasets. However, some RSIR issues are still required to be addressed. In this section, we, therefore, point out several potential directions for RSIR.

A. Constructing More Large and Challenging Datasets
Deep learning has been the mainstream technique for RSIR. To train a successful CNN, a large volume of labeled images is required. However, the existing RSIR datasets (as shown in Table II) are still not large enough to train CNNs from scratch. Besides, RS images generally have varied resolutions, complexities, and modalities in real RSIR scenarios. Therefore, these factors should be taken into account when constructing RSIR datasets to meet the real RSIR scenarios.

B. Few-and Zero-Shot Learning (FSL and ZSL) for RSIR
Current RSIR methods focus the attention on developing novel supervised algorithms, in particular CNN-based methods. The prerequisites for these methods are large-scale labeled image archives. However, it is time-consuming and laborious to annotate a huge volume of RS images. FSL is a type of machine learning method where the training set contains limited labeled samples, and ZSL even requires no labeled samples, which have been two commonly used techniques for RS task, such as scene classification [210], [211], [212]. FSL and ZSL provide RS community a potential direction for developing effective RSIR methods without large-scale labeled images.

C. Developing Novel MLRSIR and CMRSIR Methods
MLRSIR and CMRSIR are still new topics and have some advantages over SLRSIR. For example, MLRSIR is able to perform fine-grained retrieval for users, which is, however, not available by using SLRSIR. By assuming such a situation, one intends to search an outdoor basketball court where there is a parking lot and restaurant nearby. MLRSIR is to conduct RSIR between multilabel images, thus is feasible for the above task. Regarding CMRSIR, the text-image and audio-image CMR-SIR methods are friendly for users with no expert knowledge. Therefore, developing novel MLRSIR and CMRSIR methods will promote the application of RSIR.

D. Incremental Learning for RSIR
Current RSIR methods are trained and evaluated using static RS datasets and, thus, is not suited for incremental scenarios [213]. Specifically, most of the RSIR methods assume that the trained model has seen all the image categories, which is, however, not the case in real-world applications as new RS images are constantly emerging. This is also the reason why RSIR methods generally achieve worse performance when transferred to unseen images. Therefore, one potential direction is to develop incremental learning methods for RSIR that can deal with incremental streams of new RS image.

E. Hashing Methods for Large-Scale RSIR
In practical RSIR applications, we are facing large amounts of RS images. RSIR in a large-scale scenario is challenging mainly in two aspects. On the one hand, more storage space is needed to store RS images and features. On the other hand, it is time-consuming to perform one query in a large-scale RS archive. Hashing methods are able to generate compact binary codes for RSIR, which can not only save storage cost but also have high retrieval efficiency, providing a potential direction for large-scale RSIR. The key is to balance the tradeoff between efficiency and accuracy.

VII. CONCLUSION
As one of the research topics in RS community, RSIR has obtained great improvements in terms of methods and benchmark datasets over the past decade. We, therefore, in this article provide a comprehensive and systematic survey of the recent achievements of RSIR and discuss its challenges and potential future directions. To be specific, we first group the existing RSIR methods in a hierarchical category and review the related works from five RSIR category, including SLRSIR, MLRSIR, RASR-SIR, CMRSIR, and CVRSIR. Then, we present the benchmark datasets for each RSIR category. To promote CVRSIR, we proposed an effective method based on the similarity learning and constructed a new ground-drone dataset for performance evaluation. Besides, we compared the performance of representative RSIR methods of each category on some benchmark datasets. Finally, we discussed the challenges and potential directions for RSIR. Mahmoud R. Delavar received the B.Sc. degree in civil engineering and surveying from Khajeh Nasir Toosi University, Tehran, Iran, in 1988, the M.Sc. degree in civil engineering photogrammetry and remote sensing from the University of Roorkee (currently IIT Roorkee), Roorkee, India, in 1992, and the Ph.D. degree in geomatic engineering and geographic information system (GIS) from the University of New South Wales, Sydney, NSW, Australia, in 1997.
In 1998, he joined the College of Engineering, University of Tehran, Tehran, Iran, where he is currently a Full Professor of GIS and the Director of Center of Excellence in Geomatic Engineering in Disaster Management, School of Surveying and Geospatial Engineering. He has founded the Iranian Society of Surveying and Geomatic Engineering in 2001. He has been the national representative of the International Society of Urban Data Management since 2006, the Scientific Secretary of the International Society for Photogrammetry and Remote Sensing WG II/4 (uncertainty modeling and quality control for spatial data), in 2008-2012, and chairing ISPRS WG IV/3 (spatial data analysis, statistics, and uncertainty modeling), in 2016-2022. He is the representative of the University of Tehran in the International Geomatic network for networks. He is in the editorial board of a number of national and international scientific journals, such as ISPRS International Journal of GeoInformation, Spatial Statistics (Ex. Editorial Board), and International Journal of Geo-spatial Information Science. He has authored or coauthored more than 373 papers in national and international conferences and scientific journals. He has supervised 95 M.Sc. and 13 Ph.D. students, and 1 Postdoctoral research so far. His research interests include spatial data quality and uncertainty modeling, spatiotemporal GIS, disaster management, smart cities, land administration, SDI, spatial data fusion, spatiotemporal data mining, spatial data science, spatial big data, urban growth modeling, land use and land cover change modeling, remote sensing, and GIS integration.