Introduction
Visual localization is an important application of remote sensing [1], place recognition [2], and aircraft navigation [3], which is achieved by estimating the correspondence of the query and reference database images. This task is typically addressed as an image retrieval problem based on visual similarity. While the existing image-based retrieval methods have shown promising performance in scenarios where images are captured using optical cameras, they heavily rely on the assumption that optical images can always reliably capture the necessary information. However, this assumption may not hold true in challenging conditions such as low-light environments or adverse weather conditions. Hence, it becomes imperative to explore alternative and more robust information sources that can facilitate stable image correspondence even in such challenging scenarios.
The advantage of offering stable imaging during both day and night allows synthetic aperture radar (SAR) robust to adapt to light changes and variable weather on remote sensing. At the same time, optical satellite images are still the most popular and accessible archive, which can serve as the reference database for localization. Taking advantage of the imaging stability of SAR and the accessibility of optical satellite image archives for visual localization makes SAR-Optical patch correspondence a great potential task. However, SAR-Optical patch correspondence remains an underexplored research area. Optical sensors capture images by detecting reflected sunlight, whereas SAR sensors produce images by detecting backscattered waves from multiple microwave signals. The two types of images differ in radiation, noise level, and imaging geometry, resulting in distinct modality discrepancies. However, images originating from the same target/scene captured by different sensors should inherently possess consistent semantic information, which can be extracted using deep networks as modal-invariant features. Although the existing methods [4], [5], [6] proposed for cross-modal retrieval have succeeded in identifying the category of the query images, they primarily focused on treating the retrieval problem as an image classification task, which cannot distinguish between different places within the same category to meet the requirement of the patch correspondence. Therefore, it is necessary to develop new methodologies that go beyond the existing retrieval task to fully unlock the potential of SAR-Optical patch correspondence.
Concretely, two challenges of this task are listed as follows.
Poor instance discriminability: The patch correspondence requires optimal matching between the query and only one target. However, similar visual features lack instance discriminability, posing the challenge of identifying the best correspondence from the retrieved candidates, as shown in query1 of Fig. 1. This limitation can lead to inaccurate and unreliable results in scenarios where the retrieved candidates have a high degree of similarity.
Distinct feature distribution: The fundamental differences in the imaging principles of SAR and optical modalities lead to variations in the appearance and structure of the same object in the two modalities, which makes it difficult to find a common representation for cross-modal features. As shown in query2 of Fig. 1, the negative sample is more similar than the true positive in embedding space due to the distinct feature distributions. It poses a challenge for learning-based methods that rely on the feature distribution metric. Overcoming this challenge requires the development of techniques that can efficiently model and bridge the feature distribution gap between these SAR and optical images.
Challenges of the CMPC task. Query1 shows challenge 1: The poor discriminability of the retrieved candidates on one-to-one correspondence. Query2 shows challenge 2: The modal discrepancy leads to the distinct feature distribution.
Matching images for localization involves extracting feature descriptors from the images and computing similarity metrics to find the best correspondence. Over the last decades, feature description methods have been developed and proven helpful for retrieval between ground-to-satellite optical images [7], [8], [9], UAV-satellite optical images [10], [11], and cross-time place recognition [12], [13]. These methods are quite effective in matching query and reference images, which are both captured by optical sensors. However, their limitations become evident when facing inconsistent feature distribution across modalities, due to the large differences in imaging principles between SAR and optical images.
To address this challenge, several cross-modal retrieval methods [4], [14], [15], [16], [17] have been proposed. While these methods have shown remarkable performance in retrieving images across modality within scenes of specific categories, they retrieve multiple possible similar category candidates rather than a unique ground instance. It poses a challenge for applications that require precise localization and identification of specific objects. Since the GPS information is available in the reference archive, it provides an available way to refine the optimal correspondence by leveraging the location information of the reference candidates. This strategy might provide helpful spatial information to address the limitations of instance discriminability in category-level methods.
To overcome the aforementioned difficulties of SAR-Optical patch correspondence, we propose a coarse-to-fine correspondence scheme to explore the feasibility of instance-level cross-modal patch correspondence (CMPC). The proposed scheme comprises a cross-modal coarse search module and a refinement module. The coarse search module adopts adversarial learning to narrow the modal gap and extract modal-invariant features to retrieve the candidates. The refinement module turns the embedding features and the candidates' GPS information into a graph representation and then selects the optimal correspondence by updating the graph via an attention message propagation. To evaluate the performance of our proposed scheme, we also construct three SAR-Optical patch correspondence datasets.
In summary, our contributions are listed as follows.
We introduce a coarse-to-fine scheme for SAR-Optical remote sensing CMPC to find the optimal correspondence between SAR and optical images.
We explicitly model the cross-modal feature distribution as Wasserstein distance and propose a cross-modal adversarial learning strategy to learn the modal-invariant feature.
We propose a graph representation that incorporates the visual feature and spatial information to improve the discriminability of the retrieved candidates and refine the coarse retrieval to optimal correspondence.
We construct three datasets to evaluate various methods' feasibility of the CMPC task and even the localization applications. Our proposed scheme achieves state-of-the-art results on these proposed datasets.
The rest of this article is structured as follows. Section II offers a succinct survey of the related works. In Section III, we provide a comprehensive exposition of our proposed scheme, including the overview of the scheme, the cross-modal coarse search module, and the refinement module. Section IV presents and analyzes the experimental results. Section V discusses the limitations of the proposed scheme, as well as possible avenues for future research. Finally, Section VI concludes this article.
Related Work
In this section, we will review the recent progress in image-based retrieval, cross-modal category-level retrieval, and cross-modal instance-level retrieval.
A. Image-Based Retrieval
The task of image retrieval involves finding relevant images from a database of images given a query image [18]. This task has received significant attention in the research community in recent years. Since deep learning has been widely used to extract robust image features, Gong et al. [19] showed that the convolutional neural network (CNN) could effectively embed images into the global features for retrieval. Despite directly employing a developed model, Noh et al. [20] designed an attention module on the vanilla model to strengthen the local features of the image. To leverage the advantage of both the local and global information, Song et al. [21] combined both local and global features to align different images and further improve retrieval effectiveness. To guide the network embedding the discriminative feature from images, a large number of metric learning methods have been proposed to regularize the distance between positive and negative samples. The core idea of these loss functions is to reduce the feature distance between the positive samples, as well as to enlarge the feature distance between the negative samples. Wen et al. [22] proposed the center loss, which distinguishes the feature center of each class of target during training. Schroff et al. [23] proposed a triplet loss to guide the network to learn an embedding distance between the positive samples and negative samples. Since the hard samples deteriorate the performance of the vanilla triplet, Hermans et al. [24] introduced the hard case mining strategy to make the triplet loss focus on the challenging samples. Sun et al. [25] unified the classification-based loss function and the distance-based loss function to improve retrieval effectiveness. However, these methods suffer from the domain gap when applied to the cross-modal retrieval task.
B. Cross-Modal Retrieval
Cross-modal image retrieval refers to measuring the similarity between images involving more than one modality. Due to the large visual appearance changes of the images from different types of sensors, the hand-crafted feature descriptor methods [26], [27], [28] encountered a bottleneck in the development of cross-modal image retrieval. Benefiting from the development of deep learning, recent works [29], [30] focus on learning modal-invariant features for both query and reference images from different modalities to improve the matching performance. To leverage the modal gap, Khokhlova et al. [31] adopted a Siamese network to extract modal-invariant descriptors of the multimodal images. In addition to extracting the modal shared feature, Liu et al. [13] proposed a separation network to extract modal exclusive features of the images from different domains. Ye et al. [32] employed a channel exchange strategy to switch the RGB image to a single-channel infrared image to reduce the color discrepancy between of two modalities. Jing et al. [12] improved a cross-modal center loss via a multilayer perception (MLP) to map different modality features into the mutual metric space. Huang et al. [33] considered that the positional relationships of the region are stable across different modalities and aligned the positional feature to improve the cross-modal matching accuracy. Facing the remote sensing sources, Li et al. [14] first proposed the cross-modal remote sensing category-level image retrieval dataset and employed the CNN to classify the panchromatic and multispectral images. Hash network [4] first solved the SAR-Optical category-level retrieval task by transforming the paired image to train the embedding network. However, these works focus on employing feature representations for classification, which cannot discriminate across instances, and thus are not suitable for instance-level retrieval.
C. Instance-Level Correspondence
Instance-level correspondence aims to identify the instances from images, such as localizing a street-view image to a satellite map. The main target of these works is to propose accuracy metric learning techniques to discriminate the instances. Several works [34], [35] focus on designing learning strategies for mining the discrimination between the instances. VIGOR [11] trained a network with the help of the neighbor patches of the target patches to estimate the image correspondence. In the situation when query images are from a new category, Yang et al. [36] showed that mapping the images into a uniform space would distort the manifolds of unseen classes, therefore designing a graph scheme to represent the feature space. With the help of Transformer [37], Tan et al. [38] mined the relationship between retrieved candidates by patch-based attention to rerank the retrieval results. In the SAR-Optical correspondence task, Hughes et al. [39] designed a pseudo-Siamese CNN to identify the established SAR-Optical patch correspondence. However, the discrepancy between SAR and optical images is too large to establish an optimal correspondence, leading to the unsatisfactory performance of the aforementioned methods.
Methodology
We propose a coarse-to-fine scheme to solve the task of SAR-Optical patch correspondence. The cross-modal coarse search module can be viewed as the cross-modal retrieval task, and the refinement can serve as the inlier estimation task. In this section, we present the overview of the proposed scheme, followed by a detailed description of the cross-modal coarse search module and the refinement module. Fig. 2 shows the overall flowchart of the proposed scheme.
A. Overview
In the retrieval step, deep cross-modal methods mainly reduce the impact of radiometric differences and speckle noise by dedicated network structures. However, designing a specific network module would increase the model's complexity and reduce the model's robustness. Therefore, we propose to suppress these impacts in the training strategy without designing additional network modules.
First of all, we adopt random channel exchange transformation and image normalization as data augmentation. The random channel exchange can force the network to focus on contour and texture information shared by both optical and SAR images. Moreover, image normalization mitigates the impact of speckle noise by narrowing the dynamic range of the images, making it more visually interpretable and suitable for subsequent processing.
Second, we propose a cross-modal training strategy (see details in Section III-B) to guide the cross-modal embedding network
\begin{equation*}
\boldsymbol{x}_{i} = f_{\text{emb}}(I_{i}). \tag{1}
\end{equation*}
\begin{equation*}
D = \lbrace k| k\leq K_{n}, \boldsymbol{x}^\text{ref}_{k} \in \text{sort}(d(\boldsymbol{X}^{\text{ref}}, \boldsymbol{x}^{\text{que}}))\rbrace \tag{2}
\end{equation*}
Due to the poor instance discriminability between the query and reference images, the top retrieved candidate might not be the optimal correspondence for the query. Considering that the reference patches in the optical database typically contain the correct GPS location, we can leverage the location information of the reference patches to increase the discrimination of the retrieved candidates. Therefore, a refinement module (see details in Section III-C) is employed to address this issue and improve the initial retrieval results. Practically, we propose a graph representation
\begin{equation*}
\hat{\boldsymbol{y}} = f_{\text{fine}} (G(\boldsymbol{x}^\text{que}, \boldsymbol{X}^\text{ref}, \boldsymbol{P}^\text{ref})) \tag{3}
\end{equation*}
\begin{equation*}
I^\text{opt}_\text{match} = \lbrace I_{i} | \arg \max _{i} \hat{y}_{i}, i\in D\rbrace. \tag{4}
\end{equation*}
B. Cross-Modal Coarse Search Module
To overcome the distinct modal discrepancy and extract modal-invariant features, we train the CNN with the Wasserstein adversarial learning strategy, combining it with the hard mining triplet and the feature projector, which aims to directly narrow the modal gap and learn the hard cross-modal samples. The coarse search module is shown in Fig. 3. The network's weights are shared between SAR and optical image embedding to enable the extraction of the mutual information from SAR and optical images.
Pipeline of the cross-modal coarse search module on the training and the testing phase. The training strategies are shown in the dashed red box. The testing phase is shown in the solid green box indicating coarse retrieval inference.
1) Wasserstein Adversarial Training
In the cross-modal feature embedding, the significant disparity between two modalities results in differences between feature distributions, causing instability in cross-modal feature similarity measurement.
To address the challenge of the modal discrepancy in cross-modal feature representations, it is essential to model and reduce the gap explicitly. Besides employing the same shared network to extract the mutual information from SAR and optical images, we employ an adversarial discriminator to minimize the distance between extracted features from different modalities. The traditional classification discriminator only differentiates the modality to which the feature belongs, which does not measure the feature discrepancy. Instead, we introduce a Wasserstein discriminator to directly estimate the discrepancy between modalities. As cross-modal features belong to the distributions of their respective modalities, the Wasserstein distance can represent the discrepancy between the modal distributions by solving the earth-moving problem. Therefore, we employ the 1-D Wasserstein distance to explicitly model the cross-modal gap and introduce the Wasserstein adversarial learning to minimize the discrepancy between the modalities.
The 1-D Wasserstein distance between distributions
\begin{equation*}
W_{1}(\mathscr{P}_{s}, \mathscr{P}_{t}) = \inf _{\gamma \in \Pi (\mathscr{P}_{s}, \mathscr{P}_{t})} \mathbb {E}_{(x,y)\sim \gamma }[||x-y||] \tag{5}
\end{equation*}
In practice, we adopt the Kantorovich–Rubinstein duality [40] to approximate the original optimal transport problem (5), which avoids solving the bipartite matching problem iteratively
\begin{equation*}
W_{1}(\mathscr{P}_{s}, \mathscr{P}_{t}) = \sup _{||f_{w}||_{L} \leq 1} \mathbb {E}_{x\sim \mathscr{P}_{s}}[f_{w}(x)] - \mathbb {E}_{x\sim \mathscr{P}_{t}}[f_{w}(y)]. \tag{6}
\end{equation*}
\begin{align*}
L_{\text{dis}}(\boldsymbol{x}) & = - W_{1}(\mathcal {X}_\text{sar}, \mathcal {X}_\text{opt}) \\
& = \sum _{\boldsymbol{x}_{j} \in \mathcal {X}_\text{opt}} f_{w}(\boldsymbol{x}_{j}) - \sum _{\boldsymbol{x}_{i} \in \mathcal {X}_\text{sar}} f_{w}(\boldsymbol{x}_{i}). \tag{7}
\end{align*}
After
\begin{equation*}
L_{w}(\boldsymbol{x}) = W_{1}(\mathcal {X}_\text{sar}, \mathcal {X}_\text{opt}) = - L_{\text{dis}}(\boldsymbol{x}). \tag{8}
\end{equation*}
2) Hard Mining Triplet Loss
The distinct feature distribution across modalities can also lead to ambiguity between positive and negative samples. This issue may pose a challenge for the network to distinguish between the hard negative samples and positive samples. To address this issue, we employ the hard mining triplet loss, which selects only the hardest negative samples for each anchor sample. By focusing on the hardest negative sample, the model is forced to embed the discriminative features that can better distinguish between positive and negative samples in challenging cross-modal scenarios. In addition, the hard mining triplet loss reduces the computational complexity of the training process by eliminating common negative samples that are less informative.
Specifically, a training batch typically consists of
\begin{equation*}
L_{\text{tri}} = \frac{1}{2B} \sum _{i=1}^{2B} \max (||\boldsymbol{x}_{a}^{i} - \boldsymbol{x}_{p}^{i}||^{2}_{2} - ||\boldsymbol{x}_{a}^{i} - \boldsymbol{x}_{n}^{i}||^{2}_{2} + \beta, 0) \tag{9}
\end{equation*}
3) Feature Projector
The distinct feature discrepancy between SAR and optical makes the network lack of robustness and easily show overfitting. Therefore, a learnable feature projector
\begin{equation*}
f_{p}(\boldsymbol{x}) = \frac{\text{Norm}(\boldsymbol{W}\boldsymbol{x})}{||\text{Norm}(\boldsymbol{W}\boldsymbol{x})||_{2}},\quad \boldsymbol{W} \in \mathbb {R}^{d \times d^{\prime }} \tag{10}
\end{equation*}
Our final loss function for the embedding network is shown as follows:
\begin{equation*}
L_{\text{emb}} = L_{\text{tri}}(f_{p}(\boldsymbol{x})) + \lambda L_{w}(f_{p}(\boldsymbol{x})) \tag{11}
\end{equation*}
C. Correspondence Refinement Module
Benefiting from the modal-invariant feature embedding, the features can overcome the modal discrepancy and retrieve the candidates belonging to the same category. However, the visual feature cannot be clearly corresponded to an instance within the same category, due to the poor discriminative of the visual and texture feature. Considering that the GPS tag of the reference patch is provided and represents the distinct location information of the retrieved candidates, we can mine the position relationship between the retrieved candidates to obtain the final match. We also concatenate the embedded features of both the query and the reference as node features to mine the mutual information.
Specifically, for each query image and its top-
Proposed refinement module. The graph representation combines the information of feature pairs and reference locations. The attention network updates the nodes to predict the inlier as the final optimal correspondence between the query and the references.
1) Graph Representation for Matching Pairs
The coarse search module produces a set of top-
\begin{equation*}
\boldsymbol{x}_{k}^{0} = \sigma (C_{1}(\boldsymbol{x}^\text{que}||\boldsymbol{x}_{k}^\text{opt})),\quad k\in D \tag{12}
\end{equation*}
After aggregating the matching pairs into input features for the refinement module, the node with the highest score represents the optimal correspondence from the retrieved candidates. Although the queries do not contain location information, the reference patch database can provide the correct GPS location. Therefore, we design the graph representation to take advantage of the reference GPS location with rich geometric information. We extract the position information from retrieved candidates' GPS tags and set it as the geometric position of the node feature. We calculate the distance between nodes and sort them in ascending order. The
\begin{equation*}
e_{kl} = \left\lbrace \begin{array}{ll}\hfil \exp \left(\frac{-||\boldsymbol{p}_{k}-\boldsymbol{p}_{l}||^{2}_{2}}{\sigma _{e}}\right), & l \in \mathcal {N}(k)\\
\hfil 0, & \text{otherwise} \end{array}\right. \tag{13}
\end{equation*}
After defining the node features and edge features, a graph representation can be built for the retrieval results, which transforms the refinement problem into an inlier selection problem.
2) Graph Attention Network
The ultimate goal of SAR-Optical patch correspondence is to find the best correspondence prediction from the reference patch for every query. After transforming the matching pairs into a graph representation with nodes
\begin{equation*}
\hat{\boldsymbol{y}} = f_{\text{gnn}}(\boldsymbol{X}, \boldsymbol{E}). \tag{14}
\end{equation*}
\begin{align*}
\boldsymbol{q}_{k}&= \boldsymbol{W}_{q} \boldsymbol{x}_{k} \\
\boldsymbol{k}_{l}&= \boldsymbol{W}_{k} \boldsymbol{x}_{l} \\
\boldsymbol{v}_{l}&= \boldsymbol{W}_{v} \boldsymbol{x}_{l}. \tag{15}
\end{align*}
\begin{equation*}
\alpha _{kl} = \text{softmax}(e_{kl} \boldsymbol{q}_{k} \boldsymbol{k}_{l}^\top). \tag{16}
\end{equation*}
\begin{equation*}
\boldsymbol{x}^{t+1}_{k} = f_\theta \left(\boldsymbol{x}^{t}_{k} + \!\sum _{l\in \mathcal {N}(k)}\!\alpha _{kl} \boldsymbol{v}_{l}\right) \tag{17}
\end{equation*}
\begin{equation*}
\hat{y} = W_{f} \boldsymbol{x}^{T}. \tag{18}
\end{equation*}
3) Inlier Loss
The refinement module aims to estimate the inlier node, which represents the optimal correspondence from the retrieved candidates. To accomplish this, we utilize the cross-entropy loss function
\begin{equation*}
L_{\text{ce}} = - \sum _{k=1}^{K_{n}} (y_{k} \log (\hat{y}_{k}) + (1-y_{k}) \log (1-\hat{y}_{k})). \tag{19}
\end{equation*}
Experiments
A. Experimental Setup
1) Patch Correspondence Datasets
Our proposed patch correspondence dataset is based on the SpaceNet 6 data [42], which consists of 204 different strips of SAR data collected from Rotterdam, The Netherlands, covering over 120 km
Sampling strategy of the proposed datasets. (a) SAR and optical image data. (b) Nonoverlap region protocol for the training set and the test set. (c) Cropping strategy of optical patches. (d) Patch correspondence definitions.
a) Optical patch archive:
Given an optical map of the city, our objective is to match a SAR query in this map by searching the reference patches cropped from this map. To ensure that all queries can be matched with the reference patches, the reference patches must cover the entire map seamlessly. As shown in Fig. 5(c), the reference patches are cropped at a grid without any overlap. Every patch has a unique identification (ID) and has a fixed size of
b) SAR patch queries: In real-world applications, the SAR strips captured by aircraft are not perfectly aligned with the optical map and have varying sizes. We design two types of SAR patches: aligned patches and nonaligned patches to construct the matching pairs. The IDs of SAR patches are the same as the matched optical patches.
Aligned pairs: The cropped SAR patches are aligned with the optical patches and have the same image size. The SAR patches that fall in the boundary of the strip are discarded due to the image being incomplete. The optical patches aligned with SAR patches are set as the positive samples of the SAR query, as shown at the top of Fig. 5(d).
Nonaligned pairs: All the SAR patches are cropped at a grid of size
pixels without being aligned to the optical map. These SAR patches are labeled with ground-truth GPS tags which are only used for correspondence identification. As shown at the bottom of Fig. 5(d), the green patch is considered as ground truth, which has the nearest GPS to the SAR query and contains the most shared objects with the query image.\text{200} \times \text{200}
c) Dataset protocol: We design two protocols for assigning the training set and the test set on the experiments: The overlap setting and the nonoverlap setting, according to different application scenarios.
Region overlap: All the optical patches are included for both the training phase and the testing phase. And the SAR patches are randomly split into two disjoint sets. This setting is for evaluating the methods when the data of the city is available in training.
Region nonoverlap: To assess the generalizability of the proposed scheme to new cities, we design the dataset protocol to ensure that the test region is not learned in the training phase. Fig. 5(b) shows that the patches are separated into two regions.
Above all, we design three datasets depending on whether the matching pairs are aligned and whether the testing region is available in training, as shown in Table I.
The Aligned dataset contains the paired SAR query aligned with the optical reference patches, which aims to evaluate the instance consistency of the cross-modal features.
The Same-Area dataset contains the nonaligned pairs, where all the optical patches are available in both the training and testing phases, which is focused on application scenarios when the city data is available for training.
The Cross-Area dataset represents a general challenge for methods to match pairs in a new region, which contains nonaligned pairs from no overlapping regions between the test set and the training set.
2) Evaluation Metrics
For the retrieval performance, we adopt the precision of top K in retrieval (P@K) and mean average precision (mAP) as the evaluation metrics. For geolocalization, we employ meter-level accuracy to evaluate the localization capability of nonaligned SAR patches. During the experiments, we evaluate the retrieval and geolocalization performance of the methods on three datasets. The best results in the experiments are bolded. All methods are trained using the training set and tested on the test set of each corresponding dataset.
Top-K precision: P@k computes the number of queries where the ground-truth label is among the top
label prediction. The definition is given byk where\begin{equation*} \text{P@K} =\frac{\sum _{i=1}^{M} Acc_{k}}{M} \tag{20} \end{equation*} View Source\begin{equation*} \text{P@K} =\frac{\sum _{i=1}^{M} Acc_{k}}{M} \tag{20} \end{equation*}
equals 1 if the ground-truth matched target is included in topAcc_{k} retrieval candidates, 0 otherwise.k is the number of all queries.M mAP: The average precision (AP) of each query
is the order in which the retrieved target is presented, and the mAP is the mean values for a retrieval result over a set of queries. The definitions are shown as follows:i where\begin{align*} \text{AP}_{i} =& \sum ^{N}_{k=1} \frac{\text{rel}(k)_{i}}{k} \\ \text{mAP} = &\frac{1}{M} \sum ^{M}_{i=1} \text{AP}_{i} \tag{21} \end{align*} View Source\begin{align*} \text{AP}_{i} =& \sum ^{N}_{k=1} \frac{\text{rel}(k)_{i}}{k} \\ \text{mAP} = &\frac{1}{M} \sum ^{M}_{i=1} \text{AP}_{i} \tag{21} \end{align*}
is an indicator function equaling 1 if the item at rank\operatorname{rel}(k)_{i} is a ground-truth target of queryk , 0 otherwise.i is the patch number of the reference archive.N Meter-level localization accuracy: Localization accuracy evaluates the real-world distance between the predicted location and the ground-truth GPS location of the SAR query.
3) Implementation Details
The experiments are conducted on a platform equipped with 4x NVIDIA TITAN Xp GPUs. We adopt the stochastic gradient descent as the optimizer of the proposed scheme. All backbones on the experiments are pretrained with a batch size of 24 on the ImageNet dataset [43] and re-trained for 60 epochs. During the evaluation, we set the Top-1 matched target's geolocation as the localization prediction for the query patch.
We utilize the Res-Net50 as our feature embedding network on the coarse search module.
Benchmarking Results
1) Comparison With Retrieval Methods
To evaluate the performance of our proposed CMPC, we compare it with several retrieval benchmarks on our proposed SAR-Optical patch correspondence datasets, including the geolocalization methods and the cross-modal methods.
Regarding geolocalization methods, we use RK-Net [48] and VIGOR [11]. RK-Net is a UAV-Satellite cross-view geolocalization method that aims to classify the building from the query. We modify the instance loss with triplet loss to adapt RK-Net for the cross-modal task. VIGOR is a ground-to-satellite geolocalization method that focuses on predicting the geolocation of the query. We use neighbor patches as semipositive samples defined in VIGOR and follow the original training process to train the VIGOR model.
For cross-modal methods, we use ReIDSB [45], X-modality [46], and DCMHN [4]. ReIDSB is a strong retrieval baseline for IR-RGB person reidentification, while X-modality is a cross-modal IR-RGB retrieval method focused on modal adaptation. DCMHN is a SAR-Optical retrieval method for the classification of area categories.
In the experiment of the Aligned dataset, we compare our proposed method with previous cross-modal retrieval methods by evaluating their retrieval performance on this task, as shown in Table II. Our method achieves 81.95 and 85.98 in Top-1 precision and mAP. This improved performance can be attributed to that our method directly models the distributional gap and focuses on the instance-level retrieval objective, rather than simply extracting cross-modal features from aligned pairs. By doing so, we are able to achieve better accuracy and overcome the modal discrepancy.
In the Same-Area dataset, we further focus on the localization performance of the methods to predict the query location. Our proposed scheme achieves 57.59 and 40.79 in top-1 precision and 50-m accuracy respectively, as shown in Table III. The top part of the table lists the geolocalization retrieval methods that directly learn the embedded feature distance between query and reference patches for locating the reference image in the same area. RK-Net, which focuses on learning the position shift between query and reference patches, suffers from unstable position relations due to the gap between SAR and optical imaging modalities. VIGOR, which benefits from overlapping semipositive patches, outperforms other methods in Top-10 retrieval precision, but still suffers from modality discrepancy, resulting in unsatisfactory results in Top-1 matching precision. The bottom part of the table lists cross-modal retrieval methods designed to overcome the gap between modalities. Image transformation used in DCMHN fails to address the gap between SAR and optical modalities, and similarity reranking strategies, such as those used in X-modality, do not provide additional information to improve matching accuracy, resulting in poor performance. In contrast, our proposed CMPC scheme first overcomes the modal gap in the coarse search and then refines the matching prediction by considering the location information of retrieved candidates, achieving state-of-the-art performance in this task.
In the experiment of the Cross-Area dataset, as presented in Table IV, all methods exhibit a substantial decrease in accuracy, highlighting the difficulty of this dataset. This dataset includes nonaligned pairs between the cross-modal patches, and the test data come from a new region that did not appear in the training phase, making it more challenging than the previous two datasets. Despite this challenge, the proposed scheme still outperforms other methods in this task. However, the improvement is not as significant, as the refinement module cannot learn sufficient information in the coarse retrieval phase, which contains a vast number of outliers. Nevertheless, the proposed method still demonstrates its effectiveness in handling CMPC, even under the challenging conditions of the Cross-Area dataset.
To evaluate whether our method can also work on lower resolution data, we conduct experiments on OS-Dataset [49]. The OS-Dataset comprises 2673 nonoverlapping aligned patch pairs of 512 × 512 pixels with 1-m spatial resolution. We downsample the images to 200 × 200 pixels, making them lower than 2-m spatial resolution. When using the same image size as our proposed dataset, the low-resolution images from the OS-Dataset can capture a larger region, providing a broader field of view. Therefore, the retrieval accuracy shown in Table V is higher in general. Notably, our proposed method still outperforms other methods with a top-1 precision of 96.93 and an mAP of 97.94.
2) Comparison With Feature Correspondence Methods
Following the coarse search, the previous steps involve the geometric verification via features correspondence [50], [51] and correlation verification [52] to rerank putative retrieval results. Therefore, we also compare our method with these methods [52], [53], [54] on the Same-Area dataset. In this experiment, the global features are extracted and the coarse search is performed. Then, we conduct a comparative analysis of these methods in the refinement stage. Notably, RIFT [53] is a multimodal image matching approach based on radiation-invariant feature descriptors, SuperGlue [54] stands out as a deep- learning-based method for feature correspondence, and CVNet [52] serves as a robust deep verification network for image retrieval tasks. In addition, we compare the computational speeds of these methods, assessed in frames per second (FPS). The results shown in Table VI demonstrate that our approach achieves superior accuracy with a higher computational speed.
3) Qualitative Analysis
We show the visualization of retrieval and match results on nonalign pairs setting and compare with the state-of-the-art instance-level retrieval methods VIGOR [11] and cross-modal methods X-modality [46]. Given the SAR queries, as shown on the left in Fig. 6, the compared methods cannot find the real match or even cannot retrieve the corresponding target in the top-ranked candidates. One reason is that they retrieve the wrong candidates due to the modal discrepancy between queries and references, resulting in matching wrong targets, which are similar in visual appearance but distinct in semantics. Another reason is that they do not improve the correspondence prediction from the retrieved candidates. Compared with previous methods, our proposed CMPC can extract more instance discriminative features; therefore, the retrieved patches have more semantic similarity and belong to the same region category. Moreover, the proposed CMPC can retrieve as most as possible neighbors in the coarse search module, which are marked as the yellow boxes, and then takes into account the location information of these retrieved candidates and adopts deep graph learning to mine the relationship between the candidates. The visualization results show that the proposed method can match the correct target from the cross-modal patch database.
Visualization of the retrieval result on the Same-Area dataset. The patches with green boxes are real matches, while the yellow boxes are neighbors with region overlaps. The numbers under patches are the location distance between the query and the retrieved optical patches.
Ablation Study
In this article, we propose the cross-modal retrieval module, the modal-invariant adversarial learning, and the refinement for the task of SAR-Optical patch correspondence. In this section, we analyze the impacts of these modules and explore their effects on different configurations to better understand their contributions to this task.
Table VII presents the performance results of our proposed methods for the Same-Area and Cross-area datasets, evaluated under various configurations. The baseline is the pretrained ResNet-50 supervised with the soft-margin triplet loss [34]. Compared with the soft-margin triplet, the hard-mining triplet loss can improve the performance of the retrieval module. With the proposed feature projector, the network has a gain of 16.76 points and 8.47 points on Top-1 precision on the Same-Area and Cross-Area datasets, respectively. Combined with the Wasserstein adversarial learning, the network has improved performance on the Cross-Area dataset but slightly decrease on the Same-Area dataset. By applying the refinement module, the scheme has a high performance at 57.59 points and 28.90 points on Top-1 precision on both the nonaligned datasets, respectively.
We further analyze the effectiveness of the Wasserstein learning and refinement module.
1) Wasserstein Discriminator
We compare our proposed Wasserstein discriminator with the traditional classifier discriminator [1] and analyze how the different types of discriminators affect retrieval performance. We conduct the experiments on the Aligned dataset and the Cross-Area dataset. The baseline is the proposed embedding network trained without adversarial learning, while the compared discriminator is a classifier discriminator, which outputs the binary modality prediction of SAR and optical.
The experiments on the Aligned dataset are shown in Table VIII, we observe that compared with the baseline, applying adversarial training with both the classifier discriminator and the Wasserstein discriminator can improve retrieval performance in the Aligned dataset. It means that the adversarial strategy can narrow the modal gap between the query and reference images when the position shift is not large. The Wasserstein discriminator has a better performance than the classifier discriminator, indicating that the Wasserstein adversarial can directly model the gap between the two modalities and achieve a narrower gap of features between SAR and optical.
We also conduct retrieval and localization performance analysis on the Cross-Area dataset, as shown in Table IX. The results indicate that the improvement on nonaligned pairs is smaller than on aligned pairs because the difference between query and reference images includes not only modal discrepancy but also positional shift. In this setting, the classifier discriminator occurs accuracy decreases. The possible explanation is that the classification-based discriminators do not learn the metric space and are not compatible with the metric-based scheme. The proposed Wasserstein adversarial learning models the modality gap in a regression way, which can have an improvement in performance than the classification-based method. Moreover, the results also imply that modal-invariant feature extraction is not enough in this task when position shifts happen, and the refinement of the coarse retrieval needs to be considered.
We also conduct comparison experiments on OS-Dataset [49] to evaluate the performance of different adversarial learning strategies. The experimental results shown in Table X demonstrate that our proposed Wasserstein adversarial strategy can achieve better performance than the classification adversarial one.
2) Graph Representation for Matching Pairs
The refinement module is a crucial part of the proposed method, which greatly improves the matching performance. We further explore the optimal configurations for different types of connections and the number of candidates and edges.
We first explore the edge connection configuration of the graph and its impact on message transmission and relationship mining, which can lead to different matching performances. Hence, we compare our graph construction method with several inlier prediction methods from the coarse retrieval result in Table XI.
MLP updates the concatenated features to predict the inlier.
Full connect connects all the nodes with the weight of the edges equal to 1, which ensures that all nodes can equally propagate messages from all other nodes.
Position embedding embeds the 2-D location coordinates into vectors of the same dimension as the node features using MLP and adds them to the node features.
Feature KNN connects
nearest neighbors with the highest inner-product between node features.K_{e}
Compared to the coarse search, the MLP and full connect settings occur a decrease in accuracy. The results imply that the MLP cannot learn the relation between nodes, while the Full connect lost the topology discrimination. Compared to the above settings, the position embedding setting can improve performance by incorporating location information into node features. However, it does not model the spatial information into a graph structure, limiting the potential of graph networks. The Feature KNN can model feature similarity into the graph structure and improve the matching performance with the graph attention network. However, this method does not consider the location information between nodes, leading to a slight improvement in the coarse search. Our proposed Location KNN further considers the position relationship and directly models it into the graph representation, taking full advantage of the graph network. The results indicate that modeling location information directly into the graph topology can significantly improve the performance of the proposed graph refinement module.
We also conduct experiments to determine the optimal number of graph nodes
The experimental results indicate that improving
Discussion
A. Wasserstein Adversarial Learning
In the proposed scheme, we design a cross-modal retrieval module based on adversarial learning with the Wasserstein discriminator. In this module, we assume that the output of the discriminator can represent the Wasserstein discrepancy between features from different modalities. During the training process, the embedding network minimizes the modal gap by both triplet loss and discrepancy loss. The triplet loss minimizes the distance of positive pairs, which are features from different modalities. As such, it can also minimize the Wasserstein discrepancy.
To verify this assumption, we compare the Wasserstein distance, as computed by the discriminator during the training phase, under various loss functions, as illustrated in Fig. 7. The minus output of the discriminator
Wasserstein distance is learned by the discriminator. The figure shows that utilizing only
B. Graph Representation
We present a visualization of the graph structure samples of the refinement module to demonstrate how the network learns knowledge from the graph. Fig. 8 illustrates the graph samples that can successfully estimate the true inlier. The color of the nodes represents the feature similarity between the query and the corresponding candidate, and the color depth of the edge is calculated using (13). These graphs are then input into the attention network to predict one inlier from the nodes, whose boundary is marked in green.
Illustration of the graph samples; the green circles indicate the true positive inliers, while the orange circles indicate the top-1 retrieved item from the coarse search module. The color of the nodes indicates the feature similarity of matching pairs.
Our results show that inliers tend to have closer neighbors than outliers, indicating that inliers have different graph topology distributions from the outliers. The results demonstrate that the graph neural network can mine the location relationships from the graph structure and learn to discriminate the inlier for the different distributions.
C. Limitations and Future Works
The proposed CMPC method has demonstrated promising results in the challenging scenario of SAR to optical patch correspondence. However, we have not yet explored the methods in optical to SAR correspondence scenarios due to the lack of sufficient data. Future works will focus on collecting more diverse and comprehensive datasets that cover a wider range of scenarios and modalities to comprehensively evaluate the effectiveness of the methods. Such datasets will enable us to conduct rigorous experiments and comparative analysis to establish the generalizability and robustness of the proposed method in various cross-modal correspondence scenarios.
In addition, in real-world applications, we may face situations where optical images and SAR images have very different resolutions. However, in this work, our primary focus is on addressing the cross-modal discrepancy challenge in visual localization, and we have not yet specifically provided a solution for resolution differences. We make the assumption that during the visual localization process, the spatial resolution of reference images close to the query image can be easily satisfied, thus mitigating the impact of resolution differences. We will investigate the significant resolution differences in our future work.
Finally, while the proposed method achieved promising results on the Aligned dataset and the Same-Area dataset, the performance on the Cross-Area dataset is still far from meeting real-world demands. Therefore, enhancing the robustness and generalizability of our proposed scheme is imperative for it to perform well in the wild. This could be achieved by developing techniques to handle positional shifts on the feature embedding to account for nonaligned patch pairs. Furthermore, the design of semisupervised learning for cross-modal methods could be explored to tackle the challenge of insufficient data.
Conclusion
This article presented a novel coarse-to-fine scheme to tackle the challenging problem of CMPC between optical and SAR images. The proposed scheme consisted of two modules: The first of which retrieved candidate patches based on modal-invariant features, while the second module refined the retrieved candidates to identify the optimal correspondence. To enhance the cross-modal retrieval performance, we introduced the Wasserstein adversarial learning to directly model the gap between the modal distributions and train the embedding network to learn the modal-invariant features. In addition, we designed a graph representation based on the reference GPS coordinates topology to model the matching of features from coarse search and propose graph attention layers to predict the optimal correspondence from the graph representation. Through extensive experiments on three SAR-Optical patch correspondence datasets of varying difficulty levels, we demonstrated the effectiveness and superiority of our proposed method.
ACKNOWLEDGMENT
The authors would like to extend their sincere gratitude to the anonymous reviewers for their insightful comments and contributions to improving the quality of this article. The authors would also like to thank our colleagues for their valuable suggestions. Furthermore, the numerical calculations presented in this article were performed on the supercomputing system at the Supercomputing Center of Wuhan University.