Intelligent Matching Method for Heterogeneous Remote Sensing Images Based on Style Transfer

Intelligent matching of heterogeneous remote sensing images is a common basic problem in the field of intelligent remote sensing image processing. Aiming at the difficulty of matching satellite-aerial remote sensing images, this article proposes an intelligent matching method for heterogeneous remote sensing images based on style transfer. First, based on the idea of image style transfer of a generative adversarial networks, this method improves the conversion effect of the model on heterogeneous images by constructing a new generative network loss function and converts satellite images into aerial images. Then, the advanced deep learning-based matching algorithms D2-Net and LoFTR are used to achieve matching between the generated aerial image and the original aerial image. Finally, this transformation relationship is mapped to the corresponding satellite–aerial image pair to obtain the final matching result. The image style transfer experiments and the matching experiments we carry out under different test datasets show that the smooth cycle-consistent generative adversarial networks proposed in this article can effectively reduce the complexity of the algorithm and improve the quality of image generation. In addition, combining it with deep learning-based feature-matching methods can effectively improve the accuracy and robustness of the matching algorithm. Our code and data can be found at: https://gitee.com/AZQZ/intelligent-matching.


I. INTRODUCTION
W ITH the rapid development of aerospace technology, remote sensing images have started to be widely used in the fields of environment, transportation, resources, national defense, etc. This has led to research on intelligent processing of remote sensing images, cognitive navigation, intelligent aircraft, target positioning and other fields. How to use platforms such as unmanned aerial vehicle (UAVs) to intelligently match the collected heterologous remote sensing images is a key common problem in the above fields.
Traditional image matching algorithms can be divided into feature-based matching methods, region-based matching methods, and combination methods based on region and feature [1]. The feature-based matching methods such as scale-invariant feature transform (SIFT) [2], speeded-up robust features (SURF) [3], and radiation-variation insensitive feature transform (RIFT) [4] achieve matching by extracting local invariant features such as points [5], lines [6], and surfaces [7] of images. This method has a small amount of computations and good robustness to various changes. However, it does not match well for large changes in image appearance, large angle transformation, and a complex model composed of many parameters. Moreover, feature extraction is very complex and only shallow features can be extracted [8], making it difficult to extract deeper and more expressive features. On the other hand, the regionbased matching methods such as sum of squared differences [9], normalized cross-correlation [10], and mutual information [11] usually use the grayscale and phase information of the image. They adjust the parameters of the optimized transformation model according to the preestablished similarity measure, and regard the matching problem as an optimization problem. The principle of this method is simple, but it is computationally intensive and time-consuming, and it is difficult to ensure real-time performance in practical applications. Moreover, most similarity measurement methods have many local minima, and it is difficult to obtain a globally optimal solution [12]. Regionbased and feature-based methods have their own advantages and disadvantages. By extracting the reliable matching between two images and finding the corresponding relationship between them, they can deal with rotation, translation, scale difference, and geometric distortion, the similarity measurement can effectively eliminate the nonlinear radiation difference. Based on the above reasons, researchers have combined these two methods to achieve more accurate matching [13], [14]. The above methods have achieved good results in matching homologous images. However, for heterologous images, due to the changing functions and types of image acquisition equipment, as well as the imaging principles and spatial positions of each sensor, the difference between the collected images is greater. This makes it difficult for traditional homologous image matching methods to be directly applied to heterologous images. It is even more difficult to achieve heterologous remote sensing image matching with complex features. In recent years, machine learning methods, especially deep learning methods, have made significant progress in the field of computer vision. Deep learning can analyze and process Big Data through its deep multilevel structure and automatically learn the characteristics of specific objects from the training data to accurately capture the characteristics of target objects This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ and understand the contents of images. With the continuous development of deep learning technology, an increasing number of neural networks are used in the field of image processing. The convolutional neural networks (CNNs) proposed by LeCun et al. [15], the fully convolutional networks (FCN) proposed by Shelhamer et al. [16], and the siamese networks proposed by Chopra et al. [17] are commonly used network structures in image matching. The generative adversarial networks (GANs) proposed by Goodfellow et al. [18] play an important role in image generation, style transfer, and other visual images, and a large number of improved networks such as deep convolutional GANs [19], Wasserstein GANs [20], cycle-consistent GANs (CycleGAN) [21], dual-path network-CycleGAN [22], etc. continue to emerge. Li et al. [23] used deep translation to convert optical images into synthetic aperture radar (SAR) images and Song et al. [24] used GAN to convert optical images into maps, reflecting the feasibility and practicality of GAN application to remote sensing image processing.
Based on the above analysis, taking the satellite-aerial remote sensing images as an example, the problem of large differences in their features can be solved in two ways: One is to design a more expressive feature extraction or description network and the other is to reduce the difference between two images based on GANs. This article chooses the second approach and then uses the state-of-the-art deep learning-based matching algorithm to achieve the matching of satellite-aerial remote sensing images.
The main contributions of this article are as follows. 1) A general matching framework is proposed to match satellite images and aerial images with large feature differences. The core idea is to first convert satellite images and aerial images into the same feature domain, and then input them to a feature-matching network to achieve image matching.
2) To reduce the feature differences between satellite images and aerial images, a smoothing loss function is proposed, which can accelerate the convergence of the network, improve the stability of the network model, and improve the quality of image generation.
3) This work proves that style transfer preprocessing methods can improve matching performance. Moreover, it is demonstrated that a better style transfer model can improve the generated images, thus improving the matching results.
The rest of this article is structured as follows. Section II describes the proposed method. Section III presents the experimental results and analysis. Finally, Section IV concludes this article.

II. METHODOLOGY
The purpose of matching is to obtain the best transformation matrix from aerial image B to satellite image A. In this section, we propose a novel method for matching heterogenous remote sensing images. The proposed method includes the following two aspects: 1) The idea of style transfer based on GAN is applied to image transfer to reduce the imaging difference between satellite image A and aerial image B. 2) The aerial image A generated by style transfer is matched with the original aerial image B and then the corresponding transformation relationship  is mapped to the original image pair B-A to obtain the final matching result. The basic pipeline of the method is shown in Fig. 1.

A. Image Transfer Method
This article uses CycleGAN for image transfer. CycleGAN is an improved variant of the GAN. It uses two symmetrical GANs to form a ring network that consists of two generators and two discriminators. The purpose of this network is to realize the mutual conversion between the source domain X and the target domain Y. Its structure diagram is shown in Fig. 2. In the figure, X is the image from the X domain, and Y is the image from the Y domain. By training the model with two mappings (G X→Y , F Y →X ), G XY (x) is made to infinitely approximate the image of the target domain Y, and then the detailed information of the generated image is further optimized by the discriminator. CycleGAN has a unique cycle consistent adversarial learning capability, so that the input image can still be reconstructed after passing through two generators in sequence. This idea not only fits the style distribution of the target domain image but also retains the content characteristics of the source domain image. This effectively reduces the addition of wrong information and useless information during the conversion process. The introduced cycle consistency constraint prevents the generators (G XY , F Y X ) from contradicting each other, alleviates the problems of model collapse and gradient disappearance, enhances the overall conversion effect between different domains, realizes the bidirectional conversion of heterogeneous image styles, and makes the conversion model training more stable. It makes it convenient to carry out the matching work between the aerial image and satellite image. However, Cycle-GAN uses the mean square error (MSE) as the loss function of the generative network, which will cause outliers to obtain higher weights at the expense of other normal samples, thus reducing the performance of the overall model. Therefore, we designed the L1 smooth loss function to replace the MSE to reduce the sensitivity of the CycleGAN model to outliers, accelerate the network convergence, and improve the stability of the network. The resulting model is called smooth cycle-consistent GANs (SCycleGAN). For outliers, when the network gradient update is greater than 1, the smoothing function will reduce the error, reduce the sensitivity to outliers, and improve the robustness of the network. As the error decreases, when the gradient update is less than 1, the property of the sum of squares operation makes the gradient smoother near zero, and the network converges faster. The formula is as follows: The forward adversarial loss, backward adversarial loss, cycle-consistent loss, and objective function of SCycleGAN are consistent with CycleGAN as shown below: In the above equations, X and Y are the real images in the source and target domains, G(x) and F (y) are the generated images, p data (x) and p data (y) are the distributions of the real images in the target domain, ∼ is the obedience relation, E is the expectation function, and λ controls the relative importance of the two objectives.

1) Forward Adversarial Loss:
We aim to convert X to Y. Therefore, the objective is to learn the mapping from X to Y and set this mapping on G(x), which corresponds to the generator in the GAN. For generated images, we also need discriminator D Y to determine whether it is a real image, to compose an adversarial generative network. Generator G(x) aims to minimize object function against an adversary discriminator D Y , which tries to maximize the objective. We express the objective as (2).
2) Backward Adversarial Loss: The backward adversarial loss has the same form to the forward adversarial loss as (3).
3) Cycle-Consistent Loss: In practice, it is difficult to train the whole network by using adversarial loss alone. The reason is that the mapping G X→Y can completely make x in domain X the same image in domain Y, which invalidates the loss. Therefore, the significance of cycle-consistent loss is to assume another We express the objective as (4). 4) Objective Function: By combining forward adversarial loss, backward adversarial loss, and cycle-consistent loss, objective function can be obtained as (5).

B. Image Matching Method
Images A and B are similar in texture and imaging features. Therefore, they are less difficult to match. The problem of difference between images is solved. Then, this article chooses two more advanced network models, D2-Net [25] and LoFTR [26], to achieve matching.
1) D2-Net: Traditional keypoint detection first generates feature descriptors and then uses some postprocessing methods to find keypoints according to these feature descriptors. Because descriptors are obtained from larger image regions and detection points are obtained from low-level information of small regions (e.g., corner points, etc.), this results in unstable detection results. D2-Net, on the other hand, performs key feature detection directly from feature descriptors. In other words, the feature detection module is highly coupled with the description module, and thus these descriptors are well suited for matching. The pipeline of the matching method is shown in Fig. 3. a) Network architecture: D2-Net selects the classic VGG16 architecture and improves it [27]. To make the feature points abstract enough and obtain high localization accuracy, the feature network discards the last convolutional layer of VGG16 and selects the last convolutional layer (Conv 4 − 3) in the middle layer 4 as the feature map for key point extraction. The network architecture is shown in Fig. 3. The feature map is the output of the original image after multilayer convolution and pooling of the CNN network, and the resolution generally decreases. To maintain the resolution of the feature map, the sliding step size of the last pooling layer window is changed from 2 pixels to 1 pixel, and the pooling method also replaces max pooling with average pooling. The three convolutions (Conv 4 − 1 to Conv 4 − 3) in the fourth layer use dilated convolutions with a dilation rate of 2, which can expand the receptive field, improve the generalization b) Matching strategy: The features extracted by the feature network are often too dense, and most of the features are not significant enough. Hence, D2-Net proposes to detect key features directly from the feature descriptor. For the D2 feature map, if the location (i, j) is a keypoint, the final detection result of this pixel location from the channel dimension should take the value corresponding to the channel with the largest detector response value so that the channel is selected. In addition, from the spatial dimension, the location must be a local maximum on the 2-D feature map of the channel defined as follows: where (i, j) is the detection pixel, (i, j) is the feature value of the kth layer, and D k ij is the feature value at the pixel (i, j) on the feature map.
To make (6) differentiable, define where α k ij is the spatial response score, N (i, j) is the set of nine neighborhoods of pixel (i, j), β k ij is the channel response weight, γ ij is the score of each pixel being a keypoint, and s ij is the score of each pixel that is a keypoint after normalization.
After detecting and extracting the key features of the image pairs, to obtain more accurate keypoint locations, the subpixellevel localization accuracy is obtained by using the feature map local interpolation encryption method with the SIFT algorithm, while the descriptors are obtained by bilinear interpolation in the neighborhood. Finally, the combination of fast library for approximate nearest neighbors and random sample consensus (RANSAC) constraints is used to eliminate the false matching points to obtain the final matching result. c) Loss function: D2-Net adopts the triplet margin ranking function as the loss function. This is because in the feature detection process, it is desired that the feature points can adapt to the effects of different ambient light intensities and geometric differences. At the same time, it is desired that the feature vectors be as unique as possible in order to find homonymous image points. To address this problem, the triplet margin ranking loss function enhances the uniqueness of related descriptors by penalizing any irrelevant descriptors that lead to false matches. In addition, to achieve the repeatability of the detection features, the detection scores are added to the loss function, as shown in (11).
where s (1) c and s (2) c are the feature detection scores obtained at two points on image I 1 , I 2 , respectively, C is the set of all point-to-point correspondences on image I 1 , I 2 , and p(c) and n(c) are the positive pair distance and negative pair distance of the corresponding points, respectively.
The above loss function generates a weighted average of the margin terms m based on the detection scores of all matches. Therefore, to minimize the loss, the most relevant correspondences with lower margin terms will obtain higher relative scores and let correspondences with higher relative scores obtain similar descriptors different from the rest of the features, thereby improving the robustness of the matching.
2) LOFTR: LoFTR is an end-to-end feature-matching scheme that does not rely on keypoint detection. It uses selfattention and cross-attention mechanisms to build feature-byfeature coarse matching directly on low-resolution feature maps and then optimizes on high-resolution feature maps to find subpixel-level fine matching. Compared with traditional methods, LoFTR uses a transformer to construct features with a global receptive field based on two images, which can construct accurate matches in low-texture areas. The pipeline of the matching method is shown in Fig. 4. a) Matching strategy: LoFTR adopts a relatively simple network architecture, which uses a combination of ResNet-18 and FPN to extract coarse-level feature maps at 1/8 of the original image dimension and fine-level feature maps at 1/2 of the original image dimension. Then, the coarse-level feature maps extracted by the feature network are summed with their respective positional encodings and input to the transformer for coarse-level feature extraction. The transformer consists of several alternating self-attention and cross-attention layers, with the self-attention layer causing each point to focus on the association of all points around it and the cross-attention layer causing the point to focus on the association with all points on the other image. In the coarse-level matching process, the matching score matrix of all positions is first calculated using the product approach, and then the optimal match is calculated, either by the optimal transmission algorithm or the dual-softmax method. Then, some outlier matching point pairs are filtered out by the mutual nearest neighbor algorithm. In the fine-level matching process, the coarse-level matching result (ĩ,j) is obtained by coarse-level matching, which is mapped to the corresponding fine-level feature map position (î,ĵ), and two sets of local windows of size ω × ω are cropped (equivalent to cropping out the features at ω × ω positions), inputting them to the transformer of fine-level feature extraction to extract matching features and generating two transformed local feature mapŝ F A tr (î) andF B tr (ĵ) centered onî andĵ, respectively. Then, the matching probability (i.e., similarity) between the central feature ofF A tr (î) and all the features inF B tr (ĵ) is calculated. Then, the probability distribution is calculated to determine the matching point position of subpixel accuracy inF B tr (ĵ). Finally, the final matching results are obtained using the RANSAC constraint method.
b) Loss function: The loss function of LoFTR includes the negative log-likelihood loss of the coarse-level matching correlation score matrix as well as the loss of the fine-level matching coordinates. In the coarse-level matching layer, LoFTR uses a negative log-likelihood loss to supervise the dense confidence matrix obtained by the differentiable matching layer. That is, the negative log-likelihood loss for those that can establish groundtruth matches is minimized. Since the differentiable matching layer ensures that the gradients are efficiently passed back to all features, there is no need for error-matching supervision. In the fine-level matching layer, LoFTR calculates the variance σ 2 (î) of the heatmap generated for each pointî to measure its uncertainty to improve the accuracy of the fine-level matching position. The formula is as follows: where L is the total loss, L c is the coarse-level loss, and L f is the fine-level loss. M gt c is the ground-truth coarse-level matching, defined as the mutual nearest neighbors of two sets of 1/8-resolution grids, P c is the confidence matrix returned by the optimal transport layer or dual-softmax operator, M f is the final fine-level matching,ĵ is the final fine-level matching position, andĵ gt is the ground-truth of the matching position corresponding to positionî.

III. EXPERIMENTAL VERIFICATION
The experiments were conducted on a hardware platform equipped with NVIDIA Quadro P4000 GPUs using Ubuntu 20.04 OS and is based on the open-source deep learning framework PyTorch, version 1.11.0, and CUDA version 10.2.

A. Original Image Data Introduction
The original image data are a set of satellite-aerial image pairs, and the image size is 5896 × 17204. Both satellite and aerial images are aligned to the EPSG:32649 GCS. Among them, the aerial image is taken by UAV, the shooting angle adopts the method of looking down, and the two-dimensional orthophoto image is generated in real time by using the Dajiang Zhitu software, with a spatial resolution of 0.25 m. The satellite image is sourced from the online map Google Satellite in GIS, acquired by the QGIS software and sampled to a spatial resolution of 0.5 m. The image data selected in this article include rural and urban areas, which have the advantages of high spatial resolution and rich types of features. The original image data are shown in Fig. 5.

B. Evaluation Criterion
Here, the number of correct matching points (NCM), matching success rate (SR), and matching end point error (EPE) are used to evaluate the performance. The position of a feature point matched by the algorithm on the target image is (x i , y i ), the corresponding feature point position on the reference image is (x i , y i ), and its corresponding position after the ground-truth transformation of the homography matrix is (x i ,ŷ i ). Then, the judgment equations and EPE of NCM are as follows: NCM is the number of all matching points on the whole image that satisfy (15), which can reflect the robustness of the  matching algorithm for matching on different image pairs. n is the number of correct matching points in the matching process, and EPE reflects the accuracy of the algorithm's matching results on different image pairs. SR is expressed as the percentage of NCM in the number of total matching points (NTP) given by the algorithm, which can reflect the matching point pair success rate of the algorithm matching on different image pairs.

C. Experiment Preparation
Image style transfer network training is performed before the matching experiment. To construct a style transfer image dataset, the original image data are cropped by randomly sampling the central point. All image data are cropped into satellite-aerial image pairs with a size of 256 × 256. The satellite-aerial images correspond one to one, with a total of 2800 pairs of images. They are divided into two groups containing a training set of 2600 pairs and a test set of 200 pairs. To verify the effectiveness of the proposed model, CycleGAN and SCycleGAN were trained and tested using the constructed datasets. The training and testing results are shown in Figs. 6 and 7. At the same time, we calculated the similarity of Learning perceptual image patch similarity (LPIPS) [28], peak signal-to-noise ratio (PSNR) [29], and perceptual hash (pHash) [30] of satellite images, generated aerial images and original aerial images, recorded the average values of all image pairs in the test set, and quantitatively analyzed the image conversion effect. The results are shown in Table I, which are the corresponding numerical comparison results under the three models (original images, CycleGAN, SCycleGAN).  The experimental results show that the training time (Cycle-GAN: 2730 min, SCycleGAN: 1860 min) can be shortened by using the proposed training method, and the loss fluctuation range of the generative network of SCycleGAN is significantly better than that of CycleGAN. However, due to the complex characteristics of remote sensing images and the different shooting times and angles of satellite images and aerial images, there are landform changes and distortions, resulting in large fluctuations in the training curve. This reflects the game process between the generator and the discriminator. From Fig. 7, it can be seen that the style transfer network model does not change the scale, perspective, and target morphology of the original image but only makes modal changes. Table I shows that CycleGAN  and SCycleGAN can improve the similarity of the original  TABLE II  COMPARISON RESULTS OF THREE TEST SETS UNDER DIFFERENT MODELS satellite images and aerial images to a certain extent, and the generated images converted by SCycleGAN are more similar to the original aerial images in terms of structure and color. Overall, the aerial images generated by the proposed model are of high quality, similar to the original aerial image, and the structure, color, and details of the image are characterized completely. There are no large area distortions, artifacts, distortions, or other phenomena, and the conversion effect is better.

D. Matching Experimental Results and Analysis
To verify the feasibility of using the image conversion mechanism as a preprocessing method for heterologous matching, we selected two more advanced network models, D2-Net and LoFTR, to evaluate the matching performance of the training models when applied to a new dataset by evaluating the test dataset. The production method of the test dataset is similar to the method of constructing the style transfer image dataset in Section III-C. It needs to crop out satellite-aerial image pairs with a size of 960 × 540. Since the matching transformation model between satellite and aerial images can be expressed by projection transformation, the matching test dataset is generated by applying random projection transformation based on fourpoint disturbance to the obtained image pairs [31], which contain 1000 pairs of satellite-aerial image pairs and corresponding homography ground-truth. The image size is 256 × 256, and the dataset is called SA1. Additionally, we use the style transfer models trained by CycleGAN and SCycleGAN to perform style transfer on the satellite images in SA1 and form new test datasets with the generated aerial images, the original aerial images, and the corresponding homography ground-truth, which are called SA2 and SA3, respectively. The experiment counts the average values of NCM, SR, and EPE for each dataset test under different network models. The accuracy threshold is 5 pixels, and the results are shown in Table II. The visualization results are shown in Figs. 8 and 9.
It can be seen from the figure that the image style transfer preprocessing method extracts more uniform features than the direct matching algorithm, and the number of matching points between the converted images increases significantly. Combined with Table II, it can be seen that the NCM after the image style transfer preprocessing method is more than that after the direct matching algorithm. This shows that this method can improve the robustness of the matching algorithm. From the results of SR and EPE, the image style transfer preprocessing method can  improve the matching accuracy and the image matching point pair success rate. In addition, accuracies after preprocessing using SCycleGAN are higher than those of CycleGAN, which further proves that the image style transfer method proposed in this article can improve the matching effect between satellite images and aerial images.
In addition, we use three classical algorithms, SIFT, SURF, and oriented fast and rotated brief (ORB) [32], for matching experiments, respectively. The number of correct matching points is very small, and the image matching fails. The visualization result examples are shown in Figs. 10-12. The experimental results show that the style transfer method combined with the classical matching method cannot achieve the matching of satellite images and aerial images. The reason is that style transfer only reduces the differences between images, the transferred images themselves are still heterogeneous images, and there are still differences in resolution and geometric distortion between images. This phenomenon proves the necessity of using deep learning methods to achieve matching in the method of this article.

IV. CONCLUSION
In this article, a satellite-aerial remote sensing image matching method based on style transfer is proposed to solve the problems of large differences between satellite images and aerial images, such as different imaging principles and resolutions. The experimental results show that the preprocessing of heterologous images through the style transfer method to reduce the difference between images, combined with the feature-matching method based on deep learning, can effectively improve the accuracy and robustness of the matching algorithm. This establishes the feasibility of using the image style transfer idea for heterogenous image matching. In addition, this article improves the loss function of the original CycleGAN generation network, effectively reduces the complexity of the algorithm, improves the quality of image generation, and provides an effective reference for solving the matching problem of heterogeneous images and the processing of heterogeneous image data.
For future works, we plan to apply this work to other types of heterogenous remote sensing image matching, such as optical and infrared images, optical, and SAR images, and further to serve as a building block for change detection or fusion of remote sensing images.