Feature Matching of Multimodal Images Based on Nonlinear Diffusion and Progressive Filtering

Traditional image feature matching methods cannot obtain satisfactory results for multimodal images in most cases because different imaging mechanisms bring significant nonlinear radiation distortion differences and geometric distortion. The key to multimodal image matching is trying to eliminate the nonlinear radiation distortion and extract more robust features. This article proposes a new robust feature matching method for multimodal images. Our method starts by detecting feature points on phase congruency maps in nonlinear scale space and then removing mismatches by progressive filtering. Specifically, the phase congruency maps are generated by the Log-Gabor filter (LGF). Then, the feature points on phase congruency maps are detected in nonlinear scale space constructed by the nonlinear diffusion filter. Subsequently, the structure descriptor is established by the LGF, and the initial correspondences are constructed by bilateral matching. Finally, an iterative strategy is used to remove mismatches by progressive filtering. We perform comparison experiments on our proposed method with the SIFT, RIFT, VFC, LLT, LPM, and mTopKPR methods using multimodal images. The algorithms of each method are comprehensively evaluated both qualitatively and quantitatively. Our experimental results indicate the superiority of our method over the other six matching methods.

angles are called multimodal images, which are more prone to nonlinear radiation distortion [6]. Traditional image matching methods cannot solve the nonlinear radiation distortion between multimodal images. Developing algorithms for generic models, capable of working across multimodal images, will be essential to employing remote sensing as a practical and high-through tool to assist in image matching.
In the past few decades, there have been many image matching methods. Existing matching methods can be broadly classified into feature based, area based, and learning based.
Feature-based matching methods extract salient features such as points, lines, and surfaces and then establish the corresponding reliable relationship according to the similarity descriptors. Point features are the most widely used in feature matching. The traditional point feature refers to the corner point because it contains the local gray feature. Moravec [7] first proposed the corner point extraction algorithm in 1977, Harris and Stephens [8] improved the Moravec operator and then proposed the Harris operator [9]. SIFT [10] is one of the most widespread and effective feature-based matching methods, which extracts feature points in the Gaussian scale space. In recent years, a series of optimized SIFT algorithms have been proposed, such as SURF [11], PCA-SIFT [12], ASIFT [13], UC-SIFT [14], SAR-SIFT [15], and AB-SIFT [16]. However, image matching based on feature is faced with two problems: 1) in the feature description stage, when the principal direction estimation (PDE) is used to resist rotation distortion, a lot of homonymous points will be removed due to incorrect PDE. 2) In the feature matching stage, due to the nonlinear radiation difference of multimodal images, the corresponding points are often unable to be recognized, resulting in numerous anomalies in the matching results.
Area-based matching, otherwise known as template matching, is achieved by searching the reference image through the predefined template window on the input image, which utilizes the similarity measure of the image without involving feature detection [17]. There are some common similarity measures, such as the sum of squares of gray difference (SSD) [18], correlation coefficient (CC) [19], and mutual information (MI) [20]. Although SSD is simple and efficient, it is very susceptible to gray differences. CC has been extensively used because of its linear intensity changes and high computational efficiency. Despite the fact that MI can resist the intensity difference between images well, the large amount of computation limits the wide application of image matching. However, these area-based matching methods usually achieve locally optimal solutions, This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ especially when there are nonlinear radiation distortions among multimodal images. At the same time, the local optimization process of image matching has high computational complexity.
The learning-based method needs to use the manual feature extractor to extract the key points on the image block as the matching candidate points, and then take the image block of a specific size as the center to input the convolutional neural network (CNN) for feature description. CNN has been applied to learn the similarity between image blocks from the reference image and feature matching image, respectively, and the center point corresponding to the image block meeting the similarity condition is regarded as the corresponding point. There are some representative learning-based matching methods, such as LIRT [21], NCN [22], and LF-NET [23]. However, studies of multimodal image matching by the learning-based method are relatively rare. The main reasons are as follows.
1) The nonlinear radiation distortions of multimodal images will lead to the inability of learning-based methods relying on low-level structure and texture information to extract the corresponding features. 2) It needs a lot of expert knowledge and manpower for training, but manually annotated multimodal image datasets are very rare.
3) It is difficult to use a model trained on one modal image to match another modal image. Recently, Zhou et al. [24] used the deep learning method to refine the image structure features and put the multidirectional gradient features of the image pair into the pseudo-Siamese neural network. Although it can capture the finer common features between SAR images and optical images, it does not have rotation and scale invariance. Ye et al. [25] proposed a multiscale framework with unsupervised learning (MU-Net) for remote sensing image registration. MU-Net does not need ground truth labels, and directly learns the end-to-end mapping from image pairs to their transformation parameters. However, extracting the structural features of image pairs with large-scale differences and unclear textures is difficult.
Several recent kinds of research have reported that the structural information is more stable than the gradient or intensity information between multimodal images [26], [27], [28]. LNIFT [29] converts different modes into the same modal based on the local normalized filter, which is robust to severe nonlinear radiation distortion. Although it realizes rotation invariance, it does not have scale invariance. LAM [30] is a robust and effective mismatching elimination algorithm suitable for rigid and nonrigid image matching. The limitation of LAM is that it only considers geometric constraints and is not suitable for image matching that does not meet local affine constraints. Phase congruency (PC) image [31] has the structural information, and there are already lots of methods for feature descriptors based on PC, such as EOH [32], PIIFD [33], and DLSS [34]. However, image noise seriously affects the detection accuracy of PC [35]. Some researchers used Log-Gabor filter (LGF) to improve PC models, such as HOPC [36], PCSD [37], and RIFT [6]. HOPC and RIFT algorithms are typically matching methods of multimodal images. The HOPC algorithm extends the phase consistency algorithm and adds phase direction statistics to enhance the robustness of the description process. However, HOPC has three significant deficiencies as follows.
1) It needs to know the geographic information for the image. However, various multimodal images do not have geographic information. 2) It is sensitive to geometric distortions such as rotation and scale.
3) It is susceptible to nonlinear radiation distortions because it detects feature points by the Harris operator. The RIFT algorithm uses the maximum index map for feature description, which takes into account the rotation invariance. However, RIFT has two disadvantages: 1) it uses the "convolution sequence ring" to deal with the rotation distortion of the image, which may lose some spatial information, resulting in the lack of rotation invariance and unfavorable feature matching; and 2) it does not consider the scale invariance and is easily affected by the image scale distortion.
Although numerous image matching methods have appeared in the past few decades, there is still no unified feature matching framework with rotation-radiation-scale-invariant to automatically match multimodal images. In this article, we propose an automatic feature matching method for multimodal images based on progressive filtering. The main advantages of the proposed method are as follows: 1) A new feature descriptor is established according to the orientation of the average oriented amplitude map based on the two-dimensional (2-D) LGF (2D-LGF), which is more robust in nonlinear radiation distortion differences. 2) It converts the initial correspondence to a convolution filtering and removes the false correspondences by progressive filtering, and then restores the structural consistency of the motion vectors.
The rest of this article is organized as follows. Section II introduces the methodology of the proposed method. Section III presents the experimental results and discussions. Finally, Section IV concludes this article.

II. METHODOLOGY
The implementation of our method is illustrated in Fig. 1. It consists of four steps as follows.

A. Phase Congruency Based on 2D-LGF
Sun et al. [38] first proposed the phase congruency (PC) theory, which points out that the perception of image features by the eyes mainly depends on phase rather than amplitude. Different from the feature detection method based on the gradient in the spatial domain, PC is a feature perception model that detects local energy in the frequency domain. Although the PC model is robust to illumination and contrast differences, the noise and aliasing effects of the edge structures greatly affect the accuracy of feature detection. Do and Vetterli [39] extended PC to be robust to noise by the 2D-LGF. 2D-LGF is constructed by a Gaussian function in multiple directions. In the frequency domain, 2D-LGF is defined as (1) where ρ denotes radius and θ denotes angle in log-polar, ρ s is the center frequency, θ so is the center orientation of 2D-LGF at scale s and orientation o, σ ρ is the width parameter of ρ, and σ θ is the width parameter of θ.
The equation of 2D-LGF in the spatial domain is defined as follows: where L e so (x, y) represents the even-symmetric Log-Gabor wavelet and L o so (x, y) represents the odd-symmetric Log-Gabor wavelet.
The input image I(x, y) is convolved with 2D-LGF to construct the response components at different scales and orientations. The convolution is defined as follows: so (x, y) and L o no (x, y) at scale s and orientation o. The amplitude response component of I(x, y) can be obtained by The phase response component of I(x, y) can be obtained by Based on the A so (x, y) and the ϕ so (x, y), PC is calculated as the ratio of the sum of scale weighted and energy of noise compensation in all directions at point (x, y) to the sum of the average direction and amplitude on the filter response. The final PC model is as follows: is a phase deviation function, and the operator · prevents the enclosed quantity from being negative.

B. Feature Detection
Traditional feature detection algorithms (e.g., SIFT and SURF) construct linear Gaussian scale space. However, Gaussian scale space will result in the loss of detailed information about the image. Nonlinear scale space (NSS) is expected to solve this problem, but the traditional method based on forwarding the Euler scheme has a very short iterative convergence step [40]. In this article, the nonlinear diffusion filter is used to construct a stable NSS, it describes the illumination variation at different scales, which can be expressed as where div is the divergence operator, ∇ is the gradient operator, k is the contrast factor, and ∇I σ represents the gradient after Gaussian filtering. Then, the differential equation is approximated using dispersion analysis. Thus, the discretization of (7) can be expressed as where l indicates the direction, A l is the derivative along l, and Δt is the time step. Scale levels increased logarithmically when a nonlinear scale space is constructed. Different from SIFT, the same resolution is used for all levels of the nonlinear scale space, and the scale between each level is described as where o represents the index of octave O, s represents the index of sub-leave S, and σ 0 is the initial value of scale σ i . As shown in Fig. 2, the image of the last sublevel in each octave is down sampled, and the down sampled image is used as the initial image for the next octave. After the creation of the NSS, attained multiscale images can preserve structural and detailed information. Therefore, it expects that our proposed method can detect more feature points. The NSS is based on the time unit, so we need to convert scale unit σ i to time unit t i , and the relationship is as follows: Since Δt = t i+1 − t i , (8) can be described as where I i+1 represents the solution of the nonlinear diffusion equation.
After constructing the NSS, feature points can be detected by searching local maximum points of the Hessian matrix. The Hessian matrix is calculated as follows: where I xx is the second order derivative in x direction, I yy is the second order derivative in y direction, and I xy is the second order cross derivative in x and y directions.
C. Feature Description 1) Average Oriented Amplitude Map (AAM): Aguilera and Sappa [41] concluded that the distribution of the high-frequency amplitude components is robust to nonlinear radiation variations. The RIFT descriptor calculates the sum amplitudes of all scales to obtain a Log-Gabor layer. Different from RIFT, we use the average amplitude of the different scales to calculate the distribution. The average amplitude is calculated by adding the amplitudes of all scales for each orientation o and then dividing N s , and the formula is as follows: where 2) Direction of Phase Congruency: The odd-symmetric wavelet of 2D-LGF is a smooth derivative filter whose convolution result represents the energy change. Therefore, the direction of PC can be constructed by using the odd-symmetric wavelet of 2D-LGF. The odd-symmetric wavelet can obtain o convolution results according to different directions, which are projected to the x-and y-axes, respectively. The direction of the PC is defined as where PC so (θ) is the convolution result of odd-symmetric wavelet at the direction θ and δ is a minimum in case the denominator is zero.
For multimodal images, gradient inversion may occur when the PC direction exceeds π. In this article, the direction of PC is restricted to 0 to π as 3) Feature Descriptor: For each feature point, a local area with P×P pixels is divided into 6×6 patches, and the block distribution histograms are established by the average oriented amplitude maps and amplitude maps. The local feature descriptors are generated by merging the histograms of each patch. The histogram is divided into 36 equal parts at intervals of 10 ࢪ , and the phase consistency gradients of each equal part are counted. The peak direction of the histogram is selected as the main direction of the features. The specific construction process is shown in Fig. 3, and the main steps of descriptor construction are as follows.
1) The odd and even convolution at scale s and direction o is calculated using 2D-LGF.
2) The average oriented amplitude map is calculated by (13).

D. Feature Matching
Jiang et al. [42] used linear adaptive filtering (LAF) to eliminate mismatches, which takes the local structural consistency as the constraint condition and transforms the local topology space into domain sequence space for consistency measurement to improve the matching accuracy. However, the LAF algorithm uses SIFT algorithm to establish the initial matching set, which largely depends on the local consistency between the potential real inliers. Due to the large nonlinear radiation difference between multimodal images, the initial matching obtained by SIFT algorithm contains a large number of mismatches, and the assumption of local structural consistency cannot satisfy. Therefore, LAF is not suitable for multimodal image matching.
In this article, we use progressive filtering to eliminate mismatches by the structural consistency of the motion vector. The motion vector is defined as the difference between the coordinates of two points with a matching relationship in the sensed image and reference image. The motion vectors formed by correct matches of adjacent pixels have strong structural consistency. The motion vectors formed by mismatches are different from the correct matches and can be regarded as outliers in the smooth field. Therefore, for multimodal image matching, we focus on how to preserve or recover the correctly matched smooth field with a large number of outliers. In this article, we transform the initial matching set into a matrix and removed outliers by kernel convolution. The elements in the matrix can represent the spatial properties of the initial matches, so we can solve the matching problem with kernel convolution filtering.
For two images I and I with overlapping areas, the initial matching set D = {(I i , I i )} N i=1 is extracted by using the proposed method, where I i = (x i , y i ) T and I i = (x i , y i ) T are the coordinates of the ith feature matching point on images I and I . The difference between I i and I i represents motion vector, which is defined as m i = I i − I i . The initial matching set D can be transformed into , then the correct matches are found from the initial correspondences according to the structure congruence of motion vectors.
It is practicable to calculate the average motion vectors of initial correspondences and eliminate mismatches by checking the consistency between each image pair. Therefore, we divide D into c × c nonoverlapping areas and define the average motion vector in the (m, n)th area as where A m,n is the number of motion vectors in the (m, n)th area. The initial set of matches is converted into the estimation of average motion matrixm m,n . We define the deviation between the initial motion vectors and the average motion vectors as . Due to the random distribution of initial motion vectors, we assume that the inlier set obeys the normal distribution e inlier ∼ N (0, σ 2 I) and the outlier set obeys the uniform distribution e outlier ∼ U (−bI, bI), where 0 is the 2-D 0 vector, I is a 2×2 identity matrix.

Algorithm 1: The FMPF Algorithm.
Input: Multi-modal images Output: Correct matching set I * 1 Calculate PC maps by using (1) -(6); 2 Obtain the initial matching set D = {(x i , y i )} N i=1 by using (7)-(12); 3 Initialize c, k and τ ; 4 Convert D to D' and divide it into c × c grids; Iteration: 5 Calculatem by using (16); 6 Calculatem by using (17); 7 Calculate d i by using (19); 8 Determine correct matching set I * by using (20); 9 Update τ ; Until Convergence; 10 Return I * To link the initial correspondences in the surrounding units and improve the robustness and filtering performance, we calculate the typical motion vector of the current unit by convolution theory. The typical motion vectorm m,n is defined as follows: where w is a counting matrix, |w m,n | = A m,n , δ is a minimum in case the denominator is zero. k is a k × k Gaussian kernel distance matrix, which is defined as where c i,j = (i, j) T and c * = (k/2, k/2) T . The deviation between m i andm m,n is defined as Then by comparing the deviation d i with the threshold τ , the correct matching set I * can be obtained as As shown in Fig. 1(d), the initial matching set contains many outliers, which makes it difficult to separate the outliers from the initial matches. As shown in Fig. 1(e), only a litter of mismatches can be filtered out by the given threshold τ . To solve this problem, an iterative method is used to anneal the threshold τ to remove outliers until convergence.
As shown in Fig. 1(f)-(h), mismatches are filtered out gradually as the iterations proceed until the correct matching set is obtained. Meanwhile, the motion vectors corresponding to the correct matching set have structural consistency. Since the feature matching method is based on progressive filtering, the algorithm is named FMPF and the whole algorithm is summarized in Algorithm 1.

III. EXPERIMENTAL SETTINGS AND RESULTS
In this section, we perform extensive experiments to test the performance of the proposed method and compare it with six advanced feature matching methods such as SIFT [10], RIFT [6], VFC [43], LLT [44], LPM [45], and mTopKRP [46]. All experiments are implemented on a laptop with 32 GB RAM, 2.6 GHz Intel Core i7-10750h CPU, and MATLAB R2021a compiler.
A. Settings 1) Evaluation Criteria: Repetition rate (RR) is used to illustrate the robustness of the feature detection of FMPF. RR describes the percentage of repeatable features detected in the image pair and is defined as RR = 2 · n n 1 + n 2 (21) where n is the number of homonymous points, n 1 is the number of feature points on the reference image, and n 2 is the number of feature points on the target image. Precision, recall, and F-score are used to evaluate the feature matching performance with the following definitions: where TP, FP, and FN represent true positive, false positive, and false negative, respectively. In addition, root-mean-square error (RMSE), MEAN, and standard deviation (Std) are used to measure the registration accuracy with the following definitions: 27) where N represents the number of correct matches and (x r i , y r i ) and (x s i , y s i ) are the coordinate of the ith feature matching point on the reference and registration image.
2) Parameter Settings: When using progressive filtering to eliminate mismatches, parameters c, k, and τ severely influence the filtering results. We restrict c to an odd number between 15 and 30, and set k not greater than c/3, as follows: where N is the number of initial correspondences, [] rounds the elements, and odd(c/3) denotes the odd number not greater than c/3. As shown in Fig. 1(d), there are many mismatches in the initial matches. A progressive iteration method is used to gradually separate outliers and outliers, which is similar to the simulated annealing strategy. We set a larger threshold at the beginning of the iteration, and gradually lower the threshold as the number of iterations increases. It eliminates outliers from coarse to fine and optimizes the threshold τ . The inlier set is approximated with the result of each iteration until convergence. In our experiments, reliable matching performance is obtained in four iterations, and the threshold for each iteration can be set to 0.8, 0.2, 0.1, and 0.01. Fig. 4(a)], and the feature points are detected by five methods (i.e., SIFT, OS-SIFT, SURF, RIFT, and FMPF). The feature detection results are shown in Fig. 4(b)-(f). In Fig. 4(b), the feature points detected by SIFT are unevenly distributed. In Fig. 4(c) and (d), the number of feature points detected on the near-infrared image is very small, resulting in a low repetition rate. In Fig. 4(e) and (f), both RIFT and FMPF can detect more feature points, but the feature points detected by FMPF are more evenly distributed. Repetition rates of five methods are computed by using (21), which are 0.1260, 0.0147, 0.0996, 0.1308, and 0.2512, respectively. The repetition rate of FMPF is higher than that of the other four methods, which indicates that the feature detection performance of FMPF is more robust to multimodal images.

B. Detector Evaluation 1) Repetition Rate: A pair of visible-near-infrared images are used as test images [see
2) Rotation Invariance: To verify the rotation invariance of FMPF, a pair of visible-NIR images are chosen for testing. Thirty-six NIR images are obtained by rotating the NIR images from 0-350°with an interval of 10°. These 36 near-infrared images and the visible image consist of 36 image pairs. SIFT, OS-SIFT, SURF, RIFT, and FMPF are used to match those image pairs. Fig. 5 shows the results of the TP and precision on all image pairs. It can be found that the TP and precision of FMPF are higher than that of the other four methods. Specifically, although TP and precision of FMPF are different at different rotation angles, all TPs are greater than 200 and all precisions are greater than 0.2. The results show that the FMPF has rotation invariance in the whole 360°range. In order to qualitatively analyze the matching performance, we selected the matching results at six rotation angles (i.e., 30°, 90°, 150°, 210°, 270°, and 330°) of each method, as shown in Fig. 6. We can find that the matching performance of RIFT is the worst, and there are several-for-one correspondences among the six matching results. Compared with SIFT, OS-SIFT, SURF, and RIFT, FMPF shows the best matching performance.
3) Scale Invariance: To study the impact of scale change, a group of SAR-optical images with different scales are chosen for testing. The scales are 0.5, 1, 1.5, 2, and 2.5, respectively. The matching results of FMPF are shown in Fig. 7, it can be seen that the matching is successful when the scales are 0.5, 1, 1.5, 2, and 2.5, respectively. Although the number of correct matching decreases with the increase of scale, the distribution of feature matching points is relatively uniform. It indicates that FMPF has scale invariance.
1) Feature Matching: We first give the result of feature matching and motion vectors of our FMPF on eight multimodal image pairs in Fig. 8. These image pairs are chosen from the aforementioned datasets, and each dataset contains two pairs. It can be found that our PMFP method can remove most of the mismatches are removed and realize structural congruence from  The feature matching performance of FMPF is quantitatively compared with four advanced feature matching methods, including VFC, LLT, LPM, and mTopKPR. The local features of the four comparison methods are detected in the nonlinear scale space, which is the same as FMPF. We calculate the precision, recall, and F-score on all MIDs, and then plot precision-recall and cumulative distribution curves of F-score, as shown in Fig. 9.
The cumulative distribution curve indicates that 100×x% of the image pairs have performance values not greater than y. The larger the precision, recall, and F-score, the better the matching performance. The average precision, recall and F-score of FMPF are (73.55%, 95.50%, 0.8136), (58.13%, 76.83%, 0.6508), (54.97%, 88.96%, 0.6627), and (43.21%, 78.12%, 0.5286). The feature matching performances are characterized by the cumulative distribution of the F-score, and it can be noted that our FMPF method is superior to the other four advanced feature matching methods.
2) Image Registration: The key point of image registration is whether the transformed image can maximize the alignment of overlapping regions. We use the affine transformation model to transform the sensed image after feature matching. Fig. 10 shows the eight multimodal image pairs (left) and the visual  registration results (right). It can be found that FMPF achieves satisfactory performance and obtain high registration accuracy.
Then, the registration accuracies of VFC, LLT, LPM, and mTopKPR are quantitatively compared and the cumulative distribution curves are plotted. The cumulative distribution curve indicates that 100×x% of the image pairs have performance values not greater than y (i.e., RMSE, MEAN, and Std). The less the RMSE, MEAN, and Std, the better registration performance. As shown in Fig. 11, the average RMSE, MEAN, and Std of mTopKPR are the largest, while the average RMSE, MEAN, and Std of FMPF are the smallest on four multimodal image datasets. It can be concluded that the accuracy of FMPF on RMSE, MEAN, and Std is higher than that of the other four algorithms.

D. Experimental Results on MRSIPs
In Section III-C, all MIDs have no coordinate information. In this part, six types of multimodal remote sensing image pairs (MRSIPs) with geographic coordinate information are selected for testing. These MRSIPs include OPT-OPT images, SARoptical images, day-night images, map-optical images, infraredoptical images, and depth-optical images, which consider almost all applications of multimodal remote sensing images (MRSI) matching, such as image fusion, image interpretation, and image registration. The six MRSIPs are displayed in Fig. 12. The size of MRSI ranges from 400×400 pixels to 1000×1000 pixels. We can find that the geometric distortions and radiation distortions of the MRSIPs are much more serious than those of MIDs. Therefore, it is essential to test our proposed method's robustness on multimodal remote sensing image matching. Qualitative and quantitative experiments are conducted to evaluate the accuracy of feature matching and image registration on these MRSIPs. 1) Feature Matching: Two descriptors (SIFT and RIFT) and four mismatch removal methods (VFC, LLT, LPM, and mTop-KPR) are used to compare with the proposed FMPF method.
a) The qualitative results of feature matching are shown in Fig. 13. SIFT failed to match the MRSIPs in Fig. 13(a2), (a5), and (a6) while successfully matching in Fig. 13(a1), (a3), and (a4). SIFT performs blur processing on the MRSI using the Gaussian pyramid to construct the image scale space, which leads to the weakening of image texture edge features and the difficulty of extracting contour edge features. In summary, the traditional Gaussian pyramid scale space is not conducive to MRSI matching.
Although the LPM algorithm matches successfully, there are a small number of FPs in Fig. 13(e2), (e3), and (e4) (i.e., 2, 2, and 1, respectively). If we use FPs to transform the image, it will increase the error of image matching. Moreover, the average TP of the LPM algorithm is 38.83, which is less than that of RIFT, VFC, and FMPF. In summary, the LPM algorithm is not good for MRSI matching.
The matching results of VFC, LLT, mTopKPR, and FMPF are all successful, and the average TPs are 47.83, 32.67, 29.83, and 220, respectively. LLT has the least TPs on the OPT-OPT dataset, SAR-optical dataset, day-night dataset, and map-optical   Tables I and II. SIFT performs better on the OPT-OPT dataset, the day-night dataset, and the map-optical dataset than on the other three datasets. In three successfully matched image pairs, all F-scores are very small (smaller than 0.02). It proves that SIFT is not suitable for matching multimodal remote sensing images.
The F-score of LPM on six data sets is much lower than that of the other four mismatch removal algorithms. The main reason may be that TP contains a small amount of NP, resulting in a   significant decrease in the F-score. We can clearly observe that the F-score of FMPF reaches the maximum, which demonstrates its robustness and effectiveness on MRSIPs.
2) Image Registration: The affine transformation model is utilized to transform the sensed image. We compare the registration results of SIFT and FMPF, as shown in Fig.14. Since the TP of Fig. 13(a2), (a5), and (a6) is 0, there are no true matches to compute the homography model, resulting in the failure of image registration of Fig. 14(a2), (a5), and (a6). In Fig. 14(a1), (a3), and (a4), SIFT performed better registration performance on the OPT-OPT dataset, the day-night dataset, and the map-optical dataset because of its resistance to illumination changes. In Fig. 14(b1)-(b6), all the registered image has achieved a good alignment effect with our FMPF method, and the registration performance is better than SIFT.
We use the affine transformation model to transform the sensed image after feature matching. We select 20 pairs of landmarks manually as the truth values, they are uniformly distributed around the region of interest of each multimodal image pair. The RMSE of the transformation residual error is computed by the homography model. The less the RMSE, the better the registration performance of corresponding points. The RMSE of each method of MRSIPs is shown in Tables III and IV. As can be seen, the proposed FMPF method achieves the best performance followed by RIFT. SIFT obtains the worst RMSE performance, which proves that it is not suitable for multimodal image registration. The average RMSE of VFC, LLT, LPM, and mTopKPR is 122.83, 47.83, 32.67, 29.83, and 220, respectively. LPM and mTopKPR have competitive performance because they can maintain reliable matches, which can estimate the transformation correctly. As for VFC and LLT, some obvious false correspondences may be preserved, resulting in relatively poor registration performance.

IV. CONCLUSION
We propose a feature matching method for multimodal images based on progressive filtering in this article. We conduct a series of experiments on MIDs and MRSIPs, the results show that our proposed FMPF method outperforms the other six advanced feature matching methods. However, the proposed FMPF method is tested only on rigid multimodal images. For non-rigid multimodal images, there may be only a small number of correct matches. The locations of the inliers can be very scattered, so the initial correspondences may not have structural consistency. We plan to use different feature detection and description methods to create more effective correspondences on nonrigid multimodal images in the feature work. In the feature matching stage, there are four empirical thresholds τ 1 -τ 4 . We use an iterative strategy to remove the mismatches progressively, which is similar to deterministic annealing. We use the same threshold for the MDIs and MRSIPs. We plan to use the dynamic adaptive threshold to remove the mismatches for different datasets in the feature work.