Introduction
Template matching is a foundational process for aligning two images of the same scene with different sizes, involving positioning a smaller template image within a more extensive search image. This procedure geometrically aligns template and search images, which may have been captured at different times using the same or different sensors. Template matching stands as a critical preprocessing step in various remote sensing applications, including image recognition [1], [2], Earth observation [3], and plant breeding [4].
Optical-SAR image registration, a key application, aligns visually distinct images captured by optical and radar sensors, enabling comprehensive land cover mapping and all-weather monitoring. Change detection [5], [6], [7] utilizes multitemporal images to identify land alterations, while image fusion [8], [9], [10] combines multiple image sources to produce enhanced, more informative representations. These applications benefit significantly from the integration of optical and SAR data, leveraging their complementary characteristics.
Notably, multisensor template matching offers the capacity to leverage multisensor images simultaneously, resulting in increased data volume, richer information, and the exploitation of complementary characteristics. The combined use of heterogeneous images enhances the efficacy of Earth observation [11], [12], [13], enabling more comprehensive and accurate analysis across diverse remote sensing applications.
Synthetic aperture radar (SAR) and optical images are the two most widely employed modalities in multisensor remote sensing for Earth observation. As depicted in Fig. 1, optical images provide color and brightness information conducive to direct human observation, albeit susceptible to environmental influences. Conversely, SAR images offer the advantage of all-weather and all-day observation. Integrating these two image types capitalizes on object radiation data, enhancing remote sensing interpretation capabilities [14]. Furthermore, their combined utilization facilitates the identification of concealed features within target areas. Fig. 1(e) and (f) illustrates the deep features extracted from the SAR and optical images feature extraction module. The features encapsulate hierarchical and spatial information derived from the input images, providing a visual representation of each modality's content. These visualizations highlight the significant modality differences between SAR and optical images. As shown in Fig. 2, our ability to accurately locate the corresponding region is demonstrated, even in the face of modal differences.
Example of SAR and optical images. Optical images encode information about the nature of surface materials, while SAR images measure physical surface characteristics. A comparison of Figures (c), (d) and (e), (f) highlights the significant differences between SAR and optical images. (a) SAR image. (b) Optical image (grayscale). (c) Pixel value distribution of the SAR image. (d) Pixel value distribution of the optical image. (e) Feature of the SAR image. (f) Feature of the optical image.
Example result of heterogeneous template matching. This figure illustrates the process and outcome of matching an SAR template to an optical image. (a) Grayscale SAR template image. (b) Optical search image with a gray region representing the search space, where red dashed arrows indicate potential matching locations and a solid red arrow points to the correct match. (c) GT matching region highlighted by a red box in the optical image. (d) Final result showing the SAR template correctly positioned within the optical image. This example demonstrates successful heterogeneous template matching, accurately locating an SAR image template within a larger optical image despite the differences in imaging modalities. (a) Template image. (b) Search image. (c) GT matching region. (d) Matching result.
Numerous approaches have emerged to address the template matching problem. Dekel et al. [15] introduced the BestBuddies Similarity measure, while deformable diversity similarity explicitly accounts for potential template deformations [16]. The above-mentioned methods employ distinct transformations to convert template and search features into a matching score graph, constituting the prevailing general architecture capable of yielding satisfactory results on homologous datasets. However, these similarity-based methods often falter when applied to heterogeneous template matching. Variations in imaging models and sensors engender significant differences in heterogeneous images. These pronounced modality differences directly undermine feature similarity, leading to suboptimal performance of similarity-based methods. Ye et al. [17] noted this problem. They propose an optical-to-SAR registration method, which integrates the fast Fourier transform and weighted edge density map. Recent advancements in SAR image coregistration have also contributed to the field of image alignment. Pallotta et al. [18], [19] proposed extensions to the constrained least-squares optimization method, addressing challenges, such as joint rotation effects, range/azimuth shifts, and trajectory sensor inaccuracies in SAR image coregistration. These developments offer valuable insights for improving template matching across different modalities.
In this study, we introduce cosine similarity template matching networks (CSTM-Nets), an advanced heterogeneous template matching method. CSTM-Net comprises three main modules: a feature extraction module, a cosine similarity (CS) module, and a matching module (MM). The feature extraction module employs a symmetric network with ResNet18 backbones to extract multiscale features that address the challenges posed by modality differences. The CS module constructs the cost volume by calculating the correlation between SAR and optical image features using a CS algorithm. Finally, the MM processes the cost volume to regress predicted heatmaps and generate the final matching points. In addition, we propose a novel pooling heatmap loss function to supervise the predicted heatmaps, which enhances gradient descent smoothness. Our contributions are summarized as follows.
We propose a CS-based correlation framework that constructs cost volumes by focusing on feature vector directions rather than magnitudes, effectively capturing feature correspondences and mitigating the impact of modality gaps. Coupled with spatial search operations, our framework achieves state-of-the-art accuracy in heterogeneous template matching.
We devise the pooling heatmap loss function to supervise heatmap predictions, resulting in smoother gradient descent and improved convergence compared to direct supervision of matching points, thereby enhancing the accuracy of locating matching regions.
We contribute a novel heterogeneous image dataset spanning four seasons: spring, summer, fall, and winter. It comprises SAR and optical image pairs, providing a robust benchmark for evaluating cross-modality template matching under diverse environmental conditions.
Related Work
This section provides a concise overview of prior research pertinent to template matching. Template matching evolved significantly over the years, addressing various challenges in image processing and computer vision. We categorize these methods into three main groups: similarity-based, local feature-based, and deep learning-based. Each category has its strengths and limitations, and understanding their progression is crucial to contextualizing our proposed CSTM-Net.
A. Similarity-Based Methods
Similarity-based template matching methods aim to determine the optimal transformation that maximizes the similarity between the target template image and the corresponding region in the search image. These methods have been beneficial in scenarios where global image characteristics are more important than local features.
Early approaches in this category focused on simple similarity measures, such as cross-correlation. However, as the field progressed, more sophisticated techniques emerged to address challenges, such as sensor variance and complex image transformations. For instance, mutual information-based methods [20] have shown high effectiveness in dealing with multimodal images. Gao et al. [21] proposed a template matching method utilizing differentiable coarse-to-fine correspondence refinement. Ye et al. [11] employed a hybrid matching method using attention-enhanced structural features, combining the advantages of handcrafted-based and learning-based methods to improve the accuracy of optical and SAR image matching. Xiong et al. [22] applied a registration algorithm for optical and SAR images via adjacent self-similarity. Zhang et al. [23] supervised pixel-level dense features of local optical and SAR image blocks using SSD loss. Corona et al. [24] proposed a novel template matching approach under image rotation scenarios. While these similarity-based methods have shown promise, they often struggle with complex transformations and large appearance differences between modalities. Our CSTM-Net addresses these limitations by incorporating CS algorithm within a deep neural network architecture.
B. Local Feature-Based Methods
In contrast to similarity-based methods, local feature-based approaches achieve matching by extracting prominent features such as points, lines, edges, and contours. These methods are designed to be invariant to changes, such as scale, rotation, and illumination, making them particularly useful for matching images with significant geometric differences.
Numerous researchers have proposed solutions combining these features to enhance template matching. Wang et al. [25] introduced a hybrid cGAN that combines the strengths of convolutions and vision transformers. Vora et al. [26] presented a volume rendering-based neural surface reconstruction method that cannot only complete the surface geometry but also reconstruct surface details to a reasonable extent from a few disparate input views. Liu et al. [27] presented a new multimodal image matching method, addressing high contrast and noise issues. Xiang et al. [13] focused on geometric disparities and proposed a robust global-to-local registration algorithm. Wang et al. [28] proposed a hierarchical extract-and-match transformer. Jhan et al. [29] introduced a normalized speeded-up robust features approach, significantly increasing the correct matching points among different image pairs and enabling one-step image registration. Nguyen et al. [30] devised a method capable of recognizing new objects and estimating their 3D pose in optical images, even under partial occlusions. Ye et al. [31] designed a feature descriptor based on the histogram of oriented phase congruency to match corner points across multimodal remote sensing images. Misra et al. [32] provided a comprehensive review of feature-based remote sensing image registration techniques, comparing various feature detection, description, and outlier removal methods for multitemporal and multimodal images. Despite the robustness of local feature-based methods to certain image transformations, they can fail when dealing with significant appearance changes between different modalities. Our proposed CSTM-Net overcomes this challenge by learning to extract modality features that capture the essential structural information across different sensor types.
C. Deep Learning-Based Methods
Different from previous algorithms, deep learning methods prevent the need for specialized knowledge and extensive design time for feature extraction operators. In recent years, they have made significant strides in numerous remote sensing applications, offering enhanced accuracy and robustness in object classification and detection tasks [33].
The divisive input modulation algorithm extracts additional templates from the background and pits them against each other in matching competition [34]. Li et al. [35] proposed a method based on contrastive learning to perform dense and consistent InfoNCE loss during matching. Gao et al. [36] investigated whether enhancing the CNN's encoding of shape information can yield more distinguishable features to improve template matching performance. Wei et al. [37] introduced cross-fusion reasoning and wavelet decomposition generative adversarial networks to preserve structural details and enhance high-frequency band information. Gazzea et al. [38] presented an end-to-end machine learning method for accurate SAR-optical image matching, using a siamese multiscale attention-gated residual U-Net for feature extraction. Jonghee et al. [39] revealed a deep convolutional scale-adaptive feature-based robust and efficient template matching method. Rosa et al. [40] proposed an online deep clustering method utilizing crop label proportions as priors for learning a sample-level classifier. Fang et al. [41] developed an end-to-end deep learning model for SAR-optical matching based on a siamese U-net with an FFT correlation layer. While these deep learning approaches have demonstrated impressive results, many still struggle with heterogeneous data, particularly in SAR-optical image matching. Our CSTM-Net builds on these advances, addressing cross-modal matching challenges through a novel architecture that combines the strengths of stereo matching techniques with deep learning.
The evolution of template matching methods has seen a progression from simple similarity measures to sophisticated local feature extraction techniques, and now to powerful deep learning approaches. However, challenges remain, particularly in dealing with cross-modal data.
Our proposed CSTM-Net addresses these challenges by incorporating a CS algorithm within a deep learning framework, thus combining the strengths of similarity-based and deep learning-based methods. It learns to extract modality features, overcoming the limitations of traditional local feature-based methods. In addition, CSTM-Net utilizes a novel architecture inspired by stereo matching techniques, which proves particularly effective for cross-modal matching tasks. By building on the strengths of previous approaches while addressing their limitations, CSTM-Net achieves state-of-the-art performance in template matching, particularly for heterogeneous datasets.
Method
This section presents our proposed CSTM-Net, illustrated in Fig. 3, which comprises interconnected modules to achieve efficient and accurate template matching. First, we will introduce detailed explanations of these modules in Section III-A. Next, The custom pooling heatmap loss function was introduced to improve the convergence effect and smoothness of CSTM-Net, elucidated further in Section III-B. Finally, we present our proposed dataset, which mainly includes the image distribution of the dataset and download links in Section III-C. For ease of description, we assume the coordinate system's origin point is the search image's upper left corner. The sizes of the input template and search images are
Overview of the proposed CSTM-Net architecture. First, we apply a symmetric network structure to extract multiscale features. The extracted image features then form a cost volume and enter the MM for regularization, refinement, and regression. Our approach uses spatial search operations and CS to build the cost volume, preserving the feature map's spatial resolution for better results.
A. Module Details
Feature extraction module: We employ a symmetric network structure to reduce modality differences between input images. This structure consists of two independent neural networks, each processing an input image separately and transforming it into a more similar feature space. We use ResNet18 [42] as our feature extraction module due to its ability to preserve low-resolution representation information. The deep structure and skip connections in ResNet18 enable effective learning and extraction of complex features, maintaining strong performance even with lower resolution.
Given ResNet18’s success in feature extraction and the need to handle images from different modalities with substantial imaging differences, we adopt a symmetric ResNet18-based architecture for feature extraction. The two ResNet18 networks have identical structures but do not share weights. This design allows the extraction of standardized similarity features, partially bridging the gap between diverse input modalities. This operation can be concisely represented as follows:
\begin{align*}
F_{t} &= \text{ResNet18}(T; \theta _{t}) \\
F_{s} &= \text{ResNet18}(S; \theta _{s}) \tag{1}
\end{align*}
CS module: Following feature extraction, a formalization module is required to establish correlation information between image features. Traditionally, various methods employ unique transformations to convert template and search features into a matching score map, representing a prevailing architecture. In this study, we employ the CS algorithm to accomplish this task due to its ability to capture the angular similarity between feature vectors, making it particularly suitable for comparing heterogeneous data, such as SAR and optical images. The operational details of the CS module are elucidated in the following.
As depicted in Fig. 4, the template image feature and search image feature are denoted as
\begin{align*}
S = \left(\frac{H}{4} - \frac{h}{4}\right) \times \left(\frac{W}{4} - \frac{w}{4}\right). \tag{2}
\end{align*}
Structure of CS module. The module applies a CS algorithm and spatial search operators to construct the cost volume.
The CS module constructs what we refer to as a “cost volume,” borrowing from the concept in stereo matching. Although the output is a 2-D matrix, the term “cost volume” reflects the encoding of similarity scores across all feasible matching regions. For each position
\begin{align*}
C_{ij} = \frac{\sum _{x,y,c} F_{t}(x,y,c) \cdot P_{ij}(x,y,c)}{\sqrt{\sum _{x,y,c} F_{t}(x,y,c)^{2}} \cdot \sqrt{\sum _{x,y,c} P_{ij}(x,y,c)^{2}}} \tag{3}
\end{align*}
At the outset of the CS module, we traverse the limited search space and extract patches
MM: Upon obtaining the initial cost volume through a correlation-based computation, the subsequent step involves encoding and decoding it to generate the final prediction heatmap. As illustrated in Fig. 5, the MM incorporates a two-stage hierarchical decoding approach to process the feature maps, maximizing the utilization of information across height, width, and channel dimensions.
Structure of MM. The module applies 2-D deconvolution to progressively upsample and decode the cost volume, followed by a final convolution layer to regress the heatmap. Here, b, c, h, and w represent batchsize, channel, height, and width, respectively.
In each decoding stage, the feature maps undergo a series of transformations. Initially, a convolution operation reduces the channel dimension. This is followed by a deconvolution that doubles the spatial dimensions while halving the channel count. Each of these operations is accompanied by normalization and nonlinear activation, ensuring the network can learn complex, nonlinear relationships in the data. After the second decoding stage, which further increases the spatial resolution, the resulting feature map is processed by a final layer.
A key aspect of this architecture is its ability to generate multiple heatmaps at different scales. These include the output of the cost volume encoding, an intermediate output after the first decoding stage, and the final output after the complete decoding process. The model is trained using a multiscale loss function that leverages these multiple outputs. This loss function computes the absolute difference between the predicted heatmaps and the ground truth (GT) label at each scale. To facilitate this comparison, the GT is progressively downsampled to match the resolution of each predicted heatmap. The loss at each scale is weighted differently, with increasing importance assigned to the finer scales, thereby emphasizing the final high-resolution prediction while still considering the coarser predictions. This hierarchical decoding approach, combined with the multiscale loss, allows the model to learn to make accurate predictions at various resolutions.
B. Loss Function
After obtaining the predicted heatmap in the preceding sections, the final step involves regressing the matching points from the predicted heatmap. While directly supervising the predicted matching points through a differentiable regression formula on the heatmap has been a common approach, it possesses several drawbacks. Chiefly, this approach can lead to unstable convergence, especially in complex scenarios where the heatmap distribution could be better. Direct point supervision tends to focus narrowly on the predicted point, ignoring the contextual information encoded in the heatmap. As a result, the model may fail to fully capture the underlying spatial relationships and structural patterns crucial for accurate matching. Moreover, directly supervising the predicted point restricts the choice of regression methods, limiting the flexibility and effectiveness of the overall approach. Different regression techniques may offer advantages in other scenarios, but traditional point supervision methods do not readily accommodate the integration of various regression methods. This constraint can hinder exploring more optimal solutions tailored to specific tasks.
In contrast, the proposed loss function that supervises the predicted heatmap avoids these issues for the following reasons.
Focusing on the heatmap as a whole allows the model to learn from the contextual information and spatial relationships encoded within it, obtaining smoother convergence.
The loss function can also obtain the matching point without affecting the backpropagation process.
This flexibility opens up new possibilities for exploring and leveraging the most appropriate regression techniques for a given task. The detailed components of this approach are outlined as follows.
Classification part: We employ the ArgMax operation for the predicted heatmap to determine the matching point
\begin{align*}
\hat{d}_{x}&= \operatorname{ArgMax}\left(\sigma \left(\hat{d}, x\right)\right) \\
\hat{d}_{y}&= \operatorname{ArgMax}\left(\sigma \left(\hat{d}, y\right)\right) \tag{4}
\end{align*}
Point transformation part: As depicted in Fig. 6, the initial GT data encompasses a single matching point. To effectively supervise the predicted heatmap, it is necessary to convert this GT point into a corresponding labeled heatmap. The process is represented as follows:
\begin{align*}
d_{i, j}=\frac{\exp \left(-\sqrt{\left.(i-x)^{2}+(j-y)^{2}\right)} / \omega \right)}{\sum _{i, j} \exp \left(-\sqrt{\left.(i-x)^{2}+(j-y)^{2}\right)} / \omega \right)} \tag{5}
\end{align*}
Example of transforming a matching point to a heatmap. By employing this operation, we transform the origin GT point into a heatmap. Compared to the point label, our heatmap label has more steady changes, which is easier for training and can converge more smoothly.
Loss function part: The original label point has been converted to a label heatmap, so we could use the GT heatmap to supervise the predicted heatmap rather than the regression point. Meanwhile, the problem of not being able to backpropagate no longer exists. Since template matching is a regression problem in this article, we utilize the L1 loss function to learn the disparity prediction. The process could be represented as follows:
\begin{align*}
\mathcal {L}=\frac{1}{N} \sum _{n=1}^{N}\left\Vert \hat{d}_{x, y}-d_{x, y}\right\Vert \tag{6}
\end{align*}
Moreover, we introduce a pooling heatmap loss function to enhance the heatmap loss function. This approach utilizes max pooling operations on predicted and GT heatmaps to capture salient features at varying scales. In our loss function implementation, we calculate the L1 loss between the predicted heatmap and the corresponding original label heatmap, resulting in the loss of
\begin{align*}
\mathcal {L}_{\text{total}}=\mathcal {L}_{1}+2\mathcal {L}_{2}+4\mathcal {L}_{3}. \tag{7}
\end{align*}
C. Datasets
Excellent datasets are crucial for advancing remote sensing applications. However, we have observed a significant need for publicly available large-scale template matching datasets, hindering the progress of template matching research, especially concerning deep learning algorithms. We propose the spring, summer, fall, and winter datasets to address this gap and enhance the benchmark for evaluating template matching algorithms. These datasets are from the SEN1-2 dataset [43], a comprehensive image fusion dataset. Due to substantial seasonal variations, we treat and evaluate each dataset separately.
Our datasets are categorized into four seasons: Spring, summer, fall, and winter. Within each season, we further divided the data into multiple groups, each consisting of 289 SAR images and one optical image. Thus, for each group, there are 289 template images and one search image available.
On the spring dataset, we have 75 540 groups for training, 40 for testing, and 144 for validation. Similarly, the summer dataset has 53 378 training groups, 29 testing groups, and 101 validation groups. For the fall dataset, the numbers are 86 160 groups for training, 48 groups for testing, and 162 groups for validation. Finally, we have 66 619 training groups, 37 testing groups, and 126 validation groups on the winter dataset. Notably, the search images are optical images, sized
Our datasets offer a diverse range of SAR images capturing various seasonal conditions paired with corresponding optical images. In addition, we provide comprehensive label information detailing the coordinates of each SAR image within its corresponding optical image. This label provision facilitates precise localization, serving as a pivotal reference point for template matching algorithm development and evaluation.
Experimental images from our datasets. (a) Spring. (b) Summer. (c) Fall. (d) Winter. (a)–(d) are heterogeneous datasets. The method can be used for heterogeneous template matching.
Dataset distribution overview. We illustrate the distribution of datasets categorized by season. The dataset is divided into training, testing, and validation sets, with the number of samples indicated for each category. The dataset is further divided by season: Spring, summer, fall, and winter.
Dataset distribution overview. We present the distribution of datasets categorized by season and dataset type. Each panel shows the pixel value distribution for optical images and accumulated SAR images for the respective season.
To facilitate seamless utilization, we furnish explicit usage guidelines and licensing information. Researchers are encouraged to download and employ the datasets, ensuring appropriate citation in their works. Our datasets are available at: Baidu Disk.1
While our dataset provides a comprehensive view of seasonal variations based on the SEN1-2 source, we recognize the need for greater diversity in data sources. To address this, future work will focus on expanding the dataset to include images from various sensors, such as SENTINEL-1/2 [44] and Landsat. This will cover a wider range of geographical locations and land cover types. In addition, we plan to incorporate challenging scenarios, including extreme weather conditions and areas experiencing rapid human-induced changes. Despite its current limitations, our dataset offers unique insights into seasonal effects on template matching, serving as a valuable complement to existing datasets in the field.
Experiments
In this section, we assess the effectiveness of the method across the spring, summer, fall, and winter template matching datasets. First, we commence by delineating our implementation specifics concerning training and testing datasets, as detailed in Section IV-A. Next, as Section IV-B describes, we conduct experiments on the Fall dataset to evaluate how each module impacts the overall experimental results. Then, We compare the effect of different loss functions on the final result in Section IV-C. Finally, as elaborated in Section IV-D, we quantify the performance of our method across all four datasets and compare it against state-of-the-art approaches. Moreover, we conduct an in-depth analysis comparing the similarity measure methods to elucidate the approach's superiority in Section IV-E.
A. Implementation Details
Training and testing: For the training process, the CSTM-Net is trained on 2 NVIDIA GeForce RTX3090 GPUs. We employ the SGD optimizer with a weight decay of
Preprocessing: SAR and optical images have different resolutions and channel counts, for which we performed some preprocessing steps. Specifically, images are transformed to a standard format and, when necessary, SAR images are duplicated across channels to match the three-channel structure of optical images. In addition, we normalize the images using calculated mean and standard deviation values to ensure consistency across datasets.
Evaluation metric: We utilize three evaluation metrics to assess the performance of different methods: the correct matching ratio (CMR), average L2 error (AvgL2), and the root-mean-square error (RMSE). The CMR measures the proportion of correct match pairs, where a pair of images is considered a correct match if the Euclidean distance is less than 3 pixels. The CMR is calculated using the following formula:
\begin{align*}
\mathrm{CMR}=\frac{N_{c}}{N_{m}} \tag{8}
\end{align*}
The AvgL2 is calculated using the predicted matching points and the corresponding GT points. It is calculated as follows:
\begin{align*}
\mathrm{AvgL2} = \frac{1}{N_{m}} \sum _{n=1}^{N_{m}}\left[\left(x^{\prime }-x\right)^{2}+\left(y^{\prime }-y\right)^{2}\right] \tag{9}
\end{align*}
The RMSE provides a measure of the average magnitude of the prediction errors and is calculated as follows:
\begin{align*}
\mathrm{RMSE} = \sqrt{\frac{1}{N{m}} \sum _{n=1}^{N_{m}}\left[\left(x^{\prime }-x\right)^{2}+\left(y^{\prime }-y\right)^{2}\right]} \tag{10}
\end{align*}
B. Ablations
This study comprehensively examines the impact of various components of our method. In this section, we present results on the fall dataset under different configurations by incrementally incorporating modules and assessing the influence of each component. The experiments were conducted on validation datasets. We replaced the MM with double deconvolution. Similarly, for CS, we used dot product computation instead. We employed AvgL2, CMR, and RMSE to evaluate and compare the effectiveness of the CS and MM. The effects of individual components and qualitative outcomes are detailed in Table III.
The findings presented in Table III demonstrate that, without additional components, we achieve an RMSE of 0.64 on the fall dataset. Incorporating the CS module has a beneficial effect on template matching tasks, reducing the RMSE to 0.47. Furthermore, integrating the MM enhances the network's ability to learn spatial dimensions. However, contrary to our expectations, the performance decreased. We identify several possibilities as follows.
The MM incorporates several convolution, normalization, and activation layers. These layers are advantageous for complex image generation or semantic segmentation tasks, owing to their capability to learn richer feature representations. However, for template matching, this method may be excessively complex.
A more straightforward structure that uses deconvolution alone suffices for upsampling features. The additional convolution and normalization operations in the MM might introduce unnecessary computations and potential feature loss.
Template matching tasks prioritize preserving the spatial information of features. The convolution operations in the MM could introduce undesired feature translation.
SAR and optical images belong to different modalities, each with distinct feature representation forms. An overly complex upsampling may impair the fusion of features from different modalities. Ultimately, the most favorable outcomes were achieved using CS alone, resulting in an RMSE of about 0.47.
C. Effects of Loss Functions and Heatmap Decay Factors
This section conducts a comparative analysis of pooling heatmap loss and other loss functions, examining the influence of loss functions and heatmap decay factors on the convergence process, AvgL2, RMSE, and CMR. When directly supervising matching points, using ArgMin as the regression method is precluded due to its hindrance to gradient backpropagation. Therefore, we employ the soft-ArgMin [50] to derive the predicted matching point
\begin{align*}
\hat{d}_{x} &= \sum _{d=0}^{D_{x}} d_{w} \times \sigma \left(\hat{d}, x\right) \\
\hat{d}_{y} &= \sum _{d=0}^{D_{y}} d_{w} \times \sigma \left(\hat{d}, y\right) \tag{11}
\end{align*}
We compare our pooling heatmap loss with point loss and heatmap loss, utilizing the ArgMin operation mentioned previously. The experiments were conducted on validation datasets, and the results are demonstrated in Table VI and Fig. 11. The findings demonstrate that the pooling heatmap loss significantly aids in model convergence and outperforms both the heatmap loss and point loss. Optimal performance is achieved specifically through the application of pooling heatmap loss. From these results, we draw the following conclusions.
Heatmap loss significantly impacts template matching, owing to its smoother and more fluid nature compared to direct point loss functions.
Pooling heatmap loss enhances network convergence and accuracy. This enhancement is attributed to the network's heightened focus on the hottest region of the heatmap, where the correct matching point is situated.
All three loss functions can identify the correct matching point. The table presents detailed metrics including AvgL2, RMSE, CMR, and accuracy within 1, 2, and 3 pixels for each loss function on the fall dataset.
Fig. 10.Compare the influence of the weight decay factor.
means the decay weight. The heatmap is flatter as the decay factor increases. The figure shows that under a fixed factor, the lowest AvgL2 of the heatmap is used for decay weight 1.2. (a)\omega . (b)\omega = 0.1 . (c) Influence of decay factor\omega = 5 .\omega Fig. 11.Compared the influence of the convergence process of different loss functions. Both heatmap loss and pooling heatmap loss converge smoother than point loss, and pooling heatmap loss has better and deeper convergence.
D. Evaluations on Heterogeneous Datasets
Spring and summer: We compare our proposed method on the spring and summer datasets, focusing on assessing AvgL2, RMSE, and CMR. The quantitative results for various template matching approaches are summarized in Table V and depicted in Fig. 13. This experiment, conducted on the test dataset, shows our method achieving the lowest AvgL2, RMSE, and highest CMR. Notably, CSTM-Net leads in performance, closely followed by QATM. Deep match, RTM, and DDFN show moderate performance among the evaluated methods. In contrast, NCC and SSD, which rely on general texture features, perform worse, especially in heterogeneous datasets with considerable image characteristic variation. The table details metrics including accuracy within 1, 2, and 3 pixels, AvgL2, RMSE, and CMR for each method across both seasons.
Predicted heatmap of our methods. We have zoomed in on the heatmap locally for clarity. Different colors represent the possibilities of correct matching. A hotter color indicates a greater likelihood of correct matching. Our method can focus on the correct matching region with smoother convergence. (a) SAR image. (b) Optical image. (c) Predict result. (d) Label. (e) SAR image. (f) Optical image. (g) Predict result. (h) Label.
Experimental results of CSTM-Net on the spring and summer datasets. (a) and (e) are SAR template images. (b) and (f) are optical search images. The CSTM-Net generates accurate matching results. (a) SAR. (b) Optical image. (c) GT. (d) Our result. (e) SAR. (f) Optical image. (g) GT. (h) Our result.
Deep CNNs' cross-modality adaptation capabilities are advantageous for dealing with template and search images from diverse imaging models. This explains the superior performance of deep learning-based methods on heterogeneous datasets. The CSTM-Net achieves RMSE of 0.57 and 0.95, with corresponding CMR of 96.49% and 95.68%. These findings effectively validate the cross-modality matching capability of the CSTM-Net.
Fall: We further evaluate CSTM-Net's performance on the fall datasets, with the CMR, AvgL2, RMSE, and detailed outcomes presented in Table VII. This experiment was conducted on the test dataset, and our method achieves state-of-the-art performance on the fall datasets. CSTM-Net achieves the highest CMR of 96.44% and the lowest RMSE of 0.72, demonstrating its exceptional accuracy in locating corresponding points. The table provides a comprehensive comparison of different template matching methods, including SSD, NCC, DMatch, RTM, DDFN, Qatm, and our method. It presents detailed metrics such as matching accuracy within 1, 2, and 3 pixels, AvgL2, RMSE, and CMR for each method. Notably, our method significantly outperforms other approaches across all metrics, particularly the accuracy with 86.41% of matches within 1 pixel.
Entire dataset: Extending our evaluation to the entire dataset, we present the CMR, AvgL2, and additional details in Table VIII. This experiment was conducted on the test dataset, and our proposed method demonstrates state-of-the-art performance across all the datasets we have evaluated. CSTM-Net achieves the highest CMR of 96.39% and the lowest RMSE of 0.78 across all images. The table provides a comprehensive comparison of different template matching methods, including traditional approaches, such as SSD and NCC, as well as advanced deep learning-based methods, such as CAMRI, SFcnet, DDFN, MCGF, DMatch, RTM, Qatm, HIM-Net, and DC-InfoNCE. It presents detailed metrics of matching accuracy within 1, 2, 3, and 4 pixels for each method.
The superior performance of CSTM-Net, as well as other deep learning-based methods, can be attributed to the exceptional capability of deep convolutional neural networks in extracting rich spatial context information and effectively describing key features. Consequently, these deep learning approaches demonstrate superior matching accuracy compared to traditional similarity measures, such as NCC and SSD. Our proposed CSTM-Net significantly improves over other state-of-the-art methods, further solidifying its effectiveness in heterogeneous datasets.
E. Analysis of Different Similarity Measure Methods
In Table III, we compare the impact of various modules on the results. Replacing CS with the dot product leads to a decrease in precision. In addition, Tables IV and VIII show that the SSD and NCC algorithms perform poorly in heterogeneous template matching tasks. In the context of template matching for SAR and optical images, we consider SSD, NCC, and the dot product as methods used to calculate the similarity or correlation between the template and the target image. SSD and NCC are sensitive to absolute pixel value differences. On the other hand, the dot product considers both direction and magnitude. This suggests that the substantial modality differences between SAR and optical images render CS a potentially more viable metric. CS emphasizes directionality rather than absolute magnitudes, unlike SSD, NCC, and the dot product, which are inherently sensitive to absolute variations. When matching across heterogeneous imaging modalities, traditional similarity measures (such as SSD and NCC) may fail to capture the underlying associations effectively, due to the significant modality differences present. The key is considering modality variations when calculating the correlation between SAR and optical images.
Discussion
Advantage: To validate the reliability of our proposed approach in template matching tasks, we conducted a thorough analysis of the generated heatmap. As evident in Fig. 12, our method can significantly concentrate weight on precise matching regions, thereby enhancing accurate localization. Not only does our method prioritize the intended matching area, resulting in smoother convergence, but it also offers advantages in terms of supervision. By transforming point labels into heatmaps, our method enables supervision across the entire prediction instead of just a single point, which limits its effectiveness. This heatmap supervision approach, as depicted in Fig. 14, facilitates smoother convergence during training, ultimately enhancing the overall effectiveness of our proposed approach in template matching applications.
Post-SoftMax probability distributions of point and heatmap label. For clear illustration, we transform the heatmap into a vector. The red line is the matching label. Point labels can only supervise the predicted point that corresponds to them, often leading to additional peaks that pose challenges during training. Heatmap labels exhibit a distinct advantage: They do not produce extraneous peaks and enable the model to focus more intently on the correctly matched regions, thereby facilitating smoother training convergence. (a) Point prediction. (b) Heatmap prediction.
Disadvantage: However, our method also has some limitations. As shown in Fig. 15, difficulties may be encountered in accurately predicting repetitive textures and boundary regions. While the heatmap supervision method improves overall supervision and training convergence, it may still be challenged by complex texture patterns and edge details. Future work will address these shortcomings to enhance further the robustness and accuracy of our template matching approach.
Mispredicting images. (a) and (e) are SAR template images. (b) and (f) are optical search images. (c) and (g) are our prediction results. (d) and (h) are the label results. Among them, (c) and (d) are processed by local magnification. (a) SAR image. (b) Optical image. (c) Predict result. (d) Label. (e) SAR image. (f) Optical image. (g) Predict result. (h) Label.
Based on the quantitative results, the distinctive network architecture and meticulously crafted loss function of CSTMNet effectively fulfill the template matching task. Furthermore, our method exhibits greater robustness and accuracy than other template matching methods.
Conclusion
In this article, we introduce an end-to-end template matching framework, CSTM-Net. The key innovation of our proposed method lies in utilizing the search operator and CS algorithm to construct a cost volume. Compared to existing template-matching methods, our network exhibits strong performance, enabling its application to heterogeneous template matching tasks. Experimental evaluations conducted on the spring, summer, fall, and entire datasets demonstrate the superior performance of our method over competitive approaches. In particular, matching points can be achieved with an RMSE to the GT locations of 0.57, 0.95, 0.72, and 0.78, respectively, which are all the best results. Furthermore, we achieve smoother convergence and better accuracy by designing the pooling heatmap loss function. Our datasets are from the SEN1-2 spring, summer, fall, and winter folders, a comprehensive collection of large image fusion datasets.
ACKNOWLEDGMENT
The authors would like to express our sincere gratitude to the computer vision community for sharing their valuable open-source deep learning implementations, which have been instrumental to our research. We are particularly indebted to the editors and reviewers for their meticulous review and constructive feedback. Their thoughtful comments and suggestions have not only helped us refine this work but also inspired us to explore deeper insights into this research area.