Introduction
Remote sensing (RS) image change detection (CD) aims to compare differences in images of the same area at different times to identify changed regions, which finds extensive application in urban expansion analysis, disaster assessment, military operations, and vegetation cover detection, etc [1], [2]. Currently, this field encounters several challenges: One reason is that varying lighting conditions often lead to the same scene exhibiting different characteristics, and bitemporal images are difficult to avoid the impact of illumination disturbances, as dual temporal images are taken from different periods or even seasons [3], [4], [5]; As shown in Fig. 1, the other is the extremely complex changed areas, which size and quantity are often irregular. The size of changed areas varies irregularly from small to large, with multiple neighboring changed areas exhibiting heavily consistent features. To effectively address these issues, integrating multiscale features and extracting high-level semantic information are essential to counteract illumination disturbances and complicated scene.
Influence of regularization coefficients in loss function on CD results. Clearly, a higher proportion of HEPP loss (such as 0:1) directly results in fewer missed detections, and the higher the recall rate. However, the false positive rate also increases accordingly. To mitigate this, the comprehensive performance can be improved by adding or increasing the BCE regularization coefficient (e.g., 20:1). Experiments demonstrate that adjusting HEPP loss coefficient effectively achieves adjustable recall rates, making the model suitable for a wider range of CD tasks.
Over the past years, researchers have developed many traditional representative CD algorithms, such as change vector analysis [6], slow feature analysis [7], and Fourier transforms [8]. However, given the powerful feature extraction and nonlinear representation capabilities, these CD models based on deep learning have gradually become the mainstream of current research [9], [10], [11]. This is attributed to the fact that various deep networks can break through superficial disturbances from change noise, mine the high-level semantic information of interest region, and then realize intelligent target recognition.
For the problems associated with the RSCD shown in Fig. 1, we should tackle them from the following aspects. First, to realize intelligent change interpretation of RS images under different noise or lumen conditions, more discriminative semantic features should be deeply explored; second, the integration of multiscale effective information can easily become the motivation for identifying nonuniformity in the size of CAs; and finally, for the consistency in shape and spatial location exhibited by multiple neighboring of CAs, the extraction of context global information is beneficial for capturing diverse CA correlation.
In addition, the evaluation for RSCD model is not completely consistent with different application scenarios. In most cases, the comprehensive indexes [e.g., F1, mean intersection over Union (mIOU), etc.,] should be emphasized more; however, in some occasions with high recall requirements more attention is paid to the positive samples rather than the comprehensive indexes, such as the illegal-building detection. If the model recall can be adjusted according to the specific situation, it will be a deep network model with more practical value.
To identify change areas of varying sizes under illumination disturbances, this article proposes a high-recall hierarchical multiscale CD network, as illustrated in Fig. 2. The objective is to extract depth information to bridge the semantic gap caused by illumination noise and to accurately delineate change areas. Initially, hierarchical multiscale features of bitemporal images are extracted based on backbone network (such as Resnet), and the same-scale feature fusion module (SsFFM) is introduced to integrate the same size feature tensor; subsequently, we propose cross-scale feature fusion module (CsFFM) with nonsampling style, which aims to enhance the globality between the adjacent features on the basis of realizing feature alignment in compact feature space. To further enhance the global contextual connectivity, the multiscale feature fusion module (MsFFM) is used to acquire the global dependencies of dual-temporal images through the multihead cross-attention (MHA) mechanism. Finally, a hyperexpectation push pull (HEPP) loss is developed and combined with to form a hybrid loss function. The recall rate of the model can be effectively controlled by adjusting the regularization coefficients.
Proposed depth network framework RaHFF-Net for CD in RS images. The SsFFM is designed to enhance and fuse effective features from same-layer tensor with consistent size. The CsFFM is introduced to improve information cross-flow and fusion between adjacent layer. The MsFFM effectively achieves long-distance modeling of multiscale depth features through the Transformer. Finally, the hybrid loss function composed of BCE and the proposed HEPP loss regularization term has been formulated. On this foundation, the recall rate can be adjusted by altering the regularization coefficient, thereby enhancing the model adaptability to different scenes.
The major contributions of our work can be summarized as follows.
A hierarchical multiscale RS image CD framework with high recall rate is proposed. On the one hand, it effectively combines CNN and Transformer to realize the local and global information fusion of multiscale features; on the other hand, it allows for controllable and adjustable recall rate to a certain extent, which is particularly well-suited for scenarios that prioritize positive instances.
Three feature fusion modules for different scales are proposed. SsFFM uses Transformer to realize the global information flow of the same-size deep features by sharing query (Q); CsFFM achieves feature fusion of adjacent scales by sharing key (K) and value (V) in compact feature space; According to the above fusion features, MsFFM further realizes the global feature relationship modeling.
A HEPP loss regularization term is proposed. This loss function alleviates the issue of data imbalance in CD by adjusting the predicted values (PV) of positive instance through the push and pull model.
The rest of this article is organized as follows. Section II provides an overview of related work. Section III details the model framework. The experimental setup and results are introduced in Section IV and V, respectively; the discussion is given in Section VI. Finally, Section VII concludes this article.
Related Work
A. CNN-Based CD Models
In view of the powerful local feature extraction ability of CNN, it has been widely used in many specific fields. To date, numerous researchers have completed CD tasks based on FCN and Unet networks, including IFN [12], FCD-GN [13], E-UNet [14], and VoVNet [15]. In addition, Siamese network is another mainstream architecture based on CNN for CD, including ECFNet [16], I3PE [17], etc. These networks employ the dual-stream architecture to extract deep information from bitemporal RS images, subsequently fusing and enhancing these features to generate CD results. To explore the temporal correlation exist in feature tensors, long short-term memory network is introduced into the CD model, including EGRCNN [18], and ML-EDAN [19]. Significantly, to allow neural networks to focus precisely on important features and diminish unimportant ones, several studies have implemented attention mechanisms to pinpoint areas of interest in RS images, such as CADRL [20] and SAGNet [21]. These methods primarily accomplish feature reweighting through various ways (channels, spatial, correlation, etc.), highlighting effective information and thereby improving the CD performance.
B. Transformer-Based CD Models
Given its excellent global contextual perception, Transformer has rapidly expanded from NLP to the field of CV, influencing fields, such as image classification [22], segmentation [23], object detection [24], super-resolution [25], denoising [26], video analysis [27], and tracking [28].
Recently, studies have utilized Transformer to achieve contextual modeling across spatial and temporal scales. Deep models, such as SwinSUNet [29], Trans-MAD [30], STADE-CDNet [31], WSMsFNet [32] and LeMeVit all utilized Transformer as their backbone to extract global features from original images. Pang et al. [33] mined effective change information during the encoding and decoding process using Transformer. Liu and Sun [34] utilized Transformer to acquire accurate CD results from different categories of heterogeneous RS images. Zhang et al. [35] analyzed the relation changes in multitemporal images and proposed a cross-temporal difference attention to capturing efficient changes. For other global feature extraction, Chen et al. [36] explored the potential of the Mamba architecture for RSCD tasks.
Furthermore, some works have been devoted to combining the strengths of CNN and Transformer, yielding deep networks that synthesize local structures with global information. Lei et al. [37] introduced a parallel feature extraction module that integrates CNN channel attention mechanisms with Transformer encoder. Ding et al. [38] proposed an alternating serial feature extraction backbone integrating residual attention mechanisms into Transformer models. Jiang [39] and Ding et al. [40] have developed networks that sequentially merge ResNet backbones with lightweight Transformer. Zhang et al. [41] applied Transformer to effectively search for spatial and channel information. In this article, we will continue to explore the potential of CNN and Transformer in the domain of RS image CD.
C. Loss Functions
Loss function is an important link for ongoing model refinement by measuring the divergence between the PV and their labels. For numerous CD models, the common loss functions are cross-entropy [42], [43], contrastive loss, dice loss [44] and their linear weighted combination [45], [46].
Beyond the standard loss functions aforementioned, researchers continually refine these functions to address the associated challenges. In order to balance the number of positive and negative instances, some researchers introduced focal loss to focus on small but more important data. To attain equilibrium between the quantities of positive and negative instances, certain researchers have adopted focal loss to prioritize the small yet pivotal data. Miao et al. [47] introduced focal loss supervision module to help the network pay more attention to the minority class samples. Cui et al. [48] combined focal loss and mean absolute error loss to achieve a class-balanced noise-tolerant CD network.
Essentially, CD still belongs to pixel classification problem, so the similarity measurement of relevant probability distributions is crucial for guiding CD results and intermediate feature extraction. Common methods for measuring similarity include Cosine distance, Kullback–Leibler (KL) divergence, and maximum mean discrepancy (MMD). Specifically, some researchers calculated Cosine function to measure semantic similarity in vector space [49], [50], [51]. Moreover, KL divergence was used as a similarity measure for different probability distributions (such as graph nodes [52], multiscale feature tensors of dual stream architecture [53]) in CD networks. For the unsupervised domain adaptation CD task, Qu et al. [54] calculated conditional MMD to measure the conditional distribution discrepancy. Peng et al. [55] integrated IoU loss function to focus on the overall detection accuracy of the change information and the global structural feature.
Furthermore, hybrid loss functions that integrate supervised and unsupervised components have markedly improved model performance. Supervised loss concentrates on discrepancies between labeled data and predictions, whereas unsupervised loss commonly directs the extraction of intermediate features. Most importantly, despite their similar forms, researchers have ascribed distinct interpretations to loss functions, including cross-entropy and contrastive loss, etc [56], [57], [58].
However, in certain zero-tolerance situations, there is an increased focus on positive instances and a heightened requirement for the recall rate. Therefore, this article aims to conduct thorough research into loss functions capable of modulating recall rates for RSCD tasks.
Method
A. Overall Architecture
The RaHFF-Net overall adopts SCNN structure as shown in Fig. 2, consisting of a feature extraction subnetwork and three auxiliary modules (i.e., SsFFM, CsFFM, and MsFFM).
First, the bitemporal RS images
Then, the corresponding feature pairs
Simultaneously, adjacent-scale features
Next, the MsFFM integrates the enhanced multiscale features to provide the Q, K, and V for the Transformer decoder, whose output is the CD result tensor
Ultimately, the convolutional fusion is applied to tensor
During the training phase, the PV and the ground truth (GT) are required to be processed through the proposed hybrid loss function, which includes binary cross-entropy (BCE) and the HEPP loss. This process calculates the loss value, which is then used to adjust the weights of the deep neural network.
B. Same-Scale Feature Fusion Module
When human observe differences between bitemporal images, they often recognize changes by directly fusing the information from both images. Inspired by this, we complete CD based on the idea of feature fusion. Given various noise, such as different lumens, simplistic approaches (e.g., addition or subtraction), are usually ineffective for accurate CD. For this reason, both CNN and Transformer are used to extract local and global high-level semantic features of dual temporal images to derive more discernible characteristics.
For all deep features, coscale features exhibit the most direct correlation in terms of pixel corresponding locations. Accounting for the contextual relationships within change areas, this article further refines global information from the local feature tensors extracted by CNN backbone. As illustrated in Fig. 3, initially, the differential features between the paired inputs at layer m (i.e.,
The entire MSA process can be expressed by
\begin{align*}
Q^{m}&=f_{d=1}^{1\times 1} \left(f_{d=1}^{3\times 3} \left(|F_{t1}^{m}-F_{t2}^{m}|\right)\right)W_{q}^{m} \\
K_{ti}^{m}&=f_{d=1}^{1\times 1} \left(f_{d=1}^{3\times 3} \left(F_{ti}^{m}\right)\right)W_{k}^{m} \\
V_{ti}^{m}&=f_{d=1}^{1\times 1} \left(f_{d=1}^{3\times 3} \left(F_{ti}^{m}\right)\right)W_{v}^{m}\tag{1}
\end{align*}
\begin{equation*}
\text{MSA} \left(F_{ti}^{m}\right)=\text{Concat} \left(\text{head}^{1},{\ldots },\text{head}^{h}\right)W^{O} \tag{2}
\end{equation*}
\begin{align*}
\bar{F}_{ti}^{m}=\sigma \left(\frac{Q^{m}K_{ti}^{mT}}{\sqrt{d}}\right)V_{ti}^{m} \tag{3}
\end{align*}
To improve the stability and comprehensive expression of feature extraction, we aggregate the local features (extracted by CNN) and global information (extracted by Transformer) of a single image through residual connection. Ultimately, the dual-temporal feature fusion at the same-scale is realized based on cascade operation. Formally
\begin{align*}
\bar{F}^\text{m}=f_{d=1}^{1\times 1}\left(\bigcup _{i=1}^{2} \left(F_{ti}^{m}+\bar{F}_{ti}^{m}\right)\right) \tag{4}
\end{align*}
C. Cross-Scale Feature Fusion Module
Influenced by human activities and natural development, change area characteristic in RS frequently exhibit similarities in scale, as seen with buildings of similar shape but different areas. Motivated by these observations, this article presents the CsFFM, engineered to extract potentially valuable information from neighboring layers, thus enhancing the model comprehension of the semantic content for change features across similar scales.
When addressing multiscale feature fusion, it is typical to either downsample or upsample to match dimensions. However, interpolation may lead to information loss and blurred boundaries. To counteract this issue, we introduce a nonsampling method in Fig. 4 to facilitate the information flow across different scale features. This approach entails breaking down the feature tensor into several visual semantic tokens, with each token vector encapsulating a visual semantic unit. Subsequently, these tokens are integrated to effectively amalgamate features across varying scales.
To this end, the same scale features are cascaded and fused through convolution fusion to acquire the weight vector of the original feature tensor, and then the semantic tokens
\begin{align*}
T_{n}^{m}= \left(\sigma \left(f_{d = 1}^{1\times 1} \left(\left(F_{t1}^{m} {{\bigcirc}\!\!\!\!{\text{c}}} \ F_{t2}^{m}\right)W_{n}^{m}\right)\right)\right)^{T} (F_{t1}^{m} {{\bigcirc}\!\!\!\!{\text{c}}} \ F_{t2}^{m}) \tag{5}
\end{align*}
To enhance the information fusion of visual tokens, the semantic tokens of different scales need to be cascaded and arranged in order in space, and then the information interaction existed in the visual semantic tokens is realized through a sliding window. Finally, this process results in newly hybrid group of tokens
\begin{align*}
\tilde{T}_{n}=\text{TF} \left(T_{n}^{m} {{\bigcirc}\!\!\!\!{\text{c}}} \ T_{n}^{m+1}\right) \tag{6}
\end{align*}
Next, the Transformer encoder is used to model the global interrelationships between features across adjacent scales. Considering feature dimension and other factors, the original feature tensor serves as the Q for the Transformer encoder. In addition, the fused semantic tokens are shared as both K and V for Transformer encoder, thereby enabling the cross-scale information flow throughout the encoding process.
The Transformer encoder is encapsulated by
\begin{align*}
\bar{Q}^{m}&= \left(F_{t1}^{m} {{\bigcirc}\!\!\!\!{\text{c}}} \ F_{t2}^{m}\right)\bar{W}_{q}^{m} \\
\bar{K}_{n}&=\tilde{T}_{n}\bar{W}_{k} \\
\bar{V}_{n}&=\tilde{T}_{n}\bar{W}_{v} \tag{7}\\
\text{MHA} \left(F_{t1}^{m} {{\bigcirc}\!\!\!\!{\text{c}}} \ F_{t2}^{m},\tilde{T}_{n}\right)&=\text{Concat} \left(\text{head}^{1},{\ldots },\text{head}^{h}\right)W^{O} \tag{8}
\end{align*}
\begin{align*}
\tilde{F}_{n}^{m}=\sigma \left(\frac{\bar{Q}^{m}\bar{K}_{n}^{T}}{\sqrt{d}}\right)\bar{V}_{n}. \tag{9}
\end{align*}
D. Multiscale Global Feature Fusion Module
Taken as a whole, both SsFFM and CsFFM modules enhance global representation in the range of the same and adjacent layer of dual-temporal images. However, Fig. 5 highlights two key characteristics of RS image CD: the diversity of change sizes and the contextual relevance of certain change areas. Therefore, the MsFFM is proposed to synchronously integrate multiscale information and establish their global interdependencies.
For the multiscale features
\begin{align*}
\check{F}=\bigcup _{m=1,n=1}^{M,N} \left(\text{Maxpool} \left(\bar{F}^{m},\tilde{F}_{n}^{m+1},\tilde{F}_{n}^{m}\right)\right) \tag{10}
\end{align*}
Subsequently, the multiscale features are upsampled to a uniform size (
\begin{align*}
\hat{F}=\bigcup _{m=1,n=1}^{M,N}\left(\text{Up} \left(\bar{F}^{m},\tilde{F}_{n}^{m+1},\tilde{F}_{n}^{m}\right)\right) \tag{11}
\end{align*}
The whole multiscale global MHA process can be expressed as follows:
\begin{align*}
\tilde{Q}&=\hat{F}\tilde{W}_{q} \\
\tilde{K}&=\check{F}\tilde{W}_{k} \\
\tilde{V}&=\check{F}\tilde{W}_{v} \tag{12}\\
\text{MHA} \left(\hat{F}, \check{F}\right)&=\text{Concat} \left(\overline{\text{head}}^{1},{\ldots },\overline{\text{head}}^{h}\right)\bar{W}^{O} \tag{13}
\end{align*}
The MsFFM
\begin{align*}
F_{o}=\sigma \left(\frac{\tilde{Q}\widetilde{K}^{T}}{\sqrt{d}}\right)\tilde{V}. \tag{14}
\end{align*}
The final prediction map PV can be obtained by the CD head based on
\begin{align*}
\text{PV}=f_{d=1}^{1\times 1} (F_{o}). \tag{15}
\end{align*}
E. Loss Function
At present, most RSCD models tend to showcase their superiority through comprehensive indicators (F1 and mIoU). Nonetheless, certain scenarios necessitate a higher recall rate, particularly in fields such as medical diagnostics, financial fraud detection, security monitoring, and others. A high recall rate signifies that the model can identify as many positive instances as possible, while the opposite reflects a more serious situation of missed detections about that model. For RS change interpretation, the scene complexity and data imbalance may result in increased miss rate. Nonetheless, this outcome is highly undesirable for detection applications involving these unauthorized constructions and vegetation damage.
To address this issue, the HEPP loss function is proposed. Traditional loss functions typically set the expected prediction for positive instance to 1 and for negative instance to 0. In contrast, the HEPP loss function adjusts the predicted probabilities of positive and negative instance toward
\begin{align*}
L_{\text{BCE}} =& - \frac{1}{N_{nc} + N_{c}} \sum _{h=1,w=1}^{H,W} \left[ (1-Y_{hw})\log (1 - P_{hw})\right. \\
&\left. +Y_{hw}\log P_{hw}\right] \tag{16}\\
L_{\text{HEPP}} =& \frac{1}{N_{nc}} \sum _{h=1,w=1}^{H,W} \left[ (1-Y_{hw}) \max (0, P_{hw} - \tau) \right] \\
& + \frac{1}{N_{c}} \sum _{h=1,w=1}^{H,W} \left[ Y_{hw} \max (0, T - P_{hw}) \right]. \tag{17}
\end{align*}
Among them, 0/1 denotes unchanged/change pixels, respectively, and
In this loss function, by increasing the expected prediction for change pixels, denoted by T, the model is guided to focus more on learning from these change pixels. This approach mitigates the problem of data imbalance in RS image CD, leading to a notable enhancement in the model recall rate.
Nevertheless, as shown in Fig. 6, instances of false positives or false negatives, which indicate a substantial discrepancy between the PV and GT, result in larger loss values when using BCE loss function. This property is beneficial for model convergence and can potentially enhance overall performance. When the predicted results are close to the expected values, the loss calculated by the BCE function decreases sharply. At this point, the HEPP loss function can further increase the loss value by adjusting the expected predictions, thereby improving model performance. Experimental validation of the parameters
\begin{align*}
L=\lambda _{1}*L_\text{BCE}+\lambda _{2}*L_\text{HEEP}. \tag{18}
\end{align*}
Experimental Setup
A. Dataset
To assess the RaHFF-Net efficacy, we performed experiments based on three benchmark RS image CD datasets (LEVIR-CD, WHU-CD, CDD). Each dataset comprises two RS images taken at different intervals from the same geographic region, accompanied by the corresponding CD label. Detailed information about these three datasets is provided as below.
1) LEVIR-CD dataset
The LEVIR-CD is a publicly available, large-scale dataset for building CD, comprising 637 pairs of ultra-high-resolution (0.5 m/pixel) RS images, each measuring 1024×1024 pixels and covering periods from 5 to 14 years. These images originate from 20 distinct areas across various cities in Texas, USA, documenting 31 333 instances of building alterations. Notably, the LEVIR-CD encompasses a diverse range of building types that have experienced substantial land-use transformations, including detached houses, high-rise apartments, small garages, and large warehouses. We cut each image pair into 16 nonoverlapping segments, each measuring 256×256 pixels, and allocated 7120/1024/2048 image pairs for training, validation, and testing, respectively.
2) WHU-CD dataset
The WHU-CD documents the architectural changes in Christchurch, New Zealand, post the magnitude 6.3 earthquake in February 2011. Specifically, the dataset comprises a pair of aerial images from the same region, captured in 2012 and 2016, with dimensions of 32 507×15 354 pixels and a resolution of 0.2 m/pixel. We segmented the large-scale image pair into 256×256 patches and randomly allocated them into sets of 6096 for training, 762 for validation, and 762 for testing.
3) CDD dataset
CDD is a public CD dataset composed of satellite images captured across different seasons, with spatial resolutions ranging from 0.03 to 1 m/pixel. The size of change area in the CDD varies and includes features, such as buildings, roads, vehicles, and others. We cut the image into patches sized 256×256 and divided them into 10 000/2998/3000 pairs for training, validation, and testing, respectively.
B. Experimental Parameters
The RaHFF-Net model is implemented in PyTorch and trained on a single NVIDIA RTX 4090 GPU. The SGD optimizer is used for model optimization, with momentum at 0.98, weight decay at 5e-4, and an initial learning rate of 0.002. The learning rate of the optimizer is dynamically adjusted, decaying by 0.7 times every 10 training epochs. In addition, each dataset undergoes a training period set to 150 epochs, with a batch size of 24. Furthermore, validation is conducted after each training cycle, and the best model on the validation set is used to evaluate the test set. The backbone layer
C. Evaluation Metrics
To comprehensively evaluate models, five evaluation metrics are used to quantitatively assess the CD results, namely, precision (P), recall (Re), F1, overall accuracy (OA), and mIoU.
Experimental Results
A. Comparative Methods
To verify the effectiveness of RaHFF-Net, we compared it with several state-of-the-art CD methods, including FC-EF, FC-Siam-diff, DTCDSCN, BIT, Change Former, ICIFNet, DMINet, SEIFNET, and ChanegMamba. For the sake of fairness, all comparisons were performed using the code published by the authors, with parameters set according to the original literature.
Objectively, the adjustable recall is a major attribute for our model. When we increase the recall regular term coefficient, many positive examples will be significantly noticed. However, this will also come to an increase in the false detection rate. So in some occasions where the recall rate is highly required and the accuracy can be ignored, the recall coefficient can be increased. In order to make a fair comparison, in this article we still use the comprehensive index F1 as the model performance target.
Table I presents quantitative metrics of different models across three datasets (i.e., LEVIR-CD, WHU-CD, and CDD). It is evident that the proposed RaHFF-Net consistently leads in recall rate, outperforming the second-best values by 2.7, 2.5, and 2.2 points, respectively. In addition, while maintaining high recall rates, our model shows significant advantages over others about these comprehensive metrics: F1 score, OA, and mIoU. Moreover, our CNN backbone only utilizes ResNet18, without employing more complex structures, such as ResNet50, FPN, and U-Net. This advantage is likely attributed to our model integration for multiscale spatio-temporal information and its enhanced salient feature representation through global context modeling. Although the quantitative analysis of RaHFF-Net is satisfactory, it must be acknowledged that its performance on precision (Pre) is not optimal. This is because increasing the recall rate inevitably leads to recalling more suspicious change pixels [i.e., false positive (FP)], thereby raising the false detection rate. Fortunately, its precision value still remains within a reasonable range.
For intuitive analysis, we visually compares the experimental results of various algorithms, as shown in Figs. 7, 8, 9. For a better view, white, black, red, and green represent true positive (TP), true negative (TN), FP, and false negative (FN), respectively.
The visual comparison of detection results on the LEVIR-CD dataset by various methods is shown in Fig. 7. Clearly, variations in illumination affect the bitemporal images, with change areas ranging from small to large, covering a broad scope. For small target areas, models, such as FC-EF, BIT, ICIFNet, and DMINet, suffer from significant missed detections. In large target areas, the scene complexity affects the detected target (e.g., DMINet, ChangeMamba, and SEIFNet) typically occur near the edges, which may be attributed to the weaker contextual feature extraction capabilities of these models. When affected by lighting, DTCDNet shows obvious false detections; whereas BIT and DMINet exhibit notable missed detections. Overall, RaHFF-Net usually excels in detecting change areas, achieving clear boundaries and lower rates of false and missed detections. This superiority is primarily attributable to the integration of local details and global multiscale semantic information from high-resolution RS images, which enables accurate identification of diverse changes and adaptation across various target areas.
Detection results on the LEVIR-CD using different methods. (a) T1 Image. (b) T2 Image. (c) GT. (d) FC-EF. (e) ChangeMamba. (f) FC-Siam-Diff. (g) DTCDSCN. (h) BIT. (i) Change Former. (j) ICIFNet. (k) DMINet. (l) SEIFNet. (m) Our RaHFF-Net.
The detection results of various methods on the WHU-CD dataset are displayed in Fig. 8. Missed detections under illumination disturbances may occur commonly, especially for minor change areas, such as those seen with FC-EF, FC-Siam-Diff, BIT, and Change Former. Notably, DTCDSCN is prone to severe false detections. When handling multiple CD targets, the detection results frequently encounter false detections near the edges, especially FC-EF and DTCDSCN. In complex scenarios with large-scale targets, these models often fail to accurately identify change areas, resulting in frequent false detections (e.g., BIT) and missed detections (e.g., DTCDSCN, DMINet, ChangeMamba, SEIFNet). DTCDSCN and BIT perform poorly in complex scenarios for CD, clearly indicating a positive correlation between scene complexity and the model ability to extract deep semantic features. In contrast, the model presented herein demonstrates enhanced performance in maintaining target integrity and achieving boundary precision, irrespective of the size of the detection target. This may be attributed to the effective information integration across various scales by the MsFFM module.
Detection results on the WHU-CD using different methods. (a) T1 Image. (b) T2 Image. (c) GT. (d) FC-EF. (e) ChangeMamba. (f) FC-Siam-Diff. (g) DTCDSCN. (h) BIT. (i) Change Former. (j) ICIFNet. (k) DMINet. (l) SEIFNet. (m) Our RaHFF-Net.
The CDD dataset particularly emphasizes the impact of seasonal changes on CD in RS images. As seen from Fig. 9, FC-EF, FC-Siam-Diff, ChangeMamba and SEIFNet exhibit significant omission issues under the influence of lighting and seasonal changes. Although BIT, ChangeFormer, and ICIFNet can identify prominent change areas, they often overlook smaller change regions. DMINet has made progress in improving CD performance, but its limitations in multiscale global-local modeling hinder its performance in detecting subtle changes. In addition, for larger change areas, ChangeFormer, and ICIFNet display false detections at the edges of targets and omissions of unchanged objects. It is worth noting that FC-EF, FC-Siam-Dif,f and ChangeMamba does not work very well for identifying large targets. Compared to the aforementioned models, RaHFF-Net not only demonstrates superior detection performance in both large and small-scale change areas but also shows exceptional performance when dealing with seasonal variations. The RaHFF-Net further validates its effectiveness in multiscale global and local feature extraction and fusion by comprehensively searching for regions with varying scales.
Detection results on the CDD using different methods. (a) T1 Image. (b) T2 Image. (c) GT. (d) FC-EF. (e) ChangeMamba. (f) FC-Siam-Diff. (g) DTCDSCN. (h) BIT. (i) Change Former. (j) ICIFNet. (k) DMINet. (l) SEIFNet. (m) Our RaHFF-Net.
The parameters as well as the FLOPs of the different methods are reported in the Table II. It can be noticed that the parameter number of our model is moderate, but the FLOPs are not dominant. This indicates that the model is not very demanding in terms of GPU memory during training, but requires more floating-point computations during forward propagation. How to keep the metrics not degraded while cutting FLOPs will be the next research content.
B. Ablation Experiments
In this part, ablation studies were conducted to verify the effectiveness of the proposed modules (i.e., SsFFM, CsFFM, and MsFFM). As indicated in Table III, where
Since our model is not limited to any particular backbone, in Table IV we attempt the effect of different backbones (including CNN, Transformer, and Mamba) on the model performance. The experiments show that the best backbone regarding model evaluation is CNN (here Resnet), followed by Mamba. The subsequent modules of our model all involves Transformer, which suggests that there is a significant positive impact of global and local fusion features on deep semantic information. In addition, the fusion features extracted by the hybrid model (i.e., Transformer and Mamba) outperforms the single model (i.e., Transformer).
C. Effect of HEPP Loss
Table V reports the detection performance of the three comparison models when they utilize the proposed HEPP loss during training process, and lists these data changes in each metric. It can be seen that the proposed HEPP loss improves the performance of the comparison models to different degrees. The results validate the generalizability of the HEPP loss and verify our motivation that it can help the models focus on more positive samples.
D. Parametric Analysis
1) Token Length
In this section, deep semantic features of images were consolidated into these compact token groups via feature fusion window, establishing token length as a crucial hyperparameter that affected the model performance. Tests were conducted on the LEVIR-CD dataset using various token lengths, and then their impacts on the model were analyzed, where
2) Determining the Hyperparameters T and \tau
To address the instance imbalance issue in RS CD, the HEPP loss function is specifically designed to target change pixels. As Table VII illustrates. We set
3) Regularity Coefficient
To validate the effectiveness of the HEPP loss, hyperparameter experiments on the BCE loss coefficient
E. Convergence Analysis
The convergence of RaHFF-Net on the three datasets is evaluated by tracking the F1 score and loss value, as shown in Fig. 10. The goal of this is to show the model convergence during the training and validation process.It can be observed that when the epoch is less than 25, the value of model loss on all three datasets decreases rapidly, and the F1 score significantly improves as the epochs increase. When the epoch exceeds 75, the model loss basically becomes stable, which indicates that our model converges stably and efficiently. This may be attributed to the model learning effective multiscale and global contextual information, which can accurately and quickly represent the change areas of interest.
Convergence analysis regarding the model training and testing on different datasets: (a) LEVIR-CD, (b) WHU-CD, and (c) CDD.
F. Network Visualization
To understand the RaHFF-Net more intuitively, a representative sample from the LEVIR-CD test set was selected to visualize the features generated at different stages, as shown in Fig. 11. Given the bitemporal images [see Fig. 11(a)], multiscale feature (
Visualization of key network modules using an instance image from the LEVIR-CD dataset. (a) Input bitemporal RS images. (b) Multilayer features
Discussion
The RaHFF-Net proposed in this article proves several advantages: multiscale CA detection, contextual information association, and adjustable recall. However, our model still suffers from the following defects: model lightweight and robustness. As shown in Table II, the FLOPs of the RaHFF-Net are as high as 33 G. Realizing the model lightweight but still guaranteeing relevant performance will be more beneficial to the practical application. In addition, when one deep network is subjected to noise interference, such as light changes, special techniques should be adopted to improve the model robustness.
Conclusion
In this article, we propose a novel RS image CD network-RaHFF-Net, which fully exploits the local feature extraction capabilities of CNN and the long-distance relationship modeling of Transformer. Initially, ResNet18 is utilized as its backbone network to thoroughly extract multiscale local features from bitemporal RS images. Subsequently, the modules SsFFM, CsFFM, and MsFFM have been introduced to accomplish multiscale feature fusion. Finally, a positive instance contrastive loss regularization term is proposed to tackle missed detections, which is beneficial to achieve high recall rate combined with BCE. In addition, necessary experiments on datasets (i.e., LEVIR-CD, WHU-CD, and CDD) show that RaHFF-Net delivers favorable results in comprehensive evaluation metrics (such as F1 and mIoU) and qualitative comparisons, highlighting its robust adaptability in detecting changes across diverse targets.