Processing math: 0%
RaHFF-Net: Recall-Adjustable Hierarchical Feature Fusion Network for Remote Sensing Image Change Detection | IEEE Journals & Magazine | IEEE Xplore

RaHFF-Net: Recall-Adjustable Hierarchical Feature Fusion Network for Remote Sensing Image Change Detection


Abstract:

Remote sensing (RS) image change detection (CD) aims to identify areas of interest that have changed between bitemporal images. For complex scenarios (e.g., varying light...Show More

Abstract:

Remote sensing (RS) image change detection (CD) aims to identify areas of interest that have changed between bitemporal images. For complex scenarios (e.g., varying lighting conditions), the diverse shapes and scales of the changed areas is especially vulnerable to cause CD models to suffer from serious missed detections. To address aforementioned problem, we propose a high recall multiscale feature fusion model for RS change interpretation. Initially, the RaHFF-Net extracts hierarchical multiscale feature from bitemporal RS images; Then, it employs CNN and Transformer to effectively merge local and global information across same-scale, cross-scale, and multiscale features. Finally, to address the issue of instance imbalance in CD, a novel hyperexpectation push pull loss regularization term is proposed. This loss function is designed to elevate the expected predictions of positive instances across the dataset, thereby enabling the development of a deep learning model with a high recall rate.
Page(s): 176 - 190
Date of Publication: 24 October 2024

ISSN Information:

Funding Agency:

References is not available for this document.

SECTION I.

Introduction

Remote sensing (RS) image change detection (CD) aims to compare differences in images of the same area at different times to identify changed regions, which finds extensive application in urban expansion analysis, disaster assessment, military operations, and vegetation cover detection, etc [1], [2]. Currently, this field encounters several challenges: One reason is that varying lighting conditions often lead to the same scene exhibiting different characteristics, and bitemporal images are difficult to avoid the impact of illumination disturbances, as dual temporal images are taken from different periods or even seasons [3], [4], [5]; As shown in Fig. 1, the other is the extremely complex changed areas, which size and quantity are often irregular. The size of changed areas varies irregularly from small to large, with multiple neighboring changed areas exhibiting heavily consistent features. To effectively address these issues, integrating multiscale features and extracting high-level semantic information are essential to counteract illumination disturbances and complicated scene.

Fig. 1. - Influence of regularization coefficients in loss function on CD results. Clearly, a higher proportion of HEPP loss (such as 0:1) directly results in fewer missed detections, and the higher the recall rate. However, the false positive rate also increases accordingly. To mitigate this, the comprehensive performance can be improved by adding or increasing the BCE regularization coefficient (e.g., 20:1). Experiments demonstrate that adjusting HEPP loss coefficient effectively achieves adjustable recall rates, making the model suitable for a wider range of CD tasks.
Fig. 1.

Influence of regularization coefficients in loss function on CD results. Clearly, a higher proportion of HEPP loss (such as 0:1) directly results in fewer missed detections, and the higher the recall rate. However, the false positive rate also increases accordingly. To mitigate this, the comprehensive performance can be improved by adding or increasing the BCE regularization coefficient (e.g., 20:1). Experiments demonstrate that adjusting HEPP loss coefficient effectively achieves adjustable recall rates, making the model suitable for a wider range of CD tasks.

Over the past years, researchers have developed many traditional representative CD algorithms, such as change vector analysis [6], slow feature analysis [7], and Fourier transforms [8]. However, given the powerful feature extraction and nonlinear representation capabilities, these CD models based on deep learning have gradually become the mainstream of current research [9], [10], [11]. This is attributed to the fact that various deep networks can break through superficial disturbances from change noise, mine the high-level semantic information of interest region, and then realize intelligent target recognition.

For the problems associated with the RSCD shown in Fig. 1, we should tackle them from the following aspects. First, to realize intelligent change interpretation of RS images under different noise or lumen conditions, more discriminative semantic features should be deeply explored; second, the integration of multiscale effective information can easily become the motivation for identifying nonuniformity in the size of CAs; and finally, for the consistency in shape and spatial location exhibited by multiple neighboring of CAs, the extraction of context global information is beneficial for capturing diverse CA correlation.

In addition, the evaluation for RSCD model is not completely consistent with different application scenarios. In most cases, the comprehensive indexes [e.g., F1, mean intersection over Union (mIOU), etc.,] should be emphasized more; however, in some occasions with high recall requirements more attention is paid to the positive samples rather than the comprehensive indexes, such as the illegal-building detection. If the model recall can be adjusted according to the specific situation, it will be a deep network model with more practical value.

To identify change areas of varying sizes under illumination disturbances, this article proposes a high-recall hierarchical multiscale CD network, as illustrated in Fig. 2. The objective is to extract depth information to bridge the semantic gap caused by illumination noise and to accurately delineate change areas. Initially, hierarchical multiscale features of bitemporal images are extracted based on backbone network (such as Resnet), and the same-scale feature fusion module (SsFFM) is introduced to integrate the same size feature tensor; subsequently, we propose cross-scale feature fusion module (CsFFM) with nonsampling style, which aims to enhance the globality between the adjacent features on the basis of realizing feature alignment in compact feature space. To further enhance the global contextual connectivity, the multiscale feature fusion module (MsFFM) is used to acquire the global dependencies of dual-temporal images through the multihead cross-attention (MHA) mechanism. Finally, a hyperexpectation push pull (HEPP) loss is developed and combined with to form a hybrid loss function. The recall rate of the model can be effectively controlled by adjusting the regularization coefficients.

Fig. 2. - Proposed depth network framework RaHFF-Net for CD in RS images. The SsFFM is designed to enhance and fuse effective features from same-layer tensor with consistent size. The CsFFM is introduced to improve information cross-flow and fusion between adjacent layer. The MsFFM effectively achieves long-distance modeling of multiscale depth features through the Transformer. Finally, the hybrid loss function composed of BCE and the proposed HEPP loss regularization term has been formulated. On this foundation, the recall rate can be adjusted by altering the regularization coefficient, thereby enhancing the model adaptability to different scenes.
Fig. 2.

Proposed depth network framework RaHFF-Net for CD in RS images. The SsFFM is designed to enhance and fuse effective features from same-layer tensor with consistent size. The CsFFM is introduced to improve information cross-flow and fusion between adjacent layer. The MsFFM effectively achieves long-distance modeling of multiscale depth features through the Transformer. Finally, the hybrid loss function composed of BCE and the proposed HEPP loss regularization term has been formulated. On this foundation, the recall rate can be adjusted by altering the regularization coefficient, thereby enhancing the model adaptability to different scenes.

The major contributions of our work can be summarized as follows.

  1. A hierarchical multiscale RS image CD framework with high recall rate is proposed. On the one hand, it effectively combines CNN and Transformer to realize the local and global information fusion of multiscale features; on the other hand, it allows for controllable and adjustable recall rate to a certain extent, which is particularly well-suited for scenarios that prioritize positive instances.

  2. Three feature fusion modules for different scales are proposed. SsFFM uses Transformer to realize the global information flow of the same-size deep features by sharing query (Q); CsFFM achieves feature fusion of adjacent scales by sharing key (K) and value (V) in compact feature space; According to the above fusion features, MsFFM further realizes the global feature relationship modeling.

  3. A HEPP loss regularization term is proposed. This loss function alleviates the issue of data imbalance in CD by adjusting the predicted values (PV) of positive instance through the push and pull model.

The rest of this article is organized as follows. Section II provides an overview of related work. Section III details the model framework. The experimental setup and results are introduced in Section IV and V, respectively; the discussion is given in Section VI. Finally, Section VII concludes this article.

SECTION II.

Related Work

A. CNN-Based CD Models

In view of the powerful local feature extraction ability of CNN, it has been widely used in many specific fields. To date, numerous researchers have completed CD tasks based on FCN and Unet networks, including IFN [12], FCD-GN [13], E-UNet [14], and VoVNet [15]. In addition, Siamese network is another mainstream architecture based on CNN for CD, including ECFNet [16], I3PE [17], etc. These networks employ the dual-stream architecture to extract deep information from bitemporal RS images, subsequently fusing and enhancing these features to generate CD results. To explore the temporal correlation exist in feature tensors, long short-term memory network is introduced into the CD model, including EGRCNN [18], and ML-EDAN [19]. Significantly, to allow neural networks to focus precisely on important features and diminish unimportant ones, several studies have implemented attention mechanisms to pinpoint areas of interest in RS images, such as CADRL [20] and SAGNet [21]. These methods primarily accomplish feature reweighting through various ways (channels, spatial, correlation, etc.), highlighting effective information and thereby improving the CD performance.

B. Transformer-Based CD Models

Given its excellent global contextual perception, Transformer has rapidly expanded from NLP to the field of CV, influencing fields, such as image classification [22], segmentation [23], object detection [24], super-resolution [25], denoising [26], video analysis [27], and tracking [28].

Recently, studies have utilized Transformer to achieve contextual modeling across spatial and temporal scales. Deep models, such as SwinSUNet [29], Trans-MAD [30], STADE-CDNet [31], WSMsFNet [32] and LeMeVit all utilized Transformer as their backbone to extract global features from original images. Pang et al. [33] mined effective change information during the encoding and decoding process using Transformer. Liu and Sun [34] utilized Transformer to acquire accurate CD results from different categories of heterogeneous RS images. Zhang et al. [35] analyzed the relation changes in multitemporal images and proposed a cross-temporal difference attention to capturing efficient changes. For other global feature extraction, Chen et al. [36] explored the potential of the Mamba architecture for RSCD tasks.

Furthermore, some works have been devoted to combining the strengths of CNN and Transformer, yielding deep networks that synthesize local structures with global information. Lei et al. [37] introduced a parallel feature extraction module that integrates CNN channel attention mechanisms with Transformer encoder. Ding et al. [38] proposed an alternating serial feature extraction backbone integrating residual attention mechanisms into Transformer models. Jiang [39] and Ding et al. [40] have developed networks that sequentially merge ResNet backbones with lightweight Transformer. Zhang et al. [41] applied Transformer to effectively search for spatial and channel information. In this article, we will continue to explore the potential of CNN and Transformer in the domain of RS image CD.

C. Loss Functions

Loss function is an important link for ongoing model refinement by measuring the divergence between the PV and their labels. For numerous CD models, the common loss functions are cross-entropy [42], [43], contrastive loss, dice loss [44] and their linear weighted combination [45], [46].

Beyond the standard loss functions aforementioned, researchers continually refine these functions to address the associated challenges. In order to balance the number of positive and negative instances, some researchers introduced focal loss to focus on small but more important data. To attain equilibrium between the quantities of positive and negative instances, certain researchers have adopted focal loss to prioritize the small yet pivotal data. Miao et al. [47] introduced focal loss supervision module to help the network pay more attention to the minority class samples. Cui et al. [48] combined focal loss and mean absolute error loss to achieve a class-balanced noise-tolerant CD network.

Essentially, CD still belongs to pixel classification problem, so the similarity measurement of relevant probability distributions is crucial for guiding CD results and intermediate feature extraction. Common methods for measuring similarity include Cosine distance, Kullback–Leibler (KL) divergence, and maximum mean discrepancy (MMD). Specifically, some researchers calculated Cosine function to measure semantic similarity in vector space [49], [50], [51]. Moreover, KL divergence was used as a similarity measure for different probability distributions (such as graph nodes [52], multiscale feature tensors of dual stream architecture [53]) in CD networks. For the unsupervised domain adaptation CD task, Qu et al. [54] calculated conditional MMD to measure the conditional distribution discrepancy. Peng et al. [55] integrated IoU loss function to focus on the overall detection accuracy of the change information and the global structural feature.

Furthermore, hybrid loss functions that integrate supervised and unsupervised components have markedly improved model performance. Supervised loss concentrates on discrepancies between labeled data and predictions, whereas unsupervised loss commonly directs the extraction of intermediate features. Most importantly, despite their similar forms, researchers have ascribed distinct interpretations to loss functions, including cross-entropy and contrastive loss, etc [56], [57], [58].

However, in certain zero-tolerance situations, there is an increased focus on positive instances and a heightened requirement for the recall rate. Therefore, this article aims to conduct thorough research into loss functions capable of modulating recall rates for RSCD tasks.

SECTION III.

Method

A. Overall Architecture

The RaHFF-Net overall adopts SCNN structure as shown in Fig. 2, consisting of a feature extraction subnetwork and three auxiliary modules (i.e., SsFFM, CsFFM, and MsFFM).

First, the bitemporal RS images T_{1},T_{2}\in \mathbb {R}^{H\times W\times 3} are fed into the SCNN for feature extraction. The original multiscale local features \lbrace (F_{t1}^{m},F_{t2}^{m});m\in [1,M]\rbrace are acquired in parallel through Siamese backbone network (such as ResNet18) with weight-sharing, where \lbrace F_{t1}^{m},F_{t2}^{m}\rbrace \in \mathbb {R}^{H/2^{m}\times W/2^{m}\times C_{b}^{m}}.

Then, the corresponding feature pairs \lbrace F_{t1}^{m},F_{t2}^{m}\rbrace for each stage m are input into the SsFFM, where the difference features are shared as the Q for the multihead self attention (MSA) mechanism. Meanwhile, the K and V are separately extracted from each feature tensor. The objective is to construct the comprehensive global feature representation \bar{F}^{m}\in \mathbb {R}^{H/2^{m}\times W/2^{m}\times C_{s}^{m}}, which leverages the local information extracted by the CNN.

Simultaneously, adjacent-scale features \lbrace F_{t1}^{m},F_{t2}^{m},F_{t1}^{m+1}, F_{t2}^{m+1}\rbrace are fed into the CsFFM together, which achieves information aggregation from adjacent but different scales in a nonsampling manner. This produces the feature tensor \tilde{F}_{n}^{m}\in \mathbb {R}^{H/2^{m}\times W/2^{m}\times C_{c}^{m}}, further enhancing the contextual relationships among neighboring features.

Next, the MsFFM integrates the enhanced multiscale features to provide the Q, K, and V for the Transformer decoder, whose output is the CD result tensor F_{o}\in \mathbb {R}^{H/2\times W/2\times C_{o}}.

Ultimately, the convolutional fusion is applied to tensor F_{o} to generate the predicted CD mask (PV\in \mathbb {R}^{H\times W\times 2}).

During the training phase, the PV and the ground truth (GT) are required to be processed through the proposed hybrid loss function, which includes binary cross-entropy (BCE) and the HEPP loss. This process calculates the loss value, which is then used to adjust the weights of the deep neural network.

B. Same-Scale Feature Fusion Module

When human observe differences between bitemporal images, they often recognize changes by directly fusing the information from both images. Inspired by this, we complete CD based on the idea of feature fusion. Given various noise, such as different lumens, simplistic approaches (e.g., addition or subtraction), are usually ineffective for accurate CD. For this reason, both CNN and Transformer are used to extract local and global high-level semantic features of dual temporal images to derive more discernible characteristics.

For all deep features, coscale features exhibit the most direct correlation in terms of pixel corresponding locations. Accounting for the contextual relationships within change areas, this article further refines global information from the local feature tensors extracted by CNN backbone. As illustrated in Fig. 3, initially, the differential features between the paired inputs at layer m (i.e., F_{t1}^{m} and F_{t2}^{m}) are obtained. Here, we employ difference features to enhance the change information of the dual temporal image. Because vallina self-attention is mainly used to extract the key information of the feature tensor itself, cross-attention could enhance the similar features between two feature tensors. But in the CD background, what we need to enhance is the difference information between these image pairs. Inspired by this, the absolute difference |F_{t1}^{m}-F_{t2}^{m}| is shared as the Q of Transformer encoder, while its K and V are extracted from each temporal RS feature tensor. Finally, the context relationship of local feature tensor from dual temporal images is modeled through the MSA mechanism.

Fig. 3. - Illustration of the overall structure of the module SsFFM.
Fig. 3.

Illustration of the overall structure of the module SsFFM.

The entire MSA process can be expressed by \begin{align*} Q^{m}&=f_{d=1}^{1\times 1} \left(f_{d=1}^{3\times 3} \left(|F_{t1}^{m}-F_{t2}^{m}|\right)\right)W_{q}^{m} \\ K_{ti}^{m}&=f_{d=1}^{1\times 1} \left(f_{d=1}^{3\times 3} \left(F_{ti}^{m}\right)\right)W_{k}^{m} \\ V_{ti}^{m}&=f_{d=1}^{1\times 1} \left(f_{d=1}^{3\times 3} \left(F_{ti}^{m}\right)\right)W_{v}^{m}\tag{1} \end{align*} View SourceRight-click on figure for MathML and additional features.where f_{d=n}^{k\times k} (\cdot) represents a convolution with a kernel size of k\times k and stride d=n \begin{equation*} \text{MSA} \left(F_{ti}^{m}\right)=\text{Concat} \left(\text{head}^{1},{\ldots },\text{head}^{h}\right)W^{O} \tag{2} \end{equation*} View SourceRight-click on figure for MathML and additional features.where \text{head}^{j}=\text{Att} (|F_{\text{t}1}^{m}-F_{\text{t}2}^{m}|W_{q}^{j},F_{ti}^{m}W_{k}^{j},F_{ti}^{m}W_{v}^{j}). The final output of the MSA mechanism is \bar{F}_{ti}^{m}=\text{Att} (Q^{m},K_{ti}^{m},V_{ti}^{m}). Formally \begin{align*} \bar{F}_{ti}^{m}=\sigma \left(\frac{Q^{m}K_{ti}^{mT}}{\sqrt{d}}\right)V_{ti}^{m} \tag{3} \end{align*} View SourceRight-click on figure for MathML and additional features.where \sigma (\cdot) denotes the softmax operation implemented in the channel dimension.

To improve the stability and comprehensive expression of feature extraction, we aggregate the local features (extracted by CNN) and global information (extracted by Transformer) of a single image through residual connection. Ultimately, the dual-temporal feature fusion at the same-scale is realized based on cascade operation. Formally \begin{align*} \bar{F}^\text{m}=f_{d=1}^{1\times 1}\left(\bigcup _{i=1}^{2} \left(F_{ti}^{m}+\bar{F}_{ti}^{m}\right)\right) \tag{4} \end{align*} View SourceRight-click on figure for MathML and additional features.where \bigcup (\cdot) is the cumulative cascade function.

C. Cross-Scale Feature Fusion Module

Influenced by human activities and natural development, change area characteristic in RS frequently exhibit similarities in scale, as seen with buildings of similar shape but different areas. Motivated by these observations, this article presents the CsFFM, engineered to extract potentially valuable information from neighboring layers, thus enhancing the model comprehension of the semantic content for change features across similar scales.

When addressing multiscale feature fusion, it is typical to either downsample or upsample to match dimensions. However, interpolation may lead to information loss and blurred boundaries. To counteract this issue, we introduce a nonsampling method in Fig. 4 to facilitate the information flow across different scale features. This approach entails breaking down the feature tensor into several visual semantic tokens, with each token vector encapsulating a visual semantic unit. Subsequently, these tokens are integrated to effectively amalgamate features across varying scales.

Fig. 4. - Illustration of the overall structure of the module CsFFM.
Fig. 4.

Illustration of the overall structure of the module CsFFM.

To this end, the same scale features are cascaded and fused through convolution fusion to acquire the weight vector of the original feature tensor, and then the semantic tokens T_{n}^{m}\in \mathbb {R}^{l\times c} of different scale features is obtained by matrix multiplication. Formally \begin{align*} T_{n}^{m}= \left(\sigma \left(f_{d = 1}^{1\times 1} \left(\left(F_{t1}^{m} {{\bigcirc}\!\!\!\!{\text{c}}} \ F_{t2}^{m}\right)W_{n}^{m}\right)\right)\right)^{T} (F_{t1}^{m} {{\bigcirc}\!\!\!\!{\text{c}}} \ F_{t2}^{m}) \tag{5} \end{align*} View SourceRight-click on figure for MathML and additional features.where n denotes the nth neighborhood cross-scale feature tensor; \sigma (\cdot) denotes the softmax function, which normalizes each semantic cluster to obtain the weights for the original features.

To enhance the information fusion of visual tokens, the semantic tokens of different scales need to be cascaded and arranged in order in space, and then the information interaction existed in the visual semantic tokens is realized through a sliding window. Finally, this process results in newly hybrid group of tokens \tilde{T}_{\text{n}}\in \mathbb {R}^{\tilde{l}\times 4c}. The process can be expressed by \begin{align*} \tilde{T}_{n}=\text{TF} \left(T_{n}^{m} {{\bigcirc}\!\!\!\!{\text{c}}} \ T_{n}^{m+1}\right) \tag{6} \end{align*} View SourceRight-click on figure for MathML and additional features.where \text{TF} (\cdot) denotes the feature fusion on tokens.

Next, the Transformer encoder is used to model the global interrelationships between features across adjacent scales. Considering feature dimension and other factors, the original feature tensor serves as the Q for the Transformer encoder. In addition, the fused semantic tokens are shared as both K and V for Transformer encoder, thereby enabling the cross-scale information flow throughout the encoding process.

The Transformer encoder is encapsulated by \begin{align*} \bar{Q}^{m}&= \left(F_{t1}^{m} {{\bigcirc}\!\!\!\!{\text{c}}} \ F_{t2}^{m}\right)\bar{W}_{q}^{m} \\ \bar{K}_{n}&=\tilde{T}_{n}\bar{W}_{k} \\ \bar{V}_{n}&=\tilde{T}_{n}\bar{W}_{v} \tag{7}\\ \text{MHA} \left(F_{t1}^{m} {{\bigcirc}\!\!\!\!{\text{c}}} \ F_{t2}^{m},\tilde{T}_{n}\right)&=\text{Concat} \left(\text{head}^{1},{\ldots },\text{head}^{h}\right)W^{O} \tag{8} \end{align*} View SourceRight-click on figure for MathML and additional features.where \text{head}^{j}=\text{Att} ((F_{t1}^{m} {{\bigcirc}\!\!\!\!{\text{c}}} \ F_{t2}^{m}))W_{q}^{j},\tilde{T}_{n}W_{k}^{j},\tilde{T}_{n}W_{v}^{j}). The final encoder output \tilde{F}_{\text{n}}^{\text{m}}=\text{Att} (\bar{Q}^{m},\bar{K}_{n},\bar{V}_{n}) represents the reshaped feature representation, namely \begin{align*} \tilde{F}_{n}^{m}=\sigma \left(\frac{\bar{Q}^{m}\bar{K}_{n}^{T}}{\sqrt{d}}\right)\bar{V}_{n}. \tag{9} \end{align*} View SourceRight-click on figure for MathML and additional features.

D. Multiscale Global Feature Fusion Module

Taken as a whole, both SsFFM and CsFFM modules enhance global representation in the range of the same and adjacent layer of dual-temporal images. However, Fig. 5 highlights two key characteristics of RS image CD: the diversity of change sizes and the contextual relevance of certain change areas. Therefore, the MsFFM is proposed to synchronously integrate multiscale information and establish their global interdependencies.

Fig. 5. - Illustration of the overall structure of the module MsFFM.
Fig. 5.

Illustration of the overall structure of the module MsFFM.

For the multiscale features {\bar{F}}^{m} and {\tilde{F}}_{n}^{m}, the MHA is employed in the Transformer decoder to derive its predicted CD results. Initially, the multiscale features are downsampled to a uniform size (8\times 8) to serve as the K and V for the Transformer decoder, as illustrated in (10). Notablely, the size diminishing about K and V not only efficiently capture their primary features but also significantly reduce computational load \begin{align*} \check{F}=\bigcup _{m=1,n=1}^{M,N} \left(\text{Maxpool} \left(\bar{F}^{m},\tilde{F}_{n}^{m+1},\tilde{F}_{n}^{m}\right)\right) \tag{10} \end{align*} View SourceRight-click on figure for MathML and additional features.where \text{Maxpool} (\cdot) denotes downsampling through the maximum pooling layer.

Subsequently, the multiscale features are upsampled to a uniform size (128\times 128) to serve as the Q for the Transformer decoder, as shown in (11). Expanding the size of Q achieves two goals: it enables the inclusion of more fine details, and it also aims to match the decoder output closely with the original size \begin{align*} \hat{F}=\bigcup _{m=1,n=1}^{M,N}\left(\text{Up} \left(\bar{F}^{m},\tilde{F}_{n}^{m+1},\tilde{F}_{n}^{m}\right)\right) \tag{11} \end{align*} View SourceRight-click on figure for MathML and additional features.where \text{Up} (\cdot) indicates up-sampling operation.

The whole multiscale global MHA process can be expressed as follows: \begin{align*} \tilde{Q}&=\hat{F}\tilde{W}_{q} \\ \tilde{K}&=\check{F}\tilde{W}_{k} \\ \tilde{V}&=\check{F}\tilde{W}_{v} \tag{12}\\ \text{MHA} \left(\hat{F}, \check{F}\right)&=\text{Concat} \left(\overline{\text{head}}^{1},{\ldots },\overline{\text{head}}^{h}\right)\bar{W}^{O} \tag{13} \end{align*} View SourceRight-click on figure for MathML and additional features.where \overline{\text{head}}^{j}=\text{Att} (\hat{F}\bar{W}_{q}^{j},\check{F}\bar{W}_{k}^{j},\check{F}\bar{W}_{v}^{j}).

The MsFFM F_{o}=\text{Att} (\tilde{Q},\tilde{K},\tilde{V}). Formally \begin{align*} F_{o}=\sigma \left(\frac{\tilde{Q}\widetilde{K}^{T}}{\sqrt{d}}\right)\tilde{V}. \tag{14} \end{align*} View SourceRight-click on figure for MathML and additional features.

The final prediction map PV can be obtained by the CD head based on 1\times 1 convolution by adjusting their channels, that is \begin{align*} \text{PV}=f_{d=1}^{1\times 1} (F_{o}). \tag{15} \end{align*} View SourceRight-click on figure for MathML and additional features.

E. Loss Function

At present, most RSCD models tend to showcase their superiority through comprehensive indicators (F1 and mIoU). Nonetheless, certain scenarios necessitate a higher recall rate, particularly in fields such as medical diagnostics, financial fraud detection, security monitoring, and others. A high recall rate signifies that the model can identify as many positive instances as possible, while the opposite reflects a more serious situation of missed detections about that model. For RS change interpretation, the scene complexity and data imbalance may result in increased miss rate. Nonetheless, this outcome is highly undesirable for detection applications involving these unauthorized constructions and vegetation damage.

To address this issue, the HEPP loss function is proposed. Traditional loss functions typically set the expected prediction for positive instance to 1 and for negative instance to 0. In contrast, the HEPP loss function adjusts the predicted probabilities of positive and negative instance toward T and \tau, respectively. When T > 1 or \tau < 0, the predicted probabilities cannot meet these expected values. However, each iteration results in a larger loss value, which, within certain bounds, contributes to improved model performance and accelerated convergence. The definitions of L_{\text{BCE}} and L_{\text{HEPP}} can be expressed as \begin{align*} L_{\text{BCE}} =& - \frac{1}{N_{nc} + N_{c}} \sum _{h=1,w=1}^{H,W} \left[ (1-Y_{hw})\log (1 - P_{hw})\right. \\ &\left. +Y_{hw}\log P_{hw}\right] \tag{16}\\ L_{\text{HEPP}} =& \frac{1}{N_{nc}} \sum _{h=1,w=1}^{H,W} \left[ (1-Y_{hw}) \max (0, P_{hw} - \tau) \right] \\ & + \frac{1}{N_{c}} \sum _{h=1,w=1}^{H,W} \left[ Y_{hw} \max (0, T - P_{hw}) \right]. \tag{17} \end{align*} View SourceRight-click on figure for MathML and additional features.

Among them, 0/1 denotes unchanged/change pixels, respectively, and N_{nc}/N_{c} represents their quantities. Y_{hw} is the pixel value of the label at position (h, w), whereas P_{hw} represents the probability that the pixel at position (h, w) is predicted to change by the model.

In this loss function, by increasing the expected prediction for change pixels, denoted by T, the model is guided to focus more on learning from these change pixels. This approach mitigates the problem of data imbalance in RS image CD, leading to a notable enhancement in the model recall rate.

Nevertheless, as shown in Fig. 6, instances of false positives or false negatives, which indicate a substantial discrepancy between the PV and GT, result in larger loss values when using BCE loss function. This property is beneficial for model convergence and can potentially enhance overall performance. When the predicted results are close to the expected values, the loss calculated by the BCE function decreases sharply. At this point, the HEPP loss function can further increase the loss value by adjusting the expected predictions, thereby improving model performance. Experimental validation of the parameters \tau and T is discussed in Section V-D. Accordingly, the final loss function L is formulated as a combination of BCE loss L_\text{BCE} and the HEPP loss L_\text{HEEP} \begin{align*} L=\lambda _{1}*L_\text{BCE}+\lambda _{2}*L_\text{HEEP}. \tag{18} \end{align*} View SourceRight-click on figure for MathML and additional features.

Fig. 6. - Hybrid Loss function curves for positive and negative.
Fig. 6.

Hybrid Loss function curves for positive and negative.

SECTION IV.

Experimental Setup

A. Dataset

To assess the RaHFF-Net efficacy, we performed experiments based on three benchmark RS image CD datasets (LEVIR-CD, WHU-CD, CDD). Each dataset comprises two RS images taken at different intervals from the same geographic region, accompanied by the corresponding CD label. Detailed information about these three datasets is provided as below.

1) LEVIR-CD dataset

The LEVIR-CD is a publicly available, large-scale dataset for building CD, comprising 637 pairs of ultra-high-resolution (0.5 m/pixel) RS images, each measuring 1024×1024 pixels and covering periods from 5 to 14 years. These images originate from 20 distinct areas across various cities in Texas, USA, documenting 31 333 instances of building alterations. Notably, the LEVIR-CD encompasses a diverse range of building types that have experienced substantial land-use transformations, including detached houses, high-rise apartments, small garages, and large warehouses. We cut each image pair into 16 nonoverlapping segments, each measuring 256×256 pixels, and allocated 7120/1024/2048 image pairs for training, validation, and testing, respectively.

2) WHU-CD dataset

The WHU-CD documents the architectural changes in Christchurch, New Zealand, post the magnitude 6.3 earthquake in February 2011. Specifically, the dataset comprises a pair of aerial images from the same region, captured in 2012 and 2016, with dimensions of 32 507×15 354 pixels and a resolution of 0.2 m/pixel. We segmented the large-scale image pair into 256×256 patches and randomly allocated them into sets of 6096 for training, 762 for validation, and 762 for testing.

3) CDD dataset

CDD is a public CD dataset composed of satellite images captured across different seasons, with spatial resolutions ranging from 0.03 to 1 m/pixel. The size of change area in the CDD varies and includes features, such as buildings, roads, vehicles, and others. We cut the image into patches sized 256×256 and divided them into 10 000/2998/3000 pairs for training, validation, and testing, respectively.

B. Experimental Parameters

The RaHFF-Net model is implemented in PyTorch and trained on a single NVIDIA RTX 4090 GPU. The SGD optimizer is used for model optimization, with momentum at 0.98, weight decay at 5e-4, and an initial learning rate of 0.002. The learning rate of the optimizer is dynamically adjusted, decaying by 0.7 times every 10 training epochs. In addition, each dataset undergoes a training period set to 150 epochs, with a batch size of 24. Furthermore, validation is conducted after each training cycle, and the best model on the validation set is used to evaluate the test set. The backbone layer M is uniformly set to 5, and C_{b}^{i}, C_{s}^{i} \in \lbrace 64,64,128,256,512\rbrace.

C. Evaluation Metrics

To comprehensively evaluate models, five evaluation metrics are used to quantitatively assess the CD results, namely, precision (P), recall (Re), F1, overall accuracy (OA), and mIoU.

SECTION V.

Experimental Results

A. Comparative Methods

To verify the effectiveness of RaHFF-Net, we compared it with several state-of-the-art CD methods, including FC-EF, FC-Siam-diff, DTCDSCN, BIT, Change Former, ICIFNet, DMINet, SEIFNET, and ChanegMamba. For the sake of fairness, all comparisons were performed using the code published by the authors, with parameters set according to the original literature.

Objectively, the adjustable recall is a major attribute for our model. When we increase the recall regular term coefficient, many positive examples will be significantly noticed. However, this will also come to an increase in the false detection rate. So in some occasions where the recall rate is highly required and the accuracy can be ignored, the recall coefficient can be increased. In order to make a fair comparison, in this article we still use the comprehensive index F1 as the model performance target.

Table I presents quantitative metrics of different models across three datasets (i.e., LEVIR-CD, WHU-CD, and CDD). It is evident that the proposed RaHFF-Net consistently leads in recall rate, outperforming the second-best values by 2.7, 2.5, and 2.2 points, respectively. In addition, while maintaining high recall rates, our model shows significant advantages over others about these comprehensive metrics: F1 score, OA, and mIoU. Moreover, our CNN backbone only utilizes ResNet18, without employing more complex structures, such as ResNet50, FPN, and U-Net. This advantage is likely attributed to our model integration for multiscale spatio-temporal information and its enhanced salient feature representation through global context modeling. Although the quantitative analysis of RaHFF-Net is satisfactory, it must be acknowledged that its performance on precision (Pre) is not optimal. This is because increasing the recall rate inevitably leads to recalling more suspicious change pixels [i.e., false positive (FP)], thereby raising the false detection rate. Fortunately, its precision value still remains within a reasonable range.

TABLE I Performance Comparison of Different Methods on Various Datasets
Table I- Performance Comparison of Different Methods on Various Datasets

For intuitive analysis, we visually compares the experimental results of various algorithms, as shown in Figs. 7, 8, 9. For a better view, white, black, red, and green represent true positive (TP), true negative (TN), FP, and false negative (FN), respectively.

The visual comparison of detection results on the LEVIR-CD dataset by various methods is shown in Fig. 7. Clearly, variations in illumination affect the bitemporal images, with change areas ranging from small to large, covering a broad scope. For small target areas, models, such as FC-EF, BIT, ICIFNet, and DMINet, suffer from significant missed detections. In large target areas, the scene complexity affects the detected target (e.g., DMINet, ChangeMamba, and SEIFNet) typically occur near the edges, which may be attributed to the weaker contextual feature extraction capabilities of these models. When affected by lighting, DTCDNet shows obvious false detections; whereas BIT and DMINet exhibit notable missed detections. Overall, RaHFF-Net usually excels in detecting change areas, achieving clear boundaries and lower rates of false and missed detections. This superiority is primarily attributable to the integration of local details and global multiscale semantic information from high-resolution RS images, which enables accurate identification of diverse changes and adaptation across various target areas.

Fig. 7. - Detection results on the LEVIR-CD using different methods. (a) T1 Image. (b) T2 Image. (c) GT. (d) FC-EF. (e) ChangeMamba. (f) FC-Siam-Diff. (g) DTCDSCN. (h) BIT. (i) Change Former. (j) ICIFNet. (k) DMINet. (l) SEIFNet. (m) Our RaHFF-Net.
Fig. 7.

Detection results on the LEVIR-CD using different methods. (a) T1 Image. (b) T2 Image. (c) GT. (d) FC-EF. (e) ChangeMamba. (f) FC-Siam-Diff. (g) DTCDSCN. (h) BIT. (i) Change Former. (j) ICIFNet. (k) DMINet. (l) SEIFNet. (m) Our RaHFF-Net.

The detection results of various methods on the WHU-CD dataset are displayed in Fig. 8. Missed detections under illumination disturbances may occur commonly, especially for minor change areas, such as those seen with FC-EF, FC-Siam-Diff, BIT, and Change Former. Notably, DTCDSCN is prone to severe false detections. When handling multiple CD targets, the detection results frequently encounter false detections near the edges, especially FC-EF and DTCDSCN. In complex scenarios with large-scale targets, these models often fail to accurately identify change areas, resulting in frequent false detections (e.g., BIT) and missed detections (e.g., DTCDSCN, DMINet, ChangeMamba, SEIFNet). DTCDSCN and BIT perform poorly in complex scenarios for CD, clearly indicating a positive correlation between scene complexity and the model ability to extract deep semantic features. In contrast, the model presented herein demonstrates enhanced performance in maintaining target integrity and achieving boundary precision, irrespective of the size of the detection target. This may be attributed to the effective information integration across various scales by the MsFFM module.

Fig. 8. - Detection results on the WHU-CD using different methods. (a) T1 Image. (b) T2 Image. (c) GT. (d) FC-EF. (e) ChangeMamba. (f) FC-Siam-Diff. (g) DTCDSCN. (h) BIT. (i) Change Former. (j) ICIFNet. (k) DMINet. (l) SEIFNet. (m) Our RaHFF-Net.
Fig. 8.

Detection results on the WHU-CD using different methods. (a) T1 Image. (b) T2 Image. (c) GT. (d) FC-EF. (e) ChangeMamba. (f) FC-Siam-Diff. (g) DTCDSCN. (h) BIT. (i) Change Former. (j) ICIFNet. (k) DMINet. (l) SEIFNet. (m) Our RaHFF-Net.

The CDD dataset particularly emphasizes the impact of seasonal changes on CD in RS images. As seen from Fig. 9, FC-EF, FC-Siam-Diff, ChangeMamba and SEIFNet exhibit significant omission issues under the influence of lighting and seasonal changes. Although BIT, ChangeFormer, and ICIFNet can identify prominent change areas, they often overlook smaller change regions. DMINet has made progress in improving CD performance, but its limitations in multiscale global-local modeling hinder its performance in detecting subtle changes. In addition, for larger change areas, ChangeFormer, and ICIFNet display false detections at the edges of targets and omissions of unchanged objects. It is worth noting that FC-EF, FC-Siam-Dif,f and ChangeMamba does not work very well for identifying large targets. Compared to the aforementioned models, RaHFF-Net not only demonstrates superior detection performance in both large and small-scale change areas but also shows exceptional performance when dealing with seasonal variations. The RaHFF-Net further validates its effectiveness in multiscale global and local feature extraction and fusion by comprehensively searching for regions with varying scales.

Fig. 9. - Detection results on the CDD using different methods. (a) T1 Image. (b) T2 Image. (c) GT. (d) FC-EF. (e) ChangeMamba. (f) FC-Siam-Diff. (g) DTCDSCN. (h) BIT. (i) Change Former. (j) ICIFNet. (k) DMINet. (l) SEIFNet. (m) Our RaHFF-Net.
Fig. 9.

Detection results on the CDD using different methods. (a) T1 Image. (b) T2 Image. (c) GT. (d) FC-EF. (e) ChangeMamba. (f) FC-Siam-Diff. (g) DTCDSCN. (h) BIT. (i) Change Former. (j) ICIFNet. (k) DMINet. (l) SEIFNet. (m) Our RaHFF-Net.

The parameters as well as the FLOPs of the different methods are reported in the Table II. It can be noticed that the parameter number of our model is moderate, but the FLOPs are not dominant. This indicates that the model is not very demanding in terms of GPU memory during training, but requires more floating-point computations during forward propagation. How to keep the metrics not degraded while cutting FLOPs will be the next research content.

TABLE II FLOPs and Params Comparison of Different Methods
Table II- FLOPs and Params Comparison of Different Methods

B. Ablation Experiments

In this part, ablation studies were conducted to verify the effectiveness of the proposed modules (i.e., SsFFM, CsFFM, and MsFFM). As indicated in Table III, where \checkmark indicated that the corresponding module was adopted. The experiments showed that each module made its own contribution to improving CD performance, regardless of the dataset.

TABLE III Module Ablation Experiment
Table III- Module Ablation Experiment

Since our model is not limited to any particular backbone, in Table IV we attempt the effect of different backbones (including CNN, Transformer, and Mamba) on the model performance. The experiments show that the best backbone regarding model evaluation is CNN (here Resnet), followed by Mamba. The subsequent modules of our model all involves Transformer, which suggests that there is a significant positive impact of global and local fusion features on deep semantic information. In addition, the fusion features extracted by the hybrid model (i.e., Transformer and Mamba) outperforms the single model (i.e., Transformer).

TABLE IV Different Backbone on CDD Dataset
Table IV- Different Backbone on CDD Dataset

C. Effect of HEPP Loss

Table V reports the detection performance of the three comparison models when they utilize the proposed HEPP loss during training process, and lists these data changes in each metric. It can be seen that the proposed HEPP loss improves the performance of the comparison models to different degrees. The results validate the generalizability of the HEPP loss and verify our motivation that it can help the models focus on more positive samples.

TABLE V EFFECT of HEPP LOSS for Different Comprision Models on WHU-CD
Table V- EFFECT of HEPP LOSS for Different Comprision Models on WHU-CD

D. Parametric Analysis

1) Token Length

In this section, deep semantic features of images were consolidated into these compact token groups via feature fusion window, establishing token length as a crucial hyperparameter that affected the model performance. Tests were conducted on the LEVIR-CD dataset using various token lengths, and then their impacts on the model were analyzed, where l \in {\lbrace 4, 9, 16, 25, 36, 49, 64, 81\rbrace }. Results in Table VI indicated that when l is relatively short (e.g., 4 or 9), token groups were insufficient to completely and efficiently characterize the features of dual temporal RS images; when l is excessively long (e.g., 81 or 64), redundant information may have scattered the model focus, hindering its ability to concentrate on effective features. When l = 36, the model achieved optimal comprehensive metrics about F1 and mIoU.

TABLE VI Effect of Varying Token Lengths in TF
Table VI- Effect of Varying Token Lengths in TF

2) Determining the Hyperparameters T and \tau

To address the instance imbalance issue in RS CD, the HEPP loss function is specifically designed to target change pixels. As Table VII illustrates. We set \tau to 0 and conduct experiments to evaluate how different T values impact model performance, aiming to find the optimal threshold for adjusting positive instances. The results reveal that the model performs best overall with T=2.5. Following this, we fix T at 2.5 and experiment with various \tau values. We find that when \tau < 0, the performance metrics of the model do not improve significantly. This lack of enhancement is likely due to the large number of negative instances (unchanged pixels), where the push and pull adjustments cause the loss values to increase disproportionately relative to \tau, thereby hindering the fine-tuning of the model.

TABLE VII Impacts of Different Hyperparameters T and \tau on the Model When Using the HEPP Loss Function on the LEVIR-CD Dataset
Table VII- Impacts of Different Hyperparameters $T$ and $\tau$ on the Model When Using the HEPP Loss Function on the LEVIR-CD Dataset

3) Regularity Coefficient

To validate the effectiveness of the HEPP loss, hyperparameter experiments on the BCE loss coefficient \lambda _{1} and HEPP loss coefficient \lambda _{2} were conducted across these datasets: LEVIR-CD, WHU-CD, and CDD. The experimental results are shown in Table VIII. When only HEPP loss term exists, its model recall rate is far ahead of other hybrid loss functions. This rule proves that the HEPP loss term can significantly improve the model recall rate. In this case, the model is very practical for those occasions with high recall requirements. With the increase of the BCE coefficient, the comprehensive indexes F1 and mIOU of the model gradually change. When the comprehensive index reaches the optimal, the model is more practical for situations with higher comprehensive performance requirements. In conclusion, the model can automatically adapt to changing scenarios and tasks by adjusting the regularization coefficient.

TABLE VIII Influence of Regularization Coefficient on Quantitative Experimental Results Regarding Different Dataset
Table VIII- Influence of Regularization Coefficient on Quantitative Experimental Results Regarding Different Dataset

E. Convergence Analysis

The convergence of RaHFF-Net on the three datasets is evaluated by tracking the F1 score and loss value, as shown in Fig. 10. The goal of this is to show the model convergence during the training and validation process.It can be observed that when the epoch is less than 25, the value of model loss on all three datasets decreases rapidly, and the F1 score significantly improves as the epochs increase. When the epoch exceeds 75, the model loss basically becomes stable, which indicates that our model converges stably and efficiently. This may be attributed to the model learning effective multiscale and global contextual information, which can accurately and quickly represent the change areas of interest.

Fig. 10. - Convergence analysis regarding the model training and testing on different datasets: (a) LEVIR-CD, (b) WHU-CD, and (c) CDD.
Fig. 10.

Convergence analysis regarding the model training and testing on different datasets: (a) LEVIR-CD, (b) WHU-CD, and (c) CDD.

F. Network Visualization

To understand the RaHFF-Net more intuitively, a representative sample from the LEVIR-CD test set was selected to visualize the features generated at different stages, as shown in Fig. 11. Given the bitemporal images [see Fig. 11(a)], multiscale feature (F_{t1}^{m}, F_{t2}^{m}) from shallow to deep was first extracted via CNN as the backbone [see Fig. 11(b) and 11(d)]. Then, the SsFFM was utilized to fuse the same-layer features F_{t1}^{m} and F_{t2}^{m}, generating the fused feature \bar{F}^{m} [see Fig. 11(c)], which indicated that the model effectively focused attention on the areas of interest and suppressed the irrelevant background interference. Simultaneously, adjacent layers F_{t1}^{m+1} and F_{t2}^{m+1} as well as F_{t1}^{m} and F_{t2}^{m} were channel-concatenated and fed into the CsFFM. This module condensed these features into more compact semantic tokens and facilitated information cross-flow between different scale features, resulting in high-level semantic information \tilde{F}_{n}^{m+1} and \tilde{F}_{n}^{m}[Fig. 11(e)]. During the feature decoding, \hat{F} (Q) [see Fig. 11(f)#1] exhibited the outlines and location information of the changed buildings while ignoring the unchange existing buildings, fully catering the CD requirements. \check{F} (KV) [see Fig. 11(f)#2] primarily contained the potential spatial and high-level semantic information of the change regions. The change probability map [see Fig. 11(g)] after convolution showed accurate target localization and fine boundaries, demonstrating that RaHFF-Net effectively captured the change targets through feature fusion and efficient supervision.

Fig. 11. - Visualization of key network modules using an instance image from the LEVIR-CD dataset. (a) Input bitemporal RS images. (b) Multilayer features $F_{t1}^{m}$ derived from the original images using CNN as the backbone. (c) Feature maps $\bar{F}^{m}$ after applying the module SsFFM. (d) Multilayer features $F_{t2}^{m}$ extracted by CNN as the backbone. (e) Feature tensors $\tilde{F}_{n}^{m}$ obtained by the CsFFM. (f) The Transformer decoder inputs $\hat{F}$ (Q) and $\check{F}$ (KV) for MsFFM. (g) Change probability map. (h) Change map.
Fig. 11.

Visualization of key network modules using an instance image from the LEVIR-CD dataset. (a) Input bitemporal RS images. (b) Multilayer features F_{t1}^{m} derived from the original images using CNN as the backbone. (c) Feature maps \bar{F}^{m} after applying the module SsFFM. (d) Multilayer features F_{t2}^{m} extracted by CNN as the backbone. (e) Feature tensors \tilde{F}_{n}^{m} obtained by the CsFFM. (f) The Transformer decoder inputs \hat{F} (Q) and \check{F} (KV) for MsFFM. (g) Change probability map. (h) Change map.

SECTION VI.

Discussion

The RaHFF-Net proposed in this article proves several advantages: multiscale CA detection, contextual information association, and adjustable recall. However, our model still suffers from the following defects: model lightweight and robustness. As shown in Table II, the FLOPs of the RaHFF-Net are as high as 33 G. Realizing the model lightweight but still guaranteeing relevant performance will be more beneficial to the practical application. In addition, when one deep network is subjected to noise interference, such as light changes, special techniques should be adopted to improve the model robustness.

SECTION VII.

Conclusion

In this article, we propose a novel RS image CD network-RaHFF-Net, which fully exploits the local feature extraction capabilities of CNN and the long-distance relationship modeling of Transformer. Initially, ResNet18 is utilized as its backbone network to thoroughly extract multiscale local features from bitemporal RS images. Subsequently, the modules SsFFM, CsFFM, and MsFFM have been introduced to accomplish multiscale feature fusion. Finally, a positive instance contrastive loss regularization term is proposed to tackle missed detections, which is beneficial to achieve high recall rate combined with BCE. In addition, necessary experiments on datasets (i.e., LEVIR-CD, WHU-CD, and CDD) show that RaHFF-Net delivers favorable results in comprehensive evaluation metrics (such as F1 and mIoU) and qualitative comparisons, highlighting its robust adaptability in detecting changes across diverse targets.

Select All
1.
R. Zhou, “A unified deep learning network for remote sensing image registration and change detection,” IEEE Trans. Geosci. Remote Sens., vol. 62, 2024, Art. no. 5101216.
2.
H. Li, “Selective transfer based evolutionary multitasking optimization for change detection,” IEEE Trans. Emerg. Topics Comput. Intell., vol. 8, no. 3, pp. 2197–2212, Jun. 2024.
3.
M. Noman, M. Naseer, H. Cholakkal, R. M. Anwar, S. Khan, and F. S. Khan, “Rethinking transformers pre-training for multi-spectral satellite imagery,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., 2024, pp. 27811–27819.
4.
Z. Du, X. Li, J. Miao, Y. Huang, H. Shen, and L. Zhang, “Concatenated deep-learning framework for multitask change detection of optical and SAR images,” IEEE J. Sel. Topics Appl. Earth Observ. Remote Sens., vol. 17, pp. 719–731, 2024.
5.
A. Ulrichsen, “Operational neural networks for parameter-efficient hyperspectral single-image super-resolution,” IEEE J. Sel. Topics Appl. Earth Observ. Remote Sens., vol. 17, pp. 1470–1484, 2024.
6.
L. Bruzzone and D. Prieto, “Automatic analysis of the difference image for unsupervised change detection,” IEEE Trans. Geosci. Remote Sens., vol. 38, no. 3, pp. 1171–1182, May 2000.
7.
C. Wu, B. Du, and L. Zhang, “Slow feature analysis for change detection in multispectral imagery,” IEEE Trans. Geosci. Remote Sens., vol. 52, no. 5, pp. 2858–2874, May 2014.
8.
H. Chen, N. Yokoya, and M. Chini, “Fourier domain structural relationship analysis for unsupervised multimodal change detection,” ISPRS J. Photogrammetry Remote Sens., vol. 198, pp. 99–114, 2023.
9.
C. Wang, W. Zheng, Z. Zhu, J. Zhou, and J. Lu, “Introspective deep metric learning,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 46, no. 4, pp. 1964–1980, Apr. 2024.
10.
T. Liao, X. Zhang, L. Zhao, T. Wang, and G. Xiao, “VSformer: Visual-spatial fusion transformer for correspondence pruning,” in Proc. AAAI Conf. Artif. Intell., 2024, pp. 3369–3377.
11.
W. Huang, “A general and efficient training for transformer via token expansion,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., 2024, pp. 15783–15792.
12.
C. Zhang, “A deeply supervised image fusion network for change detection in high resolution bi-temporal remote sensing images,” ISPRS J. Photogrammetry Remote Sens., vol. 166, pp. 183–200, 2020.
13.
C. Wu, B. Du, and L. Zhang, “Fully convolutional change detection framework with generative adversarial network for unsupervised, weakly supervised and regional supervised change detection,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 45, no. 8, pp. 9774–9788, Aug. 2023.
14.
Z. Lv, H. Huang, W. Sun, T. Lei, J. A. Benediktsson, and J. Li, “Novel enhanced UNet for change detection using multimodal remote sensing image,” IEEE Geosci. Remote Sens. Lett., vol. 20, 2023, Art. no. 2505405.
15.
Y. Lee, J.-w. Hwang, S. Lee, Y. Bae, and J. Park, “An energy and gpu-computation efficient backbone network for real-time object detection,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. Workshops, 2019, pp. 752–760.
16.
S. Zhu, Y. Song, Y. Zhang, and Y. Zhang, “ECFNet: A siamese network with fewer FPs and fewer FNs for change detection of remote-sensing images,” IEEE Trans. Geosci. Remote Sens. Lett., vol. 20, 2023, Art. no. 6001005.
17.
H. Chen, J. Song, C. Wu, B. Du, and N. Yokoya, “Exchange means change: An unsupervised single-temporal change detection framework based on intra- and inter-image patch exchange,” ISPRS J. Photogrammetry Remote Sens., vol. 206, pp. 87–105, 2023.
18.
B. Bai, W. Fu, T. Lu, and S. Li, “Edge-guided recurrent convolutional neural network for multitemporal remote sensing image building change detection,” IEEE Geosci. Remote Sens., vol. 60, 2022, Art. no. 5610613.
19.
J. Qu, S. Hou, W. Dong, Y. Li, and W. Xie, “A multilevel encoder–decoder attention network for change detection in hyperspectral images,” IEEE Trans. Geosci. Remote Sens., vol. 60, 2022, Art. no. 5518113.
20.
W. Zhang, Q. Zhang, H. Ning, and X. Lu, “Cascaded attention-induced difference representation learning for multispectral change detection,” Int. J. Appl. Earth Observation Geoinformation, vol. 121, 2023, Art. no. 103366.
21.
H. Yin, “Attention-guided Siamese networks for change detection in high resolution remote sensing images,” Int. J. Appl. Earth Observation Geoinformation, vol. 117, 2023, Art. no. 103206.
22.
Y. Xu, B. Du, and L. Zhang, “Robust self-ensembling network for hyperspectral image classification,” IEEE Trans. Neural Netw. Learn. Syst., vol. 35, no. 3, pp. 3780–3793, Mar. 2024.
23.
W. Zhao, J. Cao, and X. Dong, “Multilateral semantic with dual relation network for remote sensing images segmentation,” IEEE J. Sel. Topics Appl. Earth Observ. Remote Sens., vol. 17, pp. 506–518, 2024.
24.
J. Liu, X. Wang, M. Guo, R. Feng, and Y. Wang, “Shadow detection in remote sensing images based on spectral radiance separability enhancement,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 46, no. 5, pp. 3438–3449, May 2024.
25.
Y. Xiao, Q. Yuan, K. Jiang, J. He, C.-W. Lin, and L. Zhang, “TTST: A top-k token selective transformer for remote sensing image super-resolution,” IEEE Trans. Image Process., vol. 33, pp. 738–752, 2024.
26.
Y. Alkendi, R. Azzam, A. Ayyad, S. Javed, L. Seneviratne, and Y. Zweiri, “Neuromorphic camera denoising using graph neural network-driven transformers,” IEEE Trans. Neural Netw. Learn. Syst., vol. 35, no. 3, pp. 4110–4124, Mar. 2024.
27.
Y. Su, J. Deng, R. Sun, G. Lin, H. Su, and Q. Wu, “A unified transformer framework for group-based segmentation: Co-segmentation, co-saliency detection and video salient object detection,” IEEE Trans. Multimedia, vol. 26, pp. 313–325, 2024.
28.
Z. Luo, “Exploring point-BEV fusion for 3D point cloud object tracking with transformer,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 46, no. 9, pp. 5921–5935, Sep. 2024.
29.
C. Zhang, L. Wang, S. Cheng, and Y. Li, “SwinSUNet: Pure transformer network for remote sensing image change detection,” IEEE Trans. Geosci. Remote Sens., vol. 60, 2022, Art. no. 5224713.
30.
Y. Lin, “An unsupervised transformer-based multivariate alteration detection approach for change detection in VHR remote sensing images,” IEEE J. Sel. Topics Appl. Earth Observ. Remote Sens., vol. 17, pp. 3251–3261, 2024.

References

References is not available for this document.