Context and Difference Enhancement Network for Change Detection

At present, convolution neural networks have achieved good performance in remote sensing image change detection. However, due to the locality of convolution, these methods are difficult to capture the global context relationships among different-level features. To alleviate this issue, we propose a context and difference enhancement network (CDENet) for change detection, which can strongly model global context relationships and enhance the change difference. Specifically, our backbone is the dual TransUNet, which is based on U-Net and equipped with transformer block in the encoder. The dual TransUNet is used to extract bitemporal features. Then, the features are encoded as the input sequence, which is conducive to modeling the global context. Moreover, we design the content difference enhancement module to process the dual features of each layer in the encoder. The designed module can increase the spatial attention of difference regions to enhance the change difference features. In the decoder, we adopt a simple cross-layer feature fusion to combine the upsampled features with the high-resolution features, which is used to generate more accurate results. Finally, we adopt a novel loss to supervise the accuracy of results in regions and pixels. The experiments on two public change detection datasets demonstrate that our CDENet has strong competitiveness and performs better than the state-of-the-art methods.


I. INTRODUCTION
C HANGE detection (CD) is a technique for identifying land surface changes in remote sensing images. This technique is mainly to observe two images in different periods of the same region and extract the differences between them. Essentially, the CD is a binary classification task, which emphasizes the region whether changed. The value of the pixels in the CD result is 1 or 0, indicating whether the corresponding position has changed or not. It is very important for CD to quickly and automatically extract accurate change information from massive remote sensing data. Due to the importance and effectiveness of CD tasks, the CD has been widely used in land use [1], [2], environmental monitoring [3], [4], damage assessment [5], [6], and many other tasks. Traditional CD methods can be divided into the image algebra method [7], [8] and the transformation-based methods [9], [10]. Image algebra methods include image difference, image quantification, and so on. They usually compare the pixel values between the bitemporal images to generate difference maps. The transformation-based methods include principal component analysis (PCA) and histogram trend similarity. From the available transformation change information, they can distinguish the changed region. In recent years, deep learning methods, especially convolutional neural networks (CNNs), have been widely used in various computer vision (CV) tasks, such as image segmentation [11], [12], [13], scene classification [14], [15], [16], object detection [17], [18], saliency detection [19], [20], image enhancement [21], [22], group detection [23], and so on. Because the CNNs can learn the multi-level features and semantic features of the bitemporal images, they are also introduced into CD to effectively describe the change information [24], [25], [26]. However, due to the locality of traditional convolution, these methods are difficult to capture the global semantic interaction and the context relationships among different-level features. Then, the new CNN-based methods [27], [28], [29] are equipped with dilated convolutions [30], [31] or attention mechanisms [32], [33] to alleviate this problem. They still have bottlenecks in extracting global information. Recently, the transformer [34] has been introduced into the CV fields. Due to the fantastic global context modeling ability, the transformer-based methods constantly refresh the best performance in various CV tasks [35], [36]. The transformer-based methods are also gradually applied to remote sensing image processing.
The existing methods have achieved great results in CD tasks, but these methods still have limitations on extracting global information. And the change features extracted by the existing methods contain much disturbing noise. Our motivation is to sufficiently capture the context relationships from different level of bitemporal features and to enhance the change difference features by increasing the spatial attention to the difference regions. We propose a context and difference enhancement network (CDENet) for CD, which is based on U-Net and equipped with transformer block in the encoder. can model the global context and strongly enhance the change difference. The flow chart of our CDENet is shown in Fig. 1. Specifically, the dual-stream structure can comprehensively extract the features of bitemporal images. The transformer block is equipped in the encoder, which is used to model the context relationships and enhance the ability of structure representation. Then, we design the content difference enhancement module (CDEM), which is used to extract the change difference features between bitemporal images and can enhance the content information in difference features. Besides, the simple crosslayer feature fusion can transfer the details of the difference features from the encoder layer to the corresponding decoder layer. Finally, we adopt a novel loss to supervise the results in regions and pixels. The proposed loss can improve the detection performance.
Our contributions are summarized as follows. 1) We propose CDENet for CD, which can strongly model global context relationships and enhance the change difference. The CDENet combines the ability of U-Net to extract spatial details and the ability of the transformer to model the context relationships. Besides, to fit the CD task, we specifically integrate the transformer block in the dual-stream network structure, which can capture the context relationships from the bitemporal features. 2) We design CDEM to process the dual features of each layer. The CDEM enhances the change difference features by increasing the spatial attention to the difference regions. It can preliminarily filter the noise and accurately locate the change regions. 3) We adopt a novel loss, which combines pixel-and regionlevel supervision. Therefore, it can ensure the accuracy of change results in regions and pixels. 4) We compare our CDENet with nine state-of-the-art (SOTA) methods on two public CD datasets. Visual and quantitative results demonstrate that our CDENet performs better. Besides, ablation experiments present the effectiveness of each module. The rest of this article is organized as follows. Section II describes the related CD works. Section III introduces our CDENet. Section IV presents and analyzes the experimental results. Finally, Section V concludes this article.

II. RELATED WORK
In this section, we introduce the related work that mainly includes the traditional CD methods, CNN-based CD methods, and transformer-based methods in CV.

A. Traditional CD Methods
In the traditional CD methods, some simple algebraic methods, such as image differencing [37] and image regression [38], were applied to the bitemporal images. They detected changes by obtaining differences between the images as feature differences through threshold selection. Bruzzone and Prieto [39] proposed automatic techniques to obtain the statistical distributions of the pixel changes. Deng et al. [9] adopted the PCA, which used the transformation method to distinguish the effective difference information. Krinidis and Chatzis [40] modified the fuzzy C-means algorithm for image clustering. Zhang et al. [41] used spectral clustering and difference images, which clustered the bitemporal images into two clusters to get the change maps. Li et al. [42] proposed the locality adaptive discriminant analysis to learn a representative subspace of the data. Although these methods performed well, there were still many shortages.

B. CNN-Based CD Methods
Benefiting from the comprehensive feature extraction ability and strong expression of CNN, many CNN-based CD methods have achieved encouraging results. Daudt et al. [25] proposed end-to-end training CD methods, including FC-EF, FC-Siamconc, and FC-Siam-diff, which were based on the fully convolutional network (FCN) with shared weights. To model the spatialtemporal relationships, Chen and Shi [43] proposed a novel Siamese-based spatial-temporal attention network. To improve the change information extraction of bitemporal image pairs, Zhang et al. [44] presented a deeply supervised image fusion network (DSIFN). Chen et al. [45] presented the dual attentive Siamese networks, which can extract long-distance dependence to represent more discriminative features. Li and Huo [46] improved the FCN and presented FC-Siam-diff-pyramid attention (PA), which added the PA layer in the FC-Siam-diff. Based on PCANet and saliency detection, Li et al. [47] proposed the SD-PCANet. The SDPCANet improved the antinoise performance and detection precision. To distinguish object-level features and obtain complete CD maps, Liu et al. [48] used dual-attention to capture the relationships between channels and spatial positions.
To sufficiently utilize the difference information, Lei et al. [49] proposed a difference enhancement and spatial-spectral nonlocal network in which their difference enhancement module was mainly based on channel attention. To maintain high-resolution and delicate representation, Fang et al. [50] aggregated and refined multi-level features. To alleviate pseudo-changes and noise in CD, Shi et al. [51] presented a deeply supervised attention metric-based network (DSAMNet). Besides, they created a new dataset (i.e., SYSU-CD) to overcome limitations in CD datasets. It is based on the U-Net structure and consists of the transformer block and CDEM. F i1 denotes the features of the ith layer at T 1 time, F i2 denotes the features of the ith layer at T 2 time, O D i denotes the enhanced difference features of the ith layer, MSA represents the multihead self-attention, MLP represents the multi-layer perceptron, GAP denotes the global average pooling operation, GMP denotes the global max pooling operation, "ࣷ" denotes the element-wise summation, "å" denotes the element-wise subtraction, " c " denotes the channel concatenation, and "ࣻ" denotes the element-wise multiplication.
To better utilize the spatial and temporal semantic correlations, Ding et al. [52] proposed the bitemporal semantic reasoning network.

C. Transformer-Based Methods in CV
The transformer was originally proposed for machine translation [34] and was widely applied in natural language processing. It can establish global dependencies in the sequence-to-sequence tasks. There were many attempts to use transformers in CV tasks, which was also a trend now. Dosovitskiy et al. [35] first applied the pure transformer for CV and proposed the ViT to image classification, which took the 2-D image patches with position embedding as the input. With the efforts of many researchers, the transformer-based methods performed well as the CNN-based methods. Yuan et al. [36] presented the tokens-totoken vision transformer to progressively structurize the image. Thus, surrounding tokens can be modeled and token length can be reduced. Chen et al. [53] proposed the TransUNet for medical image segmentation, which merits both transformers and U-Net. Fang et al. [54] presented the external attentionbased TransUNet for crack detection. In addition, transformerbased methods were also used in remote sensing images. He et al. [55] embedded the Swin transformer into the U-Net for remote sensing semantic segmentation. Therefore, it can obtain global context relationships and improve feature discrimination. Ding et al. [56] proposed the wide-context network for remote sensing semantic segmentation, which used the CNNs to preserve the spatial information and used the transformer to model the semantic dependencies. Chen et al. [57] presented a bitemporal image transformer (BIT) for CD, which can efficiently model contexts and obtain semantic relationships. Bandara and Patel [58] utilized the hierarchical transformer encoder in a Siamese architecture for CD, which can capture multiscale details. Zhang et al. [59] designed a Siamese U-shaped pure transformer network (SwinSUNet) to get long-term global information in spacetime.

III. PROPOSED METHOD
First, we introduce our CDENet in detail. Then, we describe the transformer block that models the global context. Next, we introduce the CDEM that enhances the change difference. Finally, we present the loss function that ensures the accuracy of change results in regions and pixels.

A. Overview
We propose a novel CDENet for CD, which can strongly model global context relationships and enhance the change difference. The overall architecture of our CDENet is shown in Fig. 2, which can better process the bitemporal images T 1 and T 2 . The green and blue boxes represent the features from the dual encoders of U-Net. The dual encoders have the same structures, but they do not share parameters (params). The final red boldface arrow represents the output of the changed pixel. The CDENet is based on the dual U-Net structure [60] and the transformer is integrated into the dual-branch encoder. Therefore, our CDENet can combine the ability of U-Net to extract spatial details and the ability of the transformer to model global context relationships. More importantly, we specially designed the CDEM to enhance the difference for bitemporal features. Besides, the cross-layer feature fusion can transfer the change information between the encoder and decoder. In general, the main parts of our network include the transformer block and the CDEM. Among them, the transformer block can better extract the global information from the bitemporal features. And the CDEM can enhance the features of bitemporal images. The size of our input image is 256×256, and the sizes of the features in each layer extracted from U-Net are 256×256×128, 128×128×256, 64×64×512, and 32×32×1024, respectively. The size of the feature processed after the transformer block is 32×32×512. After the dual-branch encoder and the transformer block, the CDEM is used to enhance and obtain the difference. Therefore, the obtained change features are more accurate and complete. Through the cross-layer feature fusion between the encoder and decoder, the network finally generates the change results. The results are the probability values of the predicted change pixels. Thus, we combine the binary cross-entropy (BCE) loss and the dice loss to supervise the CD results in pixels and regions.

B. Transformer Block
We consider that it is difficult for convolution to capture global context, and the transformer-based methods can effectively model context. To fully capture the global context relationship, we integrate the transformer block in the last layer of the encoder. The transformer was originally integrated [53] into a single-stream network for segmentation. To fit the CD task, we need to capture the context relationships from the bitemporal features. Therefore, we specifically integrate the transformer block in the dual-stream network structure, which can model the context relationships from the bitemporal features. The subbranches of the encoder in CDENet are identical to the encoder in [53]. The transformer block can better build the dependency of context features and improves the feature representation, which is conducive to the generation of the change difference features.
First, we divide the input features x into a series of patches, and then vectorize the patches as x p . Then, we map x p to Ddimensional feature space via the trainable linear projection. Besides, we obtain the learnable position features from the added patch embeddings. The added patch embeddings are defined as ., x N p present the vectorized patches, N is the number, ω ∈ R (P 2 ·C)×D presents the embedding projection, P presents the size, C presents the channel number, and ω pos ∈ R N ×D presents the position features.
There are n = 12 transformer layers in the transformer block. Among them, multihead self-attention and multilayer perceptron modules constitute the transformer layer. The output z n of the nth transformer layer is defined as where LN() represents the layer normalization operator, MSA represents multihead self-attention, and MLP represents multilayer perceptron. The MSA can calculate the global selfattention from the feature of the last layer in the encoder, which calculates each head attention and concatenates the results together as the output. The MSA is defined as where W O , W Q , W K , and W V represent the projection matrices of output, query, key, and value, respectively. Attention() is defined as where Q, K, and V represent the query, key, and value, respectively, and d k denotes the channel dimension. Through the abovementioned operation, the representation feature z n is reshaped to the same size as original input feature x.

C. Content Difference Enhancement Module
Conventional methods usually directly subtract the bitemporal features to obtain the difference features. However, this direct acquisition of difference features will produce serious noise. To alleviate this problem, we introduce the CDEM to enhance difference for bitemporal features.
We input bitemporal images into encoder to extract features and generate four levels of features. The CDEM is designed to extract the difference features in each layer and emphasize the difference. The CDEM first obtains the initial difference feature by subtraction, but this feature contains a lot of noise and is not suitable for the difference result. Therefore, the CDEM generates the difference attention region from the initial difference feature, which can locate the change regions and preliminarily filter the noise. Then, the generated attention region acts on the original bitemporal features and performs subtraction. Through these operations, accurate difference results can be obtained.
First, we subtract the bitemporal features to obtain the initial difference D i , which is defined as where F i1 denotes the features of the ith layer at T 1 time, F i2 denotes the features of the ith layer at T 2 time, and | · | represents the absolute value operation. It ensures the non-negativity of the change difference features. Then, to fully utilize the spatial relationship of the obtained difference features, we first operate spatial attention on the differences. Specifically, we obtain the location of possible spatial differences in features through the max pooling and average pooling. Then, we use convolution and sigmoid operations to generate attention about the location. The spatial difference attention A D i is defined as where P max () means the global max pooling operation, P ave () means the global average pooling operation, f 7×7 means the convolutional layers with 7 × 7 kernel size, which is used to generate the possible change attention, and σ means the sigmoid operation. Finally, the spatial difference attention A D i acts on the bitemporal features and generates enhanced difference features O D i through subtraction. The O D i is defined as where · represents the element-wise multiplication.
The CDEM is suitable for enhancing the difference features. It captures the detailed location information of the difference features through spatial attention and applies the spatial attention weight to the bitemporal features. The CDEM can obtain enhanced difference features from these operations.
Besides, some details information will be lost during the transmission from the encoder to the decoder. To alleviate the problems, we introduce the simple cross-layer feature fusion, which fuses the features by concatenation operation. The simple cross-layer feature fusion can transfer the details of the difference features between the encoder layer and the corresponding decoder layer. It mainly fuses the downsampling features in the encoder with the upsampling features in the decoder to obtain more accurate context relationships.

D. Loss Function
Essentially, CD is a binary classification task, that is, the pixels in remote sensing images are divided into changed with the value of 1 and unchanged with the value of 0. Therefore, we can use classification loss to supervise our network training. Specifically, to better supervise the training of CD, we adopt a novel loss, which is composed of the BCE loss and the dice loss.
The BCE loss is the pixel-level supervision between the prediction results and the ground truth (GT). It is the point-to-point change supervision and is defined as where X denotes the prediction results and Y denotes the GT.
The dice loss is the region-level supervision between the prediction results and the GT. It is the regional change supervision and is defined as where X denotes the prediction results and Y denotes the GT. The total loss function L total can be denoted as where λ 1 and λ 2 are the corresponding hyperparameters of the L bce loss and L dice loss, respectively. The BCE loss can supervise the prediction results at the pixel level and the dice loss can supervise the prediction results at the regional level [61]. Because the total loss combines the BCE loss and the dice loss, it can conduct hybrid supervision on the predicted change results at the pixel level and the regional level.

E. Prediction of the Proposed CDENet
We display the representative CD results, which are shown in Fig. 3. Note that, different from the difference enhancement module in [49], our content enhancement module is based on spatial attention rather than channel attention. That is, our module enhances spatial content differences rather than channel differences. Our content enhancement module mainly focuses on the change region, and spatial attention can enhance content information. We also replace our CDEM with the channel attention enhancement module (CAEM) and display their difference.
We can see that the change region in Fig. 3(e) is very complete, while the change region in Fig. 3(g) is missing. In addition, we also show the effect of superimposing heat maps on the T 2 image. We can see more clearly that the superimposed result in Fig. 3(f) can cover more change regions and maintain better details than Fig. 3(h). Therefore, the result of our CDENet is closer to the GT than the CAEM, which means our CDENet can better enhance the content difference.

IV. EXPERIMENTS
First, we present the datasets and evaluation metrics. Then, we clearly introduce the experimental settings. Next, we show and analyze the comparison result between the proposed CDENet and other methods. The visual and quantitative results show the great performance of our CDENet. Finally, the ablation experiments show the effectiveness of each module in the CDENet.
TP represents the number of changed pixels with correct classification. FP represents the number of changed pixels with incorrect classification. FN represents the number of unchanged pixels with incorrect classification. TN represents the number of unchanged pixels with correct classification.
Precision indicates the proportion of the correct change pixels in prediction result and the total change pixels in prediction result. The precision is defined as Recall indicates the proportion of the correct change pixels in prediction result and the total change pixels in GT. The recall is defined as F1 combines accuracy and recall. Its formula is OA is the proportion of pixels correctly predicted, and its formula is IoU indicates the region overlay between the prediction result and the GT. Its formula is Because the evaluation of CD needs to comprehensively consider the results of accuracy and recall, the F1, OA, and IoU can more comprehensively reflect the performance of the methods.

C. Implementation Details
Our CDENet is implemented with Pytorch, and all experiments are conducted on the NVIDIA GeForce RTX 3090Ti GPU (24 GB memory). There are 19 120 samples in the training dataset, including 7120 samples from LEVIR-CD and 12 000 samples from SYSU-CD. The size of our input and output images is 256×256. The training details are presented as follows. We use the Adam gradient descent [62] to optimize our CDENet. We train our CDENet for 200 epochs and retain the optimal model. The batch size is 12 and the initial learning rate is 5 × 10 −3 . The hyperparameters in our loss are set as follows. λ 1 is 0.8 and λ 2 is 0.2. Because the prediction results are the probability values of change pixels, in the test phase, the values of each pixel are binarized to 0 or 1 according to the threshold value of 0.5.
FC-EF [25] concatenated the bitemporal images and fed them into a U-shaped structure. FC-Siam-conc [25] was based on a Siamese U-shaped structure, which extracted multi-level features and concatenated the features of each layer. FC-Siamdiff [25] was also based on a Siamese U-shaped structure, which extracted multi-level features and utilized the difference of features for skip connection. The DSIFN [44] adopted the two-stream FCN to extract representative multi-level features from bitemporal images. Then, DSIFN performed difference discrimination on representative features to obtain the results. The DTCDSCN [48] used dual-attention to capture the relationships between channels and spatial positions. The SNUNet [50] combined the NestedUNet and the Siamese network, which can aggregate and refine multi-level features. The DSAMNet [51] integrated the CBAM module [32] to learn a change map. Besides, an auxiliary supervision module was used to enhance features, which made the features contain more spatial information. BIT [57] modeled the context relationships with the transformers, which can better identify the change regions. ChangeFormer [58] utilized the hierarchical transformer encoder in a Siamese architecture, which can capture multiscale details.
2) Experiments on the LEVIR-CD Dataset: The visual experiment results are shown in Fig. 4. For small change object (see first row), our method can completely detect the region and structure, but the comparison methods detect incomplete regions and even some methods miss the small object. For the large building (see second row), our method can detect the correct region and keep the boundary smooth, but the comparison methods detect the wrong regions, the correct regions detected are discontinuous, and some methods even cannot detect the object. For sparse multiple buildings (see third row), our method can maintain the structure and details of the sparse multiple buildings, but the comparison methods often lose the structure of  [25]. (e) FC-Siam-conc [25]. (f) FC-Siam-diff [25]. (g) DSIFN [44]. (h) DTCDSCN [48]. (i) SNUNet [50]. (j) DSAMNet [51]. (k) BIT [57]. (l) ChangeFormer [58]. (m) Ours. the object and blur the details. For dense multiple buildings (see fourth row), our method can accurately detect the dense multiple buildings and keep their boundaries clear, but the comparison methods often mistakenly detect the buildings, and there will be adhesion between the multiple buildings. We can see that our method has achieved excellent performance in maintaining the complete structure and accurate details.
To directly show the performance of our CDENet, we conduct quantitative experiments. The quantitative results are given in Table I, where boldface numbers present the best value. From Table I, we can see that our method performs best in F1, OA, and IoU indexes. Specifically, in terms of the F1 index, our method is 0.8844 and is 0.98% higher than the second-best method (BIT). In terms of the OA index, our method is 0.9884 and is 0.07% higher than the second-best method (BIT). In terms of the IoU index, our method is 0.7927 and is 1.55% higher than the second-best method (BIT). Qualitative and quantitative results on the LEVIR-CD dataset show that our CDENet has achieved satisfactory results and better performances.
3) Experiments on the SYSU-CD Dataset: The visual experiment results are shown in Fig. 5. For the small building (see first row), our method can completely detect the region and structure, but the comparison methods detect the wrong regions and even some methods miss the small object. For multiple objects (see second row), our method can accurately detect the multiple objects and keep their profile, but the comparison methods often mistakenly detect the objects and blur their profile. For the large continuous vegetation (see third row), our method can correctly detect the large continuous vegetation and maintain great details and contours. But the comparison methods often detect the discontinuous regions and inaccurate contours. For the vegetation in large discontinuous regions (see fourth row), our method can correctly detect the vegetation in different regions and maintain great consistency and details. But the comparison methods cannot correctly detect the vegetation in different regions, and the continuity and details of the detected vegetation are poor. We can see that our method has achieved excellent performance in maintaining accurate regions and details.
To directly show the performance of our CDENet, we conduct quantitative experiments. The quantitative results are given in Table II, where boldface numbers present the best value. From Table II, we can see that our method performs best in F1, OA, and IoU indexes. Specifically, in terms of the F1 index, our method is 0.7593 and is 0.21% higher than the second-best method (ChangeFormer). In terms of the OA index, our method is 0.8966 and is 0.47% higher than the second-best method (Change-Former). In terms of the IoU index, our method is 0.6119 and is 0.27% higher than the second-best method (ChangeFormer).  [25]. (e) FC-Siam-conc [25]. (f) FC-Siam-diff [25]. (g) DSIFN [44]. (h) DTCDSCN [48]. (i) SNUNet [50]. (j) DSAMNet [51]. (k) BIT [57]. (l) ChangeFormer [58]. (m) Ours. Qualitative and quantitative results on the SYSU-CD dataset show that our CDENet has achieved satisfactory results and better performances. Table III presents the efficiency studies of the methods on the LEVIR-CD dataset. Specifically, we conduct efficiency studies on our CDENet and the SOTA methods in multiple quantitative indicators, including the number of params, floating point operations (FLOPs), and F1. Our CDENet is based on the dual-encoder of U-Net, and they do not share the params. The bitemporal features are processed by a dual-encoder of U-Net, respectively. Ours (Siamese network) is based on one encoder of U-Net, and the bitemporal features are only processed by the single encoder of U-Net. Compared with SOTA methods, the params of our CDENet are large, and the params of the Siamese network are normal. It can be seen that the dual-flow structure causes an increase in network params. In addition, compared with SOTA methods, the FLOPs of our CDENet are medium, which means the proposed network is not very complex. And the F1 of our CDENet is the highest, which indicates that the

F. Ablation Studies
To study the factors affecting the results of the CDENet and the effectiveness of different modules, we perform ablation experiments on the LEVIR-CD dataset. The quantitative ablation results are given in Table IV, where boldface numbers present the best value.
Specifically, we mainly research the impact of the cross-layer feature fusion (+ CLFF), the transformer block (+ Transformer Block), the CDEM (+ CDEM), the loss function (Ours), and the Siamese network [Ours (Siamese network)]. First, we adopt the dual-stream FCN and a simple feature subtraction method as the baseline. We use the simple cross-layer feature fusion to transfer the details of the changed regions from the encoder layer to the corresponding decoder layer. The + CLFF makes the F1 increase by 7.93% and makes the IoU increase by 11.69% more than the baseline. So, the + CLFF can better transfer the details from the encoder layer to the corresponding decoder layer. To capture the global context of the bitemporal images, the transformer block is integrated into the last convolution layer of the dual-stream. The + Transformer Block makes the F1 increase by 0.47% and makes the IoU increase by 0.75% more than the + CLFF. So, it can better capture the global context relationships. Then, the CDEM is proposed to enhance the difference for bitemporal features. The + CDEM makes the F1 increase by 0.37% and makes the IoU increase by 0.59% more than the + Transformer Block. So, it can better enhance the difference for bitemporal features. The proposed loss can better supervise the training of CD. The Ours makes the F1 increase by 0.22% and makes the IoU increase by 0.34% more than the + CDEM, which means the proposed loss can appropriately improve the detection performance. Besides, we also compare the dual-stream network (Ours) and Siamese network [Ours (Siamese network)] on our model. The Ours makes the F1 increase by 15.97% and makes the IoU increase by 22.45% more than the Ours (Siamese network). And the Ours makes the F1 increase by 8.99% and makes the IoU increase by 13.37% more than the baseline. We also can see that the params and FLOPs in + CDEM and Ours are the same as the + Transformer Block. The params and FLOPs in + CLFF are a little higher than baseline. The params and FLOPs in + Transformer Block are much higher than + CLFF. The quantitative ablation results reveal the effectiveness of the proposed network.
We provide the visual results to demonstrate the effect of the transformer block, which are shown in Fig. 6. Without the transformer block, our method cannot identify the part of the  We also show the quantitative comparison between our CDEM and the CAEM [49]. The quantitative comparison results are given in Table V, where boldface numbers present the best value. We can see that Ours makes the F1 increase by 0.89% and makes the IoU increase by 1.41% more than the result of CAEM. The comparison results present that our CDEM performs better than the CAEM.
Besides, we provide the visual results to demonstrate the performance of the proposed loss. The visual results are shown in Fig. 7. We can see that the change region in Fig. 7(d) is inaccurate and chaotic, while the change region in Fig. 7(e) is complete and continuous. The proposed loss can detect more accurate regions than the BCE loss. Therefore, we can conclude that the proposed loss can better maintain the accuracy of regions and pixels.
We have designed experiments to determine the hyperparameters of the loss function. Experimental results of loss parameters on the LEVIR-CD dataset are given in Table VI, where boldface numbers present the best value. λ 1 and λ 2 are the corresponding hyperparameters of the L bce loss and L dice loss, respectively. The experimental results show that our performance is optimal when λ 1 = 0.8 and λ 2 = 0.2.

V. CONCLUSION
In this article, we proposed a CDENet for remote sensing image CD. The proposed CDENet is conducive to modeling global context relationships and enhancing the change difference. Specifically, the CDENet is based on U-Net and equipped with transformer block in the encoder. The U-Net structure can well utilize the bitemporal features. And the transformer block can capture the global context relationships among differentlevel features. Moreover, we design the CDEM to process the dual features of each layer, which increases the attention of difference regions. Therefore, we can enhance and obtain the change difference features in the encoder. With the cross-layer feature fusion, we can transfer the details of the changed regions from the encoder layer to the corresponding decoder layer. It is beneficial to transmit more effective change difference information and get more accurate results. Finally, the proposed loss can better supervise the accuracy of change results in regions and pixels. The visual and quantitative experimental results on two popular CD datasets demonstrate that our CDENet is effective and superior to other SOTA methods. Our CDENet is mainly used to improve the detection performance, but the limitation is that the params of our CDENet are large. For future work, we will consider reducing the params of the model while maintaining the accuracy of CD.