MDFENet: A Multiscale Difference Feature Enhancement Network for Remote Sensing Change Detection

The main task of remote sensing change detection (CD) is to identify object differences in bitemporal remote sensing images. In recent years, methods based on deep convolutional neural networks have made great progress in remote sensing CD. However, due to illumination changes and seasonal changes in the images acquired by the same sensor, the problem of “pseudo change” in the change map is still difficult to solve. In this article, in order to reduce “pseudo changes,” we propose a multiscale difference feature enhancement network (MDFENet) to extract the most discriminative features from bitemporal remote sensing images. MDFENet contains three procedures: first, multiscale bitemporal features are generated by a shared weighted Siamese encoder. Then features of each scale are fed into a difference enhancement module to generate refined difference features. Finally, they are combined and reconstructed by a decoder to generate change map. The difference enhancement module includes multiple layers of difference enhancement encoder and transformer decoder. They are applied to features of different scales to establish long-range relationships of pixels semantic changes, while high-level difference features participate in the generation of low-level difference features to enhance information transmission among features of different scales, reducing “pseudo changes.” Compared with state-of-the-art methods, the proposed method achieved the best performance on two datasets, with F1 of 81.15% on the SYSU-CD dataset and 90.85% on the LEVIR-CD dataset.


I. INTRODUCTION
I N RECENT decades, change detection (CD) has been one of the hot topics in the remote sensing, and its main task is to identify the state difference of an object or phenomenon at different times [1]. Multitemporal remote sensing images, such as satellite images and aerial images, can provide rich information to identify land use and land cover (LULC) differences in the same area at different time [2], which is useful for urban planning, environmental monitoring, agricultural surveys, disasters evaluation, and map revision [3].
With the development of high-resolution optical sensors (such as WorldView-3, GeoEys-1, and Gaofen-2), it is now easier to acquire high-resolution (HR) multitemporal remote sensing images in different areas, which has gradually become the main data source for CD tasks because of its coverage wide and high resolution. Due to different sensor response characteristics, images acquired by different sensors carries different information, which makes CD more challenging. Some scholars have designed many effective methods [4], [5], [6], [7], [8] in the multisensor CD task. At the same time, due to illumination changes and seasonal changes, the images collected by the same sensor at different times change the color and features of the same object. As shown in Fig. 1, the above problems will increase the difficulty of CD, causing the model to be unable to distinguish between "real changes" and "pseudo changes." Therefore, extracting the most discriminative feature from bitemporal HR remote sensing images in order to reduce "pseudo changes" is an important issue in the remote sensing CD.
Traditional CD methods, such as algebra methods [9], [10], transform-based methods [11], and classification-based methods [12] can achieve effective results in some medium and low-resolution scenes, but new solutions are needed for complex HR images. In recent years, methods based on convolutional neural networks (CNNs) have become popular in computer vision and remote sensing, and these methods have achieved remarkable success in tasks, such as image classification [13], [14], image segmentation [15], [16], object detection [17], [18], GANs [19], [20], and super-resolution [22]. In CD tasks based on CNNs, there are generally two methods of which one is based on semantic segmentation [23], [24] and the other based on metric learning. The first method is to treat CD as a dense prediction task similar to image segmentation and introduce image segmentation models, such as U-Net [15] for improvement, which using the encoder extracts bitemporal features, and the decoder fuses the bitemporal features to generate a change map. These methods [25], [26], [27], [28] exploit the similarity of tasks, and many off-the-shelf models in image segmentation tasks can be applied to CD tasks with minor modifications. The other is the method based on metric learning [29], [30]. After feature extraction, the change map is obtained by calculating and optimizing the distance of the bitemporal image in the high-dimensional feature space. That is, increasing the distance between the features in the changed area and reducing the distance between the features in the unchanged area.
However, the current CNN-based CD methods cannot solve the "pseudo changes" problem well. Previous studies have mainly focused on using CNNs to mine the relationship between bitemporal images to obtain more discriminative features. If the extracted features already contain rich changing information, such as the positions of changed objects and unchanged objects, the change map can be obtained through feature fusion or metric learning. However, since HR remote sensing images are more refined and texture features are more complex, more effective techniques need to be used to extract more discriminative information from the features. The recognition of changing objects must consider the global semantic context of bitemporal remote sensing images. However, pure convolution-based structures are difficult to correlate long-range context information. Recently, transformers have been applied to computer vision and many variants with excellent performance have been developed [31], [32], [33], [34]. Compared with convolution, transformer has global modeling ability and can better capture the most discriminative information in HR remote sensing images.
In this article, in order to solve the problem of "pseudo changes" caused by illumination changes and seasonal changes in the images acquired by the same sensor, we propose a multiscale difference feature enhancement network (MDFENet) that combines CNN and transformer to extract the most discriminative features from bitemporal remote sensing images. Using the MDEM to jointly model the global context at different scales, the MDFENet can better establish the representation of difference features, which is beneficial to identify "real changes" and "pseudo changes" in the images. The specific method is to use the difference enhancement (DE) encoder to map features into tokens and enhance change information. Then the tokens generated by the DE encoder are mapped back to features using a transformer decoder. Finally, we perform multiscale fusion of these high-dimensional features to obtain the result of CD. Our method is beneficial for the model to identify changing information through global information, and has strong robustness. The contribution of our work can be summarized as follows.
1) An MDFENet combining CNN and transformer is proposed for remote sensing images CD. We design an MDEM to model the global context of features at different scales to enhance the representation of multiscale bitemporal changing features. 2) We proposed a difference enhancement (DE) encoder which enables high-level tokens to participate in the generation of low-level tokens and enhances information exchange among tokens of different scales to generate more accurate difference features and reduces pseudo changes problems. 3) Our method is evaluated on two public datasets, SYSU-CD and LEVIR-CD. The experimental results show the effectiveness of the method. Our method achieves an F1 of 81.15% on the SYSU-CD dataset and an F1 of 90.85% on the LEVIR-CD dataset, surpassing many state-of-the-art methods. The rest of this article is organized as follows: Related work is presented in Section II. The proposed method is presented in Section III. The Section IV includes the experiments and the analysis of the experimental results. Discussion is presented in Section V. Finally, Section VI concludes this article.

A. CNN-Based RSCD
In recent years, methods based on deep CNNs have achieved remarkable results on remote sensing image CD. Daudt et al. [37] proposed three fully convolution-based models, FC-Siam-conc, FC-Siam-diff, and FC-EF, to improve the accuracy of CD by adding skip connections to the Siamese neural network. Peng et al. [38] generated feature maps with higher spatial accuracy by improving U-Net++, and adopted the method of fusing multilayer outputs to generate high-precision change maps.
Fang et al. [39] also adopted U-Net++-based dense connections and designed an ECAM module to aggregate multiple outputs through an attention mechanism to obtain a finer change map. Zhang et al. [40] proposed a deep-supervised feature fusion network that fuses deep and difference features to ensure the integrity of the boundary, and introduced a deep-supervised method to train the model to alleviate the vanishing gradient problem. Ke et al. [41] proposed an MCCRNet for CD. In recent years, multitask methods [42], [43], [44], [45], [46], [47] have become popular, that is, to perform semantic segmentation and CD simultaneously on bitemporal remote sensing images, and the segmentation and CD can optimize and improve each other. Metric learning is also a commonly used method in CD tasks. Shi et al. [48] proposed a deeply supervised attention metric-based network (DSAMNet) to learn change maps through deep metric learning, in which a convolutional block attention module (CBAM) was integrated to provide more discriminative characteristics. As an aid, a DS module is introduced to improve the learning ability of the feature extractor and generate more useful features. Yan et al. [49] proposed a coupled distance metric learning (CDML) model to enhance the contrast between changed and unchanged pixels. Zhan et al. [50] performed CD by combining k-proximity and weighted contrastive loss. In addition to spatial modeling, time series-based models [51], [52] have achieved good results by combining CNN and LSTM. Mou et al. [64] proposed a recurrent CNN architecture for CD in multispectral images. Bai et al. [65] proposed an LSTM-based module to address edge inaccuracy in CD results.

B. Transformer-Based Methods
Recently, transformers have been successfully applied to several tasks in computer vision [53] and achieved good results. One of the problems of pure transformer applications in computer vision is the large amount of computation and memory requirements. VIT [31] divides the image into patches, which takes the patch as the basic operation unit, then inputs the transformer which reduces the amount of operation compared to taking the pixel as the basic unit. More recently, variants based on the transformer structure, such as [32], mimic the local and hierarchical design of CNNs, where attention computation is limited to each window and tokens can be interactively moved by moving windows. Hierarchical structure can reduce computation. Token-based VIT [35], [36] uses a combination of CNN and transformer, which first extracts the features of the image through CNN, and then feeding it into the transformer. The transformer does not need to perform a lot of attention calculations on the basis of the original image, which reduces the amount of calculation. In the remote sensing CD task, BIT [54] and ChangeFormer [55] proposed a CD network combining CNN and transformer, which can effectively model long-range spatiotemporal context. Inspired by these works, TransCD [56] applies the Siamese network to the scene CD task. The above methods have achieved good performance in CD task, but they do not consider multiscale feature information, so they cannot accurately identify changing objects of different scales. Hybrid-TransCD [62] proposed a hybrid multiscale transformer structure, which can model the mixed-scale attention representation of each image. STransUNet [66] combined transformer and UNet architectures, which can capture shallow detail features and model global context in high-level features. In order to capture the spatial and channel information of feature maps, MTCNet [67] divides CBAM into an SAM and a CAM, which are applied to the front-end and back-end of the multiscale transformer, respectively. However, these methods do not consider the fusion of multiscale tokens when using transformers to model the long-range context information of images. Therefore, we try to build a multiscale structure combining CNN and transformer to fuse the semantic information of high-level tokens and low-level tokens, which can accurately locate the position of changing objects.

III. PROPOSED METHOD
The MDFENet proposed in this article includes three stages: 1) feature extractor, 2) multiscale difference enhancement module, and 3) multiscale feature decoder. The network structure is shown in Fig. 2. The key idea of MDFENet network is to use the global characteristics of the transformer and the multilayer token fusion structure to enhance information exchange, which is conducive to identifying real changes and reducing pseudo changes. Let T1 and T2, respectively, represent a pair of images at different times in the same area, and Change Map is the final output change map. The process of MDFENet can be summarized as follows.
1) First, the images T1 and T2 are input into the shared weight CNN feature extractor to obtain two sets of hierarchical multiscale high-dimensional feature maps, and the features of the same scale are concatenated as the input of the next stage to obtain three sets of features F δ cate , δ = 2, 3, 4. 2) Then, input the three sets of features F δ cate , δ = 2, 3, 4 into the multiscale difference enhancement module (MDEM). That is, input the difference enhancement (DE) encoder and transformer decoder corresponding to each layer, respectively. The refined tokens generated by the DE encoder are decoded by the transformer decoder to obtain more discriminative difference features. At the same time, the low-level tokens are combined with high-level tokens to enhance the connection between tokens at different scales and enhance the representation of features. This stage will output three sets of refined difference features F δ out , δ = 2, 3, 4. 3) Finally, input the three sets of refined difference features F δ out , δ = 2, 3, 4 to the multiscale feature decoder, upsample and fuse multiscale features layer by layer, and map it through a 1 × 1 convolution as change map at last. The above process describes the general workflow of MD-FENet, and the implementation details will be given in the following sections.

A. Feature Extractor
The feature extractor adopts weighted-shared Siamese structure to extract features from bitemporal remote sensing images. As the depth of traditional CNNs increases, it is often accompanied by the problem of gradient disappearance or explosion, resulting in network degradation. ResNet [13] alleviates these problems to a large extent by introducing skip connections. Using the modified ResNet-18 as the feature extractor of MDFENet, the fifth stage, the global pooling layer and the fully connected layer are removed from the original ResNet-18. Specifically, it contains a convolutional layer with a kernel size of 7 × 7 and three residual blocks. The initial parameters of the feature extraction module are loaded from ResNet-18 pretrained on ImageNet [58]. The bitemporal images are fed into the feature extractor to obtain the bitemporal features.

1) Difference Enhancement (DE) Encoder:
One of the problems with applying transformer to computer vision is the computational complexity. Although mentioned in some papers, such as VIT [31], dividing the image into patches and then mapping them to tokens can significantly reduce the computational complexity. However, pure transformers lack the inductive biases that CNNs have, such as translation and other variability, resulting in transformer training requiring larger datasets and longer training times to achieve the same effect as CNNs. In order to solve this problem, a method of CNN combined with transformer is proposed to map the high-dimensional features extracted by CNN into a small number of token groups. The reason is that the importance of each pixel in the image is not equal [35]. For example, image segmentation tasks should focus more on the object of interest rather than the background. In the CD task, more attention needs to be paid to the changed objects, such as building changes, wasteland changes, and so on. These changed areas usually occupy a small proportion of the whole image, so they can be described by several high-level semantic tokens. Therefore, we define a DE encoder to convert high-dimensional features into tokens. The DE encoder structure is shown in Fig. 3.
To obtain refined tokens, DE encoder learns a set of spatial attention maps that map features into a set of tokens. Let F δ cate ∈ R H×W ×C the hierarchical input high-dimensional feature maps, where H, W , C are height, width, channel, T δ ∈ R L×C are tokens, where L and C are the number of tokens and the size of the token channels, respectively. We use pointwise convolution on the high-dimensional feature F δ cate to obtain L sets of semantic groups. Then the softmax function is used to operate on the H and W dimensions of semantic groups, which calculated spatial attention maps and calculated the weighted average sum of pixels in F δ cate through the attention map to obtain a set of tokens. Then, the tokens refined by DE encoder in the previous layer are added to this layer, because the high-level tokens contain richer semantic information and can participate in the generation of low-level tokens. It should be noted that in the structure of this article, when the tokens are generated in the deepest feature (F 4 cate ), there is no tokens added. Formally, where T δ 0 represents the new tokens generated by the fusion of the previous layer tokens. φ denotes the softmax function. F δ cate represents the input features. σ 2 represents 2-D convolution on the input features F δ cate , the size of the convolution kernel is 1 × 1. ϕ denotes the ReLU function. T δ+1 represents the tokens output from the previous layer. σ 1 means to perform 1-D convolution on the tokens of the previous layer, and the size of the convolution kernel is 1. For the newly generated T δ 0 , add a set of learned position embeddings E pos ∈ R L×C . Formally, Then, the context information between these tokens is modeled using the transformer structure to enhance the token representation containing differential information and obtain more refined tokens. Like VIT [31], the structure consists of a multihead self-attention (MSA) block and a multilayer perceptron (MLP) block. Layer normalization (LN) is performed before each block. Residual connections are applied after each block. The MLP contains two layers of linear unit (GELU) activations with Gaussian error [59]. Formally, where LN stands for layer-normalization. T δ represents the output of the DE encoder, that is, the refined tokens containing semantic change information, which are used for input to the transformer decoder and the next layer of DE encoder to participate in the generation of tokens.
2) Transformer Decoder: After being encoded by the DE encoder, a set of refined tokens have been obtained. These tokens containing rich change information can well represent the real changes. Tokens only contain high-level semantic information, but vision tasks require pixel-level details that are not stored in tokens. Therefore, it is necessary to fuse the output of DE encoder with the feature map, and use the high-level semantic information of tokens to refine the pixel array representation of the feature map. As shown in Fig. 4, we build a transformer decoder. For a given feature map F cate , the transformer decoder uses the relationship between each pixel and tokens to obtain refined features F out . The transformer decoder consists of a multihead attention (MHA) layer and an MLP block. In MHA, the query comes from the image feature F cate , and the keys and values come from the tokens. The implementation of other modules is the same as that of the transformer encoder in DE encoder. The calculation process for each layer in the transformer decoder is as follows: where T δ , F δ cate , F δ out represent the output of DE encoder, the features input by the feature extractor and the features output by the transformer decoder, respectively. σ denotes transpose and reshape operations.

C. Multiscale Feature Decoder
The original multiscale feature map is converted into tokens and then encoded and decoded by the MDEM to obtain a finer change feature map. Now we only need to fuse these feature maps of different scales to get the final change map. The multiscale feature decoder is shown in Fig. 5(a). It inputs the multiscale feature maps F δ out , δ = 2, 3, 4, which sizes are H/2 × W/2 × 128, H/4 × W/4 × 256, H/8 × W/8 × 512, respectively. The UpConv block is shown in Fig. 5(b), which contains a transposed convolution layer with a convolution kernel of 3 × 3 and a stride of 2, a batch normalization layer, a ReLU, and a transposed convolution layer with a convolution kernel of 3 × 3 and a stride of 1. The calculation process is as follows: ))))) (7) where C 1 represents the convolutional layer with the convolution kernel of 1 × 1, σ i , i = 2, 3, 4 represents the UpConvi, i = 2, 3, 4, ChangeMap represents the final output.

D. Loss Function
In CD tasks, the number of unchanged pixels is often much larger than the number of changed pixels. In order to reduce the influence of sample imbalance, we adopt a hybrid loss function, that is, the combination of focal loss [21] and dice loss [16], the formula is defined as We define the change probability of each pixel in the change map asŷ i , i = 1, 2, . . ., N, and the value of each pixel in the groundtruth is y i , i = 1, 2, . . ., N. N represents the total number of pixels of the image. The values ofŷ i and y i are only 0 and 1, representing unchanged and changed pixels, respectively. The focal loss can be formulated as where α and γ are set to 0.25 and 2, respectively. The dice loss can be formulated as follow:

A. Dataset and Evaluation Metrics
To evaluate our method, a series of experiments are conducted using the most common datasets in CD, SYSU-CD [48] and (LEVIR)-CD [60].
SYSU-CD contains 20 000 pairs of 0.5-m aerial images captured in Hong Kong from 2007 to 2014. The dataset includes multiple types of complex changes, such as vegetation changes and changes in offshore facilities. The original article randomly divided 800 raw images into the standard CD dataset. Through data augmentation, 12 000/4000/4000 pairs of samples of size 256 × 256 are finally obtained for training/validation/testing, respectively.
LEVIR-CD is a building CD dataset collected from Google Earth. It contains 637 pairs of HR (0.5 m) remote sensing images of size 1024 × 1024. The types of land cover changes constructed in this dataset focus on artificial building changes. Due to the limitation of GPU memory, the original image cannot be directly input for training, so each image is cropped into nonoverlapping patches. We end up with 7120/1024/2048 pairs of samples of size 256 × 256 for training/validation/testing, respectively.
In this article, five evaluation metrics of Precision, Recall, F1-Score, IoU, and OA are used to evaluate and compare the models. Among them, F1 considers Precision and Recall at the same time. Therefore, the higher the F1 got, the better the model performed. IoU represents the overlap rate of the change class with the ground truth on the change map. OA represents the accuracy of the model. We use F1 as the main metric. The above metrics are defined as follows: OA = T P + T N T P + F P + T N + F N (15) where T P , F P , T N, and F N represent true positive, false positive, true negative, and false negative, respectively.

B. Implementation Details
We implement the model on a single NVIDIA RTX3070 GPU using pytorch. For two datasets, common data augmentation operations, such as horizontal and vertical flipping, rotation, and Gaussian blur are used to avoid overfitting. At training, the batch size is set to 8, Adamw is used as the optimizer, and the weight decay is 0.05. The initial learning rate is set to 0.0001 and decays linearly according to the number of training epochs. The number of training steps is set to 100. ResNet-18 is pretrained on ImageNet [58] and its weights are loaded separately from ResNet-18. The weights of the multiscale feature decoder are initialized by Xavier. In the feature enhancement module, each scale has only one DE encoder and transformer decoder, both initialized by Xavier. The number of tokens is set to 8. Validation is performed after each training iteration and evaluated on the test set using the best model on the validation set.

1) Fully Convolutional-Early Fusion (FC-EF): FC-EF is
proposed based on the U-Net architecture, which concatenates bitemporal images into multiband images as input, and uses skip connections to gradually transfer multiscale features from the encoder to the decoder.

D. Comparisons on SYSU-CD Dataset
The precision, recall, F1, IoU, and OA of all contrasting methods are summarized in Table I. It can be seen from Table I that FC-Siam-Diff has the highest accuracy, but F1, IoU, and OA are all the lowest. This may be because there are many complex scenes in the SYSU-CD dataset, and excessive feature differences in the differential model may lead to overfiltering and underreporting of useful change information, resulting in the generation of more pseudo changes. Followed by FC-EF, the F1 is 75.07%, and the F1 of FC-Siam-Conc is 76.35%, indicating that the concatenation method can retain more useful change information than EF and difference and pass it to the decoder. IFN, DTCDSCN, and SNUNet have all designed attention modules. In Rec, F1, and IoU, they are better than the previous three FCN-based structures. It shows that the combination of attention mechanism and convolution can make the model pay more attention to the features with distinguishing degree, reduce the pseudo changes. BIT uses the transformer structure based on global attention, and the obtained F1 and IoU are lower than the scores obtained by the abovementioned dual attention-based methods, but they are all higher than the three FCN-based models. The reason is that BIT only utilizes features of a single scale and lacks the interaction of multiscale feature information, which makes it difficult to identify changes at different scales in complex scenes. Although attention mechanisms are exploited in these networks, the extracted features are still insufficient to address the pseudo changes problem due to the underutilization of information. MSCANet uses transformers to process the features of each temporal separately, lacking the interaction of bitemporal features. Different from BIT and MSCANet, the MDFENet proposed by us has concatenated the bitemporal features of each scale in the feature extraction stage, and then input it to MDEM for global modeling to mine the change information of bitemporal image features. Recall/F1/IoU/OA are the highest, respectively, 81.63%/81.15%/68.29%/91.06%, which are better than the previous three models based on dual attention and transformer, indicating the superiority of MDFENet. Fig. 6 more intuitively shows the performance of each CD method on the SYSU-CD dataset. FC-EF and its two variants, FC-Siam-Conc and FC-Siam-Diff, performed the worst, with pseudo changes problems clearly observed inside and at the boundaries of changing objects. Among the three methods based on multiscale feature stitching and attention mechanism, IFN performs best in both large and small object recognition and boundary integrity, and both DTCDSCN and SNUNet can detect changed regions and generate fewer pixels of pseudo changes. Due to the multiscale transformer structure, MSCANet has fewer false positives (red areas) than BIT, but the multiscale features are not fused but directly classified, resulting in the loss of too much change information, so there are more false negatives (green areas). Our MDFENet shows a more accurate degree and less missed detections (Rows 3 and 4), and MDFENet has good robustness to the pseudo changes problem.

E. Comparisons on LEVIR-CD Dataset
As shown in Table II   from the performance of the SYSU-CD dataset, the three models based on full convolution have the highest F1, IoU, and OA scores on the LEVIR-CD dataset, all of which are FC-Siam-Diff. This may be because most of the simple building changes in the LEVIR-CD dataset, when the model based on feature difference comes into play to its advantage. The FC-EF is the worst of the three models. IFN has the highest accuracy rate of 94.02%, but the Recall is lower than DTCDSCN and SNUNet, indicating that IFN may generate more pseudo changes in simpler scenarios. The F1 values of DTDSCN and SNUNet are 87.67% and 88.16%, respectively, which are higher than the three FCN-based methods, again illustrating the advantages of the attention mechanism. The F1 obtained by transformer-based MSCANet is 89.33%, which is higher than the three dual attention-based methods and second only to the model in this article. Fig. 7 further demonstrates the performance of different methods on the LEVIR-CD dataset. Each model has a certain degree of pseudo changes due to light intensity, seasonal variation. It can be seen from the third and fourth rows that the three FC-based models cannot correctly identify the changes in small areas, and the boundaries are severely broken. For the case where the ground surface shows similar features, IFN, DTCD-SCN, and SNUNet are still able to extract the discriminative features and find the real changing regions, which preserve the integrity of the boundaries. BIT and MSCANet can also correctly identify some subtle change areas thanks to the global receptive field. Similar to the results on the SYSU-CD dataset, our method still generates the fewest pseudo change pixels and performs best in terms of boundary integrity and identifying tiny regions.

F. Model Efficiency
The computational efficiency comparison of different algorithms is shown in Table III. It can be seen that the three models based on FCN have the least parameters, but the performance is also the worst. Compared with SNUNet, our model only adds 15.14-M parameters, but we get better results.  verify the effectiveness of the module. In the following experiments, base represents the base model without MDEM, which is similar in form to the FCN model, using CNN feature extraction as the encoding, and a multiscale feature decoder that fuses multiple encoding layers. base+MDEM_i means adding an MDEM module between the feature extractor and the feature decoder according to the base model. MDEM_i indicates that each MDEM contains i pairs of DE encoder and transformer decoder. base+MDEM_3 is our MDFENet. As can be seen from Table IV, the F1 of the base+MDEM_3 model on the SYSU dataset is 81.15%, the IoU is 68.29%. The F1 and IoU on the LEVIR dataset is 90.85% and 83.23%, which is better than the base model and other models. The visualization results are shown in Fig. 8. There are still many pseudo changes in the base model. After adding a pair of DE encoder and transformer decoder, namely, base+MDEM_1, it can be seen that the pseudo changes have been significantly reduced. When the model is base+MDEM_3, the best result is achieved, and there is basically no pseudo changes in the change map.
2) Numbers of Tokens: The difference information in bitemporal images can be described by a set of tokens. The numbers of tokens L are important hyperparameters, and different L = 2, 4, 8, 16 are tested here to analyze its impact on the model performance on the LEVIR-CD and SYSU-CD datasets. As can be seen from Fig. 9, when the tokens are 8 and 16, the F1 and IoU of the model are significantly improved compared with 2 and 4 tokens. It proves that appropriately numbers of tokens are sufficient to represent difference information, while too few tokens may affect the performance of the model. This is because when the tokens are too little, the model may lose some information related to semantic changes. When tokens are 16,  the performance of the model does not improve much, or even decreases. When tokens are 8, both F1 and IoU are the largest. Therefore, we set the tokens to 8.
3) Loss Functions.: The CD task has a sample imbalance problem, because there are far more pixels that do not change than pixels that change. Both focal loss and dice loss have good performance on the task of sample imbalance. In the ablation experiments, we tested the effect of using focal loss, dice loss, and focal+dice loss on the performance of our MDFENet, respectively. As shown in the Table V, when using focal loss, the F1 and IoU on SYSU-CD are higher than dice loss, which are 80.09% and 66.79%, respectively. When using dice loss, the F1 and IoU on LEVIR-CD are higher than focal loss, which are 90.68% and 82.95%, respectively. The difference between the two datasets leads to different performances of different loss functions. SYSU-CD contains change objects of varying size and shape, such as vegetation, buildings, roads, etc. Compared with dice loss, the results of using focal loss is better. LEVIR-CD only includes building changes, and the size and shape of the change objects are similar, and the dice loss performs better. When using focal+dice loss, the highest F1 and IoU were achieved on both datasets.

V. DISCUSSION
As proposed, MDFENet reveals the rich difference tokens of complex changing objects or regions by constructing global semantic context information, and these tokens containing global information can effectively enhance the model's attention and make the model pay more attention to the changes in bitemporal remote sensing images. In addition, the multiscale difference enhancement module enhances the information exchange of tokens at different scales, and learns more discriminative pixel-level semantic change information. At the same time, the offset error when the subsequent low-level features and high-level features are fused at the decoder is reduced, and the appearance of pseudo changes is reduced.
As shown in Fig. 10, we visualize the gradient-weighted class activation mapping (Grad-CAM) generated by the two stages (UpConv4, UpConv2) of MDFENet's multiscale feature decoder on two datasets, and output represents the change map of MDFENet. Red indicates higher attention coefficients, and blue indicates lower attention coefficients. Due to the difference in solar radiance and sensor pose, the same objects in Image1 and Image2 have obvious color shift and position shift, which increases the difficulty of CD and easily causes pseudo changes in the change map. As can be seen from Fig. 10, at the UpConv4 layer, for SYSU-CD, MDFENet does not fully pay attention to the change area, because the dataset contains a large number of complex changes, such as forests, grasslands, buildings, etc., but our method can still pay attention to the position of the change center. For LEVIR-CD, MDFENet can completely distinguish the changed regions, which shows that the single-layer token encoding and decoding can successfully establish the connection between the bitemporal images and find the regions with high discrimination. After encoding and decoding through multiple layers of DE encoder and transformer decoder, in the UpConv2 layer, the model can completely distinguish the changed area from the unchanged area, and can also distinguish the subtle edges. This illustrates the effectiveness of the multiscale structure to reduce pseudo changes in the change map.

VI. CONCLUSION
In this article, we propose an MDFENet for CD task in remote sensing images. MDFENet extracts difference information from complex background by mapping high-dimensional features into a few groups of tokens. The multiscale difference enhancement module enhances the difference information in the token space, which reveals the real changes in bitemporal images and reduces pseudo changes. Comparative experiments with other SOTA methods on SYSU-CD and LEVIR-CD prove the superiority of MDFENet. Ablation studies on MDFENet demonstrate the effectiveness of the multiscale difference enhancement module, and that very few tokens can represent difference information. In the future, we will explore more lightweight network architectures and experiment on datasets.