Feature-Guided Multitask Change Detection Network

Change detection is the discovery of changes in remote sensing images of the same region obtained at different times. Change detection algorithms based on deep neural networks have significant advantages over the traditional algorithms on high-resolution images. State-of-the-art (SOTA) change detection methods require sufficient labeled data to achieve good results, but semantic change detection requires not only binary change masks but also “from–to” change information, so large quantities of change labels are difficult to obtain. Achieving better semantic change detection accuracy with a limited number of labels remains an open problem in the remote sensing field. In this article, we propose a feature-guided multitask change detection network (MCDnet). Feature guidance is characterized by three steps: First, a multitask learning network that uses Siamese encoders to learn segmentation and change detection features simultaneously to realize mutual guidance between tasks is designed, second, a fine-grained feature fusion module to integrate and enhance change information under the guidance of symmetrical change features is constructed, and third, a contrastive loss function based on the a priori knowledge that the features of the changed regions are different while those of the unchanged regions are the same is proposed. The experimental results show that MCDnet achieves SOTA results on three public change detection datasets, including WHU-CD (F1: 94.46\IoU: 89.50), LEVIR (F1: 92.11\IoU: 85.37), and SECOND (mIoU: 73.1\Sek: 22.8). In addition, it is surprising that MCDnet is comparable to the SOTA models while using only 20% of the full training data.

recent decades [2]. Singh [3] defined CD as "the process of identifying differences in the state of an object or phenomenon by observing it at different times." Some CD techniques have been proposed in the literature. However, the selection of the most suitable method or algorithm for CD is not easy in practice [4]. The CD methodologies can be divided into traditional CD algorithms and machine-learning-based CD algorithms [4].
The traditional CD algorithms can be summarized into the following three groups: direct comparison methods, transformation-based methods, and classification-based methods [5]. The direct comparison methods use image differencing [6], image ratioing [7], and regression analysis [8] to identify pixelwise changes. Transformation-based methods transform images to feature vectors and include change vector analysis [9] and principal component analysis [10]. Direct comparison and transformation-based methods require remotely sensed data to be acquired during the same phonological period [11]. Classification-based CD methods first determine the classification results for images and then compare them to identify the change areas [12], [13], [14], [58]. The CD accuracy of a classification-based method depends on the classification accuracy of each image used.
Along with the increase of image resolution, it is difficult for researchers to summarize the complex spatial information in total and form a general algorithm, so the data-driven algorithms dominate in high-resolution CD tasks. Machine-learning-based CD algorithms, including support vector machine [15], deep neural network [16], and decision tree [17] algorithms, attempt to estimate data properties based on data with labels. Machine-learning algorithms, especially deep neural networks, have shown great accuracy advantages over the traditional methods in the CD of very high-resolution (VHR) remote sensing images. Most deep-learning CD studies have focused on model design and data characterization.
In terms of model design, a semantic segmentation network can be directly used for CD. For example, Peng et al. [18] achieved good accuracy by performing CD based on UNet++. However, CD networks also have different characteristics from semantic segmentation networks. The input of semantic segmentation is a single image, while the input of CD is composed of two images of the same region acquired at different times. Therefore, in recent years, many CD researchers have proposed many models that differ from standard semantic segmentation networks. Siamese architectures [19] were first used for CD in [20]. Compared with previous CD deep neural networks [16], [21], the Siamese architecture used dual encoders to extract image feature pyramids separately. To further enhance the feature representation and generalization abilities of the Siamese This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ architecture, generative adversarial networks were combined with the Siamese CD network [22]. Unlike a Siamese network with dual encoders, a three-branch network was used for CD tasks [23], and its input includes bitemporal remote sensing images and their difference images. Recent research based on the combination of convolutional neural network (CNN) and transformer facilitates breakthroughs in CD tasks [55], [56], [57], [59], which rely on transformer's powerful modeling capabilities to make feature fusion and extraction more robust.
In terms of data characterization, recent studies have gradually focused on exploiting the characteristics of CD data, which can be divided into two approaches. First, to alleviate the number of labels needed, Zheng et al. [24] proposed single-temporal supervised learning (STAR), which can exploit object changes in unpaired images, and Chen et al. [25] proposed a data-level solution that can generate bitemporal images containing changes by leveraging generative adversarial training. Second, some work has explored feature fusion methods so that a network can learn the change information easily, e.g., Daudt et al. [20] fused the pairs of feature vectors with concat/add/cut operations to explore their performance on different data.
While most of the pieces of work have focused on model design and data characterization for CD, there is a lack of work on CD scheme design from a feature-guided perspective. Specifically, network design can be divided into encoder feature guidance, feature fusion guidance, and contrastive loss guidance from the perspective of feature guidance.
For encoder feature guidance, semantic segmentation tasks can assist CD tasks [26], and the dependent encoders can be used for CD and semantic segmentation. Similar to the traditional postclassification CD [12], [13], [14], we further consider whether the feature extraction process of a Siamese network encoder can be guided by semantic segmentation. Therefore, we design a multitask CD network (MCDnet) with CD and semantic segmentation and experimentally show that it not only improves the accuracy of CD but also greatly reduces the CD label needed.
For feature fusion guidance, after the Siamese encoder feed acquires the feature pyramid, the paired features must be fused, and then a decoder is used to generate the probability map. FC-Siam-conc [20] fused features by concatenation, and FC-Siam-diff [20] fused features by subtraction. Many recent studies still use concatenation for feature fusion [18], [22], [25], [27], [28]. We found that, for small datasets, such as that in [29], the accuracy of cut is better than that of concatenation, while for recent large datasets, such as LEVIR [30], the accuracy of concatenation is better than that of cut. Summarizing the previous work, we believe that a feature fusion module should have three characteristics, i.e., symmetry, information integrity, and feature enhancement. Therefore, we redesigned a feature fusion module to improve the convergence speed and accuracy of the model without additional computational effort.
For contrastive loss guidance, most CD-related studies [18], [25], [31], [32] adopted cross-entropy loss [33]. There are also some studies that design customized loss functions [34], and a suitable loss function can make the model achieve higher accuracy. Unlike recent CD models, this article proposes a multitask CD model that can output the results of semantic segmentation and CD simultaneously. It can be speculated that the semantic segmentation results are different in changed regions, while they are the same in unchanged regions. Thus, we design a contrastive loss function to maximize the feature distance in changed regions and minimize a similar feature distance in unchanged regions.
The contributions of this work can be summarized as follows. 1) We explore the feature complementarity between segmentation and CD, based on which MCDnet with better CD performance and fewer sample requirements is proposed. MCDnet achieves the state-of-the-art (SOTA) accuracy on the WHU-CD [30], LEVIR [37], and SECOND [38] datasets. 2) A change contrast loss (CCL) function is used to perform feature alignment in CD and is based on the property that the features of unchanged regions are similar while those of changed regions are different. 3) Through extensive research and experiments, we speculate that the feature fusion module should have characterized by symmetry, information integrity, and change region enhancement. A new feature fusion module is proposed according to the conjecture, and the effectiveness of this module is proved by experiment. The rest of this article is organized as follows. Section II reviews the related work. Section III illustrates the proposed method. In Section IV, experiments are designed to evaluate the proposed method. Finally, Section V concludes this article.

A. Deep-Learning-Based CD Methods
Sakurada and Okatani [35] were the first to introduce fully convolutional networks for the task of scene CD in computer vision. In the field of remote sensing, FC-Siam-conc and FC-Siam-diff [20] first used the end-to-end Siamese model to find land cover changes.

B. Multitask Learning
Multitask learning (MTL) aims to improve model generalization by leveraging domain-specific information contained in the training signals of related tasks. In the deep-learning era, MTL translates to designing networks capable of learning shared representations from multitask supervised signals [39]. MTL can offer several advantages relative to single-task learning, e.g., reduced calculations and improved performance. Vandenhende et al. [26] showed that MTL can be beneficial, as it allows for the acquisition of inductive bias through the inclusion of related additional tasks into a training pipeline.
A number of works [26], [29], [40] have exhibited models that are combinations of off-the-shelf backbone networks and several task-specific heads. This kind of model relies on an encoder (i.e., a backbone network) to learn a generic representation, which is . The encoder outputs multiscale pyramidal characteristics P 1, P 2 , P 3 , P 4 , P 5 ; then, the decoder composed of CFFM and SB outputs the CD prediction, and the decoder composed of SB outputs the segmentation prediction. Arrows to represent the flow of features. then used by the task-specific heads to obtain the predictions for each task. In [40], an MTL strategy showed robustness against adversarial attacks, while Zamir et al. [41] indicated that applying cross-task consistency in MTL not only improves generalization but also allows for domain-shift detection.
In the remote sensing field [42], a novel multiscale and multitask deep-learning framework for automatic road extraction was proposed; the model was able to build a relationship between the learning tasks and simultaneously complete the road detection and centerline extraction tasks. Tan et al. [43] utilized road and junction segmentation cues to guide exploration and achieved better road alignment. Cipolla et al. [29] addressed the problem of preserving semantic segmentation boundaries in the highresolution satellite imagery by introducing a novel multitask loss. The loss leverages multiple output representations of the segmentation mask and guides the network to focus more on pixels near boundaries.

C. Contrastive Learning
Contrastive methods learn representations in a discriminative manner by contrasting similar (positive) data pairs against dissimilar (negative) pairs [44]. Contrast learning is widely used in self-supervised representation learning [45], [46], [47]. Because there are no labels, a positive sample consists of an augmented image. Recently, the method of contrastive learning has also started to be used in supervised learning. Wang et al. [46] proposed a pixel-to-pixel contrastive learning method for semantic segmentation, which is able to discover information across images and perform feature alignment on a small batch of data. Kuang et al. [47] proposed a new video-level contrastive learning method based on segments to formulate positive pairs.
We propose a contrastive learning method for CD in which the positive and negative samples are feature maps corresponding to the unchanged and changed regions, respectively. Dual features at different scales are aligned explicitly by making the features of the changed regions different and the features of the unchanged regions same.

III. PROPOSED METHOD
In this section, we first introduce the proposed MCDnet in detail. Then, we illustrate CFFM and CCL proposed in this article.

A. MCDnet
MCDnet consists of a Siamese encoder, change decoder, and segmentation decoder in the full mode, as shown in Fig. 1. According to the composition of different learning tasks, MCDnet has three variants: MCDnet-mtask, MCDnet-change, and MCDnet-seg. MCDnet-mtask is the complete variant that can obtain the segmentation and CD results simultaneously. MCDnet-change is a CD variant and is composed of a Siamese encoder and change decoder. MCDnet-seg is the segmentation variant and is composed of a Siamese encoder change decoder. Obviously, MCDnet-change and MCDnet-change can obtain only change or segmentation information.
The input of MCDnet is optical remote sensing images acquired before/after a change occurs, i.e., X post X pre and X post . The Siamese encoder shares weight and extracts the depth pyramid features of both X post and X pre . The change decoder uses the Siamese encoder features at the same time to obtain the change result P change , while the segmentation decoder uses the Siamese encoder features separately to obtain the segmentation results P post and P pre .
1) The encoder can be divided into two parts: a multiscale pyramid feature extractor and a switching path. The multiscale pyramid feature extractor is shown in Fig. 2(a); it can extract features layer-by-layer with a descending scale using common backbone networks, such as Resnet [48] and Effi-cientNet [49]. We add a channel adaptation layer in the first layer with 3 * 3 convolution, the input channel is the image channel, and the output channel is 3 so that the remote sensing images with more than three channels can be pretrained on large-scale datasets, such as ImageNet [50]. Then, five scale features are drawn from the encoder. For example, when the input is 512 * 512 * C (C is the image band) and the depth scale feature size is 3 , and C 4 }are determined by the backbone network used, e.g., in the case of using EfficientNet-b1 [49], the number of channels is {C 0 = 32, C 1 = 24, C 2 = 40, C 3 = 112, and C 4 = 320}.
The switching path is designed based on a bidirectional feature pyramid network (BiFPN) [51] structure, which is shown in Fig. 2(a), and it is designed to enhance the features at different scales by constructing propagation pathways between different layers. Considering the differences in both the spatial resolution and feature size in remote sensing images, the proposed network is designed for images whose resolution ranges from 0.3 m to 2 m. For example, for building detection, the spatial features still exist after five downsampling steps at 0.3-m resolutions, while at 2-m resolutions, the spatial features exist for only three downsampling steps. Although different networks can be trained for different resolutions, the difference in features due to the change in resolution has an impact on the accuracy. Therefore, we add the structure of BiFPN into the classification network for feature extraction. By enhancing the exchange of information between features at different scales, the network can be adapted to detect targets with different sizes on different resolution images. With the above structure, when the input is a pair of 512 * 512 * C images, the output of the encoder is {256 * 256 * C bifpn , 128 * 128 * C bifpn , 64 * 64 * C bifpn , 32 * 32 * C bifpn , 16 * 16 * C bifpn }, where C bifpn ∈ [32,64,128]. In the proposed model C bifpn = 32, the pretemporal phase output features are {P 1, P 2 , P 3 , P 4 , and P 5 }, and the post-temporal phase pyramid features are {P 1 , P 2 , P 3 , P 4 , and P 5 }.  2) The segmentation decoder consists of four segmentation blocks (SBs), as shown in Fig. 3. The SB upsamples the input features and then concatenates the same-scale features, fuses the features through the conv-BN-ReLu block, weights the channels using SE attention, and then extracts the features using the conv-BN-ReLU block. Finally, the segmentation probability map is obtained by a segmentation head.
3) The change decoder consists of five change blocks denoted by CFFM and SB, as shown in Fig. 3, each of which is composed of CFFM, as shown in Fig. 4 , and an SB. The SEB upsamples the input features, concatenates the same-scale features, fuses the features through the conv-BN-ReLu block, weights the channels using SE attention, and then extracts features using the conv-BN-ReLU block. Finally, the CD probability map is obtained by a segmentation head.

B. CFFM
The CFFM model is designed to find change regions from pairs of features. Concat, add, and cut are the commonly used feature fusion methods, where concat retains all features, while add/cut can enhance features at the expense of information loss. On small datasets, such as the OSCD dataset [21], the accuracies obtained by add/cut are higher than those obtained by concat, but on some larger datasets, such as LEVIR [19], the accuracy of concat is higher than that of add and cut. Through a refinement of the design, the CFFM proposed in this article has the simultaneous characteristics of symmetry, information integrity, and change region enhancement.
A detailed description is shown in Fig. 4. The pretemporal phase characteristic is P n , and the post-temporal phase characteristic is P n , n ∈ [1, 2, 3, 4, 5]. First, feature pairs are synchronized into three pathways.
A combination of cut and abs is used to ensure the symmetry of features. The size of P n and P n is (c, h, w) , where c denotes the number of channels of feature maps, and h and w are the height and width of the feature maps, respectively. For each point, x i,j ∈ P n and x i,j ∈ P n , where i ∈ [0, . . . , w], and j ∈ [0, . . . , h]), each x i,j and x i,j can be represented by a cdimensional vector, so I mse calculates the mean score error between each x i,j and x i,j . The number of channels is changed from C to 1 by the MSE operator, so the sigmoid operator is used to narrow its value domain to between 0 and 1. Ultimately, the CFFM output can be expressed as follows: I n = Concat [I add * I mse + I add , I cut * I mse + I cut ] . (3)

C. CCL
The proposed MCDnet-mtask model learns both segmentation and change annotations by sharing encoders, which enables mutually beneficial accuracy promotion for both segmentation and CD. The reasons behind this are that the semantic segmentation task guides the encoder to focus more on the features of objects of interest, which makes the CD task make better use of pairwise features for change discovery; and the CD task is similar to contrast learning, by which features extracted by the Siamese encoders can be aligned and propagated to enhance the robustness of feature extraction.
The encoder uses the CD label for feature alignment in this study, and we propose a new CD loss function inspired by the idea of contrast learning, namely, CCL. CCL aims to increase the feature variance in the changed region features and decrease the feature variance in the unchanged region.
Given the binary change ground truth G change , whose size is scaled to that of P n and P n , CCL is calculated as follows: P post_d = P n * G change (7) P post_d = P n * G change (8) where CM s represents the MSE of the unchanged region and CM d represents the MSE of the changed region, so when CM Loss → 0, CM s → 1 and CM n → 0.

A. Dataset
To verify the effectiveness of the method proposed in this article, three datasets were selected, i.e., LEVIR, WHU-CD, and SECOND. LEVIR contains only the binary change labels of buildings, and WHU-CD contains both the change and segmentation labels of buildings. SECOND contains the change labels as well as the multicategorical segmentation labels for pre-and post-time phases within a changed region.
LEVIR: The LEVIR dataset consists of 637 VHR image patches collected from google earth as shown Fig. 5. The resolution of the images is 0.5 m, and the size is 1024×1024. It is a large-scale CD dataset and covers different kinds of buildings. We use the default dataset split, that is, 445/64/128 for training/validation/testing, respectively.
WHU-CD: The WHU building CD dataset consists of twoperiod aerial images with a resolution of 0.3 m as shown Fig. 6. These two-period images were obtained in 2012 and 2016. There are a variety of buildings with large-scale changes in the dataset. WHU-CD provides a standard train/validation/test split. We use the official test division.
SECOND: The SECOND dataset has 4662 pairs of aerial images obtained from several platforms and sensors as shown Fig. 7. These pairs of images are distributed over cities, such as Hangzhou, Chengdu, and Shanghai. Each image has a size of 512 × 512 and is annotated at the pixel level. The annotation of SECOND was carried out by expert earth vision applications, which guarantee high label accuracy. Moreover, the SECOND dataset utilizes land cover map pairs and nonchange masks to represent the change categories. We have only 2948 sets of data labels, so we use 2250 sets for training and 598 sets for validation. The accuracy of our results is evaluated by the official server.

B. Evaluation
LEVIR-CD and WHU-CD are binary category datasets. Therefore, we use precision, recall, intersection over union (IoU), and F1 score as the evaluation metrics. The values of IoU and F1 range from 0 to 1, and the higher the value is, the better the performance it indicates. The IoU metric tends to penalize single instances of bad classification more than the F score quantitatively even when both can agree that this one instance is bad. Therefore, the fluctuations of these two metrics can reflect the problems detected by the model in two different dimensions. The evaluation metrics are calculated as follows: recall = TP TP + FN (15) where TP denotes the true positives, FP denotes the false positives, and FN denotes the false negatives. SECOND is a multitask dataset that utilizes the common mIoU metric and a proposed coefficient named SeK to evaluate the results. Specifically, given a confusion matrix Q where IoU 1 measures the identification accuracy of unchanged pixels and IoU 2 evaluates the extraction accuracy of changed regions.
Moreover, the true positive value of the unchanged pixels (q 11 ) always dominates the calculation of κ. Thus, q 11 is separated from the calculation of SeK. SECOND also utilizes IoU 2 to further emphasize the changed pixels. Specifically, it is defined as follows: whereρ whereq j+ andq +j represent the row sum and column sum, respectively, of the confusion matrix without q 11 . The exponential form enlarges the discernibility compared with simple multiplication.
C. Implementation Details 1) Data preprocessing: First, each image in the three datasets is normalized so that the mean is 0 and the variance is 1. 2) Data augmentation includes spatial, color, and noise transformations: Spatial transformations consist of random rotations, cropping, the exchange of pre/post data, and polygon shadows, and color transformations consist of HSV shifts and random brightness contrast operations. Noise transformation refers to the addition of Gaussian noise. 3) Training: We use Adam [52] as the optimizer, with an original learning rate of 0.001 and weight decay of 0.001. The learning rate scheduler is poly; when the epoch is 250, the learning rate is 0. 4) Inference: For both the LEVIR dataset and the SECOND dataset, the original data are input directly, normalized, and then inferred to obtain the results. For the WHU-CD dataset, overlapping sliding prediction with a sliding window of 1024 * 1024 and an overlapped area of 256 is performed. The MCDnet method proposed in this article contains MCDnet-change, MCDnet-seg, and MCDnet-mtask, where MCDnet-change is evaluated on the LEVIR dataset and WHU-CD dataset, and MCDnet-mtask is evaluated on the SECOND dataset and WHU-CD dataset. MCDnet-seg is evaluated on the WHU-CD dataset to demonstrate that CD can yield gains for segmentation supervision. All the experimental codes are written using Python and PyTorch [52] for deep-learning training and inference, and the GPU is a 3090 graphics card with 24 GB of memory.

D. Comparative Method
For the LEVIR-CD and WHU-CD dataset, we compare our method with several SOTA CD methods.

E. Results on the LEVIR Dataset
For the LEVIR dataset, since it contains only binary change labels, the proposed MCDnet-change is compared with SOTA CD models, as shown in Table I. MCDnet-change achieves higher F1 and IoU scores than the recently proposed CEECNetV1 model.
We show four sets of experimental results in Fig. 8. In the first set, the building color and ground color are similar, and MCDnet-change extracts buildings more completely and gives more accurate edge details; in the second and third sets, MCDnet-change performs better in extracting details of both

F. Results on the WHU-CD Dataset
The WHU-CD dataset provides both change labels and building segmentation labels, so we use it to validate both MCDnetchange and MCDnet-mtask. It is important to note that our experiments and accuracy evaluation, as shown in Table II, are carried out on an entire remote sensing image scene. MCDnet-change achieves an F1 score of 94.06% and an IoU score of 88.79%, and MCDnet-mtask achieves an F1 score of 94.36% and an IoU score of 89.32%. It is obvious that the introduction of the segmentation task can promote the CD task.
We show four sets of experimental results, as shown in Fig. 9. The first row shows that MCD-mtask can effectively detect the missed change region of MCD-change and CEECNet, and MCD-mtask outperforms MCD-change in detecting the building edges due to the use of segmentation information. The second row shows that both MCDnet-mtask and MCDnet-change are significantly better than the recent SOTA CEECNet for scenarios where only building roof renovations have occurred and there is no significant change in the building structure. In the third row, the building extraction results are shown, and both CEECNet and MCD-mtask can extract the changed buildings completely. In the fourth row, MCD-mtask can detect more medium and large buildings than CEECNet, but there are still missed detections.

G. Results on the SECOND Dataset
The accuracy on the SECOND dataset is shown in Table III, where MS denotes the multiscale test and Flip denotes the flip enhancement test. We use IoU to measure the CD accuracy and Sek to measure the classification accuracy. Table III presents that the CD and the classification accuracy are promoted by 2.9% and 5.5%, respectively. The accuracy promotion fluctuates with the adoption of data augmentation, e.g., IoU and Sek are promoted by 3% and 6.1%, respectively, when using multiscale enhancement, and the IoU value reaches its peak when multiscale and flip enhancement are adopted. MCD-mtask is superior to the ASN-ATL model in terms of both the accuracy of CD and changed region classification, i.e., semantic CD. Fig. 10 shows the experimental results on the two groups of image pairs. The group shown in the first two rows mainly covers changes from low vegetation to buildings, changes from buildings to bare ground, and changes from fragmented trees to low vegetation. The results indicate that the proposed method has fewer missed detections than ASN-ATL and HRSCD.str4, and the small change from fragmented trees to low vegetation is also correctly detected. The group shown in the second two rows mainly contains large change areas from low vegetation to tree changes that are overlaid with the shadows of buildings. The result obtained by MCD-mtask is similar to the ground truth, while the ASN-ATL and HRSCD.str4 models fail to distinguish between pseudochanges, such as shadows, and real changes.

H. Ablation Studies
To demonstrate the contributions of mtask, CFFM, and CCL, we carry out alignment studies on the WHU-CD dataset. As shown in Table IV, mtask, CCL, and CFFM bring approximately 1.15%, 0.3%, and 0.72% F1 score improvements, respectively. 1) mtask Analysis: MCDnet-change is a network trained with CD labels, while MCDnet-seg is a network trained with semantic segmentation labels. As the combination      with add and concat fusion, respectively. The results indicate that implicit feature learning using CNNs is superior to direct prior information introduction, especially when the introduction may lead to information loss (add/cut), while a reasonably designed prior information fusion method with feature guidance can better improve the accuracy.
3) Analysis of CCL: CCL is able to fully exploit the relationships between images by explicitly aligning each layer of feature pairs. As shown in Table IV, the addition of CCL achieves the accuracy improvement of approximately 0.3% for both MCDnet-change and MCDnet-mtask.

I. Number of Labels and Accuracy
There is a constraint relationship between the accuracy efficiency of the deep-learning model, so in this study, based on the WUH-CD dataset, efficiency comparison experiments are conducted using a single 3090 graphics card, and the results are shown in Table VII. The accuracy achieved by the network proposed in this article is obviously superior at the same scale.

J. Number of Labels and Accuracy
Sufficient labeled images for CD are usually more difficult to obtain than those for target detection and semantic segmentation. Deep learning requires a large amount of data with the purpose of learning feature representations as well as prior knowledge, which means that the introduction of suitable prior knowledge should reduce the need for large amounts of data. To verify this inference, we examine the relationship between the number of labels and the corresponding accuracy on the WHU-CD dataset when prior knowledge is introduced.
As shown in Table Ⅷ, MCDnet is comparable to the SOTA model when only 20% of the full training data are used. The highest accuracy improvement comes from the addition of MTL, which means that the segmentation task can be used to assist CD, thus greatly reducing the amount of CD samples needed.

V. CONCLUSION
In this article, we investigate how to more fully exploit the potential of CNN-based CD models with feature guidance. First, we design an MTL network, MCDnet-mtask, with the goal of mutually beneficial feature learning introduced by multiple segmentation tasks. Then, we analyze the possibility of improving the accuracy under the guidance of contrast learning and design a loss function, CCL, which can maximize the feature distance in changed regions and minimize a similar feature distance in unchanged regions. Finally, a novel feature fusion module, CFFM, is proposed to integrate the learned features and enhance the change information under the guidance of symmetrical change features.
The algorithm proposed in this article is tested on three datasets, LEVIR, WHU-CD, and SECOND, and superior accuracy is achieved compared with the SOTA CD models. However, the model proposed in this article also has shortcomings, our model is for symmetric change scenarios, and the accuracy improvement for asymmetric change scenarios will not be so significant. In the future, we will continue to design reasonable feature-guided schemes based on the network proposed in this article. We will verify the effectiveness of the scheme proposed in this article on actual homeland data.