Edge-Guided Parallel Network for VHR Remote Sensing Image Change Detection

Change detection (CD) is an important research topic in the remote sensing field, and it has a wide range of applications, including resource monitoring, disaster assessment, urban planning, etc. Recently, deep learning (DL) has shown its advantages in CD. However, most existing DL-based methods cannot capture the complementary information between bitemporal and difference features. This article proposes an edge-guided parallel network (EGPNet) to solve this problem. First, our EGPNet extracts bitemporal and difference features simultaneously through a parallel encoding framework. During parallel encoding, we design a supplementary mechanism to enrich the difference features with bitemporal features. Second, we fuse bitemporal and difference features at each feature level to sufficiently exploit their complementarity. Finally, the edge-aware module and edge-guidance feature module are introduced to enhance the edge representation for improving blurred edges of detection results. Benefiting from the rich change-related information in difference features and detailed information in bitemporal features, our EGPNet can detect change regions entirely and accurately. Experimental results on the LEVIR-CD, SYSU-CD, and CDD datasets demonstrate that the proposed method outperforms several state-of-the-art approaches. Especially, our EGPNet can detect more precise and sharper edges than other methods.


I. INTRODUCTION
G IVEN a pair of coregistered images of the same region in different time phases, change detection (CD) is to identify where the changes have occurred. The change regions are assigned positive labels, whereas the unchanged regions are assigned negative labels. It is essentially a binary semantic segmentation task. Remote sensing satellite imaging technology advances by leaps and bounds, many remote sensing platforms, such as QuickBird, GeoEye, Worldview, and unmanned aerial vehicles (UAVs), can provide very high resolution (VHR) images [1]. These VHR images can capture detailed ground information, making it possible to observe our Earth from a closer perspective. CD, as one of the most important applications of remote sensing image interpretation, can be used in many fields, such as resource monitoring [2], disaster assessment [3], and urban planning [4]. According to the image analysis unit, traditional CD methods can be divided into two categories. The pixel-based CD (PBCD) method takes an image pixel as the fundamental unit of analysis [5], [6], [7]. The object-based CD (OBCD) method takes an image object as the fundamental unit of analysis to explore spatial context, texture, and shape information [8], [9], [10]. Although these methods need fewer samples for training and have strong interpretability, their accuracy is not very satisfactory due to the complexity of CD.
Recently, deep learning (DL) has dominated CD methods. Among them, some methods utilizing both bitemporal and difference features achieve better performance. Lei et al. [11] take difference features as input of the channel attention module to obtain attention weight for bitemporal features. Peng et al. [12] introduce a DE module to combine difference features in input space with the final features after decoding. Zhang et al. [13] fuse difference and bitemporal features based on the attention mechanism in a difference discrimination network. These works indicate the complementarity between bitemporal and difference features. Difference features can reflect changes explicitly, but they lack details. Even if bitemporal features contain detailed information, they cannot reflect change explicitly. Both bitemporal and difference features are equally crucial to CD. Capturing the complementary information between bitemporal and difference features is essential for CD. However, the difference features used by these methods are generated from the construction, not extraction. In other words, these methods lack an explicit extraction process of difference features. Following the spirit of the two-stream architecture [14], we propose a parallel encoding framework in our edge-guided parallel network (EGPNet) to extract these two kinds of features simultaneously and explicitly. To sufficiently leverage their complementarity, we fuse these two kinds of features at each feature level generating the fused features. Benefiting from the rich change-related information in difference features and detailed information in bitemporal features, our EGPNet can detect change regions entirely and accurately.
Besides, many state-of-the-art (SOTA) CD methods still suffer from blurred edges. As shown in Fig. 1  introduce an edge-aware module (EAM) and an edge-guidance feature module (EFM) [16]. EAM integrates the low-level local detailed information and high-level global location information to explore edge semantics under direct edge supervision. EFM can guide representation learning, enhancing edge representation. The contribution of our work can be summarized as follows.
1) A parallel encoding framework that can effectively extract bitemporal and difference features is proposed to explore feature complementarity. 2) We design a supplementary mechanism (SM), which can bridge the bitemporal encoder and difference encoder enriching the difference features with bitemporal features. 3) We introduce EAM and EFM to solve the edge blur problem of VHR remote sensing image CD. The rest of this article is organized as follows. Section II reviews the related works. Section III gives the details of our proposed method. Experimental results and analysis are presented in Section IV, and finally, Section VI concludes this article.

A. DL-Based CD
CD research has been largely driven by advances in semantic segmentation technology, which were often adapted to cope with the CD. A large family of CD methods is based on bitemporal features [17], [18], [19], [20], [21], [22]. They extract bitemporal features via the Siamese network, and the concatenation of bitemporal features is fed into the decoder to identify the changes. Daudt et al. [17] propose an FC-Siam-conc that adapts Unet [23] with Siamese architecture [24] for CD. Chen et al. [18] embed a nonlocal attention module into Siamese Unet to increase the detection capability of the model as well as the noise suppression capability. Fang et al. [19] adapt Unet++ [25] with Siamese architecture to retain shallow-layer information. Chen et al. [20] use two types of modality-independent structural relationships to solve the modal heterogeneity problem in unsupervised multimodal CD. Liu et al. [21] propose a multitask Siamese convolutional network combining the semantic information of the single bitemporal image. Chen et al. [22] use the semantic information of the single bitemporal image in a self-supervised learning framework to learn more discriminative features.
Other methods pay attention to difference features [15], [26], [27], [28], [29]. First, they extract bitemporal features through the Siamese network. Then, difference features are constructed by bitemporal features using recurrent neural network (RNN) or subtraction operation. Finally, difference features are used to recognize the changes. Chen et al. [26] integrate the merits of both CNN and RNN, CNN is used to extract bitemporal features, and RNN is used to generate difference features. Zhang et al. [27] design a differential pyramid to extract multilevel difference features explicitly, and then, the difference features are fed into Unet++ for further representation learning. Chen et al. [15] use a transformer to model space-time context in the tokenbased space to reduce pseudochange. Bandara et al. [28] adapt SegFormer [30] with Siamese architecture achieving higher performance than many models employing very large ConvNets. Chen et al. [29] design a structural relationship analysis framework in the Fourier domain to solve the modal heterogeneity problem of unsupervised multimodal CD. Some methods use the channel concatenation image as the initial input to extract difference features, Liu et al. [31] use depthwise separable convolution to extract difference features efficiently. Peng et al. [32] feed the channel concatenation of bitemporal images into Unet++ to extract difference feature for CD. In addition, metric-learningbased CD methods calculate the Euclidean distance pixelwise to generate the distance map [33], [34], [35]. They are also based on difference features. This article focuses on exploring the complementarity between bitemporal features and difference features.

B. Two-Stream Architecture
For a specific task, there is more than one type of information, which is usually heterogeneous and complementary. Thus, twostream architecture is a natural choice for neural network design. For video action recognition, Simonyan et al. [14] propose a two-stream network composed of a spatial and a temporal network to integrate appearance and motion information. Zhou et al. [36] adopt faster R-CNN within a two-stream network for image manipulation detection. The RGB stream is to find tampering artifacts like substantial contrast differences and unnatural tampered boundaries. The noise stream detects the noise inconsistency between authentic and tampered regions. Zhang et al. [37] propose asymmetric two-stream architecture combining RGB information and depth information for saliency detection. Following the two-stream spirits, we design a novel parallel encoding framework to combine bitemporal and difference information for CD.

C. Edge-Guided Network
Edge cues are instrumental in many computer vision tasks, such as salient object detection [38], [39], [40], medical image segmentation [41], [42], etc. Usually, there is a subnetwork for edge detection. Edges generated through a Sobel operator or where BConv-i is the ith convolution block of the bitemporal encoder and DConv-i is the ith convolution block of the difference encoder. SM is used to enrich the difference feature flow with bitemporal feature flow. EAM integrates features at level 2 and level 5 to generate the edge map, which is injected into multilevel fused features through EFM for guiding their representation learning. The decoder employs FFM to combine low-level and high-level features progressively. In addition, we use 1 × 1 convolutions to produce change results at different feature levels and upsample them to 256 × 256 providing direct supervision for intermediate layers.
Canny operator will be involved in the calculation of edge loss. Then, the edge and no-edge features are integrated for the final detection. Thus, the network will be guided to pay more attention to edges making the edges more precise and sharper. However, research on edge cues is limited in the CD research community. Cheng et al. [43] adopt deformable convolution to achieve margin maximization clarifying the gap between changed and unchanged semantics. Bai et al. [44] propose an EGRCNN that incorporates both discriminative features and edge features to improve the edge quantity of CD results. Chen et al. [45] design an edge-guided transformer block for long-range context modeling and edge feature refinement. Xia et al. [46] propose an extra edge detection branch to guide change features with edge information. Different from EGRCNN [44] and EGDE-Net [45], which simply capture and fuse edge information at the end of the network, we introduce an EAM to explore edge semantics using selected features after encoding and design an EFM to inject edges into multilevel change features for guiding their representation learning.

A. Overall Architecture
In Fig. 2, we illustrate the overall framework of our proposed EGPNet, encoder-decoder architecture. Different from conventional encoders, we propose a parallel encoder that is made up of a bitemporal encoder and a difference encoder. In the parallel encoder, we design an SM to enrich the difference feature flow using bitemporal feature flow. Next, the concatenation of bitemporal and difference features is fed into two convolution layers for sufficient semantic fusion to obtain the fused features. Then, we use EAM to generate edges that are injected into the fused features at each feature level through EFM for edge representation enhancement. During feature decoding, we progressively fuse different levels of feature maps and employ 1 × 1 convolution to map feature vectors to the desired number of classes. Five change results are produced at corresponding feature levels, and feature level 1 gives the best result.

B. Parallel Encoding Framework 1) Bitemporal Encoder:
The bitemporal encoder adopts the Siamese network architecture as in [17] to obtain multilevel bitemporal features containing many details. Like the vanilla Unet [23], the bitemporal encoder includes five convolution stages. Each stage comprises one convolution block and one pooling layer. As shown in Fig. 3, the convolution block consists of 3 × 3 convolution, batch normalization, and Relu activation function. The first convolution layer is used to double the number of channels and the 2 × 2 max pooling layer to reduce the size of feature maps. These convolution blocks can be abbreviated as BConv − i, where i ∈ {1, 2, 3, 4, 5}. In this  article, the initial channel number is set to 32; Table I gives details of these convolution blocks. Given bitemporal images I 1 ∈ R C×H×W , I 2 ∈ R C×H×W , passing through the five convolution stages, respectively, we obtain the multilevel bitemporal features at the corresponding stages. We denote the five stages of bitemporal feature maps as f i b1 , f i b2 , where i ∈ {1, 2, 3, 4, 5}. The numerical superscript indicates the feature level, and the subscript represents the time phase.
2) Difference Encoder: Difference information is essential for CD because we can identify the changes directly from the difference information. Our idea is that difference features from extraction are superior to difference features from construction. Instead of using RNN or subtraction, we design an independent difference encoder for the representation learning of difference information. The difference encoder also includes five convolution stages consistent with the bitemporal encoder. We abbreviate these convolution blocks as DConv − i, where i ∈ {1, 2, 3, 4, 5}, details are given in Table I. There is no direct input for the difference encoder. We stack I 1 and I 2 in the channel dimension producing I D ∈ R 2 C×H×W , which can implicitly represent the difference information of input space. Then, we feed I D into the difference encoder to extract multilevel difference features. These multilevel difference features are denoted as The numerical superscript indicates the feature level, and the subscript means the difference.

C. Supplementary Mechanism
It is considered that the semantic information contained in I D is limited. It may be insufficient to consider only I D . To extract semantic-rich difference features, we aim to enrich the flow of difference features with the flow of bitemporal features. As shown in Fig. 4, we design an SM that can construct difference features by bitemporal image features. Then, the constructed difference features are used to supplement the flow of difference features at each stage. This process can be formulated as where DI i+1 represents the input of the difference encoder at the i + 1th stage. f i d is the output of the difference encoder at the ith stage. f i b1 and f i b2 are the output of the bitemporal encoder at the ith stage. || is absolute value operator. We construct difference features through subtraction and supplement original difference features using addition.

D. Bitemporal Difference Feature Fusion
To utilize change-related information in difference features and detailed information in bitemporal features. First, we directly concatenate bitemporal features and difference features in the channel dimension. Then, the concatenation of bitemporal and difference features is fed into two convolution layers for sufficient semantic fusion producing more powerful features. The fused features can locate the changes, especially their details, accurately. The fused features are denoted as where F conv3 denotes the 3 × 3 convolution layer and concat represents feature channel concatenation.

E. Edge-Aware Module
The lack of using prior edge structure information leads to inaccurate detection results in the areas of building edges [44]. As shown in Fig. 5, we introduce an EAM to explore edge semantics under direct edge supervision. Low-level features contain rich edges but many nonchange-related edges. Thus, high-level features are needed to help locate change-related edges. f 2 f and f 5 f are selected to explore edge semantics. First, 1 × 1 convolution reduces the proportion of high-level features and upsampling to align spatial resolution. Then, the concatenation of f 2 f and f 5 f are passed through two 3 × 3 convolution layers to integrate semantic information further. Finally, 1 × 1  convolution followed by the Sigmoid function is used to produce the edge map denoted as f edge .

F. Edge-Guidance Feature Module
EFM injects the edge map f edge produced by EAM into the fused features f i f , which can guide the representation learning enhancing edge representation. As shown in Fig. 6, given the fused features f i f , where i ∈ {1, 2, 3, 4, 5} and the edge map f edge . First, we perform the elementwise multiplication between the downsampled edge map and the fused features at corresponding feature levels. Then, residual connection and 3 × 3 convolution layer are used for feature fusion. In this way, we can obtain the updated fused features f i up whose edges are enhanced where D denotes downsampling. is elementwise multiplication and ⊕ is addition. Finally, we apply an efficient channel attention (ECA) module [47] to achieve further feature representation enhancement. ECA can capture local cross-channel interaction using 1D convolution. The enhanced features f i en can be denoted as where GAP represents global average pooling. F k 1D is 1-D convolution whose kernel size is k. σ is the Sigmoid function. As described in [47], the kernel size k can be selected adaptively.

G. Progressive Feature Decoding
As shown in Fig. 7, in the top-down feature fusion module (FFM), we fuse deep-layer features with shallow-layer features progressively producing the final features f i de at different feature levels. Transposed convolution is used to align the channel number and the spatial resolution of feature maps. FFM can be formulated as where F trans refer to 3 × 3 transposed convolution layer. Concat denotes feature channel concatenation.
H. Loss Function 1) Change Supervision: For the CD task, the distribution of difficult and easy samples is unbalanced due to the influence of shadows, light, and seasonal changes [27]. Thus, the focal loss (FL) [48] that can focus on hard examples is adopted for change supervision.
where p t denotes the change probability. When γ > 0, the relative loss for easy examples will be reduced by paying more attention to hard examples. In this article, γ is set to 1. Besides, we adopt the deep supervision strategy to deal with the gradient vanishing problem, learning more discriminative features. As shown in Fig. 2 where FL denotes focal loss. g is the change ground truth and p i is the detection result at feature level i.
2) Change Edge Supervision: EAM produces an edge map f edge . We employ the dice loss [49], which can solve the strong class imbalance problem for its supervision.
where g edge represents the ground truth of the edge, it is extracted from the change ground truth g through the Canny operator [50]. (x, y) indexes different pixels in f edge or g edge .
3) Overall Loss: Finally, the overall loss L total is denoted as where L edge is the edge loss and L change is the change loss. λ control the contribution of L edge in total loss. λ is set to 0.1 in this article.

IV. EXPERIMENT AND ANALYSIS
In this section, first, we will give the experimental setup, including the dataset, evaluation metrics, and implementation details. Next, we conduct comparative experiments to validate the performance of the proposed method on the LEVIR-CD [33], SYSU-CD, and CDD [34] datasets. Then, we design ablation experiments to validate the effectiveness of each part in our EGP-Net. Finally, the network visualization is presented to understand our EGPNet intuitively.
A. Dataset 1) LEVIR-CD Dataset: LEVIR-CD [33] is a public CD dataset released by the Beijing University of Aeronautics and Astronautics. The changes are mainly about construction growth. It contains 637 VHR image pairs collected from Google Earth (GE). Its spatial resolution is 0.5 m per pixel, and the image size is 1024 × 1024. Due to the GPU memory limitation, these images are cropped into smaller image patches whose size is 256 × 256 following its original dataset split. Consequently, we can obtain 7120 pairs of image patches for training, 1024 for validation, and 2048 for testing, respectively.
2) SYSU-CD Dataset: SYSU-CD [34] is a challenging CD dataset released by Sun Yat-sen University. It covers many change types (e.g., suburban dilation, road expansion, and sea construction). It contains 20 000 pairs of labeled remote sensing images collected between 2007 and 2014 in Hong Kong. The size of each image is 256 × 256, and the spatial resolution is 0.5 m per pixel. There are 12 000 pairs of images for training, 4000 for validation, and 4000 for testing.
3) CDD Dataset: CDD [51] is a public CD dataset whose images are collected from GE. It contains 16 000 pairs of remote sensing images obtained from the same region in different seasons. It covers change objects of different sizes (e.g., cars, single trees, big constructions, and forest areas). The resolution of CDD is from 3 to 1 m per pixel, and the image size is 256 × 256. There are 10 000 pairs of images for training, 3000 for validation, and 3000 for testing.

B. Evaluation Metrics
To evaluate the performance of the proposed method, Precision, Recall, F1 score, intersection-of-union (IOU), and overall accuracy (OA), which are often used in binary classification tasks, are introduced. They are defined as follows:  (14) where TP is the number of true positives, FP is the number of false positives, TN is the number of true negatives, and FN is the number of false negatives.

C. Implementation Details
We implement our EGPNet using the PyTorch DL library. We conduct all the experiments on a single NVIDIA GeForce RTX 3060 GPU. We initialize our EGPNet by KaiMing normalization [52]. During model training, we employ the Adam optimizer [53] for faster convergence. The initial learning rate is set to 1e −4 , and the linear decay strategy is adopted to adjust the learning rate. Due to GPU memory limitation, the batch size is set to 8, and the total epoch is set to 100 for both LEVIR-CD and SYSU-CD datasets. For the CDD dataset, the total epoch is set to 140.

D. Comparative Experiments
1) Comparative Methods: Several SOTA methods are selected for comparison, all implemented using code published by the original authors. These include FC-Siam-conc [17], FC-Siam-diff [17], IFNet [13], SNUNet [19], BIT [15], ISNet [43], ChangeFormer [28], EGRCNN [44], and EGCTNet [46]. FC-Siam-conc [17] is based on Unet [23] and concatenates the bitemporal features for skip connection at each layer. FC-Siamdiff [17] is also based on Unet and uses subtraction to construct difference features for skip connection. IFN [13] integrates difference and bitemporal features through the attention mechanism in a difference discrimination network. SNUNet [19] can retain fine low-level information through the dense connection between the encoder and decoder. BIT [15] uses a transformer to model the space-time context in a token-based space. ISNet [43] employs deformable convolution to achieve margin maximization. ChangeFormer [28] adapts SegFormer [30] with Siamese architecture to extract multilevel features with long-range dependency. EGRCNN [44] introduces a DAM to produce more discriminative features and a multilevel edge detection header to capture edge semantic information. EGCTNet [46] proposes an additional edge detection branch to improve edge accuracy.
2) Experiments on the LEVIR-CD Dataset: Table II reports the quantitative comparison results on the LEVIR-CD dataset. Our EGPNet outperforms other methods in terms of F1, IOU, and OA. The F1 score improves by 0.56% compared to the suboptimal method (ChangeFormer). The F1 edge score is not optimal because our edge guidance strategy focuses on how to use edges to guide the fused features, not on the edges themselves.
The results of the visual comparison are shown in Fig. 8. First, by combing the detailed information in the bitemporal features, our EGPNet can detect relatively intact change regions [e.g., Fig. 8 (1)] and small objects missed by other methods [e.g., Fig. 8 (3)]. Second, the difference encoder can explicitly capture difference information, which is useful for identifying  [34]. (e) SNUNet [19]. (f) BIT [15]. (g) ISNet [43]. (h) ChangeFormer [28]. (i) EGRCNN [44]. (j) EGCTNet [46].   Fig. 8 (2)]. Finally, Fig. 8 (4) shows the great advantages of our model in detecting accurate edges. Other methods identify the new buildings as a whole change region, failing to detect the small gaps. Our EGPNet can detect the edges of each building, providing more detail about the change regions. The EGCTNet can also detect small gaps, but they are not as accurate as ours. This is because the edges generated by EAM are accurate enough and can guide the representation learning of the fused features. To display more types of changes, visual results of 1024 × 1024 images are given in Fig. 9.
3) Experiments on the SYSU-CD Dataset: Table III reports the quantitative comparison results on the SYSU-CD dataset. Our EGPNet outperforms other methods in terms of F1, IOU, and OA. In particular, the F1 score improves by 2.12% compared to the suboptimal method (EGRCNN) on this more challenging dataset, which indicates the robustness of our model. Our model can also perform well even when applied to complex change scenes. Fig. 10 shows the visual comparison results on the SYSU-CD dataset. It can be seen that the proposed method achieves satisfactory performance. First, our proposed method can detect accurate and sharp edges. The results detected by our EPGNet have few error detections around the edges [e.g., Fig. 10 (1) and (2)]. Second, our EPGNet is better at avoiding false detections [e.g., Fig. 10 (3) and (4)]. Taking Fig. 10 (4) as an example, other methods misidentify the motorway as a change region due to illumination interference. Our proposed method achieves better discrimination results because the difference encoder can efficiently extract the difference features associated with the changes of interest, eliminating interfering factors. Table IV reports the quantitative comparison results on the CDD dataset. Our EGP-Net outperforms other methods in terms of F1, IOU, OA, and F1 edge . The F1 score improves by 0.56% compared to the suboptimal method (SNUNet). Fig. 11 shows the visual comparison results on the CDD dataset. Our EGPNet can detect intricate change scenarios completely and accurately [e.g., Fig. 11 (1)   [34]. (e) SNUNet [19]. (f) BIT [15]. (g) ISNet [43]. (h) ChangeFormer [28]. (i) EGRCNN [44]. (j) EGCTNet [46]. (k) Ours. (l) Edges generated by EAM. (m) GT of edges. Colors: White for true positive, black for true negative, red for false positive, and green for false negative. and (4)] because the encoder can extract semantic-rich features with a parallel encoding framework. For change objects with regular shapes [e.g., Fig. 11 (2)], our method can restore the real shape of the objects accurately with the help of the edge guidance strategy. In particular, Fig. 11 (3) shows the great advantages of the proposed method in capturing small details.

E. Model Efficiency Analysis
For a comprehensive comparison with other SOTA methods, we implement our EGPNet with different model capacities (the initial number of channels is set to 8/16/24/32/40). We test all methods on a server equipped with an E5-1650 CPU and RTX 3060 GPU and report the number of parameters (Params), floating point operations per second (FLOPs), F1 score, and IOU score of different methods on the LEVIR-CD and SYSU-CD datasets. As shown in Table V, the F1 score of EGPNet-8 reaches 77.53% on the SYSU dataset, outperforming other methods that also use a light backbone (e.g., FC-Siamconc, FC-Siam-diff, and BIT). EGPNet-16 has fewer parameters and lower computational complexity but achieves a higher F1 score (78.77%) on the SYSU dataset compared to (ISNet, SNUNet, ChangeFormer), demonstrating the efficiency of our proposed method. As the initial number of channels increases, EGPNet-32 achieves the optimal and best performance on both the LEVIR-CD and SYSU-CD datasets. The EGPNet-40 may suffer from the overfitting problem leading to a decrease in accuracy.  [34]. (e) SNUNet [19]. (f) BIT [15]. (g) ISNet [43]. (h) ChangeFormer [28]. (i) EGRCNN [44]. (j) EGCTNet [46].

F. Ablation Experiments
In this part, we perform extensive ablation studies on the LEVIR-CD and SYSU-CD datasets to validate the effectiveness of the parallel encoding framework, SM, and the edge guidance strategy. The following models are set for comparison. 1) BNet: Our base model using single bitemporal features.
2) DNet-di: Our base model using single difference features. It takes the differential image of the T1 and T2 images as input to the difference encoder. 3) DNet-ci: Our base model using single difference features.
It takes the channel concatenation of the T1 and T2 images as input to the difference encoder. 4) DNet-dici: The combination of DNet-di and DNet-ci. 5) ParalNet-di: Our parallel model with an SM (the combination of BNet and DNet-di). 6) ParalNet-ci: Our parallel model with an SM (the combination of BNet and DNet-ci). 7) EGPNet: ParalNet-ci + edge guidance strategy. 1) Effect of Different Input for Difference Encoder: In order to find the optimal input to the difference encoder, we try different input forms. These are difference input (DI), concatenation input (CI), and "DICI." DI is generated by subtraction between bitemporal images. We stack bitemporal images in the channel dimension to generate CI. "DICI" is the concatenation of DI and CI in the channel dimension. Three models, DNet-di, DNet-ci, and DNet-dici, are set for comparison. As shown in Table VI, the experiments show that CI is the best choice for the input of the difference encoder. Images in the input space have much noise, and subtraction will pass the noise of the bitemporal images to differential image amplifying noise. Therefore, the differential image is inappropriate for the input of the difference encoder.
2) Effect of Different Strategies for SM: In order to find the optimal strategy for SM, using ParalNet-ci as the base model, we consider four different strategies, namely, none supplement (NS), concatenation supplement (CS), multiplication supplement (MS), and addition supplement (AS). As shown in Fig. 12, there is no interaction between the two feature flows in NS. In CS, we use concatenation to supplement the difference feature flow where a 1 × 1 convolution layer is used to adjust the number of channels to fit DI i+1 . In MS, we use elementwise multiplication to supplement the difference feature flow. In AS, we use addition to supplement the difference feature flow. As shown in Table VII, the experimental results show that AS strategy gives the best result.
3) Ablation on Parallel Encoding Framework: As shown in Table VIII, the parallel encoding framework brings consistent improvements in the F1 score when combing different difference encoder input forms on the two datasets. On the SYSU-CD dataset, improvement for the F1 score is significant. The combination of BNet and DNet-ci improves the F1 score by 1.8% compared to DNet-ci. On the LEVIR-CD dataset, the combination of  BNet and DNet-di improves the F1 score by 0.18% compared to BNet. These results indicate the vital importance of the parallel encoding strategy, which can explore the complementary information between bitemporal and difference features. As shown in Fig. 13, models using single features have many false positives and false negatives [e.g., Fig. 13(c) and (d)]. Benefiting from the feature complementarity, ParalNet-ci can detect change regions entirely and accurately [e.g., Fig. 13(e)]. Table IX shows consistent and significant improvements in F1 score on the LEVIR-CD and SYSU-CD datasets when EAM and EFM are added to ParalNetci. The F1 score improves by 0.62% and 1.68% on the two  datasets, respectively. This indicates that the introduced edge guidance strategy can guide the representation learning of the fused features, leading to accurate edge detection results with low computational costs. As shown in Fig. 14, the edge guidance strategy can help correct the edge errors (see the first row). On the other hand, the edge guidance strategy can help locate the internal change regions, leading to more intact results (see the second and third rows). Besides, we embed EAM and EFM into   Table X, the F1 score improves significantly, demonstrating the generality of the introduced edge guidance strategy.

G. Network Visualization
To understand our EGPNet intuitively, we visualize the activation maps at feature level 2. Given the bitemporal images, the bitemporal encoder produces bitemporal features f 2 b1 , f 2 b2 , and the difference encoder produces difference features f 2 d at feature level 2. Then, we integrate f 2 b1 , f 2 b2 , f 2 d to produce the fused features f 2 f . Finally, f 2 f is passed through EFM for edge representation enhancement, producing f 2 en . Three representative activation maps are selected for visualization from f 2 b1 , f 2 b2 , f 2 d , and f 2 en , respectively. Fig. 15(b) and (c) shows bitemporal features. They can reflect details in bitemporal images but cannot explicitly reflect the changes. Fig. 15(d) shows difference features. They can reflect the main difference between bitemporal images but lack many details, which will cause the missed detection problem. Fig. 15(e) shows the fused features. A relatively complete and accurate picture of the difference can be obtained by combining the merits of bitemporal and difference features. This demonstrates the effectiveness of the parallel encoding framework. Fig. 15(f) shows features after EFM. From Fig. 15(e) and (f), we can see that the features after EFM have stronger and sharper edges. The edge guidance strategy improves the edge representation significantly.

V. DISCUSSION
In this section, we perform extensive experiments on the LEVIR-CD dataset to find the optimal value for λ and discuss the effect of different λ values. λ that controls the proportion of edge  loss in total loss has a great influence on the performance of the proposed method. As shown in Figs. 16 and 17, smaller edge loss causes the network to pay too much attention to internal regions resulting in lower F1 edge and F1 scores. Larger edge loss causes the network to pay too much attention to edges, which is also detrimental to the performance of the proposed method. 0.1 achieves the optimal balance. Under this circumstance, the generated edges are accurate and can well guide the representation learning of the fused features.
Benefiting from the proposed parallel encoding framework and edge guidance strategy, our EGPNet achieves higher accuracy than several SOTA methods. However, our work is based on the commonly used Unet [17] network, which is not the latest semantic segmentation model. Recently, VIT (Vision Transformer) [54] has shown advantages in CD [15], [28]. Our future work is to apply VIT models to extract bitemporal and difference features effectively.

VI. CONCLUSION
In this article, we propose an EGPNet for VHR remote sensing image CD. To utilize detailed information in bitemporal features and change-related information in difference features, we propose a parallel encoding framework in which we design an SM to enrich the difference feature flow with bitemporal feature flow. Benefiting from the feature complementarity, the EGPNet can detect the change regions completely, especially their details, more accurately. To enhance the edge representation, we introduce an edge guidance strategy composed of EAM and EFM. Our proposed network outperforms many SOTA methods on the LEVIR-CD, SYSU-CD, and CDD datasets, and the results detected by our EGPNet have more precise and sharper edges.