Robust Change Detection using Channel-Wise Co-Attention-Based Twin Network with Contrastive Loss Function

Change detection methods aim to identify significantly changed areas in co-registered bitemporal images taken of the same area. Since not only do bitemporal images usually have different environmental conditions (i.e., different weather conditions, noises, and seasonal changes) but also changes irrelevant to the purpose of change detection (e.g., road changes when detecting building change), which should not be detected as changed areas, change detection methods often suffer from the problem of pseudo-change detection. To alleviate this problem, we propose an encoder-decoder-based twin network (also known as a Siamese network) with a channel-wise co-attention module that considers the channel-wise correlations between a feature map in one image and all feature maps in the other image. By comparing the feature map in one image with the revised feature map in the other image considering the correlations, we are able to reduce the differences between the feature maps when pseudo-changes exist, thereby rendering the proposed method more robust to pseudo-changes. In addition, we apply a contrastive loss function that encourages the pairs of feature maps corresponding to unchanged regions to be similar, which can help improve the performance of change detection.We verified the performance of the proposed method through experiments using datasets such as the change detection dataset (CDD) and building change detection dataset (BCDD). In the experiment, the proposed method achieved significantly improved performance compared with existing methods in terms of recall, precision, f1-score, and overall accuracy.


I. INTRODUCTION
C HANGE detection (CD) is the task of identifying changed areas in two co-registered images of the same location acquired at different times [1]. CD methods usually assign a binary label to each pixel in a target image (also called a T 1 image) to indicate whether or not the pixel belongs to the changed area from the reference image (also called a T 0 image) [2]- [5]. Although identifying changed pixels based on the intensity values may seem straightforward, CD is a challenging task due to the existence of pseudochanges, which should not be detected as genuine changes even though the intensity values of the corresponding pixels are significantly different. For example, pseudo-changes may be generated due to environmental changes in two images as a result of illumination changes or seasonal changes, as shown in Fig.1. Fig.1 shows example images from the change detection dataset (CDD) [6] and Figs.1(a) and (b) show bitemporal image pairs with illumination change and seasonal change, respectively. Although the two images are very different in their pixel values, CD should determine that the areas in the two images have not changed if the difference is caused by a pseudo-change [7]. Even more challenging pseudo-changes are application-specific pseudo-changes that can be pseudo or genuine changes depending on the purpose of the CD. For example, in the field of urbanization monitoring, changes related to buildings are genuine changes while changes related to trees are pseudo-changes [2], [3]. On the contrary, in the field of deforestation monitoring, changes related to trees are genuine changes [8].
Many CD methods have been studied using various image processing methods [9]- [16]. Recently, with the successful application of deep learning to computer vision and remote sensing [7], [17]- [21], deep learning-based CD methods have attracted much attention [7], [22]- [29]. Some methods apply deep learning networks to extract feature maps from two images and then calculate the distance between the feature maps to generate a change map [7], [22], [23]. Other methods combine feature maps extracted from two images and then decode the combined feature maps to generate a change map [24], [26]- [28].
Many deep learning-based CD methods use a twin network (also known as a Siamese network) that contains two identical networks that share weights [7], [22], [23], [26]- [28]. Since the twin network-based CD methods identify changed areas based on the difference between two images, these methods often suffer from pseudo-changes that may cause large differences in intensity values. To alleviate the problem of pseudo-change detection, some CD methods have applied attention modules that can help obtain more discriminant feature representations by capturing feature dependencies [7], [22], [23], [27], [30]- [32]. Some methods have applied selfattention modules that consider feature dependencies within a single image to better distinguish between changed areas and unchanged areas [7], [30]- [32]. However, for improved robustness to pseudo-changes, we believe that feature dependencies between bitemporal images should be considered rather than dependencies within a single image. Although several CD methods have applied co-attention modules that consider spatial-wise feature dependencies between bitemporal images, they focus more on reducing errors caused by misregistration than reducing pseudo-change detection [22], [27]. Inspired by the above-mentioned attention modules, we propose a channel-wise co-attention-based twin network for CD, which we expect to be more robust to pseudo-changes than existing methods. The channel-wise co-attention module considers channel-wise feature dependencies between bitemporal images. The idea behind the proposed method is that if an area belongs to a pseudo-change area, there may exist a similar feature map in another image even though the map may be from a different channel. Although a direct comparison between feature maps can result in large differences, if we find a similar feature map in the other image and compute the differences with the similar feature map, the differences will be small. Based on this idea, we believe that the proposed method can help reduce the detection of pseudochanges. In addition, to improve the performance of CD, we apply a contrastive loss function that encourages the distance between features from unchanged regions to be small and the distance between features from changed regions to be larger than a specific margin.
In this paper, we quantitatively demonstrate that the proposed method can improve the performance of CD. The proposed method shows superior performance compared with existing methods for two open datasets: the change detection dataset (CDD) [6] and building change detection dataset (BCDD) [33].
The remainder of this paper is organized as follows. In Section 2, we review related studies. We also explain the proposed method in detail in Section 3. The experimental results and conclusions are presented in Sections 4 and 5.

A. TWIN NETWORK-BASED CHANGE DETECTION
A twin network is a neural network that contains two subnetworks [19] and uses two images for input. The network extracts features from the two images in parallel using each sub-network and then considers the difference between the extracted features for image comparison [19]. By sharing the weights of the sub-networks, a twin network can identify whether similar features exist, thereby comparing the two images more effectively.
Although several CD methods such as the fully convolutional early fusion network (FC-EF) [24] and the boundaryaware attentive network (BA2Net) [25] apply U-Net [18], recently, much research has been focused on twin networkbased CD methods. Some methods extract feature maps from bitemporal images using a twin network and then calculate the distance between the feature maps to generate a change map. These methods include a dual attentive fully convolutional siamese network (DASNet) [7], a spatial-temporal attention network (STANet) [22], and a deeply supervised attention metric network (DSAMNet) [23]. DASNet [7] uses self-attention modules to obtain more discriminant feature representations and attempts to reduce the detection of pseudo-change. In STANet [22], a basic spatial-temporal attention module (BAM) and a pyramid spatial-temporal attention module (PAM) are used to obtain illumination-invariant  and misregistration-robust features. DSAMNet [23] uses a combined attention module to recognize pseudo-changes. Other methods combine feature maps extracted from bitemporal images using a twin network and then decode the combined feature maps to generate a change map. These methods include a fully convolutional siamese network with concatenation skip connections (FC-Siam-Concat) [24], a fully convolutional siamese network with difference skip connections (FC-Siam-Diff) [24], and a pyramid feature-based attentionguided siamese network (PGA-SaimNet) [27].

B. SELF-ATTENTION MODULE
Recently, some CD methods have applied self-attention modules to improve CD performance [7], [22], [30]- [32]. The self-attention module considers feature dependencies within a single image. For a feature vector at one position, the self-attention module calculates the correlations between the feature vector and all other feature vectors in the same image to generate a refined feature vector, which is computed by the linear combination of all other feature vectors using the calculated correlations [7], [21], [22], [30]- [32], [34], [35]. Since the refined feature vector may reflect features of objects belonging to the same category but with different appearances, using a refined feature vector may render CD methods robust to pseudo-changes. However, since the features of one image are compared with the features of the other image in CD tasks, it is more effective to generate refined feature vectors using feature vectors of the other image to reduce pseudo-change detection. Based on this idea, studies on spatial-wise co-attention modules have been conducted [22], [27].

C. SPATIAL-WISE CO-ATTENTION MODULE
As mentioned above, spatial-wise co-attention modules may help reduce the detection of pseudo-changes. In addition, spatial-wise co-attention modules can be helpful for reducing VOLUME 4, 2016 errors due to misregistration because a feature vector in an image is compared with a similar feature vector in the other image regardless of position [22], [27]. However, we think that the performance improvement provided by a spatialwise co-attention module is limited because misregistration is minimal for most CD methods since co-registered bitemporal images are used. In terms of pseudo-change detection, because the module computes the refined feature vector using similar feature vectors in the other image, if the same structure in the other image generates different feature vectors due to pseudo-changes, these pseudo-changes may be detected as a genuine change since the feature vectors are different. To alleviate this problem, we believe that the inclusion of the channel-wise co-attention module is beneficial because it considers channel-wise correlations between bitemporal images since similar feature images may be present in different channels due to pseudo-changes. Based on this observation, we propose a novel CD method with channel-wise co-attention, which is explained in Section 3.

III. PROPOSED METHOD
In this section, we present our channel-wise co-attentionbased CD method. We first explain the overall network architecture and a novel channel-wise co-attention module for CD in detail. Finally, we describe total loss functions for the proposed method.

A. NETWORK ARCHITECTURE
We propose an encoder-decoder-based twin network with a channel-wise co-attention module. Fig.2 shows the structure of the proposed network, which consists of an encoder network with an atrous spatial pyramid pooling (ASPP) module [36] to extract high-level features without compromising spatial resolution, the channel-wise co-attention module, and a decoder network. As shown in Fig.2, the encoder extracts feature maps from two images in parallel using the same convolution layers and merges the extracted feature maps to use them as input to the decoder. Then, the decoder determines whether or not a pixel belongs to a changed area using the convolution layers with the merged feature maps and feature maps from the co-attention module, which is explained later. As one may intuitively expect, the feature maps of the changed areas in the two images may have large differences, while those of the unchanged areas show only moderate differences as long as there are no pseudo-changes. However, if pseudo-changes exist, the corresponding feature maps can be significantly different because the intensity values for the areas can be significantly different.
To reduce the problem of pseudo-change detection, we apply channel-wise co-attention modules to the outputs of the 3 rd and 4 th convolutional blocks of the encoder and transfer the output of each co-attention module to the corresponding convolutional blocks of the decoder in the form of skipconnections. Given bitemporal image pair T 0 and T 1 , we denote the extracted feature maps from T 0 and T 1 with F 0 ∈ R H×W ×C and F 1 ∈ R H×W ×C , respectively, where H is the height, W is the width, and C is the number of channels. The channel-wise co-attention module computes the correlations between the i th feature map in one image and all the feature maps in the other image and then uses the correlations as weights for computing a linear combination of the feature maps in the other image. We use the linear combination as the generated feature map, expecting that the generated feature map is similar to the i th feature map in one image even though there exist pseudo-changes. The intuition behind this expectation is that there may be similar feature maps in the other image in different channels since the different characteristics of the two images may show similar shapes under pseudo-changes. For example, if two images are acquired under different lighting conditions, features from different color filters may have similar shapes.
The aforementioned operation of the channel-wise coattention module is implemented as shown in Fig.3. First, we compute the affinity matrix S ∈ R C×C between F 0 and F 1 using the cosine similarity as follows: where s ij is the j th element in the i th column of affinity matrix S ∈ R C×C and represents the degree of similarity between the j th feature map of F 0 and the i th feature map of F 1 .F 0 ∈ R M ×C andF 1 ∈ R M ×C are reshaped feature maps from F 0 and F 1 , respectively, M is the multiplication of H and W , F (j) 0 denotes the j th column ofF 0 , andF (i) 1 denotes the i th column ofF 1 .
Next, given two reshaped feature mapsF 0 andF 1 , the channel-wise co-attention module generates modified feature maps considering the channel-wise correlations with feature maps from the other image as follows: whereF 0 represents the revised feature maps fromF 0 .F (i) 0 indicates the i th column ofF 0 and reflects the correlations between the i th feature map of F 1 and all feature maps of F 0 . Similarly, the revised feature mapsF 1 can be computed bŷ F 1 = SF T 1 . For a feature map in one image, if similar feature maps exist in the other image, the difference between the same channel feature maps may be large, but the difference between the feature map in one image and similar feature maps in the other image can be small enough to avoid detection of the pseudo-changes. Based on this, we consider the difference between the feature maps in one image and the feature maps obtained from the channel-wise co-attention as follows: whereF 0 ∈ R H×W ×C andF 1 ∈ R H×W ×C are reshaped feature maps fromF 0 andF 1 , respectively, [;] denotes the concatenate operation, and f g is a 1 × 1 convolution layer in which the number of filters is K. In addition to D att , we also use the same channel-wise difference between the feature maps F 0 and F 1 to prevent the proposed method from missing the detection of changed areas as follows: We combine the difference maps obtained from Equation (3) and (4) to reduce detection of pseudo-changes without compromising the performance of detecting genuine change. The optimal combination is determined during training using trainable weights as follows: where f h is a 1 × 1 convolution layer in which the number of filters is L. The combined difference maps D out are transferred to the corresponding decoder layer in the form of skip connections.

B. LOSS FUNCTION
We also incorporate a loss function to reduce detection of pseudo-changes. In addition to the usual cross-entropy loss function for labeled data, we add a contrastive loss function that encourages the difference between features from the unchanged areas of the two images to be small while enforcing features from the changed areas to be larger than a specific margin. The contrastive loss function was shown to be effective in improving the performance of CD in previous investigations [7], [22]. We compute the contrastive loss function using outputs of the encoder as where d ij is the distance between the feature vectors of F 0 and F 1 at position (i, j) and m is the margin for changed feature pairs. w is used to balance the weights of the two terms in Equation (6), and y ij is a label at position (i, j). We set the label in the changed area to 1 and the label in the unchanged area to 0. Therefore, the first term is zero in the changed areas while the second term is zero in the unchanged areas. By minimizing the loss function, the distance between the features from the unchanged areas should be close to zero because the first term of Equation (6), wd 2 ij , is minimized. On the contrary, the distance between the features from the changed areas should be larger than margin m because the second term of Equation (6), (1 − w)[max(m − d ij , 0)] 2 , is minimized when d ij is larger than m.
The total loss function of the proposed method is defined as where L cont is the contrastive loss function, L ce is the weighted cross-entropy loss function between the prediction and ground truth, and λ is the weight between the two losses.

IV. EXPERIMENT
To verify the effectiveness of the proposed method, we compared the performance of the proposed method with that of conventional CD methods such as FC-EF [24], FC-Siam-Conc [24], FC-Siam-Diff [24], DASNet [7], and STANet [22]. We conducted experiments involving the detection of changes in the well-known CDD [6] and BCDD [33] datasets, implementing the aforementioned methods.

A. DATASETS 1) CDD Dataset
The CDD dataset is a remote sensing change detection dataset that is open to the public [6]. The dataset contains 11 full-size image pairs of season-varying images, of which 7 image pairs are 2, 700 × 4, 725 pixels and 4 image pairs are 1, 000 × 1, 900 pixels [6]. The spatial resolutions of the images in the dataset are between 3 cm to 100 cm per pixel. In [6], the original image pairs are cropped into images that are 256 × 256 pixels to generate a cropped dataset that contains 10,000 images for training, 3,000 images for test, and 3,000 images for validation. We used the cropped datasets for this investigation. Fig.4 shows example images of the cropped CDD dataset, Figs.4(a) and (b) show bitemporal image pairs, and Fig.4(c) shows the ground truth images.

2) BCDD Dataset
The BCDD dataset covers an area that was rebuilt after the occurrence of a 6.3-magnitude earthquake in February 2011 VOLUME 4, 2016 [33]. The main purpose of the dataset is to detect changes related to buildings before and after the earthquake. Because the dataset contains only one image pair, the size of which is 15, 354 × 32, 507 pixels [33], we cropped the original image into small-sized images with a size of 256 × 256 pixels for deep learning. We divided the cropped images for use in the training set, validation set, and test set with the ratio of 8:1:1.
The cropped dataset contains 6,096 images for training, 762 images for test, and 762 images for validation. Fig.5 shows example images of the cropped BCDD dataset, Fig.5(a) and (b) show bitemporal image pairs, and Fig.5(c) shows the ground truth images.

B. IMPLEMENTATION DETAILS
The encoder of the proposed network was designed based on VGG16 [37]. We used the first four convolutional blocks of VGG16 as the first four convolutional blocks of our encoder. Additionaly, we used the weights of the pre-trained VGG16 with the ImageNet data [38] as the initial weights of the modules. We set the number of filters K in Equation (3) to be the same as the number of channels C of the input for the channel-wise co-attention module, the number of filters L in Equation (5) to C, the balance weight w for the contrastive loss function and margin m in Equation (6) to 0.4 and 2.0, respectively. The balanced weight λ between the two losses in Equation (7) is set to 0.1 for the CDD and BCDD dataset.
We implemented DASNet and STANet using the PyTorch [39] codes provided by the authors without modifying the network structures [7], [22]. We implemented FC-EF, FC-Siam-Conc, FC-Siam-Diff, and the proposed method using the TensorFlow2 library [40]. We trained the proposed method using the Adam optimizer with a fixed learning rate of 1 × 10 −4 on an NVIDIA TITAN XP graphics card. We set the batch size to 8, maximum epoch to 200, weight decay to 1 × 10 −4 , and patience for early stopping to 30. In addition, we do not perform data augmentation for the CDD dataset or BCDD dataset.

C. PERFORMANCE METRICS
To evaluate the performance of the CD methods, we analyzed the precision, recall, f1-score, and overall accuracy. The precision is defined as where T P is the number of true positives and F P is the number of false positives. The precision is the ratio of the number of pixels correctly classified as changed pixels to the number of pixels detected as changed pixels. We define recall as where F N is the number of false negatives. The recall is the ratio of the number of pixels correctly classified as changed pixels to the total number of actually changed pixels. We define the f1-score as where F is the f1-score, P is precision, and R is recall. We define the overall accuracy as where OA is overall accuracy and T N is the number of true negatives. We compute all the metrics in pixel units in this investigation.

1) CDD Dataset
We present the precision, recall, f1-score, and overall accuracy of each method for the CDD dataset in Table 1.
As shown in the table, the proposed method achieves the best performance with the highest recall (95.67%), precision (96.06%), f1-score (95.86%), and overall accuracy (98.95%). The recall, precision, f1-score, and overall accuracy are approximately 2.23%, 5.91%, 4.10%, and 1.09%, respectively, higher than those of STANet, which achieves the secondbest performance. The proposed method demonstrates a significant performance improvement compared with other methods in terms of both recall and precision. We believe that the reason for the improvement in the precision is that the proposed method may reduce the differences in unchanged areas by comparing a feature map in one image with similar feature maps in the other image. In addition, we think that  the proposed method achieves the best performance without compromising the probability of detection because the channel-wise co-attention module also considers the difference between the feature maps from the two images, as shown in Equation (5).
To further verify the effectiveness of the channel-wise co-attention module, we compared the absolute difference between the feature maps from the two images (i.e., abs(F 0 − F 1 )) with the absolute difference between the feature maps from an image and feature maps generated from the attention module (i.e., abs(F 0 −F 1 ) and abs(F 0 − F 1 )). If the channelwise co-attention module is effective in reducing differences between the feature maps from the unchanged areas, the absolute difference between F 0 andF 1 and betweenF 0 and F 1 will have smaller values than the absolute difference between F 0 and F 1 . Fig.6 shows the absolute difference map VOLUME 4, 2016  averaged in the channel direction for visualization. As can be observed, for the unchanged areas, the absolute difference maps between F 0 andF 1 and betweenF 0 and F 1 have smaller values compared with the absolute difference map between F 0 and F 1 . From these results, we can confirm that the channel-wise co-attention module is effective in alleviating the problem of pseudo-change detection.
In addition, we think that the contrastive loss function is ef-fective in reducing pseudo-change detection because it forces the features from the unchanged areas of the two images to be similar. The contrastive loss function was also used in previous methods such as STANet and DASNet [7], [22], which may explain why STANet and DASNet performed better than the other conventional methods.
For a more intuitive evaluation, Fig.7 compares the detection results of the proposed method with other CD meth-  Fig.7(c) shows the ground truth images, and Figs.7(d), (e), (f), (g), (h), and (i) show the detection results of FC-EF, FC-Siam-Diff, FC-Siam-Conc, DASNet, STANet, and the proposed method, respectively. As can be observed, the result of the proposed method is the most similar to the ground truth, which means the proposed method is more robust to environmental pseudo-changes than other methods.

2) BCDD Dataset
To further evaluate the performance of the proposed method, we also conducted experiments using the BCDD dataset. We report the precision, recall, f1-score, and overall accuracy in Table 2. As shown in Table 2, the proposed method also achieves the best performance with the highest recall (91.11%), precision (95.03%), f1-score (93.03%), and overall accuracy (99.34%). The recall, precision, f1-score, and overall accuracy are approximately 1.77%, 6.15%, 3.92%, and 0.4%, respectively, higher than those of DASNet which achieves the second-best performance. In particular, the proposed method demonstrates significant performance improvement compared with other methods in terms of precision, which implies the proposed method is effective in reducing the detection of pseudo-changes.
We also identify whether the proposed method can reduce differences between feature maps when comparing feature maps from unchanged regions of two images with environmental changes. Fig.8 compares the absolute difference between the feature maps from two images (i.e., abs(F 0 − F 1 )) with the absolute difference between feature maps from an image and feature maps generated from the attention module (i.e., abs(F 0 −F 1 ) and abs(F 0 − F 1 )). As shown in Figs.8(d), (e), and (f), for the unchanged areas, the absolute differences between F 0 andF 1 and betweenF 0 and F 1 have smaller values than the absolute difference between F 0 and F 1 . From this figure, we can observe that the proposed method reduces the difference between feature maps from unchanged areas, which is consistent with the experiments using the CDD dataset. Fig.9 illustrates the prediction results of each method using the BCDD dataset. Figs.9(a) and (b) show bitemporal images, Fig.9(c) shows the ground truth images, and Figs.9(d), (e), (f), (g), (h), and (i) show the detection results of FC-EF, FC-Siam-Diff, FC-Siam-Conc, DASNet, STANet, and the proposed method, respectively. From this figure, we also confirm that the proposed method is effective in reducing the detection of pseudo-changes.

V. CONCLUSION
In this study, we propose a channel-wise co-attentionbased twin network system to detect changes between highresolution bitemporal images that are acquired at different times for the same location. Compared with existing change detection methods, the proposed method is more robust to pseudo-changes caused by different imaging conditions and/or changes irrelevant to the purposes of change detection. The key element of the proposed method for reducing pseudo-change detection is the channel-wise co-attention module that considers channel-wise correlations between one feature map in an image and feature maps from the other image to find similar feature maps in the other image. By comparing the feature map in one image with the combination of similar feature maps in the other image instead of comparing the same channel feature maps, the proposed method reduces the detection of pseudo-changes. In addition, the contrastive loss function of the proposed method encourages the features of the two images from unchanged areas to be more similar, thereby facilitating the determination of unchanged areas as unchanged areas, also alleviating the problem of pseudo-change detection. We verified that the proposed method is more robust to pseudo-changes than conventional methods through experiments using the change detection dataset (CDD) [6] and building change detection dataset (BCDD) [33]. The proposed method achieves significant performance improvement compared with existing methods in terms of both recall and precision as demonstrated in the experiments.