SAUNet3+CD: A Siamese-Attentive UNet3+ for Change Detection in Remote Sensing Images

With the development of various optical sensors, change detection is one of the most actively researched areas in remotely sensed imagery with high spatial resolution. In particular, deep learning-based change detection techniques are very important for use in various fields, such as land monitoring and disaster analysis, because they can show superior performance compared to traditional unsupervised and supervised change detection methods. This manuscript proposes a Siamese-attentive UNet3+ for change detection (SAUNet3+CD) of multitemporal imagery with high spatial resolution. The existing UNet3+ was modified to a Siamese-based architecture, and a spatial and channel attention module was added to detect various changed areas. The proposed model was trained to effectively detect building growth and decay through the data augmentation of open datasets and a hybrid loss function. In experiments using two open datasets, the proposed deep learning model effectively detected changed areas in multitemporal images better than various methods, such as existing Siamese-based networks and a network for semantic segmentation.

tical techniques [2], [3]. 23 Traditional change detection algorithms can be classified 24 as unsupervised or supervised methods. Generally, unsuper- 25 vised change detection algorithms extract changed regions 26 by applying thresholding to a feature map using various 27 image processing or statistical techniques, such as image 28 The associate editor coordinating the review of this manuscript and approving it for publication was John Xun Yang . differencing, image rationing, spectral transformation, and 29 machine learning [4]. Change vector analysis (CVA) is a rep-30 resentative unsupervised change detection algorithm. CVA 31 extracts the magnitude and direction of a changed area using 32 differencing corresponding to each channel of multitemporal 33 images. Compressed CVA (C 2 VA) uses the transform tech-34 nique of magnitude and direction for image differencing into 35 the polar domain, and 2-D adaptive spectral change vector 36 representation (ASCVR) was developed using an automatic 37 definition of a reference vector and analysis of interactive 38 manual change identification [5], [6]. Iteratively reweighted-39 multivariate alteration detection (IR-MAD) based on canoni-40 cal correlations has been utilized in various change detection 41 algorithms [7], [8]. General unsupervised change detection 42 algorithms do not require training data and have the advan- 43 tage that they can be configured as an automated process. 44 However, these algorithms have a disadvantage in that it can 45 be challenging to analyze aspects of changes between classes 46 of changed areas. Especially, to determine changed regions, 47 mentation based on encoder-decoder architecture have 93 been conducted to detect changes in remote sensing data.  [23] modified the UNet++ model using 104 an attention module and image differencing architecture. 105 Chen et al. [24] utilized a transformer-based model com-106 posed of a CNN backbone, bitemporal image transformer, 107 and prediction head for change detection.

108
In particular, Siamese networks, which have two weight-109 shared branches, have been used to increase the performance 110 of CNNs for supervised change detection. Daudt et al. [25] [29] developed a 119 fully convolutional Siamese autoencoder for change detec-120 tion. Guo et al. [30] proposed a deep multiscale Siamese 121 network using a parallel convolutional structure (PCS) and 122 a self-attention module to integrate and generate optimal 123 image features at various scales. Choi and Kim [31] pro-124 posed a channel-wise co-attention module and contrastive 125 loss function to make an optimal Siamese network for 126 change detection. A Siamese network and nested UNet for 127 change detection (SNUNet-CD) is a Siamese-based modified 128 UNet++ version using an ensemble channel attention mod-129 ule [32]. Chen and Shi applied a spatial-temporal attention 130 module to a Siamese-based network and evaluated a new 131 dataset for change detection for remote sensing images [33]. 132 Most CNN-based deep learning models for change detec-133 tion have used U-shaped encoder-decoder models, such as 134 UNet. In particular, studies on CNNs based on a Siamese 135 network that shares weights within the CNN have been 136 conducted. Various attention modules have been analyzed 137 to improve the performance of CNN models. Therefore, 138 research to create an optimal model for change detection 139 by adding an attention module to a UNet-type architecture 140 has been important in remote sensing. On the other hand, 141 most change detection studies use open datasets such as the 142 LEVIR-CD, WHU, and SYSU-CD datasets [33], [34], [35]. 143 However, since most open datasets are built mainly focusing 144 on the changed pattern due to the creation of buildings, there 145 is a disadvantage in that it is difficult to verify whether 146 the developed deep learning models can effectively reflect 147 changes in the actual terrain. Therefore, in this manuscript, 148 we propose a CNN for change detection by modifying 149 UNet3+. UNet3+ was converted to a Siamese-based archi-150 tecture, and a spatial and channel attention module was added 151 so that the proposed model could detect changes in various 152 building objects. In addition, we evaluate a deep learning 153 model that can effectively detect the creation/destruction of 154 buildings through the augmentation of open datasets and 155 hybrid loss functions. Our contribution in remote sensing 156 fields is to develop a novel and optimal deep learning 157 architecture for change detection as the UNet3+ backbone. 158 This manuscript is organized as follows. The background 159 that supports our methodology are described in Section 2. As shown in Fig. 1, UNet, which was developed for semantic 168 segmentation, has a representative encoder-decoder struc-169 ture [36]. The encoder, which is called the contracting path, within the network. In addition, in UNet ++, because skip 202 connections of the nested network structure are performed, 203 various semantic segmentation products having the same size 204 as the input data can be obtained, and the final product can 205 be obtained through their integration. Therefore, it has the 206 advantage of effectively utilizing each stage's spatial and 207 spectral information, and it is possible to add the charac-208 teristics of deep supervision. Fig. 2 shows the structure of 209 UNet++.

C. UNET3+
211 UNet3+ with an encoder-decoder architecture similar to a 212 general network for semantic segmentation was developed 213 to improve the performance of UNet and UNet++ [39]. 214 UNet++ can improve the performance through skip connec-215 tions of the nested network structure to reduce the semantic 216 gap. However, there is a disadvantage because it is impossible 217 to identify sufficient feature information from the entire input 218 data. In addition, as the spatial information of the feature 219 map decreases as the encoder level increases, it is necessary 220 to utilize the spatial information remaining in the feature 221 map at a lower level. To this end, in UNet3+, the loss of 222 spatial information is minimized through interconnection and 223 intraconnection, and sufficient spatial and spectral character-224 istics of the entire input data can be identified. In addition, 225 since it utilizes all of the preceding feature maps through 226 interconnection and intraconnection, it is possible to reduce 227 computational costs compared to UNet++, enabling more 228 On the other hand, the general UNet3+ has a structure in 262 which learning is carried out by simply passing the feature 263 map obtained through the encoder process to the decoder 264 by concatenation. To improve the performance of UNet3+ 265 in change detection, the change attention module should be 266 required for each stage of the encoder because feature maps 267 need to be more focused on the spatial and spectral informa-268 tion of a changed area. This study used a convolutional block 269 attention module (CBAM) for the change attention module.
where α is a weighting factor, p t is the estimated probability 314 of class, and γ is a tunable focusing parameter. In this study, 315 we use α = 0.25 and γ = 2 through trial and error. Dice 316 loss, which is used in the medical image segmentation field, 317 is applied to address the imbalanced dataset problem [45]. 318 The equation of Dice loss is as follows: where p i is the reference data,p i is the value predicted by the 321 network and N is the number of samples. As the reference 322 data and the predicted value are similar to each other, the 323 general Dice coefficient has a value close to 1. Therefore, 324 the Dice loss is defined to converge to a value close to 0, 325 as in equation (2). The final loss function was constructed 326 as in equation (3) to minimize the imbalanced data problem 327 through integration of focal and Dice loss.

361
In this study, the proposed network was implemented in 362 PyTorch on an NVIDIA Quadro RTX 6000 GPU × two 363 platform. In the case of the loss function, a hybrid loss, which is 369 an integrated version of focal and Dice loss, was applied.  developed for change detection, the HRNet-v2 model was 382 trained by integrating the data before and after change and 383 using that as the input data for early fusion.

385
To evaluate the quality of the proposed network, quantitative 386 analysis is performed using a confusion matrix, which is 387 shown in Table 3   HRNet-v2, and SAUNet3+CD on the LEVIR-CD dataset.

408
As shown in Table 4, it was confirmed that the results 409 using FC-EF and HRNet-v2, which use a form of early 410 input data fusion, show a relatively low F1-score. Mean-411 while, the result using SAUNet3+CD had the best F1-score. 412 Fig. 9 shows an example of the result when applying 413 each deep learning model to the test data. As shown in 414 Fig. 9, the deep learning models detect changed areas in 415 small houses with clear boundaries. However, most of the 416 techniques except HRNet-v2 and SAUNet3+CD did not 417 detect the changed area in the large building (row 1 in 418 Fig. 9), which effectively occupies most of the image patch. 419 This trend also occurred in the change detection results 420 in the complex shape of the building (rows 4 and 6 in 421 Fig. 9). The omission of these changed regions results in 422 low recall values. On the other hand, it can be seen that 423 SAUNet3+CD expresses the boundary of the changed build-424 ing more accurately. It is because the loss of spatial infor-425 mation of the boundary was minimized through the residual 426 block and attention module applied in the encoder stage 427 of SAUNet3+CD.

429
The result using HRNet-v2 shows the lowest precision result, 430 which means that changed building pixels in the HRNet-v2 431   Fig. 10). In addition, it was confirmed that 443    comparative evaluation with the proposed method, it was con-486 firmed that the Siamese UNet3+ structure based on subtrac-487 tion proposed by SiUNet3+-CD showed lower performance 488 than the Siamese structure by the concatenation operation 489 adopted by SAUNet3+CD. It can be seen that it is effective to 490 create a feature map for the multitemporal change detection 491 dataset through the Siamese architecture and construct a deep 492 learning model that applies the concatenation process to the 493 data. In addition, adding an attention module to the feature 494 map generation was effective for change detection.

496
We calculated the parameters of the model used in the abla-497 tion study for the efficiency evaluation.