Supervised Change Detection Using Prechange Optical-SAR and Postchange SAR Data

Change detection using satellite/aerial images is used to quantify the impacts of many natural and man-made disasters. At the occurrence of such events, both prechange optical and synthetic aperture radar (SAR) images can be obtained by going back in time. However, the availability of the postchange optical image is often hindered by the presence of artifacts like clouds. To circumnavigate this, we propose a novel change detection data setting that uses both optical and SAR images prechange, yet only SAR imagery postchange. For this challenging scenario, we propose a Siamese network that processes the prechange and postchange SAR inputs using a shared set of weights, while the prechange optical input is processed using a network that do not share the weights with the SAR inputs. The encoded weights from the three networks are fused and finally decoded using a common decoder to obtain the change map. Our model effectively fuses multisensor information and can obtain satisfactory result despite the absence of the postchange optical image. Experimental results on a multisensor urban dataset demonstrate the effectiveness of the proposed approach.

Code is available at https://gitlab.lrz.de/ai4eo/cd/-/tree/main/optSarSarCd. Digital Object Identifier 10.1109/JSTARS.2022.3206898 change is causing more and more disaster, e.g., fire events [1], floods [2], dam disasters [3], altered vegetation response [4], landslides [5], and earthquakes [6]. Furthermore, events like wars occasionally cause large-scale destruction and human displacement [7]. Aerial and satellite image-based change detection (CD) methods [8] are used to quantify the impact of such events. Optical/multispectral images are used for most such CD applications [9] as they provide visual cues easily comprehensible by us. While synthetic aperture radar (SAR)-based CD methods are also proposed in the literature, they are generally designed to detect changes in specific objects that show structural changes, e.g., buildings [10].
Optical images are severely impacted by artifacts like clouds, fogs, and smokes. Clouds are frequent in some parts of the Earth, depending on latitude and local climate [11], [12], [13]. Furthermore, some incidents (e.g., volcanic eruptions, wildfires, wartime bombing) may themselves induce smoke, thus further reducing the chance of obtaining a clear optical image. Availability of postchange artifact-free optical image may impact the postchange response time. As an example, we may consider the case of recent armed conflict in Ukraine. The first cloud-free optical (Sentinel-2) acquisition of the city of Mariupol after the armed conflict started on February 24, 2022, was only available on March 14, 2022. However, for most applications, even if the immediate prechange optical image is cloudy, it is possible to go back further in time to obtain a suitable prechange optical image.
The SAR sensors are negligibly impacted by the presence of artifacts [14], [15]. This allows the SAR sensors to be used at any weather conditions and any time of the day [16], [17]. However, side-looking geometry and inherent characteristics of SAR images, such as speckle and layover/foreshortening, makes learning discriminative features more challenging. Thus, SAR images generally obtain lower classification accuracy in comparison to the artifact-free optical images for most common remote sensing datasets [18]. Motivated by the complementary nature of their information, some existing works have discussed the importance of fusing optical and SAR data for CD [17], [19]. Saha et al. [20] and Wan et al. [21], [22] consider the case where only prechange optical and postchange SAR images are available. However, in most applications there is no impediment in collecting prechange SAR images. Thus, a more practical data scenario would be using prechange optical image and both prechange and postchange SAR images. This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ Siamese networks [23], proposed first in the context of image matching [24], [25], are popular in supervised CD [26], [27], [28]. They generally consist of two weight-sharing networks that ingest two different inputs. Considering the inputs use the same modality, weight sharing ensures that the network produces similar feature representation for the two inputs. However, optical and SAR inputs show different characteristics that need to be processed using different networks [29]. Furthermore, in our case, prechange input consists of two different modalities (optical and SAR), while the postchange input contains only SAR data. Thus, this poses an additional challenge of different number of prechange and postchange inputs. Considering these specifications, we design a modified version of the Siamese network that shares weight for the prechange and postchange SAR inputs; however, prechange optical input is processed using a different set of weights. Features learned by the sensor-specific encoders are concatenated and further processed using a decoder.
The contributions of this work are as follows. 1) We introduce the (prechange) optical-SAR and (postchange) only SAR data setting in the context of supervised CD. 2) We propose a supervised Siamese network that can handle different number of prechange and postchange inputs. 3) We show results on a multisensor Sentinel2-Sentinel1 dataset [17] showing the benefits of the proposed data setting and method. We also carry out an additional analysis showing the relative importance of the three inputs (prechange optical, prechange SAR, and postchange SAR). The rest of this article is organized as follows. Existing works related to multisensor CD are briefly reviewed in Section II. Section III describes the proposed method. We detail the experimental results in Section IV. Finally, Section V concludes this article.

II. RELATED WORK
In this section, we briefly discuss the existing works on SAR and multisensor CD methods.

A. CD in SAR Images
SAR-based CD is highly challenging owing to the issues pertaining to intrinsic speckle noise and deformation sensitivity. To cope with these issues, various approaches have been proposed. For instance, Gong et al. [30] presented an approach based on joint classification of prechange and postchange SAR images. Deep transfer learning is proposed for building CD using SAR images in [10]. Zhang et al. [31] proposed a deep spatiotemporal gray-level co-occurrence aware CNN architecture that takes the 3-D gray-level co-occurrence matrix as an auxiliary feature to better capture the neighborhood relationship. Gao et al. [32] proposed a multiscale capsule network that is able to capture the discriminative characteristics between the changed and unchanged pixels. Similarly, Wang et al. [33] proposed an end-to-end graph level neural network architecture to robustly extract the local neighborhood for more discriminative graph learning for CD. Variants for Siamese networks have also been proposed for SAR CD [34], [35].

B. Multisensor CD
Several architectures have been proposed to perform CD with prechange and postchange images from same sensor [36], [37]. However, there are very few approaches that can tackle the CD problem based on bitemporal multisensor inputs [20], [21], [38], [39], [40]. The reason being that the multisensor images are affected by the differences in spatial resolution [38] and spectral characteristics of the sensors [41]. A straightforward approach to handle multisensor input is to first independently derive the classification/segmentation maps for multisensory images and later compare these maps to extract changed regions [21]. However, such postclassification approaches are prone to errors. The other popular approach is to project the prechange and postchange images into a common feature space such that they become comparable in this new feature space [42], [43]. Such projection into common feature space can be achieved using different ways, e.g., by using generative adversarial network [43], by homogeneous pixel transformation [42], and by using self-supervised learning [20]. Another approach is to learn a mapping function between the prechange and postchange images [38]. Like same-sensor CD, symmetric and Siamese deep neural networks have also been used for multisensor CD. In [39], an approximately symmetric deep neural network was used to project the images into same feature space. Wang et al. [44] proposed a deep CNN-based Siamese network with a hybrid convolutional feature extraction module using multisensor images. Finally, Ebel et al. [17] proposed a novel Siamese architecture for fusion of SAR and optical observations for multimodal CD.

C. Siamese Networks
Due to their capability to process input and output using same set of weights, Siamese networks are preferred in many supervised CD applications. In one of the first works, Zhan et al. [26] proposed a Siamese network for CD in optical aerial images. A Siamese network was used for patch based CD in [45]. Effectiveness of the Siamese networks for CD in Sentinel-2 images was shown in [37]. Zhang et al. [46] proposed a Siamese network for multimodal CD. More complex networks have been designed, e.g., by combining Siamese networks with recurrent neural networks [47]. As already discussed in Sections II-A and II-B, Siamese networks have also been adopted for SAR CD and fusing multisensor CD.
Our work is related to the works in Section II-A, as we also use SAR images as the main data source for CD. However, we also use prechange optical image, which makes our work relatable to Section II-B. Furthermore, we use Siamese networks, as in the works in Section II-C.

III. PROPOSED METHOD
When prechange and postchange images are acquired using the same sensor, Siamese (i.e., weight-sharing) networks are popularly used for CD [26], [45]. Following the benefits and practicality of Siamese networks in supervised CD, our method adapts it in our novel challenging data scenario. However, to characterize the differences of multisensor (optical and SAR) input, only prechange and postchange SAR inputs are processed using weight-sharing networks, as detailed in Section III-A. Optical and SAR images show strong difference, both in terms of characteristics and input bands. Thus, to learn better the sensor-specific features, the optical and SAR inputs are processed using separate encoders. Furthermore, skip connections are used between encoder and decoder modules to propagate fine-grained input specific details, as detailed in Section III-B. The network is trained using weighted cross entropy, as explained in Section III-C. Test time adaptation (see Section III-D) is further applied to enhance the performance of the proposed model.

A. Modality-Specific Weight Sharing Network
Siamese networks are popular in remote sensing CD, as discussed in Section II-C. Generally, prechange and postchange inputs are images acquired using the same sensor with same number of channels. To process such inputs, a Siamese network consists of twin weight-sharing networks. However, in our case only, the SAR data are common in the prechange and postchange inputs. In addition, the prechange input also has optical input that is missing in the postchange input. Thus, we propose a modalityspecific weight sharing scheme where prechange and postchange SAR images are processed using weight sharing twin encoders, E SAR1 and E SAR2 . Furthermore, the prechange optical input is processed using a separate encoder, E OPT1 . Number of features increase in each layer of the encoder and output of the encoder (bottleneck) is the widest part of the network. The bottleneck representations obtained by E SAR1 , E SAR2 , and E OPT1 are concatenated to form a unified bottleneck representation that is further processed using a decoder network. Using a series of layers, the decoder network reprojects the bottleneck representation to the same spatial dimension as input patch with only two outputs at the final layer, corresponding to changed/unchanged. We design each encoder using ten convolution layers. Each convolution layer uses kernels of spatial size 3*3 and a stride of 1 pixel. Postprocessing units (batch normalization, rectified linear unit, and dropout) follow the convolution layers. Furthermore, a max pooling (2×2) follows each convolution layer, thus shrinking the spatial size as we progress through the encoder. The decoder unit consists of 14 transposed convolution layers that are also followed by the same postprocessing units as the input. A schematic of the proposed architecture is shown in Fig. 1.

B. Skip Connections From Encoder to Decoder
U-Net [48], a popular semantic segmentation architecture, uses skip connections between encoder and decoder to propagate the fine-grained details learned in the encoder part to construct an image in the decoder part. Following this, we pass the features from the three encoders (E SAR1 , E SAR2 , and E OPT1 ) and connect them to the appropriate depth of the decoder. We argue that in this fashion, we are able to retain the fine-grained input-specific details while decoding the bottleneck representation to obtain the final output map. Skip connections are shown as "concat" units in Fig. 1.

C. Weighted Loss
Generally, the unchanged pixels are significantly more frequent than the changed pixels in the training data. To account for this, we use weighted cross-entropy loss to train the model M on the training data. Weighted cross-entropy loss is a variant of the cross-entropy loss function weighted by class, varying the relative penalty of a probabilistic false negative for an individual class [49], [50]. In our case, we deal with two classes-changed and unchanged, weights of which (β c and β nc ) are derived from their inverse ratio in the training data. For a certain pixel x, given reference label whether the pixel is changed as y and softmax output corresponding to the changed class as p(x), loss function is given as Weighted loss is aggregated over all samples in a training minibatch [51].

D. Test Time Adaptation
The model trained in the above step may be suboptimal for applying on the test regions unseen during training time [52], [53]. This is furthermore challenging in the considered data scenario, as the postchange optical image is not seen in our case. Dong et al. [54] showed that the model activation's mean can be used to effectively model the domain differences. Inspired by this, we propose a simple test time adaptation strategy to adapt the trained model to the test cities. For each pixel, the model M produces two unnormalized predictions (also called logits) θ uc and θ c , corresponding to the unchanged and changed class, respectively. Let the mean of θ uc , estimated on the entire training data, be θ T r uc . For a given test patch, if the mean of θ uc is estimated as θ T e uc , then the θ uc values for each pixel in this patch are added to (θ T r uc − θ T e uc )/θ T r uc . This ensures that the neural mean activations are similar for the training and test data. The impact of this adaptation may vary depending on the size of the test patch on which the adaptation is performed. Let say a given test scene is divided into p × p patches. A smaller value of p performs adaptation only at a global scale, while a larger value of p implies adaptation at more local scale.

A. Dataset
The Onera Satellite CD (OSCD) dataset [55] is a popular urban CD dataset. While originally proposed for only optical CD, a multisensor version of this is available in [17]. The dataset uses ascending orbit Sentinel-1 SAR observations coordinatetransformed via GDAL [56] to match the coordinate system of the original optical data.
The dataset consists of 24 cities distributed across the world. Originally, the OSCD dataset [17], [55] and works using this dataset [57] split 24 cities into a training set of 14 cities and a test set of ten cities, with no validation set. However, the use of separate validation set to optimize the hyperparameters is generally recommended in deep learning [58]. So, departing

B. Metrics and Settings
To measure the performance, we use precision (TP/(TP+FP)), recall (TP/(TP+FN)), F1 score (harmonic mean of the precision and recall), and accuracy, where TP indicates true positive, FP indicates false positive, and FN indicates false negative.
We set the number of training epochs by inspecting performance evolution on validation subset (see Section IV-A). Models are trained for 50 epochs with a learning rate of 0.0001 and Adam optimizer [60]. Result of the proposed method is shown as an average of three runs with random seeds.

C. Compared Methods and Ablation Studies
To verify the effectiveness of the proposed setting and method, we need to investigate the following two aspects: 1) whether the proposed data setting (prechange: opti-cal+SAR, postchange: SAR) is outperformed by more simpler data settings (prechange: SAR, postchange: SAR) or (prechange: optical, postchange: SAR) or (prechange: SAR, postchange: optical); 2) whether the proposed Siamese architecture can be outperformed by other popular CD architectures, e.g., fully convolutional network or Vanilla Siamese. We investigate the abovementioned two aspects by comparing to the following methods (first six are for the first aspect and the remaining are for the second aspect).
1) An Early Fusion fully convolutional network using only prechange and postchange SAR images. Input to the network is a stacked version of the prechange and postchange SAR data. 2) Data setting same as above, however using a Siamese network similar to the proposed approach. 3) An early fusion fully convolutional network using only prechange optical and postchange SAR images. 4) Data setting same as above, however using an encoderdecoder network-based method that projects prechange optical image into SAR image [59]. 5) An early fusion fully convolutional network using only prechange SAR and postchange optical images. 6) Data setting same as above, however using an encoderdecoder network-based method that projects prechange SAR image into optical image [59]. 7) Data setting same as the proposed method, i.e., prechange optical and SAR images and postchange SAR image are used. However, instead of the proposed Siamese setting, a fully convolutional network [37] is used where the all three input images are stacked and fed to the network. For fairness of comparison, the fully convolutional network uses same number of layers as the proposed method. 8) Data setting same as the proposed method, however using an encoder-decoder-based approach, similar to [59], where input of encoder is stacked prechange optical and SAR images. 9) Data setting same as the proposed method, however using a Vanilla Siamese network. 10) Proposed method without test time adaptation. In addition, we also perform the following additional ablation studies: 1) By varying the value of p for test time adaptation.
2) By inserting a test-time dropout layer with high dropout rate (probability 0.9) after the first convolution layer in the encoder processing the prechange optical input or prechange SAR input or postchange SAR input. This study helps us to understand the relative importance of every of the three inputs, e.g., hypothesizing that the postchange SAR input is vital for the proposed CD architecture, then applying test-time dropout to its encoder will severely impact the CD performance. Table I shows the result from different data settings and methods.

D. Results
Prechange: SAR, postchange: SAR: For this data setting, Siamese network outperforms the early fusion-based method, both in terms of F1 score (difference of 0.2) and accuracy (difference of 8.61%).
Prechange: Optical, postchange: SAR: While early fusion outperforms the encoder-decoder-based method, overall this data setting performs poorly compared to the SAR-SAR setting. This shows that merely setting up correspondence between prechange and postchange images from two different modalities is difficult.
Prechange: SAR, postchange: Optical: This data setting also performs poorly compared to the SAR-SAR setting. This leads to the same conclusion as above that merely setting up correspondence between prechange and postchange images from two different modalities is challenging.
Proposed, prechange: Optical+SAR, postchange: SAR: Benefiting from the availability of both prechange optical and prechange SAR images, the proposed data setting outperforms other data settings irrespective of architecture. As an example, early fusion using proposed setting obtains an F1 score of 27.72 in comparison to 22.88 obtained using prechange SAR and postchange SAR. Similarly, the encoder-decoder-based approach obtains a better F1 score than the same approach for other data settings.
Remarkably, early fusion obtains slightly better scores than Vanilla Siamese. The early fusion obtains an F1 score of 27.72 and accuracy of 89.70%. The proposed architecture outperforms both early fusion and Vanilla Siamese by a significant margin. The proposed method (with p = 5) obtains an F1 score of 32.14 and accuracy of 91.28%.
Qualitative result for the early fusion and proposed architecture (both using proposed data setting) for the Montpellier city are shown in Fig. 2(e) and (f), respectively. It is evident that the proposed approach is less prone to false alarms than the early fusion approach.
Performance of the proposed method is better with proposed test time adaptation (F1 score: 32.14 in comparison to 30.25). This is because, by homogenizing the neural mean activation between test and training data, model makes the test features more similar to training features, for which it was originally trained.
Variation of p: In Table II, we show the variation of the performance of the proposed method as p is varied during test time adaptation. While the best F1 score is obtained at p = 5, we observe a relatively stable performance w.r.t. variation of p. This is an advantage of the proposed method, as we would not need to focus on setting its value for practical applications.
Weighted loss: Due to the strong class imbalance between changed and unchanged class, we have used weighted crossentropy loss, similar to previous works on Siamese supervised CD [55]. To further validate this, we experimented with the nonweighted version of cross-entropy loss and found that it fails to obtain satisfactory result (F1 score 11.54 only).
Relative importance of three inputs: Table III shows the fall in performance (accuracy and F1 score) as test-time dropout is applied to the first convolution layer of encoder of some of the three inputs. We observe that most significant decrease in performance is observed when dropout is applied to postchange SAR input. This is intuitive as we have only one postchange input, and thus, the information from it is essential to make decision about change. Similar decrease in performance is also observed if dropout is applied simultaneously to both prechange inputs. Among the prechange optical and prechange SAR inputs, decrease in performance is more in the case of prechange optical input. While dropout applied to the prechange optical input leads to drop in F1 score of 7.04, the drop for the prechange SAR is 4.81. This shows that for the proposed data setting, the network extracts the prechange information more from the optical input than the SAR input.

V. CONCLUSION
This article presents a new data setting for remote sensing CD, using prechange optical and SAR images and postchange SAR image. Using a set of experiments, we show that this setting obtains superior performance to only SAR scenario. This data setting is especially useful, since at occurrence of any disaster/incident, it is indeed possible in most cases to obtain an artifact-free prechange optical image; however, it is not always practical to wait for acquisition of an artifact-free postchange optical image. The article also presents a novel Siamese architecture for this data setting, where prechange and postchange SAR images share the same encoder, while prechange optical data are processed using a separate encoder. The proposed architecture effectively fuses information from different data modalities and processes them using a common decoder to obtain the change map. Our analysis also shows the relative importance of three inputs for our data setting, ordered as postchange SAR, prechange optical, and prechange SAR. Our future work will focus on designing more robust architecture for the proposed data setting. We will furthermore investigate the combination with image reconstruction methods for spatio-temporal cloud removal [61]. We will also like to extend the method for fine grained changed objection detection [62], [63].