Self-Supervised Multisensor Change Detection

Most change detection methods assume that pre-change and post-change images are acquired by the same sensor. However, in many real-life scenarios, e.g., natural disaster, it is more practical to use the latest available images before and after the occurrence of incidence, which may be acquired using different sensors. In particular, we are interested in the combination of the images acquired by optical and Synthetic Aperture Radar (SAR) sensors. SAR images appear vastly different from the optical images even when capturing the same scene. Adding to this, change detection methods are often constrained to use only target image-pair, no labeled data, and no additional unlabeled data. Such constraints limit the scope of traditional supervised machine learning and unsupervised generative approaches for multi-sensor change detection. Recent rapid development of self-supervised learning methods has shown that some of them can even work with only few images. Motivated by this, in this work we propose a method for multi-sensor change detection using only the unlabeled target bi-temporal images that are used for training a network in self-supervised fashion by using deep clustering and contrastive learning. The proposed method is evaluated on four multi-modal bi-temporal scenes showing change and the benefits of our self-supervised approach are demonstrated.


I. INTRODUCTION
Our Earth is rapidly changing, both due to natural and man-made causes.Satellite image based change detection (CD) is generally used to monitor the temporal evolution of the dynamic Earth [1], [2], [3], [4], [5], [6], [7].CD ingests bi-temporal images as input and segregates all pixels as changed/unchanged.CD is a crucial step for several applications, including disaster management, urban monitoring, forestry, glacier monitoring, and precision agriculture.Considering the variation of applications, rarity of occurrences of some change-inducing incidents (e.g., natural disasters), and large geographic variation, it is imprudent to assume that largescale training datasets corresponding to all such tasks can be ever collected.Thus, there is a significant inclination in the CD literature towards methods that can process the target bitemporal region-of-interest without using any training label or any additional pool of unlabeled image.Motivated by its excellent performance in computer vision, researchers have applied deep learning to satellite image change detection [8].
To exploit the potential of deep learning while not using any training label or additional unlabeled images, transfer learning based CD methods are popular that reuse a pre-trained network for bi-temporal feature extraction and comparison [1].
A striking feature of satellite data is its variability, in terms of different sensors.Images captured using passive optical sensor are quite similar to the natural images studied in the computer vision.However, images captured by the active sensors, e.g., Synthetic Aperture Radar (SAR) are remarkably different from the optical images [9], [10], [11].While optical sensors use wavelengths near to visible light (approx. 1 micron), SAR uses a wavelength of 1 cm to 1 m.Moreover, optical sensors rely upon the natural illumination (e.g., sun) to create the brightness observed by the sensor, while the SAR sensors carry their own illumination source, in the form of radio waves transmitted by an antenna.Moreover, satellite images are captured with different number of spectral bands (one to few hundreds), different spatial resolutions (few cm/pixel to Kms/pixel), different polarization.While this vast variation provides an opportunity for detailed Earth observation, it is not trivial to use same set of methods for images from different sensors.Due to this reason, most existing CD methods assume that the pre-change and post-change images are acquired using the same sensor.The temporal frequency at which same sensor can image same place depends on the revisit period of the satellite on which the sensor is mounted.However, better the spatial resolution, more close the satellite is to the Earth, more time it takes to revisit same place.This is a hindrance in the use of same-sensor CD in time-bound applications, e.g., fast response for disaster management and precision agriculture.Using different sensors may allow us to obtain temporal sequences with better temporal frequency without sacrificing spatial resolution.However, it is not trivial to process multisensor bi-temporal images as they are affected by the spectral characteristics of the sensors.Moreover, different sensors capture different type of information, making their comparison often challenging [12].The difficulty of this problem is further accentuated by the fact that we are interested to detect change without using any labeled training data or any abundant pool of unlabeled data.
The emergence of deep learning has seen many such problems solved that were thought to be very challenging in the past [13], [14].Self-supervised learning has shown remarkable success recently, even when only few images are available [15].Intrigued by this, in this paper we explore the challenging problem of change detection between optical and SAR images, the disparity between which is evident in Figure 1.We exploit recent developments in the self-supervised learning and deep clustering to propose a method for challenging SAR-optical CD where one of the bi-temporal images is acquired by an optical sensor, while the other is acquired by a SAR sensor.
The proposed method requires only the bi-temporal target scene (where change is to be detected), no training label, and no additional unlabeled data.The target bi-temporal scene is typically large, few hundred pixels by few hundred pixels.Smaller bi-temporal patches (e.g., 64×64) are extracted from it to train a two-branch network, similar to the Siamese network [16].Each branch of the network has a projection module and a predictor.Projection modules learn features unique to optical and SAR data without sharing weights, while predictors share the weight.The output of the predictors are used to estimate deep clustering loss for both images separately.Moreover, considering prior probability of changed pixels are much less than the unchanged ones, a temporal consistency loss is proposed that ensures that pixels in the same location at two different time tend to get same label.To ensure that this does not lead the network to learn a trivial solution, a contrastive loss is used.By combination of these losses, the proposed method learns useful semantic feature from the multi-sensor (SAR-optical) bi-temporal target scene and after training the network predictions can be compared for change detection.
The contributions of this paper are as follows: 1) We propose a self-supervised learning method for change detection in a bi-temporal scene where one image is captured by the optical sensor and the other by the SAR sensor.The proposed method, only exploiting the available target unlabeled scene, effectively absorbs several concepts from the recent self-supervised learning literature, e.g., deep clustering, augmented view, Siamese network, and contrastive learning.By effectively exploiting these concepts and modifying them appropriately for the target multi-sensor bi-temporal data, proposed method is able to train a network that is further used for bi-temporal comparison and change detection.2) We show the versatility of the self-supervised learning on spatio-temporal satellite data that is very different from typical computer vision images.Even though some form of aerial images (e.g., drone images) are often studied in the computer vision, we stress that our satellite data (both optical and SAR) are significantly different from the typical aerial images.3) We experimentally show the efficacy of the proposed method on four different bi-temporal multi-sensor scenes.The rest of this paper is organized as follows.Related works are briefly discussed in Section II.Section III outlines the proposed method.Datasets and experimental results are detailed in Section IV.Finally, we conclude the paper in Section V.

II. RELATED WORK
In this Section we briefly discuss existing works on unsupervised CD (with focus on the multi-sensor CD) and selfsupervised learning.

A. Change Detection
Prior to the emergence of deep learning, most unsupervised change detection methods used the concept of pixelwise image differencing, i.e., change vector analysis (CVA) [17].A number of superpixel and spatial neighborhood based variants of CVA have been proposed, e.g., Parcel Change Vector Analysis (PCVA) [18] and Robust Change Vector Analsis (RCVA) [19].Most deep learning based unsupervised change detection methods use transfer learning.[1] proposed deep change vector analysis (DCVA), a CD framework that combines ideas from CVA with feature extraction based on pre-trained neural networks.In nutshell, a deep model that has been trained for some other task is reused to obtain pixelwise bi-temporal deep features from the target scene.Bi-temporal deep features are then compared to obtain deep change hypervectors for each pixel in the scene that are analyzed based on magnitude ( 2 norm) to identify the changed pixels.While [20] shows that sensor-specific pre-trained network is more suitable for transfer learning, [5] advocates models trained on ImageNet [21] for transfer learning in CD.There is another class of unsupervised CD methods that pre-classifies some pixels with high confidence as changed/unchanged using some traditional approach and further uses those confident samples for training a CD model [22].It is not trivial to process multi-sensor bi-temporal images as they are affected by differences in the spatial resolution and differences in the spectral characteristics of the sensors.Due to this, there are very few works that can work in the setting where pre-change and post-change images have different spatial resolution [23], [24] or bands with different spectral characteristics [25].Moreover, those works deal with only minor variation in spatial or spectral characteristics.[23] proposed a cycle-consistent generative adversarial network based method to learn transcoding between multi-sensor multitemporal domain.However, their work assumes that a large (unlabeled) area corresponding to both sensors are available as training data.[26] used a symmetric convolutional coupling network (SCCN) and [27] used denoising autoencoder (DAE) for CD in multisensor images.Though those works considered optical-SAR images, they applied their methods on scenes with limited spatial complexity.While our work is strongly motivated from the existing works on multi-sensor CD [23], [24], it takes them a step further by considering the challenging scenario of optical-SAR CD in complex urban scenes and furthermore by integrating recent developments in self-supervised learning.

B. Self-supervised learning
Considering the difficulty of collecting labeled data and abundance of unlabeled data, machine learning researchers have focused on developing unsupervised and self-supervised deep learning methods in the recent past.Gidaris et.al. [28] used image rotation as a pre-text task to learn unsupervised semantic feature.Several other pre-text tasks have been explored in the literature, e.g., relative patch prediction [29] and image inpainting [30].Deep clustering, i.e., joint learning of the parameters of deep network and the cluster assignment of the resulting features, has also been shown to be effective for unsupervised representation learning [31].Remarkably, [15] have shown that above-mentioned unsupervised methods learn useful semantic features even with a single-image input.Contrastive methods function by bringing the representation of different views of the same image ('positive pairs') closer while spreading representations of different images ('negative pairs') apart [32], [33], [34].Boostrap your own latent [35] and its variant SiamSiam [16] eliminate the requirement of negative pair by using multiple views of the same image.In more details, SiamSiam [16] ingests as input two randomly augmented views of an image and processes it through a Siamese architecture.Each Siamese branch consists of an encoder and a prediction head.The encoders share weight between two views.
Proposed method is strongly inspired from the above selfsupervised methods.Like deep clustering [31], the proposed method uses the concept of simultaneous representation learning and cluster/label assignment.The bi-temporal images can be considered to be views of same scene, like SiamSiam [16].Like the contrastive methods, the proposed method uses the idea of bringing closer the representation of positive pairs and spreading apart the negative pairs.Like [15], the proposed method works on single scene (a pair of images capturing same location at two different times).
Multi-temporal satellite image processing researchers have also proposed self-supervised representation learning methods, e.g., deep clustering for multi-temporal segmentation [36] and learning by rearranging randomly shuffled time-series images [37].Proposed method is related to them, using the concept of deep clustering as in [36].

III. PROPOSED METHOD
Let X 1 , Z 2 be two images of size R×C taken over the same geographical region at time t 1 and t 2 , respectively.Without loss of generality we assume that the pre-change image X 1 is acquired by optical sensor (RGB) and the post-change image Z 2 is acquired by the SAR sensor.Since SAR image is grayscale, same channel is replicated thrice to make it 3channel like the optical input.We aim to detect changes from the images X 1 , Z 2 in an unsupervised manner, i.e., without using any training labels and any additional unlabeled data pool.Our goal is to divide the set of all pixels Ω into two subsets Ω c and ω nc corresponding to changed and unchanged pixels, respectively.Like most existing unsupervised CD methods [1], we assume that prior probability of occurrence of change is less compared to no change [38].
We can extract a set of bi-temporal patches of size R × C (R < R and C < C) from the images X 1 , Z 2 .In practice, one training iteration involves only a batch of B patches from X 1 , denoted as X = {x 1 1 , ..., x B 1 } and corresponding patches from Z 2 , denoted as Z = {z 1 2 , ..., z B 2 }. x b 1 and z b 2 are processed separately with deep clustering loss, as detailed in Section III-C.Furthermore, considering that x b 1 and z b 2 represent same location at two different times and prior probability of change is less, a temporal consistency loss (see Section III-D) is formulated using each such pair.Furthermore, Z is shuffled to form negative samples Z and a contrastive loss is used between pairs from X and Z , as outlined in Section III-E.Proposed method is outlined in Figure 2.

A. Bi-temporal patches are multiple views of the same location
We recall from Section II-B that many self-supervised learning approaches build upon the concept of bringing closer the representation of the multiple views of the same image.Different views of the same image are generally obtained by different augmentation techniques, e.g., random crops.We argue that multi-sensor bi-temporal patches x b 1 and z b 2 can be similarly thought to be multiple views of the same location.They represent augmentation of the same place, where the augmentation transformation is naturally caused by multisensor differences and other factors including weather condition.Considering that the prior probability of change is less [38], most of the times such pair of patches x b 1 and z b 2 represent the same information, but from the eyes of two different viewers (sensors).

B. Siamese representation
Since bi-temporal patches can be seen as multiple views of the same location, we argue that semantic information can be captured from them by using a Siamese-like architecture.Similar to [16], both branches of the two-branch network have projection modules f opt and f sar for the optical and SAR branch, respectively.Additionally, both branches have prediction modules h opt and h sar for the optical and SAR branch, respectively.However, unlike [16], the projection modules f opt and f sar do not share weight.This is because SAR and optical images are significantly different processed by two different projection modules using different sets of weights.However, the prediction modules h opt and h sar share weights and henceforth simply denoted as h.
The projection and the prediction networks consists of L 1 and L 2 (generally L 2 = 1) convolutional layers, respectively, where L = L 1 + L 2 .The two projections compute a projected representation from the optical and SAR images and project them to a common domain.In the ideal scenario, where the projectors have perfectly learned to project optical and SAR images into a common domain and the bi-temporal images do not show any change, the output generated for an input pair is expected to be identical.However, practically even in absence of any change, there are differences caused by multi-sensor acquisition and other factors that are not trivial for projection modules to mitigate.All but last convolution layers are followed by ReLU activation function.They are further followed by batch normalization layer.We do not use any pooling layer, hence the size of the input is preserved in the output.While filters of spatial size 3 × 3 are used for all convolution layers for projection, the prediction module uses 1 × 1 filter.The kernel number of the final layer is K and can be thought of as K different clusters/classes.Each pixel can be assigned to one of these K clusters (detailed in Section III-C).The network architecture is shown in Figure 3.

C. Deep clustering
Deep clustering process involves the joint learning of the parameters of deep network and the cluster assignment of the resulting features [31].Deep clustering helps the network to learn discriminative features that can identify different classes/clusters in the images.Considering processing of the two images as independent process, deep clustering can be performed for each of them.The output obtained by the network for a paired input patches x b 1 and z b 2 is: The rationale behind finding the highest activation of an input pixel is that the pixels that obtain the highest activation in the same feature are likely to have similar semantics, thus belonging to the same group.While there are several possible ways to define the pseudo-label, our approach more closely follows the ones based on argmax classification of the final layer [39], [40].Once the pixels are assigned to the K clusters, parameters of the deep network can be updated by using a loss between the feature y b 1,n and the cluster c b 1,n .We use crossentropy loss as: In practice, the loss term L 1 is computed by taking mean of b 1,n over all pixels in x b 1 and all patches in the batch (b = 1, ..., B).L 1 is used to adjust the weights of h and f opt .Similarly, L 2 is computed from z b 2 (b = 1, ..., B) and used to modulate the weights of h and f sar .
While the deep clustering helps to learn representation for each sensor separately, they do not ensure that the independently learned features are aligned with each other.Obtain Z as random shuffling of Z 6: for j ← 1 to J do 7: for b ∈ B do 8: )) 9: 10: 12: For each 3 consecutive iterations j, use L 1 , L 1,2 , and L 1,2 , respectively, to modulate W 1 , ..., W L

23:
end for 24: end for

D. Temporal consistency
Recalling from Section III-B, multi-sensor bi-temporal patches x b 1 and z b 2 are multiple views of the same location in absence of any change.In other words, in co-registered bitemporal images, pixels in the same spatial location generally tend to belong to the same object as changes have a low prior probability than the unchanged class.Thus, the features computed for the bi-temporal paired patches x b 1 and z b 2 should be similar in most cases.For each input pixel x b 1,n and z b 2,n , we compute absolute error (AE) loss as: A loss term L 1,2 is computed by taking mean of b 12,n over all considered pixels for all patches in the batch.The proposed temporal consistency only ensures that the pixels at same location however at two different time tend to have same label.This may lead a to a degenerate solution where all pixels simply have same prediction for both times.Moreover, some bi-temporal pairs x b 1 and z b 2 may be indeed changed, however, penalized for producing dissimilar output in this step.

E. Contrastive learning
While Section III-D encourages the features computed for paired patch x b 1 and z b 2 to be similar, in this Section we encourage the network to produce dissimilar feature for different input by employing concepts inspired from contrastive learning.While we do not have negative samples under the unsupervised setting in which our work is based on, we simply shuffle the batch of patches Z to Z .Recall that X and Z have location-wise paired patches.This implies that X and Z have unpaired patches.Thus they should be more dissimilar in comparison to the paired patches in Section III-D.We encourage features computed for x b 1 and z b 2 to be dissimilar.This is achieved by computing (negative) absolute error loss for each input pixel ,n should be encouraged to be more and more negative.However in practice we note that, simply shuffling Z to Z does not always ensure that X and Z have semantically different patches.Even after shuffling they may have the semantically paired patches, however penalized in this step for producing similar features.Thus to control its impact, we penalize the network with b 12,n only when it approaches 0, i.e., y b 1,n and y b 2,n become too similar.This is achieved by computing the loss term L 1,2 as mean of exponentials of b 12,n over all considered pixels for all patches in the batch.

F. Overall loss and network refinement
He initialization process [41] is used to initialize all the trainable weights of the network W 1 , ..., W L , corresponding to L layers.For updating of weights, we exploit stochastic gradient descent (SGD) mechanism with momentum [42].
The training process is executed in two different steps of I 1 and I 2 epochs (summing to I).For each batch of data, J iterations are performed.For the first I 1 epochs only sum of deep clustering losses (L 1 + L 2 ) is used to modulate the network weights.For subsequent I 2 epochs, in one training iteration L 1 is used as loss function, in the following iteration L 1,2 is used and in the following iteration L 1,2 is used.The combination of three loss functions yield a balanced training process taking into account coherent cluster formation, temporal feature consistency, and feature dissimilarity for unpaired patches.Alternatively, sum of L 1 , L 1,2 , and L 1,2 can also be used as aggregated loss function.The self-supervised mechanism for network training is shown in Algorithm 1.

G. Change detection
Once the network is trained, it can be used to detect change between X 1 and Z 2 .Since the network is fully convolutional, it enables us to obtain pixelwise feature vector of dimension K from X 1 and Z 2 .Similar to [1], the pixelwise change information is captured by taking the magnitude ( 2 norm) of difference of the feature vectors computed from pre-change and post-change pixels.Changed pixels (Ω c ) generate higher difference magnitude in comparison to the unchanged ones ω nc and they can be distinguished by using any suitable threshold determination scheme [43].

B. Compared methods
To verify the effectiveness of the proposed method, we compare it to related unsupervised change detection methods: 1) Change vector analyis (CVA) [17], [45], a classical difference-based unsupervised model for change detection.2) Robust change vector analysis (RCVA) [19] that modifies CVA by taking into account pixel neighborhood effects.3) Parcel change vector analysis (PCVA) [18] that incoporates notion of object (superpixels) in CVA. 4) Deep change vector analysis (DCVA) [1] that detects change by comparing bi-temporal deep features extracted using a pre-trained network.We used second convolution layer of pre-trained VGGNet [46] for feature extraction.5) Image-to-image transfer model based on an encoderdecoder network architecture that projects pre-change optical images into post-change SAR image [47].CD map can be obtained by difference of the simulated pre-change SAR image (obtained as projection of prechange optical image) and the original post-change SAR image.6) Denoising autoencoder (DAE) based joint feature extraction [27].7) SCCN [26] that first identifies some unchanged pixels and uses them to learn a coupled network.While the methods 1-3 are not deep learning based, the following ones are deep learning based.The methods 1-4 do not have any explicit adaptation for multi-sensor input, while the method 5, 6, and 7 have.

C. Experimental settings
The proposed method and compared methods are fed with pre-processed images and post-processed similarly.For the proposed method, we use I = 5 (I 1 = 1, I 2 = 4), J = 50 K = 4, L 1 = 4, and L 2 = 1.We show the architecture of the network in Table I.A relatively simple architecture is used considering that the number of patches available to us is very few compared to the images in typical computer vision datasets.Moreover, our target image has coarse resolution (10 m/pixel) compared to natural images in computer vision.Spatial complexity in such coarse images can be handled by simpler architecture compared to those in computer vision.64 × 64 patches are used to train the model and patches are extracted from the bitemporal scene with a stride of 32.The actual number of training patches for a scene depends on the size of the particular scene.E.g., for the Las Vegas scene (824 × 716 pixels), number of patches extracted is 504.For optimization, Stochastic Gradient Descent method is used with learning rate set to 0.001.We show result in terms of sensitivity (accuracy in percentage computed over reference changed pixels) and specificity (computed over reference unchanged pixels).In more details, given true positive (TP), true negative (TN), false positive (FP) and false negative (FN), sensitivity is TP/(TP+FN) and specificity is TN/(TN+FP).

D. Results
Las Vegas: The reference CD map (ground truth) for Las Vegas is shown in Figure 4(a).Figure 4(b) shows the result obtained by the proposed method.For better visualization, a false color composition between the reference map and the obtained result is shown in Figure 4(c).The proposed method can detect most of the changed objects with fewer false alarms in comparison to the compared methods.In many cases, proposed method partly detects the changed object, thus missing some objects only partially (shown in pink in Figure 4(c II) clearly shows the superiority of the proposed method over state-of-the-art unsupervised methods.This can be attributed to superior capability of the proposed method to ingest multi-sensor multi-temporal images.
Further studies are conducted by varying different parameters on Las Vegas image pair.
Training epochs I is varied with different values as tabulated in Table III while setting K = 4.We observe clear improvement in performance from I = 1 to 2. Recalling from Section III-F that for first I 1 = 1 iterations only deep clustering loss is used, this shows that bi-temporal deep clustering itself is not sufficient to learn the correspondence between two images and the other losses (L 1,2 and L 1,2 ) are required.I = 2 onwards, we observe an increment in performance initially followed by performance getting saturated/dropping.Despite variation in performance, proposed method outperforms all compared methods for I = 3, 5, 10.
Kernel number of last layer (K) is varied from 2 to 16 in multiplicative steps of 2 while fixing the I = 5.The variation in performance is shown in Table IV.While performance improves from K = 2 to K = 4, a gradual fall in performance is observed henceforth.Increasing value of K is equivalent to allowing the scene to be partitioned more classes.Since the spatial area of the scene is fixed and not too large (only few hundred pixels by few hundred pixels), large number of classes potentially leads the model to learn irrelevant classes, impacting change detection performance.
Thresholding is done using Otsu's method [43], as it is popular in unsupervised CD methods [19], [48].However any other suitable method can be used, as shown in Table V. Results obtained by ISODATA method [49], [50] and adaptive method [1] are similar to Otsu's method [43].
Loss plot visualization in Figure 8 shows the interplay between different components of loss.L 1 consistently decreases (Figure 8(a)) except it rises for a while after epoch 1 when L 1,2 and L 1,2 are introduced to the training process.L 1,2 and L 1,2 balances each other as shown in Figure 8(b).
Projection layers f opt and f sar need to modeled independently by not sharing weights between them to capture the different semantic properties of optical and SAR patches, as hypothesized in Section III-B.Here we test this hypothesis by instead sharing the weights between f opt and f sar .For I = 5 and K = 4 the proposed method fails to detect most of the changes.This shows that it is crucial to model the optical and SAR patches differently.
Computation time requirement is not high.We tested our code on a machine equipped with a Quadro T2000 GPU, which is a low end GPU.For processing Las Vegas dataset (training process over 5 epochs), it takes approx.460 seconds.Las Vegas scene is 824 × 716 pixels with 10 m/pixel resolution and thus processing it is equivalent to processing an approximate area of 8*7 = 56 sq.km in terms of geography.Same sensor bi-temporal input can be ingested by the proposed method, though designed for multi-sensor CD.For Las Vegas prechange optical -postchange optical input, proposed method can obtain a sensitivity of 64.74% and specificity of 97.89%.However, we note that some characteristics of the proposed method (e.g., temporal consistency loss) are designed to reduce the representation gap of multi-sensor input, which is less relevant in single-sensor input.So the proposed method may not be the most suitable choice for single-sensor scenarios as there are numerous existing CD techniques particularly designed for same-sensor scenario [1].
Chongqing and Abu Dhabi: Reference CD map (ground truth) for Chongqing is shown in Figure 5(a).Figure 5(b) and 5(c) show the result obtained by the proposed method and false color composition between the reference map and the obtained result, respectively.Proposed method outperforms all compared methods, as can be observed in quantitative result in Table VI).Similar result is obtained for Abu Dhabi (Figure 6 and Table VII).
Montpellier: Reference CD map (ground truth) for Montpellier is shown in Figure 7(a).The proposed method (Figure 7(b)) outperforms most of the state-of-the-art methods including PCVA (Figure 7(d)) and DCVA (Figure 7(e)), as shown in Table VIII.However, SCCN (Figure 7(f)) outperforms the proposed method.The performance of the proposed method is relatively poor for Montpellier, which can be possibly explained by: 1) smaller size of Montpellier scene, which implies less data to learn proposed self-supervised network and 2) uniform (showing mostly urban areas) geospatial characteristics of Montpellier scene in comparison to Las Vegas and Chongqing that show complex distribution consisting of both urban and non-urban areas.

V. CONCLUSIONS
This paper proposed a self-supervised learning based method for CD in multi-sensor bi-temporal images where one of the image is acquired by optical sensor and the other one is captured by SAR sensor.The proposed method effectively utilizes several concepts from the self-supervised learning,  Comparisons with the existing methods working under unsupervised scenario show that the proposed method brings significant improvement, especially when the target scene is large.Potential improvement of the proposed method may be achieved by prior learning of clusters on the unrelated domains/sensors and transferring them to target sensors on the fly [51].Additionally, our future work will focus on extending the method to other application domains, e.g., comparison of biomedical images.

Fig. 1 .
Fig. 1.Visual contrast for Las Vegas between (a) optical image (prechange) and (b) SAR image (postchange) .Optical and SAR images emphasize different properties of the target area, thus performing CD on them is challenging.

Fig. 3 .
Fig. 3.The network simplified architecture with L 1 = 4 and L 2 = 1.Optical and SAR inputs are processed separately and subsequently fed to a common prediction layer.

y b 1 ,n from y b 1 .
has same spatial dimension R × C as x b 1 and has kernel number (or, feature dimension) K.The deep clustering process is performed over the pixels, i.e., each pixel is assigned to a cluster.Without loss of generality, we henceforth explain the deep clustering process in reference to a generic pixel y b 1The dimension of y b 1,n is K that can be converted to 1-dimensional label c b 1,n by argmax classification.This is achieved by selecting the kernel/feature in y b 1,n that has maximum value.If the k-th feature of y b 1,n is represented by y b 1,n (k), then label c b 1,n is obtained as following:

x b 1 ,n and z b 2
has negative value.Ideally b 12

Fig. 4 .Fig. 5 .Fig. 6 .Fig. 7 .
CD results for Las Vegas.CD maps: (a) Reference, (b) Proposed, (c) FCC between reference and proposed (the correctly detected region are in black, false alarms are in green, missed alarms are in pink), (d) CVA, (e) RCVA, (f) PCVA, (g) DCVA, (h) Encoder-decoder, and (i) SCCN.CD results for the Chongqing.CD maps: (a) Reference, (b) Proposed, (c) FCC between reference and proposed (the correctly detected region are in black, false alarms are in green, missed alarms are in pink), (d) PCVA, (e) DCVA, (f) SCCN.images are acquired by the Sentinel-2 sensor and are taken from the Onera Satellite Change Detection (OSCD) dataset [44].They show 10 m/pixel spatial resolution.OSCD dataset is originally a single-sensor dataset consisting only Sentinel-2 images.Recalling the importance of multisensor CD (see Section I), we extend this dataset by collecting the post-change SAR Sentinel-1 images for the nearest available date as the post-change image in original OSCD dataset.Both Sentinel-2 and Sentinel-1 sensors are part of the European Space Agency's Copernicus program.The four scenes are collected over Las Vegas in United CD results for the Abu Dhabi.CD maps: (a) Reference, (b) Proposed, (c) FCC between reference and proposed (the correctly detected region are in black, false alarms are in green, missed alarms are in pink), (d) PCVA, (e) DCVA, (f) SCCN.Qualitative CD results for the Montpellier.Change detection maps: (a) Reference, (b) Proposed, (c) FCC between reference and proposed (the correctly detected region are in black, false alarms are in green, missed alarms are in pink), (d) PCVA, (e) DCVA, (f) SCCN.

Fig. 8 .
Fig. 8. Evolution of the loss over training iterations for Las Vegas: (a) deep clustering loss L 1 , (b) temporal consistency loss L 1,2 and contrastive loss L 1,2 Fig. 2. Proposed unsupervised multi-sensor (optical-SAR) CD framework.The left hand side denotes the self-supervised training process while the right hand side shows the CD process using already trained model.

TABLE I STRUCTURE
OF THE NETWORK FOR PROCESSING ONE OF THE TWO INPUTS

TABLE II COMPARISON
OF DIFFERENT METHODS ON LAS VEGAS

TABLE III VARIATION
OF RESULT FOR LAS VEGAS AS I IS VARIED .g., deep clustering, Siamese network, multiple view, and contrastive learning and operates under severe constraints, i.e., nothing except the target scene is used and no labeled data or additional unlabeled image is used.Despite strong difference in the input modalities and operating under stringent constraints, it can identify a large fraction of the changed pixels. e