Detecting Changes by Learning No Changes: Data-Enclosing-Ball Minimizing Autoencoders for One-Class Change Detection in Multispectral Imagery

Change detection is a long-standing and challenging problem in remote sensing. Very often, features about changes are difficult to model beforehand, thus making the collection of changed samples a challenging task. In comparison, it is much easier to collect numerous no-change samples. It is possible to define a change detection approach using only easily available annotated no-change samples, which we henceforth call one-class change detection. Autoencoder networks being trained on no-change data are natural candidates for addressing this task due to their superior performance when compared with other one-class classification models. However, the autoencoders usually suffer from the problem of overgeneralization, i.e., they tend to generalize too well, thus risking properly reconstructing changed samples. In this article, we propose a novel data-enclosing-ball minimizing autoencoder (DebM-AE) that is trained with dual objectives—a reconstruction error criterion and a minimum volume criterion. The network learns a compact latent space, where encodings of no-change samples have low intraclass variance, which as counterpart has the identification of changed instances. We conducted extensive experiments on three real-world datasets. Results demonstrate advantages of the proposed method over other competitors. We make our data and code publicly available (https://gitlab.lrz.de/ai4eo/reasoning/DebM-AE; https://github.com/lcmou/DebM-AE).


I. INTRODUCTION
27 W E LIVE in a dynamic world where things change 28 all the time. The recent advances in remote sensing 29 platforms open up new possibilities for observing dynamic 30 changes of our planet from a bird's eye view. New images 31 are populated daily, e.g., the Landsat 8 satellite acquires more 32 than 700 scenes a day, and in the same time span Sentinel-2 33 produces over 4 Terabytes of fresh images. Hence, multi-34 temporal data analysis is becoming increasingly important. 35 In the field of multitemporal image analysis [1], [2], change 36 detection is a long-standing research problem. It deals with 37 changes both in natural resources and in man-made structures. 38 For instance, in the former case, change detection enables 39 deforestation monitoring, disaster assessment, and crop stress 40 detection [3], [4]. In the latter one, it can aid in city monitoring 41 and planning [5], [6], [7]. 42 There are numerous methods for detecting changes in 43 remote sensing imagery. For the biggest part, they can be 44 divided into three classes, supervised, semisupervised, and 45 unsupervised ones, differing in the use of labeled data. 46 Both the supervised and semisupervised methods require 47 well-labeled samples from changed and no-change areas, and 48 the difference between them is that the latter also exploit 49 unlabeled samples to assist the labeled ones in the training 50 phase. Albeit successful, these approaches suffer from the 51 fact that in most situations, ground-truth data, particularly 52 for changed regions, are difficult to acquire. This is because 53 features about changes are often unknown or difficult to model 54 beforehand because of the variety of possible kinds of change. 55 For this reason, the unsupervised change detection models are 56 conceptually of high interest and have been widely studied 57 over the past decades [8], [9], [10], [11]. The underlying 58 This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ that our method has good transferability. 116 The remainder of this article is organized as follows. 117 Section II gives a brief overview of the current methods for 118 change detection in bitemporal remote sensing images with 119 a focus on the supervised, semisupervised, and unsupervised 120 models, as well as the existing approaches for one-class 121 change detection. Section III presents our proposed DebM-AE 122 architecture. In Section IV, the datasets, evaluation metrics, 123 and competitors are introduced. Furthermore, we present the 124 numerical results and a discussion of the observed model 125 performance in Section V. Finally, Section VI concludes this 126 article.

128
In this section, we explore the state-of-the-art change detec-129 tion methods which are divided into three categories, super-130 vised, semisupervised, and unsupervised models, differing in 131 the use of labeled data. The one-class models are particularly 132 reviewed in Section II-A. 133 A. Supervised Change Detection 134 The development of supervised learning algorithms in the 135 field of machine learning provides insights into the supervised 136 change detection methods. Early efforts have gone into seeking 137 out conventional classifiers, to name a few, random forest [12], 138 [13], [14] and support vector machine (SVM) [15], [16]. 139 In recent years, with the rising popularity of deep learning, 140 neural networks have been applied to change detection tasks. 141 For instance, [17] uses a long short-term memory (LSTM) 142 network-a kind of recurrent neural network (RNN)-to 143 detect the changes in bitemporal multispectral images, and 144 this model shows good generalization results. Later, [18] 145 proposes a recurrent convolutional neural network (CNN) that 146 is able to significantly remove salt and pepper noise from 147 change detection maps and thus obtain better results. In [19], 148 a 2-D CNN is introduced to learn high-level features, 149 and in [20], the authors use a recurrent 3-D CNN to 150 extract spectral-spatial-temporal features for change detec-151 tion. In [21], the authors introduce a Siamese CNN to extract 152 features of bitemporal images and use a weighted contrastive 153 loss to alleviate the influence of imbalanced data. For high spa-154 tial resolution remote sensing images, this task can be deemed 155 as a dense (pixel-wise) prediction task and solved by the 156 semantic segmentation network architectures [ for semisupervised node classification on this graph. In [44], Most change detection models are unsupervised, as collect-230 ing labeled data is difficult. In the literature, a widely used 231 methodology is based on image algebra, and the classical 232 models are change vector analysis (CVA) [8], [9] and its vari-233 ations [10], [11]. Furthermore, some image-transformation-234 based unsupervised change detection approaches have been 235 proposed to learn new feature representations from the original 236 data, to highlight changes while suppressing no changes in 237 the new feature spaces. For instance, [47] and [48] apply 238 principal component analysis (PCA)-a well-known subspace 239 learning algorithm-on stacked images and difference images, 240 respectively, for unsupervised change detection tasks. In [49], 241 the authors introduce a multivariate statistical transformation, 242 termed multivariate alteration detection (MAD), to identify 243 land cover changes in satellite images, and this method is 244 invariant to the linear scaling of the input data. The authors 245 of [50] propose to make use of slow feature analysis (SFA), 246 which is capable of learning slowly varying features from a 247 time series, in change detection. By doing so, in the learned 248 feature space, differences among no-change pixel pairs are 249 suppressed so that changed instances stand out more clearly. 250 Later, deep SFA, a combination of networks and SFA, is pro-251 posed in [51]. In [52], self-supervised multitemporal segmen-252 tation is used for unsupervised change detection. In addition, 253 some networks, e.g., autoencoders [53], [54], [55], [56]    principal components, and red and blue indicate change and no 316 change, respectively. As can be seen, a considerable number of 317 changed samples share the same feature space with no-change 318 instances, demonstrating the overgeneralization problem of the 319 autoencoder. There are two possible reasons: 1) some changed 320 pixel pairs share common components (e.g., spectral bands) 321 with no-change ones and 2) the encodings of changed exam-322 ples have high intraclass variance, which leads to difficulty in 323 distinguishing changes from no changes.

325
Our insight is that reducing the intraclass distance of 326 training data (i.e., labeled no-change samples) in the latent 327 space can help reduce the overgeneralization problem of the 328 autoencoder in the one-class change detection tasks. In this 329 work, a new autoencoder architecture is proposed with dual 330 objectives-a traditional reconstruction error criterion and a 331 minimum volume criterion that minimizes the volume of 332 the latent space enclosing encoded representations of train-333 ing samples. Fig. 2 shows the overview architecture of the 334 proposed DebM-AE. 335

1) Reconstruction Error Criterion:
In this article, we make 336 use of the 2 -norm-based mean square error (MSE), that is, as the reconstruction loss, where N is the number of training 339 samples. The proposed model updates the encoder and the 340 decoder to minimize the reconstruction errors of the inputs. 341 2) Minimum Volume Criterion: Our aim is to jointly learn 342 the encoder parameters e together with minimizing the 343 volume of the enclosing space of training samples in the latent 344 space Z. Here, we make use of a hypersphere as the enclosing 345 space, as it is simple, effective, and easy to implement. 346 Let c and R be the center and radius of the hypersphere, 347 respectively. Thus, the minimum volume criterion can be 348 defined as follows: Minimizing the first term of this objective, R 2 , minimizes 352 the volume of the hypersphere. The minimization of the 353 second term is a penalty that encodings outside the sphere 354 (z i − c 2 > R 2 ) get penalized. This makes the hypersphere 355 include as much training data as possible. The hyperparameter 356 μ ∈ (0, 1] controls the trade-off between the volume and 357 boundary violations. By optimizing (4), the encoder learns to 358 closely map no-change pixel pairs to the hypersphere center c. 359 Through this, we can learn a compact latent space for no 360 change. Note that the more compact the latent space, the 361 better the robustness and transferability of our model. After 362 sufficient training, changed samples can be better separated, 363 as they are supposed to be further away from c or outside the 364 sphere. 365 Furthermore, we consider a special case where the hyper-366 sphere becomes a point (R = 0). In this case, (4) can be 367 In the inference phase, a sample whose reconstruction loss is greater than a threshold is predicted as a changed instance. simplified and rewritten as follows: 3) predict a sample whose reconstruction loss is greater 387 than the threshold as a changed instance.  involve city expansion and farmland changes. Fig. 3 shows the 403 two images as well as training and test set maps.    I   NUMBERS OF TRAINING AND TEST SAMPLES IN THE KUNSHAN, LAKE EPPALOCK, TAIZHOU,       we make use of the following metrics.

492
Note that we compute F1 score, precision, and recall for 493 both change and no change classes.

494
Considering that the proposed method is for one-class 495 change detection, we take the one-class classification algo-496 rithms as the main competing methods, to evaluate the effec-497 tiveness of our approach. To verify the benefit of using 498 labeled no-change samples, we also include unsupervised 499 change detection models in comparisons. Besides, given that 500 the outlier detection approaches can be applied to the one-class 501 change detection tasks, we compare our network with some 502 outlier detection algorithms. All the methods included in 503 experiments are summarized as follows.      the proposed model. Section V-E explores the feasibility of 574 integrating spatial information into our model, and Section V-F 575 shows the effect of the number of polygon annotations.

577
We evaluate the following hyperparameters: μ in (4) and λ 578 in (6). The former is a hyperparameter that controls the rela-579 tive importance of the volume of the data-enclosing-ball and 580 boundary violations. It actually allows us to control the ratio 581 of outliers in our model during the training phase. The latter, 582 λ, is a trade-off hyperparameter between the two objectives. 583 Fig. 7 shows that the Kappa coefficient increases initially as μ 584 becomes larger, but slightly decreases after μ = 0.01, and λ of 585 large values can lead to a decrease in the network performance. 586 In our experiments, we use μ = 0.01 and λ = 1.

588
We report the numerical results in Tables III-VI. The 589 observed performance increase from VAE/AE to DebM-590 VAE/DebM-AE is a central point of our article. Take AE 591 and DebM-AE as an example. The increases in Kappa are 592 0.0515, 0.2981, 0.0892, and 0.0727 on the Kunshan, Lake 593 Eppalock, Taizhou, and Beirut datasets, respectively, which 594 represents a significant gain. The fact that DebM-VAE/DebM-595 AE outperforms VAE/AE can be attributed to learning a more 596 compact data-enclosing latent space for no-change pixel pairs. 597 Besides quantitative results, we also visualize the learned 598 latent spaces of VAE, AE, DebM-VAE, and DebM-AE.

599
The right two images of Fig. 9 show latent spaces of the one-class classifiers that use partial labeled data (i.e., labeled 625 no-change pixel pairs for training) perform better than most 626 unsupervised change detection approaches on the Kunshan 627  TABLE IV   NUMERICAL RESULTS FOR THE EVALUATED MODELS ON THE LAKE EPPALOCK DATASET. THE BEST RESULTS ARE HIGHLIGHTED IN BOLD   TABLE V   NUMERICAL RESULTS FOR THE EVALUATED MODELS ON THE TAIZHOU DATASET. THE BEST RESULTS ARE HIGHLIGHTED IN BOLD  TABLE VI   NUMERICAL RESULTS FOR THE EVALUATED MODELS ON THE BEIRUT DATASET. THE BEST RESULTS ARE HIGHLIGHTED IN BOLD and Beirut datasets, but they do much worse on the Lake  We also analyze the model complexity by measuring float- In this section, we discuss the transferability of the proposed 660 method. Several experiments are conducted to verify the 661 performance of a model trained on a dataset and tested on 662 other unseen scenes. Since both Kunshan and Taizhou are 663 relevant to urban changes, here, we use these two datasets. 664 The experimental settings are as follows.   Fig. 3). In this case, the 674 method is marked with T → K .

675
Table VII reports the experimental results. Overall, the pro-676 posed method shows good transfer performance. For example, 677 Kappa and mean F1 of DebM-AE K →T are 0.8522% and 678 92.61%, respectively, which are second to those of DebM-679 AE (in Table V) but significantly better than those of other 680 models trained on the Taizhou dataset.     Table VIII shows that the use of spatial information can boost 702 the change detection performance.

704
In this section, we study the effect of the number of polygon 705 annotations by training networks with variant subsets. Taking 706 the Kunshan dataset (including 15 polygon annotations) as 707 an example, we train our model on six subsets, which are 708 produced by randomly remaining 9, 10, 11, 12, 13, and 709 14 polygon-level annotations, respectively. As shown in Fig. 8, 710 the performance of DebM-AE gradually decreases with the 711 decrement of the number of polygon annotations. Therefore, 712 as a compromise between the classification accuracy and 713 [11] S. Saha, F. Bovolo, and L. Bruzzone, "Unsupervised deep change vector analysis for multiple-change detection in VHR images," IEEE Trans. 778 Geosci. Remote Sens., vol. 57, no. 6