SemiSiROC: Semisupervised Change Detection With Optical Imagery and an Unsupervised Teacher Model

Change detection (CD) is an important yet challenging task in remote sensing. In this article, we underline that the combination of unsupervised and supervised methods in a semisupervised framework improves CD performance. We rely on half-sibling regression for optical change detection (SiROC) as an unsupervised teacher model to generate pseudolabels (PLs) and select only the most confident PLs for pretraining different student models. Our results are robust to three different competitive student models, two semisupervised PL baselines, two benchmark datasets, and a variety of loss functions. While the performance gains are highest with a limited number of labels, a notable effect of PL pretraining persists when more labeled data are used. Further, we outline that the confidence selection of SiROC is indeed effective and that the performance gains generalize to scenes that were not used for PL training. Through the PL pretraining, SemiSiROC allows student models to learn more refined shapes of changes and makes them less sensitive to differences in acquisition conditions.


I. INTRODUCTION
C HANGE detection (CD) is the task of segmenting changing pixels over time in multitemporal Earth observation data. In the face of a changing planet, CD is at the core of many relevant monitoring tasks. It allows us to study the temporal evolution of forests [1], [2], [3], urban areas [4], [5], coastal and maritime regions [6], [7], and the effects of natural disasters [8], [9], [10], [11], [12]. CD methods face a number of hurdles related to the acquisition conditions between the different times the images are collected. This includes but is not limited to illumination conditions, clouds and shadows, acquisition angles, and the definition of what constitutes a change [13]. Despite these challenges, several trends have been beneficial for the methodological progress in CD in recent years. First, open data policies, for example, in the Copernicus program [14] increase accessibility and availability of multitemporal Earth observation data [15]. Second, technological progress results in increasing spatial and temporal resolution of satellite data with up to daily imagery [16]. Third, methodological progress in image recognition, particularly, deep learning [17], has also fueled a variety of improvements in artificial intelligence for Earth observation including CD [18], [19], [20].
However, obtaining large-scale labeled data for CD remains a challenge. Unsupervised CD methods [37], [38], [39], [40], [41], therefore, learn without labeled data to circumvent this issue. Many methods also utilize the advances in deep learning for unsupervised CD. For example, Saha et al. introduce deep change vector analysis (DCVA) for high-resolution imagery, which combines ideas from classical image differencing with a deep convolutional feature extractor [37]. DCVA has also been further extended in combination with self-supervised pretraining [42] and refined further for medium-resolution images [38]. A generative approach is used in [43] to model the different image in an unsupervised fashion. Zhan et al. [44] rely on an initial classification of changing superpixels with a fully CNN. These superpixels are then categorized by uncertainty and used to train a classifier in a second step.
Still, in unsupervised CD, particularly with lower resolution, many methods reach high performance also without the use of This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ deep features. Sibling regression for optical change detection (SiROC) [39] is inspired by exoplanet search and compares pixels against their distant neighborhood to identify changes in optical imagery. Furthermore, image differencing also called change vector analysis [45] and its extensions [46], [47], [48] still play a role in practice.
Semisupervised approaches bridge the gap between unsupervised and supervised approaches. These methods try to combine labeled data with larger amounts of unlabeled data to support the training process. Among the first to apply semisupervised learning in CD were Bovolo et al. [49]. They use a Bayesian thresholding mechanism to set up an adequately defined binary semisupervised support vector machine (S 3 VM). Modified selforganizing feature App (SOFM) uses only a limited set of initial labels to compute soft labels for unlabeled additional input [50]. Chen et al. [51] rely on probabilistic Gaussian processes (GP) as a first step with labeled and unlabeled data. The outputs of the GP classifier are then refined with a Markov random field regularizer. A Laplacian regularized metric learning mechanism is used in [52] to exploit unlabeled training data at scale for hyperspectral image CD. For very high spatial resolutions, graph convolutional networks (GCNs) are also effective for semisupervised learning by encoding multitemporal images as a graph [53].
One particularly effective direction in semisupervised learning in general image recognition is student-teacher models [54]. Typically, there is a teacher model that is trained on labeled data and predicts additional labels for images where ground truth is not available. Then, a student model uses these additional labels, referred to as pseudolabels (PL), during the training. With Earth observation data, PLs have also been shown to be effective for hyperspectral image classification [55]. PLs are also related to unsupervised CD approaches for small scenes, which rely on an initial difference image or change classification and finetune this further with another unsupervised method [43], [56], [57]. This is similar to using PLs although these approaches are purely unsupervised and are applied only to single scenes instead of large-scale training. Li et al. [58] use PLs explicitly for CD in SAR images but stay in the unsupervised domain. Similarly, Gao et al. [59] train convolutional wavelet neural networks with automatically generated labels for sea ice CD with SAR images.
In many student-teacher settings, the actual labels are used at least in some capacity in the pseudolabeling. However, this can be somewhat challenging in scenarios with limited labels as in CD. Additionally, applications of methods in regions outside their training data often require some robustness to unseen regions [60]. In this article, we therefore propose SemiSiROC where we use an unsupervised method with well-calibrated uncertainties for PL training. The uncertainty score for each prediction allows us to filter only high-quality PL for pretraining. In the second step of the semisupervised method, we finetune student models with the actual labels to improve optical CD performance. We evaluate our results on a binary version of the DynamicEarthNet benchmark [61] as well as the OSCD dataset [24] and compare the effectiveness of our strategy with five competitive CD models as students: ChangeFormer [21], BIT [25], DTCDSCN [29], FC-Siam-Diff [23], and FC-Siam-Conc [23]. Although SemiSiROC is most effective in limited label scenarios, we also find that even with a sizeable amount of 1000 labeled image pairs, SemiSiRoc boosts performance for all tested models notably. While student-teacher models themselves are not new in remote sensing, our ingenuity lies in the components specifically designed for CD on large-scale datasets and further validation on a global dataset of such scale. We have three main contributions. 1) We present SemiSiROC, a semisupervised CD method in optical remote sensing that combines advanced supervised models with unsupervised pseudolabeling. 2) Building on the confidence filtering of SiROC, we devise a mechanism to prioritize relevant scenes during PL filtering. 3) We propose a detailed experimental setup for CD subject to geographic disparity, based on the recently launched publicly available DynamicEarthNet dataset [61]. This experimental setup will be helpful for other researchers to pursue research in this direction. Our experiments on this setup and the OSCD [23] benchmark show that semisupervised learning is indeed helpful.

A. SemiSiROC
Let us assume, we have two different collections of images, D and U . D is a collection of N D bi-temporal pairs with associated pixelwise change/unchanged label. On the other hand, U is a collection of N U unlabeled bitemporal pairs. Generally N U > N D , however this is not a strict assumption. The U and D can be acquired over different geographic areas/continents, thus they need not be representing the same geographic distribution. Our goal is to exploit both D and U to learn a CD model. Toward this, we design a semisupervised pipeline that allows exploiting U for model training even if labels for it are not available. We exploit a teacher-student model where the teacher model labels the images and selects relevant samples from U . This allows its student to exploit the label space D ∪ U instead of D. Therefore, we train with PLs first before we go on to real labels. This is consistent with semisupervised literature [62] and has the underlying assumption that the model can immensely benefit from PLs as a first step of training, which can be subsequently refined with actual labels.
The PLs for pretraining are based on SiROC [39], an unsupervised method for optical CD. We average the confidence on the cube level and as a default choice use the top 25%. Then, we train a student model with the preselected locations and PLs first before finetuning with the actual labels. Since the teacher model exploits SiROC in a semisupervised setting, we call our approach SemiSiROC.
Algorithm 1 outlines SemiSiROC in pseudocode in more depth. Given the unlabeled collection U , the labeled collection D, the corresponding labels L, and a supervised CD model, the desired output is a binary change segmentation. At first, we C.append(C u ) 5: P .append(P u ) 6: end for 7: U P = C P .top_quarter(C) 8: P P = P .top_quarterC) 9: model.train(U P , P P ) {PL training} 10: model.train(D, L) {Finetuning} define a collection of confidence scores C and PLs P . Then, we loop over the elements of U and obtain PLs and confidence scores with SiROC for each image pair. Before semisupervised pretraining, we filter P and U to only use the scenes with the highest confidence, which is defined as U P . These scenes are used as input for the pretraining of the CD model before training with actual labels in the final step.
While the proposed SemiSiROC approach is similar to many semisupervised learning strategies [62], note that our approach is distinct in the following three ways: 1) how we generate the PLs with an unsupervised CD method; 2) how we select the samples for student training based on a well-calibrated uncertainty; 3) how we exploit them for global CD.

B. Unsupervised Teacher Model
The goal of the teacher model is to assign PLs to some samples from U with reasonable confidence that they can be used later for training the CD (student) model. Since U and D may not necessarily be from the same distribution, the teacher model may use its learning from D and bias the distribution of PLs for U by overfitting to D. This is particularly relevant in the geo context where different locations and points in time can quickly change the data-generating distribution [63]. We argue that the teacher label should refrain from using the actual labels in any form to obtain the PLs. If the PL extraction process uses the actual labels, this would make them interdependent and hamper generalization. Thus, the teacher model should be based on unsupervised learning in this case. Additionally, semisupervised pretraining is more flexible with unsupervised PLs and our pretrained model can serve as a starting point for other CD applications without the need to retrain the teacher model on new datasets with new labels to obtain other PLs. Therefore, we propose to use an unsupervised teacher model to incentivize more robustness to spatial generalization in the PLs. This is in contrast to many other semisupervised approaches with PLs, which rely on teacher models that have seen at least some of the actual labels [62]. As unsupervised teacher model, we employ SiROC [39]. While the method is highly performant, we pick it as a PL source or so-called teacher model mainly because it comes with a built-in, well-calibrated confidence score ranging from 0 (low) to 1 (high) with its prediction for each pixel. This allows us to filter PLs based on their confidence and only train on high confidence labels. As this confidence score is closely connected to the quality of the PL, we hypothesize that algorithms should learn better with selected PLs only. Out of N U total samples in U , N U are chosen after confidence filtering for pretraining. In the following, we explore SiROC in more depth.
1) Sibling Regression for Optical Change (SiROC): SiROC models a pixel as a linear combination of a set of neighboring pixels n at a certain time t in a time series. At time t + 1, the value of the respective pixel is predicted based on the neighbors n at t + 1. The deviation between the actual and the predicted pixel value is interpreted as a change signal. If the difference is high, this is seen as an indication of change as the pixel seems to have undergone a change compared to its neighborhood. The comparison against the neighborhood serves to eliminate local or image-wide trends as sources of false positives for changes.
More formally, given a channel of a multispectral image I at time t and t + 1, the core of the predicted change segmentation P is based on the following equation: where o is the Otsu threshold [64].Î t+1 is the predicted image at time t + 1 based on half-sibling regression. To extend this to multiple channels C, the absolute sum of the difference between the predicted and the actual image is taken aŝ For formal details on howÎ t+1,c is obtained given a set of neighbors, we refer to [39]. SiROC ensembles over many mutually exclusive neighborhoods and relies on majority voting between the models for its final prediction. This iterative process uses mutually exclusive sets of neighboring pixels that are increasingly more distant from the pixel of interest itself. Relevant parameters for this process are the maximum neighborhood size and the step size of the ensemble. The number of ensembles is given as the maximum neighborhood size divided by the step size. We use SiROC with its presented defaults in [39]. The respective parameter values are as follows.
One deviation is to reduce the step size of the ensemble from 8 to 2. This results in 100 models with a maximum neighborhood size of 200 and allows for more variation in the uncertainty estimates.
The number of votes, as shown in [39], can be interpreted as a well-calibrated uncertainty and is used in this work as a confidence score. This is because the performance of SiROC is increasing in its confidence. Therefore, we use SiROC in combination with three supervised student models for CD.

C. Student Model
Once the teacher model is used to select the pseudosamples from U , ideally any machine-learning-based classifier model can be used to train the student model. The training involves the following two steps: 1) training with pseudo labeled N U samples from U , obtained in Section II-B; 2) fine tuning with the labeled dataset D.
To illustrate that our SemiSiROC can work with a diverse set of classifiers, we chose several competitive supervised CD architectures. They are outlined in more detail as follows.
FC-Siam-diff [23] is a fully convolutional Siamese neural network inspired by the UNet architecture [30]. Pre-and postimages are processed in two separate parallel streams with shared weights, which are only merged after the convolutional layers of the network. In contrast to a classic concatenation of features, this network takes the absolute difference of the encoding streams. This allows the model to focus on temporal differences in the image pair, which is well suited for CD tasks. These differences are infused as inputs to the upsampling steps. Allowing feature differences to be passed without further processing far into the network allows the network to treat simple decisions without unnecessary complexity.
FC-Siam-conc [23] is similar to FC-Siam-diff with one major distinction. Instead of taking feature differences of the encoding streams, the features are concatenated. This gives the model more flexibility but nudges it less directly toward a temporal comparison of features.
DTCDSCN [29] stands for dual task constrained deep Siamese convolutional network. It is a convolutional model, which performs semantic segmentation and CD simultaneously. This is helpful for change detection since a prior understanding of objects and their size from semantic segmentation can be utilized for the CD task.
ChangeFormer [21] is also a Siamese network with a transformer-based encoder that reaches competitive performance on the LEVIR-CD [65] and DSIFN-CD [22] benchmarks. The hierarchical transformer encoder uses four transformer blocks in with shared weights in each branch. After every transformer block, a difference module is taken to compare differences at different abstraction levels. These differences are then passed to a lightweight multilayer perceptron decoder, which samples the features up and computes the final predicted change map.
Bitemporal image transformer (BIT) [25] also relies on selfattention rather than only deep convolutional features in a transformer framework. It has three main elements: a siamese semantic tokenizer, a transformer encoder, and a transformer decoder. The siamese backbone extracts convolutional features and inputs them into the semantic tokenizer. Inspired by advances in language processing, the tokenizer pools the image features into a compact set of vocabulary. The compact tokens are converted back to the pixel space and fed into a CNN prediction head. As a CNN backbone for the feature extraction, ResNet18 is used following the main paper.

III. EXPERIMENTAL VALIDATION
A. Data 1) DynamicEarthNet: We base our analysis on a modified version of the DynamicEarthNet dataset [61]. This is because it allows benchmarking CD algorithms with areas of interest (AOIs) across the globe and covers a variety of different changes that are not specific to a certain use case such as buildings or urban regions only. Both of these properties make the dataset well-tailored to binary CD in an application-agnostic way. It contains monthly, manual land cover annotations for two years with Planet imagery for 75 AOIs across the globe. The locations were selected to include a wide spectrum of land cover changes across seven classes.
We pick the labels of the first and last month of each AOI and compute a binary mask of changing land cover. This maximizes change and also ensures a certain difference in the scenes. The corresponding Planet Fusion images are highly preprocessed as an analysis-ready product, which includes a variety of steps including temporal gap filling of clouds and shadow removal. Each scene is 1024 × 1024 pixels with 3-m resolution per pixel in size, which results in an area per scene of about 10 km 2 . To be consistent with the image size in [21], we split each scene into 16 256 × 256 pixels RGB images. This results in a total of 1200 pairs of pre and post images taken 2 years apart. The class balance in the resulting dataset is about 80% no change and 20% change.
Our baseline train, validation, and test split is visible in Fig. 1. Locations are available across the globe, which is relevant to test generalizability to unseen regions where all continents except Antarctica are covered. Following the DynamicEarthNet terminology, we refer to the locations also as cubes given that the 2D images also vary in time. The cubes do not only differ by their geography but also by the type of change. The dataset covers locations from coastal areas, islands, urban regions, agricultural areas, and forests. This shows the diversity of change in practical applications, which makes this dataset challenging.
The cubes based in the continental US are used as training (blue), the validation data are taken from central America (green) and we test with the remaining cubes from across the globe. This simulates label scarcity in global CD tasks where generalizability to unseen regions is a key requirement. Particularly, annotated data in low and middle-income countries are often relatively rare. However, to validate our results against this choice, we use other splits with more training data (16, 32, 64 cubes) as an ablation study below.
2) Onera Satellite Change Detection (OSCD) [23]: As a secondary dataset, we rely on OSCD, which in total contains 24 before and after pairs of Sentinel-2 images in urban areas across the globe but we only use the ten pairs in the test set. To be consistent with our training efforts on DynamicEarthNet, we only include the RGB channels and crop 256 × 256 images from the original scenes. As OSCD image pairs are not square and vary in size, we pad the images to the next multiple of 256 and mask the added points during evaluation of the change prediction.

B. Training and Evaluation
Our goal is to evaluate the effectiveness of a PL pretraining step. Therefore, we compare SiROC confidence pretraining for a variety of specifications including the aforementioned models but also different choices of training sets, PL sets, and training losses. We train each model until convergence with and without a pretraining step. For this study, experiments were conducted with a single NVIDIA Quadro P4000. We acknowledge that semisupervised pretraining requires an additional computational effort compared to finetuning. PL training for 50 epochs with the top quarter of scenes by confidence takes about 15 min with the P4000 for the FC-Siam-diff model. However, PL training has to be done only once and allows for all kinds of CD applications.
The following specifications are used for all experiments to ensure comparability. We train with Adam as an optimizer with a batch size of 32 and a starting learning rate of 0.0001 and linear weight decay. We evaluate our results based on three popular criteria: Accuracy, mean IOU (MIOU), and mean F1 Score. Formally, in terms of false positives (FP), true positives (TP), false negatives (FN), and true negatives (TN), these criteria have the following definitions: Accuracy = (TP + TN)/(TP + TN + FN + FP). ( Accuracy is simply asking how often is our prediction correct relative to the total number of predictions. with IOU = TP/(TP + FP + FN). In comparison to accuracy, the IOU criterion eliminates TN from the picture per class. Similarly with F1 balancing precision and recall. F 1 = (2 * precision * recall)/(precision + recall). Precision is defined as TP/(TP + FP) and recall as TP/(TP + FN). Every model is run for five different seeds and reported scores are, therefore, a mean with the respective standard deviation in brackets.

C. DynamicEarthNet Results
Table I outlines the main results of this article. Overall, we test PL pretraining with SiROC with four different competitive models. Each pair of rows for one model compares the scores with and without pretraining on the confident PL. All specifications are run five times with different seeds to increase the robustness of the result against an unrepresentative seed. PL training is done with a focal loss (FL) and training with the real labels with the split of Fig. 1 and a MIOU loss with only the top 25% of cubes based on average SiROC confidence per cube.
At first, FC-Siam-diff with PL pretraining reaches an overall accuracy of 0.7812 with a MIOU score of 0.4854 and a Mean F1 Score of 0.6029. This makes it the best model in Table I overall according to all three criteria and notably better than its counterpart without pretraining. FC-Siam-diff without SiROC pretraining is about 15 percentage points (p.p.) lower in accuracy, 7 p.p. lower in MIOU, and about 3 p.p. lower in terms of mean F1 score. Further, standard deviations of performance are visibly lower with confidence-filtered PL pretraining for FC-Siam-diff. FC-Siam-Conc does not seem competitive here in comparison with a fairly low accuracy of around 62% with PLs and 56% without them. It seems that without the explicit feature difference, the model is not incentivized to pay enough attention to temporal differences for the final change segmentation. Therefore, it has trouble to distinguish changes from nonchanges. This is improved by the use of PLs but the issue remains large in comparison to FC-Siam-diff.
Similarly, the scores of ChangeFormer improve and stabilize notably by an even larger margin although the baseline performance is comparably bad. The general effectiveness is also confirmed when looking at BIT and DTCDSCN although the margins seem slightly lower. Given that DTCDSCN, and particularly, FC-Siam-conc seem weaker convolutional baselines than FC-Siam-diff, we focus on the latter, ChangeFormer and BIT, for the remainder of this paper for the sake of brevity. As an additional baseline, the performance of SiROC on the test set is given as a reference point.
Generally, SiROC places decently on the dataset given that it is an unsupervised method and often even outscores the supervised baselines with few labels. The information contained in the PLs and the capacity of the methods combine effectively in our semisupervised strategy. The respective scores are consistently substantially higher than in the SiROC baseline with the PLs. Fig. 2 visualizes model predictions for eight image pairs of the models in Table I Notably, the illumination conditions between the pre-and postimages differ slightly, which is often a challenge in CD problems [37]. The first comparison is for FC-Siam-Diff with training on PLs in Fig. 2(d) and the corresponding version without it in Fig. 2(e). Fig. 2(d) was the best performing model quantitatively in Table I, which is confirmed by the visual inspection of the predictions.
The location and the shape of large changes are segmented well with limited mistakes. While the model does miss some smaller changes on the right, regions in the middle are segmented well. In comparison to Fig. 2(e) without PLs, the results are visibly better in Fig. 2(d). The plain FC-Siam-Diff is thrown off by different shades of green, which results in false positives in the middle and on the right. The PL version helps to reduce these false positives due to acquisition conditions and further seems to improve not only the location but also the shapes of segmented changes.
As also visible in Table I, the segmentation performance of ChangeFormer and BIT is generally worse in comparison to FC-Siam-Diff. SiROC PLs brought the biggest improvement for ChangeFormer in Table I, which is also visible in Fig. 2(f) and (g). The no PL version predicts change for virtually all grassland regions since it interprets the change in illumination as change. It is, therefore, too sensitive to the change class and struggles to extract meaningful change. This improves visibly with the PL training. For example, the shapes in the middle are fit notably better.
Similarly, the PLs bring improvement with BIT as shapes get more refined and there are fewer false positives on the right.
The impressions of Fig. 2 are generally confirmed when inspecting predictions for a more complex urban scene in Fig. 3. Again, the upper panels for each method show pre-and postimages as well as the ground truth. For all three models, the upper prediction with PL pretraining shows more refined shapes. This becomes particularly visible for ChangeFormer [see Fig. 3 Table I shows that PL training is effective in addition to supervised use of labels. Table II outlines what happens when other PLs based on CVA or DCVA are used as semisupervised baselines. The training setup is identical to Table I and the scores for SiROC PL are the same. What varies is the source of the PLs in the pretraining step listed in the second column. FC-Siam-Diff with SiROC PLs reaches high scores in accuracy and MIOU. Accuracy is 2-3 p.p. higher compared to other PLs, which is significant but the MIOU edge is rather small. For MF1, it seems that CVA and DCVA PLs, although lacking behind in accuracy, reach a slightly more balanced classification with 61.21% MF1 each. For ChangeFormer and BIT, the scores are again lower on average. Compared to CVA, the Change Former SiROC combination scores visibly better across all three categories (+ 8 p.p. accuracy,+ 3 p.p. MIOU, + 2 p.p. MF1). ChangeFormer with SiROC PLs notably exceeds accuracy and MIOU compared to its DCVA baseline and obtains a similar MF1 score. The picture for BIT is similar with higher accuracy and MIOU and slightly better (CVA) or marginally worse (DCVA) F1 scores. Overall, SiROC PLs perform visibly better in accuracy and MIOU where the edge is particularly apparent for ChangeFormer and BIT.
Authorized licensed use limited to the terms of the applicable license agreement with IEEE. Restrictions apply.

1) Amount of Training Data:
One may be concerned that the edge of our approach is limited by the small number of training cubes with real labels. Therefore, we iteratively add more training cubes to explore differences in the edge depending on this parameter. Table III presents these scores on a harmonized test   set for this table. As we use up to 64 cubes for training and aim to keep the scores comparable, we use the respective test set for all specifications in this table. All PL specifications are again pretrained with the top 25% of cubes in confidence. We use all available training cubes with FC-Siam-diff and Change Former in the upper panel. Despite the increasing amount of training data, FC-Siam-diff remains better than ChangeFormer by a In the lower panel, we compare FC-Siam-diff against versions with fewer training data (25% and 50% of the aforementioned training set). Interestingly, the performance of SemiSiROC increases only marginally with additional real training data. This may indicate that a large part of potential gains through additional training data could already have been exploited by the PLs. Conversely, the gap between PL and no PL gets smaller with 16 training cubes. Then, performance from 16 to 32 cubes drops slightly, which is unexpected. One reason could be that the additional training cubes are somewhat more unrepresentative of the remaining cubes on the other side of the globe compared to the previous cubes. The highest scores with and without PLs  TABLE II  QUANTITATIVE RESULTS DYNAMICEARTHNET WITH DIFFERENT PLS   TABLE III  ABLATION STUDY: VARYING THE TRAINING SET SIZE   TABLE IV  ABLATION STUDY: ROBUSTNESS TO FINETUNING LOSS are achieved with the maximum number of training cubes of 64, which is about 85% of our dataset with over 1000 image pairs, where the rest is used for testing and validation. Still, the PL specification remains better than its baseline with a sizeable gap. Overall, the main takeaway remains unaffected. With both a few and a larger amount of labels, SemiSiroc is an effective strategy for CD on this dataset.
2) Varying the Finetuning Loss: However, the edge of our strategy may be specific to the loss combination used. Therefore, we test the robustness of our results with other losses at the finetuning step in Table IV for ChangeFormer, BIT, and FC-Siam-diff. We do not vary the PL loss here as this would leave the baselines without SiROC pretraining unaffected. In total, there are six specifications per model given three loss combinations each. The MIOU scores are identical to Table I. The choice of the finetuning loss leaves SemiSiROC largely unaffected with minor differences in scores. It is marginally better in accuracy and MIOU compared to the MIOU loss and slightly lower in terms of Mean F1. The focal loss baseline with FC-Siam-diff is slightly stronger than with MIOU but still lacks Expectedly, training with a cross-entropy (CE) loss pushes the FC-Siam-diff baseline to almost exclusively predict the majority no change class. This results in an accuracy high score of almost 0.80, which even marginally surpasses the respective SemiSiROC score although with a higher standard deviation. However, the corresponding Mean F1 score, which is comparably sensitive to large discrepancies in predictive performance across the classes falls behind by almost 7 p.p. to the SemiSiROC CE score.
For the ChangeFormer model, the observations of the MIOU finetuning seem to be confirmed. Similar to FC-Siam-diff, CE training leads to the prediction of mostly no change. The FL results are somewhat better than the MIOU results but still comparably bad. Overall, Table IV confirms the impression of the effectiveness of our semisupervised strategy.
At last, the results for the BIT model mirror the aforementioned results. Pseudolabeling is highly effective across all categories with an FL or MIOU loss. With CE, the model again tends to overfit largely to the no-change class, which is why the accuracies are higher. Even though the no PL version with CE loss reaches the highest accuracy among BIT models, the results are visibly unbalanced. While the PL version lacks behind 3 p.p. in accuracy, it makes more balanced choices with more than 8 p.p. more MF1.
3) Results on Unseen Geographic Areas: Note that for the two previous tables, we did not restrict the PLs to be outside of the test set. While during training, no model sees any actual labels from the test set, one could argue that the images of the test set may be advantageous for our strategy.
To ensure that our strategy is effective also on cubes that were also not part of the PL training, we split the former test set in two where we use the western half from the perspective of Fig. 1 for PL training and the eastern half for testing with the FC-Siam-diff as the most effective model overall. The respective scores are reported in Table V and cannot be directly compared to the scores of previous tables anymore because of the difference in the test cubes. Still, the PL step remains better in comparison by a wide margin that seems even bigger than in previous comparisons. The gap is substantial at 15 p.p. in accuracy and 7 p.p. in MIOU.

4) PL Filtering:
Another ablation study concerns the effectiveness of the PL filtering. Since labels are limited, the preselection discards additional information, which may be useful in training. Therefore, we mix up the cube selection with a random selection and the lowest 25% in confidence. The respective results are reported in Table VI . The top 25% cubes score best in terms of accuracy and MIOU and fall just short of the random selection in terms of MF1. Still, with a difference of almost 3 p.p. with similar MIOU and F1 values, it seems that the confidence prefiltering indeed extracts meaningful PLs, which result in more effective learning. Additionally, we notice decreasing marginal returns of adding a higher fraction of PLs in our case. Using the top half or even all cubes with their respective PLs results in a similar performance than only using the top quarter. Therefore, we choose the threshold of 25% for more efficient training. Even though SiROC PLs improve performance already without filtering, the confidence selection further pushes the CD performance.

E. OSCD Results
To further investigate the transferability and generizability of the proposed approach, we evaluate SemiSiROC also on OSCD [23], which is a widely used binary CD benchmark based on Sentinel-2 with a focus on urban regions. The results of our experiments are presented in Table VII. The models used are identical to the ones in Table I. We merely apply them to the OSCD test set instead of the DynamicEarthNet test set directly to analyze the transferability of models. Similar to Table I VII  QUANTITATIVE RESULTS OSCD TEST SET TRAINED ON DYNAMICEARTHNET AND GROUPED BY PL USE real DynamicEarthNet labels. Interestingly, the accuracies are in the range (94-96%) of FC-Siam models in [23] based on supervised training on OSCD, whereas our approach does not use OSCD labels at all. The contrast to no PLs gets even larger for ChangeFormer although some of the ChangerFormer models seem to tilt toward predicting mostly change on this dataset, which results in unstable average performance. Even when excluding these runs, however, the maximum performance of ChangeFormer on the OSCD test set is 74.13% accuracy, 41.86% MIOU, and 51.71% which is substantially below the average with PLs. Third, BIT model PLs is arguably the best model here since it is only slightly inferior to FC-Siam-diff in accuracy but achieves high scores in MIOU and MF1 with 55.85% and 64.22%, respectively. Again, the difference to no PLs is large across all categories. Overall, the OSCD results confirm the previous impression that PL pretraining with SemiSiROC can be highly effective in optical CD applications.

A. Comparing Teacher and Students
The previous section outlines the effectiveness of SiROC as an unsupervised teacher model for CD with limited labels. This is because it is an effective method and can prioritize PLs based on a well-calibrated confidence. The mechanism for these improvements seems to be higher robustness to false positives because of acquisition conditions and more refined shapes of changes.
Since SiROC models analyze how much a pixel changes in comparison to its neighborhood, it seems intuitive that it would guide a student model toward higher robustness to false positives. Consider the example of Fig. 2. Grassland seems much greener in the post images but since this affects virtually all pixels in the grassland neighborhood of a pixel, SiROC would not necessarily view this as change. This is something the student models seem to pick up on without modeling this explicitly. Another property of SemiSiROC seems to be more refined change shapes, which is also a strength of the initial SiROC model [39]. This may incentivize the student model to learn more about likely shapes and spatial dependencies of changes.

B. Relative Weakness of Transformer Models
Second, we notice that throughout our results, the two transformer models seem to perform worse compared to the siamese UNet. This results in large gains through PL pretraining and underlines the effectiveness of our strategy. There are several possible explanations for this relative weakness. A likely candidate is model size and label availability. ChangeFormer, in particular, is a large model, which makes it data hungry and its success on other datasets such as Levir-CD in [21] may be related to the fact that more labels are available there. This seems plausible for Levir-CD, which was about 10x more labeled pixels than the binary DynamicEarthNet we use here.
However, DSIFN only has 25% more labeled pixels than our dataset. Therefore, another reason could be that both of these methods have been tested in the context of urban CD only with a focus on buildings. Maybe the different kinds of change applications across the globe within DynamicEarthNet pose a challenge to these models and the smaller siamese model adjusts to this more quickly. Nevertheless, the SemiSiROC framework shows effectiveness for all the methods we tested here and shows promise for CD applications with optical data in practice. Our model pretrained with PLs converges faster during fine tuning (i.e., training with actual labels). Thus, our proposed method reduces the time requirement of the training phase with actual samples.

V. CONCLUSION
Monitoring changes of the Earth's surface over time with satellite imagery is an integral part of remote sensing. In this article, we combine unsupervised and supervised techniques in a semisupervised framework. This framework, called SemiSiROC, relies on pretraining a student model with PLs that we filter by confidence. This enables the student model to learn from additional, meaningful high-confidence examples in a pretraining step before finetuning with actual labels. We evaluate SemiSiROC with three different supervised backbones: FC-Siam-Diff, ChangeFormer, and BIT. We evaluate the models with and without filtered PL pretraining on a binary version of the DynamicEarthNet benchmark that is based on Planet Fusion imagery with 3-m resolution. We pick only the cubes with the 25% highest confidence scores during pretraining. For all three models, we find a notable boost in performance for our baseline specification in Table I with eight cubes, which corresponds to 124 training scene pairs with real labels. Additionally, we outline that SemiSiROC remains competitive in the eye of semisupervised student-teacher baselines based on DCVA and CVA PLs.
Further, we evaluate the SemiSiROC models on scenes not seen during PL training, which results in similar performance gains. This ensures that the learned features are not specific to scenes close to the PLs. Even with 64 training cubes with over 1000 labeled pairs, SemiSiROC is effective compared to its non-PL baseline, where gains are still large. Additional evaluations on the OSCD benchmark confirm the effectiveness of our SemiSiROC strategy also on an urban CD dataset based on Sentinel-2. Qualitative inspections of the predictions shed light on what the teacher model seems to teach its students: Compared to its no PL counterparts, the SemiSiROC models predict more refined shapes and seem to be less sensitive to false positives.
Our results point toward several potentially promising future research directions. At first, our work could be applied to related tasks such as multiclass CD or different input sensors. Second, more experiments are necessary to understand the role of teacher models in spatial generalization generally and particularly in CD.