Filtering Specialized Change in a Few-Shot Setting

The aim of change detection in remote sensing usually is not to find all differences between the observations, but rather only specific types of change, such as urban development, deforestation, or even more specialized categories like roadwork. However, often there are no large public datasets available for very fine-grained tasks, and to collect the amount of training data needed for most supervised learning methods is very costly and often prohibitive. For this reason, we formulate the problem of few-shot filtering, where we are provided with a relatively large change detection dataset and, at test time, a few instances of one particular change type that we try to “filter out” of the learned changes. For example, we might train on data of general urban change, and, given some samples of building construction, aim to only predict instances of these on the test set, all without any explicit labels for buildings in the training data. We further investigate a fine-tuning approach to this problem and assess its performance on a public dataset that we adapt to be used in this novel setting.


I. INTRODUCTION
C HANGE detection, that is, segmenting a pair of images of the same region but taken at two different points in time into changed and unchanged pixels, is a well-known task in remote sensing, with many applications in disaster assessment, urban planning, forest monitoring, and other remote sensing domains [1]. Usually, for these applications we are not interested in every change that occurred between the images, but limit our attention to certain categories, such as building construction or destruction [2], deforestation [3] or flooding [4], and occasionally even finer subcategories like road construction [5], mining activities [6], or ship movement [7].
For this reason, supervised learning, where we can exactly specify our interests via annotated samples, is a natural choice for these specialized change detection tasks. However, the downside of this approach is that these methods usually require large amounts of training data, which is expensive or even prohibitive The authors are with the Chair of Data Science in Earth Observation, Technical University of Munich, 85521 Ottobrunn, Germany (e-mail: martin.hermann@tum.de; sudipan.saha@tum.de; xiaoxiang.zhu.ieee@gmail.com).
Code will be available at https://gitlab.lrz.de/ai4eo/cd/-/blob/main/ fewShotFilteringCd. Digital Object Identifier 10.1109/JSTARS.2022.3231915 to produce. Even though there are a number of large datasets publicly available, these often are annotated with rather general categories, such as urban change [8].
If we want to use these existing resources for more specialized tasks, such as the ones mentioned above, the resulting models detect a lot of unwanted changes in addition to those that are relevant, so filtering out the important information becomes a key step. In principle, it is possible to do so with a lot of additional data, or by manually adapting the training labels (one recent example for this can be found in the work of Li et al. [9]). However, this again requires a large amount of resources, the lack of which often is the reason for using a preexisting dataset in the first place. Hence, a solution that adapts to a specialized usecase with only a handful of annotated samples of this particular type of changes would be very desirable.
Pushing this idea even further, research often is an iterative process, and in many situations we might not have a clear definition of the change we are interested in from the beginning. For example, when investigating deforestation, we could realize after some time that in fact the most relevant type for our scenario is caused by wildfires, and we want to ignore, e.g., logging. On the other hand, we might decide to focus on human influences, and now look for newly built infrastructure close to the rain forest. To enable a flexible workflow and avoid long interruptions caused by retraining the network from scratch, it is useful to allow the specification of the change of interest only after training on the full dataset.
Therefore, we propose an approach to detect specialized changes that works top-down: First, we learn to classify a broader type of change in a binary classification task, for which ample training data are available, and then, try to filter out one particular subcategory via only a few examples, thereby entering the realm of few-shot learning [10], [11]. As few-shot learning deals with novel classes that the machine learning model has never seen before, whereas we in contrast try to specialize and split known classes, we propose the term few-shot filtering for this task, that we will describe in detail in Section III. To provide some background, we will shortly present few-shot learning, data efficient approaches to change detection and other related work in Section II, before we describe the methods we investigate in Section IV, detail our experimental setup and dataset in Section V and finally present the results in Section VI and discuss them in Section VII. Finally, Section VIII concludes this article.
In particular, our contributions are the following. 1) We formulate the problem of few-shot filtering for change detection, which differs from standard few-shot learning in that the query classes are not disjoint from the base classes during training. Instead, we focus on refinements of previously seen change. 2) We investigate a fine-tuning approach to this problem, together with two simple baselines, and compare their performance on several different few-shot tasks. 3) To evaluate these methods, we suggest a way to suitably adapt a semantic change detection dataset to this new setting of few-shot filtering and discuss its advantages and limitations. 4) We conduct a hyperparameter study to gain some insight into the effects on different types of specialized change, which is important as the conventional approach of optimizing on a validation set is not possible in the data scarce setting.

II. RELATED WORK
In the following, we will give a brief overview over the different areas of research we touch upon and highlight important related work. The four main areas are few-shot learning, specialization and subcategorization, semantic change detection, and methods to deal with data scarcity.

A. Few-Shot Learning and Few-Shot Segmentation
Few-shot learning is a very active area in machine learning in general and computer vision in particular. Similar to how a human can learn novel objects from only a few instances, it is interested in adapting a network to previously unseen classes from a small number of training examples (which are known as "shots") [10].
There are several different popular approaches to this task [11], including meta-learning techniques, such as MAML [12] and metric-based ones, most notably matching- [13] and prototypical networks [14]. These methods generally use episodic learning. They simulate the few-shot setting already during training, optimizing the performance on a query set given a small amount of labeled data (the support set).
However, there is evidence that episodic learning might not be necessary for good performance [10], [15] and several methods only use support and query sets after training on the full dataset. Examples for this are the works of Gidaris and Komodakis [16] and Qi et al. [17]. There, the last layer classifies the embeddings produced by the rest of the network via cosine similarity and, during inference, the embeddings of the support set serve as a prototype for the new class in the query set. Closely related approaches are also investigated by Chen et al. [10] and Dhillon et al. [18] as competitive baselines, and like them, we also use fine-tuning on the support set as our main approach.
In general, most research focuses on few-shot classification; however, there also exists a considerable amount of literature on few-shot segmentation, both of images [19], [20], [21], [22] and of videos [23], [24]. This task is more challenging, but also closer to our problem of change detection, where labels have to be assigned to individual pixels instead of the whole image.
In summary, our setting is closely related to existing work on few-shot learning and segmentation, but we do not propose to solve it via the very common episodic training and aim for a simpler fine-tuning-based approach instead.

B. Specialization and Subcategorization in Computer Vision
Hierarchical information, such as subcategories and general "coarse-to-fine" relations can be a valuable resource in computer vision for a range of different tasks. Learning on fine labels induces features relevant for coarse classification, as well as vice versa [25]. One subclass specific task in semantic segmentation is, e.g., hierarchical segmentation [26] [27], where multiple levels of the hierarchy are predicted simultaneously.
Going from a broad class to a specialized subcategory in transfer learning has been explored, e.g., for object category detection [28], and recent works formalized this in a few-shot manner [29], [30]. Bukchin et al. [29] call this setting coarseto-fine few-shot, and this is essentially the classification variant of few-shot filtering. However, they do assume a set of mutually exclusive fine-grained classes, whereas we allow pixels to be relevant for different possible change types (cf. Fig. 1, where some pixels are relevant both for deforestation and roadwork). Ni et al. [31] explore the coarse-to-fine few-shot setting under the name cross-granularity few-shot with a medical application in mind, Xiang et al. [32] develop an incremental variant and, concurrently to the present work, Gong et al. [33] investigate taxonomy adaptive cross-domain semantic segmentation (e.g., also incorporating subclasses of known classes) also with only a few labeled shots. To the best of our knowledge, however, this is the first work exploring similar ideas specifically adapted to the context of change detection.

C. Semantic and Multiclass Change Detection
Whereas binary change detection is just interested in whether some change occurs in the given time frame or not, the goal of semantic change detection (also known as multiclass change detection) [34], [35] is to further break this change down into several classes. Often, these are identified by the land cover categories of both images, and the task is then to detect change as well as label it, e.g., as "from low vegetation to building." As Yang et al. [36] point out, there are also changes that do not affect the land cover categories (such as the replacement of one building by another), which are ignored without additional change/nonchange information. Other works also consider much more fine-grained change types, such as the construction of residential or industrial buildings or mega projects [37]. Unsupervised approaches to this task can also work without defining change classes in advance and discover different types of change, for example via deep change vector analysis [38].
While we are also concerned with different change categories, unlike supervised semantic change detection methods we do not consider them fixed a priori, and instead want to allow a flexible specification after training. Similar to the unsupervised methods, we also do not have any labels on the change categories, but we do have access to binary change information. Nevertheless, unsupervised semantic change detection is the task closest related to our problem of few-shot filtering.

D. Data Scarcity in Change Detection
The problem of limited training data in change detection is not new. Commonly, this is approached by semi- [39], self- [40], or unsupervised methods [38], where no or only a small amount of annotated training data is needed. In contrast to few-shot learning, the focus lies on the information contained in many unlabeled images and not so much on quick adaptability to a few labeled ones.
Additionally, other techniques, such as transfer [41] or active learning [42] are employed to deal with data scarcity. Also, how a small amount of data can affect the performance of change detection algorithms has been investigated by Saha et al. [43]. Recently, there is also work exploring few-shot learning approaches to change detection [44], [45], showing that this is indeed a promising approach, which we hope to further expand by introducing the few-shot filtering setting. Tang et al. [46] use methods from few-shot segmentation-prototypes and masked average pooling-but apply them to a standard binary setting where a larger amount of data is available.
Finally, we also want to highlight recent work by Lenczner et al. [47], who add new classes to the segmentation of remote sensing images in a continual learning setting, resulting in what can loosely be described as the inverse of our task: although in the context of semantic segmentation instead of change detection.
All in all, while there have been various attempts to deal with data scarcity, our proposed setting is novel in its flexibility and differs from existing tasks that are designed with alternative applications in mind. We will now describe it in detail in the following section.

A. Binary Change Detection and Few-Shot Filtering
The aim of few-shot filtering for change detection is to train a model on a dataset that is annotated with a general category of change, and then adapt it to a more fine-grained task with only a few new examples (the support set). This problem is illustrated in Fig. 1, and we will now formalize this setting.
The (binary) change detection task is concerned with a pair of images where H × W are the spatial dimensions and C denotes the number of channels (which might be higher than the usual 3 in standard computer vision and can also include, e.g., near infrared bands). It is assumed that I 1 and I 2 depict the same region, but are taken at different points in time, often multiple months, or (such as, e.g., for applications in urban development) years apart. Our aim then is to derive a change map i.e., a segmentation of the input images into pixels that have changed in some meaningful way between I 1 and I 2 (denoted by a value of 1) and those that remain unchanged.
Note that this notation conceals the nature of the change: while for certain applications, a pixel belonging to a newly constructed building might be considered relevant, in other cases (such as for deforestation or agricultural domains) we might not be interested in this particular instance. Therefore, we additionally index the change map by a change type T and try to produce C I 1 ,I 2 ,T . This is different from semantic change detection, where we are given K > 1 different categories of change and the aim is to find In our setting, we still are interested only in binary classification, albeit restricted to one particular category of changes T .
In supervised learning, we assume a train set of image pairs and their corresponding change map. This train set is usually limited to one change type T , such as urban development or the impact of natural disasters. A standard change detection task would now be to evaluate a model trained on D train T on some test set D test T that is sufficiently similar to the train set. For the few-shot filtering problem however, we assume an additional support set D supp T that is small (five 256 × 256 patches in our experiments, but in other scenarios, a moderate amount of data-that can still be labeled with low cost-might also be adequate) and contains examples of a change type T ⊂ T , meaning This is, e.g., the case for T denoting general urban change, and T then signifying building demolition. The evaluation then happens on a query set D query T , where we are now only interested in finding change of the new, restricted type T .

B. Relation to Other Settings
One way to compare this definition to the standard few-shot learning setting is to look at T as the base class, and T as the novel class. However, instead of the empty intersection between base and novel classes that is normally assumed, we in contrast are interested in a subset relation. We could also consider our formulation as a form of weakly supervised learning [48] with the broad annotations of the train set acting as weak labels for the more specialized labels of the support and query sets. However, this view does not adequately highlight the few-shot nature of the task, where the support set in practice is only available after the training and consists just of a small amount of change instances. Another interpretation is to see it as a form of transfer learning [49], where we have a very strong relation between source and target task and we already know quite well about this connection.
One might also think that semantic change detection should solve the same problem: by breaking the binary change class into multiple subcategories, we can then simply select the one we are interested in. However, while the amount of such datasets is growing, a big part of existing resources is annotated with binary labels. More importantly, this only shifts the problem: the amount of different change types in semantic change detection problems is limited and set in advance, therefore in realistic applications, they might still not fit our needs exactly. In fact, we can even think of both approaches going hand in hand: using a semantic change detection dataset, we first choose the category that fits best our needs (such as buildings) and then use a few examples to further filter out exactly what we want (e.g., high rise construction). Also, categories in semantic change detection are usually assumed to be mutually exclusive, while this does not have to be the case for filtered specializations. Looking at Fig. 1, we can see that there are regions where deforestation happened in order to enable roadwork, so the changed pixels are relevant for both adapted models. Achieving this with semantic categories would need much more fine-grained categories than are currently common in semantic change detection or hierarchical or nonexclusive labels.
To avoid confusion, we should note that we use a semantic change detection dataset in this work to simulate a few-shot scenario, as described in Section V-A. The aim here is to use the high quality annotations of the dataset as a precise ground truth for the experiments in this work. In that section, we also shortly discuss the limitations of using the semantic categories as few-shot tasks, and the points raised there (independent pixels, no spatial structure, and limited granularity) also strengthen the argument above that semantic change detection alone cannot solve the problem that few-shot filtering addresses. The creation of a benchmark dataset designed specifically for few-shot filtering is a logical next step for further research.

C. Discussion of the Setting
Few-shot filtering is a relevant setting in cases where we have access to an annotated binary dataset, but the annotations do not exactly fit our needs, as we are only interested in a particular subset of the change that is marked. One such scenario is the use of public benchmark datasets, where we often find, e.g., urban change. If we are investigating roadwork, it will be difficult to focus on the relevant instances, as most of the detected change probably consists of constructed buildings, which distracts from the events we want to study. Here, the few-shot filtering framework applies as we want to get to the relevant information with only a limited amount of additional labeling effort.
A big advantage of the setting is the flexibility that is enabled by the separation of base training and adaption to the few-shot samples. Typically the former will take much longer than the latter, so we can perform the adaption quickly and repeatedly, allowing for an interactive and adaptive workflow. One such scenario was described in the introduction.
The clear limitation of the approach is that by design, the specialized change type is expected to be fully part of the original annotations used for base training, restricting the possible specialization depending on what data are available. We could extend the original formulation to allow for other relations between T and T , such as just requiring a nonempty intersection, moving closer to the standard few-shot setting. However, we consider these tasks to be complementary: few-shot learning helps us find new change, few-shot filtering splits known change further. We will mainly focus on the latter in this work, 1 but in practice a combination of both tasks should be very beneficial.

IV. METHODS
After defining the few-shot filtering setting, we will now investigate several methods to tackle it, that also give some insight on how the combination of different resources can be beneficial. In addition, we will discuss the practical issue of hyperparamters in this setting.

A. Learning From a Single Data Source
Base Training Only: The most straightforward approach is to just use a standard change detection model that has been trained on the base training set and apply it on the query set without any kind of adaptations and without using the support set at all. Of course, we do not expect this to be a competitive approach, as the very idea of the few-shot filtering task is to specialize, and to limit the full output to only the interesting classes.
However, including these results in our experiments, we can gain an understanding of how much the additional information adds. In addition, it allows us to gauge the difficulty of the individual few-shot tasks and how much they vary. In general, we expect this approach to have a rather high rate of false positives, as we do not filter out the change that was relevant during training, but is not for the few-shot tasks. Support Set Only: As the exact opposite of the previous method, we can also ignore the base training data (that has all changes annotated, not just the specialized type), and just train a change detection model on the (very small) support set. In general, this will likely overfit very heavily, and for better comparability, we are also using the same architecture as for the other tasks, which will only exacerbate this problem. A smaller model that is more tailored to this situation might achieve better results if we intend to use this method in practice, but for our experiments, this setup is well suited.
We will call these approaches Baseline A for the base training and Baseline B for using only the support set in our experiments. Together, both baselines can give an idea of how much information is already contained in the large, but general base training set (Baseline A), as well as in the small, but specific support set (Baseline B), and how much we can gain by combining both, i.e., by fully making use of the few-shot setting.

B. Fine-Tuning
Fine-tuning amounts to combining both of the above baselines to learn a specialized model. For this, we first train the model on the base training set to learn a general notion of change, and then run additional epochs on the support set, to filter out the specialized type that we are interested in. Note that as the support set is small, this fine-tuning is comparably short, and we can use the same model that has been trained once on the full dataset to adapt to different few-shot tasks separately and quickly.
When learning on the support set, we retrain all layers; however, another common approach would be to freeze all but the last layer (which makes the final decision about whether an individual pixel has changed). This is known as linear probing, and some research suggests that it can for example be beneficial for out-of-distribution cases [50]. Based on initial experiments, we decided against it, but further research into the best training strategy during this phase might prove to be valuable.
We decide to investigate a fine-tuning approach instead of more sophisticated episodic techniques for several reasons: for one, it makes sense to first establish a basic performance level for this new setting, and to use a conceptually simple method to do so. Whether and what other methods might improve on this, and in particular whether episodic learning can be of advantage here is only the next step: in particular as the question how much can be gained from episodic learning is under discussion in the general few-shot literature [15]. Also, from a practical point of view, we assume that we use some existing binary data source, so episodic training would require additional filtered data, which defeats the purpose to have a data efficient method available. We could avoid this with a semantic change detection dataset, the same way that we also do this for testing in this article, but this limits the usability to domains where such data exist. Therefore, we propose to tackle our problem with a fine-tuning approach.

C. Effect of Hyperparameters and Lack of Validation Data
Part of the training strategy during fine-tuning is the decision on a set of hyperparameters. The importance of these for the performance of a deep learning model is well known, and that optimizing them also for fine-tuning can be vital has been shown by Li et al. [51]. However, in our setting we only have a few annotated images for a new task, so we cannot simply use a validation set to determine the best values for these parameters. In addition, we also cannot expect that there is one set of parameters that will work well across all possible few-shot tasks, as they may differ in key aspects, such as how frequent the type of change is. The problem of no available validation set is also discussed, e.g., by Gulrajani and Lopez-Paz [52], however, in the context of domain generalization. Suggested solutions in literature include data augmentation on the support set [53].
It is desirable to gain some understanding on how different choices affect different scenarios, so that in practice we can, e.g., choose a suitable setting based on some heuristics, (similar to BiT-HyperRule [54]), such as how frequent we expect the specialized change to be or how difficult the task is. We will investigate the effects of parameters by varying the parameters on the actual few-shot tasks. In the terms introduced by [52], this would amount to using a test-domain validation set, which they do not view as suited for benchmarking. However, our goal is not to decide on a particular set of parameters, but rather to investigate their effect across different tasks, and the size of their effect in general. Still, there is some information leaking, as of course we also choose the ranges of the different parameters that we investigate based on initial experiments on both the test and validation sets, and there is no guarantee that this will transfer identically to different tasks or datasets. However, for the present study, this approach still gives valuable insights, and we consider more robust selection of parameters the goal of further research. We will discuss which hyperparameters are varied in Section V-B and give the exact parameters used in Section V-C.

A. Dataset and Few-Shot Tasks
The semantic change detection dataset (SECOND) [36] consists of 2968 2 image pairs of size 512 × 512, obtained from several Chinese cities. As it has been designed for semantic change detection, it also includes pixel-level annotations for land-cover classes of the changed areas in both images, with the categories nonvegetated ground surface, tree, low vegetation, water, buildings, and playgrounds. 3 As these annotations are provided for both time steps and include an additional nonchange label, this enables also the description of change where the land-cover class stays the same, e.g., changes from one building to another.
As mentioned, we are not interested in semantic change detection as such. However, the structure of the dataset is very well suited to our task as well: during training, all change categories are grouped together, and we perform full binary change detection. Then, for the few-shot tasks, we can select individual change categories (e.g., from any class to buildings) and treat all others as unchanged. Fig. 2 shows an illustration of this process. A similar approach is also described by Liu et al. [55], who use the labels of the semantic change detection dataset HRSCD [34] to only select cropland changes as their binary labels.
This definition of few-shot tasks, while providing an easy way to adapt an existing dataset and make use of its high quality labels also for the few-shot setting, nevertheless has some shortcomings: first of all, we treat every pixel independently, and therefore have no way to determine if a removed "tree" pixel was part of a large forest, or an isolated roadside tree. Similarly, as there is no notion of spatial structure, we cannot, for example, Fig. 2. Illustration of the process for dataset preparation: Starting from the semantic annotations from SECOND, we can create both the full binary change map, as well as different few-shot tasks by combining different sorts of transitions. easily define road construction as a task in this way. Also, we are limited in granularity by the decisions of the original dataset, which means that we cannot determine which type of "building" we see, as it can be a skyscraper, a factory hall, a residential building, or anything else.
We choose four different of these tasks for our experiments: surface change, which is defined by either "from n.v.g. surface to low vegetation or from low vegetation to n.v.g. surface," deforestation ("from tree to any class"), building demolition ("from building to any class"), and building construction ("from any class to building"). In each case, the support set was manually selected from the validation data to be representative of this change type and has a size of five 256 × 256 patches. We show the first two support images for every task in Fig. 3. The choice and quality of the support images likely has a considerable impact on the performance of the few-shot methods, and exploring this might be a valuable target for further research.
The four tasks differ quite a lot, both in their difficulty and frequency of change: while it might be relatively easy to determine if a new building was constructed, the changes between nonvegetated and vegetated surface can be hard to assess even for a human. In addition, deforestation is very rare in the test set, with only 0.49% (0.86% on the validation set and 1.60% in the train set) of pixels from the images undergoing this change, compared to 1.71% (3.34% / 3.31%) for building demolition, 10.78% (6.79% / 8.52%) for building construction, and 6.26% (5.18% / 5.67%) for surface change. For reference, the amount of all changed pixels together is 20.70%, very similar to the value of 20.19% on the train set, and slightly higher than the 16.91% on the validation set. Therefore, we hope to have a somewhat broad representation of different scenarios one might want to use few-shot filtering in, and to be able to compare what works well in what circumstances.

B. Hyperparameters Under Investigation
We will now shortly introduce the parameters for the finetuning phase that we investigate more closely (cf. Section IV-C).
The concrete values used are given in Section V-C.
Change Weight: Change detection is an inherently imbalanced task, as there are always less changed pixels compared to unchanged ones, even for the more general base training set. Usually, we can solve this problem relatively easily by weighted cross entropy, where we use a weight term (that we call change weight) to give more importance to the rarer change instances and to avoid learning a network that simply predicts "no change" for every pixel.
In the typical change detection setting, and therefore also for the base training, we can take the frequency of change in the train set as a reference point to set this weight. However, in the fine-tuning phase, the support set will usually have a much higher amount of changed pixels. If we are interested in deforestation, for example, we will usually select images of forests or parks where some logging happened, to guide the few-shot process. This will not reflect the distribution in the actual test set, where also other scenes might be present. Therefore, we commonly have a mismatch between training and test data in terms of frequency of change during the few-shot phase.
Another issue is that we want to learn is specialized change, which is by definition not as frequent as the general change in the base training data. This implies that learning unchanged pixels (that is, "forgetting" change) is the most important part of fine-tuning, which will lead us to bias the training more toward unchanged pixels in this phase (or at least less strongly toward changed ones).
Both of these aspects suggest using a lower change weight in the fine-tuning phase. It also seems reasonable to assume that for few-shot tasks where change is very rare (in our case, deforestation is such a task), a lower weight might be sensible than for a relatively common type of change (such as building construction). Therefore, we investigate the impact of two different change weights, both lower than the one used during base training.
Number of Fine-tuning Epochs: How long we train during fine-tuning should determine the influence of the (specialized) support set compared to the (general) base training data. Initially, the information the model has learned from the full dataset should dominate, but over time, it will fit more and more to the specialized and narrow category from the few new images. Balancing this is therefore very important, and in addition, this also has a direct impact on performance. Using, e.g., 100 instead of 10 epochs will also increase the duration of fine-tuning around ten times. As with the change weight, we will investigate a shorter and a longer number of epochs in our experiments.
Learning Rate: The learning rate decides how much the model adapts in each step. Choosing a suitable value for this is necessary for the information in the support set to improve the overall performance. However, in order to keep the scope of the investigations in this this study reasonable and to not impede the analysis by too many variables, we decided on keeping this at a fixed value across all experiments.
Dropout: We do not use dropout during base training, as initial experiments suggested a slightly worse performance on the validation set when considering all change as relevant. However, during fine-tuning, dropout might be very valuable to avoid overfitting to the very small support set. We found that the effect is different for different tasks, which is why we include it in the parameters we investigate separately. Note that this is not the same as using dropout during test time, as it is often done in order to asses the uncertainties of the model [56]. During inference, it is turned OFFas usual.

C. Setup
Backbone: We use a standard change detection backbone, namely, the FC-Siam-Conc by Daudt et al. [57], as implemented by the TorchGeo Python package [58]. We adapted it slightly from this implementation, removing the final block (which consists of a 3 × 3 transposed convolution, batch normalization, an ReLU activation and a dropout layer), the dropout from the second to last block, and replacing all that by a 1 × 1 convolutional layer. Also, we disabled dropout during training, as initial experiments showed slightly better results for the base change detection task. Dropout during fine-tuning is part of the hyperparameters investigated.
Base Training: The model is trained with Adam (using standard parameters), an initial learning rate of 5 · 10 −4 that decays exponentially with a γ of 0.95 and a weight decay factor of 1 · 10 −4 . The maximum number of epochs is 100, but we choose the one with the lowest validation loss. Each 512 × 512 image is split into 9 patches of size 256 × 256, with overlap to reduce edge effects due to padding, and the batch size is B = 32. For the change weight, we choose 0.5 · (1/p change − 1), where p change = 0.2019 is the fraction of changed pixels in the train set images.
Fine-tuning: For fine-tuning, we also use Adam, a learning rate of 5 · 10 −4 , and a weight decay factor of 1 · 10 −4 . However, as the epochs are much shorter, we do not use learning rate scheduling. The number of fine-tuning epochs, the change weight, and whether to use dropout are varied as part of the parameter studies, and we use values 25 and 75 for the epochs, 1 and 0.1 for the weight and a dropout probability of 0.2 where it is applied.
Training of Baseline B: In the case of Baseline B (i.e., training directly on the support set), we decided on individual hyperparameters by tuning on the validation set. Of course, the discussion of Section IV-C remains true, and we cannot do this in practice. However, the goal of this baseline is to see how much information is contained in the small support set and can be recovered with the backbone, so it serves more as a lower boundary than a real practical suggestion. In addition, experiments suggest that this baseline is much less sensitive to hyperparameter changes, and indeed using the same settings for all tasks yields results that are almost identical. The values used are a change weight of 0.1, a learning rate of 3 · 10 −4 , trained for 3000 epochs 4 in the case of surface change, a weight of 2, learning rate of 2 · 10 −4 and 5000 epochs for deforestation, a change weight of 5, learning rate of 4 · 10 −4 trained for 7500 epochs for building demolition and a change weight of 1, learning rate of 5 · 10 −4 trained for 7500 epochs for building construction. Dropout with probability 0.2 was used for all tasks except surface change. All other aspects of the training are done as in the fine-tuning case.
Number of runs: In order to account for the stochastic nature of training in deep learning, we perform ten training runs with different random seeds. In addition, when performing fine-tuning, we run each of these models two times to account for variance there. Similarly, for Baseline B, we also train 20 models, to have the same number of evaluations in the end.
Metrics: For qualitative comparison, we use the standard metrics in change detection and semantic segmentation: Intersection over Union (IoU) and Precision (Prec), Recall (Rec) and F1-Score (F1), computed from the binary annotation masks and ground truth, both of the full task and the reduced ones.

VI. RESULTS
The main results are shown in Table I (with standard deviations in brackets) and example outputs can be found in Fig. 4. Additional parameter variations are recorded in Table II. For the main results, in each task we use the parameters that give the highest F1-score (which should be taken with a bit of care, since, as mentioned above, we cannot choose these parameters based on a validation set in practice).
The first thing we can notice is the considerable difference between individual tasks, with a very low precision and overall performance on the deforestation task being the most noticeable. The most straight forward explanation for this is the fact that this change type is very rare, which more easily leads to a higher number of false positives. Indeed, we find that ranking the tasks by their F1 scores, their precision values and the percent of changed pixels (cf. Section V-A) all yield the same order, suggesting a clear connection there.
Putting aside the intertask variability, we can see that in every case, fine-tuning can considerably improve the performance compared to both baselines, showing that the few-shot filtering setting does indeed provide significant benefit for the detection of specialized change. However, regarding the comparison with the Baseline B, we have to note that while we optimize the hyperparameters for this on the validation set, for the fine-tuning approach, we choose the best performance on the test set (albeit over a much less extensive parameter range), so this has to be taken into account. Still, if we look at the worst hyperparamter choice for fine-tuning (which we are unlikely to pick when using the validation set in a comparable experiment), then we still beat Baseline B on three out of four tasks, albeit only by a slight margin on two of them, and for "reasonably adapted" choices (e.g., only choosing whether to use dropout during fine-tuning or not), the advantage is clear for all tasks.
Regarding hyperparameters, we notice that the best performing values are different for all four tasks, even in our restricted set of eight different combinations. This suggests that indeed some decision needs to be done based on the (expected) characteristics of each specialized change. Further, the effect of dropout is mixed. Generally, it increases precision and lowers recall, however, for the task of building demolition, it hurts both measures. Also, it shows a tendency to increase the variance between model runs. Considering the other parameters, increasing the  amount of fine-tuning epochs, as well as lowering the change weight both also have the effect of increasing precision and lowering recall, with some exceptions and differing size of the effect. In the case of training for more epochs, this confirms the hypothesis that we start with high recall and low precision on the base model, and training for longer adapts better to the support set, gaining precision but losing some change instances in the process. However, as noted, there are some exceptions (all in the surface change and building demolition tasks), and a larger number of epochs can even yield the opposite result. The effect of the change weight is more consistent and can also be easily explained, as a higher value gives relatively less importance to falsely identified change instances, therefore allowing for more false positives and a higher recall, but lower precision.

VII. DISCUSSION
Considering the results described in the previous section, we see that combining two different data sources indeed performs better than each one on their own, showing that the basic assumption is reasonable. The large differences between the individual tasks, however, show that-at least for the simple fine-tuning approach investigated in this article-change categories that are very rare in the base dataset do perform worse than relatively common ones. While these low performances might be an issue for some practical applications, we should also add that pixelwise statistics are not always the only relevant metrics. For example, when deciding on whether to update maps in certain regions, only a general measure of the amount of change is needed, and there the boost in precision might be very beneficial already.
Also, we restate the impact of a smart choice of parameters for every task, as for some settings, fine-tuning even performs worse than Baseline B. As a first observation, we have already seen that rare changes naturally have a low precision in the unadapted base model, therefore we should generally use a lower change weight, dropout, or more fine-tuning epochs in these cases. However, this is not the full picture, and, e.g., we find for all tasks that the best results are achieved with a lower number of epochs and that dropout hurts performance for the relatively rare building demolition. Investigating the interaction of the individual hyperparameters, finding reasons for heterogeneous effects, such as that of the number of epochs, and exploring methods to find good values in a low data setting therefore are interesting lanes for further research.
We also see that Baseline B can perform surprisingly well, given that it only ever uses the five support images and has no access to the base training set. We cannot even exclude the possibility that, in particular with a very good choice of hyperparameters, a well-designed training strategy or a better suited network architecture, Baseline B can be competitive to fine-tuning, and this might be worth investigating in future work. However, we believe that even for such approaches, there might still be value in using the base training set as an additional information source in some way, and that by this, the few-shot filtering setting still is the right lens for these approaches.

VIII. CONCLUSION
In this article, we have presented a new task for specialized change detection in low data regimes, investigated a fine-tuning approach to tackle it, compared it to two simple baselines, and studied the effect of hyperparameters that are difficult to assess in lack of validation data. In addition, we described a way to adapt existing semantic change detection datasets to act as a proxy for the few-shot tasks, enabling the use of trusted data sources for this new setting. While there is still some room for improvement regarding the overall quality of the results, the ideas here should be seen as a first concrete step into the direction of adaptive change detection using only a few samples, and we hope that our approach of filtering out change will lead to new applications in less explored domains or geographic regions for which currently no large public training corpora are available, helping uncommon usecases and underrepresented communities.