An Advanced Dirichlet Prior Network for Out-of-Distribution Detection in Remote Sensing

Remote sensing deals with a plethora of sensors, a large number of classes/categories, and a huge variation in geography. Owing to the difficulty of collecting labeled data uniformly representing all scenarios, data-hungry deep learning models are often trained with labeled data in a source domain that is limited in the above-mentioned aspects. However during test/inference phase, such deep learning models are often subjected to a distributional shift, also called out-of-distribution (OOD) samples, in the form of unseen classes, geographic differences, and multi-sensor differences. Deep learning models can behave in an unexpected manner when subjected to such distributional uncertainties. Vulnerability to OOD data severely reduces the reliability of deep learning models and trusting on such predictions in absence of any reliability indicator may lead to wrong policy decisions or mishaps in time-bound remote sensing applications. Motivated by this, in this work, we propose a Dirichlet Prior Network-based model to quantify distributional uncertainty of deep learning-based remote sensing models. The approach seeks to maximize the representation gap between the in-domain and OOD examples for better segregation of OOD samples at test time. Extensive experiments on several remote sensing image classification data sets demonstrate that the proposed model can quantify distributional uncertainty. To the best of our knowledge this is the first work to elaborately study distributional uncertainty in context of remote sensing. The codes are publicly available at.

in various remote sensing tasks, including classification [2], [3], hyperspectral image analysis [4], [5], [6], semantic segmentation [7], [8], [9], change detection [10], [11], [12], image retrieval [13], [14], target detection [15], [16], disaster management [17], [18], cloud detection and removal [19], [20], [21], and image fusion [22], [23], [24]. Most of these methods assume that the model is trained on a data set that has similar geographical characteristics as the target area [25], i.e., the source data distribution is the same as the target data distribution. Moreover, they assume that the source and the target data have an identical set of classes. But in practice remote sensing deals with a large number of sensors, operates across a significant variation of geographies, and a large number of classes [26]. Considering this variation, the above assumptions often do not hold in remote sensing. There are few works related to domain adaptation [27], [28], [29] that try to align the target distribution with the source distribution. However, such methods are only effective when the domain shift between the source and target is small. Moreover, they do not consider the presence of unseen/open-set classes. Deep learning models are likely to fail or behave in an unexpected way when faced with open-set classes, e.g., when a deep model trained on images from forest area is applied on urban images consisting of residential complexes and parking lots. Similarly, deep models behave in unexpected way when they are fed with data from seen classes but with considerable geographic variation, e.g., when a model trained on European urban area (where skyscrapers are rare) is used to predict test images from Asian urban areas. When deep learning models fail, they do not provide sufficient clue to the user, having unforeseeable impact in remote sensing applications, especially in time-bound and safety-critical applications. As an example, we can consider the scenario of disaster management after an Earthquake where unreliable predictions may lead the rescue team to the wrong site, at the expenses of human lives. Nevertheless, unreliable predictions may also negatively impact non-time-bound applications, e.g., a building detection model trained for Europe and used unreliably on Asia/Africa may lead to incorrect estimations of the building density and thus impacts the subsequent policy decisions.
Towards designing reliable deep learning models that are aware of different sources of uncertainty, predictive uncertainty estimation has recently emerged as a research topic in the machine learning community [30]. Uncertainty estimation informs users about the confidence on a prediction, thus improving the reliability of such systems. Deep learning-based classification models are prone to predictive uncertainties from three different sources [31], model (aka epistemic) uncertainty, data (aka aleatoric) uncertainty, and distributional uncertainty. Epistemic uncertainty stems from a model's lack of knowledge (e.g. limited training data, limited complexity, errors in the training process, etc.) while aleatoric uncertainty arises from complexities related to data distribution (e.g. class overlap in data). Distributional uncertainty is related to the mismatch between the training and the test data and can be seen as a special case of model uncertainty [32]. In remote sensing, distributional uncertainty may arise due to various reasons, e.g., unseen classes, geographic differences, and sensor differences. Considering its high relevance in remote sensing, in this work we focus on distributional uncertainty [31].
The key contributions of this paper are as follows: 1) Introducing the concept of out-of-distribution detection in remote sensing. 2) Proposing a Dirichlet Prior Network (DPN)-based model that can quantify distributional uncertainty in context of different remote sensing uncertainty sources. 3) Extensively experimenting on large scale remote sensing data sets for open set recognition, sensor shift, and region shift. 4) Providing a benchmark which can facilitate further research on remote sensing distributional uncertainty. Extensive experiments demonstrated that the proposed approach is able to detect OOD examples in remote sensing images, thus improving the reliability and robustness of deep learning-based models. To the best of our knowledge, this is the first work that extensively addresses out-of-distribution detection in remote sensing 1 .
The rest of the paper is organized as follows. We briefly discuss the related works in Section II. In Section III we detail the proposed method and in Section IV the data sets, experiments and results are presented. A critical discussion on different distributional uncertainties in context of our results is presented in Section V. We conclude the paper and discuss scope of future research in Section VI.

II. RELATED WORKS
Uncertainty quantification gained attention of the remote sensing community even before the emergence of deep learning [33], [34]. Despite this, there are only a few works that explore distributional uncertainty for remote sensing and topics closely related to it [35]. In Section II-A we briefly discuss them. We also briefly discuss different existing Bayesian paradigms in the machine learning literature to handle uncertainty (Section II-B). Our work is not in contrast with the domain adaptation literature, as explained in Section II-C.

A. Detecting distributional shifts in remote sensing
One common form of distributional shift is the presence of new classes in the target data. This problem has also been dealt as open set recognition. Silva et. al. [36] proposed a method for open set aerial image segmentation. They assign a pixel with a class confidence (given by the soft-max) that exceeds a 1 The code for this work is available under https://gitlab.lrz.de/ai4eo/Uncertainty/-/tree/main/DPN-RS. threshold as belonging to that class. However, if the pixel-wise probability is inferior to the threshold, the pixel is classified as open-set. Dang et. al. [37] proposed an open set incremental learning-based method for target recognition by exploiting extreme value theory (EVT). Wu et. al. [38] introduced open set recognition to hyperspectral image classification.
A few works identified that the models may likely fail if applied to new geographic locations considerably different from the training data [39], [40]. To quantify the area of applicability, Meyer and Pebesma [39] proposed a dissimilarity index based on the minimum distance to the training data in multidimensional predictor space. In [25], an applicable model is learned by using unlabeled data from each geography of interest.
Contrary to the previous works, our work tackles all forms of distributional shift (e.g., open set, spatial shift, and sensor shift) in same framework. Moreover, in contrary to previous works [36] that employ trivial solutions, our work is based on Dirichlet Prior Networks, a well-founded theoretical framework for uncertainty estimation. Our work is also a step forward towards building explainable remote sensing model [41], [42].

B. Bayesian frameworks for uncertainty
Bayesian frameworks are traditionally used to model predictive uncertainty of a classifier. The sources of uncertainty [31] can be broadly categorized into the following three categories: 1) Epistemic uncertainty characterizes the uncertainty caused by the lack of knowledge of the network, caused for example by insufficient training data, a shortage of model capabilities or an insufficient training process. 2) Aleatoric uncertainty arises from the complexity in the data distribution, e.g., class overlap and label noise. E.g., data having different value in label space may have very similar representation in the feature domain. 3) Distributional uncertainty arises from a mismatch in the distribution of the training and the test data. Distributional uncertainty is likely in remote sensing due to differences caused by new classes in the target data, geographic shift, and multi-sensor differences. Data uncertainty is in general modeled as a confidence prediction by the neural network itself, e.g. by a soft-max probability vector [43], [32]. Bayesian neural network-based approaches capture the model uncertainty by modeling the network parameters as probability variables. A posterior distribution over the parameters is derived based on the given training data and predictions are realized by sampling different sets of parameters from this posterior. Different ways of approximating such a posterior are available, e.g., Monte Carlo dropout [43], Laplace approximation [44] or deep ensembles [32]. However, it is computationally very expensive to produce such ensembles, thus limiting the application of existing ensemble and Bayesian approaches in such scenario. Dirichlet Prior Network (DPN) and its variants are introduced in [31], [45] as an efficient adaptation of the Bayesian networks. Our work directly derives from the [31], [45] thus exploiting the benefits of Bayesian modeling while still being computationally efficient.

C. Position in reference to domain adaptation
Domain adaptation [28], [46] is a branch of multi-domain learning. A model trained on a source domain is modulated by domain adaptation techniques to be applied on another target domain. However, domain adaptation assumes that either a few labeled data samples or a large unlabeled data set from the target domain is available during the training of the model. If the target domain data is completely unseen during the training, the most domain adaptation methods do not have the capability to mitigate differences between domains and may eventually produce unreliable predictions. Thus, it is important to be able to identify the test samples that are drawn from a distribution unseen during the training. This is where the out-of-distribution detection comes into play, further pushing forward the paradigm of multi-domain learning.

III. PROPOSED METHOD
Remote sensing deals with a vast set of data types, varying in geography, climate conditions, sensor properties, end applications, and target classes. It is expensive, both in terms of time and effort, to collect labeled data uniformly representing all scenarios. Thus most deep learning models are trained with limited training samples in a source domain that is limited in the above-mentioned aspects. During test/inference, when the model is fed with data that does not follow the source domain distribution, the model predicts in unexpected fashion. Our goal is to propose a framework that handles the abovementioned sources of uncertainty in the same framework without any adjustment being made for different sources of uncertainty. Towards this, we adopt an efficient adaptation of the DPN approach [45]. The Dirichlet distribution is popularly used as a prior distribution in Bayesian learning [47]. Motivated by this, Malinin and Gales [31] proposed Dirichlet Prior Networks (DPNs) for an improved detection of OOD samples. DPNs are deterministic neural networks that efficiently mimic the behavior of Bayesian neural networks by parameterizing a Dirichlet distribution over the categorical distribution given by a soft-max classification output. Convenient to remote sensing applications, any neural network with a soft-max activation can be considered as a DPN. Following the idea of Malinin and Gales, several other DPN-based methods for OOD detection were developed [48], [49], [45]. In this work we take inspiration from the Dirichlet distribution-based approaches and propose DPN-RS which transfers DPNs to remote sensing settings.
Section III-A briefly introduces the Dirichlet distribution. In Section III-B, we detail the Dirichlet Prior Networks (DPN) and we briefly discuss its suitability for remote sensing data in Section III-C. Finally, we present DPN-RS in Section III-D.

A. Dirichlet distribution
In probability theory, a categorical distribution is a discrete probability distribution that describes the possible results of a random variable that can take on one of K possible categories [50]. In classification tasks, the popularly used soft-max activation function transforms the output of a neural network to a probability vector describing a categorical distribution. The

Training-Data
In-Distribution

In-Distribution
Training-Data Data Environment Fig. 1. A visualization of the data environment. The in-distribution represents the data on which the network is expected to deliver accurate predictions. Out-of-distribution (OOD) covers any other kind of data that is significantly different from training data distribution. The OOD training data set is used to train the network to handle OOD examples, but can only cover a small portion of the OOD region. While trained, the network can handle any OOD data.
Dirichlet distribution is the conjugate prior of the categorical distribution and can be interpreted as a distribution over categorical distributions. While the probability vector given by a soft-max function represents a single point on the underlying solution simplex, the Dirichlet distribution represents a distribution on this simplex. Following, it can be used to represent the uncertainty on a classification network's output vector. A Dirichlet distribution for K classes is described by class concentrations {α 1 , ..., α K } > 0 and a derived precision value α 0 = K c=1 α c . With this, the density of a Dirichlet distribution is given by where Γ is the gamma function. A higher class concentration α i for the class c i leads to more probability mass shifted towards the corner of class c i . And the higher the resulting precision value, the sharper is the Dirichlet distribution, i.e. the lower is the variety in the plausible categorical probability vectors. In Fig. 2 this is visualized for a Dirichlet distribution based on three classes.

B. Dirichlet Prior Network
Since a Dirichlet distribution describes a distribution over categorical distributions, it can be used as a distribution over the outputs of a neural network with K outputs. For a neural network f θ (x) with parameters θ and input x the network outputs before the soft-max activation function are called the logits and are given by f θ (x) = z(x) = (z 1 (x), ..., z K (x)) ∈ R K . The logits are in general unbounded and can be both, positive and negative. A Dirichlet Prior Network (DPN) uses the logit output to predict the log-concentration of a Dirichlet distribution. With predicted logit values z 1 , ..., z K , the network parameterizes a Dirichlet distribution with (positive) concentrations α c = exp(z c (x * )), c = 1, ..., K. Equivalently, the precision value is given by With this formulation, the posterior distribution p(ω|x, θ) over the possible class labels ω ∈ {1, 2, ..., K} is given by the expected value of the Dirichlet distribution The posterior given in (2) is equivalent to applying the softmax function to the logit values of the network. The challenge in optimizing the posterior distribution using standard neural networks with a soft-max activation function and cross-entropy loss function lies in the scaling of the posterior. As evident from (2), the scaling of the concentrations (α c ) affects the precision (α 0 ). Thus, looking only at the soft-max value, one can not conclude on the precision of the Dirichlet distribution. Following, the network is optimized based on pointwise estimations of the posterior distribution instead of taking the uncertainty on the posterior into account. As a result, it is not possible to separate distributional and data uncertainty effectively, leading to difficulty in the detection of out-of-distribution samples.
The DPN tackles the above-mentioned challenge by designing a multi-task learning paradigm. In order to separate indistribution samples and OOD samples, the network is trained on a mixture of two sets, a set of in-distribution samples (D in ) and an additional set of OOD samples (D out ). Please note that the set D out for training is not necessarily drawn from same distribution as the OOD samples during test/evaluation (see Fig. 1). The OOD samples during training (D out ) are only used to learn a boundary on the in-distribution samples. Once trained, the network can be applied on any OOD samples, even those that have a completely different distribution than the OOD samples used during training.
The general purpose of DPNs is to predict different forms of Dirichlet distributions in order to separate the following three cases: 1) In-distribution examples where the network is certain in its prediction. 2) In-distribution examples where the network is uncertain.
3) Out-of-distribution examples. DPNs seek to differentiate between in-domain and out-ofdistribution samples based on the predicted class concentrations. More explicit, they aim to produce a uni-modal distribution at the corner of the solution simplex with the correct class ( Fig. 2(a)) [31]. For in-domain samples with high data uncertainty DPNs aim to produce a sharp distribution at the center ( Fig. 2(b)) and for OOD data a flat distribution ( Fig. 2(c)).
The key architecture of the deep model remains unmodified with a DPN, except removing the soft-max activation after the final layer, i.e., outputting the logits. However, the key to achieve the desired behavior is the design of a multitask optimization loss function, i.e. a loss which simultaneously supports the network in learning the classification task for in-distribution samples and learning to predict very small class concentrations for OOD examples. For that, the loss has to differentiate whether a received prediction is based on a sample from D in or D out and hence should be of the form of where γ > 0 is a scalar, balancing the impact of in-distribution and OOD samples. In order to achieve the desired behavior Malinin and Gales [31] presented a loss function based on the Kullback-Leibler (KL) divergence between the target Dirichlet distribution Dir(µ|α in ) or Dir(µ|α out ) for some sample x, and the corresponding predicted Dirichlet distribution p(µ|x, θ): P in and P out describe the in-and out-distribution and α in and α out represent the ground-truth target concentrations. Since the target concentrations can not be derived from the one-hot encoding (due to the scaling described before), these values have to be chosen beforehand [31]. Based on further investigations, Malinin and Gales [48] also presented a loss function based on reverse Kullback-Leibler divergence: L rkl θ; α in , α out = E Pin KL p(µ|x, θ)||Dir µ|α in The reverse Kullback-Leibler divergence showed improvement in the numerical stability and OOD detection results in comparison to [31]. However, as shown by Nandy et al. [45], for in-domain examples with high aleatoric uncertainty among multiple classes, DPNs produce flat Dirichlet distributions [45]. In practice this could easily lead to representations which are indistinguishable from OOD examples.

C. Suitability of classical DPN for remote sensing
The DPN is a suitable framework for remote sensing image classification for the following reasons: 1) Considering the variety of remote sensing data, OOD data may come in many unforeseeable forms in remote sensing. DPNs provide the flexibility that all samples from all possible distributions do not need to be seen during the training phase. E.g., considering a spatially varying system, if the in-domain training data belongs to Europe and OOD training data belongs to Africa, the DPN model is capable of handling OOD test data from Asia.

Training-Data
Testing-Data

In-Distribution
Set OOD label to uniform distribution.
Pass data x through network with parameters θ.
Received logits: Compute regularization term and probability: In-Distribution (class c)

Out-of-Distribution
Minimize: 2) DPNs can be used without altering the key architecture of the models already used in remote sensing classification.

3) A DPN is a single deterministic neural network where
only one forward pass per evaluation has to be performed. This leads to less computation than for other approaches as ensembles or Bayesian Neural Networks. This is an important advantage, especially for very large scale EO applications.
Due to the large number of classes in remote sensing with strong inter-class similarity, it is common in remote sensing for in-domain samples to have high aleatoric uncertainty among multiple classes. In such cases, DPNs produce a flatter Dirichlet distribution [45]. This leads to representations which are harder to distinguish from the OOD samples. In other words, for remote sensing applications, DPN may confuse between aleatoric uncertainty and distributional uncertainty. This limits the practical application of traditional DPNs [48] in remote sensing. Hence, to alleviate this problem, inpsired by [45], we propose DPN-RS that can effectively segregate the OOD samples from in-domain data.

D. DPN-RS
To overcome the challenges introduced in Section III-C, our approach aims at learning a sharp multi-modal distribution (α 0 << 1) (see Fig. 2(d)) instead of a flat uni-modal distribution for OOD examples. The precision regularization is achieved by introducing a bounded regularization term given by the sigmoid function on the logits: α 0 is used as a regularizer along with the cross-entropy loss. This gives the following two loss formulations for in-domain and OOD examples: and where U denotes the uniform distribution over all classes and H ce denotes the cross-entropy function. With this approach, the ground truth is given by a probability vector and can be therefore directly derived from the class labels and no target concentrations have to be chosen. The precision is controlled by two hyper-parameters λ in and λ out [45] and the combined loss-function is given by where again in-domain and OOD samples are balanced by a hyper-parameter γ > 0. For the proposed approach, we use λ in > 0, while λ out < 0. For in-domain examples which are confidently predicted, the cross-entropy loss maximizes the logit value of the correct class. However, for in-domain samples with aleatoric uncertainty, the optimizer maximizes sigmoid(z c (x)) for all classes, thus yielding a sharp distribution centered on the solution simplex. By choosing λ out < 0, DPN-RS produces negative values for z c (x * ) for an OOD example x * . This leads to α c << 1 for all c = 1, ..., K, and thus an OOD sample yields a sharp multi-modal Dirichlet distribution with probability mass at each corner of the simplex ( Fig. 2(d)). Fig. 2(b) and Fig. 2(d) are more distinct over the simplex, making the OOD samples easily distinguishable from the in-domain ones. In Fig. 3 a visualization of the training process of the proposed approach is given. In Fig. 5 an example for a certain in-distribution prediction, for an uncertain indistribution prediction and for an out-of-distribution prediction are presented together with different derived measures which can be used to separate between in-distribution and out-ofdistribution.

IV. EXPERIMENTAL VALIDATION
In Section IV-A, we briefly present the data sets used in our experiments. Experimental settings are discussed in Section IV-B. The rest of this Section presents the results and the analyses for each of the experiment.

A. Data sets
We perform our experiments on three different data sets, namely the Aerial Image Data set (AID) [52], the UC-Merced Land Use Data set (UCM) [51], and the So2Sat Local Climate Zone 42 (LCZ42) Data set [53]. In the following the data sets are briefly described. An overview over the classes contained in the different data sets is given in Fig. 4.  Fig. 4. The defined classes and corresponding example patches of the UCM data set [51], the AID [52] and the So2Sat LCZ42 data set [53]. For LCZ42 only the three bands representing red, green and blue are visualized.

1) AID Data Set:
The Aerial Image Data set (AID) data set [52] contains very high resolution aerial RGB images with 600 × 600 pixels size. The data set covers 30 different classes, each represented by more than 300 samples in the data set. We split the data set randomly into 70% for training and 30% for testing. Furthermore, the images are cropped and resized to 256 × 256 pixels. All experiments are based on a ResNet50 neural network pretrained on imagenet.
2) UC Merced Landcover Data set: The UC Merced (UCM) data set [51] contains high resolutional aerial RGB images with one foot ground sampling distance and 256 × 256 pixels size. The data set covers 21 different classes, each represented by 100 samples in the data set. Again, we split the data set randomly into 70% for training and 30% for testing. All experiments are based on a ResNet50 neural network pretrained on Imagenet.
3) So2Sat LCZ42 Data set: The So2Sat LCZ42 data set [53] provides about half a million co-registered Sentinel-1 and Sentinel-2 patches. For our experiments we only use the optical Sentinel-2 images. The 32 × 32 patches are taken from 42 different regions worldwide and for each sample a Local Climate Zone (LCZ) label is provided. The data is split into a training set of 352366 patches and a validation and test set containing 24188 and 24119 patches, respectively, sampled from regions different to the regions of the training set. An overview over the local climate zones can be seen in Fig. 4. We build our networks based on the network structure proposed in [54] but without multi-level fusion. For the experiments related to open-set recognition and sensor-shifts, we want to avoid a region shift and therefore work only on the training set of the original data set which we split into 70%-30% for our training and testing.

B. Experimental settings
We evaluate the performance of the presented methods on three different remote sensing tasks: • Open Set Recognition, where the test set contains classes unseen during training. • Channel Separation, where the test set contains images from different channels than the training images. This simulates a multi-sensor scenario. • Location Separation, where the test set contains images from different spatial locations than the training images. We run the experiments within single data sets and without mixing different data sets. Intuitively it is clear that when working with different data sets, the similarity between the data set used for in-distribution and the data set used for OOD during training builds a crucial point for the OOD detection performance. In Table X we show the results of DPN-RS on the open set problem when using UCM, AID and LCZ42 at the same time. One can clearly see that the similar resolution of AID and UCM has an significant influence on the OOD detection performance.
We compare the proposed method to following paradigms which main properties are also summarized in Table I: 1) DPN + [45]: A DPN-based approach with precision regularizing factors λ in > 0 and λ out > 0. 2) DPN rev [48]: A DPN that uses the reverse Kullback-Leibler-Divergence as in (5) to compare the predicted and the ground-truth Dirichlet distribution. 3) DPN forw [31]: A DPN that uses the Kullback-Leibler-Divergence as in (4) to compare the predicted and the ground-truth Dirichlet distribution. 4) Evidential neural network [49]: The Evidential Neural network (ENN) does not require any out-of-distribution training data. ENN is motivated by subjective logic and also interprets the logits as a parameterization of a Dirichlet Distribution. But in contrast to DPNs, ENNs set the class concentrations in relation to an additional constant concentration that is interpreted as an unknown class. For ENNs, different loss functions are presented in [49]. For our analysis we use the expected cross-entropy loss. The Receiver Operator Characteristic (ROC) is popularly used to present results for binary decision problems in machine learning [55]. Conforming to this, we use Area under ROC (AUROC) to present the OOD detection performance based on four popularly indicators, namely maximum probability, entropy, mutual information, and α 0 [45].
For the approaches that make use of OOD samples at training time, we generated batches that contain 50% in-distribution and 50% OOD samples. Based on preliminary experiments we ave chosen the hyper-parameters λ in , λ out and γ for the losses defined in (6), (7) and (8) as The targets are chosen as shown in Table I. We show all results as mean and standard deviation based on seven different runs. Even though the objective of the proposed method is OOD detection, we also show in-domain classification performances. We show them as accuracy computed over all in-domain samples (denoted as "accuracy" in the tables), accuracy computed separately over all classes and then averaged (denoted as "average accuracy" in the tables) and as Cohen's Kappa value.

C. Open Set Recognition
Open set recognition is an important problem in computer vision [56] and remote sensing [36]. To simulate open-set behavior in a remote sensing data set, we split the given data sets into three subsets. The sets of classes in each subset are disjoint, i.e. each class is part of exactly one subset. For the open set recognition problem, we use one of the subsets as in-distribution samples, one as OOD samples that are given at training time and the third one as OOD samples reserved for testing the OOD detection performance.

1) Open Set Recognition on AID:
For the open set recognition we split the 30 classes of the AID into three groups of 10 classes each. In order to evaluate the robustness of the considered approaches, we consider a hand-crafted split into human built scenes and non-human built scenes. Furthermore, we consider 5 random splits of classes. The resulting indomain and OOD data sets are described in Table II. We tabulate the open-set recognition accuracy and classification accuracy in Table III and Table IV, respectively. All methods perform relatively well for this data set. While the DPN-based approaches (DPN-RS, DPN + , DPN forw and DPN rev ) receive AUROC values above 0.9 for the OOD detection task, the ENN achieves at least 0.80 in average. Over all test cases, all DPN-based approaches are among the best performances with not more than 1% difference.
All approaches perform satisfactorily in regards to the classification accuracy on the in-distribution samples. All DPNbased approaches obtain an average accuracy higher than 95% for all test cases.
2) Open Set Recognition on UCM: We split the 21 classes of the UCM data set into three groups of 7 classes each. In order to evaluate the robustness of the considered approaches, we consider a hand-crafted split into human built scenes and non-human built scenes. Furthermore, we consider 5 random splits of classes. The resulting in-domain and OOD data sets are described in Table V. The OOD detection performance and classification results on UCM data set are presented in Table VI and Table VII, respectively. Regarding the OOD detection task, the DPN-based approaches perform satisfactorily with AUROC values of at least 0.95. Even though DPN-RS gives the highest average AUROC score in four out of six cases, the results are very close to each other considering the Cross-entropy with a precision regularization.
DPN + Aims to predict high class concentrations for in-distribution samples and high and and uniform class concentrations for OOD samples.
Uniform probability vector of the form Cross-Entropy with a precision regularization.

DPNrev
Aims to predict high class concentrations for in-distribution samples and low and uniform class concentrations for out-of-distribution samples.
Target concentrations of the form (1, .., 100, .., 1).   The average in-domain classification accuracy is satisfactory for all approaches and all settings. The best average accuracy is above 99% with only small deviations between the different approaches.
3) Open Set Recognition on LCZ42: In comparison to the AID and UCM data sets, the inter-class similarity is much stronger in the low spatial-resolution LCZ42 data set, making it a more challenging data set. For our experiments, we split the classes into urban (classes 1-10), vegetation (classes A-F), and water (class G). First, we test the performance with urban as in-domain and vegetation and water as OOD data. Secondly, we test the performance with vegetation as indomain and urban and water as OOD data. The OOD detection performance and the classification results on the LCZ42 data set are presented in Table VIII and Table IX, respectively.
The proposed DPN-RS performs best on all test settings based on the LCZ42 data set. Not only the average separation performance is better than for the other approaches, but also the accuracy on the in-distribution classification task is larger in 3 out of 4 settings and the variances in the results are smaller. The setting with Urban classes as in-domain and vegetation as out-of-distribution during training and water as out-of-distribution for testing leads to the best results over all test settings with all considered AUROC values above 0.99 for DPN-RS.

D. Channel Separation
For the channel separation we use the R-, G-, and Bchannels of the samples of the three data sets. All classes are considered, but each sample is separated into an indomain channel, an OOD channel for training and an OOD

E. Location Separation
The location separation experiments are conducted only on the LCZ42 data set as any location information for the other two data sets are not available. We form three sets of regions contained in the LCZ42 data set: The three groups exhibit distinct characteristics. While group one contains less high rise buildings, group two contains many high rise buildings. In contrast to this, group three contains many disorganized crowded settlements.

A. Open Set Recognition
The experiments on the open set recognition clearly demonstrate that the proposed method as well as the compared methods are capable of differentiating between in-domain and OOD samples for high resolution images with clear differences in the class representations, as seen for the AID and the UCM data set. On the low-resolution So2Sat LCZ42 data set which contains multiple very similar classes, the proposed method clearly outperforms the other methods and is the only method that still delivers a separation between in-domain and OOD samples with best AUROC scores between 0.88 and 0.99 on all considered test cases. This result underlines the main motivation of this method to derive a better separation between aleatoric in-distribution uncertainty and distributional uncertainty. However, the performance is lower in this data set compared to the other two data sets, caused by its lower resolution and poor inter-class separability. Small variations in such low resolution data may lead to completely different predictions. Therefore, maximizing the gap between the indistribution and out-of-distribution data is challenging on such data sets.      XIV  CLASSIFICATION ACCURACY AND AVERAGE ACCURACY ON THE SENSOR SHIFT TASKS ON THE UCM DATA SET. THE RESULTS ARE GIVEN AS MEAN AND  STANDARD DEVIATION COMPUTED BASED ON SEVEN RUNS. THE BEST AVERAGE ACCURACY FOR EACH SETTING IS HIGHLIGHTED. R

B. Sensor Shift
Contrary to the open-set recognition, the results indicate that the OOD detection under sensor shift is easier with lower resolution images and more challenging with higher resolution images. Furthermore, the similarity of the different sensors highly affects the OOD detection performance. It can be clearly observed that separating the blue channel from the green and the red channel gives the best results. Furthermore, the results on the LCZ42 data set indicate that using a band more similar to the in-distribution as OOD data for training leads to a better separation. This underlines the obvious assumption, that the similarity of the sensors highly affects the performance of such approaches and has to be taken into account for further research in this direction. The classification performance on the in-distribution data is pretty similar among the different experiments.  XIX  OOD DETECTION UNDER REGION SHIFT IN THE URBAN CLASSES OF THE SO2SAT LCZ42 DATA SET. THE PERFORMANCE IS MEASURED BY 100× THE  AREA UNDER RECEIVER OPERATING CHARACTERISTIC CURVE (AUROC). THE RESULTS ARE GIVEN AS MEAN AND STANDARD DEVIATION COMPUTED  BASED ON SEVEN RUNS. THE BEST RESULTS PER APPROACH ARE GIVEN BOLDFACED AND THE BEST RESULTS ON THE SINGLE SETTINGS OVER ALL   APPROACHES ARE ITALICIZED.   R1-R2-R3  R1-R3-R2  R2-R1-R3  R2-R3-R1  R3-R1-R2  R3-R2-

C. Region Shift
The region shift shows that DPN-RS is in general capable of detecting unknown city structures from other regions. Moreover, such a region-wise shift is almost similarly prevalent in both urban classes and vegetation classes. Poor OOD detection performance is obtained when using group 2 as in-domain data, group 3 as OOD training data and group 1 as OOD test data. This shows that the group 3 has a more diverse distribution than the other two, thus a boundary learned on group 2 using group 3 as OOD training data is less efficient to detect group 1 as OOD.
In contrast to the experiments in sensor shift experiments, the accuracy on the in-distribution samples is competitive, even though it might change significantly between different regions. Taking the OOD detection into account is therefore a promising way to improve the classification performance by rejecting uncertain samples from new regions.

VI. CONCLUSION
In this paper we proposed a method for out-of-distribution detection for remote sensing data. While deep learning is currently being applied to almost all remote sensing problems, their reliability is still questionable when the test data has a distributional shift from the training data. OOD detection is a crucial step for improving the trustworthiness of deep learning models. Towards this we propose a DPN-based model that can effectively increase the gap between in-domain data and OOD data. The proposed method is tested extensively on three remote sensing data sets and three different tasks, namely open set recognition, sensor shift, and region shift. The proposed method shows satisfactory performance in all of the above settings. In general DPN-based methods perform very well on OOD-detection and outperform the compared ENN approach and other baselines. Successful detection of OOD samples is a stride forward for building reliable, trustworthy deep learning-based remote sensing models. To the best of our knowledge, our work is the first extensive study on remote sensing data for this topic. Our future work will aim towards extending the OOD detection in the context of multi-temporal analysis and multimodal fusion.