Mitigating Distributional Shift in Semantic Segmentation via Uncertainty Estimation from Unlabelled Data

Knowing when a trained segmentation model is encountering data that is different to its training data is important. Understanding and mitigating the effects of this play an important part in their application from a performance and assurance perspective - this being a safety concern in applications such as autonomous vehicles (AVs). This work presents a segmentation network that can detect errors caused by challenging test domains without any additional annotation in a single forward pass. As annotation costs limit the diversity of labelled datasets, we use easy-to-obtain, uncurated and unlabelled data to learn to perform uncertainty estimation by selectively enforcing consistency over data augmentation. To this end, a novel segmentation benchmark based on the SAX Dataset is used, which includes labelled test data spanning three autonomous-driving domains, ranging in appearance from dense urban to off-road. The proposed method, named Gamma-SSL, consistently outperforms uncertainty estimation and Out-of-Distribution (OoD) techniques on this difficult benchmark - by up to 10.7% in area under the receiver operating characteristic (ROC) curve and 19.2% in area under the precision-recall (PR) curve in the most challenging of the three scenarios.


I. INTRODUCTION
S EMANTIC segmentation is crucial for visual understand- ing, as semantic information is useful for many robotics tasks, e.g.planning, localisation, and mapping [1]- [3].Significant progress has been made on supervised semantic segmentation, where accuracy has significantly improved over the years.However, this has mostly involved test datasets from the same underlying data distribution as the training data.Considering data with a distributional shift from the labelled training data, it is still a significant challenge to train a segmentation network to (1) retain its accuracy on this data (2) report accurate uncertainty estimates.
The operational design domain (ODD) for a mobile robot is defined as the set of operating conditions under which the robot has been designed to operate safely [5].However, due to the dynamic nature of uncontrolled outdoor environments, these operating conditions are liable to change: e.g.weather changes, dynamic objects of unknown classes or appearance Fig. 1.In an image from the SAX project [4] (top), a horse (an object of unknown class) can be seen on the road.In the central image, this horse is poorly segmented leading to a dangerous driving situation.However, the model proposed in this work expresses pixel-wise uncertainty (blacked pixels on the bottom image), thereby mitigating the poor segmentation and the dangerous situation more generally.Uncertainty is also expressed over unfamiliar greenery that the model struggles to consistently segment as either vegetation or terrain (classes defined in Cityscapes).arXiv:2402.17653v1[cs.CV] 27 Feb 2024 are seen, illumination varies, etc.This is exacerbated by the significant cost of labelling images for semantic segmentation, as it is intractably difficult to anticipate and represent the full breadth of possible situations in the sample distribution of the labelled training dataset.Therefore, it is crucial that robots are able to verify whether they are in their ODD and can operate safely or whether the domain -and thus the data distributionhas deleteriously changed.Standards for autonomous vehicles [5], [6] cite this as critical for safe deployment.
This work therefore answers the question: given a labelled training dataset in one domain (a.k.a. the source domain), how can the segmentation error rate be mitigated on a shifted unlabelled domain (a.k.a. the target domain)?In answering the question, a model is presented that can learn to perform highquality uncertainty estimation from an uncurated unlabelled dataset of the target domain, without the prohibitive cost of labelling.An example of the system working is shown in Fig. 1.
Here, the vast majority of the image pixels are segmented correctly but the unknown class "horse" is segmented poorly.This is dangerous: e.g. if identified as some other static class, the robot may drive forward, or if identified as some other dynamic class, downstream prediction or tracking systems may be affected.Both situations mean the robot would act unsafely around a wild animal.Our model however accompanies this prediction with high uncertainty.Critically, this is learned from unlabelled examples in this domain.This is accomplished by training a segmentation network using a semi-supervised task, where -in lieu of labelssegmentation consistency in the target domain is selectively maximised across data augmentation.The intuition is that the performance on the semi-supervised task can be considered a proxy for segmentation accuracy, which is then used to train the network to express uncertainty on regions of images with poor performance.The proposed network expresses uncertainty in feature space with a single forward pass, thus satisfying run-time requirements for a robotics deployment.
The contributions of this work are as follows: • It proposes a training method that leverages an unlabelled dataset to learn pixel-wise uncertainty estimation alongside segmentation.• It evaluates the robustness of the proposed method against several uncertainty estimation and Out-of-Distribution (OoD) detection techniques.• It presents a new semantic segmentation benchmark based on images belonging to the Sense-Assess-eXplain (SAX) project [4].This work proposes over 700 pixel-wise labels for a manually curated set of images spanning three domains, that are used for testing both semantic segmentation and uncertainty estimation. 1 In addition to the labelled test data, we also propose a set of metrics to evaluate the quality of a model's uncertainty estimates.

II. PRELIMINARIES ON UNCERTAINTY ESTIMATION
A given model's error on test data is often described as originating from two different sources: epistemic or aleatoric uncertainty.The distinction is that error due to epistemic uncertainty is reducible, meaning the modeller can reduce the model's test error e.g. by labelling more data or improving network architecture.In contrast, aleatoric uncertainty is irreducible as the uncertainty is inherent in the test data.This means it is impossible to train a model that fully reduces the error on this test data, as aleatoric uncertainty is not under the control of the modeller.
Error due to distributional shift, i.e. distributional uncertainty, has two components: (1) uncertainty due to unknown classes, i.e. those not defined in the training data (2) uncertainty due to known classes with unfamiliar appearance.
Given the constraints of this work, it can be argued that ( 1) is aleatoric as no model parameterisation will fully mitigate the error in the target domain.This is because our model estimates the probability of a pixel belonging to a fixed number of defined classes, meaning that class assignment to a novel class is impossible.However, it can be estimated directly from data whether a pixel does not belong to a defined class.
Component (2) is caused by the intra-class visual dissimilarity between source and target images.This relates to epistemic uncertainty as it can be reduced by a more diverse labelled training dataset.It can however also be estimated directly from the input target image.
It is therefore true that the entirety of distributional uncertainty, i.e. components (1) and (2), can be estimated directly from the data.This motivates this work to draw upon methods for aleatoric uncertainty estimation, as discussed further in Sec.III-A and Sec.III-C.

A. Epistemic Uncertainty Estimation
Epistemic uncertainty estimation considers the weight posterior distribution p(w|D), with w and D defining the model weights and training data respectively.This distribution over possible model parameterizations given the training dataset is then related to the distribution over possible segmentations for an image.However, this Bayesian analysis is intractable to perform exactly, and so approximations are made, such as Monte Carlo Dropout (MCD) [7].Alternatively, p(w|D) can be defined using an ensemble of models [8], each trained independently but with the same labelled dataset.
For training and inference, these methods estimate uncertainty by perturbing the model weights to induce a distribution of segmentations for a given image.For this reason, they are computationally expensive at inference time, as they require multiple forward passes of a network to produce an uncertainty estimate.When considering the deployment of a segmentation network, both latency and memory usage are critical criteria.
This issue is mitigated in [9] by distilling a MCD model into a deterministic network that can estimate uncertainty in a single pass.During the training of our proposed model, a segmentation distribution is instead obtained by applying perturbations to the input data while keeping the model weights constant; at inference time, our model produces segmentation uncertainties in a single pass.

B. Deterministic Uncertainty Methods
Noting the computational requirements of many epistemic uncertainty estimation techniques, Deterministic Uncertainty Methods (DUMs) design networks to estimate uncertainty in a single pass using spectral-norm layers [10], which constrain the network's Lipschitz constant, ensuring that semantic differences in the input produce proportionally-scaled differences in feature space.Uncertainty can then be estimated as distance in feature space, i.e. the semantic dissimilarity, between a given input and the labelled training data.DUMs differ in how they measure uncertainty in feature space, either with Gaussian Processes [11], a post-hoc Gaussian Mixture Model [12], or radial basis function (RBF) kernels [13].
DUMs, like this work, turn uncertainty estimation into a representation learning problem, rather than an study of the model weights.Both methods measure uncertainty in feature space, but DUMs regularise the feature space with layer normalisation, whereas this work leverages unlabelled data.

C. Aleatoric Uncertainty Estimation
As aleatoric uncertainty is inherent to the data (rather than the model), aleatoric techniques are designed to estimate uncertainty purely as a function of the input data.This is typically achieved by supervised training.For the training images, the variability in network output with respect to the ground-truth is approximated by a distribution, [14], [15].
For a regression task in [14], with network estimate f (x (i) ) and ground-truth y (i) , the following loss for a pixel i is used to distribute the output as Gaussian: Intuitively, this loss function gives the network two paths to minimise the loss.Its estimate can either be closer to the ground-truth or it can mitigate the large squared error ∥y (i) − f (x (i) )∥ 2 by expressing a large variance σ(x (i) ) 2 .In [14], this style of objective is given the name learned-loss attenuation.
Furthermore, [16] approximates every layer output as a distribution, using assumed density filtering.[17] learns features for geometric matching in a self-supervised manner, while expressing uncertainty over poor geometric matches.This work and [17], unlike many techniques, do not require labels to calculate the learned loss-attenuation objective.This work differs from [17], as it is designed to extract semantics rather than just geometry, meaning labels are required to define the semantic classes.This work also expresses uncertainty as a feature-space distance rather than as a network output.

D. Out-of-Distribution Detection
OoD detection attempts to identify instances that appear distributionally distinct from the labelled training data.The difference with uncertainty estimation is that the focus is on the data, rather than on mitigating model error.One set of techniques train a network in a supervised manner on source data and then calculates an OoD score from the network's learned representation.The simplest method calculates the maximum softmax-score [18].[19] also adds adversarial perturbations to the input.In [20], the Mahalanobis distance between the input and training data in a series of feature spaces is combined with logistic regression.In [21] a score which is a function of both the features and the logits is used.
Other methods introduce proxy-training tasks to generate an OoD score.In [22] a classifier learns to classify the transformation applied to the input image, using the maxsoftmax score as an OoD score.In [23] the task of orientation prediction is utilized, resulting in improved OoD performance.
Our work also introduces a task to learn an OoD score, but uses the task to learn a separable representation of indistribution and OoD pixels, so the task does not have to be run at inference time.Additionally, the aforementioned works train only on in-distribution data, instead of improving robustness via leveraging OoD data.Finally, the proposed model performs pixel-wise, rather than image-wise, OoD detection.
Both [24], [25] use a curated OoD dataset to explicitly separate in-distribution and OoD data.[26], [27] generate OoD datasets containing near-distribution images, i.e. those at the edge of the training distribution, with a Generative Adversarial Network (GAN) and data augmentation respectively.Neardistribution images are used to achieve more robust OoD detection than clearly OoD images, as separating in-distribution and near-distribution images is much more challenging.
Although unlabelled, a purely OoD dataset can require costly curation.They are also less informative for pixelwise OoD detection, where in testing, OoD instances are contained in in-distribution scenes (and vice versa).In contrast, our work's target-domain dataset does not require curation and contains near-distribution and in-distribution and OoD instances within the same image.For these reasons, this data represents the opportunity to learn a more robustly separable representation, at the cost of being more challenging to use.

E. Semi-Supervised Learning
Semi-supervised methods extend supervised methods by including a loss on unlabelled data, and are evaluated on their ability to reduce the test error rate.In contrast, our approach seeks to detect test errors.These approaches operate orthogonally, but both prevent unknowingly erroneous predictions.
Semi-supervised approaches often maximise the consistency of a model's representation across perturbations to the input, model, or both [28]- [30].Similar to this work, [31] uses data augmentation and a cross-entropy objective between the class assignment distributions.In contrast, we apply the objective selectively and per-pixel.To appropriately represent OoD data, we also introduce additional objectives and procedures in place of the regularisation objective seen in [31].
Prototypes: Related to [31], [32] maximises consistency using prototypes as part of its mechanism, also seen in fewshot learning literature [33], [34].A prototype is calculated for each class as the centroid of all source embeddings for that class.Prototypes thus compactly represent the high-level semantics for a given class by averaging over the intra-class factors of variation.Additionally, DUMs calculate class mean features (i.e.prototypes) and measure uncertainty as a distance between features and class centres, as mentioned in Sec.III-B.< γ, the feature is certain and assigned the class of its closest prototype (denoted by the coloured pixel overlaid on the right), and if not, the feature is assigned uncertain (denoted as the question mark in white pixel).In this way, a 'safe region of operation' is defined in feature space, where inside pixels are accurate and certain, and outside they are uncertain and inaccurate.
Unsupervised Domain Adaptation: A specific instance of semi-supervised learning is Unsupervised Domain Adaptation (UDA), which specifically targets the distributional shift between the labelled (a.k.a.source) and unlabelled (a.k.a.target) data.Still, UDA methods [35], [36] are designed purely to increase test accuracy by learning an improved representation of the target domain and not to detect the errors arising from the shift, as in our method.
IV. SYSTEM OVERVIEW Fig. 2 shows an overview of the proposed system.At inference, its goal is to segment each pixel of an image of size H × W , x ∈ R 3×W ×H into K known classes K = {k 1 , . . .k K } or flag them as uncertain.This is done by producing both a categorical distribution p(y|x) = [p(y = k 1 |x), . . .p(y = k K |x)] ∈ R K×W ×H and an uncertainty mask M γ ∈ B W ×H , which assigns to each pixel 1 for certain or 0 for uncertain through a threshold γ in feature space.

A. Segmentation Using Prototypes
An encoder, E : R 3×H×W → R F ×h×w , and projection network g ρ : R F ×h×w → R F ×h×w calculate embeddings z ∈ R F ×h×w , ∥z∥ 2 = 1 of an image x, where F is the feature length and h and w are the downsampled spatial dimensions.
Let X S ∈ R N ×3×W ×H be a batch of N source images and Y S ∈ R N ×h×w×K their corresponding one-hot labels downsampled to the size of the embeddings Z S ∈ R N ×h×w×F .Prototypes p S ∈ R F ×K are then calculated for each of the K classes from the embeddings Z S : Here, Z ⊤ S Y S ∈ R F ×w×h×N × R N ×h×w×K = R F ×K , giving us a normalised, aggregate feature per class.A segmentation p(y|x) is obtained by projecting each pixel embedding of x onto the prototypes.Specifically, the classification scores, s(i) ∈ R 1×K , for the i-th pixel embedding z (i) ∈ R F ×1 are: Here, For this segmentation method, the scores s represent the cosine similarity between a feature and the prototype for each class.These scores over the downsampled spatial resolution s ∈ R h×w×K are then bilinearly upsampled to s ∈ R H×W ×K .The categorical distribution over the K classes, p(y|x where σ τ (.) is the softmax function with temperature τ (in this work, τ = 0.07).

B. Uncertainty Estimation using γ
Uncertainty is expressed in this work as the probability that a pixel belongs to no known class, i.e. p(y / ∈ K|x).Taking inspiration from [37], [38], we append a parameter γ to the classification scores as the (K + 1)-th score: Where ⊕ denotes vector concatenation.The largest score, max(s (i) ), is the cosine similarity between an embedded pixel and its closest prototype.This is a measure of the model's confidence, upon which γ operates as a threshold2 .Certainty mask, M γ , is given by: where 1[•] is the indicator function, and so certainty is given by M = 0 if γ is the highest score.This way, γ defines a region around each prototype, outside of which the feature is considered uncertain.

V. TRAINING OBJECTIVES
Ultimately, we aim to train a model that can express uncertainty with uncertainty map M γ via the threshold γ.Yet, in the absence of labels, we draw on the hypothesis that segmentation consistency over image augmentation approximates groundtruth accuracy.Consider two corresponding pixels from two augmentations: if both are assigned the same class, this is likely to be accurate; otherwise, it is not.Embedding consistency, whether feature distance [39], [40] or implicit class assignment [32], is a good proxy for classification accuracy, shown by the success of linear probe experiments in training classifiers from models trained with an embedding consistency objective.Instead, we focus on learning a representation for uncertainty estimation, and use pixel-wise alignment, rather than image-wise supervision.
Seen in Fig. 3, consistency between two augmented targetdomain images x′ T and xT , is represented by consistency map M c ∈ B H×W , where each pixel is 1 if consistent or 0 otherwise.In this work, consistency maps are related to uncertainty maps in the following two-step training process: 1) γ is solved for such that the mean certainty of segmentations according to M γ is equal to the mean consistency according to M c .γ therefore separates pixels into certain and uncertain broadly according to consistency.2) The model parameters are then updated by maximising the consistency of pixels deemed certain by M γ , as a proxy for maximising the accuracy of the certain pixels.This both improves the estimate of M c and separates features of pixels assigned certain and uncertain.
This method therefore establishes a positive feedback loop, where M c and M γ continually improve each other's estimates of which pixels are segmented accurately.The following sections describe how, with care, the training dynamics can be conditioned such that this leads to simultaneous high-quality uncertainty estimation and segmentation.

A. Semi-Supervised Task
Augmentations transform one image into two with distinct appearances but contain the same underlying semantics.Firstly, a target image x T is randomly cropped with transform T G 1 .This global crop is transformed by T L 1 and T L 2 , which are sampled such that one is always a local crop and resize, and the other an identity transform.Finally, x′ T and xT are obtained by applying colour-space transforms C 1 and C 2 .At the end, x′ T and xT are images of the same spatial dimensions, but of different appearance and where one is an upsampled crop of a region within the other.
Both of these images are then segmented by functions f and g.Pixel-wise segmentations s ′ T and s T are obtained by applying the opposite local cropping transform to the one applied to the input image, as follows: s ′ T and s T are therefore pixel-wise aligned as they are both segmentations of the region of the smaller local crop.f(•) and g(•) represent the top and bottom branches in Fig. 3 and are distinct functions that both return a segmentation for a given image.Insights for why f ̸ = g are given in Sec.V-E.
Transformations are applied to x S in the following order to obtain xS : T G 2 , T L 3 , C 3 , where T G 2 is a global crop, T L 3 is the identity transform or a local crop and resize, and C 3 is a colour-space transform.

B. Calculating γ: Making inconsistent pixels uncertain
Let us consider batches of target domain images, X′ T , XT ∈ R N ×3×H×W , and segmenting them to obtain is given by: γ is then calculated such that the p(certain) (i.e. the proportion of pixels that are certain according to M γ ) is equal to p(consistent) (i.e. the proportion of pixels consistent according to M c ), as detailed in Algorithm 1.Here, MaxS T contains the largest similarity with prototypes for each pixel.
(1 − p c ) is the proportion of inconsistent pixels in the batch.Line 9 then chooses γ so that certain pixels have the same proportion as consistent.

C. Learning E: Making certain pixels consistent
Similar to learned loss attenuation, the ultimate objective is to train a model that produces either high-quality segmentations or expresses high uncertainty.To this end, a consistency objective, L c , maximises the quality of the segmentation, but only for pixels that are deemed certain by M γ .Segmentation quality is represented by the consistency, which in this case is calculated as the cross-entropy between the pixel-wise categorical distributions across views, as follows: where H is the cross-entropy function.As shown in Fig. 3, only the encoder E is updated by this loss function.L c causes an entropy decrease for p(y|x) of certain pixels, but not for uncertain pixels.Given that p(y|x T ) is produced via prototype segmentation (see Sec. IV-A), this relates to an increase in the separation between the features z T of certain and uncertain pixels.This is because certain features, i.e. those with a cosine distance less than γ to a prototype (see Fig. 2), have been pulled closer to their closest prototype, and thus further from the uncertain features.

D. Additional Objective Functions
L c does not maximise consistency for uncertain pixels.Despite this, when only L c is used to update the model, all features in the target domain tend to collapse onto a subset of the source prototypes, thereby achieving near-perfect consistency irrespective of the input image.
This negatively affects uncertainty estimation, as the calculated γ cannot effectively separate certain and uncertain features, and so M γ is a poor uncertainty estimate.Additionally, few of the prototypes onto which the features collapse correspond to the correct ground-truth class, so the near-perfect consistency does not correspond to near-perfect accuracy.This results both in an inaccurate model and a M c that approximates accuracy very poorly.
The proposed solution to this problematic training dynamic is to softly constrain the model to distribute each batch of target features uniformly on the unit hypersphere, as presented in [41].Uniformity prevents feature collapse by constraining the proportion of features near the prototypes, thus making the model more selective about which features are certain or uncertain.L u is calculated in the same form as in [41]: where t = 2 and ZT ∈ R N ×F ×hu×wu is a batch of target features which has been downsampled by average pooling such that h u , w u = h/4, w/4.Average pooling reduces the number of pairwise distances calculated, reducing memory usage.Simultaneously, another loss term, L p [42], maximises the distance between the source prototypes, i.e. spreads them on the unit sphere, preventing L u from concentrating the prototypes to maximise the distance between certain and uncertain features.For K class prototypes p S ∈ R F ×K : This minimises the similarity between nearest prototypes, as p ⊤ S p S ∈ R K×K contains cosine similarities between prototypes, and subtracting 2I excludes self-similarity terms.
While L u and L p both maximise the distance between features, L u applies this locally using an RBF kernel, whereas L p maximises nearest-neighbour feature distance, and thus more strongly encourages uniformity.
Finally, a supervised loss is calculated using the source labels to maintain a good representation of the source domain.This objective, L s , is used to update the encoder and segmentation head and is calculated as the cross-entropy: where y S is the ground-truth one-hot label for pixel i.

E. Asymmetric Branches
As seen in Fig. 3, the branches segmenting x′ T and xT are not identical.The top branch, f : R 3×H×W → R K×H×W , segments x′ T through E and a segmentation head rather than prototype segmentation; therefore Branch asymmetry prevents training from collapsing.L c can be trivially minimised by assigning large regions to the same incorrect class across views, similar to what is described in Sec.V-D.Similar architectures are found in self-supervised learning methods such as [40], [43], also prevent similar failure modes - [40] using exponential moving averages of model weights, and [43] using additional layers.
As a result, the proposed asymmetry is not susceptible to collapse.Suppose that the encoder E collapses, for L c to be minimised by s ′ T = s T , the following has to be true: f ψ (•) = g π • g ρ (•).This is not observed, which we attribute to the following factors: (1) g π • g ρ and f ψ are architecturally different (2) neither the segmentation head f ψ , nor the prototypes via g π can contribute to a degenerate solution, as neither are updated by L c (3) g ρ is updated with L u in addition to L c .
A secondary benefit of using the segmentation head f ψ is that it naturally produces a low entropy p(y|x ′ T ).This contributes to decreasing the entropy of p(y|x T ), thus further separating certain and uncertain pixels in the manner described in Sec.V-C.

VI. EXPERIMENTAL SETUP
This work introduces a novel test benchmark, building on top of the Sense-Assess-eXplain (SAX) project [4].The benchmark is composed of pixel-wise labels that annotate a set of manually curated images from three domains of the SAX project.Alongside the test labels, this benchmark also proposes test metrics to evaluate quality of uncertainty estimation, presented in Sec.VII.

A. Data
This work uses three different types of data: (1) labelled training images (2) unlabelled training images (3) labelled test images.The primary experiments in this work use Cityscapes [44] as (1), and a SAX domain provides (2) and (3).As such, unless otherwise stated, the labelled dataset used is Cityscapes.However, in order to investigate the generality of the method, Berkeley DeepDrive 10k (BDD) [45] is also considered as a source domain, and both BDD and KITTI [46] are used as target domains.
The SAX dataset comprises data from three domains defined by their location of collection: London, the New Forest (a rural region in southern England), and the Scottish Highlands, ordered by descending similarity with Cityscapes.Examples from each dataset can be seen in Fig. 4. By testing across all three domains, the effect of the magnitude of the distributional shift on uncertainty estimation can be evaluated.
Each domain contains instances that are in-distribution (e.g.cars, road, signs that look very similar to those found in Cityscapes) and OoD (e.g.classes not defined in Cityscapes such as horses, Scottish lochs, gravel roads).Classes undefined in source domain are treated such that any assignment to these classes is treated as inaccurate.Importantly, each domain also contains many instances on the edge of the labelled distribution, i.e. near-distribution.Each of these instance types are combined within images, causing a significant challenge to uncertainty estimation and OoD detection models.
For the KITTI labelled test dataset, the 201 labelled training images were used (as only these have downloadable labels), while the unlabelled training dataset come from the published raw data.BDD publishes both a labelled training dataset, test dataset and 100,000 driving images without semantic annotation.This allows us to use BDD as both source and target domain separately.For both KITTI and BDD, care was taken to prevent any overlap between labelled testing images and the unlabelled training images.

B. Network Architecture & Training
For every experiment, the segmentation network used has a DeepLabV3+ architecture [47] with a ResNet18 backbone [48].More specifically, E is represented by both the ResNet18 and the ASPP module, and so the features ẑ are taken from the penultimate layer of DeepLabV3+, with just the segmentation head f ψ to follow. 3For prototype segmentation, features ẑ are then passed through a projection network g ρ , which is a two-hidden-layer perceptron -similar to [39], but applied to each pixel embedding independently.The feature dimensions are given by h = H /4, w = W /4, F = 256.
Before being updated by L c , the networks E and g ρ are pretrained using only L s and L u .Firstly, this means that before semi-supervised training begins, the segmentation head g ρ has already broadly learned the spatial distribution of classes.Secondly, after a small number training iterations with L c and L p , the prototypes faithfully represent the semantic classes due to the pre-training of E. In combination, these factors mean that M c starts as a better estimate of ground-truth segmentation accuracy, and thus the system is well-initialised for the positive feedback loop described in Sec.V.

C. Use of a Domain-based Curriculum
Models trained with the method presented in Sec.IV are given the name γ-SSL.For each domain, a separate γ-SSL model is trained, such that testing occurs in the domain of the unlabelled training data.Models named γ-SSL iL are however also initialised on weights trained with unlabelled SAX London images.
γ-SSL iL models are trained under the following hypothesis: splitting training into two chunks (source → intermediate target, intermediate target → final target), reduces the distributional gap between source and target for each chunk of training, and this improves the quality of the learned representation of the final target domain.SAX London is this intermediate domain, as it is most similar Cityscapes, while also sharing platform configurations with other SAX domains.
The motivation for this comes from curriculum learning [49], where a model's performance is improved by using a training procedure in which the difficulty of training examples increases during training.In our case, the difficulty of the curriculum is controlled by one high-level characteristic of the domain, i.e. its geographic location, but other characteristics could be used, such as rainy/dry, day/night, sunny/overcast.
Given that robots are designed for a specific ODD, the source domain is precisely defined by its operating conditions.By considering what the ODD considers in-distribution, the curriculum can be designed to include progressively more OoD conditions.The added diversity in this curriculum naturally increases the data requirements for γ-SSL iL .However, in a robotics context, this work argues that these requirements are not difficult to satisfy.This is because collecting an uncurated unlabelled dataset for a specific set of operating conditions (e.g.those not contained in the ODD) merely requires access to a robot and for those conditions to exist in the real world.This argument also explains why the standard γ-SSL models are not significantly more difficult to train than the benchmarks.

D. Benchmarks
This work evaluates a range of techniques, each producing a likelihood per-pixel of the model making an error.As the test 3 see https://github.com/qubvel/segmentationmodels.pytorchfor implementation details data is distributionally shifted from the labelled training data, the proposed method is benchmarked against several epistemic uncertainty estimation and OoD detection techniques.
Methods are split into epistemic and representation based methods.The distinction is that the epistemic methods consider a distribution over the model parameters, sample from that distribution and then calculate the inconsistency in segmentation; this process requires multiple forward passes of the network.Instead, the representation methods solely leverage a learned representation and compute a metric to determine the uncertainty in a single forward pass, greatly reducing the computational requirements for deployment.
The epistemic methods comprise Monte-Carlo Dropout [7] (MCD) and Deep Ensemble [8] (Ens).For both, predictive entropy (PE) and mutual information (MI) are used as uncertainty measures, where MI is more often used to estimate epistemic uncertainty, as evidenced by its use in active learning [50].The network for MCD builds on Bayesian DeepLab [51], but is adapted for a ResNet18, and is tested over a range of dropout probabilities and numbers of samples, with the best presented (0.2 and 8 respectively).Different ensemble sizes are also evaluated.The MCD model is also distilled into a deterministic network, named MCD-DSL, as per [9].
As for the representation methods, the techniques investigated include several OoD detection methods [18]- [21].[18] (Softmax) and [19] (Softmax A ) propose a pretrained segmentation network with tuned softmax temperature parameters, where the latter also leverages adversarially perturbed images.[20] leverages a pretrained network where OoD score is the Mahalanobis distance in feature space (FeatDist), which can also leverage adversarial inputs (FeatDist A ). Finally, [21] (ViM) defines the OoD score as a function of both the features and the logits.The feature space chosen for [20], [21] is the same as used for the proposed method.When using adversarially perturbed inputs, evaluation takes place over a range of step-sizes, ϵ.The final representation method is a Deterministic Uncertainty method (DUM), presented in [12], and uses the official implementation of [52].

VII. EVALUATION METRICS
This work evaluates methods on their ability to perform misclassification detection.This is a binary classification problem, whereby pixels segmented accurately with respect to labels should be classified as certain, and inaccurate pixels classified as uncertain.These states are defined in Tab.I.
Given a set of imperfect models, the best model for a binary classification problem can be selected based on a number of different metrics.For uncertainty estimation, the most appropriate metric is dependent on the context in which the uncertainty estimates and segmentation predictions are used.For this reason, this work considers a range of possible definitions of metrics and justifies each in a robotics context.

A. Metrics: Definitions
Firstly, this work considers receiver operating characteristic (ROC) curves and precision-recall (PR) curves for the evaluation of misclassification detection, based on the prior use of these metrics in [18].ROC curves plot the true positive rate (TPR) versus the false positive rate (FPR): , FPR = FP FP + TN Here, TPR is the proportion of accurate pixels detected as certain, whereas FPR is the proportion of inaccurate pixels incorrectly assigned to certain.The ROC curve treats the positive and negative classes separately and is thus independent of the class distribution, i.e. the underlying proportion of pixels segmented as the correct semantic class.The ROC curve is summarised by the area under it, the AUROC.
Precision and recall are defined respectively as: , Recall = TP TP + FN Interpreting misclassification detection as an information retrieval task, PR curves evaluate the ability to use uncertainty estimates to retrieve only accurate pixels (as the positive class is defined as accurate in Tab.I).As for ROC curves, PR curves can be summarised by the area under them, the AUPR.
Additionally, as an alternative to AUPR, the F β score is also considered, defined as: This factors precision and recall into a single metric, weighting their contribution through the scalar β.For β < 1 a stronger focus is given to Precision, while for β > 1 to Recall.
Finally, as in [51], the misclassification detection accuracy, A MD , is also used to evaluate uncertainty estimation.This is defined for a given uncertainty threshold as: TP + TN TP + TN + FP + FN According to this metric, the best model is the one which segments the highest proportion of all pixels in one of two states: (accurate, certain) or (inaccurate, uncertain).
A MD and F 0.5 are plotted against the proportion of pixels that are in the state: (accurate, certain), named p(a, c), where: p(a, c) = TP TP + TN + FP + FN For these plots, the ideal model should maximise both the uncertainty metric (A MD , F 0.5 ) and p(a, c), i.e. better results are closer to the top-right of the plots.Therefore, the maximum value of A MD and F 0.5 -named MaxA MD and MaxF 0.5 respectively -and the value of p(a, c) at which they occur are also reported.

B. Discussion: AUROC, AUPR and F-scores
Given the context of this work, i.e. semantic segmentations for robotics applications, we must consider: 1) what the real-world costs are for misclassification in the cases of FP versus FN, and 2) whether the evaluation should be independent of the class distribution, p(accurate) The first is context-dependent to a large extent.In general, for safety-critical contexts such as robotics, where perception directly leads to decisions and actions in the real world, the importance of certain being accurate is higher than uncertain being inaccurate.To simplify, accidents arise when autonomous systems make confident but incorrect predictions about their surroundings.In contrast, when predictions are uncertain but accurate, the system is overly conservative and thus does not take action as it considers some safe actions unsafe; this is inefficient but less hazardous.It can therefore be argued that precision is more important than recall -we want FP = 0, i.e. no pixels inaccurate and certain even if some pixels are accurate but uncertain (FN > 0).
PR and ROC curves present the performance of a model over the full range of relative misclassification costs.AUROC and AUPR assess how models perform in aggregate over this range of relative misclassification costs, and thus over a range of robotics contexts.Consequently, however, these metrics do not fully represent whether a model is appropriate in a specific context.To represent a specific context of interest, F β scores can aggregate PR curves with a preference towards precision or recall.In the effort to prioritise precision, as argued above, F 0.5 scores are presented in this work.

C. Discussion: Misclassification Detection Accuracy
Misclassification detection accuracy (named A MD to avoid confusion with segmentation accuracy) provides an intuitive understanding of misclassification detection performance by reporting the proportion of all pixels (the denominator considering the entire image grid) in either of the following 'safe' states: (accurate, certain) or (inaccurate, uncertain); no preference between FP and FN is expressed.
Segmentation networks are typically one component of a larger system, and there are some robotics contexts where FP and FN are not drastically different in effect.For example, semantic localisation can reject a FP using additional processing steps in the localisation pipeline, i.e.RANSAC in geometric refinement.Also, in semantic mapping, multiple views of a location are available, which allow for additional processing, e.g.majority voting, to reduce the effect of errors.In both cases, where subsequent steps provide additional filtering, where a FP is less detrimental to the broader system.

D. Discussion: Class Distribution
A characteristic of ROC and PR curves is their insensitivity to the class distribution, defined as the proportion of pixels that are accurate versus inaccurate.Each model will have a different semantic segmentation accuracy, and so the class distributions will vary.This means that ROC and PR curves provide helpful analysis on the ability to detect misclassification, independent of segmentation accuracy.
In the context of this work, however, if two models have the same misclassification detection performance, the model with the higher p(accurate) should be chosen.More specifically, the proportion of pixels assigned, with certainty, to the known semantic classes should be considered, i.e. p(a, c).In the interest of jointly considering misclassification detection and semantic segmentation performance, this work presents a procedure to describe both objectives intuitively.
Let's consider F 0.5 and A MD versus p(a, c).This allows us to determine the model and uncertainty threshold at which misclassification detection is performed best and the proportion of confidently segmented pixels returned at that threshold.These plots intuitively describe both the 'introspectiveness' and the 'usefulness' of the model.Note that the point on the curve at maximum p(a, c) corresponds to the segmentation accuracy, as all pixels are treated as certain at this threshold, and so max[p(a, c)] = p(accurate).

VIII. EXPERIMENTAL RESULTS
This section presents test metrics and qualitative examples discussing the benefits of our approach in Sec.VIII-K.

A. Source: Cityscapes, Target: SAX London
For each plot in the first row of Fig. 5, γ-SSL performs best.Its precision is higher for nearly all values of recall, with a corresponding increase of 19 % and 8 % from the best benchmark in AUROC and AUPR respectively (see Tab. II and Tab.III).It has the highest values of MaxA MD and MaxF 0.5 , returning them at p(a, c) exceeded only by MCD 0.2 (however at a lower value for MaxA MD and MaxF 0.5 ).
The best-performing benchmarks are the MCD 0.2 models, with MI outperforming PE.This is because they are both more introspective as judged by AUROC and AUPR, and also have a high segmentation accuracy.The latter suggests the dropout layers allowed for greater generalisation to the SAX domains, compared with a standard segmentation network.
On average, the performance for this domain was relatively similar for the epistemic and representation-based methods in terms of AUROC and AUPR.However, while the representation-based methods exhibited similar MaxA MD and MaxF 0.5 values to epistemic methods, they do so at lower p(a, c) on average; thus, overall, they perform less well.
Note that γ-SSL iL for the SAX London, KITTI and BDD domains does not exist in the results.For SAX London γ-SSL and γ-SSL iL are the same model, and for the latter two, it is not clear that SAX London is an intermediate domain for these target domains, see Sec.VI-C for more details.

B. Source: Cityscapes, Target: SAX New Forest
In SAX New Forest, γ-SSL iL and γ-SSL perform similarly for AUROC and AUPR with γ-SSL iL having slightly higher
AUPR (0.942 vs. 0.921).The increases from γ-SSL to γ-SSL iL for MaxA MD and MaxF 0.5 are modest at 2.4 % and 3.5 % respectively, however the increases in p(a, c) at which they occur are far larger at 25.9 % and 30.7 %.This coincides with a large increase in segmentation accuracy from γ-SSL to γ-SSL iL , as seen from the maximum p(a, c) in Fig. 5.For this domain, this confirms the hypothesis that presenting unlabelled target images in a curriculum improves semi-supervised learning both in segmentation quality and uncertainty estimation.
As in SAX London, the MCD 0.2 models were the bestperforming benchmarks on each metric.In general, the epistemic methods outperformed the representation-based methods, with higher mean AUROC and AUPR.While the mean values of MaxA MD and MaxF 0.5 were not significantly different, the values of p(a, c) at which they occur were higher for epistemic methods than for representation methods.In this domain, the increase in performance from γ-SSL to γ-SSL iL is at its greatest, and the γ-SSL iL model far exceeds the performance of the other models, as in Fig. 5.The increase over the next best model for metrics AUROC, AUPR, MaxA MD @p(a, c), MaxF 0.5 @p(a, c) are as follows: 10.7 %, 19.2 %, 9.1 % @ 65.8 %, 20.2 % @ 72.1 %.Once again, the performance increase in segmentation quality and uncertainty estimation can be attributed to the curriculum training procedure described in Sec.VI-C.
γ-SSL performs comparably to MCD 0.2 in this domain, which as the distributional shift between SAX Scotland and Cityscapes so significant that it is challenging to use the semisupervised task to improve the model's representation.
On average, epistemic methods outperform representationbased ones to a larger extent in this domain, characterised by a mean increase in AUPR of 34.8 % from the latter to the former.This is because representation methods rely on a representation learned from Cityscapes, which is significantly distributionally shifted from SAX Scotland.

D. Effect of increasing distributional shift
As discussed in Sec.VI-A, the following domains are in order of ascending distributional shift: London, New Forest, Scotland, as evidenced by the corresponding reduction in segmentation accuracy -0.571, 0.538, 0.394 -for a network trained solely on Cityscapes.Tab.X shows that the method type has an effect on the extent to which misclassification detection performance degrades as distributional shift increases.
Epistemic Methods: For all epistemic methods, AUROC, AUPR increases from London to New Forest -6.0 % and 5.0 % on average, respectively.This suggests that these techniques perform better uncertainty estimation as the proportion of errors related to distributional shift increases, which aligns with the stated motivation of the methods.However, as distributional uncertainty significantly increases, not every method improves further.In fact, both AUROC and AUPR significantly decrease from New Forest to Scotland, −7.4 %, −18.3 % on average.This suggests that there is a limit beyond which these methods start to degrade significantly, as was also reported for a range of epistemic uncertainty estimation methods in [53].
Representation methods: In general, these methods are less robust to distributional shift than the epistemic ones, with a difference in AUPR of −5.4 % and −27.3 % for London to New Forest, and New Forest to Scotland respectively.The changes were less uniform for AUROC; however, the majority of models tested decreased in performance for both shifts.Additionally, these methods have a significant reduction in the p(a, c) at which MaxA MD and MaxF 0.5 occur compared with epistemic methods.
γ-SSL methods: Much like many of the representation methods, the performance of γ-SSL decreases from London to New Forest, and again to Scotland.γ-SSL decreases less on average in terms of AUPR, but more in terms of AUROC.Independent of the increase in segmentation accuracy, the use of a curriculum for γ-SSL iL reduces the effects of large distributional shifts on uncertainty estimation, as evidenced by the γ-SSL iL having by far the smallest reduction in AUPR from New Forest to Scotland.Factoring in the increased segmentation accuracy, γ-SSL iL also exhibits a much smaller reduction in the value of p(a, c) at which MaxA MD and MaxF 0.5 occur when compared to γ-SSL.

E. Source: Cityscapes, Target: KITTI & BDD
The results for KITTI and BDD can be found in Tab.II, Tab.III, Tab.IV, Tab.V.The accuracy of γ-SSL for KITTI, SAX London, BDD and SAX New Forest are as follows: 0.817, 0.703, 0.684, 0.595, therefore KITTI is the least distributionally shifted domain w.r.t.Cityscapes, and BDD is more distributionally shifted than SAX London.KITTI: The AUROC, MaxA MD and MaxF 0.5 metrics for our γ-SSL method exceeded that of all of the benchmarks, with only AUPR exceeded by the MCD-DSL method.Epistemic methods significantly outperformed representation-based methods, as epistemic method outperformed the latter on almost all of the metrics.This experiment therefore provides extra evidence that our method performs well in target domains close to the source domain.
BDD: γ-SSL is the best performing model for each metric with BDD as the target domain, with epistemic methods on average outperforming the representation-based methods.Despite being less distributionally shifted than SAX New Forest according to segmentation accuracy, each of the uncertainty estimation metrics are lower for BDD than SAX New Forest in a break from the trend.The hypothesis for why this is that there is significantly more diversity in BDD compared with SAX New Forest, and learning a representation for uncertainty estimation is more difficult in a more diverse domain.

F. Source: BDD
In order to investigate the generality of this approach, this work also conducts secondary experiments with BDD as the source domain.The suite of results for this are found: Tab.VI, Tab.VII, Tab.VIII, Tab.IX.For both epistemic and representation based methods, using BDD as the source domain improves uncertainty estimation performance across all metrics.This benefit typically increases as the domain shift increases, e.g. for MaxF 0.5 the average percentage increases from Cityscapes to BDD for the benchmarks are: 9.1%, 14.6%, 26.2% for SAX London, New Forest and Scotland respectively.The BDD labelled dataset is larger and more diverse, therefore leading to a more general representation of the semantic classes.
SAX London: For this target domain, γ-SSL performs the best on each metric, with a 6.5 % increase in MaxF 0.5 over the best performing benchmark.Across the board, the results for γ-SSL with Cityscapes as the source domain are better than in this experiment.The higher uncertainty estimation metrics coincide with a higher accuracies using Cityscapes as the source domain than BDD (0.703 and 0.688 respectively), suggesting that Cityscapes could be more similar to this domain than BDD.
SAX New Forest: The accuracy of the γ-SSL models are 0.666 and 0.595 for BDD and Cityscapes respectively, suggesting a larger domain shift between Cityscapes and SAX New Forest, than for BDD and SAX New Forest.In all but AUPR, both the γ-SSL and γ-SSL iL models outperform all benchmarks.The results for using BDD as source are similar or slightly better than using Cityscapes.Better uncertainty estimation results for BDD would be predicted by the lesser domain shift, however this is not consistently shown.This is perhaps because BDD is more diverse, and a narrower definition of the source domain allows a greater separation of source and target, and simpler uncertainty estimation.
SAX Scotland: The results for γ-SSL are significantly better with source domain as BDD, than as Cityscapes, accompanied by similar but smaller improvements for γ-SSL iL .Segmentation accuracies for γ-SSL of 0.495 and 0.431 for BDD and Cityscapes respectively suggest that the magnitude of the domain shifts would explain this.It is however also true in this experiment that the epistemic methods are significantly better resulting in comparable performance between γ-SSL iL and these methods for these metrics, i.e. which method is better depends on the metric considered.
The results demonstrate that while a different source domain has an effect on the quality of uncertainty estimation, our method still exceeds or is competitive with the results of the best benchmarks considered, while still having the low latency required for robotic deployment.It is clear from Fig. 5 that there typically exists a threshold such that accurate and inaccurate pixels are optimally separated (according to A MD and F 0.5 ).Calculation of this threshold typically requires a set of validation images from the test domain, which reduces the number of images available for testing.Additionally, if this small set of validation images is not representative of the test dataset, then the calculated threshold will result in significantly worse misclassification detection performance.

G. Optimal Threshold Calculation Testing
The effect of the number of validation examples on misclassification detection performance for the γ-SSL models is investigated by calculating the metrics discussed in Sec.VII-A for a given withheld validation set on the remaining test images.The results for this are averaged over 100 trials, where for each the withheld validation set is selected randomly.Given that the method in this work calculates a threshold parameter during training, it is also possible for it to use this threshold during testing, such that zero validation examples are required.The mean and variability of these metrics is presented as a box plot in Fig. 6.
Firstly, these plots demonstrate that the smaller the number of chosen validation images, the less well they represent the test dataset; therefore, the more variable the test performance.More importantly, they show that there is a minimal decrease in performance between using 20 validation images and calculating the threshold with 0, as per Sec.V-B.This means the γ-SSL methods are successfully able to calculate an appropriate threshold for the misclassification detection task without using any validation examples.

H. Cross-Domain Threshold Testing
The aim of this work is to propose a model that can estimate its mistakes with a feature-space distance threshold as the data distribution changes.It is therefore important that this threshold calculated for one domain is effective for all domains, rather than needing a specifically optimal threshold for each domain.
In order to investigate this (see Tab. XI), we compare the F 0.5 results for the optimal threshold value for a test domain and the value calculated from the target domain the model is trained on, e.g. for γ-SSL iL -LDN (trained on unlabelled SAX London data) the threshold corresponding to the maximum F 0.5 score is used testing on the SAX New Forest and Scotland datasets and the corresponding F 0.5 scores are shown.
Tab. XI shows that a threshold optimal for one domain degrades performance only very slightly for another domain.

I. WildDash Results
So far in this work, the γ-SSL and γ-SSL iL models have been trained on the SAX domains with operating conditions different to that of Cityscapes, and tested on these same SAX domains.In this section, the models are also tested on the WildDash dataset [54], in order to investigate how the models generalise to a test dataset with different operating conditions to both the labelled and unlabelled training data.This dataset does not define a single domain but includes images from a diverse set of domains, including: different weather conditions, day/night, and geographic locations from across the world.Misclassification detection performance on this dataset is, therefore, a measure of how well these models can detect error due to OoD instances unlike anything seen in the labelled or unlabelled training datasets, or how specific they are to the domain of the unlabelled training data.
As shown in Fig. 7, γ-SSL iL -SCOT outperform all benchmarks, with an AUROC and AUPR of 0.852 and 0.896, compared with the best benchmark, Ens-PE 5 , returning 0.803 and 0.868.
Firstly, this demonstrates that although the γ-SSL iL models have been trained to mitigate error in specific operating conditions based on geographic location, they can also effectively detect error due to never-before-seen conditions.Secondly, given that these values for γ-SSL iL -SCOT are lower than the values for the SAX test domains, it also demonstrates that the best performance is reached when an unlabelled training dataset is collected from the same domain as the test data.

J. Timing Results
In Tab.XII, the frequency at which each method can operate is presented on differing hardware, as a low latency is a key characteristic for robotic perception systems.The GPU tested upon is a NVIDIA V100 GPU, while the CPU is on Macbook Pro with M2 Pro CPU.The Vanilla method is a DeepLabV3+ segmentation network [47] detailed in Sec.VI-B, with the difference from Ours being that it uses a convolutional layer instead of prototype segmentation, and therefore also does not contain a projection network.The timing results for Vanilla therefore represent all of the representation-based methods described in in Sec.VI-D.The MCD and Ens methods are the same as those described in Sec.VI-D, (i.e. 5 and 10 member ensembles and 8 samples for the MCD method).The superscript LM and HM relate to different inference methods for the ensembles, with the former (Low Memory) only loading one network into GPU memory at a time, while the latter (High Memory) loads every member network at once, while still performing inference sequentially.
These results demonstrate that our proposed method operates at significantly lower latency than the Monte Carlo Dropout and Ensemble methods, and is no slower than the segmentation network from which our method is built.

K. Qualitative results
Presentation of qualitative results can be found in Fig. 8. Again, labels are only available in the source domain -Cityscapes -for which segmentations are very high-quality.The segmentation performance degrades with increasing domain shift as expected from London → New Forest → Scotland.Correspondingly, however, uncertainty mitigates these erroneous predictions.Consider several examples from these samples: 1) The lane-obstructed street sign in the first London example, not amongst the known signs in the Cityscapes dataset.
2) The telephone box in the second New Forest example (not in the Cityscapes domain or list of known classes), 3) The pile of timber in the first Scotland exampleclassified as vehicle, All of these are correctly assigned high uncertainty.

B. Is the target domain data required?
In this investigation, Cityscapes data is used for the semisupervised task instead of the SAX unlabelled data (hence is called NoSAX).The objective is to determine whether the proposed method is leveraging the unlabelled target domain data or our uncertainty estimation results are due only to the semi-supervised task and objectives.
Compared with γ-SSL and γ-SSL iL , the AUROC and AUPR results in Tab.XIIIa show a worse misclassification detection performance for NoSAX on the SAX test datasets.This confirms the utility of collecting large, diverse datasets containing near-distribution and OoD instances, as this ablation empirically shows that using such a dataset during training improves the detection of OoD instances during testing.

C. Is M γ required to learn uncertainty estimation?
This experiment, named M γ=−∞ , trains a model by maximising the consistency between all pixels.This, therefore, investigates the loss function in Eq. ( 8) and whether standard semi-supervised training is sufficient to learn a good representation for uncertainty estimation.
The results consistently show that the proposed γ-SSL performs better at misclassification detection on each metric.This suggests that attenuating the loss for uncertain pixels facilitates learning a representation suitable for uncertainty estimation.Additionally, the results in Fig. 9 show that the segmentation accuracy is lower for this ablation (remembering that max[p(a, c)] is the segmentation accuracy), showing it is beneficial to use M γ to filter out noise in the semi-supervised consistency task introduced by the challenging, uncurated nature of the unlabelled training images.

D. Does branch asymmetry prevent feature collapse?
Sym-Non-Param and Sym-Param produce segmentations using only non-parametric prototype segmentation and a parameterised segmentation head respectively, and are thus both symmetric, unlike our system (see Sec. V-E).
For both methods, the models nearly always suffered feature collapse, where each feature is embedded near a single class prototype (Road in this case).The exception is for Sym-Proto on the SAX New Forest dataset.When collapse occurs, the key observation is that segmentation accuracy greatly deteriorates, even if AUROC and AUPR look acceptable.Firstly, this ablation provides evidence that the asymmetric branches successfully prevent this type of failure.Secondly, it confirms that looking at AUROC and AUPR alone is clearly insufficient to fully evaluate the model, and that integrating segmentation quality into the metrics is useful.

E. Do the additional losses provide useful regularisation?
By removing both L u and L p for NoRegL, the effect of these additional losses can be investigated.The result is not a complete feature collapse but a deterioration of misclassification-detection and segmentation performance on every metric.This suggests that by spreading out features and prototypes on the unit-hypersphere, distance to prototypes is a better measure of uncertainty, and the classes in the target domain become more separable.
F. Is data augmentation the best way of obtaining a distribution over possible segmentations?Sec.II discusses how distributional uncertainty can be treated as inherent to the data rather than the model, thus motivating using data augmentation rather than model perturbation to obtain a distribution over possible segmentations of an image.This ablation, MCD-SSL, investigates training the model using dropout instead of data augmentation to calculate M c .The dropout probability used was 0.2, as this performed best for both misclassification-detection and segmentation performance for the MCD 0.2 benchmarks.
MCD-SSL does not achieve as good misclassificationdetection performance as γ-SSL, as demonstrated on every  metric in Tab.XIII.This provides evidence that data augmentation is a good method for obtaining segmentation distributions representing the likelihood of correct class assignments.

X. ADDITIONAL ABLATION STUDIES
In this section differing training procedures are experimented with in order to investigate their performance relative to our proposed method.The experiments involve using the Cityscapes dataset as the labelled dataset, and SAX London data as the unlabelled dataset, with test results reported on the SAX London Test dataset.
A. Does a "soft" M γ help training?
In this experiment, the certainty mask is no longer binary, but instead the confidence is expressed as the max softmax score, M (i) γ = norm(max[σ τ (s (i) )]) ∈ {0, 1}, where norm is a function that normalises a batch of certainty masks such that the lowest pixel confidence is 0, and the highest is 1, see Sec.IV-A for more details.
The uncertainty estimation results for this are MaxF 0.5 is 0.862 at a p(a, c) of 0.421, and with a segmentation accuracy of 0.576.The results for an equivalent model using a binary thresholded M γ are as follows: 0.893 @ 0.548 for MaxF 0.5 @ p(a, c) and segmentation accuracy of 0.70, thereby showing a significant drop in segmentation accuracy in the target domain, and also a drop in uncertainty estimation performance.
These results suggest that the soft M γ introduces noise into the consistency task on the unlabelled domain, and prevents the learning of a high-quality representation of this domain.

B. Should each class prototype have a different threshold?
This experiment evaluates whether learned uncertainty estimation can be improved by thresholding cosine distance between a class prototype and a feature with a different value for each class.
The rationale for this is that datasets often represent different classes with different levels of occurrence and diversity, therefore the statistics of the each class's representation may also vary, e.g. more diversely represented classes may have a greater variance.
In order to account for this in this experiment, the certainty mask M γ is calculated from per-class thresholds Γ = [γ 1 , γ 2 , ...γ K ] as follows: where s (i) ∈ R K are the segmentation scores for a pixel i.
The per-class thresholds Γ are calculated such that the perclass consistency [p c ] k is equal to the per-class certainty [p γ ] k , with both calculated as follows: The uncertainty estimation results achieved with a model trained in this way are significantly worse than our proposed method, with a MaxF 0.5 @ p(a, c) of 0.772 @ 0.432 (versus 0.893 @ 0.548 for our proposed model) and does this with a segmentation accuracy of 0.651 versus 0.703.This suggests that solving for a per-class threshold during training negatively affects both segmentation quality and uncertainty estimation, and it is therefore preferable to solve for a single threshold as per our method.
C. Is a large batch size required for calculating the prototypes during training?
For efficiency during training, the prototypes are calculated from features extracted from a batch of labelled images, whereas in testing prototypes can be calculated from the entire dataset.This therefore raises the question as to whether the training prototypes are too noisy if the batch size becomes too small, and whether a large amount of GPU memory is required for this method.Our proposed training procedure only uses a batch size of 12, and mitigates one aspect of this problem by using the most recent class prototype if a class is not present in the batch of labelled images (shown in Tab.XIV as Use history?).
In order to investigate whether small batch sizes are a problem, we train a model with a smaller number of images from which to compute prototypes, while keeping the batch size for the other aspects the same.During testing, prototypes are calculated from all available labelled images from the labelled domain.We report metrics for both the segmentation quality (segmentation accuracy, Seg.Acc.) and uncertainty estimation performance (MaxF 0.5 @ p(a, c)) in Tab.XIV.These results show that, while still using previous prototypes where needed, reducing the prototype batch size does not significantly affect segmentation or uncertainty estimation quality.However when not using the history of prototypes, uncertainty estimation quality (in the form of MaxF 0.5 @ p(a, c)) does degrade at lower batch sizes, thereby demonstrating the usefulness of this method.the deployment domain, the constraint of having labelled data from that domain is relaxed, and thus a small labelled dataset from a distinct domain can be used.Secondly, it presents a novel semantic segmentation test benchmark, comprising a set of 700 pixel-wise labels from three distinct domains and metrics to measure quality of uncertainty estimation.Upon evaluation on this challenging benchmark, the presented network outperforms epistemic uncertainty estimation and outof-distribution detection methods, and does so without increasing the computational footprint of a standard segmentation network.

Fig. 2 .
Fig. 2. Depiction of simultaneous segmentation and uncertainty estimation for the model presented in this work.Pixel-wise features are extracted from an image by encoder E. Distances d 1:3 are calculated between each feature and prototypical features from each class p 1:3 , known as prototypes.If one of [d 1 , d 2 , d 3 ]< γ, the feature is certain and assigned the class of its closest prototype (denoted by the coloured pixel overlaid on the right), and if not, the feature is assigned uncertain (denoted as the question mark in white pixel).In this way, a 'safe region of operation' is defined in feature space, where inside pixels are accurate and certain, and outside they are uncertain and inaccurate.

Fig. 3 .
Fig. 3.The training regime of the proposed approach.The model parameters are updated by four losses: (a) Lc (b) Lu (c) Lp (d) Ls.(a) For the pixels deemed certain by Mγ , Lc maximises the consistency -a proxy for accuracy -over the segmentations s ′ T , s T of augmented versions x′ T , xT of the original target domain image x T .(b) Lu softly constrains the features z T to be uniformly distributed on the unit-hypersphere.(c) Lp maximises the distance between source prototypes p S , i.e. spreads the mean embeddings of each class in the source domain dataset over the unit-sphere uniformly.(d) Ls maximises the accuracy for the segmentations of the source images x S with respect to ground-truth labels.For each diagram, the networks coloured in aquamarine are updated by the losses, while the cross-hatched networks are not.Note that for diagrammatic clarity, the colour transforms are depicted as following x′ T , xT , xS , whereas in reality and as described in Sec.V-A: x′T = C 1 • T L 1 • T G (x T ), xT = C 2 • T L 2 • T G 1 (x T ), xS = C 3 • T L 3 • T G 2 (x S ).Best viewed in color.
Fig. 4. Example images and relative segmentation masks from Cityscapes -the source domain -and the domains in the SAX Segmentation Test Dataset.

Fig. 5 .
Fig.5.For each SAX domain, a row of plots describes the misclassification detection performance of a series of benchmarks and the proposed methods, γ-SSL and γ-SSL iL .Misclassification detection accuracy, A MD , and F-score, F 0.5 , aggregate performance into a single metric, where a larger value of each represents a more 'introspective' model.They are plotted versus p(a, c), the proportion of pixels that are accurate and certain, as this represents the amount of accurate and useful semantic information the model can extract from images; also a metric maximised by the ideal model.Note that the maximum value of p(a, c) is equal to the segmentation accuracy, max[p(a, c)] = p(accurate).Best viewed in color.

Fig. 6 .
Fig. 6.Box plot representing the achieved A MD by calculating the uncertainty threshold with varying numbers of validation examples for a γ-SSL model.The dashed lines represent the values of A MD achieved when using the entire test dataset to calculate the optimal threshold, before then testing on it.

Fig. 7 .
Fig.7.Misclassification detection results on the WildDash Dataset[54].γ-SSL-LDN refers to a γ-SSL trained on the SAX London unlabelled dataset, whereas γ-SSL iL -NF, γ-SSL iL -SCOT refer to γ-SSL iL models that are trained on the SAX New Forest and SAX Scotland unlabelled datasets, while also using SAX London as part of a curriculum.Best viewed in color.

Fig. 8 .
Fig. 8. Qualitative results for Cityscapes and the SAX domains.As the SAX RGB images (left) become more dissimilar from Cityscapes (from top to bottom), the corresponding semantic segmentations (centre) decrease in quality.However, for these poorly segmented regions, high uncertainty is largely expressed over them, shown in black (right).

Fig. 9 .
Fig.9.Misclassification results for ablated γ-SSL models.Given that the ablations are performed on the γ-SSL models, the γ-SSL iL models are not plotted.See Sec.IX for details.Best viewed in color.

TABLE IV MAXIMUM
ACCURACY A MD AND p(a, c) WITH SOURCE: CITYSCAPES

TABLE V MAXIMUM
F 0.5 SCORE AND p(a, c) WITH SOURCE: CITYSCAPES

TABLE VIII MAXIMUM
ACCURACY A MD AND p(a, c) WITH SOURCE: BDD

TABLE IX MAXIMUM
F 0.5 SCORE AND p(a, c) WITH SOURCE: BDD

TABLE X AUROC
AND AUPR PERCENTAGE CHANGE AT INCREASING DISTRIBUTIONAL SHIFT, %∆ROC AND %∆PR RESPECTIVELY

TABLE XI CROSS
-DOMAIN THRESHOLD TESTING RESULTS (a) γ-SSL iL -LDN

TABLE XIV RESULTS
FOR VARYING TRAINING PROTOTYPE BATCH SIZE ON SAX LONDON work presents a segmentation network that mitigates misclassification on challenging distributionally shifted test data via uncertainty estimation.It achieves this by learning a feature representation, where pixel embeddings corresponding to accurate and inaccurate segmentations are separable by a single global threshold around prototypical class features.By leveraging a large quantity of uncurated unlabelled data from