One Versus All for Deep Neural Network for Uncertainty (OVNNI) Quantification

Deep neural networks (DNNs) are powerful learning models, yet their results are not always reliable. This drawback results from the fact that modern DNNs are usually overconfident, and consequently their epistemic uncertainty cannot be straightforwardly characterized. In this work, we propose a new technique to quantify easily the epistemic uncertainty of data. This method consists in mixing the predictions of an ensemble of DNNs trained to classify One class versus All the other classes (OVA) with predictions from a standard DNN trained to perform All versus All (AVA) classification. First of all, the adjustment provided by the AVA DNN to the score of the base classifiers allows for a more fine-grained inter-class separation. Moreover, the two types of classifiers enforce mutually their detection of out-of-distribution (OOD) samples, circumventing entirely the requirement of using such samples during training. The additional cost involved by the construction of the ensemble is offset by the ease of use of our proposed strategy and by its enhanced generalization potential, as it does not bind its performance in a given context to specific OOD datasets. The extensive experiments confirm the wide applicability of our approach, and our method achieves state of the art performance in quantifying OOD data across multiple datasets and architectures while requiring little hyper-parameter tuning.


I. INTRODUCTION
Anomaly detection is the task of detecting data that deviate from the training distribution. Deep neural networks (DNNs) have reached state-of-the-art performance on machine learning [20], [45], and computer vision tasks [40], [70]. Significant progress has raised interest in adopting them in a wide range of decision-making systems, including safety-critical ones. Yet, one of the main weaknesses of these techniques is that they tend to be overconfident [22] in their decisions, even when they are wrong [22], [25], [63]. This leads to DNNs that might miss detecting anomalies. This issue is difficult to tackle, as the high inner complexity of DNNs results in a poor output explainability.
Anomaly or outlier detection is a wide thematic of research [66]. The objective is to detect rare or corrupted data, The associate editor coordinating the review of this manuscript and approving it for publication was Mingbo Zhao . that are different from what we consider to be normal data. This research topic has multiple practical applications, such as risk management [81], safety [69] or automatic inspection and non destructive control [38]. Anomalies can also be linked to the knowledge uncertainty [27] of the DNNs. The precise identification of anomalies in DNN predictions is crucial for improving the reliability of such models, and a key step towards their deployment in practical settings.
In order to address this important problem, we propose to leverage a finer quantification of the uncertainty of DNNs. In contrast to most Bayesian DNN techniques [4], [17], [18], [35], [56], or to frequentist techniques such as Deep Ensembles [43], our approach relies on One versus All (OVA) training. In the statistical learning community, ensembles of OVA or One versus One (OVO) base classifiers for multi-class prediction have been particularly popular in association with Support Vector Machines (SVM), due to SVM being essentially a binary classifier, and to the simplicity of  [28], (b) Deep Ensembles [43], (c) OVNNI. All runs use Resnet50 [24] trained on CIFAR-10 and tested on CIFAR-10 and SVHN. Our proposed algorithm OVNNI generates very low prediction scores for OOD data, outperforming Deep Ensembles (current state-of-the-art) and MCP (baseline) on detecting OOD data. the aggregation rules supported by fundamental theoretical results [37], [41], [79]. The most popular rule in case of OVA ensembles, winner-takes-all (WTA), assigns the test sample to the class for which the membership score is the highest. For a binary output, the WTA rule creates in the input space multiple unclassifiable regions, for which the class assignment is not unique, and the standard solution is to rely on continuous membership scores. In contrast to SVMbased learning, nowadays the OVA approach has been mostly discarded when training deep classifiers, in favor of All vs All (AVA) learning.
The predictive uncertainty of DNNs is commonly categorized into aleatoric uncertainty and epistemic uncertainty [30]. The former is related to randomness, typically due to the noise in the data. The latter concerns finite size training datasets. The epistemic uncertainty captures the uncertainty in the DNN parameters and their lack of knowledge on the model that generated the training data. In this paper, we propose to use OVA learning in order to improve the quantification of the epistemic uncertainty of the DNN. The underlying idea of our approach is that the score of a base classifier should be adjusted by a factor which approximates its local reliability in the input space from which the test sample originated. Initially for SVM learning, the reliability has been linked to the average value of the local objective function [53], which is approximated using the closest training samples belonging to the respective class. Here we propose to adjust the OVA scores by the score provided by an AVA DNN which will play thus the role of approximating the local class-specific objective function. This strategy allows for a particularly effective detection of out-of-distribution (OOD) samples in the test data, as we can discriminate between samples belonging to unclassifiable regions equally close to some classifiable regions, and samples belonging to unclassifiable regions far from all classifiable regions. Figure 1 presents the distribution of the scores provided by the baseline, Deep Ensembles (the current state-of-theart) and our method, respectively. The baseline is the single AVA classifier, for which the class assignment is performed based on the Maximum Class Probability (MCP) [28].
The baseline is unable to discriminate among in-and out-ofdistribution samples, illustrated in blue/yellow and orange in the histograms, respectively. Deep Ensembles produces lower scores for OOD samples, but the in-distribution membership is still overestimated. Finally, OVNNI successfully assigns low scores to the OOD samples, while keeping at the same time the in-distribution scores high.
Our main contributions are the following: (1) We propose an effective non-Bayesian technique for uncertainty quantification in OOD data classification, that reaches state of the art results on calibration and on OOD data detection on a variety of datasets, and on all typical metrics.
(2) We conduct extensive evaluation experiments on multiple computer vision tasks (image classification, semantic segmentation) and datasets (MNIST [46]/Not MNIST [1], CIFAR-10 [39]/SVHN [62], Camvid [8], StreetHazards [27], BDD Anomaly [27]) and compare with strong and recent related methods. We show that OVNNI excels at detecting OOD images and objects. (3) We shed a fresh light on One vs All classifiers that have been so far rather ignored in the context of DNNs and hope to rehabilitate them for such approaches. Our conclusions are in line with less recent findings, e.g. OVA classifier aggregation [71].

II. RELATED WORK
OOD detection is not a novel problem and has been studied before the deep learning revival in various branches of machine learning under slightly different taks: anomaly [51], outlier [7] or novelty detection [74]. In the last few years, this task has seen increased attention from different communities and has been addressed with: predictive uncertainty estimation, ensemble methods, image reconstruction, etc. In the following we review briefly some of the methods related to our approach.

A. ANOMALY DETECTION
Anomalies are linked to abnormal data detection. As building datasets for anomaly detection is a difficult task, we can make different assumptions. Assumption 1: we consider that we can collect data of anomalies; in this case, we assume that normal and abnormal data are available. This case was VOLUME 10, 2022 studied in [21], [54]. We relate to this case as supervised anomaly detection. Since it is hard to collect such data, lately research has focused on unsupervised anomaly detection that does not require any labeled training data [72], [73]. Then, semi-supervised anomaly detection approaches make Assumption 2: training data contain no anomalies, and, during inference, tests are performed on normal and abnormal data. Finally, weakly supervised learning considers a set of labeled normal training data, and also has no abnormal training data. This is studied in [17], [28]. In this case, the anomalies represent the lack of knowledge of the DNN and are related to epistemic uncertainty (Assumption 1). Our paper focuses on this kind of anomaly detection problem.

B. CLASSIFICATION WITH A BACKGROUND CLASS
In multiple computer vision tasks, e.g., object detection [52], [70], it is common to use a background class in addition to the known classes to classify. This leads to a better separation of the classification space and a more discriminative classifier. While this seems to be a reasonable and straightforward approach, for OOD detection, it is likely to suffer from negative dataset bias [78] and thus not generalize to other background objects not seen during training. In our approach, we also use a part of the classes as background when training the individual classifiers, however the overlap of their decision boundaries, coupled with the AVA model, better distinguishes in-from out-of-distributions samples.

C. ANOMALY DETECTION BY RECONSTRUCTION
Anomalies can be detected by training an autoencoder [2], [12] or generative model [50], [73] on in-distribution data, and use the quality of the reconstruction as a proxy OOD, as the autoencoder is unlikely to decode accurately patterns not seen during training. Training such models for accurate and robust reconstruction requires large amounts of data.

D. BAYESIAN APPROACHES
Bayesian Neural Networks (BNN) [61] are elegant, intuitive and easy to reason models, that can capture the epistemic uncertainty through the exploitation of the distributions of their weights. In spite of recent progress that makes them more tractable [4], they are still limited to small or mediumsize networks, while most DNNs usually enclose millions of parameters. Gal and Ghahramani [18] aimed for a method to imitate BNNs. To this end they proposed Monte Carlo Dropout (MC Dropout) to estimate the posterior predictive network distribution by sampling different subsets of neurons at each forward pass during test time and aggregate their predictions. In computer vision, MC Dropout is the most popular instance of BNNs due to its speed and simplicity. It has been extended to other tasks, e.g., semantic segmentation [33], pose estimation [34]. However, the benefits of Dropout are more limited for convolutional layers, where specific architectural design choices must be made [33], [60]. Recent OOD benchmarks for semantic segmentation [26], [50] show that MC Dropout still induces many false positives.

E. ENSEMBLES
Ensemble methods are prominent techniques for measuring epistemic uncertainty. They have the potential to encapsulate a true diversity in the weights of the composing models, contrarily to the dispersion introduced by MC Dropout [16], which ultimately focuses on a single mode. Lakshminarayan et al. [43] propose training an ensemble of DNNs with different initialization seeds. Vyas et al. [80] train an ensemble of classifiers in a self-supervised way on different subsets of the training data, using the left-out data as OOD. Izmailov et al. [32] collect weight checkpoints from local minima and average them or fit a distribution over them and sample networks [56]. Franchi et al. [17] track weights trajectories across training and compute their distributions, further used for sampling an ensemble of networks. Our approach also exploits ensembles, however each network is specialized on a different classification task. We exploit the complementarity in this ensemble for better OOD predictions.

F. OVA/OVO ENSEMBLES
These aggregation techniques are popular for performing multi-label classification based on an ensemble of binary base classifiers. For OVO, instead of the baseline max-voting aggregation strategy, pairwise coupling [83] or ECOC [15] have been widely used, but the quadratically increasing number of base classifiers may limit significantly OVO applicability in the case of large sets of labels. In contrast, OVA fusion uses a linearly increasing number of base classifiers, and relies in most works on a Winner-Takes-All class assignment based on the maximum class response. To the best of our knowledge, these ensembling methods have not been used for estimating the epistemic uncertainty of DNNs. One-vs-all formulations have also been studied in a more recent publication [65] where an ensemble of one-vs-all DNNs is trained, with a new distance-based loss that can encode the distance of a point from the training manifold, maximizing the binary log-likelihood for the positive class and minimizing it for the negative classes. Despite the interesting results on image classification tasks, this approach does not seem scalable for computer vision tasks such as semantic segmentation.

G. DEEP OOD DETECTION
A recent line of approaches addresses OOD detection through DNNs specific heuristics. Hendrycks and Gimpel [28] established a standard baseline for OOD detection relying on the Maximum Class Probability from softmax. In [14] a confidence branch is attached to a classification network, which is trained to predict OOD samples, while ODIN [49] learns a temperature scaling for softmax values and adversarial perturbation to better distinguish OOD data. Lee et al. [48] get a class conditional Gaussian distribution with respect to features that they tune on a dataset with OOD data and in-distribution data. Lambert et al. [44] attenuate uncertainty by training on a large composite dataset leading to a more robust DNN. Zendel et al. [84] propose a semantic segmentation dataset for checking the confidence score of DNNs. The authors of [3] train a DNN to predict OOD confidence score. Lee et al. [47] train a GAN along with the classifier to produce near-distribution examples and enforce lower classifier confidence on GAN samples. Malinin and Gales [57] use Dirichlet networks to build a distribution over the prediction distributions for OOD detection. Most of these methods rely on a OOD dataset during training and are likely to specialize on specific anomalies from these data [29]. In contrast, in our approach we do not require OOD examples during training, as we leverage the multiple one-versus-all classifiers.

III. ONE VERSUS ALL FOR DEEP NEURAL NETWORK FOR UNCERTAINTY QUANTIFICATION (OVNNI)
This section focuses first on the necessary details on the traditional AVA training of a DNN. Then we describe our approach based on additional OVA training.

A. NOTATIONS
• The training and testing sets are denoted by , respectively, where x i and y i represent the observed sample and the corresponding label, respectively, with n l and n τ the size of the training and testing sets, and i ∈ {1 . . . n l } or i ∈ {1 . . . n τ }; x i are input vectors and y i ∈ {1, . . . , n label } are class labels. Unless otherwise specified, x i and y i , i ∈ {1 . . . n l }, will refer to training data.
• X is the random variable associated with observed samples and Y the one associated with classes.
• The DNN is a function f of the observed data x i , with i ∈ {1 . . . n l } or i ∈ {1 . . . n τ }, and vector ω that contains the trainable weights. We call f ω (x i ) the output of the DNN associated with the weights ω on the data x i .
• L(ω, y i ) is the loss function used to measure the dissimilarity between the output f ω (x i ) of the DNN and the expected output y i . Different loss functions can be considered according to the type of task. Here we will focus on the cross-entropy that will be introduced in the next section.

B. ALL VERSUS ALL TRAINING OF DEEP NEURAL NETWORKS
For image classification, the goal of a DNN is to map the input data to a probabilistic prediction that we denote P(Y = y * | X = x i , ω) with y * a class label. During training, an optimization algorithm will improve the weights ω in order to fit as much as possible the output to the ground truth vector of class labels. The loss is expected to measure the similarity between f ω (x i ) and y i . Classically we use cross-entropy defined on a batch B of size N ∈ N by: The minimization of this loss function is usually based on gradient methods. Computing the optimal value of each parameter involves a bin-to-bin measure of similarity, which may lead to overfitting issues.
A solution might be to use One Versus All training.

C. FROM ONE VERSUS ALL (OVA) TO OVNNI
The current state of the art on uncertainty estimation is Deep Ensembles [43]. This technique relies on ensembling multiple DNN models trained in parallel in order to optimize the same loss. In contrast to random forests [6], or Bagging [5] the diversity arises from the fact that different embodiments of the same model will converge towards different local optima during training. Conversely, in our approach the diversity is provided by the one-versus-all (OVA) models constructed using different labelings of the training set.
Learning to detect abnormal data means that the DNNs learn to point out: ''I do not know''. The issue is that the crossentropy loss, in collaboration with other heuristics for training DNNs, e.g., BatchNorm [31], many layers and residual connections [24], leads to highly overconfident DNNs [22]. Hence we must deal with an overconfident DNN that cannot handle unknown data. Our solution is to use a DNN for learning to classify each class vs all the other classes (OVA training); thanks to this training procedure, each DNN will learn to discriminate every class as well as their boundaries: this permits classifiers that it knows but also that it does not know.
The OVA strategy is conceptually simple, since at its core it involves training a binary classification DNN. One classifier is trained for each class, and prediction is then performed by running the obtained binary classifiers on the testing sample and choosing the prediction with the highest confidence score. Yet, the multiple classifiers involved will learn multiple probabilistic predictions, denoted by P(Y j = 1 | X = x i , ω j ) with Y j a binary random variable for each class j. We add a super script on ω j , to inform that 1) weights are different from the ones trained to perform the AVA classification that we denote ω, and 2) they are also different from the weights of other classes (different from j).
By training one class versus all the other classes, the DNN learns in some sense the out of distribution classes, however with the significant advantage of not relying on explicitly provided OOD data, in contrast to other strategies [58], [67]. Thanks to this strategy, the DNN learns to better distinguish between objects from known classes and unknown objects from classes not seen during training. Yet, OVA might reduce the performances since the DNN focuses a lot on learning to discriminate every class. Therefore, in addition to the OVA base classifiers, we also perform an All versus all training that we aggregate with the probabilities of the OVA models in the following way, as shown in Figure 2.
Let us denote by Y the discrete random variable, that is taking its value in the list of all classes, and let us denote by Y j a binary random variable that takes values 0 or 1, with Y j = 1 meaning that the sample belongs to class j. Hence the OVA DNN of the class j provides P( in {1 . . . n label }. We consider that the final confidence score for a sample x i to belong to class j is: This score is high if AVA and OVA are confident and low otherwise. Multiplying OVA and AVA scores also helps to increase the accuracy since AVA has lower accuracy than OVA (see Figure 3). Hence we propose to use this score as a way to quantify the confidence of the DNN.

D. UNCERTAINTY WITH OVNNI
When we optimize a DNN on a training set, it might suffer from mainly two kinds of uncertainty. The aleatoric uncertainty is linked to the data acquisition and in general it can be learned but not reduced. It can be reduced only if we have more information about this process and we can change it, e.g., add more efficient camera sensors for low-light scenes. The epistemic uncertainty is related to the lack of knowledge of the DNN. Hence, by learning to say ''I do not know'', our strategy can better model the epistemic uncertainty.
We consider that a measure of confidence must satisfy the following properties: (1) be bounded, (2) exhibit low values for OOD data, (3) have a confidence value that aligned to the accuracy of the algorithm, (4) get more confident if additional training samples are provided. The first point assures that we know what is the maximum and minimum of confidence. The second point is to ensure to detect OOD data, which is crucial since it provides information on the reliability of the DNN on one data. The third point is linked to the calibration [22], which is crucial to rely on the model predictions. The last point concerns the fact that we want to reduce the uncertainty when increasing the dataset.
We use as a measure of confidence for OVNNI the probability max j∈{1...n label } {p j (x i )}. This measure, bounded by 0 and 1, tells us how much we can rely on the DNN prediction and to which extent it can be used to model the epistemic uncertainty. Indeed in most approaches, the maximum class probability (MCP) [28] is used as a simple baseline to model uncertainty. In our case, we do not properly evaluate the MCP since we do not have a probability. Yet, our confidence score is directly inspired by the MCP. Other approaches make similar assumptions, for instance the evidential model from Sensoy et al. [75] where the uncertainty is quantified from belief measures.

IV. EXPERIMENTS
We continue by illustrating the performance of OVNNI for detecting OOD data by conducting five experiments. In the rest of this section we will describe the experimental protocol, followed by the five experiments. We implemented all approaches ourselves and used for all the same learning hyper-parameters per dataset, without particular tuning. Moreover, the number of ensembles is the same for all the techniques, and corresponds to the number of classes.

A. EXPERIMENTAL PROTOCOL
The detection of OOD data can be done either by techniques that measure the uncertainty, or by techniques that detect OOD data. We first have compared our OVNNI to three other uncertainty estimation techniques: MC Dropout [18], Deep Ensembles [43], and TRADI [17]. The major interest of these techniques comes from the fact that, since they estimate uncertainty, they also estimate the epistemic uncertainty and therefore the OOD data. We also have compared our approach to two other techniques: ODIN [49] and ConfidNET [11], that serve as references in unsupervised techniques for detecting OOD data. As a baseline algorithm, we use the maximum class probability (MCP) with AVA trained DNN. We denote this approach as MCP. As an additional baseline we consider one-class Support Vector Machine [59], [64], a classic method for outlier detection. We train it on AVA logits. Note that we have not compared our OVNNI to techniques trained to learn OOD such as [58], [67], since in these cases the OOD data are in the training set, making this technique able to detect just with trained OOD data. To balance OVA training which typically has more samples available for the ''All'' class, we use weighted cross-entropy to train for each class, with weights for a given class based on 1 − τ class , where τ class is the proportion of data samples of this class in the training set. In addition, for a fair comparison in all experiments we use the same number of models for ensemble and Bayesian methods. We conducted several experiments in two target applications: image classification (2 experiments) and semantic pixel segmentation (3 experiments). We considered 7 evaluation measures, in addition to accuracy. Details and results are given below.

1) EVALUATION MEASURES
The evaluation should focus on several points. The first one is the error/success on predicting if the DNN model has some knowledge about specific data. This involves detecting if the data is OOD or not. For that, we use three solutions proposed in [28]. We first only used the confidence score of the OOD data and of the in distribution test data. Based on these confidence scores, and as in [26], [28], we evaluated the AUC, AUPR and the FPR-95%-TPR, that are indicators of the accuracy of detecting OOD data.
However, these measures give no information about the number of good predictions (that should be high) and of bad predictions (that should be low).
This information is crucial since, although it is important to have a low score with the OOD data, the DNN should also reach a high confidence score for wellclassified data, and low confidence scores elsewhere. In case the DNN does not reach this point then it might be unusable.
For that, the authors in [11] propose to use metrics similar to the one used by Hendrycks et al. [28] but rather than classifying into classes ''OOD'' or ''In distribution'', they classify as ''correctly classified'' or ''not correctly classified'' (this latter class contains both bad predictions and predictions on OOD data, see [11] for more details).
We also used the Expected Calibration Error (ECE) [22], which uses the M -bin histograms of confidence scores and accuracy. The ECE performs a bin-to-bin difference between the two histograms, then an average over the M bins. Similarly to [22] we set M = 15. This metric, by measuring the difference between the expected accuracy and confidence, is an indicator of the quality of the confidence, and should be close to 0.
To better understand the behavior of our DNNs when facing strong shifts in the input data distribution, we  propose to evaluate the Corrupted Accuracy (cA), and Corrupted Expected Calibration Error (cE) for CIFAR10 [39].
For this scenario, we use CIFAR-10-C [27], where the author generated various noise with different levels of intensity.

2) OOD CLASSIFICATION WITH MNIST [46]
Concerning the classification, we used in a first experiment MNIST [46] which is a dataset composed of digit images as training dataset and NotMnist [1] which contains letter images as OOD dataset. We first trained a classifier to learn to recognize the images of digits then tested it on the test set of MNIST and NotMnist hoping that the classifier would distinguish digits form letters. The DNN used for this experiment is fully connected and composed of 3 layers as in [17], [43]. Results are shown in Tables 1 and 2 (MNIST rows).

3) OOD CLASSIFICATION WITH CIFAR10 [39]
We also trained a network on CIFAR10 composed of classes airplanes, cars, birds, cats, deer, dogs, frogs, horses, ships and trucks. We have considered as OOD SVHN dataset [62]. Many methods [56] train on CIFAR10 and test on the test set of CIFAR10 with noise or on STL-10 [10]. It turns out that the first test aims more at measuring random uncertainty and the second one the capability to adapt to the domain. We rather have preferred to consider as an OOD dataset SVHN which is a color image dataset of digits, that guarantees that the OOD data really comes from a distribution different from that of CIFAR10. The DNN we used on this experiment is Resnet50 [24], which has the advantage of being popular in the community. Results are shown in Tables 1 and 2 (CIFAR10 rows). One can see in Table 3 our results under dataset shift. Our approach has state-of-the-art results in terms of accuracy and almost regarding the ECE. Figure 5 shows that OVA is not adapted to aleatoric uncertainty while OVNNI maintains good performances. Hence, we have shown that our approach is resistant to aleatoric uncertainty.

4) OOD SEGMENTATION WITH CAMVID [8]
We used Camvid, a dataset conventionally used in works dealing with segmentation or uncertainty theory and deep learning [11], [17], [33]. This dataset is an ''easy'' dataset but allows quickly validating results. To test the ability of OVNNI to detect OOD pixels, we trained on all Camvid classes except 3 classes (pedestrian, bicycle, and car), that we deleted, by marking the corresponding pixels as unlabeled. These three classes correspond to OOD classes. Thus this experimental protocol, proposed in [17], makes it possible to validate that the trained DNN will detect the pixels on which it has not been trained as OOD. The DNN for this experiment is Enet [68]. Results are shown in Tables 1 and 2 (Camvid rows).

5) OOD SEGMENTATION WITH StreetHazards [26]
StreetHazards is a large-scale dataset that contains different sets of synthetic images of street scenes. More precisely, this dataset is composed of 5125 images for training and 1500 test images. The training dataset contains 13 classes and the test dataset is composed of the 13 training classes and 250 OOD classes, making it possible to test the robustness of the algorithms with all possible scenarios. For this experiment we used PSPnet [85] with the experimental protocol in [26]. The architecture used for the PSPnet is ResNet50. Results are shown in Tables 1 and 2 (StreetHazards rows).

6) OOD SEGMENTATION WITH BDD ANOMALY [26]
BDD Anomaly dataset is a subset of BDD dataset, composed of 6688 street scenes for the training set and 361 for the testing set. The training set contains 17 classes, and the test dataset is composed of the 17 training classes and 2 OOD classes. For this experiment we used PSPnet [85] with the experimental protocol in [26]. The architecture used for the PSPnet is ResNet50. Results are shown in Tables 1 and 2 (BDD Anomaly rows).

B. VISUALIZING OVA AND AVA EMBEDDING
In this subsection, we perform two experiments to determine the behavior of the representations learned by the DNN with the different techniques. For both experiments we train a simple DNN composed of 3 hidden layers followed by a batch normalization on MNIST dataset [46].
In the first experiment, we have considered as training data only the images with the digits '0','1' and '2' images (the 3 first classes). Then we perform inference on the official test set composed of images with these classes and the OOD images which are composed of other classes. We represent in Figure 8 the softmax of a classical AVA training, a deep ensemble training and the OVNNI training. We can see that in contrast to other techniques, OVNNI results do not necessarily belong to the 2-dimensional simplex. In addition, OVNNI brings the OOD data far away from the simplex vertices which highlights its potential to detect OOD data. Results of OVNNI on BDD Anomaly. The first column is the input image, the second is the ground truth, the third is prediction and the fifth is the confidence score of OVNNI. For comparison, we add the MCP confidence score in the fourth column. We can see that OVNNI has a low score on the motorcycle on the three first rows and on the train on the last row which correspond to the OOD classes.

FIGURE 7.
Results of OVNNI on StreetHazards. The first column is the input image, the second is the ground truth, the third is prediction and the last is the confidence score of OVNNI. For comparison, we add the MCP confidence score in the fourth column. We can see that OVNNI has a low score on the chair, the seat, the rocket and the spider which correspond to the OOD classes.
In the second experiment, we performed a classical AVA training, and we also performed the OVA training. Hence for the OVA training, we have 10 DNNs (since the dataset has 10 classes which are the 10 digits). The OOD class is composed of images of the NotMNIST dataset [1]. Hence, we apply the DNNs on this test dataset and on the AVA case, we collect for each data the feature space of the DNN just before the classification of each data. In the OVA case, we collect the same feature space but for the DNN of the predicted class.
We reduce the dimension of each of these feature spaces using t-SNE [55] and Principal Component Analysis (PCA) [82], and we plot the results in Figure 9. We can see that in the AVA case the OOD data are in the center of Figure 9 mixed with the other classes, and in the OVA case they are closer to the border whatever the dimensionality reduction algorithm we use. This is crucial because it shows that OVA learns a more interesting descriptor than AVA.

D. DISCUSSIONS
On MNIST we can see in Tables 1 and 2 that OVNNI has competitive results for detecting OOD data; more specifically, its calibration score (ECE) is the best. With respect to the metrics proposed by Hendryck et al., OVNNI is the most effective in detecting OOD images, improving the best AUC by 1.4% the best AUPR by 0.6% and the best FP by 62.0%.   On CIFAR10, although Deep Ensembles achieve good results on all the measurements as well, except on the ECE, note that OVNNI is better calibrated. This can also be seen in the histogram in Figure 4. The difference between OVNNI and Deep Ensembles is low and the crucial requirement of DNN is to have a good calibration. Hence, having a good calibration is more important than having a good AUC or AUPR. Also, we have represented the accuracy vs confidence curves in Figure 9. These curves are defined in [43] and are constructed by evaluating the accuracy of all data where the DNN has reached confidence thresholds. These curves show the performance of the OVNNI confidence index over CIFAR10. Finally, we have illustrated the OVNNI calibration on CIFAR10 in the calibration curve in Figure 9. The calibration plot is defined in [22] and is constructed by taking bins of data based on their confidence score. Then on each bin, we evaluate the accuracy, as it should ideally be comparable to the confidence score. These curves show once again the good performance of OVNNI in terms of calibration.
On Camvid we note that OVNNI improves the results of the state of the art by up to 77% with regard to the metrics proposed by Corbière et al. [11], and by up to 77% for calibration as well. Concerning the metrics proposed by Hendryck et al., OVNNI improves the measurements by a maximum of 22%.
On StreetHazards we show in Table 2 that OVNNI has better results than the state of the art by improving the best results by up to 42.8%. In Table 1 OVNNI improves the result by a least 2.6% and improves state-of-the-art ECE by 2%. These results show the interest of using OVNNI for semantic segmentation.
Finally, on BDD Anomaly OVNNI improves the calibration by at least 48% which is highly relevant, given the importance of this metric. Furthermore concerning the other metrics, OVNNI improves the results by at least 22%. Furthermore, in Figure 4 we have illustrated the confidence accuracy curve of several algorithms. These curves underline again that OVNNI reaches the best performance in terms of calibration.
Overall, these results show that OVNNI improves the calibration of networks by rendering the confidence in their results more in line with their expected results. Making DNN models more reliable is crucial, especially in areas where the model should not be overconfident. In [22] the authors show that good accuracy of DNNs comes with a price, namely their reliability. In this work, we propose a solution that increases accuracy in most cases, while at the same time improving the calibration and the OOD detection performance.  The conceptual simplicity of this solution is a significant asset for its adoption, and the results also convey the message that one vs all training can still have an interest for a finer understanding of epistemic uncertainty in DNNs.
Ensemble of OVAs and OVNNI act like Deep Ensembles, i.e. discovering and exploring multiple modes [16]. Just like Deep Ensembles, they benefit from the multiple modes provided by each model leading to better calibrated predictions [16]. However, Deep Ensembles can still become overconfident as they follow modern training heuristics [20]. In OVNNI, weighting OVAs with AVA softmax leads to generally less confident predictions and improves calibration (plots in Figure 4 confirm this quantitatively). This effect is similar to temperature scaling for calibration [20]. However, we do not need an additional validation set to tune this factor. OVA acts as a single class classifier for OOD detection and also learns to perform classification.

E. LIMITATIONS OF THE APPROACH AND PERSPECTIVES
The OVA strategy is conceptually simple and straightforward to adopt and implement in most cases. The main limitation of OVA is related to cases with numerous classes. For popular datasets and tasks involving up to 10-15 classes the computational cost is on par with ensembling approaches, for which this is a typical size for the ensemble [43]. However beyond this number of classes, the approach is less appealing due to the increasing training cost. We note that in several practical settings, e.g., perception for driving assistance [9], [23], [36], [77], the number of classes is often low (less than 10), in order to avoid ambiguity and class imbalance, which are frequent drawbacks of high granularity datasets. For tasks that do not allow for a low number of classes, we indicate a few potential strategies to render OVNNI more feasible for such cases. we can consider meta-classes that group multiple classes from the training set according to visual or semantic relatedness. The taxonomy of a number of datasets is derived from an ontology (usually hierarchical), e.g., Imagenet [13], and we can use this criterion for grouping classes. Alternatively, such meta-classes can be learned [19], [86], as considered in other related approaches, e.g., Error Correcting Output Codes rely on OVA and even on OVO classifiers [42], [76]. In this work, we did not need to adopt such strategies were not necessary as the computational cost is tractable on the considered datasets. However we plan to explore them in future works.

V. CONCLUSION
In this work, we presented an approach based on one versus all training and mixed with a modern approach based on deep learning. We show that the combination of these approaches reaches state of the art performance on all segmentation experiments. Regarding classification tasks, OVNNI exhibits the best calibration performance. Concurrent approaches suffer from a lack of performance in calibration in most datasets, hence the scores that they provide are overconfident, potentially leading to dangerous scenarios in critical applications. In addition to the reported performance, our approach needs little hyperparameter tuning and is easy to implement.
Future work involves first extending this strategy to new tasks such as medical image analysis. One could also use this framework for active learning since active learning algorithms require techniques that can detect OOD data. SÉVERINE DUBUISSON received the master's and Ph.D. degrees in system control from the University of Technology of Compiègne, in 1997 and 2000, respectively. She has been an Associate Professor with Sorbone Universtity. She is currently an Associate Professor with Aix-Marseille University, France. Her research interests include computer vision, visual tracking, probabilistic models for video sequence analysis, and human interaction.
ISABELLE BLOCH graduated from the Ecole des Mines de Paris, Paris, France, in 1986. She received the master's degree from the University Paris 12, Paris, in 1987, the Ph.D. degree from the Ecole Nationale Supérieure des Télécommunications (Télécom Paris), Paris, in 1990, and the Habilitation degree from the University Paris 5, Paris, in 1995. She has been a Professor at Télécom Paris, until 2020. She is currently a Professor at Sorbonne Université. Her current research interests include 3-D image understanding, computer vision, mathematical morphology, information fusion, fuzzy set theory, structural, graph-based, and knowledge-based object recognition, spatial reasoning, artificial intelligence, and medical imaging.