Reducing Overconfidence Predictions in Autonomous Driving Perception

In state-of-the-art deep learning for object recognition, Softmax and Sigmoid layers are most commonly employed as the predictor outputs. Such layers often produce overconfidence predictions rather than proper probabilistic scores, which can thus harm the decision-making of ‘critical’ perception systems applied in autonomous driving and robotics. Given this, we propose a probabilistic approach based on distributions calculated out of the Logit layer scores of pre-trained networks which are then used to constitute new decision layers based on Maximum Likelihood (ML) and Maximum a-Posteriori (MAP) inference. We demonstrate that the hereafter called ML and MAP layers are more suitable for probabilistic interpretations than Softmax and Sigmoid-based predictions for object recognition.We explore distinct sensor modalities via RGB images and LiDARs (RV: range-view) data from the KITTI and Lyft Level-5 datasets, where our approach shows promising performance compared to the usual Softmax and Sigmoid layers, with the benefit of enabling interpretable probabilistic predictions. Another advantage of the approach introduced in this paper is that the so-called ML and MAP layers can be implemented in existing trained networks, that is, the approach benefits from the output of the Logit layer of pre-trained networks. Thus, there is no need to carry out a new training phase since the ML and MAP layers are used in the test/prediction phase. The Classification results are presented using reliability diagrams, while detection results are illustrated using precision-recall curves.


I. INTRODUCTION
Recent advances in deep learning and sensory technology (e.g., RGB cameras, LiDAR, radar, stereo, RGB-D, among others [1]- [5]) have made remarkable contributions to perception systems applied to autonomous driving [6]- [9].Perception systems include, but are not limited to, image and point cloud-based classification and detection [8], [10]- [13], semantic segmentation [6], [14], [15], and tracking [16], [17].Oftentimes, regardless of the type of network architecture or input modalities, most state-of-the-art CNN-based object recognition algorithms output normalized prediction scores via the Softmax layer [18] i.e., the prediction values are in a range of [0, 1], as shown in Fig. 1.Furthermore, such algorithms are often implemented through deterministic neural networks, and the prediction itself does not consider the model's actual confidence for the predicted class in decision-making [19].In fact, in most cases, the decision-making takes into account only the prediction value provided directly by a deep learning algorithm disregarding a proper level of confidence of the prediction (which is unavailable for most networks).Therefore, evaluating the prediction confidence or uncertainty is crucial in decision-making because an erroneous decision can lead to disaster, especially in autonomous driving where the safety of human lives are dependent on the automation algorithms.
Many works have pointed out Softmax layer overconfidence as an open issue in the field of deep learning [20]- [23].Two main techniques have been suggested to mitigate the overconfidence in deep networks, calibration [24]- [30] and regularization [27], [28], [31].Often, calibrations are defined as techniques that act directly on the resulting output of the network, while regularization are techniques that aims to penalize network weights through a variety of methods, which adds parameters or terms directly to the network cost/loss  function [31]- [33].However, the paper proposed by [34] defines regularization techniques as a type of calibration.Consequently, the latter demands that the network must be retrained.
The overconfidence problem is more evident in complex networks such as Convolutional Neural Networks (CNNs), particularly when using the Softmax layer as the prediction layer, thus generating ill-distributed outputs i.e., values close to either zero or one [26] which can be observed in Fig. 1a and Fig. 1b.We note that this is desirable when the true positives have higher scores.However, the counterpart problem is that 'overconfidence networks' also generate highscore values for the objects erroneously detected or classified i.e., false positives.Given this problem, a question that arises, how can we guarantee prediction values that are 'high' for true positives and, at the same time, 'low' for false positives?This question can be answered by analyzing the output of the network's Logit layer, which provides a smoother output than the Softmax layer.This can be observed within Figs. 2a and  2b.
Following this, we can put a new question: although normalized outputs aim to guarantee a 'probabilistic interpretation', how reliable are these predictions?Additionally, given an object belonging to a non-trained/unseen class (e.g., an unexpected object on the road), how confident is the model's prediction?These are the key research questions explored in this work by considering the importance of having models grounded on interpretable probability assumptions to enable adequate interpretation of the outputs, ultimately leading to more reliable predictions and decisions.In terms of contributions, this paper introduces new prediction layers, designated Maximum Likelihood (ML) and Maximum a-Posteriori (MAP) layers, for deep neural networks, which provide a more adequate solution compared to state-of-the-art (Softmax or Sigmoid) prediction layers.Both ML and MAP layers compute a single estimate, rather than a distribution.Moreover, this work contributes towards the advances of multi-sensor perception (RGB and LiDAR modalities) for autonomous perception systems [35]- [37] by proposing a probabilitygrounded solution that is practical in the sense it can be used in existing (i.e., pre-trained) state-of-the-art models such as Yolo [38].
It is important to emphasize that there is no need to retrain the neural networks when the approach described in this article is employed, because the ML and MAP prediction layers produce outputs based on PDFs obtained from the Logits of already trained networks.Therefore, instead of using the traditional prediction layers (Softmax or Sigmoid) to predict the object scores on a test set, the ML and MAP nonlinearities can be used to make the predictions for the objects scores.Thus, the proposed technique in this paper is practical given that a network has already been trained with Softmax (SM) or Sigmoid (SG) prediction layers.In other words, the ML and MAP layers depend on the Logit's outputs of the already trained network 1In summary, the scientific contributions arising from this work are: • An investigation of the distribution of predicted values of the Logit and Softmax layers, for both calibrated and non-calibrated networks; • An analysis of the predicted probabilities inferred by the proposed ML and MAP formulations, both for object classification and detection; • An investigation of the predicted score values on out-oftraining distribution test data (unseen/non-trained class); • The proposed approach does not require the retraining of networks; • Experimental validation of the proposed methodology through different modalities, RGB and Range-View (3D point clouds-LiDAR), for classification (using Incep-tionV3) and object detection (using YoloV4).In this paper, we report on object recognition results showing that the Softmax and Sigmoid prediction layers do indeed sometimes induce erroneous decision-making, which can be critical in autonomous driving.This is particularly evident when 'unseen' samples i.e., out-of-training distribution test data are presented to the network.On the other hand, the approach described here is able to mitigate such problems during the testing stage (prediction).
The rest of this article is structured as follows.The related work is presented in Section II, while the proposed methodology is developed in Section III.The experimental part and the results are reported in Section IV, the conclusion is given in Section V, while Section VI presents ideas to expand the proposed research, and finally Section VII (Appendix) presents results considering an extra experiment.

II. RELATED WORK
In this section, we review the key methodologies related to our proposed approach.We briefly discuss the uncertainties of neural networks based on the concepts of Bayesian inference, consequently defining the types of uncertainties that can be captured by the Bayesian Neural Networks (BNNs).Then, techniques for reducing overconfidence of prediction layers are presented as well, in particular the regularization and calibration techniques.

A. Predictive Uncertainty
Many deep learning methods used for perception systems (objects detection and recognition) do not capture the network uncertainties at training and test times.The Bayesian Neural Network (BNN) is an alternative to cope with uncertainties and it can be carried out through distinct approaches.One way is to obtain the posterior distribution using variational inference after defining a prior distribution to the network weights [32], [41], [42].Another method is the ensemble of multiple networks with the same architecture and different training sets for estimating predictive uncertainty [43].
Currently, many studies consider aleatory and epistemic uncertainties obtained through BNNs.Aleatory uncertainty is related to the inherent noise of observations (uncertainties arising from sensor inherent noise and associated with the distance of the object to be detected, as well as the object occlusion), while the epistemic ones explain the uncertainties in the model parameters (uncertainties of the model associated with the detection accuracy, showing the limitations of the model) [44].The formulation of aleatory and epistemic uncertainties with the aim of presenting confidence of predictions, which can capture the uncertainties in object recognition, can be done through BNNs, Shannon Entropy (uncertainty in the prediction output) and Mutual Information (confidence of the model in the output) to measure the uncertainty of the classification scores [45]- [47].
The uncertainty of a prediction can also be achieved through Monte Carlo dropout strategy, using the dropout layers at test time i.e., the predicted values depend on the randomly chosen connections between the neurons according to the dropout rate, that is, the same test example (an object) forwarded several times in the network can have different predicted values (the predicted values are not deterministic).In this way, it is possible to obtain the distribution, the average (final predicted value) and the variance (uncertainty) [48] for each example.
Differing from the aforementioned works, the approach proposed in this paper uses data obtained from the Logit layer of already trained/existing networks, to employ the concepts of Bayesian inference.The methodology proposed in this paper defines a final prediction value for each object and does not need to predict recurrently for the same object several times.Furthermore, the approach presented in the paper does not consider the distribution of the network weights, and thus, it is an efficient and practical approach.These advantages are clear when compared to traditional Bayesian neural networks and the Monte Carlo dropout strategy, because the novel strategy presented here avoids a high computational cost and at the same time does preserve the recognition/detection performance.Nevertheless, there are ongoing research on Bayesian neural networks that have reduced the computational cost through feature decomposition and memorization [49].

B. Regularization and Calibration
Another important component for the improvement of the predicted values are the regularization techniques that avoid overfitting and contribute to reduce overconfidence predictions, such as the transformation of network weights using L1 and L2 [50] regularization, label and model regularization by a process of pseudo-label and self-training [33], label smoothing [51], knowledge distillation [52], architecture development where the network has to determine whether or not an example belongs to the training set, and specific cost mathematical formulation [53], [54].Other well-known regularization techniques are the Batch Normalization [55], stochastic regularization techniques such as Dropout [56], multiplicative Gaussian noise [57], and dropConnect [58].
Alternatively, highly confident predictions can often be mitigated by calibration techniques such as temperature scaling (T S) [26], by multiplying all the values of the logit vector by a scalar parameter, 1  T S > 0, for all classes, where the value of T S is obtained by minimizing the negative log likelihood on the validation set; Isotonic Regression [59] which combines binary probability estimates of multiple classes, thus jointly optimizing the bin boundary and bin predictions; Platt Scaling [60] which uses classifier predictions as features for a logistic regression model; Beta Calibration [61] which uses a parametric formulation that considers the Beta probability density function; compositional method (parametric and nonparametric approaches) [62], as well as the embedding complementary networks technique [63], [64].
In this study, we reduce highly confident predictions on the test set by replacing the predicted values by Softmax and Sigmoid layers with the predicted values from ML and MAP nonlinearities, obtaining a smoother score distribution for new objects.Such functions depend on the output of the network's Logit layer, by means of parametric (Gaussian functions) and nonparametric (normalized histograms) modeling.This is a post-training operation, that is, the novel inference functions proposed in this work do not modify the weights neither the cost function of the network and still provides very satisfactory results.This is an advantage over regularization techniques, since the ML and MAP layers do not require network retraining.The advantage of the approach proposed in this paper with respect to calibration techniques is to provide a smoother distribution of the predicted values without degrading the results.

III. PROPOSED METHOD
This section presents the core of the proposed methodology i.e., the formulations for making predictions based on the novel ML and MAP prediction layers.The development of such a methodology begins with the concepts of probabilities, random variables, distribution function, probability density function and Bayes' theorem i.e., the background to develop the methodology proposed in this paper.In the second stage, we present the proposed method through formulations of the Maximum Likelihood (ML) and Maximum a-Posteriori (MAP) layers, as well as nonparametric and parametric mathematical modeling to define the posterior (likelihood-conditional) and prior probabilities.Finally, we present the network architectures, diagrams for evaluating the calibration of the proposed methodology, and the datasets that have been used in the experiments.

A. A Brief Review of Probability and Density Functions
The output scores x = {x 1 , . . ., x nc } of a supervised classification system with nc classes, c = {c 1 , . . ., c nc } can be formulated according to a random experiment considering a sample space S. The numerical outcome obtained from each element of S is related to a real number defined by the random variable (rv) x i.e., the output scores, which is conditioned to the rv c.Formally, the rv is a function that maps each element of the sample space with a real number of the set R, which can be simply expressed as x : S → R. In other words, an rv is a function x that outputs a real number x(ζ) for each element ζ ∈ S of a random experiment.From the sample space, an event (subset of S) can be defined and associated with a probability P between the [ξ, ξ + ∆ξ] interval.Such probability is a distribution function and its derivative is the probability density function (PDF) f x (x = ξ), as in (1) [65].
where f x (x = ξ) ≥ 0 ∀ ξ, considering ξ continuous.The integral of (1) represents the probability P with the random variable x contained in the interval.Consequently, if the interval [ξ, ξ + ∆ξ] is sufficiently small, the probability will be P{ξ ≤ x ≤ ξ + ∆ξ} f x (x = ξ)∆ξ i.e., the probability of the random variable x is proportional to f x (x = ξ).Thus, the probability will be maximum if the interval [ξ, ξ + ∆ξ] contains its value and f x (x = ξ) will be maximum.Such a value is the most likely value of x.
Given the most likely value of the random variable x, Maximum Likelihood (ML) and Maximum a-Posteriori (MAP) inferences can be obtained.However, the random variable x is dependent of the variable c for the formulation of ML and MAP.Therefore, the density function is conditional to c [65], as formulated in (2): If the random variable is discrete, a probability mass function (PMF) is used instead of a probability density function  (PDF).Assuming that the class conditional probability P (x|c) (likelihood) and the prior are known, the posterior probability P (c|x) can be obtained through Bayes' rule where P (c) is the prior probability, P (x) = 0 is the marginal probability defined by P (x|c)(c)dc, that often can be determined by law of the total probability [66].Thus, (3) can be re-written using the per-class expression: In this work, the goal is to use (4) to make inferences on the test set about the 'unknown' rv c from the dependence with x i.e., the value of the posterior distribution of c is determined after observing the value of x.

B. ML and MAP Prediction Layers
The proposed ML and MAP layers make inference based on PDFs obtained from the Logit layer prediction scores by using the training set.This is illustrated in Fig. 3, where the horizontal axes represent the random variable x and the vertical axes are the normalized frequency of the amount of objects in the classification and detection datasets.We can observed  that the distribution scores from the Logit layer are far more appropriate to represent a PDF (as shown in Fig. 2).Therefore, the ML and MAP layers are more adequate to perform probabilistic inference in regard to permitting decision-making under uncertainty, which is particularly relevant in autonomous driving and robotic perception systems.
As noted in (4), the posterior probability depends on the class conditional probability (likelihood function) and on the prior probability i.e., the MAP estimated depends on a distribution for both the likelihood and prior, while ML only depends on P (x|c), because P (c) is usually assumed to be uniform and identically distributed.The probabilities P (x|c) are modeled by means of non-parametric estimates over the predicted scores of the Logit layer for each class, as showed in the first column of Fig. 3.These estimates are obtained on the training set, through normalized histograms (i.e., discrete densities defined by a single parameter -the number of bins) for each modality, as shown in the Table I.
Histograms are graphical ways of summarizing or describing a variable in a simple way, in other words, histograms show how variables (in this case, the network logits) are distributed, revealing modes and bumps, as well as information about the frequencies of observations.As said by C. Bishop [66], 'we can view the histogram as a simple way to model a probability distribution given only a finite number of points drawn from that distribution'.Often, the bins of a histogram are chosen to have the same width thus, the only (single) parameter left is the number of bins (nbins).To do so, nbins can be mathematically determined by means of the mean squared error (MSE-expected value of the squared error) [67].However, for our methodology, we have chosen nbins empirically to guarantee a result very close to or better than the results provided by the SM and SG layers and, in addition, to generate smoother distribution by adding the parameter λ.Thus, the process of estimating the number of bins and λ (the additive smoothing factor) have been defined empirically by verifying which combinations would not degrade the results.So, these two parameters were defined empirically for each dataset/modality, as well as for each of the ML and MAP layers.
Each predicted value on the test set from the Logit layer has a score value corresponding to its bin range in the respective The normal distribution is feasible for modeling an unknown distribution because it has a maximum entropy.Thus, the greater entropy can guarantee a more informative distribution and at the same time less confident information around the mean, that is, it contributes to the reduction of the overconfidence inferences.Defining otherwise, the events most likely to happen have low information content i.e., low entropy.Therefore, a Gaussian distribution was defined for prior P (c i ) to express a high degree of uncertainty2 in the value of variable c before observing the data.Furthermore, a prior distribution with high entropy is said to be a prior distribution with high variance [66].
Additionally, to avoid the 'zero' probability problem, as well as to incorporate some uncertainty level in the final prediction, the Additive Smoothing method (λ) [68]- [70] (also defined as Laplace smoothing) is implemented during the ML and MAP predictions.The values assigned for the Additive Smoothing are shown in Table I, does not depend on previous information of the training set.This value was determined empirically i.e., by observing which value would preserve approximately the 'original' distribution without compromising the final result.The probability estimates with the Additive Smoothing are shown in ( 5) and ( 6), i.e., a small correction is incorporated into the ML and MAP estimate.Consequently, no prediction will have a 'zero' probability, no matter how unlikely.
ML layer is straightforwardly calculated by normalizing P (x|c) by the P (x) during the prediction phase, as in (5), since the priors P (c) are set uniformly and identically distributed for the set of classes c, Alternatively, the inference using MAP layer is given in (6) as follows, Algorithm 1: compute ML and MAP.The sequential steps for calculating the ML and MAP is summarized within Algorithm 1, where class-conditional P (x|c) is modelled by a normalized histogram.On the other hand, to get the maximum posterior probabilities (MAP) the priors are modelled by normals N (test Lg |µ train , σ 2 train ), where the sub-index Lg indicates that the data is obtained from the Logit layer (layer before the network prediction layer).Both the likelihood and prior are extracted from the Logit layer using the training data 3 .

C. CNN Architectures for Object Recognition
Experiments in [26] suggested that the greater the number of layers and neurons, the more overconfidence the result will be.However, the experiments that we have conducted show that even when reducing the amount of neurons and filters in the dense and convolutional layers, the network can still produce overconfidence in the predicted values, as can be observed in Fig. 1.This conclusion was reached by training the Inception V3 CNN [71] and reducing the number of filters and neurons/units.Regarding object detection, the model Yolo V4 [38] was trained to detect cars, cyclists, and pedestrians, with predictions based on the SG layer.
The experiments reported throughout the remainder of this work were based on the premise that, after training the network, the proposed ML and MAP layers then replace the SM and SG prediction layers on the test set, only, according to Fig. 5.

D. Reliability Diagram
Typically, post-calibration predictions are analyzed in the form of reliability diagram representations [26], [72], which illustrate the relationship of the model's prediction scores in regard to the true correctness likelihood [73], as shown in Fig. 6.Reliability diagrams show the expected accuracy of the samples as a function of confidence i.e., the maximum value of the prediction function.
The scores (predicted values) are grouped into M bins (histogram) in the reliability diagrams.Each sample (classification score of an object) is allocated within a bin, according to the maximum prediction value (prediction confidence).Each bin has a range I m = (m−1) M , m M , where m = 1, .., M .The accuracy is calculated in each range I m , as well as the average confidence conf average = 1 BM i pi , where pi is the confidence for sample i and BM is the amount of objects in each I m .In addition, a gap can be obtained i.e., the difference between accuracy and average confidence in each range (I m ).Thus, the greater the gap, the worse the calibration result in the respective bin.Furthermore, through reliability diagrams, it is possible to obtain calibration errors, such as the Expected Calibration Error (ECE) and the Maximum Calibration Error (MCE): where n is the number of samples.Moreover, the reliability diagrams illustrate the identity function (diagonal-dashed line) that represents a perfectly calibrated output, while any deviation from the diagonal represents a calibration error [26], [72].

E. Benchmarking Datasets
A key contribution to the growing improvement of perception systems for autonomous driving is the availability of representative datasets of different modalities, such as RGB, LiDAR, and radar [74]- [79].In this work, we used the KITTI Vision Benchmark Suite-2D object [36] and Lyft Level-5 (LL5) Perception [80], [81] datasets.The classes of interest were pedestrians, cars, and cyclists.Table II shows the number of objects cropped from both the RGB and range-view (depth from the LiDAR modality) images.In addition, some extra objects from the unseen/non-trained classes (not used during training), such as a person sitting, tram, truck, van, tree, lamppost, signpost, bus, and motorcycle classes were classified in the test/prediction phase, to verify the erroneous overconfidence from the prediction layers of the trained networks.Such a class can be understood as an 'adversarial' class; Note that this research did not carry out any study involving adversarial network architectures.Range-view images were obtained by a coordinate transformation of the 3D point clouds on the 2D image plane followed by an upsample of the projected points.The upsample was performed using a bilateral filter, and considered a mask size 13 × 13 (sliding-windows) [37], [82]- [84] for t he KITTI dataset and a mask size 23 × 23 for LL5 dataset.Examples of these operations can be observed in Fig. 7 and Fig. 8, respectively.
As a way to validate the proposed methodology for object detection, the KITTI Vision Benchmark Suite-2D object was used.The respective dataset was divided into 3367 frames for the training dataset, 375 frames for the validation dataset and 3739 frames for the test dataset.

IV. EVALUATION AND RESULTS
The output scores of the CNN indicate a degree of certainty of the given prediction.The level of certainty can be defined as the confidence of the model, and in an object recognition problem, represents the maximum value within the prediction layer.However, the output scores may not always represent a reliable indication of certainty with regard to a given class, especially when unseen (non-trained) objects occur in the prediction stage; this is particularly relevant for a real world application involving autonomous robots and vehicles, since unpredictable objects are likely to be encountered which would be misclassified by prediction layers with a high degree of certainty.With this in mind, in addition to the trained classes (pedestrian, car, and cyclist), a set of unseen objects were introduced into the classification dataset, according to Subsection III-E.Regarding the object detection, the unseen classes are already contained in the dataset's own frames.Unlike the results reported on the classification dataset, the object detection results are presented by means of precisionrecall curves considering the easy, moderate, and hard cases, according to the devkit-tool provided by the KITTI benchmark.

A. Results on Object Classification
All classes for the training dataset were extracted directly from the aforementioned datasets, except for the tree, lamppost, and signpost classes which were manually extracted from the data for this study.The rationale behind this is to evaluate the prediction confidence of the network on objects that do not belong to any of the trained classes, and as such the consistency of the models can be assessed.Ideally, if the classifiers are perfectly consistent in terms of probability interpretation, the prediction scores would be identical (equal to 1/3) for each class in each sample of the unseen dataset.Results on the testing set are shown in Table III in terms of F-score, false positive rate (F P R), the average (Ave.Scores F P ) and variance (V ar.Scores F P ) of the false positives (F P ).The average (Ave.Scores unseen ) and the variance (V ar.Scores unseen ) of the predicted scores are also shown for the unseen testing set (out-of-training distribution test data).
In reference to Table III, where the results are reported based on the classification test set, it can be observed that the F P R, Ave.Scores F P and V ar.Scores F P values are considerably lower than the results presented by the SM layer for both of the sensor modalities and datasets.Regarding the F-scores of the proposed approach (ML and MAP) compared to the SM resulted in an average reduction of 1% (percentage point) for the RGB modality and 0.76% for RV modality, considering KITTI dataset.The F-scores on the LL5 dataset got a gain of 0.065% for RGB modality, considering the MAP approach, F-score of the VR modality had a average reduction of 0.26%.Such reductions of the F-scores are relatively small and thus did not compromise the classification ability.Additionally, the distribution of the top-label scores on the test set comprising the objects that belong to the trained classes (in-distribution classes) is discussed in the Appendix VII-A.Another way of analyzing the results of reducing overconfidence predictions is through reliability diagrams, as shown in the figures 9 and 10, considering uncalibrated, ML and MAP data.Furthermore, as a way of validating our methodology, we compared our results achieved with the temperature scaling calibration technique.Note that the results presented through the reliability diagrams are shown through the MCE and ECE metrics.From these metrics we cannot say which is the best calibration technique, because for a given technique the lowest value for the MCE was obtained, while for another technique the lowest value for the ECE was obtained.However, we show that the proposed approach contributed to reduce the calibration errors i.e., to reduce the values of the MCE and ECE metrics when compared to the uncalibrated data, and consequently we provide a more reliable result, as well as the contribution to reduce the overconfidence predictions.
Further experiments have been carried out as a complementary analysis concerning the network's overconfidence behaviour, on a so-called 'unseen' test set, by means of the network's average score Ave.Scores unseen .Note that for ML and MAP layers, the results are smaller than the SM layer as can be seen in Table III.This indicates that the probabilistic inferences are significantly better balanced i.e., enabling more reliable decision-making, when 'new' objects of 'non-trained' classes are presented to the CNNs, as illustrated by Fig. 11 i.e., the distribution for the unseen dataset.We can see that the aforementioned graphs show less extreme results than those provided by the SM layer.

B. Results on Object Detection
The results on the object detection dataset using the ML and MAP nonlinearities are impressive.Such results were not presented through reliability diagrams, but through normalized histograms, which showed more clearly the reduction in overconfidence in relation to objects detected as false positives without degrading the results of the true positives, as showed in Fig. 12.The results are more representative through precisionrecall curves, especially for the cyclist class (Cyc), whose areas under the curves (AUCs) are 24.03%,14.28% and 14.63% for the easy, moderate and hard cases respectively, as shown in Fig. 13 and Table IV.With respect to the car (Car) and pedestrians (Ped), the proposed approach also showed some improvement.
Note that the proposed methodology is dependent on the number of bins (nbins) and the parameter λ.Thus, the values of the scores may vary according to the values of these parameters.For the particular case of the cyclist class, the proposed methodology achieved strong classification performance compared to the baseline (results in Table IV).In this paper we have chosen to use a single set of parameters for all the three cases (i.e., the same values of λ and nbins for each class).Given the proposed approach, we note that a set of tailored parameters for each class can be used instead, as the distributions (PDF's) are carried out individually.
V. DISCUSSION AND CONCLUSION Within the experiments performed in this work, a probabilistic approach for CNNs was addressed as distributions in the Logit layer to better represent the classification outputs.The results reported within the experiments in this work are promising given that ML and MAP noticeably reduced the classifier overconfidence and provided a more significant distribution in terms of probabilistic interpretation.
The improvement is not as significant when analyzing objects defined as true positives.But, our concern is to develop a methodology that can reduce the values of false positives (mainly objects of the unseen class: which may be critical in robotics and autonomous driving applications) without degrading the results achieved by true positives.Note that we have included two metrics in Table III, in order to show the reduction of score values for the 'unseen' class (in particular) and also to show that the overconfidence behavior has been mitigated for TPs and FPs.
One potential way to improve the F-scores achieved by the ML and MAP layers would be to obtain a 'perfect' match   between the smoothing parameter (λ) and the number of bins in the histograms.For the new results with the Efficient-NetB1 network, we have selected the parameters by using an exhaustive search process (combining several values as possible), in order to keep the values of the F-scores of the ML and MAP layers practically identical to those achieved by the EfficientNetB1 baseline.Figures 16, 17, and 18 show reductions on the scores for objects of class 'unseen' thus, the proposed approach is efficient.As a consequence of the Additive Smoothing, the score values equal to 0.0 and 1.0 are excluded from the prediction values.The influence of the λ parameter on the data distribution can be seen from the figures in Appendix VII-B, particularly with respect to objects of the 'unseen' class.
To assess the classifier's robustness or the uncertainty of the model when predicting objects of unseen classes by the network, we considered a test set comprised of 'new' objects.Overall, the results are promising, since the distribution of the predictions were not extremities relative to the results from the SM layer, in other words, the average scores using ML and MAP layers were significantly lower than the Softmax prediction layer (the baseline), and thus the CNNs are less prone to overconfidence.
The results for object classification were presented through reliability diagrams, taking into account the MCE and ECE metrics.In fact, such metrics indicate how much the predicted score values are calibrated, that is, the best calibration has to present the lowest value for the MCE and ECE.However, we observed that depending on the dataset and sensor modality, our approach obtained the best result in only one of the metrics i.e., either the lowest value for the MCE metric or the lowest value for the ECE metric.This fact can also be noticed with the temperature scaling calibration technique.
Another important factor that contributes to validate the proposed approach is the use of two different datasets, in terms of both RGB and Range-View (3D point clouds-LiDARs)   modalities, since the sensors of the datasets have different resolutions, mainly the LiDAR sensor; While the KITTI dataset provides 3D point clouds obtained from a sensor with 64 beams, the LL5 dataset provides 3D point clouds with 40 beams -and so, the proposed approach was also successful with differing sensor resolutions within the state of the art.The proposed methodology also obtained good results for object detection, not degrading the results when compared to the SG prediction layer, presenting better results in all cases.The improvement is more evident for the 'cyclist' class, which contains the least amount of examples.This is an interesting result that could be further investigated in future work.
Regarding the formulations of probabilistic distributions, the prior modeling by a Gaussian distribution was shown to guarantee a smoother distribution for the prediction values.Unlike the prior, the likelihood function was modeled by means of a normalized histogram i.e., by a non-parametric formulation showing the probability distributions.If both the prior and the likelihood function were modeled by a uniform distribution, the final result would be similar to those achieved by the SM and SG layers, since it would not offer any smoothing for the prediction values.In fact, a uniform prior or likelihood would add a constant to the training data modeling, which would have little effect on the prediction values obtained by the ML and MAP .VI. FUTURE WORK Softmax and Sigmoid layers represent confidence measures, but they do not provide any measure of uncertainty of the predictions.In other words, both layers mentioned previously provide a direct measure of certainty through the maximum class probability.Such layers also do not provide any information about the certainty that the model itself has about the predictions.Therefore, we address the issues of overconfident predictions and calibration techniques in this work with a focus on perception systems for autonomous vehicles.However, we realize that there is a lack of studies on how to quantify the certainty/uncertainty of predictions in relation to calibration techniques and reliability diagrams.As we verified that the MCE and ECE metrics that quantify the calibrated data through the reliability diagrams depend on the number of bins of such diagrams, that is, by changing the number of bins, the MCE and ECE metrics can provide new error values.Thus, what is the correct value of bins to ensure that a set of predictions is well calibrated?Regardless of the methodology to reduce overconfidence predictions or capture uncertainty in predictions, how should we assess the quality of estimated uncertainty independent of calibration and regularization techniques?
Faced with such questions and based on the studies presented in the literature on computing uncertainties of predictions and of calibration and regularization techniques, we  found that evaluating the quality of uncertainty estimates is still a challenge for the following reasons: • uncertainty estimates depend on methods, which are performed by means of approximations i.e., by means of inferences; • uncertainty estimates depend on the sample size i.e., the sample size can provide a certain degree of confidence that such a sample is representative; • it is not easy to obtain a ground truth about uncertainty estimates.In fact, during our study we did not verify the ground truth about uncertainty estimates; • study and evaluate the quality of quantitative uncertainty metrics, such as entropy, Mutual Information, Kullback-Leibler Divergence, and predictive variance.Based on the issues mentioned above, we intend to advance research on the quality of uncertainty estimates, including the formulation of reliability diagrams, as a way to quantify the  The proposed methodology, which is based on the ML/MAP layers, aims to reduce overconfidence predictions of deep models, especially for objects classified as false positives which sometimes receive high score values of deep networks.An ideal result would be for the network to provide lower score values for the false positives i.e., objects misclassified by the network, and concurrently to attain higher scores for the true positives.As a way of validating additional results on test sets,  we present the Fig. 14 and Fig. 15 that contain the results for the pedestrian, car, and cyclist classes (columns from left to right), considering the scores of the objects as being positive and negative, which show smoother distributions of scores when compared to the results shown in Fig. 1.

B. Smoothing Parameter Influence
Additionally to the results presented above, we have implemented the proposed methodology on another state-of-theart network, the EfficientNetB1.The performance achieved by the EfficientNetB1 to classify RGB images is a F-score of 98.67% using the Softmax layer (as baseline).The result achieved through the ML layer is equivalent to the baseline i.e., F-score = 98.67%, while using the MAP layer the network achieved 98.66% (almost the same).By keeping nbins = 19 for both cases, we have performed several runs by changing the values of λ, and the resulting F-score stabilized around 99.66% i.e., very close to the F-score provided by the Softmax layer (baseline).A way to choose the best values for nbins and λ could be, for instance, by reducing the values of the scores of the objects classified as false positives without degrading the results of the true positives, as illustrated by figures 16, 17, and 18, where the distributions in each row were obtained through a given value for the λ parameter, considering classifications from the unseen dataset.Note that as the value of λ increases, the distributions tend to move away from the extreme values (0.0 and 1.0).

Fig. 1 :
Fig. 1: Graphs (a) and (b) are the Softmax prediction scores for the 'pedestrian', 'car' and 'cyclist' classes (where the positives are in orange), showing evidence of overconfidence behavior.The bar-plots were obtained on a RGB image classification set from the KITTI and LL5 databases respectively.

Fig. 2 :
Fig. 2: Probability density functions (PDFs), using normalized histograms, for the Logit layers data on the training sets of the KITTI (a) and LL5 (b) datasets.The graphs are organized from left-right by classes (pedestrian, car and cyclist, where the positives are in orange) using the RGB modality.

Fig. 3 :
Fig. 3: From left-right respectively, normalized histogrambased densities and Gaussian densities calculated on the Logit layer values, for each class, on the training set (here for the RGB modality).On the 1 st row, we have the densities on the KITTI set while the 2 nd row shows the densities on the LL5 training set.

Fig. 4 :
Fig. 4: Obtaining probability values of a normalized histogram generated with the training data of the Logit layer.

Fig. 5 :Fig. 6 :
Fig. 5: Inception V3 CNN representation with Logit and Softmax layers, Maximum Likelihood (ML) and Maximum a-Posteriori (MAP) layers.CNN's training was done with the Softmax layer.After training, the Softmax layer was replaced by the ML and MAP i.e., the CNN was not trained with the ML and MAP layers.

( a )
3D point cloud projected on the 2D image plane.(b) Range-view (RV) after upsampling the point cloud.

Fig. 7 :
Fig. 7: Example from the KITTI dataset.Representations of a 'raw' point-cloud (a) in image coordinates and the upsampled range-view (b) obtained using the bilateral filter.

( a )
3D point cloud projected on the 2D image plane.(b) Range-view (RV) for a 40-channels LiDAR.

Fig. 8 :
Fig. 8: Example from the LL5 dataset.In (a) the 3D point clouds are in pixel-coordinates, and (b) shows the respective range-view after applying the bilateral filter.

Fig. 9 :
Fig. 9: The graphs, from left to right, represent uncalibrated score values, followed by score values calibrated through Temperature Scaling, then scores obtained by the ML and MAP layers respectively.

( b )
Reliability diagrams for RV images from Lyft Level 5 dataset, considering the number of bins = 15 and T S = 1.90.

Fig. 10 :
Fig.10: Reliability diagrams, on the LL5 dataset, for the following cases (from left-right): uncalibrated scores, calibrated model using TS, and then the diagrams for the models using ML and MAP layers.

Fig. 11 :
Fig. 11: Prediction scores on the unseen/non-trained data (comprising the classes: person sitting, tram, tree/ lamppost/signpost, truck, van), using SM layer (left side), and the proposed ML (center) and MAP (right side) layers.The graphs of the first two rows are the results of the KITTI dataset, while the last two are from the LL5 dataset.
(a) Scores from the true positives.(b) Scores from the false positives.

Fig. 12 :
Fig. 12: Results obtained from the Yolo V4.The columns from left to right represent the car, cyclist and pedestrian classes, as well as the distributions of the Sigmoid layer, Maximum Likelihood and Maximum a-Posteriori functions scores.The first line of the distributions are the results of the classifications of the true positives, while the last line is the corresponding scores of the false positives.

Fig. 13 :
Fig.13: Precision-recall curves for Yolo V4 obtained from the Sigmoid prediction layer, ML and MAP layers on the KITTI dataset, considering the true positives.The curves were obtained for the easy, moderate and hard cases, according to the toolbox provided by KITTI.

Fig. 14 :
Fig.14: From the RGB and LiDAR (RV) modalities, the prediction scores were calculated using the ML and MAP functions on the KITTI dataset.

TABLE I :
Number of bins and smoothing parameter (λ) for ML and MAP layers.

TABLE II :
KITTI and LL5 dataset for classification: number of objects per class and subsets.

TABLE III :
Comparison between the classifications obtained by the SM layer, ML and MAP layers in terms of average F-score and F P R (%).The performance measures on the 'unseen' dataset are the average and the variance of the prediction scores.
Reliability diagrams for RGB images from KITTI dataset, considering the number of bins = 15 and T S = 1.31.Reliability diagrams for RV images from KITTI dataset, considering the number of bins = 15 and T S = 2.26.

TABLE IV :
Comparison of the areas under the curves (%) between the Sigmoid layer (SG), ML and MAP layers from the precision-recall curves.Car 70.47 71.34 71.68 Car 62.74 63.77 63.85 Cyc 43.24 53.43 53.63 Cyc 39.70 45.31 45.37 Cyc 35.61 40.62 40.82 Reliability diagrams for RGB images from Lyft Level 5 dataset, considering the number of bins = 15 and T S = 2.46.