Evaluating Object (Mis)Detection From a Safety and Reliability Perspective: Discussion and Measures

We argue that object detectors in the safety critical domain should prioritize detection of objects that are most likely to interfere with the actions of the autonomous actor. Especially, this applies to objects that can impact the actor’s safety and reliability. To quantify the impact of object (mis)detection on safety and reliability in the context of autonomous driving, we propose new object detection measures that reward the correct identification of objects that are most dangerous and most likely to affect driving decisions. To achieve this, we build an object criticality model to reward the detection of the objects based on proximity, orientation, and relative velocity with respect to the subject vehicle. Then, we apply our model on the recent autonomous driving dataset nuScenes, and we compare nine object detectors. Results show that, in several settings, object detectors that perform best according to the nuScenes ranking are not the preferable ones when the focus is shifted on safety and reliability.


I. INTRODUCTION
The goal of object detection is to perceive and locate instances of semantic objects of a certain class [20].A multitude of solutions have been proposed for 2D and 3D object detection, based on cameras and lidars [26], [44].Object detection is fundamental in emerging safety-critical applications, and in particular it is a major pillar of autonomous driving applications [32].
However, we argue that current measures for object detection do not match the demands and peculiarities of autonomous vehicles and safety-critical systems in general, i.e., systems whose failure may lead to harmful consequences [3].Evaluations based on Average Precision typically judge how well a detector detects objects, without discriminating based on the current position of these objects, and on their possibility to interfere with the subject in the considered scenario.To clarify, let us consider the typical modular pipeline for autonomous driving [17]: the subject vehicle is sensing the surroundings to perform object detection, and the output of the object detection is used for trajectory planning.Let us now consider two other vehicles in the sensed scenario, one directed straight towards the subject vehicle, in a colliding trajectory, and one headed away from the subject vehicle at a higher speed.Clearly, for the safety of the driving task, it is critical to detect the first one, while detection of the second vehicle is not relevant at all.Unfortunately, this is not captured by the measures currently used in object detection, which consider both objects as equally relevant.Very practically, in a typical autonomous driving modular pipeline, it is first essential to detect all relevant objects, then these objects can be used for, e.g., trajectory planning.We argue that it is desirable the object detector does not fail to detect objects in colliding trajectory, otherwise also the output of the trajectory planner is compromised.
In this paper, we elaborate on how to measure the performance of object detectors in the safety-critical domain, with specific contextualization to the domain of autonomous driving, and we identify the need of an object criticality model and related measures.As key requirement, the desired measures should reward the detection of those objects that may interfere with the subject vehicle, and that are relevant for the safe and reliable execution of the driving task.Also, to be practically useful, the proposed measures have to be in a defined range, and be summarized by an overarching unifying measure.While autonomous driving is the most evident application domain, and it will be used as reference in the rest of this paper, our reasoning applies to any domain where reliability and safety of the object detection task are relevant for the success of the mission, for example in case of navigation and collision avoidance in drone systems [42].
More in detail, we propose a set of new measures, that we refer to as object criticality model.Such an object criticality model assigns a criticality score to each object, based on ground truth and estimated object distance, colliding trajectory, and time to collisions.Such criticality scores contribute to compute measures, named reliability-weighted precision and safety-weighted recall, that weight correct object detections and misdetections based on the impact on the safety and the reliability of the driving task.Last, a summarizing measure, named Critical Average Precision, allows ranking detectors according to such safety-and reliabilityoriented measures.
The object criticality model and the related measures are exercised on the nuScenes dataset, with nine 3D-object detectors.We show that, under numerous settings, the ranking we obtain differs from the one achieved using the nuScenes evaluation library, which relies on traditional measures.Amongst implications, this result questions the usual approach to rate and select the most suitable object detector for the autonomous driving domain.
The rest of the paper is organized as follows.Section II presents basic notions and the related works.Section III shows the object criticality model and the measures we are introducing.Section IV describes the experiments based on nine object detectors and the nuScenes dataset.Section V illustrates the results, in which the object detectors are ranked according to our and traditional measures, and differences are discussed.Section VI concludes the paper.

II. BACKGROUND AND RELATED WORKS A. OBJECT DETECTION AND ITS EVALUATION
We report the minimal set of notions on object detection that we require to present the choices made in our work.
To describe the spatial location and extent of a detectable object, in this paper for simplicity we only consider bounding boxes, although alternative approaches, e.g., [26], are applicable to our object criticality model as well.
Object detectors compute bounding boxes with an assigned confidence score.Then, a detection threshold is applied as a configuration parameter: all bounding boxes with a confidence score above the detection threshold are predictions.The classification of true positives (TPs), false positives (FPs), and false negatives (FNs), is based on some definition of distance between the predicted bounding boxes and the ground truth bounding boxes.In this paper, we use the distance between their center points [5]: a detected object is considered a TP if the distance between the ground truth bounding box and the detected bounding box is closer than a distance limit.
If there is no predicted bounding box that matches this criterion, then the object is not detected and it counts as an FN.Predicted bounding boxes that are farther than the distance limit from all ground truth bounding boxes are considered FPs.True negatives (TNs) are not taken into account, because there are infinite bounding boxes that should not be detected within any given image [30].
While there are several measures that can evaluate the performance of object detectors, the conventional approach to the evaluation of object detectors consists of measures that are derived from the count of TP, FP, and FN.These form the basis for our object criticality model defined in Section III, and they are briefly reviewed here [31], [30], [6].Precision, P = T P/(T P + F P ), indicates how many of the selected items are relevant.If some non-relevant items are selected, this reduces precision.Precision is 1 if all the detected objects exist, and 0 in the opposite case.Conversely, Recall, R = T P/(T P + F N ), indicates how many of the existing relevant items are selected.If a detector has recall 1, it means it detected everything without any detection miss; in the opposite case, recall is 0.An object detector with high recall but low precision outputs many predictions, but most of them are incorrect; an object detector with high precision but low recall returns very few predictions, but most of them are correct.
Currently, the most frequently used summarizing measure is Average Precision (AP ) [13], which summarizes the precisionrecall curve as the weighted mean of precision scores achieved at different detection thresholds, using the increase in recall from the previous detection threshold as the weight.More precisely, AP = n (R n − R n−1 ) P n , where P n and R n are the precision and recall at the n-th detection threshold.In this paper, in agreement with [5], we calculate AP only for recall and precision above or equal to 0.1: we remove cases in which recall or precision is less than 0.1 in order to minimize the impact of noise commonly seen in regions with low precision or low recall.

B. RELATED WORKS ON OBJECT DETECTION IN SAFETY-CRITICAL SYSTEMS
A safety-critical (computer) system is one whose malfunction could lead to unacceptable consequences, like harm to users or to the environment.A typical example is an autonomous vehicle, whose malfunction (of whatever cause) may lead to a collision.Instead, reliability describes the continuity of correct service, which can be temporarily disrupted, for example, to avoid situations that are potentially dangerous [3].
The inclusion of object detection tasks in safety-critical systems comes with a relevant set of renowned challenges, because of the many distinguishing aspects of the problem, its complexity, and also the variety of applications [1], [38], [22].Considering object detectors, some incorrect predictions may lead to catastrophic consequences, and therefore have the maximum impact on safety, while others may have an irrelevant impact.Further, some false positives may cause an unnecessary interruption of the service, and therefore they impact the reliability.However, to evaluate object detectors, measures from Section II.A are typically used, without considering the different impact of each detection mistake.This also applies to the wide domain of autonomous driving, and it becomes evident when considering the measures used in object detection challenges for autonomous driving.For example, in challenges for KITTI [15], CityScapes [10], Waymo [35], or nuScenes [5], evaluation measures revolve around Average Precision and the concepts summarized in Section II-A.
Up to now, very few approaches have attempted to define safety or reliability measures for object detectors; to the best of our knowledge, the few works which targets a goal similar to ours are focusing on safety but leaving aside the reliability concern, and are [4], [36], [28], [39].Noteworthy, they all appeared in very recent years, which underlines a recent understanding of the relevance of the subject, and they are all in the autonomous driving domain.The work in [4] ranks each object in three categories (imminent collision, potential collision, no collision), based on its collision risk.Instead, in [36] the authors define critical zones in which accurate perception is mandatory.On a similar position, the authors of [28] argue the relevance of identifying a distance up to which all pedestrians are detected.The closest approach to our work is [39], where the authors combine scores measuring detection quality, collision potential, and time needed to make the detection.This allows computing a safety score of a test scenario, in 5 classes from insufficient to excellent.
With respect to the reviewed works, the object criticality model we propose includes both safety and reliability issues of the driving task.This is important, because safety by itself (to detect everything which is potentially dangerous) can be enforced by low precision and high recall, i.e., low false negatives at the cost of many false positives.Instead, by balancing both reliability and safety issues, our object criticality model provides a standalone evaluation of object detectors.
Other works address the problem of deep neural network uncertainty in autonomous driving, where the term uncertainty should be interpreted in the broad sense of how certain an object detector is about its predictions [14].In general, these works aim to improve object detection, but they do not target the definition of specific measures.More specifically, the work in [14] arguments that object detectors should also include prediction confidence, and it presents various methods to capture uncertainties in object detection for autonomous driving.Otherwise, object detectors can only tell the human drivers what they have seen, but not how certain they are about it.The work in [27] includes information on uncertainty sources (e.g., sensor noise), the work in [18] includes uncertainty when computing the bounding box regression loss, and the work in [21] considers both the noise inherent to the observations and the uncertainty that can be explained away given enough data.Last, despite not focusing on object detection, the work in [16] defines safety-oriented measures by proposing that predictions with a confidence score close to the detection threshold should be treated differently and more suspiciously.Finally, the work in [8] introduces the distinction of a critical area, which is the area nearby the vehicle where failed detection of an object may lead to immediate safety risks.The work acknowledges that the design of a driving application is focused on both i) guaranteeing safety in such critical area, and ii) guaranteeing high detection accuracy on the non-critical area (in order to have smooth driving).This observation leads the author to build different DNNs for the detection of objects in the two areas.
Still, the above works weight all the detected objects the same, i.e., when assessing the object detector, the usual binary (yes/no) counting of TPs, FPs, and FNs is performed.Instead, in our work we claim that i) object detectors should be evaluated depending on the ability to detect those objects that are most likely to affect the driving task, i.e., impact on safety and reliability, and ii) this can be realized by weighting the objects based on their criticality, and by building specific measures that consider such weights.Also, we remark that, in our object criticality model, measurement errors and uncertainty in the detection are inherently considered, when computing the scores assigned to each object, and when the predicted values are compared to the ground truth.

III. OBJECT CRITICALITY MODEL
Our object criticality model is based on assigning a criticality value to each object in the scene, and then computing object detection measures that consider this criticality.The description of such model is independent of the sensors used to capture the scene (e.g., cameras or lidars) and of the type of objects.

A. REQUIREMENTS AND ASSUMPTIONS
The application of the object criticality model requires i) a subject vehicle (named ego afterwards) that captures the scene with sensors as cameras and lidars, and ii) objects (other vehicles, pedestrians, etc.) that are within line-of-sight to ego and that are consequently captured by the sensors.This is the very typical situation of an autonomous vehicle that performs object detection.
We assume that the following ground truth information is available: i) 3D bounding boxes describing the size of the objects; ii) coordinates of ego and of the objects; iii) velocity of ego and of the objects.The most recent automotive datasets have very rich meta-data, typically including the above information; for example, in Section IV and in Section V we will use nuScenes [5], which satisfies our assumptions.Clearly, the ground truth is required only to evaluate the object detector, and not in the case of operation in a deployed setting.
Further, we assume that the object detector produces as output: i) the computed 3D bounding boxes, ii) the estimated distance of detected objects from ego, and iii) the estimated velocity of objects.In other words, the object detector is assumed to conflate detection, tracking, and dynamics: this is done in several 3D object detectors, which include the above estimates in their output.Noteworthy, these estimates are computed in the object detection challenges of the nuScenes community, which will be our reference for the experiments in Section IV and Section V.
For simplicity of the discussion, when computing coordinates of objects and their distance from ego, in this paper we consider only the (x, y) coordinates, i.e., we ignore the vertical dimension.In other words, while extending the object criticality model to the z-dimension is definitely possible, only at the cost of slightly more complex geometric computations, in the following we exclude the relative altitude of the objects and the ego.From the point of view of results, this is not an issue, because the dataset we use in this paper was collected on essentially flat lands.Also, note that ignoring possible vertical offsets of objects may only reduce their distance from ego, and it is therefore a worst-case approximation.

B. STRUCTURE OF THE OBJECT CRITICALITY MODEL
We call ego the roving vehicle that mounts the sensors and collects data from the environment, and we call object B any other object.There are no restrictions on the type of objects, for example B can be a car, a pedestrian, a bike, etc.Note that for ego we only have ground truth values, i.e., the object detector does not predict its own velocity or position.
The construction of our object criticality model is organized in 3 steps, which are repeated for each object B within the line of sight of ego, and for both the ground truth values and the predicted values of B.
The first step (Section III-C) is the analysis of the collision scenario involving B and ego.In this step, we calculate indicators that will be later used to define the criticality of B. In particular, we calculate i) the initial distance d between ego and B, ii) the closest distance r that ego and B would reach, and iii) the time ∆t that ego and B require to reach such distance.These values are input to the following step, together with the current position and velocity of ego and B.
The second step (Section III-D) is the calculation of criticality weights that are assigned to each object B. These are κ d , κ r , and κ t , and they are based, respectively, on the three values calculated in the first step.
These weights indicate the relevance of B for the driving task: weights are higher if it is more likely that B may affect the behavior of ego.Such weights are used as rewards or penalties depending, respectively, on whether the object has been detected or missed.
The third step (Section III-E) exploits the assigned criticality to construct aggregate safety and reliability measures that allow comparing different detectors.

C. ANALYTICAL CHARACTERIZATION OF THE COLLISION SCENARIO
We refer to Figure 1 for a visual representation of the collision scenario analyzed in this section.
We define ego = (ego x , ego y ) the position of ego, and B = (B x , B y ) the position of the object B in the captured scene.Further, we define, in vector form, v ego = (v egox , v egoy ) the velocity of ego, and v B = (v Bx , v By ) the velocity of B. We compute the relative velocity of B with respect to ego, as v rel = (v relx , v rely ) = (v Bx −v egox , v By − v egoy ), that is, the vectorial difference of the velocity of ego and the velocity of B. This allows simplifying the subsequent calculations: we can consider ego as stationary, while B is moving with the velocity resulting from the difference between the two velocity vectors v ego and v B .
Then, we identify the shortest distance from ego at which object B will pass if both continue moving with the same velocity.This is the distance between ego and point C = (C x , C y ), with C being the point closest to ego on the trajectory of B. Point C can also be thought as the tangent point between the line representing the trajectory of B and a circle centered on ego.
Point C = (C x , C y ) can be computed as the intersection of two lines, using basic Euclidean geometry.The line defining the direction of the relative movement of B is obtained from the general equation of a line, i.e., y−y 0 = m(x−x 0 ).We are looking for the line passing from point (B x , B y ) and whose angular coefficient (i.e., orientation with respect to the x axis) is given by the ratio between the y and x components of the relative velocity v rel (refer again to Figure 1).
The shortest distance between such line j and the position of ego lies on the line perpendicular to j passing from ego.
Then, applying the Euclidean distance, we can easily compute: i) the distance d egoB between ego and B, ii) the distance d egoC between ego and C, and iii) the distance d BC between B and C.
Assuming that both ego and B continue moving with the same velocity, the time ∆t that B needs to reach the collision point C is then computed as the distance divided by the scalar speed of B, i.e., ∆t = d BC /|v|, where |v| = v 2 x + v 2 y .We recall that ego is considered to be stationary, while B moves with a relative velocity obtained as the difference of the velocities of the two objects.
We note that including acceleration would better characterize objects' movement; however, since acceleration is quadratic with respect to space, any estimation error would be greatly amplified, introducing unnecessary noise in the final measures.
Finally, note that the object criticality model exhibits some corner cases, for example, when ego and B are moving at the exact same velocity, ∆t is undefined.We treat these rare cases by skipping the object criticality model calculation and setting the criticality values to conservative fallback values.

D. COMPUTATION OF CRITICALITY WEIGHTS
The collision scenario above is used to assign criticality to objects.The idea is inspired by reliability analysis [37], in which quantities like reliability (or safety) are defined in the interval [0, 1].However, we do not propose probabilities.
Each object B, either identified by the object detector or ground truth, is assigned a criticality weight κ(B).This weight is obtained by combining three criticality values κ d (B), κ r (B), and κ t (B), as explained later.Note that for a given object B, its criticality κ(B) may be different if calculated with its predicted properties (e.g., position and velocity) or the ground truth ones.Furthermore, for some objects, we may have ground truth values only (FNs) or predicted values only (FPs).When needed, we indicate with κ ′ (B) the criticality weight computed with predicted properties of object B, as opposed to κ(B) that is calculated based on the ground truth.
The Distance Criticality, κ d (B), is based on the distance d egoB between ego and the object B. This score does not depend on velocity, but only on the position of objects in the scene.We want the score to be maximum when the distance from ego to B is zero, and then decrease to zero when reaching a maximum distance D max > 0.
We compute the weight κ d (B) as a second-degree equation (downward parabola) passing from points (0, 1) and (D max , 0).That is, the maximum value is 1.0 when d egoB = 0 and it decreases as d egoB increases, reaching 0 when d ego = D max .The parabola shape allows the criticality to decrease non-linearly with respect to the distance: the decrease is slow for values close to zero (i.e., close to the vehicle), and it gets faster when approaching D max (i.e., far from the vehicle).We also need to enforce that κ d (B) is always in the interval [0, 1], and therefore the final equation is: The Collision Distance Criticality, κ r (B), is based on the distance between ego and the potential collision point C. It is an indicator of how close to ego the object is likely to pass.κ r (B) is calculated using the same rationale of κ d (B) (Section 2), with x = d egoC and Z = R max , where R max > 0 is the maximum considered collision distance, beyond which the corresponding criticality is zero.
Similarly, the Collision Time Criticality, κ t (B), is based on the time ∆t for B to reach the potential collision point.All the other things unchanged, this score depends on the (relative) velocity of the object B with respect to ego.This score is again calculated based on Section 2, with x = ∆t and Z = T max .
The final criticality κ(B) is obtained by the combination of the three criticality scores κ d (B), κ r (B), κ t (B).The resulting measure is defined following four requirements: i) it should range in the interval [0, 1]; ii) it should be 0 if all values are zero; iii) it should be 1 if at least one of the values is 1; and iv) it should increase if any of the three values increases.
Inspired again by classic reliability analysis [37], our final criticality weight is then computed as: The final criticality κ(B) is therefore a measure of: how much the object is close, how much it is likely to pass close in the near future, and how much time is available to react.

3) Corner Cases
The following corner cases are considered: • When ego and B are moving at the exact same velocity (in both dimensions), the resulting relative velocity is zero, and ∆t cannot be computed.We solve this case by setting κ r and κ t to zero.• The case in which only one component of the relative velocity is zero does not need to be treated differently.
The resolution of Section III-C yields a form in which the denominator is the sum of squares of the two components of the velocity.The denominator is thus zero only when both components of the velocity are zero, which is already treated in the previous case.• In the calculation of ∆t we need to verify if the object B is actually moving towards point C, and not on the same line but in the opposite direction.In case B is moving in the opposite direction, κ r and κ t are again set to zero.
• In rare cases, where the collision point is particularly far away or the speed is particularly low, the calculation of ∆t may generate an overflow or a not-a-number (NaN) value: in this case κ t is set to 0.1.The rationale is to set it to a low value, but still greater than zero.• The dataset may contain invalid values, or the detector may not be able to provide estimates.In particular, when we are not able to obtain the velocity of the object, we set κ r and κ t to 1 (their maximum value).

E. SAFETY-AND RELIABILITY-BASED MEASURES
We exploit the above criticality scores to remodel the traditional recall and precision measures, such that they are more oriented towards reflecting the safety and reliability offered by object detectors.

1) Reliability
Reliability measures the continuity of correct service.[3].For a reliable driving task, a good object detector should not predict false positives that correspond to dangerous situations, because they could lead to an interruption of the driving task.For example, false positives may cause an unnecessary brake; instead, the continuity of the driving mission may require considering some risks of collision as unavoidable.This clearly conflicts with safety (which aims to minimize risks), but it is widely accepted that safety and reliability have different goals [3] and may be conflicting requirements.
For this reason, we measure the reliability of the detection task through a revised definition of precision.The idea is that false positives are penalizing the continuity of the driving process, with a greater impact the closer they are, or are likely to be, to ego.We weight TPs and FPs according to the criticality κ(B) of the associated object B. In simpler words, when a non-existing object is detected, we do not add 1 to the count of FPs, but instead we add its criticality; the same applies to TPs.
For a correctly detected object we may use the criticality computed either using the ground truth (κ) or the predicted values (κ ′ ): we use ground truth values at the numerator, and predicted values at the denominator.The idea is that the detector might detect a greater criticality (denominator) than what is actually present (numerator), which reduces reliability of the driving task.Also, clearly we do not have ground truth values for FPs, because those objects do not exist.
We can then define the reliability-weighted precision as: where T P * is the set of true positive objects, and F P * is the set of false positive objects.Note that the P R may in principle raise above 1, in case the detected criticality is significantly lower than the ground truth.To be consistent with the classic definition of precision, we limit the maximum value of P R to 1.

2) Safety
Safety is instead the absence of catastrophic consequences [3].
To ensure safety, the object detector must detect as many as possible of the dangerous objects, even at the cost of raising some false alarms.A safety measure should then reflect how much of the existing criticality has been detected by the object detector.The proposed measure is adapted from the recall, using the ground truth values at the denominator and the detected values at the numerator.Clearly, we do not have predicted values for FNs, which are objects that have been missed.Therefore, we define the safety-weighted recall as: where T P * is the set of true positive objects and F N * is the set of false negative objects.Also for R S we limit its maximum value to 1.

3) Critical Average Precision
The proposed criticality values depend on three parameters, namely D max , R max , and T max .We can compute P R and R S for different values of these parameters, to understand their evolution when different subsets of objects are considered.In analogy to the precision-recall curve (see Section II-A), this allows computing several P R -R S curves, one for each combination of values (D max , R max , T max ); consequently, we can compute the Critical Average Precision AP crit from each of the P R -R S curves, based on our definitions of P R and R S .
Depending on the driving scenario and the intended system in which the object detector is deployed, different values of D max , R max , and T max may be favored.For example, an object detector which is very good on P R could be safely used on a highway under low traffic conditions; but if it is not good on R S , it should not be used in an urban scenario, where cars may approach from different directions at essentially any angle.

IV. CASE STUDY ON THE NUSCENES DATASET A. DATASETS AND SELECTED OBJECT DETECTORS
To exercise the object criticality model, we choose the nuScenes dataset for the following reasons: i) it is very recent and extensive, forged with the latest sensor technology; ii) very recent object detectors are available; iii) it includes all the necessary information to apply the object criticality model presented in Section III.
NuScenes [5] is a recent large-scale dataset for autonomous driving that reports scenes collected from a vehicle.The dataset comprises 1000 scenes, each being 20 seconds long and fully annotated with 3D bounding boxes.Keyframes are sampled every 0.5 seconds; five intermediate frames are collected between keyframes.
Following common practices in datasets of this kind [15], [35], nuScenes defines an object detection task and proposes related measures to officially rank object detectors on its website.The detection task in nuScenes consists in predicting the objects at each keyframe time t, using sensors data collected between (t − 0.5, t] seconds (five intermediate frames).Detectable objects are all objects within 50 meters from ego and with line of sight.For each object, ground truth 3D bounding boxes, attributes (e.g., orientation), and velocities are provided.A detection is successful if the distance between the centers of the predicted and ground-truth bounding boxes is less than a distance limit l; four different values of l are considered, which are l ∈ {0.5, 1, 2, 4} meters.For brevity of the discussion, the only objects we consider are cars.
We select nine 3D object detectors from the zoo of mmdetection3d [9], an open-source object detection toolbox based on PyTorch for 3D detection.We present the object detectors below; each detector is matched to an acronym to easily distinguish it in the rest of the paper.
FCOS [40] and its evolution PGD [41] use visual cameras only.The backbone is a pretrained ResNet101 with deformable convolutions [11].The neck is the Feature Pyramid Network (FPN, [24]), which generates a pyramid of feature maps.The head that produces final predictions (deciding on object class, location, etc.) relies on an approach similar to RetinaNet [25], which applies shared heads to operate detection of multiple targets.PGD head also includes a branch to improve the estimation of distance depth.
The other seven object detectors (see Table 1) process lidar's pointcloud and they are based on the Pointpillars [23] network.Pointpillars is well-known both for its speed and its accuracy.It exploits an encoder that learns features on pillars (vertical columns) of the point cloud to predict 3D oriented bounding boxes for objects.The Pointpillars network consists of three main stages: i) a feature encoder network that converts a point cloud to a structured representation, namely a sparse pseudoimage; ii) a 2D convolutional backbone to process the pseudo-image into high-level representation, extracting the features map upon which the rest of the network is used; and iii) a detection head that detects and regresses 3D bounding boxes.We consider seven alternatives based on Pointpillars; essentially, they use the pillar-based method from [23] to convert the point cloud into a sparse pseudoimage, and differentiate from [23] by applying different backbones, and optionally changing the necks and heads.

B. IMPLEMENTATION OF THE OBJECT CRITICALITY MODEL
We execute all the object detectors on the nuScenes validation set [5], which consists of 150 frame sequences of 20 seconds each, and achieved the exact same results of their authors reported at [9].This confirms that our setup of mmdetection3d is correct.The implementation of our object criticality model exploits the development kit of nuScenes, which is available with opensource license.For example, the ranking of object detectors available at the nuScenes website [29] is computed using the code of this library, but on a different test set, whose ground truth information is not released to the public.We extended the development kit, to have it compute the measures from our object criticality model alongside the usual measures of the nuScenes object detection challenge.We compute and plot the analogous of the precision-recall curve, but with our criticality-oriented measures P R and R S .The resulting library is available at [7].Its usage is straightforward: it is sufficient to have a working installation of nuScenes-dev, and replace with the files in [7] the corresponding files of the nuScenesdev installation.Then, the set of results will appear enriched with our measures.Therefore, any object detector whose output is compatible with nuScenes can be also evaluated using our library.The library is released open source on [7], including tutorials that reproduce the experiments described in this paper.We used the nuScene development kit v1.1.2,and we tested for compatibility up to 1.1.7.The release at [7] includes a usage example, which allows repeating our experiments from the execution of the mmdetection3d object detectors to the computation of results.

V. EXPERIMENTS AND RESULTS
We execute the 9 object detectors on the dataset previously described.We compute AP crit , P R and R S for different values of D max , R max , and T max .More specifically, we consider several configurations (D max , R max , T max ), with D max ∈ {5, 10, . . ., 50} meters, R max ∈ {5, 10, . . ., 50} meters, and T max ∈ {2, 4, . . .30} seconds.Since distance is measured starting from the center of ego, a distance of 5 meters includes only vehicles very close to ego; 50 meters instead is the maximum distance from ego that is considered in the nuScenes object detection challenge, where objects farther than 50 meters from ego are ignored.Overall, this leads to 1500 configurations (D max , R max , T max ), repeated for each object detector.

A. APCRIT AND RANKING OF OBJECT DETECTORS
First, we calculate the rankings of detectors based on AP crit for all the 1500 configurations (D max , R max , T max ).Many of them produced a different ranking with respect to the one based on AP .For example, consider l = 0.5 and l = 4.When l = 0.5, the ranking calculated with AP crit does not match the AP ranking for 567 out of 1500 configurations; for each of these 567 configurations, the difference with respect to the AP ranking is 2 or 4 positions.The whole set of object TABLE 2: AP and AP crit of car detection, for nine object detectors and l ∈ {0.5, 1, 2, 4}, ordered by AP .AP crit is computed with (D max , R max , T max ) amongst the configurations that reported the highest differences between AP crit and AP ranking.Ranking differences are in bold.
(a) l = 0.5, (20,20,8) Detector  detectors may change position with respect to the AP ranking, with the exception of the detector in the 7 th and 8 th positions which are always PGD and FCOS, respectively.For l = 4, the ranking changes in 1425 out of 1500 configurations, and all the object detectors may change position, including FCOS performing better than PGD.In Table 2, we compare the AP and AP crit ranking of the nine object detectors, for exemplary configurations (D max , R max , T max ).Noticeably, the object detector with the highest AP , REG1.6, is outperformed by SSNREG and also others when we consider AP crit .
To explore trends of AP crit , we select representative examples.In Figure 2 we show the AP crit values of object detectors REG1.6 (AP = 0.874) and PGD (AP = 0.703) with l = 2 and when D max = 25, for different R max , T max .The AP crit of REG1.6 and PGD is higher than the respective AP s under the considered configurations.In fact, setting D max = 25 reduces the impact of objects farther than 25 meters, which are a significant contribution to misdetections.
Next, we pick the object detector REG1.6 with l = 2.0.In Figure 3 we show the AP crit when R max = 20; the figure clearly shows how the highest AP crit values are achieved when D max is set in the range [20,30].This is possibly due to the fact that setting D max very low excludes a lot of "easy" (i.e., close) objects from the relevant ones, thus deteriorating AP crit .Conversely, when D max becomes much greater than R max , a lot of distant but not relevant objects are included, which are unlikely to reach a collision point closer than R max .In the lower part of the z axis, AP = 0.874 is represented as a flat grey surface in the figure.Figure 3 shows that AP crit is in general higher than AP .This is expected, because the AP crit gives less weight to objects that are harder to detect, e.g., those at a farther distance from ego.In general, higher values of AP crit are achieved with low values of R max and T max ; for both REG1.6 and PGD, the maximum AP crit values are obtained with (D max , R max , T max ) = (25,5,2).Intuitively, low R max and T max reduce the number of vehicles to be considered in our analysis: only those that are really critical for the detection are included.Analogous observations can be derived with the other configurations and object detectors.
We remark that, while studies like Figure 2 and Figure 3 are effective to explain the proposed AP crit measure, the most suitable configuration (D max , R max , T max ) should be decided based on the requirements of the target application, and then the object detector with the highest AP crit for such configuration should be selected.

B. TRADEOFF BETWEEN PR AND RS
To discuss the relations between P R and R S , we rely on Figure 4 and Figure 5, where we use SSNREG with l ∈ {1, 4}.We compute P R and P for, respectively, R S and R at steps of 0.01, starting from 0.85.Red crosses represent precisionrecall pairs (P, R).Black dots represent (P R , R S ) pairs; these are computed for each configuration (D max , R max , T max ), thus yielding 1500 black dots for each R S value.The large blue dots are the (P R , R S ) values achieved using SSNREG with the configuration from Table 2b and Table 2d, while the green triangle are the configuration leading to the highest AP crit , which is (25,5,2).
We investigate the relations between P R and R S for high values of R S (safety-weighted recall), which are of particular interest in the reference domain of this work.This way we can study the P R (reliability-weighted precision) that we achieve when safety is enforced thanks to a high R S .This corresponds to answering the question "given a safety target on the detection, what is the possibility of driving the car with good mission reliability, i.e., without being forced to interrupt the driving continuously because of false positives?".Of course, the safest condition would be R S = 1, but P R is typically 0 in such cases; still, a very high R S is necessary to enforce safety of the detection.
When the recall R increases, the precision P quickly drops to 0. SSNREG can offer a high recall, i.e., a high ability to detect all the objects, only at the cost of many false positives: this is clearly of little or no use in practice.Instead, if we restrict the scope of the object detector thanks to our object criticality model, we reach different conclusions.For example, consider again the case l = 1 (Table 2b).Even with R S ≥ 0.9,  there are some configurations in which P R > 0.8, which is clearly a much more comforting result, showing confidence in the detection at least to some extent.
On the other hand, the best-performing triples, represented with the green triangles in Table 2b, may be not practical, because it is computed applying small spatial and temporal distances of the objects from ego.Summarizing, our conclusion on SSNREG can be very different from those we achieve using P and R, when we apply the criteria of R S and P R .

C. EXPLANATION OF DISTANCE CRITICALITY κ(B)
The objective of this analysis is to explain the inner details of the object criticality model, even if P R , R S , and AP crit are sufficient to describe the performance of the object detection.We rely on bird-views from selected frames of nuScene to explain how our object criticality model works, in a very practical way, for the computation of κ(B).We consider PGD and SEC object detectors, but all nine detectors lead to similar conclusions.
Figures from Figure 6 to Figure 9 are extracted relying on the nuscene-dev kit 1.1.2,properly modified to visualize values from κ(B), κ d (B), κ r (B), and κ t (B).The axes represent distances, in meters.The ego is always located in the center at the (0, 0) coordinates and is oriented along the y-axis (heading towards the top).The other vehicles are represented as rectangles, and the front side is indicated by a small segment.The ground truth (real position and orientation of cars) is in green, while the detected cars are in blue.In the ideal case of a perfect object detectors, blue and green rectangles would overlap.Both ground truths and detected vehicles have associated a value, which is either κ(B), κ d (B), κ r (B), and κ t (B) depending on the figure.We add text labels and red circles to improve readability.
We first consider PGD with D max = 30, R max = 20, T max = 8.0.Very intuitively, this setting says that it is critical to detect vehicles that are within 30 meters, and/or that are in colliding trajectories within 20 meters in the next 8 seconds.
We start from Figure 6.A car is very close to ego, but it is not detected: it has been assigned κ(B) = 0.98.This car is located at the center of the diagram, and it is circled in red.Another one is very close, but in "a less dangerous" situation: κ(B) = 0.89.A third one is within D max = 30 meters, but headed in a different direction, so it gets a mild criticality score κ(B) = 0.60.Instead, there are other less critical missed detections in the upper and lower parts of the image.These are farther than D max = 30 meters, and are headed in non-colliding trajectories: these are irrelevant, so they are worth κ d (B) = 0.00.Similarly, Figure 9 shows the values of κ t (B).Cars which may enter in a collision within T = 10 seconds are assigned κ t (B) > 0. The velocity of ego and each car is a determining factor to assign the criticality κ t (B): vehicles relatively close and in colliding trajectory may also have κ t (B) = 0 values if they are not expected to reach the collision point within T = 10.In the red circle, there are opposite examples, of detected cars in colliding trajectories with ego but with κ t (B) = 0 and κ t (B) = 0.96 (note that this last one is a false positive).

VI. CONCLUSIONS AND FUTURE WORKS
We argue that the most used measures for object detection do not match the demands and peculiarities of a safety-critical system.Within the autonomous driving domain, currently adopted measures typically describe how good an object detector is at detecting all the objects on the scene, while instead, for the purpose of an autonomous driving system, we are interested in detecting all the objects that will likely interfere with the driving task of the vehicle.
To this end, we show that the state-of-the-art evaluation of object detectors does not consider the possible role of the objects in a specific scene, and in particular with respect to the driving task of the vehicle performing the detection.
Consequently, we propose novel measures that take into account the concepts of safety (detection of dangerous objects, which require immediate reaction, should be prioritized) and reliability (misdetections should not severely disrupt the continuity of the driving task).We build and exercise an object criticality model that performs a rating of the objects, based on the distance from the subject vehicle, the possible colliding trajectory, and the expected time to collision.Amongst the main results, we show that our judgment on the performance of object detectors may be very different when we consider the detection of i) everything on the scene (as it is usually done), or ii) only the relevant items.Depending on which of the two cases is of interest, we may end up choosing different object detectors.Further, we show that object detectors with high performance under case i) can be less competitive in case ii), and vice-versa.
Last, an important implication of our object criticality model is that, when safety and reliability issues are considered, the selection of the most suitable object detector strictly depends on its desired use, i.e., on the requirements of the target application.Starting from application requirements, the desired configuration of our object criticality model is identified, measures are computed, and the most suitable object detector is selected.
Noteworthy, our analysis is not meant to prove that the evaluated object detectors are safe and reliable.Rather, it shows how the object criticality model allows establishing sound parameters that can be used to build, assess and tune object detectors for their application in safety-critical domains.
We remark that object detection in complex scenarios is still an open research topic that makes improvements every year [2], with new detectors that are proposed continuously; however, defining new object detectors, or assessing the most up-to-date object detectors, is beyond the scope of this paper.
As future work, we are currently working towards training an object detector whose goal is to maximise AP crit .More precisely, the objective is to train to maximize a specific configuration of R S and P S , rather than R and P .Intuitively, the object detector is intended to reward the detection of objects that are relevant (close and in colliding trajectories), and it is expected instead to be far less effective in the detection of objects that are not relevant for the driving tasks and that do not interfere with the elaboration of the trajectory of ego.Practically, this can be realized by a proper training phase, where the usual loss measurement approach is modified according to the principles and measures established in this work.

FIGURE 1 :
FIGURE 1: Geometrical representation of the main elements of our object criticality model. brevity.

FIGURE 3 :
FIGURE 3: AP crit measured on REG1.6 with l = 4.0 and R max = 20, for the different D max and T max .

FIGURE 4 :
FIGURE 4: P R , R S , P and R for SSNREG when R S ≥ 0.85 and R ≥ 0.85, with l = 1.

FIGURE 5 :
FIGURE 5: P R , R S , P and R for SSNREG when R S ≥ 0.85 and R ≥ 0.85, with l = 4.

FIGURE 6 :
FIGURE 6: κ(B) = 0.98 and κ(B) = 0.89 for two dangerous missed detections from PGD. Best viewed in color.Next, we explore the contribution of the distance criticality κ d (B).We consider SEC with D max = 15, R max = 20, T max = 10.Figure 7 shows that cars farther than 15 meters from ego are assigned κ d (B) = 0; the closest to ego, the higher the κ d (B) values.The red circle is approximately 15 meters radius: vehicles outside the circle have κ d (B) = 0.

Figure 7
shows that cars farther than 15 meters from ego are assigned κ d (B) = 0; the closest to ego, the higher the κ d (B) values.The red circle is approximately 15 meters radius: vehicles outside the circle have κ d (B) = 0.

Figure 8
Figure8shows the values of κ r (B) for the same scene and settings of Figure7.Cars with a trajectory passing closer to

FIGURE 8 :
FIGURE 8: κ r (B) computed for SEC with D max = 15, R max = 20, T max = 10.The red circle is an area of approximately R max = 20 meters from ego.Best viewed in color.

TABLE 1 :
The seven lidar-based object detectors in use.