Introduction
Human drivers successfully complete their driving tasks by 1) being aware of their current situation, including their steering angle, speed, location, and acceleration; 2) sensing the conditions of surrounding obstacles; 3) formulating a future course of action that will ensure their safety; and 4) operating the steering wheel and brakes to control the vehicle. But human errors like distraction, weariness, speeding, breaking traffic laws, and poor judgment on other drivers actions are the reasons for the majority of road accidents.
A 2018 study by the National Highway Transportation Safety Administration (NHTSA) [1] states that about 94% of all car accidents are caused by human errors. Later, safety experts claimed that this statistic was made up. Nevertheless, research has confirmed that human failure is the main cause of road accidents and that the introduction of some sort of automation contributes to reduce accidents statistics. Assistance and partly automated systems may prevent weaknesses in human capacities and increase safety in routine human driving cases with supervision, warnings and lateral or longitudinal support [2]. For example, the number of accidents caused by driver-error and skidding fell from about 2.8 (per 1000 cars) in 1998/1999 to 2.21 in 2000/2001, after Mercedes-Benz introduced Electronic Stability Control (ESC) as a standard in all cars [2].
A significant potential to lower errors and thereby achieve higher road safety, collision-free, profitability, and traffic control is achieved by automating the driving task [3]. The Society of Automotive Engineers (SAE International) defines six levels of driving automation going from no driving automation (Level 0) to full driving automation (Level 5) [4]. While Level 1 corresponds to basic driver assistance, such as using cruise control on highways, Levels 2 and higher include ADAS (Advanced Driver-Assistance System) features, where sensors and a computer are used to sense and analyze the surroundings to make decisions based on the proximity of objects. Within Level 3 the vehicle can handle most of the driving tasks, but the driver must still be ready to take control in certain situations. In Level 4 vehicles can operate in self-driving mode without human interaction in most circumstances, but a human still has the option to manually override. The difference to Level 3 is that Level 4 vehicles can take control in case of anomaly or system failure. At level 5, the vehicle does not require any human intervention, becoming a truly autonomous vehicle (AV).
Globally, these systems are being explored in order to realize their enormous potential, resorting to sensors like LiDAR (Light Detection And Ranging), cameras, ultrasound/sonar, RADAR (Radio Detection and Ranging), and GPS (Global Positioning System) to extract information from the surrounding environment [5]. Inertial measurement units (IMU) are also used to measure the vehicle’s linear acceleration and angular velocity, providing information on the vehicle’s current location and orientation (relatively to a known starting location).
Among all these sensors, LiDAR is currently the one that deserves the highest attention from industry. They show fast response, high resolution and high accuracy, high surface sample density, can be used day and night, and are economically accessible. According to a report by Grand View Research [6], the global LiDAR market size was worth US
A. Data Processing and Navigation
The control systems translate this sensory data into a two- or three-dimensional representation of the environment, determine the best navigation routes after identifying other vehicles, cyclists, pedestrians, traffic signs, stop signs, and generic obstacles, and manage the vehicles longitudinal and lateral motions simultaneously [7].
Recent developments in image processing and machine learning techniques make it simpler to implement these tasks. Object detection and tracking [8], object classification [9], semantic segmentation/instance segmentation [10], and localisation [11] are eventually the most useful operations for the perception of vehicle surroundings. The perception and motion planning modules are the most difficult assignments. The major function of the perception module is to comprehend/abstract the environment by processing data from sensors [12].
Within ADAS, object detection is a computer vision approach that enables the recognition and finding of objects in an image captured using a camera or/and LiDAR [13]. With the inclusion of features for identification and localization, object detection can be used to identify, localize, and count objects in a scene and label them appropriately. The process of tracking objects involves taking a collection of initial objects, giving each one a special identification (ID), and then following each object as it moves across the frames of a video while still keeping the ID assigned. Object classification is a part of object detection that helps to classify the objects in the image. An advanced method of image segmentation, called instance segmentation, deals with locating instances of things and defining their bounds [14].
Object detection, tracking, and classification are all tasks performed by a perception module. This serves as the framework for driving assistance and organizing AV’s future mobility. The observation of an AV’s status, including location, speed, and momentum, is required for its localization. For an approximate state estimation, one can resort to a GPS system [15]. Perception entails monitoring the conditions of the nearby obstacles, such as their position, speed, momentum, and class. To identify, categorize, and track the nearby obstacles, researchers have proposed various machine learning algorithms for the analysis of data collected from LiDARs, RADARs, GPS, and cameras. In order to safely travel in a challenging environment, an AV plans its subsequent decisions using knowledge about its surroundings. An extremely difficult problem is the motion planning (or, to be more accurate, trajectory planning) of the vehicle, which entails determining the vehicle’s future states (location, speed, and velocity) in continuously changing traffic conditions. The motion planning module uses the present and potential future states of the surrounding obstacles to ensure the vehicle’s safe and effective movement through the dynamics of traffic conditions. To prevent accidents, extreme caution must be taken. The tricky duty of environmental perception can be greatly simplified by wireless communication among all vehicles on the route. Nevertheless, this scenario would only be possible if all vehicles on the road are interconnected through wireless connection. Hence, the performance and effectiveness of the ADAS core modules determine the AV’s safety.
Several AVs employ various types of perception algorithms and sensors. LiDARs are used by certain developers, while cameras are the primary sensors for others. As a result, the design of the employed sensors will automatically affect how the environment is seen. AV’s performance is mostly dependent on the perception algorithms utilized for processing the data provided by sensors. To ensure public acceptance, AVs driving behavior must resemble that of human drivers. To do so, the performance of perception algorithms, which depend on diverse parameters, must be accurate; for example, object detection algorithms depend on the size of the dataset, the correctness of the respective labels, the accuracy of sensor devices and model hyper-parameters. The accuracy of semantic segmentation depends on the individual pixel and its correlation with neighboring pixels. For the sake of safety, the ultimate objective is to minimize the probability of occurrence of false positives (an outcome where the model predicts the occurrence of an event which did not actually occur) and false negatives (a non detected occurrence that actually occurred).
This paper presents a review of different metrics used to measure the performance of perception algorithms, including object detection & object tracking, semantic & panoptic segmentation, and metrics used for the evolution of LiDAR sensors. Besides the identification of the various metrics being proposed, their performance and influence on models are assessed as well, after conducting new tests or reproducing the experimental results of the reference under consideration. The metrics used to assess perception algorithms can be split into the following four:
Point Cloud: A three-dimensional set of measures acquired by the LiDAR of the vehicle’s surroundings
Object detection: List of detected objects where each one has been assigned a class. The measured metrics are accuracy, precision, recall, F1-score, Intersection Over Union (IOU), area under the so-called receiver-operating characteristic (ROC) curve, Planning KL-Divergence (PKL), Timed Quality Temporal Logic (TQTL), and Spatio-temporal Quality Logic (STQL).
Object tracking: Is the process of estimating each identified object’s position, dimensions, velocity, and respective class. The used metrics are multiple object tracking (MOT) accuracy and MOT precision.
Semantic Segmentation: A point cloud is segmented into subgroups to facilitate further processing or analysis of each segment. Upon segmentation, labels are assigned to pixels to identify objects, pedestrians, and other important elements in the point cloud. To assess it, Dice coefficients, precision, and recall are used.
B. Review on Perception Algorithms
Numerous studies have been done to date that look into different facets of autonomous vehicle technology [16]. To the best of our knowledge, none of these studies offer a comprehensive view on metrics to assess the performance of perception algorithms for AVs; instead, the majority of them concentrate on just one aspect of the AVs. The authors of [17] provide a review of AVs in view of hardware architectures, simulation software, deep learning models, and computational resources used till 2023. A study of algorithms and hardware used in AV’s visual perception systems is mentioned in [18]. A survey on the applications of AI techniques in the creation of AV’s is given in [19] that includes virtual & augmented reality, high performance computing, big data, and advancements in 5G communication for AV’s. Stages of development, obstacles, and trends for the practical implementation of an energy management plan for AVs based on connected and intelligent technologies are given in [20]. Issues with security, privacy, and trust are a few of the most important ones in the AV’s domain, and a review of various technologies like information and communication technology, Blockchain, AI, etc. covering these issues is given in [21] and [22]. A thorough analysis of the literature on the factors influencing the use of AVs can be found in [23]. An analysis of recent advances in obstacle detection technologies is presented in [24]. A description of different sensors and deep learning models used for obstacle detection can be found in [25]. Overviews of sensor technologies and sensor fusion for AV’s perception are provided in [26] and [27], respectively. To increase road safety, AV’s performance must have a solid, reliable perception, so the authors in [28] outline recent developments, suggest potential avenues for next research, and list the benefits and drawbacks of various sensor and localization/mapping algorithm configurations. Also, some issues, including detection certainty, illumination and weather, sensor fault detection, and other difficulties pertaining to AVs, include algorithm effectiveness, reliance on prior data, and public perception. Principles, issues, and developments in automotive LiDAR and perception systems for AVs are discussed in [29] and an examination of usual procedures and new technologies is provided in [30]. Other reviews focused on AV’s applications have been published on motion planning [31], object detection [32], semantic segmentation [33] techniques, analysis of deep learning methods for semantic segmentation of images and videos [34], and deep learning-based image recognition [35].
The rest of the paper is organized as follows: Section II describes the search strategy adopted to withdraw relevant sources and publications. Performance indicators that have been adopted for LiDAR devices are given in Section III. Moreover, Section IV elaborates on the metrics that have been proposed to evaluate object detection. Section V provides an overview of the benefits and restrictions of the current performance indicators for object tracking. In Section VI semantic, instance, and panoptic segmentation are introduced, and the respective metrics are described. Section VII gives a theoretical and practical explanation of metrics with their respective models. Finally, section VIII provides a summary of the paper and highlights the main conclusions as well as new developments to be considered.
The Adopted Search Strategy
A comprehensive literature review was made based on articles published in international journals and conferences between 2013 and 2023. This review is mainly focused on metrics for perception algorithms by looking at the critical academic publications of Science Citation Index (SCI), Science Citation Expanded (SCIE), and Scopus. Conference articles presented at well-known organizations, universities, and platforms under the umbrella of IEEE, Springer, and Elsevier and indexed by Scopus were taken into account. Three database sources were explored for relevant articles, mainly IEEE Explore, Scopus, and Google Scholar. These three sources mainly cover articles published in IEEE, Inderscience, IGI Global, MDPI, Willy, and ScienceDirect. A combination of several keywords was used to search for relevant articles. For example, “perception algorithm metrics”, “autonomous vehicle metrics”, “object detection metrics”, “object tracking metrics”, “semantic segmentation metrics”, and “panoptic segmentation metrics”. In addition, different keywords were used depending on the technology used for perception algorithm metrics. Some publishers reserve a few journals, books, or special issues that cover the main technologies related to autonomous vehicles. For example, ScienceDirect launched a journal in 2021 with the title “Autonomous Vehicles”, Springer publisher is maintaining a journal with the title “Autonomous Intelligent Systems”, Wiley holds an open access book with the title “Autonomous Vehicles: Using Machine Intelligence”, and IEEE is publishing the “IEEE Transactions on Intelligent Vehicles” journal. Only with the “perception algorithm” keyword, 2,153 publications were found in the IEEE database. But, with the combination of another keyword (“perception algorithm + metrics”), the count was reduced to 798. Another approach is to use year-wise filtering; for example, the exact count was further reduced to 254. In this way, irreverent and incomplete publications were filtered. Also, book chapters, case reports, and letters were disregarded.
A. Filtering Process
Five selection criteria were used to collect relevant articles for this review. Those are:
The title and abstract of the articles were checked against the stacked eligibility criteria. Duplicates and publications that did not match the basic inclusion criteria were eliminated.
To guarantee that the included articles were most relevant to today’s perception algorithm metrics, only publications from 2013 to 2023 were considered.
To attract more readers, only publications written in English were included.
Publications that were unavailable or lacked a full text or abstract were also discarded.
Publications relevant to state-of-the-art technologies were included.
Metrics for LiDAR Point Cloud
To measure a LiDAR’s point cloud performance, the point cloud distance is calculated by finding the minimum Euclidean distance between equivalent points in a reference cloud and in the captured point cloud. There are four distance metrics:
Hausdorff Distance (HD): Is the largest of all Euclidean distances between any two points (
,x ) in different point clouds [37]. More formally, the HD fromy toP is a maximin function, defined as (eq. 1)1Q where\begin{equation*} HD(P, Q)=\max \left \{{\underset {x\in P}{\mathrm {sup}}\underset {y\in Q}{\mathrm {inf}}d(x,y), \underset {y\in Q}{\mathrm {sup}}\underset {x\in P}{\mathrm {inf}}d(x,y) }\right \} \tag{1}\end{equation*} View Source\begin{equation*} HD(P, Q)=\max \left \{{\underset {x\in P}{\mathrm {sup}}\underset {y\in Q}{\mathrm {inf}}d(x,y), \underset {y\in Q}{\mathrm {sup}}\underset {x\in P}{\mathrm {inf}}d(x,y) }\right \} \tag{1}\end{equation*}
andx are points ofy andP , respectively,Q is the Euclidian distance betweend(x,y) &x , andy &'sup' are the supremum and infimum.'inf' Modified Hausdorff Distance (MHD): Is a modified version of HD proposed in [38] and uses the sum of the mean of the minimum distance between two sets of points; it is less prone to outliers. The MHD was found after extensive research into 24 various distance measures and their behavior in the presence of noise.
Chamfer Distance (CD): When two point clouds are evaluated using the Chamfer Distance, each of the distances from a point in one cloud to all points in the other cloud are taken into consideration. CD locates the closest point in the other point set and adds the square of the distance for each point in either cloud. The CD between two point clouds
andP is given as in eq. 2.2Q where\begin{align*} \mathrm {CD}\left ({P, Q}\right)&=\frac {1}{\left |{P}\right |} \sum _{x \in P} \min _{y \in Q}\|x-y\|_{2}^{2} \\ &\quad +\frac {1}{\left |{Q}\right |} \sum _{y \in Q} \min _{x \in P}\|x-y\|_{2}^{2} \tag{2}\end{align*} View Source\begin{align*} \mathrm {CD}\left ({P, Q}\right)&=\frac {1}{\left |{P}\right |} \sum _{x \in P} \min _{y \in Q}\|x-y\|_{2}^{2} \\ &\quad +\frac {1}{\left |{Q}\right |} \sum _{y \in Q} \min _{x \in P}\|x-y\|_{2}^{2} \tag{2}\end{align*}
andx are, respectively, points ofy andP .Q Earth Mover’s Distance (EMD): It is also known as the Discrete Wasserstein distance [39]. It is a technique for determining the degree to which two multi-dimensional distributions differ in a feature space, where a ground distance is the measurement of the distance between individual features. The Earth Mover’s Distance between two point clouds (
andP ) is calculated with eq. 3 [40].Q where\begin{equation*} EMD(P,Q) = \min _{\phi:P\rightarrow Q}\sum _{x \in P}\left |{ x - \phi (x) }\right |_{2} \tag{3}\end{equation*} View Source\begin{equation*} EMD(P,Q) = \min _{\phi:P\rightarrow Q}\sum _{x \in P}\left |{ x - \phi (x) }\right |_{2} \tag{3}\end{equation*}
is a bijective function\phi (x) , i.e., a one-to-one (injective) and onto (surjective) mapping off:P \to Q toP .Q
A. LiDAR Accuracy Assessment
Estimating a LiDAR’s accuracy by finding the Root Mean Square Error (RMSE) between two point clouds is a typical practice. There are two different accuracy assessments: Absolute accuracy and Relative accuracy [41].
1) Absolute LiDAR Accuracy
It refers to the vertical and horizontal precisions of data collected from a LiDAR. By comparing the collected LiDAR data with ground surveyed checkpoints, absolute accuracy is evaluated [41] with the condition that horizontal checkpoints, ground-level features, are well defined. Its horizontal placements are precisely measured in relation to the objects’ geographic locations. On the other hand, vertical checkpoints do not have to be well defined. The term vertical accuracy refers to the vertical precision attained over the environment. There is no right way to choose the right checkpoint distribution. It typically depends on the geographic location of the objects and the environment under evaluation.
2) Relative LiDAR Accuracy
It is a metric to measure small variations in the point cloud [41] and the LiDAR calibration has an impact on it. There are two approaches to evaluate relative accuracy: The evaluation of data acquired by an autonomous vehicle with two different LiDARs at the same location is often known as “within-swath accuracy”. It reveals the LiDAR system’s level of stability; The evaluation of data obtained by an AV with two different LiDARs at different locations is often known as “swath-to-swath accuracy”. In addition to these metrics, table 1 gives other metrics with their advantages and limitations.
Metrics for Object Detection
Autonomous vehicles require precise 3D vision of the surrounding environment, including other vehicles and all other relevant objects. Using 3D-based object detection, spatial path planning for object avoidance and navigation is possible, as opposed to 2D detection. With more output parameters required to indicate 3D-oriented bounding boxes around targets, 3D object detection is more difficult than 2D object detection, which has been extensively investigated in [47]. Moreover, the resolution of LiDAR data is often lower than that of video, which has a significant negative influence on accuracy at extended ranges. Three object detection modalities based on dataset dimensions that can be found in the literature are: 2D image based, 3D point cloud based, and fusion of both image and point cloud detection. Despite the advantage of not requiring LiDAR, 2D image-based approaches perform poorly as compared to those that use point clouds; therefore, here we concentrate on the first two categories.
As illustrated in Fig. 2, 2D object detection algorithms use RGB images as input and produce 2D axis-aligned bounding boxes with confidence scores, while 3D object detection algorithms work with 3D point clouds and produce classified, oriented 3D bounding boxes with confidence scores. The 3D bounding box in the LiDAR coordinates may be precisely projected into the image plane using the calibration settings of the camera and LiDAR after a fusion process. So, metrics to be considered in the case of object detection include 3D object detection using a camera, 3D object detection using LiDAR, fusion of both, and finally object tracking. In the following section, we discuss each of these individual metrics.
A. Metrics for 3D Object Detection
To evaluate the effectiveness of object detection algorithms, intersection over union or the Jaccard index are used to compare the predicted and ground truth. As shown in Fig. 3, each ground truth box in the image is taken into consideration while calculating IOU for each prediction. Then, using a greedy approach, predictions are matched with ground truth boxes after these IOUs have been thresholded to a certain value, often between 0.5 and 0.95 (the highest IOUs are matched first). Then it is determined whether a prediction is True Positive (TP), False Positive (FP), or False Negative (FN) with the aid of the IOU threshold value. It is crucial to keep in mind that a true negative (TN) result does not apply in the domain of object detection, because there are a limitless number of bounding boxes that should not be detected in any image. With the help of TP, FP, and FN, a confusion matrix is obtained as shown in Fig. 4. With the known confusion matrix, calculate the precision and recall. These metrics are also used for segmentation purposes, so we are defining specificity even if it is not important for object detection.
1) Precision
Precision is the ratio of true positives to all positive predictions (true plus false predictions). For instance, if the model identified 100 trees and 90 of them were accurate, the precision would be 90%.\begin{equation*} Precision = \frac {TP}{TP+FP} \tag{4}\end{equation*}
2) Recall and Specificity
Recall is also called true positive rate or sensitivity, and it gives the percentage of positive voxels in the label or ground truth that are positive. The specificity, or true negative rate, gives the percentage of negative voxels (background) in the ground truth detection that are further detected as negative by the assessed detection.\begin{equation*} Recall = \frac {TP}{TP+FN} Specificity = \frac {TN}{TN+FP} \tag{5}\end{equation*}
3) F1-Score
The F1-score is particularly suited for imbalanced datasets. It gives the harmonic mean of precision and recall.\begin{equation*} F1-{score} = \frac {2 \times precision \times recall}{precision + recall} \tag{6}\end{equation*}
4) Accuracy
Accuracy is the proportion of valid predictions, including true positives and true negatives, among the total number of analyzed cases.\begin{equation*} Accuracy = \frac {TP+TN}{TP+TN+FP+FN} \tag{7}\end{equation*}
5) Mean Average Precision (mAP)
One of the issues with object detection is the diversity of classes involved, e.g., car, tanker, pedestrian, bicycle, and bus. The average precision \begin{equation*} mAP = \frac {\sum _{i=1}^{F}AP(i)}{F} \tag{8}\end{equation*}
6) 11-Point Interpolation (AP_{11}
)
The highest precision whose recall value is greater than a particular value is taken into consideration in this definition of AP rather than the precision seen at each recall level [48]. The highest accuracy values at a set of 11 equally spaced recall levels \begin{equation*} AP_{11} = \frac {1}{11}\sum _{T\epsilon (0,0.1,\ldots..1)}^{}MP_{11} \tag{9}\end{equation*}
7) All-Point Interpolation (AP_{all}
)
Here, the AP is generated after interpolating the precision at each level, using the highest precision whose recall value is greater or equal to the particular value, as opposed to using the precision observed at only a few places [48].
For a better understanding of the 11-point and all-point interpolations, let’s take an example of an object detection case [48] whose precision and recall curve is shown in Fig. 5. From this figure, the obtained average precision values are 26.84% and 24.56% with the 11-point and all-point interpolations, respectively.
The average precision computation has significant flaws due to the N-point interpolation techniques currently in use. It is impossible to accurately assess the model’s performance because of these mistakes, which lead to average precision distortion. To address these problems, an enhanced interpolation was proposed in [48] by taking the position of the interpolation point from the middle and dynamic parameter selection in determining the interpolation interval’s area. They observed that the average precision distortion is reduced by over 90% to only 0.04%.
8) Average Recall (AR)
The aggressiveness of object detectors for a particular class is measured using another assessment metric called average recall [50]. The assessed detector confidences are not included in the computation of AR, in contrast to the average precision. Because of this, the confidence threshold is effectively set to 0, and all detections are considered positive [51]. By including all recall results acquired for IOU thresholds in the span [0.5,1] the AR metric is evaluated by considering a wide range of IOU thresholds. The least reasonable IOU according to most metrics is 0.5, which can be read as an imprecise positioning of an object. An IOU of 1 corresponds to the exact location of the identified object. Consequently, the model is assessed under the presumption that the item placement is extremely accurate by averaging recall values that fall within the range [0.5,1].
9) Mean Average Recall (mAR)
Although AR is generated separately for each class, analogous to how mAP is computed, a single AR value can be determined by taking into account the mean AR across all classes [52], that is:\begin{equation*} mAR = \frac {1}{N}\sum _{i=1}^{N}AR_{i} \tag{10}\end{equation*}
B. Nuscenes Detection Score (NDS)
Perhaps the most often used metric for object detection is mAP with a predefined IOU threshold [53]. The nuScenes detection task, such as estimations of position, shape, velocity, and inclination, cannot be fully measured with mAP. They are separated by specifying thresholds for each error category, as in the ApolloScape [54] 3D automobile instance challenge. In this challenge, the number of thresholds is 103, which leads to complicated, arbitrary, and unpredictable mAP. To overcome these limitations, a nuScenes detection score was introduced in [55], which combines the various errors into a scalar value by taking the weighted sum of mAP and several True Positive Metrics (TPM), such as translation, orientation, rotation, attribute, and velocity errors, which are defined as follows:
Average Scale Error (ASE): Calculated as IOU after aligning centers and orientation.
Average Translation Error (ATE): Euclidean center distance in 2D in meters.
Average Orientation Error (AOE): Smallest yaw angle difference between prediction and ground truth in radians. The orientation error is evaluated at 360 degrees for all classes except barriers, where it is only evaluated at 180 degrees. Orientation errors for cones are ignored.
Average Velocity Error (AVE): The absolute velocity error is measured in
. Velocity errors for barriers and cones are ignored.m/s Average Attribute Error (AAE): Calculated as an attribute classification accuracy. Attribute errors for barriers and cones are ignored.

C. Planning Kl-Divergence (PKL)
The computer vision community uses variations of accuracy and precision as the gold standard to assess the performance of perception algorithms. These metrics are widely used since they are basically task-independent and, usually, are aimed at finding zero false positives or negatives of any object detection algorithm. These metrics have the drawback of ignoring objects’ position, velocity, and speed. The orientation, location, and environmental characteristics are not taken into account by mAP and NDS. Jonah Philion proposed a novel measure, PKL [56], for 3D object detection that integrates perception performance analysis with driving performance. The main concept of PKL is to analyze detections using a planner that has been trained to plan a driving trajectory using its semantic observations, or detections. If the perception algorithm is flawless, PKL will always return the best result when tested on the nuScenes dataset [57] which is publicly available for indeed researchers. Test results demonstrated that the intuitive ranking of the significance of identifying each car in a scene is induced by the PKL metric, which outperformed traditional metrics [57]. They offer a server for comparing competing object detectors using planning-based metrics, in order to encourage the creation of new perception algorithms that are more in line with the requirements of autonomous driving in the real world.
When a planner is provided with a detection from a detector rather than a human-labeled detection, PKL evaluates the discrepancies between the planner’s planning and perception efficiency [56]. It is usually positive, and lower detection performance is correlated with higher PKL scores. An ideal detector is one with a PKL of 0. Several environments for nuScenes detection are used to illustrate the advantage of PKL over mAP. The planner learns how to go through the scenarios by studying a lot of data collected from a human driven system. The local semantic map and the detected bounding boxes serve as conditions for the planner.
D. Timed Quality Temporal Logic (TQTL)
The accuracy of perception algorithms was examined using TQTL. It is a formal language for expressing the desired spatio-temporal features of a perception algorithm when processing a video, and it is an extension of Timed Propositional Temporal Logic (TPTL) [58]. The evaluation of a perception algorithm typically involves comparing its performance against labels that represent the real world. TQTL provides an alternative metric that can provide relevant information even in the absence of ground truth labels, making it a helpful tool for assessing perception quality. The phrases “I’m always hungry,” “I’ll get hungry eventually,” and “I’ll be hungry until I eat something” can be taken as examples of TQTL. A temporal logic with modalities related to time is linear temporal logic (LTL), also known as linear-time temporal logic (LTTL). “A condition will ultimately be true”, “a condition will not be true until another fact becomes true”, etc, are a few examples of formulae that can be encoded in LTL to describe the future of pathways. Variables are used in TPTL to calculate the time intervals between two occurrences. For instance, TPTL permits specifying a time limit for the occurrence of an event ‘
To elaborate on the effectiveness of TQTL in our object detection problem, we consider the work in [59], in which object detection algorithms such as YOLO and SqueezeDet were trained on different datasets with the same settings [60], such as window frame range, for analysis and to know the impact of TQTL in measuring the performance of detection models. Following are the findings that are observed when using the TQTL metric in addition to other metrics: 1) Both object detection models mistakenly label bikes as pedestrians on multiple occasions. In some cases, the autonomous vehicle plane is orthogonal to the image plane, which leads to a cyclist looking like a pedestrian. This might suggest that there aren’t enough images of the cyclist taken right in front of or behind the car in the KITTI dataset. 2) Both algorithms identify objects sporadically, which means they quickly lose faith in their predictions. 3) It has been noted that SqueezeDet finds a number of “phantom” items with high confidence before swiftly losing faith in these incorrect predictions.
E. Spatio-Temporal Quality Logic (STQL)
Autonomous vehicles perception algorithms are essential to their ability to recognize and track objects in the environment as well as comprehend the semantics of their surroundings. The results of these algorithms are then applied to decision-making in safety-critical situations, like autonomous emergency braking and accident avoidance. It is vital to keep an eye on these perceptual systems while they are in use. The outputs of perception systems are represented in high-level, sophisticated ways, making it difficult to test and validate these systems, particularly during runtime. Authors in [61] introduced PerceMon, a tool for runtime monitoring that can keep track of any specifications in timed quality temporal logic and its extensions with spatial operators. STQL is an extension of TQTL that includes a set of operations on and reasoning about high-level topological entities like bounding boxes that are present in perceptual data. These two are extensions of Metric Temporal Logic (MTL) [62]. In STQL, specifications define a set of operations on the spatial artifacts, like bounding boxes, produced by vision systems, together with operators to reason about classes of objects and discrete IDs. For perception algorithms, the correctness properties can be expressed using TQTL and STQL. PerceMon [61] is an effective online monitoring tool for STQL standards, and it is interconnected with the Robot Operating System (ROS) [63] and the CARLA simulation environment [64].
F. Object Detection Competitions
World-famous competitions for object detection are the VOC PASCAL challenge [65], COCO [66], ImageNet object detection challenge [67], Google open images challenge [68] and Lyft [69]. All these competitions provide their code to calculate average precision, or mean AP, but the Lyft 3D object detection for autonomous vehicles challenge uses the AP averaged over 10 different thresholds, the so-called AP@50:5:95 metric. Submissions for the COCO detection challenge are graded based on metrics divided into four primary categories.
Average Precision (AP): Several IOUs are used to evaluate the AP. It can be calculated for 10 IOUs that change in percentage by 5% increments from 50% to 95%; this value is typically stated as AP@50:5:95. It can also be assessed using just one IOU value; the most typical values are 50% and 75%, which are reported as AP50 and AP75, respectively.
AP Across Scales: The AP is calculated for objects of three sizes: small (322 pixels or less in area), medium (322 pixels to 962 pixels), and large (962 pixels or more in area).
Average Recall (AR): The maximum recall values for an image with a specified number of detections (1, 10, or 100) are used to estimate the AR.
AR Across Scales: The same three sizes of objects used in the AP across scales are used to determine the AR, which are typically given as AR-S, AR-M, and AR-L, respectively.
Metrics for Multi-Object Tracking (MOT)
It is a process of finding different objects in a video that are of interest, following them in later frames by giving them a distinctive ID, and keeping track of these distinct IDs as the objects move around in the video in later frames, as in Fig. 6. MOT separates a single continuous video into discrete frames at a predetermined frame rate (frames per second). The results of MOT are:
Detection: Identification of the objects in each frame.
Localization: locating things in each frame through localization.
Association: determination of whether items appear to be the same or different in different frames.
By comparing a tracker’s predictions to the actual set of tracking results, one may assess the performance of MOT algorithms. Metrics for MOT evaluation must have two important characteristics: 1. MOT evaluation metrics must account for five different types of MOT errors; 2. Error kinds should be distinguishable, and MOT evaluation metrics should be monotonic. The five errors are:
False negative or miss: when there is a ground truth but the prediction is wrong, the result is a false negative or miss.
False positive: if a tracker prediction exists but there is no ground truth, it is a false positive.
Merge or ID switch: when two or more object tracks are switched as they pass by one another, this is known as a merge or ID switch.
Deviation: deviation after re-initializing an object track with a changed track ID.
Fragmentation: when a track abruptly stops being tracked yet the ground truth track still exists.
In the first part of Fig. 7, an ID switch occurs when the mapping switches from the previously assigned red track to the blue one. In the second part, a track fragmentation is counted in frame
A. Localization
Localization measures the spatial alignment between a predicted detection and the actual detection [74]. The localization accuracy, given by the localization IOU (\begin{equation*} LocA = \frac {1}{TP}\sum _{C\varepsilon TP}^{}\text {Loc-IOU}(C) \tag{13}\end{equation*}
B. Detection Accuracy (DetA)
The proportion of the set of predicted detections to the set of all ground-truth detections measures the detection accuracy. This metric is also often expressed by the Detection IOU (\begin{equation*} DetA = \text {Det-IOU} = \frac {TP}{TP+FN+FP} \tag{14}\end{equation*}
When a prediction overlaps with more than one ground truth or vice-versa, the Hungarian algorithm is used to identify a one-to-one match between the predicted detection and ground truth.
C. Association Accuracy (AssA)
The average alignment between matched trajectories, averaged over all TP detections over \begin{equation*} AssA = \frac {1}{TP}\sum _{C\varepsilon TP}^{}\text {Ass-IOU}(C) \tag{15}\end{equation*}
D. Track-mAP
It matches trajectory-level predictions and ground reality. It requires a trajectory similarity score,
Sometimes Track-mAP has numerous overlapping outputs, and some of them have low confidence scores, making it difficult to understand tracking outputs with this method. As a result, the final score for each trajectory is obscured by the implicit confidence ranking, making it difficult to analyze and visualize the results.
As a result of this metric’s high threshold of 0.5 for a trajectory to be considered a positive match, it ignores significant advancements in localization, association, and detection. Any increase in detection and association is not evident in metric scores since even with the best tracking, more than half of its best guess predictions will be reported as errors in Track-mAP.
The trajectories used by Track-mAP measurements combine association, detection, and localization in a way that makes the error type non-differentiable and non-separable.
E. Multi-Object Tracking Accuracy: MOTA
MOTA continues to be the most accurate measurement that most closely matches human visual evaluation. Matching is carried out at the detection level while calculating MOTA. If the predicted detection (\begin{equation*} MOTA = 1-\frac {FN+FP+IDSW}{gtDets} \tag{16}\end{equation*}
F. Multi-Object Tracking Precision (MOTP)
The overlap between all accurately matched predictions and their ground truth is averaged by MOTP. It takes the collection of TP and averages the similarity score (\begin{equation*} mTP = \frac {1}{TP}\sum _{TP}^{}S \tag{17}\end{equation*}
G. Safety Score (S
)
To calculate the safety score of an object tracking model, one should give equal importance to precision and accuracy. The tracking safety score (\begin{equation*} S_{D} = \frac {MODA+MODP}{2} \tag{18}\end{equation*}
H. Identification F1-Score (IDF1
)
It is used as a supplemental metric on the MOTChallenge5 benchmark because it places more emphasis on measuring association accuracy than detection accuracy. Unlike MOTA, which matches objects at an object detection level across time, \begin{align*} ID-Recall &= \frac {IDTP}{IDTP+IDFN} \tag{19}\\ ID-Precision &= \frac {IDTP}{IDTP+IDFP} \tag{20}\\ IDF1 &= \frac {IDTP}{IDTP + 0.5 IDFN + 0.5 IDFP} \tag{21}\end{align*}
I. High Order Tracking Accuracy (HOTA)
A single unifying metric called \begin{equation*} HOTA = \text {Det-IOU} + \text {Ass-IOU} + \text {Loc-IOU} \tag{22}\end{equation*}
J. Detection Error
A detection error occurs when a tracker either incorrectly anticipates detections in the ground truth or incorrectly predicts detections that are present in the ground truth. Other types of detection errors include detection recall (measured by FNs) and detection precision (measured by FPs).
K. Association Error
It occurs when trackers assign two detections with distinct
L. Spatio-Temporal Tube Average Precision (STT-AP)
All of the above-mentioned metrics are applied to an individual image or frame. The predictive accuracy at the level of the entire video may be relevant when working with videos. The STT-AP is an extension of the AP metric to assess video object detection models. Similar to AP, the accuracy of the detection is evaluated using a threshold above the IOU. Nevertheless, it broadens the conventional IOU definition to take into account the spatio-temporal tubes produced by the detection and the ground truth rather than utilizing two different kinds of overlaps (spatial and temporal). This metric is brief but evocative because it combines spatial and temporal localization. Spatio-temporal tube IOU (STT-IOU) is the ratio of ground truth to predicted spatio-temporal tube. This way, if the STT-IOU is equal to or higher than a specified threshold, a detection is treated as a TP.
Metrics for Semantic Segmentation
The technique of grouping point clouds into various homogeneous regions, each containing points with similar characteristics, is known as 3D point cloud segmentation. As point cloud data has high levels of redundancy, irregular sample densities, and a lack of explicit structure, segmentation is difficult. These problems are addressed by several researchers in the field of robotics applications, including autonomous vehicles, self-driving cars, and navigation. There are three types of segmentation techniques that play a crucial role in relation to autonomous vehicles: semantic, instance, and panoptic segmentation [76]. These three are labeled differently based on the labeling of things/countable objects (trees, cars, pedestrians, etc.) and stuff/non-countable objects (road, gross, sky, etc.) in an image. For a better understanding and visual appearance of these three, see Fig. 8.
Every pixel in an image is assigned a class label using semantic segmentation, such as a person, flower, car, etc. Several objects belonging to the same class are treated as a single entity. Semantic segmentation methods that are frequently employed include Fully Connected Network (FCN) [77], DeconvNet [78], U-Net [79], and SegNet [80]. Comparatively speaking, instance segmentation treats several objects belonging to the same class as unique individual instances. Frequently used instance segmentation methods include PANet [81], Faster R-CNN [82], Mask R-CNN [83], and YOLACT [84].
Each pixel in an image receives two labels from panoptic segmentation: a semantic label and an instance ID. The similarly marked pixels are regarded as being members of the same semantic class, and its instances are identified by their unique identifiers (IDs). The Mask R-CNN [83] approach is the foundation of most panoptic segmentation methods. The architectures that make up its backbone include VPSNet [85], EPSNet [86], FPSNet [87], and UPSNet [88].
A. Evaluation Metrics
Each segmentation method evaluates the expected masks or IDs in a scene using a different set of evaluation measures. This is due to the diverse ways in which things and items are processed.
B. Metrics for Semantic Segmentation
The goal of establishing metrics for semantic segmentation is to score the similarity between the predicted (prediction) and annotated segmentation (ground truth). The mainly used ones are: Dice coefficient, Jaccard Index (or IOU), pixel accuracy, and mean accuracy.
1) Dice Coefficient
It is equal to two times the intersection of the predicted (\begin{equation*} Dice = 2 |Pseg\cap GTseg| / (|Pseg|+|GTseg|) \tag{23}\end{equation*}
Keep in mind that the area of the union of
In general, most of the researchers are using IOU for the object detection evaluations and Dice for the semantic segmentation case, even if both have similar metrics. Which one to use depends on personal preferences and convention. In segmentation tasks, the Dice loss (eq. 24) is used as a loss function because it is differentiable where IOU is not differentiable. The IOU and Dice can be used as metrics to assess the model’s performance, but only Dice loss is used as a loss function.\begin{equation*} Dice~loss = 1-Dice~coeff \tag{24}\end{equation*}
2) Jaccard Index
The Jaccard Index, which measures how close the anticipated and actual masks are, is widely used in semantic segmentation. It is also commonly known as an IOU and is calculated after dividing the intersection’s area by the union’s area.\begin{equation*} Jaccard = TP / (TP + FP + FN) \tag{25}\end{equation*}
3) Mean Pixel Accuracy (mPA)
The quantity of pixels accurately categorized in the resulting segmentation mask is known as pixel accuracy (\begin{equation*} PAseg = \frac {\sum _{J=1}^{C}TP_{J}}{\sum _{J=1}^{C}T_{J}} \tag{26}\end{equation*}
\begin{equation*} mPA = \frac {1}{C}\frac {\sum _{J=1}^{C}TP_{J}}{\sum _{J=1}^{C}T_{J}} \tag{27}\end{equation*}
4) Average Hausdorff Distance (AHD)
It is a popular performance metric that determines the distance between two point sets. It is used to compare labels with detected or segmented images and to rate various detection/segmentation outcomes. The AHD is particularly well suited for segmentation involving complex boundaries and narrow segments. Unlike the Dice coefficient, AHD takes voxel localization information into account. The AHD between two point clouds \begin{equation*} AHD(P,Q)=\frac {1}{\left |{P}\right |} \sum _{x \in Q} \min _{y \in P_{\mathbf {z}}}d(P,Q)+\frac {1}{\left |{Q}\right |} \sum _{y \in Q_{\mathbf {z}}} \min _{x \in Q}d(P,Q) \tag{28}\end{equation*}
C. Metrics for Instance Segmentation
The typical evaluation statistic for instance segmentation is the average precision (\begin{equation*} IoS(P, GT) = \frac {N(Pinst\cap GTinst))}{N(Pinst)} \tag{29}\end{equation*}
D. Metrics for Panoptic Segmentation
As a brand-new task, panoptic segmentation was originally put forth in [76]. In this method, background classes are segmented using semantic segmentation, while foreground classes are segmented using instance segmentation. These two categories are also known as stuff/countable classes and things/non-countable classes, respectively. The Panoptic Quality (\begin{equation*} PQ = \frac {\sum _{(popt,gopt)\epsilon TP}^{}IOU(popt,gopt)}{\left |{ TP }\right |+\frac {1}{2}\left |{ FP }\right |+\frac {1}{2}\left |{ FN }\right |} \tag{30}\end{equation*}
\begin{align*} PQ &= SQ\times RQ = \frac {\sum _{(popt,gopt)\epsilon TP}^{}IOU(popt,gopt)}{\left |{ TP }\right |} \\ &\quad \times {\frac {\left |{ TP }\right |}{\left |{ TP }\right |+\frac {1}{2}\left |{ FP }\right |+\frac {1}{2}\left |{ FN }\right |}} \tag{31}\end{align*}
The
A fundamental
No single pixel can simultaneously belong to two predicted segments or overlapping predictions.
Only predicted segments whose
with the ground truth is greater than 0.5 can be matched with the ground truth.IOU
1) Parsing Covering (Pc) Metric
This metric is an extension of the covering metric [91] proposed in [92]. The covering metric is mostly useful for the evolution of class-agnostic segmentation. In some applications, such as portrait segmentation (referring to the process of segmenting a person in an image from its background) or autonomous driving (where near objects are more significant than far-off ones), one should pay more attention to large objects. This inspired the authors to propose the \begin{align*} PC_{i}& = \frac {1}{M_{i}}\sum _{R\epsilon s_{i}}^{}\left |{ R }\right |\max _{R'\epsilon s_{i}'}IOU(R,R') \tag{32}\\ M_{i}&=\sum _{R\epsilon S_{i}}^{}\left |{ R }\right | \tag{33}\\ PC &= \frac {1}{C}\sum _{i=1}^{C}PC_{i} \tag{34}\end{align*}
There is also no matching involved in
E. Dataset - Panoptic Segmentation
1) SemanticKITTI [94]
It is a sizable dataset for driving scenes that can be used for panoptic and semantic point cloud segmentation [94]. Data was gathered in “Germany” using a Velodyne-HDLE64 LiDAR and is derived from the KITTI Vision Odometry Benchmark. The dataset consists of 22 sequences, which are divided into a training set (using sequence 08 as the validation set) and a test set (using sequences 11 to 21). After combining classes with varied movement statuses and ignoring classes with very few points, 19 classes are still available for training and evaluation.
2) Nuscenes [57]
It uses a 32-beam LiDAR sensor to gather 1,000 scenes with a 20-second duration. It comprises 40,000 frames in total, each of which is sampled 20 times per second. Additionally, they formally divided the data into a training set and a validation set. A total of 16 classes for the LiDAR semantic segmentation are left after combining comparable classes and deleting unusual classes. A cylindrical partition divides these point clouds from the two datasets into 3D representations with the dimensions 32, 360, and 480, where the three dimensions denote the height, angle, and radius, respectively.
3) Cityscapes [95]
It contains 5000 images of egotistical driving situations in metropolitan locations (2975 training sets, 500 validations, and 1525 tests). There are 19 classes with dense pixel annotations (97% coverage), of which 8 have instance-level segmentation.
4) ADE20k [96]
With an open-dictionary label set, it has over 25k images (20k for the training set, 2k for validation, and 3k for the test). In order to cover 89% of all pixels, 100 things and 50 stuff classes were chosen for the 2017 Places Challenge.6
5) Mapillary Vistas [97]
It offers 25k street-view images in a variety of resolutions (18k for the training set, 2k for validation, and 5k for the test). The dataset has a 98% pixel coverage density annotation with 28 stuff and 37 things classes.
Results and Discussions
After the description of the various metrics that have been proposed for object detection, multi-object tracking, and panoptic segmentation model architecture, this section provides an evaluation of their performance and influence on models after reproducing previously published results or carrying out new tests with trained models from Github and testing them with the nuScence and KITTI datasets.
A. Complex-Yolov4 and Complex-Yolov3
Due to its direct connection to environmental comprehension and subsequent creation of the foundation for prediction and motion planning, LiDAR-based 3D object detection is unavoidable for autonomous vehicles. A poorly stated problem for many other application fields besides autonomous vehicles, such as augmented reality, personal robots, or industrial automation, is the ability to infer highly sparse 3D data in real-time. The authors in [98] presented Complex-YOLO, a cutting-edge real-time 3D object identification network that exclusively works with point clouds. YOLOv2 [99], a quick 2D standard object detector for RGB images, is expanded in [98] by a network that uses a very sophisticated regression method to estimate multi-class 3D bounding boxes in the cartesian space. As a result, they suggest a particular Euler-region proposal network to calculate the object’s posture by incorporating imaginary and real terms into the regression network. This eliminates singularities, which are caused by single-angle estimations, and results in a closed complex space. For the application of AVs, a comparison of complex-YOLO versions 3, 4, and 5 is given in [100]. Along with a theoretical description of complex YOLO, we produced the results of trained YOLOv47 and YOLOv38 on the KITTI dataset. Figs. 9 and 10 show the outcomes of YOLOv4 and YOLOv3 on the KITTI dataset, respectively. Keep in mind that the predictions in Fig. 9 are only based on aerial images created from point clouds.
B. Pointpillars 3D Object Detection
A new encoder called PointPillars uses PointNets [101] to train itself how to represent point clouds in the form of vertical columns (pillars). PointPillars predicts positioned 3D boxes of vehicles, pedestrians, bicycles, etc., using point clouds as input. There are three primary phases: A point cloud is first transformed into a sparse pseudo-image by a feature encoder network, then the pseudo-image is processed into a high-level representation by a 2D convolutional backbone, and finally a detection head detects and regresses 3D boxes. While any common 2D convolutional detection architecture can employ the encoded features, it also uses a lean downstream network.
One of the earliest techniques to use PointNets for object detection with LiDAR point clouds is VoxelNet [102]. Here, voxels are subjected to PointNets before being processed by a group of 3D convolutional layers, a 2D backbone, and a detection head. This makes end-to-end learning a possibility, but VoxelNet is cumbersome; it takes 225 ms of inference time (4.4 Hz) for a single point cloud, which is slower than prior work [101]. This issue was resolved in Frustum PointNet [103] and the speed of detection increased further with a detector called SECOND [104].
In Tables 4 and 5, FP16 denotes the adoption of the Mixed Precision (FP16) in training. Using 8 Titan XP GPUs with a batch size of 2, PointPillars are trained with the nuScenes dataset using mixed precision training [101]. Without this mixed-precision training, out-of-memory (OOM) errors would result. On the nuScenes dataset, the loss scale for PointPillars is precisely calibrated to prevent the loss from being excessive. Experiments show that 32 is more stable than 512, while loss scale 32 occasionally causes NaN problems. This is the reason for NaN in Tables 4, 5 and 6 for some classes.
C. Pointpillars - Feature Pyramid Network (FPN)
By using a top-down approach and lateral connections, FPN mixes semantically robust features with high-resolution ones and semantically weak features with low-resolution ones. In FPN, a feature pyramid is generated quickly from a single input image scale and has extensive features at all levels without losing representational power, speed, or memory. Other concurrent works, such as the deconvolutional single Shot setector [105], also employ this strategy. The feedforward computation of the backbone ConvNet is the bottom-up pathway as in Fig. 11. Every stage in FPN has its own pyramid level, and the final layer of each stage’s output will serve as the reference set of feature maps for lateral connections. The feature maps from higher pyramid levels are upsampled to produce high resolution features that are geographically coarser but semantically stronger. To be more precise, for ease of use, the spatial resolution is upsampled by a factor of two using the nearest neighbor. Each lateral connection combines feature maps from the top-down and bottom-up pathways that are of the same spatial size. To specifically decrease the channel dimensions,
D. Pointpillars - Second Feature Pyramid Network (SECFPN)
Robot vision and autonomous driving are two examples of applications that make use of RGB-D or LiDAR-based object detection. Since a while ago, point cloud LiDAR data processing has made use of voxel-based 3D convolutional networks to improve information retention. Yet, issues such as sluggish inference speed and poor orientation estimation performance persist. In order to considerably speed up both training and inference, [104] investigated an enhanced sparse convolution approach for such networks. In order to increase the performance of orientation estimation, a new type of angle loss regression was introduced. Also, presented a fresh method for data augmentation that can boost convergence performance and speed. The SECFPN network maintains a high inference speed while delivering cutting-edge performance on the KITTI 3D object detection benchmark, as shown in Table 4.
Fig. 12 shows the components of SECOND detector. A raw point cloud is fed into the SECOND detector, which then transforms into voxel features and coordinates before applying two VFE (voxel feature encoding) [102] layers and a linear layer. A sparse CNN is then used. Lastly, the detection is produced by a Region Proposal Network (RPN) [106]. To extract voxel-wise features, VFE is used. A VFE layer uses FCN made up of a linear, a batch normalization (BatchNorm), and a rectified linear unit (ReLU) layer to extract pointwise information from all the points in a single voxel. All atomic operations relating to the convolution kernel elements are gathered by sparse convolution and saved as computation instructions in a rulebook. Fully convolutional networks that anticipate object limits and objectness scores at each place are known as RPNs. To provide top-notch regional proposals, the RPN receives comprehensive training.
The primary distinction between the shape-aware grouping heads and the original SECFPN heads is that the former groups objects of comparable sizes and shapes together while designing shape-specific heads for each group. Longer strides and more convolutions are seen in heavier heads, which are made for handling heavy things. Smaller heads are made for handling light objects. Keep in mind that the outputs could contain feature maps of various sizes; therefore, the solution must also include an anchor generator that is appropriate for feature maps.
E. Shape Signature Networks (SSN) for Multi-Class Object Detection From Point Clouds
Finding and classifying objects from point clouds that fall into different categories is the goal of multi-class 3D object detection. Shape information is one feature that can help with multi-class discrimination but is underutilized because point clouds are, by their very nature, sparse, unstructured, and noisy. So, authors in [107], proposed 3D shape information from point clouds using a unique shape signature. By including a convex hull, symmetry, and Chebyshev filter, the proposed shape signature is not only efficient and compact but also noise-resistant, acting as a soft constraint to enhance the feature capability of multi-class discrimination. The created shape signature network is composed of explicit shape encoding objectives, shape-aware grouping heads, and pyramid feature encoding for 3D object detection. In this review paper, we employed shape-aware grouping heads of SSN as the backbone in PointPillars, and results are produced on nuScenes as in Table 6. Finally, Table 7 shows the evaluation time and mean average of true positive metrics obtained with SECFPN (FP16), FPN (FP16) and SSN.
F. Point Cloud Distance Metric
To assess the quality of a match between two point clouds, we used various distance measures. To visualize point clouds, open3D9 was employed. Standard methods offered in Numpy and Scipy are used to create the distance metrics. We generated one point cloud randomly with 100 points, and it was shifted along [x,y,z] axis to generate another point cloud as in Fig. 13. For each point cloud, the measured nearest neighbor distances can be shown as a distribution in Fig. 14. Because point clouds have varying degrees of spatial resolution, accuracy, and outlier characteristics, the distributions may differ. We opted for three shifts along [x,y,z] axis for calculation of distances, i. e.,
G. Planning Kl-Divergence (PKL)
In order to explain the effectiveness of PKL over NDS, we used a trained MEGVII [108] point cloud 3D object detection model. In this model, sparse 3D convolutions [109] were used to extract rich semantic features, which were subsequently input into a class-balanced multi-head network. Class-balanced sampling and augmentation techniques were used to address the significant class imbalance problem inherent in autonomous vehicles, and balanced grouping heads improved the results for groups with comparable forms. Classes (car, bicycle, pedestrian, etc.,) with comparable shapes or sizes can cooperate with one another according to the multi-group head network’s design, whereas categories with dissimilar shapes or sizes stop interfering with one another. Sub-manifolds and standard 3D sparse convolutions make up the 3D Feature Extractor. The outputs of the 3D Feature Extractor have a 16:1 downscale ratio and are flattened along the output axis before being input into the region proposal network that follows to produce 8:1 feature maps and the multi-group head network that follows to produce the final predictions. According to the grouping specification, the number of groups in the head is set. The main goal of PKL is to mark the false positives and false negatives of the object detection model as in Fig. 15. In general, a falsely detected parked vehicle will not lead to dangerous maneuvers by the AV, while a FP in front of it will. Metrics like mAP and NDS treat both of these cases in a similar way and rank the MEGVII object detection model, but PKL treats them in different cases and assigns a rank to the MEGVII object detection model accordingly. A model with a higher PKL value shows the worst performance as in Fig. 16 and a lower one shows better performance as in Fig. 17.
Fig. 15 shows some examples of pretrained planner predictions on the nuScenes test dataset. The pre-trained planner can be found at the link shown below.11 The ground truth, predictions, and PKL in which the reported MEGVII detections perform the worst according to the PKL measure are shown in Fig. 16. The objects in front of the ego vehicle appear to be traveling backwards due to a FP that appears in front of the ego vehicle. Because of this, the planner anticipates that the ego vehicle will halt instead of moving ahead, which would incur a severe penalty under the PKL metric. The time interval where MEGVII performs the best under PKL is depicted in Fig. 16. The car to the left of the ego is consistently detected in the time sequence. Although there are a number of FP human detections in the scene, the task of waiting at the light is unaffected by these detections, so the scene still functions properly. Recognizing the people on the sidewalk accurately is an essential subtask for some downstream tasks, such as autonomous vehicles. Our objective is not to promote the use of PKL exclusively for object detector evaluation, but rather to suggest PKL as an alternative to task-agnostic metrics that do not take into consideration the environment in which perceptual errors occur. In Fig. 17, the green-colored object is the ego car, the red color is a FP, and the pink color is a FN.
H. Timed Quality Temporal Logic (TQTL)
In this section, the impact of TQTL on two object detection models is discussed. We used pre-trained weights found in the code repository run by the creators of the original SqueezeDet12 and YOLOv313 was assessed. Both models are trained on the KITTI object detection dataset with a total of nine classes, for example, cycle, van, misc., etc. Both models were trained for 1000 epochs on a GPU-compatible device, and it took 9 and 12 hours to train SqueezeDet and YOLOv3. A portion of the KITTI raw dataset was used to monitor the data streams produced by these two models in comparison to the TQTL specifications [59].
One of the specifications verifies that if the object detection algorithms identify bicycles in any frame
The ability of TQTL to compare object characteristics across frames is demonstrated by these specifications. As seen in Fig. 19, the first specification is broken since the cyclist that YOLOv3 mistakenly classified as a pedestrian with a fair amount of confidence. As seen in Fig. 19, we can see that the requirement set forth by the second specification is being violated since the likelihood of the cyclist being correctly identified as a cyclist or a pedestrian is dropping below 60%. This is the reason for the negative robustness observed when measured against the second specification. As the cyclist is incorrectly classified as a pedestrian with high confidence in the stream in Fig. 20, SqueezeDet violates specification one. This demonstrates that the algorithm incorrectly labels the cyclist as a pedestrian in images like YOLOv3, where the cyclist is moving nearly parallel to the direction of the camera. Even if we used a second specification to keep an eye on this misclassification, the algorithm continues to break the property. This is caused by “phantom” objects that SqueezeDet had a high likelihood of detecting but then unexpectedly failed to do so. Fig. 20 is an illustration of this. With these results, we were able to locate intriguing examples of bad quality perception algorithm outputs localized to a set of frames using TQTL. When a perception algorithm is being debugged, such information can be quite helpful, especially if it is being used in a situation where safety is crucial.
a) With high confidence (greater than 75%), YOLOv3 incorrectly labels the cyclist as a pedestrian; b) The possibility that YOLO will detect a cyclist varies from 0 to 75%.
a) With high confidence (greater than 75%), SqueezeDet incorrectly labels the cyclist as a pedestrian; b) Occasionally, SqueezeDet detects an erroneous cyclist with a probability ranging from 55% to 75%.
I. Spatio-Temporal Quality Logic (STQL)
The PerceMon framework monitors and broadcasts all the data from the simulator, including information from the autonomous vehicle’s cameras, using the ROS wrapper for CARLA [61] as shown in Fig. 21. Perception modules, such as the YOLO object detector [110] and the DeepSORT object tracker [111], use the image data to broadcast processed data. These perception modules publish information that can be used by other perception modules, controllers, online PerceMon monitors, and other controllers to follow objects that are recognized and possibly avoid collisions. Fig. 21 depicts an overview of the architecture. The bounding boxes of images are detected with the YOLO object detector, and DeepSORT assigns an ID to each of the sets of detections that the object detector makes. Then, using Kalman filters and cosine association measures, it tries to follow each item that has been marked over multiple frames. PerceMon successfully detects false negatives and false positives in object detectors by creating two specifications in STQL. Those are:
Consistent detection: If an object is far from the margins in the current frame and has a high confidence value, it must have existed in the preceding frame with a similar high confidence value.
Smooth object trajectories: Every object in the current frame must have a bounding box that overlaps with the equivalent bounding box in the previous frame by at least 30%.
With the help of these two specifications, PerceMon keeps an eye on the aforementioned attributes for the situations shown in Fig. 22, as well as on how long it takes to compute the satisfaction values of the aforementioned properties. The object detector finds more objects as more passive or non-adversarial vehicles are included in each scenario. PerceMon can therefore empirically evaluate how long it takes to compute the satisfaction value in the monitor because the runtime for the STQL monitor grows exponentially with the number of item IDs.
J. Panoptic Segmentation
In this article, we consider some of the models used for panoptic segmentation, and the respective results are tabulated in Table 9. In this table, the superscripts ‘
PSPNet [76] contains a Mask R-CNN and FPN based instance and semantic segmentation, respectively. Surprisingly, this basic framework not only continues to work well for instance segmentation but also produces a fast, efficient way for semantic segmentation. RTPS [113] used dense detection and a global self-attention mechanism. This model presents a unique parameter-free mask-building technique that effectively makes use of data from the object detection and semantic segmentation subtasks to significantly reduce computational complexity. Because of the network’s straightforward data flow and lack of feature map resampling, significant hardware acceleration is possible. The PanopticDepth [114] model was designed with the dynamic convolution technique, which helps to predict depth and segmentation masks for each instance instead of predicting depth for all pixels at a time. The Panoptic-DeepLab [115] contains a dual Atrous Spatial Pyramid Pooling (ASPP) and dual-decoder for instance segmentation and semantic segmentation, respectively. In order to extract a denser feature map, it uses atrous convolution in the final block of a network backbone. The context module uses the ASPP together with a lightweight decoder module that only uses one convolution at a time during each upsampling stage.
Conclusion
In autonomous driving and advanced driver-assistance systems, perception algorithms play a significant role in observing the surrounding environment for safe, secure, and collision-free motion. The performance of these algorithms depends on several factors, and selecting the most accurate and robust one is a crucial task. Thus, after training, performance has to be defined and evaluated with metrics based on “unseen” test data. This is achieved by resorting to testing methods that compare their output against the ground truth (annotated data) included within the dataset, and provide detailed test reports including statistics, correlations, outliers, etc.
This paper presents an overview of the main four perception performance assessment approaches: point cloud quality analysis, object detection, object tracking, and panoptic segmentation. Different metrics and their advantages and disadvantages over different models are also discussed, with particular emphasis on state-of-the-art metrics used for performance measures of object detection, object tracking, and panoptic segmentation algorithms. Actually, object tracking is intimately related to object detection, as tracking implies detecting the same object through frames and estimating or predicting its positions and other details of a moving object.
The following main conclusions can be drawn from the conducted experiments:
LiDAR point cloud: The originality of the environment captured with LiDARs depends on many factors, such as the functional characteristics of the sensors used, environmental weather and lighting conditions, speed of the ego vehicle, etc. To measure the LiDAR device’s accuracy with respect to a reference LiDAR, distance accuracy is the most commonly used metric. In this review, we evaluated four different distance metrics, and among them, we observed that the Earth Mover’s Distance metric gives better dissimilarity between two point clouds or distributions generated with two different LiDARs.
Object detection: For autonomous vehicles, the most important perception algorithm is object detection. So in this review, we have given importance to metrics that measure the performance of object detection algorithms. The effectiveness of object detection models depends on many factors, such as the speed and size of the object, the size of the dataset, an imbalance in the class of objects, etc. In the literature, several metrics exist to measure the performance of object detection models, but these metrics have their own advantages and disadvantages. The most commonly used metric for object detection is mean average precision, but it ignores the object’s position, velocity, speed, and orientation. So, the nuScenes detection score was introduced, which covers all four of these in calculating the model’s performance. To understand this effect, we consider shape signature networks and PointPillers (with backbone networks such as feature pyramid network and SECOND feature pyramid network) as an object detection model. With these models, we explained the effect of the nuscene detection score in measuring the model’s performance over the mean average precision. Also, it was observed that mAP and NDS also fail to identify the false positives and false negatives of the object detection model. So PKL was introduced and tested on the MEGVII point cloud 3D object detection model. With this test, it was observed that PKL is capable of distinguishing a parked vehicle from a vehicle in front of the autonomous vehicles by giving different confident scores, while mAP and NDS fail to distinguish both cases. We also observed that a higher PKL value shows the worst performance of the model (the ideal value would be zero). In addition to these metrics, TQTL and STQL were introduced as object detection metrics. The TQTL metric considers time, and STQL considers both time and space in evaluating the model’s performance. To know the impact of TQTL on queezeDet and YOLOv3 models, we performed experiments on pre-trained models of both on the KITTI dataset. It could be observed that both models fail to detect pedestrians and cyclists when they are exactly opposite to autonomous vehicles. But TQTL identifies these false positives by introducing two specifications in terms of time frames for video. Similarly, STQL was used to track or monitor the performance of the YOLO object detector, followed by the DeepSORT object tracker. STQL successfully detects false negatives and false positives in object detectors by creating two specifications in terms of the spatial and time frames of the video.
Panoptic segmentation: It is a cascaded combination of semantic and instance segmentation; thus, metrics used for both are useful for panoptic segmentation. The most commonly used metric for semantic segmentation is the Dice coefficient. In the literature, two metrics for panoptic segmentation could be found, parsing covering and panoptic quality metrics, which are a combination of segmentation quality and recognition quality. A table of models with these metrics is presented.
Different methods exist that allow us to evaluate the performance of LiDAR data perception algorithms. The diversity and specificity of driving conditions and vehicles’ surrounding situations, requires the rigorous application of various methods to fully evaluate the algorithms’ capabilities and ensure the highest levels of dependability and safety of autonomous driving and advanced driver-assistance systems. Other methods, not reported here, exist or are being developed to tackle these requirements. These include, e. g., testing in dynamic scenarios, measure of the signal to noise ratio of both distance and beam intensity, and under moisture, mechanical and other environmental influences. On the other hand, a higher diversity of datasets is needed so that the most realistic evaluation conditions are available as input.