Efficient Characterization Method for Big Automotive Datasets used for Perception System Development and Verification

The paper proposes a formal approach for describing and evaluating the datasets that are used in automotive applications for machine learning, testing, and validation purposes. Proper, that is, qualitative and quantitative characterization of the datasets can simplify the analysis, evaluation, and comparison of perception-based algorithms designed for highly automated vehicles. Such formalism is also needed to achieve compliance with the automotive industry safety standards that have been recently introduced. Characterization in the form of size or type of raw data, number of recognized and classified objects, and environmental parameters is not perfectly suitable for describing both the static and dynamic aspects of automotive datasets; therefore, another approach is required. In this paper, an efficient method based on an object tracking mechanism, grid representation of the sensor field of view, heatmap concept, and Wasserstein metric is proposed. The efficiency of the method is demonstrated by its ability to handle both the size, properties, and diversity of the dataset, including static and time-varying aspects. The presented description can also be used to compare different datasets and to define the amount of data to be collected.


A. MOTIVATION
The development of new generation cars with higher level of automation will require solving many research problems. One problem can be formulated as adding eyes and a brain to the vehicle so that it is aware of its driving environment to first assist and then to replace the driver. This research problem is equivalent to equipping a car with a type of artificial intelligence. As special skills and the ability to properly perceive the surrounding environment are required by human beings driving cars, artificial intelligence (AI) together with machine Learning learning (ML) are essential to achieve higher levels of autonomy. In other words, achieving higher levels of driving automation with appropriate levels of quality and reliability will not be possible without the use of AIbased methods. As a reminder, the Society of Automotive Engineers (SAE) distinguishes six levels of autonomy [1], that range, from level 0, which means no automation, to level 5, which means full automation. It should be noted that most of the currently produced models are at level 1 or 2 on this scale.
In the design and development process of automotive control systems, one of the most important issues is related to the safety aspects of the systems. This is in particular valid for systems that utilize AI components. It is clear that testing all aspects of such systems is impossible as there are an infinite number of important road scenarios. The process of selecting just a few of the many possible scenarios is a difficult and challenging task and currently is most often based on qualitative best engineering judgment. Verification of automotive control systems requires large, variable, and diverse datasets to guarantee a proper safety level. By large datasets, we mean collections containing several dozen peta bytes (PB) of data as in the present state. In the last few years, these requirements for the amount of data collection have changed by several orders of magnitude, which illustrates the technological progress in this area. It is also expected that without virtual simulators, it will not be possible to comprehensively verify driving algorithms for autonomous vehicles. Virtual simulators supported by AI can supplement datasets collected from real test drives. The extended datasets can then be used for learning, testing, and validation purposes. The data from many sensors recorded in a vehicle with a distribution that should reflect the normal user use case statistics are named real-world user profiles (RWUPs). Virtual world user profiles (VWUPs) represent data generated from virtual environments.
One of the most challenging problems in the development of automotive control systems is defining the performance indicators to be used to evaluate and compare different algorithms on a given dataset. This is particularly important for solutions that are based on machine learning techniques where the data are categorized into learning, test, and validation subsets. The learning dataset is used to obtain and tune the neural network parameters. The test dataset is used to verify learning outcomes. Validation datasets are used to assess the algorithm performance. Here, the problem is that with two datasets, we can train a network substantially differently. Moreover, when the network is already trained, we can have two validation datasets that produce different quality metrics. There is a need for an overview of different results with unified assumptions, as presented, for example, in [2]. Thus, there is a need to formally describe the datasets that are used for learning, testing, and validation purposes. Having this formality in place, we can develop a measure that allows us to conclude if the datasets are similar in terms of defined criteria and use so that we can compare the results of the algorithms. It is also clear that to reliably compare two different algorithms, similar datasets must be used. If the datasets are not comparable, we cannot draw any conclusions regarding the comparison. The authors of [3] claim that data design deserves at least the same amount of attention as the model design. Their work underlines that public datasets in the academic community differ substantially from those in industrial settings. Evaluation of active driving assistance systems in newest cars performed by researchers at AAA shows that on over 4000 miles of real-world driving on average some type of issue occurs every eight miles [4]. This discrepancy shows that validation datasets do not reflect reality correctly. More open problems and current research challenges connected to datasets used in computer vision for autonomous vehicles are described in [5].

B. CONTRIBUTION
In this paper, we propose an formal approach for the description, evaluation, and comparison of datasets that are used in automotive applicatios for machine learning, testing, and validation of perception-based systems. We use the tracks of detected and classified objects as the main sources of information. The tracks are projected on the grid representing the field of view of the sensors monitoring the vehicle's surroundings. Then, the coverage of the grid elements by the tracks is used to create heatmaps. In our approach, the heatmap is considered a formal representation of the dataset. We propose the use of a Wasserstein metric to compare heatmaps. In addition, we describe track similarity measures that can be used to select unique and diverse test scenarios in the data collection process. The theoretical analysis is verified on publicly available datasets.

C. ORGANIZATION OF THE PAPER
The paper is organized as follows. The next section is a literature review focused on available methodologies of big data description that are useful for automotive data. The following section contains background information related to specific aspects of automotive perception systems, sensors, object tracking, safety, grid representation, and heatmaps. In the following two sections, similarity measures for comparing tracks and test scenarios are introduced. Next, the concept of dataset characterization is presented. The final section contains some comments and conclusions. The pipeline of the proposed methodology is depicted in figure 1.

II. RELATED WORK
Formal descriptions of datasets can be connected to the determination of different types of meta-data and building meta-models that can characterize the datasets. Meta-data and meta-models play key roles in the meta-learning process [6], which is the field of research that aims at improving learning performance. Metadata in the case of automotive datasets can be considered a supplement to raw data logged directly from the vehicle sensors. The meta-data are usually logged at the same time as raw data by the vehicle logging system during data collection (e.g., GPS position, speed, yaw rate, weather conditions, etc.) or are added later on as a result of the postprocessing of raw signals by the system functions (e.g., height of the sun, number of pedestrians in a frame, etc.) [7]. The extensive research done thus far shows that a strong correlation exists between learning datasets and classification algorithms [8]. In [9], the authors investigated such correlations and similarities between datasets. In this paper, the following two approaches were described: clustering datasets using error correlations and clustering datasets using ranks.
In [10], complexity measures were proposed to characterize datasets and to assess the performance levels of learners. Further study on the prediction of classification performance using complexity measures was carried out in [11].

A. BIG DATA CHARACTERIZATION AND VISUALIZATION
One of the most comprehensive characterization methodologies for big data workloads is the metric importance analysis (MIA) found in [12]. It presents a methodology for choosing and presenting metrics from wide sets that represent data at multiple layers. The authors state that MIA reduces the complexity of the analysis without losing information. The most important metrics characterizing the data are then shown with Kiviat plots.
Regarding the analysis and synthesis of large datasets, much space in the literature is devoted to visualization methods. In [13], the authors presented a tool called time-tunnel for the visualization of time series multidimensional data in a web browser using 3D charts. The approach relies on the exploration and comparison of patterns to find a strong relationship between them. To do so, the data-wings (2D subchart) in time-tunnel are selected and arranged based on a genetic algorithm that produces mutations of potentially optimal solutions to find the best one.
From the automotive point of view, sensor data is classified as massive spatiotemporal data by the authors of [14]. This enables automatic geospatial analysis and visualization. Another example of visualization of big data comes from [15], which describes how to use the geometry of parallel coordinates to present high dimensional data and enable visual patterns. The author states that complex data cannot be shown in a single visualization alone, but multiple views of different types can provide different perspectives on the same data and are key for deep understanding. To emphasize this effect, it is good to coordinate multiple views. Brushing and linking is given as an example, and [16] extends the parallel approach to the selection of scatterplot techniques. Different types of visualization of high-dimensional data can be found, in [17] , which shows how to decide which two variables should be selected to investigate their data distribution on the basis of a 3D correlation matrix. Custom properties can be represented as clustered 3D spheres. Their radius is based on the cluster size and color based on a variable independent of clustering. Additionally, we can create cubes with sorted custom properties. The paper [18] describes a visualization method of spatially fixed multivariate volumetric data that focuses on two goals , i.e., maintaining the see-through property while providing easily interpretable color-coded scalar values. The authors find a compromise between these opposing goals by redistributing the opacity of a voxel by making part of it more opaque (making the other parts more transparent). The paper [19] draws attention to the fact that dealing with multivariate big data can mean an analysis of the same topic and even the same source datasets, there are different attributes being measured. The publication presents Robinson-Foulds tree distance metrics [20] and variation of information [21] as similarity metrics that can be adapted to compare dimensionally inconsistent multivariate data which is a subclass of the big data variety research. The authors of [22] stated that directly processing voluminous data is inefficient, but instead of juxtaposition of different visualization types, they proposed division of data into regular blocks using data dimensionality reduction techniques such as linear discriminant analysis (LDA) [23], principal component analysis (PCA) [24] and multidimensional scaling (MDS) [25]. The LDA is a dimensionality reduction method that attempts to find a linear combination of variables to categorize or separate two or more groups. MDS is used to visualize given high-dimensional data in a low-dimensional space by generating a configuration of the given data utilizing Euclidean low-dimensional space. PCA transforms the data into a new coordinate system by projecting the original data in the direction where the variance between data points is maximized. The principal components are derived from the eigenvectors of the covariance or correlation matrix of the dataset. The authors show a novel extension of the last method based on k-means block segmentation. Based on this we can state that although ways to describe multidimensional data and spatiotemporal automotive data related to single scenarios exists, there is a lack of ability to synthetize and summarize information that is important from the point of learning and testing automotive detectors and trackers. The desired methodology should describe not only the quantity of data but also their quality, taking into account the diversity of the whole dataset.

B. SIMILARITY OF OBJECTS IN TIME
Description of the similarity between tracks with time signatures can be also carried out simultaneously with the analysis of object cubature, velocity and other properties dynamically assigned to each position of the track generated by object. The most natural way to do this is the method described in the Davis challenge [26] but other options also exists. Examples are variations of the OSPA metric [27], [28], [29] which represents the possibility dividing information by its three components, each separately focusing on localization, cardinality and labeling errors. Weighted state correlation similarity [30], tracking difficulty [31], information theorybased metrics [32], shape association measures based on Minkowski distance [33] and other metrics ( [34], [35], [36]) are event evaluation metrics. The Jaccard index is also a tool that can be used for both local similarity and to summarize trajectories overlapping in time. It is widely used in various tasks that involve the evaluation metric. In [37] , the authors use it to study the influence of small errors in ground truth on the tracking algorithm output. In similar cases, the precision and possibility of decomposing information are essential. The authors in [38] showed a combined performance estimation criterion by considering the veracity, real-time demand, and implementation on hardware of the tracking algorithm. In [39] we can find an algorithm that applies fuzzy rules to merge the detected bounding boxes VOLUME 4, 2016 into a unique cluster bounding box that covers a unique object. The abovementioned publications show that there are many existing methods to describe the similarity between bounding boxes, but none of them directly translate into the analysis of the trajectories created by the systems of these data. Therefore, we want to analyze methods directly aimed at trajectory analysis.

C. TRAJECTORY ANALYSIS
Motion trajectory analysis is frequently mentioned in the literature. One of the best approaches is presented in [40] and is called TrajAlign. This method presents a methodology for the alignment of trajectories by using their representative distance matrices and Needleman and Wunsch's dynamic programming algorithm [41]. In [42], the authors optimized the trajectory distance calculation by employing the multiscale transform method and using vector fields on the manifold instead of the Euclidian distance. The work in [43] focuses on trajectory smoothing by clustering with Euler distance and two-dimensional wavelet transform performed on received groups. The reconstructed trajectory can be treated as denoised data. The authors of [44] defined a novel similarity measurement that considers spatiotemporal and semantic features simultaneously using the multidimensional semantic matrix of spatiotemporal trajectory and singular value decomposition to mine the essential characteristics of trajectories. Spatiotemporal matching of trajectories is also considered in [45]. Reference [46] provides a review of existing approaches to trajectory similarity functions and evaluates them in terms of computational cost and memory. The authors of [47] presented a similarity measurement between trajectories based on Pearson's correlation coefficient and the coefficient of determination. In [48], the authors proposed a similarity measure for trajectories based on longest common subsequence (LCS) method and tested it on a database consisting of tracks of moving vehicles in cities. Reference [49] presents a robust-to-noise version of the LCS method. The authors of [50] show the pose normalization process for a trajectory matching framework that is translation, rotation, and scale invariant. All of these metric can be useful for trajectory analysis but our aim is to present trajectory comparison metrics regarding the position on the occupancy grid and the time that the area on the grid was occupied by a given object. Additionally, none of those methods have the ability to compare systems of trajectories, which would provide whole scenario comparison functionality.

D. WASSERSTEIN METRIC CAPABILITIES
Reference [51] focuses on metric learning that satisfies the data geometry. The goal of the metric learning algorithm is to learn a metric that assigns a small distance to similar points and a relatively large distance to dissimilar points. Obtaining a metric for high dimensional data which is often difficult to even visualize, is difficult because the metric has to satisfy the data geometry. The authors propose the use of the Mahalanobis distance as the ground distance for the Wasserstein metric calculation and automatically learn the corresponding matrices by the alternative iterative approach. The author presents the ability of different methods to accurately classify information in traffic video databases, showing that their methods outperform others such as the K-Nearest Neighbors method using Euclidean distance or Support-Vector machine (SVM) models [52]. Publication [53] shows efficient algorithms for the calculation of Wasserstein Barycenter while simultaneously showing that it is a well suited tool for the generalization of sets of 2D pictures with various translations of content. Additionally, the author of [54] constructed a twosample test based on the Wasserstein metric that is designed to detect structural breaks in data with complex geometries. The results published in [55] show that the Wasserstein distance performs well in tasks that require assessing the similarity of objects of similar types. Publication [55] describes a multivehicle tracking methodology based on the Wasserstein association metric. The described results show that the Wasserstein distance performs well in tasks that require assessing the similarity of objects of similar types and is robust to partial occlusion. The authors of [56] describe how to use the Wasserstein generative adversarial network (WGAN) to denoise and improve the quality of positron emission tomography imaging to reduce patient radiation exposure. Using the Wasserstein distance as a loss function solves the problem of excessive smoothness and loss of detailed information regarding classic image reconstruction approaches. Reference [57] proves that the Wasserstein metric can be used to critically score the quality of reconstructed magnetic resonance imaging. All this research provides arguments that this metric proves useful in tasks involving analyzing the of intensity of a given phenomenon. Trajectory data that is understood as the continuous occupation of the field around the ego vehicle generated by objects that appeared in test scenarios require the use of tools that are sensitive to changes in the intensity and localization of data. Based on this, we propose, a methodology that allows the analysis of large automotive data consisting of trajectories. It should allow the occupancy generated by them to be described with use of grid based versions of similarity metrics and Wasserstein distance. These tools make it possible to treat sets of trajectories as single objects without losing key information by unsuited generalization.

A. SENSORS
Cameras, radars, and lidars are currently essential elements for many automotive control systems. The raw data provided by these sensors constitute an input to the perception algorithms aimed at monitoring and interpreting what is happening outside the vehicle.
A camera sensor takes images of the vehicle's surroundings that are further processed by AI based algorithms to detect and classify objects. The analysis is performed either in a camera (so-called smart sensor) based on a custom SoC (System-on-Chip) or within a centralized processing unit (a so-called multi-domain controller) with either dedicated or shared computing resources. The output is high-level information about the detected objects (objects IDs, bounding box positions and sizes, etc.).
Radars analyze the reflections of transmitted electromagnetic waves from different object types to find detection (single point reflection from an object). A single detection carries information not only about x, y, the position of the reflection point, but also many other interesting parameters, such as target speed (from the Doppler effect). In general, there are four levels of radar data abstraction [58]: raw analog to digital converter (ADC) values logged from antennas, data cube representations (CDC), detection levels, representation objects, or object tracking levels, where multiple detections are assigned to one object. The radar systems currently used are equipped with algorithms that, in addition to clustering, tracking, and providing range and velocity information, are also able to classify some objects.
Lidar works on the same principle as radar, but instead of microwaves, it uses light pulses. Due to its high resolution, lidar can be used to classify objects as well as to measure their positions. Unfortunately, lidar technology is still not mature enough (both in cost per unit and lifetime durability) to be mounted on a mass scale in vehicles. There are some examples of successful projects using solid-state lidar technology. In most cases, however, lidars are used as a reference sensors to enable precise labeling and positioning of objects around data logging vehicles. Nevertheless, the most popular devices due to quality vs. price factors are rotating devices rather than solid-state lidars.
It should be mentioned that the sensors provide data in different coordinate systems, for example, radars in the polar coordinate system, cameras in the planar coordinate system and, usually with different frame rates. Moreover, different sensors may have different accuracies, ranges, and angles of view.
To increase the performance of vehicle perception, car manufacturers and designers increase either the number of sensors mounted in the vehicle or increase their resolution.

B. AUTOMOTIVE PERCEPTION SYSTEMS
The perception system analyzes the data provided by a sensor to detect and classify objects existing in the sensor field of view. Video-based motion detection applications are is still a challenging task, and reference [59] describes, a memoryefficient texture descriptor to overcome real-world problems. In works [60], [61] and [62], the authors consider this task from the point of background subtraction. The authors of [63] focus on various types of methods connected to human activity recognition. In addition, the system is responsible for determining the static and dynamic parameters of the detected objects. An example of moving object detection (MOD) problem is taken up with graph convolutional neural networks in reference [64]. Typical   bikes, traffic signs, barriers, tunnels, etc. A typical set of object parameters includes: position -either in 2D or in 3D space, motion parameters such as velocity, acceleration, heading, and turn rate, parameters that determine the shape and size of the objects and the orientation of the shape. For some applications, the determination of the allowable (so-called freespace) or not allowable (so-called occupancy space) driving space is also required [65]. The perception system is exposed to many types of uncertainties. This means that with each object being visible and detectable in the sensor field of view, a certain level of probability of existence is associated. To increase the confidence of the detected objects, features from different types of sensors are fused together (Fig. 2). Recent reviews of tracking and object detection systems in automotive applications can be found in [66] , [67] and [68].

C. SAFETY OF THE INTENDED FUNCTIONALITY
Verification of automotive perception systems is difficult to achieve. Such systems rely on sensing the external environment, which is uncertain and contains an infinite number of possible road scenarios. Due to the limitations imposed by the design and implementation of the perception algorithms, as well as the performance of the hardware platform executing these algorithms, there can be potentially hazardous behavior in the system. In addition, due to the high complexity of the system and the specificity of the operation of sensors and associated detection systems, the appearance of errors is inevitable. A properly designed active safety system should take into account the probability of these distortions. System architecture, redundancy, fuzzy logic elements, and fusion algorithms must be designed in a way that guarantees non-propagation and possibly quick suppression of emerging distortions. Proveing that the applied methods guarantee the expected quality of operation of the entire system is a major problem. The norm ISO/PAS 21448:2019 Road vehicles -Safety of the intended functionality [69] provides examples VOLUME 4, 2016 of such an approach, including the inability of the perception function to correctly comprehend the situation and operate safely, as well as insufficient robustness of the function with respect to sensor input variations and diverse environmental conditions. The norm [69] also provides an example methodology for the definition and validation of an acceptable risk level needed to prove the safety of the system. The methodology includes: (1) partitioning of system failures; (2) modeling of hazardous events; (3) analysis of traffic statistics; and (4) definition of test scenarios. To support this methodology, the data collection should include relevant driving situations derived from the analysis of sensor limitations and feature-specific limitations, such as those described in [70] or [71], regarding LiDAR performance verification in different conditions. In such an analysis, a formal description of the representative dataset is an essential element.

D. OBJECT TRACKING
Object tracking is defined as the problem of extracting the motion of the object from a sequence of scans and estimating its trajectory. The scans can be in the form of images or clouds of radar or lidar points. The object state, in addition to its position in the 2D or 3D plane, usually includes additional variables related to its geometric and kinematic state. For example, the state vector representing the state of the detected object on a 2D plane can have the following form (Fig. 3): where q k ) is an object position at time step k, and j stands for the object's identifier. The kinematic state of the object is determined by its velocity v k . The object's shape is assumed to be a rectangle with length l (j) k and width l (j) k that may vary over time due to its visibility in the sensor field of view. T s denotes the sampling time.
, of object j detected by the perception system is a motion trajectory extracted from a sequence of scans, that is, for n = k, . . . , m − 1.

E. GRID SPACE
X ⊂ R 2 denotes a state space consisting of all points on the 2D plane that are located in the sensor field of view. This means that any object j in the vehicle's surroundings with coordinates [x k ] T ∈ X shall be detected by the perception system.
x hx is the largest integer not greater than x/h x , y hy is the largest integer not greater than y/h y , and Z stands for the set of integers. Using this definition, state space X can be transformed into space X h , as indicated by the following definition. Definition 3: Space X h transformed from space X using grid G h is defined as follows: The number of elements in the transformed state space X h depends on the parameter h and can be chosen accordingly based on the system requirements. The number of states of the transformed state space X h covered by track T (j) can be calculated as follows: As a scenario S, we understand a set of tracks {T (j) : j ∈ J ⊂ N + }. For scenario S a histogram of occupancy is as follows: H : where | · | stands for the cardinality of the set.
The most convenient method for visualizing histograms and two-dimensional distributions is heatmap. A heatmap is a data visualization technique that shows the magnitude of a phenomenon using a variable color intensity that changes with increasing value of the function, assigning a value to a two-dimensional vector. Fig. 4 shows an example of heatmap of grid occupancy. In our case, black indicates that the scenario (or set of scenarios) does not contain a trajectory that interacts with a given grid cell. Grids that intersect once with any trajectory are white. The white color gradually turns red for grid cells that were occupied in total by a higher number of trajectories. The limit value for this change is 100 , as depicted on the color bar.

F. DISTANCE IN THE SPACE OF SCENARIOS
The distribution of occupancy (Definition 4) is a discrete probability density function. (X h , d) is a metric space, where d is the induced Euclidean metric. Definition 5: For any two distributions of occupancy µ and ν we can calculate the Wasserstein distance between them as follows: where Γ(µ, ν) is a set of all probability measures defined on space X h × X h such that µ and ν are their marginals.

G. AUTOMOTIVE DATASETS
A typical advanced driver-assistance system or autonomous driving (ADAS, AD) project requires logging hundreds of thousands of kilometers of driving to gather the material needed for the development and verification of the final product (forward collision warning camera, autopilot, etc.). It is a great challenge to properly log, store and use these data due to the very large number of files and required disk space (often exceeding tens of petabytes). Recorded material is usually considered IP and is not revealed to anyone outside the project team. Some car manufacturers and universities have published their own automotive databases, which can be used by researchers to train algorithms and benchmark their performance [72]. In addition to the various raw sensor data, the databases contain object annotations in different formats (sometimes just 2D bounding box on camera images, sometimes a 3D bounding box of objects in world coordinates). One of the most well-known database is KITTI [73], which was released by Toyota and Karlsruhe Institute of Technology (7481 annotated frames + 7518 unlabeled frames). Another database was released by the Audi Autonomous Driving Dataset (12499 frames with 3D bounding box annotations + 390000 unlabeled frames). There are also many other datasets, such as the Waymo Open Dataset [74], Berkeley DeepDrive, BDD100k [75], Baidu Apolloscapes [76], and Cityscape Dataset [77], etc. The dataset that we decided to use to test the methodology described in this article is the nuScenes dataset [78] provided by Motional (Hyundai Aptiv joint venture). The full dataset includes approximately 1.4 M camera images, 390 k Lidar sweeps, 1.4 M radar sweeps, and 1.4 M object bounding boxes in 40 k frames. It was chosen mainly because it contains a large number of annotated scenarios from two different countries (left-and right-hand traffic).
In our work, we narrow the meaning of an automotive database, which is the collection of data logged by a vehicle with various radar, lidar, or camera sensor setups.
We understand a dataset (database) as a set of scenarios (recording chunks providing a set of tracks) that were ground truthed (either by manual or automated labeling) so that at least the unique id and 2D positions of objects (vehicles, bicycles, pedestrians etc.) in relationvto the recording vehicle is known. Databases without annotation cannot be used for algorithm training/performance computation and as such are almost useless.
The unique IDs and positions in every frame allow the creation of a movement trajectory for every object instance (in the recording vehicle coordinate system). If the database also contains object types (classes), it is possible to further subdivide the trajectories, e.g., trajectories for vehicles and trajectories for pedestrians.

IV. TRACK SIMILARITY DESCRIPTION
In this section, we provide several metrics in the form of modified distance functions and similarity measures that can be used to compare two tracks T (i) and T (j) . State-of-theart approaches such as those described in [79] and others researched in section II make use of the direct positions and properties of tracked objects. Our approach is focused on analyzing the occupancy that objects generate on a grid around car, and calculations are performed on this basis. These tracks can have different lengths and can belong to the same or different classes of detected objects. If the distance is less VOLUME 4, 2016 than a defined threshold, then the tracks can be considered equivalent; otherwise, they are different from each other.

A. TRAJECTORY COMPARISON EXPERIMENT
The first definition is based on the Minkowski distance between two points in the norm vector space. Definition 6: The distance between two tracks T (i) and T (j) in the L p -norm sense can be defined using the following formula: Using Definition 6, we can manipulate the value of p and calculate the distance using different means, as follows: if p = 1, then the distance is similar to the Manhattan distance; if p = 2, then the distance formula reflects the Euclidean distance metric; and if p = ∞, the distance is based on the Chebyshev maximum metric. In Table 1, we calculated the distances between examples of tracks presented in Fig. 5 according to this definition with p = 2. Definition 6 can also be extended to the function space L p . Definition 7: The distance between two tracks T (i) and T (j) in the L p -norm sense can be defined using the following formula The next definition is based on the Jaccard similarity index, which is a measure of similarity for two sets of data to determine which elements are shared and which are distinct. The definition also utilizes the concept of the transformed state space introduced in the previous section. Definition 8: The similarity measure between two tracks T (i) and T (j) in the Jaccard sense can be defined using the following formula: where |X| stands here for the cardinality of set X. Examples of comparisons with this measure are in Table 2. We can also apply a similar scheme as in the calculation of the Hamming distance, which is a metric for comparing two binary data strings. As the Hamming distance for two strings of equal length is the number of bit positions in which two bits are different, the distance between two tracks can be the number of grid cells that are not covered by two tracks. Definition 9: The distance between two tracks T (i) and T (j) in the Hamming sense can be defined as follows: where |X| stands here for the cardinality of set X. Examples of comparisons with this metric are shown in Table 3.      Track 1  2  3  4  5  6  7  8  9  1  0  6  36  40  33  33  21  54  52  2  6  0  36  40  33  33  21  54  52  3  36  36  0  32  39  39  25  60  58  4  40  40  32  0  43  43  29  64  62  5  33  33  39  43  0  34  24  57  55  6  33  33  39  43  34  0  24  57  55  7  21  21  25  29  24  24  0  45  43  8  54  54  60  64  57  57  45  0  70  9  52  52  58  62  55  55  43 70 0

B. APPLICATION
As the results show, those metrics are focused on indicating whether the trajectories occupy similar positions in the field around the car. This can be applied to search for similar maneuvers on the road in different scenarios to create unique and diverse synthetic scenarios. Those and other trajectory metrics can also be used to describe the diversity of scenarios. Although similarity measures that take into account the shape of trajectories while ignoring their relative positions exists, we will focus on the diversity description of scenarios and datasets in terms of grid coverage and intensity of grid occupancy.

V. SCENARIO SIMILARITY DESCRIPTION
In this section, we propose a concept to characterize scenarios using object tracks projected on a grid space.

A. GRID COVERAGE APPROACH
A metric that can represent the degree how a set of the tracks T set = {T (j1) , T (j2) , . . . T (j N ) } belonging to N detected objects corresponds to the whole grid in the field of view can be defined as follows: The proposed measure is defined using a partition of the system state space. The partition forms a rectangular grid and, roughly speaking, metric (8) is defined by the number of grid boxes covered by the tracks. The boundedness the of the space X assures that the numerator and denominator in Formula (8) are finite numbers. A similar concept was used in [80] to define a test coverage measure for continuous-time software systems.
In the example presented in Fig. 6, one track was recorded. This track covers 27 cell grids out of 415 in the whole field of view, which gives C h (T set ) = 0.065 (6.5%). Full coverage of the sensor field of view by the recorded tracks is illustrated in Fig. 7. An acceptable level of coverage can be achieved by selecting tracks according to Algorithm 1. Fig. 8 illustrates   FIGURE 6. Example of the coverage of the sensor field of view by track the tracks of the radar system consisting of several radar sensors that deliver a 360-degree view around the vehicle. The tracks were recorded in a real experiment and then selected according to Algorithm 1. Out of 93 tracks recorded in that scenario Algorithm 1 selects 72 tracks that cover the equivalent area. Additionally, out of 4443 recorded tracks that create the occupancy heatmap in Fig. 4, Algorithm 1 selects only 2391 tracks that cover the equivalent area.
Find new track T (j k ) from the available tracks that cover at least one yet do not cover the cell grid from the set X h \V h (T set ), k := j k

B. SCENARIO COMPARISON EXPERIMENT
Based on the nuScenes database [81], we created occupancy distribution for a sample of 10 scenarios chosen from this database. Their occupancy heatmaps based on the vehicle trajectories are presented in Fig. 9. We calculated the Wasserstein distance between each pair from this set of scenes (Table  4) to compare the values of this metric with the desired sense of similarity.
The calculation of the Wasserstein metric in the case of two distributions of occupancy µ and ν can be considered the VOLUME 4, 2016 where d(k, j) is squared Euclidean distance between vectors η(k) and η(j).
Constrains for function f are as follows ∀k ∈ M : ∀j ∈ M : To solve this setup, we can apply the network simplex method described in [82] or [83]. Then, the Wasserstein distance value can be calculated from the following formula: The largest value, i.e., 28.7, is achieved for the comparison of scenarios 60 and 539. In scenario 60, most vehicles appeared in front of the host vehicle, while for scenario 539, most vehicles appear behind the host vehicle. In scenarios 87 and 202, occupancy is equally distributed around the host vehicle, and the value for this comparison is relatively low, i.e., 5.4. Scenarios 273 and 274 are based on data collected at intersections. The Wasserstein distance between them is equal to 4.9. Crossroad examples are almost twice as distant as those from the scenario that comes from a straight road (Scenario 305). This scenario is also significantly distanced from the data based on a road with perpendicular movement in relation to the host (Scenario 169) with a Wasserstein distance equal to 15.6. Scenarios with a form similar to that  Table 4.
achieved in examples 256 and 403 arise when the host makes major turns. The distance between the given examples is equal to 4.9. Based on this, we can state that this measure representatively describes the similarity between scenarios in terms of the intensity of occupancy and the wider context in which they are recorded.

VI. DATASET SIMILARITY DESCRIPTION
Datasets that are used in automotive systems are specific in terms of their size, properties, and diversity. The formal description should therefore be in the form of a set of parameters that describe these features. Moreover, the description should include both static and dynamic aspects that charac-  terize the purpose of the data. Table 5 [69] illustrates a typical form of specification for data collection that is currently used in automotive applications. Although the specification contains a set of quantified parameters, it is not clear how the recommended distribution of the values of these parameters can guarantee an unambiguous and comprehensive description of the dataset. Datasets that are used in automotive applications consist of recorded sequences of possible road scenarios. Thus, the problem is to find a minimum number of such scenarios that can represent reality accurately. Representativeness can be defined by a distribution on the histogram H : X h → N where for each cell grid, i ∈ X h is defined as the minimum number of tracks that should cross through the cell, that is, H(i) = h i1i2 ≥ 0. Algorithm 2 illustrates an approach for selecting representative test scenarios. The optical path for the automotive vision system consists of a full set of hardware starting from the windshield, through antireflective coatings, lenses, light detectors, and gain control to the serializer. RWUP data collected for a certain optical path are valid only for this specific setup. Eventual reuse of RWUP for different optical paths is possible for indication only. To reduce the cost of data collection and to reduce time-to-market, the test scenarios selected by the presented algorithms can be used as a seed of the corresponding virtual scenario database. An additional benefit of this approach is that the same seed is sufficient to generate variations in RWUP data for a given distribution, for example, the same set of scenes rendered for different countries, the same set of scenes rendered for different sun heights (dazzle simulation), and the same set of scenes rendered for different weather conditions and light illumination. With this approach, many of the SOTIF required scenarios may be virtually generated and virtually tested with relatively little human effort. Fig. 10 presents an example of the modeling of triggering events for the potentially hazardous behavior of the system. The events have the form of tracks of objects that have appeared in only case , k := ∅ 9: end while 10: return T suite one scan or objects moving in the same direction and speed as the ego vehicle. Such scenarios can be classified according to the SOTIF standard as hazardous events causing the system to behave in unintended ways due to performance limitations. In the previous section, we proposed an algorithm that makes it possible to select a set of trajectories for scenarios that represent similar levels of grid occupancy. Below, we propose a comparison methodology for two scenario datasets S and R in terms of the intensity of grid occupancy in the scenarios they contain. Each database is a set of scenarios {S 1 , ..., S n } and {R 1 , ..., R m }. Every scenario consists of at least one trajectory. Our goal is to determine whether database R is a significant extension of database S in terms of the intensity of occupancy of grid X h generated by every set of trajectories VOLUME 4, 2016 S 1 , ..., S n . The result should be an indication of the most influential scenarios in R and a formal description of the diversity introduced by base R in relation to base S.

A. CUMULATIVE HISTOGRAM COMPARISON
We can also calculate the Wasserstein distance between cumulative histograms of occupancy for sets of scenarios. To do so, we divided nuScenes into two databases. The first one includes the scenarios that were recorded in the USA (467), and the second is made of scenarios recorded in Singapore (383). Then , we created cumulative histograms for both databases separately for pedestrian and vehicle trajectories Fig. 11. The Wasserstein distance between the cumulative occupancy is equal to 3.0989 in the pedestrian case. For vehicle trajectory comparison, this metric takes a value of 4.0773.

B. SCENARIO CLUSTERING
Below we present another, more complex method of dataset comparison. This approach can provide a more detailed description of the relation between two large datasets and makes use of the Wasserstein distance. Thanks to this metric, we can cluster scenarios from databases with the use of an algorithm similar to the one described in [84]. The difference is that we do not cluster one-dimensional quality distributions but two-dimensional distributions of occupancy, and we use the Dunn Index [85] to the choose optimal number of groups. Let us assume that we cluster database S. As an output from the clustering algorithm, we obtain clusters of scenarios C 1 , C 2 , ..., C e and the corresponding centroids ξ 1 , ξ 2 , ..., ξ e . Each centroid ξ i is a Wasserstein barycenter of scenes from cluster C i . In other words centroid ξ i is a probability distribution on space X h that averages all occupancy distributions from cluster C i in a Wasserstein metric sense. The next step is clusterization of the sum of scenes from both databases S∪R. As a result, we obtain clusters of scenarios D 1 , D 2 , ..., D f and corresponding centroids (Wasserstein barycenters) of those clusters π 1 , π 2 , ..., π f . The difference between those two outputs from the clustering algorithm is the information that database R brings to database S. Now, we formally describe this difference; to do so, we have to define the assignment of clusters from the second output to the clusters from the first run of the clustering algorithm. Definition 10: Let I e = {1, ..., e} and I f = {1, ..., f }. We define function F : I e → I f such that: where J = {j ∈ I f : C i ∩ D j = ∅}.

C. DIVERSITY COEFFICIENTS
Based on our algorithm from the previous subsection, we define the diversity coefficients of database R in relation to database S. Definition 11: Enrichment factor of existing clusters Definition 12: Cluster collection enrichment factors Definition 13: Mean internal inertia change

D. DATASET COMPARISON EXPERIMENT
Finally , we would like to show an application of this methodology. Our goal is to describe in the terms presented before how a set of 50 scenes from Singapore extends the dataset created from 150 scenes from the USA (both samples are taken from the nuScenes dataset). The clustering results and diversity coefficients are stored in Table 6. Before adding the Singapore scenarios to our database, 150 USA scenarios are decomposed into 3 groups. The new system obtained on the basis of clustering consists of only one additional group but increases the inertia of this system by 34.2%. The mean shift of centroids (w 1 ) is equal to 4.7, which suggests that the change in existing centroids is similar to the difference between scenarios of the same type (Table 4) and can be interpreted as low. However, the fact that mean size of new groups (w 2 ) is larger than the sample of Singapore scenarios indicates that the new scenes expand the set of scenes that occurred in the US but were not sufficiently represented to create a separate cluster. The low internal inertia (w 3 ) with regard to the new final inertia suggests that the new cluster is still highly distorted and has the potential to further decay.

VII. CONCLUSIONS
In this paper, an efficient method to formally characterize automotive datasets used for perception-based system development and verification is presented. The method primarily focuses on datasets used in automotive applications, as these datasets are specific in terms of dynamic aspects that should not be omitted. However, the presented approach can be extended to other types of large datasets used in other robotic applications. By characterizing the datasets in terms of motion trajectories of detected objects appearing in the sensor field of view during vehicle movement, we can compare and evaluate different datasets and relate them to the performance of perception algorithm. Such analytical formalism helps in understanding the data and may also improve the algorithm development process and consequently reduce effort, cost, and time, as in the automotive industry, data collection, testing, and verification activities consume the majority of the project effort. Moreover, a dataset described using mathematical notation, can serve as a base to define a dataset that can accurately represent the vehicle's entire surroundings in reality with a defined confidence level. Such a representative dataset can be used to validate control and perception algorithms for autonomous vehicles with the support of a strong mathematical explanation. This problem remains an issue to date, and no good solution has been proposed thus far.