Virtual Testing of Automated Driving Systems. A Survey on Validation Methods

This paper surveys the state-of-the-art contributions supporting the validation of virtual testing toolchains for Automated Driving System (ADS) verification. The work builds upon the well-known limitations of physical testing while conceiving the virtual counterpart as a fundamental ingredient for the type-approval of high automation level ADS. The purpose of the research effort is to summarize computational tools, validation methodologies, and the corresponding fidelity levels delivered by state-of-the-art simulation toolchains. The ultimate goal is to establish how effectively simulation can play the role of a “virtual proving ground” for ADS certification independently from any specific ADS implementation/effectiveness. The contribution includes classic high-level validation approaches and modern specific computational tools that can be adopted depending on the type of data under analysis. Moreover, the investigation covers approaches embraced both within the scientific community and in technical regulations for the sake of completeness. Ultimately, we identified two high-level validation schema: integrated environment and subsystem-based solutions. In addition, we found that modeling and validating virtual sensors for ADS is the most lacking area from a subsystem-level approach. On the other side, the closed-loop interaction between the ADS and other virtual traffic participants makes it difficult to directly compare the experimental results with simulated generated evidence as the emergent behaviors of the ADS may amplify minor discrepancies between the environments.


I. INTRODUCTION
Virtual testing is gradually becoming part of the certification process of future Automated Driving System (ADS) [1], [2]. Virtual tests are starting to complement the traditionally performed proving ground and public road experiments after decades of simulation's utilization for new technology development purposes. Introducing a virtual component in the ADS certification pipeline brings all the well-known advantages of simulation with respect to physical testing. These include tests repeatability across different ADS/vehicle combinations, the possibility of scaling up the number of tests, a safer technology assessment, and a reduction in the costs associated with the certification process. Indeed, validating a high-automation level ADS by means of physical testing only would require traveling for decades, as already pointed out in several literature works [3], [4]. Simulation is thus the ideal The associate editor coordinating the review of this manuscript and approving it for publication was Tamas Tettamanti . testing environment to investigate how an ADS perform in high-mileage tests and in edge-case driving scenarios [5].
Naturally, the adoption of virtual testing as an ADS certification tool shall first investigate the appropriateness of the simulation environment for such a purpose. Despite the mentioned simulation's assets, suitable validation procedures have to be established before promoting a virtual testing environment as a certification environment. In particular, a validation activity shall ensure that the simulation-generated evidence is characterized by a fidelity level that serves the certification process's need. A virtual testing environment fulfilling such validation criteria could be regarded to as a ''virtual proving ground''. However, an acknowledged validation practice to accomplish the accreditation task is a research topic that is not yet regulated. Instead, a plethora of approaches exists which depend on the specific application.
Unlike widely acknowledged scientific contributions dealing with the implementation and validation of ADSs (see for instance [6]- [9]), our effort concentrates on the (simu- lated) environment used to support the ADS certification task. Indeed, the objective of this work is to survey the state-ofthe-art approaches adopted to validate both complete ADS virtual testing toolchains and simulation submodels involved in the virtual experiments. Such a selection of the topics sets our work apart from the research stream focusing on the ADS development. In fact, the validation approaches presented here can be adopted regardless of the specific ADS implementation and its safety effectiveness. Despite some of the state-of-the-art surveys (for instance [7], [9]) also list simulation softwares, the fidelity of the tool is typically not discussed nor computational methods for a quantitative assessment are presented. Our contribution aims to fill the mentioned research gap by presenting a review of the approaches and computational tools adopted in actual simulation validation practices together with the corresponding acceptance correlation thresholds whenever applicable to the end of establishing the fidelity level that a virtual testing toolchain for ADS simulation can deliver.

A. STRUCTURE OF THE WORK
The work opens by recalling the general principles of validation in Section II, focusing on the validation of virtual models. The associated computational tools are presented in Section III. Then the discussion shifts to the ADS-specific validation approaches and challenges Section IV. Open issues as reported in Section V whereas conclusions are eventually drawn in Section VI.

II. VALIDATION IN ENGINEERING
The validation process, in general, aims at determining the capability of a product to fulfill its purpose and expectations for the required application. In contrast to the verification task, the process of validation takes place upon the completion of the product to ensure that the ''correct'' product was built. The verification, instead, is mainly concerned with the correct implementation of the conceptual (or mathematical) model [10] into the actual product. The considerations are graphically emphasized in Fig. 1, where the typical V-Model product development workflow is displayed.
The general definition for validation had been given a bespoke formulation for the field of Modeling and Simulation (M&S) in several literature works [12], [13]. In particular, the validation of a virtual model (product) can be defined as a procedure aimed at establishing the model's accuracy (capability) in representing the real-world (purpose) from the perspective of the intended use (application).
Ensuring suitable accuracy for a virtual model is a crucial task whenever the model plays a critical role. That is becoming more relevant in recent years due to the widespread adoption of simulation at any design and decision-making level [14].
One of the first attempts to build up a validation framework for virtual models was carried out by Carson in [15]. According to the author, the validation should be made up of a three steps procedure: 1) face validity (i.e., answering the question: ''is the model returning reasonable results?''); 2) test the model over a range of input parameters (''stress test''); 3) comparison of the model's predictions with physical data whenever possible, in particular: a) collection of input data from real-world experiments; b) re-execution of the simulation model with the collected input; c) performance comparison with respect to the realworld; d) use statistical techniques to create confidence intervals in case multiple datasets are available. Carson's scientific contribution also mentions a list of possible modeling errors, which include: project management (e.g., missing key personnel or decision-makers at crucial meetings), data modeling (e.g., use of incorrect data collection procedures), logic model (e.g., modeling assumption not replicating physical behavior of the system), and experimentation errors (e.g., too few executions of the model).
In [16], a three-steps validation approach is proposed as graphically reported in Fig. 2. More in detail, the virtual model is first compared against the real one in terms of predicted output. Secondly, an interpolation/extrapolation analysis over the required domain is carried out. Thirdly, the prediction uncertainty is characterized.
A widely recognized practical approach for the validation of simulation models has been proposed in [14] and depicted in Fig. 3. The approach foresees the distinction between the ''conceptualization'', i.e., the mathematical/logical representation of the model and the ''computerization'', i.e., the realization of the conceptual model in terms of a programming language. Validation then takes place by determining the adoption of reasonable theories/assumptions (''conceptual validation''), the correctness of the computer implementation (''computerized model verification'', i.e., static and dynamic code testing), and the accuracy of the realized virtual model (''operational validation''). Eventually, a ''data validity'' procedure should assess that accurate data were available to forge  the conceptual model, allow its computerization, and conduct a quantitative assessment of the truthfulness.

A. VALIDATION LIMITATIONS
The cited literature works highlight, however, inherent validation caveats. The first (and major) point is related to the context-dependent nature of the validity analysis: absolute validity is generally not achievable. Thus, a simulation model will always be an approximation of the real-world phenomena and will only serve the need of the specific application it aims at replicating.
The second point is the lack of globally defined useracceptance criteria. Despite the outstanding cited contributions in terms of validation pipeline, in the real-world, the validation procedures are tailored to the specific application and Operative Design Domain (ODD). This limitation leads to a plethora of validation tools and corresponding compli- . Type-approval based on virtual testing. EU regulation schematic process view, adapted from [19].
ance thresholds, some of which are presented in the following sections whenever available. Nevertheless, to the best of the authors' knowledge, an ultimate ''validation criterion'' for a complex virtual testing toolchain is not available. Eventually, on a practical level, obtaining validation-grade data might be very challenging, extremely costly, or even impossible (e.g., when modeling a system not yet existing), thus limiting the applicability of the validation analysis.
In order to partially account for the limitations, a credibility analysis is typically carried out in parallel to the validation [17]. The credibility investigation aims at enhancing the confidence in the developed virtual toolchain [15]. A common framework for the credibility analysis of virtual models was developed at NASA and released publicly [18]. Similarly, a credibility framework is currently being discussed at the UN/ECE level 1 for ADS applications.

B. EXISTING REGULATIONS LEVERAGING ON VIRTUAL TESTING
Simulation models have recently started to be adopted in passenger vehicles regulations for type-approval purposes. For instance, the EU 2018/858 [19] regulating the type-approval of motor vehicle classes M {1,2,3} , N {1,2,3} , and O {1,2,3,4} according to the UNECE classification scheme, 2 allows the use of virtual testing for a list of regulatory acts which include superstructure's strength and interior visibility. Generic guidelines on how the model's validation should be carried out and its relationship to the approval process are also given and reported in the flowchart Fig. 4.
For what concerns ADAS/ADS, the UN/ECE R140 [20] which regulates the Electronic Stability Control (ESC) system, allows the ''sine with dwell'' maneuver to be carried out in a simulation set-up provided that the corresponding virtual model accounts for a list of parameters and complies with a selection of Key Performance Indicators (KPIs). Similarly, the recently released Automated Lane Keeping System (ALKS) regulation [2] permits the usage of simulation to validate the system for a lane-keeping maneuver. However, neither of the documents gives any quantitative specification on how the validation should be carried out. Instead, only a qualitative ''comparability'' of the simulated and physical test results is demanded. Moreover, no requirements are given for the used simulation environment.

III. VALIDATION METHODOLOGIES
From Section II, a validation effort is typically a multi-step approach embracing several techniques. A common starting point for the validation pipeline is the conceptual validation activity. The next tier in the validation process requires assessing the simulation model's degree of discrepancy with respect to the real-world system. Eventually, the simulation model shall be investigated from an uncertainty analysis perspective. The following paragraphs provide extensive details on how the three steps can be fulfilled. Whenever possible, the literature examples provided are concerned with vehicle dynamics or ADS virtual testing to promote methods that have already found their way into the case of study.

A. CONCEPTUAL VALIDATION OF THE VIRTUAL REALIZATION
Before any quantitative/qualitative assessment of the simulation results is undertaken, the validation task has to prove that the correct system theories were implemented while virtualizing the real process and that the simulation model is capable of representing the target device's structure, input/output relationships, and functioning logic [14], [15], [21], [22]. The conceptual model validation is thus mainly concerned with checking the consistency of the modeling hypotheses and level of detail with the target objective the model aims to accomplish [17].
Additionally, the conceptual validation may involve studying the validity of the data used to develop the model as in Fig. 3. At this stage, an assessor might question whether the dataset used to build the simulation model was sufficiently informative to capture all the nuances of the real-world relevant for the target application.
Concerning vehicle dynamics, an early attempt to provide a conceptual validation framework was proposed in both [23] and [24]. The works delineated the appropriateness of the modeling approach depending on the target maneuver the virtual realization is requested to replicate. The effort of establishing suitability is, however, mostly demanded to experienced engineers, thus making the overall assessment largely subjective.
Since then, the conceptual validation gradually started to recede from the scientific discussions as the increased computational capability allowed more extensive virtual experi-mentations and validation over a larger domain using techniques described in Section III-B.
Nonetheless, the concept was brought to new attention with the advent of complex cyber-physical systems [25], [26] and agent-based simulations [27], such as the case of study. In an ADAS/ADS application, the overall system to virtualize can be functionally decomposed in layers (vertical decomposition) or submodels (horizontal decomposition). A typical functional organization of a virtual testing toolchain for ADS is shown in Fig. 10. A similar modeling abstraction framework was also presented in [28,Fig. 5] and [4, Fig. 3]. The conceptual validation of a submodels-based pipeline shall assess that each component is a reasonable virtualization of the corresponding physical counterpart for the given purpose. For example, the level of detail needed to model an automotive powertrain for a fuel-consumption application study is substantially finer than what is demanded by microscopic traffic simulation frameworks. On the other side, the modeling philosophy of the virtual sensors plays a critical role for ADS. Thus, they require suitable conceptualization methods to ensure the required fidelity level as summarized in Section IV-C1.

B. MODEL VALIDATION VIA RESPONSE ANALYSIS
After assessing the theories employed to develop the model, the validation shall determine that the degree of discrepancy between the simulated model and the physical realization is contained below a prescribed threshold level. Such a phase is typically accomplished via defining a selection of KPIs and the appropriate computational method to compare the recorded signals. Selecting the suitable list of signals is not a trivial task, nor is it supported by established literature for the specific case of ADS. In general, the list of variables to investigate should be large enough to cover the phenomena to model attempt to replicate with sufficient confidence.
Once the eligible list is determined, one has to apply a computational tool to effectively contrast the model's output to the physical system's response. Several options are offered to the designers, which are described in the continuation of the section.

1) GRAPHICAL COMPARISONS
Graphical comparisons provide an intuitive and straightforward way to compare the results. Typical ways of visually representing data include 2/3D plots, histograms, and scatter charts. They can alternatively be supported by animations [14] to graphically represent of evolution in time of the virtual system.
In addition, the visual inspection of the model's output is a first step to accomplish the face validity task pointed out by Carson. That is particularly effective when exploring the response of the model to a combination of inputs that are not easily reproducible with the real system [14].
The drawback with graphical comparisons is the lack of objective criteria to validate the modeling assumptions despite being a quite widespread method in the actual 24352 VOLUME 10, 2022 engineering practice [29]. The corroboration can nevertheless be carried out adopting Turing tests-like approaches [30], or by relying on Subject-Matter Experts (SMEs) [22], individuals who have a long experience in the discipline and are able to tell from the charts is the model shall be deemed valid.
An example of a virtual model validation using the graphical comparison of the recorded trajectories for an experimental automated vehicle is given in Fig. 5.

2) SCALAR QUANTITIES
Scalar quantities analysis affords the possibility of introducing a quantitative assessment despite being limited in terms of the provided information with respect to other methods described later.
Scalar quantities can derive from extracting a reference value from a time-history, for instance, by applying min or max operators. A widely established validation framework exploiting scalar quantities is the ISO standard 19364 [32], where the relative difference between the yaw-rate peaks (simulation sim vs. real-world rw experiment) is computed for a reference lane change maneuver and then compared to a threshold error. Eq. (1) is also known as the ''Relative Error Criterion'' (REC) and is adopted in other technical standards and fields, such as the EASA certification memorandum on structural mechanics [33] for a different set of KPI. When studying higher-level automated driving functionalities, several scalar KPIs can be computed and clustered in a table such as in the summary of results of the European Project Enable-S3 [34], here adapted in Fig. 6. Additionally, Fig. 6 demonstrates how scalar metrics can account for highlevel information, such as the number of times the vehicle stopped during the experiment and low-level information about the driving policy such as the maximum longitudinal jerk.  Table 3].

3) TIME-HISTORIES
Time-histories relationships give extensive information about the virtual model's fidelity but require extra care when carrying out the assessment with respect to the scalar quantities. In fact, the comparison of the M&S output with respect to the corresponding real-world experiment might need data conditioning procedures such as time synchronization and resampling.
A popular method for synchronization is the Time of Arrival (ToA) [35], [36] criterion. ToA is based on time-shifting the signals until the time-histories reach, for the first time, a reference amplitude. The time-occurrence when the reference amplitude is reached is compared for each signal making up the dataset and the computed time difference used to shift the signals in the time-domain.
Once the data are synchronized, a common method for validation is introducing an absolute or relative tolerance interval around the experimental reference data [37]. Tolerance intervals are representative of the measurement and model's uncertainty. For example, a validation assessment procedure can require the virtual model's output to stay within the ±5% amplitude interval of the corresponding realworld realization. A similar approach is adopted in the technical ISO standards 19364, 19365 [32], [38]. Alternatively, confidence levels can be used to define the validation intervals as displayed in Fig. 7 where the acceleration recorded on the proving ground (PG) is compared against multiple repetitions in a vehicle-hardware-in-the-loop (VeHIL) setup for a car-following study.
Time-histories can also be characterized in terms of their amplitude distance [36], [40], [41] via studying the residuals. Several options are available when assessing the magnitude discrepancy: VOLUME 10, 2022 where N is total number of samples, x sim,i is the generic KPI's time-sample deriving from the simulation environment, x rw,i the corresponding real-world evidence, and p the order on norm. Typical solutions include the • normalized vector norms according to N , such as the Mean Absolute Error (MAE) following the normalization of L 1 -norm. The normalization of the L 2 -norm yields instead the Root Means Square Error (RMSE) • normalized distances, such as the Theil's inequality which lies between 0 (perfect agreement) and 1 (total incompatibility) and enables setting dimensionless validation thresholds. The distance analysis can be further informed by exploiting more advanced tools capable of decoupling the mismatch in magnitude and phase (hence, the shape of the signal). Such techniques are known as Magnitude Phase Composite (MPC) [35]- [37]. Among the MPC methods, a widely acknowledged solution is represented by the Sprague and Geers (S&G) criterion, which combines the phase discrepancy information: and (integral) magnitude error into a ''comprehensive factor'' A complementary solution to the S&G metric is Dynamic Time Warping (DTW) [43]. DTW provides a method to assess independently errors in phase, magnitude, and topology by aligning patterns (picks and valleys as in Fig. 8) in data through time axis scaling.
DTW-based validation metrics have compared favorably against SMEs' judgment in the field of simulation models for predicting the acceleration the human body is subjected to during a car accident in [40]. The proposed method, labeled Error Assessment of Response Time Histories (EARTH), exploits DTW to assess the magnitude, phase, and topology discrepancies independently. An enhanced version of EARTH (EEARTH) was recently developed, which uses linear regression to combine the three separate discrepancy components into a unique global error [44].

4) CORRELATION
Time-histories distance analysis can also be supported by the computation of the Pearson correlation coefficient to detect the extent to which the model stays in a linear relationship with the real data. The squared value of r originates the coefficient of determination R 2 , which is representative of the amount of variance predicted by the model. Both r and R 2 are widely adopted tools to assess the validity of the model as of several references [31], [39], [45].

5) FREQUENCY DOMAIN
Frequency domain approaches do not require time synchronization, and they are particularly suitable for the validation of physics-based inspired models such as vehicle dynamics [46], [47]. Frequency analysis can help to disentangle the noise contribution to the actual true signal in evaluating the discrepancy between a real and a simulated dataset since the noise will mostly appear in the high-frequency region, whereas the true signal will be concentrated in the lower frequency area. Using frequency domain techniques, the distance analysis could thus account for the discrepancy affecting only a share of interest of the frequency spectrum. Fig. 13 depicts, for instance, the frequency responses of two virtual models for lateral vehicle dynamics with respect to the experimental transfer function. The power spectrum of the residual is also particularly important to gain insights into the robustness of the calibrated models. Residuals mostly concentrated in the high-frequency domain are indicative of a robust calibration procedure which yielded a model grasping the real dynamics of the system without overfitting the noise (high-frequency) components [48].

6) STATISTICAL TESTING
Ultimately, the validation procedure shall enable an assessor to answer the question: ''is the simulation model an accurate representation of the physical phenomenon for the intended purpose?''. That is exactly the formulation of the null hypothesis (H 0 ) of a hypothesis testing [49], [50] problem. Conversely, the alternative hypothesis H a reads instead as ''the simulation model is not an accurate representation of the physical phenomenon for the intended purpose''. The goal of validation can thus be seen as avoiding both Type I errors, i.e., rejecting valid virtual models (model's builder risk), and Type II error, i.e., accepting an invalid simulation model (model's user risk) [21]. The probability of Type I error equals the significance level α assumed in the hypothesis testing, where α = 0.5 is a common assumption.
Statistical methods might also be particularly suitable to handle data generated by non-deterministic or hybrid (more on this in Section IV-A) virtual environments. Taking into account distribution functions allows, for instance, considering simulation stochasticity with no need of averaging over the repetitions, hence preserving a higher amount of information. Such an approach was exploited in [51] to derive validation metrics for a lane-keeping application.
Several works investigated which statistical tool to exploit for the sake of model validation, among them [13], [49], [52]- [56]. The cited contributions suggest the following methods as viable candidate tools to assess the validity of a model: T-test: consists of checking whether two distributions have the same means (two-sample testing) or whether a population mean differs significantly from a sample; F-test: similarly to the T-test, the F-test examines the consistency of the variances; Kolmogorov-Smirnov test: measures the vertical distance between two Cumulative Distribution Functions (CDFs); Anderson-Darling test: evaluates if a sample comes from a population characterized by a specific distribution [57]. A downside of statistical model validation is that statistical testing is mainly concerned with stationary distributions [58], which limits its effectiveness in investigating the transients behavior of the system. For example, one can straightforwardly utilize the T-test to validate a finite element method (FEM) model for a static structural analysis against experimental evidence. The same is not true in the case of the velocity profile of a vehicle dynamics simulation model, which typically exhibits time-varying components. Furthermore, handling transients requires the adoption of aggregation operators, which will inevitably drop relevant information about the signals [39]. Devising aggregation operators capable of retaining the amount of signal information necessary for the sake of validation is still, to the best of the authors' knowledge, an open point in the scientific literature that shall be addressed in the upcoming years.

7) STATISTICAL DISCREPANCY
Whenever multiple repetitions of the same driving scenario subjected to stochasticity are available (for either the simulated models and/or for the real-world), statistical tools can be exploited to appraise the distance between the generated distributions.
The statistical discrepancy can be computed exploiting one of the following tools: z-score and its corresponding multidimensional generalization Mahalanobis distance [59] (x rw/sim − µ sim/rw ) T −1 (x rw/sim − µ sim/rw ), (10) which assess how many standard deviations a point stands apart from a given distribution given the covariance matrix; Kullback-Leibler divergence (D KL ) measures the distance between two distributions [60], a method that can be helpful to analyze stochastic models such as probabilistic models for traffic behaviors [56].

C. SENSITIVITY ANALYSIS & UNCERTAINTY QUANTIFICATION
Beyond the model accuracy discussed in Section III-B, the validation should quantify the expected uncertainty from the model as resulting from input data error and modeling approximations. The combination of techniques capable of explicitly addressing the uncertainty content of a simulation model on top of the mentioned methods for validation in Section III-B, yields the so-called Verification, Validation and Uncertainty Quantification (VV&UQ) methodologies [61]. VV&UQ methods aim at improving conventional validation activities by specifically addressing some of the critical aspects associated with validation practices based on correlation thresholds only. In particular, VV&UQ allows: 1) accounting for the uncertainty in the calibration/ validation dataset, which is especially critical for applications where direct measurements are difficult or expensive to obtain; 2) surmounting the binary validation outcome, i.e., the model can be either valid or invalid with an estimation of the degree of validity; 3) studying the extrapolation capabilities in the proximity of the validation domain.

1) SENSITIVITY ANALYSIS
Sensitivity analysis examines how small perturbations of the simulation model's input quantities, both in terms of the time-invariant model's parameters and time-varying model's inputs, propagate to the output. Such an analysis is particularly beneficial to support the validation activity. In fact, the sensitivity study affords the determination of the extent to which the model satisfies the validation thresholds when it is subjected to limited variations of the parameters. Thus, the robustness and the generalization capabilities of the simulation model can be established. As an additional outcome, the model's parameters which mostly contribute to the end result of the simulation can be identified. Potentially, the sensitivity analysis investigation of the virtual model output can be one of the few tools available to validate a model in the absence of real-world data together with the SMEs' judgment. An overview of the sampling-based sensitivity analysis techniques for model validation can be found in [62]. The survey work identifies two main approaches: ''traditional'' techniques, which are best suited for a small number of parameters and operating points (also known as one-factorat-a-time methods or first-order analysis), and ''modern'' techniques capable of supporting hundreds of parameters and their interaction through efficient sampling operations. An example of a traditional technique within the field of vehicle dynamics is proposed in [63] where a 20 parameters tire wear model is validated against real-world experiments and the individual individually parameters perturbed (±25%) to study the sensitivity. Overall, 60 model executions, which do not account for parameters' interdependence, are required using the traditional method for a relatively simple application compared to an ADS toolchain. Indeed, such an approach may turn out not to be a feasible option to globally (i.e., firstorder analysis plus interactions) assess complex ADS testing toolchain which might have thousands of parameters and efficient techniques shall be enforced such as Monte-Carlo or Latin Hypercube Sampling [64].
An alternative approach to sampling is to characterize a model based on differential equations using automatic differentiation [65] and first-order analysis given the higher computational efficiency of the method. An application of automatic differentiation is proposed in [66] for the sensitivity analysis of a complex multibody (18 degrees of freedom, DOFs) vehicle dynamics model.
The interdependency of parameters' variation was studied in [67] for a 9 DOFs, 12 parameters vehicle model using a two-steps approach. Firstly, the elementary effect of each parameter variation was investigated to reduce the model complexity. Secondly, a global sensitivity analysis (GSA) index was calculated using Sobol's method [68] based on the ANOVA theory. The GSA was found to be in agreement with the elementary analysis for most of the effects considered with expect of the vertical acceleration, which is underestimated for some elementary parameters' variations given the non-linear dynamics of the model. For the considered KPIs, the sprung mass and suspension damping coefficient turned out to be the most sensitive parameters. Moreover, the robustness of the study is further enhanced, given that two model parametrizations are analyzed with similar findings.

2) UNCERTAINTY ANALYSIS
Partially related to the sensitivity study is the assessment of the uncertainty. While the sensitivity analysis is mainly concerned with the establishing properties of the model in a local portion of the input space, the uncertainty examination explores the set of outcomes' distributions. Two sources of uncertainty are typically assumed for the simulation models: aleatory (random) components, epistemic (lack-of-knowledge) factors. Ultimately, examining how the model extrapolates for input perturbations provides sounder credibility to the virtual realization.
The uncertainty of simulation models can be established by specifying ranges for the model parameters as resulting from robust calibration procedures such as bootstrapping [69] or known aleatory properties of the modeled elements [70]. The virtual realization uncertainty can then be estimated by propagating through the model samples from the parameters' distributions. This can be done by exploiting Monte-Carlo techniques [71]. The Monte-Carlo method is based on executing deterministic simulations where, at every new simulation instantiation, a random set of parameters are drawn from the parameters' distributions. Such a methodology is graphically shown in Fig. 9. The uncertainty analysis's effectiveness depends on the total amount of virtual tests carried out: the higher the number of simulations performed, the more accurate the estimation of the output distributions. Convergence criteria based on the output variance have been proposed in [72].
From an industrial perspective, an exact quantification of the uncertainty is typically not pursued due to the high computational demand in executing a large number of simulations and collecting all the simulation model inputs' variance. Conversely, safety factors are adopted to investigate worst-case scenario parametrizations [33], [61]. This is also the case for the field of vehicle dynamics, where models are typically validated without resorting to an explicit formulation of the uncertainty [46]. Instead, a tolerance envelope is defined around the experimental data: only if the output is within the interval, the model is accepted.
Nonetheless, some recent scientific contributions dealing with vehicle models have started introducing uncertainty assessment [73]. For instance, in [74], the uncertainty of a simulation model for fuel consumption prediction is assessed via the propagation of input and parameters uncertainties following the approach visually represented in Fig. 9.

IV. VIRTUAL VALIDATION FOR ADS
ADS virtual testing can take advantage of a huge number of simulation tools that help the design and development of autonomous driving functionalities. However, the validation of the ADS technology is still largely based on physical testing, using either scenario-based approaches [75] or the number of miles before disengagements [76]. Given the discussed benefits of simulation in the introductory Section and the limitation of real-world testing alone in terms of scenario coverage and the unmanageable number of miles to be driven, it is in the authors' view that virtual testing will be a crucial pillar of future ADS certification.
This Section, thus, building up on the general concepts presented in Section II and on the specific methods presented in Section III, surveys the procedures empowered to support the virtual validation of ADS functionalities. The considerations here presented start from the underlying assumption that the validation of a virtual testing toolchain is pursued regardless of the specific ADS implementation. A virtual testing environment is here considered as the overall simulation framework that allows running a simulation once an ADS is plugged in. The validation of an ADS via virtual testing has first to demonstrate the validity of the toolchain involved, which is somehow implicitly done for the physical tests only.
From the literature analysis we carried out, we can split the validation of a virtual testing environment into two methodologies: ''integrated system'', where the overall simulation toolchain is tuned to replicate a distinct maneuver Section IV-B, and ''submodels-based'' Section IV-C, where each ingredient of the simulation pipeline is individually validated with respect to its physical counterpart. Hybrid solutions combining both integrated system and submodel-based validation in a cascade fashion are also foreseen as explained in [28]. However, no practical application of such an approach is available in the literature yet.

A. VIRTUAL ENVIRONMENTS FOR ADS SIMULATION
Before presenting the validation methodologies for ADS virtual testing, a brief explanation of the simulation environments' setups provides valuable insights for the later discussion. The flexibility of modern days simulation software enables the combination of hybrid simulation/realhardware testing configurations known as 'X-in-the-Loop'' (XiL) approaches. A common classification in the scientific literature and industrial practice is to discriminate among: Further details about testing using XiL-based approaches are outside the scope of the present paper and the reader may refer to the widely acknowledged survey works such as [77]- [79].

B. INTEGRATED SYSTEM VALIDATION
This set of validation methodologies are concerned with the definition of a reference maneuver and the tuning of a simulation environment to reproduce the driving task virtually. According to this methodology, the validation is carried out for a list of KPIs which are representative of the full (closedloop) simulation environment and not of the models making up the toolchain, such as virtual sensors and virtual vehicle models. Additionally, these techniques are typically framed in a scenario-based approach [75].
From a legislation perspective, an approach following this philosophy is currently under discussion for the AEBS virtual test [80] based on an ad-hoc devised computational method. A similar method is already implemented in the UN/ECE R140 [20] for the Electronic Stability Control (ESC) typeapproval, where generic guidelines are given on the simulation model's structure and which relevant KPIs to use for validation albeit the exact correlation threshold is not provided.
Several proposals are also found within the scientific literature, in particular in [45], [51], and [39]. In [45], the authors study the fidelity level that a Vehicle-Hardwarein-the-Loop (VeHIL) setup can deliver with respect to a MiL environment by comparing the real-world experiments against the evidence generated by simulation environments. Given the limitation of the VeHIL in not the allowing steering action, the chosen KPIs are representative of the longitudinal dynamics only. In particular, the relative vehicles spacing, ego vehicle velocity, and longitudinal acceleration time-histories are compared. The correlation is carried out by analyzing the mean values, standard deviations, and Pearson correlations. Overall, the VeHIL returned significantly better fidelity than the MiL. In [51], a method is suggested which firstly checks validation scenario coverage and then computes the correlation between the simulated and logged lateral acceleration for a lane-keeping application by also considering stochastic effects of the particular VeHIL environment. Similarly to [45], in [39] a VeHIL setup is investigated in terms of the achievable fidelity level for increasing complexity driving scenarios using camera stimulation rather than signal injection. The paper is complemented with statistical validation methods in addition to traditional discrepancy analysis techniques. The KPIs selection includes the ego vehicle velocity and longitudinal acceleration for which 95% confidence intervals are provided as resulting from the multiple repetitions carried out in the VeHIL environment.
On the positive side, this methodology does not foresee the validation of each simulation submodel making up the virtual testing environment, thus reducing the validation effort. The main focus is instead on providing high correlation with real-world data on a global level for the specific KPIs considered. On the downside, this philosophy provides little information on how the toolchain is going to extrapolate outside the validation domain, which affects the credibility of the developed solution [28].

C. VALIDATION OF VIRTUAL SUBMODELS
The validation of the testing environment can alternatively be carried out by validating each individual simulation model making up the toolchain. The methodology is based on the functional decomposition of the virtual testing toolchain into submodels that replicate the physical counterpart. In general, a widespread approach is to adopt the decomposition shown in Fig. 10, a common solution also depicted in [28, Fig. 5], [4, Fig. 3], and in the recently published ISO/TR document [81].
In Fig. 10, the toolchain is functionally divided into 4 main blocks: the sensors models, the vehicle model, the virtual world model and the ADS implementation which are discussed in the remaining of the Section.

1) SENSOR MODELS
A preliminary investigation of virtual vs. real sensors was qualitatively carried out in [82]. The work focused on present-ing the simulation environment ''Virtual Test Drive'' (VTD) but also provided nuances on how to generate synthetic data and reusable (parametric) models for sensors. In the end, several charts are reported which compare the recorded position of a static traffic obstacle with respect to the same setup replicated by the virtual toolchain. The work does not provide acceptance thresholds nor validation methodologies that could be straightforwardly generalized to other case studies. However, it highlights the strong coupling that exists between models' fidelity level and the quality of the virtual environment-generated synthetic data.
In [83], the authors examine the modeling approaches for sensors from white-box modeling i.e., based on the replication of the physical phenomena happening in the actual sensor device, to the black-box methods, which are concerned only in replicating the observed statistical I/O relationship. In addition, the authors propose a third-way grey-box modeling philosophy as a trade-off solution to combine the pros of white-and black-boxes approaches. Such a modeling approach defines a functional decomposition of the sensors' components rather than a physical reconstruction to allow a versatile and re-usable model parametrization.
The authors of [84] focus on the virtual validation for sensors and propose a two-steps approach where the virtual model is initially assessed in terms of a direct comparison with the real sensor's output. Secondly, the next tier in the ADS chain is fed the synthetic data, and its output is compared against the outcome originating from the real input. The authors' claim is that the direct comparison of the sensors' models is necessary but not sufficient as sensors are highly coupled with the consecutive layers in the ADS pipeline.
More recently, in [85], the lack of experience and clearly defined requirements with respect to other modeling fields (as vehicle dynamics for instance) is highlighted in addition to the difficulty in replicating trajectories in the simulation environment with a precision that matches the fidelity required by the sensor validation. The scientific effort is completed via repeating the metrics calculation in [86] for a different scenario and obtaining similar scores.
Overall, designers have several opportunities to model sensors despite the overall classification is still under discussion [83], [85]- [87]: low-fidelity: also known as ''black-box'' models or ''object list''. Low-fidelity models retrieve the traffic objects' list and status directly from the simulation environment kernel. This modeling paradigm does not afford statistical aspects related to the perception, such as false positives/negatives rate. However, low fidelity models might account for basic sensor effects such as the Field of View (FoV) and occlusions to filter the whole object list; medium-fidelity: also known as ''grey-box'' [83] or ''phenomenological/data-driven'' according to the classification presented in [85]. They share the basic working principle of low-fidelity models. Nonetheless, they introduce the possibility of modeling false positives/false negatives rates, the effect of traffic objects' shape and texture on the detection, and environmental effects such as atmospheric degradation; high-fidelity: also known as ''white-box'' or ''physicsbased''. High-fidelity models try to replicate the physical phenomena regulating the interaction between the sensor and the external environment in simulation. Typically, advanced computer graphics techniques are adopted, such as ray-tracing and rasterization to render the 3D simulation environment. Strongly linked to the modeling philosophy is the information level at which carrying out the calculation of the KPIs. The viable opportunities depend on modeling abstraction and can be summarized as: objects detection level: the highest level information provided by the sensors' models, e.g. class and size of the object. This option is the only available when adopting the lowest fidelity virtual sensor models which cannot provide pixel-level detailed information; occupancy grid: (OG) intermediate level sensor information which refers to the probability of a pixel to be occupied by an obstacle; raw data: lowest level sensor information extracted before any tracking/classification algorithm is employed. It needs static obstacles (e.g. buildings, fences, . . . ) to be digitized and included in the simulation environment for the pixel-level comparability of the results. Considering the object detection level, popular metrics to compare the generated bounding boxes (BB) are: • the Intersection over Union (IoU), also knonw as the Jaccard distance [88]: • the multiple object tracking accuracy (MOTA) [89]: where FN i are the false negatives at time-step i, FP the false positive, MM the misdetections and n obj the total number of objects. Based on the OG, some of the computational tools exploited in the literature to quantitatively evaluate the fidelity level are: • OG Pearson correlation: where N c is the total number of cells; • the occupied cells ratio (OCR): Assuming the availability of the point cloud (PC) raw data, the following assessment criteria can be adopted: • normalized minimum Euclidean distance between point clouds [90]: where N is the total number of rays per scan; • PC Pearson correlation [84]: • PC RMSE [90]: where only the sensors' output is obtained via the re-simulation of a previously recorded driving scenario and the validation is carried out after clustering and tracking (IF3 interface in Fig. 11); closed-loop simulations (CL): where the actual virtual sensors generated information is fed to the ADS and the ego-vehicle motion is part of the validation process similarly to the process described in Section IV-B Ultimately, the validation of the sensor models is an open topic in the scientific literature and industrial practice. Currently, no realistic correlation thresholds have been established to accept the models, and, secondly, no consolidated modeling framework has been adopted. a: RADAR RADAR sensors are particularly demanding when it comes to accurately replicate in simulation their working principle [85]. There are, in fact, several non-trivial factors that affect the performance of RADARs which include: multi-path reflections, interference, ambiguities, clutter, ghost objects, and attenuation [91]. On one side, the replication of such phenomena in the virtual world is better accomplished using white-box modeling approaches. On the other side, the solution of the governing equations of a RADAR in real-time is an extremely challenging demand [92]. Indeed, the solution of Maxwell's equations for ≈77 GHz electromagnetic radiations (automotive RADAR operative frequency) using the Finite Difference Finite Time (FDFT) method [93] is not a viable option for most commercial automotive simulation environments and modeling assumptions have to be introduced. The most widespread solution to partially replicate the physics behind the RADAR is to adopt the ray tracing framework. Ray tracing is still a computational intensive alternative and requires a detailed 3D representation of the obstacles' geometry. Nonetheless, ray tracing can capture reflection, diffraction, and ghosts objects [94].
Regardless of the modeling framework adopted for the RADARs, there exists no generally accepted evaluation criteria to validate a virtual sensor model, nor unified testing procedure [95]. Additionally, from a modeling perspective, there is little difference in the virtual models for RADARs with respect to LiDARs, given the similar working principles as they both rely on similar modeling techniques.
In [96], a sensitivity analysis is performed to determine the most critical aspects that contribute to realistic RADAR simulation. The analysis is carried out by comparing real sensor data vs. synthetic ones generated using the open-source CARLA simulator [97]. Among the studied factors, the RADAR Cross Section (RCS) model and long-distance obstacles were found to be the main responsible for the dissimilarity with respect to the experimental data. A similar work is proposed in [95] which highlights the multipath propagation, separability, and sensitivity to the RCS as the major modeling challenges.
In [98], a neural network is trained to predict whether a point cloud is real or simulated. In addition, the classifier's confidence is used as a metric to determine the fidelity degree with respect to state-of-the-art methods. In a followup work, [90], multiple RADAR sensor models are compared by means of quantitative metrics in an explicit open-loop and implicit open-loop manner. The correlation level computed for the highest fidelity model is reported in Table 1.

b: LiDAR
LiDAR sensors are based on a working principle comparable to RADARs. However, their modeling effectiveness using ray tracing is considerably higher with respect to RADARs thanks to the higher frequency range where they operate (infrared spectrum region), which reduces interference [83]. Ray tracing can thus be safely exploited for white-box modeling approaches.
In [84], a ray tracing-based LiDAR model is studied and validated both explicitly and implicitly using Pearson correlation. Additionally, the authors investigated the robustness of the validation routine by manipulating the simulation scenario through the removal of a traffic vehicle present in the real-world and verifying the worsening of the validation KPIs.
In [99], the authors investigate LiDARs' sensitivity characteristics to determine which effects are prominent for the sake of defining requirements for LiDARs' virtual replication using high-fidelity models. The most critical aspects to reproduce were identified as the temporal scan order, noise figures, and the received signal's intensity. A first contribution comparing different LiDARs' models parametrization was published in 2019 in [86]. The work also proposes a functional decomposition that helps to standardize interfaces for metrics assessment as shown in Fig. 11.
The same work also proposes a validation method based on the occupied cells ratio (OCR) and point cloud distance criteria over 250 scans repetition. In [86], four high-fidelity sensor models are analyzed for the same set of inputs. OCR is shown to be dependent on the distance of the object, and in general poor performances were reported for the models. However, the contribution constitutes one of the first attempts to thoroughly compare different LiDAR modeling options given the same input and assessment criteria. The correlation levels obtained are reported in Table 2. From Table 2, a large variation of the fidelity is denoted where the highest correlation level corresponds to the portion of the validation scenario where the target obstacle is closer to the ego-vehicle.
A follow-up work, [100], a high-fidelity model is presented which is compatible with popular the popular Open Simulation Interface (OSI) [101] and the Functional Mock-up Interface (FMI). The model is then implicitly validated against both a real LiDAR's output and the ground-truth information derived from Real Time Kinematic (RTK) device by comparing the RMSE of the target vehicle trajectory.

c: CAMERA
Differently from RADARs and LiDARs, cameras are passive sensing devices. Therefore ad-hoc modeling techniques are required to craft high-fidelity models.
A first study involving the characterization of the fidelity level attainable with state-of-the-art camera models was proposed in [102]. The cited contribution introduces the modeling framework in Fig. 12, where the camera system is made  up of an optical module and a CCD sensor creating an RGB representation of the simulation engine rendered frame. The work details the implementation of the virtual optics and the virtual sensor and provides a preliminary validation activity using static reference scenarios.
Similarly to [102], in [103], the authors described the implementation of the main factors affecting cameras behavior before presenting an explicit validation of their realized camera model against three real camera systems. However, in [103], the focus is shifted for the first time towards ADAS/ADS applications.
In [104] a phenomenological camera model is presented and implicitly validated against real-world data and an ideal sensor model using a frequency-based approach. In [89] two camera models implemented in the simulation environments Vires VTD and IPG CarMaker are compared under different lighting conditions against both proving ground tests and the corresponding ground-truth. The authors of [89] defined several implicit validation metrics to ultimately compute an overarching index representative of the fidelity level obtained, the ''Simulation-to-Reality Gap''. In [89], weak points of the simulation environment under test are reported, such as the difficulties of replicating the real-world noise levels and the limited capabilities of reproducing the weather conditions, which affect how the Deep Neural Network (DNN)-based object tracking modules perform.

2) VEHICLE MODELS
Modeling vehicle dynamics is an established activity supported by technical regulations and widely acknowledged scientific literature, including excellent textbooks.
Technical regulations such as [32], [38] provide specifications on how to virtualize and validate vehicle models for Electronic Stability Control (ESC) applications. Fidelity levels associated with the modeling approaches are also pointed out in the recent regulation [105].
Within the scientific literature, a comprehensive review of modeling and validation methods for vehicle dynamics models is given in [46]. The work effectively summarizes the state-of-the-art contributions from the dawn of vehicle modeling up to contemporary scientific efforts. Two main applications for vehicle dynamics modeling are delineated: the crafting of artifacts for driving simulator platforms and the design of simulation models to support the development of vehicular technology. Concerning the first case-of-study, the topic has been widely discussed in [106] and goes beyond the scope of the present work given the importance that human feedback plays in validating the model for driving simulators, an aspect which is not relevant for an ADS application. Within the second class, models typically target a specific use case such as handling or riding studies, and each application is commonly backed by an ad-hoc devised validation procedure. Most of the works do not specify nor suggest correlation thresholds for validation. Instead, mainly subjective criteria based on a qualitative evaluation a selection of charts are reported. Moreover, statistical analysis and confidence intervals definition for the validity of the simulation models are only rarely found. On the other side, considerably broad literature is available to the end of defining maneuvers for the later characterization of the vehicle model's parameters, thus accomplishing the ''data validity'' objective suggested by Sargent [14]. The authors of [46], based on the extensive literature survey carried out, convey that a top-down approach should be established where: 1) the validation maneuvers have to be defined depending on the model requirements. In particular, the dataset should include both repeated steady-state and transient open-loop maneuvers with the aim of isolating the contribution of distinct vehicle parameters and building confidence intervals; 2) the validation procedures include both time-domain and frequency domain-metrics able to account for the uncertainty; 3) the validity domain of the simulation model is ultimately defined given no global validation is possible. For instance, a set of lateral accelerations and/or steering input frequency intervals can be used to limit the domain's validity.
In [47] two different models are validated with real-world trajectories to investigate the effect of order reduction on the fidelity delivered by the model. Interestingly, in the time domain metrics, no discernible discrepancy is observed VOLUME 10, 2022 as shown in Fig. 13b. However, when comparing the transfer functions, an apparent deficit in replicating the high-frequency component can be detected for the low-order model as visible in Fig. 13a. Such a discrepancy is due to the missing poles and zeros of the low-order model which do not allow for replicating the real system's high-frequency behavior. Nonetheless, even a simple model proves to be faithful at reproducing the steady-state response and maneuvers which do not involve frequency components higher than 1 Hz.
More recently, in [107], a framework was proposed to introduce the model's uncertainty analysis into the validation activity. The foundation of the framework is that the candidate model shall satisfy the validation threshold for multiple vehicles (or different vehicle configurations) following the suitable calibration of the model. In other words, the authors try to decouple the validity of the model structure from the actual parametrization to isolate the model's inherent uncertainty and increase the credibility of the developed simulation framework. A graphical representation of the proposed solution is shown in Fig. 14.
Although the approach being particularly promising since it would allow solving some of the critical aspects of validation, namely the lack of sensitivity and uncertainty characterization in most of the validation procedures, only a formal description of the activity is presented, whereas an actual practical application is deferred to future work. Additionally, some aspects might still arise from the gradual introduction of ADAS/ADS technologies, given the different driving characteristics of artificial actuators with respect to human driving. For instance, the actuators' higher bandwidth might excite higher-frequency poles of the system [31] with respect to human driving.
Ultimately, in [108], the impact of different modeling approaches is investigated by comparing four classes of models having increasing complexity with real-world measurements on both icy and dry roads. It is reported that even the ''simple'' was able to predict surprisingly accurately the traveling velocity and path recorded on the dry road, whereas detailed modeling approaches were deemed necessary to predict transients at the beginning and the exit of curves, especially for the low-friction scenarios.
From the literature analyzed, we can summarize the modeling approaches in three classes based on the fidelity level that they can provide: low-fidelity: point-mass or kinematic models. Mainly used for controller synthesis and for simulations where detailed vehicle modeling is not required such as microscopic traffic studies; medium-fidelity: chassis models such as single-track, double-track, and lumped mass [47] with linear or nonlinear tires. Their range of application spans from the synthesis of model predictive controllers [31], [109] to intermediate fidelity simulations involving the testing and prototyping of ADAS/ADS functionalities; high-fidelity: multibody models [110], [111] including suspensions geometry, chassis compliance, engine mounting stiffness and damping characteristics, driveline dynamics, and tires contact points to provide the ultimate degree of faithfulness. Concerning VV&UQ methods, a survey addressing the estimation of the uncertainty content in vehicle dynamics simulation models was proposed [73] which identifies four key aspects originating the uncertainty: 1) the reconstruction of the input signals used for executing the virtual tests; 2) the input model's parameters aleatory and epistemic uncertainty; 3) the parametrization of the model; 4) the model's output. The work summarizes scientific contributions also discussed in [46] from the perspective of the ''degree of statistical validation'' (DSV). It highlights the potential beneficial effects of the statistical validation concept applied to the field of vehicle dynamics in addressing the mentioned limitations of conventional validation. Nonetheless, the novelty of the approach is emphasized by the fact that no literature work could be found that fulfilled all the uncertainty sources.

3) VIRTUAL WORLD MODELS
The ''virtual world'' is representative of a broad set of models which include any virtual entity enabling the closed-loop interaction of the ego-vehicle with the virtual environment. These include: virtual road, traffic agents and weather models.

a: ROAD LAYOUT
The virtualization of the roads' layouts is well supported by the widespread ASAM modeling standards: • OpenCRG 3 : which standardizes local properties of the lane in terms, for instance, of asphalt granularity and potholes location; • OpenDRIVE 4 : which regulates the definition of the road geometry in terms of number of lanes, curvature radius, lane marking and slope/camber; • OpenSCENARIO 5 : institutes a standard way to model the behaviors of road users and traffic agents. Although no explicit validation activities have been pursued for the listed modeling approaches, they are widely recognized as faithfully replicating road geometry for the sake of ADAS/ADS virtual testing in the actual simulation practice [4], [112].

b: TRAFFIC AGENTS
Traffic models involve external agents to the ego vehicle which participate and/or interact in the simulation. These consist of pedestrians [113], drivers [114]. The validation of traffic models is an open point in the literature given the intrinsic stochasticity of human behavior, which makes it fundamentally impossible to obtain ground-truth information. Additionally, high-fidelity traffic models might even not be necessary in the case of scenario-based approaches where targets' trajectories are assigned beforehand [115].

c: STATIC OBJECTS
These include buildings, guardrails, and any other object making up the virtual environment. Currently, no modeling standard exists which aims at standardizing the modeling approaches for this class of components. Replicating the optical and reflectance properties of static obstacles is particularly important in the case of physics-based sensor models as false positives/false negatives might arise due to the complex phenomena involved.

d: COLLISION MODELS
Whenever collisions occur, the rigid body assumptions underlying the traffic participants and traffic objects might not hold anymore. Thus collision models might be established and assigned to the objects acting the virtual test depending on the use-cases. Based on [81], several approaches can be pursued: low-fidelity: using relative velocity and heading angle to directly estimate damage of the collision; medium-fidelity: use the involved agent's kinematic to determine acceleration and classify the damage consequently; high-fidelity: use FEM to precisely compute forces exerted during the collision.

e: WEATHER
Adverse weather conditions are known to blur and darken cameras' output [116]; absorb and scatter RADARs' pulses [117]; back-scatter and reduce surface reflectivity in the case of LiDARs [118]. Replicating such complex phenomena in the virtual environment is, at the moment of writing, an extremely challenging task due to the need of providing sensor-grade realism. Indeed, despite widespread metrics exist to evaluate the quality of generated images, such as the structural similarity (SSIM) [119], the actual focus is on the human perception and a dual solution capable of grasping sensors' perception properties is still missing. In [120], a model-based approach to introduce sensorspecific rain effect via the post-processing manipulation of the rendered frames is presented. The approach proposed in [120] uses ray tracing to model rain-induced noise for virtual cameras and an equivalent RCS for RADAR sensors. The preliminary validation analysis performed demonstrates that the proposed methodology enables achieving a higher fidelity level for all the sensors' types.

4) OTHER SOFTWARE COMPONENTS & VERIFICATION
The sub-components division proposed overarches most of the ingredients making up the virtual toolchain. There are, however, additional tools that need to be introduced in order to carry out the virtual test. Among them: time steps, solvers, and coupling algorithms beyond the craftsmanship required to set up a simulation or a co-simulation framework.
Despite the adoption of validated sub-components, the overall virtual test's outcome might be far from reality due to integration and software implementation issues which shall be addressed using software verification techniques. Namely [121], code verification: concerned with the execution of test demonstrating that no numerical/logical flaws affect the virtual models; calculation verification: deals with the estimation of numerical errors affecting the M&S toolchain. Code and calculation verification methods are a well-established practice in many engineering applications. For instance, in FEM/CFD (computational fluid dynamics) simulations, the spatial and time discretization are typically iterated until the result converges to an equilibrium value [122]. Nonetheless, the same concept is commonly overlooked for ADAS/ADS simulations, where time-step is rarely discussed.

V. DISCUSSION
Despite the substantial scientific effort in the last years concerning the definition of validation methodologies for simulation models, several open research questions remain unsolved.
Integrated testing procedures are exceptionally delicate to handle as the ADS agent is interacting in a closed-loop fashion with other traffic participants within the virtual environment. Such a dynamic multi-agent modeling framework is indeed noticeably different from a traditional engineering simulation applications where the validation of the virtual submodels would ensure the accuracy of the overall output [123]. Moreover, most of the validation frameworks are conceived for open-loop tests where the model under analysis shows limited or null closed-loop interaction with internal state variables of the environment, such as in FEM/CFD simulation. That is not the case for ADS (especially for high-automation level technology), where the driving agents are capable of reasoning and producing emergent behaviors based on the sensor data. From a sensitivity perspective, it is also particularly hard to judge how small deficiencies in reconstructing the virtual scenarios propagate throughout the toolchain given the role played by the ADS, which can either dampen or amplify the discrepancies [39], [124]. Unfortunately, to the best of the authors' knowledge, this aspect is rarely discussed given the extremely high computational cost that a sensitivity study over such a large parameter space would imply.
Considering vehicle dynamics, most of the validation procedures are tailored to judge the validity of replicating a specific dynamical effect, e.g. vertical dynamics, comfort or handling. This limitation results from the narrow application vehicle dynamical models have played in the technical regulations and scientific literature in comparison to what is expected from an ADS application where the full vehicle is simulated, albeit with a degree of fidelity which is yet to be established. Nonetheless, vehicle dynamics models benefit from the support of conceptual model validity techniques [24], statistical testing [74], and sensitivity analysis methodologies [67] which increase the credibility of the developed model with respect to other simulation modules contributing to the ADS virtual tests.
Sensor models, on the other side, are still immature. Currently, no standard modeling framework has been adopted (only an ''informal'' classification is available as in Section IV-C1), no validation framework has been standardized both in terms of KPIs and via setting realistic correlation thresholds. This lack of standardization results in disconnected modeling validation methodologies, which complicate the comparison of virtual sensor models fidelity. Some specific aspects are reported to be particularly critical to replicate, which include noise figures [89], RCS, and real-world weather parametrization/replication capabilities of the simulation environments. That is mainly true because human-eye realism is commonly pursued with simulation environments with respect to the sensor-grade realism that should be pursued.
Eventually, the survey work has emphasized the infancy of the validation approaches adopted for validation ADS-related models: simple techniques and limited statistical assessment characterize most of the approaches presented in Section IV with respect to the state-of-the-art methods in Section III. This is also unsurprisingly accentuated by the novelty of the publications cited in our work. Most of the references are indeed a few years old, which contrasts with the widely recognized ASME/AIAA validation standards for FEM/CFD applications grounded on decades of utilization.

VI. CONCLUSION
This work surveyed the most relevant contributions supporting the virtual validation of simulation models for ADS certification. The scientific effort pursued is agnostic to any ADS structure and functionality and was primarily aimed at establishing procedures to assess how a virtual testing toolchain could be awarded the ''virtual proving ground'' trait for the sake of ADS certification.
Our contribution started by summarizing the state-of-theart traditional simulation models validation to build up a benchmark of validation techniques across different engineering fields. We clustered the methodologies in three classes in Section III: conceptual validation, validation via response analysis, and sensitivity & uncertainty. We spent particular effort on the quantitative methods within the response analysis family, given the large variety of computational tools available and their most widespread adoption as validation tools both in the scientific community and in technical regulations. We found conceptual analysis to have been largely superseded by other validation tools for the field of ADAS/ADS. On the other side, VV&UQ approaches are still in an early stage of development. Nonetheless, they have huge potential to increase the overall M&S credibility, especially for complex simulation toolchains whose robustness is hard to establish via conventional threshold-based validation criteria.
We then investigated the literature contributions dealing specifically with simulation models for ADS virtual testing after briefly presenting what is meant with ''simulation environment'' in Section IV. We found that two main approaches can be outlined: integrated tests and submodels-based solutions. For the integrated test category, we analyzed the most relevant scientific efforts, and we outlined the authors' selection of KPIs and validation methodologies. Similarly, for the submodel category, we proposed a modeling framework in Fig. 10, and we analyzed the literature concerning virtual sensors, virtual vehicles, and virtual world models with their corresponding validation strategy. Special emphasis was dedicated to the sensors' models validation given the novelty of the approaches proposed.
Eventually, we summarized the open challenges in Section V. Overall, the field of model validation for ADS virtual certification is an emerging technology with enormous potential but still relatively immature with respect to other simulation disciplines where widely acknowledged validation procedures have been established.

ACKNOWLEDGMENT
The authors are grateful to the JRC colleagues Maria Cristina Galassi, Sandor Vass, Kostantinos Mattas for the exchanges within the project and to Barnaby Simkin, Tobias Duser, Gil Amid, Siddartha Khastgir, Espedito Rusciano, and all the experts participating to the VMAD SG2 working group on Virtual Testing and Simulation for the insightful discussions had on the subject. The work of Riccardo Donà has been carried out for the European Commission Joint Research Centre.