Software Verification and Validation of Safe Autonomous Cars: A Systematic Literature Review

Autonomous, or self-driving, cars are emerging as the solution to several problems primarily caused by humans on roads, such as accidents and traffic congestion. However, those benefits come with great challenges in the verification and validation (V&V) for safety assessment. In fact, due to the possibly unpredictable nature of Artificial Intelligence (AI), its use in autonomous cars creates concerns that need to be addressed using appropriate V&V processes that can address trustworthy AI and safe autonomy. In this study, the relevant research literature in recent years has been systematically reviewed and classified in order to investigate the state-of-the-art in the software V&V of autonomous cars. By appropriate criteria, a subset of primary studies has been selected for more in-depth analysis. The first part of the review addresses certification issues against reference standards, challenges in assessing machine learning, as well as general V&V methodologies. The second part investigates more specific approaches, including simulation environments and mutation testing, corner cases and adversarial examples, fault injection, software safety cages, techniques for cyber-physical systems, and formal methods. Relevant approaches and related tools have been discussed and compared in order to highlight open issues and opportunities.


I. INTRODUCTION
Accidents on roads happen so frequently they are considered part of everyday life. However, the number of fatal accidents worldwide is not to be neglected. The recent number of casualties on roads exceeds 1 million per year globally [1]. Fatal accidents are considered to be a major problem in many countries. While some countries aim to decrease the number of fatal accidents by 50% in the next 5 years, others, for example, Sweden, have set a goal to completely prevent such accidents [1]. A summary in reference [2] shows that human errors account for 75% of all road accidents, and this number The associate editor coordinating the review of this manuscript and approving it for publication was Bohui Wang . is over 90% in the United States. According to recent studies, these numbers do not show a sign of decline if humans are left in charge to fully control the vehicles. For human drivers, the decision-making is 90% based on visual perception, and as a matter of fact, humans are not capable to be fully aware of potential hazards in the surrounding environment [2]. Additionally, traffic jams are also usually caused by either poor decisions by humans or their inability to accurately monitor the behavior of all other vehicles. Traffic congestion does not only lead to loss of time and resources, but it also causes a rise in air pollution and greenhouse gases. These issues are of paramount importance in industrialized civilizations, and introducing autonomous vehicles (AVs) on roads is seen as a viable solution to these problems. However, the fact that AVs provide overall increased safety compared to human driving needs to be demonstrated.
It is rather easy to demonstrate the scale of human mistakes leading to safety issues on roads. However, switching from human drivers to AI systems is currently a challenging transition. AI systems showed impressive results in different domains and are progressing quickly. However, their application to safety-critical domains, such as AVs, creates several V&V issues, mainly due to limited predictability. Although it is possible to statistically show the accuracy of AI systems to some extent, it is most often not possible to fully understand how they came to a specific conclusion. Therefore, these systems should be regarded as unsafe until their safety is proven according to reference standards such as ISO 26262 [3]. While ISO 26262 provides guidelines for the development of safety-critical systems and requirements applicable to the automotive industry [4], it is not yet clear whether this standard can be effectively used as it is in AI-based systems.
Further issues arise when considering more than one AV operating on roads. Even if a single vehicle performs well under test conditions with a reasonably small error rate, deploying millions of AVs may increase the total failure rate to an unacceptable level [3]. However, transition on such a large scale is unlikely to happen at once: the control of human drivers can be gradually reduced in order to progressively reach full automation. The Society of Automobile Engineers (SAE) has defined six levels of automation for AVs. In addition to outlining the responsibilities of the human driver, those levels also help to determine technical requirements to ensure the safety of AVs [5]: • Level 0. The vehicle system has no control over the vehicle. As the full control and responsibility are on the human operator, the vehicle is only allowed to issue a warning in hazardous situations and cannot command the actuators.
• Level 1. The human driver and the vehicle cooperate throughout the ride. The contribution of the vehicle helps to increase driving performance, however, it is limited to steering or acceleration. The driver should always stay aware of the situation.
• Level 2. The vehicle assists the human operator by fully controlling the vehicle. Although the driver does not need to control the vehicle, he/she should have full situational awareness to intervene whenever needed.
• Level 3. Vehicles classified at this level have restricted decision making abilities and perception through sensors. Although the human driver does not need to have complete situation awareness, he/she is still required to be ready to take control in dangerous situations. The attention of the driver is less strictly required at this level.
• Level 4. At this level, although the driver can intervene, he/she is not expected to take control of the vehicle. Highly Automated Vehicles (HAV) are responsible for all tasks including safety-critical ones and can operate safely without human input in predefined modes.
• Level 5. Vehicles classified at this level can operate themselves in any situation. According to reference [6], at this level, the driver is completely out of the loop, while reference [5] suggests that completely taking control away from the human driver, i.e., removing the steering wheel and pedals, may be optional; in both references, the authors agree that the driver has no responsibility for vehicle control. If the driver cannot intervene, the vehicle should be classified with a high ASIL (Automotive Safety Integrity Level) as defined in ISO 26262 as they lack controllability [3].
Vehicle systems from Level 0 to Level 2 are usually introduced as Advanced Driver Assistance Systems (ADAS). At these levels, the human driver is held responsible for the vehicle's operation and therefore should monitor the vehicle and intervene in unsafe situations [6]. Figure 1 depicts a graphical summary of the described SAE levels. So far, the most advanced autonomous cars introduced to the public are only partially autonomous since a human driver should still stay alert to supervise and take over the control of the vehicle when needed to avoid or manage hazardous situations [4]. A fatal accident in 2018 involving a Tesla car and a truck [7] showed the necessity of human supervision in autonomous cars at the current stage of evolution. The accident happened due to the inability of the car's artificial vision system in autopilot mode to detect a white truck on a cloudy sky background. Such a corner case for the camera was not part of Tesla's test suite [8] and could only be identified by a human driver.
The importance of software V&V for the safety assurance of autonomous cars is the main motivation of the study presented in this paper. The increasing complexity of software systems will raise the cost of production [4] and reliable V&V approaches are crucial as a single bug that goes unnoticed can be extremely costly for a company [9]. Even without AI being involved, currently, around 60% of vehicle recalls are due to software bugs [10].
In such a context, this paper presents a comprehensive Systematic Literature Review (SLR) addressing the stateof-the-art of safety V&V frameworks and approaches for software systems in autonomous cars.
The main objectives of this study are to review existing methodologies and tools used to tackle challenges in the safety assessment of autonomous cars, highlight open issues, and provide some hints about future research directions. This study is also important to define a baseline on which evaluating the potential for technology transfer to other transportation domains, such as railways, where autonomous vehicles can adopt some of the techniques that are being pioneered in the automotive domain due to the huge and growing interest in self-driving cars. This is one of the objectives of the Horizon 2020 research project named RAILS (Roadmaps for AI Integration in the Rail Sector) funded by the Shift2Rail Joint Undertaking, a body of the European Union [11]. Please note that all literature addressing V&V of more common safety-critical automotive software such as the one controlling legacy ABS (Automatic Braking Systems) are out of the scope of this survey.
Since this paper is classified as an SLR, it will only include papers published in reputable sources that are publicly accessible and already indexed in reference literature repositories; for this reason, and due to the extremely competitive automotive industry landscape, it is important to highlight that the study presented in this paper cannot include any research results that are either not peer-reviewed or not made public for confidentiality reasons. Furthermore, since the safety of self-driving cars is considered a ''hot'' research field nowadays, works are continuously added to the reference literature. Therefore, it cannot be in the scope of this paper to provide an always up-to-date picture including all of them, which would be impossible, but rather to define the main research trends by grouping papers into relevant categories and discuss them in terms of main challenges and opportunities.
Please note that in the paper we often use the abbreviation AV for autonomous (road) vehicle, which we mainly considered as a synonym for self-driving or unmanned car, although, as explained above, road vehicles might be of different types and featuring different levels of autonomy; however, it should be clear that we are not considering infrastructure-based ITS (Intelligent Transportation Systems) implementations where connected vehicles (sometimes even framed in the ''Internet of Vehicles'') are controlled centrally and/or through vehicle-to-vehicle (V2V) communications, such as platooning approaches.
The rest of the paper is organized as follows: Section II describes the research methodology and provides a general classification of the papers resulting from the literature search. Section III provides the general results of the SLR serving as background information about the life-cycle and certification of safety-critical software in autonomous cars, as well as the V&V approaches for machine learning systems currently proposed in the literature. Section IV analyses and discusses more specific SLR results, i.e., the state-of-the-art of techniques for the safety assessment of autonomous cars. Section V outlines open issues, challenges, and opportunities as they emerge from the review. Section VI briefly reports about related work in terms of similar SLR, and finally, Section VII closes the paper by summarizing findings and results of the study.

II. RESEARCH METHODOLOGY
This survey adopts the guidelines for Systematic Literature Review (SLR) in software engineering from references [12] and [13]. Following these guidelines, the SLR has been conducted in the two phases ''Plan Review'' and ''Conduct Review'', each one consisting of several steps.

A. PLAN REVIEW
In this phase, the review protocol, the research scope, and research questions are specified. In the following sections, these steps are described in more detail.

1) REVIEW PROTOCOL
The review protocol contains elements of the SLR and is necessary to define the methodology, the review process, and criteria used to filter the selected papers [12], [13]. Components of the review protocol of this SLR consist of research questions, search process, inclusion and exclusion criteria, and classification of the related work. Details of these components start with this section and continue with section II-B.

2) RESEARCH SCOPE
The scope of this SLR encompasses reviewing research work in recent years (approximately, the last ten years) related to the V&V of safe autonomous cars, with a focus on software engineering, to identify the state-of-the-art and open issues. Related SLR are discussed in section VI.

3) RESEARCH QUESTIONS
Research questions of this SLR are defined as follows: • RQ1: What are the most common requirements in the software V&V of safe autonomous cars?
The objective of answering this question is to identify V&V requirements according to existing safety standards and regulations. This will also give us an insight into the main objectives of the certification processes of AVs.

1) AUTOMATIC SEARCH
In this section, the search query to retrieve relevant studies is defined. The following main query has been used to search in title, abstract and keywords: safe* AND ((self PRE/0 driving) OR (autonom* PRE/0 (vehicle* OR car* OR automobile* OR driving))) PRE/0 is a Scopus specific operator to specify that the second word should precede the first one immediately, i.e. there should be no words between two words. The first instance of the operator is used to include both ''self-driving'' and ''self driving'' keywords. In the second instance, the operator is used instead of AND operator as AND operator would match any text that includes the keywords in any distance between each other. This, for example, would lead to many results with autonomous marine vehicles, autonomous aerial vehicles or autonomous heavy vehicles. In a search engine that does not support the proximity operator, the query would need to be modified to include exact matches such as ''autonomous vehicle'', ''autonomous car'' and their variations. Using the proximity operator PRE/0 helps us to exclude irrelevant studies effortlessly. Even though the focus of this SLR is on autonomous cars, studies related to other types of vehicles have been included to cover technologies also used in autonomous cars. Adversarial attacks targeting image recognition systems can be an example subject because they are not specific to autonomous cars but also used in other AVs. The general search query allowed us to explore research in different areas that could provide valuable information in support of the V&V of safe autonomous cars.
In addition to the main query, in order to limit search results to software V&V, only studies that have keys matching to the following query have been selected: (testing OR verification OR validati*) AND (software OR ''Artificial Intelligence'' OR ''Machine Learning'' OR ''Deep Learning'' OR ''Neural Network*'') Although the keyword ''software'' covers most of the related work (75% of the search results), adding common keywords related to software aspects of self-driving cars has increased inclusiveness.
The results are also limited to research work published between 2009 and 2020 in the ''Computer Science'' and ''Engineering'' fields. We got about 200 results for our query. Two of the results have been removed as they were duplicates of references [8] and [14]. In Figure 2, the distribution of papers over the recent years is illustrated. Starting from 2016, there has been a steep increase in the number of papers published on the reference subjects, which demonstrates the growing interest in the research community. Please consider data for the year 2020 is incomplete as many papers still need to be indexed, therefore the dashed line provides a conservative estimate based on linear regression from 2016, although the growth is actually more likely to become exponential.

2) INCLUDED AND EXCLUDED STUDIES
After determining the initial set of papers to review, inclusion and exclusion criteria have been defined and applied to the results in order to remove false positives and only keep relevant studies. Inclusion and exclusion criteria have been applied mainly considering only title and abstract; however, when the situation was not clear, the paper full text has been reviewed, as suggested by reference [13]. Inclusion and exclusion criteria are defined below.
Inclusion criteria: • I1: Studies focusing on V&V of software systems in safe autonomous cars Research work included by this criterion has a focus on the safety of autonomous cars specifically. In contrast with other autonomous vehicles, or autonomous road vehicles to be specific, autonomous cars would interact with other human drivers on busy roads and pedestrians inside the city more frequently and closely. Thus ensuring their safety is different, for instance, than that of autonomous heavy vehicles that usually do not drive in areas with many pedestrians.
• I2: Studies focusing on V&V of safety-critical software systems in different types of vehicles that is also applicable to autonomous cars This criterion covers studies that focus on safety V&V of software that is used in any type of vehicle and not just cars. To give an example, references [14]- [17] provide useful information about adversarial attacks that can target any type of autonomous vehicle including autonomous cars, but does not contain the keyword ''car'' anywhere in the title, abstract or keywords, and therefore would not be normally included. Exclusion criteria: • E1: Studies that do not apply to autonomous cars Studies about AVs that cannot be applied to autonomous cars, such as approaches suitable to aircraft and marine vehicles that cannot be transferred to the automotive domain, are not in the scope of this paper. Also, papers that mention autonomous vehicles, but do not provide further information related to the topic are considered as false positives and are therefore excluded.
• E2: Studies that do not focus on V&V for safety assessment Many papers about autonomous vehicles mention the importance of their safety, and since those studies are related to software engineering, they also mention verification. However, not all of these studies are about V&V for safety assessment, hence those unrelated are excluded from our study.
• E3: Studies that do not focus on software aspects of safety V&V Research work that primarily focuses on hardware (e.g., sensor technologies) with no sufficient information provided on software is not included in this SLR. The process of applying inclusion and exclusion criteria to 200 studies resulted in 105 primary papers. Figure 3 shows the distribution of selected studies over the years: the same considerations hold here as in the case of Figure 2.  Figure 4 shows the cumulative number of citations received by the papers in the last decade: again, the graph shows a clearly increased interest in recent years, starting from 2016, with the contribution of a few seminal papers on the subject that we also mention quite often in this study, although citation data for most recent years still needs to be stabilized. Figure 5 shows the geographical distribution of publications based on affiliation of authors. United States lead the research  on this subject ranking first, not unexpectedly due to the pioneering research on smart-highways and large investments from leading companies like Tesla Motors and Google. Countries with a tradition of high quality academic research and/or strong automotive industry sector, also rank high in the graph. Classification of studies according to inclusion and exclusion criteria is provided in Table 1. The categorization of the papers according to their main topics is outlined in Table 2, where CPS is the acronym of Cyber-Physical System. In the table, we have grouped papers in homogeneous categories according to the most recurrent topics addressed by the studies, hence following a ''bottom-up'' approach. We have avoided any classification causing ambiguities, such as the distinction between ''verification'' (i.e., checking the system against its specification) and ''validation'' (i.e., checking that VOLUME 9, 2021 specification is appropriate for the application), because in several studies we found those words have been used with different meanings. It is worth mentioning that there is some overlap in topic coverage of those categories, nevertheless the classification has proven appropriate for the sake of discussion presented in Section IV, where those categories will be described in more details. We attempted to use different categories with a ''top down'' approach, based, e.g., on lifecycle stages, architectural levels or components, but those classifications did not match well with the coverage of papers we actually found, due to several factors, including heterogeneity in reference architectural models and immaturity of new V&V approaches for AVs, compared to more stable sectors such as avionics or railways (e.g., automatic train control), where those classifications work better.

III. GENERAL SLR RESULTS: OVERVIEW OF SAFETY ASSESSMENT IN AUTONOMOUS CARS
In this section we provide an overview of safety assessment in autonomous cars, serving as background knowledge for more specific techniques, by exploring certification challenges as well as the most general and seminal studies on the subject, as emerged from the SLR.
AVs brings new challenges in safety assessment. There is a plethora of approaches and techniques for the V&V of conventional software systems; however, autonomous systems diverge from ordinary systems as they learn, adapt, and change according to new situations [7]. Therefore, traditional methods cannot be directly applied to autonomous systems. Reference [42] argues that the lack of sound methodology to assess the safety of such systems may hinder the delivery of AVs to the public. In the following sections, we provide an overview of safety assessment in autonomous cars, as emerged from the SLR, including background information on certification of AVs, challenges in the assessment of machine learning (ML) technologies used in those systems, and approaches proposed for tackling those challenges.

A. CERTIFICATION
ISO 26262 is a widely used standard for safety-critical road vehicles including at least one electronic component. The standard conforms to the V-model for product lifecycle V&V management, and provides guidelines for the design and integration of both software and hardware components [4]. The standard provides specifications for modelbased development and usage of fault injection for software and hardware components [4]. The V-model used for software development and testing has been applied to vehicles for more than 20 years [3]. The process is shown in Figure 6. Within the V-model, the right side describes the incremental V&V process applied to the waterfall development model described in the left side. This approach relies on the assumption that system requirements and specifications can be defined correctly and exhaustively at well-defined stages. However, such a model is often considered too much ideal because it deviates from actual industrial practice, and that is even worse with AVs due to their extremely dynamic engineering processes [3]. In fact, although ISO 26262 provides general guidelines and best practices for the safety assessment of road vehicles according to the V-model, AVs pose new challenges as mapping the actual development process of AVs to the V-model creates several technical difficulties.
Scaling the classical V-model approach for AVs is especially challenging and does not seem feasible in the near future. The main reason is that in the classical approach a list of requirements should be provided; however, due to their unpredictable nature, AVs may encounter an incalculable amount of possible scenarios. Therefore, all requirements cannot simply be put in plain documentation. Although it might be tolerable that AVs will not be required to handle all rare cases, the requirements should clearly show what is covered and what is not [201].
According to reference [4], the main issue in checking that a system complies with ISO 26262, is to statistically demonstrate that the system will remain in agreement with the safety goals while operating. This becomes even more challenging when considering a fleet of autonomous cars on roads. Even if testing methods show a low statistical failure rate for a single vehicle, when considering, for instance, a million vehicles on roads, the rate might reach an unacceptable level. To statistically prove that a million vehicles can be on roads one hour each day without having a fatal accident for a thousand days, a single test should run for at least 10 9 hours [3]. ISO 26262 standard has determined the limit for the acceptable ratio of faults as 10 FIT (Failures in Time), meaning if the system operates for 10 9 hours then the number of faults observed should not exceed 10 [6]. Accordingly, to statistically show that the requirement is met, more than one test should be run. However, the fact that such extensive test suits cannot be physically conducted in public traffic for safety reasons poses serious obstacles to the automotive industry.

B. CHALLENGES IN ASSESSING MACHINE LEARNING
The usage of ML systems in industry is mainly based on the so-called Inductive Learning that relies on training data to create models. Identifying pedestrians using camera images can be an example of one of the many use cases of Inductive Learning in AVs. An extensive amount of images are fed to a classifier algorithm which ''learns'' to detect pedestrians and provide a probability of correctness. For assessment purposes, the main objective is that the chosen images are within the system requirements. To test how the resulting model performs, some data from the training set can be kept to be later used to check the system. The process can be mapped to the V-model if training data is associated with the left side of the model at requirements specification level; consequently, on the corresponding right side of the model, requirements can be validated by using the subset of initial training data that was kept out to validate the system. There should be no coincidental correlation between the training set and the expected result to avoid over-fitting. Likewise, the selected validation set should be diverse and should not correlate with the training set while conforming to the requirements in order to detect over-fitting. There is no common way to prove that the resulting ML model for the system is not over-fitted as a safety requirement [202].
One of the limitations of the validation of ML systems is the cost of labelling the data. Labelling is either done by someone or some other unsupervised learning algorithm, both having their own complications. ML systems are sensitive to change: doing a minor change leads to a necessity to re-validate the whole system. When a problem in the training set is identified, it also adds the extra work of collecting more data to validate the system. AVs are likely to encounter many extremely rare cases that were not considered in the initial training set. Every time a new case is detected, the system should be updated and re-validated accordingly [3]. In the next sections, generating test data synthetically within a simulation and the issues that come with such approaches are examined.
Another challenge on the validation of ML is its incomprehensible internal structure for humans. For instance, it might not be clear why a Convolutional Neural Network (CNN) algorithm provided a certain decision or what it has learned about the rules while making a decision. The so-called ''explainable AI'' (XAI) methodologies have been defined to cope with those issues, but in general, making ML easy to understand for humans has not currently been achieved. Being unable to predict the behaviour of ML, makes it harder to assess its decision-making process and leaves us to use costly brute force techniques. Even if a brute force can be applied given sufficient resources, it only validates the result within the training set and does not show the coverage or accuracy of the training data and how it complies with the safety standards. Referring to the example of pedestrian detection, it might be the case that an insufficient amount of sample images where pedestrians in wheelchairs are used, and as a result, the algorithm will not classify those cases as pedestrians [201].
Reference [3] also suggests an alternative way to validate systems that rely on Inductive Learning. The monitoractuator approach mentioned before can also be used in the validation of such systems. The objective is to make the high-ASIL monitor a deductive component, while the low-ASIL actuator remains an inductive component. Here, the main validation process would be shifted to a component that does not rely on Inductive Learning, thus making the solution of the validation problem easier. As before, the monitor can observe and catch faults in the functionality of the actuator within a safety envelope to achieve fail-safe operation. Consequently, the safety issue is changed to an availability issue, as the system becomes unavailable but remains safe.
Reference [77] studies a use case of an object detection system that uses ML algorithms and provides an efficient approach to validate the system. For an image recognition algorithm to function properly in safety-critical automotive applications, a failure rate as small as 10 −12 should be achieved. Validating a system that needs to have such a small failure rate is costly and may take several months of testing in real life. Emulators can be used to generate synthetic data, rather than collecting real-life data through test drives, in order to speed up testing. The problem is that the physical implementation of the system does not yet exist to rely on when validating a system at the design level. In this case, simulation approaches should be used, which in turn generates high costs if the model of the system is required to be accurate. To reduce validation cost in such a system, reference [77] proposes an approach of subset sampling in which the objective is to estimate the failure rate of an ML algorithm in an AV in terms of different probabilities. By experimenting within the use case of STOP sign detection, the study concludes that the proposed approach is 15.2 times faster than the traditional brute-force Monte Carlo approach without leading to lower accuracy in the result.

C. VERIFICATION AND VALIDATION METHODOLOGIES
The process of ensuring the safety of AVs is considered to be an interdisciplinary challenge [33], [84]. It is crucial to have a sound engineering process to be followed by contributors with diverse skills and areas of expertise. The V-model is VOLUME 9, 2021 widely used in the automotive industry for the development of safety-critical components [45]. In the V-model, phases of the development process create a V-shaped diagram as illustrated in Figure 6: on the left, the design process follows a top-down approach, while on the right assessment is done in a bottomup fashion. In practice, the process can be iterative and may not necessarily strictly follow the sequence. Despite having a straightforward sequence of phases, the development of ADASs and AVs can encounter certain issues in different development phases due to their unpredictable nature related to AI [45].
As already mentioned, reference [3] suggests that one of the strategies to assess risk according to ISO 26262 is to implement a monitor and actuator architecture. In this architecture, the actuator module executes the main functionality and the monitor module carries out the validation process. If there is an issue in the actuator, the monitor is expected to stop the whole function including itself, which leads to a fail-safe system. If the pairing of the two modules is done correctly, as long as the monitor module provides high enough ASIL and can identify the faults within itself, the actuator can be modelled in a low ASIL. The monitor is required to catch faults within its own module in order to avoid the scenario where the fault in the monitor prevents the monitor from detecting faults in the actuator. The main objective here would be to simplify the monitor module as much as possible as it is expected to have high ASIL and consequently requires stricter assessment. Therefore, the complex features are implemented in the actuator which has low ASIL. The fail-safe operation of this pair is a great benefit, however, it has its weaknesses. Pairing two different modules, in which a monitor observes an actuator, is aimed to prevent the actuator from executing unsafe instructions. The problem with this approach is that if something wrong happens during operation, the actuator will become unavailable, which is not always intended (e.g., shutting down steering while the vehicle is still on the move). Additionally, implementing this architecture requires more than one pair of monitors and actuators to cross-validate. The design should also be diverse to avoid software related failures to lead to the failure of the whole system. Reference [3] gives Arianne 5 Flight 501 as an example for such a case, in which the same exception led to the failure of both the main and the backup system, as both were unable to handle the exception in a similar way. As much as it is important to have diversity in the components, it is not easy to achieve it. The reason is that if the same requirements are used, it is likely that components designed according to the same requirements will have the same defects. Reference [3] argues that an approach that will guarantee fail-safe operation by detecting faults in the system is necessary to be adopted in safe autonomous systems, and proposes extending the monitor and actuator architecture to be the main strategy in AVs for this purpose.
Reference [45] proposes a new approach utilizing ML and Deep Neural Networks (DNN) for assessing ADASs and AVs both in lab environments and in the real world. According to reference [45], the primary challenge of using the V-model is the lack of efficient information exchange between different phases and between real-world and lab tests. The methodology proposed in that work aims to categorize and group the real-world test data according to functionalities using ML and DNN algorithms. Data can be reused to recreate realworld test cases inside a simulation. Data is also generalized and stored in a common database to be accessible in all phases of the V-model and therefore to be suitable for different original equipment manufacturers (OEM). The paper also illustrates the use of AI-core (ML system) for test case and scenario generation for both laboratory and real-world environments. Reference [45] suggests that AI-core can be used to validate high levels of safe autonomy with minimal human interference.
A novel approach from reference [42] suggests an iterative process in order to solve the AV validation challenges and achieve SAE Level 5 autonomy. The proposed methodology aims to limit AV decisions to those which are validated and can be guaranteed to be safe, i.e. the vehicle can execute actions only inside a validated safe space. The paper demonstrates a control architecture that can be applied to vehicles in any level of autonomy starting from Level 1. Reference [42] suggests that an AV with Level 2, which requires full situation awareness from the human driver, can initially be released and using the proposed approach assess itself while operating. The deployment starts with Level 2 as it initially does not cover the validation space completely. However, by time, as the vehicle learns, it can increase the coverage of the validation space by assessing itself in an iterative manner with, the goal of achieving Level 5, i.e., full autonomy.
Using feedback from many subsystems to cross-check the output is already used in hardware components and sensors, such as radar and vision systems. This reduces the chance of failures if the components are implemented independently. The same approach can be used to validate the decision of AI algorithms. However, assessing such a combination of systems might need billions of tests as the failure rates will be extremely low. Each component can be isolated and validated separately, however, and the system should also be assessed to show that failures happen independently among the components [3].

IV. SPECIFIC SLR RESULTS: TECHNIQUES FOR THE SAFETY ASSESSMENT OF AUTONOMOUS CARS
In this section we address more specific techniques for the safety assessment of autonomous cars, following the main categories and related groups of papers as reported in Table 2, which will be discussed in dedicated subsections.
With the emergence of AVs, there is an increasing necessity for more effective approaches in verifying software technologies and algorithms. There are two main categories to consider to ensure safety by checking if the software behaves as intended. One of them is simulation-based and commonly known as testing, where the real system or its model is run in a virtual environment and the output of simulation runs is used to demonstrate that the system functions as expected. However, testing cannot prove completeness: if testing could not find an issue in the system, it only means that the system operated correctly for the test cases considered, but does not conclude that an issue does not exist in the system if other scenarios are considered. Another approach to verification based on formal methods can be used to achieve both completeness and soundness. Formal verification relies on mathematical models to prove or disprove specific specifications and properties. What separates it from testing is that the formal verification method is able to find an erroneous state in the system if it exists, and if it exists it can be demonstrated that this is an issue in the system model [18].
Formal verification methods can especially be valuable to verify safety-critical Machine Learning (ML) systems which are usually impractical to test with conventional testing approaches. There is ongoing research on the topic of using formal methods for such systems. The studies that focus on this topic can be divided into two sections by verification of component and system levels. In formal verification, the state space can increase immensely by increasing the number of parameters and possible values they may take. This makes verification of systems relying on Neural Networks (NN) especially difficult as such architectures can have millions of parameters. Satisfiability Modulo Theories (SMT) and Integer Linear Programming (ILP) solvers offer sound solutions to the verification problem of such systems. Regarding SMT, current research consists of applying formal verification on input and output specification of nonlinear NN architecture in the form of a piecewise linear function called Rectifier Linear Unit (ReLU). Given inputs, this approach can provide guarantees for a set of outputs from the Deep NN (DNN) algorithm. However, it is challenging to apply the approach to NN as one needs to take into consideration all stages of more than a thousand ReLUs that appear in such architectures. The ReLU function can be relaxed to reduce the number of phases by excluding ones that are not breaking the specifications. Using semi-formal verification together with testing is also an amenable approach in dealing with ML systems. The goal of these methods is to find corner cases in ML algorithms which may lead the system to an unsafe state. There are also proposals of various scalable algorithms such as random sampling, generative adversarial networks, and node coverage that can generate different inputs to find such cases [18].

A. SIMULATION ENVIRONMENTS AND TEST SCENARIOS
In order to ensure the safety of autonomous cars, they need to be tested in different conditions and environments. Doing these tests in real life has several complications such as public safety, cost and time. Additionally, it is highly unlikely to get exactly the same scenario every time, hence reproducibility is also an issue. Approaches based on simulation environments are a viable solution to test autonomous cars in real-world alike environments where any scenario can be synthetically generated and repeated [46].
Reference [46] focuses on how different weather conditions, in specific rain that can obstruct the camera's perception, affect the accuracy of decisions in AVs. The proposed approach in the study utilizes the Continuous Nearest Neighbor search together with an R-tree data structure. Realistic 3D vision with raindrops is created using stereo images for the simulation. The approach can be used on already existing images so that the training and test images for the algorithms do not to be recaptured for rainy weather and can be modified to reflect such conditions. For the experiments in the study, Recurrent Rolling Convolution (RRC) object detection network was used with the KITTI data set at Karlsruhe Institute of Technology. Testing on 450 images, the results show that when introducing raindrops to the scene, measured average precision of the algorithm drops by 1.18% and the overall average accuracy drops by 0.37%. Reference [46] also demonstrates that detection accuracy for smaller objects is more sensitive to obstruction by the raindrops (measured average precision for pedestrians drops by 4%).
According to reference [66], one of the challenges in fully utilizing simulation in development is the lack of a test criterion to abstract the real-world environment and automatically generate test scenarios reflecting road conditions. Reference [66] investigates research directions to tackle these challenges. The study focuses on defining test criteria on environmental factors, roadway designs, and dynamic object behaviors, and then how test scenarios can be generated according to these criteria.
A prototype called AsFault to automatically create synthetic test scenarios in a systematic way in simulation is proposed in references [29], [30]. The tool uses a searchbased procedural content generation technique and is demonstrated by tests performed on lane-keeping functionality in an autonomous car program. The objective of the tool, in this case, is to create conditions that lead the car out of the roadway. The experiments done with the tool show that it is able to efficiently generate cases where the autonomous car may violate the requirements. The study explains the process of virtual road generation in detail and concludes that more future work is to be done to generate more physically realistic roads by using real-world roads and terrains as an input. The authors also note that currently one lane-keeping system is evaluated due to the lack of availability of such systems and more tests with various execution conditions need to be performed to further confirm the validity of the tool.
Reference [75] proposes a framework called MOBATSim that integrates fault injection and simulation together. The framework is developed in MATLAB Simulink and provides ways to model different real-world scenarios in simulation. The advantage of the approach over other simulation-based approaches is that it supports injecting faults at run-time to verify whether safety requirements are violated in the simulation. The framework is tested on two sensors, front sensor (used for platooning) and speed sensor as a case study. First, a simulation without injecting any faults is run to create a reference point. Then for the front sensor, measurement noise, and for the speed sensor value, stuck-at faults are introduced. In the case study, 400 simulations are run with faults, increasing the level of faults for each sensor to measure the vehicle's tolerance for faults and a detailed comparison of the results is presented. The authors aim to extend the framework to make it a full safety evaluation tool in accordance with the ISO 26262 standard.
Reference [40] presents a Domain-Specific Language (DSL) called GeoScenario to reduce the cost of creating test scenarios for simulation. According to the study, the main motivation behind designing a DSL for this purpose is that the current simulation tools require engineers to learn different tools and then program everything by themselves. Reference [40] argues that having a DSL that is not dependent on any tool makes it much easier to transform existing scenarios to formats understood by different simulation tools. The language relies on the Open Street Map standard and can be extended based on requirements. GeoScenario is then used in an autonomous car project to show its usage in practice and the authors hope that using a common language will increase the usability of test scenarios designed by different researchers. A similar work carried out in reference [51] focuses on the use of ontologies to achieve the same goal of simplifying and standardizing test case generation for simulation of autonomous cars.

B. TEST CASE DEFINITION AND GENERATION
One of the major challenges in the testing of AVs is that the system behavior is strongly dependant on its environment: as all possible cases that might happen in the real world cannot be predicted, the system will constantly learn and adapt during its lifetime. Therefore, it essential to have a runtime testing approach to test the system continuously [25]. In reference [25], theory and application of a framework to achieve real-time testing in autonomous cars is described. The study especially focuses on the Internet of Things (IoT) and mentions the importance of real-time testing and its challenges within IoT and autonomous systems. During operation, autonomous cars may have a very small amount of time to do testing when experiencing a new situation. The authors demonstrate how Combinatory Logic can be used to generate new test scenarios for the extended system from test scenarios designed for the initial system.
Another approach dealing with performance and variety is proposed in reference [81]. Initially, training inputs representing different traffic situations are defined. Then the objective is to define test cases by considering the actions of other objects in traffic. Abstract definitions are used to classify these actions and scenarios. This way, test cases can be compared to eliminate identical cases from the test. However, the approach requires manual efforts when identifying relevant scenarios, thus it needs to be fully automated using ML to make it practical. The authors note that the effectiveness of the approach needs to be evaluated after its automation.
As previously mentioned, the number of possible scenarios in the real world is virtually unlimited, thus it is not possible to test every possible case. To cope with this problem and reduce the testing space, reference [17] proposes an approach to determine circumstances that might lead the system outside its safety limits. Reference [17] argues that although Hazard Based Testing (HBT) approaches are able to discover test cases that make the system fail, they cannot show the specific parameters that cause the failure. In the proposed approach, combinations of parameters in the test cases that cause the failure of the system are identified with Bayesian Optimization. It then is implemented in a simulation model to demonstrate how finding multiple combinations of parameters help to discover testing conditions easier than using random testing methods.
Simulation, especially when considering 3D simulation with realistic details, requires extensive computation power and therefore is costly. In order to reduce testing space and the number of required tests to be run consequently, search-based testing algorithms can be implemented [67]. As in previous approaches, the objective is to find only those scenarios that lead so safety violations. Reference [67] compares two search algorithms, namely the genetic algorithm and simulated annealing to evaluate their effectiveness in the domain of testing autonomous cars. Both algorithms are run in a simulation framework and time-to-collision is measured to be used as a termination condition. The results of the study show that using the genetic algorithm, safety-critical cases can be generated by performing a lower number of tests than when using simulated annealing. Furthermore, previously in reference [36], it has been shown that the genetic algorithm is also more effective than random selection in generating such cases.
Reference [19] proposes an approach to generate test cases for features that are independent of each other in terms of functionality yet interact with one another. In order to identify feature interactions that lead to violation of safety requirements, a search-based algorithm, named FITEST, is developed. A case study and experiment settings are provided together with the results. The results from two systems and the discussion with the system developers show that failures found using FITEST were not identified by the testers before and were indeed related to feature interactions [19].
Reference [23] proposes an approach that uses DNN's sentiments as a way to prioritize inputs that can lead to erroneous behavior. Three sentiments are considered in the study: confidence, uncertainty, and surprise. These measures are evaluated to demonstrate their efficiency in identifying cases that can lead to failures. The results from the experiments on MNIST models show that the approach can identify safetycritical cases with a varying 88% to 94.8% accuracy and that the sentiments can indeed be used as a way to prioritize inputs. Reference [23] notes that the method is not only suitable to reduce testing cost during development but can also be implemented to predict safety-critical cases and activate corresponding mechanisms for protection in real-time.

C. CORNER CASES AND ADVERSARIAL EXAMPLES
Deep Learning (DL) systems refer to systems that comprise one or more DNN elements or consist of entirely DNNs. Lately, the accuracy of achieving tasks that utilize DL, such as object detection and speech recognition has improved significantly, to a level that it can even outperform humans in some cases. This progress has contributed to the use of DL in safety-critical components of autonomous cars on a lager scale [8].
However, despite those improvements, behavior and decisions of DL algorithms are sometimes unpredictable, or even wrong. Such corner cases might exist due to issues in the model or training data, for instance, overfitting or underfitting of system model, or bias, or lack of diversity in the training data [8]. Additionally, DL systems can be vulnerable to adversarial attacks that make minimal changes in the input in order to cause, for instance, an image classification algorithm to wrongly classify the traffic sign, leading to a wrong decision that might result in an accident. Such perturbations on images may be invisible for humans and need to be discovered during testing [38]. Sometimes such corner cases happen without someone intentionally changing the input, but due to weather conditions or differences in lighting degrading the perception of the camera [32]. Therefore, while working on a safetycritical domain, it is essential to have a sound methodology to test for corner cases in order to avoid accidents that may result in great damage [8].
As noted in references [8], [38], currently the common ways of testing DL components comprise collecting a vast amount of real-world data and labeling it manually to create a test set, or generating data synthetically using simulation. In addition to being expensive [38], these common practices do not take into account the inner mechanisms of DL algorithms. Considering the number of possible scenarios in the real-world, such testing methods might not cover the whole input space and will certainly miss the majority of corner cases. Therefore, even if all tests pass successfully, the result cannot be used to prove that there is no issue in the system, as previously mentioned in reference [18].
To cope with the aforementioned issues, reference [8] proposed design, implementation, and evaluation of the first white-box testing framework for the industry, named Deep-Xplore. The framework aims to achieve systematic testing of real-world DL applications. The presented approach presents neuron coverage as a metric that is used to measure the proportion of the active neurons in DNN to increase the effectiveness of tests. The neuron coverage concept can be compared to conventional code coverage in a way that in code coverage parts of the code that are invoked by the input is measured, while in neuron coverage activated neurons are measured. Although the implementation of DL itself is a software, the model that is doing the main task is not developed by a programmer but is a result of the training data. Reference [8] notes that it is rather easy to reach 100% code coverage while the neuron coverage does not exceed 10%. Later in the paper, several DL components with matching features are used together to cross-check the decisions, which eliminates the need for checking the result manually. Reference [8] also demonstrates a methodology to find corner case inputs that might activate incorrect behaviors, and increase neuron coverage by using gradient-based search approaches. According to reference [8], DeepXplore can effectively discover corner case inputs in the most advanced and complex DL systems (fifteen of them are tested in the paper). The average performance of DeepXplore on a regular notebook shows that it is capable of finding one corner case per each DL model in a second. It is also shown that the generated corner cases can be reused as a training set to train the DL system again to increase its accuracy about 3% [8].
Despite the promising results of the DeepXplore framework, there are 2 major threats to its validity. One is pointed out in reference [38] that cross-referencing multiple models is not a reliable approach as currently, attackers are able to generalize the adversarial example for these models to trick the system. According to experiments in recent studies, another criticism towards the framework and usage of neuron coverage in general has emerged. Reference [58] notes that even achieving 100% neuron coverage is not enough to verify the safety of DNN applications as maximum neuron coverage is proved to be reached by merely using specific input vectors from the training data.
Following DeepXplore [8], another testing framework called DLFuzz, which also utilizes the differential testing technique, is proposed in reference [31] to find corner cases in DL systems. DLFuzz iteratively mutates the input to increase neuron coverage and tries to maximize the difference between results (predictions) per input in order to find rare inputs. After finding inputs that lead to an increase in the neuron coverage, the process continues to find an invisible perturbation by mutating these inputs. The advantage of DLFuzz over DeepXplore is that DLFuzz does not require to label test set manually or cross-check the results with other DL systems. As a result, DLFuzz outperforms DeepXplore's white-box testing by finding 338.59% more adversarial examples that have 89.82% smaller perturbations. DLFuzz also achieves to increase neuron coverage 2.86% more in 20.11% less amount of time. However, reference [31] has observed that in tests for the ResNet50 neural network, which has a much higher number of neurons, DeepXplore was faster as DLFuzz had to spend more time selecting neurons. Nevertheless, the adversarial examples found by DLFuzz contain perturbations too small to notice with a human eye, while the ones generated by DeepXplore can be clearly visible.
Another differential testing framework for DNN systems, called DeepHunter, is proposed in reference [14]. Similar to DLFuzz it aims to preserve the semantics of the input while mutating it into an adversarial example. Additionally, the paper proposes a seed selection approach that couples variety and recentness. The variety of seed selection is achieved by decreasing the probability of tests that have been VOLUME 9, 2021 fuzzed before, and recent tests are biased by giving them more priority to balance recentness property. Reference [14] demonstrates that while finding defects in the DNN system, DeepHunter is capable of preserving the semantics of the initial tests with a validity rate of 98%. The authors also conclude that the seed selection with a more priority to diversity has more influence than recentness-based selection on increasing neuron coverage. It is also illustrated that Dee-pHunter surpasses DeepTest and TensorFuzz approaches in coverage, and the amount and variety of adversarial examples found. A recent work in reference [55] presents a new framework called DeepSmartFuzzer and argues that it outperforms both DeepHunter and TensorFuzz. The experiments in the study use DeepXplore's neuron coverage as well as four other coverage criteria to compare the frameworks. Source code is also provided to enable researchers to reproduce the experiment results.
Reference [38] proposes a different approach to solve the oracle problem (humans needing to manually label inputs and check results limiting the extensibility of testing) without relying on cross-referencing the results. The approach utilizes a metamorphic testing methodology in which the accuracy of the test is measured by the change in the output. This relation between the input and the output is referred to as Metamorphic Relation. Using the metamorphic relations approach to solve the oracle problem also eliminates the necessity of using ML algorithms, which are expensive to execute, for finding adversarial images. Reference [38] argues that using metamorphic relations is the most efficient way of finding adversarial inputs, and can help to achieve significant results with an accuracy of about 90% on average and 96.85% on best-case in image classification.
In reference [32], an approach that relies on Satisfiability Modulo Theory to verify neural network systems is presented. Although the study concentrates on image classification, reference [32] argues that it can be used in different applications that utilize neural networks. The proposed approach aims to ensure the safety of the classification algorithm by finding manipulations that would make the image fall in the same category for the human observer while it would result in a different decision by the algorithm. Examples of such perturbations can be rain, fog, different light angles or conditions. The results of the framework are promising as it can find adversarial inputs in a few seconds. However, the complexity of the process of verifying the algorithm increases exponentially as the number of features increases. According to reference [32] performance issues of the framework can be solved using parallel computation.
Many algorithms that are used to verify DNN require knowledge about the inner workings of the system. Reference [47] proposes a method using a black-box approach to verify image classification algorithms. Key features are taken from the images using an object detection algorithm and the pixels in those parts are prioritized according to their contribution to creating the whole visual meaning of the image for a human eye. The process is implemented using a two-player gaming approach in which each player takes a turn and aims to achieve a specific goal. Here the first player tries to decrease the distance between the original input and an adversarial one by manipulating the image while the other player will either help the first one achieve the objective or will oppose it. Reference [47] argues that in theory, a player-based approach in this fashion can eventually find the smallest perturbation possible to achieve an adversarial image. The authors also show the cases for Lipschitz networks in which safety guarantees can be made about the absence of adversarial images. According to the paper, utilizing a Monte Carlo search, the black-box approach can perform as good as the most advanced and complex white-box approaches.
Reference [48] extends the work in reference [47] by developing a tool called DeepGame that explores two problems concerning pointwise robustness, namely maximum safety radius and feature robustness. In the first problem, the objective is to calculate the minimal distance (in terms of pixels for an image) between the initial input and an adversarial one, hence to find a safety distance from the original input in which no adversarial examples will be present. In the second problem, the goal is to find a safety radius, in which it is possible to control the occurrence of adversarial examples by limiting perturbations to selected features. The computation of these problems is done using a two-player gaming approach, in which the first player picks features and the second player applies manipulations to the input in order to decrease the distance to an adversarial example and the process continues to change the input in an iterative fashion. As in reference [47], the second player can be both cooperating with the first player or opposing it. If players are cooperating the prize of the first player is the maximum absolute safety distance and when the players oppose each other, the prize of the first player is measured by feature robustness.
As mentioned previously, the adversarial examples do not only include intentional manipulations but also situations that can frequently occur in real life such as extreme weather conditions and infrastructure issues on roads. While they might not hugely affect human diver's perception, they can degrade the performance of DNN image classification systems significantly. Reference [49] proposes a framework called DeepRoad that uses unsupervised learning techniques to synthetically generate real-world scenarios for the camera. DeeRoad is able to add different weather filters to the image such as rain, fog and snow, and also can create fake holes on the roads. As in reference [38], DeepRoad also utilizes the metamorphic testing approach. The experiments of the framework are run with the Udacity data set and show its ability to find thousands of inconsistencies. However, the authors note that rather small size of this data set can be one of the threats to the validity of the framework's effectiveness. Additionally, the capabilities of the used Generative Adversarial Network for producing synthetic images should also be kept in mind as it sometimes does not preserve the semantics of some objects in the scene [49].
DeepRoad is later used as a baseline approach for input validation in reference [89]. The authors of this study present a new framework called SelfOracle. Their experiments show that SelfOracle is more than two times effective at discovering misbehaviors compared to DeepRoad both in terms of false positive and true positive rates. They also argue that SelfOracle has a significantly less computational cost in contrast to DeepRoad and provide the source code of the experiments for reproducibility.
Another approach that enables automatic detection of erroneous behaviors in DNNs without manual labeling efforts is proposed in reference [41]. The study focuses on the object detection system and uses the difference between the outputs of two inputs that fall in the same category as a hypothesis to identify false negatives. Two specific cues are used in the proposed system: temporal and stereo. Object detection algorithms that use CNN sometimes fail to identify objects, however, region-based trackers are able to trace the objects within continuous frames. Using previous frames, missing objects in the next frames can be detected, which enables engineers to find the temporal inconsistencies in the algorithm. Stereo inconsistencies refer to the different outputs from the algorithm for the objects in two similar images. Using mapping between images, the location of missing objects can be found. Since the approach can be applied to unlabeled data and object detectors that are required to implement the framework are already a part of autonomous cars, the framework can add an extra layer of verification for finding false negatives with a minimal cost. Although the proposed framework was able to identify the mistakes of the most advanced object detectors in the experiments, the authors note that if no object is detected in the scene or the scene is too crowded for the object detector, the system will not be able to identify the errors. Thus, the framework is currently limited to the capabilities of the object detection algorithm it uses.

D. FAULT INJECTION
AVs include a combination of software and hardware components that makes it more challenging to verify their safety. Since events that can lead to a system failure may happen randomly, it is not only hard to predict them but also to (re)produce these scenarios [15]. To meet the 10 FIT requirement of ISO 26262, not only system permanent faults, but also transient faults, due to e.g. electromagnetic interference and cosmic radiation, should be detected [6]. There are many in-depth approaches such as the one in reference [8] to detect permanent faults, while approaches dealing with transient faults are rather immature in the AV domain, and need to be handled in real-time [6]. In order to test and possibly increase the coverage of fault-tolerance, Fault Injection (FI) can be used. FI is used to insert faults in places where fault/error handling happens [15]. Within safety assessment, FI can be applied at several abstraction levels, even to validate safety arguments. Irregular inputs are given to the system to find its weaknesses, i.e., cases where the system has unexpected behavior. Fault injection can be applied at each different phase of the V-model in ISO 26262.
Reference [73] proposes a methodology utilizing FI at the software level according to ISO 26262. The main objective of the approach is to validate system performance not using synthetically generated data, but with real-world samples. The study concentrates on inserting time-based, sensor data-based and signal data-based faults. The approach is then applied to an object detection algorithm utilizing a camera sensor to see how system reliability changes when faulty inputs are introduced. In the specific use case, ''salt and pepper'' noise is added to evaluate its effects on system performance.
The framework called DriveFI is able to modify the state of software and hardware components to demonstrate the effects of faults in a simulation environment, and use ML-based Bayesian FI to find faults where there is a high chance of violation of safety requirements [15]. The study uses three models to demonstrate the efficiency of the proposed approach. Reference [15] shows that using random FI over 98,400 faults would take 615 days to complete, while with the proposed method it took only 4 hours. Additionally, running in real-time, DriveFI could detect 561 faults with a safety risk, while the random method could not find any in several weeks.
Another FI framework that relies on Systems-Theoretic Process Analysis is introduced in reference [74]. The study introduces the framework using the Openpilot agent in various weather conditions where faults happen at sensors of the system. Hazard analysis is done to produce test cases that might violate safety requirements in order to improve the coverage of testing. The output from the experiments shows that the proposed approach (with 35.78% hazard coverage) is more effective than random fault injection (with 28.97% hazard coverage). The authors provide detailed experiment data, however they suggest more experimentation is needed before making a conclusion about the results.
An approach to measuring fault resilience of ML systems, TensorFI framework, using TensorFlow, is proposed in reference [37]. With flexibility and portability in mind, the framework helps to discover the correlation between various parameters or algorithms, and error resilience of ML systems. Within the experiments in the study, it is demonstrated that the resilience of the system may vary significantly depending on the used algorithms and the inputs. Reference [37] introduces a metric called prediction accuracy drop that shows the difference between original prediction accuracy of the trained model and the accuracy when Ten-sorFI is used; it also demonstrates the relation between accuracy drop and the number of classes in the output data. The authors suggest that exploring such correlations can help with the design of ML systems requiring higher resilience.
Reference [6] extends the work in reference [37] and creates an effective framework called BinFI to find safety-critical bits in ML systems by using TensorFlow. Reference [6] argues that most common ML algorithms comprise of monotonic computations, hence error propagation of the application can be approximated to be VOLUME 9, 2021 a monotonic function. The approach relies on binary-search to find the faults that lead to safety violations efficiently. The framework is compared to approaches that use random FI and shows significantly better results in terms of performance and cost, with 99.56% of the faults detected within 99.63% accuracy. The performance gain is 5 times more than random FI, however, the authors note that although the framework currently is not covering 100% of safety-critical faults, its accuracy is within a tolerable 0.5% range. Reference [37] argues that it is a fair trade-off for FI approaches to give up a small margin of accuracy to have efficiency advantages.
Reference [6] argues that BinFI has an advantage over DeepXplore proposed in reference [8], in a way that it covers significantly more errors, and additionally it also covers transient faults, while DeepXplore only considers systematic failures.

E. MUTATION TESTING
According to reference [4], while FI is an efficient way to detect faults in the system that violate safety goals, mutation testing is required to statistically prove that the system conforms to ISO 26262 safety requirements; FI can still be useful to identify test cases leading to system failures, thus reducing the testing space for mutation testing. Mutation testing is an approach used to evaluate the sufficiency of the test cases. The technique involves creating faulty duplicates of the original system, named mutants, by injecting reproducible faults. It allows engineers to have a metric that corresponds to the ratio of identified vs missed mutations, in order to evaluate the adequacy of test suites [4]. Mutants are identified if the output of the mutated software is different than that of the original one [58]. Reference [4] states that mutation testing is an efficient method to mock real faults. Mutation testing relies on two basic assumptions. One of them is the Competent Programmer Hypothesis, which is an assumption that the programmers develop software that is close to its perfect form. This assumption translates to mutation testing context as follows: mutants are close to real faults, and if the test suite is able to identify the mutants it will also be able to identify the real faults. Another assumption is the Coupling Effect hypothesis which states that large effects of the errors in the software are tightly coupled with small bugs. This assumption suggests that simple mutants are closely related to the real bugs in the software, and if it is possible to detect the mutants, it is also possible to detect critical bugs [4].
Reference [58] focuses on assessing the effectiveness of the current mutation tools for DNN. During the tests in the study, a threshold (1%) is set to check the deviations in the accuracy of decisions of DNN algorithms. Reference [58] demonstrates that it is possible to detect mutants with a deviation that exceeds the threshold and there was no case that no mutant was detected. To show the effectiveness of the tests a criterion called mutation score, which is the ratio of detected mutants to the remaining mutants, is used. Reference [58] notes that such criterion for the quality of test suits is important to show that safety-critical systems such as autonomous cars meet the Modified Condition/Decision Coverage criterion as recommended in ISO 26262. Experiments in the study show that the mutation score of experimented algorithms varies from 40% to 65%. The results also suggest that mutation operators related to learning behavior seem to have better performance than the others.

F. SOFTWARE SAFETY CAGES
A software safety cage is a safety mechanism that monitors the system behavior and performs appropriate actions if a malfunction is detected. Reference [68] suggests the usage of safety cages to reduce the strictness of requirements over NN-related applications in AVs. Since DNNs are usually seen as black boxes during the verification process, it is challenging to guarantee safety goals. However, using safety cages the behavior of the AV can be limited to a safe set of states. As an example, the decision of acceleration in a vehicle can be considered safe only if there is no vehicle in the front. Therefore, the environment should be monitored in realtime and, depending on the situation, system's limits should be controlled dynamically to always operate inside safety envelopes. The advantage of implementing safety cages is that the way they work is explainable, and conventional verification methods can be applied to them. They also do not require any knowledge about the inner workings of the ML system they are integrated with. The study in reference [68] focuses on safety cages that are created to avoid a forward collision. The experiments run in a simulation show that safety cages can effectively prevent collisions from happening. It is also demonstrated that the safety cages do not interfere with the system controllers if there is no safety-critical case, thus they do not reduce the performance of the vehicle under normal circumstances. Additionally, the authors suggest that identified safety cage violations can further be used to train and improve the network.
Another methodology to verify software systems in an iterative manner within the autonomous systems domain is proposed in reference [7]. The study introduces dependability cages to check situations where the system can behave outside its defined specifications in the development stage. In unexpected scenarios, dependability cages are able to interfere with the system configuration and collect information about these situations to provide them as feedback for further development. In this approach, similar to the monitor-actuator architecture, the functions or the system are deactivated to reach a safe state when unexpected events occur.

G. TECHNIQUES FOR CYBER-PHYSICAL SYSTEMS
In the context of AVs, Cyber-Physical Systems (CPSs) are defined as systems in which the physical behavior of the vehicle is controlled by computer-based algorithms without human intervention [39]. Although we have defined this separate category due to the number of studies specifically focusing on CPS V&V paradigms that can be transferred to/from multiple domains (e.g., rail transportation, manufacturing, etc.), the issues addressed by those studies largely overlap with the ones already discussed in other categories. In fact, utilizing AI in CPSs creates new challenges in safety assessment due to the unpredictable nature of AI algorithms. Currently, CPS development relies on the assumption that all requirements for the system are complete and well-defined; however, considering the huge number of situations that might happen in the real world, the requirements are unlikely to cover all of them [7]. Therefore, new approaches should be adopted to assess the safety of smart and adaptive CPS.
The safety of conventional CPSs can be verified using formal methods [39] leveraging on the heritage of Real-Time and Embedded Systems research. The main challenges emerge when AI algorithms are involved. Extending the study in reference [28], reference [39] explores the ways of combining formal verification and AI together in CPSs. The study proposes an approach that uses differential dynamic logic to design CPS models with mathematical accuracy. Then Mod-elPlex method is used to guarantee that the verification results from the CPS model will be relevant for the actual implementation of the system. Lastly, the VeriPhy verification pipeline is used to ensure that safety verification results are preserved while the model changes by learning. It should be noted that, while safe control of CPSs can be proven using formal methods, it is under the assumption that sensors and actuators are working properly and provide correct information about the state of the system [39].
Reference [34] proposes an approach to formally verify the non-functional safety properties of CPSs. The study is then extended in reference [65] to verify both functional and non-functional properties using Simulink Design Verifier and UPPAAL-SMC for model checking. In reference [65], verification models are provided and explained step-by-step. A running example of a traffic sign recognition system is modeled and verified with 12 properties related to time and energy constraints.
Reference [20] investigates ways to improve the efficiency of testing CPSs in a simulation. The study focuses on finding a correlation between the modifications to CPS software and its decisions in order to reduce the number of test cases to be executed. The objective of the approach is to only choose tests cases that can show different results upon changes in the system. Experimenting with a lane-following algorithm shows that the approach could eliminate around 11% of the test cases, reducing the time required for testing [20]. Further research plan in reference [21] includes the use of namespace separation and virtual machines to reduce the execution time of testing in CPSs.

H. FORMAL METHODS
Testing is a the main approach used by engineers to find the inconsistencies and defects in the software, however, as mentioned earlier, testing can only prove the presence of errors, not their absence. In other words, testing can help to decrease the number of defects, but considered alone it is not sufficient to prove compliance against certification requirements [79]. With the aim of achieving both correctness and completeness, more formal approaches can and should be applied to AV software. For instance, according to reference [9], model-based testing is currently the most viable approach for conforming the quality of safety-critical systems such as autonomous cars.
More in general, together with fully formal approaches such as model-checking and theorem proving, which can be extremely challenging when dealing with complex and adaptive systems, several hybrid or semi-formal approaches have been proposed where formality actually refers to the methodology with which testing is performed. For instance, reference [79] proposes a formal methodology to verify the safety of embedded software in AVs. The proposed approach utilizes black-box testing techniques and can be used in realtime. This way the verification can be done online without human intervention. In the verification process, a model that conforms to the same requirements as the real system is designed. The same input is fed into both model (as a digital signal) and the system (as a physical signal), and their outputs are compared to generate a test report about whether the test cases failed or passed.
While a holistic approach is always needed in safety assessment, one way to manage the complexity and thus enable more extensive usage of formal methods is to divide the system into simpler components that can be isolated and checked by their own. Components in AVs can be divided into two categories in terms of safety, low-level and highlevel. Low-level components are the ones for perception and actuation, while high-level components can be defined as the ones that are responsible for making decisions [9]. The study in reference [9] focuses on verifying the safe navigation of the vehicle considering obstacle avoidance and selection (if there is no way to avoid the obstacle). While formal verification can be applied to different software components of autonomous cars, the study uses the method to verify whether the vehicle makes correct decisions. LTL (Linear Temporal Logic) formulas that represent the desired decisions and state of the system are defined as properties and checked by the AJPF tool. However, the authors note that the approach does not yet consider real-world scenarios in depth, and future improvements are necessary.
Formal methods are based on sound mathematical models and hence are very appropriate in the V&V of safe AVs. However, one essential requirement is that the models being checked should conform to the real systems. Although carefully designed models may hold the properties of interest, it may happen that they are not always complete. It is particularly difficult and expensive to build CPSs that are complete in the context of AVs, as these systems are expected to function in open environments which make inconsistencies between the model and reality unavoidable. For instance, some reinforcement learning approaches can be applied without reference models, and, although the learning algorithms are effective, their safe behavior cannot be proven [27]. Therefore, reference [27] proposes a technique combining two approaches: optimized learning algorithms with high exploration capabilities are coupled with formal verification methods that can ensure system safety. The main objective of the proposed approach is to verify learning in run-time. As long as the reinforcement learning algorithm does not violate the requirements, the learning process continues efficiently and the verification results are saved. When a violation occurs, the efficient way of learning gives its place to a different algorithm that tries to reach the states that conform to the model and requirements. Reference [27] compares the approach to other state-of-the-art approaches and shows its computational advantages as it performs verification during learning.
In complex systems such as AVs, problems can also arise considering the interaction between hardware and software. Amongst other things, the software system needs to recognize and manage transient faults in the hardware. To that aim, a standard architecture called AUTOSAR is used in the automotive industry. Within this architecture, software applications can be implemented on several electronic control units (ECUs) [82]. In such a context, reference [82] proposes a verification approach for software systems running on hardware platforms subject to transient faults. To be able to formally verify the model, the AUTOSAR model is converted into timed automata. The method has been implemented and experimented on an autonomous car and the effectiveness of the approach has been demonstrated. Table 3 summarizes the main tools supporting the software V&V of safe autonomous cars, as emerged from the review reported in this section. Most of them are available on the GitHub repository (www.github.com) and support fault injection approaches.

V. OPEN ISSUES, CHALLENGES AND OPPORTUNITIES
The SLR performed in this paper has highlighted a rising interest in the research area of software V&V for safe autonomous cars. In this section, the main open issues and challenges are pointed out and promising opportunities in the V&V of safe autonomous cars are outlined to support future research directions.
The biggest challenge in the V&V of safe autonomous cars is due to the fact that achieving autonomy requires using complex ML algorithms that need to be integrated into the control software. While these algorithms are capable of providing highly accurate results, their behavior is generally unpredictable. The decision-making process of some algorithms may be explainable in human terms, however, how to make these algorithms ''legible'' for humans is still an open issue [3].
Specific black-box testing approaches have been demonstrated to be a viable option to test ML algorithms against adversarial examples. The most effective black-box testing approach for DNNs mentioned in this SLR is the one presented in reference [47]. The authors have experimented their approach on two state-of-the-art networks and concluded that they have not yet found a network that was safe in all conditions. This result suggests that current DNN solutions are not mature enough to be relied on in safety-critical systems such as autonomous cars.
Two fuzzy testing frameworks, DLFuzz [31] and DeepS-martFuzzer [55] both showed high performance and accuracy in detecting adversarial examples. However, their effectiveness cannot be only measured in terms of achieved neuron coverage, since it is noted in references [8], [58] that 100% neuron coverage can easily be achieved using certain input vectors from the training set. The approach shown in reference [38] does not rely on neuron coverage and can be an efficient alternative with a very promising 96.85% accuracy. Reference [38] suggests further research and experiments (with popular datasets such as MNIST) on using affine transformations to find adversarial examples. Although there are extensive research efforts on how to find adversarial examples in DNNs, how to avoid these attacks is still an open issue [24].
Furthermore, the performance of ML algorithms hugely depends on the quality of training data. As mentioned in Section III, the training data should be diverse and the resulting model should not suffer from over-fitting. However, there is currently no general way to either assess if the data is sufficient in terms of quantity and diversity, or prove that the model is not over-fitted. Reference [3] argues that even creating a set of requirements for collecting the data will not cause any progress towards the solution, since now the adequacy of the requirements would need validation in the same way, merely shifting the problem one layer up.
While being valuable, conventional testing methods are limited in terms of completeness and that is especially true in the autonomous cars domain [3]. Formal methods are demonstrated to be a viable option to achieve completeness, however they do not scale up well to complex system and they still rely on the correctness of sensors and perception algorithms. Reference [39] mentions that even if sensors can be wrong, their output can still be partly considered and it is thus necessary to establish a limit for sensor errors, which is an open issue. Additionally, how formal properties for perception-related algorithms can be defined is so far an unanswered question [28].
One promising option to consider unreliable event detection in combination with sensor redundancy, diversity and self-adaptation techniques, is to adopt probabilistic XAI approaches based on Bayesian Networks, enabling run-time ASIL estimation and assurance against quantitative safety targets such as the hazard rate [203].
As mentioned in reference [79], one concern in modelbased formal verification is the assumption that the model being checked reflects the properties of the actual system. Similarly, reference [28] mentions that it is possible to ensure the safety of learning algorithms using formal properties only if the environmental model is correct. Reference [28] suggests that future research directions in this area will be towards ensuring safety when the state or action space is continuous and when the environmental model is not available or is not accurately defined.
Regarding mitigation of transient faults in AVs, we have mentioned formal approaches for CPSs [82] and fault injection (BinFI [6]). In particular, BinFI does not only cover transient errors but also systematic errors with better error coverage than DeepXplore [8]. To further demonstrate the performance of BinFI, it might be useful to consider making a comparison between BinFI and DLFuzz presented in reference [31], as DLFuzz is shown to perform better than DeepXplore.
Regarding simulation-based approaches, developments in MOBATSim [75] should be seriously considered due to its potential to become an advanced safety analysis tool for AVs. Authors plan to add a report generation feature conforming to ISO 26262 specifications, which might be valuable for the industry. Additionally, a recent prototype, AsFault, presented in reference [30] have been demonstrated to be an effective approach to automatically generate test cases, and future works in that direction seem very promising.
Another research direction to consider is the usage of safety cages which might help to reduce the assessment burden of ML algorithms. Since these algorithms are challenging to assess, safety cages can be used to limit the behavior of the vehicle to a safe envelope, avoiding hazardous scenarios from happening due to unexpected conditions. In this way, even if the system cannot be completely verified, a safe operation can be achieved. Furthermore, reference [68] suggests utilizing safety cages not only to achieve safe operation but also to find corner cases in ML algorithms, and that seems another promising opportunity.
It is important to underline that there are several other aspects connected to human factors, ethics, and the safety of AVs in combination with specific SAE levels, which are being currently investigated by the research community, and that could possibly have an impact on future evolution of reference standards as well as on the software design for driver-machine interfaces (DMIs); however, all the open issues, challenges and opportunities related to those important aspects of safe AVs, but not specifically to their software V&V, were not in the scope of this SLR.

VI. RELATED WORK
A parallel SLR on testing and verification of NN-based safety-critical control software has been recently performed and published in reference [204]. The review covered 83 studies published between 2011 and 2018, and mainly focused on NN and CPSs. Although the review covered several important papers in the safety-critical systems domain, only 13 of them were connected to the automotive field. Since ISO 26262 is an adaptation of the IEC 61508 standard with a specific focus on on-road vehicles, reference [204] adopts IEC 61508 as a reference standard rather than ISO 26262. Therefore, papers selected in reference [204] do not completely represent the domain considered in this SLR due to different search criteria and topic coverage. Reference [205] very recently provided a useful overview of the factors influencing the adoption of AVs, including safety challenges. A total of 14 factors out of 85 articles have been highlighted in that study. Compared to the study performed in this paper, reference [205] only focused on identifying the general industry challenges without addressing technical details and the state-of-the-art of relevant research. A valuable systematic literature review on the specific topic of coverage-based testing for selfdriving autonomous vehicles is presented in reference [88]. The review classifies available literature based on coveragecriteria used in the studies. Out of 89 studies related to V&V of AVs, 26 studies that use coverage-based testing are selected after applying inclusion and exclusion criteria. One important conclusion of the study is that the terminology of coverage criteria used in V&V processes is not standardized throughout the literature. Therefore, the authors suggest unification of terminologies to increase the progress pace of the research in that field. We have addressed testing coverage criteria in Section IV-C. VOLUME 9, 2021

VII. CONCLUSION
This paper provided a systematic literature review on software V&V of safe autonomous cars, covering the research work in -approximately -the last ten years. Three research questions have been defined to provide an in-depth analysis of current challenges, state-of-the-art and future research directions: • RQ1: What are the most common requirements in the software V&V of safe autonomous cars?
• RQ2: What are the main challenges in performing software V&V of safe autonomous cars?
• RQ3: What are the open issues and opportunities in software V&V of safe autonomous cars? In order to answer those questions, we have identified the main V&V requirements according to safety standards such as the ISO 26262, and provided a summary of most important certification objectives (RQ1). Furthermore, we have classified the main approaches available today to check that software in AVs meets given safety requirements, together with a comparison of shortcomings and potential of relevant methods (RQ2). Finally, we have investigated the state-ofthe-art in structured categories and described open issues, future opportunities and directions (RQ3).
In total, more than 200 relevant papers have been investigated in the SLR, and more than half of them have been further selected as a base for discussion in this study. Throughout the study, it has been observed that the current trend in the V&V of safe autonomous cars is towards extending already existing standards and processes such as the ISO 26262 and its V-model. Likewise, the progress in V&V processes is mostly achieved by adapting and integrating diverse approaches such as formal verification and fault injection, also considering novel techniques needed to tackle the emerging challenges of smart-CPS paradigms. One somehow expected result of the study is that although state-of-the-art approaches seem promising, they are currently inadequate to ensure the safety of fully autonomous cars in all operating conditions. In the software part, this is primarily due to the lack of approaches to make ML algorithms fully explainable and predictable, also considering adversarial attacks and training data validation issues. Nevertheless, the usage of safety/dependability cages as well as monitor and actuator approaches have demonstrated a great potential to compensate for incomplete or immature V&V processes.
In conclusion, we hope that due to its quite extensive topic coverage, this paper can serve as a useful compendium for the many engineers and researchers who are starting to investigate those extremely current and challenging subjects related to the software safety of autonomous road vehicles. Although V&V for AVs is a less mature area compared to other safety-critical domains, we believe that the pioneering research and rapid progress in this sector connected to the need for quickly reaching higher levels of autonomy, pushed by private investments, can also be very useful in the perspective of technology transfer towards other transport sectors such as smart railways.
NIJAT RAJABLI was born in Gakh, Azerbaijan, in 1997. He received the bachelor's degree in computer engineering from Khazar University, Baku, Azerbaijan, in 2018. He is currently pursuing the master's degree in software technology with Linnaeus University, Växjö, Sweden. From 2016 to 2017, he has attended the Middle East Technical University, Ankara, Turkey, as a Selected Exchange Student of computer engineering. His experience with software engineering and AI in industry includes an internship at Ericsson in 2020 and his current work at Softwerk, where he is researching applications of AI on timber log tracing. His research interests include AI and verification of deep learning systems. Since 2018, he has been working as an Associate Professor with Linnaeus University, Sweden, where he has chaired the Cyber-Physical Systems (CPS) environment. Since 2020, he has also been a Professor of computer science with Mälardalen University, Sweden. He had leadership roles in more than ten research projects, has edited or authored more than ten books and 130 publications. His research interests include resilient CPS and trustworthy autonomous systems. He is also an ACM Distinguished Speaker, a member of several IEEE societies, including Intelligent Transportation Systems.
ROBERTO NARDONE was born in Napoli, Italy, in 1986. He received the master's degree in computer engineering and the Ph.D. degree in computer and automation engineering from the University of Naples Federico II, Italy, in 2009 and 2013, respectively. Since 2018, he has been an Assistant Professor with the Mediterranean University of Reggio Calabria, Italy. He currently serves as an Associate Editor of distinguished journals, such as IEEE ACCESS and Journal of Universal Computer Science, and has coauthored more than 60 research articles published in peer-reviewed international journals and conference proceedings. His research interest includes quantitative evaluation of non-functional properties by means of model-driven techniques.
VALERIA VITTORINI was born in Napoli, Italy, in 1965. She received the master's degree in mathematics and the Ph.D. degree in computer and automation engineering from the University of Napoli Federico II, Italy, in 1991 and 1995, respectively. Since 2005, she has been an Associate Professor with the University of Napoli Federico II. She has coedited proceeding volumes and coauthored more than 80 research papers published in international journals and conference proceedings. Her research interests include model-driven engineering, formal modeling, and verification of critical systems. She is recently investigating the application of artificial intelligence to the transportation sector, and in particular to railway systems. She currently serves as the Coordinator for the H2020 Shift2Rail project RAILS. VOLUME 9, 2021