A Survey of Autonomous Driving: Common Practices and Emerging Technologies

Automated driving systems (ADSs) promise a safe, comfortable and efficient driving experience. However, fatalities involving vehicles equipped with ADSs are on the rise. The full potential of ADSs cannot be realized unless the robustness of state-of-the-art is improved further. This paper discusses unsolved problems and surveys the technical aspect of automated driving. Studies regarding present challenges, high-level system architectures, emerging methodologies and core functions including localization, mapping, perception, planning, and human machine interfaces, were thoroughly reviewed. Furthermore, many state-of-the-art algorithms were implemented and compared on our own platform in a real-world driving setting. The paper concludes with an overview of available datasets and tools for ADS development.


I. INTRODUCTION
A CCORDING to a recent technical report by the National Highway Traffic Safety Administration (NHTSA), 94% of road accidents are caused by human errors [1].Automated driving systems (ADSs) are being developed with the promise of preventing accidents, reducing emissions, transporting the mobility-impaired and reducing driving related stress [2].Annual social benefits of ADSs are projected to reach nearly $800 billion by 2050 through congestion mitigation, road casualty reduction, decreased energy consumption and increased productivity caused by the reallocation of driving time [3].
Eureka Project PROMETHEUS [4] was carried out in Europe between 1987-1995, and it was one of the earliest major automated driving studies.The project led to the development of VITA II by Daimler-Benz, which succeeded in automatically driving on highways [5].DARPA Grand Challenge, organized by the US Department of Defense in 2004, was the first major automated driving competition where all of the attendees failed to finish the 150-mile off-road parkour.The difficulty of the challenge was due to the rule that no human intervention at any level was allowed during the finals.Another similar DARPA Grand Challenge was held in 2005.This time five teams managed to complete the off-road track without any human interference [6].
Fully automated driving in urban scenes was seen as the biggest challenge of the field since the earliest attempts.During DARPA Urban Challenge [7], held in 2007, many different research groups around the globe tried their ADSs in a test environment that was modeled after a typical urban scene.Six teams managed to complete the event.Even though this competition was the biggest and most significant event up to that time, the test environment lacked certain aspects of a real-world urban driving scene such as pedestrians and cyclists.Nevertheless, the fact that six teams managed to complete the challenge attracted significant attention.After DARPA Urban Challenge, several more automated driving competitions such as [8]- [10] were held in different countries.
Common practices in system architecture have been established over the years.Most of the ADSs divide the massive task of automated driving into subcategories and employ an array of sensors and algorithms on various modules.More recently, end-to-end driving systems started to emerge as an alternative to modular approaches.Deep learning models have become dominant in many of these tasks [11].A high level classification of ADS architectures is given in Figure 1.
The accumulated knowledge in vehicle dynamics, breakthroughs in computer vision caused by the advent of deep learning [12] and availability of new sensor modalities, such as lidar [13], catalyzed ADS research and industrial implementation.Furthermore, an increase in public interest and market potential precipitated the emergence of ADSs with varying degrees of automation.However, robust automated driving in urban environments has not been achieved yet [14].Accidents caused by immature systems [15]- [18] undermine trust, and furthermore, they cost lives.As such, a thorough investigation of unsolved challenges and the state-of-the-art is deemed necessary here.
The Society of Automotive Engineers (SAE) refers to hardware-software systems that can execute dynamic driving tasks (DDT) on a sustainable basis as ADS [19].There are also vernacular alternative terms such as "autonomous driving" and "self-driving car" in use.Nonetheless, despite being commonly used, SAE advices not to use them as these terms are unclear and misleading.In this paper we follow SAE's convention.
The present paper attempts to provide a structured and comprehensive overview of the hardware-software architectures of state-of-the-art ADSs.Moreover, emerging trends such as endto-end driving and connected systems are discussed in detail.There are overview papers on the subject, which covered several core functions [20], [21], and which concentrated on the motion planning aspect [22], [23].However, a survey that covers: present challenges, available and emerging high-level system architectures, individual core functions such as localization, mapping, perception, planning, vehicle control, and human-machine interface altogether does not exist.The aim of this paper is to fill this gap in the literature with a thorough survey.In addition, a detailed summary of available datasets, software stacks, and simulation tools is presented here.An- other contribution of this paper is the detailed comparison and analysis of alternative approaches through implementation.We implemented the state-of-the-art in our platform using opensource software.Comparison of existing overview papers and our work is shown in Table I.The reminder of this paper is written in eight sections.Section II is an overview of present challenges.Details of automated driving system components and architectures are given in Section III.Section IV presents a summary of stateof-the-art localization techniques followed by Section V, an indepth review of perception models.Assessment of the driving situation and planning are discussed in Section VI and VII respectively.In Section VIII, current trends and shortcomings of human machine interface are introduced.Datasets and available tools for developing automated driving systems are given in Section IX.

II. PROSPECTS AND CHALLENGES A. Social impact
Widespread usage of ADSs is not imminent, yet, it is still possible to foresee the potential impact and benefits of it to a certain degree.
1) Problems that can be solved: preventing traffic accidents, mitigating traffic congestions, reducing emissions 2) Arising opportunities: reallocation of driving time, transporting the mobility impaired 3) New trends: consuming Mobility as a Service (MaaS), logistics revolution Widespread deployment of ADSs can reduce the societal loss caused by erroneous human behavior such as distraction, driving under influence and speeding [3].
Globally, the elder group (over 60 years old) is growing faster than the younger groups [29].Increasing the mobility of elderly with ADSs can have a huge impact on the quality of life and productivity of a large portion of the population.
A shift from personal vehicle-ownership towards consuming Mobility as a Service (MaaS) is an emerging trend.Currently, ride-sharing has lower costs compared to vehicle-ownership under 1000 km annual mileage [30].The ratio of owned to shared vehicles is expected to be 50:50 by 2030 [31].Large scale deployment of ADSs can accelerate this trend.

B. Challenges
ADSs are complicated robotic systems that operate in indeterministic environments.As such, there are myriad scenarios with unsolved issues.This section discusses the high level challenges of driving automation in general.More minute, task-specific details are discussed in corresponding sections.
The Society of Automotive Engineers (SAE) defined five levels of driving automation in [19].In this taxonomy, level zero stands for no automation at all.Primitive driver assistance systems such as adaptive cruise control, anti-lock braking systems and stability control start with level one [32].Level two is partial automation to which advanced assistance systems such as emergency braking or collision avoidance [33], [34] are integrated.With the accumulated knowledge in the vehicle control field and the experience of the industry, level two automation became a feasible technology.The real challenge starts above this level.
Level three is conditional automation; the driver could focus on tasks other than driving during normal operation, however, s/he has to quickly respond to an emergency alert from the vehicle and be ready to take over.In addition, level three ADS operate only in limited operational design domains (ODDs) such as highways.Audi claims to be the first production car to achieve level 3 automation in limited highway conditions [35].However, taking over the control manually from the automated mode by the driver raises another issue.Recent studies [36], [37] investigated this problem and found that the takeover situation increases the collision risk with surrounding vehicles.The increased likelihood of an accident during a takeover is a problem that is yet to be solved.
Human attention is not needed in any degree at level four and five.However, level four can only operate in limited ODDs where special infrastructure or detailed maps exist.In the case of departure from these areas, the vehicle must stop the trip by automatically parking itself.The fully automated system, level five, can operate in any road network and any weather condition.No production vehicle is capable of level four or level five driving automation yet.Moreover, Toyota Research Institute stated that no one in the industry is even close to attaining level five automation [38].
Level four and above driving automation in urban road networks is an open and challenging problem.The environmental variables, from weather conditions to surrounding human behavior, are highly indeterministic and difficult to predict.Furthermore, system failures lead to accidents: in the Hyundai competition one of the ADSs crashed because of rain [15], Google's ADS hit a bus while lane changing because it failed to estimate the speed of a bus [16], and Tesla's Autopilot failed to recognize a white truck and collided with it, killing Fig. 1: A high level classification of automated driving system architectures Fig. 2: A generic end-to-end system information flow diagram the driver [17].
Fatalities [17], [18] caused by immature ADSs undermine trust.According to a recent survey [30], the majority of consumers question the safety of the technology, and they want a significant amount of control over the development and use of ADS.On the other hand, extremely cautious ADSs are also making a negative impression [39].
Ethical dilemmas pose another set of challenges.In an inevitable accident situation, how should the system behave [40]?Experimental ethics were proposed regarding this issue [41].
Risk and reliability certification is another task yet to be solved.Like in aircraft, ADSs need to be designed with high redundancies that will minimize the chance of a catastrophic failure.Even though there is promising projects in this regard such as DeepTest [42], the design-simulation-test-redesigncertification procedure is still not established by the industry nor the rule-makers.
Finally, various optimization goals such as time to reach the destination, fuel efficiency, comfort, and ride-sharing optimization increases the complexity of an already difficult to solve problem.As such, carrying all of the dynamic driving tasks safely under strict conditions outside a well defined, geofenced area has not been achieved yet and stays as an open problem.
1) Ego-only systems: The ego-only approach is to carry all of the necessary automated driving operations on a single self-sufficient vehicle at all times, whereas a connected ADS may or may not depend on other vehicles and infrastructure elements given the situation.Ego-only is the most common approach amongst the state-of-the-art ADSs [20], [43], [47]- [52], [52]- [54].We believe this is due to the practicality of having a self-sufficient platform for development and the additional challenges of connected systems.
2) Modular systems: Modular systems, referred as the mediated approach in some works [55], are structured as a pipeline of separate components linking sensory inputs to actuator outputs [11].Core functions of a modular ADS can be summarized as: localization and mapping, perception, assessment, planning and decision making, vehicle control, and human-machine interface.Typical pipelines [20], [43], [47]- [52], [52]- [54] start with feeding raw sensor inputs to localization and object detection modules, followed by scene prediction and decision making.Finally, motor commands are generated at the end of the stream by the control module [11], [64].
Developing individual modules separately divides the challenging task of automated driving into an easier-to-solve set of problems [65].These sub-tasks have their corresponding literature in robotics [66], computer vision [67] and vehicle dynamics [32], which makes the accumulated know-how and expertise directly transferable.This is a major advantage of modular systems.In addition, functions and algorithms can be integrated or build upon each other in a modular design.E.g, a safety constraint [68] can be implemented on top of a sophisticated planning module to force some hard-coded emergency rules without modifying the inner workings of the planner.This enables designing redundant but reliable architectures.
The major disadvantages of modular systems are being prone to error propagation [11] and over-complexity.In the unfortunate Tesla accident, an error in the perception module, misclassification of a white trailer as sky, propagated down the pipeline until failure, causing the first ADS related fatality [42].
3) End-to-end systems: End-to-end driving systems, referred as direct perception in some studies [55], generate ego-motion directly from sensory inputs.Ego-motion can be either the continuous operation of steering wheel and pedals or a discrete set of actions, e.g, acceleration and turning left.There are three main approaches for end-to-end driving: direct supervised deep learning [55]- [59], neuroevolution [62], [63] and the more recent deep reinforcement learning [60], [61].The flow diagram of a generic end-to-end driving system is shown in Figure 2 and comparison of the approaches is given in Table II.
The earliest end-to-end driving system dates back to ALVINN [56], where a 3-layer fully connected network was trained to output the direction that the vehicle should follow.An end-to-end system for off-road driving was introduced in [57].With the advances in artificial neural network research, deep convolutional and temporal networks became feasible for automated driving tasks.A deep convolutional neural network that takes image as input and outputs steering was proposed in [58].A spatiotemporal network, an FCN-LSTM architecture, was developed for predicting ego-vehicle motion in [59].DeepDriving is another convolutional model that tries to learn a set of discrete perception indicators from the image input [55].This approach is not entirely end-to-end though, the proper driving actions in the perception indicators have to be generated by another module.All of the mentioned methods follow direct supervised training strategies.As such, ground truth is required for training.Usually, the ground truth is the ego-action sequence of an expert human driver and the network learns to imitate the driver.This raises an import design question: should the ADS drive like a human?
A novel deep reinforcement learning model, Deep Q Networks (DQN), combined reinforcement learning with deep learning [69].In summary, the goal of the network is to select a set of actions that maximize cumulative future rewards.A deep convolutional neural network was used to approximate the optimal action reward function.Actions are generated first with random initialization.Then, the network adjust its parameters with experience instead of direct supervised learning.An automated driving framework using DQN was introduced in [60], where the network was tested in a simulation environment.The first real world run with DQN was achieved in a countryside road without traffic [61].DQN based systems do not imitate the human driver, instead, they learn the "optimum" way of driving.
Neuroevolution refers to using evolutionary algorithms to train artificial neural networks [70].End-to-end driving with neuroevolution is not popular as DQN and direct supervised learning.To the best of our knowledge, real world end-toend driving with neuroevolution is not achieved yet.However, some promising simulation results were obtained [62], [63].ALVINN was trained with neuroevolution and outperformed the direct supervised learning version [62].A RNN was trained with neuroevolution in [63] using a driving simulator.The biggest advantage of neuroevolution is the removal of backpropagation, hence, the need for direct supervision.
End-to-end driving is promising, however it has not been implemented in real-world urban scenes yet, except limited demonstrations.The biggest shortcomings of end-to-end systems in general are the lack of hard coded safety measures and interpretability [65].In addition, DQN and neuroevolution has one major disadvantage over direct supervised learning: these networks must interact with the environment online and fail a lot to learn the desired behavior.On the contrary, direct supervised networks can be trained offline with labeled data.

4) Connected systems:
There is no operational connected ADS in use yet, however, some researchers [44]- [46] believe this emerging technology will be the future of driving automation.With the use of Vehicular Ad hoc NETworks (VANETs), the basic operations of automated driving can be distributed amongst agents.V2X is a term that stands for "vehicle to everything."From mobile devices of pedestrians to stationary sensors on a traffic light, an immense amount of data can be accessed by the vehicle with V2X [26].By sharing detailed information of the traffic network amongst peers [71], shortcomings of the ego-only platforms such as sensing range, blind spots, and computational limits may be eliminated.More V2X applications that will increase safety and traffic efficiency are expected to emerge in the foreseeable future [72].
VANETs can be realized in two different ways: conventional IP based networking and Information-Centric Networking (ICN) [44].For vehicular applications, lots of data have to be distributed amongst agents with intermittent and in less than ideal connections while maintaining high mobility [46].Conventional IP-host based Internet protocol cannot function properly under these conditions.On the other hand, in informationcentric networking, vehicles stream query messages to an area instead of a direct address and they accept corresponding responses from any sender [45].Since vehicles are highly mobile and dispersed on the road network, the identity of the information source becomes less relevant.In addition, local data often carries more crucial information for immediate driving tasks such as avoiding a rapidly approaching vehicle on a blind spot.
Early works, such as the CarSpeak system [73], proved that vehicles can utilize each other's sensors and use the shared information to execute some dynamic driving tasks.However, without reducing huge amounts of continuous driving data, sharing information between hundreds of thousand vehicles in a city could not become feasible.A semiotic framework that integrates different sources of information and converts raw sensor data into meaningful descriptions was introduced in [74] for this purpose.In [75], the term Vehicular Cloud Computing (VCC) was coined and the main advantages of it over conventional Internet cloud applications was introduced.Sensors are the primary cause of the difference.In VCC, sensor information is kept on the vehicle and only shared if there is a local query from another vehicle.This potentially saves the cost of uploading/downloading a constant stream of sensor data to the web.Besides, the high relevance of local data increases the feasibility of VCC.Regular cloud computing was compared to vehicular cloud computing and it was reported that VCC is technologically feasible [76].The term "Internet of Vehicles" (IoV) was proposed for describing a connected ADS [44] and the term "vehicular fog" was introduced in [45].
Establishing an efficient VANET with thousands of vehicles in a city is a huge challenge.For an ICN based VANET, some of the challenging topics are security, mobility, routing, naming, caching, reliability and multi-access computing [77].Fig. 3: Ricoh Tetha V panoramic images collected using our data collection platform, in Nagoya University campus.Note some distortion still remains on the periphery of the image.
In summary, even though the potential benefits of a connected system is huge, the additional challenges increase the complexity of the problem to a significant degree.As such, there is no operational connected system yet.

B. Sensors and hardware
State-of-the-art ADSs employ a wide selection of onboard sensors.High sensor redundancy is needed in most of the tasks for robustness and reliability.Hardware units can be categorized into five: exteroceptive sensors for perception, proprioceptive sensors for internal vehicle state monitoring tasks, communication arrays, actuators, and computational units.
Exteroceptive sensors are mainly used for perceiving the environment, which includes dynamic and static objects, e.g., drivable areas, buildings, pedestrian crossings.Camera, lidar, radar and ultrasonic sensors are the most commonly used modalities for this task.A detailed comparison of exteroceptive sensors is given in Table III.
1) Monocular Cameras: Cameras can sense color and is passive, i.e. does not emit any signal for measurements.Sensing color is extremely important for tasks such as traffic light recognition.Furthermore, 2D computer vision is an established field with remarkable state-of-the-art algorithms.Moreover, a passive sensor does not interfere with other systems since it does not emit any signals.However, cameras have certain shortcomings.Illumination conditions affect their performance drastically, and depth information is difficult to obtain from a single camera.There are promising studies [83] to improve monocular camera based depth perception, but modalities that are not negatively affected by illumination and weather conditions are still necessary besides cameras for dynamic driving tasks.Other camera types gaining interest for ADS include flash cameras [78], thermal cameras [80], [81], and event cameras [79].
2) Omnidirection Camera: For 360 • 2D vision, omnidirectional cameras are used as an alternative to camera arrays.It have seen increasing use, with increasingly compact and high performance hardware being constantly released.Panoramic view is particularly desirable for applications such as navigation, localization and mapping [84].3) Event Cameras: Event cameras are among the newer sensing modalities that have seen use in ADS [85].Event cameras record data asynchronously for individual pixels with respect to visual stimulus.The output is therefore an irregular sequence of data points, or events triggered by changes in brightness.The response time is in the order of microseconds [86].The main limitation of current event cameras is pixel size and image resolution.For example, the DAVIS40 image shown in Figure 4 has a pixel size of 18.5 × 18.5 µm and a resolution of 240 × 180.Recently, a driving dataset with event camera data has been published [85].
4) Radar: Radar, lidar and ultrasonic sensors are very useful in covering the shortcomings of cameras.Depth information, i.e. distance to objects, can be measured effectively to retrieve 3D information with these sensors, and they are not affected by illumination conditions.However, they are active sensors.Radars emit radio waves that bounce back from objects and measure the time of each bounce.Emissions from active sensors can interfere with other systems.Radar is a well established technology that is both lightweight and cost effective.For example, radars can fit inside side-mirrors.Radars are cheaper and can detect objects at longer distances than lidars.However, lidars are more accurate.5) Lidar: Lidar operates with a similar principle that of radar but it emits infrared light waves instead of radio waves.It has much higher accuracy than radar under 200 meters.Weather conditions such as fog or snow have a negative impact on the performance of lidar.Another aspect is the sensor size.Smaller sensors are preferred on the vehicle because of limited space and aerodynamic restraints.Lidar is larger than radar.
In [87], human sensing performance is compared to ADS.One of the key findings of this study is that even though human drivers are still better at reasoning in general, the perception capability of ADSs with sensor-fusion can exceed humans, especially in degraded conditions such as insufficient illumination.
6) Proprioceptive sensors: Proprioceptive sensing is another crucial category.Vehicle states such as speed, acceleration and yaw must be continuously measured in order to operate the platform safely with feedback.Almost all of the modern production cars are equipped with adequate proprioceptive sensors.Wheel encoders are mainly used for odometry, Inertial Measurement Units (IMU) are employed for monitoring the velocity and position changes, tachometers are utilized for measuring speed and altimeters for altitude.These signals can be accessed through the CAN protocol of modern cars.
Besides sensors, an ADS needs actuators to manipulate the vehicle and advanced computational units for processing and storing sensor data.
7) Full size cars: There are numerous instrumented vehicles introduced by different research groups such as Stanford's Junior [20], which employs an array of sensors with different modalities for perceiving external and internal variables.Boss won the [43] DARPA Urban Challenge with an abundance of sensors.RobotCar [49] is a cheaper research platform aimed for data collection.In addition, different levels of driving automation have been introduced by the industry; Tesla's Autopilot [88] and Google's self driving car [89] are some examples.Bertha [53] is developed by Daimler and has 4 120 • short-range radars, two long-range range radar on the sides, stereo camera, wide angle-monocular color camera on the dashboard, another wide-angle camera for the back.Our vehicle is shown in Figure 5.A detailed comparison of sensor setups of 10 different full-size ADSs is given in Table IV.
8) Large vehicles and trailers: Earliest intelligent trucks were developed for the PATH program in California [98], which utilized magnetic markers on the road.Fuel economy is an essential topic in freight transportation and methods such as platooning has been developed for this purpose.Platooning is a well-studied phenomenon; it reduces drag and therefore fuel consumption [99].In semi-autonomous truck platooning, the lead truck is driven by a human driver, and several automated trucks follow it; forming a semi-autonomous roadtrain as defined in [100].Sartre European Union project [101] introduced such a system that satisfies three core conditions: using the already existing public road network, sharing the traffic with non-automated vehicles and not modifying the road infrastructure.A platoon consisting of three automated trucks was formed in [99] and significant fuel savings were reported.
Tractor-trailer setup poses an additional challenge for automated freight transport.Conventional control methods such as feedback linearization [102] and fuzzy control [103] were used for path tracking without considering the jackknifing constraint.The possibility of jackknifing, the collision of the truck and the trailer with each other, increases the difficulty of the task [104].A control safety governor design was proposed in [104] to prevent jackknifing while reversing.

IV. LOCALIZATION AND MAPPING
Localization is the task of finding ego-position relative to a reference frame in an environment [106], and it is fundamental to any mobile robot.It is especially crucial for ADSs [25]; the vehicle must use the correct lane and position Fig. 6: We used NDT matching [97], [105] to localize our vehicle in the Nagoya University campus.White points belong to the offline map and the colored ones were obtained from online scans.The objective is to find the best match between colored points and white points, thus localizing the vehicle.
itself in it accurately.Furthermore, localization is an elemental requirement for global navigation.The reminder of this section details the three most common approaches that use solely on-board sensors: Global Positioning System and Inertial Measurement Unit (GPS-IMU) fusion, Simultaneous Localization And Mapping (SLAM), and stateof-the-art a priori map-based localization.Readers are referred to [106] for a broader localization overview.A comparison of localization methods is given in Table V.

A. GPS-IMU fusion
The main principle of GPS-IMU fusion is correcting accumulated errors of dead reckoning in intervals with absolute position readings [107].In a GPS-IMU system, changes in position and orientation are measured by IMU, and this information is processed for localizing the robot with dead reckoning.There is a significant drawback of IMU, and in general dead reckoning though, errors accumulate with time and often leads to failure in long-term operations [108].With the integration of GPS readings, the accumulated errors of the IMU can be corrected in intervals.
GPS-IMU systems by themselves cannot be used for vehicle localization as they do not meet the performance criteria [109].In the 2004 DARPA Grand Challenge, the red team from Carnegie Mellon University [92] failed the race because of a GPS error.The accuracy required for automated driving in urban scenes cannot be realized with the current GPS-IMU technology.Moreover, in dense urban environments, the accuracy drops further, and the GPS stops functioning from time to time because of tunnels [107] and high buildings.Fig. 7: Creating a 3D pointcloud map with congregation of scans.We used Autoware [110] for mapping.
Even though GPS-IMU systems by themselves do not meet the performance requirements and cannot be utilized except high-level route planning, they are used for initial pose estimation in tandem with lidar and other sensors in state-ofthe-art localization systems [109].

B. Simultaneous localization and mapping
Simultaneous localization and mapping (SLAM) is the act of online map making and localizing the robot in it at the same time.A priori information about the environment is not required in SLAM.It is a common practice in robotics, especially in indoor environments.However, due to the high computational requirements and environmental challenges, running SLAM algorithms outdoors is less efficient than localization with a pre-built map [111].
Team MIT used a SLAM approach in DARPA urban challenge [112] and finished it in the 4th place.Whereas, the winner, Carnegie Mellon's Boss [43] and the runner-up, Stanford's Junior [20], both utilized a priori information.In spite of not having the same level of accuracy and efficiency, SLAM techniques have one major advantage over a priori methods: they can work anywhere.
SLAM based methods have the potential to replace a priori techniques if their performances can be increased further [24].We refer the readers to [25] for a detailed SLAM survey in the intelligent vehicle domain.

C. A priori map-based localization
The core idea of a priori map-based localization techniques is matching: localization is achieved through the comparison of online readings to the information on the detailed a priori map and finding the location of the best possible match [109].Often an initial pose estimation, for example with a GPS, is used Fig. 8: Annotating a 3D point cloud map with topological information.A large number of annotators were employed to build the map shown on the right-hand side.The point-cloud and annotated maps are available on [113].
at the beginning of the matching process.There are various approaches to map building and preferred modalities.
Changes in the environment affect the performance of mapbased methods negatively.This effect is prevalent especially in rural areas where past information of the map can deviate from the actual environment because of changes in roadside vegetation and constructions [114].Moreover, this method requires an additional step of map making.
There are two different map-based approaches; landmark search and matching.
1) Landmark search: Landmark search is computationally less expensive in comparison to point cloud matching.It is a robust localization technique as long as a sufficient amount of landmarks exists.In an urban environment, poles, curbs, signs and road markers can be used as landmarks.
A road marking detection method using lidar and Monte Carlo Localization (MCL) was used in [94].In this method, road markers and curbs were matched to an offline 3D map to find the location of the vehicle.A vision based road marking detection method was introduced in [115].Road markings detected by a single front camera were compared and matched to a low-volume digital marker map with global coordinates.Then, a particle filter was employed to update the position and heading of the vehicle with the detected road markings and GPS-IMU output.A road marking detection based localization technique using; two cameras directed towards the ground, GPS-IMU dead reckoning, odometry, and a precise marker location map was proposed in [116].Another vision based method with a single camera and geo-referenced traffic signs was presented in [117].
This approach has one major disadvantage; landmark dependency makes the system prone to fail where landmark amount is insufficient.
2) Point cloud matching: The state-of-the-art localization systems use multi-modal point cloud based approaches.In summary, the online-scanned point cloud, which covers a smaller area, is translated and rotated around its center iteratively to be compared against the larger a priori point cloud map.The position and orientation that gives the best match/overlap of points between the two point clouds give the localized position of the sensor relative to the map.For initial pose estimation, GPS is used commonly along dead reckoning.We used this approach to localize our vehicle.The matching process is shown in fig.6 and the map-making in fig.7 and fig.8.
In the seminal work of [109], a point cloud map collected with lidar was used to augment inertial navigation and localization.A particle filter maintained a three-dimensional vector of 2D coordinates and the yaw angle.A multi-modal approach with probabilistic maps was utilized in [96] to achieve localization in urban environments with less than 10 cm RMS error.Instead of comparing two point clouds point by point and discarding the mismatched reads, the variance of all observed data was modeled and used for the matching task.A matching algorithm for lidar scans using multiresolution Gaussian Mixture Maps (GMM) was proposed in [118].Iterative Closest Point (ICP) was compared against Normal Distribution Transform (NDT) in [105], [119].In NDT, accumulated sensor readings are transformed into a grid that is represented by the mean and covariance obtained from the scanned points that fall into its' cells/voxels.NDT proved to be more robust than point-to-point ICP matching.An improved version of 3D NDT matching was proposed in [97], and [114] augmented NDT with road marker matching.An NDT-based Monte Carlo Localization (MCL) method that utilizes an offline static map and a constantly updated shortterm map was developed by [120].In this method, NDT occupancy grid was used for the short-term map and it was utilized only when and where the static map failed to give sufficient explanations.
Map-making and maintaining it is time and resource consuming.Therefore some researchers such as [95] argue that methods with a priori maps are not feasible given the size of road networks and rapid changes.
3) 2D to 3D matching: Matching online 2D readings to a 3D a priori map is an emerging technology.This approach requires only a camera on the ADS equipped vehicle instead of the more expensive lidar.The a priori map still needs to be created with a lidar though.
A monocular camera was used to localize the vehicle in a point cloud map in [121].With an initial pose estimation, 2D synthetic images were created from the offline 3D point cloud map and they were compared with normalized mutual information to the online images received from the camera.This method increases the computational load of the localization task.Another vision matching algorithm was introduced in [122] where a stereo camera setup was utilized to compare online readings to synthetic depth images generated from 3D prior.
Camera based localization approaches can become popular in the future as the hardware requirement is cheaper than lidar based systems.

V. PERCEPTION
Perceiving the surrounding environment and extracting information which may be critical for safe navigation is a critical objective.A variety of tasks, using different sensing modalities, fall under the umbrella of perception.Building on decades of computer vision research, cameras are the most commonly used sensor for perception, with 3D vision becoming a strong alternative/supplement.
The reminder of this section is divided into core perception tasks.We discuss image-based object detection in Section V-A1, semantic segmentation in Section V-A2, 3D object detection in Section V-A3, road and lane detection in Section V-C and object tracking in Section V-B.
A. Detection 1) Image-based Object Detection: Object detection refers to identifying the location and size of objects of interests.Both static objects, from traffic lights and signs to road crossings, and dynamic objects such as other vehicles, pedestrians or cyclists are of concern to ADS.Generalized object detection has a long-standing history as a central problem in computer vision, where the goal is to determine whether or not objects of specific classes are present in an image, then to determine their size via a rectangular bounding box.This section mainly discusses state-of-the-art object detection methods, as they represent the starting point of several other tasks in an ADS pipe, such as object tracking and scene understanding.
Object recognition research started more than 50 years ago, but only recently, in the late 1990s and early 2000s, has algorithm performance reached a level of relevance for driving automation.In 2012, the deep convolutional neural network (DCNN) AlexNet [12] shattered the ImageNet image recognition challenge [132].This resulted in a near complete shift of focus to supervised learning and in particular deep learning for object detection.There exists a number of extensive surveys on general image-based object detection [133]- [135].Here, the focus is on the state-of-the-art methods that could be applied to ADS.
While state-of-the-art methods all rely on DCNNs, there currently exist a clear distinction between them: 1) Single stage detection frameworks use a single network to produce object detection locations and class prediction simultaneously.2) Region proposal detection frameworks have two distinct stages, where general regions of interest are first proposed, then categorized by separate classifier networks.
Region proposal methods are currently leading detection benchmarks, but at the cost requiring high computation power, and generally being difficult to implement, train and finetune.Meanwhile, single stage detection algorithms tend to have fast inference time and low memory cost, which is wellsuited for real-time driving automation.YOLO (You Only Look Once) [136] is a popular single stage detector, which has been improved continuously [123], [137].Their network uses a DCNN to extract image features on a coarse grid, significantly reducing the resolution of the input image.A fully-connected neural network then predicts class probabilities and bounding box parameters for each grid cell and class.This design makes YOLO very fast, the full model operating at 45 FPS and a smaller model operating at 155 FPS for a small accuracy trade-off.More recent versions of this method, YOLOv2, YOLO9000 [137] and YOLOv3 [123] briefly took over the PASCAL VOC and MS COCO benchmarks while maintaining low computation and memory cost.Another widely used algorithm, even faster than YOLO, is the Single Shot Detector (SSD) [138], which uses standard DCNN architectures such as VGG [131] to achieve competitive results on public benchmarks.SSD performs detection on a coarse grid similar to YOLO, but also uses higher resolution features obtained early in the DCNN to improve detection and localization of small objects.
Considering both accuracy and computational cost is essential for detection in ADS; the detection needs to be reliable, but also operate better than real-time, to allow as much time as possible for the planning and control modules to react to those objects.As such, single stage detectors are often the detection algorithms of choice for ADSs.However, as shown in Table VI, region proposal networks (RPN), used in twostage detection frameworks, have proven to be unmatched in terms of object recognition and localization accuracy, and computational cost has improved greatly in recent years.They are also better suited for other tasks related to detection, such as semantic segmentation as discussed in Section V-A2.Through transfer learning, RPNs achieving multiple perception tasks simultaneously are become increasingly feasible for online applications [124].RPNs can replace single stage detection networks for ADS applications in the near future.
Omnidirectional Cameras: 360 degree vision, or at least panoramic vision, is necessary for higher levels of automation.This can be achieved through camera arrays, though precise extrinsic calibration between each camera is then necessary to make image stitching possible.Alternatively, omnidirectional Fig. 9: An urban scene near Nagoya University, with camera and lidar data collected by our experimental vehicle and object detection outputs from state-of-the-art perception algorithms.(a) A front facing camera's view, with bounding box results from YOLOv3 [123] and (b) instance segmentation results from MaskRCNN [124].(c) Semantic segmentation masks produced by DeepLabv3 [125].(d) The 3D Lidar data with object detection results from SECOND [126].Amongst the four, only the 3D perception algorithm outputs range to detected objects.cameras can be used, or a smaller array of cameras with very wide angle fisheye lenses.These are however difficult to intrinsically calibrate; the spherical images are highly distorted and the camera model used must account for mirror reflections or fisheye lens distortions, depending on the camera model producing the panoramic images [139], [140].The accuracy of the model and calibration dictates the quality of undistorted images produced, on which the aforementioned 2D vision algorithms are used.An example of fisheye lenses producing two spherical images then combined into one panoramic image is shown in Figure 3.Some distortions inevitably remain, but despite these challenges in calibration, omnidirectional cameras have been used for many applications such as SLAM [141] and 3D reconstruction [142].
Event Cameras: Event cameras are a fairly new modality which output asynchronous events usually caused by movement in the observed scene, as shown in Figure 4.This makes the sensing modality interesting for dynamic object detection.The other appealing factor is their response time on the order of microseconds [86], as frame rate is a significant limitation for high-speed driving.The sensor resolution remains an issue, but new models are rapidly improving.
They have been used for a variety of applications closely related to ADS.A recent survey outlines progress in pose estimation and SLAM, visual-inertial odometry and 3D reconstruction, as well as other applications [143].Most notably, a dataset for end-to-end driving with event cameras was recently published, with preliminary experiments showing that the output of an event camera can, to some extent, be used to predict car steering angle [85].
Poor Illumination and Changing Appearance: The main drawback with using camera is that changes in lighting conditions can significantly affect their performance.Low light conditions are inherently difficult to deal with, while changes in illumination due to shifting shadows, intemperate weather, or seasonal changes, can cause algorithms to fail, in particular supervised learning methods.For example, snow drastically alters the appearance of scenes and hides potentially key features such as lane markings.An easy alternative is to use an alternate sensing modalities for perception, but lidar also has difficulties with some weather conditions like snow [144], and radars lack the necessary resolution for many perception tasks [47].A sensor fusion strategy is often employed to avoid any single point of failure [145].Thermal imaging through infrared sensors are also used for object detection in low light conditions, which is particularly effective for pedestrian detection [146].Camera-only methods which attempt to deal with dynamic lighting conditions directly have also been developed.Both attempting to extract lighting invariant features [147] and assessing the quality of features [148] have been proposed.Pre-processed, illumination invariant images have applied to ADS [149] and were shown to improve localization, mapping and scene classification capabilities over long periods of time.Still, dealing with the unpredictable conditions brought forth by inadequate or changing illumination remains a central challenge preventing the widespread implementation of ADS.
2) Semantic Segmentation: Beyond image classification and object detection, computer vision research has also tackled the task of image segmentation.This consists of classifying each pixel of an image with a class label.This task is of particular importance to driving automation as some objects of interest are poorly defined by bounding boxes, in particular roads, traffic lines, sidewalks and buildings.A segmented scene in an urban area can be seen in Figure 9.As opposed to semantic segmentation, which labels pixels based on a class, instance segmentation algorithms further separates instances of the same class, which is important in the context of driving automation.In other words, objects which may have different trajectories and behaviors must be differentiated from each other.We used the COCO dataset [150] to train the instance segmentation algorithm Mask R-CNN [124] with the sample result shown in Figure 9.
Segmentation has recently started being feasible for realtime applications.Developments in this field are very much parallel progress in general, image-based object detection.The aforementioned Mask R-CNN [124] is a generalization of Faster R-CNN [151].The multi-task network can achieve accurate bounding box estimation and instance segmentation simultaneously and can also be generalized to other tasks like pedestrian pose estimation with minimal domain knowledge.Running at 5 fps means it is approaching the area of real-time use for ADS.
Unlike Mask-RCNN's CNN which is more akin to those used for object detection through its use of region proposal networks, segmentation networks usually employ a combination of convolutions, for feature extractions, followed by deconvolutions, also called transposed convolutions, to obtain pixel resolution labels [152], [153].Feature pyramid networks are also commonly used, for example in PSPNet [154], which also introduced dilated convolutions for segmentation.This idea of sparse convolutions was then used to develop DeepLab [155], with the most recent version being the current state-ofthe-art for object segmentation [125].We employed DeepLab with our ADS and a segmented frame is shown in Figure 9.
While most segmentation networks are as of yet too slow and computationally expensive to be used in ADS, it is important to notice that many of these segmentations networks are initially trained for different tasks, such as bounding box estimation, then generalized to segmentation networks.Furthermore, these networks were shown to learn universal feature representations of images, and can be generalized for many tasks.This suggests the possibility that single, generalized perception networks may be able to tackle all the different perception tasks required for an ADS.
3) 3D Object Detection: Given their affordability, availability and widespread research, the camera is used by nearly all algorithms presented so far as the primary perception modality.However, cameras have limitations that are critical to ADS.Aside from illumination which was previously discussed, camera-based object detection occurs in the projected image space and therefore the scale of the scene is unknown.To make use of this information for dynamic driving tasks like obstacle avoidance, it is necessary to bridge the gap form 2D imagebased detection to the 3D, metric space.Depth estimation is therefore necessary, which is in fact possible through single camera [156] though stereo or multi-view systems are more robust [157].These algorithms necessarily need to solve an expensive image matching problem, which adds a significant amount of processing cost to an already complex perception pipeline.
A relatively new sensing modality, the 3D lidar, offers an alternative for 3D perception.The 3D data collected inherently solves the scale problem, and since they have their own emission source, they are far less dependable on lighting condition, and less susceptible to intemperate weather.The sensing modality collects sparse 3D points representing the surfaces of the scene, as shown in Figure 10, which are challenging to use for object detection and classification.The appearance of objects change with range, and after some distance, very few data points per objects are available to detect an object.This poses some challenges for detection, but since the data is a direct representation of the world, it is more easily separable.Traditional methods often used euclidean clustering [158] or region-growing methods [159] for grouping points into objects.This approach has been made much more robust through various filtering techniques, such as ground filtering [160] and map-based filtering [161].We implemented a 3D object detection pipeline to get clustered objects from raw point cloud input.An example of this process is shown in Figure 10.
As with image-based methods, machine learning has also recently taken over 3D detection methods.These methods have also notably been applied to RGB-D [162], which produce similar, but colored, point clouds; with their limited range and unreliability outdoors, RGB-D have not been used for ADS applications.A 3D representation of point data, through a 3D occupancy grid called voxel grids, was first applied for object detection in RGB-D data [162].Shortly thereafter, a similar approach was used on point clouds created by lidars [163].Inspired by image-based methods, 3D CNNs are used, despite being computationally very expensive.
The first convincing results for point cloud-only 3D bounding box estimation were produced by VoxelNet [164].Instead of hand-crafting input features computed during the discretization process, VoxelNet learned an encoding from raw point cloud data to voxel grid.Their voxel feature encoder (VFE) uses a fully connected neural network to convert the variable number of points in each occupied voxel to a feature vector of fixed size.The voxel grid encoded with feature vectors was then used as input to an aforementioned RPN for multi-class object detection.This work was then improved both in terms of accuracy and computational efficiency by SECOND [126] by exploiting the natural sparsity of lidar data.We employed SECOND and a sample result is shown in Figure 9. Several algorithms have been produced recently, with accuracy constantly improving as shown in Table VII, yet the computational complexity of 3D convolutions remains an issue for real-time use.
Another option for lidar-based perception is 2D projection of point cloud data.There are two main representations of point cloud data in 2D, the first being a so-called depth image shown in Figure 14, largely inspired by camera-based methods that perform 3D object detection through depth estimation [165] and methods that operate on RGB-D data [166].The VeloFCN network [167] proposed to use single-channel depth image as input to a shallow, single-stage convolutional neural network which produced 3D vehicle proposals, with many  other algorithms adopting this approach.Another use of depth image was shown for semantic classification of lidar points [168].
The other 2D projection that has seen increasing popularity, in part due to the new KITTI benchmark, is projection to bird's eye view (BV) image.This is a top-view image of point clouds shown as shown in Figure 15.As such, bird's view images necessarily discretize space purely in 2D, so lidar points which vary in height alone necessarily occlude each other.The MV3D algorithm [169] used camera images, depth images, as well as multi-channel BV images; each channel corresponding to a different range of heights, so as to minimize these occlusions.Several other works have reused camera-based algorithms and trained efficient networks for 3D object detection on 2D BV images [170]- [173].State-ofthe-art algorithms are currently being evaluated on the KITTI dataset [174] and nuScenes dataset [175] as they offer labeled 3D scenes.Table VII shows the leading methods on the KITTI benchmark, alongside detection times.2D methods are far less computationally expensive, but recent methods that take point sparsity into account [126] are real-time viable and rapidly approaching the accuracies necessary for integration in ADS.
Radar Radar sensors have already been used for various perception applications, in different types of vehicles, with different models operating at complementary ranges.While not as accurate as the lidar, it can detect at object at high range and estimate their velocity [112].The lack of precision for estimating shape of objects is a major drawback when it is used in perception systems [47], the resolution is simply too low.As such, it can be used for range estimation to large objects like vehicles, but it is challenging for pedestrians or static objects.
Another issue is the very limited field of view of most radars, forcing a complicated array of radar sensors to cover the full field of view.Nevertheless, radar have seen widespread use as an ADAS component, for applications including proximity warning and adaptive cruise control [144].While radar and lidar are often seen as competing sensing modalities, they will likely be used in tandem in fully automated driving systems.Radars are very long range, have low cost and complete robustness to poor weather, while lidar offer precise object localization capabilities, as discussed in Section IV.
Another similar sensor to the radar are sonar devices, though their extremely short range of < 2m and poor angular resolution makes their use limited to very near obstacle detection [144].

B. Object Tracking
Object tracking is also often referred to as multiple object tracking (MOT) [180] and detection and tracking of multiple objects (DATMO) [181].For fully automated driving in complex and high speed scenarios, estimating location alone is insufficient.It is necessary to estimate dynamic objects' heading and velocity so that a motion model can be applied to track the object over time and predict future trajectory to avoid collisions.These trajectories must be generally estimated in the vehicle frame to be used by planning, so range information must be obtained through multiple camera systems, lidars or radar sensors.3D lidars are often used for their precise range information and large field of view, allowing tracking over Fig. 11: A scene with several tracked pedestrians and cyclist with a basic particle filter on an urban road intersection.Past trajectories are shown in white with current heading and speed shown by the direction and magnitude of the arrow, sample collected by our data collection platform.longer periods of time.To better cope with the limitations and uncertainties of different sensing modalities, a sensor fusion strategy is often use for tracking [43].
Commonly used object trackers rely on simple data association techniques followed by traditional filtering methods.When objects are tracked in 3D space at high frame rate, nearest neighbor methods are often sufficient for establishing associations between objects.Image-based methods, however, need to establish some appearance model, which may consider the use of color histograms, gradients and other features such as KLT to evaluate the similarity [182].Point cloud based methods may also use similarity metrics such as point density and Hausdorff distance [161], [183].Since association errors are always a possibility, multiple hypothesis tracking algorithms [184] are often employed, which ensures tracking algorithms can recover from poor data association at any single time step.Using occupancy maps as a frame for all sensors to contribute to and then doing data association in that frame is common, especially when using multiple sensors [185].To obtain smooth dynamics, the detection results are filtered by traditional Bayes filters.Kalman filtering is sufficient for simple linear models, while the extended and and unscented Kalman filters [186] are used to handle nonlinear dynamic models [187].We implemented a basic particle filter based object-tracking algorithm, and an example of tracked pedestrians in contrasting camera and 3D lidar perspective is shown in Figure 11.
Physical models for the object being tracked are also often used for more robust tracking.In that case, non-parametric methods such as particle filters are used, and physical parameters such as the size of the object are tracked alongside dynamics [188].More involved filtering methods such as Rao-Blackwellized particle filters have also been used to keep track of both dynamic variables and vehicle geometry variables for an L-shape vehicle model [189].Various models have been proposed for vehicles and pedestrians, while some models generalize to any dynamic object [190].
Finally, deep learning has also been applied to the problem of tracking, particularly for images.Tracking in monocu-lar images was achieved in real-time through a CNN-based method [191], [192].Multi-task network which estimate object dynamics are also emerging [193] which further suggests that generalized networks tackling multiple perception tasks may be the future of ADS perception.

C. Road and Lane Detection
Bounding box estimation methods previously covered are useful for defining some objects of interest but are inadequate for continuous surfaces like roads.Determining the drivable surface is critical for ADS and has been specifically researched as a subset of the detection problem.While drivable surface can be determined through semantic segmentation, automated vehicles need to understand road semantics to properly negotiate the road.An understanding of lanes, and how they are connected through merges and intersections remains a challenge from the perspective of perception.In this section, we provide an overview of current methods used for road and lane detection, and refer the reader to in-depth surveys of traditional methods [194] and the state-of-the-art methods [195], [196].
This problem is usually subdivided in several tasks, each unlocking some level of automation.The simplest is determining the drivable area from the perspective of the ego-vehicle.The road can then be divided into lanes, and the vehicles' host lane can be determined.Host lane estimation over a reasonable distance allows ADAS technology such as lane departure warning, lane keeping and adaptive cruise control [194], [198].Even more challenging is determining other lanes and their direction [199], and finally understanding complex semantics, as in their current and future direction, or merging and turning lanes [43].These ADAS or ADS technologies have different criteria both in terms of task, detection distance and reliability rates, but fully automated driving will require a complete, semantic understanding of road structures and the ability to detect several lanes at long ranges [195].Annotated maps as shown in Figure 8 are extremely useful for understanding lane semantics.
Current methods on road understanding typically first rely on exteroceptive data preprocessing.When cameras are used, this usually means performing image color corrections to normalize lighting conditions [200].For lidar, several filtering methods can be used to reduce clutter in the data such as ground extraction [160] or map-based filtering [161].For any sensing modality, identifying dynamic objects which conflicts with the static road scene is an important pre-processing step.Then, road and lane feature extraction is performed on the corrected data.Color statistics and intensity information [201], gradient information [202], and various other filters have been used to detect lane markings.Similar methods have been used for road estimation, where the usual uniformity of roads and elevation gap at the edge allows for region growing methods to be applied [203].Stereo camera systems [204], as well as 3D lidars [201], have been used determine the 3D structure of roads directly.More recently, machine learningbased methods which either fuse maps with vision [196] or use fully appearance-based segmentation [205] have been used.Fig. 12: Assessing the overall risk level of driving scenes.We employed an open-source1 deep spatiotemporal video-based risk detection framework [197] to assess the image sequences shown in this figure .Once surfaces are estimated, model fitting is used to establish the continuity of the road and lanes.Geometric fitting through parametric models such as lines [206] and splines [201] have been used, as well as non-parametric continuous models [207].Models that assume parallel lanes have been used [198], and more recently models integrating topological elements such as lane splitting and merging were proposed [201].
Temporal integration completes the road and lane segmentation pipeline.Here, vehicle dynamics are used in combination with a road tracking system to achieve smooth results.Dynamic information can also be used alongside Kalman filtering [198] or particle filtering [204] to achieve smoother results.
Road and lane estimation is a well-researched field and many methods have already been integrated successfully for lane keeping assistance systems.However, most methods remain riddled with assumptions and limitations, and truly general systems which can handle complex road topologies have yet to be developed.Through standardized road maps which encode topology and emerging machine learning-based road and lane classification methods, robust systems for driving automation are slowly taking shape.

VI. ASSESSMENT
A robust ADS should constantly evaluate the overall risk level of the situation and predict the intentions of human drivers and pedestrians around itself.A lack of acute assessment mechanism can lead to accidents.This section discusses assessment under three subcategories: overall risk and uncertainty assessment, human driving behavior assessment, and driving style recognition.

A. Risk and uncertainty assessment
Overall assessment can be summarized as quantifying the uncertainties and the risk level of the driving scene.It is a promising methodology that can increase the safety of ADS pipelines [11].
Using Bayesian methods to quantify and measure uncertainties of deep neural networks were proposed in [208].A Bayesian deep learning architecture was designed for propagating uncertainty throughout an ADS pipeline, and the advantage of it over conventional approaches was shown in a hypothetical scenario [11].In summary, each module conveys and accepts probability distributions instead of exact outcomes throughout the pipeline, which increases the overall robustness of the system.
An alternative approach is to assess the overall risk level of the driving scene separately, i.e outside the pipeline.Sensory inputs were fed into a risk inference framework in [74], [209] to detect unsafe lane change events using Hidden Markov Models (HMMs) and language models.Recently, a deep spatiotemporal network that infers the overall risk level of a driving scene was introduced in [197].Implementation of this method is available open-source 1 .We employed this method to assess the risk level of a lane change as shown in Figure 12.

B. Surrounding driving behavior assessment
Understanding surrounding human driver intention is most relevant to medium to long term prediction and decision making.In order to increase the prediction horizon of surrounding object behavior, human traits should be considered and incorporated into the prediction and evaluation steps.Understanding surrounding driver intention from the perspective of an ADS is not a common practice in the field, as such, state-of-the-art is not established yet.
In [210], a target vehicle's future behavior was predicted with a hidden Markov model (HMM) and the prediction time horizon was extended 56% by learning human driving traits.The proposed system tagged observations with predefined maneuvers.Then, the features of each type were learned in a data-centric manner with HMMs.Another learning based approach was proposed in [211], where a Bayesian network classifier was used to predict maneuvers of individual drivers on highways.A framework for long term driver behavior prediction using a combination of a hybrid state system and HMM was introduced in [212].Surrounding vehicle information was integrated with ego-behavior through a symbolization framework in [74], [209].Detecting dangerous cut in maneuvers was achieved with an HMM framework that was trained on safe and dangerous data in [213].Lane change events were predicted 1.3 seconds in advance with support vector machines (SVM) and Bayesian filters [214].
The main challenges are the short observation window for understanding the intention of humans and real-time highfrequency computation requirements.Most of the time the ADS can observe a surrounding vehicle only for seconds.Complicated driving behavior models that require longer observation periods cannot be utilized under these circumstances.

C. Driving style recognition
In 2016, Google's self-driving car collided with an oncoming bus [16] during a lane change.The ADS assumed that the bus driver was going to yield and let the self-driving car merge in.However, the bus driver accelerated instead.This accident could have been prevented if the ADS understood this particular bus driver's individual, unique driving style and predicted his behavior.
Driving style is a broad term without an established common definition.A thorough review of the literature can be found in [215] and a survey of driving style recognition algorithms for intelligent vehicles is given in [216].Readers are referred to these papers for a complete review.
Typically, driving style is defined with respect to either aggressiveness [217]- [221] or fuel consumption [222]- [226].For example, [227] introduced a rule-based model that classified driving styles with respect to jerk.This model decides whether a maneuver is aggressive or calm by a set of rules and jerk thresholds.Drivers were categorized with respect to their average speed in [228].In conventional methods, total number and meaning of driving style classes are predefined beforehand.The vast majority of driving style recognition literature uses two or three classes.In [229]- [231] driving style was categorized into three and into two in [74], [209], [217], [218], [222].Representing driving style in a continuous domain is uncommon, but there are some studies.In [232], driving style was depicted as a continuous value between -1 and +1, which stands for mild and active respectively.Details of classification is given in Table VIII.
More recently, machine learning based approaches have been utilized for driving style recognition.Principal component analysis was used and five distinct driving classes were detected in an unsupervised manner in [233] and A GMM based driver model was used to identify individual drivers with success in [234].Car-following and pedal operation behavior was investigated separately in the latter study.Another GMM based driving style recognition model was proposed for electric vehicle range prediction in [235].In [217], aggressive event detection with dynamic time wrapping was presented where the authors reported a high success score.Bayesian approaches were utilized in [236] for modeling driving style on roundabouts and in [237] to asses critical braking situations.Bag-of-words and K-means clustering was used to represent individual driving features in [238].A stacked autoencoder was used to extract unique driving signatures from different drivers, and then macro driving style centroids were found with clustering [239].Another autoencoder network was used to extract road-type specific driving features [240].Similarly, driving behavior was encoded in a 3-channel RGB space with a deep sparse autoencoder to visualize individual driving styles [241].
A successful integration of driving style recognition into a real world ADS pipeline is not reported yet.However, these studies are promising, and they point to a possible new direction in ADS development.

VII. PLANNING AND DECISION MAKING
A brief overview of planning and decision making is given here.For more detailed studies of this topic [22], [27], [244] can be referred.

A. Global planning
Planning can be categorized into two sub-tasks: global route planning and local path planning.The global planner is responsible for finding the route on the road network from origin to the final destination.The user usually defines the final destination.Global navigation is a well-studied subject, and high performance has became an industry standard for more than a decade.Almost all modern production cars are equipped with navigation systems that utilize GPS and offline maps to plan a global route.
Route planning is formulated as finding the point-to-point shortest path in a directed graph, and conventional methods are examined under four categories in [244].These are; goal-directed, separator-based, hierarchical and bounded-hop techniques.A* search [245] is a standard goal-directed path planning algorithm and used extensively in various fields for almost 50 years.
The main idea of separator-based techniques is to remove a subset of vertices [246] or arcs from the graph and compute an overlay graph over it.Using the overlayed graph to calculate the shortest path results in faster queries.
Hierarchical techniques take advantage of the road hierarchy.For example, the road hierarchy in the US can be listed from top to bottom as freeways, arterials, collectors and local roads respectively.For a route query, the importance Fig. 13: Global plan and the local paths.The annotated vector map shown in Figure 8 was utilized by the planner.We employed OpenPlanner [243], a graph-based planner, here to illustrate a typical planning approach.RPP [252], RRT [253], PRM [254] Fast solution but jerky Curve interpolation clothoids [255], polynomials [256], Bezier [257], splines [100] Smooth but slow Numerical optimization num.non-linear opt.[258], Newton's method [259] increases computational cost but improves quality Deep learning FCN [260], segmentation network [261] high imitation performance, but no hard coded safety measures of hierarchy increases as the distance between origin and destination gets longer.The shortest path may not be the fastest nor the most desirable route anymore.Getting away from the destination thus making the route a bit longer to take the closest highway ramp may result in faster travel time in comparison to following the shortest path of local roads.Contraction Hierarchies (CH) method was proposed in [247] for exploiting road hierarchy.Precomputing distances between selected vertexes and utilizing them on the query time is the basis of bounded-hop techniques.Precomputed shortcuts can be utilized partly or exclusively for navigation.However, the naive approach of precomputing all possible routes from every pair of vertices is impractical in most cases with large networks.One possible solution to this is to use hub labeling (HL) [248].This approach requires preprocessing also.A label associated with a vertex consists of nearby hub vertices and the distance to them.These labels satisfy the condition that at least one shared hub vertex must exist between the labels of any given two vertices.HL is the fastest query time algorithm for route planning [244], in the expense of high storage usage.
A combination of the above algorithms are popular in stateof-the-art systems.For example, [249] combined a separator with a bounded-hop method and created the Transit Node Routing with Arc Flags (TNR+AF) algorithm.Modern route planners can make a query in milliseconds.

B. Local planning
The objective of the local planner is to execute a global plan without failing.In other words, in order to complete its trip, the ADS must find trajectories to avoid obstacles and satisfy optimization criteria in the configuration space (Cspace), given a starting and destination point.A detailed local planning review is presented in [23] where the taxonomy of motion planning was divided into four groups; graph-based planners, sampling-based planners, interpolating curve planners and numerical optimization approaches.After a summary of these conventional planners, the emerging deep learningbased planners are introduced at the end of this section.
Graph-based local planners use the same techniques as graph-based global planners such as Dijkstra [250] and A* [245], which output discrete paths rather than continuous ones.This can lead to jerky trajectories [23].A more advanced graph-based planner is the state lattice algorithm.As all graphbased methods, state lattice discretizes the decision space.High dimensional lattice nodes, which typically encode 2D position, heading and curvature [251], are used to create a grid first.Then, the connections between the nodes are precomputed with an inverse path generator to build the state lattice.During the planning phase, a cost function, which usually considers proximity to obstacles and deviation from the goal, is utilized for finding the best path with the precomputed path primitives.State lattices can handle high dimensions and is good for local planning in dynamical environments, however, the computational load is high and the discretization resolution limits the planners' capacity [23].
A detailed overview of Sampling Based Planning (SBP) methods can be found in [262].In summary, SBP tries to build the connectivity of the C-space by randomly sampling it.Randomized Potential Planner (RPP) [252] is one of the earliest SBP approaches, where random walks are generated for escaping local minimums.Probabilistic roadmap method (PRM) [254] and rapidly-exploring random tree (RRT) [253] are the most commonly used SBP algorithms.PRM first samples the C-space during its learning phase and then makes a query with the predefined origin and destination points on the roadmap.RRT, on the other hand, is a single query planner.The path between start and goal configuration is incrementally Fig. 14: A depth image produced from synthetic lidar data, generated in the CARLA [264] simulator.built with random tree-like branches.RRT is faster than PRM and both are probabilistically complete [253], which means a path that satisfies the given conditions will be guaranteed to be found with enough runtime.The main disadvantage of SBP is, again, the jerky trajectories [23].
Interpolating curve planners fit a curve to a known set of points [23], e.g.way-points generated from the global plan or a discrete set of future points from another local planner.The main obstacle avoidance strategy is to interpolate new collision-free paths that first deviate from, and than re-enter back to the initial planned trajectory.The new path is generated by fitting a curve to a new set of points: an exit point from the currently traversed trajectory, newly sampled collision free points, and a re-entry point on the initial trajectory.The resultant trajectory is smooth, however, the computational load is usually higher compared to other methods.There are various curve families that are used commonly such as clothoids [255], polynomials [256], Bezier curves [257] and splines [100].
Optimization based motion planners improve the quality of already existing paths with optimization functions.A* trajectories were optimized with numeric non-linear functions in [258].Potential Field Method (PFM) was improved by solving the inherent oscillation problem using Newton's method with obtaining C 1 continuity in [259].
Recently, Deep Learning (DL) and reinforcement learning based local planners started to emerge as an alternative.Fully convolutional 3D neural networks can generate future paths from sensory input such as lidar point clouds [260].An interesting take on the subject is to segment image data with path proposals using a deep segmentation network [261].Planning a safe path in occluded intersections was achieved in a simulation environment using deep reinforcement learning in [263].The main difference between end-to-end driving and deep learning based local planners is the output: the former outputs direct vehicle control signals such as steering and pedal operation, whereas the latter generates a trajectory.This enables DL planners to be integrated into conventional pipelines [28].
Deep learning based planners are promising, but they are not widely used in real-world systems yet.Lack of hard-coded safety measures, generalization issues, need for labeled data are some of the issues that need to be addressed.

VIII. HUMAN MACHINE INTERFACE
Vehicles communicate with their drivers/passengers through their HMI module.The nature of this communication greatly depends on the objective, which can be divided into two: primary driving tasks and secondary tasks.The interaction intensity of these tasks depend on the automation level.Where a manually operated, level zero conventional car requires constant user input for operation, a level five ADS may need user input only at the beginning of the trip.Furthermore, the purpose of interaction may affect intensity.A shift from executing primary driving tasks to monitoring the automation process raises new HMI design requirements.
There are several investigations such as [265], [266] about automotive HMI technologies, mostly from the distraction point of view.Manual user interfaces for secondary tasks are more desired than their visual counterparts [265].The main reason is vision is absolutely necessary and has no alternative for primary driving tasks.Visual interface interactions require glances with durations between 0.6 and 1.6 seconds with a mean of 1.2 seconds [265].As such, secondary task interfaces that require vision is distracting and detrimental for driving.
Auditory User Interfaces (AUI) are good alternatives to visually taxing HMI designs.AUIs are omni-directional: even if the user is not attending, the auditory cues are hard to miss [267].The main challenge of audio interaction is automatic speech recognition (ASR).ASR is a very mature field.However, in vehicle domain there are additional challenges; low performance caused by uncontrollable cabin conditions such as wind and road noise [268].Beyond simple voice commands, conversational natural language interaction with an ADS is still an unrealized concept with many unsolved challenges [269].
The biggest HMI challenge is at level three and four automation.The user and the ADS need to have a mutual understanding; otherwise, they will not be able to grasp the intentions of each other [266].The transition from manual to automated driving and vice versa is prone to fail in the stateof-the-art.Recent research showed that drivers exhibit low cognitive load when monitoring automated driving compared to doing a secondary task [284].Even though some experimental systems can recognize driver-activity with a driver facing camera based on head and eye-tracking [285], and prepare the driver for handover with visual and auditory cues [286] in simulation environments, a real world system with an efficient handover interaction module does not exist at the moment.This is an open problem [287] and future research should focus on delivering better methods to inform/prepare the driver for easing the transition [37].

IX. DATASETS AND AVAILABLE TOOLS A. Datasets and Benchmarks
Datasets are crucial for researchers and developers because most of the algorithms and tools have to be tested and trained before going on road.
Typically, sensory inputs are fed into a stack of algorithms with various objectives.A common practice is to test and validate these functions separately on annotated datasets.For example, the output of cameras, 2D vision, can be fed into an object detection algorithm to detect surrounding vehicles and pedestrians.Then, this information can be used in another algorithm for planning purposes.Even though these two algorithms are connected in the stack of this example, the object detection part can be worked on and validated separately during the development process.Since computer vision is a well-studied field, there are annotated datasets for object detection and tracking specifically.The existence of these datasets increases the development process and enables interdisciplinary research teams to work with each other much more efficiently.For end-to-end systems, the dataset has to include additional ego-vehicle signals, chiefly steering and longitudinal control signals.
As learning approaches emerged, so did training datasets to support them.The PASCAL VOC dataset [288], which grew from 2005 to 2012, was one of the first dataset featuring a large amount of data with relevant classes for ADS.However, the images often featured single objects, in scenes and scales that are not representative of what is encountered in driving scenarios.In 2012, the KITTI Vision Benchmark [174] remedied this situation by providing a relatively large amount of labeled driving scenes.It remains now as one of the most widely used datasets for applications related to driving automation.Yet in terms of quantity of data and number of labeled classes, it is far inferior to generic image databases such as ImageNet [132] and COCO [150].While no doubt useful for training, it remains that generic image databases lack the adequate context to test the capabilities of ADS.UC Berkeley DeepDrive [271] is a recent dataset with annotated image data.The Oxford RobotCar [49] is used to collect over 1000 km of driving data with six cameras, lidar, GPS and INS in the UK.The dataset is not annotated though.ApolloScape is a very recent dataset that is not fully public yet [274].Cityscapes [270] is commonly used for computer vision algorithms as a benchmark set.Mapillary Vistas is a big image dataset with annotations [272].TorontoCity benchmark [282] is a very detailed dataset; however it is not public yet.The nuScenes dataset is the most recent urban driving dataset with lidar and image sensors [175].Comma.ai has released a part of their dataset [289] which includes 7.25 hours of driving.In DDD17 [85] around 12 hours of driving data is recorded.The LiVi-Set [277] is a new dataset that has lidar, image and driving behavior.
Naturalistic driving data is another type of dataset that concentrates on the individual element of the driving: the driver.SHRP2 [279] includes over 3000 volunteer participants' driving data over a 3-year collection period.Other naturalistic driving datasets are the 100-Car study [280], euroFOT [281] and NUDrive [278].Table X shows the comparison of these datasets.

B. Open-source frameworks and simulators
Open source frameworks are very useful for both researchers and the industry.These frameworks can "democratize" ADS development.Autoware [110], Apollo [290], Nvidia DriveWorks [291] and openpilot [292] are amongst the most used software-stacks capable of running an ADS platform in real world.We utilized Autoware [110] to realize core automated driving functions in this study.
Simulations also have an important place for ADS development.Since the instrumentation of an experimental vehicle still has a high cost and conducting experiments on public road networks are highly regulated, a simulation environment is beneficial for developing certain algorithms/modules before road tests.Furthermore, highly dangerous scenarios such as a collision with pedestrian can be tested in simulations with ease.CARLA [264] is an urban driving simulator developed for this purpose.TORCS [293] was developed for race track simulation.Some researchers even used computer games such as Grand Theft Auto V [294].Gazebo [295] is a common simulation environment for robotics.For traffic simulations, SUMO [296] is a widely used open-source platform.[297] proposed different concepts of integrating real-world measurements into the simulation environment.

X. CONCLUSIONS
In this survey on automated driving systems, we outlined some of the key innovations as well as existing systems.While the promise of automated driving is enticing and already marketed to consumers, this survey has shown there remains clear gaps in the research.Several architecture models have been proposed, from fully modular to completely end-to-end, each with their own shortcomings.The optimal sensing modality for localization, mapping and perception is still disagreed upon, algorithms still lack accuracy and efficiency, and the need for a proper online assessment has become apparent.Less than ideal road conditions are still an open problem, as well as dealing with intemperate weather.Vehicle-to-vehicle communication is still in its infancy, while centralized, cloud-based information management has yet to be implemented due to the complex infrastructure required.Human-machine interaction is an under-researched field with many open problems.
The development of automated driving systems relies on the advancements of both scientific disciplines and new technologies.As such, we discussed in the recent research developments that which are likely to have a significant impact on automated driving technology, either by overcoming the weakness of previous methods or by proposing an alternative.This survey has shown that through inter-disciplinary academic collaboration and support from industries and the general public, the remaining challenges can be addressed.With directed efforts towards ensuring robustness at all levels of automated driving systems, safe and efficient roads are just beyond the horizon.

Fig. 4 :
Fig. 4: DAVIS240 events, overlayed on the image (left) and corresponding RBG image from a different camera (right), collected by our data collection platform, at a road crossing near Nagoya University.The motion of the cyclist and vehicle causes brightness changes which trigger events.

Fig. 5 :
Fig. 5: The ADS equipped Prius of Nagoya University.We have used this vehicle to perform core automated driving functions.

Fig. 10 :
Fig. 10: Outline of a traditional method for object detection from 3D pointcloud data.Various filtering and data reduction methods are used first, followed by clustering.The resulting clusters are shown by the different colored points in the 3D lidar data of pedestrians collected by our data collection platform.

TABLE I :
Comparison of survey papers

TABLE II :
End-to-end driving architectures

TABLE IV :
Onboard sensor setup of ADS equipped vehicles

TABLE V :
Localization techniques

TABLE VI :
Comparison of 2D bounding box estimation architectures on the test set of ImageNet1K, ordered by Top 5% error.Number of parameters (Num.Params) and number of layers (Num.Layers), hints at the computational cost of the algorithm.

TABLE VII :
Average Precision (AP) in % on the KITTI 3D object detection test set car class, ordered based on moderate category accuracy.These algorithms only use pointcloud data.

TABLE VIII :
Driving style categorization

TABLE IX :
Local planning techniques