Vision-Based Semantic Segmentation in Scene Understanding for Autonomous Driving: Recent Achievements, Challenges, and Outlooks

Scene understanding plays a crucial role in autonomous driving by utilizing sensory data for contextual information extraction and decision making. Beyond modeling advances, the enabler for vehicles to become aware of their surroundings is the availability of visual sensory data, which expand the vehicular perception and realizes vehicular contextual awareness in real-world environments. Research directions for scene understanding pursued by related studies include person/vehicle detection and segmentation, their transition analysis, lane change, and turns detection, among many others. Unfortunately, these tasks seem insufficient to completely develop fully-autonomous vehicles i.e., achieving level-5 autonomy, travelling just like human-controlled cars. This latter statement is among the conclusions drawn from this review paper: scene understanding for autonomous driving cars using vision sensors still requires significant improvements. With this motivation, this survey defines, analyzes, and reviews the current achievements of the scene understanding research area that mostly rely on computationally complex deep learning models. Furthermore, it covers the generic scene understanding pipeline, investigates the performance reported by the state-of-the-art, informs about the time complexity analysis of avant garde modeling choices, and highlights major triumphs and noted limitations encountered by current research efforts. The survey also includes a comprehensive discussion on the available datasets, and the challenges that, even if lately confronted by researchers, still remain open to date. Finally, our work outlines future research directions to welcome researchers and practitioners to this exciting domain.


I. INTRODUCTION
A UTONOMOUS Driving (AD) relies on processed information from numerous sensors installed over the vehicle, perceiving the surroundings, helping to understand the traffic scenes and control the movements of the vehicle [1], and hence playing a role of its eyes and ears. These sensors mostly include high resolution cameras, radar, and Light Imaging Detection and Ranging (LiDAR) [2] to classify the objects via feature extraction and to measure the distance to surrounding objects via radio waves and illumination, so as to eventually yield a 3D view of the environment. To avoid collision with on-road obstacles, various types of other sensors have also been deployed for autonomous vehicles, which include infrared, sonar, micro radar, ultrasonic, and short distance sensors. Similarly, vision sensors are used to equip autonomous vehicles with the ability to understand the visuals of surrounding environment, which include road lanes detection, traffic light analysis, road sign detection and recognition, vehicle detection This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ Fig. 1. Sample segmented images for an autonomous vehicle, helping in scene parsing: (a) exemplifies instance segmentation, where each object from similar classes is segmented into different color with its own boundary pixels; (b) depicts a semantically segmented image, where objects of similar classes are highlighted in an individual color, without any differentiation. and tracking, pedestrian detection (both on-road and off-road), and short-term traffic prediction [3]. Visual scene representation and understanding for AD include lanes detection, traffic lights analysis, traffic signs, surrounding pedestrian and cars detection, and many other tasks. Accumulating these information provide more enhanced and safer instructions for automated actions of the vehicle, such as turning manoeuvres, lane changing, or braking [4].
Among the various sources of information gathered for vehicular decision making, vision sensors data [5] are arguably considered as the most reliable ones [6]. Therefore, this research domain has been extensively studied and widely applied in Intelligent Transportation Systems (ITSs) [7], mostly from a machine learning perspective and by resorting to deep Convolutional Neural Networks (CNNs). Deep CNNs embody a special flavor of neural networks with several functional layers suitable to process images by repetitively extracting model features from the input image, towards optimally achieving better representations. Scene understanding from vision data operates likewise, applying a deep CNN over real-time frames to e.g. interpret a pedestrian location and its distance from the autonomous car. Beyond this simplified generic computer vision-based scene understanding, complex models proposed nowadays are able to generate multiple labelled outputs (e.g. pedestrians and vehicles), as well as their localization.
Scene understanding primarily refers to context extraction from visual data that is based on different features such as shapes of objects, their distance from the vehicle, and many other clues including size of the objects and their approaching speed. A scene analysis can be achieved by accumulating these information and building a complete scenario of the scene around the vehicle, so that vehicular systems can be informed of e.g. the presence of humans in front of the car and their distance from the autonomous vehicle. When assessed together, this information helps in actions being taken by the autonomous vehicle, where the distinction among various humans, vehicles, buildings, traffic signs, turns etc. is essential for proper decision making, as visualized in Figure 1. Traditionally, these information streams are extracted in isolation using separate computer vision algorithms [11], which are recently replaced by CNNs-based segmentation mechanisms. A segmentation mechanism annotates the boundaries of various types of objects and assigns different colors to each pixel identified to belong to different objects. Pixel-level labeling may refer to semantic or instance segmentation as shown in Figure 1, where instance-level segmentation assigns different colors to each object, even in the same class (e.g. vehicles), whereas pixel-level semantic segmentation assigns the same color labels for the same class of objects. Among many traditional segmentation strategies [12], the most widely used category is semantic segmentation using deep CNNs, which partitions an ongoing scene into different meaningful elements such as road, cars, pedestrians, trees, besides other elements present in the vehicular context.

A. Background and Related Works
Semantic segmentation is widely used in AD applications until proper scene understanding [13] demands a clear distinction between two identical objects. For example, surrounding cars pose a similar label in semantic segmentation networks, and convey a clear understanding of the scene for further decision making. However, at some point, AD needs instance-level segmentation to deal with various types of traffic stakeholders and their levels of engagement. Traditionally, there are three representative types of semantic segmentation networks represented in Figure 2: fully convolutional networks (FCNs), deep fully convolutional neural network architecture for semantic pixel-wise segmentation known as (SegNet), and the so-called DeepLab strategy [14], which we briefly revisit next towards arriving at the purpose of this manuscript.
To begin with, the FCNs [8] architecture is structured in encoder-decoder formation to extract deep discriminative features for later instance localization and segmentation task. The encoder part comprises of standard convolutional and down-sampling layers typically used in CNNs for classification problems, where the decoder part used transposed convolutional layer to up-sample the coarse output feature maps from the bottleneck layers of the architecture as shown in Figure 2(a). The up-sampling process can be achieved at coarse and finer level of FCNs i.e., instead of the traditional last layer output, it can be passed through transposed convolution layer(s), that help produce prediction maps of the same size as the input frame. On the other hand, SegNet is built upon a series of deconvolutional layers that transform the extracted features into class score prediction maps as an output with identical frame size as that of the input. A SegNet network comprises two functional modules, where the first extracts features from the input frame using a CNN and the prediction maps with class scores are constructed via a series of transposed convolutions and un-pooling layers in the second module [15], ultimately producing instance-level segmentation results. This kind of segmentation strategy is also known as encoder-decoder strategy. Finally, the DeepLab strategy for semantic segmentation utilizes convolutional layers with an up-sampled filter, known as atrous convolutions, with bilinear interpolation to obtain prediction maps of identical size as an input frame.
Recently a plethora of new semantic segmentation methods [16], [17], [18], [19], [20], [21], [22], [23], [24] for visual scene understanding has emerged in the literature, eliciting impressive results. For instance, Nesti et al. [19] presented a method that evaluates the robustness of semantic segmentation approaches for autonomous vehicles. They introduced a novel loss function to analyze the effectiveness of existing semantic segmentation methods against real-world adversarial attacks in autonomous driving environments. Natan et al. [21] proposed a compact yet efficient multi-task learning semantic segmentation method to deal with different modes of data. Their method has the ability to perform various tasks in a unified approach that includes depth estimation, semantic segmentation, ranging (LiDAR data) segmentation, and light detection. To analyze the model uncertainty problem for semantic segmentation, Zhao et al. [22] presented a pyramid bayesian approach, which evaluates the uncertainity of semantic segmentation model for autonomous driving. They examined the performance of semantic segmentation model (SegNet) by replacing dropout layers with pyramid pooling layer and claimed improvement in their model's performance. Baran et al. [24] introduced a unique approach for understanding the road view semantics through onboard Bird's Eye View (BEV) camera visuals. They have analyzed the understating of road scenery in three different perspectives that include image-level, BEV level, and aggregated temporal road scene understanding. Traditionally, neural networks are trained using powerful graphics processing units (GPUs) and huge server computers, whereas inference is performed over embedded systems in self-driving cars. Lately the computational complexity has been reduced significantly by some deep models such as SqueezeNet [25], which achieves AlexNet level accuracy with 50 times less number of parameters. Following SqueezeNet, ENet [26] achieved real-time semantic segmentation over embedded devices. More recently, semantic segmentation achieved significant milestones from the perspective of time complexity, as reported in [25] and [26]. For the better understanding of readers, the graphical overview of the major architectures distribution of the semantic segmentation driven scene understanding literature for autonomous driving is depicted in Figure 2.

B. Challenges and Motivation
Self-driving cars have to react instantly according to the surroundings, where in real-world circumstances there are higher chances to encounter new type of events, putting the car in tangle situation. Furthermore, the inherent uncertainty associated to unknown situation increases the probability of the model to issue erroneous decisions, putting the lives of passengers and other counterparts nearby in danger. The inference of a trained model installed in self-driving cars needs to be dynamic in nature, perceiving real-time decisions, aware of the confidence in its own outputs, learning from new events, and updating the parameters of their model. Similarly, decisions made by self-driving cars are mostly generated by black-box neural models, leaving a manifold of open questions for explainable and accountable decisions made by an autonomous car. Moreover, future location perception of pedestrians and vehicles with truly actionable accuracy is still to be achieved in AD. Similarly, complex driving scene understanding and visual scene perceptions in adverse weather conditions are also open challenges yet to be covered in AD domain. All these challenges are of utmost necessity to see driver-less cars moving safely in urban areas. Unfortunately, despite prior efforts [29], [30], the community lacks a consolidated, unified, single point of reference for ascertaining the current level of maturity of semantic segmentation techniques for vehicular scene understanding.
Considering the aforementioned challenges and importance of vision sensors-based semantic segmentation in accurate scene understanding and parsing, we accumulate the existing research contributions and outcomes in this survey. The main research questions that are highlighted in this survey are given as follows. 1) Do the available datasets possess generalization potentials for scene understanding in complex visual scenes? 2) Can the current methods segment complex visual scenes containing uncertainties such as fog and rain, and segment the unstructured information including rough roads and non-smooth pathways for pedestrians? 3) Do the current methods attentively learn from the ongoing scenes and contain events-based scene understanding potentials?

C. Contributions
This survey takes a step ahead in this regard by critically examining the recent state-of-the-art in visual scene understanding using segmentation techniques, with the following main contributions to the ITS community: 1) A thorough introduction to scene understanding, which defines the generic pipeline and explains each of its steps individually. This helps newcomers to the field grasp prior knowledge from all aspects of scene understanding for AD. 2) A discussion and critical analysis of the most relevant papers and datasets arising from the notable research activity on scene understanding witnessed during the last decade. 3) A performance study of current state-of-the-art methods by considering their consumed computational resources and the platforms for which these methods are developed. In the existing literature, some contributions provide their open-source implementations. This review leverages them by executing, analyzing, and comparing the resources consumed by each method. This study allows expanding the target audience of this review towards industrialists with interest in better scene understanding strategies that are functional in real-world environments. 4) A reasoned derivation of future research guidelines based on the analyzed literature, identifying open problems and challenges in this domain, as well as research opportunities that can be explored to address them effectively.

D. Review Methodology
The research articles discussed in our position survey are retrieved using different keywords such as scene understanding in autonomous vehicles, vision-based semantic segmentation in autonomous vehicles, and multi-class scene understanding in autonomous driving. Most of the articles retrieved were purely relevant with some exceptions for multi-modalities methods [31], [32], weak relevance to the investigated topic, for instance, point cloud systems [33], and some outdated articles with relatively old deep learning strategies [34]. Furthermore, the aforementioned keywords are searched in multiple repositories including the Web of Science and Google Scholar to ensure the retrieval of relevant contents. The inclusion criteria ensures that a paper is recognized among the AD experts i.e., the number of citations, where we also analyzed the Use in the Web of Science and the classification of citations such as checking whether the concerned paper is cited in most of the articles as a support or in background or general discussion. In Figure 3, the overall distribution is provided, where the statistics indicate that the trending publisher in ITS domain from semantic segmentation understanding perspective is IEEE, followed by non-reviewed pre-prints in ArXiv repositories.
The rest of the manuscript is split into five main sections. Section II highlights the role of segmentation for AD, and explains some featured methods from related literature. Section III explains evaluation metrics in use for segmentation tasks, several loss functions designed for special purposes, and a time complexity analysis of representative methods from the segmentation literature. A list of widely used segmentation datasets are enumerated and described in Section IV, along with an explanatory discussion on the drawbacks and the challenges posed by them. Section V exposes open challenges for scene understanding methods in the AD domain using segmentation modules, and outlines research directions to address them. Finally, in Section VI, we conclude this review with derivations of the whole article and an outlook.

II. SEMANTIC SEGMENTATION FOR SCENE UNDERSTANDING IN AD
The primary objective of semantic segmentation is to annotate each pixel of an input image within a range of predefined classes used while training i.e., defining boundaries of individual entities inside an ongoing scene, assisting in many applications [45]. The dictionary of possible classes varies depending on the dataset and the segmentation task under consideration. Nevertheless, basic objects that are common in most databases used in semantic segmentation literature for AD include humans/pedestrians, different types of vehicles (car, bike, etc.), traffic lights, and many more, [46], [47]. Segmenting different types of objects assists the autonomous vehicle decision making. For instance, if a nearby pedestrian is accurately segmented by a deep neural model, it instantly initiates the brake pressing mechanism by considering the distance between vehicle and the pedestrian. This is easily doable using accurate segmentation technique that draws clear boundaries of pedestrian against other objects, contributing to real-time decision making.
Semantic segmentation for scene understanding is mostly performed via RGB cameras. More recently, LiDAR sensors-based methods have achieved significant results in segmenting an outdoor scene for autonomous vehicles [36]. There are major fusion-based techniques that allow RGB data and LiDAR point clouds to interact in a single network for semantic segmentation [40]. However, in this article we specifically focus on RGB sensors-based semantic segmentation methods due to their lower computation cost, high level of applicability, and large field of view. A concise summary about the literature on LiDAR and multi-modalities semantic segmentation is given in Table I. Furthermore, interested readers can refer to a very recent survey on 3D LiDAR data for semantic segmentation available in [48].
We now discuss some prominent segmentation methods featured for AD. Segmentation is widely used in scene parsing, whereas some methods only focus on specific kind of objects such as pedestrian, cars, bicyclist, and lane to incorporate their importance for AD in streets. In order to attribute the desired level of importance to such objects, RAPNet [73] contains importance-aware features selection method to automatically nominate important features for the predicted labels. By contrast, other mainstream methods [60], [65] focus on general objects' segmentation without granting any importance to objects on road or zebra crossing areas. Scene understanding in some methods is performed using segmentation techniques functional in diverse environments with unstructured roads [74], challenging weather [75], outdoor complex conditions [76], and varying illumination [77]. A detailed description of features segmentation methods is given in Table II. There exists several survey contributions of computer vision research community to cover various major challenges, provide tutorials, and offer future research directions in various subdomains of AD. These surveys are summarized in Table III. As can be observed in this table, scene understanding is not specifically considered to the level of its importance in AD, and there exist very scarce surveys related to scene segmentation. For instance, Xue et al. covered scene understanding methods based on events reasoning in their baseline survey [29]. This is the most related survey to our topic, but it is concentrated on events and intention prediction of pedestrians and vehicles rather than on scene parsing and related paradigms. Another recent survey broadly covers road segmentation methods, but without any focus on their concerned challenges with future research directions in the AD domain [78]. To the best of our knowledge, this survey is novel of its kind in the AD literature and is a need of the community working on autonomous vehicles, given the acknowledged importance of scene parsing in this domain.

III. PERFORMANCE EVALUATION OF SEMANTIC SEGMENTATION
The performance evaluation of different semantic segmentation models used in AD domain are discussed in this section. Herein, we explain the evaluation metrics, different types of objective functions, analyze the computational complexity, and finally provide quantitative comparisons of deep models. The nomenclature of the used variables is given in Table IV.

A. Evaluation Metrics
Building only a predictive deep segmentation model is not a wise and trustworthy decision for safe AD unless it is tested on unseen data. Most models evaluate their performance on a disjoint set of the same dataset that is used for training, but still the test data are totally new for the trained model. Recently, deep models are being developed with more generalized potentials for unseen data [89]. Deep models for segmentation are evaluated using some common metrics to assess the optimal results against ground truth. Based on the difference between instance and semantic segmentation, different types of evaluation metrics can be used for these tasks, which we review next as follows.

1) Intersection Over Union (IoU):
The IoU metric [90], [91] computes the overlapping regions between the predicted model's results pr ed Mask input and the ground truth mask GT . It is the simplest metric that essentially counts the number of common pixels using intersection and union as per Equation (1). where pr ed Mask input is the mask of labels predicted for each pixel of the input image, and GT is the ground truth mask that should be predicted by an ideal segmentation model. In case of multiple classes (as it often occurs in the related literature), the IoU score is computed for each class individually followed by its global average over all classes, giving rise to the so-called mean IoU. As this method is based on Jaccard and Dice coefficients, it is also referred to as Jaccard Index.
Computing IoU over the output of instance segmentation models is complicated, as it produces multiple masks for each object inside an input image. Therefore, it becomes similar to object detection evaluation with the only difference being the bounding boxes comparison in the object detection problem, which is replaced by the masks comparison in instance segmentation.
2) Pixel Accuracy for Semantic Segmentation: Another commonly used metric is the pixel accuracy Pi xel Acc [57], which reports the percentage of correctly classified pixels in an input image when correspondingly compared to the ground truth mask, as formulated in Equation (2).
where T P(), T N(), F P(), and F N() respectively denote the number of true positives, true negatives, false positives, and false negatives measured over the image, assuming that pixels of label are given value 1 and 0 otherwise. As in IoU, it is also computed individually for every class, and globally for all classes of a given dataset. For a single-class representation with comparatively smaller coverage in an image, this metric is biased as it only reports on the identification of pixels in an image where a class (positive class) is not present.

B. Special Loss Functions for Semantic Segmentation
In general, various factors may affect the learning potentials of a certain Machine Learning model. The loss function is among the most important ones in neural computation, as it quantitatively evaluates the model's predictions during training and improves the performance via gradient updates and back propagation until the the specified number of epochs. There are multiple loss functions for segmentation tasks. Furthermore, some research works have hitherto proposed to improve the segmentation performance further by defining modified/hybrid versions of these loss functions. Common loss functions can be found in [92], whereas advanced type of loss functions are given below with their respective mathematical definitions: 1) Weighted Binary Cross Entropy: It is a variant of cross entropy loss function that is widely used in many computer vision problems. It is defined as the difference measure of two probability distributions (y and y) of corresponding inputs [93], [94]. In this case, β is used for balancing among false positives and negatives.
where y is the output of the segmentation network for a given pixel, and y is its ground truth, and the images and labels weights are computed using zeros and ones.
2) Balanced Cross Entropy: In this alternative formulation of the loss function [95], [96], positive and negative samples are weighted as follows: wbBC E(y, y) hence inserting a complementary weight for negative samples.
3) Focal Loss: Focal loss F L is a well-established loss function that can be used in case of imbalanced data [97], which also occurs frequently in segmentation problems. Following the previous notation, the focal loss is given by: where p t is the probability that the model predicts for the ground truth object, γ > 0 is an parameter that permits to grant more or less relative weight to misclassified examples, and α t ∈ [0, 1] is set to account for the presence of class imbalance or instead, tuned as another hyper-parameter of the overall model.

4) Others:
There are many other types of loss functions 1 used in specific cases for segmentation problems, such as region-based losses [98]. Among them we underscore the prevalence of studies using the Dice loss, which gets inspired by the Sørensen-Dice coefficient (namely, a measure of similarity between images); the Tversky loss, which extends the Dice loss with a β coefficient to weight differently false negatives and positives; the shape-aware loss for better addressing the segmentation of challenging objects; the Hausdorff distance loss [99], [100]; and the combo loss, which blends together the binary cross-entropy loss for curves smoothing effect and the Dice loss for class balancing problems. We again refer to [92] for a detailed mathematical compendium of these loss functions.

C. Time Complexity Analysis
In order to illustrate the current performance levels of segmentation models used for AD, we now report the results of some featured deep segmentation models. The overall report of running time of these models is given in Table V. Some of the model's time complexity indicators are reported from their methods, whereas in other cases we have run the reported models from their publicly available repositories using our experimental resources. The system's configuration for CPU includes an Intel(R) Core i7-7700 CPU@3.60 GHz processor running on Windows 10 operating system, while the GPU used in experimentation is a NVIDIA GeForce GTX 1060 with 6 GB graphics memory. Table V also shows the predictive performance of the models (when available) over three different datasets, as well as the size of the trained models (measured in MB). 1 https://cnvrg.io/semantic-segmentation/ (accessed on April 21st, 2021).  The world's leading AV chips including Intel Ponte Vecchio, NVIDIA A100, Tesla D1, Huawei Ascend 910, and Google TPU (v1, v2, v3), have achieved mass production for applications such as 2D/3D fusion annotation and semantic segmentation training [101]; however, the time complexity of the analyzed methods running over CPU indicates that the current neural architectures still need to focus on lowering the time complexity and energy consumption. The highest frames per second (FPS) among these methods is achieved by [57], that is 3 frames per second for CPU. When deployed over a GPU, the best FPS score is 81.9 frames per second achieved by [102]. In real-world environments [103], devices are severely resource-restricted [104], such as Raspberry-pi, Jetson Nano, and Google Board. Executing such huge models over these devices is a challenging task. Therefore, much attention is required in terms of time complexity towards enabling the execution of these models over energy-limited devices functional in Internet of Things setups [105], [106].

D. Quantitative Analysis of Scene Segmentation Methods for AD
This section elaborates on the quantitative empirical analysis of road scene segmentation methods surveyed in this paper  Table VI, it can be noticed that, among all methods evaluated over the Cityscapes dataset, the approaches proposed in [26], [27], and [102] attain a balanced trade-off between accuracy (in terms of mIoU, Pixel Accuracy, and mAP) and efficiency for real-time applications (in terms of FPS). By contrast the reported results over the CamVid17 dataset evince a better segmentation performance of the methods contributed in [27] and [102]. Among these three focused methods, [27] scores best in terms of mIoU and mAP values, with superior FPS, which are 31.30, 65.50, and 65.5, respectively. The reported results over the SemanticKITTI dataset indicate a better performance of the method in [51], achieving well-balanced mIoU and FPS scores, i.e., 52.20 and 92, respectively. Finally, the method in [65] performs comparatively better than the one in [107], by offering best values of the mIoU, mAP, and FPS scores (75.70, 83.60, and 19.5, respectively).

IV. DATASETS
Many datasets are nowadays available for segmentation tasks, where some of them are related to semantic segmentation and others are introduced for instance segmentation. Representative datasets in the segmentation literature particularly those designed for AD are discussed in detail in the subsequent sections and their detailed statistics are given Table VII.

A. KITTI
KITTI [46] is a 3D vision benchmark data containing outdoor stereo images of road scenery along with its corresponding 3D laser scans. The 3D image data is acquired by two high resolution stereo cameras (gray scale and color), advanced OXTS RT 3003 localization system that combines global positiong system (GPS), global navigation satellite system (GLONASS), inertial measurement unit (IMU), and real time kinematic (RTK) correction signals. It also contains Velodyne HDL-64E laser scanner, mounted on the top vehicle to produce 3D points for the captured scenes in real time. The deployed stereo cameras are first calibrated and then synchronized with a localization system and a laser scanner to generate accurate ground truth data.
The dataset comprises a total of 14999 RGB stereo image pairs (including both image and its corresponding ground truth), with a resolution of 1240 × 376 pixels. The entire dataset is partitioned into a training (7841 samples) and a test set (7518 samples). The training set is further split into two subsets, namely, train (3712 samples) and test set (3769 samples), and the latter is used mainly for validation purposes.

B. SemanticKITTI
SemanticKITTI [47] is a large-scale outdoor scene dataset constructed for point cloud semantic and panoptic segmentation of road scenery, including residential area, city traffic, and highways. It comprises a total of 43552 point-wise re-annotated 3D scans generated with automotive LiDAR sensor for the KITTI Vision Odometry Benchmark dataset [46]. This dataset has a total of 22 distinct sequences split into training-validation and test subsets. The training-validation set consists of 23,201 3D scans from sequences 0 to 10, while the test set comprises of 20,351 3D scans from sequences 11 to 21.
Unlike Paris-Lille-3D [113] and Wachtberg [114] datasets, which only contain the aggregated 3D scans of the complete sequence captured with the same type of sensors, SemanticKITTI provides the individual point cloud of the entire captured sequence of road scenery. Thus, it enables the performance evaluation of semantic segmentation based on multiple consecutive scans.

C. HighD
The HighD dataset [55] contains around 110,000 refined trajectories of different vehicles, including cars and trucks. Those trajectories are captured from drone videos recorded at a resolution of 4096 × 2160 pixels and 25 FPS over German highways. For each particular vehicle trajectory, the dataset provides trajectory ID, speed, acceleration, longitudinal coordinate, distance to the leader, and ID of the current leader. These trajectories are widely used to analyze the driving behavior of car-following drivers using computer vision algorithms. The dataset includes 60 videos of 17 minutes on average captured in 6 different locations, depicting a road portion of around 420 meters in length. All videos are captured in sunny and clear weather conditions, from 8 AM to 5 PM, thereby minimizing the efforts required for video stabilization and other post-processing operations.
The dataset includes four different files for each captured video, including three CSV files and the visual aerial view of the highway. The first file contains the information about traffic signs, driving lanes, speed limit on each specific lane, and location of the site. The vehicle class, vehicle dimensions, mean speed, and driving direction is given in the second file. The third file provides the detailed information such as speeds, lane position, accelerations, and description of adjacent vehicles per frame.

D. CityScapes
CityScapes [50] is a high-quality pixel-level semantic segmentation dataset for urban street scene understanding, collected in around 50 cities in Germany and neighboring countries. The dataset provides 5,000 pixel-level annotated images of resolution 1024 × 2048, depicting complex urban scenes captured in different weather conditions, varying background, and scene layout. As compared to other benchmark datasets for street scene understanding [46], [47], [55], the CityScapes dataset surpasses the previous efforts in terms of variety, size, scene complexity, and annotation richness.
To discriminate the semantic representation of each particular object in the captured image, data is annotated with 30 different categories. For semantic segmentation task, the entire dataset is split into four separate subsets including 2,993 training images, 503 validation images, 1,531 test images, and 20,021 auxiliary images. The training, validation, and test image sets have high-level refined annotations, while the auxiliary set of images contains coarse annotations.

E. Nuscenes
NuScenes [53] is a large-scale 3D object detection dataset recently introduced for driving scene understanding in AD. The dataset is collected in Boston (South Boston and Seaport) and Singapore (Holland Village, Queenstown, and One North) using moving car equipped with a suite of specially designed sensors. The car-mounted suite includes 13 sensors: 6 RGB cameras with 1600 × 900 resolution and 12Hz capture frequency, 5 long-range radar sensors operating at 77 GHZ with 13Hz capture frequency, 1 LiDAR sensor with 20Hz capture frequency, and an IMU sensor. All sensors are precisely synchronized with each other to obtain high-quality data and better cross-modality between visual and sequential data.
The dataset consists of 1000 driving sequences, where each sequence is 20 seconds long. Data are annotated by experts into 23 object classes (i.e., Car, Truck, Human, and Bicycle etc.), where each object class is further categorized into 10 different sequence classes based on the semantic differences between the sequences. For training and inference, the dataset is divided into 700, 150, and 150 annotated sequences for training, validation, and testing, respectively. Each sequence comprises 40 frames, offering a 360 o view of the surrounding scenery.

F. Mapillary Vistas
The Mapillary Vistas [62] is one of the largest and challenging street-level scene segmentation datasets for pedestrian and traffic-related scene analysis. The dataset contains 25,000 high quality (8.6 Pixels) outdoor scene images of resolution 1920 × 1080 captured from all over the world at different conditions concerning lightning, season, weather, and daytime. Images are captured by the sidewalk pedestrians as well as from the moving cars with various image acquisition devices including smart phone cameras, action cameras, tablets, and professional cameras. To prepare the data for supervised learning-based scene segmentation, data are annotated into 66 distinct object categories with additional 37 classes with instance-specific labels.
The Mapillary Vistas dataset is 5 times larger than the benchmark CityScapes dataset [50], providing fine-grained annotated data generated by 69 expert annotators with polygon style for delineating each specific object in the image. For semantic segmentation learning task, the dataset is split into three subsets of images namely training, validation, and testing, having a total of 18,000, 2,000, and 5,000 annotated images, respectively.

G. ApolloScape
ApolloScape [115] is an extensive street-level road scene dataset recently released for a variety of self-driving applications including car instance segmentation, 3D map construction self-location, scene parsing, lane segmentation, scene trajectories, and detection-tracking. The dataset contains 143,906 frames of resolution 3384 × 2710 pixels, with good quality ground-truth data, comprising pixel-level semantic segmentation, pose information, and 3D point clouds of captured scene. Compared to the existing publicly available datasets (i.e., KITTI [46] or the Mapillary Vistas [62]), ApolloScape comprises almost 15 times more data with rich labeling in terms of holistic semantic dense point for each scene.
The images and depth data in the dataset are acquired with car-mounted sensors deployed over various cities of China under different weather (cloudy and sunny), lightning (day, night, noon), and traffic conditions (rush and non-rush hours traffic with pair of stereo images). The suite of car-mounted sensors includes one VMX-CS6 camera system with two front cameras having a resolution of 3384 × 2710 pixels, two VUX-1HA laser scanners with range of 1.2m to 420m and 360 o FOV, a measuring head device with IMU/GNSS (heading accuracy 0.015 o , position accuracy 20∼ 50%, and roll and pitch accuracy 0.005 o ). During data recording, the vehicle drives with a speed of 30 km per hour, whereas the mounted cameras are triggered every 1 meter.

H. Berkely Deep Drive
The Berkely Deep Drive dataset [116] is a large-scale dataset composed by diverse driving videos and GPS/IMU data for road scene understanding including drive-able area segmentation, road objects detection, instance segmentation, and lane mark detection. The dataset includes around 10,000 hours of driving stream depicting visuals of towns, highways, and rural areas of San Francisco Bay Area, New York, and other cities of USA in varying weather and lightning conditions. Besides the video data, the dataset also provides GPS/IMU driving trajectories for location tracking, recorded with GPS, IMU, gyroscope, and magnetometer sensors. The dataset provides image-level annotations for a variety of driving scene understanding tasks. Object detection annotations include traffic light, traffic sign, bus, person, motor, bike, truck, car, train, and rider. The instance segmentation annotations contain car, road, pedestrian, person, footpath, and traffic boards etc.

I. COCO
The Common Objects in Context (COCO) dataset [66] is one of the predominant databases released by Microsoft, widely used for object detection, semantic and object instance segmentation, and object captioning. The dataset embeds 330,000 images with more than 200,000 labeled instances, 250,000 persons with key points, human pose estimation, and 1,500,000 object instances categorized in 80 distinct classes. The image data is collected from different sources including relevant object images from the PASCAL VOC dataset [67] and the Flickr site uploaded by amateur photographers with search-able keywords. The entire dataset is collected and annotated for object detection, instance segmentation, and image captioning using an interface specifically designed for hired expert annotators.
Originally the COCO dataset is released into two parts: the first part of the dataset was released in 2014, where the second part of the dataset was introduced in 2015. The first part comprises three subsets of images including 82,783 training, 40,504 validation, and 40,775 testing images. Likewise, the second release of the dataset comprises 165,482, 81,208, and 81,434 images for training, validation, and testing, respectively.

J. VOC (2007 and 2012)
The PASCAL VOC (Visual Object Classes) [67] is one of the most challenging datasets publicly available and is used for image classification, object detection, and image segmentation. Similar to the COCO [66] dataset, the VOC dataset is released into two parts: VOC 2007 and VOC 2012. The VOC 2007 release contains a total of 9,962 images and their corresponding annotations split into three subsets: 2501 training, 2510 validation, and 4951 testing images. The VOC 2012 release includes 22,531 images divided into three subsets of 5,717, 5,823, and 10,991 images for training, validation, and testing, respectively. The dataset is captured from two different sources (flickr photo-sharing website and the Microsoft Research Cambridge database).
All images of the VOC 2007 and VOC 2012 datasets are annotated with two distinct attributes, i.e., object class and bounding box, which denote the object type and the coordinates of the object location. Both datasets contain 20 classes, where each class contains a varying number of images. However, each class contains at least 500 images, depicting common objects such as cat, dog, person, car, and bike. For each of these categories, a comprehensive set of images is supplied, each having semantic richness and significant variability concerning to object size, illumination, pose, occlusion, orientation, and position.

V. SCENE UNDERSTANDING IN AD: CHALLENGES AND DIRECTIONS
The datasets introduced above possess a wide variety of objects, with some of them posing least importance towards decision making of an autonomous vehicle such as sky and buildings. Mainstream research contributed nowadays is centered towards favorable daytime scenes for semantic segmentation, with sufficient illumination and supportive weather conditions. Many car companies and Original Equipment Manufacturer in industry have access to a high volume of data; however, they are not keen to share their data publicly, mainly due to IP, industrial competitions, and General Data Protection Regulations (GDPR) concerns. Consequently, lack of sufficient labelled data for accurate scene understanding in dynamic weather conditions with varied illumination conditions, such as night time [117], smoggy situations, and edge cases remains a challenging task for AD research.
This research niche is among the challenges that are still insufficiently addressed by the community to date. In this section we offer our critical views on the current status of scene understanding in AD, summarizing them in a set of challenges together with a prescription of the research directions that can help the community step further and overcome them effectively.

A. Open Challenges
Although significant research has been done and AD industry is widely growing but still there are several open challenges to achieve perfectly intelligent AD, demanding researchers' attention. These challenges are discussed individually with supported references from the related literature.

1) Salient Objects Consideration:
While much work has been done in the field of segmentation, very less attention has been paid to objects' distinction based on safety levels or priorities. For instance, a segmentation model only segments humans in an ongoing scene without any consideration of their location or their movement speed, which can be useful to control the autonomous vehicle and avoid accidents. There are various challenges while considering an object's location during segmentation. For instance, the distance of the object from the autonomous vehicle, where the closest distance can be segmented as the highest risky level and the vehicle needs to take actions accordingly. Similarly, an object using zebra crossing and another one walking on roadside can be prioritized differently [73]. Furthermore, motion of the objects [118] from or towards the autonomous vehicle is also an open issue to be faced by future deep learning models for scene segmentation. Object's motion towards the autonomous vehicle with higher speed segmentation map needs quicker actions and vice versa.
2) Coarse-Structured Information: Most of the datasets introduced in AD literature for segmentation are recorded in normal and well-structured infrastructures of advanced cities. The currently developed deep learning models may achieve best results 2 over structured datasets [50], but generalize poorly in many unstructured environments, as given in a sample scenario in Figure 4. For instance, an online challenge NCVPRIPG-2019 focused on unstructured road data recorded in India. 3 The highest mean IoU achieved so far in this competition is 0.6276 over the testing set, which reflects the enormous difficulty of achieving models with good generalization properties in complex scenes. This aspect of AD demands further attention in terms of data collection, as well as the inclusion of new and effective representation mechanisms in deep learning models.
3) Uncertainty-Aware Decisions: A largely overseen aspect of scene understanding and AD decision making thereof is the confidence under which models elicit their predictions over the input data. The fact that the vehicular surroundings are inherently uncertain ecosystems seem not to have persuaded the community to delve into this matter, stepping aside current methodological trends centered exclusively on predictive scores. Fortunately, confidence estimation has grasped the attention of the community recently (see e.g. [119], [120], [121] and references therein included). Nevertheless, elements from evidential deep learning [122], Bayesian formulations of deep neural networks [123], simpler mechanisms to approximate the output confidence of neural networks (e.g. Monte Carlo dropout [124] or ensembles [125]) and other assorted methods for uncertainty quantification [126] should be progressively incorporated as an additional yet crucial criterion for decision making. This is specially important when dealing with complex environments, in which the lack of data that can fully represent any possible scene induces a large amount of epistemic uncertainty in the output of the model. Without confidence being considered as an additional factor for AD, or with current studies focused solely on predictive and/or computational efficiency aspects, there will be no guarantees that new scene segmentation models upsurging in the scientific community are of practical use and can be transferred to industry.

B. Future Directions
The aforementioned challenges and our literature analysis suggest a number of research opportunities for advancing over the current state-of-the-art in vision-based semantic segmentation-assisted scene understanding for AD. We herein offer our envisioned directions: 1) Explainable AD: Deep segmentation models emerging from the AD literature generate their output without eliciting any explanations of how it applied an action during the drive, associating the model's decision with certain complications. If a certain non-explanatory decision of an autonomous vehicle led to an erroneous behavior, causing accidents and traffic irregularities would be problematic from the legal perspective. Explanations of the model's decision are necessary for AI-based decisions to be verified, interpretable, and accountable.
Recently, many deep models are there to explain the generated output [127], that could be applied in AD domain to explain the contributions of a model in a driving decision. Considering achievements so far in explainable Artificial Intelligence (XAI, [128]), AD can harness the myriad of post-hoc 3 https://cvit.iiit.ac.in/ncvpripg19/idd-challenge/ (access: April 8th, 2021). XAI techniques available for generating explanations. However, such produced explanations may not suffice in practice as their limited scope may not demonstrate the overall interpretation of a model, but rather provide a correspondence between what the model observes in an input to predict their output. Further research is extensively needed in this direction to produce an enriched narrative connecting vehicular perception to automated actions, as we elevate gradually towards realizing the highest level of AD.
2) Towards Video Segmentation for AD: Semantic segmentation using frame-based visual data has achieved considerable attention, with major improvements in the last two years. Although there are significantly robust techniques for frame-level segmentation, they are still mostly designed for achieving better accuracy levels, compromising their computation efficiency. Therefore, when image-based segmentation is employed in AD, it results in large processing latencies that are unaffordable for their adoption in real vehicular on-board hardware. Despite this noted issue, more generally there are some scenes encountered while driving which have overlap and occlusion during consecutive frames, paralysing the frame-based segmentation for scene understanding. Videobased segmentation is a contemporary option in this regard, which should ensure faster processing and a better practicality for AD applications.
3) Object's Predicted Locations Segmentation: A significantly vibrant research activity can be lately noted around the estimation of the future location of pedestrians and other moving objects in the scene, such as vehicles [60]. Notwithstanding its highly challenging nature, the task of future location estimation assists decision making of autonomous vehicle, providing estimated future trajectories of persons and vehicles. Unfortunately, research revolving on segmenting future locations is scarce and to the best of our knowledge there is not a single research segmenting or drawing segmentation maps of pedestrian or other objects' future locations. This area is very challenging though, but not far to be achieved for scene understanding. Recently, many methods [129], [130] have achieved accurate bounding-boxes prediction of pedestrians for upcoming 10 to 15 frames. These methods can be considered as the baseline for future research in this valuable direction.

4) Hybrid Methods and Multi-Modalities:
Besides the broader coverage of RGB data generated by vision sensors, there are some other modalities [131] and sensors with quite informative patterns and points for scene analysis and understanding. For instance, point clouds [132], [133], meshes [134] and depth data [135] together with RGB data can generate an increased 3D scene understanding for an autonomous vehicle [136]. These data are generated from various sensors, including LiDAR among many other options. Hybrid models are widely used in many domains [137], [138] with successful results in terms of vulnerability and can be implemented in ITS domain as well. As vehicles are equipped with more sensors, we envision many opportunities for research on multi-modal information fusion, further stimulated by other non-embarked sources of related information (e.g. floating car data -wherein cell phones of drivers and passengers act as additional traffic probes -or social network data).

5) Active and Incremental Learning:
Active learning [139] in machine learning refers to self-adaptability and learning of a model with respect to time and new data encountered during testing phase even after its deployment stage [140]. In realworld environments, dynamic scenes with rarely occurring living species or objects such as kangaroo or a self-engineered dump and cargo truck may be encountered by a vehicle, which may rely on AI's model decision for further actions such as applying brake or increasing acceleration. Thus, a scene understanding AI based mechanism should interactively allow processing queries of every type of data and its structures in the form of unlabeled data instances labeled by a human annotator during the process, involving human in the training loop [141]. There are different types of active learning techniques, such as membership query synthesis [142], where synthetic data is generated and the parameters of synthetic data can be tuned [143] based on structure of objects, derived from base species of the dataset. On the other hand, the capability of segmentation models to update their captured knowledge with new data in an incremental fashion is a key for their sustainability and continuous improvement. We foresee that these two capabilities of segmentation models for scene understanding will grow in importance in prospective studies.
6) Complex Driving Scenes Understanding: Semantic segmentation with applications to scene understanding primarily focuses on objects in a single category without any consideration to the importance of their location. For instance, a pedestrian walking through a sidewalk is classified simply as a pedestrian. There are some disadvantages associated to this approach: the extra time involved by an algorithm to verify its location; and let the vehicle decide actions, there is no specific safety levels of pedestrians (relate-able to cyclist and other objects), treating all objects as belonging to a similar safety level. Therefore, for complex driving scenes with abundant human subjects, there is a need for priority-driven systems to segment the pedestrians on vehicular lanes in a different category, and conversely, for the pedestrian with huge distance or ones on side walk. A baseline research dealing with this problem recently introduced a pedestrian location perception network with location inference of each semantic map corresponding to the human [56]. This work can be advanced in terms of more objects identified in scenes characterized by a higher complexity and diversity. 7) Adverse Weather Conditions: When operative in realworld environments, autonomous vehicles may encounter adverse weather conditions such as snow, fog, rain or dark areas, among other phenomena [144], [145]. Existing models are highly accurate for normal cases with sufficient illumination and other favorable conditions. However, models need to be adaptable to non-favorable weather scenarios. For instance, a dataset for night-time segmentation is introduced by Xin et al. in [146]. Furthermore, preprocessing techniques for haze [147] and fog [148] removal ensure effective semantic segmentation. But at the same time, if deep segmentation models are designed with built-in capabilities to account for weather-related uncertainties, or they prove to be effective in such cases, would decrease the computation time required for the aforementioned preprocessing steps. Some representative results of existing models over weather uncertainties are tested and reported in Figure 5, whereas a baseline research for scene understanding has developed a deep model and a Foggy Cityscapes dataset [149]. The segmentation maps generated by these models clearly outline a long road ahead in this direction. The current models seem to have insufficient generalization potentials towards challenging scenarios such as rainy environment, snow, and cloudy scenarios. Despite the presence of some challenging datasets in adverse weather conditions such as Fog [150], [151], night time and dark scenarios [152], [153], wild [154], etc., the current methods still lack focusing on end-to-end deep models to handle complex weather scenarios effectively. There also exists some generalized datasets with multiple challenges [116], [155], but the amount of data labelled for semantic segmentation in most of these datasets are very limited i.e., number of annotated instances ranging from 40 [151] to maximum 4006 [155] samples.
Utilization of advanced driving simulators such as VituoCity [156] to create photo-realistic synthetic dataset without needing expensive and high-risk driving in real-world is also among the current approaches to compensate experiments in adverse weather conditions. 8) Events-Based Scene Understanding: So far, scene understanding has been primarily approached by using segmentation techniques. Nonetheless, the focus can be diverted towards higher levels of vehicular cognition, such as events based scene understanding [157]. For instance, analyzing the events for scene parsing is a promising direction, where surrounding events such as bicyclist on the vehicle lane, pedestrian crossing the road, among many other common events can better support and favor more informed decision making of autonomous vehicles [158]. The main point here is to not rely only on segmentation for scene understanding, but rather to explore other metrics and to discover relationships between identified objects over space and time [159]. It is our belief that this augmented contextual awareness will be a major breakthrough towards the accountability of decisions made by autonomous vehicles.

9) Replacing CNNs With Vision Transformers:
Dense prediction models, such as semantic segmentation and saliency detection, are mostly inspired by convolutional architectures. Particularly, backbones of semantic segmentation methods mainly rely on convolutional operations. It is true that these networks progressively downsample input images and acquire features at multiple scales, thus allowing for increased receptive fields. These mechanisms for feature refining, i.e., for transitioning from low-level to high-level descriptors, are computationally complex and have certain limitations for many computer vision tasks, particularly for dense prediction tasks. For instance, the granularity of the features, as well as their resolution, are lost gradually as the layers go deeper and deeper, by producing inadequate representations for subsequent decoder layers, and by loosing information that cannot be recovered during the decoding procedure. Training at higher input resolutions demands higher computational budget, whereas the use of dilated convolutions increases receptive fields quickly without downsampling. Other similar techniques can be applied to mitigate the loss of feature granularity. Unfortunately, such techniques still suffer from bottlenecks due to the involvement of convolutional operations over the hierarchical neural structure of the model.
In contrast, transformers (as encoders) have better image representation capabilities [160], [161], which mainly hinge on representing images as bag-of-words, and passing them through various transformer layers to extract features at several resolutions. Then, they progressively integrate these multi-resolution representations to finally attain the concerned dense prediction task. When trained over large-scale datasets, vision transformers [162], [163], [164], [165], [166], [167], [168], [169] perform well for dense prediction tasks. For instance, Ranftl et al. [162] establish an unprecedented state-of-the-art level of performance by introducing vision transformers in a semantic segmentation domain. A similar approach is observable for the saliency detection domain, where the authors in [163] applied vision transformers with multi-level tokens fusion and a new token upsampling strategy based on transformers. Liu et al., [165] introduced a transformer-based weakly supervised semantic segmentation method named WegFormer, which encapsulated three different components to generate high-quality segmentation masks. Their presented WegFormer first generates attention maps using deep taylor decomposition (DTD) and then used a soft erasing mechanism to smooth computed attention maps. Finaly, they have filtered the noisy activation maps using their proposed efficient potential object mining strategy. Ruiping et al., [166] presented knowledge distillation driven transformer for efficient semantic segmentation of road scenes. They have retrained a shallow transformer by transferring the learned knowledge from large transformer network trained on large volume of image data. The knowledge distillation strategy allowed their method to achieve the same level of segmentation performance and faster inference time due to reduced computational complexity. Lin et al., [168] proposed a multi-scale transformer for efficient semantic segmentation, which extracts multi-level features from an image and then aggregates the extracted features using a feature selection technique. The aggregated features are then used to determine the salient regions of the given image, resulting a fine quality semantic segmentation. So far, these methods have achieved unrivaled performance levels in these specific domains, unleashing manifold future research directions and opportunities for semantic segmentation tasks.
10) Towards More Accurate and Efficient Semantic Segmentation Methods for AD: The qualitative performance of currently employed semantic segmentation techniques is shown in Table VI, where we notice that only a few methods are able to balance the trade-off between accuracy and inference latency of their model. The experimental results reported for these methods indicate that they require further work to alleviate their computational burden while maintaining their unparalleled performance. Furthermore, we test most well-known semantic segmentation models in a few challenging scenarios, as reported in Figure 5. We have found that these models should also be evaluated in terms of knowledge transferability and generalization accross different datasets [89]. Furthermore, the time complexity reported in Table VI suggests that some of these methods are functional in real time when deployed on GPU devices. In any case, the focus of semantic segmentation methods should also be diverted towards computational complexity, given the stringently limited computational resources available in today's AD in-vehicle telematics.

VI. CONCLUDING REMARKS AND OUTLOOK
Vision sensors' data are a key component of autonomous vehicles, playing a significant role in an autonomous vehicle's decision making. Vision sensory data are analyzed using Computational Intelligence techniques for effective outputs such as sign board detection, drivable area selection, and traffic lights perception. In doing so, an autonomous vehicle senses the surroundings using vision sensory data. Segmentation extracts pixel values of various objects inside an input image and individuates them from one another using distinct colors. The segmentation of various objects into their respective classes helps dramatically in parsing scene information for the vehicle. Although complementary options can be found to derive data from other sensors for decision making, vision sensors have undoubtedly a major role in the current vehicular panorama.
Segmentation for scene understanding of autonomous vehicles has been in play for many years, but a consolidated, summarized analysis is absent from the existing literature. In this survey we have discussed on the strengths of existing segmentation methods in clear environments and their weaknesses when facing challenging scenarios. Our main conclusion is that the scene understanding literature has not achieved perfection yet, as many limitations remain in the current methods that we have thoroughly covered in our review, followed by relevant suggestions and outlooks in a detailed manner. We have covered baseline works dealing with deep learning models, their hierarchy for segmentation tasks and the challenges associated to each model category. Furthermore, performance evaluation strategies suited for segmentation models, special loss functions, and datasets widely used in AD domain have been also tackled in depth. We have rounded up the work by exposing open challenges for scene understanding, together with future research directions with stimulating baseline references from the recent literature.
On a closing note, it is undeniable that experts from the ITS community are continuously struggling towards better scene understanding strategies to utilize the vision sensors' data effectively. Mainstream research is gravitated towards improving the model's accuracy through the capabilities of its neural layers. However, there exist other challenges to be covered in order to achieve reliable, trustworthy and safe AD. Challenges from the scene understanding perspective demand robust models with prioritization levels for segmented objects, coarse-structure information processing capabilities, and risk categorization. Furthermore, current deep segmentation models are confied to handle a single information modality, while recently point cloud data [133], [170] have been studied extensively for complex tasks related to AD. These are open opportunities to utilize multi-modal data such as 3D LiDAR [171] and vision sensors, and to transcend from single deep neural network to more elaborated fusion models, capable of accomplishing complicated yet more informative learning tasks for autonomous vehicles. These opportunities (if well and timely leveraged) can advance the ITS research and bring scene segmentation to a new level, where driver-less vehicles can be deployed in real-world environments and support safer and reliable travel services.