Explaining a Deep Reinforcement Learning (DRL)-Based Automated Driving Agent in Highway Simulations

As deep learning models have become increasingly complex, it is critical to understand their decision-making, particularly in safety-relevant applications. In order to support a quantitative interpretation of an autonomous agent trained through Deep Reinforcement Learning (DRL) in the highway-env simulation environment, we propose a framework featuring three types of views for analyzing data: (i) episode timeline, (ii) frame by frame, and (iii) aggregated statistical analysis, also including heatmaps for a better spatial understanding. Our methodology allowed a novel, consistent description of the behavior of the agent. The main motivator for the taken action is typically the longitudinal distance from the second-closest and, to a lower extent, third-closest vehicle. In the overtakes, also the agent’s position in lanes becomes relevant. The analysis identified interesting patterns and an issue in the last frames of an episode, when the agent is unable to overtake the last two vehicles, arguably because of the lack of reference vehicles ahead. We observed a clear differentiation between attention and SHAP values (estimating the importance of each feature for each decision), reflecting the architecture of the neural network, where the first layer implements the attention mechanism, while the deeper ones make the actual decision. Attention focuses on the proximity of the ego, while the decision is taken on a wider horizon, denoting a valuable anticipation capability. To support research, the proposed framework is released as open source.


I. INTRODUCTION
With the rapid advancement in perception and advanced driving assistance systems (ADAS), the importance of high-level behavioral planning has emerged in the automated driving (AD) research panorama [1]. A cutting-the-edge framework for learning sequential decision-making policies under uncertainty is given by neural network (NN)-based deep reinforcement learning (DRL) methods (e.g., [2], [3]). As NNs are black-box modules, a key requirement for a robust validation procedure [4], especially in safety-relevant applications, has become explainability [5], which aims at ensuring The associate editor coordinating the review of this manuscript and approving it for publication was Jiachen Yang . that the network has learned from the correct features, not from artifacts in the data [6]. There are two main types of explainability approaches in machine learning (ML). The first one relies on inherently interpretable models [7] (aka white or transparent boxes, such as linear models, decision trees, Bayesian networks, and Tensorflow Lattice [8]). The second one exploits techniques that process the inputs and the outputs of a black-box model in order to gain insights into it (post-hoc explanations). The most representative members of this second group include SHapley Additive exPlanations (SHAP) [9], Local Interpretable Model-agnostic Explanations (LIME) [10], DIverse Counterfactual Explanations (DICE) [11]. Exploitation of explainable artificial intelligence (XAI) is debated in the academic community, FIGURE 1. RL-SHAP diagram for the test in the OpenAI Gym LongiControl environment. The first diagram presents the state of the agent (speed and covered distance), together with the speed limits. The other diagrams show two features (speed and current speed limit), where the color encodes the SHAP values (red indicates a positive effect on the acceleration, blue a negative effect, and gray indifference). The figure is re-elaborated from [14].
since advantages may be offset by possible disadvantages. On the one hand, it has been argued that inherent XAI may limit the functionality of AI more broadly [12], on the other hand post-hoc models increase the overall system complexity and may not be able to completely reproduce/assess the computations of an entirely separate model [13]. Thus, better insights are needed.
In the DRL field, the SHAP method has been recently proposed as the basis for a methodology to explain how a trained agent selects its action in a particular situation in the OpenAI Gym LongiControl environment [14]. SHAP, based on game theory, is a unified framework for interpreting predictions post-hoc [9]. The framework unifies six existing methods, including LIME [10], DeepLIFT [15], and Layer-Wise Relevance Propagation [16], which have been shown to use all the same explanation models [17]. For each provided prediction, SHAP assigns each feature an importance measurement and can be used to quantitatively explain the output of any ML model [9]. In the mentioned case of longitudinal control of a vehicle [14], SHAP values were computed for the different input features and the effect of each feature on the selected action (i.e., level of acceleration) was shown in a novel RL-SHAP diagram representation, on a timeline. The color-codebased RL-SHAP representation shows which state features have a positive, negative or negligible influence on the action. Analyzing the agent's behavior on a test trajectory (Fig. 1), the authors showed that the contributions of the different state features can be logically explained given some domain knowledge.
A key mechanism of state of the art NNs is given by the attention, which stemmed in the natural language processing (NLP) area, but has quickly spread broadly across domains. The concept corresponds to the human tendency to focus on the distinctive parts when processing large amounts of information [19]. It has been argued that the attention layer(s) in a NN provides a way to reason about the model behind its predictions, thus naturally lending itself to interpretability (e.g., [20], [21]), but some research works have also highlighted its opacity (e.g., [22], [23]). In the AD domain, attention has been used in pedestrian (e.g., [24], [25]) and car (e.g., [26], [27]) trajectory prediction, and in decision making (e.g., [2], [18]). Reference [18] exploited an attention-based NN architecture to implement an agent able to cross an intersection in the well-established highway-env OpenAI Gym 2D environment [28], [29]. As shown in Fig. 2, the authors display two colored beams (one for each head of attention) in each frame, whose width is proportional to the per-vehicle attention values. Commenting on a complete intersection crossing simulated episode, the authors show that the values of the attention layer are consistent with the amounts of attention a real driver would pay to the different vehicles around, thus also stressing interpretability as an advantage of the proposed neural architecture.
The novelty of our research, consists in going beyond the reasoning on a single episode and making a more detailed, quantitative interpretability analysis of behavioral planning in highways. Particularly, we are interested in understanding the relationship between attention and interpretability. We utilize SHAP, given its solid theoretical foundation in game theory ( [9], [30]), and its wide adoption [17] also in our specific area [14]. We are also interested in understanding how to analyze and represent attention and SHAP values in a 2D spatial highway environment (limiting our analysis to single frames, thus ignoring temporal correlations). Another addressed question is whether abstract information from the neural model (i.e., attention and SHAP values) is sufficient for interpretability, or it should be complemented with an indepth domain-specific analysis in order to guarantee a proper functional verification.
As the simulation environment, we use the above mentioned open source, PyGame-based, Highway-Env, which employs the Bicycle Kinematic Model [31] motion model, a linear acceleration model based on the Intelligent Driver Model (IDM) [32], and a lane changing behavior based on the MOBIL [33] model. While this platform is simpler than popular vehicle simulators, such as Carla [34], or Sumo [35], it is well-established for assessing novel DRLbased decision-making control policies (e.g., [2], [3], [36], [37]) and we believe that it is well suited for setting up an environment for interpreting high-level decision-making in highways. Highway-Env natively supports DRL.
The remainder of the paper is organized as follows. Section II presents related work and background, and section III the experimental environment setup. Section IV shows and discusses the results, while section V draws conclusions on the work done.

II. RELATED WORK AND BACKGROUND
Automated driving requires decision-making in dynamic and uncertain environments, and targeting higher levels of automation requires implementing strategies for higher-level decisions [38]. The driving task can be formalized as a Partially Observable Markov Decision Process (POMDP) keeping into account both the stochasticity of the behavior of the various road actors and the uncertainty of the perception systems [39].
A key factor for decision-making in behavioral planning is given by trajectory prediction. The common pattern in trajectory prediction is basically focused on spatial interaction modeling. The pioneering model can be tracked back to the Social Force model [40], which superimposes attractive forces from a goal vehicle with repulsive forces from other vehicles. Attention-based architectures trained through DRL are gaining ever more interest (e.g., [2], [41], [42]). In the pioneering work in the area of decision-making involving social interactions, [18] argues that such architectures allow dealing with a varying number of nearby vehicles, and support invariance to the ordering chosen to describe them even when using a list of features representation. They also naturally account for interactions between the ego-vehicle and any other traffic participant. Hierarchical Spatio-Temporal Attention architecture (HSTA) has been recently proposed [43], which activates the utilization of spatial interactions with different weights, and jointly considers the temporal interactions across time steps of all agents. Reference [2] proposes a hierarchical control structure, of which the high-level decision-making integrates two attention modules into a dueling double deep Q network (D3QN-DA), achieving a higher safety rate and average explore distance.
Similarly to [2], [3] presents a hierarchical control framework, in which the upper-level manages the driving decisions in a highway environment, and the lower-level manages speed and acceleration. The dueling deep Q-network DRL algorithm is applied to learn the highway decision-making strategy, improving convergence rate and control performance. Reinforcement Learning (RL) is a branch of ML in which an agent learns a policy through trial and error deciding which action to take in each state. The policy is trained to maximize the potential future rewards [44]. The learning process happens in an environment providing positive and/or negative rewards for each decision taken by the agent. The learning objective is framed as the optimal control of a Markov Decision Process. There are two main approaches to determine the optimal action: value-based (find the action, typically discrete, with the maximum expected overall value) and policy-based algorithms (find the maximum reward policy) [45]. The optimal action-value function Q * = max π Q π (s) satisfies the Bellman Optimality Equation: Q * (s, a) = E {max a'∈A [R(s, a) + γ Q * (s', a')]}, where s is a state, a an action, the apex denotes the next step, R is the reward and γ the discount factor [46]. The Q-learning algorithm [47] computes Q * iteratively by applying a sampling version to a batch of collected experience. The Deep Q-Network (DQN) algorithm [48] addresses the issue of continuous state spaces by using a NN model to represent the action-value function Q. Several other algorithms have been proposed in literature to train DRL models, particularly addressing issues such as convergence and stability. These include, among others: Double DQN (DDQN) [49]; Dueling DQN [50]; Dueling Double DQN (D3QN) [51].
Reference [52] describes the foundations of interpretable data science for decision making that analyzes data that summarizes domain relationships to produce knowledge that is readily understandable by human decision makers. The paper indicates as the most popular model-agnostic approaches LIME [10], SHAP-values [9], partial dependence plots (PDP) [53], and permutation feature importance scores [54]. Similarly, two recent XAI reviews ( [17] and [55]) highlight the popularity in literature of LIME, SHAP and, still, PDP as methods for visualizing feature interactions and feature importance. Reference [17] also stresses that ''whitebox highly performing models are very hard to create, especially in computer vision and natural language processing, where the gap in performance against deep learning models is unbridgeable''. Other two systematic reviews of the XAI field have been published recently, such as [56] and [57]. Reference [58] highlights the several open research points in the area, starting from the lack of agreement on the definition of explanation itself. Very interesting and useful, [59] warns of many general pitfalls of ML model interpretation, such as using interpretation techniques in the wrong context, interpreting models that do not generalize well, ignoring feature dependencies, interactions, uncertainty estimates and issues in high-dimensional settings, or making unjustified causal interpretations.
As anticipated, in this work we focus on SHAP, a method to explain individual predictions by computing the contribution of each feature to the prediction. It leverages the idea of Shapley values [60] for model feature influence scoring. A Shapley value is the ''average marginal contribution of a feature value over all possible coalitions'' (i.e., groups of features) [30]. Having to consider all possible predictions for an instance using all possible combinations of inputs, SHAP can guarantee properties like consistency missingness and local accuracy but has a high computational cost. Finally, SHAP is an additive feature attribution method, meaning that the sum of the SHAP values (one for each input feature) results in the prediction of the model (actually the difference between the prediction and a base value given by the average prediction, as shown Fig. 3). SHAP guarantees that the effects (i.e., the prediction) are fairly distributed among the features (efficiency property).

III. EXPERIMENTAL ENVIRONMENT
In order to achieve the research goals stated in the Introduction, we have trained a DRL agent through a simple DQN algorithm in the highway-env environment. We modified the open-source code base in order to get the values to analyze from each episode (namely: state log, action log, attention matrix values, and Q matrix values).
The agent model architecture revolves around one egoattention layer, to capture ego-to-vehicle dependencies. The layer takes one vector input per vehicle, which is obtained by embedding each vehicle's feature list through a two 64-unit linear dense layer encoder, while one overall decoder (constituted by two dense layers as well) produces the final output ( Fig. 4) [18]. We use a single ego-attention head both for simplicity and because of the one-way highway environment.
We have conducted our analysis on a standard 3-lane highway-env environment, and employed its default values, apart from 0.6 (medium) traffic density, 8 vehicle observation (in order to limit the computational complexity and reduce the training times), 1 Hz policy (i.e., decision -in order to give the agent enough time to see the effects of its decision) frequency, 80 frames (i.e., 80 sec., since the term frame refers to a simulation step in which a decision is taken) episode dura- tion, which we considered a sufficient time horizon to verify an agent behaviour. In highway-env, the observed vehicles are sorted frame by frame by distance from the ego vehicle: v0 is the ego vehicle, v1 the closest to the ego, v2 the second closest, etc. Thus, there is not an episode-unique identifier per vehicle (apart from the ego), but vehicles are identified on a frame-by-frame basis. Each episode involved a total of 15 vehicles, which is the default value. Each observation (i.e., the input to the DRL model) consists of 7 features per vehicle (namely: presence -as a vehicle may not be present in the current frame, as the observation area around the ego vehicle is limited -, x, y, longitudinal and lateral speed, and the two trigonometric headings). Observations are normalized (100 m for longitudinal distance, 80 m/s for speed. Each lane is 3.5 m wide) and are absolute for the ego (the ego's x is always 0) and relative (to the ego) for the other vehicles.
The ego vehicle velocity has three possible levels: 20, 25, and 30 m/s, which is slightly higher than that of the other vehicles, that travel at around 22 m/s. The agent can choose at each frame one among five actions: right, left, idle, faster, or slower. The decided actions are then processed by the system taking into account the context. For instance, the slower action at 20 m/s speed actually results in an idle, and the same applies to the left action when the ego vehicle is already in the third lane. For this reason, in the following, we distinguish between action (i.e., the decision of the model) and actual action (i.e., an action performed by the ego vehicle considering the context limitation).
The agent was trained with a 0-1 dense reward for speed, -2 sparse reward for collisions, which also determine the end of the episode. For simplicity, we have removed the right lane reward, as the original model does not penalize the right overtake.
After training, the final success rate (successfully completed episodes) is 89%, with 2.3 Km driven per episode. In general, in several attempts with different hyperparameters, we noticed that training is effective only in the first (between 400 and 1,200) episodes, after which performance degrades without any significant recover, even after around 35,000 episodes (48 hours of training on an Nvidia DGX system). This pattern is known in literature as catastrophic forgetting. Longer training periods tend to make the agent more conservative, probably because of accidents due to a bad behavior of the other vehicles. The Tensorboard diagram of the training is reported in Fig. 5.  Observing the resulting episodes, we also notice cases in which the ego is not able to overtake the last few vehicles. However, while not optimal, we argue that the achieved model is appropriate to quantitatively explain its behavior, which is the goal of this paper.
After an agent has been sufficiently trained, it can be analyzed. SHAP values are obtained by running a certain number of episodes and fitting the model through the SHAP python library. According to the SHAP documentation and SHAP library [61], we instantiated a DeepExplainer passing to it the agent's DQN model (the value network) and the observations from a set of 20 training episodes (i.e., a total of 1,600 samples), which should enable the module to produce a very good estimation of the SHAP values for each input feature. Once fit, the SHAP model can be fed with test values to produce the estimations. At every frame, SHAP values are computed on the DRL's value network, which outputs the Q value of each possible action, that is the overall expected reward assuming the agent is in the observed state and performs the action, and then continues playing until the end of the episode following some policy π. Thus, we have SHAP values for each possible action, even if in this paper we mostly focus on values for the selected action only.
Despite the exponential complexity of the algorithm [9], SHAP execution times were quite limited, in the order of some seconds.
For the attention values, we take the output of the attention layer, which is a probability distribution across vehicles. Thus, the sum is always equal to 1, and high max attention values indicate that the attention is focused on a specific vehicle, while low max attention values indicate that the attention is distributed across more vehicles.
Comparing attention and SHAP requires an adjustment, since attention values are per vehicle, while SHAP values are per feature. Also reading the explanation of simpler problems, such as [14], we considered it a good approximation to define the SHAP value of a vehicle as the SHAP value of its most important feature. SHAP(V i ) = Max fi∈{features ofVi} {SHAP(f i )} Such per-vehicle SHAP values are then converted into probabilities through softmax. We define max SHAP vehicle (MSV) as the vehicle with the highest per-vehicle SHAP value in a frame. Similarly, the max attention vehicle (MAV) is the vehicle with the highest attention value.

IV. EXPERIMENTAL RESULTS
We organized our analysis into three main aspects. The first one involves considering the timeline of every single episode. Then we study the SHAP and attention values at each decision frame. The last one consists in computing statistical values from the aggregation of several episodes. Accordingly, we have prepared a jupyter notebook in order to perform the analysis. For clarity of presentation, we name the three aspects as Episode View, Frame View, and Aggregated View, respectively. While we will present our analysis sequentially, for the sake of simplicity, it is important to highlight that several presented considerations actually come from an iterative synergy of the different views. In the following, for space reasons, we provide a subset of the results obtained from our analysis. All the data, together with the source code, are available at: https://github.com/Elios-Lab/explaindrl-highway.
A first step of the analysis consists in analyzing the evolution of one or more whole episodes on a timeline, which is the Episode View. For instance, Fig. 6 reports a selection of the most representative quantities in episode 40, that we have chosen as a significant example. The view shows some characteristics that we deem important to support the analysis (e.g., action, actual action, ego lane) and plots their values against the timeline. The view shows also the MAV and MSV vehicles. Vehicles are identified, frame by frame, by distance ranking. That is, v1 is the closest vehicle to the ego, v2 the second closest, etc. v0 is the ego. The type of the most important feature (i.e., the feature with the maximum SHAP value) is reported as well. According to the highway-env convention, lanes are numbered 0-2 from left to right. Features are numbered as follows: 0 presence; 1 longitudinal distance (x) from the ego; 2 lateral distance (y); 3 relative longitudinal velocity; 4 relative lateral velocity; 5 and 6 trigonometric headings.
Looking at the lane plot, we see that the ego starts driving on the rightmost lane, then goes to the leftmost lane (around frame 24), and comes back to the rightmost one slightly after frame 30. A similar pattern (that we will see due to a double simultaneous overtaking) is executed two more times. Then, the ego finishes the episode proceeding down the center lane. In this episode, it appears that max attention is never on the ego vehicle, differently than the max SHAP. Moreover, there is a clear difference between max attention and SHAP, with the MAV typically closer to the ego than the MSV. It could thus seem that attention is more general, while SHAP is more specific on the decision. There are only three exceptions, in which MAV is v3 (i.e., farther away) and MSV is v2 (i.e., closer). We briefly analyze them with the help of another snapshot from the Episode View, the one reporting the SHAP values for the input features ( Fig. 7, which again shows only a selection of all the available observations). While frame 5 could be a transitory spike just after an overtake (v1 x getting to 0 in fig. 6), in frames 52-54 v2 and v3 have very similar longitudinal distances and, as for frames 14-15, v3 (MAV) is on the ego lane, differently from v2 (MSV). This suggests that attention may be slightly more focused on the ego lane than the others. Considering the max (i.e., most important) feature diagram, it appears that it is almost always the longitudinal distance from the ego, the lane position (in these frames the MSV is the ego), and, in the episode's last frames, trigonometric heading (in one case, lane position), when the MSV is v4 or v3. But if we look at the presence feature of these two vehicles (not reported in Fig. 7 for space reasons) in these last frames, we notice that they are not present. Moreover, trigonometric heading x is equal to 0 (default value) when the vehicle is absent (while it is 1 or close to 1 when a vehicle is present), so we argue that trigonometric heading x is interpreted by the RL model as an indicator of presence. The absence of vehicles towards the end of an episode is due to the fact that there is a limit of 15 vehicles per episode, and most of them were previously overtaken. Observing all the timelines in Fig. 6, however, the final frames (72-79) show a peculiar pattern, that is a sequence of actual brakes and acceleration, with the max feature being the trigonometric heading of the lowestorder absent vehicle (i.e., v3 or v4). This deserves further investigation in the following. We also notice that the agent is able to travel always at the highest speed, apart from in these final frames.
MSV is the ego in frames 22 and 59 only. This corresponds to two left lane actual actions (i.e., overtakes), while the most common actual action is idle.
Looking at the chart showing the distance from the ego of the MAV and MSV strengthens the impression that attention is usually paid closer to the ego. There is also a ''step & staircase'' pattern, with attention steps happening almost always in proximity of the overtakes by the ego (overtakes can be recognized as the longitudinal distance of v1 goes approximately to 0), while the max SHAP tends to ''jump'' (i.e., do the ''step'' in the pattern) to a farther vehicle before the overtaking. In few cases the ''step'' is delayed until the actual overtaking. The pattern seems to suggest that a lane change decision (overtaking or coming back to the right lane) is usually based on a higher-order (i.e., farther away from the ego) reference vehicle. This implies that the closest vehicles do not have a major influence on the decision, as they are already being overtaken. The agent has already processed them and now focuses on a larger horizon.
Apart from the mentioned exceptions, MAV is always the closest or second closest vehicle, while MSV has more variety (it also includes the ego), and long sight, and is never on the closest vehicle. As SHAP concerns the final decision of the NN (while attention is a shallower layer), this seems to show a kind of anticipation ability for decision taking (which is a characteristic of good drivers [62]), while attention is generally kept to close vehicles, as to avoid ''bad surprises'' (the highway-env model involves rare random lane changes by some vehicles, that may result in accidents). In some frames, we see an overlapping of max attention and SHAP. This is due to the fact that the MAV ''reaches'' the MSV (in other words, the ego gets closer to the MSV, which becomes also the MAV). But then the MSV rapidly becomes a farther vehicle, which is a ''delayed step'' in the above-mentioned pattern.
We finally observe that the sum of SHAP values in a frame looks like a significant indicator of the expected benefit of the taken action. We observe jumps in this value after a successful lane change decision (i.e., start or close an overtaking), and a more or less slight decrease as new vehicles get or stay close to the ego.
Our proposal of Episode View includes also the RL-SHAP diagrams proposed by [14]. Interpretation is quite more complicate than in the original case, given the number of timelines (one for each feature per vehicle, for a total of 56, compared to the original 7), even if only a tenth of them look relevant. In order to improve the readability of the charts, we normalize the color code (representing the SHAP values of the single features, as in Fig. 1) at each episode, with two different slopes for positive (red) and negative (blue) values. The v1 x (i.e., longitudinal distance) diagram clearly shows a sequence of 11 overtakes (v1 distance time-series gradually falling from about 0.2 to below 0). Similar behaviors can be seen in the longitudinal distance feature of higher-order vehicles (i.e., v2, v3, etc.) as well, with the difference that they are farther away from the vehicle and, apart from the case of simultaneous overtakes, they terminate before reaching a distance of 0 (because vehicles, getting closer to the ego, decrease their order, as some previous vehicles get overcome). From the charts, also some clear explainable patterns emerge. For instance, the main motivator for the taken action is typically the longitudinal distance from v2 and, to a lower extent, v3. For v2, the greater the distance, the stronger the SHAP values, typically with an idle action. On the other hand (considering both Fig. 6 and 7), v3 seems a major motivator for overtaking/coming back to the right lane. Thus, a distant second vehicle seems a good motivator for continuing at high speed on the same lane, while a farther away vehicle (v3 is more distant than v2) may be needed as a reference point for lane change decisions. As we will see in the Frame View (e.g., Fig. 8) and argue from the timeline of the v1 x SHAP values, (Fig. 7), v1 is frequently on a different lane than the ego and is already being overtaken. The latitudinal position of the ego looks relevant for the lane changes (the SHAP values of this feature look higher just before the lane-change decision frames). All this suggests that the agent takes a lane change decision mostly checking its lane position and the longitudinal distance of the second and third vehicles, which could be interpreted as references for overtaking v1 or v2 respectively. The fact that SHAP values for the longitudinal distance of v2 tend to decrease as the vehicle gets closer to the ego seems to indicate that the proximity of the v2 to the ego makes the expected benefit of an idle actual action lower (as there is less free space for the ego). We also notice that speed information does not look important, even for speed change decisions.
As the distance diagram in Fig. 6 refers to present vehicles only, in the final frames, in which MSV is an absent vehicle, the distance goes to 0 as the present vehicle with the highest SHAP value becomes the ego.
As anticipated, the interpretation of such final frames, which tend to occur in other episodes as well, required a more in-depth investigation, which spurred us to implement a second display modality, namely the Frame View, that we describe in the following.
The Frame View of an episode is a quantitative view reporting for each frame the whole observation, organized as a table In frames 39 and 40 the agent's decision is to go faster. Since the vehicle is already traveling at the maximum speed, the actual action is idle. In frame 39, max attention and max SHAP are on different vehicles (v1 and v2, respectively), while in the next frame, they converge to v2. At a glance, it appears that attention values are quite evenly distributed across vehicles, proportionally with longitudinal distance, particularly in the same lane as the vehicle (Y=0), which is in the central lane. SHAP values, instead, are focused on some specific features (longitudinal distance of v2 and v3). The ticks in the color scale at the right side of each frame give an idea of the actual SHAP values. Thus, in these frames there is an overall prevalence of positive values (the sum of the SHAP values of all the features is 13.0 and 11.3 in the two frames, respectively), indicating that the Q value is higher than the average, probably because of the available space in front of and near the vehicle. Positive rewards, in fact, are given only for speed and negative ones for collisions. In frame 39, the highest attention vehicle would discourage the faster action chosen by the model, and the closest vehicle (v1) would do the same in the next frame. Being in a different lane, v1 is not (by little) the max attention vehicle, in this frame. This seems to confirm our previous impression that the model's decisions are more long-sighted than the MAV (i.e., they depend more on vehicles that are farther away).
Frames 22 to 24 (Fig. 9) concern a double lane change (to the left-most lane) in view of the overtaking of v2, onto which most of the attention is devoted, being the closest vehicle in the ego lane (frame 22). We can see that the left action is not recommended by the presence of v1 in lane 0 (i.e., the left-most), while is supported by the fact that v0 is in lane 2 (the right-most) and the first vehicle in lane 1 (v3) is not so close. Actually, the longitudinal distance of v3 is a slight deterrent. The agent decides to move left. Anyways, it is not an ''easy'' decision, as it appears by looking at the sum of SHAP values, which is -2.8 and -11.4 in frames 22 and 23, respectively. In frame 23, the former v1 has gone behind (thus disappeared from the observation -and former v2 has become v1) and another left decision is taken because of a close vehicle on the same lane (v2) and, overall, the fact that the first vehicle in the left lane (v3) is quite far. Also in this case, max attention goes to the first vehicle in the same lane as the ego. In frame 24, the agent chooses a faster action, because the first vehicle in the same lane is not close. This is translated into an idle actual action since the ego is already at the highest speed. Looking at the actual SHAP values, it is interesting to note that in this last frame, we have an overall prevalence of positive values (their sum is 0.8), given the lower risk of collision. Thus, the sum of SHAP increased from 0.2 in frame 21 to 0.8 in frame 24, passing through deeply negative values in between, which indicates a risky but useful maneuver. Also in these frames, the SHAP values of the closest/max attention vehicle do not support the decision taken by the agent. It is interesting to note, observing also other episodes, that v1 is never in agreement with the agent's decision. Moreover, looking, frame by frame, also at the actions different from those actually taken, we see that v1 has almost always negative SHAP values, for all the actions. This confirms the impression that, while the closest vehicle (v1) is important for the attention layer, the model decision has a longer, and totally complementary horizon. v1 is not rarely a vehicle already almost overtaken, so it does not look so important for the evolution of the ego's trajectory.
As anticipated, the final frames (73-75) of the episode are quite challenging. Here we see that the ego is always in the middle lane and there is no vehicle in its lane. So, it would be easy for it to just continue straight at the highest speed. Instead, it is slowed down by the presence of two close-by vehicles in the lateral lanes, to which all the attention is devoted (they are the only two in the scene). The max SHAP values are given to the trigonometric headings of three vehicles that are not in the scene. As anticipated, we argue that this is intended by the model as an indicator of presence. But we also notice that the actual presence feature of these vehicles is a strong deterrent to the chosen action. We argue that this discrepancy highlights the difficulty of the agent in facing this situation. In fact, after the two brakes in frames 73 and 75, the agent accelerates in frame 78 and then stays idle in 79, without going at the maximum speed. Taking into account our previous considerations, we see that the agent decisions are usually taken based on a reference vehicle (traveling in the lane target of the decision), which is far enough from the ego. In these frames (that appear at the end of each simulation, thus are not frequent in training), there is no such vehicle, and the model is not able to properly tackle this situation. As a matter of fact, only in very few cases is the agent able to complete the episode overtaking all 15 vehicles. Observing the Frame View also of the non-chosen actions, we see that the trigonometric heading of v3, v4, and v5 are motivators for all of them, which confirms the uncertainty. On the other hand, if we restrict our analysis to present vehicles, we notice that ego Y has the main importance, as we already observed in other situations where a non-idle action was actually performed (i.e., in situations requiring a change in the ego state).
The above-presented episode analysis enables us to make useful considerations. However, a more complete statistical analysis over a set of episodes would be useful, possibly also segmenting some particular driving situations. Moreover, the above analysis almost misses 2D spatial considerations. This is the reason for our third approach, namely the Aggregated view, for which we present results from a set of 150 test episodes. For the analysis, we drop the failed episodes, so to avoid soiling the interpretation with bad behaviors.
The distribution of the decided actions is as follows: 2% idle, 21% right, 10% left, 60% faster, and 7% slower. In terms of actual actions, the distribution becomes 70% idle, 9% right, 9% left, 6% faster, and 6% slower. The ego is at the highest speed for 85% of the frames, and slowest for 7%. The agent travels 2.3 Km per episode (average speed: 104 km/h). The agent tends to prefer traveling in the right lane (45% of the time, center: 28%, left: 28%), even without a specific rule/reward for this. We argue that this is due to the fact that the central lane could be more dangerous, as it could involve vehicle cut-ins from both the left and the right side. Preference for right or left lane may be randomly learned during the training, because of the traffic conditions, since in some training we recorded a preference for the left side instead of the right. Preference for a lateral lane also explains the human-like ''overtake'' patterns that we could observe by looking at the Ego lane chart in Fig. 6 (Episode view). This has some implications on perception specialization, as we will see later.
Max attention (Fig. 11) is concentrated on V2 and V1, while max SHAP is mostly in V2, sometimes on V0, V3, and v4, and almost never on the others. We also computed that max attention and SHAP are on the same vehicle only 27% of the time (and only 31% of the two max are within the second max of the other), confirming a clear differentiation between the attention layer and the outcome of the network (i.e., the decision).
Distribution of max values (Fig. 12) looks almost normal for attention (but there is an unneglectable spike at the right end), while it is highly skewed for SHAP, suggesting that in about one-third of the cases, the action is strongly determined by the feature of a single vehicle.
The average value of each vehicle's (i.e., v0, v1, v2, etc.) attention and max SHAP value is shown in Fig. 13. Attention entropy is slightly higher than SHAP entropy (1.6 vs 1.5). Entropies of attention and SHAP are not correlated, according to a Pearson test (2 * e-4, 0.98).
Max SHAP features are the longitudinal distance (76% of the time), lateral position (i.e., lane) (15%), and trigonometric heading (9%), which we consider as an indicator of the absence of the vehicle in the scene, as said before.
The average distance of the vehicle with max attention is 31.8m (15.2 std), which is quite less than for the max SHAP vehicle (47.9m, 25.0 std). In both cases, we exclude the ego.   The actual distributions (including the ego) are reported in Fig. 14 Average time headway of the ego vehicle is 1.24 s., space headway 37m. Distributions are reported in Fig. 15.    Excluding the ego cases, max attention is more frequently on a vehicle in the same lane of the ego (61% of the times), while max SHAP only in 46% of the times. This seems to suggest that the differentiation between attention and SHAP involves also the latitudinal dimension.
Differentiation between max attention and max SHAP vehicles even more clearly appears from a spatial analysis on heatmaps (Fig. 16). In the grid view, the ego-vehicle is positioned in the (2, 1) cell (third row, second column,), where the rows indicate the relative lane with respect to the ego (e.g., the fifth row is the second lane to the right of the vehicle, which may even not exist, in a given frame). Cell width is 5m. The grid represents the position of the MAV and MSV (Fig. 16 a) and c), respectively). Fig. 16 b) and d) represent the same values, normalized by the traffic in the cell. The figures show that max attention is mostly paid in front of the vehicle, without a significant lateral spread. Normalization highlights that whenever a vehicle is in the three/four cells ahead of the ego, it is always given the maximum attention. SHAP max values (when not placed on the ego-vehicle) are at a farther distance. Considering normalization, we see that max SHAP values are more spread across the lanes. These results confirm the intuition of the previous analyses, that the network decision is taken more on the basis of the ego state (actually, lane) or, more frequently, of a vehicle on the target lane than on the close-by vehicle on the same lane (on which most of the attention is however devoted). This vehicle seems to act as a reference for the decision. It is important to highlight that SHAP (and attention, and the DRL neural network itself) always refers to vehicles, not too empty road segments, as could be more intuitive for a person, especially if there is little traffic. We also notice a slight bias towards the right, which corresponds to the already mentioned preference for the right-most lane.
In a similar perspective, looking at the distance-rank normalized heatmap (Fig. 17), we see that attention focuses almost exclusively on the first vehicle in each lane, partic- ularly the ego lane. In 65% of the cases in which there is a vehicle ahead in the same lane, max attention is devoted to it. The percentage drops to 28% and 34% for the two adjacent lateral lanes. Max SHAP values, instead, are more distributed across the first vehicle in the lateral lanes and sometimes concern also the second vehicle, particularly in the lateral lanes. On the one hand, this confirms the longerterm perspective of the network decision, on the other hand, this is a human-like pattern, and it is not obvious that it was learned, because the kinematic observation does not prevent from observing the second next vehicle in the ego lane, but only rarely the decision is mostly based on that, while it is more common that a decision is taken on a second vehicle in a lateral lane (as it is not hidden by a previous car), again acting as a reference.
Observing the traffic normalized heatmaps for the most important feature (Fig. 18), we see that longitudinal distance is key in a middle area, while lane information is important farther away, most probably in view of a lane change. The second figure confirms the already discussed bias towards the right.
As a summary, computing, for each position in the grid, the most important SHAP feature (Fig. 19), we get the confirmation that considering the ego vehicle, the lane is the most important information. Very close to the ego, there are no significant max SHAP cases, while longitudinal distance becomes the most important feature from 10m ahead on. We notice that speed is never a max SHAP feature, and we argue that this is due to the fact that the ego is almost always at the highest speed (30 m/s), while non-ego vehicles go generally slower (around 22 m/s), without significant differences.
The above-presented analysis encompasses all the frames, with no filtering (condition=NONE). However, it is interesting to segment the timeline according to different conditions. Conditions can be associated, for instance, with the decided action, max attention/SHAP values, type of feature with max SHAP, traffic/vehicle conditions, etc. For space reasons, we limit our discussion to a selection of the conditions implemented in our framework, which are those listed in the headers of Tab. 1 -4, where they are grouped in homogeneous clusters. The rows of those tables report the values of some significant quantities.
Comparing the conditions in the frames when the actually performed action is idle and when it is not (first group in Tab. 1 -excluding the None case, which is reported for reference, as it poses no condition), we see several factors indicating a ''difficult'' decision in the second case. Attention entropy is similar in the two cases, but the entropy of SHAP value vehicle probabilities is much higher in the second case, indicating a more distributed decision. This is confirmed also by observing the corresponding distribution across vehicles of the max SHAP values and the per-vehicle average SHAP value. Also, the sum of SHAP values has a different sign in the two cases, which indicates a lower total reward expectation by the agent for a non-idle decision. Max attention and SHAP are on the same vehicle more than twice as often in the idle case as in the non-idle case. In the idle case, the most important feature is the longitudinal position, by far, while in the non-idle case the trigonometric heading doubles its frequency and the latitudinal position triples it. In the non-idle case, the frequency of times max SHAP goes on an absent vehicle doubles. Interestingly, max attention is much more frequently on a vehicle in the same lane in the non-idle case, which again indicates a situation requiring a change in the state of the ego vehicle. The time headway (THw) of the ego is much shorter in the non-idle case as well (and the speed difference with the max attention and max SHAP vehicle is lower). Similarly, the average distance of the MAV is much shorter as well. These observations seem to suggest that max attention goes to a vehicle that indicates the need for changing the state of the ego, while the actual action decision is taken based on the features of vehicles farther away (or of the ego). This gives a quantitative confirmation to the observation from the Episode and Frame views of the importance of having a reference vehicle for taking a decision.
Differently from the MAV, the average distance of the MSV is higher in the non-idle case than in the idle, which looks odd. Exploring the data, we observed that, in some cases, it happens that all three lanes are occupied by close-by vehicles for several frames. In these cases, the model learns to alternate decelerations and accelerations (which allows keeping a relatively high speed, but would not be acceptable for passengers), basing its decision on mostly the lane of the ego (e.g., episode 36, frames 51 to 61, in Figs. 21 and 22). This however does not influence the average distance of the MSV, which considers only the cases of MSV present and different from the ego. But, in a small subset of these cases (e.g., episode 37, after frame 42, in Fig. 23), MAV is v3, close to and in the same lane of the ego, and MSV is v4, which is farther away, and heavily contributes to the increase of the MSV average distance. We would expect that the agent managed all these cases based on the position of the three close-by vehicles, but our SHAP analysis suggests that the agent learns a different, maybe simpler but probably weaker, pattern.
Taking into account the spatial aspect (second group in Tab. 1), we notice that max attention is quite frequently close or even very close to the ego. Here we use the following thresholds: far attention: 60 meters; far SHAP: 80 m; close attention: 30m; very close attention: 15m; close SHAP: 30m. The sum of the SHAP values is largely positive in both the far SHAP and far attention cases (where we also have low SHAP entropy and high max SHAP mean), and quite negative in the very close attention cases (where we have high SHAP entropy). When the MAV is far, the vast majority of actions is idle, while when MAV is very close the action mix becomes much more balanced, while close MSV is associated  with idle, right lane changes (we argue that a left change, i.e., starting an overtake, requires a longer space margin) and no speed changes. MSV and MAV are always different when attention is very close. We argue that a very close MAV requires a change in ego state (typically, lane, which allows keeping high speed), and the agent takes a higher-order vehicle as the reference point for the maneuver. On the other hand, MSV and MAV frequently coincide when attention is far and, overall, SHAP is far. When MAV is very close, MSV may be absent (but there are absent MSV cases, also when attention is far), and not rarely is the ego (max feature: lateral position, as it appears from the feature map in Fig. 20). We argue that the very close MAV case happens in three scenarios: (i) close overtaking (action: lane change), (ii) when three close vehicles occupy the lanes ahead (max feature is usually the ego lateral position, and action slower/faster) and (iii), the typical situation in the final frames (when the max feature is the trigonometric heading of an absent vehicle and action slower/faster), when the ego is not able to overtake the last vehicles, because of the lack of reference vehicles for taking the decision.
Focusing the episode-aggregated analysis based on the value of the MSV presence feature (the third group in Tab. 1), we observe that absence corresponds to frames with low attention entropy and high vehicle SHAP value entropy (also the mean value of max attention is high, while the mean value of max SHAP is low), denoting a close-by danger (the one/two vehicles that the ego is unable to overtake, according to our previous observations, that are confirmed also by the analysis presented in this paragraph) and a decision uncertainty, based on several factors (that actually are the several absent vehicles). This is confirmed by the fact that the sum of SHAP values is largely negative in the absence case. This condition is characterized by a relatively high frequency of faster and slower actions, again showing uncertainty on the agent's part. The almost only relevant feature for the absence case is the trigonometric heading. Attention focuses on v1/v2, while max SHAP is mostly on V4 and V5 (that are absent). The MAV and MSV never match (attention is never on an absent vehicle). In the absence case, the THw is lower than usual and the ego speed (particularly relative velocity difference with the MAV), as well.
Quickly considering the feature importance (first group in Tab. 2), we see that when the max SHAP is the longitudinal distance (which is the most common case) we have low SHAP value entropy, while we have high SHAP value entropy when  the max SHAP is latitudinal position (MSLP) and trigonometric heading (MSTH). In this last case, we have also low attention entropy (and high max attention average), arguably indicating situations difficult to manage, as confirmed by the SHAP sums, and by the different mix of performed actions (particularly faster and, even more, slower). Uncertainty about the decision is confirmed by the very low average value of max SHAP. In these cases, MAV and MSV never coincide (we argue that the ego is exploring the space in order to overcome danger, as already seen), and THw is quite short, and the speed difference with the MAV is lower. But there are also some differences between the MSLP and MSTH conditions. MSLP is mostly associated with MSV=v0, while MSTH completely corresponds to an absent vehicle. Distance of MAV is much less in the MSLP case (arguably because the MSTH case is actually characterized by difficulty by the agent to manage the last overtakes). Looking at the performed actions (and ego lane), we see that MSLP is characterized by frequent left lane change (starting from the right-most lane), slightly more accelerations, and less decelerations. This confirms our intuition that the MSLP cases typically concern a possible overtake, while MSTH regards cases that the agent VOLUME 11, 2023  is less able to manage, arguably because it lacks a reference vehicle for doing the overtaking.
Another group of conditions concerns the frames with high and low max attention and max SHAP values (second group in Tab. 2). We use the following thresholds: attention (low: 0.3, high: 0.75), SHAP (low: 0.3, high: 0.9). We recall that, at each frame, the sum of the attention values for the vehicles is 1, thus high attention values essentially indicate a focus on a single vehicle (the same applies to vehicle SHAP values). Looking at the results, we notice a clear trend in the average sum of SHAP values. Particularly, in frames with high max SHAP values, the decision corresponds to a higher expected total reward (and vice versa for low SHAP), while in frames with high max attention, the decision corresponds to a lower sum of SHAP values (and vice versa for low max attention). This confirms the intuition of attention focus as an indicator of possible danger, while confident decisions (high SHAP values) correspond to significant expected benefits. SHAP entropy seems to increase with max attention, hinting at a ''difficult'' decision, taken considering more vehicles, in a dangerous situation (due to a single vehicle). On the other hand, average attention values are more concentrated (on v1 and v2) in the high max SHAP case, leading to a light decrease in attention entropy with max SHAP. We argue that this is due to the different discretization adopted for the max SHAP and attention levels. Observing the distributions across vehicles of the max values, we see that in the high attention case, the attention focus is on v1 (while max SHAP is distributed), in the low attention case the attention is more distributed (while max SHAP is focused on v2). In the high max SHAP case, the MSV is almost always on v2, while the MAV is v2 and, more frequently, v1. In the low max SHAP case, max SHAP values are distributed among v0, v3, and v4, while max attention shifts towards V3 (this looks the already discussed case of occupied lanes ahead).
We notice a certain coincidence of MSV and MAV in the high max SHAP case, and a divergence in the low SHAP and high max attention conditions, which indicate more ''difficult'' cases. Looking at the features we see that the longitudinal position is by far the most important one in the medium-high max SHAP and medium-low max attention cases (i.e., the ones with an ''easy'' decision), while trigonometric heading gets very important with high attention. Actually, in 60% of the high-attention cases the MSV is absent, and in 79% there are only two vehicles apart from the ego. Considering the actual action, a significant difference is that the high max attention and, particularly, low max SHAP cases correspond much more frequently to a slower action than the low max attention cases. The high max SHAP value condition is characterized by idle actions. Not surprisingly, THw looks like an interesting indicator, as max attention decreases with it, while max SHAP (decision confidence) increases. Considering spatial aspects, we observe that the distance of the MAV increases with max SHAP (and MAV is frequently in the ego lane when max SHAP is low, indicating a necessity of overtaking). The average MAV distance decreases between low and medium attention but increases with high attention (while the speed difference between ego and MAV continues to decrease), which corresponds to the mentioned weakness of the trained model (i.e., its inability to overcome the last vehicles, that go slightly faster than the others, because of the lack of traffic). At high attention, speed is quite high (29 m/s).
Considering the conditions of coincidence (MAV = MSV) and divergence (MAV is even not the second MSV and vice versa), which is quite more frequent, Tab. 3 (first group) that the first case is characterized by low SHAP entropy, a high mean value of max SHAP and a high positive sum of SHAP values. Coincidence is characterized by a large prevalence of idle actions and a lack of changes in speed. MAV and MSV are almost exclusively on v2, while in the divergence case attention shifts towards v1 (but also v3), and max SHAP towards higher order (i.e., farther) vehicles. Coherently, THws are smaller for the divergence case. In the coincidence condition, the MAV/MSV tends to be in the ego lane (but farther from the ego, as confirmed by the higher THw). Divergence spreads MAV and, particularly, MSV, to different lanes. All these considerations suggest that coincidence happens in ''easy'' traffic conditions (ego speed is also higher), while traffic difficulties are managed by diverging attention and decision, particularly by changing the lane (i.e., MAV is the obstacle to overcome, MSV is the reference vehicle in the target lane).
Looking at the sign of the sum of the SHAP values (Tab. 3, second group), we notice positive sums with low SHAP entropy (suggesting a decision by a clear factor, as confirmed Matching of MAV and MSV is much more frequent in the positive sum case. The positive case mostly concerns MSV=v2 and the longitudinal distance feature, while in the negative case also v0 (with latitudinal position) and higher order vehicles (e.g., v4) are important, as we have already seen. Max attention is slightly shifted towards higher-order VOLUME 11, 2023 vehicles in the negative case, probably indicating more traffic. Distance of MAV and THw are both higher in the positive case, indicating easier traffic conditions. It is important to highlight that a positive SHAP sum is never associated with max SHAP on an absent vehicle, while in 20% of the negative sum cases, the MSV is absent.  Considering the ego speed (Tab. 3, third group), we notice that only rarely does the ego drives at the minimum speed, being almost always at the maximum. While attention entropy does not differ in the two cases, SHAP vehicle value entropy is much higher at low speeds, hinting at a more uncertain decision. At low speed, the most frequent action is faster, highlighting again the force of the RL rewards, that draws the agent to drive at the highest speed. The sum of SHAP values is quite low in this condition, hinting at a low potential of this condition. At min speed, almost never do MAV and MSV coincide. Low speed is characterized by the MAV being almost always in the same lane and traveling slightly faster, but very close. As already seen in similar contexts, both the latitudinal position of the ego and trigonometric heading of absent vehicles are relevant features in the low-speed case (i.e., respectively, the ego is arguably trying to make an overtaking -as in the 34% of the left lane frames, the max feature is latitudinal position -, or is in the final frames).
Considering traffic conditions, we analyze four cases (Tab. 4, first group): the presence of a close vehicle (i.e., within 20 m.) on the same lane of the ego (CV_SL); free time ahead (THw > 1.5) (FTA); more THw in a lateral lane (MTLL); blocked road (the max THw in the ego and lateral lanes is lower than 0.8) (BR). The attention entropy is similar in the four cases, but the differences are clear in the SHAP entropy and, even more, in the sum of SHAP values, with CV_SL and, even more, BR being difficult situations to manage (negative sum). Looking at the action mix, CV_SL  has an ample variety (never seen in other conditions, hinting at the ability of the model to deal with a difficulty in a variety of ways), while the BR case almost removes the lane change possibility, since traffic is close also in the lateral lanes. In both cases, anyways, the high frequency of accelerations and decelerations highlights the challenge. Analysis of MAV and MSV coincidence, of average max SHAP values, and on the absence and max SHAP features leads to similar considerations as already seen for other conditions. We highlight the very short distance, in the CV_SL and BR cases, of the MAV, which is almost always on the ego lane. MSV is farther away (also because of the already spotted issue on the model's decision process) and more frequently on another lane, again highlighting that the agent tries to overcome the danger. FTA and MTLL, on the other hand, present the typical features of ''easy'' driving conditions. Lastly, we consider two conditions in which the ego could show better behavior (Tab. 4, second group): Could go faster (CGF) (the ego is not already going at the highest speed) and could overtake (CO) (the ego is not doing an overtake it could do). These conditions happen very rarely, which indicates the ability of the system to avoid falling into these inefficient situations. In the CGF case, the average THw is much higher than usual, but the selected action is either idle or even slower. Here we have clearly an imperfect functioning of the agent, which is testified by the large negative value of the sum of SAP and by the fact that MSV is absent in 86% of the cases. Thus, we argue that this condition happens only in the last frames of an episode when the agent is unable to overtake the last few cars.
Compared to CGF, the CO case has a higher attention entropy (even if lower than on average) and a less negative sum of SHAP values. MSV is an absent vehicle in almost half of the cases. MAV is always on the same lane and close to the ego (it is actually the vehicle that could be overtaken). We argue that this case happens in the final frames of an episode (when MSV is an absent vehicle), but also in other cases when probably wrongly relying on the longitudinal position of a far vehicle, the ego brakes instead of safely starting an overtaking maneuver.
In order to have a quantitative assessment of the relationships between the variables, we have computed the Spearmann correlation among the following time-series, for all the episodes:  Table 5 reports the detected Spearmann correlations (all with p<10 −3 ). The reported results do not really add new information but give confirmation and a quantitative value to the strength of the relationships that were spotted in the previous analyses. The table omits some trivial correlations (e.g., max SHAP values, max SHAP high).

V. CONCLUSION
Deep learning models have become increasingly complex, so it is critical to understand how decision-making happens from data [19]. Interpretation is challenging and our experience has shown the importance of relying on a methodology and a framework for a quantitative interpretation of the agent decisions, also considering the attention mechanism. The novel solution proposed in this paper implements three types of views for analyzing data: (i) episode timeline, (ii) frame by frame, and (iii) aggregated statistical analysis, also including heatmaps for a better spatial understanding. The framework relies on domain knowledge (e.g., different traffic conditions), which was necessary to analyze the different operation conditions. Our analytical tool allowed a consistent description of the behavior of the agent and its decision factors. For instance, we observed that the ego vehicle travels preferentially on a lateral lane, to reduce the risk of dangerous cut-ins. The main motivator for the taken action is typically the longitudinal distance from the second and, to a lower extent, the third vehicle, as the first one has already been processed. According to the reward, the agent manages to keep high speed. In the overtakes, it takes a lane change decision mostly checking its lane position and the longitudinal distance of the second and third vehicles, which could be interpreted as references for overtaking the first or second vehicle respectively. Given the nature of its input, the DRL neural network always refers to vehicles, not too empty road segments, as could be more intuitive for a person, especially if there is little traffic. We argue that per-vehicle SHAP value entropy and Sum of SHAP values (interpreting the outcome of a Q-value network) may be considered indicators of expected benefit. We also see that state change decisions typically happen in low-benefit (difficult) conditions, and lead to better states.
The interpretability analysis also revealed a non-humanlike pattern, which is the strict sequence of accelerations and decelerations, due to the speed reward, when all the lanes ahead are occupied by a close-by vehicle. In the same condition, we spotted a rare artifact, which consists in taking a decision based on a farther-away vehicle. The tool also spotted the model's reliance on the trigonometric heading feature of an absent vehicle in the last frames of an episode, when the agent is unable to overtake the last two vehicles, arguably because of the lack of reference vehicles ahead.
We observed a clear differentiation between attention and SHAP values, reflecting the architecture of the network, where the first layer implements the attention mechanism, while the deeper ones make the decision. Attention focuses on the proximity of the ego, while the SHAP decision is taken on a wider horizon, denoting a valuable anticipation capability. Since both these values are the result of a softmax, high attention is not necessarily a synonym of high danger, but of focused attention, a high per-vehicle max SHAP indicates a decision based on a single feature.
Our research advances the state of the art as it provides an extensive statistical analysis of an automated driving agent's behavior. We conclude that the use of SHAP and its integration within DRL is helpful to explain the decision-making process of the agent. The proposed framework and methodology allow observing, at every time step, the relevance of each feature to the decision. Statistical analysis over several episodes is also implemented, allowing a quantitative assessment of the causes and effects.
However, a lot of research has to be done in the field. The article discusses only a subset of the results, which are available at: https://github.com/Elios-Lab/explain-drlhighway. At the same link, the whole framework is available open source, to support further research. Extensions could involve, for instance, considering other measurements and conditions (e.g., on traffic).
Our analysis focused on single frames, while temporal correlations could be considered as well, in the future. Also more complex vehicular models and simulation environments, and DRL different training methods and types of networks (e.g., multi-head attention, recurrent networks, temporal attention) might be analyzed.
New research work could involve also incident analysis, specifically considering a few frames prior to the accident and maybe correlating them with the spotted artifacts, so to try to improve the agent's training.
ALESSIO CAPELLO received the B.Sc. degree in electronic engineering and information technologies and the M.Sc. degree (cum laude) in electronic engineering from the University of Genoa, Genoa, Italy, in 2017 and 2021, respectively, where he is currently pursuing the Ph.D. degree in science and technology for electronic and telecommunication engineering (STIET), ELIOS Laboratory, Department of Electrical, Electronic, Telecommunication Engineering and Naval Architecture. His main research interests include big data management and deep learning.
MARIANNA COSSU received the bachelor's degree in electronic engineering and information technologies, in 2019, and the master's degree in electronic engineering, in 2021. She is currently pursuing the Ph.D. degree in science and technology electronic and telecommunication engineering (STIET) with the ELIOS Laboratory, DITEN, University of Genoa. Her main research interests include machine learning, deep learning, and automated driving.
ALESSANDRO DE GLORIA is a Full Professor of electronic engineering with the University of Genoa. He is the leader of the ELIOS Laboratory, DITEN, University of Genoa. He has led and/or participated in more than 20 research projects in the last ten years, in the fields of technology-enhanced learning and automotive. His main research interests include automated driving, machine learning, the IoT, and computer graphics. He has authored more than 260 papers in peer-reviewed international journals and conferences on the above-cited topics.
RICCARDO BERTA (Member, IEEE) is an Associate Professor with the ELIOS Laboratory, DITEN, University of Genoa. He has authored about 120 papers in international journals and conferences. His main research interest includes applications of electronic systems, in particular, in the fields of machine learning, the Internet of Things (IoT), and serious gaming. He is a founding member of the Serious Games Society. He is a Publications Chair of the International Conference Series GALA (Game and Learning Alliance) and the Applications in Electronics Pervading Industry, Environment and Society (ApplePies) International Conference. He is an Associate Editor of the International Journal of Serious Games.
Open Access funding provided by 'Università degli Studi di Genova' within the CRUI CARE Agreement