An ML-aided Reinforcement Learning Approach for Challenging Vehicle Maneuvers

—The richness of information generated by today’s 1 vehicles fosters the development of data-driven decision-making 2 models, with the additional capability to account for the context 3 in which vehicles operate. In this work, we focus on Adaptive 4 Cruise Control (ACC) in the case of such challenging vehicle ma-5 neuvers as cut-in and cut-out, and leverages Deep Reinforcement 6 Learning (DRL) and vehicle connectivity to develop a data-driven 7 cooperative ACC application. Our DRL framework accounts for 8 all the relevant factors, namely, passengers’ safety and comfort 9 as well as efficient road capacity usage, and it properly weights 10 them through a two-layer learning approach. We evaluate and 11 compare the performance of the proposed scheme against existing 12 alternatives through the CoMoVe framework, which realistically 13 represents vehicle dynamics, communication and traffic. The 14 results, obtained in different real-world scenarios, show that our 15 solution provides excellent vehicle stability, passengers’ comfort, 16 and traffic efficiency, and highlight the crucial role that vehicle 17 connectivity can play in ACC. Notably, our DRL scheme improves 18 the road usage efficiency by being inside the desired range of 19 headway in cut-out and cut-in scenarios for 69% and 78% (resp.) 20 of the time, whereas alternatives respect the desired range only 21 for 15% and 45% (resp.) of the time. We also validate the 22 proposed solution through a hardware-in-the-loop implementation , 23 and demonstrate that it achieves similar performance to that 24 obtained through the CoMoVe framework. 25


I. INTRODUCTION
Recent report by the World Health Organization (WHO) indicates that nearly 1.35 million people die in road accidents, and approximately 20-50 million people suffer nonfatal injuries yearly.Also, traffic congestion takes a substantial toll on public health and economy because of the polluted air, people's commuting time, and fuel consumption [1], [2].
In this context, Connected Autonomous Vehicles (CAV) can play an essential role, as they can mitigate traffic externalities, especially safety and traffic efficiency.Both vehicles and road infrastructures are increasingly equipped with sensing computational equipment to assist the driver, as well as with vehicleto-everything (V2X) communication devices to facilitate data exchange.As a result, a CAV can gather an enormous amount of data promoting the development of Machine Learning (ML) models to further improve passengers' safety and comfort.
Among the Advanced Driver Assistance Systems, the Adaptive Cruise Control (ACC) is one of the most popular applications in new vehicles generations, and it seemingly performs well under most car-following scenarios.However, there are 48 few challenging scenarios where the human has to be alert 49 and take control of the vehicle to perform a safe maneuver 50 over the ACC [3].One such scenario is given by the lane 51 change maneuvers which are more common on the roads, and 52 responsible for 7.6% of the car crashes in the US [4].To 53 overcome such limitations of the traditional ACC, we propose 54 an ML-based ACC application that leverages the information 55 collected through both sensors and communication devices.56 Such application can improve not only safety but also comfort 57 and traffic efficiency since it substantially reduces the traffic 58 shock waves that usually occur during challenging maneuvers.59 More specifically, the framework we propose, called 2-60 Layer Learning Cooperative ACC (2LL-CACC), accounts for 61 CAVs' road efficiency, safety, and comfort, as follows.Effi-62 ciency is measured by the headway metric -a proxy way to 63 measure the inter-vehicle distance in a traffic stream [5]- [7].64 Safety is expressed in terms of the longitudinal slip ratio and 65 the Time-To-Collision (TTC), where the former is the amount 66 of slip experienced by pneumatic tires on the road surface, 67 while the latter represents the time it takes for two vehicles to 68 collide.Finally, comfort is measured through the jerk metric, 69 defined as a rate of change in the vehicle's acceleration.

70
As sketched in Fig. 1, 2LL-CACC aims at finding the best 71 tradeoff among road efficiency, safety, and comfort by using 72 a Deep Reinforcement Learning (DRL) where the reward 73 function is an ML-driven weighted sum of the three metrics.74 The top layer hosts a Random Forest Classifier [8] to assess the 75 current contextual information, while the lower one includes 76 a Deep Deterministic Policy Gradient (DDPG) [9] algorithm 77 that aims to maximize the cumulative reward by mapping the 78 states and action through an optimal policy.Thanks to such 79 a 2-layer ML-based approach, 2LL-CACC can adapt to the operational context and effectively selects the acceleration to adopt, thus overcoming the limitations of the traditional ACC in coping with challenging traffic situations.To demonstrate this, we primarily address road traffic scenarios where the ego vehicle follows a short-distance/low-velocity lead vehicle, as it typically occurs during cut-in and cut-out scenarios.
Our main contributions can thus be summarized as follows: (i) We present the 2LL-CACC, an ML-aided DRL framework that employs a two-layered learning strategy to accomplish road efficiency, safety, and comfort objectives.

II. RELATED WORK 60
The Adaptive Cruise Control application efficiently controls 61 the longitudinal speed of the vehicle for simple car-following 62 scenarios, while such complex conditions like cut-in or cut-63 out maneuvers can be highly challenging [3], [10], as the 64 inter-vehicle distance may change dramatically.In particular, a 65 defensive response to the cut-in/cut-out vehicles may greatly 66 affect traffic efficiency [11], while an overly aggressive re-67 action leads to collision with very high probability [4].It 68 is thus critical that automated/autonomous vehicles overcome 69 the current limitations to ensure safety.To assist vehicles in 70 such complex situations, ML techniques are widely adopted.71 In particular, (D)RL algorithms have been preferred to other 72 ML approaches, since they effectively deal with uncertain and 73 partially observable environments [12].Several works [13]-74 [17] have explored the usage of (D)RL-based algorithms to 75 improve vehicle performance in complex scenarios.In particu-76 lar, [13] leverages a DRL-based CACC algorithm that exploits 77 information from vehicle's RADAR and vehicle-to-vehicle 78 (V2V) communication to maintain the desired headway with 79 the lead vehicle.Even though V2V communication can help to 80 identify lane-changing scenarios beforehand, [13] only focuses 81 on optimizing headway, thus overlooking the passenger com-82 fort or vehicle stability.The traditional ACC also suffers from 83 similar inadequacies, as it does not consider the environmental 84 factors while controlling the longitudinal vehicle movements.85 The DRL-based framework in [15] addresses some drawbacks 86 of [13], by using a multi-objective reward function to optimize 87 vehicle's safety, comfort and efficiency.It also considers a 88 continuous action space, unlike the DRL framework in [13] 89 which can only select an action from a pre-defined discrete 90 action space.However, [15] only considers a linear model to 91 simulate the vehicle behavior, which is often not suitable to 92 represent a vehicle in real-world conditions.Furthermore, prior 93 art has not considered vehicle stability under different road 94 conditions as an objective, which is an integral part of the 95 passenger's safety.To address this limitation, our framework 96 utilizes longitudinal slip ratio and road friction coefficient to 97 ensure the vehicle's stability.Even though we obtain these 98 parameters from the simulation models, one can estimate them 99 in real-life situations by leveraging the estimation techniques 100 proposed in, e.g., [18]- [20].

101
Looking at the cut-in scenario, [16] presents a DRL frame-102 work tailored to deal with cut-in events and car-following 103 scenarios.[16] uses a two-step process: (i) a deep neural 104 network trained to predict the cut-in maneuver, and (ii) a 105 Double Deep Q Network (DDQN) to train the DRL model 106 for the cut-in scenario.As part of the second step, the authors 107 develop an Experience Screening, a pre-training process where 108 multiple DRL simulations are performed for a set of pre-109 defined scenarios, and the best experiences (states, actions, 110 rewards, transition states) of each scenario are stored in an 111 experience pool.Later, the DDQN samples the data from the 112 experience pool for faster training convergence and generaliza-113 tion across different scenarios.We take this study as one of the 114 experience.As depicted in Fig. 1, 2LL-CACC comprises two 55 layers: the top one hosts an ML model to access the current 56 context and scenario characteristics; the lower one focuses on the DRL agent attributes to learn an optimal policy.At any given time t, the ego vehicle traveling through a road traffic scenario provides information about the environment, specifically, neighboring vehicles and road conditions, to the DRL agent and Context Recognition model as state s(t) ∈ S and context c(t) ∈ C (resp.).Given s(t), the role of the DRL framework is to attain efficient, safe, and comfortable driving 2LL-CACC by maintaining optimal speed, according to headway, slip, and jerk values through the agent's decisionmaking policy.Based on the input state, s(t) ∈ S, the DRL agent takes action (A) to change the behavior of the ego vehicle by either accelerating or decelerating it.As a response to the action, the agent gets a reward from the environment.The representation of states, actions, and reward in the DRL framework assists the agent in learning the optimal policy.
In our study, the reward comprises three components, namely, headway, slip, and jerk, to model efficient, safe, and comfortable driving.However, equally weighted reward components may not provide optimal feedback to the DRL agent, as, depending on the situation experienced by the ego vehicle, a component may be more important and hence should be weighted more.Examples include the case where road pavement conditions are particularly slippery and vehicle stability has to be ensured with highest priority, or the case where the ego vehicle has high driving speed and should maintain a sufficient headway.Thus, each reward component should be weighted depending upon the current context, as the latter impacts the learning process directly.To do so, it is necessary to derive the relation between features that impact the reward components and the corresponding weights.To this end, we introduce the Context Recognition Model, which leverages a Random Forest Classifier to infer such a relationship and determine the weight to be associated with each reward component based on the current context c(t).Subsequently, the predicted weights are used to regulate their corresponding reward components, and the sum of the weighted rewards facilitates the DRL model in learning an optimal policy.Fig. 2 depicts an overview of the proposed 2LL-CACC framework.
In the following, Sec.III-A details the top layer hosting the Context Recognition model, while Sec.III-B presents the lower layer hosting the DRL framework.The notations used in Sec.III-A and Sec.III-B are summarized in Tab.I.

A. Top Layer: Context Recognition Model
As mentioned above, we use an ML model to predict the weights of each reward component, based on the current situation.Specifically, we uses a Random Forest Classifier (RFC) [8], [31] and train it to interpret the current contextual information through selected input features (C), and output the optimal weight class for the headway (l h ), stability (l s ), and comfort (l c ) reward components.

1) Preliminaries:
In general, RFC is a powerful ensemble algorithm that has been proved to handle high dimensionality problems efficiently.It suits the problem at hand particularly well, since we have a set of input features that must be mapped  tree includes two types of nodes (i) a decision splitting the data into two subsets (branches), and (ii) a leaf node representing an outcome decision.With the help of the Gini Index (GI) [31], each decision node determines a splitting criterion based on a specific feature and a threshold, which results in fewer samples of heterogeneous classes in each subset H.At each subset, the GI is calculated as: where n is the total number of classes, and p i the number of samples in subset H that belong to the i−th class normalized to |H|'s' cardinality.Then, the weighted average of each subset's GI is used to identify the best criteria to split the data.Given subsets H 1 and H 2 , the weighted GI is given by: where n 1 and n 2 , represent the number of samples in subsets H 1 and H 2 , respectively, and n is the total number of rows in Similarly, GI w is calculated for different splitting criteria, and the criterion with minimum GI w is used to split the data.Indeed, the smaller GI value means better splitting criteria with a higher percentage of homogeneous classes in the subsets.The decision node continues to split the data till all values in each subset are homogeneous, i.e., belong to the same class.However, the splitting is controlled by the maximum tree depth parameter to tackle overfitting.In the decision-making (i.e., inference) phase, the input data traverse the decision tree from decision nodes to the leaf node, where the majority class in the leaf node is predicted as the output label.The output label is decided based on the majority vote of all decision trees.

2) Context Recognition Model:
The context recognition model employs the Random Forest Classifier to predict the weights of the reward components based on the current contextual information.Therefore, the RFC takes input features that concern the ego vehicles' objectives to choose the corresponding labels for the reward components.We have three reward components representing headway (r h ), stability (r s ), and comfort (r c ).To support the classification model, we discretize the weight values into 20 bins and numerically labeled each bin (e.g., label 1 represents [0.0, 0.05], label 2 [0.05, 0.1], etc.).We chose classification rather than regression algorithms because the predicted values vary considerably in regression, causing the DRL model to map a similar stateaction pair with different rewards, and, hence, slowing down convergence.At a given time t, the input features headway (ϑ(t)), jerk (j(t)), and longitudinal slip (ξ(t)) play a crucial role, as they represent the three main objectives of the ego vehicle.They are formulated as: where ∆P lead (t) is the relative distance between lead and (TTC) to identify potential collision situations, representing 13 the time it takes for two vehicles to collide, if their speed is 14 not modified.The TTC at time t is formulated as: where ν(t) represents the relative velocity between the lead and ego vehicles at time t.Therefore, the safety indicator at where the 4-s threshold is set based on [32]. 19 Furthermore, the road friction coefficient (µ(t)) at time roads.The road condition (ψ(t)) is defined as: Finally, (ω(t)) represents the road traffic scenario at time t, as the ego vehicle' behavior may significantly vary, e.g., from • road network (ω(t)) representing the current road traffic scenario.
At every time-step, the RFC model takes the current context (c(t)) as input and predicts the optimal weight label for headway, stability, and comfort reward components.Then, labels (l h , l s , l c ) are converted into values (x h , x s , x c ) according to their discretized bins.For example, with reference to the above example about the bins' labels, if the RFC model predicts 1 as headway label l h , the corresponding headway weight is a random uniform value between 0 to 0.05.

B. Bottom Layer: The DRL Model
We now provide a detailed description of the DRL model, starting with some preliminaries on DRL and then introducing the solution we designed.

1) Preliminaries:
The goal of a RL model is to learn an optimal decision-making strategy by repeatedly interacting with an environment that provides positive or negative feedback as a reward for the current behavior.Its main components are: state-space (S), i.e., a representation of the environment; action space (A), a set of actions an agent can take to interact with the environment; rewards (R), numerical feedback from the environment; policy (π(s)), a decision-making strategy that characterizes the mapping from states to actions; value function (Q π (s, a)), which indicates the expected future return from the state-action pair.The policy and value function facilitate the agent to take a sequence of actions that maximizes the cumulative discounted reward received from the environment.
The RL problem is generally modeled as a Markov Decision Process (MDP).At any given time step t, the MDP is represented through a quintuple, < s(t) ∈ S, a(t) ∈ A, K, r(s(t), a(t)) ∈ R, γ > where K is the state transition probability matrix, and γ ∈ [0, 1] is a discount factor for future rewards.K, specifies the probability of being in s(t+1) due to action a(t) taken at state s(t).However, it is difficult to model the state transitions for complex problems such as vehicle dynamics, thus we adopt an actor-critic method, which is model free and exhibits low computational complexity.
In the actor-critic framework, the critic uses a function approximator to learn the value function parameters β optimizing the value function (Q(s, a|β)), while the actor adopts a function approximator as well to update the policy parameter (η) in the direction suggested by the critic to optimize π(s|η).In general, RL algorithms employ deep neural networks as function approximators to achieve the optimal solution, and such techniques are collectively called Deep Reinforcement Learning (DRL) methods.
In our work, we use Deep Deterministic Policy Gradient (DDPG) [9], a DRL algorithm that follows the actor-critic framework to learn both the value function and policy.The critic network with parameters β takes care of the value function estimation (Q(s, a|β)) while the actor network with parameters η represents the agent's policy (π(s|η)).Notably, the DDPG algorithm supports the continuous action space and suits the 2LL-CACC scheme to learn the optimal ego vehicle acceleration profile.As the name signifies, it learns a deterministic policy where the policy predicts the action directly, rather than predicting a set of probability distributions over the action space A. Since the policy is deterministic, a standard normal noise with zero mean and a standard deviation of 0.1 is added to the predicted action value during training, to ensure the continued exploration of the action space.
Further, it leverages experience replay buffer and target network techniques to ensure a stable and efficient learning process.The experience replay buffer (Z) stores the agent's experience samples as a tuple (s(t), a(t), s(t + 1), r(s(t), a(t)), d(t)) at every step, where d(t) is a binary value indicating whether the state s(t + 1) is a terminal state or not.In the replay buffer Z, the next state (s(t + 1)) is represented as s ′ as it holds transitions from several time steps.
The algorithm then randomly draws the experience samples from the buffer during the learning process.Since the replay buffer allows using the same transitions multiple times, the experience replay buffer improves the sample efficiency and removes the correlation between them by random sampling.
The target network, instead, helps stabilize the learning.In general, the value function tries to minimize the Mean-Squared Bellman Error (MSBE), which indicates the difference between the current value function and the value function with greedy policy (taking actions with maximum expected return).
It is represented as: Note that both terms in (9)  States and Action: At a certain time-step t, the state space of the environment is represented by: (i) the lead vehicle acceleration α(t), (ii) the headway ϑ(t), (iii) the headway derivative ∆ϑ(t), (iv) the longitudinal slip ξ(t), (v) the friction coefficient µ(t), and (vi) the relative velocity ν(t).In the state Update critic by minimizing the loss: Update the actor policy using the sampled policy gradient: Update the target networks: end for end for space, the preceding vehicle acceleration is obtained through 53 V2X communication, which is simulated with the help of the 54 CoMoVe framework, and we assume the road friction coeffi-55 cient is provided by an external estimation method running in 56 the ego vehicle.The headway (ϑ(t)) and longitudinal slip ξ(t) 57 variables are formulated through Eq. 3 and Eq. 5, respectively.58 The remaining state variables are formulated as: where ϑ(t) and ϑ(t − 1) are the headway values at time t 60 and t − 1 (resp.),and V ego (t) and V lead (t) represent ego and 61 lead vehicle's velocity (resp.) at time t.In our model, and in 62 contrast to prior art [13], [15], we also account for the wheel 63 longitudinal slip ratio and road friction coefficient, to represent 64 the vehicle stability. 65 Since our DRL model aims to control the ego vehicle's 66 acceleration, action a(t) ∈ A is defined as a continuous where x h , x s , x c are the weight coefficients obtained from lead vehicle [13], [15], and it can be calculating using Eq. 3. 22 Following [13], we set the ideal headway to secure a safe and where According to [33], the best comfort is observed when the absolute jerk value is below 0.9 m/s 3 , while values above 1.3 m/s 3 indicate aggressive driving.Therefore, the reward function decreases gradually with the jerk value rising from 0.6 m/s 3 to 2 m/s 3 , and it saturates with the minimum reward of −1.To satisfy the desired jerk reward trend, the comfort reward component is modeled using Polynomial Curve Fitting.It is worth noting that the passengers' safety supersedes the comfort factor during critical situations.Thus, we consider the TTC as a safety indicator to identify dangerous situations.The comfort reward is neglected when TTC≤ 4 s, to prioritize safety during such situations.The comfort reward is formulated as: where M 3 is the polynomial degree parameter specified in Tab.IV, and χ(t) is the safety indicator declared in (14), which is used to discount comfort in the case of danger.
Stability reward component: It is valued in terms of the slip, i.e., the maximum tractive force of a pneumatic tire on road surfaces.Based on experimental data, an absolute longitudinal slip value below 0.2 is considered a stable condition.Thus, the stability reward gives a maximum reward of +1 for zero slip, and a negative reward for slip values over 0.2, indicating that the vehicle is not in the stable region.The stability reward is given by a tanh function as: where M 4 , M 5 are scaling parameters and their respective values are reported in Tab.IV.
Simulation Environment: To learn the desired behavior, the DRL agent has to interact with an environment that simulates the neighboring vehicle's behaviors and road conditions.Our study uses CoMoVe, a comprehensive simulation environment that can accurately simulate all vehicles' dynamics and sensor arrays, V2X communication, and road conditions to facilitate the DRL agent's learning process.Sec.IV explains in detail the integration of the DRL model with the CoMoVe simulation framework.

Importance of V2X Communication:
The role of communication in the 2LL-CACC scheme is crucial as it is responsible for collecting lead vehicle's acceleration to form the state space (S) in the DRL model.Furthermore, in the cut-in and cut-out scenarios, the lead vehicle often falls in the blind spot of the ego vehicle's sensor array, resulting in late detection of the lead vehicle's presence and in uncomfortable maneuvers to avoid a potential collision.Through V2X communications, instead, the ego vehicle can periodically receive information on the lead vehicle's movements (e.g., yaw rate and position) and recognize in advance its intention to change lane, even before the sensor array can perceive it.
To fully benefit from such additional information, we introduce a gradual switching technique that allows the ego vehicle to gradually switch the attention to the cut-in vehicle, or the vehicle ahead of the cut-out vehicle, to perform moderate evasive maneuvers without hindering the passengers' safety and comfort.Denoting with Y e and Y c , respectively, the lateral position of the ego vehicle and that of the generic lanechanging vehicle, we define ∆Y c as: As long as ∆Y c is less than the threshold value in the cutout scenario, we compute the headway to be fed to the DRL model as: where n is the new lead vehicle, identified by the ego vehicle based on the values of yaw rate received from its neighbors through V2X communication.Similarly, as long as ∆Y c is greater than the threshold in the cut-in scenario, the headway input to the DRL model is: where p is the previous lead vehicle.Specifically for cut-in • TTC becomes lower than the T T C LaneIntrusion [34] threshold, defined as: where ν is the relative velocity between the lane-changing social vehicle and the ego vehicle, while M 8 and M 9 are scaling factors accounting for maximum deceleration rate and reaction time, respectively; • TTC is less than 4 s [32]; • The social vehicle is 30 cm [34] inside the ego vehicle's lane.The relative velocity between the ego vehicle and the lead 50 vehicle (ν) is computed similarly, and further used by the DRL 51 framework to control the ego vehicle movements.In summary, 52 the gradual switching technique is specifically introduced to 53 handle challenging vehicle maneuvers such as cut-in and 54 cut-out.In fact, it influences the DRL model variables to 55 advise the agent to accommodate the lane-changing maneuvers 56 efficiently.The results reported in Sec.VI-B further validate 57 the importance of V2X communication in the proposed frame-58 work.

IV. INTEGRATING THE DRL MODEL IN COMOVE 60
The CoMoVe framework [30], depicted in Fig. 3, combines 61 widely used simulators in each domain (mobility, commu-62 nication, and vehicle dynamics) and makes them to interact 63 efficiently.It combines: (i) SUMO, a traffic simulator for 64 vehicle mobility, (ii) ns-3, a network simulator to model 65 V2X communications, (iii) the MATLAB/Simulink module 66 modeling the vehicle dynamics and the vehicle on-board 67 sensors while Driving scenario designer converts the vehicle 68 information from SUMO to MATLAB format to support the 69 on-board sensing, (iv) a Python Engine as a middle-man to 70 handle the information flow between the modules and the host 71 control strategies.Then, the MATLAB DRL model is converted into a function to evaluate the learned policy of the DRL agent.At a given time step t, the generated function can predict the action (a(t)) based on the state (s(t)), as per the trained optimal policy.Subsequently, the function is integrated into the Simulink model through the "MATLAB Function" block, so that it can directly predict the control action for the ego vehicle in Simulink.Notice that the generated function does not support further learning and can only be used to perform inference.
Finally, we validated the performance of PyTorch and 57 MATLAB DRL agents by transferring multiple model pa-58 rameters from PyTorch to MATLAB.Tab.II presents the 59 observed Root Mean Square Error (RMSE) of the headway 60 parameter concerning the PyTorch and MATLAB DRL agents.61 The validation results indicate that the effect of the model 62 conversion on the output values is negligible, thus firmly 63 confirming the correct transfer of the PyTorch DRL model 64 to MATLAB.In the second step, the dSPACE's Real-Time 65 Interface (RTI) links the Simulink software with the dSPACE 66 hardware.In particular, the RTI extends Simulink's C code 67 generator to execute the Simulink software model in real-68 time hardware.Later, the generated C code is loaded into the 69 dSPACE SCALEXIO AutoBox to perform the HIL simulation.70

VI. PERFORMANCE EVALUATION 82
This section first introduces the realistic settings, under 83 which we derive the performance of the 2LL-CACC, as well as 84 the state-of-the-art technique that we consider as benchmark 85 (Sec.VI-A).Then, using both the CoMoVe framework and 86 the HIL implementation, it presents the obtained performance 87 results in relevant, practical scenarios (Sec.VI-B).

A. Reference scenario and test cases 89
We explore two highly challenging highway driving scenar-90 ios where a lead vehicle cut in and out from its current lane, 91  exposing the ego vehicle to unclear or critical situations.In In addition, we compare the proposed framework to the 11 state-of-the-art method in [16], whose implementation cannot 12 be entirely reproduced because their dataset is not available 13 for public use.We therefore consider the vanilla DDQN 14 algorithm [35], which is the core component of the method   with the 2LL-CACC, and its relative performance with the case where at the bottom layer we use the state-of-the-art vanilla DDQN algorithm instead of the proposed DRL model, in the challenging cut-in and cut-out maneuvers.For brevity, in the plots shown in the following we refer to the considered benchmark as DDQN.

B. Results
We start by discussing the performance of 2LL-CACC in the cut-out scenario.The top left and top right plots of Fig. 7 present the velocity and acceleration profile of the vehicles.The bottom left and right plots show instead the headway and the jerk trend, i.e., the efficiency and comfort factor of the objectives.In Fig. 7, the black line indicates the ego vehicle's desired operating range to maintain safe inter-vehicle distance, provide adequate comfort, and improve road usage efficiency.As mentioned, in the cut-out scenario LV-A changes lane at the last moment, to avoid collision with the slow-moving vehicle in front of it.As the ego vehicle monitors the LVs' lateral movements, it responds to the lane-changing behavior by gradually switching its focus to LV-B.
Notice that the ego vehicle's DRL model is designed to maintain a headway of 1.3 s, but the headway increases as the ego vehicle gradually switches its focus to LV-B.Initially, the ego vehicle speeds up to compensate for the rise in headway; however, it also keeps track of the TTC with LV-A to ensure it does not collide with it before it completes the lane-changing maneuver.Once LV-A's lateral position is far enough, LV-B becomes a primary focus for the ego vehicle.Subsequently, the ego vehicle decelerates to maintain zero relative velocity with LV-B, as the latter travels at a lower velocity.
From the bottom left plot of Fig. 7, we can see that 2LL-CACC maintains the headway inside the desired range for about 69% of the simulation time.Also, although the ego vehicle cannot keep the headway inside the desired range during the cut-out maneuver, the left plot of Fig. 8 shows that the TTC never drops below the critical threshold of 4 s, thus always guaranteeing safety.For better visualization, Fig. 8 presents the TTC with an upper bound of 100 s and highlights the lower critical threshold of 4 s with a black horizontal line.
In terms of comfort, one can notice a few spikes in the jerk trend from the bottom right plot of Fig. 7, which however are necessary to maintain in a safe range the TTC between the vehicles, which have clearly higher priority.Also, the context  contrast, 2LL-CACC learns an optimal policy within 340 46 episodes and delivers better results than the DDQN.We now move to the cut-in scenario.The top plots of 48 Fig. 10 show the velocity (left) and acceleration (right) trends 49 of the vehicles' involved in the scenario.The lead vehicle 50 (LV-A) traveling at higher speed overtakes the ego vehicle 51 and then starts a cut-in maneuver to enter the ego vehicle's 52 lane.Also, LV-A decelerates in order to squeeze into the gap 53 between the ego vehicle and LV-B.As before, the ego vehicle 54 promptly recognizes the lane-changing maneuver thanks to 55 V2X communications, and it starts to monitor the lateral 56 movement of the social vehicle.Once the gradual switching 57 determines it is a cut-in situation, the ego vehicle starts 58 decelerating as the distance between them reduces rapidly.59 Note that in this situation, the distance between the ego 60 vehicle and LV-A does not represent the gap between them.61 Instead, the relative distance is calculated according to the    an excellent trade-off among the three objectives, providing safe inter-vehicle distance, improved road usage efficiency, and satisfactory comfort to the passengers.

VII. CONCLUSION
We addressed Adaptive Cruise Control (ACC) for connected autonomous vehicles in such challenging traffic scenarios as the cut-in and cut-out maneuvers.We proposed a 2-layer, ML-assisted deep reinforcement learning (DRL) approach that weighs the target metrics such as headway, jerk, and longitudinal wheel slip properly and achieves the best tradeoff among safety, road efficiency, and comfort objectives.When compared with state-of-the-art alternatives, our framework provides substantially better performance.Notably, it achieves 54% and 33% better headway than its alternatives, thus ensuring better traffic flow efficiency.Particularly, the V2X communication enables the ego vehicle to be timely and gradually switch its focus to the neighboring vehicles, significantly boosting safety performance.While we have considered roads to be straight (hence, lane-changing maneuver only influences the lead vehicle's yaw rate), future work will leverage ADAS applications like lane change detectors and extend the proposed framework to turning roads.
The two layers host the Context Recognition Model and the DRL model, respectively.The role of the Context Recognition Model is to recognize the current contextual information and appropriately weigh the reward components, i.e., road efficiency, safety, and comfort.Subsequently, the weighted reward components assist the DRL model convergence by providing valuable feedback on the DRL learning process.(ii) To achieve the above objectives and adequately represent the environment, the DRL states exploit information about the lead vehicle and its relation to the ego vehicle in terms of its lead vehicle's acceleration, headway, and relative velocity.Furthermore, vehicle stability-related states, such as longitudinal slip and road friction coefficient, are used to evaluate the vehicle's stability.As rewards, we use headway as a traffic efficiency indicator, jerk to assess comfort, and slip to ensure vehicle stability.The reward components are modeled to provide positive/negative reinforcement to the agent as feedback.(iii) Specifically, for the aggressive driving scenarios, we have introduced a V2X-supported gradual-switching technique that facilitates the ego vehicle to change focus on the lanechanging vehicle safely and steadily.Unlike the car-following scenario, gradual switching is crucial for the early identification of lane-changing vehicles and smooth transition between the vehicles to prevent the deterioration of the target key performance indicators.(iv) We present a detailed process flow of the Hardware-Inthe-Loop (HIL) implementation that facilitates the real-time deployment of the PyTorch-based DRL agent in the dSPACE SCALEXIO AutoBox through the MathWorks environment.Also, the HIL validation demonstrates that 2LL-CACC can be actually implemented in a real-world vehicle and that it achieves a similar outcome as in the CoMoVe simulations.Overall, the proposed system uses a content recognition model to assess the contextual information, the DRL model to drive the ego vehicle in an efficient, safe, comfortable way, and finally, the gradual switching to identify the lane-changing vehicles and adequately manipulate the DRL states to take suitable decisions.The rest of the paper is organized as follows: Sec.II discusses relevant previous work and highlights our novel contributions.Sec.III describes the 2LL-CACC framework and explains how V2X communication is exploited, while Sec.IV and Sec.V detail, respectively, the integration with the CoMoVe framework and the process flow of the HIL implementation.Sec.VI presents the performance of 2LL-CACC against state-of-the-art alternatives.Finally, Sec.VII draws our conclusions and discusses future work.

20 t
is directly related to vehicle stability, where an abrupt 21 acceleration change often leads to instability in low-friction 22

2 )
depend on the same value function parameters β.Eventually, it causes instability in the learning process as both the terms in the (9) keep changing.Thus, the structure of the main actor and critic network is cloned as a target actor-and-critic network (Q ′ and π ′ ) with different parameters (β * , η * ) to overcome training instability.The target network parameters are used in the (10)-(11) to calculate y t and later, the MSBE.As the training progresses, the main network parameters (β, η) are gradually updated to the target network (β * , η * ) through the Polyak averaging technique.Essentially, the critic network is trained to minimize the mean square error of the target network's expected return and value predicted by the critic network, while the actor network aims to maximize the critic network's mean value for the actions predicted by the actor.Subsequently, the model learns to predict the actions with maximum critic value for the current state.The DRL-based Acceleration Control: The DRL-based ACC application we develop seeks to optimally determine the ego vehicle's acceleration through system state information gathered from the ego vehicle's sensors and neighboring vehicles.The pseudo-code of the proposed scheme is presented in Algorithm 1.

1 variable. 7 τ 8 routine are performed every τ seconds. 9 Reward
Further, the action values are bounded, in regular 2 conditions, between [-2, 1.47] to provide a comfortable travel experience[33].The DRL agent receives a numerical value 4 from the environment as feedback on the agent's behavior, a 5 numerical reward that motivates the DRL agent to satisfy the 6 desired objective.The sampling interval of our framework is = 100 ms long; the state observation and action decision Components: The reward function comprises three 10 components: headway (representing traffic flow efficiency), 11 stability (representing safety), and comfort, each component's 12 value ranging in [-1,1].More formally, we have:

23 efficient
inter-vehicle distance to 1.3 s, while headway values 24 lower than 0.5 s imply a possible risky situation between the 25 ego and the lead vehicle.Traffic efficiency is further ensured 26 by adding the relative velocity between ego and lead vehicle 27 to the headway term, as in (17).The headway term remains 28 unchanged if the ego and lead vehicle travel at the same speed.29 If instead the ego vehicle travels faster or slower than the 30 lead vehicle, its velocity affects the relative distance, hence 31 the headway.Thus, the addition of the relative velocity helps 32 regulate the ego vehicle's acceleration proactively.Compared 33 to our preliminary work [24], the addition of relative velocity 34 to the state space (S) and reward calculation assists the 35 DRL model to consider the neighboring vehicles traveling at 36 different velocities effectively.37 The headway reward component (r h (s(t), a(t))) is modeled 38 as a Log-Normal distribution function with mean ϵ and variance σ, equal to 0.285 and 0.15, respectively: pi are the parameters of the headway reward component and their respective values are 42 defined in Tab.IV.Such headway reward function reaches +1 43 for φ(t) = 1.3 s, and −1 for φ(t) = 0.5 s with the specified 44 parameter values.45 Comfort reward component: It is associated with the rate 46 of change of acceleration with time, i.e., jerk j(t).

)
Then we let the ego vehicle trigger a lead-vehicle switch whenever ∆Y c crosses a certain threshold.Specifically, in the cutout scenario, the ego vehicle switches to the new lead vehicle when ∆Y c > M 6 • Y 0 with Y 0 being the lane width (i.e., 3.3 m) and M 6 a scaling factor.In the cut-in scenario, instead, the ego vehicle takes as new lead vehicle the one cutting-in when ∆Y c < M 7 • Y 0 with M 7 being a scaling factor.The values of the scaling factors we used are presented in Tab.IV.Since the gradual switching indicates slowly shifting the focus from one vehicle to another, this scaled variable suits well our methodology and actual implementation.Next, let us introduce a normalized variable ℘ scaled between 0 and 1 according to the specified thresholds.
situations, the gradual switching technique incorporates an adaption of the Automated Lane Keeping System (ALKS), UN Regulation No. 157[34].As per the regulation suggestions, the gradual switching technique is refined to wait for at least 0.72 seconds before reacting to the cut-in vehicle to avoid considering any temporary lateral position changes in the social vehicle.Subsequently, if the social vehicle's lateral position continues to change for more than the specified threshold, the proposed switching technique will change the focus gradually to the lane-changing vehicle, considering it a cut-in situation.In addition, we monitor the ego vehicle's Time-to-Collision (TTC) concerning the social vehicle and the social vehicle's lateral position during the lane-changing phase to handle aggressive cut-in situations.The ego vehicle will switch its focus entirely to the lane-changing social vehicle if any of the conditions are met:

90
Using the CoMoVe framework, in Sec.VI we show how 2LL-CACC provides a safe, comfortable, and efficient driving experience in challenging road scenarios.V. HARDWARE-IN-THE-LOOP IMPLEMENTATIONTesting and validating ADAS subsystems in assembled vehicles incurs significant overhead in terms of time, safety, and cost.Thus, HIL simulations have emerged as a convenient way to virtually validate the system in a wide range of test scenarios during the vehicle development process.In general, HIL simulations validate control algorithms through a real-time virtual environment encompassing the vehicle's functionalities.To validate our approach, we perform HIL simulations using dSPACE real-time systems comprising modular and robust platforms for testing autonomous driving.Notably, HIL simulations demonstrate the deployable nature of the proposed controller with a similar outcome in the actual vehicle.More specifically, in the proposed framework, the implementation of HIL simulation involves two main steps: (i) Conversion of the pre-trained Python DRL model into MATLAB/Simulink supported DRL model, and (ii) Generation of the DRL agent.Since the vehicle sensor and dynamics models are simulated in the MathWorks environment, the Python-based DRL agent must be converted into the MATLAB-supported DRL agent for auto code generation.Note that the network simulator (ns3) and traffic mobility model (SUMO) are not part of the HIL implementation, as they do not support the auto code generation process.Instead, the HIL simulation uses the mobility traces of the lead vehicles' and is assumed to be equipped with the vehicle's V2X communication On-Board Unit (OBU) to receive the lead vehicle information.As specified in Sec.III, we embedded the DDPG algorithm into the CoMoVe framework through the Python Engine.Specifically, the PyTorch machine learning framework is used to build and train the DDPG algorithm's neural network model.In general, the Open Neural Network Exchange (ONNX) format is used to achieve interoperability between different ML frameworks like TensorFlow, PyTorch, and MATLAB.However, the support of the ONNX format in MATLAB is limited to the 3D input layers, i.e., images, so the direct usage of the PyTorch model in MATLAB is unattainable.As a workaround, we replicated the PyTorch neural network structure in MATLAB, and its learnable parameter values are transferred to the MATLAB model.In essence, the learnable parameters are the optimized weights and biases of the neural network that are learned to achieve the desired outcome.
Fig. 4 shows the process flow of our HIL implementation 79 (left) and the structure of the HIL simulation platform in 80 dSPACE SCALEXIO (right).81

1 7 sensor array takes over the perception control only once the 8 new
both scenarios, as shown in Fig.5, the lead vehicle (LV), i.e., 2 LV-A in the cut-in and LV-B in the cut-out scenario, is in the 3 sensor array's blind spot.Thanks to V2X communication, the 4 ego vehicle becomes aware of the LV's lateral movements and 5 employs the gradual switching technique to change its focus 6 between the two lead vehicles.Note that the ego vehicle's LV is in its field of view.Fig. 6 illustrates the lateral 9 movement of the LVs in the considered mobility scenario.10

15 used in [ 16 ]
. Tab. III shows the hyperparameter values we used 16 for the DDPG and DDQN algorithms and Tab.IV presents 17 the parameter values of the different reward components.18 Sec.VI-B discusses the behavior of the ego vehicle equipped19

1 9
turning angle and length of LV-A, as it also incorporates the 2 ego vehicle's collision point on LV-A.The calculated relative 3 distance gives additional time to the ego vehicle to handle 4 the maneuver effectively.The camera sensor quickly identifies 5 the LV-A presence and takes control to define the relative 6 distance between the cars.Nevertheless, the role of V2X 7 communication is still crucial, as it recognizes the maneuver 8 proactively and allows the ego vehicle to decelerate gradually.As for the headway, the bottom left plot of Fig. 10 shows 10 that the 2LL-CACC can maintain such metric within the 11 desired range for a more extended time period compared to the 12 DDQN (78% versus 45% of the total simulation time).Also, 13 the right plot of Fig. 8 shows that the TTC never drops below 14 the critical safety threshold of 4 s: this shows that, even in the 15 closer cut-in situation, the gradual switching can assist the ego 16 vehicle to ensure safety and better road usage efficiency.

34 Finally 35 (
The bottom right plot of Fig.10underlines that the jerk 18 is temporarily outside the desired range during the cut-in19 maneuver, but this is again inevitable, as the scenario demands 20 such a response to maintain a safe distance between the cars 21 for the whole simulation period.In this scenario the DDQN 22 model cannot handle the cut-in maneuver as efficiently as the 23 2LL-CACC.Furthermore, the ego vehicle's longitudinal slip 24 presented in Fig. 11 shows that the vehicle remains stable with 25 both DDPG-and DDQN-based models, given that the cut-in 26 scenario is carried out on dry road conditions.27 In addition, we have executed the cut-in scenario in the 28 dSPACE Scalexio AutoBox and verified the HIL system's per-29 formance.As can be seen in Fig. 12, the HIL implementation 30 achieves similar performance to that obtained with the standard 31 model in the loop setup.This result further strengthens the 32 2LL-CACC's ability, as it validates the deployable nature of 33 the trained DRL model in actual vehicles., Tab.V presents the Root Mean Square Error RMSE) of the obtained results to highlight the quantitative 36 performance of the proposed framework.2LL-CACC achieves 37 very good results with respect to all objectives, and no-38 tably outperforms the DDQN-based model in all scenarios.39 In terms of comfort, 2LL-CACC has higher jerk RMSE 40 values; however this is inevitable, to promptly react to new 41 conditions, as passengers' safety has to be prioritized over 42 comfort in these critical scenarios.Still, 2LL-CACC achieves 43

TABLE I :
Notations 10 decision tree.Essentially, a decision tree aims to split the data 11 into homogeneous branches to determine the outcome.The12 Algorithm 1 DRL-based Acceleration Control Randomly initialize critic network Q(s, a|β) and actor π(s|η) with weights β and η Initialize target network Q ′ and π ′ with weights β * ← β and η * ← η Initialize replay buffer Z for episode= 1, M do h , x s , x c ) from Context Recognition Model Calculate the reward r(s(t), a(t)) based on the weights and their reward components Store transition (s(t), a(t), s(t + 1), r(s(t), a(t)), d(t)) in Z Sample a random mini batch of N transitions

72
CoMoVe leverages SUMO's TraCI library, ns3's Python 73 bindings, and MATLAB's Python Engine to write complete 74 Python simulation scripts and ensure efficient interactions 75 between them.Consequently, the Python Engine is the Co-76 MoVe's core: it can access information from each simulator 77 and hosts the 2LL-CACC framework to control the ego 78 vehicle movement.As for the DRL state components, the lead 79 vehicle acceleration value (α(t)) is received through the ns3 80 V2X communication model, while the vehicle sensor model 81 output helps calculate the headway (ϑ(t)), headway derivative 82 (∆ϑ(t)), and relative velocity ν(t) values.The longitudinal 83 slip (ξ(t)) and friction coefficient (µ(t)) are obtained through 84 the Simulink Vehicle Dynamic model.The DRL model's 85 action (desired acceleration) is used as a reference signal 86 to the ego vehicle's lower level controller in the Vehicle 87 Dynamics Model.A pure electric vehicle with a 14-Degree-of-88 Freedom (DoF) mathematical model and rear in-wheel motors 89 are utilized to characterize the vehicle dynamics.

TABLE II :
Model conversion validation

TABLE IV :
Parameter values

TABLE V :
Comparison between 2LL-CACC and DDQN in terms of RMSE for headway, jerk, and slip