A Hierarchical Framework for Multi-Lane Autonomous Driving Based on Reinforcement Learning

This paper proposes a hierarchical framework integrating deep reinforcement learning (DRL) and rule-based methods for multi-lane autonomous driving. We define an instantaneous desired speed (IDS) to mimic the common motivation for higher speed in different traffic situations as an intermediate action. High-level DRL is utilized to generate IDS directly, while the low-level rule-based policies including car following (CF) models and three-stage lane changing (LC) models are governed by the common goal of IDS. Not only the coupling between CF and LC behaviors is captured by the hierarchy, but also utilizing the benefits from both DRL and rule-based methods like more interpretability and learning ability. Owing to the decomposition and combination with rule-based models, traffic flow operations can be taken into account for individually controlled automated vehicles (AVs) with an extension of traffic flow adaptive (TFA) strategy through exposed critical parameters. A comprehensive evaluation for the proposed framework is conducted from both the individual and system perspective, comparing with a pure DRL model and widely used rule-based model IDM with MOBIL. The simulation results prove the effectiveness of the proposed framework.


I. INTRODUCTION
A UTONOMOUS vehicles (AVs) that drive themselves without the need for human intervention hold enormous potentials to increase the safety and efficiency of the transportation system and provide better mobility services for people and goods. As one of the most challenging tasks for autonomous driving is the decision making for safe, comfortable, and efficient vehicle maneuvers on a multi-lane highway [1]. What makes multi-lane driving significantly more challenging is the fact that multi-vehicle interaction happens both laterally and longitudinally and The review of this article was arranged by Associate Editor Jiwon Kim. requires coordination between lateral and speed control [2], [3], [4]. The coupling control in a multi-lane environment needs to take into account more potential vehicles in multiple lanes, which enlarges the complexity.
In traffic flow research community, longitudinal (car following) and lateral (lane changing) behavior are usually modeled with rule-based methods separately. Numerous carfollowing (CF) models have been developed as an ordinary differential equation, such like Newell [5], Gipps' model [6], Intelligent Driver Model (IDM) [7] and Optimal Velocity Model (OVM) [8]. While relatively fewer studies focus on modeling lane-changing (LC) behavior due to its complexity [2], much less efforts have been made in integrating CF and LC models for multi-lane driving scenarios [9]. Although these rule-based models are easy to interpret and calibrate with terms in physical meaning and mathematically tractable for safety guarantee, they are generally not compatible with complex traffic scenarios involving a variety of interactive agents which affect the subject vehicle's driving behavior.
With the recent advances of machine learning methods, more studies opt for adopting learning-based approaches for multi-lane driving decision making [10], [11], [12], [13], [14], [15] rather than conventional rule-based approaches, as learning-based approaches can achieve better fitting or optimizing performance. Among those, deep reinforcement learning (DRL) is a recently emerging approach that aims to learn an optimal driving policy by gaining more rewards through interacting with the traffic environment [16]. Many task-specific DRL applications emerge for CF or LC behavior modelling (please see a review in [17]), and DRL based multi-lane driving models which can output the CF and LC decisions directly have also been proposed (please see a review in [18]). However, these models are found to lack robustness due to multiple interactive vehicles and the lack of a causal model [19]. Additionally, another critical deficiency of learning-based approaches is the lack of interpretability [20], which is essential for studying individual behavior, in particular, AV driving, for reasons like trustability and safety of AV and transparency in governance. Moreover, current AV models based on DRL for multi-lane driving usually consider the optimal performance of the subject vehicle [18] and the performance of traffic system separately, where the optimization of traffic system adopts centralized control (i.e., all AVs are controlled to the optimize the traffic flow regardless their individual performance) rather than decentralized control (i.e., each AV achieves optimal performance and jointly contributes to the traffic flow optimization). Given that negative impacts on traffic system that have been discussed in previous AV studies [21], [22], [23], an efficient model capable of considering the traffic flow operation with individually optimized vehicles is thus of significance.
Considering the challenges of rule-based and DRL-based models for multi-lane AV driving modeling in terms of generalization, interpretation, and robustness, we aim to explore how to leverage the strengths of both DRL and rule-based methods to address the aforementioned challenges. To this end, we develop a hierarchical decision-making framework for multi-lane driving considering both individual and system performance optimization. Specifically, there are mainly four contributions of this study. 1) We combine the DRL model and rule-based policy for multi-lane driving framework in a hierarchy to inherit the advantages of both methods, where DRL is used in the high level, and the rule-based CF and LC models are developed in the low level.
2) We define the instantaneous desired speed (IDS) as the intermediate action to synergize DRL and rule-based methods. It is the direct output of DRL, which depicts inherent pursuit of speed and motivates both longitudinal and lateral movements.
3) We extend the hierarchical framework with traffic flow adaptive (TFA) strategy based on the exposed parameters in the framework, for optimizing the mixed traffic flow.
4) The performance is demonstrated in several typical traffic scenarios, including ring roads and an on-ramp bottleneck, and compared with the classical DRL model and widely used rule-based model IDM with MOBIL.
The rest of the paper is structured as follows: Section II reviews the related literature; Section III introduces the hierarchical framework; Section IV presents the training results of the framework; Section V applies the proposed framework for traffic flow optimization and discusses the evaluation results; Section VI presents the conclusions of this work.

II. LITERATURE REVIEW
With the recent massive progress of DRL, more DRL-based studies emerge in the domain of autonomous driving. This section first introduces DRL-based AV studies, followed by system-optimal DRL modeling. Lastly, we review the studies on hybrid models combining learning-based and rule-based methods.

A. DRL MODELS FOR AV DRIVING
Most DRL models for AV in the literature only concentrate on single-task driving, e.g., CF [24], LC [25], overtaking [26], ramp merging [10], intersection crossing [27] and single-lane trajectory planning [28]. For multi-lane driving, although the driving decision of CF and LC are usually generated simultaneously, the coupling between LC and CF cannot be guaranteed, while CF and LC are strongly related according to previous studies [29], [30].
By creating a policy hierarchy, Hierarchical RL (HRL) boosts both learning efficiency and solution quality [31]. Generally, HRL can be categorized into option-based [32] and goal-based framework [33]. For multi-lane driving, option-based HRL is applied in [34], where two DRL algorithms are adopt to select lane-changing actions in the upper layer and process car-following decisions in the lower layer respectively. Nonetheless, only one option (e.g., left LC, right LC and lane keeping) can be trained at a time, which makes it inefficient and the mutual interdependence of CF and LC is not fully utilized. The goal-based HRL uses a common goal to communicate subtasks, activated by a higher-level manager and implemented by a lower-level worker. It is suitable for multi-lane driving as CF and LC are motivated by inherent pursuit of optimal speed. Thus, we adopt the basic ideas of goal-based HRL in this work and combine it with rule-based models. Regarding more detailed DRL models for multi-lane driving, recent surveys are presented in [17], [18].

B. DRL MODELS FOR TRAFFIC FLOW OPTIMIZATION
Many studies have been completed for traffic flow optimization through cooperative AVs based on multi-agent VOLUME 4, 2023 627 reinforcement learning (MARL), where the movement of AVs is managed by a central controller to collectively regulate traffic flow and smooth out traffic jams. The series of studies by the FLOW project team are representative works of system-optimal MARL models [35], [36], [37], [38], [39], [40]. However, centralized control overlooks the performance of each vehicle, in addition to that it can only be realized in far future with high penetration rate of connected automated vehicles. With elaborated design for reward function consisting with multiple objectives like traffic flow stability, the distributed control based on DRL model is applied to take into account both traffic flow operation and driving performance of single AV [41], [42], [43], [44], [45]. However, these studies generally neglected LC behavior for the sake of simplicity, which is one of the most basic driving behaviors and is more likely to disturb traffic flow [46].

C. HYBRID MODELS COMBINING LEARNING AND RULE-BASED METHODS
This paper aims to combine DRL with rule-based methods for improved multi-lane driving modeling. This type of hybrid models can utilize prior knowledge stemming from our observational, empirical, physical or mathematical understanding of the world as rules to enhance the performance of a learning algorithm. Despite the advantages of the hybrid models, only little effort is made in driving behavior modeling. For example, to connect with classical rule-based CF models, neural networks of diverse architectures are employed [47]. To achieve a higher prediction accuracy, a rule-based CF model can be encoded into the neural network [48]. For DRL, common ways to embed domain knowledge include designing new reward function [43] in loss function, injecting rule-based constraints as the safeguard [49] and learning the parameters of the rules with DRL [50]. However, the coupling relationship between CF and LC is hard to express as reward function.
In summary, this paper aims to explore a novel hierarchical multi-lane driving framework based on DRL, which utilizes goal-based HRL and rule-based methods to capture the coupling CF and LC behaviors and realize both individual-level and system-level improvement with adaptation to different traffic scenarios and conditions through more exposed parameters in the framework.

A. PRELIMINARIES OF DRL
DRL is the integration of reinforcement learning (RL) and deep neural networks, in which the core idea is to find the optimal policy for an agent by interacting with the environment [16]. Markov Decision Process (MDP) is the theoretical basis of the RL, which can be expressed as a tuple of {S, A, R, P, ρ 0 }, where: S is a set of states; A is a set of actions; R is the reward function, with reward r t = R(s t , a t , s t+1 ); P is the transition probability function, with P(s |s, a) being the probability of transition into the state s starting from s and taking action a; and ρ 0 is the starting state distribution.
A policy is a rule set/network used by an agent to decide which action to take, as a stochastic policy can be denoted as a t ∼ π θ (·|s t ), where θ is the parameter set of the policy. We can define the expected return of stochastic policy as (1), where ε is the discounted factor and τ is a sequence of states and actions. The goal of DRL is to find a policy that maximizes the expected return, which can be expressed by (2), where π * is the optimal policy. Moreover, value functions are always estimated to assess the quality of current state-action pair or state rather than waiting for the long-term result, as in (3) and (4) respectively.
To get the optimal policy, policy optimization is one of the commonly used and powerful methods in modern DRL, which optimizes the parameter set θ directly by gradient ascent on the performance objective J(π ), as in (5).

B. HIERARCHICAL FRAMEWORK WITH TRAFFIC-ADAPTIVE EXTENSION
Although we seek to combine DRL with rule-based models to introduce inductive bias in the paradigm of PIML, the way to design the new architecture remains an open question. Based on the basic idea of HRL, a hierarchical framework composed of DRL and rule-based models is desired. Considering that learning-based techniques specialize in automatically extracting features from the large volumes of multi-dimensional data, DRL should be placed on the higher level to recognize and react to different traffic situations through processing the observed states of environment. Meanwhile, rule-based models have explicit rules, enabling further embedding hard constraints to ensure safety (e.g., responsibility-sensitive safety), and some rule-based models even have collision-free nature (e.g., the well-known safety distance model Gipps' model), while serious variance problem hinders imposing the hard constraints into the DRL algorithm [51] that may lead to collisions. Thus, it is also appropriate to plant rule-based models in the lower level to generate realistic vehicle movements. Specifically, a hierarchical policy for multi-lane driving is proposed as shown in Fig. 1. It consists of a collection of low-level rule-based decisions including the CF policy and the three-stage LC policy, and a high-level DRL strategy to output the intermediate goal to govern them. Instantaneous desired speed (IDS) is proposed as the intermediate goal to depict the common motivation of CF and LC. In summary, it is a Hierarchical framework centered in Instantaneous Desired Speed for multi-lane driving based on DRL (referred to as HIDS-DRL hereafter).
With the decomposition in a hierarchical structure and the combination of rule-based methods, critical parameters with physical meaning (e.g., IDS) are exposed. A traffic flow adaptive strategy is further designed to consider the traffic flow operations by adjusting the exposed critical parameters for different traffic states. Note that although the TFA strategy is not included in the learning framework, it is an imperative extension of the proposed HIDS-DRL model which is also strongly connected with the model. As mentioned in the Introduction, given the limitation of the existing DRL-based AV studies that the optimization of individual AV performance and traffic flow are separately studied and sometimes contradicted with each other, one main purpose of this study is to develop an efficient model capable of jointly considering the individually optimal driving performance and traffic flow optimization. This is feasible with the exposed critical parameters in the rule-based CF and LC models due to the well-designed hierarchical structure of the HIDS-DRL while other DRL models are hard to be adaptive.
Overall, by combining DRL and domain knowledge of traffic flow theory, i.e., a simple DRL agent with intuitive CF and LC models and a TFA extension, we demonstrate that the HIDS-DRL model can achieve competitive individual performance and optimized system performance simultaneously (see details in Sections IV and V).

C. HIERARCHICAL MODEL FOR MULTI-LANE DRIVING 1) HIGH-LEVEL MOTIVATION WITH IDS
Desired speed assumes that each driver has a desired driving speed, and the driver seeks to minimize the speed difference between the actual speed and the desired speed. It has been widely adopted in many well-known CF models and their variants, such as the IDM and the OVM. However, conventional desired speed in IDM or the parameters for calculating desired speed in other CF models are generally fixed, e.g., desired time headway. Hence it is not capable of adapting to different traffic situations. In terms of the optimal velocity in the class of OVM models, it is usually a function of spacing headway and changing constantly. Yet the optimal velocity is only valid for CF behavior and not comparable with other speeds, since it only takes the spacing headway relative to the leading vehicle in the subject lane into consideration.
Generally, pursuing higher speed is the motivation of both CF and LC behavior. We thus propose the IDS, which stands for the varying desired speed at each time step for acceleration and lane change motivations considering the joint impacts of surrounding vehicles. Longitudinal acceleration is generated to diminish the deviation from the IDS. Comparing to the speeds in adjacent lanes, LC is triggered when exceeding IDS. Despite the assumption of IDS to activate both CF and LC behaviors, the ideal IDS can be depicted by DRL with such CF and LC models combined in the training loop, as in the hierarchical policy of Fig. 1.
The intermediate goal IDS v * t at the time t is generated directly by DRL algorithm, i.e., a t = v * t . The state of the DRL agent is defined as the gap and relative speed of surrounding vehicles and its own speed in a typical multi-lane expressway. Specifically, we identify the closest preceding and following vehicles in the current lane and adjacent lanes within a certain range L as the surrounding vehicles, as shown in Fig. 2. The state s t of AV at time t is formulated as: s t = (g LL rounding vehicle i and the subject AV, respectively, as i ∈ {LL, LF, RL, RF, SL, SF} represents the leading vehicle and following vehicle in left adjacent, right adjacent, and subject lane respectively (for example, LL denotes the leading vehicle in the left adjacent lane); v t is the speed of the subject AV. Note that the state which feeds into the DRL agent is produced by the interaction between the final action of low-level policies and the environment, where the final actions encompass continuous longitudinal acceleration and discrete lane-changing decision. Overall, the high-level DRL takes the attributes of surrounding vehicles as input states and outputs the IDS, which is passed to the low-level controller.

2) LOW-LEVEL DECISION WITH RULE-BASED POLICIES
With IDS, the final actions can be obtained from low-level policies, which are composed of a rule-based CF model and a three-stage LC model tailored for IDS. Specifically, the final actions of the AV agent are (v, lcd), wherev denotes AV's longitudinal acceleration and lcd denotes AV's lane-changing decision from the lateral decision set {left, keep, right}. CF policy: Inspired by IDM and OVM, an intuitive asymmetric CF policy based on IDS is developed, which responds to IDS actively and obtains corresponding acceleration with the consideration of dynamic constraint and comfort, as its mathematical form is: where: v t , v t are acceleration and current speed of the subject vehicle at the time t, respectively; v * t is the IDS at the time t; δ is the exponent that reflects the degree of response to the IDS; i indicates acceleration (i = a) or deceleration (i = b); k is the acceleration multiplier, as k a refers to the maximum acceleration and k b refers to the comfortable braking deceleration.
Intuitively, the potential increment/decrease of speed is The varying reaction exponent δ is considered for acceleration and deceleration, respectively. While the feasible range of acceleration and deceleration are different, k i is added as the acceleration multiplier.
LC policy: The whole LC decision process can be divided into three steps, i.e., motivation generation, selection of target lane and gap acceptance [52], [53]. As mentioned before, the motivation of LC is to pursue higher speed as well. Since IDS can be regarded as a potential target speed that the subject vehicle will get soon in the subject lane, while the average speed of neighboring vehicles is stable in a short term. Therefore, LC is triggered when the average speed of vehicles on the adjacent lanes exceeds the IDS as in (7).
where: U is the total utility of the adjacent lane; v adj is the average speed of vehicles in the adjacent lane, e.g., v adj = v LL +v LF 2 ; g L is the gap with the preceding vehicle in the adjacent lane; c v , c g are the weights of speed and gap; v coef , g coef are the coefficients to normalize speed and gap. After selecting the lane with a larger utility, the gap on that lane should be verified to make a safe LC. We consider 1s as the safety time headway threshold, which means LC will be executed only if the time gap from the subject vehicle to the vehicle ahead and behind exceed 1s [55]. As we mainly focus on decision making in this paper, LC execution is simplified to be a lateral movement with a fixed duration [56].

D. TRAFFIC FLOW ADAPTIVE STRATEGY
To take traffic system operation into consideration, we incorporate a heuristic method of TFA strategy in [57] to extend the capability of the HIDS-DRL model for traffic system optimization.
Specifically, the principles of TFA strategy based on the individual vehicle are as follows: 1) Decelerate slowly and depress lane changing when approaching jam, to avoid forming backward shockwave and propagating it laterally; 2) Keep an agile driving style when leaving congested traffic or entering a bottleneck, to reduce the possibility of capacity drop and thereby increase the throughput. Therefore, harnessing the exposed parameters in the HIDS-DRL model, the TFA strategies can be applied by scaling those parameters to adjust the driving behavior.
To implement the TFA strategy, the real-time local traffic state needs to be estimated firstly and various driving styles based on traffic states can then be applied. As per [58], the exponential moving average (EMA) of speed v EMA (t) is adopted to smooth the short-term fluctuations of the subject vehicle's speed v(t), so that different local traffic states can be identified with the average velocity, with the formulation of v EMA (t) and its update with relaxation time ξ shown in (9) and (10), respectively. It is approaching a jam when v(t) − v EMA (t) < −v th , and leaving a jam when v(t) − v EMA (t) > v th , where the threshold v th = 10km/h and ξ = 5s [58].
For the identification of entering a bottleneck, position of the bottleneck (x s , x e ) can be obtained from digital map. Thus, it is in the bottleneck when the position of the subject vehicle x(t) fulfills the spatial criteria that x s < x(t) < x e . The driving strategies are encoded by critical parameters k a (increasing value denotes increasing agility/responsiveness), k b (increasing value denotes increasing aggressiveness in brake), v * (increasing value denotes increasing aggressiveness in CF and a more conservative style in LC), and γ (increasing value denotes a conservative LC style). For each state, the driving policies are parameterized complying with the TFA driving strategies, as presented in Table 1. The multipliers λ a , λ b , λ v * and λ γ of the parameters k a , k b , v * and γ , respectively, apply to each situation (e.g., new deceleration multiplier k b equals to k b multiplied by λ k b , namely k b = λ k b × k b ). For example, when approaching a jam, the strategy reduces the comfortable deceleration to 70% and makes the threshold factor for LC 1.5 times to conduct gently decelerating to stop and reduce LC, aiming to dampen the backward forming shockwave in multiple lanes. In particular, since the focus of this study lies in the feasibility of the HIDS-DRL model to improve traffic flow with traffic-adaptive strategies, the specific values of parameters are determined empirically as an example and not necessary to be optimal.

IV. TRAINING AND EVALUATION OF HIDS-DRL MODEL A. MODEL TRAINING SETUP
Training environment: Following the setup in previous studies [59], we build a three-lane expressway with a length of 1km in a commonly-used microscopic traffic simulator VISSIM. Referring to the code for design of urban road engineering, for each episode the traffic flow is randomly set within the range of low and middle level of service (LOS), to produce more realistic CF and LC behaviors, since lower flow may lead to more free-flow driving and heavier flow may result in a traffic jam. The simulation resolution is 0.1s and the specific AV enters the road after 100s of warm up time. A new episode will begin if a collision occurs or the AV leaves the road.
DRL algorithm: In general, any model-free DRL approach can be used to train the high-level DRL. In this paper, we adopt the policy optimization method based on trust region (TRPO), which has been proved reliable for both discrete and continuous missions [60], [61], by bounding the size of the policy update and the changes in state distributions that guarantees improvements in policy.
Reward function: In this study, various rewards are determined for different conditions. Specifically, we penalize risky  time headway (less than safe time headway T safe ) with a value of −1 and further penalize dangerous spacing (less than minimum safety distance) with a value of −10 for safety. Regarding driving efficiency, we reward AV's high speed with a value of normalized speed value and penalize its deceleration out of CF range with a negative value of normalized gap. As for comfort, negative normalized reward is given when AV exceeds perceptible jerk J p (1.5 m/s 3 ) and a reward of −1 is given when AV exceeds acceptable comfort jerk J a (5 m/s 3 ). The reward for each condition is generally bounded between −1 and 1 except in the worst case that spacing is less than minimum safety distance, as the complete reward function is listed in Table 2.
Parameters: All parameters used in the model are summarized in Table 3.

B. MODEL TRAINING RESULTS
The total reward per episode [62] and the entropy of policy [63] that were commonly used in previous studies are adopted to evaluate the convergence and stability of model training. As shown in Fig. 3, the average total reward for each episode first increases and tends to converge after about 1,000 episodes. The policy entropy, which is a measure of uncertainty in information theory and can be used as an assessment for the chaos of probability distribution, gradually  decreases and gets stabilized after about 1,000 episodes. The training results suggest that a strategy with regularity and convergent total reward is learned.

C. RESULTS OF IDS
With a stably convergent DRL model, we investigate the characteristics of the core element IDS. The distribution of IDS is shown in Fig. 4, which indicates a mean value of about 17m/s while the maximum value is 33.3m/s (120km/h) as the predetermined free-flow speed. The distribution of IDS demonstrates the diversity and feasibility of the proposed hierarchical policy.
To further analyze the relationship between the IDS and the influencing factors, the Pearson correlation coefficients between output IDS and all the input states are demonstrated in Fig. 5. The speed difference v SL , speed of the subject vehicle v and gap g SL are the most related factors as expected, and following are the attributes of the following vehicle in the subject lane. While the attributes of vehicles in adjacent lanes have smaller correlation coefficients with almost symmetrical impacts.

D. BASELINE MODELS
To further quantify the model performance for a single AV, we first adopt a pure rule-based model including the widely used CF model IDM and the LC model MOBIL (Minimizing Overall Braking Induced by Lane Changes) [64], referred to as IDM-MOBIL below. The mathematical form of IDM is shown in (11) and (12).
where v 0 is the desired speed, g * t is the desired spacing gap, g 0 is the minimum gap at the standstill situation, T is the desired time gap, α is the maximum acceleration and β is the comfortable deceleration. Other variables have the same meanings as those introduced before. The parameters of IDM are calibrated using the reconstructed NGSIM dataset with the recommended Normalized Root Mean Squared Error (NRMSE) adopted as the objective function [65]. The calibrated parameter values are: v 0 = 20.9m/s, T = 1.37s, α = 0.97m/s 2 , β = 1.85m/s 2 , and g 0 = 2.14m.
For MOBIL model, a LC is motivated when (13) and (14) are satisfied. Remaining LC steps including lane choice, gap acceptance and LC execution are the same with the proposed model in this paper. (14) where a j , a j are the acceleration of the vehicle j after LC and before LC, and j = new means the new follower in the target lane while j = old means the old follower in the subject lane; a, a are the acceleration of the subject vehicle after LC and before LC. All the accelerations after LC are predicted with the above calibrated IDM. We use typical values for other parameters for realistic LC behavior as per [66], i.e., maximum deceleration b max = 8m/s 2 , politeness factor p = 0.5 and threshold a th = 0.2m/s 2 . Thus, the calibrated IDM-MOBIL represents human drivers' behavior. Additionally, a pure DRL model for multi-lane driving [59] is also chosen as the baseline model for comparison (referred to as Pure-DRL below). Using the identical input states of surrounding vehicles as in this paper, it is also trained with the same environment and similar reward setting to achieve safety, high efficiency and driving comfort, whereas the DRL algorithm outputs the final decision of longitudinal acceleration and lateral LC decision directly.  It is also noted that other parameters such as maximum acceleration of Pure-DRL are consistent with HIDS-DRL to ensure a fair and reliable comparison, while the parameters of the IDM-MOBIL are calibrated with human-driving data and as a whole works as the baseline model of human-like AV.

E. PERFORMANCE OF HIDS-DRL MODEL
All the models are evaluated with the simulation results in the training environment for 500 episodes for three aspects, i.e., safety, comfort, and efficiency. We first assess the safety risk with time-to-collision (TTC). For the sake of clarity, only TTC values between 0 and 50 s are included as shown in Fig. 6. The proposed HIDS-DRL model and the IDM-MOBIL model have satisfied safety performance while the HIDS-DRL model has even larger TTCs, most of which are larger than 10 s. Although no collision occurs for all three models, the Pure-DRL model shows worst safety performance as many small TTCs (0∼10s) arise. Given that higher TTC values correspond to lower crash risks, the results of TTC show that our proposed HIDS-DRL model achieves more safety than Pure-DRL and is comparable to IDM-MOBIL.
Comfort is evaluated by jerk in this study as shown in Fig. 7. The jerks of all three models achieve the acceptable ride comfort. Only 0.2% of the jerks in IDM-MOBIL exceed the perceptible value, with 5.7% for Pure-DRL and 2.8% for HIDS-DRL. The results of jerk indicate that the HIDS-DRL model can maintain more comfortable than the Pure-DRL model with the combination of rule-based methods but just below the pure rule-based IDM-MOBIL model. The higher jerk of DRL-based models compared to the IDM-MOBIL model may be due to the fact that the policy of these models is neural network, which is highly nonlinear, non-differentiable and complicated. Consequently, the output of IDS and the acceleration calculated based on the IDS in HIDS-DRL or the acceleration directly output from the Pure-DRL is less smooth than analytic models, which leads to slightly higher jerk.
For efficiency, the distributions of average velocity for simulation results in each episode are presented in Fig. 8 in terms different models. Fig. 8 shows that the behavior with the Pure-DRL model is more aggressive with longer range of velocity, while the IDM-MOBIL model obtains more low velocities. The HIDS-DRL model has the largest overall average velocity of 17.73 m/s, which is 4.5% higher than the IDM-MOBIL model (16.87m/s) and also slightly higher than the Pure-DRL (17.71 m/s). ANOVA analysis also reveals that the differences in average velocity between the HIDS-DRL and IDM-MOBIL models is statistically significant (P = 0.000) while that between the HIDS-DRL and Pure-DRL models is statistically equal (P = 0.895). Therefore, HIDS-DRL is considered as efficient as Pure-DRL and better than IDM-MOBIL. In addition, we calculate the average LC frequency for the three models, where the result of HIDS-DRL is 0.79 veh/km/ln, which is higher than 0.20 veh/km/ln for IDM-MOBIL but lower than 3.62 veh/km/ln for Pure-DRL. It implies that LC decisions generated by HIDS-DRL are more effective than Pure-DRL, owing to the clear LC motivation activated by IDS.
Moreover, with the distributions of time headway shown in Fig. 9, the average time headway of HIDS-DRL is 1.5 s compared with 2 s for IDM-MOBIL, which is expected to be more conservative as the baseline human-like model, and 1.2 s for Pure-DRL. This further validates the performance of three models in terms of efficiency.
In a nutshell, the HIDS-DRL learns satisfying driving strategies with the hierarchical framework integrating rule-based model with DRL model. It is safer and more comfortable than the Pure-DRL model while maintaining efficiency as the Pure-DRL model other than conservative IDM-MOBIL.

V. TRAFFIC FLOW OPTIMIZATION USING HIDS-DRL
As some studies have shown AVs impose possible negative impacts on the traffic system [21], [22], [23], how to alleviate the potential negative system impacts through the user-oriented driving model remains unresolved, since centralized control is not applicable at least in a near future while AVs are equipped with varying algorithms and belong to different companies. We thus extend the HIDS-DRL model with a traffic flow adaptive strategy to further accommodate and optimize the traffic flow. Three typical test scenarios are built within VISSIM, and the traffic operation is evaluated with various strategies as shown in TABLE 4. It is noted that the traffic adaptive strategy for the proposed DRL model is the same as in the TABLE 1, while it is reduced to adjust only the outputted acceleration with multiplier factor for Pure-DRL as there is no more exposed parameters, which is also one motivation of this work to accommodate adaptability and flexibility in DRL. For the IDM-MOBIL model, we implement the adaptive strategy as in [58] because of the validated performance and for consistency. It is remarked that the common factors (λ k a , λ k b ) are identical for all the models.

A. TYPICAL TRAFFIC SCENARIOS
We develop three typical traffic scenarios, i.e., single-lane ring road, two-lane ring road and on-ramp bottleneck, to evaluate the performance of the models in optimizing traffic flow. The single-lane ring road is usually used to explore the stability of CF models. As early as 2008, Japanese scholars have conducted on-site human driving experiments, which verify that a single-lane road with a certain density will produce phantom traffic jams [67]. A ring with a circumference of 230m was built, with 22 vehicles placed on the ring road and drivers instructed to cruise at 30km/h. The results suggest that the phenomenon of stop-and-go emerges due to the reaction time of human drivers and vehicles' limited accelerating characteristics. Thus, we choose the single-lane ring road to assess the CF behavior. Furthermore, the single-lane ring road is extended to double lanes to testify the CF and LC decisions with doubled number of vehicles.
Although ring road is a simple closed system for evaluation, it is not realistic overall. To attain a more accurate assessment of the real-world traffic flow, an open road network with a bottleneck is built. Particularly, this paper utilizes a two-lane expressway of 1 km with an on-ramp bottleneck as the open testing system. AVs are placed on the main road to test whether the CF and LC behavior of AVs can eliminate bottleneck congestion. As per [68] and the definition of LOS, the input flow of the two-lanes expressway is set to 3500 pcu/h, and the single-lane on-ramp is 200 pcu/h. The first-in-first-out (FIFO) principle is applied in the conflict area, which means the priority of leaving is determined by the order of arriving, to accelerate the occurrence of congestion.
All the three scenarios are implemented in VISSIM, as shown in Fig. 10. By randomly selecting a certain number of AVs in each episode (the proportion of AV in the mixed traffic is fixed in a relative low penetration rate 5%), the average system performance is assessed through repeated experiments where for each experiment the data is collected from stable periods (100∼600s for the ring road as closed systems can achieve quick convergence, and 100∼3700s for the on-ramp bottleneck).
With respect to the operation efficiency of the whole traffic system, we use the average cumulative delay per vehicle as evaluation indicator. The cumulative delay is defined as the difference between the actual travel time and the ideal travel time [69], where the ideal travel time is the potential minimum travel time given the speed limits of road links (30 km/h for the ring road and double-lane ring road, and 50 km/h for the on-ramp bottleneck in the simulations) regardless of vehicles' desired speeds.

B. EVALUATION RESULTS
The model performance of traffic system optimization is evaluated for all the baseline methods in Table 4, and the percentage change relative to the pure human-driving vehicle (HDV) system is calculated for better comparison. For single-lane ring road in Fig. 11, the proposed HIDS-DRL has the best performance (−1.9%), following by Pure-DRL (−0.9%) and IDM-MOBIL (−0.3%). With traffic adaptive strategy, the cumulative delay further drops for all the models but achieves the largest fall (−0.9%) for HIDS-DRL.
As for the double-lane ring road scenario, as shown in Fig. 12, what stands out is that the Pure-DRL slightly deteriorates the operation efficiency of traffic system even with traffic adaptive strategy. The HIDS-DRL still performs best (−1.8%) while IDM-MOBIL almost remains the same (−0.4%) with the HDV system. Traffic adaptive strategy significantly improves the HIDS-DRL (from −1.8% to −4.9%), and little enhancement is obtained for IDM-MOBIL with traffic adaptive strategy. At on-ramp bottleneck, as shown in Fig. 13, the HIDS-DRL shows the largest fall in cumulative delay, while there is nearly no difference with IDM-MOBIL and even worse with Pure-DRL. Extended by TFA strategy, the HIDS-DRL obtains 8.5% decrease in delay with only 5% of AVs, following by IDM-MOBIL with 3.4% decrease (noted that the improvement performance of IDM-MOBIL in our study and the work in [58] is not comparable mainly because the simulation settings of on-ramp scenarios in two studies are considerably different in terms of road layout, traffic demand, traffic composition (trucks were included in [58]).) and Pure-DRL with 1.6% decrease. It is also worth noting that without the TFA strategy, the HIDS-DRL model can still achieve fairly good performance in optimizing the whole traffic system (equivalent to the IDM-MOBIL with TFA strategy), which demonstrate its inherent capability of both individual behavior and traffic flow optimization.
We also would like to point out that the experiments are based on a typical set of TFA strategies, and the optimization performance may vary for different parameters. Without loss of generality, we have tested another set of parameters in TFA strategy at on-ramp bottleneck, where the HIDS-DRL model still achieves the largest fall of 7.5% in average cumulative delay relative to HDV, compared to 3% and 2.6% decrease for the Pure-DRL and IDM-MOBIL model respectively. It further demonstrates the superiority of our proposes model. However, it is worth noting that the Pure-DRL model outperforms the IDM-MOBIL model with the new parameter set, which partially indicates the significance of finding optimal parameters when comparing TFA strategies with different driving models (this is beyond the scope of this work and can be explored in future research).

C. DISCUSSION
In summary, no significant reduction in cumulative delay is found with the IDM-MOBIL model comparing to the HDV system. This is expected because the IDM model is calibrated with human-driving data and the parameters of MOBIL are chosen for a realistic behavior. Thus, the IDM-MOBIL model behaves like a human. Regarding the Pure-DRL model, it improves the single-lane ring road traffic but negatively impacts the double-lane ring road and onramp bottleneck. The more congested traffic in double-lane ring road than on-ramp bottleneck leads to the deterioration results in double-lane ring road even with traffic adaptive strategy. The performance of multi-lane driving underlines the generalization problem of the Pure-DRL model in the LC decision. It validates the aggressive CF behavior and some improper LC of the Pure-DRL again. With the HIDS-DRL model, the decision making of LC is built upon IDS to pursue higher speed, as such transparent and clear logic of LC results in a more robust DRL model. Meanwhile, the IDS generated from the high-level DRL algorithm exploits the learning capability of DRL to understand the observations better, rather than manually construct rules as the pure rule-based model IDM-MOBIL. All the results justify that applying the HIDS-DRL model leads to more efficient traffic flow than other models.
Furthermore, by incorporating the TFA strategy, the traffic flow efficiency is further enhanced for all the models, which proves that the traffic adaptive strategy is applicable for DRL-based models through tunning critical parameters. Additionally, more progress is achieved in HIDS-DRL with the TFA strategy. The reason for effective extension in HIDS-DRL is twofold. Firstly, the hierarchical framework combined rule-based policies enable the exposure of more critical parameters for further adjustments. In contrast, only multipliers of acceleration can be adjusted for the Pure-DRL model which is a common condition for other DRL models. Secondly, the LC behavior is integrated in the framework aiming to achieve higher speed given the surrounding traffic environment. The explicit LC motivation and the critical parameters among it with physical meanings such as the threshold factor for LC motivation γ , make it easy to adapt the LC behavior for traffic flow optimization. Instead, the Pure-DRL model only obtains the final LC decision and only fully compliance or rejection is allowed for LC behavior, which is not flexible and effective. Given that the original traffic adaptive strategy for IDM-MOBIL model only works for the CF module, it is expected that the TFA strategy provides more room for improvement with HIDS-DRL than IDM-MOBIL. Overall, the proposed novel hierarchical framework is credited for the success application of traffic adaptive strategy.

VI. CONCLUSION
In this study, we developed a novel multi-lane autonomous driving decision-making framework combining the DRL with rule-based policies in a hierarchy, where the IDS is proposed to activate CF and LC behaviors integrally. Moreover, the TFA driving strategy is incorporated to take into consideration the traffic flow operation. We then evaluated the proposed models in three typical scenarios. The main conclusions of this paper are as follows.
1) With the decomposition and the IDS defined to integrate CF and LC for the pursuit of higher speed, the HIDS-DRL model can be learned efficiently and interpretably for multi-lane driving, to benefit from both DRL and rule-based methods.
2) Compared to Pure-DRL model and rule-based model IDM-MOBIL, the proposed HIDS-DRL model achieves higher speed with the premise of safety and comfort, verifying its ability to handle multi-lane driving with satisfying efficiency.
3) Applying the traffic adaptive strategy via exposed critical parameters, the HIDS-DRL model further improves the traffic efficiency, with a maximum of 8.5% decrease in cumulative delay with only 5% AV penetration compared to other baseline models. The results validate that the proposed framework is effective in both individual behavior and traffic flow optimization.