Hierarchical Reinforcement Learning for Autonomous Decision Making and Motion Planning of Intelligent Vehicles

Autonomous decision making and motion planning in complex dynamic traffic environments, such as left-turn without traffic signals and multi-lane merging from side-ways, are still challenging tasks for intelligent vehicles. It is difficult to generate optimized behavior decisions while considering the motion capabilities and dynamic properties of intelligent vehicles. Aiming at the above problems, this article proposes a hierarchical reinforcement learning approach for autonomous decision making and motion planning in complex dynamic traffic scenarios. The proposed approach consists of two layers. At the higher layer, a kernel-based least-squares policy iteration algorithm with uneven sampling and pooling strategy (USP-KLSPI) is presented for solving the decision-making problems. The motion capabilities of the ego vehicle and the surrounding vehicles are evaluated with a high-fidelity dynamic model in the decision-making layer. By doing so, the consistency between the decisions generated at the higher layer and the operations in the lower planning layer can be well guaranteed. The lower layer addresses the motion-planning problem in the lateral direction using a dual heuristic programming (DHP) algorithm learned in a batch-mode manner, while the velocity profile in the longitudinal direction is inherited from the higher layer. Extensive simulations are conducted in complex traffic conditions including left-turn without traffic signals and multi-lane merging from side-ways scenarios. The results demonstrate the effectiveness and efficiency of the proposed approach in realizing optimized decision making and motion planning in complex environments.


I. INTRODUCTION
Autonomous driving technology (ADT) has received extensive attention in the past decade and has made much progress. ADT is an important part of the future intelligent transport system which is promising to realize transportation safety, efficiency and energy conservation. However, in complex traffic conditions such as left-turn without traffic signals and multi-lane merging from side-ways etc., there are still many challenges in achieving fully autonomous driving [1]. Among them, behavioral decision making and motion planning for The associate editor coordinating the review of this manuscript and approving it for publication was Michail Makridis . solving complex dynamic traffic scenarios are two major challenges.
In a typical decision-making and motion-planning system, the decision-making module is in charge of generating highlevel orders, such as slow down and speed up; while the motion planner is to compute the detailed trajectory profile that can be followed by the vehicle using low-level controllers. There have been fruitful works contributed to improving the decision-making behaviors in complex environment and the motion planning performance in terms of mobility and smoothness, see for instance [2], [3]. The above works addressed the decision-making or motion planning problems respectively. In fact, the performance of the decision-making and motion planning are mutually coupled. As a specific example, the decision-making, if not considering the motion capability of the vehicle, might lead to a policy that can not respect the limit of vehicle dynamics. With this information, the trajectory computed by the motion planner might not be reachable to the low-level controllers, which leads to inconsistency issues.
In this article, we develop a hierarchical structure based on reinforcement learning to address this issue. In contrast to rule-based approaches that might generate conflict decisions in complex environment, we propose a reinforcement learning-based algorithm for making decisions in continuous state space. Also, a high-fidelity dynamical model is utilized for collecting state samples of the ego and surrounding vehicles, which are then used for generating consistent decisionmaking policies via off-line training. Note that, in [4], a time-efficient motion planning algorithm was developed for maneuver in urban scenarios. The difference our paper lies in the following two aspects: i) the decision-making and motion planning policies are learned and improved in a fully data-driven way; ii) the online computational issue is not a major concern since the policy can be learned off-line in a batch-mode manner.
Hence, the main contributions can be summarized as follows: • We propose a hierarchical reinforcement learning algorithm for decision making and motion planning of intelligent vehicles. At the higher layer, a sample-efficient kernel-based least-squares policy iteration algorithm with uneven sampling and pooling strategy (USP-KLSPI) is presented for decision making, while the motion planner in the lower layer utilizes a batch-mode dual heuristic programming (DHP) to generate the trajectory in the lateral direction.
• The motion capabilities of the ego vehicle and the surrounding vehicles are considered at the higher layer in the training process using a high-fidelity vehicle dynamic model. In this way, the consistency between the decisions generated at the higher layer and the operations learned in the lower planning layer can be well guaranteed.
• Extensive simulations on decision making and motion planning are conducted in complex traffic conditions including left-turn without traffic signals and multi-lane merging from side-ways scenarios. The results demonstrate the effectiveness and efficiency of the proposed approach in realizing optimized decision making and motion planning. The remainder of this article is arranged as follows. Related works and research background are given in Section II. The framework of HRL, decision making in two complex scenarios, as well as motion planning for avoiding obstacles are presented in section III, while Section IV provides simulation, analyses, and discussion. Finally, the conclusions and future work are drawn in Section V.

II. RELATED WORKS AND RESEARCH BACKGROUND A. RELATED WORKS
In the aspect of decision making, rule-based decision-making methods were generally utilized in earlier research on intelligent vehicles. Among them, the most typical rule-based decision-making method is finite state machine (Finite State Machine, FSM) [5], [6]. Rule-based methods were widely applied in solving decision-making problems of intelligent vehicles because of the clear logic and strong interpretability. However, problems such as decision conflicts or discontinuous decisions may occur due to overlapping judgment conditions or discontinuous state divisions. Besides, rule-based decision-making methods rarely have the ability to deal with unseen scenarios.
In recent years, applications of machine learning methods in behavioral decision making have received much attention. Ngai et al. proposed a tabular reinforcement learning (RL) framework to learn lane-changing and overtaking behaviors [7]. The tabular Q-learning method converges slowly. In addition, it has computational and storage burdens in behavioral decisions of continuous space. In [8], kernel-based approximate policy iteration (API) algorithm was utilized to deal with the overtaking problems in the highway. Compared with traditional tabular RL methods in solving decisionmaking problems, it improves the learning efficiency. In [9], the extreme learning machine (ELM) was introduced to the batch-mode RL for realizing fast and efficient feature reconstruction, and Liu et al. verified its effectiveness in the lane-changing decision-making problem. Kernel-based API in [8] and ELM-API in [9] dealt with simple overtaking tasks and lane-changing issues without considering the trajectory planning when there exists the obstacles. In [10]- [14], deep learning approaches were used for training the collected raw sensor samples in specific scenarios which can map the images to actions. The advantages of using deep neural networks include automatic feature extraction and feature representation. However, the complex dynamic scenarios such as left-turn without traffic signals and multilane merging from side-ways were not considered in previous works. In addition, it is still difficult to generate optimized behavior decisions while considering the motion capabilities and dynamic properties of intelligent vehicles.
As a connection between behavior decision and lowerlevel control, the motion planning module of intelligent vehicles needs to plan kinematically or dynamically feasible trajectories for tracking control. Previous motion-planning methods can be classified into four categories [15]: graph search based [16]- [21], sampling based [22]- [24], interpolation curve [25]- [28], and numerical optimization [29]- [31]. In [29], Dolgov et al. proposed to use A* search to obtain a kinematically feasible trajectory and utilized numerical optimization method to smooth the previously calculated trajectory. The computation time of A* search method is expensive and is suitable for known unstructured low-speed scenarios. As a kind of sequential optimization decision approach, VOLUME 8, 2020 RL has potential in solving the motion-planning problems of nonlinear systems. In [32], Lian et al. proposed the use of Heuristic Dynamic Programming (HDP) to avoid obstacles. Experimental tests on wheeled mobile robots verified the effectiveness of the method. In [33], Al Dabooni et al. proposed the Dyna-HDP algorithm for vehicle planning tasks which can find approximate optimal trajectories. HDP in [32] and Dyna-HDP in [33] were utilized to complete the motionplanning tasks. The above works are mainly concerned about the optimization in the planning layer. Nevertheless, since there is a close coupling relationship between behavioral decision making and motion planning, in complex dynamic scenarios, there may be conflicts between the two layers. In a previous work [34], Qian et al. proposed a decision-making algorithm using the planning feature in the training process for performance improvement. However, the planning feature used cannot completely represent the motion capability of the vehicles.

B. RESEARCH BACKGROUND
Markov decision processes (MDPs) can provide efficient mathematical frameworks for RL. As the system model is unknown or partially known in complex dynamic scenarios, it is difficult to calculate an accurate solution to solve the optimal decision-making problems. In RL, approximate policy iteration and actor-critic algorithms have been studied to solve MDPs with unknown model and large state spaces.

1) THE MDP MODEL FOR RL
An MDP model can be represented as a quaternion tuple [S, A, P sa , R], where S and A are the sets of states and actions, respectively. P sa is the state transition probability from state s t to s t+1 after performing an action a t . When it occurs state transition from the state s t to the next state s t+1 after taking an action a t ∈ A, we can obtain the corresponding reward R(s t , a t ), i.e. S × A → R.
In RL, the state-action value function for a given MDP model is defined as follows: where E π (·) is the mean of stationary policy π which can map the state to the action space, that is π : S → A. r(s t , a t ) ∈ R is the reward, the γ ∈ (0, 1] is the discount factor and the equation above can be described by

2) APPROXIMATE POLICY ITERATION
In [35], Lagoudakis et al. considered the advantages of leastsquares temporal-difference (LSTD) and approximate policy iteration (API). Then the least-squares policy iteration (LSPI) was proposed. LSPI uses a linear weighted combination of k basis functions k i=1 φ i (s t , a t ) to approximate the stateaction value function Q π [35].
where φ (s t+1 , a t+1 ) ∈ R 1×k . Hence equation (2) can be rewritten in the format of matrix as follows: a t+1 = π(s t+1 ). According to [36], the weighted leastsquares solution would be: According to [35], A and b can be approximated using collected samples. The state-action pair (s i , a i ) is denoted as s i , and (s i+1 , a i+1 ) is denoted as s i+1 for convenience. Hence, the approximated value of A and b are written as: where s k represents the training samples and T is the number of samples. LSPI utilizes the greedy strategy to choose the action that can maximize the state-action value function Q π .
To improve the learning efficiency and convergence speed, Xu et al. proposed a kernel-based least-squares policy iteration algorithm (KLSPI) which uses a technique of approximate linear dependence (ALD) [37]. The decision-making policy can be trained by KLSPI with the pre-collected samples. Due to the ability of non-linear feature representation, KLSPI has a superiority in dealing with decision-making problems of large-scale state and action spaces. In addition, when the number of samples is large, it can still learn the approximate optimal policy. Compared with LSPI, the KLSPI algorithm has better convergence properties and can obtain an approximate optimal solution.

3) HRL FOR OPTIMIZING DECOMPOSED SUBTASKS
When intelligent vehicles drive in complex dynamic environments, the decision making and motion planning have to be optimized at the same time. One advantage of hierarchical reinforcement learning is solving complex tasks based on task decomposition. In [38], a hierarchical-DQN framework was proposed. The higher layer learns a policy over intrinsic goals and the lower layer learns a policy that meets the goal from the atomic actions. In [39], Ding et al. proposed a hierarchical framework of which the upper layer generates the control actions by hidden Markov model and the lower layer realizes the planning for agents' navigation by deep reinforcement learning. To the best of our knowledge, there is little work on hierarchical reinforcement learning for autonomous decision making and motion planning of intelligent vehicles in complex traffic conditions such as left-turn without traffic signals and multi-lane merging from side-ways.

III. THE PROPOSED HRL APPROACH FOR OPTIMIZED DECISION MAKING AND MOTION PLANNING
From the task decomposition perspective, we propose a hierarchical reinforcement learning approach for solving decision-making and motion-planning problems, as depicted in Fig.1.
At the higher layer, we utilize the USP-KLSPI to make decisions. The process for obtaining decision-making policy includes four parts: MDP modeling of the decision-making tasks, uneven sampling, sample pooling strategy, and the KLSPI algorithm.
At the lower layer, the real-time obstacle avoidance is solved by the planning policy being trained via a DHP in a batch-mode way. Finally, the expected longitudinal speed v expect and the steering angle δ are transferred to the low-level controller.
In the both layers, the data samples used for training are collected using a high-fidelity 14-degree-of-freedom (DOF) dynamics, which is referred to the previous work in [8], [40]. The dynamics of the built vehicle consists of 3 modules, which are the steering system, the body suspension model, and the motion. Also, the effectiveness of the model to the real vehicle dynamics is verified.
The main steps of HRLDP is shown in Algorithm 1. The algorithm includes the higher decision-making layer and the lower motion-planning layer. 2: Obtain the states of the ego vehicle and the ruled-based surrounding vehicles, i.e. s t ; 3: ComputeQ π at the state s t by K (s t )w π ; 4: Generate the final decision by arg max a t Q π (s t , a t ); 5: \\ The lower layer. Require: The decision containing the longitudinal speed information, the trained motion-planning policy W a . Ensure: Planning results to low-level control. 6: Obstacle detection and obtaining current vehicle state s; 7: Generate the front steering angle for lateral trajectory planning, i.e. δ = W a K (s);

A. LEARNING-BASED DECISION MAKING USING USP-KLSPI IN COMPLEX TRAFFIC ENVIRONMENTS 1) USP-KLSPI
In order to improve the sampling efficiency and reduce the training time, this article proposes an USP-KLSPI algorithm for decision making. Uneven sampling is reflected in collecting more samples from the interest intervals, while roughly sampling in the uninterested area. This sampling method is similar to the attention mechanism of human in life. For instance, when pedestrians go across the road, they tend to pay attention to the information of vehicles that are very close. Assuming that the distance ranges between the ego vehicle and social vehicle are: where d f max and d r max are related to the maximum perception distance of the ego vehicle. The sampling ranges are composed by where ω fi , ω ri are the important factors and can be adjusted by hand according to the heuristic information. We initialize the distances of social vehicles in D fi and D ri while collecting the i-th sub-sample set. A data sample collection process is performed by using an exploration policy, which has the form of [s t , a t , s t+1 , r t ]. Repeat the above processes until the desired number of samples are collected for N sub-sample sets.
To obtain the representative information, we use the pooling strategy to process elements in each sample. The intuitive pooling process is shown in Fig.2

. As seen from it, both d(n)
and v(n) represent the sum of n intervals. They are described as follows: where d i and v i represent the i-th interval size. For states we assume the small intervals in d q and v z are with constant values (h d and h v ). Elements d and v in a sample can be processed with a maxpooling manner: With the processed samples, we train the KLSPI algorithm for a decision-making policy. According to [37], KLSPI uses the way of kernel feature representation to approximate stateaction value function Q π . To a state s i , a vector of basis functions can be expressed as follows: wherek(·, ·) is the kernel function and K ∈ R c×1 . Note that, the ALD method could produce kernel dictionary with multiple center points C = {s 1 · · · , s c } from the sample set, where c is number of center points. According to the equation (5), the following matrices update rule can be obtained: where s k represents the training samples and T is the number of samples.
With the solved least-squares solution w π by (5), the approximation of state-action value function at state s t can be obtained: Hence, the policy that maximizes the state-action value function is the optimal solution: Remark 1: It is highlighted that the time cost required for the sampling process of Algorithm 2 is T · t, where t is the preset time of completing a sampling process and T = N · N num .
Remark 2: The flops required for the computation in equations (13) and (5)  Randomize each state s t and collect N num samples [s t , a t , s t+1 , r t ] by using an exploration policy; 5: end for 1: \\ Training process Require: The training samples, initialized policy weight w π 1 . Ensure: policy weight w π for decision making.
2: \\ N iter is the preset number of iterative learning; 3: \\ is the preset small threshold; 4: Use sample pooling strategy to process the samples by equations (10) and (11); 5: for m ← 2 to N iter do 6: Update the weight w π m by KLSPI; 7: if w π m − w π m−1 2 ≥ then 8: Continue.

11:
Output the converged policy weight w π . 12: end if 13: end for 209780 VOLUME 8, 2020 off-line training process of Algorithm 2 are about 2Tc 2 + (T + 1)c. The flops count can be reduced approximately as Tc 2 . This means the algorithm complexity grows linearly with sample numbers T , and the computation time is also related to the number of kernel centers c.
Remark 3: When the trained policy is deployed for realtime decision-making problems, the required flops for computing the optimal policy (i.e. equations (14) and (15)) are n a (2c − 1), where n a = 3 is the number of optional actions. The flops count can be reduced approximately as c. The number of kernel centers c in scenario-A is with a magnitude of 10 2 , while is about 10 3 in scenario-B. Hence it has a low computation time and is feasible to deal with real-time planning problems.
The USP-KLSPI algorithm applied for obtaining the decision-making policy of intelligent vehicles is presented in the Algorithm 2.

2) DECISION MAKING IN LEFT-TURN WITHOUT TRAFFIC SIGNALS
With the development of vehicle communication technology, the information on surrounding vehicles can be easily obtained. This article assumes that, within the autonomous vehicle's perception range, we can obtain the speed, position, and current lane information of the social vehicles.
The scenario of left-turn without traffic signals is shown in Fig. 3. The autonomous vehicle needs to cross with consideration of those coming from the front and rear directions. The turning radius of the intersection is an arc with a radius of around 13m (R ≈ 13m). Taking into account the driving safety and the constraints of vehicle dynamics, a reasonable longitudinal speed around 10m/s is adopted.
Before sampling, an MDP model needs to be established. As shown in Fig. 3, in the process of driving through each lane, the state in MDP is defined as: Among them, i is the lane information, V ri /V fi represent the speed of the nearest rear/front social vehicle; d ri and d fi are the distances from the autonomous vehicle, respectively. The optional actions of the autonomous vehicle in this scenario can be expressed as: The actions are slowing down, keeping an original speed and acceleration, respectively. Denote V Ego as the current longitudinal speed of the ego vehicle, where V Ego is set within the range [0, 13m/s]. Denoting the notation L Ego as the current lane information of the ego vehicle, a collision is defined as L Ego == i&&d fi ≤ 5m || L Ego == i&&d ri ≤ 15m . A successful decision-making process means no collision occurs when X Ego < −25m&&L Ego == 2.
To bridge the gap between simulated and real environments, social vehicles are considered with 14-DOF vehicle dynamics [40]. They have different desired velocity and a corresponding safe tracking distance d follow . The longitudinal planning policies of the τ -th social vehicle can be expressed as follows: where v τ _expect is the expected velocity, v τ _brake represents the speed that social vehicle should slow down to. d τ j is the distance between the two social vehicles. Assuming the τ -th car is following the j-th car, d follow can be expressed as follows: where T is the reaction time of the drivers. v τ and v j represent the current speed. a τ and a j are the corresponding brake accelerations. Hence, d follow is a guidance for planning the longitudinal speed of social vehicles. As shown in Table 1, the detailed parameter settings are given. In the case of multiple forward/backward vehicles in the i-th lane, the nearest forward/backward vehicle will be considered into the state s t . When there is no vehicle forward/backward, d fi , V fi and d ri , V ri are determined by the following equation.
Both the surrounding vehicles and the ego vehicle are simulated with 14-DOF vehicle dynamics [40]. After the ego vehicle takes a randomized action a t , one sampling process terminates when the ego vehicle crosses the 1-th lane or waits at the interaction for 1s, thereby obtaining the next state s t+1 . The simulation runs for such preset time mainly because we have computed the approximate time to complete a sampling process. The reward function is designed as  where ξ is the adjustability coefficients, i.e. the penalty for the magnitude of the velocity. Variable t is the completion time of each task, and Y Ego is the ordinate value. The process of obtaining a sample [s t , a t , s t+1 , r t ]: in initial state s t , randomized action a t was taken, a reward r t was received, and the resulting state was s t+1 . Then we train the KLSPI algorithm based on the samples to obtain a decision-making policy.
One thing to emphasize is that the autonomous vehicle needs to make continuous decisions in the area of A1 and A2 which are shown in Fig. 3.

3) DECISION MAKING IN MULTI-LANE MERGING FROM SIDE-WAYS
As shown in Fig. 4, the scenario is composed of three lanes. The autonomous vehicle has to consider the social vehicles' motion in three lanes, especially for the social vehicles in the 2-th lane and the 1-th lane. The autonomous vehicle has to choose a proper occasion to merge into the 1-th lane. In the meanwhile, longitudinal speed is also given by the decisionmaking layer.  Table 2, the detailed parameter settings are given. In the case of multiple forward/backward vehicles in the i-th lane, the nearest forward/backward vehicle will be considered into the state s t . When there is no vehicle forward/backward, d fi , V fi and d ri , V ri are determined by equation (20). Before sampling, an MDP model needs to be established. The 14-dimensional state s t is defined as

As shown in
The action a t is defined as  (23) shows, the autonomous vehicle has a total of three optional actions in this scenario. They are: acceleration on the ramp, slowing down on the ramp, and changing to the 1-th lane. A collision is defined as L Ego == i&&d fi < 5m || L Ego == i&&d ri < 20m . In Fig. 5, we use the notations A actions , B actions , C actions to represent the optional actions at the states A, B, C, respectively. Under the premise of ensuring driving safety, the driving policies should conform to the driving habits of human drivers as much as possible, and the rule-based policies of social vehicles are presented in Table 3. Through the above settings, the simulation environment gradually becomes high-fidelity. 209782 VOLUME 8, 2020 After the ego vehicle taking a randomized action a t , one sampling process terminates when the ego vehicle merges into the 1-th lane or drives on the ramp for 5s, thereby obtaining the next state s t+1 . The reward function is a quantitative evaluation of the behavioral decision. The reward function is designed as: where X Ego represents the current abscissa of the autonomous vehicle. Due to the limited length of the ramp, a greater penalty will be given when the autonomous vehicle has a larger abscissa value. V d represents the desired maximum speed of the ego vehicle, and this item is used to force the autonomous vehicle to reach the desired speed faster. Samples with the form of [s t , a t , s t+1 , r t ] can be collected and the KLSPI algorithm is trained for a decision-making policy.

B. THE BATCH-MODE DHP ALGORITHM FOR MOTION PLANNING
At the higher layer, we model the complex dynamic scenarios and propose a USP-KLSPI algorithm to learn the decisionmaking policy. In dealing with the decision-making problems, the motion data-samples of the ego vehicle and the rule-based surrounding vehicles are collected with a highfidelity dynamic model for the learning process. Hence, the inconsistency between the decision generated at the higher layer and the operation in the lower planning layer can be avoided. The higher layer has considered the longitudinal speed policy. Therefore, the planning work in the lower layer can be reduced a lot as we only need to plan the trajectory in the lateral direction. Generally speaking, the high-fidelity 14-DOF vehicle model can be employed in the learning process. However, it will bring a computation load. We simplify the high-fidelity 14-DOF vehicle model to the lateral dynamics and utilize DHP algorithm learning in batch-mode manner to realize motion planning. The simplified lateral dynamic model includes the following four-dimensional states s = [e LatėLat ϕ φ] , where e Lat is the lateral error between the reference point and c.g. of the vehicle, and the ϕ represents the heading error between the vehicle and the expect yaw angle. for n ← 1 to 1000 do 4: Randomize u n in the range of [−u max , u max ]; 5: Obtain the next state s n+1 with the 14-DOF vehicle dynamics; 6: if s n+1 ∈ s max then 7: Save s n+1 ; 8: The algorithm of sampling process in motion planning is shown in Algorithm 3, where N Sam represents the desired episodes, and the reference y = ψ (x). As seen from it, the first step is to initialize the state in each state-dimensional range, s max = [1, 1, π/2, π/2], and the steering angle within the preset range (u max = π/6) is applied to the 14-DOF dynamics to obtain the next state. In this way, we can obtain the desired samples of simulating the dynamics of the real vehicle.
The learning-based motion planning in this article utilizes a kernel-based DHP [41] to train for a planner. The next state is obtained by applying the control action to the 14-DOF vehicle dynamics in the simulated environment.
The reward function is defined as The adjustable parameter p is the penalty factor and set as 10. When there are no obstacles, the parameter p is set as 0. Among them, Q = diag{1, 1, 5, 1} and R = 2. The item p · e − e Lat indicates that the reward function is smaller with the distance increasing. In practical application, it is not suitable for the ego car to stay too far away from the obstacle (large lateral error e Lat does not meet the requirements of the planning task.). Therefore, we hope to keep a balance. A solution is to set proper penalty matrix Q. The item s Qs can limit the lateral error from too large. The other idea is to set the penalty factor p = 0 and adopt the reference trajectory changing strategy for realizing obstacle avoidance when it is a structured road. The batch-mode DHP planner can be trained by data collected from real cars or high-fidelity software Carsim. In fact, collecting training data of decision making through the two ways requires a lot of resources. Therefore, this article utilizes 14-DOF vehicle dynamics for sampling in the simulation environment. After collecting about 30000 samples generated by the 14-DOF dynamic model, we train the Algorithm 4 for a planning policy. When the policy is deployed to real-time planning tasks, the lateral steering angle can be expressed as where u is the front steering angle, i.e. δ, W a is the weight of the actor module, K (s) can be computed by the format of equation (12). Based on the 14-DOF vehicle dynamics and trained motion-planning policy, we can plan a trajectory of vehicle motion. Remark 4: In this note, we construct the feature by the format of equation (12). Denote the number of samples and kernel centers as T m and c m , respectively. Through approximation, the algorithm computation complexity in the training process is T m c 2 m . Remark 5: When the offline trained policy is applied to real-time motion planning problem, the lateral planning results can be computed by equation (26). As W a ∈ R c m ×1 and K (s) ∈ R c m ×1 , the computational complexity is about 2c m − 1. It can be approximately reduced as d m , where d m is with a magnitude of 10. Hence it has a low computation complexity and is feasible to deal with real-time planning problems.

IV. SIMULATION AND ANALYSIS
In this section, the proposed HRLDP algorithm is applied to the left-turn without traffic signals and multi-lane merging from side-ways scenarios. Corresponding simulation results and analyses are described in the following sections. Furthermore, the discussion is also presented.

A. LEFT-TURN WITHOUT TRAFFIC SIGNALS
The simulation is performed in the Matlab environment by a Desktop with Intel i7-8700K CPU @ 3.7GHz and 16GB RAM, and Windows 10 operating system. We obtain the average time cost of our algorithm in 100 computations, i.e. t de = 7.32 × 10 −4 s and t pl = 2.98 × 10 −4 s, where t de and t pl are the average computation time of our decision-making and motion-planning algorithms, respectively. The simulation results are shown in the Fig.6. We show the positions of the vehicles in a time series, thereby obtaining an x-y-t diagram. To present the decision-making and motion-planning processes better, Fig.7 shows a screenshot of different moments. As seen from the two figures, the autonomous vehicle waits for the social vehicles at the intersection firstly. After the social vehicle (green diamond in Fig.6 and green cube in Fig.7) just passes, the ego vehicle starts to go across the intersection. Under normal circumstances, the ego vehicle will drive to the 2-th lane safely. In Fig.6 and Fig.7, as the red social vehicle slows down suddenly, the autonomous vehicle then calls for the motion planning module to avoid obstacle. From the trajectories in Fig.6, the autonomous vehicle finally accomplishes the task successfully.
Then this article compares the performance of KLSPI and USP-KLSPI in left-turn without traffic signals scenario. To figure out the influence of parameters h d , h v on our decisionmaking task, we determine several representative pooling intervals, i.e. h d ∈ {1, 10, 20} and h v ∈ {1, 5, 10} to obtain an approximate optimal combination. Then we perform testing accordingly as seen in Table 4. When both h d and h v are 10, the combination has a better performance in training time and obstacle avoidance assistance (OAAR). In USP-KLSPI, the samples are composed of two parts: initial states in 20000 samples are within the intervals as Table 1, i.e. D f 1 ∈ [0, 150] and D r1 ∈ [0, 150]. Initial states in 10000 samples by uneven sampling are within the relatively small intervals, i.e. D f 2 ∈ [0, 80] and D r2 ∈ [0, 80]. Hence a total of 30000 samples consisting of two sub-sample sets are collected. In addition, ξ in the reward function is set as 10. Table 5 shows the simulation results and the kernel widths in the different algorithms are the same, i.e. σ = 20, for a fair comparison. For the performance of OAAR and average time cost (ATC) while completing this task, the two algorithms are similar. Furthermore, the USP-KLSPI performs well in reducing the training time. Under the same test set, we test the USP-KLSPI algorithm and the expert policy (in Table 6). The corresponding results are shown in Table 7. As seen from it, the completion rate  of expert policy is 82.2%. For USP-KLSPI, about 0.7% of tests need obstacle avoidance assistance and the completion rate is 99.6%. In terms of ATC, the USP-KLSPI has a better performance due to the conservation of the expert policy. That is to say, the autonomous vehicle using our USP-KLSPI algorithm is more flexible in complex dynamic environments.

B. MULTI-LANE MERGING FROM SIDE-WAYS
In the same line with scenario-A, an extensive simulation is conducted in the multi-lane merging from side-ways scenario. The simulation environment is in the same line with the previous scenario. We obtain the average time cost of our algorithm in 100 computations, i.e. t de = 4.4 × 10 −3 s and t pl = 2.84 × 10 −4 s. The corresponding results are shown in Fig. 8. It shows the decision-making and motion-planning processes with several views. To present the two processes better, we only highlight the vehicles that directly influence the motion of autonomous vehicle, i.e. the red triangle and the green square in Fig.8.
As seen from     continues to drive on the original ramp. After the red vehicle passes, the autonomous vehicle merges into the 1-th lane and accomplishes the merging task. To present the processes better, Fig.9 shows a screenshot of different moments and the states information are attached.
Then this article compares the performance of KLSPI and USP-KLSPI in this scenario. To figure out the influence of parameters h d , h v on our decision-making task, we determine several representative pooling intervals, i.e. h d ∈ {1, 10, 20} and h v ∈ {1, 5, 10} to obtain an approximate optimal  combination. Then we perform testing accordingly as seen in Table 8. When h d and h v are set as 1, the combination has a better performance in lower training time and ATC. In USP-KLSPI, the samples are composed of two parts: The initial states of 20000 samples are within the intervals as Table 2, i.e. D f 1 ∈ [0, 150] and D r1 ∈ [0, 100]. The initial states of 10000 samples by uneven sampling are within the relatively small intervals, i.e. D f 2 ∈ [0, 80] and D r2 ∈ [0, 60]. Thereby a total of 30000 samples consisting of two sub-sample sets are collected for training. The decision-making policies trained by the two algorithms are employed for testing separately in a total number of 1000. Table 9 shows the simulation results and the kernel widths in different algorithms are the same, i.e. σ = 50, for a fair comparison. The OAAR and ATC of the two algorithms while completing this task are very close. Furthermore, the proposed USP-KLSPI is useful in reducing the training time. Compared with KLSPI, the OAAR is lower and the ATC is longer. The major reason is that a relatively conservative policy has been learned. While reducing the OAAR, it also sacrifices ATC to complete the tasks.
Because the green vehicle suddenly slows down ahead, the autonomous calls for the motion-planning module to avoid the obstacle. The corresponding results are shown  in Fig.10. As seen from it, a maneuver to the 2-th lane is generated and the autonomous vehicle avoids the obstacle by lateral trajectory planning.
Comparisons with expert policy (as in Table 10) were performed, and the corresponding results are shown in Table 11. The results show the ATC of USP-KLSPI method is much shorter than that of expert policy. In terms of the completion rate, the expert policy has a higher number of failures compared to our proposed method.

C. DISCUSSION
The algorithms in the two layers are both trained off-line and then deployed to real-time environments. From the analyses of computing performance in remarks 3 and 5, our proposed approach is potential in realizing real-time decisionmaking and motion planning. To bridge the gap between the simulated and real environment, the data samples used for training are collected using a high-fidelity 14-DOF dynamics in the both layers, and the consistency can be guaranteed. In summary, our algorithm is feasible in dealing with decision-making and motion planning problems in real environments.
As to the applications in real environments, we need to obtain the state information, i.e. s t as in equation (16) or (23). Then we use the trained policy w π to make real-time decisions by equations (14) and (15). In the aspect of motion planning, as the decisions have included the longitudinal speed information, we only concern about the lateral trajectory planning by equation (26), i.e. W a K (s). The real-time decision-making and motion-planning processes are depicted in the lower part of Figure.1 and in Algorithm 1.

V. CONCLUSION
In this article, a hierarchical reinforcement learning approach is proposed for autonomous decision making and motion planning in complex dynamic traffic scenarios. The motion data-samples of the ego vehicle and the rule-based surrounding vehicles are collected by a high-fidelity 14-DOF dynamics for the learning process in the decision-making problems. In addition, the decision-making layer considers the optimization of longitudinal speeds. The motion capabilities of the ego vehicle and the surrounding vehicles are evaluated with a high-fidelity dynamic model in the decision-making layer. Therefore, the consistency between the decisions generated at the higher layer and the operations in the lower planning layer can be well guaranteed.
The simulation results on two complex traffic scenarios show the effectiveness and efficiency of our proposed hierarchical reinforcement learning approach. Moreover, comparisons with KLSPI was conducted, and the test results indicate that the proposed USP-KLSPI can reduce the training time obviously. The corresponding results indicate that the autonomous vehicle using the USP-KLSPI algorithm is more flexible in complex dynamic environments. Future work will be focused on efficient feature learning, and we will pay more attention to handling integrated decision and planning problems under complex constraints.  He has been a Visiting Professor with The Hong Kong Polytechnic University, the University of Alberta, the University of Guelph, and the University of Strathclyde, U.K. He is currently a Professor with the College of Intelligence Science and Technology, NUDT. He has coauthored more than 160 papers in international journals and conferences and coauthored four books. His research interests include intelligent control, reinforcement learning, approximate dynamic programming, machine learning, robotics, and autonomous vehicles.
Dr He is currently an Assistant Professor with NUDT. He has published more than ten articles in recent years. His research interests include robotics, optimization, and artificial intelligence. VOLUME 8, 2020