Anytime Tree-Based Trajectory Planning for Urban Driving

The personal mobility of the future will be changed significantly by autonomous driving. To realize this vision, the complex task of trajectory planning needs to be solved. In this article, a novel planning concept, CarPre trajectory planning, based on Monte-Carlo tree search, is presented. Using a speed-dependent steering angle transformation, the state space of a kinematic single track model is discretized. The planner can then choose between different actions, each consisting of a discrete-value pair of an acceleration and a steering rate. With this, an equitemporal search tree is created to compute the future trajectory. Using Monte-Carlo simulations, the influence of short-term actions of the vehicle can be evaluated over a longer planning horizon. Thus, the current best solution can be accessed at any point during computation, enabling real-time applications. Furthermore, the discretized search tree enables easy checking of complex constraints dependent on binary or continuous variables. The concept is verified on a real test vehicle in a lane keeping maneuver. Through initial testing, a pleasant driving experience is perceived, which indicates future acceptance of the real-time capable algorithm.


I. INTRODUCTION
A UTONOMOUS vehicles will substantially change the personal mobility of the future. This is inter alia because of the reduction of traffic accidents, savings in energy, parking space as well as people's time and the increased access to mobility [1]. But until this vision comes reality, first more research and development work needs to be done in order to solve the most difficult traffic scenarios. These challenging situations occur especially in urban areas because of limited available space, area-specific traffic rules and high density of other traffic participants.
To act autonomously, a vehicle needs three main capabilities: perception, planning and control [2]. In perception, the vehicle collects information about its environment and localizes itself. In planning, the future movement of the vehicle (i.e., trajectory) is determined to achieve a higher order The review of this article was arranged by Associate Editor Abel C. H. Chen. goal such as reaching a specified target region without colliding with other obstacles. Finally, in the control task the planned action is executed. Therefore, planning can be seen as the core competency of the autonomous vehicle. Here, all previously calculated information are combined to one future plan of action. This plan of action needs to be executable by the controller, comfortable, traffic rule compliant and riskminimizing. Furthermore, it is essential for the acceptance of autonomous vehicles since its result is directly perceived by its passengers. Here, "it should be borne in mind that people are more likely to forgive mistakes made by other people than mistakes made by machines" [1, p. 6].
Solving the complex planning problem in real-time is challenging. To tackle this complex problem, a common planning architecture is chosen as presented in [3] and shown in Fig. 1. In the first layer, the route planning, waypoints through a street network are planned, similar to a navigation systems in current vehicles. In the second layer, the behavior planning, a high-level behavior (also: maneuver) for the autonomous vehicle is planned. Such a maneuver can be e. g. to follow/change the lane or to turn left/right at intersections. In the third layer, the motion planning, the high-level behavior will be transformed into a precise plan of motion, the calculated trajectory. Finally, in the fourth layer, the feedback control, the calculated trajectory will be executed using a local feedback controller.
Therefore, trajectory planning can be seen as the combination of maneuver and motion planning. This separation brings the benefit of reducing the planning complexity, since the motion planning can focus on a specific maneuver. But it may lead to synchronization problems between those two layers, it needs to be ensured that the trajectory reflects the planned maneuver. Additionally, the question arises how the discrete set of maneuvers needs to look like in order to cover all possible scenarios. Especially, in urban driving with narrow roads, limited space and possibly endless maneuvers, it is hard to find a complete set of maneuvers. Therefore, solutions exists solving the planning task of both layers in a single step [4], [5]. Because of the increased complexity, efficient algorithms are needed for real-time execution.
Current planning algorithms all have different strengths and weaknesses, so the search continues for new possible solutions that may surpass existing algorithms. One promising approach is the Monte-Carlo tree search (MCTS) [6]. It is used for solving discrete decision problems, especially in complex combinatorial games. Here, the game Go is considered the most challenging because of its high branching factor. By combining MCTS with reinforcement learning, AlphaGo [7] succeeded in defeating a Go world champion, a milestone for artificial intelligence.
MCTS has been used for behavior planning [8], [9], [10] and end-to-end learning [11]. The idea of this paper is to apply the algorithm to the problem of trajectory planning. Therefore, in this paper we present the CarPre trajectory planner (Monte-Carlo Model Predictive Trajectory Planner) which is based on MCTS, uses a kinematic motion model, and is real-time capable. Since this is an anytime algorithm, it can be guaranteed that a solution is available within the given calculation time. This is especially important for safety-critical applications such as trajectory planning. With possible options, such as e. g. the incorporation of explicit behavioral rules [12], the incorporation of discrete variables into the decision process as well as a variety of possible adaptations, MCTS offers advantages for trajectory planning. Before presenting the novel CarPre trajectory planner, a short overview over the related work is given.

A. RELATED WORK
For the motion planning, three major solution classes exist [3]: optimization, graph-search, and incremental search methods. In the first class, the planning problem is formulated as continuous optimization as in model predictive control (MPC). Using a motion model, the vehicle kinematics can be taken into account as optimization constraints. Common examples are [13], [14], [15], [16]. The advantage of these methods is the (mostly) fast solution in the continuous planning space. Disadvantages are that a solution is only available when the optimization has converged. This can be problematic, especially in complex, critical situations, when an individual optimization requires significantly more computing time than expected (e. g. peaks in Fig. 12 from [15]). Another disadvantage is the convergence to a local minimum. Thus, a higher planning level, such as behavior planning, is mandatory for the optimization to converge to the desired local minimum. Moreover, discrete states (e. g. a traffic light that decides between stop and drive) cannot be directly incorporated.
For the second class, a graph is generated by discretizing the planning space into a finite number of states. Then, graph search algorithms, like Dijkstra [17] or A* [18], can calculate the optimal trajectory in the discretized graph. To generate such a graph different methods exist. The most common variant is to sample the planning space, resulting in a tree structure [19], [20] or a lattice structure [5], [21]. The advantage of graph-based approaches is that the discretization allows global searching of the graph. By performing a global search, the planning does not get stuck in local minima, but the (discretized) global minimum is found. Moreover, no explicit enumeration of planning constraints is needed as for the continuous optimization methods. Instead, the discretization allows checking of very complex constraints for possible trajectories or for each trajectory point [22]. The disadvantage of this is the discretization: if it is chosen too large, the smaller planning space simplifies the planning problem, however, no solution may be found, although one exists in the continuous space (e. g. a narrow passage which cannot be passed due to a too coarse discretization). On the other hand, if the discretization is chosen too small, the graph explodes due to the curse of dimension [23].
The third class, the incremental search methods, tries to circumvent the disadvantage of a fixed discretization as in graph-based approaches. This is achieved by incrementally sampling the planning space with finer and finer discretization steps and simultaneously trying to find a solution [3]. Examples include RRT [24], RRT* [25], or CL-RRT [22]. The advantage of these approaches is that a solution is found if it exists (cf. graph-based approaches). However, the computation time required for finding a solution is unbounded, i.e., it is possible that the solution will be found only after far too much time. This usually excludes a real-time application.

B. CONTRIBUTION AND OVERVIEW
Our contribution in this article is two-folded. First, we derive a discretization of possible vehicle movements. Second, MCTS is adapted to the problem of trajectory planning.
The remainder of this paper is organized as follows: First, the used motion model is introduced in Section II which will be the basis for our further examinations. Then, MCTS is introduced in Section III, before the CarPre trajectory planner is presented in Section IV. For this, the state space is first discretized before adapting the MCTS algorithm to the problem of trajectory planning. Finally, the novel trajectory planner is evaluated in Section V in a real test vehicle before concluding the paper in Section VI.

II. MOTION MODEL
A basic requirement for the planning of motion sequences, like a trajectory of car-like vehicles, is a motion model. This motion model describes the relationship between a current state and possible future states. With this, physical movement restrictions can be taken into account in the planning task.
In the literature of trajectory planning, multiple motion models exist. This can be e. g., a simple model of constant turn rate [26], which considers the position and orientation of the vehicle or a single track model [14], [27], [28], which represents the non-holonomic motion of the vehicle. While the kinematic single track model assumes slip-free driving, the dynamic single track model [29], [30] considers the forces between the tires and the road surface. Therefore, it is especially useful for driving at the vehicle's limit.
To decide which motion model should be used, the specific use cases have to be considered. The trajectory planning of a vehicle, which is supposed to drift or to drive best times on a race track, requires a different motion model than e. g. an autonomous cab service. Since the focus of this work is on trajectory planning in urban scenarios, a dynamic vehicle model can be omitted. Instead, ease of use is needed. Since the kinematic single track model is the most simple model which considers the non-holonomic movement of a car-like vehicle, it is chosen. In addition, Polack et al. [31] proved that the kinematic single track model can be used for consistent planning if limiting the lateral acceleration to 0.54μg.
As model inputs, the acceleration a as well as the steering rate ω is chosen. With this, the state vector , consisting of the 2D position W p ref in a world coordinate frame, the orientation ψ, the speed v and the steering angle δ, can be calculated using the following ODE: With this model, δ can change for a constant input u = [ a, ω ] which leads to trajectories with variable curvature. The reference point of the motion model, i.e., the origin of the vehicle-fixed coordinate frame, is chosen on the front axle as discussed in [32]. Because of this, the only parameter of the model is the wheelbase l, which defines the length between the rear and the front axle. All states, inputs and parameters are shown in Fig. 2.

III. MONTE CARLO TREE SEARCH
In this section, a short introduction to Monte Carlo Tree Search (MCTS) is given. MCTS [6] is a search algorithm to solve discrete decision problems. In discrete decision problems, one can choose between a finite number of actions.
Examples include board games such as tic-tac-toe, Go, or chess. To solve such a problem, the algorithm builds a search tree, which consists of nodes (states of the search space) and edges (possible actions in a state). For each state, the number of previous visits n v is stored, as well as the sum of its reward estimates R s . By trying out different actions, the most promising one is searched, i.e., for the action with the highest mean reward estimate R m = R s /n v . An iteration of the algorithm consists of four steps, which are shown in Fig. 3: In this phase, a child node is selected to be explored in this iteration. It is strategically chosen between actions which are so far most promising (exploitation) and between less promising but also not often investigated actions (exploration). The latter should also be investigated, as they may have given poor results so far due to random evaluation, although they are actually promising. To solve the exploitation/exploration problem, the UCT formula (upper confidence bound applied to trees) is often used: where n v is the number of visits to the current node, n v,j and R s,j are the visits and the summed reward estimate of the child node by selecting the jth action, respectively. Changing the parameter c > 0, the focus of the selection can be put more on exploitation or more on exploration. Starting from the root node of the tree, the subsequent state to the action with the highest UCT value is selected. This scheme is iteratively executed through all levels of the tree until an action is selected for which no child node yet exists. This approach is called tree policy. II Expansion: The matching child node is created for the selected action from step I. III Simulation: The reward of the selected action is estimated by a simulation. According to a defined default policy (e. g. a random selection of possible actions), actions are selected and simulated until a state is reached where no further actions are available (end of game). This end state is evaluated and passed to the next step. IV Backpropagation: The estimated reward R of the simulation is fed back through the tree starting at the selected node and ending at the root node. To do this, at each of these nodes, the number of visits n v and the sum of the reward estimates are updated as follows: As soon as a defined time t calc has elapsed or a certain number of iterations has been executed, the algorithm ends. Then, the most promising action of the root node is chosen. For a summary of the properties as well as possible modifications, please refer to [33].
In the following, the modifications of the CarPre-trajectory planning are presented.

IV. CARPRE TRAJECTORY PLANNER
In this section, our novel planning approach, the CarPre trajectory planning, is presented. Based on MCTS, a trajectory is calculated using a discretized search tree. Each tree node represents a vehicle state x in a given environment. Using the kinematic single track model of Section II, a state x k can be transformed into a new state x k+1 by applying a chosen input u k for a sampling time T in . Therefore, our approach can be categorized as a graph search method (cf. Section I-A) which creates an equitemporal graph, i.e., a graph with the same time difference between two adjacent tree levels. In the following, the MCTS adaptions for the CarPre planner are presented.

A. DISCRETIZATION OF THE PLANNING SPACE
The trajectory planning problem is in general a continuous planning problem. However, MCTS solves discrete decision problems. Therefore, the action as well as the state space is discretized in the following. To ensure that only physically possible trajectories will be planned, the vehicle motion is constrained by the kinematic single track model (cf. (1)-(5)), so that an action consists of an acceleration value as well as a steering rate. Building an equitemporal search tree, the following discretization depends on the chosen sampling time T in . In the following, it is assumed that the vehicle can only move forward (v ≥ 0).
First, the longitudinal states (i.e., acceleration and speed) are discretized equidistantly. The discrete set of acceleration values is calculated according to the discretization distance a as well as the minimum a min and the maximum acceleration a max as Because of the fixed sampling time T in as well as a d as multiples of a, the set of equidistant speeds results in until a maximum speed v max is reached. By an appropriate choice of a, common behavioral patterns such as emergency braking, comfort braking, coasting down, constant speed, comfort acceleration, and emergency acceleration [34, p. 116] can be represented. For the lateral motion states (i.e., steering angle and steering rate), the speed-dependent steering angle transformation from [35] is used: where κ max is the maximum drivable path curvature, a lat,max is the maximum lateral acceleration, and l is the wheelbase of the vehicle. This transformation was derived from human driven trajectories, limits the steering angle with the help of its two variable parameters and is shown in Fig. 4 as red dotted lines. For low driving speeds, the maximum drivable path curvature κ max of the vehicle limits the steering angle. For higher speeds, the maximum lateral acceleration a lat,max takes over the limitation. This ensures that lateral accelerations do not become too large and planned trajectories can be executed by the feedback controller at all times. The speeddependent steering angle δ d (v d ) with n δ ∈ {2p − 1 | p ∈ N} discretized states is set to and |δ max (v d )| taken from (11). An example discretization is shown in Fig. 4. The discrete steering rate With this, a state space symmetrical around zero is obtained for the lateral motion. Furthermore, the discretization leads to a state lattice, i.e., solving the equations of motion (4) and (5)   performed with a discretization error. This error can only be avoided if the respective states remain in the continuous range of values. In general, this is possible because MCTS needs only discrete actions and no discrete states. Using the discretization described above, a search tree can be constructed as described in Section III. When selecting the discretization parameters, it should be noted that they determine the complexity of the planning space. Especially the number of inputs n in = n a n ω as well as the sampling time T in are crucial, since they determine the number of possible action combinations n in,total within a planning horizon T hor ∈ {p· T in | p ∈ N}. The number of combinations increase exponentially (cf. curse of dimension [23]): n in,total = (n a n ω ) To mitigate this exponential growth, the number of possible actions can be limited depending on the last selected action. An example for this is the limitation of the acceleration a k depending on the last acceleration a k−1 : so that the number of possible acceleration actions n a is reduced to three. This restriction resembles a limitation of the jerk, thus increasing the driving comfort and reducing the planning complexity at the same time. A resulting example discretization of the acceleration is presented in Fig. 5.

B. END OF GAME
The end of game specifies states at which no further action can be selected. This applies to states in the decision tree as well as states in the simulation for the reward estimation. In board games, such an end of game is determined by the rules of the game. In the case of trajectory planning, reaching a goal state can be the end of game. However, this goal state is usually farther away and therefore only tracked by route planning (cf. Fig. 1). Thus, additional end states must be defined for the trajectory planner: 1) Reaching a defined time horizon t = T hor (also: planning horizon). 2) Leaving the area for which environmental information is available, so that no further planning is possible. 3) Causing a collision with other objects. 4) Stopping of the vehicle for t > 0. The limitation of the planning horizon 1) is similar to the principle of the receding horizon control [36], which is used in MPC. Thus, planning considers only a limited time window to determine the next trajectory. After a trajectory is computed, the first part of it is executed before a new schedule is started with a shifted time window.

C. DEFAULT POLICY
In order to evaluate a chosen action, a reward value is estimated by means of simulation. To increase the algorithm's time of convergence, random combinations of acceleration and steering rate values for the simulation are avoided. Instead, a default policy is presented below, which is derived from human behavior. This behavior is based on an analysis of the inD dataset [37]. In general, two main behavior patterns can be identified in the dataset. First, human drivers try to follow their lane (cf. Fig. 6). Second, they usually accelerate only for a longer period of time (cf. Fig. 5). Therefore, the default policy is defined as follows: • Acceleration: The previous acceleration value is chosen again for the next k acc time steps (constant acceleration model). Subsequently, acceleration values around zero (i.e., a d ∈ {− a; a}) are set to a d = 0. • Steering rate: The steering rate is set to zero for the next k ω actions. Subsequently, the steering rate is selected based on a direction map so that the vehicle follows the course of the road as best as possible (lane keeping). Such a direction map is shown in Fig. 7.

D. HEURISTIC FOR INITIAL NODE REWARD
According to the UCT formula (6), any action that has never been selected is preferred over an already explored action because of This means that all possible actions of a node are explored first before a node of the next tree level can be explored. With limited computational time, this can lead to the search tree not reaching the required depth for planning a good trajectory. To speed up the convergence of the algorithm, an initialization of the node reward values R s,j = R ini is done. Similarly, the initial number of visits of the node is set to n v,j = 1. Using the direction map, the best steering rate ω best for lane keeping is determined for the current state, and accordingly the initialization R ini is set to 1 for ω k = ω best and a k = a k−1 ; R ini,2 for ω k = ω best and a k = a k−1 ; 0 else, where 0 ≤ R ini,2 < R ini,1 ≤ 1. With this choice, an initial preference of future actions is set, so that actions with equal accelerations as well as steering rates to follow the road are preferred during exploration. When selecting a particular action, the default policy is used to simulate to the end of game and the reward estimate is backpropagated. By averaging, the initial value loses influence with increasing iterations and converges to the real reward value.

E. REWARD FUNCTION
If an end of game is reached, the planned trajectory must be evaluated to get an estimation of the chosen action. In order to establish a clear prioritization of the planning goals, a reward function R(x k ), which differs for each reward term by one order of magnitude, is chosen for each state x k of the trajectory: with the m reward terms R i (x k ) ∈ [0, 1] and the basis of the order of magnitude b R . The overall reward value of one trajectory R traj consists of the normalized sum of all n traj = T hor /T in possible states: where p ≤ n traj describes the number of trajectory points to be evaluated. If the simulation is terminated early (p < n traj ) by the end of game described in Section IV-B, the state evaluation of the final state x p is repeated until n traj points have been incorporated into the trajectory evaluation. This is important so that each state x k is weighted equally in each trajectory evaluation. An exemplary reward term is the term for reaching a target speed v target . It can be chosen to As seen in this example, the trajectory discretization allows checking of complex constraints dependent on discrete events. In this case, the target speed is changed to zero if the vehicle collides or leaves the allowed drivable area, since this is the desired behavior. These properties, i.e., the evaluation of complex reward terms with discrete decision variables combined with the weighting of different order of magnitudes, lead to a flexible specification of the desired behavior. Therefore, a behavior-semantic scenery description [12] can be integrated in the CarPre trajectory planner to implement the behavioral rules of the road.

F. EXTRACTION OF THE PLANNED TRAJECTORY
After the defined calculation time t calc has elapsed, the newly planned trajectory k can be extracted from the created search tree. To avoid small variations between two successive trajectories, the actions of the last planned trajectory k−1 are used, if available. Starting at the root node of the recomputed search tree, the j best action is selected in each node, which has the largest mean reward R m,best . If the trajectory extracted so far is still equal to the previous one, the current reward value R m,best,prev of the previous best j best,prev action is also evaluated. If these two reward values are within a tolerable deviation R , the previous best action is preferred over the current best action. This results in j best = ⎧ ⎨ ⎩ j best,prev for |R m,best − R m,best,prev | < R and so far extracted k equals k−1 ; j best else.
This procedure is executed, starting at the root node, until either n traj trajectory points are extracted or a leaf node, i.e., a node without further child nodes, is reached.

V. EVALUATION
In order to evaluate the presented planning approach, a lane following maneuver is chosen. Here, the planner's goal is to follow its own lane with (if possible) the target speed. For this, the algorithm is implemented in C++ and executed on an Intel Xeon E5-2698 v4 processor. To speed-up the exploration progress, the algorithm is parallelized on 10 threads using tree parallelization with local mutexes (cf. [38]). The planning algorithm is then put into operation on a real test  vehicle which is shown in Fig. 8 using the minimal software architecture presented in Fig. 9. In this architecture, planning is only based on a road model calculated from an HD map. The tests are executed on the available test-track, the August-Euler airfield (cf. Fig. 10) with the used parameters listed in Table 1. The reward function of the algorithm includes terms for avoiding collisions, reaching the target speed, following the own lane and minimizing accelerations. Because of safety reasons, the vehicle is only allowed to drive 8.4 m/s = 30.24 km/h. Furthermore, a steering torque limitation exists, so that a safety driver can overrule the planned trajectory at all times. First, a single planning step of a right turn is analyzed (Fig. 11). Here, the vehicle has already reached the target speed v target and is approaching the curve. However, the curve is too tight so that the vehicle has to brake. Due to the speeddependent steering angle discretization, the vehicle can only plan within the given physical constraints, i.e., if the vehicle does not brake, it would miss the turn (cf. red/orange dots in the right of Fig. 11a). Without reducing v target , the algorithm searches for the fastest possible speed to safely drive through the curve. Due to the initialization of the possible actions (cf. Section IV-D) as well as the limited computation time, the vehicle slows down more than necessary (cf. light blue dots in the right of Fig. 11b). This is because of the exploration characteristic of the algorithm, the actions in the near future (i.e., directly in front of the vehicle) are well estimated by the high number of simulations. As a result of the branching of the tree, the behavior in the far future is largely determined by the initialization of the action values and the default strategy. However, using the receding horizon control, only the first planned action is executed before the planning for the next time step begins.
In order to analyze the planning over multiple planning steps and the driving comfort, the test vehicle drives on the test track for ∼45 minutes (one planning cycle every 0.2 s, in total 13513 cycles). The goal is to follow the route shown in Fig. 10 with the given target speed of 8.4 m/s = 30.24 km/h. During the measurements, the speed-dependent steering torque limit of the vehicle is exceeded several times, although the dynamics of the vehicle were not at the limit (five times in curve 5, and once in curve 2). In this case, the automation stops, the vehicle is brought to a standstill by the safety driver and the automation is enabled again. Such an overrun is not desirable, but a later fully automated system will not include such a limit. Fig. 12 shows an almost complete lap on the test track, only the long, straight part between curves 1 and 5 (lower straight line in Fig. 10) is not shown. In this example, the steering torque limit is exceeded in curve 5, which is marked in red.
In general, the feedback controller, in combination with the inertia of the actuators, smoothes the planned trajectory consisting of the discrete-time inputs shown in blue. Individual small outliers in the longitudinal acceleration a of the scheduled trajectory such as between 30 and 40 s or 130 and 140 s are filtered out. Although such high-frequency fluctuations are undesirable and should be avoided by further adjustments, they are not noticed by the driver.
The lateral acceleration a lat of the vehicle oscillates slightly around zero when driving straight ahead, but this is not perceived in the vehicle. In the curves, the steering angle discretization limits the maximum lateral acceleration, which is maintained by the vehicle. Additionally, the lateral deviation d lat of the vehicle to a virtual center line is shown. Since d lat is not part of the cost function, it will not be optimized. The deviations in curve 2 and 5 are high, because these curves are on intersections with wide lane segments. Maximizing the speed with limited lateral accelerations results in driving along the inner edge of the curve and thus, high deviations from the center line. The observed behavior is similar to driving a racing line without leaving the own lane. Furthermore, the vehicle already starts to brake before a curve, comparable to human driving behavior. The required speed is then reached within the curve before accelerating again towards the end of the curve. Fig. 13 shows the G-G diagram of the planned and driven trajectory. Due to the physically constrained discretization, only actions within the selected boundaries can be chosen. Therefore, the circle, representing 1/3 G, is not reached. All in all, this creates a natural driving experience.

VI. CONCLUSION
In this article we present CarPre trajectory planning, a novel planning algorithm based on MCTS. Using the kinematic single track model in combination with a speed-dependent steering angle transformation, possible movements of the vehicle are described and limited by physical constraints. With this, a discrete action space, consisting of discretevalue pairs of accelerations and steering rates, is derived which contains only drivable motions. This action space is then used to create an equitemporal search tree. Monte-Carlo simulations in combination with heuristics estimate the long-term effect of each short-term action. Thus, high-level maneuvers can be indirectly incorporated in the algorithm and do not need to be calculated beforehand by a behavior planning module. Furthermore, the algorithm offers great  flexibility. Being able to limit and evaluate trajectories at discrete time steps, complex cost functions or certain behavioral constraints can be easily implemented. The practical feasibility of the novel concept is demonstrated by running the algorithm on a real test vehicle. Here, comfortable, foresighted driving suggests future acceptance of the real-time capable algorithm.