Design of DDPG-Based Extended Look-Ahead for Longitudinal and Lateral Control of Vehicle Platoon

This paper presents a novel Deep Deterministic Policy Gradient (DDPG) algorithm with extended look-ahead approach for longitudinal and lateral control of vehicle platooning. The DDPG algorithm is adapted due to its ability to fit nonlinear system and to handle continuous control environment. Moreover, the dynamic input inversion is introduced to reduce domain of the action space from DDPG output. The existing look-ahead approach is considered as a cost-effective approach since it uses the available information from on-board sensors and is effective against the loss of lane markings. However, the approach is known to suffer from cutting-corner phenomenon. To address cutting-corners, we introduce the extended look-ahead approach and derive the true-local error states using the already available information from lidar and V2V communication. The robustness and performance of DDPG-based extended look-ahead controller is investigated by means of simulations and validated through experiments on a Donkey Car platform. The simulations and experiments with Donkey Car show that the DDPG-based extended look-ahead algorithm can provide an efficient control strategy for longitudinal and lateral maneuvers without the requirement of path information.


I. INTRODUCTION
The rapid development of automotive industries and the needs of mobility nowadays brings benefits and also simultaneously brings challenges and problems to the transportation. The increase of vehicles that is not balanced with the increase of road capacity has led to traffic accidents, congestion, and environmental problem. To cope with this problem, one of the solution proposed is to increase the capacity of existing roads and infrastructure. However, the development of roads and infrastructure is considered costly, arduous, and The associate editor coordinating the review of this manuscript and approving it for publication was Junho Hong . time-inefficient. Thus, an alternative solution of increasing the traffic flow is proposed. One way to increase the traffic flow is by reducing the distance between vehicles by means of vehicle platooning, which can be defined as grouping several vehicles that formed a compact formation and drive at a small inter-vehicle distance. An autonomous vehicle is then developed to enables a vehicle to drive at a closer inter-vehicle distance than a human driver could. Adaptive Cruise Control (ACC), as a part of Advanced Driver Assistance System (ADAS), is first invented as a comfort system that enables a vehicle to automatically regulate its speed to maintain a safe distance from its preceding vehicle. ACC system in a vehicle utilizes radar, laser, or a camera to detect the vehicle ahead, allowing it to brake when the inter-vehicle distance is too near, or accelerate when the inter-vehicle distance is too far. ACC technology is regarded as one of the key component of autonomous or intelligent vehicles. Along its development, the functionality of ACC was then extended to Cooperative Adaptive Cruise Control (CACC), that allows a vehicle to receive information from other neighboring vehicles using a Vehicle-to-Vehicle (V2V) communication. CACC realizes automated longitudinal vehicle control by adding the preceding vehicle's velocity, acceleration, or steering angle that are used in a feed-forward loop [1]. The invention of ACC/CACC has led to the realization of automated longitudinal control. However, to achieve a level 2 autonomous vehicle, in which both steering and acceleration/deceleration are automated, the lateral control of vehicle has to be taken into account.
One proposed approach to realize both longitudinal and lateral control of autonomous vehicle is to separate the control problem into two independent subsystems [2]. In this approach, the ACC/CACC system regulates the longitudinal control [3], while a lane keeping assist system regulates the lateral control [4]. Typically, a lane keeping controller is developed based on the path following method, where a vehicle detect a reference path (e.g., lane markings or magnetic markings embedded in roads) using magnetic sensors or cameras. The control objective is then to calculate a steering input that minimizes the distance from the position of vehicle to a reference path. In this lane keeping system, the clarity of lane markings is important for the accuracy of path following. The drawback of this path following method is when the lane markings are of bad quality, or even not available, resulting in the difficulty of tracking the correct lane which then may lead to accidents. From the platooning perspective, when a small inter-vehicle distance is desired, an accurate measurement of lane markings is not always possible due to the obstructed view from the preceding vehicle. Moreover, by treating the longitudinal and lateral motion independently, the application of this decomposed approach is limited to situations with small steering angles or low speeds [5].
To achieve a combined control for longitudinal and lateral motion that does not depend on lane markings, a vehicle following method is adapted. In this method, a lookahead approach is usually used by a vehicle to follow its preceding vehicle [6]. By utilizing the already available information from the ACC/CACC setup, this approach is considered as a cost-effective and feasible solution. However, the implementation of the look-ahead approach results in cutting-corner phenomenon for the lateral movement of vehicles. Authors in [7] and [8] proposed an extended look-ahead method that compensates the cutting-corners, where the method is then applied on the vehicle platoon with kinematic vehicle model. However, the kinematic vehicle model is considered inaccurate for a dynamic environment in longitudinal and lateral maneuvers. Moreover, these two papers considered the needs of global and semi-local positions of vehicles, which in some situation cannot be accurately measured. In [9], the trajectory estimation of the preceding vehicle is designed to compensate the cornercutting problem. Using a dynamic vehicle model, a model predictive control is applied to fulfill the longitudinal and lateral control. The extended look-ahead approach is also adapted in [10], where the nonlinear dynamics of the vehicle is taken into account. However, the physical parameters of vehicles are difficult to be measured and estimated, and the performance of the proposed controller in these papers for a heterogeneous vehicle platoon was not yet investigated.
The rapid development of artificial intelligence technology, especially Reinforcement Learning (RL), has brought benefits to autonomous driving [5], [11], [12]. The application of artificial neural networks for tracking and autonomous vehicles can be traced back to 1990s, see, e.g., [13] and [14]. In these papers, data are provided by ultrasonic sensors and used as an input to a feedforward network. Another hybrid approach, which combines RL and fuzzy logic, for longitudinal vehicle control is proposed in [15]. In [16], the authors propose an actor-critic algorithm, which use a model-free value-and policy-based RL algorithm to solve longitudinal control problem. Q-learning algorithm, as a subfield of RL, is a model-free algorithm that learn value of an action in a particular state. In [17], a deep Q-learning algorithm, which uses experience replay from a random sample of prior action instead of the most recent action, is proposed to control braking as a part of collision avoidance system. The authors in [18] propose a combined approach of supervised learning and deep Q-learning algorithm. The lateral and longitudinal control are then treated as a dependent system and the objective is formulated as a velocity control and lane keeping control. In [19], the authors propose a model-based policy iteration algorithm to solve the Hamilton-Jacobi-Bellman equation for a linearized dynamic vehicle model, and adaptive dynamic programming is implemented for the data-driven policy iteration. It is also important to note that the automated driving can be classified as a control problem for a continuous system, and the drawback of these proposed algorithms is that they can only be implemented to discrete environments. The process of discretization would lead to the inability to overcome dynamic environment in the vehicle platooning system [20]. Deep Deterministic Policy Gradient (DDPG) algorithm, which is based on a direct policy search, is then proposed to handle continuous control environment by directly output continuous action [25]. In [5], the authors propose an improved DDPG algorithm based on the double critic networks to overcome large cumulative errors. The approach is applied to a lane following method in autonomous vehicle. A DDPG-based proportional integral derivative (PID) longitudinal controller is proposed in [23] to improve the performance regarding speed tracking error. However, the proposed algorithms have not yet incorporated a vehicle dynamic model, and the longitudinal and lateral motion are still treated independently. VOLUME 11, 2023 96649 Authorized licensed use limited to the terms of the applicable license agreement with IEEE. Restrictions apply. Several methods to control a vehicle platoon have been proposed in literature, see Table 1. Most of the studies focused on using Reinforcement Learning for the control design of vehicle platooning. However, most studies assume that the path information for the follower vehicle to follow is available, which is not true in some cases [26]. The real-time control design of vehicle platooning also needs to consider the coupled nonlinear longitudinal and lateral dynamics of vehicle. Therefore, the main contribution of this paper is the design of DDPG-based algorithm using extended look-ahead approach for a dynamic vehicle model, which can be divided into several sub-contributions. First, we introduce the control problem formulation in vehicle platooning and a nonlinear dynamic single-track vehicle model. To reduce the domain of the action space from DDPG output, we propose a dynamic input inversion method that transforms the output acceleration from DDPG algorithm to the longitudinal force input of the dynamic single-track model. With this domain reduction, the computational load of the algorithm can be reduced. To overcome cutting-corner in the look-ahead approach, we adapted the extended look-ahead approach and derive true-local error states. By this approach, the error states can be calculated using the already available information from radar and V2V communication and a global positioning information is not required. In comparison to the existing results of other RL-based controllers for vehicle following, our DDPG-based extended look-ahead controller takes the coupled dynamics of longitudinal and lateral motion into account and handles the cutting-corners without the need of path information.
The next part of this paper is formulated as follows. Section II describes the problem formulation and system model. In this section, the nonlinear dynamic single-track vehicle model, the dynamic input inversion are introduced, and the adapted extended look-ahead error is derived. In Section III, we propose the design of DDPG-based extended look-ahead for longitudinal and lateral control. Section IV presents the training of the proposed algorithm in MATLAB, training results, and simulations. In Section V, we presents the implementation of the designed algorithm in a Donkey Car platform for further validation. Finally, the last section presents the summary of conclusions and future works.

II. PROBLEM FORMULATION AND SYSTEM MODEL A. FORMULATION OF VEHICLE PLATOONING CONTROL PROBLEM
Consider a platoon of of m ∈ N vehicles with a predecessorfollowing topology, as depicted in Fig. 1, where S m = {i ∈ N|1 ≤ i ≤ m} denotes the set of all vehicles in the platoon. The first vehicle in the platoon (with index i = 1) is assumed to be controlled by a human driver, i.e., the vehicle can be directly velocity and steering controlled. The velocity of vehicle i is denoted by v i , and the actual distance between vehicle i − 1 and vehicle i is denoted by d a,i . With a predecessor-following topology, the main objective of vehicle i is to follow vehicle i − 1 at a desired distance, d i .
The desired distance between vehicle, also known as the spacing policy, can be designed as a constant, a constant timegap, or a variable time-gap spacing policy [27]. In this paper, the desired distance d i is formulated as a constant time-gap spacing policy as follows where r i > 0 is the standstill distance, h is the time-gap, and v i is the velocity of vehicle i. Using this spacing policy, the inter-vehicle distance is increasing if the velocity of the follower vehicle is increasing. Based on a predecessor-following control structure, an ego vehicle can can receive the state information (such as position, velocity, acceleration, yaw rate, and steering angle) through on-board sensors and communicate with its neighbor vehicles through a vehicle-to-vehicle (V2V) communication.
The relative distance to the preceding vehicle is measured by radar, lidar, or cameras.
For a higher degree of automation in vehicle platooning, an integration of longitudinal and lateral control have to be considered. Based on the vehicle platooning control framework in Fig. 2, the DDPG controller uses the information obtained from on-board sensors and V2V communication to calculate the desired acceleration and steering angle. For a vehicle with a large acceleration and varying yaw rate, the kinematic model of vehicle is no longer valid since the lateral forces generated by the tires affect the longitudinal and lateral motion [2]. Therefore, in this paper we consider a dynamic vehicle model, and introduce the transformation of the input of a dynamic vehicle model from the output of DDPG algorithm. The transformation is implemented using a dynamic input inversion, which was first investigated in [28], and later extended to a certain class of nonlinear systems in [29] and [30]. In order to arrive at a suitable dynamic input inversion, the vehicle dynamic modeling is defined first. To fulfill the longitudinal and lateral control objective of vehicle platooning, the desired distance (1) needs to be adapted to a two-dimensional distance. In the next subsection, we design the dynamic input inversion, which then used for the formulation of the error derivation based on the two-dimensional constant time-gap spacing policy.

B. DYNAMIC MODELING OF VEHICLE
We consider the single-track vehicle model, which is commonly used to model the lateral and longitudinal dynamics of a vehicle under a normal driving conditions [2], [31]. The dynamic single-track model with a front-wheel drive is given byv where v x , v y , andψ denote the longitudinal, lateral, and yaw velocity, respectively. Furthermore, m i denotes the mass of the vehicle, I i denotes the moment of inertia, and l j,i , with j ∈ {f , r}, denotes the distance between the axle of front and rear tires, respectively, and the center of gravity of the vehicle. The front-wheel traction force of the engine is denoted by F lf ,i and the steering angle is denoted by δ i , which act as the inputs to the system. The cornering forces that generated by the front and rear tire, are denoted by F cf ,i and F cr,i , respectively, and depend on the lateral tire slip angle α f ,i and α r,i of the respective tire, see Fig. 3. The mapping from the lateral tire slip angle α j,i to its respective cornering force F cj,i , with j ∈ {f , r}, depends on the tire model, road condition, driving condition, and can vary in complexity. In vehicle platooning, we consider a regular driving condition with a relatively low longitudinal acceleration, steering angle rate, and small slip angle. Therefore, a linear tire model in which the cornering forces depend proportionally on their respective slip angles can be used [32], and are defined as where C αj,i , j ∈ {f , r}, denote the cornering stiffness coefficient of the respective tire.  is given byẊ where (v x,i , v y,i , ψ i ) dynamics are as described in (2), (3), and (4). Using (7) and (8), the velocity of the CoG, v i , can be defined as The longitudinal acceleration of the CoG, a i , is obtained by differentiating (9) with respect to time as followṡ By substituting (2) and (3) into (10), the longitudinal force input F lf ,i is eventually obtained as , r}, is as described in (5), (6), and v x,i > 0 such that ξ i ̸ = 0. Therefore, any control outputs of the form longitudinal acceleration and steering angle can be applied directly into the single-track model through the dynamic inversion (11). Using the new states (7), (8), and the transformed input (11), the error derivation is discussed in the next section.

D. TRUE-LOCAL EXTENDED LOOK-AHEAD ERROR DERIVATION
Using the input inversion defined in the previous section, the control to the dynamic model of vehicle i is transformed To overcome the cutting-corner problem, we adapt the extended look-ahead approach in [8]. Let P i denotes the posture of the vehicle i, P i,0 denotes the desired posture of where the vehicle i should be, and Fig. 4. The desired posture P i,0 is defined as a function of the preceding vehicle i−1 posture, P i−1 , and the angle of the circular arc formed by P i−1 and P i,0 . The angle, denoted by α i−1 , is defined as where d i is the desired inter-vehicle distance as defined in (1). By using trigonometric identities in Fig. 4, we also have From Fig. 4, it can be observed that the length of P i,0 P i−1 is equal to the desired spacing distance d i . Thus, the posture P i,0 in a global Cartesian coordinate is defined as With this desired posture derivation, the control objective can be defined as bringing the posture P i to the desired posture P i,0 . We define the error state components as By substituting (14) and (15) into (16), and by using half and double angle formulae, we eventually obtain the error state components as where cos(α i−1 /2) and sin(α i−1 /2) are as defined in (13), and e ψ,i is the orientation error of vehicle i. It can be observed directly that the first term of the right-hand side of (17) denotes the relative longitudinal and lateral error, respectively, that can be obtained from lidar. The relative orientation ψ i−1 − ψ i , and the yaw rate of vehicle i − 1,ψ i−1 , can be measured by an IMU (Inertial Measurement Unit) and lidar, while v i−1 is measured using on-board sensors and communicated through V2V. Using these error definitions, in the next section we define the design of DDPG-based extended look-ahead controller.

III. DESIGN OF DDPG-BASED EXTENDED LOOK-AHEAD LONGITUDINAL AND LATERAL CONTROLLER FOR VEHICLE PLATOONING
As a subfield of Reinforcement Learning (RL), Deep Deterministic Policy Gradient (DDPG) is an algorithm that uses an actor critic reinforcement to search for an optimal 96652 VOLUME 11, 2023 Authorized licensed use limited to the terms of the applicable license agreement with IEEE. Restrictions apply.
policy that maximizes the expected cumulative long-term rewards. To understand how a DDPG algorithm can be applied in a standard control problem, we can view the policy inside of DDPG as a controller. Hence, the action produced by the policy acts as the control input to the system, while the states of the environment in DDPG act as the observed output of the system. In standard optimal control problems, the reward function designed in DDPG can be viewed as the objective function that the policy, or the controller, is trying to maximize. Therefore, DDPG can be seen as an algorithm that finds the solution for optimization problem of the reward function, and produces the optimal policy as an optimal control law for the system [33]. The algorithm of DDPG itself, which is an off-policy algorithm, inherits an actor-critic framework [34]. The actor is responsible for a policy, which receives the states of the environment as the input and generates an action. The critic, on the other hand, estimates the action value function, which is utilized to assess how good the actor is. The algorithm uses two deep neural networks for each actor and critic, which is powerful enough to represent control problem for a highly non-linear dynamical system [35].
To ensure the convergence of the DDPG algorithm, choosing an appropriate state space and a reward function are critical. The states of the environment should be related to the motion states of the vehicles, while the reward function should be designed to minimize the longitudinal and lateral error. For the longitudinal and lateral control problem of vehicle platooning, the observation state at time step t, s t , are designed as a state with nine elements as follows where e v,i = v i−1 −v i is the velocity error, e 1,i and e 2,i denotes the longitudinal and lateral error, as defined in (17).
The actor network of DDPG then approximates a behavior policy, µ, with respect to the observation state (19). The output of the actor network is given by [36] a t = µ s t |θ µ , where θ µ denotes the parameter of the strategy network of making the determined action. On the other hand, the critic network approximates a value function as where θ Q denotes the parameter of the strategy critic network, and N t represents the exploration noise. The objective of the DDPG algorithm is to find the optimal strategy such that the cumulative reward is maximized, see Algorithm 1 [25]. The design of the reward function is critical for the DDPG-based extended look-ahead algorithm and needs to take several aspects into account: firstly, the velocity of the follower vehicle v i must be positive to satisfy the condition of v x,i > 0, which then implies that Algorithm 1 DDPG algorithm [25] Initiate random actor network µ(s|θ µ ) with weight θ µ and critic network Q(s, a|θ Q ) with weight θ Q Initiate target network µ ′ with weight θ µ ′ ← θ µ and Q ′ with weight θ Q ′ ← θ Q Initiate replay buffer for i = 1 to G do Initiate random process N for action exploration Obtain initial observation state s 1 for t = 1 to T do Evaluate action a t = µ(s t |θ µ )+N t with respect to the current policy and exploration noise Implement action a t , observe reward R t and new state s t+1 Save transition state (s t , a t , R t , s t+1 ) in replay buffer Sample a random mini batch of N transition states (s i , a i , R i , s i+1 ) from replay buffer Set Update the critic network by minimizing the loss function Update the actor policy using sampled policy gradient: Update the target networks:

end for end for
ξ i ̸ = 0 as a necessary condition for (11); secondly, there should to be no collision between vehicles; and thirdly, the follower vehicle needs to be responsive to follow the maneuvers of its preceding vehicle. Therefore, as one of the main contributions in this paper, we propose the reward function R t at every time step t, which is designed as where R e = w e 50e 2 1,i (t) + 50e 2 2,i (t)  andȧ i (t − 1) andδ i (t − 1) are the derivative of acceleration and steering input of the previous time step t − 1. Negative term of reward function R e eliminates errors to achieve the longitudinal and lateral control objective. The negative reward function R u is designed to minimize the chattering in control effort. The reward function R t is a logical value that terminates the training if the lateral error is too high, if the velocity of the follower vehicle is approaching zero, or if the relative inter-vehicle distance is negative. The reward functions R 1 and R 2 are logical values that encourage the agent to make longitudinal and lateral error, respectively, small. The weighting coefficients w j , with j = {e, u, t, 1, 2}, are appropriately tuned in order to balance the importance of the components [37]. To fulfill the important aspects of vehicle following objectives, it is critical to avoid the condition of the non-zero or negative velocity and to minimize the lateral and longitudinal position errors. Hence, the weight w t , w 1 , and w 2 are chosen as 10. On the other hand, we want to avoid the aggressive control action of accelerating/decelerating, while also maintaining a smooth chattering in control effort. Hence, we set the weight w e and w u equals to 0.001.

IV. TRAINING OF DDPG-BASED EXTENDED LOOK-AHEAD CONTROLLER A. TRAINING AND SIMULATION ENVIRONMENT
For the longitudinal and lateral vehicle platooning control system as described in the previous section, we utilized the actor and critic networks within the DDPG algorithm. The overall network architecture is shown in Fig. 5. The actor network consists of six layers: one input layer, one output scaling layer, and four fully connected hidden layers. Each connected hidden layer has 100 neurons with the activation function of rectified linear unit (ReLU) to accelerate convergence [38]. To handle multi-layer neural networks, a tanh activation function is used in the output scaling layer.  The critic network consists of two input layers (state and action), one output layer, and three fully connected hidden layers. Both action and state path merge into the output layer through addition layer. The hidden layers have 100 neurons each with ReLU activation function, while the linear activation function is used in the addition layer. To eliminate the dimensional influence between the data, we use batch normalization that can transform the input data to a normal distribution. To improve the efficiency of exploration, the Ornstein-Uhlenbeck process noise is applied with variance 0.6 and 0.1 for acceleration and yaw rate input, respectively. The decay rate of the noise is chosen as 10 −5 . Other training parameters are as listed in Table 2.
The DDPG-based extended look-ahead algorithm is trained using RL Toolbox in MATLAB. We trained the algorithm for a platoon with two vehicles, modeled as a nonlinear dynamic single-track vehicle. Vehicle 1 acts as a leader and is directly controlled with predefined acceleration and yaw rate profile. Vehicle 2 acts as a follower vehicle and is controlled by the DDPG-based extended look-ahead algorithm. The parameters for vehicles are as given in Table 3.

B. ALGORITHM TRAINING AND SIMULATION RESULTS
In this subsection, the training result of the DDPG-based extended look-ahead algorithm and the simulations are presented. The platoon consists of leader and follower vehicles which are modeled using a nonlinear dynamic single-track vehicle model as in (2)- (4). The platoon is simulated in three scenarios: the circular path, eight-shaped path, and snake path. The circular path can be considered as the simplest scenario, consisting of a straight path and a circular path with a constant curvature. The eight-shaped path consists of a zero, varying, and constant curvature, where the robustness of the designed controller is evaluated against an aggressive curvature trajectory. The snake path consists of a straight path and a wave path with varying curvature and acceleration. Within this path, the robustness of the controller algorithm is evaluated simultaneously against both varying curvature and acceleration. Fig. 6 shows the acceleration and yaw rate profile of each scenario. In the training process, the leader and the follower vehicle are initiated with the velocity of 24 m/s (equals to 86.4 km/h) and 22 m/s (equals to 79.2 km/h), respectively, and the initial positions of both vehicles and the reference path are randomly changed before the beginning of each episode to prevent over-fitting of the model to a certain scenario, while also ensure the reliability of the agent for all scenarios. In other words, each episode has different initial longitudinal and lateral spacing errors.
The total and last five average reward per episode are as shown in Fig. 7. Training performance is considered good if the reward value is high. As expected, the value of cumulative reward increases with the increase of episode. The designed DDPG-based extended look-ahead algorithm converges after training for 2975 episode with the last five average reward equals to 5251. Due to the random change of scenarios, it can be observed that total reward is declining between episode 2000 and 2500. However, the proposed algorithm shows its robustness against changing scenario after episode 2500.

C. SIMULATION RESULTS
A vehicle platoon typically maneuvers at normal driving conditions, i.e, low acceleration/deceleration, and low yaw rate. To demonstrate the potency of the DDPG-based extended look-ahead controller, we simulated the trained agent for a platoon with 4 vehicles in the three different scenarios: circular path, eight-shaped path, and snake path scenario. The circular path is composed of a straight and circular path to assess the robustness of the designed controller against a delicate curvature changes. The eight-shaped path is generated by half circles and quintic polynomial functions, and is useful to evaluated the robustness of the controller against a varying curvature. The snake path is composed of a sinusoidal yaw rate with varying frequency and varying velocity, and used to demonstrate the robustness of the controller in handling varying curvature and velocity at the same time. It should be noted that the second and third scenario are not typical maneuvers of a vehicle platoon, but they were conducted to evaluate the performance of our designed controller to handle challenging maneuvers.
To model all vehicles, we used a dynamic single-track model, as in (2)-(4). Vehicle 1 acts as a leader and is controlled directly using respective acceleration and yaw rate profile as in Fig. 6. Vehicle i, with i = {2, 3, 4}, is controlled by the DDPG-based extended look-ahead algorithm with the preceding vehicle i − 1 as its reference. For all scenarios, vehicle 1 starts at a random initial position (50 + x, 0) m, | x| ≤ 2, with initial orientation π/4 rad, while vehicle i, with i = {2, 3, 4}, starts at a random initial position (50 − 15(i − 1) + x, −5 + y) m, | x| ≤ 2, | y| ≤ 0.5, with initial orientation 0 rad. VOLUME 11, 2023 96655 Authorized licensed use limited to the terms of the applicable license agreement with IEEE. Restrictions apply.  For comparison purposes, we simulated the regular look-ahead controller designed in [7] with identical vehicle parameters and initial conditions with the DDPG-based extended look-ahead controller. This regular look-ahead controller adapts the concept of pure-pursuit controller, in which the follower vehicle does not have any path or trajectory information of its preceding vehicle, and can only track the preceding vehicle by measuring the distance from the front of the follower to the rear of the preceding vehicle. Fig. 8 shows the trajectory of vehicles for the circular path scenario. The figure on the left shows the trajectory of vehicles using the regular look-ahead controller. As expected, the follower vehicle can only track its preceding vehicle's trajectory on a straight path. Without any path or trajectory information of the preceding vehicle, the follower vehicle suffers cutting-corner, as evidenced by the difference in the path curvature of the follower and its preceding. The difference in curvature can be explained by observing Fig. 9, where all follower vehicles with the regular look-ahead controller have lower turning velocity than the ones with the DDPG-based extended look-ahead controller. On the other hand, as depicted in Fig. 8(right), the follower vehicles with the DDPG-based extended look-ahead controller can track their predecessor's trajectory for the circular path, i.e., the path with constant velocity and varying yaw rate.   The overall response of the first scenario is similar to one of the second scenario, the eight-shaped path, as shown in Fig. 10. For the DDPG-based extended look-ahead controller, it can be observed from Fig. 11(right) that the difference between the velocity and the yaw rate of the follower vehicles and vehicle 1 is relatively small, thus ensuring that all vehicles drive on the same path. From both first and second scenario, we can conclude that the proposed DDPG-based extended look-ahead algorithm handles varying yaw rate and constant velocity in a satisfactory manner.
The third scenario is conducted to assess the performance of the design DDPG-based algorithm against both varying yaw rate and velocity. From t = 0 s until t = 10 s, vehicle 1 drives at a straight path with constant velocity. From t = 10 s onward, a sinusoidal yaw rate with magnitude 0.4 rad/s and frequency 0.5 Hz is applied as input to the vehicle. From t = 15 s until t = 25 s, a one cycle sinusoidal acceleration with magnitude 1.8 m/s 2 is applied. As shown by Fig. 12(right), all follower vehicles with the DDPG-based  extended look-ahead controller respond quickly to follow their respective preceding vehicle on a varying yaw rate and varying acceleration. The difference between the velocity of the followers and their preceding vehicle from t = 15 s onward indicates a varying deceleration (see Fig. 13(right), which implies a varying desired distance due to the constant time-gap spacing policy (1). Therefore, it can be concluded from these three scenarios that our DDPG-based extended look-ahead controller can be applied to multiple vehicles in a platoon, where vehicle i − 1 acts as a target tracking for vehicle i, further confirming the scalability of the controller. The simulation results also confirm that the DDPG-based controller can be implemented in a nonlinear dynamic model, which takes coupled longitudinal and lateral dynamics into account, using the designed dynamic input inversion. Moreover, our designed controller is proven to be effective against cutting-corner in the situation where the preceding vehicle's path is unknown, justified by the acceptable error tolerance for various road conditions with varying yaw rate and velocity.

V. EXPERIMENT WITH DONKEY CARS A. EXPERIMENTAL SETUP
In this section, we describe the experimental setup for the implementation of DDPG-based extended look-ahead controller. Additionally, we present the experimental results to validate theoretical and simulation results. We implemented the DDPG-based extended look-ahead controller in a platoon consisting of two Donkey Cars. A Donkey Car is a fourwheel small-scale car with a modular design, and is developed using open source libraries based on Python. A Donkey Car is typically modeled by a kinematic single-track model instead of a nonlinear dynamic single-track model that we used on the simulation By using Donkey Cars, we further verified the generalization capability of our controller to different vehicle models. In this experiment, we used two starter kit Donkey Cars (HSP 94186 model) with servo drivers PCA 9685 to control the throttle and steering, and additional components as follows (see Fig. 14): • Raspberry Pi 4 Model B: a single-board computer (SBC) that acts as the brain of the car, attached to each leader and follower vehicle. The DDPG-based controller algorithm is embedded into the SBC of the follower vehicle, while the SBC of the leader vehicle controls the throttle and steering and also transmits all necessary signals for the controller through Wi-Fi protocols.
• SLAMTEC RPLIDAR-A2: a small lidar with 0.2 -16 m measuring range, 16K sampling frequency, and 5 -10 Hz rotational speed. The lidar is attached to the follower vehicle to measure the inter-vehicle distance.
• InvenSense MPU-6050: a micro electro-mechanical system with a 3-axis gyroscope and 3-axis accelerometer to measure linear and angular velocities and accelerations, attached to each leader and follower vehicle.
• Extra battery: since the original Donkey Car kit uses 1,100 mAh battery, we replace it with a 2,400 mAh battery in each vehicle to power Donkey Car, Raspberry Pi, and lidar. The DDPG-algorithm that is trained in MATLAB in the previous section is exported to TensorFlow using the built-in function exportNetworkToTensorFlow. The function exports the deep learning network and saves it as a TensorFlow model in the Python package.
In order to validate the simulation results, we define a circular trajectory for the leader vehicle. The leader vehicle is directly controlled, while the follower vehicle is controlled by the DDPG-based extended look-ahead algorithm. Since the longitudinal motion of a Donkey Car is controlled by providing a throttle input, we designed a PID controller for the leader vehicle to achieve a constant velocity of 2 m/s. On the other hand, the lateral motion of a Donkey Car is directly-controlled by providing a constant steering angle input of 0.2479 rad. The standstill distance parameter for the spacing policy is chosen as r i = 0.3 m, and the time-gap is chosen as h i = 0.1 s. The lidar is used to measure the inter-vehicle distance and angle of an object relative to the lidar orientation. The Inertial Measurement Unit (IMU) utilizes a low pass filter to reduce noise effect and increase sensor's accuracy. The necessary signals from the leader vehicle (such as, velocity and relative orientation) with a sample rate of 60 Hz are then transmitted to the follower vehicle through a Wi-Fi communication, while the relative inter-vehicle distance is measured using the lidar. The DDPG-based controller then calculates the inputs needed for the follower vehicle.

B. EXPERIMENTAL RESULTS
For the experiment, we placed the follower vehicle behind the leader vehicle with a relatively small initial longitudinal and lateral errors. The orientation of the follower is placed in the same direction as the leader vehicle. This positioning is done to ensure that the follower vehicle moves forward with a small initial orientation error as possible. The leader vehicle maneuvers on a circular path with a constant velocity of 2 m/s and a constant steering angle of 0.2479 rad. The velocity and steering angle of both leader and follower vehicle are shown in Fig. 15. As observed, the velocity convergence of the follower vehicle is apparent after t = 3 s, subject to the inter-vehicle longitudinal spacing policy. Fig 16(left) shows the desired longitudinal distance, based on the constant time-gap spacing policy (1), and the actual inter-vehicle distance. It can be observed that the desired distance varies due to the varying velocity of the follower vehicle. During t = 0 s until t = 2 s, the desired distance is bigger than the actual distance due to a high velocity of the follower vehicle. From t = 4 s onward, the actual distance converges to 0.5 m. On the other hand, although the measured steering angle suffers from a bigger noise than the velocity measurement, it is shown that the steering angle of the follower also converges to the steering angle of the leader vehicle. To analyze the runtime efficiency of the proposed algorithm, we measured the execution time in Since the controller agent is already trained in MATLAB, the computational time is relatively fast and does not interfere with the control purpose, which evidently shown by the error plot in Fig. 16(right). However, from the figure, it should be noted that lateral error e 2 is relatively larger than the longitudinal error e 1 . The large lateral error might be caused by several factors: the inaccuracy of the inertial measurement unit, delay of the actuators, and the non-deterministic delay in the Wi-Fi communication. Despite of these factors, the average errors are smaller than 0.1 m, which then show satisfactory results of the proposed DDPG-based extended look-ahead algorithm.

C. REMARKS AND DISCUSSION
The experiment with Donkey Cars showed the potential of the DDPG-based extended look-ahead controller implementation, which also can be seen as a first step towards the implementation in real vehicles. It should be noted that although the dynamics of a Donkey Car is different than the one of a real vehicle, in a normal driving condition in a platoon (i.e., low acceleration/deceleration and low yaw rate) a real vehicle can be modeled as a linear dynamic single-track model, or even a kinematic single-track model. To adapt our controller to a real vehicle model, we can design a secondary controller that controls the acceleration or speed, instead of throttle in Donkey Car, internally compensating for the vehicle parameters.
In the simulation, it is assumed that all vehicles can measure the inter-vehicle distance accurately, and no delays are involved in wireless communication. However, it should be noted that the experiment with Donkey Cars and future implementation of the DDPG-based extended look-ahead controller in real vehicles have to take sensors' accuracy, reliability, and potential failures into account. In [39], a failureresilient platooning system is proposed to handle potential failures of sensors. In this system, the failed sensors is emulated by collectively utilizing other sensors in the platoon. The control system then can instantaneously reconfigured the cooperative mode using only the live sensors. Authors in [40] proposed levels of failure with immediate stop of the vehicle to address sensor malfunctions. To address the problem where some signals cannot be accurately measured or are unavailable due to sensors malfunction, an observer-based secure control as proposed in [21] can be adapted. In the experiment with Donkey Cars, we used IEEE.802.11 wireless protocol which works well for our controller. On the other hand, the Vehicle-to-Everything (V2X) communication uses ETSI ITS-G5 and DSRC/WAVE technology based on IEEE 802.11p, which is a more robust protocol against fading and multipath propagation effects of signals in a vehicular environment. In the situation where the communication is delayed or even failed, we can adapt a failure tolerance system as the one designed in [41]. To address a deliberate manipulation of the system through cyber-attacks, authors in [42] proposed a reliable trust-based platoon service. To handle the aforementioned sensors failures, our DDPG-based controller can be used in conjunction with other controllers or systems designed in those studies.

VI. CONCLUSION
In this paper, we propose a Deep Deterministic Policy Gradient (DDPG) algorithm with extended look-ahead approach for the control of longitudinal and lateral vehicle platooning. This method combines the ability of DDPG to handle nonlinear system and uncertainties with the advantage of cost-effectiveness and feasibility of the lookahead approach. Without the information of reference paths or lane markings, the application of the look-ahead approach can cause cutting-corner phenomenon. The cutting-corners escalate upstream in the platoon, thus affecting the lateral tracking performance of all vehicles in the platoon. Moreover, a nonlinear dynamics vehicle model must be considered for dynamic longitudinal and lateral maneuvers. To solve the cutting-corner problem, we adapted the extended look-ahead approach that redefined the tracking objective point into a nonlinear dynamic vehicle model. With an input inversion, the original input of the nonlinear dynamic vehicle model can be obtained from acceleration and steering angle input from DDPG algorithm. The effectiveness of the proposed DDPG-based extended look-ahead approach is evaluated in both simulation and experimental environments. The results demonstrate that the DDPG-based extended look-ahead algorithm can meet the requirements of longitudinal and lateral tracking by reducing cutting-corners under different road, velocity, and yaw rate conditions. The future work would be concentrated on the implementation of the algorithm into a platoon with more than two vehicles, and on the design optimization of the DDPG algorithm to improve the performance, error minimization, and convergence speed. Moreover, the following research can incorporate the string stability analysis of the platoon and can focus on the implementation on real vehicles to further verify the reliability of the DDPG-based extended look-ahead algorithm.