Research on ATO Control Method for Urban Rail Based on Deep Reinforcement Learning

Aiming at the problems of punctuality, parking accuracy and energy saving of urban rail train operation, an intelligent control method for automatic train operation (ATO) based on deep Q network (DQN) is proposed. The train dynamics model is established under the condition of satisfying the safety principle and various constraints of automatic driving of urban rail train. Considering the transformation rules and sequences of working conditions between train stations, the agent in the DQN algorithm is used as the train controller to adjust the train automatic driving strategy in real time according to the train operating state and operating environment, and optimizes the generation of the train automatic driving curve. Taking the Beijing Yizhuang Subway line as an example, the simulation test results show that the DQN urban rail train control method reduces energy consumption by 12.32% compared with the traditional train PID control method, and improves the running punctuality and parking accuracy; at the same time, the DQN train automatically driving control method can adjust the train running state in real time and dynamically, and has good adaptability and robustness to the change of train running environment parameters.


I. INTRODUCTION
Urban rail transit has become an important mode of travel in today's society, and as urban rail transit has grown rapidly, huge energy consumption has become an increasingly prominent major issue. According to statistics in 2020, the total electrical energy consumption of urban rail transportation was 17.24 billion kWh in China, and the traction energy consumption was 8.4 billion kWh, accounting for 48.7% of the total energy consumption [1]. The actual traction energy consumption in train operation depends on the control strategy used by the ATO system. The ATO system is a key part of the automatic train control system, which is used to control The associate editor coordinating the review of this manuscript and approving it for publication was Jjun Cheng . automatic train departure, acceleration, cruising, braking, precise stopping, and automatic turning back [2]. The control method of conventional ATO system can be divided into two parts: optimal target velocity profile generation and target velocity profile tracking. The ATO system takes infrastructure data, vehicle data, train diagram data and train dynamics parameters as input data to calculate the recommended target speed curve between train stations [3], [4], and realizes accurate tracking of the recommended target speed curve through traction, coasting, braking and other operations in actual operation [5]. Traditional ATO research usually only considers optimizing the generation of optimal target speed profiles for trains or studying how to accurately track target speed profiles. However, in the actual operation of the train, the train deviates from the pre calculated optimal speed due VOLUME 11, 2023 This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ to changes in traction / braking performance or power supply problems. Therefore, the off-line calculation of the optimal target speed curve is difficult to meet the requirements of punctuality and accuracy of train operation. Among them, the classical control algorithm represented by PID was first applied to the ATO control system to control the deviation between the set value and the output value according to the proportion integral derivative, so as to realize the tracking control of the preset target speed curve, and the PID tracking control algorithm is still widely used in Beijing Metro trains because of its applicability [6], [7].
With the rapid development of artificial intelligence, more intelligent control algorithms are applied to the field of ATO control. Wang et al. [8] applied iterative learning control to the ATO system, making full use of the available information obtained from previous running cycles to adjust the current driving strategy, so that the train can drive itself according to the given target speed curve. Yang [9] considered the characteristics of regenerative braking electric locomotives, analyzed the influence of optimal operating curve and operating time on operating energy consumption, and proposed a target speed curve approximation algorithm, which effectively reduced the energy consumption of train operation. To improve the control performance of the conventional ATO control algorithm on suspended permanent magnet maglev trains, Liu et al [10] proposed a predictive fuzzy proportionalintegral-derivative control algorithm with weights, which effectively improves train on-time performance and comfort, and reduces train operation energy consumption and stopping errors. Yin et al. [11] proposed an integrated mixed integer linear programming (MILP) formula for the optimal coordination of passenger oriented train schedules in urban railway networks during peak hours. In recent years, advanced artificial intelligence techniques have advanced, and Deep Reinforce Learning (DRL), an artificial intelligence algorithm that combines the decision-making capability of reinforcement learning with the perceptual capability of deep learning, is considered one of the most promising techniques in machine learning [12], [13]. DRL uses neural networks in deep learning to handle the high-dimensional space in reinforcement learning, allowing DRL to overcome uncertainty in the environment to approximate the target value. At the same time, DRL does not need accurate model information, which improves the scalability of the control system. Therefore, DRL has been used to deal with optimal control problems in many different fields. For example, Wang et al [14] proposed a DRL-based dynamic resource allocation method for network slicing, using DRL to extract knowledge from experience by interacting with the network and dynamically allocating resources to each slice so as to maximize resource utilization. In order to solve the problem of automatic train parking control, Chen et al. [15] proposed an online learning strategy, which uses reinforcement learning based on the accurate position data provided by the transponder to effectively and dynamically reduce the parking error.
Zhu et al. [16] introduced two definitions of different train operation states related to travel time and energy distribution, proposed using Q-learning method to determine the optimal energy allocation strategy, and the effectiveness of the proposed approach is verified in a deterministic and a stochastic environment. Yin et al. [17] proposed two train intelligent operation (ITO) algorithms based on expert system and reinforcement learning. Without accurate train model information and offline optimized speed curve, the timeliness of train operation is improved and the energy consumption of train operation is reduced.
The previous research and application of deep reinforcement learning algorithm has achieved great success, providing an effective solution for the complex automatic train driving control. However, most of the current studies do not consider the control strategy for real-time dynamic adjustment of trains after changes in their operating states. The main goal of this paper is to reduce the traction energy consumption on the premise of meeting the requirements of safe, punctual and accurate parking between stations by reasonably planning the traction, braking and coasting control strategies. Based on the above analysis, this paper is based on the traditional ATO optimal speed curve tracking strategy, which exists when the train deviates from the pre calculated optimal speed, and it is difficult to meet the requirements of real-time and accuracy of train operation. Use the Deep Q Network (DQN) algorithm [18] to conduct interactive training with the urban rail train control operation model, and dynamically adjust the control strategy in train operation to adapt to the complex train operation environment and changing travel plan time. The research in this paper can provide a theoretical basis for the transformation of automatic driving control of urban rail trains from automation to intelligence.

II. URBAN RAIL TRAIN DYNAMICS MODEL
Considering the train as a rigid body, the single mass point model is used to study the traction and braking characteristics of the urban rail train, and the single mass point dynamics model of the train is constructed according to the Newtonian mechanics principle [19].
where x is the train running position; t is the running time; v is the running speed; F is the combined force on the train running; M is the train mass. Train operating conditions are generally composed of traction, idling and braking conditions, for different operating conditions, the combined forces F on the train operation can be expressed as follows.
where f (u l , v) is the traction/braking force acting on the train; {u l } l=0,1,2 represents train operation condition, u 0 is traction 5920 VOLUME 11, 2023 condition, u 1 is idler condition, u 2 is braking condition. The train running resistance can be divided into basic running resistance R c (v) and ramp resistance G (x).
Where R c (v) and G (x) are expressed respectively as follows [20].
where a, b, c is the empirical constant; g is the acceleration of gravity at the train's operating position x; and i is the line slope at the location x. The train operation model is designed by the DQN algorithm, and the iteration steps of sampling time t are used to calculate the running position x and speed v of the train in the line, x and v the iterative formulas are shown in Eqs. (4) and (5).
With full consideration of train operation safety and passenger comfort, train operation punctuality, stopping accuracy and operation energy consumption are solved as multi-objectives. The train inter-station operation index can be expressed as follows.
The error between the actual running time T ′ of the train and the planned running time T can be measured as follows.
The error between the actual distance S ′ travelled by the train and the full length S of the line to measure the stopping accuracy can be expressed as follows.
The energy consumption of inter-station operation of trains can be expressed as follows [21].
where v (0) is the initial train speed; v (S) is the final train speed; v (x) is the actual speed of the train in the line; v lim (x) is the restricted speed of the train, where x ∈ [0, S]; e t is the on-time error limit of the train, s; ε s is the stopping accuracy error limit, m. In essence, the multi-objective train optimization problem is finding the optimal sequence { u l } l=0,1,2 of traction, idling and braking conditions with the lowest energy consumption and the changeover point under the constraints of Eq. (6) and (7). Therefore, the multi-objective optimization problem of on-time train operation, stopping accuracy and energy consumption can be expressed as follows.

III. DESIGN OF DEEP REINFORCEMENT LEARNING A. PRINCIPLE OF DQN CONTROL ALGORITHM
Reinforcement learning is a method for solving decision problems in nonlinear stochastic systems. The agent interacts with the environment as follows: given a state s of the environment at the current t moments, the agent selects an action a t according to the current policy π ( a| s t ). After executing the action a t , the environment changes and the state s transforms into a new state s ′ . The agent gets an immediate reward value R i and adjusts the policy in the current state according to the obtained reward size so that the sum of the accumulated rewards is maximized when the round reaches the termination state and finally gets an optimal strategy π * .
where k is the number of all state samples in the current round. R i is the reward value obtained at moment t.

B. TRAIN OPERATING CONDITIONS CONVERSION RULES
The train inter-station energy-efficient driving strategy usually consists of two or more working condition transition sequences and corresponding working condition transition points, which are usually expressed as MA-C-MB and MA-(C-MA-pairs)-MB transition sequences as shown in  limit, it can only implement the idling condition to decelerate, and then the train switches to the traction condition, which is shown as the ''C-MA'' condition change sequence repeatedly appearing in pairs. In essence, the train ATO optimal control problem is to solve the analytical solution of the optimal condition transition sequence and condition transition points to obtain multiple optimal ''C-MA'' combination sequences between stations, so that e t , ε s and E are as small as possible. By designing the fitness function, the multi-objective optimization problem of inter-station operation is normalized to a single-objective optimization problem, which can be expressed as follows.
where ω 1 , ω 2 and ω 3 are the weighting coefficients related to on-time performance e t , parking accuracy ε s and energy consumption E, respectively. λ and µ are the energy consumption E scale coefficients, respectively.

C. DESIGN OF DQN CONTROLLER 1) DEFINITION OF DQN TRAIN CONTROLLER
The DQN algorithm uses a neural network (agent in the reinforcement learning model) as a train controller and trains the agent using historical train operation data. Taking into account the current running state s i (x, v, t), the agent outputs the Q value obtained after executing all the condition transition sequence actions a i (x i , u l ) in the current state s i (x, v, t) and selects the next moment condition transition sequence u l and the position of the condition transition point x i based on the maximum Q value output. θ is the current moment network parameter of the neural network. The structure of the neural network model is schematically shown in Figure 2.

2) DESIGN OF CONTROLLERS
In the MATLAB simulation platform, the agent determines the ATO optimal control strategy by interacting with the train operation line simulation environment. As can be seen from the interaction process in Figure 3, the state space, action space, reward function and experience pool in the DQN algorithm are involved in the DQN algorithm update in real time, and the following key elements of the ATO control process are designed. State space: From the above multi-objective optimal operation model of urban rail trains, it can be seen that the location, speed and running time of the train in the line at the current moment can reflect the current running state of the train, which can be taken as the state space, and the state space is defined as s i (x, v, t).
Action space: During the operation of the self-driving train, the train is operated according to the operation status, i.e., train operation position, operation speed and operation time, and combined with the line conditions, by changing the conversion sequence u l of traction, idling and braking conditions under the current operation position x i of the train, so that the train can meet the safety, punctuality, energy saving and accurate stopping of the train in the current operation status. Since the DQN control algorithm is a discrete algorithm, the traction, idling and braking conditions are discretized, and to ensure the accuracy of the control conditions, the action a i of the condition change sequence u l and condition change point x i is designed as follows.
where x i is the location of the changeover point in the train operation line, x i ∈ [0, S], where i = 1, 2, · · · , k − 1. a i is the action space, u 1 0 , u 2 0 , u 3 0 , u 4 0 , u 5 0 is the five-stage traction force under traction condition u 0 , u 1 = 0 the idler operating condition when the traction force is 0, u 1 2 , u 2 2 , u 3 2 , u 4 2 , u 5 2 is the five-stage braking force under braking conditions. Reward function: The ATO control problem is a typical sparse reward problem, i.e., to minimize the energy consumption of train operation under the conditions of on-time train operation and precise stopping, it is difficult for the intelligent body to obtain a reward at the initial moment. In order to ensure that the ATO system can learn the best control sequence with strong migration and less computation, the designed reward function should be as simple as possible, using the principle of ''light reward and heavy punishment'', and defining the reward function in the t time state as follows. where r 1 and r 2 are the rewards for punctuality and stopping accuracy during the train run. r 3 is the reward for the energy consumption of state transfers in the round, that is the reward for the energy consumption of each state transfer step in the train operation is evaluated as a function of the negative correlation with the energy consumption. r 4 is the penalty for early termination of the round, and a larger negative reward is obtained when the training is terminated early or ends unexpectedly. The role of r 3 is to guide the ATO control strategy to change in the direction of obtaining a reward for train energy efficiency, and to avoid early termination of rounds, a larger round early termination penalty term r 4 is set. ω 1 , ω 2 and ω 3 are the reward coefficients for on-time train operation, stopping accuracy and energy consumption, which take the values of 20, 30 and 1, respectively. In order to meet the actual operating conditions of the train and avoid the impact of high train running speed on the safety of train operation, the train running speed should always be less than the line speed limit v lim . The error e t between the specified running time T and the actual running time T ′ of the automatic driving train station should be less than the specified time error. The error ε s between the full length of the line S and the actual stopping distance S ′ of the train should be less than the specified stopping error, and the energy consumption per step of state transfer should be The end conditions of the training round are set as follows.
or |e t | > 1or E > 10 0 otherwise (15) where v lim is the speed limit of the train operating line. E is the maximum energy consumption of train operation during single-step state transfer, 10MJ.
Experience pool: In order to reduce the correlation and non-stationary distribution problems of the continuous train operation state s (x, v, t) samples input to the neural network, the transfer samples s, a, r, s ′ obtained from the agent's interaction with the environment at each time steps are stored in the experience pool. The smallest batch is randomly taken out for training to disrupt the sample correlation and improve the sample utilization, avoiding the gradient non-convergence caused by the DQN algorithm doing gradient descent training in the same direction for a consecutive period of time, as shown in Figure 3 for the framework of the DQN control algorithm for automatic train driving.

IV. UPDATE OF CONTROLLER PARAMETERS
The DQN algorithm uses a neural network with weight parameter θ as the Q-network model of the action value function, and the action value function Q π (s, a) is simulated by this neural network model Q (s, a, θ).
The DQN algorithm uses another neural network model Q s, a, θ − as the target Q network to calculate the target Q value (Taget Q) and subsequently defines the objective function using the mean square error as the loss function of the neural network.
where the parameters s ′ and a ′ are the train running state and the action to be performed at the next time step. r A is the VOLUME 11, 2023 reward obtained by the agent at each step. γ is the discount factor. Update the parameters in the Q-network with stochastic gradient descent algorithm. θ = θ + α · loss (θ) ∇Q (s, a, θ) (19) where α is the Q-network learning rate. Finally, the parameters of the target network are updated according to the loss function, and the parameters related to the Q network parameters are copied to the target network θ − = θ after every C iterations, so that the target network remains relatively stable in terms of parameter updates and reduces the correlation between the output values of the two networks. After the training is completed, the intelligent controller will output the action with the maximum reward arg max a (s i , a, θ) according to the train operation status, while continuously optimizing the controller as the operation scenario is enriched. State parameters: maximum capacity of the experience replay pool D = 15000, target network update frequency (number of steps) C = 120, data size of a single random sample from the experience replay pool SM = 400. The training frequency of neural network is 30 and the update frequency of target Q network is 120.
Neural network parameters: the number of hidden layers of neural network is 2, and the number of neurons is 24; The discount factor γ is 0.5.
Initialize the Q network using the random parameter θ. Initialize the target network parameters θ − = θ using the random parameter θ. Deposit the state transfer data s i , a i , r i , s ′ i set into the experience replay pool The set s j , a j , r j , s ′ j of state transfer arrays randomly sampled from the empirical playback pool as SM y j = r j at t_step j + 1 r j + γ maxQ s ′ , a ′ ; θ − otherwise Perform a gradient descent step on y j − Q s j , a j ; θ 2 with respect to the network parameters θ Every C step reset θ − = θ i = i + 1 End While End For End For

V. SIMULATION ANALYSIS
In order to verify the effectiveness and robustness of DQN algorithm in intelligent control of automatic driving of urban rail train, the line from Jiugong Station to Yizhuang Bridge station of Beijing Yizhuang Subway line is taken as an example. The length between stations is 1,982 meters, and the speed limit and slope information of the line [21] are shown in Figure 4.
Real vehicle parameters are used to build the train kinematics model and simulation operating environment. In order to meet the applicable conditions of DQN algorithm control space discretization, the traction or braking force of the train is discrete into five-level traction and braking commands, as shown in Table 1.

A. SIMULATION EXPERIMENT 1
In this simulation experiment, the running energy consumption, punctuality and parking accuracy of the train DQN control algorithm and the traditional PID control algorithm are compared under the same line and the same planning time, where the planning time is set to 125s. The speed-distance curves of the train DQN control algorithm and the traditional PID control algorithm are shown in Figure 5.
It can be seen from the figure that although the traditional PID control algorithm can track the given speed-distance curve, there is a deviation between the speed-distance curve tracked by the PID algorithm and the target speed-distance curve, and a certain error between the actual running time and parking position of the train and the planned time and parking position, which increases the energy consumption of the train. The DQN algorithm is used to dynamically adjust.
The train driving strategy according to the current running state of the train and the line conditions, which realizes the punctuality and parking accuracy of the train station operation, and reduces the energy consumption of the train station operation. Table 2 shows the control simulation results of DQN algorithm and traditional PID algorithm on train punctuality, parking accuracy and operating energy consumption.  It can be seen from Table 2 that under different control strategies, 0.1s is used as the sampling step. Using PID control method, the actual running time of the train is 125.7s, the actual running distance is 1982.26m, and the energy consumption between stations is 104.04MJ. The train control strategy based on the DQN algorithm is more flexible, its actual running time is 125.5s, the actual running distance is 1982.20, and the energy consumption between stations is 91.22MJ. From the experimental results, it is concluded that the DQN algorithm improves the punctuality and parking accuracy of the train control strategy compared with the traditional PID algorithm, and reduces the energy consumption by 12.32%.

B. SIMULATION EXPERIMENT 2
In order to verify the adaptability of the DQN algorithm to the train operating environment and the robustness of handling variable parameters, the operation planning time between train stations is set to 129s, and the basic operating resistance of the train is increased by 30% without changing other operating conditions. The speed-distance curves obtained by the DQN algorithm before and after the change of train running resistance are shown in Figure 6, and the control performance comparison is shown in Table 3.  From table 3, the punctuality error and parking accuracy error of the train under DQN control strategy before and after the change of train operation environment are far less than the standard error range. The control of the train operation strategy by the DQN algorithm is a real-time and dynamic adjustment process, rather than a fixed vehicle control strategy. Even in a random operation environment, it can still arrive on time and stop accurately according to the planned travel time. It also formulates corresponding energysaving control strategy according to the current operating environment.

C. SIMULATION EXPERIMENT 3
The urban rail transit system has a significant passenger flow time fluctuation law, so the running time of trains between stations in the train diagram can be reasonably adjusted according to the dynamic demand of passenger flow. On the premise of ensuring the passenger service level, dynamically adjust the operation time between stations in different passenger flow cycles to further reduce the energy consumption of train operation. In order to verify the effectiveness and adaptability of the DQN train control algorithm to different operation planning times between stations, the operation time between stations is set to 125s, 129s and 132s according to the actual operation planning time between train stations. The train speed-distance curves and acceleration-distance curves for three different operation planning times are shown in Figures 7-8 below. VOLUME 11, 2023 FIGURE 6. Speed-distance curves before and after increasing the basic running resistance of the train.  The punctuality error, parking accuracy error and operating energy consumption corresponding to the three different planning times between stations are shown in Table 4.
According to the simulation analysis in Figure 7- Figure 8 and Table 4, under three different planned running times between stations, the train accelerates to the critical speed limit of the line with large acceleration, adopts the combination of traction and coasting conditions in the middle, and  adopts the braking condition when approaching the station. The shorter the operation planning time between train stations, the higher the frequency of dynamic adjustment of train operating conditions, the more sufficient the travel planning time, the lower the frequency of switching between train stations and the more stable the operation. It can be seen from Figure 7 and Figure 8 that under the same line and operation conditions, the train energy-saving operation strategies under different operation planning times are basically the same, including the same condition conversion sequence and condition conversion point. With the increase of interstation planning time, the average speed of interstation operation decreases and the energy consumption decreases. When the interstation running time is 129s compared with 125s, the operating energy consumption decreases by 7.50%; similarly, when the interstation running time is 132s compared with 129s, the operating energy consumption decreases by 6.40%. The operating energy consumption between train stations is related to the energy-saving driving strategy between stations, and is negatively correlated with the train running time, which lays a foundation for the energy-saving operation of dynamic travel time planning between train stations.

VI. CONCLUSION
Aiming at the optimization of urban rail train operation indicators and speed curve, this paper proposes an automatic driving control method for urban rail trains based on DQN algorithm, optimizes the train operation process, and obtains the following conclusions through simulation test verification and analysis: a). Under the same operating conditions, the DQN control algorithm reduces the energy consumption of train operation by 12.32% compared with the traditional PID control algorithm, and improves the punctuality of train operation and the accuracy of parking. b). When the basic resistance of train operation and the planned time between stations change, the DQN control algorithm has good adaptability and robustness to the change of train operation parameters, and can ensure the safety and punctuality of train operation and reduce the energy consumption of train operation. c). Compared with the traditional train automatic driving control strategy, the train automatic driving control based on the DQN algorithm can dynamically adjust the train operation strategy in real time according to the current running state of the train and the operating environment during the train operation, which can ensure the safety, parking accuracy and reduce the energy consumption of train operation. The research in this paper can provide a theoretical basis for the transformation of automatic driving control of urban rail trains from automation to intelligence.