Time Difference Penalized Traffic Signal Timing by LSTM Q-Network to Balance Safety and Capacity at Intersections

The conflict between limited road resources and rapid car ownership makes the traffic signal timing become a pivotal challenge. Emerging studies have been carried out on adaptive signal timing, but most of them still focus on the throughput of intersections, leaving safety and travel experience unconsidered. This paper proposes a time difference penalized traffic signal timing method by reinforcement learning technique to balance safety and throughput capacity in traffic control system. Firstly, a microcosmic state representation is proposed to integrate the dynamics of both traffic lights and road vehicles, including driver behaviors of lane changing, car-following, previous phase of traffic light and its duration. Secondly, an action space, including 8 signal phases, and a behavior-aware reward function are designed to resist the red-light overflow. Finally, a partial long short term memory (LSTM) network is trained to balance traffic efficiency and traveling experience. In the network training, a parallel sampling method is adopted to obtain experience from multiple environments to accelerate the training convergence in practical application. Experimental results show that the proposed method improves the intersection efficiency up to 14.28% compared to the fixed signal timing and 5.26% compared to DQN while getting rid of red-light overflow time.


I. INTRODUCTION
Today, the global car ownership in the world has exceeded one billion, which led to a series of serious problems, such as traffic congestion, environmental pollution, energy-wasting etc. [1], [2]. Intersection, as a key node in the urban traffic network, plays a pivotal role in traffic manage and its optimization. Growing attention in academic and industrial fields has been paid to the traffic light control at intersection [3]. However, the limitation of road resources and the complexity of dynamical traffic flow make it hard to optimize signal timing to improve the capacity at intersection while keeping The associate editor coordinating the review of this manuscript and approving it for publication was Shuping He . safety [4]. Intelligent traffic signal control is a measure to deal with space conflicts, which not only reduces environmental pollution and energy waste [5], but also improves driving efficiency. How to make full use of the limited resources at intersections is becoming a severe challenge in intelligent traffic management [6].
Signal timing is a collaborative optimization of spatiotemporal resources. The state-of-art methods are generally carried out under a predetermined scheme, with which the intersection capacity is usually hard to be fully exploited [7]. Few of them focused on the conflict-avoidance while improving the capacity of intersections [8], [9].
The blooming development on artificial intelligence has brought growing efforts in the field of adaptive control of signals [10]. Through reinforcement learning (RL), an agent interacts with the environment and explores reasonable action. RL has been employed as a promising solution to optimize traffic signal timing and its control without relying on the manual phase planning. Various signal timing methods have been proposed to explore more efficient signal timing. But most of the existing works focus on one goal to maximize the throughput of intersections [11]. Reported from the U.S. Department of Transportation, overflow length of red light is a crucial factor to the safety at intersections [12]. If the traffic flow is unbalanced at an intersection, the busy lane will tend to get unlimited priority to maximize reward and the free lane will encounter an unbearable waiting time, which may lead to drivers' illegal driving through red lights. On the contrary, a short signal cycle is a barrier for improving the efficiency of intersections [13]. Therefore, it is a difficult trade-off between safety and capacity in an adaptive signal control system. Worse still, existing methods have not considered lane changing, which brings a negative effect on intersection efficiency [14].
Aiming at these issues, we propose a novel intelligent signal timing method named Rep-DRQN to process signal control in this work. After a short view of the system framework, a state representation with lane-changing sensibility and an action space of full flexibility are proposed to catch traffic state at intersections and take the most reasonable actions. Then, a novel reward function is designed with RELU(Rectified Linear Unit) to deal with the intolerable waiting time for green light. Finally, to get efficient processing in complex traffic scenarios, we adopt a parallel learning scheme to accelerate the training process.
This paper is organized as follows: the related works on the signal controlling are reviewed in Section II. Then, we propose the system modeling and methodology for intelligent signal timing through deep reinforcement learning in parallel environments in Section III. Finally, we introduce the experimental results and discussion in Section IV and conclude this work in Section V.

II. RELATED WORKS
Various RL methods have been proposed for traffic signal control. In early study, due to limited resources, researchers had to simplify the situation of intersections as much as possible [15]. As example, Thorpe used four elements in the state space, including vehicle count, fixed-distance, variable-distance and duration [16], and then the performance of two SARSA schemes are analyzed [17]. Abdulhai et al. defined the state as the queue lengths on the four approaches and the elapsed phase time, and reward function as the total delay of the four approaches [18]; then Q-learning is used to select whether to remain or change the phase. These methods simplify the traffic scene to make real-time calculations possible. However, the simplified methods lose some important features of the traffic state.
With the rapid development of computing devices, several methods emerged in RL. Konda and Tsitsiklis [19] proposed Acto-Critic method, in which the actor is used to choose the action, while the critic is used to evaluate the chosen. Under this framework, Aslani et al. [20] designed an adaptive traffic signal controller called the actor comment adaptive traffic signal controller (A-CAT controller) and applied it to the city center of Tehran. Li et al. [21] applied the stacked auto-encoders (SAE) to estimate the Q-function with a deep neural network. The action space includes two signal phases, and the average delay can be reduced about 14%. Gao et al. [22] further used a convolution neural network to learn the optimal policy, and the algorithm reduces vehicle delay by up to 47% and 86% when compared to another two popular traffic signal control algorithms, longest queue first algorithm and fixed time control algorithm, respectively. These methods learn from traffic scenes and make decisions automatically to improve traffic efficiency at intersections.
To guarantee training convergence, researchers pay their attention to maximize profit from refined modeling and intelligent controlling. Xu et al. [23] used a deep recurrent Q-network(DRQN) to fit the state and Q-value. The state was defined as a five dimension vector, including the number of vehicles and their average speed in a current intersection, number of vehicles and average speed in the neighbor intersection, and the signal state. The action included phase change and phase invariance. And the reward was the average delays. Casas [24] employed deterministic policy gradient(DDPG) to receive the information of all detectors and generate the timing of all traffic light phases. Genders and Razavi [25] used the asynchronous n-step Q-learning algorithm to reduce the average vehicle total delay by 40% without affecting throughput. The state-of-art adaptive traffic signal control are summarized in Table 1.
These methods above have achieved considerable improvement of increasing road throughput, but rare research focuses on the traffic compliance at the intersection and drivers' travel experience [26], which is crucial for modern traffic management. In this work, we propose a reward-penalty traffic signal timing with deep reinforcement VOLUME 8, 2020 learning. The state space is modeled with multiple dimension, including lane changing behavior and red light waiting time. A full action space includes all possible choices. The reward function is the response to excessive waiting in red light, we construct a partial memorial neural network to balance traffic efficiency and travel safety.

III. SYSTEM MODELING AND METHODOLOGY
A. PROBLEM DEFINITION Definition 1 (Signal-phase Vector (SV)): A typical fourway intersection in right-hand traffic is studied in this paper, and it's assumed that each way has three entry lanes, in which the right one for turning right, the middle one for going straight and the left one for turning left. All 12 entry lanes in the intersection are tagged as from L1 to L12, whose order is shown in Fig. 3, and the signal status of all entry lanes are denoted as vector <L1, L2, . . . , L12>, which is named as Signal-phase Vector (SV).
The signal status is represented as color, including Red, Green and Yellow, so SV is a color series. For example, <GGGGRRGRRGRR> means that the signals in lanes from L1 to L4, lane L7 and L10 are green and the others are red. It is noteworthy that four turn-right lanes, including L1, L4, L7 and L10, are assumed as permanent green because there is no conflict between them and the others.
Definition 2 (Action Vector (AV)): The signal action is to set signal color in all lanes, so, like SV, the action is represented as a vector of destination color series. This vector is named as Action Vector (AV). It is noteworthy that the color change in a lane automatically insert a fix-length Yellow phase between the former and the latter, so there is no Yellow phase in all action vectors.
Definition 3 (Road-segment State vector (RSV)): Each lane in the intersection is divided into equal segments by a length of 2 meters, and each segment is described as a three-dimensional state vector. This state vector is named as Road-segment State Vector (RSV), in which <0, 0, 0> indicates that there is no vehicle on the road segment; <1, 0, 0> indicates that there is a vehicle with left steering-lamp on in the segment; <0, 1, 0> indicates that there is a vehicle traveling straight, and <0, 0, 1> indicates that there is a vehicle with right steering-lamp on.
Definition 4 (Bearable Red-light Duration (BRD)): In unbalanced traffic, keeping green light in some lanes might increase intersection's throughput, but it may lead to an endless red light in some other lanes. To avoid the unbearable red light, the Bearable Red-light Duration (BRD) is set as a custom parameter to describe travelers' different tolerance degree on red light duration, and we set BRD as 300 seconds in this work.
Definition 5 (Excessive-duration Marking Vector (EMV)): To make signal timing system sensitive to long red-light durations at each entry lane, 12 lanes are marked whether it appears a longer red-light duration than BRD, and these Boolean markings form a 12-dimensional vector, named as Excessive-duration Marking Vector (EMV). Definition 8 (Episode): Unlike the game of Chess, there is no terminal-state in signal timing system, so in order to evaluate and update timing policy periodically, the Episode is defined as a custom parameter and it is set as 6000 seconds in this work.
B. SYSTEM FRAMEWORK An ideal control system of urban traffic signal should own the capability to respond to dynamical traffic flow and make online optimization in time [27]. In this work, RL is employed to learn from traffic flow at intersection and then optimize signal timing online. The RL framework in this work, as shown in Fig. 1, could be divided into two parts including system agent and its environment. The environment is on behalf of the real traffic scene simulated by SUMO, which is the most popular traffic simulation platform developed by Krajzewicz et al. [28]. The agent is the signal control which owns the ability to observe environments and to take actions on their own.
The basic idea of RL is that the agent takes an action a n according to the current strategy on state s n of the environment. After action executed, the environment transfers to the next state s n+1 . At the same time, the reward r n of this action is fed back to the agent, from which the agent could evaluate the effect of the last action and imporve the strategy while exploring the environments. The agent constantly adjusts its strategy until the optimal strategy is found eventually. In a RL system, the state space, action space and reward function should be defined elaborately. In intelligent traffic signal control, the state space should be full of information reflecting traffic flow patterns deciding different signal 80088 VOLUME 8, 2020 timing, the action space should include all kind of signal phase without conflict, the reward function should evaluate each action taken by agent with a reasonable value. With these three key models being defined, a deep neural network with suitable structure is designed to find the relationship between states and actions, i.e., f (state) = Q-value.

C. SYSTEM MODELING FOR REP-DRQN 1) STATE REPRESENTATION
In an intelligent signal system, the control agent could choose reasonable actions based on observed states, only if the agent could be sensitive to environmental changes in state representation. That's why the state representation needs to include all key potential features of the traffic flow at intersection.
These key features include vehicles' state, expressed as the spatial distribution of vehicles. Some modeling methods of vehicle distribution have been proposed, e.g., a spatial matrix. Liang et al. proposed a modeling method to generate a matrix to represent vehicle distribution at the intersection, which contains vehicle position and velocity information. Then a convolutional neural network is used to obtain the traffic flow features [29]. However, this modeling transforms all the space covering intersection into grids to form a matrix, no matter it's in the lane or off the lane. As consequence, this modeling generates lots of redundant data, which decreases the training efficiency and convergence speed.
In this work, we pay attention not only to the vehicles' position distribution but also to the drivers' lane changing behavior. We construct a matrix with the vehicle location labeled with lanes and the distance to intersection. Besides, the lane changing behavior is also been taken into consideration. We pay attention to the vehicle's steering lamp and include it into the matrix which enables the intelligent agent to understand drivers' behavior.
Each lane is divided into several segments according to the distance and expressed as RSV. When there is a vehicle at a certain distance in the lane, the corresponding element in RSV is marked as 1, otherwise 0. However, when a vehicle appeared on the road network for the first time, the occupied lane segment changes. In this case, the status of the vehicle's steering lamp is fed to the neural net work, to indicate the future direction of vehicles. In details, we consider RSV as the frist state S (1) t . The specific representation is also shown in Fig. 2. Except for making full use of the resources of the intersection, the psychology of driver should also be considered to improve traffic safety. As reported from the Federal Highway Administration in the U.S. Department of Transportation, if a driver waits too long at intersection, his patience might be exhausted and then violate traffic regulations [12]. Therefore, we proposed a method to make the neural network sensible to drivers' waiting time and then tune a suitable waiting time to balance traffic efficiency and safety. We construct a vector S (2) t with EMV, which is the red-light duration beyond BRD in each lane. If the current signal time is less than BRD, EMV will be defined as zero. The formula of EMV is shown in (1): where, EMV i is the excessive-duration of lane i, L i is the current duration of the red light of lane i, and BRD is a bearable duration waiting for green light, which could be adjusted according to practical condition. These states are transmitted to the neural network, and the network would make a response by updating the weight and bias. Additionally, considering that the current signal is also an important factor on the decision, we add the current signal phase as the third part of status representation in the system, denoted as S t is a vector representation encoded with one-hot encoding [30] for current signal phase, which is encoded from 8 non-conflict signal phases shown as actions in Table 2. It's worth noting that the yellow light is only an isolation phase between different phases, which does not play a role in relieving traffic flow so the yellow light is skipped while taking the previous signal phase.

2) ACTION SPACE
In order to improve the efficiency of intersection, the intelligent agent needs to choose a proper action from action space, so the flexibility of action space would make a considerable effect on the feasibility of the agent. In this work, we manage to make a most flexible action space for the intelligent control scheme of signal lights. First of all, we assume that each lane has its own direction and label, which is shown in Fig. 3.
We went through and checked all possible signal phases for each lane in the intersection. With two lanes except right-turning lanes being green light at the same time, there are 8 non-conflicting signal phase pairs, which are represented as color series shown in Table 2 and employed as 8 actions in DQN. The collection of these 8 actions forms an action space, and as shown in Table 2, these 8 actions are marked as a1∼ a8. It is noteworthy that we take no account of the right-turning lanes because they have no conflict with others and are always accessible.
In the actions table above, the direction of traffic flow is defined as two parts, i.e. traffic entrance direction and vehicle allowed directions. The color series of signal light corresponds to the lane label L1∼L12 (as defined in Fig. 3) in turn. For example, for action a 2 , the direction of traffic flow is E-SL, which means that the vehicles entering from the east(E) can go straight(S) and turn left(L). And the corresponding color series of signal light is <GRRGGGGRRGRR>.
Through these signal phases combinations, 8 actions can be selected at each timestep according to the current state. Besides, an 8-dimensional vector is built with one-hot encoding to represent the current signal phase, which is the third part of input data of status, symbolized as S (3) t . For example, the phase status S (3) t of W-SL could be represented by a vector < 0, 0, 0, 1, 0, 0, 0, 0 >.
When an action is the same as the previous one, it means prolonging the length of current phase. We set the timestep as a small value for adjusting phase with most flexibility, which is one second in this work. Note that when signal phase changes, the environment will insert a yellow light before a new signal light.

3) REWARD WITH A PENALTY FOR RED-LIGHT OVERFLOW TIME
The reward function is employed to evaluate the selected action of each timestep. A reward function lacking of full consideration usually leads to a bad result in training. In some researches, the reward function were designed to deal with the issue of traffic efficiency. Liang et al. set the reward as the cumulative waiting time between two neighboring intersections [29]. Mousavi et al. set the reward as the total delay of the vehicle in two adjacent movements [31]. In this work, we use the throughput of the intersection as a reward function. However, the observation duration is too short, just one second in this work, leading to the issue that the throughput of the intersection in one timestep is too small to be calculated. So, we consider the throughput of intersection as follows.
where T i represents the throughput of intersection at timestep i. Since the traffic light does not affect the right-turning vehicle, the reward will exclude the number of right-turning vehicles. The reward function is represented as the throughput of the vehicle at the intersection within the sliding time window t. Notably, during phase switching, there is still a possibility of throughput increasing. So t should include the duration of yellow light. In addition, as discussed above, avoiding excessive long Red-light duration is critical to balance the traffic efficiency and safety at the intersection, so we design a corresponding penalty item in the reward function, which will accumulate red-light duration exceeding BRD from each lane and calculate its square root to avoid a sharp switch. The final reward function is shown in (3): where L jt is the red signal duration of lane j at time t. The square root is to avoid a sharp cutting of time and keep a more smooth changing for safety.

D. METHODOLOGY 1) REWARD-PENALTY DRQN DESIGN
The optimal signal phase can be solved through Q-learning, with the state space, action space and reward function defined above. The Q-value is updated by (4).
where Q(s t , a t ) is the Q-value of action a t at timestep t; r t is the reward value of the action a t , and γ is the greedy factor, generally between 0.8 and 1. With a larger γ , the agent would pay more attention to the subsequent reward, while, with a smaller γ , the agent would pay more attention to the current reward.
80090 VOLUME 8, 2020 If the state space of the environment is tremendous, it will be a nightmare to deal with original Q-learning, for that it's hard to converge. Deep Q-Learning(DQN), a combination of deep learning and RL, was proposed to fit the Q-values, using the loss function to adjust the weight and bias of the neural network. Now, DRL has been widely used in various fields, such as the game of Go, self-driving, and signal timing. In this work, we employ DQN to train intelligent agent from intersection with environments for its continuous state, whose loss function is shown as in (5): where θ i is the parameter of the neural network at the ith iteration, U (D) is the sample set from intersection with multiple environments, and θ − i is the target parameter of the neural network. In this way, the parameters of the neural network will be updated iteratively in the intersection between agent and environment, which is shown in Fig. 4 below. Finally, the Q-value obtained by the neural network will be approximate to the real Q-value.
We add a partial long short term memory network (LSTM), since traditional recurrent neural networks would raise the problem of gradient disappearance or explosion. Therefore, we characterize the three parts of input data according to their features and endow the road situation state in the first part with memory ability because of the heavy correlation between successive road situations. The neural network proposed in this work is shown in Fig. 5, in which the state S (1) t is propagated to LSTM network to solve the problem of gradient disappearance or explosion.

2) MODEL TRAINING AND LEARNING
In DQN's environment, the experience data gained from multiple environments are collected and stored in a replay memory D for agents' sampling. The replay memory is an allocated memory storing experience data in DQN and allowing us to reuse these data later; the experience data include the last states, last action, current reward for the last action, and the current state (S t , a, r, S t+1 ) generated from the last action. After the replay memory has reached certain capacity, the agent will randomly extract a part of experiences to break the correlation between the experiences and calculate the temporal difference error (TD-error). Finally, the Q-value from the neural network will be approximate to the target Q-value.
However, only one experience item (S t , a, r, S t+1 ) could be generated in each iteration. But lots of experiences are required in the DQN algorithm, making its time complexity tremendous. In 2016, Google Deep Mind proposed an asynchronous method [32] to generate experiences (S t , a, r, S t+1 ) simultaneously with multiple environments. In this work, we similarly generate experiences by interacting with multiple environments in parallel.
In RL, enhancing data independence is an important way to improve data productivity and training efficiency [33]. Unfortunately, in urban traffic signal control, although experiences could usually be generated by multiple environments at the same time, there are various uncertainties in the environment, which may still contain some relevant experience. Therefore, we propose an asynchronous sample method to achieve a quicker convergence and a better guarantee for irrelevance between experiences. After multiple environments have gained their own experience, all the experiences will be gathered together and put into replay memory D for agents' sampling as shown in Algorithm 1. It's worth noting that we do not set the termination signal in Algorithm 1, because there is usually no termination feature in a traffic signal control system.
As shown in Algorithm 1, the neural network is trained from inputting experience data saved in the memory D. In this method a probability factor is provided to avoid falling into a local optimum with partial random. We randomly generate a value from 0 to 1. When this value is less than the probability factor , we randomly choose an action, otherwise we select the action corresponding to the maximum Q-value. In Algorithm 2, the state S t is sampled from the parallel generations of multiple simulation environments, the action a t is selected by the agent according to the state S t , and the environment gives feedback a reward r t and then generate the next state S t+1 . All of these data will be saved as a vector (S t , a t , r t , S t+1 ) in the replay memory. VOLUME 8, 2020

Algorithm 1 Training of Rep-DRQN
Input : replay memory D with capacity N Output: parameters θ of network 1 repeat 2 if the length of replay memory > maximum capacity then 3 sample random minibatch(S t ,a t ,r t ,S t+1 ) from D 4 Set y t = r t + γ * maxQ(S t+1 , a ; θ − ) 5 Preform gradient descent step on (y t − Q(S t , a t ; θ )) 2 to the network parameters θ After enough experience generated, the training module as Algorithm 1 will start to train the neural network, while the Algorithm 2 work simultaneously to generate experiences consecutively.

IV. EXPERIMENTAL RESULTS AND DISCUSSION
After system modeling and algorithm designing introduced, system implementation and its experimental results are shown in detail in this section.

A. EXPERIMENTAL ENVIRONMENT
In this work, the microscopic traffic simulation software SUMO is employed to construct the experimental environment of traffic, and the Tensorflow is used to build the neural network for deep learning framework. The operational environment for simulation experiments is a 64-bit computer with 16GB memory equipped with a GPU processor for training the neural network, which is NVIDIA Geforce GTX 1080Ti with 11GB memory.

B. SYSTEM EXPERIMENTS 1) STATE GENERATING
As introduced in Section III above, the state data for inputting to the neural network consist of three parts, including the state of road and vehicles, driving behavior, and colors of the traffic light in the last phase.
Assuming that each lane of every leg at the intersection indicates a specific turning, we assign the direction of lanes as left turn, straight ahead and right turn in order as shown in Fig. 3, and we then divide each lane into 50 segments equally by the length of 2 meters to code the road states and position vehicles. Owe to the vehicle positioning with this gridding, the state S  (20,12,50,3), which means the traffic status in the last 20 seconds, 50 segments in all the 12 lanes, and 3 kinds of states in each road segment.
As mentioned above, the second and third part of state S (2) t and S (3) t are for the current red light duration and colors of the last phase. Among them, the S (2) t is the red signal duration of 12 lanes corresponding to the traffic lights. Since the right turn signal are always a green, the value of the corresponding states S (2) t of these lanes is always 0. In the simulation, we set the total traffic simulation time as 6000 seconds to get a moderate balance between experiment duration and its effect. It's worth noting that within the first 20 seconds there is not enough state data as input, for which we fill the lacking state in the first 20 seconds with zeros and ignore saving it into the replay memory. In the third part, we generate the state by saving the last action for that S (3) t is equal to a t−1 . Then, we employ 12 simulation environments to generate transitions simultaneously.

2) DEEP NEURAL NETWORK AND RL
Due to the fact that only the S (1) t has strong spatial-temporal relation, we design a LSTM network with 2,560 cells. The other two parts are connected to the full connected layer directly. The deep network has two fully connected layers with 10,240 and 2,560 neurons respectively. The output is Q-values of 8 colors. We randomly extracted 64 transitions from the replay memory as input to the neural network, and then use algorithm AdamOptimizer to optimize the neural network with a learning rate of 10 −6 . We set the discount factor γ as 0.99 and randomly select an action with a probability of linear decrement from 0.01 to 0.0007. Meanwhile, we set the initial as a low value, so that the color alternating is not reasonable before training. Then, we calculate the reward by counting the vehicles going straight and taking left-turn. Considering that the vehicle could change lanes, we exclude these vehicles changing lane, and calculate the throughput as shown in (6): where T t means the number of vehicles passing through the intersection excluding right-turn at time t; SL t represents the total vehicles on the straight and left-turn lanes of entering the intersection at time t; RSL t+1 represents the vehicles on the straight, left and right turn lanes of entering the intersection at time t + 1. Finally, we calculate the vehicle throughput during the latest 20 seconds as a part of the reward function.
In order to prevent a long-time red light, a negative reward is given if the red signal duration exceeds BRD. It's worth noting that it is different between setting this penalty function for red-light overflow and the fixed maximal phase length, because the penalty function helps to change phase smoothly and balances the traffic safety and the capacity of intersection [12].
where L jt represents the red signal duration of the lane j signal at time t.

C. RESULTS AND DISCUSSION
Focusing on the traffic efficiency and its safety in the intersection, we mainly capture the statistical information of vehicle throughput, waiting number, and travel time, to evaluate the system performance. All the data demonstration and its analysis are shown in the following section.

1) TRAFFIC EFFICIENCY
As two typical methods, DQN and the fixed signal timing (FST) are employed as baselines in this work. FST is a traditional method for signal timing, which is also widely used now, and DQN is the main method employed for adaptive signal timing recently. As a typical setting, each phase duration of FST is set as 30 seconds; its action is limited to 4 phases, which are a1, a2, a3 and a4 shown in Table 2, and its phase changing will insert a yellow light for 3 seconds. After every episode, we counted the number of vehicles going straight and turning left through the intersection, and then compare it to the baseline of DQN and FST. After 250 episode, the throughput and waiting number of vehicles are shown in Fig. 6 and Fig. 7 respectively.
As shown in Fig. 6, the throughput with FST is fluctuating at 3,500. With DQN, the throughput is up to 3800 after 100 epochs, but with Rep-DRQN, the throughput of vehicles are mounting up to 4,000 after 100 epochs, and Rep-DRQN tends to be convergence after 150 epochs.
The comparison of waiting queues is shown in Fig. 7, whose waiting number is the average number of vehicles waiting for green light in lanes. As shown in the figure, FST's waiting number is about 65; DQN's waiting number decreases to 58 after 100 epochs, and Rep-DRQN's decreases to 40 after 150 epochs.  Further, the halting time and travel time are compared among these methods. As shown in Fig. 8, FST's halting time is about 75 seconds, and its travel time is about 105 seconds; DQN's halting time decreases to 60 after 100 epochs, and its travel time decreases to 80. In contrast, Rep-DRQN's halting time decreases to 49 seconds after 150 epochs, and its travel time decreases to 60 seconds. It could be seen that with the proposed Rep-DRQN method vehicles could pass through the intersection about 45 seconds less on average than FST and 20 seconds less than DQN, which would significantly improve travel experience and contributes energy saving.

2) TRAVEL EXPERIENCE AND SAFETY
Besides aiming at the efficiency of intersection, safety is another challenge for the controlling strategy of intelligent signal timing. Travel experience pays more attention to individual behavior, which is often affected by the driving environment. So, travel experience is one of the ways to reflect safety at intersections. According to the driver's travel experience, the excessive length of red light duration is the major factor to galvanize drivers to break traffic rules with exhausting patience. Therefore, we analyze the pattern of red light duration for the evaluation of safety. After training of 250 episodes, we conduct a system test on the generated neural network for its red light duration in one episode. In the experiment, 3,961 vehicles are going straight ahead or turning left, whose duration pattern of red lights in all lanes is shown in Fig. 9.
From the distribution of red light duration, it could be known that little phase of red light is beyond 300s. As could be known above, In addition, there will be a certain penalty in the reward value of the agent if the red light duration is VOLUME 8, 2020  lasting more than 300 seconds. And the longer the red light is exceeded, the greater the penalty is, so the time exceeded will not be excessive. As result, the lane red light duration will be kept within 300 seconds as much as possible while ensuring the throughput of the intersection.
If all the values of the three states are set to 0, the Q-value obtained by the neural network show as following:   as shown above and the negative Q-value will accelerate signal's switching to interrupt a red-light overflow.
In this result, the first Q-value 177.25 is relatively larger than the others. It means that Rep-DRQN tends to take the action a1 to change the red signal in L2 and stop its overflowed red time. Meanwhile, since the action is selected every second for precise controlling, a reasonable signal timing should be kept long enough. So, we further analyze the transition characteristic between different phases with transition matrix, which is a common expression for analyzing state changes. As shown in Table 3, we generate a transition matrix of action selections between two adjacent phases. First, we use these 8 kinds of actions as input by the third part of state data state S (3) t , and then observe the output as Q-values, which is used as the value of the transition matrix and recorded in Table 3.
From the transition matrix, it is shown that in general every action possesses the biggest Q-value for a transition to itself, which makes it possible to last duration for a suitable length. Additionally, it is a remarkable fact that there is the most possibility to transfer from action a6 to action a1. Our further analysis for this special case shows that there is an overlapping action in the action a1 and a6 as shown in Table 2, and also shows that the action transition does not fully depend on the previous phase.

V. CONCLUSION
Making a balance between traffic efficiency and safety at intersections is still a pivotal challenge in intelligent traffic management and smart city building. In this work, we propose a signal timing method named Rep-DRQN, in which we refine state modeling with multiple dimension, including driving behavior of changing lanes and drivers' waiting time. We construct a full action space with all possible choices and define reward function with response to excessive waiting in red time. Finally, we design a partial LSTM neural network to balance traffic efficiency and travel safety.
The experimental results show that the traffic capacity at intersection with Rep-DRQN is improved by 14.28% compared with FST and 5.26% compared with DQN. And the maximal phase length in red time is dynamically maintained, which brings drivers better travel experience and safety, and reduces both energy consumption and environmental pollution at the same time. It benefits for practical application with intelligent signal timing by transferring the onesided pursuing efficiency to practical requirements.
LYUCHAO LIAO received the Ph.D. degree in traffic information engineering and its control from Central South University, in 2015. From 2016 to 2018, he worked as a Postdoctoral Researcher with Tsinghua University, and he currently works with the Fujian University of Technology and currently visits the University of Essex, U.K. His research interests are primarily in the fields of big data and artificial intelligence on the transportation domain. In particular, he is interested in deep learning and its applications such as driving behavior analysis, traffic state prediction, and traffic road-network optimization.
JIERUI LIU is currently pursuing the master's degree in transportation engineering with the Fujian University of Technology. He is currently dedicating to traffic signal timing and collaborative reinforcement learning for regional signal control. His research interests are mainly in reinforcement learning and deep learning.
XINKE WU is currently pursuing the master's degree in electrical engineering with the Fujian University of Technology. Her research interests are focused on reinforcement learning and deep learning. She is currently dedicating to collaborative reinforcement learning of traffic signal timing and area signal control based on special vehicles and pedestrians. MAOLIN ZHANG received the M.A. degree in transportation engineering from the Fujian University of Technology. His research interests are mainly in the fields of machine vision and artificial intelligence on transportation domain, and he is currently dedicating to vehicle video processing, and deep learning and its applications.