A Deep Adaptive Traffic Signal Controller With Long-Term Planning Horizon and Spatial-Temporal State Definition Under Dynamic Traffic Fluctuations

This study proposes a new adaptive traffic signal control scheme to effectively manage dynamically fluctuating traffic flows through intersections. A spatial-temporal representation of the traffic state at an intersection has been designed to efficiently identify traffic patterns from complex intersection environments, and a deep neural network (long short-term memory network, LSTM) is used to determine look-ahead signal control decisions based on the estimated long-term feedback from a given traffic state. The actor-critic algorithm, one of the reinforcement learning-based algorithms, is adopted to obtain the essential parameters of the LSTM deep neural network through multiple interactions between a simulated environment and the corresponding adaptive traffic signal controller. A realistic model environment comprising a 24-hour time-varying traffic demand including rush hour and non-rush hour situations served as the basis for traffic generation in the numerical experiments to confirm the effectiveness of the proposed scheme. The results of these experiments show that, compared to an optimized fixed time plan (Synchro), the proposed scheme can reduce waiting times at intersections by an astounding 50% with consequential benefits of reducing fuel consumptions, emissions, queue lengths, and vehicle delays whilst increasing mean speeds.


I. INTRODUCTION
It is generally accepted that the efficient management of traffic flows to reduce travel delay, especially through intersections, is an essential objective in traffic management. Attempting to achieve this objective is Adaptive Traffic Control which has an advantage in that it can take real-time and stochastic traffic demand into consideration and provide traffic light control decisions based on a wide range of algorithmic designs including dynamic programming, fuzzy logic and reinforcement learning as reported in various studies [1]- [5]. However, based on an examination of these previous studies, it is clear that the design of an adaptive traffic control The associate editor coordinating the review of this manuscript and approving it for publication was Sabah Mohammed . algorithm is plagued by three crucial problems outlined as follows: Firstly, an effective method is needed to represent, as accurately as practically possible, the time-varying traffic flow situation (usually referred to as the traffic state) at a target intersection to include additional operating information pertinent to the particular intersection [5]. In such a method, in order to maintain accuracy or rather preserve as much information as possible, it would seemingly be convenient to use an approach based on microscopic measures (such as the position and velocity) of individual vehicles rather than macroscopic measures (such as traffic flow). However, as an intersection is a complex system involving the positions and velocity profiles of multiple vehicles, such an approach can be adversely impacted by high computation costs (the curse of dimensionality), which in previous studies were avoided through the use of aggregate (macroscopic) measures such as flow rate, flow speed and vehicle queue length instead of the individual (microscopic) measures of vehicle positions and velocities to represent the traffic state [3], [6], [7]. Although these aggregate measures simplify traffic state representation, specific vehicular information at a target intersection is inevitably lost. On the other hand, even if individual vehicle positions and velocities are considered, delay information is lacking for each vehicle as well as for the intersection system resulting in the inability to effectively minimize the total delay for each traffic light control decision.
Secondly, although in recent years a considerable number of studies have employed macroscopic traffic flows in their underlying traffic flow models, there remains a problem in that these models cannot precisely reflect the actual traffic flow characteristics of a target intersection which negatively impacts evaluating the real-time performance of control policies in adaptive traffic signal control algorithms [8]- [11]. As already mentioned in the statement of the first problem, there is good reason for using macroscopic traffic flow models in that their computational costs are relatively low, so they can easily be implemented in adaptive control algorithms. This computational benefit was a contributary factor which effectively led to the implementation of some wellknown adaptive traffic control systems, such as SCOOT [12], SCATS [13] and COP [14]. However, there still remains the problem that macroscopic traffic flow models cannot precisely reflect the traffic flow characteristics of a target intersection.
The third problem relates to the modelling of traffic demand at an intersection, which normally involves using vehicle arrival rates as the basis for such modelling. However, previous studies mainly assume that the arrival rate is constant throughout the day [3], [7], [15], [16], treating rush hour and non-rush hour traffic as the same homogeneous conditions. Clearly, this assumption can lead to the poor performance of traffic signal control schemes, especially during rush hours. Furthermore, although the Annual Average Daily Traffic (AADT), obtained from historical observation data, is used to provide aggregate information for traffic signal control, we found that few algorithms can exactly take into account the 24-hour time-varying traffic demands, arguably needed to further enhance the design of traffic signal control schemes.
Addressing these three problems, the research underpinning this study establishes a fundamental decision-making framework which uses a Reinforcement Learning (RL) approach supported by a deep neural network to implement an adaptive traffic control algorithm [17], [18]. The main reason behind this approach is the recognition that RL has recently contributed to effective decision-making in several other areas such as gaming and robotics, as well as in traffic control [19]- [24]. Furthermore, previous studies have confirmed that RL can effectively work with microscopic traffic flow models for adaptive traffic control and can make lookahead control decisions for an intersection system [4], [25].
This study focuses on reducing the total delay of vehicles through the design of a signal controller that can identify both the spatial and temporal patterns in real-time traffic based on microscopic information so as to reduce information loss. To reflect spatial traffic situations as realistically as possible, individual vehicle delays are defined as the basic element of the traffic state, and an intersection is partitioned into cells that represent individual delays. Moreover, to capture the temporal traffic dynamics, we employ a series of spatial observations to enhance the representation of the traffic state, which is then used as input into a neural network to determine the control decisions at different time intervals. Note that the type of neural network used in this study is the LSTM network because it is especially suited to time sequence problem modelling. This type of network, therefore, provides the essential basis for dealing with complex traffic states as represented by microscopic vehicular and other operational information without suffering from the curse of dimensionality [26], [27]. The proposed adaptive signal controller enables control decisions based on the guidance of the trained neural network after it has learnt an optimal control policy through multiple trial-and-error interactions between the controller and the intersection environment.
Reinforcement learning (RL) is employed to determine the parameters of the LSTM network under a microscopic traffic simulation environment given dynamic traffic demand scenarios [28]. It should be noted that in previous studies, traffic demand scenarios for training procedures were usually generated assuming a constant vehicle arrival rate [7], [15], [16]. In this study, however, vehicle arrivals are generated following a Poisson distribution for a 24-hour time-varying traffic demand curve obtained from historical data. This method allows the training procedure to take into account daily traffic dynamics and reflects the stochastic nature of traffic demand.
To ensure the convergence of the RL algorithm, we employed an actor-critic strategy in the RL model to optimize the parameters of the LSTM network. Moreover, we used multistep bootstrapping technique and clipped surrogate objective technique to enhance the algorithm efficiency and robustness. We finally provide a framework for the RL that is specially designed for adaptive traffic control. The contribution of this paper can be concluded as follows: Firstly, in previous RL studies, the consideration of individual vehicular delay information for processing by traffic controllers was usually ignored. Addressing this omission, we propose a novel traffic state definition to identify both the spatial and temporal patterns using microscopic traffic delay information. This method reduces information loss and provides individual vehicular delay information for each traffic light control decision.
Secondly, we have designed a RL algorithm framework to determine the parameters of the LSTM network in a microscopic simulation environment. This framework can guarantee the convergence of the learning process under complex traffic states.
Thirdly, the RL algorithm executes the learning process based on a 24-hour time-varying traffic demand, which incorporates both rush hour and non-rush hour situations. This enables realistic observations of commuting traffic in practice and can provide substantial practical benefits in the implementation of adaptive traffic control.
Following on from this introduction, the remainder of this paper is arranged as follows: Section II presents a literature review. Section III provides background information for the traffic signal control scheme, involving delay, cost and state. In Section IV, we describe the optimization algorithm, an actor-critic algorithm combined with a multistep bootstrapping technique and a clipped surrogate objective technique. In Section V, we describe the setup of the numerical experiment and present the results to demonstrate the performance of the proposed method. In Section VI, we conclude our work on the proposed method and provide some suggestions for future work.

II. LITERATURE REVIEW
There are currently two basic types of traffic signal control: fixed-time signal control and adaptive signal control. In the case of fixed-time control, the controller utilizes historical traffic data to determine signal timing off-line [29]. Fixedtime control has been widely used in several well-known signal timing systems, such as the TRANSYT [30], SYN-CHRO [31] and MAXBAND [32] systems. Fixed-time control performs stably if the traffic demand follows a fixed pattern; however, it cannot respond to stochastic traffic conditions, especially in situations where there is a sudden buildup of traffic.
Adaptive signal control utilizes real-time data to determine an optimal signal timing to maximize a defined objective function and in recent decades it has gradually gained popularity due to its adaptability and flexibility. The controller managing adaptive signal control can be classified according to the type of control it provides, namely: responsive signal control, online optimization control, or revising frequency control [33]. Each of these control classifications is discussed as follows:

A. RESPONSIVE SIGNAL CONTROL
In the case of responsive signal control, each signal in a controller seeks a decision to extend the current green phase or not, based on the upstream actuated traffic demand. A typical signal control system is the modernized optimized vehicle actuation (MOVA) system [34]. However, a disadvantage of this system is that it fails to optimize globally because the control decision only considers the traffic demand in the current green direction whilst ignoring all other directions.

B. ONLINE OPTIMIZATION CONTROL
The online optimization algorithm utilizes model predictive control [35], traffic flow model control [10], [36] and Petri nets model control [37] to make control decisions considering detected and predicted future traffic. Typical existing systems using online optimization include the SCOOT [12] and SCATS [13] systems, which have shown significant improvements in the performance of traffic signal control. However, almost all of these systems were developed using macroscopic traffic flow models, which implies the loss of detailed information about individual vehicle movements at an intersection and, therefore, results in poor performance regarding control decisions.

C. REVISING FREQUENCY CONTROL
The revising frequency control approach uses a rolling horizon and starts an optimization every few seconds to maximize an objective function over the planning period. The main feature of the revising frequency control approach is that the traffic signal timing is optimized at a rather fast pace, and the resolution can be as short as 0.5 seconds [38]. This method makes it possible to control traffic signals effectively in many practical applications, such as the PRODYN [39], OPAC [40], RHODES [41] and COP [14] systems. The techniques used to solve the rolling horizon problems include dynamic planning and reinforcement learning (RL), with the latter technique gaining popularity in recent research because of its computation feasibility and its adaptability in complex problems [38]. Extensive research, therefore, has been conducted in traffic signal control using RL [5]. In one such research, as a pilot study, a multiagent traffic signal control scheme with a model-based RL was developed to minimize the overall waiting time of vehicles. This study confirmed the effectiveness of its RL-based adaptive signal control algorithm through comparing the performance of the RL controller and nonadaptive traffic signal controllers. The author employed Qlearning to minimize the number of waiting vehicles for an isolated intersection with aggregate state information such as the queue length in its four approaches [42]. The advantage of RL in the revising frequency control method was further confirmed through the comparison of RL with function approximation and dynamic planning [38]. RL can be divided into three categories: value-based RL, policy-based RL and actor-critic RL. The author has shown that actor-critic RL, a combination of value-based RL and policy-based RL, outperforms the other two algorithms and has many benefits in terms of robustness, training speed and the generalization of new traffic scenarios [7]. Recent studies have also combined deep neural networks with RL to implement traffic control and have shown that such a combination can significantly improve the robustness and generalization ability of RL [7], [43]. In addition, a deep neural network enables RL to handle higher dimensions of traffic state representations more efficiently including the complex and stochastic characteristics of real-world traffic systems [25].

III. TRAFFIC SIGNAL CONTROL SCHEME
For a typical intersection (Section III-A), the traffic efficiency is largely influenced by the traffic signal control scheme and thus the traffic signal controller plays a critical role in building a safe, efficient, and environmental driving environment VOLUME 8, 2020 in city traffic conditions. In this study, we adopt vehicle delay to evaluate traffic efficiency and construct objective function by minimizing the total delay J of surrounding vehicles at the intersection (as introduced in Section III-B). Fig. 1 shows the proposed control scheme. We divide the time horizon into several discrete time intervals, and each time interval is indexed by t with a duration of t. For each time interval t, the signal control is fullfilled by three stages: perception, decision-making, and execution. In the perception stage: the controller first detects the positions and speeds from surrounding vehicles information at time interval t through several types of smart sensors, such as millimeterwave traffic radar and computer vision-based traffic monitors; then the controller estimates the state s t using the method in Section III-C to extract representative information without losing too much information. In the decision-making stage, the controller gives action a t between two choices: to extend the current phase (a t = 0) or change into the next phase (a t = 1) based on the observed state s t ; the basic logic of this is that a trained neural network function can predict the optimal action a t using state s t as input, where its internal parameters are obtained using RL through multiple interactions between the controller and simulation environment (as introduced in Section IV). Finally, in the execution stage, the traffic light will carry out the planned action a t for t (for a t = 0) or t yellow + t (for a t = 1) seconds, observe the control feedback cost j t and begin the next iteration of time interval t + 1. Fig. 2 shows a typical single intersection that has fourdirection legs. Each leg consists of multiple approaching links and departure links. We use to denote the approaching link set. Furthermore, we divide each approaching link into multiple cells to collect useful information about the vehicle movements, where the cells are labeled with 0, 1, 2, . . . , i, . . . , N i . Since the minimum space headway of two successive vehicles is 7.5 m, each cell span 7.5 meters when dividing cells. This can benefit the precise state representation and reduce information loss through a high-resolution space division method.

B. DELAY AND COST FUNCTION
When approaching an intersection, vehicles might slow down as a result of catching up to the vehicles in front or they may have to stop because of a red light. We define the vehicle delay d t (k) as the amount of extra time for vehicle k ∈ U t to at time interval t: where t is the duration of time interval t, v t (k) is the average speed of vehicle k during time interval t, v free is a constant that indicates the speed of a vehicle passing through the intersection under the free-flow conditions and U t is the set of the vehicles located on the approaching links of the intersection at the beginning of the time interval t. Note that for vehicle k, the average speed v t (k) is associated with phase decision a t , and the current state s t . If the corresponding phase is green, the vehicle might travel with a high speed, however, if the corresponding phase is red, the vehicle should slow down (low speed) and stop.
For each time interval t, the controller's performance can be evaluated by cost j t : where cost j t considers the total delay for surrounding vehicles U t . Furthermore, based on the cost of each time interval, the total cost function of the controller can be derived as: where T is the total number of time intervals in the planning horizon. Minimizing cost J is consistent with the fact that our controller aims at improving the traffic efficiency by manipulation phase extension or not over the planning horizon [0, 1, 2, . . . , T ].

C. STATE DEFINITION
The traffic controller makes action decisions based on the representative state of the target intersection. We propose a method to collect the multidimensional information to represent the traffic state to reduce the information loss caused by partial observability. In previous works, the traffic state definition is usually simply represented by the aggregate information, such as the average queue length or waiting time of the vehicles located in the intersection. However, such representation ignores individual differences and spatial information. This study proposes a method that can take into account the travel time delay for each vehicle in the intersection and maintain the dimensions of the input as a constant even if the number of vehicles at an intersection is time-dependent. Since the size of U t is time-dependent, it is not proper to use d t (k) directly as the representation of the state. We define delay D i t for cell i as follows: where U i t denotes the set of vehicles that are located at the cell i at time t. If |U i t | = 0, it means that no vehicle has been checked in cell i at time t and therefore D i t is equal to 0. In the case of |U i t | = 0, we use the mean delay value of the vehicles as the delay value of the cell i at time t. We further use D t to denote the delay information for the intersection as follows: where N i is the total number of cells at the approaching links for the target intersection. D i t can be obtained by observing the speeds of the vehicles in cell i during time interval t. Our proposed method does not require the tracking of the trajectories of each vehicle at an intersection. Therefore, the proposed state representation method is adaptive and easily implemented for several types of smart sensors, such as millimeter-wave traffic radar and computer vision-based traffic monitors.
Furthermore, in order to restrain the phenomenon that the phase keeps fast flip, we use c t to denote the cumulative repetition number of the current phase until the end of the last time interval t − 1 which can be expressed as: c t is 0 when the signal controller switches into the next phase (a t−1 = 1) and c t = c t−1 + 1 when the signal controller decides to remain in the current phase (a t−1 = 0) at the last time interval t − 1.
In conclusion, we introduce the spatial observation x t and state s t definitions to capture the spatial and temporal dynamics in a complex traffic environment.
where φ t ∈ {0, 1, 2, 3} is refered to the traffic signal phase at time t (see Fig. 3). ξ is the number of observed timesteps. An example is provided to illustrate the evolution of the proposed state representation method. As shown in Fig. 3, the controller executes the signal phase φ t , sequentially from phase 0 to phase 3. The value of ξ is set to 10 for illustration purposes. We assume that time interval t−1 begins with phase φ t−1 = 1 and phase repetition number c t−1 = 10. After t seconds, the controller is required to make a decision based on state s t at the beginning of time interval t. Because of action a t−1 = 0, the controller obtains x t where φ t = 1, c t = 11 and D t = [3.86, 3.12, . . . , 0.00]. Then, based on the evaluation of the state s t = [x t−9 , x t−8 , . . . , x t ], the controller selects an action a t = 0, receives feedback j t and begins the next iteration of time interval t + 1. Subsequently, the controller selects to change the phase (a t+1 = 1) at the beginning of time interval t + 1 based on s t+1 where φ t+1 = φ t = 1 and c t+1 = c t + 1 = 12 because of the extending phase decision of time interval t (a t = 0). After t yellow + t seconds, state s t+2 for time interval t + 2 is obtained where φ t+2 and c t+2 take values of 2 and 0 as a result of the phase change decision (a t+1 = 1). Notice that the yellow phase t yellow cannot be ignored considering safety issues for phase changes decisions (a t = 1).

IV. REINFORCEMENT LEARNING FOR TRAFFIC SIGNAL CONTROL
The decision-making stage is achieved through a trained neural network which can determine the best action a t for state s t . Two neural networks are constructed to achieve the decisionmaking: a critic networkV (s t ; w) to predict the expected cumulative cost and an actor networkπ(a t |s t ; θ) to calculate the optimal action directly (Section IV-A). In terms of the network structure, we adopt the LSTM network to build the critic network and actor network (Section IV-B) to capture the temporal dynamics. Then, as the parameters are stochastically given initially, we adopt an actor-critic framework, one type of the RL-based algorithms, to calculate the parameters w and θ in order to reduce the estimated error for the critic network and reduce the future cost for the actor network (Section IV-C). The main logic is to obtain a set of optimal parameters w and θ gradually to guide the process of decision making through training data, which is determined through multiple interactions between the decision-making agent and simulated environment. Furthermore, we adopt two techniques to enhance the algorithm efficiency and robustness, which are Multistep bootstrapping (Section IV-D) and Clipped surrogate objective technique (Section IV-E). Finally, the overall algorithm is given in Section IV-F to illustrate the training details.
A. BASIC COMPONTENTS RL, a type of machine learning technique, enables the controller to obtain optimal parameters for decision-making agent through trail and error interactions between the controller and simulated environment [18]. Two core concepts in RL are the actor and critic: the actor can guide the choice of action a t based on state s t at each time interval t for the controller, and the critic can predict the cumulative cost of state s t to estimate the long term performance of current state.
The actor can be defined as a conditional probability, π(s, a) = P(a t = a|s t = s), which can map system state s to an action probability distribution over an action set a t ∈ {0, 1}.
The critic can estimate the average expected cumulative cost since state s t over the planning horizon [t, t + 1, . . . , τ, . . . , T ]: Furthermore, based on the Bellman equation, we can obtain V (s t ) iteratively by using the cost j t and the next state value V (s t+1 ).
Note that the true values of V (s t ) and π(s, a) are difficult to observe in practice. Usually, we adopt function approxima-tionsV (s t ; w) andπ(a t |s t ; θ) to estimate V (s t ) and π(a t |s t ) where w and θ are the parameters of the function approxima-tionsV (s t ; w) andπ(a t |s t ; θ) (shown in Fig. 4).

B. NEURAL NETWORK STRUCTURE USING THE LSTM
To capture the temporal dynamics of the sequences, a long short-term memory (LSTM) network is employed to construct the function approximation. This study adopts the many to one structure to map a sequence vector [x 1 , x 2 , . . . , x τ ] to vector y τ [44], as shown in Fig. 5. The LSTM network can be referred as L : The function approximation architecture adopted in this study is shown in Fig. 5. For each time interval, t, the state . , x t is obtained through stacking a series of spatial observations x t−ξ +1 , x t−ξ +2 , . . . , x t of different time intervals, where ξ is the length of the time series and the length of each x t is N i + 2 (see Section IV-C). Using state s t as the input, the input matrix size is ξ · (N i + 2). First, we employ a fully connected layer C with 64 neurons and rectifier nonlinear activation functions to compress observation x τ τ ∈ [t − ξ + 1, t − ξ + 2, . . . , t]. Then, we pass the stacked output to the LSTM layer, L , where the LSTM layer is composed of 64 units and unrolled for a ξ step input. Finally, the fully connected layer C is used as the linear output layer with no activation function for the critic and softmax activation function for the actor, where the number of neurons is 1 and 2 respectively, corresponding to the number of outputs in the critic and actor. During the training process, the algorithm uses the Adam optimizer as the gradient descent algorithm with a learning rate of 0.0002 for the critic and 0.0001 for the actor. The critic and actor net use the same structure except for the output layer. The definitions of the critic and actor net are given as follows:

C. PARAMETER OPTIMIZATION USING ACTOR-CRITIC ALGORITHM
The actor-critic algorithm, one of the RL-based algorithms, is adopted in this study to determine the controller's internal parameters (w and θ) in the interactive process between the traffic signal controller and the simulation environment. The reasons for adopting the actor-critic algorithm are that the convergency results in critic-only algorithms tend to be biased and that the convergency speeds in actor-based algorithms are usually rather slow. On the other hand, the actor-critic algorithm effectively provides an appealing trade-off between optimality and convergency speeds. Two basic components are the actorπ(a t |s t ; θ) and critiĉ V (s t ; w). During the training process, the actor is employed to determine an actionπ(a t |s t ; θ) for the current state s t ; then the controller executes the command a t , receives the feedback j t and collects the training sample (s t , a t , j t , s t+1 ); based on the collected sample, the critic can estimate the longterm performanceV (s t ; w) and get involved in the parameter optimizing process to help the actor and the critic achieve better performance in the future [45]. The actor-critic algorithm optimizes internal parameters θ and w as follows: (1) the actor chooses action a t ∼π(a t |s t ; θ) at the beginning of time interval t; (2) the controller receives the feedback j t and s t+1 at the end of time interval t; (3) the controller collects the training sample (s t , a t , j t , s t+1 ); (4) the critic evaluates the temporal difference (TD) error according to the sample; (5) the algorithm optimizes the actor and the critic's internal parameters θ and w respectively, according to the TD error.
In the following, we will show the details for the parameter optimization process.
For the critic, the parameter w is optimized to decrease the difference between the approximationV (s t ; w) and true value V (s t ) by applying a gradient descent to the mean-square loss function L(w).
where j t +V (s t+1 ; w) is an approximate estimate for the true value V (s t ) following Bellman Equation. δ t = j t + V (s t+1 ; w) −V (s t ; w) is temporal difference error (TD error) to express the difference between approximation and true value. The critic optimizing can be regarded as a regression problem to build the mapping between state s t and value V (s t ) using the training sample (s t , a t , j t , s t+1 ), where the input label is s t and the output label is j t +V (s t+1 ; w). The batch VOLUME 8, 2020 stochastic gradient descent technique (BGSD) is employed to obtain an optimized parameter w: w ← w + α w δ t ∇ wV (s t ; w), where α w is the learning rate of the critic. For the actor, parameter θ is adjusted to minimize the total cost J by applying gradient ascent to the cross-entropy loss L(θ ): Eq. 13 can optimize θ using the BGSD technique following θ ← θ + α θ δ t ∇ θ log π(a t |s t ; θ), where α θ is the learning rate in the actor. For further theoratical explaination, readers can refer to the work by Sutton [18]. The whole procedure of the actor-critic algorithm is summarized in Algorithm 1.

5:
Execute the action a t , receive the feedback j t , and obtain the next state, s t+1

6:
// Parameter optimizing 7: Compute TD error δ t according to the training sample (s t , a t , j t , s t+1 ) 8: Optimize parameters w and θ using Eqs. 12 and 13 9: end for 10: end for 11: return Parameters w and θ

D. MULTISTEP BOOTSTRAPPING TECHNIQUE
Since the batch size is too small (only one training sample) for parameter optimizing process, we are inspired by the multistep bootstrapping technique to increase batch size when optimizing parameters. This can further improve the convergency speed and algorithm stability in the above algorithm (Algorithm 1).
The original method updates parameters w and θ using only one sample (s t , a t , j t , s t+1 ) at the end of each time interval. Inspired by the multistep bootstrapping technique, this study updates parameters w and θ at the beginning of the time interval t = qn for q = 0, 1, 2, . . . , Q, where n is a prespecified number (step) of time intervals. Consider the environment is at time interval t = qn, the algorithm progresses as follows: the controller selects actions a t ∼ π(a|s t ; θ qn ) during the intervals t = qn, qn+1, . . . , qn+n−1 and executes parameter optimizing until the end of the time interval t = qn+n−1. The RL algorithm collects the samples (s t , a t , j t , s t+1 ) from t = qn to qn + n − 1 and saves these samples into training buffer B [qn,(q+1)n] : , a qn+1 , j qn+1 , s qn+2 ) . . .
Therefore, we can obtain a batch of n-step-TD errors at t = q + n − 1 based on B [qn,(q+1)n] using the n-step return: Using Eq. 14, we can obtain δ qn , δ qn+1 , . . . , δ qn+n−1 as: Then, we obtain the expanded training bufferB [qn,(q+1)n] : We then draw random samples uniformly from the expanded training bufferB [qn,(q+1)n] and adopt these samples to construct a training batch to optimize the parameters w and θ using the BGSD technique based on Eqs. 12 and 13. Compared to optimizing using only a single training sample (see Algorithm 1), batch optimizing calculates a series of TD errors [δ qn , δ qn+1 , . . . , δ qn+n−1 ] to calculate parameters w and θ.
The process of the multistep bootstrapping technique is visualized in Fig. 6.

E. CLIPPED SURROGATE OBJECTIVE TECHNIQUE
Despite the Actor-Critic Algorithm serves as an effective framework to obtain optimized parameters in the RL agent, the extreme large loss (Eqs. 12 and 13) may cause the parameters change unstably between two successive parameter optimizing processes. This may have a negative impact on the stability of algorithm and convergency speed [46]. Therefore, to alleviate the influence of this problem, this study further employs a clipped surrogate objective technique to improve algorithm robustness. The clipped surrogate objective technique uses clipped probability ratios to enhance the performance of parameter updating.
After implementation in the clipped surrogate objective technique, the actor loss function L(θ) is modified as follows: (15) where the clip ratio is defined as ε t (θ) = π(a t |s t ;θ) π(a t |s t ;θ old ) and is the clip rate. θ old is the previous value of the policy neural  network parameter. This indicates a conservative attitude in parameter optimizing [47]. The clip function is defined as follows: Therefore, clip(ε t (θ), 1 − , 1 + ) can guarantee that the effective ε t (θ ) is in the interval [1 − , 1 + ], which enables the algorithm to avoid the trap of excessive parameter changes. Finally, the minimum value between δ t ε t (θ ) and clipped δ t ε t (θ ) is adopted in calculating the loss function of parameter θ, which can provide a pessimistic bound for the loss function L(θ ) and avoid the problem of oscillation caused by large parameter changes.

F. OVERALL ALGORITHM
In this section, we summarize the above content and provide an overall algorithm for the training of the controller internal parameters w and θ. The overall algorithm, shown in Algorithm 2, is an actor-critic algorithm combined with a multistep bootstrapping technique and a clipped surrogate objective technique.
In Algorithm 2, the time interval t is the minimum traffic signal control unit in the planning period with a duration of t. We define the operator * to show the real-time of time interval t in the environment where the relationship of time interval t + 1 and t is (t + 1) * = t * + t if a t = 0 or (t + 1) * = t * + t yellow + t if a t = 1. The controller consists of two parts: an actorπ(a t |s t ; θ) and a criticV (s t ; w). Moreover, we adopt a deep neural network to construct the functionsV (s t ; w) andπ(a t |s t ; θ) (see Section IV-B).
The control process is as follows: at the beginning of each time interval t, the controller observes the state s t selects an action a t and sends it to the traffic light; after t (for a t = 0) or t yellow + t (for a t = 1) seconds, the controller receives // Training sample collecting 6: Sample action a t ∼π(a t |s t ; θ) 7: Execute the action a t , receive the feedback j t and obtain the next state s t+1

8:
Store sample (s t , a t , j t , s t+1 ) to the training buffer B [qn,(q+1)n] 9: // Parameter optimizing 10: if t + 1 = (q + 1)n or the simulation of one day ends then 11: Compute a batch of n-step TD errors [δ qn , δ qn+1 , . . . , δ qn+n−1 ] and construct the expanded training bufferB [qn,(q+1)n] based on (14). 12: Optimize parameters w and θ using the expanded training bufferB [qn,(q+1)n] based on (12) and (15). 13: end if 15: end for 16: end for 17: return Parameters w and θ the feedback j t and observes the next state s t+1 to begin the next iteration. The duration of the planning period is referred to as , and the simulation of one day terminates when the actual time T * ≥ . As a common practice in RL, the state value of the terminal state V (s T ) is equal to 0. The action a t VOLUME 8, 2020 is a binary variable, where the value 0 denotes remaining in the current phase and the value 1 denotes switching into the next phase.
Using a multistep bootstrapping technique, we can divide the planning period into a training sample collecting period (interaction process between traffic environment and controller) and a parameter optimizing period (the parameter optimizing process to optimize parameters θ and w).
(1) Training sample collecting period (time duration t = [qn, qn + 1, . . . , qn + n − 1]) For each time interval t, the controller selects an action a t ∼ π(a|s t ; θ qn ) and sends the command a t to the traffic light to control the movement of the vehicles; the controller then receives the feedback j t and begins with s t+1 after t or t yellow + t seconds. Then, the controller collects the sample (s t , a t , j t , s t+1 ) and stores it in the training buffer B [qn,(q+1)n] . The controller starts at time interval qn and repeats the above interaction process n times until time interval qn + n − 1.
Finally, the controller optimizes parameters w and θ using the BGSD technique.
The traffic signal control algorithm is summarized in Algorithm 2.

V. EXPERIMENTS AND RESULTS
In this section, we provide a series of numerical examples to demonstrate the performance of the proposed traffic signal control method. The microscopic traffic simulator SUMO (Simulation of Urban Mobility) was used as the simulation environment in our experiments [48]. The vehicle agent communicates with the traffic environment through the TraCI package of SUMO. We implemented the LSTM networks using TensorFlow 1.5 to approximate the functionsV (s t ; w) andπ(a t |s t ; θ) [49]. The tests were executed on a desktop PC with a 4.20 GHz i7-7700 CPU, 32 GB of RAM running Windows 10. The whole procedure was implemented in Python 3.5.

A. STUDY AREA
This numerical example is based on the test bed introduced by Jeffrey Glick [50]. The selected intersection is located at the intersection of Palm Drive and Arboretum Road, in Stanford CA 94305, USA. The corresponding SUMO configurations were exported from OpenStreetMap.org, as shown in Fig. 7. The given intersection has four direction legs, and each approaching and departure leg comprises two links. In each approaching link, the movement m can choose to move left, through or right in accordance with their route U .

B. TRAFFIC DEMAND AND VEHICLE GENERATION
In this section, we introduce a vehicle arrival generation method based on the realistic traffic demands (hourly traffic volume data of different vehicular movements). In practice, it is common to collect the hourly traffic volume data (vehicles per hour, vph) for 12 movements (4 left-turn movements, 4 through movements and 4 right-turn movements), as shown in Fig. 7. The 24-hour traffic volume data for the 12 movements is used for the test, as shown by the gray points in Fig. 8. These traffic flow data were sourced from the Github project by Jeffrey Glick [50]. To describe the traffic flow time-dependent characteristics, we adopt 12 polynomial fitted functions [ρ 0 (t), ρ 1 (t), . . . , ρ m (t), . . . , ρ 11 (t)] to approximate the time-dependent traffic demand of the  12 traffic movements. The black curve in Fig. 8 represents the time-related characteristics of the traffic demand ρ m (t). Therefore, we can insert the vehicles into the network dynamically and stochastically to reflect the morning and evening peaks, which creates a realistic and varying flow pattern for the agent to control, as opposed to using a constant hourly demand.
To simulate a series of stochastic traffic scenarios in SUMO, we assume that the vehicles arrive upstream of the approach to the intersection following a Poisson distribution with a time-dependent arrival rate of λ = ρ m (t). For vehicles k ∈ U m , they follow the direction of movement m, where U m denotes all the vehicles in movement m. Therefore, for vehicles k ∈ U m in movement m, the headway ζ, of two adjacent vehicles k + 1 and k follows the negative exponential distribution: where ζ indicates the headway between two adjacent vehicles k +1 and k. ζ (k) indicates the departing time of vehicle k. The arrival rate λ is estimated using the fitted function ρ m (t) (see the black curve in Fig. 8).    Table 1.

D. BENCHMARK GENERATED BY SYNCHRO
We evaluate the effectiveness of the algorithm relative to the fixed-time controller generated by Synchro [31]. Synchro has been widely recognized as a practical software tool for developing a fixed time schedule for traffic signal control. Based on historical demand, the software calculated the best cycle length to be 60 seconds. To reflect fluctuations in the traffic demand, we divide the 24 hours demand into 7 subsets, and each configuration is shown in Table 2. Additionally, the yellow transition is set to 4 seconds between two different phases, which is included at the end of each phase.

E. RESULTS ANALYSIS 1) CONVERGENCE VALIDATION
The performance of the agent during the training process is shown in Fig. 9.  with growing training days. The results confirm that our algorithm can learn from experience and optimize the policy gradually.
To describe the above results in detail, in Fig. 9 (b), the cost j t during one day, for different days is visualized to provide insight into the noise and performance changes as the algorithm progresses. The horizontal axis represents the simulation time for one day, while the vertical axis is the cost j t obtained per time interval. Note that we use the moving average technique to illustrate the change tendency of costs during a day. Fig. 9 (b) shows that the cost increases significantly during the morning and evening peaks of all the training days because of commuting traffic, which is consistent with the tendency of the traffic demand in Fig. 8. However, the magnitude of increased delay during the period of commuting, decreases as the training days progress as illustrated through the comparisons between day 0, 49, 99, 149 and 199. Therefore, we can conclude that the influence of the commuting traffic is significantly weakened using our algorithm.
Moreover, Fig. 9 (b) shows the convergence of the proposed RL algorithm. Initially, on training day 0 (day 0), the agent explores the environment and selects actions randomly most of the time. Therefore, the performance of the agent is poor, and the magnitude of the oscillations is very large. As the training days progress, the agent learns more from experience and selects more exploitative actions with a lower exploratory rate. Thus, the agent achieves a lower cost, lower variance and a near-optimal and stable performance to control the environment. Finally, after training day 83, the agent begins to converge to an optimal policy as indicated in Fig. 9 (a). VOLUME 8, 2020

2) PERFORMANCE COMPARISON
We compare the RL controller's performance to that of the fixed-time plan generated by Synchro. As the RL controller has a different cycle length with a fixed-time plan, we collect data every 5 seconds in the RL controller and use a fixed-time plan to build the same test bench. To test the effectiveness  of the RL controller, we chose an additional 5 indicators, including fuel consumption (ml/s), emissions rate (mg/s), mean speed (m/s), queue length (veh), and waiting time (seconds). Note that waiting time is different from the delay, in that waiting time refers to the stopping time as a result of a red phase, while delay time refers to the extra time needed because of the control policies. With respect to the emissions rate, we only consider hydrocarbon (HC) emissions. In the experiment, we chose 10 random samples to compare the two above control methods (see Table 3). Moreover, we establish performance comparisons by analyzing day 0 (1st sample) in detail. In Table 4 and Fig. 10, we analyze the performance of the four directions in the Synchro and RL controllers. In Fig. 11, we analyze the delay evolution over 24 h, morning peak between 8:45 to 9:00 and evening peak from 17:45 to 18:00.
As listed in Table 3, for 10 random samples, the RL controller can effectively reduce fuel consumptions by 8.85%, emission rates by 17.92%, vehicle queue lengths by 33.09%, waiting times by 50.27% and increase mean speeds by 8.31%. Therefore, we can conclude that the RL controller performs better than the fixed-time plan because it can adapt to realtime changes in traffic flows at a higher resolution. Table 4 and Fig. 10 show that compared with the fixedtime plan in Synchro, the RL controller can significantly improve the performance of traffic signal control in terms of the five indicators. Fig. 10 shows that the RL controller can enhance the performance of the north direction in the morning peak and the performance of the east direction in the evening. Compared to the fixed-time plan, RL can reduce the queue length from 4.25 to 1.31 and from 2.60 to 1.26 in the north and east directions, respectively (see Table 4). The above result is consistent with the fact that the north traffic (flows 3,4 and 5) and east traffic (flows 9,10 and 11) occupy a large proportion of the traffic in the morning and evening peaks (see Fig. 8). The result shows that the RL controller can address the unbalanced time-space problem caused by commuting traffic.
The delay evolution is illustrated in Fig. 11 (a), and the results show that the RL controller performs better in comparison to the Synchro plan because of the rapid and flexible changing phases. As Fig. 11 (a) shows, the delay in both controllers increases with increasing traffic demand in the morning and evening peaks, whereas the RL controller increases more slowly when compared to the fixed-time controller, especially in the morning peak. Fig. 11 (b) shows that the RL controller can increase the duration of phase 1 (time duration from 31600 to 31900) and phase 3 (time duration from 32000 to 32400) to accommodate the traffic demand in the morning peak. However, the Synchro plan is based on the fixed time plan, which can cause an oversaturated traffic flow at the intersection, and delays using the Synchro plan can increase significantly (i.e., the time duration from 31600 to 31900). Additionally, Fig. 11 (c) shows that the RL controller outperforms the Synchro plan by giving more priority to left-turn traffic (phase 0 and phase 2) in the evening peak. We can conclude that the RL controller can outperform the fixed-time plan because of its adaptability and rapid responsiveness.

VI. CONCLUSION
In this paper, we propose a new adaptive traffic signal control scheme to produce optimized traffic control policies in order to minimize the delay of vehicles passing through intersections. The scheme employs an enhanced algorithm which uses spatial-temporal network information to define the traffic state, where individual vehicle delay is used as a basic measure rather than the aggregate measures of flow rate, flow speed and vehicle queue length as used in previous studies. The proposed method to identify traffic patterns can reduce information loss (such as vehicle delay) when characterizing high-dimensional features in the definition of traffic state. Furthermore, we adopted a deep neural network (the LSTM) to construct a decision-making agent in which its intrinsic parameters are determined through a RL framework; thus optimizing the ability of a traffic controller to decide whether to extend the current phase or switch into the next phase. Specifically, the RL framework uses an actor-critic algorithm to obtain a balance between a biased convergence result (critic-based RL algorithms) and a high variance result (actor-based RL algorithms). Additionally, we modified this algorithm with a multistep technique and a clipped surrogate objective technique to improve its performance.
In the simulations, we built experiments based on a representation of the intersection of Palm Drive and Arboretum VOLUME 8, 2020 Road, where simulated vehicles entered the network based on the assumption that the vehicle arrival rate follows a Poisson distribution that was derived from the 24-hour traffic flow history for this intersection. Therefore, this vehicle generation plan reflects the prevailing time-varying daily commuting traffic, which characterizes the flow peak in rush hours and the instability of traffic flow during non-rush hour periods. Regarding the primary aim of this study, to reduce vehicle delay times at intersections, it was shown in the numerical examples for 10 random samples, compared to the optimized fixed-time plans obtained using Synchro, that the proposed method can reduce such vehicle delay times by over 50%. This significant reduction in delay times also has additional knock-on effects; fuel consumptions were reduced by over 8%, emission rates down by over 17%, vehicle queues down by over 33%, whilst mean speeds were increased by over 8%. The results, therefore, strongly indicate that the proposed scheme should be effective for traffic signal control at isolated intersections where significant traffic fluctuations are prevalent.
The proposed scheme in this study essentially focused on an isolated intersection. However, in future work the scheme could be extended to control a regional network similar to that of the OPAC and PRODYN systems. To overcome the significant computational load problems, a distributive RL framework should also be considered to speed up the training process.

NOTATION
In this section, the notation is given in Table 5 to clarify the whole scheme.