PlanLight: Learning to Optimize Traffic Signal Control With Planning and Iterative Policy Improvement

Intelligent traffic signal control (TSC) is essential for transportation efficiency in modern road networks. There is an emerging trend of using deep reinforcement learning techniques to train TSC models in simulators for reducing trial-and-error in real-world scenarios, and recent studies have shown promising results. The target of TSC is to minimize the average travel time of a given area. However, it is impractical to directly optimize the target by setting the average travel time as the reward function due to its nature of feedback latency and difficulty on credit assignment. Existing methods often define the reward function in a heuristic way, which may cause a biased optimization on the real target as they only optimize the accumulative reward. In this work, we propose PlanLight, a novel planning-based TSC algorithm that learns from the demonstration of rollout algorithm, which obtains suboptimal control on the given target, through behavior cloning. We show the effectiveness and efficiency of the rollout algorithm in the multi-intersection control scenario. Moreover, we achieve further policy optimization by improving the base policy in the rollout procedure iteratively. Through comprehensive experiments, we demonstrate that PlanLight outperforms both conventional transportation approaches and existing learning-based methods in various sizes of traffic datasets. Furthermore, we empirically show the potential of PlanLight to be a general algorithm to obtain improvement on future state-of-the-art TSC methods.


I. INTRODUCTION
Traffic congestion has become a severe growing problem due to the rapid increase in vehicles in the urban area, caused by the development of the economy and technology [5]. Meanwhile, this development also brings advantages to improve the current traffic signal control (TSC) systems with a much smarter one, utilizing many surveillance cameras and highperformance computational power [16], [43].
Most modern traffic intersections rely on rule-based TSC systems such as SCATS [24] and SCOOT [18], which require lots of manual adjustment encountering different traffic situations and weather conditions. The emergence of supervised machine learning, however, enabled the computer to learn from the given strategy and control the traffic signals without The associate editor coordinating the review of this manuscript and approving it for publication was Choon Ki Ahn . human intervention [45], [46]. Unfortunately, machine learning methods require demonstrative data labeled by human experts, yet it is difficult to determine the best decision under a given traffic situation. Therefore machine learning based methods are not suitable enough for the TSC task.
Recently, researchers start to investigate reinforcement learning (RL) techniques [6] for the intelligent control of traffic signals that optimize the average travel time of all passing vehicles [9], [12] in a given area. Different from supervised learning models that learn from given labeled samples, RL algorithms only require reward signals that measure the quality of single-step TSC decisions. The RL algorithms can interact with the environment and learn to maximize the accumulative reward based on the experience of its decisions and feedback. However, two main problems have plagued RL algorithms in this practical field for an extended period: 1) RL agents learn from interactions with the environment in a trial-and-error manner, which is risky to be directly applied in the real world since it may cause real traffic congestion and safety problems. 2) It is non-trivial to define a proper reward function that directly optimizes the final target (i.e., average travel time) due to its nature of feedback latency and difficulty on credit assignment. The former is typically solved by building a simulator with high fidelity and training a policy that is well-performed in the simulator [31], [32]. However, the latter problem remains unsolved.
Early RL-based TSC methods often define the reward function in a heuristic way [16], which may result in a biased optimization of the final target. Recently, some researchers try to discover the relationship between the reward and target. In LIT [37], they prove that using the queue length as reward function is equivalent to optimizing average travel time, but the statement only holds for a single intersection. PressLight [13] borrows the term in max-pressure (MP) [7] algorithm, which is one of the state-of-the-art methods in the transportation field. PressLight uses pressure [7] as the reward function and minimizing pressure is proved to be equivalent as maximizing throughput, which in turn minimizes the average travel time. However, these relations require strong assumptions about the traffic environment (e.g., no complex acceleration or deceleration dynamics while following the front car or crossing intersections). Real situations always violate these assumptions and thus introduce the gap between reward and final target. We show this gap in Table 1, where we run multiple experiments using the same traffic flow on a 1 × 1 single intersection roadnet and a 4 × 4 multiple intersections roadnet. We record the average reward and average travel time of some typical episodes in the experiments. FRAP [12] uses negative queue length as reward and PressLight uses negative pressure as reward. As can be easily observed, the final target (average travel time) does not necessarily decrease as the acquired average reward increases. The main reason that previous deep RL-based TSC methods heavily rely on short term reward is that they belong to the temporal difference (TD) model-free reinforcement learning framework, which offers a straightforward way to train value functions or policies with a large amount of experience regardless of the environment dynamics [29].
Inspired by the success of AlphaGo [33] and other planningbased RL algorithms in game playing [4], [38], in this paper, we perform an exploration of planning-based methods for TSC problem in order to eliminate the gap between reward and final target. Specifically, we start from a suboptimal base control algorithm and use rollout [17], which is an effective method that computes an improved policy based on any given base policy with the look-ahead mechanism. Rollout is guaranteed to find a policy at least not worse than the base policy given a predefined (long-term) target, thus bypassing the reward-target inconsistency problem. However, applying rollout-based method to the TSC problem introduces two main challenges. First, the rollout algorithm requires massive calculation for each action, which cannot be used as a real-time decision-making model in the real world. Second, in real-world TSC tasks, there is more than one agent (traffic signal controller) to optimize and execute [15]. We need to ensure that the rollout scheme still maintains the policy improving property under the multi-agent scenario. Moreover, the exploration space in the multi-agent case grows exponentially [2], which calls for a computational efficiency algorithm for TSC tasks.
In our work, we propose PlanLight, a planning-based TSC framework that learns from the demonstrations of the suboptimal control algorithm, which is independent of any definitions of the reward function and is capable of optimizing the long-term target. Specifically, PlanLight is trained in an iterative manner, where every iteration includes a policy improvement stage and a learning stage. In the policy improvement stage, we use rollout to find a policy at least not worse than the base policy, and we will prove that this property still holds in the multi-agent TSC problem. In the learning stage, we utilize behavior cloning to train a policy with the demonstration of rollout policy. The learned policy supports real-time decision making, thus can be deployed to the real-world TSC tasks. Further, we set the trained policy as the base policy for the next policy improvement stage at each iteration to achieve further optimization.
To sum up, our work makes the following key contributions: • A novel planning-based TSC framework that achieves direct optimization on average travel time instead of a biased one due to the heuristic design of immediate reward function. We propose PlanLight, a planningbased RL method with rollout scheme to solve the TSC problem. We design a fast rollout strategy for multiagent TSC scenario, and the policy improvement property proved in Section III-C2 shows its ability to minimize the target given arbitrary base policy. By setting learnt policy as the new base policy, we achieve a further performance boost with iterative policy improvement. The final learnt policy can not only perform well but also be able to realize real-time control in real-world applications. To the best of our knowledge, PlanLight is the first planning-based RL method solving the TSC problem. VOLUME 8, 2020 • A state-of-the-art TSC method in both single-and multiple-intersection (agent) scenarios. Through comprehensive experiments on synthetic and real-world datasets, we show that PlanLight outperforms existing methods in both transportation engineering and RL fields, with various sizes of the traffic datasets.
• A training strategy that achieves policy improvement on all existing methods. By setting the initial base policy as different traditional and RL-based methods, we show that PlanLight can achieve a considerable performance boost regardless of the initial policy. Moreover, Plan-Light shows its potential to be a generic method to further improve any future state-of-the-art methods for TSC tasks.
The rest of this paper is organized as follows. In Section II, we discuss the related works. In Section III, the PlanLight framework for both single-and multiple-intersection control tasks is presented. We conduct and analyze extensive experiments in Section IV. We finally conclude this paper in Section V.

II. RELATED WORKS A. CLASSIC TRANSPORTATION METHODS FOR TSC
We first introduce the main categories of TSC methods in transportation engineering field. FixedTime control [26] decides phases according to predefined signal plans, yet the signal timing can not adapt according to real-time changes. Actuated methods, including MaxPressure [7], [19] and Self-Organizing Traffic Light Control (SOTL) [23], define a set of rules. The traffic signal is triggered according to both the rules and real-time traffic situations. MaxPressure [7] is a state-of-the-art method in this field that greedily chooses the phase with maximum pressure (a predefined metric about upstream and downstream queue length). SOTL can adaptively regulate traffic lights based on a hand-tuned threshold on the number of waiting vehicles. This method is widely deployed in today's TSC systems, yet it does not perform well enough in many situations. Commonly used systems include SCATS [24] and SCOOT [18]. They take some signal plans as input and iteratively selects from these plans according to a few predefined performance measurements. Nevertheless, all the above methods require rules of plans predefined by human experts and often perform unsatisfactorily in complicated real-world situations.
Optimization-based adaptive control approaches decide the traffic signal plans completely according to the observed data instead of the human prior. These approaches typically formulate TSC as an optimization problem under the particular traffic flow model. However, to make the optimization problems tractable, strong assumptions (e.g., no complex acceleration or deceleration dynamics while following front car or crossing intersections) about the model are often made, and these assumptions might not hold in the real-world cases [25].

RL-based methods formulate the TSC problems as a Markov
Decision Process and optimize the cumulative discounted reward over time. These methods have achieved promising performance and shown great potential for further improvement [16]. PG-AC-TLC [41] directly takes the state as input and learns to set the joint actions of all intersections at the same time, while it might result in the curse of dimensionality as the number of agents grows. Van der Pol et al. [1] considers explicit coordination mechanisms between learning agents by factorizing the global Q-function as a linear combination of neighbours' sub-problems. DemoLight [42] learns from the demonstration of classic transportation methods to accelerate the training procedure of RL models. IntelliLight [9] uses Deep Q-Learning (DQN) [39] algorithm to solve the TSC problem, by using many innovative designs for the representations of state and reward. They also design a phase gate combined with a memory palace in the Q-network to distinguish the decision process for different phases and prevent the decision from favoring certain actions. FRAP [12] proposes a modified network structure to capture the phase competition relation between different traffic movements and speed up the training process. CoLight [15], the current state-of-theart TSC method in multi-intersection scenario, utilizes graph attention on observations between agents to achieve cooperation. MPLight [22] incorporates several recent state-of-the-art works in their framework and conducts experiments on citylevel TSC with the parameter-sharing manner.
Despite the superior performances of the above works, the reward-target inconsistency remains unsolved. The true reward function, which will be discussed in the next section, is challenging to solve as an explicit solution. To estimate the reward function, some methods compute the weighted sum of several measurements of traffic situations (e.g., the number of vehicles, delay, etc.) [1], [9], [20] as reward while LIT [37] proves that using the queue length as reward is equivalent to minimizing average and some works follow this idea [12], [15]. However, this statement only holds in single intersection scenario. Recent works justify pressure as the eligible reward function [13] to optimize the travel time, but this justification also bases on strong hypothesizes. Real-world is much more complex and therefore introduces a gap between reward and final target, as shown in Table 1. In this paper, we use rollout to optimize the final target directly and alleviate this reward-target inconsistency issue.

C. PLANNING-BASED RL METHODS
Planning-based methods, which optimize the policy value (or utility) of the given system, have shown great success in various fields [33], [34], [38]. Monte Carlo Tree Search (MCTS) is an effective planning-based method for finding optimal decisions in a given domain by taking random samples in the decision space and building a search tree according to the results [35]. However, it is challenging to scale MCTS up to the multi-agent case due to the exponential growth in the action space. Although Zerbel et al. [36] extended MCTS to a multi-agent version theoretically, they failed to show its ability to scale on complicated systems. Bertsekas [30] proposed the rollout algorithm and its multi-agent extension [17], which can obtain policy improvement given an arbitrary base policy. Yet the time complexity of rollout can not meet the requirement of real-time in real-world decision-making, and there are no experiments showing the amount of performance boost achieved. In this paper, we apply rollout algorithm to the multi-agent TSC problem. PlanLight learns the improved policy from the demonstration of the rollout, which is based on the base policy, and the learnt policy is set as the new base policy for a further performance boost. The procedure continues iteratively. The final learnt policy network not only performs well but also is capable of real-time control.

A. PRELIMINARIES
In this section, we briefly introduce some key concepts that are necessary for the formulation of a TSC problem [16] and some basics of RL.

Key Concepts in TSC
• Approach/Lane. A roadway meeting at an intersection is referred to as an approach, consisting of an incoming approach and an outgoing approach. An approach is composed of a set of lanes. Figure 1(a) shows a typical intersection consisting of 4 approaches.
• Traffic Movement. A traffic movement refers to the movement of vehicles from an incoming approach to an outgoing approach. A traffic movement can be generally categorized as through, left turn and right turn.
• Phase. As illustrated in Figure 1(b), a phase is a combination of movement signals. A phase is valid only when it contains no conflicting movement signals (e.g., northbound through and westbound through).

Basics of RL
The RL problems are defined based on the concept of Markov Decision Processes (MDPs), which are meant to be a framing of the problem of learning from iteration to achieve a goal [6]. The learner is called the agent and the thing it interacts with is called the environment.
The agent and the environment interact at each of a sequence of discrete time steps, t = 0, 1, 2, 3, . . . . At each time step t, the agent selects an action a t ∈ A base on some representation of the environment's state s t ∈ S received from the environment. One step later, the agent receives a numerical reward r t+1 ∈ R and a new state s t+1 as a consequence of its action. The function P(s t+1 , r t+1 |s t , a t ) defines the dynamics of the MDP [6].

B. PlanLight FOR SINGLE-INTERSECTION CONTROL
In this section, we will introduce the framework of Plan-Light on single intersection control. As discussed before, the design of a proper reward function is challenging, and existing designs of reward functions cannot directly be used to optimize the average travel time. Thus, we utilize imitation learning (specifically, behavior cloning) [3] as the learning framework, and thus no specific short-term reward function is needed. Figure 2 provides an overview of the PlanLight framework. We will illustrate the procedure of generating expert trajectories in Section III-B2 and elaborate on the learning framework in Section III-B3. We will first formulate the problem in Section III-B1.

1) PROBLEM DEFINITION
The single intersection control problem can be formulated as an MDP, where the agent periodically receives the traffic condition as its state and decides a phase as its action in order to minimize the average travel time of all vehicles would pass through. Specifically, the problem is characterized by the following major components S, A, R, P, T , where • the state s t ∈ S is defined as the number of vehicles V i for each lane i; • the action a t ∈ A is the selection of a certain phase P t for the next time interval δ t ; • P is the state transition function where the next state s t+1 ∼ P(·|s t , a t ) is sampled given the action a t taken in the current state s t ; • T is the time horizon. The goal is to minimize the average travel time within this time horizon. Note that as we train agents with behavior cloning, no explicit reward function is needed in our definition of the VOLUME 8, 2020 MDP (notice that we will later introduce latent reward, which help us prove the effectiveness of rollout algorithm; however, this reward function does not require explicit formulations).
In order to better illustrate the PlanLight training framework, we define the policy, cost function and optimal policy as follows.
Definition 1: A policy π is defined as where µ t maps states s t into actions a t = µ t (s t ). Definition 2: The average travel time costed after a series of decisions in an intersection, i.e the cost function that we want to optimize, is defined as J (s 0 , a 0 , . . . , s T , a T ) and the expected average travel time of a policy can thus be defined as [J (s 0 , µ 0 (s 0 ), . . . , s T , µ T (s T ))].
The target in a TSC problem is to find a policy π * that minimizes the average travel time within the time horizon: π * = arg min π J π (s 0 , T ). Definition 3: Similar to the common definition in traditional RL methods, the cost can be alternatively formulated as where r(s, a) is a latent reward function. As J π (s 0 , T ) is hard to formulate since it is directly given by the simulator, the reward function is also difficult to be built as an explicit formulation.

2) ROLLOUT ALGORITHM
The aim of rollout [30] is policy improvement. Given a base policy π = {µ 0 , µ 1 , . . . , µ T }, we can define a rollout policŷ whereμ t = arg min which satisfies Jπ (s t , T ) ≤ J π (s t , T ), ∀t, s t (The statement follows directly from definitions, we will give the proof of multi-agent rollout in next section).
Note that although we have the term r t (s t , a t ) here, we still don't need the explicit reward function to computes the sum [−r t (s t , a t ) + J π (s t+1 , T )]. Specifically, at each step t, the rollout policy computes the cost of decisions {μ(s 0 ), . . . ,μ(s t−1 ), a t , µ t+1 (s t+1 ), . . . , µ T (s T )} for each a t ∈ A by Monte-Carlo simulation (where the final cost is given by simulator) and selects the one with the lowest cost.
As such, the rollout algorithm can generate an expert policyπ that can act provably superior to arbitrary given base policy π. Figure 3 is an illustration of the rollout process.

3) LEARNING FRAMEWORK
As rollout algorithm requires an environment to be reversible, which is unrealistic in real-world scenes, and massive calculations before every selection of action, it cannot be used as the real-time decision-making method in the real world. Thus, we utilize imitation learning to train a policy that learns to perform the TSC tasks from the demonstrations of the rollout policy.
As we can easily get a vast amount of demonstrations from the expert, i.e., the rollout policy, we employ behavior cloning as the imitation learning model, which is shown to have decent performances given abundant demonstrations [3]. Given a set of trajectory demonstrated by the expert and a policy network µ θ which maps a state s t to an action a t = µ θ (s t ), we can have the imitation policy µθ bŷ where B represents the expert trajectory buffer which stores the state-action pairs of the rollout policy and L is the cross entropy loss function [14]. Thus, we can obtain a trained policyπ θ = {µ θ , . . . , µ θ } given arbitrary base policy π through the procedure of getting rollout policyπ and imitatingπ. Through comprehensive experiments, we show that given enough demonstrations fromπ,π θ tends to have a better performance than π. Hence, π θ can be regarded as the new base policy used for generating next learned policyπ θ (which can again be used as a new base policy to generateπ θ , etc.), as shown in Algorithm 1.

C. PlanLight FOR MULTI-INTERSECTION CONTROL
In this section, we will extend PlanLight to the multiintersection scenario, an overview of the framework is shown In the policy network, we use Graph Attention to model the interaction between intersections. A major issue in the multi-intersection scenario is that rollout becomes extremely time-consuming. The exploration space grows exponentially as the number of intersections increases. We introduce a modified but equivalent one-agent-at-a-time rollout controller which largely reduces the time complexity.

Algorithm 1 PlanLight Iteration Algorithm
Result: Optimized Policy µ θ initialize policy parameter θ; repeat Initialize expert trajectory buffer B; Define base policy π ← {µ θ , . . . , µ θ }; Perform rollout according to Equation (3) ; Save state-action pairs generated by rollout policy to B ; Update θ according to Equation (4) ; until Converged; return θ; in Figure 4. Similar to the previous section, we will first define the problem and then illustrate the multi-agent rollout algorithm and learning framework respectively.

1) PROBLEM DEFINITION
We present the problem of multi-intersection TSC as a Markov Game, where each of n intersection in the system is controlled by an agent. The problem can thus be characterized by S, O, A, R, P, T , where: • The state s t ∈ S can be represented as the traffic situation of all intersections, which is not fully observable by each agent. • a t = {a 1 t , . . . , a n t } ∈ A is the set of actions, where a i t can be defined as the selection of phase P i t by agent i for the next time interval.
• P is the state transition function where s t+1 ∼ P(·|s t , a t ). Similar with the previous section, we can thus have the following definitions: Definition 4: The joint policy can be defined as where π i = {µ i 0 , µ i 1 , . . . , µ i T } and µ t = {µ 1 t , µ 2 t , . . . , µ n t }, (6) Note that µ t maps joint observations into joint actions: a t = µ t (o t ) and µ i t maps observations into actions: a i t = µ i t (o i t ).

2) MULTI-AGENT ROLLOUT ALGORITHM
Note that a major issue in rollout is the minimization over a t ∈ A in Equation (3), which becomes extremely timeconsuming when the size of the action space gets large [17]. In particular, in the multi-agent case where a t = (a 1 t , . . . , a n t ), the time complexity to perform this minimization is exponential w.r.t. n. In this case, we can reformulate the problem by breaking down the collective decision a t into n individual component decisions, thereby reducing the complexity of the action space while increasing the complexity of the state space.
We introduce a modified but equivalent problem, involving one-agent-at-a-time control selection [17]. At the generic state a t , we break down the joint action a t into the sequence of the n decisions a 1 t , . . . , a n t . Between s t and the next state VOLUME 8, 2020 and corresponding transitions. The whole process is illustrated in Figure 5. As such, given a base policy π = {µ 0 , . . . , µ T }, we can define a rollout policyπ = {μ 0 ,μ 1 , . . . ,μ T }. In such definition of state, the algorithm involves a minimization over only one action component at the states s t and at the intermediate states. Particularly, for each stage t, the algorithm requires a sequence of minimization, once over each the agent actions {a 1 k , . . . , a n k }, with the past actions determined by the rollout policy, and the future actions determined by the base policy π.
Thus, we can haveμ i t (o t ) in true state defined aŝ wherê Note that we write o i t (s t ) as o i t for simplicity. The cost improvement property can also be shown analytically by induction methods [17]. For simplicity, we give the proof for the case of just two agents, i.e., n = 2, and rewritê µ i t (o i t (s t )) asμ i t .
(9) Proof: Clearly, it holds for t = T . Assuming that it holds for index t + 1, for all s t , we have which completes the proof.

3) LEARNING FRAMEWORK
Inspired by CoLight [15], we integrate Graph Attention Network [40] into the behavior cloning network in order to achieve coordination among agents in the procedure of imitation learning. Specifically, to model the overall influence of neighbors to the target intersection, the representation of several source intersections are combined with their respective importance as where h j is the embedding of the observation of intersection j, W c ∈ R m×c is weight parameters for source intersection embedding, W q and W b are trainable variables and e ij is given by with W s , W t ∈ R m×n representing the embedding parameters for the source and target intersection respectively. With the weighted sum of neighborhood representation h(o i ) and a policy network µ θ , the phase p i t of intersection i selected by a i t = µ θ (h(o i )). Note that the policy network µ θ is shared between all agents in our setting.
Therefore, the minimization of θ in the behavior cloning stage is obtained bŷ and the iteration algorithm of the policy improvement with Equation (7) and (12) is similar to Algorithm 1, and thus is omitted here.

IV. EXPERIMENTS
In this section, we conduct extensive experiments to answer the following questions: 1) How does our proposed method perform compared with other state-of-the-art? 2) Do the iterations of rollout and imitation contribute to the policy improvement? 3) How much policy improvement can PlanLight obtain given different initial base policies? Our code has been released on an open-source repository. 1

A. ENVIRONMENT
The experiments are conducted on CityFlow 2 [32], an opensource traffic simulator that supports large-scale TSC. After providing road network definition and flow setting to CityFlow, it simulates the dynamics of each vehicle. CityFlow also provides information about the environment which serve as state observed by agents. Agents can control traffic signals using control APIs. CityFlow also provides APIs to save and load snapshots of the environment which make rollout possible.

B. DATASETS 1) SYNTHETIC DATA
In the experiment, synthetic data is used in both scenarios. In the single intersection scenario, the road network contains a four-approach intersection with 3 lanes each approach and the two traffic flows tested are generated uniformly with the entering ratio of 400 and 600 vehicles/hour/lane. In multiple intersection scenarios, we have three sizes of road networks (2 × 2, 3 × 3 and 6 × 6) where each intersection has four approaches with three lanes each (left-turn, through and rightturn respectively). The traffic flows are generated uniformly with the entering ratio of 500 vehicles/hour/lane and the turning ratio of: 10% (turning left), 80%(going straight) and 10% (turning right).

2) REAL-WORLD DATA
Real-world traffic data from Hangzhou and Jinan is also used in our experiment, which is available at TSC Benchmark. 3 Statistics of traffic flow in these datasets are shown in Table 2. The detailed descriptions of these datasets are as follows:

C. COMPARED METHODS
In the following experiments, we compare our model with the following two categories of methods: Transportation engineering methods and RL methods.

1) TRANSPORTATION ENGINEERING METHODS
• FixedTime [26]: This method determines phases according to a pre-determined plan for cycle length and phase time.
• SOTL [23]: This method adaptively regulate traffic lights based on a hand-tuned threshold on the number of waiting vehicles.

2) RL METHODS
• Individual RL (DQN) [29]: This method leverages a DQN framework for TSC. Each intersection is controlled by an individual agent, and the agents update their networks independently.
• OneModel [28]: This method uses the same framework as Individual RL. Different from individual RL, all the agents share the same network parameters.
• PressLight [13]: An RL-based method which incorporates pressure in the observation and reward design for the RL model.
• FRAP [12]: This method uses a modified network structure to capture the phase competition relation between different traffic movements and speed up training.
• CoLight [15]: A state-of-the-art multi-intersection control method which utilizes graph attention network on agents' observations to achieve network-level cooperation between neighbouring agents with shared parameters.
• FreeFlow: A lower-bound of average travel time of a given road network and flow. It is computed assuming all vehicles in the flow travels from origin to destination without being stopped by red lights or blocked by other vehicles.

D. EXPERIMENT SETTINGS
The initial base policies used are FRAP [12] and CoLight [15], respectively, in single and multiple intersection scenarios.
Following the tradition and previous work [15], each green signal is followed by a three-second yellow signal and two-second all red time. The decision interval of every agent is set to 20 seconds. Time horizon is 3600 seconds.

E. EVALUATION METRIC
Following existing studies, we use the average travel time to evaluate the performance of different models for TSC. It calculates the average travel time of all the vehicles spend between entering and leaving the area (in seconds), which is the most frequently used measure of performance to control traffic signals in the transportation field. Notice that the percentage improvement is calculated with respect to FreeFlow using Equation 13. VOLUME 8, 2020 Since no TSC method can perform better than the lower bound FreeFlow, this relative improvement serves as a more accurate measurement for the true performance boost.

F. PERFORMANCE COMPARISON
We show the performance of transportation methods as well as RL models on single-intersection in Table 3 and multi-intersection data in Table 4. We have the following observations.   Table 3.

1) IN SINGLE INTERSECTION SCENARIO
PlanLight achieves consistent performance improvements over state-of-the-art transportation and RL-based methods across diverse road networks and traffic patterns, as shown in Table 3. PlanLight achieves an average improvement of 0.33% on synthetic datasets and 20.7% on real-world datasets. PlanLight outperforms all transportation engineering methods as other RL-based methods do because of the ability of learning and adapting to the complicated traffic situation. As the synthetic data in a small size of traffic area is relatively simple, existing RL-based methods can obtain  excellent performances, and thus PlanLight didn't show its superiority. However, on real-world traffic flows, which is much more complicated, PlanLight significantly outperforms all RL-based methods due to its direct optimization on the target. Notice that FRAP outperforms other baseline methods on real-world data, this is mainly due to the sophisticated design of their Q-network, which incorporate prior knowledge like move-phase relation, phase demanding and phase pair competition into the network structure. PlanLight uses simple multilayer perceptron as policy network but still achieves the best result.

2) IN MULTI-INTERSECTION SCENARIO
PlanLight outperforms all traditional and RL-based methods on all experimental settings, with an average improvement of 4.54% on synthetic data and 8.03% on real-world data. PlanLight outperforms all existing methods on every traffic flow and roadnet setting due to its superiority on finding cooperative policies based on the combination of optimization on travel time and design of policy network. There are three main reasons for the reduction of increment PlanLight brings on performance comparing with the results on single intersection controls. The first reason is, as discussed in section III-C2, that in order to reduce the time complexity of rollout, we perform rollout operations sequentially on agents, and this may reduce the improvement over base policy. The other reason is as the chance that a vehicle comes to a red light increases when the number of intersection it travels raises, the lower-bound becomes harder to approach as the size of the road network goes larger. Thus, the relative improvement decreases in the multi-intersection scenario. The third is, the interaction between agents becomes more important in decision making. CoLight, which is the best among all baselines, uses a well-designed graph attention network (GAT) [40] based method to model the communication between neighbor intersections. We borrow the idea of GAT but do not finetune the structure (e.g., multi-head attention); thus, behavior cloning might not learn the optimal policy through demonstration of rollout process. We will treat this as future work because it is independent of the learning framework. Figure 6 demonstrates the performance curve regarding the number of iterations. As defined previously in section III-B3, a rollout policy is the policy improved upon a given base policy by performing rollout, and a learned policy is the policy trained via behavior cloning, which is randomly initialized in this experiment. The results demonstrate the tendency of continuous improvement on both policies during the iterations. Rollout policy and learned policy improve each other and finally converges to a decent result. Our training framework remedy the limitation of rollout that the policy improvement of one single round might be minor through iterations of mutual optimization. Figure 7 demonstrates the improvement of PlanLight obtains based on different existing policies. The TSC results are rescaled to clearly show the performance boost achieved by PlanLight in the chart. The results show that initializing with arbitrary existing TSC methods, PlanLight can improve the policy after a few iterations of rollout and imitation learning, regardless of the size of the road network. Based on this result, we can claim that PlanLight might have the ability to achieve further policy improvement on future state-of-the-art methods. Figure 8 shows a typical case in real-world traffic data of D HangZhou . To better illustrate the situation, we re-draw the north-west part of the origin road network.

I. CASE STUDY
Figure IV-I shows the traffic situation at the north-west corner of D HangZhou at time t = 910s. Intersection A is suffering congestion in both north-to-south and west-to-east directions. It is difficult for the agent to decide the best phase, only considering current and local information. Intersection C is having a large pressure in the east-west direction while there are not many vehicles waiting at intersection B and intersection D. At this step, CoLight decides to set the green light for southbound, and PlanLight decides to set the green light for eastbound. Figure IV-I shows the traffic situation at time t = 1070s where the intersections are controlled by CoLight since t = 910s. At time t = 910s, the optimal strategy for intersection C is to set a long-time green light for eastbound as there are not many waiting vehicles at intersection D at that time. However, to minimize the reward (queue length) of intersection A, CoLight decides to set the green light for southbound and this causes many vehicles waiting on the southbound lane at intersection C. Thus, intersection C decides to set the green light for southbound a few decision steps later, thus it misses the chance to clear the waiting vehicles at eastbound while intersection D still has light traffic pressure. A few steps later, when the traffic from intersection D increases, intersection C faces a dilemma where both directions have a large amount of traffic and inevitably leads to congestion.
On the other hand, Figure IV-I shows the traffic controlled by PlanLight since t = 910s, which is much smoother than Figure IV-I. With the help of planning, PlanLight is able to achieve direct optimization on the final target (average travel time). It learns from historical experience that C − D arterial is likely to get a boost in traffic and chooses to hold the southbound traffic of intersection A. PlanLight exchanges temporary congestion for long-term improvement on average travel time.

V. CONCLUSION
In this paper, we propose PlanLight, a novel planning-based TSC algorithm. PlanLight learns to control traffic signals VOLUME 8, 2020 from the demonstration of a suboptimal control method and use rollout to achieve policy improvement with regards to direct optimization of average travel time. PlanLight eliminates the reward-target bias in previous works. We design an effective rollout strategy for multi-agent TSC problem and prove its effectiveness. The final policy is learned via behavior cloning, which enables real-time decision making. We conduct extensive experiments with synthetic and realworld data and show the superior performance of our proposed method over state-of-the-art methods. To the best of our knowledge, PlanLight is the first planning-based RL method solving the TSC problem. Besides, we demonstrate the ability of our work to obtain performance boost over given methods by setting them as the first rollout policy of the learning procedure.
We would like to point out several future directions to make the method more applicable and generalize. First, the policy improvement stage can be designed more efficiently as agents share the network parameters and might have some similar state-action trajectories. Fewer steps of rollout would be needed with a more elaborate design of the mechanism to deal with similar trajectories among different agents. Second, as the rollout policy is relatively complicated in a large scale traffic network, it might be difficult to directly learn a cooperative policy without an effective communication mechanism. To integrate the policy network with better communication methods, for example, taking into consideration the spatialtemporal property of the traffic flows between different intersections might help to learn a superior policy. Third, our approach can be integrated within Smart City (SC) [44] as a part of Intelligent Transport Systems (ITS). Internet of Things (IoT) [43] paradigm can help collect more data such as traffic speed and volume with improving sensing capabilities, and the integration of fifth-generation (5G) enables reliable communication between agents with high speed and high capacity. On the other hand, our intelligent TSC algorithm can help to improve the efficiency of the city as well as providing environmental benefits (e.g., less CO 2 emission, better air quality and sustainable cities). YONG YU is currently a Professor with the Department of Computer Science, Shanghai Jiao Tong University. His research interests include information systems, web search, data mining, and machine learning. He has published over 200 articles and served as PC member of several conferences including WWW, RecSys and a dozen of other related conferences (e.g., NIPS, ICML, SIGIR, and ISWC) in these fields. VOLUME 8, 2020