CoTV: Cooperative Control for Traffic Light Signals and Connected Autonomous Vehicles Using Deep Reinforcement Learning

The target of reducing travel time only is insufficient to support the development of future smart transportation systems. To align with the United Nations Sustainable Development Goals (UN-SDG), a further reduction of fuel and emissions, improvements of traffic safety, and the ease of infrastructure deployment and maintenance should also be considered. Different from existing work focusing on optimizing the control in either traffic light signal (to improve the intersection throughput), or vehicle speed (to stabilize the traffic), this paper presents a multi-agent Deep Reinforcement Learning (DRL) system called CoTV, which Cooperatively controls both Traffic light signals and Connected Autonomous Vehicles (CAV). Therefore, our CoTV can well balance the reduction of travel time, fuel, and emissions. CoTV is also scalable to complex urban scenarios by cooperating with only one CAV that is nearest to the traffic light controller on each incoming road. This avoids costly coordination between traffic light controllers and all possible CAVs, thus leading to the stable convergence of training CoTV under the large-scale multi-agent scenario. We describe the system design of CoTV and demonstrate its effectiveness in a simulation study using SUMO under various grid maps and realistic urban scenarios with mixed-autonomy traffic.


I. INTRODUCTION
D Eveloping the next generation Intelligent Transportation Systems (ITS) is one of the key ways to achieve the United Nations Sustainable Development Goals (UN-SDG) [1].In particular, firstly, sustainable traffic requires higher efficiency to reduce enormous monetary losses caused by excessive traffic delays.Secondly, more eco-friendly driving should be encouraged to decrease fuel consumption and gas emissions (mainly CO 2 ).Thirdly, traffic safety is one of the key indicators for sustainable traffic, inherently, which should be enhanced by avoiding potential collisions to save lives.Last but not least, to achieve those sustainable traffic goals, easyto-deploy ITS infrastructure is critical.
Most existing research in sustainable urban traffic control adjusts either traffic light signals or vehicle speed.Traffic light signal controllers dynamically select the best timing plan Jiaying Guo is with the School of Computer Science, University College Dublin, Ireland, E-mail: jiaying.guo@ucdconnect.ieLong Cheng is with the School of Control and Computer Engineering, North China Electric Power University, Beijing 102206, China.E-mail: lcheng@ncepu.edu.cnShen Wang is with the School of Computer Science, University College Dublin, Ireland, E-mail: shen.wang@ucd.ieFig. 1.The illustration of the motivation and goals of our proposed system CoTV.Traditionally, traffic light controllers can increase the intersection throughput, thus reducing the travel time and fuel.While CAV adjusts its speed to reduce the fuel, thus maintaining a safe time gap to its surrounding traffic.Our CoTV coordinates these two different types of agents to achieve a more comprehensive set of the goals of sustainable traffic.according to the real-time traffic.As shown in Fig. 1, this can directly increase the intersection throughput, thus reducing travel time as well as energy consumption and emissions.CAV can proactively control vehicles' acceleration, as shown in Fig. 1, to achieve more stable traffic nearby with relatively higher driving velocity (i.e., lower fuel consumption and gas emissions) and keep a safe distance [2] from the surrounding traffic (i.e., longer time-to-collision).Recent research from the transportation domain attempts to explore the potential of joint control for both traffic light signals and vehicle speed.Methodologies used in such research include mixedinteger linear programming [3], the enumeration method and the pseudo-spectral method [4].However, these methods may not perform well in realistic traffic scenarios because their deterministic traffic control decisions are insufficient to deal with a fast-changing urban environment [5].
Unlike the aforementioned traditional methods, many researchers nowadays have demonstrated the great potential of DRL in solving traffic control challenges under complex urban scenarios.For instance, inspired by the traditional traffic signal control method MaxPressure [6], PressLight [7] can achieve even better traffic efficiency improvements under various urban scenarios using DRL.Moreover, the DRL-based traffic signal control can also reduce the waiting time of specific vehicles in emergency situations in which traffic condition varies quickly [8].On the other hand, efficient and effective CAV speed arXiv:2201.13143v2[cs.AI] 14 Feb 2023 control can stabilize traffic in many complex and changing road scenarios using DRL [9], which is traditionally infeasible using optimization-based controllers.However, there is a lack of research using DRL for the joint control of both urban intersection signals and vehicle speed.This DRL-based joint control is challenging due to the difficulty of designing a proper cooperation scheme for two different agent types (i.e., traffic light controllers and CAV).Moreover, the unpredictability of urban mixed-autonomy traffic makes it even harder to converge within a reasonable number of training iterations.
To overcome the limitations mentioned above, we propose CoTV: a multi-agent DRL-based system that can cooperatively control traffic light signals and CAV.CoTV well balances the advantages of both traffic light controllers and CAV to achieve more sustainable traffic, as shown in Fig. 1.Concretely, the contributions of our work are as follows: • Effective cooperation schemes between CAVs and traffic light controllers.Different from the methodology in the literature on Multi-Agent Reinforcement Learning (MARL) for traffic control, instead of using action-dependent design [10] (i.e., the action of one agent depends on the action of other agents in the shared environment), our cooperation schemes rely on the exchange of states between agents within the range of one intersection, including the traffic light controller and approaching CAVs.This so-called "action-independent MARL" [11] can work in our CoTV as the objective of traffic light controller and CAV for the traffic improvement are inherently complementary (i.e., rather than overlapping: all improving travel time or reducing fuel).Thus, CoTV takes advantage of the simplicity of "actionindependent MARL" design on DRL training and keeps the effectiveness of CoTV in improving traffic under various scenarios.The cooperation schemes of CoTV are shown to facilitate training convergence, which is challenging for independent MARL that does not include any cooperation (either state or action).Specifically, our CoTV using Proximal Policy Optimization (PPO) [12] obtains up to 30% reduction in both travel time and fuel consumption & CO 2 emissions under varying CAV penetration rates.• Scalable to complex urban scenarios by avoiding cooperation with excessive CAV agents.Compared with controlling all possible CAVs using MARL, the traffic light controller in our CoTV selects the closest CAV to the intersection on each incoming road as the CAV agent.This idea is inspired by platooning can increase intersection throughput [13], as the leading vehicle in a certain road has the great potential to form a platoon with the rest vehicles on the same road.We also demonstrate that compared with coordinating all CAVs (CoTV*), CoTV does not compromise efficiency improvement while significantly reducing the training time and resources used. .This paper extends our previous work [15] to control traffic light signals and CAV cooperatively using DRL.The improvements include: 1) The system framework of CoTV is designed for addressing scalability issues, resulting in the significantly reduced number of CAV agents controlled.2) The state and reward for agents are simplified by removing redundant traffic information.Therefore, the amount of information exchanged among agents is reduced to ease the deployment of CoTV.
3) The testing scenarios are extended from a small grid map to more realistic urban scenarios.4) We demonstrate the robustness of CoTV under different CAV penetration rates.5) As an important requirement of achieving sustainable traffic, the effectiveness of CoTV in enhancing traffic safety is evaluated by time-to-collision [16].6) Two other common MARL approaches, action-dependent and independent, are compared with the action-independent MARL of our CoTV in terms of policy training and traffic improvements.

II. RELATED WORKS
This section overviews the recent related work and highlights the gaps that our CoTV attempts to fill.In particular, it firstly focuses on the research in either traffic light signal control or vehicle speed control.Secondly, it discusses the recent research in the joint control of both agents.Lastly, it summarizes the practicability of deploying existing work in mixed-autonomy and its impact on traffic efficiency and safety.

A. Control for Either Traffic Light Signals or Vehicle Speed
Most existing research in sustainable urban traffic control adjusts either traffic light signals or vehicle speed.Sydney Coordinated Adaptive Traffic System (SCATS) [17] is one of the earliest and most widely applied traffic light signal control systems.It can dynamically select the best signal plan from a list of pre-defined candidates that can potentially achieve better intersection throughput by improving green time efficiency.Varaiya [6] proposed a traffic light signal control scheme named MaxPressure, which was proven to maximize the throughput of the entire road network, with each traffic light controller receiving local traffic information.On the other hand, the field experiments in [18] prove that the speed control of CAV can stabilize traffic and is beneficial to reduce braking times and fuel consumption.Green Light Optimal Speed Advisory (GLOSA) system guides CAV to adjust its speed according to the current traffic signal phase and the remaining distance to its approaching intersection [19].Therefore, the more smooth acceleration/deceleration of CAVs can further reduce fuel consumption and CO 2 emissions.However, these traffic control optimization approaches rely on deterministic formulations to make dynamic traffic problems tractable.These deterministic formulations remain static in ever-changing traffic and thus may not be flexible enough to improve realistic traffic.
DRL has been used to cope with complex traffic environments, promising better urban traffic.PressLight [7] is a DRL-based model using Deep Q-learning (DQN).It collects local real-time traffic information inspired by the traditional method MaxPressure [6] while achieving more improvement on traffic efficiency than MaxPressure.Wu et al. [9] extended the field experiments in [18] using the Trust Region Policy Optimization (TRPO) method for training CAVs in a simulated experiment.The used DRL-based vehicle speed controller surpasses traditional optimization controllers on traffic improvement.Various scenarios using CAV had been tested in [20], including road merging and unsignalized intersections.The DRL-based speed control of CAV can optimize the vehicle trajectory of the whole trip and reduce the risk of collision all the time.Compared to traditional optimization methods with deterministic solutions, DRL methods, which are used in our proposed CoTV, learn from trial-and-error in the interaction with the environment to train different optimal policies under various traffic scenarios, which is more capable of performing adaptive traffic control and generalizing well under fast-changing urban road scenarios.

B. Joint Control for Traffic Light Signals and Vehicle Speed
Traditional optimization-based methods have been attempted to jointly control traffic light signals and vehicle speed.Yu et al. [3] developed mixed-integer linear programming for optimizing vehicle trajectories and traffic signals simultaneously at isolated intersections.The phase sequence and duration of traffic light signals are coordinated with vehicle arriving time to the intersections.A two-level model for traffic light controllers and CAV was proposed using the enumeration method and the pseudo-spectral method [4].The first level is applied to coordinate CAV and traffic light controllers to minimize travel time, and the second level is used to regulate CAV trajectory to reduce fuel consumption.The same system targets were adopted in the cooperative optimization model [21].The model uses a mixed-integer non-linear program, which originally has high computational complexity.
To the best of our knowledge, DRL methods for the joint control of traffic light signals and CAV have not been well studied.The joint control using DRL suffers many challenges, commonly in multi-agent systems [22]: (1) Every agent, traffic light controller, or CAV, proactively interacts with the same environment simultaneously, causing a non-stationary environment to bring more uncertainty on training convergence.
(2) A large number of agents cause scalability issues due to an exponential increase in the computational cost of joint action.(3) The reward of agents can assess the system at a different area scale in the environment: individual, regional, or global.Reward design is critical for DRL agents due to the high correlation to achieving system goals.For example, traffic light controllers explicitly coordinate traffic around intersections, and each CAV mainly affects its surrounding traffic.The proposed model in this paper attempts to overcome these difficulties and utilize the advantage of DRL methods to control traffic light signals and CAV cooperatively.

C. Efficiency and Safety for Mixed-Autonomy Traffic
The development of CAV is thriving in both academia and industry, which is expected to improve traffic.However, the deployment must experience a gradual mixed transition from introductory, established, to prevalent [23] with the growth of CAV penetration rate.Existing work presents that CAV mixing in traffic still brings uncertainty.Mixed-autonomy experiments on motorways were conducted in [24], simplified from intersections with conflicting traffic movements.Similar work was tested in single-lane facilities, where CAV can enhance traffic safety by keeping a larger gap from the surrounding vehicles [25].However, a low penetration rate (less than 10%) causes more conflicts in urban scenarios [26].On the other hand, CAV has the potential to improve traffic efficiency but cannot guarantee a higher average speed than traditional vehicles depending on the network type and traffic conditions.[27] conducted experiments in a ring scenario, showing that the CAV penetration rate greater than 20% allows all vehicles to reach higher speeds and stabilize the flow.The penetration rate in 20% to 40% is possible to result in the near-maximum improvements [26].Overall, a high penetration rate of CAV can bring traffic efficiency and safety improvement on mixedautonomy traffic in various scenarios.
This work advances the state-of-the-art in assessing DRLbased mix-autonomy control under dynamic urban road scenarios with multiple intersections.Moreover, our system CoTV chooses only a small fraction of CAVs that cooperate with traffic light controllers, which have great potential to guide the rest of the vehicles.This makes the deployment of CoTV practical and easy-to-scale.

III. SYSTEM OVERVIEW
This section explains the design of our system CoTV.Firstly, we outline the system design goals.Then, the system components (i.e., traffic light controllers and CAV) are presented with the design of their action, state, and reward.The cooperation schemes between the two types of agents using Vehicle-To-Everything (V2X) communications are elaborated as well.Thirdly, we explain the training process of CoTV using PPO, during which parameter sharing is applied for all agents in the same type to perform the learned policy.Additionally, we also present the consideration of the ease of deployment in designing CoTV.The code of this study is open-sourced1 .

A. System Design Goals
The proposed model CoTV aims to achieve the following goals, which are also shown in Fig. 1: • Reduced travel time: Travel time is the metric that end road users care about the most.Our system should reduce the travel time of a vehicle with a given route.This goal is traditionally achieved by traffic light signal control that can increase intersection throughput.• Lower fuel consumption and CO 2 emissions: Sustainable traffic goals encourage eco-friendly driving behaviors.This goal is traditionally achieved by the speed control of CAV that can stabilize traffic flow.Smart traffic light control can also partly contribute to achieving this goal by reducing the number of stop-and-go.• Longer time-to-collision: Safety is a crucial consideration in sustainable traffic system design.Reducing the risk of collision can be achieved by maintaining a longer time-to-collision [16], with sufficient time to moderately decelerate.CAV can proactively keep a safe distance from the surrounding traffic.Thus, ITS using CAV has the potential to achieve higher traffic safety.• Easy to deploy: Our system CoTV requires a V2X communication infrastructure to support communication exchange over the cooperative control.Meanwhile, scalability issues should be addressed with the increasing number of agents.Efficient communication schemes among traffic light controllers and reduced CAV agents are the key to achieving this goal.

B. System Components
Our proposed system assumes that all vehicles are connected, including CAV and Human-Driven Vehicles (HDV) (details can be found in Table II).The V2X communication is also assumed perfect without no packet loss and no latency.The main components of CoTV: traffic light controllers and CAV, as shown in Fig. 2. The design of action, state, and reward for them are described as follows, while the V2X communication schemes involved are shown in Fig. 3: 1) Traffic light controller: • Action: We limit the action of traffic light controller to a binary set, where "1" represents switching to the next phase for the next timestep while "0" means to keep the current phase unchanged.As opposed to other common action definitions in the literature, such as phase selection [11], the phase switch [28] we choose is more manageable for the model training process.
We also illustrate the above reward in Fig. 4. The reward function is formulated to reduce travel time, one of the system goals, by increasing intersection throughput.
Minimizing intersection pressure encourages vehicles to pass through the intersection quickly while considering the remaining capacity in the outgoing roads, thus improving green light efficiency and throughput [6].We also simplify the calculation of intersection pressure without considering traffic movements (correspondence between incoming and outgoing roads) compared with [6], [7].Therefore, CoTV can be easily applied in various urban scenarios with multi-directional roads.Besides, we avoid using other common reward definitions in the current literature, such as queue length and waiting time [10], [28], which is precarious in different traffic flow conditions even without the influence of traffic lights.2) CAV: • Action: The action is set to be consistent with the literature [29], which is a continuous action space to represent the CAV acceleration in the range of [−3m/s 2 , 3m/s 2 ]. • State: The state explicitly includes speed and acceleration for itself and the vehicle preceding the CAV immediately, the distances to the preceding vehicle and the approaching intersection, and the current signal status of the approaching traffic light controller.The CAV agent can receive information from the vehicles on the same road and the approaching traffic light controller using V2V and V2I communication, as shown in Fig. 3. • Reward: The reward is penalized by the deviation of average speed v from the maximum speed limit v * , plus the Euclidean norm of acceleration a after the normalization using the vehicle's maximum acceleration a * , as shown in Fig. 5. Speeds and accelerations in the reward are that of all vehicles K located on the same road as the CAV agent.The reward of certain CAV agent r n , Eq.( 2) becomes , a j = 0, a j < 0 a j , a j ≥ 0 (2) The first term of the reward function r 1 encourages a higher average vehicle velocity but keeps it within the maximum speed limit.In this speed range, higher speed increases fuel economy, and potential collisions due to excessive speed can be avoided.Moreover, collisions can generally be avoided as they often lead to a significant decrease in the speed of many following vehicles blocked by such collisions (i.e., such training episodes will be discarded due to low reward value).The second term of the reward function r 2 stabilizes acceleration to reduce fuel consumption, while also inducing a large time gap between adjacent vehicles [25] for enabling high-speed collision-free driving.Our reward function of CAV agents encourages better speed control, thus facilitating cooperative control of CoTV to achieve the reduction of fuel consumption and CO 2 emissions and the improvement of traffic safety.These CAV agents have the potential to increase intersection throughput by forming a platoon with the rest vehicles on the same road [13].The communication exchange schemes for the cooperative control of CoTV occur in the agent receiving state, in Line 13 of Algorithm 1. Traffic light controllers and CAV exchange information with each other, involving the current signal status of traffic light and speed, acceleration, and location of CAV.
We choose PPO algorithm [12] for the following reasons.The PPO algorithm has the advantage of being easy to implement and achieving monotonic reward improvement.DQN is the common algorithm to train traffic light controllers [7], [28], which is efficient in the discrete actions (e.g., a binary set of signal phase adjustment).Whereas DQN does not perform well on continuous actions (e.g., vehicle acceleration of any real number within a certain range) [30].In contrast, PPO can be applied for scenarios with discrete actions or continuous actions.On the other hand, compared with traffic light signals that have a pre-defined phase sequence, the initial driving behavior of DRL-controlled CAV has lots of unreasonable stop-and-go and standstill.The constrained policy update of PPO aims to improve reward monotonically, which is more stable to train CAV and better than Asynchronous Advantage Actor-Critic used in [11].Although TRPO can also constraint the policy update, PPO is easy to implement and simpler to sample data, which helps the cooperation of traffic light controllers and CAV.
When interacting with the environment, CoTV applies parameter sharing [9] to all agents of the same type in the multiagent DRL system, which can converge the training process faster and benefit from shared experience, especially in largescale applications [31].

D. Considerations for "easy-to-deploy"
Firstly, CoTV is designed to be deployed in the major junctions of urban scenarios, which requires minimum upgrades to the existing adaptive traffic light systems (e.g., SCATS, SCOOTS, etc.).This deployment strategy covers broader arterial roads that carry the majority of traffic by the minimum possible number of intersection controllers.Lanechanging operations are not considered in the action space of CoTV agent design for simplicity.However, lane-changing operations are permitted in the evaluation of CoTV shown in Section V. Secondly, compared with controlling all possible CAVs with DRL, the traffic light controller of CoTV selects only the closest CAV to the intersection on each incoming road to cooperate, which can significantly reduce training time and resources used in the process thus alleviating scalability issues.Meanwhile, the cooperation schemes among agents (i.e., the traffic light controller and the approaching CAV agents) only rely on the information exchange of states, not actions.This means the action for a certain agent is selected independently from other agents' actions.Therefore, CoTV avoids the exponentially increased complexity of joint actions

Require:
1: Obtain the set of traffic light agents to control, M 2: Set the number of episodes in parallel to E, and the time horizon for each episode to H 3: Initialize the policy parameter for one type of agent, θ T L for traffic light controllers and θ CAV for CAV, through parameter sharing 4: Initialize sample batch B = ∅ 5: Set the number of epochs for mini-batch updates in one iteration as K Ensure: 6: for iteration = 1,2,...,I do 7: for episode = 1,2,...,E do in parallel 8: for timestep h = 0,1,...,H do 9: for each traffic light agent m in M do for each agent i in M + N do 13: Run policy π T L or π CAV in the environment 14: Collect trajectories τ = (s h−1i , a hi , s hi , r hi ) Update θ T L and θ CAV in the policy π T L , π CAV using advantage estimates Â, with K epochs to sample minibatches from B, and then reset B = ∅ 21: end for for MARL using action-dependent design [22].Besides, the amount of information exchanged in CoTV is low enough compared with high-dimensional transmission data (i.e., image representations to describe traffic features) [28], [32].Specifically, as shown in Fig. 3, the information of CAV involves speed, acceleration, and location.Their size is estimated to be approximately 40 Bytes if encoded using float numbers.While traffic light controllers send their current signal phase, which is about 8 Bytes if using integer numbers for encoding.This information plus headers will still be less than 100Kbps.This transmission demand is met by the V2I and V2V communications infrastructure [14] using IEEE 802.11p which is between 3 and 20 Mbps.Additionally, all the information exchanged using the vehicular network occurs within the range of a single intersection (i.e., the single-hop range that is about 300 meters), which can improve the robustness of CoTV instead of heavily relying on a large scale (i.e., using multi-hop transmission) of network conditions [11].

IV. EVALUATION METHODOLOGY
In this section, we introduce the evaluation methodology, which includes the simulation settings, the metrics used for evaluation, and the overview of compared methods against our proposed CoTV.

A. Simulation Scenarios
The simulation platform used in this work is Simulation of Urban MObility (SUMO) 2 , which is one of the most widely used open-source microscopic traffic simulators.Our model design and implementation are based on FLOW 3 , which provides DRL-related API to work with SUMO dynamically.
We clarify some concepts relating to the time horizons.We set 1 simulation timestep equal to 1 simulation second.One episode refers to a full run of a single simulation scenario, which is set to 720 simulation timesteps.At the end of each iteration, CoTV starts to update the parameters of the PPO algorithm used, after 18 episodes run in parallel.In total, we terminate the training process of CoTV after 150 iterations.
For testing scenarios, firstly, we demonstrate the effectiveness of CoTV under a simple 1x1 grid map with a single intersection.Then, we show CoTV can be scalable to more consecutive intersections under a 1x6 grid map.Lastly, we validate the effectiveness of CoTV using a subset of the realistic urban scenario of Dublin city, Ireland.Table I summarizes the settings of traffic in each scenario.1) 1 × 1 grid map: In our 1x1 grid map, each edge has two roads in opposite directions.To make this map closer to the real urban scenario, we set the road length as 300 meters and the maximum speed limit as 15 m/s (=54km/h).As shown in Fig. 6, we generate different go-straight traffic flows in four directions: N→S (from north to south), S→N, W→E (from west to east), and E→W.This traffic generation method is inspired from [11].The origin and destination of each vehicle are at the end of the road at the perimeter of the network.The vehicle generation duration for each flow is approximately 300 seconds.The traffic flows N→S and W→E are relatively heavier than the other two.Specifically, the traffic flow rates in the number of vehicles per hour per road are: 288 (N→S), 240 (W→E), 192 (E→W), and 120 (S→N), respectively.The two traffic flows, S→N and W→E, are generated at the beginning of each episode.Then, the N→S flow vehicles start to enter the network sequentially on the 45th second.After one minute, the traffic flow of E→W appears.The speed of each vehicle when entering the network is random.Thus, the total number of vehicles is 70 in the scenario with one intersection.2) 1×6 grid map: The 1x6 grid scenario is shown in Fig. 7, which contains six intersections extending the 1x1 grid map with 5 more consecutive intersections.The road setting and traffic flow configurations are similar to the settings of the 1x1 grid scenario.The increased vertical (N→S and S→N) roads are allocated the corresponding traffic flow.A total of 240 vehicles are generated in this scenario.3) Dublin map: Fig. 8 illustrates the selected six signalized intersections area in the city of Dublin.These intersections are the main ones connected by arterial roads, maximizing traffic improvement while considering "easy-to-deploy" with minimized infrastructure upgrades as mentioned in Section III-D.A variety of roads are introduced, including exclusive go-straight, exclusive turn, and multi-directional roads.Meanwhile, intersections come in different shapes and sizes, including one three-leg with four signal phases (the rightmost one in Fig. 8); the four-leg intersection is the majority, three have four phases, and the other has six phases; the most complex intersection is 5-leg with 6 phases (the third one from the left of Fig. 8).The scenario is extracted from the open data in [26] to simulate the real-world traffic in Dublin city.We extracted dynamic traffic generated from 10 AM for 400 seconds, consisting of 275 vehicles allowed to drive straight, turn left or right at intersections.Each vehicle has a dedicated trip.
Fig. 8.The selected six signalized intersections area in the city of Dublin (a regional road, R111, in South Dublin).The highlighted roads are our selected testing scenario (six intersections are highlighted using red circles).

B. Evaluation Metrics
We evaluate the sustainable traffic improvements of each scenario using the following metrics: • Travel time (seconds): Travel time of each vehicle is the time cost in the road network until finishing the designated trip.The average travel time is calculated on vehicles completing their trips in a scenario, which is the common measure to evaluate traffic efficiency [10].• Delay (seconds): Delay is the difference between the actual travel time and the ideal travel time (i.e., time spent when driving at the maximum permitted speed) for each trip.This value indicates the space in which the traffic efficiency can be further optimized to its upper-bound.This metric could be more noticeable than travel time to reflect the improvement of traffic efficiency [24].• Fuel consumption (l/100km): Fuel consumption is the average amount of fuel consumed in liters every 100 kilometers traveled.The closer the vehicle speed is to the maximum speed limit we set, the more gentle change of acceleration, the less fuel consumption is likely to be achieved [33].In our experiments, fuel consumption, as well as the CO 2 emission described later, is calculated using HBEFA3/PC G EU4 model (i.e., a gasoline-powered Euro norm 4-passenger car modeled using the HBEFA3 [34]), which is the default vehicle emission model in SUMO 4 .This model mainly considers the instantaneous speed and acceleration of a vehicle.• CO 2 emissions (g/km): CO 2 emissions are measured by the average amount of carbon dioxide emitted in grams per kilometer traveled by all vehicles.As the primary component of greenhouse gas emissions, CO 2 emissions are required to be reduced to achieve sustainable traffic.• Time-To-Collision (TTC): TTC is a widely-used safety indicator [16], estimating the time required for a car to hit its preceding one.We use the default threshold of TTC in SUMO, 3 seconds5 , which means a possible collision is recognised when the time gap between the two adjacent cars is less than 3 seconds.The value of TTC is literally the total number of such possible rear-end collisions for a given time horizon.

All vehicles are CONNECTED HDV (Non-CAV) CAV
• Can NOT be controlled by CoTV • IDM car-following model [35] • Can be controlled by CoTV 1 • IDM, if not controlled by CoTV |HDV |+|CAV | × 100% 1 Denoted as CoTV* when all CAVs are controlled by our system.This scenario of 100% penetration rate is a different case as CoTV only controls CAV that is the closest to the incoming intersection on each road.

C. Compared Methods
To evaluate the effectiveness of our system CoTV, the compared methods are described as follows: • Baseline: This method is the baseline to evaluate the improvement of others.Traffic light signals have a static timing plan that does not change with the varying traffic, thus does not require V2X communications to collect vehicle information.All vehicles are HDV that are simulated by IDM car-following model [35] as shown in Table II, which is also used for simulating HDV in The state of a traffic light controller includes the number of vehicles on the incoming roads and outgoing roads.The reward design utilizes the "pressure" to improve intersection throughput, which is inspired by [6].All vehicles are HDV and connected, as shown in Table II, which are periodically broadcast their up-to-date status (e.g., location, speed, acceleration), any agents within the communication range can aggregate them as the real-time traffic information.
• GLOSA: This is a optimization-based method for jointly controlling traffic light signals and CAVs.The GLOSA system6 can adjust CAV speed considering the current traffic light phase and the current status of CAV.In our experiment, we combine it with adaptive traffic light controllers 7 to achieve joint control.Thus, phase switching is actuated after detecting a sufficient time gap between successive vehicles, resulting in various phase durations.
It is worth noting that all vehicles in this scenario are CAVs.• I-CoTV: I-CoTV combines independent policy training on the two types of agents as a common and straightforward way to develop MARL.There is no cooperation design between agents in either state or action, distinct from CoTV (action-independent MARL with cooperation schemes in the state exchange).Hence, the state of traffic light controllers involves two parts: its current signal phase and traffic on the roads it coordinates, not including any instantaneous vehicle information compared to CoTV.Correspondingly, the state of CAV agent only consists of the speed, acceleration, and location of itself and its preceding vehicle, without the current signal of the approaching traffic light from agent communication    III.PressLight and GLOSA improve traffic safety as well.However, there is a great difference in TTC between PressLight and CoTV under the 1x6 grid scenario, and the result of CoTV under Dublin scenario is much better than the other two methods.Conversely, FlowCAV hurts traffic safety under the grid maps but not in Dublin scenario.The more realistic urban scenario brings explicit complexity to enhance safety.This also highlights the advantages of CoTV using DRL-based methods for the cooperative control.2) Robustness to varying CAV penetration rates: Fig. 10 shows that the travel time of CoTV tends to decrease as the CAV penetration rates increases under 1x1 and 1x6 grid maps and Dublin scenarios.Even under 0% CAV penetration rate (i.e., the ratio of CAVs to all vehicles as shown in Table II), the travel time that CoTV achieves is still less than Baseline and PressLight.Specifically, CoTV with 0% CAV penetration rate implicates no vehicle speed control.In general, CoTV with CAV speed control can get better results, which demonstrates the effectiveness of cooperative traffic control.Similar results are shown in other metrics; thus, we do not present them to save space.This demonstrates the practicability of CoTV when deployed in a realistic mixed-autonomy scenario.
3) Comparison with other MARL methods: To further demonstrate the effectiveness of our CoTV system design on cooperative control, we compare CoTV with two other common MARL methods, I-CoTV (independent, without any cooperation schemes) and M-CoTV (action-dependent, with cooperation schemes in action and state).Results under Dublin scenario with full-autonomy traffic are shown in Table .IV.CoTV achieves the best results, while I-CoTV suffers from convergence issues, resulting in the worst traffic performance.M-CoTV fails to overcome high complexity from consideration of other agents' actions, which affects traffic improvements.In particular, the performance changes in fuel consumption and travel time are inconsistent in M-CoTV compared with I-CoTV.The training time of M-CoTV also increases by about 50%.In addition, referring to  In summary, CoTV achieves the first three system goals, including reduced travel time, lower fuel consumption and CO 2 emissions, and longer time-to-collision.The cooperation schemes between CAV and traffic light controllers, which is the first contribution of this paper, can overcome the difficulties of DRL-based joint control in complex urban traffic scenarios.

B. Scalability Improvement
The second contribution of CoTV is the improvement of the multi-agent system scalability by reducing the number of CAV agents controlled.Compared with CoTV* that trains all possible CAVs, results from Table V indicate that CoTV can reduce the training time by up to 44%, while still having comparable (sometimes slightly better) improvement in both traffic efficiency and safety under Dublin scenario.Although CoTV* obtains better results under the two grid maps than CoTV, it is worth reminding that CoTV achieves this by only cooperating with the closest CAV on each incoming road for the traffic light controller.The closest CAV has the great potential to increase intersection throughput, which is similar to controlling the leading vehicle only for improving the traffic efficiency of a platoon [13].The CAV as the leading  vehicle is well controlled by CoTV, all its following vehicles are subsequently self-adjusted.Moreover, Fig. 11 indicates that two agent types of CoTV, traffic light controllers and CAVs, can be converged at a higher reward with a small standard deviation than the start after about 60 training iterations.Thus, CoTV can alleviate scalability issues, while also not compromise traffic improvement.The last goal of system design, easier to deploy, is achieved.

C. Discussion
To further explore the deployment options of CoTV, we conduct experiments under a relatively large and dense urban scenario in Dublin city centre, which traditionally requires sophisticated coordination between adjacent traffic light controllers.The selected area covers nearly 1 km 2 with 31 signalized intersections, as shown in Fig. 12.These intersections with different road shapes and traffic light signal cycles/phases are all controlled by CoTV.Table .VI shows the traffic performance under this dense Dublin scenario under 100% CAV penetration rate.Although CoTV can get converged and obtain the best results in all evaluation metrics, which shows that CoTV can be deployed in both major and minor junctions, we still need further studies to find the optimal selection of key intersections to control to avoid costly deployment on all urban junctions.As future works, we plan to improve the robustness of our CoTV system to more practical scenarios.Firstly, we will relax the assumption that all vehicles are connected using V2X communications.Secondly, we will improve CoTV to be resilient to varying vehicular network conditions (e.g., latency, packet loss, bandwidth, etc.).Our long-term goal is to tackle the scalability issues of applying cooperative MARL algorithms (e.g., COMA) in complex urban traffic scenarios.

Fig. 2 .
Fig. 2. The DRL design of CoTV.Two types of agents, traffic light controllers and CAV interact with the environment according to the state information exchanged via V2X communications.

Fig. 3 .
Fig. 3. V2X communication schemes in CoTV showing how traffic light controllers and CAVs use V2I and V2V.This implements state exchange and cooperative control.CAV agents of CoTV are highlighted in blue.

Fig. 4 .
Fig. 4. The illustration of the reward of a traffic light controller rm, assuming the maximum road capacity c = 40.

Fig. 5 .
Fig. 5. Illustration of the CAV reward rn, assuming the maximum speed limit v * = 15, and the vehicle's maximum acceleration a * = 9.CAV agents of CoTV are highlighted in blue.

10 :
Add the closest CAV n to the intersection on each incoming road to the CAV agent set N

1
axb grid, a is the number of row, b is the number of column.

Fig. 6 .
Fig. 6.The settings of traffic generation for 1x1 grid scenario.For example, W→E (#20, 1st sec) means there are 20 vehicles sequentially generated from the first simulation second.

Fig. 7 .
Fig. 7.The settings of traffic generation for 1x6 grid scenario.The same settings (the number of vehicles generated, the simulation time to start traffic generation) apply for the traffic flow in the same direction.
[29].The Baseline scenario simulates most existing urban scenarios, which do not have any traffic light controllers and CAVs controlled by DRL.A cycle of the static traffic light signal plan contains four phases in order: Green-NS (green light for the flow N→S and S→N), Yellow-NS, Green-WE, and Yellow-WE.The duration of the green light is 40 seconds (default value in SUMO).The yellow light duration typically lasts from 3 to 6 seconds[10], so we set 3 seconds as the yellow light duration, which is also the default setting in SUMO.Thus, the length of a cycle is 86 seconds (40+3+40+3).Besides, the Baseline of the Dublin scenario adopts the original traffic light signal plans.Their specific settings vary by different intersections.Green light phase duration ranges from 37 to 42 seconds, and yellow light phase lasts 3 seconds.Some of intersections have a short green light for turn-right with 6 seconds.• FlowCAV: FlowCAV [29] is a state-of-the-art DRL-based model to control the speed of a CAV to improve fuel efficiency and reduce emissions.Each CAV observes its preceding vehicle and then regulates its speed.The reward of a single CAV is evaluated globally by the average speed and acceleration of all vehicles.In this scenario, all traffic light signals are static.There is only one CAV agent per road, which leads the following vehicles on the same road.• PressLight: PressLight [7] is a state-of-the-art DRLbased model to control traffic light signals to improve intersection throughput.

Fig. 9 .
Fig. 9. Travel time distributions for three test scenarios, comparing four methods.CoTV can reduce the travel time of all vehicles to be densely distributed with lower values than other methods.

Fig. 10 .
Fig. 10.The average travel time of CoTV under different penetration rates under both grid and Dublin scenarios.The average travel time obtained in Baseline and PressLight is also given for comparison.Travel time tends to decrease as the CAV penetration rate increases.

Fig. 11 .
Fig. 11.Evolution of the average episode reward for traffic light controllers (TL) and CAV agent of CoTV under Dublin scenario.The shade represents the standard deviation value.After DRL training on CoTV, the rewards for both types of agents can converge to higher values and smaller standard deviations than in the initial stage.

Fig. 12 .
Fig. 12.The selected dense urban scenario in the city centre of Dublin.There are 119 intersections in total, including 31 signalized intersections.321 vehicles are generated from 10 AM in 400 seconds.
designed to exchange the speed, acceleration, and location of CAVs and the current signal phase of traffic light controllers to each other.
• Efficient communication exchange schemes between CAV and traffic light controllers.The amount of state information exchanged between CAV and traffic light controllers is low enough.As shown in Fig.2, the communication schemes are

TABLE I TRAFFIC
SETTINGS IN THE THREE TEST SCENARIOS.

TABLE II SIMULATION
SETTINGS OF DIFFERENT VEHICLE TYPES.
. Introducing I-CoTV aims to demonstrate that the efficient cooperation schemes of CoTV facilitate training convergence.CoTV on the cooperative control.Secondly, we show if CoTV can be efficiently deployed by comparing it with CoTV* in training time and traffic improvements.All numerical results shown are averaged from eighteen episodes.Comparison with state-of-the-art methods: Table III shows the traffic improvements of CoTV under 100% CAV penetration rate, the same for FlowCAV and GLOSA (while 0% CAV penetration rate for PressLight and Baseline scenario as no need for vehicle speed control).

TABLE III COMPARISON
OF COTV AGAINST BASELINE AND STATE-OF-THE-ART METHODS.PERCENTAGE CHANGES SHOWN ARE COMPARED TO BASELINE.THE BEST ACHIEVED MEASUREMENTS ARE IN BOLD.Travel time & delay: As shown in TableIII, CoTV achieves the shortest travel time with up to 30% reduction compared to Baseline.PressLight and GLOSA achieve over 24% and 23% reduction, respectively.However, FlowCAV does not reduce travel time due to static traffic light plan and the absence of current traffic light signals, and the results in grid road maps are much worse than Baseline.The further improvement of CoTV demonstrates the advantages of cooperative traffic control compared with controlling traffic light signals only, • Table.III, M-CoTV and I-CoTV perform better than Baseline and FlowCAV but do not surpass PressLight and GLOSA.

TABLE IV COMPARISON
BETWEEN I-COTV (INDEPENDENT, WITHOUT ANY COOPERATION SCHEMES), M-COTV (ACTION-DEPENDENT, WITH COOPERATION SCHEMES IN ACTION AND STATE), AND COTV WITH FULL-AUTONOMY TRAFFIC UNDER DUBLIN SCENARIO.

TABLE V COMPARISON
BETWEEN COTV AND COTV* (CONTROL ALL POSSIBLE CAVS) UNDER FULL-AUTONOMY TRAFFIC.

TABLE VI TRAFFIC
PERFORMANCE UNDER A DENSE DUBLIN SCENARIO.PERCENTAGE CHANGES SHOWN ARE COMPARED TO BASELINE.THE BEST ACHIEVED MEASUREMENTS ARE IN BOLD.platoon) on each incoming road for alleviating scalability issue of multi-agent systems.This also eases the deployment and achieves the training process to converge within a moderate number of iterations.Experiments in various grid maps and realistic urban scenarios demonstrate the effectiveness of CoTV.Compared to the Baseline, CoTV can save up to 28% in fuel consumption and CO 2 while reducing travel time by up to 30%.The robustness of CoTV is also validated under different penetration rates of CAV. a