Loading web-font TeX/Math/Italic
Analytically Guided Reinforcement Learning for Green It and Fluent Traffic | IEEE Journals & Magazine | IEEE Xplore

Analytically Guided Reinforcement Learning for Green It and Fluent Traffic


Throughput achieved as a function of the number of learning episodes for the conventional machine learning method PressLight and the analytically guided method GuidedLigh...

Abstract:

This study investigates various methods for autonomous traffic signal control. We look into different types of control methods, including fixed time, adaptive, analytic, ...Show More

Abstract:

This study investigates various methods for autonomous traffic signal control. We look into different types of control methods, including fixed time, adaptive, analytic, and reinforcement learning approaches. Machine learning approaches are compared with the “analytic” approach, which is used as “gold standard” for performance assessment. We find that conventional machine learning approaches are better than the analytic approach, but require a lot more computer power. We, therefore, introduce a novel hybrid method called “analytically guided reinforcement learning” or shorter “ \alpha -RL”. This approach is implemented in our “GuidedLight agent” and tends to outperform both, classical machine learning and the analytic approach, while largely improving convergence. This method is therefore suited as a “green IT” solution that improves environmental impact in a two-fold way: by reducing (i) traffic congestion and (ii) the processing power needed for the learning and operation of the traffic light control algorithm.
Throughput achieved as a function of the number of learning episodes for the conventional machine learning method PressLight and the analytically guided method GuidedLigh...
Published in: IEEE Access ( Volume: 10)
Page(s): 96348 - 96358
Date of Publication: 05 September 2022
Electronic ISSN: 2169-3536

Funding Agency:


SECTION I.

Introduction

Traffic congestion is one of the most widespread problems of cities today, leading to losses in productivity, avoidable CO2 emissions, environmental pollution, and reduced quality of life. Along with the world’s population growth and progressive urbanization, these problems are expected to amplify further.

While in the long term, a technological shift to less problematic forms of transportation is likely, traffic congestion will remain a challenge for the foreseeable future.

A. Contribution

Traffic light control is a complex optimization problem, which is NP-hard [1], i.e. not solvable exactly in real-time, if problems get realistically big. Hence, it is required to apply approximation methods. Such methods comprise, among others, fixed (time) scheduling [2], analytic methods [3], [4], [5], [6], [7], [8], adaptive methods [9], [10], [11], [12], [13], [14], [15], [16], and genetic algorithms [17], [18]. Moreover, there exist methods which assume all or some of the vehicles in the network to be autonomous [19], [20]. Recently, with the spread of powerful machine learning (ML) and artificial intelligence (AI) applications, reinforcement learning (RL) approaches have attracted much interest [21], [22], [23], [24], [25]. However, while the feasibility of the RL approach got much attention, the related issues and limitations have not yet been investigated in full [3], [10]. Furthermore, the potential benefits of “hybrid” approaches, which combine analytic knowledge and RL methods, have not yet been explored in depth. Also, the assessment of the ecological footprint of machine learning approaches has been often neglected. Therefore, the main contribution of this paper is to:

  • highlight the performance and limitations of machine learning approaches considering ecological issues,

  • propose an improved, hybrid machine learning approach called “analytically guided reinforcement learning” or “$\alpha $ -RL”, which converges much more quickly than conventional machine learning methods.

In the following sections, we will present the background of the field and current state of the art, focusing on the comparison between adaptive and learning methods. We will also propose an analytic benchmark for machine learning methods. Finally, we will discuss the potential benefits of combining reinforcement learning (RL) and analytic approaches in a hybrid method (“$\alpha $ -RL”).

SECTION II.

Background

One of the simplest method of traffic signal control is fixed time scheduling [2], which is usually predefined and operated in a periodic way. For the sake of simplicity, it is sometimes furthermore assumed that the same amount of green time is assigned to each phase. This approach is obviously quite limited, but often considered as a baseline to compare the performance of various traffic light control approaches to. An adaptive extension of fixed time scheduling is able to select a traffic plan from a predefined list of plans in response to the respective traffic conditions [26], but the assumption of repetitive service patterns is usually still applied. In contrast, fully adaptive approaches are also possible, which respond to data from induction loops placed before and after intersections, that detect arriving and departing vehicles [27]. Such approaches do not rely on predefined plans, but rather adapt in real-time to the particular local traffic conditions. They, however, often lack coordination among intersections. Recently, a lot of interest has also been paid to employing data-driven machine learning approaches to traffic light control [28].

In the following section, we will introduce the traffic light control problem in more detail and typical solution approaches including fixed time, analytic, and reinforcement learning methods. For a comprehensive survey of different traffic control approaches, we recommend to read, for example, [29].

A. Glossary of Terms

Here we provide definitions of the key terms used in formalizing traffic intersections.

  • Approach: a road crossing other roads at an intersection. There are “incoming approaches”, i.e. the ones through which cars arrive at the intersection, and “outgoing approaches” through which cars depart.

  • Lane: a single approach can be subdivided into lanes. The lanes on the incoming approach are referred to as “incoming lanes” (short: “in-lanes”), the one on the outgoing approach as “outgoing lanes” (short: “out-lanes”).

  • Movement: consists of an incoming approach and an outgoing approach, through which vehicles can move from in-lane(s) to out-lane(s). Usually three types of movements are considered: left turns, right turns and moving straight (“through traffic”).

  • Movement signal: signal indicating whether the given movement is allowed (green) or not (red). The yellow signal indicates the change from green to red and, depending on national law, may allow or block the movement (in this work we assume the yellow signal allows for movement). A conditional green signal is usually assumed for the right turn, allowing for movement when there is currently no conflicting traffic.

  • Phase: a combination of movement signals. A phase can only consist of no conflicting movement signals. A movement signal is conflicting, if a related movements crosses another movement.

B. Problem Description

In this paper we are looking for methods to change traffic lights at intersections such that the resulting traffic performance is as high as possible. To assess the performance, one often studies quantities such as the throughput and average travel time. The methods we are interested in should work for different traffic intensities. They should also work for a large number of intersections with a reasonable computational effort. In this connection, an important distinction to make is whether one attempts to optimize traffic flow locally on the level of single intersections or over extended parts of the entire road network. A network-wide approach requires much more computational resources than an intersection-based approach and is often practically intractable. In this paper, we will focus on local control approaches due to the focus on green IT and for the sake of comparability with previous publications such as [22], [23]. Note, however, that this does not exclude the possibility of coordination between neighboring intersections.

SECTION III.

Related Work

In this section we will discuss relevant related work.

A. Fixed Time Control

A classical method of traffic control is to generate centralized schedules, which are imposed on all intersection in the city [2]. In its simplest form each intersection cycles through all its phases with no off-sets. Each intersection at a given time has the same phases, and each phase is given the same amount of time. We refer to this simplistic method as Fixed Time Control. More advanced versions of this method include the implementation of different green times periods for each phase and suitably calibrated off-sets [2].

B. Adaptive Methods

A typical adaptive method is able to select the next phase based on the current state of the intersection controlled. One of the simplest adaptive methods is “demand-based” control. This approach adapts its actions based on the “demand of a phase”, which is defined as the sum of the demands of all movements belonging to the phase. The “demand of a movement” corresponds to the number of cars that are present on all incoming lanes belonging to the movement.

C. Self-Organization

An important aspect in local traffic optimization is the avoidance of negative interactions between neighboring intersections. In general, a decision that is optimal at one intersection may cause sub-optimal traffic flows at neighboring intersections, for example, due to spill-over effects. To address this problem, the concept of self-organized traffic light control has been developed, which promotes a coordination among neighboring intersections [30].

A self-organizing system is a system where its adjacent elements interact in a way that gives rise to a collective behavior. This can be coordinated behavior over the entire system or extended parts of it. If the interactions are well chosen, the resulting self-organized system dynamics can perform extremely well. Therefore, the emphasis is to make the interactions between the individual system elements mutually positive (synergistic). In [10], it is demonstrated that a method, called “Self-Organizing Traffic Light” (SOTL), based on the above concepts, can reach significant improvements even over state-of-the-art methods to produce green waves, which attempt global traffic flow optimization by synchronizing traffic lights and supporting vehicle platoons that rarely need to stop [31].

D. Analytic Approach

Analytic methods rely on models and formulas derived from a theory (e.g. queuing theory or traffic physics) and focus on showing that the proposed control scheme locally optimizes the selected performance criterion.

A very effective analytic, adaptive approach, which relies on concepts from traffic physics as well as self-organization principles, has been proposed in [5]. The method consists of two elements: an optimization rule and a stabilization rule. The optimization rule (see Appendix) is based on the short-term anticipation of future arrivals of vehicles to the queue and on calculating the green time needed to clear the expected queue. A priority score is used by the optimization rule to select the movement or phase that needs to be switched to.

The stabilization rule overrides the optimization rule in situations when a queue has grown too large or some phases have not been activated for a long time [6]. This helps to prevent spill-over effects at neighboring intersections.

The short-term anticipation of this analytic approach promotes a self-organized coordination between flows and traffic lights at neighboring intersections. Due to the resulting self-organization, the two rules lead to a spontaneous emergence of green waves, much like in [30]. The method has been successfully implemented in real life settings in the cities of Dresden, Germany, and Lucerne, Switzerland [32], [33]. In the following, for the sake of simplicity, our implementation of the analytic method will use the optimization rule only, while the stabilization rule will be neglected, possibly at the cost of losing some performance. (We will focus on its role in a follow-up study.)

E. Reinforcement Learning (RL)

Due to the complexity of traffic light optimization, many recent publications have proposed to use machine learning approaches. Instead of deriving analytic models, these propose to use an iterative, neural-network-based learning method, often called a “black box”, which is fed with lots of data. Significant success has been demonstrated by multi-agent deep reinforcement learning models, which we discuss below. We focus on models which, like the previously described approaches, optimize traffic flows locally on the level of a single intersection, mainly for the sake of comparison with previously published results [22], [23].

In the machine learning models, an “agent” represents an intersection of the road network. The agent is fed with data from observations of the environment and takes actions based on them. The agent is also given rewards that reflect the desirability of the actions it had taken [34]. The data included in the observations as well as the choice of the reward function may have a strong influence on the efficiency of the learning process.

In [22], a learning algorithm called “IntelliLight” uses the queue length, number of vehicles, waiting time and an image representation of the intersection as its state.

In [35], an analysis of the reward and state design in reinforcement learning is applied to traffic light control. Moreover, the “LIT” method is proposed to simplify the state description.

In [24], the authors propose “CoLight”, which uses graph attentional networks to facilitate communication between traffic lights. The method considers a spatial and temporal interaction of neighboring agents.

The state representation is studied in depth in [36] and a “FRAP” model is proposed. The model addresses the problem of limited adaptive potential of most learning approaches (e.g. a model trained with morning traffic may not adapt well to evening traffic, because the prevailing direction of traffic is reversed). It decides the competition between alternative phases based on demand. FRAP is able to achieve invariance to rotation and flipping. Moreover, FRAP can be applied to intersections with different numbers of incoming lanes as well as a different number of possible phases. FRAP shows very good performance (in terms of average travel times) for a simple, single intersection setting. However, in a realistic setting with many intersections its performance deteriorates.

Another learning algorithm is described in [23]. “PressLight” simplifies the state to consist only of cars on incoming and outgoing lanes and the current phase. The reward is the “pressure” at an intersection [3], which is explained in detail in subsection IV-B.

The PressLight method outperforms both IntelliLight and LIT in both synthetic and realistic scenarios in terms of average travel time. PressLight outperforms the FRAP model in scenarios with more than one intersection as well. PressLight’s performance appears to be comparable with CoLight although no direct comparison has yet been published.

The publications mentioned above achieve convincing results. With the help of computer simulations, it is shown that reinforcement learning has great potential to help mitigate the problem of traffic congestion. It is less clear, however, how the machine learning approaches perform compared to previous adaptive approaches, also in terms of the computational resources needed. Similarly, the environmental costs of training the RL models are often left unreported. This will be the focus of our further investigation.

SECTION IV.

Methods

In this section we will specify the design of the GuidedLight agent implementing “analytically guided reinforcement learning” (short: “$\alpha $ -RL”). We will specifically describe the basis of the $\alpha $ -RL approach which, as we will see, combines the benefits of the analytic approach with those of machine learning.

A. Deep Q-Learning

In the approach called $Q$ -learning, the agents’ decisions are guided by a $Q$ -function, which takes the current state and an action as arguments and maps them to the reward space. The mapping is according to the Bellman equation based on the expected future rewards as in Equation 1, where $Q^{new}$ is the $Q$ -value after the update for the given state-action pair ($S_{t}, A_{t}$ at time $t$ in this case); $Q$ is the old $Q$ -value for the same state-action pair; $l$ represents the learning rate; $\gamma $ weights the importance of long-term vs. short-term gains. $R_{t}$ is the reward at time $t$ and $\max _{A}Q(S_{t+1}, A)$ is the estimate of the optimal future value, more specifically it is the estimate of the highest $Q$ value that can be obtained starting from state $S_{t+1}$ and taking optimal actions. The term in the square brackets is also known as temporal difference [34].\begin{align*}&\hspace {-0.5pc}Q^{new}(S_{t}, A_{t}) = Q(S_{t}, A_{t}) + l * [R_{t} + \gamma * \\&\,\max _{A}Q(S_{t+1}, A) - Q(S_{t}, A_{t})] \tag{1}\end{align*} View SourceRight-click on figure for MathML and additional features.

In deep $Q$ -learning, the $Q$ -function is approximated using a deep neural network. For most complex problems, it is impossible to specify the $Q$ -function explicitly (that is why one speaks of a “black box” approach). The $Q$ -function is learned by the agent as the training process advances. In modern implementations, the agent stores the previous action $A$ , current state $S$ , and reward $R$ in its memory. Mini-batches of these data triplets are sampled from the memory at intervals and used to learn the $Q$ -function [37].

In deep-$Q$ -learning, a neural network referred to as Deep $Q$ -Network (DQN) is used to approximate the $Q$ -function that estimates the reward, given a state-action pair.

Our GuidedLight implements a more advanced version of the DQN known as Double Deep Q-Network (DDQN), to avoid the overestimation of the action values. This is done by leveraging two parallel DQNs, which are updated with a different frequency using “soft updates” (see [38] for details).

We implement the memory replay [37] and train the DDQN periodically with mini-batches sampled from the memory. While the data is generated on the individual level of every intersection, the memory is shared between all agents to speed up convergence, by increasing the number of training samples. The details of the DDQN and memory implementations can be found in Appendix.

B. Pressure-Based Learning

Drawing on the good results of [23], [25], and the theoretical background of [3], we incorporate a “pressure” concept in the reward design of “GuidedLight”. Intuitively, the pressure can be interpreted as an imbalance in the distribution of vehicles over the incoming and outgoing lanes of an intersection.

Specifically the pressure of an intersection is defined in Equation 2, where $i$ denotes the intersection, $l$ the incoming lane of a given movement and $o$ the outgoing lane of the same movement. $w(l,o)$ represents the pressure of a single movement from lane $l$ to lane $o$ . The pressure of a single movement is simply the difference between the number of cars on the incoming lane $l$ and the outgoing lane $o$ , weighted by the maximum number of cars possible on the corresponding lanes. This is summarized in Equation 3, where $x(a)$ denotes the number of cars on lane $a$ and $x_{\mathrm{ max}}(a)$ denotes the maximum possible number of cars on that lane:\begin{equation*} P_{i} = |\sum _{(l, o) \in i} w(l,o)|, \tag{2}\end{equation*} View SourceRight-click on figure for MathML and additional features. where \begin{equation*} w(l,o) = \dfrac {x(l)}{x_{\mathrm{ max}}(l)} - \dfrac {x(o)}{x_{\mathrm{ max}}(o)}. \tag{3}\end{equation*} View SourceRight-click on figure for MathML and additional features.

Based on the results in [3], we conjecture that, optimizing the pressure at the level of individual intersections leads to the global throughput also being optimized, under certain constraints. Thus, we expect, the emergence of coordination between the intersection as long as they are optimizing their individual pressures.

C. Analytic Component

Our main goal is to build on the benefits of the analytic approach in order to improve the efficiency and accuracy of our learning method. An area that can benefit from analytic insights is the exploration strategy chosen by our agent. In reinforcement learning, exploration is a key concept that allows the agent to learn more about its environment and avoid getting stuck in local optima. The learning methods mentioned in subsection III-E rely on the epsilon-greedy ($\epsilon $ -greedy) exploration method [34] explained in the following.

1) Epsilon-Greedy Exploration

In this approach, every time the agent acts, with probability $\epsilon $ it chooses a random action rather than the action suggested by its $Q$ -function. This probability is usually relatively high in the beginning, in order to allow the agent to thoroughly explore the state-action space and generate enough experience to train its deep $Q$ -network. However, as the training progresses, the value of $\epsilon $ is gradually decreased to some minimal value, which is typically greater than 0 in order to provide a chance of further exploration even to trained agents. The value of $\epsilon $ , its decrease rate and minimum value must be specified in a problem-dependent way, see Appendix.

2) Analytic Exploration

In this paper we propose an alternative, analytic exploration process, where the exploration of the agent is guided by the results of some analytic method. The design extends the epsilon-greedy approach. Every time the agent acts, there is a probability $\epsilon $ of choosing a random action. However, there is also some probability $\alpha $ that the chosen value will not be random but an action that would be chosen by some analytic approach. Hence the overall probability to deviate from the action proposed by the $Q$ -function is $\epsilon $ and then there is the probability $\alpha $ that the action will be analytically derived.

The intuition behind this approach is that we inject knowledge from the analytic approach, in order to guide our exploration into areas of the state-action space that are likely to be performing highly. The analytic approach is based on implications of the physical laws underlying traffic flows (“traffic physics”), which can be expressed by precise mathematical formulas. In comparison, the data-driven reinforcement learning approach, is only able to provide approximate relationships.

By injecting precise analytic knowledge into the exploration, we hope to accelerate the convergence of our method as compared to alternative, “blind” (i.e. unguided) learning methods. We expect that, with our new approach, the agent needs to explore less states to find the optimal state-action pairs. Nevertheless, we still allow random exploration to make sure the agent does not get stuck in local optima. The $\alpha $ value, much like the $\epsilon $ , can also be specified to decrease over time. The interplay between the two values might have a significant influence on the convergence. Furthermore, by relying on analytic exploration, the agent could even be allowed to safely explore in deployment, as most of the exploring actions would not be random, but analytically motivated (and so more likely to be efficient).

Note that the analytic exploration can be understood in terms of the heuristic-exploration paradigm [39]. In our case, the exploration uses a problem-specific heuristic - that of an analytic model. In that sense it can be considered a concrete application of the general heuristic exploration approach, which has been shown to achieve good results for many problems [39].

The details of the analytic approach used for the analytically guided exploration in this paper can be found in Appendix. Note that, for simplicity, we have restricted ourselves to the optimization rule of the analytic self-control approach proposed in [5], while the stabilization rule has been neglected, here (which may lead to higher densities, as we will see).

D. GuidedLight AGENT

In this subsection we summarize the design of the “GuidedLight” agent implementing the analytically guided exploration paradigm ($\alpha $ -RL)

  • Agent: An agent is a decision-making entity that represents a single intersection in the traffic network and controls the traffic lights at that intersection.

  • State: The state of the agent, also referred to as “observations” according to [23], consists of the percentage coverage of vehicles on the incoming lanes. We use the percentage coverage of the lane, as it implicitly includes the length of the respective lane: Three cars of approximately 5 meters each on a 30 meters lane should be considered differently from such cars on a 300 meters long lane. Furthermore each incoming lane is divided into 3 segments of equal length: closest to the intersection, middle and furthest. Such an approach has been shown to give superior results as compared to the unsegmented approach [23]. Moreover the state includes the percentage coverage of cars on each of the outgoing lanes and the current phase at the intersection.

  • Actions: The actions, from which the agent selects, consist of the possible phases for the given intersection.

  • Reward: The reward $R_{i}$ uses the pressure concept [3] and is equal to the negative of the pressure $P_{i}$ defined in Equation 2:\begin{equation*} R_{i} = -P_{i} \tag{4}\end{equation*} View SourceRight-click on figure for MathML and additional features. In other words, the reward follows Equation 4, where $i$ represents the specific intersection. The negative is taken, as we aim at minimizing the pressure, which corresponds to maximizing its negative.

SECTION V.

Simulation Experiments

The goal of our computer-based simulation experiments is to test the machine learning approaches described above against the fixed time and the analytic approach. We will, therefore, conduct simulation experiments in several virtual city environments and compare the results to each other. We will also specifically evaluate the number of learning episodes needed for the learning approaches to achieve convergence.

The specific details of all parameters used in our experiments can be found in Appendix.

A. Methods Compared

Our simulation experiments will compare the following methods:

  • Fixed Time: A fixed traffic light schedule, where we give 10 seconds to each phase with a 2 seconds clearing phase in between the phases as described in [2]. The same order of phases is followed by all agents. Hence, at a given time all intersections have the same phase. This is obviously a low baseline which, however, has repeatedly been used to compare the relative performance of various reinforcement learning approaches.

  • Demand: A simple adaptive method, which always chooses the phases with the highest demand as expressed by the number of cars on the incoming lanes.

  • Analytic: A state-of-the-art analytic approach relying on the optimization rule described in [4] and [5]. The method calculates both, the phase to be chosen and the amount of green time to be given following the details in Appendix.

  • PressLight: A popular reinforcement learning approach [23] with a reward based on pressure [3] and an action-state space similar to the description in subsection IV-D, but considering the number of vehicles instead of the percentage coverage. For the purpose of the study, the “PressLight agent” was re-implemented. The results obtained by our implementation were compared with the open-sourced PressLight implementation and were found to be well consistent. Small differences might occur due to the use of a larger neural network, the use of a newer version of the CityFlow simulator, smaller set-up times as well as a larger number of actions-phases available to the agent.

  • GuidedLight: A reinforcement learning approach using analytic insights for exploration, as proposed in section IV of this paper.

All the agents in all scenarios have 8 actions to select from. The actions correspond to all 8 non-conflicting phases available at a 12 movement intersection. Both learning methods (GuidedLight and PressLight) explore the environment with the same probability $\epsilon $ . The main difference is that GuidedLight chooses an analytically derived action with probability $\alpha $ while PressLight always chooses a random action when exploring.

B. Computer Simulations

As simulation environment we use CityFlow [40] due to the availability of a large number of synthetic and realistic scenarios as well as the higher computational efficiency compared to SUMO [41].

1) Scenarios

In our experiments we compare the aforementioned methods in a variety of scenarios. Four of them are based on synthetic configurations specified in Table 1, which follow the research design in [23]. The first setting is a 4 by 4 artificial road grid with 16 intersection agents. The distances between the intersections are assumed to be 100 meters.

TABLE 1 Traffic Flows Assumed in the Four Synthetic Simulation Scenarios
Table 1- 
Traffic Flows Assumed in the Four Synthetic Simulation Scenarios

At each intersection, the amount of vehicles turning left is set to 10%, the amount going straight to 60%, and the amount turning right to 30%. The specification of the synthetic traffic data follows [25].

The second simulation scenario is based on real-world traffic and a real world network: the 16 by 1 grid with 16 agents based on the 8th Avenue in Manhattan, New York. The road network is based on the road network data extracted from OpenStreetMap and flow data based on open-sourced taxi trip data as presented in [23]. The arrival rate is 1.886 vehicles/second with standard deviation of 0.009.

The third setting is also based on Manhattan, New York. However, it consists of 196 intersections of the Upper East Side. The vehicle flow, also based on taxi trip data, is set at 0.803 vehicles/second with a standard deviation of 0.0336. Since the taxi data provides only origin-destination data, the shortest path between two points is generated following [24].

In all scenarios, the vehicles arrive at the terminal edges of the road network. Moreover, the action frequency for the demand agents and learning agents is set to 10 simulation steps, where 1 simulation step corresponds to 1 second. If the phase is changed, a clearing phase is initiated first for a fixed time period of 2 seconds. During that time, only right turns are allowed, which is possible in all phases, following the custom in many countries. The traffic is bidirectional in all scenarios.

The synthetic scenarios are included for better comparability with previously published learning methods [23]. The performance in the NY196 scenario is of greater interest to us, as it is realistically complex both, in terms of the road network and the traffic flows.

Each scenario is run for 1800 seconds, that is 30 minutes of real-world time. An “episode” is a full run of a simulation for the entire period of 1800 seconds, which corresponds to 1800 simulation steps.

C. Performance Metrics

The main performance metrics that we use for comparison are the average travel time (in seconds) and the throughput (in number of vehicles over the entire simulation period). The average travel time as well as throughput is calculated using the methods available in the CityFlow simulator [40].

For the machine learning methods, we present the minimum of the average travel time and maximum of the throughput along with the standard deviations in the last ten episodes of training, which can be treated as an indicator of the methods’ stability. We also provide data on the number of episodes needed for convergence of the learning methods. The learning methods are trained for 150 episodes.

D. Further Analysis

In addition to studying the performance of different methods in different scenarios we also include an ablation study, where we validate the benefits of analytic exploration. We further investigate the influence of the $\alpha $ parameter on the performance of the GuidedLight method.

Furthermore, we analyze the action space induced by the three best performing methods. We present a histogram of the actions taken by the agent controlling the intersection to compare the similarity of the action space for different methods.

SECTION VI.

Simulation Results

In this section we present the results of our simulation experiments described in the previous section.

A. Average Travel Time and Throughput

In Table 2 we can see the performance of the various traffic light control methods in terms of the average travel time and throughput for the four configurations of the synthetic scenario (I-IV) and the two real-world scenarios. As can be seen in the table, the GuidedLight method achieves the best results for all configurations.

TABLE 2 Average Travel Time and Throughput of Different Traffic Signal Control Methods for the Four Configurations of tbe $4\times4$ Synthetic Scenarios and Two Real-World Scenarios. The Values in Brackets are Standard Deviations. Best Results are Presented in Bold
Table 2- 
Average Travel Time and Throughput of Different Traffic Signal Control Methods for the Four Configurations of tbe 
$4\times4$
 Synthetic Scenarios and Two Real-World Scenarios. The Values in Brackets are Standard Deviations. Best Results are Presented in Bold

If we compare the different approaches in Figure 4, we can see that the differences between the analytic and the GuidedLight approaches are especially significant for the synthetic scenarios and NY196. Interestingly, for the NY16 scenario, the simple Demand-based method is able to reach comparable results to the learning and analytic methods. Furthermore, the standard deviations of the PressLight method are higher than that of GuidedLight for the two realistic scenarios, suggesting that the training is less stable for PressLight than for GuidedLight.

FIGURE 1. - Representation of an intersection with four approaches: North, West, East, South. There are 3 separate lanes on each approach: one for through traffic, one for turning left, and one for turning right. Here, the traffic lights are assumed to be in phase 1 as per the numbering introduced in Figure 2. Green arrows indicate movements that are allowed, while red arrows indicate movements that are disallowed in the current phase.
FIGURE 1.

Representation of an intersection with four approaches: North, West, East, South. There are 3 separate lanes on each approach: one for through traffic, one for turning left, and one for turning right. Here, the traffic lights are assumed to be in phase 1 as per the numbering introduced in Figure 2. Green arrows indicate movements that are allowed, while red arrows indicate movements that are disallowed in the current phase.

FIGURE 2. - Possible phases to be selected from by a control mechanism, here, for intersections with four approaches: North, West, East, South. For all the phases, a right-turn from each approach is also assumed to be possible, when there are no conflicting traffic flows.
FIGURE 2.

Possible phases to be selected from by a control mechanism, here, for intersections with four approaches: North, West, East, South. For all the phases, a right-turn from each approach is also assumed to be possible, when there are no conflicting traffic flows.

FIGURE 3. - The five road networks used in the experiments, blue dots indicate intersections, black lines indicate roads.
FIGURE 3.

The five road networks used in the experiments, blue dots indicate intersections, black lines indicate roads.

FIGURE 4. - Throughput of various traffic light control methods relative to the throughput of fixed time control for the four configurations of the 
$4\times 4$
 synthetic scenarios and two real-world scenarios. Higher values are better.
FIGURE 4.

Throughput of various traffic light control methods relative to the throughput of fixed time control for the four configurations of the $4\times 4$ synthetic scenarios and two real-world scenarios. Higher values are better.

Similarly, by consulting Figure 5 we find that the travel time improvement over the Fixed Time method is significant for all methods tested. GuidedLight gives the best ratio of improvement in all scenarios. It is also worth noting that, for all methods, the travel time improvement over the Fixed Time method is lowest in the NY196 scenario.

FIGURE 5. - Average travel time of various traffic light control methods relative to the average travel time for fixed time scheduling for the four configurations of the 
$4\times 4$
 synthetic scenarios and two real-world scenarios. Lower values are better.
FIGURE 5.

Average travel time of various traffic light control methods relative to the average travel time for fixed time scheduling for the four configurations of the $4\times 4$ synthetic scenarios and two real-world scenarios. Lower values are better.

B. Convergence of Learning Methods

Here, we compare the convergence of the machine learning approach PressLight to the $\alpha $ -RL method GuidedLight. As can be seen in Figure 6, GuidedLight converges to a stable result significantly faster in all the tested settings than PressLight. This showcases the benefits of the analytically guided exploration ($\alpha $ -RL). It is worth noting that the training of PressLight in most scenarios (I-IV) becomes unstable—unlike GuidedLight. This is likely due to the overestimation of the action values, which is avoided in GuidedLight by using the Double Deep Q-Network.

FIGURE 6. - Throughput achieved as a function of the number of learning episodes for the conventional machine learning method PressLight and the analytically guided method GuidedLight.
FIGURE 6.

Throughput achieved as a function of the number of learning episodes for the conventional machine learning method PressLight and the analytically guided method GuidedLight.

FIGURE 7. - Histograms of actions taken by a single agent in the NY196 scenario according to each of the three methods. Different colors indicate different actions.
FIGURE 7.

Histograms of actions taken by a single agent in the NY196 scenario according to each of the three methods. Different colors indicate different actions.

C. Ablation Study

To validate the benefits of $\alpha $ -exploration, we perform an ablation study. We compare the GuidedLight without $\alpha $ -exploration with PressLight with $\alpha $ -exploration and both normal GuidedLight and PressLight. In Table 3 we can see that indeed GuidedLight without $\alpha $ -exploration performs much worse to GuidedLight with it. Similarly, PressLight with $\alpha $ -exploration is superior to PressLight but not as good as GuidedLight, due to differences in DQN and state-space design (the use of DDQN instead of DQN and the use of percentage coverage instead of absolute number of vehicles in the state description).

TABLE 3 Results of the Ablation Study on the NY196 Scenario Run to Validate. the Benefits of $\alpha$ -Exploration (With Standard Deviations Determined Over the Last 10 Training Epochs). $-\alpha$ -Exploration Indicates a Model Using Random Exploration
Table 3- 
Results of the Ablation Study on the NY196 Scenario Run to Validate. the Benefits of 
$\alpha$
-Exploration (With Standard Deviations Determined Over the Last 10 Training Epochs). 
$-\alpha$
-Exploration Indicates a Model Using Random Exploration

D. $\alpha$ Parameter Study

In Table 4, we present the study of the effects of different values of the $\alpha $ parameter, controlling the frequency of analytically guided exploration, on the results achieved by GuidedLight. We compare different starting and end values for the $\alpha $ parameter, as well as different rates of change. It seems it is beneficial to initialize $\alpha $ at 1 and decrease it to 0 gradually. Also, decreasing the $\alpha $ parameter slowly appears to be favorable. Setting alpha to 1 and not decreasing the parameter corresponds to $\epsilon $ exploration and obtains the worst results among the settings compared.

TABLE 4 Throughput Achieved by GuidedLight in the NY196 Scenario With Different Starting and End Values and Changing Rates of the $\alpha$ Parameter (With the Standard Deviations Over the Last 10 Training Epochs)
Table 4- 
Throughput Achieved by GuidedLight in the NY196 Scenario With Different Starting and End Values and Changing Rates of the 
$\alpha$
 Parameter (With the Standard Deviations Over the Last 10 Training Epochs)

E. Action Space Analysis

In order to further understand the differences and characteristics of the compared methods. we study the action space of the Reinforcement Learning and analytic methods. Actions correspond to the possible phases such as indicated in Figure 2. The agents using the analytic method favors action 7, PressLight appears to favor action 6 and 7 heavily, while GuidedLight favors actions 2 and 6. Furthermore, the analytic approach appears to select a greater variety of actions as compared to PressLight and GuidedLight. It is important to note that the analytic method selects more actions, as it adjusts the green time given to each action.

SECTION VII.

Summary, Conclusions, Discussion, and Outlook

In this paper, we have compared different performance indicators of various adaptive traffic light control approaches and some alternative reinforcement learning methods. It turns out that the analytic method performs well, especially in real-world inspired scenarios, and can, thus, serve as benchmark for novel reinforcement learning (RL) methods. We also note that the analytic method becomes less effective in highly congested traffic as in the synthetic scenarios, at least if the stabilization rule is neglected. Our results further show that $\alpha $ -RL methods can significantly outperform the analytic approach after a sufficient number of learning episodes. Even though these results have been gained for somewhat idealized traffic scenarios, we expect qualitatively similar findings for irregular road networks and more complex traffic scenarios (as is to be shown in follow-up work).

The performance of the analytic method results from the use of mathematical formulas derived from traffic physics, which allow one to determine the green time needed to clear the entire vehicle queue, considering the arrivals of further vehicles based on a sophisticated short-term prediction. This mechanism also promotes coordinated traffic flows and emergent green waves, while not being restricted to repetitive service patterns.

Reinforcement learning misses analytic insight into the physical laws underlying traffic dynamics - it has to rely on guessing the dynamics based on traffic patterns that occurred in the past. $\alpha $ -RL combines the good parts of RL and the analytic method, hence, outperforming both of these methods.

In summary we find that:

  • In order to find superior solutions, one needs a “hybrid” approach, where the scientific knowledge behind the analytic approach is fed into the machine learning approach.

  • Therefore, even in the age of Artificial Intelligence, analytic approaches remain important, but hybrid approaches are best.

A. Green it

A recently highlighted issue in connection with the UN Sustainability Development Goals (SDGs) are the energy consumption and environmental footprint of technologies. While digital technologies contributed just about 3-5% to the world’s electricity consumption some years ago, the share is expected to grow beyond 20% by the year 2030 [42]. In some cities, the share of electricity spent on data centers is already higher than that.

These developments have caused a call for “green IT”, i.e. Information Technology solutions that have a low environmental footprint. This is of particular importance for machine learning methods [43], which are computationally quite expensive. Deep learning, including deep reinforcement learning, relies heavily on deep neural networks. This often requires vast amounts of GPU processing time, which translates into significant amounts of energy consumed.

It is, therefore, relevant to consider the ecological impact of reinforcement learning (RL) models used for traffic light control, since one of its goals is reducing emissions. It would be questionable to employ models to solve a problem, if they would actually exacerbate that problem. Some of the models we have mentioned take dozens of hours of training time on a state of the art computational architecture until they converge [24], then reach a performance achieved by the analytic approach from the very beginning. Unfortunately, this is combined with limited generalization abilities to modified scenarios, for example, involving accidents or temporary building sites. Such typical disruptions of the regular operation would call for frequent retraining in order to avoid sub-optimal performance. The related ecological footprint should, hence, be taken into account, particularly considering the fact that highly performing analytic approaches exist, which are computationally cheap and environment-friendly.

At least it seems pressing to work on novel methods that use analytic knowledge, speed up convergence, and improve the ability to generalize. For these reasons, we have proposed a novel, hybrid machine learning method called “GuidedLight”, which combines the benefits of machine learning and analytic approaches by analytically guided exploration. We have shown that the proposed “$\alpha $ -RL” method achieves considerably faster convergence than conventional reinforcement learning methods, leading to decreased training times. This in turn is expected to reduce the environmental footprint. Moreover, we were able to show that GuidedLight performs better than the analytic approach and better than the other reinforcement learning approaches studied in this paper. This applies particularly to the performance measures of average travel time and throughput. We believe that analytically guided machine learning would have benefits also in many other application areas, which are to be explored in the future.

Appendix

Model Parameters

For both, PressLight and GuidedLight, we use a fully-connected neural network with two hidden layers of 128 and 64 hidden units, each. We use a learning rate of 0.0005, batch size of 64, a starting $\epsilon $ of 1, a minimum $\epsilon $ of $0.01.~\epsilon $ is decreased by a factor of 0.00005 per each action taken by an agent. The size of the memory buffer is set to 100000, the discount factor to 0.999 and the soft update parameter to 0.0001. The network update frequency is set to 10. For GuidedLight, the value of $\alpha $ is set to the value of $\epsilon $ in each episode. Finally, for the analytic approach’s stabilization strategy we use $T =180$ and $T_{\mathrm{ max}} =240$ .

Appendix

Analytic Method

In the study presented here, the analytic approach used to guide the exploration is the optimization rule proposed in [5]. We have selected this method because of its superior results among all analytic approaches we have tested. Specifically the analytic approach selects the next phase based on a priority score, which itself is related to the required green time $\hat {g}$ to clear a given lane [5]. This value is derived from Equation 5, where $N^{exp}(t)$ represents the expected number of cars that arrive at the intersection by time $t$ , $N^{out}(t)$ the number of vehicles that depart from the upstream intersection by time $t$ , $q^{max}$ is the maximum flow of the movement (“saturation rate”), and $\tau $ the set up time needed to switch phases:\begin{equation*} N^{exp}(t + \tau + \hat {g}) = N^{out}(t) + \hat {g}q^{max} \tag{5}\end{equation*} View SourceRight-click on figure for MathML and additional features.

The additional data needed to perform the analytic computations consist of the arrival and departure rates. This data is easily available to any intersection equipped with cameras, induction loops, or other suitable sensors. Therefore, the overhead of performing the analytic computations is negligible due to their low complexity, data availability and, lastly, because they are performed only with low frequency.

References

References is not available for this document.