Multi-Agent Reinforcement Learning Based on Representational Communication for Large-Scale Traffic Signal Control

Traffic signal control (TSC) is a challenging problem within intelligent transportation systems and has been tackled using multi-agent reinforcement learning (MARL). While centralized approaches are often infeasible for large-scale TSC problems, decentralized approaches provide scalability but introduce new challenges, such as partial observability. Communication plays a critical role in decentralized MARL, as agents must learn to exchange information using messages to better understand the system and achieve effective coordination. Deep MARL has been used to enable inter-agent communication by learning communication protocols in a differentiable manner. However, many deep MARL communication frameworks proposed for TSC allow agents to communicate with all other agents at all times, which can add to the existing noise in the system and degrade overall performance. In this study, we propose a communication-based MARL framework for large-scale TSC. Our framework allows each agent to learn a communication policy that dictates “which” part of the message is sent “to whom”. In essence, our framework enables agents to selectively choose the recipients of their messages and exchange variable length messages with them. This results in a decentralized and flexible communication mechanism in which agents can effectively use the communication channel only when necessary. We designed two networks, a synthetic $4 \times 4$ grid network and a real-world network based on the Pasubio neighborhood in Bologna. Our framework achieved the lowest network congestion compared to related methods, with agents utilizing $\sim 47-65 \%$ of the communication channel. Ablation studies further demonstrated the effectiveness of the communication policies learned within our framework.


I. INTRODUCTION
Rapid urbanization in recent years [1] has given rise to a growing problem of traffic congestion [2].Recent trends also show a huge rise in ride-hailing and e-commerce services, which have contributed significantly towards the increasing number of vehicles on the road [3], [4].The impacts of traffic congestion include increased delays and wasted fuel in addition to the impact on the environment and public health [5], [6].Traffic signal control (TSC) is one of the challenging bottlenecks in reducing traffic congestion.The goal of TSC is to dynamically and intelligently control signal timings to reduce the number of vehicles halted on the road.
Recent advances in machine learning have opened up a wide range of opportunities for developing intelligent transportation systems solutions, including traffic signal control.Deep learning based architectures provide flexibility in processing data from various sensory inputs [7] and additionally serve as a useful tool for multimodal data fusion [8].Deep reinforcement learning (RL) uses deep neural networks (DNNs) to map inputs to actions.Deep RL frameworks have shown tremendous progress in learning effective policies directly from raw sensory inputs [9].Following these advances, deep MARL has emerged as one of the promising tools to develop effective frameworks for network-wide TSC, where each traffic light is treated as an agent that learns to select appropriate phases to minimize congestion within the network.
A straightforward way to carry over the framework of deep RL into the MARL setting is to treat all the agents as a collective entity.One can then use a function approximator, such as DNNs, to map the state into joint actions.However, the problem with this approach is that the action space grows exponentially with the number of agents.This kind of centralized control often proves impractical for large-scale applications.Furthermore, centralized approaches require access to the global state of the environment, which may not always be feasible.TSC is a large-scale problem for which decentralized execution becomes crucial.Several deep MARL frameworks have been proposed for independently controlling the traffic signals [10]- [37].However, to apply these methods to real-world applications, such as TSC, one must consider potential limitations of communication such as bandwidth availability [38].In addition, allowing such unrestricted communication can be disadvantageous for several reasons.One reason is that the system incurs additional communication overhead when the messages received by an agent are unhelpful and excess communication can increase the overhead and reduce performance by adding unnecessary noise.Another reason is that it leaves the system in a state of vulnerability to adversarial attacks.Potential solutions to these problems are (1) compressing the information into a small number of bits [39]- [42], (2) communicating only when necessary [43]- [49], and/or (3) communicating with selective agents [43], [45], [48]- [51].The majority of studies that proposed message passing mechanisms focused extensively on the aspect of improving the content of the messages by leveraging techniques from DL (e.g., attention mechanism [52], graph neural networks [53], and variational inference [54]).The lines of work that focused on addressing the problem of deciding when to communicate or whom to communicate with involved heuristics-based frameworks [43], [45], or use gating mechanisms [44], [46], [49]- [51].

A. CONTRIBUTION
In this paper, we propose an alternate framework for learning communication protocols that builds upon the existing Q-MIX [55] and NDQ [45] frameworks, which leverage the paradigm of centralized training and decentralized execution (CTDE) by learning a global action-value function.The global action-value is monotonically decomposed into individual action-values for decentralized execution.Facilitating communication among agents results in better action-value estimates [45].Within our framework, QRC-TSC, agents learn how to effectively compress their environmental perception and action intentions into a message and determine which part of the message needs to be transmitted to another agent for effective coordination.Also, this decision is made independently for each available recipient, thereby, making the communication framework flexible.We utilize the variational inference deep learning framework [54], [56]- [58] to maximize the mutual information between the message sent by the sender and the actions taken by the recipient [45], which is an effective metric to measure communication performance [59].Specifically, we model the message space as a joint distribution of generated message and communication policy (whether to send the bit of message).Through our formulation of the communication objective, we also encourage exploration over the communication policy space.We used the SUMO simulator [60] to design two traffic networks, a 4 × 4 synthetic grid network with variable traffic flow and a real-world network based on the city of Pasubio.We demonstrated the efficacy of our framework in reducing the congestion level of network-wide traffic by comparing it with some of the leading communication-based MARL frameworks.We also conducted ablation studies on the communication mechanism by comparing the results of our framework with several baseline communication strategies, including full communication, no communication, and random communication.We observed that traffic signals on the network were able to dynamically adjust the number of bits they send in the messages while maximizing performance.
The rest of the paper is organized as follows.Section III provides an overview of the relevant work done in MARL which serves as the basis of our framework.Section IV discusses our framework in detail and also describes the formulation of the TSC problem within our framework.In Section V we provide the experimental setup and compare the results of QRC-TSC with other frameworks and perform ablation studies.Finally, Section VI concludes the paper and discusses potential future research directions.

II. RELATED WORK A. INTER-AGENT COMMUNICATION IN MARL
Recently proposed algorithms (e.g., DIAL [39] and Comm-Net [40]) have made it possible to learn communication protocols through a feedback mechanism by leveraging DL techniques.DIAL is an extension to Independent Q-learning (IQL), where each agent generates both action-values and a message vector.The message vector is then passed as input to the other agent networks in the next time step, thus obtaining feedback from the receiver agents in the form of gradients.
The most relevant work to our problem is the Nearly Decomposable Q-function (NDQ) [45], which combines the communication framework of DIAL with the general learning framework of Q-MIX by utilizing the variational inference [54] technique from DL.In addition to learning communication through feedback, NDQ proposes an objective function that maximizes the mutual information (MI) between the sender's message and the recipient's action.The main idea is for agents to learn to capture the most relevant information in as few bits of message as possible.A similar metric, causal influence of communication (CIC), was proposed [59], [61] to improve communication performance without impeding the general learning process.However, NDQ uses a threshold-based heuristic to filter out unhelpful messages in its communication framework.In our work, we extend the work done in NDQ and develop a communication framework that learns to effectively select the important bits of messages.

B. DEEP MULTI-AGENT REINFORCEMENT LEARNING IN TRAFFIC SIGNAL CONTROL
The problem of TSC has been studied through the lens of MARL [10], [62], [63] by treating the traffic signal as an agent and rewarding it based on a metric that is inversely proportional to the level of congestion (queue length).Recently, with the advent of Deep MARL, many proposed solutions to the problem of TSC [12], [13], [18], [19], [21], [28], [32], [37], [64] were effective in extracting richer information from more sophisticated sensor inputs for the decision-making process [20], [25].Communication mechanisms are a part of the progress in applying MARL in TSC domains as well.Several methods proposed for TSC [20], [27], [64]- [66] implemented a variety of communication mechanisms to train the traffic signals to send and receive messages from neighboring traffic signals.However, the aforementioned methods fail to avoid the pitfall of unrestricted communication.TSC is a large-scale problem where communication between traffic signals has to be wireless, which comes at the cost of limited bandwidth and requires the utilization of additional resources.Hence, the communication mechanism must be efficient in allowing traffic signals to exchange relevant information only when it is beneficial.

III. BACKGROUND 1) Deep reinforcement learning
Reinforcement learning (RL) aims at learning the optimal policy through repeated interaction with the environment.A standard RL problem can be formulated as a Markov Decision Process (MDP).At each time step t agent observes the state of the environment s t ∈ S and takes an action a t ∈ A according to policy π.Based on this action, the agent receives feedback from the environment in the form of reward r t and transitions to the next state s t+1 .The objective is to maximize the total expected discounted reward R = T t=1 γ t r t , where γ ∈ [0, 1] is the discount factor.Deep Q-Networks (DQN) learns the action-value function where θ represents the parameters of the Q-network.The action-value function can be trained recursively by minimizing the loss where y = r + γ max a ′ Q θ ′ (s ′ , a ′ ) and θ ′ represents the parameters of the target network.The agent selects the action that maximizes the Q-value with the probability 1 − ϵ or acts randomly with probability ϵ.The set of parameters θ − in the target network are updated in regular time intervals by copying over the parameters θ from the primary network.Double DQN [67] modifies DQN to add stabilization and avoid overestimation.In Double DQN, the target action-value is indexed from the output of the target network based on the greedy action selected by the primary network Both DQN and Double DQN are based on fully observable MDPs.However, in partially observable settings, an agent conditions its action-value function on the action-observation history.DRQN [68] achieves this by using recurrent neural networks.At each time step, the Q-network takes as input the observation o t , and the hidden state h t−1 to approximate the action values Q θ (o t , h t−1 , a t ).This enables the agent to integrate past information to make decisions.

A. COOPERATIVE DEEP MULTI-AGENT DEEP REINFORCEMENT LEARNING
One approach to modeling multi-agent systems as RL problems is to treat the whole system as a single agent.The agent observes the true state of the environment and selects joint-actions for all the agents.This approach, however, scales poorly as the search space for joint-action increases exponentially with the number of agents in the system.A more feasible approach is to enable each agent to act independently.Thus, one could formulate the problem as a decentralized partially observable Markov decision process (Dec-POMDP), which extends the framework of MDP to multi-agent scenarios with partial observability [69].It is defined by a tuple of M =< S, A, P, Ω , O, r , N , γ >, where s ∈ S is the global state space and i ∈ N ≡ {1, • • • , n} is the finite set of agents.At time step t, each agent a selects an action a i ∈ A resulting in a joint action vector a ∈ A ≡ A n .The transition dynamics of the environment state are given by P (s ′ |s, a).All agents receive a shared reward according to the reward function r(s, a) and γ ∈ [0, 1) is the discount factor.Each agent receives an observation o a ∈ Ω according to the observation function O(s, a).Each agent has an action-observation history τ i ∈ T ≡ (Ω × A) * on which it conditions its individual policy π i (a i |τ i ).The joint policy π =< π 1 , • • • , π n > induces a joint-action value function Some studies propose that each agent learn the global action-value [70].Recent works have demonstrated better performance with monotonic factorization of the globalaction value [55], [71].Q-MIX [55], specifically, leverages the CTDE paradigm to learn a monotonic mapping between individual utilities and the global actionvalue by utilizing a mixing network The weights of the mixing network θ mixer are generated by a set of hypernetworks, conditioned on the state s t , with absolute activation function to ensure monotonicity ∂Q total ∂Qi ≥ 0. The decomposition allows for decentralized action selection during execution, since the mixing network is only used for training.Thus, the mixing network can be conditioned on additional information available during the training.Recent works improved performance on complex multi-agent environments by combining Q-MIX with communication framework [43], [45].Thus, we utilize Q-MIX as the base framework for our proposed communication mechanism.
We extend the framework of Dec-POMDP to incorporate inter-agent communication.We formulate the traffic signal network as an undirected graph G = (V, E), where v i ∈ V is the set of nodes and v ij ∈ E is the set of edges.Each node represents an agent (traffic signal) and each edge represents the connectivity between agents.The neighborhood for a node v is defined as ∈ E. We design communication framework such that each agent is only allowed to communicate with its neighbors.
We set up the problem of TSC as a Dec-POMDP, where each traffic signal in the network is treated as an agent and the central goal of the system is to reduce networkwide congestion.The traffic signals make decisions using information about incoming vehicles, which is assumed to be accessible through sensors located near the signals.
The traffic signals control the flow of traffic through the intersection by selecting a phase from the available set of phases.We discuss the details of our formulation in detail below.

1) Observation representation
Each traffic signal has a limited range of vision of 50 meters, within which it can obtain information related to the traffic flow.This is equivalent to the sensory information that can be obtained from practical common sensors.We implement observation collection in the environment using by placing laneAreaDetector of length 50 meters on each incoming lane to capture the traffic information which can be seen by the boxes highlighted in grey in Fig. 2. The observation for each traffic signal consists of: the number of vehicles {n l } Li l=1 , the average normalized speed of the vehicles {s l } Li l=1 , the number of halted vehicles (queue lengths) {q l } Li l=1 , and the current phaseID of the traffic signal, where L i ∈ L are the incoming lanes for a traffic signal i and L is a set of all the lanes in the network.

2) Action Representation
For each traffic signal i, we define its action a i as choosing one green phase from a list of available phases.As an example, Fig. 3 shows the list of phases that are available for a traffic signal in a 4 × 4 grid network.A traffic signal can select any green phase from its list or keep its current one, but it must then follow the next yellow phase, which is enforced by the environment.The action selection interval and the yellow phases are fixed for a duration of 5 simulation seconds.

3) Reward
Various metrics are used for rewards in traffic signal control settings.In our study, we chose queue length q l as the performance metric of the traffic signal controller due to its simplistic nature and its property of representing an instantaneous feedback signal.We define the objective function as minimizing the number of vehicles stopped throughout the network where r t ∈ R is the global reward and l ∈ L represents the lanes in the network.

B. OVERALL FRAMEWORK
In this section, we present a detailed design of QRC-TSC in the context of multi-agent Q-learning, Fig. 5.We adopt the CTDE paradigm and use Q-MIX [55] as a base learning framework.The training takes place in a centralized manner, assuming that the global state information is available.Each agent i has access to an agent network with parameters shared across all agents.This approach has been shown to accelerate learning and enhance scalability in deep MARL settings.The agent network takes as inputs the action-observation history of the agent and the incoming messages from other agents to generate action-values.The agent uses its own action values to select an action during decentralized execution.Each agent also has a communication network that takes in the agent's action-observation history and generates the message vector m ij and a communication policy c ij for each available recipient agent j ∈ N (i).This can be seen in the communication module in Fig. 4

C. COMMUNICATION IN QRC-TSC
In our framework, each agent learns communication protocols through feedback from the recipient agents.Feedback is received in the form of gradients during backpropagation [39], [40].Thus, the entire network architecture can be trained from a single objective function.Our goal in this work is to train agents to quickly and effectively learn communication protocols.Therefore, agents must learn the communication policy and send messages that reduce the uncertainty in the recipient's policy.To this end, we aim to maximize the mutual information between the sender's message and the recipient's policy, similar to NDQ [45].This metric was previously proposed [59] as one of the key metrics to measure communication performance.Therefore, it makes sense to integrate such a metric into the objective function and explicitly maximize it.First, we model outgoing messages as a joint distribution p(m ij , c ij ) of the message generated m ij and its communication actions c ij by agent i for agent j.
[58].Specifically, each agent i generates a shared latent message distribution (a multivariate Gaussian) of size from which a message vector m i is sampled and a discrete communication policy distribution (encoded as Bernoulli) which decides which bits of the messages are to be sent to agent j.
This decision c ij is made independently for each agent j ∈ N (i) in the neighborhood.Similar to the approach proposed in [56], [57], we use Gumbel-sigmoid as a continuous approximation of the categorical variables.The Gumbel-max trick allows for differential sampling and does not suffer from high variance like the REINFORCE algorithm [56].Thus, our framework is end-to-end differentiable.
The communication action c ij acts as a mask over the messages during execution.To achieve this, we use differential relaxation of categorical/discrete variables [56], [57].Gumbel-sigmoid can be considered as a continuous relaxation of the Bernoulli distribution and can be written as where g l and g m are samples from Gumbel(0, 1) distribution and λ is the temperature parameter.
Next, we discuss the objective function J c (θ c ) for learning communication.We maximize the mutual information between the sender's message and the recipient's policy.
where π j (•|τ j , min j ) = softmax(Q(•; τ j , min j )) represents the policy of the agent conditioned on its action-observation history and incoming messages, mij is the resulting outgoing message from agent i to agent j, and θ c is the set of parameters of the communication network.
where β is the scaling factor that controls the tradeoff between the expressiveness and compressiveness of the messages.Since our objective is to maximize the mutual information it is sufficient to derive the objective as the lower bound for the term.The lower bound [45], [54] for the mutual information objective, the first term in (4) can be given as where τ is the joint local action-observation history of the agents sampled from the replay memory D and CE is the cross-entropy.The posterior estimates are given by q θr (•; τ j , min j ) = q θr (•; τ j , (m ⊙ c) in −(j)j ) = q θr (•; τ j , m in −(j)j , c in −(j)j ) and parameters θ r are shared across all the agents.
The second term, analogous to the variational bottleneck objective in [54], is the mutual information between the agent's action-observation history τ i and the messages generated m i .
The first term in (6) controls the tradeoff between maximizing the mutual information between the message m ij and agent j's policy π j (•; τ j , mij ) and being compressive about the action-observation history τ i .The second term in (6) regularizes the communication policy.This encourages exploration of varied communication policies, which can be controlled by β c .
Combining equations ( 5) and ( 6), we can write the loss function for the communication objective as: Thus, the final loss function for training can be given as: where is the TD loss, θ − is the set of parameters of the target network, θ is a set of parameters for all the networks combined and L C (θ r , θ c ) is the total communication loss.

A. EXPERIMENTAL SETUP
We built a synthetic 4 × 4 grid network and a real-world network of Pasubio, Bologna as proposed by Bieker et al. [72].Trips are generated with origin-destination pairs of the fringe edges.For both the networks, we generated variable hourly traffic, similar to [18], as shown in Fig. 6d, where the solid lines represent the high flow rates and the dotted lines represent the low flow rates.Flow rates are varied in 5-minute intervals within which the vehicles are inserted uniformly into the network with the specified flow rate.The peak flow rate is 900 veh/hr.For convenience and representation purposes, we broke down the traffic flow into two types: (1) from east-west/west-east (red lines), which starts at the beginning of the hour and (2) north-south/south-north (blue lines), which starts after 15 minutes.Both flows last for 35 minutes.Flows from the opposite direction, represented by dotted lines in Fig. 6d, are scaled down by a factor of 0.6.Every hour, a random direction is selected as the opposite direction.
1) 4×4 grid network: We built a two-lane synthetic 4×4 grid network of homogeneous agents.We simulated two traffic flow scenarios, one of which is selected randomly at the beginning of each simulation hour.
For the first scenario, Fig. 6a, we simulated high traffic on the external edges of the network, whereas in the second scenario Fig. 6b internal edges of the network received a higher bulk of traffic flow.To induce a level of randomness in the traffic flow, a random direction was selected at the beginning of each simulation hour to have a high flow rate.Traffic flow settings in the synthetic version were not tethered to reality but were designed to test the robustness of the learning algorithm.The speed limit on all the lanes was around 14 m/s.2) Pasubio network: We used the real-world network of Pasubio, Bologna.The neighborhood has a hospital and includes common routes to the football stadium, and therefor is prone to congestion.The network has 7 traffic lights, some of which control multiple junctions.
3 traffic signals have 8 phases and the rest have 4, 10, 14, and 16 phases.The heterogeneity of the real-world network made it a more challenging environment than the synthetic network.We tried to replicate the traffic flow settings from [72].The maximum allowable speed on each lane was set to 14 m/s.
Further, we adopt the metric average number of stops or queue length to measure the performance of the algorithms on the traffic network.We can see that the agent learns to embed messages in the latent space based on its inputs and action intentions.

B. BASELINES
In this work, we are interested in teaching the agents efficient communication policies.Specifically, our goal is to show that agents do not need to communicate all the time to be able to coordinate.Instead, agents can establish an optimal communication policy that tells the agent which parts of the message are worth sending and to which agent.To this end, we set Q-MIX [55] as the baseline framework for learning the action-value function and DIAL [39] as a baseline framework for communication.To make fair comparisons, we implemented DIAL by extending Q-MIX.
We also compared our framework to NDQ [45], a state-of-theart method to learn communication, which uses thresholds to filter out unnecessary messages.Thus, all the methods we compared our framework to only differed in the type of communication mechanism: (i) Q-MIX can be seen as a base method without communication, (ii) Q-MIX + DIAL enables learning communication via a feedback mechanism, (iii) Q+MIX + TarMAC, adds attention mechanism to messages, and (iv) NDQ can be seen as an extension to Q-MIX + DIAL, which maximizes the mutual information between the sender's message and the recipient's policy.

C. TRAINING SETTINGS
We trained all the algorithms on the grid network and Pasubio environment for 1.8 million and 3 million simulation steps, respectively.At the end of each episode, which lasted for 90 steps or 360 simulation seconds, we ran a training iteration.
To evaluate the robustness of the algorithm, we ran 10 evaluation episodes with each agent selecting its actions greedily after every 200 training episodes.

D. RESULTS
To ensure a fair comparison, we used Q-MIX as a baseline centralized training algorithm for all the algorithms based on communication.The learning curves of the algorithms are illustrated in Fig. 7.The solid lines represent the hourly average queue length of an intersection for each scenario.
Queue length, which represents the number of vehicles stopped in the incoming lanes of the traffic signal, is a key metric in evaluating the performance of a traffic signal network.Evaluations were conducted after every 200 training episodes, and the results were averaged over 15 independent runs.Additionally, we compare our algorithm to some traditional traffic signal control approaches (Fixed time [73], Self Organizing Traffic Lights (SOTL) [74], Max pressure [75]).For the fixed time algorithm, the phase duration for green phases was set to 30 seconds.In both network scenarios, QRC-TSC performed consistently better than the other frameworks.While Q-MIX uses a centralized training mechanism to factorize the actionvalues, the agents operate in a completely decentralized way during execution.Purely decentralized policies can hinder the performance of systems, since traffic flow can be highly dynamic at times.On the other hand, in DIAL, the agents communicate all the time, which can decrease performance, as communication is often unnecessary and acts as additional noise.The performance of Q-MIX + DIAL, Q-MIX + TarMAC, and Q-MIX was relatively similar and significantly underperformed in the Pasubio scenario.The performance of NDQ and QRC-TSC was similar in the Pasubio network (Fig. 7a), however, NDQ performed poorly in the grid network (Fig. 7b).When considering average queue length, QRC-TSC consistently outperformed the other frameworks in both network scenarios and learned relatively stable policies, as can be seen in Fig. 7.

E. COMMUNICATION 1) Learned message representations
Within our framework, each agent learns to generate messages conditioned on its action-observation history.Thus, messages can be interpreted as compressed representations of the agent's inputs and its action intentions.The message space is analogous to latent space in variational autoencoders, where each variable in the latent space is independent of the other.Thus, each bit in the message represents a unique information from the sender's action-observation history.Fig. 8 shows an example t-SNE plot [76] of message embeddings learned by our algorithm collected over 100 evaluation episodes.In the first three plots from left to right, the color gradients represent features of agents inputs averaged over the number of incoming lanes: mean speed ( 1 L l s l ), mean density ( 1 L l n l ), and mean queue length ( 1 L l q l ), respectively.These images show that the message distribution learned by the agents was correlated with its inputs, confirming that the agents learned to send meaningful information from their observations.The color labels in the fourth plot represent the actions taken by the agents, which indicates that the agents were able to effectively convey their action intentions through the message space.A key observation from this figure is that mean density and mean queue length are often correlated with each other, and hence the agent can eliminate information.
2) Learned communication policies  In our framework, the agents are allowed to send 5 bit messages at each time step.Therefore, the communication policy can be seen as an action of selecting the bit of message for each recipient.This makes visualizing the communication policy for each agent in a reduced space almost infeasible.To evaluate the effectiveness of communication policy, we compared the communication policy learned by QRC-TSC with (1) random policy, (2) full communication, and (3) no communication.During the evaluation stage, we ran 3 additional independent tests where we manually altered the communication policies.Since this was done during the execution stage, we can be sure that altering the communication policies did not affect the training of QRC-TSC.
Fig. 9 illustrates the performance of the communication policies learned by our framework.We selected a few key metrics (queue length, wait time, and mean speed) from traffic signal control theory to showcase the effectiveness of the learned policies.All metrics were averaged over 100 test episodes and across five runs.The performance of QRC-TSC (in blue) was the best across all metrics in both network scenarios.By choosing which bits to send, the agents were effectively able to balance the performance between no communication and full communication.It is interesting to note that the performance in the Pasubio network with full communication is the worst, which can also be seen in Fig. 7a(a) where DIAL performs the worst among all the algorithms.This strongly indicates that constant communication can impede the performance of the system, likely caused due to redundancy in input information (from incoming messages).This leads us to conclude that the agents only need limited information about the action-observation history of the other agent to take optimal actions.

F. HYPERPARAMETERS
We based our framework on the PyMARL library [77] and used the default parameters for all experiments.We experimented with different values for the message size and found that the message of length 5 performed the best.For the additional hyperparameters within the QRC-TSC framework, we conducted a coarse grid search to find the best set of hyperparameters.We set the value of both β m and β c to 10 −5 across all environments.We tried linearly annealing the values of β m and β c over 50k iterations, but the overall performance change was negligible.We trained our models on an NVIDIA GeForce RTX 2080 using experience sampled from 8 parallel environments.

VI. CONCLUSION
In this paper, we propose a novel communication mechanism enabling agents to effectively learn (i) which part of the message is worth sending (ii) when to send a message, and (iii) to whom the message should be sent.This can be especially beneficial for problems where there exist constraints on communication (e.g.limited bandwidth).Further, our proposed framework is differentiable which allows for endto-end training.The advantage of this framework is that the agents can act in a completely decentralized manner but exchange necessary bits of information to maintain coordination between agents.The framework is versatile and could be extended to a large number of applications.We tested our framework, QRC-TSC, on the real-world problem of traffic signal control by building two different traffic signal network scenarios (a synthetic and a real-world network).We compared QRC-TSC with several state-of-the-art frameworks involving communication, and demonstrate that it is able to maintain the least amount of congestion throughout the network while keeping the utility of the communication channel within ∼ 47 − 65 percent.Some real-world problems have constraints, for example the cost of communication.Although this study did not address a constrained problem, we believe that QRC-TSC can be extended to include additional parameters, such as cost.One of the drawbacks QRC-TSC is that the maximum length of the message needs to be set a priori.One solution to this problem could be to allow for multiple communication passes.Future work will address how to establish the maximum message length.for each episode do  Calculate the communication loss L C (θ r , θ c ) according to (7) and TD loss L T D (θ) as in (9) and set total loss as in (8) 18: Update θ by minimizing the total loss L(θ)

FIGURE 1 :
FIGURE 1: The highlighted circles represent communication range for each traffic light, i.e., each traffic light can communicate with its immediate neighbor or within 500 meters of range.

FIGURE 2 :
FIGURE 2: Prototype of a traffic signal network with two intersections.The highlighted zones on the incoming lane on each traffic light represent the range within which the traffic light can access information about the vehicles.

FIGURE 3 :
FIGURE 3: An example of the phases available for an intersection in a 4 × 4 grid network from SUMO simulator.The colored lines (red, yellow, and green) together indicate the phase of the traffic signal.The first phase (from the left) indicates an all green phase, where the vehicles are allowed to go straight and/or make turns.Each agent controls the traffic signals by selecting one of these phases.
. The message is then gated mij = (m ⊙ c) ij 1 based on the communication action c ij .The parameters of the communication network are also shared across agents.The mixing network combines the individual action-values of the agents Q i (τ i , a i , mij ; θ agent ) to compute the join-action value function Q total .The weights of the mixing network are generated by a set of hypernetworks conditioned on the state s.We use DIAL[39] as the base communication framework and we improve on it in the following ways:1) We use variational inference to maximize the mutual information between the sent messages (including the communication action) and the recipient's action.2) We introduce an entropy regularization term for the communication policies, enabling controlled exploration in the communication action space.3) Communication policies are differentiable, allowing for end-to-end training.

FIGURE 4 :
FIGURE 4: Example of the proposed communication framework.Agent A generates a message space m A and communication action c AB and c AC for agents B and agent C respectively.The message is then gated based on the communication action and sent to respective agents.

FIGURE 5 :
FIGURE 5: Architecture of QRC-TSC with two agents.Each agent uses a communication network (shown in the communication block) in addition to the agent network.The communication network takes the action-observation history (o i t , a i t−1 ) of the agent i as input and outputs both the message m ij t and a communication action c ij t for the recipient j at time t.

(a) 4 ×
FIGURE 6: (a) and (b) represent the flow scenarios for the 4 × 4 grid network.(c) shows the flow in Pasubio network and (d) shows the hourly flow distribution for both the networks.The dotted lines represent flow from opposite direction whenever bidirectional flows are simulated.The red and the blue lines represent the outer and inner network flow respectively.

FIGURE 7 :
FIGURE 7: The plot shows average queue length throughout training (lower the better).The x-axis represents simulation steps (in millions).The solid lines show mean over 5 runs and the shaded region represents 95% CI.

FIGURE 8 :
FIGURE 8: Message representation: The figure shows a t-SNE plot of the learned messages representations by an agent in the 4 × 4 grid network.The color scale in the first three plots, starting from left, represents a feature (averaged across all incoming lanes) of observations received by the agent.The color scale in the final plot represents the actions taken by the agent.We can see that the agent learns to embed messages in the latent space based on its inputs and action intentions.
(a) Comparison of communication policies for 4 × 4 grid network (b) Comparison of communication policies for Pasubio network

FIGURE 9 :
FIGURE 9: Comparison of performance of communication policies averaged across 100 test episodes.QRC-TSC (in blue) represents the performance of the communication policies learned by our framework.
ROHIT BOKADE received his Masters degree in Operations Research from Northeastern University, where he is currently pursuing a Ph.D. degree in Industrial Engineering.His current research interests involve exploring the potential of advanced machine learning techniques, such as reinforcement learning, deep learning, and optimization techniques to improve industrial engineering practices and solve real-world problems.XIAONING JIN (Member, IEEE) received the Ph.D. degree in Industrial and Systems Engineering from the University of Michigan, Ann Arbor, MI, USA in 2012.She is currently an Assistant Professor of mechanical and industrial engineering with the College of Engineering at Northeastern University, Boston, USA.She is the recipient of the National Science Foundation Career Award in 2020.She has over 50 papers in fully refereed international journals and conferences.Her research interests include predictive analytics and decision making, data analytics, fault diagnostics and prognostics, and artificial intelligence in various engineering applications.Prof. Jin currently serves as the Vice-Chair of the Manufacturing Systems Technical Committee with the ASME Manufacturing Science and Engineering division.She received the 2016 Outstanding Young Manufacturing Engineer Award from the Society of Manufacturing Engineers (SME).CHRISTOPHER AMATO is an Assistant Professor at Northeastern University where he leads the Lab for Learning and Planning in Robotics.Before joining Northeastern, Dr. Amato was a Research Scientist at Aptima, Inc. and a Postdoc and Research Scientist at MIT as well as an Assistant Professor at the University of New Hampshire.He has published many papers in leading artificial intelligence, machine learning and robotics conferences (including winning a best paper prize at AAMAS-14 and being nominated for the best paper at RSS-15, AAAI-19, AAMAS-21 and MRS-21).He has also won several awards such as Amazon Research Awards and an NSF CAREER Award.His research focuses on reinforcement learning and planning in partially observable and multi-agent/multi-robot systems.A. ALGORITHM FOR QRC-TSC Algorithm 1 Training procedure for QRC-TSC 1: Initialize the agent network with parameters θ, the target network with parameters θ − , replay buffer D with capacity N D , and batch size N B 2: for each training episode e do 3:

TABLE 1 :
Performance results of various algorithms on 4 × 4 grid and Pasubio Network Execute joint action a t = {a 1 t , • • • , a n t } in the environment 12: Obtain the global reward r t+1 , next observation o i t+1 for each agent i and next global state s t+1 Store the episode in the buffer D such that the oldest episode is replaced if |D| ≥ N D Sample a batch of N B episodes ∼ Uniform(D) 17: