I. Introduction
Many real-life applications involve interaction among multiple intelligent systems, such as collaborative robot teams [1], internet-of-things devices [2], agents in cooperative or competitive games [3], and traffic management devices [4]. Reinforcement learning (RL) [5] is an effective tool to optimize the behavior of intelligent agents in such applications based on reward signals from interaction with the environment. Traditional RL algorithms, such as Q-Learning [6] and policy gradient [3], can be scaled to multiple agents by simultaneous application to each individual agent. However, learning independently for each agent performs poorly as the environment is non-stationary from the perspective of a single agent due to the actions of the other agents [3], [7]. Multi-agent reinforcement learning (MARL) [6] focuses on mitigating these challenges by adding other agents’ policy parameters to the Q function [8] or relying on importance sampling [9]. Yang et al. [10] propose a mean-field Q learning algorithm, which uses Q functions defined only with respect to an agent’s own action and those of its neighbors instead of all agent actions. The multi-agent deep deterministic policy gradient (MADDPG) [3] is an extension of the deep deterministic policy gradient (DDPG) algorithm [11] to a multi-agent setting. MADDPG uses a Q function that depends on all agent observations and actions but local control policies, defined over the observation and action of an individual agent. One key challenge faced by MARL approaches is that the training computational complexity scales with the number of agents in the environment. For large-scale MARL applications, the traditional centralized training mechanism that runs in a single compute node could thus be cost-prohibitive.