Skip to Main Content
The paper develops Q D-learning, a distributed version of reinforcement Q -learning, for multi-agent Markov decision processes (MDPs); the agents have no prior information on the global state transition and on the local agent cost statistics. The network agents minimize a network-averaged infinite horizon discounted cost, by local processing and by collaborating through mutual information exchange over a sparse (possibly stochastic) communication network. The agents respond differently (depending on their instantaneous one-stage random costs) to a global controlled state and the control actions of a remote controller. When each agent is aware only of its local online cost data and the inter-agent communication network is weakly connected, we prove that Q D-learning, a consensus + innovations algorithm with mixed time-scale stochastic dynamics, converges asymptotically almost surely to the desired value function and to the optimal stationary control policy at each network agent.