A Cooperative Multi-Agent Reinforcement Learning Method Based on Coordination Degree

Multi-agent reinforcement learning (MARL) has become a prevalent method for solving cooperative problems owing to its tractable implementation and task distribution. The goal of the MARL algorithms for fully cooperative scenarios is to obtain the optimal joint strategy that maximizes the expected common cumulative reward for all agents. However, to date, the analysis of MARL dynamics has focused on repeated games with few agents and actions. To this end, we propose a cooperative MARL algorithm based on the coordination degree (CMARL-CD) and analyze its dynamics in more general cases in which repeated games with more agents and actions are considered. Theoretical analysis shows that if the component action of every optimal joint action is unique, all optimal joint actions are asymptotically stable critical points. The CMARL-CD algorithm realizes coordination among agents without the need to estimate the global Q-value function. Each agent estimates the coordination degree of its own action, which represents the potential of being the optimal action. The efficacy of the CMARL-CD algorithm is studied through repeated games and stochastic games.


I. INTRODUCTION
Reinforcement learning [1] (RL) is a prevalent method for optimizing an agent's behavior so that the best response from the environment can be obtained. Typically, RL is used to solve a Markov decision process (MDP). In an MDP, the state transition depends on a single agent's action. An agent can perceive states, execute an action, and receive a numerical reward from the environment. The goal is to obtain the maximum expected cumulative reward. However, some problems in the real world are naturally modeled as multi-agent systems (MASs), such as urban traffic signal control for multiple intersections [2], multiple automatic guided vehicle path planning [3], and mobile traffic for wireless networks [4]. In multi-agent settings, the state transition is determined by joint actions. The Markov property does not hold for any single agent in such settings, which is one of the major concerns when designing new multi-agent reinforcement learning (MARL) algorithms [5], [6].
The associate editor coordinating the review of this manuscript and approving it for publication was Valentina E. Balas .
The goal of MARL is determined by the type of task. The goal in a general-sum game is to converge to some type of equilibrium [7] or socially optimal outcomes [8], [9]. This type of learning is known as equilibrium-based MARL (EMARL). The goal in a cooperative game is to maximize the expected common cumulative reward of all agents [10]. We name this type of learning cooperative MARL (CMARL). The goal in a zero-sum game is to maximize the expected reward of each agent [11], [12].
Non-stationarity and the rapidly growing joint action space are the two main challenges for MARL. First, for an independent learner, the individual Q-function of each agent is influenced by the other agents'actions. The analysis of independent learners focuses on repeated games with few agents and actions [13]- [18]. Second, to alleviate the nonstationarity problem, centralized learning is employed to estimate the global Q-function. In this framework, the joint action space grows exponentially with the number of agents, which affects the scalability of the JALs.
To this end, we propose a new CMARL algorithm known as CMARL based on the coordination degree (CMARL-CD). The CMARL-CD algorithm does not need to learn the global Q-value function of the joint actions. Each agent records the maximal reward obtained in history and updates the coordination degree of its own action during the learning stage. The main contribution is the analysis of the CMARL-CD model in repeated games with more than two agents and actions. It has been proven that all optimal joint actions are asymptotically stable critical points if the component action of every optimal joint action is unique.
The remainder of this paper is organized as follows. Section II briefly reviews the different types of MARL algorithms. Section III presents the preliminaries. Section IV elaborates on the CMARL-CD algorithm, and provides theoretical analysis in repeated games. Section V studies the efficacy of CMARL-CD in two stochastic games: the distributed sensor network (DSN) task and the blood battlefield task. Finally, Section VI draws the conclusions.

II. RELATED WORK
Two characteristics of MARL algorithms are considered in this section. The first is whether the MARL algorithm belongs to JALs or independent learners. A JAL requires estimating the global Q-function of joint actions, while an independent learner requires each agent to estimate the individual Q-function of its own actions. The second characteristic, which is used to categorize the MARL algorithms reviewed in this paper, is whether the MARL algorithm belongs to CMARL or EMARL.
CMARL aims to optimize some performance index in a cooperative task. FMRQ [19], EAQR [20], and WRFMR [21] use the frequency of receiving the maximum reward. SOoN [22] utilizes the farsighted frequency together with the frequency used in FMRQ and EAQR. LA-OCA [23] is a learning automata-based algorithm that introduces a variable to indicate whether the maximum reward is achieved. LA-OCA has demonstrated excellent performance in some cooperative tasks. All of the above CMARL algorithms are independent learners.
Recently, deep learning has been incorporated into CMARL [24], [25]. One of the prevalent paradigms is centralized training with decentralized execution (CTDE), which attenuates both the problems of non-stationarity and the exponentially growing joint action space. MADDPG [26] uses decentralized critic networks for each agent, but the selected joint action is still required in centralized learning. COMA [27] uses a central critic network to estimate the global Q-function and uses distributed actor networks to select actions for each agent. COMA requires on-policy learning, which could be inefficient. To this end, a variety of Q-function decomposition methods have sprung up. Value decomposition networks (VDNs) [28] approximate the Q-function of joint actions by the sum of Q-functions of individual actions. Furthermore, QMIX [29] uses the mixing network to realize the individual-global-max (IGM) principle and account for the influence of the global state. To overcome the restriction on the structure of the critic used in QMIX, Qatten [30], QTRAN [31], and Q-value path decomposition (QPD) [32] have been proposed. Qatten uses multi-head attention to formulate the decomposition with theoretical foundations. QTRAN employs a gap function to satisfy the IGM and uses a fully centralized critic to guide the training of the individual Q-functions. QPD decomposes the Q-value function of joint actions along the state transition trajectories for credit assignments among agents and uses integrated gradients to approximate the Q-values.
The goal of most EMARL algorithms is to converge to an NE. Some EMARL algorithms attempt to accomplish this goal by employing the gradient method. These algorithms include but are not limited to infinitesimal gradient ascent (IGA) [33], WoLF-PHC [34], WPL [35], and PGA-APP [36]. Other EMARL algorithms have their own strategy updating rules. Nash-Q [37] searches for an NE in each state using quadratic programming. LRI [38], [39] is a learning automata-based algorithm. It has been proven that LRI converges to a pure NE in general-sum repeated games. Of the aforementioned EMARL algorithms, Nash-Q is a JAL, and the other EMARL algorithms are independent learners.
CMARL-CD distinguishes between the aforementioned cooperative independent learners as follows: First, compared with FMRQ, EAQR, and WRFMR, CMARL-CD does not use the frequency information that could influence the performance because of the estimation error introduced by it. Second, compared with LA-OCA, the analysis of the dynamics of CMARL-CD considers the normalization operation on the action probability.

A. STOCHASTIC GAMES
In a stochastic game [40], [41], the state transition depends on the joint action. Let S represent the set of valid states, A i (s) represent the action set of agent i at state s ∈ S for i = 1, 2, . . . , n, and A(s) = A 1 (s) ×A 2 (s), . . . , × A n (s) represent the set of the joint actions at state s ∈ S. The probability of being at state s if the joint action a ∈ A is performed at state s is determined by the state transition T : . . , ×A n (s) × S → R determines the immediate reward received by agent i. In a cooperative stochastic game, the global immediate reward is r = n i=1 r i , and the goal is to maximize the global discounted cumulative reward, defined as follows: where γ ∈ (0, 1) is the discount factor and K is the ending time of an episode.

B. REPEATED GAMES
A repeated game [42], [43] is a one-stage game repeated by finite agents with finite actions. This study focuses on fully cooperative games. The payoff matrix determines the reward received by each agent. The payoff matrix of a two-agenttwo-action cooperative game is shown in Fig. 1. If agent 1 chooses the first action (the first row) and agent 2 chooses the second action (the second column), both agents obtain a reward of 2. A strategy is the probability distribution on action selection. For a pure strategy, some action is always selected. For a mixed strategy, each action is assigned a probability. r i_ max = r i 10: Update c i according to (2)-(4). 11: Update p i according to (5). 12: End if 13: End for each agent 14: Until the strategy of each agent becomes pure. 15: Return c i for each agent.

IV. COOPERATIVE MULTI-AGENT REINFORCEMENT LEARNING BASED ON COORDINATION DEGREE A. FORMULATION OF THE ALGORITHM
To facilitate cooperation among agents, we propose the concept of coordination degree to evaluate the optimality of an action. In a fully cooperative repeated game, if the maximum global reward is obtained, the coordination degree of the selected action of each agent is increased, while the coordination degrees of the other actions are decreased. Otherwise, no updates are required. After the learning stage, each agent selects the action with the maximum coordination degree.
The pseudocode of CMARL-CD is shown in Algorithm 1. Each agent i updates its coordination degree c i = (c i 1 , . . . , c i |A i | ) according to: where a i j denotes the j-th action of agent i, c i j denotes the coordination degree of a i j , I i ∈ {0, 1} is an indicator variable, and δ ∈ (0, 1) is the learning rate. The value of I i is determined by where r i (k) is the global immediate reward received by agent i in the k-th game, and r i_ max (k) is the maximal global immediate reward received by agent i by the k-th game. The learning rate δ should be set to a small positive value.
is updated as follows: where p i j is the probability of selecting the j-th action of agent i, and T is the temperature parameter that balances exploration and exploitation.

B. ANALYSIS OF CMARL-CD IN REPEATED GAMES
The updating rule of the coordination degrees is as follows: where P(k) = (p 1 (k), p 2 (k) . . . , p n (k)) is the joint strategy, a(k) is the joint action, r(k) is the global immediate reward, and c i j (. . . , . . . , . . .) represents the updating term of (2) and (3). According to Theorem 3.1 in [38], if δ is infinitely small, the CMARL-CD model in repeated games can be represented by The following theorem presents the characteristics of CMARL-CD in repeated games.
Theorem 1: Each agent uses the CMARL-CD algorithm to play a cooperative repeated game with n (n ≥ 2) agents and m (m ≥ 2) optimal joint actions. If the component action of every optimal joint action is unique, all the optimal joint actions are asymptotically stable critical points.
Proof: (i) Let a ij be agent i's component action of the j-th optimal joint action, p ij be the probability of selecting a ij , and c ij be the coordination degree of a ij for i = 1, 2, . . . , n, j = 1, 2, . . . , m. Because the component action of every optimal joint action is unique, (7) can be written as: The coordination degree of the component action that does not constitute any optimal joint action decreases over time. According to (5), the probabilities of such actions decrease to zero over time. According to (5) and (8), we have: Any critical point of (9) must satisfy It is clear that all joint actions are critical points. It is noted that m j=1 p ij = 1, i = 1, 2, . . . , n when the probabilities of the actions that do not constitute any optimal joint actions decrease to zero. Thus, we perform the transformation as follows: Then we can get The stability of each of the m optimal joint actions is determined by the eigenvalues of the following Jacobin matrix J ∈ R (m−1)n×(m−1)n :  This shows that each eigenvalue of J is − 2 T . Thus all the optimal joint actions are asymptotically stable critical points of (9).

C. EMPIRICAL STUDIES IN REPEATED GAMES
The efficacy of CMARL-CD in repeated games with n agents and m actions is investigated by empirical studies. Two situations are considered.
Case 1: The component action of every optimal joint action is unique.
Case 2: The component action of at least one optimal joint is not unique.
The simulation is performed for 50 runs. A game with a random payoff matrix is played repeatedly in each run. If the strategy of each agent becomes pure, and the joint action can obtain the maximum reward, a successful run is achieved. The learning ends if for each agent, the probability of selecting some action is greater than 0.999. The temperature T is 1.0, and δ is 0.04.
In case 1, each game contains m optimal joint actions. It is shown in Tab. 1 that the success rate is 98% when n = 7, m = 4, and 100% otherwise. The reason for failure to obtain 100% when n = 7, m = 4 is that the maximal global reward has never been obtained during the learning stage, and the joint strategy converges to the local optimum.
In case 2, each game contains [0.1m n ] optimal joint actions, where [ ] returns a round integer. It is shown in Tab. 2 that a success rate of 100% is obtained in all games.
The empirical results show that the CMARL-CD algorithm can converge to one of the optimal joint actions in cooperative repeated games.

D. CMARL-CD FOR STOCHASTIC GAMES
CMARL-CD can be applied to stochastic games. The framework is illustrated in Fig. 2. Each agent receives the global state as input, evaluates the coordination degree of each of its actions, and executes an action, which is indicated by the arrows with solid lines. Each agent independently updates the coordination degree using each trajectory, as indicated by the arrows with dotted lines. The pseudocode is presented in Algorithm 2. The coordination degree of agent ic i = (c i 1 (s), . . . , c i |A i | (s)) is updated according to × for all a i j = a i g (15) where c i j (s) denotes the coordination degree of a i j at state s, and I i (s) ∈ {0, 1} is an indicator variable. The value of I i (s) is determined by where R i (s) is the global cumulative reward obtained by agent i from state s, and R i_ max (s) is the maximal global cumulative reward in history. To maintain exploration, we confine the value of c i j (s) within [c min , c max ], where c min is positive.
The probability of selecting action a i j at state s is updated as follows.

V. EMPIRICAL STUDIES FOR STOCHASTIC GAMES
The efficacy of CMARL-CD in stochastic games is studied empirically through the DSN task and the blood battlefield task. The states and the actions of both the tasks take discrete values. The tasks differ in that the first task is fully cooperative, while the second task involves competition between two teams of cooperative agents. CMARL-CD is compared with LA-OCA, VDN, and QMIX to demonstrate its efficacy. For fairness, global state information is used during the learning stage for all the algorithms. The DSN task [44] requires the sensors (agents) to cooperate to capture the targets. As shown in Fig. 3, 12 sensors are distributed within a grid. The two targets walk randomly within the six cells. Both the targets' number and positions can be sensed by all the sensors. The sensors must cooperate at each of the 42 states (excluding one absorbing state). Each    sensor can choose to do nothing or focus on one of its adjacent cells. The joint action space contains 291,600 elements. All the sensors execute their actions simultaneously. Then, the targets move sequentially at each time step. If a target moves to a cell that is not empty, it fails to move. If three or four sensors focus on a target, a hit is made. If a target receives three hits, it is captured. The captured target does not occupy any cells. An episode lasts 40 time steps unless both the targets are captured.
The reward assignment obeys the following rules. If a target is captured, a reward of 10 is obtained by each sensor that involves the capture. If the target is captured by four sensors, only the sensors with the top three indices are rewarded. Focusing on one cell is rewarded by −1, and doing nothing is rewarded by 0. Each sensor shares its immediate reward with the other agents at each step. The optimal joint strategy can obtain a cumulative reward of 42 in three steps.
The simulation is performed for 50 runs, each of which contains L episodes for learning and 50,000 episodes for evaluation. In the learning stage, all agents' strategies are updated. In the evaluation stage, all agents' strategies are fixed.
The CMARL-CD algorithm uses the parameters δ = 0.05, T = 1.0, c min = 0.5, c max = 3.0, γ = 0.9, and the initial values of the coordination degrees of all state-action pairs c i j (s) = 1.5. To increase exploration near the local optima, the CMARL-CD algorithm resets the coordination degrees of all actions at state s to 1.5, and sets δ to 0.5 when the obtained cumulative reward from state s is larger than the maximum cumulative reward in history. The LA-OCA algorithm uses the parameters in [23].
The parameter settings for QMIX and VDN are as follows. Each agent network is an MLP with one hidden layer of 49 neurons. The size of the replay buffer is 40000, and the size of each batch is 400. Parameter updating begins after one batch of tuples is available. The estimation networks are updated after every 200 time steps using the Adam optimizer with an initial learning rate of 0.001. The target networks are cloned from the estimation networks after every 2000 time steps. The ε-greedy policy is used to select an action during the learning stage. The exploration rate ε follows where ε ini = 1.0 and ε fin = 0 are two constants, and n is the number of elapsed episodes. For QMIX, the mixing network contains one hidden layer of 70 neurons with exponential linear unit (ELU). The success rate, cumulative reward, and the number of steps are selected as performance indices. If a cumulative reward of 42 is obtained in three steps in an evaluation episode, success is obtained. The success rate in a run is defined as the number of successful evaluation episodes divided by the total number of evaluation episodes (50,000).
The success rates are listed in Tab. 3. It can be seen that CMARL-CD achieves the highest learning speed and highest success rate. Both QMIX and VDN achieve a suc- cess rate of less than 8% for all values of L. After L = 100,000 learning episodes, LA-OCA achieves a success rate of 46.57%, while CMARL-CD achieves a success rate of 90.90%. Although LA-OCA, QMIX, and VDN perform better given more learning episodes, they obtain a lower success rate when CMARL-CD has already obtained a 100% success rate after 300,000 learning episodes. These results indicate that CMARL-CD learns much faster than the other algorithms.
The average cumulative reward and steps are listed in Tab. 4 and Tab. 5 respectively. Taking the item of CMARL-CD with L = 100,000 in Tab. 4 as an example, '41.90|0.10' represents an average cumulative reward of 41.90 and a standard deviation of 0.10. CMARL-CD obtains the maximum cumulative reward and uses fewer time steps than any of the other algorithms for each value of L. Both QMIX and VDN obtain high cumulative rewards with L = 500,000. However, most of the time, they fail to capture the targets in three steps, which leads to a low success rate. The worst case is shown in Tab. 6 and Tab. 7. It can be seen that CMARL-CD is more reliable than the others.
To verify the effectiveness of CMARL-CD, we visualize the joint strategy obtained by CMARL-CD with L = 500,000, as shown in Fig. 4. Each sensor selects an action with the maximal coordination degree. It can be seen that the optimal joint action is selected under each of the 42 states (not including the absorbing state).

B. TASK 2: BLOOD BATTLEFIELD
Blood battlefield, which is developed by us, is a strategy game. The player needs to command a troop to fight against its opponent who owns a troop.
Each side has four marines and two gunners. The property values are presented in Fig. 5. All units cannot move, just like the units in Hearthstone that have been deployed on the battlefield. Unlike most turn-based strategy games, the units on both sides take actions at the same time in each turn. Every live unit must attack a live opponent unit in each turn and take damage afterward. The true damage depends on the attacker's attack damage (AD) and hit rate (HR). For example, a gunner with 2 HP (hit point) was attacked by a gunner and a marine on the other side. The marine was missed and the gunner hit successfully. The true damage received by the target was 0 + 2 = 2. The HP became 2 − 2 = 0. A dead unit will never become a target. A game ends with one side beating the other side or a tie. A tie appears if no side wins the game in 100 turns.
The state vector is s = [hp 1 , . . . hp 6 , s 1 , . . . s 6 ] T where hp i is the HP value of opponent i, and s j ∈ {ALIVE, DESTROYED} is the state of its j-th teammate (including itself). The state space contains 4 4 × 7 2 × 2 5 = 401, 408 elements. Each unit can select only a live opponent unit as its target. The reward assignment obeys the following rules: if one opponent unit is destroyed, all units of the other side (alive or not) are rewarded with 2; the winner obtains a reward of 10 and the loser obtains a reward of −10, and a tie brings a reward of 0 with both sides.
The game involves both coordination and competition. The units of each side need to coordinate to eliminate the opponent units to survive the war. Because the game is not a sequential decision problem, each player needs to consider only its own fire deployment. Four algorithms including CMARL-CD, LA-OCA, QMIX, and VDN are compared in a tournament.
The CMARL-CD algorithm uses the parameters c max = 2.5 and T = 0.8. The other parameters are the same as those used in the DSN task. The LA-OCA algorithm uses the same parameters as those used in the DSN task. The parameter settings for QMIX and VDN are as follows: Each agent network is an MLP with one hidden layer of 49 neurons. The size of the replay buffer is 100000, and the size of each batch is 720. Parameters updating begins after one batch of tuples is available. The estimation networks are updated after every 240 time steps using the Adam optimizer with an initial learning rate of 0.001. The target networks are cloned from the estimation networks after every 2400 time steps. The ε-greedy policy is used during the learning stage. The exploration rate ε is annealed from 1.0 to 0.0 with L increases. For QMIX, the mixing network contains one hidden layer of 64 neurons with ELUs.
The tournament has 30 rounds. Each round includes 4 + 4 × (4 − 1)/2 = 10 matches. In each match, one  algorithm plays against another algorithm (which could be itself) for 500,000 episodes in the learning stage, and then plays another 500,000 episodes in the evaluation stage. Fixed learned strategies are used in the evaluation stage. All results are averaged over 30 rounds. Fig. 6 shows the win rate of each algorithm (a tie does not count in the win rate, which explains that the sum of the win rates of the two opponents is not 100%). CMARL-CD has a higher win rate against any of the other algorithms. Both QMIX and VDN have a great advantage over LA-OCA, but they have a win rate of less than 40% against CMARL-CD. Fig. 7 shows the win steps of each algorithm. The win-step of an algorithm is the average number of steps used in each winning episode. Fewer win-steps indicate that the algorithm learns a better strategy to defeat its opponent. It can be seen that CMARL-CD has fewer win-steps compared to the other algorithms.

VI. CONCLUSION
This paper proposes an MARL algorithm known as CMARL-CD for fully cooperative scenarios. Empirical studies support the theoretical analysis of repeated games. The simulation results on the DSN task and blood battlefield empirically demonstrate that the CMARL-CD algorithm can converge to the optimal joint strategy for stochastic games. In the future, we will study the dynamics of the CMARL-CD algorithm in stochastic games, and incorporate deep learning into CMARL-CD to improve scalability.