WRFMR: A Multi-Agent Reinforcement Learning Method for Cooperative Tasks

Multi-agent reinforcement learning (MARL) for cooperative tasks has been extensively studied in recent years. The balance of exploration and exploitation is crucial to MARL algorithms’ performance in terms of the learning speed and the quality of the obtained strategy. To this end, we propose an algorithm known as the weighted relative frequency of obtaining the maximal reward (WRFMR), which uses a weight parameter and the action probability to balance exploration and exploitation and accelerate convergence to the optimal joint action. For the WRFMR algorithm, each agent needs to share the state and the immediate reward and does not need to observe the actions of the other agents. Theoretical analysis on the model of WRFMR in cooperative repeated games shows that each optimal joint action is an asymptotically stable critical point if the component action of every optimal joint action is unique. The box-pushing task, the distributed sensor network (DSN) task, and a strategy game known as blood battlefield are used for empirical studies. Both the DSN task and the box-pushing task involve full cooperation, while blood battle comprises both cooperation and competition. The simulation results show that the WRFMR algorithm outperforms the other algorithms regarding the success rate and the learning speed.


I. INTRODUCTION
Reinforcement learning (RL) is a prevalent method to optimize a single agent's strategy in a Markov Decision Process (MDP). An agent can perceive the state with sensors, make decisions, and execute actions through actuators. Some tasks are naturally modeled as multi-agent systems (MASs) in which the Markov property still holds from the view of centralized learning [1]. However, the joint action space grows exponentially as the number of agents increases. Independent learning [2]- [4], which does not need any agent to observe the actions of the other agents, has been proposed to alleviate the dimension curse of the joint action space. In independent learning, each agent maintains a Q-value function that evaluates the benefit of its own action. In this article, we concern only independent learning algorithms.
The associate editor coordinating the review of this manuscript and approving it for publication was Ikramullah Lali .
The purpose of the MARL algorithms depends on the nature of the task. In zero-sum games, the goal is to maximize each agent's reward while thinking of the other agents in a pessimistic way [5]. In general-sum games, the goal is to converge to the Nash equilibrium (NE) [6], [7]. In fullycooperative games, the goal is to maximize the sum of all agents' reward [8]- [12]. In addition, some algorithms can be applied to mixed tasks [13]- [16]. In this article, we focus on algorithms for fully cooperative tasks.
Two factors have to be considered when designing an independent MARL algorithm for fully cooperative tasks. First, the convergence to the optimal joint strategy is crucial for an effective algorithm. Most of the existing results on convergence analysis are limited to repeated games with two agents and two actions. Theoretical results of the convergence of MARL in repeated games with an arbitrary finite number of agents and actions are not much. Second, the learning speed is vital for an efficient algorithm. For an RL-based algorithm, a well-designed exploration and exploitation policy can improve the learning speed. Exploitation is to use current information to generate a better solution. Exploration is to explore the search space more thoroughly to avoid falling into local optima. We propose an algorithm known as the weighted relative frequency of obtaining the maximal reward (WRFMR). The main contributions are as follows. First, the WRFMR algorithm does not need any agent to observe the actions of the other agents. Second, the decreasing weight parameter and the action probability are used to balance exploration and exploitation to improve the learning speed. Third, we analyze the characteristics of the WRFMR algorithm in repeated games with an arbitrary finite number of agents and actions. Theoretical analysis shows that each optimal joint action is an asymptotically stable critical point if the component action of every optimal joint action is unique. Empirical studies on repeated games and stochastic games are also presented. The efficacy of the WRFMR algorithm is studied through three fully cooperative tasks -the distributed sensor network (DSN) task, the box-pushing task, and a strategic game known as blood battlefield. The results show that the WRFMR algorithm outperforms the other algorithms in terms of the success rate and the learning speed. Joint action learner needs to estimate the Q-value of each joint action, while independent learner needs to estimate the Q-value of each component action.
A brief description of the other sections in this article is as follows. Section II reviews the related work on MARL algorithms. Section III introduces repeated games and stochastic games. Section IV elaborates the WRFMR algorithm in detail and presents a theoretical analysis of the characteristics of the algorithm in repeated games with an arbitrary number of agents and actions. Section V studies the efficacy of the WRFMR algorithm over the other MARL algorithms in three fully cooperative tasks -the distributed sensor network task, the box-pushing task, and a strategy game known as blood battlefield. Section VI gives the conclusion.

II. PREVIOUS WORK
In this section, the MARL algorithms for fully cooperative games and general-sum games, and multi-agent deep reinforcement learning (MDRL) algorithms are reviewed respectively. The MARL algorithms can belong to joint action learner or independent learner. For joint action learner, each agent can perceive the action of each of the other agents. For independent action learner, each agent cannot observe the actions of the other agents.
In fully cooperative games, the goal is to optimize the joint strategy to obtain the maximum sum of all agents' reward. Team Q-learning [17] avoids the coordination mechanism by assuming that all optimal joint actions are unique. Joint action learner (JAL) [18] learns the Q-value of each joint action, and needs each agent to construct models for its teammates to promote coordination. Optimal adaptive learning (OAL) [19] needs each agent to construct its teammates' models, and use the models to obtain the optimal joint action of each virtual game on the top of each stage of the stochastic game. The probability of maximum reward based on estimated gradient ascent (PMR-EGA) [20] uses the gradient of the probability of obtaining the maximum reward to each agent's strategy. The gradient information is estimated by the Q-value function of the joint actions. PMR-EGA has been proven to converge to the optimal joint action in repeated games with two optimal joint actions that have different component actions. Team Q-learning, JAL, OAL, and PMR-EGA belong to joint action learner. Q-learning with aggregation (QA-Learning) [21] reduces the complicacy tasks with large state space by decomposing the task into more manageable sub-tasks, and distributing agents between these sub-tasks, to promote efficiency and enhance parallelization. The frequency of the maximal reward Q-learning (FMRQ) [22] uses the frequency of obtaining the maximal reward to update the strategy of each agent. It uses the stability theory to analyze the convergence of the algorithm in some specific repeated games. QA-Learning and FMRQ belong to independent learner.
In general-sum games, the goal is to converge to the Nash equilibrium. Some algorithms use the gradient information to update each agent's strategy, such as infinitesimal gradient ascent (IGA) [23], win or learn fast IGA (WoLF-IGA) [24], generalized IGA (GIGA) [25], and GIGA-WoLF [26]. Convergence with Model Learning and Safety (CMLES) [27] ensures targeted optimality for memory-bounded agents and safety for any other set of agents. These algorithms belong to JAL. The win or learn fast policy hill climbing (Wolf-PHC) [24] converges to the Nash equilibrium in two-agent-twoaction repeated games by using the 'Win or Learn Fast' rule to update each agent's strategy. The exponential moving average (EMA) Q-learning [28] algorithm uses the exponential moving average mechanism to update each agent's strategy. The policy gradient ascent with approximate policy prediction (PGA-APP) [29] augments the basic gradient ascent method through approximate policy prediction. PGA-APP performs better than GIGA-WoLF in some stochastic games. The max or minimax Q-learning (M-Qubed) [30] balances best response, optimistic, and cautious learning biases to make profitable compromises in general-sum games. WoLF-PHC, EMA Q-learning, PGA-APP, and M-Qubed belong to independent learner. Table 1 and Table 2 show the classification of MARL algorithms according to two fundamental classes (independent learner and joint action learner) and the nature of the scenarios (cooperative scenarios and generalsum scenarios) respectively.
MDRL becomes an emerging research area in RL community. In multi-agent deep deterministic policy gradient (MADDPG) [31], each actor uses local observations to select actions and each critic uses the global state to evaluate the Q-value conditioned on the joint action. Counterfactual multi-Agent policy gradients (COMA) [35] uses a centralized critic and addresses the multi-agent credit assignment by using a counterfactual baseline. Value-decomposition networks (VDN) [32] uses a linear value-decomposition method where the global Q-functions is approximated by a sum of local Q-functions. QMIX [33] uses a mixing network to approximate the global Q-function by conflating the local Q-functions and the global state in a non-linear way. Lenient-DQN (LDQN) [34] applies leniency mechanism with decaying temperature values to regulate policy updates. To overcome the non-stationarity problem, fingerprint [36] disambiguates the age of the samples obtained from the replay memory applying a fingerprint.
We propose the WRFMR algorithm for cooperative agents. It has the following characteristics. First, compared with joint action learner such as OAL and PME-EGA, the WRFMR algorithm does not need each agent to observe the actions of the other agents and therefore mitigates the curse of dimensionality of the action space. Second, compared with FMRQ, the WRFMR algorithm uses the weight parameters and action probabilities to accelerate convergence to the optimal joint action.

III. PRELIMINARIES A. STOCHASTIC GAMES
A Stochastic game is a tuple < S, A 1 , . . . A n , T , r 1 , . . . r n >, where n is the number of agents, S is the set of states, A i is the set of agent i s actions, T is the state transition function, and r i is the immediate reward of agent i. The set of joint actions is denoted by A = A 1 ×A 2 , . . . , ×A n which consists of the actions of all agents. The state transition function is probability distribution to transit to the next state s given the current state s and the executed joint action a. The immediate reward of agent i r i : S × A 1 × A 2 , . . . , × A n × S → R is determined by the state s, the joint action a and the next state s . The global immediate reward is the sum of each agent's immediate reward, and is denoted by r = n i=1 r i . The goal of fully cooperative stochastic games is to maximize the following discounted cumulative reward at each time t where γ is the discount factor and is used to weigh the importance of the future reward, K is the ending time of an episode, and r(t + 1) is the global immediate reward received at time t + 1.

B. REPEATED GAMES
A repeated game is formed by a range of iterations of the same stage game. In each stage game, agents choose their own actions and a joint action is formed. According to the selected joint action, each agent will receive a local immediate reward. The global immediate reward is the sum of each agent's local immediate reward. In cooperative repeated games, each agent receives a global immediate reward in each stage game and optimizes its strategy to obtain the maximal global immediate reward. The strategy of an agent is pure if some action probability is one. Otherwise, the strategy is mixed. The joint strategy is pure if the strategy of each agent is pure. In this article, our aim is to obtain the optimal pure joint strategy for cooperative repeated games. The payoff matrix for a cooperative repeated game with two agents and three actions is shown in Fig.1. In the payoff matrix, each row represents an action of agent A, each column represents an action of agent B, and each numerical value represents a global immediate reward for both agents. If agent A selects the second action (the second row) and agent B selects the first action (the first column), both of them will receive a global immediate reward of 2. The goal of each game is to obtain the maximal global immediate reward marked with parentheses in Fig.1.

IV. WRFMR ALGORITHM FOR COOPERATIVE AGENTS A. FORMULATION OF THE WRFMR ALGORITHM
The WRFMR algorithm is proposed to optimize performance indices of full collaboration tasks. For WRFMR, each agent needs to observe the states and the immediate rewards of the other agents. Each agent does not need to observe the actions of any other agent. The pseudo code of the WRFMR algorithm for repeated games is shown in Algorithm 1. Each  agent selects an action according to: where p i j (t + 1) represents the probability of agent i selecting its j-th action, Q i j (t) is the Q-value of the j-th action of agent i, |A i | is the number of agent i s actions, and T is the temperature parameter. After each game, the frequency of obtaining the maximal immediate reward and the Q-function of each agent will be updated. The Q-value updating rule is as follows: where α ∈ (0, 1) is the learning rate, β ∈ [0, 0.5) is the weight parameter, and u i j (t) is the relative frequency of agent i selecting its j-th action. The weight parameter and the action probability are used to balance exploration and exploitation. The relative frequency u i j (t) is defined as follows: where f i j (t) represents the frequency of obtaining the maximal global immediate reward when agent i selects its jth action. The value of the frequency is small during the early learning stage, so the relative frequency is used to speed up the learning process. The frequency of obtaining the maximal global immediate reward is estimated according to

Algorithm 1
The WRFMR Algorithm for Repeated Games , and f i j (t) for agent i to zero, for i = 1, 2, . . . , n, j = 1, 2, . . . , |A i |. 2: repeat 3: for each agent i do 4: Select an action according to (2). 5: end for 6: for each agent i do 7: Observe the reward r i . 8: if r i ≥ r i_ max then 9: r i_ max = r i . 10: end if 11: for each action j do 12: Evaluate u i j (t) and f i j (t) according to (4)-(6). 13: 14: end for 15: end for 16: until the predefined number of games have been played 17: return Q-value function for each agent for the selected action a i j at time step t and for each of the action a i g (g = j) at time step t. Among the above, a i j represents the j-th action of agent i, α h and α l (α l < α h ) are learning rates, r i j (t) is the immediate reward when agent i selects a i j , and r i_ max (t) is the maximal immediate reward obtained by agent i in history.

B. ANALYSIS OF THE WRFMR ALGORITHM
Theorem 1: In a cooperative repeated game with n (n ≥ 2) agents and m (m ≥ 2) optimal joint actions, each agent adopts the WRFMR algorithm. If the component action of every optimal joint action is unique, then all of the m optimal joint actions are asymptotically stable critical points.
Proof: Let p ij denote the probability of agent i selecting its corresponding component action of the j-th optimal joint action i = 1, 2, . . . , n, j = 1, 2, . . . , m, Q ij denote the Q-value of agent i's corresponding component action of the j-th optimal joint action. According to (2) and (3), the probabilities of the component actions that can never obtain the maximal global reward will gradually decrease to zero. The Q-value updating process of the WRFMR algorithm can be approximated by the following differential equations when the value of α is infinitely small.Q According to the total derivative formula, the model of WRFMR algorithm can be obtained as follows: Any critical point of the system described by (8) must satisfy: It is obvious that any pure joint strategy is a critical point. By performing the following transformation: We obtain the following model from: The Jacobin matrix J ∈ R (m−1)n×(m−1)n corresponding to any of the m optimal joint actions is as follows: It can be seen that all the eigenvalues of J are (2β − 1)/T , since β is strictly less than 0.5, each of the m optimal joint actions is an asymptotically stable critical point.

C. EMPIRICAL STUDIES ON REPEATED GAMES
The convergence of the WRFMR algorithm in repeated games is verified experimentally in this section. The aim is to obtain the maximum global reward in repeated games with n = 2, 3, 4, 5, 6, 7 agents and m = 2, 3, 4, 5 actions. The results are averaged on 100 runs. The payoff matrix is randomly generated for each run under the assumption of Theorem 1. If each agent converges to a pure strategy and the joint strategy obtains the maximal global immediate reward, this run is successful. Each agent's strategy is considered to be a pure strategy if the probability of choosing some action is no less than 0.999. The WRFMR algorithm uses the parameters T = 10, α = 0.01, α l = 0.1, α h = 0.6, and β = 0.1. As shown in Table 3, the success rate is 100% in all cases. The WRFMR algorithm converges to one of the optimal joint actions for all values of n and m. The simulation results are consistent with Theorem 1.

Algorithm 2
The WRFMR Algorithm for Stochastic Games 1: Initialize Q i j (s), u i j (s), and f i j (s)for agent i to zero, for i = 1, 2, . . . , n, j = 1, 2, . . . , |Ai|. 2: repeat 3: repeat 4: for each agent i do 5: Select an action according to (13). 6: end for 7: for each agent i do 8: Observe the next state s and the immediate reward r i . 9: Record the tuple < s, a i , s , r i >. 10: end for 11: until the episode is ended 12: for each agent i do 13: for each visited state s in the last episode do Evaluate R i (s) by the recorded tuples and (1). 14: if R i (s) ≥ R i_ max (s) then R i_ max (s) = R i (s) 15: end if 16: for each action j do 17: Evaluate u i j (s) and f i j (s) according to (15)-(17). 18: Update Q i j (s) according to (14). 19: The WRFMR algorithm can be extended to solve cooperative stochastic games. The pseudo-code is shown in Algorithm 2. Each agent selects an action according to: where p i j (s) represents the probability of agent i selecting its j-th action at state s, Q i j (s) is the Q-value of the j-th action of agent i at state s. When an episode ends, the cumulative global reward in each visited state is evaluated by (1). Then the frequency of obtaining the maximal cumulative global reward and the Q-value of each visited state action pair (s, a i ) for i = 1, 2, . . . , n are updated. The Q-value updating rule is as follows: where u i j (s)represents the relative frequency of agent i selecting its j-th action at state s, and is defined as follows: where f i j (s) represents the frequency of obtaining the maximal cumulative reward when agent i selects its j-th action at state s. It is estimated according to for the selected action a i j at state s and for each of the action a i g (g = j) at state s. Among the above, R i j (s) is the cumulative global reward when agent i selects a i j at state s, and R i_ max (s) is the maximal cumulative reward obtained by agent i at state s in history.

V. EMPIRICAL STUDIES FOR COOPERATIVE TASKS
In this section, the efficacy of the WRFMR algorithm is studied in three cooperative tasks -the box-pushing task, the DSN task, and a game known as blood battlefield. The differences of these tasks are as follows. First, the box-pushing task is a stochastic game with a deterministic transition function, while both the DSN task and blood battlefield are stochastic games with a probabilistic transition function. Second, the box-pushing task and the DSN task involve only cooperation, while the blood battlefield comprises both cooperation and competition.

A. TASK 1: BOX-PUSHING
The box-pushing task [37] is shown in Fig.2. Six boxes are allocated in six vertices of a polygon with a total number of 12 vertices. Each agent is responsible for pushing one box. Each agent can select to push the box to one of its adjacent vertices or stay still. The goal of this task is to coordinate the six agents to distribute the boxes evenly. At the beginning of each episode, the boxes locate in random positions. The state variables include five relative positions to box 1. The number of states is C 5 11 × 5! = 55440. If all boxes are distributed evenly, each agent obtains a reward of 10, otherwise, each agent obtains a reward of −1.
The rules of the box-pushing task are as follows. First, all agents push the boxes simultaneously. Second, if two or more boxes collide with each other, they will stay still. Otherwise, they move successfully. A collision occurs when a box moves to another box that chooses to stay still or has failed to move, two boxes move to the same positions, or two adjacent boxes move towards each other.
FMRQ [22], WoLF-PHC [24], EMA Q-learning [28], and EAQR [12] are selected as comparison algorithms. The results are averaged on 100 runs and each run includes L learning episodes and 300,000 evaluation episodes. For the WRFMR algorithm, the discount factor γ is set to 0.9, the learning rates α = 0.01, α l = 0.15, α h = 0.6, and the temperature parameter T follows: where n represents the current episodes, and L represents the total number of episodes. The weight parameter β is as follows: For EMA Q-learning, γ = 0.9, η w = 0.1, η l = 0.001η w , ε = 0.6, and α follows: where α ini = 0.2. For WoLF-PHC, γ = 0.9, δ w = 0.003, δ l = 0.01, ε = 0.1, and α follows (18) with α ini = 0.6. For FMRQ α = 0.5, γ = 1.0, and T follows: | For EAQR, γ = 0.9, α = 0.7, and ε follows: The performance metrics include the average success rate, the average number of steps, and the standard deviation. A successful episode uses minimal steps to complete the task. We develop a program to obtain the actual minimal number of steps in an episode. The average number of steps and the standard deviation are shown in Table 4. Take the WRFMR algorithm and L = 5,000,000 for example, 1.55 | 0.01 represents an average number of steps of 1.55 and TABLE 4. Average steps and standard deviation for the 6-agent-12-vertex box pushing task (runs = 100).

TABLE 5.
Success rate for the 6-agent-12-vertex box pushing task (runs = 100). a standard deviation of 0.01. The performance of all algorithms improves as L increases. The WRFMR algorithm and WoLF-PHC perform well for all different values of L. WoLF-PHC uses the smallest number of average steps and standard deviation among all algorithms when L is 5,000,000. As L increases, the WRFMR algorithm outperforms the WoLF-PHC in terms of average steps. After 10,000,000 learning episodes, the WRFMR algorithm uses 1.48 average steps per episode. The average steps per episode in box-pushing task are shown in Fig.3. In the initial learning stage, the learning speed of WRFMR is lower than that of WoLF-PHC, EMA and EAQR. However, after four million episodes, only WRFMR continuously improves its performance. The average steps for WRFMR drop off at 4000000-th learning episode, because the joint strategy becomes more greedy as T varies at that episode. Table 5 shows the success rate. The success rates of FMRQ with L = 5,000,000 is higher than the success rate of other algorithms. The WRFMR algorithm has the highest learning speed. It obtains the highest success rate when L = 10,000,000. EMA Q-learning obtains the lowest success rate for all values of L.

B. TASK 2: DISTRIBUTED SENSOR NETWORK
The goal of the DSN [38] task is to coordinate the sensors to capture two targets in minimal time steps. Fig.4 shows a DSN  with twelve sensors and two targets. Each sensor is viewed as an agent. The action set of each sensor is shown in Table 6. The number of joint actions is 291,600. At each step, each target has equal probability to move to one of four directions (up, down, left, and right), or stay still. If a target tries to move out of the grid or move to a cell that has been occupied by another target, it fails and stays still. Each cell can be occupied by at most one target.
At the beginning of an episode, each target has an energy value of three. The energy of a target is reduced by one if the target is focused by at least three sensors. If a target's energy is decreased to zero, it is captured and wiped out from the cells. The state variables contain the number of uncaptured targets and the positions of both the targets. The state space contains 43 elements. If both the targets are captured or 300time steps have elapsed, an episode ends.
The reward function is defined as follows. If a target is captured, each of the three sensors involved in the capture is rewarded by 10. If four sensors capture the target, the sensors with the largest indexes receive a reward of 0. The action of focus produces an immediate reward of -1, and no focus produces an immediate reward of 0. For the DSN task, the actual maximal cumulative reward is 42, and the minimal number of time steps is 3. A success is obtained if a cumulative reward of 42 is obtained in an evaluation episode.
The results are averaged on 100 runs, and each run includes L learning episodes and 50,000 evaluation episodes. The WRFMR algorithm uses the parameters γ = 0.9, α = 0.01, α l = 0.1, α h = 0.8, T = 8, and β follows: For FMRQ, γ = 0.9, α = 0.2, T follows: For EMA Q-learning, γ = 0.9, η w = 0.1, η l = 0.001η w , k = 2, ε = 0.2, and α follows (18) with α ini = 0.7. For WoLF-PHC, γ = 0.9, δ w = 0.003, δ l = 0.01, ε = 0.2, and α follows (18) with α ini = 0.7. For EAQR, γ = 0.9, α = 0.2, α follows (18), and ε follows: Table 7 shows the success rate. Compared with the other algorithms, the WRFMR algorithm obtains a higher success rate. After 3,000,000 episodes, the WRFMR algorithm obtains a success rate of 100%. The success rate of FMRQ rises as L increases, but it learns slower than the WRFMR algorithm. Besides, the learning speed is a great advantage of the WRFMR algorithm. The average number of steps per episode in the DSN task is shown in Fig.5. We can see that WRFMR converges more quickly than other algorithms. The average steps for WRFMR did not drop off at 4000000-th learning episode, because the optimal joint strategy has been obtained before T varies.
The average cumulative reward is shown in Table 8. All the algorithms except WoLF-PHC obtain more cumulative reward as L increases. The WRFMR algorithm is the only one that obtains the optimal average cumulative reward of 42, which is consistent with the results of the success rate and the worst case is presented in Table 9.
The average number of steps is presented in Table 10, and the worst case is presented in Table 11. The WRFMR algorithm overwhelms the other algorithms in terms of the steps. After 3,000,000 episodes, the WRFMR algorithm obtains 3.01 steps to capture both the targets.
To verify the statistical significance of the results in Table 8  and Table 10, we conduct a one-way analysis of variance (ANOVA). As shown in Table 12, p-values are less than 0.05, and F is greater than F-critical, which indicates that there a statistically significant difference between the results of all groups. In addition, we conduct a one-tailed t-test at 0.05 significance level for average steps and average cumulative reward. The p-value for t-test between WRFMR and each of the other algorithms is presented in Table 13. A p-value VOLUME 8, 2020    less than 0.05 indicates that there a statistically significant difference between the WRFMR algorithm and the other algorithms.
The comparison MARL algorithms did not obtain the best performance for several reasons. FMRQ and EAQR learn slowly because it consumes too much time to obtain the estimate of the probability of obtaining the maximum reward. Its poor performance in average steps might be because the punishment induced by the discount factor is not enough. As for WoLf-PHC and EMA Q-learning, they might converge to the NE at each stage, but the NE is not the optimal one that obtains the maximal cumulative reward.

C. TASK 3: BLOOD BATTLEFIELD
Blood battlefield is a strategic game that involves both cooperation and competition. In this game, each player commands  a team to fight against another player who commands another team. Each team has two gunners and four riflemen. The goal is to eliminate all the units of the opponent team and survive the battle.
The property values of both the units are presented in Table 14. A live unit can choose to attack only one live opponent unit at each step. The damage caused by a unit is determined by the unit's AD (attack damage) and HR (hit rate). All units of both sides act simultaneously at each step. If the HP (hit point) value of a unit is below zero, the unit is wiped out. The player wins the battle if at least one unit survives and all units of the other player are eliminated. A tie happens if all units of both sides are eliminated at the same step. An episode ends if one player wins the game, a tie happens, or 100 time steps have elapsed.
The state vector is expressed as s = [h 1 , . . . h 6 , k 1 , . . . k 6 ] T , where h i represents the HP of the i-th opponent unit, and k l ∈ {live,dead}represents the state of its l-th teammate. The number of states is 4 4 × 7 2 × 2 6 = 802, 816. The number of available actions for each unit depends on the number of live opponent units. The reward function is described as follows: if a unit is eliminated, the opponent player receives a reward of 2; when an episode ends, the winner receives a reward of 10 while the loser receives a reward of 10. If a tie occurs or 100 time steps have elapsed, both players receive a reward of 0. The WRFMR algorithm uses the parameters α = 0.01, α l = 0.05, α h = 0.8, T = 0.5, γ = 1, and β follows: FMRQ and EAQR use more memory than our computing resources and cannot perform this task on our computer, so the approximate frequency of obtaining maximal cumulative reward is also used for FMRQ and EAQR to avoid recording large amounts of data. For FMRQ, α = 0.1, α l = 0.1, α h = 0.8, T = 0.6, and γ = 1. For EMA Q-learning, α = 0.1, ε = 0.2, η w = 0.1, η l = 0.001η w , and γ = 0.9. For WoLF-PHC, δ w = 0.003, δ l = 0.01, γ = 0.9, α = 0.1, and   The results are averaged on 100 runs, and each run includes 1,000,000 learning episodes and 1,000,000 evaluation episodes. Fig.6 shows the win rates of four algorithms against each other. The ties are not considered. The WRFMR algorithm gains the highest win rate against FMRQ, EMA Q-learning, WoLF-PHC, and EAQR. The WRFMR algorithm performs slightly better than FMRQ. However, its win rate against FMRQ is apparently higher than the win rate of FMRQ against it. The win rates of both EMA Q-learning and WoLF-PHC are not high because the Nash equilibrium is not the optimal strategy in this game.
To verify the efficacy of the WRFMR algorithm, we visualize a match between it (team 1) and FMRQ (team 2). As shown in Fig.7, the actions of the units are indicated by arrows. The arrow with a solid line represents a hit from team 1, and the arrow with a dotted line represents a hit from team 2. For example, the leftmost rifleman of team 2 is attacked by two riflemen and a gunner of team 1 at step 0, and only one rifleman hits successfully. Thus the true damage taking by the target is 0 + 0 + 2 = 2. The target's HP is FIGURE 7. A game played by the WRFMR algorithm (team 1) and the FMRQ algorithm (team 2) (after 1,000,000 learning episodes). reduced by 2. At each step, several units of team 1 concentrate fire to attack the same opponent unit. By contrast, the fire of team 2 is scattered on more opponent units. Team 2 is outnumbered after step 2, and is finally defeated by team 1 at step 5. The WRFMR algorithm performs well in all three stochastic games, which indicates that the WRFMR algorithm can converge to one of the optimal cumulative global rewards in many cases.

VI. CONCLUSION
This article aims to solve the coordination problem in fully cooperative MASs. We propose the WRFMR algorithm to obtain the maximal global reward. The decreasing weight parameter and the action probability are used to balance exploration and exploitation to improve the learning speed. Theoretical analysis on the model of the WRFMR algorithm for cooperative repeated games indicates that if the component action of every optimal joint action is unique, then all optimal joint actions are asymptotically stable critical points. The efficacy of the WRFMR algorithm is also studied empirically through three cooperative tasks. The results show that the WRFMR algorithm performs better than other algorithms in the box-pushing task, the DSN task, and blood battlefield in terms of the success rate and the learning speed. In the future, the convergence of the algorithm will be further studied.